Skip to content

Protocol-driven test suite rework + cross-check pipeline#512

Open
mmschlk wants to merge 42 commits intomainfrom
claude/plan-testing-suite-rework-FF4ZA
Open

Protocol-driven test suite rework + cross-check pipeline#512
mmschlk wants to merge 42 commits intomainfrom
claude/plan-testing-suite-rework-FF4ZA

Conversation

@mmschlk
Copy link
Copy Markdown
Owner

@mmschlk mmschlk commented Apr 16, 2026

Consolidates and supersedes #511. Contains the full scope of the test-suite rework in three layers.

1. Protocol-driven rewrite (from #511's original commits)

Replaces the 75-file suite (~10k LOC, 321 tests) with 8 protocol-driven files (~1.5k LOC). Design docs at docs/superpowers/plans/2026-04-15-test-rework.md and docs/superpowers/specs/2026-04-15-test-rework-design.md.

  • test_approximators.py — registry + parametrized TestApproximatorProtocol across 20+ approximator/index configs
  • test_explainers.pyTabularExplainer, AgnosticExplainer, ProductKernelExplainer protocols + validation
  • test_tree.pyTreeExplainer across sklearn/xgboost/lightgbm with manual TreeModel tests + segfault regressions
  • test_imputers.py — imputer registry with 4 core imputers
  • test_interaction_values.py — data-structure correctness
  • test_game_theory.pyExactComputer, indices, MoebiusConverter
  • test_plots.py — plot smoke tests
  • test_public_api.py — every concrete public subclass is exported in __all__
  • conftest.py — shared game/model/data fixtures + skip_if_no_* markers
  • pyproject.tomlslow marker + addopts = -m 'not slow' tiering

2. Close coverage gaps (commit fd0282a)

  • Imputers: add GenerativeConditionalImputer to IMPUTER_CONFIGS; slow-gated TestTabPFNImputer
  • Explainers: slow-gated TestTabPFNExplainer
  • Plots: smoke tests for network_plot, stacked_bar_plot, upset_plot, si_graph_plot, sentence_plot, beeswarm_plot, abbreviate_feature_names
  • New test_utils.py — 26 unit tests for shapiq.utils.{sets,modules,datasets,errors}
  • New TestAggregation, TestCore, TestGame in test_game_theory.py
  • New slow-gated test_datasets.py for the three built-in dataset loaders

3. Cross-check pipeline (commit 67cd77f) — correctness layer

Turns protocol contract checks into correctness tests by making independent ground-truth sources agree on the same game. Five test classes in test_cross_checks.py:

Test class Ground-truth edge Indices / methods
TestExactVsSOUM ExactComputer(SOUM)SOUM.exact_values SV, SII, k-SII, STII, FSII, FBII
TestMoebiusConverter ExactComputer("Moebius")MoebiusConverter → target ↔ ExactComputer(target) same set
TestApproximatorAtFullBudget 11 consistent approximators at budget=2**nSOUM.exact_values KernelSHAP, KernelSHAPIQ, InconsistentKernelSHAPIQ, UnbiasedKernelSHAP, RegressionFSII, RegressionFBII, SHAPIQ (SII/k-SII/STII), SVARMIQ, SVARM
TestApproximatorConvergence (slow) error decreases with budget Permutation*, Owen, Stratified
TestTreeExplainerVsExactComputer ExactComputer(TreeSHAPIQXAI)TreeExplainer.explain(x) SV, k-SII on a small decision tree

Tolerance strategy: atol=1e-10 for pairs that should be analytically identical, 1e-8 for larger Moebius-converted games (n=7), 1e-6 for LS / Monte Carlo noise. Sampling-based methods verify monotonic error decrease instead of exactness.

Supporting changes in conftest.py: soum_5 / soum_7 fixtures, GROUND_TRUTH_INDICES constant, and an assert_iv_close helper that aligns InteractionValues by interaction_lookup (skipping empty-interaction asymmetry across pipelines).

Runtime

Suite Before rework After gaps After cross-checks
Default (-m 'not slow') 3–5 min / 321 tests ~23s / 232 tests ~25s / 257 tests
Full (-m '') 3–5 min ~36s / 241 tests ~74s / 277 tests

Test plan

  • uv run pytest tests/shapiq -q — default tier
  • uv run pytest tests/shapiq -m '' -q — full tier
  • uv run pre-commit run --all-files
  • Inject a small bias into one approximator and confirm the matching cross-check fails (correctness tests actually check numerics, not just types)

https://claude.ai/code/session_01DHsGf4an1Dnnw4qTnmdB22

mmschlk and others added 17 commits April 15, 2026 12:09
Protocol-driven test suite replacing 75 files with 8, targeting ~1min
default runtime. Covers approximator/explainer/tree/imputer protocols,
tiering strategy, fixture design, and migration approach.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
10 tasks covering: pytest config, conftest, approximator/explainer/tree/
imputer/interaction_values/game_theory/plot/public_api tests, old test
deletion, and final verification.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Track shared Claude Code settings (settings.json, agents, commands)
while keeping local settings and worktrees gitignored.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replaces the per-module test files under tests_unit/, tests_integration_tests/,
and tests_deprecation/ with the 8 protocol-driven files added in previous commits.
Shared fixture plugins under tests/shapiq/fixtures/ are preserved because legacy
tests in tests/shapiq_games still consume them.

The new suite runs in ~25s and makes adding new components trivial: append a
config dict to the relevant registry.
Adds a parametrized protocol test for SVR, SVC, and GaussianProcessRegressor
models, checking that explain() returns an InteractionValues object, that
sum(values) matches the regression prediction, and that explain_X handles
batches.

Also adds validation tests for the three documented error paths:
max_order > 1, unsupported model type, and multiclass SVC.

Lifts overall coverage from 60% to 62% and brings explainer/product_kernel/
from 0% to ~85% (game.py remains uncovered as it's a separate Game subclass
not exercised by the explainer path).
Extends the existing 8-file protocol suite with targeted additions — no
redesign. New coverage:

- Imputers: GenerativeConditionalImputer added to IMPUTER_CONFIGS; slow-gated
  TestTabPFNImputer for the Remove-and-Contextualize imputer.
- Explainers: slow-gated TestTabPFNExplainer.
- Plots: smoke tests for network, stacked_bar, upset, si_graph, sentence,
  beeswarm, plus abbreviate_feature_names.
- New test_utils.py: unit tests for powerset / pair_subset_sizes /
  split_subsets_budget / get_explicit_subsets / interaction lookup / coalition
  transforms / count_interactions / safe_isinstance / check_import_module /
  shuffle_data / raise_deprecation_warning.
- game_theory: TestAggregation (aggregate_base_interaction,
  aggregate_to_one_dimension), TestCore (egalitarian_least_core), TestGame
  (Game base-class API: __call__, access_counter, grand/empty coalition
  values, precompute, save_values/load_values, save/load JSON round-trip).
- New slow-gated test_datasets.py: load_california_housing,
  load_bike_sharing, load_adult_census.

Default suite: 232 passed, 12 skipped in ~23s. Full suite (incl. slow):
241 passed, 16 skipped in ~36s.

https://claude.ai/code/session_01DHsGf4an1Dnnw4qTnmdB22
…ruth

Turns protocol contract checks into real correctness tests by making
independent ground-truth sources agree on the same game. Five test
classes in the new tests/shapiq/test_cross_checks.py:

1. TestExactVsSOUM — ExactComputer(SOUM) == SOUM.exact_values for
   SV/SII/k-SII/STII/FSII/FBII on n=5; n=7 slow-gated.
2. TestMoebiusConverter — round-trip ExactComputer("Moebius") ->
   MoebiusConverter -> target index matches ExactComputer on the target.
3. TestApproximatorAtFullBudget — 11 consistent approximators
   (KernelSHAP / KernelSHAPIQ / InconsistentKernelSHAPIQ /
   UnbiasedKernelSHAP / RegressionFSII / RegressionFBII / SHAPIQ on
   SII,k-SII,STII / SVARMIQ / SVARM) at budget=2**n match SOUM within 1e-6.
4. TestApproximatorConvergence (slow) — sampling-based approximators
   (Permutation*, Owen, Stratified) show monotonically decreasing error
   with more budget on n=7 SOUM.
5. TestTreeExplainerVsExactComputer — TreeExplainer output matches
   ExactComputer run on TreeSHAPIQXAI.value_function for a 5-feature
   decision tree (SV and k-SII).

Supporting changes in tests/shapiq/conftest.py:
- SOUM fixtures (soum_5 default, soum_7 slow).
- GROUND_TRUTH_INDICES constant.
- assert_iv_close helper that aligns InteractionValues by
  interaction_lookup (skips empty-interaction asymmetry across
  pipelines; optional check_baseline flag).

Runtime impact:
  Default suite:   232 -> 257 passed  (22s -> 25s)
  Full suite:      241 -> 277 passed  (36s -> 74s)

https://claude.ai/code/session_01DHsGf4an1Dnnw4qTnmdB22
@mmschlk mmschlk changed the title Flesh out protocol test suite + add cross-check pipeline Protocol-driven test suite rework + cross-check pipeline Apr 16, 2026
- test_reproducible now aligns InteractionValues by interaction_lookup
  rather than comparing raw values arrays. SPEX's sparse transform
  produces the same interaction values on Windows but stores them in a
  different order in the values array across runs, which broke the old
  np.allclose(r1.values, r2.values) check.
- Remove the stale `from shapiq_games.synthetic import SOUM` import from
  test_cross_checks.py (ruff auto-removal caused Code Quality to fail).

https://claude.ai/code/session_01DHsGf4an1Dnnw4qTnmdB22
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 16, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

claude added 8 commits April 16, 2026 21:21
The original tests/shapiq/data/test_croc.JPEG was removed in the
test-suite rewrite (commit 76aa3ce), but four tests in
tests/shapiq_games/tests_legacy/test_local_xai.py still depend on it
through the image_and_path fixture. CI fails with FileNotFoundError on
those four tests.

Make image_and_path skip with a clear message when the JPEG isn't on
disk, rather than erroring. Restoring the file (or pointing the fixture
elsewhere) re-enables the tests automatically.

https://claude.ai/code/session_01DHsGf4an1Dnnw4qTnmdB22
…exact list

Review found that InconsistentKernelSHAPIQ was passing TestApproximatorAtFullBudget
only because the SOUM fixture (max_interaction_size=3, max_order=2) happened to
sit in a trivial k-additive regime. On a genuinely non-k-additive game the
estimator's own docstring says it does not recover the true SII — and indeed
it produces ~1.3e-1 errors once the fixture is strengthened.

Changes:
- soum_5 / soum_7: raise n_basis_games (25 / 40) and set max_interaction_size = n
  with min_interaction_size = 1, so basis games span all orders from 1..n and
  the SOUM is not k-additive for any small k.
- Drop InconsistentKernelSHAPIQ from CONSISTENT_APPROXIMATORS; leave a comment
  explaining why it doesn't belong.
- Add "BV" to GROUND_TRUTH_INDICES (supported by both ExactComputer and
  MoebiusConverter).
- Add TestMoebiusVsSOUM that compares ExactComputer("Moebius", n) against
  soum.moebius_coefficients — two independent ground-truth Möbius transforms.
- Cache ExactComputer per SOUM (module-scoped exact_soum_5 / exact_soum_7)
  to avoid redundant 2^n recomputation across parametrised tests.
- Update tolerances with measured noise floors (1e-8 for the LS solves in
  TestExactVsSOUM / TestMoebiusConverter; 1e-6 for the approximator test
  where Shapley-kernel LS hits ~5e-7 on random non-k-additive games).
…eeExplainer

Previously the tree cross-check relied on TreeSHAPIQXAI from shapiq_games,
which is slated for removal. Replace it with the coalition-valued game that
lives inside shapiq itself — the one actually used under the hood by local
XAI setups — and pair it with the matching InterventionalTreeExplainer.

- shapiq.tree.interventional.InterventionalGame is a Game subclass whose
  value_function computes v(S) = E_ref[f(x_S, z_{not S})] over a reference
  dataset. Running ExactComputer on it brute-forces the Shapley / Banzhaf /
  faithful values from 2^n coalition evaluations.
- shapiq.tree.interventional.InterventionalTreeExplainer computes the same
  quantities via a tree-walking TreeSHAP-IQ variant.

The two are semantically matched (both interventional) — verified
empirically: SV, BV, SII, BII, FSII, FBII all agree to ~4e-9. STII is
omitted because the two implementations disagree (~1e-1 error, separate
bug), and k-SII because InterventionalTreeExplainer does not support it.

Note: the default shapiq.TreeExplainer uses path-dependent TreeSHAP-IQ,
which has different semantics than InterventionalGame. The test now pairs
matching pairs — path-dependent vs interventional explanations were
accidentally being compared before only because TreeSHAPIQXAI itself used
path-dependent averaging via node_sample_weight.
… tree efficiency

Second-round review surfaced five major gaps in the cross-check pipeline.
Acting on all of them:

1. assert_iv_close now takes strict=True. When set, both sides must cover
   the same non-empty interactions (modulo zero-valued keys, which
   MoebiusConverter drops and ExactComputer emits — a pure encoding
   difference, not a bug). Adopted in TestExactVsSOUM and
   TestMoebiusConverter where both pipelines are analytical and should
   agree on support.

2. Added TestKAddSHAPAtFullBudget. kADD-SHAP is user-facing via kADDSHAP
   but had no independent ground truth: SOUM.exact_values and
   MoebiusConverter don't support it. Cross-check against
   ExactComputer("kADD-SHAP") closes the gap (agreement to ~1e-7).

3. Added TestPathDependentTreeEfficiency. The default shapiq.TreeExplainer
   (path-dependent TreeSHAP-IQ) was completely unexercised by the
   interventional cross-check pair. Since no path-dependent Game wrapper
   exists for a full cross-check, we pin the SV efficiency axiom:
   sum(SV) == f(x) - E[f]. Cheap, catches most regressions in the
   polynomial arithmetic or baseline computation.

4. Strengthened TestApproximatorConvergence. errors[-1] < errors[0] was
   tautological — a 16x budget increase for essentially zero error
   reduction would pass. Now averages errors over 3 seeds per budget and
   requires the mean error to halve with 16x budget. Catches silently
   broken sampling estimators.

5. Tightened TestMoebiusVsSOUM tolerance from 1e-10 to 1e-9 — the
   alternating-sign sum over 2^n coalitions was liable to flake on
   Windows/macOS where FMA ordering can eat a few ULPs.

Minor polish:
- Renamed _small_tree_setup → small_tree_setup for consistency with
  other fixtures.
- Narrowed warnings.catch_warnings to category=UserWarning in the
  approximator tests — blanket ignore was swallowing deprecation signal.
- Updated module docstring from 5 to 6 ground-truth sources.
TestExactVsSOUM and TestMoebiusConverter previously exercised a single
SOUM instance (random_state=42). One game can hide real conditioning
edge cases: zero-valued interactions, near-singular LS matrices,
degenerate basis overlaps.

Add soum_5_seeded / soum_7_seeded fixtures parametrised over a fixed
list of seeds (42, 1337, 7, 2024, 31415). Each test now runs once per
seed × per index, giving 5x game-instance diversity while staying fully
deterministic — tolerances remain tight, CI stays reproducible, bisects
still work. Test count goes from ~30 to ~130, still sub-10s total.

Left the approximator tests on the single-seed fixtures: multiplying 10
approximators × 5 seeds adds visible cost without proportional coverage
gain on algorithms that are already exact at full budget.
Closes the last explainer-vs-brute-force gap in the cross-check pipeline.
ProductKernelExplainer computes SV analytically via elementary symmetric
polynomials on kernel vectors. ProductKernelGame wraps the same RBF
kernel and training data as a coalition-valued game. Running
ExactComputer on it must agree with the explainer's closed-form output —
and empirically does so to ~1e-16 on a 5-feature SVR.

Pattern mirrors TestInterventionalTreeCrossCheck. Only SV with
max_order=1 is tested since the explainer hard-rejects anything else.

Single wiring detail: ProductKernelGame takes the validated
ProductKernelModel, not a raw sklearn estimator. The explainer already
does this conversion internally, so reading explainer.converted_model is
both the cleanest and the most user-accurate path.
Closes the coverage gap on shapiq.tree.linear.LinearTreeSHAP. The class
computes first-order path-dependent Shapley values via a Chebyshev
polynomial shortcut; it previously had no numerical regression test.

The test pairs it against a small private helper game,
_PathDependentTreeGame, that brute-forces the same path-dependent value
function over 2^n coalitions (the same logic the now-deprecated
TreeSHAPIQXAI used internally, replicated in ~25 lines of test
scaffolding to avoid depending on shapiq_games). Agreement is asserted
to atol=1e-10; empirically matches to ~1e-16.

Semantic match:
- LinearTreeSHAP: Chebyshev-basis closed-form on validated TreeModel.
- _PathDependentTreeGame: for each absent feature, average both children
  weighted by node_sample_weight. Fallback to uniform weighting only
  when both node weights are zero (degenerate pruning).

Left XGBoost/LightGBM conversion coverage as a separate follow-up to
keep this change focused on LinearTreeSHAP alone.
…ter pin

Closes two real coverage gaps and pins one known unsupported path:

- lgbm_reg (LGBMRegressor) fixture + full protocol (task="regression").
  Efficiency check passes to ~2e-9. Previously missing entirely.
- lgbm_booster (native lightgbm.Booster) fixture + full protocol
  (task="regression"). Exercises the native-Booster code path in
  _lightgbm_model_to_bytes that sklearn-wrapper fixtures never hit.
- TestXGBoostBoosterUnsupported — standalone pin test asserting that
  passing a raw xgboost.Booster raises TypeError("not supported").
  Reverse-alarms when the conversion is implemented.

Scope intentionally narrow:
- No ExtraTreeRegressor/IsolationForest/ExtraTreesClassifier (different
  concern — separate sklearn conversion paths).
- No XGBClassifier/LGBMClassifier efficiency upgrade (task="basic" is
  semantically correct — XGB/LGBM classifiers output raw margins, not
  probabilities, so efficiency in proba space doesn't hold).

Coverage of src/shapiq/tree/conversion: ~60% -> 82%.
claude and others added 16 commits April 18, 2026 14:07
Signed-off-by: Maximilian <maximilian.muschalik@gmail.com>
Signed-off-by: Maximilian <maximilian.muschalik@gmail.com>
The entire tests/shapiq/fixtures/ directory and 13 fixtures in
conftest.py had zero consumers after the test suite rework.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Maximilian <maximilian.muschalik@gmail.com>
Signed-off-by: Maximilian <maximilian.muschalik@gmail.com>
Signed-off-by: Maximilian <maximilian.muschalik@gmail.com>
These tests exercise the full seam — sklearn → imputer → approximator →
InteractionValues → plots/serialisation — mirroring the canonical flows
from README.md and docs/source/introduction/start.rst. They catch
cross-module regressions that pass every per-module unit test.

Coverage (8 test invocations, <2s total):
- test_tabular_explainer_readme_flow (parametrised SV / k-SII / FSII / STII),
  asserts the efficiency axiom holds end-to-end
- test_tree_explainer_efficiency (parametrised SV / k-SII), asserts
  pointwise efficiency for TreeExplainer
- test_agnostic_explainer_on_soum, verifies the Game-based researcher path
  against ExactComputer ground truth
- test_interaction_values_roundtrip_and_plots, covers JSON save/load and
  all five top-level plot functions

https://claude.ai/code/session_01DHsGf4an1Dnnw4qTnmdB22
- loosen TabularExplainer efficiency tolerance 1e-4 -> 1e-2 for CI
  robustness across approximators with different budget accounting
  (still ~1e7x headroom over observed error, still catches real
  efficiency breaks whose magnitude scales with |pred|)
- drop TreeExplainer SV/1 parametrisation; that invariant is already
  covered by test_cross_checks.TestPathDependentTreeEfficiency. Keep
  only the novel k-SII/2 case as test_tree_explainer_ksii_efficiency
- remove redundant mpl.use("Agg") — conftest.py already sets it

https://claude.ai/code/session_01DHsGf4an1Dnnw4qTnmdB22
Revert the SV/1 removal from test_tree_explainer_efficiency. The overlap
with test_cross_checks.TestPathDependentTreeEfficiency is a feature of
the integration layer, not a bug: the cross-check exercises the
lower-level invariant with min_order=1, whereas the integration test
asserts the same property through the canonical public API flow
(shapiq.TreeExplainer(...).explain(x)). Distinct entry points into the
same invariant catches different regressions.

https://claude.ai/code/session_01DHsGf4an1Dnnw4qTnmdB22
Extends tests/shapiq/test_datasets.py beyond return-type smoke checks
with: exact-shape guards against the docstring, target-column-leakage
check, no-NaN postcondition, numpy/pandas path equivalence, and a
binary-label check for adult census.

https://claude.ai/code/session_01DHsGf4an1Dnnw4qTnmdB22
Replaces the prior 11-test smoke layer in test_plots.py with a four-layer
strategy catching regressions that "did it crash?" tests miss:

- TestPlotUtils: proper coverage of the pure helpers (format_value,
  format_labels, abbreviate_feature_names, get_color) that were nearly
  untested.
- TestPlots / TestPlotsNoAbbreviate / TestPlotsWithWords: each public plot
  is parametrised over (abbreviate, feature_names) variants so a kwarg
  regressing in one branch doesn't pass silently.
- TestPlotStructure: one rich test per plot inspecting the returned
  Axes/Figure — tick labels, title/xlabel/ylabel honored, expected artists
  drawn.
- TestPlotEdgeCases: all-zero IV, feature_names=None, long names with
  abbreviate=True, max_display below n_features.

68 tests, ~3s runtime. No new dependencies.

https://claude.ai/code/session_01DHsGf4an1Dnnw4qTnmdB22
Both functions previously accepted malformed input silently:
- beeswarm_plot with data.shape[1] != n_players would plot a subset or
  scramble columns without warning.
- sentence_plot with len(words) != n_players would index past the
  InteractionValues or drop entries silently.

Each gets a ValueError guard with a clear message. Re-enables the two
dropped edge-case tests in TestPlotEdgeCases.

https://claude.ai/code/session_01DHsGf4an1Dnnw4qTnmdB22
Rewrite test_imputers.py as a DRY protocol-driven suite: TestImputerProtocol
holds the shared contract (full coalition == model(x), present features use x,
missing features don't leak x, fit/refit behaviour, random_state reproducibility);
per-imputer classes cover what's unique (Baseline mean-from-background, Marginal
joint-vs-per-feature sampling on dependent data, Gaussian closed-form conditional
mean, GaussianCopula rank round-trip, Generative cluster-aware neighbourhood
sampling, TabPFN remove-and-contextualize); TestCrossImputerAgreement asserts
relationships across imputers (baseline == marginal on constant background,
gaussian ~ marginal on independent data, copula ~ gaussian on standard normal).

https://claude.ai/code/session_01XN6xQdEpvZHekXYnRuJwyT
@mmschlk mmschlk self-assigned this Apr 21, 2026
@mmschlk mmschlk added this to the 1.5.0 milestone Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants