Protocol-driven test suite rework + cross-check pipeline#512
Open
Protocol-driven test suite rework + cross-check pipeline#512
Conversation
Protocol-driven test suite replacing 75 files with 8, targeting ~1min default runtime. Covers approximator/explainer/tree/imputer protocols, tiering strategy, fixture design, and migration approach. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
10 tasks covering: pytest config, conftest, approximator/explainer/tree/ imputer/interaction_values/game_theory/plot/public_api tests, old test deletion, and final verification. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Track shared Claude Code settings (settings.json, agents, commands) while keeping local settings and worktrees gitignored. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replaces the per-module test files under tests_unit/, tests_integration_tests/, and tests_deprecation/ with the 8 protocol-driven files added in previous commits. Shared fixture plugins under tests/shapiq/fixtures/ are preserved because legacy tests in tests/shapiq_games still consume them. The new suite runs in ~25s and makes adding new components trivial: append a config dict to the relevant registry.
Adds a parametrized protocol test for SVR, SVC, and GaussianProcessRegressor models, checking that explain() returns an InteractionValues object, that sum(values) matches the regression prediction, and that explain_X handles batches. Also adds validation tests for the three documented error paths: max_order > 1, unsupported model type, and multiclass SVC. Lifts overall coverage from 60% to 62% and brings explainer/product_kernel/ from 0% to ~85% (game.py remains uncovered as it's a separate Game subclass not exercised by the explainer path).
Extends the existing 8-file protocol suite with targeted additions — no redesign. New coverage: - Imputers: GenerativeConditionalImputer added to IMPUTER_CONFIGS; slow-gated TestTabPFNImputer for the Remove-and-Contextualize imputer. - Explainers: slow-gated TestTabPFNExplainer. - Plots: smoke tests for network, stacked_bar, upset, si_graph, sentence, beeswarm, plus abbreviate_feature_names. - New test_utils.py: unit tests for powerset / pair_subset_sizes / split_subsets_budget / get_explicit_subsets / interaction lookup / coalition transforms / count_interactions / safe_isinstance / check_import_module / shuffle_data / raise_deprecation_warning. - game_theory: TestAggregation (aggregate_base_interaction, aggregate_to_one_dimension), TestCore (egalitarian_least_core), TestGame (Game base-class API: __call__, access_counter, grand/empty coalition values, precompute, save_values/load_values, save/load JSON round-trip). - New slow-gated test_datasets.py: load_california_housing, load_bike_sharing, load_adult_census. Default suite: 232 passed, 12 skipped in ~23s. Full suite (incl. slow): 241 passed, 16 skipped in ~36s. https://claude.ai/code/session_01DHsGf4an1Dnnw4qTnmdB22
…ruth
Turns protocol contract checks into real correctness tests by making
independent ground-truth sources agree on the same game. Five test
classes in the new tests/shapiq/test_cross_checks.py:
1. TestExactVsSOUM — ExactComputer(SOUM) == SOUM.exact_values for
SV/SII/k-SII/STII/FSII/FBII on n=5; n=7 slow-gated.
2. TestMoebiusConverter — round-trip ExactComputer("Moebius") ->
MoebiusConverter -> target index matches ExactComputer on the target.
3. TestApproximatorAtFullBudget — 11 consistent approximators
(KernelSHAP / KernelSHAPIQ / InconsistentKernelSHAPIQ /
UnbiasedKernelSHAP / RegressionFSII / RegressionFBII / SHAPIQ on
SII,k-SII,STII / SVARMIQ / SVARM) at budget=2**n match SOUM within 1e-6.
4. TestApproximatorConvergence (slow) — sampling-based approximators
(Permutation*, Owen, Stratified) show monotonically decreasing error
with more budget on n=7 SOUM.
5. TestTreeExplainerVsExactComputer — TreeExplainer output matches
ExactComputer run on TreeSHAPIQXAI.value_function for a 5-feature
decision tree (SV and k-SII).
Supporting changes in tests/shapiq/conftest.py:
- SOUM fixtures (soum_5 default, soum_7 slow).
- GROUND_TRUTH_INDICES constant.
- assert_iv_close helper that aligns InteractionValues by
interaction_lookup (skips empty-interaction asymmetry across
pipelines; optional check_baseline flag).
Runtime impact:
Default suite: 232 -> 257 passed (22s -> 25s)
Full suite: 241 -> 277 passed (36s -> 74s)
https://claude.ai/code/session_01DHsGf4an1Dnnw4qTnmdB22
- test_reproducible now aligns InteractionValues by interaction_lookup rather than comparing raw values arrays. SPEX's sparse transform produces the same interaction values on Windows but stores them in a different order in the values array across runs, which broke the old np.allclose(r1.values, r2.values) check. - Remove the stale `from shapiq_games.synthetic import SOUM` import from test_cross_checks.py (ruff auto-removal caused Code Quality to fail). https://claude.ai/code/session_01DHsGf4an1Dnnw4qTnmdB22
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
The original tests/shapiq/data/test_croc.JPEG was removed in the test-suite rewrite (commit 76aa3ce), but four tests in tests/shapiq_games/tests_legacy/test_local_xai.py still depend on it through the image_and_path fixture. CI fails with FileNotFoundError on those four tests. Make image_and_path skip with a clear message when the JPEG isn't on disk, rather than erroring. Restoring the file (or pointing the fixture elsewhere) re-enables the tests automatically. https://claude.ai/code/session_01DHsGf4an1Dnnw4qTnmdB22
…exact list
Review found that InconsistentKernelSHAPIQ was passing TestApproximatorAtFullBudget
only because the SOUM fixture (max_interaction_size=3, max_order=2) happened to
sit in a trivial k-additive regime. On a genuinely non-k-additive game the
estimator's own docstring says it does not recover the true SII — and indeed
it produces ~1.3e-1 errors once the fixture is strengthened.
Changes:
- soum_5 / soum_7: raise n_basis_games (25 / 40) and set max_interaction_size = n
with min_interaction_size = 1, so basis games span all orders from 1..n and
the SOUM is not k-additive for any small k.
- Drop InconsistentKernelSHAPIQ from CONSISTENT_APPROXIMATORS; leave a comment
explaining why it doesn't belong.
- Add "BV" to GROUND_TRUTH_INDICES (supported by both ExactComputer and
MoebiusConverter).
- Add TestMoebiusVsSOUM that compares ExactComputer("Moebius", n) against
soum.moebius_coefficients — two independent ground-truth Möbius transforms.
- Cache ExactComputer per SOUM (module-scoped exact_soum_5 / exact_soum_7)
to avoid redundant 2^n recomputation across parametrised tests.
- Update tolerances with measured noise floors (1e-8 for the LS solves in
TestExactVsSOUM / TestMoebiusConverter; 1e-6 for the approximator test
where Shapley-kernel LS hits ~5e-7 on random non-k-additive games).
…eeExplainer
Previously the tree cross-check relied on TreeSHAPIQXAI from shapiq_games,
which is slated for removal. Replace it with the coalition-valued game that
lives inside shapiq itself — the one actually used under the hood by local
XAI setups — and pair it with the matching InterventionalTreeExplainer.
- shapiq.tree.interventional.InterventionalGame is a Game subclass whose
value_function computes v(S) = E_ref[f(x_S, z_{not S})] over a reference
dataset. Running ExactComputer on it brute-forces the Shapley / Banzhaf /
faithful values from 2^n coalition evaluations.
- shapiq.tree.interventional.InterventionalTreeExplainer computes the same
quantities via a tree-walking TreeSHAP-IQ variant.
The two are semantically matched (both interventional) — verified
empirically: SV, BV, SII, BII, FSII, FBII all agree to ~4e-9. STII is
omitted because the two implementations disagree (~1e-1 error, separate
bug), and k-SII because InterventionalTreeExplainer does not support it.
Note: the default shapiq.TreeExplainer uses path-dependent TreeSHAP-IQ,
which has different semantics than InterventionalGame. The test now pairs
matching pairs — path-dependent vs interventional explanations were
accidentally being compared before only because TreeSHAPIQXAI itself used
path-dependent averaging via node_sample_weight.
… tree efficiency
Second-round review surfaced five major gaps in the cross-check pipeline.
Acting on all of them:
1. assert_iv_close now takes strict=True. When set, both sides must cover
the same non-empty interactions (modulo zero-valued keys, which
MoebiusConverter drops and ExactComputer emits — a pure encoding
difference, not a bug). Adopted in TestExactVsSOUM and
TestMoebiusConverter where both pipelines are analytical and should
agree on support.
2. Added TestKAddSHAPAtFullBudget. kADD-SHAP is user-facing via kADDSHAP
but had no independent ground truth: SOUM.exact_values and
MoebiusConverter don't support it. Cross-check against
ExactComputer("kADD-SHAP") closes the gap (agreement to ~1e-7).
3. Added TestPathDependentTreeEfficiency. The default shapiq.TreeExplainer
(path-dependent TreeSHAP-IQ) was completely unexercised by the
interventional cross-check pair. Since no path-dependent Game wrapper
exists for a full cross-check, we pin the SV efficiency axiom:
sum(SV) == f(x) - E[f]. Cheap, catches most regressions in the
polynomial arithmetic or baseline computation.
4. Strengthened TestApproximatorConvergence. errors[-1] < errors[0] was
tautological — a 16x budget increase for essentially zero error
reduction would pass. Now averages errors over 3 seeds per budget and
requires the mean error to halve with 16x budget. Catches silently
broken sampling estimators.
5. Tightened TestMoebiusVsSOUM tolerance from 1e-10 to 1e-9 — the
alternating-sign sum over 2^n coalitions was liable to flake on
Windows/macOS where FMA ordering can eat a few ULPs.
Minor polish:
- Renamed _small_tree_setup → small_tree_setup for consistency with
other fixtures.
- Narrowed warnings.catch_warnings to category=UserWarning in the
approximator tests — blanket ignore was swallowing deprecation signal.
- Updated module docstring from 5 to 6 ground-truth sources.
TestExactVsSOUM and TestMoebiusConverter previously exercised a single SOUM instance (random_state=42). One game can hide real conditioning edge cases: zero-valued interactions, near-singular LS matrices, degenerate basis overlaps. Add soum_5_seeded / soum_7_seeded fixtures parametrised over a fixed list of seeds (42, 1337, 7, 2024, 31415). Each test now runs once per seed × per index, giving 5x game-instance diversity while staying fully deterministic — tolerances remain tight, CI stays reproducible, bisects still work. Test count goes from ~30 to ~130, still sub-10s total. Left the approximator tests on the single-seed fixtures: multiplying 10 approximators × 5 seeds adds visible cost without proportional coverage gain on algorithms that are already exact at full budget.
Closes the last explainer-vs-brute-force gap in the cross-check pipeline. ProductKernelExplainer computes SV analytically via elementary symmetric polynomials on kernel vectors. ProductKernelGame wraps the same RBF kernel and training data as a coalition-valued game. Running ExactComputer on it must agree with the explainer's closed-form output — and empirically does so to ~1e-16 on a 5-feature SVR. Pattern mirrors TestInterventionalTreeCrossCheck. Only SV with max_order=1 is tested since the explainer hard-rejects anything else. Single wiring detail: ProductKernelGame takes the validated ProductKernelModel, not a raw sklearn estimator. The explainer already does this conversion internally, so reading explainer.converted_model is both the cleanest and the most user-accurate path.
Closes the coverage gap on shapiq.tree.linear.LinearTreeSHAP. The class computes first-order path-dependent Shapley values via a Chebyshev polynomial shortcut; it previously had no numerical regression test. The test pairs it against a small private helper game, _PathDependentTreeGame, that brute-forces the same path-dependent value function over 2^n coalitions (the same logic the now-deprecated TreeSHAPIQXAI used internally, replicated in ~25 lines of test scaffolding to avoid depending on shapiq_games). Agreement is asserted to atol=1e-10; empirically matches to ~1e-16. Semantic match: - LinearTreeSHAP: Chebyshev-basis closed-form on validated TreeModel. - _PathDependentTreeGame: for each absent feature, average both children weighted by node_sample_weight. Fallback to uniform weighting only when both node weights are zero (degenerate pruning). Left XGBoost/LightGBM conversion coverage as a separate follow-up to keep this change focused on LinearTreeSHAP alone.
…ter pin
Closes two real coverage gaps and pins one known unsupported path:
- lgbm_reg (LGBMRegressor) fixture + full protocol (task="regression").
Efficiency check passes to ~2e-9. Previously missing entirely.
- lgbm_booster (native lightgbm.Booster) fixture + full protocol
(task="regression"). Exercises the native-Booster code path in
_lightgbm_model_to_bytes that sklearn-wrapper fixtures never hit.
- TestXGBoostBoosterUnsupported — standalone pin test asserting that
passing a raw xgboost.Booster raises TypeError("not supported").
Reverse-alarms when the conversion is implemented.
Scope intentionally narrow:
- No ExtraTreeRegressor/IsolationForest/ExtraTreesClassifier (different
concern — separate sklearn conversion paths).
- No XGBClassifier/LGBMClassifier efficiency upgrade (task="basic" is
semantically correct — XGB/LGBM classifiers output raw margins, not
probabilities, so efficiency in proba space doesn't hold).
Coverage of src/shapiq/tree/conversion: ~60% -> 82%.
Signed-off-by: Maximilian <maximilian.muschalik@gmail.com>
Signed-off-by: Maximilian <maximilian.muschalik@gmail.com>
The entire tests/shapiq/fixtures/ directory and 13 fixtures in conftest.py had zero consumers after the test suite rework. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Maximilian <maximilian.muschalik@gmail.com>
Signed-off-by: Maximilian <maximilian.muschalik@gmail.com>
Signed-off-by: Maximilian <maximilian.muschalik@gmail.com>
These tests exercise the full seam — sklearn → imputer → approximator → InteractionValues → plots/serialisation — mirroring the canonical flows from README.md and docs/source/introduction/start.rst. They catch cross-module regressions that pass every per-module unit test. Coverage (8 test invocations, <2s total): - test_tabular_explainer_readme_flow (parametrised SV / k-SII / FSII / STII), asserts the efficiency axiom holds end-to-end - test_tree_explainer_efficiency (parametrised SV / k-SII), asserts pointwise efficiency for TreeExplainer - test_agnostic_explainer_on_soum, verifies the Game-based researcher path against ExactComputer ground truth - test_interaction_values_roundtrip_and_plots, covers JSON save/load and all five top-level plot functions https://claude.ai/code/session_01DHsGf4an1Dnnw4qTnmdB22
- loosen TabularExplainer efficiency tolerance 1e-4 -> 1e-2 for CI
robustness across approximators with different budget accounting
(still ~1e7x headroom over observed error, still catches real
efficiency breaks whose magnitude scales with |pred|)
- drop TreeExplainer SV/1 parametrisation; that invariant is already
covered by test_cross_checks.TestPathDependentTreeEfficiency. Keep
only the novel k-SII/2 case as test_tree_explainer_ksii_efficiency
- remove redundant mpl.use("Agg") — conftest.py already sets it
https://claude.ai/code/session_01DHsGf4an1Dnnw4qTnmdB22
Revert the SV/1 removal from test_tree_explainer_efficiency. The overlap with test_cross_checks.TestPathDependentTreeEfficiency is a feature of the integration layer, not a bug: the cross-check exercises the lower-level invariant with min_order=1, whereas the integration test asserts the same property through the canonical public API flow (shapiq.TreeExplainer(...).explain(x)). Distinct entry points into the same invariant catches different regressions. https://claude.ai/code/session_01DHsGf4an1Dnnw4qTnmdB22
Extends tests/shapiq/test_datasets.py beyond return-type smoke checks with: exact-shape guards against the docstring, target-column-leakage check, no-NaN postcondition, numpy/pandas path equivalence, and a binary-label check for adult census. https://claude.ai/code/session_01DHsGf4an1Dnnw4qTnmdB22
Replaces the prior 11-test smoke layer in test_plots.py with a four-layer strategy catching regressions that "did it crash?" tests miss: - TestPlotUtils: proper coverage of the pure helpers (format_value, format_labels, abbreviate_feature_names, get_color) that were nearly untested. - TestPlots / TestPlotsNoAbbreviate / TestPlotsWithWords: each public plot is parametrised over (abbreviate, feature_names) variants so a kwarg regressing in one branch doesn't pass silently. - TestPlotStructure: one rich test per plot inspecting the returned Axes/Figure — tick labels, title/xlabel/ylabel honored, expected artists drawn. - TestPlotEdgeCases: all-zero IV, feature_names=None, long names with abbreviate=True, max_display below n_features. 68 tests, ~3s runtime. No new dependencies. https://claude.ai/code/session_01DHsGf4an1Dnnw4qTnmdB22
Both functions previously accepted malformed input silently: - beeswarm_plot with data.shape[1] != n_players would plot a subset or scramble columns without warning. - sentence_plot with len(words) != n_players would index past the InteractionValues or drop entries silently. Each gets a ValueError guard with a clear message. Re-enables the two dropped edge-case tests in TestPlotEdgeCases. https://claude.ai/code/session_01DHsGf4an1Dnnw4qTnmdB22
Rewrite test_imputers.py as a DRY protocol-driven suite: TestImputerProtocol holds the shared contract (full coalition == model(x), present features use x, missing features don't leak x, fit/refit behaviour, random_state reproducibility); per-imputer classes cover what's unique (Baseline mean-from-background, Marginal joint-vs-per-feature sampling on dependent data, Gaussian closed-form conditional mean, GaussianCopula rank round-trip, Generative cluster-aware neighbourhood sampling, TabPFN remove-and-contextualize); TestCrossImputerAgreement asserts relationships across imputers (baseline == marginal on constant background, gaussian ~ marginal on independent data, copula ~ gaussian on standard normal). https://claude.ai/code/session_01XN6xQdEpvZHekXYnRuJwyT
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Consolidates and supersedes #511. Contains the full scope of the test-suite rework in three layers.
1. Protocol-driven rewrite (from #511's original commits)
Replaces the 75-file suite (~10k LOC, 321 tests) with 8 protocol-driven files (~1.5k LOC). Design docs at
docs/superpowers/plans/2026-04-15-test-rework.mdanddocs/superpowers/specs/2026-04-15-test-rework-design.md.test_approximators.py— registry + parametrizedTestApproximatorProtocolacross 20+ approximator/index configstest_explainers.py—TabularExplainer,AgnosticExplainer,ProductKernelExplainerprotocols + validationtest_tree.py—TreeExplaineracross sklearn/xgboost/lightgbm with manualTreeModeltests + segfault regressionstest_imputers.py— imputer registry with 4 core imputerstest_interaction_values.py— data-structure correctnesstest_game_theory.py—ExactComputer, indices,MoebiusConvertertest_plots.py— plot smoke teststest_public_api.py— every concrete public subclass is exported in__all__conftest.py— shared game/model/data fixtures +skip_if_no_*markerspyproject.toml—slowmarker +addopts = -m 'not slow'tiering2. Close coverage gaps (commit
fd0282a)GenerativeConditionalImputertoIMPUTER_CONFIGS; slow-gatedTestTabPFNImputerTestTabPFNExplainernetwork_plot,stacked_bar_plot,upset_plot,si_graph_plot,sentence_plot,beeswarm_plot,abbreviate_feature_namestest_utils.py— 26 unit tests forshapiq.utils.{sets,modules,datasets,errors}TestAggregation,TestCore,TestGameintest_game_theory.pytest_datasets.pyfor the three built-in dataset loaders3. Cross-check pipeline (commit
67cd77f) — correctness layerTurns protocol contract checks into correctness tests by making independent ground-truth sources agree on the same game. Five test classes in
test_cross_checks.py:TestExactVsSOUMExactComputer(SOUM)↔SOUM.exact_valuesTestMoebiusConverterExactComputer("Moebius")→MoebiusConverter→ target ↔ExactComputer(target)TestApproximatorAtFullBudgetbudget=2**n↔SOUM.exact_valuesTestApproximatorConvergence(slow)TestTreeExplainerVsExactComputerExactComputer(TreeSHAPIQXAI)↔TreeExplainer.explain(x)Tolerance strategy:
atol=1e-10for pairs that should be analytically identical,1e-8for larger Moebius-converted games (n=7),1e-6for LS / Monte Carlo noise. Sampling-based methods verify monotonic error decrease instead of exactness.Supporting changes in
conftest.py:soum_5/soum_7fixtures,GROUND_TRUTH_INDICESconstant, and anassert_iv_closehelper that alignsInteractionValuesbyinteraction_lookup(skipping empty-interaction asymmetry across pipelines).Runtime
-m 'not slow')-m '')Test plan
uv run pytest tests/shapiq -q— default tieruv run pytest tests/shapiq -m '' -q— full tieruv run pre-commit run --all-fileshttps://claude.ai/code/session_01DHsGf4an1Dnnw4qTnmdB22