Summary
The current benchmark suite (benchmarks/) measures throughput and latency — graph ops, retrieval pipeline, agentic memory, serialization. It does not measure the semantic effectiveness of context graphs: whether the context they surface is accurate, temporally valid, causally grounded, and whether it actually improves agent decision quality.
This issue tracks the work to add a dedicated Context Graph Effectiveness benchmark track that covers every major capability of the context, kg, reasoning, provenance, conflicts, and deduplication modules — split by query type, with a clear "good vs not good" definition rooted in decision quality delta.
Why This Matters
Context graphs are not vector retrieval. The difference is structural:
| Dimension |
Vector RAG |
Semantica Context Graph |
| Storage |
Chunk embeddings |
Typed nodes + directed edges |
| Temporal |
None |
valid_from / valid_until + recorded_at / superseded_at (bi-temporal) |
| Causal |
None |
Explicit causal edges traversed by CausalChainAnalyzer |
| Decision memory |
None |
Full Decision lifecycle — DecisionRecorder, DecisionQuery, PolicyEngine |
| Reasoning |
Implicit (LLM) |
Explicit: Rete, SPARQL, Datalog, abductive, deductive, Allen intervals |
| Provenance |
None |
W3C PROV-O end-to-end lineage (doc → chunk → entity → KG → query → response) |
| Conflict handling |
None |
Multi-strategy resolution (voting, credibility-weighted, temporal, confidence) |
| Deduplication |
None |
Union-find + hierarchical clustering with provenance-preserving merges |
Flat average precision/recall collapses all of this into a single number and misses structural failures entirely. A context graph scoring 0.85 average recall may still inject stale facts, ignore causal ancestry, or apply an overridden policy — all invisible to a flat metric.
The right evaluation framework splits by capability dimension, measures decision quality delta as the headline signal, and enforces pass/fail thresholds in CI.
Capabilities to Cover
The following capability dimensions are extracted from the current API surface and must each have at least one benchmark class.
1. Core Graph Retrieval — ContextRetriever / TemporalGraphRetriever
The retriever supports hybrid retrieval (hybrid_alpha from 0 = pure vector to 1 = pure graph), intent-guided score boosting, multi-hop BFS traversal, semantic re-ranking (70% original score + 30% query-content similarity), and multi-source boosting (20% boost when a result appears in both vector and graph sources).
Benchmark dimensions:
- Lookup (direct node by label/ID) — hit rate, latency
- Multi-hop traversal (2–3 hops) — path recall, hop precision
hybrid_alpha sensitivity — does increasing graph weight improve structural queries?
- Multi-source boost verification — results in both vector and graph should rank above single-source
- Semantic re-ranking quality — reranked list vs. raw score list precision@5
Implementation target: benchmarks/context_graph_effectiveness/test_retrieval.py
2. Temporal Validity — BiTemporalFact, TemporalGraphRetriever, TemporalQueryRewriter
Semantica supports full bi-temporal facts: valid_from / valid_until (domain time) and recorded_at / superseded_at (transaction time). TemporalQueryRewriter extracts temporal references from natural-language queries using regex (no LLM required) or an optional LLM for free-form phrasing. TemporalGraphRetriever uses the extracted parameters to call reconstruct_at_time() on the retrieved subgraph.
Benchmark dimensions:
- Stale-context injection rate — fraction of retrieved nodes/edges where
valid_until < query_time
- Future-context injection rate — fraction where
valid_from > query_time
- Temporal precision — valid-at-query-time results / all retrieved results
- Temporal recall — valid-at-query-time results retrieved / all valid-at-query-time results in graph
- Query rewriter accuracy — does
TemporalQueryRewriter extract at_time, start_time, end_time, and temporal_intent correctly across query phrasings (before/after/during/as-of/since)?
- Historical query correctness — querying at
T-90d returns the graph state that was valid at T-90d, not the current state
- Competing validity window disambiguation — two nodes with overlapping
valid_from/valid_until windows; only the correct one should be returned
Test cases should cover:
query_time == now
query_time in the past (historical)
query_time between valid_from and valid_until of a competing pair
- Open-ended facts (
valid_until == TemporalBound.OPEN)
- NL temporal phrases: "last week", "before the 2021 merger", "as of Q2 2022", "since the policy change"
Implementation target: benchmarks/context_graph_effectiveness/test_temporal_validity.py
3. Causal Chain Quality — CausalChainAnalyzer
CausalChainAnalyzer traverses explicit causal edges (CAUSES, REQUIRES, INFLUENCES, etc.) to reconstruct decision causality. This is the primary structural differentiator versus vector RAG — a vector retriever cannot follow a causal chain; it can only return chunks that mention causality.
Benchmark dimensions:
- Causal chain recall — fraction of true causal ancestors retrieved for a given effect node
- Causal chain precision — fraction of retrieved nodes that are actual ancestors (no spurious nodes)
- Root cause accuracy — does traversal identify the correct root at depth N?
- Spurious-edge rate — non-causal nodes surfaced in a causal query
- Chain depth accuracy — correct retrieval at depths 1, 2, 3+ hops
Test fixture topologies to cover:
- Linear chain: A → B → C → D
- Branching: A → B, A → C, B → D
- Diamond (convergence): A → B → D, A → C → D
- Cycle detection (should not loop): A → B → C → A
Implementation target: benchmarks/context_graph_effectiveness/test_causal_chains.py
4. Decision Intelligence — DecisionRecorder, DecisionQuery, PolicyEngine, CausalChainAnalyzer
Decision intelligence is the core enterprise capability. The API surface covers:
DecisionRecorder — records decisions with full context, entities, source documents, and cross-system context; assigns decision IDs
DecisionQuery — queries decisions by category, date range, outcome, maker, with multi-hop graph traversal via multi_hop_query()
PolicyEngine — applies policies against decisions, checks compliance, raises exceptions, tracks policy versions via create_policy_with_versioning()
CausalChainAnalyzer — traces which prior decisions influenced the current one; returns full causal chain with weights
ApprovalChain — multi-level approval chain data model
PolicyException — exception/waiver tracking against policies
- Convenience functions:
find_exception_precedents(), analyze_decision_impact(), check_decision_compliance(), get_decision_statistics(), capture_decision_trace()
Benchmark dimensions:
- Precedent retrieval accuracy — does
find_precedents() return the most relevant historical decisions for a given scenario? (hybrid search: semantic + structural + vector)
- Advanced precedent search — does
find_precedents_advanced() outperform basic precedent search when use_kg_features=True?
- Policy compliance hit rate — fraction of compliant decisions correctly identified as compliant; fraction of violations correctly flagged
- Exception precedent retrieval —
find_exception_precedents() should surface decisions where a policy exception was granted under similar circumstances
- Causal influence score accuracy — does
analyze_decision_influence() assign higher influence scores to decisions with more downstream effects?
- Decision impact analysis —
analyze_decision_impact() should quantify propagation of a decision through the causal graph
- Decision statistics correctness —
get_decision_statistics() should return accurate aggregate counts, approval rates, and category distributions
- Cross-system context capture — decisions with
cross_system_context should be retrievable by external system identifiers
Implementation target: benchmarks/context_graph_effectiveness/test_decision_intelligence.py
5. Decision Quality Delta — Primary Headline Metric
The real-world signal: does context graph injection improve agent decision accuracy compared to no context?
decision_accuracy_delta = accuracy(agent + context_graph) - accuracy(agent_alone)
hallucination_rate_delta = hallucination_rate(agent_alone) - hallucination_rate(agent + context_graph)
Both should be positive for the context graph to be considered beneficial.
Protocol:
- Define a fixed eval dataset of
(scenario, ground_truth_decision) pairs. Start with 100 synthetic scenarios across categories: lending, healthcare, legal, e-commerce, HR.
- Run each scenario twice against a deterministic mock LLM (no API cost, no flakiness):
- Baseline: agent receives only the raw scenario text
- With context: agent receives scenario text + context injected from
AgentContext
- Score structured output against ground truth using exact-match + partial-credit rubric.
- Report:
decision_accuracy_delta — primary metric
hallucination_rate_delta — secondary metric (fewer invented entities/facts)
citation_groundedness — fraction of agent claims traceable to a context node
policy_compliance_rate — fraction of decisions that satisfy applicable policies when context is injected
Hallucination approximation for mock runs:
def hallucination_rate(agent_output: str, graph: ContextGraph) -> float:
entities = lightweight_ner(agent_output)
known = {n["id"] for n in graph.find_nodes()}
return len([e for e in entities if e not in known]) / max(len(entities), 1)
Implementation target: benchmarks/context_graph_effectiveness/test_decision_quality.py
6. KG Algorithm Quality — CentralityCalculator, CommunityDetector, NodeEmbedder, PathFinder, LinkPredictor, SimilarityCalculator
ContextGraph integrates the full KG algorithm suite when instantiated with advanced_analytics=True, kg_algorithms=True. These algorithms power influence analysis, precedent search, and relationship prediction.
Benchmark dimensions:
| Algorithm |
What to measure |
CentralityCalculator (degree, betweenness, closeness, eigenvector) |
Correctness on known graphs (star, chain, clique); convergence iterations for eigenvector |
CommunityDetector (Louvain, Leiden, K-clique) |
Modularity score on synthetic graphs with planted communities; NMI against ground-truth partition |
NodeEmbedder (Node2Vec) |
Embedding similarity between semantically linked nodes vs. unlinked nodes |
PathFinder (BFS shortest path, all-pairs) |
Correctness + latency for graphs of N=100, 1K, 10K nodes |
LinkPredictor (score_link) |
AUC-ROC for predicting held-out edges |
SimilarityCalculator (cosine_similarity) |
Correlation between structural similarity scores and semantic similarity scores |
Decision Intelligence integration check: When analyze_decision_influence() is called, verify that decisions with higher betweenness centrality receive higher influence scores.
Implementation target: benchmarks/context_graph_effectiveness/test_kg_algorithms.py
7. Reasoning Engine Quality — Reasoner, GraphReasoner, TemporalReasoningEngine, ExplanationGenerator
The reasoning module supports:
- Rete engine — forward-chaining rule evaluation on graph facts
- SPARQL reasoner — SPARQL query evaluation
- Datalog reasoner — recursive Datalog rule evaluation
- Abductive / deductive reasoning — via
Reasoner.infer_facts(facts, rules)
TemporalReasoningEngine — Allen's 13 interval relations (IntervalRelation) for temporal fact entailment
ExplanationGenerator — generates ReasoningPath and Justification for inferred conclusions
Benchmark dimensions:
- Rule inference accuracy — given a known set of facts and rules, does the Rete engine derive all expected conclusions and no spurious ones?
- Datalog recursive accuracy — transitive closure of a relation (e.g., ancestor/2) computed correctly
- Allen interval relation coverage — all 13 relations (
before, meets, overlaps, starts, during, finishes, and their inverses + equals) correctly classified
- Explanation completeness — does
ExplanationGenerator produce a ReasoningPath that covers every inference step from premise to conclusion?
- SPARQL result correctness — SPARQL queries against synthetic RDF-shaped facts return expected result sets
- Reasoning latency — for N=1K facts and 20 rules, Rete evaluation should complete under threshold
Implementation target: benchmarks/context_graph_effectiveness/test_reasoning_quality.py
8. Provenance & Lineage Integrity — ProvenanceManager, ProvenanceTracker
Provenance is the audit trail that makes context graphs trustworthy in high-stakes domains (finance, healthcare, legal). The module is W3C PROV-O compliant and tracks end-to-end lineage: document → chunk → entity → KG node → query → response.
Benchmark dimensions:
- Lineage completeness — given a response entity,
get_lineage() should trace back to the source document without gaps
- Source citation accuracy —
SourceReference (DOI + page + quote) correctly round-trips through storage and retrieval
- Checksum integrity —
compute_checksum() / verify_checksum() detect single-byte mutations
- SQLite persistence round-trip — provenance written to
SQLiteStorage survives process restart and is read back identically
- Provenance overhead —
GraphBuilderWithProvenance and AlgorithmTrackerWithProvenance should add less than 15% overhead vs. non-provenance equivalents
Implementation target: benchmarks/context_graph_effectiveness/test_provenance_integrity.py
9. Conflict Detection & Resolution Quality — ConflictDetector, ConflictResolver
The conflicts module detects value, type, relationship, temporal, and logical inconsistencies across sources and resolves them using voting, credibility-weighted, recency, or confidence-based strategies.
Benchmark dimensions:
- Detection recall by conflict type — for each of value / type / temporal / logical conflicts, fraction of injected conflicts detected
- Detection precision — fraction of flagged conflicts that are true conflicts (no false positives)
- Resolution strategy correctness:
VOTING — selects the majority value when N sources disagree
CREDIBILITY_WEIGHTED — selects the value from the highest-credibility source
MOST_RECENT — selects the value with the latest timestamp
HIGHEST_CONFIDENCE — selects the value with the highest confidence score
- Severity scoring calibration — high-severity conflicts (affecting many sources, critical properties) should score higher than low-severity ones
- Investigation guide completeness —
InvestigationGuideGenerator should produce a guide with at least one step per conflict type
Implementation target: benchmarks/context_graph_effectiveness/test_conflict_resolution.py
10. Deduplication Quality — DuplicateDetector, EntityMerger, ClusterBuilder
Deduplication keeps the context graph clean. The module supports Levenshtein, Jaro-Winkler, cosine, Jaccard, and multi-factor similarity; union-find and hierarchical clustering; and provenance-preserving merges.
Benchmark dimensions:
- Duplicate detection recall — fraction of injected duplicate pairs detected at
threshold=0.8
- Duplicate detection precision — fraction of flagged pairs that are true duplicates
- F1 by similarity method — compare Levenshtein vs. Jaro-Winkler vs. cosine vs. multi-factor; multi-factor should dominate
- Cluster quality — NMI of union-find clusters vs. ground-truth entity groups
- Merge strategy correctness:
keep_most_complete — merged entity should have the union of all non-null properties
- Provenance preservation — merged entity's metadata should reference all source entities
- Incremental detection efficiency —
O(n×m) new-vs-existing comparison should be faster than O(n²) all-pairs for large N
Implementation target: benchmarks/context_graph_effectiveness/test_deduplication_quality.py
11. Embedding Quality — EmbeddingGenerator, GraphEmbeddingManager, NodeEmbedder
Embeddings underpin semantic search, precedent retrieval, and node similarity. The module supports OpenAI, BGE, FastEmbed, and sentence-transformers providers, with five pooling strategies (mean, max, CLS, attention, hierarchical).
Benchmark dimensions:
- Semantic coherence — cosine similarity between embeddings of semantically related entities should be higher than between unrelated entities
- Provider consistency — embeddings from different providers for the same text should produce consistent similarity rankings (Spearman rank correlation > 0.7)
- Pooling strategy impact — for long-form text, hierarchical pooling should outperform mean pooling on retrieval accuracy
- Hash-fallback stability — SHA-256 hash-based fallback embeddings must be deterministic (same input → same vector) and stable across runs
- GraphEmbeddingManager correctness — node embeddings computed by
GraphEmbeddingManager should place structurally similar nodes (same community, same centrality range) closer in embedding space
Implementation target: benchmarks/context_graph_effectiveness/test_embedding_quality.py
12. Change Management & Versioning — TemporalVersionManager, OntologyVersionManager
The change management module provides versioned snapshots of the KG and ontology with SQLite persistence, SHA-256 checksums, and enterprise compliance support (HIPAA, SOX, FDA).
Benchmark dimensions:
- Snapshot fidelity — a snapshot taken at time T, when restored, should be graph-isomorphic to the original
- Version diff correctness — diff between V1 and V2 should contain exactly the nodes/edges added, removed, or modified
- Checksum change detection — any mutation to a versioned snapshot should change its checksum
- SQLite persistence — versions written to
SQLiteVersionStorage are read back identically after process restart
- Version manager overhead —
TemporalVersionManager adds less than 10% overhead to graph build time
Implementation target: benchmarks/context_graph_effectiveness/test_change_management.py
13. Skill Injection Evaluation
Context graphs can encode behavioral scaffolding — structured nodes that, when serialized into an agent prompt, reliably elicit a specific reasoning pattern. This is distinct from factual retrieval: the node's structure (type, properties, relationships) matters as much as its content.
Skill types to benchmark:
| Skill type |
Encoding |
Assertion |
| Temporal awareness |
Node with valid_from/valid_until + edge to decision |
Agent qualifies claims with time bounds |
| Causal reasoning |
Causal chain with 3+ hops |
Agent explains cause before effect; cites chain |
| Policy compliance |
Policy node with rules + PolicyException node |
Agent respects constraints; flags exceptions |
| Precedent citation |
Precedent node linked to decision |
Agent references prior similar decision |
| Uncertainty flagging |
Query with no matching context node |
Agent expresses uncertainty rather than hallucinating |
| Approval escalation |
ApprovalChain node with multi-level requirements |
Agent escalates rather than deciding unilaterally |
Implementation target: benchmarks/context_graph_effectiveness/test_skill_injection.py
Pass/Fail Thresholds
All thresholds should live in benchmarks/context_graph_effectiveness/thresholds.py and be enforced by benchmarks/benchmark_runner.py --strict.
| Metric |
Threshold |
Rationale |
decision_accuracy_delta |
> 0 |
Context must improve, not degrade |
hallucination_rate_delta |
> 0 |
Context must reduce invented facts |
stale_context_injection_rate |
< 0.05 |
< 5% stale facts in retrieved context |
causal_chain_recall |
> 0.80 |
80% of true causal ancestors surfaced |
causal_chain_precision |
> 0.85 |
< 15% spurious nodes in causal results |
policy_compliance_hit_rate |
> 0.90 |
Violations detected with > 90% recall |
temporal_precision |
> 0.90 |
< 10% temporally invalid results |
provenance_lineage_completeness |
== 1.0 |
No gaps in lineage chain |
duplicate_detection_f1 |
> 0.85 |
Clean graph guarantee |
skill_activation_rate |
> 0.70 |
Injected skills reliably elicit behavior |
explanation_completeness |
> 0.90 |
Reasoning paths cover all inference steps |
Good vs Not Good — Definition
A context graph configuration is good when all threshold conditions above are met simultaneously. In practice, this means:
- Run the agent with and without context on the eval dataset.
- If
decision_accuracy_delta > 0 and hallucination_rate_delta > 0 — the context is helping.
- If
stale_context_injection_rate >= 0.05 — temporal filtering is broken; fix TemporalGraphRetriever.
- If
causal_chain_recall < 0.80 — causal traversal is incomplete; check edge types in CausalChainAnalyzer.
- If
policy_compliance_hit_rate < 0.90 — PolicyEngine is missing violations; review rule matching logic.
- If
skill_activation_rate < 0.70 — injected skill nodes are not reaching the prompt; check serialization path.
Implementation Plan
Phase 1 — Infrastructure
Phase 2 — Core Retrieval + Temporal
Phase 3 — Causal + Decision Intelligence
Phase 4 — Decision Quality Delta
Phase 5 — KG Algorithms + Reasoning
Phase 6 — Data Quality (Provenance, Conflicts, Dedup, Embeddings, Change Management)
Phase 7 — Skill Injection + CI Integration
Related
Notes
- All effectiveness benchmarks use deterministic mock LLMs — no real API calls in CI.
- Synthetic graph fixtures are seeded and committed as JSON in
benchmarks/context_graph_effectiveness/fixtures/ for reproducibility across machines.
- The
decision_accuracy_delta metric is the headline number for community communication: "run the agent with and without context — if accuracy goes up and hallucinations drop, it's working."
- Query-type split is non-negotiable: a single aggregate score hides structural failures. A retriever can score 0.9 on lookup while completely failing causal traversal.
- Bi-temporal benchmarks (valid_from/valid_until + recorded_at/superseded_at) must test both temporal dimensions independently — domain time failures and transaction time failures require different fixes.
Summary
The current benchmark suite (
benchmarks/) measures throughput and latency — graph ops, retrieval pipeline, agentic memory, serialization. It does not measure the semantic effectiveness of context graphs: whether the context they surface is accurate, temporally valid, causally grounded, and whether it actually improves agent decision quality.This issue tracks the work to add a dedicated Context Graph Effectiveness benchmark track that covers every major capability of the
context,kg,reasoning,provenance,conflicts, anddeduplicationmodules — split by query type, with a clear "good vs not good" definition rooted in decision quality delta.Why This Matters
Context graphs are not vector retrieval. The difference is structural:
valid_from/valid_until+recorded_at/superseded_at(bi-temporal)CausalChainAnalyzerDecisionlifecycle —DecisionRecorder,DecisionQuery,PolicyEngineFlat average precision/recall collapses all of this into a single number and misses structural failures entirely. A context graph scoring 0.85 average recall may still inject stale facts, ignore causal ancestry, or apply an overridden policy — all invisible to a flat metric.
The right evaluation framework splits by capability dimension, measures decision quality delta as the headline signal, and enforces pass/fail thresholds in CI.
Capabilities to Cover
The following capability dimensions are extracted from the current API surface and must each have at least one benchmark class.
1. Core Graph Retrieval —
ContextRetriever/TemporalGraphRetrieverThe retriever supports hybrid retrieval (
hybrid_alphafrom 0 = pure vector to 1 = pure graph), intent-guided score boosting, multi-hop BFS traversal, semantic re-ranking (70% original score + 30% query-content similarity), and multi-source boosting (20% boost when a result appears in both vector and graph sources).Benchmark dimensions:
hybrid_alphasensitivity — does increasing graph weight improve structural queries?Implementation target:
benchmarks/context_graph_effectiveness/test_retrieval.py2. Temporal Validity —
BiTemporalFact,TemporalGraphRetriever,TemporalQueryRewriterSemantica supports full bi-temporal facts:
valid_from/valid_until(domain time) andrecorded_at/superseded_at(transaction time).TemporalQueryRewriterextracts temporal references from natural-language queries using regex (no LLM required) or an optional LLM for free-form phrasing.TemporalGraphRetrieveruses the extracted parameters to callreconstruct_at_time()on the retrieved subgraph.Benchmark dimensions:
valid_until < query_timevalid_from > query_timeTemporalQueryRewriterextractat_time,start_time,end_time, andtemporal_intentcorrectly across query phrasings (before/after/during/as-of/since)?T-90dreturns the graph state that was valid atT-90d, not the current statevalid_from/valid_untilwindows; only the correct one should be returnedTest cases should cover:
query_time == nowquery_timein the past (historical)query_timebetweenvalid_fromandvalid_untilof a competing pairvalid_until == TemporalBound.OPEN)Implementation target:
benchmarks/context_graph_effectiveness/test_temporal_validity.py3. Causal Chain Quality —
CausalChainAnalyzerCausalChainAnalyzertraverses explicit causal edges (CAUSES,REQUIRES,INFLUENCES, etc.) to reconstruct decision causality. This is the primary structural differentiator versus vector RAG — a vector retriever cannot follow a causal chain; it can only return chunks that mention causality.Benchmark dimensions:
Test fixture topologies to cover:
Implementation target:
benchmarks/context_graph_effectiveness/test_causal_chains.py4. Decision Intelligence —
DecisionRecorder,DecisionQuery,PolicyEngine,CausalChainAnalyzerDecision intelligence is the core enterprise capability. The API surface covers:
DecisionRecorder— records decisions with full context, entities, source documents, and cross-system context; assigns decision IDsDecisionQuery— queries decisions by category, date range, outcome, maker, with multi-hop graph traversal viamulti_hop_query()PolicyEngine— applies policies against decisions, checks compliance, raises exceptions, tracks policy versions viacreate_policy_with_versioning()CausalChainAnalyzer— traces which prior decisions influenced the current one; returns full causal chain with weightsApprovalChain— multi-level approval chain data modelPolicyException— exception/waiver tracking against policiesfind_exception_precedents(),analyze_decision_impact(),check_decision_compliance(),get_decision_statistics(),capture_decision_trace()Benchmark dimensions:
find_precedents()return the most relevant historical decisions for a given scenario? (hybrid search: semantic + structural + vector)find_precedents_advanced()outperform basic precedent search whenuse_kg_features=True?find_exception_precedents()should surface decisions where a policy exception was granted under similar circumstancesanalyze_decision_influence()assign higher influence scores to decisions with more downstream effects?analyze_decision_impact()should quantify propagation of a decision through the causal graphget_decision_statistics()should return accurate aggregate counts, approval rates, and category distributionscross_system_contextshould be retrievable by external system identifiersImplementation target:
benchmarks/context_graph_effectiveness/test_decision_intelligence.py5. Decision Quality Delta — Primary Headline Metric
The real-world signal: does context graph injection improve agent decision accuracy compared to no context?
Both should be positive for the context graph to be considered beneficial.
Protocol:
(scenario, ground_truth_decision)pairs. Start with 100 synthetic scenarios across categories: lending, healthcare, legal, e-commerce, HR.AgentContextdecision_accuracy_delta— primary metrichallucination_rate_delta— secondary metric (fewer invented entities/facts)citation_groundedness— fraction of agent claims traceable to a context nodepolicy_compliance_rate— fraction of decisions that satisfy applicable policies when context is injectedHallucination approximation for mock runs:
Implementation target:
benchmarks/context_graph_effectiveness/test_decision_quality.py6. KG Algorithm Quality —
CentralityCalculator,CommunityDetector,NodeEmbedder,PathFinder,LinkPredictor,SimilarityCalculatorContextGraphintegrates the full KG algorithm suite when instantiated withadvanced_analytics=True,kg_algorithms=True. These algorithms power influence analysis, precedent search, and relationship prediction.Benchmark dimensions:
CentralityCalculator(degree, betweenness, closeness, eigenvector)CommunityDetector(Louvain, Leiden, K-clique)NodeEmbedder(Node2Vec)PathFinder(BFS shortest path, all-pairs)LinkPredictor(score_link)SimilarityCalculator(cosine_similarity)Decision Intelligence integration check: When
analyze_decision_influence()is called, verify that decisions with higher betweenness centrality receive higher influence scores.Implementation target:
benchmarks/context_graph_effectiveness/test_kg_algorithms.py7. Reasoning Engine Quality —
Reasoner,GraphReasoner,TemporalReasoningEngine,ExplanationGeneratorThe reasoning module supports:
Reasoner.infer_facts(facts, rules)TemporalReasoningEngine— Allen's 13 interval relations (IntervalRelation) for temporal fact entailmentExplanationGenerator— generatesReasoningPathandJustificationfor inferred conclusionsBenchmark dimensions:
before,meets,overlaps,starts,during,finishes, and their inverses +equals) correctly classifiedExplanationGeneratorproduce aReasoningPaththat covers every inference step from premise to conclusion?Implementation target:
benchmarks/context_graph_effectiveness/test_reasoning_quality.py8. Provenance & Lineage Integrity —
ProvenanceManager,ProvenanceTrackerProvenance is the audit trail that makes context graphs trustworthy in high-stakes domains (finance, healthcare, legal). The module is W3C PROV-O compliant and tracks end-to-end lineage: document → chunk → entity → KG node → query → response.
Benchmark dimensions:
get_lineage()should trace back to the source document without gapsSourceReference(DOI + page + quote) correctly round-trips through storage and retrievalcompute_checksum()/verify_checksum()detect single-byte mutationsSQLiteStoragesurvives process restart and is read back identicallyGraphBuilderWithProvenanceandAlgorithmTrackerWithProvenanceshould add less than 15% overhead vs. non-provenance equivalentsImplementation target:
benchmarks/context_graph_effectiveness/test_provenance_integrity.py9. Conflict Detection & Resolution Quality —
ConflictDetector,ConflictResolverThe conflicts module detects value, type, relationship, temporal, and logical inconsistencies across sources and resolves them using voting, credibility-weighted, recency, or confidence-based strategies.
Benchmark dimensions:
VOTING— selects the majority value when N sources disagreeCREDIBILITY_WEIGHTED— selects the value from the highest-credibility sourceMOST_RECENT— selects the value with the latest timestampHIGHEST_CONFIDENCE— selects the value with the highest confidence scoreInvestigationGuideGeneratorshould produce a guide with at least one step per conflict typeImplementation target:
benchmarks/context_graph_effectiveness/test_conflict_resolution.py10. Deduplication Quality —
DuplicateDetector,EntityMerger,ClusterBuilderDeduplication keeps the context graph clean. The module supports Levenshtein, Jaro-Winkler, cosine, Jaccard, and multi-factor similarity; union-find and hierarchical clustering; and provenance-preserving merges.
Benchmark dimensions:
threshold=0.8keep_most_complete— merged entity should have the union of all non-null propertiesO(n×m)new-vs-existing comparison should be faster thanO(n²)all-pairs for large NImplementation target:
benchmarks/context_graph_effectiveness/test_deduplication_quality.py11. Embedding Quality —
EmbeddingGenerator,GraphEmbeddingManager,NodeEmbedderEmbeddings underpin semantic search, precedent retrieval, and node similarity. The module supports OpenAI, BGE, FastEmbed, and sentence-transformers providers, with five pooling strategies (mean, max, CLS, attention, hierarchical).
Benchmark dimensions:
GraphEmbeddingManagershould place structurally similar nodes (same community, same centrality range) closer in embedding spaceImplementation target:
benchmarks/context_graph_effectiveness/test_embedding_quality.py12. Change Management & Versioning —
TemporalVersionManager,OntologyVersionManagerThe change management module provides versioned snapshots of the KG and ontology with SQLite persistence, SHA-256 checksums, and enterprise compliance support (HIPAA, SOX, FDA).
Benchmark dimensions:
SQLiteVersionStorageare read back identically after process restartTemporalVersionManageradds less than 10% overhead to graph build timeImplementation target:
benchmarks/context_graph_effectiveness/test_change_management.py13. Skill Injection Evaluation
Context graphs can encode behavioral scaffolding — structured nodes that, when serialized into an agent prompt, reliably elicit a specific reasoning pattern. This is distinct from factual retrieval: the node's structure (type, properties, relationships) matters as much as its content.
Skill types to benchmark:
valid_from/valid_until+ edge to decisionPolicynode with rules +PolicyExceptionnodePrecedentnode linked to decisionApprovalChainnode with multi-level requirementsImplementation target:
benchmarks/context_graph_effectiveness/test_skill_injection.pyPass/Fail Thresholds
All thresholds should live in
benchmarks/context_graph_effectiveness/thresholds.pyand be enforced bybenchmarks/benchmark_runner.py --strict.decision_accuracy_deltahallucination_rate_deltastale_context_injection_ratecausal_chain_recallcausal_chain_precisionpolicy_compliance_hit_ratetemporal_precisionprovenance_lineage_completenessduplicate_detection_f1skill_activation_rateexplanation_completenessGood vs Not Good — Definition
A context graph configuration is good when all threshold conditions above are met simultaneously. In practice, this means:
decision_accuracy_delta > 0andhallucination_rate_delta > 0— the context is helping.stale_context_injection_rate >= 0.05— temporal filtering is broken; fixTemporalGraphRetriever.causal_chain_recall < 0.80— causal traversal is incomplete; check edge types inCausalChainAnalyzer.policy_compliance_hit_rate < 0.90—PolicyEngineis missing violations; review rule matching logic.skill_activation_rate < 0.70— injected skill nodes are not reaching the prompt; check serialization path.Implementation Plan
Phase 1 — Infrastructure
benchmarks/context_graph_effectiveness/withconftest.pyfixtures/qa_pairs.json)thresholds.pywith all pass/fail valuesbenchmarks/benchmark_runner.pyto include the new track and report effectiveness metrics alongside throughput metricsPhase 2 — Core Retrieval + Temporal
test_retrieval.py— lookup, multi-hop,hybrid_alphasweep, re-ranking qualitytest_temporal_validity.py— stale/future rates, NL rewriter accuracy, historical queriesPhase 3 — Causal + Decision Intelligence
test_causal_chains.py— linear, branching, diamond, cycle topologiestest_decision_intelligence.py— precedent retrieval, policy compliance, influence scoringPhase 4 — Decision Quality Delta
test_decision_quality.py— accuracy delta + hallucination delta with mock LLM, 100-scenario eval setfixtures/scenarios/— committed JSON eval dataset (lending, healthcare, legal, e-commerce, HR)Phase 5 — KG Algorithms + Reasoning
test_kg_algorithms.py— centrality, community detection, link prediction, path findingtest_reasoning_quality.py— Rete, Datalog, Allen intervals, explanation completenessPhase 6 — Data Quality (Provenance, Conflicts, Dedup, Embeddings, Change Management)
test_provenance_integrity.pytest_conflict_resolution.pytest_deduplication_quality.pytest_embedding_quality.pytest_change_management.pyPhase 7 — Skill Injection + CI Integration
test_skill_injection.py— all 6 skill types--strictbenchmarks/benchmark_results.mddocs/benchmarks/skill_injection.mdRelated
ContextRetriever+TemporalGraphRetrieverCausalChainAnalyzerPolicyEngineDecision,Policy,PolicyException,ApprovalChainTemporalQueryRewriterBiTemporalFact,TemporalBoundTemporalGraphRetriever,TemporalQueryRewriter)Notes
benchmarks/context_graph_effectiveness/fixtures/for reproducibility across machines.decision_accuracy_deltametric is the headline number for community communication: "run the agent with and without context — if accuracy goes up and hallucinations drop, it's working."