Skip to content

[FEATURE] Expand Context Graph & Decision Intelligence Benchmarks #414

@KaifAhmad1

Description

@KaifAhmad1

Summary

The current benchmark suite (benchmarks/) measures throughput and latency — graph ops, retrieval pipeline, agentic memory, serialization. It does not measure the semantic effectiveness of context graphs: whether the context they surface is accurate, temporally valid, causally grounded, and whether it actually improves agent decision quality.

This issue tracks the work to add a dedicated Context Graph Effectiveness benchmark track that covers every major capability of the context, kg, reasoning, provenance, conflicts, and deduplication modules — split by query type, with a clear "good vs not good" definition rooted in decision quality delta.


Why This Matters

Context graphs are not vector retrieval. The difference is structural:

Dimension Vector RAG Semantica Context Graph
Storage Chunk embeddings Typed nodes + directed edges
Temporal None valid_from / valid_until + recorded_at / superseded_at (bi-temporal)
Causal None Explicit causal edges traversed by CausalChainAnalyzer
Decision memory None Full Decision lifecycle — DecisionRecorder, DecisionQuery, PolicyEngine
Reasoning Implicit (LLM) Explicit: Rete, SPARQL, Datalog, abductive, deductive, Allen intervals
Provenance None W3C PROV-O end-to-end lineage (doc → chunk → entity → KG → query → response)
Conflict handling None Multi-strategy resolution (voting, credibility-weighted, temporal, confidence)
Deduplication None Union-find + hierarchical clustering with provenance-preserving merges

Flat average precision/recall collapses all of this into a single number and misses structural failures entirely. A context graph scoring 0.85 average recall may still inject stale facts, ignore causal ancestry, or apply an overridden policy — all invisible to a flat metric.

The right evaluation framework splits by capability dimension, measures decision quality delta as the headline signal, and enforces pass/fail thresholds in CI.


Capabilities to Cover

The following capability dimensions are extracted from the current API surface and must each have at least one benchmark class.

1. Core Graph Retrieval — ContextRetriever / TemporalGraphRetriever

The retriever supports hybrid retrieval (hybrid_alpha from 0 = pure vector to 1 = pure graph), intent-guided score boosting, multi-hop BFS traversal, semantic re-ranking (70% original score + 30% query-content similarity), and multi-source boosting (20% boost when a result appears in both vector and graph sources).

Benchmark dimensions:

  • Lookup (direct node by label/ID) — hit rate, latency
  • Multi-hop traversal (2–3 hops) — path recall, hop precision
  • hybrid_alpha sensitivity — does increasing graph weight improve structural queries?
  • Multi-source boost verification — results in both vector and graph should rank above single-source
  • Semantic re-ranking quality — reranked list vs. raw score list precision@5

Implementation target: benchmarks/context_graph_effectiveness/test_retrieval.py


2. Temporal Validity — BiTemporalFact, TemporalGraphRetriever, TemporalQueryRewriter

Semantica supports full bi-temporal facts: valid_from / valid_until (domain time) and recorded_at / superseded_at (transaction time). TemporalQueryRewriter extracts temporal references from natural-language queries using regex (no LLM required) or an optional LLM for free-form phrasing. TemporalGraphRetriever uses the extracted parameters to call reconstruct_at_time() on the retrieved subgraph.

Benchmark dimensions:

  • Stale-context injection rate — fraction of retrieved nodes/edges where valid_until < query_time
  • Future-context injection rate — fraction where valid_from > query_time
  • Temporal precision — valid-at-query-time results / all retrieved results
  • Temporal recall — valid-at-query-time results retrieved / all valid-at-query-time results in graph
  • Query rewriter accuracy — does TemporalQueryRewriter extract at_time, start_time, end_time, and temporal_intent correctly across query phrasings (before/after/during/as-of/since)?
  • Historical query correctness — querying at T-90d returns the graph state that was valid at T-90d, not the current state
  • Competing validity window disambiguation — two nodes with overlapping valid_from/valid_until windows; only the correct one should be returned

Test cases should cover:

  • query_time == now
  • query_time in the past (historical)
  • query_time between valid_from and valid_until of a competing pair
  • Open-ended facts (valid_until == TemporalBound.OPEN)
  • NL temporal phrases: "last week", "before the 2021 merger", "as of Q2 2022", "since the policy change"

Implementation target: benchmarks/context_graph_effectiveness/test_temporal_validity.py


3. Causal Chain Quality — CausalChainAnalyzer

CausalChainAnalyzer traverses explicit causal edges (CAUSES, REQUIRES, INFLUENCES, etc.) to reconstruct decision causality. This is the primary structural differentiator versus vector RAG — a vector retriever cannot follow a causal chain; it can only return chunks that mention causality.

Benchmark dimensions:

  • Causal chain recall — fraction of true causal ancestors retrieved for a given effect node
  • Causal chain precision — fraction of retrieved nodes that are actual ancestors (no spurious nodes)
  • Root cause accuracy — does traversal identify the correct root at depth N?
  • Spurious-edge rate — non-causal nodes surfaced in a causal query
  • Chain depth accuracy — correct retrieval at depths 1, 2, 3+ hops

Test fixture topologies to cover:

  • Linear chain: A → B → C → D
  • Branching: A → B, A → C, B → D
  • Diamond (convergence): A → B → D, A → C → D
  • Cycle detection (should not loop): A → B → C → A

Implementation target: benchmarks/context_graph_effectiveness/test_causal_chains.py


4. Decision Intelligence — DecisionRecorder, DecisionQuery, PolicyEngine, CausalChainAnalyzer

Decision intelligence is the core enterprise capability. The API surface covers:

  • DecisionRecorder — records decisions with full context, entities, source documents, and cross-system context; assigns decision IDs
  • DecisionQuery — queries decisions by category, date range, outcome, maker, with multi-hop graph traversal via multi_hop_query()
  • PolicyEngine — applies policies against decisions, checks compliance, raises exceptions, tracks policy versions via create_policy_with_versioning()
  • CausalChainAnalyzer — traces which prior decisions influenced the current one; returns full causal chain with weights
  • ApprovalChain — multi-level approval chain data model
  • PolicyException — exception/waiver tracking against policies
  • Convenience functions: find_exception_precedents(), analyze_decision_impact(), check_decision_compliance(), get_decision_statistics(), capture_decision_trace()

Benchmark dimensions:

  • Precedent retrieval accuracy — does find_precedents() return the most relevant historical decisions for a given scenario? (hybrid search: semantic + structural + vector)
  • Advanced precedent search — does find_precedents_advanced() outperform basic precedent search when use_kg_features=True?
  • Policy compliance hit rate — fraction of compliant decisions correctly identified as compliant; fraction of violations correctly flagged
  • Exception precedent retrievalfind_exception_precedents() should surface decisions where a policy exception was granted under similar circumstances
  • Causal influence score accuracy — does analyze_decision_influence() assign higher influence scores to decisions with more downstream effects?
  • Decision impact analysisanalyze_decision_impact() should quantify propagation of a decision through the causal graph
  • Decision statistics correctnessget_decision_statistics() should return accurate aggregate counts, approval rates, and category distributions
  • Cross-system context capture — decisions with cross_system_context should be retrievable by external system identifiers

Implementation target: benchmarks/context_graph_effectiveness/test_decision_intelligence.py


5. Decision Quality Delta — Primary Headline Metric

The real-world signal: does context graph injection improve agent decision accuracy compared to no context?

decision_accuracy_delta = accuracy(agent + context_graph) - accuracy(agent_alone)
hallucination_rate_delta = hallucination_rate(agent_alone) - hallucination_rate(agent + context_graph)

Both should be positive for the context graph to be considered beneficial.

Protocol:

  1. Define a fixed eval dataset of (scenario, ground_truth_decision) pairs. Start with 100 synthetic scenarios across categories: lending, healthcare, legal, e-commerce, HR.
  2. Run each scenario twice against a deterministic mock LLM (no API cost, no flakiness):
    • Baseline: agent receives only the raw scenario text
    • With context: agent receives scenario text + context injected from AgentContext
  3. Score structured output against ground truth using exact-match + partial-credit rubric.
  4. Report:
    • decision_accuracy_delta — primary metric
    • hallucination_rate_delta — secondary metric (fewer invented entities/facts)
    • citation_groundedness — fraction of agent claims traceable to a context node
    • policy_compliance_rate — fraction of decisions that satisfy applicable policies when context is injected

Hallucination approximation for mock runs:

def hallucination_rate(agent_output: str, graph: ContextGraph) -> float:
    entities = lightweight_ner(agent_output)
    known = {n["id"] for n in graph.find_nodes()}
    return len([e for e in entities if e not in known]) / max(len(entities), 1)

Implementation target: benchmarks/context_graph_effectiveness/test_decision_quality.py


6. KG Algorithm Quality — CentralityCalculator, CommunityDetector, NodeEmbedder, PathFinder, LinkPredictor, SimilarityCalculator

ContextGraph integrates the full KG algorithm suite when instantiated with advanced_analytics=True, kg_algorithms=True. These algorithms power influence analysis, precedent search, and relationship prediction.

Benchmark dimensions:

Algorithm What to measure
CentralityCalculator (degree, betweenness, closeness, eigenvector) Correctness on known graphs (star, chain, clique); convergence iterations for eigenvector
CommunityDetector (Louvain, Leiden, K-clique) Modularity score on synthetic graphs with planted communities; NMI against ground-truth partition
NodeEmbedder (Node2Vec) Embedding similarity between semantically linked nodes vs. unlinked nodes
PathFinder (BFS shortest path, all-pairs) Correctness + latency for graphs of N=100, 1K, 10K nodes
LinkPredictor (score_link) AUC-ROC for predicting held-out edges
SimilarityCalculator (cosine_similarity) Correlation between structural similarity scores and semantic similarity scores

Decision Intelligence integration check: When analyze_decision_influence() is called, verify that decisions with higher betweenness centrality receive higher influence scores.

Implementation target: benchmarks/context_graph_effectiveness/test_kg_algorithms.py


7. Reasoning Engine Quality — Reasoner, GraphReasoner, TemporalReasoningEngine, ExplanationGenerator

The reasoning module supports:

  • Rete engine — forward-chaining rule evaluation on graph facts
  • SPARQL reasoner — SPARQL query evaluation
  • Datalog reasoner — recursive Datalog rule evaluation
  • Abductive / deductive reasoning — via Reasoner.infer_facts(facts, rules)
  • TemporalReasoningEngine — Allen's 13 interval relations (IntervalRelation) for temporal fact entailment
  • ExplanationGenerator — generates ReasoningPath and Justification for inferred conclusions

Benchmark dimensions:

  • Rule inference accuracy — given a known set of facts and rules, does the Rete engine derive all expected conclusions and no spurious ones?
  • Datalog recursive accuracy — transitive closure of a relation (e.g., ancestor/2) computed correctly
  • Allen interval relation coverage — all 13 relations (before, meets, overlaps, starts, during, finishes, and their inverses + equals) correctly classified
  • Explanation completeness — does ExplanationGenerator produce a ReasoningPath that covers every inference step from premise to conclusion?
  • SPARQL result correctness — SPARQL queries against synthetic RDF-shaped facts return expected result sets
  • Reasoning latency — for N=1K facts and 20 rules, Rete evaluation should complete under threshold

Implementation target: benchmarks/context_graph_effectiveness/test_reasoning_quality.py


8. Provenance & Lineage Integrity — ProvenanceManager, ProvenanceTracker

Provenance is the audit trail that makes context graphs trustworthy in high-stakes domains (finance, healthcare, legal). The module is W3C PROV-O compliant and tracks end-to-end lineage: document → chunk → entity → KG node → query → response.

Benchmark dimensions:

  • Lineage completeness — given a response entity, get_lineage() should trace back to the source document without gaps
  • Source citation accuracySourceReference (DOI + page + quote) correctly round-trips through storage and retrieval
  • Checksum integritycompute_checksum() / verify_checksum() detect single-byte mutations
  • SQLite persistence round-trip — provenance written to SQLiteStorage survives process restart and is read back identically
  • Provenance overheadGraphBuilderWithProvenance and AlgorithmTrackerWithProvenance should add less than 15% overhead vs. non-provenance equivalents

Implementation target: benchmarks/context_graph_effectiveness/test_provenance_integrity.py


9. Conflict Detection & Resolution Quality — ConflictDetector, ConflictResolver

The conflicts module detects value, type, relationship, temporal, and logical inconsistencies across sources and resolves them using voting, credibility-weighted, recency, or confidence-based strategies.

Benchmark dimensions:

  • Detection recall by conflict type — for each of value / type / temporal / logical conflicts, fraction of injected conflicts detected
  • Detection precision — fraction of flagged conflicts that are true conflicts (no false positives)
  • Resolution strategy correctness:
    • VOTING — selects the majority value when N sources disagree
    • CREDIBILITY_WEIGHTED — selects the value from the highest-credibility source
    • MOST_RECENT — selects the value with the latest timestamp
    • HIGHEST_CONFIDENCE — selects the value with the highest confidence score
  • Severity scoring calibration — high-severity conflicts (affecting many sources, critical properties) should score higher than low-severity ones
  • Investigation guide completenessInvestigationGuideGenerator should produce a guide with at least one step per conflict type

Implementation target: benchmarks/context_graph_effectiveness/test_conflict_resolution.py


10. Deduplication Quality — DuplicateDetector, EntityMerger, ClusterBuilder

Deduplication keeps the context graph clean. The module supports Levenshtein, Jaro-Winkler, cosine, Jaccard, and multi-factor similarity; union-find and hierarchical clustering; and provenance-preserving merges.

Benchmark dimensions:

  • Duplicate detection recall — fraction of injected duplicate pairs detected at threshold=0.8
  • Duplicate detection precision — fraction of flagged pairs that are true duplicates
  • F1 by similarity method — compare Levenshtein vs. Jaro-Winkler vs. cosine vs. multi-factor; multi-factor should dominate
  • Cluster quality — NMI of union-find clusters vs. ground-truth entity groups
  • Merge strategy correctness:
    • keep_most_complete — merged entity should have the union of all non-null properties
    • Provenance preservation — merged entity's metadata should reference all source entities
  • Incremental detection efficiencyO(n×m) new-vs-existing comparison should be faster than O(n²) all-pairs for large N

Implementation target: benchmarks/context_graph_effectiveness/test_deduplication_quality.py


11. Embedding Quality — EmbeddingGenerator, GraphEmbeddingManager, NodeEmbedder

Embeddings underpin semantic search, precedent retrieval, and node similarity. The module supports OpenAI, BGE, FastEmbed, and sentence-transformers providers, with five pooling strategies (mean, max, CLS, attention, hierarchical).

Benchmark dimensions:

  • Semantic coherence — cosine similarity between embeddings of semantically related entities should be higher than between unrelated entities
  • Provider consistency — embeddings from different providers for the same text should produce consistent similarity rankings (Spearman rank correlation > 0.7)
  • Pooling strategy impact — for long-form text, hierarchical pooling should outperform mean pooling on retrieval accuracy
  • Hash-fallback stability — SHA-256 hash-based fallback embeddings must be deterministic (same input → same vector) and stable across runs
  • GraphEmbeddingManager correctness — node embeddings computed by GraphEmbeddingManager should place structurally similar nodes (same community, same centrality range) closer in embedding space

Implementation target: benchmarks/context_graph_effectiveness/test_embedding_quality.py


12. Change Management & Versioning — TemporalVersionManager, OntologyVersionManager

The change management module provides versioned snapshots of the KG and ontology with SQLite persistence, SHA-256 checksums, and enterprise compliance support (HIPAA, SOX, FDA).

Benchmark dimensions:

  • Snapshot fidelity — a snapshot taken at time T, when restored, should be graph-isomorphic to the original
  • Version diff correctness — diff between V1 and V2 should contain exactly the nodes/edges added, removed, or modified
  • Checksum change detection — any mutation to a versioned snapshot should change its checksum
  • SQLite persistence — versions written to SQLiteVersionStorage are read back identically after process restart
  • Version manager overheadTemporalVersionManager adds less than 10% overhead to graph build time

Implementation target: benchmarks/context_graph_effectiveness/test_change_management.py


13. Skill Injection Evaluation

Context graphs can encode behavioral scaffolding — structured nodes that, when serialized into an agent prompt, reliably elicit a specific reasoning pattern. This is distinct from factual retrieval: the node's structure (type, properties, relationships) matters as much as its content.

Skill types to benchmark:

Skill type Encoding Assertion
Temporal awareness Node with valid_from/valid_until + edge to decision Agent qualifies claims with time bounds
Causal reasoning Causal chain with 3+ hops Agent explains cause before effect; cites chain
Policy compliance Policy node with rules + PolicyException node Agent respects constraints; flags exceptions
Precedent citation Precedent node linked to decision Agent references prior similar decision
Uncertainty flagging Query with no matching context node Agent expresses uncertainty rather than hallucinating
Approval escalation ApprovalChain node with multi-level requirements Agent escalates rather than deciding unilaterally

Implementation target: benchmarks/context_graph_effectiveness/test_skill_injection.py


Pass/Fail Thresholds

All thresholds should live in benchmarks/context_graph_effectiveness/thresholds.py and be enforced by benchmarks/benchmark_runner.py --strict.

Metric Threshold Rationale
decision_accuracy_delta > 0 Context must improve, not degrade
hallucination_rate_delta > 0 Context must reduce invented facts
stale_context_injection_rate < 0.05 < 5% stale facts in retrieved context
causal_chain_recall > 0.80 80% of true causal ancestors surfaced
causal_chain_precision > 0.85 < 15% spurious nodes in causal results
policy_compliance_hit_rate > 0.90 Violations detected with > 90% recall
temporal_precision > 0.90 < 10% temporally invalid results
provenance_lineage_completeness == 1.0 No gaps in lineage chain
duplicate_detection_f1 > 0.85 Clean graph guarantee
skill_activation_rate > 0.70 Injected skills reliably elicit behavior
explanation_completeness > 0.90 Reasoning paths cover all inference steps

Good vs Not Good — Definition

A context graph configuration is good when all threshold conditions above are met simultaneously. In practice, this means:

  1. Run the agent with and without context on the eval dataset.
  2. If decision_accuracy_delta > 0 and hallucination_rate_delta > 0 — the context is helping.
  3. If stale_context_injection_rate >= 0.05 — temporal filtering is broken; fix TemporalGraphRetriever.
  4. If causal_chain_recall < 0.80 — causal traversal is incomplete; check edge types in CausalChainAnalyzer.
  5. If policy_compliance_hit_rate < 0.90PolicyEngine is missing violations; review rule matching logic.
  6. If skill_activation_rate < 0.70 — injected skill nodes are not reaching the prompt; check serialization path.

Implementation Plan

Phase 1 — Infrastructure

  • Create benchmarks/context_graph_effectiveness/ with conftest.py
    • Synthetic graph fixture factory (seeded, deterministic, multiple topologies)
    • Deterministic mock LLM stub (no API cost)
    • Ground-truth Q&A dataset loader (fixtures/qa_pairs.json)
    • thresholds.py with all pass/fail values
  • Extend benchmarks/benchmark_runner.py to include the new track and report effectiveness metrics alongside throughput metrics

Phase 2 — Core Retrieval + Temporal

  • test_retrieval.py — lookup, multi-hop, hybrid_alpha sweep, re-ranking quality
  • test_temporal_validity.py — stale/future rates, NL rewriter accuracy, historical queries

Phase 3 — Causal + Decision Intelligence

  • test_causal_chains.py — linear, branching, diamond, cycle topologies
  • test_decision_intelligence.py — precedent retrieval, policy compliance, influence scoring

Phase 4 — Decision Quality Delta

  • test_decision_quality.py — accuracy delta + hallucination delta with mock LLM, 100-scenario eval set
  • fixtures/scenarios/ — committed JSON eval dataset (lending, healthcare, legal, e-commerce, HR)

Phase 5 — KG Algorithms + Reasoning

  • test_kg_algorithms.py — centrality, community detection, link prediction, path finding
  • test_reasoning_quality.py — Rete, Datalog, Allen intervals, explanation completeness

Phase 6 — Data Quality (Provenance, Conflicts, Dedup, Embeddings, Change Management)

  • test_provenance_integrity.py
  • test_conflict_resolution.py
  • test_deduplication_quality.py
  • test_embedding_quality.py
  • test_change_management.py

Phase 7 — Skill Injection + CI Integration

  • test_skill_injection.py — all 6 skill types
  • Add effectiveness track to CI with --strict
  • Add effectiveness section to benchmarks/benchmark_results.md
  • Document skill encoding conventions in docs/benchmarks/skill_injection.md

Related


Notes

  • All effectiveness benchmarks use deterministic mock LLMs — no real API calls in CI.
  • Synthetic graph fixtures are seeded and committed as JSON in benchmarks/context_graph_effectiveness/fixtures/ for reproducibility across machines.
  • The decision_accuracy_delta metric is the headline number for community communication: "run the agent with and without context — if accuracy goes up and hallucinations drop, it's working."
  • Query-type split is non-negotiable: a single aggregate score hides structural failures. A retriever can score 0.9 on lookup while completely failing causal traversal.
  • Bi-temporal benchmarks (valid_from/valid_until + recorded_at/superseded_at) must test both temporal dimensions independently — domain time failures and transaction time failures require different fixes.

Metadata

Metadata

Labels

enhancementNew feature or request

Type

Projects

Status

In progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions