[FEATURE] Expand Context Graph & Decision Intelligence Benchmarks


## Summary

The current benchmark suite (`benchmarks/`) measures **throughput and latency** — graph ops, retrieval pipeline, agentic memory, serialization. It does **not** measure the *semantic effectiveness* of context graphs: whether the context they surface is accurate, temporally valid, causally grounded, and whether it actually improves agent decision quality.

This issue tracks the work to add a dedicated **Context Graph Effectiveness** benchmark track that covers every major capability of the `context`, `kg`, `reasoning`, `provenance`, `conflicts`, and `deduplication` modules — split by query type, with a clear "good vs not good" definition rooted in decision quality delta.

---

## Why This Matters

Context graphs are not vector retrieval. The difference is structural:

| Dimension | Vector RAG | Semantica Context Graph |
|---|---|---|
| Storage | Chunk embeddings | Typed nodes + directed edges |
| Temporal | None | `valid_from` / `valid_until` + `recorded_at` / `superseded_at` (bi-temporal) |
| Causal | None | Explicit causal edges traversed by `CausalChainAnalyzer` |
| Decision memory | None | Full `Decision` lifecycle — `DecisionRecorder`, `DecisionQuery`, `PolicyEngine` |
| Reasoning | Implicit (LLM) | Explicit: Rete, SPARQL, Datalog, abductive, deductive, Allen intervals |
| Provenance | None | W3C PROV-O end-to-end lineage (doc → chunk → entity → KG → query → response) |
| Conflict handling | None | Multi-strategy resolution (voting, credibility-weighted, temporal, confidence) |
| Deduplication | None | Union-find + hierarchical clustering with provenance-preserving merges |

Flat average precision/recall collapses all of this into a single number and misses structural failures entirely. A context graph scoring 0.85 average recall may still inject stale facts, ignore causal ancestry, or apply an overridden policy — all invisible to a flat metric.

The right evaluation framework splits by **capability dimension**, measures **decision quality delta** as the headline signal, and enforces pass/fail thresholds in CI.

---

## Capabilities to Cover

The following capability dimensions are extracted from the current API surface and must each have at least one benchmark class.

### 1. Core Graph Retrieval — `ContextRetriever` / `TemporalGraphRetriever`

The retriever supports hybrid retrieval (`hybrid_alpha` from 0 = pure vector to 1 = pure graph), intent-guided score boosting, multi-hop BFS traversal, semantic re-ranking (70% original score + 30% query-content similarity), and multi-source boosting (20% boost when a result appears in both vector and graph sources).

**Benchmark dimensions:**
- Lookup (direct node by label/ID) — hit rate, latency
- Multi-hop traversal (2–3 hops) — path recall, hop precision
- `hybrid_alpha` sensitivity — does increasing graph weight improve structural queries?
- Multi-source boost verification — results in both vector and graph should rank above single-source
- Semantic re-ranking quality — reranked list vs. raw score list precision@5

**Implementation target:** `benchmarks/context_graph_effectiveness/test_retrieval.py`

---

### 2. Temporal Validity — `BiTemporalFact`, `TemporalGraphRetriever`, `TemporalQueryRewriter`

Semantica supports full bi-temporal facts: `valid_from` / `valid_until` (domain time) and `recorded_at` / `superseded_at` (transaction time). `TemporalQueryRewriter` extracts temporal references from natural-language queries using regex (no LLM required) or an optional LLM for free-form phrasing. `TemporalGraphRetriever` uses the extracted parameters to call `reconstruct_at_time()` on the retrieved subgraph.

**Benchmark dimensions:**
- **Stale-context injection rate** — fraction of retrieved nodes/edges where `valid_until < query_time`
- **Future-context injection rate** — fraction where `valid_from > query_time`
- **Temporal precision** — valid-at-query-time results / all retrieved results
- **Temporal recall** — valid-at-query-time results retrieved / all valid-at-query-time results in graph
- **Query rewriter accuracy** — does `TemporalQueryRewriter` extract `at_time`, `start_time`, `end_time`, and `temporal_intent` correctly across query phrasings (before/after/during/as-of/since)?
- **Historical query correctness** — querying at `T-90d` returns the graph state that was valid at `T-90d`, not the current state
- **Competing validity window disambiguation** — two nodes with overlapping `valid_from`/`valid_until` windows; only the correct one should be returned

Test cases should cover:
- `query_time == now`
- `query_time` in the past (historical)
- `query_time` between `valid_from` and `valid_until` of a competing pair
- Open-ended facts (`valid_until == TemporalBound.OPEN`)
- NL temporal phrases: "last week", "before the 2021 merger", "as of Q2 2022", "since the policy change"

**Implementation target:** `benchmarks/context_graph_effectiveness/test_temporal_validity.py`

---

### 3. Causal Chain Quality — `CausalChainAnalyzer`

`CausalChainAnalyzer` traverses explicit causal edges (`CAUSES`, `REQUIRES`, `INFLUENCES`, etc.) to reconstruct decision causality. This is the primary structural differentiator versus vector RAG — a vector retriever cannot follow a causal chain; it can only return chunks that mention causality.

**Benchmark dimensions:**
- **Causal chain recall** — fraction of true causal ancestors retrieved for a given effect node
- **Causal chain precision** — fraction of retrieved nodes that are actual ancestors (no spurious nodes)
- **Root cause accuracy** — does traversal identify the correct root at depth N?
- **Spurious-edge rate** — non-causal nodes surfaced in a causal query
- **Chain depth accuracy** — correct retrieval at depths 1, 2, 3+ hops

Test fixture topologies to cover:
- Linear chain: A → B → C → D
- Branching: A → B, A → C, B → D
- Diamond (convergence): A → B → D, A → C → D
- Cycle detection (should not loop): A → B → C → A

**Implementation target:** `benchmarks/context_graph_effectiveness/test_causal_chains.py`

---

### 4. Decision Intelligence — `DecisionRecorder`, `DecisionQuery`, `PolicyEngine`, `CausalChainAnalyzer`

Decision intelligence is the core enterprise capability. The API surface covers:

- **`DecisionRecorder`** — records decisions with full context, entities, source documents, and cross-system context; assigns decision IDs
- **`DecisionQuery`** — queries decisions by category, date range, outcome, maker, with multi-hop graph traversal via `multi_hop_query()`
- **`PolicyEngine`** — applies policies against decisions, checks compliance, raises exceptions, tracks policy versions via `create_policy_with_versioning()`
- **`CausalChainAnalyzer`** — traces which prior decisions influenced the current one; returns full causal chain with weights
- **`ApprovalChain`** — multi-level approval chain data model
- **`PolicyException`** — exception/waiver tracking against policies
- **Convenience functions**: `find_exception_precedents()`, `analyze_decision_impact()`, `check_decision_compliance()`, `get_decision_statistics()`, `capture_decision_trace()`

**Benchmark dimensions:**
- **Precedent retrieval accuracy** — does `find_precedents()` return the most relevant historical decisions for a given scenario? (hybrid search: semantic + structural + vector)
- **Advanced precedent search** — does `find_precedents_advanced()` outperform basic precedent search when `use_kg_features=True`?
- **Policy compliance hit rate** — fraction of compliant decisions correctly identified as compliant; fraction of violations correctly flagged
- **Exception precedent retrieval** — `find_exception_precedents()` should surface decisions where a policy exception was granted under similar circumstances
- **Causal influence score accuracy** — does `analyze_decision_influence()` assign higher influence scores to decisions with more downstream effects?
- **Decision impact analysis** — `analyze_decision_impact()` should quantify propagation of a decision through the causal graph
- **Decision statistics correctness** — `get_decision_statistics()` should return accurate aggregate counts, approval rates, and category distributions
- **Cross-system context capture** — decisions with `cross_system_context` should be retrievable by external system identifiers

**Implementation target:** `benchmarks/context_graph_effectiveness/test_decision_intelligence.py`

---

### 5. Decision Quality Delta — Primary Headline Metric

The real-world signal: does context graph injection improve agent decision accuracy compared to no context?

```
decision_accuracy_delta = accuracy(agent + context_graph) - accuracy(agent_alone)
hallucination_rate_delta = hallucination_rate(agent_alone) - hallucination_rate(agent + context_graph)
```

Both should be positive for the context graph to be considered beneficial.

**Protocol:**

1. Define a fixed eval dataset of `(scenario, ground_truth_decision)` pairs. Start with 100 synthetic scenarios across categories: lending, healthcare, legal, e-commerce, HR.
2. Run each scenario twice against a **deterministic mock LLM** (no API cost, no flakiness):
   - **Baseline:** agent receives only the raw scenario text
   - **With context:** agent receives scenario text + context injected from `AgentContext`
3. Score structured output against ground truth using exact-match + partial-credit rubric.
4. Report:
   - `decision_accuracy_delta` — primary metric
   - `hallucination_rate_delta` — secondary metric (fewer invented entities/facts)
   - `citation_groundedness` — fraction of agent claims traceable to a context node
   - `policy_compliance_rate` — fraction of decisions that satisfy applicable policies when context is injected

**Hallucination approximation for mock runs:**

```python
def hallucination_rate(agent_output: str, graph: ContextGraph) -> float:
    entities = lightweight_ner(agent_output)
    known = {n["id"] for n in graph.find_nodes()}
    return len([e for e in entities if e not in known]) / max(len(entities), 1)
```

**Implementation target:** `benchmarks/context_graph_effectiveness/test_decision_quality.py`

---

### 6. KG Algorithm Quality — `CentralityCalculator`, `CommunityDetector`, `NodeEmbedder`, `PathFinder`, `LinkPredictor`, `SimilarityCalculator`

`ContextGraph` integrates the full KG algorithm suite when instantiated with `advanced_analytics=True`, `kg_algorithms=True`. These algorithms power influence analysis, precedent search, and relationship prediction.

**Benchmark dimensions:**

| Algorithm | What to measure |
|---|---|
| `CentralityCalculator` (degree, betweenness, closeness, eigenvector) | Correctness on known graphs (star, chain, clique); convergence iterations for eigenvector |
| `CommunityDetector` (Louvain, Leiden, K-clique) | Modularity score on synthetic graphs with planted communities; NMI against ground-truth partition |
| `NodeEmbedder` (Node2Vec) | Embedding similarity between semantically linked nodes vs. unlinked nodes |
| `PathFinder` (BFS shortest path, all-pairs) | Correctness + latency for graphs of N=100, 1K, 10K nodes |
| `LinkPredictor` (`score_link`) | AUC-ROC for predicting held-out edges |
| `SimilarityCalculator` (`cosine_similarity`) | Correlation between structural similarity scores and semantic similarity scores |

**Decision Intelligence integration check:** When `analyze_decision_influence()` is called, verify that decisions with higher betweenness centrality receive higher influence scores.

**Implementation target:** `benchmarks/context_graph_effectiveness/test_kg_algorithms.py`

---

### 7. Reasoning Engine Quality — `Reasoner`, `GraphReasoner`, `TemporalReasoningEngine`, `ExplanationGenerator`

The reasoning module supports:
- **Rete engine** — forward-chaining rule evaluation on graph facts
- **SPARQL reasoner** — SPARQL query evaluation
- **Datalog reasoner** — recursive Datalog rule evaluation
- **Abductive / deductive reasoning** — via `Reasoner.infer_facts(facts, rules)`
- **`TemporalReasoningEngine`** — Allen's 13 interval relations (`IntervalRelation`) for temporal fact entailment
- **`ExplanationGenerator`** — generates `ReasoningPath` and `Justification` for inferred conclusions

**Benchmark dimensions:**
- **Rule inference accuracy** — given a known set of facts and rules, does the Rete engine derive all expected conclusions and no spurious ones?
- **Datalog recursive accuracy** — transitive closure of a relation (e.g., ancestor/2) computed correctly
- **Allen interval relation coverage** — all 13 relations (`before`, `meets`, `overlaps`, `starts`, `during`, `finishes`, and their inverses + `equals`) correctly classified
- **Explanation completeness** — does `ExplanationGenerator` produce a `ReasoningPath` that covers every inference step from premise to conclusion?
- **SPARQL result correctness** — SPARQL queries against synthetic RDF-shaped facts return expected result sets
- **Reasoning latency** — for N=1K facts and 20 rules, Rete evaluation should complete under threshold

**Implementation target:** `benchmarks/context_graph_effectiveness/test_reasoning_quality.py`

---

### 8. Provenance & Lineage Integrity — `ProvenanceManager`, `ProvenanceTracker`

Provenance is the audit trail that makes context graphs trustworthy in high-stakes domains (finance, healthcare, legal). The module is W3C PROV-O compliant and tracks end-to-end lineage: document → chunk → entity → KG node → query → response.

**Benchmark dimensions:**
- **Lineage completeness** — given a response entity, `get_lineage()` should trace back to the source document without gaps
- **Source citation accuracy** — `SourceReference` (DOI + page + quote) correctly round-trips through storage and retrieval
- **Checksum integrity** — `compute_checksum()` / `verify_checksum()` detect single-byte mutations
- **SQLite persistence round-trip** — provenance written to `SQLiteStorage` survives process restart and is read back identically
- **Provenance overhead** — `GraphBuilderWithProvenance` and `AlgorithmTrackerWithProvenance` should add less than 15% overhead vs. non-provenance equivalents

**Implementation target:** `benchmarks/context_graph_effectiveness/test_provenance_integrity.py`

---

### 9. Conflict Detection & Resolution Quality — `ConflictDetector`, `ConflictResolver`

The conflicts module detects value, type, relationship, temporal, and logical inconsistencies across sources and resolves them using voting, credibility-weighted, recency, or confidence-based strategies.

**Benchmark dimensions:**
- **Detection recall by conflict type** — for each of value / type / temporal / logical conflicts, fraction of injected conflicts detected
- **Detection precision** — fraction of flagged conflicts that are true conflicts (no false positives)
- **Resolution strategy correctness**:
  - `VOTING` — selects the majority value when N sources disagree
  - `CREDIBILITY_WEIGHTED` — selects the value from the highest-credibility source
  - `MOST_RECENT` — selects the value with the latest timestamp
  - `HIGHEST_CONFIDENCE` — selects the value with the highest confidence score
- **Severity scoring calibration** — high-severity conflicts (affecting many sources, critical properties) should score higher than low-severity ones
- **Investigation guide completeness** — `InvestigationGuideGenerator` should produce a guide with at least one step per conflict type

**Implementation target:** `benchmarks/context_graph_effectiveness/test_conflict_resolution.py`

---

### 10. Deduplication Quality — `DuplicateDetector`, `EntityMerger`, `ClusterBuilder`

Deduplication keeps the context graph clean. The module supports Levenshtein, Jaro-Winkler, cosine, Jaccard, and multi-factor similarity; union-find and hierarchical clustering; and provenance-preserving merges.

**Benchmark dimensions:**
- **Duplicate detection recall** — fraction of injected duplicate pairs detected at `threshold=0.8`
- **Duplicate detection precision** — fraction of flagged pairs that are true duplicates
- **F1 by similarity method** — compare Levenshtein vs. Jaro-Winkler vs. cosine vs. multi-factor; multi-factor should dominate
- **Cluster quality** — NMI of union-find clusters vs. ground-truth entity groups
- **Merge strategy correctness**:
  - `keep_most_complete` — merged entity should have the union of all non-null properties
  - Provenance preservation — merged entity's metadata should reference all source entities
- **Incremental detection efficiency** — `O(n×m)` new-vs-existing comparison should be faster than `O(n²)` all-pairs for large N

**Implementation target:** `benchmarks/context_graph_effectiveness/test_deduplication_quality.py`

---

### 11. Embedding Quality — `EmbeddingGenerator`, `GraphEmbeddingManager`, `NodeEmbedder`

Embeddings underpin semantic search, precedent retrieval, and node similarity. The module supports OpenAI, BGE, FastEmbed, and sentence-transformers providers, with five pooling strategies (mean, max, CLS, attention, hierarchical).

**Benchmark dimensions:**
- **Semantic coherence** — cosine similarity between embeddings of semantically related entities should be higher than between unrelated entities
- **Provider consistency** — embeddings from different providers for the same text should produce consistent similarity rankings (Spearman rank correlation > 0.7)
- **Pooling strategy impact** — for long-form text, hierarchical pooling should outperform mean pooling on retrieval accuracy
- **Hash-fallback stability** — SHA-256 hash-based fallback embeddings must be deterministic (same input → same vector) and stable across runs
- **GraphEmbeddingManager correctness** — node embeddings computed by `GraphEmbeddingManager` should place structurally similar nodes (same community, same centrality range) closer in embedding space

**Implementation target:** `benchmarks/context_graph_effectiveness/test_embedding_quality.py`

---

### 12. Change Management & Versioning — `TemporalVersionManager`, `OntologyVersionManager`

The change management module provides versioned snapshots of the KG and ontology with SQLite persistence, SHA-256 checksums, and enterprise compliance support (HIPAA, SOX, FDA).

**Benchmark dimensions:**
- **Snapshot fidelity** — a snapshot taken at time T, when restored, should be graph-isomorphic to the original
- **Version diff correctness** — diff between V1 and V2 should contain exactly the nodes/edges added, removed, or modified
- **Checksum change detection** — any mutation to a versioned snapshot should change its checksum
- **SQLite persistence** — versions written to `SQLiteVersionStorage` are read back identically after process restart
- **Version manager overhead** — `TemporalVersionManager` adds less than 10% overhead to graph build time

**Implementation target:** `benchmarks/context_graph_effectiveness/test_change_management.py`

---

### 13. Skill Injection Evaluation

Context graphs can encode behavioral scaffolding — structured nodes that, when serialized into an agent prompt, reliably elicit a specific reasoning pattern. This is distinct from factual retrieval: the node's *structure* (type, properties, relationships) matters as much as its *content*.

**Skill types to benchmark:**

| Skill type | Encoding | Assertion |
|---|---|---|
| Temporal awareness | Node with `valid_from`/`valid_until` + edge to decision | Agent qualifies claims with time bounds |
| Causal reasoning | Causal chain with 3+ hops | Agent explains cause before effect; cites chain |
| Policy compliance | `Policy` node with rules + `PolicyException` node | Agent respects constraints; flags exceptions |
| Precedent citation | `Precedent` node linked to decision | Agent references prior similar decision |
| Uncertainty flagging | Query with no matching context node | Agent expresses uncertainty rather than hallucinating |
| Approval escalation | `ApprovalChain` node with multi-level requirements | Agent escalates rather than deciding unilaterally |

**Implementation target:** `benchmarks/context_graph_effectiveness/test_skill_injection.py`

---

## Pass/Fail Thresholds

All thresholds should live in `benchmarks/context_graph_effectiveness/thresholds.py` and be enforced by `benchmarks/benchmark_runner.py --strict`.

| Metric | Threshold | Rationale |
|---|---|---|
| `decision_accuracy_delta` | > 0 | Context must improve, not degrade |
| `hallucination_rate_delta` | > 0 | Context must reduce invented facts |
| `stale_context_injection_rate` | < 0.05 | < 5% stale facts in retrieved context |
| `causal_chain_recall` | > 0.80 | 80% of true causal ancestors surfaced |
| `causal_chain_precision` | > 0.85 | < 15% spurious nodes in causal results |
| `policy_compliance_hit_rate` | > 0.90 | Violations detected with > 90% recall |
| `temporal_precision` | > 0.90 | < 10% temporally invalid results |
| `provenance_lineage_completeness` | == 1.0 | No gaps in lineage chain |
| `duplicate_detection_f1` | > 0.85 | Clean graph guarantee |
| `skill_activation_rate` | > 0.70 | Injected skills reliably elicit behavior |
| `explanation_completeness` | > 0.90 | Reasoning paths cover all inference steps |

---

## Good vs Not Good — Definition

A context graph configuration is **good** when all threshold conditions above are met simultaneously. In practice, this means:

1. Run the agent with and without context on the eval dataset.
2. If `decision_accuracy_delta > 0` and `hallucination_rate_delta > 0` — the context is helping.
3. If `stale_context_injection_rate >= 0.05` — temporal filtering is broken; fix `TemporalGraphRetriever`.
4. If `causal_chain_recall < 0.80` — causal traversal is incomplete; check edge types in `CausalChainAnalyzer`.
5. If `policy_compliance_hit_rate < 0.90` — `PolicyEngine` is missing violations; review rule matching logic.
6. If `skill_activation_rate < 0.70` — injected skill nodes are not reaching the prompt; check serialization path.

---

## Implementation Plan

### Phase 1 — Infrastructure
- [ ] Create `benchmarks/context_graph_effectiveness/` with `conftest.py`
  - Synthetic graph fixture factory (seeded, deterministic, multiple topologies)
  - Deterministic mock LLM stub (no API cost)
  - Ground-truth Q&A dataset loader (`fixtures/qa_pairs.json`)
  - `thresholds.py` with all pass/fail values
- [ ] Extend `benchmarks/benchmark_runner.py` to include the new track and report effectiveness metrics alongside throughput metrics

### Phase 2 — Core Retrieval + Temporal
- [ ] `test_retrieval.py` — lookup, multi-hop, `hybrid_alpha` sweep, re-ranking quality
- [ ] `test_temporal_validity.py` — stale/future rates, NL rewriter accuracy, historical queries

### Phase 3 — Causal + Decision Intelligence
- [ ] `test_causal_chains.py` — linear, branching, diamond, cycle topologies
- [ ] `test_decision_intelligence.py` — precedent retrieval, policy compliance, influence scoring

### Phase 4 — Decision Quality Delta
- [ ] `test_decision_quality.py` — accuracy delta + hallucination delta with mock LLM, 100-scenario eval set
- [ ] `fixtures/scenarios/` — committed JSON eval dataset (lending, healthcare, legal, e-commerce, HR)

### Phase 5 — KG Algorithms + Reasoning
- [ ] `test_kg_algorithms.py` — centrality, community detection, link prediction, path finding
- [ ] `test_reasoning_quality.py` — Rete, Datalog, Allen intervals, explanation completeness

### Phase 6 — Data Quality (Provenance, Conflicts, Dedup, Embeddings, Change Management)
- [ ] `test_provenance_integrity.py`
- [ ] `test_conflict_resolution.py`
- [ ] `test_deduplication_quality.py`
- [ ] `test_embedding_quality.py`
- [ ] `test_change_management.py`

### Phase 7 — Skill Injection + CI Integration
- [ ] `test_skill_injection.py` — all 6 skill types
- [ ] Add effectiveness track to CI with `--strict`
- [ ] Add effectiveness section to `benchmarks/benchmark_results.md`
- [ ] Document skill encoding conventions in `docs/benchmarks/skill_injection.md`

---

## Related

- [benchmarks/context/test_retrieval_logic.py](benchmarks/context/test_retrieval_logic.py) — existing retrieval throughput benchmarks
- [benchmarks/context_memory/test_graphrag.py](benchmarks/context_memory/test_graphrag.py) — existing GraphRAG benchmarks
- [benchmarks/context_memory/test_agentic.py](benchmarks/context_memory/test_agentic.py) — existing agentic memory benchmarks
- [semantica/context/context_retriever.py](semantica/context/context_retriever.py) — `ContextRetriever` + `TemporalGraphRetriever`
- [semantica/context/causal_analyzer.py](semantica/context/causal_analyzer.py) — `CausalChainAnalyzer`
- [semantica/context/policy_engine.py](semantica/context/policy_engine.py) — `PolicyEngine`
- [semantica/context/decision_models.py](semantica/context/decision_models.py) — `Decision`, `Policy`, `PolicyException`, `ApprovalChain`
- [semantica/kg/temporal_query_rewriter.py](semantica/kg/temporal_query_rewriter.py) — `TemporalQueryRewriter`
- [semantica/kg/temporal_model.py](semantica/kg/temporal_model.py) — `BiTemporalFact`, `TemporalBound`
- [semantica/reasoning/](semantica/reasoning/) — Rete, SPARQL, Datalog, explanation generation
- [semantica/provenance/](semantica/provenance/) — W3C PROV-O provenance manager
- [semantica/conflicts/](semantica/conflicts/) — conflict detection and resolution
- [semantica/deduplication/](semantica/deduplication/) — entity deduplication and merging
- [semantica/change_management/](semantica/change_management/) — version management
- PR #402 — Temporal GraphRAG Integration (`TemporalGraphRetriever`, `TemporalQueryRewriter`)

---

## Notes

- All effectiveness benchmarks use **deterministic mock LLMs** — no real API calls in CI.
- Synthetic graph fixtures are seeded and committed as JSON in `benchmarks/context_graph_effectiveness/fixtures/` for reproducibility across machines.
- The `decision_accuracy_delta` metric is the headline number for community communication: "run the agent with and without context — if accuracy goes up and hallucinations drop, it's working."
- Query-type split is non-negotiable: a single aggregate score hides structural failures. A retriever can score 0.9 on lookup while completely failing causal traversal.
- Bi-temporal benchmarks (valid_from/valid_until + recorded_at/superseded_at) must test both temporal dimensions independently — domain time failures and transaction time failures require different fixes.


Dimension	Vector RAG	Semantica Context Graph
Storage	Chunk embeddings	Typed nodes + directed edges
Temporal	None	`valid_from` / `valid_until` + `recorded_at` / `superseded_at` (bi-temporal)
Causal	None	Explicit causal edges traversed by `CausalChainAnalyzer`
Decision memory	None	Full `Decision` lifecycle — `DecisionRecorder`, `DecisionQuery`, `PolicyEngine`
Reasoning	Implicit (LLM)	Explicit: Rete, SPARQL, Datalog, abductive, deductive, Allen intervals
Provenance	None	W3C PROV-O end-to-end lineage (doc → chunk → entity → KG → query → response)
Conflict handling	None	Multi-strategy resolution (voting, credibility-weighted, temporal, confidence)
Deduplication	None	Union-find + hierarchical clustering with provenance-preserving merges

Algorithm	What to measure
`CentralityCalculator` (degree, betweenness, closeness, eigenvector)	Correctness on known graphs (star, chain, clique); convergence iterations for eigenvector
`CommunityDetector` (Louvain, Leiden, K-clique)	Modularity score on synthetic graphs with planted communities; NMI against ground-truth partition
`NodeEmbedder` (Node2Vec)	Embedding similarity between semantically linked nodes vs. unlinked nodes
`PathFinder` (BFS shortest path, all-pairs)	Correctness + latency for graphs of N=100, 1K, 10K nodes
`LinkPredictor` (`score_link`)	AUC-ROC for predicting held-out edges
`SimilarityCalculator` (`cosine_similarity`)	Correlation between structural similarity scores and semantic similarity scores

Skill type	Encoding	Assertion
Temporal awareness	Node with `valid_from`/`valid_until` + edge to decision	Agent qualifies claims with time bounds
Causal reasoning	Causal chain with 3+ hops	Agent explains cause before effect; cites chain
Policy compliance	`Policy` node with rules + `PolicyException` node	Agent respects constraints; flags exceptions
Precedent citation	`Precedent` node linked to decision	Agent references prior similar decision
Uncertainty flagging	Query with no matching context node	Agent expresses uncertainty rather than hallucinating
Approval escalation	`ApprovalChain` node with multi-level requirements	Agent escalates rather than deciding unilaterally

Metric	Threshold	Rationale
`decision_accuracy_delta`	> 0	Context must improve, not degrade
`hallucination_rate_delta`	> 0	Context must reduce invented facts
`stale_context_injection_rate`	< 0.05	< 5% stale facts in retrieved context
`causal_chain_recall`	> 0.80	80% of true causal ancestors surfaced
`causal_chain_precision`	> 0.85	< 15% spurious nodes in causal results
`policy_compliance_hit_rate`	> 0.90	Violations detected with > 90% recall
`temporal_precision`	> 0.90	< 10% temporally invalid results
`provenance_lineage_completeness`	== 1.0	No gaps in lineage chain
`duplicate_detection_f1`	> 0.85	Clean graph guarantee
`skill_activation_rate`	> 0.70	Injected skills reliably elicit behavior
`explanation_completeness`	> 0.90	Reasoning paths cover all inference steps

Uh oh!

[FEATURE] Expand Context Graph & Decision Intelligence Benchmarks #414

Description

Summary

Why This Matters

Capabilities to Cover

1. Core Graph Retrieval — ContextRetriever / TemporalGraphRetriever

2. Temporal Validity — BiTemporalFact, TemporalGraphRetriever, TemporalQueryRewriter

3. Causal Chain Quality — CausalChainAnalyzer

4. Decision Intelligence — DecisionRecorder, DecisionQuery, PolicyEngine, CausalChainAnalyzer

5. Decision Quality Delta — Primary Headline Metric

6. KG Algorithm Quality — CentralityCalculator, CommunityDetector, NodeEmbedder, PathFinder, LinkPredictor, SimilarityCalculator

7. Reasoning Engine Quality — Reasoner, GraphReasoner, TemporalReasoningEngine, ExplanationGenerator

8. Provenance & Lineage Integrity — ProvenanceManager, ProvenanceTracker

9. Conflict Detection & Resolution Quality — ConflictDetector, ConflictResolver

10. Deduplication Quality — DuplicateDetector, EntityMerger, ClusterBuilder

11. Embedding Quality — EmbeddingGenerator, GraphEmbeddingManager, NodeEmbedder

12. Change Management & Versioning — TemporalVersionManager, OntologyVersionManager

13. Skill Injection Evaluation

Pass/Fail Thresholds

Good vs Not Good — Definition

Implementation Plan

Phase 1 — Infrastructure

Phase 2 — Core Retrieval + Temporal

Phase 3 — Causal + Decision Intelligence

Phase 4 — Decision Quality Delta

Phase 5 — KG Algorithms + Reasoning

Phase 6 — Data Quality (Provenance, Conflicts, Dedup, Embeddings, Change Management)

Phase 7 — Skill Injection + CI Integration

Related

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. Core Graph Retrieval — `ContextRetriever` / `TemporalGraphRetriever`

2. Temporal Validity — `BiTemporalFact`, `TemporalGraphRetriever`, `TemporalQueryRewriter`

3. Causal Chain Quality — `CausalChainAnalyzer`

4. Decision Intelligence — `DecisionRecorder`, `DecisionQuery`, `PolicyEngine`, `CausalChainAnalyzer`

6. KG Algorithm Quality — `CentralityCalculator`, `CommunityDetector`, `NodeEmbedder`, `PathFinder`, `LinkPredictor`, `SimilarityCalculator`

7. Reasoning Engine Quality — `Reasoner`, `GraphReasoner`, `TemporalReasoningEngine`, `ExplanationGenerator`

8. Provenance & Lineage Integrity — `ProvenanceManager`, `ProvenanceTracker`

9. Conflict Detection & Resolution Quality — `ConflictDetector`, `ConflictResolver`

10. Deduplication Quality — `DuplicateDetector`, `EntityMerger`, `ClusterBuilder`

11. Embedding Quality — `EmbeddingGenerator`, `GraphEmbeddingManager`, `NodeEmbedder`

12. Change Management & Versioning — `TemporalVersionManager`, `OntologyVersionManager`