Skip to content

Latest commit

 

History

History
253 lines (179 loc) · 18 KB

File metadata and controls

253 lines (179 loc) · 18 KB

Development Plan

Phased implementation roadmap for cognitive-memory-model. Each phase is independently useful and builds on the previous one.

Background & Motivation

LLMs have no real memory. Knowledge is either baked into weights, held in the ephemeral context window, or stored in explicit memory files that require deliberate read/write actions. If the context is cleared without writing something down, it's gone. This doesn't resemble human memory at all — humans recall information automatically when cues trigger associations.

This project builds an autoassociative memory system that passively monitors LLM/agent conversations, compresses them into gist representations, and automatically surfaces relevant memories when similar cues appear. The LLM does all reasoning; the memory system is a substrate that provides information.

Lessons from Prior Work

  • HDC role-filler binding with random expansion to 10,000D: worked but unnecessary if the LLM handles reasoning. Standard embedding vectors suffice for similarity-based retrieval.
  • Subject-relation-object triplets: too coarse to capture meaning.
  • AMR (Abstract Meaning Representation): richer structure, but the model used was very slow. Unclear if slowness was the SDM (now solved with FAISS) or the AMR model itself.
  • Semantic framework doing reasoning: scope creep. The memory system tried to return structured responses to queries. This is the LLM's job.
  • FAISS IVF for O(1) lookup: proven to work in a separate project. See docs/FAISS-SDM.md for reference.

Key Design Decisions

  1. LLM-as-gist-encoder: use an LLM or small model to compress conversation turns into natural-language gist summaries + tags, rather than formal semantic representations (AMR, triplets).
  2. Embedding vectors for retrieval: standard embedding models (sentence-transformers, etc.) provide the similarity vectors. No HDC random expansion needed.
  3. FAISS for O(1) similarity search: use FAISS IVF index for content-addressable lookup at scale.
  4. Cognitive features via scoring modifiers: spreading activation, decay, priming, and importance are all implemented as modifiers on the base FAISS similarity score: final_score = similarity * decay(age) * importance * priming_boost.
  5. Memory system does NOT reason: it stores, retrieves, and surfaces information. The LLM does all reasoning over recalled memories.

Phase 1: Core Memory Store + Encoding Pipeline — COMPLETE

Goal: Get the "zebra example" working end-to-end. Store conversation turns as gist memories, retrieve them by similarity when relevant cues appear.

Implementation

1.1 Conversation Parser — cmm/core/types.py, cmm/pipeline/conversation.py

  • ConversationTurn dataclass: {role: Role, content: str, timestamp: float}
  • CognitiveMemoryPipeline.ingest(role, content) — main entry point
  • CognitiveMemoryPipeline.process_turn(turn) — encode, store, retrieve in one call

1.2 Gist Encoder — cmm/encoding/gist_encoder.py, cmm/encoding/ollama_gist_encoder.py

  • GistEncoder ABC with encode(turn, context?) -> Gist interface
  • PassthroughGistEncoder — baseline, passes raw text + keyword tag extraction (for testing)
  • OllamaGistEncoder — uses local Mistral 7B via Ollama to produce compressed 1-2 sentence gists + tags in JSON format; falls back to passthrough on failure

1.3 Embedding Layer — cmm/encoding/embedding.py

  • EmbeddingModel wrapping sentence-transformers (all-MiniLM-L6-v2, 384D)
  • Vectors L2-normalized for cosine similarity via inner product

1.4 FAISS Memory Store — cmm/core/memory_store.py

  • Starts with flat index (exact search), auto-trains IVF index once buffer reaches max(nlist * 10, 256) items
  • Metadata stored alongside: Memory dataclass with gist, tags, timestamp, importance, access_count, last_accessed, source_role
  • Thread-safe with locking; optional GPU support

1.5 Retrieval Trigger — cmm/retrieval/retriever.py

  • On each turn: embed query → FAISS top-k → update access metadata → format for context injection
  • Retriever.format_for_context() produces [Recalled from memory...]...[End recalled memories] blocks with relevance scores, tags, emotional context, and agent attribution — clearly marked as memory (not user input)

Results

  • All success criteria met: zebra test passes (store zebra facts → 20 unrelated turns → recall "zebra" succeeds)
  • FAISS retrieval is sub-millisecond at test scale
  • Multiple topic discrimination works (Python vs. elephants vs. Eiffel Tower)

Phase 2: Temporal Decay + Recency Weighting + Working Memory — COMPLETE

Goal: Memories fade over time. Recent/frequently accessed memories are prioritized. A working memory buffer keeps just-activated memories warm.

Implementation

2.1 Temporal Decay + Rehearsal — cmm/retrieval/decay.py

  • DecayScorer applies exponential decay: decay = e^(-λ_eff * age)
  • Age measured from last_accessed, not creation time — accessed memories reset their decay clock
  • Rehearsal effect: λ_eff = λ / (1 + rehearsal_weight * ln(1 + access_count)) — frequently accessed memories decay slower, with log dampening
  • Default decay_rate=1e-5 (~50% decay after 19 hours without access)
  • Both decay_rate and rehearsal_weight configurable via CognitiveMemoryPipeline

2.2 Working Memory Buffer — cmm/retrieval/working_memory.py

  • WorkingMemory — fixed-capacity buffer (default 10 items) with turn-based TTL (default 5 turns)
  • Retrieved memories automatically enter working memory
  • Reactivation resets TTL and keeps the higher score
  • Lowest-scoring item evicted when at capacity
  • tick() called on each process_turn() — items expire after TTL turns without reactivation

2.3 Retriever Integration — cmm/retrieval/retriever.py

  • Retrieval pipeline now: FAISS top-k*3 (overfetch) → apply similarity * decay * importance → merge working memory items (re-scored against current query) → threshold filter → sort → return top-k
  • Working memory items are re-scored using cosine similarity to the current query embedding, preventing stale working memory from crowding out relevant FAISS results
  • Pipeline ticks working memory on each turn automatically

Results

  • All success criteria met: old memories score lower, accessed memories resist decay, working memory keeps items warm for TTL turns and clears on topic shift
  • Scoring formula in practice: final_score = raw_similarity * decay(age, access_count) * importance

Design Note

Working memory items are re-scored against the current query rather than using their original activation score. This was necessary because stale working memory items (high score from a previous query context) would otherwise crowd out relevant FAISS results for the current query.


Phase 3: Spreading Activation + Priming — COMPLETE

Goal: Retrieving one memory activates related memories. Recent activations lower thresholds for associated concepts.

Implementation

3.1 Spreading Activation — cmm/retrieval/spreading_activation.py, cmm/retrieval/entity_index.py

  • SpreadingActivation expands retrieval via two paths:
    1. Embedding proximity: FAISS neighbor queries from seed embeddings
    2. Entity links: spaCy NER extracts named entities at storage time; EntityIndex maps entities to memory IDs; spreading traverses entity links to find cross-domain associations
  • Entity linking solves the cross-domain problem: "warehouse inspection on Industrial Way" and "hospital patients near Industrial Way" have only 0.18 embedding similarity but share the "Industrial Way" entity
  • Score decays per hop: spread_score = parent_score * spread_factor * neighbor_similarity (embedding path) or parent_score * spread_factor * entity_boost (entity path)
  • Multiple paths to the same memory are deduplicated, keeping max score
  • Embedding model upgraded from all-MiniLM-L6-v2 (384D) to all-mpnet-base-v2 (768D) for better cross-domain similarity

3.2 Priming State — cmm/retrieval/priming.py

  • PrimingState tracks recently activated memory IDs with turn counters
  • Boost formula: 1 + boost_strength * e^(-decay_rate * turns_since_activation) (default 1.3x at activation)
  • Reactivation resets the boost; auto-cleanup after max_turns (default 10)
  • Applied to both direct FAISS results and spread-activated results

3.3 Retriever Pipeline — cmm/retrieval/retriever.py

  • Full 8-step pipeline: FAISS fetch → decay + importance + priming → initial top-k → spreading activation → priming on spread results → working memory merge → final sort → update access/WM/priming
  • Retriever.tick() now advances both working memory and priming in one call

Results

  • All success criteria met: "zebra" activates "Africa/wildlife" memories, animal priming boosts animal retrieval on subsequent turns, spreading stays focused (animal memories rank above programming ones)
  • Full scoring formula: final_score = similarity * decay(age, access_count) * importance * priming_boost

Design Note

The associative index (3.3 in original plan) was deferred — FAISS re-query works well for spreading activation and avoids maintaining a separate co-occurrence graph. Can be revisited if performance becomes a concern at scale.


Phase 4: Episodic → Semantic Consolidation — COMPLETE

Goal: Over time, specific episodic memories consolidate into general semantic knowledge. Like human sleep consolidation.

Implementation

4.1 Consolidation Engine — cmm/consolidation/consolidator.py

  • Consolidator clusters episodic memories by embedding cosine similarity (greedy single-linkage)
  • Configurable cluster_threshold (default 0.6) and min_cluster_size (default 3)
  • ConsolidationSummarizer ABC + SimpleConsolidationSummarizer (concatenation fallback)
  • Consolidated semantic memories stored with higher importance (default 2.0x)
  • Episodic memories in consolidated clusters get demoted (default 0.5x importance)

4.2 LLM-Based Summarization — cmm/consolidation/ollama_summarizer.py

  • OllamaConsolidationSummarizer uses local Mistral 7B to generate consolidated summaries
  • Produces quality results like: "The user frequently discusses Python-related topics such as decorators, async/await, function optimization, and writing unit tests using pytest."
  • Same JSON output pattern as the gist encoder, with graceful fallback

4.3 Session Summaries — cmm/consolidation/session.py

  • SessionSummarizer creates session-level summaries from accumulated turn memories
  • Stored as semantic memories with moderate importance (1.5x), prefixed with "Session summary:"
  • pipeline.end_session() triggers summarization and resets session tracking

4.4 Pipeline Integration — cmm/pipeline/conversation.py

  • pipeline.consolidate() for manual trigger
  • Auto-consolidation fires every consolidation_threshold turns (default 50)
  • Session memory IDs tracked per-session so summaries only cover their own turns

Results

  • All success criteria met: Python debugging cluster produces "The user frequently discusses debugging Python scripts related to CSV parsing"; semantic memories retrievable by broad cues; episodic memories still retrievable by specific cues
  • LLM-based summarizer dramatically outperforms the simple concatenation fallback

Design Notes

  • Embedding similarity between diverse subtopics of the same domain (e.g., Python decorators vs async/await) is only 0.1-0.4. Cluster threshold needs to be set around 0.2-0.4 to group them, or the gist encoder needs to produce more similar phrasings.
  • Hierarchical retrieval (4.4 in original plan) happens naturally — FAISS returns both episodic and semantic memories, and the scoring formula (with importance weighting) gives semantic memories a natural advantage for broad queries.

Phase 5: Importance Weighting + Metamemory + Maintenance — COMPLETE

Goal: Not all memories are equal. The system detects importance signals, knows what it knows, and cleans up after itself.

Phase 5a: Importance Detection — cmm/scoring/importance.py

  • ImportanceScorer ABC + RuleBasedImportanceScorer with regex-based detection
  • Scoring tiers: corrections (2.0x), explicit instructions (2.0x), novel information (1.5x), normal (1.0x), routine/filler (0.5x)
  • Novelty detection: queries FAISS for max similarity to existing memories; below novelty_threshold (default 0.5) = novel
  • Integrated into process_turn() — importance is scored and assigned at storage time automatically

Results: corrections stored at 2.0x importance, greetings at 0.5x; corrections rank above normal memories for the same query due to importance multiplier in the scoring formula.

Phase 5b: Metamemory Signals — cmm/retrieval/metamemory.py

  • MetamemoryScorer classifies retrieval results into confidence levels: HIGH (≥0.7), MODERATE (≥0.4), LOW (≥0.2), NONE
  • MetamemoryResult wraps results with confidence, partial matches, and convenience properties (has_strong_match, has_tip_of_tongue)
  • Partial matches: candidates that scored above partial_threshold but below the retrieval threshold — "tip of the tongue" signals
  • pipeline.recall_with_metamemory() for metamemory-enriched retrieval
  • MetamemoryScorer.format_for_context() includes confidence level and a separate [Partial matches] section

Results: strong matches get HIGH/MODERATE confidence; completely unrelated queries get NONE; borderline queries surface partial matches as hints.

Phase 5c: Memory Maintenance — cmm/maintenance/maintenance.py

  • MemoryMaintainer.prune() removes memories where decay * importance < prune_threshold AND importance <= prune_min_importance. High-importance memories (corrections, instructions) are protected.
  • MemoryMaintainer.deduplicate() finds pairs with cosine similarity ≥ duplicate_threshold (default 0.95), keeps the one with higher importance/access, transfers access count to survivor.
  • MemoryMaintainer.get_health_metrics() returns HealthMetrics dataclass: total/episodic/semantic counts, avg importance, avg access count, pruned/merged counts.
  • MemoryMaintainer.maintain() runs deduplicate then prune in sequence.
  • pipeline.maintain() and pipeline.health() for easy access.
  • Added store.remove() and store.rebuild_index() to MemoryStore for index reconstruction after deletions.

Results: old low-importance memories are pruned; high-importance and recently-accessed memories survive; near-duplicates merge with access count transfer; FAISS index correctly rebuilt after maintenance.


Post-Phase Features — COMPLETE

Features originally listed as "Future Considerations" that have been implemented:

  • Multi-agent memory sharingcmm/multi_agent/shared_store.py: SharedMemoryManager with per-agent pipelines, scoped visibility (PRIVATE/SHARED/TEAM), contradiction detection with agent attribution, and nightly consolidation.
  • Emotional valence taggingcmm/scoring/valence.py: ValenceScorer tags each memory with valence (-1 to +1), arousal (0 to 1), and emotion labels. Emotional context surfaced in retrieval formatting.
  • Entity-linked spreading activationcmm/retrieval/entity_index.py: spaCy NER extracts named entities at storage time. Spreading activation traverses both embedding neighbors AND entity links for cross-domain association.
  • Embedding model upgrade — Upgraded from all-MiniLM-L6-v2 (384D) to all-mpnet-base-v2 (768D) for better cross-domain similarity.

Integrations — COMPLETE

Multiple integration paths for different environments. See integrations/README.md for full guide.

  • HTTP Memory Server (integrations/claude-code/memory_server.py) — Language-agnostic REST API. Any application that can make HTTP calls gets autoassociative memory.
  • Claude Code Hooks (integrations/claude-code/hooks/) — UserPromptSubmit and Stop hooks for fully automatic two-way monitoring. Zero manual intervention.
  • Python API Middleware (integrations/middleware.py) — Wraps any OpenAI-compatible or Anthropic API call with automatic memory ingest + recall.
  • MCP Server (integrations/mcp/server.py) — Exposes memory as MCP tools for Claude Desktop, Cursor, and any MCP client. Semi-automatic (LLM decides when to call tools).
  • Gist encoder backends — Ollama (local), OpenAI-compatible (any provider), Anthropic (Claude API), Passthrough (no LLM needed).

All integration paths support true autoassociative memory except MCP tools, which require the LLM to decide to call the memory tools.


Persistence and Distributed Memory — COMPLETE

  • Persistencepipeline.save(directory) / CognitiveMemoryPipeline.load(directory) saves and restores the FAISS index, all memory metadata (importance, timestamps, agent_id, valence, etc.), and the entity index. Three files: faiss.index, memories.json, entities.json.
  • Distributed multi-agent server — The HTTP memory server (integrations/claude-code/memory_server.py) supports:
    • agent_id and session_id on every request for multi-agent tagging
    • ThreadingMixIn for concurrent access from multiple agents
    • --data-dir for automatic persistence (saves on shutdown, loads on startup)
    • --auto-save N for periodic background saves
    • --host 0.0.0.0 for network access (remote agents)
    • POST /consolidate for nightly consolidation
    • POST /contradictions for cross-agent contradiction detection
    • GET /stats for per-agent memory counts

The 60-90 agent, 30-person developer team scenario: all agents hit http://memory-server:7832/ingest_and_recall with their agent_id. The server maintains a shared FAISS index. Each agent's memories are tagged and scoped. Nightly consolidation runs via POST /consolidate.


Future Considerations (Not Planned)

Ideas captured for potential future work, not in current scope:

  • Multi-modal memories: storing memories from images, audio, structured data
  • Continual learning integration: using memory patterns to influence model fine-tuning
  • Hardware acceleration: replacing FAISS with TCAM/neuromorphic hardware for true O(1)