Phased implementation roadmap for cognitive-memory-model. Each phase is independently useful and builds on the previous one.
LLMs have no real memory. Knowledge is either baked into weights, held in the ephemeral context window, or stored in explicit memory files that require deliberate read/write actions. If the context is cleared without writing something down, it's gone. This doesn't resemble human memory at all — humans recall information automatically when cues trigger associations.
This project builds an autoassociative memory system that passively monitors LLM/agent conversations, compresses them into gist representations, and automatically surfaces relevant memories when similar cues appear. The LLM does all reasoning; the memory system is a substrate that provides information.
- HDC role-filler binding with random expansion to 10,000D: worked but unnecessary if the LLM handles reasoning. Standard embedding vectors suffice for similarity-based retrieval.
- Subject-relation-object triplets: too coarse to capture meaning.
- AMR (Abstract Meaning Representation): richer structure, but the model used was very slow. Unclear if slowness was the SDM (now solved with FAISS) or the AMR model itself.
- Semantic framework doing reasoning: scope creep. The memory system tried to return structured responses to queries. This is the LLM's job.
- FAISS IVF for O(1) lookup: proven to work in a separate project. See
docs/FAISS-SDM.mdfor reference.
- LLM-as-gist-encoder: use an LLM or small model to compress conversation turns into natural-language gist summaries + tags, rather than formal semantic representations (AMR, triplets).
- Embedding vectors for retrieval: standard embedding models (sentence-transformers, etc.) provide the similarity vectors. No HDC random expansion needed.
- FAISS for O(1) similarity search: use FAISS IVF index for content-addressable lookup at scale.
- Cognitive features via scoring modifiers: spreading activation, decay, priming, and importance are all implemented as modifiers on the base FAISS similarity score:
final_score = similarity * decay(age) * importance * priming_boost. - Memory system does NOT reason: it stores, retrieves, and surfaces information. The LLM does all reasoning over recalled memories.
Goal: Get the "zebra example" working end-to-end. Store conversation turns as gist memories, retrieve them by similarity when relevant cues appear.
ConversationTurndataclass:{role: Role, content: str, timestamp: float}CognitiveMemoryPipeline.ingest(role, content)— main entry pointCognitiveMemoryPipeline.process_turn(turn)— encode, store, retrieve in one call
GistEncoderABC withencode(turn, context?) -> GistinterfacePassthroughGistEncoder— baseline, passes raw text + keyword tag extraction (for testing)OllamaGistEncoder— uses local Mistral 7B via Ollama to produce compressed 1-2 sentence gists + tags in JSON format; falls back to passthrough on failure
EmbeddingModelwrappingsentence-transformers(all-MiniLM-L6-v2, 384D)- Vectors L2-normalized for cosine similarity via inner product
- Starts with flat index (exact search), auto-trains IVF index once buffer reaches
max(nlist * 10, 256)items - Metadata stored alongside:
Memorydataclass with gist, tags, timestamp, importance, access_count, last_accessed, source_role - Thread-safe with locking; optional GPU support
- On each turn: embed query → FAISS top-k → update access metadata → format for context injection
Retriever.format_for_context()produces[Recalled from memory...]...[End recalled memories]blocks with relevance scores, tags, emotional context, and agent attribution — clearly marked as memory (not user input)
- All success criteria met: zebra test passes (store zebra facts → 20 unrelated turns → recall "zebra" succeeds)
- FAISS retrieval is sub-millisecond at test scale
- Multiple topic discrimination works (Python vs. elephants vs. Eiffel Tower)
Goal: Memories fade over time. Recent/frequently accessed memories are prioritized. A working memory buffer keeps just-activated memories warm.
DecayScorerapplies exponential decay:decay = e^(-λ_eff * age)- Age measured from
last_accessed, not creation time — accessed memories reset their decay clock - Rehearsal effect:
λ_eff = λ / (1 + rehearsal_weight * ln(1 + access_count))— frequently accessed memories decay slower, with log dampening - Default
decay_rate=1e-5(~50% decay after 19 hours without access) - Both
decay_rateandrehearsal_weightconfigurable viaCognitiveMemoryPipeline
WorkingMemory— fixed-capacity buffer (default 10 items) with turn-based TTL (default 5 turns)- Retrieved memories automatically enter working memory
- Reactivation resets TTL and keeps the higher score
- Lowest-scoring item evicted when at capacity
tick()called on eachprocess_turn()— items expire after TTL turns without reactivation
- Retrieval pipeline now: FAISS top-k*3 (overfetch) → apply
similarity * decay * importance→ merge working memory items (re-scored against current query) → threshold filter → sort → return top-k - Working memory items are re-scored using cosine similarity to the current query embedding, preventing stale working memory from crowding out relevant FAISS results
- Pipeline ticks working memory on each turn automatically
- All success criteria met: old memories score lower, accessed memories resist decay, working memory keeps items warm for TTL turns and clears on topic shift
- Scoring formula in practice:
final_score = raw_similarity * decay(age, access_count) * importance
Working memory items are re-scored against the current query rather than using their original activation score. This was necessary because stale working memory items (high score from a previous query context) would otherwise crowd out relevant FAISS results for the current query.
Goal: Retrieving one memory activates related memories. Recent activations lower thresholds for associated concepts.
SpreadingActivationexpands retrieval via two paths:- Embedding proximity: FAISS neighbor queries from seed embeddings
- Entity links: spaCy NER extracts named entities at storage time;
EntityIndexmaps entities to memory IDs; spreading traverses entity links to find cross-domain associations
- Entity linking solves the cross-domain problem: "warehouse inspection on Industrial Way" and "hospital patients near Industrial Way" have only 0.18 embedding similarity but share the "Industrial Way" entity
- Score decays per hop:
spread_score = parent_score * spread_factor * neighbor_similarity(embedding path) orparent_score * spread_factor * entity_boost(entity path) - Multiple paths to the same memory are deduplicated, keeping max score
- Embedding model upgraded from all-MiniLM-L6-v2 (384D) to all-mpnet-base-v2 (768D) for better cross-domain similarity
PrimingStatetracks recently activated memory IDs with turn counters- Boost formula:
1 + boost_strength * e^(-decay_rate * turns_since_activation)(default 1.3x at activation) - Reactivation resets the boost; auto-cleanup after
max_turns(default 10) - Applied to both direct FAISS results and spread-activated results
- Full 8-step pipeline: FAISS fetch → decay + importance + priming → initial top-k → spreading activation → priming on spread results → working memory merge → final sort → update access/WM/priming
Retriever.tick()now advances both working memory and priming in one call
- All success criteria met: "zebra" activates "Africa/wildlife" memories, animal priming boosts animal retrieval on subsequent turns, spreading stays focused (animal memories rank above programming ones)
- Full scoring formula:
final_score = similarity * decay(age, access_count) * importance * priming_boost
The associative index (3.3 in original plan) was deferred — FAISS re-query works well for spreading activation and avoids maintaining a separate co-occurrence graph. Can be revisited if performance becomes a concern at scale.
Goal: Over time, specific episodic memories consolidate into general semantic knowledge. Like human sleep consolidation.
Consolidatorclusters episodic memories by embedding cosine similarity (greedy single-linkage)- Configurable
cluster_threshold(default 0.6) andmin_cluster_size(default 3) ConsolidationSummarizerABC +SimpleConsolidationSummarizer(concatenation fallback)- Consolidated semantic memories stored with higher importance (default 2.0x)
- Episodic memories in consolidated clusters get demoted (default 0.5x importance)
OllamaConsolidationSummarizeruses local Mistral 7B to generate consolidated summaries- Produces quality results like: "The user frequently discusses Python-related topics such as decorators, async/await, function optimization, and writing unit tests using pytest."
- Same JSON output pattern as the gist encoder, with graceful fallback
SessionSummarizercreates session-level summaries from accumulated turn memories- Stored as semantic memories with moderate importance (1.5x), prefixed with "Session summary:"
pipeline.end_session()triggers summarization and resets session tracking
pipeline.consolidate()for manual trigger- Auto-consolidation fires every
consolidation_thresholdturns (default 50) - Session memory IDs tracked per-session so summaries only cover their own turns
- All success criteria met: Python debugging cluster produces "The user frequently discusses debugging Python scripts related to CSV parsing"; semantic memories retrievable by broad cues; episodic memories still retrievable by specific cues
- LLM-based summarizer dramatically outperforms the simple concatenation fallback
- Embedding similarity between diverse subtopics of the same domain (e.g., Python decorators vs async/await) is only 0.1-0.4. Cluster threshold needs to be set around 0.2-0.4 to group them, or the gist encoder needs to produce more similar phrasings.
- Hierarchical retrieval (4.4 in original plan) happens naturally — FAISS returns both episodic and semantic memories, and the scoring formula (with importance weighting) gives semantic memories a natural advantage for broad queries.
Goal: Not all memories are equal. The system detects importance signals, knows what it knows, and cleans up after itself.
ImportanceScorerABC +RuleBasedImportanceScorerwith regex-based detection- Scoring tiers: corrections (2.0x), explicit instructions (2.0x), novel information (1.5x), normal (1.0x), routine/filler (0.5x)
- Novelty detection: queries FAISS for max similarity to existing memories; below
novelty_threshold(default 0.5) = novel - Integrated into
process_turn()— importance is scored and assigned at storage time automatically
Results: corrections stored at 2.0x importance, greetings at 0.5x; corrections rank above normal memories for the same query due to importance multiplier in the scoring formula.
MetamemoryScorerclassifies retrieval results into confidence levels:HIGH(≥0.7),MODERATE(≥0.4),LOW(≥0.2),NONEMetamemoryResultwraps results with confidence, partial matches, and convenience properties (has_strong_match,has_tip_of_tongue)- Partial matches: candidates that scored above
partial_thresholdbut below the retrieval threshold — "tip of the tongue" signals pipeline.recall_with_metamemory()for metamemory-enriched retrievalMetamemoryScorer.format_for_context()includes confidence level and a separate[Partial matches]section
Results: strong matches get HIGH/MODERATE confidence; completely unrelated queries get NONE; borderline queries surface partial matches as hints.
MemoryMaintainer.prune()removes memories wheredecay * importance < prune_thresholdANDimportance <= prune_min_importance. High-importance memories (corrections, instructions) are protected.MemoryMaintainer.deduplicate()finds pairs with cosine similarity ≥duplicate_threshold(default 0.95), keeps the one with higher importance/access, transfers access count to survivor.MemoryMaintainer.get_health_metrics()returnsHealthMetricsdataclass: total/episodic/semantic counts, avg importance, avg access count, pruned/merged counts.MemoryMaintainer.maintain()runs deduplicate then prune in sequence.pipeline.maintain()andpipeline.health()for easy access.- Added
store.remove()andstore.rebuild_index()toMemoryStorefor index reconstruction after deletions.
Results: old low-importance memories are pruned; high-importance and recently-accessed memories survive; near-duplicates merge with access count transfer; FAISS index correctly rebuilt after maintenance.
Features originally listed as "Future Considerations" that have been implemented:
- Multi-agent memory sharing —
cmm/multi_agent/shared_store.py:SharedMemoryManagerwith per-agent pipelines, scoped visibility (PRIVATE/SHARED/TEAM), contradiction detection with agent attribution, and nightly consolidation. - Emotional valence tagging —
cmm/scoring/valence.py:ValenceScorertags each memory with valence (-1 to +1), arousal (0 to 1), and emotion labels. Emotional context surfaced in retrieval formatting. - Entity-linked spreading activation —
cmm/retrieval/entity_index.py: spaCy NER extracts named entities at storage time. Spreading activation traverses both embedding neighbors AND entity links for cross-domain association. - Embedding model upgrade — Upgraded from all-MiniLM-L6-v2 (384D) to all-mpnet-base-v2 (768D) for better cross-domain similarity.
Multiple integration paths for different environments. See integrations/README.md for full guide.
- HTTP Memory Server (
integrations/claude-code/memory_server.py) — Language-agnostic REST API. Any application that can make HTTP calls gets autoassociative memory. - Claude Code Hooks (
integrations/claude-code/hooks/) —UserPromptSubmitandStophooks for fully automatic two-way monitoring. Zero manual intervention. - Python API Middleware (
integrations/middleware.py) — Wraps any OpenAI-compatible or Anthropic API call with automatic memory ingest + recall. - MCP Server (
integrations/mcp/server.py) — Exposes memory as MCP tools for Claude Desktop, Cursor, and any MCP client. Semi-automatic (LLM decides when to call tools). - Gist encoder backends — Ollama (local), OpenAI-compatible (any provider), Anthropic (Claude API), Passthrough (no LLM needed).
All integration paths support true autoassociative memory except MCP tools, which require the LLM to decide to call the memory tools.
- Persistence —
pipeline.save(directory)/CognitiveMemoryPipeline.load(directory)saves and restores the FAISS index, all memory metadata (importance, timestamps, agent_id, valence, etc.), and the entity index. Three files:faiss.index,memories.json,entities.json. - Distributed multi-agent server — The HTTP memory server (
integrations/claude-code/memory_server.py) supports:agent_idandsession_idon every request for multi-agent taggingThreadingMixInfor concurrent access from multiple agents--data-dirfor automatic persistence (saves on shutdown, loads on startup)--auto-save Nfor periodic background saves--host 0.0.0.0for network access (remote agents)POST /consolidatefor nightly consolidationPOST /contradictionsfor cross-agent contradiction detectionGET /statsfor per-agent memory counts
The 60-90 agent, 30-person developer team scenario: all agents hit http://memory-server:7832/ingest_and_recall with their agent_id. The server maintains a shared FAISS index. Each agent's memories are tagged and scoped. Nightly consolidation runs via POST /consolidate.
Ideas captured for potential future work, not in current scope:
- Multi-modal memories: storing memories from images, audio, structured data
- Continual learning integration: using memory patterns to influence model fine-tuning
- Hardware acceleration: replacing FAISS with TCAM/neuromorphic hardware for true O(1)