Recursive Episodic Memory RAG — a cognitive memory system for AI agents that need to remember users across long conversations, not just retrieve documents from a static index.
Traditional RAG answers questions by searching a fixed corpus of chunks. REM-RAG answers questions by searching what the agent has learned from past interactions: recent turns, structured events, compressed summaries, reusable workflows, and relationships between users, goals, and concepts.
Long-running agents (assistants, copilots, support bots) face three limits of classic RAG:
| Problem | Classic RAG | REM-RAG |
|---|---|---|
| Memory scope | External documents only | Conversation history + learned abstractions |
| Context cost | Stuff full chat logs into the prompt | Reconstruct a compact memory context |
| Memory quality | No notion of importance or decay | Score, compress, consolidate, and forget low-value memories |
REM-RAG is built for long-horizon use: preferences that change over time, recurring workflows, contradictions, and thousands of turns without blowing the context window.
- Working memory — last N turns and task state; TTL eviction (Redis or in-memory).
- Episodic memory — each interaction stored as a structured event (actor, action, object, outcome, timestamp, embedding).
- Semantic memory — compressed summaries distilled from many episodes (clustering + summarization).
- Procedural memory — reusable step-by-step workflows learned from user behavior.
Turns raw messages into storable events:
- Entity and intent extraction
- Event structuring (who did what, to what, with what outcome)
- Semantic abstraction across multiple episodes
Memories are ranked with a learned-style importance function (recurrence, utility, recency, confidence, future relevance). Low-importance memories fade over time via temporal decay instead of living forever.
Background “sleep” pass that:
- Merges redundant memories
- Builds higher-level semantic abstractions
- Drops low-utility episodes
- Re-clusters semantic groups
- Resolves contradictions
Queries hit all memory layers at once using:
- Vector similarity (embeddings)
- BM25 keyword scoring
- Graph neighborhood search
- Importance and recency weighting
Retrieved hits are fused, de-duplicated, conflict-resolved, and summarized into a single compact prompt block — not a dump of raw chunks.
Optional Neo4j (or NetworkX fallback) links users → goals → concepts for relationship-aware retrieval.
- Synthetic benchmark generator (long chats, preferences, conflicts, workflows)
- Metrics: recall@k, MRR, compression ratio, preference retention
- Streamlit dashboard: memory growth, decay curves, retrieval traces, killer demo over 1000+ interactions
- Python API —
CognitiveMemoryRAGorchestrator (ingest,query,consolidate,stats) - CLI —
main.pyfor ingest, query, eval, dataset generation - HTTP API — FastAPI (
/ingest,/query,/consolidate,/stats)
REM-RAG is organized as a pipeline around a single orchestrator (rem_rag/orchestrator.py) that coordinates memory engines, retrieval, consolidation, and optional external stores.
flowchart TB
subgraph clients [Clients]
CLI[main.py CLI]
API[FastAPI HTTP]
SDK[Python SDK]
UI[Streamlit Dashboard]
end
subgraph orchestrator [Orchestrator]
CMR[CognitiveMemoryRAG]
end
subgraph write_path [Write Path]
ENC[Memory Encoding Pipeline]
IMP[Importance Scoring Engine]
SLP[Sleep Consolidation Pipeline]
CMP[Memory Compression Engine]
end
subgraph stores [Memory Stores]
WM[Working Memory]
EM[Episodic Memory]
SM[Semantic Memory]
PM[Procedural Memory]
GM[Graph Memory]
end
subgraph read_path [Read Path]
HYB[Hybrid Retrieval Engine]
SCR[Retrieval Scorer]
CTX[Context Reconstruction]
end
subgraph infra [Optional Infrastructure]
Redis[(Redis)]
PG[(PostgreSQL)]
QD[(Qdrant)]
N4J[(Neo4j)]
end
LLM[Your LLM]
CLI --> CMR
API --> CMR
SDK --> CMR
UI --> CMR
CMR --> ENC
ENC --> IMP
IMP --> WM
IMP --> EM
IMP --> GM
ENC --> PM
CMR --> SLP
SLP --> CMP
CMP --> SM
SLP --> EM
CMR --> HYB
HYB --> SCR
HYB --> WM
HYB --> EM
HYB --> SM
HYB --> PM
HYB --> GM
HYB --> CTX
CTX --> LLM
WM -.-> Redis
EM -.-> PG
SM -.-> QD
GM -.-> N4J
Dashed lines apply when USE_EXTERNAL_STORES=true; otherwise all stores run in-process.
Memories move from short-lived, high-fidelity traces toward compressed, long-lived abstractions.
flowchart LR
subgraph fast [Fast / Volatile]
WM2[Working Memory]
end
subgraph structured [Structured / Durable]
EM2[Episodic Memory]
PM2[Procedural Memory]
end
subgraph compressed [Compressed / Abstract]
SM2[Semantic Memory]
end
subgraph relational [Relational]
GM2[Graph Memory]
end
WM2 -->|"encode"| EM2
EM2 -->|"sleep consolidate"| SM2
EM2 -->|"workflow detect"| PM2
EM2 -->|"link entities"| GM2
SM2 -->|"retrieve"| OUT[Reconstructed Context]
EM2 --> OUT
PM2 --> OUT
WM2 --> OUT
GM2 --> OUT
| Layer | Module | Typical content | Backend |
|---|---|---|---|
| Working | memory_engine/working.py |
Last turns, task state | Redis or dict + TTL |
| Episodic | memory_engine/episodic.py |
Structured events + embeddings | PostgreSQL or list |
| Semantic | memory_engine/semantic.py |
Cluster summaries | Qdrant or in-memory |
| Procedural | memory_engine/procedural.py |
Multi-step workflows | In-memory |
| Graph | graph_memory/layer.py |
User → goal → concept edges | Neo4j or NetworkX |
Every user or assistant message flows through encoding, scoring, and storage. Consolidation runs periodically or on demand.
sequenceDiagram
participant User
participant Orchestrator as CognitiveMemoryRAG
participant WM as WorkingMemory
participant ENC as EncodingPipeline
participant IMP as ImportanceEngine
participant EM as EpisodicStore
participant GM as GraphMemory
participant PM as ProceduralStore
participant SLP as SleepConsolidation
participant SM as SemanticStore
User->>Orchestrator: ingest(message)
Orchestrator->>WM: append(session, turn)
Orchestrator->>ENC: parse, structure event
ENC-->>Orchestrator: StructuredEvent
Orchestrator->>IMP: score(importance)
Orchestrator->>EM: store(event)
Orchestrator->>GM: link(user, intent, concept)
alt workflow intent
Orchestrator->>PM: learn_from_pattern(steps)
end
alt every N interactions
Orchestrator->>SLP: run()
SLP->>EM: decay / delete low utility
SLP->>SLP: cluster + compress
SLP->>SM: store semantic summaries
end
A question triggers multi-store retrieval, scoring, and prompt reconstruction — not a full chat log.
sequenceDiagram
participant User
participant Orchestrator as CognitiveMemoryRAG
participant HYB as HybridRetriever
participant Stores as All Memory Stores
participant CTX as ContextReconstruction
participant LLM as Your LLM
User->>Orchestrator: query(question)
Orchestrator->>HYB: retrieve(query, session, user)
HYB->>Stores: vector + BM25 + graph + recency
Stores-->>HYB: ranked candidates
HYB-->>Orchestrator: RetrievalResults
Orchestrator->>CTX: fuse, resolve conflicts, summarize
CTX-->>Orchestrator: compact prompt
Orchestrator-->>User: prompt + evidence
User->>LLM: prompt + question
Compression happens at two stages so agents keep recall without linear token growth.
flowchart TB
subgraph store_time [Store-time compression]
E1[Many episodic events]
CL[Clustering]
SU[Summarization]
DE[Dedup and conflict resolve]
FG[Forget low-importance]
E1 --> CL --> SU --> SM3[Semantic memories]
E1 --> FG
E1 --> DE
end
subgraph query_time [Query-time compression]
R1[Retrieved hits from all layers]
FU[Fusion and dedup]
CR[Conflict resolution]
CAP[Character budget cap]
R1 --> FU --> CR --> CAP --> P1[Final LLM prompt]
end
SM3 --> R1
| Stage | When | Goal |
|---|---|---|
| Store-time | consolidate() / sleep pipeline |
Shrink what is persisted |
| Query-time | Every query() |
Shrink what enters the LLM context |
Evaluated via compression ratio in evaluation_suite/metrics.py.
flowchart LR
subgraph packages [Python packages]
ME[memory_engine]
RE[retrieval_engine]
CE[consolidation_engine]
GR[graph_memory]
RR[rem_rag]
end
subgraph tooling [Tooling]
BD[benchmark_datasets]
ES[evaluation_suite]
VD[visualization_dashboard]
MN[main.py]
end
RR --> ME
RR --> RE
RR --> CE
RR --> GR
MN --> RR
ES --> RR
VD --> RR
BD --> ES
memory_engine/ Encoding, stores, importance, compression
retrieval_engine/ Hybrid search + context reconstruction
consolidation_engine/ Sleep consolidation pipeline
graph_memory/ Neo4j / NetworkX graph layer
benchmark_datasets/ Synthetic trace generator
evaluation_suite/ Benchmarks and metrics
visualization_dashboard/ Streamlit UI
rem_rag/ Config, embeddings, orchestrator
main.py CLI and API entrypoint
Uses the dev conda environment.
conda activate dev
cd REM-RAG
export PYTHONPATH="$(pwd)"
pip install -r requirements.txtOr:
conda env update -f environment.yml --prune
source scripts/activate_dev.shIn-memory mode (default) — no Docker required; all stores run locally.
Production-style backends — start services and enable external stores:
docker compose up -d
cp .env.example .env
export USE_EXTERNAL_STORES=true| Service | Used for |
|---|---|
| Redis | Working memory |
| PostgreSQL | Episodic memory |
| Qdrant | Semantic vectors |
| Neo4j | Graph memory |
conda activate dev
export PYTHONPATH="$(pwd)"
python main.py ingest --user u1 --session s1 --text "I prefer dark mode for all UI work"
python main.py query --user u1 --session s1 --text "What theme do I prefer?"
python main.py consolidate
python main.py statsGenerate synthetic data and run evaluation:
python main.py generate
python main.py evalStart the HTTP API:
python main.py serve --port 8000Open the dashboard:
streamlit run visualization_dashboard/app.pyPython usage:
from rem_rag import CognitiveMemoryRAG
from rem_rag.types import Interaction
rag = CognitiveMemoryRAG()
rag.ingest(Interaction(user_id="u1", session_id="s1", role="user", content="I prefer Python over JavaScript"))
result = rag.query("u1", "s1", "Which language do I prefer?")
print(result["prompt"])- Run
streamlit run visualization_dashboard/app.py - Open the Killer Demo tab
- Ingest hundreds or thousands of synthetic interactions
- Inspect episodic vs semantic memory counts, decay plots, and reconstructed retrieval context
This repo explores whether agents can match or beat raw chat-history RAG on:
- Long-horizon QA with fewer tokens
- Preference retention over many sessions
- Handling contradictory information
- Recalling learned workflows without re-explaining them
MIT