REM-RAG

Recursive Episodic Memory RAG — a cognitive memory system for AI agents that need to remember users across long conversations, not just retrieve documents from a static index.

Traditional RAG answers questions by searching a fixed corpus of chunks. REM-RAG answers questions by searching what the agent has learned from past interactions: recent turns, structured events, compressed summaries, reusable workflows, and relationships between users, goals, and concepts.

What problem does this solve?

Long-running agents (assistants, copilots, support bots) face three limits of classic RAG:

Problem	Classic RAG	REM-RAG
Memory scope	External documents only	Conversation history + learned abstractions
Context cost	Stuff full chat logs into the prompt	Reconstruct a compact memory context
Memory quality	No notion of importance or decay	Score, compress, consolidate, and forget low-value memories

REM-RAG is built for long-horizon use: preferences that change over time, recurring workflows, contradictions, and thousands of turns without blowing the context window.

Features

Hierarchical memory (4 layers)

Working memory — last N turns and task state; TTL eviction (Redis or in-memory).
Episodic memory — each interaction stored as a structured event (actor, action, object, outcome, timestamp, embedding).
Semantic memory — compressed summaries distilled from many episodes (clustering + summarization).
Procedural memory — reusable step-by-step workflows learned from user behavior.

Memory encoding pipeline

Turns raw messages into storable events:

Entity and intent extraction
Event structuring (who did what, to what, with what outcome)
Semantic abstraction across multiple episodes

Importance & decay

Memories are ranked with a learned-style importance function (recurrence, utility, recency, confidence, future relevance). Low-importance memories fade over time via temporal decay instead of living forever.

Sleep consolidation

Background “sleep” pass that:

Merges redundant memories
Builds higher-level semantic abstractions
Drops low-utility episodes
Re-clusters semantic groups
Resolves contradictions

Hybrid retrieval

Queries hit all memory layers at once using:

Vector similarity (embeddings)
BM25 keyword scoring
Graph neighborhood search
Importance and recency weighting

Context reconstruction

Retrieved hits are fused, de-duplicated, conflict-resolved, and summarized into a single compact prompt block — not a dump of raw chunks.

Graph memory

Optional Neo4j (or NetworkX fallback) links users → goals → concepts for relationship-aware retrieval.

Evaluation & observability

Synthetic benchmark generator (long chats, preferences, conflicts, workflows)
Metrics: recall@k, MRR, compression ratio, preference retention
Streamlit dashboard: memory growth, decay curves, retrieval traces, killer demo over 1000+ interactions

Interfaces

Python API — CognitiveMemoryRAG orchestrator (ingest, query, consolidate, stats)
CLI — main.py for ingest, query, eval, dataset generation
HTTP API — FastAPI (/ingest, /query, /consolidate, /stats)

System architecture

REM-RAG is organized as a pipeline around a single orchestrator (rem_rag/orchestrator.py) that coordinates memory engines, retrieval, consolidation, and optional external stores.

Component overview

flowchart TB
  subgraph clients [Clients]
    CLI[main.py CLI]
    API[FastAPI HTTP]
    SDK[Python SDK]
    UI[Streamlit Dashboard]
  end

  subgraph orchestrator [Orchestrator]
    CMR[CognitiveMemoryRAG]
  end

  subgraph write_path [Write Path]
    ENC[Memory Encoding Pipeline]
    IMP[Importance Scoring Engine]
    SLP[Sleep Consolidation Pipeline]
    CMP[Memory Compression Engine]
  end

  subgraph stores [Memory Stores]
    WM[Working Memory]
    EM[Episodic Memory]
    SM[Semantic Memory]
    PM[Procedural Memory]
    GM[Graph Memory]
  end

  subgraph read_path [Read Path]
    HYB[Hybrid Retrieval Engine]
    SCR[Retrieval Scorer]
    CTX[Context Reconstruction]
  end

  subgraph infra [Optional Infrastructure]
    Redis[(Redis)]
    PG[(PostgreSQL)]
    QD[(Qdrant)]
    N4J[(Neo4j)]
  end

  LLM[Your LLM]

  CLI --> CMR
  API --> CMR
  SDK --> CMR
  UI --> CMR

  CMR --> ENC
  ENC --> IMP
  IMP --> WM
  IMP --> EM
  IMP --> GM
  ENC --> PM
  CMR --> SLP
  SLP --> CMP
  CMP --> SM
  SLP --> EM

  CMR --> HYB
  HYB --> SCR
  HYB --> WM
  HYB --> EM
  HYB --> SM
  HYB --> PM
  HYB --> GM
  HYB --> CTX
  CTX --> LLM

  WM -.-> Redis
  EM -.-> PG
  SM -.-> QD
  GM -.-> N4J

Dashed lines apply when USE_EXTERNAL_STORES=true; otherwise all stores run in-process.

Memory hierarchy

Memories move from short-lived, high-fidelity traces toward compressed, long-lived abstractions.

flowchart LR
  subgraph fast [Fast / Volatile]
    WM2[Working Memory]
  end

  subgraph structured [Structured / Durable]
    EM2[Episodic Memory]
    PM2[Procedural Memory]
  end

  subgraph compressed [Compressed / Abstract]
    SM2[Semantic Memory]
  end

  subgraph relational [Relational]
    GM2[Graph Memory]
  end

  WM2 -->|"encode"| EM2
  EM2 -->|"sleep consolidate"| SM2
  EM2 -->|"workflow detect"| PM2
  EM2 -->|"link entities"| GM2
  SM2 -->|"retrieve"| OUT[Reconstructed Context]
  EM2 --> OUT
  PM2 --> OUT
  WM2 --> OUT
  GM2 --> OUT

Layer	Module	Typical content	Backend
Working	`memory_engine/working.py`	Last turns, task state	Redis or dict + TTL
Episodic	`memory_engine/episodic.py`	Structured events + embeddings	PostgreSQL or list
Semantic	`memory_engine/semantic.py`	Cluster summaries	Qdrant or in-memory
Procedural	`memory_engine/procedural.py`	Multi-step workflows	In-memory
Graph	`graph_memory/layer.py`	User → goal → concept edges	Neo4j or NetworkX

Write path — ingest

Every user or assistant message flows through encoding, scoring, and storage. Consolidation runs periodically or on demand.

sequenceDiagram
  participant User
  participant Orchestrator as CognitiveMemoryRAG
  participant WM as WorkingMemory
  participant ENC as EncodingPipeline
  participant IMP as ImportanceEngine
  participant EM as EpisodicStore
  participant GM as GraphMemory
  participant PM as ProceduralStore
  participant SLP as SleepConsolidation
  participant SM as SemanticStore

  User->>Orchestrator: ingest(message)
  Orchestrator->>WM: append(session, turn)
  Orchestrator->>ENC: parse, structure event
  ENC-->>Orchestrator: StructuredEvent
  Orchestrator->>IMP: score(importance)
  Orchestrator->>EM: store(event)
  Orchestrator->>GM: link(user, intent, concept)
  alt workflow intent
    Orchestrator->>PM: learn_from_pattern(steps)
  end
  alt every N interactions
    Orchestrator->>SLP: run()
    SLP->>EM: decay / delete low utility
    SLP->>SLP: cluster + compress
    SLP->>SM: store semantic summaries
  end

Read path — query

A question triggers multi-store retrieval, scoring, and prompt reconstruction — not a full chat log.

sequenceDiagram
  participant User
  participant Orchestrator as CognitiveMemoryRAG
  participant HYB as HybridRetriever
  participant Stores as All Memory Stores
  participant CTX as ContextReconstruction
  participant LLM as Your LLM

  User->>Orchestrator: query(question)
  Orchestrator->>HYB: retrieve(query, session, user)
  HYB->>Stores: vector + BM25 + graph + recency
  Stores-->>HYB: ranked candidates
  HYB-->>Orchestrator: RetrievalResults
  Orchestrator->>CTX: fuse, resolve conflicts, summarize
  CTX-->>Orchestrator: compact prompt
  Orchestrator-->>User: prompt + evidence
  User->>LLM: prompt + question

Memory compression (explicit design goal)

Compression happens at two stages so agents keep recall without linear token growth.

flowchart TB
  subgraph store_time [Store-time compression]
    E1[Many episodic events]
    CL[Clustering]
    SU[Summarization]
    DE[Dedup and conflict resolve]
    FG[Forget low-importance]
    E1 --> CL --> SU --> SM3[Semantic memories]
    E1 --> FG
    E1 --> DE
  end

  subgraph query_time [Query-time compression]
    R1[Retrieved hits from all layers]
    FU[Fusion and dedup]
    CR[Conflict resolution]
    CAP[Character budget cap]
    R1 --> FU --> CR --> CAP --> P1[Final LLM prompt]
  end

  SM3 --> R1

Stage	When	Goal
Store-time	`consolidate()` / sleep pipeline	Shrink what is persisted
Query-time	Every `query()`	Shrink what enters the LLM context

Evaluated via compression ratio in evaluation_suite/metrics.py.

Repository map

flowchart LR
  subgraph packages [Python packages]
    ME[memory_engine]
    RE[retrieval_engine]
    CE[consolidation_engine]
    GR[graph_memory]
    RR[rem_rag]
  end

  subgraph tooling [Tooling]
    BD[benchmark_datasets]
    ES[evaluation_suite]
    VD[visualization_dashboard]
    MN[main.py]
  end

  RR --> ME
  RR --> RE
  RR --> CE
  RR --> GR
  MN --> RR
  ES --> RR
  VD --> RR
  BD --> ES

Project layout

memory_engine/           Encoding, stores, importance, compression
retrieval_engine/        Hybrid search + context reconstruction
consolidation_engine/    Sleep consolidation pipeline
graph_memory/            Neo4j / NetworkX graph layer
benchmark_datasets/      Synthetic trace generator
evaluation_suite/        Benchmarks and metrics
visualization_dashboard/ Streamlit UI
rem_rag/                 Config, embeddings, orchestrator
main.py                  CLI and API entrypoint

Setup

Uses the dev conda environment.

conda activate dev
cd REM-RAG
export PYTHONPATH="$(pwd)"
pip install -r requirements.txt

Or:

conda env update -f environment.yml --prune
source scripts/activate_dev.sh

In-memory mode (default) — no Docker required; all stores run locally.

Production-style backends — start services and enable external stores:

docker compose up -d
cp .env.example .env
export USE_EXTERNAL_STORES=true

Service	Used for
Redis	Working memory
PostgreSQL	Episodic memory
Qdrant	Semantic vectors
Neo4j	Graph memory

Quick start

conda activate dev
export PYTHONPATH="$(pwd)"

python main.py ingest --user u1 --session s1 --text "I prefer dark mode for all UI work"
python main.py query --user u1 --session s1 --text "What theme do I prefer?"
python main.py consolidate
python main.py stats

Generate synthetic data and run evaluation:

python main.py generate
python main.py eval

Start the HTTP API:

python main.py serve --port 8000

Open the dashboard:

streamlit run visualization_dashboard/app.py

Python usage:

from rem_rag import CognitiveMemoryRAG
from rem_rag.types import Interaction

rag = CognitiveMemoryRAG()
rag.ingest(Interaction(user_id="u1", session_id="s1", role="user", content="I prefer Python over JavaScript"))
result = rag.query("u1", "s1", "Which language do I prefer?")
print(result["prompt"])

Killer demo

Run streamlit run visualization_dashboard/app.py
Open the Killer Demo tab
Ingest hundreds or thousands of synthetic interactions
Inspect episodic vs semantic memory counts, decay plots, and reconstructed retrieval context

Research goals

This repo explores whether agents can match or beat raw chat-history RAG on:

Long-horizon QA with fewer tokens
Preference retention over many sessions
Handling contradictory information
Recalling learned workflows without re-explaining them

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

REM-RAG

What problem does this solve?

Features

Hierarchical memory (4 layers)

Memory encoding pipeline

Importance & decay

Sleep consolidation

Hybrid retrieval

Context reconstruction

Graph memory

Evaluation & observability

Interfaces

System architecture

Component overview

Memory hierarchy

Write path — ingest

Read path — query

Memory compression (explicit design goal)

Repository map

Project layout

Setup

Quick start

Killer demo

Research goals

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
benchmark_datasets		benchmark_datasets
consolidation_engine		consolidation_engine
evaluation_suite		evaluation_suite
graph_memory		graph_memory
memory_engine		memory_engine
rem_rag		rem_rag
retrieval_engine		retrieval_engine
scripts		scripts
visualization_dashboard		visualization_dashboard
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
environment.yml		environment.yml
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

REM-RAG

What problem does this solve?

Features

Hierarchical memory (4 layers)

Memory encoding pipeline

Importance & decay

Sleep consolidation

Hybrid retrieval

Context reconstruction

Graph memory

Evaluation & observability

Interfaces

System architecture

Component overview

Memory hierarchy

Write path — ingest

Read path — query

Memory compression (explicit design goal)

Repository map

Project layout

Setup

Quick start

Killer demo

Research goals

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages