Skip to content

thisishardik/REM-RAG

Repository files navigation

REM-RAG

Recursive Episodic Memory RAG — a cognitive memory system for AI agents that need to remember users across long conversations, not just retrieve documents from a static index.

Traditional RAG answers questions by searching a fixed corpus of chunks. REM-RAG answers questions by searching what the agent has learned from past interactions: recent turns, structured events, compressed summaries, reusable workflows, and relationships between users, goals, and concepts.


What problem does this solve?

Long-running agents (assistants, copilots, support bots) face three limits of classic RAG:

Problem Classic RAG REM-RAG
Memory scope External documents only Conversation history + learned abstractions
Context cost Stuff full chat logs into the prompt Reconstruct a compact memory context
Memory quality No notion of importance or decay Score, compress, consolidate, and forget low-value memories

REM-RAG is built for long-horizon use: preferences that change over time, recurring workflows, contradictions, and thousands of turns without blowing the context window.


Features

Hierarchical memory (4 layers)

  • Working memory — last N turns and task state; TTL eviction (Redis or in-memory).
  • Episodic memory — each interaction stored as a structured event (actor, action, object, outcome, timestamp, embedding).
  • Semantic memory — compressed summaries distilled from many episodes (clustering + summarization).
  • Procedural memory — reusable step-by-step workflows learned from user behavior.

Memory encoding pipeline

Turns raw messages into storable events:

  • Entity and intent extraction
  • Event structuring (who did what, to what, with what outcome)
  • Semantic abstraction across multiple episodes

Importance & decay

Memories are ranked with a learned-style importance function (recurrence, utility, recency, confidence, future relevance). Low-importance memories fade over time via temporal decay instead of living forever.

Sleep consolidation

Background “sleep” pass that:

  • Merges redundant memories
  • Builds higher-level semantic abstractions
  • Drops low-utility episodes
  • Re-clusters semantic groups
  • Resolves contradictions

Hybrid retrieval

Queries hit all memory layers at once using:

  • Vector similarity (embeddings)
  • BM25 keyword scoring
  • Graph neighborhood search
  • Importance and recency weighting

Context reconstruction

Retrieved hits are fused, de-duplicated, conflict-resolved, and summarized into a single compact prompt block — not a dump of raw chunks.

Graph memory

Optional Neo4j (or NetworkX fallback) links users → goals → concepts for relationship-aware retrieval.

Evaluation & observability

  • Synthetic benchmark generator (long chats, preferences, conflicts, workflows)
  • Metrics: recall@k, MRR, compression ratio, preference retention
  • Streamlit dashboard: memory growth, decay curves, retrieval traces, killer demo over 1000+ interactions

Interfaces

  • Python APICognitiveMemoryRAG orchestrator (ingest, query, consolidate, stats)
  • CLImain.py for ingest, query, eval, dataset generation
  • HTTP API — FastAPI (/ingest, /query, /consolidate, /stats)

System architecture

REM-RAG is organized as a pipeline around a single orchestrator (rem_rag/orchestrator.py) that coordinates memory engines, retrieval, consolidation, and optional external stores.

Component overview

flowchart TB
  subgraph clients [Clients]
    CLI[main.py CLI]
    API[FastAPI HTTP]
    SDK[Python SDK]
    UI[Streamlit Dashboard]
  end

  subgraph orchestrator [Orchestrator]
    CMR[CognitiveMemoryRAG]
  end

  subgraph write_path [Write Path]
    ENC[Memory Encoding Pipeline]
    IMP[Importance Scoring Engine]
    SLP[Sleep Consolidation Pipeline]
    CMP[Memory Compression Engine]
  end

  subgraph stores [Memory Stores]
    WM[Working Memory]
    EM[Episodic Memory]
    SM[Semantic Memory]
    PM[Procedural Memory]
    GM[Graph Memory]
  end

  subgraph read_path [Read Path]
    HYB[Hybrid Retrieval Engine]
    SCR[Retrieval Scorer]
    CTX[Context Reconstruction]
  end

  subgraph infra [Optional Infrastructure]
    Redis[(Redis)]
    PG[(PostgreSQL)]
    QD[(Qdrant)]
    N4J[(Neo4j)]
  end

  LLM[Your LLM]

  CLI --> CMR
  API --> CMR
  SDK --> CMR
  UI --> CMR

  CMR --> ENC
  ENC --> IMP
  IMP --> WM
  IMP --> EM
  IMP --> GM
  ENC --> PM
  CMR --> SLP
  SLP --> CMP
  CMP --> SM
  SLP --> EM

  CMR --> HYB
  HYB --> SCR
  HYB --> WM
  HYB --> EM
  HYB --> SM
  HYB --> PM
  HYB --> GM
  HYB --> CTX
  CTX --> LLM

  WM -.-> Redis
  EM -.-> PG
  SM -.-> QD
  GM -.-> N4J
Loading

Dashed lines apply when USE_EXTERNAL_STORES=true; otherwise all stores run in-process.

Memory hierarchy

Memories move from short-lived, high-fidelity traces toward compressed, long-lived abstractions.

flowchart LR
  subgraph fast [Fast / Volatile]
    WM2[Working Memory]
  end

  subgraph structured [Structured / Durable]
    EM2[Episodic Memory]
    PM2[Procedural Memory]
  end

  subgraph compressed [Compressed / Abstract]
    SM2[Semantic Memory]
  end

  subgraph relational [Relational]
    GM2[Graph Memory]
  end

  WM2 -->|"encode"| EM2
  EM2 -->|"sleep consolidate"| SM2
  EM2 -->|"workflow detect"| PM2
  EM2 -->|"link entities"| GM2
  SM2 -->|"retrieve"| OUT[Reconstructed Context]
  EM2 --> OUT
  PM2 --> OUT
  WM2 --> OUT
  GM2 --> OUT
Loading
Layer Module Typical content Backend
Working memory_engine/working.py Last turns, task state Redis or dict + TTL
Episodic memory_engine/episodic.py Structured events + embeddings PostgreSQL or list
Semantic memory_engine/semantic.py Cluster summaries Qdrant or in-memory
Procedural memory_engine/procedural.py Multi-step workflows In-memory
Graph graph_memory/layer.py User → goal → concept edges Neo4j or NetworkX

Write path — ingest

Every user or assistant message flows through encoding, scoring, and storage. Consolidation runs periodically or on demand.

sequenceDiagram
  participant User
  participant Orchestrator as CognitiveMemoryRAG
  participant WM as WorkingMemory
  participant ENC as EncodingPipeline
  participant IMP as ImportanceEngine
  participant EM as EpisodicStore
  participant GM as GraphMemory
  participant PM as ProceduralStore
  participant SLP as SleepConsolidation
  participant SM as SemanticStore

  User->>Orchestrator: ingest(message)
  Orchestrator->>WM: append(session, turn)
  Orchestrator->>ENC: parse, structure event
  ENC-->>Orchestrator: StructuredEvent
  Orchestrator->>IMP: score(importance)
  Orchestrator->>EM: store(event)
  Orchestrator->>GM: link(user, intent, concept)
  alt workflow intent
    Orchestrator->>PM: learn_from_pattern(steps)
  end
  alt every N interactions
    Orchestrator->>SLP: run()
    SLP->>EM: decay / delete low utility
    SLP->>SLP: cluster + compress
    SLP->>SM: store semantic summaries
  end
Loading

Read path — query

A question triggers multi-store retrieval, scoring, and prompt reconstruction — not a full chat log.

sequenceDiagram
  participant User
  participant Orchestrator as CognitiveMemoryRAG
  participant HYB as HybridRetriever
  participant Stores as All Memory Stores
  participant CTX as ContextReconstruction
  participant LLM as Your LLM

  User->>Orchestrator: query(question)
  Orchestrator->>HYB: retrieve(query, session, user)
  HYB->>Stores: vector + BM25 + graph + recency
  Stores-->>HYB: ranked candidates
  HYB-->>Orchestrator: RetrievalResults
  Orchestrator->>CTX: fuse, resolve conflicts, summarize
  CTX-->>Orchestrator: compact prompt
  Orchestrator-->>User: prompt + evidence
  User->>LLM: prompt + question
Loading

Memory compression (explicit design goal)

Compression happens at two stages so agents keep recall without linear token growth.

flowchart TB
  subgraph store_time [Store-time compression]
    E1[Many episodic events]
    CL[Clustering]
    SU[Summarization]
    DE[Dedup and conflict resolve]
    FG[Forget low-importance]
    E1 --> CL --> SU --> SM3[Semantic memories]
    E1 --> FG
    E1 --> DE
  end

  subgraph query_time [Query-time compression]
    R1[Retrieved hits from all layers]
    FU[Fusion and dedup]
    CR[Conflict resolution]
    CAP[Character budget cap]
    R1 --> FU --> CR --> CAP --> P1[Final LLM prompt]
  end

  SM3 --> R1
Loading
Stage When Goal
Store-time consolidate() / sleep pipeline Shrink what is persisted
Query-time Every query() Shrink what enters the LLM context

Evaluated via compression ratio in evaluation_suite/metrics.py.

Repository map

flowchart LR
  subgraph packages [Python packages]
    ME[memory_engine]
    RE[retrieval_engine]
    CE[consolidation_engine]
    GR[graph_memory]
    RR[rem_rag]
  end

  subgraph tooling [Tooling]
    BD[benchmark_datasets]
    ES[evaluation_suite]
    VD[visualization_dashboard]
    MN[main.py]
  end

  RR --> ME
  RR --> RE
  RR --> CE
  RR --> GR
  MN --> RR
  ES --> RR
  VD --> RR
  BD --> ES
Loading

Project layout

memory_engine/           Encoding, stores, importance, compression
retrieval_engine/        Hybrid search + context reconstruction
consolidation_engine/    Sleep consolidation pipeline
graph_memory/            Neo4j / NetworkX graph layer
benchmark_datasets/      Synthetic trace generator
evaluation_suite/        Benchmarks and metrics
visualization_dashboard/ Streamlit UI
rem_rag/                 Config, embeddings, orchestrator
main.py                  CLI and API entrypoint

Setup

Uses the dev conda environment.

conda activate dev
cd REM-RAG
export PYTHONPATH="$(pwd)"
pip install -r requirements.txt

Or:

conda env update -f environment.yml --prune
source scripts/activate_dev.sh

In-memory mode (default) — no Docker required; all stores run locally.

Production-style backends — start services and enable external stores:

docker compose up -d
cp .env.example .env
export USE_EXTERNAL_STORES=true
Service Used for
Redis Working memory
PostgreSQL Episodic memory
Qdrant Semantic vectors
Neo4j Graph memory

Quick start

conda activate dev
export PYTHONPATH="$(pwd)"

python main.py ingest --user u1 --session s1 --text "I prefer dark mode for all UI work"
python main.py query --user u1 --session s1 --text "What theme do I prefer?"
python main.py consolidate
python main.py stats

Generate synthetic data and run evaluation:

python main.py generate
python main.py eval

Start the HTTP API:

python main.py serve --port 8000

Open the dashboard:

streamlit run visualization_dashboard/app.py

Python usage:

from rem_rag import CognitiveMemoryRAG
from rem_rag.types import Interaction

rag = CognitiveMemoryRAG()
rag.ingest(Interaction(user_id="u1", session_id="s1", role="user", content="I prefer Python over JavaScript"))
result = rag.query("u1", "s1", "Which language do I prefer?")
print(result["prompt"])

Killer demo

  1. Run streamlit run visualization_dashboard/app.py
  2. Open the Killer Demo tab
  3. Ingest hundreds or thousands of synthetic interactions
  4. Inspect episodic vs semantic memory counts, decay plots, and reconstructed retrieval context

Research goals

This repo explores whether agents can match or beat raw chat-history RAG on:

  1. Long-horizon QA with fewer tokens
  2. Preference retention over many sessions
  3. Handling contradictory information
  4. Recalling learned workflows without re-explaining them

License

MIT

About

Hierarchical Memory Compression framework for Persistent RAG agents

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors