Skip to content

cbooth-neo4j/RAGvsGraphRAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

32 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

RAG vs GraphRAG: Comprehensive Evaluation Framework

🎯 Project Overview

A comprehensive evaluation framework comparing various RAG approaches using the RAGAS evaluation framework with research-based enhancements.

πŸ” Retrieval Approaches

  1. ChromaDB RAG - Traditional vector similarity search
  2. GraphRAG - Multi-hop graph traversal with entity resolution
  3. Advanced GraphRAG - Community detection and element summarization
  4. Text2Cypher - Natural language to Cypher query translation
  5. Neo4j Vector - Graph database vector search
  6. Hybrid Cypher - Combined vector + graph traversal
  7. DRIFT GraphRAG - Dynamic reasoning with iterative fact-finding

🧠 Ontology & Entity Discovery

  • Research-based corpus sampling with TF-IDF clustering and stratified selection
  • Domain-aware entity extraction (financial, medical, legal, technical, academic)
  • Multi-strategy text sampling for optimal entity type discovery
  • Quality metrics and performance analysis

πŸ§ͺ RAGBench Integration

  • Multiple dataset presets from nano (10 docs) to full (60K docs)
  • Domain-specific benchmarks with rich metadata
  • JSONL format for flexible evaluation data
  • Automated Q&A pair generation for evaluation

All approaches are evaluated using RAGAS framework with automated visualizations and comprehensive performance metrics.

πŸ“š Research Foundation

πŸ“ Project Structure

RAGvsGraphRAG/
β”œβ”€β”€ πŸ“‚ data_processors/              # Document processing and graph construction
β”‚   β”œβ”€β”€ process_data.py             # 🎯 Main CLI for data processing
β”‚   β”œβ”€β”€ build_graph/                # Graph processor
β”‚   β”‚   β”œβ”€β”€ main_processor.py       # Main orchestrator class
β”‚   β”‚   β”œβ”€β”€ entity_discovery.py     # Research-based entity discovery
β”‚   β”‚   β”œβ”€β”€ text_processing.py      # PDF extraction, chunking, embeddings
β”‚   β”‚   β”œβ”€β”€ graph_operations.py     # Neo4j operations & entity resolution
β”‚   β”‚   └── README.md               # Technical deep-dive documentation
β”‚   β”œβ”€β”€ chroma_processor.py         # ChromaDB vector processing
β”‚   β”œβ”€β”€ graph_processor.py          # Legacy processor (use build_graph instead)
β”‚   └── advanced_graph_processor.py # Community detection and summarization
β”œβ”€β”€ πŸ“‚ retrievers/                   # RAG retrieval implementations
β”‚   β”œβ”€β”€ chroma_retriever.py         # ChromaDB vector similarity search
β”‚   β”œβ”€β”€ graph_rag_retriever.py      # Multi-hop graph traversal
β”‚   β”œβ”€β”€ advanced_graphrag_retriever.py # Community-enhanced GraphRAG
β”‚   β”œβ”€β”€ text2cypher_retriever.py    # Natural language to Cypher
β”‚   β”œβ”€β”€ neo4j_vector_retriever.py   # Neo4j vector search
β”‚   β”œβ”€β”€ hybrid_cypher_retriever.py  # Combined vector + graph
β”‚   β”œβ”€β”€ drift_graphrag_retriever.py # Dynamic reasoning approach
β”‚   └── README.md                   # Retriever usage guide
β”œβ”€β”€ πŸ“‚ benchmark/                    # Evaluation framework
β”‚   β”œβ”€β”€ ragas_benchmark.py          # 🎯 Main evaluation CLI
β”‚   β”œβ”€β”€ visualizations.py           # Automated chart generation
β”‚   β”œβ”€β”€ benchmark.csv               # Default benchmark dataset
β”‚   β”œβ”€β”€ ragbench/                   # RAGBench dataset integration
β”‚   β”‚   β”œβ”€β”€ simple_ingester.py      # Dataset processor
β”‚   β”‚   β”œβ”€β”€ evaluator.py            # Q&A data preparation
β”‚   β”‚   β”œβ”€β”€ results_formatter.py    # Human-readable reports
β”‚   β”‚   β”œβ”€β”€ configs.py              # Preset configurations
β”‚   β”‚   └── README.md               # RAGBench documentation
β”‚   └── README.md                   # Benchmarking guide
β”œβ”€β”€ πŸ“‚ benchmark_outputs/           # Generated results and visualizations
β”œβ”€β”€ πŸ“‚ tests/                       # Test and validation scripts
β”œβ”€β”€ πŸ“‚ PDFs/                        # Source documents for processing
β”œβ”€β”€ πŸ“‚ chroma_db/                   # ChromaDB vector store data
└── πŸ“„ requirements.txt             # Python dependencies

πŸš€ Quick Start

1. Environment Setup

# Install dependencies
pip install -r requirements.txt

# Copy and configure .env file from template
cp .env_example .env

# Configure your settings in .env:
# - Provider selection (openai, ollama, vertexai)
# - Embedding model selection
# - Neo4j connection details
# - API keys as needed

# See .env_example for all available options

⚠️ IMPORTANT: Different embedding models produce different vector dimensions.

  • VertexAI/Ollama: 768 dimensions
  • OpenAI (small/ada-002): 1536 dimensions
  • OpenAI (large): 3072 dimensions

You must use the same embedding model for data ingestion AND querying!

See Embedding Dimensions Guide for detailed information.

2. Start Neo4j Database

# Using Docker (recommended)
docker run --name neo4j-rag \
    -p 7474:7474 -p 7687:7687 \
    -e NEO4J_AUTH=neo4j/password \
    neo4j:latest

3. Process Data (Choose One)

Option A: Process Your PDFs

# Place PDFs in PDFs/ folder, then:
python data_processors/process_data.py --pdfs

Option B: Use RAGBench Dataset

# Quick test with nano preset (10 documents)
python data_processors/process_data.py --ragbench --preset nano

# Or larger dataset with domain hint
python data_processors/process_data.py --ragbench --preset micro --domain financial

# See all available presets
python data_processors/process_data.py --list-presets

# This creates:
# - Neo4j graph with dynamically discovered entities and relationships
# - ChromaDB vector store for similarity search  
# - Entity resolution to merge duplicates using LLM evaluation
# - Corpus-wide entity discovery with CLI approval and caching

4. Run Evaluation

# Compare all RAG approaches (uses default benchmark.csv with 18 questions)
python benchmark/ragas_benchmark.py --all

# Use RAGBench evaluation data (automatically created during processing)
python benchmark/ragas_benchmark.py --all --jsonl benchmark/ragbench__nano_benchmark.jsonl

# Selective testing
python benchmark/ragas_benchmark.py --chroma --graphrag --text2cypher

πŸ“ Benchmark File Selection Priority:

  1. --jsonl file.jsonl β†’ Uses specified JSONL file (highest priority)
  2. --csv file.csv β†’ Uses specified CSV file
  3. No file specified β†’ Uses default benchmark/benchmark.csv (18 questions)
# Examples:
python benchmark/ragas_benchmark.py --hybrid-cypher                    # Uses default CSV
python benchmark/ragas_benchmark.py --hybrid-cypher --jsonl my.jsonl  # Uses custom JSONL

⚠️ Note: Approach flags (like --hybrid-cypher) determine which retriever to test, not which file to use.

5. View Results

  • Neo4j Browser: http://localhost:7474 (explore the knowledge graph)
  • Charts: benchmark_outputs/ folder (performance comparisons)
  • Detailed Reports: HTML reports with individual Q&A analysis

🎯 Key Features

🧠 Research-Based Entity Discovery

  • Multi-strategy corpus sampling with TF-IDF clustering and stratified selection
  • Domain-aware entity extraction with hints for financial, medical, legal, technical domains
  • Quality metrics including diversity scores and compression ratios
  • Interactive CLI approval for discovered entity types

πŸ§ͺ RAGBench Integration

  • Multiple dataset presets from nano (10 docs) to full (60K docs)
  • Rich metadata with domain, dataset, and record IDs
  • JSONL format for flexible evaluation data
  • Automated Q&A generation for comprehensive evaluation

πŸ” 7 Retrieval Approaches

  • ChromaDB RAG - Fast vector similarity search
  • GraphRAG - Multi-hop graph traversal with entity resolution
  • Advanced GraphRAG - Community detection and element summarization
  • Text2Cypher - Natural language to database queries
  • Neo4j Vector - Graph database vector search
  • Hybrid Cypher - Combined vector + graph approach
  • DRIFT GraphRAG - Dynamic reasoning with iterative refinement

πŸ“Š Comprehensive Evaluation

  • RAGAS metrics - Context Recall, Faithfulness, Factual Correctness
  • Automated visualizations - Performance charts and heatmaps
  • Detailed reports - HTML, CSV, and JSON outputs
  • Human-readable analysis - Individual Q&A breakdowns

πŸ“š Component Documentation

  • Data Processors - Data processing and ingestion guide
  • Build Graph - Technical deep-dive on enhanced graph processing
  • Retrievers - Retrieval approaches and usage patterns
  • Benchmark - Evaluation framework and RAGAS integration
  • RAGBench - RAGBench dataset integration details
  • Embedding Dimensions - ⚠️ IMPORTANT: Guide for handling different embedding models and dimensions

πŸ› οΈ Requirements

  • Python 3.8+
  • Neo4j Database (Docker recommended)
  • OpenAI API Key (for embeddings and LLM processing)
  • 8GB+ RAM (for larger datasets)
  • Optional: scikit-learn (for enhanced entity discovery)

2. GraphRAG

Neo4j graph-enhanced vector search with dynamic entity discovery. Automatically discovers entity types from your documents with CLI approval. Includes LLM-based entity resolution to merge duplicates.

3. Advanced GraphRAG

Intelligent routing between global community search and local entity search with element summarization and community detection.

4. DRIFT GraphRAG

Iterative refinement algorithm with dynamic follow-ups and multi-depth exploration using NetworkX action graphs.

5. Text2Cypher RAG

Natural language to Cypher query translation with direct Neo4j graph database querying and schema-aware prompt engineering.

6. Neo4j Vector RAG

Pure Neo4j vector similarity search using native vector operations without graph traversal for fast retrieval - good to compare against vector only databases such as ChromaDB.

πŸ“ˆ RAGAS Evaluation Framework

Metrics Overview

The benchmark evaluates all approaches using three key RAGAS metrics:

1. Context Recall (0.0-1.0)

How well the retrieval system finds relevant information needed to answer the question.

2. Faithfulness (0.0-1.0)

How faithful the generated answer is to retrieved context without hallucination.

3. Factual Correctness (0.0-1.0)

How factually accurate the response is compared to ground truth reference answers.

Overall Performance Calculation

The Average Score for each approach is calculated as:

Average Score = (Context Recall + Faithfulness + Factual Correctness) / 3

🧠 Dynamic Entity Discovery

The graph processor now features intelligent entity discovery that adapts to your document content:

How It Works

  1. Corpus Analysis: Analyzes your entire document collection using hybrid sampling (first/last 500 chars + entity-rich patterns)
  2. LLM Proposal: GPT-4 proposes relevant entity types based on document content
  3. CLI Approval: You review and approve/modify the proposed entities
  4. Schema Caching: Approved entities are cached for reuse across runs
  5. Dynamic Extraction: Extracts only the approved entity types from each document chunk

How to Use

Simply run the graph processor and it will automatically discover entities from your documents:

python data_processors/graph_processor.py

Benefits

  • Adaptive: Discovers entities relevant to your specific domain (contracts, medical, legal, etc.)
  • Consistent: Single entity schema applied across all documents in a corpus
  • Efficient: Caches approved schemas to avoid re-discovery
  • User-Controlled: You approve all entity types before processing

Example Discovery Output

πŸ” Analyzing corpus with LLM...
πŸ“‹ Proposed entities: Contract, Vendor, Deliverable, Timeline, Budget, Compliance
βœ… Approve these entities? (y/n/edit): y
πŸš€ Processing documents with approved entities...

πŸ“š Customization Guide

Adding New Documents

  1. Add PDFs: Place new PDF files in the PDFs/ directory
  2. Reprocess: Run your chosen processing command again:
    python data_processors/graph_processor.py  # Reprocesses all PDFs
  3. Test: Validate with python tests/test_ragas_setup.py

Custom Benchmark Questions

  1. Edit questions: Modify benchmark/benchmark.csv:
    question,ground_truth
    Your custom question?,Expected answer here
    
  2. Run benchmark: python benchmark/ragas_benchmark.py --all

πŸ”§ Development & Testing

Health Checks

# Verify all systems working
python tests/check_chromaDB.py      # ChromaDB status
python tests/check_schema.py        # Neo4j schema and statistics  
python tests/test_ragas_setup.py    # All approaches validation

πŸš€ Next Steps

Quick Start Checklist

  • Set up environment variables in .env
  • Run pip install -r requirements.txt
  • Add PDFs to PDFs/ directory
  • Choose processing level: python data_processors/graph_processor.py
  • Validate setup: python tests/test_ragas_setup.py
  • Run benchmark: python benchmark/ragas_benchmark.py --all

Cheatsheet

python benchmark/ragas_benchmark.py --hybrid-cypher --chroma --neo4j-vector --graphrag --advanced-graphrag  --limit 1 --jsonl benchmark/ragbench__nano_benchmark.jsonl
python benchmark/ragas_benchmark.py --hybrid-cypher --chroma --neo4j-vector --graphrag --advanced-graphrag --text2cypher --limit 1

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •