Foldex

Foldex is a local-first multimodal RAG (Retrieval-Augmented Generation) system that transforms Google Drive folders into intelligent conversation interfaces. Ask questions, find files, and get insights from your documents using AI—all running locally on your machine.

Demo Video

🚀 Quick Start

# Clone the repository
git clone <repository-url>
cd Foldex

# Run Docker setup (recommended)
chmod +x setup.sh
./setup.sh

# Or use Docker Compose directly
docker-compose up -d

Access the application:

Frontend: http://localhost:3000
Backend API: http://localhost:8000/api/docs

✨ Features

🔐 Google Drive Integration: Authenticate and process folders from Google Drive
📄 Multimodal Processing: PDFs, Office documents (Word, Excel, PowerPoint), text files, Markdown, HTML, CSV, audio (Whisper), and images with OCR
🔧 Unstructured.io Integration: Advanced document parsing using Unstructured.io for intelligent content extraction with OCR support
🧠 Intelligent RAG: Hybrid retrieval with semantic search, keyword matching, and knowledge graphs
💬 Conversational Interface: Chat with your documents with precise citations
📊 Knowledge Graph Visualization: Interactive graph showing document relationships
🔒 Local-First: All processing and storage happens locally
⚡ Real-time Updates: WebSocket-based progress tracking
🎯 Citation-Driven: Every response includes precise source citations

🏗️ Architecture

Foldex uses Unstructured.io as the primary document processing engine for PDFs, Office documents (Word, Excel, PowerPoint), text files, HTML, CSV, and images. Unstructured.io provides intelligent content extraction with OCR support for scanned documents and images, title-based chunking for better semantic understanding, and unified processing across multiple document formats. Embeddings are generated using Ollama's nomic-embed-text model for local-first vector generation.

📊 Diagram Navigation

Jump to: Architecture | Data Flow | Ingestion Pipeline | Citation Flow

🏗️ Architecture Diagram

graph TB
    Start([User Pastes Drive Folder URL]) --> Auth[Google OAuth2<br/>Authentication]
    Auth --> Fetch[Google Drive API<br/>Fetch Files + Metadata]
    
    Fetch --> FileType{File Type?}
    
    FileType -->|PDF/Office/Text/HTML/CSV/Images| UnstructuredProc[Unstructured.io Processor<br/>Advanced parsing + OCR<br/>Title-based chunking]
    FileType -->|Audio| AudioProc[Audio Processor<br/>Whisper transcription]
    
    UnstructuredProc --> Chunker[Smart Chunker<br/>600 tokens, 100 overlap<br/>Metadata preservation]
    AudioProc --> Chunker
    
    Chunker --> Embed[Ollama: nomic-embed-text<br/>Embedding Generation<br/>Batched + Cached]
    
    Embed --> VectorDB[(ChromaDB<br/>Persistent Vector Store)]
    
    Query([User Query]) --> ChatService[Chat Service]
    
    ChatService --> QCache{Query<br/>Cached?}
    
    QCache -->|Yes| CachedResult[Return Cached Result<br/>⚡ Sub-2s response]
    QCache -->|No| RAGFlow
    
    RAGFlow[RAG Service] --> Retrieval[Hybrid Retrieval Strategy]
    
    Retrieval --> Semantic[Semantic Search<br/>MMR for diversity]
    Retrieval --> BM25[BM25 Keyword<br/>Exact term matching]
    
    Semantic --> VectorDB
    BM25 --> VectorDB
    
    Semantic --> Ensemble[Ensemble Retriever<br/>Weights: 0.6 semantic, 0.4 BM25]
    BM25 --> Ensemble
    
    Ensemble --> Rerank[Cross-encoder Re-ranking<br/>ms-marco-MiniLM-L-6-v2]
    
    Rerank --> TopK[Top 5 Chunks<br/>Grouped by source file]
    
    TopK --> OllamaLLM[Ollama: llama3.2:3b<br/>Streaming Enabled]
    
    OllamaLLM --> Stream[Real-time Token Stream<br/>WebSocket to client]
    OllamaLLM --> Citations[Citation Extraction<br/>Parse inline markers]
    
    Citations --> UI[Client UI<br/>Progressive display]
    Stream --> UI
    
    style ChatService fill:#ffffff,stroke:#28a745,stroke-width:2px,color:#000000
    style RAGFlow fill:#ffffff,stroke:#28a745,stroke-width:2px,color:#000000
    style CachedResult fill:#ffffff,stroke:#004085,stroke-width:2px,color:#000000
    style VectorDB fill:#ffffff,stroke:#0c5460,stroke-width:2px,color:#000000

📊 Data Flow Diagram

sequenceDiagram
    participant User
    participant Frontend as React Frontend
    participant API as FastAPI Backend
    participant Chat as ChatService
    participant RAG as RAGService
    participant Retriever as Hybrid Retriever
    participant ChromaDB
    participant Ollama as Ollama (llama3.2:3b)
    participant Citations as Citation Utils
    
    User->>Frontend: Paste Drive folder URL
    Frontend->>API: POST /api/v1/folders/process
    API->>API: Authenticate with Google
    API->>API: Download & chunk files
    API->>ChromaDB: Store chunks + embeddings
    ChromaDB-->>API: ✓ Ingested
    API-->>Frontend: ✓ Ready
    
    User->>Frontend: Ask question
    Frontend->>API: WebSocket /ws/query
    API->>Chat: process_query()
    
    Chat->>RAG: query()
    RAG->>Retriever: retrieve(query, k=5)
    
    Retriever->>ChromaDB: Semantic search (MMR)
    ChromaDB-->>Retriever: Top 20 candidates
    
    Retriever->>ChromaDB: BM25 keyword search
    ChromaDB-->>Retriever: Top 20 candidates
    
    Retriever->>Retriever: Ensemble + re-rank
    Retriever-->>RAG: Top 5 chunks
    
    RAG->>Ollama: Generate (streaming=true)
    
    loop For each token
        Ollama-->>RAG: Token
        RAG-->>Chat: Token
        Chat-->>Frontend: Stream token
    end
    
    Ollama-->>RAG: Complete response
    
    RAG->>Citations: extract_citations(response, chunks)
    Citations-->>RAG: Citation list
    
    RAG-->>Chat: Response + citations
    Chat-->>Frontend: Citations
    
    Frontend-->>User: Display response with inline citations

🔄 Ingestion Pipeline Diagram

flowchart TD
    Start([User Pastes Drive Folder URL]) --> Auth[Google OAuth2<br/>Authentication Layer]
    Auth --> Fetch[Google Drive API<br/>File Metadata + Content]
    
    Fetch --> FileType{File Type?}
    
    FileType -->|PDF/Office/Text/HTML/CSV/Images| UnstructuredProc[Unstructured.io Processor<br/>Advanced parsing + OCR<br/>Title-based chunking]
    FileType -->|Audio| AudioProc[Audio Processor<br/>Whisper transcription]
    
    UnstructuredProc --> Chunker[Hierarchical Chunker<br/>600 tokens, 100 overlap<br/>Metadata: file, page, section]
    AudioProc --> Chunker
    
    Chunker --> Cache{Embedding<br/>Cached?}
    
    Cache -->|No| Embed[Ollama: nomic-embed-text<br/>Embedding Generation<br/>Batched + Cached]
    Cache -->|Yes| CacheHit[Cache Hit]
    
    Embed --> VectorDB
    CacheHit --> VectorDB
    
    VectorDB[(ChromaDB<br/>Persistent Vector Store<br/>with rich metadata)]
    
    VectorDB --> Complete[✓ Processing Complete<br/>Ready for queries]
    
    style Chunker fill:#ffffff,stroke:#28a745,stroke-width:2px,color:#000000
    style CacheHit fill:#ffffff,stroke:#004085,stroke-width:2px,color:#000000
    style VectorDB fill:#ffffff,stroke:#0c5460,stroke-width:2px,color:#000000

📝 Citation Extraction Flow

flowchart TD
    Start[LLM Response Generated] --> Parse[Parse for citation markers<br/>Pattern: cid:chunk_id]
    
    Parse --> Found{Markers<br/>Found?}
    
    Found -->|No| NoCite[No citations<br/>Return response as-is]
    Found -->|Yes| Extract[Extract chunk IDs<br/>e.g., cid:abc123]
    
    Extract --> Lookup[Lookup chunk metadata<br/>from retrieved chunks]
    
    Lookup --> Meta{Metadata<br/>Found?}
    
    Meta -->|No| Unknown[Replace with<br/>source unknown]
    Meta -->|Yes| Format[Format citation link]
    
    Format --> HTML[Generate HTML:<br/>sup>a href=drive_url<br/>filename, page X/a>/sup>]
    
    HTML --> Replace[Replace marker with link]
    Unknown --> Replace
    
    Replace --> More{More<br/>Markers?}
    
    More -->|Yes| Extract
    More -->|No| Dedupe[Deduplicate citations<br/>by file + page]
    
    Dedupe --> Final[Return:<br/>- Formatted response<br/>- Citation list]
    NoCite --> Final
    
    Final --> UI[Display in UI with<br/>inline clickable links]
    
    style HTML fill:#ffffff,stroke:#28a745,stroke-width:2px,color:#000000
    style Dedupe fill:#ffffff,stroke:#004085,stroke-width:2px,color:#000000
    style UI fill:#ffffff,stroke:#28a745,stroke-width:2px,color:#000000

🛠️ Setup

Prerequisites

Docker and Docker Compose (recommended)
Python 3.10+ (for local development)
Node.js 18+ (for local development)
ffmpeg (required for audio processing)
Google OAuth2 Credentials (for Google Drive access)

Docker Setup (Recommended)

# Run the setup script
chmod +x setup.sh
./setup.sh

The setup script will:

✅ Check Docker and Docker Compose installation
✅ Create necessary directories
✅ Generate .env file with default settings
✅ Pull required Docker images
✅ Build backend and frontend containers
✅ Start all services (ChromaDB, Ollama, Backend, Frontend)
✅ Pull and warm up the LLM model
✅ Verify all services are healthy

Manual Setup

Click to expand manual setup instructions

1. Install System Dependencies

# macOS
brew install ffmpeg

# Ubuntu/Debian
sudo apt-get update && sudo apt-get install ffmpeg

# Windows
# Download from https://ffmpeg.org/download.html

2. Clone and Setup

git clone <repository-url>
cd Foldex

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install backend dependencies
cd backend
pip install -r requirements/base.txt
cd ..

# Install frontend dependencies
cd frontend
npm install
cd ..

3. Configure Environment

Create a .env file in the project root:

# Google Drive API (Required)
GOOGLE_CLIENT_ID=your-client-id
GOOGLE_CLIENT_SECRET=your-client-secret
GOOGLE_REDIRECT_URI=http://localhost:3000/auth/callback

# Local LLM Configuration
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=llama3.2:3b
OLLAMA_KEEP_ALIVE=-1

# Embedding Model
EMBEDDING_MODEL=nomic-embed-text:latest
EMBEDDING_TYPE=ollama

# Security
SECRET_KEY=$(openssl rand -hex 32)

4. Start Services

Start Ollama:

ollama serve
ollama pull llama3.2:3b
ollama pull nomic-embed-text:latest

Start Backend:

cd backend
source ../venv/bin/activate
uvicorn app.main:app --reload

Start Frontend:

cd frontend
npm run dev

📖 Usage

1. Authenticate with Google Drive

Click "Sign in with Google" on the landing page
Grant permissions to access Google Drive
You'll be redirected back to the application

2. Process a Folder

Paste a Google Drive folder URL
Click "Process Folder"
Monitor progress via WebSocket updates
Wait for processing to complete

3. Chat with Your Documents

Navigate to the chat interface
Ask questions about your documents
View citations and source references
Explore the knowledge graph visualization

🏛️ Project Structure

Foldex/
├── backend/                 # FastAPI backend
│   ├── app/
│   │   ├── api/             # API routes
│   │   ├── services/        # Business logic
│   │   ├── processors/      # Document processors (Unstructured.io, Audio)
│   │   ├── rag/             # RAG engine
│   │   ├── knowledge_graph/ # Knowledge graph
│   │   └── database/        # Database layer
│   ├── tests/               # Backend tests
│   └── requirements/        # Python dependencies
├── frontend/                 # React frontend
│   ├── src/
│   │   ├── components/      # React components
│   │   ├── hooks/           # Custom hooks
│   │   ├── services/        # API clients
│   │   └── utils/           # Utilities
│   └── package.json
├── scripts/                  # Setup and utility scripts
├── data/                     # Local data storage
├── models/                   # ML models
├── docker-compose.yml        # Docker services
└── README.md

⚙️ Configuration

Environment Variables

Key environment variables (see .env for full list):

GOOGLE_CLIENT_ID: Google OAuth2 client ID
GOOGLE_CLIENT_SECRET: Google OAuth2 client secret
OLLAMA_MODEL: Local LLM model name (default: llama3.2:3b)
EMBEDDING_MODEL: Embedding model (default: nomic-embed-text:latest)
SECRET_KEY: JWT signing key (auto-generated)
OLLAMA_KEEP_ALIVE: Keep model loaded (-1 = indefinitely)
UNSTRUCTURED_STRATEGY: Unstructured.io processing strategy (default: fast, options: fast, hi_res, auto)
ENABLE_OCR: Enable OCR for images and scanned documents (default: true)

Database

Foldex uses:

SQLite for metadata storage (default: ./data/foldex.db)
ChromaDB for vector embeddings (default: ./data/vector_db)

Both are automatically initialized on first run.

🧪 Testing

Backend Tests

cd backend
source ../venv/bin/activate
pytest tests/ -v
pytest tests/ --cov=app --cov-report=html

Frontend Tests

cd frontend
npm test
npm test -- --coverage

🔧 Troubleshooting

Common Issues

1. Port Already in Use

# Change ports in .env or docker-compose.yml
BACKEND_PORT=8001
FRONTEND_PORT=3001

2. Ollama Connection Failed

# Ensure Ollama is running
docker-compose ps ollama
# Or locally: ollama serve

# Check if model is available
ollama list

3. Google OAuth Errors

Verify GOOGLE_CLIENT_ID and GOOGLE_CLIENT_SECRET in .env
Ensure redirect URI matches: http://localhost:3000/auth/callback
Check OAuth consent screen configuration in Google Cloud Console

4. Database Errors

# Reset database (WARNING: Deletes all data)
docker-compose down -v
# Restart application to recreate
docker-compose up -d

5. Audio Processing Errors

# Ensure ffmpeg is installed
ffmpeg -version
# If not installed, see Prerequisites section

📚 API Documentation

Interactive API documentation is available at:

Swagger UI: http://localhost:8000/api/docs

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.cursor/rules/projectrules		.cursor/rules/projectrules
backend		backend
data		data
frontend		frontend
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Claude.md		Claude.md
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
package-lock.json		package-lock.json
package.json		package.json
setup.sh		setup.sh

License

arpan1221/Foldex

Folders and files

Latest commit

History

Repository files navigation