Skip to content

arpan1221/Foldex

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

14 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Foldex

Foldex is a local-first multimodal RAG (Retrieval-Augmented Generation) system that transforms Google Drive folders into intelligent conversation interfaces. Ask questions, find files, and get insights from your documents using AIβ€”all running locally on your machine.

πŸš€ Quick Start

# Clone the repository
git clone <repository-url>
cd Foldex

# Run Docker setup (recommended)
chmod +x setup.sh
./setup.sh

# Or use Docker Compose directly
docker-compose up -d

Access the application:

✨ Features

  • πŸ” Google Drive Integration: Authenticate and process folders from Google Drive
  • πŸ“„ Multimodal Processing: PDFs, Office documents (Word, Excel, PowerPoint), text files, Markdown, HTML, CSV, audio (Whisper), and images with OCR
  • πŸ”§ Unstructured.io Integration: Advanced document parsing using Unstructured.io for intelligent content extraction with OCR support
  • 🧠 Intelligent RAG: Hybrid retrieval with semantic search, keyword matching, and knowledge graphs
  • πŸ’¬ Conversational Interface: Chat with your documents with precise citations
  • πŸ“Š Knowledge Graph Visualization: Interactive graph showing document relationships
  • πŸ”’ Local-First: All processing and storage happens locally
  • ⚑ Real-time Updates: WebSocket-based progress tracking
  • 🎯 Citation-Driven: Every response includes precise source citations

πŸ—οΈ Architecture

Foldex uses Unstructured.io as the primary document processing engine for PDFs, Office documents (Word, Excel, PowerPoint), text files, HTML, CSV, and images. Unstructured.io provides intelligent content extraction with OCR support for scanned documents and images, title-based chunking for better semantic understanding, and unified processing across multiple document formats. Embeddings are generated using Ollama's nomic-embed-text model for local-first vector generation.

πŸ“Š Diagram Navigation

Jump to: Architecture | Data Flow | Ingestion Pipeline | Citation Flow

πŸ—οΈ Architecture Diagram

graph TB
    Start([User Pastes Drive Folder URL]) --> Auth[Google OAuth2<br/>Authentication]
    Auth --> Fetch[Google Drive API<br/>Fetch Files + Metadata]
    
    Fetch --> FileType{File Type?}
    
    FileType -->|PDF/Office/Text/HTML/CSV/Images| UnstructuredProc[Unstructured.io Processor<br/>Advanced parsing + OCR<br/>Title-based chunking]
    FileType -->|Audio| AudioProc[Audio Processor<br/>Whisper transcription]
    
    UnstructuredProc --> Chunker[Smart Chunker<br/>600 tokens, 100 overlap<br/>Metadata preservation]
    AudioProc --> Chunker
    
    Chunker --> Embed[Ollama: nomic-embed-text<br/>Embedding Generation<br/>Batched + Cached]
    
    Embed --> VectorDB[(ChromaDB<br/>Persistent Vector Store)]
    
    Query([User Query]) --> ChatService[Chat Service]
    
    ChatService --> QCache{Query<br/>Cached?}
    
    QCache -->|Yes| CachedResult[Return Cached Result<br/>⚑ Sub-2s response]
    QCache -->|No| RAGFlow
    
    RAGFlow[RAG Service] --> Retrieval[Hybrid Retrieval Strategy]
    
    Retrieval --> Semantic[Semantic Search<br/>MMR for diversity]
    Retrieval --> BM25[BM25 Keyword<br/>Exact term matching]
    
    Semantic --> VectorDB
    BM25 --> VectorDB
    
    Semantic --> Ensemble[Ensemble Retriever<br/>Weights: 0.6 semantic, 0.4 BM25]
    BM25 --> Ensemble
    
    Ensemble --> Rerank[Cross-encoder Re-ranking<br/>ms-marco-MiniLM-L-6-v2]
    
    Rerank --> TopK[Top 5 Chunks<br/>Grouped by source file]
    
    TopK --> OllamaLLM[Ollama: llama3.2:3b<br/>Streaming Enabled]
    
    OllamaLLM --> Stream[Real-time Token Stream<br/>WebSocket to client]
    OllamaLLM --> Citations[Citation Extraction<br/>Parse inline markers]
    
    Citations --> UI[Client UI<br/>Progressive display]
    Stream --> UI
    
    style ChatService fill:#ffffff,stroke:#28a745,stroke-width:2px,color:#000000
    style RAGFlow fill:#ffffff,stroke:#28a745,stroke-width:2px,color:#000000
    style CachedResult fill:#ffffff,stroke:#004085,stroke-width:2px,color:#000000
    style VectorDB fill:#ffffff,stroke:#0c5460,stroke-width:2px,color:#000000
Loading

πŸ“Š Data Flow Diagram

sequenceDiagram
    participant User
    participant Frontend as React Frontend
    participant API as FastAPI Backend
    participant Chat as ChatService
    participant RAG as RAGService
    participant Retriever as Hybrid Retriever
    participant ChromaDB
    participant Ollama as Ollama (llama3.2:3b)
    participant Citations as Citation Utils
    
    User->>Frontend: Paste Drive folder URL
    Frontend->>API: POST /api/v1/folders/process
    API->>API: Authenticate with Google
    API->>API: Download & chunk files
    API->>ChromaDB: Store chunks + embeddings
    ChromaDB-->>API: βœ“ Ingested
    API-->>Frontend: βœ“ Ready
    
    User->>Frontend: Ask question
    Frontend->>API: WebSocket /ws/query
    API->>Chat: process_query()
    
    Chat->>RAG: query()
    RAG->>Retriever: retrieve(query, k=5)
    
    Retriever->>ChromaDB: Semantic search (MMR)
    ChromaDB-->>Retriever: Top 20 candidates
    
    Retriever->>ChromaDB: BM25 keyword search
    ChromaDB-->>Retriever: Top 20 candidates
    
    Retriever->>Retriever: Ensemble + re-rank
    Retriever-->>RAG: Top 5 chunks
    
    RAG->>Ollama: Generate (streaming=true)
    
    loop For each token
        Ollama-->>RAG: Token
        RAG-->>Chat: Token
        Chat-->>Frontend: Stream token
    end
    
    Ollama-->>RAG: Complete response
    
    RAG->>Citations: extract_citations(response, chunks)
    Citations-->>RAG: Citation list
    
    RAG-->>Chat: Response + citations
    Chat-->>Frontend: Citations
    
    Frontend-->>User: Display response with inline citations
Loading

πŸ”„ Ingestion Pipeline Diagram

flowchart TD
    Start([User Pastes Drive Folder URL]) --> Auth[Google OAuth2<br/>Authentication Layer]
    Auth --> Fetch[Google Drive API<br/>File Metadata + Content]
    
    Fetch --> FileType{File Type?}
    
    FileType -->|PDF/Office/Text/HTML/CSV/Images| UnstructuredProc[Unstructured.io Processor<br/>Advanced parsing + OCR<br/>Title-based chunking]
    FileType -->|Audio| AudioProc[Audio Processor<br/>Whisper transcription]
    
    UnstructuredProc --> Chunker[Hierarchical Chunker<br/>600 tokens, 100 overlap<br/>Metadata: file, page, section]
    AudioProc --> Chunker
    
    Chunker --> Cache{Embedding<br/>Cached?}
    
    Cache -->|No| Embed[Ollama: nomic-embed-text<br/>Embedding Generation<br/>Batched + Cached]
    Cache -->|Yes| CacheHit[Cache Hit]
    
    Embed --> VectorDB
    CacheHit --> VectorDB
    
    VectorDB[(ChromaDB<br/>Persistent Vector Store<br/>with rich metadata)]
    
    VectorDB --> Complete[βœ“ Processing Complete<br/>Ready for queries]
    
    style Chunker fill:#ffffff,stroke:#28a745,stroke-width:2px,color:#000000
    style CacheHit fill:#ffffff,stroke:#004085,stroke-width:2px,color:#000000
    style VectorDB fill:#ffffff,stroke:#0c5460,stroke-width:2px,color:#000000
Loading

πŸ“ Citation Extraction Flow

flowchart TD
    Start[LLM Response Generated] --> Parse[Parse for citation markers<br/>Pattern: cid:chunk_id]
    
    Parse --> Found{Markers<br/>Found?}
    
    Found -->|No| NoCite[No citations<br/>Return response as-is]
    Found -->|Yes| Extract[Extract chunk IDs<br/>e.g., cid:abc123]
    
    Extract --> Lookup[Lookup chunk metadata<br/>from retrieved chunks]
    
    Lookup --> Meta{Metadata<br/>Found?}
    
    Meta -->|No| Unknown[Replace with<br/>source unknown]
    Meta -->|Yes| Format[Format citation link]
    
    Format --> HTML[Generate HTML:<br/>sup>a href=drive_url<br/>filename, page X/a>/sup>]
    
    HTML --> Replace[Replace marker with link]
    Unknown --> Replace
    
    Replace --> More{More<br/>Markers?}
    
    More -->|Yes| Extract
    More -->|No| Dedupe[Deduplicate citations<br/>by file + page]
    
    Dedupe --> Final[Return:<br/>- Formatted response<br/>- Citation list]
    NoCite --> Final
    
    Final --> UI[Display in UI with<br/>inline clickable links]
    
    style HTML fill:#ffffff,stroke:#28a745,stroke-width:2px,color:#000000
    style Dedupe fill:#ffffff,stroke:#004085,stroke-width:2px,color:#000000
    style UI fill:#ffffff,stroke:#28a745,stroke-width:2px,color:#000000
Loading

πŸ› οΈ Setup

Prerequisites

  • Docker and Docker Compose (recommended)
  • Python 3.10+ (for local development)
  • Node.js 18+ (for local development)
  • ffmpeg (required for audio processing)
  • Google OAuth2 Credentials (for Google Drive access)

Docker Setup (Recommended)

# Run the setup script
chmod +x setup.sh
./setup.sh

The setup script will:

  1. βœ… Check Docker and Docker Compose installation
  2. βœ… Create necessary directories
  3. βœ… Generate .env file with default settings
  4. βœ… Pull required Docker images
  5. βœ… Build backend and frontend containers
  6. βœ… Start all services (ChromaDB, Ollama, Backend, Frontend)
  7. βœ… Pull and warm up the LLM model
  8. βœ… Verify all services are healthy

Manual Setup

Click to expand manual setup instructions

1. Install System Dependencies

# macOS
brew install ffmpeg

# Ubuntu/Debian
sudo apt-get update && sudo apt-get install ffmpeg

# Windows
# Download from https://ffmpeg.org/download.html

2. Clone and Setup

git clone <repository-url>
cd Foldex

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install backend dependencies
cd backend
pip install -r requirements/base.txt
cd ..

# Install frontend dependencies
cd frontend
npm install
cd ..

3. Configure Environment

Create a .env file in the project root:

# Google Drive API (Required)
GOOGLE_CLIENT_ID=your-client-id
GOOGLE_CLIENT_SECRET=your-client-secret
GOOGLE_REDIRECT_URI=http://localhost:3000/auth/callback

# Local LLM Configuration
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=llama3.2:3b
OLLAMA_KEEP_ALIVE=-1

# Embedding Model
EMBEDDING_MODEL=nomic-embed-text:latest
EMBEDDING_TYPE=ollama

# Security
SECRET_KEY=$(openssl rand -hex 32)

4. Start Services

Start Ollama:

ollama serve
ollama pull llama3.2:3b
ollama pull nomic-embed-text:latest

Start Backend:

cd backend
source ../venv/bin/activate
uvicorn app.main:app --reload

Start Frontend:

cd frontend
npm run dev

πŸ“– Usage

1. Authenticate with Google Drive

  1. Click "Sign in with Google" on the landing page
  2. Grant permissions to access Google Drive
  3. You'll be redirected back to the application

2. Process a Folder

  1. Paste a Google Drive folder URL
  2. Click "Process Folder"
  3. Monitor progress via WebSocket updates
  4. Wait for processing to complete

3. Chat with Your Documents

  1. Navigate to the chat interface
  2. Ask questions about your documents
  3. View citations and source references
  4. Explore the knowledge graph visualization

πŸ›οΈ Project Structure

Foldex/
β”œβ”€β”€ backend/                 # FastAPI backend
β”‚   β”œβ”€β”€ app/
β”‚   β”‚   β”œβ”€β”€ api/             # API routes
β”‚   β”‚   β”œβ”€β”€ services/        # Business logic
β”‚   β”‚   β”œβ”€β”€ processors/      # Document processors (Unstructured.io, Audio)
β”‚   β”‚   β”œβ”€β”€ rag/             # RAG engine
β”‚   β”‚   β”œβ”€β”€ knowledge_graph/ # Knowledge graph
β”‚   β”‚   └── database/        # Database layer
β”‚   β”œβ”€β”€ tests/               # Backend tests
β”‚   └── requirements/        # Python dependencies
β”œβ”€β”€ frontend/                 # React frontend
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”œβ”€β”€ components/      # React components
β”‚   β”‚   β”œβ”€β”€ hooks/           # Custom hooks
β”‚   β”‚   β”œβ”€β”€ services/        # API clients
β”‚   β”‚   └── utils/           # Utilities
β”‚   └── package.json
β”œβ”€β”€ scripts/                  # Setup and utility scripts
β”œβ”€β”€ data/                     # Local data storage
β”œβ”€β”€ models/                   # ML models
β”œβ”€β”€ docker-compose.yml        # Docker services
└── README.md

βš™οΈ Configuration

Environment Variables

Key environment variables (see .env for full list):

  • GOOGLE_CLIENT_ID: Google OAuth2 client ID
  • GOOGLE_CLIENT_SECRET: Google OAuth2 client secret
  • OLLAMA_MODEL: Local LLM model name (default: llama3.2:3b)
  • EMBEDDING_MODEL: Embedding model (default: nomic-embed-text:latest)
  • SECRET_KEY: JWT signing key (auto-generated)
  • OLLAMA_KEEP_ALIVE: Keep model loaded (-1 = indefinitely)
  • UNSTRUCTURED_STRATEGY: Unstructured.io processing strategy (default: fast, options: fast, hi_res, auto)
  • ENABLE_OCR: Enable OCR for images and scanned documents (default: true)

Database

Foldex uses:

  • SQLite for metadata storage (default: ./data/foldex.db)
  • ChromaDB for vector embeddings (default: ./data/vector_db)

Both are automatically initialized on first run.

πŸ§ͺ Testing

Backend Tests

cd backend
source ../venv/bin/activate
pytest tests/ -v
pytest tests/ --cov=app --cov-report=html

Frontend Tests

cd frontend
npm test
npm test -- --coverage

πŸ”§ Troubleshooting

Common Issues

1. Port Already in Use

# Change ports in .env or docker-compose.yml
BACKEND_PORT=8001
FRONTEND_PORT=3001

2. Ollama Connection Failed

# Ensure Ollama is running
docker-compose ps ollama
# Or locally: ollama serve

# Check if model is available
ollama list

3. Google OAuth Errors

  • Verify GOOGLE_CLIENT_ID and GOOGLE_CLIENT_SECRET in .env
  • Ensure redirect URI matches: http://localhost:3000/auth/callback
  • Check OAuth consent screen configuration in Google Cloud Console

4. Database Errors

# Reset database (WARNING: Deletes all data)
docker-compose down -v
# Restart application to recreate
docker-compose up -d

5. Audio Processing Errors

# Ensure ffmpeg is installed
ffmpeg -version
# If not installed, see Prerequisites section

πŸ“š API Documentation

Interactive API documentation is available at:

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published