Foldex is a local-first multimodal RAG (Retrieval-Augmented Generation) system that transforms Google Drive folders into intelligent conversation interfaces. Ask questions, find files, and get insights from your documents using AIβall running locally on your machine.
# Clone the repository
git clone <repository-url>
cd Foldex
# Run Docker setup (recommended)
chmod +x setup.sh
./setup.sh
# Or use Docker Compose directly
docker-compose up -dAccess the application:
- Frontend: http://localhost:3000
- Backend API: http://localhost:8000/api/docs
- π Google Drive Integration: Authenticate and process folders from Google Drive
- π Multimodal Processing: PDFs, Office documents (Word, Excel, PowerPoint), text files, Markdown, HTML, CSV, audio (Whisper), and images with OCR
- π§ Unstructured.io Integration: Advanced document parsing using Unstructured.io for intelligent content extraction with OCR support
- π§ Intelligent RAG: Hybrid retrieval with semantic search, keyword matching, and knowledge graphs
- π¬ Conversational Interface: Chat with your documents with precise citations
- π Knowledge Graph Visualization: Interactive graph showing document relationships
- π Local-First: All processing and storage happens locally
- β‘ Real-time Updates: WebSocket-based progress tracking
- π― Citation-Driven: Every response includes precise source citations
Foldex uses Unstructured.io as the primary document processing engine for PDFs, Office documents (Word, Excel, PowerPoint), text files, HTML, CSV, and images. Unstructured.io provides intelligent content extraction with OCR support for scanned documents and images, title-based chunking for better semantic understanding, and unified processing across multiple document formats. Embeddings are generated using Ollama's nomic-embed-text model for local-first vector generation.
Jump to: Architecture | Data Flow | Ingestion Pipeline | Citation Flow
graph TB
Start([User Pastes Drive Folder URL]) --> Auth[Google OAuth2<br/>Authentication]
Auth --> Fetch[Google Drive API<br/>Fetch Files + Metadata]
Fetch --> FileType{File Type?}
FileType -->|PDF/Office/Text/HTML/CSV/Images| UnstructuredProc[Unstructured.io Processor<br/>Advanced parsing + OCR<br/>Title-based chunking]
FileType -->|Audio| AudioProc[Audio Processor<br/>Whisper transcription]
UnstructuredProc --> Chunker[Smart Chunker<br/>600 tokens, 100 overlap<br/>Metadata preservation]
AudioProc --> Chunker
Chunker --> Embed[Ollama: nomic-embed-text<br/>Embedding Generation<br/>Batched + Cached]
Embed --> VectorDB[(ChromaDB<br/>Persistent Vector Store)]
Query([User Query]) --> ChatService[Chat Service]
ChatService --> QCache{Query<br/>Cached?}
QCache -->|Yes| CachedResult[Return Cached Result<br/>β‘ Sub-2s response]
QCache -->|No| RAGFlow
RAGFlow[RAG Service] --> Retrieval[Hybrid Retrieval Strategy]
Retrieval --> Semantic[Semantic Search<br/>MMR for diversity]
Retrieval --> BM25[BM25 Keyword<br/>Exact term matching]
Semantic --> VectorDB
BM25 --> VectorDB
Semantic --> Ensemble[Ensemble Retriever<br/>Weights: 0.6 semantic, 0.4 BM25]
BM25 --> Ensemble
Ensemble --> Rerank[Cross-encoder Re-ranking<br/>ms-marco-MiniLM-L-6-v2]
Rerank --> TopK[Top 5 Chunks<br/>Grouped by source file]
TopK --> OllamaLLM[Ollama: llama3.2:3b<br/>Streaming Enabled]
OllamaLLM --> Stream[Real-time Token Stream<br/>WebSocket to client]
OllamaLLM --> Citations[Citation Extraction<br/>Parse inline markers]
Citations --> UI[Client UI<br/>Progressive display]
Stream --> UI
style ChatService fill:#ffffff,stroke:#28a745,stroke-width:2px,color:#000000
style RAGFlow fill:#ffffff,stroke:#28a745,stroke-width:2px,color:#000000
style CachedResult fill:#ffffff,stroke:#004085,stroke-width:2px,color:#000000
style VectorDB fill:#ffffff,stroke:#0c5460,stroke-width:2px,color:#000000
sequenceDiagram
participant User
participant Frontend as React Frontend
participant API as FastAPI Backend
participant Chat as ChatService
participant RAG as RAGService
participant Retriever as Hybrid Retriever
participant ChromaDB
participant Ollama as Ollama (llama3.2:3b)
participant Citations as Citation Utils
User->>Frontend: Paste Drive folder URL
Frontend->>API: POST /api/v1/folders/process
API->>API: Authenticate with Google
API->>API: Download & chunk files
API->>ChromaDB: Store chunks + embeddings
ChromaDB-->>API: β Ingested
API-->>Frontend: β Ready
User->>Frontend: Ask question
Frontend->>API: WebSocket /ws/query
API->>Chat: process_query()
Chat->>RAG: query()
RAG->>Retriever: retrieve(query, k=5)
Retriever->>ChromaDB: Semantic search (MMR)
ChromaDB-->>Retriever: Top 20 candidates
Retriever->>ChromaDB: BM25 keyword search
ChromaDB-->>Retriever: Top 20 candidates
Retriever->>Retriever: Ensemble + re-rank
Retriever-->>RAG: Top 5 chunks
RAG->>Ollama: Generate (streaming=true)
loop For each token
Ollama-->>RAG: Token
RAG-->>Chat: Token
Chat-->>Frontend: Stream token
end
Ollama-->>RAG: Complete response
RAG->>Citations: extract_citations(response, chunks)
Citations-->>RAG: Citation list
RAG-->>Chat: Response + citations
Chat-->>Frontend: Citations
Frontend-->>User: Display response with inline citations
flowchart TD
Start([User Pastes Drive Folder URL]) --> Auth[Google OAuth2<br/>Authentication Layer]
Auth --> Fetch[Google Drive API<br/>File Metadata + Content]
Fetch --> FileType{File Type?}
FileType -->|PDF/Office/Text/HTML/CSV/Images| UnstructuredProc[Unstructured.io Processor<br/>Advanced parsing + OCR<br/>Title-based chunking]
FileType -->|Audio| AudioProc[Audio Processor<br/>Whisper transcription]
UnstructuredProc --> Chunker[Hierarchical Chunker<br/>600 tokens, 100 overlap<br/>Metadata: file, page, section]
AudioProc --> Chunker
Chunker --> Cache{Embedding<br/>Cached?}
Cache -->|No| Embed[Ollama: nomic-embed-text<br/>Embedding Generation<br/>Batched + Cached]
Cache -->|Yes| CacheHit[Cache Hit]
Embed --> VectorDB
CacheHit --> VectorDB
VectorDB[(ChromaDB<br/>Persistent Vector Store<br/>with rich metadata)]
VectorDB --> Complete[β Processing Complete<br/>Ready for queries]
style Chunker fill:#ffffff,stroke:#28a745,stroke-width:2px,color:#000000
style CacheHit fill:#ffffff,stroke:#004085,stroke-width:2px,color:#000000
style VectorDB fill:#ffffff,stroke:#0c5460,stroke-width:2px,color:#000000
flowchart TD
Start[LLM Response Generated] --> Parse[Parse for citation markers<br/>Pattern: cid:chunk_id]
Parse --> Found{Markers<br/>Found?}
Found -->|No| NoCite[No citations<br/>Return response as-is]
Found -->|Yes| Extract[Extract chunk IDs<br/>e.g., cid:abc123]
Extract --> Lookup[Lookup chunk metadata<br/>from retrieved chunks]
Lookup --> Meta{Metadata<br/>Found?}
Meta -->|No| Unknown[Replace with<br/>source unknown]
Meta -->|Yes| Format[Format citation link]
Format --> HTML[Generate HTML:<br/>sup>a href=drive_url<br/>filename, page X/a>/sup>]
HTML --> Replace[Replace marker with link]
Unknown --> Replace
Replace --> More{More<br/>Markers?}
More -->|Yes| Extract
More -->|No| Dedupe[Deduplicate citations<br/>by file + page]
Dedupe --> Final[Return:<br/>- Formatted response<br/>- Citation list]
NoCite --> Final
Final --> UI[Display in UI with<br/>inline clickable links]
style HTML fill:#ffffff,stroke:#28a745,stroke-width:2px,color:#000000
style Dedupe fill:#ffffff,stroke:#004085,stroke-width:2px,color:#000000
style UI fill:#ffffff,stroke:#28a745,stroke-width:2px,color:#000000
- Docker and Docker Compose (recommended)
- Python 3.10+ (for local development)
- Node.js 18+ (for local development)
- ffmpeg (required for audio processing)
- Google OAuth2 Credentials (for Google Drive access)
# Run the setup script
chmod +x setup.sh
./setup.shThe setup script will:
- β Check Docker and Docker Compose installation
- β Create necessary directories
- β
Generate
.envfile with default settings - β Pull required Docker images
- β Build backend and frontend containers
- β Start all services (ChromaDB, Ollama, Backend, Frontend)
- β Pull and warm up the LLM model
- β Verify all services are healthy
Click to expand manual setup instructions
# macOS
brew install ffmpeg
# Ubuntu/Debian
sudo apt-get update && sudo apt-get install ffmpeg
# Windows
# Download from https://ffmpeg.org/download.htmlgit clone <repository-url>
cd Foldex
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install backend dependencies
cd backend
pip install -r requirements/base.txt
cd ..
# Install frontend dependencies
cd frontend
npm install
cd ..Create a .env file in the project root:
# Google Drive API (Required)
GOOGLE_CLIENT_ID=your-client-id
GOOGLE_CLIENT_SECRET=your-client-secret
GOOGLE_REDIRECT_URI=http://localhost:3000/auth/callback
# Local LLM Configuration
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=llama3.2:3b
OLLAMA_KEEP_ALIVE=-1
# Embedding Model
EMBEDDING_MODEL=nomic-embed-text:latest
EMBEDDING_TYPE=ollama
# Security
SECRET_KEY=$(openssl rand -hex 32)Start Ollama:
ollama serve
ollama pull llama3.2:3b
ollama pull nomic-embed-text:latestStart Backend:
cd backend
source ../venv/bin/activate
uvicorn app.main:app --reloadStart Frontend:
cd frontend
npm run dev- Click "Sign in with Google" on the landing page
- Grant permissions to access Google Drive
- You'll be redirected back to the application
- Paste a Google Drive folder URL
- Click "Process Folder"
- Monitor progress via WebSocket updates
- Wait for processing to complete
- Navigate to the chat interface
- Ask questions about your documents
- View citations and source references
- Explore the knowledge graph visualization
Foldex/
βββ backend/ # FastAPI backend
β βββ app/
β β βββ api/ # API routes
β β βββ services/ # Business logic
β β βββ processors/ # Document processors (Unstructured.io, Audio)
β β βββ rag/ # RAG engine
β β βββ knowledge_graph/ # Knowledge graph
β β βββ database/ # Database layer
β βββ tests/ # Backend tests
β βββ requirements/ # Python dependencies
βββ frontend/ # React frontend
β βββ src/
β β βββ components/ # React components
β β βββ hooks/ # Custom hooks
β β βββ services/ # API clients
β β βββ utils/ # Utilities
β βββ package.json
βββ scripts/ # Setup and utility scripts
βββ data/ # Local data storage
βββ models/ # ML models
βββ docker-compose.yml # Docker services
βββ README.md
Key environment variables (see .env for full list):
GOOGLE_CLIENT_ID: Google OAuth2 client IDGOOGLE_CLIENT_SECRET: Google OAuth2 client secretOLLAMA_MODEL: Local LLM model name (default:llama3.2:3b)EMBEDDING_MODEL: Embedding model (default:nomic-embed-text:latest)SECRET_KEY: JWT signing key (auto-generated)OLLAMA_KEEP_ALIVE: Keep model loaded (-1 = indefinitely)UNSTRUCTURED_STRATEGY: Unstructured.io processing strategy (default:fast, options:fast,hi_res,auto)ENABLE_OCR: Enable OCR for images and scanned documents (default:true)
Foldex uses:
- SQLite for metadata storage (default:
./data/foldex.db) - ChromaDB for vector embeddings (default:
./data/vector_db)
Both are automatically initialized on first run.
cd backend
source ../venv/bin/activate
pytest tests/ -v
pytest tests/ --cov=app --cov-report=htmlcd frontend
npm test
npm test -- --coverage1. Port Already in Use
# Change ports in .env or docker-compose.yml
BACKEND_PORT=8001
FRONTEND_PORT=30012. Ollama Connection Failed
# Ensure Ollama is running
docker-compose ps ollama
# Or locally: ollama serve
# Check if model is available
ollama list3. Google OAuth Errors
- Verify
GOOGLE_CLIENT_IDandGOOGLE_CLIENT_SECRETin.env - Ensure redirect URI matches:
http://localhost:3000/auth/callback - Check OAuth consent screen configuration in Google Cloud Console
4. Database Errors
# Reset database (WARNING: Deletes all data)
docker-compose down -v
# Restart application to recreate
docker-compose up -d5. Audio Processing Errors
# Ensure ffmpeg is installed
ffmpeg -version
# If not installed, see Prerequisites sectionInteractive API documentation is available at:
- Swagger UI: http://localhost:8000/api/docs
This project is licensed under the MIT License - see the LICENSE file for details.