Skip to content

AhmadHakami/exp-rag

Repository files navigation

RAG System - Arabic/English PDF Support

A Retrieval-Augmented Generation (RAG) system that supports accurate parsing and question-answering for Arabic and English PDF documents, powered by Ollama.

Features

Bilingual Support: Works with both Arabic and English PDF documents
Accurate Parsing: Uses PyMuPDF for precise text extraction
Page Number Tracking: Each answer includes the page number(s) where information was found
Follow-up Questions: Maintains conversation history for contextual follow-up questions
CLI Interface: User-friendly command-line interface with rich formatting
Local Ollama Models: Uses Ollama for both LLM and embeddings (fully local, no external APIs)
Persistent Storage: ChromaDB vector store with persistence across sessions

Prerequisites

  1. Python 3.8+
  2. Ollama: Install from ollama.ai
  3. Ollama Models: Pull required models:
    ollama pull llama3.1
    ollama pull nomic-embed-text

Installation

  1. Clone or navigate to the project directory:

    cd /home/ahmad/Documents/agentic-lab
  2. Create a virtual environment (recommended):

    python3 -m venv venv
    source venv/bin/activate
  3. Install dependencies:

    pip install -r requirements.txt
  4. Configure environment (optional):

    • The .env file is already configured with defaults
    • Modify if you need different models or settings

Usage

Starting the CLI

python cli.py

Main Menu Options

  1. Add PDF Document: Upload and index a PDF file
  2. Ask a Question: Enter question mode for interactive Q&A
  3. View Statistics: See system statistics and document counts
  4. Clear Conversation History: Reset the conversation context
  5. Clear All Documents: Remove all indexed documents
  6. Exit: Close the application

Example Workflow

  1. Start the CLI:

    python cli.py
  2. Add a PDF document (Option 1):

    • Enter the full path to your PDF file
    • The system will extract text and create embeddings
    • Wait for indexing to complete
  3. Ask questions (Option 2):

    • Type your question in Arabic or English
    • Receive answers with page numbers
    • Ask follow-up questions for context-aware responses
    • Type 'back' to return to the main menu

Example Questions

English: "What is the main topic discussed in the document?"
Arabic: "ما هو الموضوع الرئيسي المناقش في الوثيقة؟"

Follow-up: "Can you provide more details about that?"
Follow-up: "هل يمكنك تقديم المزيد من التفاصيل؟"

Configuration

Edit .env file to customize:

# Ollama Configuration
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_LLM_MODEL=llama3.1              # Change to your preferred model
OLLAMA_EMBEDDING_MODEL=nomic-embed-text

# ChromaDB Configuration
CHROMA_PERSIST_DIRECTORY=./chroma_db

# PDF Processing
CHUNK_SIZE=1000        # Size of text chunks for processing
CHUNK_OVERLAP=200      # Overlap between chunks for context

Recommended Ollama Models

For English:

  • LLM: llama3.1, mistral, mixtral
  • Embeddings: nomic-embed-text, mxbai-embed-large

For Arabic:

  • LLM: llama3.1, gemma2 (better multilingual support)
  • Embeddings: nomic-embed-text (supports multilingual)

Project Structure

agentic-lab/
├── cli.py                 # CLI interface
├── rag_engine.py         # RAG engine with conversation history
├── vectorstore.py        # Vector store management (ChromaDB)
├── pdf_parser.py         # PDF parsing with page tracking
├── requirements.txt      # Python dependencies
├── .env                  # Configuration file
├── .env.example          # Example configuration
├── README.md            # This file
└── chroma_db/           # Vector store persistence (created automatically)

How It Works

  1. PDF Parsing:

    • PyMuPDF extracts text page by page
    • Each page's content is tracked with its page number
  2. Text Chunking:

    • Text is split into chunks with overlap
    • Each chunk maintains its page number metadata
  3. Embedding & Indexing:

    • Ollama creates embeddings for each chunk
    • ChromaDB stores embeddings with metadata
  4. Question Answering:

    • User question is embedded
    • Similar chunks are retrieved with page numbers
    • LLM generates answer using retrieved context
    • Conversation history is maintained for follow-ups
  5. Answer with Page Numbers:

    • Each answer explicitly mentions source pages
    • Format: "According to page X..." or "(Page X)"

Troubleshooting

Ollama Connection Issues

# Check if Ollama is running
curl http://localhost:11434/api/tags

# Start Ollama if not running
ollama serve

Model Not Found

# Pull required models
ollama pull llama3.1
ollama pull nomic-embed-text

PDF Parsing Errors

  • Ensure the PDF is not password-protected
  • Check that the PDF contains extractable text (not just images)
  • For image-based PDFs, you would need OCR (not included in this version)

Memory Issues

  • Reduce CHUNK_SIZE in .env for large documents
  • Process fewer pages at once
  • Use a smaller embedding model

Advanced Usage

Using Different Models

Edit .env to use different Ollama models:

# For better Arabic support
OLLAMA_LLM_MODEL=gemma2

# For faster embeddings
OLLAMA_EMBEDDING_MODEL=all-minilm

Batch Processing Multiple PDFs

You can modify the code or write a script to process multiple PDFs:

from pdf_parser import PDFParser
from vectorstore import VectorStore

parser = PDFParser()
vectorstore = VectorStore(...)

pdf_files = ["doc1.pdf", "doc2.pdf", "doc3.pdf"]
for pdf in pdf_files:
    pages = parser.extract_text_with_pages(pdf)
    vectorstore.add_documents(pages, pdf)

Future Enhancements

  • OCR support for image-based PDFs
  • Multi-document search with source filtering
  • Export conversation history
  • Web interface
  • Streaming responses
  • Citation extraction

License

MIT License - Feel free to use and modify as needed.

Support

For issues or questions, please check:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published