RAG System - Arabic/English PDF Support

A Retrieval-Augmented Generation (RAG) system that supports accurate parsing and question-answering for Arabic and English PDF documents, powered by Ollama.

Features

✅ Bilingual Support: Works with both Arabic and English PDF documents
✅ Accurate Parsing: Uses PyMuPDF for precise text extraction
✅ Page Number Tracking: Each answer includes the page number(s) where information was found
✅ Follow-up Questions: Maintains conversation history for contextual follow-up questions
✅ CLI Interface: User-friendly command-line interface with rich formatting
✅ Local Ollama Models: Uses Ollama for both LLM and embeddings (fully local, no external APIs)
✅ Persistent Storage: ChromaDB vector store with persistence across sessions

Prerequisites

Python 3.8+
Ollama: Install from ollama.ai

Ollama Models: Pull required models:

ollama pull llama3.1
ollama pull nomic-embed-text

Installation

Clone or navigate to the project directory:
```
cd /home/ahmad/Documents/agentic-lab
```

Create a virtual environment (recommended):

python3 -m venv venv
source venv/bin/activate

Install dependencies:
```
pip install -r requirements.txt
```
Configure environment (optional):
- The .env file is already configured with defaults
- Modify if you need different models or settings

Usage

Starting the CLI

python cli.py

Main Menu Options

Add PDF Document: Upload and index a PDF file
Ask a Question: Enter question mode for interactive Q&A
View Statistics: See system statistics and document counts
Clear Conversation History: Reset the conversation context
Clear All Documents: Remove all indexed documents
Exit: Close the application

Example Workflow

Start the CLI:
```
python cli.py
```
Add a PDF document (Option 1):
- Enter the full path to your PDF file
- The system will extract text and create embeddings
- Wait for indexing to complete
Ask questions (Option 2):
- Type your question in Arabic or English
- Receive answers with page numbers
- Ask follow-up questions for context-aware responses
- Type 'back' to return to the main menu

Example Questions

English: "What is the main topic discussed in the document?"
Arabic: "ما هو الموضوع الرئيسي المناقش في الوثيقة؟"

Follow-up: "Can you provide more details about that?"
Follow-up: "هل يمكنك تقديم المزيد من التفاصيل؟"

Configuration

Edit .env file to customize:

# Ollama Configuration
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_LLM_MODEL=llama3.1              # Change to your preferred model
OLLAMA_EMBEDDING_MODEL=nomic-embed-text

# ChromaDB Configuration
CHROMA_PERSIST_DIRECTORY=./chroma_db

# PDF Processing
CHUNK_SIZE=1000        # Size of text chunks for processing
CHUNK_OVERLAP=200      # Overlap between chunks for context

Recommended Ollama Models

For English:

LLM: llama3.1, mistral, mixtral
Embeddings: nomic-embed-text, mxbai-embed-large

For Arabic:

LLM: llama3.1, gemma2 (better multilingual support)
Embeddings: nomic-embed-text (supports multilingual)

Project Structure

agentic-lab/
├── cli.py                 # CLI interface
├── rag_engine.py         # RAG engine with conversation history
├── vectorstore.py        # Vector store management (ChromaDB)
├── pdf_parser.py         # PDF parsing with page tracking
├── requirements.txt      # Python dependencies
├── .env                  # Configuration file
├── .env.example          # Example configuration
├── README.md            # This file
└── chroma_db/           # Vector store persistence (created automatically)

How It Works

PDF Parsing:
- PyMuPDF extracts text page by page
- Each page's content is tracked with its page number
Text Chunking:
- Text is split into chunks with overlap
- Each chunk maintains its page number metadata
Embedding & Indexing:
- Ollama creates embeddings for each chunk
- ChromaDB stores embeddings with metadata
Question Answering:
- User question is embedded
- Similar chunks are retrieved with page numbers
- LLM generates answer using retrieved context
- Conversation history is maintained for follow-ups
Answer with Page Numbers:
- Each answer explicitly mentions source pages
- Format: "According to page X..." or "(Page X)"

Troubleshooting

Ollama Connection Issues

# Check if Ollama is running
curl http://localhost:11434/api/tags

# Start Ollama if not running
ollama serve

Model Not Found

# Pull required models
ollama pull llama3.1
ollama pull nomic-embed-text

PDF Parsing Errors

Ensure the PDF is not password-protected
Check that the PDF contains extractable text (not just images)
For image-based PDFs, you would need OCR (not included in this version)

Memory Issues

Reduce CHUNK_SIZE in .env for large documents
Process fewer pages at once
Use a smaller embedding model

Advanced Usage

Using Different Models

Edit .env to use different Ollama models:

# For better Arabic support
OLLAMA_LLM_MODEL=gemma2

# For faster embeddings
OLLAMA_EMBEDDING_MODEL=all-minilm

Batch Processing Multiple PDFs

You can modify the code or write a script to process multiple PDFs:

from pdf_parser import PDFParser
from vectorstore import VectorStore

parser = PDFParser()
vectorstore = VectorStore(...)

pdf_files = ["doc1.pdf", "doc2.pdf", "doc3.pdf"]
for pdf in pdf_files:
    pages = parser.extract_text_with_pages(pdf)
    vectorstore.add_documents(pages, pdf)

Future Enhancements

License

MIT License - Feel free to use and modify as needed.

Support

For issues or questions, please check:

Ollama documentation: https://ollama.ai
LangChain documentation: https://python.langchain.com
ChromaDB documentation: https://docs.trychroma.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RAG System - Arabic/English PDF Support

Features

Prerequisites

Installation

Usage

Starting the CLI

Main Menu Options

Example Workflow

Example Questions

Configuration

Recommended Ollama Models

For English:

For Arabic:

Project Structure

How It Works

Troubleshooting

Ollama Connection Issues

Model Not Found

PDF Parsing Errors

Memory Issues

Advanced Usage

Using Different Models

Batch Processing Multiple PDFs

Future Enhancements

License

Support

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.env		.env
.env.example		.env.example
.gitignore		.gitignore
QUICKSTART.md		QUICKSTART.md
README.md		README.md
SETUP_COMPLETE.md		SETUP_COMPLETE.md
cli.py		cli.py
pdf_parser.py		pdf_parser.py
rag_engine.py		rag_engine.py
requirements.txt		requirements.txt
start.sh		start.sh
test_installation.py		test_installation.py
vectorstore.py		vectorstore.py

AhmadHakami/exp-rag

Folders and files

Latest commit

History

Repository files navigation

RAG System - Arabic/English PDF Support

Features

Prerequisites

Installation

Usage

Starting the CLI

Main Menu Options

Example Workflow

Example Questions

Configuration

Recommended Ollama Models

For English:

For Arabic:

Project Structure

How It Works

Troubleshooting

Ollama Connection Issues

Model Not Found

PDF Parsing Errors

Memory Issues

Advanced Usage

Using Different Models

Batch Processing Multiple PDFs

Future Enhancements

License

Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages