A Retrieval-Augmented Generation (RAG) system that supports accurate parsing and question-answering for Arabic and English PDF documents, powered by Ollama.
✅ Bilingual Support: Works with both Arabic and English PDF documents
✅ Accurate Parsing: Uses PyMuPDF for precise text extraction
✅ Page Number Tracking: Each answer includes the page number(s) where information was found
✅ Follow-up Questions: Maintains conversation history for contextual follow-up questions
✅ CLI Interface: User-friendly command-line interface with rich formatting
✅ Local Ollama Models: Uses Ollama for both LLM and embeddings (fully local, no external APIs)
✅ Persistent Storage: ChromaDB vector store with persistence across sessions
- Python 3.8+
- Ollama: Install from ollama.ai
- Ollama Models: Pull required models:
ollama pull llama3.1 ollama pull nomic-embed-text
-
Clone or navigate to the project directory:
cd /home/ahmad/Documents/agentic-lab -
Create a virtual environment (recommended):
python3 -m venv venv source venv/bin/activate -
Install dependencies:
pip install -r requirements.txt
-
Configure environment (optional):
- The
.envfile is already configured with defaults - Modify if you need different models or settings
- The
python cli.py- Add PDF Document: Upload and index a PDF file
- Ask a Question: Enter question mode for interactive Q&A
- View Statistics: See system statistics and document counts
- Clear Conversation History: Reset the conversation context
- Clear All Documents: Remove all indexed documents
- Exit: Close the application
-
Start the CLI:
python cli.py
-
Add a PDF document (Option 1):
- Enter the full path to your PDF file
- The system will extract text and create embeddings
- Wait for indexing to complete
-
Ask questions (Option 2):
- Type your question in Arabic or English
- Receive answers with page numbers
- Ask follow-up questions for context-aware responses
- Type 'back' to return to the main menu
English: "What is the main topic discussed in the document?"
Arabic: "ما هو الموضوع الرئيسي المناقش في الوثيقة؟"
Follow-up: "Can you provide more details about that?"
Follow-up: "هل يمكنك تقديم المزيد من التفاصيل؟"
Edit .env file to customize:
# Ollama Configuration
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_LLM_MODEL=llama3.1 # Change to your preferred model
OLLAMA_EMBEDDING_MODEL=nomic-embed-text
# ChromaDB Configuration
CHROMA_PERSIST_DIRECTORY=./chroma_db
# PDF Processing
CHUNK_SIZE=1000 # Size of text chunks for processing
CHUNK_OVERLAP=200 # Overlap between chunks for context- LLM:
llama3.1,mistral,mixtral - Embeddings:
nomic-embed-text,mxbai-embed-large
- LLM:
llama3.1,gemma2(better multilingual support) - Embeddings:
nomic-embed-text(supports multilingual)
agentic-lab/
├── cli.py # CLI interface
├── rag_engine.py # RAG engine with conversation history
├── vectorstore.py # Vector store management (ChromaDB)
├── pdf_parser.py # PDF parsing with page tracking
├── requirements.txt # Python dependencies
├── .env # Configuration file
├── .env.example # Example configuration
├── README.md # This file
└── chroma_db/ # Vector store persistence (created automatically)
-
PDF Parsing:
- PyMuPDF extracts text page by page
- Each page's content is tracked with its page number
-
Text Chunking:
- Text is split into chunks with overlap
- Each chunk maintains its page number metadata
-
Embedding & Indexing:
- Ollama creates embeddings for each chunk
- ChromaDB stores embeddings with metadata
-
Question Answering:
- User question is embedded
- Similar chunks are retrieved with page numbers
- LLM generates answer using retrieved context
- Conversation history is maintained for follow-ups
-
Answer with Page Numbers:
- Each answer explicitly mentions source pages
- Format: "According to page X..." or "(Page X)"
# Check if Ollama is running
curl http://localhost:11434/api/tags
# Start Ollama if not running
ollama serve# Pull required models
ollama pull llama3.1
ollama pull nomic-embed-text- Ensure the PDF is not password-protected
- Check that the PDF contains extractable text (not just images)
- For image-based PDFs, you would need OCR (not included in this version)
- Reduce
CHUNK_SIZEin.envfor large documents - Process fewer pages at once
- Use a smaller embedding model
Edit .env to use different Ollama models:
# For better Arabic support
OLLAMA_LLM_MODEL=gemma2
# For faster embeddings
OLLAMA_EMBEDDING_MODEL=all-minilmYou can modify the code or write a script to process multiple PDFs:
from pdf_parser import PDFParser
from vectorstore import VectorStore
parser = PDFParser()
vectorstore = VectorStore(...)
pdf_files = ["doc1.pdf", "doc2.pdf", "doc3.pdf"]
for pdf in pdf_files:
pages = parser.extract_text_with_pages(pdf)
vectorstore.add_documents(pages, pdf)- OCR support for image-based PDFs
- Multi-document search with source filtering
- Export conversation history
- Web interface
- Streaming responses
- Citation extraction
MIT License - Feel free to use and modify as needed.
For issues or questions, please check:
- Ollama documentation: https://ollama.ai
- LangChain documentation: https://python.langchain.com
- ChromaDB documentation: https://docs.trychroma.com