PDF Search Engine

Goal of the Project

This project aims to implement a simple and efficient search engine for PDF documents stored on your local machine. The goal is to provide full-text search functionality using custom indexing methods, allowing users to search through PDF documents based on content.

Note: We did not use Lucene for this project. Instead, we've implemented a lightweight search system from scratch.

Features

Search Method: We use a basic inverted index approach to map terms to the documents they appear in. This allows for quick lookups of documents based on search queries.
Indexing Method: The indexing process extracts and processes the text from PDF files using PyMuPDF (formerly known as fitz) and stores it in an inverted index structure.
Stemming: We use the Porter Stemmer algorithm to reduce words to their root form, improving search results by matching different word variants.

Repository Structure and Running the App

The repository is structured as follows:

/app
/templates # HTML templates for the Flask web interface
/static # Static files (e.g., CSS, JS)
app.py # Main Flask app file
indexer.py # Indexing script
searcher.py # Search logic
requirements.txt # List of required dependencies
Dockerfile # Dockerfile for containerization /data # Storing the index files

Summary of Pickle Files

INDEX_PICKLE inverted_index: term → {doc_id: tf} DOCS_PICKLE documents: doc_id → full text DF_PICKLE df: term → document frequency DOC_LENGTHS_PICKLE doc_id → tf-idf norm length TERMS_PICKLE doc_id → list of original words

Running the Application

python -m venv .venv.\.venv\Scripts\activate
pip install -r requirements.txt
python app.py
The app will be available at http://localhost:5000.

Docker

The project is Dockerized for easy deployment. To run the project using Docker, follow these steps:

docker build -t pdf-search .
docker run -p 5000:5000 pdf-search

The app will be accessible at http://localhost:5000 from within the Docker container.

Performance and Limitations

Performance: The search speed is decent for small to medium-sized document sets. As the number of documents increases, performance may degrade as the indexing size grows.
Limitations: This search engine currently only supports PDFs. It does not index metadata or perform advanced features like TF-IDF scoring or caching, but these are planned for future improvements.

Future Improvements

TF-IDF Scoring: Implementing Term Frequency-Inverse Document Frequency (TF-IDF) scoring will improve search relevance by weighting terms based on their frequency in the document versus the entire dataset.
Caching: Implement caching to store the results of frequent searches and reduce search time for popular queries.
Search by Metadata: Integrating metadata extraction (e.g., author, title) using Tika would allow for richer search capabilities. Tika could be used to extract additional information from PDFs and other file formats beyond the raw text content.

How We Process Documents

We use PyMuPDF (also known as fitz) to extract text from PDFs. PyMuPDF is chosen because:

Speed: It is faster than Java-based tools like Tika.
Local Processing: It works entirely locally without the need for external servers or services.
PDF Focused: Since the project focuses on extracting text from PDFs, PyMuPDF provides all the functionality we need without dealing with additional overhead from extracting metadata.

Why Not Use Tika?

While Tika is great for extracting both text and metadata from various file formats, we chose to use PyMuPDF (fitz) for the following reasons:

Speed: PyMuPDF is much faster than Tika for processing PDFs.
Local: Tika requires a Java server to run, which adds complexity. PyMuPDF works entirely within the Python environment, making it easier to set up and run.
Simplified Use Case: Our use case only requires extracting text from PDF documents, and PyMuPDF is more than sufficient for this purpose. We don't need the extra functionality Tika provides for extracting metadata or processing other file types.

This project serves as a foundational search engine for PDFs. Feel free to extend it and add more advanced features as you see fit! 😊

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.idea		.idea
.venv		.venv
app		app
data		data
.env		.env
Dockerfile		Dockerfile
README.md		README.md
gitignore		gitignore

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDF Search Engine

Goal of the Project

Features

Repository Structure and Running the App

Summary of Pickle Files

Running the Application

Docker

Performance and Limitations

Future Improvements

How We Process Documents

Why Not Use Tika?

About

Uh oh!

Releases

Packages

Uh oh!

Languages

ruasnv/PDF-Search-Engine

Folders and files

Latest commit

History

Repository files navigation

PDF Search Engine

Goal of the Project

Features

Repository Structure and Running the App

Summary of Pickle Files

Running the Application

Docker

Performance and Limitations

Future Improvements

How We Process Documents

Why Not Use Tika?

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages