This project is to gain a hands on exprience with some of the most important tools in software. It is a fully containerized pipeline for extracting text from PDF files asynchronously.
It uses FastAPI for the API, Celery for background task processing, RabbitMQ as the message broker, Redis for storing results, and Apache Tika for PDF parsing.
- Upload single or multiple PDF files
- Asynchronous processing (non-blocking API)
- Text extraction via Apache Tika
- Task status and result retrieval
- Scalable worker architecture
- Docker installed
- Docker Compose installed
- Clone this repository
- Go to the pdf-parsing-pipeline directory
- docker-compose up --build
That's it! You're good to go!
First, you should upload your pdf vie either terminal or the API Docs, API saves file(s) in a shared volume. After that, API sends task to RabbitMQ with file path and RabbitMQ holds the task until a worker is ready. A celery worker then reads the file, sends it to Apache Tika for parsing and saves extracted text + status in Redis at the end. You can check the status of proccessing tasks and see the results by provided commands. The diagram for the workflow is as follow:
┌───────────┐ HTTP ┌────────────┐
│ Client │ ───────────────────▶ │ FastAPI │
└───────────┘ │ API │
▲ └─────┬──────┘
│ │
│ ▼
│ ┌────────────┐
│ │ Uploads │
│ │ Folder │
│ └─────┬──────┘
│ │
│ ▼
┌─────────────┐ Task Queue ┌─────────────┐
│ RabbitMQ │ ─────────────────▶ │ Celery │
│ Queue │ │ Worker │
└─────────────┘ └─────┬───────┘
│
▼
┌────────────┐
│ Tika │
│ Server │
└─────┬──────┘
│
▼
┌────────────┐
│ Redis │
│ Store │
└────────────┘
Uplaod the file:
curl -F "file=@/path/to/file.pdf" http://localhost:8000/uploadYou get the response:
{"task_id": "123abc"}
You can check status by:
curl http://localhost:8000/status/123abcOnce the status us "SUCCESS", get the result:
curl http://localhost:8000/result/123abcThe main goal of this project was for me to learn and get hands-on experience with the different tools and technologies involved in building an asynchronous PDF parsing pipeline. Here’s a summary of what I learned.