A Python POC project for extracting required text from PDF and image files, with support for image and PDF pre-processing.
- Extract text from PDF files
- Extract images from PDF files
- Uses OpenCV, PyMuPDF, Pillow, and pytesseract
-
Create and activate the virtual environment:
python3 -m venv myenv source myenv/bin/activate -
Install dependencies:
pip install -r requirements.txt
Place your PDF and image files in the samples/ directory (have 4 samples for testing).
Run the main script:
python main/main.pySee requirements.txt for all dependencies.