Skip to content

rupesh-ps/pdf_extract

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF Extract

A Python POC project for extracting required text from PDF and image files, with support for image and PDF pre-processing.

What it does

  • Extract text from PDF files
  • Extract images from PDF files
  • Uses OpenCV, PyMuPDF, Pillow, and pytesseract

Setup

  1. Create and activate the virtual environment:

    python3 -m venv myenv
    source myenv/bin/activate
  2. Install dependencies:

    pip install -r requirements.txt

Usage

Place your PDF and image files in the samples/ directory (have 4 samples for testing).

Run the main script:

python main/main.py

Requirements

See requirements.txt for all dependencies.

About

POC to extract required text from pdf and images

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages