PDF Extract

A Python POC project for extracting required text from PDF and image files, with support for image and PDF pre-processing.

What it does

Extract text from PDF files
Extract images from PDF files
Uses OpenCV, PyMuPDF, Pillow, and pytesseract

Setup

Create and activate the virtual environment:

python3 -m venv myenv
source myenv/bin/activate

Install dependencies:
```
pip install -r requirements.txt
```

Usage

Place your PDF and image files in the samples/ directory (have 4 samples for testing).

Run the main script:

python main/main.py

Requirements

See requirements.txt for all dependencies.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
main		main
samples		samples
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Extract

What it does

Setup

Usage

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDF Extract

What it does

Setup

Usage

Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages