Skip to content

Free AI-powered toolkit for prompt engineering, PDF text extraction, and image processing. Combines Python's versatility with Go's performance for seamless document analysis and content generation.

License

Notifications You must be signed in to change notification settings

Dsouza10082/documentOCRProcessor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

11 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Go OCR Text Extractor

gopher_reading_python_mind

A powerful and free OCR (Optical Character Recognition) solution for the Go community that bridges the gap between Go's efficiency and Python's rich AI ecosystem.

🎯 Why This Project Exists

The OCR market is flooded with expensive, proprietary solutions that often come with limitations:

  • Costly licensing fees for commercial OCR APIs
  • Limited language support in many solutions
  • Vendor lock-in with cloud-based services
  • Complex integration requiring specialized knowledge
  • Poor accuracy on varied document types
  • No local processing options for sensitive documents

This project addresses these pain points by providing a completely free, locally-run OCR solution that leverages the power of established Python AI libraries while maintaining Go's performance and simplicity.

🌟 The Python Advantage

Python has already solved many complex problems in the AI and machine learning space with mature, battle-tested libraries. Rather than reinventing the wheel in Go, this project creates a bridge that allows Go developers to harness these powerful Python capabilities:

  • Tesseract OCR - Google's industry-leading OCR engine
  • PIL (Python Imaging Library) - Robust image processing
  • PyPDF2 & pdfplumber - Comprehensive PDF text extraction
  • Extensive language support - Over 100 languages supported by Tesseract

πŸš€ Features

  • PDF Text Extraction: Extract text from PDF documents using multiple extraction methods
  • Image OCR: Convert images to text with high accuracy
  • Multilingual Support: Supports Portuguese, English, and 100+ other languages
  • Caching System: Built-in memory cache to avoid reprocessing the same files
  • Fallback Mechanisms: Multiple extraction methods ensure maximum compatibility
  • Thread-Safe: Concurrent processing with mutex protection
  • Error Handling: Comprehensive error reporting and recovery
  • Free & Open Source: No licensing fees or API limits

πŸ“‹ Prerequisites

Python Dependencies

The project automatically handles Python dependency installation, but you can install them manually:

pip install PyPDF2 pdfplumber pytesseract Pillow

Docker

Mac Arm64 - Silicon

docker build --build-arg TARGETOS=darwin --build-arg TARGETARCH=arm64 -t ocr-processor:latest .

Windows

docker build --build-arg TARGETOS=windows --build-arg TARGETARCH=amd64 -t ocr-processor:latest .

Linux

docker build --build-arg TARGETOS=linux --build-arg TARGETARCH=amd64 -t ocr-processor:latest .

Linux arm64

docker build --build-arg TARGETOS=linux --build-arg TARGETARCH=arm64 -t ocr-processor:latest .

Tesseract OCR Installation

Windows

  1. Download the installer from GitHub Tesseract releases
  2. Run the installer and follow the setup wizard
  3. Add Tesseract to your system PATH:
    • Default installation path: C:\Program Files\Tesseract-OCR
    • Add this path to your Windows PATH environment variable
  4. Restart your command prompt

macOS

Using Homebrew:

brew install tesseract

Using MacPorts:

sudo port install tesseract

Linux (Ubuntu/Debian)

sudo apt-get update
sudo apt-get install tesseract-ocr
sudo apt-get install tesseract-ocr-por  # For Portuguese language support

Linux (CentOS/RHEL/Fedora)

sudo yum install tesseract tesseract-langpack-por

or for newer versions:

sudo dnf install tesseract tesseract-langpack-por

πŸ—οΈ Architecture

Core Components

PythonExecutor

The main orchestrator that manages Python script execution and handles:

  • Python environment detection
  • Dependency management
  • Script execution
  • Result parsing
  • Caching management

PDF Text Extraction

  • Primary Method: pdfplumber (more accurate)
  • Fallback Method: PyPDF2 (broader compatibility)
  • Automatic Selection: Chooses the best method based on document type

Image OCR Processing

  • Language Detection: Attempts Portuguese first, falls back to English
  • Image Preprocessing: Automatic RGB conversion
  • Format Support: PNG, JPG, JPEG, TIFF, BMP, GIF

Prompt System

  • LangChain Integration: Structured prompt templates for AI processing
  • Flexible Configuration: Customizable prompt parameters
  • JSON Output: Structured response format

Result Structures

type PDFTextResult struct {
    Success  bool   `json:"success"`
    Text     string `json:"text"`
    Error    string `json:"error"`
    Pages    int    `json:"pages"`
    Filename string `json:"filename"`
}

type ImageTextResult struct {
    Success  bool   `json:"success"`
    Text     string `json:"text"`
    Error    string `json:"error"`
    Pages    int    `json:"pages"`
    Filename string `json:"filename"`
}

πŸ”§ Usage Example

package main

import (
    "fmt"
    "log"
    "your-project/model"
)

func main() {
    // Initialize the Python executor
    executor := model.NewPythonExecutor()
    
    // Check and install PDF dependencies
    if err := executor.CheckPythonDependenciesForPDF(); err != nil {
        log.Fatal(err)
    }
    
    // Extract text from PDF
    result, err := executor.ExtractPDFText("document.pdf")
    if err != nil {
        log.Fatal(err)
    }
    
    if result.Success {
        fmt.Printf("Extracted %d characters from %d pages\n", 
                   len(result.Text), result.Pages)
        fmt.Println(result.Text)
    } else {
        fmt.Printf("Error: %s\n", result.Error)
    }
    
    // Extract text from image
    imageResult, err := executor.ExtractImageText("scanned_document.png")
    if err != nil {
        log.Fatal(err)
    }
    
    if imageResult.Success {
        fmt.Printf("OCR Result: %s\n", imageResult.Text)
    }
    
    // Use with AI prompts
    promptInstance := model.NewPromptOCRInstance()
    prompt := promptInstance.GetPrompt(result.Text)
    // Process with your preferred LLM...
}

🀝 Contributing to the Go AI Community

This project represents a commitment to advancing AI capabilities within the Go ecosystem. By providing free access to powerful OCR functionality, we aim to:

  • Lower barriers to entry for developers interested in document processing
  • Accelerate innovation in Go-based AI applications
  • Foster collaboration between Go and Python communities
  • Enable experimentation without cost constraints
  • Support education and research initiatives

πŸ“š Dependencies & Credits

This project stands on the shoulders of giants. We extend our gratitude to the following projects and their maintainers:

Python Libraries

Go Libraries

  • LangChain Go - Go implementation of LangChain for LLM integration

πŸ” Error Handling

The system provides comprehensive error handling:

  • Dependency Check: Automatic verification and installation of Python packages
  • File Validation: Format and existence verification
  • Graceful Fallbacks: Multiple extraction methods for maximum compatibility
  • Detailed Logging: Clear error messages and debugging information

🎯 Future Enhancements

  • Support for more document formats (DOCX, RTF, etc.)
  • Batch processing capabilities
  • Configuration file support
  • Docker containerization
  • REST API wrapper
  • Performance benchmarking tools

πŸ“„ License

This project is open-source and free to use. Please refer to the LICENSE file for details.

🀝 Contributing

We welcome contributions from the community! Whether it's bug reports, feature requests, or code contributions, every effort helps make this tool better for everyone.


Built with ❀️ for the Go community. Empowering developers to build amazing AI applications without breaking the bank.

About

Free AI-powered toolkit for prompt engineering, PDF text extraction, and image processing. Combines Python's versatility with Go's performance for seamless document analysis and content generation.

Resources

License

Stars

Watchers

Forks

Packages

No packages published