Adobe Hackathon - PDF Intelligence Engine

This project is a solution for the Adobe Hackathon, designed to extract structured information from PDF documents and rank sections based on semantic relevance to a user's query.

Approach

The solution is divided into two main rounds, each building upon the last.

Round 1A: Outline Extraction

The primary goal of Round 1A is to extract a structured outline (Title and Headings) from a PDF document. The approach is based on a set of heuristics, as simply relying on font size is not always reliable.

Text Block Extraction: The process begins by using the PyMuPDF (fitz) library to extract all text blocks from the PDF. Crucially, this extraction includes metadata for each text span, such as its font size, font name, and page number.
Body Text Identification: To find headings, we first need a baseline for what constitutes normal body text. The script analyzes all extracted text blocks to find the most common (mode) font size and font name. This combination is assumed to be the standard style for paragraph text.
Heading Detection Heuristics: Any text block that deviates from the body text style is considered a potential heading. The following rules are applied:
- Font Size: Must be larger than the body text font size.
- Font Weight: The font name contains "Bold" (a common indicator).
- Line Length: The line is short (headings are typically not long sentences).
- Punctuation: The line does not end with a period.
Title and Heading Classification: The document's title is identified as the text with the largest font size on the first page. The remaining headings are then classified into levels (H1, H2, H3, etc.) by sorting their unique font sizes in descending order.

Round 1B: Persona-Driven Intelligence

Round 1B adds a layer of semantic understanding to rank document sections based on their relevance to a user's query.

Section Grouping: A "section" is defined as a heading plus all the text content that follows it, up to the next heading. A function groups the raw text blocks from the PDF into these structured sections.
Semantic Embeddings: The Sentence-Transformers library is used with the all-MiniLM-L6-v2 model. This lightweight (~86MB) but powerful model converts both the user's query and the text content of each section into numerical vectors (embeddings).
Cosine Similarity: To measure relevance, the cosine similarity between the user's query vector and each section's vector is calculated. A higher score (closer to 1.0) indicates a stronger semantic match.
Ranking: The sections are then sorted in descending order based on their cosine similarity score to produce the final importance ranking.

Libraries and Models

Programming Language: Python 3.9
PDF Parsing: PyMuPDF (fitz)
NLP / Semantic Model: Sentence-Transformers with the all-MiniLM-L6-v2 model.
Core Libraries: torch (for Sentence-Transformers)

How to Build and Run

Prerequisites

Docker Desktop installed and running.
A sample PDF file placed in the input/ directory.

1. Build the Docker Image

Navigate to the project's root directory in your terminal and run the following command:

docker build --platform linux/amd64 -t adobe-solution:latest .

2. Run the Solution

For Round 1A (Outline Extraction)

To generate the JSON outline, run the container with the input and output directories mounted:

docker run --rm -v $(pwd)/input:/app/input -v $(pwd)/output:/app/output --network none adobe-solution:latest

This will process all PDFs in input/ and create *_outline.json files in the output/ directory.

For Round 1B (Persona-Driven Ranking)

To run the semantic ranking, you need to pass the persona and job-to-be-done as environment variables (ADOBE_PERSONA and ADOBE_JOB). The script will process all PDFs in the input/ directory as a single collection.

docker run --rm -v $(pwd)/input:/app/input -v $(pwd)/output:/app/output -e ADOBE_PERSONA="Your Persona Here" -e ADOBE_JOB="Your Job-to-be-Done Here" --network none adobe-solution:latest

Replace the placeholder text with the actual persona and job. This will create a single round_1b_analysis.json file in the output/ directory.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Adobe_Hackathon-25		Adobe_Hackathon-25
ROUND_1B_PERSONA_ANALYSIS		ROUND_1B_PERSONA_ANALYSIS
Round1A_PDF_Outline		Round1A_PDF_Outline
Round1B_Persona_Analysis		Round1B_Persona_Analysis
README.md		README.md
approach_explanation.md		approach_explanation.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Adobe Hackathon - PDF Intelligence Engine

Approach

Round 1A: Outline Extraction

Round 1B: Persona-Driven Intelligence

Libraries and Models

How to Build and Run

Prerequisites

1. Build the Docker Image

2. Run the Solution

For Round 1A (Outline Extraction)

For Round 1B (Persona-Driven Ranking)

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

sahilsahni18/Adobe_Hackathon-25

Folders and files

Latest commit

History

Repository files navigation

Adobe Hackathon - PDF Intelligence Engine

Approach

Round 1A: Outline Extraction

Round 1B: Persona-Driven Intelligence

Libraries and Models

How to Build and Run

Prerequisites

1. Build the Docker Image

2. Run the Solution

For Round 1A (Outline Extraction)

For Round 1B (Persona-Driven Ranking)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages