This project is a solution for the Adobe Hackathon, designed to extract structured information from PDF documents and rank sections based on semantic relevance to a user's query.
The solution is divided into two main rounds, each building upon the last.
The primary goal of Round 1A is to extract a structured outline (Title and Headings) from a PDF document. The approach is based on a set of heuristics, as simply relying on font size is not always reliable.
-
Text Block Extraction: The process begins by using the
PyMuPDF(fitz) library to extract all text blocks from the PDF. Crucially, this extraction includes metadata for each text span, such as its font size, font name, and page number. -
Body Text Identification: To find headings, we first need a baseline for what constitutes normal body text. The script analyzes all extracted text blocks to find the most common (mode) font size and font name. This combination is assumed to be the standard style for paragraph text.
-
Heading Detection Heuristics: Any text block that deviates from the body text style is considered a potential heading. The following rules are applied:
- Font Size: Must be larger than the body text font size.
- Font Weight: The font name contains "Bold" (a common indicator).
- Line Length: The line is short (headings are typically not long sentences).
- Punctuation: The line does not end with a period.
-
Title and Heading Classification: The document's title is identified as the text with the largest font size on the first page. The remaining headings are then classified into levels (H1, H2, H3, etc.) by sorting their unique font sizes in descending order.
Round 1B adds a layer of semantic understanding to rank document sections based on their relevance to a user's query.
-
Section Grouping: A "section" is defined as a heading plus all the text content that follows it, up to the next heading. A function groups the raw text blocks from the PDF into these structured sections.
-
Semantic Embeddings: The
Sentence-Transformerslibrary is used with theall-MiniLM-L6-v2model. This lightweight (~86MB) but powerful model converts both the user's query and the text content of each section into numerical vectors (embeddings). -
Cosine Similarity: To measure relevance, the cosine similarity between the user's query vector and each section's vector is calculated. A higher score (closer to 1.0) indicates a stronger semantic match.
-
Ranking: The sections are then sorted in descending order based on their cosine similarity score to produce the final importance ranking.
- Programming Language: Python 3.9
- PDF Parsing:
PyMuPDF(fitz) - NLP / Semantic Model:
Sentence-Transformerswith theall-MiniLM-L6-v2model. - Core Libraries:
torch(for Sentence-Transformers)
- Docker Desktop installed and running.
- A sample PDF file placed in the
input/directory.
Navigate to the project's root directory in your terminal and run the following command:
docker build --platform linux/amd64 -t adobe-solution:latest .To generate the JSON outline, run the container with the input and output directories mounted:
docker run --rm -v $(pwd)/input:/app/input -v $(pwd)/output:/app/output --network none adobe-solution:latestThis will process all PDFs in input/ and create *_outline.json files in the output/ directory.
To run the semantic ranking, you need to pass the persona and job-to-be-done as environment variables (ADOBE_PERSONA and ADOBE_JOB). The script will process all PDFs in the input/ directory as a single collection.
docker run --rm -v $(pwd)/input:/app/input -v $(pwd)/output:/app/output -e ADOBE_PERSONA="Your Persona Here" -e ADOBE_JOB="Your Job-to-be-Done Here" --network none adobe-solution:latestReplace the placeholder text with the actual persona and job. This will create a single round_1b_analysis.json file in the output/ directory.