Skip to content

Conversation

@hugohonda
Copy link
Contributor

@hugohonda hugohonda commented Sep 17, 2025

Form elements were disappearing from SDK output despite being present in API responses. The SDK was unnecessarily splitting PDFs using pypdf library, which strips interactive form elements during PDF manipulation.

Used pymupdf instead of pypdf + fix minimum pages split logic

https://landingai.slack.com/archives/C07KNEGHWKA/p1757949372178239
https://app.asana.com/1/504311096896991/project/1206677697418483/task/1211376367674705?focus=true

@hugohonda hugohonda self-assigned this Sep 17, 2025
@github-actions
Copy link

❌ Integration tests failed. Please check the logs.

3 similar comments
@github-actions
Copy link

❌ Integration tests failed. Please check the logs.

@github-actions
Copy link

❌ Integration tests failed. Please check the logs.

@github-actions
Copy link

❌ Integration tests failed. Please check the logs.

fix mypy

update test to pymupdf + split size limit scenario

update test to pymupdf

update to use get file path function

update mypy

update mypy

update tests
@github-actions
Copy link

❌ Integration tests failed. Please check the logs.

@github-actions
Copy link

❌ Integration tests failed. Please check the logs.

Copy link
Member

@camiloaz camiloaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some comments after thinking more about it.

Comment on lines 582 to 612
if total_pages <= split_size:
# Process the PDF directly without splitting
file_path = Path(file_path)
return _parse_doc_parts(
Document(
file_path=file_path,
start_page_idx=0,
end_page_idx=total_pages - 1,
),
include_marginalia=include_marginalia,
include_metadata_in_markdown=include_metadata_in_markdown,
extraction_model=extraction_model,
extraction_schema=extraction_schema,
config=config,
)

# Split PDF using the already opened document
with tempfile.TemporaryDirectory() as temp_dir:
file_path = Path(file_path)
parts = split_pdf(pdf_doc, temp_dir, split_size, file_stem=file_path.stem)
part_results = _parse_doc_in_parallel(
parts,
doc_name=file_path.name,
include_marginalia=include_marginalia,
include_metadata_in_markdown=include_metadata_in_markdown,
extraction_model=extraction_model,
extraction_schema=extraction_schema,
config=config,
)
split_type = (
config.split if config and config.split is not None else SplitType.full
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of changing this function, i would change only the split_pdf function to check for length and return a list with a single part with the unmodified document. it seems less risky and easier to maintain to me because you are repeating some logic here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants