fix: pdf form parsing #117

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

hugohonda wants to merge 6 commits into main from fix/pdf-form-parsing

+137 −136

Contributor

hugohonda commented Sep 17, 2025 •

edited

Loading

Form elements were disappearing from SDK output despite being present in API responses. The SDK was unnecessarily splitting PDFs using pypdf library, which strips interactive form elements during PDF manipulation.

Used pymupdf instead of pypdf + fix minimum pages split logic

https://landingai.slack.com/archives/C07KNEGHWKA/p1757949372178239
https://app.asana.com/1/504311096896991/project/1206677697418483/task/1211376367674705?focus=true

hugohonda self-assigned this

github-actions bot commented Sep 17, 2025

❌ Integration tests failed. Please check the logs.

3 similar comments

github-actions bot commented Sep 17, 2025

❌ Integration tests failed. Please check the logs.

github-actions bot commented Sep 17, 2025

❌ Integration tests failed. Please check the logs.

github-actions bot commented Sep 17, 2025

❌ Integration tests failed. Please check the logs.


          fix: pdf form parsing

dad5441

fix mypy

update test to pymupdf + split size limit scenario

update test to pymupdf

update to use get file path function

update mypy

update mypy

update tests

hugohonda force-pushed the fix/pdf-form-parsing branch from 3a9789c to dad5441 Compare

September 17, 2025 22:06

hugohonda added 2 commits

September 17, 2025 19:33


          remove pypdf

15dc626


          fix mypy

c4cf3b0

hugohonda force-pushed the fix/pdf-form-parsing branch from d3a3f69 to c4cf3b0 Compare

September 18, 2025 15:39

github-actions bot commented Sep 18, 2025

❌ Integration tests failed. Please check the logs.


          revert poetry lock

f31b719

hugohonda force-pushed the fix/pdf-form-parsing branch from 94e100f to f31b719 Compare

September 18, 2025 15:59

github-actions bot commented Sep 18, 2025

❌ Integration tests failed. Please check the logs.

camiloaz approved these changes

View reviewed changes

camiloaz requested changes

View reviewed changes

Member

camiloaz left a comment

left some comments after thinking more about it.

agentic_doc/parse.py Outdated

Comment on lines 582 to 612

    
                      if total_pages <= split_size:

                          # Process the PDF directly without splitting

                          file_path = Path(file_path)

                          return _parse_doc_parts(

                              Document(

                                  file_path=file_path,

                                  start_page_idx=0,

                                  end_page_idx=total_pages - 1,

                              ),

                              include_marginalia=include_marginalia,

                              include_metadata_in_markdown=include_metadata_in_markdown,

                              extraction_model=extraction_model,

                              extraction_schema=extraction_schema,

                              config=config,

                          )

                      # Split PDF using the already opened document

                      with tempfile.TemporaryDirectory() as temp_dir:

                          file_path = Path(file_path)

                          parts = split_pdf(pdf_doc, temp_dir, split_size, file_stem=file_path.stem)

                          part_results = _parse_doc_in_parallel(

                              parts,

                              doc_name=file_path.name,

                              include_marginalia=include_marginalia,

                              include_metadata_in_markdown=include_metadata_in_markdown,

                              extraction_model=extraction_model,

                              extraction_schema=extraction_schema,

                              config=config,

                          )

                          split_type = (

                              config.split if config and config.split is not None else SplitType.full

Member

camiloaz Oct 7, 2025

instead of changing this function, i would change only the split_pdf function to check for length and return a list with a single part with the unmodified document. it seems less risky and easier to maintain to me because you are repeating some logic here.

Contributor Author

hugohonda Oct 8, 2025

Agreed! Thanks

hugohonda added 2 commits

October 8, 2025 08:19


          skip split if the doc total pages is smaller than split requested

6bd611f


          update tests

0b665e6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet