-
Notifications
You must be signed in to change notification settings - Fork 230
fix: pdf form parsing #117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
❌ Integration tests failed. Please check the logs. |
3 similar comments
|
❌ Integration tests failed. Please check the logs. |
|
❌ Integration tests failed. Please check the logs. |
|
❌ Integration tests failed. Please check the logs. |
fix mypy update test to pymupdf + split size limit scenario update test to pymupdf update to use get file path function update mypy update mypy update tests
3a9789c to
dad5441
Compare
d3a3f69 to
c4cf3b0
Compare
|
❌ Integration tests failed. Please check the logs. |
94e100f to
f31b719
Compare
|
❌ Integration tests failed. Please check the logs. |
camiloaz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left some comments after thinking more about it.
agentic_doc/parse.py
Outdated
| if total_pages <= split_size: | ||
| # Process the PDF directly without splitting | ||
| file_path = Path(file_path) | ||
| return _parse_doc_parts( | ||
| Document( | ||
| file_path=file_path, | ||
| start_page_idx=0, | ||
| end_page_idx=total_pages - 1, | ||
| ), | ||
| include_marginalia=include_marginalia, | ||
| include_metadata_in_markdown=include_metadata_in_markdown, | ||
| extraction_model=extraction_model, | ||
| extraction_schema=extraction_schema, | ||
| config=config, | ||
| ) | ||
|
|
||
| # Split PDF using the already opened document | ||
| with tempfile.TemporaryDirectory() as temp_dir: | ||
| file_path = Path(file_path) | ||
| parts = split_pdf(pdf_doc, temp_dir, split_size, file_stem=file_path.stem) | ||
| part_results = _parse_doc_in_parallel( | ||
| parts, | ||
| doc_name=file_path.name, | ||
| include_marginalia=include_marginalia, | ||
| include_metadata_in_markdown=include_metadata_in_markdown, | ||
| extraction_model=extraction_model, | ||
| extraction_schema=extraction_schema, | ||
| config=config, | ||
| ) | ||
| split_type = ( | ||
| config.split if config and config.split is not None else SplitType.full |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of changing this function, i would change only the split_pdf function to check for length and return a list with a single part with the unmodified document. it seems less risky and easier to maintain to me because you are repeating some logic here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed! Thanks
Form elements were disappearing from SDK output despite being present in API responses. The SDK was unnecessarily splitting PDFs using pypdf library, which strips interactive form elements during PDF manipulation.
Used pymupdf instead of pypdf + fix minimum pages split logic
https://landingai.slack.com/archives/C07KNEGHWKA/p1757949372178239
https://app.asana.com/1/504311096896991/project/1206677697418483/task/1211376367674705?focus=true