Skip to content

Conversation

@k00ni
Copy link
Collaborator

@k00ni k00ni commented Nov 24, 2025

Type of pull request

  • Bug fix (involves code and configuration changes)

About

This PR fixes xref-handling when it comes to circular references. A malformed PDF can lead to memory exhaustion because PDFParser enters an endless loop at some point. These fixes should prevent that.

It allows me to work with a bunch of "broken" PDFs, but I might have overlook some things. It would be great if someone else could have a look. Any feedback is appreciated (maybe @GreyWyvern @j0k3r?) 🚀


Might solve #71

#2)

* Initial plan

* Initial setup: Add crash file and reproducer script

Co-authored-by: k00ni <[email protected]>

* Fix infinite loop caused by circular xref references

Co-authored-by: k00ni <[email protected]>

* Add test case for circular xref reference fix

Co-authored-by: k00ni <[email protected]>

* Address review feedback: Add array<int> type hints and remove test

- Added array<int> type hints to $visitedOffsets parameters in PHPDoc comments
- Removed testCircularXrefReference test as requested by reviewer
- Removed reproducer.php file as requested
- Crash file memory-exhaustion-crash-f5d89649.txt remains permanently in samples/bugs/

Co-authored-by: k00ni <[email protected]>

* Remove crash file and add comprehensive tests for circular reference fix

- Removed memory-exhaustion-crash-f5d89649.txt as requested
- Added helper methods to expose protected functions for testing
- Added testGetXrefDataPreventsCircularReferences: Tests that visitedOffsets prevents infinite loops
- Added testDecodeXrefPassesVisitedOffsets: Tests that decodeXref correctly passes visitedOffsets
- Added testGetXrefDataTracksVisitedOffsets: Tests that offsets are tracked properly
- All tests use minimal PDF structures and don't rely on external crash file

Co-authored-by: k00ni <[email protected]>

---------

Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Co-authored-by: k00ni <[email protected]>
@GreyWyvern
Copy link
Contributor

I haven't checked if it solves #71, but the PR code worked fine, without error on my main search engine implementation (464 PDFs).

Then again, I'm pretty sure none of those PDFs were of the malformed type to have caused this error.

@k00ni
Copy link
Collaborator Author

k00ni commented Nov 25, 2025

Then again, I'm pretty sure none of those PDFs were of the malformed type to have caused this error.

You would have noticed 😅

Thank you very much for taking the time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants