Handle CDATA with UTF-8 characters when partial parsing#133
Open
tomtaylor wants to merge 1 commit intoqcam:masterfrom
Open
Handle CDATA with UTF-8 characters when partial parsing#133tomtaylor wants to merge 1 commit intoqcam:masterfrom
tomtaylor wants to merge 1 commit intoqcam:masterfrom
Conversation
Don't assume that we're always seeing a full UTF-8 character. In partial mode, UTF-8 encoded characters might be split across multiple chunks.
Author
|
@qcam any thoughts on this? |
qcam
reviewed
Oct 22, 2024
| element_cdata(rest, more?, original, pos, state, len + 1) | ||
|
|
||
| <<codepoint::utf8>> <> rest -> | ||
| element_cdata(rest, more?, original, pos, state, len + Utils.compute_char_len(codepoint)) |
Owner
There was a problem hiding this comment.
I think we can the same way how dangling UTF-8 fragments is handled
For example https://github.com/qcam/saxy/blob/master/lib/saxy/parser/builder.ex#L540-L541
Author
There was a problem hiding this comment.
I don't quite follow I'm afraid. Would you prefer to take this PR over if it's a quick fix at your end?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follows on from #122.
In partial mode, UTF-8 encoded characters might be split across multiple chunks. When this happens for a character such as
£, which is encoded as<<0xC2, 0xA3>>, the0xC2is neither an ASCII character (<= 127), nor does it match the<<codepoint::utf-8>>clause, and Saxy throws a parser error.This fixes that by just parsing all the bytes inside a CDATA element regardless of their code point. It drops the UTF-8 character optimisation, but I suspect that's probably a minor performance improvement for most documents.
@qcam is this a more prevalent issue than my use case? I can see why matching on UTF-8 codepoint and swallowing the whole character is a nice optimisation, but I wonder if it might cause issues in other places when partial parsing.