Handle CDATA with UTF-8 characters when partial parsing by tomtaylor · Pull Request #133 · qcam/saxy

tomtaylor · 2024-08-05T08:17:32Z

Follows on from #122.

In partial mode, UTF-8 encoded characters might be split across multiple chunks. When this happens for a character such as £, which is encoded as <<0xC2, 0xA3>>, the 0xC2 is neither an ASCII character (<= 127), nor does it match the <<codepoint::utf-8>> clause, and Saxy throws a parser error.

This fixes that by just parsing all the bytes inside a CDATA element regardless of their code point. It drops the UTF-8 character optimisation, but I suspect that's probably a minor performance improvement for most documents.

@qcam is this a more prevalent issue than my use case? I can see why matching on UTF-8 codepoint and swallowing the whole character is a nice optimisation, but I wonder if it might cause issues in other places when partial parsing.

Don't assume that we're always seeing a full UTF-8 character. In partial mode, UTF-8 encoded characters might be split across multiple chunks.

tomtaylor · 2024-08-13T06:42:48Z

@qcam any thoughts on this?

qcam · 2024-10-22T13:38:34Z

            element_cdata(rest, more?, original, pos, state, len + 1)

-          <<codepoint::utf8>> <> rest ->
-            element_cdata(rest, more?, original, pos, state, len + Utils.compute_char_len(codepoint))


I think we can the same way how dangling UTF-8 fragments is handled

For example https://github.com/qcam/saxy/blob/master/lib/saxy/parser/builder.ex#L540-L541

I don't quite follow I'm afraid. Would you prefer to take this PR over if it's a quick fix at your end?

Handle CDATA containing partial UTF-8 characters

9475427

Don't assume that we're always seeing a full UTF-8 character. In partial mode, UTF-8 encoded characters might be split across multiple chunks.

tomtaylor changed the title ~~Handle CDATA containing partial UTF-8 characters~~ Handle CDATA with UTF-8 characters when partial parsing Aug 5, 2024

tomtaylor mentioned this pull request Aug 5, 2024

CDATA element fails to parse when element contains £ symbol #122

Open

qcam reviewed Oct 22, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle CDATA with UTF-8 characters when partial parsing#133

Handle CDATA with UTF-8 characters when partial parsing#133
tomtaylor wants to merge 1 commit intoqcam:masterfrom
tomtaylor:cdata-fix

tomtaylor commented Aug 5, 2024 •

edited

Loading

Uh oh!

tomtaylor commented Aug 13, 2024

Uh oh!

qcam Oct 22, 2024

Uh oh!

tomtaylor Feb 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tomtaylor commented Aug 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tomtaylor commented Aug 13, 2024

Uh oh!

qcam Oct 22, 2024

Choose a reason for hiding this comment

Uh oh!

tomtaylor Feb 23, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tomtaylor commented Aug 5, 2024 •

edited

Loading