Skip to content

Honor byte-order mark when parsing UTF-16 and UTF-8 input#65

Open
SAY-5 wants to merge 3 commits into
JuliaData:mainfrom
SAY-5:fix-utf16-bom-parsing
Open

Honor byte-order mark when parsing UTF-16 and UTF-8 input#65
SAY-5 wants to merge 3 commits into
JuliaData:mainfrom
SAY-5:fix-utf16-bom-parsing

Conversation

@SAY-5

@SAY-5 SAY-5 commented Jun 18, 2026

Copy link
Copy Markdown

Fixes #62. UTF-16 (and UTF-8 BOM) documents crashed with a BoundsError because the byte tokenizer was handed the raw bytes without any encoding detection, so the interleaved zero bytes desynced the </> scanning and walked off the end of the buffer.

This adds a small normalization step at the single Vector{UInt8} entry point: a leading BOM (FF FE, FE FF, or EF BB BF) is honored, UTF-16 is transcoded to UTF-8, and a UTF-8 BOM is stripped, so the tokenizer always sees UTF-8. Input without a BOM is returned unchanged. Added a regression test covering UTF-16 LE/BE and UTF-8 BOM round-trips.

SAY-5 added 2 commits June 17, 2026 19:32
Signed-off-by: Sai Asish Y <say.apm35@gmail.com>
Signed-off-by: Sai Asish Y <say.apm35@gmail.com>
Comment thread test/runtests.jl
Signed-off-by: Sai Asish Y <say.apm35@gmail.com>
@SAY-5

SAY-5 commented Jun 20, 2026

Copy link
Copy Markdown
Author

Good call, added that. The loop now also writes each BOM-prefixed buffer to a temp file and reads it back through read(path, Node), so the mmap path is covered too.

@codecov-commenter

Copy link
Copy Markdown

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered ☂️

@mathieu17g

Copy link
Copy Markdown
Collaborator

Thanks Anything else you want to add, or shall I merge?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

UTF-16 XML input crashes parser with BoundsError (no encoding detection)

3 participants