Validate BPE prefix merges without unchecked UTF-8 by dfgvaetyj3456356-hash · Pull Request #2082 · huggingface/tokenizers

dfgvaetyj3456356-hash · 2026-06-01T04:37:22Z

Summary

This PR removes an unsafe UTF-8 construction in BPE merge initialization when continuing_subword_prefix is configured.

BpeBuilder::build() previously trimmed the configured prefix from the second merge token by byte length and then rebuilt the merged token with from_utf8_unchecked(). If a serialized tokenizer provides a merge whose second token does not actually start with the configured prefix, that byte slicing can cut a multibyte UTF-8 token at an invalid boundary before the unchecked conversion.

The change uses strip_prefix() so prefix handling stays on UTF-8 string boundaries and returns the existing MergeTokenOutOfVocabulary error when a merge token does not match the configured prefix. It also replaces the unchecked conversion with checked std::str::from_utf8() as a defense-in-depth guard.

Tests

Added test_bpe_rejects_merge_without_configured_prefix covering a multibyte merge token with a non-matching configured prefix.

Local validation:

git diff --check

I could not run cargo test locally because this Windows/WSL environment does not have cargo/rustc installed.

dfgvaetyj3456356-hash · 2026-06-04T21:42:32Z

Updated this branch to fix the Rust build failure from the previous run and clean the branch history.

What changed:

Rebased/rebuilt from current upstream main with only the intended patch commit.
Fixed the E0382 partial-move error by capturing continuing_subword_prefix before consuming the merge list.

Local validation:

cargo check
cargo test test_bpe_rejects_merge_without_configured_prefix

I will watch the refreshed CI after GitHub repopulates the checks.

Validate BPE prefix merges without unchecked UTF-8

95539f0

dfgvaetyj3456356-hash force-pushed the security/bpe-prefix-utf8-validation branch from 45d83bb to 95539f0 Compare June 4, 2026 21:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate BPE prefix merges without unchecked UTF-8#2082

Validate BPE prefix merges without unchecked UTF-8#2082
dfgvaetyj3456356-hash wants to merge 1 commit into
huggingface:mainfrom
dfgvaetyj3456356-hash:security/bpe-prefix-utf8-validation

dfgvaetyj3456356-hash commented Jun 1, 2026

Uh oh!

dfgvaetyj3456356-hash commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dfgvaetyj3456356-hash commented Jun 1, 2026

Summary

Tests

Uh oh!

dfgvaetyj3456356-hash commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant