Skip to content

Validate BPE prefix merges without unchecked UTF-8#2082

Open
dfgvaetyj3456356-hash wants to merge 1 commit into
huggingface:mainfrom
dfgvaetyj3456356-hash:security/bpe-prefix-utf8-validation
Open

Validate BPE prefix merges without unchecked UTF-8#2082
dfgvaetyj3456356-hash wants to merge 1 commit into
huggingface:mainfrom
dfgvaetyj3456356-hash:security/bpe-prefix-utf8-validation

Conversation

@dfgvaetyj3456356-hash

Copy link
Copy Markdown

Summary

This PR removes an unsafe UTF-8 construction in BPE merge initialization when continuing_subword_prefix is configured.

BpeBuilder::build() previously trimmed the configured prefix from the second merge token by byte length and then rebuilt the merged token with from_utf8_unchecked(). If a serialized tokenizer provides a merge whose second token does not actually start with the configured prefix, that byte slicing can cut a multibyte UTF-8 token at an invalid boundary before the unchecked conversion.

The change uses strip_prefix() so prefix handling stays on UTF-8 string boundaries and returns the existing MergeTokenOutOfVocabulary error when a merge token does not match the configured prefix. It also replaces the unchecked conversion with checked std::str::from_utf8() as a defense-in-depth guard.

Tests

Added test_bpe_rejects_merge_without_configured_prefix covering a multibyte merge token with a non-matching configured prefix.

Local validation:

git diff --check

I could not run cargo test locally because this Windows/WSL environment does not have cargo/rustc installed.

@dfgvaetyj3456356-hash dfgvaetyj3456356-hash force-pushed the security/bpe-prefix-utf8-validation branch from 45d83bb to 95539f0 Compare June 4, 2026 21:40
@dfgvaetyj3456356-hash

Copy link
Copy Markdown
Author

Updated this branch to fix the Rust build failure from the previous run and clean the branch history.

What changed:

  • Rebased/rebuilt from current upstream main with only the intended patch commit.
  • Fixed the E0382 partial-move error by capturing continuing_subword_prefix before consuming the merge list.

Local validation:

  • cargo check
  • cargo test test_bpe_rejects_merge_without_configured_prefix

I will watch the refreshed CI after GitHub repopulates the checks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant