Validate BPE prefix merges without unchecked UTF-8#2082
Open
dfgvaetyj3456356-hash wants to merge 1 commit into
Open
Validate BPE prefix merges without unchecked UTF-8#2082dfgvaetyj3456356-hash wants to merge 1 commit into
dfgvaetyj3456356-hash wants to merge 1 commit into
Conversation
45d83bb to
95539f0
Compare
Author
|
Updated this branch to fix the Rust build failure from the previous run and clean the branch history. What changed:
Local validation:
I will watch the refreshed CI after GitHub repopulates the checks. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR removes an unsafe UTF-8 construction in BPE merge initialization when
continuing_subword_prefixis configured.BpeBuilder::build()previously trimmed the configured prefix from the second merge token by byte length and then rebuilt the merged token withfrom_utf8_unchecked(). If a serialized tokenizer provides a merge whose second token does not actually start with the configured prefix, that byte slicing can cut a multibyte UTF-8 token at an invalid boundary before the unchecked conversion.The change uses
strip_prefix()so prefix handling stays on UTF-8 string boundaries and returns the existingMergeTokenOutOfVocabularyerror when a merge token does not match the configured prefix. It also replaces the unchecked conversion with checkedstd::str::from_utf8()as a defense-in-depth guard.Tests
Added
test_bpe_rejects_merge_without_configured_prefixcovering a multibyte merge token with a non-matching configured prefix.Local validation:
I could not run
cargo testlocally because this Windows/WSL environment does not havecargo/rustcinstalled.