Return an error instead of panicking on out-of-range BPE merges#2104
Open
NahButch wants to merge 1 commit into
Open
Return an error instead of panicking on out-of-range BPE merges#2104NahButch wants to merge 1 commit into
NahButch wants to merge 1 commit into
Conversation
BpeBuilder::build sizes its scratch buffer to the longest vocabulary key, then writes each merge rule's concatenation into it. A merge whose concatenated token is longer than every vocab key overran the buffer, which panics and aborts the process in FFI embeddings. tokenizer.json files are downloaded artifacts, so this is reachable from untrusted input. A merge right-hand side shorter than continuing_subword_prefix similarly panicked on length underflow. Both cases now return MergeTokenOutOfVocabulary: a merged token longer than every key cannot resolve to a vocab id, and a right-hand side shorter than the prefix cannot be a valid rule. Regression tests panic at the previously reported lines without the fix. Fixes huggingface#2094 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
BpeBuilder::buildsizes its scratch buffer to the longest vocabulary key, then writes each merge rule's concatenation into it. A merge whose concatenated token is longer than every vocab key overruns the buffer, which panics — and aborts the process in FFI embeddings.tokenizer.jsonfiles are downloaded artifacts, so this is reachable from untrusted input. A merge right-hand side shorter thancontinuing_subword_prefixsimilarly panicked on length underflow.Both cases now return
MergeTokenOutOfVocabulary: a merged token longer than every key cannot resolve to a vocab id, and a right side shorter than the prefix cannot be a valid rule. Regression tests verified to panic at the previously reported lines without the fix; the full test suite passes with it.Fixes #2094
🤖 Generated with Claude Code