Skip to content

Return an error instead of panicking on out-of-range BPE merges#2104

Open
NahButch wants to merge 1 commit into
huggingface:mainfrom
NahButch:bpe-merge-buffer-overrun
Open

Return an error instead of panicking on out-of-range BPE merges#2104
NahButch wants to merge 1 commit into
huggingface:mainfrom
NahButch:bpe-merge-buffer-overrun

Conversation

@NahButch

Copy link
Copy Markdown

BpeBuilder::build sizes its scratch buffer to the longest vocabulary key, then writes each merge rule's concatenation into it. A merge whose concatenated token is longer than every vocab key overruns the buffer, which panics — and aborts the process in FFI embeddings. tokenizer.json files are downloaded artifacts, so this is reachable from untrusted input. A merge right-hand side shorter than continuing_subword_prefix similarly panicked on length underflow.

Both cases now return MergeTokenOutOfVocabulary: a merged token longer than every key cannot resolve to a vocab id, and a right side shorter than the prefix cannot be a valid rule. Regression tests verified to panic at the previously reported lines without the fix; the full test suite passes with it.

Fixes #2094

🤖 Generated with Claude Code

BpeBuilder::build sizes its scratch buffer to the longest vocabulary
key, then writes each merge rule's concatenation into it. A merge whose
concatenated token is longer than every vocab key overran the buffer,
which panics and aborts the process in FFI embeddings. tokenizer.json
files are downloaded artifacts, so this is reachable from untrusted
input. A merge right-hand side shorter than continuing_subword_prefix
similarly panicked on length underflow.

Both cases now return MergeTokenOutOfVocabulary: a merged token longer
than every key cannot resolve to a vocab id, and a right-hand side
shorter than the prefix cannot be a valid rule. Regression tests panic
at the previously reported lines without the fix.

Fixes huggingface#2094

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Security] Load-time process abort when loading a malicious tokenizer.json (BPE merge buffer overrun)

1 participant