perf: do not allocate for alignments in encode_fast#2097
Draft
SBrandeis wants to merge 8 commits into
Draft
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Guards ids/type_ids/special-mask equality for gpt2+added tokens, llama-3, bert-wiki, albert (Metaspace) — single, pair, and batch. Green on the current branch; safety net for the upcoming no-alignment fast path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Make empty-alignments a first-class 'unaligned' mode: slice, transform_range and replace branch to text-only work, set_normalized switches into the mode, and extract_and_normalize_fast builds the pipeline unaligned from the start. encode_fast now allocates no alignment vectors and no original-string clones anywhere: not at construction, not per pre-token slice, not in ByteLevel's byte-to-char transform. original_shift still accumulates slice positions so Metaspace First's 'is this the first piece' predicate (offsets_original().0 == 0) keeps working. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ture Adversarial review findings: the first-piece predicate diverges from encode when a normalizer strips leading content (pre-existing normalize_str-era behavior, now documented at the shift accumulation site); the normalize_str default impl still allocates one transient aligned NormalizedString per piece (doc no longer claims otherwise). New Metaspace(First) equivalence fixture with a prepend-distinguishing vocab locks in the common case. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Canary for the known Metaspace(First) + leading-strip divergence: asserts the exact divergent ids on both paths so an upstream fix trips it and the caveat docs get cleaned up alongside - Unaligned-emptied strings re-enter aligned mode and keep behaving identically (prepend on empty is a no-op on both paths) - Bounded Range::Original on unaligned strings is a no-op; unbounded means the full string (the transform() delegation case) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The invariant and the Metaspace exception are documented once, on is_unaligned(); the branches just carry an 'encode_fast path' marker. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
PrependScheme::First documents itself as 'prepended only once, on the first split', but the implementation checked offsets_original().0 == 0. The two disagree when a normalizer removes characters at the start of the input (e.g. Strip): the first split then maps to a nonzero original position and got no prepend. encode_fast cannot reproduce that offset check without alignments — it was the one known encode/encode_fast divergence. Using the split index (already provided by PreTokenizedString::split) matches the documented semantics, is identical in every non-stripping case, and makes encode and encode_fast agree everywhere.⚠️ Slow-path behavior change: Metaspace(First) + leading-stripping normalizer now prepends to the first split where it previously did not. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
encode_fast
Collaborator
|
/benchmark |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


Builds on top of #2022
encode_fastonly cares about token ids. Offsets, token strings, etc get dropped at the very end of the pipeline: ref#2022 improved the perf of encode_fast by not computing alignment offsets during normalization, but the alignment
Vecwere still fully allocated at construction (NormalizedString::from), inset_normalizedref and at everyslice()call during pre-tokenization: refThis PR adds a "no alignment" fast path on NormalizedString methods that skip allocation (
String::newandVec::newdo not allocate) when in theencode_fastpath, by relying on a invariant that cannot happen when computing alignments throughencode: emptyalignmentsvec + non-emptynormalizedstringtokenizers/tokenizers/src/tokenizer/normalizer.rs
Lines 139 to 160 in f5aa6c0
On my machine this results in a +42% throughput improvement on llama-3 vs the base branch (would love a /benchmark to back it up)
A possible next step would be to stop copying the String around at every slice() call but it's a much bigger and deeper change