Skip to content

perf: do not allocate for alignments in encode_fast#2097

Draft
SBrandeis wants to merge 8 commits into
fast-extract-normalizefrom
fast_pretok
Draft

perf: do not allocate for alignments in encode_fast#2097
SBrandeis wants to merge 8 commits into
fast-extract-normalizefrom
fast_pretok

Conversation

@SBrandeis

@SBrandeis SBrandeis commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Builds on top of #2022

encode_fast only cares about token ids. Offsets, token strings, etc get dropped at the very end of the pipeline: ref

#2022 improved the perf of encode_fast by not computing alignment offsets during normalization, but the alignment Vec were still fully allocated at construction (NormalizedString::from), in set_normalized ref and at every slice() call during pre-tokenization: ref

This PR adds a "no alignment" fast path on NormalizedString methods that skip allocation (String::new and Vec::new do not allocate) when in the encode_fast path, by relying on a invariant that cannot happen when computing alignments through encode: empty alignments vec + non-empty normalized string

/// Build a `NormalizedString` that doesn't track alignments.
/// See [`Self::is_unaligned`].
pub(crate) fn unaligned(normalized: String) -> Self {
Self {
original: String::new(),
normalized,
alignments: Vec::new(),
original_shift: 0,
}
}
/// Whether this string skips alignment tracking.
///
/// The trick: an aligned string always has one alignment entry per byte
/// of `normalized`, so "alignments empty but normalized not empty" is
/// normally impossible. We use that impossible state to mean "no
/// alignment tracking". This is what `encode_fast` runs on: same text
/// operations, no offsets, no per-byte bookkeeping. (Empty strings count
/// as aligned; both modes behave the same on them.)
pub(crate) fn is_unaligned(&self) -> bool {
self.alignments.is_empty() && !self.normalized.is_empty()
}

On my machine this results in a +42% throughput improvement on llama-3 vs the base branch (would love a /benchmark to back it up)

cargo bench --bench ci_becnhmark -- 'llama3'

A possible next step would be to stop copying the String around at every slice() call but it's a much bigger and deeper change

@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

SBrandeis and others added 4 commits June 11, 2026 17:03
Guards ids/type_ids/special-mask equality for gpt2+added tokens, llama-3,
bert-wiki, albert (Metaspace) — single, pair, and batch. Green on the
current branch; safety net for the upcoming no-alignment fast path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Make empty-alignments a first-class 'unaligned' mode: slice, transform_range
and replace branch to text-only work, set_normalized switches into the mode,
and extract_and_normalize_fast builds the pipeline unaligned from the start.
encode_fast now allocates no alignment vectors and no original-string clones
anywhere: not at construction, not per pre-token slice, not in ByteLevel's
byte-to-char transform.

original_shift still accumulates slice positions so Metaspace First's
'is this the first piece' predicate (offsets_original().0 == 0) keeps working.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ture

Adversarial review findings: the first-piece predicate diverges from encode
when a normalizer strips leading content (pre-existing normalize_str-era
behavior, now documented at the shift accumulation site); the normalize_str
default impl still allocates one transient aligned NormalizedString per piece
(doc no longer claims otherwise). New Metaspace(First) equivalence fixture
with a prepend-distinguishing vocab locks in the common case.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Canary for the known Metaspace(First) + leading-strip divergence: asserts
  the exact divergent ids on both paths so an upstream fix trips it and the
  caveat docs get cleaned up alongside
- Unaligned-emptied strings re-enter aligned mode and keep behaving
  identically (prepend on empty is a no-op on both paths)
- Bounded Range::Original on unaligned strings is a no-op; unbounded means
  the full string (the transform() delegation case)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
SBrandeis and others added 4 commits June 11, 2026 17:24
The invariant and the Metaspace exception are documented once, on
is_unaligned(); the branches just carry an 'encode_fast path' marker.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
PrependScheme::First documents itself as 'prepended only once, on the first
split', but the implementation checked offsets_original().0 == 0. The two
disagree when a normalizer removes characters at the start of the input
(e.g. Strip): the first split then maps to a nonzero original position and
got no prepend. encode_fast cannot reproduce that offset check without
alignments — it was the one known encode/encode_fast divergence.

Using the split index (already provided by PreTokenizedString::split)
matches the documented semantics, is identical in every non-stripping case,
and makes encode and encode_fast agree everywhere.

⚠️ Slow-path behavior change: Metaspace(First) + leading-stripping
normalizer now prepends to the first split where it previously did not.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@SBrandeis SBrandeis changed the title feat: do not track alignments in encode_fast perf: do not allocate for alignments in encode_fast Jun 11, 2026
@ArthurZucker

Copy link
Copy Markdown
Collaborator

/benchmark

@github-actions

Copy link
Copy Markdown

Python Benchmark Results

Commit: f5aa6c08ff481bdd7ce4cf185a7671b0acbcdf66

Python Benchmarks

@github-actions

Copy link
Copy Markdown

Rust Benchmark Results

Commit: f5aa6c08ff481bdd7ce4cf185a7671b0acbcdf66

Rust Benchmarks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants