perf: do not allocate for alignments in `encode_fast` by SBrandeis · Pull Request #2097 · huggingface/tokenizers

SBrandeis · 2026-06-11T14:55:24Z

Builds on top of #2022

encode_fast only cares about token ids. Offsets, token strings, etc get dropped at the very end of the pipeline: ref

#2022 improved the perf of encode_fast by not computing alignment offsets during normalization, but the alignment Vec were still fully allocated at construction (NormalizedString::from), in set_normalized ref and at every slice() call during pre-tokenization: ref

This PR adds a "no alignment" fast path on NormalizedString methods that skip allocation (String::new and Vec::new do not allocate) when in the encode_fast path, by relying on a invariant that cannot happen when computing alignments through encode: empty alignments vec + non-empty normalized string

tokenizers/tokenizers/src/tokenizer/normalizer.rs

Lines 139 to 160 in f5aa6c0

    
               /// Build a `NormalizedString` that doesn't track alignments. 
        
               /// See [`Self::is_unaligned`]. 
        
               pub(crate) fn unaligned(normalized: String) -> Self { 
        
                   Self { 
        
                       original: String::new(), 
        
                       normalized, 
        
                       alignments: Vec::new(), 
        
                       original_shift: 0, 
        
                   } 
        
               } 
        
               /// Whether this string skips alignment tracking. 
        
               /// 
        
               /// The trick: an aligned string always has one alignment entry per byte 
        
               /// of `normalized`, so "alignments empty but normalized not empty" is 
        
               /// normally impossible. We use that impossible state to mean "no 
        
               /// alignment tracking". This is what `encode_fast` runs on: same text 
        
               /// operations, no offsets, no per-byte bookkeeping. (Empty strings count 
        
               /// as aligned; both modes behave the same on them.) 
        
               pub(crate) fn is_unaligned(&self) -> bool { 
        
                   self.alignments.is_empty() && !self.normalized.is_empty() 
        
               }

On my machine this results in a +42% throughput improvement on llama-3 vs the base branch (would love a /benchmark to back it up)

cargo bench --bench ci_becnhmark -- 'llama3'

A possible next step would be to stop copying the String around at every slice() call but it's a much bigger and deeper change

HuggingFaceDocBuilderDev · 2026-06-11T14:58:53Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Guards ids/type_ids/special-mask equality for gpt2+added tokens, llama-3, bert-wiki, albert (Metaspace) — single, pair, and batch. Green on the current branch; safety net for the upcoming no-alignment fast path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Make empty-alignments a first-class 'unaligned' mode: slice, transform_range and replace branch to text-only work, set_normalized switches into the mode, and extract_and_normalize_fast builds the pipeline unaligned from the start. encode_fast now allocates no alignment vectors and no original-string clones anywhere: not at construction, not per pre-token slice, not in ByteLevel's byte-to-char transform. original_shift still accumulates slice positions so Metaspace First's 'is this the first piece' predicate (offsets_original().0 == 0) keeps working. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ture Adversarial review findings: the first-piece predicate diverges from encode when a normalizer strips leading content (pre-existing normalize_str-era behavior, now documented at the shift accumulation site); the normalize_str default impl still allocates one transient aligned NormalizedString per piece (doc no longer claims otherwise). New Metaspace(First) equivalence fixture with a prepend-distinguishing vocab locks in the common case. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- Canary for the known Metaspace(First) + leading-strip divergence: asserts the exact divergent ids on both paths so an upstream fix trips it and the caveat docs get cleaned up alongside - Unaligned-emptied strings re-enter aligned mode and keep behaving identically (prepend on empty is a no-op on both paths) - Bounded Range::Original on unaligned strings is a no-op; unbounded means the full string (the transform() delegation case) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The invariant and the Metaspace exception are documented once, on is_unaligned(); the branches just carry an 'encode_fast path' marker. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

PrependScheme::First documents itself as 'prepended only once, on the first split', but the implementation checked offsets_original().0 == 0. The two disagree when a normalizer removes characters at the start of the input (e.g. Strip): the first split then maps to a nonzero original position and got no prepend. encode_fast cannot reproduce that offset check without alignments — it was the one known encode/encode_fast divergence. Using the split index (already provided by PreTokenizedString::split) matches the documented semantics, is identical in every non-stripping case, and makes encode and encode_fast agree everywhere. ⚠️ Slow-path behavior change: Metaspace(First) + leading-stripping normalizer now prepends to the first split where it previously did not. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

ArthurZucker · 2026-06-11T16:13:30Z

/benchmark

github-actions · 2026-06-11T16:18:35Z

Python Benchmark Results

Commit: f5aa6c08ff481bdd7ce4cf185a7671b0acbcdf66

github-actions · 2026-06-11T16:22:21Z

Rust Benchmark Results

Commit: f5aa6c08ff481bdd7ce4cf185a7671b0acbcdf66

SBrandeis and others added 4 commits June 11, 2026 17:03

SBrandeis force-pushed the fast_pretok branch from 172dd82 to 6066742 Compare June 11, 2026 15:06

SBrandeis and others added 4 commits June 11, 2026 17:24

docs: single source of truth for the unaligned mode

621d677

The invariant and the Metaspace exception are documented once, on is_unaligned(); the branches just carry an 'encode_fast path' marker. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

fix: Python tests

1bb74c0

test: encode_fast equivalence sweep across all pre-tokenizers

f5aa6c0

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

SBrandeis changed the title ~~feat: do not track alignments in encode_fast~~ perf: do not allocate for alignments in encode_fast Jun 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: do not allocate for alignments in `encode_fast`#2097

perf: do not allocate for alignments in `encode_fast`#2097
SBrandeis wants to merge 8 commits into
fast-extract-normalizefrom
fast_pretok

SBrandeis commented Jun 11, 2026 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Jun 11, 2026

Uh oh!

ArthurZucker commented Jun 11, 2026

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	/// Build a `NormalizedString` that doesn't track alignments.
	/// See [`Self::is_unaligned`].
	pub(crate) fn unaligned(normalized: String) -> Self {
	Self {
	original: String::new(),
	normalized,
	alignments: Vec::new(),
	original_shift: 0,
	}
	}

	/// Whether this string skips alignment tracking.
	///
	/// The trick: an aligned string always has one alignment entry per byte
	/// of `normalized`, so "alignments empty but normalized not empty" is
	/// normally impossible. We use that impossible state to mean "no
	/// alignment tracking". This is what `encode_fast` runs on: same text
	/// operations, no offsets, no per-byte bookkeeping. (Empty strings count
	/// as aligned; both modes behave the same on them.)
	pub(crate) fn is_unaligned(&self) -> bool {
	self.alignments.is_empty() && !self.normalized.is_empty()
	}

Conversation

SBrandeis commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Jun 11, 2026

Uh oh!

ArthurZucker commented Jun 11, 2026

Uh oh!

github-actions Bot commented Jun 11, 2026

Python Benchmark Results

Uh oh!

github-actions Bot commented Jun 11, 2026

Rust Benchmark Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SBrandeis commented Jun 11, 2026 •

edited

Loading