Skip to content

google/sentencepiece

SentencePiece

Build C++ Build Wheels GitHub Issues PyPI - Python Version PyPI version PyPi downloads Contributions welcome License SLSA 3

SentencePiece is a fast, lightweight, and unsupervised text tokenizer and detokenizer designed for neural network-based text generation systems (such as Large Language Models) where the vocabulary size is fixed prior to training.

It implements subword units—including Byte-Pair-Encoding (BPE) [Sennrich et al.] and the unigram language model [Kudo.]—with the ability to train directly from raw sentences. By treating input text as a raw sequence of Unicode characters, SentencePiece enables a purely end-to-end, language-independent pipeline that completely eliminates the need for language-specific pre- or post-processing.

This is not an official Google product.


Quick Start (Python)

SentencePiece provides an easy-to-use Python module. Install it via pip:

pip install sentencepiece

Basic Example

Here is how to train a model, encode text into tokens/IDs, and decode them back to the original string:

import sentencepiece as spm

# 1. Train a model directly from a raw text file.
# (No pre-tokenization or language-specific preprocessing required!)
spm.SentencePieceTrainer.train(
    input='data/botchan.txt', 
    model_prefix='m', 
    vocab_size=1000
)

# 2. Load the trained model.
sp = spm.SentencePieceProcessor(model_file='m.model')

# 3. Encode raw text into subword pieces (strings) or vocabulary IDs (integers).
text = "I saw a girl with a telescope."
pieces = sp.encode(text, out_type=str)
ids = sp.encode(text, out_type=int)

print(f"Pieces: {pieces}")
# Output: ['▁I', '▁saw', '▁a', '▁girl', '▁with', '▁a', '▁', 'te', 'le', 's', 'c', 'o', 'pe', '.']

print(f"IDs:    {ids}")
# Output: [9, 459, 11, 939, 44, 11, 4, 142, 82, 8, 28, 21, 132, 6]

# 4. Decode IDs or pieces back into the original text.
# The reconstruction is completely lossless and reversible!
print(sp.decode(ids))
# Output: "I saw a girl with a telescope."

print(sp.decode(pieces))
# Output: "I saw a girl with a telescope."

Why SentencePiece?

1. Reversible & Lossless Tokenization (Whitespace as a Basic Symbol)

Traditional tokenizers drop whitespace information (e.g., treating Tokenize("World.") identically to Tokenize("World .")), making detokenization ambiguous and language-dependent.

SentencePiece treats the input text as a raw sequence of Unicode characters. It escapes whitespaces with a meta-symbol (U+2581) and includes it in the tokenization. This design ensures that detokenization is a simple, lossless string join operation, entirely independent of the language:

# Lossless detokenization
original_text = "".join(pieces).replace("▁", " ")

2. Purely Data-Driven & Language-Independent

SentencePiece trains tokenization and detokenization models directly from raw sentences. It does not require language-specific pre-tokenizers (such as Moses, MeCab, or KyTea). This makes it highly effective for languages without explicit word boundaries, such as Chinese, Japanese, and Korean.

3. Subword Regularization & BPE-Dropout

To improve the robustness and accuracy of translation and language models, SentencePiece supports on-the-fly subword sampling during training. By sampling different segmentations for the same input text (Subword Regularization for Unigram, BPE-Dropout for BPE), it virtually augments your training data and makes the model more resilient to spelling variations and noise.

# Sample different segmentations on-the-fly
for _ in range(3):
    print(sp.encode('New York', out_type=str, enable_sampling=True, alpha=0.1, nbest_size=-1))
# May output:
# ['▁', 'N', 'e', 'w', '▁York']
# ['▁New', '▁York']
# ['▁New', '▁Y', 'o', 'r', 'k']

4. Fast, Lightweight, and Self-Contained

  • Performance: Written in highly optimized C++. Segmentation speed is around 50,000 sentences per second, with a memory footprint of only ~6MB.
  • Self-Contained: The generated .model file contains the entire normalization rules, vocabulary mapping, and segmentation model. You are guaranteed to get the exact same tokenization results in any environment (C++, Python, Go, etc.) as long as you use the same model file.

Documentation & Resources

For detailed guides, API references, and advanced usage, please refer to the following resources:


License

SentencePiece is licensed under the Apache 2.0 License.

About

Unsupervised text tokenizer for Neural Network-based text generation.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors