SentencePiece

SentencePiece is a fast, lightweight, and unsupervised text tokenizer and detokenizer designed for neural network-based text generation systems (such as Large Language Models) where the vocabulary size is fixed prior to training.

It implements subword units—including Byte-Pair-Encoding (BPE) [Sennrich et al.] and the unigram language model [Kudo.]—with the ability to train directly from raw sentences. By treating input text as a raw sequence of Unicode characters, SentencePiece enables a purely end-to-end, language-independent pipeline that completely eliminates the need for language-specific pre- or post-processing.

This is not an official Google product.

Quick Start (Python)

SentencePiece provides an easy-to-use Python module. Install it via pip:

pip install sentencepiece

Basic Example

Here is how to train a model, encode text into tokens/IDs, and decode them back to the original string:

import sentencepiece as spm

# 1. Train a model directly from a raw text file.
# (No pre-tokenization or language-specific preprocessing required!)
spm.SentencePieceTrainer.train(
    input='data/botchan.txt', 
    model_prefix='m', 
    vocab_size=1000
)

# 2. Load the trained model.
sp = spm.SentencePieceProcessor(model_file='m.model')

# 3. Encode raw text into subword pieces (strings) or vocabulary IDs (integers).
text = "I saw a girl with a telescope."
pieces = sp.encode(text, out_type=str)
ids = sp.encode(text, out_type=int)

print(f"Pieces: {pieces}")
# Output: ['▁I', '▁saw', '▁a', '▁girl', '▁with', '▁a', '▁', 'te', 'le', 's', 'c', 'o', 'pe', '.']

print(f"IDs:    {ids}")
# Output: [9, 459, 11, 939, 44, 11, 4, 142, 82, 8, 28, 21, 132, 6]

# 4. Decode IDs or pieces back into the original text.
# The reconstruction is completely lossless and reversible!
print(sp.decode(ids))
# Output: "I saw a girl with a telescope."

print(sp.decode(pieces))
# Output: "I saw a girl with a telescope."

Why SentencePiece?

1. Reversible & Lossless Tokenization (Whitespace as a Basic Symbol)

Traditional tokenizers drop whitespace information (e.g., treating Tokenize("World.") identically to Tokenize("World .")), making detokenization ambiguous and language-dependent.

SentencePiece treats the input text as a raw sequence of Unicode characters. It escapes whitespaces with a meta-symbol ▁ (U+2581) and includes it in the tokenization. This design ensures that detokenization is a simple, lossless string join operation, entirely independent of the language:

# Lossless detokenization
original_text = "".join(pieces).replace("▁", " ")

2. Purely Data-Driven & Language-Independent

SentencePiece trains tokenization and detokenization models directly from raw sentences. It does not require language-specific pre-tokenizers (such as Moses, MeCab, or KyTea). This makes it highly effective for languages without explicit word boundaries, such as Chinese, Japanese, and Korean.

3. Subword Regularization & BPE-Dropout

To improve the robustness and accuracy of translation and language models, SentencePiece supports on-the-fly subword sampling during training. By sampling different segmentations for the same input text (Subword Regularization for Unigram, BPE-Dropout for BPE), it virtually augments your training data and makes the model more resilient to spelling variations and noise.

# Sample different segmentations on-the-fly
for _ in range(3):
    print(sp.encode('New York', out_type=str, enable_sampling=True, alpha=0.1, nbest_size=-1))
# May output:
# ['▁', 'N', 'e', 'w', '▁York']
# ['▁New', '▁York']
# ['▁New', '▁Y', 'o', 'r', 'k']

4. Fast, Lightweight, and Self-Contained

Performance: Written in highly optimized C++. Segmentation speed is around 50,000 sentences per second, with a memory footprint of only ~6MB.
Self-Contained: The generated .model file contains the entire normalization rules, vocabulary mapping, and segmentation model. You are guaranteed to get the exact same tokenization results in any environment (C++, Python, Go, etc.) as long as you use the same model file.

Documentation & Resources

For detailed guides, API references, and advanced usage, please refer to the following resources:

License

SentencePiece is licensed under the Apache 2.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 1,337 Commits
.github		.github
cmake		cmake
contrib		contrib
data		data
doc		doc
python		python
src		src
third_party		third_party
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
VERSION.txt		VERSION.txt
config.h.in		config.h.in
sentencepiece.pc.in		sentencepiece.pc.in

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SentencePiece

Quick Start (Python)

Basic Example

Why SentencePiece?

1. Reversible & Lossless Tokenization (Whitespace as a Basic Symbol)

2. Purely Data-Driven & Language-Independent

3. Subword Regularization & BPE-Dropout

4. Fast, Lightweight, and Self-Contained

Documentation & Resources

License

About

Uh oh!

Releases 29

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SentencePiece

Quick Start (Python)

Basic Example

Why SentencePiece?

1. Reversible & Lossless Tokenization (Whitespace as a Basic Symbol)

2. Purely Data-Driven & Language-Independent

3. Subword Regularization & BPE-Dropout

4. Fast, Lightweight, and Self-Contained

Documentation & Resources

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 29

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages