SentencePiece Command Line Interface (CLI) & Build Guide

This document describes how to build and install the SentencePiece C++ libraries and command-line tools, and how to use them to train models, encode text, and decode token sequences.

1. Installation & Build

Prerequisites

The following tools and libraries are required to build SentencePiece:

CMake (3.1 or later)
C++11 compiler (e.g., gcc 4.8+ or clang 3.3+)
gperftools library (optional, provides a 10-40% performance improvement)

Building from Source (Linux/macOS)

On Ubuntu/Debian-based systems, install the prerequisites using apt-get:

sudo apt-get install cmake build-essential pkg-config libgoogle-perftools-dev

Then, clone the repository and build the command line tools:

git clone https://github.com/google/sentencepiece.git
cd sentencepiece
mkdir build
cd build
cmake ..
make -j $(nproc)
sudo make install
sudo ldconfig -v

Note for macOS: Replace the last command (sudo ldconfig -v) with:

sudo update_dyld_shared_cache

Installing via vcpkg

You can download and install SentencePiece using the vcpkg dependency manager:

git clone https://github.com/Microsoft/vcpkg.git
cd vcpkg
./bootstrap-vcpkg.sh
./vcpkg integrate install
./vcpkg install sentencepiece

Verifying Signed Release Wheels (SLSA3)

Pre-built wheels are available on the GitHub releases page, secured with SLSA Level 3 signatures. To verify and install:

Install the verification tool from slsa-verifier.
Download the provenance file attestation.intoto.jsonl from the releases page.

Run the verifier:

slsa-verifier -artifact-path <the-wheel> -provenance attestation.intoto.jsonl -source github.com/google/sentencepiece -tag <the-tag>

Install the verified wheel:
```
pip install wheel_file.whl
```

2. CLI Usage Instructions

Train SentencePiece Model (`spm_train`)

Use spm_train to train a new tokenization model from a raw text corpus.

spm_train --input=<input> --model_prefix=<model_name> --vocab_size=8000 --character_coverage=1.0 --model_type=<type>

Key Parameters:

--input: Path to a one-sentence-per-line raw corpus file. No pre-tokenization or preprocessing is required. Supports a comma-separated list of files.
--model_prefix: Output model name prefix. Generates <model_name>.model (binary model) and <model_name>.vocab (human-readable vocabulary list).
--vocab_size: Desired vocabulary size (e.g., 8000, 16000, 32000).
--character_coverage: The ratio of characters in the corpus covered by the model. Recommended defaults:
- 0.9995 for languages with rich character sets (e.g., Japanese, Chinese, Korean) to prune rare noise characters/emojis.
- 1.0 for languages with small alphabets (e.g., English, European languages).
--model_type: The tokenization algorithm: unigram (default), bpe, char, or word (requires pre-tokenized input).

For a complete list of training options, see Training Options.

Encode Raw Text (`spm_encode`)

Segment raw text into subword pieces or vocabulary IDs.

Basic Encoding

Encode to subword pieces:

spm_encode --model=<model_file> --output_format=piece < input.txt > output.txt

Encode to vocabulary IDs:

spm_encode --model=<model_file> --output_format=id < input.txt > output.txt

Adding Special Tokens

Use --extra_options to insert BOS/EOS markers:

Add EOS only:

spm_encode --model=<model_file> --extra_options=eos < input.txt

Add both BOS and EOS:

spm_encode --model=<model_file> --extra_options=bos:eos < input.txt

N-best Segmentation and Sampling (Subword Regularization)

Sample pieces (with temperature/alpha):

spm_encode --model=<model_file> --output_format=sample_piece --nbest_size=-1 --alpha=0.5 < input.txt

Generate top 10 segmentations as IDs:

spm_encode --model=<model_file> --output_format=nbest_id --nbest_size=10 < input.txt

Decode Pieces/IDs into Raw Text (`spm_decode`)

Restore raw text from subword pieces or IDs.

Decode from pieces:

spm_decode --model=<model_file> --input_format=piece < input.txt > output.txt

Decode from IDs:

spm_decode --model=<model_file> --input_format=id < input.txt > output.txt

End-to-End Example

Here is a complete workflow using a sample corpus:

# 1. Train a model with a vocabulary size of 1000
spm_train --input=data/botchan.txt --model_prefix=m --vocab_size=1000

# 2. Encode text to subword pieces
echo "I saw a girl with a telescope." | spm_encode --model=m.model
# Output: ▁I ▁saw ▁a ▁girl ▁with ▁a ▁ te le s c o pe .

# 3. Encode text to IDs
echo "I saw a girl with a telescope." | spm_encode --model=m.model --output_format=id
# Output: 9 459 11 939 44 11 4 142 82 8 28 21 132 6

# 4. Decode IDs back to raw text
echo "9 459 11 939 44 11 4 142 82 8 28 21 132 6" | spm_decode --model=m.model --input_format=id
# Output: I saw a girl with a telescope.

Export Vocabulary (`spm_export_vocab`)

Export the vocabulary list along with emission log probabilities:

spm_export_vocab --model=<model_file> --output=<output_file>

The output file stores a list of vocabulary pieces and their scores. The vocabulary ID corresponds to the 0-indexed line number in this file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SentencePiece Command Line Interface (CLI) & Build Guide

1. Installation & Build

Prerequisites

Building from Source (Linux/macOS)

Installing via vcpkg

Verifying Signed Release Wheels (SLSA3)

2. CLI Usage Instructions

Train SentencePiece Model (`spm_train`)

Key Parameters:

Encode Raw Text (`spm_encode`)

Basic Encoding

Adding Special Tokens

N-best Segmentation and Sampling (Subword Regularization)

Decode Pieces/IDs into Raw Text (`spm_decode`)

End-to-End Example

Export Vocabulary (`spm_export_vocab`)

FilesExpand file tree

cli.md

Latest commit

History

cli.md

File metadata and controls

SentencePiece Command Line Interface (CLI) & Build Guide

1. Installation & Build

Prerequisites

Building from Source (Linux/macOS)

Installing via vcpkg

Verifying Signed Release Wheels (SLSA3)

2. CLI Usage Instructions

Train SentencePiece Model (spm_train)

Key Parameters:

Encode Raw Text (spm_encode)

Basic Encoding

Adding Special Tokens

N-best Segmentation and Sampling (Subword Regularization)

Decode Pieces/IDs into Raw Text (spm_decode)

End-to-End Example

Export Vocabulary (spm_export_vocab)

Train SentencePiece Model (`spm_train`)

Encode Raw Text (`spm_encode`)

Decode Pieces/IDs into Raw Text (`spm_decode`)

Export Vocabulary (`spm_export_vocab`)