GPT-2 Example

This project demonstrates how instruction mix affects performance in a compute-bound workload. It uses a GPT-2 Medium text-generation engine running on the CPU, where throughput is largely determined by how efficiently the hot matrix multiplication kernel uses the processor's scalar and vector execution units.

The baseline binary generates text one token at a time and reports throughput in tokens per second. This repository is designed to be profiled with Arm Performix to explain why different implementations run at different speeds.

Overview

The program loads GPT-2 model weights and generates new text from a prompt, one token at a time. The main throughput metric is tokens per second (tok/s): higher tok/s means faster text generation.

Internally, each generated token requires repeated matrix multiplications across many model layers. In this workload, the dominant kernel is matmul, so the performance difference between binaries comes primarily from how that kernel is implemented.

Baseline: Scalar matrix multiply (`gpt2`)

The baseline implementation uses a straightforward scalar matmul loop. On Arm systems with vector hardware, this means the hottest arithmetic path can spend most of its time issuing scalar loads, scalar floating-point instructions, and loop-control instructions rather than wider SIMD operations.

SIMD and library variants (`gpt2_neon`, `gpt2_sve`, `gpt2_kai_sve`)

This repository also includes variants that replace the scalar matmul path with:

NEON vector code (gpt2_neon)
SVE vector code (gpt2_sve)
KleidiAI SVE microkernels (gpt2_kai_sve)

These variants keep the model and generation algorithm the same, but change the instructions executed by the CPU. In the tutorial workflow, you use Performix to confirm that the hot path shifts from scalar floating-point work toward vector instructions and then verify the effect with higher tok/s.

Learner kernel exercise (`gpt2_user`)

You can also build a dedicated learner target where you implement your own matmul kernel in src/kernels/matmul_user.cpp.

gpt2_user starts from the same naive scalar matmul loop as baseline gpt2.
It is built without extra target-specific optimization flags, so initial throughput should match gpt2 closely.
Implement kernels::matmul_user(...) using NEON or SVE intrinsics.
Keep gpt2, gpt2_neon, gpt2_sve, and gpt2_kai_sve as reference solutions.
Rebuild and compare your gpt2_user throughput against the reference binaries.

Prerequisites

A Linux system, ideally an Arm Neoverse-based cloud metal instance such as AWS c7g.metal
GCC 11+ or Clang 14+
CMake 3.16+
Python 3.8+ with pip
Internet access for the initial model export step
Arm Performix installed and configured

Note: gpt2_sve and gpt2_kai_sve are only built on AArch64 systems with SVE support.

Build

Before building the C++ binaries, export the GPT-2 Medium weights into the binary format used by this repository:

python3 -m venv venv
source venv/bin/activate
pip install -r src/requirements.txt
python3 src/export_gpt2.py --model gpt2-medium

This downloads OpenAI’s GPT-2 Medium model from Hugging Face. The model contains 355 million parameters stored in 32-bit floating-point format. Because it is relatively small and unquantized, it is an example for demonstration and experimentation purposes.

This model is available under a modified MIT License. The command above writes model weight and vocab data to the following files:

models/gpt2-medium/weights.bin
models/gpt2-medium/vocab.bin

Next, configure and build the project:

cmake -S . -B build
cmake --build build --parallel

To enable the learner-owned kernel target (gpt2_user), use the dedicated CMake option:

cmake -S . -B build -DBUILD_USER_MATMUL=ON
cmake --build build --parallel

The CMake build can produce the following binaries, depending on the host architecture:

build/gpt2
build/gpt2_neon
build/gpt2_sve
build/gpt2_kai_sve
build/gpt2_user (only when -DBUILD_USER_MATMUL=ON)

Run

Run the baseline workload and record the final throughput:

./build/gpt2 --model gpt2-medium "Once upon a time" -n 50

When generation completes, the program prints a summary such as:

[50 tokens, 3.4 tok/s]

You can change the prompt and the number of generated tokens with the positional prompt argument and the -n flag.

To compare all available binaries in one step, use:

./compare_gpt2_variants.sh gpt2-medium "Once upon a time" 20 1 1

The script rebuilds the project, runs gpt2, gpt2_neon, gpt2_sve, gpt2_kai_sve, and gpt2_user, and prints the average tok/s for each one.

Project Structure

README.md
AGENTS.md
CMakeLists.txt
License.md
compare_gpt2_variants.sh
assets/
    gpt_kai_sve_textgen.gif
models/
    gpt2-medium/
        vocab.bin
        weights.bin
src/
    export_gpt2.py
    gpt2.cpp
    gpt2_kai_sve.cpp
    kernels/
        matmul.h
        matmul_kai_sve.cpp
        matmul_neon.cpp
        matmul_ref.cpp
        matmul_sve.cpp
        matmul_user.cpp
    kleidiai/

License

This project is licensed under the Arm Education End User License Agreement for Teaching and Learning Content. It is provided for non-commercial educational purposes only. See License.md for the full terms.

For commercial use enquiries, contact education@arm.com.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPT-2 Example

Overview

Baseline: Scalar matrix multiply (`gpt2`)

SIMD and library variants (`gpt2_neon`, `gpt2_sve`, `gpt2_kai_sve`)

Learner kernel exercise (`gpt2_user`)

Prerequisites

Build

Run

Project Structure

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
AGENTS.md		AGENTS.md
CMakeLists.txt		CMakeLists.txt
License.md		License.md
README.md		README.md
compare_gpt2_variants.sh		compare_gpt2_variants.sh

Folders and files

Latest commit

History

Repository files navigation

GPT-2 Example

Overview

Baseline: Scalar matrix multiply (gpt2)

SIMD and library variants (gpt2_neon, gpt2_sve, gpt2_kai_sve)

Learner kernel exercise (gpt2_user)

Prerequisites

Build

Run

Project Structure

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Baseline: Scalar matrix multiply (`gpt2`)

SIMD and library variants (`gpt2_neon`, `gpt2_sve`, `gpt2_kai_sve`)

Learner kernel exercise (`gpt2_user`)

Packages