This project demonstrates how instruction mix affects performance in a compute-bound workload. It uses a GPT-2 Medium text-generation engine running on the CPU, where throughput is largely determined by how efficiently the hot matrix multiplication kernel uses the processor's scalar and vector execution units.
The baseline binary generates text one token at a time and reports throughput in tokens per second. This repository is designed to be profiled with Arm Performix to explain why different implementations run at different speeds.
The program loads GPT-2 model weights and generates new text from a prompt, one token at a time. The main throughput metric is tokens per second (tok/s): higher tok/s means faster text generation.
Internally, each generated token requires repeated matrix multiplications across many model layers. In this workload, the dominant kernel is matmul, so the performance difference between binaries comes primarily from how that kernel is implemented.
The baseline implementation uses a straightforward scalar matmul loop. On Arm systems with vector hardware, this means the hottest arithmetic path can spend most of its time issuing scalar loads, scalar floating-point instructions, and loop-control instructions rather than wider SIMD operations.
This repository also includes variants that replace the scalar matmul path with:
- NEON vector code (
gpt2_neon) - SVE vector code (
gpt2_sve) - KleidiAI SVE microkernels (
gpt2_kai_sve)
These variants keep the model and generation algorithm the same, but change the instructions executed by the CPU. In the tutorial workflow, you use Performix to confirm that the hot path shifts from scalar floating-point work toward vector instructions and then verify the effect with higher tok/s.
You can also build a dedicated learner target where you implement your own matmul kernel in src/kernels/matmul_user.cpp.
gpt2_userstarts from the same naive scalar matmul loop as baselinegpt2.- It is built without extra target-specific optimization flags, so initial throughput should match
gpt2closely. - Implement
kernels::matmul_user(...)using NEON or SVE intrinsics. - Keep
gpt2,gpt2_neon,gpt2_sve, andgpt2_kai_sveas reference solutions. - Rebuild and compare your
gpt2_userthroughput against the reference binaries.
- A Linux system, ideally an Arm Neoverse-based cloud
metalinstance such as AWSc7g.metal - GCC 11+ or Clang 14+
- CMake 3.16+
- Python 3.8+ with
pip - Internet access for the initial model export step
- Arm Performix installed and configured
Note:
gpt2_sveandgpt2_kai_sveare only built on AArch64 systems with SVE support.
Before building the C++ binaries, export the GPT-2 Medium weights into the binary format used by this repository:
python3 -m venv venv
source venv/bin/activate
pip install -r src/requirements.txt
python3 src/export_gpt2.py --model gpt2-mediumThis downloads OpenAI’s GPT-2 Medium model from Hugging Face. The model contains 355 million parameters stored in 32-bit floating-point format. Because it is relatively small and unquantized, it is an example for demonstration and experimentation purposes.
This model is available under a modified MIT License. The command above writes model weight and vocab data to the following files:
models/gpt2-medium/weights.binmodels/gpt2-medium/vocab.bin
Next, configure and build the project:
cmake -S . -B build
cmake --build build --parallelTo enable the learner-owned kernel target (gpt2_user), use the dedicated CMake option:
cmake -S . -B build -DBUILD_USER_MATMUL=ON
cmake --build build --parallelThe CMake build can produce the following binaries, depending on the host architecture:
build/gpt2build/gpt2_neonbuild/gpt2_svebuild/gpt2_kai_svebuild/gpt2_user(only when-DBUILD_USER_MATMUL=ON)
Run the baseline workload and record the final throughput:
./build/gpt2 --model gpt2-medium "Once upon a time" -n 50When generation completes, the program prints a summary such as:
[50 tokens, 3.4 tok/s]
You can change the prompt and the number of generated tokens with the positional prompt argument and the -n flag.
To compare all available binaries in one step, use:
./compare_gpt2_variants.sh gpt2-medium "Once upon a time" 20 1 1The script rebuilds the project, runs gpt2, gpt2_neon, gpt2_sve, gpt2_kai_sve, and gpt2_user, and prints the average tok/s for each one.
README.md
AGENTS.md
CMakeLists.txt
License.md
compare_gpt2_variants.sh
assets/
gpt_kai_sve_textgen.gif
models/
gpt2-medium/
vocab.bin
weights.bin
src/
export_gpt2.py
gpt2.cpp
gpt2_kai_sve.cpp
kernels/
matmul.h
matmul_kai_sve.cpp
matmul_neon.cpp
matmul_ref.cpp
matmul_sve.cpp
matmul_user.cpp
kleidiai/
This project is licensed under the Arm Education End User License Agreement for Teaching and Learning Content. It is provided for non-commercial educational purposes only. See License.md for the full terms.
For commercial use enquiries, contact education@arm.com.
