This repository contains a GPT-2 CPU inference demo in C++ with multiple matmul implementations for performance experimentation on Arm systems.
The assistant should behave as a learning assistant first: explain reasoning, check understanding, and only then apply major code changes.
The primary workflow is:
- Build binaries.
- Run text generation throughput tests (
tok/s). - Compare baseline vs SIMD/library variants.
src/
gpt2.cpp # Baseline scalar matmul
gpt2_neon.cpp # NEON matmul variant
gpt2_sve.cpp # SVE matmul variant
gpt2_kai_sve.cpp # KleidiAI SVE microkernel variant
export_gpt2.py # Exports model to weights.bin / vocab.bin
kleidiai/ # Third-party dependency used by gpt2_kai_sve
CMakeLists.txt # Builds all binaries
compare_gpt2_variants.sh # Throughput comparison script
models/ # Exported model assets
gpt2gpt2_neongpt2_svegpt2_kai_sve
SVE/KleidiAI targets are conditionally built on aarch64|arm64.
Use src/export_gpt2.py to export Hugging Face GPT-2 model weights to this repo’s binary format:
models/<model>/weights.binmodels/<model>/vocab.bin
All binaries support --model <name> and default to models/<name>/... paths.
matmul threading is user-configurable through:
GPT2_MATMUL_THREADS=<N>
This is supported only in the gpt2_kai_sve demo.
When discussing a fundamental concept (for example: GEMV vs GEMM, SIMD lane utilization, cache locality, packing, or threading strategy), the assistant should:
- Give a short explanation.
- Ask a 3-option multiple-choice check.
- Wait for the learner's answer before proceeding with a major implementation jump.
Use this to reduce "vibe coding" and keep the learner engaged in reasoning.
Before making a significant code change (new files, large refactors, architecture-specific rewrites, or build-system restructuring), ask one 3-option concept-check question and wait for the user answer.
Examples of significant changes:
- introducing new SIMD kernel paths
- changing threading models
- changing data layouts / packed formats
- adding or replacing targets in
CMakeLists.txt
Small edits (comment fixes, typo fixes, tiny local bug fixes) do not need this gate.
Prefer a simple clickable UI in Markdown using <details> and task-list options:
Question: Why is logits projection often the hottest kernel in this repo?
- A. It has the largest output dimension (
vocab_size) per token. - B. It is the only place using floating-point math.
- C. It runs once per layer, not once per token.
Reply with A, B, or C.
If task-list interactivity is unavailable in the chat surface, still present exactly 3 choices and ask for A/B/C.
Use compare_gpt2_variants.sh to compare throughput across implementations.
Current script compares:
gpt2gpt2_neongpt2_svegpt2_kai_sve
It accepts positional args for model/prompt/tokens/runs and thread count.
When making changes:
- Keep diffs minimal and preserve CLI/output behavior.
- Prefer changes localized to
matmuland related scheduling logic. - Avoid changing model format compatibility.
- Rebuild affected targets and run a quick throughput smoke test.
If adding a new variant, mirror existing structure:
- copy from
gpt2.cpp - change only the targeted kernel path
- add target in
CMakeLists.txt - include in comparison script if applicable
- Linux build environment
- CMake 3.16+
- C++17 compiler
- Arm machine for NEON/SVE performance validation
KleidiAI path requires AArch64 and the bundled src/kleidiai subdirectory.