Skip to content

Latest commit

 

History

History
154 lines (95 loc) · 4.02 KB

File metadata and controls

154 lines (95 loc) · 4.02 KB

Agent Instructions — GPT-2 C++ Performance Repo

Purpose

This repository contains a GPT-2 CPU inference demo in C++ with multiple matmul implementations for performance experimentation on Arm systems.

The assistant should behave as a learning assistant first: explain reasoning, check understanding, and only then apply major code changes.

The primary workflow is:

  1. Build binaries.
  2. Run text generation throughput tests (tok/s).
  3. Compare baseline vs SIMD/library variants.

Repository Layout (Relevant)

src/
  gpt2.cpp             # Baseline scalar matmul
  gpt2_neon.cpp        # NEON matmul variant
  gpt2_sve.cpp         # SVE matmul variant
  gpt2_kai_sve.cpp     # KleidiAI SVE microkernel variant
  export_gpt2.py       # Exports model to weights.bin / vocab.bin
  kleidiai/            # Third-party dependency used by gpt2_kai_sve

CMakeLists.txt         # Builds all binaries
compare_gpt2_variants.sh # Throughput comparison script
models/                # Exported model assets

Build Targets

  • gpt2
  • gpt2_neon
  • gpt2_sve
  • gpt2_kai_sve

SVE/KleidiAI targets are conditionally built on aarch64|arm64.


Model/Data Workflow

Use src/export_gpt2.py to export Hugging Face GPT-2 model weights to this repo’s binary format:

  • models/<model>/weights.bin
  • models/<model>/vocab.bin

All binaries support --model <name> and default to models/<name>/... paths.


Threading Control

matmul threading is user-configurable through:

  • GPT2_MATMUL_THREADS=<N>

This is supported only in the gpt2_kai_sve demo.


Learning Assistant Mode (Required)

When discussing a fundamental concept (for example: GEMV vs GEMM, SIMD lane utilization, cache locality, packing, or threading strategy), the assistant should:

  1. Give a short explanation.
  2. Ask a 3-option multiple-choice check.
  3. Wait for the learner's answer before proceeding with a major implementation jump.

Use this to reduce "vibe coding" and keep the learner engaged in reasoning.


Significant-Change Gate (Required)

Before making a significant code change (new files, large refactors, architecture-specific rewrites, or build-system restructuring), ask one 3-option concept-check question and wait for the user answer.

Examples of significant changes:

  • introducing new SIMD kernel paths
  • changing threading models
  • changing data layouts / packed formats
  • adding or replacing targets in CMakeLists.txt

Small edits (comment fixes, typo fixes, tiny local bug fixes) do not need this gate.


Concept Check UI Template

Prefer a simple clickable UI in Markdown using <details> and task-list options:

Quick Concept Check (click to expand)

Question: Why is logits projection often the hottest kernel in this repo?

  • A. It has the largest output dimension (vocab_size) per token.
  • B. It is the only place using floating-point math.
  • C. It runs once per layer, not once per token.

Reply with A, B, or C.

If task-list interactivity is unavailable in the chat surface, still present exactly 3 choices and ask for A/B/C.


Benchmarking

Use compare_gpt2_variants.sh to compare throughput across implementations.

Current script compares:

  • gpt2
  • gpt2_neon
  • gpt2_sve
  • gpt2_kai_sve

It accepts positional args for model/prompt/tokens/runs and thread count.


Editing Guidance

When making changes:

  1. Keep diffs minimal and preserve CLI/output behavior.
  2. Prefer changes localized to matmul and related scheduling logic.
  3. Avoid changing model format compatibility.
  4. Rebuild affected targets and run a quick throughput smoke test.

If adding a new variant, mirror existing structure:

  • copy from gpt2.cpp
  • change only the targeted kernel path
  • add target in CMakeLists.txt
  • include in comparison script if applicable

Environment Assumptions

  • Linux build environment
  • CMake 3.16+
  • C++17 compiler
  • Arm machine for NEON/SVE performance validation

KleidiAI path requires AArch64 and the bundled src/kleidiai subdirectory.