WIP: Persistent kvcache#6776
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: dbc4ddb840
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
dbc4ddb to
6f0a47f
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cd3c5d4491
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ncnn_add_perf(sdpa_flash) | ||
| ncnn_add_perf(sdpa_mem) | ||
| ncnn_add_perf(sdpa_mla_kvcache) |
There was a problem hiding this comment.
Declare perf env helpers before enabling targets
When the new sdpa_flash and sdpa_mla_kvcache perf targets are enabled here, they do not compile: tests/perf/perf_sdpa_flash.cpp and tests/perf/perf_sdpa_mla_kvcache.cpp call perf_match_env_int, perf_env_int, and perf_has_env, but tests/perf/perfutil.h only declares PerfMat/perf_layer, and a repo-wide rg shows no definitions for these helper names. Building perf tests with WITH_LAYER_sdpa therefore fails as soon as these targets are compiled; either add/export the helpers or keep the targets out until they are self-contained.
Useful? React with 👍 / 👎.
|
Please enable github action in YOUR FORKED REPO to make code-format workflow work |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6776 +/- ##
==========================================
- Coverage 95.60% 95.47% -0.13%
==========================================
Files 960 943 -17
Lines 404032 409662 +5630
==========================================
+ Hits 386278 391142 +4864
- Misses 17754 18520 +766 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
| for (int q = 0; q < num_group; q++) | ||
| { | ||
| unsigned char* kd = (unsigned char*)past_key.channel(q).data + (size_t)past_seqlen * embed_dim * elemsize; | ||
| memcpy(kd, cur_key.channel(q).data, embed_dim * cur_seqlen * elemsize); |
| unsigned char* kd = (unsigned char*)past_key.channel(q).data + (size_t)past_seqlen * embed_dim * elemsize; | ||
| memcpy(kd, cur_key.channel(q).data, embed_dim * cur_seqlen * elemsize); | ||
| unsigned char* vd = (unsigned char*)past_value.channel(q).data + (size_t)past_seqlen * out_embed_dim * elemsize; | ||
| memcpy(vd, cur_value.channel(q).data, out_embed_dim * cur_seqlen * elemsize); |
| // Convert from source elemsize to fp32 | ||
| if (elemsize == 4) | ||
| { | ||
| memcpy(key.channel(q), past_key.channel(q), embed_dim * dst_seqlen * 4); |
| if (elemsize == 4) | ||
| { | ||
| memcpy(key.channel(q), past_key.channel(q), embed_dim * dst_seqlen * 4); | ||
| memcpy(value.channel(q), past_value.channel(q), out_embed_dim * dst_seqlen * 4); |
|
|
||
| memcpy(value_head.row(0), past_value_head, out_embed_dim * past_seqlen * sizeof(float)); | ||
| memcpy(value_head.row(past_seqlen), cur_value_head, out_embed_dim * cur_seqlen * sizeof(float)); | ||
| memcpy((float*)key.channel(q), past_key.channel(q), embed_dim * past_seqlen * sizeof(float)); |
| { | ||
| if (elemsize == 4) | ||
| { | ||
| memcpy(value.channel(q), past_value.channel(q), out_embed_dim * dst_seqlen * sizeof(float)); |
| unsigned char* pk = (unsigned char*)past_key.channel(q).data; | ||
| unsigned char* pv = (unsigned char*)past_value.channel(q).data; | ||
| memcpy(pk + (size_t)past_seqlen * embed_dim * elemsize, | ||
| cur_key.channel(q).data, embed_dim * cur_seqlen * elemsize); |
| memcpy(pk + (size_t)past_seqlen * embed_dim * elemsize, | ||
| cur_key.channel(q).data, embed_dim * cur_seqlen * elemsize); | ||
| memcpy(pv + (size_t)past_seqlen * out_embed_dim * elemsize, | ||
| cur_value.channel(q).data, out_embed_dim * cur_seqlen * elemsize); |
| #pragma omp parallel for num_threads(opt.num_threads) | ||
| for (int q = 0; q < num_group; q++) | ||
| { | ||
| memcpy(key.channel(q), past_key.channel(q), embed_dim * dst_seqlen * elemsize); |
| for (int q = 0; q < num_group; q++) | ||
| { | ||
| memcpy(key.channel(q), past_key.channel(q), embed_dim * dst_seqlen * elemsize); | ||
| memcpy(value.channel(q), past_value.channel(q), out_embed_dim * dst_seqlen * elemsize); |
There was a problem hiding this comment.
💡 Codex Review
Lines 544 to 546 in 6c518a0
When SDPA kv_cache=2 perf inputs pass a persistent cache view (inputs[3].h = past_seqlen, often 0), uploading that view here allocates the VkMat with only the live length instead of the full capacity preserved in Mat::cstep. SDPA_vulkan::forward derives capacity from past_key.cstep, so the new Vulkan flash/decode perf cases reject the append or benchmark a too-small cache, and run_layer_forward_gpu currently drops the return code, making the timings silently invalid. Please mirror the CPU convert_input_layout_persistent_view handling for these inputs before upload.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Summary
This PR adds
SDPAkv_cache=2, a persistent K/V cache mode for CPU.kv_cache=2takes preallocatedpast_key/past_valuecache buffers asviews. The input view height is the current valid cache length, while
cstepkeeps the preallocated cache capacity. SDPA appends current K/V in place and
returns views of the same cache buffers with the updated valid length.
This avoids changing cache ownership every decode step and provides the base
interface for follow-up optimized kernels such as arm64/Vulkan flash attention.
Existing
kv_cache=0andkv_cache=1behavior is kept unchanged.Vulkan
kv_cache=2is disabled in this PR. The existing Vulkan SDPA path stilluses the old concat-cache layout and cannot interpret the persistent cache view
layout correctly. Vulkan persistent-cache support will be handled in a follow-up
flash-attention PR.
Changes
kv_cache=2validation and input parsing in genericSDPAcur_key/cur_valueinto preallocated cache buffers at the input view heighttop_blobs[1]andtop_blobs[2]test_sdpa_kvcachecoverage for normal use, invalid views, and tail dimensionskv_cache=2until the Vulkan path implements the persistent layoutkv_cache=2int8 SDPA through the generic int8 implementation for correctnessThe new code is only selected for
kv_cache=2. Existingkv_cache!=2casescontinue to use the previous concat / no-cache paths.
kv_cache=2input layout:past_key_view.handpast_value_view.hare the current valid cachelength. The underlying
cstepkeeps the cache capacity.Output layout:
Correctness
Tested on aarch64:
Both tests passed. The
test_sdpa_kvcachecoverage compareskv_cache=2attention output numerically against reference output, including the int8 path,
and checks persistent cache buffer identity / view shape separately.
Performance
Baseline:
Current:
Measured on Qualcomm Oryon aarch64, CPU0-3 governor
performance.Single-thread runs use
taskset -c 0; four-thread runs usetaskset -c 0-3.Shape:
This PR is mainly an interface and cache-lifetime change. It avoids allocating a
new returned K/V cache blob every decode step: the output cache blobs are shallow
views of the caller-owned persistent cache buffers with updated
h.The current CPU fallback still builds compact fp32 K/V temporaries for the
existing Gemm-based SDPA math path, so latency is roughly neutral and temporary
per-call memory usage does not improve much by itself. The persistent view
layout is intended to be used by follow-up optimized kernels that can read
directly from the stable cache buffer.
Causal prefill latency
Decode latency
Decode temporary peak memory
Per-call temporary allocator peak, four threads:
This only counts allocations made during one SDPA call after inputs are already
prepared. It does not include the caller-owned persistent K/V cache buffer.
In the current fallback path, SDPA still copies the valid cache range into
compact K/V blobs for the existing Gemm implementation, so the temporary peak
remains close to
kv_cache=1.For this shape, the persistent K/V cache buffer for
ctx=4096is about 40 MiB:That buffer is allocated and owned outside the SDPA call, then reused across
decode steps. It replaces the growing
kv_cache=1returned-cache ownershipmodel, but it is not shown in the temporary peak table above.