Skip to content

WIP: Persistent kvcache#6776

Open
MollySophia wants to merge 6 commits into
Tencent:masterfrom
MollySophia:persistent-kvcache
Open

WIP: Persistent kvcache#6776
MollySophia wants to merge 6 commits into
Tencent:masterfrom
MollySophia:persistent-kvcache

Conversation

@MollySophia

@MollySophia MollySophia commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR adds SDPA kv_cache=2, a persistent K/V cache mode for CPU.

kv_cache=2 takes preallocated past_key / past_value cache buffers as
views. The input view height is the current valid cache length, while cstep
keeps the preallocated cache capacity. SDPA appends current K/V in place and
returns views of the same cache buffers with the updated valid length.

This avoids changing cache ownership every decode step and provides the base
interface for follow-up optimized kernels such as arm64/Vulkan flash attention.

Existing kv_cache=0 and kv_cache=1 behavior is kept unchanged.

Vulkan kv_cache=2 is disabled in this PR. The existing Vulkan SDPA path still
uses the old concat-cache layout and cannot interpret the persistent cache view
layout correctly. Vulkan persistent-cache support will be handled in a follow-up
flash-attention PR.

Changes

  • Add kv_cache=2 validation and input parsing in generic SDPA
  • Add CPU implementations for generic, arm, x86, mips, and loongarch paths
  • Append cur_key / cur_value into preallocated cache buffers at the input view height
  • Return persistent cache buffers as updated views in top_blobs[1] and top_blobs[2]
  • Add test_sdpa_kvcache coverage for normal use, invalid views, and tail dimensions
  • Add a direct persistent-buffer semantic test for buffer identity, append correctness, and capacity overflow
  • Disable Vulkan SDPA for kv_cache=2 until the Vulkan path implements the persistent layout
  • Route arm kv_cache=2 int8 SDPA through the generic int8 implementation for correctness

The new code is only selected for kv_cache=2. Existing kv_cache!=2 cases
continue to use the previous concat / no-cache paths.

kv_cache=2 input layout:

query, cur_key, cur_value, [attn_mask], past_key_view, past_value_view

past_key_view.h and past_value_view.h are the current valid cache
length. The underlying cstep keeps the cache capacity.

Output layout:

out, past_key_buffer, past_value_buffer

Correctness

Tested on aarch64:

cmake --build build-perf-linux --target test_sdpa test_sdpa_kvcache
taskset -c 0 ./build-perf-linux/tests/test_sdpa
taskset -c 0 ./build-perf-linux/tests/test_sdpa_kvcache

Both tests passed. The test_sdpa_kvcache coverage compares kv_cache=2
attention output numerically against reference output, including the int8 path,
and checks persistent cache buffer identity / view shape separately.

Performance

Baseline:

bedf9c65681a0b1e012081d43a566d20aaadf5c6

Current:

cd3c5d44

Measured on Qualcomm Oryon aarch64, CPU0-3 governor performance.
Single-thread runs use taskset -c 0; four-thread runs use taskset -c 0-3.

Shape:

d_k=192 d_v=128 heads=128 groups=16 dtype=fp16 ctx=4096

This PR is mainly an interface and cache-lifetime change. It avoids allocating a
new returned K/V cache blob every decode step: the output cache blobs are shallow
views of the caller-owned persistent cache buffers with updated h.

The current CPU fallback still builds compact fp32 K/V temporaries for the
existing Gemm-based SDPA math path, so latency is roughly neutral and temporary
per-call memory usage does not improve much by itself. The persistent view
layout is intended to be used by follow-up optimized kernels that can read
directly from the stable cache buffer.

Causal prefill latency

len baseline t=1 ms kv_cache=2 t=1 ms baseline t=4 ms kv_cache=2 t=4 ms
128 15.3350 15.2490 3.9160 4.0598
256 59.5649 59.7761 14.7642 14.9651
512 239.1162 239.2300 59.6160 60.3730
1024 944.7761 936.3608 248.9351 249.1870

Decode latency

past baseline t=1 ms kv_cache=2 t=1 ms baseline t=4 ms kv_cache=2 t=4 ms
128 0.7490 0.7329 0.1951 0.1941
512 2.2429 2.3611 0.7478 0.7119
1024 4.7610 4.5950 1.5261 1.5559
2048 10.1528 9.3242 3.5171 3.1182

Decode temporary peak memory

Per-call temporary allocator peak, four threads:

op_peak = workspace peak + blob peak

This only counts allocations made during one SDPA call after inputs are already
prepared. It does not include the caller-owned persistent K/V cache buffer.
In the current fallback path, SDPA still copies the valid cache range into
compact K/V blobs for the existing Gemm implementation, so the temporary peak
remains close to kv_cache=1.

past baseline MiB kv_cache=2 MiB
128 1.69 1.74
512 6.22 6.46
1024 13.24 12.21
2048 23.58 25.87

For this shape, the persistent K/V cache buffer for ctx=4096 is about 40 MiB:

(d_k + d_v) * groups * sizeof(fp16) * ctx
= (192 + 128) * 16 * 2 * 4096
= 40 MiB

That buffer is allocated and owned outside the SDPA call, then reused across
decode steps. It replaces the growing kv_cache=1 returned-cache ownership
model, but it is not shown in the temporary peak table above.

@tencent-adm

tencent-adm commented Jun 9, 2026

Copy link
Copy Markdown
Member

CLA assistant check
All committers have signed the CLA.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dbc4ddb840

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/layer/sdpa.cpp Outdated
@MollySophia MollySophia force-pushed the persistent-kvcache branch from dbc4ddb to 6f0a47f Compare June 9, 2026 04:50
@github-actions github-actions Bot added the vulkan label Jun 9, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cd3c5d4491

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tests/perf/CMakeLists.txt
Comment on lines +46 to +48
ncnn_add_perf(sdpa_flash)
ncnn_add_perf(sdpa_mem)
ncnn_add_perf(sdpa_mla_kvcache)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Declare perf env helpers before enabling targets

When the new sdpa_flash and sdpa_mla_kvcache perf targets are enabled here, they do not compile: tests/perf/perf_sdpa_flash.cpp and tests/perf/perf_sdpa_mla_kvcache.cpp call perf_match_env_int, perf_env_int, and perf_has_env, but tests/perf/perfutil.h only declares PerfMat/perf_layer, and a repo-wide rg shows no definitions for these helper names. Building perf tests with WITH_LAYER_sdpa therefore fails as soon as these targets are compiled; either add/export the helpers or keep the targets out until they are self-contained.

Useful? React with 👍 / 👎.

@github-actions

Copy link
Copy Markdown

Please enable github action in YOUR FORKED REPO to make code-format workflow work

@codecov-commenter

codecov-commenter commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 85.89342% with 45 lines in your changes missing coverage. Please review.
✅ Project coverage is 95.47%. Comparing base (bedf9c6) to head (28bc4c3).
⚠️ Report is 23 commits behind head on master.

Files with missing lines Patch % Lines
src/layer/sdpa.cpp 72.22% 35 Missing ⚠️
src/layer/loongarch/sdpa_loongarch.cpp 93.61% 3 Missing ⚠️
src/layer/mips/sdpa_mips.cpp 93.61% 3 Missing ⚠️
src/layer/x86/sdpa_x86.cpp 93.75% 3 Missing ⚠️
src/layer/arm/sdpa_arm.cpp 98.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6776      +/-   ##
==========================================
- Coverage   95.60%   95.47%   -0.13%     
==========================================
  Files         960      943      -17     
  Lines      404032   409662    +5630     
==========================================
+ Hits       386278   391142    +4864     
- Misses      17754    18520     +766     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread src/layer/sdpa.cpp
for (int q = 0; q < num_group; q++)
{
unsigned char* kd = (unsigned char*)past_key.channel(q).data + (size_t)past_seqlen * embed_dim * elemsize;
memcpy(kd, cur_key.channel(q).data, embed_dim * cur_seqlen * elemsize);
Comment thread src/layer/sdpa.cpp
unsigned char* kd = (unsigned char*)past_key.channel(q).data + (size_t)past_seqlen * embed_dim * elemsize;
memcpy(kd, cur_key.channel(q).data, embed_dim * cur_seqlen * elemsize);
unsigned char* vd = (unsigned char*)past_value.channel(q).data + (size_t)past_seqlen * out_embed_dim * elemsize;
memcpy(vd, cur_value.channel(q).data, out_embed_dim * cur_seqlen * elemsize);
Comment thread src/layer/sdpa.cpp
// Convert from source elemsize to fp32
if (elemsize == 4)
{
memcpy(key.channel(q), past_key.channel(q), embed_dim * dst_seqlen * 4);
Comment thread src/layer/sdpa.cpp
if (elemsize == 4)
{
memcpy(key.channel(q), past_key.channel(q), embed_dim * dst_seqlen * 4);
memcpy(value.channel(q), past_value.channel(q), out_embed_dim * dst_seqlen * 4);
Comment thread src/layer/sdpa.cpp

memcpy(value_head.row(0), past_value_head, out_embed_dim * past_seqlen * sizeof(float));
memcpy(value_head.row(past_seqlen), cur_value_head, out_embed_dim * cur_seqlen * sizeof(float));
memcpy((float*)key.channel(q), past_key.channel(q), embed_dim * past_seqlen * sizeof(float));
Comment thread src/layer/sdpa.cpp
{
if (elemsize == 4)
{
memcpy(value.channel(q), past_value.channel(q), out_embed_dim * dst_seqlen * sizeof(float));
unsigned char* pk = (unsigned char*)past_key.channel(q).data;
unsigned char* pv = (unsigned char*)past_value.channel(q).data;
memcpy(pk + (size_t)past_seqlen * embed_dim * elemsize,
cur_key.channel(q).data, embed_dim * cur_seqlen * elemsize);
memcpy(pk + (size_t)past_seqlen * embed_dim * elemsize,
cur_key.channel(q).data, embed_dim * cur_seqlen * elemsize);
memcpy(pv + (size_t)past_seqlen * out_embed_dim * elemsize,
cur_value.channel(q).data, out_embed_dim * cur_seqlen * elemsize);
#pragma omp parallel for num_threads(opt.num_threads)
for (int q = 0; q < num_group; q++)
{
memcpy(key.channel(q), past_key.channel(q), embed_dim * dst_seqlen * elemsize);
for (int q = 0; q < num_group; q++)
{
memcpy(key.channel(q), past_key.channel(q), embed_dim * dst_seqlen * elemsize);
memcpy(value.channel(q), past_value.channel(q), out_embed_dim * dst_seqlen * elemsize);

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

for (size_t j = 0; j < inputs.size(); j++)
{
cmd.record_upload(inputs[j], vk_inputs[j], opt);

P2 Badge Preserve persistent-cache capacity before GPU upload

When SDPA kv_cache=2 perf inputs pass a persistent cache view (inputs[3].h = past_seqlen, often 0), uploading that view here allocates the VkMat with only the live length instead of the full capacity preserved in Mat::cstep. SDPA_vulkan::forward derives capacity from past_key.cstep, so the new Vulkan flash/decode perf cases reject the append or benchmark a too-small cache, and run_layer_forward_gpu currently drops the return code, making the timings silently invalid. Please mirror the CPU convert_input_layout_persistent_view handling for these inputs before upload.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants