WIP: Persistent kvcache by MollySophia · Pull Request #6776 · Tencent/ncnn

MollySophia · 2026-06-09T03:40:51Z

Summary

This PR adds SDPA kv_cache=2, a persistent K/V cache mode for CPU.

kv_cache=2 takes preallocated past_key / past_value cache buffers as
views. The input view height is the current valid cache length, while cstep
keeps the preallocated cache capacity. SDPA appends current K/V in place and
returns views of the same cache buffers with the updated valid length.

This avoids changing cache ownership every decode step and provides the base
interface for follow-up optimized kernels such as arm64/Vulkan flash attention.

Existing kv_cache=0 and kv_cache=1 behavior is kept unchanged.

Vulkan kv_cache=2 is disabled in this PR. The existing Vulkan SDPA path still
uses the old concat-cache layout and cannot interpret the persistent cache view
layout correctly. Vulkan persistent-cache support will be handled in a follow-up
flash-attention PR.

Changes

Add kv_cache=2 validation and input parsing in generic SDPA
Add CPU implementations for generic, arm, x86, mips, and loongarch paths
Append cur_key / cur_value into preallocated cache buffers at the input view height
Return persistent cache buffers as updated views in top_blobs[1] and top_blobs[2]
Add test_sdpa_kvcache coverage for normal use, invalid views, and tail dimensions
Add a direct persistent-buffer semantic test for buffer identity, append correctness, and capacity overflow
Disable Vulkan SDPA for kv_cache=2 until the Vulkan path implements the persistent layout
Route arm kv_cache=2 int8 SDPA through the generic int8 implementation for correctness

The new code is only selected for kv_cache=2. Existing kv_cache!=2 cases
continue to use the previous concat / no-cache paths.

kv_cache=2 input layout:

query, cur_key, cur_value, [attn_mask], past_key_view, past_value_view

past_key_view.h and past_value_view.h are the current valid cache
length. The underlying cstep keeps the cache capacity.

Output layout:

out, past_key_buffer, past_value_buffer

Correctness

Tested on aarch64:

cmake --build build-perf-linux --target test_sdpa test_sdpa_kvcache
taskset -c 0 ./build-perf-linux/tests/test_sdpa
taskset -c 0 ./build-perf-linux/tests/test_sdpa_kvcache

Both tests passed. The test_sdpa_kvcache coverage compares kv_cache=2
attention output numerically against reference output, including the int8 path,
and checks persistent cache buffer identity / view shape separately.

Performance

Baseline:

bedf9c65681a0b1e012081d43a566d20aaadf5c6

Current:

cd3c5d44

Measured on Qualcomm Oryon aarch64, CPU0-3 governor performance.
Single-thread runs use taskset -c 0; four-thread runs use taskset -c 0-3.

Shape:

d_k=192 d_v=128 heads=128 groups=16 dtype=fp16 ctx=4096

This PR is mainly an interface and cache-lifetime change. It avoids allocating a
new returned K/V cache blob every decode step: the output cache blobs are shallow
views of the caller-owned persistent cache buffers with updated h.

The current CPU fallback still builds compact fp32 K/V temporaries for the
existing Gemm-based SDPA math path, so latency is roughly neutral and temporary
per-call memory usage does not improve much by itself. The persistent view
layout is intended to be used by follow-up optimized kernels that can read
directly from the stable cache buffer.

Causal prefill latency

len	baseline t=1 ms	kv_cache=2 t=1 ms	baseline t=4 ms	kv_cache=2 t=4 ms
128	15.3350	15.2490	3.9160	4.0598
256	59.5649	59.7761	14.7642	14.9651
512	239.1162	239.2300	59.6160	60.3730
1024	944.7761	936.3608	248.9351	249.1870

Decode latency

past	baseline t=1 ms	kv_cache=2 t=1 ms	baseline t=4 ms	kv_cache=2 t=4 ms
128	0.7490	0.7329	0.1951	0.1941
512	2.2429	2.3611	0.7478	0.7119
1024	4.7610	4.5950	1.5261	1.5559
2048	10.1528	9.3242	3.5171	3.1182

Decode temporary peak memory

Per-call temporary allocator peak, four threads:

op_peak = workspace peak + blob peak

This only counts allocations made during one SDPA call after inputs are already
prepared. It does not include the caller-owned persistent K/V cache buffer.
In the current fallback path, SDPA still copies the valid cache range into
compact K/V blobs for the existing Gemm implementation, so the temporary peak
remains close to kv_cache=1.

past	baseline MiB	kv_cache=2 MiB
128	1.69	1.74
512	6.22	6.46
1024	13.24	12.21
2048	23.58	25.87

For this shape, the persistent K/V cache buffer for ctx=4096 is about 40 MiB:

(d_k + d_v) * groups * sizeof(fp16) * ctx
= (192 + 128) * 16 * 2 * 4096
= 40 MiB

That buffer is allocated and owned outside the SDPA call, then reused across
decode steps. It replaces the growing kv_cache=1 returned-cache ownership
model, but it is not shown in the temporary peak table above.

tencent-adm · 2026-06-09T03:41:06Z

All committers have signed the CLA.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dbc4ddb840

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cd3c5d4491

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-10T06:07:45Z

+    ncnn_add_perf(sdpa_flash)
+    ncnn_add_perf(sdpa_mem)
+    ncnn_add_perf(sdpa_mla_kvcache)


Declare perf env helpers before enabling targets

When the new sdpa_flash and sdpa_mla_kvcache perf targets are enabled here, they do not compile: tests/perf/perf_sdpa_flash.cpp and tests/perf/perf_sdpa_mla_kvcache.cpp call perf_match_env_int, perf_env_int, and perf_has_env, but tests/perf/perfutil.h only declares PerfMat/perf_layer, and a repo-wide rg shows no definitions for these helper names. Building perf tests with WITH_LAYER_sdpa therefore fails as soon as these targets are compiled; either add/export the helpers or keep the targets out until they are self-contained.

Useful? React with 👍 / 👎.

github-actions · 2026-06-11T08:56:47Z

Please enable github action in YOUR FORKED REPO to make code-format workflow work

codecov-commenter · 2026-06-11T11:28:37Z

Codecov Report

❌ Patch coverage is 85.89342% with 45 lines in your changes missing coverage. Please review.
✅ Project coverage is 95.47%. Comparing base (bedf9c6) to head (28bc4c3).
⚠️ Report is 23 commits behind head on master.

Files with missing lines	Patch %	Lines
src/layer/sdpa.cpp	72.22%	35 Missing ⚠️
src/layer/loongarch/sdpa_loongarch.cpp	93.61%	3 Missing ⚠️
src/layer/mips/sdpa_mips.cpp	93.61%	3 Missing ⚠️
src/layer/x86/sdpa_x86.cpp	93.75%	3 Missing ⚠️
src/layer/arm/sdpa_arm.cpp	98.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6776      +/-   ##
==========================================
- Coverage   95.60%   95.47%   -0.13%     
==========================================
  Files         960      943      -17     
  Lines      404032   409662    +5630     
==========================================
+ Hits       386278   391142    +4864     
- Misses      17754    18520     +766

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

+        for (int q = 0; q < num_group; q++)
+        {
+            unsigned char* kd = (unsigned char*)past_key.channel(q).data + (size_t)past_seqlen * embed_dim * elemsize;
+            memcpy(kd, cur_key.channel(q).data, embed_dim * cur_seqlen * elemsize);


+            unsigned char* kd = (unsigned char*)past_key.channel(q).data + (size_t)past_seqlen * embed_dim * elemsize;
+            memcpy(kd, cur_key.channel(q).data, embed_dim * cur_seqlen * elemsize);
+            unsigned char* vd = (unsigned char*)past_value.channel(q).data + (size_t)past_seqlen * out_embed_dim * elemsize;
+            memcpy(vd, cur_value.channel(q).data, out_embed_dim * cur_seqlen * elemsize);


+            // Convert from source elemsize to fp32
+            if (elemsize == 4)
+            {
+                memcpy(key.channel(q), past_key.channel(q), embed_dim * dst_seqlen * 4);


+            if (elemsize == 4)
+            {
+                memcpy(key.channel(q), past_key.channel(q), embed_dim * dst_seqlen * 4);
+                memcpy(value.channel(q), past_value.channel(q), out_embed_dim * dst_seqlen * 4);


-
-            memcpy(value_head.row(0), past_value_head, out_embed_dim * past_seqlen * sizeof(float));
-            memcpy(value_head.row(past_seqlen), cur_value_head, out_embed_dim * cur_seqlen * sizeof(float));
+            memcpy((float*)key.channel(q), past_key.channel(q), embed_dim * past_seqlen * sizeof(float));


+        {
+            if (elemsize == 4)
+            {
+                memcpy(value.channel(q), past_value.channel(q), out_embed_dim * dst_seqlen * sizeof(float));


+            unsigned char* pk = (unsigned char*)past_key.channel(q).data;
+            unsigned char* pv = (unsigned char*)past_value.channel(q).data;
+            memcpy(pk + (size_t)past_seqlen * embed_dim * elemsize,
+                   cur_key.channel(q).data, embed_dim * cur_seqlen * elemsize);


+            memcpy(pk + (size_t)past_seqlen * embed_dim * elemsize,
+                   cur_key.channel(q).data, embed_dim * cur_seqlen * elemsize);
+            memcpy(pv + (size_t)past_seqlen * out_embed_dim * elemsize,
+                   cur_value.channel(q).data, out_embed_dim * cur_seqlen * elemsize);


+        #pragma omp parallel for num_threads(opt.num_threads)
+        for (int q = 0; q < num_group; q++)
+        {
+            memcpy(key.channel(q), past_key.channel(q), embed_dim * dst_seqlen * elemsize);


+        for (int q = 0; q < num_group; q++)
+        {
+            memcpy(key.channel(q), past_key.channel(q), embed_dim * dst_seqlen * elemsize);
+            memcpy(value.channel(q), past_value.channel(q), out_embed_dim * dst_seqlen * elemsize);


chatgpt-codex-connector

💡 Codex Review

ncnn/tests/perf/perfutil.cpp

Lines 544 to 546 in 6c518a0

    
           for (size_t j = 0; j < inputs.size(); j++) 
        
           { 
        
               cmd.record_upload(inputs[j], vk_inputs[j], opt);

Preserve persistent-cache capacity before GPU upload

When SDPA kv_cache=2 perf inputs pass a persistent cache view (inputs[3].h = past_seqlen, often 0), uploading that view here allocates the VkMat with only the live length instead of the full capacity preserved in Mat::cstep. SDPA_vulkan::forward derives capacity from past_key.cstep, so the new Vulkan flash/decode perf cases reject the append or benchmark a too-small cache, and run_layer_forward_gpu currently drops the return code, making the timings silently invalid. Please mirror the CPU convert_input_layout_persistent_view handling for these inputs before upload.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

github-actions Bot added test layer arm loongarch mips x86 labels Jun 9, 2026

chatgpt-codex-connector Bot reviewed Jun 9, 2026

View reviewed changes

Comment thread src/layer/sdpa.cpp Outdated

MollySophia added 2 commits June 9, 2026 12:48

sdpa persistent kv cache

b63e241

test sdpa persistent kv cache semantics

6f0a47f

MollySophia force-pushed the persistent-kvcache branch from dbc4ddb to 6f0a47f Compare June 9, 2026 04:50

disable vulkan sdpa for persistent kv cache

1c63fb0

github-actions Bot added the vulkan label Jun 9, 2026

Use kvcache views for persistent sdpa

cd3c5d4

chatgpt-codex-connector Bot reviewed Jun 10, 2026

View reviewed changes

MollySophia and others added 2 commits June 11, 2026 16:47

fix perfutil: add missing perf_env_int helpers used by sdpa perf tests

28bc4c3

vulkan sdpa persistent kv cache

6c518a0

github-advanced-security AI found potential problems Jun 11, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Persistent kvcache#6776

WIP: Persistent kvcache#6776
MollySophia wants to merge 6 commits into
Tencent:masterfrom
MollySophia:persistent-kvcache

MollySophia commented Jun 9, 2026 •

edited

Loading

Uh oh!

tencent-adm commented Jun 9, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 10, 2026

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

codecov-commenter commented Jun 11, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	for (size_t j = 0; j < inputs.size(); j++)
	{
	cmd.record_upload(inputs[j], vk_inputs[j], opt);

Conversation

MollySophia commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Correctness

Performance

Causal prefill latency

Decode latency

Decode temporary peak memory

Uh oh!

tencent-adm commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

codecov-commenter commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MollySophia commented Jun 9, 2026 •

edited

Loading

tencent-adm commented Jun 9, 2026 •

edited

Loading

codecov-commenter commented Jun 11, 2026 •

edited

Loading