rmsnorm vulkan subgroup reduce optimization by futz12 · Pull Request #6755 · Tencent/ncnn

futz12 · 2026-05-29T05:16:48Z

add rmsnorm_reduce_subgroup shader using subgroupAdd arithmetic
compute rms in single dispatch, eliminating ~7 ping-pong reduces
fallback to shared memory tree reduction when subgroup arithmetic unavailable
add perf_rmsnorm benchmark

shape	precision	baseline	optimized	speedup
[4096,1,1]	fp32	16.42 us	6.41 us	2.6x
[4096,1,1]	fp16ps	16.32 us	5.62 us	2.9x
[4096,1,1]	fp16psa	17.40 us	5.76 us	3.0x
[4096,1,1]	bf16ps	17.33 us	5.77 us	3.0x
[4096,1,32]	fp32	28.59 us	12.40 us	2.3x
[4096,1,32]	fp16ps	27.86 us	11.80 us	2.4x
[4096,1,32]	fp16psa	26.66 us	11.80 us	2.3x
[4096,1,32]	bf16ps	25.45 us	11.10 us	2.3x
[16384,1,1]	fp32	21.86 us	10.14 us	2.2x
[16384,1,1]	fp16ps	23.70 us	9.33 us	2.5x
[16384,1,1]	fp16psa	23.22 us	9.32 us	2.5x
[16384,1,1]	bf16ps	24.24 us	9.34 us	2.6x
[5120,1,1]	fp32	22.04 us	7.07 us	3.1x
[5120,1,1]	fp16ps	21.38 us	6.28 us	3.4x
[5120,1,1]	fp16psa	21.58 us	6.46 us	3.3x
[5120,1,1]	bf16ps	21.45 us	6.32 us	3.4x
[4096,512,1]	fp32	94.40 us	43.76 us	2.2x
[4096,512,1]	fp16ps	93.90 us	42.67 us	2.2x
[4096,512,1]	fp16psa	99.70 us	42.86 us	2.3x
[4096,512,1]	bf16ps	92.70 us	43.06 us	2.2x
[1024,1,1]	fp32	20.19 us	5.57 us	3.6x
[1024,1,1]	fp16ps	15.67 us	5.08 us	3.1x
[1024,1,1]	fp16psa	14.63 us	5.09 us	2.9x
[1024,1,1]	bf16ps	15.19 us	5.04 us	3.0x
[768,1,1]	fp32	16.24 us	13.95 us	1.2x
[768,1,1]	fp16ps	14.45 us	6.71 us	2.2x
[768,1,1]	fp16psa	14.75 us	5.32 us	2.8x
[768,1,1]	bf16ps	15.33 us	5.14 us	3.0x

- add rmsnorm_reduce_subgroup shader using subgroupAdd arithmetic - compute rms in single dispatch, eliminating ~7 ping-pong reduces - fallback to shared memory tree reduction when subgroup arithmetic unavailable - add perf_rmsnorm benchmark perf improvement on RTX 4060 [5120,1,1] fp32: ~3.0ms -> ~0.7ms (4.4x) perf improvement on RTX 4060 [1024,1,1] fp32: ~14ms -> ~5.5ms (2.5x) perf improvement on Intel iGPU [5120,1,1] fp32: ~26ms -> ~11ms (2.4x)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 007817c414

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-29T05:19:34Z

        pipeline_rmsnorm_norm_pack4->create(LayerShaderType::rmsnorm_norm_pack4, opt, spec);
    }
+
+    if (vkdev->info.support_subgroup_ops() & VK_SUBGROUP_FEATURE_ARITHMETIC_BIT)


Honor disabled subgroup ops before creating the subgroup pipeline

When callers set opt.use_subgroup_ops = false on a device that still advertises subgroup arithmetic, this branch still creates and later uses the new subgroup RMSNorm pipeline. The shader is guarded by the device macro ncnn_subgroup_arithmetic, so it will compile/execute subgroup instructions despite the option being disabled (and the pipeline compiler targets the non-subgroup SPIR-V environment when the option is false). Please gate this optimization on both the device capability and opt.use_subgroup_ops, falling back to the existing reduce path when subgroup ops are disabled.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7c9358dd23

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-29T08:47:33Z

+    afp sqsum = afp(0.f);
+
+    for (int t = tid; t < p.group_size; t += 256)
+    {
+        afp v = buffer_ld1(bottom_top_blob_data, offset + t);
+        sqsum += v * v;
+    }
+
+    afp sg_sqsum = subgroupAdd(sqsum);


Use float accumulation for the subgroup RMS sum

When opt.use_fp16_arithmetic is enabled, afp is defined as float16_t, so this new subgroup path accumulates the whole pack1 RMS sum in half precision. That is a regression from the fallback path, which reduces through the *_fp16_to_fp32 shaders; for common pack1/dims1 cases such as 4096 features, inputs around 8 already make the sum of squares exceed fp16 range and produce inf/incorrect normalization. Keep the subgroup accumulator and subgroupAdd value in float as the pack4 shader does.

Useful? React with 👍 / 👎.

github-actions Bot added vulkan test labels May 29, 2026

chatgpt-codex-connector Bot reviewed May 29, 2026

View reviewed changes

vulkan: remove dead #else branches in rmsnorm subgroup reduce shaders

7c9358d

chatgpt-codex-connector Bot reviewed May 29, 2026

View reviewed changes

apply code-format changes

92f33fe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rmsnorm vulkan subgroup reduce optimization#6755

rmsnorm vulkan subgroup reduce optimization#6755
futz12 wants to merge 3 commits into
Tencent:masterfrom
futz12:rmsnorm-subgroup-reduce

futz12 commented May 29, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 29, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

futz12 commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

futz12 commented May 29, 2026 •

edited

Loading