Skip to content

rmsnorm vulkan subgroup reduce optimization#6755

Open
futz12 wants to merge 3 commits into
Tencent:masterfrom
futz12:rmsnorm-subgroup-reduce
Open

rmsnorm vulkan subgroup reduce optimization#6755
futz12 wants to merge 3 commits into
Tencent:masterfrom
futz12:rmsnorm-subgroup-reduce

Conversation

@futz12

@futz12 futz12 commented May 29, 2026

Copy link
Copy Markdown
Contributor
  • add rmsnorm_reduce_subgroup shader using subgroupAdd arithmetic
  • compute rms in single dispatch, eliminating ~7 ping-pong reduces
  • fallback to shared memory tree reduction when subgroup arithmetic unavailable
  • add perf_rmsnorm benchmark
shape precision baseline optimized speedup
[4096,1,1] fp32 16.42 us 6.41 us 2.6x
[4096,1,1] fp16ps 16.32 us 5.62 us 2.9x
[4096,1,1] fp16psa 17.40 us 5.76 us 3.0x
[4096,1,1] bf16ps 17.33 us 5.77 us 3.0x
[4096,1,32] fp32 28.59 us 12.40 us 2.3x
[4096,1,32] fp16ps 27.86 us 11.80 us 2.4x
[4096,1,32] fp16psa 26.66 us 11.80 us 2.3x
[4096,1,32] bf16ps 25.45 us 11.10 us 2.3x
[16384,1,1] fp32 21.86 us 10.14 us 2.2x
[16384,1,1] fp16ps 23.70 us 9.33 us 2.5x
[16384,1,1] fp16psa 23.22 us 9.32 us 2.5x
[16384,1,1] bf16ps 24.24 us 9.34 us 2.6x
[5120,1,1] fp32 22.04 us 7.07 us 3.1x
[5120,1,1] fp16ps 21.38 us 6.28 us 3.4x
[5120,1,1] fp16psa 21.58 us 6.46 us 3.3x
[5120,1,1] bf16ps 21.45 us 6.32 us 3.4x
[4096,512,1] fp32 94.40 us 43.76 us 2.2x
[4096,512,1] fp16ps 93.90 us 42.67 us 2.2x
[4096,512,1] fp16psa 99.70 us 42.86 us 2.3x
[4096,512,1] bf16ps 92.70 us 43.06 us 2.2x
[1024,1,1] fp32 20.19 us 5.57 us 3.6x
[1024,1,1] fp16ps 15.67 us 5.08 us 3.1x
[1024,1,1] fp16psa 14.63 us 5.09 us 2.9x
[1024,1,1] bf16ps 15.19 us 5.04 us 3.0x
[768,1,1] fp32 16.24 us 13.95 us 1.2x
[768,1,1] fp16ps 14.45 us 6.71 us 2.2x
[768,1,1] fp16psa 14.75 us 5.32 us 2.8x
[768,1,1] bf16ps 15.33 us 5.14 us 3.0x

- add rmsnorm_reduce_subgroup shader using subgroupAdd arithmetic
- compute rms in single dispatch, eliminating ~7 ping-pong reduces
- fallback to shared memory tree reduction when subgroup arithmetic unavailable
- add perf_rmsnorm benchmark

perf improvement on RTX 4060 [5120,1,1] fp32: ~3.0ms -> ~0.7ms (4.4x)
perf improvement on RTX 4060 [1024,1,1] fp32: ~14ms -> ~5.5ms (2.5x)
perf improvement on Intel iGPU [5120,1,1] fp32: ~26ms -> ~11ms (2.4x)

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 007817c414

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

pipeline_rmsnorm_norm_pack4->create(LayerShaderType::rmsnorm_norm_pack4, opt, spec);
}

if (vkdev->info.support_subgroup_ops() & VK_SUBGROUP_FEATURE_ARITHMETIC_BIT)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor disabled subgroup ops before creating the subgroup pipeline

When callers set opt.use_subgroup_ops = false on a device that still advertises subgroup arithmetic, this branch still creates and later uses the new subgroup RMSNorm pipeline. The shader is guarded by the device macro ncnn_subgroup_arithmetic, so it will compile/execute subgroup instructions despite the option being disabled (and the pipeline compiler targets the non-subgroup SPIR-V environment when the option is false). Please gate this optimization on both the device capability and opt.use_subgroup_ops, falling back to the existing reduce path when subgroup ops are disabled.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7c9358dd23

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +34 to +42
afp sqsum = afp(0.f);

for (int t = tid; t < p.group_size; t += 256)
{
afp v = buffer_ld1(bottom_top_blob_data, offset + t);
sqsum += v * v;
}

afp sg_sqsum = subgroupAdd(sqsum);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use float accumulation for the subgroup RMS sum

When opt.use_fp16_arithmetic is enabled, afp is defined as float16_t, so this new subgroup path accumulates the whole pack1 RMS sum in half precision. That is a regression from the fallback path, which reduces through the *_fp16_to_fp32 shaders; for common pack1/dims1 cases such as 4096 features, inputs around 8 already make the sum of squares exceed fp16 range and produce inf/incorrect normalization. Keep the subgroup accumulator and subgroupAdd value in float as the pack4 shader does.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant