rmsnorm vulkan subgroup reduce optimization#6755
Conversation
- add rmsnorm_reduce_subgroup shader using subgroupAdd arithmetic - compute rms in single dispatch, eliminating ~7 ping-pong reduces - fallback to shared memory tree reduction when subgroup arithmetic unavailable - add perf_rmsnorm benchmark perf improvement on RTX 4060 [5120,1,1] fp32: ~3.0ms -> ~0.7ms (4.4x) perf improvement on RTX 4060 [1024,1,1] fp32: ~14ms -> ~5.5ms (2.5x) perf improvement on Intel iGPU [5120,1,1] fp32: ~26ms -> ~11ms (2.4x)
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 007817c414
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| pipeline_rmsnorm_norm_pack4->create(LayerShaderType::rmsnorm_norm_pack4, opt, spec); | ||
| } | ||
|
|
||
| if (vkdev->info.support_subgroup_ops() & VK_SUBGROUP_FEATURE_ARITHMETIC_BIT) |
There was a problem hiding this comment.
Honor disabled subgroup ops before creating the subgroup pipeline
When callers set opt.use_subgroup_ops = false on a device that still advertises subgroup arithmetic, this branch still creates and later uses the new subgroup RMSNorm pipeline. The shader is guarded by the device macro ncnn_subgroup_arithmetic, so it will compile/execute subgroup instructions despite the option being disabled (and the pipeline compiler targets the non-subgroup SPIR-V environment when the option is false). Please gate this optimization on both the device capability and opt.use_subgroup_ops, falling back to the existing reduce path when subgroup ops are disabled.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7c9358dd23
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| afp sqsum = afp(0.f); | ||
|
|
||
| for (int t = tid; t < p.group_size; t += 256) | ||
| { | ||
| afp v = buffer_ld1(bottom_top_blob_data, offset + t); | ||
| sqsum += v * v; | ||
| } | ||
|
|
||
| afp sg_sqsum = subgroupAdd(sqsum); |
There was a problem hiding this comment.
Use float accumulation for the subgroup RMS sum
When opt.use_fp16_arithmetic is enabled, afp is defined as float16_t, so this new subgroup path accumulates the whole pack1 RMS sum in half precision. That is a regression from the fallback path, which reduces through the *_fp16_to_fp32 shaders; for common pack1/dims1 cases such as 4096 features, inputs around 8 already make the sum of squares exceed fp16 range and produce inf/incorrect normalization. Keep the subgroup accumulator and subgroupAdd value in float as the pack4 shader does.
Useful? React with 👍 / 👎.
Uh oh!
There was an error while loading. Please reload this page.