GroupNorm Vulkan subgroup reduce optimization#6756
Conversation
Add single-dispatch subgroup arithmetic fast-path for GroupNorm Vulkan. - groupnorm_reduce_subgroup.comp: pack1, two-pass (sum->mean, sqsum->var) - groupnorm_reduce_subgroup_pack4.comp: pack4 with component-boundary checks - groupnorm_vulkan.cpp/h: conditional subgroup path with fallback - tests/perf/perf_groupnorm.cpp: benchmark for representative shapes Performance (NVIDIA RTX 4060 Laptop, gpu-1, median ms): | shape | precision | baseline | optimized | speedup | |---------------|-----------|----------|-----------|---------| | [64,64,128] | fp32 | 37.94 | 15.12 | 2.5x | | [64,64,128] | fp16 | 36.18 | 11.85 | 3.1x | | [64,64,128] | fp16psa | 36.11 | 11.67 | 3.1x | | [64,64,128] | bf16 | 36.62 | 12.28 | 3.0x | | [32,32,256] | fp32 | 30.37 | 5.56 | 5.5x | | [32,32,256] | fp16 | 29.66 | 0.92 | 32.2x | | [32,32,256] | fp16psa | 29.71 | 0.92 | 32.3x | | [32,32,256] | bf16 | 29.80 | 0.95 | 31.4x | | [16,16,512] | fp32 | 29.66 | 4.46 | 6.7x | | [16,16,512] | fp16 | 29.11 | 2.89 | 10.1x | | [16,16,512] | fp16psa | 29.20 | 2.58 | 11.3x | | [16,16,512] | bf16 | 29.40 | 2.37 | 12.4x | | [8,8,512] | fp32 | 24.54 | 6.38 | 3.8x | | [8,8,512] | fp16 | 24.38 | 5.67 | 4.3x | | [8,8,512] | fp16psa | 24.34 | 5.55 | 4.4x | | [8,8,512] | bf16 | 24.39 | 5.68 | 4.3x | | [224,224,3] | fp32 | 38.49 | 35.63 | 1.1x | | [224,224,3] | fp16 | 38.30 | 31.06 | 1.2x | | [224,224,3] | fp16psa | 38.35 | 31.78 | 1.2x | | [224,224,3] | bf16 | 38.48 | 30.66 | 1.3x | | [224,224,64] | fp32 | 13.22 | 9.95 | 1.3x | | [224,224,64] | fp16 | 8.53 | 8.93 | 1.0x | | [224,224,64] | fp16psa | 8.55 | 8.96 | 1.0x | | [224,224,64] | bf16 | 8.88 | 9.24 | 1.0x | | [4096,1,1] | fp32 | 25.18 | 8.53 | 3.0x | | [4096,1,1] | fp16 | 25.07 | 7.58 | 3.3x | | [4096,1,1] | fp16psa | 25.12 | 7.49 | 3.4x | | [4096,1,1] | bf16 | 25.13 | 7.63 | 3.3x | | [512,1,1] | fp32 | 6.96 | 6.25 | 1.1x | | [512,1,1] | fp16 | 4.40 | 5.34 | 0.8x | | [512,1,1] | fp16psa | 2.87 | 5.45 | 0.5x | | [512,1,1] | bf16 | 2.29 | 5.41 | 0.4x | SD-like shapes (32x32x256, 16x16x512) benefit most from eliminating ~10-15 dispatches down to 3 dispatches (subgroup reduce -> coeffs -> norm). Small shapes (512x1x1) see regression because subgroup shader fixed-cost overhead exceeds savings for tiny group sizes. test_groupnorm passes on NVIDIA RTX 4060
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 621533af32
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| } | ||
| } | ||
|
|
||
| if (vkdev->info.support_subgroup_ops() & VK_SUBGROUP_FEATURE_ARITHMETIC_BIT) |
There was a problem hiding this comment.
When a caller sets opt.use_subgroup_ops = false, this branch still creates the subgroup GroupNorm pipeline on any device advertising arithmetic support, and forward_inplace later selects any non-null subgroup pipeline instead of the fallback path. This re-enables subgroup shaders for users/drivers that explicitly disabled them, so the condition should also gate on opt.use_subgroup_ops.
Useful? React with 👍 / 👎.
| const int group_size = p.channels_g * p.group_area; | ||
|
|
||
| // Phase 1: compute sum -> mean | ||
| afp sum = afp(0.f); |
There was a problem hiding this comment.
Accumulate pack1 reductions in float
For pack1 tensors with fp16 arithmetic enabled, afp is float16_t, so the new subgroup path accumulates the mean reduction in half precision (and the variance reduction does the same below) even though the existing fallback uses the groupnorm_reduce_sum4_fp16_to_fp32 path. On large groups such as a single group over a 224x224x3 tensor, the half-precision running sum/sqsum can overflow or lose enough precision to produce incorrect normalization; use float accumulators for these reductions like the pack4 shader does.
Useful? React with 👍 / 👎.
Add single-dispatch subgroup arithmetic fast-path for GroupNorm Vulkan.
Performance (NVIDIA RTX 4060 Laptop, gpu-1, median ms):
SD-like shapes (32x32x256, 16x16x512) benefit most from eliminating ~10-15 dispatches down to 3 dispatches (subgroup reduce -> coeffs -> norm). Small shapes (512x1x1) see regression because subgroup shader fixed-cost overhead exceeds savings for tiny group sizes.
test_groupnorm passes on NVIDIA RTX 4060