GroupNorm Vulkan subgroup reduce optimization by futz12 · Pull Request #6756 · Tencent/ncnn

futz12 · 2026-05-29T05:50:46Z

Add single-dispatch subgroup arithmetic fast-path for GroupNorm Vulkan.

groupnorm_reduce_subgroup.comp: pack1, two-pass (sum->mean, sqsum->var)
groupnorm_reduce_subgroup_pack4.comp: pack4 with component-boundary checks
groupnorm_vulkan.cpp/h: conditional subgroup path with fallback
tests/perf/perf_groupnorm.cpp: benchmark for representative shapes

Performance (NVIDIA RTX 4060 Laptop, gpu-1, median ms):

shape	precision	baseline	optimized	speedup
[64,64,128]	fp32	37.94	15.12	2.5x
[64,64,128]	fp16	36.18	11.85	3.1x
[64,64,128]	fp16psa	36.11	11.67	3.1x
[64,64,128]	bf16	36.62	12.28	3.0x
[32,32,256]	fp32	30.37	5.56	5.5x
[32,32,256]	fp16	29.66	0.92	32.2x
[32,32,256]	fp16psa	29.71	0.92	32.3x
[32,32,256]	bf16	29.80	0.95	31.4x
[16,16,512]	fp32	29.66	4.46	6.7x
[16,16,512]	fp16	29.11	2.89	10.1x
[16,16,512]	fp16psa	29.20	2.58	11.3x
[16,16,512]	bf16	29.40	2.37	12.4x
[8,8,512]	fp32	24.54	6.38	3.8x
[8,8,512]	fp16	24.38	5.67	4.3x
[8,8,512]	fp16psa	24.34	5.55	4.4x
[8,8,512]	bf16	24.39	5.68	4.3x
[224,224,3]	fp32	38.49	35.63	1.1x
[224,224,3]	fp16	38.30	31.06	1.2x
[224,224,3]	fp16psa	38.35	31.78	1.2x
[224,224,3]	bf16	38.48	30.66	1.3x
[224,224,64]	fp32	13.22	9.95	1.3x
[224,224,64]	fp16	8.53	8.93	1.0x
[224,224,64]	fp16psa	8.55	8.96	1.0x
[224,224,64]	bf16	8.88	9.24	1.0x
[4096,1,1]	fp32	25.18	8.53	3.0x
[4096,1,1]	fp16	25.07	7.58	3.3x
[4096,1,1]	fp16psa	25.12	7.49	3.4x
[4096,1,1]	bf16	25.13	7.63	3.3x
[512,1,1]	fp32	6.96	6.25	1.1x
[512,1,1]	fp16	4.40	5.34	0.8x
[512,1,1]	fp16psa	2.87	5.45	0.5x
[512,1,1]	bf16	2.29	5.41	0.4x

SD-like shapes (32x32x256, 16x16x512) benefit most from eliminating ~10-15 dispatches down to 3 dispatches (subgroup reduce -> coeffs -> norm). Small shapes (512x1x1) see regression because subgroup shader fixed-cost overhead exceeds savings for tiny group sizes.

test_groupnorm passes on NVIDIA RTX 4060

Add single-dispatch subgroup arithmetic fast-path for GroupNorm Vulkan. - groupnorm_reduce_subgroup.comp: pack1, two-pass (sum->mean, sqsum->var) - groupnorm_reduce_subgroup_pack4.comp: pack4 with component-boundary checks - groupnorm_vulkan.cpp/h: conditional subgroup path with fallback - tests/perf/perf_groupnorm.cpp: benchmark for representative shapes Performance (NVIDIA RTX 4060 Laptop, gpu-1, median ms): | shape | precision | baseline | optimized | speedup | |---------------|-----------|----------|-----------|---------| | [64,64,128] | fp32 | 37.94 | 15.12 | 2.5x | | [64,64,128] | fp16 | 36.18 | 11.85 | 3.1x | | [64,64,128] | fp16psa | 36.11 | 11.67 | 3.1x | | [64,64,128] | bf16 | 36.62 | 12.28 | 3.0x | | [32,32,256] | fp32 | 30.37 | 5.56 | 5.5x | | [32,32,256] | fp16 | 29.66 | 0.92 | 32.2x | | [32,32,256] | fp16psa | 29.71 | 0.92 | 32.3x | | [32,32,256] | bf16 | 29.80 | 0.95 | 31.4x | | [16,16,512] | fp32 | 29.66 | 4.46 | 6.7x | | [16,16,512] | fp16 | 29.11 | 2.89 | 10.1x | | [16,16,512] | fp16psa | 29.20 | 2.58 | 11.3x | | [16,16,512] | bf16 | 29.40 | 2.37 | 12.4x | | [8,8,512] | fp32 | 24.54 | 6.38 | 3.8x | | [8,8,512] | fp16 | 24.38 | 5.67 | 4.3x | | [8,8,512] | fp16psa | 24.34 | 5.55 | 4.4x | | [8,8,512] | bf16 | 24.39 | 5.68 | 4.3x | | [224,224,3] | fp32 | 38.49 | 35.63 | 1.1x | | [224,224,3] | fp16 | 38.30 | 31.06 | 1.2x | | [224,224,3] | fp16psa | 38.35 | 31.78 | 1.2x | | [224,224,3] | bf16 | 38.48 | 30.66 | 1.3x | | [224,224,64] | fp32 | 13.22 | 9.95 | 1.3x | | [224,224,64] | fp16 | 8.53 | 8.93 | 1.0x | | [224,224,64] | fp16psa | 8.55 | 8.96 | 1.0x | | [224,224,64] | bf16 | 8.88 | 9.24 | 1.0x | | [4096,1,1] | fp32 | 25.18 | 8.53 | 3.0x | | [4096,1,1] | fp16 | 25.07 | 7.58 | 3.3x | | [4096,1,1] | fp16psa | 25.12 | 7.49 | 3.4x | | [4096,1,1] | bf16 | 25.13 | 7.63 | 3.3x | | [512,1,1] | fp32 | 6.96 | 6.25 | 1.1x | | [512,1,1] | fp16 | 4.40 | 5.34 | 0.8x | | [512,1,1] | fp16psa | 2.87 | 5.45 | 0.5x | | [512,1,1] | bf16 | 2.29 | 5.41 | 0.4x | SD-like shapes (32x32x256, 16x16x512) benefit most from eliminating ~10-15 dispatches down to 3 dispatches (subgroup reduce -> coeffs -> norm). Small shapes (512x1x1) see regression because subgroup shader fixed-cost overhead exceeds savings for tiny group sizes. test_groupnorm passes on NVIDIA RTX 4060

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 621533af32

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-29T05:54:04Z

        }
    }

+    if (vkdev->info.support_subgroup_ops() & VK_SUBGROUP_FEATURE_ARITHMETIC_BIT)


Respect disabled subgroup ops

When a caller sets opt.use_subgroup_ops = false, this branch still creates the subgroup GroupNorm pipeline on any device advertising arithmetic support, and forward_inplace later selects any non-null subgroup pipeline instead of the fallback path. This re-enables subgroup shaders for users/drivers that explicitly disabled them, so the condition should also gate on opt.use_subgroup_ops.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-29T05:54:04Z

+    const int group_size = p.channels_g * p.group_area;
+
+    // Phase 1: compute sum -> mean
+    afp sum = afp(0.f);


Accumulate pack1 reductions in float

For pack1 tensors with fp16 arithmetic enabled, afp is float16_t, so the new subgroup path accumulates the mean reduction in half precision (and the variance reduction does the same below) even though the existing fallback uses the groupnorm_reduce_sum4_fp16_to_fp32 path. On large groups such as a single group over a 224x224x3 tensor, the half-precision running sum/sqsum can overflow or lose enough precision to produce incorrect normalization; use float accumulators for these reductions like the pack4 shader does.

Useful? React with 👍 / 👎.

github-actions Bot added vulkan test labels May 29, 2026

chatgpt-codex-connector Bot reviewed May 29, 2026

View reviewed changes

futz12 and others added 2 commits May 29, 2026 16:44

vulkan: remove dead #else branches in groupnorm subgroup reduce shaders

ef69876

apply code-format changes

c2972d2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GroupNorm Vulkan subgroup reduce optimization#6756

GroupNorm Vulkan subgroup reduce optimization#6756
futz12 wants to merge 3 commits into
Tencent:masterfrom
futz12:groupnorm-subgroup-reduce

futz12 commented May 29, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 29, 2026

Uh oh!

chatgpt-codex-connector Bot May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

futz12 commented May 29, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant