InstanceNorm Vulkan subgroup reduce optimization by futz12 · Pull Request #6758 · Tencent/ncnn

futz12 · 2026-05-29T06:45:24Z

Add subgroup reduction fast-path for InstanceNorm Vulkan backend. Replaces multi-pass reduce chain (sum -> mean -> sub -> square -> reduce -> var) with a single dispatch per channel using subgroupAdd.

Changes:

New shader: instancenorm_reduce_subgroup.comp (pack1)
New shader: instancenorm_reduce_subgroup_pack4.comp (pack4)
C++ dispatch uses w=1, h=c, c=1 to avoid dispatching over spatial dims
Falls back to existing reduce chain when subgroup ops unavailable
Added perf benchmark: tests/perf/perf_instancenorm.cpp

NVIDIA RTX 4060 Laptop (gpu-1) per-op speedup:

shape	precision	baseline (us)	optimized (us)	speedup
[64,64,128]	fp32	35.64	12.08	3.0x
[64,64,128]	fp16ps	34.79	8.77	4.0x
[64,64,128]	fp16psa	34.71	8.74	4.0x
[64,64,128]	bf16ps	35.24	9.50	3.7x
[32,32,256]	fp32	29.80	59.70	0.5x
[32,32,256]	fp16ps	26.00	8.70	3.0x
[32,32,256]	fp16psa	26.70	8.50	3.1x
[32,32,256]	bf16ps	26.50	9.00	2.9x
[16,16,512]	fp32	19.64	62.40	0.3x
[16,16,512]	fp16ps	19.12	9.10	2.1x
[16,16,512]	fp16psa	19.13	9.10	2.1x
[16,16,512]	bf16ps	19.34	9.70	2.0x
[8,8,512]	fp32	18.10	8.40	2.2x
[8,8,512]	fp16ps	17.70	9.00	2.0x
[8,8,512]	fp16psa	15.30	9.20	1.7x
[8,8,512]	bf16ps	16.20	8.30	2.0x
[224,224,64]	fp32	130.00	52.90	2.5x
[224,224,64]	fp16ps	83.50	36.30	2.3x
[224,224,64]	fp16psa	83.90	36.50	2.3x
[224,224,64]	bf16ps	91.50	40.50	2.3x
[224,224,3]	fp32	38.27	22.70	1.7x
[224,224,3]	fp16ps	37.99	18.00	2.1x
[224,224,3]	fp16psa	37.80	15.90	2.4x
[224,224,3]	bf16ps	37.83	16.20	2.3x
[4096,1,1]	fp32	24.87	6.76	3.7x
[4096,1,1]	fp16ps	24.76	6.07	4.1x
[4096,1,1]	fp16psa	24.76	6.03	4.1x
[4096,1,1]	bf16ps	24.83	6.22	4.0x
[512,1,1]	fp32	74.40	5.91	12.6x
[512,1,1]	fp16ps	35.10	5.20	6.8x
[512,1,1]	fp16psa	27.70	5.22	5.3x
[512,1,1]	bf16ps	23.30	5.21	4.5x

Note: fp32 pack1 path may regress on small spatial sizes (e.g. [32,32,256], [16,16,512]) due to under-utilized 256-thread workgroups. fp16/fp16a/bf16 paths use pack4 and show consistent speedups across all tested shapes.

Add subgroup reduction fast-path for InstanceNorm Vulkan backend. Replaces multi-pass reduce chain (sum -> mean -> sub -> square -> reduce -> var) with a single dispatch per channel using subgroupAdd. Changes: - New shader: instancenorm_reduce_subgroup.comp (pack1) - New shader: instancenorm_reduce_subgroup_pack4.comp (pack4) - C++ dispatch uses w=1, h=c, c=1 to avoid dispatching over spatial dims - Falls back to existing reduce chain when subgroup ops unavailable - Added perf benchmark: tests/perf/perf_instancenorm.cpp NVIDIA RTX 4060 Laptop (gpu-1) per-op speedup: | shape | precision | baseline (us) | optimized (us) | speedup | |----------------|-----------|---------------|----------------|---------| | [64,64,128] | fp32 | 35.64 | 12.08 | 3.0x | | [64,64,128] | fp16ps | 34.79 | 8.77 | 4.0x | | [64,64,128] | fp16psa | 34.71 | 8.74 | 4.0x | | [64,64,128] | bf16ps | 35.24 | 9.50 | 3.7x | | [32,32,256] | fp32 | 29.80 | 59.70 | 0.5x | | [32,32,256] | fp16ps | 26.00 | 8.70 | 3.0x | | [32,32,256] | fp16psa | 26.70 | 8.50 | 3.1x | | [32,32,256] | bf16ps | 26.50 | 9.00 | 2.9x | | [16,16,512] | fp32 | 19.64 | 62.40 | 0.3x | | [16,16,512] | fp16ps | 19.12 | 9.10 | 2.1x | | [16,16,512] | fp16psa | 19.13 | 9.10 | 2.1x | | [16,16,512] | bf16ps | 19.34 | 9.70 | 2.0x | | [8,8,512] | fp32 | 18.10 | 8.40 | 2.2x | | [8,8,512] | fp16ps | 17.70 | 9.00 | 2.0x | | [8,8,512] | fp16psa | 15.30 | 9.20 | 1.7x | | [8,8,512] | bf16ps | 16.20 | 8.30 | 2.0x | | [224,224,64] | fp32 | 130.00 | 52.90 | 2.5x | | [224,224,64] | fp16ps | 83.50 | 36.30 | 2.3x | | [224,224,64] | fp16psa | 83.90 | 36.50 | 2.3x | | [224,224,64] | bf16ps | 91.50 | 40.50 | 2.3x | | [224,224,3] | fp32 | 38.27 | 22.70 | 1.7x | | [224,224,3] | fp16ps | 37.99 | 18.00 | 2.1x | | [224,224,3] | fp16psa | 37.80 | 15.90 | 2.4x | | [224,224,3] | bf16ps | 37.83 | 16.20 | 2.3x | | [4096,1,1] | fp32 | 24.87 | 6.76 | 3.7x | | [4096,1,1] | fp16ps | 24.76 | 6.07 | 4.1x | | [4096,1,1] | fp16psa | 24.76 | 6.03 | 4.1x | | [4096,1,1] | bf16ps | 24.83 | 6.22 | 4.0x | | [512,1,1] | fp32 | 74.40 | 5.91 | 12.6x | | [512,1,1] | fp16ps | 35.10 | 5.20 | 6.8x | | [512,1,1] | fp16psa | 27.70 | 5.22 | 5.3x | | [512,1,1] | bf16ps | 23.30 | 5.21 | 4.5x | Note: fp32 pack1 path may regress on small spatial sizes (e.g. [32,32,256], [16,16,512]) due to under-utilized 256-thread workgroups. fp16/fp16a/bf16 paths use pack4 and show consistent speedups across all tested shapes.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b51bcfc308

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-29T06:48:27Z

+    afp sum = afp(0.f);
+    for (int t = tid; t < p.size; t += 256)
+    {
+        int v_offset = base_offset + t;
+        afp v = buffer_ld1(bottom_top_blob_data, v_offset);
+        sum += v;


Accumulate pack1 InstanceNorm statistics in float

When opt.use_fp16_arithmetic is enabled, afp is float16_t, so this new pack1 subgroup path accumulates the channel sum in half precision before converting to float for mean_data. The old pack1 path uses the reduce_sum4_fp16_to_fp32 shader and accumulates in float; with elempack=1 inputs such as non-4-channel tensors (for example 224x224x3 fp16a), these half-precision sums/squared sums can round heavily or overflow, producing incorrect InstanceNorm mean/variance. The accumulators and subgroup reductions in this shader should stay float like the previous path.

Useful? React with 👍 / 👎.

…aders

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1cad3e0e1a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-29T08:51:11Z

        }
    }

+    if (vkdev->info.support_subgroup_ops() & VK_SUBGROUP_FEATURE_ARITHMETIC_BIT)


Honor disabled subgroup ops before creating shaders

When a caller explicitly sets opt.use_subgroup_ops = false on a device that still advertises VK_SUBGROUP_FEATURE_ARITHMETIC_BIT, this path still creates the new subgroup InstanceNorm pipelines. compile_spirv_module only switches to the Vulkan 1.1 / SPIR-V 1.3 environment when opt.use_subgroup_ops is true, so these subgroup shaders can fail to compile or leave a non-null Pipeline object that forward_inplace then selects instead of the existing fallback. Please include opt.use_subgroup_ops in this gate so disabling subgroup ops continues to work.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-29T08:51:11Z

+    int size;
+} p;
+
+shared vec4 sdata_v4[64];


Size pack4 shared storage for all subgroups

On devices with a valid subgroup size below 4, the 256-thread workgroup has more than 64 subgroups, but the pack4 shader only reserves sdata_v4[64] and then writes entries by gl_SubgroupID during the elected-lane stores. NCNN only filters out non-power-of-two subgroup sizes, so elempack=4 InstanceNorm on such devices can write past shared memory and corrupt the reduction; either allocate for all 256 possible subgroups or avoid this shader for subgroup sizes under 4.

Useful? React with 👍 / 👎.

github-actions Bot added vulkan test labels May 29, 2026

chatgpt-codex-connector Bot reviewed May 29, 2026

View reviewed changes

futz12 and others added 2 commits May 29, 2026 16:44

vulkan: remove dead #else branches in instancenorm subgroup reduce sh…

1358f99

…aders

apply code-format changes

1cad3e0

chatgpt-codex-connector Bot reviewed May 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

InstanceNorm Vulkan subgroup reduce optimization#6758

InstanceNorm Vulkan subgroup reduce optimization#6758
futz12 wants to merge 3 commits into
Tencent:masterfrom
futz12:instancenorm-subgroup-reduce

futz12 commented May 29, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 29, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 29, 2026

Uh oh!

chatgpt-codex-connector Bot May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

futz12 commented May 29, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant