InstanceNorm Vulkan subgroup reduce optimization#6758
Conversation
Add subgroup reduction fast-path for InstanceNorm Vulkan backend. Replaces multi-pass reduce chain (sum -> mean -> sub -> square -> reduce -> var) with a single dispatch per channel using subgroupAdd. Changes: - New shader: instancenorm_reduce_subgroup.comp (pack1) - New shader: instancenorm_reduce_subgroup_pack4.comp (pack4) - C++ dispatch uses w=1, h=c, c=1 to avoid dispatching over spatial dims - Falls back to existing reduce chain when subgroup ops unavailable - Added perf benchmark: tests/perf/perf_instancenorm.cpp NVIDIA RTX 4060 Laptop (gpu-1) per-op speedup: | shape | precision | baseline (us) | optimized (us) | speedup | |----------------|-----------|---------------|----------------|---------| | [64,64,128] | fp32 | 35.64 | 12.08 | 3.0x | | [64,64,128] | fp16ps | 34.79 | 8.77 | 4.0x | | [64,64,128] | fp16psa | 34.71 | 8.74 | 4.0x | | [64,64,128] | bf16ps | 35.24 | 9.50 | 3.7x | | [32,32,256] | fp32 | 29.80 | 59.70 | 0.5x | | [32,32,256] | fp16ps | 26.00 | 8.70 | 3.0x | | [32,32,256] | fp16psa | 26.70 | 8.50 | 3.1x | | [32,32,256] | bf16ps | 26.50 | 9.00 | 2.9x | | [16,16,512] | fp32 | 19.64 | 62.40 | 0.3x | | [16,16,512] | fp16ps | 19.12 | 9.10 | 2.1x | | [16,16,512] | fp16psa | 19.13 | 9.10 | 2.1x | | [16,16,512] | bf16ps | 19.34 | 9.70 | 2.0x | | [8,8,512] | fp32 | 18.10 | 8.40 | 2.2x | | [8,8,512] | fp16ps | 17.70 | 9.00 | 2.0x | | [8,8,512] | fp16psa | 15.30 | 9.20 | 1.7x | | [8,8,512] | bf16ps | 16.20 | 8.30 | 2.0x | | [224,224,64] | fp32 | 130.00 | 52.90 | 2.5x | | [224,224,64] | fp16ps | 83.50 | 36.30 | 2.3x | | [224,224,64] | fp16psa | 83.90 | 36.50 | 2.3x | | [224,224,64] | bf16ps | 91.50 | 40.50 | 2.3x | | [224,224,3] | fp32 | 38.27 | 22.70 | 1.7x | | [224,224,3] | fp16ps | 37.99 | 18.00 | 2.1x | | [224,224,3] | fp16psa | 37.80 | 15.90 | 2.4x | | [224,224,3] | bf16ps | 37.83 | 16.20 | 2.3x | | [4096,1,1] | fp32 | 24.87 | 6.76 | 3.7x | | [4096,1,1] | fp16ps | 24.76 | 6.07 | 4.1x | | [4096,1,1] | fp16psa | 24.76 | 6.03 | 4.1x | | [4096,1,1] | bf16ps | 24.83 | 6.22 | 4.0x | | [512,1,1] | fp32 | 74.40 | 5.91 | 12.6x | | [512,1,1] | fp16ps | 35.10 | 5.20 | 6.8x | | [512,1,1] | fp16psa | 27.70 | 5.22 | 5.3x | | [512,1,1] | bf16ps | 23.30 | 5.21 | 4.5x | Note: fp32 pack1 path may regress on small spatial sizes (e.g. [32,32,256], [16,16,512]) due to under-utilized 256-thread workgroups. fp16/fp16a/bf16 paths use pack4 and show consistent speedups across all tested shapes.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b51bcfc308
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| afp sum = afp(0.f); | ||
| for (int t = tid; t < p.size; t += 256) | ||
| { | ||
| int v_offset = base_offset + t; | ||
| afp v = buffer_ld1(bottom_top_blob_data, v_offset); | ||
| sum += v; |
There was a problem hiding this comment.
Accumulate pack1 InstanceNorm statistics in float
When opt.use_fp16_arithmetic is enabled, afp is float16_t, so this new pack1 subgroup path accumulates the channel sum in half precision before converting to float for mean_data. The old pack1 path uses the reduce_sum4_fp16_to_fp32 shader and accumulates in float; with elempack=1 inputs such as non-4-channel tensors (for example 224x224x3 fp16a), these half-precision sums/squared sums can round heavily or overflow, producing incorrect InstanceNorm mean/variance. The accumulators and subgroup reductions in this shader should stay float like the previous path.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1cad3e0e1a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| } | ||
| } | ||
|
|
||
| if (vkdev->info.support_subgroup_ops() & VK_SUBGROUP_FEATURE_ARITHMETIC_BIT) |
There was a problem hiding this comment.
Honor disabled subgroup ops before creating shaders
When a caller explicitly sets opt.use_subgroup_ops = false on a device that still advertises VK_SUBGROUP_FEATURE_ARITHMETIC_BIT, this path still creates the new subgroup InstanceNorm pipelines. compile_spirv_module only switches to the Vulkan 1.1 / SPIR-V 1.3 environment when opt.use_subgroup_ops is true, so these subgroup shaders can fail to compile or leave a non-null Pipeline object that forward_inplace then selects instead of the existing fallback. Please include opt.use_subgroup_ops in this gate so disabling subgroup ops continues to work.
Useful? React with 👍 / 👎.
| int size; | ||
| } p; | ||
|
|
||
| shared vec4 sdata_v4[64]; |
There was a problem hiding this comment.
Size pack4 shared storage for all subgroups
On devices with a valid subgroup size below 4, the 256-thread workgroup has more than 64 subgroups, but the pack4 shader only reserves sdata_v4[64] and then writes entries by gl_SubgroupID during the elected-lane stores. NCNN only filters out non-power-of-two subgroup sizes, so elempack=4 InstanceNorm on such devices can write past shared memory and corrupt the reduction; either allocate for all 256 possible subgroups or avoid this shader for subgroup sizes under 4.
Useful? React with 👍 / 👎.
Add subgroup reduction fast-path for InstanceNorm Vulkan backend. Replaces multi-pass reduce chain (sum -> mean -> sub -> square -> reduce -> var) with a single dispatch per channel using subgroupAdd.
Changes:
NVIDIA RTX 4060 Laptop (gpu-1) per-op speedup:
Note: fp32 pack1 path may regress on small spatial sizes (e.g. [32,32,256], [16,16,512]) due to under-utilized 256-thread workgroups. fp16/fp16a/bf16 paths use pack4 and show consistent speedups across all tested shapes.