Skip to content

InstanceNorm Vulkan subgroup reduce optimization#6758

Open
futz12 wants to merge 3 commits into
Tencent:masterfrom
futz12:instancenorm-subgroup-reduce
Open

InstanceNorm Vulkan subgroup reduce optimization#6758
futz12 wants to merge 3 commits into
Tencent:masterfrom
futz12:instancenorm-subgroup-reduce

Conversation

@futz12

@futz12 futz12 commented May 29, 2026

Copy link
Copy Markdown
Contributor

Add subgroup reduction fast-path for InstanceNorm Vulkan backend. Replaces multi-pass reduce chain (sum -> mean -> sub -> square -> reduce -> var) with a single dispatch per channel using subgroupAdd.

Changes:

  • New shader: instancenorm_reduce_subgroup.comp (pack1)
  • New shader: instancenorm_reduce_subgroup_pack4.comp (pack4)
  • C++ dispatch uses w=1, h=c, c=1 to avoid dispatching over spatial dims
  • Falls back to existing reduce chain when subgroup ops unavailable
  • Added perf benchmark: tests/perf/perf_instancenorm.cpp

NVIDIA RTX 4060 Laptop (gpu-1) per-op speedup:

shape precision baseline (us) optimized (us) speedup
[64,64,128] fp32 35.64 12.08 3.0x
[64,64,128] fp16ps 34.79 8.77 4.0x
[64,64,128] fp16psa 34.71 8.74 4.0x
[64,64,128] bf16ps 35.24 9.50 3.7x
[32,32,256] fp32 29.80 59.70 0.5x
[32,32,256] fp16ps 26.00 8.70 3.0x
[32,32,256] fp16psa 26.70 8.50 3.1x
[32,32,256] bf16ps 26.50 9.00 2.9x
[16,16,512] fp32 19.64 62.40 0.3x
[16,16,512] fp16ps 19.12 9.10 2.1x
[16,16,512] fp16psa 19.13 9.10 2.1x
[16,16,512] bf16ps 19.34 9.70 2.0x
[8,8,512] fp32 18.10 8.40 2.2x
[8,8,512] fp16ps 17.70 9.00 2.0x
[8,8,512] fp16psa 15.30 9.20 1.7x
[8,8,512] bf16ps 16.20 8.30 2.0x
[224,224,64] fp32 130.00 52.90 2.5x
[224,224,64] fp16ps 83.50 36.30 2.3x
[224,224,64] fp16psa 83.90 36.50 2.3x
[224,224,64] bf16ps 91.50 40.50 2.3x
[224,224,3] fp32 38.27 22.70 1.7x
[224,224,3] fp16ps 37.99 18.00 2.1x
[224,224,3] fp16psa 37.80 15.90 2.4x
[224,224,3] bf16ps 37.83 16.20 2.3x
[4096,1,1] fp32 24.87 6.76 3.7x
[4096,1,1] fp16ps 24.76 6.07 4.1x
[4096,1,1] fp16psa 24.76 6.03 4.1x
[4096,1,1] bf16ps 24.83 6.22 4.0x
[512,1,1] fp32 74.40 5.91 12.6x
[512,1,1] fp16ps 35.10 5.20 6.8x
[512,1,1] fp16psa 27.70 5.22 5.3x
[512,1,1] bf16ps 23.30 5.21 4.5x

Note: fp32 pack1 path may regress on small spatial sizes (e.g. [32,32,256], [16,16,512]) due to under-utilized 256-thread workgroups. fp16/fp16a/bf16 paths use pack4 and show consistent speedups across all tested shapes.

Add subgroup reduction fast-path for InstanceNorm Vulkan backend.
Replaces multi-pass reduce chain (sum -> mean -> sub -> square -> reduce -> var)
with a single dispatch per channel using subgroupAdd.

Changes:
- New shader: instancenorm_reduce_subgroup.comp (pack1)
- New shader: instancenorm_reduce_subgroup_pack4.comp (pack4)
- C++ dispatch uses w=1, h=c, c=1 to avoid dispatching over spatial dims
- Falls back to existing reduce chain when subgroup ops unavailable
- Added perf benchmark: tests/perf/perf_instancenorm.cpp

NVIDIA RTX 4060 Laptop (gpu-1) per-op speedup:

| shape          | precision | baseline (us) | optimized (us) | speedup |
|----------------|-----------|---------------|----------------|---------|
| [64,64,128]    | fp32      |         35.64 |          12.08 |    3.0x |
| [64,64,128]    | fp16ps    |         34.79 |           8.77 |    4.0x |
| [64,64,128]    | fp16psa   |         34.71 |           8.74 |    4.0x |
| [64,64,128]    | bf16ps    |         35.24 |           9.50 |    3.7x |
| [32,32,256]    | fp32      |         29.80 |          59.70 |    0.5x |
| [32,32,256]    | fp16ps    |         26.00 |           8.70 |    3.0x |
| [32,32,256]    | fp16psa   |         26.70 |           8.50 |    3.1x |
| [32,32,256]    | bf16ps    |         26.50 |           9.00 |    2.9x |
| [16,16,512]    | fp32      |         19.64 |          62.40 |    0.3x |
| [16,16,512]    | fp16ps    |         19.12 |           9.10 |    2.1x |
| [16,16,512]    | fp16psa   |         19.13 |           9.10 |    2.1x |
| [16,16,512]    | bf16ps    |         19.34 |           9.70 |    2.0x |
| [8,8,512]      | fp32      |         18.10 |           8.40 |    2.2x |
| [8,8,512]      | fp16ps    |         17.70 |           9.00 |    2.0x |
| [8,8,512]      | fp16psa   |         15.30 |           9.20 |    1.7x |
| [8,8,512]      | bf16ps    |         16.20 |           8.30 |    2.0x |
| [224,224,64]   | fp32      |        130.00 |          52.90 |    2.5x |
| [224,224,64]   | fp16ps    |         83.50 |          36.30 |    2.3x |
| [224,224,64]   | fp16psa   |         83.90 |          36.50 |    2.3x |
| [224,224,64]   | bf16ps    |         91.50 |          40.50 |    2.3x |
| [224,224,3]    | fp32      |         38.27 |          22.70 |    1.7x |
| [224,224,3]    | fp16ps    |         37.99 |          18.00 |    2.1x |
| [224,224,3]    | fp16psa   |         37.80 |          15.90 |    2.4x |
| [224,224,3]    | bf16ps    |         37.83 |          16.20 |    2.3x |
| [4096,1,1]     | fp32      |         24.87 |           6.76 |    3.7x |
| [4096,1,1]     | fp16ps    |         24.76 |           6.07 |    4.1x |
| [4096,1,1]     | fp16psa   |         24.76 |           6.03 |    4.1x |
| [4096,1,1]     | bf16ps    |         24.83 |           6.22 |    4.0x |
| [512,1,1]      | fp32      |         74.40 |           5.91 |   12.6x |
| [512,1,1]      | fp16ps    |         35.10 |           5.20 |    6.8x |
| [512,1,1]      | fp16psa   |         27.70 |           5.22 |    5.3x |
| [512,1,1]      | bf16ps    |         23.30 |           5.21 |    4.5x |

Note: fp32 pack1 path may regress on small spatial sizes (e.g. [32,32,256],
[16,16,512]) due to under-utilized 256-thread workgroups. fp16/fp16a/bf16
paths use pack4 and show consistent speedups across all tested shapes.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b51bcfc308

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +35 to +40
afp sum = afp(0.f);
for (int t = tid; t < p.size; t += 256)
{
int v_offset = base_offset + t;
afp v = buffer_ld1(bottom_top_blob_data, v_offset);
sum += v;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Accumulate pack1 InstanceNorm statistics in float

When opt.use_fp16_arithmetic is enabled, afp is float16_t, so this new pack1 subgroup path accumulates the channel sum in half precision before converting to float for mean_data. The old pack1 path uses the reduce_sum4_fp16_to_fp32 shader and accumulates in float; with elempack=1 inputs such as non-4-channel tensors (for example 224x224x3 fp16a), these half-precision sums/squared sums can round heavily or overflow, producing incorrect InstanceNorm mean/variance. The accumulators and subgroup reductions in this shader should stay float like the previous path.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1cad3e0e1a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

}
}

if (vkdev->info.support_subgroup_ops() & VK_SUBGROUP_FEATURE_ARITHMETIC_BIT)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor disabled subgroup ops before creating shaders

When a caller explicitly sets opt.use_subgroup_ops = false on a device that still advertises VK_SUBGROUP_FEATURE_ARITHMETIC_BIT, this path still creates the new subgroup InstanceNorm pipelines. compile_spirv_module only switches to the Vulkan 1.1 / SPIR-V 1.3 environment when opt.use_subgroup_ops is true, so these subgroup shaders can fail to compile or leave a non-null Pipeline object that forward_inplace then selects instead of the existing fallback. Please include opt.use_subgroup_ops in this gate so disabling subgroup ops continues to work.

Useful? React with 👍 / 👎.

int size;
} p;

shared vec4 sdata_v4[64];

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Size pack4 shared storage for all subgroups

On devices with a valid subgroup size below 4, the 256-thread workgroup has more than 64 subgroups, but the pack4 shader only reserves sdata_v4[64] and then writes entries by gl_SubgroupID during the elected-lane stores. NCNN only filters out non-power-of-two subgroup sizes, so elempack=4 InstanceNorm on such devices can write past shared memory and corrupt the reduction; either allocate for all 256 possible subgroups or avoid this shader for subgroup sizes under 4.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant