Skip to content

GroupNorm Vulkan subgroup reduce optimization#6756

Open
futz12 wants to merge 3 commits into
Tencent:masterfrom
futz12:groupnorm-subgroup-reduce
Open

GroupNorm Vulkan subgroup reduce optimization#6756
futz12 wants to merge 3 commits into
Tencent:masterfrom
futz12:groupnorm-subgroup-reduce

Conversation

@futz12

@futz12 futz12 commented May 29, 2026

Copy link
Copy Markdown
Contributor

Add single-dispatch subgroup arithmetic fast-path for GroupNorm Vulkan.

  • groupnorm_reduce_subgroup.comp: pack1, two-pass (sum->mean, sqsum->var)
  • groupnorm_reduce_subgroup_pack4.comp: pack4 with component-boundary checks
  • groupnorm_vulkan.cpp/h: conditional subgroup path with fallback
  • tests/perf/perf_groupnorm.cpp: benchmark for representative shapes

Performance (NVIDIA RTX 4060 Laptop, gpu-1, median ms):

shape precision baseline optimized speedup
[64,64,128] fp32 37.94 15.12 2.5x
[64,64,128] fp16 36.18 11.85 3.1x
[64,64,128] fp16psa 36.11 11.67 3.1x
[64,64,128] bf16 36.62 12.28 3.0x
[32,32,256] fp32 30.37 5.56 5.5x
[32,32,256] fp16 29.66 0.92 32.2x
[32,32,256] fp16psa 29.71 0.92 32.3x
[32,32,256] bf16 29.80 0.95 31.4x
[16,16,512] fp32 29.66 4.46 6.7x
[16,16,512] fp16 29.11 2.89 10.1x
[16,16,512] fp16psa 29.20 2.58 11.3x
[16,16,512] bf16 29.40 2.37 12.4x
[8,8,512] fp32 24.54 6.38 3.8x
[8,8,512] fp16 24.38 5.67 4.3x
[8,8,512] fp16psa 24.34 5.55 4.4x
[8,8,512] bf16 24.39 5.68 4.3x
[224,224,3] fp32 38.49 35.63 1.1x
[224,224,3] fp16 38.30 31.06 1.2x
[224,224,3] fp16psa 38.35 31.78 1.2x
[224,224,3] bf16 38.48 30.66 1.3x
[224,224,64] fp32 13.22 9.95 1.3x
[224,224,64] fp16 8.53 8.93 1.0x
[224,224,64] fp16psa 8.55 8.96 1.0x
[224,224,64] bf16 8.88 9.24 1.0x
[4096,1,1] fp32 25.18 8.53 3.0x
[4096,1,1] fp16 25.07 7.58 3.3x
[4096,1,1] fp16psa 25.12 7.49 3.4x
[4096,1,1] bf16 25.13 7.63 3.3x
[512,1,1] fp32 6.96 6.25 1.1x
[512,1,1] fp16 4.40 5.34 0.8x
[512,1,1] fp16psa 2.87 5.45 0.5x
[512,1,1] bf16 2.29 5.41 0.4x

SD-like shapes (32x32x256, 16x16x512) benefit most from eliminating ~10-15 dispatches down to 3 dispatches (subgroup reduce -> coeffs -> norm). Small shapes (512x1x1) see regression because subgroup shader fixed-cost overhead exceeds savings for tiny group sizes.

test_groupnorm passes on NVIDIA RTX 4060

Add single-dispatch subgroup arithmetic fast-path for GroupNorm Vulkan.
- groupnorm_reduce_subgroup.comp: pack1, two-pass (sum->mean, sqsum->var)
- groupnorm_reduce_subgroup_pack4.comp: pack4 with component-boundary checks
- groupnorm_vulkan.cpp/h: conditional subgroup path with fallback
- tests/perf/perf_groupnorm.cpp: benchmark for representative shapes

Performance (NVIDIA RTX 4060 Laptop, gpu-1, median ms):

| shape         | precision | baseline | optimized | speedup |
|---------------|-----------|----------|-----------|---------|
| [64,64,128]   | fp32      | 37.94    | 15.12     | 2.5x    |
| [64,64,128]   | fp16      | 36.18    | 11.85     | 3.1x    |
| [64,64,128]   | fp16psa   | 36.11    | 11.67     | 3.1x    |
| [64,64,128]   | bf16      | 36.62    | 12.28     | 3.0x    |
| [32,32,256]   | fp32      | 30.37    | 5.56      | 5.5x    |
| [32,32,256]   | fp16      | 29.66    | 0.92      | 32.2x   |
| [32,32,256]   | fp16psa   | 29.71    | 0.92      | 32.3x   |
| [32,32,256]   | bf16      | 29.80    | 0.95      | 31.4x   |
| [16,16,512]   | fp32      | 29.66    | 4.46      | 6.7x    |
| [16,16,512]   | fp16      | 29.11    | 2.89      | 10.1x   |
| [16,16,512]   | fp16psa   | 29.20    | 2.58      | 11.3x   |
| [16,16,512]   | bf16      | 29.40    | 2.37      | 12.4x   |
| [8,8,512]     | fp32      | 24.54    | 6.38      | 3.8x    |
| [8,8,512]     | fp16      | 24.38    | 5.67      | 4.3x    |
| [8,8,512]     | fp16psa   | 24.34    | 5.55      | 4.4x    |
| [8,8,512]     | bf16      | 24.39    | 5.68      | 4.3x    |
| [224,224,3]   | fp32      | 38.49    | 35.63     | 1.1x    |
| [224,224,3]   | fp16      | 38.30    | 31.06     | 1.2x    |
| [224,224,3]   | fp16psa   | 38.35    | 31.78     | 1.2x    |
| [224,224,3]   | bf16      | 38.48    | 30.66     | 1.3x    |
| [224,224,64]  | fp32      | 13.22    | 9.95      | 1.3x    |
| [224,224,64]  | fp16      | 8.53     | 8.93      | 1.0x    |
| [224,224,64]  | fp16psa   | 8.55     | 8.96      | 1.0x    |
| [224,224,64]  | bf16      | 8.88     | 9.24      | 1.0x    |
| [4096,1,1]    | fp32      | 25.18    | 8.53      | 3.0x    |
| [4096,1,1]    | fp16      | 25.07    | 7.58      | 3.3x    |
| [4096,1,1]    | fp16psa   | 25.12    | 7.49      | 3.4x    |
| [4096,1,1]    | bf16      | 25.13    | 7.63      | 3.3x    |
| [512,1,1]     | fp32      | 6.96     | 6.25      | 1.1x    |
| [512,1,1]     | fp16      | 4.40     | 5.34      | 0.8x    |
| [512,1,1]     | fp16psa   | 2.87     | 5.45      | 0.5x    |
| [512,1,1]     | bf16      | 2.29     | 5.41      | 0.4x    |

SD-like shapes (32x32x256, 16x16x512) benefit most from eliminating
~10-15 dispatches down to 3 dispatches (subgroup reduce -> coeffs -> norm).
Small shapes (512x1x1) see regression because subgroup shader fixed-cost
overhead exceeds savings for tiny group sizes.

test_groupnorm passes on NVIDIA RTX 4060

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 621533af32

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

}
}

if (vkdev->info.support_subgroup_ops() & VK_SUBGROUP_FEATURE_ARITHMETIC_BIT)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Respect disabled subgroup ops

When a caller sets opt.use_subgroup_ops = false, this branch still creates the subgroup GroupNorm pipeline on any device advertising arithmetic support, and forward_inplace later selects any non-null subgroup pipeline instead of the fallback path. This re-enables subgroup shaders for users/drivers that explicitly disabled them, so the condition should also gate on opt.use_subgroup_ops.

Useful? React with 👍 / 👎.

const int group_size = p.channels_g * p.group_area;

// Phase 1: compute sum -> mean
afp sum = afp(0.f);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Accumulate pack1 reductions in float

For pack1 tensors with fp16 arithmetic enabled, afp is float16_t, so the new subgroup path accumulates the mean reduction in half precision (and the variance reduction does the same below) even though the existing fallback uses the groupnorm_reduce_sum4_fp16_to_fp32 path. On large groups such as a single group over a 224x224x3 tensor, the half-precision running sum/sqsum can overflow or lose enough precision to produce incorrect normalization; use float accumulators for these reductions like the pack4 shader does.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant