Skip to content

vulkan int8 kernels#6751

Open
nihui wants to merge 87 commits into
Tencent:masterfrom
nihui:vulkan-int-2
Open

vulkan int8 kernels#6751
nihui wants to merge 87 commits into
Tencent:masterfrom
nihui:vulkan-int-2

Conversation

@nihui

@nihui nihui commented May 28, 2026

Copy link
Copy Markdown
Member

No description provided.

Implement Vulkan int8 paths for convolution, convolutiondepthwise, innerproduct, and gemm with scalar fallback and integer dot product shader branches. Add goal.md and enable existing int8 unit tests for Vulkan coverage.
@tencent-adm

Copy link
Copy Markdown
Member

CLA assistant check
Thank you for your submission, we really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@nihui nihui marked this pull request as draft May 28, 2026 02:35
@codecov-commenter

codecov-commenter commented May 28, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 77.63514% with 662 lines in your changes missing coverage. Please review.
✅ Project coverage is 94.80%. Comparing base (b16501a) to head (a6d87e8).

Files with missing lines Patch % Lines
src/layer/vulkan/convolution_vulkan.cpp 66.69% 389 Missing ⚠️
src/layer/vulkan/gemm_vulkan.cpp 62.29% 161 Missing ⚠️
src/layer/vulkan/padding_vulkan.cpp 86.06% 51 Missing ⚠️
src/layer/vulkan/convolutiondepthwise_vulkan.cpp 96.57% 16 Missing ⚠️
src/gpu.cpp 88.23% 10 Missing ⚠️
src/layer/vulkan/innerproduct_vulkan.cpp 96.96% 10 Missing ⚠️
src/layer/convolution.cpp 40.00% 6 Missing ⚠️
src/layer/x86/convolution_x86.cpp 50.00% 5 Missing ⚠️
src/layer/arm/convolution_arm.cpp 50.00% 3 Missing ⚠️
src/layer/loongarch/convolution_loongarch.cpp 62.50% 3 Missing ⚠️
... and 3 more
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6751      +/-   ##
==========================================
- Coverage   95.78%   94.80%   -0.99%     
==========================================
  Files         946      946              
  Lines      410528   413891    +3363     
==========================================
- Hits       393237   392402     -835     
- Misses      17291    21489    +4198     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions Bot added the doc label May 28, 2026
@nihui nihui mentioned this pull request May 29, 2026
6 tasks
@nihui nihui closed this Jun 1, 2026
@nihui nihui reopened this Jun 1, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 034f555987

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +46 to +48
for (int i = 0; i < psc(w); i += 4)
{
const int v4 = i8buffer_sm4(bottom_blob_int8_data, i / 4);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Pad scalar int8 inputs before vector loads

When the flattened InnerProduct input length is not a multiple of 4 and the non-GEMM int8 path is used, the flatten/quantize shapes remain elempack1, but this shader still loads each group as a 32-bit sint8vec4. For example a 1-D int8 input with 6 elements reaches the iteration i == 4 and reads bytes 4..7 even though only bytes 4..5 belong to the VkMat; the padded weight lanes being zero does not make the out-of-bounds buffer load valid. Please pack/zero-pad the input to num_input_packed or load the tail lanes with bounded i8buffer_ld1 accesses.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3e709465b3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


for (int i = start; i < end; i++)
{
const int v4 = i8buffer_sm4(bottom_blob_int8_data, i);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Guard tail lanes in sum8 int8 input loads

When the flattened int8 input is large enough to use the sum8 path (num_input_packed / 4 >= 32) but its true length is not divisible by 4, this loop rounds psc(w) up and reads the final input group as a full sint8vec4. The flattened VkMat remains scalar/elempack1 with only the real elements initialized, so a length like 130 makes the last iteration consume bytes 128..131 even though only 128..129 are tensor data; fixing only the similar innerproduct_int8.comp tail case would still leave large non-GEMM InnerProduct inputs reading padding/invalid lanes here. Please either zero-pad/pack the flattened input to num_input_packed or load the last lanes with bounded scalar reads.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2109b5aa92

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

{
bvec4 mask = bvec4(uvec4(x >= 0 && x < psc(w) && y >= 0 && y < psc(h)) & (uvec4(greaterThanEqual(z4, ivec4(0))) & uvec4(lessThan(z4, ivec4(psc(c))))));

ivec4 pad_value4 = per_channel_pad == 1 ? i8buffer_ld4(per_channel_pad_blob_data, gz) : pad_value;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Splat scalar pad value before ternary

When an int8 Padding layer needs the pack1-to-pack4 shader, this line makes shader compilation fail because the ternary operands have different GLSL types: i8buffer_ld4(...) expands to an ivec4, while pad_value is an int. This is checked when the pipeline is compiled even when per_channel_pad is specialized to 0, so any int8 pack1-to-pack4 padding case (for example channel padding that makes the output channel count divisible by 4) cannot run; use an ivec4(pad_value) fallback.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ae1231abec

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".


const int elempack = opt.use_packing_layout && num_input % 4 == 0 ? 4 : 1;

if (!bottom_is_int8)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Repack existing int8 convolution inputs

When the input is already an 8-bit VkMat, this path skips the quantize helper that would create the num_input % 4 == 0 ? 4 : 1 layout, and it never converts bottom.elempack to the elempack specialization used by the int8 convolution shaders. For an int8 pack1 producer feeding a convolution whose input channel count is divisible by 4, the channel-count check still passes, but convolution_packed_int8.comp takes its elempack == 4 branch and reads the pack1 buffer through the bottom_blob_4 binding. Mirror the depthwise path by converting already-int8 inputs to the expected pack before dispatch.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9f11fe82ca

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1460 to +1461
int out_elempack = 1;
if (use_cooperative_matrix)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor output_elempack without cooperative matrix

When int8 Gemm runs the non-cooperative path (for example on devices without int8 cooperative matrix support, or when cooperative matrices are disabled) and the model sets output_elempack to 4, this leaves out_elempack at 1 and allocates a scalar output even though the caller requested pack4. The regular Vulkan Gemm applies output_elempack before allocation, while gemm_int8.comp only writes a scalar sfp buffer, so downstream layers/shapes that expect the requested pack4 layout receive the wrong layout.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 12dcacfc4e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


sumfp32 = activation_afpvec4(sumfp32, activation_type, activation_param_0, activation_param_1);

buffer_st4(top_blob_data, gx / 4, sumfp32);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Guard tail output lanes for scalar int8 InnerProduct

When the non-GEMM int8 InnerProduct path has num_output not divisible by 4 and the sum8 path is not used, forward_int8 allocates top_blob with out_elempack == 1, but this shader still dispatches ceil(num_output/4) groups and writes a full sfpvec4 for the last group. For example num_output == 5 allocates 5 scalar outputs while the second invocation writes lanes 4..7, overrunning or corrupting data beyond the VkMat; the tail needs scalar/bounded stores or the output must be allocated as padded pack4.

Useful? React with 👍 / 👎.


sumfp32 = activation_afpvec4(sumfp32, activation_type, activation_param_0, activation_param_1);

buffer_st4(top_blob_data, gx, sumfp32);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Guard tail output lanes in sum8 reduction

When the large-input int8 InnerProduct uses the sum8 path and num_output is not divisible by 4, top_blob is still allocated as scalar (out_elempack == 1), but the reducer writes one full sfpvec4 for each rounded output group. A case like num_output == 5 makes the final reducer invocation store four lanes even though only one output remains, so the shader writes past the scalar output buffer; mirror the tail handling here or allocate a padded pack4 output.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants