Skip to content

x86: optimize permute with SIMD#6762

Open
crafcat7 wants to merge 2 commits into
Tencent:masterfrom
crafcat7:feat/x86-permute
Open

x86: optimize permute with SIMD#6762
crafcat7 wants to merge 2 commits into
Tencent:masterfrom
crafcat7:feat/x86-permute

Conversation

@crafcat7

Copy link
Copy Markdown
Contributor

Summary

Add an x86 Permute override that covers all dims2, dims3, and dims4 cases inside the x86 implementation. Transpose-friendly fp32 pack1 layouts use SIMD tiled kernels while the remaining layouts stay in x86-local generic loops instead of falling back to the base implementation.

Changes

  1. Add Permute_x86 layer declaration and implementation files under src/layer/x86
  2. Implement SSE, AVX, and AVX-512 tiled transpose helpers for high-frequency permute order types
  3. Keep full dims2, dims3, and dims4 coverage inside Permute_x86 with x86-local generic paths for non-SIMD-friendly layouts

Benchmark

Case Threads Baseline (ms/run) Optimized (ms/run) Speedup
[1024,1024] order=1 1 2.1433 1.1950 1.79x
[256,256,32] order=1 1 1.8568 0.9800 1.89x
[80,1600,32] order=2 1 2.2060 1.8091 1.22x
[80,1600,32] order=3 1 7.4803 3.1840 2.35x
[19,19,24,16] order=3 1 0.0591 0.0179 3.30x
[19,19,24,16] order=7 1 0.0551 0.0301 1.83x
[19,19,24,16] order=13 1 0.0493 0.0191 2.58x
[19,19,24,16] order=15 1 0.1403 0.0219 6.40x
[1024,1024] order=1 8 2.1371 1.2652 1.69x
[256,256,32] order=1 8 0.6857 0.8719 0.79x
[80,1600,32] order=2 8 1.7951 1.6545 1.08x
[80,1600,32] order=3 8 2.2130 1.1377 1.95x
[19,19,24,16] order=3 8 0.0126 0.0058 2.18x
[19,19,24,16] order=7 8 0.0139 0.0099 1.41x
[19,19,24,16] order=13 8 0.0133 0.0086 1.55x
[19,19,24,16] order=15 8 0.0377 0.0057 6.62x

Summary:
  Add an x86 Permute override that covers all dims2, dims3, and dims4 cases inside the x86 implementation. Transpose-friendly fp32 pack1 layouts use SIMD tiled kernels while the remaining layouts stay in x86-local generic loops instead of falling back to the base implementation.

Changes:
  1. Add Permute_x86 layer declaration and implementation files under src/layer/x86
  2. Implement SSE, AVX, and AVX-512 tiled transpose helpers for high-frequency permute order types
  3. Keep full dims2, dims3, and dims4 coverage inside Permute_x86 with x86-local generic paths for non-SIMD-friendly layouts
@github-actions github-actions Bot added the x86 label May 31, 2026
@codecov-commenter

codecov-commenter commented Jun 1, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 74.95798% with 149 lines in your changes missing coverage. Please review.
✅ Project coverage is 95.76%. Comparing base (882f319) to head (d9dcb3b).
⚠️ Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
src/layer/x86/permute_x86.cpp 74.95% 149 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6762      +/-   ##
==========================================
- Coverage   95.77%   95.76%   -0.01%     
==========================================
  Files         946      944       -2     
  Lines      410747   411385     +638     
==========================================
+ Hits       393380   393969     +589     
- Misses      17367    17416      +49     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@crafcat7

crafcat7 commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

Todo:
Add support for any packed

Summary:
  Enable support_packing in Permute_x86 so the layer can accept packed fp32 inputs directly. Packed inputs are unpacked to pack1, permuted with the existing SIMD transpose kernels, and repacked to the original elempack.

Changes:
  1. Enable support_packing = true in Permute_x86 constructor
  2. Add unpack_permute_repack helper for packed fp32 input handling
  3. Route all packed inputs through unpack → pack1 permute → repack path

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d9dcb3b098

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Permute_x86::Permute_x86()
{
#if __SSE2__
support_packing = true;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Unpack non-fp32 packed inputs before permuting

With support_packing enabled, the runtime can pass packed int8 blobs to this layer as well as fp32 blobs, but the new unpack/repack path only runs when bottom_blob.elembits() == 32. Packed int8 inputs therefore fall through to the generic loops with elemsize equal to a whole channel pack, so order types that move or interleave the channel axis (for example dims3 order_type 3 or 5) permute groups of channels as indivisible elements and produce the wrong channel order in quantized models. Either unpack all packed element types before permuting or do not advertise packing support for element types this implementation cannot handle.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants