x86: optimize permute with SIMD#6762
Conversation
Summary: Add an x86 Permute override that covers all dims2, dims3, and dims4 cases inside the x86 implementation. Transpose-friendly fp32 pack1 layouts use SIMD tiled kernels while the remaining layouts stay in x86-local generic loops instead of falling back to the base implementation. Changes: 1. Add Permute_x86 layer declaration and implementation files under src/layer/x86 2. Implement SSE, AVX, and AVX-512 tiled transpose helpers for high-frequency permute order types 3. Keep full dims2, dims3, and dims4 coverage inside Permute_x86 with x86-local generic paths for non-SIMD-friendly layouts
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6762 +/- ##
==========================================
- Coverage 95.77% 95.76% -0.01%
==========================================
Files 946 944 -2
Lines 410747 411385 +638
==========================================
+ Hits 393380 393969 +589
- Misses 17367 17416 +49 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
Todo: |
Summary: Enable support_packing in Permute_x86 so the layer can accept packed fp32 inputs directly. Packed inputs are unpacked to pack1, permuted with the existing SIMD transpose kernels, and repacked to the original elempack. Changes: 1. Enable support_packing = true in Permute_x86 constructor 2. Add unpack_permute_repack helper for packed fp32 input handling 3. Route all packed inputs through unpack → pack1 permute → repack path
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d9dcb3b098
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| Permute_x86::Permute_x86() | ||
| { | ||
| #if __SSE2__ | ||
| support_packing = true; |
There was a problem hiding this comment.
Unpack non-fp32 packed inputs before permuting
With support_packing enabled, the runtime can pass packed int8 blobs to this layer as well as fp32 blobs, but the new unpack/repack path only runs when bottom_blob.elembits() == 32. Packed int8 inputs therefore fall through to the generic loops with elemsize equal to a whole channel pack, so order types that move or interleave the channel axis (for example dims3 order_type 3 or 5) permute groups of channels as indivisible elements and produce the wrong channel order in quantized models. Either unpack all packed element types before permuting or do not advertise packing support for element types this implementation cannot handle.
Useful? React with 👍 / 👎.
Summary
Add an x86 Permute override that covers all dims2, dims3, and dims4 cases inside the x86 implementation. Transpose-friendly fp32 pack1 layouts use SIMD tiled kernels while the remaining layouts stay in x86-local generic loops instead of falling back to the base implementation.
Changes
Benchmark
[1024,1024]order=1[256,256,32]order=1[80,1600,32]order=2[80,1600,32]order=3[19,19,24,16]order=3[19,19,24,16]order=7[19,19,24,16]order=13[19,19,24,16]order=15[1024,1024]order=1[256,256,32]order=1[80,1600,32]order=2[80,1600,32]order=3[19,19,24,16]order=3[19,19,24,16]order=7[19,19,24,16]order=13[19,19,24,16]order=15