x86: optimize permute with SIMD by crafcat7 · Pull Request #6762 · Tencent/ncnn

crafcat7 · 2026-05-31T14:51:58Z

Summary

Add an x86 Permute override that covers all dims2, dims3, and dims4 cases inside the x86 implementation. Transpose-friendly fp32 pack1 layouts use SIMD tiled kernels while the remaining layouts stay in x86-local generic loops instead of falling back to the base implementation.

Changes

Add Permute_x86 layer declaration and implementation files under src/layer/x86
Implement SSE, AVX, and AVX-512 tiled transpose helpers for high-frequency permute order types
Keep full dims2, dims3, and dims4 coverage inside Permute_x86 with x86-local generic paths for non-SIMD-friendly layouts

Benchmark

Case	Threads	Baseline (ms/run)	Optimized (ms/run)	Speedup
`[1024,1024]` order=1	1	2.1433	1.1950	1.79x
`[256,256,32]` order=1	1	1.8568	0.9800	1.89x
`[80,1600,32]` order=2	1	2.2060	1.8091	1.22x
`[80,1600,32]` order=3	1	7.4803	3.1840	2.35x
`[19,19,24,16]` order=3	1	0.0591	0.0179	3.30x
`[19,19,24,16]` order=7	1	0.0551	0.0301	1.83x
`[19,19,24,16]` order=13	1	0.0493	0.0191	2.58x
`[19,19,24,16]` order=15	1	0.1403	0.0219	6.40x
`[1024,1024]` order=1	8	2.1371	1.2652	1.69x
`[256,256,32]` order=1	8	0.6857	0.8719	0.79x
`[80,1600,32]` order=2	8	1.7951	1.6545	1.08x
`[80,1600,32]` order=3	8	2.2130	1.1377	1.95x
`[19,19,24,16]` order=3	8	0.0126	0.0058	2.18x
`[19,19,24,16]` order=7	8	0.0139	0.0099	1.41x
`[19,19,24,16]` order=13	8	0.0133	0.0086	1.55x
`[19,19,24,16]` order=15	8	0.0377	0.0057	6.62x

Summary: Add an x86 Permute override that covers all dims2, dims3, and dims4 cases inside the x86 implementation. Transpose-friendly fp32 pack1 layouts use SIMD tiled kernels while the remaining layouts stay in x86-local generic loops instead of falling back to the base implementation. Changes: 1. Add Permute_x86 layer declaration and implementation files under src/layer/x86 2. Implement SSE, AVX, and AVX-512 tiled transpose helpers for high-frequency permute order types 3. Keep full dims2, dims3, and dims4 coverage inside Permute_x86 with x86-local generic paths for non-SIMD-friendly layouts

codecov-commenter · 2026-06-01T02:02:49Z

Codecov Report

❌ Patch coverage is 74.95798% with 149 lines in your changes missing coverage. Please review.
✅ Project coverage is 95.76%. Comparing base (882f319) to head (d9dcb3b).
⚠️ Report is 2 commits behind head on master.

Files with missing lines	Patch %	Lines
src/layer/x86/permute_x86.cpp	74.95%	149 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6762      +/-   ##
==========================================
- Coverage   95.77%   95.76%   -0.01%     
==========================================
  Files         946      944       -2     
  Lines      410747   411385     +638     
==========================================
+ Hits       393380   393969     +589     
- Misses      17367    17416      +49

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

crafcat7 · 2026-06-01T02:18:49Z

Todo:
Add support for any packed

Summary: Enable support_packing in Permute_x86 so the layer can accept packed fp32 inputs directly. Packed inputs are unpacked to pack1, permuted with the existing SIMD transpose kernels, and repacked to the original elempack. Changes: 1. Enable support_packing = true in Permute_x86 constructor 2. Add unpack_permute_repack helper for packed fp32 input handling 3. Route all packed inputs through unpack → pack1 permute → repack path

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d9dcb3b098

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-06-04T15:20:44Z

+Permute_x86::Permute_x86()
+{
+#if __SSE2__
+    support_packing = true;


Unpack non-fp32 packed inputs before permuting

With support_packing enabled, the runtime can pass packed int8 blobs to this layer as well as fp32 blobs, but the new unpack/repack path only runs when bottom_blob.elembits() == 32. Packed int8 inputs therefore fall through to the generic loops with elemsize equal to a whole channel pack, so order types that move or interleave the channel axis (for example dims3 order_type 3 or 5) permute groups of channels as indivisible elements and produce the wrong channel order in quantized models. Either unpack all packed element types before permuting or do not advertise packing support for element types this implementation cannot handle.

Useful? React with 👍 / 👎.

github-actions Bot added the x86 label May 31, 2026

chatgpt-codex-connector Bot reviewed Jun 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x86: optimize permute with SIMD#6762

x86: optimize permute with SIMD#6762
crafcat7 wants to merge 2 commits into
Tencent:masterfrom
crafcat7:feat/x86-permute

crafcat7 commented May 31, 2026

Uh oh!

codecov-commenter commented Jun 1, 2026 •

edited

Loading

Uh oh!

crafcat7 commented Jun 1, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

crafcat7 commented May 31, 2026

Summary

Changes

Benchmark

Uh oh!

codecov-commenter commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

crafcat7 commented Jun 1, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov-commenter commented Jun 1, 2026 •

edited

Loading