[CPU:Perf] Adapt CommonOptFunction for RVV architecture by jxgxxx · Pull Request #4426 · alibaba/MNN

jxgxxx · 2026-05-06T13:51:39Z

Description

This PR implements the RISC-V Vector (RVV) adaptation for core operators in CommonOptFunction.

Accuracy Validation

Verified that the outputs of all RVV-adapted functions strictly match the original C++ implementations.

Performance Metrics

The performance was evaluated on a remote RISC-V server. Profiling was conducted using perf for each individual function.

Function / Operator	Baseline (C++)	RVV Optimized	Speedup

| MNNPackedMatMulFP32 | 2373.37 ms | 323.82 ms | 7.33x |
| generalIm2col | 114.56 ms | 12.27 ms | 9.33x |
| MNNDynamicUpdateConvBiasScale| 510.58 ms | 400.24 ms | 1.23x |
| MNNPackedMatMulFP32 | 1355.11 ms | 165.08 ms | 8.20x |
| MNNPackedMatMulRemainFP32 | 613.35 ms | 105.72 ms | 5.80x |
| MNNPackC4ForMatMul_A | 24.82 ms | 8.79 ms | 2.83x |
| MNNPackForMatMul_B | 3390.81 ms | 714.80 ms | 4.74x |

(Note: Data represents average execution time per X iterations / runs.)

Module

CPU / RVV

Type

Checklist

Commit message follows [Module:Type] Description format
Code compiles without errors
Tested on relevant platform(s)
No unrelated format or style changes included

Co-authored-by: lyd1992 <liuyudong@iscas.ac.cn>

CLAassistant · 2026-05-06T13:52:12Z

All committers have signed the CLA.

wangzhaode

Thanks for the great work on RVV optimization — the performance numbers look impressive! Before we can merge this, there are a few issues that need to be addressed:

1. Duplicate symbol: MNNPackC4ForMatMul_A (build blocker)
The new file MNNPackC4ForMatMul_A_RVV.cpp defines MNNPackC4ForMatMul_A, which is already defined in the existing MNNPackC4ForMatMul_A.cpp. Since CMakeLists.txt uses FILE(GLOB ...) to compile all .cpp files in the directory, this will cause a linker error due to duplicate symbols. Please either replace the old file or rename the new function.

2. Missing framework integration
The new functions are not registered in CommonOptFunction.cpp (e.g., gCoreFunction->MNNPackedMatMul = ...), so they won't actually be called at runtime. Please add the necessary registration code.

3. Inconsistent naming convention
Some functions use the _RVV suffix (MNNPackForMatMul_B_RVV, MNNDynamicUpdateConvBiasScale_RVV, MNNQuantScaleFP32_RVV) while others don't (MNNPackedMatMulFP32, generalIm2col). Please unify the naming to be consistent with the framework's integration pattern.

4. ARM SME2 macro names in RVV code
MNNPackForMatMul_B.cpp uses SME2_MATMUL_LP and SME2_MATMUL_HP, which are ARM-specific names. Please rename them to something architecture-neutral or RVV-specific.

5. Code duplication
MNNPackedMatMulFP32.cpp and MNNPackedMatMulRemainFP32.cpp are nearly identical. Consider having the Packed version call the Remain version (similar to the SME2 approach):

void MNNPackedMatMulFP32(...) {
    MNNPackedMatMulRemainFP32(C, A, B, 16, parameter, ...);
}

6. Missing trailing newlines
All new files are missing the POSIX-required trailing newline at the end of the file.

Looking forward to the updated version. Thanks again for the contribution!

wangzhaode · 2026-05-18T04:46:49Z

Critical Bug: layout incompatible with

MNNGetMatMulPackMode_RVV returns hP=4, which means the framework expects B matrix packed in [h/4][l][4] layout. However, MNNPackForMatMul_B_RVV uses RVV_MATMUL_HP = 64, producing a [h/64][l][64] layout.

MNNPackedMatMulRemainFP32_RVV reads B as:

size_t bStride = bExtraStride + l * 4;   // stride per h-block = l*4
const float* w_ptr = b_base + z * 4;     // 4 weights per l-step

This assumes [h/4][l][4] layout, which does NOT match what PackForMatMul_B produces.

Verification

I wrote a standalone C++ test that simulates PackB(HP=64) + MatMul(hP=4) and compares against a scalar reference. All 8 test cases fail:

=== PR#4426: PackB(HP=64) vs MatMul(hP=4) Layout Test ===

Test0: h=4   l=8   tr=0  FAIL (16 mismatches, maxErr=41.28)
Test1: h=8   l=16  tr=0  FAIL (32 mismatches, maxErr=77.77)
Test2: h=32  l=32  tr=0  FAIL (128 mismatches, maxErr=170.79)
Test3: h=64  l=16  tr=0  FAIL (256 mismatches, maxErr=124.90)
Test4: h=128 l=16  tr=0  FAIL (512 mismatches, maxErr=151.80)
Test5: h=256 l=24  tr=0  FAIL (1024 mismatches, maxErr=207.31)
Test6: h=8   l=16  tr=1  FAIL (32 mismatches, maxErr=58.22)
Test7: h=128 l=32  tr=1  FAIL (512 mismatches, maxErr=169.48)

=== 0 PASSED, 8 FAILED ===

Suggested Fix

Either:

Change RVV_MATMUL_HP in MNNPackForMatMul_B_RVV from 64 to 4 to match hP=4, or
If HP=64 tiling is intentional for performance, update MNNGetMatMulPackMode_RVV to return hP=64 and adjust MNNPackedMatMulRemainFP32_RVV to read B with bStride = bExtraStride + l * 64 and process 64 h-values per block.

Also a minor note: the diff contains many unrelated whitespace/formatting changes (alignment adjustments in NEON code, for-loop spacing, etc.) that make review harder. Consider separating those into a dedicated commit.

Co-authored-by: jxgxxx <1955992348@qq.com> Co-authored-by: typer-J <2236066784@qq.com> Co-authored-by: Sherlockzhangjinge <zjgzhangjinge@outlook.com> Co-authored-by: lyd1992 <liuyudong@iscas.ac.cn>

jxgxxx · 2026-05-20T07:09:31Z

Hi @wangzhaode ,

Thank you for the detailed review and the helpful test script! I have addressed all the issues mentioned:

Fixed layout mismatch: Changed RVV_MATMUL_HP to 4 in MNNPackForMatMul_B_RVV to perfectly align with the hP=4 compute kernel.
Cleaned up formatting: Reverted all unintentional formatting/whitespace changes (including the NEON code and CommonOptFunction.cpp) to keep the diff clean.

The commits have been squashed and updated. Please let me know if anything else is needed. Thanks again for your guidance!

ihb2032 · 2026-06-09T01:27:14Z

I have some concerns about the MNNPackC4ForMatMul_A_RVV implementation in this PR.

First, I agree that registering RVV implementations into CommonOptFunction and fixing the symbol naming / dispatch path are reasonable. The RVV functions should be properly selected through the backend dispatch mechanism.

However, for MNNPackC4ForMatMul_A_RVV, this PR appears to replace or reimplement the already merged RVV kernel from #3813 with a different algorithm. I do not think this replacement is justified by the current benchmark data.

#3813 already provided benchmark data with explicit benchmark entry, shapes, timing results, and test environment. It included:

benchmark entry: test_pack_c4_for_mat_mul_a
dimensions: eReal and l
multiple benchmark shapes, including large and asymmetric cases
scalar time, RVV time, and speedup for each case
test environment: Banana Pi BPI-F3, EulixOS 3.0
also a negative case where RVV is slower, such as eReal = 1, which helps clarify the applicable workload range

For example, #3813 included benchmark cases such as:

eReal = 1024, l = 128
eReal = 1024, l = 1024
eReal = 1024, l = 4096
eReal = 1024, l = 8192
eReal = 1024, l = 16384
eReal = 1024, l = 32768
eReal = 65536, l = 128
eReal = 1000000, l = 64
eReal = 16, l = 1000000
eReal = 1, l = 65536

The reported speedups for the large eReal = 1024 cases were around 33x to 63x over the scalar implementation.

In contrast, this PR only reports a single number for MNNPackC4ForMatMul_A:

24.82 ms -> 8.79 ms, 2.83x

This only shows that the new implementation is faster than the scalar C++ baseline. Since #3813 is already merged, the correct comparison target should be the existing RVV implementation from #3813, not only the scalar fallback.

The current benchmark information is also incomplete compared with #3813. For this packing kernel, performance depends heavily on eReal and l, but this PR does not provide the shape corresponding to the reported timing number. Without the benchmark shape and environment, the single timing number is not enough to evaluate whether the new implementation is actually better.

More importantly, the two implementations use different vectorization strategies.

The implementation from #3813 intentionally vectorizes along the e dimension. Although it uses strided loads from the source C4 layout, it uses large vl / m8 and stores to the destination contiguously. This allows the kernel to use the available RVV vector length effectively, especially when e is large.

The implementation in this PR uses contiguous vle32 loads from source, but the effective vector length is only 4, and then it writes to destination with strided stores. This changes the vectorization dimension from the long e dimension to the small C4 dimension. Even though the source load is contiguous, this may underutilize RVV for large e cases.

So I do not think this kernel should be evaluated only by whether the load is contiguous or whether the code looks simpler. The key question is whether the implementation uses the RVV vector length effectively for the real packing workload.

Please provide an apples-to-apples benchmark using the same benchmark entry and shapes from #3813, including:

scalar C++ baseline
current ENH: Optimize MNNPackC4ForMatMul_A with RVV implementation #3813 RVV implementation
this PR's RVV implementation

Please include at least the same level of benchmark information as #3813:

benchmark entry
eReal and l
scalar time
RVV time
speedup
test environment

Before such data is provided, I suggest keeping the #3813 implementation for MNNPackC4ForMatMul_A_RVV, and only adapting the function name / registration if needed for CommonOptFunction.

In short, integrating RVV functions into the dispatch path is reasonable. But replacing an existing performance-critical RVV kernel should require direct benchmark data against the existing RVV kernel, not only against the scalar implementation.

Optimize: Adapt CommonOptFunction for RVV architecture

32729f9

Co-authored-by: lyd1992 <liuyudong@iscas.ac.cn>

wangzhaode self-assigned this May 7, 2026

wangzhaode requested changes May 9, 2026

View reviewed changes

wangzhaode mentioned this pull request May 9, 2026

Feature/rvv support cpu #4425

Open

12 tasks

jxgxxx force-pushed the rvv-CommonOptFunction branch from 88728c4 to ee8a361 Compare May 11, 2026 13:14

jxgxxx requested a review from wangzhaode May 11, 2026 13:44

jxgxxx force-pushed the rvv-CommonOptFunction branch 4 times, most recently from 7aacba5 to 2fc23d2 Compare May 20, 2026 06:54

The issues raised in the comments have been addressed.

761e012

Co-authored-by: jxgxxx <1955992348@qq.com> Co-authored-by: typer-J <2236066784@qq.com> Co-authored-by: Sherlockzhangjinge <zjgzhangjinge@outlook.com> Co-authored-by: lyd1992 <liuyudong@iscas.ac.cn>

jxgxxx force-pushed the rvv-CommonOptFunction branch from 2fc23d2 to 761e012 Compare May 20, 2026 07:03

Merge branch 'master' into rvv-CommonOptFunction

4f3fa0b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CPU:Perf] Adapt CommonOptFunction for RVV architecture#4426

[CPU:Perf] Adapt CommonOptFunction for RVV architecture#4426
jxgxxx wants to merge 3 commits into
alibaba:masterfrom
jxgxxx:rvv-CommonOptFunction

jxgxxx commented May 6, 2026

Uh oh!

CLAassistant commented May 6, 2026 •

edited

Loading

Uh oh!

wangzhaode left a comment

Uh oh!

wangzhaode commented May 18, 2026

Uh oh!

jxgxxx commented May 20, 2026

Uh oh!

ihb2032 commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jxgxxx commented May 6, 2026

Description

Accuracy Validation

Performance Metrics

Module

Type

Checklist

Uh oh!

CLAassistant commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wangzhaode left a comment

Choose a reason for hiding this comment

Uh oh!

wangzhaode commented May 18, 2026

Critical Bug: layout incompatible with

Verification

Suggested Fix

Uh oh!

jxgxxx commented May 20, 2026

Uh oh!

ihb2032 commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CLAassistant commented May 6, 2026 •

edited

Loading