Skip to content

[CPU:Perf] Adapt CommonOptFunction for RVV architecture#4426

Open
jxgxxx wants to merge 3 commits into
alibaba:masterfrom
jxgxxx:rvv-CommonOptFunction
Open

[CPU:Perf] Adapt CommonOptFunction for RVV architecture#4426
jxgxxx wants to merge 3 commits into
alibaba:masterfrom
jxgxxx:rvv-CommonOptFunction

Conversation

@jxgxxx

@jxgxxx jxgxxx commented May 6, 2026

Copy link
Copy Markdown

Description

This PR implements the RISC-V Vector (RVV) adaptation for core operators in CommonOptFunction.

Accuracy Validation

  • Verified that the outputs of all RVV-adapted functions strictly match the original C++ implementations.

Performance Metrics

The performance was evaluated on a remote RISC-V server. Profiling was conducted using perf for each individual function.

Function / Operator Baseline (C++) RVV Optimized Speedup

| MNNPackedMatMulFP32 | 2373.37 ms | 323.82 ms | 7.33x |
| generalIm2col | 114.56 ms | 12.27 ms | 9.33x |
| MNNDynamicUpdateConvBiasScale| 510.58 ms | 400.24 ms | 1.23x |
| MNNPackedMatMulFP32 | 1355.11 ms | 165.08 ms | 8.20x |
| MNNPackedMatMulRemainFP32 | 613.35 ms | 105.72 ms | 5.80x |
| MNNPackC4ForMatMul_A | 24.82 ms | 8.79 ms | 2.83x |
| MNNPackForMatMul_B | 3390.81 ms | 714.80 ms | 4.74x |

(Note: Data represents average execution time per X iterations / runs.)

Module

CPU / RVV

Type

  • Feature
  • Bugfix
  • Perf
  • Refact
  • Style
  • Doc
  • Test
  • Chore

Checklist

  • Commit message follows [Module:Type] Description format
  • Code compiles without errors
  • Tested on relevant platform(s)
  • No unrelated format or style changes included

Co-authored-by: lyd1992 <liuyudong@iscas.ac.cn>
@CLAassistant

CLAassistant commented May 6, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@wangzhaode wangzhaode self-assigned this May 7, 2026

@wangzhaode wangzhaode left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great work on RVV optimization — the performance numbers look impressive! Before we can merge this, there are a few issues that need to be addressed:

1. Duplicate symbol: MNNPackC4ForMatMul_A (build blocker)
The new file MNNPackC4ForMatMul_A_RVV.cpp defines MNNPackC4ForMatMul_A, which is already defined in the existing MNNPackC4ForMatMul_A.cpp. Since CMakeLists.txt uses FILE(GLOB ...) to compile all .cpp files in the directory, this will cause a linker error due to duplicate symbols. Please either replace the old file or rename the new function.

2. Missing framework integration
The new functions are not registered in CommonOptFunction.cpp (e.g., gCoreFunction->MNNPackedMatMul = ...), so they won't actually be called at runtime. Please add the necessary registration code.

3. Inconsistent naming convention
Some functions use the _RVV suffix (MNNPackForMatMul_B_RVV, MNNDynamicUpdateConvBiasScale_RVV, MNNQuantScaleFP32_RVV) while others don't (MNNPackedMatMulFP32, generalIm2col). Please unify the naming to be consistent with the framework's integration pattern.

4. ARM SME2 macro names in RVV code
MNNPackForMatMul_B.cpp uses SME2_MATMUL_LP and SME2_MATMUL_HP, which are ARM-specific names. Please rename them to something architecture-neutral or RVV-specific.

5. Code duplication
MNNPackedMatMulFP32.cpp and MNNPackedMatMulRemainFP32.cpp are nearly identical. Consider having the Packed version call the Remain version (similar to the SME2 approach):

void MNNPackedMatMulFP32(...) {
    MNNPackedMatMulRemainFP32(C, A, B, 16, parameter, ...);
}

6. Missing trailing newlines
All new files are missing the POSIX-required trailing newline at the end of the file.

Looking forward to the updated version. Thanks again for the contribution!

@wangzhaode wangzhaode mentioned this pull request May 9, 2026
12 tasks
@jxgxxx jxgxxx force-pushed the rvv-CommonOptFunction branch from 88728c4 to ee8a361 Compare May 11, 2026 13:14
@jxgxxx jxgxxx requested a review from wangzhaode May 11, 2026 13:44
@wangzhaode

Copy link
Copy Markdown
Collaborator

Critical Bug: layout incompatible with

MNNGetMatMulPackMode_RVV returns hP=4, which means the framework expects B matrix packed in [h/4][l][4] layout. However, MNNPackForMatMul_B_RVV uses RVV_MATMUL_HP = 64, producing a [h/64][l][64] layout.

MNNPackedMatMulRemainFP32_RVV reads B as:

size_t bStride = bExtraStride + l * 4;   // stride per h-block = l*4
const float* w_ptr = b_base + z * 4;     // 4 weights per l-step

This assumes [h/4][l][4] layout, which does NOT match what PackForMatMul_B produces.

Verification

I wrote a standalone C++ test that simulates PackB(HP=64) + MatMul(hP=4) and compares against a scalar reference. All 8 test cases fail:

=== PR#4426: PackB(HP=64) vs MatMul(hP=4) Layout Test ===

Test0: h=4   l=8   tr=0  FAIL (16 mismatches, maxErr=41.28)
Test1: h=8   l=16  tr=0  FAIL (32 mismatches, maxErr=77.77)
Test2: h=32  l=32  tr=0  FAIL (128 mismatches, maxErr=170.79)
Test3: h=64  l=16  tr=0  FAIL (256 mismatches, maxErr=124.90)
Test4: h=128 l=16  tr=0  FAIL (512 mismatches, maxErr=151.80)
Test5: h=256 l=24  tr=0  FAIL (1024 mismatches, maxErr=207.31)
Test6: h=8   l=16  tr=1  FAIL (32 mismatches, maxErr=58.22)
Test7: h=128 l=32  tr=1  FAIL (512 mismatches, maxErr=169.48)

=== 0 PASSED, 8 FAILED ===

Suggested Fix

Either:

  1. Change RVV_MATMUL_HP in MNNPackForMatMul_B_RVV from 64 to 4 to match hP=4, or
  2. If HP=64 tiling is intentional for performance, update MNNGetMatMulPackMode_RVV to return hP=64 and adjust MNNPackedMatMulRemainFP32_RVV to read B with bStride = bExtraStride + l * 64 and process 64 h-values per block.

Also a minor note: the diff contains many unrelated whitespace/formatting changes (alignment adjustments in NEON code, for-loop spacing, etc.) that make review harder. Consider separating those into a dedicated commit.

@jxgxxx jxgxxx force-pushed the rvv-CommonOptFunction branch 4 times, most recently from 7aacba5 to 2fc23d2 Compare May 20, 2026 06:54
Co-authored-by: jxgxxx <1955992348@qq.com>

Co-authored-by: typer-J <2236066784@qq.com>

Co-authored-by: Sherlockzhangjinge <zjgzhangjinge@outlook.com>

Co-authored-by: lyd1992 <liuyudong@iscas.ac.cn>
@jxgxxx jxgxxx force-pushed the rvv-CommonOptFunction branch from 2fc23d2 to 761e012 Compare May 20, 2026 07:03
@jxgxxx

jxgxxx commented May 20, 2026

Copy link
Copy Markdown
Author

Hi @wangzhaode ,

Thank you for the detailed review and the helpful test script! I have addressed all the issues mentioned:

  1. Fixed layout mismatch: Changed RVV_MATMUL_HP to 4 in MNNPackForMatMul_B_RVV to perfectly align with the hP=4 compute kernel.
  2. Cleaned up formatting: Reverted all unintentional formatting/whitespace changes (including the NEON code and CommonOptFunction.cpp) to keep the diff clean.

The commits have been squashed and updated. Please let me know if anything else is needed. Thanks again for your guidance!

@ihb2032

ihb2032 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

I have some concerns about the MNNPackC4ForMatMul_A_RVV implementation in this PR.

First, I agree that registering RVV implementations into CommonOptFunction and fixing the symbol naming / dispatch path are reasonable. The RVV functions should be properly selected through the backend dispatch mechanism.

However, for MNNPackC4ForMatMul_A_RVV, this PR appears to replace or reimplement the already merged RVV kernel from #3813 with a different algorithm. I do not think this replacement is justified by the current benchmark data.

#3813 already provided benchmark data with explicit benchmark entry, shapes, timing results, and test environment. It included:

  • benchmark entry: test_pack_c4_for_mat_mul_a
  • dimensions: eReal and l
  • multiple benchmark shapes, including large and asymmetric cases
  • scalar time, RVV time, and speedup for each case
  • test environment: Banana Pi BPI-F3, EulixOS 3.0
  • also a negative case where RVV is slower, such as eReal = 1, which helps clarify the applicable workload range

For example, #3813 included benchmark cases such as:

  • eReal = 1024, l = 128
  • eReal = 1024, l = 1024
  • eReal = 1024, l = 4096
  • eReal = 1024, l = 8192
  • eReal = 1024, l = 16384
  • eReal = 1024, l = 32768
  • eReal = 65536, l = 128
  • eReal = 1000000, l = 64
  • eReal = 16, l = 1000000
  • eReal = 1, l = 65536

The reported speedups for the large eReal = 1024 cases were around 33x to 63x over the scalar implementation.

In contrast, this PR only reports a single number for MNNPackC4ForMatMul_A:

24.82 ms -> 8.79 ms, 2.83x

This only shows that the new implementation is faster than the scalar C++ baseline. Since #3813 is already merged, the correct comparison target should be the existing RVV implementation from #3813, not only the scalar fallback.

The current benchmark information is also incomplete compared with #3813. For this packing kernel, performance depends heavily on eReal and l, but this PR does not provide the shape corresponding to the reported timing number. Without the benchmark shape and environment, the single timing number is not enough to evaluate whether the new implementation is actually better.

More importantly, the two implementations use different vectorization strategies.

The implementation from #3813 intentionally vectorizes along the e dimension. Although it uses strided loads from the source C4 layout, it uses large vl / m8 and stores to the destination contiguously. This allows the kernel to use the available RVV vector length effectively, especially when e is large.

The implementation in this PR uses contiguous vle32 loads from source, but the effective vector length is only 4, and then it writes to destination with strided stores. This changes the vectorization dimension from the long e dimension to the small C4 dimension. Even though the source load is contiguous, this may underutilize RVV for large e cases.

So I do not think this kernel should be evaluated only by whether the load is contiguous or whether the code looks simpler. The key question is whether the implementation uses the RVV vector length effectively for the real packing workload.

Please provide an apples-to-apples benchmark using the same benchmark entry and shapes from #3813, including:

  1. scalar C++ baseline
  2. current ENH: Optimize MNNPackC4ForMatMul_A with RVV implementation #3813 RVV implementation
  3. this PR's RVV implementation

Please include at least the same level of benchmark information as #3813:

  • benchmark entry
  • eReal and l
  • scalar time
  • RVV time
  • speedup
  • test environment

Before such data is provided, I suggest keeping the #3813 implementation for MNNPackC4ForMatMul_A_RVV, and only adapting the function name / registration if needed for CommonOptFunction.

In short, integrating RVV functions into the dispatch path is reasonable. But replacing an existing performance-critical RVV kernel should require direct benchmark data against the existing RVV kernel, not only against the scalar implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants