[CPU:Perf] Adapt CommonOptFunction for RVV architecture#4426
Conversation
Co-authored-by: lyd1992 <liuyudong@iscas.ac.cn>
wangzhaode
left a comment
There was a problem hiding this comment.
Thanks for the great work on RVV optimization — the performance numbers look impressive! Before we can merge this, there are a few issues that need to be addressed:
1. Duplicate symbol: MNNPackC4ForMatMul_A (build blocker)
The new file MNNPackC4ForMatMul_A_RVV.cpp defines MNNPackC4ForMatMul_A, which is already defined in the existing MNNPackC4ForMatMul_A.cpp. Since CMakeLists.txt uses FILE(GLOB ...) to compile all .cpp files in the directory, this will cause a linker error due to duplicate symbols. Please either replace the old file or rename the new function.
2. Missing framework integration
The new functions are not registered in CommonOptFunction.cpp (e.g., gCoreFunction->MNNPackedMatMul = ...), so they won't actually be called at runtime. Please add the necessary registration code.
3. Inconsistent naming convention
Some functions use the _RVV suffix (MNNPackForMatMul_B_RVV, MNNDynamicUpdateConvBiasScale_RVV, MNNQuantScaleFP32_RVV) while others don't (MNNPackedMatMulFP32, generalIm2col). Please unify the naming to be consistent with the framework's integration pattern.
4. ARM SME2 macro names in RVV code
MNNPackForMatMul_B.cpp uses SME2_MATMUL_LP and SME2_MATMUL_HP, which are ARM-specific names. Please rename them to something architecture-neutral or RVV-specific.
5. Code duplication
MNNPackedMatMulFP32.cpp and MNNPackedMatMulRemainFP32.cpp are nearly identical. Consider having the Packed version call the Remain version (similar to the SME2 approach):
void MNNPackedMatMulFP32(...) {
MNNPackedMatMulRemainFP32(C, A, B, 16, parameter, ...);
}6. Missing trailing newlines
All new files are missing the POSIX-required trailing newline at the end of the file.
Looking forward to the updated version. Thanks again for the contribution!
88728c4 to
ee8a361
Compare
Critical Bug: layout incompatible with
size_t bStride = bExtraStride + l * 4; // stride per h-block = l*4
const float* w_ptr = b_base + z * 4; // 4 weights per l-stepThis assumes VerificationI wrote a standalone C++ test that simulates Suggested FixEither:
Also a minor note: the diff contains many unrelated whitespace/formatting changes (alignment adjustments in NEON code, for-loop spacing, etc.) that make review harder. Consider separating those into a dedicated commit. |
7aacba5 to
2fc23d2
Compare
Co-authored-by: jxgxxx <1955992348@qq.com> Co-authored-by: typer-J <2236066784@qq.com> Co-authored-by: Sherlockzhangjinge <zjgzhangjinge@outlook.com> Co-authored-by: lyd1992 <liuyudong@iscas.ac.cn>
2fc23d2 to
761e012
Compare
|
Hi @wangzhaode , Thank you for the detailed review and the helpful test script! I have addressed all the issues mentioned:
The commits have been squashed and updated. Please let me know if anything else is needed. Thanks again for your guidance! |
|
I have some concerns about the First, I agree that registering RVV implementations into However, for #3813 already provided benchmark data with explicit benchmark entry, shapes, timing results, and test environment. It included:
For example, #3813 included benchmark cases such as:
The reported speedups for the large In contrast, this PR only reports a single number for This only shows that the new implementation is faster than the scalar C++ baseline. Since #3813 is already merged, the correct comparison target should be the existing RVV implementation from #3813, not only the scalar fallback. The current benchmark information is also incomplete compared with #3813. For this packing kernel, performance depends heavily on More importantly, the two implementations use different vectorization strategies. The implementation from #3813 intentionally vectorizes along the The implementation in this PR uses contiguous So I do not think this kernel should be evaluated only by whether the load is contiguous or whether the code looks simpler. The key question is whether the implementation uses the RVV vector length effectively for the real packing workload. Please provide an apples-to-apples benchmark using the same benchmark entry and shapes from #3813, including:
Please include at least the same level of benchmark information as #3813:
Before such data is provided, I suggest keeping the #3813 implementation for In short, integrating RVV functions into the dispatch path is reasonable. But replacing an existing performance-critical RVV kernel should require direct benchmark data against the existing RVV kernel, not only against the scalar implementation. |
Description
This PR implements the RISC-V Vector (RVV) adaptation for core operators in
CommonOptFunction.Accuracy Validation
Performance Metrics
The performance was evaluated on a remote RISC-V server. Profiling was conducted using
perffor each individual function.|
MNNPackedMatMulFP32| 2373.37 ms | 323.82 ms | 7.33x ||
generalIm2col| 114.56 ms | 12.27 ms | 9.33x ||
MNNDynamicUpdateConvBiasScale| 510.58 ms | 400.24 ms | 1.23x ||
MNNPackedMatMulFP32| 1355.11 ms | 165.08 ms | 8.20x ||
MNNPackedMatMulRemainFP32| 613.35 ms | 105.72 ms | 5.80x ||
MNNPackC4ForMatMul_A| 24.82 ms | 8.79 ms | 2.83x ||
MNNPackForMatMul_B| 3390.81 ms | 714.80 ms | 4.74x |(Note: Data represents average execution time per X iterations / runs.)
Module
CPU / RVV
Type
Checklist
[Module:Type] Descriptionformat