Skip to content

feat: add TopK, Gather, GatherElements, Expand, Tile, and Mod operators#6669

Open
vlordier wants to merge 74 commits into
Tencent:masterfrom
vlordier:feature/yolo26-support
Open

feat: add TopK, Gather, GatherElements, Expand, Tile, and Mod operators#6669
vlordier wants to merge 74 commits into
Tencent:masterfrom
vlordier:feature/yolo26-support

Conversation

@vlordier

@vlordier vlordier commented Apr 11, 2026

Copy link
Copy Markdown

Summary

Adds six missing ncnn operators required for YOLOv10/YOLO26-style model deployment, along with pnnx lowering passes and full test coverage. Consolidates and supersedes #6558 and #6668.

Operators Implemented

Operator Layer pnnx pass Tests
TopK src/layer/topk.cpp pass_ncnn/TopK.cpp tests/test_topk.cpp
Gather src/layer/gather.cpp pass_ncnn/torch_gather.cpp tests/test_gather.cpp
GatherElements src/layer/gatherelements.cpp pass_ncnn/gatherelements.cpp tests/test_gatherelements.cpp
Expand src/layer/expand.cpp pass_ncnn/expand.cpp tests/test_expand.cpp
Tile (ONNX) src/layer/tile.cpp (extended) pass_ncnn/tile.cpp existing
Mod src/layer/mod.cpp pass_ncnn/mod.cpp tests/test_mod.cpp

Review Issues Addressed

All Copilot review comments from #6558, #6668, and the initial #6669 review have been resolved:

  • int32/int64 index dispatch: READ_IDX macro in Gather and GatherElements branches on idx_elemsize (4 vs 8)
  • Output rank preservation: top_blob allocated with overload matching index_blob.dims (1D/2D/3D)
  • Correct axis ordering: Gather/GatherElements use PyTorch-style axis (0=outermost); TopK uses ncnn-internal axis (0=w=innermost); pnnx TopK pass now converts new_axis = (ncnn_ndim-1) - new_axis
  • TopK int32 indices: TopK outputs int32 indices (not float), enabling correct downstream Gather indexing
  • TopK pnnx axis fix: pnnx TopK pass converts PyTorch dim to ncnn-internal axis ordering
  • pnnx TopK writes k: op->params["3"] = k_val now correctly propagated
  • expand.cpp int64 shape support: shape blob read via shape_elemsize (4 vs 8), validation rejects invalid broadcasts
  • expand.cpp total() padding bug: replaced with explicit w*h*c loops; no NEON intrinsics
  • torch_gather.cpp batch_index: applies axis > batch_index ? axis-1 : axis remapping
  • onnxruntime opt-in: -DPNNX_DISABLE_ONNXRUNTIME=ON cmake option; not forced-disabled
  • Test coverage: value-checking tests for all six operators; sorted=0 paths covered in test_topk

End-to-End Verification

pnnx converts a torch.topk + torch.gather TorchScript model ([1,8]→top-5) and ncnn inference matches PyTorch reference:

topk values: [0.9000, 0.8000, 0.7000, 0.5000, 0.4000]
gathered:    [0.9000, 0.8000, 0.7000, 0.5000, 0.4000]  ✓

Test Results

167/167 tests passed, 0 failed

vlordier and others added 30 commits April 10, 2026 12:24
- Generate TopK class definition in pnnx.py output with forward() method
- Instantiate TopK modules in Model.__init__() with proper parameters
- Update forward() method to call self.topk_name() instead of direct TopK() calls
- Fixes pnnx inference to properly execute TopK operations using torch.topk()
- Test confirms TopK ONNX→pnnx conversion and inference working correctly
- Fix IR pattern syntax to use explicit parameter names (axis=%, largest=%, sorted=%)
- Replace incorrect parameter lookup from 'op_0.axis' to 'axis' to match captured names
- TopK pass now properly fires during ONNX→pnnx→ncnn conversion
- All TopK parameters (axis, largest, sorted) correctly captured and set in ncnn layers
- End-to-end test confirms ONNX→pnnx→ncnn conversion with TopK working correctly
use c++03-style topk comparator and keep deterministic nan/inf ordering

remove redundant constructor param initialization

fix tests cmakelists alphabetical order (Tile before TopK)

expand torch_topk onnx tests (k=0/k=1, negative dim, sorted=false cases)

drop generated topk onnx/pnnx/ncnn sidecar artifacts from repo
- Guard <algorithm>/<vector> behind #if NCNN_SIMPLESTL, include simplestl.h
- Use std::partial_sort in simplestl mode (no std::nth_element available)
- Guard <math.h> in tests behind #if !NCNN_SIMPLESTL to avoid simplemath.h
  conflict; define INFINITY/NAN as float expressions in simplestl mode
- Fix cstep-unaware indexing for 3D/4D output tensors: use actual cstep
  for channel offset instead of assuming contiguous w*h layout
- Convert #pragma omp parallel + inner #pragma omp for to #pragma omp
  parallel for to avoid __kmpc_barrier in simpleomp mode
- Fix copyright year 2026->2025
- Apply code-format whitespace cleanup
- pass_level2/torch_topk.cpp: capture k/dim/largest/sorted as parameters
  (prim::Constant) instead of tensor inputs, enabling ncnn pass matching
- pass_level2/torch_gather.cpp: restore original pattern (dim as tensor)
- pass_ncnn/TopK.cpp: match torch.topk with captured parameters and
  convert to ncnn TopK layer (axis, largest, sorted)
- pass_ncnn/torch_gather.cpp (NEW): match torch.gather with 2 inputs
  (input, index) and captured dim parameter, convert to ncnn Gather layer
- src/layer/gather.{h,cpp} (NEW): implement Gather ncnn operator
  supporting 1D/2D/3D tensors with arbitrary axis
- PNNX CMakeLists fixes:
  - per-target Torch include dirs to avoid protobuf header conflicts
  - Abseil linking for Homebrew protobuf 34.x
  - disable onnxruntime auto-detection (protobuf conflict)
  - directory-level INCLUDE_DIRECTORIES_BEFORE for protobuf headers

Verified: YOLOv10n converts with 2 TopK + 2 Gather layers, only
cosmetic ops (Tensor.to, pnnx.Expression) ignored.

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
- src/layer/cast.{h,cpp}: extend Cast layer with int64 (type 5) and
  int32 (type 6) support, adding conversions int64↔float32 and
  int32↔float32
- pass_ncnn/tensor_to.cpp (NEW): convert Tensor.to (dtype cast) to
  ncnn Cast layer, mapping torch dtype strings to ncnn type codes
- CMakeLists.txt: register tensor_to.cpp in pass_ncnn sources

Verified: YOLOv10n Tensor.to (i64→f32) now converts to Cast layer
instead of being ignored. Only cosmetic ops (pnnx.Expression) remain.

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
…rformance

- topk: output int32 indices instead of float (fixes Gather compatibility)
- pnnx/TopK: convert PyTorch-style axis to ncnn-internal ordering (shape[0]=w)
- expand: rewrite with OMP 3-level loop, fix total() cstep-padding bug, drop NEON
- gatherelements: add OMP parallelism and READ_IDX/CLAMP_IDX macros
- tests/CMakeLists: fix WITH_LAYER_* variable case (uppercase→lowercase)
- test_expand, test_mod: rewrite as value-checking testutil.h tests
- test_topk: update index reading from float* to int* after topk change

End-to-end verified: pnnx TopK+Gather model produces [0.9,0.8,0.7,0.5,0.4]
matching PyTorch reference. 167/167 tests pass.
@vlordier vlordier changed the title Add GatherElements, Expand, Tile, and Mod operators with ARM NEON optimizations feat: add TopK, Gather, GatherElements, Expand, Tile, and Mod operators Apr 16, 2026
@vlordier

Copy link
Copy Markdown
Author

All Copilot review comments have been addressed in the latest commits. Summary of fixes:

Comment Status
Expand not in ncnn_add_layer ✅ Fixed — registered at line 108 of src/CMakeLists.txt
Gather/GatherElements output always 3D ✅ Fixed — top_blob.create() overload matches index_blob.dims
Index tensor read as const int* ignoring int64 ✅ Fixed — READ_IDX macro branches on idx_elemsize (4 vs 8)
GatherElements axis dimension order wrong ✅ Fixed — PyTorch-style axis ordering (axis=0=outermost)
expand.cpp shape_blob read as const int* (no int64) ✅ Fixed — reads via shape_elemsize check
expand.cpp no broadcast validation ✅ Fixed — validates each dimension, rejects invalid shapes
torch_topk.cpp header counts 12 7 ✅ Fixed — now 7 7 matching actual operators
ir.cpp TopK emits numeric param ids (0=...) ✅ Fixed — emits named args (k=, axis=, largest=, sorted=)
tensor_to.cpp type_from=0 unconditional ✅ Already correct — looks up pnnx_to_ncnn_cast_type[in_pnnx_type]
pnnx CMakeLists forces onnxruntime_FOUND=FALSE ✅ Fixed — PNNX_DISABLE_ONNXRUNTIME opt-in option (default OFF)
No tests for Gather/GatherElements/Expand ✅ Fixed — test_gather.cpp, test_gatherelements.cpp, test_expand.cpp, test_mod.cpp added
TopK pass never writes k ✅ Fixed — op->params["3"] = k_val
torch_gather.cpp ignores batch_index ✅ Fixed — applies axis > batch_index ? axis-1 : axis
New (this session): TopK pnnx axis wrong for ncnn-internal ordering ✅ Fixed — new_axis = (ncnn_ndim-1) - new_axis
New (this session): TopK outputs float indices, breaks Gather ✅ Fixed — TopK now outputs int32 indices

End-to-end verified: torch.topk + torch.gather TorchScript model converted by pnnx and executed by ncnn produces [0.9, 0.8, 0.7, 0.5, 0.4] matching PyTorch reference exactly.

All 167 ncnn tests pass.

@vlordier

Copy link
Copy Markdown
Author

End-to-end YOLOv10n conversion verified ✅

Converted YOLOv10n (ultralytics 8.4.37) to ncnn using pnnx with the operators from this PR:

python3 -c "
from ultralytics import YOLO
YOLO('yolov10n.pt').export(format='torchscript', imgsz=320)
"
pnnx yolov10n.torchscript inputshape=[1,3,320,320]f32

Result: 268 layers, 0 ignored ops, with TopK and Gather correctly lowered:

TopK    topk_162    1 2 296 297 298  0=-1 1=1 2=1 3=300
Gather  gather_165  2 1 294 302 303  0=0
TopK    topk_163    1 2 304 305 306  0=-1 1=1 2=1 3=300
Gather  gather_166  2 1 292 317 318  0=0

Before this PR, both torch.topk and torch.gather appeared as ignore in the conversion output. Now they produce correct ncnn layers with proper axis and k parameters.

Environment: macOS (Apple M4 Pro), torch 2.8.0, ultralytics 8.4.37

@vlordier

Copy link
Copy Markdown
Author

YOLO26n conversion verified ✅

Also converted YOLO26n (the original motivation for this PR) with identical success:

python3 -c "
from ultralytics import YOLO
YOLO('yolo26n.pt').export(format='torchscript', imgsz=320)
"
pnnx yolo26n.torchscript inputshape=[1,3,320,320]f32

Result: 337 layers, 0 ignored ops, TopK and Gather fully lowered:

TopK    topk_205    1 2 380 381 382  0=-1 1=1 2=1 3=300
Gather  gather_208  2 1 378 386 387  0=0
TopK    topk_206    1 2 388 389 390  0=-1 1=1 2=1 3=300
Gather  gather_209  2 1 376 401 402  0=0

Both YOLOv10n (268 layers) and YOLO26n (337 layers) convert cleanly end-to-end.

Environment: macOS (Apple M4 Pro), torch 2.8.0, ultralytics 8.4.37

vlordier and others added 23 commits April 16, 2026 17:12
- Build and test all 5 new layers: topk, gather, gatherelements, expand, mod
- Replace direct ./tests/test_xxx with ctest --output-on-failure -R pattern
- Remove stale fix-pnnx-onnx-topk-support push trigger (PR closed)
- Add feature/yolo26-support to push triggers
- Rename pnnx-onnx-topk job to pnnx-onnx-ops, add test_onnx_torch_gather
Replace total()-based flat iteration in test_gatherelements check_equal
with explicit c/h/w loops indexed via cstep, avoiding comparisons of
uninitialized SIMD padding bytes that caused failures on Linux.

Anchor ctest regex alternatives with $ to prevent test_expand from
matching the pre-existing test_expanddims target (not a build target).
Replace total()-based flat comparison with explicit c/h/w loops indexed
via cstep, matching the fix already applied to test_gatherelements.
Remove <cmath> include (not available in SIMPLESTL mode) and use ::fmod
instead of std::fmod to call the global function from platform.h,
bypassing the class member named fmod.
std::max and std::vector are provided by simplestl.h (via platform.h)
in SIMPLESTL mode; <algorithm> is not available in that environment.
Pre-existing ncnn x86 layers (batchnorm, bnll, convolution) conflict
with simplemath.h declarations; our new layers are SIMPLESTL-compatible
but we cannot fix the upstream conflict in this PR.
- mod.cpp: replace total()-based flat loops with explicit c/h/w loops
  using cstep to avoid reading/writing alignment padding bytes
- test_mod.cpp: same fix for reference loops and b-zeroing pass
- topk.cpp: dispatch k_blob read on elemsize (int32/int64) instead of
  casting raw bytes as float
- TopK.cpp: extract shared write_topk_params() helper to eliminate
  ~80 lines of duplication between torch_topk and torch_topk_0
- CI: remove fork-specific push branch triggers; drop simplestl-simplemath
  job (pre-existing libncnn conflict unrelated to this PR)
Delete header-only stubs (expand_arm.h, tile_arm.h), pure delegation
shims (gatherelements_arm.*), buggy NEON files (mod_arm.*), and
broken Vulkan TODO stubs (gatherelements_vulkan.*, mod_vulkan.*)
along with placeholder shader SPVs. ncnn_add_layer auto-discovers
these files, so leaving them in caused them to be compiled in silently.
- Expand: Add ARM NEON vectorized path for broadcasting scalar values
- TopK tests: Refactor test helper, add NaN, tie-breaking, k=0, k-clamp tests

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
- topk.cpp: Don't break early on NaN detection; continue processing
  remaining elements and fall through to NaN-aware fallback for proper
  tie-breaking (fixes potential missed elements after NaN)
- gather.cpp: Remove unused READ_IDX macro (dead code)
- expand.cpp: Add comment explaining NEON unroll factor (16 = 4×4 floats)

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
- topk.cpp: Replace broken inline NaN detection with pre-scan approach
  - Pre-scan entire input for NaN before NEON optimization
  - If NaN found, fall through to NaN-aware scalar path
  - This avoids corrupting NEON registers with NaN values
  - Cleaner and safer than trying to handle NaN mid-computation
- gather.cpp: Remove orphaned #undef READ_IDX (cleanup)

Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants