Skip to content

Runtime TXN generation#3002

Draft
jgmelber wants to merge 69 commits into
mainfrom
dynamic-runtime-sequences
Draft

Runtime TXN generation#3002
jgmelber wants to merge 69 commits into
mainfrom
dynamic-runtime-sequences

Conversation

@jgmelber

@jgmelber jgmelber commented Mar 26, 2026

Copy link
Copy Markdown
Collaborator

Summary

Add dynamic runtime TXN generation: compile a single XCLBIN once, then generate NPU instruction streams at runtime for arbitrary problem sizes. The compiler emits a standalone C++ function that builds TXN binaries parameterized by runtime values (e.g., matrix dimensions M, K, N).

Core infrastructure

  • TxnEncoding.h — Header-only runtime library for encoding NPU TXN instructions. Zero MLIR/LLVM dependencies; shared between compiler and generated host code.

  • ConvertAIEXToEmitC pass — Lowers AIEX runtime sequence ops (npu.write32, npu.sync, npu.address_patch, npu.blockwrite) to EmitC dialect, producing compilable C++ via translateToCpp. Includes blockwrite fusion: dynamic BD word overrides (write32s tagged with bd_group) are folded into a single txn_append_blockwrite call, producing identical TXN binary to the static path.

  • --aie-generate-txn-cpp flag — New aiecc option that generates a C++ header alongside the XCLBIN containing generate_txn_sequence(...) parameterized by runtime values.

  • BdLowering.{h,cpp} — Shared utility for emitting hardware BD register encodings as SSA arith chains. Used by both AIEDmaToNpu and AIEDMATasksToNPU dynamic paths. Includes d0_stride underflow guard and compile-time warning for 10-bit d0_size overflow.

Dynamic DMA support

  • npu.dma_memcpy_nd — Extended to support SSA-parameterized sizes, strides, and offsets via emitDynamicHwBdEncoding().

  • aie.dma_bd — New dyn_sizes, dyn_strides, dyn_offset, dyn_len operands for runtime-parameterized buffer descriptors within aiex.dma_configure_task.

  • npu.write_rtp — Supports dynamic SSA values via dyn_value operand. Single implementation in aie.py, delegated from aiex.py.

  • RuntimeSequenceOp — Removed IsolatedFromAbove to allow referencing parent DeviceOp values. SCF-to-CF conversion scoped to aie.core ops only.

IRON Python support

RuntimeScalar type for scalar runtime parameters, write_rtp() for runtime-tunable parameters, BD ID allocator for dynamic DMA tasks, direct npu_dma_memcpy_nd emission path.

Test coverage

  • FileCheck tests for EmitC conversion (basic, dynamic values, unsupported ops)
  • Dynamic BD encoding tests with numeric arith chain verification
  • MemTile rejection negative test
  • Python round-trip and lowering tests (dma_tasks_dynamic.py)
  • End-to-end aiecc pipeline test (cpp_dynamic_txn.mlir)
  • Static-vs-dynamic TXN binary equivalence test

Test plan

  • test/python/ — 74 pass, 3 unsupported (expected)
  • test/dialect/AIEX/ — 21 pass
  • test/dialect/AIE/ — 110 pass, 1 xfail (expected)
  • test/Conversion/DmaToNpu/ — 16 pass
  • test/Conversion/AIEXToEmitC/ — 3 pass
  • Hardware: 5 shapes x 2 variants = 10/10 pass on Ryzen AI NPU
  • CI: Full build matrix + Ryzen AI hardware tests

jgmelber and others added 15 commits March 26, 2026 08:11
Introduce new operations that accept SSA values instead of static
attributes, enabling runtime parameterization of NPU sequences:

- aiex.npu.dyn_write32: Dynamic write with SSA address and value
- aiex.npu.dyn_maskwrite32: Dynamic masked write with SSA operands
- aiex.npu.dyn_dma_memcpy_nd: Fully dynamic N-D DMA with SSA sizes/strides
- aiex.npu.dyn_sync: Dynamic synchronization with SSA tile/channel

These operations can be lowered to templated C++ code for runtime
transaction generation, allowing a single compiled artifact to support
multiple problem sizes determined at runtime.

Added verification to ensure SSA operands have correct types (index or
signless integers). Maximum 4 dimensions enforced for DMA operations
to match hardware constraints.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implements the dynamic runtime sequences infrastructure:

Phase 1: Extract TXN instruction encoding from AIETargetNPU.cpp into
include/aie/Runtime/TxnEncoding.h - a header-only C++ library with
zero MLIR/LLVM dependencies. Refactor AIETargetNPU.cpp to use it.

Phase 2: Add ConvertAIEXToEmitC pass that lowers AIEX runtime sequence
ops (both static npu.write32/blockwrite/sync/address_patch and dynamic
npu.dyn_write32/dyn_maskwrite32/dyn_sync) plus SCF/arith ops into
EmitC dialect. The EmitC IR calls aie_runtime::txn_append_* functions.

Phase 3: Wire aie-translate --aie-generate-txn-cpp translation that
runs the NPU lowering pipeline then the EmitC pass, producing
compilable C++ that generates TXN binaries at runtime.

Phase 4: Add test_dynamic.cpp example that uses the generated C++
instead of loading insts.bin. Verified bit-for-bit identical TXN
output and PASS on NPU hardware.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extract the NPU lowering pipeline (7 passes) into a shared
populateNpuLoweringPipeline() function used by both aiecc and
aie-translate, eliminating the duplicated pass list.

Unify the host test executable: test_dynamic.cpp is deleted and
test.cpp gains a USE_DYNAMIC_TXN compile flag that sets a
generate_instr callback on the args struct in xrt_test_wrapper.h.
Both static (insts.bin) and dynamic (generated TXN) paths now
share the same XRT setup, buffer management, verification, and
timing infrastructure.

Add --aie-generate-txn-cpp and --txn-cpp-name flags to aiecc so
C++ TXN generation is accessible through the compiler driver
alongside --aie-generate-npu-insts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add test_dynamic_size.mlir using aiex.npu.dyn_write32 with BLOCKWRITE
for BD configuration (required for address_patch compatibility) and
dynamic buffer_length as a runtime parameter.

Add --dynamic-size flag to test executable. When set, the host passes
the transfer size to generate_txn_sequence() at runtime instead of
using a pre-compiled instruction binary. The core loops processing
fixed-size ObjectFIFO tiles, so any multiple of the tile size works.

Demonstrated: XCLBIN compiled once at 1024-byte tile size, single
executable runs correctly at 1024, 2048, 3072, and 4096 bytes with
zero recompilation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extend NpuWrite32Op, NpuMaskWrite32Op, and NpuSyncOp with optional
SSA operands (dyn_address, dyn_value, etc.) so a single op handles
both compile-time constant and runtime-parameterized forms. Delete
the 4 separate Dyn ops (NpuDynWrite32Op, NpuDynMaskWrite32Op,
NpuDynSyncOp, NpuDynDmaMemcpyNdOp) that previously duplicated them.

- Add AttrSizedOperandSegments trait and custom parse/print/verify
- Add custom builders preserving existing call-site signatures
- Merge EmitC conversion handlers (static vs dynamic dispatch)
- Add error guards in NPU binary translation for dynamic operands
- Add Python wrappers: npu_write32_dynamic, npu_maskwrite32_dynamic,
  npu_sync_dynamic
- Replace hand-written test_dynamic_size.mlir with Python design file
  (passthrough_kernel_dynamic.py)

100% backward compatible: all 294 existing LIT tests pass unchanged.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…amic Worker

Extend IRON APIs to support dynamic (RTP-based) loop bounds:
- Worker: add dynamic_objfifo_lowering parameter
- Runtime: extend sequence() for mixed array/scalar types, add write_rtp()
- RuntimeScalar: new class for scalar runtime sequence parameters
- RtpWriteTask: new task class wrapping npu_rtp_write
- single_core_iron_dynamic.py: IRON-level dynamic GEMM example

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… and LIT tests

Extract the 2 dynamic designs (low-level, IRON) and their test harness from
single_core/ into a new single_core_dynamic/ directory. Add a new placed
dynamic variant using shim_dma_single_bd_task. Create LIT tests for all 3
variants and a passthrough_kernel dynamic LIT test.

Fix dynamic_gemm_txn.h: add missing MASKWRITE before S2MM queue push
(required for XRT completion token) and restructure C output BDs to use
the batched pingpong pattern matching the static compiler.

Verified on NPU2 hardware: all 3 dynamic sizes (32x32x32, 64x64x64,
128x128x128) PASS with a single XCLBIN for both low-level and placed
variants.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The dynamic TXN generator now parses the static instruction stream
(generated alongside the XCLBIN) to discover the RTP buffer address
and S2MM control register values. This makes the dynamic test harness
work with all 3 design variants (low-level, placed, IRON) since each
may place the RTP buffer at a different address.

Previously the IRON variant failed because its buffer allocator placed
the RTP at 0x204d00 while the code hardcoded 0x200600.

All 3 variants now PASS on NPU2 at 32x32x32, 64x64x64, 128x128x128.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Enable auto-generated C++ TXN code from MLIR runtime sequences with
SSA parameters, allowing a single XCLBIN to run matrix multiplications
at any M/K/N (multiples of 32) determined at runtime.

Key changes:
- Add IsolatedFromAbove trait to RuntimeSequenceOp, preventing SCF-to-CF
  from entering runtime sequences while still lowering core bodies.
  Also prevents constant hoisting across the isolation boundary.
- Extend DmaToNpuPattern with dynamic code path: when sizes/strides are
  SSA values, compute BD words via arith ops and emit npu.write32.
  Fixes bf16 d0_size (multiply-first-then-divide), stride underflow
  guards (size>1 check), and repeat_count off-by-one.
- Extend EmitC conversion for scf.for with iter_args (VariableOp +
  LoadOp + AssignOp pattern), scf.if with results, and new arith ops
  (TruncI, ExtUI, ExtSI, MinSI, MaxSI). Add pre-scan for values
  hoisted outside runtime_sequence and cross-reference fixup pass.
- Add dyn_arg_plus to NpuAddressPatchOp and dyn_value to NpuWriteRTPOp
  for runtime-parameterized buffer offsets and RTP writes.
- Scope SCF-to-CF in aiecc via markOpRecursivelyLegal on
  RuntimeSequenceOp, and disable cross-region constant CSE in
  AIEVectorTransferLoweringPass.
- Add unified aiecc compilation (--aie-generate-xclbin +
  --aie-generate-txn-cpp) producing both XCLBIN and C++ TXN from the
  same MLIR with identical buffer addresses.

Verified on NPU Strix Halo: 32x32x32, 64x64x64, 64x32x64, 96x96x96,
128x64x128, 128x128x128 all PASS against reference matmul.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adapt to upstream API changes and fix IsolatedFromAbove interaction
with MLIR's dialect conversion infrastructure:

- Move convert-vector-to-aievec from resource allocation pipeline to
  per-core LLVM lowering, preventing vectorization of scalar arith ops
  (e.g. arith.minsi → aievec.min) inside runtime_sequence
- Walk RuntimeSequenceOps explicitly in DmaToNpu, DMATasksToNPU,
  LowerSetLock, SubstituteShimDMA, since applyPartialConversion no
  longer descends into IsolatedFromAbove regions in newer LLVM
- Skip materialize pass in AIETranslateToCppTxn (runtime_sequence is
  already in final form)
- Add type casts in EmitC yield handler for mixed i32/opaque types
- Disable constant CSE in AIEMaterializeBDChains
- Update Python API: link_with on external_func, TraceShimRouting enum
- Add hasVerifier to RunOp, dyn_arg_plus to AIEInsertTraceFlows

All 6 GEMM sizes verified on NPU Strix Halo after rebase.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix getAsValue to zero-extend narrow values (was truncate-only)
- Fix yieldTargets ArrayRef invalidation by using stack directly
- Fail on unsupported ops in EmitC instead of silently emitting comments
- Fix 0x80000000u token bit cast to int32_t
- Remove dead preSCFModule global variable
- Add IsolatedFromAbove negative test for RuntimeSequenceOp
- Add bf16 d0_stride hardware constraint comment
- Remove dead vectorized variable, name ROWS_PER_BLOCK constant
- Deduplicate trace event list into module-level constant
- Fix npu_time_min initialization to numeric_limits::max

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Stale submodule pointer from before rebase.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
AIETargetCppTxn.cpp and AIENpuLowering.cpp link AIEXTransforms, which
uses BdIdGenerator from AIETransforms. Without this transitive
dependency, static Release builds fail with undefined references to
BdIdGenerator::nextBdId etc. in AIEAssignRuntimeSequenceBDIDs.cpp.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jgmelber jgmelber force-pushed the dynamic-runtime-sequences branch from 2c839d4 to 93acc43 Compare March 26, 2026 19:36
IsolatedFromAbove broke 62 existing tests that reference device-scope
values (tiles, locks) from inside runtime_sequence. Instead, protect
against constant hoisting by stripping runtime_sequences from LLVM
lowering clones (where convert-vector-to-aievec's canonicalizer was
the source of the hoisting). The markOpRecursivelyLegal SCF→CF scoping
and enableConstantCSE(false) in AIEVectorTransferLowering remain as
the primary guards.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jgmelber jgmelber force-pushed the dynamic-runtime-sequences branch from 47bb7b4 to c97ce07 Compare March 26, 2026 22:06
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jgmelber jgmelber force-pushed the dynamic-runtime-sequences branch from 3e32ea1 to 5286ec4 Compare March 26, 2026 22:24
@github-actions

github-actions Bot commented Mar 26, 2026

Copy link
Copy Markdown
Contributor

Coverage Report

Created: 2026-06-05 21:44

Click here for information about interpreting this report.

FilenameFunction CoverageLine CoverageRegion CoverageBranch Coverage
Conversion/AIEToConfiguration/AIEToConfiguration.cpp 91.30% 67.58% 61.27% 45.19%
Conversion/AIEXToEmitC/AIEXToEmitC.cpp 68.00% 48.05% 46.27% 36.32%
Dialect/AIE/IR/AIEDialect.cpp 91.14% 86.86% 87.99% 79.44%
Dialect/AIE/Transforms/AIEInsertTraceFlows.cpp 78.95% 85.62% 82.87% 74.84%
Dialect/AIE/Transforms/AIEVectorTransferLowering.cpp 83.33% 79.17% 72.73% 50.00%
Dialect/AIEX/IR/AIEXDialect.cpp 94.52% 78.08% 78.46% 64.34%
Dialect/AIEX/Transforms/AIEDMATasksToNPU.cpp 97.22% 85.55% 86.24% 73.67%
Dialect/AIEX/Transforms/AIEDmaToNpu.cpp 100.00% 80.89% 73.73% 50.44%
Dialect/AIEX/Transforms/AIELowerSetLock.cpp 100.00% 82.35% 80.00% 50.00%
Dialect/AIEX/Transforms/AIEMaterializeBDChains.cpp 100.00% 84.71% 80.00% 57.14%
Dialect/AIEX/Utils/BdLowering.cpp 100.00% 96.13% 94.44% 84.62%
Totals 91.35% 80.35% 79.60% 68.28%
Generated by llvm-cov -- llvm version 18.1.3

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jgmelber jgmelber force-pushed the dynamic-runtime-sequences branch from 53af3d9 to 934d90b Compare March 26, 2026 22:29
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@andrej andrej left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to see this starting to take shape. The biggest question is if we want to deprecate the attributes; I'd be in favor of it, although it would mean touching potentially a lot of tests (but the actual code would be smaller).

The GEMM test unfortunately doesn't seem to use the added infrastructure. I think we are already on the same page, but just to make sure, what I'm envisioning looks more like this:

User writes a single runtime sequence in MLIR/Python (pseudocode):

aiex.runtime_sequence @my_sequence(%A: memref, %B: memref, %C: memref, %param_M: int, %param_K: int, %param_N: int) {\
  ...
  aie.dma_memcpy_nd(...)
  ...
}

User calls compiler roughly like so:

aie-opt --aie-to-cpp aie.mlir -o my_runtime_sequence.h

which produces something like this using emitC (important -- this is compiler generated from the above MLIR, not manually written like in the GEMM test):

#include <txn_encoding.h>
std::vector<uint32_t> my_sequence(void *A, void *B, void *C, int param_M, int param_K, int param_N) {
    std::vector<uint32_t> txn;
    aie_runtime::txn_append_write32(txn, param_M, ...)
    ...
}

and then can use that generated file in their test.cpp like so:

#include "my_runtime_sequence.h"

int main(){
   // setup XRT
   xrt::kernel my_kernel = // get out of xclbin
   std::vector<uint32_t> insts = my_sequence( my params ... )
   my_kernel(insts, a, b, c);
}

So ideally the GEMM test's test.cpp at the end of this would not look significantly more complicated than the existing ones do.

Again, cool to see this taking shape!

// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
// (c) Copyright 2025 Advanced Micro Devices, Inc.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2026

Comment on lines +751 to 764
OptionalAttr<I32Attr>:$value,
Optional<I32>:$dyn_value
);
let results = (outs );
let assemblyFormat = [{ `(` $buffer `,` $index `,` $value `)` attr-dict
}];
let hasCustomAssemblyFormat = 1;
let hasVerifier = 1;
let description = [{
rtp write operator
rtp write operator.
When `dyn_value` is provided, it supplies the RTP value at runtime
instead of the static `value` attribute.
}];
let extraClassDeclaration = [{
bool hasDynamicValue() { return getDynValue() != nullptr; }
}];

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little worried about code bloat with having every parameter for these ops duplicated, once as an attribute and once as an SSA value, along with the added custom verifier and assembly format for each op.

Could we consider removing the attributes altogether and instead use SSA values, with arith.constant for the static case? All existing lowerings can get the value from arith.constant and throw an error if it's not a constant, this emitC pass can use the actual SSA values. This approach would of course touch a lot of code (all examples etc. that use these ops with attributes would have to be rewritten to use arith.constant), but I think AI could handle it. I think it would be cleaner and might remove the need for customAssemblyFormat and hasVerifier for every op (haven't gotten to those yet but assume they're there because of this).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fair, and I agree. Though I want to get this working without impacting the current flow. For now, it is entirely optional. Later I'd like to debate and decide if it should remain a separate path or transition to SSA values with constants in the static case.

Comment thread include/aie/Dialect/AIEX/IR/AIEX.td Outdated
Optionally, SSA values can be provided for 'dyn_address', 'dyn_value', and
'dyn_mask' to enable runtime-parameterized sequences.

Static syntax (unchanged): `aiex.npu.maskwrite32 {address = 123 : ui32, ...}`

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest removing "(unchanged)" from these comments

Comment thread include/aie/Runtime/TxnEncoding.h Outdated
Comment on lines +1 to +2
//===- TxnEncoding.h - Standalone TXN instruction encoding -------*- C++
//-*-===//

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest formatting onto a single line

Comment on lines +109 to 120
// Use encoding library for the core format, then fix up col/row field.
aie_runtime::txn_append_blockwrite(instructions, *address, payload.data(),
payload.size());

// XAIE_IO_BLOCKWRITE
words[0] = XAIE_IO_BLOCKWRITE;
words[2] = op.getAddress();
// The encoding library leaves word[1] as 0. If col/row are present, set it.
auto col = op.getColumn();
auto row = op.getRow();
if (col && row) {
words[1] = (*col & 0xff) | ((*row & 0xff) << 8);
// word[1] is at position (current_size - headerSize - count + 1)
size_t headerPos = instructions.size() - 4 - payload.size();
instructions[headerPos + 1] = (*col & 0xff) | ((*row & 0xff) << 8);
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doesn't the encoding library aie_runtime::txn_append_blockwrite take row/col?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this file could benefit from some code deduplication and cleanup. If NpuWriteBd op were changed to also accept SSA values, maybe there would be less of a need for separate code paths here. I'd like to avoid having to make every change in two places (dynamic and static path) for future changes to these ops.

GreedyRewriteConfig rewriter_config = GreedyRewriteConfig();
rewriter_config.setRegionSimplificationLevel(
GreedySimplifyRegionLevel::Disabled);
rewriter_config.enableConstantCSE(false);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?

Comment thread lib/Targets/AIETargetCppTxn.cpp Outdated
Comment on lines +1 to +2
//===- AIETargetCppTxn.cpp - EmitC-based C++ TXN translation ------*- C++
//-*-===//

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format

Comment on lines +68 to +127
/// Extract design-specific constants from the static instruction stream.
///
/// The static instructions always begin with:
/// [4-word TXN header]
/// [6-word write32: RTP write 0 — rtp_addr = words[header+2]]
/// [6-word write32: RTP write 1]
/// ... then DMA configuration including a maskwrite before S2MM push ...
///
/// We scan for the first write32 (opcode 0) to get the RTP address, and
/// for the first maskwrite32 (opcode 3) to get the S2MM control register
/// address and its value/mask.
inline DesignConstants extract_constants(const std::vector<uint32_t> &insts) {
DesignConstants c{};
constexpr uint32_t HEADER_SIZE = 4;

bool found_rtp = false, found_s2mm = false;
size_t i = HEADER_SIZE;
while (i < insts.size() && (!found_rtp || !found_s2mm)) {
uint32_t opcode = insts[i];

if (opcode == aie_runtime::TXN_OPC_WRITE && !found_rtp) {
// First write32: RTP address is at word [i+2]
c.rtp_addr = insts[i + 2];
found_rtp = true;
i += 6;
} else if (opcode == aie_runtime::TXN_OPC_MASKWRITE && !found_s2mm) {
// First maskwrite: S2MM control register
c.s2mm_ctrl = insts[i + 2];
c.s2mm_ctrl_val = insts[i + 4];
c.s2mm_ctrl_mask = insts[i + 5];
found_s2mm = true;
i += 7;
} else {
// Skip this op by reading its size field
uint32_t op_size_bytes = 0;
if (opcode == aie_runtime::TXN_OPC_WRITE)
op_size_bytes = insts[i + 5];
else if (opcode == aie_runtime::TXN_OPC_MASKWRITE)
op_size_bytes = insts[i + 6];
else if (opcode == aie_runtime::TXN_OPC_BLOCKWRITE)
op_size_bytes = insts[i + 3];
else if (opcode == aie_runtime::TXN_OPC_TCT)
op_size_bytes = insts[i + 1];
else if (opcode == aie_runtime::TXN_OPC_DDR_PATCH)
op_size_bytes = insts[i + 1];
else
break; // unknown opcode

i += op_size_bytes / sizeof(uint32_t);
}
}

if (!found_rtp)
throw std::runtime_error("Could not find RTP write in static instructions");
if (!found_s2mm)
throw std::runtime_error(
"Could not find S2MM maskwrite in static instructions");

return c;
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might it make sense to add a compiler option to export some of these "magic values" at compile time, into say a JSON that could be ingested at runtime? Some of them, like the controller ID, probably should also just get baked into the code generated by emitC.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to use the code generated from the emitC, but instead reconstructs the instruction sequence manually using the encoding library.

jgmelber and others added 5 commits May 5, 2026 15:04
Remove dynamic_gemm_txn.h and the #ifdef USE_GENERATED_TXN paths from
test_dynamic.cpp and Makefile — the auto-generated C++ TXN path is now
the only path. Add compare_txn.cpp to tracking, remove AI agent
artifacts (AGENTS.md, .codex/), clean build artifacts, and stage all
previously unstaged working-copy changes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- C1: Add null check for blockwrite data in non-fused EmitC path
- C2: Propagate errors from NPU binary translator (void → LogicalResult)
- C3: Fix syntax error in test_dynamic.cpp option chain
- C4: Add set-once guard to RuntimeScalar.op setter
- C5: Fix BD ID aliasing for unplaced tiles (id(tile) → stable key)
- M2: Guard repeatCount underflow when sizes[3] == 0 in dynamic DMA
- M5: Restrict dynamic operand verifiers to 32-bit integers only
- M9: Fix CMake variable syntax in AIEXToEmitC CMakeLists.txt

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- M1: Replace O(n) txn_prepend_header insert with txn_init + in-place overwrite
- M4: Add explicit error diagnostics for unhandled ops in EmitC pass
- M8: Document emit_free() no-op for direct NPU DMA tasks
- M10: Raise TypeError on unsupported sequence argument types
- m1: Remove redundant static keyword in anonymous namespace
- m2: Add comments explaining blockwrite fusion op consumption
- m3: Move raw_string_ostream outside loop in EmitC blockwrite emission
- m4: Make EmitC pass a no-op when no runtime sequences exist
- m5: Document all address_patch word fields in TxnEncoding.h
- m6: Fix arg_plus type to uint32_t for consistency
- m10: Replace std::distance with early-exit loops in DMATasksToNPU
- m14: Move __task_group_index to instance variable
- m15: Use deque for O(1) BD ID allocation
- m19: Fix "prohibitted" typo

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- M6: Document RuntimeSequenceOp non-IsolatedFromAbove contract
- m7: Remove redundant module clone in generateCppTxnCode
- m8: Use emitc::FuncOp with inline specifier to prevent ODR violations
- m9: Fix formatString to replace all occurrences of {0}/{1}
- m12: Add tellg() and gcount() error checks in compare_txn.cpp
- m13: Validate M/K/N are positive before uint32_t cast
- m16: Raise RuntimeError on MLIR verification failure
- m17: Document DMATask offset/sizes/strides parameters
- m18: Remove unused _orig_npu_rtp_write saved reference
- m20: Replace std::optional<std::string> with std::string for cachedPeanoDir

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
jgmelber and others added 22 commits May 7, 2026 17:23
Convert single_core_dynamic.py from npu_dma_memcpy_nd to
shim_dma_single_bd_task + dma_start_task, making it a minimal delta
from the static placed version. The Python stays clean using range_,
if_, arith.minsi — no raw InsertionPoint or builder patterns.

Compiler changes to support dma_task ops with dynamic SSA operands
nested inside scf.for/scf.if:

- Remove HasParent<RuntimeSequenceOp> from 5 dma_task ops in AIEX.td
  so they can be nested inside scf control flow within runtime_sequence

- AIEDMATasksToNPU: allow arith ops in BD blocks (needed to compute
  dyn_offset/dyn_len/dyn_sizes/dyn_strides), hoist them before task
  erasure to avoid dangling SSA refs

- AIEDMATasksToNPU: rewrite rewriteSingleBDDynamic to emit NpuWriteBdOp
  (blockwrite template) + selective NpuWrite32Op overrides for dynamic
  words only, matching the npu_dma_memcpy_nd lowering path exactly

- AIEDMATasksToNPU: DMAStartTaskOpPattern emits dynamic push_queue
  NpuWrite32Op when BD has dyn_sizes, computing repeat_count from
  outermost dimension at runtime (fixes 64x64x64+ hangs)

Verified on NPU Strix Halo: 32x32x32, 64x64x64, 96x96x96,
128x128x128 all PASS against reference matmul.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move link_with from @core positional arg to external_func kwargs
(matching updated API in single_core_placed.py). Add test.cpp include
shim so make run works for use_placed=1 and use_iron=1.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace deprecated configure_packet_tracing_flow/configure_packet_tracing_aie2/
gen_trace_done_aie2 with configure_trace/start_trace matching single_core_placed.py.
Remove unused make_port_event helper.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The placed variant (single_core_dynamic_placed.py) now accepts M, K, N
as SSA i32 runtime_sequence inputs, matching the non-placed dynamic
variant.  This makes the compiled XCLBIN shape-agnostic and enables
--aie-generate-txn-cpp to produce a parameterizable C++ function.

Key changes:
- Replace TensorTiler2D (static-only) with explicit SSA sizes/strides
- Replace Python range() with range_() (scf.for) for tile_row_block loop
- Switch from dma_await_task/dma_free_task to npu_sync (compatible with
  dynamic loop bounds)
- rows_per_block changed from 2 to 4 to match non-placed variant

Verified on hardware: both placed and non-placed variants pass for
32x32x32, 64x64x64, 128x128x128, 64x128x64, 128x64x128 using a
single XCLBIN each.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
single_core_dynamic.py (unplaced):
  - npu_dma_memcpy_nd with explicit bd_id + dma_wait
  - Matches single_core.py and passthrough_dmas.py conventions

single_core_dynamic_placed.py (placed):
  - shim_dma_single_bd_task + dma_start_task + npu_sync
  - Auto-assigned BD IDs, matches passthrough_dmas_placed.py convention
  - Clean up core body: replace scf.while boilerplate with range_()

Both accept SSA M/K/N inputs for runtime parameterization and
produce C++ TXN via --aie-generate-txn-cpp. Verified on hardware
for 32x32, 64x64, 128x128, 64x128x64, 128x64x128.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Use T.memref(2, T.i32()) in placed variant to match unplaced (only 2
slots are used). Remove hardcoded address=0x600.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
CRITICAL-1: Guard d0_stride against underflow when stride0==0 and
elemWidth==addrGran. Previously SubIOp(0, 1) would wrap to 0xFFFFFFFF.
Now uses select(stride > 0, stride - 1, 0).

QUALITY-1: Replace `elemWidth < addrGran || elemWidth > addrGran` with
`elemWidth != addrGran` for clarity.

WARN-2: Fix self-referential TODO in _cast_to_i64 docstring and correct
contradicting Variadic<I32> comment in NpuDmaMemcpyNd.__init__.

W3: Simplify `enable_tracing = True if trace_size > 0 else False` to
`enable_tracing = trace_size > 0`.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
TEST-2: Add negative test verifying dynamic operands on MemTile are
rejected with a diagnostic during AIEDMATasksToNPU lowering.

QUALITY-2: Document why hw.d2Size is intentionally not extracted from
HwBdEncoding in AIEDmaToNpu.cpp (ShimNOC always uses bufLen instead).

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
WARN-3: Consolidate npu_rtp_write (aiex.py) to delegate to
npu_write_rtp (aie.py), eliminating duplicate dynamic-value dispatch.

QUALITY-3: Document 10-bit d0_size hardware limit for dynamic transfers
in BdLowering.cpp and AIEDmaToNpu.cpp.

WARN-1: Document verification bypass for dynamic operands in
NpuDmaMemcpyNdOp::verify() — constraints are deferred to runtime.

TEST-1: Enhanced FileCheck patterns in dma_task_dynamic.mlir to verify
arith computation chains (subi/cmpi/select guards, muli for bufLen,
andi for iteration active flag).

TEST-2: MemTile negative test verifying dynamic operands are rejected.

TEST-3: Added lowering RUN line to dma_tasks_dynamic.py that verifies
npu.blockwrite and npu.write32 ops appear after pass pipeline.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Includes TableGen updates for dma_bd dynamic operands, AIEDialect
verifier changes, BdLowering CMakeLists, and aie.py npu_write_rtp class.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
…nces

# Conflicts:
#	.pre-commit-config.yaml
#	include/aie/Dialect/AIE/IR/AIEOps.td
#	lib/Dialect/AIEX/IR/AIEXDialect.cpp
#	python/iron/worker.py
Local XRT install doesn't have these new opcodes yet. Define them
as fallbacks matching the values from the AIEX.td comments (opcodes
10 and 12 respectively).

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
These files were created on the branch but never staged for commit.
Without BdLowering.h the build fails; without the test files the
dma_task_dynamic lit tests are missing.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
When a statically-known d0_size would exceed 1023 (10-bit hardware
field width), emit a warning during lowering. The dynamic path cannot
apply the linear-mode optimization that the static path uses to avoid
this limit.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
The EmitC pass previously relied on fragile forward-scan pattern
matching (consecutive ops, constant addresses, sentinel address_patch)
to fuse dynamic write32 overrides into their parent blockwrite. If the
IR layout changed, fusion would silently fall back to per-op emission.

Now the lowering passes (AIEDmaToNpu, AIEDMATasksToNPU) tag each BD
word override write32 with a bd_group attribute containing the parent
BD's base address. The EmitC pass uses this attribute to collect
overrides — no order or adjacency requirement, robust against op
reordering.

Queue push write32s (separate hardware register) are intentionally
not tagged.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
The IRON Runtime API doesn't yet support dynamic iteration (range_/if_
inside rt.sequence), so this file had a static runtime sequence with
raw MLIR bindings leaking through the core body. The fully dynamic
IRON GEMM will be developed on the iron-dynamic-gemm branch once IRON
Runtime gains dynamic iteration support.

The two non-IRON dynamic variants (placed and unplaced) remain and
provide full dynamic M/K/N capability.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Q6: Document getConstOr0 lambda in AIEDmaToNpu.cpp — the 0 return is
an intentional placeholder for dynamic fields, overridden by write32.

Q9: Document CSE disable in AIEVectorTransferLowering.cpp — prevents
hoisting constants out of runtime_sequence BD configuration blocks.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
CI uses clang-format-17 which line-wraps string concatenation
differently from clang-format-18.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Replace TileOp::getOrCreate() with a read-only tile lookup in the
dynamic DMA start task pattern. getOrCreate() can produce duplicate
TileOps when called multiple times during ConversionPatternRewriter
execution (rewriter buffering hides recently-inserted tiles from the
search), which then triggers an assertion in AIEPathFinder's
DynamicTileAnalysis::runAnalysis.

The pattern only needs to read the controller_id attribute from an
existing tile — it never needs to create one.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
@jgmelber jgmelber force-pushed the dynamic-runtime-sequences branch from 5b0328e to c927363 Compare June 5, 2026 16:12
@jgmelber jgmelber force-pushed the dynamic-runtime-sequences branch from c927363 to 5cc5919 Compare June 5, 2026 16:25
jgmelber and others added 5 commits June 5, 2026 09:25
Remove changes not directly related to dynamic runtime TXN generation:
- .gitignore: AGENTS.md and .codex/ entries
- AIEAssignCoreLinkFiles.cpp: link_with refactor (separate PR)
- AIETargetBCF.cpp: link_files consumer (separate PR)
- AIETargetLdScript.cpp: link_files consumer (separate PR)
- 4 files: cosmetic blank line removals

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Brand new files should have single-year copyright (2026), not a range.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
aiecc now prints per-pipeline completion messages instead of a single
"Compilation completed successfully". Update the test to match.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants