Skip to content

AIR to AMDGCN: E2E matmul with device-side padding#543

Open
erwei-xilinx wants to merge 16 commits intoiree-org:mainfrom
erwei-xilinx:air-to-amdgcn
Open

AIR to AMDGCN: E2E matmul with device-side padding#543
erwei-xilinx wants to merge 16 commits intoiree-org:mainfrom
erwei-xilinx:air-to-amdgcn

Conversation

@erwei-xilinx
Copy link
Copy Markdown

  1. AIR-to-AMDGCN lowering pipeline — Three new passes that lower AIR dialect ops to AMDGCN-compatible IR:
  • air-to-amdgcn — flattens air.launch/air.herd hierarchy to gpu.block_id/gpu.thread_id
  • convert-memspace-to-amdgcn — converts integer memory spaces to #amdgcn.addr_space
  • convert-to-amdgcn-library-calls — replaces air.dma_memcpy_nd, linalg.fill, and linalg.generic (matmul) with calls to kittens library functions (copy_f16_16x64, mfma_matmul_f16_16x64, etc.)
  1. Device-side padding for non-tile-aligned matmul — A 40×40 matmul (tile size 16) with NO host-side padding of A/B inputs:
  • Kernel uses air.launch with 3×3 grid, dynamic-sized DMA copies (arith.minui for boundary tiles), and LDS zero-fill before partial copies
  • Upstream air-split-launch-for-padding (with new split-mode=single-launch mode, contributed via PR #1496) produces a single launch with scf.if on block indices — boundary DMAs get pad_after attribute
  • ConvertToAMDGCNLibraryCalls detects pad_after and emits fill_f16_16x64 + copy_f16_16x64_padded (clamped global loads via arith.minui, no scf.if per-thread)
  1. Two passing E2E GPU tests:
  • test_matmul_64x64 — tile-aligned, no padding needed
  • test_matmul_padded_40x40 — non-tile-aligned, device-side padding via split-launch
  1. Upstream mlir-air contributions (landed):
  • PR #1492 — DMA support in air-split-launch-for-padding
  • PR #1496 — Extract pass from AIE-gated code, add single-launch mode
  • PR #1499 — Fix getDependentDialects for AIRWrapFuncWithParallelPass

@erwei-xilinx
Copy link
Copy Markdown
Author

Cc: @nicolasvasilache

@erwei-xilinx erwei-xilinx force-pushed the air-to-amdgcn branch 2 times, most recently from f2a828f to a256466 Compare April 8, 2026 23:19
erwei-xilinx and others added 16 commits April 8, 2026 23:21
Full AIR pipeline from tensor linalg to AMDGCN assembly.

Pipeline: transform (tile/pad/promote/bufferize) → air-par-to-herd →
one-shot-bufferize → air-par-to-launch → air-copy-to-dma →
air-dma-to-channel → air-to-amdgcn → convert-memspace-to-amdgcn →
convert-linalg-to-amdgcn → preload + inline → assembly.

Design:
- air.herd = wavefront (thread_id/64), air.launch = workgroup (block_id)
- Base ptr stays sgpr, tile offsets passed separately (kittens pattern)
- Per-wavefront LDS: alloc numWavefronts * size, offset by wave_id * size
- scf.parallel from channel hoisting inlined with wavefront IDs
- Global memref.copy/fill on non-LDS buffers eliminated

Assembly generates correctly. GPU E2E test has numerical error:
root cause identified as LDS cache key mismatch — air-to-amdgcn clones
herd body ops (creating new SSA values), but channel gets reference
original (pre-clone) alloc values. The ldsCache maps original allocs
but matmul uses cloned allocs, causing separate LDS allocations that
don't share data with the channel copies.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Erwei Wang <erwei.wang@amd.com>
…ont LDS

- Remove air-dma-to-channel from the pipeline: channels are not mappable
  to GPU fabric when herds map to wavefronts (no hardware FIFO backpressure).
  DMAs stay inside the herd body so copy and compute remain colocated.
- Add DMA→library-call lowering in ConvertLinalgToAMDGCN: detects num-wavefronts
  from IR (wavefrontId affine.apply coefficient), allocates nWf*tileSizeBytes of
  LDS and strides per-wavefront to prevent LDS collision.
- Fix K-accumulation: tile_using_for tile_sizes [16,16,0] (no K tiling) so the
  library's zero_C is called exactly once per output tile.
- Add copy_f16_16x64 and mfma_matmul_f16_16x64 library functions to handle
  K=64 in a single call (4 MFMA panels from two 1024-byte LDS blocks).
- Delete ConvertAirChannelToAMDGCN pass (unused, no pipeline invokes it).

E2E test (64x64 f16 matmul on gfx942) passes: rtol/atol 1e-2.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Erwei Wang <erwei.wang@amd.com>
Remove manually duplicated AIR/MLIR link deps from mlir-air-opt and
aster-shlib. MlirAirLib declares them as LINK_LIBS PUBLIC, so CMake
propagates them automatically. The aster-shlib --whole-archive hack
bypassed this; fixed by also linking MlirAirLib normally for dep
resolution.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Erwei Wang <erwei.wang@amd.com>
- Split transform sequence into separate file (air-to-amdgcn-matmul-transform.mlir)
  and load via --transform-preload-library, matching upstream MLIR convention.
  No transform ops in the payload or output; no cleanup code needed in passes.
- Register transform::registerPreloadLibraryPass() in Init.cpp.
- Remove emit_dealloc from bufferize_to_allocation (deallocs not needed).
- Remove manual DCE for alloca/view/dealloc from ConvertLinalgToAMDGCN.
- Remove global->global memref.copy forwarding hack (fill writes to %C directly
  via bufferization.to_tensor, so no temp alloc/copy pattern).
- Remove transform dialect cleanup from ConvertLinalgToAMDGCN.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Erwei Wang <erwei.wang@amd.com>
The pass converts linalg ops, air.dma_memcpy_nd, and air.channel.put/get
to AMDGCN library calls — not just linalg. Rename to reflect its scope:
  convert-linalg-to-amdgcn → convert-to-amdgcn-library-calls

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Erwei Wang <erwei.wang@amd.com>
air-dma-to-channel is not in the pipeline, so ChannelGetOp/ChannelPutOp/
ChannelOp are never present. Remove ~210 lines of dead channel pre-allocation,
conversion, and cleanup code.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Erwei Wang <erwei.wang@amd.com>
Replace imperative IR walks with OpRewritePattern + applyPatternsGreedily:
- DmaToLibraryCall: air.dma_memcpy_nd → copy library call
- LinalgToLibraryCall<T>: templated for fill/copy/matmul ops
- GenericMatmulToLibraryCall: linalg.generic with matmul semantics
- EraseGlobalFill: erase fill on global buffers (library handles zero-init)

Merge LDS pre-allocation into emitLDSOffset with wavefront striping and
function-entry insertion, eliminating the separate preallocateDmaLDS walk.
Pattern ordering no longer matters — whichever pattern fires first creates
the correct LDS allocation in the shared cache.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Erwei Wang <erwei.wang@amd.com>
Demonstrates non-tile-aligned matmul: actual dimensions 40x40x64, host
pads inputs/output to tile-aligned 48x48x64. Kernel operates on full
tiles; padding zeros produce zero output in the padded region. Host
extracts valid C[0:40, 0:40] for verification.

- New payload: air-to-amdgcn-matmul-padded.mlir (48x48x64 kernel)
- New transform: tile_using_forall [16,0,0] (3 wavefronts)
- New test: test_matmul_padded_40x40 in test_air_matmul_e2e.py
- Register tensor ValueBounds and SubsetOp interfaces in Init.cpp
  (needed for tensor.extract_slice in padding transforms)
- Refactor _air_preprocess into _air_preprocess_with_files for reuse

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Erwei Wang <erwei.wang@amd.com>
Infrastructure for non-tile-aligned matmul (40x40x64) with compiler-level
padding. Transform pads all three operands (A, B, C) to tile size 16.
C accumulates in LDS, then copies back to global (only valid region).

New library functions:
- store_lds_C_mfma_f32_16x16x16_f16: AGPR→LDS via ds_write_b32
- mfma_matmul_lds_c_f16_16x64: matmul with all-LDS operands
- fill_f32_16x16, copy_f32_16x16: LDS C init and global→LDS copy
- store_global_f32_16x16: LDS→global C writeback

Pass changes:
- DmaToLibraryCall: handle both global→LDS and LDS→global directions
- GenericMatmulToLibraryCall: _lds_c suffix when output is in LDS
- Register air-override-memref-memory-space pass
- Register tensor ValueBounds/SubsetOp interfaces

BLOCKING: amdgcn-preload-library hangs on the padded test's IR when
library functions with bodies (store_global_f32_16x16 etc.) interact
with preloaded external library functions. The 64x64 test still passes.
Needs investigation of the preload pass's internal inlining behavior.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Erwei Wang <erwei.wang@amd.com>
Fix a bug in PreloadLibrary where declarationsToReplace replaces a
declaration with another declaration from the library map, returning
true forever. Guard: only replace if the library entry has a body.

WIP: compiler-level padded matmul (40x40x64) with LDS C accumulation.
Infrastructure works through convert-to-amdgcn-library-calls, but the
aster backend rejects non-thread-uniform loops from non-divisible tiling.
New: fill_lds_16x64_b in lds_16x64_b.mlir for LDS zero-fill.

BLOCKING: "only thread-uniform loops are supported" — the scf.for loop
from tile_using_for with affine.min bounds varies per wavefront.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Erwei Wang <erwei.wang@amd.com>
Replace per-tile padding approach with pad_tiling_interface which pads
the entire iteration domain from 40→48 BEFORE tiling. This eliminates:
- affine.min bounds (all tiles are full, 48 % 16 == 0)
- Non-uniform loops rejected by amdgcn-convert-scf-control-flow
- Dynamic allocs without memory_space
- C-in-LDS complexity (C stays in global, same as 64x64 test)

The padded allocs (48x64, 48x48) are at function level, outside the herd.
Library functions are identical to the 64x64 test.

BLOCKING: lsir.reg_cast normal form violation in aster backend from
padded alloc pointer decomposition. 64x64 test still passes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Erwei Wang <erwei.wang@amd.com>
Add E2E test for non-tile-aligned matmul (M=40, N=40, K=64) with
device-side padding. The kernel writes air.launch directly with a 3x3
grid and dynamic-sized DMA copies (arith.minui for boundary tiles).

The upstream air-split-launch-for-padding pass (split-mode=single-launch,
pad-location=source) produces a single launch with scf.if on block
indices. Boundary DMAs get pad_after attributes, which
ConvertToAMDGCNLibraryCalls converts to fill_f16_16x64 (zero LDS) +
copy_f16_16x64_padded (clamped global loads via arith.minui).

Key changes:
- ConvertToAMDGCNLibraryCalls: detect pad_after on DMAs, emit fill +
  padded copy for boundary tiles
- PromoteAllocsToFuncArgs: new pass to promote nested memref.alloc
  (including inside scf.if) to kernel function arguments
- Init.cpp: register upstream air-split-launch-for-padding (extracted
  from AIE-gated code in mlir-air #1496), scf::SCFDialect
- indexing_ptr.mlir: add make_raw_buffer_rsrc_bounded
- Update mlir-air submodule to 39ee8fc6 (includes #1496 and #1499)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Erwei Wang <erwei.wang@amd.com>
Remove store_lds_C_mfma_f32_16x16x16_f16 (AGPR→LDS C tile store) from
compute_16x16_f16.mlir — unused since the padded matmul writes directly
to global C. Keep fill_lds_16x64_b in lds_16x64_b.mlir as it is needed
by the device-side padding path (zero-fill LDS before partial copy).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Erwei Wang <erwei.wang@amd.com>
No longer needed: the upstream single-launch mode in
air-split-launch-for-padding shares LDS allocs across scf.if branches
instead of duplicating them, and the padded matmul writes directly to
global C without temp allocs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Erwei Wang <erwei.wang@amd.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Erwei Wang <erwei.wang@amd.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Erwei Wang <erwei.wang@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant