AIR to AMDGCN: E2E matmul with device-side padding#543
Open
erwei-xilinx wants to merge 16 commits intoiree-org:mainfrom
Open
AIR to AMDGCN: E2E matmul with device-side padding#543erwei-xilinx wants to merge 16 commits intoiree-org:mainfrom
erwei-xilinx wants to merge 16 commits intoiree-org:mainfrom
Conversation
erwei-xilinx
commented
Apr 7, 2026
- AIR-to-AMDGCN lowering pipeline — Three new passes that lower AIR dialect ops to AMDGCN-compatible IR:
- air-to-amdgcn — flattens air.launch/air.herd hierarchy to gpu.block_id/gpu.thread_id
- convert-memspace-to-amdgcn — converts integer memory spaces to #amdgcn.addr_space
- convert-to-amdgcn-library-calls — replaces air.dma_memcpy_nd, linalg.fill, and linalg.generic (matmul) with calls to kittens library functions (copy_f16_16x64, mfma_matmul_f16_16x64, etc.)
- Device-side padding for non-tile-aligned matmul — A 40×40 matmul (tile size 16) with NO host-side padding of A/B inputs:
- Kernel uses air.launch with 3×3 grid, dynamic-sized DMA copies (arith.minui for boundary tiles), and LDS zero-fill before partial copies
- Upstream air-split-launch-for-padding (with new split-mode=single-launch mode, contributed via PR #1496) produces a single launch with scf.if on block indices — boundary DMAs get pad_after attribute
- ConvertToAMDGCNLibraryCalls detects pad_after and emits fill_f16_16x64 + copy_f16_16x64_padded (clamped global loads via arith.minui, no scf.if per-thread)
- Two passing E2E GPU tests:
- test_matmul_64x64 — tile-aligned, no padding needed
- test_matmul_padded_40x40 — non-tile-aligned, device-side padding via split-launch
- Upstream mlir-air contributions (landed):
- PR #1492 — DMA support in air-split-launch-for-padding
- PR #1496 — Extract pass from AIE-gated code, add single-launch mode
- PR #1499 — Fix getDependentDialects for AIRWrapFuncWithParallelPass
Author
f2a828f to
a256466
Compare
Full AIR pipeline from tensor linalg to AMDGCN assembly. Pipeline: transform (tile/pad/promote/bufferize) → air-par-to-herd → one-shot-bufferize → air-par-to-launch → air-copy-to-dma → air-dma-to-channel → air-to-amdgcn → convert-memspace-to-amdgcn → convert-linalg-to-amdgcn → preload + inline → assembly. Design: - air.herd = wavefront (thread_id/64), air.launch = workgroup (block_id) - Base ptr stays sgpr, tile offsets passed separately (kittens pattern) - Per-wavefront LDS: alloc numWavefronts * size, offset by wave_id * size - scf.parallel from channel hoisting inlined with wavefront IDs - Global memref.copy/fill on non-LDS buffers eliminated Assembly generates correctly. GPU E2E test has numerical error: root cause identified as LDS cache key mismatch — air-to-amdgcn clones herd body ops (creating new SSA values), but channel gets reference original (pre-clone) alloc values. The ldsCache maps original allocs but matmul uses cloned allocs, causing separate LDS allocations that don't share data with the channel copies. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Erwei Wang <erwei.wang@amd.com>
…ont LDS - Remove air-dma-to-channel from the pipeline: channels are not mappable to GPU fabric when herds map to wavefronts (no hardware FIFO backpressure). DMAs stay inside the herd body so copy and compute remain colocated. - Add DMA→library-call lowering in ConvertLinalgToAMDGCN: detects num-wavefronts from IR (wavefrontId affine.apply coefficient), allocates nWf*tileSizeBytes of LDS and strides per-wavefront to prevent LDS collision. - Fix K-accumulation: tile_using_for tile_sizes [16,16,0] (no K tiling) so the library's zero_C is called exactly once per output tile. - Add copy_f16_16x64 and mfma_matmul_f16_16x64 library functions to handle K=64 in a single call (4 MFMA panels from two 1024-byte LDS blocks). - Delete ConvertAirChannelToAMDGCN pass (unused, no pipeline invokes it). E2E test (64x64 f16 matmul on gfx942) passes: rtol/atol 1e-2. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Erwei Wang <erwei.wang@amd.com>
Remove manually duplicated AIR/MLIR link deps from mlir-air-opt and aster-shlib. MlirAirLib declares them as LINK_LIBS PUBLIC, so CMake propagates them automatically. The aster-shlib --whole-archive hack bypassed this; fixed by also linking MlirAirLib normally for dep resolution. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Erwei Wang <erwei.wang@amd.com>
- Split transform sequence into separate file (air-to-amdgcn-matmul-transform.mlir) and load via --transform-preload-library, matching upstream MLIR convention. No transform ops in the payload or output; no cleanup code needed in passes. - Register transform::registerPreloadLibraryPass() in Init.cpp. - Remove emit_dealloc from bufferize_to_allocation (deallocs not needed). - Remove manual DCE for alloca/view/dealloc from ConvertLinalgToAMDGCN. - Remove global->global memref.copy forwarding hack (fill writes to %C directly via bufferization.to_tensor, so no temp alloc/copy pattern). - Remove transform dialect cleanup from ConvertLinalgToAMDGCN. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Erwei Wang <erwei.wang@amd.com>
The pass converts linalg ops, air.dma_memcpy_nd, and air.channel.put/get to AMDGCN library calls — not just linalg. Rename to reflect its scope: convert-linalg-to-amdgcn → convert-to-amdgcn-library-calls Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Erwei Wang <erwei.wang@amd.com>
air-dma-to-channel is not in the pipeline, so ChannelGetOp/ChannelPutOp/ ChannelOp are never present. Remove ~210 lines of dead channel pre-allocation, conversion, and cleanup code. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Erwei Wang <erwei.wang@amd.com>
Replace imperative IR walks with OpRewritePattern + applyPatternsGreedily: - DmaToLibraryCall: air.dma_memcpy_nd → copy library call - LinalgToLibraryCall<T>: templated for fill/copy/matmul ops - GenericMatmulToLibraryCall: linalg.generic with matmul semantics - EraseGlobalFill: erase fill on global buffers (library handles zero-init) Merge LDS pre-allocation into emitLDSOffset with wavefront striping and function-entry insertion, eliminating the separate preallocateDmaLDS walk. Pattern ordering no longer matters — whichever pattern fires first creates the correct LDS allocation in the shared cache. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Erwei Wang <erwei.wang@amd.com>
Demonstrates non-tile-aligned matmul: actual dimensions 40x40x64, host pads inputs/output to tile-aligned 48x48x64. Kernel operates on full tiles; padding zeros produce zero output in the padded region. Host extracts valid C[0:40, 0:40] for verification. - New payload: air-to-amdgcn-matmul-padded.mlir (48x48x64 kernel) - New transform: tile_using_forall [16,0,0] (3 wavefronts) - New test: test_matmul_padded_40x40 in test_air_matmul_e2e.py - Register tensor ValueBounds and SubsetOp interfaces in Init.cpp (needed for tensor.extract_slice in padding transforms) - Refactor _air_preprocess into _air_preprocess_with_files for reuse Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Erwei Wang <erwei.wang@amd.com>
Infrastructure for non-tile-aligned matmul (40x40x64) with compiler-level padding. Transform pads all three operands (A, B, C) to tile size 16. C accumulates in LDS, then copies back to global (only valid region). New library functions: - store_lds_C_mfma_f32_16x16x16_f16: AGPR→LDS via ds_write_b32 - mfma_matmul_lds_c_f16_16x64: matmul with all-LDS operands - fill_f32_16x16, copy_f32_16x16: LDS C init and global→LDS copy - store_global_f32_16x16: LDS→global C writeback Pass changes: - DmaToLibraryCall: handle both global→LDS and LDS→global directions - GenericMatmulToLibraryCall: _lds_c suffix when output is in LDS - Register air-override-memref-memory-space pass - Register tensor ValueBounds/SubsetOp interfaces BLOCKING: amdgcn-preload-library hangs on the padded test's IR when library functions with bodies (store_global_f32_16x16 etc.) interact with preloaded external library functions. The 64x64 test still passes. Needs investigation of the preload pass's internal inlining behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Erwei Wang <erwei.wang@amd.com>
Fix a bug in PreloadLibrary where declarationsToReplace replaces a declaration with another declaration from the library map, returning true forever. Guard: only replace if the library entry has a body. WIP: compiler-level padded matmul (40x40x64) with LDS C accumulation. Infrastructure works through convert-to-amdgcn-library-calls, but the aster backend rejects non-thread-uniform loops from non-divisible tiling. New: fill_lds_16x64_b in lds_16x64_b.mlir for LDS zero-fill. BLOCKING: "only thread-uniform loops are supported" — the scf.for loop from tile_using_for with affine.min bounds varies per wavefront. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Erwei Wang <erwei.wang@amd.com>
Replace per-tile padding approach with pad_tiling_interface which pads the entire iteration domain from 40→48 BEFORE tiling. This eliminates: - affine.min bounds (all tiles are full, 48 % 16 == 0) - Non-uniform loops rejected by amdgcn-convert-scf-control-flow - Dynamic allocs without memory_space - C-in-LDS complexity (C stays in global, same as 64x64 test) The padded allocs (48x64, 48x48) are at function level, outside the herd. Library functions are identical to the 64x64 test. BLOCKING: lsir.reg_cast normal form violation in aster backend from padded alloc pointer decomposition. 64x64 test still passes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Erwei Wang <erwei.wang@amd.com>
Add E2E test for non-tile-aligned matmul (M=40, N=40, K=64) with device-side padding. The kernel writes air.launch directly with a 3x3 grid and dynamic-sized DMA copies (arith.minui for boundary tiles). The upstream air-split-launch-for-padding pass (split-mode=single-launch, pad-location=source) produces a single launch with scf.if on block indices. Boundary DMAs get pad_after attributes, which ConvertToAMDGCNLibraryCalls converts to fill_f16_16x64 (zero LDS) + copy_f16_16x64_padded (clamped global loads via arith.minui). Key changes: - ConvertToAMDGCNLibraryCalls: detect pad_after on DMAs, emit fill + padded copy for boundary tiles - PromoteAllocsToFuncArgs: new pass to promote nested memref.alloc (including inside scf.if) to kernel function arguments - Init.cpp: register upstream air-split-launch-for-padding (extracted from AIE-gated code in mlir-air #1496), scf::SCFDialect - indexing_ptr.mlir: add make_raw_buffer_rsrc_bounded - Update mlir-air submodule to 39ee8fc6 (includes #1496 and #1499) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Erwei Wang <erwei.wang@amd.com>
Remove store_lds_C_mfma_f32_16x16x16_f16 (AGPR→LDS C tile store) from compute_16x16_f16.mlir — unused since the padded matmul writes directly to global C. Keep fill_lds_16x64_b in lds_16x64_b.mlir as it is needed by the device-side padding path (zero-fill LDS before partial copy). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Erwei Wang <erwei.wang@amd.com>
No longer needed: the upstream single-launch mode in air-split-launch-for-padding shares LDS allocs across scf.if branches instead of duplicating them, and the padded matmul writes directly to global C without temp allocs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Erwei Wang <erwei.wang@amd.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Erwei Wang <erwei.wang@amd.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Erwei Wang <erwei.wang@amd.com>
a256466 to
e8c6943
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.