MoonMath AI Open-Sources HIP Attention Kernel for AMD MI300X Beating AITER v3 in All Orientation and Rotation Mode

The MoonMath AI team released bf16 for AMD’s MI300X GPU. It is written in HIP, not a manual assembly. The code is open source under the MIT license. The MoonMath.ai team reports that it beats AITER v3, AMD’s optimized kernel, in every configuration tested. Bare-metal access came from HotAisle, AMD’s cloud provider.
Attention is combined softmax(QKᵀ/√d)·V working inside all transformers. The MI300X is AMD’s CDNA3 data center, with an ISA core (gfx942). This kernel works only on that hardware.
The TL;DR
- MoonMath.ai open source bf16 focus for AMD MI300X, written by HIP, not organization (MIT).
- Beats AMD’s AITER v3 in all orientations and rotation modes — geomean 1.18×/1.15×/1.08×, up to 1.26×.
- The core trick: single-instruction asm wrappers let you choose an opcode while the compiler allocates registers.
- Most of the speedup is memory placement – K in LDS, hot V in L1, Q and accumulators in registers.
- The real SGlang PR used it to speed up Wan2.1 video streaming by 1.23×, without quality degradation.
Understanding the Kernel
A kernel is a small program that runs directly on multiple GPU cores to perform one specific computation—here, computational attention—as fast as the hardware allows. The kernel calculates attention forward to bf16 on MI300X only. Input is required in BSHD or BHSD format, without switching. The size of the head is 128. It supports any sequence length, including reverse focus.
There are real limitations. No causal mask, no GQA, and no varlen batching. The output is bf16, and works on gfx942 hardware exclusively.
Numbers are strictly controlled. All three zoom modes match the AITER rotation rule for each mode. All finite outputs stay within 1 bf16 ULP of AITER. The handling of NaN and Inf is somewhat similar, and the results are deterministic.
Core Trick: One-Instruction asm Wrappers
The core approach avoids the general problem. Compiler intrinsics keep code tidy but allow the compiler to rearrange or rename operands. Raw inline integration provides control but forces manual register and address management.
MoonMath directly wraps one instruction in a __device__ __forceinline__ work. Extended asm parameters define the operands. The research team chooses the opcode. The compiler still allocates registers and tracks data flow.
// in/out tied to the SAME VGPR → no accumulator rename, no v_mov copy.
__device__ __forceinline__ void asm_mfma(bf16x4_t a, bf16x4_t b, fp32x4_t& c) {
asm volatile("v_mfma_f32_16x16x16_bf16 %0, %1, %2, %0"
: "+v"(c) : "v"(a), "v"(b));
}I "+v"(c) limit binds the accumulator input and output to the same VGPR. No copy command was issued. This keeps the kernel close to the normal HIP. It still directs the machine one instruction at a time.
Architecture: Eight Waves, Two Groups, Two Barriers
The CDNA3 computer unit has four SIMD units. The textbook block is four waves. MoonMath instead uses eight waves per block, in two groups of four.
The two teams run the same Q*Ksoftmax, O += P*V succession. They are victimized by class. While one group fills the core of the matrix, the other uses softmax and removes loads. Then they alternate, so the matrix core doesn’t work.
There are two s_barriers by repetition. One lives in the handoff area. One lives on the border of repetition. Each counter wait handles all synchronization.
This is similar to FlashAttention-3’s matmul and softmax alternation. It does not copy the FA3 manufacturer split and consumer warp. In CDNA3, all memory movements are already asynchronous, so a dedicated generator wave is not needed.
Where the Data Lives, and Why 16×16×16
Most of the speedup comes from memory placement. K broadcast from HBM to LDS, double buffered, shared by all eight waves. V always hot on L1, readable on all PV matmul. Q and accumulators reside in registers.
The research team chose 16×16×16 MFMA over 32×32×8. Both cases have the same output. A small tile stacks up to 4 fp32 features per lane, versus 16. Low accumulator pressure leaves room for deep preload and third part. Q tile.
| The decision | Choice | The reason |
|---|---|---|
| Waves per block | 8 (two teams of 4) | Arrange the pipe directly; share one K copy |
| MFMA format | 16×16×16 bf16 | Same performance, lower VGPR pressure, better power efficiency |
| Placement of K | LDS, double buffered, 32 KB | It is distributed in all 8 waves, modified by repetition |
| Placement of V | L1, resident, preloaded | Reread all PV, deliberately kept hot |
| Q + accumulators | VGPRs | Read all iterations, never reloaded |
Two victories later closed the gap. The third time Q tile (3Q) suggests reuse of data on each upload K again V tile. KV-style Flash-Decoding tail division saves a fractional cycle tied to all of the MI300X’s 304 CUs. This is a cascade win. It’s going V in L1 released LDS that the third Q tile and fill.
Benchmark
Test run on MI300X at bf16, 128 head size. Each shape was measured in three rotational modes. RTNE rotates in near equilibrium. RTNA rotates close, bounding away from zero. RTZ decreases to zero.
| Shape (B, H, S, D) | The cycle | Ours (ms) | AITER v3 (ms) | competes with AITER | competes with MAX |
|---|---|---|---|---|---|
| (2, 24, 8192, 128) | RTNE | 3.083 | 3.792 | 1.23× | 1.37× |
| (2, 24, 16384, 128) | RTNE | 11.670 | 14.691 | 1.26× | 1.54× |
| (4, 16, 16384, 128) | RTZ | 15.055 | 16.183 | 1.07× | 1.47× |
| (2, 24, 32768, 128) | RTNA | 44.440 | 52.363 | 1.18× | 1.57× |
| (1, 16, 131072, 128) | RTNE | 232.517 | 269.278 | 1.16× | 1.46× |
Geomeans beyond sweeping love MoonMath. Compared to AITER, it scores 1.18× (RTNE), 1.15× (RTNA), and 1.08× (RTZ). Against the Modular MAX, the geomeans run from 1.44× to 1.49×, and the acceleration of each shape reaches 1.59×.
RTZ is AITER’s fastest mode and intense racing. RTZ shape (4, 16, 16384) went from 0.95× to 1.07×. KV’s tail split is what closed that final gap.
Interactive Descriptor



