MoonMath AI Open-Sources HIP Attention Kernel for AMD MI300X Beating AITER v3 in All Orientation and Rotation Mode

admin 3 hours ago

0 0 5 minutes read

MoonMath AI Open-Sources HIP Attention Kernel for AMD MI300X Beating AITER v3 in All Orientation and Rotation Mode

The MoonMath AI team released bf16 for AMD’s MI300X GPU. It is written in HIP, not a manual assembly. The code is open source under the MIT license. The MoonMath.ai team reports that it beats AITER v3, AMD’s optimized kernel, in every configuration tested. Bare-metal access came from HotAisle, AMD’s cloud provider.

Attention is combined softmax(QKᵀ/√d)·V working inside all transformers. The MI300X is AMD’s CDNA3 data center, with an ISA core (gfx942). This kernel works only on that hardware.

The TL;DR

MoonMath.ai open source bf16 focus for AMD MI300X, written by HIP, not organization (MIT).
Beats AMD’s AITER v3 in all orientations and rotation modes — geomean 1.18×/1.15×/1.08×, up to 1.26×.
The core trick: single-instruction asm wrappers let you choose an opcode while the compiler allocates registers.
Most of the speedup is memory placement – K in LDS, hot V in L1, Q and accumulators in registers.
The real SGlang PR used it to speed up Wan2.1 video streaming by 1.23×, without quality degradation.

Understanding the Kernel

A kernel is a small program that runs directly on multiple GPU cores to perform one specific computation—here, computational attention—as fast as the hardware allows. The kernel calculates attention forward to bf16 on MI300X only. Input is required in BSHD or BHSD format, without switching. The size of the head is 128. It supports any sequence length, including reverse focus.

There are real limitations. No causal mask, no GQA, and no varlen batching. The output is bf16, and works on gfx942 hardware exclusively.

Numbers are strictly controlled. All three zoom modes match the AITER rotation rule for each mode. All finite outputs stay within 1 bf16 ULP of AITER. The handling of NaN and Inf is somewhat similar, and the results are deterministic.

Core Trick: One-Instruction asm Wrappers

The core approach avoids the general problem. Compiler intrinsics keep code tidy but allow the compiler to rearrange or rename operands. Raw inline integration provides control but forces manual register and address management.

MoonMath directly wraps one instruction in a __device__ __forceinline__ work. Extended asm parameters define the operands. The research team chooses the opcode. The compiler still allocates registers and tracks data flow.

// in/out tied to the SAME VGPR → no accumulator rename, no v_mov copy.
__device__ __forceinline__ void asm_mfma(bf16x4_t a, bf16x4_t b, fp32x4_t& c) {
    asm volatile("v_mfma_f32_16x16x16_bf16 %0, %1, %2, %0"
                 : "+v"(c) : "v"(a), "v"(b));
}

I "+v"(c) limit binds the accumulator input and output to the same VGPR. No copy command was issued. This keeps the kernel close to the normal HIP. It still directs the machine one instruction at a time.

Architecture: Eight Waves, Two Groups, Two Barriers

The CDNA3 computer unit has four SIMD units. The textbook block is four waves. MoonMath instead uses eight waves per block, in two groups of four.

The two teams run the same Q*Ksoftmax, O += P*V succession. They are victimized by class. While one group fills the core of the matrix, the other uses softmax and removes loads. Then they alternate, so the matrix core doesn’t work.

There are two s_barriers by repetition. One lives in the handoff area. One lives on the border of repetition. Each counter wait handles all synchronization.

This is similar to FlashAttention-3’s matmul and softmax alternation. It does not copy the FA3 manufacturer split and consumer warp. In CDNA3, all memory movements are already asynchronous, so a dedicated generator wave is not needed.

Where the Data Lives, and Why 16×16×16

Most of the speedup comes from memory placement. K broadcast from HBM to LDS, double buffered, shared by all eight waves. V always hot on L1, readable on all PV matmul. Q and accumulators reside in registers.

The research team chose 16×16×16 MFMA over 32×32×8. Both cases have the same output. A small tile stacks up to 4 fp32 features per lane, versus 16. Low accumulator pressure leaves room for deep preload and third part. Q tile.

The decision	Choice	The reason
Waves per block	8 (two teams of 4)	Arrange the pipe directly; share one K copy
MFMA format	16×16×16 bf16	Same performance, lower VGPR pressure, better power efficiency
Placement of K	LDS, double buffered, 32 KB	It is distributed in all 8 waves, modified by repetition
Placement of V	L1, resident, preloaded	Reread all PV, deliberately kept hot
Q + accumulators	VGPRs	Read all iterations, never reloaded

Two victories later closed the gap. The third time Q tile (3Q) suggests reuse of data on each upload K again V tile. KV-style Flash-Decoding tail division saves a fractional cycle tied to all of the MI300X’s 304 CUs. This is a cascade win. It’s going V in L1 released LDS that the third Q tile and fill.

Benchmark

Test run on MI300X at bf16, 128 head size. Each shape was measured in three rotational modes. RTNE rotates in near equilibrium. RTNA rotates close, bounding away from zero. RTZ decreases to zero.

Shape (B, H, S, D)	The cycle	Ours (ms)	AITER v3 (ms)	competes with AITER	competes with MAX
(2, 24, 8192, 128)	RTNE	3.083	3.792	1.23×	1.37×
(2, 24, 16384, 128)	RTNE	11.670	14.691	1.26×	1.54×
(4, 16, 16384, 128)	RTZ	15.055	16.183	1.07×	1.47×
(2, 24, 32768, 128)	RTNA	44.440	52.363	1.18×	1.57×
(1, 16, 131072, 128)	RTNE	232.517	269.278	1.16×	1.46×

Geomeans beyond sweeping love MoonMath. Compared to AITER, it scores 1.18× (RTNE), 1.15× (RTNA), and 1.08× (RTZ). Against the Modular MAX, the geomeans run from 1.44× to 1.49×, and the acceleration of each shape reaches 1.59×.

RTZ is AITER’s fastest mode and intense racing. RTZ shape (4, 16, 16384) went from 0.95× to 1.07×. KV’s tail split is what closed that final gap.

Interactive Descriptor

B crossing the columns of two sections. var phase=0, iter=”N”, play=false, timer=null, spd=950; // [phase1 cell, phase2 cell] for each row – content is fixed var CELLS={ a:[{c:’mc’,t:’PV · QK’,tag:’matrix core’},{c:’mem’,t:’softmax · V→L1′,tag:’memory’}]b:[{c:’mem’,t:’K→LDS · softmax’,tag:’memory’},{c:’mc’,t:’PV · QK’,tag:’matrix core’}]t:[{t:’K: HBM→LDS’},{t:’V prefetch→L1′}] }; function setSlot(id,cell){var el=$(‘#’+id);el.className=”slot slot “+cell.c;el.innerHTML=’‘+cell.t+’‘+cell.tag+’‘;} function paintStatic(){ setSlot(‘a0’,CELLS.a.)[0]);setSlot(‘a1’,CELLS.a[1]); setSlot(‘b0’,CELLS.b[0]);setSlot(‘b1’,CELLS.b[1]); $(‘#t0’).innerHTML=CELLS.t[0].t;$(‘#t1’).innerHTML=CELLS.t[1].t; } function paintPipe(){ // section 0 -> col0 is active (a0,b0,t0); section 1 -> col1 act (a1,b1,t1) var act=section, dim=section^1; [[‘a0′,’a1’],[‘b0′,’b1’]].forEach(function(pair){ $(‘#’+pair[act]).classList.remove(‘dim’);$(‘#’+pair[act]).classList.add(‘active’); $(‘#’+pair[dim]).classList.add(‘dim’);$(‘#’+pair[dim]).classList.remove(‘active’); }); [‘t0′,’t1’].forEach(function(id,i){$(‘#’+id).className=”seg2″+(i===act? flow’:’ dim’);}); $(‘#ph1′).className=”ph”+(phase===0?’ cur’:”); $(‘#ph2′).className=”ph”+(phase===1?’ cur’:”); $(‘#mm-iter’).textContent=”iteration “+iter+’ · stage ‘+(stage+1)+(stage===0?’ — Group A in matrix context’:’ — Group B in matrix context’); } function adv(){ section++; if(section>1){section=0;iter=(iter===’N’)?’N+1′:(iter===’N+1′?’N+2′:’N’);} paintPipe(); } $(‘#mm-step’).addEventListener(‘click’,function(){stop();adv();}); function play(){playing=true;$(‘#mm-play’).innerHTML=’❚❚ Pause’;$(‘#mm-play’).classList.remove(‘primary’); timer=setInterval(adv,spd);} function stop(){playing=false;$(‘#mm-play’).innerHTML=’► Play’;$(‘#mm-play’).classList.add(‘primary’); if(timer){clearInterval(timer);timer=null;}} $(‘#mm-play’).addEventListener(‘click’,function(){playing?stop():play();}); $(‘#mm-spd’).addEventListener(‘input’,function(){spd=1950-(+le.value);if(play){stop();play();}}); paintStatic(); paintPipe(); /* —- memory map —- */ var MEM={ q:{t:’Q tile — VGPRs, persistent’,b:’The Q tile is read iteratively and is never reloaded, so it always resides in the vector register file. Two of the three Q tiles per wave remain in the register and are hot.’}, acc:{t:’scores · O — fp32 accumulators in VGPRs’,b:’Matrix-core outputs (matrix of scores and active outputs) never leave the registers until the last store. The 16×16×16 MFMA stacks into 4 fp32 elements per lane, keeping the accumulator pressure low.’}, k:{t:’K tile – LDS, double-buffered, 32 KiB’,b:’One copy of K is shared by all eight waves and replaced by double-buffered iteration. K streams from HBM directly to LDS via direct DMA, bypassing VGPR. XOR swizzle eliminates bank collisions with zero padding.’}, q3:{t:’3rd Q tile — LDS, 32 KiB, streamed’,b:’Moving V to L1 freed 32 KB of LDS. The kernel discards it in the third Q tile (48 q rows per wave). Parked in LDS and broadcast in ping-pong buffer during QK matmul, suggests K/V reuse.’}, v:{t:’V_t tile — L1, Resident’,b:’Pre-converted V tile is kept hot in L1 and re-read in all PV matmuls. L1 is non-modifiable, so persistence is established by first fetching the next iterative lines into the dump register – the data remains in L1 as a side effect.’}, src:{t:’K / V source — HBM, played through L2′,b:’Head-first chipset swizzle Qbaps2 in each head) (19) XCD, so its K and V reside in that XCDu2019s piece of L2 instead of piercing through all eight.’} }; function showMem(k){ $all(‘.chip’).forEach(function(c){c.setAttribute(‘aria-pressed’, c.getAttribute(‘data-k’)===k ? ‘true’:’false’); }); $(‘#mm-detail .dt’).textContent=MEM[k].t; $(‘#mm-detail .db’).textContent=MEM[k].b; reportHeight(); } $all(‘.chip’).forEach(function(c){c.addEventListener(‘click’,function(){showMem(c.getAttribute(‘data-k’));});}); /* —- init —- */ drawBench(); showMem(‘q’); /* —–auto resize WordPress embeds —- */ function reportHeight(){ var h=root.offsetHeight+40; if(window.parent){ window.parent.postMessage({type:’mm-cdna3-height’,height:h},’*’);} } window.addEventListener(‘load’,reportHeight); window.addEventListener(‘resize’,reportHeight); setTimeout(reportHeight, 300); setTimeout(reportHeight, 900); })();

admin 3 hours ago

0 0 5 minutes read