Microsoft Unveils Maia 200, FP4 and FP8 Optimized AI Inference Accelerator for Azure Datacenters

0 0 4 minutes read

Microsoft Unveils Maia 200, FP4 and FP8 Optimized AI Inference Accelerator for Azure Datacenters

Maia 200 is Microsoft’s new AI accelerator designed to be deployed in Azure datacenters. It addresses the cost of generating tokens for large language models and other computational workloads by combining narrow precision computing, parallel chip memory hierarchy and scalable Ethernet-based fabric.

Why did Microsoft build a dedicated chip??

Training hardware and inference stress in different ways. Training is very demanding in all communication and ongoing activities. Inference cares about tokens per second, latency and tokens per dollar. Microsoft ranks the Maia 200 as its most efficient indexing system, performing 30 percent better per dollar than the latest hardware in its fleet.

Maia 200 is part of a heterogeneous Azure stack. It will offer multiple models, including the latest GPT 5.2 models from OpenAI, and will power workloads in Microsoft Foundry and Microsoft 365 Copilot. The Microsoft Superintelligence team will use the chip for artificial intelligence and reinforcement learning to improve on indoor models.

Primary silicon and numerical specifications

Each Maia 200 die is made with TSMC’s 3 nanometer process. The chip includes more than 140 billion transistors.

The compute pipeline is built around native FP8 and FP4 tensor cores. A single chip delivers over 10 petaFLOPS in FP4 and over 5 petaFLOPS in FP8, within a 750W SoC TDP envelope.

The memory is split between stacked HBM and Die SRAM. The Maia 200 offers 216 GB of HBM3e with approximately 7TB per second of bandwidth and 272MB of die SRAM. SRAM is organized into tile-level SRAM and cluster-level SRAM and is fully managed by software. Integrators and runtimes can clearly define the operation sets to keep attention and GEMM kernels close to the calculation.

Tile-based microarchitecture and memory layout

The Maia 200 microarchitecture is hierarchical. The base unit is tile. A chip is a very small independent unit of computing and storage on a chip. Each tile includes a Tile Tensor Unit for high matrix performance and a Tile Vector Processor as a programmable SIMD engine. The SRAM tile feeds both units and the tile’s DMA engines move and retrieve data from the SRAM without stopping the computer. The Tile Control Processor organizes the tensor sequence and DMA function.

Many tiles form a collection. Each cluster exposes a large multi-bank Cluster SRAM that is shared across all tiles in that cluster. Cluster DMA engines move data between the Cluster SRAM and the collectively stacked HBM stacks. The cluster core coordinates the use of multiple tiles and uses tile reuse schemes and SRAM to improve yield while maintaining the same programming model.

This sequence allows the software stack to pin different parts of the model to different tiers. For example, attention kernels can store the Q, K, V values in the SRAM tile, while the interconnect kernels can add payloads to the cluster SRAM and reduce HBM pressure. The goal of the design is continuous high efficiency when models grow in size and sequence length.

On chip data movement and Ethernet fabric increase

Predictions are often limited by data flow, not computational overhead. The Maia 200 uses a custom On-Chip Network and array of DMA engines. The on-chip network includes tiles, arrays, memory controllers and I/O units. It has separate planes for large tensor traffic and small control messages. This separation maintains synchronization and minimal effects on blocking after a large transfer.

Beyond the chip boundary, the Maia 200 integrates its own NIC and a scale-based Ethernet network using the AI Transport Layer protocol. The on-die NIC exposes approximately 1.4 TB per second per side, or 2.8 TB per second bandwidth, and scales to 6,144 accelerators in a two-tier domain.

Inside each tray, four Maia accelerators form a Fully Connected Quad. These four devices have direct links that are not changed to each other. Most parallel traffic stays within this group, while only a small amount of traffic goes out to the switches. This improves latency and reduces the number of switching ports in conventional logic arrays.

Azure system integration and cooling

At the system level, the Maia 200 follows the same rack, power and hardware standards as Azure GPU servers. It supports air cooling and liquid cooling and uses a second generation closed loop liquid cooling Heat Exchanger Unit on top racks. This allows mixed deployments of GPUs and Maia accelerators in the same datacenter footprint.

The accelerator also includes the Azure control plane. Firmware management, health monitoring and telemetry use the same workflow as other Azure computing services. This enables extensive fleet deployment and maintenance without disrupting AI performance.

Key Takeaways

Here are 5 short, technical takeaways:

Initial design inference: Maia 200 is Microsoft’s first silicon and system platform built exclusively for the AI index, optimized for the production of large tokens in modern logic models and large language models.
Numerical specification and memory organization: The chip is built on TSMCs 3nm, includes about 140 billion transistors and delivers more than 10 PFLOPS FP4 and more than 5 PFLOPS FP8, with 216 GB HBM3e at 7TB per second and 272 MB on chip SRAM divided into tile SRAM and cluster SRAM and controlled in cluster software.
Performance compared to other cloud accelerators: Microsoft reports 30 percent better performance per dollar than the latest Azure benchmarks and claims 3 times the FP4 performance of the third generation Amazon Trainium and higher FP8 performance than Google TPU v7 at the acceleration level.
A tile-based architecture with an Ethernet fabric: The Maia 200 plans to compute into tiles and clusters with local SRAM, DMA engines and Network on Chip, and exposes a combined NIC with approximately 1.4 TB per second through an Ethernet bandwidth channel of up to 6,144 accelerators using Fully Connected Quad groups as a parallel local domain.

Michal Sutter is a data science expert with a Master of Science in Data Science from the University of Padova. With a strong foundation in statistical analysis, machine learning, and data engineering, Michal excels at turning complex data sets into actionable insights.