Tilde Research Introduces Aurora: A Leverage-Aware Optimizer That Fixes the Hidden Neuron Death Problem in the Muon

admin 4 hours ago

0 0 4 minutes read

Tilde Research Introduces Aurora: A Leverage-Aware Optimizer That Fixes the Hidden Neuron Death Problem in the Muon

Tilde Research researchers released Auroraa new optimizer for training neural networks that address the structural problem in the widely used Muon accelerator. The error silently kills a significant proportion of MLP neurons during training and keeps them dead forever. Aurora comes with 1.1B parameter pre-test, new state-of-the-art result in modded-nanoGPT speed benchmark, and unlocked codes.

What is a Muon?

To understand the Aurora, it helps to understand the Muon first. The Muon optimizer has attracted attention in the ML community after outperforming AdamW during the wall to meet in the nanoGPT speedrun competition — a public benchmark that measures how fast you can train a GPT-style model in loss-directed validation. Since then, the muon has been adopted for training boundary-scale models by several research groups.

A key algorithmic step for Muon involves the polar factor of the gradient matrix. For the gradient matrix G Singular Value Decomposition (SVD) G = UΣVᵀMuon counts polar(G) = UVᵀwhich is the closest semi-orthogonal matrix G typically Frobenius. This orthogonalized gradient is then used to update the weights: W ← W − η UVᵀ for the learning rate η. The use of matmul-only iterative algorithms to calculate the polar factor is what makes Muon work at scale.

The NorMuon Puzzle: Row Normalization Helps, But Why?

Before Aurora, NorMuon led the modded-nanoGPT speedrun. It introduces a row normalization step—similar to Adam’s individual parameter scaling—that corrects the polar factor by its inverse RMS norm. Although this tends to pull the update away from a strong orthogonal gradient, NorMuon still produces impressive results. The Tilde team set out to understand exactly what gap in Muon’s architecture NorMuon was talking about.

Key Problem: Row-Norm Anisotropy and Neuron Death in Long Matrices

The research team found that the Muon optimizer “kills” a large proportion of neurons long weighted matricessuch as those found in SwiGLU-based MLP layers. Because it is mathematically impossible for these matrix conditions to remain exact while maintaining row updates, the optimizer ends up giving large updates to some neurons while ignoring others. This results in a “cycle of death” in which poorly functioning neurons receive less signal over time, eventually becoming permanently inactive.

Research studies have revealed that with 500 training steps, more than one in four neurons have effectively died. This is not just a local issue; the lack of activity in these neurons starves the next layers of the necessary data, spreading the dysfunction throughout the model. Aurora solves this by using a new mathematical method that enforces the same update on all neurons without sacrificing the advantages of orthogonalization.

Before arriving at Aurora, the study introduced an intermediate correction called NorMuon. An important observation is that NorMuon normalizes each row to a normal unit (normal = 1), but in reality this is the wrong target for a long matrix. For a long column-orthogonal matrix, the statistically correct average row ratio is √(n/m), not 1. NorMuon corrects this by normalizing the rows of the matrix to have a normality of √(n/m) instead of 1.

In tests on a 340M scale, U-NorMuon outperforms both Muon and normal NorMuon and completely eliminates the phenomenon of neuron death – the leverage score becomes almost isotropic during training. Unfortunately, NorMuon distributes this advantage to layers that do not directly affect it: keeping the top/gate lines alive ensures an isotropic gradient flow in the down-projection, stabilizing the height of its column without direct intervention.

However, the NorMuon still has a problem: it forces out the polar factor with the same row values, sacrificing the accuracy of the polar factor, which is theoretically undesirable and more expensive for the Muon framework (the paper shows that the Muon achieves a relatively low loss with accurate orthogonalization). This is the motivation of Aurora.

Aurora: The Greatest Descent Under Two Combined Obstacles

Aurora is refactoring the update selection problem from scratch. Instead of using orthogonalization and pasting it with linear normalization, Aurora asks: what is the correct update under combined the limit of left semi-orthogonality and the same linear norms?

Formally, for long matrices, Aurora solves:

$U ∗ =arg U max Tr(G ⊤ U)stU ⊤ U=I n ,∥U i: ∥ 2 = mn ∀i$

Research shows that these two constraints together force all unitary values of U to be exactly equal to 1. This means that the joint limit still produces a valid semi-orthogonal left update, not a corrupted one. This is the important information that distinguishes Aurora from NorMuon and U-NorMuon: it achieves row similarity and trend and orthogonality simultaneously instead of trading one against the other.

The study also provides two algorithmic implementations of the Aurora solution. I The Riemannian Aurora uses a restricted gradient projection method on the combination Stiefel/equal-row-leverage manifold. I vanilla Aurora it’s a simple, practical implementation. Both are open source. For non-long (wide and square) matrices, row and normality are already defined by orthogonality, so Aurora leaves those parameters unchanged.

Results

Aurora was used to train a 1.1B model that achieves 100x data efficiency on open source internet data and outperforms larger models in common evals such as HellaSwag. On the 1B scale, Aurora gains significant advantages over both Muon and NorMuon. In running modded-nanoGPT optimization, Aurora’s delivery surpasses the previous state-of-the-art (which was NorMuon). The Untuned Aurora carries only 6% more computing power than the traditional Muon and is designed as a replacement.

The research team also found that the performance of Aurora benefits from the width of the MLP, suggesting that it works best in networks with large MLP expansion factors – which is consistent with the neuron death hypothesis, since wider MLPs have longer matrices and more opportunity to gain anisotropy for integration.

Key Takeaways

Muon’s polar factor update inherits the row-excitement anisotropy in long matrices, causing more than 25% of MLP neurons to die permanently before the 500th training step.
Aurora solves this by finding the correct update under the joint constraint of left semi-orthogonality and parallel row norms – achieving both at the same time rather than trading separately for the other.
At 1.1B scale, Aurora achieves 100x data efficiency on open source Internet data, outperforms larger models in HellaSwag, and sets a new SoTA in a modded-nanoGPT speedrun.
Aurora is a close successor to Muon with only 6% computing overhead, and its MLP-wide profit margin.

Check it out Paper again GitHub Repo Also, feel free to follow us Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? contact us

admin 4 hours ago

0 0 4 minutes read