NVIDIA AI Releases Gated DeltaNet-2: A Separate Attention Layer That Decouples and Writes on the Delta Law

Linear attention replaces the infinite KV cache of softmax attention with an iterative form of fixed size. This reduces sequence mixing to linear time and recording in non-volatile memory. The hard part is not to forget. It is a way of organizing repressed memory without criticizing existing associations.
NVIDIA has been released Gated DeltaNet-2a specific attention layer that targets that problem. The model divides the working memory arrangement into two intelligent gates per channel. It is trained on 1.3B parameters on 100B FineWeb-Edu tokens. It outperforms Mamba-2, Gated DeltaNet, KDA, and Mamba-3 in the entire research benchmark suite.
The scalar gate problem in delta-rule models
The iterative attention layer maintains the matrix state St and read it along with the question. DeltaNet adds active sorting by extracting the value currently associated with the current key. It uses a scalar step size βt control how much you have to write on it. Mamba-2 adds data-dependent scaling decomposition αt by forgetting the whole world. Gated DeltaNet combines both operations, but both gates remain scalar per head.
Kimi Delta Attention (KDA) refines the decay side. It replaces the scale αt with channel wise vector. KDA still maintains one scale βt active planning. That scale controls two different things at once. Determines how much old content to delete on the key side. It also determines how much new content to create on the value side. These two decisions apply to different sectors of the government. Coherence is a limitation of models, not a property of the delta law.

Gated Delta Rule-2: two gates instead of one
Gated DeltaNet-2 separates the two decisions with Gated Delta Rule-2. Introduces a clearing gate that uses a channel bt ∈ [0,1]dk on the key axis. It also introduces a write gateway that uses the channel wt ∈ [0,1]dv on the value axis. Both gates are generated by sigmoid projections of the token representation. The update works to decay before the active editing.
Written together, the repetition is:
St = (I − kt (bt ⊙ kt)⊤) Dt St−1 + kt (wt ⊙ vt)⊤
Here Dt = Diag(αt) channel wise decomposition which is transferred to KDA. The leftmost element of the deletion matrix remains ktmaintains the direction of writing the delta rule. The right factor becomes bt ⊙ ktwhich makes the read direction channel selectable. Writing term kt zt⊤ uses zt = wt ⊙ vtwhich makes the value update channel select.
When both gates fall on the same scale βtthe update restores KDA directly. When decay αt and falls to a scalar, obtaining a Gated DeltaNet. Both previous models are kept as captive sub-bases for the new update.
For fast weighting, Gated Delta Rule-2 online one-step gradient in local regression loss. The decay state remains close to memory, while residual programming uses learned and gated targets.
Chunkwise training and back gate awareness
Replication adopts a small WY form similar to the structure used by KDA. The cumulative decomposition of the channel frame is centered on two deletions for each rank. The update of each chunk is a product of asymmetric matrices of the form I − k̄r ēr⊤. Usage uses the chunk size C = 64 with integrated Triton kernels.
In retrospect, the scalar interrupt used by KDA is no longer valid. The write side contains a separate diagonal gate over the value channels. The eraser side contains a separate diagonal gate over the key channels. Therefore the gate properties must appear within the dot products that accumulate the gradients. The paper derives this vector-Jacobian product known from the gate clearly. On Hopper GPUs, the combined WY of the backward kernel is limited to two and four warps to avoid the assertion of the Triton WGMMA structure.
Block design and hybrid model
Gated DeltaNet-2 is used as a common token combiner in a common Transformer style block. Key questions and methods use linear regression, short causal variables, SiLU, and L2 normalization. Value routing uses linear approximation, short transform, and SiLU. Decay αtclear the gate btand write the gateway wt they appear in different linear branches. The recurrent output is RMS-normalized, multiplied by the SiLU output gate, and detrended.
A mixed variant includes Sliding-Window Attention (SWA) after the continuous mixer. The replicated cell contains Gated DeltaNet-2, MLP, SWA, and another MLP. SWA handles direct local interactions, while the iterative mixture suppresses long histories. The hybrid maintains a sequential scale with a bounded attention cache.
Results in 1.3B parameters
All 1.3B parameter models were trained on 100B FineWeb-Edu tokens. Parameter estimates and common condition sizes were matched across models. The persistent state holds 262,144 floats per layer per heap element. The training length of the tokens is 4K, and hybrid models use a 2K SWA window. The base Mamba-3 MIMO uses standard R = 4.
In language processing and logical reasoning, Gated DeltaNet-2 scores best in both settings. The continuous model is between 53.11 for all LAMBADA and the thinking suite. That sits above Mamba-3 MIMO at 52.39 and KDA at 52.28. In the hybrid configuration, Gated DeltaNet-2 averaged 53.97 compared to Mamba-3 MIMO at 52.72. Since the standard state size is not specified, the benefit points to the update rule, not the additional memory.
The most obvious benefits come from RULER long content retrieval. In the continuous setting, the S-NIAH-2 in 4K increases from 89.0 (KDA) to 93.0. NIAH-3 at 2K jumped from 63.2 (KDA) to 89.8. MK-NIAH-1 in 4K increases from 28.0 (KDA) to 37.8.
In real-world returns (SWDE, SQuAD, FDA, TriviaQA, NQ, DROP), Gated DeltaNet-2 also leads both settings. The normal average is 29.88 and the hybrid average is 42.28.
Marktechpost Visual Explainer
Company MARKTECHPOST · Hub for AI research, dev tools, and model launch
Key Takeaways
- Gated DeltaNet-2 divides the scale βt enter the smart wipe gate
bt(key axis) and the write gate that uses the channelwt(value axis). - The update finds KDA when both gates fall on the same scale, and Gated DeltaNet when the decomposition breaks down.
- Training remains consistent in chunkwise WY form, with channel-wise decay focused on uneven wipes and a reverse gate integrated into the Triton.
- For 1.3B parameters in 100B FineWeb-Edu with simulated region size, it has a better rate than Mamba-2, Gated DeltaNet, KDA, and Mamba-3 in both iterative and hybrid settings.
- The biggest gains come with RULER longer content retrieval — S-NIAH-3 in 2K increases by 63.2 → 89.8 and MK-NIAH-1 in 4K increases by 28.0 → 37.8 over KDA (normal).
Check it out Paper again Repo. Also, feel free to follow us Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.
Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? contact us



