NVIDIA AI Releases Gated DeltaNet-2: A Separate Attention Layer That Decouples and Writes on the Delta Law

admin May 24, 2026

0 0 7 minutes read

NVIDIA AI Releases Gated DeltaNet-2: A Separate Attention Layer That Decouples and Writes on the Delta Law

Linear attention replaces the infinite KV cache of softmax attention with an iterative form of fixed size. This reduces sequence mixing to linear time and recording in non-volatile memory. The hard part is not to forget. It is a way of organizing repressed memory without criticizing existing associations.

NVIDIA has been released Gated DeltaNet-2a specific attention layer that targets that problem. The model divides the working memory arrangement into two intelligent gates per channel. It is trained on 1.3B parameters on 100B FineWeb-Edu tokens. It outperforms Mamba-2, Gated DeltaNet, KDA, and Mamba-3 in the entire research benchmark suite.

The scalar gate problem in delta-rule models

The iterative attention layer maintains the matrix state S_t and read it along with the question. DeltaNet adds active sorting by extracting the value currently associated with the current key. It uses a scalar step size β_t control how much you have to write on it. Mamba-2 adds data-dependent scaling decomposition α_t by forgetting the whole world. Gated DeltaNet combines both operations, but both gates remain scalar per head.

Kimi Delta Attention (KDA) refines the decay side. It replaces the scale α_t with channel wise vector. KDA still maintains one scale β_t active planning. That scale controls two different things at once. Determines how much old content to delete on the key side. It also determines how much new content to create on the value side. These two decisions apply to different sectors of the government. Coherence is a limitation of models, not a property of the delta law.

Gated Delta Rule-2: two gates instead of one

Gated DeltaNet-2 separates the two decisions with Gated Delta Rule-2. Introduces a clearing gate that uses a channel b_t ∈ [0,1]^d_k on the key axis. It also introduces a write gateway that uses the channel w_t ∈ [0,1]^d_v on the value axis. Both gates are generated by sigmoid projections of the token representation. The update works to decay before the active editing.

Written together, the repetition is:

S_t = (I − k_t (b_t ⊙ k_t)^⊤) D_t S_t−1 + k_t (w_t ⊙ v_t)^⊤

Here D_t = Diag(α_t) channel wise decomposition which is transferred to KDA. The leftmost element of the deletion matrix remains k_tmaintains the direction of writing the delta rule. The right factor becomes b_t ⊙ k_twhich makes the read direction channel selectable. Writing term k_t z_t^⊤ uses z_t = w_t ⊙ v_twhich makes the value update channel select.

When both gates fall on the same scale β_tthe update restores KDA directly. When decay α_t and falls to a scalar, obtaining a Gated DeltaNet. Both previous models are kept as captive sub-bases for the new update.

For fast weighting, Gated Delta Rule-2 online one-step gradient in local regression loss. The decay state remains close to memory, while residual programming uses learned and gated targets.

Chunkwise training and back gate awareness

Replication adopts a small WY form similar to the structure used by KDA. The cumulative decomposition of the channel frame is centered on two deletions for each rank. The update of each chunk is a product of asymmetric matrices of the form I − k̄_r ē_r^⊤. Usage uses the chunk size C = 64 with integrated Triton kernels.

In retrospect, the scalar interrupt used by KDA is no longer valid. The write side contains a separate diagonal gate over the value channels. The eraser side contains a separate diagonal gate over the key channels. Therefore the gate properties must appear within the dot products that accumulate the gradients. The paper derives this vector-Jacobian product known from the gate clearly. On Hopper GPUs, the combined WY of the backward kernel is limited to two and four warps to avoid the assertion of the Triton WGMMA structure.

Block design and hybrid model

Gated DeltaNet-2 is used as a common token combiner in a common Transformer style block. Key questions and methods use linear regression, short causal variables, SiLU, and L2 normalization. Value routing uses linear approximation, short transform, and SiLU. Decay α_tclear the gate b_tand write the gateway w_t they appear in different linear branches. The recurrent output is RMS-normalized, multiplied by the SiLU output gate, and detrended.

A mixed variant includes Sliding-Window Attention (SWA) after the continuous mixer. The replicated cell contains Gated DeltaNet-2, MLP, SWA, and another MLP. SWA handles direct local interactions, while the iterative mixture suppresses long histories. The hybrid maintains a sequential scale with a bounded attention cache.

Results in 1.3B parameters

All 1.3B parameter models were trained on 100B FineWeb-Edu tokens. Parameter estimates and common condition sizes were matched across models. The persistent state holds 262,144 floats per layer per heap element. The training length of the tokens is 4K, and hybrid models use a 2K SWA window. The base Mamba-3 MIMO uses standard R = 4.

In language processing and logical reasoning, Gated DeltaNet-2 scores best in both settings. The continuous model is between 53.11 for all LAMBADA and the thinking suite. That sits above Mamba-3 MIMO at 52.39 and KDA at 52.28. In the hybrid configuration, Gated DeltaNet-2 averaged 53.97 compared to Mamba-3 MIMO at 52.72. Since the standard state size is not specified, the benefit points to the update rule, not the additional memory.

The most obvious benefits come from RULER long content retrieval. In the continuous setting, the S-NIAH-2 in 4K increases from 89.0 (KDA) to 93.0. NIAH-3 at 2K jumped from 63.2 (KDA) to 89.8. MK-NIAH-1 in 4K increases from 28.0 (KDA) to 37.8.

In real-world returns (SWDE, SQuAD, FDA, TriviaQA, NQ, DROP), Gated DeltaNet-2 also leads both settings. The normal average is 29.88 and the hybrid average is 42.28.

Marktechpost Visual Explainer

NVIDIA · 2026

Gated DeltaNet-2

Decoupling Erase and Write with Line Attention. Delta-law recursive attention layer with channel-wise erase and write gates.

PyTorch
Triton letters
1.3B parameters
100B FineWeb-Edu tokens

Step 01 · Vision

Two gates instead of one scale

Linear attention compresses the infinite KV cache into an iterative form of constant size. Organizing this memory without criticizing existing organizations is the hard part.

The problem

Previous delta-rule models (Gated DeltaNet, KDA) are binding to delete old content again writing new content in one scalar gate β_t.

Repair

Split it: a clearing gate following the station b_t on the key axis, and the write gate using the channel w_t on the value axis.

Clear the gate selects which links to the decomposed state key are read and deleted.
Write the gateway it chooses which value-side links for new content are made.
Decomposition following the channel bequeathed to KDA with good oblivion around the world.

Step 02 · Law of Renewal

Gated Delta Rule-2

Through the clearing gate b_t ∈ [0,1]^{d_k}write the gate w_t ∈ [0,1]^{d_v}and channel-specific decay D_t = Diag(α_t)the continuous state changes to:

S_t = (I − k_t (b_t ⊙ k_t)^⊤) D_t S_{t−1} + k_t (w_t ⊙ v_t)^⊤

It is recovering KDA exactly when both gates fall on the same scale.
It is recovering Gated DeltaNet where the decay also folds to a scalar.
Trains efficiently with a little WY form with channel-wise decay absorbed in asymmetric clearing characteristics.

Step 03 · Get the Code

Compile the repo and build environment

The official PyTorch implementation ships with a Dockerfile, training documentation, and lit_gpt model definitions.

git clone 
cd GatedDeltaNet-2

# build the environment from the provided Dockerfile
docker build -t gdn2 .
docker run --gpus all -it —ipc=host -v $PWD:/workspace gdn2

Repo structure

lit_gpt/ model code · scripts/ launchers · pretrain.py training entry · data.py, cache.py KV data & archive · paper/ arXiv PDF

Step 04 · Introduce Training

Run `pretrain.py`

A simplified command from the official README. Replace the placeholders with your own dataset methods and configuration name.

python ../pretrain.py 
  --train_data_dir ${TRAIN_DATA} 
  --val_data_dir ${VALIDATION_DATA} 
  --output_root ${SAVE_DIR} 
  --exp_name ${NAME} 
  --model_name ${MODEL} 
  --train_config ${CONFIG} 
  --eval_iters ${EVAL_ITERS} 
  --learning_rate ${LR} 
  --micro_batch_size ${MICRO_BATCH_SIZE}

Pro tip

Add --interactive_job --debug for debugging interaction time.

Step 05 · Automatic recipe

1.3B / 100B FineWeb-Edu setup

It is compared to Mamba-2, Gated DeltaNet, KDA, and basic Mamba-3 under the same optimizer settings and constant state size.

The Optimizer

AdamW · top LR 4e-4 · weight loss 0.1 · gradient clip 1.0 · cosine system · 1B– token warmup.

Collection and sequence

A global collection 0.5M tokens · sequence length 4K · Hybrid models use a 2K attention size of the sliding window.

Shape of the Model

16 heads · d_k = d_v = 128 · each layer is a replication state 262,144 floating, similar to Mamba-2/3.

Hybrid Block

Repeated cell: Igated DeltaNet-2 → MLP → SWA → MLP. A repetitive mixer suppresses long histories; The SWA handles local interactions.

Step 06 · Results

Appropriate numbers to be attached to the comparison

It is the best measure of all language modeling and logical reasoning, with the greatest advantages in the retrieval of long content.

Setup · Metric	KDA	Mamba-3 MIMO	GDN-2
A common measure. (LMB + thinking)	52.28	52.39	53.11
Hybrid is average. (LMB + thinking)	52.68	52.72	53.97
S-NIAH-3 @2K (typical)	63.2	72.4	89.8
MK-NIAH-1 @4K (standard)	28.0	18.0	37.8
Real-world recall, continuous measurement.	28.67	28.35	29.88
Real-world recall, hybrid avg.	40.14	40.11	42.28

Step 07 · Resources

Paper, code, and quote

Everything you need to learn, implement, and quote Gated DeltaNet-2 in one place.

@article{hatamizadeh2026gdn2,
  title   = {Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention},
  author  = {Hatamizadeh, Ali and Choi, Yejin and Kautz, Jan},
  journal = {arXiv preprint},
  year    = {2026}
}

Company MARKTECHPOST · Hub for AI research, dev tools, and model launch

Key Takeaways

Gated DeltaNet-2 divides the scale β_t enter the smart wipe gate b_t (key axis) and the write gate that uses the channel w_t (value axis).
The update finds KDA when both gates fall on the same scale, and Gated DeltaNet when the decomposition breaks down.
Training remains consistent in chunkwise WY form, with channel-wise decay focused on uneven wipes and a reverse gate integrated into the Triton.
For 1.3B parameters in 100B FineWeb-Edu with simulated region size, it has a better rate than Mamba-2, Gated DeltaNet, KDA, and Mamba-3 in both iterative and hybrid settings.
The biggest gains come with RULER longer content retrieval — S-NIAH-3 in 2K increases by 63.2 → 89.8 and MK-NIAH-1 in 4K increases by 28.0 → 37.8 over KDA (normal).

Check it out Paper again Repo. Also, feel free to follow us Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? contact us

admin May 24, 2026

0 0 7 minutes read

NVIDIA AI Releases Gated DeltaNet-2: A Separate Attention Layer That Decouples and Writes on the Delta Law

The scalar gate problem in delta-rule models

Gated Delta Rule-2: two gates instead of one

Chunkwise training and back gate awareness

Block design and hybrid model

Results in 1.3B parameters

Marktechpost Visual Explainer

Gated DeltaNet-2

Two gates instead of one scale

The problem

Repair

Gated Delta Rule-2

Compile the repo and build environment

Run `pretrain.py`

1.3B / 100B FineWeb-Edu setup

The Optimizer

Collection and sequence

Shape of the Model

Hybrid Block

Appropriate numbers to be attached to the comparison

Paper, code, and quote

Key Takeaways

admin

Leave a Reply Cancel reply

What you need to know about Vinod Khosla, the Silicon Valley legend whose family is buying the Seahawks

Coding’s Guide to NVIDIA’s Tile-Based GPU Programming: From cuTile and Triton Kernels to Flash Attention

5 Things People Don’t Know You Do Because You Were Born in December

Jayden Adams, the South African midfielder who played in all three group games at the 2026 World Cup, has died at the age of 25.

Smart glasses without a camera? Even Reality Bets production beats filming everyone else

Coding Implementation to Build an AI Agent for a Hierarchical Planner Using Open Source LLMs Using Multi-Agent Tools and Reasoning

India blocks access to popular developer platform Supabase with ban order

Your Dream Bedroom, Based on Your Zodiac Sign

Google DeepMind Introduces Unified Latencies (UL): A Machine Learning Framework That Co-Controls Latencies Using Diffusion Forwards and Decoders

Strong Ways Christians Can Show More Empathy (Not Condemnation)

4 Zodiac Signs That Feel Alone or Surrounded by People

The scalar gate problem in delta-rule models

Gated Delta Rule-2: two gates instead of one

Chunkwise training and back gate awareness

Block design and hybrid model

Results in 1.3B parameters

Marktechpost Visual Explainer

Gated DeltaNet-2

Two gates instead of one scale

The problem

Repair

Gated Delta Rule-2

Compile the repo and build environment

Run pretrain.py

1.3B / 100B FineWeb-Edu setup

The Optimizer

Collection and sequence

Shape of the Model

Hybrid Block

Appropriate numbers to be attached to the comparison

Paper, code, and quote

Key Takeaways

Fast and Reliable Salesforce Data Backup » GetSocialGuide - Grow and Monetize Your WordPress Blog Through Social Media

Microsoft Research Releases Webwright: A Terminal-Native Web Agent Framework That Scores 60.1% in Odysseys, Up from Base GPT-5.4's 33.5%

Related Articles

Leave a Reply Cancel reply

Coding Implementation to Build an AI Agent for a Hierarchical Planner Using Open Source LLMs Using Multi-Agent Tools and Reasoning

India blocks access to popular developer platform Supabase with ban order

Your Dream Bedroom, Based on Your Zodiac Sign

Google DeepMind Introduces Unified Latencies (UL): A Machine Learning Framework That Co-Controls Latencies Using Diffusion Forwards and Decoders

Strong Ways Christians Can Show More Empathy (Not Condemnation)

4 Zodiac Signs That Feel Alone or Surrounded by People

Run `pretrain.py`