Technology & AI

NVIDIA AI Releases Gated DeltaNet-2: A Separate Attention Layer That Decouples and Writes on the Delta Law

Linear attention replaces the infinite KV cache of softmax attention with an iterative form of fixed size. This reduces sequence mixing to linear time and recording in non-volatile memory. The hard part is not to forget. It is a way of organizing repressed memory without criticizing existing associations.

NVIDIA has been released Gated DeltaNet-2a specific attention layer that targets that problem. The model divides the working memory arrangement into two intelligent gates per channel. It is trained on 1.3B parameters on 100B FineWeb-Edu tokens. It outperforms Mamba-2, Gated DeltaNet, KDA, and Mamba-3 in the entire research benchmark suite.

The scalar gate problem in delta-rule models

The iterative attention layer maintains the matrix state St and read it along with the question. DeltaNet adds active sorting by extracting the value currently associated with the current key. It uses a scalar step size βt control how much you have to write on it. Mamba-2 adds data-dependent scaling decomposition αt by forgetting the whole world. Gated DeltaNet combines both operations, but both gates remain scalar per head.

Kimi Delta Attention (KDA) refines the decay side. It replaces the scale αt with channel wise vector. KDA still maintains one scale βt active planning. That scale controls two different things at once. Determines how much old content to delete on the key side. It also determines how much new content to create on the value side. These two decisions apply to different sectors of the government. Coherence is a limitation of models, not a property of the delta law.

Gated Delta Rule-2: two gates instead of one

Gated DeltaNet-2 separates the two decisions with Gated Delta Rule-2. Introduces a clearing gate that uses a channel bt ∈ [0,1]dk on the key axis. It also introduces a write gateway that uses the channel wt ∈ [0,1]dv on the value axis. Both gates are generated by sigmoid projections of the token representation. The update works to decay before the active editing.

Written together, the repetition is:

St = (I − kt (bt ⊙ kt)) Dt St−1 + kt (wt ⊙ vt)

Here Dt = Diag(αt) channel wise decomposition which is transferred to KDA. The leftmost element of the deletion matrix remains ktmaintains the direction of writing the delta rule. The right factor becomes bt ⊙ ktwhich makes the read direction channel selectable. Writing term kt zt uses zt = wt ⊙ vtwhich makes the value update channel select.

When both gates fall on the same scale βtthe update restores KDA directly. When decay αt and falls to a scalar, obtaining a Gated DeltaNet. Both previous models are kept as captive sub-bases for the new update.

For fast weighting, Gated Delta Rule-2 online one-step gradient in local regression loss. The decay state remains close to memory, while residual programming uses learned and gated targets.

Chunkwise training and back gate awareness

Replication adopts a small WY form similar to the structure used by KDA. The cumulative decomposition of the channel frame is centered on two deletions for each rank. The update of each chunk is a product of asymmetric matrices of the form I − k̄r ēr. Usage uses the chunk size C = 64 with integrated Triton kernels.

In retrospect, the scalar interrupt used by KDA is no longer valid. The write side contains a separate diagonal gate over the value channels. The eraser side contains a separate diagonal gate over the key channels. Therefore the gate properties must appear within the dot products that accumulate the gradients. The paper derives this vector-Jacobian product known from the gate clearly. On Hopper GPUs, the combined WY of the backward kernel is limited to two and four warps to avoid the assertion of the Triton WGMMA structure.

Block design and hybrid model

Gated DeltaNet-2 is used as a common token combiner in a common Transformer style block. Key questions and methods use linear regression, short causal variables, SiLU, and L2 normalization. Value routing uses linear approximation, short transform, and SiLU. Decay αtclear the gate btand write the gateway wt they appear in different linear branches. The recurrent output is RMS-normalized, multiplied by the SiLU output gate, and detrended.

A mixed variant includes Sliding-Window Attention (SWA) after the continuous mixer. The replicated cell contains Gated DeltaNet-2, MLP, SWA, and another MLP. SWA handles direct local interactions, while the iterative mixture suppresses long histories. The hybrid maintains a sequential scale with a bounded attention cache.

Results in 1.3B parameters

All 1.3B parameter models were trained on 100B FineWeb-Edu tokens. Parameter estimates and common condition sizes were matched across models. The persistent state holds 262,144 floats per layer per heap element. The training length of the tokens is 4K, and hybrid models use a 2K SWA window. The base Mamba-3 MIMO uses standard R = 4.

In language processing and logical reasoning, Gated DeltaNet-2 scores best in both settings. The continuous model is between 53.11 for all LAMBADA and the thinking suite. That sits above Mamba-3 MIMO at 52.39 and KDA at 52.28. In the hybrid configuration, Gated DeltaNet-2 averaged 53.97 compared to Mamba-3 MIMO at 52.72. Since the standard state size is not specified, the benefit points to the update rule, not the additional memory.

The most obvious benefits come from RULER long content retrieval. In the continuous setting, the S-NIAH-2 in 4K increases from 89.0 (KDA) to 93.0. NIAH-3 at 2K jumped from 63.2 (KDA) to 89.8. MK-NIAH-1 in 4K increases from 28.0 (KDA) to 37.8.

In real-world returns (SWDE, SQuAD, FDA, TriviaQA, NQ, DROP), Gated DeltaNet-2 also leads both settings. The normal average is 29.88 and the hybrid average is 42.28.

Marktechpost Visual Explainer

NVIDIA · 2026

Gated DeltaNet-2

Decoupling Erase and Write with Line Attention. Delta-law recursive attention layer with channel-wise erase and write gates.

PyTorch
Triton letters
1.3B parameters
100B FineWeb-Edu tokens

Step 01 · Vision

Two gates instead of one scale

Linear attention compresses the infinite KV cache into an iterative form of constant size. Organizing this memory without criticizing existing organizations is the hard part.

The problem

Previous delta-rule models (Gated DeltaNet, KDA) are binding to delete old content again writing new content in one scalar gate β_t.

Repair

Split it: a clearing gate following the station b_t on the key axis, and the write gate using the channel w_t on the value axis.

  • Clear the gate selects which links to the decomposed state key are read and deleted.
  • Write the gateway it chooses which value-side links for new content are made.
  • Decomposition following the channel bequeathed to KDA with good oblivion around the world.

Step 02 · Law of Renewal

Gated Delta Rule-2

Through the clearing gate b_t ∈ [0,1]^{d_k}write the gate w_t ∈ [0,1]^{d_v}and channel-specific decay D_t = Diag(α_t)the continuous state changes to:

S_t = (I − k_t (b_t ⊙ k_t)) D_t S_{t−1} + k_t (w_t ⊙ v_t)

  • It is recovering KDA exactly when both gates fall on the same scale.
  • It is recovering Gated DeltaNet where the decay also folds to a scalar.
  • Trains efficiently with a little WY form with channel-wise decay absorbed in asymmetric clearing characteristics.

Step 03 · Get ​​the Code

Compile the repo and build environment

The official PyTorch implementation ships with a Dockerfile, training documentation, and lit_gpt model definitions.

git clone 
cd GatedDeltaNet-2

# build the environment from the provided Dockerfile
docker build -t gdn2 .
docker run --gpus all -it —ipc=host -v $PWD:/workspace gdn2
Repo structure

lit_gpt/ model code · scripts/ launchers · pretrain.py training entry · data.py, cache.py KV data & archive · paper/ arXiv PDF

Step 04 · Introduce Training

Run pretrain.py

A simplified command from the official README. Replace the placeholders with your own dataset methods and configuration name.

python ../pretrain.py 
  --train_data_dir ${TRAIN_DATA} 
  --val_data_dir ${VALIDATION_DATA} 
  --output_root ${SAVE_DIR} 
  --exp_name ${NAME} 
  --model_name ${MODEL} 
  --train_config ${CONFIG} 
  --eval_iters ${EVAL_ITERS} 
  --learning_rate ${LR} 
  --micro_batch_size ${MICRO_BATCH_SIZE}
Pro tip

Add --interactive_job --debug for debugging interaction time.

Step 05 · Automatic recipe

1.3B / 100B FineWeb-Edu setup

It is compared to Mamba-2, Gated DeltaNet, KDA, and basic Mamba-3 under the same optimizer settings and constant state size.

The Optimizer

AdamW · top LR 4e-4 · weight loss 0.1 · gradient clip 1.0 · cosine system · 1B– token warmup.

Collection and sequence

A global collection 0.5M tokens · sequence length 4K · Hybrid models use a 2K attention size of the sliding window.

Shape of the Model

16 heads · d_k = d_v = 128 · each layer is a replication state 262,144 floating, similar to Mamba-2/3.

Hybrid Block

Repeated cell: Igated DeltaNet-2 → MLP → SWA → MLP. A repetitive mixer suppresses long histories; The SWA handles local interactions.

Step 06 · Results

Appropriate numbers to be attached to the comparison

It is the best measure of all language modeling and logical reasoning, with the greatest advantages in the retrieval of long content.

Setup · MetricKDAMamba-3 MIMOGDN-2
A common measure. (LMB + thinking)52.2852.3953.11
Hybrid is average. (LMB + thinking)52.6852.7253.97
S-NIAH-3 @2K (typical)63.272.489.8
MK-NIAH-1 @4K (standard)28.018.037.8
Real-world recall, continuous measurement.28.6728.3529.88
Real-world recall, hybrid avg.40.1440.1142.28

Step 07 · Resources

Paper, code, and quote

Everything you need to learn, implement, and quote Gated DeltaNet-2 in one place.

@article{hatamizadeh2026gdn2,
  title   = {Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention},
  author  = {Hatamizadeh, Ali and Choi, Yejin and Kautz, Jan},
  journal = {arXiv preprint},
  year    = {2026}
}

Company MARKTECHPOST · Hub for AI research, dev tools, and model launch

Key Takeaways

  • Gated DeltaNet-2 divides the scale βt enter the smart wipe gate bt (key axis) and the write gate that uses the channel wt (value axis).
  • The update finds KDA when both gates fall on the same scale, and Gated DeltaNet when the decomposition breaks down.
  • Training remains consistent in chunkwise WY form, with channel-wise decay focused on uneven wipes and a reverse gate integrated into the Triton.
  • For 1.3B parameters in 100B FineWeb-Edu with simulated region size, it has a better rate than Mamba-2, Gated DeltaNet, KDA, and Mamba-3 in both iterative and hybrid settings.
  • The biggest gains come with RULER longer content retrieval — S-NIAH-3 in 2K increases by 63.2 → 89.8 and MK-NIAH-1 in 4K increases by 28.0 → 37.8 over KDA (normal).

Check it out Paper again Repo. Also, feel free to follow us Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? contact us


Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button