Technology & AI

Parallax: Parameterized Local Linear Attention That Preserves Softmax and Adds a Learned Covariance Correction Branch

The Transformer’s focus has not changed since 2017. Most efficient work has tried to replace the softmax focus directly. The new paper takes a different route. Maintains softmax attention and bolts on the repair branch.

A team of researchers from Northwestern University, Tilde Research, and the University of Washington present a Local Linear Attention called ‘Parallax’ that reaches LLM pre-training and Muon coding.

Parallax doesn’t rush for efficiency by cutting the computer. It intentionally adds a computer, then makes that computer cheaper to use modern GPUs.

What is Parallax

Parallax builds on Local Linear Attention (LLA). LLA appears in the regression framework of the evaluation period. That framework learns attention as a regression solver over key-value pairs.

In this view, the keys are the training data points. Label values. The question is a test point. Softmax attention is a non-parametric estimator called Nadaraya-Watson. It is equivalent to a constant local operation for each query.

LLA improves that local approximation constant to the local linear approximation. The research team proves that this results in the least integrated mean square error. The advantage is a better bias-variance trade-off for associative memory.

But LLA has a problem with scale. Its straight forward approach requires solving a linear system of every question. That uses a parallel conjugate gradient (CG) solver. The CG solver creates three problems: intensive I/O, a strong formalization-expressiveness tradeoff, and inconsistency and low accuracy.

Parallax removes the solver. Instead, it reads an additional projection matrix. The research team writes this as ρi = WRxi. Here WR readable matrix that interrogates the covariance of KV directly from the layer input.

Parallax thus preserves the linear spatial principle. It simply replaces the solution of each question with a read projector, which is the same as the question. That makes it simple, efficient, and easy to use.

How the Process Works

Parallax recalibrates LLA as a softmax focus and additive correction. The output is the same as the softmax derivative minus the assumed covariance term. In the text of the research paper, that term is the KV covariance multiplied by the studied probe ρi.

The research team also reduced one piece of LLA called the boundary amplification factor, which is set to zero. This is necessary for stability. Once the investigation is parametric, the true geometric interpretation breaks down. Omitting the feature can cause the scale to be different or reversed.

Parallax sits within a family of attention-grabbing techniques. The research team organizes them in the paper along three axes: bandwidth, probe design, and affine structure. In some extreme cases, Parallax decreases exactly to the softmax focus when the probe trend goes to zero.

Setting up WR = 0 makes the Parallax layer behave the same way in softmax focus. So the test point of the pre-trained Transformer can be changed by adding WR and planning well.

Hardware conflict

Parallax inherits the FlashAttention streaming property. Add a single covariance branch that reuses the same key-value stream.

The research team expands forward into two parallel branches of the score. Both branches share online multiplayer, a rescaling feature, and K and V tiles. So Parallax doesn’t need extra I/O for iteration.

A key feature is high arithmetic intensity (AI). AI is a measure of floating-point performance in high-bandwidth memory traffic. In the realm where the KV function dominates, Parallax almost doubles the power of arithmetic. Adds a computer while reusing the same memory.

This turns attention to the program that accompanies the computer. That’s exactly the realm where kernelization helps in modern hardware.

The research team implemented a decode kernel in CuTeDSL on NVIDIA Hopper GPUs. Hopper’s tensor core matmul instructions work on tiles of at least 64 lines. The decoding step provides only one query line. Therefore the products of QK and RK can be calculated together, within the instructions the general attention has become problematic.

They profiled against FlashAttention 2 and 3 on H200 GPUs with BF16 accuracy. They span cluster sizes from 1 to 2,048 and context lengths from 128 to 32,768. The prototype kernel matches or exceeds FlashAttention in all configurations. The figure below describes a speedup of 1.54× in the computer-matched setting and 1.14× in the I/O-matched setting.

What Tests Show

The research team verified Parallax in synthetic tasks and in LLM pre-training at 0.6B and 1.7B scales. The models used the Qwen-3 architecture in place of torchtitan. They trained on the Ultra-FineWeb dataset with 4096 context lengths. Foundations included softmax attention (Transformer), Mamba, Gated DeltaNet, MesaNet, and Kimi DeltaAttention.

In the MAD-Benchmark, Parallax achieved the highest overall accuracy with a score of 0.716. It constantly improves memory-oriented tasks such as In-Context-Recall and Selective-Copying. It was constantly competing in pressure and head-to-head activities.

In language modeling, Parallax with Muon achieved the best confusion at both scales. It also posted the highest average accuracy of the river. In 1.7B, Parallax scored 62.45 compared to Transformer’s 61.43.

Two controls check where the profit comes from. A parametric transformer closes a small part of the gap. Computer-simulated Parallax still hits both first lines. The paper argues against this by pointing to the method itself, not additional parameters or calculations.

The Optimizer Twist

The main discovery is the optimizer-architecture interaction. Parallax shows a significant gain under Muon. Under AdamW, profits are significantly reduced or disappear.

Muon is a recent optimizer of matrix parameters for hidden layers. It uses the polar factor of the pressure buffer, so the updates are exactly numbered. Previous work shows that this produces better weighted matrices.

The research team in the paper traces the gap in the maintenance branch. They define the correction-to-output ratio (COR). Below the Muon, the COR exceeds 8 in the deeper layers. Under AdamW, it stays under 4.

IWR speculation is equally affected. Its steady state falls below AdamW but remains high below Muon. A gate test confirms the pattern. Under AdamW, the model learns to push the maintenance branch rather than use it.

The research team calls this the first empirical demonstration of an architecture-optimizer code for attention mechanisms. They are not saying that Muon with WSD is the ideal recipe. Appendix ablation shows benefit diminishes during the stage of decay.

How Scores Differ

Parallax also generates different point distributions from the softmax focus. Its individual token weights can take negative values ​​and exceed one in magnitude. Standard softmax weights cannot do this.

The research team reports three results. Parallax can remove value components from non-essential tokens. It greatly reduces the initial token attention. Its base softmax entropy is always high, giving different attention weights.

Strengths and weaknesses and open questions

Power

  • It keeps the softmax focus constant, so that the pre-trained Transformer transforms by adding WR and fine-tuning.
  • It does not incur additional I/O by simply reusing the FlashAttention key value stream.
  • It doubles the arithmetic intensity, with a prototype kernel that matches or beats FlashAttention 2/3 in decoding.
  • It shows consistent confusion and downstream benefits under parameter-matched and computer-simulated controls.

Weaknesses and open questions

  • The benefits are very muon dependent; under AdamW the benefit largely disappears.
  • The exact cause of the optimizer dependency remains an open question.
  • The results stand at 1.7B scale, with no MoE, long core, or large runs.
  • Gain erodes during the decay phase of the WSD, which is only partially corrected by weight decay.

Key Takeaways

  • Parallax retains the focus of softmax and adds a learned covariance correction branch, replacing the LLA for each query of the conjugate gradient solver.
  • It doubles the arithmetic intensity while reusing the same KV stream, with a decode kernel that matches or beats FlashAttention 2/3.
  • Fixed confounds and downstream gains at 0.6B and 1.7B, held under parameter-simulated and computer-simulated controls.
  • The benefits are very muon dependent; under AdamW the profit decreases significantly or disappears.
  • Setting up WR = 0 gets exactly softmax attention, so pre-trained Transformers can transform by adding WR and fine-tuning.

Check it out Paper again Repo. Also, feel free to follow us Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? Connect with us


Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button