Meet EAGLE 3.1: A predictive coding algorithm that corrects Attention Drift in LLM Inference

Predictive coding is a way to speed up language model prediction. A small, fast draft model raises several tokens. A large target model verifies them in parallel. If accepted, the prediction is fast. If it is rejected, the system goes back to normal.
The EAGLE Team, the vLLM Team, and the TorchSpec Team introduced the EAGLE series which includes EAGLE 1, EAGLE 2, and EAGLE 3 which have become the most widely adopted and widely distributed families of predictive modeling algorithms in all research and production applications. Today, that family gets a targeted reliability upgrade with the introduction of EAGLE 3.1.
What was going wrong
Although predictive recording works well in controlled settings, performance often degrades under different conversational templates, long content inputs, or out-of-order system information.
Team EAGLE followed this weakness until they found something called attention drift as the depth of projection increases, the artist gradually shifts attention away from the sink tokens and towards his generated tokens.
In simple words: an artist is a small model that predicts future tokens. As the projection deepens, it begins to deal with its previous effects instead of the original context. This reduces the length of reception and stability of the output.
Two main problems were identified. First, the integrated input becomes increasingly unbalanced as the hidden regions of the upper layer dominate the draft input. Second, the size of the latent variable increases in the estimation steps due to the non-standard residual path. Together, these results enable the producer to remain unstable in the depths of deep speculation.
Two Architectural Fixes in EAGLE 3.1
To deal with the attention drop, EAGLE 3.1 comes with two important architectural improvements: FC normalization after each hidden target and before the FC layer, and providing post-normalization hidden regions in the next recording step.
FC normalization stabilizes the hidden regions that the modeler finds in the target model. Without it, the size of the hidden state increases in steps, which makes the programmer even more unreliable. Applying normalization to each step keeps the input accountable.
The post-normalization design makes the method behave like a programmer’s iteration of the decoding steps, rather than simply attaching more layers to the target model.

What These Amendments Bring
Compared to EAGLE 3, EAGLE 3.1 shows: better time of training to extrapolation of fixed time, stability of long strong content, high stability of dialog template and diversity of system information, and more stable reception length in different feeding areas.
For long context workloads, EAGLE 3.1 achieves up to 2× longer reception times compared to EAGLE 3.
Training Infrastructure: TorchSpec
TorchSpec now provides support for EAGLE 3.1 active training and future predictive modeling algorithms. By reducing training overhead and streamlining testing workflows, TorchSpec helps accelerate the iteration and testing of next-generation predictive modeling research and applications.
Based on TorchSpec and vLLM, the research team retrained and open sourced the EAGLE 3.1 draft model of Kimi K2.6, available on HuggingFace. The model serves as an example of using EAGLE 3.1 with TorchSpec training and vLLM providing support for a real-world deployment model.
vLLM Integration: Config-Driven and Backward-Compatible
EAGLE 3.1 resides in vLLM as a configuration-driven extension of the existing implementation of EAGLE 3. Integrations include FC support for routines, post-routine hidden state feedback, and removal of hard-coded guesswork from target hidden fields.
Backward compatibility with existing EAGLE 3 test environments is fully preserved. EAGLE 3.1 draft models that can be directly connected via the path of the prediction code.
vllm serve nvidia/Kimi-K2.6-NVFP4
--trust-remote-code
--tensor-parallel-size 4
--tool-call-parser kimi_k2
--enable-auto-tool-choice
--reasoning-parser kimi_k2
--attention-backend tokenspeed_mla
--speculative-config '{"model":"lightseekorg/kimi-k2.6-eagle3.1-mla","method":"eagle3","num_speculative_tokens":3}'
--language-model-onlyBenchmark results on Kimi K2.6
The research team estimated the draft model of Kimi K2.6 EAGLE 3.1 on Kimi-K2.6-NVFP4 with vLLM (TP=4, GB200, non-disagg) on the SPEED-Bench code dataset. EAGLE 3.1 delivers 2.03× higher per-user throughput than compromise 1. The speedup remains reasonable as the compromise scales: 1.71× for C=4 and 1.66× for C=16.
Marktechpost Visual Explainer
Key Takeaways
- EAGLE 3.1 fix attention drift – newly identified instability when the artist loses focus on the sink tokens in the deep projection.
- Two architectural changes – FC normalization again hidden post-norm status response — tighten the frame at all stages of the projection.
- For long context workloads, EAGLE 3.1 delivers up to 2× the maximum allowable length compared to EAGLE 3.
- Benchmarks on the Kimi-K2.6-NVFP4 show 2.03× output per user for concurrency 1, it drops to 1.66× for C=16.
- EAGLE 3.1 backwards-compatible with the EAGLE 3 test and is already integrated into the main vLLM, shipping with v0.22.0.
Check it out Technical details. Also, feel free to follow us Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.
Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? Connect with us

Michal Sutter is a data science expert with a Master of Science in Data Science from the University of Padova. With a strong foundation in statistical analysis, machine learning, and data engineering, Michal excels at turning complex data sets into actionable insights.



