Technology & AI

NVIDIA AI Introduces PivotRL: A New AI Framework That Achieves Higher Agent Accuracy with 4x Fewer Outputs and More Efficient Turns

After training Large-scale Language Modelers (LLMs) for long-horizon agent tasks—such as software engineering, web browsing, and the use of complex tools—they present a constant trade-off between computational efficiency and modeling in general.. Although Supervised Fine-Tuning (SFT) is computationally cheap, it often suffers from out-of-domain (OOD) performance degradation and struggles to generalize beyond its training distribution.. In contrast, end-to-end reinforcement learning (E2E RL) generally maintains the capabilities of OOD and achieves high intra-domain accuracy, but incurs high computational costs due to the need for multiple iterations, multiple policy activations for every parameter update..

NVIDIA researchers presented PivotRLframework designed to bridge this gap. By working on existing SFT trajectories, PivotRL aims to deliver the typical benefits of E2E RL while maintaining the data efficiency associated with SFT..

Pivot Architecture

The core of PivotRL is the transition from full release to targeted, pivot-level development.. The framework identifies and uses two main approaches: Pivot Sorting again Active Rewards.

1. Pivot Sorting

In agent-level training, every completion of the helper’s model call boundary is considered an action. PivotRL starts by extracting all the auxiliary curves from the SFT dataset into a ‘pivot candidate’ pool.

The system then profiles these candidates offline using a frozen reference policy, π0. To optimize the training budget, PivotRL adjusts itself the pivots: specific regions where local, policy-based emissions show high variation in outcomes. Filter criteria are defined by two criteria:

  • Nonzero empirical reward variance: σ^2(s)>0hat{sigma}^2(s) > 0.
  • Low reward means: μ^(s)<λdiffhats{mu}(s) < lambda_{diff}

This approach deals with the problem of access to information. In group-normalized RL—specifically Group Relative Policy Optimization (GRPO)—the curve where actions uniformly succeed or uniformly fail leads to a zero average profit, which does not provide a meaningful gradient update. Focusing on the ever-difficult mixed-effects evolution of the reference policy, PivotRL focuses on counting the states that provide the strongest learning signal.

2. Using Effective Rewards

Standard SFT-to-RL practice often relies on accurate string matching with display data to assign rewards. However, in production action spaces (e.g., shell commands or search queries), many similar actions may differ from certain strings in the training data..

PivotRL replaces strict matching performance rewards, rfunc(s,a)=1[a(s)]r_{func}(ama, a) = 1[a in mathcal{M}(s)]there (s)math{M} is the set of acceptable actions in a domain determined by a domain-specific validator. These validators can range from standard schema and string matching checks to lightweight LLM-as-a-judge goals.

Theoretical Foundations: Gradient signal and OOD storage

The effectiveness of these design options is supported by two main theoretical results:

  • Theorem 3.2 (Reward Variation and GRPO Signal): The research team proved that Fisher’s natural gradient averages the statewise mean award with the standard deviation of the award. Specifically, GRPO scores for individuals, γs,b,equalsσb2gamma_{s, beta}, is equal to frac{sigma}{beta^2}. This validates the strategy of filtering pivots with mixed results to maximize the local domain learning signal.
  • Theorem 3.3 (Minimal change of KL): This theory shows that active reward-based RL shifts the probability weight to acceptable actions while preserving the relative order of the reference policy for actions unrelated to the training task. Because the relative level of non-task-related actions remains constant, PivotRL significantly reduces the catastrophic forgetting and OOD deterioration common to SFT.

Performance and Success

The research team tested PivotRL using Qwen3-30B-A3B-Thinking-2507 as the overall basic model four agent domains: the use of a chat tool (τ2Bench)(tau^2-Bench)software engineering (SWE-Bench Verified), terminal control (Terminal-Bench), and web browsing (BrowseComp).

Advantages of In-Domain Accuracy

Compared to SFT on the same data, PivotRL achieved the following domain results:

  • Average Profit: +14.11 points over the base model, compared to +9.94 points for the SFT.
  • Domain Specifications: PivotRL performed better than SFT on τ2Benchtau^2-Bench (+5.37), Terminal-Bench (+6.25), and BrowseComp (+9.80).

Off-Site Storage

The most significant benefit was observed in the stability of the OOD. While SFT caused an average regression of -9.83 across all eight OOD benchmarks (including math and science QA), PivotRL maintained a near-zero conversion rate +0.21. Significantly, PivotRL has been achieved +10.04% higher OOD accuracy in non-agency activities compared to SFT.

Compute Effectiveness in SWE-Bench

In SWE-Bench Verified, the rigorous standard for long-term agents, PivotRL showed a significant decrease in surface training:

  • Change Efficiency: PivotRL achieved comparable accuracy levels with E2E RL implementations 4x several release curves.
  • Temporary Operation: The training was ~5.5x faster in wall clock time than E2E RL when using the same number of computing nodes.

Key Takeaways

  • Hybrid efficiency: PivotRL combines the computational efficiency of Supervised Fine-Tuning (SFT) with out-of-domain (OOD) generalization of RL to end.
  • Pivot Sorting: The framework identifies ‘pivots’—important central curves where the sample actions show high contrast on success/failure, providing very strong learning signals.
  • Valid Validators: Instead of requiring exact text matches, PivotRL uses domain-specific validators to reward any in terms of performance action.
  • OOD Stability: Unlike SFT, PivotRL preserves the model’s applicability to unrelated tasks (eg, math) by preserving the order of probability in the reference policy for unrelated tasks.
  • Production Speed: It achieves comparable accuracy to E2E RL with 4x several release curves again ~5.5x faster training time, as proven in NVIDIA’s Nemotron-3-Super.

Check it out Paper. Also, feel free to follow us Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.


Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button