NVIDIA AI Unveils ProRL Agent: A Decentralized Infrastructure-as-a-Service for Reinforcing Learning for Multi-Conversion LLM Agents at Scale

admin 14 hours ago

0 0 3 minutes read

NVIDIA AI Unveils ProRL Agent: A Decentralized Infrastructure-as-a-Service for Reinforcing Learning for Multi-Conversion LLM Agents at Scale

Presented by NVIDIA researchers PRORL AGENTscalable infrastructure designed for reinforcement learning (RL) training for multi-turn LLM agents. By adopting a ‘Rollout-as-a-Service’ philosophy, the system separates the orchestration of agent rollout from the training loop. This architecture change addresses the inherent resource conflict between the intensive I/O environment and the GPU-intensive policy updates that currently hamper agent development.

Main Problem: Tight Integration

A dynamic agent’s tasks involve interacting with external environments, such as code repositories or operating systems, through an iterative tool. Many frameworks exist—including SkyRL, Tool for VerL, Agent Lightning, rLLMagain GEM-embed the output controller directly within the training process.

This tight coupling leads to two main limitations:

Conflicting System Requirements: Release is I/O bound, requires sandbox creation, long tool sessions, and asynchronous communication. Training is GPU intensive, focusing on forward/backward optimization and gradient synchronization. Running both in one process causes interference and reduces hardware efficiency.
Maintenance Barriers: Embedding the release concept in the trainer makes it difficult to migrate to back-end training environments or support new runtime environments without reusing the implementation pipeline.

System Design: Release-as-a-Service

PRORL AGENT it acts as an independent HTTP service that manages the complete release lifecycle. The RL trainer interacts with the server only through an API, the rest is unknown to the release infrastructure.

Asynchronous pipeline with three stages

To maximize throughput, the server schedules output through a three-stage asynchronous ‘assembly line’:

NIT: Startup staff spin up sandbox containers and prepare tools.
RUN: Outbound workers drive a multi-turn agent loop and collect trajectories.
EVAL: Analysts find results against ground truth to generate reward signals.

By giving each division a pool of private workers, PRORL AGENT it allows phases to overlap across different tasks, preventing premature tests (such as full test suite signing) from stalling the release process.

HPC-Compatible Sandboxing and Optimized Tools

PRORL AGENT uses Oneness with its sandbox infrastructure. Unlike Docker-based platforms, Unity allows rootless execution, which is necessary to run on shared HPC clusters managed by Slurm.

The program includes several optimizations to reduce tooling delays, which often dominate the overall release time:

A working Bash: Replaces tmux-based multiplexing with a ptyprocess-based based pseudo-terminal, reducing the shell command delay from 0.78s to 0.42s.
Direct IPython API: Connects to persistent kernels via an in-process API instead of network gateways, removing network overhead.
Unix Domain Sockets (UDS): Replaces the TCP loopback communication between the agent and the signing server inside the container to remove additional latency.

Advanced Features of Scalable RL

The infrastructure introduces ways to improve training stability and hardware usage:

Load Benchmarks and Start Using the Start Cache

The server hosts a cluster of LLM inference backends (eg, vLLM) using a min-heap populated with allocation numbers^{^{^{^{. If a function is assigned, all subsequent calls within that function are forwarded to the same backend^{. This strategy is growing reuse of the prefix cachereduces decision time in multi-agent rotations^.}}}}}

Token-in/Token-out Communication

To finish reset tokens to drift-when the sequence of tokens generated during release differs from that used during training—PRORL AGENT uses token IDs as a legal representation throughout the process. Logins and IDs are propagated unchanged from the inference backend to the trainer.

Optimized DAPO Implementation

The program supports Dynamic Sampling Policy Optimization (DAPO)which filters out ‘no information’ commands that produce similar rewards. PRORL AGENT it uses a parallel processing method to maintain a high number of jobs, terminating idle jobs as soon as the target number of information orders is reached.

Test Results on SWE-Bench Verified

The system was validated using Qwen3 models at multiple scales. PRORL AGENT continuously improved performance compared to regenerated bases.

Model Scale	A Recycled Foundation	ProRL Agent (RL)
Q3-4B	14.8	21.2
Qwen3-8B	9.6	18.0
Qwen3-14B	15.4 (reproduced base)	23.6

Note: The previous reported score for SkyRL-Agent-14B-v0 was 21.6.

In addition to software engineering, the system has shown familiarity with STEM, Mathematicsagain The code domains, showing stable reward enhancement during RL training^{^{^{^{. The robustness test confirmed that the output increases about the same as the computing nodes are added^{^{^{^.}}}}}}}

Key Takeaways

Architectural Decoupling: The ProRL agent handles the complete lifecycle of an agent’s execution—including environmental startup, tool execution, and reward retrieval—as an independent HTTP service, separating I/O-intensive operations from GPU-focused policy training.
Key Performance Benefits: This infrastructure enabled the Qwen3-8B model to almost double its performance on the SWE-Bench Certified benchmark (from 9.6% to 18.0%), while the Qwen3-14B model improved from 15.4% to 23.6%.
Reducing System Latency: Target optimization, such as replacing tmux with ptyprocess in shell execution, reduced action delay from 0.78s to 0.42s, contributing to near-linear throughput across all compute nodes.
Elimination of Tokenization Drift: The framework uses a token input/output communication pipeline, ensuring that the original token IDs generated during the output are transmitted to the trainer without the risk of losing re-tokenization.
HPC-Native Deployment: Using Unity instead of Docker, the ProRL Agent supports rootless execution and native Slurm integration, enabling large-scale agent training on high-performance shared computing clusters.

Check it out Paper again Repo. Also, feel free to follow us Twitter and don’t forget to join our 120k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.