Running reinforcement learning on a trillion-parameter model sounds like a recipe for GPU-melting latency and training instability. Prime Intellect just shipped prime-rl v0.6.0 to prove otherwise. The new release enables RL post-training of massive Mixture-of-Experts (MoE) models , think 744B+ parameter behemoths , on long-horizon agentic tasks, achieving sub-5-minute step times on only 28 H200 nodes. That's a meaningful systems milestone for anyone trying to post-train frontier-scale open models without a hyperscaler's budget.

Why RL at this scale is hard

In RL, inference is the critical bottleneck of the training lifecycle , that's where the model interacts with its environment, producing rollouts that are evaluated and assigned a reward. At trillion-parameter scale, the problem compounds: training such models introduces unprecedented challenges, including train-inference misalignment, inefficiencies in rollout processing, and bottlenecks in the RL system.

Agentic tasks make this even messier. A coding agent solving a SWE task might run for hundreds of turns, generating tool calls, reading file outputs, and reasoning over long contexts. Agentic tasks often have long-tail outliers , rollouts that can take up to a few hours, especially long-horizon coding tasks. Delaying the policy update until these rollouts finish would under-utilize GPUs and hurt performance.

The benchmark Prime Intellect set for themselves: train GLM-5 on SWE tasks at up to 131k sequence length, with sub-5-minute step times and a batch size of 256 rollouts, on only 28 H200 nodes. GLM-5.1 is Zhipu AI's flagship foundation model designed for autonomous agentic workflows, utilizing a massive 744 billion parameter MoE architecture, with 40 billion parameters active during inference.

Bar chart comparing trainer step time (3.7 min median) versus rollout step time (5.0 min median) for RL at 1T scale

Async-first, from the ground up

prime-rl is a framework for large-scale, asynchronous reinforcement learning of large language models, designed to be easy to use and hackable, yet capable of training 1T+-parameter MoE models on 1000+ GPU clusters. The key architectural decision is full asynchrony between the trainer and inference engine , they run on separate hardware and can be optimized independently.

Alpha Signal

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

  • Full access to in-depth AI research breakdowns
  • Be the first to know what's trending before it hits mainstream
  • Daily curated papers, repos, and industry moves