Reinforcement learning post-training has become one of the most powerful tools in modern AI, but there's a catch: almost every RL framework is built for exactly one type of model. You need one stack for your LLM, another for your diffusion model, and yet another if you're working with a unified architecture that combines both. UniRL, just open-sourced by Tencent Hunyuan, is a direct attack on that fragmentation.

One loop to rule them all

UniRL applies one RL post-training loop , generate samples, score them, compute advantages, update the policy, and sync weights back to rollout workers , across multimodal model families. That sounds simple, but the engineering challenge is significant: diffusion models, autoregressive LLMs, and vision-language models all have fundamentally different generation mechanics, probability representations, and training dynamics.

The key design insight is that model family and algorithm are treated as two independent axes. Any supported algorithm can run on any supported model in its domain, so the total coverage is the product of the two dimensions rather than a fixed list of hand-matched pairs. The loop is always the same: generate, score, advantage, update, sync.

The framework is layered and composable. Each entrypoint (train_diffusion, train_ar, train_pe, train_unified_model) loads a Hydra config covering model, algorithm, rollout, reward, placement, and sync, then creates the matching domain trainer. The trainer coordinates the RL loop across pluggable rollout engines, algorithms, model bundles, reward services, and the shared distributed runtime: Ray DevicePool, FSDP2, Transfer Queue, and LoRA/full-weight sync.

The model zoo is unusually broad

The supported model table is where UniRL's scope becomes clear. It covers:

Alpha Signal

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

  • Full access to in-depth AI research breakdowns
  • Be the first to know what's trending before it hits mainstream
  • Daily curated papers, repos, and industry moves