Reinforcement learning post-training has become the go-to technique for squeezing alignment and task performance out of large language models. But the moment you step outside the world of text tokens and into diffusion models, video generators, and omni-modal systems that handle audio, images, and video simultaneously, the existing tooling largely falls apart. That gap is exactly what VeRL-Omni is designed to close.

The VeRL-Omni Team has announced the pre-release of VeRL-Omni, a general RL post-training framework for multimodal generative models. It is built on top of verl and vLLM-Omni, and the vLLM project is cheering it on as a key ecosystem extension. The framework is open-source under Apache 2.0 and available now on GitHub.

Why text-only RL tooling breaks for multimodal

While the LLM RL stack has evolved rapidly over the past year, multimodal generative RL, covering diffusion and omni-modality models for image, video, and audio understanding and generation, faces critical unmet needs. The core problem is structural: rollouts in diffusion models are denoising trajectories in a continuous latent space rather than token sequences, and a single rollout may invoke multiple heterogeneous model components and multi-stage pipelines, such as a text encoder feeding into a DiT feeding into a VAE. You cannot just plug a standard GRPO trainer into that pipeline.

On top of that, reward functions for multimodal tasks are themselves multimodal models. An OCR reward for a text-to-image model, for example, requires running a vision-language model to read the generated image and score it. Scheduling that reward computation efficiently, without stalling the training loop, is a non-trivial systems problem.

Alpha Signal

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

  • Full access to in-depth AI research breakdowns
  • Be the first to know what's trending before it hits mainstream
  • Daily curated papers, repos, and industry moves