HiDream.ai's HiDream-O1-Image-1.5 Beats Google Gemini in Blind Image Votes

EDITORIAL LEADERBOARD

Artificial Analysis

HiDream.ai's HiDream-O1-Image-1.5 Beats Google Gemini in Blind Image Votes

12H AGO

2 min read

12 hrs ago

2 min read

The image generation leaderboard just got a new serious contender. HiDream-O1-Image-1.5, the latest closed-source model from HiDream.ai, has debuted at #3 on the Artificial Analysis Text-to-Image Arena, surpassing Google's Nano Banana 2 (Gemini 3.1 Flash Image Preview) and sitting just behind OpenAI's GPT Image models. This is a big deal for a company that only open-sourced its first O1-series model a month ago.

A quick recap: what is the O1 Image family?

HiDream-O1-Image is a natively unified image generative foundation model built on a Pixel-level Unified Transformer (UiT) without external VAEs or disjoint text encoders, which natively encodes raw pixels, text, and task-specific conditions in a single shared token space, supporting text-to-image, image editing, and subject-driven personalization at up to 2,048 × 2,048. The original open-source 8B model launched and immediately became the highest-ranked open-weight model on the same leaderboard. The 1.5 release is a closed-source, more capable follow-up in the same family.

The defining architectural choice is the elimination of the variational autoencoder (VAE). Every mainstream diffusion model , FLUX.2, Stable Diffusion 3.5, DALL-E 3 , compresses images through a VAE before and after the diffusion process. HiDream skips that step completely. The Unified Transformer maps raw image pixels, text tokens, and task-specific conditions into a single shared token space, removing the VAE bottleneck used by virtually every competing model.

Why dropping the VAE matters

Most image generation pipelines work in two stages: a text encoder converts your prompt into a vector, and a separate diffusion model generates an image in a compressed "latent space," which is then decoded back to pixels via the VAE. Each handoff introduces a translation bottleneck. Unlike latent DiTs that use latent-space VAE compression and pixel-space DiTs that typically rely on disjoint text encoders, the Unified Transformer in HiDream-O1-Image natively encodes raw image pixels, texts, and task-specific conditions within a shared token space, and thus generalizes to broader and more complex generative tasks.

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

Takeaways

A quick recap: what is the O1 Image family?

Why dropping the VAE matters

Don't miss what's next in AI