Every major image generation model you've used , Stable Diffusion, FLUX, DALL-E 3 , shares a hidden bottleneck: before any diffusion happens, a separate neural network called a VAE (Variational Autoencoder) compresses the image into a compact "latent" representation. It's a practical shortcut that makes training feasible, but it comes with a cost. That compression is lossy, and the errors it introduces ripple through the entire generation pipeline. NVIDIA Research's PixelDiT asks a simple but radical question: what if you just skipped the VAE entirely?

The VAE Problem Nobody Talks About

Latent Diffusion Models (LDMs) have dominated image generation for years. The standard recipe is: train a VAE to compress images into a smaller latent space, then train a diffusion model to operate in that latent space. It's efficient, but it creates two structural problems that are easy to overlook.

  • Lossy reconstruction: Latent-space modeling has been the standard for Diffusion Transformers (DiTs). However, it relies on a two-stage pipeline where the pretrained autoencoder introduces lossy reconstruction, leading to error accumulation while hindering joint optimization.
  • Objective mismatch: The VAE is trained to reconstruct images well, not to make the diffusion model's job easier. These are different goals, and the gap between them introduces distribution shift that the diffusion model has to compensate for.
  • Fine detail destruction: High-frequency content like text in a scene, fine textures, and sharp edges are exactly what VAE compression tends to smear. Even a perfect diffusion model can't recover what the VAE already destroyed.

The practical consequence shows up most painfully in image editing. If you use a flow-based editing method like FlowEdit on FLUX or Stable Diffusion 3 and ask it to change a bicycle to a motorcycle in a scene with text on a wall, the VAE will have already garbled that wall text before the diffusion model even sees it. The edit is correct, but the background is corrupted.

Going Back to Pixels , But Smarter

PixelDiT is a single-stage, end-to-end model that eliminates the need for the autoencoder and learns the diffusion process directly in the pixel space. Pixel-space diffusion isn't a new idea , ADM (the model behind early DALL-E research) operated in pixels , but it was abandoned because the computational cost is brutal. Attention over raw pixels scales quadratically with image resolution, making megapixel-scale training essentially impossible with naive approaches.

Alpha Signal

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

  • Full access to in-depth AI research breakdowns
  • Be the first to know what's trending before it hits mainstream
  • Daily curated papers, repos, and industry moves