
Autoregressive language models have a fundamental speed problem: they generate one token at a time, sequentially, no matter how much compute you throw at them. NVIDIA Research just released Nemotron-Labs-TwoTower, a new architecture that sidesteps this bottleneck by splitting a single pretrained model into two specialized copies that work in tandem, achieving 2.42x faster generation while retaining 98.7% of the original model's benchmark quality.
The core problem with diffusion language models
Diffusion models for text are a promising alternative to autoregressive generation. Instead of writing tokens left-to-right one at a time, they start with a fully masked sequence and iteratively "denoise" it, predicting multiple tokens in parallel. The catch is that existing approaches use a single network for both context representation and iterative denoising, forcing one model to serve both roles and limiting its capacity for either.
Think of it like asking one person to both hold the conversation history in their head and simultaneously write the next paragraph. Both tasks compete for the same mental bandwidth. NVIDIA's insight is to just use two people.
Two towers, one pretrained checkpoint
TwoTower decouples these roles into two towers: a frozen AR context tower that causally processes clean tokens, and a trainable diffusion denoiser tower with bidirectional block attention that refines noisy blocks via cross-attention to the context. Both towers are initialized from the same pretrained Nemotron-3-Nano-30B-A3B checkpoint, a 30B hybrid model that interleaves Mamba-2 state-space layers, standard attention, and mixture-of-experts (MoE) layers.
The key engineering decision: only the denoiser tower is trained. The context tower stays completely frozen, preserving all the knowledge baked in during the original 25T-token pretraining. The denoiser is then adapted via a masked diffusion objective on roughly 2.1T tokens, a fraction of the original training budget.
Here is how generation actually works at inference time:
- The context tower encodes the prompt causally, producing per-layer KV caches and Mamba state vectors.
- The denoiser receives a block of 16 masked tokens (
[MASK], [MASK], ...). - Over multiple denoising steps, it predicts all masked positions in parallel, using bidirectional attention within the block and cross-attending to the context tower layer-by-layer.
Don't miss what's next in AI
Join 300,000+ engineers and researchers who get the signal, not the noise.
- Full access to in-depth AI research breakdowns
- Be the first to know what's trending before it hits mainstream
- Daily curated papers, repos, and industry moves

