
Training a large neural network has always meant one thing: hold the whole thing in memory at once. Every layer, every activation, every gradient -- all of it resident during the backward pass. DiffusionBlocks, a new framework from Sakana AI and the University of Tokyo presented at ICLR 2026, breaks that assumption. It trains networks one block at a time, and the memory cost scales down proportionally.
Standard neural network training optimizes all parameters jointly, so the memory it requires grows with model size. Block-wise training instead trains each block of the network independently, using memory for just one block at a time. The catch has always been: how do you give each block a meaningful training signal without the rest of the network? That is exactly what DiffusionBlocks solves.
Why memory is the real wall
End-to-end backpropagation requires storing activations throughout all layers, creating memory bottlenecks that limit model scalability. As Transformers are scaled primarily by adding more layers, depth directly drives the growing memory cost. Today's frontier models typically have hundreds of billions of parameters or more and require thousands of GPUs to train, and only a small number of organizations have the resources to develop them.
Don't miss what's next in AI
Join 300,000+ engineers and researchers who get the signal, not the noise.
- Full access to in-depth AI research breakdowns
- Be the first to know what's trending before it hits mainstream
- Daily curated papers, repos, and industry moves

