Together AI's ParallelKernelBench Reveals Frontier Models Struggle Across Multi-GPU Clusters

Together AI

22H AGO

3 min read

22 hrs ago

3 min read

LLMs have gotten surprisingly good at writing GPU kernels. Benchmarks like KernelBench have tracked steady progress on single-GPU CUDA generation, and the research community has taken notice. But there is a catch: almost all current benchmarks measuring that progress are single-GPU. In production, models span dozens of GPUs, and the bottleneck is not compute anymore.

Communication overhead can account for over 20% of inference latency, and that gap keeps widening as compute scales faster than interconnect bandwidth. To measure whether LLMs can actually handle that reality, researchers at Together AI built ParallelKernelBench (PKB) -- a benchmark of 87 real-world multi-GPU kernel problems. The verdict: frontier models are not there yet, but a few surprising wins hint at what is coming.

Why multi-GPU is a fundamentally different problem

Writing a fast single-GPU kernel is hard. Writing a fast multi-GPU kernel is a different category of hard. The design space expands combinatorially as practitioners compose tensor, expert, data, context, and sequence parallelism. The performance model changes -- a single-GPU roofline is built around compute and memory bandwidth, but in multi-GPU code, the bottleneck is often the interconnect. And there is a critical new design choice: how to move data between GPUs -- through the copy engine, TMA, SM load/store, or NVLS -- and whether to fuse that movement with compute.

NCCL (NVIDIA Collective Communications Library) is the standard tool for this today. It handles collectives like all-reduce and all-gather across GPUs, but it operates at a high level of abstraction. Reducing communication overheads for small message sizes in the decode phase is important, particularly if data dependencies prevent compute-communication overlap. Collective algorithms optimized for higher throughput and larger message sizes do not scale well to smaller communication payloads. These algorithms can be tuned through custom implementation kernels, which enable fusing or interleaving communication chunks with compute. That is exactly what PKB asks models to do.

The benchmark: 87 problems from real codebases

PKB offers a benchmark and evaluation framework for multi-GPU kernel generation. Each problem starts from a standard PyTorch + NCCL implementation and a description of the hardware topology. The model then has to replace that reference with a CUDA kernel that communicates directly across GPUs using symmetric memory. Symmetric memory here means memory that is simultaneously addressable by multiple GPUs over NVLink -- it lets a kernel on GPU 0 directly read or write GPU 1's memory without going through the OS or a separate copy engine.

To make sure the 87 problems cover the real space of production parallelism types, the team built them from a taxonomy of distributed workloads, identifying the major ways models get sharded -- tensor, context, data, expert, sequence, and FSDP/ZeRO -- along with the communication patterns each one creates. The problems were pulled from:

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

Together AI's ParallelKernelBench Reveals Frontier Models Struggle Across Multi-GPU Clusters

Takeaways

Why multi-GPU is a fundamentally different problem

The benchmark: 87 problems from real codebases

Don't miss what's next in AI