Together AI's ParallelKernelBench Reveals Top Models Fail 70% of Multi-GPU Tasks

Together AI

11H AGO

2 min read

11 hrs ago

2 min read

Every major coding benchmark for LLMs tests single-GPU CUDA kernels. But production AI infrastructure doesn't run on one GPU , it runs on clusters, where the real bottleneck is how fast data moves between GPUs, not how fast a single chip computes. ParallelKernelBench (PKB) is a new open-source benchmark from Together AI's Frontier Performance team that tests exactly this, and the results are a wake-up call.

The gap nobody was measuring

In production, communication overhead can account for over 20% of inference latency, and that gap keeps widening as compute scales faster than interconnect bandwidth. Yet LLMs have made progress on GPU kernel generation, but that progress has mostly been measured on a single GPU. PKB is the first benchmark to directly address this mismatch.

ParallelKernelBench offers a benchmark and evaluation framework for multi-GPU kernel generation, including 87 problems from real codebases where the task is replacing PyTorch + NCCL with a CUDA kernel that moves data directly over NVLink. NCCL (NVIDIA Collective Communications Library) is the standard library for GPU-to-GPU communication , PKB asks models to bypass it entirely and write lower-level kernels that talk directly over NVLink, the high-bandwidth interconnect between GPUs in a server node.

What makes multi-GPU so much harder

It's not just about knowing CUDA syntax. The challenge is fundamentally different from single-GPU work in three ways:

The design space explodes. Practitioners compose tensor, expert, data, context, and sequence parallelism to fit the hardware, and each combination creates a different communication pattern.
The performance model changes. On a single GPU, you optimize for compute throughput and memory bandwidth. On multiple GPUs, the bottleneck is the interconnect , and the roofline model (a framework for estimating peak achievable performance) looks completely different.
New low-level choices appear.

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

Together AI's ParallelKernelBench Reveals Top Models Fail 70% of Multi-GPU Tasks

Takeaways

The gap nobody was measuring

What makes multi-GPU so much harder

Don't miss what's next in AI