Moonshot AI's Kimi K2.7 Code Hits 260 Tokens per Second at 6x Speed

EDITORIAL LEADERBOARD

Kimi.ai

2H AGO

2 min read

2 hrs ago

2 min read

Moonshot AI just shipped Kimi K2.7 Code HighSpeed, a high-throughput inference mode for their latest open-source coding model that pushes generation up to 6x faster than the standard version. The headline numbers: around 180 tokens per second on typical coding tasks, and up to 260 tok/s on shorter-context inputs. For a trillion-parameter model, that is genuinely fast.

The model underneath

To understand why the speed jump matters, you need to know what K2.7 Code is. It is a 1-trillion-parameter Mixture-of-Experts model with 32B active parameters per token, 384 experts, and a 256K-token context window, released under a Modified MIT license. MoE (Mixture-of-Experts) means only a fraction of the model's parameters activate for any given token, which is how a trillion-parameter model stays economically feasible to serve.

Kimi K2.7 Code is a coding-focused agentic model built upon Kimi K2.6, with substantial improvements on real-world long-horizon coding tasks, while reducing thinking-token usage by approximately 30% compared with Kimi K2.6. The architecture also includes a MoonViT vision encoder, native INT4 quantization, and Multi-head Latent Attention (MLA).

One behavioral quirk worth knowing: K2.7 Code forces "thinking" and preserve_thinking on, and you cannot turn them off. The model always reasons before answering, and it keeps its full reasoning chain across multi-turn conversations. Moonshot says this "preserve thinking" mode is what boosts performance in coding-agent scenarios where context builds up over many steps.

Better, cheaper, and now faster

The base K2.7 Code model already came with a meaningful efficiency story. Kimi K2.7 Code cuts thinking-token usage by approximately 30% on average compared with K2.6, while achieving higher scores across Kimi Code Bench v2, Program Bench, and MLS Bench Lite. Because thinking tokens bill as output, that 30% reduction is a real cost cut on every agentic task, not just a capability claim.

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

Takeaways

The model underneath

Better, cheaper, and now faster

Don't miss what's next in AI