Cohere's Command A+ Runs a 218B Open Model on Just Two GPUs

Cohere

May 20, 2026

2 min read

LLMS

long_context vision_language

OPEN_SOURCE

May 20, 2026

LLMS

long_context vision_language

OPEN_SOURCE

2 min read

Cohere just dropped what may be the most practically deployable open-weight frontier model yet. Command A+ is a 218-billion-parameter sparse Mixture-of-Experts (MoE) model released under the Apache 2.0 license, meaning full commercial use with zero revenue caps or usage restrictions. The headline number that matters most: it runs on as few as two NVIDIA H100 GPUs at 4-bit quantization, with virtually no quality loss.

One model to replace five

Born from a year of deploying North with enterprise customers, Command A+ surpasses every previous generation in the Command series and unifies their capabilities into a single scalable model. That consolidation is a bigger deal than it sounds. Previously, Cohere maintained separate specialized variants:

Command A , base model, tool use, 23 languages
Command A Reasoning , extended thinking, chain-of-thought
Command A Vision , image understanding
Command A Translate , multilingual translation

Command A+ consolidates capabilities that previously lived across separate Command A, Command A Reasoning, Command A Vision, and Command A Translate variants into a single model. For teams managing multiple fine-tuned or specialized deployments, that is a significant operational simplification.

The architecture: sparse by design

Command A+ uses a sparse Mixture-of-Experts decoder with 218 billion total parameters organized across 128 experts. Each token activates 8 experts plus 1 shared expert, meaning 25 billion parameters are active per inference call, not 218 billion. This is the core trick behind its hardware efficiency: the model is large in total capacity but cheap to run token-by-token.

It requires far less compute to run than proprietary giants like OpenAI's GPT-5.5 and Anthropic's Claude Opus 4.7, which are estimated by third-party observers to be in the trillions of parameters. The MoE approach routes each token only to the expert subnetworks best suited to handle it, leaving the rest dormant.

The quantization breakthrough

Getting a 218B model to fit on two H100s without degrading quality is the real engineering story here. The genuinely innovative piece is Quantization-Aware Distillation (QAD): rather than applying uniform quantization across the model, QAD preserves attention pathway weights at full precision while quantizing only the MoE expert layers. The result is near-lossless W4A4 quantization, with the compressed model retaining quality while running on significantly less hardware.

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

Cohere's Command A+ Runs a 218B Open Model on Just Two GPUs

Takeaways

One model to replace five

The architecture: sparse by design

The quantization breakthrough

Don't miss what's next in AI