
Cohere just dropped what may be the most practically deployable open-weight frontier model yet. Command A+ is a 218-billion-parameter sparse Mixture-of-Experts (MoE) model released under the Apache 2.0 license, meaning full commercial use with zero revenue caps or usage restrictions. The headline number that matters most: it runs on as few as two NVIDIA H100 GPUs at 4-bit quantization, with virtually no quality loss.
One model to replace five
Born from a year of deploying North with enterprise customers, Command A+ surpasses every previous generation in the Command series and unifies their capabilities into a single scalable model. That consolidation is a bigger deal than it sounds. Previously, Cohere maintained separate specialized variants:
- Command A , base model, tool use, 23 languages
- Command A Reasoning , extended thinking, chain-of-thought
- Command A Vision , image understanding
- Command A Translate , multilingual translation
Command A+ consolidates capabilities that previously lived across separate Command A, Command A Reasoning, Command A Vision, and Command A Translate variants into a single model. For teams managing multiple fine-tuned or specialized deployments, that is a significant operational simplification.
The architecture: sparse by design
Command A+ uses a sparse Mixture-of-Experts decoder with 218 billion total parameters organized across 128 experts. Each token activates 8 experts plus 1 shared expert, meaning 25 billion parameters are active per inference call, not 218 billion. This is the core trick behind its hardware efficiency: the model is large in total capacity but cheap to run token-by-token.
It requires far less compute to run than proprietary giants like OpenAI's GPT-5.5 and Anthropic's Claude Opus 4.7, which are estimated by third-party observers to be in the trillions of parameters. The MoE approach routes each token only to the expert subnetworks best suited to handle it, leaving the rest dormant.
The quantization breakthrough
Getting a 218B model to fit on two H100s without degrading quality is the real engineering story here. The genuinely innovative piece is Quantization-Aware Distillation (QAD): rather than applying uniform quantization across the model, QAD preserves attention pathway weights at full precision while quantizing only the MoE expert layers. The result is near-lossless W4A4 quantization, with the compressed model retaining quality while running on significantly less hardware.
Don't miss what's next in AI
Join 300,000+ engineers and researchers who get the signal, not the noise.
- Full access to in-depth AI research breakdowns
- Be the first to know what's trending before it hits mainstream
- Daily curated papers, repos, and industry moves
