StepFun's Step 3.7 Flash Hits 416 Tokens per Second While Staying Open-Source

EDITORIAL LEADERBOARD

Artificial Analysis

StepFun's Step 3.7 Flash Hits 416 Tokens per Second While Staying Open-Source

1D AGO

2 min read

LLMS

mixture_of_experts small_models

OPEN_SOURCE

1 day ago

LLMS

mixture_of_experts small_models

OPEN_SOURCE

2 min read

StepFun has pushed a new open-weights model into the Pareto frontier of speed versus intelligence. Step 3.7 Flash is a 198B-parameter sparse Mixture-of-Experts vision-language model that activates roughly 11B parameters per token, scores 43 on the Artificial Analysis Intelligence Index, and serves at over 400 output tokens per second on StepFun's first-party API. It ships under Apache 2.0, with weights in BF16, FP8, NVFP4, and GGUF formats, putting frontier-adjacent capability inside the memory budget of a single high-end workstation.

The speed number is the headline

According to Artificial Analysis, Step 3.7 Flash generates output at 415.9 tokens per second based on StepFun's API, which is well above average compared to other open weight models of similar size (median: 57.4 t/s). Time to first token sits at 0.96 seconds, against a median of 2.36 seconds for comparable models. That places it at #1 of 88 models tracked for output speed.

Two architectural choices drive that throughput. First, only about 11B of the 198B parameters fire per token thanks to MoE routing. Second, the model ships with trained Multi-Token Prediction heads, three of them, that predict several future tokens in a single forward pass. Combined with speculative decoding, the model can verify multiple draft tokens at once instead of generating them one at a time. The reference SGLang serving config exposes this directly with flags like --speculative-algorithm EAGLE, --speculative-num-steps 3, and --enable-multi-layer-eagle.

What the intelligence numbers actually say

On the Artificial Analysis Intelligence Index v4.0, which blends ten evaluations including GDPval-AA, Terminal-Bench Hard, SciCode, AA-LCR, Humanity's Last Exam, and GPQA Diamond, Step 3.7 Flash lands at 43. That's a four-point gain over Step 3.5 Flash 2603 (38.5), roughly matching Qwen3.5 122B A10B (41.6) but trailing MiniMax-M2.7 (49.6) and DeepSeek V4 Flash (46.5). The biggest movements come from agentic tasks:

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

Takeaways

The speed number is the headline

What the intelligence numbers actually say

Don't miss what's next in AI