
NVIDIA has switched on Day 0 support for StepFun's newest open multimodal model, Step 3.7 Flash, meaning the model can be prototyped, deployed, and fine-tuned across NVIDIA's developer stack the moment its weights hit Hugging Face. The rollout spans GPU-accelerated endpoints on build.nvidia.com, packaged NIM inference microservices, and ready-made NeMo fine-tuning recipes, removing most of the integration work that typically follows an open model release.
A vision-language model wired into NVIDIA's stack on arrival
Step 3.7 Flash itself is StepFun's latest vision-language model. It is a 198B-parameter Mixture-of-Experts model with approximately 11B activated parameters per forward pass, optimized for agentic workflows that combine perception, search, and multi-step reasoning, and it ships with native image and video input, three configurable reasoning levels, and a 256k context window.
What NVIDIA is announcing is not the model, but the fact that its infrastructure is ready for it on day one. Developers can pull StepFun's NVFP4-quantized checkpoint from Hugging Face for boosted inference thanks to reduced memory bandwidth and storage requirements, and run it through the open-source serving stacks that NVIDIA maintains kernels for.
Don't miss what's next in AI
Join 300,000+ engineers and researchers who get the signal, not the noise.
- Full access to in-depth AI research breakdowns
- Be the first to know what's trending before it hits mainstream
- Daily curated papers, repos, and industry moves
