NVIDIA has switched on Day 0 support for StepFun's newest open multimodal model, Step 3.7 Flash, meaning the model can be prototyped, deployed, and fine-tuned across NVIDIA's developer stack the moment its weights hit Hugging Face. The rollout spans GPU-accelerated endpoints on build.nvidia.com, packaged NIM inference microservices, and ready-made NeMo fine-tuning recipes, removing most of the integration work that typically follows an open model release.

A vision-language model wired into NVIDIA's stack on arrival

Step 3.7 Flash itself is StepFun's latest vision-language model. It is a 198B-parameter Mixture-of-Experts model with approximately 11B activated parameters per forward pass, optimized for agentic workflows that combine perception, search, and multi-step reasoning, and it ships with native image and video input, three configurable reasoning levels, and a 256k context window.

What NVIDIA is announcing is not the model, but the fact that its infrastructure is ready for it on day one. Developers can pull StepFun's NVFP4-quantized checkpoint from Hugging Face for boosted inference thanks to reduced memory bandwidth and storage requirements, and run it through the open-source serving stacks that NVIDIA maintains kernels for.

Alpha Signal

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

  • Full access to in-depth AI research breakdowns
  • Be the first to know what's trending before it hits mainstream
  • Daily curated papers, repos, and industry moves