Running a serious LLM benchmark used to mean wrangling harnesses, spinning up sandboxes, babysitting hours of parallel compute, and debugging infrastructure failures before you even got to the results. Prime Intellect's Hosted Evaluations collapses that entire stack into a model selection and a single CLI flag.

Prime Intellect is the team behind INTELLECT-3, a 106B-parameter open MoE model trained with large-scale RL, and the Environments Hub, a community repository for RL environments and evaluations. Since launching the Environments Hub, over 1,000 unique environments have been created by 250+ contributors, with more than 100k total environment downloads. Hosted Evaluations is the natural next step: take that library and make it runnable by anyone, without any local setup.

Why evals became an infra nightmare

Evaluations are the cornerstone of LLM research and deployment. They tell you what a model can and, more importantly, cannot do. Without them, you are stuck with vibes, single-prompt tests, and cherry-picked examples from model vendors.

The problem is that the nature of evaluations has fundamentally changed. Early benchmarks like MMLU-Pro were simple: feed a multiple-choice question, score with a regex, done. But modern evals look nothing like that.

The shift from knowledge-based single-prompt tasks to multi-turn agentic coding tasks massively increased infrastructure requirements. Models inside harnesses now run for hours on a single task, editing and running code in sandboxes. To evaluate efficiently, hundreds of sandboxes need to be spun up in parallel. These infrastructure requirements are often overlooked, but become a huge bottleneck in research.

Alpha Signal

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

  • Full access to in-depth AI research breakdowns
  • Be the first to know what's trending before it hits mainstream
  • Daily curated papers, repos, and industry moves