
Kaggle just made writing AI evaluations feel a lot more like writing normal Python. The team shipped local development for Kaggle Benchmarks, which means you can now author, validate, push, and run benchmark tasks from VSCode, Cursor, Antigravity, or any agent shell instead of being trapped in Kaggle's hosted notebook editor.
Alongside the CLI, they released write-kaggle-benchmarks, an agent skill, essentially a structured instruction set you drop into Claude Code or a similar coding agent so it knows how to scaffold a task, run it against a model, and ship the results. The pitch is simple: describe the evaluation you want in natural language, let the agent generate the task file, and iterate locally.
Why this matters
Until now, creating evaluation tasks meant working exclusively in Kaggle's web-based notebook editor, instead of developers' preferred stack. The new update enables developers to create, validate, push, run and download tasks directly from their local development environments like Antigravity, VSCode, Cursor and coding agents.
That matters because benchmarks have become the bottleneck for evaluating frontier models. As AI models evolve from simple chatbots into reasoning agents that write code, use tools and solve complex problems, traditional benchmarks are no longer enough. The community needs dynamic, rigorous evaluations built by the people who use these models in the real-world. The global AI community has already created more than 10,000 evaluation tasks on the platform, and pulling that workflow into local tooling lowers the friction to add more.
Don't miss what's next in AI
Join 300,000+ engineers and researchers who get the signal, not the noise.
- Full access to in-depth AI research breakdowns
- Be the first to know what's trending before it hits mainstream
- Daily curated papers, repos, and industry moves
