Google Research's GeminiSQL-2 Tops the Hardest SQL Benchmark Without Agentic Tricks

EDITORIAL LEADERBOARD

Google Research

5H AGO

2 min read

5 hrs ago

2 min read

Converting plain English into working SQL has been a stubborn problem for AI. Queries that look right often fail when you actually run them, especially against real-world databases with messy schemas, ambiguous column names, and domain-specific business logic baked in. Google Research just announced GeminiSQL-2, a new text-to-SQL capability powered by Gemini 3.1 Pro that claims state-of-the-art results on the hardest standard benchmark in the field.

The benchmark that actually matters

The gold standard for evaluating text-to-SQL systems is BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL Evaluation). BIRD is an industry standard for testing text-to-SQL solutions, spanning over 12,500 unique question-SQL pairs from 95 databases with a total size of 33 GB. What makes BIRD particularly hard is that it does not just check whether a query looks syntactically correct. It moves beyond simple, single-table queries to cover real-world challenges like reasoning over very large schemas, dealing with ambiguous values, and incorporating external business knowledge. Crucially, BIRD measures execution-verified accuracy -- meaning the query must actually run and return the right result.

On BIRD, human experts achieve an execution accuracy of 92.96%, whereas even the top-performing methods lag considerably, with the top approaches around 75% on the test set. That gap is where GeminiSQL-2 is pushing.

The score and what it means

Google Cloud scored a new state-of-the-art result on the BIRD benchmark's Single Trained Model Track, scoring 76.13, ahead of any other single-model solution. The Single Trained Model Track is the most meaningful category to win. It is designed to measure the raw, intrinsic capability of the model itself, restricting the use of complex preprocessing, retrieval, or agentic frameworks often used to boost model accuracy. In other words, no ensembles, no retrieval-augmented generation pipelines, no re-rankers -- just the model.

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

Takeaways

The benchmark that actually matters

The score and what it means

Don't miss what's next in AI