
It is well understood that giving a language model space to "think" -- generating a chain-of-thought before answering -- boosts performance on hard problems like math or multi-step reasoning. But what about a simple factual question, one where the model either knows the answer or it doesn't? A new study from Google Research shows that reasoning helps there too, and the explanation is more interesting than anyone expected.
The puzzle nobody had a clean answer for
While reasoning in LLMs plays a natural role in math, code generation, and multi-hop factual questions, its effect on simple, single-hop factual questions remains unclear. Such questions do not require step-by-step logical decomposition, making the utility of reasoning highly counterintuitive. If you ask a model "What year was Marie Curie awarded her second Nobel Prize?", no amount of step-by-step logic will conjure up a fact that isn't stored in its weights. So why does enabling the thinking mode on models like Gemini or Qwen help?
The answer, it turns out, is that reasoning serves a dual purpose that has nothing to do with logical deduction. The team identified two key driving mechanisms: a computational buffer effect, where the model uses generated reasoning tokens to perform latent computation independent of their semantic content; and factual priming, where generating topically related facts acts as a semantic bridge that facilitates correct answer retrieval.
Measuring what a model actually knows
To study this rigorously, the researchers needed a way to separate "what the model knows" from "what the model outputs on the first try." They used the pass@k metric -- instead of checking a single answer, you sample k independent outputs and ask whether the correct answer appears anywhere in that set. This reveals the model's capability boundary: the full range of facts it can potentially recall, not just what surfaces at the top of its output distribution.
Enabling reasoning substantially expands the capability boundary of the model's parametric knowledge recall, unlocking correct answers that are otherwise effectively unreachable. The team tested this on Gemini-2.5 Flash, Gemini-2.5 Pro, and Qwen3-32B, using two challenging closed-book QA benchmarks: SimpleQA-Verified and EntityQuestions -- both composed of predominantly simple, single-hop questions.
Don't miss what's next in AI
Join 300,000+ engineers and researchers who get the signal, not the noise.
- Full access to in-depth AI research breakdowns
- Be the first to know what's trending before it hits mainstream
- Daily curated papers, repos, and industry moves

