
Frontier LLMs are surprisingly bad at one of the most routine tasks in finance: deciding which documents are actually worth reading. Bridgewater's AI and Automation Labs, in collaboration with Thinking Machines Lab, just published a detailed case study showing how they built a custom model that outperforms every frontier model they tested , GPT, Claude, and Gemini , on six financial information-filtering tasks, while costing 13.8x less per task to run.
The problem no one talks about
Every investor is buried in documents: news articles, central bank releases, research reports, internal memos. The real work isn't reading , it's the constant low-level triage of deciding what's worth reading in the first place. The real work is the small, repeated judgments carried over it , filtering, interpreting, segmenting, and identifying where the useful signal lies. These micro-decisions are embedded throughout an investor's day and consume enormous time.
The team wanted to automate this triage. But when they tested frontier models on six concrete tasks drawn from real investor workflows, the results were sobering. Variants of Gemini, Claude, and GPT averaged a mere ~50% accuracy when given a prompt that simply states each of the six tasks to perform , essentially a coin flip.
The six tasks they evaluated were:
- Financial Article Relevancy , is this news article relevant to a C-suite macro investor?
- Central Bank Document Relevancy , does this central bank release signal future rate changes?
- Generic Document Relevancy , does this research doc answer a specific investor question?
- Ad Hoc Content Labeling , is this document recurring boilerplate or does it contain one-off analysis?
- Document Truncation , where does the boilerplate begin in a document?
- Email Truncation , where does the boilerplate begin in an email?
Why prompting alone hits a wall
The team didn't give up on frontier models immediately. Their experts wrote detailed task descriptions and reframed certain problems , for example, splitting the "relevant" label into three buckets: relevant and interesting, relevant but uninteresting, and irrelevant. These changes boosted accuracy from a coin flip to the mid-70s. Automatic prompt optimization methods added nothing on top of that.
Don't miss what's next in AI
Join 300,000+ engineers and researchers who get the signal, not the noise.
- Full access to in-depth AI research breakdowns
- Be the first to know what's trending before it hits mainstream
- Daily curated papers, repos, and industry moves

