Perplexity Cuts RAG Context Tokens by 70% Without Sacrificing Accuracy

Perplexity

May 20, 2026

2 min read

RETRIEVAL

document_understanding rag

INFRA

inference_optimization

May 20, 2026

RETRIEVAL

document_understanding rag

INFRA

inference_optimization

2 min read

Every RAG system faces the same dirty secret: most of what gets retrieved is junk. Nav bars, cookie banners, ads, boilerplate footers , all of it gets shoved into the context window alongside the one sentence that actually answers the question. Perplexity just shipped a production fix for this, and the numbers are striking.

The company deployed a new, state-of-the-art snippet generator within its applications and API Platform. At its core is a query-aware context compression model that decides, for each query and candidate result, the subset of spans that should be preserved. The result: context tokens cut by up to 70% while improving answer quality , a combination that most teams assume is a tradeoff, not a win-win.

The "Context Rot" Problem

Long streams of irrelevant information waste model capacity, resulting in "context rot" that impairs a model's ability to address the actual user request. This isn't just an accuracy problem. Passing distractors to a downstream agent creates three compounding problems: it hurts accuracy, regresses latency, and raises cost. Every unnecessary token you feed the model makes it slower and more expensive , and you get worse answers for the privilege.

The standard industry response has been to throw more context at the problem. Bigger context windows, more retrieved chunks, higher token budgets. Perplexity's bet is the opposite: be more surgical about what you pass downstream in the first place.

Extractive, Not Generative

The key design decision here is what kind of compression to use. There are two broad approaches: generation-based (ask a model to write a summary of the page) and selection-based (pick the relevant spans and drop the rest). Perplexity ruled out generation, since creating a new summary for every page hurts performance across accuracy, latency, and cost. Summarization can paraphrase the source, making citation alignment harder, and can introduce wording not present in the page , risky when the snippet is used as evidence.

Instead, they define the task as query-aware extractive compression: separating the helpful parts of the retrieved context from the distracting parts. The model accepts a query and candidate result, decides which spans to preserve, and keeps source text rather than rewriting it. This makes the result easier to cite, easier to verify, and cheaper to pass downstream.

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

Perplexity Cuts RAG Context Tokens by 70% Without Sacrificing Accuracy

Takeaways

The "Context Rot" Problem

Extractive, Not Generative

Don't miss what's next in AI