Baidu's Unlimited-OCR Parses 200-Page PDFs With Flat Memory Usage

vLLM

Baidu's Unlimited-OCR Parses 200-Page PDFs With Flat Memory Usage

5H AGO

2 min read

IMAGE

ocr

INFRA

inference_optimization model_serving

5 hrs ago

IMAGE

ocr

INFRA

inference_optimization model_serving

2 min read

OCR models have a dirty secret: the longer the document, the slower and more memory-hungry they get. Every token the decoder generates adds to the KV cache (the running memory of keys and values that attention uses to look back at prior context), which means parsing a 40-page PDF is not just harder than a single page -- it is fundamentally different in cost. Baidu just shipped a model that breaks that assumption entirely.

A new model in the DeepSeek-OCR lineage

Unlimited-OCR is Baidu's open-source document-parsing model, built on top of DeepSeek-OCR and designed specifically to eliminate the memory wall that makes long-document OCR impractical. It is now officially supported in vLLM, with a dedicated recipe and Docker image. The model is free, MIT-licensed, and the weights are on Hugging Face.

Unlimited OCR is a 3B-parameter Mixture-of-Experts model, with only 500M parameters active at any given time. It builds on DeepSeek OCR via continue-training, not a from-scratch run. The research team continue-trained from the DeepSeek OCR checkpoint for 4,000 steps, froze the DeepEncoder, and trained only the decoder on about 2M document samples using 8x16 A800 GPUs.

The core problem: attention that never forgets

As the output sequence lengthens, the accumulated KV cache drives up memory consumption and progressively slows down generation -- in stark contrast to humans, who exhibit no such decline in efficiency during long-horizon copying tasks.

Standard attention stores a key and value for every token ever generated. Standard Multi-Head Attention stores a key and value for every token. As output length T grows, the cache grows with it -- memory and latency climb without bound. For a model trying to transcribe a 200-page book, this becomes a hard wall.

Reference Sliding Window Attention: the fix

The key innovation is Reference Sliding Window Attention (R-SWA) -- a new attention design that keeps the KV cache at a fixed size regardless of how long the output gets. Think of it like how a human copyist works: you glance at the source document, write a few words, and only need to remember the last sentence you wrote -- not every word since you started.

Don't miss what's next in AI

Join 300,000+ engineers and researchers who get the signal, not the noise.

Full access to in-depth AI research breakdowns
Be the first to know what's trending before it hits mainstream
Daily curated papers, repos, and industry moves

Baidu's Unlimited-OCR Parses 200-Page PDFs With Flat Memory Usage

Takeaways

A new model in the DeepSeek-OCR lineage

The core problem: attention that never forgets

Reference Sliding Window Attention: the fix

Don't miss what's next in AI