Serverless OCR with DeepSeek and Modal: From Scanned PDFs to Searchable Knowledge

Most “knowledge base” projects fail at a boring step that nobody wants to talk about: getting text out of documents that are not really text. If your PDFs are scans, your search is fake, your RAG is brittle, and your agent is blind. The frustrating part is that the tooling is either expensive at scale or operationally heavy if you want to run it yourself.

A recent write up shows a clean middle path: use an open OCR model (DeepSeek OCR) but run it on pay per second GPU infrastructure via Modal. The result is a small, readable Python script that turns scanned textbooks into markdown you can grep, index, or feed into downstream systems.

The Core Insight

The key idea is to treat OCR as an API that you own, not as a batch job you outsource. The “serverless” framing changes how you build the system:

You pay for compute only while requests run.
You can keep the model warm inside a container to amortize load time.
You can tune batching and resolution like a throughput engineer, not a spreadsheet buyer.

This matters because OCR is one of those workloads where the naive approach is painfully slow and surprisingly expensive. GPU time is cheap only if you keep the GPU busy.

Why an open OCR model is the unlock

General purpose OCR engines often struggle with mathematical notation, tables, and the subtle typography of textbooks. DeepSeek OCR is positioned as strong on math, which is exactly where many older pipelines degrade into garbage.

Using an open model also gives you a better security posture. You are not uploading pages of proprietary material to a third party OCR SaaS. You still run code on a cloud provider, but the data flow is under your control and can be wrapped with your own authentication and retention policies.

Modal as an execution layer, not a product dependency

Modal’s decorator based model is doing real work here. You define an image with CUDA, install pinned versions of torch and transformers, and then mark a function as GPU backed. Modal handles:

Building and caching the container
Provisioning a GPU (for example an A100)
Routing HTTP requests to your app

The practical benefit is that you can focus on the interface and the throughput knobs rather than on Kubernetes YAML.

The throughput knobs that actually matter

The original implementation highlights three details that are worth copying:

1) Load the model once per container. Subsequent requests reuse it, which is essential because model initialization can dominate runtime.

2) Batch pages. OCR is embarrassingly parallel, and batching improves GPU utilization. The sweet spot depends on your memory budget and the model’s input format, but even small batches can reduce per page overhead.

3) Render PDFs at higher resolution before inference. Using a 2x zoom when rasterizing pages improves small text and subscripts. This is a classic “garbage in, garbage out” constraint: a better image can be cheaper than a better model.

Why This Matters

If you are building agents that cite sources, explain formulas, or answer questions about internal documents, OCR quality is not optional. A system that returns a plausible answer is dangerous if it is backed by corrupted input text.

There is also a hidden productivity angle: once your textbooks and PDFs are searchable, you stop treating them like static artifacts. You can:

Diff different editions by text
Build chapter level embeddings with better chunking
Create citation links back to page markers
Use simple tools (grep, ripgrep) before you even touch vector search

In other words, OCR is not just ingestion. It is a capability multiplier.

Key Takeaways

The best OCR pipeline is the one that is easy enough to run again. Treat it like an API, not a one off migration.
Open OCR models can be safer and more accurate for specialized content, especially math.
Serverless GPUs are a pragmatic compromise: you avoid owning hardware, but you also avoid vendor locked OCR pricing.
Batching and high resolution rendering often matter more than fancy post processing.
Plan for cleanup. OCR output may include tags, coordinates, or artifacts. Strip what you do not need, but keep enough structure to support citations.

Looking Ahead

The next step after “OCR to markdown” is “OCR to trustworthy knowledge.” A few practical upgrades can turn this from a hobby script into a durable ingestion service:

Add request authentication and rate limiting to protect your endpoint.
Emit page level metadata (book id, page number, checksum) so downstream indices stay consistent.
Store both the raw OCR output and a cleaned version, so you can re process as your rules evolve.
Add lightweight evaluation: sample pages, measure character error rate, and track regression when you change model versions.

Finally, watch the supply chain boundary. The script uses trust_remote_code=True when loading from Hugging Face. That can be necessary, but it is also a real risk. Pin exact commit hashes when possible, mirror the repository, and treat model code like any other dependency you execute.

Sources

Rolling your own serverless OCR in 40 lines of code
https://christopherkrapu.com/blog/2026/ocr-textbooks-modal-deepseek/

Based on analysis of Rolling your own serverless OCR in 40 lines of code https://christopherkrapu.com/blog/2026/ocr-textbooks-modal-deepseek/

Serverless OCR with DeepSeek and Modal: From Scanned PDFs to Searchable Knowledge

The Core Insight

Why an open OCR model is the unlock

Modal as an execution layer, not a product dependency

The throughput knobs that actually matter

Why This Matters

Key Takeaways

Looking Ahead

Sources

Related Articles

The Magic Circle: Why AI Can’t Cross the Printer

Three Descent: Mr.doob Explores 3D in the Browser

Breaking the Spell of Vibe Coding: When AI coding becomes dark flow