Pictures replace a thousand words

DeepSeek may not have the GPU capacity to repeat the impact of their R1 launch earlier in the year, but they are still close to the forefront of AI research. Released this week, their new OCR model doesn’t just read text from images; it stores text as images, achieving 10x compression with 97% accuracy! The idea sounds counterintuitive. Why convert perfectly good text into pixels? Yet the results suggest this approach could reshape how AI systems handle information.

DeepSeek uses a two-stage encoder: first, an 80-million parameter SAM model captures fine details, then a CNN compresses the data 16-fold before a CLIP model builds global understanding. A document requiring 6,000 text tokens compresses to under 800 vision tokens whilst maintaining better performance than traditional approaches.

This begs the question; will pixels become the universal input for AI systems? Images preserve formatting and layout naturally, enable bidirectional attention without complex tokenizers, and reduce injection vulnerabilities in text processing. Most intriguingly, this may mirror human memory. It often feels like we recall pages and diagrams visually rather than as abstract text strings.

The new model can process 200 pages per minute on standard hardware, with infrastructure costs reduced by an order of magnitude. Whilst DeepSeek OCR will provide a useful new document ingestion option, its compression innovation may be truly game changing.

Takeaways: If DeepSeek’s approach proves scalable, we may be witnessing the beginning of AI systems that think in pictures rather than words, making vast knowledge bases accessible through simple visual compression.

Pictures replace a thousand words

Compute crunch 2.0 arrives

DeepSeek pays less attention

Grok goes fast

Gemini raises the bar

Subscribe to the ExoBrain Weekly Newsletter