Technology & AI

DeepSeek AI Releases DeepSeek-OCR 2 with Causal Visual Flow Encoder for Layout Aware Document Understanding

DeepSeek AI released DeepSeek-OCR 2, an open-source document OCR and recognition system that reprograms its vision encoder to read pages in a logical order that’s closer to how people scan complex documents. The key component is DeepEncoder V2, a language model-style converter that converts a 2D page into a 1D sequence of visual tokens that already follows the read-in flow before text encoding begins.

From raster programming to causal visualization flow

Most multimodal models still flatten images into a fixed raster sequence, top left to right, and use a static encoding transformer. This is not a good fit for documents with multi-column layouts, nested tables, and mixed language regions. Human readers instead follow a semantic sequence that jumps between states.

DeepSeek-OCR 2 retains the encoder and decoder architecture of DeepSeek-OCR, but replaces the original CLIP ViT-based optical encoder with DeepEncoder V2. The decoder resides DeepSeek-3B-A500M, a MoE language model with about 3B parameters in total and about 500M active parameters per token. The goal is to allow the encoder to reason over the visual tokens and provide the coder with a sequence that is already aligned with a possible learning scheme.

Vision tokenizer and token budget

The vision token is inherited from DeepSeek-OCR. It uses an 80M parameter SAM base backbone followed by 2 convolution layers. This section downsamples the image so that the visible token count is reduced by a factor of 16 and compresses the features to an embedding size of 896.

DeepSeek-OCR 2 uses a global and local multi-plant strategy to cover dense pages without allowing the token count to explode. A global view at 1024 × 1024 resolution produces 256 tokens. Up to 6 local plants at 768 × 768 resolution add 144 tokens each. As a result, the virtual token count is from 256 to 1120 per page. This overhead is smaller than the 1156 token budget used in DeepSeek-OCR’s Gundam first mode, and is comparable to the budget used by the Gemini-3 Pro in OmniDocBench.

DeepEncoder-V2, a language model as a vision encoder

DeepEncoder-V2 is built by installing a Qwen2-0.5B style encoder as a vision encoder. The input sequence is constructed as follows. First, all virtual tokens from the tokenizer form a prefix. Then a set of readable query tokens, called causal flow tokens, is added as a suffix. The number of causal flow tokens is equal to the number of virtual tokens.

The attention pattern is asymmetric. Virtual tokens use dual attention and recognize all other virtual tokens. Causal flow tokens use causal attention and can see all visible tokens and only previous causal flow tokens. Only the output from the logical flow fields is transmitted to the encoder. Basically, the encoder reads a map from a 2D grid of visual tokens to a 1D logical sequence of flow tokens that includes the proposed reading order and spatial context.

This design divides the problem into 2 phases. DeepEncoder-V2 performs causal reasoning over visual structure and learning structure. The DeepSeek-3B-A500M then performs a causal recording over the text placed on this reprogrammed visual input.

Training pipeline

The training data pipeline follows DeepSeek-OCR and focuses on deep OCR content. OCR data makes up 80 percent of the mix. The research team rescales samples across text, formulas, and tables using a 3:1:1 ratio so that the model sees enough structure for difficult examples.

The training goes through 3 phases:

In section 1encoder pretraining couples DeepEncoder-V2 into a small encoder and uses a general language modeling principle. The model is trained at 768 × 768 and 1024 × 1024 resolutions with multiple scale samples. The idea token is started from the original DeepEncoder. The LLM-style encoder is implemented from the Qwen2-0.5B base. The optimizer is AdamW with a cosine learning rate that decays from 1e-4 to 1e-6 for 40k iterations. Training uses about 160 A100 GPUs, 8k sequence length and packing, and a large mixture of document image text samples.

In phase 2the query enhancer attaches the DeepEncoder-V2 to the DeepSeek-3B-A500M and introduces multiple plant views. The tokenizer is frozen. The encoder and decoder are jointly trained with 4-stage pipeline simulations and 40 parallel data replicas. The global cluster size is 1280 and the schedule runs in 15k iterations with a learning rate decay from 5e-5 to 1e-6.

In section 3all encoder parameters are fixed. Only the DeepSeek decoder is trained to better adapt to reordered visual tokens. This class uses the same batch size but a shorter schedule and a lower learning rate that decays from 1e-6 to 5e-8 with 20k iterations. Freezing the encoder more than doubles the training output at this stage.

Benchmark results in OmniDocBench

The main test uses OmniDocBench-v1.5. This benchmark has 1355 pages in 9 document categories in Chinese and English, including books, academic papers, forms, presentations, and newspapers. Each page is annotated with structural elements such as text, figures, tables, and figures.

DeepSeek-OCR 2 achieves a total OmniDocBench score of 91.09 with a maximum of 1120 visible tokens. The first DeepSeek-OCR base gets 87.36 for 1156 large tokens. Therefore DeepSeek-OCR 2 achieves a score of 3.73 while using a small budget for scanning.

The reading program (order R) Plot Distance, which measures the difference between the predicted sequence and the ground truth reading, decreased from 0.085 to 0.057. Text editing range decreased from 0.073 to 0.048. The formula and tabulation range also decreases, indicating a better analysis of the mathematical and structured regions.

Considered a document analyzer, DeepSeek-OCR-2 achieves an overall element-level editing grade of 0.100. Original DeepSeek-OCR achieves 0.129 and Gemini-3 Pro achieves 0.115 under the same virtual token constraints. This suggests that the causal virtual flow encoder improves the reliability of the structure without increasing the token budget.

Class wise, DeepSeek-OCR-2 improves the quality of text editing in many types of documents, such as academic papers and books. Performance is poorer for denser newspapers, where the text editing distance is always greater than 0.13. The research team links this to the limited training data of newspapers and the high compression of extreme text density. Reading order metrics, however, improve across categories.

Key Takeaways

  • DeepSeek-OCR 2 replaces the CLIP ViT style encoder with DeepEncoder-V2, a language model based on Qwen2-0.5B that transforms a 2D document page into a 1D sequence of logic flow tokens aligned with the read program.
  • The vision token uses an 80M parameter SAM core with flexible, multiple global and local crop views, and maintains a visual token budget between 256 and 1120 tokens per page, slightly less than the original DeepSeek-OCR Gundam mode while remaining comparable to Gemini 3 Pro.
  • Training follows a 3-stage pipeline, encoder pre-training, joint query development with DeepSeek-3B-A500M, and decoder optimization only with a frozen encoder, using a combination of 80 percent heavy OCR data and a 3 to 1 to 1 sample ratio over text, formulas, and tables.
  • In OmniDocBench v1.5 with 1355 pages and 9 document categories, DeepSeek-OCR 2 achieves a total score of 91.09 compared to 87.36 for DeepSeek-OCR, reduces the order of reading distance from 0.085 to 0.057, and achieves an element distance of 02 compared to DeepSeek-02 of 0 Deep-OCR. 0.115 for Gemini-3 Pro under the budget of the same virtual tokens.

Check it out Paper, Repo and Model weights. Also, feel free to follow us Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.


Michal Sutter is a data science expert with a Master of Science in Data Science from the University of Padova. With a strong foundation in statistical analysis, machine learning, and data engineering, Michal excels at turning complex data sets into actionable insights.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button