Structured PDF to JSON: A Guide to Open Source Release Models in 2026

admin 2 hours ago

0 0 6 minutes read

Structured PDF to JSON: A Guide to Open Source Release Models in 2026

Most business data still sits inside PDFs, scans, and slide decks. Main language models and agents cannot use that data until it is formatted JSON. The release of an open source document has become a common way to make such modifications to your hardware.

Two different problems hide under the name ‘PDF to JSON.’ The first one is schema-driven domain: you define the fields, and the model fills them with values. The second says edit the document: the model reconstructs the page into structured JSON or Markdown. Most teams need one, sometimes both. Choosing the wrong category costs real time.

Open weights are important here for cost and privacy. Proprietary APIs can cost thousands of dollars for a million pages, and require sending documents offshore. Local models remove both obstacles. Below are models and tools worth checking out, grouped by what they actually do.

Two paragraphs, one sentence

A schema-driven domain takes a document and a JSON schema, and returns the values of your fields. Use it for invoices, forms, contracts, and receipts, where you know the fields in advance.

Analyzing a document reconstructs the document itself. It gets the structure, reading order, tables, formulas, and code, and sends it as JSON or Markdown. Use it to prepare clean corpora for retrieval-augmented generation (RAG) and agents.

Datalab lift

the lift is a 9B concept model from Datalab, the team behind Marker and Surya. You pass a JSON schema, and elevator returns the same JSON. Decoding the constraints in the schema ensures that the output is valid JSON. The model is built on Qwen 3.5 and runs locally via Hugging Face or remotely via the vLLM server.

Handles multi-page documents in one pass, including multiple page values. It ships with a CLI, Python API, and Streamlit ‘Schema Studio’ for building and testing schemas.

pip install lift-pdf

# Start the vLLM server, then extract to your schema
lift_vllm
lift_extract input.pdf ./output --schema schema.json


from lift import extract

result = extract("document.pdf", "schema.json")
if result.extraction is not None:
    data = result.extraction  # dict matching your schema

In the Datalab 225 document benchmark, the lift achieves a field accuracy of 90.2% at 9.5s median latency. It leads NuExtract3 (81.5%) and Qwen3.5-9B (76.3%) in field accuracy. It follows Gemini Flash 3.5 (91.3%) and Datalab hosted API (95.9%). Note that the accuracy of the full document remains low for all spatial models, with an improvement of 20.9%. Finding every field in one document is always difficult.

The code is Apache-2.0. Weights uses a modified OpenRAIL-M license, free for research, personal use, and startups with less than $5M in revenue or revenue. Commercial self-hosting requires a license, and weights cannot be used in competition with the Datalab API.

NuExtract 3 is a 4B visualization language model from NuMind. It combines two functions in one model: structured extraction (document to JSON) and content extraction (OCR to Markdown). You provide input and a JSON template that defines the fields you need. The model is trained with reinforcement learning to add output-specific reasoning, which you can turn on or off for each application.

NuExtract 3 is multimodal, multilingual, and based on the Qwen core. It works with vLLM with an OpenAI-compatible API, and the Python SDK is available with pip install numind. NuMind sets it up as an open reference model for both structured and content releases for its size. Check the model card for specific license terms before commercial use.

Phase 2: The document is parsed into JSON and structured Markdown

IBM Docling

Docling started at IBM Research and is now hosted by the LF AI & Data Foundation. Splits PDF, DOCX, PPTX, XLSX, HTML, images, and more. Output formats include Markdown, HTML, lossless JSON, and DocTags. At its core is the DoclingDocument representation, which maintains layout, reading order, tables, and formulas like LaTeX.

Docling works locally in spaces with air gaps. It also includes LangChain, LlamaIndex, Crew AI, and Haystack, and ships with an MCP server and Docling Serve mode. The project holds a valid MIT license. IBM also offers a managed version with watsonx.

IBM Granite-Docling-258M

Granite-Docling-258M is a model of the 258M unified vision language from IBM. Performs one-shot document conversion within Docling pipelines. Despite its size, it handles OCR, layout, tables, code, and statistics, and outputs DocTags. On the A100 GPU, it reaches about 0.35 seconds per page.

The model builds on the Idefics3 design, with a SigLIP2 encoder and a Granite 165M tongue core. Released under Apache 2.0. IBM says it’s designed for document manipulation, not general image recognition.

OpenDataLab MinerU

MinerU, from OpenDataLab and Shanghai AI Laboratory, converts PDF, image, DOCX, PPTX, and XLSX input to Markdown and JSON. It pairs a processing pipeline with a visual language model. The current model, MinerU2.5-Pro, targets high-resolution classification of complex structures, including cross-page tables and charts.

MinerU recently changed its license. It moved from AGPL-3.0 to the “MinerU Open Source License,” a custom license based on Apache 2.0 with additional terms. That change reduces the friction of commercial shipping.

Data Tag

Markup is a Datalab pipeline for converting documents into Markdown, JSON, chunks, and HTML. It supports PDF, image, PPTX, DOCX, XLSX, HTML, and EPUB. Formats tables, forms, figures, inline figures, links, and code. Optional --use_llm flag adds a modeling language to develop tables and forms.

In the third-party olmOCR-Bench program, the Marker scores approximately 76.1. Its code is GPL-3.0, and its model weights use the modified AI Pubs OpenRAIL-M license. That weight license is free for research, personal use, and startups with less than $2M in funding or revenue. The platform managed by Datalab now uses the new OCR model, Chandra, which is Apache-2.0 and HTML, Markdown, and JSON outputs.

Ai2 olmOCR 2

olmOCR 2 is a special 7B OCR language model from the Allen Institute for AI (Ai2). Converts PDFs to plain text and Markdown while maintaining readability. Handles tables, figures, and handwriting in all complex multi-column layouts. The model was trained to learn reinforcement from confirmed rewards, using an artificial unit test as the reward signal.

olmOCR 2 scores 82.4 on its olmOCR-Bench, among the highest published results in that suite. Ai2 estimates a cost of about $178 per million pages on your GPUs. Tool kit and allenai/olmOCR-2-7B-1025 The weights are Apache-2.0. The current model focuses on English.

DeepSeek DeepSeek-OCR

DeepSeek-OCR is an open source OCR model from DeepSeek, released in October 2025. It introduces “compression of optical contexts,” which represents text-rich pages as integrated visual tokens, then separates them back into text. This allows it to process long documents with far fewer tokens than conventional language representation models.

It uses DeepEncoder and 3B Mixture-of-Experts decoder which activates about 570M parameters per token. Depending on the content, it outputs plain text, Markdown, HTML tables, or structured JSON, and supports 100+ languages. The code is released under the MIT license. The follow-up, DeepSeek-OCR2, arrived in January 2026.

General purpose option: Qwen3-VL

The Qwen3-VL from Alibaba is not a document-specific model. It is a standard multimodal series used by many emission models as a basis. You can tell it to return Markdown, JSON, or code from the page. Most sizes are shipped under Apache 2.0. A flexible fallback if a special model doesn’t fit, although it requires very fast engineering and offers few guarantees of release.

How the options compare

Model	Org	The size	What it does	The main output	License
lift up	Datalab	9B	Schema-driven release	JSON to your schema	Apache-2.0 code / OpenRAIL-M weights
NuExtract 3	NuMind	4B	Schema extraction + OCR	JSON + Markdown	Open weights (see card)
Docling	IBM / LF AI & Data	A pipe	Analyzing the structure	Markdown, JSON, DocTags	MIT
Granite-Docling	IBM	258M	One shot conversion	DocTags, Markdown	Apache-2.0
MinerU	OpenDataLab	~ 1.2B VLM	Analyzing the structure	Markdown, JSON	MinerU Open Source License
Mark	Datalab	A pipe	Analyzing the structure	Markdown, JSON, HTML	GPL-3.0 code / OpenRAIL-M weights
olOCR 2	It is 2	7B	OCR to text	Plain text, Markdown	Apache-2.0
DeepSeek-OCR	DeepSeek	3B MoE (~570M active)	OCR with compression token	Text, Markdown, JSON	MIT (code)
Qwen3-VL	Alibaba	2B–235B	Standard VLM	Markdown, JSON, code	Apache-2.0 (multiple sizes)

A note on benchmarks: these numbers are from different suites and are not directly comparable. lift’s 90.2% field accuracy in Datalab’s schema-extraction benchmark. olmOCR-Bench scores for olmOCR 2 (82.4) and Marker (76.1) measure content extraction with unit test scores. Run your documents through each candidate before making a decision.

Marktechpost Explainer

“PDF to JSON” hides two different functions. A schema-driven domain fills in the fields you define. Document parsing reconstructs the page into JSON or Markdown. Sort by function and license, and open any repo.

Schema-driven release
Document analysis
General purpose VLM

Work

License

Benchmarks are not directly comparable. lift 90.2% field accuracy in the Datalab schema benchmark. The olmOCR-Bench scores for olmOCR 2 (82.4) and Marker (76.1) measure content extraction through unit testing. Use your documents before choosing.

Key Takeaways

Schema-driven domain (fields to values) and document parsing (structure to JSON) are separate operations.
lift and NuExtract 3 driven JSON schema; another part of the target document.
Docling, MinerU, Marker, olmOCR 2, and DeepSeek-OCR parse documents into structured Markdown or JSON.
Licenses vary widely; MinerU migrated to AGPL-3.0 in 2026, and lifting code and distinguishing Mark and model weight licenses.
The published benchmarks are from different suites, so treat the scores of different models as indicators, which cannot be compared.

Michal Sutter is a data science expert with a Master of Science in Data Science from the University of Padova. With a strong foundation in statistical analysis, machine learning, and data engineering, Michal excels at turning complex data sets into actionable insights.

admin Send an email 2 hours ago
0 0 6 minutes read