GLM-OCR Explained: 0.9B Model That Beats Gemini 3 Pro at OCR

Q: How much does the GLM-OCR API cost?

The Z.ai cloud API costs $0.03 per million tokens for both input and output. Self-hosted deployment runs on consumer GPUs (2–4 GB VRAM at FP16) with vLLM, SGLang, or Ollama.

Last updated: April 2026

GLM-OCR is a 0.9-billion-parameter multimodal OCR model by Zhipu AI that scored 94.62 on OmniDocBench V1.5 — the highest of any model, open or closed. It outperforms Gemini 3 Pro (90.33), GPT-5.2 (85.4), and Qwen3-VL-235B (89.15) on document parsing despite being 260× smaller than the largest competitor. The model combines a CogViT vision encoder with a GLM language decoder and uses Multi-Token Prediction to process 1.86 PDF pages per second at just $0.03 per million tokens.

OCR Document Parsing Zhipu AI Vision-Language Model OmniDocBench

In February 2026, a model small enough to fit on a consumer GPU quietly topped the most respected document understanding benchmark in the industry. GLM-OCR, released by Zhipu AI (Z.ai), achieved a score of 94.62 on OmniDocBench V1.5 with just 0.9 billion parameters — surpassing not only every open-source competitor but also closed-source giants like Gemini 3 Pro and GPT-5.2. Within its first month, the model was downloaded over 3 million times from Hugging Face.

This article breaks down the architecture behind GLM-OCR, its benchmark results with full comparisons across 7 competing models, real-world performance data, and a practical deployment guide. If you work with document parsing pipelines — whether invoices, contracts, scientific papers, or code documentation — this is the model to evaluate.

What Is GLM-OCR and Why Does It Matter?

GLM-OCR is a multimodal vision-language model built specifically for document understanding. Unlike general-purpose LLMs that treat OCR as one of dozens of capabilities, every architectural decision in GLM-OCR — from the vision encoder to the decoding strategy — is optimized exclusively for extracting and structuring text from documents.

The model was developed by Zhipu AI, the Beijing-based lab behind the GLM family of transformer models. It was published on March 11, 2026, as an open-source project under the MIT License (with the layout analysis component under Apache 2.0), accompanied by a technical report on arXiv (2603.10910).

The practical significance is clear: most high-accuracy document parsers before GLM-OCR required either expensive closed-source APIs (Gemini, GPT) or heavyweight models demanding serious GPU infrastructure. GLM-OCR delivers superior accuracy on a model you can self-host on a single GPU with 4 GB VRAM, at an API price of $0.03 per million tokens.

How Does the GLM-OCR Architecture Work?

GLM-OCR follows a vision-language encoder-decoder design, but with three key innovations that set it apart from general-purpose VLMs.

Core Components

The model consists of two main blocks. The CogViT visual encoder (400M parameters) extracts high-level visual representations from document images. It was pre-trained on tens of billions of image-text pairs using a dual MIM + CLIP objective, with additional knowledge distillation from a larger in-house ViT. The GLM language decoder (500M parameters) generates structured textual outputs — Markdown, JSON, LaTeX — conditioned on the visual embeddings passed through a lightweight cross-modal connector with efficient token downsampling.

Multi-Token Prediction (MTP)

Standard autoregressive decoding generates one token at a time. For OCR tasks — which are inherently deterministic with strong local dependencies — this is wasteful. GLM-OCR introduces Multi-Token Prediction: the model is trained to predict 10 tokens per step using k shared-parameter auxiliary heads. At inference time, it generates an average of 5.2 tokens per decoding step, yielding approximately 50% throughput improvement. MTP also improves structural coherence, producing fewer broken HTML/Markdown tags in table and formula outputs.

Two-Stage Pipeline

Rather than feeding entire complex documents to the model at once (which causes hallucinations in small models), GLM-OCR uses a two-stage approach. PP-DocLayout-V3 first performs layout analysis, decomposing pages into semantically coherent regions (paragraphs, tables, formulas, headers). Each region is then independently processed by the GLM-OCR core in parallel. A merge-and-post-process module restores reading order and produces the final structured output. This modular design significantly reduces hallucination risk and enables parallel processing.

4-Stage Training Recipe

GLM-OCR uses a progressive training pipeline. Stage 1 trains the CogViT encoder on billions of image-text pairs with MIM + CLIP + distillation. Stage 2 performs vision-language pretraining by attaching the GLM decoder and jointly training on document parsing, grounding, and VQA data, then introduces MTP. Stage 3 is supervised fine-tuning on curated OCR datasets (text, formula, table, KIE). Stage 4 applies reinforcement learning across all tasks to improve accuracy and structural consistency. This is one of the first OCR models to use RL at scale — a technique borrowed from the latest generation of reasoning-focused LLMs.

How Does GLM-OCR Perform on Benchmarks?

The headline number is 94.62 on OmniDocBench V1.5 — the most widely cited benchmark for document parsing in 2026. But the details below the headline are where it gets interesting.

Public Benchmark Comparison (7 Models)

Benchmark	GLM-OCR (0.9B)	PaddleOCR VL-1.5 (0.9B)	DeepSeek OCR2 (3B)	MinerU 2.5 (1.2B)	dots.ocr (3B)	Gemini 3 Pro	GPT-5.2
OmniDocBench V1.5	94.62	94.50	91.1	90.7	88.4	90.3	85.4
OCRBench (Text)	94.0	75.3	34.7	75.3	92.1	91.9	83.7
UniMERNet (Formula)	96.5	96.1	85.8	96.4	90.0	96.4	90.5
PubTabNet (Table)	85.2	84.6	—	88.4	71.0	91.4	84.4
TEDS_TEST (Table)	86.0	83.3	—	85.4	62.4	81.8	67.6
Nanonets-KIE	93.7	—	—	—	—	95.2	87.5
Handwritten-KIE	86.1	—	—	—	—	94.5	78.2

Gemini 3 Pro and GPT-5.2 results are listed for reference only (closed-source, excluded from official ranking). Best open-source scores in green bold. Source: arXiv 2603.10910, Tables 3–4.

Key takeaway

GLM-OCR leads open-source models on 5 out of 7 public benchmarks. The only area where it trails is PubTabNet (complex scientific tables), where MinerU 2.5 scores 88.4 vs. 85.2. On text recognition (OCRBench), it beats PaddleOCR-VL-1.5 by nearly 19 points.

OmniDocBench V1.5 — Detailed Breakdown

The overall score is a composite of four sub-metrics. Here is where GLM-OCR wins and where it doesn’t:

Sub-metric	GLM-OCR	PaddleOCR-VL-1.5	Gemini 3 Pro	Qwen3-VL-235B	Winner
Overall	94.62	94.50	90.33	89.15	GLM-OCR
TextEdit (↓ better)	0.040	0.035	—	—	PaddleOCR
FormulaCDM (↑)	93.90	94.21	—	—	PaddleOCR
TableTEDS (↑)	93.96	—	—	—	GLM-OCR
TableTEDS-S (↑)	96.39	—	—	—	GLM-OCR

PaddleOCR-VL-1.5 has a slight edge in raw text accuracy (TextEdit) and formula recognition (FormulaCDM). But GLM-OCR dominates table parsing — the single hardest component in document understanding — with the best TableTEDS and TableTEDS-S scores among all evaluated models. This is what pushes its overall score to #1.

How Does GLM-OCR Perform in Real-World Scenarios?

Benchmarks are one thing. Production documents — with stamps, handwriting, receipts, and code — are another. Zhipu AI’s in-house evaluation tested 6 scenarios that reflect real deployment conditions:

Scenario	GLM-OCR	PaddleOCR-VL-1.5	dots.ocr	Gemini 3 Pro	GPT-5.2
Code Documents	84.7	—	—	—	—
Real-world Tables	91.5	—	—	—	—
Handwritten Text	87.0	87.4	—	—	—
Multilingual	69.3	54.8	—	—	—
Seal Recognition	90.5	—	63.0	91.3	—
Receipt KIE	94.5	—	—	—	83.5

The standout result is seal recognition: GLM-OCR scores 90.5, nearly matching Gemini 3 Pro (91.3) and crushing the next open-source model (dots.ocr at 63.0). For receipt processing — a common production workload — it scores 94.5, easily beating GPT-5.2’s 83.5. The only scenario where PaddleOCR holds a marginal lead is handwritten text (87.4 vs. 87.0).

How Fast Is GLM-OCR Compared to Other Models?

Under identical hardware conditions (single replica, single concurrency), GLM-OCR processes PDF documents at 1.86 pages per second and individual images at 0.67 images per second. For comparison, the cloud API pricing is $0.03 per million tokens for both input and output — substantially cheaper than any comparable commercial offering.

Efficiency math

A 0.9B model needs roughly 2–4 GB VRAM at FP16. That means you can run GLM-OCR on a consumer RTX 3060 or even on edge devices. The 50% throughput boost from Multi-Token Prediction makes it viable for batch processing pipelines that handle thousands of pages per day without dedicated GPU clusters.

How to Deploy GLM-OCR

GLM-OCR supports three deployment paths: self-hosted with vLLM/SGLang/Ollama, the official SDK pipeline, or the Z.ai cloud API.

Quick Start with Ollama

Bash

ollama run glm-ocr

Self-Hosted with vLLM

Bash

pip install git+https://github.com/huggingface/transformers.git
vllm serve zai-org/GLM-OCR --allowed-local-media-path / --port 8080

Python SDK (Recommended for Document Parsing)

Python

# Install SDK — choose your deployment mode
pip install glmocr               # Cloud API
pip install "glmocr[selfhosted]"  # Self-hosted with layout detection
pip install "glmocr[server]"      # Flask service support

Python

from zai import ZaiClient

client = ZaiClient(api_key="your-api-key")

response = client.layout_parsing.create(
    model="glm-ocr",
    file="https://example.com/invoice.pdf"
)
print(response)  # Structured Markdown + JSON output

Direct Model Inference with Transformers

Python

from transformers import AutoProcessor, AutoModelForImageTextToText

model = AutoModelForImageTextToText.from_pretrained(
    "zai-org/GLM-OCR", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("zai-org/GLM-OCR")

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "url": "document.png"},
        {"type": "text", "text": "Text Recognition:"}
    ]
}]

inputs = processor.apply_chat_template(
    messages, tokenize=True,
    add_generation_prompt=True,
    return_dict=True, return_tensors="pt"
).to(model.device)
inputs.pop("token_type_ids", None)

output = model.generate(**inputs, max_new_tokens=8192)
print(processor.decode(output[0][inputs["input_ids"].shape[1]:],
      skip_special_tokens=False))

The SDK is recommended over raw model inference for document parsing because it integrates PP-DocLayout-V3 automatically — without it, you lose the two-stage pipeline that prevents hallucinations on complex layouts.

What Are GLM-OCR’s Limitations?

No model is perfect, and GLM-OCR has specific gaps worth understanding before you commit to it in production.

Handwritten document KIE: GLM-OCR scores 86.1 vs. Gemini 3 Pro’s 94.5 on Handwritten-KIE. If handwritten forms are your primary workload, Gemini still has a significant edge.

Complex scientific tables: On PubTabNet — which tests dense, formatting-heavy tables from academic papers — GLM-OCR (85.2) trails MinerU 2.5 (88.4) and Gemini 3 Pro (91.4).

Language coverage: The model officially supports 8 languages (Chinese, English, French, Spanish, Russian, German, Japanese, Korean), with strongest performance on Chinese and English. Performance is uneven on non-Latin scripts like Arabic or Hindi.

No reasoning capability: Unlike general VLMs, GLM-OCR cannot answer questions about document content, reason across multiple pages, or do anything beyond text extraction and structured output. It is a specialist, not a generalist.

Two-stage error propagation: If PP-DocLayout-V3 misdetects layout regions, downstream recognition quality degrades. Complex layouts with irregular multi-column structures or cross-page dependencies can still cause issues.

Benchmark saturation: As LlamaIndex has noted, OmniDocBench V1.5 covers only 1,355 pages across 9 document types. Scoring #1 here is meaningful but may not fully represent performance on the long tail of real-world document edge cases.

GLM-OCR vs Gemini 3 Pro: What Actually Wins?

The comparison is not straightforward because these models solve fundamentally different problems. GLM-OCR is a 0.9B specialist — it does exactly one thing and does it better than anything else at its price point. Gemini 3 Pro is a massive general-purpose VLM that also happens to do OCR quite well.

GLM-OCR wins on overall document parsing accuracy (94.62 vs. 90.33), text recognition, table parsing, and price ($0.03/M tokens vs. Gemini’s significantly higher API costs). Gemini wins on handwritten KIE (94.5 vs. 86.1), scientific table parsing (PubTabNet 91.4 vs. 85.2), and general document reasoning capabilities that GLM-OCR simply doesn’t have.

The practical decision: if your pipeline processes structured documents (invoices, contracts, receipts, code docs) at scale, GLM-OCR gives you better accuracy at a fraction of the cost. If you need to understand and reason about document content — answering questions, cross-referencing, summarizing — you need a general VLM like Gemini or a capable LLM downstream.

What Does GLM-OCR Mean for the OCR Industry?

GLM-OCR represents a broader trend in AI: purpose-built small models outperforming general-purpose giants on specific tasks. The traditional OCR pipeline (detect → recognize → post-process) is being replaced by end-to-end VLMs that see the entire document at once. And the cost advantage is staggering — self-hosted GLM-OCR processes documents at roughly $0.09 per 1,000 pages compared to $15+ for GPT-4o.

Under the EU AI Act framework taking effect in 2026, document processing systems handling personal data (IDs, medical records, financial documents) face specific transparency requirements. An open-source model like GLM-OCR — where you control the data pipeline, can audit the model weights, and avoid sending documents to third-party APIs — has compliance advantages that closed-source alternatives cannot match.

The OCR space is moving fast. PaddleOCR-VL-1.5 is nearly tied with GLM-OCR on overall accuracy. DeepSeek-OCR2 focuses on token efficiency (processing 200K pages per day on a single A100). The next generation of benchmarks — as OmniDocBench V1.5 approaches saturation — will likely test more complex, domain-specific documents where current models still fail.

FAQ

What is GLM-OCR?

GLM-OCR is a 0.9-billion-parameter multimodal OCR model developed by Zhipu AI (Z.ai) for document understanding. It combines a CogViT vision encoder with a GLM language decoder to extract structured text (Markdown, JSON, LaTeX) from document images, tables, formulas, and handwritten text. It is open-source under the MIT License.

How does GLM-OCR compare to Gemini 3 Pro on OCR benchmarks?

On OmniDocBench V1.5, GLM-OCR scores 94.62 versus Gemini 3 Pro’s 90.33 — a 4.29-point lead in overall document parsing accuracy. GLM-OCR also beats Gemini on text recognition (94.0 vs. 91.9 on OCRBench) and table parsing (86.0 vs. 81.8 on TEDS_TEST). However, Gemini 3 Pro outperforms GLM-OCR on handwritten KIE (94.5 vs. 86.1) and PubTabNet (91.4 vs. 85.2).

How many parameters does GLM-OCR have?

GLM-OCR has 0.9 billion parameters total: a 400-million-parameter CogViT visual encoder and a 500-million-parameter GLM language decoder. This makes it 260× smaller than Qwen3-VL-235B, which it outperforms on OmniDocBench V1.5.

What is Multi-Token Prediction in GLM-OCR?

Multi-Token Prediction (MTP) is a decoding mechanism where the model predicts multiple tokens per step instead of the standard one-at-a-time approach. GLM-OCR is trained to predict 10 tokens per step and averages 5.2 at inference, delivering approximately 50% throughput improvement. MTP also reduces structural errors in generated HTML and Markdown tags.

How much does the GLM-OCR API cost?

The Z.ai cloud API for GLM-OCR costs $0.03 per million tokens for both input and output. For self-hosted deployment, the 0.9B model runs on consumer GPUs (2–4 GB VRAM at FP16) with vLLM, SGLang, or Ollama, effectively making it free after hardware costs.

What languages does GLM-OCR support?

GLM-OCR officially supports 8 languages: Chinese, English, French, Spanish, Russian, German, Japanese, and Korean. Performance is strongest on Chinese and English. For non-Latin scripts like Arabic or Hindi, you should benchmark specifically on your document types before deploying.

Can GLM-OCR be fine-tuned for custom document types?

Yes. Zhipu AI published a fine-tuning guide based on LLaMA-Factory in March 2026. You can adapt GLM-OCR to domain-specific documents — medical records, legal contracts, industry-specific forms — without training from scratch.

Bibliography

Z.ai. (2026). GLM-OCR [arXiv Technical Report]. arXiv. https://arxiv.org/abs/2603.10910

Z.ai. (2026). GLM-OCR Model Page. Hugging Face. https://huggingface.co/zai-org/GLM-OCR

Z.ai. (2026). GLM-OCR [Open source repository]. GitHub. https://github.com/zai-org/GLM-OCR

Z.ai. (2026). GLM-OCR Developer Guide. Z.ai Documentation. https://docs.z.ai/guides/vlm/glm-ocr

Ollama. (2026). GLM-OCR Library. Ollama. https://ollama.com/library/glm-ocr

OpenDataLab. (2025). OmniDocBench Benchmark. GitHub. https://github.com/opendatalab/OmniDocBench

LlamaIndex. (2026). OmniDocBench Saturation Analysis: What’s Next for OCR Benchmarks. LlamaIndex Blog. https://www.llamaindex.ai/blog/omnidocbench-is-saturated-what-s-next-for-ocr-benchmarks

European Commission. (2026). Regulatory Framework for Artificial Intelligence (EU AI Act). Europa.eu. https://digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai

What Is GLM-OCR and Why Does It Matter?

How Does the GLM-OCR Architecture Work?

Core Components

Multi-Token Prediction (MTP)

Two-Stage Pipeline

4-Stage Training Recipe

How Does GLM-OCR Perform on Benchmarks?

Public Benchmark Comparison (7 Models)

OmniDocBench V1.5 — Detailed Breakdown

How Does GLM-OCR Perform in Real-World Scenarios?

How Fast Is GLM-OCR Compared to Other Models?

How to Deploy GLM-OCR

Quick Start with Ollama

Self-Hosted with vLLM

Python SDK (Recommended for Document Parsing)

Direct Model Inference with Transformers

What Are GLM-OCR’s Limitations?

GLM-OCR vs Gemini 3 Pro: What Actually Wins?

What Does GLM-OCR Mean for the OCR Industry?

FAQ

Bibliography

Leave a Comment Cancel reply