GLM-OCR is a 0.9-billion-parameter multimodal OCR model by Zhipu AI that scored 94.62 on OmniDocBench V1.5 — the highest of any model, open or closed. It outperforms Gemini 3 Pro (90.33), GPT-5.2 (85.4), and Qwen3-VL-235B (89.15) on document parsing despite being 260× smaller than the largest competitor. The model combines a CogViT vision encoder with a GLM language decoder and uses Multi-Token Prediction to process 1.86 PDF pages per second at just $0.03 per million tokens.
In February 2026, a model small enough to fit on a consumer GPU quietly topped the most respected document understanding benchmark in the industry. GLM-OCR, released by Zhipu AI (Z.ai), achieved a score of 94.62 on OmniDocBench V1.5 with just 0.9 billion parameters — surpassing not only every open-source competitor but also closed-source giants like Gemini 3 Pro and GPT-5.2. Within its first month, the model was downloaded over 3 million times from Hugging Face.
This article breaks down the architecture behind GLM-OCR, its benchmark results with full comparisons across 7 competing models, real-world performance data, and a practical deployment guide. If you work with document parsing pipelines — whether invoices, contracts, scientific papers, or code documentation — this is the model to evaluate.
What Is GLM-OCR and Why Does It Matter?
GLM-OCR is a multimodal vision-language model built specifically for document understanding. Unlike general-purpose LLMs that treat OCR as one of dozens of capabilities, every architectural decision in GLM-OCR — from the vision encoder to the decoding strategy — is optimized exclusively for extracting and structuring text from documents.
The model was developed by Zhipu AI, the Beijing-based lab behind the GLM family of transformer models. It was published on March 11, 2026, as an open-source project under the MIT License (with the layout analysis component under Apache 2.0), accompanied by a technical report on arXiv (2603.10910).
The practical significance is clear: most high-accuracy document parsers before GLM-OCR required either expensive closed-source APIs (Gemini, GPT) or heavyweight models demanding serious GPU infrastructure. GLM-OCR delivers superior accuracy on a model you can self-host on a single GPU with 4 GB VRAM, at an API price of $0.03 per million tokens.
How Does the GLM-OCR Architecture Work?
GLM-OCR follows a vision-language encoder-decoder design, but with three key innovations that set it apart from general-purpose VLMs.
Core Components
The model consists of two main blocks. The CogViT visual encoder (400M parameters) extracts high-level visual representations from document images. It was pre-trained on tens of billions of image-text pairs using a dual MIM + CLIP objective, with additional knowledge distillation from a larger in-house ViT. The GLM language decoder (500M parameters) generates structured textual outputs — Markdown, JSON, LaTeX — conditioned on the visual embeddings passed through a lightweight cross-modal connector with efficient token downsampling.
Multi-Token Prediction (MTP)
Standard autoregressive decoding generates one token at a time. For OCR tasks — which are inherently deterministic with strong local dependencies — this is wasteful. GLM-OCR introduces Multi-Token Prediction: the model is trained to predict 10 tokens per step using k shared-parameter auxiliary heads. At inference time, it generates an average of 5.2 tokens per decoding step, yielding approximately 50% throughput improvement. MTP also improves structural coherence, producing fewer broken HTML/Markdown tags in table and formula outputs.
Two-Stage Pipeline
Rather than feeding entire complex documents to the model at once (which causes hallucinations in small models), GLM-OCR uses a two-stage approach. PP-DocLayout-V3 first performs layout analysis, decomposing pages into semantically coherent regions (paragraphs, tables, formulas, headers). Each region is then independently processed by the GLM-OCR core in parallel. A merge-and-post-process module restores reading order and produces the final structured output. This modular design significantly reduces hallucination risk and enables parallel processing.
4-Stage Training Recipe
GLM-OCR uses a progressive training pipeline. Stage 1 trains the CogViT encoder on billions of image-text pairs with MIM + CLIP + distillation. Stage 2 performs vision-language pretraining by attaching the GLM decoder and jointly training on document parsing, grounding, and VQA data, then introduces MTP. Stage 3 is supervised fine-tuning on curated OCR datasets (text, formula, table, KIE). Stage 4 applies reinforcement learning across all tasks to improve accuracy and structural consistency. This is one of the first OCR models to use RL at scale — a technique borrowed from the latest generation of reasoning-focused LLMs.
How Does GLM-OCR Perform on Benchmarks?
The headline number is 94.62 on OmniDocBench V1.5 — the most widely cited benchmark for document parsing in 2026. But the details below the headline are where it gets interesting.
Public Benchmark Comparison (7 Models)
| Benchmark | GLM-OCR (0.9B) |
PaddleOCR VL-1.5 (0.9B) |
DeepSeek OCR2 (3B) |
MinerU 2.5 (1.2B) |
dots.ocr (3B) |
Gemini 3 Pro |
GPT-5.2 |
|---|---|---|---|---|---|---|---|
| OmniDocBench V1.5 | 94.62 | 94.50 | 91.1 | 90.7 | 88.4 | 90.3 | 85.4 |
| OCRBench (Text) | 94.0 | 75.3 | 34.7 | 75.3 | 92.1 | 91.9 | 83.7 |
| UniMERNet (Formula) | 96.5 | 96.1 | 85.8 | 96.4 | 90.0 | 96.4 | 90.5 |
| PubTabNet (Table) | 85.2 | 84.6 | — | 88.4 | 71.0 | 91.4 | 84.4 |
| TEDS_TEST (Table) | 86.0 | 83.3 | — | 85.4 | 62.4 | 81.8 | 67.6 |
| Nanonets-KIE | 93.7 | — | — | — | — | 95.2 | 87.5 |
| Handwritten-KIE | 86.1 | — | — | — | — | 94.5 | 78.2 |
Gemini 3 Pro and GPT-5.2 results are listed for reference only (closed-source, excluded from official ranking). Best open-source scores in green bold. Source: arXiv 2603.10910, Tables 3–4.
GLM-OCR leads open-source models on 5 out of 7 public benchmarks. The only area where it trails is PubTabNet (complex scientific tables), where MinerU 2.5 scores 88.4 vs. 85.2. On text recognition (OCRBench), it beats PaddleOCR-VL-1.5 by nearly 19 points.
OmniDocBench V1.5 — Detailed Breakdown
The overall score is a composite of four sub-metrics. Here is where GLM-OCR wins and where it doesn’t:
| Sub-metric | GLM-OCR | PaddleOCR-VL-1.5 | Gemini 3 Pro | Qwen3-VL-235B | Winner |
|---|---|---|---|---|---|
| Overall | 94.62 | 94.50 | 90.33 | 89.15 | GLM-OCR |
| TextEdit (↓ better) | 0.040 | 0.035 | — | — | PaddleOCR |
| FormulaCDM (↑) | 93.90 | 94.21 | — | — | PaddleOCR |
| TableTEDS (↑) | 93.96 | — | — | — | GLM-OCR |
| TableTEDS-S (↑) | 96.39 | — | — | — | GLM-OCR |
PaddleOCR-VL-1.5 has a slight edge in raw text accuracy (TextEdit) and formula recognition (FormulaCDM). But GLM-OCR dominates table parsing — the single hardest component in document understanding — with the best TableTEDS and TableTEDS-S scores among all evaluated models. This is what pushes its overall score to #1.
How Does GLM-OCR Perform in Real-World Scenarios?
Benchmarks are one thing. Production documents — with stamps, handwriting, receipts, and code — are another. Zhipu AI’s in-house evaluation tested 6 scenarios that reflect real deployment conditions:
| Scenario | GLM-OCR | PaddleOCR-VL-1.5 | dots.ocr | Gemini 3 Pro | GPT-5.2 |
|---|---|---|---|---|---|
| Code Documents | 84.7 | — | — | — | — |
| Real-world Tables | 91.5 | — | — | — | — |
| Handwritten Text | 87.0 | 87.4 | — | — | — |
| Multilingual | 69.3 | 54.8 | — | — | — |
| Seal Recognition | 90.5 | — | 63.0 | 91.3 | — |
| Receipt KIE | 94.5 | — | — | — | 83.5 |
The standout result is seal recognition: GLM-OCR scores 90.5, nearly matching Gemini 3 Pro (91.3) and crushing the next open-source model (dots.ocr at 63.0). For receipt processing — a common production workload — it scores 94.5, easily beating GPT-5.2’s 83.5. The only scenario where PaddleOCR holds a marginal lead is handwritten text (87.4 vs. 87.0).
How Fast Is GLM-OCR Compared to Other Models?
Under identical hardware conditions (single replica, single concurrency), GLM-OCR processes PDF documents at 1.86 pages per second and individual images at 0.67 images per second. For comparison, the cloud API pricing is $0.03 per million tokens for both input and output — substantially cheaper than any comparable commercial offering.
A 0.9B model needs roughly 2–4 GB VRAM at FP16. That means you can run GLM-OCR on a consumer RTX 3060 or even on edge devices. The 50% throughput boost from Multi-Token Prediction makes it viable for batch processing pipelines that handle thousands of pages per day without dedicated GPU clusters.
How to Deploy GLM-OCR
GLM-OCR supports three deployment paths: self-hosted with vLLM/SGLang/Ollama, the official SDK pipeline, or the Z.ai cloud API.
Quick Start with Ollama
ollama run glm-ocr
Self-Hosted with vLLM
pip install git+https://github.com/huggingface/transformers.git
vllm serve zai-org/GLM-OCR --allowed-local-media-path / --port 8080
Python SDK (Recommended for Document Parsing)
# Install SDK — choose your deployment mode
pip install glmocr # Cloud API
pip install "glmocr[selfhosted]" # Self-hosted with layout detection
pip install "glmocr[server]" # Flask service support
from zai import ZaiClient
client = ZaiClient(api_key="your-api-key")
response = client.layout_parsing.create(
model="glm-ocr",
file="https://example.com/invoice.pdf"
)
print(response) # Structured Markdown + JSON output
Direct Model Inference with Transformers
from transformers import AutoProcessor, AutoModelForImageTextToText
model = AutoModelForImageTextToText.from_pretrained(
"zai-org/GLM-OCR", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("zai-org/GLM-OCR")
messages = [{
"role": "user",
"content": [
{"type": "image", "url": "document.png"},
{"type": "text", "text": "Text Recognition:"}
]
}]
inputs = processor.apply_chat_template(
messages, tokenize=True,
add_generation_prompt=True,
return_dict=True, return_tensors="pt"
).to(model.device)
inputs.pop("token_type_ids", None)
output = model.generate(**inputs, max_new_tokens=8192)
print(processor.decode(output[0][inputs["input_ids"].shape[1]:],
skip_special_tokens=False))
The SDK is recommended over raw model inference for document parsing because it integrates PP-DocLayout-V3 automatically — without it, you lose the two-stage pipeline that prevents hallucinations on complex layouts.
What Are GLM-OCR’s Limitations?
No model is perfect, and GLM-OCR has specific gaps worth understanding before you commit to it in production.
Handwritten document KIE: GLM-OCR scores 86.1 vs. Gemini 3 Pro’s 94.5 on Handwritten-KIE. If handwritten forms are your primary workload, Gemini still has a significant edge.
Complex scientific tables: On PubTabNet — which tests dense, formatting-heavy tables from academic papers — GLM-OCR (85.2) trails MinerU 2.5 (88.4) and Gemini 3 Pro (91.4).
Language coverage: The model officially supports 8 languages (Chinese, English, French, Spanish, Russian, German, Japanese, Korean), with strongest performance on Chinese and English. Performance is uneven on non-Latin scripts like Arabic or Hindi.
No reasoning capability: Unlike general VLMs, GLM-OCR cannot answer questions about document content, reason across multiple pages, or do anything beyond text extraction and structured output. It is a specialist, not a generalist.
Two-stage error propagation: If PP-DocLayout-V3 misdetects layout regions, downstream recognition quality degrades. Complex layouts with irregular multi-column structures or cross-page dependencies can still cause issues.
Benchmark saturation: As LlamaIndex has noted, OmniDocBench V1.5 covers only 1,355 pages across 9 document types. Scoring #1 here is meaningful but may not fully represent performance on the long tail of real-world document edge cases.
GLM-OCR vs Gemini 3 Pro: What Actually Wins?
The comparison is not straightforward because these models solve fundamentally different problems. GLM-OCR is a 0.9B specialist — it does exactly one thing and does it better than anything else at its price point. Gemini 3 Pro is a massive general-purpose VLM that also happens to do OCR quite well.
GLM-OCR wins on overall document parsing accuracy (94.62 vs. 90.33), text recognition, table parsing, and price ($0.03/M tokens vs. Gemini’s significantly higher API costs). Gemini wins on handwritten KIE (94.5 vs. 86.1), scientific table parsing (PubTabNet 91.4 vs. 85.2), and general document reasoning capabilities that GLM-OCR simply doesn’t have.
The practical decision: if your pipeline processes structured documents (invoices, contracts, receipts, code docs) at scale, GLM-OCR gives you better accuracy at a fraction of the cost. If you need to understand and reason about document content — answering questions, cross-referencing, summarizing — you need a general VLM like Gemini or a capable LLM downstream.
What Does GLM-OCR Mean for the OCR Industry?
GLM-OCR represents a broader trend in AI: purpose-built small models outperforming general-purpose giants on specific tasks. The traditional OCR pipeline (detect → recognize → post-process) is being replaced by end-to-end VLMs that see the entire document at once. And the cost advantage is staggering — self-hosted GLM-OCR processes documents at roughly $0.09 per 1,000 pages compared to $15+ for GPT-4o.
Under the EU AI Act framework taking effect in 2026, document processing systems handling personal data (IDs, medical records, financial documents) face specific transparency requirements. An open-source model like GLM-OCR — where you control the data pipeline, can audit the model weights, and avoid sending documents to third-party APIs — has compliance advantages that closed-source alternatives cannot match.
The OCR space is moving fast. PaddleOCR-VL-1.5 is nearly tied with GLM-OCR on overall accuracy. DeepSeek-OCR2 focuses on token efficiency (processing 200K pages per day on a single A100). The next generation of benchmarks — as OmniDocBench V1.5 approaches saturation — will likely test more complex, domain-specific documents where current models still fail.
FAQ
What is GLM-OCR?
GLM-OCR is a 0.9-billion-parameter multimodal OCR model developed by Zhipu AI (Z.ai) for document understanding. It combines a CogViT vision encoder with a GLM language decoder to extract structured text (Markdown, JSON, LaTeX) from document images, tables, formulas, and handwritten text. It is open-source under the MIT License.
How does GLM-OCR compare to Gemini 3 Pro on OCR benchmarks?
On OmniDocBench V1.5, GLM-OCR scores 94.62 versus Gemini 3 Pro’s 90.33 — a 4.29-point lead in overall document parsing accuracy. GLM-OCR also beats Gemini on text recognition (94.0 vs. 91.9 on OCRBench) and table parsing (86.0 vs. 81.8 on TEDS_TEST). However, Gemini 3 Pro outperforms GLM-OCR on handwritten KIE (94.5 vs. 86.1) and PubTabNet (91.4 vs. 85.2).
How many parameters does GLM-OCR have?
GLM-OCR has 0.9 billion parameters total: a 400-million-parameter CogViT visual encoder and a 500-million-parameter GLM language decoder. This makes it 260× smaller than Qwen3-VL-235B, which it outperforms on OmniDocBench V1.5.
What is Multi-Token Prediction in GLM-OCR?
Multi-Token Prediction (MTP) is a decoding mechanism where the model predicts multiple tokens per step instead of the standard one-at-a-time approach. GLM-OCR is trained to predict 10 tokens per step and averages 5.2 at inference, delivering approximately 50% throughput improvement. MTP also reduces structural errors in generated HTML and Markdown tags.
How much does the GLM-OCR API cost?
The Z.ai cloud API for GLM-OCR costs $0.03 per million tokens for both input and output. For self-hosted deployment, the 0.9B model runs on consumer GPUs (2–4 GB VRAM at FP16) with vLLM, SGLang, or Ollama, effectively making it free after hardware costs.
What languages does GLM-OCR support?
GLM-OCR officially supports 8 languages: Chinese, English, French, Spanish, Russian, German, Japanese, and Korean. Performance is strongest on Chinese and English. For non-Latin scripts like Arabic or Hindi, you should benchmark specifically on your document types before deploying.
Can GLM-OCR be fine-tuned for custom document types?
Yes. Zhipu AI published a fine-tuning guide based on LLaMA-Factory in March 2026. You can adapt GLM-OCR to domain-specific documents — medical records, legal contracts, industry-specific forms — without training from scratch.
Bibliography
Z.ai. (2026). GLM-OCR [arXiv Technical Report]. arXiv. https://arxiv.org/abs/2603.10910
Z.ai. (2026). GLM-OCR Model Page. Hugging Face. https://huggingface.co/zai-org/GLM-OCR
Z.ai. (2026). GLM-OCR [Open source repository]. GitHub. https://github.com/zai-org/GLM-OCR
Z.ai. (2026). GLM-OCR Developer Guide. Z.ai Documentation. https://docs.z.ai/guides/vlm/glm-ocr
Ollama. (2026). GLM-OCR Library. Ollama. https://ollama.com/library/glm-ocr
OpenDataLab. (2025). OmniDocBench Benchmark. GitHub. https://github.com/opendatalab/OmniDocBench
LlamaIndex. (2026). OmniDocBench Saturation Analysis: What’s Next for OCR Benchmarks. LlamaIndex Blog. https://www.llamaindex.ai/blog/omnidocbench-is-saturated-what-s-next-for-ocr-benchmarks
European Commission. (2026). Regulatory Framework for Artificial Intelligence (EU AI Act). Europa.eu. https://digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai