ModelBaidu (ERNIE)Baidu (ERNIE)published Mar 18, 2026seen 5d

baidu/Qianfan-OCR

Open original ↗

Captured source

source ↗
published Mar 18, 2026seen 5dcaptured 9hhttp 200method plaintask image-text-to-textlicense apache-2.0library transformersparams 4.7Bdownloads 175klikes 1.2k

Introduction

Qianfan-OCR is a 4B-parameter end-to-end document intelligence model developed by the Baidu Qianfan Team. It unifies document parsing, layout analysis, and document understanding within a single vision-language architecture.

Unlike traditional multi-stage OCR pipelines that chain separate layout detection, text recognition, and language comprehension modules, Qianfan-OCR performs direct image-to-Markdown conversion and supports a broad range of prompt-driven tasks — from structured document parsing and table extraction to chart understanding, document question answering, and key information extraction — all within one model.

Key Highlights

  • 🏆 #1 End-to-End Model on OmniDocBench v1.5 — Achieves 93.12 overall score, surpassing DeepSeek-OCR-v2 (91.09), Gemini-3 Pro (90.33), and all other end-to-end models
  • 🏆 #1 End-to-End Model on OlmOCR Bench — Scores 79.8
  • 🏆 #1 on Key Information Extraction — Overall mean score of 87.9 across five public KIE benchmarks, surpassing Gemini-3.1-Pro, Gemini-3-Pro, Seed-2.0, and Qwen3-VL-235B-A22B
  • 🧠 Layout-as-Thought — An innovative optional thinking phase that recovers explicit layout analysis within the end-to-end paradigm via ⟨think⟩ tokens
  • 🌍 192 Languages — Multilingual OCR support across diverse scripts
  • Efficient Deployment — Achieves 1.024 PPS (pages per second) with W8A8 quantization on a single A100 GPU

Architecture

Qianfan-OCR adopts the multimodal bridging architecture from Qianfan-VL, consisting of three core components:

| Component | Details | |---|---| | Vision Encoder | Qianfan-ViT, 24 Transformer layers, AnyResolution design (up to 4K), 256 visual tokens per 448×448 tile, max 4,096 tokens per image | | Language Model | Qwen3-4B (3.6B non-embedding), 36 layers, 2560 hidden dim, GQA (32 query / 8 KV heads), 32K context (extendable to 131K) | | Cross-Modal Adapter | 2-layer MLP with GELU activation, projecting from 1024-dim to 2560-dim |

Layout-as-Thought

A key innovation is Layout-as-Thought: an optional thinking phase triggered by ⟨think⟩ tokens, where the model generates structured layout representations (bounding boxes, element types, reading order) before producing final outputs.

This mechanism serves two purposes: 1. Functional: Recovers layout analysis capability within the end-to-end paradigm — users obtain structured layout results directly 2. Enhancement: Provides targeted accuracy improvements on documents with complex layouts, cluttered elements, or non-standard reading orders

> When to use: Enable thinking for heterogeneous pages with mixed element types (exam papers, technical reports, newspapers). Disable for homogeneous documents (single-column text, simple forms) for better results and lower latency.

Benchmark Results

OmniDocBench v1.5 (Document Parsing)

| Model | Type | Overall ↑ | TextEdit ↓ | FormulaCDM ↑ | TableTEDs ↑ | TableTEDss ↑ | R-orderEdit ↓ | |---|---|---|---|---|---|---|---| | Qianfan-OCR (Ours) | End-to-end | 93.12 | 0.041 | 92.43 | 91.02 | 93.85 | 0.049 | | DeepSeek-OCR-v2 | End-to-end | 91.09 | 0.048 | 90.31 | 87.75 | 92.06 | 0.057 | | Gemini-3 Pro | End-to-end | 90.33 | 0.065 | 89.18 | 88.28 | 90.29 | 0.071 | | Qwen3-VL-235B | End-to-end | 89.15 | 0.069 | 88.14 | 86.21 | 90.55 | 0.068 | | dots.ocr | End-to-end | 88.41 | 0.048 | 83.22 | 86.78 | 90.62 | 0.053 | | PaddleOCR-VL 1.5 | Pipeline | 94.50 | 0.035 | 94.21 | 92.76 | 95.79 | 0.042 |

General OCR Benchmarks

| Model | OCRBench | OCRBenchv2 (en/zh) | CCOCR-multilan | CCOCR-overall | |---|---|---|---|---| | Qianfan-OCR (Ours) | 880 | 56.0 / 60.77 | 76.7 | 79.3 | | Qwen3-VL-4B | 873 | 60.68 / 59.13 | 74.2 | 76.5 | | MonkeyOCR | 655 | 21.78 / 38.91 | 43.8 | 35.2 | | DeepSeek-OCR | 459 | 15.98 / 38.31 | 32.5 | 27.6 |

Document Understanding

| Benchmark | Qianfan-OCR | Qwen3-VL-4B | Qwen3-VL-2B | |---|---|---|---| | DocVQA | 92.8 | 94.9 | 92.7 | | CharXiv_DQ | 94.0 | 81.8 | 69.7 | | CharXiv_RQ | 85.2 | 48.5 | 41.3 | | ChartQA | 88.1 | 83.3 | 78.3 | | ChartQAPro | 42.9 | 36.2 | 24.5 | | ChartBench | 85.9 | 74.9 | 73.2 | | TextVQA | 80.0 | 81.8 | 79.9 | | OCRVQA | 66.8 | 64.7 | 59.3 |

> 💡 Two-stage OCR+LLM systems score 0.0 on CharXiv (both DQ and RQ), demonstrating that chart structures discarded during text extraction are essential for reasoning.

Key Information Extraction (KIE)

| Model | Overall | OCRBench KIE | OCRBenchv2 KIE (en) | OCRBenchv2 KIE (zh) | CCOCR KIE | Nanonets KIE (F1) | |---|---|---|---|---|---|---| | Qianfan-OCR (Ours) | 87.9 | 95.0 | 82.8 | 82.3 | 92.8 | 86.5 | | Qwen3-VL-235B-A22B | 84.2 | 94.0 | 85.6 | 62.9 | 95.1 | 83.8 | | Qwen3-4B-VL | 83.5 | 89.0 | 82.1 | 71.3 | 91.6 | 83.3 | | Gemini-3.1-Pro | 79.2 | 96.0 | 87.8 | 63.4 | 72.5 | 76.1 |

Inference Throughput

| Model | PPS (pages/sec) | |---|---| | Qianfan-OCR (W8A8) | 1.024 | | Qianfan-OCR (W16A16) | 0.503 | | MinerU 2.5 | 1.057 | | MonkeyOCR-pro-1.2B | 0.673 | | Dots OCR | 0.352 |

*All benchmarks on a single NVIDIA A100 GPU with vLLM 0.10.2.*

Supported Tasks

Qianfan-OCR supports a comprehensive set of document intelligence tasks through prompt-driven control:

| Task Category | Specific Tasks | |---|---| | Document Parsing | Image-to-Markdown conversion, multi-page parsing, structured output (JSON/HTML) | | Layout Analysis | Bounding box detection, element type classification (25 categories), reading order | | Table Recognition | Complex table extraction (merged cells, rotated tables), HTML output | | Formula Recognition | Inline and display math formulas, LaTeX output | | Chart Understanding | Chart QA, trend analysis, data extraction from various chart types | | Key Information Extraction | Receipts, invoices, certificates, medical records, ID cards | | Handwriting Recognition | Chinese and English handwritten text | | Scene Text Recognition | Street signs, product labels, natural scene text | | Multilingual OCR | 192 languages including Latin, Cyrillic, Arabic, South/Southeast Asian, CJK scripts |

Quick Start

Basic Usage

from transformers import AutoModelForImageTextToText, AutoProcessor
import torch
from PIL import Image

MODEL_PATH = "baidu/Qianfan-OCR"
model =…

Excerpt shown — open the source for the full document.

Notability

notability 8.0/10

High HF downloads, notable OCR model from Baidu.