RepoTencent HunyuanTencent Hunyuanpublished Nov 18, 2025seen 5d

Tencent-Hunyuan/HunyuanOCR

Python

Open original ↗

Captured source

source ↗
published Nov 18, 2025seen 5dcaptured 8hhttp 200method plain

Tencent-Hunyuan/HunyuanOCR

Language: Python

License: NOASSERTION

Stars: 1649

Forks: 130

Open issues: 75

Created: 2025-11-18T04:06:24Z

Pushed: 2026-06-02T03:48:17Z

Default branch: main

Fork: no

Archived: no

README:

🎯 Demo | 📥 Model Download | 📄 Technical Report

🤝 Join Our Community

🔥 News

  • [2026/06/02] 🎉 We have released two new benchmarks. Chronicles-OCR (arXiv), an open-source ancient-text perception benchmark covering the evolutionary trajectory of the "Seven Chinese Scripts", is jointly built by the SSV Digital Culture Lab and the SSV Technical Architecture Department, together with the Palace Museum and Anyang Normal University. We have also released ChartArena (arXiv), a new chart-parsing benchmark supporting diverse chart types. Welcome to evaluate and provide your valuable feedback!
  • [2026/05/11] 🎉 We have officially open-sourced two benchmarks on document parsing and text-image machine translation: Wild-OmniDocBench and MMTIT-Bench. Welcome to evaluate and provide your valuable feedback!
  • [2026/04/08] 🎉 Our works on document parsing and text-image machine translation have been accepted to the CVPR 2026 Main Conference! Check out the papers: Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training and MMTIT-Bench: A Multilingual and Multi-Scenario Benchmark with Cognition-Perception-Reasoning Guided Text-Image Machine Translation.
  • [2026/01/13] ⭐ We have released a stable official online demo, feel free to try it out!
  • [2025/11/28] 🛠️ We fixed vLLM inference bugs and hyperparameter configuration issues such as system prompt. It is recommended to use the latest vLLM installation steps and the inference script for performance testing. Currently, there is still a certain accuracy difference between Transformers and the vLLM framework (we are working on fixing this).
  • [2025/11/25] 📝 Inference code and model weights publicly available.

📖 Introduction

HunyuanOCR stands as a leading end-to-end OCR expert VLM powered by Hunyuan's native multimodal architecture. With a remarkably lightweight 1B parameter design, it has achieved multiple state-of-the-art benchmarks across the industry. The model demonstrates mastery in complex multilingual document parsing while excelling in practical applications including text spotting, open-field information extraction, video subtitle extraction, and photo translation.

✨ Key Features

  • 💪 Efficient Lightweight Architecture: Built on Hunyuan's native multimodal architecture and training strategy, achieving SOTA performance with only 1B parameters, significantly reducing deployment costs.
  • 📑 Comprehensive OCR Capabilities: A single model covering classic OCR tasks including text detection and recognition, complex document parsing, open-field information extraction and video subtitle extraction, while supporting end-to-end photo translation and document QA.
  • 🚀 Ultimate Usability: Deeply embraces the "end-to-end" philosophy of large models - achieving SOTA results with single instruction and single inference, offering greater efficiency and convenience compared to industry cascade solutions.
  • 🌏 Extensive Language Support: Robust support for over 100 languages, excelling in both single-language and mixed-language scenarios across various document types.

🛠️ Dependencies and Installation

System Requirements

  • 🖥️ Operating System: Linux
  • 🐍 Python: 3.12+ (recommended and tested)
  • ⚡ CUDA: 12.9
  • 🔥 PyTorch: 2.7.1
  • 🎮 GPU: NVIDIA GPU with CUDA support
  • 🧠 GPU Memory: 20GB (for vLLM)
  • 💾 Disk Space: 6GB

🚀 Quick Start with vLLM (⭐ Recommended)

  • [HunyuanOCR Usage Guide](https://docs.vllm.ai/projects/recipes/en/latest/Tencent-Hunyuan/HunyuanOCR.html)

Installation

pip install vllm>=0.12.0
pip install -r requirements.txt

Note: We suggest to install cuda-compat-12-9:

sudo dpkg -i cuda-compat-12-9_575.57.08-0ubuntu1_amd64.deb
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.9/compat:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc
# verify cuda-compat-12-9
ls /usr/local/cuda-12.9/compat

Model Deploy

vllm serve tencent/HunyuanOCR \
--no-enable-prefix-caching \
--mm-processor-cache-gb 0 \
--gpu-memory-utilization 0.2

Model Inference

from vllm import LLM, SamplingParams
from PIL import Image
from transformers import AutoProcessor

def clean_repeated_substrings(text):
"""Clean repeated substrings in text"""
n = len(text)
if n= 0 and text[i:i + length] == candidate:
count += 1
i -= length

if count >= 10:
return text[:n - length * (count - 1)]

return text

model_path = "tencent/HunyuanOCR"
llm = LLM(model=model_path, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(model_path)
sampling_params = SamplingParams(temperature=0, max_tokens=16384)

img_path = "/path/to/image.jpg"
img = Image.open(img_path)
messages = [
{"role": "system", "content": ""},
{"role": "user", "content": [
{"type": "image", "image": img_path},
{"type": "text", "text": "检测并识别图片中的文字,将文本坐标格式化输出。"}
]}
]
prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = {"prompt": prompt, "multi_modal_data": {"image": [img]}}
output = llm.generate([inputs], sampling_params)[0]
print(clean_repeated_substrings(output.outputs[0].text))

Alternatively, you can also use the provided demo script as follow:

cd Hunyuan-OCR-master/Hunyuan-OCR-vllm && python run_hy_ocr.py

🚀 Quick Start with Transformers

Installation

pip install git+https://github.com/huggingface/transformers@82a06db03535c49aa987719ed0746a76093b1ec4

> Note: Currently, Transformers has a certain performance degradation compared to the vLLM framework (we are working hard to fix it), and we…

Excerpt shown — open the source for the full document.

Notability

notability 7.0/10

New OCR model from Tencent, solid traction.