ModelBaidu (ERNIE)Baidu (ERNIE)published Nov 7, 2025seen 5d

baidu/ERNIE-4.5-VL-28B-A3B-Thinking

Open original ↗

Captured source

source ↗
published Nov 7, 2025seen 5dcaptured 11hhttp 200method plaintask image-text-to-textlicense apache-2.0library transformersparams 30Bdownloads 613likes 540

🚀 Introducing ERNIE-4.5-VL-28B-A3B-Thinking: A Breakthrough in Multimodal AI

🔥 Demo

Model Highlights

Built upon the powerful ERNIE-4.5-VL-28B-A3B architecture, the newly upgraded ERNIE-4.5-VL-28B-A3B-Thinking achieves a remarkable leap forward in multimodal reasoning capabilities. 🧠✨ Through an extensive mid-training phase, the model absorbed a vast and highly diverse corpus of premium visual-language reasoning data. This massive-scale training process dramatically boosted the model's representation power while deepening the semantic alignment between visual and language modalities—unlocking unprecedented capabilities in nuanced visual-textual reasoning. 📊

The model leverages cutting-edge multimodal reinforcement learning techniques on verifiable tasks, integrating GSPO and IcePop strategies to stabilize MoE training combined with dynamic difficulty sampling for exceptional learning efficiency. ⚡ Responding to strong community demand, we've significantly strengthened the model's grounding performance with improved instruction-following capabilities, making visual grounding functions more accessible than ever. 🎯 Additionally, our innovative "Thinking with Images" feature, when paired with tools like image zooming and image search, dramatically elevates the model's ability to process fine-grained details and handle long-tail visual knowledge. 🔍🖼️

Together, these enhancements form a critical foundation for developing sophisticated multimodal agents, empowering developers and researchers to create next-generation AI applications that push the boundaries of what's possible in visual-language understanding. 🤖🌟

![benchmark](./benchmark.jpg)

Key Capabilities

As a lightweight model that activates only 3B parameters ⚡, ERNIE-4.5-VL-28B-A3B-Thinking closely matches the performance of the industry's top flagship models across various benchmarks. 🚀

  • Visual Reasoning 🧠👁️: Bolstered by large-scale reinforcement learning, the model demonstrates exceptional multi-step reasoning, chart analysis, and causal reasoning capabilities in complex visual tasks! 📊✨
  • STEM Reasoning 🔬📐: Leveraging its powerful visual abilities, the model achieves a leap in performance on STEM tasks like solving problems from photos, easily handling even complex questions! 🎯💡
  • Visual Grounding 📍🎨: Features more precise grounding and flexible instruction execution, easily triggering grounding functions in complex industrial scenarios for a significant efficiency boost! ⚙️💪
  • Thinking with Images 🤔🔍: The model thinks like a human, capable of freely zooming in and out of images to grasp every detail and uncover all information. 🖼️✨
  • Tool Utilization 🛠️⚡: Empowered by robust tool-calling capabilities, the model can instantly use functions like image search to easily identify long-tail knowledge and achieve comprehensive information retrieval! 🔎📚
  • Video Understanding 🎬🎥: The model possesses outstanding temporal awareness and event localization abilities, accurately identifying content changes across different time segments in a video, making video analysis smarter and more efficient! ⏱️🌟

Quickstart

Hugging Face 🤗 app

Using transformers Library

Requirement: transformers <= 4.57.6

Here is an example of how to use the transformers library for inference:

import torch
from transformers import AutoProcessor, AutoTokenizer, AutoModelForCausalLM

model_path = 'baidu/ERNIE-4.5-VL-28B-A3B-Thinking'
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="auto",
dtype=torch.bfloat16,
trust_remote_code=True
)

processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model.add_image_preprocess(processor)

messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What color clothes is the girl in the picture wearing?"
},
{
"type": "image_url",
"image_url": {
"url": "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example1.jpg"
}
},
]
},
]

text = processor.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
image_inputs, video_inputs = processor.process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)

device = next(model.parameters()).device
inputs = inputs.to(device)

generated_ids = model.generate(
inputs=inputs['input_ids'].to(device),
**inputs,
max_new_tokens=1024,
use_cache=False
)
output_text = processor.decode(generated_ids[0][len(inputs['input_ids'][0]):])
print(output_text)

vLLM Inference

Install vLLM

pip install decord
pip install uv
uv pip install vllm==0.11.2 --torch-backend=auto

Run vLLM

# 80G*1 GPU,If an error occurs, add the --gpu-memory-utilization 0.95 and try again
vllm serve baidu/ERNIE-4.5-VL-28B-A3B-Thinking --trust-remote-code

Run vLLM using reasoning-parser and tool-call-parser

# 80G*1 GPU,If an error occurs, add the --gpu-memory-utilization 0.95 and try again
vllm serve baidu/ERNIE-4.5-VL-28B-A3B-Thinking --trust-remote-code \
--reasoning-parser ernie45 \
--tool-call-parser ernie45 \
--enable-auto-tool-choice

Run vLLM for video understanding (ensure your vLLM version includes PR#31274 for accurate timestamp rendering)

vllm serve baidu/ERNIE-4.5-VL-28B-A3B-Thinking --trust-remote-code \
--reasoning-parser ernie45 \
--media-io-kwargs '{"video": {"num_frames": 180, "fps": 2}}'

FastDeploy Inference

Quickly deploy services using FastDeploy as shown below. For more detailed usage, refer to the FastDeploy GitHub Repository.

Note: For single-card deployment, at least 48GB of GPU memory is required.

fastdeploy serve --model baidu/ERNIE-4.5-VL-28B-A3B-Thinking \
--max-model-len 131072 \
--max-num-seqs 32 \
--port 8180 \
--quantization wint8 \
--reasoning-parser ernie-45-vl-thinking \
--tool-call-parser ernie-45-vl-thinking \
--mm-processor-kwargs '{"image_max_pixels": 12845056 }'

Finetuning with ERNIEKit

ERNIEKit is a training toolkit based on PaddlePaddle, specifically designed for the ERNIE series of…

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

Low downloads but from major lab