ModelTencent HunyuanTencent Hunyuanpublished Mar 5, 2026seen 5d

tencent/Penguin-VL-8B

Open original ↗

Captured source

source ↗
published Mar 5, 2026seen 5dcaptured 9hhttp 200method plaintask text-generationlicense apache-2.0library transformersparams 8.7Bdownloads 216likes 75

Penguin-VL

Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders

Project Page: penguin-vl.github.io | GitHub: tencent-ailab/Penguin-VL | arXiv: 2603.06569

---

📰 News

  • 2026.03 — PenguinVL-Encoder now available for general use.
  • 2026.03 — Released PenguinVL-2B, PenguinVL-8B.

---

🌟 Model Overview

PenguinVL is a compact Vision-Language Model designed to explore the efficiency limits of small-scale VLMs. Rather than being only an instruction-tuned model, PenguinVL is built from the ground up through LLM-based vision encoder construction, multimodal pretraining, and subsequent instruction tuning.

Unlike most existing VLMs that rely on contrastive-pretrained vision encoders (e.g., CLIP/SigLIP), PenguinVL initializes its vision encoder directly from a text-only LLM. This design avoids the objective mismatch between contrastive learning and autoregressive language modeling, enabling tighter alignment between visual representations and the language backbone.

Key Characteristics

  • 🧠 LLM-based Vision Encoder

The vision encoder is adapted from a pretrained text LLM (Qwen3-0.6B), modified with bidirectional attention and 2D-RoPE for spatial modeling. This provides strong semantic priors and native compatibility with the downstream LLM.

  • 🎥 Efficient Video Understanding

A Temporal Redundancy-Aware (TRA) token compression strategy dynamically allocates token budgets across frames, enabling long-video reasoning within a limited context window.

  • 🏗 Unified Architecture

The model consists of: 1. LLM-initialized vision encoder 2. Lightweight MLP projector 3. Qwen3 language backbone

  • 📊 Compact but Strong

At 8B scale, Penguin-VL achieves competitive performance across image, document, OCR, math, and video benchmarks while remaining deployment-friendly.

---

🧪 Quick Start — Transformers Inference

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

model_name = "tencent/Penguin-VL-8B"

model = AutoModelForCausalLM.from_pretrained(
model_name,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
)

processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)

# Example: Image + Text
inputs = processor(
conversation=[
{"role": "system", "content": "You are a helpful assistant."},
{
"role": "user",
"content": [
{"type": "image", "image": {"image_path": "assets/example.jpg"}},
{"type": "text", "text": "Describe this image."}
],
},
],
return_tensors="pt",
)

inputs = {k: v.cuda() if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
if "pixel_values" in inputs:
inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)

output_ids = model.generate(**inputs, max_new_tokens=128)
response = processor.decode(output_ids[0], skip_special_tokens=True)

print(response)

🌎 Model Zoo

| Model | Base Model | HF Link | | -------------------- | ------------ | ------------------------------------------------------------ | | PenguinVL-8B | Qwen3-8B | tencent/Penguin-VL-8B | | PenguinVL-2B | Qwen3-1.7B | tencent/Penguin-VL-2B | | PenguinVL-Encoder | Qwen3-0.6B | tencent/Penguin-Encoder |

🚀 Main Results

Chart / OCR / Document Understanding

| Benchmark | Penguin-VL 8B | Qwen3-VL 8B | InternVL3.5 8B | OpenAI GPT-5 nano | |---|---:|---:|---:|---:| | InfoVQA | 86.8 | 83.1 | 79.1 | 49.2 | | ChartQA | 90.5 | 89.6 | 86.7 | 48.6 | | DocVQA | 96.2 | 96.1 | 92.3 | 78.3 | | CharXiv (DQ / RQ) | 75.7 / 40.0 | 83.0 / 46.4 | 72.2 / 44.4 | 64.4 / 31.7 | | OCRBench | 852 | 896 | 840 | 701 |

General Knowledge / Multi-Image / Math Reasoning

| Benchmark | Penguin-VL 8B | Qwen3-VL 8B | InternVL3.5 8B | OpenAI GPT-5 nano | |---|---:|---:|---:|---:| | AI2D | 86.1 | 85.7 | 84.0 | 65.7 | | RealWorldQA | 75.8 | 71.5 | 67.5 | 60.7 | | V-star | 90.2 | 90.1 | 70.7 | 63.4 | | MMMU-Pro | 40.2 | 55.9 | 39.7 | 36.5 | | BLINK | 58.2 | 69.1 | 59.5 | 42.2 | | MathVista | 77.4 | 77.2 | 74.2 | 40.9 | | MathVerse | 50.8 | 62.1 | 55.8 | 27.0 | | LogicVista | 53.8 | 55.3 | 57.3 | 40.5 |

Video Understanding

| Benchmark | Penguin-VL 8B | Qwen3-VL 8B | InternVL3.5 8B | OpenAI GPT-5 nano | |---|---:|---:|---:|---:| | MVBench | 71.7 | 68.7 | 72.1 | 52.9 | | LongVideoBench | 67.0 | 62.6 | 62.1 | 38.1 | | VideoMME | 66.2 | 71.4 | 66.0 | 49.4 | | Egochema | 67.0 | 70.2 | 61.0 | 34.8 | | MMVU | 53.9 | 58.7 | 51.5 | 51.0 | | CharadesSTA | 61.4 | 56.0 | 32.8 | 5.0 | | NextQA | 85.4 | 82.3 | 81.3 | 59.3 | | ActivityNetQA | 65.2 | 63.7 | 60.1 | – | | Perception Test | 78.0 | 72.7 | 72.7 | – |

> Bold indicates the best result among compared models. > More details can see our paper.

Citation

If you find Penguin-VL useful for your research and applications, please cite using this BibTeX:

@article{Penguin-VL,
title={Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders},
author={Boqiang Zhang and Lei Ke and Ruihan Yang and Qi Gao and Tianyuan Qu and Rossell Chen and Dong Yu and Leoweiliang},
journal={arXiv preprint arXiv:2603.06569},
year={2026}
}

Notability

notability 5.0/10

New model release but low traction