RepoMeituan (LongCat)Meituan (LongCat)published Mar 25, 2026seen 5d

meituan-longcat/LongCat-Next

Open original ↗

Captured source

source ↗
published Mar 25, 2026seen 5dcaptured 11hhttp 200method plain

meituan-longcat/LongCat-Next

License: MIT

Stars: 438

Forks: 23

Open issues: 9

Created: 2026-03-25T14:59:18Z

Pushed: 2026-05-09T10:21:43Z

Default branch: main

Fork: no

Archived: no

README:

LongCat-Next

Tech Report 📄

Model Introduction

![evaluation](./assets/overview.jpg)

We develop LongCat-Next, a native multimodal model that processes text, vision, and audio under a single autoregressive objective with minimal inductive bias beyond the language paradigm. As an industrial-strength foundation model with A3B model size, it excels at seeing, creating, and talking, achieving strong performance across a wide range of multimodal benchmarks. In particular, leveraging semantically complete discrete representations, it surpasses the long-standing performance ceiling of discrete vision modeling on understanding tasks, and provides a unified solution for visual understanding and generation. This success demonstrates that discrete tokens can universally represent multimodal signals and be deeply internalized within a single discrete embedding space. We further provide extensive experiments to analyze this unified discrete training paradigm and uncover several interesting findings.

As a meaningful attempt toward native multimodality, we open-source the LongCat-Next and its tokenizers, hoping to foster further research and development in the community.

Key Features

This work primarily addresses the fundamental barrier to native multimodality through a design philosophy that prioritizes simplicity, treating vision and audio as intrinsic extensions of language. As a step toward this goal, we present LongCat-Next, a discrete native multimodal model that achieves industrial-strength performance within discrete frameworks while remaining highly competitive across a wide range of specialized domains. Built upon the LongCat-Flash-Lite MoE backbone (A3B) as a _multi-task_ learner, the model unifies language, vision, and audio within a single discrete framework. In this paper, we make the following principal contributions:

🌟 Discrete Native Autoregression Paradigm (DiNA).

We introduce DiNA, a unified paradigm that extends next-token prediction from language to native multimodality, which internalizes diverse modalities into a shared token space. It simplifies multimodal modeling by creating modality-aware tokenizer-detokenizer pairs and leveraging the established training infrastructure of large language models.

🌟 Semantic Completeness for Discrete Visual Representation.

We improve discrete visual modeling by combining Semantic-and-Aligned Encoders (SAE) with Residual Vector Quantization (RVQ). This integration creates hierarchical discrete tokens that preserve both semantic abstraction and fine-grained visual details, surpassing traditional representation limitations.

🌟 Discrete Native-Resolution Vision Transformer (dNaViT).

Analogous to linguistic tokenizers, we propose dNaViT as a highly flexible, unified discrete interface for vision that extracts semantic features as "visual words", constructing a hierarchical representation space supporting dynamic tokenization and detokenization. dNaViT integrates seamlessly with large language models, ensuring high performance without degradation.

🌟 Exceling in Seeing, Creating, and Talking in a Unified Model.

Within the framework of DiNA, visual understanding and generation are elegantly reformulated as two manifestations of the same predictive process without performance compromise. This formulation bridges the long-standing architectural divide while introducing minimal interference between these traditionally competing objectives and preserving core language capabilities. Remarkably, LongCat-Next achieves competitive performance with specialized understanding models, while maintaining strong generative quality even under a 28× compression ratio, particularly in text rendering, while also excelling in advanced speech comprehension, low-latency voice conversation, and customizable voice cloning.

Please refer to our [technical report](./tech_report.pdf) for details!

Evaluation Results

![evaluation](./assets/evaluation.png)

Quick Start

To use LongCat-Next with transformers, we need at least 3 GPUs (80GB VRAM each, e.g., H100/A100 80GB), and we recommend the following environment:

  • python >= 3.10
  • torch >= 2.6
  • transformers >= 4.57.6
  • accelerate >= 1.10.0
# (Install python=3.10, ffffmpeg./assets/book.png"}
]

# Apply chat-template
text_input = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
print(f"{text_input=}")

# Preprocessing
text_inputs, visual_inputs, audio_inputs = processor(text=text_input, return_tensors="pt")
text_inputs = text_inputs.to(model.device)
if visual_inputs is not None:
visual_inputs = visual_inputs.to(model.device)
if audio_inputs is not None:
audio_inputs = audio_inputs.to(model.device)

# AR
with torch.no_grad():
outputs = model.generate(
input_ids=text_inputs["input_ids"],
visual_inputs=visual_inputs,
audio_inputs=audio_inputs,
return_dict_in_generate=True,
)

# Text decoding
output_input_ids = outputs.sequences
text_output = tokenizer.decode(output_input_ids[0][len(text_inputs["input_ids"][0]):], skip_special_tokens=True)
print(f"{text_output=}")

# Images decoding
output_visual_ids = outputs.visual_ids
if output_visual_ids.size(0) > 0:
image_path_list = model.model.decode_visual_ids_and_save(
output_visual_ids,
save_prefix="./output_image",
**model.generation_config.visual_generation_config["custom_params"],
)
print(f"{image_path_list=}")

# Audio decoding
output_audio_text_ids = outputs.audio_text_ids
output_audio_ids = outputs.audio_ids
if output_audio_text_ids.size(-1) > 0:
audio_text = tokenizer.decode(output_audio_text_ids[0], skip_special_tokens=True)
print(f"{audio_text=}")
if output_audio_ids.size(0) > 0:
audio_path_list = model.model.decode_audio_ids_and_save(
output_audio_ids,
save_prefix="./output_audio",
**model.generation_config.audio_generation_config["custom_params"],
)
print(f"{audio_path_list=}")

Text - Tool Calling Example

from parse_model_response import parse_model_response

tools = [
{
"type": "function",
"function": {
"name": "func_add",
"description": "Calculate the sum of two numbers",
"parameters": {
"type": "object",
"properties": {
"x1": {"type": "number", "description": "The first addend"},
"x2": {"type": "number", "description": "The second addend"}
},…

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

New repo, moderate stars, not a major release