ModelInclusionAI (Ant Group)InclusionAI (Ant Group)published Jun 2, 2026seen 1w

inclusionAI/Ling-2.6-flash-base

Open original ↗

Captured source

source ↗
published Jun 2, 2026seen 1wcaptured 1whttp 200method plaintask text-generationlicense mitlibrary transformersparams 108Bdownloads 157likes 12

🤗 Hugging Face | 🤖 ModelScope | Tech Report

Ling-2.6-flash-base

Ling-2.6-flash-base is the base checkpoint behind the Ling-2.6-flash model. It is a flash-scale Mixture-of-Experts language model retrofitted from the Ling-2.0 base checkpoint with a hybrid linear attention design, continued pre-training, and long-context mid-training.

This release is intended for research, continued pre-training, distillation, and supervised or preference-based fine-tuning. It is not a chat-aligned assistant model. If you want an out-of-the-box instruction model, use the corresponding post-trained Ling-2.6-flash checkpoint instead.

1. Model Overview

Ling-2.6-flash-base is designed for efficient instant-response modeling with stronger long-context efficiency than the previous GQA-based Ling-2.0 generation. The core upgrade is a hybrid attention retrofit that combines Lightning Attention with MLA in a 7:1 ratio, together with a smooth migration pipeline from the original architecture.

Ling-2.6 base models are trained through approximately 9.6T tokens across migration pre-training, continued pre-training, and mid-training, with staged context extension from 4K to 256K. Ling-2.6-flash-base serves as the base checkpoint for the post-trained Ling-2.6-flash instant model.

2. Key Features

  • Hybrid linear attention architecture combining Lightning Attention and MLA in a 7:1 ratio
  • Flash-scale MoE backbone optimized for efficient serving and high token efficiency
  • Long-context training pipeline extended to 256K context during mid-training
  • Continued pre-training mixture covering agentic data, long-context data, knowledge-rich web data, math, code, and multilingual corpora
  • Strong base-model quality across knowledge, math, code, reasoning, and long-context understanding benchmarks

3. Model Summary

| Item | Value | | --- | --- | | Architecture | Fine-grained MoE with hybrid linear attention | | Parameter Scale | Totoal ~104B, Activated ~7.4B | | Transformer layers | 32 | | Routed experts per MoE layer | 256 | | Shared experts per MoE layer | 1 | | Active routed experts per token | 8 | | Attention heads | 32 | | Dense FFN layers | 1 | | Hidden size | 4096 | | Dense intermediate size | 9216 | | Expert intermediate size | 1024 | | KV LoRA rank | 512 | | Q LoRA rank | 1536 | | Layer group size | 8 | | Positional encoding | Partial RoPE | | Attention design | Lightning Attention + MLA, 7:1 ratio | | Training recipe | Migration pre-training + continued pre-training + mid-training | | Total training tokens | ~9.6T | | Context training schedule | 4K -> 32K -> 256K |

4. Training Highlights

Architecture Migration

The model is converted from the Ling-2.0 generation into the Ling-2.6-flash architecture through a multi-stage migration pipeline that includes:

1. Lightning Attention conversion 2. Linear warmup 3. MLA conversion 4. MLA warmup 5. Full continued pre-training

This retrofit is designed to preserve pre-trained capability while reducing long-context compute cost, KV-cache pressure, and decode latency.

Data Mixture

The continued pre-training and mid-training stages include:

  • Agentic corpus built from tool-use and coding environments
  • Long-context corpus covering mathematics, web parsing, summarization, retrieval, and multi-hop reasoning
  • General web knowledge data with targeted STEM and factual augmentation
  • Math and code corpora
  • Multilingual data spanning 21 languages

5. Base Model Evaluation

The following numbers are selected from the technical report and reflect base-model evaluation rather than chat-aligned or instruction-tuned performance.

| Benchmark | Ling-2.0-flash-base | Ling-2.6-flash-base | | --- | ---: | ---: | | MMLU | 82.98 | 84.13 | | MMLU-Pro | 60.73 | 61.36 | | GPQA | 35.35 | 37.88 | | SimpleQA | 10.01 | 18.33 | | C-SimpleQA | 49.43 | 63.53 | | MMMLU | 62.76 | 64.76 | | GSM8K | 90.60 | 91.89 | | OmniMath | 28.30 | 29.90 | | HumanEval-Plus | 83.54 | 81.10 | | LiveCodeBench | 30.40 | 33.48 | | BIRD-SQL | 38.69 | 38.40 | | BBH | 84.82 | 85.06 | | AutoLogic | 61.10 | 62.82 | | LEval | 73.41 | 77.86 | | LongBenchv2 | 33.40 | 34.19 |

Ling-2.6-flash-base shows broad gains over Ling-2.0-flash-base, especially on knowledge-oriented, reasoning-oriented, and long-context evaluations.

6. Intended Use

Recommended use cases:

  • Continued pre-training
  • Supervised fine-tuning for domain adaptation
  • Preference optimization and RL post-training
  • Distillation research
  • Long-context and MoE systems research

Not recommended as-is for:

  • Direct end-user chat deployment
  • Safety-critical applications without additional alignment and evaluation
  • Production use without post-training and task-specific validation

7. Limitations

  • This is a base model and is not instruction-aligned.
  • Outputs may be inaccurate, biased, incomplete, or unsafe without additional post-training.
  • Long-context quality depends on the serving stack, positional scaling configuration, and prompt format used at inference time.
  • The training mixture includes web-scale and synthetic data, so the model may reproduce factual errors or undesirable artifacts.
  • Benchmark results in the technical report are collected under controlled internal evaluation settings and should not be treated as a guarantee of downstream production behavior.

8. Relationship to Other Releases

  • Ling-2.6-flash: instruction and instant-response optimized model derived from this base.

If your goal is interactive assistant use rather than research on base checkpoints, the post-trained Ling-2.6-flash model is usually the better starting point.

9. Usage

This is a base checkpoint. The example below illustrates the loading pattern only.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "inclusionAI/Ling-2.6-flash-base"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto",
)

prompt = "Summarize the benefits of hybrid linear attention."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
**inputs,
max_new_tokens=256,
do_sample=False,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

For production inference, prefer serving stacks that support the released architecture and remote code path....

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

New lightweight model release, unknown traction