What does this repo signal mean?

InclusionAI (Ant Group) published inclusionAI/linghe (Python). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo inclusionAI/linghe · language Python · New repo with moderate stars. onlylabs links this event to 1 captured evidence page and 6 related repo signals.

InclusionAI (Ant Group) Repo: inclusionAI/linghe

Captured source

source ↗

GitHub/github.com/inclusionAI/linghe

inclusionAI/linghe repository metadata

Source ↗

published Oct 14, 2025seen Jun 5captured Jun 11http 200method plain

inclusionAI/linghe

Description: A high-performance kernel library for LLM training

Language: Python

License: MIT

Stars: 80

Forks: 10

Open issues: 1

Created: 2025-10-14T09:50:16Z

Pushed: 2026-04-28T09:54:12Z

Default branch: main

Fork: no

Archived: no

README: linghe

A library of high-performance kernels for LLM training.

Roadmap ##

---

Support more shapes and various GPU archs.
Release our fp8 training kernels beyond blockwise quantization.

News or Update 🔥

---

[2025/07] We implement multiple kernels for FP8 training with Megatron-LM blockwise quantization.

Introduction

--- Our repo, linghe, is designed for LLM training, especially for MoE training with FP8 quantizaiton. It provides 3 main categories of kernels:

Fused quantization kernels: fuse quantization with previous layer, e.g., RMS norm and Silu.
Memory-efficiency kernels: fuse multiple IO-itensive operations, e.g., ROPE with qk-norm.
Implementation-optimized kernels: use efficient triton implementation, e.g., routing map padding instead of activation padding.

Benchmark

--- We benchmark on H800 with batch size 8192, hidden size 2048, num experts 256, activation experts 8.

| kernel | baseline(us) | linghe(us) | speedup | |--------|--------------|------------|---------| | RMSNorm+Quantization(forward) | 159.3 us | 72.4 us | 2.2 | | Split+qk-norm+rope+transpose(forward) | 472 us | 59.1 us | 7.99 | | Split+qk-norm+rope+transpose(backward) | 645 us | 107.5 us | 6.0 | | Fp32 router gemm(forward) | 242.3 us | 61.6 us | 3.931 | | Fp32 router gemm(backward) | 232.7 us | 78.1 us | 2.979 | | Permute with padded indices | 388 us | 229.4 us | 1.69 | | Unpermute with padding indices | 988.6 us | 806.9 us | 1.23 | | Batch Silu+quantization(forward) | 6241.7 us | 1181.7 us | 5.28 | | Batch Silu+quantization(backward) | 7147.7 us | 2317.9 us | 3.08 | | Silu+quantization(forward) | 144.9 us | 58.2 us | 2.48 | | Silu+quantization(backward) | 163.4 us | 74.2 us | 2.2 | | fused linear gate(forward) | 160.4 us | 46.9 us | 3.42 | | fused linear gate(backward) | 572.9 us | 81.1 us | 7.06 | | Cross entropy(forward) | 2780.8 us | 818.2 us | 3.4 | | Cross entropy(backward) | 7086.3 us | 1781.0 us | 3.98 | | batch grad norm | 1733.7 us | 1413.7 us | 1.23 | | Batch count zero | 4997.9 us | 746.8 us | 6.69 |

Other benchmark results can be obtained by running scripts in tests and benchmark folders.

Examples

---

Examples can be found in tests.

Api Reference

---

Please refer to API

Citations

[TBD]

@misc{zhao2025linghe,
title={Linghe: Enabling Efficient Trillion-Scale LLM Training via Optimized Kernels},
author={Yao Zhao and Chen Liang and Jingyu Hu and Zixuan Cheng and Longfei Li}
}

Notability

notability 5.0/10

New repo with moderate stars