RepoInclusionAI (Ant Group)InclusionAI (Ant Group)published Oct 14, 2025seen 5d

inclusionAI/linghe

Python

Open original ↗

Captured source

source ↗
published Oct 14, 2025seen 5dcaptured 10hhttp 200method plain

inclusionAI/linghe

Description: A high-performance kernel library for LLM training

Language: Python

License: MIT

Stars: 80

Forks: 10

Open issues: 1

Created: 2025-10-14T09:50:16Z

Pushed: 2026-04-28T09:54:12Z

Default branch: main

Fork: no

Archived: no

README: linghe

A library of high-performance kernels for LLM training.

Roadmap ##

---

  • Support more shapes and various GPU archs.
  • Release our fp8 training kernels beyond blockwise quantization.

*News or Update* 🔥

---

  • [2025/07] We implement multiple kernels for FP8 training with Megatron-LM blockwise quantization.

Introduction

--- Our repo, linghe, is designed for LLM training, especially for MoE training with FP8 quantizaiton. It provides 3 main categories of kernels:

  • Fused quantization kernels: fuse quantization with previous layer, e.g., RMS norm and Silu.
  • Memory-efficiency kernels: fuse multiple IO-itensive operations, e.g., ROPE with qk-norm.
  • Implementation-optimized kernels: use efficient triton implementation, e.g., routing map padding instead of activation padding.

Benchmark

--- We benchmark on H800 with batch size 8192, hidden size 2048, num experts 256, activation experts 8.

| kernel | baseline(us) | linghe(us) | speedup | |--------|--------------|------------|---------| | RMSNorm+Quantization(forward) | 159.3 us | 72.4 us | 2.2 | | Split+qk-norm+rope+transpose(forward) | 472 us | 59.1 us | 7.99 | | Split+qk-norm+rope+transpose(backward) | 645 us | 107.5 us | 6.0 | | Fp32 router gemm(forward) | 242.3 us | 61.6 us | 3.931 | | Fp32 router gemm(backward) | 232.7 us | 78.1 us | 2.979 | | Permute with padded indices | 388 us | 229.4 us | 1.69 | | Unpermute with padding indices | 988.6 us | 806.9 us | 1.23 | | Batch Silu+quantization(forward) | 6241.7 us | 1181.7 us | 5.28 | | Batch Silu+quantization(backward) | 7147.7 us | 2317.9 us | 3.08 | | Silu+quantization(forward) | 144.9 us | 58.2 us | 2.48 | | Silu+quantization(backward) | 163.4 us | 74.2 us | 2.2 | | fused linear gate(forward) | 160.4 us | 46.9 us | 3.42 | | fused linear gate(backward) | 572.9 us | 81.1 us | 7.06 | | Cross entropy(forward) | 2780.8 us | 818.2 us | 3.4 | | Cross entropy(backward) | 7086.3 us | 1781.0 us | 3.98 | | batch grad norm | 1733.7 us | 1413.7 us | 1.23 | | Batch count zero | 4997.9 us | 746.8 us | 6.69 |

Other benchmark results can be obtained by running scripts in tests and benchmark folders.

Examples

---

Examples can be found in tests.

Api Reference

---

Please refer to API

Citations

[TBD]

@misc{zhao2025linghe,
title={Linghe: Enabling Efficient Trillion-Scale LLM Training via Optimized Kernels},
author={Yao Zhao and Chen Liang and Jingyu Hu and Zixuan Cheng and Longfei Li}
}

Notability

notability 5.0/10

New repo with moderate stars