ForkNous ResearchNous Researchpublished Oct 15, 2025seen 5d

NousResearch/Liger-Kernel

forked from linkedin/Liger-Kernel

Open original ↗

Captured source

source ↗
published Oct 15, 2025seen 5dcaptured 9hhttp 200method plain

NousResearch/Liger-Kernel

Description: Efficient Triton Kernels for LLM Training

License: BSD-2-Clause

Stars: 5

Forks: 2

Open issues: 0

Created: 2025-10-15T15:54:49Z

Pushed: 2025-10-16T02:57:26Z

Default branch: main

Fork: yes

Parent repository: linkedin/Liger-Kernel

Archived: no

README:

Liger Kernel: Efficient Triton Kernels for LLM Training

Stable Nightly Discord

[Installation](#installation) | [Getting Started](#getting-started) | [Examples](#examples) | [High-level APIs](#high-level-apis) | [Low-level APIs](#low-level-apis) | [Cite our work](#cite-this-work)

Latest News 🔥

Liger Kernel is a collection of Triton kernels designed specifically for LLM training. It can effectively increase multi-GPU training throughput by 20% and reduces memory usage by 60%. We have implemented Hugging Face Compatible RMSNorm, RoPE, SwiGLU, CrossEntropy, FusedLinearCrossEntropy, and more to come. The kernel works out of the box with Flash Attention, PyTorch FSDP, and Microsoft DeepSpeed. We welcome contributions from the community to gather the best kernels for LLM training.

We've also added optimized Post-Training kernels that deliver up to 80% memory savings for alignment and distillation tasks. We support losses like DPO, CPO, ORPO, SimPO, KTO, JSD, and many more. Check out how we optimize the memory.

You can view the documentation site for additional installation, usage examples, and API references:https://linkedin.github.io/Liger-Kernel/

Supercharge Your Model with Liger Kernel

!Banner

With one line of code, Liger Kernel can increase throughput by more than 20% and reduce memory usage by 60%, thereby enabling longer context lengths, larger batch sizes, and massive vocabularies.

| Speed Up | Memory Reduction | |--------------------------|-------------------------| | !Speed up | !Memory |

> Note: > - Benchmark conditions: LLaMA 3-8B, Batch Size = 8, Data Type = bf16, Optimizer = AdamW, Gradient Checkpointing = True, Distributed Strategy = FSDP1 on 8 A100s. > - Hugging Face models start to OOM at a 4K context length, whereas Hugging Face + Liger Kernel scales up to 16K.

Optimize Post Training with Liger Kernel

We provide optimized post training kernels like DPO, ORPO, SimPO, and more which can reduce memory usage by up to 80%. You can easily use them as python modules.

from liger_kernel.chunked_loss import LigerFusedLinearORPOLoss
orpo_loss = LigerFusedLinearORPOLoss()
y = orpo_loss(lm_head.weight, x, target)

Examples

| Use Case | Description | |------------------------------------------------|---------------------------------------------------------------------------------------------------| | **Hugging Face Trainer** | Train LLaMA 3-8B ~20% faster with over 40% memory reduction on Alpaca dataset using 4 A100s with FSDP | | **Lightning Trainer** | Increase 15% throughput and reduce memory usage by 40% with LLaMA3-8B on MMLU dataset using 8 A100s with DeepSpeed ZeRO3 | | **Medusa Multi-head LLM (Retraining Phase)** | Reduce memory usage by 80% with 5 LM heads and improve throughput by 40% using 8 A100s with FSDP | | **Vision-Language Model SFT** | Finetune Qwen2-VL on image-text data using 4 A100s with FSDP | | **Liger ORPO Trainer** | Align Llama 3.2 using Liger ORPO Trainer with FSDP with 50% memory reduction |

Key Features

  • Ease of use: Simply patch your Hugging Face model with one line of code, or compose your own model using our Liger Kernel modules.
  • Time and memory efficient: In the same spirit as Flash-Attn, but for layers like RMSNorm, RoPE, SwiGLU, and CrossEntropy! Increases multi-GPU training throughput by 20% and reduces memory usage by 60% with kernel fusion, in-place replacement, and chunking techniques.
  • Exact: Computation is exact—no approximations! Both forward and backward passes are implemented with rigorous unit tests and undergo convergence testing against training runs without Liger Kernel to ensure accuracy.
  • Lightweight: Liger Kernel has minimal dependencies, requiring only Torch and Triton—no extra libraries needed! Say goodbye to dependency…

Excerpt shown — open the source for the full document.

Notability

notability 1.0/10

Trivial fork with minimal traction