microsoft/dion
Python
Captured source
source ↗microsoft/dion
Description: Dion optimizer algorithm
Language: Python
License: MIT
Stars: 484
Forks: 55
Open issues: 5
Created: 2025-05-29T20:49:18Z
Pushed: 2026-06-08T16:49:02Z
Default branch: main
Fork: no
Archived: no
README:
Welcome to the Microsoft/Dion Codebase
This repository provides efficient implementations of orthonormal optimizers for distributed ML training. You can find the following optimizers:
Table of Contents
Show/Hide
1. [Requirements](#-requirements) 1. [Quick Start](#-quick-start) 1. [Introduction](#introduction) 1. [Optimizers](#optimizers) 1. [Building Parameter Groups](#building-parameter-groups)
- [Example Code](#example-code)
- [Per-Head Newton-Schulz for Attention Projections](#per-head-newton-schulz-for-attention-projections)
1. [Distributed Training Configuration](#distributed-training-configuration)
- [Flattened Meshes](#flattened-meshes)
- [Device Mesh for Muon](#device-mesh-for-muon)
- [Usage with DDP ProcessGroup](#usage-with-ddp-processgroup)
1. [Compressed Data-Parallel Gradient Sync](#compressed-data-parallel-gradient-sync)
- [Usage with HSDP](#usage-with-hsdp)
- [Example Code](#example-code-1)
- [Usage with DDP](#usage-with-ddp)
- [Checkpointing](#checkpointing)
1. [Best Practices](#best-practices) 1. [Experimental Features](#experimental-features)
- [Mixed Precision Dion](#mixed-precision-dion)
- [Accelerating Optimization Step for Lower Ranks](#accelerating-optimization-step-for-lower-ranks)
- [Triton Kernels for Muon Newton-Schulz](#triton-kernels-for-muon-newton-schulz)
1. [Citation](#citation)
Requirements
This code is written for modern PyTorch (version 2.7 or newer) using DTensor-based parallelism. This includes FSDP2 with fully_shard and tensor parallelism (TP) with parallelize_module. Support for other distributed training APIs is not implemented.
Quick Start
Our implementations are available as a pip package! Install to use in your project:
pip install git+https://github.com/microsoft/dion.git
Then in your code, you can use:
from dion import Dion2, Muon, NorMuon, Dion
Please carefully go through this readme for detailed instructions on using our optimizers. There are major differences compared to PyTorch built-in optimizers, such as Adam/AdamW.
Running Our Sample Training Script
First clone this repo, then install dependencies for both Dion and training code:
git clone https://github.com/microsoft/dion.git cd dion pip install -e .[train]
Download pretokenized FineWeb dataset:
python data/cached_fineweb10B.py 30
Distributed Data Parallel (DDP) Training
To train a GPT-small model using Dion2 with 4 GPUs (adjust as needed for your setup):
torchrun --standalone --nproc_per_node=4 train.py --config configs/dion2_160m.yaml
This will launch Distributed Data Parallel (DDP) training.
Distributed Training: FSDP / TP / Hybrid Sharding
Fully Sharded Data Parallel (FSDP)
To enable FSDP, specify the FSDP group size using --fs_size:
torchrun --standalone --nproc_per_node=4 train.py \ --config configs/dion2_160m.yaml \ --fs_size 4
This configuration trains a GPT-small model using Dion2 with FSDP sharding across all 4 GPUs (a single FSDP group of size 4).
Hybrid Sharded Data Parallel (HSDP)
To use Hybrid Sharded Data Parallel, where multiple FSDP groups are replicated using Data Parallel (DP), set --fs_size smaller than the total number of GPUs and specify the data parallel dimension via --dp_size:
torchrun --standalone --nproc_per_node=4 train.py \ --config configs/dion2_160m.yaml \ --fs_size 2 \ --dp_size 2
This configuration creates:
- 2 FSDP groups, each spanning 2 GPUs
- 2-way data parallelism across the FSDP groups
- Total: 4 GPUs with 2-way FSDP × 2-way DP
The product dp_size × fs_size must equal world_size. Any unspecified dimension defaults to 1.
Tensor Parallelism (TP)
Note: Currently, only Dion (our legacy implementation) supports Tensor Parallelism.
You can combine all three parallelism strategies (DP × FSDP × TP). For example, a 2 × 2 × 2 configuration across 8 GPUs:
torchrun --standalone --nproc_per_node=8 train.py \ --config configs/dion_160m.yaml \ --dp_size 2 \ --fs_size 2 \ --tp_size 2
This configuration creates:
- 2-way data parallelism (outer replication)
- 2-way FSDP
- 2-way tensor parallelism
- Total: 8 GPUs with 2-way DP × 2-way FSDP × 2-way TP
The product dp_size × fs_size × tp_size must equal world_size. Any unspecified dimension defaults to 1.
Introduction
Optimization algorithms are essential to training neural networks, converting gradients into model weight updates to minimize loss. For many years, the method of choice has been Adam/AdamW. However, recent work has shown that orthonormal optimizers can significantly accelerate model convergence. Check out blog posts by Jeremy Bernstein and Laker Newhouse for more details.
The practical effectiveness of orthonormal optimizers was first demonstrated by Muon in the NanoGPT speedrun, and has since been validated at scale by models such as Kimi K2 and GLM-4.5. Muon implements orthonormalization via *Newton-Schulz iterations*, which relies on repeated matrix-matrix multiplications. However, large-scale training relies on model sharding, where weight matrices and optimizer states are distributed across multiple processes. As discussed by Essential AI, orthonormalizing a sharded matrix with Newton-Schulz iterations involves the communication-intensive procedure of reconstructing the full matrices from their individual shards.
Dion/Dion2 are our methods for building a scalable, communication-efficient optimizer. Like Muon, they compute matrix weight updates based on matrix orthonormalization and share similar practical…
Excerpt shown — open the source for the full document.
Notability
notability 6.0/10Solid new repo with moderate traction