RepoMicrosoftMicrosoftpublished May 29, 2025seen 4d

microsoft/dion

Python

Open original ↗

Captured source

source ↗
published May 29, 2025seen 4dcaptured 8hhttp 200method plain

microsoft/dion

Description: Dion optimizer algorithm

Language: Python

License: MIT

Stars: 484

Forks: 55

Open issues: 5

Created: 2025-05-29T20:49:18Z

Pushed: 2026-06-08T16:49:02Z

Default branch: main

Fork: no

Archived: no

README:

Welcome to the Microsoft/Dion Codebase

This repository provides efficient implementations of orthonormal optimizers for distributed ML training. You can find the following optimizers:

Table of Contents

Show/Hide

1. [Requirements](#-requirements) 1. [Quick Start](#-quick-start) 1. [Introduction](#introduction) 1. [Optimizers](#optimizers) 1. [Building Parameter Groups](#building-parameter-groups)

  • [Example Code](#example-code)
  • [Per-Head Newton-Schulz for Attention Projections](#per-head-newton-schulz-for-attention-projections)

1. [Distributed Training Configuration](#distributed-training-configuration)

  • [Flattened Meshes](#flattened-meshes)
  • [Device Mesh for Muon](#device-mesh-for-muon)
  • [Usage with DDP ProcessGroup](#usage-with-ddp-processgroup)

1. [Compressed Data-Parallel Gradient Sync](#compressed-data-parallel-gradient-sync)

  • [Usage with HSDP](#usage-with-hsdp)
  • [Example Code](#example-code-1)
  • [Usage with DDP](#usage-with-ddp)
  • [Checkpointing](#checkpointing)

1. [Best Practices](#best-practices) 1. [Experimental Features](#experimental-features)

  • [Mixed Precision Dion](#mixed-precision-dion)
  • [Accelerating Optimization Step for Lower Ranks](#accelerating-optimization-step-for-lower-ranks)
  • [Triton Kernels for Muon Newton-Schulz](#triton-kernels-for-muon-newton-schulz)

1. [Citation](#citation)

Requirements

This code is written for modern PyTorch (version 2.7 or newer) using DTensor-based parallelism. This includes FSDP2 with fully_shard and tensor parallelism (TP) with parallelize_module. Support for other distributed training APIs is not implemented.

Quick Start

Our implementations are available as a pip package! Install to use in your project:

pip install git+https://github.com/microsoft/dion.git

Then in your code, you can use:

from dion import Dion2, Muon, NorMuon, Dion

Please carefully go through this readme for detailed instructions on using our optimizers. There are major differences compared to PyTorch built-in optimizers, such as Adam/AdamW.

Running Our Sample Training Script

First clone this repo, then install dependencies for both Dion and training code:

git clone https://github.com/microsoft/dion.git
cd dion
pip install -e .[train]

Download pretokenized FineWeb dataset:

python data/cached_fineweb10B.py 30

Distributed Data Parallel (DDP) Training

To train a GPT-small model using Dion2 with 4 GPUs (adjust as needed for your setup):

torchrun --standalone --nproc_per_node=4 train.py --config configs/dion2_160m.yaml

This will launch Distributed Data Parallel (DDP) training.

Distributed Training: FSDP / TP / Hybrid Sharding

Fully Sharded Data Parallel (FSDP)

To enable FSDP, specify the FSDP group size using --fs_size:

torchrun --standalone --nproc_per_node=4 train.py \
--config configs/dion2_160m.yaml \
--fs_size 4

This configuration trains a GPT-small model using Dion2 with FSDP sharding across all 4 GPUs (a single FSDP group of size 4).

Hybrid Sharded Data Parallel (HSDP)

To use Hybrid Sharded Data Parallel, where multiple FSDP groups are replicated using Data Parallel (DP), set --fs_size smaller than the total number of GPUs and specify the data parallel dimension via --dp_size:

torchrun --standalone --nproc_per_node=4 train.py \
--config configs/dion2_160m.yaml \
--fs_size 2 \
--dp_size 2

This configuration creates:

  • 2 FSDP groups, each spanning 2 GPUs
  • 2-way data parallelism across the FSDP groups
  • Total: 4 GPUs with 2-way FSDP × 2-way DP

The product dp_size × fs_size must equal world_size. Any unspecified dimension defaults to 1.

Tensor Parallelism (TP)

Note: Currently, only Dion (our legacy implementation) supports Tensor Parallelism.

You can combine all three parallelism strategies (DP × FSDP × TP). For example, a 2 × 2 × 2 configuration across 8 GPUs:

torchrun --standalone --nproc_per_node=8 train.py \
--config configs/dion_160m.yaml \
--dp_size 2 \
--fs_size 2 \
--tp_size 2

This configuration creates:

  • 2-way data parallelism (outer replication)
  • 2-way FSDP
  • 2-way tensor parallelism
  • Total: 8 GPUs with 2-way DP × 2-way FSDP × 2-way TP

The product dp_size × fs_size × tp_size must equal world_size. Any unspecified dimension defaults to 1.

Introduction

Optimization algorithms are essential to training neural networks, converting gradients into model weight updates to minimize loss. For many years, the method of choice has been Adam/AdamW. However, recent work has shown that orthonormal optimizers can significantly accelerate model convergence. Check out blog posts by Jeremy Bernstein and Laker Newhouse for more details.

The practical effectiveness of orthonormal optimizers was first demonstrated by Muon in the NanoGPT speedrun, and has since been validated at scale by models such as Kimi K2 and GLM-4.5. Muon implements orthonormalization via *Newton-Schulz iterations*, which relies on repeated matrix-matrix multiplications. However, large-scale training relies on model sharding, where weight matrices and optimizer states are distributed across multiple processes. As discussed by Essential AI, orthonormalizing a sharded matrix with Newton-Schulz iterations involves the communication-intensive procedure of reconstructing the full matrices from their individual shards.

Dion/Dion2 are our methods for building a scalable, communication-efficient optimizer. Like Muon, they compute matrix weight updates based on matrix orthonormalization and share similar practical…

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

Solid new repo with moderate traction