RepoAmazon (Nova)Amazon (Nova)published Apr 15, 2026seen 5d

amazon-science/expert-upcycling

Python

Open original ↗

Captured source

source ↗
published Apr 15, 2026seen 5dcaptured 9hhttp 200method plain

amazon-science/expert-upcycling

Language: Python

License: NOASSERTION

Stars: 14

Forks: 2

Open issues: 0

Created: 2026-04-15T23:52:22Z

Pushed: 2026-04-15T23:58:50Z

Default branch: main

Fork: no

Archived: yes

README:

Expert Upcycling

Capacity expansion for Mixture-of-Experts models during continued pre-training.

> Dwivedi et al., *"Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts"* (preprint).

Scaling laws show that MoE quality improves predictably with total expert count at fixed active computation, but training large MoEs from scratch is expensive — memory, gradients, and all-to-all communication all scale with total parameters. Expert upcycling sidesteps this by starting training with a smaller E-expert model and expanding to mE experts mid-training via the upcycling operator:

1. Expert replication — each expert is duplicated (high-utility experts receive more copies via gradient-based importance scores). 2. Router extension — router weights are copied to new slots with small bias perturbations to seed routing diversity. 3. Continued pre-training (CPT) — stochastic gradient diversity and loss-free load balancing break symmetry among duplicates, driving specialization.

Top-K routing is held fixed throughout, so per-token inference cost is unchanged.

![Expert Upcycling](assets/figure_optE_hires.png) *Figure 1: Overview of the expert upcycling procedure.*

Key results on a 7B→13B total parameter (1B active) interleaved MoE, pre-trained on 380B tokens:

  • The upcycled model (32→64 experts) matches the fixed-size 64-expert baseline across 11 downstream benchmarks (56.4 vs. 56.7 avg accuracy) and validation loss (1.263 vs. 1.267).
  • Training cost is reduced by ~32% of GPU hours (27,888 vs. 41,328 hours). When a pre-trained checkpoint already exists (e.g., from a prior training run or a public release), the pre-training cost is already paid and only the CPT phase is needed, bringing savings to ~67%.
  • Results generalize to full MoE architectures (256→512 experts, TopK=8) with 93–95% gap closure across scales from 154M to 1B total parameters.

![Results](assets/barplot.JPG) *Figure 2: GPU hours, validation loss, and downstream accuracy for the 7B→13B upcycled model vs. baselines.*

Installation

Recommended: NeMo 2.x container

Start from the official NeMo container — PyTorch, Megatron-LM, Transformer Engine, NeMo, Lightning, and omegaconf are all pre-installed.

docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
-v /path/to/expert-upcycling:/workspace/expert-upcycling \
-it nvcr.io/nvidia/nemo:24.09 bash

# Inside the container:
cd /workspace/expert-upcycling
pip install -e .
pip install dacite

> Do not use pip install -e ".[nemo]" inside the container — it would conflict with the container's pre-installed NeMo.

From scratch (no NeMo container)

Install dependencies manually, then install the package with the relevant extras:

# Core only (torch + numpy):
pip install -e .
pip install dacite

# With Megatron-LM integration:
pip install -e ".[megatron]"

# Full NeMo entrypoint (installs NeMo, Lightning, omegaconf):
pip install -e ".[nemo]"

Quick Start

Option A: NeMo entrypoint (recommended)

Edit configs/upcycle.yaml to set your model dimensions, then run from the repo root:

# Single GPU
cd /workspace/expert-upcycling
python -m expert_upcycling.entrypoint \
--config-path=configs --config-name=upcycle \
resume.restore_config.path=/path/to/base/checkpoint

# Multi-GPU (e.g. 8 GPUs with tensor parallelism)
torchrun --nproc_per_node=8 -m expert_upcycling.entrypoint \
--config-path=configs --config-name=upcycle \
resume.restore_config.path=/path/to/base/checkpoint \
strategy.tensor_model_parallel_size=8

The callback fires on the first optimizer step, doubles the expert count, saves the upcycled checkpoint, and exits. The output path defaults to -upcycled.

Option B: Patch existing training script

import expert_upcycling
expert_upcycling.apply_patches()

# Now TEGroupedMLP has .upcycle_experts() and TopKRouter has .upcycle_router()
# Call them during training at the desired transition point.
# Note: model is typically wrapped — unwrap to reach the decoder:
inner = model
for attr in ("module", "module"):
if hasattr(inner, attr):
inner = getattr(inner, attr)

for i, layer in enumerate(inner.decoder.layers):
if hasattr(layer.mlp, 'experts'):
selected = layer.mlp.experts.upcycle_experts(optimizer, i, expert_cfg)
if hasattr(layer.mlp, 'router'):
layer.mlp.router.upcycle_router(router_cfg, selected)

Option C: Use the model-level API

from expert_upcycling import perform_expert_upcycling

perform_expert_upcycling(
model, optimizer,
expert_cfg={"usefulness_metric": "gradient_norm", "selection_strategy": "greedy"},
router_cfg={"method": "bias_only", "bias_noise_scale": 0.01},
)

Upcycling Strategies

Expert duplication

| Strategy | Description | |---|---| | Utility-based (recommended) | Duplicate high-importance experts using gradient-based scores (weight norm, saliency, gradient squared, approx Fisher) | | copy | Exact duplication (baseline) | | copy_noise | Duplication + Gaussian noise | | drop_upcycle | Re-initialize a fraction of columns | | svd_perturb | SVD decomposition + perturbation | | + 6 more | See expert_upcycling.config.UpcycleMethod |

Router expansion

| Strategy | Description | |---|---| | bias_only (recommended) | Keep weights identical, add noise to bias | | copy | Exact duplication | | copy_noise | Duplication + noise | | + 7 more | See expert_upcycling.config.RouterUpcycleMethod |

Architecture

This package treats Megatron-LM and NeMo as third-party dependencies — no fork required. Upcycling methods are injected at runtime via monkey-patching:

expert-upcycling/ # pip install -e .
├── expert_upcycling/
│ ├── config.py # All enums + dataclasses (no deps)
│ ├── expert_upcycler.py # Heuristic strategies (torch only)
│ ├── expert_selector.py # Utility-based selection (torch + numpy)
│ ├── router_upcycler.py # Router strategies (torch only)
│ ├── optimizer_utils.py # Optimizer state handling (torch only)
│ ├── patch.py # Monkey-patches onto Megatron-LM classes
│ ├── upcycle_model.py # Model traversal
│ └── entrypoint.py # NeMo launch script
├── configs/
│ └── upcycle.yaml # Example config
└── scripts/
└── run_upcycle.sh #…

Excerpt shown — open the source for the full document.

Notability

notability 4.0/10

Low-stars research repo from Amazon