ReleaseNVIDIANVIDIApublished Apr 16, 2026seen 5d

NVIDIA/Megatron-LM core_v0.17.0

NVIDIA/Megatron-LM

Open original ↗

Captured source

source ↗
published Apr 16, 2026seen 5dcaptured 9hhttp 200method plain

NVIDIA Megatron Core 0.17.0

Repository: NVIDIA/Megatron-LM

Tag: core_v0.17.0

Published: 2026-04-16T19:59:42Z

Prerelease: no

Release notes: Changelog Details

  • Fix two minor bugs in MTP implementation for hybrid models by @deepakn94 :: PR: #3194
  • Update README.md by @mvirts :: PR: #2111
  • mRoPE for MTP by @BestJuly :: PR: #3114
  • Fix bug in SFTDataset by @duncanriach :: PR: #3185
  • Fix several syntax error by @HollowMan6 :: PR: #3004
  • Fix for RL Test by @wdykas :: PR: #3148
  • Fix latent moe flops and backward_dw by @buptzyb :: PR: #2977
  • Use global user buffer when the bucket size does not fit FixedPoolAllocator by @shengf-nv :: PR: #2857
  • ci: Checkpoint retention by @ko3n1g :: PR: #3205
  • Add unit test for LatentMoE by @venmugil :: PR: #2892
  • ci: Enable unit tests on merge-queue by @ko3n1g :: PR: #3186
  • Fix seq pack flag in get_logprobs by @mathemakitten :: PR: #3206
  • ci(fix): Parse unit tests in merge-queue by @ko3n1g :: PR: #3224
  • Fix TE 2.12 AllGather CI failure by @BestJuly :: PR: #3101
  • ci(hotfix): Pin uv by @ko3n1g :: PR: #3233
  • Add a unit test to check that RL get_logprobs will reuse training cudagraphed forward pass by @mathemakitten :: PR: #3209
  • Do not offload grad buffers when training graphs are enabled by @mathemakitten :: PR: #3231
  • Fix missing PackedSeqParams import by @parthmannan :: PR: #3214
  • Synchronize the request counts for EP inference with strict matching by @santhnm2 :: PR: #3033
  • Fix coordinator address collision check in flask by @tdene :: PR: #3208
  • Do not let requests fail silently inside inference engine by @tdene :: PR: #3228
  • torch saver inference model offload by @wdykas :: PR: #3170
  • enable cuda graph ut by @Autumn1998 :: PR: #3197
  • Support EP with HSDP by @wplf :: PR: #2840
  • [Main] Add the missing part to support 1F1B overlap for Qwen3-Next by @BestJuly :: PR: #2997
  • Missing import fix by @parthmannan :: PR: #3241
  • Miscellaneous inference cleanup (Replay of !2955) by @santhnm2 :: PR: #3232
  • Add DistributedInitConfig by @maanug-nv :: PR: #3173
  • Fix checkpoint converter missing parallel group initialization by @yashaswikarnati :: PR: #3217
  • Skip empty sequences and chunks in MTP tensor roll by @BestJuly :: PR: #3035
  • Implement get_parameters for ChainedOptimizer by @nschank :: PR: #3201
  • ci(fix): Create main/dev image tags by @ko3n1g :: PR: #3252
  • Reapply "Add MTP support for hybrid models (#2363)" by @sancha :: PR: #3207
  • Fix uv install for GH actions by @Phlip79 :: PR: #3259
  • Update the project structure in README by @janEbert :: PR: #3251
  • Cherry-pick: Fix mtp_num_layers and clip_qk issues (#2581, #2776) by @BestJuly :: PR: #3075
  • RL: training cudagraphs functional test by @mathemakitten :: PR: #3235
  • [Main] fix cg missing wgrad hook by @Wohox :: PR: #3074
  • Avoid .cuda call on meta device in LanguageModel by @nschank :: PR: #3202
  • fix checkpointing error message by @dimapihtar :: PR: #3203
  • Nano QAT/D fix with sft tokenizer and datasets by @ChenhanYu :: PR: #3254
  • Revert "fix checkpointing error message (#3203)" by @ko3n1g :: PR: #3283
  • Reapply "fix checkpointing error message (#3203)" (#3283) by @ko3n1g :: PR: #3285
  • docs: Add changelog for 0.15.3 by @ko3n1g :: PR: #3286
  • ci: Set throughput tests as flaky by @chtruong814 :: PR: #3301
  • chore: Move GB200 tests to nightly by @ko3n1g :: PR: #3302
  • Ensure type-checker understands use of Submodules in bert_model by @nschank :: PR: #3256
  • Override extra_repr instead of __repr__ by @nschank :: PR: #3200
  • Replace ModuleSpec with Protocols for LayerNorm submodules by @nschank :: PR: #3090
  • Non colocated refit by @wdykas :: PR: #3213
  • Fuse permute+pad and unpermute+unpad ops for FP8/FP4 training by @xiaoxi-wangfj :: PR: #2763
  • Add check to prevent MFSDP from numeric issue in gradient accumulate fusion by @shjwudp :: PR: #2904
  • update get_embedding_ranks and get_position_embedding_ranks docstrings by @c1lovez1 :: PR: #3223
  • Param offset in _ParamAndGradBucket should be aligned by @skydoorkai :: PR: #3007
  • ci: Add secrets detector by @chtruong814 :: PR: #3180
  • Ensure type-checker understands use of Submodules in llava_model by @nschank :: PR: #3257
  • updates to support modelopt EAGLE training with CP by @yeyu-nvidia :: PR: #3147
  • fully remove legacy tokenizer system by @dimapihtar :: PR: #2946
  • M-FSDP: Remove redundant stream waits in HSDP to prevent CG fail by @shjwudp :: PR: #2941
  • General README and pyproject fixes by @ahmadki :: PR: #2907
  • chore: More aggressive checkpointing by @ko3n1g :: PR: #3315
  • ci: Pin down setuptools to lt 82 by @ko3n1g :: PR: #3313
  • fix: numpy overflow by @ko3n1g :: PR: #3306
  • fix: T5 dataset by @ko3n1g :: PR: #3307
  • ci: Revert "ci: Add secrets detector (#3180)" by @chtruong814 :: PR: #3330
  • ci: Add more tests, run on merge-queue by @ko3n1g :: PR: #3317
  • ci: Remove merge-gate environment check by @chtruong814 :: PR: #3331
  • Use FP4 context for mamba by @kwyss-nvidia :: PR: #2604
  • ci: Ensure we run all functional tests in merge group by @chtruong814 :: PR: #3332
  • Replace ModuleSpec with Protocols for inputs to MLP by @nschank :: PR: #3084
  • ci: Fix merge queue functional tests by @chtruong814 :: PR: #3337
  • ci: skip queue in merge-gate by @ko3n1g :: PR: #3343
  • ci: Timeout for functional tests by @ko3n1g :: PR: #3346
  • update checkpointing documentation by @dimapihtar :: PR: #3347
  • Update golden values to reflect improvements by @tdene :: PR: #3350
  • BUGFIX: gpt vs hybrid model mtp naming mismatch by @sancha :: PR: #3334
  • Disable flaky test by @tdene :: PR: #3354
  • re-enable gpt grpo tests by @jon-barker :: PR: #3348
  • Fix SFT Pipeline when TP>1 by @asolergi-nv :: PR: #3268
  • Fixes for KD mode by @AAnoosheh :: PR: #3342
  • chore: Update codeowners file by @ko3n1g :: PR: #3365
  • Siddharth/fix inference functional tests by @sidsingh-nvidia :: PR: #3357
  • Switch oncall by @janEbert :: PR: #3360
  • Add missing RMSNorm to llama train script by @AAnoosheh :: PR: #3314
  • Fix inference for MTP models by @tdene :: PR: #3297
  • Add a logprobs test with real gpt model. by @yobibyte :: PR: #2870
  • Add simple GRPO functional test by @tdene :: PR: #3323
  • ci: Concurrency control for merge-queue by @ko3n1g :: PR: #3353
  • ci: Update golden value download script to work with Github by @chtruong814 :: PR: #3335
  • fix: correct typos 'seperated' and 'recieved' by @thecaptain789 :: PR: #3305
  • Improved PyTorch profiler and added PyTorch execution trace by @shengf-nv :: PR: #3273
  • Removing etc from main index page, shifted…

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

Significant library update from NVIDIA