ReleaseNVIDIANVIDIApublished Jun 23, 2026seen 3d

NVIDIA/Megatron-LM core_v0.18.0

NVIDIA/Megatron-LM

Open original ↗

Captured source

source ↗
published Jun 23, 2026seen 3dcaptured 3dhttp 200method plain

NVIDIA Megatron Core 0.18.0

Repository: NVIDIA/Megatron-LM

Tag: core_v0.18.0

Published: 2026-06-23T00:16:28Z

Prerelease: no

Release notes: Changelog Details

  • fix(ci): replace actions/setup-python with apt-get to avoid 429 rate limits by @ko3n1g :: PR: #4072
  • ci: Fix package name for code-freeze workflow by @ko3n1g :: PR: #4077
  • chore: bump _code_freeze workflow to v0.86.0 by @ko3n1g :: PR: #4078
  • Fix checkpoint inspector by @janEbert :: PR: #4079
  • Update docs to conform to NVIDIA style guides by @megnvidia :: PR: #4068
  • Miscellaneous inference fixes by @santhnm2 :: PR: #4030
  • fix fine_grained_callables with fused rmsnorm residual by @CarlosGomes98 :: PR: #4026
  • [Main][feat] Support overlapping A2A Combine backprop with wgrad GEMM by @Wohox :: PR: #3795
  • Modify mfsdp default data-parallel-sharding-strategy by @wplf :: PR: #3691
  • Fix fsdp_dtensor conversion for pretrained-only checkpoints by @DAISY-gh :: PR: #3912
  • Guard NVshmem issues by @wdykas :: PR: #4093
  • m-fsdp: wire use_precision_aware_optimizer from ddp_config to ParamAn… by @rapatel :: PR: #4024
  • Megatron-FSDP: Add MXFP8 transpose helper buffer for Hybrid FSDP by @shjwudp :: PR: #3918
  • feat(fsdp): use TE general_gemm for mixed-precision wgrad in FSDP path by @Victarry :: PR: #3822
  • Megatron-FSDP: Fix insufficient double buffers during gradient reduce by @shjwudp :: PR: #4054
  • Fix M-FSDP MXFP8 related BUGs by @shjwudp :: PR: #3991
  • Megatron-FSDP: Make _pre_forward_param_unshard and _register_post_backward_hook formal by @shjwudp :: PR: #4029
  • FIX: Use decoupled gradients for precision-aware M-FSDP grad norm by @XueSongTap :: PR: #3746
  • Align chat completions endpoint with vLLM by @santhnm2 :: PR: #4063
  • [Megatron-FSDP] Fix compatibility with frozen parameters and add unit tests by @shjwudp :: PR: #3287
  • [M-FSDP] Refactor uneven dtensor to full tensor and add UT by @shjwudp :: PR: #3190
  • Add agent instruction files by @Phlip79 :: PR: #4102
  • Bump eopt version by @skyw :: PR: #4100
  • Refactor emerging optimizer integration by @skyw :: PR: #4113
  • Fix over provisioning of Mamba state memory when max_requests is set by @santhnm2 :: PR: #4114
  • base strategy simplification by @dimapihtar :: PR: #4001
  • add support for DCP and FSDP async save by @dimapihtar :: PR: #4027
  • Add more emerging optimizers (#3907) by @skyw :: PR: #4119
  • Fix FSDP checkpoint conversion and loading for Qwen3.5-VL by @DAISY-gh :: PR: #3936
  • docs: update mcore optimizer docstrings to google style by @Akshat8510 :: PR: #2799
  • Set tensor-parallel attributes irrespective of perform_initialization by @ilml :: PR: #4084
  • docs: add developer-guide skill with CI/CD and failure navigation guidance by @ko3n1g :: PR: #4035
  • chore: Move skills by @ko3n1g :: PR: #4136
  • ci: Let Claude react to comment by @ko3n1g :: PR: #4135
  • Nemotron3 Super GB200 release config by @maanug-nv :: PR: #4118
  • Enable CUDA graph for ADAM optimizer by @vasunvidia :: PR: #3429
  • Claude review should recommend testing by @Phlip79 :: PR: #4137
  • cleanup: remove unused scatter_gather_tensors_in_pipeline argument by @Phlip79 :: PR: #4140
  • fix: Remove fail-fast (-x) and guard distributed teardown against deadlock by @ko3n1g :: PR: #4139
  • Claude: add respond-to-issue skill by @Phlip79 :: PR: #4141
  • Fix muon getter backward compatability by @skyw :: PR: #4157
  • Audit of user guide by @megnvidia :: PR: #4098
  • Fix RerunStateMachine crash (TypeError: 'NoneType' object is not subscriptable) by not saving a checkpoint after a transient NaN / Inf by @yezhengmao1 :: PR: #3981
  • Preserve type of decorated methods/classes by @nschank :: PR: #4062
  • update muon test case to use new interface by @skyw :: PR: #4163
  • [M-FSDP] Fix Tensor Parallel mode detection by @shjwudp :: PR: #3191
  • fix: remove weights_only=False for multimodal example by @faradawn :: PR: #4104
  • Cudagraphs: Fix sequence packing segfault more generally by @mathemakitten :: PR: #4162
  • Make MTP work with materialize_only_last_token_logits by @santhnm2 :: PR: #4166
  • Add unit test for Mamba EP inference (eager fallback with mixed CUDA graphs) by @santhnm2 :: PR: #4085
  • update docs in respect to async changes by @dimapihtar :: PR: #4177
  • update checkpointing docs in respect to async changes by @dimapihtar :: PR: #4208
  • chore: improve build-and-test skill with trigger rules and dependency workflow by @ko3n1g :: PR: #4199
  • Fix layerwise optimizer with expt_dp_size=1 and contention with element-wise distributed optimizer by @skyw :: PR: #4138
  • ci: add --cluster-a100/h100/gb200 args to trigger_internal_ci.py by @ko3n1g :: PR: #4195
  • ci: Update golden values for nightly tests by @chtruong814 :: PR: #4215
  • rename async_allgather to overlap_param_gather by @skyw :: PR: #4217
  • Fix Slack sync for users with GitHub email privacy enabled by @Phlip79 :: PR: #4220
  • Miscellaneous MTP inference fixes by @santhnm2 :: PR: #4191
  • Move inference guards out of arguments.py by @mathemakitten :: PR: #4210
  • Fix: enable fine-grained activation offloading for Mamba model. by @fanshiqing :: PR: #4173
  • bump NVRx by @dimapihtar :: PR: #4178
  • Update tokenizer args for Nemotron3 release config by @maanug-nv :: PR: #4239
  • build: add dynamic git-versioning and drop rc0 pre-release tag by @ko3n1g :: PR: #4212
  • Fix unnecessary permute padding for non-quantized MoE dispatch by @xiaoxi-wangfj :: PR: #4038
  • Fix split state dict main by @kunlunl :: PR: #3676
  • Add /split-pr Claude Code command for splitting PRs by CODEOWNERS by @Phlip79 :: PR: #4160
  • Enable FP8 DPA for MXFP8 recipe by @vasunvidia :: PR: #4066
  • Enable AG/RS overlap with explicit process group passing by @jeffnvidia :: PR: #3249
  • Enable cpu_offloading with Full iteration CUDA graph by @vasunvidia :: PR: #3969
  • Fix TransformerConfig validation for mixed dense/MoE upcycling by @rkteddy :: PR: #3647
  • Remove cross-rank synchronization during checkpoint load & deprecate torch.distributed.checkpoint.state_dict_loader.load_state_dict by @asolergi-nv :: PR: #2864
  • Fix incorrectly set decoupled_grad and DistOpt mechanics for MFSDP. by @cspades :: PR: #4133
  • Refit Miscelaneous by @wdykas :: PR: #3973
  • Add conditions_embeddings argument to TransformerBlock, TransformerLayer for DiT (diffusion transformer) by @huvunvidia :: PR: #4134
  • Fix build_sequences_per_dataset output path arg usage by @DhineshPonnarasan :: PR: #4144
  • ci: Flush pending CUDA work before the barrier in destroy_model_parallel by @chtruong814 :: PR: #4259
  • Update oncall schedule...

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Routine version release, no major breakthrough indicated.