NVIDIA/Megatron-LM core_v0.18.0
NVIDIA/Megatron-LM
Captured source
source ↗published Jun 23, 2026seen 3dcaptured 3dhttp 200method plain
NVIDIA Megatron Core 0.18.0
Repository: NVIDIA/Megatron-LM
Tag: core_v0.18.0
Published: 2026-06-23T00:16:28Z
Prerelease: no
Release notes: Changelog Details
- fix(ci): replace actions/setup-python with apt-get to avoid 429 rate limits by @ko3n1g :: PR: #4072
- ci: Fix package name for code-freeze workflow by @ko3n1g :: PR: #4077
- chore: bump
_code_freezeworkflow tov0.86.0by @ko3n1g :: PR: #4078 - Fix checkpoint inspector by @janEbert :: PR: #4079
- Update docs to conform to NVIDIA style guides by @megnvidia :: PR: #4068
- Miscellaneous inference fixes by @santhnm2 :: PR: #4030
- fix fine_grained_callables with fused rmsnorm residual by @CarlosGomes98 :: PR: #4026
- [Main][feat] Support overlapping A2A Combine backprop with wgrad GEMM by @Wohox :: PR: #3795
- Modify mfsdp default data-parallel-sharding-strategy by @wplf :: PR: #3691
- Fix fsdp_dtensor conversion for pretrained-only checkpoints by @DAISY-gh :: PR: #3912
- Guard NVshmem issues by @wdykas :: PR: #4093
- m-fsdp: wire use_precision_aware_optimizer from ddp_config to ParamAn… by @rapatel :: PR: #4024
- Megatron-FSDP: Add MXFP8 transpose helper buffer for Hybrid FSDP by @shjwudp :: PR: #3918
- feat(fsdp): use TE general_gemm for mixed-precision wgrad in FSDP path by @Victarry :: PR: #3822
- Megatron-FSDP: Fix insufficient double buffers during gradient reduce by @shjwudp :: PR: #4054
- Fix M-FSDP MXFP8 related BUGs by @shjwudp :: PR: #3991
- Megatron-FSDP: Make _pre_forward_param_unshard and _register_post_backward_hook formal by @shjwudp :: PR: #4029
- FIX: Use decoupled gradients for precision-aware M-FSDP grad norm by @XueSongTap :: PR: #3746
- Align chat completions endpoint with vLLM by @santhnm2 :: PR: #4063
- [Megatron-FSDP] Fix compatibility with frozen parameters and add unit tests by @shjwudp :: PR: #3287
- [M-FSDP] Refactor uneven dtensor to full tensor and add UT by @shjwudp :: PR: #3190
- Add agent instruction files by @Phlip79 :: PR: #4102
- Bump eopt version by @skyw :: PR: #4100
- Refactor emerging optimizer integration by @skyw :: PR: #4113
- Fix over provisioning of Mamba state memory when max_requests is set by @santhnm2 :: PR: #4114
- base strategy simplification by @dimapihtar :: PR: #4001
- add support for DCP and FSDP async save by @dimapihtar :: PR: #4027
- Add more emerging optimizers (#3907) by @skyw :: PR: #4119
- Fix FSDP checkpoint conversion and loading for Qwen3.5-VL by @DAISY-gh :: PR: #3936
- docs: update mcore optimizer docstrings to google style by @Akshat8510 :: PR: #2799
- Set tensor-parallel attributes irrespective of perform_initialization by @ilml :: PR: #4084
- docs: add developer-guide skill with CI/CD and failure navigation guidance by @ko3n1g :: PR: #4035
- chore: Move skills by @ko3n1g :: PR: #4136
- ci: Let Claude react to comment by @ko3n1g :: PR: #4135
- Nemotron3 Super GB200 release config by @maanug-nv :: PR: #4118
- Enable CUDA graph for ADAM optimizer by @vasunvidia :: PR: #3429
- Claude review should recommend testing by @Phlip79 :: PR: #4137
- cleanup: remove unused
scatter_gather_tensors_in_pipelineargument by @Phlip79 :: PR: #4140 - fix: Remove fail-fast (-x) and guard distributed teardown against deadlock by @ko3n1g :: PR: #4139
- Claude: add respond-to-issue skill by @Phlip79 :: PR: #4141
- Fix muon getter backward compatability by @skyw :: PR: #4157
- Audit of user guide by @megnvidia :: PR: #4098
- Fix
RerunStateMachinecrash (TypeError: 'NoneType' object is not subscriptable) by not saving a checkpoint after a transient NaN / Inf by @yezhengmao1 :: PR: #3981 - Preserve type of decorated methods/classes by @nschank :: PR: #4062
- update muon test case to use new interface by @skyw :: PR: #4163
- [M-FSDP] Fix Tensor Parallel mode detection by @shjwudp :: PR: #3191
- fix: remove weights_only=False for multimodal example by @faradawn :: PR: #4104
- Cudagraphs: Fix sequence packing segfault more generally by @mathemakitten :: PR: #4162
- Make MTP work with materialize_only_last_token_logits by @santhnm2 :: PR: #4166
- Add unit test for Mamba EP inference (eager fallback with mixed CUDA graphs) by @santhnm2 :: PR: #4085
- update docs in respect to async changes by @dimapihtar :: PR: #4177
- update checkpointing docs in respect to async changes by @dimapihtar :: PR: #4208
- chore: improve build-and-test skill with trigger rules and dependency workflow by @ko3n1g :: PR: #4199
- Fix layerwise optimizer with
expt_dp_size=1and contention with element-wise distributed optimizer by @skyw :: PR: #4138 - ci: add --cluster-a100/h100/gb200 args to trigger_internal_ci.py by @ko3n1g :: PR: #4195
- ci: Update golden values for nightly tests by @chtruong814 :: PR: #4215
- rename async_allgather to overlap_param_gather by @skyw :: PR: #4217
- Fix Slack sync for users with GitHub email privacy enabled by @Phlip79 :: PR: #4220
- Miscellaneous MTP inference fixes by @santhnm2 :: PR: #4191
- Move inference guards out of arguments.py by @mathemakitten :: PR: #4210
- Fix: enable fine-grained activation offloading for Mamba model. by @fanshiqing :: PR: #4173
- bump NVRx by @dimapihtar :: PR: #4178
- Update tokenizer args for Nemotron3 release config by @maanug-nv :: PR: #4239
- build: add dynamic git-versioning and drop rc0 pre-release tag by @ko3n1g :: PR: #4212
- Fix unnecessary permute padding for non-quantized MoE dispatch by @xiaoxi-wangfj :: PR: #4038
- Fix split state dict main by @kunlunl :: PR: #3676
- Add /split-pr Claude Code command for splitting PRs by CODEOWNERS by @Phlip79 :: PR: #4160
- Enable FP8 DPA for MXFP8 recipe by @vasunvidia :: PR: #4066
- Enable AG/RS overlap with explicit process group passing by @jeffnvidia :: PR: #3249
- Enable cpu_offloading with Full iteration CUDA graph by @vasunvidia :: PR: #3969
- Fix TransformerConfig validation for mixed dense/MoE upcycling by @rkteddy :: PR: #3647
- Remove cross-rank synchronization during checkpoint load & deprecate torch.distributed.checkpoint.state_dict_loader.load_state_dict by @asolergi-nv :: PR: #2864
- Fix incorrectly set decoupled_grad and DistOpt mechanics for MFSDP. by @cspades :: PR: #4133
- Refit Miscelaneous by @wdykas :: PR: #3973
- Add conditions_embeddings argument to TransformerBlock, TransformerLayer for DiT (diffusion transformer) by @huvunvidia :: PR: #4134
- Fix build_sequences_per_dataset output path arg usage by @DhineshPonnarasan :: PR: #4144
- ci: Flush pending CUDA work before the barrier in destroy_model_parallel by @chtruong814 :: PR: #4259
- Update oncall schedule...
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10Routine version release, no major breakthrough indicated.