What does this release signal mean?

NVIDIA published NVIDIA/TransformerEngine v2.16 (NVIDIA/TransformerEngine). This release signal is evidence of what shipped, changed, or was packaged for users. High-signal details: NVIDIA's library for efficient Transformer training with mixed precision. · Release: NVIDIA/TransformerEngine v2.16 - Repository: NVIDIA/TransformerEngine | A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and.... onlylabs links this event to 1 captured evidence page and 6 related release signals.

NVIDIA Release: NVIDIA/TransformerEngine v2.16

Captured source

source ↗

GitHub/github.com/NVIDIA/TransformerEngine

NVIDIA/TransformerEngine v2.16

Source ↗

published Jun 9, 2026seen Jun 9captured Jun 10http 200method exa

Release: NVIDIA/TransformerEngine v2.16

Repository: NVIDIA/TransformerEngine | A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on Hopper, Ada and Blackwell GPUs, to provide better performance with lower memory utilization in both training and inference. | 3K stars | Python
Author: @ksivaman
Created: 2026-05-29T23:46:45Z
Published: 2026-06-09T01:15:56Z

Transformer Engine v2.16 Release Notes

Key Features and Enhancements

[Common] Improved the performance of the split-overlap reduce-scatter GEMMs. (#2056)
[Common] Improved the fused MoE auxiliary loss kernel performance for models with a large number of experts. (#2758)
[Common] Optimized MXFP8 and NVFP4 dequantize kernels for improved performance. (#2865)
[Common] Improved performance of the MXFP8 quantization kernels. (#2958)
[PyTorch] Added pad_between_seqs support for non-CP and CP (A2A and P2P) with FA3 + THD (varlen) attention. (#2596)
[PyTorch] Added role-based custom quantization control, enabling recipes to target specific modules and tensor types. (#2620)
[PyTorch] Added end-to-end Mixtral MoE examples showing TE GroupedLinear integration with HuggingFace models for BF16 and FP8 training. (#2642)
[PyTorch] Increased performance of the CPU activation offloading path in some cases (#2793)
[PyTorch] Reduced the CPU overhead in the GroupedLinear module and operation (#2900) (#2957) (#2666)
[PyTorch] Added CUDA Graph capture support for GroupedLinear and grouped MoE operations on supported configurations. (#2923)
[PyTorch] Added FlashAttention 4 support for attention head dimension 256. (#2932)
[JAX] Improved MoE permutation kernel performance. (#2975)
[JAX] Improved JAX tutorial documentation with updated examples and guidance. (#2976)
[Common, PyTorch] Added bias and dbias support for GroupedLinear layers. (#2885)
[Common, PyTorch] Added variable grouped swizzle support for flexible grouped tensor memory layouts. (#2914)
[Common, PyTorch] Implemented a row-scaled NVFP4 forward propagation recipe. (#2931)
[Common, PyTorch] Expanded grouped GEMM support with NVFP4 on Blackwell and FP8 block scaling on Hopper. (#2971)
[Common, JAX] Added a top-k operation for faster MoE routing. (#2890)
[Common, JAX] Enabled the cuDNN fused attention backend for no-mask bidirectional sliding-window attention. (#2961)

Fixed Issues

[PyTorch] Fixed variable-length attention cache reuse across devices and inference/training modes. (#2728)
[PyTorch] Fixed FSDP2 memory leaks for FP8 weight workspaces and transpose caches. (#2805)
[PyTorch] Fixed TE fuser behavior in torch.no_grad() paths by avoiding invalid gradient-flag updates on non-leaf tensors. (#2919)
[PyTorch] Fixed distributed checkpoint loading for FSDP2 for models initialized with QuantizedModelInit. (#2974)
[Common, PyTorch] Fixed cuBLAS grouped GEMM when weight dimensions are not divisible by 128. (#2954)
[Common, PyTorch] Fixed int32 overflow and -1 sentinel value handling in moe_permute. (#2907)
[Common, PyTorch] Fixed context-parallel FlashAttention output handling when FA3 is installed without FA2.(#2825)
[Common, PyTorch] Disabled RHT quantization fusion on unsupported GPU architectures to avoid launch failures. (#2968)
[PyTorch] Fixed a crash coming from GroupedLinear weight-gradient allocation. (#3049)

Breaking Changes in This Release

[Common, PyTorch] The original FP8 delayed-scaling fused attention path has been removed. FP8 attention now uses the current cuDNN-backed implementation. (#2959)
[Common, PyTorch, JAX] Removed the legacy f16_max512 fused-attention backend. BF16/FP16 attention is routed through the maintained arbitrary-sequence backend, but explicit selections of the old backend must be updated. (#2949)

Deprecated Features

There are no deprecated features in this release.

---

Assets

| File | Size | Downloads | | --- | --- | --- | | transformer_engine_torch-2.16.0+cu12torch2.8.0+cu129cxx11abiTRUE-cp312-cp312-linux_x86_64.whl | 799 KB | 2 downloads | | transformer_engine_torch-2.16.0+cu13torch26.02cxx11abiTRUE-cp312-cp312-linux_x86_64.whl | 949 KB | 1 downloads | | transformer_engine_torch-2.16.0+cu13torch26.03cxx11abiTRUE-cp312-cp312-linux_x86_64.whl | 947 KB | 1 downloads | | transformer_engine_torch-2.16.0+cu13torch26.04cxx11abiTRUE-cp312-cp312-linux_x86_64.whl | 944 KB | 1 downloads | | transformer_engine_torch-2.16.0+cu13torch26.05cxx11abiTRUE-cp312-cp312-linux_x86_64.whl | 960 KB | 1 downloads |

Notability

notability 3.0/10

Routine library release, minor update.