ReleaseNVIDIANVIDIApublished Jun 9, 2026seen 2d

NVIDIA/TransformerEngine v2.16

NVIDIA/TransformerEngine

Open original ↗

Captured source

source ↗
published Jun 9, 2026seen 2dcaptured 1dhttp 200method exa

Release: NVIDIA/TransformerEngine v2.16

  • Repository: NVIDIA/TransformerEngine | A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on Hopper, Ada and Blackwell GPUs, to provide better performance with lower memory utilization in both training and inference. | 3K stars | Python
  • Author: @ksivaman
  • Created: 2026-05-29T23:46:45Z
  • Published: 2026-06-09T01:15:56Z

Transformer Engine v2.16 Release Notes

Key Features and Enhancements

  • [Common] Improved the performance of the split-overlap reduce-scatter GEMMs. (#2056)
  • [Common] Improved the fused MoE auxiliary loss kernel performance for models with a large number of experts. (#2758)
  • [Common] Optimized MXFP8 and NVFP4 dequantize kernels for improved performance. (#2865)
  • [Common] Improved performance of the MXFP8 quantization kernels. (#2958)
  • [PyTorch] Added pad_between_seqs support for non-CP and CP (A2A and P2P) with FA3 + THD (varlen) attention. (#2596)
  • [PyTorch] Added role-based custom quantization control, enabling recipes to target specific modules and tensor types. (#2620)
  • [PyTorch] Added end-to-end Mixtral MoE examples showing TE GroupedLinear integration with HuggingFace models for BF16 and FP8 training. (#2642)
  • [PyTorch] Increased performance of the CPU activation offloading path in some cases (#2793)
  • [PyTorch] Reduced the CPU overhead in the GroupedLinear module and operation (#2900) (#2957) (#2666)
  • [PyTorch] Added CUDA Graph capture support for GroupedLinear and grouped MoE operations on supported configurations. (#2923)
  • [PyTorch] Added FlashAttention 4 support for attention head dimension 256. (#2932)
  • [JAX] Improved MoE permutation kernel performance. (#2975)
  • [JAX] Improved JAX tutorial documentation with updated examples and guidance. (#2976)
  • [Common, PyTorch] Added bias and dbias support for GroupedLinear layers. (#2885)
  • [Common, PyTorch] Added variable grouped swizzle support for flexible grouped tensor memory layouts. (#2914)
  • [Common, PyTorch] Implemented a row-scaled NVFP4 forward propagation recipe. (#2931)
  • [Common, PyTorch] Expanded grouped GEMM support with NVFP4 on Blackwell and FP8 block scaling on Hopper. (#2971)
  • [Common, JAX] Added a top-k operation for faster MoE routing. (#2890)
  • [Common, JAX] Enabled the cuDNN fused attention backend for no-mask bidirectional sliding-window attention. (#2961)

Fixed Issues

  • [PyTorch] Fixed variable-length attention cache reuse across devices and inference/training modes. (#2728)
  • [PyTorch] Fixed FSDP2 memory leaks for FP8 weight workspaces and transpose caches. (#2805)
  • [PyTorch] Fixed TE fuser behavior in torch.no_grad() paths by avoiding invalid gradient-flag updates on non-leaf tensors. (#2919)
  • [PyTorch] Fixed distributed checkpoint loading for FSDP2 for models initialized with QuantizedModelInit. (#2974)
  • [Common, PyTorch] Fixed cuBLAS grouped GEMM when weight dimensions are not divisible by 128. (#2954)
  • [Common, PyTorch] Fixed int32 overflow and -1 sentinel value handling in moe_permute. (#2907)
  • [Common, PyTorch] Fixed context-parallel FlashAttention output handling when FA3 is installed without FA2.(#2825)
  • [Common, PyTorch] Disabled RHT quantization fusion on unsupported GPU architectures to avoid launch failures. (#2968)
  • [PyTorch] Fixed a crash coming from GroupedLinear weight-gradient allocation. (#3049)

Breaking Changes in This Release

  • [Common, PyTorch] The original FP8 delayed-scaling fused attention path has been removed. FP8 attention now uses the current cuDNN-backed implementation. (#2959)
  • [Common, PyTorch, JAX] Removed the legacy f16_max512 fused-attention backend. BF16/FP16 attention is routed through the maintained arbitrary-sequence backend, but explicit selections of the old backend must be updated. (#2949)

Deprecated Features

There are no deprecated features in this release.

---

Assets

| File | Size | Downloads | | --- | --- | --- | | transformer_engine_torch-2.16.0+cu12torch2.8.0+cu129cxx11abiTRUE-cp312-cp312-linux_x86_64.whl | 799 KB | 2 downloads | | transformer_engine_torch-2.16.0+cu13torch26.02cxx11abiTRUE-cp312-cp312-linux_x86_64.whl | 949 KB | 1 downloads | | transformer_engine_torch-2.16.0+cu13torch26.03cxx11abiTRUE-cp312-cp312-linux_x86_64.whl | 947 KB | 1 downloads | | transformer_engine_torch-2.16.0+cu13torch26.04cxx11abiTRUE-cp312-cp312-linux_x86_64.whl | 944 KB | 1 downloads | | transformer_engine_torch-2.16.0+cu13torch26.05cxx11abiTRUE-cp312-cp312-linux_x86_64.whl | 960 KB | 1 downloads |