ReleaseNVIDIANVIDIApublished Feb 24, 2026seen 5d

NVIDIA/TransformerEngine v2.12

NVIDIA/TransformerEngine

Open original ↗

Captured source

source ↗
published Feb 24, 2026seen 5dcaptured 13hhttp 200method plain

v2.12

Repository: NVIDIA/TransformerEngine

Tag: v2.12

Published: 2026-02-24T00:03:58Z

Prerelease: no

Release notes:

Transformer Engine v2.12 Release Notes

Key Features and Enhancements

  • Made miscellaneous improvements and fixes to the documentation.
  • [C] Improved performance of NVFP4 quantization kernels. (#2412)
  • [C] Documented environment variables. (#2552)
  • [PyTorch] Added fused permute+pad and unpermute+unpad operations for FP8 optimization. (#1921)
  • [PyTorch] Improved the performance in CPU-limited scenarios.
  • [PyTorch] Added support for Sliding Window Attention (left, right) with fused attention. (#2477)
  • [PyTorch] Improved the performance of MXFP8 and NVFP4 by fusing the swizzling into the quantization (#2486)
  • [PyTorch] Added cudagraph support for activation recomputation. (#2518)
  • [JAX] Added a tutorial for integrating TE/JAX quantization into existing frameworks. (#2423)
  • [JAX] Added custom partitioning for permutation primitives. (#2591)

Fixed Issues

  • [C] Fixed SM120 compilation with CUDA 12. (#2482)
  • [C] Fixed overflow in padding and unpadding kernels. (#2548)
  • [C] Fixed a numerical issue in `sort_chunks_by_index`. (#2566)
  • [C] Fixed a numerical issue in swizzling blockwise E8 scales. (#2589)
  • [PyTorch] Fixed an AttributeError issue when checkpointing the model with MXFP8 parameters. (#2427)
  • [PyTorch] Fixed cross-entropy loss calculation when some tokens are ignored. (#2476)
  • [PyTorch] Fixed `Float8Tensor.contiguous` autograd support. (#2533)
  • [PyTorch] Fixed multiple CPU offloading issues. (#2535)
  • [PyTorch] Fixed uninitialized `permuted_scale` values. (#2547)
  • [PyTorch] Fixed FP8 quantization for the second MLP in `LayerNormMLP`. (#2577)
  • [PyTorch] Fixed ONNX tests and added FP8 attention export support. (#2598)
  • [JAX] Removed unused TE DPA dtype handling to improve cuDNN backend dtype detection. (#2485)
  • [JAX] Fixed segment-position calculation from segment IDs in SequenceDescriptor class. (#2523)
  • [JAX] Fixed bugs in permutation custom partitioning. (#2617)
  • [JAX] Fixed issue in encoder and MNIST examples due to dataset path moving. (#2625)

Breaking Changes in This Release

No breaking changes in this release.

Deprecated Features

No features deprecated in this release.

Notability

notability 4.0/10

Incremental version release of an existing library.