ReleaseNVIDIANVIDIApublished Jun 3, 2025seen 15h

NVIDIA/TorchFort v0.3.0

NVIDIA/TorchFort

Open original ↗

Captured source

source ↗
published Jun 3, 2025seen 15hcaptured 15hhttp 200method plain

Multi Tensor, Multi Environment Support, Modernization of Dependencies

Repository: NVIDIA/TorchFort

Tag: v0.3.0

Published: 2025-06-03T14:09:33Z

Prerelease: no

Release notes: Summary Release Notes Major Features and Enhancements 1. Multi-Argument Model and Loss Support • Added full support for models and loss functions that require multiple input, label, and output tensors, as well as custom loss arguments. This is enabled via new torchfort_train_multiarg and torchfort_inference_multiarg APIs, with corresponding Fortran and C documentation and usage examples. • Introduced torchfort_tensor_list types and management functions (create, destroy, add_tensor) to facilitate passing multiple tensors to models and losses. • Expanded documentation and provided a comprehensive Fortran example (examples/fortran/graph) demonstrating online training on unstructured meshes with a MeshGraphNet-like model and a custom PyTorch loss function exported via TorchScript. 2. TorchScript Loss Functions • Added support for loading custom loss functions from exported TorchScript modules via a new torchscript loss type. This allows users to implement arbitrary loss logic in Python and integrate it into TorchFort workflows. • Updated configuration and documentation to describe usage and options for TorchScript-based losses. 3. Expanded Documentation and Examples • Significantly updated API and usage documentation to cover the new multi-argument interfaces, tensor list management, and custom loss workflows. • Added a detailed, reproducible example (examples/fortran/graph) including all necessary mesh data, configuration, model/loss generation scripts, and visualization tools. Core and API Changes 4. Loss Function API Refactor • Refactored the internal loss interface: loss functions now accept an additional extra_args argument, supporting more flexible and extensible loss computations. • Implemented new TorchscriptLoss class for TorchScript integration, and updated the loss registry accordingly. 5. Distributed and RL Improvements • Reinforcement learning (RL) off-policy and on-policy buffers now support local multi-environment updates, with new APIs and documentation for batch buffer operations. • Improved distributed communication routines to enforce tensor contiguity, with clear error messages for unsupported non-contiguous tensors. 6. Grad Accumulation and Training Control • Added support for gradient accumulation steps, configurable via the optimizer general block in the YAML config. This enables larger effective batch sizes and more control over optimization steps. • RL algorithms and model training logic now respect the new gradient accumulation setting, only stepping the optimizer after the configured number of accumulation steps. Build and Environment Updates 7. Updated Build and Dependency Stack • Dockerfiles and build scripts updated to use CUDA 12.8, NVIDIA HPC SDK 25.3, latest OpenMPI/HPC-X, and PyTorch 2.7.0 for improved performance and compatibility. • Default C++ ABI flag switched to -D_GLIBCXX_USE_CXX11_ABI=1 for all builds, to accomodate updated PyTorch version. • Requirements updated to match new PyTorch and torchvision/torchaudio versions. 8. Improved Compiler and MPI Compatibility • CMake logic now detects and blocks unsupported compilers (e.g., nvc++ for C++ code), with clear error messages. • Fortran MPI compatibility is now tested at build time, and the build system automatically sets the MPICH flag if required. Other Notable Improvements • Various bugfixes and enhancements to distributed communication, RL API, and internal error handling. • Expanded and clarified documentation throughout the API and example codebases. Upgrade Notes: • Users should update their Docker images or environments to the new CUDA, HPC SDK, and PyTorch versions. • When using custom loss functions or multi-input models, refer to the new documentation and examples for correct API usage. • YAML configuration files may require updates to optimizer and loss sections to leverage new features.