NVIDIA/TileGym v1.0.0
NVIDIA/TileGym
Captured source
source ↗published Mar 11, 2026seen 5dcaptured 11hhttp 200method plain
v1.0.0
Repository: NVIDIA/TileGym
Tag: v1.0.0
Published: 2026-03-11T00:21:54Z
Prerelease: yes
Release notes:
What's Changed
- [Bug fix] use padding_mode inside the kernel to process elements out of boundary for softmax by @xjmxyt in https://github.com/NVIDIA/TileGym/pull/1
- [Bug fix] use ct.gather ct.store for softmax's no-tma op by @yifeis-nv in https://github.com/NVIDIA/TileGym/pull/2
- Add PR bot to repository by @arjkesh in https://github.com/NVIDIA/TileGym/pull/3
- Update README.md by @xjmxyt in https://github.com/NVIDIA/TileGym/pull/5
- remove dead code in silu_and_mul kernel - creates output offsets (for 1D), expect n_elements param... but no need... by @lessw2020 in https://github.com/NVIDIA/TileGym/pull/6
- Initialize TileGym CI by @arjkesh in https://github.com/NVIDIA/TileGym/pull/4
- Use ruff formatter, introduce helper dev script by @arjkesh in https://github.com/NVIDIA/TileGym/pull/11
- Introduce job timeouts, speed up builds by @camille-004 in https://github.com/NVIDIA/TileGym/pull/9
- [FEA] add gelu & relu by @xjmxyt in https://github.com/NVIDIA/TileGym/pull/13
- Update dockerfile to use cuda 13.1 base image by @arjkesh in https://github.com/NVIDIA/TileGym/pull/12
- [Fix] Refactor nightly skip logic by @arjkesh in https://github.com/NVIDIA/TileGym/pull/8
- Add automatic header checks and formatting by @arjkesh in https://github.com/NVIDIA/TileGym/pull/14
- Standardize softmax.py to avoid numpy dependency by @lessw2020 in https://github.com/NVIDIA/TileGym/pull/16
- [Update] update kernels and reformat codes by @hannahli-nv in https://github.com/NVIDIA/TileGym/pull/18
- [FEA] Add dropout by @hannahli-nv in https://github.com/NVIDIA/TileGym/pull/19
- Split-K reduction kernel cleanup by @lessw2020 in https://github.com/NVIDIA/TileGym/pull/21
- Fix: moe_align_block_size() supports non-power-of-2 num_experts by @huanghua1994 in https://github.com/NVIDIA/TileGym/pull/24
- Update autotuner: use experimental autotuner in cutile-python by @xjmxyt in https://github.com/NVIDIA/TileGym/pull/25
- feat: chunked softmax implementation for large column size by @aghilann in https://github.com/NVIDIA/TileGym/pull/17
- [Update] Add benchmark and autotune for group_gemm by @xjmxyt in https://github.com/NVIDIA/TileGym/pull/26
- Fix benchmark failure cases by @arjkesh in https://github.com/NVIDIA/TileGym/pull/27
- Format benchmark files as json, add perf thresholds by @arjkesh in https://github.com/NVIDIA/TileGym/pull/15
- feat: RMSNorm backward pass kernels by @aghilann in https://github.com/NVIDIA/TileGym/pull/29
- Split-K reduction: remove un-needed scaling via INV_LOG_2 by @lessw2020 in https://github.com/NVIDIA/TileGym/pull/22
- [fix] Update benchmark sparse checkout by @arjkesh in https://github.com/NVIDIA/TileGym/pull/30
- [FEA] Add bmm by @hannahli-nv in https://github.com/NVIDIA/TileGym/pull/31
- Temporarily avoid job failures due to inconsistent benchmarks by @arjkesh in https://github.com/NVIDIA/TileGym/pull/32
- [Update] Fix bmm issue by @hannahli-nv in https://github.com/NVIDIA/TileGym/pull/34
- [FEA] Add Qwen2-7B module by @hannahli-nv in https://github.com/NVIDIA/TileGym/pull/36
- Update for ragged_bmm moe by @hannahli-nv in https://github.com/NVIDIA/TileGym/pull/37
- Add env "DISABLE_FALLBACK" & fix type hint error & other updates by @hannahli-nv in https://github.com/NVIDIA/TileGym/pull/39
- Add reusable retry workflow for runner availability timeouts by @arjkesh in https://github.com/NVIDIA/TileGym/pull/35
- Add mHC fused kernels and tests by @Edward-lyz in https://github.com/NVIDIA/TileGym/pull/38
- Update some comments by @hannahli-nv in https://github.com/NVIDIA/TileGym/pull/42
- Add tilegym wheel building by @arjkesh in https://github.com/NVIDIA/TileGym/pull/41
- fix matmul illegal address error on DGX Spark by @xjmxyt in https://github.com/NVIDIA/TileGym/pull/44
- fix qwen2 fp16 bug by @hannahli-nv in https://github.com/NVIDIA/TileGym/pull/43
- [Fix] fix num_kv_split becomes 0 by @xjmxyt in https://github.com/NVIDIA/TileGym/pull/45
- Avoid OOM for large GEMM 32k & modify layernorm cutile by @hannahli-nv in https://github.com/NVIDIA/TileGym/pull/50
- Add option to ignore specific wheel validations by @arjkesh in https://github.com/NVIDIA/TileGym/pull/51
- Add road map by @hannahli-nv in https://github.com/NVIDIA/TileGym/pull/52
- [FEA] Add SwiGLU backward pass implementation, test cases and benchmark by @Weili-0234 in https://github.com/NVIDIA/TileGym/pull/46
- Enable experimental_kernel marker by @hannahli-nv in https://github.com/NVIDIA/TileGym/pull/53
- [FEA] Add FlashAttention backward pass implementation, test cases and benchmark by @Weili-0234 in https://github.com/NVIDIA/TileGym/pull/49
- Update README.md by @xjmxyt in https://github.com/NVIDIA/TileGym/pull/54
- Add version for tilegym wheels, update reusable workflow by @arjkesh in https://github.com/NVIDIA/TileGym/pull/55
- Fix import error for experimental marker & support gemma 3 & other updates by @hannahli-nv in https://github.com/NVIDIA/TileGym/pull/57
- Add tilegym homepage to setup.py by @arjkesh in https://github.com/NVIDIA/TileGym/pull/58
- Update MoE by @hannahli-nv in https://github.com/NVIDIA/TileGym/pull/59
- fix torch dependency by @xjmxyt in https://github.com/NVIDIA/TileGym/pull/61
- feat: replace RMSNorm backward with persistent CuTile kernel by @aghilann in https://github.com/NVIDIA/TileGym/pull/60
- Scan for CVEs in wheels, fix python versions by @arjkesh in https://github.com/NVIDIA/TileGym/pull/64
- feat: add CuTile RoPE backward with tests and backward benchmark by @aghilann in https://github.com/NVIDIA/TileGym/pull/62
- A fix for silu_and_mul & Update codes & other updates by @hannahli-nv in https://github.com/NVIDIA/TileGym/pull/67
- Add workflow to prepare release tag and artifacts by @arjkesh in https://github.com/NVIDIA/TileGym/pull/66
- Update moe type hint & Update gitignore & other updates by @hannahli-nv in https://github.com/NVIDIA/TileGym/pull/68
- add cutile kernel skill and Move install_requires dependencies to requirements.txt by @hannahli-nv in https://github.com/NVIDIA/TileGym/pull/69
- Add SECURITY.md with vulnerability reporting instructions & Add SPDX license header to SECURITY.md & other updates by @hannahli-nv in https://github.com/NVIDIA/TileGym/pull/71
- feat: swiglu forward optimizations by @aghilann in https://github.com/NVIDIA/TileGym/pull/63
- feat: chunked fused linear cross-entropy kernel forward by @aghilann in…
Excerpt shown — open the source for the full document.
Notability
notability 5.0/10New NVIDIA RL environment release.