NVIDIA/tilus v0.2.0
NVIDIA/tilus
Captured source
source ↗v0.2.0
Repository: NVIDIA/tilus
Tag: v0.2.0
Published: 2026-04-17T17:56:00Z
Prerelease: no
Release notes:
Tilus v0.2.0 Release Notes
Tilus v0.2.0 brings Blackwell GPU support, initial Hopper support, and a comprehensive set of tutorials, documentation, and optimizations. This release transforms Tilus from an Ampere-focused DSL into one that targets NVIDIA's latest GPU architectures, with fine-grained control over Tensor Memory, TMA, and Cluster Launch Control.
Highlights
- Blackwell (SM 10.0) support — Write kernels using 5th-gen Tensor Cores (tcgen05), Tensor Memory (TMEM), TMA bulk copies, and 2-CTA Tensor Core mode.
- Hopper (SM 9.0) support — Initial support for WGMMA instructions and warp-specialized GEMM.
- Step-by-step matmul tutorials — Seven tutorials (V0–V6) walk through building a high-performance Blackwell matmul from scratch, covering TMA, software pipelining, warp specialization, tile rasterization, CLC persistent kernels, and 2-CTA mode.
- Codegen optimizations — Faster generated code via fast divmod, predicated instruction emission, warp-uniform code generation, and named barriers.
Blackwell Architecture
Instruction Support
- Tensor Memory (tcgen05):
alloc,dealloc,relinquish_alloc_permit,load,store,wait,copy,commit, andmma— including 2-CTA mode ([#47], [#48], [#50], [#55], [#89]) - TMA (Tensor Memory Access): Asynchronous bulk copy instructions for global↔shared memory transfers ([#46])
- Cluster Launch Control (CLC): Cross-CTA scheduling and synchronization primitives ([#68])
- mbarrier: Full set of memory barrier instructions with explicit
arrive_and_expect_tx([#38], [#88]) - cp.async.bulk: Bulk async copy with
.readmodifier forwait_group([#40], [#106]) - Fence instructions: Refactored fence support for proxy async and memory ordering ([#110])
Layout System
- Tensor Memory layout for TMEM tensors ([#80])
- Refactored shared memory layout system with byte-level swizzle support ([#85], [#109])
GEMM Examples (V0–V8)
A progressive series of Blackwell matmul examples demonstrating increasing optimization levels ([#58], [#59], [#64], [#66], [#75], [#81], [#90], [#91], [#95]), reorganized for clarity in [#128].
Hopper Architecture
- WGMMA instructions for Hopper Tensor Cores ([#83])
- Hopper GEMM examples: pipelined matmul and warp-specialized GEMM ([#84], [#86])
Language & IR
- Thread groups: Support for
thread_group,single_thread,single_warp, andwarp_groupin Tilus Script ([#41]) - State construct: Persistent state across kernel invocations ([#71])
- Cluster dimensions: Specify cluster layout via
cluster_blocks([#33]) - Tensor indexing/slicing: Support for shared and global tensor indexing ([#37])
- `.item()` / `.item_ptr()`: Access scalar values and pointers from tensors ([#60])
- Target suffixes: Added
a(architecture-specific) andf(family-portable) target variants ([#32])
Codegen & Optimizations
- Fast divmod: Hardware-accelerated integer division ([#117])
- Predicated instruction emission: Reduced warp divergence in generated code ([#114])
- Warp-uniform code generation: Uniform execution for
ThreadGroupStmt([#96]) - Named barriers: Use named barriers for warpgroup sync instead of mbarrier ([#113])
- Dead code elimination: New Tilus IR pass ([#92])
- Barrier register spill avoidance: Prevent local memory spill for barrier tensors ([#93])
- TVM-FFI runtime: Generated libraries now use the TVM-FFI ABI ([#53])
Documentation & Tutorials
- Blackwell matmul tutorial series (V0–V6): Step-by-step guides covering the full optimization journey ([#123]–[#131])
- Instruction documentation: Comprehensive API docs for all instruction groups ([#122])
- Programming guides: Revised guides for thread groups, autotuning, caching, and targets ([#122], [#134])
- Interactive register layout demo ([#98])
- Multi-version docs with Sphinx ([#119])
Infrastructure
- Pre-commit hooks for lint and formatting ([#62])
- Docstring lint enforcement ([#74])
- Separate CI workflows for tests and docs ([#121])
- Python version compatibility tests ([#121])
- Nsight Compute report analysis skill ([#105])
Bug Fixes
- Fix low-precision pointer assignment transformation ([#35])
- Fix tcgen05.cp codegen ([#51])
- Fix volatile specifier for MMA instruction ([#44])
- Add
cuda_bf16.hinclude header ([#77]) - Fix semaphore codegen ([#112])
New Contributors
- @soodoshll — Hopper WGMMA and GEMM examples ([#83], [#84], [#86])
- @qiching — Fused softmax example ([#99])
- @splint-disk-8i — CI improvements and README edits ([#97], [#102])
- @WilliamZhang20 — Vector addition example ([#111])
Full Changelog: https://github.com/NVIDIA/tilus/compare/v0.1.1...v0.2.0
[#32]: https://github.com/NVIDIA/tilus/pull/32 [#33]: https://github.com/NVIDIA/tilus/pull/33 [#35]: https://github.com/NVIDIA/tilus/pull/35 [#37]: https://github.com/NVIDIA/tilus/pull/37 [#38]: https://github.com/NVIDIA/tilus/pull/38 [#40]: https://github.com/NVIDIA/tilus/pull/40 [#41]: https://github.com/NVIDIA/tilus/pull/41 [#44]: https://github.com/NVIDIA/tilus/pull/44 [#46]: https://github.com/NVIDIA/tilus/pull/46 [#47]: https://github.com/NVIDIA/tilus/pull/47 [#48]: https://github.com/NVIDIA/tilus/pull/48 [#50]: https://github.com/NVIDIA/tilus/pull/50 [#51]: https://github.com/NVIDIA/tilus/pull/51 [#53]: https://github.com/NVIDIA/tilus/pull/53 [#55]: https://github.com/NVIDIA/tilus/pull/55 [#58]: https://github.com/NVIDIA/tilus/pull/58 [#59]: https://github.com/NVIDIA/tilus/pull/59 [#60]: https://github.com/NVIDIA/tilus/pull/60 [#62]: https://github.com/NVIDIA/tilus/pull/62 [#64]: https://github.com/NVIDIA/tilus/pull/64 [#66]: https://github.com/NVIDIA/tilus/pull/66 [#68]: https://github.com/NVIDIA/tilus/pull/68 [#71]: https://github.com/NVIDIA/tilus/pull/71 [#74]: https://github.com/NVIDIA/tilus/pull/74 [#75]: https://github.com/NVIDIA/tilus/pull/75 [#77]: https://github.com/NVIDIA/tilus/pull/77 [#80]: https://github.com/NVIDIA/tilus/pull/80 [#81]: https://github.com/NVIDIA/tilus/pull/81 [#83]: https://github.com/NVIDIA/tilus/pull/83 [#84]: https://github.com/NVIDIA/tilus/pull/84 [#85]: https://github.com/NVIDIA/tilus/pull/85 [#86]: https://github.com/NVIDIA/tilus/pull/86 [#88]: https://github.com/NVIDIA/tilus/pull/88 [#89]: https://github.com/NVIDIA/tilus/pull/89 [#90]:…
Excerpt shown — open the source for the full document.