ReleaseNVIDIANVIDIApublished Apr 17, 2026seen 2d

NVIDIA/tilus v0.2.0

NVIDIA/tilus

Open original ↗

Captured source

source ↗
published Apr 17, 2026seen 2dcaptured 9hhttp 200method plain

v0.2.0

Repository: NVIDIA/tilus

Tag: v0.2.0

Published: 2026-04-17T17:56:00Z

Prerelease: no

Release notes:

Tilus v0.2.0 Release Notes

Tilus v0.2.0 brings Blackwell GPU support, initial Hopper support, and a comprehensive set of tutorials, documentation, and optimizations. This release transforms Tilus from an Ampere-focused DSL into one that targets NVIDIA's latest GPU architectures, with fine-grained control over Tensor Memory, TMA, and Cluster Launch Control.

Highlights

  • Blackwell (SM 10.0) support — Write kernels using 5th-gen Tensor Cores (tcgen05), Tensor Memory (TMEM), TMA bulk copies, and 2-CTA Tensor Core mode.
  • Hopper (SM 9.0) support — Initial support for WGMMA instructions and warp-specialized GEMM.
  • Step-by-step matmul tutorials — Seven tutorials (V0–V6) walk through building a high-performance Blackwell matmul from scratch, covering TMA, software pipelining, warp specialization, tile rasterization, CLC persistent kernels, and 2-CTA mode.
  • Codegen optimizations — Faster generated code via fast divmod, predicated instruction emission, warp-uniform code generation, and named barriers.

Blackwell Architecture

Instruction Support

  • Tensor Memory (tcgen05): alloc, dealloc, relinquish_alloc_permit, load, store, wait, copy, commit, and mma — including 2-CTA mode ([#47], [#48], [#50], [#55], [#89])
  • TMA (Tensor Memory Access): Asynchronous bulk copy instructions for global↔shared memory transfers ([#46])
  • Cluster Launch Control (CLC): Cross-CTA scheduling and synchronization primitives ([#68])
  • mbarrier: Full set of memory barrier instructions with explicit arrive_and_expect_tx ([#38], [#88])
  • cp.async.bulk: Bulk async copy with .read modifier for wait_group ([#40], [#106])
  • Fence instructions: Refactored fence support for proxy async and memory ordering ([#110])

Layout System

  • Tensor Memory layout for TMEM tensors ([#80])
  • Refactored shared memory layout system with byte-level swizzle support ([#85], [#109])

GEMM Examples (V0–V8)

A progressive series of Blackwell matmul examples demonstrating increasing optimization levels ([#58], [#59], [#64], [#66], [#75], [#81], [#90], [#91], [#95]), reorganized for clarity in [#128].

Hopper Architecture

  • WGMMA instructions for Hopper Tensor Cores ([#83])
  • Hopper GEMM examples: pipelined matmul and warp-specialized GEMM ([#84], [#86])

Language & IR

  • Thread groups: Support for thread_group, single_thread, single_warp, and warp_group in Tilus Script ([#41])
  • State construct: Persistent state across kernel invocations ([#71])
  • Cluster dimensions: Specify cluster layout via cluster_blocks ([#33])
  • Tensor indexing/slicing: Support for shared and global tensor indexing ([#37])
  • `.item()` / `.item_ptr()`: Access scalar values and pointers from tensors ([#60])
  • Target suffixes: Added a (architecture-specific) and f (family-portable) target variants ([#32])

Codegen & Optimizations

  • Fast divmod: Hardware-accelerated integer division ([#117])
  • Predicated instruction emission: Reduced warp divergence in generated code ([#114])
  • Warp-uniform code generation: Uniform execution for ThreadGroupStmt ([#96])
  • Named barriers: Use named barriers for warpgroup sync instead of mbarrier ([#113])
  • Dead code elimination: New Tilus IR pass ([#92])
  • Barrier register spill avoidance: Prevent local memory spill for barrier tensors ([#93])
  • TVM-FFI runtime: Generated libraries now use the TVM-FFI ABI ([#53])

Documentation & Tutorials

  • Blackwell matmul tutorial series (V0–V6): Step-by-step guides covering the full optimization journey ([#123]–[#131])
  • Instruction documentation: Comprehensive API docs for all instruction groups ([#122])
  • Programming guides: Revised guides for thread groups, autotuning, caching, and targets ([#122], [#134])
  • Interactive register layout demo ([#98])
  • Multi-version docs with Sphinx ([#119])

Infrastructure

  • Pre-commit hooks for lint and formatting ([#62])
  • Docstring lint enforcement ([#74])
  • Separate CI workflows for tests and docs ([#121])
  • Python version compatibility tests ([#121])
  • Nsight Compute report analysis skill ([#105])

Bug Fixes

  • Fix low-precision pointer assignment transformation ([#35])
  • Fix tcgen05.cp codegen ([#51])
  • Fix volatile specifier for MMA instruction ([#44])
  • Add cuda_bf16.h include header ([#77])
  • Fix semaphore codegen ([#112])

New Contributors

  • @soodoshll — Hopper WGMMA and GEMM examples ([#83], [#84], [#86])
  • @qiching — Fused softmax example ([#99])
  • @splint-disk-8i — CI improvements and README edits ([#97], [#102])
  • @WilliamZhang20 — Vector addition example ([#111])

Full Changelog: https://github.com/NVIDIA/tilus/compare/v0.1.1...v0.2.0

[#32]: https://github.com/NVIDIA/tilus/pull/32 [#33]: https://github.com/NVIDIA/tilus/pull/33 [#35]: https://github.com/NVIDIA/tilus/pull/35 [#37]: https://github.com/NVIDIA/tilus/pull/37 [#38]: https://github.com/NVIDIA/tilus/pull/38 [#40]: https://github.com/NVIDIA/tilus/pull/40 [#41]: https://github.com/NVIDIA/tilus/pull/41 [#44]: https://github.com/NVIDIA/tilus/pull/44 [#46]: https://github.com/NVIDIA/tilus/pull/46 [#47]: https://github.com/NVIDIA/tilus/pull/47 [#48]: https://github.com/NVIDIA/tilus/pull/48 [#50]: https://github.com/NVIDIA/tilus/pull/50 [#51]: https://github.com/NVIDIA/tilus/pull/51 [#53]: https://github.com/NVIDIA/tilus/pull/53 [#55]: https://github.com/NVIDIA/tilus/pull/55 [#58]: https://github.com/NVIDIA/tilus/pull/58 [#59]: https://github.com/NVIDIA/tilus/pull/59 [#60]: https://github.com/NVIDIA/tilus/pull/60 [#62]: https://github.com/NVIDIA/tilus/pull/62 [#64]: https://github.com/NVIDIA/tilus/pull/64 [#66]: https://github.com/NVIDIA/tilus/pull/66 [#68]: https://github.com/NVIDIA/tilus/pull/68 [#71]: https://github.com/NVIDIA/tilus/pull/71 [#74]: https://github.com/NVIDIA/tilus/pull/74 [#75]: https://github.com/NVIDIA/tilus/pull/75 [#77]: https://github.com/NVIDIA/tilus/pull/77 [#80]: https://github.com/NVIDIA/tilus/pull/80 [#81]: https://github.com/NVIDIA/tilus/pull/81 [#83]: https://github.com/NVIDIA/tilus/pull/83 [#84]: https://github.com/NVIDIA/tilus/pull/84 [#85]: https://github.com/NVIDIA/tilus/pull/85 [#86]: https://github.com/NVIDIA/tilus/pull/86 [#88]: https://github.com/NVIDIA/tilus/pull/88 [#89]: https://github.com/NVIDIA/tilus/pull/89 [#90]:…

Excerpt shown — open the source for the full document.