What does this release signal mean?

NVIDIA published NVIDIA/tilus v0.2.0 (NVIDIA/tilus). This release signal is evidence of what shipped, changed, or was packaged for users. High-signal details: Open-source AI model or toolkit by NVIDIA, version 0.2.0 · v0.2.0 Repository: NVIDIA/tilus Tag: v0.2.0 Published: 2026-04-17T17:56:00Z Prerelease: no Release notes: Tilus v0.2.0 Release Notes Tilus v0.2.0 brings **Blackwell GPU.... onlylabs links this event to 1 captured evidence page and 6 related release signals.

NVIDIA Release: NVIDIA/tilus v0.2.0

Captured source

source ↗

GitHub/github.com/NVIDIA/tilus

NVIDIA/tilus v0.2.0

Source ↗

published Apr 17, 2026seen Jun 9captured Jun 11http 200method plain

v0.2.0

Repository: NVIDIA/tilus

Tag: v0.2.0

Published: 2026-04-17T17:56:00Z

Prerelease: no

Release notes:

Tilus v0.2.0 Release Notes

Tilus v0.2.0 brings Blackwell GPU support, initial Hopper support, and a comprehensive set of tutorials, documentation, and optimizations. This release transforms Tilus from an Ampere-focused DSL into one that targets NVIDIA's latest GPU architectures, with fine-grained control over Tensor Memory, TMA, and Cluster Launch Control.

Highlights

Blackwell (SM 10.0) support — Write kernels using 5th-gen Tensor Cores (tcgen05), Tensor Memory (TMEM), TMA bulk copies, and 2-CTA Tensor Core mode.
Hopper (SM 9.0) support — Initial support for WGMMA instructions and warp-specialized GEMM.
Step-by-step matmul tutorials — Seven tutorials (V0–V6) walk through building a high-performance Blackwell matmul from scratch, covering TMA, software pipelining, warp specialization, tile rasterization, CLC persistent kernels, and 2-CTA mode.
Codegen optimizations — Faster generated code via fast divmod, predicated instruction emission, warp-uniform code generation, and named barriers.

Blackwell Architecture

Instruction Support

Tensor Memory (tcgen05): alloc, dealloc, relinquish_alloc_permit, load, store, wait, copy, commit, and mma — including 2-CTA mode ([#47], [#48], [#50], [#55], [#89])
TMA (Tensor Memory Access): Asynchronous bulk copy instructions for global↔shared memory transfers ([#46])
Cluster Launch Control (CLC): Cross-CTA scheduling and synchronization primitives ([#68])
mbarrier: Full set of memory barrier instructions with explicit arrive_and_expect_tx ([#38], [#88])
cp.async.bulk: Bulk async copy with .read modifier for wait_group ([#40], [#106])
Fence instructions: Refactored fence support for proxy async and memory ordering ([#110])

Layout System

Tensor Memory layout for TMEM tensors ([#80])
Refactored shared memory layout system with byte-level swizzle support ([#85], [#109])

GEMM Examples (V0–V8)

A progressive series of Blackwell matmul examples demonstrating increasing optimization levels ([#58], [#59], [#64], [#66], [#75], [#81], [#90], [#91], [#95]), reorganized for clarity in [#128].

Hopper Architecture

WGMMA instructions for Hopper Tensor Cores ([#83])
Hopper GEMM examples: pipelined matmul and warp-specialized GEMM ([#84], [#86])

Language & IR

Thread groups: Support for thread_group, single_thread, single_warp, and warp_group in Tilus Script ([#41])
State construct: Persistent state across kernel invocations ([#71])
Cluster dimensions: Specify cluster layout via cluster_blocks ([#33])
Tensor indexing/slicing: Support for shared and global tensor indexing ([#37])
`.item()` / `.item_ptr()`: Access scalar values and pointers from tensors ([#60])
Target suffixes: Added a (architecture-specific) and f (family-portable) target variants ([#32])

Codegen & Optimizations

Fast divmod: Hardware-accelerated integer division ([#117])
Predicated instruction emission: Reduced warp divergence in generated code ([#114])
Warp-uniform code generation: Uniform execution for ThreadGroupStmt ([#96])
Named barriers: Use named barriers for warpgroup sync instead of mbarrier ([#113])
Dead code elimination: New Tilus IR pass ([#92])
Barrier register spill avoidance: Prevent local memory spill for barrier tensors ([#93])
TVM-FFI runtime: Generated libraries now use the TVM-FFI ABI ([#53])

Documentation & Tutorials

Blackwell matmul tutorial series (V0–V6): Step-by-step guides covering the full optimization journey ([#123]–[#131])
Instruction documentation: Comprehensive API docs for all instruction groups ([#122])
Programming guides: Revised guides for thread groups, autotuning, caching, and targets ([#122], [#134])
Interactive register layout demo ([#98])
Multi-version docs with Sphinx ([#119])

Infrastructure

Pre-commit hooks for lint and formatting ([#62])
Docstring lint enforcement ([#74])
Separate CI workflows for tests and docs ([#121])
Python version compatibility tests ([#121])
Nsight Compute report analysis skill ([#105])

Bug Fixes

Fix low-precision pointer assignment transformation ([#35])
Fix tcgen05.cp codegen ([#51])
Fix volatile specifier for MMA instruction ([#44])
Add cuda_bf16.h include header ([#77])
Fix semaphore codegen ([#112])

New Contributors

@soodoshll — Hopper WGMMA and GEMM examples ([#83], [#84], [#86])
@qiching — Fused softmax example ([#99])
@splint-disk-8i — CI improvements and README edits ([#97], [#102])
@WilliamZhang20 — Vector addition example ([#111])

Full Changelog: https://github.com/NVIDIA/tilus/compare/v0.1.1...v0.2.0

[#32]: https://github.com/NVIDIA/tilus/pull/32 [#33]: https://github.com/NVIDIA/tilus/pull/33 [#35]: https://github.com/NVIDIA/tilus/pull/35 [#37]: https://github.com/NVIDIA/tilus/pull/37 [#38]: https://github.com/NVIDIA/tilus/pull/38 [#40]: https://github.com/NVIDIA/tilus/pull/40 [#41]: https://github.com/NVIDIA/tilus/pull/41 [#44]: https://github.com/NVIDIA/tilus/pull/44 [#46]: https://github.com/NVIDIA/tilus/pull/46 [#47]: https://github.com/NVIDIA/tilus/pull/47 [#48]: https://github.com/NVIDIA/tilus/pull/48 [#50]: https://github.com/NVIDIA/tilus/pull/50 [#51]: https://github.com/NVIDIA/tilus/pull/51 [#53]: https://github.com/NVIDIA/tilus/pull/53 [#55]: https://github.com/NVIDIA/tilus/pull/55 [#58]: https://github.com/NVIDIA/tilus/pull/58 [#59]: https://github.com/NVIDIA/tilus/pull/59 [#60]: https://github.com/NVIDIA/tilus/pull/60 [#62]: https://github.com/NVIDIA/tilus/pull/62 [#64]: https://github.com/NVIDIA/tilus/pull/64 [#66]: https://github.com/NVIDIA/tilus/pull/66 [#68]: https://github.com/NVIDIA/tilus/pull/68 [#71]: https://github.com/NVIDIA/tilus/pull/71 [#74]: https://github.com/NVIDIA/tilus/pull/74 [#75]: https://github.com/NVIDIA/tilus/pull/75 [#77]: https://github.com/NVIDIA/tilus/pull/77 [#80]: https://github.com/NVIDIA/tilus/pull/80 [#81]: https://github.com/NVIDIA/tilus/pull/81 [#83]: https://github.com/NVIDIA/tilus/pull/83 [#84]: https://github.com/NVIDIA/tilus/pull/84 [#85]: https://github.com/NVIDIA/tilus/pull/85 [#86]: https://github.com/NVIDIA/tilus/pull/86 [#88]: https://github.com/NVIDIA/tilus/pull/88 [#89]: https://github.com/NVIDIA/tilus/pull/89 [#90]:...

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Routine minor version release of a niche tool.