RepoNVIDIANVIDIApublished Nov 13, 2025seen 5d

NVIDIA/TileGym

Python

Open original ↗

Captured source

source ↗
published Nov 13, 2025seen 5dcaptured 13hhttp 200method plain

NVIDIA/TileGym

Description: Helpful kernel tutorials, examples and SKILLs for tile-based GPU programming

Language: Python

License: NOASSERTION

Stars: 749

Forks: 72

Open issues: 7

Created: 2025-11-13T07:21:50Z

Pushed: 2026-06-10T13:24:27Z

Default branch: main

Fork: no

Archived: no

README:

English | [简体中文](README_chs.md) | [繁體中文](README_cht.md) | [日本語](README_ja.md) | [Français](README_fr.md)

TileGym

TileGym is a CUDA Tile kernel library that provides a rich collection of kernel tutorials and examples for tile-based GPU programming.

[Overview](#overview) | [Features](#features) | [Installation](#installation) | [Quick Start](#quick-start) | [Contributing](#contributing) | [License](#license-and-third-party-notices)

Overview

This repository aims to provide helpful kernel tutorials and examples for tile-based GPU programming. TileGym is a playground for experimenting with CUDA Tile, where you can learn how to build efficient GPU kernels and explore their integration into real-world large language models such as Llama 3.1 and DeepSeek V2. Whether you're learning tile-based GPU programming or looking to optimize your LLM implementations, TileGym offers practical examples and comprehensive guidance.

Features

  • Rich collection of CUDA Tile kernel examples
  • Practical kernel implementations for common deep learning operators
  • Performance benchmarking to evaluate kernel efficiency
  • End-to-end integration examples with popular LLMs (Llama 3.1, DeepSeek V2)

Installation

Prerequisites

> GPU Support: TileGym requires CUDA 13.1+ and a Blackwell GPU (e.g., B200, RTX 5080, RTX 5090). NVIDIA Ampere (e.g., A100) is also supported with CUDA 13.2+. All released cuTile kernels are validated on both architectures. Download CUDA from NVIDIA CUDA Downloads.

  • PyTorch (version 2.9.1 or compatible)
  • [CUDA 13.1+](https://developer.nvidia.com/cuda-downloads) (Required - TileGym is built and tested exclusively on CUDA 13.1+)
  • Triton (included with PyTorch installation)

Setup Steps

1. Prepare torch and triton environment

If you already have torch and triton, skip this step.

pip install --pre torch --index-url https://download.pytorch.org/whl/cu130

We have verified that torch==2.9.1 works. You can also get triton packages when installing torch.

2. Install TileGym

TileGym uses `cuda-tile` (≥ 1.3.0) for GPU kernel programming, which depends on the tileiras compiler at runtime.

##### Install from PyPI (recommended)

pip install tilegym[tileiras]

This installs TileGym and all runtime dependencies, including cuda-tile[tileiras] which bundles the tileiras compiler directly into your Python environment.

If you already have tileiras available on your system (e.g., from CUDA Toolkit 13.1+), you can omit the extra:

pip install tilegym

##### Install from source

git clone https://github.com/NVIDIA/TileGym.git
cd TileGym
pip install .[tileiras] # or: pip install . (if you have system tileiras)

For editable (development) mode, use pip install -e . or pip install -e .[tileiras].

All runtime dependencies are declared in [requirements.txt](requirements.txt) and are installed automatically by both pip install tilegym and pip install ..

We also provide Dockerfile, you can refer to [modeling/transformers/README.md](modeling/transformers/README.md).

Quick Start

There are three main ways to use TileGym:

1. Explore Kernel Examples

All kernel implementations are located in the src/tilegym/ops/ directory. You can test individual operations with minimal scripts. Function-level usage and minimal scripts for individual ops are documented in [tests/ops/README.md](tests/ops/README.md)

2. Run Benchmarks

Evaluate kernel performance with micro-benchmarks:

cd tests/benchmark
bash run_all.sh

Complete benchmark guide available in [tests/benchmark/README.md](tests/benchmark/README.md)

3. Run LLM Transformer Examples

Use TileGym kernels in end-to-end inference scenarios. We provide runnable scripts and instructions for transformer language models (e.g., Llama 3.1-8B) accelerated using TileGym kernels.

First, install the additional dependency:

pip install accelerate==1.13.0 --no-deps

Containerized Setup (Docker):

docker build -t tilegym-transformers -f modeling/transformers/Dockerfile .
docker run --gpus all -it tilegym-transformers bash

More details in [modeling/transformers/README.md](modeling/transformers/README.md)

4. Julia (cuTile.jl) Kernels (Optional)

TileGym also includes experimental cuTile.jl kernel implementations in Julia. These are self-contained in the julia/ directory and do not require the Python TileGym package.

Prerequisites: Julia 1.12+, CUDA 13.1, Blackwell GPU

# Install Julia (if not already installed)
curl -fsSL https://install.julialang.org | sh

# Install dependencies
julia --project=julia/ -e 'using Pkg; Pkg.instantiate()'

# Run tests
julia --project=julia/ julia/test/runtests.jl

See julia/Project.toml for the full dependency list.

Contributing

We welcome contributions of all kinds. Please read our [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines, including the Contributor License Agreement (CLA) process.

License and third-party notices

  • Project license: MIT
  • [LICENSE](LICENSE)
  • Third-party attributions and license texts:
  • [LICENSES/ATTRIBUTIONS.md](LICENSES/ATTRIBUTIONS.md)

Notability

notability 6.0/10

New NVIDIA repo with 748 stars, solid traction.