Introducing Triton: Open-source GPU programming for neural networks
Captured source
source ↗Introducing Triton: Open-source GPU programming for neural networks | OpenAI
July 28, 2021
Introducing Triton: Open-source GPU programming for neural networks
View code Read documentation
Loading…
Share
We’re releasing Triton 1.0, an open-source Python-like programming language which enables researchers with no CUDA experience to write highly efficient GPU code—most of the time on par with what an expert would be able to produce.
Why it matters
Triton makes it possible to reach peak hardware performance with relatively little effort; for example, it can be used to write FP16 matrix multiplication kernels that match the performance of cuBLAS—something that many GPU programmers can’t do—in under 25 lines of code. Our researchers have already used it to produce kernels that are up to 2x more efficient than equivalent Torch implementations, and we’re excited to work with the community to make GPU programming more accessible to everyone.
Novel research ideas in the field of Deep Learning are generally implemented using a combination of native framework operators. While convenient, this approach often requires the creation (and/or movement) of many temporary tensors, which can hurt the performance of neural networks at scale. These issues can be mitigated by writing specialized GPU kernels, but doing so can be surprisingly difficult due to the many intricacies of GPU programming.1, 2, 3 And, although a variety of systems have recently emerged4, 5 to make this process easier, we have found them to be either too verbose, lack flexibility or generate code noticeably slower than our hand-tuned baselines. This has led us to extend and improve Triton6, a recent language and compiler whose original creator now works at OpenAI.
The challenges of GPU programming
The architecture of modern GPUs can be roughly divided into three major components—DRAM, SRAM and ALUs—each of which must be considered when optimizing CUDA code:
- Memory transfers from DRAM must be coalesced into large transactions to leverage the large bus width of modern memory interfaces.
- Data must be manually stashed to SRAM prior to being re-used, and managed so as to minimize shared memory bank conflicts upon retrieval.
- Computations must be partitioned and scheduled carefully, both across and within Streaming Multiprocessors (SMs), so as to promote instruction/thread-level parallelism and leverage special-purpose ALUs (e.g., tensor cores).
Basic architecture of a GPU.
Reasoning about all these factors can be challenging, even for seasoned CUDA programmers with many years of experience. The purpose of Triton is to fully automate these optimizations, so that developers can better focus on the high-level logic of their parallel code. Triton aims to be broadly applicable, and therefore does not automatically schedule work across SMs -- leaving some important algorithmic considerations (e.g. tiling, inter-SM synchronization) to the discretion of developers.
CUDA
TRITON
Memory Coalescing
Manual
Automatic
Shared Memory Management
Manual
Automatic
Scheduling (Within SMs)
Manual
Automatic
Scheduling (Across SMs)
Manual
Manual
Compiler optimizations in CUDA vs Triton.
Programming model
Out of all the Domain Specific Languages and JIT-compilers available, Triton is perhaps most similar to Numba: kernels are defined as decorated Python functions, and launched concurrently with differentprogram_id’s on a grid of so-called instances. However, as shown in the code snippet below, the resemblance stops there: Triton exposes intra-instance parallelism via operations on blocks—small arrays whose dimensions are powers of two—rather than a Single Instruction, Multiple Thread (SIMT)7 execution model. In doing so, Triton effectively abstracts away all the issues related to concurrency within CUDA thread blocks (e.g., memory coalescing, shared memory synchronization/conflicts, tensor core scheduling).
Loading...
While this may not be particularly helpful for embarrassingly parallel (i.e., element-wise) computations, it can greatly simplify the development of more complex GPU programs.
Consider for example the case of a fused softmax kernel (below) in which each instance normalizes a different row of the given input tensor X_∈R_M_×_N. Standard CUDA implementations of this parallelization strategy can be challenging to write, requiring explicit synchronization between threads as they concurrently reduce the same row of X. Most of this complexity goes away with Triton, where each kernel instance loads the row of interest and normalizes it sequentially using NumPy-like primitives.
Python
1import triton2import triton.language as tl34@triton.jit5def softmax(Y, stride_ym, stride_yn, X, stride_xm, stride_xn, M, N):6 # row index7 m = tl.program_id(0)8 # col indices9 # this specific kernel only works for matrices that 10 # have less than BLOCK_SIZE columns11 BLOCK_SIZE = 102412 n = tl.arange(0, BLOCK_SIZE)13 # the memory address of all the elements14 # that we want to load can be computed as follows15 X = X + m * stride_xm + n * stride_xn16 # load input data; pad out-of-bounds elements with 0 17 x = tl.load(X, mask=n = 0, acc, alpha * acc)38 # write back result39 C = C + (rm[:, None] * stride_cm + rn[None, :] * stride_cn)40 mask = (rm[:, None] < M) & (rn[None, :] < N)41 tl.store(C, acc, mask=mask)
Matrix multiplication in Triton.
One important advantage of handwritten matrix multiplication kernels is that they can be customized as desired to accommodate fused transformations of their inputs (e.g., slicing) and outputs (e.g., Leaky ReLU). Without a system like Triton, non-trivial modifications of matrix multiplication kernels would be out-of-reach for developers without exceptional GPU programming expertise.
Loading...
High-level system architecture
The good performance of Triton comes from a modular system architecture centered around Triton-IR, an LLVM-based intermediate representation in which multi-dimensional blocks of values are first-class citizens.
Loading...
The@triton.jit decorator works by walking the Abstract Syntax Tree (AST) of the provided Python function so as to generate Triton-IR on-the-fly using a common SSA construction algorithm.8 The resulting IR code is then simplified, optimized and automatically parallelized by our compiler backend, before being converted into high-quality LLVM-IR—and eventually PTX—for execution on recent NVIDIA GPUs. CPUs and AMD GPUs are…
Excerpt shown — open the source for the full document.