wafer-ai/gpu-perf-engineering-resources
Captured source
source ↗wafer-ai/gpu-perf-engineering-resources
Description: A curriculum for learning about gpu performance engineering, from scratch to what the frontier AI labs do
Stars: 819
Forks: 98
Open issues: 0
Created: 2026-01-12T00:47:24Z
Pushed: 2026-04-27T21:27:59Z
Default branch: main
Fork: no
Archived: no
README:
Learning Guide: Performance Engineering for AI Infra
Purpose
The purpose of this guide is to help engineers learn GPU kernel programming and optimization, with a focus on high-performance AI systems. It covers the full journey from fundamentals to production deployment, balancing foundational concepts with cutting-edge techniques.
If you're interested in GPU performance engineering - we're hiring at Wafer.
How to read
Recommended reading order:
1. Read "Tier 1" for all topics 2. Read "Tier 2" for all topics 3. Etc
Table of contents
- [Fundamentals](#fundamentals)
- [Introduction to GPU programming](#introduction-to-gpu-programming)
- [Architecture deep dives](#architecture-deep-dives)
- [Low-level details](#low-level-details)
- [Matrix Multiplication](#matrix-multiplication)
- [Essential tutorials](#essential-tutorials)
- [Advanced implementations](#advanced-implementations)
- [cuBLAS internals](#cublas-internals)
- [Tensor Cores & Mixed Precision](#tensor-cores--mixed-precision)
- [Tensor core fundamentals](#tensor-core-fundamentals)
- [Precision formats](#precision-formats)
- [Blackwell-specific](#blackwell-specific)
- [Attention & Memory-Bound Kernels](#attention--memory-bound-kernels)
- [FlashAttention](#flashattention)
- [PagedAttention & serving](#pagedattention--serving)
- [KV cache optimization](#kv-cache-optimization)
- [Compiler & DSL Approaches](#compiler--dsl-approaches)
- [Triton](#triton)
- [CUTLASS & CuTe](#cutlass--cute)
- [Other DSLs](#other-dsls)
- [Profiling & Optimization](#profiling--optimization)
- [NVIDIA tools](#nvidia-tools)
- [Optimization techniques](#optimization-techniques)
- [Advanced topics](#advanced-topics)
- [AMD & Alternative Hardware](#amd--alternative-hardware)
- [ROCm fundamentals](#rocm-fundamentals)
- [CDNA architecture](#cdna-architecture)
- [TPU & others](#tpu--others)
- [Production Inference Systems](#production-inference-systems)
- [Core systems](#core-systems)
- [Continuous batching](#continuous-batching)
- [Speculative decoding](#speculative-decoding)
- [LLM-Generated Kernels](#llm-generated-kernels)
- [Benchmarks & models](#benchmarks--models)
- [Agentic approaches](#agentic-approaches)
- [Research papers](#research-papers)
- [Distributed & Multi-GPU](#distributed--multi-gpu)
- [Communication primitives](#communication-primitives)
- [Parallelism strategies](#parallelism-strategies)
- [Kernel fusion](#kernel-fusion)
- [The Big Picture](#the-big-picture)
- [Industry analysis](#industry-analysis)
- [Practitioner blogs](#practitioner-blogs)
- [Communities](#communities)
- [Maintainer](#maintainer)
Fundamentals
Introduction to GPU programming
Tier 1
- Programming Massively Parallel Processors (PMPP) - Hwu, Kirk, El Hajj. The canonical textbook, 5th edition covers Ampere/Hopper/Blackwell
- GPU Mode Lectures - Community-driven lecture series: profiling → kernels → CUTLASS → SASS. Active Discord (23k+ members): discord.gg/gpumode
- NVIDIA CUDA Programming Guide - Official documentation, essential reference for programming model
Architecture deep dives
Tier 2
- NVIDIA Hopper Architecture In-Depth - TMA, Thread Block Clusters, Distributed Shared Memory, WGMMA
- Chips and Cheese: Blackwell - Microbenchmarking analysis of GB202, memory latency comparisons
- Dissecting the NVIDIA Hopper GPU Architecture - Academic microbenchmarking of H100
- Dissecting the NVIDIA Blackwell Architecture - Microbenchmarks covering tcgen05, TMEM, 2SM MMA
Low-level details
Tier 3
- PTX ISA Documentation - Official PTX instruction set reference
- Understanding PTX - Introduction to CUDA's virtual assembly language
- DocumentSASS - Unofficial SASS instruction documentation extracted from nvdisasm
- JEB SASS Disassembler - Reverse engineering GPU binaries (Volta → Blackwell)
Matrix Multiplication
Essential tutorials
Tier 1
- How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance - siboehm. The canonical starting tutorial. Covers tiling, shared memory, vectorized loads
- Inside NVIDIA GPUs: Anatomy of High-Performance Matmul Kernels - Aleksa Gordić. 47 figures. Covers PTX/SASS, wave quantization, ILP, roofline model, warp tiling
- Outperforming cuBLAS on H100: A Worklog - cudaforfun. Real optimization journey using WGMMA and TMA
Advanced implementations
Tier 2
- Advanced Matrix Multiplication Optimization - salykova. Detailed optimization techniques following CUTLASS approach
- CUDA Matrix Multiplication Optimization - Lei Mao. Systematic optimization progression
- Optimizing SGEMV for cuBLAS-like Performance - Maharshi. Matrix-vector multiplication optimization worklog
- DeepGEMM - DeepSeek. Clean FP8 GEMM implementation for Hopper, ~300 lines
cuBLAS internals
Tier 3
- New cuBLAS 12.0 Features - Hopper-specific optimizations and performance
- [cuBLAS 12.9 Floating Point…
Excerpt shown — open the source for the full document.
Notability
notability 6.0/10Curated GPU performance resources with solid stars.