RepoWaferWaferpublished Jan 12, 2026seen 5d

wafer-ai/gpu-perf-engineering-resources

Open original ↗

Captured source

source ↗

wafer-ai/gpu-perf-engineering-resources

Description: A curriculum for learning about gpu performance engineering, from scratch to what the frontier AI labs do

Stars: 819

Forks: 98

Open issues: 0

Created: 2026-01-12T00:47:24Z

Pushed: 2026-04-27T21:27:59Z

Default branch: main

Fork: no

Archived: no

README:

Learning Guide: Performance Engineering for AI Infra

Purpose

The purpose of this guide is to help engineers learn GPU kernel programming and optimization, with a focus on high-performance AI systems. It covers the full journey from fundamentals to production deployment, balancing foundational concepts with cutting-edge techniques.

If you're interested in GPU performance engineering - we're hiring at Wafer.

How to read

Recommended reading order:

1. Read "Tier 1" for all topics 2. Read "Tier 2" for all topics 3. Etc

Table of contents

  • [Fundamentals](#fundamentals)
  • [Introduction to GPU programming](#introduction-to-gpu-programming)
  • [Architecture deep dives](#architecture-deep-dives)
  • [Low-level details](#low-level-details)
  • [Matrix Multiplication](#matrix-multiplication)
  • [Essential tutorials](#essential-tutorials)
  • [Advanced implementations](#advanced-implementations)
  • [cuBLAS internals](#cublas-internals)
  • [Tensor Cores & Mixed Precision](#tensor-cores--mixed-precision)
  • [Tensor core fundamentals](#tensor-core-fundamentals)
  • [Precision formats](#precision-formats)
  • [Blackwell-specific](#blackwell-specific)
  • [Attention & Memory-Bound Kernels](#attention--memory-bound-kernels)
  • [FlashAttention](#flashattention)
  • [PagedAttention & serving](#pagedattention--serving)
  • [KV cache optimization](#kv-cache-optimization)
  • [Compiler & DSL Approaches](#compiler--dsl-approaches)
  • [Triton](#triton)
  • [CUTLASS & CuTe](#cutlass--cute)
  • [Other DSLs](#other-dsls)
  • [Profiling & Optimization](#profiling--optimization)
  • [NVIDIA tools](#nvidia-tools)
  • [Optimization techniques](#optimization-techniques)
  • [Advanced topics](#advanced-topics)
  • [AMD & Alternative Hardware](#amd--alternative-hardware)
  • [ROCm fundamentals](#rocm-fundamentals)
  • [CDNA architecture](#cdna-architecture)
  • [TPU & others](#tpu--others)
  • [Production Inference Systems](#production-inference-systems)
  • [Core systems](#core-systems)
  • [Continuous batching](#continuous-batching)
  • [Speculative decoding](#speculative-decoding)
  • [LLM-Generated Kernels](#llm-generated-kernels)
  • [Benchmarks & models](#benchmarks--models)
  • [Agentic approaches](#agentic-approaches)
  • [Research papers](#research-papers)
  • [Distributed & Multi-GPU](#distributed--multi-gpu)
  • [Communication primitives](#communication-primitives)
  • [Parallelism strategies](#parallelism-strategies)
  • [Kernel fusion](#kernel-fusion)
  • [The Big Picture](#the-big-picture)
  • [Industry analysis](#industry-analysis)
  • [Practitioner blogs](#practitioner-blogs)
  • [Communities](#communities)
  • [Maintainer](#maintainer)

Fundamentals

Introduction to GPU programming

Tier 1

Architecture deep dives

Tier 2

Low-level details

Tier 3

Matrix Multiplication

Essential tutorials

Tier 1

Advanced implementations

Tier 2

cuBLAS internals

Tier 3

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

Curated GPU performance resources with solid stars.