What does this repo signal mean?

Wafer published wafer-ai/gpu-perf-engineering-resources. This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo wafer-ai/gpu-perf-engineering-resources · Curated GPU performance resources with solid stars.. onlylabs links this event to 1 captured evidence page and 5 related repo signals.

Wafer Repo: wafer-ai/gpu-perf-engineering-resources

Captured source

source ↗

GitHub/github.com/wafer-ai/gpu-perf-engineering-resources

wafer-ai/gpu-perf-engineering-resources repository metadata

Source ↗

published Jan 12, 2026seen Jun 5captured Jun 11http 200method plain

wafer-ai/gpu-perf-engineering-resources

Description: A curriculum for learning about gpu performance engineering, from scratch to what the frontier AI labs do

Stars: 819

Forks: 98

Open issues: 0

Created: 2026-01-12T00:47:24Z

Pushed: 2026-04-27T21:27:59Z

Default branch: main

Fork: no

Archived: no

README:

Learning Guide: Performance Engineering for AI Infra

Purpose

The purpose of this guide is to help engineers learn GPU kernel programming and optimization, with a focus on high-performance AI systems. It covers the full journey from fundamentals to production deployment, balancing foundational concepts with cutting-edge techniques.

If you're interested in GPU performance engineering - we're hiring at Wafer.

How to read

Fundamentals

Introduction to GPU programming

Tier 1

Programming Massively Parallel Processors (PMPP) - Hwu, Kirk, El Hajj. The canonical textbook, 5th edition covers Ampere/Hopper/Blackwell
GPU Mode Lectures - Community-driven lecture series: profiling → kernels → CUTLASS → SASS. Active Discord (23k+ members): discord.gg/gpumode
NVIDIA CUDA Programming Guide - Official documentation, essential reference for programming model

Architecture deep dives

Tier 2

NVIDIA Hopper Architecture In-Depth - TMA, Thread Block Clusters, Distributed Shared Memory, WGMMA
Chips and Cheese: Blackwell - Microbenchmarking analysis of GB202, memory latency comparisons
Dissecting the NVIDIA Hopper GPU Architecture - Academic microbenchmarking of H100
Dissecting the NVIDIA Blackwell Architecture - Microbenchmarks covering tcgen05, TMEM, 2SM MMA

Low-level details

Tier 3

PTX ISA Documentation - Official PTX instruction set reference
Understanding PTX - Introduction to CUDA's virtual assembly language
DocumentSASS - Unofficial SASS instruction documentation extracted from nvdisasm
JEB SASS Disassembler - Reverse engineering GPU binaries (Volta → Blackwell)

Matrix Multiplication

Essential tutorials

Tier 1

How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance - siboehm. The canonical starting tutorial. Covers tiling, shared memory, vectorized loads
Inside NVIDIA GPUs: Anatomy of High-Performance Matmul Kernels - Aleksa Gordić. 47 figures. Covers PTX/SASS, wave quantization, ILP, roofline model, warp tiling
Outperforming cuBLAS on H100: A Worklog - cudaforfun. Real optimization journey using WGMMA and TMA

Advanced implementations

Tier 2

Advanced Matrix Multiplication Optimization - salykova. Detailed optimization techniques following CUTLASS approach
CUDA Matrix Multiplication Optimization - Lei Mao. Systematic optimization progression
Optimizing SGEMV for cuBLAS-like Performance - Maharshi. Matrix-vector multiplication optimization worklog
DeepGEMM - DeepSeek. Clean FP8 GEMM implementation for Hopper, ~300 lines

cuBLAS internals

Tier 3

New cuBLAS 12.0 Features - Hopper-specific optimizations and performance
[cuBLAS 12.9 Floating Point...

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

Curated GPU performance resources with solid stars.

wafer-ai/gpu-perf-engineering-resources

Learning Guide: Performance Engineering for AI Infra

Purpose

How to read

Table of contents

Fundamentals

Introduction to GPU programming

Tier 1

Architecture deep dives

Tier 2

Low-level details

Tier 3

Matrix Multiplication

Essential tutorials

Tier 1

Advanced implementations

Tier 2

cuBLAS internals

Tier 3