What does this repo signal mean?

Together AI published togethercomputer/ParallelKernelBench (Python). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo togethercomputer/ParallelKernelBench · language Python · Benchmark for evaluating parallel kernel performance on GPUs.. onlylabs links this event to 1 captured evidence page and 6 related repo signals.

Together AI Repo: togethercomputer/ParallelKernelBench

Captured source

source ↗

GitHub/github.com/togethercomputer/ParallelKernelBench

togethercomputer/ParallelKernelBench repository metadata

Source ↗

published Jun 3, 2026seen 3dcaptured 3dhttp 200method plain

togethercomputer/ParallelKernelBench

Language: Python

Stars: 0

Forks: 0

Open issues: 6

Created: 2026-06-03T19:58:41Z

Pushed: 2026-06-23T06:09:38Z

Default branch: main

Fork: no

Archived: no

README:

ParallelKernelBench: Can LLMs write fast multi-GPU kernels?

ParallelKernelBench (PKB) is a benchmark with the goal of enabling LLMs to optimize multi-GPU kernels. Specifically, we investigate model capabilities on turning existing PyTorch + NCCL reference code into fine-grained CUDA (or related DSLs).

The design is heavily inspired by KernelBench.

📄 Paper · 🤗 Hugging Face · 🌐 Project website

---

👋 Overview

PKB asks models to optimize multi-GPU kernels: each problem has a PyTorch + NCCL reference under reference/; candidates go in solutions_/ (CUDA, Triton, ParallelKittens, or run-specific trees from generation).

Correctness: eval mode runs reference and candidate on the same inputs and compares per-rank outputs (rank_*.pt) within --atol / --rtol.

Performance: optional timing reports speedup vs reference. We follow ThunderKittens 2 — benchmark rigor: 500 warmup iterations, 100 timed iterations (see worker / perf utilities).

Roofline (approximate): reference_rooflines_code/ provides utilization estimates; contributions welcome.

---

⚙️ Setup

PKB uses [uv](https://docs.astral.sh/uv/) for reproducible Python environments.

Prerequisites

OS: Linux with NVIDIA GPUs (multi-GPU runs need matching torchrun / NCCL).
Driver: Recent enough for CUDA 12.8 wheels (H100 nodes typically satisfy this).
ParallelKittens backend (optional): clone ThunderKittens and set THUNDERKITTENS_ROOT to the repo root (Modal/Together images do this automatically).

Install with uv

# Install uv (skip if already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

cd ParallelKernelBench

cd kernelgen
git clone https://github.com/SWE-agent/mini-swe-agent.git
cd ..

uv sync

# Verify the environment
uv run python -c "import torch; print(torch.__version__, torch.cuda.is_available())"
uv run pytest tests/ -q

This creates .venv/ at the repo root and installs all dependencies from the pinned uv.lock (commit that file when you change pyproject.toml).

API keys (generation / cloud eval)

Google models: GEMINI_API_KEY or GOOGLE_API_KEY.
Together models: TOGETHER_API_KEY (see ALLOWED_MODELS in kernelgen/generate_kernel.py).
Anthropic models: ANTHROPIC_API_KEY.
OpenAI models: OPENAI_API_KEY.

---

🚀 Usage

Generate a single solution

[kernelgen/generate_kernel.py](kernelgen/generate_kernel.py) assembles a kernel-generation prompt from [kernelgen/prompts.toml](kernelgen/prompts.toml) and optionally calls an LLM. You can (1) print the prompt only (--print-prompt) or (2) generate a solution file for one problem and backend.

# [print-prompt] Inspect the assembled user prompt; no API call, no file written
# --precision: fp32 | fp16 | bf16 (must match an entry in prompts.toml)
# --hardware: h100_8 | b200_72 (optional; omitted → "none" in the output directory name)
# --backend: cuda | triton | parallelkittens (must match [backends] in prompts.toml)
# --model: must be in ALLOWED_MODELS in generate_kernel.py (ignored with --print-prompt)
uv run python kernelgen/generate_kernel.py \
--precision bf16 \
--hardware h100_8 \
--problem 1 \
--model zai-org/GLM-5.1 \
--backend cuda \
--print-prompt

# [generate] Call the LLM and write e.g. solutions_cuda_bf16_h100_8_together_zai-org_GLM-5.1/1_allreduce_cuda.py
uv run python kernelgen/generate_kernel.py \
--precision bf16 \
--hardware h100_8 \
--problem 1 \
--model zai-org/GLM-5.1 \
--backend cuda

# Other backends (same flags; different output dir prefix and filename suffix)
uv run python kernelgen/generate_kernel.py --precision bf16 --hardware h100_8 --problem 1 --model gemini-3-pro-preview --backend triton

uv run python kernelgen/generate_kernel.py --precision bf16 --hardware h100_8 --problem 1 --model gemini-3-pro-preview --backend parallelkittens

# Optional: custom prompt template
uv run python kernelgen/generate_kernel.py --paths-to-prompts-template /path/to/prompts.toml --precision bf16 --problem 1 --backend cuda --print-prompt

Outputs: without --print-prompt, each run writes under solutions_____/ as {stem}_{backend}.py (for example 1_allreduce_cuda.py). Pass that directory to [run_local.py](run_local.py) via --solutions-dir when evaluating.

Generate a single solution (mini-SWE-agent)

We provide a script ([kernelgen/generate_kernel_agent.py](kernelgen/generate_kernel_agent.py)) that uses mini-swe-agent in generating a kernel. Note that we verified functionality on Google models.

To use this script, you must install mini-swe-agent separately:

pip install -e kernelgen/mini-swe-agent
cd kernelgen
git clone https://github.com/SWE-agent/mini-swe-agent.git

An example command:

python kernelgen/generate_kernel_agent.py \
--problem 1 \
--backend cuda \
--model gemini-3-flash-preview \
--step-limit 3 \
--timeout 600 \
--remote-dryrun-command \
'python run_local.py --num-procs-per-node 4 --mode dryrun --problem {problem_arg} --solution {backend} --measure-perf' \
--remote-eval-command \
'python run_local.py --num-procs-per-node 4 --mode eval --problem {problem_arg} --solution {backend} --measure-perf'

Generate multiple or all solutions

We provide convienient functionality to the --problem flag to make it simple to generate for multiple problems:

# generate every problem under reference/
python kernelgen/generate_kernel.py \
--precision bf16 \
--hardware h100_8 \
--problem all \
--model deepseek-ai/DeepSeek-V4-Pro \
--backend cuda

# generate a specific subset of problems
python kernelgen/generate_kernel.py \
--precision bf16 \
--hardware h100_8 \
--problem '[72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88]' \
--model zai-org/GLM-5.1 \
--backend cuda

Evaluate a single problem (locally)

[run_local.py](run_local.py) is the local multi-GPU harness (via torchrun). Assuming you have an environment with multiple GPUs connected via NVLink, you can (1) run one backend in isolation (dryrun) or (2) compare a candidate kernel against the reference (eval).

#...

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

New benchmark repo for parallel kernels