RepoByteDance (Doubao/Seed)ByteDance (Doubao/Seed)published Aug 1, 2025seen 5d

ByteDance-Seed/cudaLLM

Python

Open original ↗

Captured source

source ↗
published Aug 1, 2025seen 5dcaptured 8hhttp 200method plain

ByteDance-Seed/cudaLLM

Language: Python

License: Apache-2.0

Stars: 144

Forks: 7

Open issues: 3

Created: 2025-08-01T22:30:11Z

Pushed: 2025-08-18T06:56:11Z

Default branch: main

Fork: no

Archived: no

README:

CudaLLM: Training Language Models to Generate High-Performance CUDA Kernels

This project provides a complete pipeline for training LLMs to automatically generate efficient and correct CUDA kernels. By leveraging a two-stage process of SFT and RL, this framework fine-tunes a base model to write optimized CUDA code.

For demonstration purposes, this guide uses Qwen3-8B as the base model.

How It Works

The training methodology is composed of two main stages:

1. SFT: The base LLM is first fine-tuned on a high-quality dataset of CUDA kernel examples. The data is generated by DeepSeek R1, DeepSeel Coder-7B, and Qwen2-32B. 2. RL: After SFT, the model is further optimized through reinforcement learning. In this stage, the model generates CUDA kernels which are then compiled and tested. This feedback signal is used as a reward to train the model to produce valid kernels.

Getting Started

To set up and run the training pipeline, follow these steps:

Step 0: Prepare Datasets

First, you need to process the raw datasets for SFT and RL, and download the evaluation dataset. This script handles the necessary preprocessing.

  • SFT Dataset: sft_cuda_llm_r1.parquet
  • RL Dataset: rl_cuda_llm_0424.parquet
  • Evaluation Dataset: KernelBench

Run the following command to begin:

# install verl, the git SHA is abb87bc147467589d1357dd80a1e3fefa188e11f
git clone https://github.com/volcengine/verl.git
cd verl
pip install --no-deps -e .
cd ..

python3 cuda_dataset.py

Step 1: SFT

Next, fine-tune the base model on the prepared SFT dataset. This will adapt the model to the domain of CUDA code generation.

bash scripts/sft.sh

Step 2: Evaluate the SFT Model

After the SFT stage is complete, evaluate the model's code generation accuracy on the KernelBench benchmark. This step provides a baseline measurement of the model's capabilities before reinforcement learning.

bash scripts/eval.sh

Step 3: RL

Finally, use reinforcement learning to further enhance the SFT model's ability to generate performant code. For each node, this stage uses a hardware allocation as below:

  • 4x GPUs are dedicated to the RL training loop.
  • 4x GPUs are used to run the generated kernels, providing the reward needed for training.
bash scripts/rl.sh

Upon completion, you will have a model specifically trained to generate high-quality CUDA kernels. You can re-run the evaluation script (eval.sh) to measure the performance uplift from the RL stage.

License

This project is licensed under the Apache License 2.0. See the [LICENSE](LICENSE) file for details.

Notability

notability 5.0/10

New repo by ByteDance, moderate traction.