ByteDance-Seed/cudaLLM
Python
Captured source
source ↗ByteDance-Seed/cudaLLM
Language: Python
License: Apache-2.0
Stars: 144
Forks: 7
Open issues: 3
Created: 2025-08-01T22:30:11Z
Pushed: 2025-08-18T06:56:11Z
Default branch: main
Fork: no
Archived: no
README:
CudaLLM: Training Language Models to Generate High-Performance CUDA Kernels
This project provides a complete pipeline for training LLMs to automatically generate efficient and correct CUDA kernels. By leveraging a two-stage process of SFT and RL, this framework fine-tunes a base model to write optimized CUDA code.
For demonstration purposes, this guide uses Qwen3-8B as the base model.
How It Works
The training methodology is composed of two main stages:
1. SFT: The base LLM is first fine-tuned on a high-quality dataset of CUDA kernel examples. The data is generated by DeepSeek R1, DeepSeel Coder-7B, and Qwen2-32B. 2. RL: After SFT, the model is further optimized through reinforcement learning. In this stage, the model generates CUDA kernels which are then compiled and tested. This feedback signal is used as a reward to train the model to produce valid kernels.
Getting Started
To set up and run the training pipeline, follow these steps:
Step 0: Prepare Datasets
First, you need to process the raw datasets for SFT and RL, and download the evaluation dataset. This script handles the necessary preprocessing.
- SFT Dataset:
sft_cuda_llm_r1.parquet - RL Dataset:
rl_cuda_llm_0424.parquet - Evaluation Dataset: KernelBench
Run the following command to begin:
# install verl, the git SHA is abb87bc147467589d1357dd80a1e3fefa188e11f git clone https://github.com/volcengine/verl.git cd verl pip install --no-deps -e . cd .. python3 cuda_dataset.py
Step 1: SFT
Next, fine-tune the base model on the prepared SFT dataset. This will adapt the model to the domain of CUDA code generation.
bash scripts/sft.sh
Step 2: Evaluate the SFT Model
After the SFT stage is complete, evaluate the model's code generation accuracy on the KernelBench benchmark. This step provides a baseline measurement of the model's capabilities before reinforcement learning.
bash scripts/eval.sh
Step 3: RL
Finally, use reinforcement learning to further enhance the SFT model's ability to generate performant code. For each node, this stage uses a hardware allocation as below:
- 4x GPUs are dedicated to the RL training loop.
- 4x GPUs are used to run the generated kernels, providing the reward needed for training.
bash scripts/rl.sh
Upon completion, you will have a model specifically trained to generate high-quality CUDA kernels. You can re-run the evaluation script (eval.sh) to measure the performance uplift from the RL stage.
License
This project is licensed under the Apache License 2.0. See the [LICENSE](LICENSE) file for details.
Notability
notability 5.0/10New repo by ByteDance, moderate traction.