RepoOpenBMB (MiniCPM)OpenBMB (MiniCPM)published Jun 6, 2025seen 5d

OpenBMB/CPM.cu

Cuda

Open original ↗

Captured source

source ↗
published Jun 6, 2025seen 5dcaptured 9hhttp 200method plain

OpenBMB/CPM.cu

Description: CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge techniques in sparse architecture, speculative sampling and quantization.

Language: Cuda

License: Apache-2.0

Stars: 241

Forks: 26

Open issues: 7

Created: 2025-06-06T05:31:19Z

Pushed: 2026-01-14T09:34:08Z

Default branch: main

Fork: no

Archived: no

README:

CPM.cu

[中文版本](./README_ZH.md) | English

CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge techniques in sparse architecture, speculative sampling and quantization.

🔥 Project Updates

  • [2025.06.06] Optimized for MiniCPM4.
  • Support InfLLM-v2 attention kernel
  • Support sliding-window for the MTP layer, optimized for long context
  • Support quantization for the MTP layer
  • [2025.05.29] Support Quantization at SpecMQuant.
  • Support Marlin GPTQ kernel for the LLM
  • Support Speculative Sampling for quantized LLM
  • [2025.03.01] Release the first version at FR-Spec.
  • SOTA Speculative Sampling Implementation
  • Support FR-Spec: Frequency-Ranked Speculative Sampling
  • Support Tree-based verification of Speculative Sampling in Flash-Attention
  • Support Static memory management and memory reuse
  • Support Fused kernels
  • Support Chunked prefill
  • Support CUDA Graph

Demo

https://github.com/user-attachments/assets/ab36fd7a-485b-4707-b72f-b80b5c43d024

Getting Started

  • [Installation](#install)
  • [Docker Usage](#docker)
  • [Model Weights](#modelweights)
  • [Command Line Interface (CLI)](#cli)
  • [OpenAI API Service](#openai-api)

Installation

Install from source

This library's build depends on torch and ninja. Please install both before installing this library.

Supported Python versions: 3.8–3.12.

git clone https://github.com/OpenBMB/CPM.cu.git --recursive
cd CPM.cu
pip install .

If you encounter installation issues, please follow the error messages to resolve them or create a GitHub issue. You can use python setup.py --help-config to view more installation configuration options.

Docker Usage

We provide pre-built Docker images that support out-of-the-box GPU inference environments.

Docker Images List

| Image | Description | url | |-------|-------------|-------| | cpmcu:cuda12.6-release | CUDA 12.6 release image recommended |modelbest-registry.cn-beijing.cr.aliyuncs.com/model-align/cpmcu_cu12.6:v1.0.0| | cpmcu:cuda12.8-release | CUDA 12.8 develop image, add support for RTX 50 series |modelbest-registry.cn-beijing.cr.aliyuncs.com/model-align/cpmcu_cu12.8:v1.0.0| | cpmcu:jetpack6.1| Jetpack 6, add support for Jetson Orin, developing |---------| | cpmcu:cuda11.8-release | CUDA 11.8 release image, developing |---------|

Quick Start

# Pull pre-built image
docker pull modelbest-registry.cn-beijing.cr.aliyuncs.com/model-align/cpmcu_cu12.6:v1.0.0

docker tag modelbest-registry.cn-beijing.cr.aliyuncs.com/model-align/cpmcu_cu12.6:v1.0.0 cpmcu:cuda12.6-release

# Run interactive container
docker run --gpus all -it cpmcu:cuda12.6-release /bin/bash

# Start API server(need to login to huggingface or -v mount model)
docker run --gpus all -p 8000:8000 cpmcu:cuda12.6-release \
python examples/minicpm4/start_server.py --apply-sparse

Offline Usage (Recommended)

# 1. Download model on host
huggingface-cli download openbmb/MiniCPM4-8B-marlin-cpmcu --local-dir model/MiniCPM4-8B-marlin-cpmcu

# Also download draft model & FRSpec for speculative decoding (optional)
huggingface-cli download openbmb/MiniCPM4-8B-Eagle-FRSpec-QAT-cpmcu --local-dir model/MiniCPM4-8B-Eagle-FRSpec-QAT-cpmcu

# 2. Mount directories and run
docker run --rm --gpus all \
-v /path/to/model:/workspace/model \
cpmcu:cuda12.6-release \
bash -lc 'cd examples && python3 minicpm4/test_generate.py \
--model-path /workspace/model/MiniCPM4-8B-marlin-cpmcu \
--draft-model-path /workspace/model/MiniCPM4-8B-Eagle-FRSpec-QAT-cpmcu \
--frspec-path /workspace/model/MiniCPM4-8B-Eagle-FRSpec-QAT-cpmcu \
--prompt-text "Hello" --num-generate 128 --use-stream false'

Detailed Documentation: [Docker User Guide](doc/en/docker_use.md)

Prepare Model

Please follow MiniCPM4's README to download the model weights.

Quick Start

We provide a simple example to show how to use CPM.cu to generate text.

cd examples
python3 minicpm4/test_generate.py --prompt-file

If you don't ​​specify​​ the model path, the scripts will load the model from ​​OpenBMB's Hugging Face repository​​. If you want to use local paths, we recommend keeping all model filenames unchanged and placing them in the same directory. This way, you can run the model by specifying the directory with the -p parameter. Otherwise, we suggest modifying the paths in the code accordingly. You can use --help to learn more ​​about the script's features​​.

We also provide a script, examples/long_prompt_gen.py, to generate ​​long code summarization. This script ​​automatically collects code from this repository​​ and prompts ​​the model to "Summarize the code."​

cd examples
python3 long_prompt_gen.py # generate prompt.txt (for more details, use --help)
python3 minicpm4/test_generate.py --prompt-file ../prompt.txt

The output should be of the following format:

Generated text (streaming output):
--------------------------------------------------
Prefilling: 100.0% (106850/106850 tokens) @ 6565.3 tokens/s - Complete!


==================================================
Stream Generation Summary:
==================================================
Prefill length: 106850
Prefill time: 16.36 s
Prefill tokens/s: 6530.77
Mean accept length: 2.50
Decode length: 118
Decode time: 0.76 s
Decode tokens/s: 154.59

Where:

  • the Prefill and Decode speed are output by (length, time and token/s).
  • the Mean accept length is the average length of the accepted tokens when using Speculative Sampling.

Command Line Interface (CLI)

For users who need more granular control over inference parameters (e.g., temperature, generation length), we recommend using the cpmcu.cli module directly. This is the most flexible way to perform detailed configuration and testing.

You can view all available parameters by running `python…

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

New CUDA-optimized CPM repo, moderate stars