What does this fork signal mean?

SiliconFlow forked siliconflow/checkpoint-engine (forked from MoonshotAI/checkpoint-engine). This fork signal points to upstream code the lab may be inspecting, patching, or building on. High-signal details: repo siliconflow/checkpoint-engine · parent MoonshotAI/checkpoint-engine · A tool for managing machine learning model checkpoints.. onlylabs links this event to 1 captured evidence page and 6 related fork signals.

SiliconFlow Fork: siliconflow/checkpoint-engine

Captured source

source ↗

GitHub/github.com/siliconflow/checkpoint-engine

siliconflow/checkpoint-engine repository metadata

Source ↗

published Apr 1, 2026seen Jun 5captured Jun 11http 200method plain

siliconflow/checkpoint-engine

Description: Checkpoint-engine is a simple middleware to update model weights in LLM inference engines

License: MIT

Stars: 0

Forks: 0

Open issues: 0

Created: 2026-04-01T08:58:05Z

Pushed: 2026-04-01T09:58:32Z

Default branch: main

Fork: yes

Parent repository: MoonshotAI/checkpoint-engine

Archived: no

README:

Checkpoint Engine

Checkpoint-engine is a simple middleware to update model weights in LLM inference engines -- a critical step in reinforcement learning. We provide an efficient and lightweight implementation for inplace weight update: updating our Kimi-K2 model (1 Trillion parameters) across thousands of GPUs takes about 20s.

Architecture

The core weight update logic is in ParameterServer class, a service colocated with inference engines. It provides two implementations of weight update: Broadcast and P2P.

Broadcast: Used when a large number of inference instances need to update weights in synchronous. This is the fastest implementation and should be used as the default update method. See _update_per_bucket with ranks == None or [].
P2P: Used when new inference instances are dynamically added (due to restarts or dynamic availability) while the existing instances are already serving requests. Under this scenario, to avoid affecting the workloads on existing instances, we use the `mooncake-transfer-engine` to P2P send weights from CPUs in existing instances to GPUs in new instances. See _update_per_bucket with ranks specified.

Optimized Weight Broadcast

In the *Broadcast* implementation, the checkpoint-engine holds references to sharded weights in CPU memory, and need to efficiently broadcast them to a cluster of inference instances, often under a different sharding pattern. We arrange the data transfer into 3 stages: 1. H2D: moving weights to GPU memory. These weights may come from disk or the training engine. 2. broadcast: broadcast among checkpoint engine workers; the data results in a CUDA IPC buffer shared with inference engine. 3. reload: inference engine decides what subset of weights to copy from the broadcasted data.

Checkpoint-engine orchestrates the entire transfer process. It first gathers necessary metadata to create a plan, including deciding the proper bucket size for data transfer. It then executes the transfer, where it controls the inference engine through a ZeroMQ socket. To maximize performance, it organizes the data transfers into a pipeline with overlapped communication and copy, illustrated below. The details can be found in Kimi-K2 Technical Report.

Pipelining naturally requires more GPU memory. When memory is not enough, checkpoint-engine will fallback to serial execution.

Optimized P2P Bucket Assignment

In the *P2P* implementation, checkpoint-engine needs to send weights from existing instances to new instances. To minimize the overall transfer time, checkpoint-engine optimizes the bucket assignment for each sender-receiver pair. The optimization goal is to make full use of the available network bandwidth for each sender and receiver. See issue #25

Benchmark

| Model | Device Info | GatherMetas | Update (Broadcast) | Update (P2P) | | :----------------------------------- | :----------- | :---------- |:-------------------| :---------------------- | | GLM-4.5-Air (BF16) | 8xH800 TP8 | 0.12s | 3.47s (3.02GiB) | 4.12s (3.02GiB) | | Qwen3-235B-A22B-Instruct-2507 (BF16) | 8xH800 TP8 | 0.33s | 6.22s (2.67GiB) | 7.10s (2.68GiB) | | DeepSeek-V3.1 (FP8) | 16xH20 TP16 | 1.17s | 10.19s (5.39GiB) | 11.80s (5.41GiB) | | Kimi-K2-Instruct (FP8) | 16xH20 TP16 | 1.33s | 14.36s (5.89GiB) | 17.49s (5.91GiB) | | DeepSeek-V3.1 (FP8) | 256xH20 TP16 | 0.80s | 11.33s (8.00GiB) | 11.81s (8.00GiB) | | Kimi-K2-Instruct (FP8) | 256xH20 TP16 | 1.22s | 16.04s (8.00GiB) | 16.75s (8.00GiB) |

All results above are tested by [examples/update.py](./examples/update.py) and use vLLM v0.10.2rc1 as inference engine. Some notes:

FP8 test needs additional vLLM patches, see [FP8 quantization](#fp8-quantization).
Device Info: we tested various combination of devices and parallelism setups. For example, a 256-GPU TP16 setup means that we deploy 16 vLLM instances, each with 16-way tensor parallelism.
Since update duration is related to IPC bucket size, we provide the bucket size in the table.
The P2P time were tested for updating no more than two nodes (16 GPUs) (ParameterServer.update(ranks=range(0, 16))) out of the entire cluster.
We bind each GPU to its corresponding NUMA node to ensure stable H2D transfer speeds.

Installation

Use the fastest broadcast implementation

pip install checkpoint-engine

Use the flexible P2P implementation, notice this will install mooncake-transfer-engine to support RDMA transfer between different ranks.

pip install 'checkpoint-engine[p2p]'

Getting Started

Prepare an H800 or H20 machine with 8 GPUs with vLLM. Be sure to include /collective_rpc API endpoint commit (available in main branch) since checkpoint-engine will use this endpoint to update weights. vLLM version v0.10.2 is fully tested and recommended.

mkdir -p /opt/vLLM && cd /opt/vLLM
uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install vllm==0.10.2

Install checkpoint-engine

uv pip install 'checkpoint-engine[p2p]'

We use Qwen/Qwen3-235B-A22B-Instruct-2507 (BF16) as the test model

hf download Qwen/Qwen3-235B-A22B-Instruct-2507 --local-dir /opt/models/Qwen/Qwen3-235B-A22B-Instruct-2507/

Start vLLM in dev mode and set --load-format dummy. Notice that we also set --worker-extension-cls=checkpoint_engine.worker.VllmColocateWorkerExtension

VLLM_SERVER_DEV_MODE=1 python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 19730 --trust-remote-code \
--tensor-parallel-size=8 --max-model-len 4096 --load-format dummy \
--served-model-name checkpoint-engine-demo --model /opt/models/Qwen/Qwen3-235B-A22B-Instruct-2507/ \
--worker-extension-cls checkpoint_engine.worker.VllmColocateWorkerExtension

Meanwhile, use this command to update...

Excerpt shown — open the source for the full document.

Notability

notability 1.0/10

Routine internal fork