RepoOpenBMB (MiniCPM)OpenBMB (MiniCPM)published Jun 5, 2025seen 5d

OpenBMB/infllmv2_cuda_impl

Python

Open original ↗

Captured source

source ↗
published Jun 5, 2025seen 5dcaptured 13hhttp 200method plain

OpenBMB/infllmv2_cuda_impl

Language: Python

Stars: 102

Forks: 12

Open issues: 0

Created: 2025-06-05T15:30:25Z

Pushed: 2026-02-11T10:59:17Z

Default branch: main

Fork: no

Archived: no

README:

InfLLM V2 CUDA Kernel Implementation

This repository contains the optimized CUDA kernel implementation for InfLLM V2's Two-Stage Sparse Attention Mechanism. Our implementation provides high-performance kernels for both Stage 1 (Top-K Context Selection) and Stage 2 (Sparse Attention Computation), enabling Large Language Models (LLMs) to efficiently process long contexts with trainable sparse patterns.

Overview

InfLLM V2 introduces a novel two-stage approach for efficient long-context processing:

  • Stage 1: Top-K Context Selection: Block scoring and aggregation using semantic kernels (kernel computes and aggregates scores, selection performed externally)
  • Stage 2: Sparse Attention Computation: Attention calculation on selected blocks

This CUDA kernel implementation includes both stages, providing:

  • Optimized relevance score computation and aggregation for Stage 1 (Top-K selection performed externally)
  • Efficient sparse attention on selected blocks for Stage 2
  • Significant reduction in computational costs for both forward and backward phases

Built upon FlashAttention, our kernels leverage efficient memory access patterns and optimized implementations for both stages.

![InfLLM V2 Architecture](assets/infllm-v2.jpg)

Open Source Resources

*Updated: 2025-12-01*

We have released the training data and base model for InfLLM V2:

🔗 Training Data: https://huggingface.co/datasets/openbmb/InfLLM-V2-data-5B 🔗 Initial Model: https://huggingface.co/openbmb/InfLLM-V2-Short-Dense-Base 🔗 Final Model: https://huggingface.co/openbmb/InfLLM-V2-Long-Sparse-Base

The optimized Stage 1 implementation has been integrated into this repository.

Two-Stage Architecture

Stage 1: Top-K Context Selection

The Top-K selection stage involves three sequential steps: 1. Relevance Score Computation: Computing scores between query tokens and each semantic kernel (compressed representations of key-value blocks), followed by softmax normalization 2. Score Aggregation: Aggregating relevance scores for each semantic kernel across the query group dimension using dimension reduction (hdim16_reduce) 3. Block Selection (Post-processing): Selecting the top-K context blocks for each query token based on the aggregated scores

Note: The infllmv2_attn_stage1 kernel handles steps 1 and 2 (score computation and aggregation). Only step 3 (Top-K selection) is performed outside the kernel.

Stage 2: Sparse Attention Computation

The sparse attention stage performs standard attention computation, but only on the blocks selected in Stage 1:

  • Support for both forward and backward passes
  • Efficient memory access through block-sparse patterns

Kernel Design Features

  • Token-level Query, Block-level Key-Value: Avoids training-inference inconsistency during decoding
  • Trainable Context Selection: Semantic kernels updated indirectly through token-level key vector optimization
  • Selective Block Attention: Performs attention only on blocks selected in Stage 1

Kernel Implementation Details

Stage 1 Kernels

  • infllmv2_attn_stage1: Calculates similarity scores between query tokens and compressed key representations
  • Performs score aggregation across query group dimension (hdim16_reduce)
  • Returns aggregated attention scores for subsequent Top-K selection (selection performed outside the kernel)
  • Support for causal masking and variable sequence lengths

Stage 2 Kernels

  • infllmv2_sparse_attn_fwd: Forward pass kernel for sparse attention
  • infllmv2_sparse_attn_bwd: Backward pass kernel for training

Installation

Requirements

  • PyTorch 1.12+
  • CUDA 11.6+ (with CUDA development toolkit)
  • Python 3.7+
  • Linux operating system
  • Sufficient GPU memory for kernel compilation
  • Ninja build system (for faster compilation)

Build from Source

For Training / Inference (main branch)

# Install with CUDA kernel compilation
pip install -e .

Usage

CUDA Kernel API

The InfLLM V2 CUDA kernel provides the following interfaces for the two-stage sparse attention:

Stage 1: Attention Score Computation and Aggregation (feature_infer branch)

from infllm_v2 import infllmv2_attn_stage1

# Stage 1: Compute and aggregate relevance scores between queries and semantic kernels
# This kernel performs:
# 1. LSE approximation using compressed keys
# 2. Full attention score computation
# 3. Score aggregation across query group dimension (hdim16_reduce)
# Top-K selection must be performed separately on the aggregated scores
#
# Inputs:
# - q: Query tensor (batch_size * n_heads, seqlen_q, head_dim)
# - k: Compressed key tensor representing semantic kernels
# - v: Placeholder tensor (not used in score computation)
# - cu_seqlens_q, cu_seqlens_k: Cumulative sequence lengths
# - max_seqlen_q, max_seqlen_k: Maximum sequence lengths

# Returns aggregated attention scores for subsequent Top-K selection
aggregated_scores = infllmv2_attn_stage1(
q, k, v,
cu_seqlens_q=cu_seqlens_q,
cu_seqlens_k=cu_seqlens_k,
max_seqlen_q=max_seqlen_q,
max_seqlen_k=max_seqlen_k,
causal=True, # Apply causal masking
return_attn_probs=True # Return attention scores
)

# Top-K selection should be performed on the returned aggregated scores
# (This step is not part of the kernel)

Stage 2: Sparse Attention Computation

from infllm_v2 import infllmv2_attn_varlen_func

# Stage 2: Sparse Attention Computation Kernel
# Inputs:
# - q_unpad: Queries tensor (token-level)
# - k_unpad, v_unpad: Keys and Values tensors (block-level)
# - cu_seqlens_q, cu_seqlens_k: Cumulative sequence lengths
# - topk_idx: Selected block indices from Stage 1
# - max_seqlen_q, max_seqlen_k: Maximum sequence lengths

out_unpad = infllmv2_attn_varlen_func(
q_unpad, k_unpad, v_unpad,
cu_seqlens_q, cu_seqlens_k,
topk_idx, # Block indices selected in Stage 1
max_seqlen_q, max_seqlen_k
)

Kernel Parameters

Stage 1 Parameters

  • q: Query tensor with shape (batch_size * n_heads, seqlen_q, head_dim)
  • k: Compressed key tensor representing semantic kernels
  • causal: Whether to apply causal masking
  • return_attn_probs: Whether to return attention scores (required for Top-K selection)

-…

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

CUDA impl for InfLLMv2, modest traction