OpenBMB/infllmv2_cuda_impl
Python
Captured source
source ↗OpenBMB/infllmv2_cuda_impl
Language: Python
Stars: 102
Forks: 12
Open issues: 0
Created: 2025-06-05T15:30:25Z
Pushed: 2026-02-11T10:59:17Z
Default branch: main
Fork: no
Archived: no
README:
InfLLM V2 CUDA Kernel Implementation
This repository contains the optimized CUDA kernel implementation for InfLLM V2's Two-Stage Sparse Attention Mechanism. Our implementation provides high-performance kernels for both Stage 1 (Top-K Context Selection) and Stage 2 (Sparse Attention Computation), enabling Large Language Models (LLMs) to efficiently process long contexts with trainable sparse patterns.
Overview
InfLLM V2 introduces a novel two-stage approach for efficient long-context processing:
- Stage 1: Top-K Context Selection: Block scoring and aggregation using semantic kernels (kernel computes and aggregates scores, selection performed externally)
- Stage 2: Sparse Attention Computation: Attention calculation on selected blocks
This CUDA kernel implementation includes both stages, providing:
- Optimized relevance score computation and aggregation for Stage 1 (Top-K selection performed externally)
- Efficient sparse attention on selected blocks for Stage 2
- Significant reduction in computational costs for both forward and backward phases
Built upon FlashAttention, our kernels leverage efficient memory access patterns and optimized implementations for both stages.

Open Source Resources
*Updated: 2025-12-01*
We have released the training data and base model for InfLLM V2:
🔗 Training Data: https://huggingface.co/datasets/openbmb/InfLLM-V2-data-5B 🔗 Initial Model: https://huggingface.co/openbmb/InfLLM-V2-Short-Dense-Base 🔗 Final Model: https://huggingface.co/openbmb/InfLLM-V2-Long-Sparse-Base
The optimized Stage 1 implementation has been integrated into this repository.
Two-Stage Architecture
Stage 1: Top-K Context Selection
The Top-K selection stage involves three sequential steps: 1. Relevance Score Computation: Computing scores between query tokens and each semantic kernel (compressed representations of key-value blocks), followed by softmax normalization 2. Score Aggregation: Aggregating relevance scores for each semantic kernel across the query group dimension using dimension reduction (hdim16_reduce) 3. Block Selection (Post-processing): Selecting the top-K context blocks for each query token based on the aggregated scores
Note: The infllmv2_attn_stage1 kernel handles steps 1 and 2 (score computation and aggregation). Only step 3 (Top-K selection) is performed outside the kernel.
Stage 2: Sparse Attention Computation
The sparse attention stage performs standard attention computation, but only on the blocks selected in Stage 1:
- Support for both forward and backward passes
- Efficient memory access through block-sparse patterns
Kernel Design Features
- Token-level Query, Block-level Key-Value: Avoids training-inference inconsistency during decoding
- Trainable Context Selection: Semantic kernels updated indirectly through token-level key vector optimization
- Selective Block Attention: Performs attention only on blocks selected in Stage 1
Kernel Implementation Details
Stage 1 Kernels
infllmv2_attn_stage1: Calculates similarity scores between query tokens and compressed key representations- Performs score aggregation across query group dimension (hdim16_reduce)
- Returns aggregated attention scores for subsequent Top-K selection (selection performed outside the kernel)
- Support for causal masking and variable sequence lengths
Stage 2 Kernels
infllmv2_sparse_attn_fwd: Forward pass kernel for sparse attentioninfllmv2_sparse_attn_bwd: Backward pass kernel for training
Installation
Requirements
- PyTorch 1.12+
- CUDA 11.6+ (with CUDA development toolkit)
- Python 3.7+
- Linux operating system
- Sufficient GPU memory for kernel compilation
- Ninja build system (for faster compilation)
Build from Source
For Training / Inference (main branch)
# Install with CUDA kernel compilation pip install -e .
Usage
CUDA Kernel API
The InfLLM V2 CUDA kernel provides the following interfaces for the two-stage sparse attention:
Stage 1: Attention Score Computation and Aggregation (feature_infer branch)
from infllm_v2 import infllmv2_attn_stage1 # Stage 1: Compute and aggregate relevance scores between queries and semantic kernels # This kernel performs: # 1. LSE approximation using compressed keys # 2. Full attention score computation # 3. Score aggregation across query group dimension (hdim16_reduce) # Top-K selection must be performed separately on the aggregated scores # # Inputs: # - q: Query tensor (batch_size * n_heads, seqlen_q, head_dim) # - k: Compressed key tensor representing semantic kernels # - v: Placeholder tensor (not used in score computation) # - cu_seqlens_q, cu_seqlens_k: Cumulative sequence lengths # - max_seqlen_q, max_seqlen_k: Maximum sequence lengths # Returns aggregated attention scores for subsequent Top-K selection aggregated_scores = infllmv2_attn_stage1( q, k, v, cu_seqlens_q=cu_seqlens_q, cu_seqlens_k=cu_seqlens_k, max_seqlen_q=max_seqlen_q, max_seqlen_k=max_seqlen_k, causal=True, # Apply causal masking return_attn_probs=True # Return attention scores ) # Top-K selection should be performed on the returned aggregated scores # (This step is not part of the kernel)
Stage 2: Sparse Attention Computation
from infllm_v2 import infllmv2_attn_varlen_func # Stage 2: Sparse Attention Computation Kernel # Inputs: # - q_unpad: Queries tensor (token-level) # - k_unpad, v_unpad: Keys and Values tensors (block-level) # - cu_seqlens_q, cu_seqlens_k: Cumulative sequence lengths # - topk_idx: Selected block indices from Stage 1 # - max_seqlen_q, max_seqlen_k: Maximum sequence lengths out_unpad = infllmv2_attn_varlen_func( q_unpad, k_unpad, v_unpad, cu_seqlens_q, cu_seqlens_k, topk_idx, # Block indices selected in Stage 1 max_seqlen_q, max_seqlen_k )
Kernel Parameters
Stage 1 Parameters
- q: Query tensor with shape (batch_size * n_heads, seqlen_q, head_dim)
- k: Compressed key tensor representing semantic kernels
- causal: Whether to apply causal masking
- return_attn_probs: Whether to return attention scores (required for Top-K selection)
-…
Excerpt shown — open the source for the full document.
Notability
notability 5.0/10CUDA impl for InfLLMv2, modest traction