meituan-longcat/flashinfer
forked from flashinfer-ai/flashinfer
Captured source
source ↗meituan-longcat/flashinfer
Description: FlashInfer: Kernel Library for LLM Serving
License: Apache-2.0
Stars: 0
Forks: 0
Open issues: 0
Created: 2026-02-05T08:45:48Z
Pushed: 2026-02-05T08:51:03Z
Default branch: main
Fork: yes
Parent repository: flashinfer-ai/flashinfer
Archived: no
README:
High-Performance GPU Kernels for Inference
| Documentation | Latest Release | Blog | Slack | Discussion Forum |
 
FlashInfer is a library and kernel generator for inference that delivers state-of-the-art performance across diverse GPU architectures. It provides unified APIs for attention, GEMM, and MoE operations with multiple backend implementations including FlashAttention-2/3, cuDNN, CUTLASS, and TensorRT-LLM.
Why FlashInfer?
- State-of-the-art Performance: Optimized kernels for prefill, decode, and mixed batching scenarios
- Multiple Backends: Automatically selects the best backend for your hardware and workload
- Modern Architecture Support: Support for SM75 (Turing) and later (through Blackwell)
- Low-Precision Compute: FP8 and FP4 quantization for attention, GEMM, and MoE operations
- Production-Ready: CUDAGraph and torch.compile compatible for low-latency serving
Core Features
Attention Kernels
- Paged and Ragged KV-Cache: Efficient memory management for dynamic batch serving
- Decode, Prefill, and Append: Optimized kernels for all attention phases
- MLA Attention: Native support for DeepSeek's Multi-Latent Attention
- Cascade Attention: Memory-efficient hierarchical KV-Cache for shared prefixes
- Sparse Attention: Block-sparse and variable block-sparse patterns
- POD-Attention: Fused prefill+decode for mixed batching
GEMM & Linear Operations
- FP8 GEMM: Per-tensor and groupwise scaling
- FP4 GEMM: NVFP4 and MXFP4 matrix multiplication for Blackwell GPUs
- Grouped GEMM: Efficient batched matrix operations for LoRA and multi-expert routing
Mixture of Experts (MoE)
- Fused MoE Kernels
- Multiple Routing Methods: DeepSeek-V3, Llama-4, and standard top-k routing
- Quantized MoE: FP8 and FP4 expert weights with block-wise scaling
Sampling & Decoding
- Sorting-Free Sampling: Efficient Top-K, Top-P, and Min-P without sorting
- Speculative Decoding: Chain speculative sampling support
Communication
- AllReduce: Custom implementations
- Multi-Node NVLink: MNNVL support for multi-node inference
- NVSHMEM Integration: For distributed memory operations
Other Operators
- RoPE: LLaMA-style rotary position embeddings (including LLaMA 3.1)
- Normalization: RMSNorm, LayerNorm, Gemma-style fused operations
- Activations: SiLU, GELU with fused gating
GPU Support
| Architecture | Compute Capability | Example GPUs | |--------------|-------------------|------| | Turing | SM 7.5 | T4, RTX 20 series | | Ampere | SM 8.0, 8.6 | A100, A10, RTX 30 series | | Ada Lovelace | SM 8.9 | L4, L40, RTX 40 series | | Hopper | SM 9.0 | H100, H200 | | Blackwell | SM 10.0, 10.3 | B200, B300 | | Blackwell | SM 12.0, 12.1 | RTX 50 series, DGX Spark, Jetson Thor |
News
Notable updates:
- [2025-10-08] Blackwell support added in v0.4.0
- [2025-03-10] Blog Post Sorting-Free GPU Kernels for LLM Sampling, which explains the design of sampling kernels in FlashInfer.
Getting Started
Installation
Quickstart:
pip install flashinfer-python
Package Options:
- flashinfer-python: Core package that compiles/downloads kernels on first use
- flashinfer-cubin: Pre-compiled kernel binaries for all supported GPU architectures
- flashinfer-jit-cache: Pre-built kernel cache for specific CUDA versions
For faster initialization and offline usage, install the optional packages to have most kernels pre-compiled:
pip install flashinfer-python flashinfer-cubin # JIT cache (replace cu129 with your CUDA version) pip install flashinfer-jit-cache --index-url https://flashinfer.ai/whl/cu129
Verify Installation
flashinfer show-config
Basic Usage
import torch import flashinfer # Single decode attention q = torch.randn(32, 128, device="cuda", dtype=torch.float16) # [num_qo_heads, head_dim] k = torch.randn(2048, 32, 128, device="cuda", dtype=torch.float16) # [kv_len, num_kv_heads, head_dim] v = torch.randn(2048, 32, 128, device="cuda", dtype=torch.float16) output = flashinfer.single_decode_with_kv_cache(q, k, v)
See documentation for comprehensive API reference and tutorials.
Install from Source
git clone https://github.com/flashinfer-ai/flashinfer.git --recursive cd flashinfer python -m pip install -v .
For development, install in editable mode:
python -m pip install --no-build-isolation -e . -v
Build optional packages:
# flashinfer-cubin cd flashinfer-cubin python -m build --no-isolation --wheel python -m pip install dist/*.whl
# flashinfer-jit-cache (customize for your target GPUs) export FLASHINFER_CUDA_ARCH_LIST="7.5 8.0 8.9 9.0a 10.0a 10.3a 11.0a 12.0f" cd flashinfer-jit-cache python -m build --no-isolation --wheel python -m pip install dist/*.whl
For more details, see the Install from Source documentation.
Nightly Builds
pip install -U --pre flashinfer-python --index-url https://flashinfer.ai/whl/nightly/ --no-deps pip install flashinfer-python # Install dependencies from PyPI pip install -U --pre flashinfer-cubin --index-url https://flashinfer.ai/whl/nightly/ # JIT cache (replace cu129 with your CUDA version) pip install -U --pre flashinfer-jit-cache --index-url https://flashinfer.ai/whl/nightly/cu129
CLI Tools
FlashInfer provides several CLI commands for configuration, module management, and development:
# Verify installation and view configuration flashinfer show-config # List and inspect modules flashinfer list-modules flashinfer module-status # Manage artifacts…
Excerpt shown — open the source for the full document.
Notability
notability 1.0/10Routine fork by large company.