ForkMeituan (LongCat)Meituan (LongCat)published Feb 5, 2026seen 5d

meituan-longcat/flashinfer

forked from flashinfer-ai/flashinfer

Open original ↗

Captured source

source ↗
published Feb 5, 2026seen 5dcaptured 9hhttp 200method plain

meituan-longcat/flashinfer

Description: FlashInfer: Kernel Library for LLM Serving

License: Apache-2.0

Stars: 0

Forks: 0

Open issues: 0

Created: 2026-02-05T08:45:48Z

Pushed: 2026-02-05T08:51:03Z

Default branch: main

Fork: yes

Parent repository: flashinfer-ai/flashinfer

Archived: no

README:

High-Performance GPU Kernels for Inference

| Documentation | Latest Release | Blog | Slack | Discussion Forum |

![Build Status](https://ci.tlcpack.ai/job/flashinfer-ci/job/main/) ![Documentation](https://github.com/flashinfer-ai/flashinfer/actions/workflows/build-doc.yml)

FlashInfer is a library and kernel generator for inference that delivers state-of-the-art performance across diverse GPU architectures. It provides unified APIs for attention, GEMM, and MoE operations with multiple backend implementations including FlashAttention-2/3, cuDNN, CUTLASS, and TensorRT-LLM.

Why FlashInfer?

  • State-of-the-art Performance: Optimized kernels for prefill, decode, and mixed batching scenarios
  • Multiple Backends: Automatically selects the best backend for your hardware and workload
  • Modern Architecture Support: Support for SM75 (Turing) and later (through Blackwell)
  • Low-Precision Compute: FP8 and FP4 quantization for attention, GEMM, and MoE operations
  • Production-Ready: CUDAGraph and torch.compile compatible for low-latency serving

Core Features

Attention Kernels

  • Paged and Ragged KV-Cache: Efficient memory management for dynamic batch serving
  • Decode, Prefill, and Append: Optimized kernels for all attention phases
  • MLA Attention: Native support for DeepSeek's Multi-Latent Attention
  • Cascade Attention: Memory-efficient hierarchical KV-Cache for shared prefixes
  • Sparse Attention: Block-sparse and variable block-sparse patterns
  • POD-Attention: Fused prefill+decode for mixed batching

GEMM & Linear Operations

  • FP8 GEMM: Per-tensor and groupwise scaling
  • FP4 GEMM: NVFP4 and MXFP4 matrix multiplication for Blackwell GPUs
  • Grouped GEMM: Efficient batched matrix operations for LoRA and multi-expert routing

Mixture of Experts (MoE)

  • Fused MoE Kernels
  • Multiple Routing Methods: DeepSeek-V3, Llama-4, and standard top-k routing
  • Quantized MoE: FP8 and FP4 expert weights with block-wise scaling

Sampling & Decoding

  • Sorting-Free Sampling: Efficient Top-K, Top-P, and Min-P without sorting
  • Speculative Decoding: Chain speculative sampling support

Communication

  • AllReduce: Custom implementations
  • Multi-Node NVLink: MNNVL support for multi-node inference
  • NVSHMEM Integration: For distributed memory operations

Other Operators

  • RoPE: LLaMA-style rotary position embeddings (including LLaMA 3.1)
  • Normalization: RMSNorm, LayerNorm, Gemma-style fused operations
  • Activations: SiLU, GELU with fused gating

GPU Support

| Architecture | Compute Capability | Example GPUs | |--------------|-------------------|------| | Turing | SM 7.5 | T4, RTX 20 series | | Ampere | SM 8.0, 8.6 | A100, A10, RTX 30 series | | Ada Lovelace | SM 8.9 | L4, L40, RTX 40 series | | Hopper | SM 9.0 | H100, H200 | | Blackwell | SM 10.0, 10.3 | B200, B300 | | Blackwell | SM 12.0, 12.1 | RTX 50 series, DGX Spark, Jetson Thor |

News

Notable updates:

  • [2025-10-08] Blackwell support added in v0.4.0
  • [2025-03-10] Blog Post Sorting-Free GPU Kernels for LLM Sampling, which explains the design of sampling kernels in FlashInfer.

Getting Started

Installation

Quickstart:

pip install flashinfer-python

Package Options:

  • flashinfer-python: Core package that compiles/downloads kernels on first use
  • flashinfer-cubin: Pre-compiled kernel binaries for all supported GPU architectures
  • flashinfer-jit-cache: Pre-built kernel cache for specific CUDA versions

For faster initialization and offline usage, install the optional packages to have most kernels pre-compiled:

pip install flashinfer-python flashinfer-cubin
# JIT cache (replace cu129 with your CUDA version)
pip install flashinfer-jit-cache --index-url https://flashinfer.ai/whl/cu129

Verify Installation

flashinfer show-config

Basic Usage

import torch
import flashinfer

# Single decode attention
q = torch.randn(32, 128, device="cuda", dtype=torch.float16) # [num_qo_heads, head_dim]
k = torch.randn(2048, 32, 128, device="cuda", dtype=torch.float16) # [kv_len, num_kv_heads, head_dim]
v = torch.randn(2048, 32, 128, device="cuda", dtype=torch.float16)

output = flashinfer.single_decode_with_kv_cache(q, k, v)

See documentation for comprehensive API reference and tutorials.

Install from Source

git clone https://github.com/flashinfer-ai/flashinfer.git --recursive
cd flashinfer
python -m pip install -v .

For development, install in editable mode:

python -m pip install --no-build-isolation -e . -v

Build optional packages:

# flashinfer-cubin
cd flashinfer-cubin
python -m build --no-isolation --wheel
python -m pip install dist/*.whl
# flashinfer-jit-cache (customize for your target GPUs)
export FLASHINFER_CUDA_ARCH_LIST="7.5 8.0 8.9 9.0a 10.0a 10.3a 11.0a 12.0f"
cd flashinfer-jit-cache
python -m build --no-isolation --wheel
python -m pip install dist/*.whl

For more details, see the Install from Source documentation.

Nightly Builds

pip install -U --pre flashinfer-python --index-url https://flashinfer.ai/whl/nightly/ --no-deps
pip install flashinfer-python # Install dependencies from PyPI
pip install -U --pre flashinfer-cubin --index-url https://flashinfer.ai/whl/nightly/
# JIT cache (replace cu129 with your CUDA version)
pip install -U --pre flashinfer-jit-cache --index-url https://flashinfer.ai/whl/nightly/cu129

CLI Tools

FlashInfer provides several CLI commands for configuration, module management, and development:

# Verify installation and view configuration
flashinfer show-config

# List and inspect modules
flashinfer list-modules
flashinfer module-status

# Manage artifacts…

Excerpt shown — open the source for the full document.

Notability

notability 1.0/10

Routine fork by large company.