What does this repo signal mean?

Tencent Hunyuan published Tencent-Hunyuan/flex-block-attn (Jupyter Notebook). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo Tencent-Hunyuan/flex-block-attn · language Jupyter Notebook · New repo, low stars, minor traction.. onlylabs links this event to 1 captured evidence page and 6 related repo signals.

Tencent Hunyuan Repo: Tencent-Hunyuan/flex-block-attn

Captured source

source ↗

GitHub/github.com/Tencent-Hunyuan/flex-block-attn

Tencent-Hunyuan/flex-block-attn repository metadata

Source ↗

published Nov 18, 2025seen Jun 5captured Jun 11http 200method plain

Tencent-Hunyuan/flex-block-attn

Description: flex-block-attn: an efficient block sparse attention computation library

Language: Jupyter Notebook

License: NOASSERTION

Stars: 131

Forks: 14

Open issues: 2

Created: 2025-11-18T11:39:09Z

Pushed: 2025-12-26T02:48:33Z

Default branch: main

Fork: no

Archived: no

README:

Flex Block Attn

Introduction

Flex-Block-Attn is an efficient block sparse attention computation library specifically designed for Hunyuan Video. It supports various sparse attention strategies including STA, MOBA, and SSTA (selective and sliding tile attention, a hybrid of STA and MOBA) for both training and inference. Built upon ThunderKitten's attention demo implementation, this library delivers arbitrary sparse attention computation capabilities optimized for Hopper architecture GPUs. It features PyTorch-like mask expressions that ensure high usability while enabling efficient sparse mask generation.

![flex block attn](/assets/flex_block_attn.png)

Project Updates

[2025-11-19] We have released the Flex-Block-Attn implementation along with comprehensive benchmark results. We welcome the community to test and provide feedback!

🛠️ Quick start

Requirements

Hopper (SM90) GPUs, or other architectures with SM90 PTX ISA support
Python 3.8 and above
CUDA version 12.8 [CUDA Toolkit](https://developer.nvidia.com/cuda-12-8-0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=RHEL&target_version=8)

Installation

git submodule update --init --recursive
python setup.py install

🔑 Usage

Custom kernel

from flex_block_attn import flex_block_attn_func
from benchmark.utils.utils import create_sparse_mask
# take a (block_size * 2)*(block_size * 2) as an example
# block size can be 64,128,192...
selected_blocks = [[0,1],[1,0]]
# create block mask with selected blocks
sparse_mask = create_sparse_mask(q, block_size, selected_blocks)

'''
sparse mask: torch.tensor([[0,1],[1,0]])
for example if block_size=64, ths shape of torch mask is [128,128]

our sparse mask：
[[0,1],
[1,0]]

original torch mask:
[[0,0,0...,0],[1,1,1...,1],
[0,0,0...,0],[1,1,1...,1],
... , ... ,
[0,0,0...,0],[1,1,1...,1],
[1,1,1...,1],[0,0,0...,0],
[1,1,1...,1],[0,0,0...,0],
... , ... ,
[1,1,1...,1],[0,0,0...,0],]
'''
#compute
output = flex_block_attn_func(query, key, value, q_block_size, k_block_size, block_mask)

SSTA kernel

SSTA is a novel attention mechanism that integrates the sparse attention of both Moba and STA. It has been utilized in both the training and inference processes of Hunyuan Video. We will be open-sourcing all related code in the near future – stay tuned!

❗️Notes

The head dim must be 128.
The q tile_size can be any multiple of 16, k/v tile_size can be any multiple of 64, with *384* recommended (as we have performed additional optimizations for this size).
The sequence length of q and kv must be divisible by their respective tile sizes.
The attention_mask only supports block-level masking. block_mask supports two shapes: [seq_len, seq_len] or [batch, head_num, seq_len, seq_len].
Within selected blocks, full attention computation is performed.

🚀 Performance

We provide performance comparisons in the [benchmark](/benchmark/) folder, including measurements for mask creation time, forward/backward execution time, and GPU memory usage across the following attention types: full attention, sparse static attention, and sparse dynamic attention. Meanwhile, we have provided all the results([full attn](/benchmark/full/results/), [static sparse attn](/benchmark/static/swa/results/), [dynamic sparse attn](/benchmark/dynamic/random/results/)) obtained from testing on the H800 GPU and H20 GPU.

Sparse dynamic attention

In sparse dynamic attention tasks, attention mask is generated randomly with a a specified sparsity ratio.We display FlexBlockAttn speedup using these parameters:

Sequence length 11520, 19200, 30720, 38400, 46080, 53760, 61440, 69120
Block_size 384
Sparse rate 0.6

The performance(combined mask creation, forward and backward) of Flex Block Attention is better than mainstream sparse attention libraries.

##### H800 Dynamic Attention Speedup ![FlexBlockAttn speedup on H800](assets/h800_dynamic_time_merge.png)

##### H20 Dynamic Attention Speedup ![FlexBlockAttn speedup on H20](assets/h20_dynamic_time_merge.png)

Full attention

In full attention tasks, Flex Block Attention continues to deliver robust performance.

##### H800 Full Attention Speedup ![FlexBlockAttn full attn speedup on H20](assets/h800_full_time_merge.png)

##### H20 Full Attention Speedup ![FlexBlockAttn full attn speedup on H20](assets/h20_full_time_merge.png)

🙏 Acknowledgments

This project stands on the shoulders of the following amazing projects and resources. We extend our sincere gratitude to:

[ThunderKittens](https://github.com/HazyResearch/ThunderKittens) : Our project extends its computational engine, building additional logic layers while leveraging its core calculation capabilities. The underlying computational power is entirely provided by its excellent infrastructure.
[STA(Sliding Tile Attention)](https://github.com/hao-ai-lab/FastVideo), [MoBA](https://github.com/MoonshotAI/MoBA): In our video model training, we have drawn inspiration from the innovative contributions of these projects in sparse attention computation.
[flex attention](https://github.com/meta-pytorch/attention-gym), [flash-attention](https://github.com/Dao-AILab/flash-attention), [MagiAttention](https://github.com/SandAI-org/MagiAttention), [SpargeAttn](https://github.com/thu-ml/SpargeAttn), [Triton](https://github.com/triton-lang/triton): These projects have been pivotal in advancing efficient and flexible attention mechanisms and high-performance GPU programming. Their collective work in long-sequence processing, sparsity optimization, and providing efficient computational backends has been a crucial source of inspiration, performance benchmarking, and validation foundation for our design and implementation.

We are grateful to the entire open-source community for their invaluable contributions.

🔗Citation

If you use this codebase or otherwise find our work valuable, please cite:

@misc{flex_block_attn2025,
title={flex-block-attn: an efficient block sparse attention computation library},
author={Yuanbo Peng*, Penghao Zhao*, Jiangfeng Xiong,...

Excerpt shown — open the source for the full document.

Notability

notability 4.0/10

New repo, low stars, minor traction.