What does this fork signal mean?

Meituan (LongCat) forked meituan-longcat/fast-hadamard-transform (forked from Dao-AILab/fast-hadamard-transform). This fork signal points to upstream code the lab may be inspecting, patching, or building on. High-signal details: repo meituan-longcat/fast-hadamard-transform · parent Dao-AILab/fast-hadamard-transform · Routine fork from known entity. onlylabs links this event to 1 captured evidence page and 5 related fork signals.

Meituan (LongCat) Fork: meituan-longcat/fast-hadamard-transform

Captured source

source ↗

GitHub/github.com/meituan-longcat/fast-hadamard-transform

meituan-longcat/fast-hadamard-transform repository metadata

Source ↗

published Feb 5, 2026seen Jun 5captured Jun 11http 200method plain

meituan-longcat/fast-hadamard-transform

Description: Fast Hadamard transform in CUDA, with a PyTorch interface

License: BSD-3-Clause

Stars: 0

Forks: 0

Open issues: 0

Created: 2026-02-05T09:02:14Z

Pushed: 2026-02-05T09:05:37Z

Default branch: master

Fork: yes

Parent repository: Dao-AILab/fast-hadamard-transform

Archived: no

README:

Fast Hadamard Transform in CUDA, with a PyTorch interface

Features:

Support fp32, fp16, bf16, for dimension up to 32768.
Implicitly pad with zeros if dimension is not a power of 2.

Installation

git clone https://github.com/Dao-AILab/fast-hadamard-transform.git fast-hadamard-transform
cd fast-hadamard-transform
pip install -v .

How to use

from fast_hadamard_transform import hadamard_transform

def hadamard_transform(x, scale=1.0):
"""
Arguments:
x: (..., dim)
scale: float. Multiply the output by this number.
Returns:
out: (..., dim)

Multiply each row of x by the Hadamard transform matrix.
Equivalent to F.linear(x, torch.tensor(scipy.linalg.hadamard(dim))) * scale.
If dim is not a power of 2, we implicitly pad x with zero so that dim is the next power of 2.
"""

Speed

Benchmarked on A100, for not too small batch size, compared to memcpy (torch.clone), which is a lower bound for the time taken as we'd need to read inputs from GPU memory and write output to GPU memory anyway.

| Data type | Dimension | Time taken vs memcpy | | --------- | ---------- | -------------------- | | fp16/bf16 | <= 512 | 1.0x | | | 512 - 8192 | <= 1.2x | | | 16384 | 1.3x | | | 32768 | 1.8x | | fp32 | <= 8192 | 1.0x | | | 16384 | 1.1x | | | 32768 | 1.2x |

Notability

notability 3.0/10

Routine fork from known entity