What does this repo signal mean?

ByteDance (Doubao/Seed) published ByteDance-Seed/decoupleQ (Cuda). This repository signal exposes tooling, eval, infrastructure, or model-adjacent work before it may appear in a launch post. High-signal details: repo ByteDance-Seed/decoupleQ · language Cuda. onlylabs links this event to 1 captured evidence page and 6 related repo signals.

ByteDance (Doubao/Seed) Repo: ByteDance-Seed/decoupleQ

Captured source

source ↗

GitHub/github.com/ByteDance-Seed/decoupleQ

ByteDance-Seed/decoupleQ repository metadata

Source ↗

published Apr 19, 2024seen 5dcaptured 8hhttp 200method plain

ByteDance-Seed/decoupleQ

Description: A quantization algorithm for LLM

Language: Cuda

License: Apache-2.0

Stars: 151

Forks: 10

Open issues: 16

Created: 2024-04-19T08:18:27Z

Pushed: 2024-06-21T03:29:25Z

Default branch: main

Fork: no

Archived: no

README:

decoupleQ: Towards 2-bit Post-Training Uniform Quantization via decoupling Parameters into Integer and Floating Points

This repository contains the code for decoupleQ, the paper link is https://arxiv.org/abs/2404.12759

The W2 CUDA kernel is available at https://github.com/NVIDIA/TensorRT-LLM/pull/1568

Some of the code in this repo is built on top of OPTQ's repository. We sincerely thank OPTQ for their great contribution.

Please feel free to raise issues or contact chenwei.gavin@bytedance.com or guoyi.0@bytedance.com if you have any question.

Dependencies

All of our experiments are conducted in the following environment.

datasets==1.17.0
transformers==4.35.0
torch==2.1.0

Reproduce

To reproduce the results of LLama, you should first download the models from here, then put it at `MODEL_PATH. Change the MODEL_PATH` in the following command to the destination where the models are placed.

bash run_llama.sh MODEL_PATH # will get result 9.49 for wikiText2
bash run_resnet.sh # will get result 64.134 for ResNet-18

In llama quantization, if you find that the reproduced results (including the runtime) are far from the reported results, consider modifying the flag: torch.backends.cuda.matmul.allow_tf32. More details can be found in here.

to run inference demo, you should first modify the `build.sh, change the DCMAKE_PREFIX_PATH, DDECOUPLEQ_TORCH_HOME, DDECOUPLEQ_CUDA_HOME and DDECOUPLEQ_CUDNN_HOME` based on your system, and then run the following commands:

git submodule update --init
bash build.sh # need cmake3.21+
bash run_inference_llama.sh $LLAMA_ORG_MODEL_DIR $LLAMA_TRUE_QUANT_MODEL_PT

Results

Here is a summary of LLama results (runtime for the quantization process is measured in hours):

![decoupleQ](imgs/img.png)

Updates

Here is the results of ByteDance's two ASR models. The models are quantized into W2A16g64. In decoupleQ+sft, when the whole model is quantized, we fine-tune the float-point parts with labeled dataset, while freezing all the integer part. There are two sub-domains in task B, and we report the WER of both. (runtime is measured in hours)

![decoupleQ](imgs/private_exp.png)

Cite

If you found this work useful, please consider citing:

@article{guo2024decoupleq,
title={decoupleQ: Towards 2-bit Post-Training Uniform Quantization via decoupling Parameters into Integer and Floating Points},
author={Guo, Yi and Kong, Fanliu and Li, Xiaoyang and Li, Hui and Chen, Wei and Tian, Xiaogang and Cai, Jinping and Zhang, Yang and Liu, Shouda},
journal={arXiv preprint arXiv:2404.12759},
year={2024}
}