ByteDance-Seed/SDP4Bit
Python
Captured source
source ↗ByteDance-Seed/SDP4Bit
Description: official implementation of paper SDP4Bit: Toward 4-bit Communication Quantization in Sharded Data Parallelism for LLM Training
Language: Python
License: Apache-2.0
Stars: 44
Forks: 8
Open issues: 0
Created: 2024-11-21T06:24:38Z
Pushed: 2024-12-11T04:37:04Z
Default branch: main
Fork: no
Archived: no
README:
SDP4Bit
This repository is the official implement of paper [SDP4Bit: Toward 4-bit Communication Quantization in Sharded Data Parallelism for LLM Training](https://arxiv.org/abs/2410.15526).
Overview
SDP4Bit is a communication quantization strategy designed to reduce the overhead of large-scale distributed training in Sharded Data Parallelism (ShardedDP). By utilizing quantization on weight differences and two-level gradient smooth quantization, SDP4Bit reduces the communication of weights and gradients to nearly 4 bits without compromising accuracy.
Paper Results Reproduce
Preparing for Data
In the data processing step, we followed the data preprocessing instructions in Megatron-LM official repository. We use the **pile deduplicated dataset** provided by huggingface as our training baseline. For the vocabulary and merges file, we used same as gpt2 model. Download
from datasets import load_dataset
train_data = load_dataset('EleutherAI/the_pile_deduplicated', split='train', num_proc=16)
train_data.to_json(os.path.join(save_path, dataset_output_name), lines=True)
hf_hub_download(repo_id="gpt2", filename="merges.txt", local_dir=save_path)
hf_hub_download(repo_id="gpt2", filename="vocab.json", local_dir=save_path)Data Process We used preprocess script in Megatron-LM repository and the dataset download in last step.
python preprocess_data.py \ --input pile.jsonl \ --split train \ --columns text \ --output-prefix pile \ --vocab-file vocab.json \ --merge-file merges.txt \ --dataset-impl mmap \ --tokenizer-type GPT2BPETokenizer \ --append-eod \ --torch-backend mpi
Accucracy Test Results Reproduce
!enter image description here We set all models to run for a total of 80,000 training iterations. The learning rate was configured according to GPT-2 settings. Note: For each experimental group, we used the same training configuration for the same model, with only the quantization configuration being changed to ensure a fair comparison. The model configuration and detailed sample training scripts are provided below. Model Card
125M MODEL_ARGS=" --num-layers 12 \ --hidden-size 768 \ --num-attention-heads 12 \ --seq-length 2048 \ --max-position-embeddings 2048 \ " OPTIMIZER_ARGS=" --lr 0.0006 \ --lr-decay-iters 70000 \ --lr-decay-style cosine \ --min-lr 0.00006 \ --adam-beta1 0.9 \ --adam-beta2 0.95 \ --adam-eps 1e-08 \ --weight-decay .1 \ --lr-warmup-fraction 0.01 \ --clip-grad 1.0 \ --loss-scale 0 \ --loss-scale-window 1000 \ --hysteresis 2 \ --min-loss-scale 1 \ --bf16 \ --use-distributed-optimizer \ " TRAINING_ARGS=" --tensor-model-parallel-size 1 \ --pipeline-model-parallel-size 1 \ --micro-batch-size 8 \ --global-batch-size 256 \ --train-iters 80000 \ "
GPT 350M Model MODEL_ARGS=" --num-layers 24 \ --hidden-size 1024 \ --num-attention-heads 16 \ --seq-length 2048 \ --max-position-embeddings 2048 \ " TRAINING_ARGS=" --tensor-model-parallel-size 1 \ --pipeline-model-parallel-size 1 \ --micro-batch-size 8 \ --global-batch-size 256 \ --train-iters 80000 \ " OPTIMIZER_ARGS=" --lr 0.0003 \ --lr-decay-iters 70000 \ --lr-decay-style cosine \ --min-lr 0.00003 \ --adam-beta1 0.9 \ --adam-beta2 0.95 \ --adam-eps 1e-08 \ --weight-decay .1 \ --lr-warmup-fraction 0.01 \ --clip-grad 1.0 \ --loss-scale 0 \ --loss-scale-window 1000 \ --hysteresis 2 \ --min-loss-scale 1 \ --bf16 \ --use-distributed-optimizer \ "
GPT 1.3B Model MODEL_ARGS=" --num-layers 24 \ --hidden-size 2048 \ --num-attention-heads 16 \ --seq-length 2048 \ --max-position-embeddings 2048 \ " TRAINING_ARGS=" --tensor-model-parallel-size 1 \ --pipeline-model-parallel-size 1 \ --micro-batch-size 2 \ --global-batch-size 256 \ --train-iters 80000 \ " OPTIMIZER_ARGS=" --lr 0.0002 \ --lr-decay-iters 70000 \ --lr-decay-style cosine \ --min-lr 0.00002 \ --adam-beta1 0.9 \ --adam-beta2 0.95 \ --adam-eps 1e-08 \ --weight-decay .1 \ --lr-warmup-fraction 0.01 \ --clip-grad 1.0 \ --loss-scale 0 \ --loss-scale-window 1000 \ --hysteresis 2 \ --min-loss-scale 1 \ --bf16 \ --use-distributed-optimizer \ "
GPT 6.7B Model MODEL_ARGS=" --num-layers 32 \ --hidden-size 4096 \ --num-attention-heads 32 \ --seq-length 2048 \ --max-position-embeddings 2048 \ " OPTIMIZER_ARGS=" --lr 0.00012 \ --lr-decay-iters 70000 \ --lr-decay-style cosine \ --min-lr 0.000012 \ --adam-beta1 0.9 \ --adam-beta2 0.95 \ --adam-eps 1e-08 \ --weight-decay .1 \ --lr-warmup-fraction 0.01 \ --clip-grad 1.0 \ --loss-scale 0 \ --loss-scale-window 1000 \ --hysteresis 2 \ --min-loss-scale 1 \ --bf16 \ --use-distributed-optimizer \ " TRAINING_ARGS=" --tensor-model-parallel-size 1 \ --pipeline-model-parallel-size 1 \ --micro-batch-size 2 \ --global-batch-size 256 \ --train-iters 80000 \ "
Sample Training Scripts | Model |Baseline|qWD|TLq|TLq-HS|SDP4Bit| |--|--|--|--|--|--|--| | 125M |link |link |link |link |link|
Speed Test Results Reproduce
!enter image description here We provide the detailed speed test scripts on H800 as below. Please note that since H800 node contains 8 GPUs, and A100 node contains 4GPU, we adjust the tensor parallel size and pipeline parallel…
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10New repo, low stars, minor.
ByteDance (Doubao/Seed) has a repo signal matching data demand, infrastructure.