RepoByteDance (Doubao/Seed)ByteDance (Doubao/Seed)published Nov 21, 2024seen 5d

ByteDance-Seed/SDP4Bit

Python

Open original ↗

Captured source

source ↗
published Nov 21, 2024seen 5dcaptured 10hhttp 200method plain

ByteDance-Seed/SDP4Bit

Description: official implementation of paper SDP4Bit: Toward 4-bit Communication Quantization in Sharded Data Parallelism for LLM Training

Language: Python

License: Apache-2.0

Stars: 44

Forks: 8

Open issues: 0

Created: 2024-11-21T06:24:38Z

Pushed: 2024-12-11T04:37:04Z

Default branch: main

Fork: no

Archived: no

README:

SDP4Bit

This repository is the official implement of paper [SDP4Bit: Toward 4-bit Communication Quantization in Sharded Data Parallelism for LLM Training](https://arxiv.org/abs/2410.15526).

Overview

SDP4Bit is a communication quantization strategy designed to reduce the overhead of large-scale distributed training in Sharded Data Parallelism (ShardedDP). By utilizing quantization on weight differences and two-level gradient smooth quantization, SDP4Bit reduces the communication of weights and gradients to nearly 4 bits without compromising accuracy.

Paper Results Reproduce

Preparing for Data

In the data processing step, we followed the data preprocessing instructions in Megatron-LM official repository. We use the **pile deduplicated dataset** provided by huggingface as our training baseline. For the vocabulary and merges file, we used same as gpt2 model. Download

from datasets import load_dataset
train_data = load_dataset('EleutherAI/the_pile_deduplicated', split='train', num_proc=16)
train_data.to_json(os.path.join(save_path, dataset_output_name), lines=True)
hf_hub_download(repo_id="gpt2", filename="merges.txt", local_dir=save_path)
hf_hub_download(repo_id="gpt2", filename="vocab.json", local_dir=save_path)

Data Process We used preprocess script in Megatron-LM repository and the dataset download in last step.

python preprocess_data.py \
--input pile.jsonl \
--split train \
--columns text \
--output-prefix pile \
--vocab-file vocab.json \
--merge-file merges.txt \
--dataset-impl mmap \
--tokenizer-type GPT2BPETokenizer \
--append-eod \
--torch-backend mpi

Accucracy Test Results Reproduce

!enter image description here We set all models to run for a total of 80,000 training iterations. The learning rate was configured according to GPT-2 settings. Note: For each experimental group, we used the same training configuration for the same model, with only the quantization configuration being changed to ensure a fair comparison. The model configuration and detailed sample training scripts are provided below. Model Card

125M
MODEL_ARGS="
--num-layers 12 \
--hidden-size 768 \
--num-attention-heads 12 \
--seq-length 2048 \
--max-position-embeddings 2048 \
"

OPTIMIZER_ARGS="
--lr 0.0006 \
--lr-decay-iters 70000 \
--lr-decay-style cosine \
--min-lr 0.00006 \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--adam-eps 1e-08 \
--weight-decay .1 \
--lr-warmup-fraction 0.01 \
--clip-grad 1.0 \
--loss-scale 0 \
--loss-scale-window 1000 \
--hysteresis 2 \
--min-loss-scale 1 \
--bf16 \
--use-distributed-optimizer \
"

TRAINING_ARGS="
--tensor-model-parallel-size 1 \
--pipeline-model-parallel-size 1 \
--micro-batch-size 8 \
--global-batch-size 256 \
--train-iters 80000 \
"
GPT 350M Model
MODEL_ARGS="
--num-layers 24 \
--hidden-size 1024 \
--num-attention-heads 16 \
--seq-length 2048 \
--max-position-embeddings 2048 \
"

TRAINING_ARGS="
--tensor-model-parallel-size 1 \
--pipeline-model-parallel-size 1 \
--micro-batch-size 8 \
--global-batch-size 256 \
--train-iters 80000 \
"

OPTIMIZER_ARGS="
--lr 0.0003 \
--lr-decay-iters 70000 \
--lr-decay-style cosine \
--min-lr 0.00003 \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--adam-eps 1e-08 \
--weight-decay .1 \
--lr-warmup-fraction 0.01 \
--clip-grad 1.0 \
--loss-scale 0 \
--loss-scale-window 1000 \
--hysteresis 2 \
--min-loss-scale 1 \
--bf16 \
--use-distributed-optimizer \
"
GPT 1.3B Model
MODEL_ARGS="
--num-layers 24 \
--hidden-size 2048 \
--num-attention-heads 16 \
--seq-length 2048 \
--max-position-embeddings 2048 \
"

TRAINING_ARGS="
--tensor-model-parallel-size 1 \
--pipeline-model-parallel-size 1 \
--micro-batch-size 2 \
--global-batch-size 256 \
--train-iters 80000 \
"

OPTIMIZER_ARGS="
--lr 0.0002 \
--lr-decay-iters 70000 \
--lr-decay-style cosine \
--min-lr 0.00002 \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--adam-eps 1e-08 \
--weight-decay .1 \
--lr-warmup-fraction 0.01 \
--clip-grad 1.0 \
--loss-scale 0 \
--loss-scale-window 1000 \
--hysteresis 2 \
--min-loss-scale 1 \
--bf16 \
--use-distributed-optimizer \
"
GPT 6.7B Model
MODEL_ARGS="
--num-layers 32 \
--hidden-size 4096 \
--num-attention-heads 32 \
--seq-length 2048 \
--max-position-embeddings 2048 \
"

OPTIMIZER_ARGS="
--lr 0.00012 \
--lr-decay-iters 70000 \
--lr-decay-style cosine \
--min-lr 0.000012 \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--adam-eps 1e-08 \
--weight-decay .1 \
--lr-warmup-fraction 0.01 \
--clip-grad 1.0 \
--loss-scale 0 \
--loss-scale-window 1000 \
--hysteresis 2 \
--min-loss-scale 1 \
--bf16 \
--use-distributed-optimizer \
"

TRAINING_ARGS="
--tensor-model-parallel-size 1 \
--pipeline-model-parallel-size 1 \
--micro-batch-size 2 \
--global-batch-size 256 \
--train-iters 80000 \
"

Sample Training Scripts | Model |Baseline|qWD|TLq|TLq-HS|SDP4Bit| |--|--|--|--|--|--|--| | 125M |link |link |link |link |link|

Speed Test Results Reproduce

!enter image description here We provide the detailed speed test scripts on H800 as below. Please note that since H800 node contains 8 GPUs, and A100 node contains 4GPU, we adjust the tensor parallel size and pipeline parallel…

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

New repo, low stars, minor.

ByteDance (Doubao/Seed) has a repo signal matching data demand, infrastructure.