ForkCoreWeaveCoreWeavepublished Jun 29, 2023seen 6d

coreweave/coreweave-megatron

forked from NVIDIA/Megatron-LM

Open original ↗

Captured source

source ↗
published Jun 29, 2023seen 6dcaptured 8hhttp 200method plain

coreweave/coreweave-megatron

Description: (CoreWeave Fork) Ongoing research training transformer models at scale

License: NOASSERTION

Stars: 0

Forks: 0

Open issues: 3

Created: 2023-06-29T17:23:25Z

Pushed: 2024-07-16T04:25:35Z

Default branch: main

Fork: yes

Parent repository: NVIDIA/Megatron-LM

Archived: yes

README: Megatron (1, 2, and 3) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA. This repository is for ongoing research on training large transformer language models at scale. We developed efficient, model-parallel (tensor, sequence, and pipeline), and multi-node pre-training of transformer based models such as GPT, BERT, and T5 using mixed precision.

Below are some of the projects where we have directly used Megatron:

Megatron is also used in NeMo Megatron, a framework to help enterprises overcome the challenges of building and training sophisticated natural language processing models with billions and trillions of parameters.

Our codebase is capable of efficiently training very large (hundreds of billions of parameters) language models with both model and data parallelism. To demonstrate how the code scales with multiple GPUs and model sizes, we consider GPT models from 1 billion all the way to 1 trillion parameters. All models use a vocabulary size of 51,200 and a sequence length of 2048. We vary hidden size, number of attention heads, and number of layers to arrive at a specifc model size. As the model size increases, we also modestly increase the batch size. We leverage NVIDIA's Selene supercomputer to perform scaling studies and use up to 3072 A100 GPUs for the largest model. Each cluster node has 8 NVIDIA 80GB A100 GPUs. The graph below shows that we scale nearly linear up to 1 trillion parameter models running on 3072 GPUs. Note that these results are from benchmark runs and these models were not trained to convergence; however, the FLOPs are measured for end-to-end training, i.e., includes all operations including data loading, optimization, and even logging.

![Scaling Graph](images/Achieved_petaFLOPs.png)

The following table shows both model (MFU) and hardware (HFU) FLOPs utilization for select configurations up to 1T parameters (see our paper for a description of how these are calculated). As the model size increases, we achieve better GPU utilization and for the one trillion parameter model, we reach a MFU and HFU of 56.3% and 57.0%, respectively. Note that these numbers are also measured on benchmark runs and in this case are measured using a data parallel size of one. Data parallelism introduces some overhead due to the gradient all-reduce required between the data parallel groups. However, for large transformer models, this overhead is not large and can almost entirely eliminted by overlapping the gradient all-reduce with backpropagation.

| Model Size | Model FLOPs Utilization | Hardware FLOPs Utilization | | :---: | :---: | :---: | | 22B | 41.5% | 43.7% | | 175B | 51.4% | 52.8% | | 530B | 56.0% | 57.0% | | 1T | 56.3% | 57.0% |

Contents

  • [Contents](#contents)
  • [Setup](#setup)
  • [Downloading Checkpoints](#downloading-checkpoints)
  • [Usage](#usage)
  • [Training](#training)
  • [Data Preprocessing](#data-preprocessing)
  • [BERT Pretraining](#bert-pretraining)
  • [GPT Pretraining](#gpt-pretraining)
  • [T5 Pretraining](#t5-pretraining)
  • [Distributed Pretraining](#distributed-pretraining)
  • [Activation Checkpointing and Recomputation](#activation-checkpointing-and-recomputation)
  • [Distributed Optimizer](#distributed-optimizer)
  • [FlashAttention](#flashattention)
  • [GPT-3 Example](#gpt-3-example)
  • [Retro](#retro)
  • [Evaluation and Tasks](#evaluation-and-tasks)
  • [GPT Text Generation](#gpt-text-generation)
  • [GPT Evaluation](#gpt-evaluation)
  • [WikiText Perplexity Evaluation](#wikitext-perplexity-evaluation)
  • [LAMBADA Cloze Accuracy](#lambada-cloze-accuracy)
  • [BERT Task Evaluation](#bert-task-evaluation)
  • [RACE Evaluation](#race-evaluation)
  • [MNLI Evaluation](#mnli-evaluation)
  • [Datasets](#datasets)
  • [Collecting Wikipedia Training Data](#collecting-wikipedia-training-data)
  • [Collecting GPT Webtext Data](#collecting-gpt-webtext-data)
  • [Reproducibility](#reproducibility)

Setup

We strongly recommend using the latest release of NGC's PyTorch container with DGX nodes. If you can't use this for some reason, use the latest pytorch, cuda, nccl, and NVIDIA APEX releases. Data preprocessing requires…

Excerpt shown — open the source for the full document.