arcee-ai/Megatron-LM
forked from NVIDIA/Megatron-LM
Captured source
source ↗arcee-ai/Megatron-LM
Description: domain adapted MOE training
Language: Python
License: NOASSERTION
Stars: 0
Forks: 0
Open issues: 2
Created: 2024-04-18T23:08:16Z
Pushed: 2024-07-01T18:47:45Z
Default branch: main
Fork: yes
Parent repository: NVIDIA/Megatron-LM
Archived: no
README:
Megatron-LM & Megatron-Core =========================== GPU optimized techniques for training transformer models at-scale
Latest News
- [2024/1 Announcement] NVIDIA has released the core capabilities in Megatron-LM into **Megatron-Core** in this repository. Megatron-Core expands upon Megatron-LM's GPU-optimized techniques with more cutting-edge innovations on system-level optimizations, featuring composable and modular APIs. Explore the [Megatron-Core intro](#megatron-core) for more details.
Table of Contents
- [Megatron Overview](#megatron-overview)
- [Megatron-LM](#megatron-lm)
- [Megatron-Core](#megatron-core)
- [Training Speed and Scalability](#training-speed-and-scalability)
- [Setup](#setup)
- [Downloading Checkpoints](#downloading-checkpoints)
- [Usage](#usage)
- [Training](#training)
- [Data Preprocessing](#data-preprocessing)
- [BERT Pretraining](#bert-pretraining)
- [GPT Pretraining](#gpt-pretraining)
- [T5 Pretraining](#t5-pretraining)
- [Distributed Pretraining](#distributed-pretraining)
- [Activation Checkpointing and Recomputation](#activation-checkpointing-and-recomputation)
- [Distributed Optimizer](#distributed-optimizer)
- [FlashAttention](#flashattention)
- [GPT-3 Example](#gpt-3-example)
- [Retro and InstructRetro](#retro-and-instructretro)
- [Evaluation and Tasks](#evaluation-and-tasks)
- [GPT Text Generation](#gpt-text-generation)
- [GPT Evaluation](#gpt-evaluation)
- [WikiText Perplexity Evaluation](#wikitext-perplexity-evaluation)
- [LAMBADA Cloze Accuracy](#lambada-cloze-accuracy)
- [BERT Task Evaluation](#bert-task-evaluation)
- [RACE Evaluation](#race-evaluation)
- [MNLI Evaluation](#mnli-evaluation)
- [Llama-2 Inference and Finetuning](#llama-2-inference-and-finetuning)
- [Datasets](#datasets)
- [Collecting Wikipedia Training Data](#collecting-wikipedia-training-data)
- [Collecting GPT Webtext Data](#collecting-gpt-webtext-data)
- [Reproducibility](#reproducibility)
- [Projects using Megatron](#projects-using-megatron)
Megatron Overview
This repository comprises two essential components: Megatron-LM and Megatron-Core. Megatron-LM serves as a ressearch-oriented framework leveraging Megatron-Core for large language model (LLM) training. Megatron-Core, on the other hand, is a library of GPU optimized training techniques that comes with formal product support including versioned APIs and regular releases. You can use Megatron-Core alongside Megatron-LM or Nvidia NeMo Framework for an end-to-end and cloud-native solution. Alternatively, you can integrate Megatron-Core's building blocks into your preferred training framework.
Megatron-LM
First introduced in 2019, Megatron (1, 2, and 3) sparked a wave of innovation in the AI community, enabling researchers and developers to utilize the underpinnings of this library to further LLM advancements. Today, many of the most popular LLM developer frameworks have been inspired by and built directly leveraging the open-source Megatron-LM library, spurring a wave of foundation models and AI startups. Some of the most popular LLM frameworks built on top of Megatron-LM include Colossal-AI, HuggingFace Accelerate, and NVIDIA NeMo Framework. A list of projects that have directly used Megatron can be found [here](#projects-using-megatron).
Megatron-Core
Megatron-Core is a newly released open-source PyTorch-based library that further expands the collections of GPU optimized techniques inherited from Megatron-LM with more cutting-edge innovations on system-level optimizations. It abstracts them into composable and modular APIs, allowing full flexibility for developers and model researchers to train custom transformers at-scale on NVIDIA accelerated computing infrastructure. This library is compatible with all NVIDIA Tensor Core GPUs, including FP8 acceleration support for NVIDIA Hopper architectures.
Megatron-Core offers the core building blocks such as attention mechanisms, transformer blocks and layers, normalization layers, and embedding techniques. Additional functionality like activation recomputation, distributed checkpointing is also natively built-in to the library. The building blocks and functionality are all GPU optimized, and can be built with advanced parallelization strategies for optimal training speed and stability on NVIDIA Accelerated Computing Infrastructure. Another key component of the Megatron-Core library includes advanced model parallelism techniques (tensor, sequence, and pipeline). Currently, popular LLM model architectures based on Decoder (ex. GPT, Llama), Encoder (ex. BERT), Encoder-Decoder (ex. T5), Retrieval Enhanced Transformers (ex. RETRO), and Mixture of Experts (MoE) can easily be built with performance and efficiency at large compute scales. Developers can also use Megatron-Core's transformer blocks and functional APIs to build their own custom layers.
Training Speed and Scalability
Our codebase is capable of efficiently training very large (hundreds of billions of parameters) language models with both model and data parallelism. To demonstrate how the code scales with multiple GPUs and model sizes, we consider GPT models from 1 billion all the way to 1 trillion parameters. All models use a vocabulary size of 51,200 and a sequence length of 2048. We vary hidden size, number of attention heads, and number of layers to arrive at a specific model size. As the model size increases, we also modestly increase the batch size. We leverage [NVIDIA's Selene…
Excerpt shown — open the source for the full document.