What does this fork signal mean?

Arcee AI forked arcee-ai/Megatron-LM-Llama-70B (forked from NVIDIA/Megatron-LM). This fork signal points to upstream code the lab may be inspecting, patching, or building on. High-signal details: repo arcee-ai/Megatron-LM-Llama-70B · parent NVIDIA/Megatron-LM · Routine fork with negligible traction.. onlylabs links this event to 1 captured evidence page and 6 related fork signals.

Arcee AI Fork: arcee-ai/Megatron-LM-Llama-70B

Captured source

source ↗

GitHub/github.com/arcee-ai/Megatron-LM-Llama-70B

arcee-ai/Megatron-LM-Llama-70B repository metadata

Source ↗

published May 22, 2024seen Jun 5captured Jun 11http 200method plain

arcee-ai/Megatron-LM-Llama-70B

Description: Ongoing research training transformer models at scale

Language: Python

License: NOASSERTION

Stars: 2

Forks: 0

Open issues: 0

Created: 2024-05-22T17:40:11Z

Pushed: 2024-07-19T16:31:15Z

Default branch: main

Fork: yes

Parent repository: NVIDIA/Megatron-LM

Archived: no

README:

Megatron-LM & Megatron-Core =========================== GPU optimized techniques for training transformer models at-scale

Latest News

[2024/1 Announcement] NVIDIA has released the core capabilities in Megatron-LM into **Megatron-Core** in this repository. Megatron-Core expands upon Megatron-LM's GPU-optimized techniques with more cutting-edge innovations on system-level optimizations, featuring composable and modular APIs. Explore the [Megatron-Core intro](#megatron-core) for more details.

[Megatron Overview](#megatron-overview)
[Megatron-LM](#megatron-lm)
[Megatron-Core](#megatron-core)
[Training Speed and Scalability](#training-speed-and-scalability)
[Setup](#setup)
[Downloading Checkpoints](#downloading-checkpoints)
[Usage](#usage)
[Training](#training)
[Data Preprocessing](#data-preprocessing)
[BERT Pretraining](#bert-pretraining)
[GPT Pretraining](#gpt-pretraining)
[T5 Pretraining](#t5-pretraining)
[Distributed Pretraining](#distributed-pretraining)
[Activation Checkpointing and Recomputation](#activation-checkpointing-and-recomputation)
[Distributed Optimizer](#distributed-optimizer)
[FlashAttention](#flashattention)
[GPT-3 Example](#gpt-3-example)
[Retro and InstructRetro](#retro-and-instructretro)
[Evaluation and Tasks](#evaluation-and-tasks)
[GPT Text Generation](#gpt-text-generation)
[GPT Evaluation](#gpt-evaluation)
[WikiText Perplexity Evaluation](#wikitext-perplexity-evaluation)
[LAMBADA Cloze Accuracy](#lambada-cloze-accuracy)
[BERT Task Evaluation](#bert-task-evaluation)
[RACE Evaluation](#race-evaluation)
[MNLI Evaluation](#mnli-evaluation)
[Llama-2 Inference and Finetuning](#llama-2-inference-and-finetuning)
[Datasets](#datasets)
[Collecting Wikipedia Training Data](#collecting-wikipedia-training-data)
[Collecting GPT Webtext Data](#collecting-gpt-webtext-data)
[Reproducibility](#reproducibility)
[Projects using Megatron](#projects-using-megatron)

Megatron Overview

This repository comprises two essential components: Megatron-LM and Megatron-Core. Megatron-LM serves as a ressearch-oriented framework leveraging Megatron-Core for large language model (LLM) training. Megatron-Core, on the other hand, is a library of GPU optimized training techniques that comes with formal product support including versioned APIs and regular releases. You can use Megatron-Core alongside Megatron-LM or Nvidia NeMo Framework for an end-to-end and cloud-native solution. Alternatively, you can integrate Megatron-Core's building blocks into your preferred training framework.

Megatron-LM

First introduced in 2019, Megatron (1, 2, and 3) sparked a wave of innovation in the AI community, enabling researchers and developers to utilize the underpinnings of this library to further LLM advancements. Today, many of the most popular LLM developer frameworks have been inspired by and built directly leveraging the open-source Megatron-LM library, spurring a wave of foundation models and AI startups. Some of the most popular LLM frameworks built on top of Megatron-LM include Colossal-AI, HuggingFace Accelerate, and NVIDIA NeMo Framework. A list of projects that have directly used Megatron can be found [here](#projects-using-megatron).

Megatron-Core

Megatron-Core is an open-source PyTorch-based library that contains GPU-optimized techniques and cutting-edge system-level optimizations. It abstracts them into composable and modular APIs, allowing full flexibility for developers and model researchers to train custom transformers at-scale on NVIDIA accelerated computing infrastructure. This library is compatible with all NVIDIA Tensor Core GPUs, including FP8 acceleration support for NVIDIA Hopper architectures.

Megatron-Core offers core building blocks such as attention mechanisms, transformer blocks and layers, normalization layers, and embedding techniques. Additional functionality like activation recomputation, distributed checkpointing is also natively built-in to the library. The building blocks and functionality are all GPU optimized, and can be built with advanced parallelization strategies for optimal training speed and stability on NVIDIA Accelerated Computing Infrastructure. Another key component of the Megatron-Core library includes advanced model parallelism techniques (tensor, sequence, pipeline, context, and MoE expert parallelism).

Megatron-Core can be used with NVIDIA NeMo, an enterprise-grade AI platform. Alternatively, you can explore Megatron-Core with the native PyTorch training loop here. Visit Megatron-Core documentation to learn more.

Training Speed and Scalability

Our codebase is capable of efficiently training large language models (i.e., models with hundreds of billions of parameters) with both model and data parallelism. To demonstrate how our software scales with multiple GPUs and model sizes, we consider GPT models ranging from 2 billion parameters to 462 billion parameters. All models use a vocabulary size of 131,072 and a sequence length of 4096. We vary hidden size, number of attention heads, and number of layers to arrive at a specific model size. As the model size increases, we also modestly increase batch size. Our experiments use up to 6144 H100 GPUs. We perform fine-grained overlapping of data-parallel (`--overlap-grad-reduce...

Excerpt shown — open the source for the full document.

Notability

notability 1.0/10

Routine fork with negligible traction.