ForkTogether AITogether AIpublished Oct 9, 2024seen 5d

togethercomputer/accelerate

forked from huggingface/accelerate

Open original ↗

Captured source

source ↗
published Oct 9, 2024seen 5dcaptured 9hhttp 200method plain

togethercomputer/accelerate

Description: 🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support

License: Apache-2.0

Stars: 0

Forks: 0

Open issues: 1

Created: 2024-10-09T18:27:48Z

Pushed: 2024-10-17T20:34:11Z

Default branch: main

Fork: yes

Parent repository: huggingface/accelerate

Archived: no

README:

Run your *raw* PyTorch training script on any kind of device

Easy to integrate

🤗 Accelerate was created for PyTorch users who like to write the training loop of PyTorch models but are reluctant to write and maintain the boilerplate code needed to use multi-GPUs/TPU/fp16.

🤗 Accelerate abstracts exactly and only the boilerplate code related to multi-GPUs/TPU/fp16 and leaves the rest of your code unchanged.

Here is an example:

import torch
import torch.nn.functional as F
from datasets import load_dataset
+ from accelerate import Accelerator

+ accelerator = Accelerator()
- device = 'cpu'
+ device = accelerator.device

model = torch.nn.Transformer().to(device)
optimizer = torch.optim.Adam(model.parameters())

dataset = load_dataset('my_dataset')
data = torch.utils.data.DataLoader(dataset, shuffle=True)

+ model, optimizer, data = accelerator.prepare(model, optimizer, data)

model.train()
for epoch in range(10):
for source, targets in data:
source = source.to(device)
targets = targets.to(device)

optimizer.zero_grad()

output = model(source)
loss = F.cross_entropy(output, targets)

- loss.backward()
+ accelerator.backward(loss)

optimizer.step()

As you can see in this example, by adding 5-lines to any standard PyTorch training script you can now run on any kind of single or distributed node setting (single CPU, single GPU, multi-GPUs and TPUs) as well as with or without mixed precision (fp8, fp16, bf16).

In particular, the same code can then be run without modification on your local machine for debugging or your training environment.

🤗 Accelerate even handles the device placement for you (which requires a few more changes to your code, but is safer in general), so you can even simplify your training loop further:

import torch
import torch.nn.functional as F
from datasets import load_dataset
+ from accelerate import Accelerator

- device = 'cpu'
+ accelerator = Accelerator()

- model = torch.nn.Transformer().to(device)
+ model = torch.nn.Transformer()
optimizer = torch.optim.Adam(model.parameters())

dataset = load_dataset('my_dataset')
data = torch.utils.data.DataLoader(dataset, shuffle=True)

+ model, optimizer, data = accelerator.prepare(model, optimizer, data)

model.train()
for epoch in range(10):
for source, targets in data:
- source = source.to(device)
- targets = targets.to(device)

optimizer.zero_grad()

output = model(source)
loss = F.cross_entropy(output, targets)

- loss.backward()
+ accelerator.backward(loss)

optimizer.step()

Want to learn more? Check out the documentation or have a look at our examples.

Launching script

🤗 Accelerate also provides an optional CLI tool that allows you to quickly configure and test your training environment before launching the scripts. No need to remember how to use torch.distributed.run or to write a specific launcher for TPU training! On your machine(s) just run:

accelerate config

and answer the questions asked. This will generate a config file that will be used automatically to properly set the default options when doing

accelerate launch my_script.py --args_to_my_script

For instance, here is how you would run the GLUE example on the MRPC task (from the root of the repo):

accelerate launch examples/nlp_example.py

This CLI tool is optional, and you can still use python my_script.py or python -m torchrun my_script.py at your convenience.

You can also directly pass in the arguments you would to torchrun as arguments to accelerate launch if you wish to not run accelerate config.

For example, here is how to launch on two GPUs:

accelerate launch --multi_gpu --num_processes 2 examples/nlp_example.py

To learn more, check the CLI documentation available here.

Or view the configuration zoo here

Launching multi-CPU run using MPI

🤗 Here is another way to launch multi-CPU run using MPI. You can learn how to install Open MPI on this page. You can use Intel MPI or MVAPICH as well. Once you have MPI setup on your cluster, just run:

accelerate config

Answer the questions that are asked, selecting to run using multi-CPU, and answer "yes" when asked if you want accelerate to launch mpirun. Then, use accelerate launch with your script like:

accelerate launch examples/nlp_example.py

Alternatively, you can use mpirun directly, without using the CLI like:

mpirun -np 2 python examples/nlp_example.py

Launching training using DeepSpeed

🤗 Accelerate supports training on single/multiple GPUs using DeepSpeed. To use it, you don't need to change anything in your training code; you can set everything using just accelerate config. However, if you desire to tweak your DeepSpeed related args from your Python script, we provide you the DeepSpeedPlugin.

from accelerate import Accelerator, DeepSpeedPlugin

# deepspeed needs to know your gradient accumulation steps beforehand, so don't forget to pass it
# Remember you still need to do gradient accumulation by yourself, just like you would have done without deepspeed
deepspeed_plugin = DeepSpeedPlugin(zero_stage=2, gradient_accumulation_steps=2)
accelerator = Accelerator(mixed_precision='fp16', deepspeed_plugin=deepspeed_plugin)

# How to save your 🤗 Transformer?
accelerator.wait_for_everyone()
unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.save_pretrained(save_dir, save_function=accelerator.save, state_dict=accelerator.get_state_dict(model))

Note: DeepSpeed support is experimental for now. In case you get into some problem, please open an issue.

Launching your training from a notebook

🤗 Accelerate also provides a notebook_launcher

Excerpt shown — open the source for the full document.

Notability

notability 2.0/10

Routine fork, no notable traction