What does this writing signal mean?

Scaleway published Distributed ML model inference. This talking signal gives public context for research themes, product direction, policy, or launch framing. High-signal details: Routine cloud provider blog post · [2209.01188] Petals: Collaborative Inference and Fine-tuning of Large Models Petals: Collaborative Inference and Fine-tuning of Large Models Alexander Borzunov HSE.... onlylabs links this event to 2 captured evidence pages and 6 related writing signals.

Scaleway Writing: Distributed ML model inference

Captured source

source ↗

scaleway.com/scaleway.com/en/blog

Distributed ML model inference

Source ↗

published Dec 19, 2024seen 5dcaptured 3dhttp 200method plain

Distributed ML model inference Deploy • Valentin Macheret • 19/12/24 • 7 min read

In the state of 2024, some Large Language Models (LLM) are made of hundreds of billions of parameters. To run them you need GPUs, big GPUs. With BLOOM-176 or OPT-175 you will broadly need 3 Nvidia A100, costing $15K each. A paper published in March 2023 introduces Petals , a framework for collaborative inference (ie: process a real user's request). It concludes that the bill can be drastically reduced. Let's see how: we first introduce how training actually works then inference for a big model, then explain how Petals improved that. We’ll conclude by system limitations.

Distributed training

Distributed Machine Learning is required to achieve high performance in training large models based on very large dataset (about terabytes of data). It globally implies to train the model across multiple instances (that can host one or more GPUs), rather than on a single instance. The data is split across the instances, and each of them trains the model on its portion of the data. All resulting models are then combined to produce a final model. This approach can significantly reduce the time it takes to train large models.

Tools like Hivemind , Horovod and BigDL fit well for this purpose. This approach allows it to overrun many single-instance hardware limitations. Concerning privacy, distributed-ML design patterns like Private Aggregation of Teacher Ensembles (PATE) are built to keep data as private as possible during training.

Inference time!

Once the model is trained and fine tuned, it is then post-processed to be prepared for inference. It is not an easy task.

The common pipeline is like:

Optional: Dilute your model to a new one (kind of transfer learning) in order to drastically reduce the number of parameters, but it will affect your model accuracy

Optional: Quantize the model by replacing 4 bytes floats to 1 (int8) or 2 bytes (int16). It can be a good move, but it highly depends on your model (typically you won't quantize a trigonometric function...). If done completely, a model can be run of a CPU or NPU instead of a GPU

Deploy an API or a Triton server that will receive input data / queries in front of the model

Put the model on an instance and pray that it fits into RAM (or VRAM). To illustrate, BLOOM LLM model "fits" into 352GB of RAM

It doesn't fit? Well, you have to split your model into smaller pieces (layer by layer), offload them to RAM or SSD and load them dynamically when needed, performance loss is guaranteed

Because your model is highly requested (congrats!) you need to scale your GPU-equipped instance pool to handle such traffic, which leads to orchestration and load balancing… You know the drill.

Even if you optimize your billions-parameters-model (as it is the trend with LLMs), each instance requires you to run and infer the entire model, which is quite expensive. Some sources argue that ChatGPT cost about $700K a day to run .

Distributed… inference?

In march 2023, BigScience released a new paper onto arXiv that sounds like a relevant proof-of-concept.

In a word: Petals is a protocol that connects a swarm of multi-origin and heterogeneous instances with GPU to share the whole inference of a large language model (the POC is using BLOOM model, about 176 billions params, quite similar to GPT3). Each instance runs a single layer of the model for forward and backward passes, instead of the whole model:

When an inference request is received, the instance running the first layer applies the forward pass

Result is sent to the instance hosting the second layer

And so on until the last layer of the model

The final output is the response payload to the input request

Achievements belong in 3 aspects: the nature of the instances, the layer-based load balancing possibility and a memory efficient fine-tuning.

Heterogeneous park

In the article, the nature of the instances themselves is very sparse: "only" equipped with gaming GTX 2080, GTX 3060 or stronger A100 GPU.

In comparison, if you were running the whole BLOOM model (352GB) by offloading on a single A100 (80GB GPU), it would take 5.5 seconds to compute one inference.

An interesting benchmark was done on a set of 14 small servers in real circumstances (with firewalling, heterogeneous network on 2 continents), which shows good performances: Up to x6 in single-batch (processing one request) and x15 in batch-1 and equivalent in batch-64 (ie 64 requests in the same time).

Layer-level load balancing

Because the model's layers can be assigned to a large typology of instances, it is possible to apply a fine workload-scaling. Indeed, in case of high inference demand, it can be good to increase compute-intensive layers presence and decrease low-compute layer presence. It could be done by assigning new instances or by rebalancing the cardinality of each layer (ie: keeping the same number of instances).

But in reality, the computing power of each instance must be taken into consideration to host the layer that suits better.

Scalable fine tuning

Another interesting property from this model distribution is the ability to apply fine tunings 1) without loading the whole model and 2) simultaneously.

As a reminder: fine tuning is to specialize a “general purpose trained model” (also named Foundation models ) by training over an ad-hoc dataset. For instance: fine tune an animal-detector model to make it a more precise cat-race-detector model.

In these circumstances we want to manipulate only specific layers, which is what Petals does. 1) Each ML engineer can handle a specific set of layers that fit in their local RAM, compute a forward pass based on a new dataset, then ask the other layer-instances to apply a backpropagation (without changing their original pretrained weights). 2) Each fine-tuning backpropagation result is versioned and stored on its respective instance, so there can be many ML engineers working on their task without interfering with each other. For such a system, storing data through IPFS could be very efficient to reduce data redundancy.

Limitations & challenges

Petals is an interesting step, but it came with many challenges:

There is no incentive / reward system for participants that would share their hardware for model inference.

As is, Petals provided model is already quantized, dividing the memory footprint per 2 from the original model, but still requires instances with at least a…

Excerpt shown — open the source for the full document.

Additional captured pages

Distributed ML model inferencecaptured 2d

[2209.01188] Petals: Collaborative Inference and Fine-tuning of Large Models Petals: Collaborative Inference and Fine-tuning of Large Models Alexander Borzunov HSE University, Yandex &Dmitry Baranchuk∗ Yandex &Tim Dettmers∗ University of Washington \ANDMax Ryabinin∗ HSE…

Notability

notability 3.0/10

Routine cloud provider blog post