WritingScalewayScalewaypublished Feb 21, 2024seen 5d

Infrastructures for LLMs in the cloud

Open original ↗

Captured source

source ↗
published Feb 21, 2024seen 5dcaptured 3dhttp 200method plain

Infrastructures for LLMs in the cloud Build • Fabien da Silva • 21/02/24 • 6 min read

Open source makes LLMs (large language models) available to everyone. There are plenty of options available, especially for inference. You’ve probably heard of Hugging Face’s inference library , but there’s also OpenLLM , vLLM , and many others.

The main challenge, especially if you’re a company like Mistral AI building new LLMs, is that the architecture of your LLM has to be supported by all these solutions. They need to be able to talk to Hugging Face, to NVIDIA, to OpenLLM and so on.

The second challenge is the cost, especially that of the infrastructures you’ll need to scale your LLM deployment. For that, you have different solutions:

Choosing the right GPUs (your LLM has to fit with them)

Choosing the right techniques:

Quantization, which involves reducing the number of bytes used by the variables, so you can fit larger models into smaller memory constraints. That’s a give and take between the two, as that can have impacts on the accuracy of your model and its performance results

Fine-tuning methods, like parameter-efficient fine-tuning ( PEFT ). With PEFT methods, you can significantly decrease computational and memory cost by only fine-tuning a small number of (extra) model parameters instead of all the model's parameters. And you can combine PEFT methods with quantization too.

Then you have to decide whether you host it yourself; you use a PaaS solution; or ready-to-use API endpoints, like what OpenAI does.

Choosing the right GPU

NVIDIA H100 - L4 - L40S

The above is Scaleway’s offering, but similar offerings are currently being installed with most major cloud providers.

H100 PCIe 5 is the flagship, NVIDIA’s most powerful GPU. It has interesting features like the Transformer Engine, a library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada Lovelace GPUs, to provide better performance with lower memory utilization in both training and inference. It speeds up training of Transformer models, meaning you can put twice the amount of variables in memory, in 8 bits instead of 16. Furthermore, NVIDIA’s Library helps make these changes simpler; plus a large amount of memory and memory bandwidth are key, as the faster you can load your memory, the faster your GPU will be

L4 PCIe 4 can be seen as the modern successor to the NVIDIA T4, intended for inference, but perfectly capable of training smaller LLM models. Like H100, it can manage new data formats like FP8. It has less memory bandwidth than H100, but that may create some bottlenecks for certain use cases, like handling large batches of images for training computer vision models. In these cases, you may not see a significant performance boost compared with previous Ampere architecture for example. And unlike H100, this one has video and 3D rendering capabilities, so if you want to generate a synthetic dataset for computer vision with Blender, you can use this GPU

L40S PCIe 4 is what NVIDIA considers as the new A100. It has twice the amount of memory as the L4, but with a larger memory bandwidth, and stronger compute performance too. For generative AI, according to NVIDIA, when you optimize your code with FP8 and so on, DGX with 8x A100 with 40 Gb NVlink can perform as well as 8 L40S PCIe 4 without NVLink, so that’s a powerful and interesting GPU.

Using GPU Instances tip 1: Docker images

NGC Catalog

When using GPUs, use Docker images, and start with those offered by NVIDIA, which are free. This way, the code is portable, so it can run on your laptop, on a workstation, on a GPU Instance (whatever the cloud provider, so without lock-in), or on a powerful cluster (either with SLURM as the orchestrator if you’re in the HPC/AI world, or Kubernetes if you’re more in the AI/MLOps world).

NVIDIA updates these images regularly, so you can benefit from performance improvements and bug/security fixes. A100 performance is significantly better now than it was at launch, and the same will apply to H100, L4 and so on. Also, there are a lot of time-saving features, which will allow you to make POCs more quickly, like framework and tools like NeMo, Riva and so on, which are available through the NGC catalog (above).

This also opens up the possibility to use an AI Enterprise license on supported hardware configurations, which is something typically only seen in cloud provider offers), which will give you support in case you meet bugs or performance issues, and even offers help from NVIDIA data scientists, to help you debug your code, and to get the best performance out of all of these softwares. And of course, you can choose your favorite platform, from PyTorch, TensorFlow, Jupyter Lab and so on.

Using Scaleway GPU Instances

In Scaleway’s GPU OS 12, we’ve already pre-installed Docker, so you can use it right out of the box. I’m often asked why there’s no CUDA or Anaconda preinstalled. The reason is these softwares should be executed inside the containers, because not all users have the same requirements. They may not be using the same versions of CUDA, cuDNN or Pytorch, for example, so it really depends on the user requirements. And it’s easier to use a container built by NVIDIA than installing and maintaining a Python AI environment. Furthermore, doing so makes it easier to reproduce results within your trainings or experiments.

So basically, you do this:

Connect to a GPU instance like H100 - 1 - 80G ssh root@ ## Pull the Nvidia Pytorch docker image ( or other image , with the software versions you need ) docker pull nvcr . io / nvidia / pytorch : 24.01 - py3 [ ... ] ## Launch the Pytorch container docker run -- rm - it -- runtime = nvidia \ - p 8888 : 8888 \ - p 6006 : 6006 \ - v / root / my - data / : / workspace \ - v / scratch / : / workspace / scratch \ nvcr . io / nvidia / pytorch : 24.01 - py3 ## You can work with Jupyter Lab , Pytorch etc… CopyContentIcon Copy code

It’s much easier than trying to install your environment locally.

Using GPU Instances tip 2: MIG

MIG

One unique feature of the H100 is MIG, or multi-instance GPU , which allows you to split your GPU into up to seven pieces. This is really useful when you want to optimize your workload. If you have workloads that don’t fully saturate GPUs, this is a nice way to have multiple workloads and maximize GPU utilization. It works with standalone VMs, and works really easily in Kubernetes. You…

Excerpt shown — open the source for the full document.