How to Optimize LLM Performance with NVIDIA H100 GPUs from Scaleway, by Golem.ai
Captured source
source ↗How to Optimize LLM Performance with NVIDIA H100 GPUs from Scaleway, by Golem.ai Build • Kevin Baude • 03/11/23 • 8 min read
(Article originally published on Golem.ai's blog, here . Reproduced with permission. Thanks, guys!)
Why did Golem.ai decide to experiment with LLMs ? It’s because we believe in the complementary nature of Symbolic & Generative AI approaches, as explained in our previous blogpost .
Why choose LlaMA-2 ?
Facebook parent company Meta caused a stir in the artificial intelligence (AI) industry last July with the launch of LLaMA 2, an open-source large-scale language model (LLM) designed to challenge the restrictive practices of its major technological competitors.
Unlike AI systems launched by Google, OpenAI and others (such as Apple with Apple GPT?), which are tightly guarded in proprietary models, Meta is releasing LLaMA 2's code and data free of charge to enable researchers worldwide to build and improve the technology!
Here are the five key features of Llama 2:
Llama 2 outperforms other open-source LLMs in benchmarks for reasoning, coding proficiency, and knowledge tests.
The model was trained on almost twice the data of version 1, totaling 2 trillion tokens. Additionally, the training included over 1 million new human annotations and fine-tuning for chat completions.
The model comes in three sizes, each trained with 7, 13, and 70 billion parameters.
Llama 2 supports longer context lengths, up to 4096 tokens.
Version 2 has a more permissive license than version 1, allowing for commercial use.
First tests in “practicing & learning mode” with Replicate.com
To Test Llama-2, we first opted for Replicate.com . This allows you to pay as you go, with no need to install on existing hardware. A perfect first approach for experimenting !
However, for reasons of privacy and economic intelligence, we’ve opted for a second approach, as explained below.
Why Llama-2 on in-house GPUs after Replicate.com?
At Golem.ai, trusted artificial intelligence, data sovereignty, security and control of the entire value chain is the most important thing.
For this reason, we decided to carry out our own benchmark using the material resources of our French cloud provider, Scaleway.
Although the LLaMA-2 model is free to download and use, it should be noted that self-hosting of this model requires GPU power for timely processing.
LLaMA 2 is available in three sizes: 7 billion, 13 billion and 70 billion parameters, depending on the model you choose.
For the purposes of this demonstration, we will use model 70b to obtain the best relevance !
Setting up the in-house GPUs solution
Let’s get to the heart of the matter 😈
Integration overview
The user provides one input: a prompt input (i.e. ask a question).
An API call is made to the LLAMA.CPP server, where the prompt input is submitted and the response generated by Llama-2 is obtained and displayed to the user.
We running Llama-2 70B model using llama.cpp, with NVIDIA CUDA 12.2 on Ubuntu 22.04
Llama.cpp is a C/C++ library for the inference of LlaMA/LlaMA-2 models .
For this scenario, we will use the H100-1-80G , the most powerful hardware in the GPUs range from our French Cloud provider Scaleway.
The method for implementing the solution is specified in the next few lines.
We estimate that it will take around 30mn to set up, provided you meet our OS, software, hardware requirements and you don’t encounter any errors 🙂
A. Installation
Two possible paths :
1/ The official way to run LLaMA-2 is via their examples repository and in their recipes repository.
Benefit: Official method
Disadvantages: Developed in python (Slow to run & Excessive RAM consumption); GPU H100 acceleration may not work.
2/ Run LLaMA-2 via the llama.cpp interface
Benefits : This pure C/C++ implementation is faster and more efficient than its official Python counterpart, and supports GPU acceleration via CUDA and Apple's Metal. This considerably speeds up inference on the CPU and makes GPU inference more efficient.
Disadvantages: Community-based method (unofficial)
We've opted to use llama.cpp for this implementation.
B. Model available
Check model type :
https://www.hardware-corner.net/llm-database/Llama-2/
/!\ /!\ llama.cpp no longer supports the GGML models
https://huggingface.co/TheBloke/Llama-2-70B-Chat-GGML
⇒ Replace with GGUF models
https://huggingface.co/TheBloke/Llama-2-70B-chat-GGUF (based on Llama-2-70b-chat-hf)
C. Installation process
1/ Install NVIDIA CUDA DRIVER (if not installed on your GPU Machine)
To start, let's install NVIDIA CUDA on Ubuntu 22.04. The guide presented here is the same as the CUDA Toolkit download page provided by NVIDIA.
$ wget $ sudo dpkg - i cuda - keyring_1 . 1 - 1_all . deb $ sudo apt - get update $ sudo apt - get - y install cuda - toolkit - 12 - 3 ` CopyContentIcon Copy code After installing, the system should be restarted. This is to ensure that NVIDIA driver kernel modules are properly loaded with dkms. Then, you should be able to see your GPUs by using nvidia-smi.
$ sudo shutdown - r now llm@h100 - ftw : ~ $ nvidia - smi Wed Oct 4 08 : 44 : 54 2023 CopyContentIcon Copy code + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + | NVIDIA - SMI 535.104 .12 Driver Version : 535.104 .12 CUDA Version : 12.2 | | -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- + -- -- -- -- -- -- -- -- -- -- -- + | GPU Name Persistence - M | Bus - Id Disp . A | Volatile Uncorr . ECC | | Fan Temp Perf Pwr : Usage / Cap | Memory - Usage | GPU - Util Compute M . | | | | MIG M . | |= === === === === === === === === === === === === === = += === === === === === === === += === === === === === === === | | 0 NVIDIA H100 PCIe On | 00000000 : 01 : 00.0 Off | 0 | | N / A 42C P0 51W / 350W | 4MiB / 81559MiB | 0 % Default | | | | Disabled | + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- + -- -- -- -- -- -- -- -- -- -- -- + + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + | Processes : | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |= === === === === === === === === === === === === === === === === === === === === === === === === === === === === == | | No running processes found | + -- -- -- -- -- -- -- -- -- -- --…
Excerpt shown — open the source for the full document.