WritingScalewayScalewaypublished Mar 8, 2024seen 5d

Ollama: from zero to running an LLM in less than 2 minutes!

Open original ↗

Captured source

source ↗
published Mar 8, 2024seen 5dcaptured 3dhttp 200method plain

Ollama: from zero to running an LLM in less than 2 minutes! Build • Diego Coy • 08/03/24 • 6 min read

The Artificial Intelligence (AI) field has been fueled by open source initiatives from the very beginning, from data sets used in model training, frameworks, libraries and tooling, to the models themselves. These initiatives have been mainly focused on empowering researchers and a subset of experts to facilitate their investigations and further contributions. Fortunately for the rest of us – technologists without deep AI knowledge – there has been a wave of open source initiatives aimed at allowing us to leverage the new opportunities AI brings along.

Data sourcing, model training, math thinking, and its associated coding are done by a group of dedicated folks who then release models, such as Mixtral or Stable Diffusion. Then another group of people build wrappers around them to make the experience of using them a matter of basic configuration, and in some cases nowadays, just executing a command, allowing us to focus on leveraging the models and simply build on top of them. That’s the power of open source!

One such tool that has caught the internet’s attention lately is Ollama , a cross-platform tool that can be installed on a wide variety of hardware, including Scaleway’s H100 PCIe GPU Instances .

A model

Before diving into Ollama and how to use it, it is important to spend a few moments getting a basic understanding of what a machine learning (ML) model is. This is by no means intended to be an extensive explanation of AI concepts, but instead, a quick guide that will let you sort your way out to experience the power of AI firsthand.

A model is a representation of the patterns an algorithm has learned from analyzing data it was fed during its training phase. The goal of a Machine Learning model is to make predictions or decisions based on new, unseen data.

A model is generally trained by feeding it labeled data or unlabeled – depending on the type of model – and then adjusting the model's parameters to minimize the error between the expected and actual outputs.

By the end of its training phase, a model will be distributed as either a set of multiple files including the patterns it learned, configuration files, or a single file containing everything it needs. The number of files will vary depending on the frameworks and tools used to train it, and most tools today can adapt to the different ways a model is distributed.

The size of a machine learning model refers to the number of parameters that make up the model, and in turn, its file size: from a couple of megabytes to tens of gigabytes. A larger model size typically means more complex patterns can be learned from the training data. However, larger models also require more computational resources which can negatively affect their practicality.

Some of the most popular models today have been trained on huge amounts of data, with Llama2 reaching 70 Billion parameters (Also known as Llama2 70B ), however, the model’s size doesn’t always correlate with its accuracy. Some other models that have been trained with fewer parameters claim they can outperform Llama 2 70B, such as Mixtral 8x7B , in certain benchmarks.

Choosing the right tool for the job

Deciding to use a model that is smaller in size – instead of a larger one that will potentially require larger sums of hardware resources – when the task at hand can be easily performed by it can be the most efficient optimization you can achieve without having to tweak anything else.

Depending on your needs, using the 7B version of Llama 2 instead of the 70B one can cover your use case and provide faster results. In other cases, you may realize that using a model that has been trained to do a smaller set of specific tasks instead of the more generic ones can be the best call. Making the right choice will require some time trying out different alternatives, but this can yield improved inference times and hardware resource optimization.

Choosing the right tool also can be seen from the hardware angle: should I use a regular x86-64 CPU, an ARM CPU, a gaming GPU, or a Tensor Core GPU…? And this is a conversation worth having in a separate blog post. For this scenario, we’ll stick with Scaleway’s H100 PCIe GPU Instances as they run the fastest hardware of its kind.

Ollama: up and running in less than 2 minutes

Finally, we get to talk about Ollama, an open source tool that will hide away all the technical details and complexity of finding and downloading the right LLM, setting it up, and then deploying it. Ollama was originally developed with the idea of enabling people to run LLMs locally on their own computers, but that doesn’t mean you can’t use it on an H100 PCIe GPU Instance; in fact, its vast amount of resources will supercharge your experience.

After creating your H100 PCIe GPU Instance , getting Ollama up and running is just a matter of running the installation command:

curl -fsSL https://ollama.com/install.sh | sh CopyContentIcon Copy code Note: It’s always a good idea to take a moment to review installation scripts before execution. Although convenient, running scripts directly from the internet without understanding their content can pose significant security risks.

Once installed, you can run any of the supported models available in their model library , for instance, Mixtral from Mistral AI – a model licensed under Apache 2.0, that is on-par and sometimes outperforms GPT3.5 – by using the run command:

Ollama run mixtral CopyContentIcon Copy code Ollama will begin the download process, which will take just a few seconds – thanks to the 10Gb/s networking capabilities of Scaleway’s H100 PCIe GPU Instances –, and once done, you will be able to interact with the model through your terminal. You can start a conversation with the model, as you would with ChatGPT, or any other AI chatbot; the difference here is that your conversation is kept locally within your H100 PCIe GPU Instance, and only you have access to the prompts you submit, and the answers you receive.

The Ollama model library showcases a variety of models you can try out on your own helping you decide what’s the best tool for the job, be it a compact model, such as TinyLlama or a big one, like Llama2 ; there are multimodal models, like LLaVA , which include a vision encoder that enables both visual and language understanding. There are also models made for specific use cases, such as Code Llama ,…

Excerpt shown — open the source for the full document.