What does this writing signal mean?

Replicate Writing: A comprehensive guide to running Llama 2 locally

Captured source

source ↗

replicate.com/replicate.com/blog/run-llama-locally

A comprehensive guide to running Llama 2 locally

Source ↗

published Jul 22, 2023seen 5dcaptured 3dhttp 200method plain

A comprehensive guide to running Llama 2 locally – Replicate blog

Replicate Blog

A comprehensive guide to running Llama 2 locally

Posted July 22, 2023 by zeke

We’ve been talking a lot about how to run and fine-tune Llama 2 on Replicate. But you can also run Llama locally on your M1/M2 Mac, on Windows, on Linux, or even your phone. The cool thing about running Llama 2 locally is that you don’t even need an internet connection.

Here’s an example using a locally-running Llama 2 to whip up a website about why llamas are cool:

It’s only been a couple days since Llama 2 was released, but there are already a handful of techniques for running it locally. In this blog post we’ll cover three open-source tools you can use to run Llama 2 on your own devices:

Llama.cpp (Mac/Windows/Linux)

Ollama (Mac)

MLC LLM (iOS/Android)

Llama.cpp (Mac/Windows/Linux)

Llama.cpp is a port of Llama in C/C++, which makes it possible to run Llama 2 locally using 4-bit integer quantization on Macs. However, Llama.cpp also has support for Linux/Windows.

Here’s a one-liner you can use to install it on your M1/M2 Mac:

Copy

curl -L "https://replicate.fyi/install-llama-cpp" | bash

Here’s what that one-liner does:

Copy

#!/bin/bash

Clone llama.cpp

git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp

Build it. `LLAMA_METAL=1` allows the computation to be executed on the GPU

LLAMA_METAL = 1 make

Download model

export MODEL = llama-2-13b-chat.ggmlv3.q4_0.bin if [ ! -f models/${MODEL} ]; then curl -L "https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolve/main/${ MODEL }" -o models/ ${MODEL} fi

Set prompt

PROMPT = "Hello! How are you?"

Run in interactive mode

./main -m ./models/llama-2-13b-chat.ggmlv3.q4_0.bin \ --color \ --ctx_size 2048 \ -n -1 \ -ins -b 256 \ --top_k 10000 \ --temp 0.2 \ --repeat_penalty 1.1 \ -t 8

Here’s a one-liner for your intel Mac, or Linux machine. It’s the same as above, but we’re not including the LLAMA_METAL=1 flag:

Copy

curl -L "https://replicate.fyi/install-llama-cpp-cpu" | bash

Here’s a one-liner to run on Windows on WSL :

Copy

curl -L "https://replicate.fyi/windows-install-llama-cpp" | bash

Ollama (Mac)

Ollama is an open-source macOS app (for Apple Silicon) that lets you run, create, and share large language models with a command-line interface. Ollama already has support for Llama 2.

To use the Ollama CLI, download the macOS app at ollama.ai/download . Once you’ve got it installed, you can download Lllama 2 without having to register for an account or join any waiting lists. Run this in your terminal:

Copy

download the 7B model (3.8 GB)

ollama pull llama2

or the 13B model (7.3 GB)

ollama pull llama2:13b

Then you can run the model and chat with it:

Copy

ollama run llama2 >>> hi Hello! How can I help you today?

Note: Ollama recommends that have at least 8 GB of RAM to run the 3B models, 16 GB to run the 7B models, and 32 GB to run the 13B models.

MLC LLM (Llama on your phone)

MLC LLM is an open-source project that makes it possible to run language models locally on a variety of devices and platforms, including iOS and Android.

For iPhone users, there’s an MLC chat app on the App Store. MLC now has support for the 7B, 13B, and 70B versions of Llama 2, but it’s still in beta and not yet on the Apple Store version, so you’ll need to install TestFlight to try it out. Check out out the instructions for installing the beta version here .

Next steps

We’d love to see what you build. Hop in our Discord and share it with our community.

Replicate lets you run machine learning models in the cloud. Run Llama 2 with an API on Replicate.

Fine-tune Llama 2 on Replicate

Happy hacking! 🦙

Next: Fine-tune Llama 2 on Replicate

Clone llama.cpp

Build it. LLAMA_METAL=1 allows the computation to be executed on the GPU

Download model

Set prompt

Run in interactive mode

download the 7B model (3.8 GB)

or the 13B model (7.3 GB)

Build it. `LLAMA_METAL=1` allows the computation to be executed on the GPU