WritingReplicateReplicatepublished Jul 27, 2023seen 5d

Run Llama 2 with an API

Open original ↗

Captured source

source ↗
published Jul 27, 2023seen 5dcaptured 3dhttp 200method plain

Run Llama 2 with an API – Replicate blog

Replicate Blog

Run Llama 2 with an API

Posted July 27, 2023 by joehoover

Llama 2 is a language model from Meta AI. It’s the first open source language model of the same caliber as OpenAI’s models.

With Replicate, you can run Llama 2 in the cloud with one line of code.

Contents

Contents

Running Llama 2 with JavaScript

Running Llama 2 with Python

Running Llama 2 with cURL

Choosing which model to use

Example chat app

Fine-tune Llama 2

Run Llama 2 locally

Keep up to speed

Running Llama 2 with JavaScript

You can run Llama 2 with our official JavaScript client :

Copy

import Replicate from "replicate" ;

const replicate = new Replicate ({ auth: process.env. REPLICATE_API_TOKEN , });

const input = { prompt: "Write a poem about open source machine learning in the style of Mary Oliver." , };

for await ( const event of replicate. stream ( "meta/llama-2-70b-chat" , { input, })) { process.stdout. write (event. toString ()); }

Running Llama 2 with Python

You can run Llama 2 with our official Python client :

Copy

import replicate

The meta/llama-2-70b-chat model can stream output as it's running.

for event in replicate.stream( "meta/llama-2-70b-chat" , input = { "prompt" : "Write a poem about open source machine learning in the style of Mary Oliver." }, ): print ( str (event), end = "" )

Running Llama 2 with cURL

Your can call the HTTP API directly with tools like cURL:

Copy

curl -s -X POST \ -H "Authorization: Bearer $REPLICATE_API_TOKEN " \ -H "Content-Type: application/json" \ -H "Prefer: wait" \ -d $'{ "input": { "prompt": "Write a poem..." } }' \ https://api.replicate.com/v1/models/meta/llama-2-70b-chat/predictions

You can also run Llama using other Replicate client libraries for Go, Swift, and others .

Choosing which model to use

There are four variant Llama 2 models on Replicate, each with their own strengths:

meta/llama-2-70b-chat : 70 billion parameter model fine-tuned on chat completions. If you want to build a chat bot with the best accuracy, this is the one to use.

meta/llama-2-70b : 70 billion parameter base model. Use this if you want to do other kinds of language completions, like completing a user’s writing.

meta/llama-2-13b-chat : 13 billion parameter model fine-tuned on chat completions. Use this if you’re building a chat bot and would prefer it to be faster and cheaper at the expense of accuracy.

meta/llama-2-7b-chat : 7 billion parameter model fine-tuned on chat completions. This is an even smaller, faster model.

What’s the difference between these? Learn more in our blog post comparing 7B, 13B, and 70B.

Example chat app

If you want a place to start, we’ve built a demo chat app in Next.js that can be deployed on Vercel:

Take a look at the GitHub README to learn how to customize and deploy it.

Fine-tune Llama 2

Because Llama 2 is open source, you can train it on more data to teach it new things, or learn a particular style.

Replicate makes this easy. Take a look at our guide to fine-tune Llama 2.

Run Llama 2 locally

You can also run Llama 2 without an internet connection. We wrote a comprehensive guide to running Llama on your M1/M2 Mac, on Windows, on Linux, or even your phone.

Keep up to speed

Follow us on Twitter X to get the latest from the Llamaverse.

Hop in our Discord and to talk Llama.

Happy hacking! 🦙

Next: Run SDXL with an API