WritingReplicateReplicatepublished Aug 14, 2023seen 5d

Streaming output for language models

Open original ↗

Captured source

source ↗
published Aug 14, 2023seen 5dcaptured 3dhttp 200method plain

Streaming output for language models – Replicate blog

Replicate Blog

Streaming output for language models

Posted August 14, 2023 by zeke

You know when you’re using ChatGPT or Vercel’s AI playground and it returns an animated response, rendered word by word? That’s not just a dramatic visual effect to make it look like there’s a robot typing on the other side of the conversation. That’s actually the language model generating tokens one at a time, and streaming them back to you while it’s running.

Replicate already provides ways for you to receive incremental updates as your predictions are running, through polling and webhooks . But those aren’t always the most efficient methods to get updates from a running model. When you’re building something like a chat app, what you really need is a live-updating event stream.

Replicate’s API now supports server-sent event streams for language models. This lets you update your app live, as the model is running. In this post we’ll show you how to consume streaming responses from language models on Replicate.

How streaming works

At a high level, consuming an event stream on Replicate works like this:

You create a prediction with the stream option.

Replicate returns a prediction with a URL to receive streaming output.

You connect to the URL in your web browser and receive a stream of updates.

A Node.js example

Let’s walk through an example using Replicate’s Node.js client .

First, create a prediction using llama-2-70b-chat , setting the stream option to true :

Copy

import Replicate from "replicate" ;

const replicate = new Replicate ({ auth: process.env. REPLICATE_API_TOKEN });

const prediction = await replicate.predictions. create ({ version: "2c1608e18606fad2812020dc541930f2d0495ce32eee50074220b87300bc16e1" , input: { prompt: "Tell me a story" }, stream: true , });

Note the stream URL in the prediction response:

Copy

console. log (prediction.urls.stream); // https://streaming.api.replicate.com/v1/predictions/fuwwvjtbdmroc4xifxdcwqtdfq

To receive streaming output, construct an EventSource in your browser-side JavaScript code using the stream URL from the prediction:

Copy

const source = new EventSource (prediction.urls.stream, { withCredentials: true , });

source. addEventListener ( "output" , ( e ) => { console. log ( "output" , e.data); });

source. addEventListener ( "error" , ( e ) => { console. error ( "error" , JSON . parse (e.data)); });

source. addEventListener ( "done" , ( e ) => { source. close (); console. log ( "done" , JSON . parse (e.data)); });

A command-line example using cURL

The browser’s built-in EventSource API is useful for building web apps, but the responses are standard HTTP event stream responses , so you don’t have to use a browser to consume them. You can also receive streaming output using the programming language of your choice, or use command-line tools like cURL and jq to display the output right in your terminal.

Copy and paste the commands below in your shell to do the following:

Use curl to create a prediction with llama-2-70b-chat

Pipe the prediction response into jq to pluck out the stream URL and print it out

Use curl again to connect to the stream URL and receive a stream of updates

Copy

STREAM_URL = $( curl -s -X POST \ -d '{"version": "58d078176e02c219e11eb4da5a02a7830a283b14cf8f94537af893ccff5ee781", "input": {"prompt": "Tell me a story"}, "stream": true}' \ -H "Authorization: Bearer $REPLICATE_API_TOKEN " \ "https://api.replicate.com/v1/predictions" | jq -r .urls.stream ) echo $STREAM_URL curl -H 'Accept: text/event-stream' $STREAM_URL

cURL will print out a stream of updates from the model until the connection is closed:

Copy

event: output id: 1692041342:0 data: Sure

event: output id: 1692041342:1 data: !

event: output id: 1692041342:2 data: Here

event: output id: 1692041342:3 data: '

event: output id: 1692041343:0 data: s

event: output id: 1692041343:1 data: a

event: output id: 1692041343:2 data: story

event: output id: 1692041343:3 data: for

event: output id: 1692041343:4 data: you

Which models support streaming output?

Streaming output is already supported by lots of language models on Replicate, including Falcon, Vicuna, StableLM, and of course… Llama 2 🦙. For a full list of models that support streaming output, see the streaming language models collection :

Adding streaming support to your own models

When publishing your own public or private language models to Replicate, you should make sure they support streaming so users of your model will have the best possible experience.

If you’re fine-tuning an existing language model, then you’re already set: Your fine-tuned model will automatically inherit the streaming support from the base model.

If you’re writing your own model using Cog, the key is to yield tokens as they’re generated, instead of return ing the final result from a function. Use ConcatenateIterator to hint that the output should be concatenated together into a single string. Here’s an example:

Copy

from cog import BasePredictor, Path, ConcatenateIterator

class Predictor ( BasePredictor ): def predict (self) -> ConcatenateIterator[ str ]: tokens = [ "The" , "quick" , "brown" , "fox" , "jumps" , "over" , "the" , "lazy" , "dog" ] for token in tokens: yield token + " "

For more details, check out the Cog documentation on streaming output .

Further reading

Check out llama.replicate.dev to see an example of streaming output in a Next.js app.

Read our streaming guide for more details about how to use streaming output on Replicate.

Read the Replicate Node.js client API docs for usage details for Node.js and browsers.

Compare streaming models using Vercel’s AI playground .

Learn how to use Vercel’s AI SDK to stream models on Replicate in JavaScript apps.

Follow @replicate on Twitter X to keep up as we add streaming support to more models.

Next: Fine-tune SDXL with your own images