WritingReplicateReplicatepublished Nov 10, 2023seen 5d

Using open-source models for faster and cheaper text embeddings

Open original ↗

Captured source

source ↗

Using open-source models for faster and cheaper text embeddings – Replicate blog

Replicate Blog

Using open-source models for faster and cheaper text embeddings

Posted November 10, 2023 by nateraw

Embeddings are a powerful tool for working with text. By “embedding” text into vectors, you encode its meaning into a representation that can more easily be used for tasks like semantic search, clustering, and classification. If you’re new to embeddings, check out this awesome introduction by Simon Willison to get up to speed. These days, embeddings are being used for even more interesting applications like Retrieval Augmented Generation , which uses semantic search over embeddings to improve the quality of responses from language models.

In this guide, we’ll see how to use the BAAI/bge-large-en-v1.5 model on Replicate to generate text embeddings. The “BAAI General Embedding” (BGE) suite of models, released by the Beijing Academy of Artificial Intelligence (BAAI), are open source and available on the Hugging Face Hub .

As of October 2023, the large BGE model we’ll use here is the current state-of-the-art open source model for text embeddings. It is ranked higher than OpenAI embeddings on the MTEB leaderboard , and is 4x cheaper to run on Replicate for large-scale text embedding (more on this later !).

👇 The code in this post is also available as a hosted, interactive Google Colab notebook :

Prerequisites

You’ll need:

An account on Replicate : You’ll use Replicate to run the BAE model. It’s free to get started, and you get a bit of credit when you sign up. After that, you pay per second for your usage. See how billing works for more details.

A Python Environment to follow along in (or you can use the Google Colab notebook instead).

👀 See the model in the Replicate UI here , and more ways to run it (Node.js, cURL, Docker, etc.) here .

Install the dependencies

Start by installing the following dependencies:

Copy

pip install replicate

to count tokens:

pip install transformers sentencepiece

for our example "samsum" dataset:

pip install datasets py7zr scikit-learn

Authenticate with Replicate

Grab a Replicate API token from replicate.com/account/api-tokens and set it as an environment variable:

Copy

export REPLICATE_API_TOKEN=...

Generate embeddings from a list of text

Now you can run the embedding model. We’ll use the replicate library to run the model on Replicate:

Copy

import json import replicate

texts = [ "the happy cat" , "the quick brown fox jumps over the lazy dog" , "lorem ipsum dolor sit amet" , "this is a test" , ]

output = replicate.run( "nateraw/bge-large-en-v1.5:9cf9f015a9cb9c61d1a2610659cdac4a4ca222f2d3707a68517b18c198a9add1" , input = { "texts" : json.dumps(texts)} ) print (output)

The output here will be a list of embeddings for each text.

Generate embeddings from a JSONL file

JSONL (or “JSON lines”) is a file format for storing structured data in a text-based, line-delimited format. Each line in the file is a standalone JSON object.

Here’s an example of a JSONL file, dummy_example.jsonl :

Copy

{ "text" : "the happy cat" } { "text" : "the quick brown fox jumps over the lazy dog" } { "text" : "lorem ipsum dolor sit amet" } { "text" : "this is a test" }

Run the model on this file by specifying the path input.

Copy

output = replicate.run( "nateraw/bge-large-en-v1.5:9cf9f015a9cb9c61d1a2610659cdac4a4ca222f2d3707a68517b18c198a9add1" , input = { "path" : open ( "dummy_example.jsonl" , "rb" )} ) len (output)

Output:

4

Real-world example: Embedding the SAMSum dataset

The SAMSum dataset is a collection of ~14k example dialogues with manually annotated summaries. It is often used for training and evaluating language models.

Here we’ll encode the whole SAMSum dataset. We’ll use the datasets library to load the dataset, convert it to a JSONL file, and then run the BGE model on it to generate text embeddings.

Copy

from pathlib import Path

from datasets import load_dataset

dataset_name = "samsum" text_field = "dialogue" outfile_name = "samsum_dialogue.jsonl"

ds = load_dataset(dataset_name, split = 'train' ) ds = ds.remove_columns([x for x in ds.column_names if x != text_field]) ds = ds.rename_column(text_field, "text" ) texts = ds[ "text" ] texts[ 0 ]

Output:

"Amanda: I baked cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)"

To convert the dataset to a JSONL file, call .to_json on the dataset.

Copy

ds.to_json(outfile_name)

If all goes well, the dataset should be written to samsum_dialogue.jsonl . Use the head command to see the first few lines of the file:

Copy

head -n 5 {outfile_name}

You should see the following:

Copy

{"text":"Amanda: I baked cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)"} {"text":"Olivia: Who are you voting for in this election? \r\nOliver: Liberals as always.\r\nOlivia: Me too!!\r\nOliver: Great"} {"text":"Tim: Hi, what's up?\r\nKim: Bad mood tbh, I was going to do lots of stuff but ended up procrastinating\r\nTim: What did you plan on doing?\r\nKim: Oh you know, uni stuff and unfucking my room\r\nKim: Maybe tomorrow I'll move my ass and do everything\r\nKim: We were going to defrost a fridge so instead of shopping I'll eat some defrosted veggies\r\nTim: For doing stuff I recommend Pomodoro technique where u use breaks for doing chores\r\nTim: It really helps\r\nKim: thanks, maybe I'll do that\r\nTim: I also like using post-its in kaban style"} {"text":"Edward: Rachel, I think I'm in ove with Bella..\r\nrachel: Dont say anything else..\r\nEdward: What do you mean??\r\nrachel: Open your fu**ing door.. I'm outside"} {"text":"Sam: hey overheard rick say something\r\nSam: i don't know what to do :-\/\r\nNaomi: what did he say??\r\nSam: he was talking on the phone with someone\r\nSam: i don't know who\r\nSam: and he was telling them that he wasn't very happy here\r\nNaomi: damn!!!\r\nSam: he was saying he doesn't like being my roommate\r\nNaomi: wow, how do you feel about it?\r\nSam: i thought i was a good rommate\r\nSam: and that we have a nice place\r\nNaomi: that's true man!!!\r\nNaomi: i used to love living with you before i moved in with me boyfriend\r\nNaomi: i don't know why he's saying that\r\nSam: what should i do???\r\nNaomi: honestly if it's bothering you that much you should talk to him\r\nNaomi: see what's going on\r\nSam: i don't want to get in any kind of confrontation though\r\nSam: maybe i'll just let it go\r\nSam: and see…

Excerpt shown — open the source for the full document.