How to use retrieval augmented generation with ChromaDB and Mistral
Captured source
source ↗How to use retrieval augmented generation with ChromaDB and Mistral – Replicate blog
Replicate Blog
How to use retrieval augmented generation with ChromaDB and Mistral
Posted October 17, 2023 by jakedahn
Over the last few months, Retrieval Augmented Generation (RAG) has emerged as a popular technique for getting the most out of Large Language Models (LLMs) like Llama-2-70b-chat .
In this post, we’ll explore the creation of an example RAG “app” which helps you generate click-worthy titles for Hacker News submissions. All you need to do is provide a working title, idea, or phrase, and even the most boring of words will be transformed into a title destined for the front page of Hacker News.
Admittedly, this is a basic toy idea. It’s not revolutionary, and it may not land your post on the front page of Hacker News. That’s okay, because that’s not the point: the point is to provide you with a practical hands-on feel for how RAG works, and give you the understanding you need to use this technique in your own projects and systems.
Okay, so what is retrieval augmented generation?
Retrieval augmented generation is a technique of enriching your language model outputs by retrieving contextual information from an external data source, and including that information as part of your language model prompts. The idea is that when you augment a language model prompt with meaningful external data, the language model is able to respond with deeper understanding and relevance.
This sort of pattern effectively extends the functional context length of a given language model, because instead of being limited to 4,096 tokens (5-ish pages of text), you can query an entire 1,000 page book for meaningful passages, and pull in just the handful of sentences needed to create high-quality responses.
RAG is very flexible in that you can use the same data sources across multiple language models or upgrade to the latest language models without needing to re-train or fine-tune any specific models.
Other techniques such as fine-tuning can provide great results for specific use-cases like setting the style/tone of a language model, or teaching a language model how to perform a task or skill that is difficult or too large to explain in every prompt. But if your goal is to integrate an LLM with data, RAG is the way .
The basics of retrieval augmented generation
You may have heard that you need to use embeddings and a vector store to build a RAG application, but that is not entirely true. Those components can be helpful, and we will cover them in this post, but they are not strictly required.
At its most basic level, RAG is really just pasting additional context into the text of your language model prompt . It can be as simple as retrieving the current weather for San Francisco from a realtime weather API, and then passing the JSON response forward in the text of the prompt. Nothing fancy.
In the example prompt below, we instruct the mistral-7b-instruct model to read the input JSON data and respond with a weather report, as if it were a human meteorologist:
Copy
[INST] Given the following JSON weather data, write a weather report:
{ "location": "San Francisco, CA", "temperature: "65ºF", "forecast_high": "81ºF", "forecast_low": "63ºF", "time": "2023-10-09 10:00" }
You must only report on the data that is provided in the JSON. Do not mention any weather details that are outside the scope of the provided JSON. Do not mention the JSON context data.
All responses must sound like a human meteorologist.
[/INST]
Responds with:
Copy
Good morning! Today in San Francisco, CA, the temperature is currently 65 degrees Fahrenheit. The forecast for the day calls for a high of 81 degrees and a low of 63 degrees. It's a beautiful day to explore the city and enjoy the outdoors. Make sure to stay hydrated and wear sunscreen if you plan on spending time in the sun. Have a great day!
See this prediction on Replicate: https://replicate.com/p/quh3kf3b23ftbag2dvjtebysfy The ability to pass arbitrary textual information into an LLM prompt is part of what make them so interesting and exciting. You could include a small CSV table, passages from a book, code, quotes from inspiring philosophers, or even application state. Anything, really, as long as it is smaller than the language model maximum context length, which is 4,096 tokens for Llama2 .
Let’s outline what we’re building
Now that we’re on the same page and understand that RAG will solve all of your problems 🤭, let’s explore how to actually use it with a hands-on example.
As mentioned above, we’re going to build an example that converts titles, ideas, or phrases into hackernews titles. To make this example a reality, we’ll tap into the wisdom of the crowd by indexing titles of top stories from hackernews into a vector store, and making them queryable via embedding similarity search.
The idea is that inspiration from titles with a proven track record will help to create a better title suggestion.
The general flow of how this will work together goes something like:
Construct a dataset that can be indexed and queried.
Load the dataset into ChromaDB (a vector store).
Run some test queries against ChromaDB and visualize what is in the database.
Bring it all together. Query ChromaDB for 10 related popular titles, then prompt mistral-7b-instruct on Replicate to suggest new titles, inspired by the related popular titles.
1. Construct a dataset that can be indexed and queried
The first thing we need to do is create a dataset of Hacker News titles. Creating this sort of dataset from scratch is kind of annoying and surprisingly complicated, so I’ve done the heavy lifting of scraping the Hacker News API and distilling stories down to a dataset of 13,509 top stories, each of which have received 100+ upvotes from January 2023 to early October 2023.
The dataset consists of id , score , title , url , and time fields.s
You can download the data from the git repo for this blog post, here: 13509-hn-topstories-2023.jsonl
I won’t bore you with all of the details, but I have included the two scripts I hacked together to build this nice jsonl file. The first script scrapes the hackernews API by story ID, and saves all of the stories from a specific id range to an sqlite database. The second script queries the sqlite database for stories with 100 or more upvotes, and writes them to a .jsonl file.
For our purposes, we really only care about the title fields, since we’ll be…
Excerpt shown — open the source for the full document.