WritingScalewayScalewaypublished Jul 4, 2024seen 5d

Retrieval-Augmented Generation: Building a RAG Pipeline with Scaleway’s Managed Inference

Open original ↗

Captured source

source ↗

Retrieval-Augmented Generation: Building a RAG Pipeline with Scaleway’s Managed Inference Build • Sebastian Tatut • 04/07/24 • 5 min read

Editor's note — Product naming update

Since this article was published, Managed Inference has been renamed Generative APIs - Dedicated Deployments . The product itself remains unchanged.

Retrieval Augmented Generation (RAG) is one of the most sought-after solutions when using AI, and for good reason. It addresses some of the main limitations of Large Language Models (LLMs) such as a static knowledge base, inexact information, and hallucinations.

While there is a plethora of online material discussing RAG systems, most of them use high-level components that mask the building blocks composing a RAG pipeline. In this article, we’ll use a more grassroots approach to analyze the structure of such systems and build one using Scaleway’s elements, notably one of the latest entries in our portfolio: Managed Inference.

The Anatomy of a RAG System

Let’s start by describing a typical use case. You want to build an assistant that can answer questions and provide precise information using your company’s data. You can do this by providing users with a chat application that leverages a foundation model to answer queries. Today, you can choose from a multitude of foundation models and quickly set up such a system. The problem is that none of these models were trained using your data, and even if they were, by the time you put your system into production, the data will already be stale.

This leaves you with two choices: either you create your own foundation model, or you take an existing one and fine-tune it using your company’s data. RAG provides a third way, that allows you to retrieve your own data based on user queries and use the retrieved information to pass an enriched context to a foundation model. The model then uses that context to answer the original query.

Key Components of a RAG System

We now have enough information to identify the main components of our solution:

Data Source: This can be a data lake, internal documents in the form of PDFs, images, sounds, or even web pages.

Embeddings Model: A specialized type of model that generates vector representations of the input data.

Vector Database: A specialized type of database that stores vectors and the associated data, providing mechanisms to compare these vectors based on similarity.

Foundation Model: This can be your typical Large Language Model.

However, we are still missing some components. We need to ingest the raw data from our Data Source, like parse PDFs, scrape web pages, and so on. We need a Scraper/Parser component to achieve that.

Then, the raw data needs to be preprocessed before we can pass it to the Embeddings Model. We need to normalize and tokenize it properly before passing it as input to the embeddings model. The same goes for user queries; they must be normalized and tokenized using the same preprocessor. Thus, we have identified our missing components:

Scraper/Parser: We’ll use BeautifulSoup as our scraper and PyPDF2 as our PDF parser to generate the raw data.

Preprocessor: We’ll use Hugging Face’s AutoTokenizer from the Transformers library and spaCy to tokenize our raw data.

Structure of the RAG Pipeline

Now that we have all our puzzle pieces in place, a pattern emerges in the structure of our RAG pipeline. We can clearly identify two sub-systems:

Ingest Sub-System: Responsible for pulling information from the Data Source and passing that raw data to the Preprocessor, which transforms that data into tokens that can then be used by the Embeddings Model to generate vectors. The vectors and their associated raw data are then stored in the Vector Database.

Query/Retrieval Sub-System: Handles the user query the same way as the Ingest sub-system handles the raw data: it gets normalized and tokenized, then passed to the Embeddings Model to generate its vector representation. The query vector is then used to perform a similarity search using the Vector Database and retrieve the data that is closest to the user query. That data is used to generate an enriched context that is then passed together with the user query to the Foundation Model, which then generates the response.

Building the Ingest Sub-System

With this information, we can design the Ingest sub-system, which includes:

Data Sources

Scraper/Parser: Extracts raw data.

Preprocessor: Normalizes and tokenizes data.

Embeddings Model: Generates vectors.

Vector Database: Stores vectors and associated data.

Fortunately, Scaleway offers most of these components as managed services, simplifying the implementation process.

Scaleway’s newly developed Managed Inference service, now in public beta, can be used to quickly and securely deploy an easy-to-use LLM endpoint based on a select list of open-source models. It can be used to deploy a scalable, ready-to-use Sentence-t5-xxl embedding model in less than 5 minutes. Check the Quickstart guide to learn how to create an embeddings endpoint. At the end of the Quickstart, you’ll end up with an endpoint in the form: https:// /v1/embeddings. All of Scaleway’s Managed Inference endpoints follow OpenAI’s API spec, so if you already have a system using that spec, you can use Managed Inference as a drop-in replacement.

The same goes for the Vector Database. Scaleway provides a PostgreSQL Managed Database with a plethora of available extensions , one of which is the pgvector extension that enables vector support for PostgreSQL. Make sure to check the Quickstart guide to deploy a resilient production-ready vector database in just a few clicks.

This leaves us with the Scrapper/Parser and the Preprocessor. You can find sample implementations for these two components in the dedicated Github repository in the form of two services using a REST API.

Once Scaleway’s managed components and our sample implementations are in place, all we have to do is assemble them to obtain our Ingest pipeline.

A. The Scraper/Parser pulls data from the external Data Sources. In this example, we’ll scrape information from Scaleway’s Github documentation and parse data from PDFs uploaded on Amazon S3-compatible Scaleway’s Object Storage.

B. The raw data is sent to the Preprocessor, which normalizes it and tokenizes it appropriately for the Embeddings Model provided via Scaleway’s Managed Inference.

C. The preprocessed data is sent to the Embeddings Model via a POST request using the…

Excerpt shown — open the source for the full document.