What does this writing signal mean?

Scaleway Writing: Your RAG-powered AI app in 50 lines of code!

Captured source

source ↗

scaleway.com/scaleway.com/en/blog

Your RAG-powered AI app in 50 lines of code!

Source ↗

published Apr 9, 2024seen 5dcaptured 3dhttp 200method plain

Your RAG-powered AI app in 50 lines of code! Build • Diego Coy • 09/04/24 • 14 min read

Introduction

This article continues the journey we embarked on a few weeks back with our last practical AI blog post: “ Ollama: from zero to running an LLM in less than 2 minutes! ” where we leveraged Ollama to procure and serve an LLM in a virtual machine equipped with a GPU, Scaleway's H100 PCIe GPU Instance . After going through that article you may have been inspired to integrate AI capabilities into your own applications (Did you? Let me know via the Scaleway Community !) and you may have realized that even though thousands of possibilities opened up for you, there may still be some scenarios missing in the picture, such as the ability to make an LLM interact with your data. This is where RAG, the focus of this article, comes in.

The term RAG stands for Retrieval-augmented generation, which is a technique that augments the usefulness of an LLM by enabling it to generate responses based on an extended set of information you provide. This “extended set of information” may come in the form of basically any type of structured (your typical database or a spreadsheet) or unstructured data (text documents, or even media files) and needs to be further processed and stored in a specific way such that the model can easily find patterns within it, in order to retrieve the right information. If such information cannot be found, instead of confidently providing a hallucinated answer, the LLM can be instructed to simply say “ Hey, good question! I don't know ¯\ (ツ) /¯ ” or another response you consider appropriate for your use case.

The work we did when using Ollama to run an LLM laid the foundations we need for this new blog post where we will use that same hands-on approach to harness the power of AI, and thus we will focus only on the really important concepts and leave the more complex ones for later. This also means we will continue to use Python, and I'll assume you have an Instance running your preferred LLM with Ollama.

Hands-on with RAG

The importance of RAG lies in its ability to improve an LLM's accuracy and reliability. LLMs by themselves rely entirely on the knowledge gained through their training phase to generate output, which can sometimes result in inaccurate or outdated responses. RAG addresses this issue by incorporating external sources of information into the response generation pipeline with the added benefit of not needing to update or “fine-tune” the original model — a process that might require large amounts of compute power —, making it a simpler and more efficient approach.

We will build a simple app that will use an LLM (Llama2:70b) to go through Scaleway's public documentation repository and try to find the answer to an input question provided by the user. The base example has 50 lines of code, and we will see how we can improve its functionality by adding a few more here and there.

We will use LlamaIndex “a simple, flexible data framework for connecting custom data sources to large language models” — as they describe it — as our main tool to achieve our goal. We will also make use of an 'embedding model' that will transform documents — or chunks of data — into a numerical representation (vectors would be the fancy/proper term) based on their attributes. And finally, a 'Vector Database' that will store the numerical representations of our documents, for easier consumption by the whole pipeline.

Architectural Overview

The system looks something like this:

A diagram showcasing the system's architecture: files are loaded into a 'vector database', this database is queried with the user prompt, and context documents that match are returned to the LLM, which will in turn generate a response to be sent back to the user.

Setup

All the commands and code are meant to be run inside your GPU Instance. Feel free to check the documentation if you need a refresher.

You can use your preferred text editor, in my case I still like Visual Studio Code and its Remote Development feature lets me connect to my instance by logging in via SSH. It automatically installs a server on my Instance that allows me to edit and run code that lives in the remote Instance just the same way as I'd do it on my local environment. But if you know how to exit Vim, by all means, feel free to use it.

The environment

It's always a good idea to set up a virtual environment for your project, and I like to go simple, so I default to virtualenv:

mkdir rag-example cd rag-example apt update apt install python3.10-venv -y python3 -m venv .venv source .venv/bin/activate CopyContentIcon Copy code Running the Vector Database

There are many “Vector Databases'' to choose from nowadays. Qdrant is an open source one that's written in Rust, has many official client libraries, and can be easily run via docker:

docker run -d -p 6333:6333 --name qdrant qdrant/qdrant CopyContentIcon Copy code And if for some reason you decide to use a different Vector Database, LlamaIndex makes it easy for you to migrate with a few tweaks.

Dependencies

We'll need to install the LlamaIndex package, our open source workhorse:

pip install llama-index CopyContentIcon Copy code And while we're at it, why not install all the other dependencies?

pip install llama-index-llms-ollama llama-index-embeddings-huggingface llama-index-vector-stores-qdrant qdrant-client CopyContentIcon Copy code

llama-index-llms-ollama is the LlamaIndex wrapper that allows us to use a model served by Ollama

llama-index-embeddings-huggingface is the LlamaIndex wrapper for HuggingFace embedding models (more on those later on)

llama-index-vector-stores-qdrant is the LlamaIndex 'Vector Store' integration for Qdrant

qdrant-client is the official Python Qdrant library

Getting the “data source”

As mentioned before, this example will use the Scaleway Documentation as its data source. Scaleway docs are maintained by a dedicated team of professional technical writers, but they're also a collaborative effort that the community can contribute to. That's why it is available as an open source repository on GitHub . For this example, we will only clone the main branch with a depth of 1

git clone https://github.com/scaleway/docs-content.git --depth 1 CopyContentIcon Copy code If you explore the repo, you'll find several directories and files linked…

Excerpt shown — open the source for the full document.