WritingScalewayScalewaypublished Feb 25, 2025seen 5d

Think small, move fast: My journey building a RAG chatbot at Scaleway

Open original ↗

Captured source

source ↗
published Feb 25, 2025seen 5dcaptured 3dhttp 200method plain

Think small, move fast: My journey building a RAG chatbot at Scaleway Build • Matteo Lhommeau • 25/02/25 • 4 min read

When I arrived at Scaleway a few months ago as the first Data Scientist of the company, my goal was clear: transform the company. Yes, that’s challenging! I had to have a plan and to seek for quick added value to impulse this transformation. So we got started with our first project: build a Scaler Assistant to answer all HR related questions from Scalers. Let’s see what I did to build and deploy a RAG system in one month and the tips I can give you to achieve the same.

Think small, be efficient, fail fast

Think small, be efficient, fail fast. These are the three things I told myself to keep in mind if I wanted this project to succeed.

Think small: My goal is to help Scalers to find accurate HR related information, not every single piece of information contained in the internal documentation of the company. Therefore, my RAG system should and must be tailor-made for this kind of information. No need to set up a general retrieval system that works for every kind of data, no need to go into lots of technical considerations regarding LLMs or RAG architectures.

Be efficient: I have significant experience working with LLMs, creating RAG systems and more generally doing data science, so let’s leverage everything I’ve learned so far! My plan was to go into techniques I don’t know (or even worse: research papers!) only if my personal toolbox failed. To make it clear, I am not saying that I should not try to learn new skills or explore new techniques (I love doing this and I think that it is so important to stay on track and to enjoy my job), I am saying that it takes time. My goal here is to be as efficient as I can, to deliver the project within a month, to start transforming Scaleway as soon as I can. With this in mind I chose to focus on what I know. I chose to trust myself. When this project will be running and used by many Scalers, I will take the time to improve it with newer techniques and to improve my skills at the same time. That being said, regarding the tech stack I chose for the main components of my app:

Llama index to build my RAG system since I know this framework pretty well. It is a framework used to ease the creation of LLM-powered applications and it is really powerful when working with RAG given all the documentation management system and the myriad of retrieval techniques, data reader, preprocessing and postprocessing techniques it offers. If you are curious about it, check llama index github here

FastAPI for API development as I am used to using this framework .

Open WebUI for the user interface as it provides a familiar interface for the user and it comes with a lot of pre-built features, saving me a lot of time.

Tech stack: Llama index, FastAPI, Open WebUI

Final interface using Open WebUI with a general LLM, a RAG chatbot on HR documentation and a code model

Fail fast: This may be the most important tip I could give you. There is nothing worse than working on a project, experimenting, exploring, and at the time you want to deploy it in a production environment it does not work as expected. That’s disappointing and that’s a huge waste of time. So, as soon as I had something running (even an extremely simple system) I put myself in a production-like environment. What does it mean? I containerized everything (even at the project’s early stage), orchestrated containers, made sure my containers could communicate safely, built my API endpoints at the same time I developed the system, … This philosophy is usually referred to as MLOPs cycle.

No overkilling needed, of course, just making sure that your production environment evolves at the same pace as your code. Trust me, it will save you a huge amount of time in the end!

source: https://ml-ops.org/content/mlops-principles

Seek performance to quickly create value

Delivering fast is obviously a good point but to really help Scalers and to achieve my goal, performance was key. There were several areas where performance was needed such as documents retrieval accuracy, LLM response and hallucination reduction, system latency and hardware considerations for instance.

This can be a lot to work on at the same time and become quite intense. Here are some things I did to save me some time and to help me be more focused on the areas quoted previously.

Make good use of what your company offers. Lucky me, Scaleway is a cloud-provider and even provides AI products (no RAG as a service though)! It would be a shame not to use them, right? Indeed, the first thing I did was to understand what my company already provides or not. There is no need to reinvent the wheel, no need to spend time configuring database accesses, provisioning GPUs/CPUs, setting up an inference engine on a GPU if there already is a team dedicated to it … I leveraged my environment and my ecosystem. What does it mean in practice?

I used the Generative APIs product from Scaleway. It provides API access to multiple LLMs like language models, code models, vision models or embedding ones. This came really handy for this project as I needed to use embedding and text models. Within a couple of minutes I was able to query a Llama 3.1 70B and embed my document base by making API requests! If you have ever set up LLMs on a GPU using any inference engine, you know how complicated it can become especially when you want to scale things up, balance the load in input or optimize your latency and throughput.

I worked with internal teams to get CPU infrastructures and vector databases provisioned on our internal network. As you will see later in this article, security concerns are central to the success of the Scaler’s assistant deployment. Therefore, I needed all my infrastructure to be available on Scaleway’s internal network without any access to the public network (except to send requests to the Generative APIs product load balancer of course). At Scaleway we have teams dedicated to the provisioning and maintenance of all kinds of infrastructure you could need, so I specified my needs and they got me a vector database and a VM to start working!

Set up a good and reproducible experimentation lab. Let’s dive into the core of the project (and the driver for performance as said before): the retrieval system. Any RAG project is useless if the document retrieval system is not accurate enough. I needed to be able to experiment and iterate on the…

Excerpt shown — open the source for the full document.

Notability

notability 4.0/10

Substantive blog post, but not a model release or high-impact event.