In our previous post, we explored the transformative potential of RAGLite - a lightweight and efficient framework for Retrieval-Augmented Generation. We discussed how RAGLite addresses the limitations of traditional RAG implementations, offering streamlined workflows, seamless and efficient document processing, and advanced retrieval mechanisms. But understanding its benefits is only the first step.
In this tutorial, we’ll move beyond theory and dive into the practicalities of building a RAG pipeline with RAGLite. From setting up your environment to implementing semantic chunking, integrating with an LLM, and optimizing retrieval performance, this guide will equip you with the tools and insights needed to harness RAGLite’s full potential.
Whether you’re building a scalable enterprise solution or experimenting with a personal project, this hands-on guide will show you how to bring Retrieval-Augmented Generation to life - efficiently and effectively. Let’s get started!
Where should I get started?
The purpose of RAGLite is not only to provide a toolkit for building high-performing RAG-based applications, but also to implement that quickly.
1. Configure RAGLite
The first step is to choose the LLM you want to use and to connect RAGLite to your database.
Configure your LLM provider and your database
Start by configuring your LLM provider thanks to LiteLLM and specify your database connection string. The LLM and your database can be hosted remotely, as in the following example with an OpenAI LLM and a remote PostreSQL database:
But both can also be hosted locally, like for example the following configuration for Llama-3.1-8B used together with SQLite demonstrates:
As we discussed previously, both remote and local LLMs are supported. In both cases, configuring RAGLite is very straightforward and painless.
Configure your reranking model
Now, you can optionally configure any reranker supported by rerankers and again choose between a remote:
or a local reranking model, which is equally straightforward:
Again, we see that RAGLite not only supports remote, API-based, rerankers but also local ones when full privacy is necessary.
2. Inserting documents
Next, insert some documents into the database. RAGLite will take care of the conversion to Markdown, optimal level 4 semantic chunking, and multi-vector embedding with late chunking. Should you have to insert documents in a format different from pdf, install the pandoc extra with pip install raglit[pandoc].
With just a few lines of code, your documents are processed and ready for efficient retrieval, making your knowledge base immediately usable for advanced RAG workflows.
3. Searching and RAG
3.1 Adaptive RAG
Now you can run a simple but powerful adaptive RAG pipeline that consists of retrieving the most relevant chunk spans (each of which is a list of consecutive chunks) with hybrid search and reranking, converting the user prompt to a RAG instruction and appending it to the message history, and finally generating the RAG response:
The LLM will adaptively decide whether to retrieve information based on the complexity of the user prompt. If retrieval is necessary, the LLM generates the search query and RAGLite applies hybrid search and reranking to retrieve the most relevant chunk spans (each of which is a list of consecutive chunks). The retrieval results are sent to the on_retrieval callback and are appended to the message history as a tool output. Finally, the assistant response is streamed and appended to the message history.
3.2 Programmable RAG pipeline
If you need manual control over the RAG pipeline, you can run a basic but powerful pipeline that consists of retrieving the most relevant chunk spans with hybrid search and reranking, converting the user prompt to a RAG instruction and appending it to the message history, and finally generating the RAG response:
As we explained in the first blogpost, reranking can significantly improve the output quality of a RAG application. To add reranking to your application: first search for a larger set of 20 relevant chunks, then rerank them with a rerankers reranker, and finally keep the top 5 chunks.
In addition to the simple RAG pipeline, RAGLite also offers more advanced control over the individual steps of the pipeline. A full pipeline consists of several steps:
1. Searching for relevant chunks with keyword, vector, or hybrid search.
2. Retrieving the chunks from the database.
3. Reranking the chunks and selecting the top 5 results.
4. Extending the chunks with their neighbors and grouping them into chunk spans.
5. Converting the user prompt to a RAG instruction and appending it to the message history.
6. Streaming an LLM response to the message history.
7. Accessing the cited documents from the chunk spans.
This advanced pipeline empowers developers to fine-tune every aspect of the RAG process, from chunk retrieval to reranking and context grouping. By incorporating reranking and neighbor extension, it ensures a richer and more accurate contextual foundation for generating responses, while maintaining flexibility for custom application needs.
4. Computing and using an optimal query adapter
RAGLite can compute and apply an optimal closed-form query adapter to the prompt embedding to improve the output quality of RAG. To benefit from this, first generate a set of evals with insert_evals and then compute and store the optimal query adapter with update_query_adapter:
This feature enables RAGLite to enhance the quality of vector search results by refining the prompt embedding with an optimal query adapter. By leveraging evaluation data, this step ensures a more precise alignment between user queries and the retrieved chunks, thereby improving the overall performance and accuracy of RAG-based applications.
5. Evaluation of retrieval and generation
If you installed the ragas extra, you can use RAGLite to answer the evals and then evaluate the quality of both the retrieval and generation steps of RAG using Ragas:
By answering a set of evaluation queries and analyzing the results, you can assess both the retrieval accuracy and the quality of the generated responses. This process provides valuable insights for optimizing your RAG implementation.
6. Running a Model Context Protocol (MCP) server
RAGLite comes with an MCP server implemented with FastMCP that exposes a search_knowledge_base
tool. To use the server:
Install Claude desktop
Install uv so that Claude desktop can start the server
Configure Claude desktop to use uv to start the MCP server with:
To use an API-based LLM, make sure to include your credentials in a .env file or supply them inline:
Now, when you start Claude desktop you should see a 🔨 icon at the bottom right of your prompt indicating that the Claude has successfully connected with the MCP server.
When relevant, Claude will suggest using the search_knowledge_base
tool that the MCP server provides. You can also explicitly ask Claude to search the knowledge base if you want to be certain that it does.
7. Serving a customizable ChatGPT-like frontend
If you installed the chainlit extra, you can serve a customizable ChatGPT-like frontend with:
The application is also deployable to the web, Slack, and Teams.
You can specify the database URL, LLM, and embedder directly in the Chainlit frontend, or with the CLI as follows:
To use an API-based LLM, make sure to include your credentials in a .env file or supply them inline:
Conclusion: Simplified AI-Powered Retrieval
In this guide, we’ve taken you through the process of building a powerful and efficient RAG pipeline using RAGLite. From configuring your LLM and database to implementing advanced retrieval strategies, we’ve covered everything you need to leverage the full potential of Retrieval-Augmented Generation. Whether you’re working on a personal project or scaling for enterprise use, RAGLite offers the flexibility and performance necessary for building high-quality RAG-based applications.
By incorporating semantic chunking, reranking models, optimal query adapters, and evaluation mechanisms, you can fine-tune your pipeline for maximum retrieval accuracy and generation quality. Additionally, with features like customizable frontends and support for both remote and local models, RAGLite ensures that you have a robust toolkit to build, deploy, and scale your RAG applications efficiently.
We hope this guide empowers you to create your own innovative solutions with RAGLite. Dive into the world of Retrieval-Augmented Generation and unlock new possibilities for data-driven insights and enhanced user experiences!
Ready to transform your AI applications with RAGLite?
Get started today and unlock the full potential of Retrieval-Augmented Generation.
Created by Laurent Sorber, CTO & Founder of Superlinear, an AI consulting company.