Insights

About

Contact

Services

Impact

Insights

Careers

About

Contact

Back to All Articles

Table of contents

Mastering RAGLite: A step-by-step Guide to building your own RAG pipeline

Where should I get started?

1. Configure RAGLite

2. Inserting documents

3. Searching and RAG

4. Computing and using an optimal query adapter

5. Evaluation of retrieval and generation

6. Running a Model Context Protocol (MCP) server

7. Serving a customizable ChatGPT-like frontend

Conclusion: Simplified AI-Powered Retrieval

Table of contents

Mastering RAGLite: A step-by-step Guide to building your own RAG pipeline

Where should I get started?

1. Configure RAGLite

2. Inserting documents

3. Searching and RAG

4. Computing and using an optimal query adapter

5. Evaluation of retrieval and generation

6. Running a Model Context Protocol (MCP) server

7. Serving a customizable ChatGPT-like frontend

Conclusion: Simplified AI-Powered Retrieval

Mastering RAGLite: A step-by-step Guide to building your own RAG pipeline

Last updated on:

28 Jul 2025

Published on:

18 Dec 2024

This guide walks you through the process of building a powerful RAG pipeline using RAGLite. From configuring your LLM and database to implementing advanced retrieval strategies like semantic chunking and reranking, this guide covers everything you need to optimize and scale your RAG-based applications.

In our previous post, we explored the transformative potential of RAGLite - a lightweight and efficient framework for Retrieval-Augmented Generation. We discussed how RAGLite addresses the limitations of traditional RAG implementations, offering streamlined workflows, seamless and efficient document processing, and advanced retrieval mechanisms. But understanding its benefits is only the first step.

In this tutorial, we’ll move beyond theory and dive into the practicalities of building a RAG pipeline with RAGLite. From setting up your environment to implementing semantic chunking, integrating with an LLM, and optimizing retrieval performance, this guide will equip you with the tools and insights needed to harness RAGLite’s full potential.

Whether you’re building a scalable enterprise solution or experimenting with a personal project, this hands-on guide will show you how to bring Retrieval-Augmented Generation to life - efficiently and effectively. Let’s get started!

Where should I get started?

The purpose of RAGLite is not only to provide a toolkit for building high-performing RAG-based applications, but also to implement that quickly.

1. Configure RAGLite

The first step is to choose the LLM you want to use and to connect RAGLite to your database.

Configure your LLM provider and your database

Start by configuring your LLM provider thanks to LiteLLM and specify your database connection string. The LLM and your database can be hosted remotely, as in the following example with an OpenAI LLM and a remote PostreSQL database:

from raglite import RAGLiteConfig

# Example 'remote' config with a PostgreSQL database and an OpenAI LLM:
my_config = RAGLiteConfig(
    db_url="postgresql://my_username:my_password@my_host:5432/my_database",
    llm="gpt-4o-mini",  # Or any LLM supported by LiteLLM.
    embedder="text-embedding-3-large",  # Or any embedder supported by LiteLLM.
)

But both can also be hosted locally, like for example the following configuration for Llama-3.1-8B used together with SQLite demonstrates:

from raglite import RAGLiteConfig

# Example 'local' config with a DuckDB database and a llama.cpp LLM:
my_config = RAGLiteConfig(
    db_url="duckdb:///raglite.db",
    llm="llama-cpp-python/unsloth/Qwen3-8B-GGUF/*Q4_K_M.gguf@8192",
    embedder="llama-cpp-python/lm-kit/bge-m3-gguf/*F16.gguf@512", # More than 512 tokens degrades bge-m3's performance
)

As we discussed previously, both remote and local LLMs are supported. In both cases, configuring RAGLite is very straightforward and painless.

Configure your reranking model

Now, you can optionally configure any reranker supported by rerankers and again choose between a remote:

from rerankers import Reranker

# Example remote API-based reranker:
my_config = RAGLiteConfig(
    db_url="postgresql://my_username:my_password@my_host:5432/my_database"
    reranker=Reranker("rerank-v3.5", model_type="cohere", api_key=COHERE_API_KEY, verbose=0)  # Multilingual
)

Or a local reranking model, which is equally straightforward:

from rerankers import Reranker

# Example local cross-encoder reranker per language (this is the default):
my_config = RAGLiteConfig(
    db_url="duckdb:///raglite.db",
    reranker={
        "en": Reranker("ms-marco-MiniLM-L-12-v2", model_type="flashrank", verbose=0),  # English
        "other": Reranker("ms-marco-MultiBERT-L-12", model_type="flashrank", verbose=0),  # Other languages
    }
)

Again, we see that RAGLite not only supports remote, API-based, rerankers but also local ones when full privacy is necessary.

2. Inserting documents

Next, insert some documents into the database. RAGLite will take care of the conversion to Markdown, optimal level 4 semantic chunking, and multi-vector embedding with late chunking. Should you have to insert documents in a format different from pdf, install the pandoc extra with pip install raglit[pandoc].

# Insert documents:
from pathlib import Path
from raglite import insert_document

insert_document(Path("On the Measure of Intelligence.pdf"), config=my_config)
insert_document(Path("Special Relativity.pdf"), config=my_config)

With just a few lines of code, your documents are processed and ready for efficient retrieval, making your knowledge base immediately usable for advanced RAG workflows.

3. Searching and RAG

3.1 Adaptive RAG

Now you can run a simple but powerful adaptive RAG pipeline that consists of retrieving the most relevant chunk spans (each of which is a list of consecutive chunks) with hybrid search and reranking, converting the user prompt to a RAG instruction and appending it to the message history, and finally generating the RAG response:

from raglite import rag

# Create a user message:
messages = []  # Or start with an existing message history.
messages.append({
    "role": "user",
    "content": "How is intelligence measured?"
})

# Adaptively decide whether to retrieve and stream the response:
chunk_spans = []
stream = rag(messages, on_retrieval=lambda x: chunk_spans.extend(x), config=my_config)
for update in stream:
    print(update, end="")

# Access the documents referenced in the RAG context:
documents = [chunk_span.document for chunk_span in chunk_spans]

The LLM will adaptively decide whether to retrieve information based on the complexity of the user prompt. If retrieval is necessary, the LLM generates the search query and RAGLite applies hybrid search and reranking to retrieve the most relevant chunk spans (each of which is a list of consecutive chunks). The retrieval results are sent to the on_retrieval callback and are appended to the message history as a tool output. Finally, the assistant response is streamed and appended to the message history.

3.2 Programmable RAG pipeline

If you need manual control over the RAG pipeline, you can run a basic but powerful pipeline that consists of retrieving the most relevant chunk spans with hybrid search and reranking, converting the user prompt to a RAG instruction and appending it to the message history, and finally generating the RAG response:

from raglite import create_rag_instruction, rag, retrieve_rag_context

# Retrieve relevant chunk spans with hybrid search and reranking:
user_prompt = "How is intelligence measured?"
chunk_spans = retrieve_rag_context(query=user_prompt, num_chunks=5, config=my_config)

# Append a RAG instruction based on the user prompt and context to the message history:
messages = []  # Or start with an existing message history.
messages.append(create_rag_instruction(user_prompt=user_prompt, context=chunk_spans))

# Stream the RAG response and append it to the message history:
stream = rag(messages, config=my_config)
for update in stream:
    print(update, end="")

# Access the documents referenced in the RAG context:
documents = [chunk_span.document for chunk_span in chunk_spans]

As we explained in the first article, reranking can significantly improve the output quality of a RAG application. To add reranking to your application: first search for a larger set of 20 relevant chunks, then rerank them with a rerankers reranker, and finally keep the top 5 chunks.

In addition to the simple RAG pipeline, RAGLite also offers more advanced control over the individual steps of the pipeline.

A full pipeline consists of several steps:

1. Searching for relevant chunks with keyword, vector, or hybrid search.
2. Retrieving the chunks from the database.
3. Reranking the chunks and selecting the top 5 results.
4. Extending the chunks with their neighbors and grouping them into chunk spans.
5. Converting the user prompt to a RAG instruction and appending it to the message history.
6. Streaming an LLM response to the message history.
7. Accessing the cited documents from the chunk spans.

# Search for chunks:
from raglite import hybrid_search, keyword_search, vector_search

user_prompt = "How is intelligence measured?"
chunk_ids_vector, _ = vector_search(user_prompt, num_results=20, config=my_config)
chunk_ids_keyword, _ = keyword_search(user_prompt, num_results=20, config=my_config)
chunk_ids_hybrid, _ = hybrid_search(user_prompt, num_results=20, config=my_config)

# Retrieve chunks
from raglite import retrieve_chunks

chunks_hybrid = retrieve_chunks(chunk_ids_hybrid, config=my_config)

# Rerank chunks and keep the top 5 (optional, but recommended)
from raglite import rerank_chunks

chunks_reranked = rerank_chunks(user_prompt, chunks_hybrid, config=my_config)
chunks_reranked = chunks_reranked[:5]

# Extend chunks with their neighbors and group them into chunk spans
from raglite import retrieve_chunk_spans

chunk_spans = retrieve_chunk_spans(chunks_reranked, config=my_config)

# Append a RAG instruction based on the user prompt and context to the message history
from raglite import add_context

messages = []  # Or start with an existing message history
messages.append(add_context(user_prompt=user_prompt, context=chunk_spans))

# Stream the RAG response and append it to the message history
from raglite import rag

stream = rag(messages, config=my_config)
for update in stream:
    print(update, end="")

# Access the documents referenced in the RAG context
documents = [chunk_span.document for chunk_span in chunk_spans]

This advanced pipeline empowers developers to fine-tune every aspect of the RAG process, from chunk retrieval to reranking and context grouping. By incorporating reranking and neighbor extension, it ensures a richer and more accurate contextual foundation for generating responses, while maintaining flexibility for custom application needs.

4. Computing and using an optimal query adapter

RAGLite can compute and apply an optimal closed-form query adapter to the prompt embedding to improve the output quality of RAG. To benefit from this, first generate a set of evals with insert_evals and then compute and store the optimal query adapter with update_query_adapter:

# Improve RAG with an optimal query adapter:
from raglite import insert_evals, update_query_adapter

insert_evals(num_evals=100, config=my_config)
update_query_adapter(config=my_config)  # From here, every vector search will use the query adapter.

This feature enables RAGLite to enhance the quality of vector search results by refining the prompt embedding with an optimal query adapter. By leveraging evaluation data, this step ensures a more precise alignment between user queries and the retrieved chunks, thereby improving the overall performance and accuracy of RAG-based applications.

5. Evaluation of retrieval and generation

If you installed the ragas extra, you can use RAGLite to answer the evals and then evaluate the quality of both the retrieval and generation steps of RAG using Ragas:

# Evaluate retrieval and generation:
from raglite import answer_evals, evaluate, insert_evals

insert_evals(num_evals=100, config=my_config)
answered_evals_df = answer_evals(num_evals=10, config=my_config)
evaluation_df = evaluate(answered_evals_df, config=my_config)

By answering a set of evaluation queries and analyzing the results, you can assess both the retrieval accuracy and the quality of the generated responses. This process provides valuable insights for optimizing your RAG implementation.

6. Running a Model Context Protocol (MCP) server

RAGLite comes with an MCP server implemented with FastMCP that exposes a search_knowledge_base tool. To use the server:

Install Claude desktop
Install uv so that Claude desktop can start the server
Configure Claude desktop to use uv to start the MCP server with:

raglite \
    --db-url duckdb:///raglite.db \
    --llm llama-cpp-python/unsloth/Qwen3-4B-GGUF/*Q4_K_M.gguf@8192 \
    --embedder llama-cpp-python/lm-kit/bge-m3-gguf/*F16.gguf@512 \
    mcp install

To use an API-based LLM, make sure to include your credentials in a .env file or supply them inline:

export OPENAI_API_KEY=sk-...
raglite \
    --llm gpt-4o-mini \
    --embedder text-embedding-3-large \
    mcp install

Now, when you start Claude desktop you should see a 🔨 icon at the bottom right of your prompt indicating that the Claude has successfully connected with the MCP server.

When relevant, Claude will suggest using the search_knowledge_base tool that the MCP server provides. You can also explicitly ask Claude to search the knowledge base if you want to be certain that it does.

7. Serving a customizable ChatGPT-like frontend

If you installed the chainlit extra, you can serve a customizable ChatGPT-like frontend with:

raglite chainlit

The application is also deployable to the web, Slack, and Teams.

You can specify the database URL, LLM, and embedder directly in the Chainlit frontend, or with the CLI as follows:

raglite \
    --db-url duckdb:///raglite.db \
    --llm llama-cpp-python/unsloth/Qwen3-4B-GGUF/*Q4_K_M.gguf@8192 \
    --embedder llama-cpp-python/lm-kit/bge-m3-gguf/*F16.gguf@512 \
    chainlit

To use an API-based LLM, make sure to include your credentials in a .env file or supply them inline:

OPENAI_API_KEY=sk-... raglite --llm gpt-4o-mini --embedder text-embedding-3-large chainlit

Conclusion: Simplified AI-Powered Retrieval

In this guide, we’ve taken you through the process of building a powerful and efficient RAG pipeline using RAGLite. From configuring your LLM and database to implementing advanced retrieval strategies, we’ve covered everything you need to leverage the full potential of Retrieval-Augmented Generation. Whether you’re working on a personal project or scaling for enterprise use, RAGLite offers the flexibility and performance necessary for building high-quality RAG-based applications.

By incorporating semantic chunking, reranking models, optimal query adapters, and evaluation mechanisms, you can fine-tune your pipeline for maximum retrieval accuracy and generation quality. Additionally, with features like customizable frontends and support for both remote and local models, RAGLite ensures that you have a robust toolkit to build, deploy, and scale your RAG applications efficiently.

We hope this guide empowers you to create your own innovative solutions with RAGLite. Dive into the world of Retrieval-Augmented Generation and unlock new possibilities for data-driven insights and enhanced user experiences!

Ready to transform your AI applications with RAGLite?
Get started today and unlock the full potential of Retrieval-Augmented Generation.

Created by Laurent Sorber, CTO & Founder of Superlinear, an AI consulting company.

Author(s):

Renaud Chrétien

Machine Learning Engineer

read all our Articles

smartphone with a LLM on a screen waiting for prompt engineering

ARTICLE

Mastering prompt engineering for LLMs: Techniques to improve quality, optimize cost & reduce latency

Master prompt engineering to improve LLM outputs. Learn structured techniques like XML formatting, few-shot prompting, and Chain of Thought to boost quality, reduce latency, and optimize AI costs for smarter, scalable solutions.

ARTICLE

DeepSeek R1: GRPO in action – A Battlefield analogy for next-gen LLMs

What if training powerful GenAI models could be faster, cheaper, and more efficient? DeepSeek R1’s GRPO is changing the game, cutting memory and compute costs nearly in half. Through a Battleship-inspired simulation, learn how this breakthrough is reshaping Reinforcement Learning.

multimodal rag system of elements linked in a network

ARTICLE

The future of multimodal RAG systems: transforming AI’s capabilities

Explore the next evolution of Retrieval-Augmented Generation (RAG), where AI goes beyond text to integrate images, video, and audio. Multimodal RAG unlocks richer, more precise insights, but merging diverse data comes with challenges.