RAGLite tutorial
RAGLite tutorial
RAGLite tutorial

Table of content

Mastering RAGLite: A Step-by-Step Guide to Building Your Own RAG Pipeline
Where should I get started?
1. Configure RAGLite 
2. Inserting documents
3. Searching and RAG
4. Computing and using an optimal query adapter
5. Evaluation of retrieval and generation
6. Running a Model Context Protocol (MCP) server
7. Serving a customizable ChatGPT-like frontend
Conclusion: Simplified AI-Powered Retrieval

Table of content

Table of content

Mastering RAGLite: A Step-by-Step Guide to Building Your Own RAG Pipeline
Where should I get started?
1. Configure RAGLite 
2. Inserting documents
3. Searching and RAG
4. Computing and using an optimal query adapter
5. Evaluation of retrieval and generation
6. Running a Model Context Protocol (MCP) server
7. Serving a customizable ChatGPT-like frontend
Conclusion: Simplified AI-Powered Retrieval

Mastering RAGLite: A Step-by-Step Guide to Building Your Own RAG Pipeline

Mastering RAGLite: A Step-by-Step Guide to Building Your Own RAG Pipeline

18 Dec 2024

This guide walks you through the process of building a powerful RAG pipeline using RAGLite. From configuring your LLM and database to implementing advanced retrieval strategies like semantic chunking and reranking, this guide covers everything you need to optimize and scale your RAG-based applications.

In our previous post, we explored the transformative potential of RAGLite - a lightweight and efficient framework for Retrieval-Augmented Generation. We discussed how RAGLite addresses the limitations of traditional RAG implementations, offering streamlined workflows, seamless and efficient document processing, and advanced retrieval mechanisms. But understanding its benefits is only the first step.

In this tutorial, we’ll move beyond theory and dive into the practicalities of building a RAG pipeline with RAGLite. From setting up your environment to implementing semantic chunking, integrating with an LLM, and optimizing retrieval performance, this guide will equip you with the tools and insights needed to harness RAGLite’s full potential.

Whether you’re building a scalable enterprise solution or experimenting with a personal project, this hands-on guide will show you how to bring Retrieval-Augmented Generation to life - efficiently and effectively. Let’s get started!

Where should I get started?

The purpose of RAGLite is not only to provide a toolkit for building high-performing RAG-based applications, but also to implement that quickly.

1. Configure RAGLite 

The first step is to choose the LLM you want to use and to connect RAGLite to your database.

Configure your LLM provider and your database

Start by configuring your LLM provider thanks to LiteLLM and specify your database connection string. The LLM and your database can be hosted remotely, as in the following example with an OpenAI LLM and a remote PostreSQL database:

from raglite import RAGLiteConfig

# Example 'remote' config with a PostgreSQL database and an OpenAI LLM:
my_config = RAGLiteConfig(
    db_url="postgresql://my_username:my_password@my_host:5432/my_database",
    llm="gpt-4o-mini",  # Or any LLM supported by LiteLLM.
    embedder="text-embedding-3-large",  # Or any embedder supported by LiteLLM.
)

But both can also be hosted locally, like for example the following configuration for Llama-3.1-8B used together with SQLite demonstrates:

from raglite import RAGLiteConfig

# Example 'local' config with a SQLite database and a llama.cpp LLM:
my_config = RAGLiteConfig(
    db_url="sqlite:///raglite.sqlite",   llm="llama-cpp-python/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/*Q4_K_M.gguf@8192",
    embedder="llama-cpp-python/lm-kit/bge-m3-gguf/*F16.gguf@1024",  # A context size of 1024 tokens is the sweet spot for bge-m3.
)

As we discussed previously, both remote and local LLMs are supported. In both cases, configuring RAGLite is very straightforward and painless.

Configure your reranking model

Now, you can optionally configure any reranker supported by rerankers and again choose between a remote:

from rerankers import Reranker

# Example remote API-based reranker:
my_config = RAGLiteConfig(
    db_url="postgresql://my_username:my_password@my_host:5432/my_database",
    reranker=Reranker("cohere", lang="en", api_key=COHERE_API_KEY)
)

 or a local reranking model, which is equally straightforward:

from rerankers import Reranker

# Example local cross-encoder reranker per language (this is the default):
my_config = RAGLiteConfig(
    db_url="sqlite:///raglite.sqlite",
    reranker=(
        ("en", Reranker("ms-marco-MiniLM-L-12-v2", model_type="flashrank")),  # English
        ("other", Reranker("ms-marco-MultiBERT-L-12", model_type="flashrank")),  # Other languages
    )
)

Again, we see that RAGLite not only supports remote, API-based, rerankers but also local ones when full privacy is necessary.

2. Inserting documents

Next, insert some documents into the database. RAGLite will take care of the conversion to Markdown, optimal level 4 semantic chunking, and multi-vector embedding with late chunking. Should you have to insert documents in a format different from pdf, install the pandoc extra with pip install raglit[pandoc].

# Insert documents:
from pathlib import Path
from raglite import insert_document

insert_document(Path("On the Measure of Intelligence.pdf"), config=my_config)
insert_document(Path("Special Relativity.pdf"), config=my_config)

With just a few lines of code, your documents are processed and ready for efficient retrieval, making your knowledge base immediately usable for advanced RAG workflows.

3. Searching and RAG

3.1 Adaptive RAG

Now you can run a simple but powerful adaptive RAG pipeline that consists of retrieving the most relevant chunk spans (each of which is a list of consecutive chunks) with hybrid search and reranking, converting the user prompt to a RAG instruction and appending it to the message history, and finally generating the RAG response:

from raglite import rag

# Create a user message:
messages = []  # Or start with an existing message history.
messages.append({
    "role": "user",
    "content": "How is intelligence measured?"
})

# Adaptively decide whether to retrieve and stream the response:
chunk_spans = []
stream = rag(messages, on_retrieval=lambda x: chunk_spans.extend(x), config=my_config)
for update in stream:
    print(update, end="")

# Access the documents referenced in the RAG context:
documents = [chunk_span.document for chunk_span in chunk_spans]

The LLM will adaptively decide whether to retrieve information based on the complexity of the user prompt. If retrieval is necessary, the LLM generates the search query and RAGLite applies hybrid search and reranking to retrieve the most relevant chunk spans (each of which is a list of consecutive chunks). The retrieval results are sent to the on_retrieval callback and are appended to the message history as a tool output. Finally, the assistant response is streamed and appended to the message history.

3.2 Programmable RAG pipeline

If you need manual control over the RAG pipeline, you can run a basic but powerful pipeline that consists of retrieving the most relevant chunk spans with hybrid search and reranking, converting the user prompt to a RAG instruction and appending it to the message history, and finally generating the RAG response:

from raglite import create_rag_instruction, rag, retrieve_rag_context

# Retrieve relevant chunk spans with hybrid search and reranking:
user_prompt = "How is intelligence measured?"
chunk_spans = retrieve_rag_context(query=user_prompt, num_chunks=5, config=my_config)

# Append a RAG instruction based on the user prompt and context to the message history:
messages = []  # Or start with an existing message history.
messages.append(create_rag_instruction(user_prompt=user_prompt, context=chunk_spans))

# Stream the RAG response and append it to the message history:
stream = rag(messages, config=my_config)
for update in stream:
    print(update, end="")

# Access the documents referenced in the RAG context:
documents = [chunk_span.document for chunk_span in chunk_spans]

As we explained in the first blogpost, reranking can significantly improve the output quality of a RAG application. To add reranking to your application: first search for a larger set of 20 relevant chunks, then rerank them with a rerankers reranker, and finally keep the top 5 chunks.

In addition to the simple RAG pipeline, RAGLite also offers more advanced control over the individual steps of the pipeline. A full pipeline consists of several steps:

1. Searching for relevant chunks with keyword, vector, or hybrid search.
2. Retrieving the chunks from the database.
3. Reranking the chunks and selecting the top 5 results.
4. Extending the chunks with their neighbors and grouping them into chunk spans.
5. Converting the user prompt to a RAG instruction and appending it to the message history.
6. Streaming an LLM response to the message history.
7. Accessing the cited documents from the chunk spans.

# Search for chunks:
from raglite import hybrid_search, keyword_search, vector_search

user_prompt = "How is intelligence measured?"
chunk_ids_vector, _ = vector_search(user_prompt, num_results=20, config=my_config)
chunk_ids_keyword, _ = keyword_search(user_prompt, num_results=20, config=my_config)
chunk_ids_hybrid, _ = hybrid_search(user_prompt, num_results=20, config=my_config)

# Retrieve chunks:
from raglite import retrieve_chunks

chunks_hybrid = retrieve_chunks(chunk_ids_hybrid, config=my_config)

# Rerank chunks and keep the top 5 (optional, but recommended):
from raglite import rerank_chunks

chunks_reranked = rerank_chunks(user_prompt, chunks_hybrid, config=my_config)
chunks_reranked = chunks_reranked[:5]

# Extend chunks with their neighbors and group them into chunk spans:
from raglite import retrieve_chunk_spans

chunk_spans = retrieve_chunk_spans(chunks_reranked, config=my_config)

# Append a RAG instruction based on the user prompt and context to the message history:
from raglite import create_rag_instruction

messages = []  # Or start with an existing message history.
messages.append(create_rag_instruction(user_prompt=user_prompt, context=chunk_spans))

# Stream the RAG response and append it to the message history:
from raglite import rag

stream = rag(messages, config=my_config)
for update in stream:
    print(update, end="")

# Access the documents referenced in the RAG context:
documents = [chunk_span.document for chunk_span in chunk_spans]

This advanced pipeline empowers developers to fine-tune every aspect of the RAG process, from chunk retrieval to reranking and context grouping. By incorporating reranking and neighbor extension, it ensures a richer and more accurate contextual foundation for generating responses, while maintaining flexibility for custom application needs.

4. Computing and using an optimal query adapter

RAGLite can compute and apply an optimal closed-form query adapter to the prompt embedding to improve the output quality of RAG. To benefit from this, first generate a set of evals with insert_evals and then compute and store the optimal query adapter with update_query_adapter:

# Improve RAG with an optimal query adapter:
from raglite import insert_evals, update_query_adapter

insert_evals(num_evals=100, config=my_config)
update_query_adapter(config=my_config)  # From here, every vector search will use the query adapter.

This feature enables RAGLite to enhance the quality of vector search results by refining the prompt embedding with an optimal query adapter. By leveraging evaluation data, this step ensures a more precise alignment between user queries and the retrieved chunks, thereby improving the overall performance and accuracy of RAG-based applications.

5. Evaluation of retrieval and generation

If you installed the ragas extra, you can use RAGLite to answer the evals and then evaluate the quality of both the retrieval and generation steps of RAG using Ragas:

# Evaluate retrieval and generation:
from raglite import answer_evals, evaluate, insert_evals

insert_evals(num_evals=100, config=my_config)
answered_evals_df = answer_evals(num_evals=10, config=my_config)
evaluation_df = evaluate(answered_evals_df, config=my_config)

By answering a set of evaluation queries and analyzing the results, you can assess both the retrieval accuracy and the quality of the generated responses. This process provides valuable insights for optimizing your RAG implementation.

6. Running a Model Context Protocol (MCP) server

RAGLite comes with an MCP server implemented with FastMCP that exposes a search_knowledge_base tool. To use the server:

  • Install Claude desktop

  • Install uv so that Claude desktop can start the server

  • Configure Claude desktop to use uv to start the MCP server with:

raglite \
    --db-url sqlite:///raglite.db \
    --llm llama-cpp-python/bartowski/Llama-3.2-3B-Instruct-GGUF/*Q4_K_M.gguf@4096 \
    --embedder llama-cpp-python/lm-kit/bge-m3-gguf/*F16.gguf@1024 \
    mcp install

To use an API-based LLM, make sure to include your credentials in a .env file or supply them inline:

OPENAI_API_KEY=sk-... raglite --llm gpt-4o-mini --embedder text-embedding-3-large mcp install

Now, when you start Claude desktop you should see a 🔨 icon at the bottom right of your prompt indicating that the Claude has successfully connected with the MCP server.

When relevant, Claude will suggest using the search_knowledge_base tool that the MCP server provides. You can also explicitly ask Claude to search the knowledge base if you want to be certain that it does.

7. Serving a customizable ChatGPT-like frontend

If you installed the chainlit extra, you can serve a customizable ChatGPT-like frontend with:

raglite chainlit

The application is also deployable to the web, Slack, and Teams.

You can specify the database URL, LLM, and embedder directly in the Chainlit frontend, or with the CLI as follows:

raglite chainlit \
    --db_url sqlite:///raglite.sqlite \
    --llm llama-cpp-python/bartowski/Llama-3.2-3B-Instruct-GGUF/*Q4_K_M.gguf@4096 \
    --embedder llama-cpp-python/lm-kit/bge-m3-gguf/*F16.gguf@1024

To use an API-based LLM, make sure to include your credentials in a .env file or supply them inline:

OPENAI_API_KEY=sk-... raglite chainlit --llm gpt-4o-mini --embedder text-embedding-3-large

Conclusion: Simplified AI-Powered Retrieval

In this guide, we’ve taken you through the process of building a powerful and efficient RAG pipeline using RAGLite. From configuring your LLM and database to implementing advanced retrieval strategies, we’ve covered everything you need to leverage the full potential of Retrieval-Augmented Generation. Whether you’re working on a personal project or scaling for enterprise use, RAGLite offers the flexibility and performance necessary for building high-quality RAG-based applications.

By incorporating semantic chunking, reranking models, optimal query adapters, and evaluation mechanisms, you can fine-tune your pipeline for maximum retrieval accuracy and generation quality. Additionally, with features like customizable frontends and support for both remote and local models, RAGLite ensures that you have a robust toolkit to build, deploy, and scale your RAG applications efficiently.

We hope this guide empowers you to create your own innovative solutions with RAGLite. Dive into the world of Retrieval-Augmented Generation and unlock new possibilities for data-driven insights and enhanced user experiences!

Ready to transform your AI applications with RAGLite?
Get started today and unlock the full potential of Retrieval-Augmented Generation.

Created by Laurent Sorber, CTO & Founder of Superlinear, an AI consulting company.

Author(s):

Renaud Chrétien

Machine Learning Engineer

Article

Exploring bias in HR: Identifying sources of bias in AI-driven job-matching, tracking with fairness metrics, and leveraging architecture for transparency and mitigation.

Article

Exploring bias in HR: Identifying sources of bias in AI-driven job-matching, tracking with fairness metrics, and leveraging architecture for transparency and mitigation.

Article

Exploring bias in HR: Identifying sources of bias in AI-driven job-matching, tracking with fairness metrics, and leveraging architecture for transparency and mitigation.

AI Literacy explained within two professionals

Article

Understand the essentials of AI literacy and its impact on your organization. Explore practical steps to build team-wide understanding, comply with regulations, and confidently navigate the complexities of AI integration.

AI Literacy explained within two professionals

Article

Understand the essentials of AI literacy and its impact on your organization. Explore practical steps to build team-wide understanding, comply with regulations, and confidently navigate the complexities of AI integration.

AI Literacy explained within two professionals

Article

Understand the essentials of AI literacy and its impact on your organization. Explore practical steps to build team-wide understanding, comply with regulations, and confidently navigate the complexities of AI integration.

RAGLite

Article

Discover RAGLite, a lightweight toolkit that revolutionizes Retrieval-Augmented Generation (RAG). With features like semantic chunking, adaptive retrieval, and hybrid search, RAGLite overcomes traditional RAG limitations, simplifying workflows and ensuring fast, scalable, and accurate information retrieval for real-world AI applications.

RAGLite

Article

Discover RAGLite, a lightweight toolkit that revolutionizes Retrieval-Augmented Generation (RAG). With features like semantic chunking, adaptive retrieval, and hybrid search, RAGLite overcomes traditional RAG limitations, simplifying workflows and ensuring fast, scalable, and accurate information retrieval for real-world AI applications.

RAGLite

Article

Discover RAGLite, a lightweight toolkit that revolutionizes Retrieval-Augmented Generation (RAG). With features like semantic chunking, adaptive retrieval, and hybrid search, RAGLite overcomes traditional RAG limitations, simplifying workflows and ensuring fast, scalable, and accurate information retrieval for real-world AI applications.

Contact Us

Ready to tackle your business challenges?

Stay Informed

Subscribe to our newsletter

Get the latest AI insights and be invited to our digital sessions!

Stay Informed

Subscribe to our newsletter

Get the latest AI insights and be invited to our digital sessions!

Stay Informed

Subscribe to our newsletter

Get the latest AI insights and be invited to our digital sessions!

Locations

Brussels HQ

Central Gate

Cantersteen 47



1000 Brussels

Ghent

Planet Group Arena

Ottergemsesteenweg-Zuid 808 b300

9000 Gent

© 2024 Superlinear. All rights reserved.

Locations

Brussels HQ

Central Gate

Cantersteen 47



1000 Brussels

Ghent

Planet Group Arena
Ottergemsesteenweg-Zuid 808 b300
9000 Gent

© 2024 Superlinear. All rights reserved.

Locations

Brussels HQ

Central Gate

Cantersteen 47



1000 Brussels

Ghent

Planet Group Arena
Ottergemsesteenweg-Zuid 808 b300
9000 Gent

© 2024 Superlinear. All rights reserved.