Retrieval Augmented Generation (RAG) has emerged as the most commercially successful application of generative AI, because it revolutionizes how organizations interact with their private data. Its magic lies in connecting an organization’s knowledge base to a large language model, enabling users to query complex information using natural language.
The journey from this vision to practical implementation is fraught with challenges. Traditional RAG approaches often suffer from critical limitations:
semantic information loss during document processing,
ineffective retrieval mechanisms,
computational overhead,
and complex integration requirements.
AI engineers have found themselves wrestling with frameworks that are either too rigid or require extensive customization.
This is where RAGLite enters the scene as a lightweight, high-performance implementation of the RAG framework designed to address these fundamental challenges. A RAG pipeline is typically made of various subproblems to be solved, each of which affects the quality of the application as a whole. RAGLite aims at offering the best possible solution to each of those subproblems. More than just another tool, RAGLite represents a thoughtful approach to democratizing advanced information retrieval and generation capabilities. In this first article, we present what RAGLite is and what kind of challenges it addresses.
What is RAGLite?
Retrieval-Augmented Generation (RAG) has become a popular technique for improving AI outputs by grounding responses in external knowledge bases. However, traditional RAG implementations can be resource-intensive. This is where RAGLite shines - offering a lightweight, efficient alternative.
RAGLite is a lightweight and highly configurable Python toolkit designed for RAG workflows. It enables seamless integration of retrieval systems with language models to produce contextually relevant, augmented text generation. This approach combines a database search component with a generative AI model, allowing you to retrieve relevant information from a dataset or knowledge base and use it to enhance the quality of text generation.
Why would you want to use it?
To understand the benefits of RAGLite, it is essential to grasp how RAG works and the shortcomings of a naive implementation.
In simple terms, RAG connects a powerful technology—a large language model (LLM), with a private company knowledge base, allowing for bespoke applications that enable searching private (structured or unstructured) data using natural language. A basic implementation of a RAG application is outlined below
and operates as follows:
1. User Query: The user asks a question, which is embedded and compared with the embeddings of document chunks in a pre-populated vector database.
2. Retrieval: The system performs a lookup and returns the chunks most relevant to answering the question.
3. Augmented Context: These retrieved chunks are provided as augmented context along with the initial question to the LLM.
4. Response Generation: The LLM generates a sourced answer based on the provided context.
While straightforward and seemingly effective at the first glance, this naive approach often underperforms upon closer inspection. The quality of a RAG-based pipeline largely depends on the retrieval engine’s effectiveness. LLMs can generate high-quality answers when supplied with relevant information, but if the retrieval mechanism fails to fetch pertinent data, the LLM cannot reliably answer the question.
RAGLite addresses this issue by providing a state-of-the-art retriever that ensures relevant information is consistently retrieved.
Each of the arrows in the RAG architecture above is a subproblem that is to be solved, and each of which affects the quality of the RAG application as a whole. While most RAG applications today use components provided by LangChain or LlamaIndex, RAGLite's goal is to offer the best possible solution to each of these subproblems.
The benefits of RAGLite
In the following, we will explore together how RAGLite tackles each of the subproblems and offers the best possible solution to each of them, thereby boosting the overall pipeline performance.
Indexing documents in the database
RAGLite offers robust document conversion features, including PDF to Markdown conversion using pdftext and pypdfium2, which efficiently extracts text from PDF files and transforms it into clean, readable Markdown. Additionally, RAGLite provides the flexibility to convert any input document (such as Word, HTML, or LaTeX) into Markdown using the versatile Pandoc, ensuring seamless compatibility with various formats for easy editing and use in markdown-supported environments.
Chunking strategy
One common challenge in RAG applications is the chunking of documents to pre-population the vector store. Various techniques for chunking include:
Level 1: fixed-size chunking. Divides documents into segments of the same size (e.g., a fixed number of characters, words, sentences, or tokens). While simple to implement, it disrupts semantic coherence by ignoring document structure.
Level 2: recursive chunking. Splits text into smaller chunks using separators, recursively adjusting until chunks reach the desired size. This better captures document structure but can split cohesive ideas or combine unrelated content, disrupting context.
Level 3: document-based chunking. Treats entire documents as single, coherent, processing units without splitting them. While simple to implement and effective for very structured documents, it shows limited specificity and scalability issues.
Level 4: semantic chunking. Ensures each chunk retains contextual relevance and meaning by breaking documents into segments that capture self-contained ideas or concepts, optimizing chunks for AI interpretation.
RAGLite implements out of the box level 4 semantic chunking to maintain semantically cohesive chunks by solving a binary integer programming problem. This feature is unique to RAGLite and leverages binary integer programming to achieve a globally optimal partitioning of the document in semantic chunks by maximizing the semantic cohesiveness of the chunks while taking into account that chunks should not exceed a given amount of tokens.
Additionally, RAGLite supports multi-vector chunk embedding with two recently developed RAG techniques called late chunking and contextual chunk headings:
Multi-vector chunk embedding: Represents a text chunk with multiple vectors, capturing various semantic dimensions for a richer, more nuanced understanding.
Late chunking: Postpones splitting text until after initial processing, ensuring chunks are informed by the full context and structure.
Contextual chunk headings: Generates descriptive summaries for each chunk, providing quick insights into their content.
These combined techniques enhance semantic coherence, improve usability, and are particularly useful for applications like search, summarization, and content navigation. They enable precise text processing and meaningful representation of complex information.
Supported databases
RAGLite offers the choice between PostgreSQL or SQLite.
PostgreSQL: Provides proven reliability, scalability, and strong support features like backups, making it a safe, long-term choice. Its established reputation ensures stability for the foreseeable future.
SQLite: Ideal for smaller applications, local setups, or quick prototyping, offering excellent performance and simplicity.
While vector databases excel in specific use cases like machine learning-driven searches, they are newer and may lack the same maturity and comprehensive support as PostgreSQL. If you prioritize high availability, stability, scalability, and long-term availability, PostgreSQL is likely the better option. For lightweight applications, SQLite remains an excellent choice and is fully supported by RAGLite.
Semantic query adaptation: enhancing the search process
Another subproblem that RAGLite aims at optimally solving is query adaptation. This involves transforming or modifying a query to better align with a target representation or vector space, enhancing the retrieval of relevant information. This technique is crucial in RAG, where precise query-document alignment is essential for generating accurate responses. Another one of RAGLite’s unique contributions is an optimal closed-form linear query adapter that is the solution to an orthogonal Procrustes problem, which optimally transforms the query into a vector representation within the latent space used for document retrieval, aligning it closely with the representations of the retrieved documents.
By solving this problem, the system ensures that the query is mapped efficiently and accurately, improving both retrieval quality and the relevance of the generated output. The Orthogonal Procrustes method aligns the query vector with the document embedding space by finding the optimal transformation (rotation and scaling), ensuring semantic consistency and minimising mismatches between the query’s meaning and the document space. This approach is efficient, robust, and enhances the relevance of retrieved results. At the same time, RAGLite takes care to only minimally adapt the query embedding so that only incorrectly ranked search results are ranked correctly.
Hybrid search: when precision meets relevance
Traditional lookup methods like keyword search efficiently find specific strings or substrings, such as names or dates. In contrast, vector search excels at finding documents that are semantically relevant to the query but use different wording. Hybrid search, natively supported by RAGLite, combines traditional keyword-based search with semantic search to leverage the strengths of both approaches. It matches exact terms using keyword search while also retrieving contextually relevant results through semantic understanding of the query. This ensures precise results for exact matches and broader relevance for nuanced queries, making it ideal for comprehensive and flexible information retrieval.
Efficient reranking with rerankers
While hybrid search combines the strengths of both keyword and vector search, it does not guarantee that the most relevant chunks appear in the top k, meaning they can be overlooked and not provided to the LLM. A reranker addresses this by using technologies like a cross-encoder to re-evaluate and reorder the candidate chunks found by hybrid search based on their relevance to the user query.
RAGLite supports any reranking models through integrated support of rerankers, including multilingual FlashRank as the default. Utilising a reranker significantly enhances the quality of the retrieved chunks, ensuring that the most relevant information is prioritized and provided to the LLM for generating accurate and reliable responses.
Adaptive retrieval: dynamic assessment of the query complexity
Adaptive retrieval is a highly effective technique that enables retrieval-augmented LLMs to intelligently determine the best retrieval strategy, from basic to advanced, tailored to the complexity and specific requirements of the query. This dynamic approach ensures that the model can optimize its retrieval process, adjusting its level of sophistication to deliver the most relevant and accurate responses based on the nature of the input. Thanks to a small language model acting as a classifier, the query’s complexity is dynamically assessed and the most appropriate retrieval strategy, from no to a multi-step retrieval approach, is thus adopted. Adapting the retrieval strategy to the query’s complexity strengthens the overall performance and saves computational resources that are unnecessary to handle straightforward queries.
LLM provider
RAGLite natively supports any LLM provider thanks to LiteLLM, including local, lightweight models like llama-cpp-python. This is particularly valuable for environments with limited resources or when total privacy is desired. By using local models, you maintain full control over the data and performance, making the system not only more efficient but also highly customizable. This flexibility allows for the integration of cutting-edge query adaptation techniques while maintaining local, cost-effective execution for real-time applications.
Evaluation of retriever and generator performance
Evaluating retrieval and generation performance with Ragas is crucial to ensure the system delivers accurate and relevant results. By assessing both the retrieval process (how well relevant data is fetched) and the generation process (how effectively responses are created), developers can identify areas for improvement, optimize pipelines, and enhance overall system efficiency. This evaluation ensures that the solution provides high-quality, reliable outputs, ultimately improving user satisfaction and the effectiveness of the application.
Boosting performance: fast and scalable
Performance is a core focus for RAGLite. It’s designed to be fast and efficient, with optimizations that allow it to scale from small prototypes to production-level systems without sacrificing quality. By using only lightweight and permissive open-source dependencies (avoiding heavy frameworks like PyTorch or LangChain), RAGLite ensures efficiency and ease of use. Additionally, it supports hardware acceleration with Metal on macOS and CUDA on Linux and Windows, boosting computational speed and performance.
Integration with ChatGPT-Like Frontends
RAGLite includes optional integration with frontend frameworks like Chainlit, enabling you to quickly deploy a ChatGPT-like interface. This allows users to interact with the AI model in a conversational manner, seamlessly combining document retrieval with AI-driven responses. Whether you’re building a chatbot, a virtual assistant, or any application that requires conversational AI, RAGLite offers the tools to create a responsive and intelligent user experience.
Built-in Model Context Protocol (MCP) server
RAGLite offers powerful support for Model Context Protocol (MCP) servers, a key feature that enhances the way applications connect to LLMs. Think of MCP as a standardized “connector” that allows LLMs to seamlessly access and interact with both local and remote data sources. This integration, implemented with FastMCP, enables RAGLite users to build more sophisticated agents and workflows by effortlessly connecting their LLMs to a wide range of tools and data, without worrying about vendor lock-in. It integrates natively with MCP clients such as Claude desktop. With RAGLite’s MCP support, you get access to a growing list of pre-built integrations, enhanced flexibility to switch between LLM providers, and strong security practices for managing your data. By using RAGLite’s MCP integration, you can unlock the full potential of your LLMs, making them smarter and more responsive to the data that powers your workflows.
Improved Cost and Latency with a Prompt Caching-Aware Message Array Structure
RAGLite introduces a prompt caching-aware message array structure to optimize cost and latency. This innovative approach leverages caching to store and reuse prompts and intermediate results across multiple queries, reducing the need for redundant processing. By efficiently managing these cached elements, RAGLite significantly lowers compute usage and API token consumption. This not only reduces operational costs but also accelerates query response times, providing a smoother and faster user experience.
Improved Output Quality with Anthropic’s Long-Context Prompt Format
With support for Anthropic’s long-context prompt format, RAGLite ensures high-quality responses even for extended or complex queries. This format enables the system to utilise a larger context window effectively, preserving critical details and maintaining semantic coherence across lengthy inputs. By structuring and managing long-context prompts intelligently, RAGLite enhances the accuracy and richness of outputs, making it ideal for scenarios requiring detailed analysis or comprehensive text generation.
Conclusion: Simplified AI-Powered Retrieval
RAGLite is the ideal tool for developers who want to leverage the power of Retrieval-Augmented Generation without the complexity of larger, more resource-demanding frameworks. By addressing key subproblems in RAG architectures with carefully optimized solutions, RAGLite is tailored to deliver exceptional performance. With its efficient chunking strategy, hybrid search, semantic query adaptation, and fast reranking, RAGLite provides an accessible, high-performance solution for a wide range of AI applications.
By utilising RAGLite, you can take full advantage of RAG’s capabilities while enjoying faster retrieval times, more relevant results, and a simpler, more streamlined implementation. Whether you’re working on a local application or a scalable production system, RAGLite offers the tools you need to bring your AI-driven projects to life.
Created by Laurent Sorber