rag benchmark study between RAGLite LlamaIndex OpenAI Azure AI
rag benchmark study between RAGLite LlamaIndex OpenAI Azure AI
rag benchmark study between RAGLite LlamaIndex OpenAI Azure AI

Table of contents

Benchmarking Retrieval-Augmented Generation (RAG): Who wins in document retrieval?
What was tested?
1. Answer retrieval 
2. Document retrieval
Evaluation metric
Tools evaluated in the benchmark
Benchmark results 
Benchmark 1 - Answer Retrieval: LegalBench-RAG (CUAD)
Benchmark 2 - Document Retrieval: HotpotQA & MS MARCO
Executive-level takeaways
Final thoughts

Table of contents

Table of contents

Benchmarking Retrieval-Augmented Generation (RAG): Who wins in document retrieval?
What was tested?
1. Answer retrieval 
2. Document retrieval
Evaluation metric
Tools evaluated in the benchmark
Benchmark results 
Benchmark 1 - Answer Retrieval: LegalBench-RAG (CUAD)
Benchmark 2 - Document Retrieval: HotpotQA & MS MARCO
Executive-level takeaways
Final thoughts

Benchmarking Retrieval-Augmented Generation (RAG): Who wins in document retrieval?

Benchmarking Retrieval-Augmented Generation (RAG): Who wins in document retrieval?

Benchmarking Retrieval-Augmented Generation (RAG): Who wins in document retrieval?

Published on:

05 Sept 2025

We benchmarked leading Retrieval-Augmented Generation (RAG) tools, including RAGLite, OpenAI, Azure AI Search, and LlamaIndex, across industry datasets. Discover which systems consistently deliver accurate results, where others fall short, and why retrieval choice is a strategic decision.

We benchmarked leading Retrieval-Augmented Generation (RAG) tools, including RAGLite, OpenAI, Azure AI Search, and LlamaIndex, across industry datasets. Discover which systems consistently deliver accurate results, where others fall short, and why retrieval choice is a strategic decision.

We benchmarked leading Retrieval-Augmented Generation (RAG) tools, including RAGLite, OpenAI, Azure AI Search, and LlamaIndex, across industry datasets. Discover which systems consistently deliver accurate results, where others fall short, and why retrieval choice is a strategic decision.

As Retrieval-Augmented Generation (RAG) becomes the backbone of intelligent enterprise search and AI assistants, choosing the right retrieval system can make or break your product’s performance. Whether you're an executive evaluating ROI or a technical leader optimizing your RAG stack, this benchmark study provides actionable insights on which tools deliver the best performance. 

What was tested?

We evaluated four different RAG tools that, based on an input query, retrieve relevant documents for AI systems such as legal AI assistants, customer support bots, and enterprise search.

The study focused on two types of RAG benchmarks: answer and document retrieval. In both, queries are passed through the retrieval tool, but they differ in the level of granularity used to evaluate the outputs.

1. Answer retrieval 

How well does the system pull out a specific answer from documents?

  • Labels: Every input query is accompanied by a reference to the precise text segments within the documents that contain the answer.

  • Goal: Extract precise answers from documents, especially critical in legal, finance, and compliance use cases.

  • Benchmark used: LegalBench-RAG, based on the Contract Understanding Atticus Dataset (CUAD).

2. Document retrieval

How well does the system find the right documents?

  • Labels: Every input query is accompanied by a set of relevant documents. The answer can be found in those documents, but its exact location is not labeled.

  • Goal: Identify the most relevant documents for a given query, useful in broader search applications.

  • Benchmarks used:

    • HotpotQA: A dataset containing multi-hop questions that require finding multiple documents to construct the answer.

    • MS MARCO (Microsoft Machine Reading Comprehension): Microsoft's large-scale, publicly available question answering dataset that contains real Bing search queries and human generated answers.

Evaluation metric

We used Mean Average Precision as the evaluation metric, limited to the top 10 retrieved documents/chunks (MAP@10). This metric ranges between 0 and 1 and is higher when more relevant documents are present and ranked higher.

Tools evaluated in the benchmark

The retrieval systems and configurations tested are:

  • RAGLite (by Superlinear)

    • Vector search

    • Vector search + Query adapter

    • Vector search + Cohere reranking

  • OpenAI Vector Store

    • Default configuration (includes reranking)

  • LlamaIndex

    • Vector search (using SentenceSplitter + FaissVectorStore)

  • Azure AI Search

    • Vector search (using SplitSkill + VectorSearch)  

    • Vector search + Reranking (using SplitSkill + VectorSearch + SemanticSearch)


All systems used OpenAI’s text-embedding-3-large for embeddings and a similar chunk size (≈2,048 tokens) to keep results comparable.

Benchmark results 

Benchmark 1 - Answer Retrieval: LegalBench-RAG (CUAD)

graph showing the accuracy (MAP@10) for different systems on Contract Understanding Atticus Dataset (CUAD)

This graph shows the accuracy (MAP@10) for different systems on Contract Understanding Atticus Dataset (CUAD).

Systems compared:

RAGLite with Cohere reranking

🥇 71.9% (highest performance)

RAGLite (no reranking) + query adapter

65.9%

OpenAI Vector Store with OpenAI reranking

65.8%

RAGLite (no reranking)

65.2%

Azure AI Search with semantic search

43.7%

LlamaIndex

39%

Azure AI Search without semantic search

27.2%

Observations 

RAGLite delivers the strongest performance overall. 

Importantly, RAGLite achieves a score of 65.2 (65.9 with query adapter) without any reranking, while OpenAI’s Vector Store only reaches similar levels with its own built-in reranker (which we don’t have visibility into). This shows that RAGLite is inherently strong, even without reranking, and outperforms OpenAI’s setup.

LlamaIndex and Azure AI Search perform poorly in this benchmark.

Key takeaways 

  1. If you're extracting exact answers (e.g., legal clauses), RAGLite with reranking is the most accurate choice, outperforming even commercial solutions.

  2. Even without reranking, RAGLite performs on par with OpenAI Vector Store with reranking, showing the importance of base retriever quality.

Benchmark 2 - Document Retrieval: HotpotQA & MS MARCO

graph comparing accuracy for document retrieval tasks on HotpotQA and MS MARCO datasets

This graph compares accuracy for document retrieval tasks on HotpotQA and MS MARCO datasets.

HotpotQA:

RAGLite with Cohere reranking

🥇 90.96%

RAGLite without reranking

81.09%

LlamaIndex

77.27%

OpenAI Vector Store

70.78%

MS MARCO:

RAGLite with reranking

🥇 66.29%

RAGLite without reranking

54.82%

LlamaIndex

53.67%

OpenAI Vector Store

47.60%

Observations

  • Again, RAGLite with Cohere reranking leads across benchmarks.

  • OpenAI Vector Store underperforms in document retrieval, even with reranking.

  • LlamaIndex is stronger for document retrieval than answer retrieval.

Key takeaways

  1. OpenAI's vector store lags behind on document-heavy tasks. RAGLite again leads, with up to ±91% accuracy on complex queries.

  2. Reranking plays a significant role. In MS MARCO, Cohere reranking boosts RAGLite by over 11 percentage points. LlamaIndex performs solidly, though it trails RAGLite.

Executive-level takeaways


1. RAGLite consistently outperforms competitors
Across all tests, RAGLite (especially with Cohere reranking) delivers best-in-class accuracy, meaning it retrieves better answers and documents. Think of it as being more precise and relevant: fewer missed answers, less noise.

2. OpenAI’s built-in vector store underperforms
While OpenAI is powerful for generating answers, its document retrieval system gets behind, especially on complex document tasks.

3. Semantic search boosts quality, but costs more
Tools like Azure AI Search improve accuracy when using “semantic search”, but it's more expensive to run.

4. Re-ranking is a game-changer
Tools that re-evaluate search results (reranking) see a 15–25% performance boost, an important optimization lever.

Final thoughts

Choosing the right retrieval system in a RAG pipeline isn't just a technical decision, it's a strategic one. This benchmark shows that RAGLite consistently delivers top-tier performance, especially when paired with modern reranking techniques. Meanwhile, popular solutions like OpenAI Vector Store offer convenience but may underdeliver in precision-critical or document-heavy applications.

Need help building or optimizing your RAG stack?
Discover our RAGLite tutorial to implement it, or try it out directly on Github

Author(s):

Stijn Goossens

Solution Architect

Thomas Delsart

Machine Learning Engineer

Justine Demarque

Marketing Executive

Related articles

Related articles

gen ai challenges
gen ai challenges
gen ai challenges

ARTICLE

Generative AI holds massive potential, but most businesses struggle to scale it. This guide breaks down 7 common generative AI challenges and how to turn AI into real ROI.

Read more

ARTICLE

The internet has been overflooded by stunning visuals created by Generative AI. And it seems all you need to do is “talk” to the AI? Let’s start with the elephant in the room: Will AI replace the creator or designer?

Read more

ARTICLE

AI-generated images have been used for everything from winning art competitions to protecting the identity of anti-government protesters, but are they usable in a business context? What about the controversy about these models from artists and other content creators?

Read more

Load More

Load More

Load More

Contact Us

Ready to tackle your business challenges?

Ready to tackle your business challenges?

Stay Informed

Subscribe to our newsletter

Get the latest AI insights and be invited to our digital sessions!

Stay Informed

Subscribe to our newsletter

Get the latest AI insights and be invited to our digital sessions!

Stay Informed

Subscribe to our newsletter

Get the latest AI insights and be invited to our digital sessions!

Locations

Brussels HQ

Central Gate

Cantersteen 47



1000 Brussels

Ghent

Planet Group Arena

Ottergemsesteenweg-Zuid 808 b300

9000 Gent

© 2024 Superlinear. All rights reserved.

Locations

Brussels HQ

Central Gate

Cantersteen 47



1000 Brussels

Ghent

Planet Group Arena
Ottergemsesteenweg-Zuid 808 b300
9000 Gent

© 2024 Superlinear. All rights reserved.

Locations

Brussels HQ

Central Gate

Cantersteen 47



1000 Brussels

Ghent

Planet Group Arena
Ottergemsesteenweg-Zuid 808 b300
9000 Gent

© 2024 Superlinear. All rights reserved.