As Retrieval-Augmented Generation (RAG) becomes the backbone of intelligent enterprise search and AI assistants, choosing the right retrieval system can make or break your product’s performance. Whether you're an executive evaluating ROI or a technical leader optimizing your RAG stack, this benchmark study provides actionable insights on which tools deliver the best performance.
What was tested?
We evaluated four different RAG tools that, based on an input query, retrieve relevant documents for AI systems such as legal AI assistants, customer support bots, and enterprise search.
The study focused on two types of RAG benchmarks: answer and document retrieval. In both, queries are passed through the retrieval tool, but they differ in the level of granularity used to evaluate the outputs.
1. Answer retrieval
How well does the system pull out a specific answer from documents?
Labels: Every input query is accompanied by a reference to the precise text segments within the documents that contain the answer.
Goal: Extract precise answers from documents, especially critical in legal, finance, and compliance use cases.
Benchmark used: LegalBench-RAG, based on the Contract Understanding Atticus Dataset (CUAD).
2. Document retrieval
How well does the system find the right documents?
Labels: Every input query is accompanied by a set of relevant documents. The answer can be found in those documents, but its exact location is not labeled.
Goal: Identify the most relevant documents for a given query, useful in broader search applications.
Benchmarks used:
HotpotQA: A dataset containing multi-hop questions that require finding multiple documents to construct the answer.
MS MARCO (Microsoft Machine Reading Comprehension): Microsoft's large-scale, publicly available question answering dataset that contains real Bing search queries and human generated answers.
Evaluation metric
We used Mean Average Precision as the evaluation metric, limited to the top 10 retrieved documents/chunks (MAP@10). This metric ranges between 0 and 1 and is higher when more relevant documents are present and ranked higher.
Tools evaluated in the benchmark
The retrieval systems and configurations tested are:
RAGLite (by Superlinear)
Vector search
Vector search + Query adapter
Vector search + Cohere reranking
Default configuration (includes reranking)
Vector search (using
SentenceSplitter
+FaissVectorStore
)
Vector search (using
SplitSkill
+VectorSearch
)Vector search + Reranking (using
SplitSkill
+VectorSearch
+SemanticSearch
)
All systems used OpenAI’s text-embedding-3-large
for embeddings and a similar chunk size (≈2,048 tokens) to keep results comparable.
Benchmark results
Benchmark 1 - Answer Retrieval: LegalBench-RAG (CUAD)

This graph shows the accuracy (MAP@10) for different systems on Contract Understanding Atticus Dataset (CUAD).
Systems compared:
RAGLite with Cohere reranking | 🥇 71.9% (highest performance) |
RAGLite (no reranking) + query adapter | 65.9% |
OpenAI Vector Store with OpenAI reranking | 65.8% |
RAGLite (no reranking) | 65.2% |
Azure AI Search with semantic search | 43.7% |
LlamaIndex | 39% |
Azure AI Search without semantic search | 27.2% |
Observations
RAGLite delivers the strongest performance overall.
Importantly, RAGLite achieves a score of 65.2 (65.9 with query adapter) without any reranking, while OpenAI’s Vector Store only reaches similar levels with its own built-in reranker (which we don’t have visibility into). This shows that RAGLite is inherently strong, even without reranking, and outperforms OpenAI’s setup.
LlamaIndex and Azure AI Search perform poorly in this benchmark.
Key takeaways
If you're extracting exact answers (e.g., legal clauses), RAGLite with reranking is the most accurate choice, outperforming even commercial solutions.
Even without reranking, RAGLite performs on par with OpenAI Vector Store with reranking, showing the importance of base retriever quality.
Benchmark 2 - Document Retrieval: HotpotQA & MS MARCO

This graph compares accuracy for document retrieval tasks on HotpotQA and MS MARCO datasets.
HotpotQA:
RAGLite with Cohere reranking | 🥇 90.96% |
RAGLite without reranking | 81.09% |
LlamaIndex | 77.27% |
OpenAI Vector Store | 70.78% |
MS MARCO:
RAGLite with reranking | 🥇 66.29% |
RAGLite without reranking | 54.82% |
LlamaIndex | 53.67% |
OpenAI Vector Store | 47.60% |
Observations
Again, RAGLite with Cohere reranking leads across benchmarks.
OpenAI Vector Store underperforms in document retrieval, even with reranking.
LlamaIndex is stronger for document retrieval than answer retrieval.
Key takeaways
OpenAI's vector store lags behind on document-heavy tasks. RAGLite again leads, with up to ±91% accuracy on complex queries.
Reranking plays a significant role. In MS MARCO, Cohere reranking boosts RAGLite by over 11 percentage points. LlamaIndex performs solidly, though it trails RAGLite.
Executive-level takeaways
1. RAGLite consistently outperforms competitors
Across all tests, RAGLite (especially with Cohere reranking) delivers best-in-class accuracy, meaning it retrieves better answers and documents. Think of it as being more precise and relevant: fewer missed answers, less noise.
2. OpenAI’s built-in vector store underperforms
While OpenAI is powerful for generating answers, its document retrieval system gets behind, especially on complex document tasks.
3. Semantic search boosts quality, but costs more
Tools like Azure AI Search improve accuracy when using “semantic search”, but it's more expensive to run.
4. Re-ranking is a game-changer
Tools that re-evaluate search results (reranking) see a 15–25% performance boost, an important optimization lever.
Final thoughts
Choosing the right retrieval system in a RAG pipeline isn't just a technical decision, it's a strategic one. This benchmark shows that RAGLite consistently delivers top-tier performance, especially when paired with modern reranking techniques. Meanwhile, popular solutions like OpenAI Vector Store offer convenience but may underdeliver in precision-critical or document-heavy applications.
Need help building or optimizing your RAG stack?
Discover our RAGLite tutorial to implement it, or try it out directly on Github.