这是用户在 2024-11-13 10:54 为 https://ar5iv.labs.arxiv.org/html/2407.01219 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Searching for Best Practices in Retrieval-Augmented Generation
搜索检索增强生成最佳实践

Xiaohua Wang,  Zhenghua Wang,  Xuan Gao,  Feiran Zhang,
王晓华,王正华,高璇,张飞然

Yixin Wu,  Zhibo Xu,  Tianyuan Shi,  Zhengyuan Wang,  Shizheng Li,
吴一欣,徐志博,石天元,王正远,李世正,

Qi Qian,  Ruicheng Yin,  Changze Lv,  Xiaoqing Zheng,  Xuanjing Huang
齐倩,尹瑞城,吕长泽,郑晓青,黄玄静

School of Computer Science, Fudan University, Shanghai, China
Shanghai Key Laboratory of Intelligent Information Processing
{\{xiaohuawang22,zhenghuawang23}\}@m.fudan.edu.cn
复旦大学计算机学院,上海,中国 上海智能信息处理重点实验室 {\{ xiaohuawang22,zhenghuawang23 }\} @m.fudan.edu.cn

{\{zhengxq,xjhuang}\}@fudan.edu.cn
{\{ 郑翔琪,黄霞 }\} @fudan.edu.cn
Corresponding Author. 相应作者。
Abstract 摘要

Retrieval-augmented generation (RAG) techniques have proven to be effective in integrating up-to-date information, mitigating hallucinations, and enhancing response quality, particularly in specialized domains. While many RAG approaches have been proposed to enhance large language models through query-dependent retrievals, these approaches still suffer from their complex implementation and prolonged response times. Typically, a RAG workflow involves multiple processing steps, each of which can be executed in various ways. Here, we investigate existing RAG approaches and their potential combinations to identify optimal RAG practices. Through extensive experiments, we suggest several strategies for deploying RAG that balance both performance and efficiency. Moreover, we demonstrate that multimodal retrieval techniques can significantly enhance question-answering capabilities about visual inputs and accelerate the generation of multimodal content using a “retrieval as generation” strategy. Resources are available at https://github.com/FudanDNN-NLP/RAG.
检索增强生成(RAG)技术已被证明在整合最新信息、减轻幻觉和提升响应质量方面非常有效,尤其是在专业领域。尽管已经提出了许多 RAG 方法来通过查询依赖检索来增强大型语言模型,但这些方法仍然存在实现复杂和响应时间过长的问题。通常,RAG 工作流程涉及多个处理步骤,每个步骤都可以以不同的方式执行。在这里,我们研究了现有的 RAG 方法及其潜在组合,以确定最佳的 RAG 实践。通过大量实验,我们提出了几种平衡性能和效率的 RAG 部署策略。此外,我们证明了多模态检索技术可以显著增强关于视觉输入的问答能力,并使用“检索即生成”策略加速多模态内容的生成。资源可在 https://github.com/FudanDNN-NLP/RAG 获取。

Refer to caption
Figure 1: Retrieval-augmented generation workflow. This study investigates the contribution of each component and provides insights into optimal RAG practices through extensive experimentation. The optional methods considered for each component are indicated in bold fonts, while the methods underlined indicate the default choice for individual modules. The methods indicated in blue font denote the best-performing selections identified empirically.
图 1:检索增强生成工作流程。本研究调查了每个组件的贡献,并通过大量实验提供了关于最佳 RAG 实践的见解。每个组件考虑的可选方法以粗体字表示,而带下划线的方法表示各个模块的默认选择。以蓝色字体表示的方法是经验上确定的最佳选择。

1 Introduction 1 简介

Generative large language models are prone to producing outdated information or fabricating facts, although they were aligned with human preferences by reinforcement learning [1] or lightweight alternatives [2, 3, 4, 5]. Retrieval-augmented generation (RAG) techniques address these issues by combining the strengths of pretraining and retrieval-based models, thereby providing a robust framework for enhancing model performance [6]. Furthermore, RAG enables rapid deployment of applications for specific organizations and domains without necessitating updates to the model parameters, as long as query-related documents are provided.
生成式大型语言模型容易产生过时信息或编造事实,尽管它们通过强化学习[1]或轻量级替代方案[2, 3, 4, 5]与人类偏好对齐。检索增强生成(RAG)技术通过结合预训练和基于检索的模型的优势来解决这些问题,从而为提高模型性能提供了一个稳健的框架[6]。此外,只要提供相关查询文档,RAG 就能使特定组织和领域的应用程序快速部署,而无需更新模型参数。

Many RAG approaches have been proposed to enhance large language models (LLMs) through query-dependent retrievals [7, 8, 6]. A typical RAG workflow usually contains multiple intervening processing steps: query classification (determining whether retrieval is necessary for a given input query), retrieval (efficiently obtaining relevant documents for the query), reranking (refining the order of retrieved documents based on their relevance to the query), repacking (organizing the retrieved documents into a structured one for better generation), summarization (extracting key information for response generation from the repacked document and eliminating redundancies) modules. Implementing RAG also requires decisions on the ways to properly split documents into chunks, the types of embeddings to use for semantically representing these chunks, the choice of vector databases to efficiently store feature representations, and the methods for effectively fine-tuning LLMs (see Figure 1).
许多 RAG 方法已被提出,通过查询依赖检索来增强大型语言模型(LLMs)[7, 8, 6]。典型的 RAG 工作流程通常包含多个中间处理步骤:查询分类(确定对于给定的输入查询是否需要检索)、检索(高效地获取与查询相关的文档)、重新排序(根据文档与查询的相关性对检索到的文档进行排序)、重新打包(将检索到的文档组织成结构化文档以更好地生成)、摘要(从重新打包的文档中提取关键信息以生成响应并消除冗余)模块。实现 RAG 还需要对以下方面做出决策:如何合理地将文档分割成块、用于语义表示这些块所使用的嵌入类型、选择用于高效存储特征表示的向量数据库,以及有效微调LLMs的方法(见图 1)。

What adds complexity and challenge is the variability in implementing each processing step. For example, in retrieving relevant documents for an input query, various methods can be employed. One approach involves rewriting the query first and using the rewritten queries for retrieval [9]. Alternatively, pseudo-responses to the query can be generated first, and the similarity between these pseudo-responses and the backend documents can be compared for retrieval [10]. Another option is to directly employ embedding models, typically trained in a contrastive manner using positive and negative query-response pairs [11, 12]. The techniques chosen for each step and their combinations significantly impact both the effectiveness and efficiency of RAG systems. To the best of our knowledge, there has been no systematic effort to pursue the optimal implementation of RAG, particularly for the entire RAG workflow.
什么增加了复杂性和挑战的是每个处理步骤的实施变异性。例如,在检索与输入查询相关的文档时,可以采用各种方法。一种方法是在首先重写查询后,使用重写的查询进行检索[9]。或者,可以先生成查询的伪响应,然后比较这些伪响应与后端文档之间的相似性以进行检索[10]。另一种选择是直接使用嵌入模型,这些模型通常使用正负查询-响应对以对比方式进行训练[11, 12]。每个步骤选择的技术及其组合对 RAG 系统的有效性和效率有重大影响。据我们所知,还没有系统地努力追求 RAG 的最佳实现,尤其是对于整个 RAG 工作流程。

In this study, we aim to identify the best practices for RAG through extensive experimentation. Given the infeasibility of testing all possible combinations of these methods, we adopt a three-step approach to identify optimal RAG practices. First, we compare representative methods for each RAG step (or module) and select up to three of the best-performing methods. Next, we evaluate the impact of each method on the overall RAG performance by testing one method at a time for an individual step, while keeping the other RAG modules unchanged. This allows us to determine the most effective method for each step based on its contribution and interaction with other modules during response generation. Once the best method is chosen for a module, it is used in subsequent experiments. Finally, we empirically explore a few promising combinations suitable for different application scenarios where efficiency might be prioritized over performance, or vice versa. Based on these findings, we suggest several strategies for deploying RAG that balance both performance and efficiency.
本研究旨在通过大量实验确定 RAG 的最佳实践。鉴于测试所有可能组合方法的不可行性,我们采用三步法来识别最佳的 RAG 实践。首先,我们比较每个 RAG 步骤(或模块)的代表性方法,并选择表现最好的三种方法。接下来,我们通过逐个测试单个步骤的方法,同时保持其他 RAG 模块不变,评估每种方法对整体 RAG 性能的影响。这使我们能够根据其贡献和与其他模块在响应生成过程中的交互作用,确定每个步骤的最有效方法。一旦为模块选择了最佳方法,它将在后续实验中使用。最后,我们根据不同应用场景中可能优先考虑效率而非性能或反之,经验性地探索一些有希望的组合。基于这些发现,我们建议几种平衡性能和效率的 RAG 部署策略。

The contributions of this study are three-fold:
这项研究的三项贡献是:

  • Through extensive experimentation, we thoroughly investigated existing RAG approaches and their combinations to identify and recommend optimal RAG practices.
    通过广泛的实验,我们彻底研究了现有的 RAG 方法及其组合,以识别和推荐最佳 RAG 实践。

  • We introduce a comprehensive framework of evaluation metrics and corresponding datasets to comprehensively assess the performance of retrieval-augmented generation models, covering general, specialized (or domain-specific), and RAG-related capabilities.
    我们引入了一套全面的评估指标框架及其对应的数据集,以全面评估检索增强生成模型的性能,涵盖通用、专业(或领域特定)以及 RAG 相关能力。

  • We demonstrate that the integration of multimodal retrieval techniques can substantially improve question-answering capabilities on visual inputs and speed up the generation of multimodal content through a strategy of “retrieval as generation”.
    我们证明了通过“检索即生成”的策略,多模态检索技术的集成可以显著提高视觉输入上的问答能力,并加快多模态内容的生成速度。

2 Related Work 2 相关工作

Ensuring the accuracy of responses generated by Large Language Models (LLMs) such as ChatGPT [13] and LLaMA [14] is essential. However, simply enlarging model size does not fundamentally address the issue of hallucinations [15, 16], especially in knowledge-intensive tasks and specialized domains. Retrieval-augmented generation (RAG) addresses these challenges by retrieving relevant documents from external knowledge bases, providing accurate, real-time, domain-specific context to LLMs [6]. Previous works have optimized the RAG pipeline through query and retrieval transformations, enhancing retriever performance, and fine-tuning both the retriever and generator. These optimizations improve the interaction between input queries, retrieval mechanisms, and generation processes, ensuring the accuracy and relevance of responses.
确保由大型语言模型(如 ChatGPT [13] 和 LLaMA [14])生成响应的准确性至关重要。然而,仅仅扩大模型规模并不能从根本上解决幻觉问题[15, 16],尤其是在知识密集型任务和专业领域。检索增强生成(RAG)通过从外部知识库检索相关文档,为LLMs [6]提供准确、实时、特定领域的上下文来应对这些挑战。先前的研究通过查询和检索转换优化了 RAG 管道,提高了检索器的性能,并微调了检索器和生成器。这些优化改进了输入查询、检索机制和生成过程之间的交互,确保了响应的准确性和相关性。

2.1 Query and Retrieval Transformation
2.1 查询与检索转换

Effective retrieval requires queries accurate, clear, and detailed. Even when converted into embeddings, semantic differences between queries and relevant documents can persist. Previous works have explored methods to enhance query information through query transformation, thereby improving retrieval performance. For instance, Query2Doc [17] and HyDE [10] generate pseudo-documents from original queries to enhance retrieval, while TOC [18] decomposes queries into subqueries, aggregating the retrieved content for final results.
有效的检索需要准确、清晰和详细的查询。即使转换为嵌入表示,查询与相关文档之间的语义差异也可能持续存在。先前的研究已经探讨了通过查询转换增强查询信息的方法,从而提高检索性能。例如,Query2Doc [17] 和 HyDE [10] 从原始查询生成伪文档以增强检索,而 TOC [18] 将查询分解为子查询,汇总检索内容以生成最终结果。

Other studies have focused on transforming retrieval source documents. LlamaIndex [19] provides an interface to generate pseudo-queries for retrieval documents, improving matching with real queries. Some works employ contrastive learning to bring query and document embeddings closer in semantic space [20, 12, 21]. Post-processing retrieved documents is another method to enhance generator output, with techniques like hierarchical prompt summarization [22] and using abstractive and extractive compressors [23] to reduce context length and remove redundancy [24].
其他研究集中于转换检索源文档。LlamaIndex [19] 提供了一个生成检索文档伪查询的接口,提高了与真实查询的匹配度。一些工作采用对比学习将查询和文档嵌入在语义空间中拉近[20, 12, 21]。后处理检索文档是另一种增强生成器输出的方法,包括使用分层提示摘要[22]和使用抽象和提取压缩器[23]来缩短上下文长度并去除冗余[24]。

2.2 Retriever Enhancement Strategy
2.2 捕收器增强策略

Document chunking and embedding methods significantly impact retrieval performance. Common chunking strategies divide documents into chunks, but determining optimal chunk length can be challenging. Small chunks may fragment sentences, while large chunks might include irrelevant context. LlamaIndex [19] optimizes the chunking method like Small2Big and sliding window. Retrieved chunks can be irrelevant and numbers can be large, so reranking is necessary to filter irrelevant documents. A common reranking approach employs deep language models such as BERT [25], T5 [26], or LLaMA [27], which requires slow inference steps during reranking but grants better performance. TILDE [28, 29] achieves efficiency by precomputing and storing the likelihood of query terms, ranking documents based on their sum.
文档分块和嵌入方法对检索性能有显著影响。常见的分块策略将文档分成块,但确定最佳块长度可能具有挑战性。小块可能会分割句子,而大块可能包含无关的上下文。LlamaIndex [19] 优化了类似于 Small2Big 和滑动窗口的分块方法。检索到的块可能是不相关的,并且数字可能很大,因此需要重新排序以过滤掉无关的文档。一种常见的重新排序方法采用深度语言模型,如 BERT [25]、T5 [26] 或 LLaMA [27],这需要在重新排序期间进行缓慢的推理步骤,但可以提供更好的性能。TILDE [28, 29] 通过预先计算和存储查询词的概率,根据它们的总和对文档进行排序,从而实现效率。

2.3 Retriever and Generator Fine-tuning
2.3 捕收器和生成器微调

Fine-tuning within the RAG framework is crucial for optimizing both retrievers and generators. Some research focuses on fine-tuning the generator to better utilize retriever context [30, 31, 32], ensuring faithful and robust generated content. Others fine-tune the retriever to learn to retrieve beneficial passages for the generator [33, 34, 35]. Holistic approaches treat RAG as an integrated system, fine-tuning both retriever and generator together to enhance overall performance [36, 37, 38], despite increased complexity and integration challenges.
RAG 框架内的微调对于优化检索器和生成器至关重要。一些研究侧重于微调生成器以更好地利用检索器上下文[30, 31, 32],确保生成内容的忠实和稳健。其他人则微调检索器以学习检索对生成器有益的段落[33, 34, 35]。整体方法将 RAG 视为一个集成系统,同时微调检索器和生成器以增强整体性能[36, 37, 38],尽管增加了复杂性和集成挑战。

Several surveys have extensively discussed current RAG systems, covering aspects like text generation [7, 8], integration with LLMs [6, 39], multimodal [40], and AI-generated content [41]. While these surveys provide comprehensive overviews of existing RAG methodologies, selecting the appropriate algorithm for practical implementation remains challenging. In this paper, we focus on best practices for applying RAG methods, advancing the understanding and application of RAG in LLMs.
几项调查广泛讨论了当前的 RAG 系统,涵盖了文本生成[7, 8]、与LLMs的集成[6, 39]、多模态[40]和 AI 生成内容[41]等方面。虽然这些调查提供了现有 RAG 方法的全面概述,但选择合适的算法进行实际应用仍然具有挑战性。在本文中,我们关注应用 RAG 方法的最佳实践,推进对 RAG 在LLMs中的应用和理解。

Refer to caption
Figure 2: Classification of retrieval requirements for different tasks. In cases where information is not provided, we differentiate tasks based on the functions of the model.
图 2:不同任务检索需求分类。在信息未提供的情况下,我们根据模型的函数区分任务。

3 RAG Workflow 3RAG 工作流程

In this section, we detail the components of the RAG workflow. For each module, we review commonly used approaches and select the default and alternative methods for our final pipeline. Section 4 will discuss best practices. Figure 1 presents the workflow and methods for each module. Detailed experimental setups, including datasets, hyperparameters, and results are provided in Appendix A.
在这一节中,我们详细介绍了 RAG 工作流程的组件。对于每个模块,我们回顾了常用的方法,并为我们最终的流水线选择了默认和替代方法。第 4 节将讨论最佳实践。图 1 展示了每个模块的工作流程和方法。详细的实验设置,包括数据集、超参数和结果,在附录 A 中提供。

3.1 Query Classification 3.1 查询分类

Not all queries require retrieval-augmented due to the inherent capabilities of LLMs. While RAG can enhance information accuracy and reduce hallucinations, frequent retrieval can increase response time. Therefore, we begin by classifying queries to determine the necessity of retrieval. Queries requiring retrieval proceed through the RAG modules; others are handled directly by LLMs.
并非所有查询都需要检索增强,因为LLMs本身具有固有功能。虽然 RAG 可以增强信息准确性并减少幻觉,但频繁检索会增加响应时间。因此,我们首先对查询进行分类,以确定检索的必要性。需要检索的查询通过 RAG 模块进行;其他查询则直接由LLMs处理。

Retrieval is generally recommended when knowledge beyond the model’s parameters is needed. However, the necessity of retrieval varies by task. For instance, an LLM trained up to 2023 can handle a translation request for “Sora was developed by OpenAI” without retrieval. Conversely, an introduction request for the same topic would require retrieval to provide relevant information.
检索通常在需要超出模型参数的知识时推荐。然而,检索的必要性因任务而异。例如,到 2023 年训练的LLM可以处理“Sora 是由 OpenAI 开发的”的翻译请求而不需要检索。相反,对同一主题的介绍请求则需要检索以提供相关信息。

Therefore, we propose classifying tasks by type to determine if a query needs retrieval. We categorize
因此,我们建议按类型对任务进行分类,以确定查询是否需要检索。我们将任务分为类别。

Model Metrics 指标
Acc Acc(输入文本为缩写或专有名词,无需翻译) Prec 精确 Rec Rec (由于输入的单词 "Rec" 缺乏上下文,无法确定其具体含义,因此直接保留原词。) F1
BERT-base-multilingual BERT-base-多语言 0.95 0.96 0.94 0.95
Table 1: Results of the Query Classifier.
表 1:查询分类器的结果。

15 tasks based on whether they provide sufficient information, with specific tasks and examples illustrated in Figure 2. For tasks entirely based on user-given information, we denote as “sufficient”, which need not retrieval; otherwise, we denote as “insufficient”, and retrieval may be necessary. We train a classifier to automate this decision-making process. Experimental details are presented in Appendix A.1. Section 4 explores the impact of query classification on the workflow, comparing scenarios with and without classification.
基于是否提供足够信息,分为 15 个任务,具体任务和示例在图 2 中展示。对于完全基于用户提供信息的任务,我们表示为“足够”,无需检索;否则,我们表示为“不足”,可能需要检索。我们训练一个分类器来自动化这个决策过程。实验细节在附录 A.1 中介绍。第 4 节探讨了查询分类对工作流程的影响,比较了有和无分类的场景。

3.2 Chunking 3.2 分块

Chunking documents into smaller segments is crucial for enhancing retrieval precision and avoiding length issues in LLMs. This process can be applied at various levels of granularity, such as token, sentence, and semantic levels.
将文档分割成更小的段落对于提高检索精度和避免在LLMs中遇到长度问题至关重要。此过程可以在各种粒度级别上应用,例如在标记、句子和语义级别。

  • Token-level Chunking is straightforward but may split sentences, affecting retrieval quality.
    分词级块划分很简单,但可能会分割句子,影响检索质量。

  • Semantic-level Chunking uses LLMs to determine breakpoints, context-preserving but time-consuming.
    语义级分块使用LLMs确定断点,保留上下文但耗时。

  • Sentence-level Chunking balances preserving text semantics with simplicity and efficiency.
    句子级分块在保持文本语义的同时,平衡了简洁性和效率。

In this study, we use sentence-level chunking, balancing simplicity and semantic preservation. We examine chunking from four dimensions.
在这项研究中,我们采用句子级块划分,平衡简洁性和语义保留。我们从四个维度考察块划分。

Embedding Model 嵌入式模型 namespace-Pt/msmarco
MRR@1 MRR@10 MRR@100 R@1 R@10 R@100
BAAI/LLM-Embedder [20] BAAI/LLM-嵌入器 [20] 24.7924.79 37.5837.58 38.6238.62 24.0724.07 66.4566.45 90.7590.75
BAAI/bge-base-en-v1.5 [12] 23.3423.34 35.8035.80 36.9436.94 22.6322.63 64.1264.12 90.1390.13
BAAI/bge-small-en-v1.5 [12] 23.2723.27 35.7835.78 36.8936.89 22.6522.65 63.9263.92 89.8089.80
BAAI/bge-large-en-v1.5 [12] 24.6324.63 37.4837.48 38.5938.59 23.9123.91 65.5765.57 90.6090.60
BAAI/bge-large-en [12] 24.8424.84 37.6637.66 38.7338.73 24.1324.13 66.0966.09 90.6490.64
BAAI/bge-small-en [12] 23.2823.28 35.7935.79 36.9136.91 22.6222.62 63.9663.96 89.6789.67
BAAI/bge-base-en [12] 23.4723.47 35.9435.94 37.0737.07 22.7322.73 64.1764.17 90.1490.14
Alibaba-NLP/gte-large-en-v1.5 [21]
阿里巴巴-NLP/gte-large-en-v1.5 [21]
8.938.93 15.6015.60 16.7116.71 8.678.67 32.2832.28 60.3660.36
thenlper/gte-base [21] 7.427.42 13.2313.23 14.3014.30 7.217.21 28.2728.27 56.2056.20
thenlper/gte-small [21] 7.977.97 14.8114.81 15.9515.95 7.717.71 32.0732.07 61.0861.08
jinaai/jina-embeddings-v2-small-en [42] 8.078.07 15.0215.02 16.1216.12 7.877.87 32.5532.55 60.3660.36
intfloat/e5-small-v2 [11] 10.0410.04 18.2318.23 19.4119.41 9.749.74 38.9238.92 68.4268.42
intfloat/e5-large-v2 [11] 9.589.58 17.9417.94 19.0319.03 9.359.35 39.0039.00 66.1166.11
sentence-transformers/all-mpnet-base-v2
sentence-transformers/all-mpnet-base-v2 句子转换器/所有-mpnet-base-v2
5.805.80 11.2611.26 12.2612.26 5.665.66 25.5725.57 50.9450.94
Table 2: Results for different embedding models on namespace-Pt/msmarco.
表 2:不同嵌入模型在 namespace-Pt/msmarco 上的结果

3.2.1 Chunk Size 3.2.1 块大小

Chunk size significantly impacts performance. Larger chunks provide more context, enhancing comprehension but increasing process time. Smaller chunks improve retrieval recall and reduce time but may lack sufficient context.
块大小显著影响性能。较大的块提供更多上下文,增强理解但增加处理时间。较小的块提高检索召回率并减少时间,但可能缺乏足够的上下文。

Finding the optimal chunk size involves a balance between some metrics such as faithfulness, and relevancy. Faithfulness measures whether the response is hallucinated or matches the retrieved texts.
寻找最佳块大小需要在某些指标之间取得平衡,例如忠实度和相关性。忠实度衡量响应是否为幻觉或与检索到的文本匹配。

Chunk Size 块大小 lyft_2021
Average 平均 Faithfulness 忠诚 Average 平均 Relevancy 相关性
20482048 80.3780.37 91.1191.11
10241024 94.2694.26 95.5695.56
512512 97.5997.59 97.4197.41
256256 97.2297.22 97.7897.78
128128 95.7495.74 97.2297.22
Table 3: Comparison of different chunk sizes.
表 3:不同块大小的比较。

Relevancy measures whether the retrieved texts and responses match queries. We use the evaluation module of LlamaIndex [43] to calculate the metrics above. For embedding, we use the text-embedding-ada-002111https://platform.openai.com/docs/guides/embeddings/embedding-models model, which supports long input length. We choose zephyr-7b-alpha222https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha and gpt-3.5-turbo333https://www.openai.com/ as generation model and evaluation model respectively. The size of the chunk overlap is 20 tokens. First sixty pages of the document lyft_2021444https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/lyft_2021.pdf are used as corpus, then prompting LLMs to generate about one hundred and seventy queries according to chosen corpus. The impact of different chunk sizes is shown in Table 3.
相关性衡量检索到的文本和响应是否与查询匹配。我们使用 LlamaIndex [43]的评估模块来计算上述指标。对于嵌入,我们使用支持长输入长度的 text-embedding-ada-002 1 模型。我们选择 zephyr-7b-alpha 2 和 gpt-3.5-turbo 3 作为生成模型和评估模型。块重叠的大小为 20 个标记。使用 lyft_2021 4 文档的前六十页作为语料库,然后根据所选语料库提示LLMs生成大约一百七十个查询。不同块大小的影响如表 3 所示。

3.2.2 Chunking Techniques 3.2.2 分块技术

Advanced techniques such as small-to-big and sliding window improve retrieval quality by organizing chunk block relationships. Small-sized blocks are used to match queries, and larger blocks that include the small ones along with contextual information are returned.
高级技术如从小到大和滑动窗口通过组织块块关系来提高检索质量。使用小块来匹配查询,并返回包含小块及其上下文信息的较大块。

To demonstrate the effectiveness of advanced chunking techniques, we use the LLM-Embedder [20] model as an embedding model. The smaller chunk size is 175175 tokens, the larger chunk size is 512512 tokens and the chunk overlap is 2020 tokens. Techniques like small-to-big and sliding window improve retrieval quality by maintaining context and ensuring relevant information is retrieved. Detailed results are shown in Table 4.
为了展示高级分块技术的有效性,我们使用LLM-Embedder [20]模型作为嵌入模型。较小的分块大小为 175175 标记,较大的分块大小为 512512 标记,分块重叠为 2020 标记。小到大和滑动窗口等技术通过保持上下文并确保检索相关信息来提高检索质量。详细结果如表 4 所示。

3.2.3 Embedding Model Selection
3.2.3 嵌入模型选择

Choosing the right embedding model is crucial for effective semantic matching of queries and chunk blocks. We use the evaluation module of FlagEmbedding555https://github.com/FlagOpen/FlagEmbedding which uses the dataset
选择合适的嵌入模型对于有效匹配查询和块块语义至关重要。我们使用 FlagEmbedding 5 的评估模块,该模块使用的数据集

Chunk Skill 块技能 lyft_2021
Average 平均 Faithfulness 忠诚 Average 平均 Relevancy 相关性
Original 原始 95.7495.74 95.3795.37
small2big 96.6796.67 95.3795.37
sliding window 滑动窗口 97.4197.41 96.8596.85
Table 4: Comparison of different chunk skills.
表 4:不同块技能的比较。

namespace-Pt/msmarco666https://huggingface.co/datasets/namespace-Pt/msmarco as queries and dataset namespace-Pt/msmarco-corpus777https://huggingface.co/datasets/namespace-Pt/msmarco-corpus as corpus to choose the appropriate open source embedding model. As shown in Table 2, LLM-Embedder [20] achieves comparable results with BAAI/bge-large-en [12], however, the size of the former is three times smaller than that of the latter. Thus, we select the LLM-Embedder [20] for its balance of performance and size.
namespace-Pt/msmarco 6 作为查询和数据集命名空间 namespace-Pt/msmarco-corpus 7 作为语料库来选择合适的开源嵌入模型。如表 2 所示,LLM-Embedder [20] 与 BAAI/bge-large-en [12] 的结果相当,然而,前者的尺寸是后者的三分之一。因此,我们选择 LLM-Embedder [20],因为它在性能和尺寸之间取得了平衡。

3.2.4 Metadata Addition 3.2.4 元数据添加

Enhancing chunk blocks with metadata like titles, keywords, and hypothetical questions can improve retrieval, provide more ways to post-process retrieved texts, and help LLMs better understand retrieved information. A detailed study on metadata inclusion will be addressed in future work.
增强块状数据结构,添加标题、关键词和假设性问题等元数据,可以提高检索效果,提供更多后处理检索文本的方法,并有助于LLMs更好地理解检索信息。关于元数据包含的详细研究将在未来的工作中进行讨论。

3.3 Vector Databases 3.3 向量数据库

Vector databases store embedding vectors with their metadata, enabling efficient retrieval of documents relevant to queries through various indexing and approximate nearest neighbor (ANN) methods.
向量数据库存储嵌入向量及其元数据,通过各种索引和近似最近邻(ANN)方法,实现查询相关文档的高效检索。

To select an appropriate vector database for our research, we evaluated several options based on four key criteria: multiple index types, billion-scale vector support, hybrid search, and cloud-native
为了选择适合我们研究的需求的向量数据库,我们根据四个关键标准评估了几个选项:多种索引类型、支持亿级向量、混合搜索和云原生

Database 数据库 Multiple 多重 Index Type 索引类型 Billion- 十亿 Scale 规模 Hybrid 混合 Search 搜索 Cloud-  Native 本地
Weaviate
Faiss 法斯
Chroma 色度
Qdrant
Milvus 密尔沃斯
Table 5: Comparison of Various Vector Databases
表 5:各种向量数据库的比较

capabilities. These criteria were chosen for their impact on flexibility, scalability, and ease of deployment in modern, cloud-based infrastructures. Multiple index types provide the flexibility to optimize searches based on different data characteristics and use cases. Billion-scale vector support is crucial for handling large datasets in LLM applications. Hybrid search combines vector search with traditional keyword search, enhancing retrieval accuracy. Finally, cloud-native capabilities ensure seamless integration, scalability, and management in cloud environments. Table 5 presents a detailed comparison of five open-source vector databases: Weaviate, Faiss, Chroma, Qdrant, and Milvus.
功能。这些标准被选择是因为它们对灵活性、可扩展性和在现代基于云的基础设施中的部署便捷性的影响。多种索引类型提供了根据不同的数据特性和用例优化搜索的灵活性。支持百亿规模的向量对于处理LLM应用中的大数据集至关重要。混合搜索结合了向量搜索和传统关键词搜索,提高了检索准确性。最后,云原生功能确保了在云环境中的无缝集成、可扩展性和管理。表 5 展示了五个开源向量数据库的详细比较:Weaviate、Faiss、Chroma、Qdrant 和 Milvus。

Our evaluation indicates that Milvus stands out as the most comprehensive solution among the databases evaluated, meeting all the essential criteria and outperforming other open-source options.
我们的评估表明,Milvus 在所评估的数据库中脱颖而出,是功能最全面的解决方案,满足了所有基本标准,并优于其他开源选项。

Method 方法 TREC DL19 TREC DL20
mAP mAP:平均精度(Mean Average Precision) nDCG@10 R@50 R@1k Latency . 延迟 mAP mAP:平均精度(Mean Average Precision) nDCG@10 R@50 R@1k Latency . 延迟
unsupervised 无监督
BM25 30.1330.13 50.5850.58 38.3238.32 75.0175.01 0.070.07 28.5628.56 47.9647.96 46.1846.18 78.6378.63 0.290.29
Contriever Contriever (由于“Contriever”可能是一个专有名词或品牌名,因此没有进行翻译,直接保留了原词。) 23.9923.99 44.5444.54 37.5437.54 74.5974.59 3.063.06 23.9823.98 42.1342.13 43.8143.81 75.3975.39 0.980.98
supervised 监督
LLM-Embedder LLM-嵌入器 44.6644.66 70.2070.20 49.0649.06 84.4884.48 2.612.61 45.6045.60 68.7668.76 61.3661.36 84.4184.41 0.710.71
+ Query Rewriting + 查询重写 44.5644.56 67.8967.89 51.4551.45 85.3585.35 7.807.80 45.1645.16 65.6265.62 59.6359.63 83.4583.45 2.062.06
+ Query Decomposition 查询分解 41.9341.93 66.1066.10 48.6648.66 82.6282.62 14.9814.98 43.3043.30 64.9564.95 57.7457.74 84.1884.18 2.012.01
+ HyDE 50.8750.87 75.4475.44 54.9354.93 88.7688.76 7.217.21 50.9450.94 73.9473.94 63.8063.80 88.0388.03 2.142.14
+ Hybrid Search 混合搜索 47.1447.14 72.5072.50 51.1351.13 89.0889.08 3.203.20 47.7247.72 69.8069.80 64.3264.32 88.0488.04 0.770.77
+ HyDE + Hybrid Search
+ HyDE + 混合搜索
52.1352.13 73.3473.34 55.3855.38 90.4290.42 11.1611.16 53.1353.13 72.7272.72 66.1466.14 90.6790.67 2.952.95
Table 6: Results for different retrieval methods on TREC DL19/20. The best result for each method is made bold and the second is underlined.
表 6:TREC DL19/20 不同检索方法的结果。每种方法的最佳结果加粗,次佳结果加下划线。

3.4 Retrieval Methods 3.4 检索方法

Given a user query, the retrieval module selects the top-kk relevant documents from a pre-built corpus based on the similarity between the query and the documents. The generation model then uses these documents to formulate an appropriate response to the query. However, original queries often underperform due to poor expression and lack of semantic information [6], negatively impacting the retrieval process. To address these issues, we evaluated three query transformation methods using the LLM-Embedder recommended in Section 3.2 as the query and document encoder:
给定一个用户查询,检索模块根据查询与文档之间的相似性,从预建的语料库中选择最相关的文档。生成模型随后使用这些文档来制定针对查询的适当响应。然而,由于表达不佳和语义信息不足[6],原始查询通常表现不佳,从而对检索过程产生负面影响。为了解决这些问题,我们评估了三种查询转换方法,使用第 3.2 节推荐的LLM-Embedder 作为查询和文档编码器。

  • Query Rewriting: Query rewriting refines queries to better match relevant documents. Inspired by the Rewrite-Retrieve-Read framework [9], we prompt an LLM to rewrite queries to enhance performance.
    查询重写:查询重写优化查询以更好地匹配相关文档。受 Rewrite-Retrieve-Read 框架[9]的启发,我们通过LLM提示重写查询以提升性能。

  • Query Decomposition: This approach involves retrieving documents based on sub-questions derived from the original query, which is more complex to comprehend and handle.
    查询分解:这种方法涉及根据从原始查询中派生的子问题检索文档,这更难以理解和处理。

  • Pseudo-documents Generation: This approach generates a hypothetical document based on the user query and uses the embedding of hypothetical answers to retrieve similar documents. One notable implement is HyDE [10],
    伪文档生成:这种方法基于用户查询生成一个假设文档,并使用假设答案的嵌入来检索相似文档。一个值得注意的实现是 HyDE [10]。

Recent studies, such as [44], indicate that combining lexical-based search with vector search significantly enhances performance. In this study, we use BM25 for sparse retrieval and Contriever [45], an unsupervised contrastive encoder, for dense retrieval, serving as two robust baselines based on Thakur et al. [46].
近期研究表明,如[44]所示,将基于词汇的搜索与向量搜索相结合可显著提高性能。在本研究中,我们使用 BM25 进行稀疏检索,并使用 Contriever [45],一个无监督对比编码器,进行密集检索,作为基于 Thakur 等人[46]的两种稳健的基线。

3.4.1 Results for different retrieval methods
3.4.1 不同检索方法的成果

We evaluated the performance of different search methods on the TREC DL 2019 and 2020 passage ranking datasets. The results presented in Table 6 show that supervised methods significantly outperformed unsupervised methods. Combining with HyDE and hybrid search, LLM-Embedder achieves the highest scores. However, query rewriting and query decomposition did not enhance retrieval performance as effectively. Considering the best performance and tolerated latency, we recommend Hybrid Search with HyDE as the default retrieval method. Taking efficiency into consideration, Hybrid Search combines sparse retrieval (BM25) and dense retrieval (Original embedding) and achieves notable performance with relatively low latency.
我们评估了不同搜索方法在 TREC DL 2019 和 2020 段落排名数据集上的性能。表 6 中呈现的结果显示,监督方法显著优于无监督方法。结合 HyDE 和混合搜索,LLM-Embedder 取得了最高分数。然而,查询重写和查询分解并没有像预期那样有效地提高检索性能。考虑到最佳性能和可接受的延迟,我们推荐使用 HyDE 的混合搜索作为默认检索方法。从效率的角度考虑,混合搜索结合了稀疏检索(BM25)和密集检索(原始嵌入),在相对较低的延迟下实现了显著的性能。

Configuration 配置 TREC DL19 TREC DL20
mAP mAP:平均精度(Mean Average Precision) nDCG@10 R@50 R@1k latency 延迟 mAP mAP:平均精度(Mean Average Precision) nDCG@10 R@50 R@1k Latency . 延迟
HyDE HyDE (由于“HyDE”可能是一个专有名词或缩写,没有上下文无法确定其具体含义,因此直接保留原样。)
w/ 1 pseudo-doc 与 1 个伪文档 48.7748.77 72.4972.49 53.2053.20 87.7387.73 8.088.08 51.3151.31 70.3770.37 63.2863.28 87.8187.81 2.092.09
w/ 1 pseudo-doc + query
与 1 个伪文档 + 查询
50.8750.87 75.4475.44 54.9354.93 88.7688.76 7.217.21 50.9450.94 73.9473.94 63.8063.80 88.0388.03 2.142.14
w/ 8 pseudo-doc + query
与 8 个伪文档+查询
51.6451.64 75.1275.12 54.5154.51 89.1789.17 14.1514.15 53.1453.14 73.6573.65 65.7965.79 88.6788.67 3.443.44
Table 7: HyDE with different concatenation of hypothetical documents and queries.
表 7:HyDE 与不同拼接的假设文档和查询。
Hyperparameter 超参数 TREC DL19 TREC DL20
mAP mAP:平均精度(Mean Average Precision) nDCG@10 R@50 R@1k latency 延迟 mAP mAP:平均精度(Mean Average Precision) nDCG@10 R@50 R@1k Latency . 延迟
Hybrid Search 混合搜索
α\alpha = 0.1  α\alpha = 0.1 α\alpha = 0.1 46.046.00 70.8770.87 49.2449.24 88.8988.89 2.982.98 46.5446.54 69.0569.05 63.3663.36 87.3287.32 0.900.90
α\alpha = 0.3 47.1447.14 72.5072.50 51.1351.13 89.0889.08 3.203.20 47.7247.72 69.8069.80 64.3264.32 88.0488.04 0.770.77
α\alpha = 0.5 47.3647.36 72.2472.24 52.7152.71 88.0988.09 3.023.02 47.1947.19 68.1268.12 64.9064.90 87.8687.86 0.870.87
α\alpha = 0.7 47.2147.21 71.8971.89 52.4052.40 88.0188.01 3.153.15 45.8245.82 67.3067.30 64.2364.23 87.9287.92 1.021.02
α\alpha = 0.9 46.3546.35 70.6770.67 52.6452.64 88.2288.22 2.742.74 44.0244.02 65.5565.55 63.2263.22 87.7687.76 1.201.20
Table 8: Results of hybrid search with different alpha values.
表 8:不同 alpha 值下的混合搜索结果。

3.4.2 HyDE with Different Concatenation of Documents and Query
3.4.2 使用不同文档和查询拼接的 HyDE

Table 7 shows the impact of different concatenation strategies for hypothetical documents and queries using HyDE. Concatenating multiple pseudo-documents with the original query can significantly enhance retrieval performance, though at the cost of increased latency, suggesting a trade-off between retrieval effectiveness and efficiency. However, indiscriminately increasing the number of hypothetical documents does not yield significant benefits and substantially raises latency, indicating that using a single hypothetical document is sufficient.
表 7 显示了使用 HyDE 对假设文档和查询的不同拼接策略的影响。将多个伪文档与原始查询拼接可以显著提高检索性能,尽管会增加延迟,这表明检索效果和效率之间存在权衡。然而,不加区分地增加假设文档的数量并不能带来显著的好处,并且会大幅提高延迟,表明使用单个假设文档就足够了。

3.4.3 Hybrid Search with Different Weight on Sparse Retrieval
3.4.3 基于不同稀疏检索权重的混合搜索

Table 8 presents the impact of different α\alpha values in hybrid search, where α\alpha controls the weighting between sparse retrieval and dense retrieval components. The relevance score is calculated as follows:
表 8 展示了混合搜索中不同 α\alpha 值的影响,其中 α\alpha 控制稀疏检索和密集检索组件之间的权重。相关度得分计算如下:

Sh=αSs+Sd\small S_{h}=\alpha\cdot S_{s}+S_{d} (1)

where SsS_{s}, SdS_{d} are the normalized relevance scores from sparse retrieval and dense retrieval respectively, and ShS_{h} is the total retrieval score.
SsS_{s}SdS_{d} 分别是稀疏检索和密集检索的归一化相关性得分, ShS_{h} 是总检索得分。

We evaluated five different α\alpha values to determine their influence on performance. The results indicate that an α\alpha value of 0.3 yields the best performance, demonstrating that appropriate adjustment of α\alpha can enhance retrieval effectiveness to a certain extent. Therefore, we selected α=0.3\alpha=0.3 for our retrieval and main experiments. Additional implementation details are presented in Appendix A.2.
我们评估了五个不同的 α\alpha 值,以确定它们对性能的影响。结果表明, α\alpha 值为 0.3 时性能最佳,这表明适当的调整 α\alpha 可以在一定程度上提高检索效果。因此,我们选择了 α=0.3\alpha=0.3 用于我们的检索和主要实验。附录 A.2 中提供了额外的实现细节。

Method 方法 MS MARCO Passage ranking MS MARCO 文本排序
Base Model 基础模型 # Params # 参数 MRR@1 MRR@10 MRR@1k Hit Rate@10 命中率@10 Latency . 延迟
w/o Reranking 无重排序
Random Ordering 随机排序 - - 0.0110.011 0.0270.027 0.0680.068 0.0920.092 -
BM25 - - 6.526.52 11.6511.65 12.5912.59 24.6324.63 -
DLM Reranking DLM 重新排序
monoT5 单 T5 T5-base 220M 21.6221.62 31.7831.78 32.4032.40 54.0754.07 4.54.5
monoBERT 单 BERT BERT-large BERT-大模型 340M 21.6521.65 31.6931.69 32.3532.35 53.3853.38 15.815.8
RankLLaMA Llama-2-7b 7B 22.0822.08 32.3532.35 32.9732.97 54.5354.53 82.482.4
TILDE Reranking TILDE 重新排序
TILDEv2 BERT-base BERT-基础 110M 18.5718.57 27.8327.83 28.6028.60 49.0749.07 0.020.02
Table 9: Results of different reranking methods on the dev set of the MS MARCO Passage ranking dataset. For each query, the top-1000 candidate passages retrieved by BM25 are reranked. Latency is measured in seconds per query.
表 9:MS MARCO 段落排名数据集开发集上不同重排序方法的结果。对于每个查询,通过 BM25 检索到的前 1000 个候选段落进行重排序。延迟以每查询秒数衡量。

3.5 Reranking Methods 3.5 重新排序方法

After the initial retrieval, a reranking phase is employed to enhance the relevance of the retrieved documents, ensuring that the most pertinent information appears at the top of the list. This phase uses more precise and time-intensive methods to reorder documents effectively, increasing the similarity between the query and the top-ranked documents.
检索初始后,采用重新排序阶段以增强检索文档的相关性,确保最相关的信息出现在列表顶部。此阶段使用更精确且耗时更长的方法有效重新排序文档,增加查询与排名靠前文档之间的相似度。

We consider two approaches in our reranking module: DLM Reranking, which utilizes classification, and TILDE Reranking, which focuses on query likelihoods. These approaches prioritize performance and efficiency, respectively.
我们在我们的重排序模块中考虑了两种方法:DLM 重排序,它利用分类,以及 TILDE 重排序,它侧重于查询可能性。这些方法分别优先考虑性能和效率。

  • DLM Reranking: This method leverages deep language models (DLMs) [25, 26, 27] for reranking. These models are fine-tuned to classify document relevancy to a query as “true” or “false”. During fine-tuning, the model is trained with concatenated query and document inputs, labeled by relevancy. At inference, documents are ranked based on the probability of the “true” token.
    DLM 重排序:该方法利用深度语言模型(DLMs)[25, 26, 27]进行重排序。这些模型经过微调以将文档与查询的相关性分类为“真”或“假”。在微调过程中,模型使用带有相关性的标签的查询和文档输入进行训练。在推理过程中,文档根据“真”标记的概率进行排序。

  • TILDE Reranking: TILDE [28, 29] calculates the likelihood of each query term independently by predicting token probabilities across the model’s vocabulary. Documents are scored by summing the pre-calculated log probabilities of query tokens, allowing for rapid reranking at inference. TILDEv2 improves this by indexing only document-present tokens, using NCE loss, and expanding documents, thus enhancing efficiency and reducing index size.
    TILDE 重排序:TILDE [28, 29] 通过预测模型词汇表中的标记概率独立计算每个查询项的可能性。通过将查询标记的预计算对数概率相加对文档进行评分,允许在推理时快速重排序。TILDEv2 通过仅索引文档中存在的标记、使用 NCE 损失和扩展文档来改进这一点,从而提高效率并减少索引大小。

Our experiments were conducted on the MS MARCO Passage ranking dataset [47], a large-scale dataset for machine reading comprehension. We follow and make modifications to the implementation provided by PyGaggle [26] and TILDE [28], using the models monoT5, monoBERT, RankLLaMA and TILDEv2. Reranking results are shown in Table 9. We recommend monoT5 as a comprehensive method balancing performance and efficiency. RankLLaMA is suitable for achieving the best performance, while TILDEv2 is ideal for the quickest experience on a fixed collection. Details on the experimental setup and results are presented in Appendix A.3.
我们的实验在 MS MARCO Passage 排名数据集[47]上进行,这是一个用于机器阅读理解的规模庞大的数据集。我们遵循并修改了 PyGaggle[26]和 TILDE[28]提供的实现,使用了 monoT5、monoBERT、RankLLaMA 和 TILDEv2 模型。重排序结果如表 9 所示。我们推荐 monoT5 作为一种平衡性能和效率的综合方法。RankLLaMA 适用于实现最佳性能,而 TILDEv2 适用于在固定集合上获得最快速体验。实验设置和结果细节见附录 A.3。

3.6 Document Repacking 3.6 文档重新打包

The performance of subsequent processes, such as LLM response generation, may be affected by the order documents are provided. To address this issue, we incorporate a compact repacking module into the workflow after reranking, featuring three repacking methods: “forward”, “reverse” and “sides”. The “forward” method repacks documents by descending relevancy scores from the reranking phase, whereas the “reverse” arranges them in ascending order. Inspired by Liu et al. [48], concluding that optimal performance is achieved when relevant information is placed at the head or tail of the input, we also include a “sides” option.
后续过程(如LLM响应生成)的性能可能受到提供文档顺序的影响。为了解决这个问题,我们在重排序后工作流程中引入了一个紧凑的重新打包模块,具有三种重新打包方法:“正向”、“反向”和“两侧”。“正向”方法通过从重排序阶段按相关性分数降序重新打包文档,而“反向”则按升序排列。受 Liu 等人[48]的启发,他们认为当相关信息放置在输入的开头或结尾时,可以达到最佳性能,我们也包括了一个“两侧”选项。

Since the repacking method primarily affects subsequent modules, we select the best repacking method in Section 4 by testing it in combination with other modules. In this section, we choose the “sides” method as the default repacking method.
由于重新打包方法主要影响后续模块,我们在第 4 节通过与其他模块的组合测试来选择最佳重新打包方法。在本节中,我们选择“侧面”方法作为默认重新打包方法。

Method 方法 NQ TQA HotPotQA 火锅问答 Avg. 平均 Avg. Token 平均标记
F1 #token F1 #token F1 #token
w/o Summarization 无摘要
Origin Prompt 原始提示 27.0727.07 124124 33.6133.61 152152 33.9233.92 141141 31.5331.53 139139
Extractive Method 提取方法
BM25 27.9727.97 4040 32.4432.44 5959 28.0028.00 6363 29.4729.47 5454
Contriever Contriever (由于“Contriever”可能是一个专有名词或品牌名,因此没有进行翻译,直接保留了原词。) 23.6223.62 4242 33.7933.79 6565 23.6423.64 6060 27.0227.02 5656
Recomp (extractive) Recomp(提取式) 27.8427.84 3434 35.3235.32 6060 29.4629.46 5858 30.8730.87 5151
Abstractive Method 抽象方法
SelectiveContext 选择性上下文 25.0525.05 6565 34.2534.25 7070 34.4334.43 6666 31.2431.24 6767
LongLLMlingua 21.3221.32 5151 32.8132.81 5656 30.7930.79 5757 28.2928.29 5555
Recomp (abstractive) Recomp(抽象) 33.6833.68 5959 35.8735.87 6161 29.0129.01 5757 32.8532.85 5959
Table 10: Comparison between different summarization methods.
表 10:不同摘要方法的比较。

3.7 Summarization 3.7 摘要

Retrieval results may contain redundant or unnecessary information, potentially preventing LLMs from generating accurate responses. Additionally, long prompts can slow down the inference process. Therefore, efficient methods to summarize retrieved documents are crucial in the RAG pipeline.
检索结果可能包含冗余或不必要的信息,这可能会阻止LLMs生成准确的响应。此外,长提示可能会减慢推理过程。因此,在 RAG 管道中,高效地总结检索到的文档至关重要。

Summarization tasks can be extractive or abstractive. Extractive methods segment text into sentences, then score and rank them based on importance. Abstractive compressors synthesize information from multiple documents to rephrase and generate a cohesive summary. These tasks can be query-based or non-query-based. In this paper, as RAG retrieves information relevant to queries, we focus exclusively on query-based methods.
摘要任务可以是提取式或抽象式。提取式方法将文本分割成句子,然后根据重要性评分和排序。抽象式压缩器从多个文档中综合信息,重新措辞并生成连贯的摘要。这些任务可以是基于查询的或非基于查询的。在本文中,由于 RAG 检索与查询相关的信息,我们专注于基于查询的方法。

  • Recomp:  Recomp [23] has extractive and abstractive compressors. The extractive compressor selects useful sentences, while the abstractive compressor synthesizes information from multiple documents.
    Recomp:Recomp [23] 具有提取式和抽象式压缩器。提取式压缩器选择有用的句子,而抽象式压缩器从多个文档中综合信息。

  • LongLLMLingua:  LongLLMLingua [49] improves LLMLingua by focusing on key information related to the query.
    LongLLMLingua:LongLLMLingua [49] 通过关注与查询相关的关键信息来改进 LLMLingua。

  • Selective Context  Selective Context enhances LLM efficiency by identifying and removing redundant information in the input context. It evaluates the informativeness of lexical units using self-information computed by a base causal language model. This method is non-query-based, allowing a comparison between query-based and non-query-based approaches.
    选择性上下文选择性地提高LLM效率,通过识别和删除输入上下文中的冗余信息。它使用基于基础因果语言模型的自我信息评估词汇单位的信息量。这种方法是非查询基础的,允许比较基于查询和非查询基础的方法。

We evaluate these methods on three benchmark datasets: NQ, TriviaQA, and HotpotQA. Comparative results of different summarization methods are shown in Table 10. We recommend Recomp for its outstanding performance. LongLLMLingua does not perform well but demonstrates better generalization capabilities as it was not trained on these experimental datasets. Therefore, we consider it as an alternative method. Additional implementation details and discussions on non-query-based methods are provided in Appendix A.4.
我们对这三个基准数据集:NQ、TriviaQA 和 HotpotQA 评估了这些方法。不同摘要方法的比较结果如表 10 所示。我们推荐 Recomp,因为它表现出色。LongLLMLingua 表现不佳,但因为它没有在这些实验数据集上训练,所以展示了更好的泛化能力。因此,我们将它视为一种替代方法。附录 A.4 提供了额外的实现细节和非基于查询方法的讨论。

3.8 Generator Fine-tuning 3.8 生成器微调

In this section, we focus on fine-tuning the generator while leaving retriever fine-tuning for future exploration. We aim to investigate the impact of fine-tuning, particularly the influence of relevant or irrelevant contexts on the generator’s performance.
在这一节中,我们专注于微调生成器,同时将检索器的微调留待未来探索。我们的目标是研究微调的影响,特别是相关或不相关上下文对生成器性能的影响。

Formally, we denote xx as the query fed into the RAG system, and 𝒟\mathcal{D} as the contexts for this input. The fine-tuning loss of the generator is the negative log-likelihood of the ground-truth output yy.
形式上,我们表示 xx 为输入到 RAG 系统的查询, 𝒟\mathcal{D} 为该输入的上下文。生成器的微调损失是真实输出 yy 的负对数似然。

To explore the impact of fine-tuning, especially relevant and irrelevant contexts, we define dgoldd_{gold} as a context relevant to the query, and drandomd_{random} as a randomly retrieved context. We train the model by varying the composition of 𝒟\mathcal{D} as follows:
为了探索微调的影响,特别是相关和不相关语境,我们将 dgoldd_{gold} 定义为与查询相关的语境,将 drandomd_{random} 定义为随机检索的语境。我们通过以下方式改变 𝒟\mathcal{D} 的组成来训练模型:

  • DgD_{g}: The augmented context consists of query-relevant documents, denoted as Dg={dgold}D_{g}=\{d_{gold}\}.
    增强的上下文由与查询相关的文档组成,表示为 Dg={dgold}D_{g}=\{d_{gold}\}

  • DrD_{r}: The context contains one randomly sampled document, denoted as Dr={drandom}D_{r}=\{d_{random}\}.
    DrD_{r} :上下文中包含一个随机抽取的文档,表示为 Dr={drandom}D_{r}=\{d_{random}\}

  • DgrD_{gr}: The augmented context comprises a relevant document and a randomly-selected one, denoted as Dgr={dgold,drandom}D_{gr}=\{d_{gold},d_{random}\}.
    增强的上下文包括一个相关文档和一个随机选择的文档,记为 Dgr={dgold,drandom}D_{gr}=\{d_{gold},d_{random}\}

  • DggD_{gg}: The augmented context consists of two copies of a query-relevant document, denoted as Dgg={dgold,dgold}D_{gg}=\{d_{gold},d_{gold}\}.
    增强的上下文由两份与查询相关的文档副本组成,分别表示为 Dgg={dgold,dgold}D_{gg}=\{d_{gold},d_{gold}\}

We denote the base LM generator not fine-tuned as MbM_{b} , and the model fine-tuned under the corresponding 𝒟\mathcal{D} as MgM_{g}, MrM_{r}, MgrM_{gr}, MggM_{gg}. We fine-tuned our model on several QA and reading
我们表示未微调的基本 LM 生成器为 MbM_{b} ,在相应的 𝒟\mathcal{D} 下微调的模型为 MgM_{g}MrM_{r}MgrM_{gr}MggM_{gg} 。我们在多个问答和阅读任务上微调了我们的模型。

Refer to caption
Figure 3: Results of generator fine-tuning.
图 3:生成器微调结果。

comprehension datasets. Ground-truth coverage is used as our evaluation metric since QA task answers are relatively short. We select Llama-2-7B [50] as the base model. Similar to training, we evaluate all trained models on validation sets with DgD_{g}, DrD_{r}, DgrD_{gr}, and DD_{\varnothing}, where DD_{\varnothing} indicates inference without retrieval. Figure 3 presents our main results. Models trained with a mix of relevant and random documents (MgrM_{gr}) perform best when provided with either gold or mixed contexts. This suggests that mixing relevant and random contexts during training can enhance the generator’s robustness to irrelevant information while ensuring effective utilization of relevant contexts. Therefore, we identify the practice of augmenting with a few relevant and randomly-selected documents during training as the best approach. Detailed dataset information, hyperparameters and experimental results can be found in Appendix A.5.
理解数据集。使用真实值覆盖率作为我们的评估指标,因为 QA 任务答案相对较短。我们选择 Llama-2-7B [50] 作为基础模型。与训练类似,我们在验证集上使用 DgD_{g}DrD_{r}DgrD_{gr}DD_{\varnothing} 评估所有训练模型,其中 DD_{\varnothing} 表示无检索的推理。图 3 展示了我们的主要结果。使用相关和随机文档混合训练的模型( MgrM_{gr} )在提供黄金或混合上下文时表现最佳。这表明,在训练期间混合相关和随机上下文可以增强生成器对无关信息的鲁棒性,同时确保有效利用相关上下文。因此,我们将训练期间使用少量相关和随机选择的文档进行增强的做法识别为最佳方法。详细的数据集信息、超参数和实验结果可以在附录 A.5 中找到。

4 Searching for Best RAG Practices
搜索最佳 RAG 实践

In the following section, we investigate the optimal practices for implementing RAG. To begin with, we used the default practice identified in Section 3 for each module. Following the workflow depicted in Figure 1, we sequentially optimized individual modules and selected the most effective option among alternatives. This iterative process continued until we determined the best method for implementing the final summarization module. Based on Section 3.8, we used the Llama2-7B-Chat model fine-tuned where each query was augmented by a few random-selected and relevant documents as the generator. We used Milvus to build a vector database that includes 1010 million text of English Wikipedia and 44 million text of medical data. We also investigated the impact of removing the Query Classification, Reranking, and Summarization modules to assess their contributions.
在以下部分,我们研究了实施 RAG 的最佳实践。首先,我们为每个模块使用了第 3 节中确定的默认实践。遵循图 1 所示的流程,我们依次优化了各个模块,并在替代方案中选择了最有效的选项。这个迭代过程一直持续到我们确定了实现最终摘要模块的最佳方法。根据第 3.8 节,我们使用了 Llama2-7B-Chat 模型进行微调,其中每个查询都通过添加一些随机选择的和相关的文档作为生成器。我们使用 Milvus 构建了一个向量数据库,包括 1010 百万的英文维基百科文本和 44 百万的医疗数据文本。我们还研究了移除查询分类、重排序和摘要模块的影响,以评估它们的贡献。

4.1 Comprehensive Evaluation
4.1 综合评估

We conducted extensive experiments across various NLP tasks and datasets to assess the performance of RAG systems. Specifically: (I) Commonsense Reasoning; (II) Fact Checking; (III) Open-Domain QA; (IV) MultiHop QA; (V) Medical QA. For further details on the tasks and their corresponding datasets, please refer to Appendix A.6. Furthermore, we evaluated the RAG capabilities on subsets extracted from these datasets, employing the metrics recommended in RAGAs [51], including Faithfulness, Context Relevancy, Answer Relevancy, and Answer Correctness. Additionally, we measured Retrieval Similarity by computing the cosine similarity between retrieved documents and gold documents.
我们针对各种 NLP 任务和数据集进行了广泛的实验,以评估 RAG 系统的性能。具体包括:(I)常识推理;(II)事实核查;(III)开放域问答;(IV)多跳问答;(V)医学问答。有关任务及其对应数据集的更多详细信息,请参阅附录 A.6。此外,我们还评估了从这些数据集中提取的子集上的 RAG 能力,采用了 RAGAs [51] 推荐的指标,包括忠实度、上下文相关性、答案相关性和答案正确性。此外,我们还通过计算检索文档与黄金文档之间的余弦相似度来衡量检索相似度。

We used accuracy as the evaluation metric for the tasks of Commonsense Reasoning, Fact Checking, and Medical QA. For Open-Domain QA and Multihop QA, we employed token-level F1 score and Exact Match (EM) score. The final RAG score was calculated by averaging the aforementioned five RAG capabilities. We followed Trivedi et al. [52] and sub-sampled up to 500500 examples from each dataset.
我们使用准确率作为常识推理、事实核查和医学问答任务的评估指标。对于开放域问答和多跳问答,我们采用了分词级别的 F1 分数和精确匹配(EM)分数。最终 RAG 分数是通过平均上述五个 RAG 能力计算得出的。我们遵循 Trivedi 等人[52]的方法,并对每个数据集进行了最多 500500 个样本的子采样。

Method 方法 Commonsense 常识 Fact Check 事实核查 ODQA Multihop 多跳 Medical 医学 RAG Avg. 平均
Acc Acc(输入文本为缩写或专有名词,无需翻译) Acc Acc(输入文本为缩写或专有名词,无需翻译) EM F1 EM F1 Acc Acc(输入文本为缩写或专有名词,无需翻译) Score 得分 Score 得分 F1 Latency . 延迟
classification module , Hybrid with HyDE, monoT5, sides, Recomp  
分类模块,与 HyDE、monoT5、sides、Recomp 混合
w/o classification 无分类 0.719 0.505 0.391 0.450 0.212 0.255 0.528 0.540 0.465 0.353 16.58
+ classification + 分类 0.727 0.595 0.393 0.450 0.207 0.257 0.460 0.580 0.478 0.353 11.71
\hdashline    with classification,   retrieval module , monoT5, sides, Recomp  
\hdashline 与分类、检索模块、monoT5、侧面、重编译
+ HyDE 0.718 0.595 0.320 0.373 0.170 0.213 0.400 0.545 0.443 0.293 11.58
+ Original + 原始 0.721 0.585 0.300 0.350 0.153 0.197 0.390 0.486 0.428 0.273 1.44
+ Hybrid 混合 0.718 0.595 0.347 0.397 0.190 0.240 0.750 0.498 0.477 0.318 1.45
+ Hybrid with HyDE 混合型 HyDE 0.727 0.595 0.393 0.450 0.207 0.257 0.460 0.580 0.478 0.353 11.71
\hdashline    with classification, Hybrid with HyDE,  reranking module , sides, Recomp  
\hdashline 与分类,混合与 HyDE,重排序模块,两侧,重编译
w/o reranking 无重新排序 0.720 0.591 0.365 0.429 0.211 0.260 0.512 0.530 0.470 0.334 10.31
+ monoT5 0.727 0.595 0.393 0.450 0.207 0.257 0.460 0.580 0.478 0.353 11.71
+ monoBERT 0.723 0.593 0.383 0.443 0.217 0.259 0.482 0.551 0.475 0.351 11.65
+ RankLLaMA 0.723 0.597 0.382 0.443 0.197 0.240 0.454 0.558 0.470 0.342 13.51
+ TILDEv2 0.725 0.588 0.394 0.456 0.209 0.255 0.486 0.536 0.476 0.355 11.26
\hdashline    with classification, Hybrid with HyDE, monoT5,  repacking module , Recomp  
\hdashline 与分类,HyDE 混合,monoT5,重新打包模块,重新编译
+ sides + 边 0.727 0.595 0.393 0.450 0.207 0.257 0.460 0.580 0.478 0.353 11.71
+ forward + 向前 0.722 0.599 0.379 0.437 0.215 0.260 0.472 0.542 0.474 0.349 11.68
+ reverse + 反转 0.728 0.592 0.387 0.445 0.219 0.263 0.532 0.560 0.483 0.354 11.70
\hdashline    with classification, Hybrid with HyDE, monoT5, reverse,  summarization module
\hdashline 与分类,HyDE 混合,monoT5,反向,摘要模块
w/o summarization 无摘要 0.729 0.591 0.402 0.457 0.205 0.252 0.528 0.533 0.480 0.355 10.97
+ Recomp + 重新编译 0.728 0.592 0.387 0.445 0.219 0.263 0.532 0.560 0.483 0.354 11.70
+ LongLLMLingua + 长龙 LLMLingua 0.713 0.581 0.362 0.423 0.199 0.245 0.530 0.539 0.466 0.334 16.17
Table 11: Results of the search for optimal RAG practices. Modules enclosed in a boxed module are under investigation to determine the best method. The underlined method represents the selected implementation. The “Avg” (average score) is calculated based on the Acc, EM, and RAG scores for all tasks, while the average latency is measured in seconds per query. The best scores are highlighted in bold.
are 表格 11:搜索最佳 RAG 实践的结果。框起来的模块正在研究以确定最佳方法。下划线的方法代表所选的实施方案。“平均”得分(平均得分)是基于所有任务的准确度、期望值和 RAG 得分计算的,而平均延迟以每查询秒数衡量。最佳得分以粗体突出。

4.2 Results and Analysis 4.2 结果与分析

Based on the experimental results presented in Table 4.1, the following key insights emerge:
基于表 4.1 中呈现的实验结果,以下关键见解浮现:

  • Query Classification Module: This module is referenced and contributes to both effectiveness and efficiency, leading to an average improvement in the overall score from 0.4280.428 to 0.4430.443 and a reduction in latency time from 16.4116.41 to 11.5811.58 seconds per query.
    查询分类模块:本模块被引用并有助于提高效果和效率,使整体评分从 0.4280.428 提高到 0.4430.443 ,并使每个查询的延迟时间从 16.4116.41 秒减少到 11.5811.58 秒。

  • Retrieval Module: While the “Hybrid with HyDE” method attained the highest RAG score of 0.580.58, it does so at a considerable computational cost with 11.7111.71 second per query. Consequently, the “Hybrid” or “Original” methods are recommended, as they reduce latency while maintaining comparable performance.
    检索模块:虽然“HyDE 与 Hybrid 结合”方法获得了最高的 RAG 分数 0.580.58 ,但它以每查询 11.7111.71 秒的相当大的计算成本来实现。因此,建议使用“Hybrid”或“原始”方法,因为它们在保持可比较性能的同时减少了延迟。

  • Reranking Module: The absence of a reranking module led to a noticeable drop in performance, highlighting its necessity. MonoT5 achieved the highest average score, affirming its efficacy in augmenting the relevance of retrieved documents. This indicates the critical role of reranking in enhancing the quality of generated responses.
    重排序模块:缺少重排序模块导致性能明显下降,突显了其必要性。MonoT5 实现了最高的平均分,证实了其在增强检索文档相关性方面的有效性。这表明重排序在提高生成响应质量中的关键作用。

  • Repacking Module: The Reverse configuration exhibited superior performance, achieving an RAG score of 0.5600.560. This indicates that positioning more relevant context closer to the query leads to optimal outcomes.
    重新包装模块:反向配置表现出色,实现了 0.5600.560 的 RAG 分数。这表明将更相关的上下文定位到查询附近可以带来最佳结果。

  • Summarization Module: Recomp demonstrated superior performance, although achieving comparable results with lower latency was possible by removing the summarization module. Nevertheless, Recomp remains the preferred choice due to its capability to address the generator’s maximum length constraints. In time-sensitive applications, removing summarization could effectively reduce response time.
    摘要模块:Recomp 展示了卓越的性能,尽管通过移除摘要模块可以实现具有较低延迟的相似结果。然而,由于 Recomp 能够解决生成器的最大长度限制,它仍然是首选选择。在时间敏感的应用中,移除摘要可以有效地减少响应时间。

The experimental results demonstrate that each module contributes uniquely to the overall performance of the RAG system. The query classification module enhances accuracy and reduces latency, while the retrieval and reranking modules significantly improve the system’s ability to handle diverse queries. The repacking and summarization modules further refine the system’s output, ensuring high-quality responses across different tasks.
实验结果表明,每个模块都对 RAG 系统的整体性能做出了独特的贡献。查询分类模块提高了准确性并降低了延迟,而检索和重排序模块显著提高了系统处理多样化查询的能力。重新打包和摘要模块进一步优化了系统的输出,确保在不同任务中提供高质量的响应。

5 Discussion 5 讨论

5.1 Best Practices for Implementing RAG
5.1 实施 RAG 的最佳实践

According to our experimental findings, we suggest two distinct recipes or practices for implementing RAG systems, each customized to address specific requirements: one focusing on maximizing performance, and the other on striking a balance between efficiency and efficacy.
根据我们的实验发现,我们建议两种不同的 RAG 系统实现方案或实践,每个都针对特定需求进行定制:一个侧重于最大化性能,另一个在效率和效果之间寻求平衡。

Best Performance Practice: To achieve the highest performance, it is recommended to incorporate query classification module, use the “Hybrid with HyDE” method for retrieval, employ monoT5 for reranking, opt for Reverse for repacking, and leverage Recomp for summarization. This configuration yielded the highest average score of 0.4830.483, albeit with a computationally-intensive process.
最佳性能实践:为了实现最高性能,建议集成查询分类模块,采用“Hybrid with HyDE”方法进行检索,使用 monoT5 进行重排序,选择 Reverse 进行重打包,并利用 Recomp 进行摘要。此配置产生了最高的平均分数 0.4830.483 ,尽管计算过程较为复杂。

Balanced Efficiency Practice: In order to achieve a balance between performance and efficiency, it is recommended to incorporate the query classification module, implement the Hybrid method for retrieval, use TILDEv2 for reranking, opt for Reverse for repacking, and employ Recomp for summarization. Given that the retrieval module accounts for the majority of processing time in the system, transitioning to the Hybrid method while keeping other modules unchanged can substantially reduce latency while preserving a comparable performance.
平衡效率实践:为了在性能和效率之间取得平衡,建议整合查询分类模块,实施混合检索方法,使用 TILDEv2 进行重新排序,选择反向进行重新打包,并采用 Recomp 进行摘要。鉴于检索模块占系统大部分处理时间,在保持其他模块不变的情况下过渡到混合方法可以显著降低延迟,同时保持可比较的性能。

Refer to caption
Figure 4: Workflow of multimodal retrieval. The upper section illustrates the text-to-image retrieval process. Initially, a text query is used to find images in the database with the highest similarity. If a high similarity is found, the image is returned directly. If not, an image generation model is employed to create and return an appropriate image. The lower section demonstrates the image-to-text retrieval process. Here, a user-provided image is matched with images in the database to find the highest similarity. If a high similarity is identified, the pre-stored caption of the matching image is returned. Otherwise, an image captioning model generates and returns a new caption.
图 4:多模态检索的工作流程。上部展示了文本到图像的检索过程。最初,使用文本查询在数据库中找到最相似的图像。如果找到高度相似,则直接返回图像。如果没有,则使用图像生成模型创建并返回适当的图像。下部演示了图像到文本的检索过程。在这里,将用户提供的图像与数据库中的图像匹配,以找到最高相似度。如果识别出高度相似,则返回匹配图像的预先存储的标题。否则,图像标题模型生成并返回新的标题。

5.2 Multimodal Extension 5.2 多模态扩展

We have extended RAG to multimodal applications. Specifically, we have incorporated text2image and image2text retrieval capabilities into the system with a substantial collection of paired image and textual descriptions as a retrieval source. As depicted in Figure 4, the text2image capability speeds up the image generation process when a user query aligns well with the textual descriptions of stored images (i.e., “retrieval as generation” strategy), while the image2text functionality comes into play when a user provides an image and engages in conversation about the input image. These multimodal RAG capabilities offer the following advantages:
我们已经将 RAG 扩展到多模态应用。具体来说,我们已将文本到图像和图像到文本检索功能纳入系统,并使用大量成对的图像和文本描述作为检索来源。如图 4 所示,当用户查询与存储图像的文本描述相匹配时(即“检索即生成”策略),文本到图像功能可以加快图像生成过程,而当用户提供图像并就输入图像进行对话时,图像到文本功能就发挥作用。这些多模态 RAG 功能具有以下优势:

  • Groundedness: Retrieval methods provide information from verified multimodal materials, thereby ensuring authenticity and specificity. In contrast, on-the-fly generation relies on models to generate new content, which can occasionally result in factual errors or inaccuracies.
    扎根性:检索方法从经过验证的多模态材料中提供信息,从而确保真实性和特异性。相比之下,即时生成依赖于模型生成新内容,这有时会导致事实错误或不准确。

  • Efficiency: Retrieval methods are typically more efficient, especially when the answer already exists in stored materials. Conversely, generation methods may require more computational resources to produce new content, particularly for images or lengthy texts.
    效率:检索方法通常更高效,尤其是在答案已经存在于存储材料中时。相反,生成方法可能需要更多的计算资源来产生新内容,尤其是图像或长文本。

  • Maintainability: Generation models often necessitate careful fine-tuning to tailor them for new applications. In contrast, retrieval-based methods can be improved to address new demands by simply enlarging the size and enhancing the quality of retrieval sources.
    可维护性:生成模型通常需要仔细微调以适应新的应用。相比之下,基于检索的方法可以通过简单地扩大检索源的大小和提高质量来应对新的需求。

We plan to broaden the application of this strategy to include other modalities, such as video and speech, while also exploring efficient and effective cross-modal retrieval techniques.
我们计划将此策略的应用范围扩大到包括其他模态,如视频和语音,同时探索高效有效的跨模态检索技术。

6 Conclusion 6 结论

In this study, we aim to identify optimal practices for implementing retrieval-augmented generation in order to improve the quality and reliability of content produced by large language models. We systematically assessed a range of potential solutions for each module within the RAG framework and recommended the most effective approach for each module. Furthermore, we introduced a comprehensive evaluation benchmark for RAG systems and conducted extensive experiments to determine the best practices among various alternatives. Our findings not only contribute to a deeper understanding of retrieval-augmented generation systems but also establish a foundation for future research.
本研究旨在确定实施检索增强生成(RAG)的最佳实践,以提高大型语言模型生成内容的品质和可靠性。我们系统地评估了 RAG 框架内每个模块的多种潜在解决方案,并为每个模块推荐了最有效的方法。此外,我们引入了 RAG 系统的全面评估基准,并进行了广泛的实验,以确定各种替代方案中的最佳实践。我们的发现不仅有助于更深入地理解检索增强生成系统,还为未来的研究奠定了基础。

Limitations 局限性

We have evaluated the impact of various methods for fine-tuning LLM generators. Previous studies have demonstrated the feasibility of training both the retriever and generator jointly. We would like to explore this possibility in the future. In this study, we embraced the principle of modular design to simplify the search for optimal RAG implementations, thereby reducing complexity. Due to the daunting costs associated with constructing vector databases and conducting experiments, our evaluation was limited to investigating the effectiveness and influence of representative chunking techniques within the chunking module. It would be intriguing to further explore the impact of different chunking techniques on the entire RAG systems. While we have discussed the application of RAG in the domain of NLP and extended its scope to image generation, an enticing avenue for future exploration would involve expanding this research to other modalities such as speech and video.
我们已经评估了针对LLM生成器进行微调的各种方法的影响。先前的研究已经证明了联合训练检索器和生成器的可行性。我们希望在未来探索这一可能性。在本研究中,我们采用了模块化设计原则,以简化对最佳 RAG 实现的搜索,从而降低复杂性。由于构建向量数据库和进行实验的巨大成本,我们的评估仅限于调查分块模块中代表性分块技术的有效性和影响。进一步探索不同分块技术对整个 RAG 系统的影响将非常有趣。虽然我们已经讨论了 RAG 在 NLP 领域的应用并将其范围扩展到图像生成,但未来探索的一个诱人途径是将这项研究扩展到其他模态,如语音和视频。

Acknowledgments 致谢

The authors would like to thank the anonymous reviewers for their valuable comments. This work was supported by National Natural Science Foundation of China (No. 6207606862076068).
作者们想感谢匿名审稿人宝贵的意见。这项工作得到了国家自然科学基金委员会(编号 6207606862076068 )的支持。

References

  • Ouyang et al. [2022] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS 2022), 2022.
  • Rafailov et al. [2023] Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  • Zhao et al. [2023a] Yao Zhao, Rishabh Joshi, Tianqi Liu, Misha Khalman, Mohammad Saleh, and Peter J Liu. SLIC-HF: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425, 2023a.
  • Yuan et al. [2023] Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. RRHF: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302, 2023.
  • Liu et al. [2023] Wenhao Liu, Xiaohua Wang, Muling Wu, Tianlong Li, Changze Lv, Zixuan Ling, Jianhao Zhu, Cenyuan Zhang, Xiaoqing Zheng, and Xuanjing Huang. Aligning large language models with human preferences through representation engineering. arXiv preprint arXiv:2312.15997, 2023.
  • Gao et al. [2023] Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2023.
  • Li et al. [2022] Huayang Li, Yixuan Su, Deng Cai, Yan Wang, and Lemao Liu. A survey on retrieval-augmented text generation. arXiv preprint arXiv:2202.01110, 2022.
  • Cai et al. [2022] Deng Cai, Yan Wang, Lemao Liu, and Shuming Shi. Recent advances in retrieval-augmented text generation. In Proceedings of the 45th international ACM SIGIR conference on research and development in information retrieval, pages 3417–3419, 2022.
  • Ma et al. [2023a] Xinbei Ma, Yeyun Gong, Pengcheng He, Hai Zhao, and Nan Duan. Query rewriting for retrieval-augmented large language models. arXiv preprint arXiv:2305.14283, 2023a.
  • Gao et al. [2022] Luyu Gao, Xueguang Ma, Jimmy Lin, and Jamie Callan. Precise zero-shot dense retrieval without relevance labels. arXiv preprint arXiv:2212.10496, 2022.
  • Wang et al. [2022] Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022.
  • Xiao et al. [2023] Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. C-pack: Packaged resources to advance general chinese embedding, 2023.
  • OpenAI [2023] OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. doi: 10.48550/ARXIV.2303.08774. URL https://doi.org/10.48550/arXiv.2303.08774.
  • Touvron et al. [2023a] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  • Zhang et al. [2023a] Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al. Siren’s song in the ai ocean: a survey on hallucination in large language models. arXiv preprint arXiv:2309.01219, 2023a.
  • Wang et al. [2023a] Xiaohua Wang, Yuliang Yan, Longtao Huang, Xiaoqing Zheng, and Xuan-Jing Huang. Hallucination detection for generative large language models by bayesian sequential estimation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 15361–15371, 2023a.
  • Wang et al. [2023b] Liang Wang, Nan Yang, and Furu Wei. Query2doc: Query expansion with large language models. arXiv preprint arXiv:2303.07678, 2023b.
  • Kim et al. [2023] Gangwoo Kim, Sungdong Kim, Byeongguk Jeon, Joonsuk Park, and Jaewoo Kang. Tree of clarifications: Answering ambiguous questions with retrieval-augmented large language models. arXiv preprint arXiv:2310.14696, 2023.
  • Liu [2022] Jerry Liu. LlamaIndex, 11 2022. URL https://github.com/jerryjliu/llama_index.
  • Zhang et al. [2023b] Peitian Zhang, Shitao Xiao, Zheng Liu, Zhicheng Dou, and Jian-Yun Nie. Retrieve anything to augment large language models. arXiv preprint arXiv:2310.07554, 2023b.
  • Li et al. [2023] Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. Towards general text embeddings with multi-stage contrastive learning. arXiv preprint arXiv:2308.03281, 2023.
  • Jiang et al. [2023a] Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Llmlingua: Compressing prompts for accelerated inference of large language models. arXiv preprint arXiv:2310.05736, 2023a.
  • Xu et al. [2023] Fangyuan Xu, Weijia Shi, and Eunsol Choi. Recomp: Improving retrieval-augmented lms with compression and selective augmentation. arXiv preprint arXiv:2310.04408, 2023.
  • Wang et al. [2023c] Zhiruo Wang, Jun Araki, Zhengbao Jiang, Md Rizwan Parvez, and Graham Neubig. Learning to filter context for retrieval-augmented generation. arXiv preprint arXiv:2311.08377, 2023c.
  • Nogueira et al. [2019] Rodrigo Nogueira, Wei Yang, Kyunghyun Cho, and Jimmy Lin. Multi-stage document ranking with bert. arXiv preprint arXiv:1910.14424, 2019.
  • Nogueira et al. [2020] Rodrigo Nogueira, Zhiying Jiang, and Jimmy Lin. Document ranking with a pretrained sequence-to-sequence model. arXiv preprint arXiv:2003.06713, 2020.
  • Ma et al. [2023b] Xueguang Ma, Liang Wang, Nan Yang, Furu Wei, and Jimmy Lin. Fine-tuning llama for multi-stage text retrieval. arXiv preprint arXiv:2310.08319, 2023b.
  • Zhuang and Zuccon [2021a] Shengyao Zhuang and Guido Zuccon. Tilde: Term independent likelihood model for passage re-ranking. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1483–1492, 2021a.
  • Zhuang and Zuccon [2021b] Shengyao Zhuang and Guido Zuccon. Fast passage re-ranking with contextualized exact term matching and efficient passage expansion. arXiv preprint arXiv:2108.08513, 2021b.
  • Luo et al. [2023] Hongyin Luo, Yung-Sung Chuang, Yuan Gong, Tianhua Zhang, Yoon Kim, Xixin Wu, Danny Fox, Helen M. Meng, and James R. Glass. Sail: Search-augmented instruction learning. In Conference on Empirical Methods in Natural Language Processing, 2023. URL https://api.semanticscholar.org/CorpusID:258865283.
  • Zhang et al. [2024a] Tianjun Zhang, Shishir G. Patil, Naman Jain, Sheng Shen, Matei A. Zaharia, Ion Stoica, and Joseph E. Gonzalez. Raft: Adapting language model to domain specific rag. ArXiv, abs/2403.10131, 2024a.
  • Liu et al. [2024a] Zihan Liu, Wei Ping, Rajarshi Roy, Peng Xu, Chankyu Lee, Mohammad Shoeybi, and Bryan Catanzaro. Chatqa: Surpassing gpt-4 on conversational qa and rag. 2024a. URL https://api.semanticscholar.org/CorpusID:267035133.
  • Izacard et al. [2022] Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane A. Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. Few-shot learning with retrieval augmented language models. ArXiv, abs/2208.03299, 2022.
  • Zhang et al. [2024b] Lingxi Zhang, Yue Yu, Kuan Wang, and Chao Zhang. Arl2: Aligning retrievers for black-box large language models via self-guided adaptive relevance labeling. ArXiv, abs/2402.13542, 2024b.
  • Shi et al. [2023] Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Rich James, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652, 2023.
  • Guu et al. [2020] Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. Realm: Retrieval-augmented language model pre-training. ArXiv, abs/2002.08909, 2020.
  • Lin et al. [2023] Xi Victoria Lin, Xilun Chen, Mingda Chen, Weijia Shi, Maria Lomeli, Rich James, Pedro Rodriguez, Jacob Kahn, Gergely Szilvasy, Mike Lewis, Luke Zettlemoyer, and Scott Yih. Ra-dit: Retrieval-augmented dual instruction tuning. ArXiv, abs/2310.01352, 2023.
  • Zamani and Bendersky [2024] Hamed Zamani and Michael Bendersky. Stochastic rag: End-to-end retrieval-augmented generation through expected utility maximization. 2024. URL https://api.semanticscholar.org/CorpusID:269605438.
  • Huang and Huang [2024] Yizheng Huang and Jimmy Huang. A survey on retrieval-augmented text generation for large language models. arXiv preprint arXiv:2404.10981, 2024.
  • Zhao et al. [2023b] Ruochen Zhao, Hailin Chen, Weishi Wang, Fangkai Jiao, Xuan Long Do, Chengwei Qin, Bosheng Ding, Xiaobao Guo, Minzhi Li, Xingxuan Li, et al. Retrieving multimodal information for augmented generation: A survey. arXiv preprint arXiv:2303.10868, 2023b.
  • Zhao et al. [2024] Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wentao Zhang, and Bin Cui. Retrieval-augmented generation for ai-generated content: A survey. arXiv preprint arXiv:2402.19473, 2024.
  • Günther et al. [2023] Michael Günther, Jackmin Ong, Isabelle Mohr, Alaeddine Abdessalem, Tanguy Abel, Mohammad Kalim Akram, Susana Guzman, Georgios Mastrapas, Saba Sturua, Bo Wang, et al. Jina embeddings 2: 8192-token general-purpose text embeddings for long documents. arXiv preprint arXiv:2310.19923, 2023.
  • [43] LlamaIndex. Llamaindex website. https://www.llamaindex.com. Accessed: 2024-06-08.
  • Sawarkar et al. [2024] Kunal Sawarkar, Abhilasha Mangal, and Shivam Raj Solanki. Blended rag: Improving rag (retriever-augmented generation) accuracy with semantic search and hybrid query-based retrievers. arXiv preprint arXiv:2404.07220, 2024.
  • Izacard et al. [2021] Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. Unsupervised dense information retrieval with contrastive learning. arXiv preprint arXiv:2112.09118, 2021.
  • Thakur et al. [2021] Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv preprint arXiv:2104.08663, 2021.
  • Bajaj et al. [2016] Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268, 2016.
  • Liu et al. [2024b] Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173, 2024b.
  • Jiang et al. [2023b] Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. arXiv preprint arXiv:2310.06839, 2023b.
  • Touvron et al. [2023b] Hugo Touvron, Louis Martin, Kevin R. Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Daniel M. Bikel, Lukas Blecher, Cristian Cantón Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony S. Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel M. Kloumann, A. V. Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, R. Subramanian, Xia Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zhengxu Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288, 2023b.
  • Shahul et al. [2023] ES Shahul, Jithin James, Luis Espinosa Anke, and Steven Schockaert. Ragas: Automated evaluation of retrieval augmented generation. In Conference of the European Chapter of the Association for Computational Linguistics, 2023. URL https://api.semanticscholar.org/CorpusID:263152733.
  • Trivedi et al. [2022] Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, page 539–554, May 2022. doi: 10.1162/tacl_a_00475. URL http://dx.doi.org/10.1162/tacl_a_00475.
  • Conover et al. [2023] Mike Conover, Matt Hayes, Ankit Mathur, Jianwei Xie, Jun Wan, Sam Shah, Ali Ghodsi, Patrick Wendell, Matei Zaharia, and Reynold Xin. Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023. URL https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm.
  • Craswell et al. [2020] Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Fernando Campos, and Ellen M. Voorhees. Overview of the trec 2019 deep learning track. ArXiv, abs/2003.07820, 2020. URL https://api.semanticscholar.org/CorpusID:253234683.
  • Craswell et al. [2021] Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Fernando Campos, and Ellen M. Voorhees. Overview of the trec 2020 deep learning track. ArXiv, abs/2102.07662, 2021. URL https://api.semanticscholar.org/CorpusID:212737158.
  • Lin et al. [2021a] Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, and Rodrigo Nogueira. Pyserini: A python toolkit for reproducible information retrieval research with sparse and dense representations. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2356–2362, 2021a.
  • Kwiatkowski et al. [2019] Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur P. Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc V. Le, and Slav Petrov. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019.
  • Joshi et al. [2017] Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. ArXiv, abs/1705.03551, 2017.
  • Yang et al. [2018] Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, 2018.
  • Stelmakh et al. [2022] Ivan Stelmakh, Yi Luan, Bhuwan Dhingra, and Ming-Wei Chang. Asqa: Factoid questions meet long-form answers. ArXiv, abs/2204.06092, 2022.
  • Kočiskỳ et al. [2018] Tomáš Kočiskỳ, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Edward Grefenstette. The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317–328, 2018.
  • Rajpurkar et al. [2016] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
  • Lin et al. [2021b] Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021b.
  • Hu et al. [2021] J. Edward Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. ArXiv, abs/2106.09685, 2021.
  • Hendrycks et al. [2020] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. Cornell University - arXiv,Cornell University - arXiv, Sep 2020.
  • Clark et al. [2018] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457, 2018. URL https://api.semanticscholar.org/CorpusID:3922816.
  • Mihaylov et al. [2018] Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Jan 2018. doi: 10.18653/v1/d18-1260. URL http://dx.doi.org/10.18653/v1/d18-1260.
  • Thorne et al. [2018] James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. Fever: a large-scale dataset for fact extraction and verification. ArXiv, abs/1803.05355, 2018. URL https://api.semanticscholar.org/CorpusID:4711425.
  • Zhang et al. [2023c] Tianhua Zhang, Hongyin Luo, Yung-Sung Chuang, Wei Fang, Luc Gaitskell, Thomas Hartvigsen, Xixin Wu, Danny Fox, Helen M. Meng, and James R. Glass. Interpretable unified language checking. ArXiv, abs/2304.03728, 2023c. URL https://api.semanticscholar.org/CorpusID:258041307.
  • Berant et al. [2013] Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on freebase from question-answer pairs. Empirical Methods in Natural Language Processing,Empirical Methods in Natural Language Processing, Oct 2013.
  • Ho et al. [2020] Xanh Ho, A. Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. ArXiv, abs/2011.01060, 2020. URL https://api.semanticscholar.org/CorpusID:226236740.
  • Press et al. [2022] Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, NoahA. Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. Oct 2022.
  • Jin et al. [2019] Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W. Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering. In Conference on Empirical Methods in Natural Language Processing, 2019. URL https://api.semanticscholar.org/CorpusID:202572622.
  • Asai et al. [2023] Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learning to retrieve, generate, and critique through self-reflection. arXiv preprint arXiv:2310.11511, 2023.

Appendix A Experimental Details

In this section, we provide detailed experimental settings for each module, covering dataset specifics, training parameters, and any additional experimental results.

A.1 Query Classification

Datasets  We utilized a subset of the Databricks-Dolly-15K [53] and generated additional data using GPT-4.The prompt template for generating questions is shown in Table 14.

Implementation Details We choose BERT-base-multilingual-cased as our classifier, with a batch size of 16 and a learning rate of 1e-5. The evaluation of results is showcased in Table 1.

A.2 Experimental Details of Retrieval Methods

Implementation details of the comparative experiments of different retrieval methods are as below:

Datasets  We use the TREC DL 2019 [54] and 2020 [55] passage ranking datasets to evaluate the performance of different retrieval methods. Metrics  Widely-used evaluation metrics for retrieval include mAP, nDCG@10, R@50 and R@1k. Both mAP and nDCG@10 are order-aware metrics that take the ranking of search results into account. In contrast, R@k is an order-unaware metric. We also report the average latency incurred by each method per query. Implementation Details  For sparse retrieval, we use the BM25 algorithm, which relies on the TF-IDF algorithm. For dense retrieval, we employ Contriever as our unsupervised contrastive text encoder. Based on our evaluation of embedding models, we implement our supervised dense retrieval using LLM-Embedder. We use the default implementation of BM25 and Contriever from Pyserini [56]. The BM25 index is constructed using Lucene on MS MARCO collections, while the dense vector index is generated with Faiss employing Flat configuration on the same dataset. For query rewriting, we prompt Zephyr-7b-alpha888https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha, a model trained to act as a helpful assistant, to rewrite the original query. For query decomposition, we employ GPT-3.5-turbo-0125 to break down the original query into multiple sub-queries. We closely follow the implementation from HyDE [10], utilizing the more advanced instruction-following language model, GPT-3.5-turbo-instruct, to generate hypothetical answers. The model infers with a default temperature of 0.7, sampling up to a maximum of 512 tokens. Retrieval experiments and evaluation are conducted using the Pyserini toolkit.

A.3 Experimental Details of Reranking Methods

Datasets  Our experiments utilize the MS MARCO Passage ranking dataset, a substantial corpus designed for machine reading comprehension tasks. This dataset comprises over 8.8 million passages and 1 million queries. The training set contains approximately 398M tuples of queries paired with corresponding positive and negative passages, while the development set comprises 6,980 queries, paired with their BM25 retrieval results, and preserves the top-1000 ranked candidate passages for each query. We evaluate the effectiveness of the methods on the development set, as the test set is not publicly available.

Metrics  The evaluation metrics MRR@1, MRR@10, MRR@1k and Hit Rate@10 are used. MRR@10 is the official metric proposed by MS MARCO.

Implementation Details  We follow and make modifications to the implementation provided by PyGaggle [26] and TILDE [28]. For DLM-based reranking, we use monoT5 [26] based on T5-base, monoBERT [25] based on BERT-large and RankLLaMA [27] based on Llama-2-7b. For TILDE reranking, we use TILDEv2 [29] based on BERT-base.

Typically, 50 documents are retrieved as input for the reranking module. The documents remaining after the reranking and repacking phase can be further concentrated by assigning a top-k value or a relevancy score threshold.

Result Analysis  Reranking results are shown in Table 9. We compare our results with a randomly shuffled ordering and the BM25 retrieval baseline. All reranking methods demonstrate a notable increase in performance across all metrics. Approximately equal performance is achieved by monoT5 and monoBERT, and RankLLaMA performs best, each ascending in latency. TILDEv2 is the fastest, taking approximately 10 to 20 milliseconds per query at the cost of performance. Additionally, TILDEv2 requires that the passages reranked be identically included in the previously indexed collection. Preprocessing must be redone at inference for new unseen passages, negating the efficiency advantages.

A.4 Experimental Details of Summarization Methods

Selective Context  Selective Context enhances LLM efficiency by identifying and removing redundant information in the input context. It evaluates the informativeness of lexical units using self-information computed by a base causal language model. This method is non-query-based, allowing a comparison between query-based and non-query-based approaches.

Datasets  We evaluated these methods on three datasets: Natural Questions (NQ) [57], TriviaQA [58], and HotpotQA [59].

Metrics  Evaluation metrics include the F1 score and the number of tokens changed after summarization to measure conciseness.

Implementation Details  For all methods, we use Llama3-8B-Instruct as the generator model and set a summarization ratio of 0.4. For extractive methods, importance scores determine the sentences retained. For abstractive methods, we control the maximum generation length using the summarization ratio to align with extractive methods. Experiments are conducted on the NQ test set, TriviaQA test set, and HotpotQA development set.

Context Model NQ TriviaQA HotpotQA ASQA Avg.
DD_{\varnothing} MbM_{b} 29.7829.78 60.4460.44 23.7323.73 37.8937.89 37.9637.96
MgM_{g} 26.226.23 58.2658.26 26.6726.67 32.3032.30 35.8735.87
MrM_{r} 31.1031.10 61.3761.37 28.4028.40 39.9639.96 40.2140.21
MgrM_{gr} 25.9225.92 57.6257.62 26.4326.43 32.9932.99 35.7035.70
MggM_{gg} 26.6926.69 58.0758.07 27.0427.04 33.7533.75 36.3936.39
DgD_{g} MbM_{b} 44.7844.78 79.9079.90 56.7256.72 71.6471.64 63.2663.26
MgM_{g} 85.7285.72 88.1688.16 79.8279.82 85.5185.51 84.8084.80
MrM_{r} 60.9860.98 80.2080.20 65.7365.73 67.4967.49 68.6068.60
MgrM_{gr} 87.6087.60 87.9487.94 81.07\boldsymbol{81.07} 87.5887.58 86.05\boldsymbol{86.05}
MggM_{gg} 86.7286.72 88.35\boldsymbol{88.35} 79.5979.59 83.4483.44 84.5384.53
DrD_{r} MbM_{b} 16.4916.49 50.0350.03 21.5721.57 28.7928.79 29.2229.22
MgM_{g} 22.1522.15 46.9846.98 24.3624.36 29.4029.40 30.7230.72
MrM_{r} 36.9236.92 58.4258.42 29.6429.64 39.5439.54 41.1341.13
MgrM_{gr} 23.6323.63 45.0145.01 24.1724.17 27.9527.95 30.1930.19
MggM_{gg} 21.0821.08 43.8343.83 23.2323.23 27.3327.33 28.8728.87
DgrD_{gr} MbM_{b} 34.6534.65 81.2781.27 52.7552.75 65.4265.42 58.5258.52
MgM_{g} 85.0085.00 87.3387.33 78.1878.18 83.0283.02 83.3883.38
MrM_{r} 60.2860.28 79.3279.32 63.8263.82 67.2967.29 67.6867.68
MgrM_{gr} 87.63\boldsymbol{87.63} 87.1487.14 79.9579.95 87.78\boldsymbol{87.78} 85.6385.63
MggM_{gg} 86.3186.31 86.9086.90 78.1078.10 83.8583.85 83.7983.79
Table 12: Results of the model augmented with different contexts on various QA datasets.

A.5 Experimental Details of Generator Fine-tuning

Datasets  We fine-tune our model on several question answering(QA) and reading comprehension datasets, including ASQA [60], HotpotQA [59], NarrativeQA [61], NQ [57], SQuAD [62], TriviaQA [58], TruthfulQA [63]. We use their train splits (for those containing significantly more data entries than others, we conducted a random sample). For evaluation, ASQA [60], HotpotQA [59], NQ [57], TriviaQA [58] are used. We evaluate our model on their validation splits or manually split a

Dataset #Train #Eval
ASQA 2,0902,090 483483
HotpotQA 15,00015,000 7,4057,405
TriviaQA 9,0009,000 6,3686,368
NQ 15,00015,000 8,0068,006
NarrativeQA 7,0007,000 --
SQuAD 67,0067,00 --
TruthfulQA 817817 --
Table 13: Number of examples in each Dataset used in the fine-tuning experiments.

subset from the training set to avoid overlapping. The exact number of entries in each train and test set is detailed in Table 13.

We use the dataset-provided documents as dgoldd_{gold} for each data entry. To obtain drandomd_{random} we sample the context of different entries within the same dataset, to make sure the distributions of drandomd_{random} and dgoldd_{gold} are roughly similar.

Metrics  We use the ground-truth coverage as our evaluation metric, considering that the answers of QA tasks are relatively short, while the generation length of the model is sometimes hard to limit.

Implementation Details  We select Llama-2-7b [50] as the base model. For efficiency, we use LoRA [64] and int8 quantization during training. The prompt templates used for fine-tuning and evaluation mainly follow  Lin et al. [37]. We train our generator for 3 epochs and constrain the maximum length of the sequence to 1600, using a batch size of 4 and a learning rate of 5e-5. During testing, we use a zero-shot setting.

Detailed Results  Table 12 shows our evaluation results on each dataset.

[Instruction] Please generate ten descriptions for the continuation task.
\hdashline[Context] For example: 1.“French.Washington played a crucial role in the American Revolutionary War, leading the Continental Army against the British.” Please continue writing the above paragraph. 2.“The discovery of the double helix structure of DNA by James Watson and Francis Crick revolutionized the field of genetics, laying the foundation for modern molecular biology and biotechnology.” Please continue by discussing recent developments in genetic research, such as CRISPR gene editing, and their potential ethical implications.
Table 14: Template for generating task classification data.

A.6 Experimental Details of Comprehensive Evaluation

Tasks and Datasets  We conducted extensive experiments across various NLP tasks and datasets to assess the performance of RAG systems. Specifically: (1) Commonsense Reasoning: We evaluated on MMLU [65], ARC-Challenge [66], and OpenbookQA [67] datasets. (2) Fact Checking: Our evaluation encompassed the FEVER [68] and PubHealth [69] datasets. (3) Open-Domain QA: We assessed on NQ [57], TriviaQA [58], and WebQuestions [70] datasets. (4) MultiHop QA: Our evaluation included the HotPotQA [59], 2WikiMultiHopQA [71], and MuSiQue [52] datasets. For MuSiQue, we followed the approach outlined in [72] and focused solely on answerable 2-hop questions. (5) Medical QA: We also assessed on the PubMedQA [73] dataset. In each dataset, we randomly sub-sample 500 entries from the test set for our experiments. For datasets without test set, we use develop set instead.

To assess RAG capabilities, we evenly collect a total of 500 entries from NQ, TriviaQA, HotPotQA, 2WikiMultiHopQA and MuSiQue. Each entry is a “question, gold document, gold answer” triple.

Metrics  We use token-level F1 score and EM score for Open-Domain QA and MultiHop QA tasks, and accuracy for others. We use a more lenient EM score, which evaluates performance based on whether the model generations include gold answers instead of strictly exact matching [74].

Towards RAG capabilities evaluation, we adopt four metrics from RAGAs, including Faithfulness, Context Relevancy, Answer Relevancy, and Answer Correctness. Faithfulness measures how factually consistent the generated answer is with the retrieved context. An answer is considered faithful if all claims made can be directly inferred from the provided context. Context Relevancy evaluates how relevant the retrieved context is to the original query. Answer Relevancy assesses the pertinence of the generated answer to the original query. Answer Correctness involves the accuracy of the generated answer when compared to the ground truth. For example, Context Relevancy is calculated from the proportion of sentences within the retrieved context that are relevant for answering the given question to all sentences:

contextrelevancy=|S||Total|context\;relevancy=\frac{\left|S\right|}{\left|Total\right|} (2)

where |S|\left|S\right| denotes the number of relevant sentences, |Total|\left|Total\right| denotes the total number of sentences retrieved. All these metrics are evaluated using the RAGAs framework, with GPT-4 serving as the judge.

Additionally, we compute the cosine similarity between the retrieved document and the gold document as Retrieval Similarity. The retrieved document and gold document are fed into an embedding model, then the resulting embeddings are used to compute the cosine similarity.

Implementation Details  For Open-Domain QA and MultiHop QA datasets, we set the generation model’s maximum new token number to 100 tokens. For other datasets, we set it to 50 tokens. To deal with excessively long retrieved documents, we truncated the documents to 2048 words when evaluating RankLLaMA and LongLLMLingua.

For all datasets, we use greedy decoding during generation. To better compare the capabilities of different RAG modules, we adopt the 0-shot evaluation setting, i.e., no in-context examples are offered. In the multiple choice and fact checking tasks, answers generated by the model may take a variety of forms (e.g., “the answer is A” instead of “A”). Therefore, we preprocess the responses generated by the model, applying regular expression templates to match them with gold labels.