这是用户在 2024-8-18 14:01 为 https://www.pinecone.io/learn/series/rag/rerankers/ 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
 公告
简化构建准确的人工智能:使用 Pinecone 推理进行重新排序现已在公共预览中提供。开始使用


重新排序器和两阶段检索


检索增强生成(RAG)是一个负载过重的术语。它承诺了世界,但在开发了一个 RAG 管道之后,我们中的许多人都在想为什么它的效果不如我们预期的那样好。


与大多数工具一样,RAG 易于使用但难以掌握。事实是,RAG 不仅仅是将文档放入矢量数据库并在顶部添加LLM这么简单。这种方法可能有效,但并不总是。


这本电子书的目标是告诉您在开箱即用的 RAG 无法正常工作时该怎么办。在本章中,我们将探讨对于次优 RAG 管道来说通常是最简单和最快实施的解决方案 — 我们将学习有关重新排序器的知识。


本章的视频配套教材。


召回与上下文窗口


在着手解决方案之前,让我们先谈谈问题。通过 RAG,我们正在对许多文本文档进行语义搜索 — 这些文档可能是数以万计甚至数以十亿计的。


为了确保在大规模情况下快速搜索时间,我们通常使用向量搜索——即,我们将文本转换为向量,将它们全部放入一个向量空间,并使用诸如余弦相似度之类的相似度度量标准,比较它们与查询向量的接近程度。


为了使向量搜索起作用,我们需要向量。这些向量本质上是将一些文本背后的“含义”压缩为(通常为)768 或 1536 维的向量。由于我们将这些信息压缩为单个向量,因此会有一些信息丢失。


由于这种信息丢失,我们经常看到前三个(例如)矢量搜索文档会错过相关信息。不幸的是,检索可能会返回在我们的 top_k 截止值以下的相关信息。


如果较低位置的相关信息可以帮助我们的LLM制定更好的响应,我们该怎么办?最简单的方法是增加我们返回的文档数量(增加 top_k),并将它们全部传递给LLM。


我们在这里要衡量的指标是召回率 — 意思是“我们检索到多少相关文档”。召回率不考虑检索到的文档总数 — 因此我们可以通过返回所有内容来操纵指标并获得完美的召回率。

recall@K=#  of  relevant  docs  returned#  of  relevant  docs  in  datasetrecall@K = \frac{\#\;of\;relevant\;docs\;returned}{\#\;of\;relevant\;docs\;in\;dataset}


很遗憾,我们无法返回所有内容。LLMs对我们可以传递的文本量有限制 — 我们称之为上下文窗口限制。一些LLMs有很大的上下文窗口,比如 Anthropics 的 Claude,上下文窗口达到了 100K 个标记 [1]。有了这个,我们可以容纳许多页的文本 — 那么我们可以返回许多文档(虽然不是全部),并“填充”上下文窗口以提高召回率吗?


再次,不行。我们不能使用上下文填充,因为这会降低LLM的召回性能 — 请注意,这是LLM的召回,与我们迄今讨论的检索召回不同。

When storing information in the middle of a context window, an LLM's ability to recall that information becomes worse than had it not been provided in the first place

在上下文窗口的中间存储信息时,LLM回忆该信息的能力会变得比起初未提供该信息时更差[2]。


LLM回忆是指一个LLM从其上下文窗口中找到信息的能力。研究表明,随着我们在上下文窗口中放入更多标记,LLM回忆会下降[2]。LLMs在我们填充上下文窗口时也更不可能遵循指示 - 因此,填充上下文窗口是一个不好的主意。


我们可以增加向我们的向量数据库返回的文档数量,以提高检索召回率,但我们不能将这些传递给我们的LLM,否则会损害LLM的召回率。


解决这个问题的方法是通过检索大量文档来最大化检索召回率,然后通过最小化进入LLM的文档数量来最大化LLM召回率。为了做到这一点,我们重新排序检索到的文档,并仅保留最相关的文档供我们的LLM使用 — 为了做到这一点,我们使用重新排序。

 重新排序器的力量


重新排序模型,也被称为交叉编码器,是一种模型,给定一个查询和文档对,将输出一个相似度分数。我们使用这个分数来按照与我们的查询相关性重新排序文档。

A two-stage retrieval system. The vector DB step will typically include a bi-encoder or sparse embedding model.

一个两阶段的检索系统。向量 DB 步骤通常包括双编码器或稀疏嵌入模型。


搜索工程师长期以来一直在两阶段检索系统中使用重新排序器。在这些两阶段系统中,第一阶段模型(嵌入模型/检索器)从更大的数据集中检索一组相关文档。然后,第二阶段模型(重新排序器)用于重新排序第一阶段模型检索到的那些文档。


我们使用两个阶段是因为从大数据集中检索少量文档比重新对大量文档进行排名要快得多——我们很快会讨论为什么会这样——但简而言之,重新排名是慢的,而检索是快的。

 为什么需要重新排序器?


如果重新排序器速度慢,为什么还要使用它们呢?答案是重新排序器比嵌入模型更准确。


双编码器准确性较差的直觉是,双编码器必须将文档的所有可能含义压缩成一个单一向量 — 这意味着我们丢失了信息。此外,双编码器对查询没有上下文,因为我们在收到查询之前不知道查询内容(我们在用户查询之前创建嵌入)。


另一方面,重新排序器可以直接将原始信息输入到大型变压器计算中,意味着信息损失较少。由于我们在用户查询时运行重新排序器,因此我们有额外的好处,即分析我们文档的含义与用户查询相关,而不是试图产生通用的、平均的含义。


重新排序器避免了双编码器的信息丢失,但它们却带来了不同的代价 — 时间。

A bi-encoder model compresses the document or query meaning into a single vector. Note that the bi-encoder processes our query in the same way as it does documents, but at user query time.

双编码器模型将文档或查询的含义压缩为单个向量。请注意,双编码器处理我们的查询方式与处理文档的方式相同,但在用户查询时。


当使用具有向量搜索的双编码器模型时,我们将所有繁重的变压器计算前置到创建初始向量时 - 这意味着当用户查询我们的系统时,我们已经创建了这些向量,因此我们只需要做的是:


  1. 运行单个 transformer 计算以创建查询向量。

  2. 将查询向量与文档向量进行余弦相似度(或其他轻量级度量)比较。


通过重新排序器,我们不预先计算任何内容。相反,我们将我们的查询和另一个文档输入到变压器中,运行整个变压器推理步骤,并输出一个相似度分数。

A reranker considers query and document to produce a single similarity score over a full transformer inference step. Note that document A here is equivalent to our query.

一个重新排序器考虑查询和文档,通过完整的变压器推理步骤生成单个相似性分数。请注意,这里的文档 A 等同于我们的查询。


考虑到有 40M 条记录,如果我们在 V100 GPU 上使用像 BERT 这样的小型重新排序模型,我们将等待超过 50 小时才能返回一个查询结果[3]。我们可以使用编码器模型和向量搜索在<100ms 内完成相同的操作。


实施两阶段检索与重新排序


既然我们理解了使用重新排序器进行两阶段检索背后的思想和原因,让我们看看如何实现它(您可以跟着这个笔记本操作)。首先,我们将设置我们的先决条件库:

!pip install -qU \
  datasets==2.14.5 \
  openai==0.28.1 \
  pinecone-client==2.2.4 \
  cohere==4.27

 数据准备


在设置检索管道之前,我们需要检索数据!我们将使用 Hugging Face Datasets 中的 jamescalam/ai-arxiv-chunked 数据集。该数据集包含 400 多篇关于 ML、NLP 和LLMs的 ArXiv 论文,包括 Llama 2、GPTQ 和 GPT-4 论文。

from datasets import load_dataset

data = load_dataset("jamescalam/ai-arxiv-chunked", split="train")
data
Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading data:   0%|          | 0.00/153M [00:00<?, ?B/s]
Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]
Generating train split: 0 examples [00:00, ? examples/s]
Dataset({
    features: ['doi', 'chunk-id', 'chunk', 'id', 'title', 'summary', 'source', 'authors', 'categories', 'comment', 'journal_ref', 'primary_category', 'published', 'updated', 'references'],
    num_rows: 41584
})


数据集包含 41.5K 个预分块记录。每个记录有 1-2 段长,并包括有关所属论文的附加元数据。以下是一个示例:

data[0]
{'doi': '1910.01108',
 'chunk-id': '0',
 'chunk': 'DistilBERT, a distilled version of BERT: smaller,\nfaster, cheaper and lighter\nVictor SANH, Lysandre DEBUT, Julien CHAUMOND, Thomas WOLF\nHugging Face\n{victor,lysandre,julien,thomas}@huggingface.co\nAbstract\nAs Transfer Learning from large-scale pre-trained models becomes more prevalent\nin Natural Language Processing (NLP), operating these large models in on-theedge and/or under constrained computational training or inference budgets remains\nchallenging. In this work, we propose a method to pre-train a smaller generalpurpose language representation model, called DistilBERT, which can then be finetuned with good performances on a wide range of tasks like its larger counterparts.\nWhile most prior work investigated the use of distillation for building task-specific\nmodels, we leverage knowledge distillation during the pre-training phase and show\nthat it is possible to reduce the size of a BERT model by 40%, while retaining 97%\nof its language understanding capabilities and being 60% faster. To leverage the\ninductive biases learned by larger models during pre-training, we introduce a triple\nloss combining language modeling, distillation and cosine-distance losses. Our\nsmaller, faster and lighter model is cheaper to pre-train and we demonstrate its',
 'id': '1910.01108',
 'title': 'DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter',
 'summary': 'As Transfer Learning from large-scale pre-trained models becomes more\nprevalent in Natural Language Processing (NLP), operating these large models in\non-the-edge and/or under constrained computational training or inference\nbudgets remains challenging. In this work, we propose a method to pre-train a\nsmaller general-purpose language representation model, called DistilBERT, which\ncan then be fine-tuned with good performances on a wide range of tasks like its\nlarger counterparts. While most prior work investigated the use of distillation\nfor building task-specific models, we leverage knowledge distillation during\nthe pre-training phase and show that it is possible to reduce the size of a\nBERT model by 40%, while retaining 97% of its language understanding\ncapabilities and being 60% faster. To leverage the inductive biases learned by\nlarger models during pre-training, we introduce a triple loss combining\nlanguage modeling, distillation and cosine-distance losses. Our smaller, faster\nand lighter model is cheaper to pre-train and we demonstrate its capabilities\nfor on-device computations in a proof-of-concept experiment and a comparative\non-device study.',
 'source': 'http://arxiv.org/pdf/1910.01108',
 'authors': ['Victor Sanh',
  'Lysandre Debut',
  'Julien Chaumond',
  'Thomas Wolf'],
 'categories': ['cs.CL'],
 'comment': 'February 2020 - Revision: fix bug in evaluation metrics, updated\n  metrics, argumentation unchanged. 5 pages, 1 figure, 4 tables. Accepted at\n  the 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing\n  - NeurIPS 2019',
 'journal_ref': None,
 'primary_category': 'cs.CL',
 'published': '20191002',
 'updated': '20200301',
 'references': [{'id': '1910.01108'}]}


我们将把这些数据输入到 Pinecone 中,因此当后续进行嵌入和索引过程时,让我们重新格式化数据集,使其更适合 Pinecone。格式将包含 id、文本(我们将进行嵌入)和元数据。在这个例子中,我们不会使用元数据,但如果将来想要进行元数据过滤,包含元数据会很有帮助。

data = data.map(lambda x: {
    "id": f'{x["id"]}-{x["chunk-id"]}',
    "text": x["chunk"],
    "metadata": {
        "title": x["title"],
        "url": x["source"],
        "primary_category": x["primary_category"],
        "published": x["published"],
        "updated": x["updated"],
        "text": x["chunk"],
    }
})
# drop uneeded columns
data = data.remove_columns([
    "title", "summary", "source",
    "authors", "categories", "comment",
    "journal_ref", "primary_category",
    "published", "updated", "references",
    "doi", "chunk-id",
    "chunk"
])
data
Map:   0%|          | 0/41584 [00:00<?, ? examples/s]
Dataset({
    features: ['id', 'text', 'metadata'],
    num_rows: 41584
})

 嵌入和索引


将所有内容存储在向量 DB 中,我们需要使用嵌入/双编码器模型对所有内容进行编码。为简单起见,我们将使用 OpenAI 的 text-embedding-ada-002。我们确实需要一个 OpenAI API 密钥来通过 OpenAI 客户端进行身份验证:

import openai

# platform.openai.com
# get API key from top-right dropdown on OpenAI website
openai.api_key = "YOUR_OPENAI_API_KEY"

embed_model = "text-embedding-ada-002"


现在,我们创建向量数据库来存储我们的向量。为此,我们需要获取一个免费的 Pinecone API 密钥 — 您可以在左侧导航栏的“API 密钥”部分找到 API 密钥和环境变量。

import pinecone

# initialize connection to pinecone (get API key at app.pinecone.io)
api_key = "YOUR_PINECONE_API_KEY"
# find your environment next to the api key in pinecone console
env = "YOUR_PINECONE_ENV"

pinecone.init(api_key=api_key, environment=env)


认证后,我们创建我们的索引。我们将维度设置为 Ada-002 的维度(1536),并使用与 Ada-002 兼容的度量标准,可以是余弦或点积。

import time

index_name = "rerankers"

# check if index already exists (it shouldn't if this is first time)
if index_name not in pinecone.list_indexes():
    # if does not exist, create index
    pinecone.create_index(
        index_name,
        dimension=1536,  # dimensionality of ada 002
        metric='dotproduct'
    )
    # wait for index to be initialized
    while not pinecone.describe_index(index_name).status['ready']:
        time.sleep(1)

# connect to index
index = pinecone.Index(index_name)


我们现在准备开始使用 OpenAI 的嵌入模型来填充索引,就像这样:

from tqdm.auto import tqdm

batch_size = 100  # how many embeddings we create and insert at once

for i in tqdm(range(0, len(data), batch_size)):
    passed = False
    # find end of batch
    i_end = min(len(data), i+batch_size)
    # create batch
    batch = data[i:i_end]
    # create embeddings (exponential backoff to avoid RateLimitError)
    for j in range(5):  # max 5 retries
        try:
            res = openai.Embedding.create(input=batch["text"], engine=embed_model)
            passed = True
        except openai.error.RateLimitError:
            time.sleep(2**j)  # wait 2^j seconds before retrying
            print("Retrying...")
    if not passed:
        raise RuntimeError("Failed to create embeddings.")
    # get embeddings
    embeds = [record['embedding'] for record in res['data']]
    to_upsert = list(zip(batch["id"], embeds, batch["metadata"]))
    # upsert to Pinecone
    index.upsert(vectors=to_upsert)


我们的索引现在已经填充完毕,可以进行查询了!


无需重新排序的检索


重新排序之前,让我们看看没有重新排序时我们的结果是什么样子的。我们将定义一个名为 get_docs 的函数,仅使用检索的第一阶段返回文档:

def get_docs(query: str, top_k: int):
    # encode query
    xq = embed([query])[0]
    # search pinecone index
    res = index.query(xq, top_k=top_k, include_metadata=True)
    # get doc text
    docs = {x["metadata"]['text']: i for i, x in enumerate(res["matches"])}
    return docs


让我们来了解一下通过人类反馈进行强化学习——这是 ChatGPT 发布时展示出的突然性能提升背后的一种流行的微调方法。

query = "can you explain why we would want to do rlhf?"
docs = get_docs(query, top_k=25)
print("\n---\n".join(docs.keys()[:3]))  # print the first 3 docs
whichmodels areprompted toexplain theirreasoningwhen givena complexproblem, inorder toincrease
the likelihood that their final answer is correct.
RLHF has emerged as a powerful strategy for fine-tuning Large Language Models, enabling significant
improvements in their performance (Christiano et al., 2017). The method, first showcased by Stiennon et al.
(2020) in the context of text-summarization tasks, has since been extended to a range of other applications.
In this paradigm, models are fine-tuned based on feedback from human users, thus iteratively aligning the
models’ responses more closely with human expectations and preferences.
Ouyang et al. (2022) demonstrates that a combination of instruction fine-tuning and RLHF can help fix
issues with factuality, toxicity, and helpfulness that cannot be remedied by simply scaling up LLMs. Bai
et al. (2022b) partially automates this fine-tuning-plus-RLHF approach by replacing the human-labeled
fine-tuningdatawiththemodel’sownself-critiquesandrevisions,andbyreplacinghumanraterswitha
---
We examine the influence of the amount of RLHF training for two reasons. First, RLHF [13, 57] is an
increasingly popular technique for reducing harmful behaviors in large language models [3, 21, 52]. Some of
these models are already deployed [52], so we believe the impact of RLHF deserves further scrutiny. Second,
previous work shows that the amount of RLHF training can significantly change metrics on a wide range of
personality, political preference, and harm evaluations for a given model size [41]. As a result, it is important
to control for the amount of RLHF training in the analysis of our experiments.
3.2 Experiments
3.2.1 Overview
We test the effect of natural language instructions on two related but distinct moral phenomena: stereotyping
and discrimination. Stereotyping involves the use of generalizations about groups in ways that are often
harmful or undesirable.4To measure stereotyping, we use two well-known stereotyping benchmarks, BBQ
[40] (§3.2.2) and Windogender [49] (§3.2.3). For discrimination, we focus on whether models make disparate
decisions about individuals based on protected characteristics that should have no relevance to the outcome.5
To measure discrimination, we construct a new benchmark to test for the impact of race in a law school course
---
model to estimate the eventual performance of a larger RL policy. The slopes of these lines also
explain how RLHF training can produce such large effective gains in model size, and for example it
explains why the RLHF and context-distilled lines in Figure 1 are roughly parallel.
• One can ask a subtle, perhaps ill-defined question about RLHF training – is it teaching the model
new skills or simply focusing the model on generating a sub-distribution of existing behaviors . We
might attempt to make this distinction sharp by associating the latter class of behaviors with the
region where RL reward remains linear inp
KL.
• To make some bolder guesses – perhaps the linear relation actually provides an upper bound on RL
reward, as a function of the KL. One might also attempt to extend the relation further by replacingp
KLwith a geodesic length in the Fisher geometry.
By making RL learning more predictable and by identifying new quantitative categories of behavior, we
might hope to detect unexpected behaviors emerging during RL training.
4.4 Tension Between Helpfulness and Harmlessness in RLHF Training
Here we discuss a problem we encountered during RLHF training. At an earlier stage of this project, we
found that many RLHF policies were very frequently reproducing the same exaggerated responses to all
remotely sensitive questions (e.g. recommending users seek therapy and professional help whenever they
...


我们在这里获得了合理的表现 - 尤其是相关的文本块:

 文件 
0
"使其性能显著提高"
0
"通过迭代将模型的响应更紧密地与人类期望和偏好对齐"
0
指导微调和 RLHF 可以帮助解决事实性、毒性和实用性问题
1
在大型语言模型中减少有害行为的一种日益流行的技术


剩下的文件和文本涵盖了 RLHF,但没有回答我们特定问题:“为什么我们想要做 RLHF?”

 重新排名回应


我们将使用 Cohere 的重新排名端点进行重新排名。您需要一个 Cohere API 密钥才能使用它。有了我们的 API 密钥,我们可以这样进行身份验证:

import cohere

# init client
co = cohere.Client("YOUR_COHERE_API_KEY")


现在,我们可以使用 co.rerank 重新对结果进行排名。让我们尝试将第一阶段检索步骤返回的结果数量增加到 top_k=25,并对它们全部重新排名(设置 top_n=25),看看我们得到的重新排序是什么样子的。

The reordered results look like so:
rerank_docs = co.rerank(
    query=query, documents=docs.keys(), top_n=25, model="rerank-english-v2.0"
)

[docs[doc.document["text"]] for doc in rerank_docs]
[0,
 23,
 14,
 3,
 12,
 6,
 9,
 8,
 1,
 17,
 7,
 21,
 2,
 16,
 10,
 20,
 18,
 22,
 24,
 13,
 19,
 4,
 15,
 11,
 5]


我们仍然在顶部有记录 0 - 这很好,因为它包含了大量与我们查询相关的信息。然而,不太相关的文档 1 和 2 分别被文档 23 和 14 替换。


让我们创建一个函数,以便更快地比较原始结果和重新排名后的结果。

def compare(query: str, top_k: int, top_n: int):
    # first get vec search results
    docs = get_docs(query, top_k=top_k)
    i2doc = {docs[doc]: doc for doc in docs.keys()}
    # rerank
    rerank_docs = co.rerank(
        query=query, documents=docs.keys(), top_n=top_n, model="rerank-english-v2.0"
    )
    original_docs = []
    reranked_docs = []
    # compare order change
    for i, doc in enumerate(rerank_docs):
        rerank_i = docs[doc.document["text"]]
        print(str(i)+"\t->\t"+str(rerank_i))
        if i != rerank_i:
            reranked_docs.append(f"[{rerank_i}]\n"+doc.document["text"])
            original_docs.append(f"[{i}]\n"+i2doc[i])
    for orig, rerank in zip(original_docs, reranked_docs):
        print("ORIGINAL:\n"+orig+"\n\nRERANKED:\n"+rerank+"\n\n---\n")


我们从 RLHF 查询开始。这一次,我们进行了更标准的检索-重新排序过程,检索了 25 个文档(top_k=25),并重新排名到前三个文档(top_n=3)。

compare(query, 25, 3)
0	->	0
1	->	23
2	->	14
ORIGINAL:
[1]
We examine the influence of the amount of RLHF training for two reasons. First, RLHF [13, 57] is an
increasingly popular technique for reducing harmful behaviors in large language models [3, 21, 52]. Some of
these models are already deployed [52], so we believe the impact of RLHF deserves further scrutiny. Second,
previous work shows that the amount of RLHF training can significantly change metrics on a wide range of
personality, political preference, and harm evaluations for a given model size [41]. As a result, it is important
to control for the amount of RLHF training in the analysis of our experiments.
3.2 Experiments
3.2.1 Overview
We test the effect of natural language instructions on two related but distinct moral phenomena: stereotyping
and discrimination. Stereotyping involves the use of generalizations about groups in ways that are often
harmful or undesirable.4To measure stereotyping, we use two well-known stereotyping benchmarks, BBQ
[40] (§3.2.2) and Windogender [49] (§3.2.3). For discrimination, we focus on whether models make disparate
decisions about individuals based on protected characteristics that should have no relevance to the outcome.5
To measure discrimination, we construct a new benchmark to test for the impact of race in a law school course

RERANKED:
[23]
We have shown that it’s possible to use reinforcement learning from human feedback to train language models
that act as helpful and harmless assistants. Our RLHF training also improves honesty, though we expect
other techniques can do better still. As in other recent works associated with aligning large language models
[Stiennon et al., 2020, Thoppilan et al., 2022, Ouyang et al., 2022, Nakano et al., 2021, Menick et al., 2022],
RLHF improves helpfulness and harmlessness by a huge margin when compared to simply scaling models
up.
Our alignment interventions actually enhance the capabilities of large models, and can easily be combined
with training for specialized skills (such as coding or summarization) without any degradation in alignment
or performance. Models with less than about 10B parameters behave differently, paying an ‘alignment tax’ on
their capabilities. This provides an example where models near the state-of-the-art may have been necessary
to derive the right lessons from alignment research.
The overall picture we seem to find – that large models can learn a wide variety of skills, including alignment, in a mutually compatible way – does not seem very surprising. Behaving in an aligned fashion is just
another capability, and many works have shown that larger models are more capable [Kaplan et al., 2020,

---

ORIGINAL:
[2]
model to estimate the eventual performance of a larger RL policy. The slopes of these lines also
explain how RLHF training can produce such large effective gains in model size, and for example it
explains why the RLHF and context-distilled lines in Figure 1 are roughly parallel.
• One can ask a subtle, perhaps ill-defined question about RLHF training – is it teaching the model
new skills or simply focusing the model on generating a sub-distribution of existing behaviors . We
might attempt to make this distinction sharp by associating the latter class of behaviors with the
region where RL reward remains linear inp
KL.
• To make some bolder guesses – perhaps the linear relation actually provides an upper bound on RL
reward, as a function of the KL. One might also attempt to extend the relation further by replacingp
KLwith a geodesic length in the Fisher geometry.
By making RL learning more predictable and by identifying new quantitative categories of behavior, we
might hope to detect unexpected behaviors emerging during RL training.
4.4 Tension Between Helpfulness and Harmlessness in RLHF Training
Here we discuss a problem we encountered during RLHF training. At an earlier stage of this project, we
found that many RLHF policies were very frequently reproducing the same exaggerated responses to all
remotely sensitive questions (e.g. recommending users seek therapy and professional help whenever they

RERANKED:
[14]
the model outputs safe responses, they are often more detailed than what the average annotator writes.
Therefore, after gathering only a few thousand supervised demonstrations, we switched entirely to RLHF to
teachthemodelhowtowritemorenuancedresponses. ComprehensivetuningwithRLHFhastheadded
benefit that it may make the model more robust to jailbreak attempts (Bai et al., 2022a).
WeconductRLHFbyfirstcollectinghumanpreferencedataforsafetysimilartoSection3.2.2: annotators
writeapromptthattheybelievecanelicitunsafebehavior,andthencomparemultiplemodelresponsesto
theprompts,selectingtheresponsethatissafestaccordingtoasetofguidelines. Wethenusethehuman
preference data to train a safety reward model (see Section 3.2.2), and also reuse the adversarial prompts to
sample from the model during the RLHF stage.
BetterLong-TailSafetyRobustnesswithoutHurtingHelpfulness Safetyisinherentlyalong-tailproblem,
wherethe challengecomesfrom asmallnumber ofveryspecific cases. Weinvestigatetheimpact ofSafety

---


查看这些内容,我们已经从文档 1 中删除了一个相关的文本块,而从文档 2 中没有删除任何相关的文本块 — 现在以下相关的信息替换了这些内容:

 原始立场 重新排名位置 
231
训练语言模型,使其成为有益且无害的助手
231
RLHF 训练还提高了诚实度
231
RLHF 通过巨大幅度提高了帮助性和无害性
231
增强大型模型的能力
142
模型输出安全响应
142
通常比普通注释者所写的更详细
142
RLHF 达到模型如何撰写更加细致入微的回应
142
使模型更加抗破解尝试


我们在重新排名后获得了更多相关信息。自然地,这可能会显著提高 RAG 的性能。这意味着我们最大化相关信息,同时最小化输入到我们的LLM中的噪音。



重新排序是在检索增强生成(RAG)或任何其他基于检索的流程中显著提高召回性能的最简单方法之一。


我们已经探讨了为什么重新排序器可以比它们的嵌入模型对应物提供更好的性能 - 以及两阶段检索系统如何使我们能够兼顾两者的优点,实现规模化搜索同时保持高质量性能。


 参考文献


引入 100K 上下文窗口(2023 年),Anthropic


[2] 刘宁,林凯,胡伟,帕兰贾佩,贝维拉夸,佩特罗尼,梁平,迷失在中间:语言模型如何使用长上下文(2023 年),


[3] N. Reimers, I. Gurevych, Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (2019), UKP-TUDA [3] N. Reimers, I. Gurevych, 句子-BERT: 使用孪生 BERT 网络进行句子嵌入 (2019), UKP-TUDA

Share: