这是用户在 2024-8-6 19:35 为 https://app.immersivetranslate.com/pdf-pro/53baa1dc-9834-4241-9a88-8723b52d34da 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
2024_08_06_cdeba7fb3f232814ed72g

A COMPrEHENSIVE SurveY OF LLM ALIGNMENT TECHNIQUES: RLHF, RLAIF, PPO, DPO AND MORE
LLM对中技术综合调查:RLHF、RLAIF、PPO、DPO 等

Zhichao Wang*, Bin Bi*, Shiva Kumar Pentyala, Kiran Ramnath, Sougata Chaudhuri,
王志超*、毕斌*、希瓦-库马尔-彭蒂塔拉、基兰-拉姆纳特、苏加塔-乔杜里
Shubham Mehrotra, Zixu (James) Zhu, Xiang-Bo Mao, Sitaram Asur, Na (Claire) Cheng
Shubham Mehrotra、Zixu (James) Zhu、Xiang-Bo Mao、Sitaram Asur、Na (Claire) Cheng
Salesforce 销售团队{zhichaowang, bin.bi, shivakumar.pentyala, k.ramnath, sougata.chaudhurishubham.mehrotra, james.zhu, xmao, sasur, claire.cheng}@salesforce.com

Abstract 摘要

With advancements in self-supervised learning, the availability of trillions tokens in a pre-training corpus, instruction fine-tuning, and the development of large Transformers with billions of parameters, large language models (LLMs) are now capable of generating factual and coherent responses to human queries. However, the mixed quality of training data can lead to the generation of undesired responses, presenting a significant challenge. Over the past two years, various methods have been proposed from different perspectives to enhance LLMs, particularly in aligning them with human expectation. Despite these efforts, there has not been a comprehensive survey paper that categorizes and details these approaches. In this work, we aim to address this gap by categorizing these papers into distinct topics and providing detailed explanations of each alignment method, thereby helping readers gain a thorough understanding of the current state of the field.
随着自我监督学习技术的进步、预培训语料库中数万亿词条的可用性、指令微调以及具有数十亿参数的大型 Transformers 的开发,大型语言模型(LLMs)现在已经能够针对人类查询生成符合事实且连贯的回复。然而,训练数据的质量参差不齐,可能导致生成不受欢迎的回复,从而带来巨大的挑战。在过去的两年中,人们从不同的角度提出了各种方法来增强 LLMs,尤其是在使其符合人类期望方面。尽管做出了这些努力,但还没有一篇全面的调查论文对这些方法进行分类和详细说明。在这项工作中,我们将这些论文分为不同的主题,并对每种配准方法进行详细解释,从而帮助读者全面了解该领域的现状,从而弥补这一空白。

Keywords Large Language Model (LLM) Alignment Reward Model Human / AI Feedback Reinforcement Learning RLHF DPO
关键词 大型语言模型 (LLM) 对齐 奖励模型 人工/人工智能反馈 强化学习 RLHF DPO

1 Introduction 1 引言

Over the past decades, the pretraining of LLMs through self-supervised learning [1] has seen significant advancements. These improvements have been driven by the development of larger decoder-only Transformers, the utilization of trillions of tokens, and the parallelization of computations across multiple GPUs. Following the pretraining phase, instruction tuning was employed to guide LLMs in responding to human queries. Despite these advancements, a critical issue remains unresolved: LLMs can generate undesired responses, such as providing instructions on how to commit illegal activities. To mitigate this risk, it is essential to align LLMs with human values.
在过去的几十年中,通过自监督学习 [1] 对 LLMs 进行预训练的技术取得了长足的进步。这些进步得益于大型纯解码器变换器的开发、数万亿标记的使用以及多个 GPU 的并行计算。在预训练阶段之后,采用了指令调整来指导 LLMs 响应人类查询。尽管取得了这些进展,但一个关键问题仍未得到解决:LLMs可能会生成不受欢迎的响应,例如提供有关如何实施非法活动的指令。为了降低这种风险,必须使 LLMs 与人类的价值观保持一致。
Reinforcement Learning from Human Feedback (RLHF) [2, 3] has emerged as a groundbreaking technique for aligning LLMs. This approach has led to the development of powerful models such as GPT-4 [4], Claude [5], and Gemini [6]. Following the introduction of RLHF, numerous studies have explored various approaches to further align LLMs. However, there has not yet been a comprehensive review of methods for aligning LLMs with human preferences. This paper aims to fill that gap by categorically reviewing existing literature and providing detailed analyses of individual papers.
来自人类反馈的强化学习(RLHF)[2, 3]已成为对齐 LLMs 的一项开创性技术。这种方法开发出了 GPT-4 [4]、Claude [5] 和 Gemini [6] 等功能强大的模型。在引入 RLHF 之后,许多研究探索了各种方法来进一步调整 LLMs 。然而,目前还没有对根据人类偏好对 LLMs 进行对齐的方法进行过全面回顾。本文旨在通过对现有文献进行分类综述并对单篇论文进行详细分析来填补这一空白。
In this paper, we have structured our review into four main topics: 1. Reward Model; 2. Feedback; 3. Reinforcement Learning (RL); and 4. Optimization. Each topic was further divided into subtopics as shown in Figure. 1 For the Reward Model, the subtopics were: 1. Explicit Reward Model vs. Implicit Reward Model; 2. Pointwise Reward Model vs. Preference Model; 3. Response-Level Reward vs. Token-Level Reward and 4. Negative Preference Optimization. Regarding Feedback, the subtopics included: 1. Preference Feedback vs. Binary Feedback; 2. Pairwise Feedback vs. Listwise Feedback; and 3. Human Feedback vs. AI Feedback. In the RL section, the subtopics were: 1. Reference-Based RL vs. Reference-Free RL; 2. Length-Control RL; 3. Different Divergences in RL and 4. On-
在本文中,我们将审查分为四个主要专题:1.奖励模型;2.反馈;3.强化学习 (RL);4.优化。优化。每个主题又分为若干子主题,如图所示。1 奖励模型的子课题是1.显性奖励模型与隐性奖励模型;2.点式奖励模型与偏好模型;3.响应级奖励与代币级奖励;4.负偏好优化。负偏好优化。关于反馈,子课题包括1.偏好反馈与二进制反馈;2. 成对反馈与列表反馈;3.人类反馈与人工智能反馈。在 RL 部分,子课题包括1.基于参考的 RL vs. 无参考的 RL;2.长度控制 RL;3.RL 中的不同分歧;4.关于
\footnotetext{ \脚注文本{
These authors contributed equally to this work
这些作者对这项工作做出了同等贡献
Figure 1: The 13 categorical directions for xPO to align an LLM with human preference
图 1:xPO 根据人类偏好对齐 LLM 的 13 个分类方向
Policy RL vs. Off-Policy RL. For Optimization, the subtopics were: 1. Online/Iterative Preference Optimization vs. Offline/Non-iterative Preference Optimization; and 3. Separating SFT and Alignment vs. Merging SFT and Alignment. Table 1 provided an analysis of all the papers reviewed in detail using these 13 evaluation metrics.
政策性 RL 与非政策性 RL。优化方面的子课题有1.在线/迭代偏好优化与离线/非迭代偏好优化;以及 3.分离 SFT 和对齐与合并 SFT 和对齐。表 1 利用这 13 个评价指标对所有参评论文进行了详细分析。

2 Categorical Outline 2 分类大纲

This section provided a concise introduction to the key elements of LLM alignment, enabling readers to grasp the essential terms and various existing research directions. It includes primarily four directions: 1. reward model, 2. feedback, 3. RL policy and 4. optimization.
本节简明扼要地介绍了 LLM 对齐的关键要素,使读者能够掌握基本术语和现有的各种研究方向。它主要包括四个方向:1. 奖励模型,2. 反馈,3. RL 政策和 4. 优化。

2.1 Reward Model 2.1 奖励模式

The reward model was a fine-tuned LLM that assigned scores based on the prompt and generated response. In this subsection, we would discuss: 1. utilizing explicit or implicit reward models, 2. employing pointwise reward or preference models, and 3. using token-level or response-level reward models and 4. training reward model with solely negative preference. A plot of these different reward models could be found in Figure 2
奖励模型是一个经过微调的 LLM,它根据提示和生成的反应来分配分数。在本小节中,我们将讨论1. 使用显式或隐式奖励模型;2. 使用点式奖励或偏好模型;3. 使用标记级或响应级奖励模型;4.这些不同奖励模型的示意图见图 2

2.1.1 Explicit Reward Model vs. Implicit Reward Model
2.1.1 显性奖励模式与隐性奖励模式

In RLHF, researchers collected a large dataset composed of triplets, including a prompt , a desired response , and an undesired response . Based on this collected preference dataset, explicit reward models, represented as were derived by fine-tuning on pretrained LLMs to assign rewards for each prompt and response. This reward model was then used in a RL setting to align the LLM policy. Conversely, implicit reward models, represented as , bypassed the process of training an explicit reward model. For example, in DPO, a mapping was established between the optimal reward model and the optimal policy in RL, allowing the LLM to be aligned without directly deriving the reward model.
在 RLHF 中,研究人员收集了一个由三胞胎组成的大型数据集,包括提示 、期望的反应 和不期望的反应 。根据收集到的偏好数据集,通过对预训练 LLMs 进行微调,得出了以 表示的显式奖励模型,为每个提示和响应分配奖励。然后在 RL 设置中使用该奖励模型来调整 LLM 策略。相反,以 表示的隐式奖励模型则绕过了训练显式奖励模型的过程。例如,在 DPO 中,最优奖励模型与 RL 中的最优策略之间建立了映射关系,从而无需直接推导出奖励模型,即可对 LLM 进行调整。
Figure 2: The four subtopics of reward model
图 2:奖励模式的四个子课题

2.1.2 Pointwise Reward Model vs. Preferencewise Model
2.1.2 点式奖励模型与偏好模型

The original work in RLHF derived a pointwise reward model, which returned a reward score, i.e., given the prompt and response . Given two pointwise reward scores from the prompt, a desired response, and an undesired response and , the probability of the desired response being preferred over the undesired response could be obtained based on the Bradley-Terry (BT) model [38]. However, this methodology was inferior as it could not directly obtain pairwise preferences and could not accommodate inconsistencies in human labeling. To address this issue, Nash learning was proposed to directly model
RLHF 的最初工作推导出了一个点对点奖励模型,该模型在给出提示 和反应 的情况下返回一个奖励分数,即 。给定提示的两个点式奖励得分、期望的反应和不期望的反应 ,根据 Bradley-Terry(BT)模型 [38],可以得出期望的反应比不期望的反应更受青睐的概率 。然而,这种方法并不理想,因为它无法直接获得成对偏好,也无法顾及人类标记的不一致性。为了解决这个问题,有人提出用纳什学习法直接建立 模型。

2.1.3 Response-Level Reward vs. Token-Level Reward
2.1.3 响应级奖励与令牌级奖励的比较

In the original dataset collected in triplets, i.e., , the reward was given per response. Thus, in RLHF and DPO, the rewards were built at the response level. However, in the Markov decision process [39], rewards were given after each action, resulting in a change of state. To achieve alignment after each action, token-level reward models were introduced.
在以三连音(即 )收集的原始数据集中,奖励是按每个反应给予的。因此,在 RLHF 和 DPO 中,奖励是建立在响应水平上的。然而,在马尔可夫决策过程[39]中,奖励是在每次行动后给予的,这导致了状态的改变。为了实现每次行动后的对齐,我们引入了令牌级奖励模型。

2.1.4 Negative Preference Optimization
2.1.4 负偏好优化

In the RLHF dataset, human labeled both desired and undesired responses. Recently, with advancements in LLM capabilities, some researchers have posited that LLMs could generate desired responses of even higher quality than those produced by human labelers. Consequently, they opted to use only the prompts and undesired responses from the collected dataset, generating the desired responses using LLMs.
在 RLHF 数据集中,人类标注了期望的和不期望的响应。最近,随着 LLM 功能的进步,一些研究人员认为 LLMs 可以生成比人工标注者所生成的质量更高的期望回复。因此,他们选择只使用所收集数据集中的提示和不受欢迎的回复,使用 LLMs 生成所需的回复。

2.2 Feedback 2.2 反馈

Feedback encompassed both preferences and binary responses from humans or AI, either in pairs or lists. In this subsection, we would discuss three key distinctions: 1. preference feedback vs. binary feedback; 2. pairwise feedback vs. listwise feedback; and 3. human feedback vs. AI feedback. A plot of these feedback could be found in Figure 3
反馈包括来自人类或人工智能的偏好和二元回应,可以是成对的,也可以是列表的。在本小节中,我们将讨论三个关键区别:1.偏好反馈与二元反馈;2.成对反馈与列表反馈;3.人类反馈与人工智能反馈。这些反馈的示意图见图 3

2.2.1 Preference Feedback vs. Binary Feedback
2.2.1 偏好反馈与二进制反馈

In the RLHF paper, preference feedback, i.e., , was collected. However, subsequent works such as KTO and DRO suggested that preference feedback was more challenging to gather, and it would be advantageous to collect binary
在 RLHF 论文中,收集了偏好反馈,即 。然而,KTO 和 DRO 等后续研究表明,收集偏好反馈更具挑战性,收集二进制反馈更有优势。
Figure 3: The four subtopics of feedback
图 3:反馈的四个子主题
feedback instead. Binary feedback referred to simple "thumbs up" (positive), i.e., or "thumbs down" (negative), i.e., responses.
而不是二进制反馈。二进制反馈指的是简单的 "竖起大拇指"(正面),即 或 "放下大拇指"(负面),即 回复。

2.2.2 Pairwise Feedback vs. Listwise Feedback
2.2.2 成对反馈与列表反馈

In RLHF, listwise feedback was collected. This approach involved gathering different responses for a given prompt to expedite the labeling process. However, these listwise responses were then treated as pairwise responses. However, subsequent work, such as LiPO, proposed that it is more advantageous to treat listwise preferences as a ranking problem instead of viewing them as multiple pairwise preferences.
在 RLHF 中,按列表收集反馈。这种方法是针对给定的提示 收集 不同的回答 ,以加快标记过程。但是,这些列表式回答随后被视为 配对式回答。然而,随后的工作(如 LiPO)提出,将列表式偏好作为一个排序问题来处理,而不是将其视为多个成对偏好,会更有优势。

2.2.3 Human Feedback vs. AI Feedback
2.2.3 人工反馈与人工智能反馈

In RLHF, feedback was collected from humans who were asked to provide preferences given multiple responses to the same prompt. However, this process has proven to be tedious and expensive. With the latest developments in LLMs, it has become possible to collect AI feedback to align LLMs.
在 RLHF 中,反馈是由人类收集的,他们被要求在对同一提示做出多个回答时提供偏好。然而,事实证明这一过程既繁琐又昂贵。随着 LLMs 的最新发展,收集人工智能反馈以调整 LLMs 已成为可能。

2.3 Reinforcement Learning (RL)
2.3 强化学习(RL)

The objective of RL was formulated as . This objective encompassed two primary goals: 1) maximizing the rewards of responses, and 2) minimizing the deviation of the aligned policy model from the initial reference (SFT) model . The discussion on RL was divided into four subtopics: 1) Reference-Based RL vs. Reference-Free RL, 2) Length-Control RL, 3) Different Divergences in RL and 4) On-policy RL vs. Off-policy RL.
RL 的目标是 。该目标包含两个主要目标:1) 最大限度地提高回应的回报率,以及 2) 最大限度地减少对齐策略模型 与初始参考(SFT)模型 的偏差。关于 RL 的讨论分为四个子课题:1) 基于参考的 RL vs. 无参考的 RL,2) 长度控制 RL,3) RL 中的不同分歧,4) 政策上的 RL vs. 政策外的 RL。

2.3.1 Reference-Based RL vs. Reference-Free RL
2.3.1 基于参考的 RL 与无参考的 RL

A key aim of the RL objective in RLHF was to minimize the distance between the current policy, i.e., and the reference policy, i.e., . Consequently, most methodologies have focused on reference-based approaches. However, incorporating a reference policy introduced a significant memory burden. To address this issue, various methods have been proposed to avoid the reference policy. For instance, SimPO proposed a different objective that avoided the need for a reference policy altogether.
在 RLHF 中,RL 目标的一个关键目的是最小化当前政策(即 )与参考政策(即 )之间的距离。因此,大多数方法都侧重于基于参考的方法。然而,加入参考策略会带来很大的内存负担。为了解决这个问题,人们提出了各种方法来避免参考策略。例如,SimPO 提出了一种完全不需要参考策略的不同目标。

2.3.2 Length-Control RL 2.3.2 长度控制 RL

When using LLMs as evaluators, it has been observed that they tended to favor verbose responses, even when no additional information was provided [40]. This bias could affect the alignment of the LLM. In addition, the verbosity of LLM responses might increase the time required for humans to read and understand. The original RL objective did not account for this issue, but subsequent works such as R-DPO and SimPO incorporated considerations for length control, where represented the length of the output response.
在使用 LLMs 作为评价者时,人们发现他们倾向于偏爱冗长的回答,即使在没有提供额外信息的情况下也是如此 [40]。这种偏差可能会影响 LLM 的对齐。此外,LLM回答的冗长可能会增加人类阅读和理解所需的时间。最初的 RL 目标并没有考虑到这个问题,但随后的工作(如 R-DPO 和 SimPO)纳入了长度控制的考虑因素,其中 代表输出响应的长度。

2.3.3 Different Divergences in RL
2.3.3 区域联络中的不同分歧

In RLHF, reverse Kullback-Leibler (KL) divergence, i.e., was commonly used to measure the distance between the current policy and the reference policy . However, KL divergence has been found to reduce the diversity of responses. To address this, research has been conducted to explore the effects of different divergence measures, i.e., . More details could be found in section 3.12
在 RLHF 中,反向 Kullback-Leibler (KL) 分歧,即 通常用于衡量当前政策 与参考政策 之间的距离。然而,人们发现 KL 分歧会降低响应的多样性。为了解决这个问题,研究人员探索了不同发散度量的效果,即 。更多详情可参见第 3.12 节

2.3.4 On-policy or Off-policy Learning
2.3.4 政策内或政策外学习

In RL, responses could be generated during training using a method called on-policy learning. The main advantage of on-policy learning was that it sampled responses from the latest version of the policy. In contrast, off-policy methods relied on responses generated earlier. Although off-policy methods could save time by avoiding the need to generate new responses during training, they had the drawback of using responses that might not align with the current policy.
在 RL 中,可以使用一种称为 "政策学习"(on-policy learning)的方法在训练过程中生成响应。策略上学习的主要优势在于,它可以从最新版本的策略中采样响应。相比之下,非策略方法则依赖于之前生成的响应。虽然非政策方法可以避免在训练过程中生成新的回复,从而节省时间,但其缺点是使用的回复可能与当前政策不一致。

2.4 Optimization 2.4 优化

The alignment process of LLMs involved optimization. This section would discuss two key subtopics: 1. Iterative/Online Preference Optimization vs. Non-Iterative/Offline Preference Optimization; 2. Separating SFT and Alignment vs. Merging SFT and Alignment. The plot of these two subtopics on optimization could be found in Figure 4
LLMs的排列过程涉及优化。本节将讨论两个关键子主题:1.迭代/在线偏好优化与非迭代/离线偏好优化;2.分离 SFT 和对齐与合并 SFT 和对齐。这两个子课题的优化图见图 4
b. Separate or Merge SFT with Alignment
b.分离或合并对齐的 SFT
Figure 4: The two subtopics of optimization
图 4:优化的两个子课题

2.4.1 Iterative/Online Preference Optimization vs. Non-Iterative/Offline Preference Optimization
2.4.1 迭代/在线偏好优化与非迭代/离线偏好优化对比

When only utilizing a collected dataset for alignment, the process was referred to as non-iterative/offline preference optimization. In contrast, iterative/online preference optimization became feasible when 1. Human labeled new data or 2. LLMs assumed dual roles-both generating responses and evaluating them.
当仅利用收集的数据集进行配准时,该过程被称为非迭代/离线偏好优化。与此相反,当 1. 人为标注新数据或 2.LLMs承担双重角色--既生成响应,又评估响应。

2.4.2 Separating SFT and Alignment vs. Merging SFT and Alignment
2.4.2 分离 SFT 和对齐与合并 SFT 和对齐

In RLHF, SFT and alignment were traditionally applied in a sequential separating manner, which could be tedious and prone to catastrophic forgetting. To address this issue, some research, such as ORPO, have proposed integrating SFT with alignment into a single process to streamline fine-tuning. Additionally, PAFT suggested fine-tuning LLMs on SFT and alignment simultaneously, then merging the results.
在 RLHF 中,SFT 和对准传统上是以顺序分离的方式进行的,这可能很繁琐,而且容易造成灾难性遗忘。为解决这一问题,一些研究(如 ORPO)建议将 SFT 与对齐整合为一个过程,以简化微调。此外,PAFT 建议同时对 SFT 和排列进行微调 LLMs ,然后合并结果。

3 Individual Paper Reviews in Detail
3 个别论文的详细评审

In this section, we review each paper individually, offering readers a summary of the major innovations presented, so they may not need to read the papers themselves.
在本节中,我们将逐一评述每篇论文,为读者提供所介绍的主要创新成果的摘要,因此他们可能不需要亲自阅读这些论文。

3.1 RLHF/PPO 3.1 区域联络高频/项目管理办公室

LLMs were pretrained on extensive corpora sourced from various origins, which inherently could not ensure the quality of the datasets. Furthermore, the primary objective of LLMs was to predict the next token, a goal that diverged from the aim of "following the user's instructions helpfully and safely" [2]. Consequently, LLMs could produce outputs that were untruthful, toxic, or otherwise unhelpful to users. In essence, these models were not aligned with the users' intents. The principal aim of RLHF/PPO was to align language models with user intent across a broad spectrum of tasks by fine-tuning them using human feedback. Various studies have been conducted on this subject.
LLMs是在来自不同来源的大量语料库上进行预训练的,因此无法确保数据集的质量。此外,LLMs的主要目标是预测下一个标记,这一目标与 "帮助并安全地遵循用户指令 "的目标背道而驰[2]。因此,LLMs 可能会产生不真实、有毒或对用户无益的输出。实质上,这些模型与用户的意图并不一致。RLHF/PPO 的主要目的是通过使用人类反馈对语言模型进行微调,使其在广泛的任务中符合用户意图。已就此开展了多项研究。

3.1.1 InstructGPT 3.1.1 教学内容

The authors from OpenAI introduced InstructGPT, which served as a foundation for training models like ChatGPT and GPT-4 [4]. The inclusion of human preferences addressed the challenge of evaluating responses generated by LLMs. Traditional evaluation metrics such as BLEU [41], ROUGE [42], and BERTScore [43] were often utilized to evaluate LLM but they could not guarantee consistence with human preference. To tackle this issue, researchers directly incorporated human preferences into LLMs to enhance their performances. This process typically involved two main steps: reward model learning and RL policy training.
来自 OpenAI 的作者介绍了 InstructGPT,它是 ChatGPT 和 GPT-4 等训练模型的基础[4]。人类偏好的加入解决了由 LLMs 生成的回复的评估难题。传统的评估指标(如 BLEU [41]、ROUGE [42] 和 BERTScore [43])通常用于评估 LLM,但它们无法保证与人类偏好的一致性。为解决这一问题,研究人员直接将人类偏好纳入 LLMs 以提高其性能。这一过程通常包括两个主要步骤:奖励模型学习和 RL 策略训练。
In the reward model learning phase, an explicit pointwise reward function was trained using prompts and pairwise responses, specifically one desired and one undesired response and labeled by humans through the BT model [38], as illustrated in Eq. 1.
在奖励模型学习阶段,使用提示和成对反应来训练显式点对点奖励函数,具体来说,就是人类通过 BT 模型[38]标记的一个期望反应和一个不期望反应 ,如公式 1 所示。
During the collection of reward model preference datasets, the authors presented labelers with a range of to responses to rank. This method produced comparisons for each prompt shown to a labeler. The notation was used to denote the sampling of the prompt, the desired response, and the undesired response from the collected dataset. The explicit pointwise reward model was denoted as .
在收集奖励模型偏好数据集的过程中,作者向贴标者展示了从 的一系列反应,以便进行排序。这种方法会对向贴标者显示的每个提示进行 比较。符号 用于表示从收集的数据集中对提示、期望的反应和不期望的反应进行抽样。显式点式奖励模型用 表示。
Subsequently, the RL policy training phase commenced, wherein the LLM and the pretrained reward model functioned as the agent and the environment, respectively, within the RL framework. The objective function for RL policy training was detailed in Eq. 2 where referred to the optimal policy.
随后,开始了 RL 策略训练阶段,其中 LLM 和预训练的奖励模型分别作为 RL 框架内的代理和环境。RL 策略训练的目标函数详见公式 2,其中 指的是最优策略。
The objective function in RL served three primary goals: 1. maximizing rewards represented by . minimizing the divergence between the current RL policy and the initial reference (SFT) policy quantified by . 3. avoiding the "alignment tax" in the RLHF process expressed as on pretaining datasets. The term "alignment tax" referred to the degradation in the performance of the LLM on downstream tasks following alignment. The parameter was used to control the weight of the KL divergence, with the authors suggesting that the optimal value lied between 0.01 and 0.02 . When , the loss function corresponded to the standard RLHF loss. When , the modified loss function was termed PPO-ptx, which addressed performance degradation on public NLP datasets.
RL 的目标函数有三个主要目标:1. 以 表示的奖励最大化。 2. 以