2024_08_06_cdeba7fb3f232814ed72g

A COMPrEHENSIVE SurveY OF LLM ALIGNMENT TECHNIQUES: RLHF, RLAIF, PPO, DPO AND MORE
LLM对中技术综合调查：RLHF、RLAIF、PPO、DPO 等

Zhichao Wang*, Bin Bi*, Shiva Kumar Pentyala, Kiran Ramnath, Sougata Chaudhuri,
王志超*、毕斌*、希瓦-库马尔-彭蒂塔拉、基兰-拉姆纳特、苏加塔-乔杜里Shubham Mehrotra, Zixu (James) Zhu, Xiang-Bo Mao, Sitaram Asur, Na (Claire) Cheng
Shubham Mehrotra、Zixu (James) Zhu、Xiang-Bo Mao、Sitaram Asur、Na (Claire) ChengSalesforce 销售团队{zhichaowang, bin.bi, shivakumar.pentyala, k.ramnath, sougata.chaudhurishubham.mehrotra, james.zhu, xmao, sasur, claire.cheng}@salesforce.com

Abstract 摘要

With advancements in self-supervised learning, the availability of trillions tokens in a pre-training corpus, instruction fine-tuning, and the development of large Transformers with billions of parameters, large language models (LLMs) are now capable of generating factual and coherent responses to human queries. However, the mixed quality of training data can lead to the generation of undesired responses, presenting a significant challenge. Over the past two years, various methods have been proposed from different perspectives to enhance LLMs, particularly in aligning them with human expectation. Despite these efforts, there has not been a comprehensive survey paper that categorizes and details these approaches. In this work, we aim to address this gap by categorizing these papers into distinct topics and providing detailed explanations of each alignment method, thereby helping readers gain a thorough understanding of the current state of the field.
随着自我监督学习技术的进步、预培训语料库中数万亿词条的可用性、指令微调以及具有数十亿参数的大型 Transformers 的开发，大型语言模型（LLMs）现在已经能够针对人类查询生成符合事实且连贯的回复。然而，训练数据的质量参差不齐，可能导致生成不受欢迎的回复，从而带来巨大的挑战。在过去的两年中，人们从不同的角度提出了各种方法来增强 LLMs，尤其是在使其符合人类期望方面。尽管做出了这些努力，但还没有一篇全面的调查论文对这些方法进行分类和详细说明。在这项工作中，我们将这些论文分为不同的主题，并对每种配准方法进行详细解释，从而帮助读者全面了解该领域的现状，从而弥补这一空白。

Keywords Large Language Model (LLM)

Alignment

Reward Model

Human / AI Feedback

Reinforcement Learning

RLHF

DPO
关键词大型语言模型 (LLM)

对齐

奖励模型

人工/人工智能反馈

强化学习

RLHF

DPO

1 Introduction 1 引言

Over the past decades, the pretraining of LLMs through self-supervised learning [1] has seen significant advancements. These improvements have been driven by the development of larger decoder-only Transformers, the utilization of trillions of tokens, and the parallelization of computations across multiple GPUs. Following the pretraining phase, instruction tuning was employed to guide LLMs in responding to human queries. Despite these advancements, a critical issue remains unresolved: LLMs can generate undesired responses, such as providing instructions on how to commit illegal activities. To mitigate this risk, it is essential to align LLMs with human values.
在过去的几十年中，通过自监督学习 [1] 对 LLMs 进行预训练的技术取得了长足的进步。这些进步得益于大型纯解码器变换器的开发、数万亿标记的使用以及多个 GPU 的并行计算。在预训练阶段之后，采用了指令调整来指导 LLMs 响应人类查询。尽管取得了这些进展，但一个关键问题仍未得到解决：LLMs可能会生成不受欢迎的响应，例如提供有关如何实施非法活动的指令。为了降低这种风险，必须使 LLMs 与人类的价值观保持一致。

Reinforcement Learning from Human Feedback (RLHF) [2, 3] has emerged as a groundbreaking technique for aligning LLMs. This approach has led to the development of powerful models such as GPT-4 [4], Claude [5], and Gemini [6]. Following the introduction of RLHF, numerous studies have explored various approaches to further align LLMs. However, there has not yet been a comprehensive review of methods for aligning LLMs with human preferences. This paper aims to fill that gap by categorically reviewing existing literature and providing detailed analyses of individual papers.
来自人类反馈的强化学习（RLHF）[2, 3]已成为对齐 LLMs 的一项开创性技术。这种方法开发出了 GPT-4 [4]、Claude [5] 和 Gemini [6] 等功能强大的模型。在引入 RLHF 之后，许多研究探索了各种方法来进一步调整 LLMs 。然而，目前还没有对根据人类偏好对 LLMs 进行对齐的方法进行过全面回顾。本文旨在通过对现有文献进行分类综述并对单篇论文进行详细分析来填补这一空白。

In this paper, we have structured our review into four main topics: 1. Reward Model; 2. Feedback; 3. Reinforcement Learning (RL); and 4. Optimization. Each topic was further divided into subtopics as shown in Figure. 1 For the Reward Model, the subtopics were: 1. Explicit Reward Model vs. Implicit Reward Model; 2. Pointwise Reward Model vs. Preference Model; 3. Response-Level Reward vs. Token-Level Reward and 4. Negative Preference Optimization. Regarding Feedback, the subtopics included: 1. Preference Feedback vs. Binary Feedback; 2. Pairwise Feedback vs. Listwise Feedback; and 3. Human Feedback vs. AI Feedback. In the RL section, the subtopics were: 1. Reference-Based RL vs. Reference-Free RL; 2. Length-Control RL; 3. Different Divergences in RL and 4. On-
在本文中，我们将审查分为四个主要专题：1.奖励模型；2.反馈；3.强化学习 (RL)；4.优化。优化。每个主题又分为若干子主题，如图所示。1 奖励模型的子课题是1.显性奖励模型与隐性奖励模型；2.点式奖励模型与偏好模型；3.响应级奖励与代币级奖励；4.负偏好优化。负偏好优化。关于反馈，子课题包括1.偏好反馈与二进制反馈；2. 成对反馈与列表反馈；3.人类反馈与人工智能反馈。在 RL 部分，子课题包括1.基于参考的 RL vs. 无参考的 RL；2.长度控制 RL；3.RL 中的不同分歧；4.关于

\footnotetext{ \脚注文本{
These authors contributed equally to this work
这些作者对这项工作做出了同等贡献

Figure 1: The 13 categorical directions for xPO to align an LLM with human preference
图 1：xPO 根据人类偏好对齐 LLM 的 13 个分类方向

Policy RL vs. Off-Policy RL. For Optimization, the subtopics were: 1. Online/Iterative Preference Optimization vs. Offline/Non-iterative Preference Optimization; and 3. Separating SFT and Alignment vs. Merging SFT and Alignment. Table 1 provided an analysis of all the papers reviewed in detail using these 13 evaluation metrics.
政策性 RL 与非政策性 RL。优化方面的子课题有1.在线/迭代偏好优化与离线/非迭代偏好优化；以及 3.分离 SFT 和对齐与合并 SFT 和对齐。表 1 利用这 13 个评价指标对所有参评论文进行了详细分析。

2 Categorical Outline 2 分类大纲

This section provided a concise introduction to the key elements of LLM alignment, enabling readers to grasp the essential terms and various existing research directions. It includes primarily four directions: 1. reward model, 2. feedback, 3. RL policy and 4. optimization.
本节简明扼要地介绍了 LLM 对齐的关键要素，使读者能够掌握基本术语和现有的各种研究方向。它主要包括四个方向：1. 奖励模型，2. 反馈，3. RL 政策和 4. 优化。

2.1 Reward Model 2.1 奖励模式

The reward model was a fine-tuned LLM that assigned scores based on the prompt and generated response. In this subsection, we would discuss: 1. utilizing explicit or implicit reward models, 2. employing pointwise reward or preference models, and 3. using token-level or response-level reward models and 4. training reward model with solely negative preference. A plot of these different reward models could be found in Figure 2
奖励模型是一个经过微调的 LLM，它根据提示和生成的反应来分配分数。在本小节中，我们将讨论1. 使用显式或隐式奖励模型；2. 使用点式奖励或偏好模型；3. 使用标记级或响应级奖励模型；4.这些不同奖励模型的示意图见图 2

2.1.1 Explicit Reward Model vs. Implicit Reward Model
2.1.1 显性奖励模式与隐性奖励模式

In RLHF, researchers collected a large dataset composed of triplets, including a prompt

, a desired response

, and an undesired response

. Based on this collected preference dataset, explicit reward models, represented as

were derived by fine-tuning on pretrained LLMs to assign rewards for each prompt and response. This reward model was then used in a RL setting to align the LLM policy. Conversely, implicit reward models, represented as

, bypassed the process of training an explicit reward model. For example, in DPO, a mapping was established between the optimal reward model and the optimal policy in RL, allowing the LLM to be aligned without directly deriving the reward model.
在 RLHF 中，研究人员收集了一个由三胞胎组成的大型数据集，包括提示

、期望的反应

和不期望的反应

。根据收集到的偏好数据集，通过对预训练 LLMs 进行微调，得出了以

表示的显式奖励模型，为每个提示和响应分配奖励。然后在 RL 设置中使用该奖励模型来调整 LLM 策略。相反，以

表示的隐式奖励模型则绕过了训练显式奖励模型的过程。例如，在 DPO 中，最优奖励模型与 RL 中的最优策略之间建立了映射关系，从而无需直接推导出奖励模型，即可对 LLM 进行调整。

Figure 2: The four subtopics of reward model
图 2：奖励模式的四个子课题

2.1.2 Pointwise Reward Model vs. Preferencewise Model
2.1.2 点式奖励模型与偏好模型

The original work in RLHF derived a pointwise reward model, which returned a reward score, i.e.,

given the prompt

and response

. Given two pointwise reward scores from the prompt, a desired response, and an undesired response

and

, the probability of the desired response being preferred over the undesired response

could be obtained based on the Bradley-Terry (BT) model [38]. However, this methodology was inferior as it could not directly obtain pairwise preferences and could not accommodate inconsistencies in human labeling. To address this issue, Nash learning was proposed to directly model

RLHF 的最初工作推导出了一个点对点奖励模型，该模型在给出提示

和反应

的情况下返回一个奖励分数，即

。给定提示的两个点式奖励得分、期望的反应和不期望的反应

和

，根据 Bradley-Terry（BT）模型 [38]，可以得出期望的反应比不期望的反应更受青睐的概率

。然而，这种方法并不理想，因为它无法直接获得成对偏好，也无法顾及人类标记的不一致性。为了解决这个问题，有人提出用纳什学习法直接建立

模型。

2.1.3 Response-Level Reward vs. Token-Level Reward
2.1.3 响应级奖励与令牌级奖励的比较

In the original dataset collected in triplets, i.e.,

, the reward was given per response. Thus, in RLHF and DPO, the rewards were built at the response level. However, in the Markov decision process [39], rewards were given after each action, resulting in a change of state. To achieve alignment after each action, token-level reward models were introduced.
在以三连音（即

）收集的原始数据集中，奖励是按每个反应给予的。因此，在 RLHF 和 DPO 中，奖励是建立在响应水平上的。然而，在马尔可夫决策过程[39]中，奖励是在每次行动后给予的，这导致了状态的改变。为了实现每次行动后的对齐，我们引入了令牌级奖励模型。

2.1.4 Negative Preference Optimization
2.1.4 负偏好优化

In the RLHF dataset, human labeled both desired and undesired responses. Recently, with advancements in LLM capabilities, some researchers have posited that LLMs could generate desired responses of even higher quality than those produced by human labelers. Consequently, they opted to use only the prompts and undesired responses from the collected dataset, generating the desired responses using LLMs.
在 RLHF 数据集中，人类标注了期望的和不期望的响应。最近，随着 LLM 功能的进步，一些研究人员认为 LLMs 可以生成比人工标注者所生成的质量更高的期望回复。因此，他们选择只使用所收集数据集中的提示和不受欢迎的回复，使用 LLMs 生成所需的回复。

2.2 Feedback 2.2 反馈

Feedback encompassed both preferences and binary responses from humans or AI, either in pairs or lists. In this subsection, we would discuss three key distinctions: 1. preference feedback vs. binary feedback; 2. pairwise feedback vs. listwise feedback; and 3. human feedback vs. AI feedback. A plot of these feedback could be found in Figure 3
反馈包括来自人类或人工智能的偏好和二元回应，可以是成对的，也可以是列表的。在本小节中，我们将讨论三个关键区别：1.偏好反馈与二元反馈；2.成对反馈与列表反馈；3.人类反馈与人工智能反馈。这些反馈的示意图见图 3

2.2.1 Preference Feedback vs. Binary Feedback
2.2.1 偏好反馈与二进制反馈

In the RLHF paper, preference feedback, i.e.,

, was collected. However, subsequent works such as KTO and DRO suggested that preference feedback was more challenging to gather, and it would be advantageous to collect binary
在 RLHF 论文中，收集了偏好反馈，即

。然而，KTO 和 DRO 等后续研究表明，收集偏好反馈更具挑战性，收集二进制反馈更有优势。

Figure 3: The four subtopics of feedback
图 3：反馈的四个子主题

feedback instead. Binary feedback referred to simple "thumbs up" (positive), i.e.,

or "thumbs down" (negative), i.e.,

responses.
而不是二进制反馈。二进制反馈指的是简单的 "竖起大拇指"（正面），即

或 "放下大拇指"（负面），即

回复。

2.2.2 Pairwise Feedback vs. Listwise Feedback
2.2.2 成对反馈与列表反馈

In RLHF, listwise feedback was collected. This approach involved gathering

different responses

for a given prompt

to expedite the labeling process. However, these listwise responses were then treated as

pairwise responses. However, subsequent work, such as LiPO, proposed that it is more advantageous to treat listwise preferences as a ranking problem instead of viewing them as multiple pairwise preferences.
在 RLHF 中，按列表收集反馈。这种方法是针对给定的提示

收集

不同的回答

，以加快标记过程。但是，这些列表式回答随后被视为

配对式回答。然而，随后的工作（如 LiPO）提出，将列表式偏好作为一个排序问题来处理，而不是将其视为多个成对偏好，会更有优势。

2.2.3 Human Feedback vs. AI Feedback
2.2.3 人工反馈与人工智能反馈

In RLHF, feedback was collected from humans who were asked to provide preferences given multiple responses to the same prompt. However, this process has proven to be tedious and expensive. With the latest developments in LLMs, it has become possible to collect AI feedback to align LLMs.
在 RLHF 中，反馈是由人类收集的，他们被要求在对同一提示做出多个回答时提供偏好。然而，事实证明这一过程既繁琐又昂贵。随着 LLMs 的最新发展，收集人工智能反馈以调整 LLMs 已成为可能。

2.3 Reinforcement Learning (RL)
2.3 强化学习（RL）

The objective of RL was formulated as

. This objective encompassed two primary goals: 1) maximizing the rewards of responses, and 2) minimizing the deviation of the aligned policy model

from the initial reference (SFT) model

. The discussion on RL was divided into four subtopics: 1) Reference-Based RL vs. Reference-Free RL, 2) Length-Control RL, 3) Different Divergences in RL and 4) On-policy RL vs. Off-policy RL.
RL 的目标是

。该目标包含两个主要目标：1) 最大限度地提高回应的回报率，以及 2) 最大限度地减少对齐策略模型

与初始参考（SFT）模型

的偏差。关于 RL 的讨论分为四个子课题：1) 基于参考的 RL vs. 无参考的 RL，2) 长度控制 RL，3) RL 中的不同分歧，4) 政策上的 RL vs. 政策外的 RL。

2.3.1 Reference-Based RL vs. Reference-Free RL
2.3.1 基于参考的 RL 与无参考的 RL

A key aim of the RL objective in RLHF was to minimize the distance between the current policy, i.e.,

and the reference policy, i.e.,

. Consequently, most methodologies have focused on reference-based approaches. However, incorporating a reference policy introduced a significant memory burden. To address this issue, various methods have been proposed to avoid the reference policy. For instance, SimPO proposed a different objective that avoided the need for a reference policy altogether.
在 RLHF 中，RL 目标的一个关键目的是最小化当前政策（即

）与参考政策（即

）之间的距离。因此，大多数方法都侧重于基于参考的方法。然而，加入参考策略会带来很大的内存负担。为了解决这个问题，人们提出了各种方法来避免参考策略。例如，SimPO 提出了一种完全不需要参考策略的不同目标。

2.3.2 Length-Control RL 2.3.2 长度控制 RL

When using LLMs as evaluators, it has been observed that they tended to favor verbose responses, even when no additional information was provided [40]. This bias could affect the alignment of the LLM. In addition, the verbosity of LLM responses might increase the time required for humans to read and understand. The original RL objective did not account for this issue, but subsequent works such as R-DPO and SimPO incorporated considerations for length control, where

represented the length of the output response.
在使用 LLMs 作为评价者时，人们发现他们倾向于偏爱冗长的回答，即使在没有提供额外信息的情况下也是如此 [40]。这种偏差可能会影响 LLM 的对齐。此外，LLM回答的冗长可能会增加人类阅读和理解所需的时间。最初的 RL 目标并没有考虑到这个问题，但随后的工作（如 R-DPO 和 SimPO）纳入了长度控制的考虑因素，其中

代表输出响应的长度。

2.3.3 Different Divergences in RL
2.3.3 区域联络中的不同分歧

In RLHF, reverse Kullback-Leibler (KL) divergence, i.e.,

was commonly used to measure the distance between the current policy

and the reference policy

. However, KL divergence has been found to reduce the diversity of responses. To address this, research has been conducted to explore the effects of different divergence measures, i.e.,

. More details could be found in section 3.12
在 RLHF 中，反向 Kullback-Leibler (KL) 分歧，即

通常用于衡量当前政策

与参考政策

之间的距离。然而，人们发现 KL 分歧会降低响应的多样性。为了解决这个问题，研究人员探索了不同发散度量的效果，即

。更多详情可参见第 3.12 节

2.3.4 On-policy or Off-policy Learning
2.3.4 政策内或政策外学习

In RL, responses could be generated during training using a method called on-policy learning. The main advantage of on-policy learning was that it sampled responses from the latest version of the policy. In contrast, off-policy methods relied on responses generated earlier. Although off-policy methods could save time by avoiding the need to generate new responses during training, they had the drawback of using responses that might not align with the current policy.
在 RL 中，可以使用一种称为 "政策学习"（on-policy learning）的方法在训练过程中生成响应。策略上学习的主要优势在于，它可以从最新版本的策略中采样响应。相比之下，非策略方法则依赖于之前生成的响应。虽然非政策方法可以避免在训练过程中生成新的回复，从而节省时间，但其缺点是使用的回复可能与当前政策不一致。

2.4 Optimization 2.4 优化

The alignment process of LLMs involved optimization. This section would discuss two key subtopics: 1. Iterative/Online Preference Optimization vs. Non-Iterative/Offline Preference Optimization; 2. Separating SFT and Alignment vs. Merging SFT and Alignment. The plot of these two subtopics on optimization could be found in Figure 4
LLMs的排列过程涉及优化。本节将讨论两个关键子主题：1.迭代/在线偏好优化与非迭代/离线偏好优化；2.分离 SFT 和对齐与合并 SFT 和对齐。这两个子课题的优化图见图 4

b. Separate or Merge SFT with Alignment
b.分离或合并对齐的 SFT

Figure 4: The two subtopics of optimization
图 4：优化的两个子课题

2.4.1 Iterative/Online Preference Optimization vs. Non-Iterative/Offline Preference Optimization
2.4.1 迭代/在线偏好优化与非迭代/离线偏好优化对比

When only utilizing a collected dataset for alignment, the process was referred to as non-iterative/offline preference optimization. In contrast, iterative/online preference optimization became feasible when 1. Human labeled new data or 2. LLMs assumed dual roles-both generating responses and evaluating them.
当仅利用收集的数据集进行配准时，该过程被称为非迭代/离线偏好优化。与此相反，当 1. 人为标注新数据或 2.LLMs承担双重角色--既生成响应，又评估响应。

2.4.2 Separating SFT and Alignment vs. Merging SFT and Alignment
2.4.2 分离 SFT 和对齐与合并 SFT 和对齐

In RLHF, SFT and alignment were traditionally applied in a sequential separating manner, which could be tedious and prone to catastrophic forgetting. To address this issue, some research, such as ORPO, have proposed integrating SFT with alignment into a single process to streamline fine-tuning. Additionally, PAFT suggested fine-tuning LLMs on SFT and alignment simultaneously, then merging the results.
在 RLHF 中，SFT 和对准传统上是以顺序分离的方式进行的，这可能很繁琐，而且容易造成灾难性遗忘。为解决这一问题，一些研究（如 ORPO）建议将 SFT 与对齐整合为一个过程，以简化微调。此外，PAFT 建议同时对 SFT 和排列进行微调 LLMs ，然后合并结果。

3 Individual Paper Reviews in Detail
3 个别论文的详细评审

In this section, we review each paper individually, offering readers a summary of the major innovations presented, so they may not need to read the papers themselves.
在本节中，我们将逐一评述每篇论文，为读者提供所介绍的主要创新成果的摘要，因此他们可能不需要亲自阅读这些论文。

3.1 RLHF/PPO 3.1 区域联络高频/项目管理办公室

LLMs were pretrained on extensive corpora sourced from various origins, which inherently could not ensure the quality of the datasets. Furthermore, the primary objective of LLMs was to predict the next token, a goal that diverged from the aim of "following the user's instructions helpfully and safely" [2]. Consequently, LLMs could produce outputs that were untruthful, toxic, or otherwise unhelpful to users. In essence, these models were not aligned with the users' intents. The principal aim of RLHF/PPO was to align language models with user intent across a broad spectrum of tasks by fine-tuning them using human feedback. Various studies have been conducted on this subject.
LLMs是在来自不同来源的大量语料库上进行预训练的，因此无法确保数据集的质量。此外，LLMs的主要目标是预测下一个标记，这一目标与 "帮助并安全地遵循用户指令 "的目标背道而驰[2]。因此，LLMs 可能会产生不真实、有毒或对用户无益的输出。实质上，这些模型与用户的意图并不一致。RLHF/PPO 的主要目的是通过使用人类反馈对语言模型进行微调，使其在广泛的任务中符合用户意图。已就此开展了多项研究。

3.1.1 InstructGPT 3.1.1 教学内容

The authors from OpenAI introduced InstructGPT, which served as a foundation for training models like ChatGPT and GPT-4 [4]. The inclusion of human preferences addressed the challenge of evaluating responses generated by LLMs. Traditional evaluation metrics such as BLEU [41], ROUGE [42], and BERTScore [43] were often utilized to evaluate LLM but they could not guarantee consistence with human preference. To tackle this issue, researchers directly incorporated human preferences into LLMs to enhance their performances. This process typically involved two main steps: reward model learning and RL policy training.
来自 OpenAI 的作者介绍了 InstructGPT，它是 ChatGPT 和 GPT-4 等训练模型的基础[4]。人类偏好的加入解决了由 LLMs 生成的回复的评估难题。传统的评估指标（如 BLEU [41]、ROUGE [42] 和 BERTScore [43]）通常用于评估 LLM，但它们无法保证与人类偏好的一致性。为解决这一问题，研究人员直接将人类偏好纳入 LLMs 以提高其性能。这一过程通常包括两个主要步骤：奖励模型学习和 RL 策略训练。

In the reward model learning phase, an explicit pointwise reward function was trained using prompts and pairwise responses, specifically one desired and one undesired response

and

labeled by humans through the BT model [38], as illustrated in Eq. 1.
在奖励模型学习阶段，使用提示和成对反应来训练显式点对点奖励函数，具体来说，就是人类通过 BT 模型[38]标记的一个期望反应和一个不期望反应

和

，如公式 1 所示。

During the collection of reward model preference datasets, the authors presented labelers with a range of

responses to rank. This method produced

comparisons for each prompt shown to a labeler. The notation

was used to denote the sampling of the prompt, the desired response, and the undesired response from the collected dataset. The explicit pointwise reward model was denoted as

.
在收集奖励模型偏好数据集的过程中，作者向贴标者展示了从

到

的一系列反应，以便进行排序。这种方法会对向贴标者显示的每个提示进行

比较。符号

用于表示从收集的数据集中对提示、期望的反应和不期望的反应进行抽样。显式点式奖励模型用

表示。

Subsequently, the RL policy training phase commenced, wherein the LLM and the pretrained reward model functioned as the agent and the environment, respectively, within the RL framework. The objective function for RL policy training was detailed in Eq. 2 where

referred to the optimal policy.
随后，开始了 RL 策略训练阶段，其中 LLM 和预训练的奖励模型分别作为 RL 框架内的代理和环境。RL 策略训练的目标函数详见公式 2，其中

指的是最优策略。

The objective function in RL served three primary goals: 1. maximizing rewards represented by

. minimizing the divergence between the current RL policy and the initial reference (SFT) policy quantified by

. 3. avoiding the "alignment tax" in the RLHF process expressed as

on pretaining datasets. The term "alignment tax" referred to the degradation in the performance of the LLM on downstream tasks following alignment. The parameter

was used to control the weight of the KL divergence, with the authors suggesting that the optimal value lied between 0.01 and 0.02 . When

, the loss function corresponded to the standard RLHF loss. When

, the modified loss function was termed PPO-ptx, which addressed performance degradation on public NLP datasets.
RL 的目标函数有三个主要目标：1. 以

表示的奖励最大化。 2. 以

表示的当前 RL 策略与初始参考（SFT）策略之间的偏差最小化。3. 避免在 RLHF 过程中出现 "对齐税"，以

表示，适用于相关数据集。术语 "对齐税 "是指对齐后，LLM在下游任务上的性能下降。参数

用于控制 KL 发散的权重，作者认为最佳值介于 0.01 和 0.02 之间。当

时，损失函数对应于标准 RLHF 损失。当

时，修改后的损失函数被称为 PPO-ptx，可解决公共 NLP 数据集上的性能下降问题。

For training InstructGPT, three datasets were utilized: 1. SFT dataset: contained labeler demonstrations used to train the SFT models. 2. RM dataset: comprised labeler rankings of model outputs, which were used to train RMs. 3. PPO dataset: composed of prompts used as input for RLHF fine-tuning. Despite the complexity of the task, the inter-annotator agreement rates were notably high. Training labelers agreed with each other

of the time, while held-out labelers showed an agreement rate of

.
在训练 InstructGPT 时，使用了三个数据集：1.SFT 数据集：包含用于训练 SFT 模型的标签演示。2.RM 数据集：包含标注者对模型输出的排名，用于训练 RM。3.PPO 数据集：由作为 RLHF 微调输入的提示组成。尽管任务复杂，但标注者之间的一致率却很高。训练标注者之间的一致率为

，而被排除的标注者之间的一致率为

。

The authors trained a single 6B reward model to be utilized for training RLHF/PPO policy models of varying sizes. They also experimented with larger 175B reward models [44]. Although larger RMs exhibited lower validation loss, their training processes were unstable and significantly increased the computational requirements for RLHF/PPO. The
作者训练了一个单一的 6B 奖励模型，用于训练不同规模的 RLHF/PPO 策略模型。他们还试验了更大的 175B 奖励模型[44]。虽然更大的 RM 表现出更低的验证损失，但其训练过程并不稳定，而且大大增加了 RLHF/PPO 的计算需求。而
authors claimed that since the same input prompt generated

outputs, these outputs were correlated. A straightforward method to address this was to shuffle and train them randomly. However, this approach led to overfitting. To mitigate this issue, the authors trained all

comparisons as a batch, which improved the overfitting problem. One limitation of this method was that it did not account for the relative scores between responses; that was, pair responses with similar scores or those with very large score differences were treated the same. Subsequent works have considered this problem [28].
作者声称，由于相同的输入提示会产生

输出，因此这些输出是相关的。解决这一问题的直接方法是随机洗牌和训练。但是，这种方法会导致过度拟合。为了缓解这一问题，作者对所有

比较进行了批量训练，从而改善了过度拟合问题。这种方法的一个局限是，它没有考虑到回答之间的相对分数；也就是说，分数相近的成对回答或分数相差很大的成对回答被同等对待。随后的研究也考虑了这一问题 [28]。

The trained InstructGPT was evaluated from three perspectives: Helpful, Honest, and Harms. "Helpful" meant that the model should follow instructions and infer intention from a few-shot prompt or another interpretable pattern, and it was evaluated by human labelers. "Honest" referred to two metrics: (1) evaluating the model's tendency to fabricate information on closed-domain tasks and (2) performance on the TruthfulQA benchmark [45]. "Harms" involved labelers evaluating whether an output was inappropriate in the context of a customer assistant. From human evaluation, the authors claimed that "outputs from the 1.3B parameter InstructGPT model were preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters." Notably, InstructGPT showed improvements in truthfulness and toxicity tasks over GPT-3, which was crucial for alignment. PPO-ptx has also demonstrated reduced performance decrement on various NLP benchmarks.
从三个角度对经过培训的指导员进行了评估：有用、诚实和有害。"有用 "是指模型应遵循指令，并从几声提示或其他可解释的模式中推断出意图，并由人类标注者进行评估。"诚实 "指的是两个指标：（1）评估模型在封闭域任务中编造信息的倾向；（2）在 TruthfulQA 基准上的表现[45]。"危害 "涉及标签制作者对客户助手的输出是否不恰当进行评估。通过人工评估，作者声称 "尽管参数少了 100 倍，但 13 亿参数的 InstructGPT 模型的输出结果要优于 175 亿参数的 GPT-3 模型的输出结果"。值得注意的是，与 GPT-3 相比，InstructGPT 在真实性和毒性任务方面有所改进，而这对于配准至关重要。PPO-ptx 在各种 NLP 基准测试中的性能下降也有所减少。

3.1.2 RLHF - Anthropic 3.1.2 RLHF--人类学

Anthropic has conducted research on the same topic [3]. To facilitate a clear comparison, we would emphasize the distinctions between the two studies. To start, OpenAI selected labelers by filtering workers based on agreement rates or other direct measures of label quality, achieving approximately a

inter-labeler agreement rate. In contrast, Anthropic hypothesized that crowdworkers who demonstrated strong writing skills and engaged the AI in more stimulating discussions would likely possess better judgment regarding which AI responses were most "helpful" and "harmless". However, they observed a low average agreement rate (around 63%) between Anthropic researchers and their crowdworkers. This comparison underscored the importance of implementing filtering tasks to identify high-quality labelers.
Anthropic 也对同一课题进行了研究[3]。为了便于进行清晰的比较，我们将强调这两项研究之间的区别。首先，OpenAI 是根据标签质量的一致率或其他直接衡量标准筛选贴标员的，贴标员之间的一致率约为

。与此相反，Anthropic 假设，那些表现出很强的写作技巧并与人工智能进行了更多激励性讨论的人群工作者可能会对人工智能的哪些回应最 "有用 "和 "无害 "有更好的判断力。然而，他们观察到，Anthropic 的研究人员与众包工作者之间的平均意见一致率很低（约为 63%）。这一对比强调了实施过滤任务以识别高质量标签的重要性。

Furthermore, the data collection methodology varied significantly. The authors focused on two primary metrics: "harmless" and "helpful", with "helpful" encompassing "honest". These metrics guided the creation of two distinct datasets. For the "helpful" dataset, crowdworkers employed LLMs to assist in generating responses. Conversely, the "harmless" dataset involved a different approach. Here, crowdworkers engaged in adversarial probing or "red-teaming" of the language models to elicit harmful responses, such as inducing the AI to use toxic language. The metrics "helpful" and "harmless" often stood in opposition to each other. The authors found that integrating these datasets for preference modeling enhanced performance on both metrics, particularly when the preference models were sufficiently large. Consistent with OpenAI's approach, preference-strength information was disregarded, and all preference pairs were treated equally.
此外，数据收集方法也大相径庭。作者们主要关注两个指标："无害 "和 "有益"，其中 "有益 "包括 "诚实"。在这些指标的指导下，创建了两个不同的数据集。对于 "有帮助 "数据集，众包工作者使用 LLMs 来帮助生成回复。相反，"无害 "数据集采用了不同的方法。在这里，众包人员对语言模型进行了对抗性试探或 "红队"，以诱发有害的回应，例如诱导人工智能使用有毒语言。有帮助 "和 "无害 "这两个指标往往相互对立。作者发现，整合这些数据集进行偏好建模可以提高这两个指标的性能，尤其是当偏好模型足够大时。与 OpenAI 的方法一致，偏好强度信息被忽略，所有偏好对都被平等对待。

OpenAI has discovered that RLHF helped with alignment but could degrade performance on certain NLP benchmarks, a phenomenon referred to as the "alignment tax". Its InstructGPT model had a size of 1.3B parameters. In contrast, researchers at Anthropic evaluated seven different models with size ranging from 13 M to 52 B , following a geometric progression with increments of approximately

. They concluded that alignment imposed a tax on smaller models, whereas it provided a benefit for larger models, particularly those with 13B and 52B parameters. Given this alignment advantage, the authors also experimented with incorporating coding techniques datasets to enhance the capabilities of LLMs. In OpenAI's RLHF approach, they introduced both PPO and PPO-ptx, with PPO-ptx designed to mitigate the alignment tax on NLP benchmarks. Anthropic's RLHF findings indicated that PPO alone could achieve an alignment bonus for larger models on NLP downstream tasks. They also identified the optimal parameter for KL divergence in RL policy training as

.
OpenAI 发现，RLHF 有助于对齐，但会降低某些 NLP 基准的性能，这种现象被称为 "对齐税"。其 InstructGPT 模型有 1.3B 个参数。相比之下，Anthropic 的研究人员评估了七个不同的模型，其大小从 13 M 到 52 B 不等，按照几何级数递增，增量约为

。他们的结论是，排列对较小的模型造成了负担，而对较大的模型，特别是参数为 13B 和 52B 的模型，则有好处。鉴于这种对齐优势，作者还尝试将编码技术数据集纳入其中，以增强 LLMs 的功能。在 OpenAI 的 RLHF 方法中，他们引入了 PPO 和 PPO-ptx，其中 PPO-ptx 旨在减轻 NLP 基准的对齐税。Anthropic 的 RLHF 研究结果表明，在 NLP 下游任务中，仅 PPO 就能为大型模型带来对齐奖励。他们还确定了 RL 策略训练中 KL 分歧的最佳参数为

。

In the process of training the reward model, the authors identified a near log-linear relationship between reward model accuracy and the sizes of both the model and the dataset. Larger reward models demonstrated greater robustness compared to smaller ones during the RL policy training. Then, the authors divided the preference data into two halves: a training half and a testing half. They trained separate reward models on each half, referred to as the train RM and the test RM, respectively. The RLHF policies were trained using the train RM and evaluated with the test RM. During evaluation, it was observed that "the two scores by train and test RMs are in close agreement during early stages of training, but eventually diverge, with the test RM providing a lower score." This resulted in the conclusion that the reward model had overfitted to the training data: "the reward model is less robust and more easily exploited at higher rewards". However, when larger RMs were utilized, this overfitting issue was not significantly transferred to the RLHF policy. Additionally, during the RL policy training, a linear trend was discovered between reward and

. Then, the authors also employed out-of-distribution (OOD) techniques to detect and reject poor requests. Finally, they explored an online training mode where both the reward model and RL policy could be updated weekly with new human
在训练奖励模型的过程中，作者发现奖励模型的准确性与模型和数据集的大小之间存在近乎对数的线性关系。在 RL 策略训练过程中，与较小的奖励模型相比，较大的奖励模型表现出更强的鲁棒性。然后，作者将偏好数据分为两半：一半是训练数据，一半是测试数据。他们在每一半上分别训练奖励模型，分别称为训练 RM 和测试 RM。使用训练 RM 对 RLHF 策略进行训练，并使用测试 RM 对 RLHF 策略进行评估。在评估过程中，他们观察到 "训练 RM 和测试 RM 的两个得分在训练的早期阶段接近一致，但最终出现分歧，测试 RM 的得分较低"。由此得出的结论是，奖励模型过度拟合了训练数据："奖励模型的稳健性较差，在奖励较高时更容易被利用"。然而，当使用较大的 RM 时，这种过度拟合问题并没有明显地转移到 RLHF 政策上。此外，在 RL 策略训练过程中，发现奖励和

之间存在线性趋势。然后，作者还采用了分布外（OOD）技术来检测和拒绝不良请求。最后，他们探索了一种在线训练模式，在这种模式下，奖励模型和 RL 策略每周都可以根据新的人类请求进行更新。
preference data obtained through interactions with crowdworkers. These findings were not reported in the OpenAI InstructGPT paper.
通过与人群工作者互动获得的偏好数据。OpenAI InstructGPT 的论文中没有报告这些发现。

3.1.3 Online / Iterative RLHF
3.1.3 在线/迭代 RLHF

RLHF techniques for aligning LLMs with human preferences have traditionally been offline methods. In the paradigm of RLHF, a static dataset, i.e., a preference dataset of the form

, where

was a response preferred over

given a prompt

was prepared to train a reward function

based on the BT model [38], and then the LLM policy was optimized by leveraging the RLHF/PPO algorithm to optimize a constrained version of the reward model

, which controlled the cumulative rewards and the divergence of the policy

from the initial SFT policy

. In alternative direct preference optimization methods like DPO, which skipped reward function modeling, the static dataset was leveraged to approximate optimal policy:

and more details could be found in Section 3.3.3
将 LLMs 与人类偏好相一致的 RLHF 技术历来都是离线方法。在 RLHF 的范例中，一个静态数据集，即一个形式为

的偏好数据集，其中

是在提示

时比

更受青睐的反应，该数据集用于训练基于 BT 模型的奖励函数

[38]、然后利用 RLHF/PPO 算法优化奖励模型的受限版本

，从而优化 LLM 政策，该模型控制累积奖励和政策

与初始 SFT 政策

的偏离。在跳过奖励函数建模的其他直接偏好优化方法（如 DPO）中，静态数据集被用来近似最优策略：

，更多详情请参见第 3.3.3 节。

The drawback of training RLHF/PPO using offline dataset lied in the responses

in the static dataset itself came from other LLM policies, and the preferences

came from some oracle like human or other AI agent. Critically, the training process of reward model could not further query the preference oracle as the preference data has been fixed. However, the finite dataset

might lead to over-optimization of reward on in-distribution data, since the finite dataset was a small sample of the universe of prompt-response pairs. The resulting policy model often performed poorly when faced with out-of-distribution data [8].
使用离线数据集训练 RLHF/PPO 的缺点在于，静态数据集中的响应

本身来自其他 LLM 策略，而偏好

则来自人类或其他人工智能代理等神谕。重要的是，由于偏好数据已经固定，奖励模型的训练过程无法进一步查询偏好神谕。然而，有限数据集

可能会导致在分布数据上过度优化奖励，因为有限数据集只是提示-响应对宇宙中的一个小样本。由此产生的策略模型在面对分布外数据时往往表现不佳[8]。

To deal with out-of-distribution data, the policy needed to be continuously fine-tuned, by generating pairs of response for new prompts from the intermediate policy, getting preference feedback from the oracle and feeding it back to the policy, i.e., iterative/online learning. In practise, the iterative learning was divided into two parts [7]:
为了处理超出分布范围的数据，需要不断对策略进行微调，方法是通过中间策略为新的提示生成成对的响应，从甲骨文中获得偏好反馈，并将其反馈给策略，即迭代/在线学习。实际上，迭代学习分为两个部分[7]：

Preference oracle training: Since it was difficult/infeasible to get expert human preference feedback continuously on new data, a preference model, i.e., a different LLM was trained on large and diverse set of offline preference data. This model, on being given a new (a prompt, pair of responses), could score each (a prompt, a response), with preferred response getting higher score.
偏好甲骨文训练：由于很难/不可能持续获得人类专家对新数据的偏好反馈，因此我们在大量不同的离线偏好数据集上训练了一个偏好模型，即不同的 LLM。该模型在得到一个新的（一个提示、一对回答）时，可以为每个（一个提示、一个回答）打分，首选回答得分更高。
Iterative policy optimization: First, a base policy model was instruction fine-tuned from a pre-trained LLM . Then, the policy was continuously fine-tuned, using an exploitation and exploration framework. In the exploration phase, the current, main policy produced a response for each prompt, and an enhancer policy produced another response for the same prompt, with the preference label for the responses obtained from the preference oracle from previous step. The job of the enhancer policy was to probe in the space where there was higher uncertainty in response relative to the main policy. In the exploitation phase, the current, main policy was updated using RLHF/PPO or DPO techniques on the new preference data. Lastly, the process was then repeated to further improve the quality of the main LLM policy. The enhancer policy, in practice, could be obtained through heuristics. Popular heuristics were adjusting temperature of main policy to create enhancer policy, or rejection sampling, where main policy produced multiple responses, which were ranked by preference oracle, and best and worst response were considered to be obtained from main and enhancer policy.
迭代策略优化：首先，根据预先训练的 LLM 对基础策略模型进行指令微调。然后，利用开发和探索框架对政策进行持续微调。在探索阶段，当前的主策略为每个提示生成一个响应，增强策略为同一提示生成另一个响应，响应的偏好标签来自上一步的偏好甲骨文。增强策略的任务是在响应的不确定性比主策略高的地方进行探测。在利用阶段，利用 RLHF/PPO 或 DPO 技术对新的偏好数据更新当前的主策略。最后，重复这一过程，进一步提高主LLM政策的质量。实际上，增强策略可以通过启发式方法获得。常用的启发式方法包括调整主策略的温度以创建增强策略，或拒绝抽样，即主策略产生多个响应，由偏好甲骨文对其进行排序，最佳和最差响应被认为是从主策略和增强策略中获得的。

Significant empirical evaluation in [7] indicated improvement in result of of policy trained through online RL, over offline RL.
文献[7]中的重要实证评估表明，与离线 RL 相比，通过在线 RL 训练的政策结果有所改善。

3.2 RLAIF 3.2 快速反应部队

The Reinforcement Learning from AI Feedback (RLAIF) framework was developed to mitigate the substantial expenses involved in acquiring human preference datasets. Additionally, as the capabilities of LLMs continued to advance, this approach allowed for the collection of more accurate AI preference datasets, thereby enhancing the alignment of LLMs.
开发人工智能反馈强化学习（RLAIF）框架是为了减少获取人类偏好数据集所需的大量费用。此外，随着LLMs的功能不断进步，这种方法允许收集更准确的人工智能偏好数据集，从而增强了LLMs的一致性。

3.2.1 RLAIF-Anthropic 3.2.1 RLAIF-各向异性

Building on the foundational work of RLHF, a novel approach termed RLAIF was introduced [9]. This methodology encompassed two primary stages: 1. Supervised learning through Critiques and Revisions guided by a "constitution" and 2. RLAIF.
在 RLHF 的基础工作上，引入了一种称为 RLAIF 的新方法 [9]。这种方法包括两个主要阶段：1.在 "宪法 "的指导下，通过批评和修订进行监督学习；2.RLAIF.

In the initial stage, the authors employed the chain of thought (CoT) framework [46] to identify potential harms in harmless data using specific principle-based instructions, which they referred to as "Constitutional AI (CAI)." For CAI, a LLM served as a critic, providing revisions. The findings indicated that self-supervised critiques and revisions
在初始阶段，作者采用了思维链（CoT）框架[46]，利用基于特定原则的指令来识别无害数据中的潜在危害，并将其称为 "宪法人工智能（CAI）"。在 CAI 中，LLM充当批评者，提供修改意见。研究结果表明，自我监督的批评和修订
could surpass human performance. During this process, the authors noted a decrease in helpfulness scores, while the combined scores for harmlessness and helpfulness

improved. Additionally, increasing the number of revisions proved advantageous, as it led to the identification and correction of more harmful responses. Importantly, the critique process was found to be crucial, with the critique-revision approach outperforming the revision process alone. Following the critique and revision phase, SFT was applied to the LLM using the revised responses from critique-revision stage.
可以超越人类的表现。在此过程中，作者注意到有用性得分有所下降，而无害性和有用性的综合得分

有所提高。此外，事实证明增加修改次数是有利的，因为这样可以识别并纠正更多有害的回答。重要的是，我们发现批判过程至关重要，批判-修订方法优于单独的修订过程。在批判和修订阶段之后，使用批判-修订阶段修订后的回复对 LLM 应用了 SFT。

In the second stage, the authors substituted RLHF with RLAIF. During the initial stage, human annotators labeled the helpfulness data, whereas AI systems labeled the harmlessness data, as previously mentioned. Furthermore, distinct principles for constitution and CoT reasoning were employed to align the LLM, aiming to minimize harm while preserving helpfulness.
在第二阶段，作者用 RLAIF 取代了 RLHF。在初始阶段，如前所述，人类注释者标注了有用性数据，而人工智能系统则标注了无害性数据。此外，在对齐 LLM 时还采用了不同的构成和 CoT 推理原则，目的是在保留有用性的同时尽量减少危害性。

This study demonstrated the feasibility of self supervised AI alignment by utilizing AI to collect preference data. However, it was limited to harmlessness rather than helpfulness, given that the task of ensuring harmlessness was considerably simpler compared to that of ensuring helpfulness.
这项研究通过利用人工智能收集偏好数据，证明了自我监督人工智能排列的可行性。不过，它仅限于无害性而非有用性，因为确保无害性的任务比确保有用性的任务要简单得多。

3.2.2 RLAIF-Google

Building on the work of RLAIF by Anthropic, the authors contended that prior research have not directly compared the effectiveness of human versus AI feedback, warranting further investigation [10]. During the AI feedback collection process, a structured prompt was created, consisting of: 1. Preamble, 2. Few-shot exemplars (optional), 3. Sample to annotate, and 4. Ending. A two-step evaluation was performed to generate AI feedback: initially, all four components of the instruction, combined with CoT, were used to generate responses from the LLM. In the subsequent stage, the LLM's response, appended with an ending like "preferred summary=", was sent back to the LLM to generate preference probabilities such as "summary

, summary

". To mitigate positional bias, the sequences of the two responses were alternated, and the average scores were calculated.
在 Anthropic 的 RLAIF 工作基础上，作者认为之前的研究并未直接比较人类与人工智能反馈的有效性，因此值得进一步研究[10]。在人工智能反馈收集过程中，我们创建了一个结构化的提示，其中包括1.序言；2.少量示例（可选）；3.要注释的样本；4.结束语。结束语。为生成人工智能反馈，我们进行了两步评估：首先，使用指令的所有四个组成部分，结合 CoT，生成 LLM 的回复。在随后的阶段，LLM的回复附加了 "首选摘要="这样的结尾，并被发回给LLM，以生成偏好概率，如 "摘要

，摘要

"。为减少位置偏差，两个回答的序列交替出现，并计算平均分数。

In the RLAIF process, two strategies were employed: 1. "Distilled RLAIF", which adhered to the traditional RLHF approach by using preference to train a Reward Model, which was then used to train the LLM policy, and 2. "Direct RLAIF", which leveraged LLM feedback by prompting it to output evaluation scores directly as signals for policy training in RL.
在 RLAIF 流程中，采用了两种策略：1."蒸馏 RLAIF"，即采用传统的 RLHF 方法，利用偏好来训练奖励模型，然后再利用奖励模型来训练 LLM 政策；2."直接 RLAIF"，即利用 LLM 反馈，促使其直接输出评估分数，作为 RL 中政策训练的信号。

Lastly, during the evaluation process, three key metrics were employed: 1. AI-labeler alignment: the degree of agreement between AI and human labelers, 2. win rate: the likelihood of a response being selected by human labelers when compared between two candidates, and 3 . harmless rate: the percentage of responses deemed harmless by human evaluators.
最后，在评估过程中采用了三个关键指标：1.人工智能与贴标者的一致性：人工智能与人类贴标者之间的一致程度；2.胜出率：在两个候选答案之间进行比较时，人类贴标者选择一个答案的可能性；3.无害率：人类评估者认为无害的答案所占的百分比。

Experiments were conducted on three datasets: 1. Reddit TL;DR (summary) [47], 2. OpenAI's Human Preferences (helpful) [47], and 3. Anthropic Helpful and Harmless (HH; harmless) Human Preferences [3]. PaLM 2 was utilized as the LLM for alignment [48].
实验在三个数据集上进行：1.Reddit TL;DR（摘要）[47]；2.OpenAI 的人类偏好（有益的）[47]；3.Anthropic 的有益和无害（HH；无害）人类偏好[3]。Anthropic Helpful and Harmless (HH; harmless) Human Preferences [3]。使用 PaLM 2 作为LLM进行对齐[48]。

The authors made a couple of observations on the summarization task. They observed that the RLHF policy sometimes hallucinated when the RLAIF policy did not and RLAIF sometimes produced less coherent summaries as compared to RLHF. They mentioned that more systemaic analysis was required to understand if these patterns existed at scale.
作者对总结任务提出了一些看法。他们注意到，RLHF 政策有时会产生幻觉，而 RLAIF 政策不会，而且与 RLHF 相比，RLAIF 有时产生的摘要不那么连贯。他们提到，需要进行更系统的分析，以了解这些模式是否大规模存在。

Three main conclusions were drawn. Firstly, RLAIF achieved comparable performance to RLHF in summarization and helpful dialogue generation tasks, but outperformed RLHF in the harmless task. Secondly, RLAIF demonstrated the ability to enhance a SFT policy even when the LLM labeler was of the same size as the policy. Lastly, "Direct RLHF" surpassed "Distilled RLHF" in terms of alignment.
研究得出了三个主要结论。首先，RLAIF 在总结和有用对话生成任务中的表现与 RLHF 相当，但在无害任务中的表现优于 RLHF。其次，即使 LLM 标签的大小与 SFT 策略相同，RLAIF 也有能力增强 SFT 策略。最后，"直接 RLHF "在对齐方面超过了 "蒸馏 RLHF"。

3.3 Direct Human Preference Optimization
3.3 直接人类偏好优化

Traditional RLHF methods typically involved optimizing a reward function derived from human preferences. While effective, this approach could introduce challenges such as increased computational complexity and a bias-variance trade-off in estimating and optimizing rewards [49]. Recent research has explored alternative methods that aimed to optimize LLM policies directly based on human preferences, without necessarily relying on a scalar reward signal.
传统的 RLHF 方法通常涉及优化根据人类偏好得出的奖励函数。这种方法虽然有效，但也会带来一些挑战，如计算复杂度增加，以及在估计和优化奖励时存在偏差-方差权衡[49]。最近的研究探索了一些替代方法，旨在直接根据人类偏好优化 LLM 策略，而不一定依赖于标量奖励信号。

These approaches sought to simplify the alignment process, reduce computational overhead, and potentially achieve more robust optimization by working more directly with preference data. By framing the problem as one of preference optimization rather than reward estimation and maximization, these methods offered a different perspective on aligning language models with human judgments.
这些方法试图简化配准过程，减少计算开销，并通过更直接地使用偏好数据来实现更稳健的优化。这些方法将问题归结为偏好优化，而不是奖励估计和最大化，从而为语言模型与人类判断的配准提供了不同的视角。

3.3.1 SliC-HF

This study introduced Sequence Likelihood Calibration with Human Feedback (SLiC-HF) to align LLMs with human preferences by employing a max-margin ranking loss with regularization, as shown in Eq. 3411.
如公式 3411 所示，这项研究引入了带人类反馈的序列似然校准（SLiC-HF），通过采用带正则化的最大边际排序损失，使 LLMs 与人类偏好相一致。

Here

served as a margin to distinguish desired responses from undesired responses, and the regularization term

would encourage the trained model to stay close to the initial SFT policy.
在这里，

是一个边际值，用于区分期望的响应和不期望的响应，而正则化项

则会鼓励训练好的模型与初始 SFT 政策保持一致。

The authors proposed two main variants: SLiC-HF-direct and SLiC-HF-sample-rank. SLiC-HF-direct used human preference feedback data directly to define desired response

and undesired response

. In contrast, SLiC-HFsample-rank generated multiple responses from the SFT model and then used a separate ranking or reward model to determine

and

from these generated responses. This sample-rank variant ensured that the training examples were drawn from the model's current output distribution, potentially leading to more stable and effective learning compared to using off-policy human preference data. The authors found that SLiC-HF-sample-rank converged more robustly.
作者提出了两个主要变体：SLiC-HF-direct 和 SLiC-HF-sample-rank。SLiC-HF-direct 直接使用人类偏好反馈数据来定义期望响应

和不期望响应

。相比之下，SLiC-HFsample-rank 从 SFT 模型中生成多个响应，然后使用单独的排序或奖励模型从这些生成的响应中确定

和

。这种样本排名变体确保了训练示例取自模型的当前输出分布，与使用非政策人类偏好数据相比，可能会带来更稳定、更有效的学习。作者发现，SLiC-HF-样本秩收敛更为稳健。

The study demonstrated that SLiC-HF could achieve comparable or superior performance to RLHF/PPO methods while using significantly less computational resources, i.e., 0.25 the memory footprint of PPO training paradigm. On the Reddit TL;DR summarization task [3], a T5-Large (770M parameters) [50] model trained with SLiC-HF outperformed a 6B parameter model trained with RLHF/PPO. This result suggested that SLiC-HF represented a promising direction for aligning LLMs with human preferences, offering a balance between performance, computational efficiency, and implementation simplicity.
研究表明，SLiC-HF 可以达到与 RLHF/PPO 方法相当甚至更高的性能，而使用的计算资源却少得多，仅为 PPO 训练范式内存占用的 0.25。在 Reddit TL;DR 总结任务 [3] 中，使用 SLiC-HF 训练的 T5-Large 模型（7.7 亿个参数）[50] 优于使用 RLHF/PPO 训练的 6B 参数模型。这一结果表明，SLiC-HF 是使 LLMs 与人类偏好保持一致的一个有前途的方向，它在性能、计算效率和实施简便性之间实现了平衡。

3.3.2 RSO

Rejection Sampling Optimization (RSO) [51] addressed limitations in offline preference optimization methods like SLiC and DPO by addressing the distribution mismatch between the training data and the data expected from the optimal policy, using statistical rejection sampling.
拒绝采样优化法（RSO）[51] 解决了 SLiC 和 DPO 等离线偏好优化方法的局限性，利用统计拒绝采样法解决了训练数据与最优策略预期数据之间的分布不匹配问题。

The rejection sampling methodology was detailed as follows:
拒绝抽样方法详述如下：

Generate and .
生成和。
Calculate . 计算 .
Accept if ; otherwise, reject. This ensures that only responses close to the optimal policy are selected.
如果，则接受；否则，拒绝。这样可以确保只选择接近最优策略的回复。

In comparison,

was simple to sample, while

was tough to obtain. To solve this problem, RSO used a trained reward model

to guide the sampling process. The algorithm generated candidates from the SFT policy

and accepted them based on the calculated probability

as shown in Eq. 4 where

referred the maximum reward left in the current samples sets.
相比之下，

的取样简单，而

的取样却很困难。为了解决这个问题，RSO 使用训练有素的奖励模型

来指导采样过程。该算法从 SFT 策略

中生成候选样本，并根据计算出的概率

接受它们，如公式 4 所示，其中

指的是当前样本集中剩余的最大奖励。

Here,

controlled the selectiveness of the sampling. As

, every response was accepted (i.e.,

for all

), and as

, only the highest reward response was accepted.
在这里，

控制着抽样的选择性。如

，每个回答都被接受（即

对所有

都接受）；如

，只有奖励最高的回答才被接受。

The authors conducted experiments on the TL;DR summarization [47] and Anthropic HH dialogue datasets [3]. The T5-large model (770M) was initialized as SFT, while the T5-XXL (11B) served as the reward model [52]. Evaluation results demonstrated that RSO surpassed previous methods including SLiC and DPO across multiple metrics, including human evaluation. RSO also showed better scalability to larger models and improved cross-task generalization.
作者在 TL;DR 摘要[47]和 Anthropic HH 对话数据集[3]上进行了实验。T5-large 模型（770M）被初始化为 SFT，而 T5-XXL 模型（11B）则作为奖励模型[52]。评估结果表明，在包括人工评估在内的多个指标上，RSO 超过了 SLiC 和 DPO 等以前的方法。RSO 还显示出对大型模型更好的可扩展性和更好的跨任务泛化能力。

RSO offered a more principled approach to generating training data that approximated on-policy RL. Its unified framework and intelligent sampling strategy could serve as catalyst to other off-policy training methods as well.
RSO 提供了一种更有原则性的方法来生成训练数据，接近于政策上的 RL。它的统一框架和智能采样策略还能促进其他非策略训练方法的发展。

3.3.3 DPO

RLHL/PPO necessitated an initial phase of training a reward model using a preference dataset, followed by training a RL policy with the pretrained reward model serving as the environment. This bifurcated training process demanded
RLHL/PPO 的初始阶段需要使用偏好数据集训练奖励模型，然后以预训练的奖励模型为环境训练 RL 策略。这种分叉式训练过程要求
meticulous oversight, including significant computational resources to hold multiple models in the memory (reward, value, policy, reference); data collection for training both the reward model and the RL policy, and monitoring for overfitting. To address these challenges, Direct Preference Optimization (DPO) was introduced [12]. The objective function in PPO-based RL was shown in Eq. 5 to derive

.
这需要细致的监督，包括在内存中保存多个模型（奖励、价值、策略、参考）所需的大量计算资源；训练奖励模型和 RL 策略所需的数据收集，以及对过度拟合的监控。为了应对这些挑战，人们引入了直接偏好优化（DPO）[12]。基于 PPO 的 RL 目标函数如公式 5 所示，可以得出

。

Based on the RL objective, given a reward model, i.e.,

, the optimal policy, i.e.,

could be expressed as Eq. 6

represented a term dependent solely on the input, used to normalize

. The initial policy before DPO was indicated by

. The hyperparameter

controlled the divergence between the reference policy and the final aligned policy post-DPO.
根据 RL 目标，在给定奖励模型（即

）的情况下，最优策略（即

）可表示为式 6

代表一个完全依赖于输入的项，用于对

进行归一化。DPO 之前的初始策略用

表示。超参数

控制着参考策略与 DPO 后最终对齐策略之间的偏差。

By rewriting Eq. 6, the reward model could be illustrated in term of the RL policy as illustrated in Eq. 7
通过对公式 6 的重写，奖励模型可以用 RL 策略来表示，如公式 7 所示

By expressing the reward function

in terms of the optimal policy

, we could optimize them simultaneously in the reward model training process. Lastly, the formulation of

was given by:

. It was evident that

depended only on

as it involved summation over all possible

, which was computationally intractable. Due to this intractability, DPO suggested eliminating this term by subtraction, as demonstrated in Eq. 8 .
通过用最优策略

表示奖励函数

，我们可以在奖励模型训练过程中同时优化它们。最后，

的表达式为

。显然，

只取决于

，因为它涉及对所有可能的

求和，这在计算上是难以处理的。鉴于这种难以解决的问题，DPO 建议用减法去掉这个项，如式 8 所示。

Lastly, by employing the BT model as illustrated in Eq. 9 , the pairwise preference

was articulated in terms of the pointwise reward

, which was defined through the optimal policy

.
最后，通过采用 BT 模型（如公式 9 所示），配对偏好

被阐明为点对点奖励

，而点对点奖励是通过最优策略

定义的。

Substituting this into the cross-entropy with

and

, the final loss function of DPO was derived, as shown in Eq. 10 .
将其代入

和

的交叉熵中，就得出了 DPO 的最终损失函数，如公式 10 所示。

The authors have also derived the gradient of DPO, as illustrated in Eq. 11. This gradient maximized the likelihood of

while minimized the likelihood of

. Concurrently, a weighting term

was introduced, which imposed a higher penalty when the difference between the rewards of

and

approached negative infinity. As this difference increased and approached positive infinity, the penalty gradually decreased. This penalization was logical, as a higher penalty should be applied when the rewards of

were similar to or greater than those of

. Conversely, if the reward of

significantly exceeded that of

, minimal modification was necessary, and it was reasonable for the gradient to be smaller.
作者还推导出了 DPO 的梯度，如公式 11 所示。该梯度最大化了

的可能性，同时最小化了

的可能性。同时，引入了加权项

，当

和

的奖励之差接近负无穷大时，加权项会施加更高的惩罚。当这个差值增大并接近正无穷大时，惩罚逐渐减少。这种惩罚是合乎逻辑的，因为当

的奖励与

的奖励相近或大于

的奖励时，应适用较高的惩罚。相反，如果

的奖励大大超过

的奖励，则只需进行最小的修改，梯度较小也是合理的。

The authors proposed that "two reward functions

and

were considered equivalent if and only if

for some function

." This established the equivalence of

and

in deriving the same optimal policy, as the difference was solely dependent on the input

.
作者提出，"当且仅当

为某个函数

时，两个奖励函数

和

被认为是等价的"。这就确定了

和

在推导相同的最优策略时是等价的，因为两者之间的差异完全取决于输入

。

Upon training with DPO, the optimal policy could be directly obtained without the need for generating an intermediate reward function, thereby simplifying the training process of RLHF. In summary, DPO facilitated the extraction of the corresponding optimal policy in a closed form, deriving the resolution of the standard RLHF problem using only a straightforward classification loss.
使用 DPO 进行训练后，无需生成中间奖赏函数即可直接获得最佳策略，从而简化了 RLHF 的训练过程。总之，DPO 有助于以闭合形式提取相应的最优策略，只需使用简单的分类损失就能得出标准 RLHF 问题的解决方案。

Furthermore, the authors have extended the DPO loss function to handle noisy data in labeling, as demonstrated in Eq. 12. by substituting

and

.
此外，作者还扩展了 DPO 损失函数，以处理标记中的噪声数据，如公式 12 所示，将

和

替换为 DPO 损失函数。

However, there were certain limitations associated with DPO. In the RLHF approach used by OpenAI, the reward model remained unchanged, facilitating human alignment. For further training on related tasks, only new prompts were required, with responses generated by the LLM and rewards obtained from the existing reward model. This approach offers significant advantages in flexibility and reusability. For instance, consider an user who has built an English summarization model with a corresponding reward model. To extend this to Spanish texts, they could potentially reuse the English reward model as a naive initialization for Spanish rewards. In contrast, DPO required new preference data for further optimization, which can be challenging to obtain as it necessitated meticulous human labeling. Using the same example, DPO would require collecting an entirely new set of preference data for Spanish summaries, involving multiple Spanish summaries for each text and bilingual human annotators to compare and rank them. This process is significantly more resource-intensive than generating new prompts in Spanish for the RLHF approach.
不过，DPO 也有一定的局限性。在 OpenAI 使用的 RLHF 方法中，奖励模型保持不变，从而便于人类进行调整。对于相关任务的进一步培训，只需要新的提示，由 LLM 生成响应，并从现有奖励模型中获得奖励。这种方法在灵活性和可重用性方面具有显著优势。例如，考虑到用户已经建立了英语摘要模型和相应的奖励模型。为了将其扩展到西班牙语文本，他们可以重新使用英语奖励模型作为西班牙语奖励的初始化。相比之下，DPO 需要新的偏好数据来进一步优化，而要获得新的偏好数据可能具有挑战性，因为这需要细致的人工标注。以同样的例子为例，DPO 需要为西班牙语摘要收集一套全新的偏好数据，涉及每个文本的多个西班牙语摘要，以及双语人工标注者对它们进行比较和排序。与为 RLHF 方法生成新的西班牙语提示相比，这一过程需要大量资源。

Furthermore, the loss function of DPO focused solely on maximizing the difference between desired and undesired responses. Based on this loss function, it was possible to inadvertently reduce the rewards for desired responses or increase the rewards for undesired responses. Although the authors have claimed that two reward functions were equivalent if their differences depended only on input prompts, we might still prefer the rewards for

to increase and the rewards for

to decrease. Suppose a model generated a response to a prompt, and the corresponding reward was relatively low. In this scenario, it became challenging to determine the quality of the response. It might turn out that the output was of high quality, though the implicit reward score was low. Under these conditions, we had to generate multiple outputs, calculate their reward scores, and select the best solution.
此外，DPO 的损失函数只侧重于最大化期望反应和不期望反应之间的差值。基于这个损失函数，我们有可能在无意中减少了对理想反应的奖励，或增加了对不理想反应的奖励。尽管作者声称，如果两个奖励函数的差异仅取决于输入提示，那么它们是等价的，但我们可能仍然希望

的奖励增加，而

的奖励减少。假设一个模型生成了对提示的反应，而相应的奖励却相对较低。在这种情况下，确定响应的质量就变得很有挑战性。结果可能是，虽然隐含奖励得分很低，但输出的质量很高。在这种情况下，我们必须生成多个输出，计算它们的奖励分数，并选择最佳解决方案。

Recent studies have also shown that DPO is particularly sensitive to distribution shifts between the base model outputs and the preference data [53]. This sensitivity can lead to poor performance when there's a mismatch between the training data of the base model and the preference dataset. To address this issue, iterative DPO has been proposed, where new responses are generated with the latest policy model and a critique (can be either separate reward model or same policy network in a self-rewarding setting) are used for preference labeling in each iteration. This approach can help mitigate the distribution shift problem and potentially improve DPO's performance.
最近的研究还表明，DPO 对基础模型输出和偏好数据之间的分布变化特别敏感[53]。当基础模型的训练数据与偏好数据集不匹配时，这种敏感性会导致性能低下。为了解决这个问题，有人提出了迭代式 DPO，即在每次迭代中，使用最新的策略模型生成新的响应，并使用一个批判（可以是单独的奖励模型，也可以是自我奖励设置中的同一策略网络）进行偏好标记。这种方法有助于缓解分配偏移问题，并有可能提高 DPO 的性能。

Lastly, the tests in the DPO paper were primarily conducted on simple cases, including the IMDB dataset [54] for controlled sentiment generation and Reddit dataset [47] for summarization. More complex downstream NLP tasks should be evaluated to assess the effectiveness of DPO, especially in light of the distribution shift sensitivity and the potential benefits of iterative DPO.
最后，DPO 论文中的测试主要是在简单的情况下进行的，包括用于控制情绪生成的 IMDB 数据集[54]和用于摘要的 Reddit 数据集[47]。应评估更复杂的下游 NLP 任务，以评估 DPO 的有效性，特别是考虑到分布偏移敏感性和迭代 DPO 的潜在优势。

3.3.4 DPOP: Smaug 3.3.4 DPOP：史矛革

The DPO loss function aimed to maximize the disparity between desired and undesired responses. However, this approach could be problematic. It might lead to simultaneous increases or decreases in the rewards for both desired and undesired responses, as long as the difference between them grew. The authors theoretically demonstrated that the rewards for both types of responses could decrease concurrently [13]. This phenomenon was particularly pronounced in data with small edit (Hamming) distances. For instance, "

" and "

" had an edit distance of 1. To address the limitations of DPO in scenarios with small edit distances, the authors created three datasets: modified ARC [55], Hellaswag[56], and Metamath [57], which included more examples with small edit distances. They also introduced DPO-positive (DPOP), as defined in Eq. 13 .
DPO 损失函数的目的是最大限度地缩小期望响应与不期望响应之间的差距。然而，这种方法可能存在问题。它可能会导致理想反应和不理想反应的奖励同时增加或减少，只要它们之间的差距越来越大。作者从理论上证明，两种反应的奖励可能同时减少[13]。这种现象在编辑（汉明）距离较小的数据中尤为明显。例如，"

"和"

"的编辑距离为 1。为了解决 DPO 在编辑距离较小的情况下的局限性，作者创建了三个数据集：修改后的 ARC [55]、Hellaswag[56] 和 Metamath[57]，其中包含了更多编辑距离较小的示例。他们还引入了公式 13 中定义的 DPO-正（DPOP）。

By incorporating the term

, we could effectively prevent the reduction in rewards for desired responses. This is because the logits of preferred generation are incentivized to improve over the reference model in addition to the standard DPO loss, and this avoids the undesirable situation described above. Utilizing this revised loss function, the authors trained and evaluated Smaug-7B, 34B, and 72B models on the Huggingface LLM leaderboard and MTBench [58]. Notably, the 70B scale models achieved state-of-the-art performance on the Huggingface LLM leaderboard when the paper was published.
通过加入

一词，我们可以有效地防止期望响应的奖励减少。这是因为除了标准的 DPO 损失之外，还激励首选生成的对数比参考模型有所改进，从而避免了上述不良情况的发生。作者利用这一修订后的损失函数，在 Huggingface LLM 排行榜和 MTBench [58]上训练和评估了 Smaug-7B、34B 和 72B 模型。值得注意的是，论文发表时，70B 级模型在 Huggingface LLM 排行榜上取得了最先进的性能。

3.3.5 -DPO

While DPO has shown promise in aligning LLMs with human preferences, its performance is sensitive to the fine-tuning of its trade-off parameter

with respect to quality of preference data. This sensitivity could be attributed to two factors: 1. The optimal value of

changes with the quality of preference data, requiring a dynamic approach and 2 . Real-world datasets often contain outliers that can distort the optimization process. To avoid this overhead, DPO with Dynamic

[14] introduced a framework that dynamically calibrates

at the batch level, informed by the underlying preference data.
虽然 DPO 在使 LLMs 与人类偏好保持一致方面显示出良好的前景，但其性能对偏好数据质量的权衡参数

的微调非常敏感。这种敏感性可归因于两个因素：1.

的最佳值会随着偏好数据质量的变化而变化，因此需要一种动态的方法；2.现实世界的数据集通常包含异常值，这些异常值会扭曲优化过程。为了避免这种开销，具有动态

的 DPO [14] 引入了一个框架，该框架根据基本偏好数据在批次级别动态校准

。

To address these challenges,

-DPO introduced two main components:
为了应对这些挑战，

-DPO 引入了两个主要组成部分：

Dynamic Calibration at Batch-Level: This approach adjusts for each batch based on the quality of pairwise data. The batch-level is calculated as:
批次级动态校准：这种方法根据配对数据的质量调整每个批次的。批次级的计算公式为

where

is the individual reward discrepancy,

is a threshold, and

is a scaling factor.
其中，

是个体奖励差异，

是阈值，

是比例因子。
2.

-Guided Data Filtering: This mechanism mitigates the impact of outliers by filtering them out based on a probabilistic model of reward discrepancies.
2.

引导数据过滤：该机制根据奖励差异的概率模型过滤异常值，从而减轻异常值的影响。

Empirical evaluations on Anthropic HH [3] and Reddit TL;DR summarization [47] tasks demonstrated that

-DPO consistently outperforms standard DPO across different model sizes and sampling temperatures. For instance, on the Anthropic HH dataset,

-DPO achieved improvements exceeding

on models of various sizes including Pythia-410M, 1.4B, and 2.8B [59].
对 Anthropic HH [3] 和 Reddit TL;DR 摘要 [47] 任务的经验评估表明，

-DPO 在不同的模型大小和采样温度下始终优于标准 DPO。例如，在人类 HH 数据集上，

-DPO 在 Pythia-410M、1.4B 和 2.8B 等不同大小的模型上取得了超过

的改进 [59]。

A critical aspect of this approach is the consideration of pairwise data quality, rized as "low gap" or "high gap". Low gap denotes cases where chosen and rejected responses are closely similar, typically indicating high-quality, informative pairs. Instead, high gap refers to pairs with larger differences, implying lower-quality data.
这种方法的一个重要方面是考虑配对数据的质量，即 "低差距 "或 "高差距"。低差距指的是被选和被拒的回答非常相似，通常表示高质量、信息量大的数据对。相反，高差距指的是差异较大的配对，意味着数据质量较低。

Experiments with Pythia-1.4B on the Anthropic HH dataset revealed a distinct trend: for low gap data, a higher

reduces win rate, whereas for high gap data, an increased

improves performance. This observation highlights the necessity of tailoring the

value to the data quality, especially in the presence of outliers.
Pythia-1.4B 在人类 HH 数据集上的实验显示了一个明显的趋势：对于低间隙数据，

越大，胜率越低；而对于高间隙数据，

越大，性能越高。这一观察结果凸显了根据数据质量调整

值的必要性，尤其是在存在异常值的情况下。

However, limitations and areas for future work include exploring

-DPO in self-play scenarios, developing more sophisticated evaluation metrics, investigating scalability to ultra-large models, and pursuing automated parameter tuning.
不过，未来工作的局限性和领域包括在自我游戏场景中探索

-DPO，开发更复杂的评估指标，研究超大模型的可扩展性，以及追求自动参数调整。

3.3.6 IPO 3.3.6 首次公开募股

Azar et al. identified that RLHF and DPO were susceptible to overfitting, and introduced Identity Preference Optimization (IPO) as a solution to this issue [15]. The authors highlighted two key assumptions underlying RLHF: 1. "pairwise preferences can be substituted with pointwise rewards," and 2. "a reward model trained on these pointwise rewards can generalize from collected data to out-of-distribution data sampled by the policy". They argued that in DPO the second assumption could be circumvented by learning the policy directly from data without the need for an intermediate reward function, leaving the first assumption intact. Specifically, challenges might arise when substituting pairwise preferences with a pointwise reward model using the BT model. This assumption became problematic when preferences were deterministic or nearly deterministic, i.e.,

. Under deterministic conditions,

. As this value approached positive infinity,
Azar 等人发现 RLHF 和 DPO 容易出现过度拟合的问题，并引入了身份偏好优化（IPO）作为解决这一问题的方法[15]。作者强调了 RLHF 的两个关键假设：1.2. "根据这些点式奖励训练的奖励模型可以从收集的数据泛化到政策采样的分布外数据"。他们认为，在 DPO 中，可以通过直接从数据中学习策略来规避第二个假设，而不需要中间奖励函数，从而使第一个假设保持不变。具体来说，在使用 BT 模型用点式奖励模型替代成对偏好时，可能会出现挑战。当偏好是确定的或几乎是确定的，即

时，这一假设就会出现问题。在确定性条件下，

。当这个值接近正无穷大时、
the effectiveness of the KL divergence constraint imposed by

diminished. Consequently, the objective function shifted towards maximizing accumulated rewards, potentially leading to overfitting.

施加的 KL 发散约束的有效性就会降低。因此，目标函数转向最大化累积奖励，可能导致过度拟合。

To address the issue, the authors introduced a general objective for RLHF, which avoided the transformation preference based on pointwise reward through the BT model and focused on optimizing a nonlinear function of preferences, as detailed in Eq. 15
为了解决这个问题，作者为 RLHF 引入了一个总体目标，该目标通过 BT 模型避免了基于点式奖励的转换偏好，而侧重于优化偏好的非线性函数，详见公式 15

Two policies,

and

, were employed, with the primary focus on maximizing the first policy,

, during the RL policy training process. Equation 15 was equivalent to DPO when

. The authors attributed the overfitting observed in RLHF and DPO to the nonlinear transformation of

, stating: "small increases in probabilities already close to 1 are just as incentivized as large increases in preference probabilities around

, which may be undesirable". To address this issue, the authors proposed setting the function as

, thereby removing the nonlinear transformation in the objective of RL policy training as shown in Eq. 16 .
在 RL 策略训练过程中，采用了

和

两种策略，主要侧重于最大化第一种策略

。当

时，等式 15 等同于 DPO。作者将 RLHF 和 DPO 中观察到的过度拟合归因于

的非线性变换，并指出"在

附近，已经接近 1 的概率的小幅增加与偏好概率的大幅增加一样具有激励作用，这可能是不可取的"。为了解决这个问题，作者建议将函数设置为

，从而消除公式 16 所示的 RL 政策培训目标中的非线性转换。

Based on the given objective function, the authors formulated a novel loss function as illustrated in Eq. 17, and it could avoid BT model to transform pointwise rewards to preference probabilities.
根据给定的目标函数，作者提出了一种新的损失函数，如式 17 所示，它可以避免 BT 模型将点奖励转换为偏好概率。

This newly derived loss function could be directly optimized to obtain an optimal policy, effectively mitigating the issue of overfitting. The experiment was conducted on a basic mathematical use case, demonstrating that when the penalty coefficient

was sufficiently large, IPO successfully avoided overfitting, whereas DPO tended to overfit. However, the modified DPO by adding noise was expected to address this issue adequately. Lastly, further use cases in downstream NLP tasks were necessary to validate the advantages of the IPO method.
这种新推导出的损失函数可以直接优化以获得最佳策略，从而有效缓解过拟合问题。在一个基本的数学用例中进行的实验表明，当惩罚系数

足够大时，IPO 成功地避免了过拟合，而 DPO 则倾向于过拟合。不过，通过添加噪声对 DPO 进行改进后，有望充分解决这一问题。最后，有必要在下游 NLP 任务中进一步使用案例来验证 IPO 方法的优势。

3.3.7 sDPO

In the context of DPO, the reference model was essential for preserving the performance of SFT and downstream tasks. The authors posited that the reference model acted as the lower bound for DPO, suggesting that an improved reference model could provide a superior lower bound for DPO training [16]. Building on this premise, stepwise DPO (sDPO) was introduced, which segmented the preference datasets and employed them incrementally. At each stage, DPO was applied, and the resulting partially aligned model became the new reference model.
就 DPO 而言，参考模型对于保持 SFT 和下游任务的性能至关重要。作者认为，参考模型是 DPO 的下限，这表明改进的参考模型可以为 DPO 训练提供更优越的下限[16]。在此基础上，引入了分步 DPO（sDPO），即分割偏好数据集并逐步使用。在每个阶段都应用 DPO，由此产生的部分对齐模型成为新的参考模型。

Initially, SOLAR 10.7B [60] was used as the reference model. Subsequently, two datasets OpenOrca (around 12K samples) [61] and Ultrafeedback Cleaned (around 60K samples) 62] were employed in the sDPO process, with OpenOrca used in the first step and Ultrafeedback in the second. Four tasks, i.e., ARC [55], HellaSWAG [56], MMLU [63], and TruthfulQA [45], were utilized, and their scores surpassed those of DPO. In contrast, Winogrande [64] and GSM8K [65] were excluded due to their nature as generation tasks, differing from the multiple-choice tasks previously considered. However, in our perspective, this was not a compelling reason to omit these tasks. It raised the question: could sDPO negatively impact generation tasks? Further experiments were necessary to explore this issue.
最初，SOLAR 10.7B [60] 被用作参考模型。随后，在 sDPO 过程中使用了 OpenOrca（约 12K 个样本）[61] 和 Ultrafeedback Cleaned（约 60K 个样本）[62] 两个数据集，第一步使用 OpenOrca，第二步使用 Ultrafeedback。使用了四项任务，即 ARC [55]、HellaSWAG [56]、MMLU [63] 和 TruthfulQA [45]，其得分超过了 DPO。相比之下，Winogrande[64]和GSM8K[65]由于其生成任务的性质不同于之前考虑的多选任务而被排除在外。不过，从我们的角度来看，这并不是排除这些任务的令人信服的理由。这就提出了一个问题：sDPO 是否会对生成任务产生负面影响？有必要进行进一步的实验来探讨这个问题。

The authors have demonstrated that

increased as the number of DPO steps increased. Furthermore, they have shown that initializing the target model with the updated reference model was advantageous, as it resulted in a lower initial loss function compared to using the original reference model.
作者证明，

随着 DPO 步数的增加而增加。此外，他们还表明，用更新的参考模型初始化目标模型是有优势的，因为与使用原始参考模型相比，初始损失函数更低。

Several questions arose that could further enhance this research. The current study utilized two datasets, applying stepwise alignment to each individually. Supposed only one dataset was available, would segmenting this dataset and applying DPO sequentially to each segment yield similar benefits? Additionally, even with two datasets, would it be advantageous to use the first

of each dataset for the initial alignment step and the remaining

for the subsequent alignment stage? Finally, catastrophic forgetting was a well-known issue. Would it be beneficial to mix a portion of the previous stepwise data with the new data to mitigate this problem?
由此产生的几个问题可以进一步加强这项研究。目前的研究使用了两个数据集，分别对每个数据集进行逐步配准。假设只有一个数据集，那么对这个数据集进行分段并对每个分段依次应用 DPO 是否会产生类似的效果？此外，即使有两个数据集，将每个数据集的第一个

用于初始配准步骤，而将剩余的

用于后续配准阶段是否有利？最后，灾难性遗忘是一个众所周知的问题。将之前的部分逐步数据与新数据混合，是否有利于缓解这一问题？

3.3.8 GPO

The authors proposed a generalized preference optimization (GPO) as shown in Eq. 18 [66].
作者提出了广义偏好优化（GPO），如公式 18 所示 [66]。

Then, the authors applied Taylor expansion around 0 as shown in Eq. 19 supposing

.
然后，作者在假设

的情况下，在 0 附近应用泰勒展开（如公式 19 所示）。

was termed preference optimization, and its target focused on maximizing the difference between desired and undesired responses, which played similar roles to rewards.

was termed as offline regularization, and its targets lied in minimizing the difference between the current policy and the reference policy, which was similar to the KL divergence.

被称为偏好优化，其目标是最大化期望响应和非期望响应之间的差异，这与奖励的作用类似。

被称为离线正则化，其目标是最小化当前策略与参考策略之间的差异，这与 KL 发散类似。

3.4 Token-level DPO 3.4 令牌级 DPO

In DPO, rewards were assigned to a prompt and response collectively. Conversely, in MDP, rewards were allocated for each individual action. The subsequent two papers delved into elucidating DPO at the token level and expanding its application to token-level analysis.
在 DPO 中，奖励被集体分配给提示和响应。相反，在 MDP 中，奖励则分配给每个单独的动作。随后的两篇论文深入阐释了令牌级的 DPO，并将其应用扩展到令牌级分析。

3.4.1 DPO: from to
3.4.1 DPO：从到

DPO was conceptualized as a bandit problem rather than a token-level MDP [39], with the entire response treated as a single arm to receive a reward. In [17], the authors demonstrated that DPO was capable of performing token-level credit assignment. In the context of a token-level MDP, it was defined as

, where

represented the state space,

denoted the action space,

described the state transition given an action,

signified the reward functions, and

indicated the initial state distribution. The token-level MDP was formulated within the framework of the maximum entropy setting of RL, as illustrated in Eq. 20.
DPO 在概念上是一个强盗问题，而不是令牌级 MDP [39]，整个响应被视为获得奖励的单臂。在 [17] 中，作者证明了 DPO 能够进行令牌级信用分配。在令牌级 MDP 中，它被定义为

，其中

表示状态空间，

表示动作空间，

描述了给定动作的状态转换，

表示奖励函数，

表示初始状态分布。令牌级 MDP 是在 RL 最大熵设置的框架内制定的，如式 20 所示。

In the context of maximum entropy RL, the relationship between optimal Q function

and optimal value function

was elucidated in Eq. 21
在最大熵 RL 的背景下，最优 Q 函数

和最优值函数

之间的关系如公式 21 所示。

Bellman equation was shown in Eq. 22
贝尔曼方程如公式 22 所示

Plugging

in Eq. 22 into Eq. 21 we could derive

. Furthermore, by summing on both sides and

, the cumulative rewards could be re-expressed as indicated in Eq. 23
将公式 22 中的

插入公式 21，我们可以得出

。此外，通过两边相加和

，累积奖励可以重新表达，如公式 23 所示

The term

could be cancelled out when plugging into the BT model as shown in 24, with

tokens in

and

tokens in

.
如 24 所示，将

项插入 BT 模型时，

符号可与

中的

符号抵消，而

中的

符号可与

中的

符号抵消。

Eventually, the bandit problem, which traditionally considered the entire response as a single entity, was redefined as a token-level MDP with rewards assigned to each token generation, specifically

.
最终，传统上将整个响应视为单一实体的强盗问题被重新定义为令牌级 MDP，奖励分配给每一代令牌，特别是

。

Extensive experiments have demonstrated the efficacy of DPO in token-level MDPs. Initially, the authors successfully utilized token-level rewards to identify erroneous modifications in LLM responses

given prompt

. Then, by employing beam search with token-level rewards, the authors generated higher quality responses, with results indicating that increasing the beam size significantly enhanced response quality. Lastly, the authors proved that during maximum entropy RL, the implicit rewards for both desired and undesired responses diminished when a model fine-tuned with SFT was used as the reference model.
大量实验证明了 DPO 在令牌级 MDP 中的功效。最初，作者成功地利用了标记级奖励来识别给定提示

的 LLM 反应

中的错误修改。然后，通过使用具有标记级奖励的波束搜索，作者生成了更高质量的响应，结果表明，增加波束大小可显著提高响应质量。最后，作者证明，在最大熵 RL 期间，当使用 SFT 微调模型作为参考模型时，期望和不期望响应的隐含奖励都会减少。

3.4.2 TDPO 3.4.2 《德班宣言和行动纲领》

The authors discovered that in the DPO process, the generative diversity of LLM was deteriorated and the KL divergence grew faster for less preferred responses compared with preferred responses, and they proposed token-level DPO (TDPO) to solve these problems [18]. In original DPO, reverse KL divergence was applied, while sequential forward KL divergence was applied in token-level DPO.
作者发现，在 DPO 过程中，LLM 的生成多样性会变差，与首选响应相比，次选响应的 KL 发散增长更快，因此他们提出了标记级 DPO（TDPO）来解决这些问题[18]。原始 DPO 采用反向 KL 发散，而标记级 DPO 则采用顺序正向 KL 发散。

For the token-level DPO problem, the reward decay was set to one, i.e., no reward decay and the token-wise reward was defined as

, where

referred to the reward at the

-th token for the policy

, and it depended on the state

and action

at the

-th step. In addition, the Q -value

, value function

and advantage function

have been defined. The total reward was defined as

. Based on the obtained advantage function, the objective function for TDPO could be expressed in Eq. 25 .
对于令牌级 DPO 问题，奖励衰减设为 1，即无奖励衰减，令牌奖励定义为

，其中

指的是政策

的第

个令牌的奖励，它取决于第

步的状态

和行动

。此外，还定义了 Q 值

、价值函数

和优势函数

。总奖励定义为

。根据得到的优势函数，TDPO 的目标函数可用公式 25 表示。

Based on the objective function, the relationship between Q-value and optimal policy could be derived as shown in Eq. 26
根据目标函数，可以得出 Q 值与最优政策之间的关系，如公式 26 所示

where

. However,

, and these two terms could not be cancelled out as in DPO. To solve this problem, the authors proposed sequential KL divergence as shown in Eq. 27 .
其中

。但是，

和这两个项不能像 DPO 那样相抵消。为了解决这个问题，作者提出了顺序 KL 发散法，如式 27 所示。

Based on the defined sequential KL divergence, the

can be cancelled out when applying BT model as shown in Eq. 28 .
根据定义的顺序 KL 发散，在应用 BT 模型时，

可以被抵消，如公式 28 所示。

Here,

and

where forward KL divergence was applied. Then,

was plugged into the
这里，

和

应用了前向 KL 发散。然后，将

插入
cross entropy function for model training. Lastly, the authors proposed to stop the propagation the gradient of

to further boost the performance of TDPO.
交叉熵函数进行模型训练。最后，作者建议停止

梯度的传播，以进一步提高 TDPO 的性能。

In the experiments, the authors utilized GPT-2 Large [67] as the base model and evaluated on IMDB [54], Anthropic HH [3] and MT-bench [58] datasets. Their experiments showed that TDPO, especially with the stop-gradient, outperformed DPO.
在实验中，作者使用 GPT-2 Large [67] 作为基础模型，并在 IMDB [54]、Anthropic HH [3] 和 MT-bench [58] 数据集上进行了评估。实验结果表明，TDPO（尤其是使用停止梯度的 TDPO）的性能优于 DPO。

3.5 Iterative/Online DPO 3.5 迭代/在线 DPO

In DPO, all available preference datasets were employed to align LLMs. To achieve continuous improvement of LLMs, iterative/online DPO should be implemented, raising the intriguing question of how to efficiently collect new preference datasets. The following two papers delved into this subject.
在 DPO 中，所有可用的偏好数据集都被用来对齐 LLMs 。为实现 LLMs 的持续改进，应实施迭代/在线 DPO，这就提出了一个引人入胜的问题：如何有效地收集新的偏好数据集。以下两篇论文对这一问题进行了深入探讨。

3.5.1 Iterative/Online DPO: Self-Rewarding Language Models
3.5.1 迭代/在线 DPO：自我奖励语言模型

A significant challenge with DPO was the difficulty in acquiring new human preference data, which was very expensive. The concept of iterative/online DPO leveraged LLMs for both generating responses based on prompts and evaluating these responses in a manner akin to human labelers [19].
DPO 面临的一个重大挑战是难以获得新的人类偏好数据，而且成本非常高昂。迭代/在线 DPO 的概念利用 LLMs 根据提示生成响应，并以类似于人类贴标者的方式评估这些响应 [19]。

The authors asserted "To achieve superhuman agents, future models require superhuman feedback to provide an adequate training signal". In line with this assertion, they proposed using LLMs as judges for evaluating responses to prompts. Furthermore, they aimed to "develop an agent that processes all desired abilities during training, rather than separating them into distinct models". Consequently, the same LLM was employed for both "Instruction following: given a prompt that describes a user request, the ability to generate a high-quality, helpful (and harmless) response" and
作者断言："要实现超人代理，未来的模型需要超人反馈来提供充分的训练信号"。根据这一论断，他们建议使用 LLMs 作为评委，以评估对提示的反应。此外，他们的目标是 "开发一种在训练过程中处理所有所需能力的代理，而不是将它们分离成不同的模型"。因此，LLM被用于 "指令遵循：给定一个描述用户请求的提示，生成一个高质量、有帮助（且无害）的响应的能力 "和LLMs被用于 "指令遵循：给定一个描述用户请求的提示，生成一个高质量、有帮助（且无害）的响应的能力"。

"Self-Instruction creation: the ability to generate and evaluate new instruction-following examples to add to its own training set".
"自我教学创建：生成和评估新的教学范例并将其添加到自己的训练集的能力"。

In the "Self-Instruction Creation" phase,

candidate responses were generated, and the LLM acted as a judge to evaluate these

responses. The evaluation was based on five metrics: relevance, coverage, usefulness, clarity, and expertise, with scores ranging up to 5 . The response with the highest score was selected as the preferred response, while the one with the lowest score was deemed unpreferred. During the "Instruction Following" training, DPO was used to train the LLM to align with the generated preference dataset.
在 "自我指导创建 "阶段，生成了

候选回答，LLM 担任评委，对这些

回答进行评估。评估基于五个指标：相关性、覆盖面、有用性、清晰度和专业性，最高分为 5 分。得分最高的答复被选为首选答复，得分最低的答复被视为非首选答复。在 "遵循指令 "训练期间，DPO 被用来训练 LLM 以与生成的偏好数据集保持一致。

Numerous experiments were conducted, utilizing Llama 2 70B as the pretrained LLM [68]. The authors performed three iterations of self-reward training. A primary limitation of this study was the lack of a method to determine the optimal termination point for iterations. It did not explain why three iterations should be deemed sufficient, nor did it address why additional iterations might not yield further benefits. Models

, and

were derived after DPO training for one, two, and three iterations, respectively. In their evaluation,

achieved

wins while

only achieved

wins. On the other hand, the win rate for

versus

was

and

respectively. Similar trends were observed in AlpacaEval, demonstrating the benefits of iterative/online training. In AlpacaEval [40], various subtasks, including Health and Professional, were conducted. Generally, the LLM's performance on different tasks improved with more iterations, particularly in terms of stability. Models

and

exhibited more variability across tasks, whereas

showed greater robustness. Results on MT-Bench [58] improved, while performance on NLP Benchmarks declined. The authors suggested that this decline was due to the training data being based on Open Assistant prompts, which might not be relevant to the tasks in NLP Benchmark. However, we questioned whether this discrepancy indicated overfitting rather than an out-of-distribution dataset, especially given the LLM's extensive pretraining on large text corpora. Notably, performance on NLP benchmarks decreased with more iterations. This raised concerns about whether improvements in certain tasks came at the expense of abilities in others. Lastly, the authors evaluated reward models, finding that most metrics improved with more iterations, except for the " 5 -best

" metric, which initially increased and then decreased, though it remained higher than the initial value. This further emphasized the critical importance of determining the optimal point at which to terminate iterative/online DPO.
利用 Llama 2 70B 作为预训练 LLM 进行了大量实验 [68]。作者进行了三次自我奖励迭代训练。这项研究的一个主要局限是缺乏确定迭代最佳终止点的方法。它没有解释为什么认为三次迭代就足够了，也没有解决为什么额外的迭代可能不会产生更多益处的问题。模型

和

分别是在 DPO 训练一次、两次和三次迭代后得出的。在评估中，

取得了

的胜率，而

只取得了

的胜率。另一方面，

与

的胜率分别为

和

。在 AlpacaEval 中也观察到了类似的趋势，证明了迭代/在线培训的好处。在 AlpacaEval [40] 中，进行了包括健康和专业在内的各种子任务。一般来说，LLM在不同任务中的表现随着迭代次数的增加而提高，尤其是在稳定性方面。模型

和

在不同任务中表现出更大的变异性，而

则表现出更大的稳健性。MT-Bench [58] 上的结果有所改善，而 NLP Benchmarks 上的性能却有所下降。作者认为，成绩下降的原因是训练数据基于 Open Assistant 提示，而这些提示可能与 NLP Benchmark 中的任务无关。但是，我们质疑这种差异是否表明存在过度拟合，而不是数据集分布失衡，尤其是考虑到 LLM 在大型文本语料库上进行了大量的预训练。值得注意的是，NLP 基准的性能随着迭代次数的增加而降低。这引发了人们对某些任务的改进是否以牺牲其他任务的能力为代价的担忧。最后，作者对奖励模型进行了评估，发现大多数指标随着迭代次数的增加而提高，但 "5-最佳

" 指标除外，该指标最初有所提高，随后有所降低，但仍高于初始值。这进一步强调了确定终止迭代/在线 DPO 的最佳点的重要性。

3.5.2 Iterative/Online DPO: CRINGE
3.5.2 迭代/在线 DPO：CRINGE

Based on binary feedback, a promising approach was the ContRastive Iterative Negative GEneration (CRINGE) loss [69]. The CRINGE loss was designed to handle positive and negative responses separately. For positive responses, denoted as

and

, they were processed similarly to SFT. For negative responses, denoted as

and

, the CRINGE loss contrasted each negative token

in the sequence against a positive token. Let

represent the model output score (input to the final Softmax) corresponding to token

. Initially, we selected the top-k scores

from all scores

, excluding the negative token

. Next, we sampled following the categorical distribution constructed through the Softmax over these top-k scores,

. For instance,
在二元反馈的基础上，一种很有前途的方法是 "迭代负熵损失"（ContRastive Iterative Negative GEneration，CRINGE）[69]。CRINGE loss 的设计目的是分别处理正面和负面反馈。对于表示为

和

的正响应，其处理方法与 SFT 类似。对于表示为

和

的否定回答，CRINGE loss 将序列中的每个否定标记

与肯定标记进行对比。让

代表与标记

相对应的模型输出得分（最终 Softmax 的输入）。起初，我们从所有得分

中选出得分最高的

和

，但不包括负标记

。接下来，我们按照通过 Softmax 在这些前 k 个分数

上构建的分类分布进行采样。例如
with

, the top-k tokens might be 'discharge', 'charge', 'absorb', and 'reflect', and if

was 'discharge', we then selected from the remaining three candidates-'charge', 'absorb', and 'reflect' - based on their scores, applying the Softmax function and sampling accordingly. The binary CRINGE loss function was then derived as shown in Eq. 29
如果

是 "放电"，那么我们将根据得分从剩余的三个候选词--"充电"、"吸收 "和 "反射"--中进行选择，并应用 Softmax 函数进行相应的采样。二进制 CRINGE 损失函数如公式 29 所示

Given that the most effective alignments were achieved through preference alignment, extending CRINGE from binary feedback to preference feedback was an intriguing prospect [20]. The updated pairwise CRINGE loss function was detailed in Eq. 30
鉴于最有效的配准是通过偏好配准实现的，因此将 CRINGE 从二元反馈扩展到偏好反馈是一个令人感兴趣的前景[20]。更新后的成对 CRINGE 损失函数详见公式 30

and

were replanced by

and

. The function

served as a gate to control the binary CRINGE loss. If

was significantly better than

approached zero, rendering the loss nearly zero. Conversely, if

was much worse than

approached one, resulting in a large loss. The parameter

controlled the margin between desired and undesired responses, while

regulated the smoothness of the loss, akin to temperature during LLM inference. Finally, the authors combined the proposed pairwise CRINGE loss with iterative/online processes to further enhance quality. Four generations were produced, and the best and worst ones, as evaluated by reward functions, were used as a pair for improving LLM in the next iteration.
在

和

中，

和

取代了

和

。函数

是控制二进制 CRINGE 损失的闸门。如果

明显优于

接近零，则损失几乎为零。相反，如果

比

差很多，

就会趋近于 1，导致大量损失。参数

控制着期望响应和非期望响应之间的余量，而

则调节着损失的平滑度，类似于 LLM 推断过程中的温度。最后，作者将提议的成对 CRINGE 损失与迭代/在线过程相结合，进一步提高了质量。共产生了四代，根据奖励函数的评估，最好和最差的一代将作为一对，用于在下一次迭代中改进LLM。

In their experiments, the authors tested the approach on GPT-2 [67] using the AlpacaFarm [70] datasets. The results demonstrated that the pairwise CRINGE loss reduced repetition during inference and improved generation quality. Pairwise CRINGE outperformed Binary CRINGE, PPO, and DPO, with iterative/online Pairwise CRINGE yielding even greater improvements
在实验中，作者使用 AlpacaFarm [70] 数据集对 GPT-2 [67] 方法进行了测试。结果表明，成对 CRINGE 损失减少了推理过程中的重复，提高了生成质量。成对 CRINGE 的性能优于二元 CRINGE、PPO 和 DPO，而迭代/在线成对 CRINGE 的改进幅度更大

3.6 Binary Feedback 3.6 二进制反馈

Collecting preference feedback proved to be more challenging than gathering binary feedback, such as "thumbs up" and "thumbs down," which facilitated the scaling of the alignment process. The subsequent studies, KTO and DRO, concentrated on utilizing binary feedback to align LLMs.
事实证明，收集偏好反馈比收集二元反馈（如 "竖起大拇指 "和 "摁下大拇指"）更具挑战性，这有利于对齐过程的扩展。随后的研究，即 KTO 和 DRO，集中利用二进制反馈来排列 LLMs 。

3.6.1 KTO

Both RLHF and DPO relied on preference feedback, which was challenging to derive. In contrast, binary feedback, categorized simply as 'good' or 'bad', was more readily obtainable. Thus, enhancing alignment on binary data could significantly accelerate the overall alignment task.
RLHF 和 DPO 都依赖于偏好反馈，而偏好反馈的获得具有挑战性。相比之下，二进制反馈（简单地分为 "好 "或 "坏"）更容易获得。因此，加强二进制数据的配准可以大大加快整个配准任务的速度。

The authors were inspired by Kahneman and Tversky's prospect theory [71]. This theory elucidated how humans made decisions under uncertain events did not maximize expected value owing to loss aversion. The function of Kahneman and Tversky's prospect theory was presented in Eq. 31 .
作者受到了 Kahneman 和 Tversky 的前景理论[71]的启发。该理论阐明了人类如何在不确定事件下做出决策，但由于损失厌恶而无法最大化预期价值。卡尼曼和特沃斯基的前景理论的函数如公式 31 所示。

where

denoted the reference point,

represented the realized outcome. The value function

, mapped the value of an outcome compared to reference

to a perceived value, asserting that humans perceived losses more than gains. It was characterized by two parameters:

governed the curvature of the function and

controlled the steepness.

reflected loss aversion, typically greater than 1 . This equation encapsulated human loss aversion, and resulting loss functions termed as human-aware losses (HALOs). Techniques such as SLiC [11], along with PPO [72], DPO [12], and KTO [21], fell under the category of HALOs. The authors asserted that HALOs generally outperformed non-HALOs.
其中，

表示参考点，

表示实现的结果。价值函数

将结果与参考点

相比的价值映射为感知价值，断言人类对损失的感知高于对收益的感知。它有两个参数：

控制函数的曲率，

控制函数的陡度。

反映了损失厌恶，通常大于 1。这个等式包含了人类的损失厌恶，由此产生的损失函数被称为人类感知损失（HALOs）。SLiC [11]、PPO [72]、DPO [12] 和 KTO [21] 等技术都属于 HALOs 的范畴。作者断言，HALOs 的性能普遍优于非 HALOs。

When applying Kahneman & Tversky's prospect theory to LLMs, the utility function was slightly modified, as shown in Eq. 34 with reward

, utilizing an updated reference point

as indicated in Eq. 33. which was estimated using the average rewards from all prompts and their corresponding responses. Here,

referred to
在对 LLMs 应用 Kahneman 和 Tversky 的前景理论时，效用函数稍作修改，如公式 34 所示，奖励

，利用公式 33 中所示的更新参考点

。这里，

指的是
the total number of prompt and response pairs This reference simplified to the KL divergence between the optimal policy

and reference policy

. From the modified utility function, the loss function for KTO could be derived, as presented in Eq. 32 where

denoted

and

for desired and undesired responses respectively.
提示和响应对的总数这个参考简化为最优策略

和参考策略

之间的 KL 发散。根据修改后的效用函数，可以推导出 KTO 的损失函数，如式 32 所示，其中

分别表示

和

所期望的响应和不期望的响应。

To evaluate the performance of KTO, the authors tested two categories of models: Pythia 1.4B, 2.8B, 6.9B, 12B [59] and Llama 7B, 13B, 30B [68], using 'GPT-4-0613' [4] for assessment. Additionally, binary data were derived from preference data in UltraFeedback [62], with desired data converted to +1 and undesired data to -1 . It was noteworthy that the authors did not test on binary data, despite its ease of acquisition, due to its subjective nature and potential noisiness. Filtering out unreasonable data in such conditions presented a more intriguing challenge.
为了评估 KTO 的性能，作者测试了两类模型：Pythia 1.4B, 2.8B, 6.9B, 12B [59] 和 Llama 7B, 13B, 30B [68]，使用 "GPT-4-0613" [4] 进行评估。此外，二进制数据来自 UltraFeedback [62]中的偏好数据，理想数据转换为 +1，不理想数据转换为 -1。值得注意的是，尽管二进制数据易于获取，但由于其主观性和潜在噪声，作者并未对其进行测试。在这种情况下，如何过滤掉不合理的数据是一个更有趣的挑战。

The authors found that when

, optimal performance was achieved in downstream tasks such as MMLU [63], GSM8k [65], HumanEval [73], and BBH [74]. This indicated no significant aversion to either gains or losses. Given this lack of aversion, the necessity of Kahneman & Tversky's prospect theory was called into question. The results demonstrated a notable enhancement in GSM8K, while minor improvements in other tasks. Further insights into this phenomenon would be beneficial. The authors conducted experiments on data imbalance, demonstrating that the optimal range for

between 1 and

could deal with data imbalance optimally, where

and

represented the quantities of desired and undesired samples, respectively.
作者发现，当

时，下游任务（如 MMLU [63]、GSM8k [65]、HumanEval [73] 和 BBH [74]）中的表现均达到最佳。这表明对收益或损失都没有明显的厌恶。鉴于缺乏厌恶感，卡尼曼和特沃斯基的前景理论的必要性受到了质疑。结果表明，GSM8K 有明显的提高，而其他任务的提高幅度较小。进一步了解这一现象将大有裨益。作者进行了关于数据不平衡的实验，证明

在 1 和

之间的最佳范围能够以最佳方式处理数据不平衡问题，其中

和

分别代表所需样本和不需样本的数量。

3.6.2 DRO

Direct Reward Optimization (DRO) [22] was designed to align LLMs using single-trajectory feedback data, such as binary feedback (e.g., thumbs up or thumbs down). This method aimed to leverage more readily available data compared to the scarce pairwise preference data used in traditional alignment techniques like DPO.
直接奖励优化（DRO）[22] 的设计目的是使用单一轨迹反馈数据（如二元反馈，如竖起大拇指或放下大拇指）来对齐 LLMs 。与 DPO 等传统对齐技术中使用的稀缺的成对偏好数据相比，这种方法旨在利用更容易获得的数据。

DRO built upon the standard KL-regularized policy optimization framework used in RLHF as shown in Eq. 5 Based on the objective, the optimal policy could be formulated as shown in Eq. 35
DRO 建立在 RLHF 中使用的标准 KL 规则化策略优化框架之上，如公式 5 所示。

where

was the optimal value function and

referred to binary feedback, i.e., ' +1 ' for positive feedback and ' -1 ' for negative feedback. By reformulating the relationship between the policy and the reward, we could derive Eq. 36
其中，

为最优价值函数，

为二元反馈，即 "+1 "表示正反馈，"-1 "表示负反馈。通过重新表述政策和奖励之间的关系，我们可以得出公式 36

Eventually, the loss function for DRO could be derived using the mean square error, as illustrated in Eq. 37
最终，可以利用均方误差得出 DRO 的损失函数，如公式 37 所示

This formulation had several advantages. It directly optimized the policy without needing to learn a separate reward model. In addition, it worked with single-trajectory data, which was more abundant than pairwise preference data. Lastly, it had a unique global optimum

, which could be optimized independently for

and

.
这种方法有几个优点。它可以直接优化策略，而无需学习单独的奖励模型。此外，它还适用于单一轨迹数据，这比成对偏好数据更为丰富。最后，它有一个唯一的全局最优

，可以针对

和

进行独立优化。

However, estimating

proved to be challenging. Therefore, it was approximated using a neural network, denoted as

. DRO-V, a practical implementation of DRO, jointly optimized a policy network

and a value network

. It combined offline policy learning with a value function learning, and hence the suffix -V was used. The gradient updates for the policy and value networks were given by:
然而，事实证明，估计

是一项挑战。因此，使用神经网络对其进行了近似，记为

。DRO-V 是 DRO 的实际实现，它联合优化了策略网络

和价值网络

。它结合了离线策略学习和价值函数学习，因此使用了后缀-V。策略网络和价值网络的梯度更新由以下公式给出：

These update rules have interesting connections to standard reinforcement learning algorithms:
这些更新规则与标准强化学习算法有着有趣的联系：

The policy gradient resembles a standard policy gradient with a value baseline, but includes an additional regularization term.
该策略梯度类似于带有值基线的标准策略梯度，但包含一个额外的正则化项。
The value function update is similar to temporal difference learning, but includes a term related to the KL divergence between the current and reference policies.
值函数更新类似于时差学习，但包括一个与当前策略和参考策略之间的 KL 发散相关的项。

Key implementation details that contribute to DRO-V's performance include:
有助于提高 DRO-V 性能的关键实施细节包括

Using separate networks for policy and value functions
为政策和价值功能使用不同的网络
Rescaling the policy gradient by
通过对政策梯度进行重新缩放
Employing multiple value outputs per batch, rather than a single value
每个批次采用多个数值输出，而不是单一数值输出

Empirical results demonstrate that DRO-V outperforms previous methods like KTO [21] on the UltraFeedback dataset. The performance of DRO-V is robust to learning rate changes within an order of magnitude, and

works well as a default regularization parameter.
实证结果表明，在 UltraFeedback 数据集上，DRO-V 的表现优于 KTO [21] 等以前的方法。DRO-V 的性能对学习率变化的鲁棒性不超过一个数量级，而且

作为默认正则化参数效果良好。

DRO offers a promising alternative to preference-based methods for aligning language models, leveraging more abundant single-trajectory feedback data while maintaining a simple, theoretically principled approach without strong assumptions.
DRO 为基于偏好的语言模型配准方法提供了一种很有前途的替代方法，它既能利用更丰富的单轨迹反馈数据，又能保持一种简单、理论原则性强的方法，而无需强烈的假设。

3.7 Merge SFT and Alignment
3.7 合并 SFT 和对齐

Previous research primarily concentrated on sequentially applying SFT and alignment, a method that proved to be laborious and led to catastrophic forgetting. The subsequent studies either integrated these two processes into a single step or performed fine-tuning in parallel and merged the two model at the end.
以往的研究主要集中在顺序应用 SFT 和配准，这种方法被证明是费力的，而且会导致灾难性的遗忘。随后的研究要么将这两个过程整合为一个步骤，要么并行执行微调，并在最后合并两个模型。

3.7.1 ORPO

Odds Ratio Preference Optimization (ORPO) removed the need for a reference model and integrated SFT and alignment into a single step [23]. Initially, the authors demonstrated that even when SFT was applied to desirable data from the Anthropic HH dataset [2], the probability of undesirable data also increased. This phenomenon was logical since the undesirable data were grammatically correct and might only slightly differ from the desired data. Previous approaches, such as PPO, DPO, and KTO, addressed the probability increase of undesirable data through alignment. However, these multi-stage methods involving SFT and alignment were cumbersome. The authors proposed to combine these processes, resulting in ORPO.
赔率偏好优化（ORPO）不再需要参考模型，而是将 SFT 和配准整合为一个步骤[23]。最初，作者证明，即使将 SFT 应用于人类 HH 数据集[2]中的理想数据，不理想数据的概率也会增加。这种现象是合乎逻辑的，因为不理想的数据在语法上是正确的，与理想数据可能只有细微差别。以前的方法，如 PPO、DPO 和 KTO，通过对齐来解决不理想数据概率增加的问题。然而，这些涉及 SFT 和对齐的多阶段方法非常繁琐。作者建议将这些过程结合起来，从而产生 ORPO。

The authors defined

and

in Eq. 40 and Eq. 41, respectively.
作者分别在公式 40 和公式 41 中定义了

和

。

The term

represented the ratio of the probability of generating

given

to the probability of not generating

given

. The expression

quantified the relative likelihood of the model

producing

over

for a given input

. Utilizing

, the loss function for ORPO was derived and presented in Eq. 42
术语

表示在给定

的条件下产生

的概率与在给定

的条件下不产生

的概率之比。表达式

量化了在给定输入

的情况下，模型

比

产生

的相对可能性。利用

，得出 ORPO 的损失函数，见式 42

The models employed for fine-tuning included Phi-2 (2.7B) [75], Llama-2 (7B) [68], and Mistral (7B) [76]. The UltraFeedback dataset [62], which is a preference dataset, was employed for the desired data in the SFT process. The results achieved were

on AlpacaEval2.0 [40],

on IFEval [77], and 7.32 on MT-Bench [58].
用于微调的模型包括 Phi-2（2.7B）[75]、Llama-2（7B）[68]和 Mistral（7B）[76]。UltraFeedback 数据集[62]是一个偏好数据集，在 SFT 过程中被用于所需的数据。在 AlpacaEval2.0 [40] 上取得的结果为

，在 IFEval [77] 上取得的结果为

，在 MT-Bench [58] 上取得的结果为 7.32。

Several issues are identified with the ORPO method. Firstly, this approach is ineffective for SFT datasets where only

is present. For alignment datasets, where

and

represents relatively desired and undesired outcomes, respectively, greater caution is required when maximizing

and minimizing

. Lastly, some experiments on Mistral and Llama-3 indicated that the performance of ORPO is inferior to that of DPO [24].
ORPO 方法存在几个问题。首先，这种方法对于只有

的 SFT 数据集无效。对于排列数据集，

和

分别代表相对理想和不理想的结果，因此在最大化

和最小化

时需要更加谨慎。最后，在 Mistral 和 Llama-3 上进行的一些实验表明，ORPO 的性能不如 DPO [24]。

3.7.2 PAFT

The sequential training of SFT and alignment often led to catastrophic forgetting, where the capabilities acquired during pretraining and the SFT process were lost. To address this issue, the authors proposed a novel PArallel training paradigm for effective LLM Fine-Tuning (PAFT) [24]. PAFT performed SFT and DPO in parallel on the pretrained model. Ultimately, the fine-tuned model from SFT and the aligned model from DPO were merged to retain the capabilities of both SFT and DPO. The obtained

models through low rank adaptation LoRA [78] were denoted as

and

. During the merging process, model sparsity played a crucial role. DPO produced sparse models, i.e.,

, during alignment, whereas SFT did not generate sparse models, i.e.,

. To address this,

norm was applied to increase the sparsity of

. Finally, the merging process was applied to derive the final model, as shown in Eq. 43.
SFT 和对齐的顺序训练往往会导致灾难性遗忘，即丧失在预训练和 SFT 过程中获得的能力。为了解决这个问题，作者提出了一种新的 PArallel 训练范式，用于有效的LLM微调（PAFT）[24]。PAFT 在预训练模型上并行执行 SFT 和 DPO。最终，SFT 的微调模型和 DPO 的对齐模型被合并，以保留 SFT 和 DPO 的功能。通过低等级自适应 LoRA [78] 得到的

模型分别表示为

和

。在合并过程中，模型的稀疏性起到了至关重要的作用。DPO 在配准过程中产生了稀疏模型，即

，而 SFT 没有产生稀疏模型，即

。为了解决这个问题，我们采用了

规范来增加

的稀疏性。最后，应用合并过程得出最终模型，如公式 43 所示。

Mistral-7B [76] and LlaMA3-8B [68] were used as reference models, and UltraFeedback [62] was employed as the preference dataset. The resulting PAFT model achieved state-of-the-art performance on the 7B models of the Huggingface Leaderboard, outperforming sequential SFT+DPO, SFT alone, DPO alone, and ORPO.
Mistral-7B [76] 和 LlaMA3-8B [68] 被用作参考模型，UltraFeedback [62] 被用作偏好数据集。由此产生的 PAFT 模型在 Huggingface Leaderboard 的 7B 模型上取得了最先进的性能，优于顺序 SFT+DPO、单独 SFT、单独 DPO 和 ORPO。

3.8 Length Control DPO and Reference Free DPO
3.8 长度控制 DPO 和无参考 DPO

Previous studies have demonstrated that LLMs often produced excessively verbose outputs. To address this, R-DPO and SimPO have concentrated on generating length-controlled responses without compromising the performance of LLMs. Additionally, DPO necessitated a reference policy to ensure that the aligned model did not deviate significantly from the reference model. In contrast, SimPO, and RLOO have proposed methods to eliminate the need for a reference model while still maintaining the efficacy of LLMs.
以往的研究表明，LLMs经常会产生过于冗长的输出。为了解决这个问题，R-DPO 和 SimPO 在不影响 LLMs 性能的前提下，专注于生成长度受控的响应。此外，DPO 还需要一个参考策略，以确保对齐模型不会明显偏离参考模型。与此相反，SimPO 和 RLOO 提出了无需参考模型的方法，同时仍能保持 LLMs 的功效。

3.8.1 R-DPO

The authors described the issue where standard DPO could exploit biases in preference data such as verbosity. They addressed the issue of this preference for output length in DPO and introduced regularized DPO (R-DPO) to mitigate the verbosity of DPO outputs [25]. They incorporated the length of the output directly into the RL objective, as illustrated in Eq. 44
作者描述了标准 DPO 可能利用偏好数据中的偏差（如冗长）的问题。他们解决了 DPO 中对输出长度的偏好问题，并引入了正则化 DPO（R-DPO）来减轻 DPO 输出的冗长程度[25]。他们将输出长度直接纳入 RL 目标，如公式 44 所示

To minimize the response length

, the term

was added, where

was a hyperparameter that determined its significance. Utilizing this revised RL objective, the new reward model function could be formulated based on Eq. 45
为了最小化响应长度

，添加了项

，其中

是决定其重要性的超参数。利用这一修订后的 RL 目标，新的奖励模型函数可根据公式 45 拟定

and the only modified term was

. Based on this updated reward function, the revised loss function could be derived as shown in Eq. 46 The new loss function of R-DPO effectively restricted the length of the output.

，唯一修改的项是

。根据更新后的奖励函数，可以得出修正后的损失函数，如式 46 所示。 R-DPO 的新损失函数有效地限制了输出的长度。

The authors utilized Pythia 2.8B [59] on Anthropic RLHF HH [3] and Reddit TL;DR datasets [47]. The findings indicated that while DPO tended to produce verbose responses, R-DPO significantly mitigated this issue. In the Anthropic RLHF HH dataset, regularization improved win rates, whereas in the TL;DR dataset, it led to a decrease in win rates. Furthermore, the authors demonstrated a weak correlation between output length and KL divergence, suggesting that tuning the parameter

had minimal impact. They also observed that DPO converged more rapidly than R-DPO, attributing this to DPO's exploitation of the reward model's bias, which failed to capture the more intricate features of preferences.
作者在人类 RLHF HH [3] 和 Reddit TL;DR 数据集 [47] 上使用 Pythia 2.8B [59]。研究结果表明，DPO 往往会产生冗长的回复，而 R-DPO 则大大缓解了这一问题。在人类 RLHF HH 数据集中，正则化提高了胜率，而在 TL;DR 数据集中，正则化导致胜率下降。此外，作者还证明了输出长度和 KL 分歧之间的微弱相关性，这表明调整参数

的影响微乎其微。他们还观察到 DPO 比 R-DPO 收敛得更快，这归因于 DPO 利用了奖励模型的偏差，而 R-DPO 未能捕捉到偏好更为复杂的特征。

3.8.2

The integration of reference models in DPO and RLHF, despite preventing significant deviations in the LLM policy, has been acknowledged as complex and challenging [26]. Simple Preference Optimization (SimPO) introduced a method for preference optimization that eliminated the need for the reference model. This approach was straightforward, demonstrated strong performance, and required minimal exploration of response lengths. The loss function for SimPO was presented in Eq. 47
在 DPO 和 RLHF 中整合参考模型，尽管可以防止 LLM 政策出现重大偏差，但已被公认为是一项复杂而具有挑战性的工作[26]。简单偏好优化（SimPO）引入了一种无需参考模型的偏好优化方法。这种方法简单明了，性能优越，只需对响应长度进行最小限度的探索。SimPO 的损失函数如公式 47 所示

In this context,

and

denoted the lengths of the desired and undesired responses, respectively. The constant

regulated the scaling of the reward difference, while

ensured that the rewards for desired responses exceeded those for undesired responses by at least

. The success of SimPO was largely due to its length normalization strategy, expressed as

and reward margin

between desired and undesired responses. This approach directly aligned with response generation metrics, such as maximizing the likelihood of next-token prediction and achieving the target reward margin.
在这种情况下，

和

分别表示期望反应和不期望反应的长度。常数

调节奖励差异的比例，而

则确保期望反应的奖励至少超过非期望反应的奖励

。SimPO 的成功在很大程度上归功于其长度归一化策略，以

和期望与不期望应答之间的奖励差

表示。这种方法直接与应答生成指标相一致，例如最大化下一个标记预测的可能性和实现目标奖励幅度。

To demonstrate the benefits, the authors employed Llama3-8B [68] and Mistral-7B [76] across two configurations: Base and Instruct, evaluated on AlpacaEval2 [40], MT-Bench [58], and the challenging Arena-Hard [79] benchmark. SimPO outperformed DPO and all its variants. Furthermore, the ablation study highlighted the significance of length normalization and the reward margin

. The authors also conducted a thorough comparison between SimPO and DPO, revealing that SimPO better separated likelihood from length, thereby enhancing reward accuracy.
为了证明这些优势，作者采用了 Llama3-8B [68] 和 Mistral-7B [76] 两种配置：在 AlpacaEval2 [40]、MT-Bench [58] 和具有挑战性的 Arena-Hard [79] 基准上进行了评估。SimPO 的表现优于 DPO 及其所有变体。此外，消融研究强调了长度归一化和奖励边际

的重要性。作者还对 SimPO 和 DPO 进行了全面比较，发现 SimPO 更好地分离了可能性和长度，从而提高了奖励准确性。

Several questions arose when reviewing the paper. The new reward model focused on length-normalized next token prediction. However, the authors have not specified the corresponding objective function within the RL framework. Additionally, the reward function in SimPO emphasized length-normalized next token prediction. This raised concerns about potential deviation from the original alignment target, which aimed to avoid toxic generation and generate responses aligned with human values.
在审阅论文时，我们提出了几个问题。新的奖励模型侧重于长度归一化的下一个标记预测。但是，作者并没有在 RL 框架内指定相应的目标函数。此外，SimPO 中的奖励函数强调了长度归一化的下一个标记预测。这引起了人们对可能偏离最初对齐目标的担忧，最初的目标旨在避免产生毒性并生成与人类价值观一致的反应。

3.8.3 RLOO

PPO was introduced to address the high variance often encountered in traditional RL settings, particularly when the model initialization is sub-optimal. However, this issue might not be as pronounced in LLMs due to the extensive pretraining and SFT, which should mitigate significant variance. Consequently, PPO might be an excessive tool for aligning LLMs. To address this, the authors proposed using REINFORCE Leave-One-Out (RLOO) for alignment [80, 81, 27]. The RLOO algorithm required multiple on-policy samples, denoted as

, for the same input prompt

to estimate the baseline function. The objective function of RLOO was presented in Eq. 48
引入 PPO 是为了解决传统 RL 设置中经常遇到的高方差问题，尤其是当模型初始化为次优时。然而，在 LLMs 中，由于进行了大量的预训练和 SFT，这个问题可能并不那么明显，因为 SFT 应该可以减少显著的方差。因此，PPO 可能是对齐 LLMs 的过度工具。为了解决这个问题，作者提出使用 REINFORCE Leave-One-Out (RLOO) 算法进行配准[80, 81, 27]。RLOO 算法需要对相同的输入提示

进行多个政策采样（表示为

），以估计基线函数。RLOO 的目标函数如公式 48 所示

For each response

, the remaining

responses were used to estimate the baseline value through

. This approach simplified the PPO process while maintaining comparable performance.
对于每个

答复，其余的

答复用于估算

的基线值。这种方法既简化了 PPO 流程，又保持了可比性。

The authors evaluated RLOO using the LlaMA [68] and Pythia [59] models on the Anthropic HH [3] and TL; DR datasets [47]. The results demonstrated that RLOO outperformed both PPO and DPO, showing greater robustness to noise, particularly when more on-policy samples could be generated simultaneously.
作者使用 LlaMA [68] 和 Pythia [59] 模型在人类 HH [3] 和 TL; DR 数据集 [47] 上对 RLOO 进行了评估。结果表明，RLOO 的性能优于 PPO 和 DPO，对噪声表现出更强的鲁棒性，尤其是在可以同时生成更多政策样本的情况下。

3.9 Listwise Preference Optimization
3.9 列表式偏好优化

Previous studies on PPO or DPO primarily concentrated on pairwise preferences, while research on RLHF collected listwise preferences to speed up the data collection process and subsequently converted them into pairwise preferences. Nevertheless, it was feasible to perform preference optimization directly using listwise datasets to improve the performances of LLMs. The following three papers have specifically addressed this approach.
以往关于 PPO 或 DPO 的研究主要集中在配对偏好上，而关于 RLHF 的研究则是收集列表偏好以加快数据收集过程，然后将其转换为配对偏好。尽管如此，直接使用列表数据集进行偏好优化以改进 LLMs 的性能还是可行的。以下三篇论文专门讨论了这种方法。

3.9.1 LiPO

Previous studies have primarily concentrated on pairwise preferences, often neglecting listwise preferences [12]. Notably, in the context of RLHF, datasets containing listwise preferences were collected but subsequently treated as pairwise preferences [2]. The authors noted that "human feedback often comes in a format of a ranked list over multiple responses to amortize the cost of reading prompt". Despite this, research on optimizing listwise preferences has been limited. This gap formed the central research question of Listwise Preference Optimization (LiPO) [28], which drew inspiration from Learning-to-Rank (LTR) methodologies [82]. The authors highlighted two key advantages of listwise preference datasets: (1) considering all candidates under the same prompt in a systematic manner could enhance policy learning, and (2) the relative label values between responses might benefit the alignment process.
以往的研究主要集中在成对偏好上，往往忽略了列表偏好[12]。值得注意的是，在 RLHF 的背景下，收集了包含列表偏好的数据集，但随后将其视为成对偏好[2]。作者指出，"人类的反馈通常以多个回复的排序列表形式出现，以摊销阅读提示的成本"。尽管如此，有关优化列表偏好的研究仍然有限。这一空白构成了列表偏好优化（LiPO）[28]的核心研究问题，它从学习排名（LTR）方法[82]中汲取了灵感。作者强调了列表式偏好数据集的两个关键优势：(1) 以系统的方式在同一提示下考虑所有候选者，可以加强政策学习；(2) 不同回答之间的相对标签值可能有利于排列过程。

The loss function for LiPO was presented in Eq. 49
磷酸二氢钾的损失函数如公式 49 所示

where

, was known as the Lambda weight.

was a gain function with

where

denoted human labelled scores.

was a rank discount function with

, where

was the rank position of

in the ranking permutation induced by

, thus it was a listwise method even though the formula could be written in terms of pairs.

referred to the scores of each response as shown

. In addition, the authors have also borrowed other different loss functions including

and

.
其中

称为 Lambda 权重。

是一个增益函数，其中

表示人类标记分数，

表示人类标记分数。

是一个与

有关的等级折扣函数，其中

是

在

引起的等级排列中的等级位置，因此它是一个列表法，尽管公式可以用成对的方式来书写。

指的是

所示的每个回答的得分。此外，作者还借用了其他不同的损失函数，包括

和

。

In experiment, T5-large (770M) [83] was utilized as the policy and T5-XXL (11B) [11] was applied as the rewardranking model. It was tested on Reddit TL;DR [47], AnthropicHH [3], and OpenAssistant [84] tasks. Results showed that

.
在实验中，采用 T5-large (770M) [83] 作为策略，T5-XXL (11B) [11] 作为奖励排名模型。它在 Reddit TL;DR [47]、AnthropicHH [3] 和 OpenAssistant [84] 任务上进行了测试。结果表明，

This study presented intriguing findings. The method's impact would be more compelling if it demonstrated improvements on larger LLMs compared to the pairwise preference dataset. Additionally, the conversion from listwise preference to

should be validated with examples to ensure it effectively utilized score information. Finally, acquiring high-quality pairwise datasets was challenging, as different labelers might have varying opinions, and some responses might be of similar quality, leading to noisy datasets. Investigating methods to filter out noisy data from listwise datasets was a promising area for future research.
这项研究提出了令人感兴趣的发现。与成对偏好数据集相比，如果该方法在更大的 LLMs 数据集上有所改进，那么它的影响将更有说服力。此外，从列表偏好到

的转换应通过实例进行验证，以确保其有效利用分数信息。最后，获取高质量的成对数据集具有挑战性，因为不同的标注者可能会有不同的意见，而且有些回答的质量可能相似，从而导致数据集出现噪声。研究从列表数据集中过滤出噪声数据的方法是未来研究的一个有前途的领域。

3.9.2 RRHF

The training of RLHF necessitated a policy model, a value model (or value head), a reward model, and a reference model, which could be demanding on memory resources. To address this issue, the authors introduced Rank Responses to align Human Feedback (RRHF), a method designed to streamline the alignment process by integrating it within SFT while maintaining comparable performance [29]. The core concept of RRHF involved sampling multiple responses from various models, which were then scored and ranked by LLMs. These rankings were subsequently trained to match those provided by previously trained reward models or human annotators, as illustrated in Eq. 50
RLHF 的训练需要一个策略模型、一个价值模型（或价值头）、一个奖励模型和一个参考模型，这对内存资源的要求很高。为了解决这个问题，作者引入了 "等级响应与人类反馈相匹配"（RRHF），这种方法旨在通过将其集成到 SFT 中来简化匹配过程，同时保持相当的性能[29]。RRHF 的核心理念是从各种模型中抽取多个响应，然后按 LLMs 进行评分和排序。如公式 50 所示，这些排序随后经过训练，以与先前训练过的奖励模型或人类注释者提供的排序相匹配。

represented the length-normalized conditional

probability of

-th response, i.e.,

given input

represented the alignment of LLM evaluation based on reward

and

with those of human annotators ranking, i.e.,

and

denoted the optimal responses from the multiple response candidates,

表示长度归一化的条件

概率，即输入

的

表示根据奖励

和

进行的LLM评价与人类注释者排名的一致性，即

和

表示从多个候选响应中选出的最佳响应、
and

was the cross entropy loss utilized in SFT for instruction following. Compared to PPO, RRHF did not require reference model and value model, and it could get rid of reward model if the rankings were labelled by human.
和

是SFT中用于指令跟踪的交叉熵损失。与 PPO 相比，RRHF 不需要参考模型和价值模型，而且如果排名是由人工标注的，它还可以摆脱奖励模型。

The authors evaluated their approach using the Alpaca model [85] with Anthropic's HH dataset [3], resulting in a new model named Wombat. This model demonstrated performance on par with RLHF/PPO, while significantly simplifying the alignment process.
作者使用 Alpaca 模型[85]和 Anthropic 的 HH 数据集[3]对他们的方法进行了评估，得出了一个名为 Wombat 的新模型。该模型的性能与 RLHF/PPO 相当，同时大大简化了配准过程。

3.9.3 PRO 3.9.3 专业人员

Previous works focused on SFT and alignment in two stages utilizing pairwise dataset. To simplify this process, the authors proposed preference ranking optimization (PRO) with listwise preference datasets, which could directly realize alignment in the SFT process [30]. Instead of using pairwise preference, preference ranking of any length could be utilized for alignment. Suppose there were one prompt

and

responses

, which were ranked based on the preference scores, i.e.,

. Then, it could be broken down into K tasks. The first task took

as positive sample, while

were negative samples. In the second task, the first response

was dropped,

was regarded as positive sample, while

were negative samples. This process would continue

times. Based on this

tasks and InfoNCE, the initial loss function was utilized as shown in Eq. 51
以前的工作主要集中在利用成对数据集分两个阶段进行 SFT 和配准。为了简化这一过程，作者提出了使用列表偏好数据集的偏好排序优化（PRO），它可以在 SFT 过程中直接实现配准[30]。与使用成对偏好相比，任何长度的偏好排序都可用于配准。假设有一个提示

和

，它们根据偏好分数进行排序，即

。然后，可将其分解为 K 个任务。第一个任务以

为正样本，而

为负样本。在第二个任务中，第一个回答

被放弃，

被视为正样本，而

为负样本。这一过程将持续

次。根据

任务和 InfoNCE，初始损失函数如公式 51 所示

However, the initial loss function did not consider the score distance between responses. To take this into consideration, the loss function was slightly modified as shown in Eq. 52
然而，最初的损失函数并没有考虑响应之间的分数距离。考虑到这一点，损失函数稍作修改，如公式 52 所示

measured the distances between two responses, and

measured the minimum distance between the positive response

and all negative response

for the

-th task. Lastly, the SFT loss was modified by merging the loss term of alignment

测量的是两个反应之间的距离，

测量的是

任务中正面反应

与所有负面反应

之间的最小距离。最后，通过合并排列

的损失项对 SFT 损失进行了修改。

The authors experimented LlaMA 7B [68] on Anthoropic HH [3] datasets, and it was observed that PRO outperformed RLHF based on reward model output and BLEU scores. Two more conclusions were derived as more numbers of responses and more diverse responses they were, the better the combination of SFT and alignment together. However, they did not apply this method on downstream tasks and see their performances, and this desired further investigation.
作者在Anthoropic HH[3]数据集上对 LlaMA 7B [68]进行了实验，结果发现，根据奖励模型输出和 BLEU 分数，PRO 优于 RLHF。他们还得出了另外两个结论，即响应数量越多、响应越多样化，SFT 和 alignment 的组合效果就越好。不过，他们并没有将这种方法应用于下游任务，也没有看到它们的表现，这需要进一步研究。

3.10 Negative Preference Optimization
3.10 负偏好优化

These studies converged on a common premise: the current generation of LLMs has surpassed human performance in tasks such as translation and summarization. Consequently, it was advantageous to treat the output of LLMs as the desired response, rather than relying on human labeled data as preferred response. Conversely, undesired responses could still be leveraged for aligning LLMs through a process known as negative preference optimization (NPO).
这些研究汇聚在一个共同的前提下：当前一代的 LLMs 在翻译和摘要等任务中的表现已经超越了人类。因此，将 LLMs 的输出视为期望的响应，而不是依赖人类标记的数据作为首选响应，是非常有利的。相反，通过一个被称为负偏好优化（NPO）的过程，不受欢迎的响应仍可用于对齐 LLMs。

3.10.1 Negating Negatives
3.10.1 否定否定句

The objective of negating negatives was to maintain helpfulness while minimizing harmfulness [31]. The preferred responses in the dataset might not be flawless due to the inherent ambiguity and diversity of desired outcomes, leading to noisy preference labels [86]. Consequently, the authors contended that "the LLM also learns to maximize such

, inadvertently compromising harmlessness" and suggested "discarding

and exclusively using

to eliminate harmful responses."
否定否定项的目的是在尽量减少有害性的同时保持有用性[31]。由于预期结果本身的模糊性和多样性，数据集中的首选回答可能并非完美无瑕，从而导致首选标签的噪声[86]。因此，作者认为，"LLM也会学习最大化这种

，无意中损害了无害性"，并建议 "放弃

，专门使用

来消除有害反应"。

In line with the aim of removing positive responses and solely employing negative responses, the loss function of Negating Negatives (NN) was presented in Eq. 53
为了达到去除正向响应而只采用负向响应的目的，"消除负向响应"（Negating Negatives，NN）的损失函数如公式 53 所示。

In this study, we considered

, where

was the general reference model. The reference model

contained more beneficial information, such as the model from the previous alignment epoch, whereas

denoted a more harmful policy akin to the original unaligned LLM. During the training process,

distinct responses, denoted as

, were generated by the LLM. The loss function was designed to maximize the divergence between the generated responses and the less preferred ones, thereby effectively minimizing harmful information.
在本研究中，我们考虑了

，其中

是一般参考模型。参考模型

包含更多有益信息，例如上一次对齐历时的模型，而

表示更有害的策略，类似于原始未对齐的LLM。在训练过程中，LLM生成了

不同的响应，表示为

。损失函数的设计目的是使生成的反应与不太受欢迎的反应之间的分歧最大化，从而有效地减少有害信息。

Experiments were conducted using the PKU-SafeRLHF dataset [87] and Alpaca-7B [85] as the backbone. The results demonstrated an improvement in helpfulness, a reduction in harmfulness, and a smoother learning curve.
实验以 PKU-SafeRLHF 数据集 [87] 和 Alpaca-7B [85] 为基础。结果表明，有用性提高了，有害性降低了，学习曲线也更加平滑。

3.10.2 Negative Preference Optimization
3.10.2 负偏好优化

To reduce the likelihood of undesired responses, gradient ascent (GA) was used. However, this approach could be detrimental as it might degrade the model's overall performance on other tasks [88]. To address this issue, negative preference optimization (NPO) was introduced, which adapt the loss function from DPO by retaining only the negative component [32]. NPO has demonstrated a significantly slower rate of catastrophic collapse compared to GA and was the first method to achieve effective unlearning results, successfully forgetting

or more of the undesired training data. In contrast, existing methods struggled to forget even

of the training data.
为了降低出现不希望的反应的可能性，我们采用了梯度上升法（GA）。然而，这种方法可能会降低模型在其他任务中的整体性能，因而是有害的[88]。为了解决这个问题，人们引入了负偏好优化（NPO），通过只保留负分量来调整 DPO 的损失函数[32]。与 GA 相比，NPO 的灾难性崩溃速度明显较慢，而且是第一种实现有效解除学习效果的方法，它能成功遗忘

或更多不想要的训练数据。相比之下，现有方法甚至连

的训练数据都难以遗忘。

3.10.3 Contrastive Preference Optimization
3.10.3 对比偏好优化

To enhance the performance of machine translation (MT) in moderately-sized LLMs, contrastive preference optimization (CPO) has been proposed [33]. Previous approaches have improved downstream tasks like MT through SFT. However, SFT was constrained by the limitations of available data and the imperfections inherent in human-generated data, particularly due to the absence of mechanisms to reject errors.
为了提高中等大小 LLMs 中机器翻译 (MT) 的性能，有人提出了对比偏好优化 (CPO) [33]。以前的方法通过 SFT 改进了 MT 等下游任务。然而，SFT 受限于可用数据的局限性和人工生成数据固有的不完善性，特别是由于缺乏剔除错误的机制。

To ensure the generation of high-quality data, the authors proposed utilizing three distinct models: a reference model, GPT-4 [4], and ALMA-13B-LoRA [89], to produce respective translations. These translations were subsequently assessed using reference-free models. The translations receiving the highest and lowest scores were classified as desired and undesired outputs, respectively. Leveraging this curated dataset, a model could be trained to identify and reject errors by employing the CPO loss function, as detailed in Eq. 54 where

came from LLMs.
为了确保生成高质量的数据，作者建议利用三种不同的模型：参考模型、GPT-4[4]和 ALMA-13B-LoRA [89]，来生成各自的译文。随后使用无参照模型对这些翻译进行评估。得分最高和最低的译文分别被归类为理想输出和不理想输出。利用这个经过策划的数据集，可以通过 CPO 损失函数训练一个模型来识别和剔除错误，详见公式 54，其中

来自 LLMs。

The loss function could eliminate the reference model, i.e.,

term by assuming a uniform prior

," as the terms

and

effectively cancel each other out. Consequently, the derived ALMA-R demonstrated performance comparable to GPT-4 in the context of machine translation.
由于

和

项有效地相互抵消，"损失函数可以通过假设统一先验

来消除参考模型，即

项"。因此，推导出的 ALMA-R 在机器翻译方面的性能与 GPT-4 不相上下。

3.11 Nash Learning 3.11 纳什学习

Previous research derived pairwise preferences using pointwise rewards and the BT model. However, this approach was not comparable to direct pairwise preference modeling and failed to address inconsistencies within pairwise preferences. To overcome these limitations, several studies have introduced Nash learning methodologies.
以往的研究利用点式奖励和 BT 模型得出配对偏好。然而，这种方法无法与直接的成对偏好建模相提并论，也无法解决成对偏好中的不一致性问题。为了克服这些局限性，一些研究引入了纳什学习方法。

3.11.1 Nash Learning from Human Feedback
3.11.1 纳什从人类反馈中学习

In RLHF, a pointwise reward approach was often employed to learn from pairwise human feedback using the BT model. However, this method presented two significant issues. The first issue was non-transitivity, where

, but

. This non-transitivity problem could be exacerbated in a group of labelers with differing opinions. The second issue was that BT model scores could fail to accurately capture preference ordering, leading to a diversity problem in the trained policy [90]. To address these challenges, the authors proposed using Nash equilibrium to derive preference models instead of relying on the pointwise model [34].
在 RLHF 中，通常采用点式奖励方法，利用 BT 模型从成对的人类反馈中学习。然而，这种方法存在两个重大问题。第一个问题是非传递性，即

，但

。在一组有不同意见的贴标签者中，这种非跨度问题可能会加剧。第二个问题是，BT 模型得分可能无法准确捕捉偏好排序，从而导致训练策略的多样性问题[90]。为了应对这些挑战，作者提出使用纳什均衡来推导偏好模型，而不是依赖于点式模型[34]。

Given two policies like

and

, the probability of

was better than

was shown in Eq. 55
给定

和

两项政策，

优于

的概率如公式 55 所示。

Through this formulation, the preference probability between two policies could be directly represented, bypassing the need for the BT model and pointwise rewards. The optimal policy, or Nash equilibrium, could be determined by Eq. 56 .
通过这种表述方式，可以直接表示两种政策之间的偏好概率，而不需要 BT 模型和点式奖励。最优政策或纳什均衡可由式 56 确定。

However, when applying this preference model to LLMs, a constraint was typically introduced to ensure that the distance from the aligned model to the initial model remained limited. Equation 55 should be generalized to incorporate the reference model, as illustrated in Equation 57
不过，在将此偏好模型应用于 LLMs 时，通常会引入一个约束条件，以确保对齐模型与初始模型之间的距离保持有限。等式 55 应加以概括，以纳入参考模型，如等式 57 所示

Building on the refined preference model, the authors introduced the Nash-MD (Mirror Descent) algorithm. This algorithm was enhanced through a regularized policy, as described in Eq. 58 and a policy update mechanism, as detailed in Eq. 59 where

referred to the learning rate at the step

. In Eq.

was utilized rather than

because after the

-th optimization, the policy shifted from being variable to becoming constant.
在改进后的偏好模型基础上，作者引入了纳什-MD（镜像下降）算法。该算法通过公式 58 所描述的正则化策略和公式 59 所详述的策略更新机制得到了增强，其中

指的是步骤

时的学习率。在公式中，使用

而不是

是因为在

次优化后，策略从可变变为恒定。

The algorithm was proven to converge, maintaining the last-iterate policy throughout iterations. Experiments conducted on PaLM 2 Large [48] for text summarization demonstrated that it outperformed RLHF. However, the drawback of this method lied in the multiple iterations to reach convergence, which would be much slower compared with DPO.
事实证明，该算法可以收敛，在整个迭代过程中保持最后一次迭代策略。在 PaLM 2 Large [48] 上进行的文本摘要实验表明，它的性能优于 RLHF。不过，这种方法的缺点在于需要多次迭代才能达到收敛，与 DPO 相比要慢得多。

3.11.2 SPPO

Self-Play Preference Learning (SPPO) reinterpreted RLHF as a two-player zero-sum game [35]. This approach eliminated the need for a reward model, making the process robust against noisy, intransitive, and non-Markovian preferences. By exploiting the game's symmetry, a single agent could sample multiple trajectories, which were then evaluated by humans or evaluation models, using the win rate as the reward. This method avoided adversarial training, thereby mitigating the instability associated with such training.
自我游戏偏好学习（SPPO）将 RLHF 重新解释为双人零和游戏[35]。这种方法消除了对奖励模型的需求，使该过程对噪声、不等式和非马尔可夫偏好具有鲁棒性。通过利用博弈的对称性，单个代理可以对多个轨迹进行采样，然后由人类或评估模型进行评估，将胜率作为奖励。这种方法避免了对抗训练，从而减轻了与这种训练相关的不稳定性。

The concept of SPO, specifically leveraging the symmetry of the preference function, was subsequently applied to align LLMs [91]. In line with [92], the iterative/online policy update was detailed in Eq. 60.
SPO 的概念，特别是利用偏好函数的对称性，随后被应用于对齐 LLMs [91]。与 [92] 一致，迭代/在线策略更新详见公式 60。

, and it was utilized to normalize the numerator. By reformulating Eq. 60. the authors derived

. Lastly, L2-norm was applied as the objective function to update policy as shown in Eq. 61

，并利用它对分子进行归一化处理。通过公式 60，作者得出了

。最后，应用 L2 准则作为更新策略的目标函数，如公式 61 所示

The estimations of

and

were conducted through sampling and averaging. The authors opted to sample

responses

for each prompt

, and represented the empirical distribution as

. Consequently,

was substituted with

. Additionally,

was replaced by

.
对

和

的估计是通过抽样和平均法进行的。作者选择对每个提示

的

回答

进行抽样，并将经验分布表示为

。因此，

被替换为

。此外，

被替换为

。

The authors assessed SPPO using 60k prompts (excluding responses) from the UltraFeedback [62] dataset, without any prompt augmentation. By leveraging a pre-trained preference model, PairRM [93], with a modest 0.4 billion parameters, SPPO successfully fine-tuned Mistral-7B-Instruct-v0.2, achieving a state-of-the-art length-controlled win rate of

against GPT-4-Turbo on AlpacaEval 2.0. Additionally, they demonstrated that SPPO surpassed the iterative/online DPO and IPO on both MT-Bench and the Open LLM Leaderboard. However, due to the process of sampling

responses for a given prompt input

, the speed of this method might be further reduced.
作者使用 UltraFeedback [62] 数据集中的 60k 条提示（不包括回复）对 SPPO 进行了评估，没有使用任何提示增强功能。通过利用预先训练的偏好模型 PairRM [93]（参数不高，只有 4 亿个），SPPO 成功地对 Mistral-7B-Instruct-v0.2 进行了微调，在 AlpacaEval 2.0 上与 GPT-4-Turbo 相比，取得了

的一流长度控制胜率。此外，他们还证明 SPPO 在 MT-Bench 和 Open LLM Leaderboard 上都超过了迭代/联机 DPO 和 IPO。不过，由于要对给定的提示输入

进行

响应采样，这种方法的速度可能会进一步降低。

3.11.3 DNO

Previous Nash learning algorithms primarily aimed to approach the Nash equilibrium through a purely on-policy method, which could be unstable and might require two-timescale updates (such as

and

in [34]). To address this issue, the authors proposed Direct Nash Optimization (DNO), which employed a batched on-policy algorithm with single-timescale updates, potentially enhancing sampling efficiency [36]. Batch on-policy learning referred to a hybrid approach combining on-policy and off-policy learning. Previous methods sought the Nash equilibrium via

, whereas the authors aimed to simplify the problem to regressing

, where

represented the internal reward function at iteration

.
以前的纳什学习算法主要是通过纯策略方法来接近纳什均衡，这种方法可能不稳定，而且可能需要两次更新（如 [34] 中的

和

）。为解决这一问题，作者提出了直接纳什优化（DNO）方法，该方法采用了单次标度更新的批量策略算法，有可能提高采样效率[36]。批量策略学习指的是一种结合了策略学习和非策略学习的混合方法。以前的方法通过

寻求纳什均衡，而作者的目标是将问题简化为回归

，其中

代表迭代

时的内部奖励函数。

The original DNO algorithm was tough to scale up, and the authors modified it to apply on scale for practical usage. To begin with, a dataset for the

-th iteration should be constructed

where

and

like human labelers. Then, on-policy sampling should be conducted: sampled

outputs per prompt using the current

as shown in

. Next, responses were ranked: For each

, ranked the corresponding

using the pairwise win-rate by sampling from the general preference function

. Then, preference pairs were filtered and derived the filtered dataset

, for all

were large-margin pairs (based on the win-rate rank) within the responses for

from the previous step. Lastly, contrastive learning were performed to obtain

by Eq. 62
最初的 DNO 算法很难扩展，作者对其进行了修改，使其适用于实际应用。首先，应构建

-次迭代的数据集

，其中

和

类似于人类标签。然后，应进行政策上抽样：如

所示，使用当前

对每个提示的

输出进行抽样。接下来，对回复进行排序：对于每个

，通过从一般偏好函数

中取样，使用成对胜率对相应的

进行排序。然后，对偏好对进行过滤，得出过滤后的数据集

，因为所有

都是上一步中

回复中的大边缘对（基于胜率排名）。最后，进行对比学习，根据公式 62 得出

The developed algorithm closely resembled the iterative/online DPO approach. Consequently, the authors asserted that "a meticulously designed iterative/online DPO algorithm could approach the Nash equilibrium of any given general preferences".
所开发的算法与迭代/在线 DPO 方法非常相似。因此，作者断言，"精心设计的迭代/在线 DPO 算法可以接近任何给定一般偏好的纳什均衡"。

The authors employed Ultrafeedback [62], comprising 60k prompts to fine-tune the LLM. The model trained using DNO, specifically Orca 2.5 (7B) [94], achieved a 33% score on AlpacaEval 2.0 [40], marking a 26% improvement. Additionally, it demonstrated the capability to perform on par with Mistral Large [76], Self-Rewarding LM (70B parameters) [19], and earlier versions of GPT-4.
作者采用了由 60k 条提示组成的 Ultrafeedback [62]，对 LLM 进行了微调。使用 DNO（特别是 Orca 2.5 (7B) [94]）训练的模型在 AlpacaEval 2.0 [40] 中获得了 33% 的分数，提高了 26%。此外，它还证明了与 Mistral Large [76]、Self-Rewarding LM（70B 参数）[19] 和早期版本的 GPT-4 相媲美的能力。

3.12 Beyond Reverse KL Divergence
3.12 超越反向 KL 发散

Previous studies employed KL divergence to minimize the discrepancy between the policy and the pretrained model. However, it was noted that during the alignment process, the reward of the LLM increased while the diversity of its responses diminished [95]. The authors attributed this reduction in diversity to the KL divergence term used in alignment and proposed the use of alternative divergence terms, demonstrating their effects [37]. The general

-divergence was presented in Eq. 63
以往的研究采用 KL 发散法来最小化策略与预训练模型之间的差异。然而，有人指出，在对齐过程中，LLM的奖励增加了，而其响应的多样性却减少了[95]。作者将这种多样性的减少归因于配准过程中使用的 KL 发散项，并建议使用其他发散项，证明其效果[37]。一般的

发散在公式 63 中列出

In this context,

represented various divergence functions. In the traditional RL framework, the reverse KL divergence, defined as

, was typically employed. The authors tested the

-divergence, given by

, along with the forward KL divergence,

, and the Jensen-Shannon (JS) divergence,

. These divergences were considered within the framework of the constrained objective function as illustrated in Eq. 64
在这种情况下，

代表各种发散函数。在传统的 RL 框架中，通常采用定义为

的反向 KL 发散。作者测试了由

给出的

发散，以及正向 KL 发散

和 Jensen-Shannon (JS) 发散

。如公式 64 所示，这些发散是在约束目标函数的框架内考虑的

Using the Lagrange method, the authors could transform the constraints into the objective function. Using Karush-Kuhn-Tucker conditions (KKT), the inequality

could be transformed into a equality. The derived transformed RL objective was shown in Eq. 65
利用拉格朗日方法，作者可以将约束条件转化为目标函数。利用卡鲁什-库恩-塔克条件（KKT），不等式

可以转化为等式。得出的转化后的 RL 目标如式 65 所示

Based on the new objective function, the optimal policy could be expressed in Eq. 66
根据新的目标函数，最优政策可用公式 66 表示

With further restriction including 1.

was invertible with

, the reward function for a specific divergence

could be reformulated as Eq. 67.
再加上 1.

与

是可反的，特定发散

的奖励函数可以重拟为公式 67。

Integrating this reward model into the BT model enabled the derivation of the probability of desired responses over undesired ones, which could subsequently be incorporated into the loss function.
将这一奖励模型整合到 BT 模型中，就能推导出理想反应相对于不理想反应的概率，并将其纳入损失函数。

The authors conducted experiments on the IMDB-sentiment [54], Anthropic HH [3], and MT-bench [58] datasets using GPT-2 [67] as the base model. They observed a trade-off between reward and diversity. Specifically, RKL and JSD demonstrated high rewards, whereas FKL and

divergence exhibited better entropy with lower rewards. Notably, JSD achieved rewards comparable to RKL but with higher diversity. This suggested that further investigation into JSD for alignment purposes could be beneficial in future research.
作者使用 GPT-2 [67] 作为基础模型，在 IMDB-sentiment [54]、Anthropic HH [3] 和 MT-bench [58] 数据集上进行了实验。他们观察到了奖励和多样性之间的权衡。具体来说，RKL 和 JSD 的回报率较高，而 FKL 和

分歧的熵值较好，但回报率较低。值得注意的是，JSD 获得的奖励与 RKL 相当，但具有更高的多样性。这表明，在未来的研究中，进一步研究用于配准目的的 JSD 将是有益的。

3.13 Comparison of Different Methods
3.13 不同方法的比较

Several studies have concentrated on comparing these various methods. This synthesis could elucidate the respective advantages and disadvantages of each approach.
一些研究集中于比较这些不同的方法。这种综合方法可以阐明每种方法各自的优缺点。

3.13.1 Evaluate DPO and its variants
3.13.1 评估 DPO 及其变体

The authors conducted a comprehensive evaluation of implicit reward model, i.e., RL-free algorithms, including DPO, KTO, IPO, and CPO, across a range of tasks such as reasoning, mathematical problem-solving, truthfulness, question answering, and multi-task understanding. These evaluations were performed in three distinct scenarios: 1) fine-tuning a supervised fine-tuning (SFT) model, 2) fine-tuning a pre-trained model, and 3) fine-tuning an instruction model [96]. The findings indicated that KTO consistently outperformed other alignment methods on most benchmarks. Furthermore, the study revealed that alignment did not significantly enhance performance in reasoning and question answering tasks but did lead to substantial improvements in mathematical problem-solving. The authors also noted that the volume of data played a crucial role, with alignment methods showing optimal performance on smaller data subsets. Additionally, it was discovered that KTO and CPO could effectively bypass the SFT stage and proceed directly to the alignment stage without compromising performance. In contrast, DPO and IPO exhibited significant performance degradation when the alignment stage was applied directly, bypassing the SFT stage.
作者对隐式奖励模型，即无 RL 算法，包括 DPO、KTO、IPO 和 CPO，在推理、数学问题解决、真实性、问题解答和多任务理解等一系列任务中进行了全面评估。这些评估在三种不同的情况下进行：1）微调监督微调（SFT）模型；2）微调预训练模型；3）微调指令模型[96]。研究结果表明，在大多数基准测试中，KTO 的性能始终优于其他对齐方法。此外，研究还发现，配准方法并不能显著提高推理和问题解答任务的成绩，但却能大幅提高数学问题解决的成绩。作者还指出，数据量起着至关重要的作用，对齐方法在较小的数据子集中表现出最佳性能。此外，作者还发现 KTO 和 CPO 可以有效地绕过 SFT 阶段，直接进入配准阶段，而不会影响性能。相比之下，当绕过 SFT 阶段直接应用对齐阶段时，DPO 和 IPO 表现出明显的性能下降。

3.13.2 Is DPO Superior to PPO for LLM Alignment?
3.13.2 就 LLM 对齐而言，DPO 优于 PPO 吗？

The authors demonstrated that DPO might have inherent limitations, potentially generating biased solutions and experiencing performance degradation due to distribution shifts [97]. They found that DPO tended to train policies that favored unseen responses, specifically out-of-distribution samples. Instead, iterative/online DPO, by extensively exploring the response space and continuously updating the reference model, could alleviate this issue. In contrast, RLHF/PPO addressed these challenges through advantage normalization, large batch sizes, and the use of an exponential moving average for the reference model. Ultimately, the findings suggested that PPO outperformed iterative/online DPO, which in turn was superior to standard DPO.
作者证明了 DPO 可能存在固有的局限性，可能会产生有偏差的解决方案，并由于分布偏移而导致性能下降 [97]。他们发现，DPO 往往会训练出偏向未见响应的策略，特别是偏离分布的样本。相反，迭代/在线 DPO 通过广泛探索响应空间并不断更新参考模型，可以缓解这一问题。与此相反，RLHF/PPO 通过优势归一化、大批量和使用指数移动平均参考模型来应对这些挑战。最终，研究结果表明，PPO 优于迭代/在线 DPO，而后者又优于标准 DPO。

4 Future Directions 4 未来方向

Based on the analysis of the reviewed papers, several research problems have been identified for further exploration.
根据对所审查论文的分析，确定了几个有待进一步探讨的研究问题。

4.1 General Tasks for Alignment Evaluation
4.1 对齐评估的一般任务

When reviewing various papers, different tasks were used to evaluate the performance of these methods. However, some tasks, like GSM8K [65], which focused more on reasoning, might not be suitable for assessing alignment performance. In contrast, tasks like TruthfulQA [45] or those addressing toxicity should be prioritized for evaluating the toxicity of fine-tuned LLMs. There should be an effort to combine these tasks and create a unified leaderboard for alignment evaluation.
在查阅各种论文时，我们使用了不同的任务来评估这些方法的性能。然而，有些任务，如 GSM8K [65]，更侧重于推理，可能不适合评估对齐性能。相比之下，像 TruthfulQA [45] 这样的任务或涉及毒性的任务应优先用于评估微调 LLMs 的毒性。应努力将这些任务结合起来，创建统一的排列评估排行榜。

4.2 Apply Implicit Reward Models, Listwise Preference and Nash Learning to Larger Scale LMs
4.2 将隐含奖励模型、列表偏好和纳什学习应用于更大规模的 LM

Currently, implicit reward model methods have been applied only to models with up to 70B parameters. Extending these methods to even larger models, such as those the size of GPT-4 and Claude-3, can provide insights into their effectiveness compared to RLHF/PPO. Similarly, the listwise preference model warrants further investigation. In RLHF, preference datasets were collected using listwise preference but were subsequently transformed into multiple pairs of pairwise preferences. The potential issues associated with applying listwise preference models at larger scales remain to be addressed. Lastly, Nash learning can address the inconsistency among human labelers. Incorporating a Nash learning model into larger-scale LLMs can demonstrate its ability to capture the complexity of human nature.
目前，隐式报酬模型方法只应用于参数不超过 70B 的模型。将这些方法扩展到更大的模型，如与 GPT-4 和 Claude-3 大小相当的模型，可以深入了解它们与 RLHF/PPO 相比的有效性。同样，列表偏好模型也值得进一步研究。在 RLHF 中，使用列表偏好收集偏好数据集，但随后将其转换为多对配对偏好。在更大范围内应用列表偏好模型的潜在问题仍有待解决。最后，纳什学习可以解决人类标注者之间的不一致性问题。将纳什学习模型纳入更大规模的LLMs中，可以证明其捕捉人性复杂性的能力。

4.3 Experiments on Binary Feedbacks
4.3 二进制反馈实验

Both KTO and DRO utilized binary feedback mechanisms, such as "thumbs up" and "thumbs down", instead of pairwise preferences. These binary feedbacks were derived from preference datasets, where desired responses were marked as positive and undesired responses as negative. Further research is needed on realistic binary datasets. Additionally, binary datasets are easier to collect compared to pairwise preference data, making it feasible to use larger-scale binary feedback datasets for alignment. However, the noise in binary feedback may be more pronounced than in preference datasets, raising the intriguing question of how to effectively filter out noisy data.
KTO 和 DRO 都采用了二元反馈机制，如 "竖起大拇指 "和 "放下大拇指"，而不是成对偏好。这些二元反馈来自偏好数据集，其中期望的回应被标记为正面，不期望的回应被标记为负面。需要对现实的二元数据集进行进一步研究。此外，与成对偏好数据相比，二进制数据集更容易收集，这使得使用更大规模的二进制反馈数据集进行配准变得可行。不过，二进制反馈中的噪声可能比偏好数据集更明显，这就提出了如何有效过滤噪声数据的有趣问题。

4.4 Experiments on Helpful AI Feedback
4.4 人工智能帮助反馈实验

Current AI feedback primarily includes harmless feedback in RLAIF and feedback ranking in iterative DPO. However, in RLAIF, helpful feedback is still provided by human labelers. This approach is reasonable, as generating helpful responses is significantly more challenging than identifying harmful ones. An intriguing future direction involves using LLMs to generate helpful feedback, thereby enabling LLMs to self-improve.
目前的人工智能反馈主要包括 RLAIF 中的无害反馈和迭代 DPO 中的反馈排序。不过，在 RLAIF 中，有益的反馈仍由人工标注者提供。这种方法是合理的，因为生成有益的反馈比识别有害的反馈更具挑战性。一个令人感兴趣的未来方向是使用 LLMs 生成有用的反馈，从而使 LLMs 能够自我完善。

4.5 Speeding up Nash Learning
4.5 加快纳什学习速度

The proposed Nash learning method effectively modeled pairwise preferences and addressed inconsistencies arising from human labeling. However, it necessitated multiple iterations to converge to the optimal policy. Although the authors did not specify the time required for alignment, it was presumed to be significantly slower compared to implicit reward models such as DPO. This area warrants further research attention to speed up the Nash learning process.
所提出的纳什学习方法有效地模拟了成对偏好，并解决了人为标记所产生的不一致性。然而，该方法需要多次迭代才能收敛到最优策略。虽然作者没有具体说明调整所需的时间，但推测与 DPO 等隐式奖励模型相比，调整时间要慢得多。这一领域值得进一步研究，以加快纳什学习过程。

4.6 Termination of Iterative/Online Learning
4.6 终止迭代/在线学习

When applying iterative or online training, determining when to terminate the iteration is crucial. Previous research has noted that iterative learning can sometimes degrade the performance of LLMs on specific tasks, which can be a sign of overfitting. However, identifying a reasonable epoch for stopping the iteration remains an unexplored area.
在应用迭代或在线训练时，确定何时终止迭代至关重要。以往的研究指出，迭代学习有时会降低 LLMs 在特定任务上的性能，这可能是过度拟合的迹象。然而，确定停止迭代的合理历时仍是一个尚未探索的领域。

4.7 Simplify SFT + Alignment
4.7 简化 SFT + 对齐

Current methodologies typically implemented SFT and alignment in a consecutive manner. However, this approach often resulted in catastrophic forgetting and rendered the training process laborious. The PAFT method mitigated catastrophic forgetting by fine-tuning SFT and alignment separately before merging them, albeit at the cost of increased complexity. Conversely, the ORPO technique integrated both processes simultaneously, but this led to a decline in performance. Thus, the challenge of effectively combining SFT and alignment to achieve high performance while maintaining efficiency remains unresolved.
当前的方法通常以连续的方式实施 SFT 和对齐。然而，这种方法往往会导致灾难性遗忘，并使训练过程变得费力。PAFT 方法先对 SFT 和配准分别进行微调，然后再进行合并，从而减轻了灾难性遗忘，但代价是增加了复杂性。相反，ORPO 技术同时整合了这两个过程，但却导致了性能的下降。因此，如何有效结合 SFT 和配准，在保持效率的同时实现高性能，这一难题仍未解决。

List of Symbols 符号列表

: the loss fuction for optimization

：优化损失函数

number of candidate responses

候选答复数

: combination to select all pairs from a list

：从列表中选择所有配对的组合

prompt to LLM

提示为LLM。

the distribution of prompt for LLM

提示LLM的分布情况

the desired response in the paired responses

配对回答中的预期回答

the undesired response in the paired responses

配对响应中的非预期响应

the reference response provided in the golden dataset

黄金数据集中提供的参考响应

: the policy in RL, i.e., the LLM to be aligned

：RL 中的策略，即要对齐的 LLM

: the total number of tokens in the responses

：回复中的标记总数

: the reward model given prompt

and response

represented explicit reward functions and

represented implicit reward functions

：给定的奖励模型，提示

和反应

代表显性奖励函数，

代表隐性奖励函数

the pre-collected dataset

预先收集的数据集

: supervised fine-tuning

：监督微调

KL divergence

KL 分歧

divergence to replace KL divergence
用

发散代替 KL 发散

the parameter to limit the distance between the trained and the initial policies

限制训练策略和初始策略之间距离的参数

weight hyper-parameters

权重超参数

the probability distributions

概率分布

the weights for desired and undesired responses

期望和不期望响应的权重

the numbers of desired and undesired responses

期望和不期望响应的数量

the probability of generating a response over not generating the response, i.e.,

产生响应而不产生响应的概率，即

odd of

over

, i.e.,

的

奇数覆盖

，即

。

the gain function

增益函数

: the rank position function based on LLM score function

：基于LLM得分函数的等级位置函数

: the scores based on human labelling

：基于人工标注的分数

: the rank discount function

：等级折扣函数

the length of a response

响应的长度

: the value function in RL

：RL 中的价值函数

the Q or action-value function in RL

RL中的Q或动作值函数

: the advantage function in RL

：RL 中的优势函数

: the state function in

scores value in ranking

：

中的状态函数在排名中的得分值

: the action function based on policy in RL, which will be LLM in our case

：基于 RL 中策略的动作函数，在我们的例子中为

the entropy function in entropy-based RL

基于熵的 RL 中的熵函数

References 参考资料

[1] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
[1] Jacob Devlin、Ming-Wei Chang、Kenton Lee 和 Kristina Toutanova.Bert：用于语言理解的深度双向变换器预训练，2019.

[2] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback, 2022.
[2] 欧阳龙、吴杰夫、蒋旭、Diogo Almeida、Carroll L. Wainwright、Pamela Mishkin、张冲、Sandhini Agarwal、Katarina Slama、Alex Ray、John Schulman、Jacob Hilton、Fraser Kelton、Luke Miller、Maddie Simens、Amanda Askell、Peter Welinder、Paul Christiano、Jan Leike 和 Ryan Lowe。训练语言模型以遵循人类反馈指令》，2022 年。

[3] Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Mann, and Jared Kaplan. Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022.
[3] 白云涛、安迪-琼斯、卡迈勒-恩杜塞、阿曼达-阿克塞尔、安娜-陈、诺瓦-达斯萨玛、道恩-德赖恩、斯坦尼斯拉夫-福尔特、迪普-甘古利、汤姆-汉尼根、尼古拉斯-约瑟夫、绍拉夫-卡达瓦斯、杰克逊-科尼恩、汤姆-康纳利、希尔-埃尔-肖克Nelson Elhage、Zac Hatfield-Dodds、Danny Hernandez、Tristan Hume、Scott Johnston、Shauna Kravec、Liane Lovitt、Neel Nanda、Catherine Olsson、Dario Amodei、Tom Brown、Jack Clark、Sam McCandlish、Chris Olah、Ben Mann 和 Jared Kaplan。利用人类反馈的强化学习训练乐于助人的无害助手》，2022 年。

[4] OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung, Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, Simón Posada Fishman, Juston Forte, Isabella Fulford, Leo Gao, Elie Georges, Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross, Shixiang Shane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider, Nitish Shirish Keskar, Tabarak Khan, Logan Kilpatrick, Jong Wook Kim, Christina Kim, Yongjik Kim, Jan Hendrik Kirchner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Konstantinidis, Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel Levy, Chak Ming Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, Anna Makanju, Kim Malfacini, Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew Mayne, Bob McGrew, Scott Mayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O’Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, Ashley Pantuliano, Giambattista Parascandolo, Joel Parish, Emy Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe de Avila Belbute Peres, Michael Petrov, Henrique Ponde de Oliveira Pinto, Michael, Pokorny, Michelle Pokrass, Vitchyr H. Pong, Tolly Powell, Alethea Power, Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez, Nick Ryder, Mario Saltarelli, Ted Sanders, Shibani Santurkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schulman, Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher, Felipe Petroski Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, Madeleine B. Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan Felipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright, Justin Jay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJ Weinmann, Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan, Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, Tianhao Zheng, Juntang Zhuang, William Zhuk, and Barret Zoph. Gpt-4 technical report, 2024.
[4] OpenAI、Josh Achiam、Steven Adler、Sandhini Agarwal、Lama Ahmad、Ilge Akkaya、Florencia Leoni Aleman、Diogo Almeida、Janko Altenschmidt、Sam Altman、Shyamal Anadkat、Red Avila、Igor Babuschkin、Suchir Balaji、Valerie Balcom、保罗-巴尔泰斯库、鲍海明、穆罕默德-巴伐利亚、杰夫-贝尔古姆、伊尔万-贝洛、杰克-伯丁、加布里埃尔-伯纳代特-夏皮罗、克里斯托弗-伯纳、莱尼-博格多诺夫、奥列格-博伊科、马德莱恩-博伊德、安娜-路易莎-布拉克曼、格雷格-布罗克曼、蒂姆-布鲁克斯、迈尔斯-布伦达奇、凯文-布特、马尔-布伦达奇、蒂姆-布鲁克斯Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen、Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung, Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet、Atty Eleti、Tyna Eloundou、David Farhi、Liam Fedus、Niko Felix、Simon Posada Fishman、Juston Forte、Isabella Fulford、Leo Gao、Elie Georges、Christian Gibson、Vik Goel、Tarun Gogineni、Gabriel Goh、Rapha Gontijo-Lopes、Jonathan Gordon、Morgan Grafstein、Scott Gray、Ryan Greene、Joshua Gross、Shixiang Shane Gu、Yufei Guo、Chris Hallacy、Jesse Han、Jeff Harris、Yuchen He、Mike Heaton、Johannes Heidecke、Chris Hesse、Alan Hickey、Wade Hickey、Peter Hoeschele、Brandon Houghton、Kenny Hsu、Shengli Hu、Xin Hu、Joost Huizinga、Shantanu Jain、Shawn Jain、Joanne Jang、Angela Jiang、Roger Jiang、Haozhun Jin、Denny Jin、Shino Jomoto、Billie Jonn、Heewoo Jun、Tomer Kaftan、卢卡斯-凯泽、阿里-卡马里、英格玛-卡尼沙德、尼蒂什-希里什-凯斯卡尔、塔巴拉克-汗、洛根-基尔帕特里克、金钟旭、克里斯蒂娜-金、金永吉、扬-亨德里克-基什内尔、杰米-基罗斯、马特-奈特、丹尼尔-科科塔伊洛、卢卡斯-孔德拉克iuk、Andrew Kondrich、Aris Konstantinidis、Kyle Kosic、Gretchen Krueger、Vishal Kuo、Michael Lampe、Ikai Lan、Teddy Lee、Jan Leike、Jade Leung、Daniel Levy、Chak Ming Li、Rachel Lim、Molly Lin、Stephanie Lin、Mateusz Litwin、Theresa Lopez、Ryan Lowe、Patricia Lue、Anna Makanju、Kim Malfacini、Sam Manning、Todor Markov、Yaniv Markovski、Bianca Martin、Katie Mayer、Andrew Mayne、Bob McGrew、Scott Mayer McKinney、Christine McLeavey、Paul McMillan、Jake McNeil、David Medina、Aalok Mehta、Jacob Menick、Luke Metz、Andrey Mishchenko、Pamela Mishkin、Vinnie Monaco、Evan Morikawa、Daniel Mossing、Tong Mu、Mira Murati、Oleg Murk、David Mély、Ashvin Nair、Reiichiro Nakano、Rajeev Nayak、Arvind Neelakantan、Richard Ngo、Hyeonwoo Noh、Long Ouyang、Cullen O'Keefe、Jakub Pachocki、Alex Paino、Joe Palermo、Ashley Pantuliano, Giambattista Parascandolo, Joel Parish, Emy Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe de Avila Belbute Peres, Michael Petrov, Henrique Ponde de Oliveira Pinto, Michael, Pokorny, Michelle Pokrass, Vitchyr H. Pong, Tolly Powell.Pong, Tolly Powell, Alethea Power, Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez, Nick Ryder, Mario Saltarelli, Ted Sanders, Shibani Santurkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schulman、Daniel Selsam、Kyla Sheppard、Toki Sherbakov、Jessica Shieh、Sarah Shoker、Pranav Shyam、Szymon Sidor、Eric Sigler、Maddie Simens、Jordan Sitkin、Katarina Slama、Ian Sohl、Benjamin Sokolowsky、Yang Song、Natalie Staudacher、Felipe Petroski Such、Natalie Summers、Ilya Sutskever、Jie Tang、Nikolas Tezak、Madeleine B． Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan Felipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright, Justin Jay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJ Weinmann, Akila Welihinda, Peter Welinder, Jiayi Weng、Lilian Weng、Matt Wiethoff、Dave Willner、Clemens Winter、Samuel Wolrich、Hannah Wong、Lauren Workman、Sherwin Wu、Jeff Wu、Michael Wu、Kai Xiao、Tao Xu、Sarah Yoo、Kevin Yu、Qiming Yuan、Wojciech Zaremba、Rowan Zellers、Chong Zhang、Marvin Zhang、Shengjia Zhao、Tianhao Zheng、Juntang Zhuang、William Zhuk 和 Barret Zoph。Gpt-4 技术报告，2024 年。

[5] AI Anthropic. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card, 1, 2024.
[5] AI Anthropic.Claude 3 模型家族：作品、十四行诗、俳句。克劳德-3 模型卡，1，2024。

[6] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
[6] 双子座团队、Rohan Anil、Sebastian Borgeaud、Yonghui Wu、Jean-Baptiste Alayrac、Jiahui Yu、Radu Soricut、Johan Schalkwyk、Andrew M Dai、Anja Hauth 等：《双子座：高能力多模态模型系列》，arXiv 预印本 arXiv:2312.11805, 2023。

[7] Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, and Tong Zhang. Rlhf workflow: From reward modeling to online rlhf, 2024.
[7] 董瀚泽、熊伟、庞博、王浩翔、赵晗、周颖波、蒋楠、Doyen Sahoo、熊才明和张彤。Rlhf 工作流程：从奖励建模到在线 Rlhf，2024。

[8] Wei Xiong, Hanze Dong, Chenlu Ye, Ziqi Wang, Han Zhong, Heng Ji, Nan Jiang, and Tong Zhang. Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint, 2024.
[8] Wei Xiong、Hanze Dong、Chenlu Ye、Ziqi Wang、Han Zhong、Heng Ji、Nan Jiang 和 Tong Zhang。从人类反馈迭代偏好学习：kl约束下rlhf的理论与实践之桥》，2024.

[9] Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller,
[9] 白云涛、Saurav Kadavath、Sandipan Kundu、Amanda Askell、Jackson Kernion、Andy Jones、Anna Chen、Anna Goldie、Azalia Mirhoseini、Cameron McKinnon、Carol Chen、Catherine Olsson、Christopher Olah、Danny Hernandez、Dawn Drain、Deep Ganguli、Dustin Li、Eli Tran-Johnson、Ethan Perez、Jamie Kerr、Jared Mueller、

Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan. Constitutional ai: Harmlessness from ai feedback, 2022.
Jeffrey Ladish、Joshua Landau、Kamal Ndousse、Kamile Lukosuite、Liane Lovitt、Michael Sellitto、Nelson Elhage、Nicholas Schiefer、Noemi Mercado、Nova DasSarma、Robert Lasenby、Robin Larson、Sam Ringer、Scott Johnston、Shauna Kravec、Sheer El Showk、Stanislav Fort、Tamera Lanham、Timothy Telleen-Lawton、Tom Conerly、Tom Henighan、Tristan Hume、Samuel R. Bowman、Zac Hatfield-Dodds、Ben Man、Dario Amodei、Nolas Joseph、Sam McCandlish 和 Jared Kaplan。Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan.人工智能宪法：人工智能反馈的无害性》，2022 年。

[10] Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, and Sushant Prakash. Rlaif: Scaling reinforcement learning from human feedback with ai feedback, 2023.
[10] Harrison Lee、Samrat Phatale、Hassan Mansoor、Thomas Mesnard、Johan Ferret、Kellie Lu、Colton Bishop、Ethan Hall、Victor Carbune、Abhinav Rastogi 和 Sushant Prakash。莱夫从人工智能反馈中扩展强化学习，2023 年。

[11] Yao Zhao, Rishabh Joshi, Tianqi Liu, Misha Khalman, Mohammad Saleh, and Peter J. Liu. Slic-hf: Sequence likelihood calibration with human feedback, 2023.
[11] Yao Zhao, Rishabh Joshi, Tianqi Liu, Misha Khalman, Mohammad Saleh, and Peter J. Liu.Slic-hf：有人类反馈的序列似然校准，2023.

[12] Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model, 2023.
[12] Rafael Rafailov、Archit Sharma、Eric Mitchell、Stefano Ermon、Christopher D. Manning 和 Chelsea Finn。直接偏好优化：你的语言模型其实是奖励模型，2023.

[13] Arka Pal, Deep Karkhanis, Samuel Dooley, Manley Roberts, Siddartha Naidu, and Colin White. Smaug: Fixing failure modes of preference optimisation with dpo-positive, 2024.
[13] Arka Pal、Deep Karkhanis、Samuel Dooley、Manley Roberts、Siddartha Naidu 和 Colin White。Smaug：用 dpo-positive 修正偏好优化的失败模式，2024.

[14] Junkang Wu, Yuexiang Xie, Zhengyi Yang, Jiancan Wu, Jinyang Gao, Bolin Ding, Xiang Wang, and Xiangnan He.

-dpo: Direct preference optimization with dynamic

.
[14] 吴俊康、谢月祥、杨正一、吴建灿、高金阳、丁博林、王翔、何向南。

-dpo：具有动态

的直接偏好优化。

[15] Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, and Rémi Munos. A general theoretical paradigm to understand learning from human preferences, 2023.
[15] Mohammad Gheshlaghi Azar、Mark Rowland、Bilal Piot、Daniel Guo、Daniele Calandriello、Michal Valko 和 Rémi Munos。理解人类偏好学习的一般理论范式》，2023 年。

[16] Dahyun Kim, Yungi Kim, Wonho Song, Hyeonwoo Kim, Yunsu Kim, Sanghoon Kim, and Chanjun Park. sdpo: Don't use your data all at once, 2024.
[16] Dahyun Kim、Yungi Kim、Wonho Song、Hyeonwoo Kim、Yunsu Kim、Sanghoon Kim 和 Chanjun Park：不要一次性使用所有数据，2024 年。

[17] Rafael Rafailov, Joey Hejna, Ryan Park, and Chelsea Finn. From

: Your language model is secretly a q-function, 2024.
[17] Rafael Rafailov、Joey Hejna、Ryan Park 和 Chelsea Finn。从

到

：你的语言模型其实是一个 q 函数，2024 年。

[18] Yongcheng Zeng, Guoqing Liu, Weiyu Ma, Ning Yang, Haifeng Zhang, and Jun Wang. Token-level direct preference optimization, 2024.
[18] 曾永成、刘国庆、马维宇、杨宁、张海峰、王军。令牌级直接偏好优化，2024.

[19] Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self-rewarding language models, 2024.
[19] Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston.自酬语言模型，2024.

[20] Jing Xu, Andrew Lee, Sainbayar Sukhbaatar, and Jason Weston. Some things are more cringe than others: Iterative preference optimization with the pairwise cringe loss, 2024.
[20] Jing Xu、Andrew Lee、Sainbayar Sukhbaatar 和 Jason Weston。有些东西比其他东西更令人讨厌：使用成对崩溃损失的迭代偏好优化，2024.

[21] Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization, 2024.
[21] Kawin Ethayarajh、Winnie Xu、Niklas Muennighoff、Dan Jurafsky 和 Douwe Kiela。Kto：作为前景理论优化的模型对齐，2024.

[22] Pierre Harvey Richemond, Yunhao Tang, Daniel Guo, Daniele Calandriello, Mohammad Gheshlaghi Azar, Rafael Rafailov, Bernardo Avila Pires, Eugene Tarassov, Lucas Spangher, Will Ellsworth, Aliaksei Severyn, Jonathan Mallinson, Lior Shani, Gil Shamir, Rishabh Joshi, Tianqi Liu, Remi Munos, and Bilal Piot. Offline regularised reinforcement learning for large language models alignment, 2024.
[22] Pierre Harvey Richemond、唐云浩、Daniel Guo、Daniele Calandriello、Mohammad Gheshlaghi Azar、Rafael Rafailov、Bernardo Avila Pires、Eugene Tarassov、Lucas Spangher、Will Ellsworth、Aliaksei Severyn、Jonathan Mallinson、Lior Shani、Gil Shamir、Rishabh Joshi、Tianqi Liu、Remi Munos 和 Bilal Piot。大型语言模型对齐的离线正则化强化学习》，2024.

[23] Jiwoo Hong, Noah Lee, and James Thorne. Orpo: Monolithic preference optimization without reference model, 2024.
[23] Jiwoo Hong、Noah Lee 和 James Thorne。Orpo：无参考模型的单片偏好优化，2024 年。

[24] Shiva Kumar Pentyala, Zhichao Wang, Bin Bi, Kiran Ramnath, Xiang-Bo Mao, Regunathan Radhakrishnan, Sitaram Asur, Na, and Cheng. Paft: A parallel training paradigm for effective llm fine-tuning, 2024.
[24] Shiva Kumar Pentyala、Zhichao Wang、Bin Bi、Kiran Ramnath、Xiang-Bo Mao、Regunathan Radhakrishnan、Sitaram Asur、Na 和 Cheng。Paft：用于有效llm微调的并行训练范例，2024 年。

[25] Ryan Park, Rafael Rafailov, Stefano Ermon, and Chelsea Finn. Disentangling length from quality in direct preference optimization, 2024.
[25] Ryan Park, Rafael Rafailov, Stefano Ermon, and Chelsea Finn.直接偏好优化中的长度与质量分离》，2024.

[26] Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward, 2024.
[26] Yu Meng，Mengzhou Xia，and Danqi Chen.Simpo: Simple preference optimization with a reference-free reward, 2024.

[27] Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms, 2024.
[27] Arash Ahmadian、Chris Cremer、Matthias Gallé、Marzieh Fadaee、Julia Kreutzer、Olivier Pietquin、Ahmet Üstün、Sara Hooker。返璞归真：重新审视从llms中的人类反馈中学习的强化风格优化，2024。

[28] Tianqi Liu, Zhen Qin, Junru Wu, Jiaming Shen, Misha Khalman, Rishabh Joshi, Yao Zhao, Mohammad Saleh, Simon Baumgartner, Jialu Liu, Peter J. Liu, and Xuanhui Wang. Lipo: Listwise preference optimization through learning-to-rank, 2024.
[28] Tianqi Liu、Zhen Qin、Junru Wu、Jiaming Shen、Misha Khalman、Rishabh Joshi、Yao Zhao、Mohammad Saleh、Simon Baumgartner、Jialu Liu、Peter J. Liu 和 Xuanhui Wang。Lipo：通过学习排名进行列表式偏好优化，2024.

[29] Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. Rrhf: Rank responses to align language models with human feedback without tears, 2023.
[29] Zheng Yuan、Hongyi Yuan、Chuanqi Tan、Wei Wang、Songfang Huang 和 Fei Huang。Rrhf：用人类反馈调整语言模型的排名反应，不流泪，2023.

[30] Feifan Song, Bowen Yu, Minghao Li, Haiyang Yu, Fei Huang, Yongbin Li, and Houfeng Wang. Preference ranking optimization for human alignment, 2024.
[30] 宋飞帆、于博文、李明浩、于海洋、黄飞、李永斌、王厚峰。人类配准的偏好排序优化，2024.

[31] Shitong Duan, Xiaoyuan Yi, Peng Zhang, Tun Lu, Xing Xie, and Ning Gu. Negating negatives: Alignment without human positive samples via distributional dispreference optimization, 2024.
[31] Shitong Duan, Xiaoyuan Yi, Peng Zhang, Tun Lu, Xing Xie, and Ning Gu.Negating negatives：通过分布式分散参照优化实现无人类阳性样本的对齐，2024.

[32] Ruiqi Zhang, Licong Lin, Yu Bai, and Song Mei. Negative preference optimization: From catastrophic collapse to effective unlearning, 2024.
[32] Ruiqi Zhang, Licong Lin, Yu Bai, and Song Mei.负偏好优化：从灾难性崩溃到有效学习，2024.

[33] Haoran Xu, Amr Sharaf, Yunmo Chen, Weiting Tan, Lingfeng Shen, Benjamin Van Durme, Kenton Murray, and Young Jin Kim. Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation, 2024.
[33] Haoran Xu、Amr Sharaf、Yunmo Chen、Weiting Tan、Lingfeng Shen、Benjamin Van Durme、Kenton Murray 和 Young Jin Kim。对比偏好优化：挑战机器翻译llm性能的极限》，2024 年。

[34] Rémi Munos, Michal Valko, Daniele Calandriello, Mohammad Gheshlaghi Azar, Mark Rowland, Zhaohan Daniel Guo, Yunhao Tang, Matthieu Geist, Thomas Mesnard, Andrea Michi, Marco Selvi, Sertan Girgin, Nikola Momchev, Olivier Bachem, Daniel J. Mankowitz, Doina Precup, and Bilal Piot. Nash learning from human feedback, 2024.
[34] Rémi Munos、Michal Valko、Daniele Calandriello、Mohammad Gheshlaghi Azar、Mark Rowland、Zhaohan Daniel Guo、Yunhao Tang、Matthieu Geist、Thomas Mesnard、Andrea Michi、Marco Selvi、Sertan Girgin、Nikola Momchev、Olivier Bachem、Daniel J. Mankowitz、Doina Precup 和 Bilal Piot。从人类反馈中学习纳什》，2024 年。

[35] Gokul Swamy, Christoph Dann, Rahul Kidambi, Zhiwei Steven Wu, and Alekh Agarwal. A minimaximalist approach to reinforcement learning from human feedback, 2024.
[35] Gokul Swamy、Christoph Dann、Rahul Kidambi、Zhiwei Steven Wu 和 Alekh Agarwal。从人类反馈进行强化学习的最小极限方法》，2024.

[36] Corby Rosset, Ching-An Cheng, Arindam Mitra, Michael Santacroce, Ahmed Awadallah, and Tengyang Xie. Direct nash optimization: Teaching language models to self-improve with general preferences, 2024.
[36] Corby Rosset、Ching-An Cheng、Arindam Mitra、Michael Santacroce、Ahmed Awadallah 和 Tengyang Xie。直接纳什优化：用一般偏好教语言模型自我完善》，2024.

[37] Chaoqi Wang, Yibo Jiang, Chenghao Yang, Han Liu, and Yuxin Chen. Beyond reverse kl: Generalizing direct preference optimization with diverse divergence constraints, 2023.
[37] Chaoqi Wang、Yibo Jiang、Chengghao Yang、Han Liu 和 Yuxin Chen。Beyond reverse kl：用多样化的发散约束泛化直接偏好优化》，2023.

[38] Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39:324, 1952.
[38] Ralph Allan Bradley 和 Milton E. Terry.不完全区组设计的等级分析：I. 配对比较法。Biometrika，39：324，1952.

[39] Richard Bellman. A markovian decision process. Journal of Mathematics and Mechanics, 6(5):679-684, 1957.
[39] 理查德-贝尔曼．马尔可夫决策过程。数学与力学杂志，6（5）：679-684，1957.

[40] Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models. https://github. com/tatsu-lab/alpaca_eval, 2023.
[40] 李学勤、张天一、Yann Dubois、Rohan Taori、Ishaan Gulrajani、Carlos Guestrin、Percy Liang 和 Tatsunori B. Hashimoto。Alpacaeval：指令跟随模型的自动评估工具。https://github. com/tatsu-lab/alpaca_eval, 2023.

[41] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311-318, 2002.
[41] Kishore Papineni、Salim Roukos、Todd Ward 和 Wei-Jing Zhu。Bleu：机器翻译自动评估方法。计算语言学协会第 40 届年会论文集》，第 311-318 页，2002 年。

[42] Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages

.
[42] Chin-Yew Lin.Rouge：用于自动评估摘要的软件包。在文本摘要分支中，第

页。

[43] Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert, 2020.
[43] Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi.Bertscore：用 Bert 评估文本生成，2020.

[44] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020.
[44] Tom B. Brown、Benjamin Mann、Nick Ryder、Melanie Subbiah、Jared Kaplan、Prafulla Dhariwal、Arvind Neelakantan、Pranav Shyam、Girish Sastry、Amanda Askell、Sandhini Agarwal、Ariel Herbert-Voss、Gretchen Krueger、Tom Henighan、Rewon Child、Aditya Ramesh、Daniel M．Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei.语言模型是少量学习者，2020 年。

[45] Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods, 2022.
[45] Stephanie Lin、Jacob Hilton 和 Owain Evans。Truthfulqa：Measuring how models mimic human falsehoods, 2022.

[46] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023.
[46] Jason Wei，Xuezhi Wang，Dale Schuurmans，Maarten Bosma，Brian Ichter，Fei Xia，Ed Chi，Quoc Le，and Denny Zhou.大型语言模型中的思维链提示推理，2023.

[47] Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback, 2022.
[47] Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano.从人类反馈中学习总结》，2022 年。

[48] Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H. Clark, Laurent El Shafey, Yanping Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao, Yuanzhong Xu, Yujing Zhang, Gustavo Hernandez Abrego, Junwhan Ahn, Jacob Austin, Paul Barham, Jan Botha, James Bradbury, Siddhartha Brahma, Kevin Brooks, Michele Catasta, Yong Cheng, Colin Cherry, Christopher A. Choquette-Choo, Aakanksha Chowdhery, Clément Crepy, Shachi Dave, Mostafa Dehghani, Sunipa Dev, Jacob Devlin, Mark Díaz, Nan Du, Ethan Dyer, Vlad Feinberg, Fangxiaoyu Feng, Vlad Fienber, Markus Freitag, Xavier Garcia, Sebastian Gehrmann, Lucas Gonzalez, Guy Gur-Ari, Steven Hand, Hadi Hashemi, Le Hou, Joshua Howland, Andrea Hu, Jeffrey Hui, Jeremy Hurwitz, Michael Isard, Abe Ittycheriah, Matthew Jagielski, Wenhao Jia, Kathleen Kenealy, Maxim Krikun, Sneha Kudugunta, Chang Lan, Katherine Lee, Benjamin Lee, Eric Li, Music Li, Wei Li, YaGuang Li, Jian Li, Hyeontaek Lim, Hanzhao Lin, Zhongtao Liu, Frederick Liu, Marcello Maggioni, Aroma Mahendru, Joshua Maynez, Vedant Misra, Maysam Moussalem, Zachary Nado, John
[48] Rohan Anil、Andrew M. Dai、Orhan Firat、Melvin Johnson、Dmitry Lepikhin、Alexandre Passos、Siamak Shakeri、Emanuel Taropa、Paige Bailey、Zhifeng Chen、Eric Chu、Jonathan H.克拉克、劳伦特-埃尔-沙菲、黄燕萍、凯西-迈尔-赫尔斯滕、高拉夫-米什拉、埃里卡-莫雷拉、马克-奥默尼克、凯文-罗宾逊、塞巴斯蒂安-鲁德、郑怡、肖克凡、徐元忠、张宇靖、Gustavo Hernández Abrego、Junwhan Ahn、Jacob Austin、Paul Barham、Jan Botha、James Bradbury、Siddhartha Brahma、Kevin Brooks、Michele Catasta、Yong Cheng、Colin Cherry、Christopher A．Choquette-Choo, Aakanksha Chowdhery, Clément Crepy, Shachi Dave, Mostafa Dehghani, Sunipa Dev, Jacob Devlin, Mark Díaz, Nan Du, Ethan Dyer, Vlad Feinberg, Fangxiaoyu Feng, Vlad Fienber, Markus Freitag, Xavier Garcia, Sebastian Gehrmann, Lucas Gonzalez, Guy Gur-Ari, Steven Hand, Hadi Hashemi, Le Hou, Joshua Howland, Andrea Hu, Jeffrey Hui, Jeremy Hurwitz、迈克尔-伊萨德、亚伯-伊蒂切利亚、马修-贾吉尔斯基、贾文豪、凯瑟琳-基纳利、马克西姆-克里克恩、斯尼哈-库杜贡塔、张兰、凯瑟琳-李、本杰明-李、埃里克-李、音乐-李、李伟、李亚光、李健、林贤泰、林汉钊、刘中涛、弗雷德里克-刘、马塞洛-马吉奥尼、阿罗玛-马亨德鲁、约书亚-梅内斯、维丹特-米斯拉、梅萨姆-穆萨莱姆、扎卡里-纳多、约翰-纳多

Nham, Eric Ni, Andrew Nystrom, Alicia Parrish, Marie Pellat, Martin Polacek, Alex Polozov, Reiner Pope, Siyuan Qiao, Emily Reif, Bryan Richter, Parker Riley, Alex Castro Ros, Aurko Roy, Brennan Saeta, Rajkumar Samuel, Renee Shelby, Ambrose Slone, Daniel Smilkov, David R. So, Daniel Sohn, Simon Tokumine, Dasha Valter, Vijay Vasudevan, Kiran Vodrahalli, Xuezhi Wang, Pidong Wang, Zirui Wang, Tao Wang, John Wieting, Yuhuai Wu, Kelvin Xu, Yunhan Xu, Linting Xue, Pengcheng Yin, Jiahui Yu, Qiao Zhang, Steven Zheng, Ce Zheng, Weikang Zhou, Denny Zhou, Slav Petrov, and Yonghui Wu. Palm 2 technical report, 2023.
Nham, Eric Ni, Andrew Nystrom, Alicia Parrish, Marie Pellat, Martin Polacek, Alex Polozov, Reiner Pope, Siyuan Qiao, Emily Reif, Bryan Richter, Parker Riley, Alex Castro Ros, Aurko Roy, Brennan Saeta, Rajkumar Samuel, Renee Shelby, Ambrose Slone, Daniel Smilkov, David R. So, Daniel Sohn, Simon Tokumine, Dasha Valter, Vijay Vasudevan, Kiran Vodrahalli.So, Daniel Sohn, Simon Tokumine, Dasha Valter, Vijay Vasudevan, Kiran Vodrahalli, Xuezhi Wang, Pidong Wang, Zirui Wang, Tao Wang, John Wieting, Yuhuai Wu、Kelvin Xu、Yunhan Xu、Linting Xue、Pengcheng Yin、Jiahui Yu、Qiao Zhang、Steven Zheng、Ce Zheng、Weikang Zhou、Denny Zhou、Slav Petrov 和 Yonghui Wu。Palm 2 技术报告，2023 年。

[49] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.
[49] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel.使用广义优势估计的高维连续控制。arXiv preprint arXiv:1506.02438, 2015.

[50] Colin Raffel, Noam M. Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1-140:67, 2019.
[50] Colin Raffel, Noam M. Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu.用统一的文本到文本转换器探索迁移学习的极限。J. Mach.Learn.Res., 21:140:1-140:67, 2019.

[51] Tianqi Liu, Yao Zhao, Rishabh Joshi, Misha Khalman, Mohammad Saleh, Peter J. Liu, and Jialu Liu. Statistical rejection sampling improves preference optimization, 2024.
[51] Tianqi Liu，Yao Zhao，Rishabh Joshi，Misha Khalman，Mohammad Saleh，Peter J. Liu，and Jialu Liu.统计拒绝抽样改进偏好优化》，2024.

[52] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer, 2023.
[52] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu.用统一的文本到文本转换器探索迁移学习的极限》，2023.

[53] Shusheng Xu, Wei Fu, Jiaxuan Gao, Wenjie Ye, Weilin Liu, Zhiyu Mei, Guangju Wang, Chao Yu, and Yi Wu. Is dpo superior to ppo for llm alignment? a comprehensive study. arXiv preprint arXiv:2404.10719, 2024.
[53] 徐树生、傅伟、高嘉轩、叶文杰、刘伟林、梅志宇、王光菊、于超、吴毅。对llm配准而言，DPO优于PPO吗？

[54] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Dekang Lin, Yuji Matsumoto, and Rada Mihalcea, editors, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142-150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics.
[54] Andrew L. Maas、Raymond E. Daly、Peter T. Pham、Dan Huang、Andrew Y. Ng 和 Christopher Potts。为情感分析学习词向量。见 Dekang Lin、Yuji Matsumoto 和 Rada Mihalcea 编辑的《第 49 届计算语言学协会年会论文集》：人类语言技术》，第 142-150 页，美国俄勒冈州波特兰，2011 年 6 月。计算语言学协会。

[55] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018
[55] Peter Clark、Isaac Cowhey、Oren Etzioni、Tushar Khot、Ashish Sabharwal、Carissa Schoenick 和 Oyvind Tafjord.觉得你已经解决了问题解答？试试arc，AI2推理挑战赛。arXiv:1803.05457v1，2018

[56] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.
[56] Rowan Zellers、Ari Holtzman、Yonatan Bisk、Ali Farhadi 和 Yejin Choi。Hellaswag：机器真的能完成你的句子吗？计算语言学协会第 57 届年会论文集，2019 年。

[57] Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T. Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models, 2024.
[57] Longhui Yu，Weisen Jiang，Han Shi，Jincheng Yu，Zhengying Liu，Yu Zhang，James T. Kwok，Zhenguo Li，Adrian Weller，and Weiyang Liu.Metamath：为大型语言模型引导您自己的数学问题，2024.

[58] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
[58] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica.使用mt-bench和聊天机器人竞技场进行llm评判，2023年。

[59] Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O'Brien, Eric Hallahan, Mohammad Affah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397 -2430. PMLR, 2023.
[59] Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O'Brien, Eric Hallahan, Mohammad Affah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia：跨训练和扩展分析大型语言模型的套件。国际机器学习大会，第 2397 -2430 页。PMLR，2023。

[60] Dahyun Kim, Chanjun Park, Sanghoon Kim, Wonsung Lee, Wonho Song, Yunsu Kim, Hyeonwoo Kim, Yungi Kim, Hyeonju Lee, Jihoo Kim, Changbae Ahn, Seonghoon Yang, Sukyung Lee, Hyunbyung Park, Gyoungjin Gim, Mikyoung Cha, Hwalsuk Lee, and Sunghun Kim. Solar 10.7b: Scaling large language models with simple yet effective depth up-scaling, 2024.
[60] Dahyun Kim、Chanjun Park、Sanghoon Kim、Wonsung Lee、Wonho Song、Yunsu Kim、Hyeonwoo Kim、Yungi Kim、Hyeonju Lee、Jihoo Kim、Changbae Ahn、Seonghoon Yang、Sukyung Lee、Hyunbyung Park、Gyoungjin Gim、Mikyoung Cha、Hwalsuk Lee 和 Sunghun Kim。太阳 10.7b：利用简单而有效的深度升级扩展大型语言模型，2024 年。

[61] Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah. Orca: Progressive learning from complex explanation traces of gpt-4, 2023.
[61] Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah.Orca: Progressive learning from complex explanation traces of gpt-4, 2023.

[62] Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with high-quality feedback, 2023.
[62] Ganqu Cui，Lifan Yuan，Ning Ding，Guanming Yao，Wei Zhu，Yuan Ni，Guotong Xie，Zhiyuan Liu，and Maosong Sun.超反馈：用高质量反馈提升语言模型，2023.

[63] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
[63] Dan Hendrycks、Collin Burns、Steven Basart、Andy Zou、Mantas Mazeika、Dawn Song 和 Jacob Steinhardt。测量大规模多任务语言理解。国际学习表征会议（ICLR）论文集，2021 年。

[64] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale, 2019.
[64] Keisuke Sakaguchi、Ronan Le Bras、Chandra Bhagavatula 和 Yejin Choi。Winogrande：大规模对抗性 Winograd 模式挑战，2019 年。

[65] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
[65] 卡尔-科布、维尼特-科萨拉朱、穆罕默德-巴伐利亚、马克-陈、希乌-俊、卢卡斯-凯泽、马蒂亚斯-普拉珀特、杰里-特沃雷克、雅各布-希尔顿、中野礼一郎、克里斯托弗-赫塞和约翰-舒尔曼。训练验证器解决数学单词问题。arXiv 预印本 arXiv:2110.14168, 2021.

[66] Yunhao Tang, Zhaohan Daniel Guo, Zeyu Zheng, Daniele Calandriello, Rémi Munos, Mark Rowland, Pierre Harvey Richemond, Michal Valko, Bernardo Ávila Pires, and Bilal Piot. Generalized preference optimization: A unified approach to offline alignment, 2024.
[66] 唐云浩、郭昭翰、郑泽宇、丹尼尔-卡兰德里罗、雷米-穆诺斯、马克-罗兰德、皮埃尔-哈维-里切蒙德、米哈尔-瓦尔科、贝尔纳多-阿维拉-皮雷斯、比拉勒-皮奥。广义偏好优化：离线对齐的统一方法》，2024年。

[67] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019.
[67] Alec Radford、Jeff Wu、Rewon Child、David Luan、Dario Amodei 和 Ilya Sutskever。语言模型是无监督的多任务学习器。2019.

[68] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models, 2023.
[68] Hugo Touvron、Louis Martin、Kevin Stone、Peter Albert、Amjad Almahairi、Yasmine Babaei、Nikolay Bashlykov、Soumya Batra、Prajjwal Bhargava、Shruti Bhosale、Dan Bikel、Lukas Blecher、Cristian Canton Ferrer、Moya Chen、Guillem Cucurull、David Esiobu、裘德-费尔南德斯、傅杰里米、傅文音、布赖恩-富勒、辛西娅-高、维达努吉-戈斯瓦米、纳曼-戈亚尔、安东尼-哈特肖恩、萨加尔-侯赛尼、侯锐、哈坎-伊南、马尔辛-卡尔达斯、维克多-克尔凯兹、马迪安-卡布萨、伊莎贝尔-克鲁曼、阿特姆-科雷涅夫、普尼特-辛格-库拉、玛丽-安娜-拉肖Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith、Ranjan Subramanian、Xiaoqing Ellen Tan、Binh Tang、Ross Taylor、Adina Williams、Jian Xiang Kuan、Puxin Xu、Zheng Yan、Iliyan Zarov、Yuchen Zhang、Angela Fan、Melanie Kambadur、Sharan Narang、Aurelien Rodriguez、Robert Stojnic、Sergey Edunov 和 Thomas Scialom。喇嘛 2：开放式基础和微调聊天模型，2023 年。

[69] Leonard Adolphs, Tianyu Gao, Jing Xu, Kurt Shuster, Sainbayar Sukhbaatar, and Jason Weston. The cringe loss: Learning what language not to model, 2022.
[69] Leonard Adolphs、Tianyu Gao、Jing Xu、Kurt Shuster、Sainbayar Sukhbaatar 和 Jason Weston。崩溃的损失：学习什么语言不建模》，2022.

[70] Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback, 2024.
[70] Yann Dubois、Xuechen Li、Rohan Taori、Tianyi Zhang、Ishaan Gulrajani、Jimmy Ba、Carlos Guestrin、Percy Liang 和 Tatsunori B. Hashimoto。Alpacafarm：从人类反馈中学习方法的模拟框架，2024 年。

[71] Amos Tversky and Daniel Kahneman. Advances in prospect theory: Cumulative representation of uncertainty. Journal of Risk and Uncertainty, 5:297-323, 1992.
[71] Amos Tversky and Daniel Kahneman.前景理论的进展：不确定性的累积表示。Journal of Risk and Uncertainty, 5:297-323, 1992.

[72] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017.
[72] John Schulman、Filip Wolski、Prafulla Dhariwal、Alec Radford 和 Oleg Klimov.近端策略优化算法，2017.

[73] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. Evaluating large language models trained on code, 2021.
[73] Mark Chen、Jerry Tworek、Heewoo Jun、Qiming Yuan、Henrique Ponde de Oliveira Pinto、Jared Kaplan、Harri Edwards、Yuri Burda、Nicholas Joseph、Greg Brockman、Alex Ray、Raul Puri、Gretchen Krueger、Michael Petrov、Heidy Khlaaf、Girish Sastry、Pamela Mishkin、Brooke Chan、Scott Gray、Nick Ryder、Mikhail Pavlov、Alethea Power、Lukasz Kaiser、Mohammad Bavarian、Clemens Winter、Philippe Tillet、Felipe Petroski Such、Dave Cummings、Matthias Plappert、Fotios Chantzis、Elizabeth Barnes、Ariel Herbert-Voss、William Hebgen Guss、Alex Nichol、Alex Paino、Nikolas Tezak、Jie Tang、Igor Babuschkin、Suchir Balaji、Shantanu Jain、William Saunders、Christopher Hesse、Andrew N．Carr、Jan Leike、Josh Achiam、Vedant Misra、Evan Morikawa、Alec Radford、Matthew Knight、Miles Brundage、Mira Murati、Katie Mayer、Peter Welinder、Bob McGrew、Dario Amodei、Sam McCandlish、Ilya Sutskever 和 Wojciech Zaremba。评估基于代码训练的大型语言模型，2021 年。

[74] Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, , and Jason Wei. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
[74] Mirac Suzgun、Nathan Scales、Nathanael Schärli、Sebastian Gehrmann、Yi Tay、Hyung Won Chung、Aakanksha Chowdhery、Quoc V Le、Ed H Chi、Denny Zhou 和 Jason Wei。挑战性大本营任务及思维链能否解决这些任务。ArXiv 预印本 arXiv:2210.09261, 2022.

[75] Mojan Javaheripi, Sébastien Bubeck, Marah Abdin, Jyoti Aneja, Sebastien Bubeck, Caio César Teodoro Mendes, Weizhu Chen, Allie Del Giorno, Ronen Eldan, Sivakanth Gopi, et al. Phi-2: The surprising power of small language models. Microsoft Research Blog, 2023.
[75] Mojan Javaheripi、Sébastien Bubeck、Marah Abdin、Jyoti Aneja、Sebastien Bubeck、Caio César Teodoro Mendes、Weizhu Chen、Allie Del Giorno、Ronen Eldan、Sivakanth Gopi 等。Phi-2：小型语言模型的惊人力量。微软研究院博客，2023 年。

[76] Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023.
[76] Albert Q.Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed.米斯特拉尔 7b，2023 年。

[77] Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models, 2023.
[77] Jeffrey Zhou、Tianjian Lu、Swaroop Mishra、Siddhartha Brahma、Sujoy Basu、Yi Luan、Denny Zhou 和 Le Hou。大型语言模型的指令跟踪评估，2023.

[78] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021.
[78] Edward J. Hu、Yelong Shen、Phillip Wallis、Zeyuan Allen-Zhu、Yuanzhi Li、Shean Wang、Lu Wang 和 Weizhu Chen。Lora：大型语言模型的低秩适应，2021.

[79] Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline, 2024.
[79] Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica.从众包数据到高质量基准：Arena-hard and benchbuilder pipeline, 2024.

[80] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229-256, 1992.
[80] Ronald J Williams.连接强化学习的简单统计梯度跟随算法。机器学习，8：229-256，1992.

[81] Wouter Kool, Herke van Hoof, and Max Welling. Buy 4 REINFORCE samples, get a baseline for free!, 2019.
[81] Wouter Kool、Herke van Hoof 和 Max Welling。购买 4 份 REINFORCE 样品，免费获得一份基线！，2019 年。

[82] Tie-Yan Liu et al. Learning to rank for information retrieval. Foundations and Trends

in Information Retrieval,

.
[82] Tie-Yan Liu 等.信息检索的学习排序.信息检索的基础与趋势.

.信息检索的基础与趋势》，

，

[83] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer, 2023.
[83] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu.用统一的文本到文本转换器探索迁移学习的极限》，2023.

[84] Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi-Rui Tam, Keith Stevens, Abdullah Barhoum, Nguyen Minh Duc, Oliver Stanley, Richárd Nagyfi, Shahul ES, Sameer Suri, David Glushkov, Arnav Dantuluri, Andrew Maguire, Christoph Schuhmann, Huu Nguyen, and Alexander Mattick. Openassistant conversations - democratizing large language model alignment, 2023.
[84] Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi-Rui Tam, Keith Stevens, Abdullah Barhoum, Nguyen Minh Duc, Oliver Stanley, Richárd Nagyfi, Shahul ES, Sameer Suri, David Glushkov, Arnav Dantuluri, Andrew Maguire, Christoph Schuhmann, Huu Nguyen, and Alexander Mattick.Openassistant conversations - democratizing large language model alignment, 2023.

[85] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/ tatsu-lab/stanford_alpaca, 2023.
[85] Rohan Taori、Ishaan Gulrajani、Tianyi Zhang、Yann Dubois、Xuechen Li、Carlos Guestrin、Percy Liang 和 Tatsunori B. Hashimoto。斯坦福羊驼：指令跟随骆驼模型。https://github.com/ tatsu-lab/stanford_alpaca, 2023.

[86] Binghai Wang, Rui Zheng, Lu Chen, Yan Liu, Shihan Dou, Caishuang Huang, Wei Shen, Senjie Jin, Enyu Zhou, Chenyu Shi, Songyang Gao, Nuo Xu, Yuhao Zhou, Xiaoran Fan, Zhiheng Xi, Jun Zhao, Xiao Wang, Tao Ji, Hang Yan, Lixing Shen, Zhan Chen, Tao Gui, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Zuxuan Wu, and Yu-Gang Jiang. Secrets of rlhf in large language models part ii: Reward modeling, 2024.
[86] 王炳海、郑锐、陈璐、刘艳、窦诗涵、黄彩双、沈伟、金森杰、周恩玉、史晨宇、高颂扬、徐诺、周宇皓、范晓然、奚志恒、赵俊、王晓、季涛、阎航、沈力行、陈湛、桂涛、张琦、邱希鹏、黄璇菁、吴祖轩、姜玉刚。大型语言模型中的rlhf秘密之二：奖励建模》，2024年。

[87] Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe rlhf: Safe reinforcement learning from human feedback. In The Twelfth International Conference on Learning Representations, 2024.
[87] Josef Dai、Xuehai Pan、Ruiyang Sun、Jiaming Ji、Xinbo Xu、Mickel Liu、Yizhou Wang 和 Yaodong Yang。Safe rlhf：来自人类反馈的安全强化学习第十二届学习表征国际会议，2024年。

[88] Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C. Lipton, and J. Zico Kolter. Tofu: A task of fictitious unlearning for llms, 2024.
[88] Pratyush Maini、Zhili Feng、Avi Schwarzschild、Zachary C. Lipton 和 J. Zico Kolter。豆腐：2024年llms的虚构学习任务。

[89] Haoran Xu, Young Jin Kim, Amr Sharaf, and Hany Hassan Awadalla. A paradigm shift in machine translation: Boosting translation performance of large language models, 2024.
[89] Haoran Xu、Young Jin Kim、Amr Sharaf 和 Hany Hassan Awadalla。机器翻译的范式转变：提升大型语言模型的翻译性能》，2024 年。

[90] Quentin Bertrand, Wojciech Marian Czarnecki, and Gauthier Gidel. On the limitations of the elo, real-world games are transitive, not additive. In Francisco Ruiz, Jennifer Dy, and Jan-Willem van de Meent, editors, Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pages 2905-2921. PMLR, 25-27 Apr 2023.
[90] 昆廷-伯特兰、沃伊切赫-马里安-查内茨基和高迪尔-吉德尔。关于 Elo 的局限性，真实世界的游戏是传递性的，而非加法性的。见弗朗西斯科-鲁伊斯（Francisco Ruiz）、詹妮弗-戴（Jennifer Dy）和扬-威廉-范德梅恩特（Jan-Willem van de Meent）编著的《第 26 届人工智能与统计国际会议论文集》（Proceedings of The 26th International Conference on Artificial Intelligence and Statistics），《机器学习研究论文集》第 206 卷，第 2905-2921 页。PMLR，2023 年 4 月 25-27 日。

[91] Yue Wu, Zhiqing Sun, Huizhuo Yuan, Kaixuan Ji, Yiming Yang, and Quanquan Gu. Self-play preference optimization for language model alignment, 2024.
[91] Yue Wu，Zhiqing Sun，Huizhuo Yuan，Kaixuan Ji，Yiming Yang，and Quanquan Gu.语言模型对齐的自播放偏好优化，2024.

[92] Yoav Freund and Robert E. Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29:79-103, 1999.
[92] Yoav Freund 和 Robert E. Schapire.使用乘法权重的自适应博弈。Games and Economic Behavior, 29:79-103, 1999.

[93] Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion, 2023.
[93] Dongfu Jiang，Xiang Ren，Bill Yuchen Lin.Llm-blender:使用成对排序和生成融合组装大型语言模型，2023.

[94] Arindam Mitra, Luciano Del Corro, Shweti Mahajan, Andres Codas, Clarisse Simoes, Sahaj Agarwal, Xuxi Chen, Anastasia Razdaibiedina, Erik Jones, Kriti Aggarwal, Hamid Palangi, Guoqing Zheng, Corby Rosset, Hamed Khanpour, and Ahmed Awadallah. Orca 2: Teaching small language models how to reason, 2023
[94] Arindam Mitra、Luciano Del Corro、Shweti Mahajan、Andres Codas、Clarisse Simoes、Sahaj Agarwal、Xuxi Chen、Anastasia Razdaibiedina、Erik Jones、Kriti Aggarwal、Hamid Palangi、Guoqing Zheng、Corby Rosset、Hamed Khanpour 和 Ahmed Awadallah。Orca 2：教小语言模型如何推理，2023年

[95] Gian Wiher, Clara Meister, and Ryan Cotterell. On decoding strategies for neural text generators, 2022.
[95] Gian Wiher、Clara Meister 和 Ryan Cotterell.论神经文本生成器的解码策略，2022.

[96] Amir Saeidi, Shivanshu Verma, and Chitta Baral. Insights into alignment: Evaluating dpo and its variants across multiple tasks, 2024.
[96] Amir Saeidi、Shivanshu Verma 和 Chitta Baral。洞察对齐：在多项任务中评估 dpo 及其变体，2024 年。

[97] Shusheng Xu, Wei Fu, Jiaxuan Gao, Wenjie Ye, Weilin Liu, Zhiyu Mei, Guangju Wang, Chao Yu, and Yi Wu. Is dpo superior to ppo for llm alignment? a comprehensive study, 2024.
[97] 徐树生、傅伟、高嘉轩、叶文杰、刘伟林、梅志宇、王光菊、俞超、吴毅。在llm配准中dpo优于ppo吗？一项综合研究，2024.

A COMPrEHENSIVE SurveY OF LLM ALIGNMENT TECHNIQUES: RLHF, RLAIF, PPO, DPO AND MORE LLM对中技术综合调查：RLHF、RLAIF、PPO、DPO 等

Abstract 摘要

1 Introduction 1 引言

2 Categorical Outline 2 分类大纲

2.1 Reward Model 2.1 奖励模式

2.1.1 Explicit Reward Model vs. Implicit Reward Model2.1.1 显性奖励模式与隐性奖励模式

2.1.2 Pointwise Reward Model vs. Preferencewise Model2.1.2 点式奖励模型与偏好模型

2.1.3 Response-Level Reward vs. Token-Level Reward2.1.3 响应级奖励与令牌级奖励的比较

2.1.4 Negative Preference Optimization2.1.4 负偏好优化

2.2 Feedback 2.2 反馈

2.2.1 Preference Feedback vs. Binary Feedback2.2.1 偏好反馈与二进制反馈

2.2.2 Pairwise Feedback vs. Listwise Feedback2.2.2 成对反馈与列表反馈

2.2.3 Human Feedback vs. AI Feedback2.2.3 人工反馈与人工智能反馈

2.3 Reinforcement Learning (RL)2.3 强化学习（RL）

2.3.1 Reference-Based RL vs. Reference-Free RL2.3.1 基于参考的 RL 与无参考的 RL

2.3.2 Length-Control RL 2.3.2 长度控制 RL

2.3.3 Different Divergences in RL2.3.3 区域联络中的不同分歧

2.3.4 On-policy or Off-policy Learning2.3.4 政策内或政策外学习

2.4 Optimization 2.4 优化

2.4.1 Iterative/Online Preference Optimization vs. Non-Iterative/Offline Preference Optimization2.4.1 迭代/在线偏好优化与非迭代/离线偏好优化对比

2.4.2 Separating SFT and Alignment vs. Merging SFT and Alignment2.4.2 分离 SFT 和对齐与合并 SFT 和对齐

3 Individual Paper Reviews in Detail3 个别论文的详细评审

3.1 RLHF/PPO 3.1 区域联络高频/项目管理办公室

3.1.1 InstructGPT 3.1.1 教学内容

3.1.2 RLHF - Anthropic 3.1.2 RLHF--人类学

3.1.3 Online / Iterative RLHF3.1.3 在线/迭代 RLHF

3.2 RLAIF 3.2 快速反应部队

3.2.1 RLAIF-Anthropic 3.2.1 RLAIF-各向异性

3.2.2 RLAIF-Google

3.3 Direct Human Preference Optimization3.3 直接人类偏好优化

3.3.1 SliC-HF

3.3.2 RSO

3.3.3 DPO

3.3.4 DPOP: Smaug 3.3.4 DPOP：史矛革

3.3.5 -DPO

3.3.6 IPO 3.3.6 首次公开募股

3.3.7 sDPO

3.3.8 GPO

3.4 Token-level DPO 3.4 令牌级 DPO

3.4.1 DPO: from to 3.4.1 DPO：从 到

3.4.2 TDPO 3.4.2 《德班宣言和行动纲领》

3.5 Iterative/Online DPO 3.5 迭代/在线 DPO

3.5.1 Iterative/Online DPO: Self-Rewarding Language Models3.5.1 迭代/在线 DPO：自我奖励语言模型

3.5.2 Iterative/Online DPO: CRINGE3.5.2 迭代/在线 DPO：CRINGE

3.6 Binary Feedback 3.6 二进制反馈

3.6.1 KTO

3.6.2 DRO

3.7 Merge SFT and Alignment3.7 合并 SFT 和对齐

3.7.1 ORPO

3.7.2 PAFT

3.8 Length Control DPO and Reference Free DPO3.8 长度控制 DPO 和无参考 DPO

3.8.1 R-DPO

3.8.2

3.8.3 RLOO

3.9 Listwise Preference Optimization3.9 列表式偏好优化

3.9.1 LiPO

3.9.2 RRHF

3.9.3 PRO 3.9.3 专业人员

3.10 Negative Preference Optimization3.10 负偏好优化

3.10.1 Negating Negatives3.10.1 否定否定句

3.10.2 Negative Preference Optimization3.10.2 负偏好优化

3.10.3 Contrastive Preference Optimization3.10.3 对比偏好优化

3.11 Nash Learning 3.11 纳什学习

3.11.1 Nash Learning from Human Feedback3.11.1 纳什从人类反馈中学习

3.11.2 SPPO

3.11.3 DNO

3.12 Beyond Reverse KL Divergence3.12 超越反向 KL 发散

3.13 Comparison of Different Methods3.13 不同方法的比较

3.13.1 Evaluate DPO and its variants3.13.1 评估 DPO 及其变体

3.13.2 Is DPO Superior to PPO for LLM Alignment?3.13.2 就 LLM 对齐而言，DPO 优于 PPO 吗？

4 Future Directions 4 未来方向

4.1 General Tasks for Alignment Evaluation4.1 对齐评估的一般任务

4.2 Apply Implicit Reward Models, Listwise Preference and Nash Learning to Larger Scale LMs4.2 将隐含奖励模型、列表偏好和纳什学习应用于更大规模的 LM

4.3 Experiments on Binary Feedbacks4.3 二进制反馈实验

4.4 Experiments on Helpful AI Feedback4.4 人工智能帮助反馈实验

4.5 Speeding up Nash Learning4.5 加快纳什学习速度

4.6 Termination of Iterative/Online Learning4.6 终止迭代/在线学习

4.7 Simplify SFT + Alignment4.7 简化 SFT + 对齐

List of Symbols 符号列表

References 参考资料

A COMPrEHENSIVE SurveY OF LLM ALIGNMENT TECHNIQUES: RLHF, RLAIF, PPO, DPO AND MORE
LLM对中技术综合调查：RLHF、RLAIF、PPO、DPO 等

2.1.1 Explicit Reward Model vs. Implicit Reward Model
2.1.1 显性奖励模式与隐性奖励模式

2.1.2 Pointwise Reward Model vs. Preferencewise Model
2.1.2 点式奖励模型与偏好模型

2.1.3 Response-Level Reward vs. Token-Level Reward
2.1.3 响应级奖励与令牌级奖励的比较

2.1.4 Negative Preference Optimization
2.1.4 负偏好优化

2.2.1 Preference Feedback vs. Binary Feedback
2.2.1 偏好反馈与二进制反馈

2.2.2 Pairwise Feedback vs. Listwise Feedback
2.2.2 成对反馈与列表反馈

2.2.3 Human Feedback vs. AI Feedback
2.2.3 人工反馈与人工智能反馈

2.3 Reinforcement Learning (RL)
2.3 强化学习（RL）

2.3.1 Reference-Based RL vs. Reference-Free RL
2.3.1 基于参考的 RL 与无参考的 RL

2.3.3 Different Divergences in RL
2.3.3 区域联络中的不同分歧

2.3.4 On-policy or Off-policy Learning
2.3.4 政策内或政策外学习

2.4.1 Iterative/Online Preference Optimization vs. Non-Iterative/Offline Preference Optimization
2.4.1 迭代/在线偏好优化与非迭代/离线偏好优化对比

2.4.2 Separating SFT and Alignment vs. Merging SFT and Alignment
2.4.2 分离 SFT 和对齐与合并 SFT 和对齐

3 Individual Paper Reviews in Detail
3 个别论文的详细评审

3.1.3 Online / Iterative RLHF
3.1.3 在线/迭代 RLHF

3.3 Direct Human Preference Optimization
3.3 直接人类偏好优化

3.4.1 DPO: from to
3.4.1 DPO：从到

3.5.1 Iterative/Online DPO: Self-Rewarding Language Models
3.5.1 迭代/在线 DPO：自我奖励语言模型

3.5.2 Iterative/Online DPO: CRINGE
3.5.2 迭代/在线 DPO：CRINGE

3.7 Merge SFT and Alignment
3.7 合并 SFT 和对齐

3.8 Length Control DPO and Reference Free DPO
3.8 长度控制 DPO 和无参考 DPO

3.9 Listwise Preference Optimization
3.9 列表式偏好优化

3.10 Negative Preference Optimization
3.10 负偏好优化

3.10.1 Negating Negatives
3.10.1 否定否定句

3.10.2 Negative Preference Optimization
3.10.2 负偏好优化

3.10.3 Contrastive Preference Optimization
3.10.3 对比偏好优化

3.11.1 Nash Learning from Human Feedback
3.11.1 纳什从人类反馈中学习

3.12 Beyond Reverse KL Divergence
3.12 超越反向 KL 发散

3.13 Comparison of Different Methods
3.13 不同方法的比较

3.13.1 Evaluate DPO and its variants
3.13.1 评估 DPO 及其变体

3.13.2 Is DPO Superior to PPO for LLM Alignment?
3.13.2 就 LLM 对齐而言，DPO 优于 PPO 吗？

4.1 General Tasks for Alignment Evaluation
4.1 对齐评估的一般任务

4.2 Apply Implicit Reward Models, Listwise Preference and Nash Learning to Larger Scale LMs
4.2 将隐含奖励模型、列表偏好和纳什学习应用于更大规模的 LM

4.3 Experiments on Binary Feedbacks
4.3 二进制反馈实验

4.4 Experiments on Helpful AI Feedback
4.4 人工智能帮助反馈实验

4.5 Speeding up Nash Learning
4.5 加快纳什学习速度

4.6 Termination of Iterative/Online Learning
4.6 终止迭代/在线学习

4.7 Simplify SFT + Alignment
4.7 简化 SFT + 对齐