2024_05_22_7e1791a9716188b62f7cg

(Perhaps) Beyond Human Translation: HarNESSING MULTI-AGENT COLLABORATION FOR Translating UltRA-LONG LIterary TEXTS
(也许) 超越人类翻译：利用多智能体协作翻译超长文学文本

Minghao Wu , Yulin Yuan , Gholamreza Haffari , Longyue Wang
吴明浩，袁玉林，Gholamreza Haffari ，王龙跃 Monash University University of Macau Tencent AI Lab
莫纳什大学澳门大学腾讯人工智能实验室

Abstract 摘要

Recent advancements in machine translation (MT) have significantly enhanced translation quality across various domains. However, the translation of literary texts remains a formidable challenge due to their complex language, figurative expressions, and cultural nuances. In this work, we introduce a novel multi-agent framework based on large language models (LLMs) for literary translation, implemented as a company called TRANSAGENTS, which mirrors traditional translation publication process by leveraging the collective capabilities of multiple agents, to address the intricate demands of translating literary works. To evaluate the effectiveness of our system, we propose two innovative evaluation strategies: Monolingual Human Preference (MHP) and Bilingual LLM Preference (BLP). MHP assesses translations from the perspective of monolingual readers of the target language, while BLP uses advanced LLMs to compare translations directly with the original texts. Empirical findings indicate that despite lower -BLEU scores, translations from TRANSAGENTS are preferred by both human evaluators and LLMs over human-written references, particularly in genres requiring domain-specific knowledge. We also highlight the strengths and limitations of TRANSAGENTS through case studies and suggests directions for future research.
机器翻译（MT）的最新进展显著提高了各个领域的翻译质量。然而，由于文学作品的复杂语言、比喻表达和文化细微差别，文学文本的翻译仍然是一个艰巨的挑战。在这项工作中，我们介绍了一个基于大型语言模型的新型多代理框架，作为一个名为 TRANSAGENTS 的公司实现文学翻译，通过利用多个代理的集体能力来模拟传统的翻译出版流程，以满足翻译文学作品的复杂需求。为了评估我们系统的有效性，我们提出了两种创新的评估策略：单语人类偏好（MHP）和双语偏好（BLP）。MHP 从目标语言的单语读者的角度评估翻译，而 BLP 使用先进的技术直接将翻译与原文进行比较。根据实证研究结果表明，尽管- BLEU 分数较低，但来自 TRANSAGENTS 的翻译被人类评估者和 1001 偏好于人工编写的参考文献，尤其是在需要领域特定知识的流派中。我们还通过案例研究突出了 TRANSAGENTS 的优势和局限性，并提出了未来研究的方向。

Traditional MT 传统机器翻译

Human Translator 人类翻译员
Machine Translator 机器翻译器
Our Method 我们的方法

Human Translator 人类翻译员

TransAgents

Figure 1: An illustration of our method. Traditional machine translation (MT) systems often underperform compared to human translators. In this study, we demonstrate that the translations produced by our TRANSAGENTS are more preferred by humans than those from conventional MT systems.
图 1：我们方法的示意图。传统的机器翻译（MT）系统通常表现不如人类翻译员。在这项研究中，我们展示了我们的 TRANSAGENTS 生成的翻译比传统的 MT 系统更受人类喜爱。

1 INTRODUCTION 1 引言

Machine translation (MT) has achieved remarkable advancements in recent years, driven by breakthroughs in deep learning and neural networks (Cho et al., 2014; Sutskever et al., 2014; Vaswani et al. 2017, Gu et al., 2019b, Liu et al., 2020, Fan et al., 2021). Despite these technological strides, literary translation remains an unresolved challenge for MT systems. Literary texts, characterized by their complex language, figurative expressions, cultural nuances, and unique stylistic elements, pose significant hurdles that are hard for machines to overcome (Voigt & Jurafsky, 2012). This complexity makes literary translation one of the most challenging areas within machine translation, often referred to as "the last frontier of machine translation" (Klemin, 2024).
机器翻译（MT）近年来取得了显著进展，得益于深度学习和神经网络的突破（Cho 等，2014 年；Sutskever 等，2014 年；Vaswani 等，2017 年；Gu 等，2019b；Liu 等，2020 年；Fan 等，2021 年）。尽管这些技术进步，文学翻译仍然是机器翻译系统面临的未解决挑战。文学文本以其复杂的语言、比喻表达、文化细微差别和独特的风格元素而闻名，这些特点构成了机器难以克服的重要障碍（Voigt 和 Jurafsky，2012 年）。这种复杂性使得文学翻译成为机器翻译中最具挑战性的领域之一，通常被称为“机器翻译的最后一片疆域”（Klemin，2024 年）。

In response to complex challenges across various domains, recent research in multi-agent systems, particularly those powered by large language models (LLMs), has shown significant promise (Yao et al. 2023, Wang et al., 2023e, Dong et al. 2023). These systems leverage the collective intelligence of multiple agents, enabling superior problem-solving capabilities compared to individual model approaches. Multi-agent systems excel in dynamic environments where intricate problem-solving and collaborative efforts are required.
针对各个领域中的复杂挑战，最近在多智能体系统领域的研究，特别是那些由大型语言模型（LLMs）驱动的研究，显示出显著的潜力（Yao 等，2023 年，Wang 等，2023 年，Dong 等，2023 年）。这些系统利用多个智能体的集体智慧，使其在解决问题方面比单个模型方法具有更强大的能力。多智能体系统在需要复杂问题解决和协作努力的动态环境中表现出色。

Given the nature of literary translation, we harness the superior capabilities of multi-agent systems and establish a novel multi-agent translation company for literary translation, called TransAgents. At TransAgentS, the translation process is organized into two main stages, each consisting of several sub-stages. The process begins with the selection of a Senior Editor by our pre-defined CEO agent, who chooses based on the specific requirements of each client. The selected Senior Editor then assembles a team from our roster, which includes roles such as Junior Editor, Translator, Localization Specialist, and Proofreader. Each team member collaborates through multiple sub-stages, employing strategies like Addition-by-Subtraction Collaboration and Trilateral Collaboration to refine and enhance the translation output.
鉴于文学翻译的性质，我们利用多智能体系统的卓越能力，建立了一个新颖的文学翻译公司，名为 TransAgents。在 TransAgents，翻译过程分为两个主要阶段，每个阶段包括几个子阶段。该过程始于我们预定义的 CEO 智能体选择一位高级编辑，根据每位客户的具体要求进行选择。被选中的高级编辑然后从我们的名单中组建团队，团队成员包括初级编辑、翻译员、本地化专家和校对员。每位团队成员通过多个子阶段进行协作，采用“加法减法协作”和“三方协作”等策略来完善和提升翻译输出。

Furthermore, evaluating the accuracy and quality of literary translations presents a particularly challenging task due to the subjective nature of literature and the potential imperfections in reference translations (Thai et al., 2022, Freitag et al., 2023). To effectively address these challenges, we propose two innovative evaluation strategies: Monolingual Human Preference (MHP) and Bilingual LLM Preference (BLP). Both strategies involve comparing a pair of translations from two different translation systems to determine which one is superior. The Monolingual Human Preference strategy simulates the realistic scenario of reading a translated work. It engages human evaluators from the target audience who assess translations without the influence of the original text. This approach focuses on how well the translation resonates with the readers in terms of fluidity, readability, and cultural appropriateness, mirroring the real-world consumption of literature. Conversely, the Bilingual LLM Preference leverages the capabilities of advanced LLMs, specifically GPT-4-0125-PREVIEW. In this strategy, the LLMs are provided with the original texts to facilitate a direct comparison. This method aims to harness the superior translation capabilities of advanced LLMs, mitigating the impact of imperfect reference translations.
此外，由于文学的主观性质和参考翻译中的潜在缺陷，评估文学翻译的准确性和质量是一项特别具有挑战性的任务（Thai 等，2022 年，Freitag 等，2023 年）。为了有效应对这些挑战，我们提出了两种创新的评估策略：单语人类偏好（MHP）和双语LLM偏好（BLP）。这两种策略都涉及比较来自两个不同翻译系统的一对翻译，以确定哪一个更优秀。单语人类偏好策略模拟了阅读翻译作品的现实情景。它吸引了目标受众的人类评估者，他们在没有原始文本的影响下评估翻译。这种方法侧重于翻译在流畅性、可读性和文化适切性方面与读者产生共鸣的程度，反映了文学的现实消费情况。相反，双语LLM偏好利用了先进LLMs的能力，特别是 GPT-4-0125-PREVIEW。在这种策略中，LLMs被提供原始文本以便进行直接比较。该方法旨在利用先进LLMs的出色翻译能力，减轻不完美参考翻译的影响。

Our empirical findings reveal that TRANSAgENTS consistently delivers the poorest performance in terms of

-BLEU scores. However, it is preferred over both human-written references and GPT-4 translations by human evaluators and an LLM evaluator. In-depth analysis shows that TraNSAGENTS excels over human-written references in genres that demand domain-specific knowledge, such as historical contexts and cultural nuances, but it falls short in contemporary genres. Additionally, we observe that TransAGENTS is capable of generating translations with more diverse and vivid descriptions. Our cost analysis indicates that using TransAgents for literary text translation can result in an

reduction in costs compared to employing professional human translators. Nonetheless, we also identify significant limitations in LLM-based translation systems, including both GPT-4 and TrANSAGENTS, particularly with issues related to significant content omission
我们的实证研究结果显示，TRANSAgENTS 在

-BLEU 分数方面始终表现最差。然而，人类评估者和LLM评估者更喜欢它，而不是人类编写的参考文献和 GPT-4 翻译。深入分析显示，TraNSAGENTS 在需要领域特定知识的流派中表现优异，如历史背景和文化细微差别，但在当代流派中表现不佳。此外，我们观察到 TransAGENTS 能够生成具有更多多样和生动描述的翻译。我们的成本分析表明，与聘请专业人类翻译人员相比，使用 TransAgents 进行文学文本翻译可以使成本降低

。然而，我们也发现基于LLM的翻译系统存在显著局限性，包括 GPT-4 和 TrANSAGENTS，特别是与重要内容遗漏相关的问题。

In this work, our contributions can be summarized as follows:
在这项工作中，我们的贡献可以总结如下：

We introduces TransAgENTS, a novel multi-agent system for literary translation, which mirrors the traditional translation publication process. By employing a multi-agent approach, this approach addresses the complex nuances of literary works.
我们介绍了 TransAgENTS，这是一个新颖的文学翻译多智能体系统，它模拟了传统的翻译出版流程。通过采用多智能体方法，这种方法解决了文学作品的复杂细微之处。
We propose two novel evaluation strategies, Monolingual Human Preference (MHP) and Bilingual LLM Preference (BLP) to assess the quality of translations. MHP focuses on the translation's impact on target audience readers, emphasizing fluidity and cultural appropriateness, while BLP uses advanced LLMs to compare translations directly with the original texts.
我们提出了两种新颖的评估策略，即单语人类偏好（MHP）和双语LLM偏好（BLP），以评估翻译的质量。MHP 侧重于翻译对目标受众读者的影响，强调流畅性和文化适当性，而 BLP 使用先进的LLMs直接比较翻译与原文。
Despite lower -BLEU scores, our empirical findings highlight that translations from TRANSAGENTS are preferred by both human evaluators and language models over humanwritten references. We also present in-depth analyses about the strengths and weaknesses of TRANSAGENTS.
尽管 -BLEU 分数较低，但我们的实证研究结果表明，与人类评估者和语言模型相比，来自 TRANSAGENTS 的翻译更受欢迎，超过了人类编写的参考文献。我们还就 TRANSAGENTS 的优势和劣势进行了深入分析。

Large Language Models Large language models (LLMs) have revolutionized the field of artificial intelligence (AI). These models are typically pretrained on a vast corpus of text data, learning to predict the next word in a sentence (Brown et al., 2020; Chowdhery et al., 2022; Scao et al., 2022, Anil et al., 2023b, Touvron et al., 2023a b; Bai et al., 2023a;|Anil et al.,|2023a). After pretraining, the models are fine-tuned with instructions. This process, known as supervised fine-tuning (SFT) or instruction tuning (IT), allows the model to adapt its general language understanding to follow and implement instructions from humans (Sanh et al., 2022, Wei et al." 2022, Chung et al., 2022, Wang et al., 2022; Tay et al., 2023; Longpre et al., 2023; Shen et al., 2023). Thanks to the superior capabilities of large language models, recent works demonstrate that synthetic datasets generated by these models can also be used in this step (Wang et al., 2023c, Wu et al. 2023b; Li et al., 2023a; Luo et al., 2023, Lyu et al., 2023, Yue et al. 2023; Wang et al.| 2023d). Furthermore, reinforcement learning from human feedback (RLHF) is used to further improve the performance of these models. In this approach, the model is fine-tuned based on feedback from humans or other large language models, who rate the quality of the model's outputs (Ouyang et al. 2022, Rafailov et al., 2023, Hejna et al. 2023; Ethayarajh et al., 2024, Hong et al., 2024). Moreover, evaluating these large language models is a complex task, often involving both automated metrics and human judgment (Hendrycks et al., 2021, Liang et al., 2022; Wu & Aji, 2023; Jiang et al., 2023; Lyu et al., 2024). Additionally, these models pose challenges in terms of efficient training (Hu et al., 2022; Dettmers et al., 2023, Liu et al. 2024), fairness (Li et al., 2023c), hallucination (Zhang et al., 2023c), and other issues, which are also active areas of research. In this work, we leverage the state-of-the-art LLM as the backbone of our multi-agent system for translating the literary texts.
大型语言模型（LLMs）已经彻底改变了人工智能（AI）领域。这些模型通常在大量文本数据语料库上进行预训练，学习预测句子中的下一个单词（Brown 等人，2020 年；Chowdhery 等人，2022 年；Scao 等人，2022 年，Anil 等人，2023b，Touvron 等人，2023a b；Bai 等人，2023a；|Anil 等人，|2023a）。在预训练之后，模型会根据指令进行微调。这个过程被称为监督微调（SFT）或指令微调（IT），使模型能够调整其一般语言理解能力，以遵循并实施来自人类的指令（Sanh 等人，2022 年，Wei 等人，2022 年，Chung 等人，2022 年，Wang 等人，2022 年；Tay 等人，2023 年；Longpre 等人，2023 年；Shen 等人，2023 年）。由于大型语言模型的卓越能力，最近的研究表明，由这些模型生成的合成数据集也可以在这一步骤中使用（Wang 等人，2023c，Wu 等人，2023b；Li 等人，2023a；Luo 等人，2023 年，Lyu 等人，2023 年，Yue 等人，2023 年；Wang 等人|2023d）。此外，通过人类反馈的强化学习（RLHF）也被用来进一步提高这些模型的性能。在这种方法中，模型根据人类或其他大型语言模型的反馈进行微调，评估模型输出的质量（Ouyang 等人，2022 年，Rafailov 等人，2023 年，Hejna 等人，2023 年；Ethayarajh 等人，2024 年，Hong 等人，2024 年）。此外，评估这些大型语言模型是一项复杂的任务，通常涉及自动化指标和人类判断（Hendrycks 等人，2021 年，Liang 等人，2022 年；Wu＆Aji，2023 年；Jiang 等人，2023 年；Lyu 等人，2024 年）。此外，这些模型在高效训练（Hu 等人，2022 年；Dettmers 等人，2023 年，Liu 等人，2024 年）、公平性（Li 等人，2023c）、幻觉（Zhang 等人，2023c）和其他问题方面存在挑战，这也是研究的活跃领域。在这项工作中，我们利用最先进的LLM作为我们翻译文学文本的多代理系统的基础。

Multi-Agent Systems Intelligent agents are designed to understand their environments, make informed decisions, and respond with appropriate actions (Wooldridge & Jennings, 1995). The capabilities of large language models (LLMs) align well with these expectations. The emergence of LLMs has significantly advanced research on multi-agent systems across various contexts. Multiagent systems, compared to single-agent setups, are generally expected to either leverage collaboration among multiple agents to tackle complex problems or use diverse agents to effectively simulate complex real-world environments (Guo et al., 2024). Recent studies have shown promising outcomes in complex problem-solving areas such as software development (Qian et al. 2023, Hong et al. , 2023), multi-robot collaboration (Mandi et al. 2023, Zhang et al., 2023a), evaluation (Chan et al. 2023), and fact-checking (Du et al.||2023a). Additionally, there is extensive research on using multiple agents to simulate real-world environments, including societal, economic, and gaming simulations (Park et al. 2022; 2023; Xu et al., 2023b; Li et al., 2023b; Mukobi et al. 2023). Liang et al. (2023) propose leveraging multi-agent debate for machine translation. However, their approach is limited to the sentence level. In this work, we focus on the first category, specifically on the translation of literary texts. Literary translation is considered one of the most complex and challenging translation tasks, and we aim to address this challenge using a multi-agent system powered by LLMs.
多智能体系统智能代理被设计为理解其环境，做出明智决策，并做出适当的反应（Wooldridge＆Jennings，1995）。大型语言模型的能力（LLMs）与这些期望很好地契合。LLMs的出现显著推动了多智能体系统在各种情境下的研究。与单一代理设置相比，多智能体系统通常预期要么利用多个代理之间的协作来解决复杂问题，要么使用不同的代理有效地模拟复杂的现实环境（Guo 等，2024 年）。最近的研究显示，在诸如软件开发（Qian 等，2023 年，Hong 等，2023 年）、多机器人协作（Mandi 等，2023 年，Zhang 等，2023a）、评估（Chan 等，2023 年）和事实核查（Du 等||2023a）等复杂问题解决领域取得了有希望的成果。此外，还有大量研究使用多个代理来模拟现实世界的环境，包括社会、经济和游戏模拟（Park 等，2022 年；2023 年；Xu 等，2023b；Li 等，2023b；Mukobi 等，2023 年）。梁等。 (2023) 提议利用多智能体辩论进行机器翻译。然而，他们的方法仅限于句子级别。在这项工作中，我们专注于第一类别，特别是文学文本的翻译。文学翻译被认为是最复杂和具有挑战性的翻译任务之一，我们旨在利用由LLMs提供动力的多智能体系统来解决这一挑战。

Machine Translation Machine translation (MT) has achieved significant advancements in recent years, with developments spanning general-purpose MT (Cho et al., 2014; Sutskever et al., 2014, Vaswani et al., 2017, Gehring et al. |2017; Shen et al., 2019), low-resource MT (Zoph et al.||2016||Gu et al.||2018;|Haddow et al.||2022), multilingual MT (Liu et al.,| 2020; Fan et al., 2021; Wu et al.||2021, Li et al.|| 2022, Costa-jussà et al., 2022, Communication et al.| 2023), and non-autoregressive MT (Gu et al.||2017; 2019a: Ghazvininejad et al., 2019), among others. However, these advancements are predominantly focused at the sentence level. Recently, efforts are made to enhance translation
机器翻译机器翻译（MT）在近年取得了显著进展，涵盖了通用 MT（Cho 等，2014 年；Sutskever 等，2014 年；Vaswani 等，2017 年；Gehring 等，2017 年；Shen 等，2019 年）、低资源 MT（Zoph 等，2016 年；Gu 等，2018 年；Haddow 等，2022 年）、多语种 MT（Liu 等，2020 年；Fan 等，2021 年；Wu 等，2021 年；Li 等，2022 年；Costa-jussà等，2022 年；Communication 等，2023 年）和非自回归 MT（Gu 等，2017 年；2019a；Ghazvininejad 等，2019 年）等领域。然而，这些进展主要集中在句子级别。最近，努力提升翻译。
quality by integrating contextual information into the translation process (Wang et al., 2017, Ding et al. , 2020, Sun et al., 2022, Feng et al., 2022, Wu et al., 2023a; Herold & Ney, 2023, Wu et al. 2024b), aiming to achieve more accurate and coherent translations that extend beyond individual sentences. More recently, large language models (LLMs) have demonstrated superior capabilities in various applications, including MT (Lu et al., 2023; Zhang et al., 2023b; Xu et al., 2023a; Robinson et al. 2023, Wang et al., 2023a; Wu et al., 2024a). Given the remarkable progress in machine translation (MT), the performance of MT seems to be saturating in the general domain. There is growing interest in literary translation, which is considered one of the more challenging translation tasks because it requires not only accuracy in meaning but also the conveyance of vivid expressions and cultural nuances (Thai et al., 2022, Wang et al., 2023b). Additionally, evaluating MT accurately remains a critical aspect of research in this field. While traditional metrics like BLEU are commonly used (Papineni et al., 2002), newer approaches involve utilizing pretrained language models to assess translation quality more effectively (Rei et al., 2020, Sellam et al., 2020, Juraska et al., 2023; Guerreiro et al., 2023). Kocmi & Federmann(2023) employ the state-of-the-art LLM, GPT-4, to estimate translation quality and achieve state-of-the-art quality estimation performance at WMT 2023 (Freitag et al. 2023). In this work, we establish a novel multi-agent virtual company TRANSAGENTS for translating literary texts. We also propose two evaluation strategies for assessing the quality of the translated literary texts.
将上下文信息整合到翻译过程中，旨在实现更准确和连贯的翻译，超越单个句子。最近，大型语言模型（LLMs）在各种应用中展示出卓越的能力，包括机器翻译（Lu 等，2023 年；Zhang 等，2023b；Xu 等，2023a；Robinson 等，2023 年；Wang 等，2023a；Wu 等，2024a）。鉴于机器翻译（MT）取得的显著进展，MT 的性能似乎在一般领域中达到了饱和。人们对文学翻译越来越感兴趣，因为它被认为是更具挑战性的翻译任务之一，不仅需要准确传达意义，还需要传达生动的表达和文化细微差别（Thai 等，2022 年；Wang 等，2023b）。此外，准确评估 MT 仍然是这一领域研究的关键方面。虽然像 BLEU 这样的传统度量标准通常被使用（Papineni 等）。, 2002), 较新的方法涉及利用预训练语言模型更有效地评估翻译质量（Rei 等，2020 年，Sellam 等，2020 年，Juraska 等，2023 年；Guerreiro 等，2023 年）。Kocmi＆Federmann（2023 年）采用最先进的LLM，GPT-4，来估计翻译质量，并在 WMT 2023（Freitag 等，2023 年）实现最先进的质量估计性能。在这项工作中，我们建立了一个新颖的多代理虚拟公司 TRANSAGENTS，用于翻译文学作品。我们还提出了两种评估策略，用于评估翻译后的文学作品的质量。

3 TransAgents: A Multi-AgEnt Virtual Company FOR Literary TRANSLATION
3 TransAgents: 一个多代理虚拟公司，专门从事文学翻译

Figure 2: TRANSAGENTS, a multi-agent virtual company for literary translation.
图 2：TRANSAGENTS，一个文学翻译的多代理虚拟公司。

We establish a virtual multi-agent translation company, TRANSAGENTS, featuring a diverse range of employees including a CEO, senior editors, junior editors, translators, localization specialists, and proofreaders. When a human client assigns a book translation task, a team of selected agents from TRANSAGENTS collaborates to translate the book. This paradigm simulates the entire book translation process, where agents with different roles work together to ensure that the translation maintains high quality and consistency throughout. In this section, we describe the company overview of TraNSAGENTS in Section 3.1 the core collaboration strategies of TraNSAGENTS in Section 3.2, and the translation workflow in Section 3.3
我们建立了一个虚拟的多代理翻译公司，名为 TRANSAGENTS，拥有各种员工，包括首席执行官、高级编辑、初级编辑、翻译员、本地化专家和校对员。当人类客户分配一项书籍翻译任务时，来自 TRANSAGENTS 的一组精选代理人合作翻译这本书。这种范式模拟了整个书籍翻译过程，不同角色的代理人共同努力，以确保翻译在整个过程中保持高质量和一致性。在本节中，我们描述了 TraNSAGENTS 公司概述在第 3.1 节，TraNSAGENTS 核心协作策略在第 3.2 节，以及翻译工作流程在第 3.3 节。

3.1 COMPANY OVERVIEW 3.1 公司概况

To simulate the entire book translation process, in addition to the designated CEO, we have a diverse array of roles, including senior editors, junior editors, translators, localization specialists, and proofreaders in our company TRANSAGENTS. Each of these roles carries its own set of responsibilities:
为了模拟整本书的翻译过程，除了指定的首席执行官外，我们公司 TRANSAGENTS 还拥有各种不同的角色，包括高级编辑、初级编辑、翻译人员、本地化专家和校对员。每个角色都有自己的一套责任。

Senior Editors: Senior editors are responsible for overseeing the content production process. Their primary duties encompass setting editorial standards, guiding junior editors, and ensuring that the content aligns with the company's objectives.
高级编辑：高级编辑负责监督内容生产过程。他们的主要职责包括制定编辑标准，指导初级编辑，并确保内容与公司目标一致。
Junior Editors: Junior editors work closely under the guidance of senior editors. Their responsibilities typically include managing the day-to-day editorial workflow, editing content, and assisting in content planning. They also handle communications with various other roles within the organization.
初级编辑：初级编辑在高级编辑的指导下密切合作。他们的责任通常包括管理日常编辑工作流程，编辑内容，并协助内容规划。他们还负责与组织内的各种其他角色进行沟通。
Translators: Translators are tasked with converting written material from one language to another while preserving the tone, style, and context of the original text. Translators must possess a profound understanding of both the source and target languages, as well as a familiarity with the subject matter they are translating.
翻译员：翻译员的任务是将书面材料从一种语言转换为另一种语言，同时保留原始文本的语调、风格和语境。翻译员必须对源语言和目标语言有深刻的理解，同时熟悉他们正在翻译的主题。
Localization Specialists: Localization specialists go beyond simple translation; they adapt content for specific regions or markets. This role involves not only translating language but also adjusting cultural references, idioms, and images to resonate with local audiences.
本地化专家：本地化专家不仅仅是简单的翻译；他们会为特定地区或市场调整内容。这个角色不仅涉及语言翻译，还包括调整文化参考、成语和图像，以与当地观众产生共鸣。
Proofreaders: Proofreaders perform final checks for grammar, spelling, punctuation, and formatting errors. Their role is crucial in ensuring that content is polished and adheres to high-quality standards before publication.
校对员：校对员进行最终检查，以查找语法、拼写、标点和格式错误。他们的角色至关重要，可以确保内容在发布之前经过精心打磨，符合高质量标准。

To enhance the realism and efficacy of our simulation in the translation process, we strategically utilize GPT-4-TURBO to generate a diverse set of 30 virtual agent profiles for each distinct role. As illustrated in Figure 3, these profiles are comprehensively designed to include a wide array of attributes that extend well beyond language skills. Key characteristics such as gender, nationality, rate per word, educational background, years of experience, and areas of specialization are thoughtfully incorporated. This detailed and personalized approach not only enriches the authenticity of the translation process simulation but also mir-
为了增强我们在翻译过程中模拟的逼真性和效果，我们策略性地利用 GPT-4-TURBO 生成 30 个不同角色的虚拟代理人档案集合。如图 3 所示，这些档案被全面设计，包括远远超出语言技能范围的各种属性。关键特征，如性别、国籍、每字报酬、教育背景、工作经验年限和专业领域都被周到地纳入。这种详细和个性化的方法不仅丰富了翻译过程模拟的真实性，还使得模拟...

Name: Sofia Chang
Languages: English, Mandarin, Spanish, French
Nationality: Canadian
Gender: Female
Age: 47
Education: Ph.D. in Comparative Literature
Personality: meticulous, introverted,
~perfectionist, critical, thoughtful
Hobbies: gardening, chess, watercolor painting
Rate per word: 0.12
Years of working: 22
Profession: Senior Editor
Role prompt: You are Sofia Chang, a highly esteemed
\hookrightarrow Senior Editor [TRUNCATED]

Figure 3: An example profile of Senior Editor. rors the complexity and diversity found in realworld translation settings. The inclusion of such rich, detailed metadata about the agents not only enhances current simulation strategies but is also designed to support and inspire future research.
图 3：高级编辑的示例配置文件。反映了真实世界翻译环境中的复杂性和多样性。关于代理人的丰富详细元数据的包含不仅增强了当前的模拟策略，还旨在支持和激励未来的研究。

3.2 AgENT Collaboration StRategies
3.2 代理合作策略

In this section, we introduce two collaboration strategies used in this work, including Addition-bySubtraction Collaboration

Algorithm 1) and Trilateral Collaboration Algorithm 2).
在本节中，我们介绍了本工作中使用的两种协作策略，包括加法减法协作

算法 1）和三边协作算法 2）。

Addition-by-Subtraction Collaboration In our framework, we propose the Addition-bySubtraction Collaboration between two agents. Unlike the debate-style strategy (Liang et al., 2023, Du et al., 2023a, Chan et al., 2023), where multiple agents propose their own answers and a thirdparty agent concludes the discussion, our strategy involves only two agents. One acts as an Addition agent, responsible for extracting as much relevant information as possible, while the other agent serves as a Subtraction agent, tasked with reviewing the extracted information, eliminating redundant details, and providing feedback to the Addition agent. We present the details of our collaboration strategy in Algorithm 1. The Addition agent A first generates the initial response, aiming to include as much informative content as possible. Subsequently, the Subtraction agent

reviews the response and removes any redundant information. The conversation iterates until no further revisions are needed for the response.
在我们的框架中，我们提出了两个代理之间的加法减法协作。与辩论风格策略（Liang 等，2023 年，Du 等，2023a 年，Chan 等，2023 年）不同，其中多个代理提出自己的答案，第三方代理总结讨论，我们的策略只涉及两个代理。一个充当加法代理，负责尽可能提取尽可能多的相关信息，而另一个代理则充当减法代理，负责审查提取的信息，消除冗余细节，并向加法代理提供反馈。我们在算法 1 中介绍了我们的协作策略的细节。加法代理 A 首先生成初始响应，旨在包含尽可能多的信息内容。随后，减法代理

审查响应并删除任何冗余信息。对话循环进行，直到不再需要对响应进行修订。

Algorithm 1: Addition-by-Subtraction Collaboration
Input : Context C; Instruction I; Maximum number of iterations M; Addition agent A;
    Subtraction agent \(\mathbf{S}\)
Output: The final response \(\mathbf{R}\) that both agents agree upon.
\(\mathbf{H} \leftarrow[\mathbf{C} ; \mathbf{I}] \quad \triangleright\) Initialize the conversation history;
\(\mathbf{R} \leftarrow \emptyset \quad \triangleright\) Initialize the response;
\(m \leftarrow 0 \quad \triangleright\) Current round;
while \(m \leq \mathrm{M}\) do
    \(m \leftarrow m+1 ;\)
    \(\mathbf{R}^{\prime} \leftarrow \mathbf{A}(\mathbf{H}) \quad \triangleright\) Generate detailed response;
    \(\mathbf{F} \leftarrow \mathbf{S}\left(\mathbf{H}, \mathbf{R}^{\prime}\right) \quad \triangleright\) Review and remove redundant information;
    \(\mathbf{H} \leftarrow \mathbf{H}+\left[\mathbf{R}^{\prime} ; \mathbf{F}\right] \quad \triangleright\) Append \(\mathbf{R}^{\prime}\) and \(\mathbf{F}\) to the conversation history \(\mathbf{H}\);
    if \(\mathbf{R}=\mathbf{R}^{\prime}\) then
        Break \(\triangleright\) Stop iterating as no further revisions are needed;
    \(\mathbf{R} \leftarrow \mathbf{R}^{\prime}\)
Return the final response \(\mathbf{R}\);

Algorithm 2: Trilateral Collaboration
Input : Context \(\mathbf{C}\); Instruction \(\mathbf{I}\); Maximum number of iterations \(\mathbf{M}\); Action agent \(\mathbf{P}\);
        Critique agent \(\mathbf{Q} ;\) Judgment agent \(\mathbf{J}\)
Output: The final response \(\mathbf{R}\) that is approved by the Judgment agent \(\mathbf{J}\);
\(\mathbf{H} \leftarrow[\mathbf{C} ; \mathbf{I}] \quad \triangleright\) Initialize the conversation history;
\(m \leftarrow 0 \quad \triangleright\) Current round;
while \(m \leq \mathbf{M}\) do
    \(m \leftarrow m+1\)
    \(\mathbf{R} \leftarrow \mathbf{P}(\mathbf{H}) \quad \triangleright\) Generate response;
    \(\mathbf{F} \leftarrow \mathbf{Q}(\mathbf{H}, \mathbf{R}) \quad \triangleright\) Generate critiques;
    \(\mathbf{H} \leftarrow \mathbf{H}+[\mathbf{R} ; \mathbf{F}] \quad \triangleright\) Append \(\mathbf{R}^{\prime}\) and \(\mathbf{F}\) to the conversation history \(\mathbf{H} ;\)
    if \(m>1\) then
    \(\mathbf{D} \leftarrow \mathbf{J}(\mathbf{C}, \mathbf{I}, \mathbf{R}) \quad \triangleright\) The Judgment agent \(\mathbf{J}\) evaluate the response quality;
    if \(\mathbf{D}=T R U E\) then
        Break \(\triangleright\) Stop iterating if the Judgment agent \(\mathbf{J}\) thinks the response is of high
        quality

Return the final response

;
返回最终响应

；

Trilateral Collaboration We divide the collaboration into three branches in TRANSAGENTS, referring to as Trilateral Collaboration:
三方合作我们在 TRANSAGENTS 中将合作分为三个分支，称之为三方合作：

Action: The power to follow the instruction and implement the required actions.
行动：遵循指示并执行所需的行动的能力。
Critique: The power to review the generated response and provide constructive feedback to the Action branch.
评论：审查生成的响应并向行动部门提供建设性反馈的能力。
Judgment: The power to make the final decision on whether the response is satisfactory or requires further revision.
判断：决定是否满意或需要进一步修订的最终决定权。

We assign one agent for each branch and present the details of the collaboration among these agents in Algorithm 2. The Action agent

generates a response

given the context

and instruction

. The Critique agent

then writes critiques

against the response

. The Action agent

has the option to either accept the critiques and update the response or maintain the original response. At the end of the iteration, the Judgment agent

evaluates the response

to determine if the discussion can be concluded or if further deliberation is required.
我们为每个分支指定一个代理，并在算法 2 中展示这些代理之间的协作细节。行动代理

根据上下文

和指令

生成响应

。然后，评论代理

针对响应

撰写评论

。行动代理

可以选择接受评论并更新响应，或保持原始响应。在迭代结束时，评判代理

评估响应

，以确定讨论是否可以结束或是否需要进一步讨论。

3.3 TRANSLATION WORKFLOW
3.3 翻译工作流程

In this section, we introduce the book translation workflow in our company TRANSAGENTS, including two main stages: preparation (Section 3.3.1) and execution (Section 3.3.2).
在本节中，我们介绍了我们公司 TRANSAGENTS 的书籍翻译工作流程，包括两个主要阶段：准备（第 3.3.1 节）和执行（第 3.3.2 节）。

3.3.1 PREPARATION 3.3.1 准备

Project Members Selection System prompts or messages are used to assign roles to individual agents during the role-playing process. In our company's setup, we create 30 agent profiles, each accompanied by a unique role assignment prompt, as illustrated in Figure 3 . These prompts are essential for assigning specific roles to the agents before the dialogues begin. Within our framework, the initial step involves the CEO selecting a Senior Editor for the book translation project. This selection process takes into account both the client's requirements and the qualifications of potential Senior Editors. Once the Senior Editor is chosen, they work closely with the CEO to assemble the rest of the project team, carefully considering the skill sets and backgrounds of the candidates. Furthermore, we introduce a self-reflection strategy (Yao et al., 2023, Shinn et al., 2023, Qian et al. 2023). This strategy involves incorporating a "ghost agent" whose task is to prompt the CEO to reconsider their decision, as we observe that they sometimes struggle to select a Senior Editor with the desired language skills.
项目成员选择系统使用提示或消息在角色扮演过程中为个体代理分配角色。在我们公司的设置中，我们创建了 30 个代理配置文件，每个配置文件都附有一个独特的角色分配提示，如图 3 所示。在对话开始之前，这些提示对于为代理分配特定角色至关重要。在我们的框架内，初始步骤涉及 CEO 为图书翻译项目选择一位高级编辑。这个选择过程考虑了客户的要求和潜在高级编辑的资格。一旦选择了高级编辑，他们将与 CEO 密切合作，组建其余的项目团队，仔细考虑候选人的技能和背景。此外，我们引入了一种自我反思策略（姚等，2023 年，辛等，2023 年，钱等，2023 年）。这种策略涉及引入一个“幽灵代理”，其任务是提示 CEO 重新考虑他们的决定，因为我们观察到他们有时很难选择具有所需语言技能的高级编辑。

Translation Guideline Documentation To maintain consistency throughout the entire translation workflow, which involves multiple agents, we need to have a translation guideline. In TRANSAGENTS, there are five components: the glossary, the book summary, the tone, the style, and the target audience. We have designed different strategies to process them:
翻译指南文档为了在整个翻译工作流程中保持一致性，涉及多个代理人，我们需要有一个翻译指南。在 TRANSAGENTS 中，有五个组成部分：术语表、书籍摘要、语气、风格和目标受众。我们设计了不同的策略来处理它们：

Glossary: The primary purpose of a glossary in book translation is to compile essential terms from the source language and provide their corresponding translations in the target language. This ensures consistency and accuracy in the usage of these terms throughout the book, especially since some terms may have multiple acceptable translations. In our process, we leverage the Addition-by-Subtraction Collaboration, as described in Algorithm 1 . for collecting the key terms. For each chapter, the Junior Editor, serving as the Addition agent A, makes an exhaustive attempt to identify all potential key terms initially. Subsequently, the Senior Editor, serving as the Subtraction agent , reviews the identified key terms and removes any that are generic. The conversation continues until the list of collected key terms does not need further revision. Next, the collected key terms are translated by the Senior Editor, with consideration of their context.
术语表：书籍翻译中术语表的主要目的是编制源语言中的基本术语，并提供它们在目标语言中的对应翻译。这确保了书中这些术语的使用一致性和准确性，特别是因为一些术语可能有多个可接受的翻译。在我们的流程中，我们利用了“加法减法协作”，如算法 1 中所述，来收集关键术语。对于每一章，担任加法代理 A 的初级编辑会尽力识别所有潜在的关键术语。随后，担任减法代理的高级编辑会审查已识别的关键术语，并删除任何通用的术语。对话持续进行，直到收集到的关键术语列表不需要进一步修订为止。接下来，高级编辑会根据上下文翻译收集到的关键术语。
Book Summary: Generating a book summary is crucial to provide a comprehensive overview of the narrative. This task is facilitated by the collaboration between the Junior Editor (Addition Agent A) and the Senior Editor (Subtraction Agent S), employing the Addition-by-Subtraction Collaboration as depicted in Algorithm 1 In this process, the Junior Editor aims to retain as much detail as possible in the chapter summaries, while the Senior Editor focuses on removing superfluous information. Following the compilation of chapter summaries, the Senior Editor then crafts the book summary, mirroring the process of gathering a glossary.
书籍摘要：生成书籍摘要对于提供故事的全面概述至关重要。这项任务由初级编辑（加法代理 A）和高级编辑（减法代理 S）之间的合作来实现，采用了算法 1 中描述的加法减法协作。在这个过程中，初级编辑的目标是在章节摘要中尽可能保留更多细节，而高级编辑则专注于删除多余信息。在编写完章节摘要后，高级编辑随后撰写书籍摘要，反映了编制术语表的过程。
Tone, Style, and Target Audience: The translation of a book is more than just a word-forword conversion; it's a delicate process of adapting tone, style, and content to resonate with the target audience while staying true to the original text's essence. In TransAGENTS, the Senior Editor defines the tone, the style, and the target audience of the translated book based on a randomly selected chapter.
语调、风格和目标受众：一本书的翻译不仅仅是逐字逐句的转换；它是一个精细的过程，需要调整语调、风格和内容，以 resonant 目标受众，同时保持忠于原文的本质。在 TransAGENTS 中，高级编辑根据随机选择的一章定义了翻译书籍的语调、风格和目标受众。

Overall, the glossary, book summary, tone, style, and target audience collectively constitute the comprehensive translation guidelines. These guidelines serve as an essential part of the prompts for all roles involved in the book translation process, ensuring consistency and coherence throughout the entire work.
总的来说，术语表、书籍摘要、语调、风格和目标受众共同构成了全面的翻译指南。这些指南作为书籍翻译过程中所有角色的重要提示部分，确保整个工作的一致性和连贯性。

3.3.2 EXECUTION 3.3.2 执行

In the execution phase, the process is divided into four distinct sub-stages: translation, cultural adaptation, proofreading, and final review. During the first three sub-stages, our approach utilizes the collaborative strategy as illustrated in Algorithm 2. Within this framework, the roles of Action agents

are assigned to the Translator, the Localization Specialist, and the Proofreader, in that order. Meanwhile, the responsibilities of the Critique agent

and the Judgment agent

are fulfilled by the Junior Editor and the Senior Editor, respectively. Finally, the Senior Editor performs the final checks before publication.
在执行阶段，该过程分为四个明确的子阶段：翻译、文化适应、校对和最终审查。在前三个子阶段中，我们的方法采用了协作策略，如算法 2 所示。在这个框架内，行动代理人的角色

分配给了翻译员、本地化专家和校对员，依次进行。同时，评论代理人

和判断代理人

的责任由初级编辑和高级编辑分别承担。最后，在出版之前，高级编辑进行最终检查。

Translation, Localization, and Proofreading The translation stage involves three key roles: the Translator, the Junior Editor, and the Senior Editor. These roles collaborate to translate the book from the source language to the target language on a chapter-by-chapter basis. The translation process begins with the Translator (the Action agent

) initially translating the chapter content from the source language to the target language. Next, the Junior Editor (the Critique agent Q) undertakes a thorough review of the translation, ensuring it adheres to the guidelines while also identifying any potential errors or areas for improvement. Lastly, the Senior Editor (the Judgment agent J) evaluates the translation and determines if further revision is needed. Following the translation, the cultural adaptation process begins. The Localization Specialist tailors the translated content to fit the cultural context of the target audience, ensuring that it resonates well and maintains the intended meaning. Next, the Proofreader perform the checks for language errors. Throughout the cultural adaptation and proofreading stages, both the Junior Editor and the Senior Editor continue to offer critiques and evaluations to refine the content further.
翻译、本地化和校对翻译阶段涉及三个关键角色：翻译员、初级编辑和高级编辑。这些角色合作，按章节将书籍从源语言翻译为目标语言。翻译过程始于翻译员（行动代理

）最初将章节内容从源语言翻译为目标语言。接下来，初级编辑（批评代理 Q）对翻译进行彻底审查，确保其符合指南，同时识别任何潜在错误或改进的领域。最后，高级编辑（判断代理 J）评估翻译，并确定是否需要进一步修订。翻译完成后，文化适应过程开始。本地化专家将翻译内容量身定制以适应目标受众的文化背景，确保其共鸣并保持预期含义。接下来，校对员进行语言错误检查。在整个文化适应和校对阶段，初级编辑和高级编辑都继续提供批评和评估，以进一步完善内容。

Final Review The final review is the concluding step in the editorial process. At this point, the Senior Editor evaluates the translation quality of each chapter and also examines how pairs of adjacent chapters flow into each other. The Senior Editor not only verifies that each chapter is internally coherent and meets quality standards on its own but also ensures that the transitions between chapters are smooth, thereby maintaining narrative consistency.
最终审查最终审查是编辑过程的结束步骤。在这一点上，高级编辑评估每一章的翻译质量，还检查相邻章节之间的流畅性。高级编辑不仅验证每一章在内部连贯并符合质量标准，还确保章节之间的过渡顺畅，从而保持叙事一致性。

On the Importance of the Judgment Agent We introduce the Judgment Agent in Algorithm 2, which is responsible for evaluating the quality of the response and determining whether further revision is needed, without requiring the conversation history. Owing to the nature of web novels, each turn of dialogue is likely to contain a few thousand words. Although recent advances in large language models (LLMs) claim that LLMs are capable of processing extremely lengthy sequences of up to millions of tokens, we still observe that our agents are not able to effectively leverage the information in the context as the conversation expands. Additionally, we observe that the meaning of translations tends to deviate from the original text after several iterations of revision. Therefore, it is critical to have the Judgment agent within the Trilateral Collaboration to ensure the overall quality of the response.
关于判断代理的重要性我们在算法 2 中引入了判断代理，负责评估回应的质量并确定是否需要进一步修订，而无需查看对话历史。由于网络小说的特性，每次对话往往包含数千字。尽管最近大型语言模型的进展声称能够处理长达数百万标记的序列，我们仍然观察到我们的代理无法有效利用上下文中的信息，随着对话的扩展。此外，我们观察到翻译的含义在几次修订后往往会偏离原始文本。因此，在三方合作中拥有判断代理以确保回应的整体质量至关重要。

4 EXPERIMENTAL SETUP 4 实验设置

In this work, our experimental setup primarily follows the WMT2023 shared task on discourse-level literary translation (DLLT) (Wang et al., 2023b). The following sections introduce the baselines Section 4.1), datasets (Section 4.2), and evaluation approaches (Section 4.3) used in our study.
在这项工作中，我们的实验设置主要遵循 WMT2023 关于篇章级文学翻译（DLLT）的共享任务（Wang 等，2023b）。以下部分介绍了我们研究中使用的基线（第 4.1 节）、数据集（第 4.2 节）和评估方法（第 4.3 节）。

4.1 BASELINES 4.1 基线

We leverage the state-of-the-art LLM GPT-4-TURBO as the backbone of our agents

and compare our approach with the unconstrained systems in WMT2023 shared task on DLLT:
我们利用最先进的LLM GPT-4-TURBO 作为我们代理人

的骨干，并将我们的方法与 WMT2023 DLLT 共享任务中的无约束系统进行比较
- Llama-MT: Du et al. (2023b) fine-tune Llama-7B for literary translation. The finetuned LLAMA-MT model translates 2,048 consecutive tokens at a time.
- Llama-MT: Du 等人（2023b 年）对 Llama-7B 进行了文学翻译的微调。经过微调的 LLAMA-MT 模型一次翻译 2,048 个连续标记。
- GPT-4: While recent versions of GPT-4 models claim to support a context size of up to

tokens, they are restricted to generating a maximum of 4,096 tokens per response (OpenAI 2023). Therefore, we employ the GPT-4-0613 and GPT-4-1106-PREVIEW models to translate the documents on a chapter-by-chapter basis.
- GPT-4：尽管最近的 GPT-4 模型版本声称支持最多

个标记的上下文大小，但它们仅限于每个响应生成最多 4,096 个标记（OpenAI 2023）。因此，我们使用 GPT-4-0613 和 GPT-4-1106-PREVIEW 模型逐章翻译文档。
- Google: We employ the Google Translate system to translate the documents on a sentence-by-sentence basis.
- 谷歌：我们使用谷歌翻译系统逐句翻译文档。
- DUT: Zhao et al. (2023) explore several techniques to enhance the performance of large language models (LLMs) in discourse-level translation tasks.
- DUT：赵等人（2023 年）探索了几种技术，以提高大型语言模型在话语级翻译任务中的性能（LLMs）。
- HW-TSC: Xie et al. (2023) initially train a sentence-level Transformer to establish a baseline, subsequently enhancing its discourse-level capabilities through domain adaptation and discourse modeling, employing a variety of techniques.
-HW-TSC：谢等人（2023 年）最初训练一个句子级 Transformer 来建立一个基线，随后通过领域适应和话语建模增强其话语级能力，采用各种技术。

4.2 DATASETS 4.2 数据集

In this work, we do not need to train new models and all the agents is GPT-4-TURBO with various roles. Hence, we only leverage the official test set of WMT2023 shared task on DLLT. The official test set is collected from 20 web novels, each of which consists 20 consecutive chapters, totaling 240 chapters. The test set contains two references: REFERENCE 1 is translated by human translators and REFERENCE 2 is built by manually aligning bilingual text in web page.
在这项工作中，我们不需要训练新模型，所有的代理都是具有不同角色的 GPT-4-TURBO。因此，我们只利用 WMT2023 DLLT 共享任务的官方测试集。官方测试集是从 20 部网络小说中收集的，每部小说包含 20 个连续章节，总共 240 章。测试集包含两个参考文献：参考文献 1 由人类翻译者翻译，参考文献 2 是通过手动对齐网页中的双语文本构建的。

4.3 EVALUATION 4.3 评估

Translating literary works differs significantly from translating standard machine translation (MT) corpora, such as news articles or parliamentary proceedings. Thai et al. (2022) present a comprehensive list of techniques employed by literary translators, which largely differ from those used in common MT domains. Furthermore, literary translators have the freedom and the burden of both semantic and critical interpretation, resulting in the absence of a single, unique best translation for literary texts. In this work, we employ two evaluation approaches:
翻译文学作品与翻译标准机器翻译（MT）语料库（如新闻文章或议会文件）有很大不同。Thai 等人（2022 年）提出了文学翻译者采用的技术的全面列表，这些技术在常见的 MT 领域中大不相同。此外，文学翻译者既有语义和批判性解释的自由，也承担着负担，导致文学文本没有单一、独特的最佳翻译。在这项工作中，我们采用了两种评估方法：

Standard Evaluation: Following Wang et al. (2023b), we use -BLEU (Papineni et al. 2002, Post 2018, Liu et al. 2020) to evaluate the translation quality , as the translations may not strictly align with the source text on a sentence-by-sentence basis. To compute the -BLEU score, we concatenate all the chapter translations into a single document for evaluation. We present the results in Section 5.
标准评估：根据王等人（2023b），我们使用 -BLEU（Papineni 等人 2002 年，Post 2018 年，Liu 等人 2020 年）来评估翻译质量，因为翻译可能不严格与源文本逐句对齐。为了计算 -BLEU 分数，我们将所有章节翻译连接成单个文档进行评估。我们在第 5 节中呈现结果。
Preference Evaluation: Acknowledging the concern that there is no single, universally preferred translation for literary texts, we ask human raters or LLMs to select their preferred translation without giving them a reference translation. Further details regarding this novel evaluation approach are discussed in Section 6
偏好评估：认识到文学文本没有单一、普遍偏好的翻译这一问题，我们要求人类评分员或LLMs选择他们偏好的翻译，而不提供参考翻译。有关这种小说评估方法的更多细节在第 6 节中讨论。

5 Standard Evaluation 5 标准评估

We present the automatic evaluation results in Table 1. Interestingly, our approach performs poorly in terms of the

-BLEU metric, achieving the lowest scores among the compared methods. However, it is important to consider that

-BLEU has limitations and may not fully capture the quality and coherence of the generated text. As pointed out by Freitag et al. (2020), typical references used for calculating

-BLEU scores often exhibit poor diversity and tend to concentrate around translationese language. This suggests that a low

-BLEU score does not necessarily imply poor performance of our approach.
我们在表 1 中呈现了自动评估结果。有趣的是，我们的方法在

-BLEU 指标方面表现不佳，得分在比较方法中最低。然而，重要的是要考虑

-BLEU 存在局限性，可能无法完全捕捉生成文本的质量和连贯性。正如 Freitag 等人（2020 年）所指出的，用于计算

-BLEU 分数的典型参考文献通常缺乏多样性，并且倾向于集中在翻译语言周围。这表明低

-BLEU 分数并不一定意味着我们的方法表现不佳。

	-BLEU
LLAMA-MT (Du et al. LLAMA-MT（杜等
GPT-4-0613 OpenAI 2023	43.7
GPT-4-1106-PREVIEW (OpenAI 2023) GPT-4-1106-预览（OpenAI 2023）	47.8
GOOGLE S 谷歌 S	47.3
DUT (Zhao et al. 2023) DUT（赵等人，2023 年）	50.2
HW-TSC X1e et al. 2023 HW-TSC X1e 等。2023	52.2
TRANSAGENTS (Ours) 转换代理（我们的）	25.0

Table 1: Automatic evaluation (

-BLEU) results on WMT2023 DLLT test set.

indicates higher is better. The worst result is highlighted in bold.
表 1：WMT2023 DLLT 测试集上的自动评估（

-BLEU）结果。

表示越高越好。最差的结果以粗体显示。

Our results align with the findings from Thai
我们的结果与泰国的研究结果一致

et al. (2022), who argue that automatic metrics cannot accurately reflect human preference in the context of literary translation. Furthermore, while automatic metrics are typically highly correlated with human judgments based on the Multidimensional Quality Metrics (MQM) framework (Burchardt 2013), this framework may not be suitable for assessing translation quality in the context of literary translation 3 The unique characteristics and creative aspects of literary texts require a more nuanced evaluation approach that goes beyond the scope of standard automatic metrics and MQM-based human assessments.
等人（2022 年）认为，在文学翻译的背景下，自动度量标准无法准确反映人类偏好。此外，虽然自动度量标准通常与基于多维质量度量（MQM）框架的人类判断高度相关（Burchardt 2013），但这个框架可能不适用于评估文学翻译质量。文学文本的独特特征和创造性方面需要更加细致的评估方法，超越标准自动度量标准和基于 MQM 的人类评估范围。

Q: Which of the following writing style do you prefer?
[x] Chapter 455: Turnaround 3 "Allow me to demonstrate the sensing of Formless Fluctuation; it's remarkably
\hookrightarrow straightforward," interjected another sorcerer, a smile evident in his voice. "Your assistance is
\hookrightarrow appreciated," Lin Sheng responded, offering a nod of gratitude. Time was of the essence in finding the
\hookrightarrowremaining Fragments. He had initially planned to conquer an array of Great Evil Spirits to amass
~ubstantial reserves of pure soul power. Yet, the present opportunity necessitated an immediate and
\hookrightarrow decisive acquisition. Promptly, the sorcerer leader brought Lin Sheng to a daunting Evil Spirit Gate.
\hookrightarrow Both extended their hands, gently touching the gate's enigmatic frame, eyes closed as one. The leader
\hookrightarrow rapidly employed his Special Ability to establish a Spatial Foundation, thus setting a Coordinate Code.
[ ] Chapter 455 Reversion 3 "This is to let you feel the fluctuation of aura. It's really simple." Another
\hookrightarrow Warlock couldn't help but interrupt with a smile. "Then I'll have to trouble you." Lin Sheng nodded. He
\hookrightarrow needed to find the other fragments as soon as possible. Originally, he had planned to conquer more evil
\hookrightarrow \text { spirits and obtain more pure soul power. But now that he encountered such an opportunity, the most}
\hookrightarrow important thing for him was to get it as soon as possible. Soon, the Warlock Commander led Lin Sheng to
~ an Evil Spirit Gate. The two reached out, touched the frame of the Evil Spirit Gate at the same time,
\hookrightarrow and closed their eyes. The Warlock Commander quickly used his ability to build the space base as a
\hookrightarrow coordinate.
[ ] No Preference

Figure 4: The user interface for Monolingual Human Preference (MHP). [x] indicates the selection of human evaluator.
图 4：单语人类偏好（MHP）的用户界面。 [x] 表示人类评估者的选择。

6 PREFERENCE EvaluAtion 6 偏好评估

It is crucial to acknowledge that a literary text does not possess a single, universal translation. Conventional translation evaluation methodologies, which typically rely on direct comparisons to a standard reference translation, fail to accommodate the multifaceted and subjective nature of literary texts. Following Thai et al. (2022), we engage both human evaluators and large language models (LLMs) to assess translations based on their preferences. In this section, we describe our methods for preference evaluation in Section 6.1 and present our results in Section 6.2
认识到文学文本并没有一个单一、普遍的翻译是至关重要的。传统的翻译评估方法通常依赖于与标准参考翻译的直接比较，未能适应文学文本的多方面和主观性质。根据 Thai 等人（2022 年）的研究，我们同时邀请人类评估者和大型语言模型（LLMs）来评估翻译，基于他们的偏好。在本节中，我们描述了我们在第 6.1 节中进行偏好评估的方法，并在第 6.2 节中呈现了我们的结果。

6.1 EValuation Method 6.1 评估方法

In this section, we propose two preference evaluation methods, monolingual human preference (MHP, Section 6.1.1) and bilingual LLM preference (BLP, Section 6.1.2). For both methods, we use the winning rate

, which is the percentage of instances where a model's generated chapter is preferred by either the human evaluators (in MHP) or the LLM (in BLP), to measure the model performance.
在本节中，我们提出了两种偏好评估方法，即单语人类偏好（MHP，第 6.1.1 节）和双语偏好（BLP，第 6.1.2 节）。对于这两种方法，我们使用获胜率

，即模型生成的章节被人类评估者（在 MHP 中）或LLM（在 BLP 中）偏好的实例百分比，来衡量模型的性能。

6.1.1 Monolingual Human Preference
6.1.1 单语人类偏好

When reading a translated book, it is not necessary for the reader to understand the original language. Therefore, a better translation should naturally be preferred by readers without needing to refer to the text in its original language.
当阅读翻译的书籍时，读者并不需要理解原始语言。因此，读者应该自然地更喜欢更好的翻译，而无需参考原始语言的文本。

Preprocessing In this work, the translations of each chapter are first manually split into several segments containing approximately 150 words each, based on the story's plot. This translation segmentation step is necessary because the full translations contain thousands of words, and human evaluators may struggle to stay focused when evaluating such long passages at once.
在这项工作中，每章的翻译首先被手动分割成几个包含大约 150 个单词的段落，根据故事情节。这种翻译分割步骤是必要的，因为完整的翻译包含成千上万的单词，人类评估者可能会在一次评估如此长的段落时难以保持专注。

Evaluation The human evaluators are tasked with comparing pairs of translation segments describing the same part of the story and selecting their preferred translation for each segment pair, with the user interface shown in Figure 4 To ensure evaluations consider the full context, each evaluator is required to evaluate all the segments within a chapter in their original order, as segments may depend on information from previous segments.
评估人类评估员的任务是比较描述故事同一部分的翻译片段对，并为每个片段对选择他们偏好的翻译，用户界面如图 4 所示。为确保评估考虑到完整的上下文，每个评估员需要按照原始顺序评估章节内的所有片段，因为片段可能依赖于前面片段的信息。

Implementation In this study, we collect human preferences on translations through SurveyMonkey

To ensure the evaluators are from the target audience, we ask if they are interested in Chinese web novels before starting the evaluation 5 We only recruit evaluators from the United States to minimize potential impacts of demographics. Each translation pair is evaluated by at least 10 people and costs us

USD per annotation. We filter out possible low-quality responses or human evaluators based on following criteria:
在这项研究中，我们通过 SurveyMonkey 收集人类对翻译的偏好

为了确保评估者来自目标受众群体，我们在开始评估之前会询问他们是否对中国网络小说感兴趣 5 我们只招募来自美国的评估者，以最小化人口统计数据的潜在影响。每对翻译至少由 10 人评估，每个注释对我们的成本为

美元。我们根据以下标准过滤可能的低质量回复或人类评估者：
- Being labeled as low quality by SurveyMonkey's response quality model;
- 被 SurveyMonkey 的响应质量模型标记为低质量；
- Giving "No Preference" for all selections;
- 对所有选择都选择“无偏好”
- Taking less than 20 seconds for the evaluation.
- 评估时间不到 20 秒。

After filtering, we collect at least 5 responses per segment pair.
经过过滤后，我们至少收集每个段对至少 5 个响应。

Mitigating Positional Bias Human evaluators may exhibit a positional bias when evaluating response quality. To mitigate this bias in our translation evaluations, the positions of the translation segments being compared are randomly swapped for each selection, as shown in Figure 4 . Furthermore, the "No Preference" (Tie) option, indicating that the evaluator does not prefer one translation over the other, is always presented as the third option.
缓解位置偏见人类评估者在评估响应质量时可能会表现出位置偏见。为了减轻我们翻译评估中的这种偏见，被比较的翻译片段的位置会在每次选择时随机交换，如图 4 所示。此外，“无偏好”（平局）选项，表示评估者不偏好一种翻译而是另一种，总是作为第三个选项呈现。

Response Aggregation We aggregate the human evaluations using majority voting, where the most selected option is considered the final preference. If two translation systems receive the same number of votes, we record the final preference as "No Preference" (Tie).
响应聚合我们使用多数投票来汇总人类评估，其中被选择次数最多的选项被视为最终偏好。如果两个翻译系统获得相同数量的投票，我们将记录最终偏好为“无偏好”（平局）。

6.1.2 BILINGUAL LLM PREFERENCE
6.1.2 双语 LLM 偏好

The nature of literary texts, with their inherent complexities, artistic expression, and cultural nuances, makes it virtually impossible to produce a single, universally correct translation. As a result, multiple translations of the same literary text can coexist, each offering a unique perspective and interpretation. Recent works demonstrate that the reference translations are likely to be of low quality (Freitag et al., 2023 Xu et al. 2024). Kocmi & Federmann (2023) demonstrate that GPT-4 is capable of accurately estimating translation quality without the need for human reference translations. Their proposed GEMBA-MQM metric achieves state-
文学文本的性质，具有固有的复杂性、艺术表达和文化细微差别，使得几乎不可能产生一个单一、普遍正确的翻译。因此，同一文学文本的多个翻译可以共存，每个都提供独特的视角和解释。最近的研究表明，参考翻译可能质量较低（Freitag 等人，2023 年 Xu 等人，2024 年）。Kocmi＆Federmann（2023 年）证明 GPT-4 能够准确估计翻译质量，无需人类参考翻译。他们提出的 GEMBA-MQM 指标实现了最新技术。

[The start of source]
[$src_lang]: $src
The end of source]
The start of assistant 1's translation
$tgt lang]: $asst1
[The end of assistant 1's translation]
[The start of assistant 2's translation]
[$tgt_lang]: $asst2
The end of assistant 2's translation
We would like to request your feedback [TRUNCATED]

Figure 5: The prompt used for bilingual LLM preference evaluation. of-the-art performance in WMT 2023 Metric Shared task (Freitag et al. 2023 ).
图 5：用于双语LLM偏好评估的提示。在 WMT 2023 度量共享任务中表现出色（Freitag 等人，2023 年）。

Motivated by Kocmi & Federmann (2023), we evaluate the translation segment pairs using GPT4-0125-PREVIEW without providing the reference translations. Recent research demonstrates that even state-of-the-art LLMs may struggle to process extremely long sequences (Bai et al. 2023b, Song et al. 2024; Li et al., 2024). Therefore, we require GPT-4-0125-PREVIEW to determine which translation segment is better as described in Section 6.1.1, using the prompt shown in Figure 5. instead of directly comparing the quality of two entire chapters. We employ a different variant of GPT-4 for evaluation to avoid the potential bias. Given concerns about positional bias in LLM evaluation raised by recent studies (Wu & Aji, 2023, Zheng et al., 2023a, Dubois et al., 2024), we evaluate each translation segment pair in both forward and reversed directions.
受 Kocmi & Federmann (2023) 的启发，我们使用 GPT4-0125-PREVIEW 评估翻译片段对，而不提供参考翻译。最近的研究表明，即使是最先进的 LLMs 也可能难以处理极长的序列（Bai 等人，2023b；Song 等人，2024；Li 等人，2024）。因此，我们需要 GPT-4-0125-PREVIEW 确定哪个翻译片段更好，如第 6.1.1 节所述，使用图 5 中显示的提示，而不是直接比较两个完整章节的质量。我们采用了 GPT-4 的不同变体进行评估，以避免潜在的偏见。鉴于最近研究提出的关于 LLM 评估中的位置偏见的担忧（Wu & Aji，2023；Zheng 等人，2023a；Dubois 等人，2024），我们在正向和反向两个方向上评估每个翻译片段对。

6.2 EXPERIMENTS 6.2 实验

Setup As described in Section 4.2, there are 12 web novels consisting of 240 chapters in our test set. Due to the high cost of human evaluation, we only compare our TraNSAGENTS with the REFERENCE 1 and GPT-4-1106-PREVIEW models. We evaluate the first two chapters of each of the

novels in our test set using both of our preference evaluation methods.
设置如第 4.2 节所述，我们的测试集中有 12 部网络小说，共 240 章。由于人工评估成本高昂，我们只将我们的 TraNSAGENTS 与 REFERENCE 1 和 GPT-4-1106-PREVIEW 模型进行比较。我们使用我们的两种偏好评估方法评估我们测试集中每部小说的前两章。

Figure 6: Monolingual Human Preference evaluation results. GPT-4 indicates GPT-41106-PREVIEW.
图 6：单语人类偏好评估结果。GPT-4 表示 GPT-41106-PREVIEW。

TRANSAGENTS wins

Tie

TRANSAGENTS loses

TRANSAGENTS 胜利

平局

TRANSAGENTS 失败

Figure 7: Bilingual LLM Preference evaluation results. GPT-4 indicates GPT-4-1106PREVIEW.
图 7：双语LLM偏好评估结果。GPT-4 表示 GPT-4-1106PREVIEW。

	Overall	VG							FR
Monolingual Human Preference 单语人类偏好
GPT-4-1106-PREVIEW GPT-4-1106-预览	55.6	64.5	68.2	63.3	44.6	68.2		48.0	77.8
REFERENCE 1 参考 1	52.1	67.7	63.6	56.7	42.9	63.6		40.0	66.7
Bilingual LLM Preference 双语LLM偏好
GPT-4-1106-PREV	55.9	74.1	56.8	58.3		70.5	47.8	34.0
REFERENCE 1 参考 1	66.2	88.7	59.1	70.0		83.0	53.3

Table 3: The breakdown winning rate of TRANSAGENTS against GPT-4-1106-PREVIEW and REFERENCE 1. The best results are highlighted in bold. The worst results are highlighted in underline.
表 3：TRANSAGENTS 对 GPT-4-1106-PREVIEW 和 REFERENCE 1 的击败率分解。最佳结果以粗体突出显示。最差结果以下划线突出显示。

Results We compare the performance of our TransAgents with Reference 1 and gPt-41106-PREVIEW using monolingual human preference evaluations. The results, presented as winning rates, are shown in Figure 6. The translations produced by TRANSAGENTS are marginally preferred by human evaluators compared to both REFERENCE 1 and GPT-4-1106-PREVIEW. Additionally, we evaluate the models using bilingual LLM preference, with the results presented in Figure 7 The translations generated by TRANSAGENTS are also more preferred by GPT-4-0125-PREVIEW compared to the other models. Referring to the results in Table 4, we observe that GPT-4-0125PREVIEW appears to have a strong preference for diverse and vivid descriptions when evaluating literary translations. We leave the further investigation to the future work.
结果我们通过单语人类偏好评估比较了我们的 TransAgents 与参考 1 和 gPt-41106-PREVIEW 的性能。结果以获胜率呈现，如图 6 所示。与 REFERENCE 1 和 GPT-4-1106-PREVIEW 相比，TRANSAGENTS 生成的翻译在一定程度上更受人类评估者青睐。此外，我们使用双语LLM偏好评估模型，结果见图 7。与其他模型相比，TRANSAGENTS 生成的翻译也更受 GPT-4-0125-PREVIEW 青睐。参考表 4 中的结果，我们观察到在评估文学翻译时，GPT-4-0125PREVIEW 似乎更偏好多样化和生动的描述。我们将进一步的调查留给未来的工作。

7 ANALYSIS 7 分析

What Causes TransAgents to "Fail" in Terms of

-BLEU? As shown in Table 1, the translation produced by TranSAGENTS achieves the lowest

-BLEU score among the compared methods. To investigate the reasons behind this, we evaluate the output of each stage in the TRANSAGENTS workflow using the official references from the WMT2023 DLLT test set. The results, presented in Table 2 reveal that, although the backbone of the agents in TRANSAGENTS is GPT-4-1106PREVIEW, the initial translation produced by TRANSAGENTS achieves a significantly lower

-BLEU score. This suggests that the translation guideline is the main contributor to the final translation quality. Moreover, the localization step further reduces the

-BLEU score, while the proofreading step only minimally modifies the translation.
什么导致 TransAgents 在

-BLEU 方面“失败”？如表 1 所示，由 TranSAGENTS 生成的翻译在所比较的方法中获得了最低的

-BLEU 分数。为了调查这背后的原因，我们使用来自 WMT2023 DLLT 测试集的官方参考评估了 TRANSAGENTS 工作流程中每个阶段的输出。表 2 中呈现的结果显示，尽管 TRANSAGENTS 中代理的骨干是 GPT-4-1106PREVIEW，但 TRANSAGENTS 生成的初始翻译获得了显著较低的

-BLEU 分数。这表明翻译指南是最终翻译质量的主要贡献者。此外，本地化步骤进一步降低了

-BLEU 分数，而校对步骤仅对翻译进行了最小程度的修改。

Strengths and Weaknesses of TransAgents The original texts of the test examples are publicly accessible online and span a variety of genres, including Video Games (VG), Eastern Fantasy (EF), Sci-fi Romance (SR), Contemporary Romance (CR), Fantasy (F), Science Fiction (SF), Hor-
TransAgents 的优势和劣势测试示例的原始文本可在网上公开访问，涵盖各种流派，包括视频游戏（VG），东方奇幻（EF），科幻浪漫（SR），当代浪漫（CR），奇幻（F），科幻（SF），恐怖-
ror & Thriller (HT), and Fantasy Romance (FR). We present a detailed analysis of the performance of our model TransAgENTS, across these categories in Table 3 Our observations indicate that TransAGENTS excels in domains that demand extensive domain-specific knowledge, such as historical contexts and cultural nuances. These areas often pose significant challenges for translators. On the other hand, TRANSAGENTS tends to underperform in contemporary domains, which may not require as much specialized knowledge. This performance trend underscores the model's strengths and weaknesses.
恐怖与惊悚（HT）和奇幻浪漫（FR）。我们在表 3 中对我们的模型 TransAgENTS 在这些类别中的表现进行了详细分析。我们的观察表明，TransAGENTS 在需要广泛领域特定知识的领域中表现出色，如历史背景和文化细微差别。这些领域通常对译者提出重大挑战。另一方面，TRANSAGENTS 在当代领域表现不佳，这些领域可能不需要太多专业知识。这种表现趋势突显了模型的优势和劣势。

Linguistic Diversity Linguistic diversity in literary texts is critical for enriching the reading experience. To quantify the linguistic diversity of the translation, we leverage two metrics: the Moving-Average Type-Token Ratio (MATTR) (Covington & McFall, 2010) and the Measure of Textual Lexical Diversity (MTLD) (McCarthy & Jarvis, 2010). As shown in Table 4, assisted by our translation guidelines, our initial translation significantly improves linguistic diversity compared to the source text. Moreover, the localization step further enhances linguistic diversity, while the proofreading step does not
语言多样性文学文本中的语言多样性对丰富阅读体验至关重要。为了量化翻译的语言多样性，我们利用了两个指标：移动平均类型标记比率（MATTR）（Covington＆McFall，2010）和文本词汇多样性测量（MTLD）（McCarthy＆Jarvis，2010）。如表 4 所示，在我们的翻译指南的帮助下，我们的初始翻译与原文相比显著改善了语言多样性。此外，本地化步骤进一步增强了语言多样性，而校对步骤则没有。

	MATTR	MTLD
REFERENCE 1 参考 1	80.9	89.1
GPT-4-1106-PREVIEW GPT-4-1106-预览	81.5	94.9
TRANSAGENTS 转换代理
- translation - 翻译	83.5	117.0
- localization 本地化	83.6	119.4
- proofreading - 校对	83.6	119.4

Table 4: Linguistic diversity in terms of MATTR (up-scaled by

) and MTLD.

indicates higher is better. affect it. These results demonstrate the effectiveness of our approach in preserving and enhancing the richness of language in the translated literary work.
表 4：根据 MATTR（通过

放大）和 MTLD 的语言多样性。

表示越高越好。这些结果表明我们的方法在保留和增强翻译文学作品中语言丰富性方面的有效性。

Cost Analysis The cost of human translation services can be influenced by several factors, including the genre of the text, the translator's location, and their level of experience. The American Translators Association recommends a minimum rate of US

per word for professional translation services

The REFERENCE 1 from the WMT2023 DLLT test set contains an average of 1,404 English words per chapter, resulting in a translation cost of $168.48 USD per chapter. In comparison, translating using TRANSAGENTS costs approximately $500 USD for the entire test set, which is equivalent to

USD per chapter. Translating literary text using TransAgents can lead to an

reduction in translation costs.
成本分析人工翻译服务的成本可能受到几个因素的影响，包括文本的体裁、译者的位置和他们的经验水平。美国翻译协会建议专业翻译服务的最低价格为每个单词

美元。WMT2023 DLLT 测试集的 REFERENCE 1 每章平均包含 1,404 个英文单词，导致每章的翻译成本为 168.48 美元。相比之下，使用 TRANSAGENTS 翻译整个测试集的成本约为 500 美元，相当于每章

美元。使用 TransAgents 翻译文学文本可以导致

的翻译成本降低。

8 CASE STUDY 8 案例研究

In this section, we explore two case studies with regard to cultural adaptation and content omission, shedding light on both the strengths and weaknesses of our approach. Additionally, we enrich our analysis by incorporating insights from interviews with two experienced professional translators.
在本节中，我们探讨了两个案例研究，涉及文化适应和内容省略，揭示了我们方法的优点和缺点。此外，我们通过整合与两位经验丰富的专业翻译人员的访谈见解，丰富了我们的分析。

Cultural Adaptation In Chinese, job titles are typically placed before a person's name, whereas in English, job titles usually come after the person's name. This order reflects differing linguistic and cultural conventions regarding the structuring of personal information in the two languages. As demonstrated in Table 5. TRANSAGENTS is the only system that accurately reflects this cultural context in its translations. In contrast, both REFERENCE 1 and GPT-4-1106-PREVIEW fail to correctly adjust the order of names and job titles, thus not adhering to the cultural norms expected in the target language. The ability to produce translations that are not only linguistically accurate but also culturally appropriate is crucial. This emphasizes the capability of TRANSAGENTS to provide translations that are culturally appropriate, ensuring an immersive reading experience for readers in the target language.
在中文中，职务头衔通常放在一个人的名字之前，而在英语中，职务头衔通常放在一个人的名字之后。这种顺序反映了两种语言中关于个人信息结构的不同语言和文化惯例。如表 5 所示，TRANSAGENTS 是唯一一个在翻译中准确反映这种文化背景的系统。相比之下，REFERENCE 1 和 GPT-4-1106-PREVIEW 都未能正确调整姓名和职务头衔的顺序，因此未能遵守目标语言中预期的文化规范。能够产生不仅在语言上准确而且在文化上合适的翻译至关重要。这强调了 TRANSAGENTS 提供符合文化的翻译的能力，确保目标语言读者获得沉浸式的阅读体验。

Global Consistency It is important to maintain consistency throughout the book translation from the start to the end. As shown in Table 6, the chapter titles are consistent, with the exception of the index. Both Reference 1 and TransAgents successfully produce consistent translations.
全球一致性从头到尾保持一致性对于书籍翻译非常重要。如表 6 所示，章节标题是一致的，除了索引。参考文献 1 和 TransAgents 都成功地产生了一致的翻译。

Original Text 原始文本

罗德抬起头来, 正好看见一个中年男子推门走进来, 他穿着冒险者的皮甲, 一头鲜红的长发随意的向后梳理着, 看见罗

德, 男子微微一笑。“你好, 先生, 我是星月佣兵团的团长, 卡特。”

REFERENCE 1 参考 1

罗德抬头看到一个中年男子推开门。他穿着冒险家的皮甲，随意地梳理着他的长发。当他看到罗德时，他微笑着说：“你好，先生。我是星月雇佣兵团的会长卡特。”

Rhode looked up and saw a middle-aged man pushing the door open. He wore an adventurer's leather armor and casually combed

back his long red hair. The man smiled when he saw Rhode. "Hello, sir. I am the Guild Leader of the Star Moon Mercenaries, Carter."

GPT-4-1106-PREVIEW GPT-4-1106-预览

罗德抬头看到一个中年男子从门口走进来。他穿着一身冒险家的皮甲，头发是明亮的红色，随意地梳后。看到罗德，这个男子微微一笑。“你好，先生，我是星月雇佣团的团长卡特。”

Rhode looked up and saw a middle-aged man entering through the door. He was dressed in an adventurer's leather armor, with a

head of bright red hair casually combed back. Seeing Rhode, the man smiled slightly. "Hello, sir, I am the leader of the Star Moon

Mercenary Group, Carter."

TransAgEnts

罗德抬头看到一个中年男子走了进来。这个男人穿着冒险家典型的皮甲，他那火红的头发随意地梳后。看到罗德，这个男人微笑着说道：“你好，先生。我是星月雇佣军团的团长卡特。”

Rhode looked up to see a middle-aged man entering. The man was dressed in the leather armor typical of adventurers, his fiery red

hair casually swept back. Spotting Rhode, the man offered a modest smile. "Hello, sir. I am Carter, the leader of the Star Moon

Mercenary Corps."

Table 5: Case study for cultural adaptation. The text highlighted in red indicates that the translation is accurate in meaning but not in cultural context. The text highlighted in blue indicate that the translation is accurate both in meaning and in cultural context.
表 5：文化适应案例研究。红色突出显示的文本表示翻译在意义上准确，但在文化背景上不准确。蓝色突出显示的文本表示翻译在意义和文化背景上都准确。

Original Text 原始文本	第1906章不思量, 自难忘 (十二) [OMITTED] 第1907章不思量, 自难忘十三) [OMITTED] 第 1906 章不思量, 自难忘 (十二) [OMITTED] 第 1907 章不思量, 自难忘十三) [OMITTED]
REFERENCE 1 参考 1	Chapter 1906: Unforgettable Memories (12) [OMITTED] Chapter 1907: Unforgettable Memories (13) 第 1906 章：难忘的回忆（12）[省略] 第 1907 章：难忘的回忆（13）
GPT-4-1106-PREvIEW GPT-4-1106-预览	Chapter 1906: It's Hard to Forget Without Thinking (Twelve) [OMITTED] Chapter 1907: Without Intention, Unforgettable (Thirteen) 第 1906 章：不思而忘（十二）[省略] 第 1907 章：无意间难忘（十三）
TRANSAGENTS 转换代理	Chapter 1906: Without Intention, Unforgettable (Twelve) [OMITTED] Chapter 1907: Without Intention, Unforgettable (Thirteen) 第 1906 章：无意间，难忘（十二）[省略] 第 1907 章：无意间，难忘（十三）

Table 6: Case study for global consistency. The text highlighted in red indicates that GPT-4-1106PREVIEW generates inconsistent translations across different chapters.
表 6：全球一致性案例研究。红色突出显示的文本表明 GPT-4-1106PREVIEW 在不同章节之间生成的翻译不一致。

However, GPT-4-1106-PREVIEW struggles with maintaining consistency across different chapters. This demonstrates that TRANSAGENTS is capable of maintaining consistency throughout the entire translation process, similar to human translators.
然而，GPT-4-1106-PREVIEW 在不同章节之间保持一致性方面存在困难。这表明 TRANSAGENTS 能够在整个翻译过程中保持一致性，类似于人类翻译员。

Content Omission Our TransAgEnts is generally preferred over both REFERENCE 1 and GPT4-1106-PREVIEW according to evaluations by human judges and large language models (LLMs) Figure 6 and Figure 7). However, despite its higher preference, the translations produced by TRANSAGENTS are not without flaws. A detailed analysis of the translated chapters, when divided into smaller segments, reveals that both GPT-4-1106-PREVIEW and TRANSAGENTS exhibit significant issues with content omission, as illustrated in Table 7, While these omissions do not seem to impact the overall development of the story plot, they could potentially influence other critical aspects of the narrative. For example, missing content could diminish the depth of character development or alter the intended emotional impact of the text. Such omissions, therefore, raise concerns about the completeness and fidelity of the translation in preserving the nuanced expressions and thematic elements of the original texts.
内容省略我们的 TransAgEnts 通常比 REFERENCE 1 和 GPT4-1106-PREVIEW 更受人类评委和大型语言模型（LLMs）的青睐（见图 6 和图 7）。然而，尽管更受欢迎，TRANSAGENTS 产生的翻译并非没有缺陷。对翻译章节进行详细分析，将其分成较小的段落后，可以看出 GPT-4-1106-PREVIEW 和 TRANSAGENTS 都存在内容省略的重大问题，如表 7 所示。虽然这些省略似乎不会影响故事情节的整体发展，但它们可能会影响叙述的其他关键方面。例如，缺失的内容可能会减弱人物塑造的深度或改变文本预期的情感影响。因此，这些省略引发了对翻译完整性和忠实性的担忧，这些翻译是否能保留原始文本的微妙表达和主题元素。

Comments from Professional Translators We anonymize the translations from TRANSAGENTS, REFERENCE 1, and GPT-4-1106-PREVIEW for a randomly selected chapter and present both the original text and the translations to two experienced professional translators. We request that they assess and rank the quality of each translation and provide their comments on the translations. As shown in Table 8, both Translator A's and Translator B's comments highlight the novel-like, expressive translation style of TRANSAGENTS, which uses sophisticated language, though it sometimes omits parts of the original text. REFERENCE 1, and GPT-4-1106-PREVIEW stick closer to the original text. Overall, TransAgENTS's translations are viewed as the most expressive and engaging, REFERENCE 1's as straightforward, and GPT-4-1106-PREVIEW's as the most traditional. These comments confirm that TRANSAGENTS is capable of producing more expressive and engaging translations, compared to REFERENCE 1 and GPT-4-1106-PREVIEW.
专业翻译人员的评论我们对从 TRANSAGENTS、REFERENCE 1 和 GPT-4-1106-PREVIEW 中随机选择的一章进行了匿名化处理，并将原始文本和翻译呈现给两位经验丰富的专业翻译人员。我们要求他们评估和排名每个翻译的质量，并就翻译提供评论。如表 8 所示，翻译人员 A 和翻译人员 B 的评论都强调了 TRANSAGENTS 小说般、富有表现力的翻译风格，使用了复杂的语言，尽管有时会省略原始文本的部分。REFERENCE 1 和 GPT-4-1106-PREVIEW 更贴近原始文本。总体而言，TransAgENTS 的翻译被视为最富表现力和吸引力，REFERENCE 1 的翻译则较为直接，而 GPT-4-1106-PREVIEW 的翻译则最为传统。这些评论证实了 TRANSAGENTS 能够产生比 REFERENCE 1 和 GPT-4-1106-PREVIEW 更富表现力和吸引力的翻译。

9 LIMITATIONS 9 限制

The primary limitation of our study centers on the evaluation methods used. Extensive literature has highlighted the issues in conventional machine translation (MT) evaluation techniques, such as poor evaluation metrics and the reliability of reference translations (Papineni et al., 2002, Post, 2018; Rei et al. 2020; Freitag et al. 2020; 2021; 2022, Kocmi et al., 2023; Freitag et al. 2023). Beyond traditional MT evaluation metrics such as

-BLEU, we propose additional methods, namely
我们研究的主要局限性集中在所使用的评估方法上。广泛的文献已经强调了传统机器翻译（MT）评估技术中存在的问题，例如评估指标不足和参考翻译的可靠性（Papineni 等，2002 年，Post，2018 年；Rei 等，2020 年；Freitag 等，2020 年；2021 年；2022 年，Kocmi 等，2023 年；Freitag 等，2023 年）。除了传统的 MT 评估指标如

-BLEU 之外，我们提出了额外的方法，即

Original Text 原始文本

REFERENCE 1 参考 1

她叫了一个女佣带着叶辰和程安雅下楼清洗。小可爱真的很想跟他们一起去，但他留在了后面，因为白夜似乎准备好和某人打架了，所以小可爱迅速抓住了他的袖子。“白夜，你能救救我的妈妈和爸爸吗？”孩子的眼睛明亮，就像镶嵌在他白皙脸庞上的两颗黑葡萄，充满期待。似乎如果白夜摇头，他眼中的光芒就会黯淡。杰森抓住小可爱，紧紧地拥抱着他，感到非常自信。“亲爱的，别担心。小白甚至可以拯救死人，更不用说活人了。等你有时间了，别再为他们担心，来帮我攻破黑手党的防线吧。”“刚才是谁在质疑白夜的医术？”，黑杰克对此表示怀疑。杰森朝他挥了一拳，但他灵巧地躲开了。被高大男人抱着的小可爱感到极度矛盾，踢了杰森一脚。“放我下来。”“我也想要一个像你这样的儿子。宁宁，你愿意做我的儿子吗？我已经很久想要你了。”杰森的蔚蓝眼睛充满热情，就好像小可爱是一件稀有的珠宝。所有人都无语了...。白夜微笑着说：“杰森，让你妈妈帮你提高中文水平吧。别总是抱怨指挥官不懂得运用言辞。你也不怎么样。”“我可不是和指挥官一个层次的，好吗？”杰森很不满意有人把他和指挥官相提并论。官官相对，官匪不两立，这是原则问题。小可爱挣扎了一下，杰森放开了他。小可爱问：“白夜……”“宁宁，别担心。在做出结论之前我会调查清楚的。”白夜说着，摸了摸小可爱的头，“我保证会尽力的。”"

She called a maid to take Ye Chen and Cheng Anya downstairs to freshen up. Little Cutie really wanted to go with them, but he

stayed behind as Bai Ye seemed ready to start fighting someone, so Little Cutie quickly grabbed his sleeve. "Bai Ye, can you save my

Mommy and Daddy?" The child's eyes were bright, like two black grapes embedded in his fair face, full of expectation. It seemed

that if Bai Ye shook his head, the light in his eyes would dim. Jason grabbed Little Cutie and hugged him fiercely, feeling quite

confident. "Darling, don't worry. Little Bai can save even the dead, not to mention living people. Stop worrying about them and

come help me take down the Mafia's line of defense when you have time." "Who was questioning Bai Ye's medical skills just now?",

Black Jack expressed doubt about this. Jason threw a punch at him, but he agilely dodged it. Being carried by the tall man, Little

Cutie felt extremely conflicted and kicked Jason. "Put me down." "I also want a son like you. Ningning, will you be my son? I've

been coveting you for a long time." Jason's azure eyes burst with enthusiasm as if Little Cutie was a rare piece of jewelry. Everyone

was speechless.... Bai Ye smiled and said, "Jason, let your mother help you improve your Chinese. Don't always complain that the

commander doesn't know how to use words. You're not much better." "I'm not on the same level as the commander, okay?"' Jason

was very dissatisfied that someone had compared him to the commander. It was a matter of principle that officials and bandits were

not on the same side. Little Cutie struggled a little, and Jason put him down. Little Cutie asked, "Bai Ye...." "Ningning, don't worry.

I'll check them up before making a conclusion." Bai Ye said, rubbing the head of Little Cutie, "I promise I will do my best."

GPT-4-1106-PREVIEW GPT-4-1106-预览

她召唤女仆们带叶晨和程安雅去洗漱。小奶包想跟着去，但他留在这里，因为白夜看起来像是要打人，所以小奶包赶紧抓住他的袖子。白夜，你能救救我的爸爸妈妈吗？小奶包的眼睛明亮，就像镶嵌在他娇嫩脸庞上的两颗黑葡萄，充满希望，仿佛如果白夜摇头，他眼中的光芒就会黯淡。杰森抱住小奶包，充满自豪地说：“宝贝，别担心，小白连死人都能救，更不用说活人了。别无谓担心了，等你有空了来帮我轰炸黑帮的防御。”刚才谁在质疑白夜的医术？黑杰克表示怀疑，一掌击出，杰森迅速躲开。小奶包挣扎了一下，杰森放下他。小奶包问：“白夜..”“宁宁，等我检查完他们再说，别着急，”白夜拍了拍小奶包的头，“我会尽力的。”

She summoned the maids to take Ye Chen and Cheng Anya to wash up. Xiao Nai Bao wanted to follow, but he stayed here, as Bai

Ye looked like he was about to beat someone up, so Xiao Nai Bao quickly grabbed his sleeve. Bai Ye, can you save my daddy and

mommy? Xiao Nai Bao's eyes were bright, like two black grapes embedded in his tender face, full of hope, as if the light in his eyes

would dim if Bai Ye shook his head. Jason grabbed Xiao Nai Bao and hugged him, full of pride, "Baby, don't worry, Xiao Bai can

even save the dead, let alone the living. Stop worrying about nothing, come help me bombard the mafia's defenses when you're free."

Who was questioning Bai Ye's medical skills just now? Black Jack expressed doubt, and with a palm strike, Jason swiftly dodged.

Xiao Nai Bao struggled a bit, and Jason put him down. Xiao Nai Bao asked, "Bai Ye.."" Ningning, wait until I've examined them

before I can be sure, don't be anxious, Bai Ye said, ruffling Xiao Nai Bao's head, "I promise to do my best."

TRANSAGENTS 转换代理

在她的命令下，女仆们把叶晨和程安雅带去洗澡。小奶包在跟随和留下之间犹豫不决，最终因为白叶似乎准备好要打架而留了下来。小奶包拉着白叶的袖子，带着希望的眼神看着他。“白叶，有办法治好我的爸爸和妈妈吗？”“宁宁，让我先进行彻底的检查。保持冷静，”白叶安抚着男孩的头。“我会尽我所能。”

At her behest, maids ushered Ye Chen and Cheng An Ya away for a bath. Xiao Naibao, torn between following and staying, ultimately

remained as Bai Ye appeared poised for a scuffle. Tugging at Bai Ye's sleeve, Xiao Naibao looked up at him with hopeful eyes. "Bai

Ye, is there a way to cure my daddy and mommy?" "Ning Ning, let me conduct a thorough examination first. Stay calm," Bai Ye

soothed, patting the boy's head. "'lll do everything in my power."

Table 7: Case study for content omission. The text highlighted in red and blue indicates the content that is omitted by TransAgENTs. The text highlighted in red represents the content that is omitted by GPT-4-1106-PREVIEW.
表 7：内容省略案例研究。红色和蓝色突出显示的文本表示 TransAgENTs 省略的内容。红色突出显示的文本代表 GPT-4-1106-PREVIEW 省略的内容。

ranslator A 翻译者 A

TRANSAGENTS 的翻译风格类似于小说，用词精致，带有个人风格。尽管有些省略，但使文本更简洁，有效地传达了原文的情绪和意义。REFERENCE 1 和 GPT-4-1106-PREVIEW 的翻译更加传统，严格遵循原文逐字翻译。然而，GPT-4-1106-PREVIEW 的翻译比 REFERENCE 1 更加语法准确，用词稍微更好，使其翻译在美学上优于 REFERENCE 1，但仍未达到 TRANSAGENTS 的文学表现力。从他们的翻译习惯来看，TRANSAGENTS 似乎在英语方面有扎实的基础，REFERENCE 1 似乎依赖机器翻译，而 GPT-4-1106-PREVIEW 表现得像一个标准、守规矩的翻译员。

TRANSAGENTS's translation style is similar to that of a novel, with sophisticated wording and personal flair. Despite some omis

sions, it makes the text more concise and effectively conveys the original text's mood and meaning. REFERENCE 1 and GPT-4-

1106 -PREVIEW's translations are more conventional, adhering strictly to the original text word for word. However, GPT-4-1106-

PREVIEW's translation is more grammatically precise than REFERENCE 1's, and its wording is slightly better, making its translation

aesthetically superior to REFERENCE 1's but still not reaching the literary expressiveness of TRANSAGENTS. From their translation

habits, TRANSAGENTS appears to have a solid foundation in English, REFERENCE 1 seems to rely on machine translation, and

GPT-4-1106-PREVIEW behaves like a standard, rule-abiding translator.

RANSAGENTS 的翻译摆脱了原始语言的限制，自由地使用语言，并且添加了大量内容和扩展，词汇选择也表明对语言的更深入理解。REFERENCE 1 忠实于原始文本，直接而简洁地翻译，没有添加个人解释。GPT-4-1106-PREVIEW 的翻译风格与 REFERENCE 1 相似，两者都严格遵循原始文本，没有太多个人解释或修饰。总体而言，RANSAGENTS 的翻译显示出最深度和复杂性，其次是 REFERENCE 1，而 GPT-4-1106-PREVIEW 在这三者中表现最普通。

RANSAGENTS's translation breaks away from the constraints of the original language, using the language freely with ample addi-

ons and expansions, and the choice of vocabulary also demonstrates a deeper understanding of the language. REFERENCE 1 remains

ithful to the original text, translating directly and succinctly without adding personal interpretations. GPT-4-1106-PREVIEW's trans-

tion style is similar to REFERENCE 1 's, both strictly adhering to the original without much personal interpretation or embellishment.

verall, TRANSAGENTS's translation shows the greatest depth and sophistication, followed by REFERENCE 1, while GPT-4-1106-

REVIEW performs most ordinarily among the three.

Table 8: Comments from two experienced professional translators on the translations from TRANSAGENTS, REFERENCE 1, and GPT-4-1106-PREVIEW. We present both the original text and the anonymized translations to two experienced professional translators. The original comments are written in Chinese, and we make adaptations while preserving their original meaning. We replace the anonymized system names with the actual system names to improve readability. The translation systems are highlighted in red.
表 8：两位经验丰富的专业翻译人员对来自 TRANSAGENTS、参考 1 和 GPT-4-1106-PREVIEW 的翻译发表了评论。我们向两位经验丰富的专业翻译人员展示了原始文本和匿名翻译。原始评论用中文书写，我们在保留其原始含义的同时进行了调整。我们将匿名系统名称替换为实际系统名称，以提高可读性。翻译系统用红色突出显示。

Monolingual Human Preference and Bilingual LLM Preference, to assess translation quality. However, the implementation of these novel evaluation strategies introduces several challenges that may undermine the validity of our findings:
单语人类偏好和双语LLM偏好，用于评估翻译质量。然而，这些新颖评估策略的实施引入了一些挑战，可能会削弱我们研究结果的有效性。

Document Segmentation: Evaluating ultra-long texts introduces distinct challenges in human evaluation. In our preliminary study, we observe that human evaluators often struggle to maintain focus when reading documents containing thousands of words, which could potentially compromise the accuracy of their evaluations. Moreover, while segmenting these lengthy texts into smaller, content-based portions may simplify the task, this method risks disrupting the narrative flow and connections between different sections, potentially resulting in a loss of overall coherence. We strategically segmented the documents for this
文档分割：评估超长文本在人类评估中引入了独特的挑战。在我们的初步研究中，我们观察到，人类评估者在阅读包含数千字的文档时经常难以保持专注，这可能会影响其评估的准确性。此外，将这些冗长的文本分割为更小的基于内容的部分可能简化任务，但这种方法可能会破坏叙事流程和不同部分之间的联系，可能导致整体连贯性的丧失。我们为此战略性地对文档进行了分割。
study. However, developing more effective methods for human evaluation of ultra-long texts remains an area for future research.
然而，开发更有效的方法来评估超长文本的人类评估仍然是未来研究的一个领域。
Target Audience: Literary texts are crafted with specific target audiences in mind. In our study, we initially aim to distribute our questionnaires through an online forum dedicated to web novels, intending to gather feedback directly from the target audience. However, this approach faced challenges, either due to community regulations or the slow pace of feedback collection. Additionally, although we confirm the interest of human evaluators in Chinese web novels before they participate in the evaluation, there is a possibility that evaluators might claim interest simply to qualify for the job, regardless of their true preferences. Consequently, this could mean that our evaluation results might not accurately reflect the true preferences of the target audience.
目标受众：文学作品是针对特定目标受众精心打造的。在我们的研究中，我们最初的目标是通过专门致力于网络小说的在线论坛分发我们的问卷，意图直接从目标受众那里收集反馈。然而，这种方法面临挑战，可能是由于社区规定或反馈收集速度缓慢。此外，尽管我们在评估之前确认了人类评估者对中国网络小说的兴趣，但评估者可能会声称对工作感兴趣，而不管他们真正的偏好如何。因此，这可能意味着我们的评估结果可能无法准确反映目标受众的真实偏好。
Evaluation Scale: Due to constrained resources, the scope of our evaluation scale may be inadequate. We segment only the first two chapters of each book in the test set and gather a minimum of five valid responses per segment. Recent studies highlight the significant diversity in human preferences (Zheng et al., 2023b; Wu & Aji, 2023; Hosking et al., 2023). Consequently, the limited scale of our evaluation could affect the outcomes.
评估规模：由于资源受限，我们的评估规模可能不足。我们仅对测试集中每本书的前两章进行分割，并收集每个分割段落至少五个有效回应。最近的研究强调了人类偏好的显著多样性（Zheng 等，2023b；Wu＆Aji，2023；Hosking 等，2023）。因此，我们评估的有限规模可能会影响结果。
Human-Written References: Although the reference translations are said to be authored by professional human translators, there is a likelihood that these translators may use commercial machine translation systems, such as GOogLE TranSLate, to reduce their workload. Unfortunately, we cannot verify whether the reference translations are genuinely created by humans.
人工参考资料：尽管据说参考翻译是由专业人类翻译员撰写的，但有可能这些翻译员会使用商业机器翻译系统，比如谷歌翻译，以减轻工作量。不幸的是，我们无法验证参考翻译是否真正由人类创作。

We acknowledge these limitations and leave them to the future studies.
我们承认这些限制，并将它们留给未来的研究。

10 ConCLUSION 10 结论

In this paper, we introduce TransAgENTS, a novel multi-agent virtual company designed for literary translation that reflects the traditional translation publication process. Utilizing a multi-agent approach, this system effectively tackles the intricate nuances inherent in literary texts. We propose two innovative evaluation strategies: Monolingual Human Preference (MHP) and Bilingual LLM Preference (BLP), to assess the quality of the translations. MHP evaluates how the translation resonates with the target audience, focusing on fluidity and cultural appropriateness, whereas BLP employs advanced language models to directly compare the translations with the original texts. Although the

-BLEU scores are lower, our empirical results demonstrate that translations produced by TRANSAGENTS are favored by both human evaluators and language models over human-written references. We also provide detailed analyses of the strengths and weaknesses of TRANSAGENTS, highlighting possible directions for future research.
在本文中，我们介绍了 TransAgENTS，这是一个新颖的多代理虚拟公司，专为文学翻译而设计，反映了传统的翻译出版流程。利用多代理方法，该系统有效地处理了文学文本中固有的复杂细微差别。我们提出了两种创新的评估策略：单语人类偏好（MHP）和双语LLM偏好（BLP），以评估翻译的质量。MHP 评估翻译如何与目标受众 resonates，侧重于流畅性和文化适当性，而 BLP 采用先进的语言模型直接比较翻译与原文。尽管

-BLEU 分数较低，我们的实证结果表明，由 TRANSAGENTS 生成的翻译在人类评估者和语言模型中都受到青睐，超过了人工编写的参考文献。我们还提供了对 TRANSAGENTS 的优势和劣势的详细分析，突出了未来研究的可能方向。

REFERENCES 参考资料

Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Slav Petrov, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy P. Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul Ronald Barham, Tom Hennigan, Benjamin Lee, Fabio Viola, Malcolm Reynolds, Yuanzhong Xu, Ryan Doherty, Eli Collins, Clemens Meyer, Eliza Rutherford, Erica Moreira, Kareem Ayoub, Megha Goel, George Tucker, Enrique Piqueras, Maxim Krikun, Iain Barr, Nikolay Savinov, Ivo Danihelka, Becca Roelofs, Anaïs White, Anders Andreassen, Tamara von Glehn, Lakshman Yagati, Mehran Kazemi, Lucas Gonzalez, Misha Khalman, Jakub Sygnowski, and et al. Gemini: A family of highly capable multimodal models. CoRR, abs/2312.11805, 2023a. doi: 10.48550/ARXIV. 2312.11805. URLhttps://doi.org/10.48550/arXiv.2312.11805.
Rohan Anil，Sebastian Borgeaud，吴永辉，Jean-Baptiste Alayrac，余佳辉，Radu Soricut，Johan Schalkwyk，Andrew M. Dai，Anja Hauth，Katie Millican，David Silver，Slav Petrov，Melvin Johnson，Ioannis Antonoglou，Julian Schrittwieser，Amelia Glaese，陈吉林，Emily Pitler，Timothy P. Lillicrap，Angeliki Lazaridou，Orhan Firat，James Molloy，Michael Isard，Paul Ronald Barham，Tom Hennigan，Benjamin Lee，Fabio Viola，Malcolm Reynolds，徐元中，Ryan Doherty，Eli Collins，Clemens Meyer，Eliza Rutherford，Erica Moreira，Kareem Ayoub，Megha Goel，George Tucker，Enrique Piqueras，Maxim Krikun，Iain Barr，Nikolay Savinov，Ivo Danihelka，Becca Roelofs，Anaïs White，Anders Andreassen，Tamara von Glehn，Lakshman Yagati，Mehran Kazemi，Lucas Gonzalez，Misha Khalman，Jakub Sygnowski 等。Gemini：一系列高性能多模型。CoRR，abs/2312.11805，2023a。doi：10.48550/ARXIV.2312.11805。URLhttps://doi.org/10.48550/arXiv.2312.11805。

Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H. Clark, Laurent El Shafey, Yanping Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao, Yuanzhong Xu, Yujing Zhang, Gustavo Hernández Ábrego, Junwhan Ahn, Jacob Austin, Paul Barham, Jan A. Botha, James
罗汉·安尼尔，安德鲁·M·戴，奥尔汉·菲拉特，梅尔文·约翰逊，德米特里·莱皮金，亚历山德罗·帕索斯，西阿马克·沙克里，埃马努埃尔·塔罗帕，佩奇·贝利，陈志峰，埃里克·朱，乔纳森·H·克拉克，洛朗·埃尔夏菲，黄燕萍，凯西·迈尔-赫尔斯特恩，高拉夫·米什拉，埃里卡·莫雷拉，马克·奥默尼克，凯文·罗宾逊，塞巴斯蒂安·鲁德尔，易泰，肯凡·肖，袁忠旭，于靖，古斯塔沃·埃尔南德斯·阿布雷戈，郑俊焕，雅各布·奥斯汀，保罗·巴勒姆，扬·A·博塔，詹姆斯

Bradbury, Siddhartha Brahma, Kevin Brooks, Michele Catasta, Yong Cheng, Colin Cherry, Christopher A. Choquette-Choo, Aakanksha Chowdhery, Clément Crepy, Shachi Dave, Mostafa Dehghani, Sunipa Dev, Jacob Devlin, Mark Díaz, Nan Du, Ethan Dyer, Vladimir Feinberg, Fangxiaoyu Feng, Vlad Fienber, Markus Freitag, Xavier Garcia, Sebastian Gehrmann, Lucas Gonzalez, and et al. Palm 2 technical report. CoRR, abs/2305.10403, 2023b. doi: 10.48550/ARXIV. 2305. 10403. URLhttps://doi.org/10.48550/arXiv.2305.10403.
Bradbury，Siddhartha Brahma，Kevin Brooks，Michele Catasta，Yong Cheng，Colin Cherry，Christopher A. Choquette-Choo，Aakanksha Chowdhery，Clément Crepy，Shachi Dave，Mostafa Dehghani，Sunipa Dev，Jacob Devlin，Mark Díaz，Nan Du，Ethan Dyer，Vladimir Feinberg，Fangxiaoyu Feng，Vlad Fienber，Markus Freitag，Xavier Garcia，Sebastian Gehrmann，Lucas Gonzalez 等。Palm 2 技术报告。CoRR，abs/2305.10403，2023b。doi：10.48550/ARXIV.2305.10403。URLhttps://doi.org/10.48550/arXiv.2305.10403。

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang Zhu. Qwen technical report. CoRR, abs/2309.16609, 2023a. doi: 10.48550/ARXIV.2309.16609. URLhttps://doi.org/10.48550/arXiv. 2309.16609 .
金泽白，帅白，云飞楚，泽宇崔，凯党，晓东邓，杨帆，文彬葛，宇汉，飞黄，彬元惠，罗吉，梅丽，俊阳林，润吉林，大义恒刘，高刘，成强卢，克明卢，建新马，瑞门，兴章任，宣程任，传奇谭，思楠谭，建宏涂，鹏王，世杰王，伟王，胜光吴，本峰徐，金徐，安杨，浩杨，建杨，树生杨，杨尧，博文于，宏毅袁，铮元，建伟张，兴轩张，义昌张，振儒张，畅周，靖仁周，晓欢周，天航朱。Qwen 技术报告。CoRR，abs/2309.16609，2023a。doi：10.48550/ARXIV.2309.16609。URLhttps://doi.org/10.48550/arXiv. 2309.16609。

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Longbench: A bilingual, multitask benchmark for long context understanding. CoRR, abs/2308.14508, 2023b. doi: 10. 48550/ARXIV.2308.14508. URLhttps://doi.org/10.48550/arXiv.2308.14508,
俞石白，吕鑫，张佳杰，吕洪昌，唐建凯，黄志典，杜正晓，刘晓，曾傲寒，侯磊，董宇啸，唐杰，李娟子。Longbench：长文本理解的双语、多任务基准。CoRR，abs/2308.14508，2023b。doi：10.48550/ARXIV.2308.14508。URLhttps://doi.org/10.48550/arXiv.2308.14508。

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 1877-1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/ file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
汤姆·布朗，本杰明·曼恩，尼克·赖德，梅兰妮·苏比亚，贾里德·D·卡普兰，普拉弗拉·达里瓦尔，阿尔温德·尼拉坎坦，普拉纳夫·夏姆，吉里什·萨斯特里，阿曼达·阿斯克尔，桑迪尼·阿加尔瓦尔，阿里尔·赫伯特-沃斯，格雷琴·克鲁格，汤姆·亨尼根，雷文·奇尔德，阿迪蒂亚·拉梅什，丹尼尔·齐格勒，杰弗里·吴，克莱门斯·温特，克里斯·赫瑟，马克·陈，埃里克·西格勒，马特乌什·利特温，斯科特·格雷，本杰明·切斯，杰克·克拉克，克里斯托弗·伯纳，山姆·麦坎迪什，亚历克·拉德福德，伊利亚·苏特斯凯弗和达里奥·阿莫迪。语言模型是少样本学习者。在 H. Larochelle，M. Ranzato，R. Hadsell，M.F. Balcan 和 H. Lin（编辑），《神经信息处理系统的进展》，第 33 卷，第 1877-1901 页。 Curran Associates，Inc.，2020 年。网址 https://proceedings.neurips.cc/paper_files/paper/2020/ file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.

Aljoscha Burchardt. Multidimensional quality metrics: a flexible system for assessing translation quality. In Proceedings of Translating and the Computer 35, London, UK, November 28-29 2013. Aslib. URLhttps://aclanthology.org/2013.tc-1.6.
Aljoscha Burchardt. 多维质量度量：评估翻译质量的灵活系统。在第 35 届翻译与计算机大会论文集中，2013 年 11 月 28-29 日，英国伦敦。Aslib。URLhttps://aclanthology.org/2013.tc-1.6。

Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better llm-based evaluators through multi-agent debate. CoRR, abs/2308.07201, 2023. doi: 10.48550/ARXIV.2308.07201. URLhttps://doi.org/10.

陈启民，陈维泽，苏玉生，于建轩，薛伟，张尚航，付杰和刘志远。 Chateval：通过多智能体辩论实现更好的llm评估者。 CoRR，abs/2308.07201，2023。 doi：10.48550/ARXIV.2308.07201。 URLhttps://doi.org/10.

Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Alessandro Moschitti, Bo Pang, and Walter Daelemans (eds.), Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724-1734, Doha, Qatar, October 2014. Association for Computational Linguistics. doi: 10.3115/v1/D14-1179. URLhttps://aclanthology.org/D14-1179.
Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk 和 Yoshua Bengio。使用 RNN 编码器-解码器学习短语表示进行统计机器翻译。在 Alessandro Moschitti，Bo Pang 和 Walter Daelemans（编辑）的《2014 年自然语言处理经验方法会议论文集》中，第 1724-1734 页，2014 年 10 月，卡塔尔多哈。计算语言学协会。doi：10.3115/v1/D14-1179。URLhttps://aclanthology.org/D14-1179。

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern,
Aakanksha Chowdhery，Sharan Narang，Jacob Devlin，Maarten Bosma，Gaurav Mishra，Adam Roberts，Paul Barham，Hyung Won Chung，Charles Sutton，Sebastian Gehrmann，Parker Schuh，Kensen Shi，Sasha Tsvyashchenko，Joshua Maynez，Abhishek Rao，Parker Barnes，Yi Tay，Noam Shazeer，Vinodkumar Prabhakaran，Emily Reif，Nan Du，Ben Hutchinson，Reiner Pope，James Bradbury，Jacob Austin，Michael Isard，Guy Gur-Ari，Pengcheng Yin，Toju Duke，Anselm Levskaya，Sanjay Ghemawat，Sunipa Dev，Henryk Michalewski，Xavier Garcia，Vedant Misra，Kevin Robinson，Liam Fedus，Denny Zhou，Daphne Ippolito，David Luan，Hyeontaek Lim，Barret Zoph，Alexander Spiridonov，Ryan Sepassi，David Dohan，Shivani Agrawal，Mark Omernick，Andrew M. Dai，Thanumalayan Sankaranarayana Pillai，Marie Pellat，Aitor Lewkowycz，Erica Moreira，Rewon Child，Oleksandr Polozov，Katherine Lee，Zongwei Zhou，Xuezhi Wang，Brennan Saeta，Mark Diaz，Orhan Firat，Michele Catasta，Jason Wei，Kathy Meier-Hellstern

Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. Palm: Scaling language modeling with pathways. CoRR, abs/2204.02311, 2022. doi: 10.48550/arXiv.2204.02311. URL https://doi.org/10.48550/arXiv.2204.02311.
Douglas Eck, Jeff Dean, Slav Petrov, 和 Noah Fiedel. Palm: Scaling language modeling with pathways. CoRR, abs/2204.02311, 2022. doi: 10.48550/arXiv.2204.02311. URL https://doi.org/10.48550/arXiv.2204.02311.

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Y. Zhao, Yanping Huang, Andrew M. Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei. Scaling instructionfinetuned language models. CoRR, abs/2210.11416, 2022. doi: 10.48550/ARXIV.2210.11416. URLhttps://doi.org/10.48550/arXiv.2210.11416

Seamless Communication, Loïc Barrault, Yu-An Chung, Mariano Coria Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, Christopher Klaiber, Pengwei Li, Daniel Licht, Jean Maillard, Alice Rakotoarison, Kaushik Ram Sadagopan, Guillaume Wenzek, Ethan Ye, Bapi Akula, Peng-Jen Chen, Naji El Hachem, Brian Ellis, Gabriel Mejia Gonzalez, Justin Haaheim, Prangthip Hansanti, Russ Howes, Bernie Huang, Min-Jae Hwang, Hirofumi Inaguma, Somya Jain, Elahe Kalbassi, Amanda Kallet, Ilia Kulikov, Janice Lam, Daniel Li, Xutai Ma, Ruslan Mavlyutov, Benjamin Peloquin, Mohamed Ramadan, Abinesh Ramakrishnan, Anna Y. Sun, Kevin Tran, Tuan Tran, Igor Tufanov, Vish Vogeti, Carleigh Wood, Yilin Yang, Bokai Yu, Pierre Andrews, Can Balioglu, Marta R. Costa-jussà, Onur Celebi, Maha Elbayad, Cynthia Gao, Francisco Guzmán, Justine Kao, Ann Lee, Alexandre Mourachko, Juan Pino, Sravya Popuri, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Paden Tomasello, Changhan Wang, Jeff Wang, and Skyler Wang. Seamlessm4t-massively multilingual & multimodal machine translation. CoRR, abs/2308.11596, 2023. doi: 10.48550/ARXIV. 2308.11596. URLhttps://doi.org/10.48550/arXiv.2308.11596.
无缝沟通，Loïc Barrault，Yu-An Chung，Mariano Coria Meglioli，David Dale，Ning Dong，Paul-Ambroise Duquenne，Hady Elsahar，Hongyu Gong，Kevin Heffernan，John Hoffman，Christopher Klaiber，Pengwei Li，Daniel Licht，Jean Maillard，Alice Rakotoarison，Kaushik Ram Sadagopan，Guillaume Wenzek，Ethan Ye，Bapi Akula，Peng-Jen Chen，Naji El Hachem，Brian Ellis，Gabriel Mejia Gonzalez，Justin Haaheim，Prangthip Hansanti，Russ Howes，Bernie Huang，Min-Jae Hwang，Hirofumi Inaguma，Somya Jain，Elahe Kalbassi，Amanda Kallet，Ilia Kulikov，Janice Lam，Daniel Li，Xutai Ma，Ruslan Mavlyutov，Benjamin Peloquin，Mohamed Ramadan，Abinesh Ramakrishnan，Anna Y. Sun，Kevin Tran，Tuan Tran，Igor Tufanov，Vish Vogeti，Carleigh Wood，Yilin Yang，Bokai Yu，Pierre Andrews，Can Balioglu，Marta R. Costa-jussà，Onur Celebi，Maha Elbayad，Cynthia Gao，Francisco Guzmán，Justine Kao，Ann Lee，Alexandre Mourachko，Juan Pino，Sravya Popuri，Christophe Ropers，Safiyyah Saleem，Holger Schwenk，Paden Tomasello，Changhan Wang，Jeff Wang，和 Skyler Wang。Seamlessm4t-大规模多语言和多模式机器翻译。CoRR，abs/2308.11596，2023。doi：10.48550/ARXIV.2308.11596。URLhttps://doi.org/0。48550/arXiv.2308.11596.

Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loïc Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran, Pierre Andrews, Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco Guzmán, Philipp Koehn, Alexandre Mourachko, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, and Jeff Wang. No language left behind: Scaling human-centered machine translation. CoRR, abs/2207.04672, 2022. doi: 10.48550/ARXIV.2207. 04672. URLhttps://doi.org/10.48550/arXiv.2207.04672.
Marta R. Costa-jussà，James Cross，Onur Çelebi，Maha Elbayad，Kenneth Heafield，Kevin Heffernan，Elahe Kalbassi，Janice Lam，Daniel Licht，Jean Maillard，Anna Sun，Skyler Wang，Guillaume Wenzek，Al Youngblood，Bapi Akula，Loïc Barrault，Gabriel Mejia Gonzalez，Prangthip Hansanti，John Hoffman，Semarley Jarrett，Kaushik Ram Sadagopan，Dirk Rowe，Shannon Spruit，Chau Tran，Pierre Andrews，Necip Fazil Ayan，Shruti Bhosale，Sergey Edunov，Angela Fan，Cynthia Gao，Vedanuj Goswami，Francisco Guzmán，Philipp Koehn，Alexandre Mourachko，Christophe Ropers，Safiyyah Saleem，Holger Schwenk，和 Jeff Wang。没有一种语言被遗忘：扩展以人为中心的机器翻译。CoRR，abs/2207.04672，2022。doi：10.48550/ARXIV.2207.04672。URLhttps://doi.org/10.48550/arXiv.2207.04672。

Michael A Covington and Joe D McFall. Cutting the gordian knot: The moving-average type-token ratio (mattr). Journal of quantitative linguistics, 17(2):94-100, 2010.
迈克尔·A·科温顿和乔·D·麦考尔。解开难题：移动平均类型标记比率（mattr）。《数量语言学杂志》，17（2）：94-100，2010 年。

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL http://papers.nips.cc/paper_files/paper/2023/hash/ 1feb87871436031bdc0f2beaa62a049b-Abstract-Conference.html
Tim Dettmers、Artidoro Pagnoni、Ari Holtzman 和 Luke Zettlemoyer。Qlora：量化llms的高效微调。在 Alice Oh、Tristan Naumann、Amir Globerson、Kate Saenko、Moritz Hardt 和 Sergey Levine（编辑）的《神经信息处理系统 36：神经信息处理系统年会 2023》中，NeurIPS 2023，2023 年 12 月 10 日至 16 日，美国路易斯安那州新奥尔良。URL http://papers.nips.cc/paper_files/paper/2023/hash/1feb87871436031bdc0f2beaa62a049b-Abstract-Conference.html

Liang Ding, Longyue Wang, Di Wu, Dacheng Tao, and Zhaopeng Tu. Context-aware cross-attention for non-autoregressive translation. In Donia Scott, Nuria Bel, and Chengqing Zong (eds.), Proceedings of the 28th International Conference on Computational Linguistics, pp. 4396-4402, Barcelona, Spain (Online), December 2020. International Committee on Computational Linguistics. doi: 10.18653/v1/2020.coling-main.389. URLhttps://aclanthology.org/2020. coling-main. 389 .
梁丁，王龙跃，吴迪，陶大成和涂兆鹏。上下文感知的跨注意力非自回归翻译。在 Donia Scott，Nuria Bel 和 Chengqing Zong（编辑）的《第 28 届国际计算语言学会议论文集》，第 4396-4402 页，西班牙巴塞罗那（在线），2020 年 12 月。国际计算语言学委员会。doi：10.18653/v1/2020.coling-main.389。URL https://aclanthology.org/2020. coling-main. 389。

Yihong Dong, Xue Jiang, Zhi Jin, and Ge Li. Self-collaboration code generation via chatgpt. CoRR, abs/2304.07590, 2023. doi: 10.48550/ARXIV.2304.07590. URLhttps://doi.org/10.

董一红，姜雪，金智，李歌。通过 chatgpt 进行自协作代码生成。CoRR，abs/2304.07590，2023。doi：10.48550/ARXIV.2304.07590。URLhttps://doi.org/10.

Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. CoRR, abs/2305.14325,
Yilun Du，Shuang Li，Antonio Torralba，Joshua B. Tenenbaum 和 Igor Mordatch。通过多智能体辩论提高语言模型的事实性和推理能力。CoRR，abs/2305.14325。
2023a. doi: 10.48550/ARXIV.2305.14325. URLhttps://doi.org/10.48550/arXiv. 2305.14325 .
2023a。doi：10.48550/ARXIV.2305.14325。URLhttps://doi.org/10.48550/arXiv. 2305.14325。

Zefeng Du, Wenxiang Jiao, Longyue Wang, Chenyang Lyu, Jianhui Pang, Leyang Cui, Kaiqiang Song, Derek F Wong, Shuming Shi, and Zhaopeng Tu. On extrapolation of long-text translation with large language models. 2023b.
杜泽峰，焦文祥，王龙跃，吕晨阳，庞建辉，崔乐阳，宋凯强，黄锐德，石树明，屠兆鹏。关于利用大型语言模型进行长文本翻译的外推。2023b。

Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024.
Yann Dubois，Balázs Galambosi，Percy Liang 和 Tatsunori B Hashimoto。长度控制的 alpacaeval：去偏自动评估器的简单方法。arXiv 预印本 arXiv：2404.04475，2024。

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. KTO: model alignment as prospect theoretic optimization. CoRR, abs/2402.01306, 2024. doi: 10.48550/ ARXIV.2402.01306. URL/https://doi.org/10.48550/arXiv.2402.01306.
Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. KTO: 模型对齐作为前景理论优化。CoRR，abs/2402.01306，2024 年。doi：10.48550/ARXIV.2402.01306。URL/https://doi.org/10.48550/arXiv.2402.01306。

Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Michael Auli, and Armand Joulin. Beyond englishcentric multilingual machine translation. J. Mach. Learn. Res., 22:107:1-107:48, 2021. URL http://jmlr.org/papers/v22/20-1307.html.
安吉拉·范（Angela Fan），Shruti Bhosale，Holger Schwenk，马志毅，Ahmed El-Kishky，Siddharth Goyal，Mandeep Baines，Onur Celebi，Guillaume Wenzek，Vishrav Chaudhary，Naman Goyal，汤姆·伯奇（Tom Birch），Vitaliy Liptchinsky，Sergey Edunov，Michael Auli 和 Armand Joulin。超越以英语为中心的多语言机器翻译。J. Mach. Learn. Res.，22：107：1-107：48，2021。URL http://jmlr.org/papers/v22/20-1307.html。

Yukun Feng, Feng Li, Ziang Song, Boyuan Zheng, and Philipp Koehn. Learn to remember: Transformer with recurrent memory for document-level machine translation. In Marine Carpuat, MarieCatherine de Marneffe, and Ivan Vladimir Meza Ruiz (eds.), Findings of the Association for Computational Linguistics: NAACL 2022, pp. 1409-1420, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-naacl.105. URL https://aclanthology.org/2022.findings-naacl.105.
冯宇坤，李峰，宋子昂，郑博远和 Philipp Koehn。学会记忆：具有递归记忆的 Transformer 用于文档级机器翻译。在 Marine Carpuat，Marie-Catherine de Marneffe 和 Ivan Vladimir Meza Ruiz（编辑）的《计算语言学协会发现：NAACL 2022》中，第 1409-1420 页，美国西雅图，2022 年 7 月。计算语言学协会。doi：10.18653/v1/2022.findings-naacl.105。URL https://aclanthology.org/2022.findings-naacl.105。

Markus Freitag, David Grangier, and Isaac Caswell. BLEU might be guilty but references are not innocent. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 61-71, Online, November 2020. Association for Computational Linguistics. doi:

. emnlp-main.5. URLhttps://aclanthology.org/2020.emnlp-main.5.
马库斯·弗莱塔格（Markus Freitag）、大卫·格朗吉尔（David Grangier）和艾萨克·卡斯韦尔（Isaac Caswell）。BLEU 可能有罪，但参考文献并不无辜。在 Bonnie Webber、Trevor Cohn、Yulan He 和 Yang Liu（编辑）的《2020 年自然语言处理经验方法会议论文集》（EMNLP）中，第 61-71 页，2020 年 11 月在线出版。计算语言学协会。doi:

. emnlp-main.5. URLhttps://aclanthology.org/2020.emnlp-main.5.

Markus Freitag, George Foster, David Grangier, Viresh Ratnakar, Qijun Tan, and Wolfgang Macherey. Experts, errors, and context: A large-scale study of human evaluation for machine translation. Transactions of the Association for Computational Linguistics, 9:1460-1474, 2021. doi: 10.1162/tacl_a_00437. URLhttps://aclanthology.org/2021.tacl-1.87
马库斯·弗莱塔格，乔治·福斯特，大卫·格兰吉尔，维雷什·拉特纳卡尔，齐军·谭和沃尔夫冈·马赫雷。专家、错误和上下文：机器翻译人类评估的大规模研究。计算语言学协会交易，9：1460-1474，2021 年。doi：10.1162/tacl_a_00437。URLhttps://aclanthology.org/2021.tacl-1.87

Markus Freitag, Ricardo Rei, Nitika Mathur, Chi-kiu Lo, Craig Stewart, Eleftherios Avramidis, Tom Kocmi, George Foster, Alon Lavie, and André F. T. Martins. Results of WMT22 metrics shared task: Stop using BLEU - neural metrics are better and more robust. In Philipp Koehn, Loïc Barrault, Ondřej Bojar, Fethi Bougares, Rajen Chatterjee, Marta R. Costa-jussà, Christian Federmann, Mark Fishel, Alexander Fraser, Markus Freitag, Yvette Graham, Roman Grundkiewicz, Paco Guzman, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Tom Kocmi, André Martins, Makoto Morishita, Christof Monz, Masaaki Nagata, Toshiaki Nakazawa, Matteo Negri, Aurélie Névéol, Mariana Neves, Martin Popel, Marco Turchi, and Marcos Zampieri (eds.), Proceedings of the Seventh Conference on Machine Translation (WMT), pp. 46-68, Abu Dhabi, United Arab Emirates (Hybrid), December 2022. Association for Computational Linguistics. URLhttps://aclanthology.org/2022.wmt-1.2.
马库斯·弗莱塔格，里卡多·雷伊，尼蒂卡·马图尔，卢志九，克雷格·斯图尔特，埃莱夫泰里奥斯·阿夫拉米迪斯，汤姆·科克米，乔治·福斯特，阿隆·拉维，安德烈·F·T·马丁斯。WMT22 指标共享任务结果：停止使用 BLEU - 神经指标更好更稳健。在菲利普·科恩，洛伊克·巴罗，奥德烈·博亚尔，费蒂·布加雷斯，拉杰恩·查特吉，玛尔塔·R·科斯塔-胡萨，克里斯蒂安·费德曼，马克·菲舍尔，亚历山大·弗雷泽，马库斯·弗莱塔格，伊维特·格雷厄姆，罗曼·格伦基维奇，帕科·古兹曼，巴里·哈多，马蒂亚斯·胡克，安东尼奥·希梅诺·耶佩斯，汤姆·科克米，安德烈·马丁斯，森本真，克里斯托夫·蒙兹，永田正明，中泽敏明，马泰奥·内格里，奥雷利·内维奥尔，玛丽安娜·内维斯，马丁·波佩尔，马可·图尔奇，和马科斯·扎皮耶里（编辑），第七届机器翻译会议论文集（WMT），第 46-68 页，阿布扎比，阿拉伯联合酋长国（混合），2022 年 12 月。计算语言学协会。URLhttps://aclanthology.org/2022.wmt-1.2。

Markus Freitag, Nitika Mathur, Chi-kiu Lo, Eleftherios Avramidis, Ricardo Rei, Brian Thompson, Tom Kocmi, Frederic Blain, Daniel Deutsch, Craig Stewart, Chrysoula Zerva, Sheila Castilho, Alon Lavie, and George Foster. Results of WMT23 metrics shared task: Metrics might be guilty but references are not innocent. In Philipp Koehn, Barry Haddow, Tom Kocmi, and Christof Monz (eds.), Proceedings of the Eighth Conference on Machine Translation, pp. 578-628, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.wmt-1.51. URLhttps://aclanthology.org/2023.wmt-1.51
马库斯·弗莱塔格，尼蒂卡·马图尔，罗志九，埃莱夫泰里奥斯·阿夫拉米迪斯，里卡多·雷伊，布莱恩·汤普森，汤姆·科克米，弗雷德里克·布兰，丹尼尔·德国，克雷格·斯图尔特，克里斯乌拉·泽尔瓦，谢拉·卡斯蒂略，阿隆·拉维，乔治·福斯特。WMT23 指标共享任务结果：指标可能有罪，但参考文献并不无辜。在菲利普·科恩，巴里·哈多，汤姆·科克米和克里斯托夫·蒙兹（主编）的《第八届机器翻译会议论文集》中，第 578-628 页，2023 年 12 月，新加坡。计算语言学协会。doi：10.18653/v1/2023.wmt-1.51。URLhttps://aclanthology.org/2023.wmt-1.51

Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolutional sequence to sequence learning. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pp. 1243-1252. PMLR, 2017. URL http://proceedings.mlr.press/v70/gehring17a.html
乔纳斯·格林、迈克尔·奥利、大卫·格朗吉尔、丹尼斯·亚拉茨和扬·N·多芬。卷积序列到序列学习。在 Doina Precup 和 Yee Whye Teh（编辑），第 34 届国际机器学习会议 ICML 2017 论文集，悉尼，新南威尔士州，澳大利亚，2017 年 8 月 6-11 日，机器学习研究论文集第 70 卷，页码 1243-1252。PMLR，2017。URL http://proceedings.mlr.press/v70/gehring17a.html

Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer. Mask-predict: Parallel decoding of conditional masked language models. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 6112-6121, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1633. URLhttps://aclanthology.org/ D19-1633
Marjan Ghazvininejad，Omer Levy，Yinhan Liu 和 Luke Zettlemoyer。Mask-predict：条件掩码语言模型的并行解码。在 Kentaro Inui，Jing Jiang，Vincent Ng 和 Xiaojun Wan（编辑）的《2019 年自然语言处理经验方法会议和第 9 届国际自然语言处理联合会议论文集》中，第 6112-6121 页，中国香港，2019 年 11 月。计算语言学协会。doi：10.18653/v1/D19-1633。URLhttps://aclanthology.org/ D19-1633

Jiatao Gu, James Bradbury, Caiming Xiong, Victor O. K. Li, and Richard Socher. Nonautoregressive neural machine translation. CoRR, abs/1711.02281, 2017. URL http:// arxiv.org/abs/1711.02281
顾佳涛，詹姆斯·布拉德伯里，熊才明，李维克多，理查德·索切尔。非自回归神经机器翻译。CoRR，abs/1711.02281，2017 年。URL http:// arxiv.org/abs/1711.02281

Jiatao Gu, Yong Wang, Yun Chen, Victor O. K. Li, and Kyunghyun Cho. Meta-learning for lowresource neural machine translation. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun'ichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3622-3631, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1398. URLhttps://aclanthology. org/D18-1398
顾佳涛，王勇，陈云，李维克多和卓庆炫。元学习用于低资源神经机器翻译。在 Ellen Riloff，David Chiang，Julia Hockenmaier 和 Jun'ichi Tsujii（编辑）的《2018 年自然语言处理经验方法会议论文集》中，第 3622-3631 页，2018 年 10 月至 11 月，比利时布鲁塞尔。计算语言学协会。doi：10.18653/v1/D18-1398。URLhttps://aclanthology.org/D18-1398

Jiatao Gu, Changhan Wang, and Junbo Zhao. Levenshtein transformer. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d'Alché-Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 11179-11189, 2019a. URLhttps://proceedings.neurips.cc/paper/ 2019/hash/675f9820626f5bc0afb47b57890b466e-Abstract.html.
顾佳涛，王长翰和赵俊波。Levenshtein 变换器。在 Hanna M. Wallach，Hugo Larochelle，Alina Beygelzimer，Florence d'Alché-Buc，Emily B. Fox 和 Roman Garnett（编辑），神经信息处理系统 32 的进展：神经信息处理系统 2019 年年会，NeurIPS 2019，2019 年 12 月 8-14 日，加拿大卑诗省温哥华，第 11179-11189 页，2019a。URLhttps://proceedings.neurips.cc/paper/2019/hash/675f9820626f5bc0afb47b57890b466e-Abstract.html.

Jiatao Gu, Changhan Wang, and Junbo Zhao. Levenshtein transformer. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d'Alché-Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 11179-11189, 2019b. URLhttps://proceedings.neurips.cc/paper/ 2019/hash/675f9820626f5bc0afb47b57890b466e-Abstract.html.
顾佳涛，王长翰，赵俊波。Levenshtein 变换器。在 Hanna M. Wallach，Hugo Larochelle，Alina Beygelzimer，Florence d'Alché-Buc，Emily B. Fox 和 Roman Garnett（编辑），神经信息处理系统 32 的进展：神经信息处理系统 2019 年年会，NeurIPS 2019，2019 年 12 月 8-14 日，加拿大卑诗省温哥华，第 11179-11189 页，2019b。URLhttps://proceedings.neurips.cc/paper/2019/hash/675f9820626f5bc0afb47b57890b466e-Abstract.html.

Nuno Miguel Guerreiro, Ricardo Rei, Daan van Stigt, Luísa Coheur, Pierre Colombo, and André F. T. Martins. xcomet: Transparent machine translation evaluation through fine-grained error detection. CoRR, abs/2310.10482, 2023. doi: 10.48550/ARXIV.2310.10482. URL https: //doi.org/10.48550/arXiv.2310.10482,
Nuno Miguel Guerreiro, Ricardo Rei, Daan van Stigt, Luísa Coheur, Pierre Colombo, 和 André F. T. Martins. xcomet: 通过细粒度错误检测实现透明的机器翻译评估。CoRR, abs/2310.10482, 2023. doi: 10.48550/ARXIV.2310.10482. URL https://doi.org/10.48550/arXiv.2310.10482,

Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V. Chawla, Olaf Wiest, and Xiangliang Zhang. Large language model based multi-agents: A survey of progress and challenges. CoRR, abs/2402.01680, 2024. doi: 10.48550/ARXIV.2402.01680. URL https: //doi.org/10.48550/arXiv.2402.01680.
太成郭，秀英陈，雅琦王，瑞迪常，世超裴，尼特什·V.查瓦拉，奥拉夫·维斯特和向亮张。基于大型语言模型的多智能体：进展和挑战综述。CoRR，abs/2402.01680，2024 年。doi：10.48550/ARXIV.2402.01680。URL https://doi.org/10.48550/arXiv.2402.01680。

Barry Haddow, Rachel Bawden, Antonio Valerio Miceli Barone, Jindrich Helcl, and Alexandra Birch. Survey of low-resource machine translation. Comput. Linguistics, 48(3):673-732, 2022. doi: 10.1162/COLI\A_00446. URL https://doi.org/10.1162/coli_a_00446
Barry Haddow，Rachel Bawden，Antonio Valerio Miceli Barone，Jindrich Helcl 和 Alexandra Birch。低资源机器翻译调查。计算机语言学，48（3）：673-732，2022 年。doi：10.1162/COLI\A_00446。URL https://doi.org/10.1162/coli_a_00446

Joey Hejna, Rafael Rafailov, Harshit Sikchi, Chelsea Finn, Scott Niekum, W. Bradley Knox, and Dorsa Sadigh. Contrastive preference learning: Learning from human feedback without RL. CoRR, abs/2310.13639, 2023. doi: 10.48550/ARXIV.2310.13639. URLhttps://doi.org/

Joey Hejna, Rafael Rafailov, Harshit Sikchi, Chelsea Finn, Scott Niekum, W. Bradley Knox, and Dorsa Sadigh. 对比偏好学习：从人类反馈中学习而不需要强化学习。CoRR，abs/2310.13639，2023。doi：10.48550/ARXIV.2310.13639。URLhttps://doi.org/

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URLhttps://openreview.net/forum?id=d7KBjmI3GmQ
丹·亨德里克斯（Dan Hendrycks）、柯林·伯恩斯（Collin Burns）、史蒂文·巴萨特（Steven Basart）、安迪·邹（Andy Zou）、曼塔斯·马泽卡（Mantas Mazeika）、宋黎明（Dawn Song）和雅各布·斯坦哈特（Jacob Steinhardt）。《衡量大规模多任务语言理解》。在第 9 届国际学习表示会议 ICLR 2021 上，2021 年 5 月 3-7 日，奥地利虚拟活动。OpenReview.net，2021。URLhttps://openreview.net/forum?id=d7KBjmI3GmQ

Christian Herold and Hermann Ney. Improving long context document-level machine translation. In Michael Strube, Chloe Braud, Christian Hardmeier, Junyi Jessy Li, Sharid Loaiciga, and Amir Zeldes (eds.), Proceedings of the 4th Workshop on Computational Approaches to Discourse (CODI 2023), pp. 112-125, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.codi-1.15. URLhttps://aclanthology.org/2023. codi-1.15
Christian Herold 和 Hermann Ney。改进长篇文档级机器翻译。在 Michael Strube，Chloe Braud，Christian Hardmeier，Junyi Jessy Li，Sharid Loaiciga 和 Amir Zeldes（eds.）的《第 4 届计算话语方法研讨会论文集（CODI 2023）》中，第 112-125 页，加拿大多伦多，2023 年 7 月。计算语言学协会。 doi：10.18653/v1/2023.codi-1.15。 URLhttps://aclanthology.org/2023.codi-1.15

Jiwoo Hong, Noah Lee, and James Thorne. ORPO: monolithic preference optimization without reference model. CoRR, abs/2403.07691, 2024. doi: 10.48550/ARXIV.2403.07691. URL https://doi.org/10.48550/arXiv.2403.07691.
Jiwoo Hong，Noah Lee 和 James Thorne。ORPO：无参考模型的单片偏好优化。CoRR，abs/2403.07691，2024 年。doi：10.48550/ARXIV.2403.07691。URL https://doi.org/10.48550/arXiv.2403.07691。

Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, and Chenglin Wu. Metagpt: Meta programming for multi-agent collaborative framework. CoRR, abs/2308.00352, 2023. doi: 10.48550/ARXIV.2308.00352. URLhttps://doi.org/10.48550/arXiv. 2308.00352 .

Tom Hosking, Phil Blunsom, and Max Bartolo. Human feedback is not gold standard. CoRR, abs/2309.16349, 2023. doi: 10.48550/ARXIV.2309.16349. URLhttps://doi.org/10.

汤姆·霍斯金（Tom Hosking）、菲尔·布伦索姆（Phil Blunsom）和马克斯·巴托洛（Max Bartolo）。人类反馈不是金标准。 CoRR，abs/2309.16349，2023 年。 doi：10.48550/ARXIV.2309.16349。 URLhttps://doi.org/10.

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URLhttps://openreview.net/forum?id=nZeVKeeFYf9.

Yuxin Jiang, Yufei Wang, Xingshan Zeng, Wanjun Zhong, Liangyou Li, Fei Mi, Lifeng Shang, Xin Jiang, Qun Liu, and Wei Wang. Followbench: A multi-level fine-grained constraints following benchmark for large language models. CoRR, abs/2310.20410, 2023. doi: 10.48550/ARXIV. 2310.20410. URL/https://doi.org/10.48550/arXiv.2310.20410.
Yuxin Jiang, Yufei Wang, Xingshan Zeng, Wanjun Zhong, Liangyou Li, Fei Mi, Lifeng Shang, Xin Jiang, Qun Liu, 和 Wei Wang。Followbench: 一个用于大型语言模型的多级细粒度约束跟踪基准。CoRR, abs/2310.20410, 2023. doi: 10.48550/ARXIV. 2310.20410。URL/ https://doi.org/10.48550/arXiv.2310.20410。

Juraj Juraska, Mara Finkelstein, Daniel Deutsch, Aditya Siddhant, Mehdi Mirzazadeh, and Markus Freitag. MetricX-23: The Google submission to the WMT 2023 metrics shared task. In Philipp Koehn, Barry Haddow, Tom Kocmi, and Christof Monz (eds.), Proceedings of the Eighth Conference on Machine Translation, pp. 756-767, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.wmt-1.63. URLhttps://aclanthology. org/2023.wmt-1.63.
Juraj Juraska，Mara Finkelstein，Daniel Deutsch，Aditya Siddhant，Mehdi Mirzazadeh 和 Markus Freitag。MetricX-23：Google 提交给 WMT 2023 度量共享任务。在 Philipp Koehn，Barry Haddow，Tom Kocmi 和 Christof Monz（编辑）的《第八届机器翻译会议论文集》中，第 756-767 页，2023 年 12 月，新加坡。计算语言学协会。doi：10.18653/v1/2023.wmt-1.63。URLhttps://aclanthology.org/2023.wmt-1.63。

Jeremy Klemin. The last frontier of machine translation. The Atlantic, 2024. URL https://www.theatlantic.com/technology/archive/2024/01/ literary-translation-artificial-intelligence/677038/
Jeremy Klemin. 机器翻译的最后一片疆域。大西洋，2024 年。网址 https://www.theatlantic.com/technology/archive/2024/01/ literary-translation-artificial-intelligence/677038/

Tom Kocmi and Christian Federmann. GEMBA-MQM: Detecting translation quality error spans with GPT-4. In Philipp Koehn, Barry Haddow, Tom Kocmi, and Christof Monz (eds.), Proceedings of the Eighth Conference on Machine Translation, pp. 768-775, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.wmt-1.64. URL https://aclanthology.org/2023.wmt-1.64.
汤姆·科克米（Tom Kocmi）和克里斯蒂安·费德曼（Christian Federmann）。GEMBA-MQM：使用 GPT-4 检测翻译质量错误跨度。在 Philipp Koehn，Barry Haddow，Tom Kocmi 和 Christof Monz（编辑）的《第八届机器翻译会议论文集》中，第 768-775 页，2023 年 12 月，新加坡。计算语言学协会。doi：10.18653/v1/2023.wmt-1.64。URL https://aclanthology.org/2023.wmt-1.64。

Tom Kocmi, Eleftherios Avramidis, Rachel Bawden, Ondřej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz, Barry Haddow, Philipp Koehn, Benjamin Marie, Christof Monz, Makoto Morishita, Kenton Murray, Makoto Nagata, Toshiaki Nakazawa, Martin Popel, Maja Popović, and Mariya Shmatova. Findings of the 2023 conference on machine translation (WMT23): LLMs are here but not quite there yet. In Philipp Koehn, Barry Haddow, Tom Kocmi, and Christof Monz (eds.), Proceedings of the Eighth Conference on Machine Translation, pp. 1-42, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.wmt-1.1. URL https://aclanthology.org/2023.wmt-1.1.
汤姆·科克米、埃莱夫泰里奥斯·阿夫拉米迪斯、瑞秋·鲍登、奥德烈·博亚尔、安东·德沃尔科维奇、克里斯蒂安·费德曼、马克·菲舍尔、马库斯·弗莱塔格、塔米·高达、罗曼·格伦基维奇、巴里·哈多、菲利普·科恩、本杰明·玛丽、克里斯托夫·蒙兹、森田真、肯顿·默里、森田真、中泽俊明、马丁·波佩尔、马雅·波波维奇和玛丽亚·什马托娃。2023 年机器翻译会议（WMT23）的研究成果：LLMs已经到达这里，但还没有完全到达。在菲利普·科恩、巴里·哈多、汤姆·科克米和克里斯托夫·蒙兹（主编）的《第八届机器翻译会议论文集》中，第 1-42 页，2023 年 12 月新加坡。计算语言学协会。doi：10.18653/v1/2023.wmt-1.1。URL https://aclanthology.org/2023.wmt-1.1。

Haonan Li, Fajri Koto, Minghao Wu, Alham Fikri Aji, and Timothy Baldwin. Bactrian-x : A multilingual replicable instruction-following model with low-rank adaptation. CoRR, abs/2305.15011, 2023a. doi: 10.48550/ARXIV.2305.15011. URLhttps://doi.org/10.48550/arXiv. 2305.15011 .
Haonan Li，Fajri Koto，Minghao Wu，Alham Fikri Aji 和 Timothy Baldwin。Bactrian-x：一种多语言可复制的低秩适应指令跟随模型。CoRR，abs/2305.15011，2023a。doi：10.48550/ARXIV.2305.15011。URLhttps://doi.org/10.48550/arXiv. 2305.15011。

Nian Li, Chen Gao, Yong Li, and Qingmin Liao. Large language model-empowered agents for simulating macroeconomic activities. CoRR, abs/2310.10436, 2023b. doi: 10.48550/ARXIV. 2310.10436. URLhttps://doi.org/10.48550/arXiv.2310.10436.
年立，陈高，永立和廷敏廖。大型语言模型增强代理模拟宏观经济活动。CoRR，abs/2310.10436，2023b。doi：10.48550/ARXIV.2310.10436。URLhttps://doi.org/10.48550/arXiv.2310.10436。

Pengfei Li, Liangyou Li, Meng Zhang, Minghao Wu, and Qun Liu. Universal conditional masked language pre-training for neural machine translation. In Smaranda Muresan, Preslav Nakov,
李鹏飞，李良友，张萌，吴明浩和刘群。神经机器翻译的通用条件掩码语言预训练。在 Smaranda Muresan，Preslav Nakov 的。
and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 6379-6391, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.442. URL https://aclanthology.org/2022.acl-long.442.
Aline Villavicencio（主编），《计算语言学协会第 60 届年会论文集（第 1 卷：长文）》，第 6379-6391 页，爱尔兰都柏林，2022 年 5 月。计算语言学协会。doi：10.18653/v1/2022.acl-long.442。网址 https://aclanthology.org/2022.acl-long.442。

Tianle Li, Ge Zhang, Quy Duc Do, Xiang Yue, and Wenhu Chen. Long-context llms struggle with long in-context learning. arXiv preprint arXiv:2404.02060, 2024.
田乐李，葛张，Quy Duc Do，向悦和陈文虎。长上下文llms与长上下文学习的斗争。arXiv 预印本 arXiv:2404.02060，2024 年。

Yingji Li, Mengnan Du, Rui Song, Xin Wang, and Ying Wang. A survey on fairness in large language models. CoRR, abs/2308.10149, 2023c. doi: 10.48550/ARXIV.2308.10149. URL https://doi.org/10.48550/arXiv.2308.10149.
应继力，杜梦楠，宋瑞，王欣和王颖。大型语言模型中的公平性调查。CoRR，abs/2308.10149，2023c。doi：10.48550/ARXIV.2308.10149。URL https://doi.org/10.48550/arXiv.2308.10149。

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao, Jue Wang, Keshav Santhanam, Laurel J. Orr, Lucia Zheng, Mert Yüksekgönül, Mirac Suzgun, Nathan Kim, Neel Guha, Niladri S. Chatterji, Omar Khattab, Peter Henderson, Qian Huang, Ryan Chi, Sang Michael Xie, Shibani Santurkar, Surya Ganguli, Tatsunori Hashimoto, Thomas Icard, Tianyi Zhang, Vishrav Chaudhary, William Wang, Xuechen Li, Yifan Mai, Yuhui Zhang, and Yuta Koreeda. Holistic evaluation of language models. CoRR, abs/2211.09110, 2022. doi: 10.48550/ARXIV.2211.09110. URL https: //doi.org/10.48550/arXiv.2211.09110.
Percy Liang，Rishi Bommasani，Tony Lee，Dimitris Tsipras，Dilara Soylu，Michihiro Yasunaga，Yian Zhang，Deepak Narayanan，Yuhuai Wu，Ananya Kumar，Benjamin Newman，Binhang Yuan，Bobby Yan，Ce Zhang，Christian Cosgrove，Christopher D. Manning，Christopher Ré，Diana Acosta-Navas，Drew A. Hudson，Eric Zelikman，Esin Durmus，Faisal Ladhak，Frieda Rong，Hongyu Ren，Huaxiu Yao，Jue Wang，Keshav Santhanam，Laurel J. Orr，Lucia Zheng，Mert Yüksekgönül，Mirac Suzgun，Nathan Kim，Neel Guha，Niladri S. Chatterji，Omar Khattab，Peter Henderson，Qian Huang，Ryan Chi，Sang Michael Xie，Shibani Santurkar，Surya Ganguli，Tatsunori Hashimoto，Thomas Icard，Tianyi Zhang，Vishrav Chaudhary，William Wang，Xuechen Li，Yifan Mai，Yuhui Zhang，和 Yuta Koreeda。语言模型的整体评估。CoRR，abs/2211.09110，2022。doi：10.48550/ARXIV.2211.09110。URL https://doi.org/10.48550/arXiv.2211.09110。

Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Zhaopeng Tu, and Shuming Shi. Encouraging divergent thinking in large language models through multiagent debate. CoRR, abs/2305.19118, 2023. doi: 10.48550/ARXIV.2305.19118. URL https: //doi.org/10.48550/arXiv.2305.19118.
田亮，何志伟，焦文祥，王星，王燕，王瑞，杨玉久，屠兆鹏和石树明。通过多智能体辩论在大型语言模型中鼓励发散性思维。CoRR，abs/2305.19118，2023 年。doi：10.48550/ARXIV.2305.19118。URL https://doi.org/10.48550/arXiv.2305.19118。

Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, KwangTing Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation. CoRR, abs/2402.09353, 2024. doi: 10.48550/ARXIV.2402.09353. URLhttps://doi.org/10.

Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8:726-742, 2020. doi: 10.1162/tacl_a_00343. URLhttps://aclanthology.org/2020.tacl-1.47
刘寅翰，顾佳涛，南曼戈亚尔，李贤，谢尔盖·埃杜诺夫，马尔詹·加兹维尼内贾德，迈克·刘易斯和卢克·泽特莫耶。神经机器翻译的多语言去噪预训练。计算语言学协会交易，8：726-742，2020 年。doi：10.1162/tacl_a_00343。URLhttps://aclanthology.org/2020.tacl-1.47

Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V. Le, Barret Zoph, Jason Wei, and Adam Roberts. The flan collection: Designing data and methods for effective instruction tuning. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp. 22631-22648. PMLR, 2023. URL https://proceedings.mlr.press/v202/longpre23a.html
Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V. Le, Barret Zoph, Jason Wei, and Adam Roberts。The flan collection: Designing data and methods for effective instruction tuning。In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.)，International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp. 22631-22648。PMLR, 2023。URL https://proceedings.mlr.press/v202/longpre23a.html

Hongyuan Lu, Haoyang Huang, Dongdong Zhang, Haoran Yang, Wai Lam, and Furu Wei. Chain-ofdictionary prompting elicits translation in large language models. CoRR, abs/2305.06575, 2023. doi: 10.48550/ARXIV.2305.06575. URLhttps://doi.org/10.48550/arXiv. 2305. 06575
Hongyuan Lu，Haoyang Huang，Dongdong Zhang，Haoran Yang，Wai Lam 和 Furu Wei。链式字典提示引发大型语言模型的翻译。CoRR，abs/2305.06575，2023。doi：10.48550/ARXIV.2305.06575。URLhttps://doi.org/10.48550/arXiv. 2305. 06575

Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. Wizardcoder: Empowering code large language models with evol-instruct. CoRR, abs/2306.08568, 2023. doi: 10.48550/ARXIV.2306.08568. URL https://doi.org/10.48550/arXiv.2306.08568.
骆子扬，徐灿，赵璞，孙庆峰，耿修波，胡文祥，陶崇阳，马静，林庆伟和蒋大新。Wizardcoder：用 evol-instruct 赋能代码大型语言模型。CoRR，abs/2306.08568，2023。doi：10.48550/ARXIV.2306.08568。URL https://doi.org/10.48550/arXiv.2306.08568。

Chenyang Lyu, Minghao Wu, Longyue Wang, Xinting Huang, Bingshuai Liu, Zefeng Du, Shuming Shi, and Zhaopeng Tu. Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration. CoRR, abs/2306.09093, 2023. doi: 10.48550/ARXIV.2306.09093. URL https://doi.org/10.48550/arXiv.2306.09093.
陸晨陽，吳明浩，王龍嶽，黃新婷，劉炳帥，杜澤峰，史書明，屠照鵬。Macaw-llm：多模式語言建模與圖像、音頻、視頻和文本集成。CoRR，abs/2306.09093，2023。doi：10.48550/ARXIV.2306.09093。URL https://doi.org/10.48550/arXiv.2306.09093。

Chenyang Lyu, Minghao Wu, and Alham Fikri Aji. Beyond probabilities: Unveiling the misalignment in evaluating large language models. CoRR, abs/2402.13887, 2024. doi: 10.48550/ARXIV. 2402.13887. URLhttps://doi.org/10.48550/arXiv.2402.13887.
陸晨陽，吳明浩和阿爾漢姆·菲克里·阿吉。超越概率：揭示評估大型語言模型中的不一致性。CoRR，abs/2402.13887，2024 年。doi：10.48550/ARXIV。2402.13887。URLhttps://doi.org/10.48550/arXiv.2402.13887。

Zhao Mandi, Shreeya Jain, and Shuran Song. Roco: Dialectic multi-robot collaboration with large language models. CoRR, abs/2307.04738, 2023. doi: 10.48550/ARXIV.2307.04738. URL https://doi.org/10.48550/arXiv.2307.04738.
赵曼迪，Shreeya Jain 和 Shuran Song。Roco：具有大型语言模型的辩证多机器人协作。CoRR，abs/2307.04738，2023 年。doi：10.48550/ARXIV.2307.04738。URL https://doi.org/10.48550/arXiv.2307.04738。

Philip M McCarthy and Scott Jarvis. Mtld, vocd-d, and hd-d: A validation study of sophisticated approaches to lexical diversity assessment. Behavior research methods, 42(2):381-392, 2010.
Philip M McCarthy 和 Scott Jarvis。Mtld，vocd-d 和 hd-d：对词汇多样性评估的复杂方法的验证研究。行为研究方法，42（2）：381-392，2010。

Gabriel Mukobi, Hannah Erlebach, Niklas Lauffer, Lewis Hammond, Alan Chan, and Jesse Clifton. Welfare diplomacy: Benchmarking language model cooperation. CoRR, abs/2310.08901, 2023. doi: 10.48550/ARXIV.2310.08901. URL/https://doi.org/10.48550/arXiv.2310. 08901
加布里埃尔·穆科比，汉娜·埃尔巴赫，尼克拉斯·劳弗，刘易斯·汉蒙德，艾伦·陈和杰西·克利夫顿。福利外交：基准语言模型合作。CoRR，abs/2310.08901，2023 年。doi：10.48550/ARXIV.2310.08901。URL/https://doi.org/10.48550/arXiv.2310.08901

OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. doi: 10.48550/arXiv.2303.08774. URL/https://doi.org/10.48550/arXiv.2303.08774
OpenAI。GPT-4 技术报告。CoRR，abs/2303.08774，2023 年。doi：10.48550/arXiv.2303.08774。URL/https://doi.org/10.48550/arXiv.2303.08774

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback. In NeurIPS, 2022. URLhttp://papers.nips.cc/paper_files/paper/2022/hash/ b1efde53be364a73914f58805a001731-Abstract-Conference.html.
欧阳龙，吴杰弗里，江旭，迪奥戈·阿尔梅达，卡罗尔·L·温莱特，帕梅拉·米什金，张冲，桑迪尼·阿加瓦尔，卡塔琳娜·斯拉玛，亚历克斯·雷，约翰·舒尔曼，雅各布·希尔顿，弗雷泽·凯尔顿，卢克·米勒，玛迪·西门斯，阿曼达·阿斯克尔，彼得·韦林德，保罗·F·克里斯蒂亚诺，扬·莱克和瑞安·洛。使用人类反馈训练语言模型遵循指令。在 NeurIPS，2022 年。URLhttp://papers.nips.cc/paper_files/paper/2022/hash/ b1efde53be364a73914f58805a001731-Abstract-Conference.html.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Pierre Isabelle, Eugene Charniak, and Dekang Lin (eds.), Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311-318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics. doi: 10.3115/1073083.1073135. URLhttps://aclanthology.org/P02-1040.
Kishore Papineni，Salim Roukos，Todd Ward 和 Wei-Jing Zhu。Bleu：一种用于自动评估机器翻译的方法。在 Pierre Isabelle，Eugene Charniak 和 Dekang Lin（编辑）的《计算语言学协会第 40 届年会论文集》中，第 311-318 页，美国宾夕法尼亚州费城，2002 年 7 月。计算语言学协会。doi：10.3115/1073083.1073135。URLhttps://aclanthology.org/P02-1040。

Joon Sung Park, Lindsay Popowski, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Social simulacra: Creating populated prototypes for social computing systems. In Maneesh Agrawala, Jacob O. Wobbrock, Eytan Adar, and Vidya Setlur (eds.), The 35th Annual ACM Symposium on User Interface Software and Technology, UIST 2022, Bend, OR, USA, 29 October 2022 - 2 November 2022, pp. 74:1-74:18. ACM, 2022. doi: 10.1145/3526113.3545616. URL/https://doi.org/10.1145/3526113.3545616
朴俊成，林赛·波波夫斯基，蔡嘉琪，梅雷迪思·林格尔·莫里斯，庞思远，迈克尔·S·伯恩斯坦。社交模拟：为社交计算系统创建人口密集的原型。在 Maneesh Agrawala，Jacob O. Wobbrock，Eytan Adar 和 Vidya Setlur（编辑），第 35 届 ACM 用户界面软件和技术研讨会，UIST 2022，2022 年 10 月 29 日至 11 月 2 日，美国俄勒冈州本德，第 74 页：1-74 页：18。ACM，2022 年。doi：10.1145/3526113.3545616。URL/https://doi.org/10.1145/3526113.3545616

Joon Sung Park, Joseph C. O'Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. In Sean Follmer, Jeff Han, Jürgen Steimle, and Nathalie Henry Riche (eds.), Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, UIST 2023, San Francisco, CA, USA, 29 October 2023- 1 November 2023, pp. 2:1-2:22. ACM, 2023. doi: 10.1145/3586183. 3606763. URLhttps://doi.org/10.1145/3586183.3606763,
朴俊成，约瑟夫 C.奥布莱恩，蔡俊，梅雷迪思·林格尔·莫里斯，佩西·梁和迈克尔·S.伯恩斯坦。生成代理：人类行为的交互模拟。在肖恩·福尔默，杰夫·韩，尤尔根·斯泰姆勒和娜塔莉·亨利·里奇（主编），第 36 届 ACM 用户界面软件和技术研讨会论文集，UIST 2023，美国加利福尼亚州旧金山，2023 年 10 月 29 日至 11 月 1 日，第 2:1-2:22 页。ACM，2023 年。doi：10.1145/3586183.3606763。URLhttps://doi.org/10.1145/3586183.3606763。

Matt Post. A call for clarity in reporting BLEU scores. In Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana Neves, Matt Post, Lucia Specia, Marco Turchi, and Karin Verspoor (eds.), Proceedings of the Third Conference on Machine Translation: Research Papers, pp. 186-191, Brussels, Belgium, October 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-6319. URL https://aclanthology.org/W18-6319
马特·波斯特。在报告 BLEU 分数时呼吁清晰度。在 Ondřej Bojar，Rajen Chatterjee，Christian Federmann，Mark Fishel，Yvette Graham，Barry Haddow，Matthias Huck，Antonio Jimeno Yepes，Philipp Koehn，Christof Monz，Matteo Negri，Aurélie Névéol，Mariana Neves，Matt Post，Lucia Specia，Marco Turchi 和 Karin Verspoor（编辑），第三届机器翻译会议论文集，第 186-191 页，比利时布鲁塞尔，2018 年 10 月。计算语言学协会。doi：10.18653/v1/W18-6319。URL https://aclanthology.org/W18-6319

Chen Qian, Xin Cong, Cheng Yang, Weize Chen, Yusheng Su, Juyuan Xu, Zhiyuan Liu, and Maosong Sun. Communicative agents for software development. CoRR, abs/2307.07924, 2023. doi: 10.48550/ARXIV.2307.07924. URLhttps://doi.org/10.48550/arXiv. 2307. 07924
陈倩，辛聪，程阳，陈维泽，苏玉生，徐菊源，刘志远和孙茂松。软件开发的交流代理。CoRR，abs/2307.07924，2023。doi：10.48550/ARXIV.2307.07924。URLhttps://doi.org/10.48550/arXiv. 2307. 07924

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine
拉斐尔·拉菲洛夫（Rafael Rafailov）、阿奇特·夏尔马（Archit Sharma）、埃里克·米切尔（Eric Mitchell）、克里斯托弗·D·曼宁（Christopher D. Manning）、斯特凡诺·厄尔蒙（Stefano Ermon）和切尔西·芬恩（Chelsea Finn）。直接偏好优化：您的语言模型实际上是一个奖励模型。在 Alice Oh、Tristan Naumann、Amir Globerson、Kate Saenko、Moritz Hardt 和 Sergey Levine。
(eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 16, 2023, 2023. URL/http://papers.nips.cc/paper_files/paper/2023/hash/ a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html
(eds.)，神经信息处理系统 36: 神经信息处理系统 2023 年年会，NeurIPS 2023，美国路易斯安那州新奥尔良，2023 年 12 月 10 日至 16 日。URL/ http://papers.nips.cc/paper_files/paper/2023/hash/ a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html

Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Lavie. COMET: A neural framework for MT evaluation. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 26852702, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/ 2020.emnlp-main.213. URLhttps://aclanthology.org/2020.emnlp-main.213.
Ricardo Rei, Craig Stewart, Ana C Farinha 和 Alon Lavie。COMET：一种用于 MT 评估的神经框架。在 Bonnie Webber，Trevor Cohn，Yulan He 和 Yang Liu（编辑）的《2020 年自然语言处理经验方法会议论文集》中，第 2685-2702 页，2020 年 11 月在线。计算语言学协会。doi：10.18653/v1/2020.emnlp-main.213。URLhttps://aclanthology.org/2020.emnlp-main.213。

Nathaniel Robinson, Perez Ogayo, David R. Mortensen, and Graham Neubig. ChatGPT MT: Competitive for high- (but not low-) resource languages. In Philipp Koehn, Barry Haddow, Tom Kocmi, and Christof Monz (eds.), Proceedings of the Eighth Conference on Machine Translation, pp. 392-418, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.wmt-1.40. URLhttps://aclanthology.org/2023.wmt-1.40.
Nathaniel Robinson, Perez Ogayo, David R. Mortensen 和 Graham Neubig。ChatGPT MT：在高资源语言中具有竞争力（但不是低资源语言）。在 Philipp Koehn，Barry Haddow，Tom Kocmi 和 Christof Monz（编辑）的《第八届机器翻译会议论文集》中，第 392-418 页，2023 年 12 月，新加坡。计算语言学协会。doi：10.18653/v1/2023.wmt-1.40。URLhttps://aclanthology.org/2023.wmt-1.40。

Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal V. Nayak, Debajyoti Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong, Harshit Pandey, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, Abheesht Sharma, Andrea Santilli, Thibault Févry, Jason Alan Fries, Ryan Teehan, Teven Le Scao, Stella Biderman, Leo Gao, Thomas Wolf, and Alexander M. Rush. Multitask prompted training enables zero-shot task generalization. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL/https://openreview.net/forum?id=9Vrb9D0WI4
维克多·桑、阿尔伯特·韦布森、科林·拉菲尔、斯蒂芬·H·巴赫、林唐·苏塔维卡、扎伊德·阿利亚菲、安托万·夏芬、阿尔诺·斯蒂格勒、阿伦·拉贾、马南·迪、M Saiful Bari、Canwen Xu、Urmish Thakker、Shanya Sharma Sharma、Eliza Szczechla、Taewoon Kim、Gunjan Chhablani、Nihal V. Nayak、Debajyoti Datta、Jonathan Chang、Mike Tian-Jian Jiang、韩旺、马泰奥·曼尼卡、沈胜、郑新勇、Harshit Pandey、Rachel Bawden、Thomas Wang、Trishala Neeraj、Jos Rozen、Abheesht Sharma、Andrea Santilli、Thibault Févry、Jason Alan Fries、Ryan Teehan、Teven Le Scao、Stella Biderman、Leo Gao、Thomas Wolf 和 Alexander M. Rush。多任务提示训练实现了零-shot 任务泛化。在第十届国际学习表示会议 ICLR 2022 中，虚拟活动，2022 年 4 月 25-29 日。OpenReview.net，2022。URL/https://openreview.net/forum?id=9Vrb9D0WI4

Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilic, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major, Iz Beltagy, Huu Nguyen, Lucile Saulnier, Samson Tan, Pedro Ortiz Suarez, Victor Sanh, Hugo Laurençon, Yacine Jernite, Julien Launay, Margaret Mitchell, Colin Raffel, Aaron Gokaslan, Adi Simhi, Aitor Soroa, Alham Fikri Aji, Amit Alfassy, Anna Rogers, Ariel Kreisberg Nitzav, Canwen Xu, Chenghao Mou, Chris Emezue, Christopher Klamm, Colin Leong, Daniel van Strien, David Ifeoluwa Adelani, and et al. BLOOM: A 176bparameter open-access multilingual language model. CoRR, abs/2211.05100, 2022. doi: 10. 48550/ARXIV.2211.05100. URLhttps://doi.org/10.48550/arXiv.2211.05100
Teven Le Scao，Angela Fan，Christopher Akiki，Ellie Pavlick，Suzana Ilic，Daniel Hesslow，Roman Castagné，Alexandra Sasha Luccioni，François Yvon，Matthias Gallé，Jonathan Tow，Alexander M. Rush，Stella Biderman，Albert Webson，Pawan Sasanka Ammanamanchi，Thomas Wang，Benoît Sagot，Niklas Muennighoff，Albert Villanova del Moral，Olatunji Ruwase，Rachel Bawden，Stas Bekman，Angelina McMillan-Major，Iz Beltagy，Huu Nguyen，Lucile Saulnier，Samson Tan，Pedro Ortiz Suarez，Victor Sanh，Hugo Laurençon，Yacine Jernite，Julien Launay，Margaret Mitchell，Colin Raffel，Aaron Gokaslan，Adi Simhi，Aitor Soroa，Alham Fikri Aji，Amit Alfassy，Anna Rogers，Ariel Kreisberg Nitzav，Canwen Xu，Chenghao Mou，Chris Emezue，Christopher Klamm，Colin Leong，Daniel van Strien，David Ifeoluwa Adelani，等。BLOOM：一种拥有 176 亿参数的开放获取多语言语言模型。CoRR，abs/2211.05100，2022。doi：10.48550/ARXIV.2211.05100。URLhttps://doi.org/10.48550/arXiv.2211.05100

Thibault Sellam, Dipanjan Das, and Ankur Parikh. BLEURT: Learning robust metrics for text generation. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Aпnиal Meeting of the Association for Computational Linguistics, pp. 7881-7892, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main. 704. URL https://aclanthology.org/2020.acl-main. 704 .
Thibault Sellam、Dipanjan Das 和 Ankur Parikh。BLEURT：学习文本生成的稳健度量。在 Dan Jurafsky、Joyce Chai、Natalie Schluter 和 Joel Tetreault（编辑）的《第 58 届计算语言学协会年会论文集》中，第 7881-7892 页，2020 年 7 月在线发表。计算语言学协会。doi：10.18653/v1/2020.acl-main.704。URL https://aclanthology.org/2020.acl-main.704。

Sheng Shen, Le Hou, Yanqi Zhou, Nan Du, Shayne Longpre, Jason Wei, Hyung Won Chung, Barret Zoph, William Fedus, Xinyun Chen, Tu Vu, Yuexin Wu, Wuyang Chen, Albert Webson, Yunxuan Li, Vincent Zhao, Hongkun Yu, Kurt Keutzer, Trevor Darrell, and Denny Zhou. Flan-moe: Scaling instruction-finetuned language models with sparse mixture of experts. CoRR, abs/2305.14705, 2023. doi: 10.48550/ARXIV.2305.14705. URLhttps://doi.org/10.48550/arXiv. 2305.14705 .
生神，乐后，Yanqi Zhou，南都，Shayne Longpre，Jason Wei，Hyung Won Chung，Barret Zoph，William Fedus，Xinyun Chen，Tu Vu，Yuexin Wu，Wuyang Chen，Albert Webson，Yunxuan Li，Vincent Zhao，Hongkun Yu，Kurt Keutzer，Trevor Darrell 和 Denny Zhou。Flan-moe：使用稀疏专家混合扩展指令微调语言模型。CoRR，abs/2305.14705，2023。doi：10.48550/ARXIV.2305.14705。URLhttps://doi.org/10.48550/arXiv. 2305.14705。

Tianxiao Shen, Myle Ott, Michael Auli, and Marc'Aurelio Ranzato. Mixture models for diverse machine translation: Tricks of the trade. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 915 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp. 5719-5728. PMLR, 2019. URL/http://proceedings.mlr.press/v97/ shen19c.html.
申天啸、Myle Ott、Michael Auli 和 Marc'Aurelio Ranzato。多样化机器翻译的混合模型：行业诀窍。在 Kamalika Chaudhuri 和 Ruslan Salakhutdinov（主编）的《第 36 届国际机器学习会议论文集》中，ICML 2019，2019 年 6 月 9-15 日，美国加利福尼亚州长滩，机器学习研究论文集第 97 卷，第 5719-5728 页。PMLR，2019。URL/ http://proceedings.mlr.press/v97/ shen19c.html。

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik R Narasimhan, and Shunyu Yao. Reflexion: language agents with verbal reinforcement learning. In Thirty-seventh Conference on
诺亚·辛恩，费德里科·卡萨诺，阿什温·戈平纳斯，卡迪克·R·纳拉辛汉和姚顺宇。反射：具有口头强化学习的语言代理。在第三十七届会议上。

Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum? id=vAElhFcKW6.
神经信息处理系统，2023 年。URLhttps://openreview.net/forum?id=vAElhFcKW6。

Mingyang Song, Mao Zheng, and Xuan Luo. Counting-stars: A simple, efficient, and reasonable strategy for evaluating long-context large language models. CoRR, abs/2403.11802, 2024. doi: 10.48550/ARXIV.2403.11802. URLhttps://doi.org/10.48550/arXiv.2403. 11802
宋明阳，毛政，和罗璇。Counting-stars：一种简单、高效、合理的评估长上下文大语言模型的策略。CoRR，abs/2403.11802，2024。doi：10.48550/ARXIV.2403.11802。URLhttps://doi.org/10.48550/arXiv.2403. 11802

Zewei Sun, Mingxuan Wang, Hao Zhou, Chengqi Zhao, Shujian Huang, Jiajun Chen, and Lei Li. Rethinking document-level neural machine translation. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Findings of the Association for Computational Linguistics: ACL 2022, pp. 3537-3548, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.279. URLhttps://aclanthology.org/2022. findings-acl. 279
孙泽伟，王明轩，周浩，赵成奇，黄树健，陈佳俊和李磊。重新思考文档级神经机器翻译。在 Smaranda Muresan，Preslav Nakov 和 Aline Villavicencio（编辑）的《计算语言学协会发现：ACL 2022》中，第 3537-3548 页，爱尔兰都柏林，2022 年 5 月。计算语言学协会。doi：10.18653/v1/2022.findings-acl.279。URLhttps://aclanthology.org/2022.findings-acl.279

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. In Zoubin Ghahramani, Max Welling, Corinna Cortes, Neil D. Lawrence, and Kilian Q. Weinberger (eds.), Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pp.3104-3112, 2014. URLhttps://proceedings.neurips.cc/paper/ 2014/hash/a14ac55a4f27472c5d894ec1c3c743d2-Abstract.html.
Ilya Sutskever, Oriol Vinyals 和 Quoc V. Le。神经网络的序列到序列学习。在 Zoubin Ghahramani，Max Welling，Corinna Cortes，Neil D. Lawrence 和 Kilian Q. Weinberger（编辑）的《神经信息处理系统 27：神经信息处理系统年会 2014 年 12 月 8 日至 13 日 2014 年，加拿大魁北克蒙特利尔，第 3104-3112 页，2014 年。URLhttps://proceedings.neurips.cc/paper/2014/hash/a14ac55a4f27472c5d894ec1c3c743d2-Abstract.html。

Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Jason Wei, Xuezhi Wang, Hyung Won Chung, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Denny Zhou, Neil Houlsby, and Donald Metzler. UL2: unifying language learning paradigms. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URLhttps://openreview.net/pdf?id=6ruVLB727MC
伊泰（Yi Tay）、莫斯塔法·德赫加尼（Mostafa Dehghani）、文 Q. Tran、哈维尔·加西亚（Xavier Garcia）、魏杰森（Jason Wei）、王学志（Xuezhi Wang）、钟亨元（Hyung Won Chung）、达拉·巴赫里（Dara Bahri）、塔尔·舒斯特（Tal Schuster）、郑怀修（Huaixiu Steven Zheng）、周丹尼（Denny Zhou）、尼尔·豪尔斯比（Neil Houlsby）和唐纳德·梅兹勒（Donald Metzler）。UL2：统一语言学习范式。在第十一届国际学习表示会议 ICLR 2023 中，2023 年 5 月 1-5 日，卢旺达基加利。OpenReview.net，2023。URLhttps://openreview.net/pdf?id=6ruVLB727MC

Katherine Thai, Marzena Karpinska, Kalpesh Krishna, Bill Ray, Moira Inghilleri, John Wieting, and Mohit Iyyer. Exploring document-level literary machine translation with parallel paragraphs from world literature. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 98829902, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.672. URLhttps://aclanthology.org/ 2022.emnlp-main. 672 .
Katherine Thai，Marzena Karpinska，Kalpesh Krishna，Bill Ray，Moira Inghilleri，John Wieting 和 Mohit Iyyer。通过世界文学中的平行段落探索文档级文学机器翻译。在 Yoav Goldberg，Zornitsa Kozareva 和 Yue Zhang（编辑）的 2022 年自然语言处理经验方法会议论文集中，第 98829902 页，阿布扎比，阿拉伯联合酋长国，2022 年 12 月。计算语言学协会。doi：10.18653/v1/2022.emnlp-main.672。URLhttps://aclanthology.org/2022.emnlp-main.672。

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023a. doi: 10.48550/ARXIV.2302.13971. URL https://doi.org/10.48550/arXiv.2302.13971.
Hugo Touvron，Thibaut Lavril，Gautier Izacard，Xavier Martinet，Marie-Anne Lachaux，Timothée Lacroix，Baptiste Rozière，Naman Goyal，Eric Hambro，Faisal Azhar，Aurélien Rodriguez，Armand Joulin，Edouard Grave 和 Guillaume Lample。Llama：开放和高效的基础语言模型。CoRR，abs/2302.13971，2023a。doi：10.48550/ARXIV.2302.13971。URL https://doi.org/10.48550/arXiv.2302.13971。

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton-Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurélien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and finetuned chat models. CoRR, abs/2307.09288, 2023b. doi: 10.48550/ARXIV.2307.09288. URL https://doi.org/10.48550/arXiv.2307.09288.
雨果·图夫龙、路易斯·马丁、凯文·斯通、彼得·阿尔伯特、阿姆贾德·阿尔马海里、亚斯敏·巴巴伊、尼古拉·巴什利科夫、索米亚·巴特拉、普拉杰瓦尔·巴尔加瓦、舒尔蒂·博萨莱、丹·比克尔、卢卡斯·布莱彻、克里斯蒂安·坎顿-费雷尔、莫雅·陈、吉列姆·库库鲁尔、大卫·埃西奥布、裘德·费尔南德斯、杰里米·符、文印·符、布莱恩·富勒、辛西娅·高、维达努吉·戈斯瓦米、纳曼·戈亚尔、安东尼·哈特肖恩、萨加尔·侯赛尼、瑞·侯、哈坎·伊南、马尔钦·卡尔达斯、维克托·克尔克兹、马迪安·卡布萨、伊莎贝尔·克鲁曼、阿尔泰姆·科雷涅夫、普尼特·辛格·库拉、玛丽-安妮·拉绍、蒂博·拉夫里尔、杰尼亚·李、黛安娜·利斯科维奇、英海·卢、玉宁·毛、泽维尔·马丁内、托多尔·米哈伊洛夫、普什卡尔·米什拉、伊戈尔·莫利博格、依欣·涅、安德鲁·波尔顿、杰里米·赖森斯坦、拉什米·朗塔、卡利安·萨拉迪、艾伦·谢尔顿、鲁安·席尔瓦、埃里克·迈克尔·史密斯、兰詹·苏布拉马尼安、肖青·艾伦·谭、宾·唐、罗斯·泰勒、阿迪娜·威廉姆斯、简·香、普克辛·徐、郑燕、伊利扬·扎罗夫、雨晨·张、安吉拉·范、梅兰妮·坎巴杜尔、沙兰·纳兰、奥雷利安·罗德里格斯、罗伯特·斯托伊尼克、谢尔盖·埃杜诺夫和托马斯·斯西亚隆。Llama 2: 开放基金会和微调聊天模型。CoRR，abs/2307.09288，2023b。doi: 10.48550/ARXIV.2307.09288。URL https://doi.org/10.48550/arXiv.2307.09288。

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 注意力就是一切。在 Isabelle Guyon，Ulrike von Luxburg，Samy Bengio，Hanna M. Wallach，Rob Fergus，S. V. N. Vishwanathan 和 Roman Garnett（编辑）的《神经信息处理系统 30：年度会议进展》中。

Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 5998-6008, 2017. URLhttps://proceedings.neurips.cc/paper/2017/hash/ 3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
神经信息处理系统 2017 年，2017 年 12 月 4 日至 9 日，美国加利福尼亚州长滩，第 5998-6008 页，2017 年。URLhttps://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html。

Rob Voigt and Dan Jurafsky. Towards a literary machine translation: The role of referential cohesion. In David Elson, Anna Kazantseva, Rada Mihalcea, and Stan Szpakowicz (eds.), Proceedings of the NAACL-HLT 2012 Workshop on Computational Linguistics for Literature, pp. 18-25, Montréal, Canada, June 2012. Association for Computational Linguistics. URL https://aclanthology.org/W12-2503.
Rob Voigt 和 Dan Jurafsky。走向文学机器翻译：指代凝聚的作用。在 David Elson、Anna Kazantseva、Rada Mihalcea 和 Stan Szpakowicz（编辑）的《NAACL-HLT 2012 文学计算语言学研讨会论文集》中，第 18-25 页，2012 年 6 月，加拿大蒙特利尔。计算语言学协会。网址 https://aclanthology.org/W12-2503。

Longyue Wang, Zhaopeng Tu, Andy Way, and Qun Liu. Exploiting cross-sentence context for neural machine translation. In Martha Palmer, Rebecca Hwa, and Sebastian Riedel (eds.), Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2826-2831, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10. 18653/v1/D17-1301. URLhttps://aclanthology.org/D17-1301.
Longyue Wang, Zhaopeng Tu, Andy Way, 和 Qun Liu. 利用跨句上下文进行神经机器翻译。在 Martha Palmer, Rebecca Hwa, 和 Sebastian Riedel (eds.) 主编的《2017 年自然语言处理实证方法会议论文集》中，第 2826-2831 页，丹麦哥本哈根，2017 年 9 月。计算语言学协会。doi: 10.18653/v1/D17-1301。URL https://aclanthology.org/D17-1301。

Longyue Wang, Chenyang Lyu, Tianbo Ji, Zhirui Zhang, Dian Yu, Shuming Shi, and Zhaopeng Tu. Document-level machine translation with large language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 16646-16661, Singapore, December 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.1036. URL https: //aclanthology.org/2023.emnlp-main.1036.
龙跃王，陈阳吕，田波吉，智瑞张，殿宇，舒明石，和赵鹏涂。大型语言模型的文档级机器翻译。在侯达·布阿莫尔，胡安·皮诺和卡丽卡·巴利（主编）的《2023 年自然语言处理经验方法会议论文集》中，第 16646-16661 页，新加坡，2023 年 12 月。计算语言学协会。doi：10.18653/v1/2023.emnlp-main.1036。URL https://aclanthology.org/2023.emnlp-main.1036。

Longyue Wang, Zhaopeng Tu, Yan Gu, Siyou Liu, Dian Yu, Qingsong Ma, Chenyang Lyu, Liting Zhou, Chao-Hong Liu, Yufeng Ma, Weiyu Chen, Yvette Graham, Bonnie Webber, Philipp Koehn, Andy Way, Yulin Yuan, and Shuming Shi. Findings of the WMT 2023 shared task on discourselevel literary translation: A fresh orb in the cosmos of LLMs. In Philipp Koehn, Barry Haddow, Tom Kocmi, and Christof Monz (eds.), Proceedings of the Eighth Conference on Machine Translation, pp. 55-67, Singapore, December 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.wmt-1.3. URLhttps://aclanthology.org/2023.wmt-1.3.
龙跃王，涛兆鹏，颜谷，刘思友，俞典，马庆松，吕晨阳，周丽婷，刘超红，马玉峰，陈伟宇，伊维特·格雷厄姆，邦妮·韦伯，菲利普·科恩，安迪·韦，袁玉林和石树明的发现。WMT 2023 关于篇章级文学翻译的共享任务：LLMs宇宙中的新星。在菲利普·科恩，巴里·哈多，汤姆·科克米和克里斯托夫·蒙兹（主编）的《第八届机器翻译会议论文集》中，第 55-67 页，2023 年 12 月新加坡。计算语言学协会。doi：10.18653/v1/2023.wmt-1.3。URLhttps://aclanthology.org/2023.wmt-1.3。

Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Atharva Naik, Arjun Ashok, Arut Selvan Dhanasekaran, Anjana Arunkumar, David Stap, Eshaan Pathak, Giannis Karamanolakis, Haizhi Lai, Ishan Purohit, Ishani Mondal, Jacob Anderson, Kirby Kuznia, Krima Doshi, Kuntal Kumar Pal, Maitreya Patel, Mehrad Moradshahi, Mihir Parmar, Mirali Purohit, Neeraj Varshney, Phani Rohitha Kaza, Pulkit Verma, Ravsehaj Singh Puri, Rushang Karia, Savan Doshi, Shailaja Keyur Sampat, Siddhartha Mishra, Sujan Reddy A, Sumanta Patro, Tanay Dixit, and Xudong Shen. Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 5085-5109, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi:

. URL https://aclanthology.org/2022.emnlp-main.340.
王一中，斯瓦鲁普·米什拉，佩加·阿里普尔莫拉巴希，叶加内·科迪，阿米尔雷扎·米尔扎伊，阿塔瓦·奈克，阿尔琼·阿肖克，阿鲁特·塞尔万·达纳塞卡兰，安贾娜·阿伦库马尔，大卫·斯塔普，埃沙安·帕塔克，雅尼斯·卡拉马诺拉基斯，海智·赖，伊尚·普罗希特，伊尚尼·蒙达尔，雅各布·安德森，柯比·库兹尼亚，克里玛·多希，昆塔尔·库马尔·帕尔，迈特雷亚·帕特尔，梅拉德·莫拉德沙希，米希尔·帕尔马尔，米拉利·普罗希特，尼拉吉·瓦尔什尼，法尼·罗希萨·卡扎，普尔基特·维尔马，拉夫塞哈吉·辛格·普里，鲁尚·卡里亚，萨万·多希，沙伊拉贾·凯尤尔·桑帕特，西德哈塔·米什拉，苏杰·雷迪·A，苏曼塔·帕特罗，塔纳伊·迪克西特，沈旭东。超自然说明：通过关于 1600 多个 NLP 任务的声明性说明进行泛化。在 Yoav Goldberg，Zornitsa Kozareva 和 Yue Zhang（eds.）主持的《2022 年自然语言处理经验方法会议论文集》，第 5085-5109 页，阿布扎比，阿拉伯联合酋长国，2022 年 12 月。计算语言学协会。doi:

。URL https://aclanthology.org/2022.emnlp-main.340。

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1348413508, Toronto, Canada, July 2023c. Association for Computational Linguistics. doi: 10.18653/ v1/2023.acl-long.754. URLhttps://aclanthology.org/2023.acl-long. 754 .
王一中，叶加内，斯瓦鲁普·米什拉，阿丽莎·刘，诺亚·史密斯，丹尼尔·卡沙比和汉娜内·哈吉什尔齐。自我指导：用自动生成的指令对齐语言模型。在安娜·罗杰斯，乔丹·博伊德-格雷伯和冈崎直明（主编）的《计算语言学协会第 61 届年会论文集（第 1 卷：长篇论文）》中，第 1348-1350 页，加拿大多伦多，2023 年 7 月。计算语言学协会。doi：10.18653/v1/2023.acl-long.754。URLhttps://aclanthology.org/2023.acl-long.754。

Zhanyu Wang, Longyue Wang, Zhen Zhao, Minghao Wu, Chenyang Lyu, Huayang Li, Deng Cai, Luping Zhou, Shuming Shi, and Zhaopeng Tu. Gpt4video: A unified multimodal large language model for Instruction-followed understanding and safety-aware generation. CoRR, abs/2311.16511, 2023d. doi: 10.48550/ARXIV.2311.16511. URL/https://doi.org/10.

.
王展宇，王龙跃，赵震，吴明浩，吕晨阳，李华阳，蔡登，周鲁平，史舒明，屠兆鹏。Gpt4video：用于指令跟随理解和安全感知生成的统一多模大语言模型。CoRR，abs/2311.16511，2023 年。doi：10.48550/ARXIV.2311.16511。URL/https://doi.org/10.

。

Zihao Wang, Shaofei Cai, Anji Liu, Xiaojian Ma, and Yitao Liang. Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. CoRR, abs/2302.01560, 2023e. doi: 10.48550/ARXIV.2302.01560. URLhttps://doi.org/

arXiv.2302.01560
王子豪，蔡绍飞，刘安吉，马晓健和梁一涛。描述，解释，规划和选择：与大型语言模型的互动规划使得开放世界多任务代理成为可能。CoRR，abs/2302.01560，2023 年。doi：10.48550/ARXIV.2302.01560。URLhttps://doi.org/

arXiv.2302.01560

Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. Finetuned language models are zero-shot learners. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URLhttps://openreview.net/forum?id= gEZrGCozdqR.
Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. 微调语言模型是零-shot 学习者。在第十届国际学习表示会议 ICLR 2022，虚拟活动，2022 年 4 月 25-29 日。OpenReview.net，2022。URLhttps://openreview.net/forum?id=gEZrGCozdqR。

Michael J. Wooldridge and Nicholas R. Jennings. Intelligent agents: theory and practice. Knowl. Eng. Rev., 10(2):115-152, 1995. doi: 10.1017/S0269888900008122. URL https://doi. org/10.1017/S0269888900008122
迈克尔·J·伍德里奇（Michael J. Wooldridge）和尼古拉斯·R·詹宁斯（Nicholas R. Jennings）。智能代理：理论与实践。知识工程评论，10（2）：115-152，1995 年。doi：10.1017/S0269888900008122。URL https://doi.org/10.1017/S0269888900008122

Minghao Wu and Alham Fikri Aji. Style over substance: Evaluation biases for large language models. CoRR, abs/2307.03025, 2023. doi: 10.48550/ARXIV.2307.03025. URL https:// doi.org/10.48550/arXiv.2307.03025,
吴明浩和阿尔汉·菲克里·阿吉。风格胜过实质：大型语言模型的评估偏见。CoRR，abs/2307.03025，2023 年。doi：10.48550/ARXIV.2307.03025。URL https://doi.org/10.48550/arXiv.2307.03025。

Minghao Wu, Yitong Li, Meng Zhang, Liangyou Li, Gholamreza Haffari, and Qun Liu. Uncertaintyaware balancing for multilingual and multi-domain neural machine translation training. In MarieFrancine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 7291-7305, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.580. URLhttps://aclanthology.org/ 2021.emnlp-main.580.
吴明浩，李一桐，张萌，李良友，Gholamreza Haffari 和刘群。多语言和多领域神经机器翻译训练的不确定性平衡。在 Marie-Francine Moens，Xuanjing Huang，Lucia Specia 和 Scott Wen-tau Yih（编辑）的《2021 年自然语言处理经验方法会议论文集》中，第 7291-7305 页，2021 年 11 月，线上和多米尼加共和国蓬塔卡纳。计算语言学协会。 doi：10.18653/v1/2021.emnlp-main.580。 URLhttps://aclanthology.org/2021.emnlp-main.580。

Minghao Wu, George Foster, Lizhen Qu, and Gholamreza Haffari. Document flattening: Beyond concatenating context for document-level neural machine translation. In Andreas Vlachos and Isabelle Augenstein (eds.), Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp. 448-462, Dubrovnik, Croatia, May 2023a. Association for Computational Linguistics. doi:

. eacl-main.33. URL https://aclanthology.org/2023.eacl-main.33.
吴明浩，乔治·福斯特，屈立珍和 Gholamreza Haffari。文档展开：超越文档级神经机器翻译的上下文连接。在 Andreas Vlachos 和 Isabelle Augenstein（主编）的《欧洲计算语言学协会第 17 届会议论文集》中，第 448-462 页，克罗地亚杜布罗夫尼克，2023 年 5 月。计算语言学协会。doi：

. eacl-main.33。URL https://aclanthology.org/2023.eacl-main.33。

Minghao Wu, Abdul Waheed, Chiyu Zhang, Muhammad Abdul-Mageed, and Alham Fikri Aji. Lamini-lm: A diverse herd of distilled models from large-scale instructions. CoRR, abs/2304.14402, 2023b. doi: 10.48550/ARXIV.2304.14402. URL https://doi.org/10.

.
吴明浩，阿卜杜勒·瓦希德，张赤宇，穆罕默德·阿卜杜勒-马吉德和阿尔汉姆·菲克里·阿吉。 Lamini-lm：来自大规模指令的多样化蒸馏模型群。 CoRR，abs/2304.14402，2023b。 doi：10.48550/ARXIV.2304.14402。 URL https://doi.org/10.

。

Minghao Wu, Thuy-Trang Vu, Lizhen Qu, George F. Foster, and Gholamreza Haffari. Adapting large language models for document-level machine translation. CoRR, abs/2401.06468, 2024a. doi: 10.48550/ARXIV.2401.06468. URLhttps://doi.org/10.48550/arXiv. 2401. 06468
吴明浩，Thuy-Trang Vu，曲丽珍，George F. Foster 和 Gholamreza Haffari。为文档级机器翻译调整大型语言模型。CoRR，abs/2401.06468，2024a。doi：10.48550/ARXIV.2401.06468。URLhttps://doi.org/10.48550/arXiv. 2401. 06468

Minghao Wu, Yufei Wang, George Foster, Lizhen Qu, and Gholamreza Haffari. Importance-aware data augmentation for document-level neural machine translation. In Yvette Graham and Matthew Purver (eds.), Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 740-752, St. Julian's, Malta, March 2024b. Association for Computational Linguistics. URL https://aclanthology.org/ 2024.eacl-long.44.
吴明浩，王宇飞，乔治·福斯特，屈丽珍和 Gholamreza Haffari。重要性感知数据增强用于文档级神经机器翻译。在 Yvette Graham 和 Matthew Purver（主编）的《欧洲计算语言学协会第 18 届年会论文集（第 1 卷：长文）》中，第 740-752 页，马耳他圣朱利安，2024 年 3 月。计算语言学协会。网址 https://aclanthology.org/2024.eacl-long.44。

Yuhao Xie, Zongyao Li, Zhanglin Wu, Daimeng Wei, Xiaoyu Chen, Zhiqiang Rao, Shaojun Li, Hengchao Shang, Jiaxin Guo, Lizhi Lei, Hao Yang, and Yanfei Jiang. HW-TSC's submissions to the WMT23 discourse-level literary translation shared task. In Philipp Koehn, Barry Haddow, Tom Kocmi, and Christof Monz (eds.), Proceedings of the Eighth Conference on Machine Translation, pp. 302-306, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.wmt-1.32. URLhttps://aclanthology.org/2023.wmt-1.32.
谢宇豪，李宗耀，吴章林，魏代萌，陈晓宇，饶志强，李少军，尚恒超，郭佳鑫，雷立志，杨浩和姜艳飞。HW-TSC 对 WMT23 篇章级文学翻译共享任务的提交。在 Philipp Koehn，Barry Haddow，Tom Kocmi 和 Christof Monz（编辑）的《第八届机器翻译会议论文集》中，第 302-306 页，2023 年 12 月，新加坡。计算语言学协会。doi：10.18653/v1/2023.wmt-1.32。URLhttps://aclanthology.org/2023.wmt-1.32。

Haoran Xu, Young Jin Kim, Amr Sharaf, and Hany Hassan Awadalla. A paradigm shift in machine translation: Boosting translation performance of large language models. CoRR, abs/2309.11674, 2023a. doi: 10.48550/ARXIV.2309.11674. URLhttps://doi.org/10.48550/arXiv. 2309.11674 .
徐浩然，金永进，阿姆尔·沙拉夫和哈尼·哈桑·阿瓦达拉。机器翻译的范式转变：提升大型语言模型的翻译性能。 CoRR，abs/2309.11674，2023a。 doi：10.48550/ARXIV.2309.11674。 URLhttps://doi.org/10.48550/arXiv. 2309.11674。

Haoran Xu, Amr Sharaf, Yunmo Chen, Weiting Tan, Lingfeng Shen, Benjamin Van Durme, Kenton Murray, and Young Jin Kim. Contrastive preference optimization: Pushing the boundaries of LLM performance in machine translation. CoRR, abs/2401.08417, 2024. doi: 10.48550/ARXIV. 2401.08417. URLhttps://doi.org/10.48550/arXiv.2401.08417.
徐浩然，Amr Sharaf，陈云默，谭维庭，沈凌峰，Benjamin Van Durme，Kenton Murray 和 Young Jin Kim。对比偏好优化：推动机器翻译性能边界的LLM。CoRR，abs/2401.08417，2024 年。doi：10.48550/ARXIV。2401.08417。URLhttps://doi.org/10.48550/arXiv.2401.08417。

Yuzhuang Xu, Shuo Wang, Peng Li, Fuwen Luo, Xiaolong Wang, Weidong Liu, and Yang Liu. Exploring large language models for communication games: An empirical study on werewolf. CoRR, abs/2309.04658, 2023b. doi: 10.48550/ARXIV.2309.04658. URL https: //doi.org/10.48550/arXiv.2309.04658.
徐玉庄，王硕，李鹏，罗福文，王晓龙，刘卫东和刘阳。探索大型语言模型用于沟通游戏：狼人的实证研究。CoRR，abs/2309.04658，2023b。doi：10.48550/ARXIV.2309.04658。URL https://doi.org/10.48550/arXiv.2309.04658。

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URLhttps://openreview.net/pdf?id=WE_vluYUL-X.
顺宇姚，Jeffrey Zhao，俞典，杜楠，伊扎克·沙夫兰，卡迪克·R·纳拉西曼和曹元。React：在语言模型中协同推理和行动。在第十一届国际学习表示会议 ICLR 2023 中，卢旺达基加利，2023 年 5 月 1-5 日。OpenReview.net，2023。URLhttps://openreview.net/pdf?id=WE_vluYUL-X。

Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mammoth: Building math generalist models through hybrid instruction tuning. CoRR, abs/2309.05653, 2023. doi: 10.48550/ARXIV.2309.05653. URLhttps://doi.org/10.

相约，行为区，葛张，姚福，文浩黄，欢孙，俞苏和陈文虎。猛犸象：通过混合指导调整构建数学通才模型。CoRR，abs/2309.05653，2023。doi：10.48550/ARXIV.2309.05653。URLhttps://doi.org/10.

Hongxin Zhang, Weihua Du, Jiaming Shan, Qinhong Zhou, Yilun Du, Joshua B. Tenenbaum, Tianmin Shu, and Chuang Gan. Building cooperative embodied agents modularly with large language models. CoRR, abs/2307.02485, 2023a. doi: 10.48550/ARXIV.2307.02485. URL https://doi.org/10.48550/arXiv.2307.02485.
张宏鑫，杜伟华，单佳明，周勤宏，杜一伦，乔舒亚 B.坦恩鲍姆，舒天民和甘创。用大型语言模型模块化地构建合作的具身代理。CoRR，abs/2307.02485，2023a。doi：10.48550/ARXIV.2307.02485。URL https://doi.org/10.48550/arXiv.2307.02485。

Shaolei Zhang, Qingkai Fang, Zhuocheng Zhang, Zhengrui Ma, Yan Zhou, Langlin Huang, Mengyu Bu, Shangtong Gui, Yunji Chen, Xilin Chen, and Yang Feng. Bayling: Bridging cross-lingual alignment and instruction following through interactive translation for large language models. CoRR, abs/2306.10968, 2023b. doi: 10.48550/ARXIV.2306.10968. URL https://doi.org/10.48550/arXiv.2306.10968.
张少雷，方庆凯，张卓成，马正锐，周燕，黄朗林，卜梦雨，桂尚彤，陈云霁，陈锡林和冯扬。Bayling：通过互动翻译桥接跨语言对齐和指令遵循，为大型语言模型。CoRR，abs/2306.10968，2023b。doi：10.48550/ARXIV.2306.10968。URL https://doi.org/10.48550/arXiv.2306.10968。

Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, and Shuming Shi. Siren's song in the AI ocean: A survey on hallucination in large language models. CoRR, abs/2309.01219, 2023c. doi: 10.48550/ARXIV.2309.01219. URLhttps://doi.org/10.

.
张跃，李亚夫，崔乐阳，蔡登，刘乐茂，付廷臣，黄新婷，赵恩波，张宇，陈玉龙，王龙跃，吕安全，毕伟，史芙蕾达，史树明。AI 海洋中的塞壬之歌：大型语言模型中幻觉的调查。CoRR，abs/2309.01219，2023c。doi：10.48550/ARXIV.2309.01219。URLhttps://doi.org/10.

。

Anqi Zhao, Kaiyu Huang, Hao Yu, and Degen Huang. DUTNLP system for the WMT2023 discourse-level literary translation. In Philipp Koehn, Barry Haddow, Tom Kocmi, and Christof Monz (eds.), Proceedings of the Eighth Conference on Machine Translation, pp. 296-301, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.wmt-1. 31. URLhttps://aclanthology.org/2023.wmt-1.31.
Anqi Zhao, Kaiyu Huang, Hao Yu, 和 Degen Huang. DUTNLP 系统用于 WMT2023 篇章级文学翻译。在 Philipp Koehn, Barry Haddow, Tom Kocmi, 和 Christof Monz (eds.)主编的第八届机器翻译会议论文集中，第 296-301 页，新加坡，2023 年 12 月。计算语言学协会。doi: 10.18653/v1/2023.wmt-1.31。URLhttps://aclanthology.org/2023.wmt-1.31。

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023a. URL http://papers.nips.cc/paper_files/paper/2023/ hash/91f18a1287b398d378ef22505bf41832-Abstract-Datasets_and_ Benchmarks.html.
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica。使用 mt-bench 和 chatbot arena 评判llm-as-a-judge。在 Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt 和 Sergey Levine（编辑）的《神经信息处理系统 36：神经信息处理系统年会 2023》中，NeurIPS 2023，2023 年 12 月 10 日至 16 日，美国路易斯安那州新奥尔良。URL http://papers.nips.cc/paper_files/paper/2023/ hash/91f18a1287b398d378ef22505bf41832-Abstract-Datasets_and_ Benchmarks.html.

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. CoRR, abs/2306.05685, 2023b. doi: 10.48550/arXiv.2306.05685. URLhttps://doi.org/10.48550/arXiv.2306.05685
连敏郑，蒋伟林，盛颖，庄思源，吴章浩，庄永浩，林子，李卓瀚，李大成，Eric P. Xing，张浩，Joseph E. Gonzalez 和 Ion Stoica。使用 mt-bench 和 chatbot 竞技场评判llm-as-a-judge。 CoRR，abs/2306.05685，2023b。 doi: 10.48550/arXiv.2306.05685。 URLhttps://doi.org/10.48550/arXiv.2306.05685

Barret Zoph, Deniz Yuret, Jonathan May, and Kevin Knight. Transfer learning for low-resource neural machine translation. In Jian Su, Kevin Duh, and Xavier Carreras (eds.), Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1568-1575, Austin, Texas, November 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1163. URLhttps://aclanthology.org/D16-1163.
Barret Zoph，Deniz Yuret，Jonathan May 和 Kevin Knight。低资源神经机器翻译的迁移学习。在 Jian Su，Kevin Duh 和 Xavier Carreras（编辑）的《2016 年自然语言处理经验方法会议论文集》中，第 1568-1575 页，2016 年 11 月，德克萨斯州奥斯汀。计算语言学协会。doi：10.18653/v1/D16-1163。URLhttps://aclanthology.org/D16-1163。

Longyue Wang is the corresponding author: vinnylywang@tencent.com.
龙跃王是通讯作者：vinnylywang@tencent.com。
Model signature: gpt-4-1106-preview
模型签名：gpt-4-1106-preview
Model signature: nrefs:2|case:mixed|eff:no|tok:13a|smooth:exp|version:2.3.1
模型签名：nrefs:2|case:mixed|eff:no|tok:13a|smooth:exp|version:2.3.1

In our preliminary study, we conduct small-scale MQM-based human evaluation and also observe that our approach, TRANSAGENTS, receives a low MQM score.
在我们的初步研究中，我们进行了小规模的基于 MQM 的人类评估，并观察到我们的方法 TRANSAGENTS 获得了较低的 MQM 分数。
We initially attempt to collect responses directly from web novel forums, such as the WebNovels subreddit on Reddit. However, this approach proves to be too slow and sometimes violates the community rules of these platforms.
我们最初尝试直接从网络小说论坛收集回复，比如 Reddit 上的 WebNovels 子论坛。然而，这种方法被证明太慢，有时违反了这些平台的社区规则。
We could not find the direct source of this information from the American Translators Association. Our source of information is available at https://tinyurl.com/bdze92xr. We assume that the recommended rate of USD per word is based on the number of words in the English language text.
我们无法从美国翻译协会找到这些信息的直接来源。我们的信息来源可在 https://tinyurl.com/bdze92xr 找到。我们假设建议的每个单词美元的费率是基于英语文本中的单词数量。

(Perhaps) Beyond Human Translation: HarNESSING MULTI-AGENT COLLABORATION FOR Translating UltRA-LONG LIterary TEXTS (也许) 超越人类翻译：利用多智能体协作翻译超长文学文本