这是用户在 2024-5-21 4:24 为 https://arxiv.org/html/2404.17780v1?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
License: arXiv.org perpetual non-exclusive license
许可证:arXiv.org 永久非独家许可
arXiv:2404.17780v1 [cs.MA] 27 Apr 2024
arXiv:2404.17780v1 [cs.MA] 27 Apr 2024 预印本:2404.17780v1 [计算机科学-多代理系统] 2024 年 4 月 27 日

Verco: Learning Coordinated Verbal Communication for Multi-agent Reinforcement Learning
Verco:多智能体强化学习中的协调性语言交流学习

Dapeng Li1,2, Hang Dong3, Lu Wang3, Bo Qiao3, Si Qin3, Qingwei Lin3
李大鹏 1,2 ,董航 3 ,王璐 3 ,乔博 3 ,覃思 3 ,林清伟 3
   Dongmei Zhang3, Qi Zhang3, Zhiwei Xu1,2, Bin Zhang1,2, Guoliang Fan1,2
张冬梅 3 ,张琪 3 ,徐志伟 1,2 ,张斌 1,2 ,范国良 1,2

1
Institute of Automation, Chinese Academy of Sciences
中国科学院自动化研究所

2School of Artificial Intelligence, University of Chinese Academy of Sciences
2 人工智能学院,中国科学院大学

3 Microsoft  微软
lidapeng2020@ia.ac.cn
Abstract 摘要

In recent years, multi-agent reinforcement learning algorithms have made significant advancements in diverse gaming environments, leading to increased interest in the broader application of such techniques. To address the prevalent challenge of partial observability, communication-based algorithms have improved cooperative performance through the sharing of numerical embedding between agents. However, the understanding of the formation of collaborative mechanisms is still very limited, making designing a human-understandable communication mechanism a valuable problem to address. In this paper, we propose a novel multi-agent reinforcement learning algorithm that embeds large language models into agents, endowing them with the ability to generate human-understandable verbal communication. The entire framework has a message module and an action module. The message module is responsible for generating and sending verbal messages to other agents, effectively enhancing information sharing among agents. To further enhance the message module, we employ a teacher model to generate message labels from the global view and update the student model through Supervised Fine-Tuning (SFT). The action module receives messages from other agents and selects actions based on current local observations and received messages. Experiments conducted on the Overcooked game demonstrate our method significantly enhances the learning efficiency and performance of existing methods, while also providing an interpretable tool for humans to understand the process of multi-agent cooperation.
近年来,多智能体强化学习算法在多样化的游戏环境中取得显著进展,引发了对这些技术更广泛应用的兴趣。针对普遍存在的部分可观测性挑战,基于通信的算法通过智能体间数值嵌入的共享,提升了协作性能。然而,协作机制的形成理解仍十分有限,设计一种人类可理解的通信机制成为亟待解决的问题。本文提出一种新颖的多智能体强化学习算法,将大型语言模型嵌入智能体中,赋予其生成人类可理解语言交流的能力。整个框架包含消息模块与行动模块,消息模块负责生成并向其他智能体发送语言信息,有效增强智能体间的信息共享。为进一步强化消息模块,我们采用教师模型从全局视角生成消息标签,并通过监督微调(SFT)更新学生模型。 动作模块接收来自其他代理的消息,并根据当前局部观察和接收到的消息选择行动。在 Overcooked 游戏上进行的实验表明,我们的方法显著提高了现有方法的学习效率和性能,同时为人类提供了一个可解释的工具,以理解多代理协作的过程。

1 Introduction 1 引言

Cooperative multi-agent reinforcement learning (MARL) has received widespread research attention due to its extensive practical applications in fields such as traffic control[1], multi-robot control[2], and sensor networks[3]. However, key issues such as sample efficiency[4, 5], non-stationarity[6], and interpretability[7, 8] of cooperation have become obstacles to further advancing these practical applications. In order to address these challenges, significant progress has been made in cooperative MARL. Among them, the centralized training and decentralized execution paradigm[9, 10, 11, 12, 13, 14] alleviate the non-stationary problem during multi-agent training by introducing additional information during the training period. However, due to partial observability, the strategies learned by agents may be fragile since the uncertainty of other agents during execution may lead to catastrophic incoordination and sub-optimality [15, 16]. Inspired by human cooperation, a series of works achieve communication between individuals by exchanging their observations or hidden embedding[17, 18, 19], thereby stabilizing the learning process and promoting more efficient cooperation among agents. These methods usually treat messages as black boxes, send numerical messages, and assume that the policy network can adaptively extract content that is helpful for learning during the learning period. Therefore, the communication content is usually presented in a form that humans cannot understand, making it hard to interpret the communication mechanism.
合作式多智能体强化学习(MARL)因其广泛应用于交通控制、多机器人控制及传感器网络等领域而受到广泛研究关注。然而,样本效率、非平稳性及合作可解释性等关键问题成为阻碍这些实际应用进一步发展的障碍。为应对这些挑战,合作式 MARL 取得了显著进展。其中,集中训练与分散执行范式通过在训练期间引入额外信息,缓解了多智能体训练中的非平稳性问题。但由于部分可观测性,智能体学到的策略可能较为脆弱,执行过程中其他智能体的不确定性可能导致灾难性协调失败和次优结果。受人类合作启发,一系列工作通过交换观测数据或隐藏嵌入实现个体间通信,从而稳定学习过程,促进智能体间更高效的合作。 这些方法通常将消息视为黑箱,发送数值消息,并假设策略网络能在学习期间自适应地提取有助于学习的内容。因此,通信内容通常以人类难以理解的形式呈现,使得通信机制难以解释。

Refer to caption
Figure 1: Incorrect messages can easily lead to conflicts, while coordinated messages can promote efficient cooperation among agents.
图 1:错误信息易引发冲突,而协调一致的信息则能促进各代理间的有效合作。

One natural and interpretable way of communication is to directly generate verbal language as communication messages, which also means the policy network needs to have the ability to understand verbal text. Recent work has shown that using large language models for understanding knowledge texts can effectively improve sample efficiency in complex decision-making tasks[20, 21]. By aligning prior knowledge of large language models (LLMs) with the functional requirements of the environment with only a small amount of environmental interaction data, the LLM can achieve good performance. Meanwhile, as LLMs use verbal text as model input, it is natural to use LLMs for verbal communication. However, there are key difficulties in directly generating verbal messages: i) The candidate space for generating messages is too large to explore. ii) The text output lacks gradients and cannot be optimized end-to-end with environmental rewards. iii) The messages generated by agents based on local observations are prone to conflicts (as shown in Figure 1).
一种自然且可解释的沟通方式是直接生成口头语言作为交流信息,这也意味着策略网络需具备理解口头文本的能力。近期研究表明,利用大型语言模型理解知识文本能有效提升复杂决策任务中的样本效率[20, 21]。通过将大型语言模型(LLMs)的先验知识与环境的功能需求仅用少量环境交互数据对齐,LLM即可取得良好表现。同时,由于LLMs采用口头文本作为模型输入,自然地,LLMs可用于口头沟通。然而,直接生成口头信息面临关键难题:i) 生成消息的候选空间过大,难以探索;ii) 文本输出缺乏梯度,无法与环境奖励端到端优化;iii) 基于局部观察生成的代理消息易产生冲突(如图 1 所示)。

To address the aforementioned challenges, this paper proposes a novel multi-agent algorithm for learning coordinated VERbal COmmunication (Verco) that is understandable to humans. We first use a powerful LLM (e.g., GPT-4) as the teacher model and perform Supervised Fine-tuning (SFT) on the student model equipped with low-rank adapters (LoRA) [22] based on the output of the teacher model. The policy learning module is then fine-tuned online through interaction with the environment, aligning the model’s prior knowledge with the environment. We train different LoRA parameters for the communication module and the policy learning module respectively, avoiding mutual interference between the two modules and the cost problem of managing and training multiple large models. Overall, the communication module generates and sends text messages based on the local observations of the agents to promote cooperation between them, while the action module receives local observations and messages sent by other agents and selects the action to be taken from the candidate action set. We have conducted experiments in different scenarios on the popular multi-agent decision-making environment Overcooked and compared Verco with existing baseline methods. We find that introducing verbal communication can significantly improve the performance of agents, and also improve the interpretability of cooperation.
为应对前述挑战,本文提出一种新颖的多智能体算法,旨在学习协调人类可理解的言语沟通(Verco)。首先,我们采用强大的LLM(如 GPT-4)作为教师模型,并基于教师模型的输出,对配备低秩适配器(LoRA)[22]的学生模型进行监督微调(SFT)。随后,通过与环境的交互,在线微调策略学习模块,以使模型的先验知识与环境相协调。我们分别为通信模块和策略学习模块训练不同的 LoRA 参数,避免两模块间的相互干扰及管理、训练多个大型模型的成本问题。总体而言,通信模块根据智能体的局部观察生成并发送文本消息,以促进它们之间的合作;而动作模块接收局部观察及其他智能体发送的消息,并从候选动作集中选择应采取的动作。 我们在流行的多智能体决策环境 Overcooked 中针对不同场景进行了实验,并将 Verco 与现有基线方法进行了比较。我们发现,引入口头沟通能显著提升智能体的性能,同时增强合作的可解释性。

2 Related Work 相关工作

In this section, we will introduce some of the work related to our paper.
在本节中,我们将介绍与我们的论文相关的一些工作。

2.1 LLMs for Decision Making
2.1 用于决策的大型语言模型

LLMs trained on large datasets have shown remarkable abilities in various downstream tasks. Recent works have gradually applied LLMs to complex tasks such as robot control [23, 24], planning generation [25], embodied agent [26, 27, 28, 29], etc. Among them, SayCan [26] and LLM+P [25] use the LLM as a high-level planner to generate long-term plans for agents based on task goals but often do not directly interact with the environment. Voyager [29] uses GPT-4 to accomplish highly complex tasks in the Minecraft game and continuously generates new skills during the learning process. ReAct [30] combines reasoning and action to enhance LLMs’ reasoning ability for high-level goals, while improving the interpretability and credibility of LLMs’ decisions. As a supplement, [31] added additional reporters to provide useful information for the planner. However, these methods cannot be improved based on environmental feedback, making it difficult to adapt to different environments. In recent works, GLAM [21] and TWOSOME [20] both use reinforcement learning (RL) to enable LLMs to interact and learn in a single-agent environment, and the rich prior experience of LLMs itself can significantly reduce the learning cost of agents.
经过大规模数据集训练的模型在多种下游任务中展现出卓越能力。近期研究逐步将此类模型应用于机器人控制[23, 24]、规划生成[25]、具身智能体[26, 27, 28, 29]等复杂任务。其中,SayCan[26]与LLM+P[25]采用LLM作为高层规划器,依据任务目标为智能体制定长期计划,但通常不直接与环境互动。Voyager[29]利用 GPT-4 在 Minecraft 游戏中完成高度复杂任务,并在学习过程中持续生成新技能。ReAct[30]结合推理与行动,增强LLMs针对高层目标的推理能力,同时提升LLMs决策的可解释性与可信度。此外,[31]增设额外报告者,为规划器提供有用信息。然而,这些方法难以基于环境反馈进行优化,导致适应不同环境的能力受限。 在近期研究中,GLAM[21]与 TWOSOME[20]均采用强化学习(RL)技术,使LLMs能在单一代理环境中互动学习,而LLMs自身丰富的先前经验能显著降低代理的学习成本。

2.2 Finetuning LLMs 2.2 微调LLMs

In recent works, finetuning has been employed to improve the performance of LLMs in specific tasks across different domains. Among these approaches, the use of RL to enhance the consistency between the LLM and human preferences is a common practice [32]. Many reinforcement learning with human feedback (RLHF) methods utilize PPO [33] to learn a reward function from human datasets and fine-tune LLMs according to the learned reward function. A significant challenge in fine-tuning LLMs is how to reduce training costs. Parameter-efficient finetuning (PEFT)
近期作品中,微调技术被广泛应用于提升LLMs在跨领域特定任务中的表现。在这些方法中,利用强化学习(RL)增强LLM与人类偏好之间的一致性已成为常见做法[32]。众多基于人类反馈的强化学习(RLHF)方法采用 PPO[33]从人类数据集中学习奖励函数,并据此对LLMs进行微调。微调LLMs时面临的一大挑战是如何降低训练成本。参数高效微调(PEFT)
111https://github.com/huggingface/peft. [34, 35] can significantly reduce the number of LLM parameters while avoiding excessive performance loss. As a recent technology in PEFT, low-rank adapters (LoRA) indirectly train the dense layers of neural networks by learning low-rank matrices. TWOSOME [20] fine-tuned a LoRA as the actor model, thus enabling efficient fine-tuning while avoiding mutual interference between actor and critic.
[34, 35]能显著减少LLM参数的数量,同时避免过度的性能损失。作为 PEFT 领域的一项最新技术,低秩适配器(LoRA)通过学习低秩矩阵间接训练神经网络的稠密层。TWOSOME [20]将 LoRA 微调为行动者模型,从而在避免行动者与评判者间相互干扰的同时,实现了高效的微调。

2.3 MARL with Communication
2.3 通信支持的多智能体强化学习

Existing multi-agent cooperation algorithms have made significant progress in many complex scenarios, but the realistic situation of partial observability greatly limits the degree of cooperation among agents. To address this issue, researchers have proposed a series of communication-based cooperation paradigms [15, 18, 19, 36, 17] to facilitate agents’ understanding of the environment and teammates by exchanging local observations or intentions. However, most of the communication messages generated by existing communication algorithms are difficult for humans to understand, thus not interpretable and cannot be improved explicitly. To make the communication messages understandable and ineterpretable, using verbal text as a communication message naturally becomes a solution. The FAMA [37] algorithm extends GLAM to multi-agent settings and introduces a verbal communication module. However, the communication module in the FAMA method cannot be learned through interactions with the environment and can only select from a pre-defined set of message candidates, significantly reducing the degree of message generation freedom and relying on the quality of the pre-specified message candidates.
现有的多智能体协作算法在众多复杂场景中取得了显著进展,但部分可观测性的现实情况极大限制了智能体间的协作程度。为解决这一问题,研究者们提出了一系列基于通信的协作范式[15, 18, 19, 36, 17],通过交换局部观测或意图来促进智能体对环境和队友的理解。然而,现有通信算法生成的多数通信信息对人类而言难以理解,不具备可解释性,也无法明确改进。为使通信信息变得可理解和可解释,采用口头文本作为通信信息自然成为一种解决方案。FAMA[37]算法将 GLAM 扩展至多智能体环境,并引入了口头通信模块。 然而,FAMA 方法中的通信模块无法通过与环境的互动来学习,只能从预定义的消息候选集中选择,这显著降低了消息生成的自由度,并依赖于预设消息候选集的质量。

3 Preliminaries 3 预备知识

3.1 Problem Formulation 3.1 问题表述

We assume a textual MARL setting here with language vocabulary 𝒱𝒱\mathcal{V}caligraphic_V. The multi-agent problem here can be described as a partially observable Markov game (POMG), which is defined by the following tuple G=S,U,𝒩,Ω,𝒯,O,r,γ,𝒱𝐺𝑆𝑈𝒩Ω𝒯𝑂𝑟𝛾𝒱G=\left<S,U,\mathcal{N}{},\Omega,\mathcal{T},O,r,\gamma,\mathcal{V}\right>italic_G = ⟨ italic_S , italic_U , caligraphic_N , roman_Ω , caligraphic_T , italic_O , italic_r , italic_γ , caligraphic_V ⟩, with S𝑆Sitalic_S the state space, U𝒱𝑈𝒱U\subset\mathcal{V}italic_U ⊂ caligraphic_V the action space. At each discrete time step t𝑡titalic_t, each agent i𝒩:={1,,n}𝑖𝒩assign1𝑛i\in\mathcal{N}{}:=\{1,\dots,n{}\}italic_i ∈ caligraphic_N := { 1 , … , italic_n } will select an action uiUi𝒱subscript𝑢𝑖subscript𝑈𝑖𝒱u_{{i}}\in U_{{i}}\subset\mathcal{V}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊂ caligraphic_V. 𝒯(s|s,𝒖):S×U×SP(S):𝒯conditionalsuperscript𝑠𝑠𝒖𝑆𝑈𝑆𝑃𝑆\mathcal{T}(s^{\prime}|s,\boldsymbol{u)}:S\times U\times S\rightarrow P(S)caligraphic_T ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , bold_italic_u bold_) : italic_S × italic_U × italic_S → italic_P ( italic_S ) is the state transition function, where 𝒖={u1,,un}𝑼Un𝒖subscript𝑢1subscript𝑢𝑛𝑼superscript𝑈𝑛\boldsymbol{u}=\{u_{{1}},\dots,u_{{n{}}}\}\in\boldsymbol{U}\equiv U^{n}{}bold_italic_u = { italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } ∈ bold_italic_U ≡ italic_U start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is the joint action. Each agent i𝑖iitalic_i can get its textual partial observation oiΩsubscript𝑜𝑖Ωo_{{i}}\in\Omegaitalic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ roman_Ω by the observation function O(s,i):S×𝒩Ω:𝑂𝑠𝑖𝑆𝒩ΩO(s,i):S\times\mathcal{N}\rightarrow\Omegaitalic_O ( italic_s , italic_i ) : italic_S × caligraphic_N → roman_Ω. ri(s,ui):S×Ui:subscript𝑟𝑖𝑠subscript𝑢𝑖𝑆subscript𝑈𝑖r_{{i}}(s,u_{{i}}):S\times U_{{i}}\rightarrow\mathbb{R}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) : italic_S × italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT → blackboard_R is the reward function for each agent i𝑖iitalic_i. The γ𝛾\gammaitalic_γ is the discount factor. The goal for each agent is to maximize the expected return. In order to meet the demands of communication and decision-making, we designed a dual-LLM structure for agents, enabling them to send text messages based on their local observations while making decisions based on the received messages from other agents. By considering both local observations and teammates’ messages, the agents can select corresponding coordinated actions. Our entire framework consists of three main components: the teacher policy πtchsuperscript𝜋𝑡𝑐\pi^{tch}italic_π start_POSTSUPERSCRIPT italic_t italic_c italic_h end_POSTSUPERSCRIPT, the message policy πηsubscript𝜋𝜂\pi_{\eta}italic_π start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT, and the action policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT.
在此,我们假设一个文本多智能体强化学习(MARL)环境,其中语言词汇为 𝒱𝒱\mathcal{V}caligraphic_V 。此处的多智能体问题可描述为部分可观测马尔可夫博弈(POMG),由以下元组 G=S,U,𝒩,Ω,𝒯,O,r,γ,𝒱𝐺𝑆𝑈𝒩Ω𝒯𝑂𝑟𝛾𝒱G=\left<S,U,\mathcal{N}{},\Omega,\mathcal{T},O,r,\gamma,\mathcal{V}\right>italic_G = ⟨ italic_S , italic_U , caligraphic_N , roman_Ω , caligraphic_T , italic_O , italic_r , italic_γ , caligraphic_V ⟩ 定义,其中 S𝑆Sitalic_S 为状态空间, U𝒱𝑈𝒱U\subset\mathcal{V}italic_U ⊂ caligraphic_V 为动作空间。在每个离散时间步 t𝑡titalic_t ,每个智能体 i𝒩:={1,,n}𝑖𝒩assign1𝑛i\in\mathcal{N}{}:=\{1,\dots,n{}\}italic_i ∈ caligraphic_N := { 1 , … , italic_n } 将选择一个动作 uiUi𝒱subscript𝑢𝑖subscript𝑈𝑖𝒱u_{{i}}\in U_{{i}}\subset\mathcal{V}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊂ caligraphic_V𝒯(s|s,𝒖):S×U×SP(S):𝒯conditionalsuperscript𝑠𝑠𝒖𝑆𝑈𝑆𝑃𝑆\mathcal{T}(s^{\prime}|s,\boldsymbol{u)}:S\times U\times S\rightarrow P(S)caligraphic_T ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , bold_italic_u bold_) : italic_S × italic_U × italic_S → italic_P ( italic_S ) 是状态转移函数,其中 𝒖={u1,,un}𝑼Un𝒖subscript𝑢1subscript𝑢𝑛𝑼superscript𝑈𝑛\boldsymbol{u}=\{u_{{1}},\dots,u_{{n{}}}\}\in\boldsymbol{U}\equiv U^{n}{}bold_italic_u = { italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } ∈ bold_italic_U ≡ italic_U start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT 为联合动作。每个智能体 i𝑖iitalic_i 通过观测函数 O(s,i):S×𝒩Ω:𝑂𝑠𝑖𝑆𝒩ΩO(s,i):S\times\mathcal{N}\rightarrow\Omegaitalic_O ( italic_s , italic_i ) : italic_S × caligraphic_N → roman_Ω 获得其文本部分观测 oiΩsubscript𝑜𝑖Ωo_{{i}}\in\Omegaitalic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ roman_Ωri(s,ui):S×Ui:subscript𝑟𝑖𝑠subscript𝑢𝑖𝑆subscript𝑈𝑖r_{{i}}(s,u_{{i}}):S\times U_{{i}}\rightarrow\mathbb{R}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) : italic_S × italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT → blackboard_R 是每个智能体 i𝑖iitalic_i 的奖励函数。 γ𝛾\gammaitalic_γ 是折扣因子。每个智能体的目标是最大化期望回报。为满足通信与决策需求,我们为智能体设计了双LLM结构,使其能基于局部观测发送文本消息,并根据接收自其他智能体的消息做出决策。通过综合考虑局部观测与队友消息,智能体可选择相应的协同动作。我们的整体框架包含三个主要组件:教师策略 πtchsuperscript𝜋𝑡𝑐\pi^{tch}italic_π start_POSTSUPERSCRIPT italic_t italic_c italic_h end_POSTSUPERSCRIPT 、消息策略 πηsubscript𝜋𝜂\pi_{\eta}italic_π start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT 及动作策略 πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT

3.2 LLM and Finetuning 3.2 大型语言模型与微调

We consider using LLMs as interactive agents in embodied environments, the LLMs are trained on massive amounts of text data and can generate human-like responses to questions or prompts. In order to improve the performance of LLMs in various specific tasks, fine-tuning is commonly employed. Among these, LoRA (Low-Rank Adaptation) [22] serves as a popular fine-tuning method, effectively reducing computational costs while maintaining satisfactory performance. Specifically, LoRA training the dense layers with weights matrix, W0d×ksubscript𝑊0superscript𝑑𝑘W_{0}\in\mathbb{R}^{d\times k}italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_k end_POSTSUPERSCRIPT and injects trainable rank decomposition matrices into the Transformer by W0+ΔW=W0+BAsubscript𝑊0Δ𝑊subscript𝑊0𝐵𝐴W_{0}+\Delta W=W_{0}+BAitalic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + roman_Δ italic_W = italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_B italic_A, where Bd×r,Ar×kformulae-sequence𝐵superscript𝑑𝑟𝐴superscript𝑟𝑘B\in\mathbb{R}^{d\times r},A\in\mathbb{R}^{r\times k}italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_r end_POSTSUPERSCRIPT , italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_k end_POSTSUPERSCRIPT, since the rank r𝑟ritalic_r is much smaller than d𝑑ditalic_d and k𝑘kitalic_k, it can significantly reduce the scale of trainable parameters.
我们考虑将LLMs作为实体环境中的交互代理,这些LLMs经过大量文本数据的训练,能够生成类似人类的回答来响应问题或提示。为了提升LLMs在各种特定任务中的表现,通常采用微调技术。其中,LoRA(低秩适应)[22]作为一种流行的微调方法,能在保持良好性能的同时有效降低计算成本。具体而言,LoRA 通过 W0d×ksubscript𝑊0superscript𝑑𝑘W_{0}\in\mathbb{R}^{d\times k}italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_k end_POSTSUPERSCRIPT 训练带有权重矩阵的密集层,并通过 W0+ΔW=W0+BAsubscript𝑊0Δ𝑊subscript𝑊0𝐵𝐴W_{0}+\Delta W=W_{0}+BAitalic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + roman_Δ italic_W = italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_B italic_A 向 Transformer 注入可训练的秩分解矩阵,其中 Bd×r,Ar×kformulae-sequence𝐵superscript𝑑𝑟𝐴superscript𝑟𝑘B\in\mathbb{R}^{d\times r},A\in\mathbb{R}^{r\times k}italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_r end_POSTSUPERSCRIPT , italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_r × italic_k end_POSTSUPERSCRIPT ,由于秩 r𝑟ritalic_r 远小于 d𝑑ditalic_dk𝑘kitalic_k ,它能显著减少可训练参数的规模。

4 Method 方法 4

We introduce the LLM, capable of text generation, into agents as a communication module and action module to enhance the collaborative capabilities and interpretability of multi-agent systems. To accelerate training and improve the quality of communication messages, we first generate communication labels for supervised training of the communication module. Subsequently, the fine-tuned communication module weights are loaded into the agents, allowing them to generate highly flexible text messages to facilitate coordination among agents. The entire flow is shown in Figure 2 The action modules will interact and improve within the environment using RL. In this section, we will provide a detailed introduction to each part of the entire algorithmic.
我们引入LLM,具备文本生成能力,作为通信模块与行动模块嵌入代理中,以增强多代理系统的协同能力和可解释性。为加快训练并提升通信消息质量,我们首先为通信模块的监督训练生成通信标签。随后,将微调后的通信模块权重加载至代理,使其能生成高度灵活的文本消息,促进代理间的协调。整个流程如图 2 所示。行动模块将在环境中通过强化学习(RL)进行交互与改进。本节将详细介绍整个算法的各个部分。

Refer to caption
Figure 2: Verco framework: We first finetune the LoRA weight of the communication module with the global message label. Then we load the LoRA weight so the agent can directly generate verbal messages with its local observation. Meanwhile, the action policy takes the current local observation and text messages from other agents as input and outputs the decision. The action policy fine-tunes the weights using PPO based on the rewards returned by the environment.
图 2:Verco 框架:首先,我们利用全局消息标签对通信模块的 LoRA 权重进行微调。随后加载 LoRA 权重,使代理能够根据本地观测直接生成口头消息。同时,行动策略接收当前本地观测及来自其他代理的文本消息作为输入,并输出决策。行动策略基于环境返回的奖励,使用 PPO 算法对权重进行微调。

4.1 Cooperation with verbal communication
4.1 口头沟通合作

In the context of human collaborative problems, the diversity in the information received by individuals and their distinct modes of thinking often render simple independent actions insufficient for achieving coordinated consensus. Consequently, the use of natural language has become a prevalent and significant method for facilitating coordination and cooperation among humans. From another perspective, the domain of large language models is precisely dedicated to and adept at enhancing communication and interaction in a human-like manner. Therefore, it is a natural progression to integrate large language models into agents as communication modules. These agents can employ the communication module to convey information or intentions observed by themselves to others, thereby enabling more effective collaboration.
在人类协作问题的背景下,个体接收信息的多样性及其独特的思维模式常使得简单的独立行动不足以达成协调共识。因此,自然语言的使用已成为促进人类间协调与合作的一种普遍且重要的方法。从另一角度看,大型语言模型领域正是致力于并擅长以类人的方式增强沟通与互动。因此,将大型语言模型整合进代理作为通信模块,是自然的发展趋势。这些代理可利用通信模块传递自身观察到的信息或意图给他人,从而实现更有效的协作。

Refer to caption
Figure 3: Message module SFT stage: We employ a large model (GPT-4) as the teacher model to generate message samples based on global observations, and distill the learning for a smaller language model (LLaMA-7B) as the communication model πηsubscript𝜋𝜂\pi_{\eta}italic_π start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT.
图 3:消息模块 SFT 阶段:我们采用大型模型(GPT-4)作为教师模型,根据全局观测生成消息样本,并将学习成果提炼给作为通信模型的小型语言模型(LLaMA-7B) πηsubscript𝜋𝜂\pi_{\eta}italic_π start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT

4.2 Coordination Message Policy Pre-training
4.2 协同消息策略预训练

Relying solely on environment interaction to train a communication model that automatically generates reasonable messages from scratch is highly inefficient, while utilizing a large-scale model (e.g., GPT-4, LLaMA-65B) for communication is both costly and difficult to fine-tune. To facilitate the generation of coordinated and consistent messages in the communication module, we employ GPT-4 as a coordinated message generator, taking the local observations of all agents as input and generating customized message labels for each agent {mi}i=1nπtch(ptch)similar-tosubscriptsuperscriptsubscript𝑚𝑖𝑛𝑖1superscript𝜋𝑡𝑐superscript𝑝𝑡𝑐\{m_{i}\}^{n}_{i=1}\sim\pi^{tch}(p^{tch}){ italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ∼ italic_π start_POSTSUPERSCRIPT italic_t italic_c italic_h end_POSTSUPERSCRIPT ( italic_p start_POSTSUPERSCRIPT italic_t italic_c italic_h end_POSTSUPERSCRIPT ). The corresponding prompt ptch=ρtch({oi}i=1n,n)superscript𝑝𝑡𝑐subscript𝜌𝑡𝑐subscriptsuperscriptsubscript𝑜𝑖𝑛𝑖1𝑛p^{tch}=\rho_{tch}(\{o_{i}\}^{n}_{i=1},n)italic_p start_POSTSUPERSCRIPT italic_t italic_c italic_h end_POSTSUPERSCRIPT = italic_ρ start_POSTSUBSCRIPT italic_t italic_c italic_h end_POSTSUBSCRIPT ( { italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT , italic_n ) is obtained through the teacher LLM’s prompt function ρtchsubscript𝜌𝑡𝑐\rho_{tch}italic_ρ start_POSTSUBSCRIPT italic_t italic_c italic_h end_POSTSUBSCRIPT:
仅依赖环境交互来训练一个从零开始自动生成合理信息的通信模型效率极低,而利用大规模模型(如 GPT-4、LLaMA-65B)进行通信既成本高昂又难以微调。为促进通信模块中协调一致信息的生成,我们采用 GPT-4 作为协调信息生成器,以所有代理的本地观察为输入,为每个代理生成定制的消息标签 {mi}i=1nπtch(ptch)similar-tosubscriptsuperscriptsubscript𝑚𝑖𝑛𝑖1superscript𝜋𝑡𝑐superscript𝑝𝑡𝑐\{m_{i}\}^{n}_{i=1}\sim\pi^{tch}(p^{tch}){ italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ∼ italic_π start_POSTSUPERSCRIPT italic_t italic_c italic_h end_POSTSUPERSCRIPT ( italic_p start_POSTSUPERSCRIPT italic_t italic_c italic_h end_POSTSUPERSCRIPT ) 。相应的提示 ptch=ρtch({oi}i=1n,n)superscript𝑝𝑡𝑐subscript𝜌𝑡𝑐subscriptsuperscriptsubscript𝑜𝑖𝑛𝑖1𝑛p^{tch}=\rho_{tch}(\{o_{i}\}^{n}_{i=1},n)italic_p start_POSTSUPERSCRIPT italic_t italic_c italic_h end_POSTSUPERSCRIPT = italic_ρ start_POSTSUBSCRIPT italic_t italic_c italic_h end_POSTSUBSCRIPT ( { italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT , italic_n ) 通过教师LLM的提示函数 ρtchsubscript𝜌𝑡𝑐\rho_{tch}italic_ρ start_POSTSUBSCRIPT italic_t italic_c italic_h end_POSTSUBSCRIPT 获得。

Teacher LLM prompt You are a message design assistant who needs to design messages for <n𝑛nitalic_n> agents, <Brief environment description>. The observation of Agent 1111: <o1subscript𝑜1o_{1}italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT>, The observation of Agent n𝑛nitalic_n: <onsubscript𝑜𝑛o_{n}italic_o start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT>. To complete <Task Goal>, design concise messages (within 10 words) for each agent to their teammate: <LLM response here>

Supervised learning training is conducted on the message samples using LLaMA-7B, which is updated by LoRA. The message prediction of agent i𝑖iitalic_i is generated by message policy m^iπη(|ρm(oi))\hat{m}_{i}\sim\pi_{\eta}(\cdot|\rho_{m}(o_{i}))over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( ⋅ | italic_ρ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ). The SFT loss can be expressed as follows:
监督学习训练采用 LLaMA-7B 对消息样本进行处理,该模型通过 LoRA 进行更新。代理 i𝑖iitalic_i 的消息预测由消息策略 m^iπη(|ρm(oi))\hat{m}_{i}\sim\pi_{\eta}(\cdot|\rho_{m}(o_{i}))over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( ⋅ | italic_ρ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) 生成。SFT 损失可表示如下:

SFT=i=1nCROSS_ENTROPY(m^i,mi),subscript𝑆𝐹𝑇subscriptsuperscript𝑛𝑖1𝐶𝑅𝑂𝑆𝑆_𝐸𝑁𝑇𝑅𝑂𝑃𝑌subscript^𝑚𝑖subscript𝑚𝑖\mathcal{L}_{SFT}=\sum^{n}_{i=1}{CROSS\_ENTROPY(\hat{m}_{i},m_{i})},caligraphic_L start_POSTSUBSCRIPT italic_S italic_F italic_T end_POSTSUBSCRIPT = ∑ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT italic_C italic_R italic_O italic_S italic_S _ italic_E italic_N italic_T italic_R italic_O italic_P italic_Y ( over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (1)

where CROSS_ENTROPY𝐶𝑅𝑂𝑆𝑆_𝐸𝑁𝑇𝑅𝑂𝑃𝑌CROSS\_ENTROPYitalic_C italic_R italic_O italic_S italic_S _ italic_E italic_N italic_T italic_R italic_O italic_P italic_Y represents the cross entropy loss function. The entire workflow of SFT is shown in Figure 3(a) To control the training scale, we let all agents share the action policy and message module. Besides, We use initialized agents to interact in the environment for several rounds to collect trajectory data. Empirically, with only a small amount of data and training steps, the communication policy πηsubscript𝜋𝜂\pi_{\eta}italic_π start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT can already generate a reasonably coherent verbal message.
其中 CROSS_ENTROPY𝐶𝑅𝑂𝑆𝑆_𝐸𝑁𝑇𝑅𝑂𝑃𝑌CROSS\_ENTROPYitalic_C italic_R italic_O italic_S italic_S _ italic_E italic_N italic_T italic_R italic_O italic_P italic_Y 代表交叉熵损失函数。SFT 的整体工作流程如图 3(a)所示。为控制训练规模,我们让所有代理共享行动策略与信息模块。此外,我们使用初始化代理在环境中交互数轮,以收集轨迹数据。经验表明,仅需少量数据及训练步骤,通信策略 πηsubscript𝜋𝜂\pi_{\eta}italic_π start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT 便能生成相当连贯的口头信息。

4.3 Action Policy Alignment
4.3 行动策略对齐

The action module also employs LLMs, which brings two distinct advantages: i) LLMs can directly and effectively comprehend verbal messages from teammates without the need for further training. ii) LLMs have been demonstrated to have learned a substantial amount of physical rules present in the real world [38], and this knowledge can significantly reduce the number of interactions required between the policy and the environment, thereby enhancing sample efficiency. To align general purpose LLMs with the specific environment, feedback from the environment is utilized to adapt to the environment better and yield better performance.
行动模块还采用了LLMs,带来两大显著优势:i) LLMs能直接有效理解队友的口头信息,无需额外训练;ii) LLMs已证明掌握了现实世界中大量物理规则[38],这些知识能显著减少策略与环境间的交互次数,从而提升样本效率。为使通用LLMs适应特定环境,利用环境反馈进行调整,以更好地适应环境并提升性能。

Inspired by previous works [21, 20], we do not directly prompt the LLM to generate the actions, instead, we query the LLM for the probabilities of all available actions. Assume the k𝑘kitalic_k-th action of i𝑖iitalic_i-th agent ui,kUisubscript𝑢𝑖𝑘subscript𝑈𝑖u_{i,k}\in U_{i}italic_u start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT ∈ italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a sequence of tokens ui,k={wi,k1,,wi,kNi,k}subscript𝑢𝑖𝑘subscriptsuperscript𝑤1𝑖𝑘subscriptsuperscript𝑤subscript𝑁𝑖𝑘𝑖𝑘u_{i,k}=\{w^{1}_{i,k},\dots,w^{N_{i,k}}_{i,k}\}italic_u start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT = { italic_w start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT , … , italic_w start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT }, where Nksubscript𝑁𝑘N_{k}italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the token number of the k𝑘kitalic_k-th action. So the token-level probability of ui,ksubscript𝑢𝑖𝑘u_{i,k}italic_u start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT can be calculated by:
受先前作品[21, 20]的启发,我们并非直接指示LLM生成动作,而是向LLM查询所有可用动作的概率。假设 k𝑘kitalic_k 号动作是 i𝑖iitalic_i 号代理 ui,kUisubscript𝑢𝑖𝑘subscript𝑈𝑖u_{i,k}\in U_{i}italic_u start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT ∈ italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 的一系列令牌 ui,k={wi,k1,,wi,kNi,k}subscript𝑢𝑖𝑘subscriptsuperscript𝑤1𝑖𝑘subscriptsuperscript𝑤subscript𝑁𝑖𝑘𝑖𝑘u_{i,k}=\{w^{1}_{i,k},\dots,w^{N_{i,k}}_{i,k}\}italic_u start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT = { italic_w start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT , … , italic_w start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT } ,其中 Nksubscript𝑁𝑘N_{k}italic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPTk𝑘kitalic_k 号动作的令牌编号。因此, ui,ksubscript𝑢𝑖𝑘u_{i,k}italic_u start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT 的令牌级概率可通过以下方式计算:

Ptoken(ui,k|ρu(oi),m^i)=j=0Ni,kP(wi,kj|ρu(oi),m^i,wi,k<j),subscript𝑃𝑡𝑜𝑘𝑒𝑛conditionalsubscript𝑢𝑖𝑘subscript𝜌𝑢subscript𝑜𝑖subscript^𝑚𝑖subscriptsuperscriptproductsubscript𝑁𝑖𝑘𝑗0𝑃conditionalsubscriptsuperscript𝑤𝑗𝑖𝑘subscript𝜌𝑢subscript𝑜𝑖subscript^𝑚𝑖subscriptsuperscript𝑤absent𝑗𝑖𝑘P_{token}(u_{i,k}|\rho_{u}(o_{i}),\hat{m}_{-i})=\prod^{N_{i,k}}_{j=0}P(w^{j}_{% i,k}|\rho_{u}(o_{i}),\hat{m}_{-i},w^{<j}_{i,k}),italic_P start_POSTSUBSCRIPT italic_t italic_o italic_k italic_e italic_n end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT | italic_ρ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT ) = ∏ start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT italic_P ( italic_w start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT | italic_ρ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT , italic_w start_POSTSUPERSCRIPT < italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT ) , (2)

where ρusubscript𝜌𝑢\rho_{u}italic_ρ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT is the action prompt function. Therefore, the action policy π𝜋\piitalic_π can be got by the softmax over the token-level probabilities over actions:
其中 ρusubscript𝜌𝑢\rho_{u}italic_ρ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT 为动作提示函数。因此,通过动作级概率上的 softmax 运算,可以得到动作策略 π𝜋\piitalic_π

P(ui,k|ρu(oi),m^i)=exp(logPtoken(ui,k|ρu(oi),m^i))uiUiexp(logPtoken(ui|ρu(oi),m^i)).𝑃conditionalsubscript𝑢𝑖𝑘subscript𝜌𝑢subscript𝑜𝑖subscript^𝑚𝑖explogsubscript𝑃𝑡𝑜𝑘𝑒𝑛conditionalsubscript𝑢𝑖𝑘subscript𝜌𝑢subscript𝑜𝑖subscript^𝑚𝑖subscriptsubscript𝑢𝑖subscript𝑈𝑖explogsubscript𝑃𝑡𝑜𝑘𝑒𝑛conditionalsubscript𝑢𝑖subscript𝜌𝑢subscript𝑜𝑖subscript^𝑚𝑖P(u_{i,k}|\rho_{u}(o_{i}),\hat{m}_{-i})=\frac{\text{exp}(\text{log}P_{token}(u% _{i,k}|\rho_{u}(o_{i}),\hat{m}_{-i}))}{\sum_{u_{i}\in U_{i}}\text{exp}(\text{% log}P_{token}(u_{i}|\rho_{u}(o_{i}),\hat{m}_{-i}))}.italic_P ( italic_u start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT | italic_ρ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT ) = divide start_ARG exp ( log italic_P start_POSTSUBSCRIPT italic_t italic_o italic_k italic_e italic_n end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT | italic_ρ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT exp ( log italic_P start_POSTSUBSCRIPT italic_t italic_o italic_k italic_e italic_n end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_ρ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , over^ start_ARG italic_m end_ARG start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT ) ) end_ARG . (3)

In our work, we employ PPO [39] to optimize the action policy. PPO is a state-of-the-art actor-critic RL method, where each agent learns an actor parameterized by θ𝜃\thetaitalic_θ and a critic parameterized by ϕitalic-ϕ\phiitalic_ϕ.
在我们的工作中,我们采用 PPO[39]来优化行动策略。PPO 是一种先进的演员-评论家强化学习方法,其中每个代理学习一个由 θ𝜃\thetaitalic_θ 参数化的演员和一个由 ϕitalic-ϕ\phiitalic_ϕ 参数化的评论家。

For each agent i𝑖iitalic_i, the policy loss can be formulated as follows:
对于每个代理 i𝑖iitalic_i ,策略损失可以表述如下:

i(θ)=subscript𝑖𝜃absent\displaystyle\mathcal{L}_{i}(\theta)=caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_θ ) = 𝔼oit,uit[min(πθ(uit|ρu(oit),m^it)πθold(uit|ρu(oit),m^it)Ait,\displaystyle\mathbb{E}_{o^{t}_{i},u^{t}_{i}}[\text{min}(\frac{\pi_{\theta}(u^% {t}_{i}|\rho_{u}({o}^{t}_{i}),\hat{m}^{t}_{-i})}{\pi_{\theta_{old}}(u^{t}_{i}|% \rho_{u}({o}^{t}_{i}),\hat{m}^{t}_{-i})}A^{t}_{i},blackboard_E start_POSTSUBSCRIPT italic_o start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ min ( divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_ρ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_o start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , over^ start_ARG italic_m end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_u start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_ρ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_o start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , over^ start_ARG italic_m end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT ) end_ARG italic_A start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (4)
clip(πθ(uit|ρu(oit),m^it)πθold(uit|ρu(oit),m^it),1ϵ,1+ϵ)Ait)],\displaystyle\text{clip}(\frac{\pi_{\theta}(u^{t}_{i}|\rho_{u}({o}^{t}_{i}),% \hat{m}^{t}_{-i})}{\pi_{\theta_{old}}(u^{t}_{i}|\rho_{u}({o}^{t}_{i}),\hat{m}^% {t}_{-i})},1-\epsilon,1+\epsilon)A^{t}_{i})],clip ( divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_u start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_ρ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_o start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , over^ start_ARG italic_m end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_u start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_ρ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_o start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , over^ start_ARG italic_m end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT ) end_ARG , 1 - italic_ϵ , 1 + italic_ϵ ) italic_A start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] ,

where Aitsubscriptsuperscript𝐴𝑡𝑖A^{t}_{i}italic_A start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the Generalized Advantage Estimation (GAE) [40]. The critic network is composed of an additional MLP added to the last transformer block of the LLaMA model.
其中 Aitsubscriptsuperscript𝐴𝑡𝑖A^{t}_{i}italic_A start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 代表广义优势估计(GAE)[40]。批评家网络由附加在 LLaMA 模型最后一个变换器模块上的额外多层感知机(MLP)构成。

i(ϕ)=𝔼oit[min{(Vϕ(ρu(oit))Vit^)2,(Vϕold(ρu(oit))+clip(Vϕ(ρu(oit))Vϕold(ρu(oit)),ϵ,+ϵ)V^it)2}],subscript𝑖italic-ϕsubscript𝔼subscriptsuperscript𝑜𝑡𝑖delimited-[]minsuperscriptsubscript𝑉italic-ϕsubscript𝜌𝑢subscriptsuperscript𝑜𝑡𝑖^subscriptsuperscript𝑉𝑡𝑖2superscriptsubscript𝑉subscriptitalic-ϕ𝑜𝑙𝑑subscript𝜌𝑢subscriptsuperscript𝑜𝑡𝑖clipsubscript𝑉italic-ϕsubscript𝜌𝑢subscriptsuperscript𝑜𝑡𝑖subscript𝑉subscriptitalic-ϕ𝑜𝑙𝑑subscript𝜌𝑢subscriptsuperscript𝑜𝑡𝑖italic-ϵitalic-ϵsubscriptsuperscript^𝑉𝑡𝑖2\begin{split}\mathcal{L}_{i}(\phi)=&\mathbb{E}_{o^{t}_{i}}[\text{min}\{(V_{% \phi}(\rho_{u}(o^{t}_{i}))-\hat{V^{t}_{i}})^{2},(V_{\phi_{old}}(\rho_{u}(o^{t}% _{i}))+\\ &\text{clip}(V_{\phi}(\rho_{u}(o^{t}_{i}))-V_{\phi_{old}}(\rho_{u}(o^{t}_{i}))% ,-\epsilon,+\epsilon)-\hat{V}^{t}_{i})^{2}\}],\end{split}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_ϕ ) = end_CELL start_CELL blackboard_E start_POSTSUBSCRIPT italic_o start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ min { ( italic_V start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_o start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - over^ start_ARG italic_V start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , ( italic_V start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_o start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) + end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL clip ( italic_V start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_o start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - italic_V start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_o start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) , - italic_ϵ , + italic_ϵ ) - over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } ] , end_CELL end_ROW (5)

where V^it=Ait+Vϕold(ρu(oit))subscriptsuperscript^𝑉𝑡𝑖subscriptsuperscript𝐴𝑡𝑖subscript𝑉subscriptitalic-ϕ𝑜𝑙𝑑subscript𝜌𝑢subscriptsuperscript𝑜𝑡𝑖\hat{V}^{t}_{i}=A^{t}_{i}+V_{\phi_{old}}(\rho_{u}(o^{t}_{i}))over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_A start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_V start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_o start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ). 其中 V^it=Ait+Vϕold(ρu(oit))subscriptsuperscript^𝑉𝑡𝑖subscriptsuperscript𝐴𝑡𝑖subscript𝑉subscriptitalic-ϕ𝑜𝑙𝑑subscript𝜌𝑢subscriptsuperscript𝑜𝑡𝑖\hat{V}^{t}_{i}=A^{t}_{i}+V_{\phi_{old}}(\rho_{u}(o^{t}_{i}))over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_A start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_V start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_o start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) )

The overall learning loss is:
总体学习损失为:

RL=i=1ni(θ)+λcritici(ϕ)+λentropy(πi,θ),subscript𝑅𝐿subscriptsuperscript𝑛𝑖1subscript𝑖𝜃subscript𝜆𝑐𝑟𝑖𝑡𝑖𝑐subscript𝑖italic-ϕsubscript𝜆𝑒𝑛𝑡𝑟𝑜𝑝𝑦subscript𝜋𝑖𝜃\mathcal{L}_{RL}=\sum^{n}_{i=1}\mathcal{L}_{i}(\theta)+\lambda_{critic}% \mathcal{L}_{i}(\phi)+\lambda_{entropy}\mathcal{H}(\pi_{i,\theta}),caligraphic_L start_POSTSUBSCRIPT italic_R italic_L end_POSTSUBSCRIPT = ∑ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_θ ) + italic_λ start_POSTSUBSCRIPT italic_c italic_r italic_i italic_t italic_i italic_c end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_ϕ ) + italic_λ start_POSTSUBSCRIPT italic_e italic_n italic_t italic_r italic_o italic_p italic_y end_POSTSUBSCRIPT caligraphic_H ( italic_π start_POSTSUBSCRIPT italic_i , italic_θ end_POSTSUBSCRIPT ) , (6)

where (πi,θ)subscript𝜋𝑖𝜃\mathcal{H}(\pi_{i,\theta})caligraphic_H ( italic_π start_POSTSUBSCRIPT italic_i , italic_θ end_POSTSUBSCRIPT ) denotes the entropy of agent i𝑖iitalic_i’s action policy πi,θsubscript𝜋𝑖𝜃\pi_{i,\theta}italic_π start_POSTSUBSCRIPT italic_i , italic_θ end_POSTSUBSCRIPT.
其中 (πi,θ)subscript𝜋𝑖𝜃\mathcal{H}(\pi_{i,\theta})caligraphic_H ( italic_π start_POSTSUBSCRIPT italic_i , italic_θ end_POSTSUBSCRIPT ) 表示代理 i𝑖iitalic_i 的动作策略 πi,θsubscript𝜋𝑖𝜃\pi_{i,\theta}italic_π start_POSTSUBSCRIPT italic_i , italic_θ end_POSTSUBSCRIPT 的熵。

Given that the goals of the action policy and message policy are distinct, we load two independent LoRA weights separately to ensure that they do not interfere with each other. During the entire action policy alignment stage, the message policy will be frozen to stabilize the training of the action policy. The actor, critic, and message networks share the same LLaMA model and the gradients do not affect each other, therefore greatly reducing our training costs and memory requirements. The entire algorithm can be described as shown in Algorithm 1.
鉴于行动策略与信息策略的目标各异,我们分别加载两套独立的 LoRA 权重,确保彼此互不干扰。在行动策略对齐阶段,信息策略将保持冻结状态,以稳定行动策略的训练。行动者、批评者及信息网络共享同一 LLaMA 模型,且梯度互不影响,从而大幅降低训练成本与内存需求。整个算法可如算法 1 所示描述。

Input : initialise action policy parameters θ𝜃\thetaitalic_θ, value function parameters ϕitalic-ϕ\phiitalic_ϕ, and message policy parameters η𝜂\etaitalic_η. Set the data buffer for SFT DSFT=subscript𝐷𝑆𝐹𝑇D_{SFT}=\emptysetitalic_D start_POSTSUBSCRIPT italic_S italic_F italic_T end_POSTSUBSCRIPT = ∅ and for RL training DRL=subscript𝐷𝑅𝐿D_{RL}=\emptysetitalic_D start_POSTSUBSCRIPT italic_R italic_L end_POSTSUBSCRIPT = ∅
初始化行动策略参数 θ𝜃\thetaitalic_θ 、价值函数参数 ϕitalic-ϕ\phiitalic_ϕ 及消息策略参数 η𝜂\etaitalic_η 。设置 SFT 的数据缓冲区 DSFT=subscript𝐷𝑆𝐹𝑇D_{SFT}=\emptysetitalic_D start_POSTSUBSCRIPT italic_S italic_F italic_T end_POSTSUBSCRIPT = ∅ 以及强化学习训练的数据缓冲区 DRL=subscript𝐷𝑅𝐿D_{RL}=\emptysetitalic_D start_POSTSUBSCRIPT italic_R italic_L end_POSTSUBSCRIPT = ∅
Output : θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, ϕsuperscriptitalic-ϕ\phi^{*}italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, and ηsuperscript𝜂\eta^{*}italic_η start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT
输出: θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPTϕsuperscriptitalic-ϕ\phi^{*}italic_ϕ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPTηsuperscript𝜂\eta^{*}italic_η start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT
1:  for k = 1,2,…,K𝐾Kitalic_K do
1:对于 k = 1, 2, ..., K𝐾Kitalic_K 执行
2:     #Collect trajectory with initial action policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT.
2: #按照初始行动策略 πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT 收集轨迹。
3:     for each time step t𝑡titalic_t do
对于每个时间步 t𝑡titalic_t 执行
4:        Generate message label given global state: {mit}i=1nπtch(ptch)similar-tosubscriptsuperscriptsubscriptsuperscript𝑚𝑡𝑖𝑛𝑖1superscript𝜋𝑡𝑐superscript𝑝𝑡𝑐\{m^{t}_{i}\}^{n}_{i=1}\sim\pi^{tch}(p^{tch}){ italic_m start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ∼ italic_π start_POSTSUPERSCRIPT italic_t italic_c italic_h end_POSTSUPERSCRIPT ( italic_p start_POSTSUPERSCRIPT italic_t italic_c italic_h end_POSTSUPERSCRIPT ).
4: 根据全局状态生成消息标签: {mit}i=1nπtch(ptch)similar-tosubscriptsuperscriptsubscriptsuperscript𝑚𝑡𝑖𝑛𝑖1superscript𝜋𝑡𝑐superscript𝑝𝑡𝑐\{m^{t}_{i}\}^{n}_{i=1}\sim\pi^{tch}(p^{tch}){ italic_m start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ∼ italic_π start_POSTSUPERSCRIPT italic_t italic_c italic_h end_POSTSUPERSCRIPT ( italic_p start_POSTSUPERSCRIPT italic_t italic_c italic_h end_POSTSUPERSCRIPT )
15:        for i = 1,2,…,n𝑛nitalic_n do
15:for i = 1,2,..., n𝑛nitalic_n do
6:           Generate action from initial action policy uitπθ(ρu(oit))similar-tosubscriptsuperscript𝑢𝑡𝑖subscript𝜋𝜃subscript𝜌𝑢subscriptsuperscript𝑜𝑡𝑖u^{t}_{i}\sim\pi_{\theta}(\rho_{u}(o^{t}_{i}))italic_u start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_o start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ).
从初始行动策略 uitπθ(ρu(oit))similar-tosubscriptsuperscript𝑢𝑡𝑖subscript𝜋𝜃subscript𝜌𝑢subscriptsuperscript𝑜𝑡𝑖u^{t}_{i}\sim\pi_{\theta}(\rho_{u}(o^{t}_{i}))italic_u start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_o start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) 生成行动。
7:           Add message label and observation to buffer:DSFT=DSFT(mit,oit)subscript𝐷𝑆𝐹𝑇subscript𝐷𝑆𝐹𝑇subscriptsuperscript𝑚𝑡𝑖subscriptsuperscript𝑜𝑡𝑖D_{SFT}=D_{SFT}~{}\cup~{}(m^{t}_{i},o^{t}_{i})italic_D start_POSTSUBSCRIPT italic_S italic_F italic_T end_POSTSUBSCRIPT = italic_D start_POSTSUBSCRIPT italic_S italic_F italic_T end_POSTSUBSCRIPT ∪ ( italic_m start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
7: 向缓冲区添加消息标签和观察结果: DSFT=DSFT(mit,oit)subscript𝐷𝑆𝐹𝑇subscript𝐷𝑆𝐹𝑇subscriptsuperscript𝑚𝑡𝑖subscriptsuperscript𝑜𝑡𝑖D_{SFT}=D_{SFT}~{}\cup~{}(m^{t}_{i},o^{t}_{i})italic_D start_POSTSUBSCRIPT italic_S italic_F italic_T end_POSTSUBSCRIPT = italic_D start_POSTSUBSCRIPT italic_S italic_F italic_T end_POSTSUBSCRIPT ∪ ( italic_m start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_o start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
8:        end for 8:结束循环
9:        Take joint actions 𝒖tsuperscript𝒖𝑡\boldsymbol{u}^{t}bold_italic_u start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and obtain the next observation 𝒐𝒕+𝟏superscript𝒐𝒕1\boldsymbol{o^{t+1}}bold_italic_o start_POSTSUPERSCRIPT bold_italic_t bold_+ bold_1 end_POSTSUPERSCRIPT.
9: 采取联合行动 𝒖tsuperscript𝒖𝑡\boldsymbol{u}^{t}bold_italic_u start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT 并获取下一个观测结果 𝒐𝒕+𝟏superscript𝒐𝒕1\boldsymbol{o^{t+1}}bold_italic_o start_POSTSUPERSCRIPT bold_italic_t bold_+ bold_1 end_POSTSUPERSCRIPT
10:     end for 10:结束循环
211:  end for 211:结束循环
12:  for each batch numbers do
12:对每个批次号执行
13:     Sample a batch of data from data buffer DSFTsubscript𝐷𝑆𝐹𝑇D_{SFT}italic_D start_POSTSUBSCRIPT italic_S italic_F italic_T end_POSTSUBSCRIPT
13: 从数据缓冲区 DSFTsubscript𝐷𝑆𝐹𝑇D_{SFT}italic_D start_POSTSUBSCRIPT italic_S italic_F italic_T end_POSTSUBSCRIPT 中抽取一批数据样本
14:     Update the message policy πηsubscript𝜋𝜂\pi_{\eta}italic_π start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT following Eq. 1.
14: 根据等式更新消息策略 πηsubscript𝜋𝜂\pi_{\eta}italic_π start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT
315:  end for 315:结束循环
16:  for each episode do
16:对每一集进行处理
17:     #RL training stage with frozen message policy πηsubscript𝜋𝜂\pi_{\eta}italic_π start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT.
17: #RL 训练阶段,消息策略 πηsubscript𝜋𝜂\pi_{\eta}italic_π start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT 冻结。
18:     for each t𝑡titalic_t do
18:对于每个 t𝑡titalic_t 执行
19:        for i = 1,2,…,n do
19:对于 i = 1,2,...,n 执行
20:           Generate messages for each agent from the message policy m^itπη(|ρm(oit))\hat{m}^{t}_{i}\sim\pi_{\eta}(\cdot|\rho_{m}{(o^{t}_{i})})over^ start_ARG italic_m end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( ⋅ | italic_ρ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_o start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ).
20: 根据消息策略 m^itπη(|ρm(oit))\hat{m}^{t}_{i}\sim\pi_{\eta}(\cdot|\rho_{m}{(o^{t}_{i})})over^ start_ARG italic_m end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( ⋅ | italic_ρ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_o start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) 为每位代理生成消息。
21:        end for 21:结束循环
22:        for i = 1,2,…,n do
22:对于 i = 1,2,...,n 执行
23:           Generate actions for each agent from action policy with received message uitπθ(|ρu(oit),mit){u}^{t}_{i}\sim\pi_{\theta}(\cdot|\rho_{u}{(o^{t}_{i})},m^{t}_{-i})italic_u start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_ρ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_o start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_m start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT ).
23: 根据接收到的消息 uitπθ(|ρu(oit),mit){u}^{t}_{i}\sim\pi_{\theta}(\cdot|\rho_{u}{(o^{t}_{i})},m^{t}_{-i})italic_u start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_ρ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_o start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_m start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT ) ,为每个代理生成行动策略中的动作。
24:        end for 24:结束循环
25:        Take joint actions 𝒖tsuperscript𝒖𝑡\boldsymbol{u}^{t}bold_italic_u start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and obtain the next observation 𝒐𝒕+𝟏superscript𝒐𝒕1\boldsymbol{o^{t+1}}bold_italic_o start_POSTSUPERSCRIPT bold_italic_t bold_+ bold_1 end_POSTSUPERSCRIPT.
25: 采取联合行动 𝒖tsuperscript𝒖𝑡\boldsymbol{u}^{t}bold_italic_u start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT 并获取下一个观测结果 𝒐𝒕+𝟏superscript𝒐𝒕1\boldsymbol{o^{t+1}}bold_italic_o start_POSTSUPERSCRIPT bold_italic_t bold_+ bold_1 end_POSTSUPERSCRIPT
26:        Store the transition to buffer: DRL=DRL(𝒐t,𝒖t,𝒓t,𝒐t+1,𝒎t)subscript𝐷𝑅𝐿subscript𝐷𝑅𝐿superscript𝒐𝑡superscript𝒖𝑡superscript𝒓𝑡superscript𝒐𝑡1superscript𝒎𝑡D_{RL}=D_{RL}~{}\cup~{}(\boldsymbol{o}^{t},\boldsymbol{u}^{t},\boldsymbol{r}^{% t},\boldsymbol{o}^{t+1},\boldsymbol{m}^{t})italic_D start_POSTSUBSCRIPT italic_R italic_L end_POSTSUBSCRIPT = italic_D start_POSTSUBSCRIPT italic_R italic_L end_POSTSUBSCRIPT ∪ ( bold_italic_o start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_italic_u start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_italic_r start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_italic_o start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT , bold_italic_m start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ).
26: 将过渡存储到缓冲区: DRL=DRL(𝒐t,𝒖t,𝒓t,𝒐t+1,𝒎t)subscript𝐷𝑅𝐿subscript𝐷𝑅𝐿superscript𝒐𝑡superscript𝒖𝑡superscript𝒓𝑡superscript𝒐𝑡1superscript𝒎𝑡D_{RL}=D_{RL}~{}\cup~{}(\boldsymbol{o}^{t},\boldsymbol{u}^{t},\boldsymbol{r}^{% t},\boldsymbol{o}^{t+1},\boldsymbol{m}^{t})italic_D start_POSTSUBSCRIPT italic_R italic_L end_POSTSUBSCRIPT = italic_D start_POSTSUBSCRIPT italic_R italic_L end_POSTSUBSCRIPT ∪ ( bold_italic_o start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_italic_u start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_italic_r start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_italic_o start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT , bold_italic_m start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT )
27:     end for 27:结束循环
28:     Update the policy parameter θ𝜃\thetaitalic_θ and value function parameter ϕitalic-ϕ\phiitalic_ϕ by minimize the overall learning loss in Eq. (6).
通过最小化公式(6)中的总体学习损失,更新策略参数 θ𝜃\thetaitalic_θ 和值函数参数 ϕitalic-ϕ\phiitalic_ϕ
29:  end for 29:结束循环
Algorithm 1 Training Procedure for Verco
算法 1 Verco 训练程序

5 Experiments 五项实验

In this section, we first introduce the experimental scenario. Then, we provide a detailed introduction to all baselines. We demonstrate the performance of baseline algorithms in different experimental scenarios. We also visualize the communication content of the Verco algorithm.
在本节中,我们首先介绍实验场景。随后,我们详细介绍所有基线方法。我们展示了基线算法在不同实验场景下的性能表现,并可视化了 Verco 算法的通信内容。

5.1 Environment description
5.1 环境描述

The Overcooked environment is a popular complex environment for decision-making problems. The goal for agents is to make different types of salads with the provided raw materials and tools in a 7x7 grid-size kitchen. Our work extends the single-agent textual version of Overcooked in Tan et al. to multi-agent systems. In our setting, two agents need to collaborate to complete tasks in the environment. As shown in the figure, there are two types of maps A and B. In Map A, two agents are in the same space, so there may be collisions. In Map B, the two agents are separated and need to complete the task by passing items through the table. The environment is partially observable, and each agent can only observe the objects within 5×5 square centered of itself. As to the reward, chopping the correct item will be +0.2, providing the correct dish will be +1, -0.1 for delivering any wrong item, -0.01 for each collision between agents, and -0.001 for each time step. 

5.2 Baselines 5.2 基准线

We compare our method with two trainable baselines and a baseline for direct decision-making by GPT-4. The description for each baseline is as follows: 

TWOSOME [20]. TWOSOME proposes an efficient single-agent LLM decision fine-tuning framework, and it balances the joint probability of candidate actions by several regularizing methods. 

Symbolic PPO [33, 39]. The symbolic PPO takes the raw numerical observation as its input and uses MLPs as the backbone network. 

CommNet [19]. 

CommNet, as a commonly used multi-agent communication algorithm, takes the average of numerical observations from all agents and sends it as a message to each agent. 

GPT-4 [41]. GPT-4 has shown great potential in decision-making problems, but a significant drawback is that GPT-4 is difficult to improve from the environment’s feedback. We directly input the text context and candidate actions into GPT-4, allowing GPT-4 to select the currently appropriate action from the candidate actions. 

Verco. Our method uses two LoRA weights for communication and action policy, respectively. The communication policy is obtained by SFT with message labels generated by GPT-4, and the action policy is learned by RL through interaction with the environment. 

All curves are presented with average performance and 25similar-to\sim75% deviation over four random seeds, with the solid lines representing the median win rates. Due to our efficient design, our algorithm and all experiments can be completed on a single NVIDIA Tesla V100 32GB GPU. 

5.3 Performance on Overcooked

The results on Overcooked are shown in Figure LABEL:fig:overcooked. In the method based on raw symbolic input, CommNet showed better performance than the Symbolic-PPO. However, it is difficult to observe the communication patterns between agents in a way that humans can understand. The results also indicate that the LLM-based methods have significantly higher sample efficiency. Moreover, Our Verco further achieves higher episode returns. We believe this improvement benefits from the communication information that can effectively coordinate the actions among agents. In addition, verbal messages can also promote our understanding of the cooperation mechanisms between agents. Moreover, the results show that Verco has significantly lower episode length and entropy compared to other algorithms. This indicates that the introduction of verbal communication can also encourage agents to complete tasks more efficiently and reduce the uncertainty of action policy. Although GPT-4 has rich prior knowledge, there are still biases in making complex decisions in the environment, which can easily lead to task failure.

5.4 Verbal communication visualization on Overcooked

To further analyze the differences brought about by the introduction of verbal communication, we visualize the replay of Verco and non-communication algorithms (TWOSOME) in detail. In the Single Room scenario as shown in Figure 6(a,d), communication is crucial to coordinate and perform different tasks, as there may be conflicting actions between agents. Specifically, Alice suggested that Bob should pick up the tomatoes and Bob suggested that Alice should pick up the bowl. After receiving the message from another agent, the two agents choose different actions which avoid conflicts. In the non-communication situation, both agents choose to directly pick up the tomato which will collide and waste time. In Separate Rooms, agents are separated by table and can only transfer items through the middle table. Therefore, communication is also important for agent cooperation. As shown in Figure 6(b), after Alice reminds Bob, Bob will put the tomatoes on the middle table for Alice to chop. Otherwise, Bob may not be aware of the need to pass the tomato to Alice without communication as shown in Figure 6(e). Overall, by directly sending verbal messages, we can have a more intuitive understanding of the cooperative motivations between agents.

Refer to caption
Figure 6: Communication display in different scenarios.

6 Closing Remarks

In this paper, we propose a novel multi-agent communication algorithm, called Verco. Verco endows agents with the ability to send human-understandable verbal messages and make decisions based on teammates’ messages by incorporating multiple LoRA parameters. To generate coordinated and consistent messages for the message module, we employ a Teacher LLM with global observation to produce message labels and train the message module with local observation as input. After the process of SFT, agents load well-trained communication module weights, and the action policy is trained through reinforcement learning by continuously interacting within the environment. Evaluations conducted in the Overcooked environment demonstrate that our algorithm outperforms existing baselines in terms of performance and exhibits stronger interpretability, which contributes to a deeper understanding of the formation mechanism of cooperation among agents.

References

  • Arel et al. [2010] Itamar Arel, Cong Liu, Tom Urbanik, and Airton G Kohls. Reinforcement learning-based multi-agent system for network traffic signal control. IET Intelligent Transport Systems, 4(2):128–135, 2010.
  • Matignon et al. [2012] Laëtitia Matignon, Laurent Jeanpierre, and Abdel-Illah Mouaddib. Coordinated multi-robot exploration under communication constraints using decentralized markov decision processes. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 26, pages 2017–2023, 2012.
  • Zhang and Lesser [2011] Chongjie Zhang and Victor Lesser. Coordinated multi-agent reinforcement learning in networked distributed pomdps. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 25, pages 764–770, 2011.
  • Yarats et al. [2021] Denis Yarats, Amy Zhang, Ilya Kostrikov, Brandon Amos, Joelle Pineau, and Rob Fergus. Improving sample efficiency in model-free reinforcement learning from images. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 10674–10681, 2021.
  • Li et al. [2024] Dapeng Li, Na Lou, Bin Zhang, Zhiwei Xu, and Guoliang Fan. Adaptive parameter sharing for multi-agent reinforcement learning. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6035–6039, 2024. doi: 10.1109/ICASSP48485.2024.10447262.
  • Papoudakis et al. [2019] Georgios Papoudakis, Filippos Christianos, Arrasy Rahman, and Stefano V Albrecht. Dealing with non-stationarity in multi-agent deep reinforcement learning. arXiv preprint arXiv:1906.04737, 2019.
  • Milani et al. [2022] Stephanie Milani, Zhicheng Zhang, Nicholay Topin, Zheyuan Ryan Shi, Charles Kamhoua, Evangelos E Papalexakis, and Fei Fang. Maviper: Learning decision tree policies for interpretable multi-agent reinforcement learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 251–266. Springer, 2022.
  • Glanois et al. [2021] Claire Glanois, Paul Weng, Matthieu Zimmer, Dong Li, Tianpei Yang, Jianye Hao, and Wulong Liu. A survey on interpretable reinforcement learning. CoRR, abs/2112.13112, 2021. URL https://arxiv.org/abs/2112.13112.
  • Sunehag et al. [2017] Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinícius Flores Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z. Leibo, Karl Tuyls, and Thore Graepel. Value-decomposition networks for cooperative multi-agent learning. CoRR, abs/1706.05296, 2017.
  • Rashid et al. [2018] Tabish Rashid, Mikayel Samvelyan, Christian Schröder de Witt, Gregory Farquhar, Jakob N. Foerster, and Shimon Whiteson. QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning. In Jennifer G. Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 4292–4301. PMLR, 2018.
  • Foerster et al. [2018] Jakob N. Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. In Sheila A. McIlraith and Kilian Q. Weinberger, editors, Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 2974–2982. AAAI Press, 2018.
  • Yu et al. [2021] Chao Yu, Akash Velu, Eugene Vinitsky, Yu Wang, Alexandre M. Bayen, and Yi Wu. The surprising effectiveness of MAPPO in cooperative, multi-agent games. CoRR, abs/2103.01955, 2021.
  • Lowe et al. [2017] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 6379–6390, 2017.
  • Li et al. [2023a] Dapeng Li, Zhiwei Xu, Bin Zhang, and Guoliang Fan. Sea: A spatially explicit architecture for multi-agent reinforcement learning. In 2023 International Joint Conference on Neural Networks (IJCNN), pages 1–8, 2023a. doi: 10.1109/IJCNN54540.2023.10191819.
  • Miller et al. [2002] John H Miller, Carter T Butts, and David Rode. Communication and cooperation. Journal of Economic Behavior & Organization, 47(2):179–195, 2002.
  • Yang and Wang [2020] Yaodong Yang and Jun Wang. An overview of multi-agent reinforcement learning from game theoretical perspective. arXiv preprint arXiv:2011.00583, 2020.
  • Li et al. [2023b] Dapeng Li, Zhiwei Xu, Bin Zhang, and Guoliang Fan. From explicit communication to tacit cooperation: A novel paradigm for cooperative marl. arXiv preprint arXiv:2304.14656, 2023b.
  • Peng et al. [2017] Peng Peng, Quan Yuan, Ying Wen, Yaodong Yang, Zhenkun Tang, Haitao Long, and Jun Wang. Multiagent bidirectionally-coordinated nets for learning to play starcraft combat games. CoRR, abs/1703.10069, 2017. URL http://arxiv.org/abs/1703.10069.
  • Sukhbaatar et al. [2016] Sainbayar Sukhbaatar, Arthur Szlam, and Rob Fergus. Learning multiagent communication with backpropagation. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett, editors, Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 2244–2252, 2016. URL https://proceedings.neurips.cc/paper/2016/hash/55b1927fdafef39c48e5b73b5d61ea60-Abstract.html.
  • Tan et al. [2024] Weihao Tan, Wentao Zhang, Shanqi Liu, Longtao Zheng, Xinrun Wang, and Bo An. True knowledge comes from practice: Aligning large language models with embodied environments via reinforcement learning. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=hILVmJ4Uvu.
  • Carta et al. [2023] Thomas Carta, Clément Romac, Thomas Wolf, Sylvain Lamprier, Olivier Sigaud, and Pierre-Yves Oudeyer. Grounding large language models in interactive environments with online reinforcement learning. arXiv preprint arXiv:2302.02662, 2023.
  • Hu et al. [2021] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  • Mandi et al. [2023] Zhao Mandi, Shreeya Jain, and Shuran Song. Roco: Dialectic multi-robot collaboration with large language models. arXiv preprint arXiv:2307.04738, 2023.
  • Jiang et al. [2023] Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Anima Anandkumar, Yuke Zhu, and Linxi Fan. Vima: General robot manipulation with multimodal prompts, 2023.
  • Liu et al. [2023] Bo Liu, Yuqian Jiang, Xiaohan Zhang, Qiang Liu, Shiqi Zhang, Joydeep Biswas, and Peter Stone. Llm+ p: Empowering large language models with optimal planning proficiency. arXiv preprint arXiv:2304.11477, 2023.
  • Ahn et al. [2022] Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Kuang-Huei Lee, Sergey Levine, Yao Lu, Linda Luu, Carolina Parada, Peter Pastor, Jornell Quiambao, Kanishka Rao, Jarek Rettinghouse, Diego Reyes, Pierre Sermanet, Nicolas Sievers, Clayton Tan, Alexander Toshev, Vincent Vanhoucke, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Mengyuan Yan, and Andy Zeng. Do as i can, not as i say: Grounding language in robotic affordances, 2022.
  • Zhang et al. [2023a] Ceyao Zhang, Kaijie Yang, Siyi Hu, Zihao Wang, Guanghe Li, Yihang Sun, Cheng Zhang, Zhaowei Zhang, Anji Liu, Song-Chun Zhu, et al. Proagent: Building proactive cooperative ai with large language models. arXiv preprint arXiv:2308.11339, 2023a.
  • Zhang et al. [2023b] Bin Zhang, Hangyu Mao, Jingqing Ruan, Ying Wen, Yang Li, Shao Zhang, Zhiwei Xu, Dapeng Li, Ziyue Li, Rui Zhao, et al. Controlling large language model-based agents for large-scale decision-making: An actor-critic approach. arXiv preprint arXiv:2311.13884, 2023b.
  • Wang et al. [2023] Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023.
  • Yao et al. [2022] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.
  • Dasgupta et al. [2023] Ishita Dasgupta, Christine Kaeser-Chen, Kenneth Marino, Arun Ahuja, Sheila Babayan, Felix Hill, and Rob Fergus. Collaborating with language models for embodied reasoning. arXiv preprint arXiv:2302.00763, 2023.
  • Ouyang et al. [2022] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  • Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • Mangrulkar et al. [2022] Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, and Benjamin Bossan. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022.
  • Ding et al. [2023] Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi-Min Chan, Weize Chen, et al. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nature Machine Intelligence, 5(3):220–235, 2023.
  • Yuan et al. [2022] Lei Yuan, Jianhao Wang, Fuxiang Zhang, Chenghe Wang, Zongzhang Zhang, Yang Yu, and Chongjie Zhang. Multi-agent incentive communication via decentralized teammate modeling. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022, pages 9466–9474. AAAI Press, 2022. URL https://ojs.aaai.org/index.php/AAAI/article/view/21179.
  • Slumbers et al. [2023] Oliver Slumbers, David Henry Mguni, Kun Shao, and Jun Wang. Leveraging large language models for optimised coordination in textual multi-agent reinforcement learning. 2023.
  • Patel and Pavlick [2021] Roma Patel and Ellie Pavlick. Mapping language models to grounded conceptual spaces. In International conference on learning representations, 2021.
  • De Witt et al. [2020] Christian Schroeder De Witt, Tarun Gupta, Denys Makoviichuk, Viktor Makoviychuk, Philip HS Torr, Mingfei Sun, and Shimon Whiteson. Is independent learning all you need in the starcraft multi-agent challenge? arXiv preprint arXiv:2011.09533, 2020.
  • Schulman et al. [2015] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.
  • OpenAI et al. [2024] OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, and et al. Jeff Belgum. Gpt-4 technical report, 2024.