Verco: Learning Coordinated Verbal Communication for Multi-agent Reinforcement Learning
Verco：多智能体强化学习中的协调性语言交流学习

Dapeng Li^1,2, Hang Dong³, Lu Wang³, Bo Qiao³, Si Qin³, Qingwei Lin³
李大鹏 ^1,2 ，董航 ³ ，王璐 ³ ，乔博 ³ ，覃思 ³ ，林清伟 ³ Dongmei Zhang³, Qi Zhang³, Zhiwei Xu^1,2, Bin Zhang^1,2, Guoliang Fan^1,2
张冬梅 ³ ，张琪 ³ ，徐志伟 ^1,2 ，张斌 ^1,2 ，范国良 ^1,2
¹Institute of Automation, Chinese Academy of Sciences
中国科学院自动化研究所
²School of Artificial Intelligence, University of Chinese Academy of Sciences
² 人工智能学院，中国科学院大学
³ Microsoft 微软
lidapeng2020@ia.ac.cn

Abstract 摘要

In recent years, multi-agent reinforcement learning algorithms have made significant advancements in diverse gaming environments, leading to increased interest in the broader application of such techniques. To address the prevalent challenge of partial observability, communication-based algorithms have improved cooperative performance through the sharing of numerical embedding between agents. However, the understanding of the formation of collaborative mechanisms is still very limited, making designing a human-understandable communication mechanism a valuable problem to address. In this paper, we propose a novel multi-agent reinforcement learning algorithm that embeds large language models into agents, endowing them with the ability to generate human-understandable verbal communication. The entire framework has a message module and an action module. The message module is responsible for generating and sending verbal messages to other agents, effectively enhancing information sharing among agents. To further enhance the message module, we employ a teacher model to generate message labels from the global view and update the student model through Supervised Fine-Tuning (SFT). The action module receives messages from other agents and selects actions based on current local observations and received messages. Experiments conducted on the Overcooked game demonstrate our method significantly enhances the learning efficiency and performance of existing methods, while also providing an interpretable tool for humans to understand the process of multi-agent cooperation.
近年来，多智能体强化学习算法在多样化的游戏环境中取得显著进展，引发了对这些技术更广泛应用的兴趣。针对普遍存在的部分可观测性挑战，基于通信的算法通过智能体间数值嵌入的共享，提升了协作性能。然而，协作机制的形成理解仍十分有限，设计一种人类可理解的通信机制成为亟待解决的问题。本文提出一种新颖的多智能体强化学习算法，将大型语言模型嵌入智能体中，赋予其生成人类可理解语言交流的能力。整个框架包含消息模块与行动模块，消息模块负责生成并向其他智能体发送语言信息，有效增强智能体间的信息共享。为进一步强化消息模块，我们采用教师模型从全局视角生成消息标签，并通过监督微调（SFT）更新学生模型。动作模块接收来自其他代理的消息，并根据当前局部观察和接收到的消息选择行动。在 Overcooked 游戏上进行的实验表明，我们的方法显著提高了现有方法的学习效率和性能，同时为人类提供了一个可解释的工具，以理解多代理协作的过程。

1 Introduction 1 引言

Cooperative multi-agent reinforcement learning (MARL) has received widespread research attention due to its extensive practical applications in fields such as traffic control[1], multi-robot control[2], and sensor networks[3]. However, key issues such as sample efficiency[4, 5], non-stationarity[6], and interpretability[7, 8] of cooperation have become obstacles to further advancing these practical applications. In order to address these challenges, significant progress has been made in cooperative MARL. Among them, the centralized training and decentralized execution paradigm[9, 10, 11, 12, 13, 14] alleviate the non-stationary problem during multi-agent training by introducing additional information during the training period. However, due to partial observability, the strategies learned by agents may be fragile since the uncertainty of other agents during execution may lead to catastrophic incoordination and sub-optimality [15, 16]. Inspired by human cooperation, a series of works achieve communication between individuals by exchanging their observations or hidden embedding[17, 18, 19], thereby stabilizing the learning process and promoting more efficient cooperation among agents. These methods usually treat messages as black boxes, send numerical messages, and assume that the policy network can adaptively extract content that is helpful for learning during the learning period. Therefore, the communication content is usually presented in a form that humans cannot understand, making it hard to interpret the communication mechanism.
合作式多智能体强化学习（MARL）因其广泛应用于交通控制、多机器人控制及传感器网络等领域而受到广泛研究关注。然而，样本效率、非平稳性及合作可解释性等关键问题成为阻碍这些实际应用进一步发展的障碍。为应对这些挑战，合作式 MARL 取得了显著进展。其中，集中训练与分散执行范式通过在训练期间引入额外信息，缓解了多智能体训练中的非平稳性问题。但由于部分可观测性，智能体学到的策略可能较为脆弱，执行过程中其他智能体的不确定性可能导致灾难性协调失败和次优结果。受人类合作启发，一系列工作通过交换观测数据或隐藏嵌入实现个体间通信，从而稳定学习过程，促进智能体间更高效的合作。这些方法通常将消息视为黑箱，发送数值消息，并假设策略网络能在学习期间自适应地提取有助于学习的内容。因此，通信内容通常以人类难以理解的形式呈现，使得通信机制难以解释。

Refer to caption — Figure 1: Incorrect messages can easily lead to conflicts, while coordinated messages can promote efficient cooperation among agents.
图 1：错误信息易引发冲突，而协调一致的信息则能促进各代理间的有效合作。

One natural and interpretable way of communication is to directly generate verbal language as communication messages, which also means the policy network needs to have the ability to understand verbal text. Recent work has shown that using large language models for understanding knowledge texts can effectively improve sample efficiency in complex decision-making tasks[20, 21]. By aligning prior knowledge of large language models (LLMs) with the functional requirements of the environment with only a small amount of environmental interaction data, the LLM can achieve good performance. Meanwhile, as LLMs use verbal text as model input, it is natural to use LLMs for verbal communication. However, there are key difficulties in directly generating verbal messages: i) The candidate space for generating messages is too large to explore. ii) The text output lacks gradients and cannot be optimized end-to-end with environmental rewards. iii) The messages generated by agents based on local observations are prone to conflicts (as shown in Figure 1).
一种自然且可解释的沟通方式是直接生成口头语言作为交流信息，这也意味着策略网络需具备理解口头文本的能力。近期研究表明，利用大型语言模型理解知识文本能有效提升复杂决策任务中的样本效率[20, 21]。通过将大型语言模型（LLMs）的先验知识与环境的功能需求仅用少量环境交互数据对齐，LLM即可取得良好表现。同时，由于LLMs采用口头文本作为模型输入，自然地，LLMs可用于口头沟通。然而，直接生成口头信息面临关键难题：i) 生成消息的候选空间过大，难以探索；ii) 文本输出缺乏梯度，无法与环境奖励端到端优化；iii) 基于局部观察生成的代理消息易产生冲突（如图 1 所示）。

To address the aforementioned challenges, this paper proposes a novel multi-agent algorithm for learning coordinated VERbal COmmunication (Verco) that is understandable to humans. We first use a powerful LLM (e.g., GPT-4) as the teacher model and perform Supervised Fine-tuning (SFT) on the student model equipped with low-rank adapters (LoRA) [22] based on the output of the teacher model. The policy learning module is then fine-tuned online through interaction with the environment, aligning the model’s prior knowledge with the environment. We train different LoRA parameters for the communication module and the policy learning module respectively, avoiding mutual interference between the two modules and the cost problem of managing and training multiple large models. Overall, the communication module generates and sends text messages based on the local observations of the agents to promote cooperation between them, while the action module receives local observations and messages sent by other agents and selects the action to be taken from the candidate action set. We have conducted experiments in different scenarios on the popular multi-agent decision-making environment Overcooked and compared Verco with existing baseline methods. We find that introducing verbal communication can significantly improve the performance of agents, and also improve the interpretability of cooperation.
为应对前述挑战，本文提出一种新颖的多智能体算法，旨在学习协调人类可理解的言语沟通（Verco）。首先，我们采用强大的LLM（如 GPT-4）作为教师模型，并基于教师模型的输出，对配备低秩适配器（LoRA）[22]的学生模型进行监督微调（SFT）。随后，通过与环境的交互，在线微调策略学习模块，以使模型的先验知识与环境相协调。我们分别为通信模块和策略学习模块训练不同的 LoRA 参数，避免两模块间的相互干扰及管理、训练多个大型模型的成本问题。总体而言，通信模块根据智能体的局部观察生成并发送文本消息，以促进它们之间的合作；而动作模块接收局部观察及其他智能体发送的消息，并从候选动作集中选择应采取的动作。我们在流行的多智能体决策环境 Overcooked 中针对不同场景进行了实验，并将 Verco 与现有基线方法进行了比较。我们发现，引入口头沟通能显著提升智能体的性能，同时增强合作的可解释性。

2 Related Work 相关工作

In this section, we will introduce some of the work related to our paper.
在本节中，我们将介绍与我们的论文相关的一些工作。

2.1 LLMs for Decision Making
2.1 用于决策的大型语言模型

LLMs trained on large datasets have shown remarkable abilities in various downstream tasks. Recent works have gradually applied LLMs to complex tasks such as robot control [23, 24], planning generation [25], embodied agent [26, 27, 28, 29], etc. Among them, SayCan [26] and LLM+P [25] use the LLM as a high-level planner to generate long-term plans for agents based on task goals but often do not directly interact with the environment. Voyager [29] uses GPT-4 to accomplish highly complex tasks in the Minecraft game and continuously generates new skills during the learning process. ReAct [30] combines reasoning and action to enhance LLMs’ reasoning ability for high-level goals, while improving the interpretability and credibility of LLMs’ decisions. As a supplement, [31] added additional reporters to provide useful information for the planner. However, these methods cannot be improved based on environmental feedback, making it difficult to adapt to different environments. In recent works, GLAM [21] and TWOSOME [20] both use reinforcement learning (RL) to enable LLMs to interact and learn in a single-agent environment, and the rich prior experience of LLMs itself can significantly reduce the learning cost of agents.
经过大规模数据集训练的模型在多种下游任务中展现出卓越能力。近期研究逐步将此类模型应用于机器人控制[23, 24]、规划生成[25]、具身智能体[26, 27, 28, 29]等复杂任务。其中，SayCan[26]与LLM+P[25]采用LLM作为高层规划器，依据任务目标为智能体制定长期计划，但通常不直接与环境互动。Voyager[29]利用 GPT-4 在 Minecraft 游戏中完成高度复杂任务，并在学习过程中持续生成新技能。ReAct[30]结合推理与行动，增强LLMs针对高层目标的推理能力，同时提升LLMs决策的可解释性与可信度。此外，[31]增设额外报告者，为规划器提供有用信息。然而，这些方法难以基于环境反馈进行优化，导致适应不同环境的能力受限。在近期研究中，GLAM[21]与 TWOSOME[20]均采用强化学习（RL）技术，使LLMs能在单一代理环境中互动学习，而LLMs自身丰富的先前经验能显著降低代理的学习成本。

2.2 Finetuning LLMs 2.2 微调LLMs

In recent works, finetuning has been employed to improve the performance of LLMs in specific tasks across different domains. Among these approaches, the use of RL to enhance the consistency between the LLM and human preferences is a common practice [32]. Many reinforcement learning with human feedback (RLHF) methods utilize PPO [33] to learn a reward function from human datasets and fine-tune LLMs according to the learned reward function. A significant challenge in fine-tuning LLMs is how to reduce training costs. Parameter-efficient finetuning (PEFT)
近期作品中，微调技术被广泛应用于提升LLMs在跨领域特定任务中的表现。在这些方法中，利用强化学习（RL）增强LLM与人类偏好之间的一致性已成为常见做法[32]。众多基于人类反馈的强化学习（RLHF）方法采用 PPO[33]从人类数据集中学习奖励函数，并据此对LLMs进行微调。微调LLMs时面临的一大挑战是如何降低训练成本。参数高效微调（PEFT）¹¹1https://github.com/huggingface/peft. [34, 35] can significantly reduce the number of LLM parameters while avoiding excessive performance loss. As a recent technology in PEFT, low-rank adapters (LoRA) indirectly train the dense layers of neural networks by learning low-rank matrices. TWOSOME [20] fine-tuned a LoRA as the actor model, thus enabling efficient fine-tuning while avoiding mutual interference between actor and critic.
[34, 35]能显著减少LLM参数的数量，同时避免过度的性能损失。作为 PEFT 领域的一项最新技术，低秩适配器（LoRA）通过学习低秩矩阵间接训练神经网络的稠密层。TWOSOME [20]将 LoRA 微调为行动者模型，从而在避免行动者与评判者间相互干扰的同时，实现了高效的微调。

2.3 MARL with Communication
2.3 通信支持的多智能体强化学习

Existing multi-agent cooperation algorithms have made significant progress in many complex scenarios, but the realistic situation of partial observability greatly limits the degree of cooperation among agents. To address this issue, researchers have proposed a series of communication-based cooperation paradigms [15, 18, 19, 36, 17] to facilitate agents’ understanding of the environment and teammates by exchanging local observations or intentions. However, most of the communication messages generated by existing communication algorithms are difficult for humans to understand, thus not interpretable and cannot be improved explicitly. To make the communication messages understandable and ineterpretable, using verbal text as a communication message naturally becomes a solution. The FAMA [37] algorithm extends GLAM to multi-agent settings and introduces a verbal communication module. However, the communication module in the FAMA method cannot be learned through interactions with the environment and can only select from a pre-defined set of message candidates, significantly reducing the degree of message generation freedom and relying on the quality of the pre-specified message candidates.
现有的多智能体协作算法在众多复杂场景中取得了显著进展，但部分可观测性的现实情况极大限制了智能体间的协作程度。为解决这一问题，研究者们提出了一系列基于通信的协作范式[15, 18, 19, 36, 17]，通过交换局部观测或意图来促进智能体对环境和队友的理解。然而，现有通信算法生成的多数通信信息对人类而言难以理解，不具备可解释性，也无法明确改进。为使通信信息变得可理解和可解释，采用口头文本作为通信信息自然成为一种解决方案。FAMA[37]算法将 GLAM 扩展至多智能体环境，并引入了口头通信模块。然而，FAMA 方法中的通信模块无法通过与环境的互动来学习，只能从预定义的消息候选集中选择，这显著降低了消息生成的自由度，并依赖于预设消息候选集的质量。

3 Preliminaries 3 预备知识

3.1 Problem Formulation 3.1 问题表述

We assume a textual MARL setting here with language vocabulary $\mathcal{V}$ . The multi-agent problem here can be described as a partially observable Markov game (POMG), which is defined by the following tuple $G=\left<S,U,\mathcal{N}{},\Omega,\mathcal{T},O,r,\gamma,\mathcal{V}\right>$ , with $S$ the state space, $U\subset\mathcal{V}$ the action space. At each discrete time step $t$ , each agent $i\in\mathcal{N}{}:=\{1,\dots,n{}\}$ will select an action $u_{{i}}\in U_{{i}}\subset\mathcal{V}$ . $\mathcal{T}(s^{\prime}|s,\boldsymbol{u)}:S\times U\times S\rightarrow P(S)$ is the state transition function, where $\boldsymbol{u}=\{u_{{1}},\dots,u_{{n{}}}\}\in\boldsymbol{U}\equiv U^{n}{}$ is the joint action. Each agent $i$ can get its textual partial observation $o_{{i}}\in\Omega$ by the observation function $O(s,i):S\times\mathcal{N}\rightarrow\Omega$ . $r_{{i}}(s,u_{{i}}):S\times U_{{i}}\rightarrow\mathbb{R}$ is the reward function for each agent $i$ . The $\gamma$ is the discount factor. The goal for each agent is to maximize the expected return. In order to meet the demands of communication and decision-making, we designed a dual-LLM structure for agents, enabling them to send text messages based on their local observations while making decisions based on the received messages from other agents. By considering both local observations and teammates’ messages, the agents can select corresponding coordinated actions. Our entire framework consists of three main components: the teacher policy $\pi^{tch}$ , the message policy $\pi_{\eta}$ , and the action policy $\pi_{\theta}$ .
在此，我们假设一个文本多智能体强化学习（MARL）环境，其中语言词汇为 $\mathcal{V}$ 。此处的多智能体问题可描述为部分可观测马尔可夫博弈（POMG），由以下元组 $G=\left<S,U,\mathcal{N}{},\Omega,\mathcal{T},O,r,\gamma,\mathcal{V}\right>$ 定义，其中 $S$ 为状态空间， $U\subset\mathcal{V}$ 为动作空间。在每个离散时间步 $t$ ，每个智能体 $i\in\mathcal{N}{}:=\{1,\dots,n{}\}$ 将选择一个动作 $u_{{i}}\in U_{{i}}\subset\mathcal{V}$ 。 $\mathcal{T}(s^{\prime}|s,\boldsymbol{u)}:S\times U\times S\rightarrow P(S)$ 是状态转移函数，其中 $\boldsymbol{u}=\{u_{{1}},\dots,u_{{n{}}}\}\in\boldsymbol{U}\equiv U^{n}{}$ 为联合动作。每个智能体 $i$ 通过观测函数 $O(s,i):S\times\mathcal{N}\rightarrow\Omega$ 获得其文本部分观测 $o_{{i}}\in\Omega$ 。 $r_{{i}}(s,u_{{i}}):S\times U_{{i}}\rightarrow\mathbb{R}$ 是每个智能体 $i$ 的奖励函数。 $\gamma$ 是折扣因子。每个智能体的目标是最大化期望回报。为满足通信与决策需求，我们为智能体设计了双LLM结构，使其能基于局部观测发送文本消息，并根据接收自其他智能体的消息做出决策。通过综合考虑局部观测与队友消息，智能体可选择相应的协同动作。我们的整体框架包含三个主要组件：教师策略 $\pi^{tch}$ 、消息策略 $\pi_{\eta}$ 及动作策略 $\pi_{\theta}$ 。

3.2 LLM and Finetuning 3.2 大型语言模型与微调

We consider using LLMs as interactive agents in embodied environments, the LLMs are trained on massive amounts of text data and can generate human-like responses to questions or prompts. In order to improve the performance of LLMs in various specific tasks, fine-tuning is commonly employed. Among these, LoRA (Low-Rank Adaptation) [22] serves as a popular fine-tuning method, effectively reducing computational costs while maintaining satisfactory performance. Specifically, LoRA training the dense layers with weights matrix, $W_{0}\in\mathbb{R}^{d\times k}$ and injects trainable rank decomposition matrices into the Transformer by $W_{0}+\Delta W=W_{0}+BA$ , where $B\in\mathbb{R}^{d\times r},A\in\mathbb{R}^{r\times k}$ , since the rank $r$ is much smaller than $d$ and $k$ , it can significantly reduce the scale of trainable parameters.
我们考虑将LLMs作为实体环境中的交互代理，这些LLMs经过大量文本数据的训练，能够生成类似人类的回答来响应问题或提示。为了提升LLMs在各种特定任务中的表现，通常采用微调技术。其中，LoRA（低秩适应）[22]作为一种流行的微调方法，能在保持良好性能的同时有效降低计算成本。具体而言，LoRA 通过 $W_{0}\in\mathbb{R}^{d\times k}$ 训练带有权重矩阵的密集层，并通过 $W_{0}+\Delta W=W_{0}+BA$ 向 Transformer 注入可训练的秩分解矩阵，其中 $B\in\mathbb{R}^{d\times r},A\in\mathbb{R}^{r\times k}$ ，由于秩 $r$ 远小于 $d$ 和 $k$ ，它能显著减少可训练参数的规模。

4 Method 方法 4

We introduce the LLM, capable of text generation, into agents as a communication module and action module to enhance the collaborative capabilities and interpretability of multi-agent systems. To accelerate training and improve the quality of communication messages, we first generate communication labels for supervised training of the communication module. Subsequently, the fine-tuned communication module weights are loaded into the agents, allowing them to generate highly flexible text messages to facilitate coordination among agents. The entire flow is shown in Figure 2 The action modules will interact and improve within the environment using RL. In this section, we will provide a detailed introduction to each part of the entire algorithmic.
我们引入LLM，具备文本生成能力，作为通信模块与行动模块嵌入代理中，以增强多代理系统的协同能力和可解释性。为加快训练并提升通信消息质量，我们首先为通信模块的监督训练生成通信标签。随后，将微调后的通信模块权重加载至代理，使其能生成高度灵活的文本消息，促进代理间的协调。整个流程如图 2 所示。行动模块将在环境中通过强化学习（RL）进行交互与改进。本节将详细介绍整个算法的各个部分。

4.1 Cooperation with verbal communication
4.1 口头沟通合作

In the context of human collaborative problems, the diversity in the information received by individuals and their distinct modes of thinking often render simple independent actions insufficient for achieving coordinated consensus. Consequently, the use of natural language has become a prevalent and significant method for facilitating coordination and cooperation among humans. From another perspective, the domain of large language models is precisely dedicated to and adept at enhancing communication and interaction in a human-like manner. Therefore, it is a natural progression to integrate large language models into agents as communication modules. These agents can employ the communication module to convey information or intentions observed by themselves to others, thereby enabling more effective collaboration.
在人类协作问题的背景下，个体接收信息的多样性及其独特的思维模式常使得简单的独立行动不足以达成协调共识。因此，自然语言的使用已成为促进人类间协调与合作的一种普遍且重要的方法。从另一角度看，大型语言模型领域正是致力于并擅长以类人的方式增强沟通与互动。因此，将大型语言模型整合进代理作为通信模块，是自然的发展趋势。这些代理可利用通信模块传递自身观察到的信息或意图给他人，从而实现更有效的协作。

4.2 Coordination Message Policy Pre-training
4.2 协同消息策略预训练

Relying solely on environment interaction to train a communication model that automatically generates reasonable messages from scratch is highly inefficient, while utilizing a large-scale model (e.g., GPT-4, LLaMA-65B) for communication is both costly and difficult to fine-tune. To facilitate the generation of coordinated and consistent messages in the communication module, we employ GPT-4 as a coordinated message generator, taking the local observations of all agents as input and generating customized message labels for each agent $\{m_{i}\}^{n}_{i=1}\sim\pi^{tch}(p^{tch})$ . The corresponding prompt $p^{tch}=\rho_{tch}(\{o_{i}\}^{n}_{i=1},n)$ is obtained through the teacher LLM’s prompt function $\rho_{tch}$ :
仅依赖环境交互来训练一个从零开始自动生成合理信息的通信模型效率极低，而利用大规模模型（如 GPT-4、LLaMA-65B）进行通信既成本高昂又难以微调。为促进通信模块中协调一致信息的生成，我们采用 GPT-4 作为协调信息生成器，以所有代理的本地观察为输入，为每个代理生成定制的消息标签 $\{m_{i}\}^{n}_{i=1}\sim\pi^{tch}(p^{tch})$ 。相应的提示 $p^{tch}=\rho_{tch}(\{o_{i}\}^{n}_{i=1},n)$ 通过教师LLM的提示函数 $\rho_{tch}$ 获得。

Supervised learning training is conducted on the message samples using LLaMA-7B, which is updated by LoRA. The message prediction of agent $i$ is generated by message policy $\hat{m}_{i}\sim\pi_{\eta}(\cdot|\rho_{m}(o_{i}))$ . The SFT loss can be expressed as follows:
监督学习训练采用 LLaMA-7B 对消息样本进行处理，该模型通过 LoRA 进行更新。代理 $i$ 的消息预测由消息策略 $\hat{m}_{i}\sim\pi_{\eta}(\cdot|\rho_{m}(o_{i}))$ 生成。SFT 损失可表示如下：

\mathcal{L}_{SFT}=\sum^{n}_{i=1}{CROSS\_ENTROPY(\hat{m}_{i},m_{i})},

(1)

where $CROSS\_ENTROPY$ represents the cross entropy loss function. The entire workflow of SFT is shown in Figure 3(a) To control the training scale, we let all agents share the action policy and message module. Besides, We use initialized agents to interact in the environment for several rounds to collect trajectory data. Empirically, with only a small amount of data and training steps, the communication policy $\pi_{\eta}$ can already generate a reasonably coherent verbal message.
其中 $CROSS\_ENTROPY$ 代表交叉熵损失函数。SFT 的整体工作流程如图 3(a)所示。为控制训练规模，我们让所有代理共享行动策略与信息模块。此外，我们使用初始化代理在环境中交互数轮，以收集轨迹数据。经验表明，仅需少量数据及训练步骤，通信策略 $\pi_{\eta}$ 便能生成相当连贯的口头信息。

4.3 Action Policy Alignment
4.3 行动策略对齐

The action module also employs LLMs, which brings two distinct advantages: i) LLMs can directly and effectively comprehend verbal messages from teammates without the need for further training. ii) LLMs have been demonstrated to have learned a substantial amount of physical rules present in the real world [38], and this knowledge can significantly reduce the number of interactions required between the policy and the environment, thereby enhancing sample efficiency. To align general purpose LLMs with the specific environment, feedback from the environment is utilized to adapt to the environment better and yield better performance.
行动模块还采用了LLMs，带来两大显著优势：i) LLMs能直接有效理解队友的口头信息，无需额外训练；ii) LLMs已证明掌握了现实世界中大量物理规则[38]，这些知识能显著减少策略与环境间的交互次数，从而提升样本效率。为使通用LLMs适应特定环境，利用环境反馈进行调整，以更好地适应环境并提升性能。

Inspired by previous works [21, 20], we do not directly prompt the LLM to generate the actions, instead, we query the LLM for the probabilities of all available actions. Assume the $k$ -th action of $i$ -th agent $u_{i,k}\in U_{i}$ is a sequence of tokens $u_{i,k}=\{w^{1}_{i,k},\dots,w^{N_{i,k}}_{i,k}\}$ , where $N_{k}$ is the token number of the $k$ -th action. So the token-level probability of $u_{i,k}$ can be calculated by:
受先前作品[21, 20]的启发，我们并非直接指示LLM生成动作，而是向LLM查询所有可用动作的概率。假设 $k$ 号动作是 $i$ 号代理 $u_{i,k}\in U_{i}$ 的一系列令牌 $u_{i,k}=\{w^{1}_{i,k},\dots,w^{N_{i,k}}_{i,k}\}$ ，其中 $N_{k}$ 是 $k$ 号动作的令牌编号。因此， $u_{i,k}$ 的令牌级概率可通过以下方式计算：

P_{token}(u_{i,k}|\rho_{u}(o_{i}),\hat{m}_{-i})=\prod^{N_{i,k}}_{j=0}P(w^{j}_{% i,k}|\rho_{u}(o_{i}),\hat{m}_{-i},w^{<j}_{i,k}),

(2)

where $\rho_{u}$ is the action prompt function. Therefore, the action policy $\pi$ can be got by the softmax over the token-level probabilities over actions:
其中 $\rho_{u}$ 为动作提示函数。因此，通过动作级概率上的 softmax 运算，可以得到动作策略 $\pi$ ：

P(u_{i,k}|\rho_{u}(o_{i}),\hat{m}_{-i})=\frac{\text{exp}(\text{log}P_{token}(u% _{i,k}|\rho_{u}(o_{i}),\hat{m}_{-i}))}{\sum_{u_{i}\in U_{i}}\text{exp}(\text{% log}P_{token}(u_{i}|\rho_{u}(o_{i}),\hat{m}_{-i}))}.

(3)

In our work, we employ PPO [39] to optimize the action policy. PPO is a state-of-the-art actor-critic RL method, where each agent learns an actor parameterized by $\theta$ and a critic parameterized by $\phi$ .
在我们的工作中，我们采用 PPO[39]来优化行动策略。PPO 是一种先进的演员-评论家强化学习方法，其中每个代理学习一个由 $\theta$ 参数化的演员和一个由 $\phi$ 参数化的评论家。

For each agent $i$ , the policy loss can be formulated as follows:
对于每个代理 $i$ ，策略损失可以表述如下：

	$\displaystyle\mathcal{L}_{i}(\theta)=$	$\displaystyle\mathbb{E}_{o^{t}_{i},u^{t}_{i}}[\text{min}(\frac{\pi_{\theta}(u^% {t}_{i}\|\rho_{u}({o}^{t}_{i}),\hat{m}^{t}_{-i})}{\pi_{\theta_{old}}(u^{t}_{i}\|% \rho_{u}({o}^{t}_{i}),\hat{m}^{t}_{-i})}A^{t}_{i},$		(4)
		$\displaystyle\text{clip}(\frac{\pi_{\theta}(u^{t}_{i}\|\rho_{u}({o}^{t}_{i}),% \hat{m}^{t}_{-i})}{\pi_{\theta_{old}}(u^{t}_{i}\|\rho_{u}({o}^{t}_{i}),\hat{m}^% {t}_{-i})},1-\epsilon,1+\epsilon)A^{t}_{i})],$		(4)

where $A^{t}_{i}$ is the Generalized Advantage Estimation (GAE) [40]. The critic network is composed of an additional MLP added to the last transformer block of the LLaMA model.
其中 $A^{t}_{i}$ 代表广义优势估计（GAE）[40]。批评家网络由附加在 LLaMA 模型最后一个变换器模块上的额外多层感知机（MLP）构成。

\begin{split}\mathcal{L}_{i}(\phi)=&\mathbb{E}_{o^{t}_{i}}[\text{min}\{(V_{% \phi}(\rho_{u}(o^{t}_{i}))-\hat{V^{t}_{i}})^{2},(V_{\phi_{old}}(\rho_{u}(o^{t}% _{i}))+\\ &\text{clip}(V_{\phi}(\rho_{u}(o^{t}_{i}))-V_{\phi_{old}}(\rho_{u}(o^{t}_{i}))% ,-\epsilon,+\epsilon)-\hat{V}^{t}_{i})^{2}\}],\end{split}

(5)

where $\hat{V}^{t}_{i}=A^{t}_{i}+V_{\phi_{old}}(\rho_{u}(o^{t}_{i}))$ . 其中 $\hat{V}^{t}_{i}=A^{t}_{i}+V_{\phi_{old}}(\rho_{u}(o^{t}_{i}))$ 。

The overall learning loss is:
总体学习损失为：

\mathcal{L}_{RL}=\sum^{n}_{i=1}\mathcal{L}_{i}(\theta)+\lambda_{critic}% \mathcal{L}_{i}(\phi)+\lambda_{entropy}\mathcal{H}(\pi_{i,\theta}),

(6)

where $\mathcal{H}(\pi_{i,\theta})$ denotes the entropy of agent $i$ ’s action policy $\pi_{i,\theta}$ .
其中 $\mathcal{H}(\pi_{i,\theta})$ 表示代理 $i$ 的动作策略 $\pi_{i,\theta}$ 的熵。

Given that the goals of the action policy and message policy are distinct, we load two independent LoRA weights separately to ensure that they do not interfere with each other. During the entire action policy alignment stage, the message policy will be frozen to stabilize the training of the action policy. The actor, critic, and message networks share the same LLaMA model and the gradients do not affect each other, therefore greatly reducing our training costs and memory requirements. The entire algorithm can be described as shown in Algorithm 1.
鉴于行动策略与信息策略的目标各异，我们分别加载两套独立的 LoRA 权重，确保彼此互不干扰。在行动策略对齐阶段，信息策略将保持冻结状态，以稳定行动策略的训练。行动者、批评者及信息网络共享同一 LLaMA 模型，且梯度互不影响，从而大幅降低训练成本与内存需求。整个算法可如算法 1 所示描述。

Input : initialise action policy parameters

\theta

, value function parameters

\phi

, and message policy parameters

\eta

. Set the data buffer for SFT

D_{SFT}=\emptyset

and for RL training

D_{RL}=\emptyset

初始化行动策略参数

\theta

、价值函数参数

\phi

及消息策略参数

\eta

。设置 SFT 的数据缓冲区

D_{SFT}=\emptyset

以及强化学习训练的数据缓冲区

D_{RL}=\emptyset

。

Output :

\theta^{*}

\phi^{*}

, and

\eta^{*}

输出：

\theta^{*}

、

\phi^{*}

和

\eta^{*}

1: for k = 1,2,…,

K

do
1:对于 k = 1, 2, ...,

K

执行

2: #Collect trajectory with initial action policy

\pi_{\theta}

.
2: #按照初始行动策略

\pi_{\theta}

收集轨迹。

3: for each time step

t

do
对于每个时间步

t

执行

4: Generate message label given global state:

\{m^{t}_{i}\}^{n}_{i=1}\sim\pi^{tch}(p^{tch})

.
4: 根据全局状态生成消息标签：

\{m^{t}_{i}\}^{n}_{i=1}\sim\pi^{tch}(p^{tch})

。

15: for i = 1,2,…,

n

do
15:for i = 1,2,...,

n

6: Generate action from initial action policy

u^{t}_{i}\sim\pi_{\theta}(\rho_{u}(o^{t}_{i}))

.
从初始行动策略

u^{t}_{i}\sim\pi_{\theta}(\rho_{u}(o^{t}_{i}))

生成行动。

7: Add message label and observation to buffer:

D_{SFT}=D_{SFT}~{}\cup~{}(m^{t}_{i},o^{t}_{i})

7: 向缓冲区添加消息标签和观察结果：

D_{SFT}=D_{SFT}~{}\cup~{}(m^{t}_{i},o^{t}_{i})

8: end for 8:结束循环

9: Take joint actions

\boldsymbol{u}^{t}

and obtain the next observation

\boldsymbol{o^{t+1}}

.
9: 采取联合行动

\boldsymbol{u}^{t}

并获取下一个观测结果

\boldsymbol{o^{t+1}}

。

10: end for 10:结束循环

211: end for 211:结束循环

12: for each batch numbers do
12:对每个批次号执行

13: Sample a batch of data from data buffer

D_{SFT}

13: 从数据缓冲区

D_{SFT}

中抽取一批数据样本

14: Update the message policy

\pi_{\eta}

following Eq. 1.
14: 根据等式更新消息策略

\pi_{\eta}

。

315: end for 315:结束循环

16: for each episode do
16:对每一集进行处理

17: #RL training stage with frozen message policy

\pi_{\eta}

.
17: #RL 训练阶段，消息策略

\pi_{\eta}

冻结。

18: for each

t

do
18:对于每个

t

执行

19: for i = 1,2,…,n do
19:对于 i = 1,2,...,n 执行

20: Generate messages for each agent from the message policy

\hat{m}^{t}_{i}\sim\pi_{\eta}(\cdot|\rho_{m}{(o^{t}_{i})})

.
20: 根据消息策略

\hat{m}^{t}_{i}\sim\pi_{\eta}(\cdot|\rho_{m}{(o^{t}_{i})})

为每位代理生成消息。

21: end for 21:结束循环

22: for i = 1,2,…,n do
22:对于 i = 1,2,...,n 执行

23: Generate actions for each agent from action policy with received message

{u}^{t}_{i}\sim\pi_{\theta}(\cdot|\rho_{u}{(o^{t}_{i})},m^{t}_{-i})

.
23: 根据接收到的消息

{u}^{t}_{i}\sim\pi_{\theta}(\cdot|\rho_{u}{(o^{t}_{i})},m^{t}_{-i})

，为每个代理生成行动策略中的动作。

24: end for 24:结束循环

25: Take joint actions

\boldsymbol{u}^{t}

and obtain the next observation

\boldsymbol{o^{t+1}}

.
25: 采取联合行动

\boldsymbol{u}^{t}

并获取下一个观测结果

\boldsymbol{o^{t+1}}

。

26: Store the transition to buffer:

D_{RL}=D_{RL}~{}\cup~{}(\boldsymbol{o}^{t},\boldsymbol{u}^{t},\boldsymbol{r}^{% t},\boldsymbol{o}^{t+1},\boldsymbol{m}^{t})

.
26: 将过渡存储到缓冲区：

D_{RL}=D_{RL}~{}\cup~{}(\boldsymbol{o}^{t},\boldsymbol{u}^{t},\boldsymbol{r}^{% t},\boldsymbol{o}^{t+1},\boldsymbol{m}^{t})

。

27: end for 27:结束循环

28: Update the policy parameter

\theta

and value function parameter

\phi

by minimize the overall learning loss in Eq. (6).
通过最小化公式(6)中的总体学习损失，更新策略参数

\theta

和值函数参数

\phi

。

29: end for 29:结束循环

Algorithm 1 Training Procedure for Verco
算法 1 Verco 训练程序

5 Experiments 五项实验

In this section, we first introduce the experimental scenario. Then, we provide a detailed introduction to all baselines. We demonstrate the performance of baseline algorithms in different experimental scenarios. We also visualize the communication content of the Verco algorithm.
在本节中，我们首先介绍实验场景。随后，我们详细介绍所有基线方法。我们展示了基线算法在不同实验场景下的性能表现，并可视化了 Verco 算法的通信内容。

5.1 Environment description
5.1 环境描述

The Overcooked environment is a popular complex environment for decision-making problems. The goal for agents is to make different types of salads with the provided raw materials and tools in a 7x7 grid-size kitchen. Our work extends the single-agent textual version of Overcooked in Tan et al. to multi-agent systems. In our setting, two agents need to collaborate to complete tasks in the environment. As shown in the figure, there are two types of maps A and B. In Map A, two agents are in the same space, so there may be collisions. In Map B, the two agents are separated and need to complete the task by passing items through the table. The environment is partially observable, and each agent can only observe the objects within 5×5 square centered of itself. As to the reward, chopping the correct item will be +0.2, providing the correct dish will be +1, -0.1 for delivering any wrong item, -0.01 for each collision between agents, and -0.001 for each time step.

5.2 Baselines 5.2 基准线

We compare our method with two trainable baselines and a baseline for direct decision-making by GPT-4. The description for each baseline is as follows:

TWOSOME [20]. TWOSOME proposes an efficient single-agent LLM decision fine-tuning framework, and it balances the joint probability of candidate actions by several regularizing methods.

Symbolic PPO [33, 39]. The symbolic PPO takes the raw numerical observation as its input and uses MLPs as the backbone network.

CommNet [19].

CommNet, as a commonly used multi-agent communication algorithm, takes the average of numerical observations from all agents and sends it as a message to each agent.

GPT-4 [41]. GPT-4 has shown great potential in decision-making problems, but a significant drawback is that GPT-4 is difficult to improve from the environment’s feedback. We directly input the text context and candidate actions into GPT-4, allowing GPT-4 to select the currently appropriate action from the candidate actions.

Verco. Our method uses two LoRA weights for communication and action policy, respectively. The communication policy is obtained by SFT with message labels generated by GPT-4, and the action policy is learned by RL through interaction with the environment.

All curves are presented with average performance and 25 $\sim$ 75% deviation over four random seeds, with the solid lines representing the median win rates. Due to our efficient design, our algorithm and all experiments can be completed on a single NVIDIA Tesla V100 32GB GPU.

5.3 Performance on Overcooked

The results on Overcooked are shown in Figure LABEL:fig:overcooked. In the method based on raw symbolic input, CommNet showed better performance than the Symbolic-PPO. However, it is difficult to observe the communication patterns between agents in a way that humans can understand. The results also indicate that the LLM-based methods have significantly higher sample efficiency. Moreover, Our Verco further achieves higher episode returns. We believe this improvement benefits from the communication information that can effectively coordinate the actions among agents. In addition, verbal messages can also promote our understanding of the cooperation mechanisms between agents. Moreover, the results show that Verco has significantly lower episode length and entropy compared to other algorithms. This indicates that the introduction of verbal communication can also encourage agents to complete tasks more efficiently and reduce the uncertainty of action policy. Although GPT-4 has rich prior knowledge, there are still biases in making complex decisions in the environment, which can easily lead to task failure.

5.4 Verbal communication visualization on Overcooked

To further analyze the differences brought about by the introduction of verbal communication, we visualize the replay of Verco and non-communication algorithms (TWOSOME) in detail. In the Single Room scenario as shown in Figure 6(a,d), communication is crucial to coordinate and perform different tasks, as there may be conflicting actions between agents. Specifically, Alice suggested that Bob should pick up the tomatoes and Bob suggested that Alice should pick up the bowl. After receiving the message from another agent, the two agents choose different actions which avoid conflicts. In the non-communication situation, both agents choose to directly pick up the tomato which will collide and waste time. In Separate Rooms, agents are separated by table and can only transfer items through the middle table. Therefore, communication is also important for agent cooperation. As shown in Figure 6(b), after Alice reminds Bob, Bob will put the tomatoes on the middle table for Alice to chop. Otherwise, Bob may not be aware of the need to pass the tomato to Alice without communication as shown in Figure 6(e). Overall, by directly sending verbal messages, we can have a more intuitive understanding of the cooperative motivations between agents.

6 Closing Remarks

In this paper, we propose a novel multi-agent communication algorithm, called Verco. Verco endows agents with the ability to send human-understandable verbal messages and make decisions based on teammates’ messages by incorporating multiple LoRA parameters. To generate coordinated and consistent messages for the message module, we employ a Teacher LLM with global observation to produce message labels and train the message module with local observation as input. After the process of SFT, agents load well-trained communication module weights, and the action policy is trained through reinforcement learning by continuously interacting within the environment. Evaluations conducted in the Overcooked environment demonstrate that our algorithm outperforms existing baselines in terms of performance and exhibits stronger interpretability, which contributes to a deeper understanding of the formation mechanism of cooperation among agents.

References

Arel et al. [2010] Itamar Arel, Cong Liu, Tom Urbanik, and Airton G Kohls. Reinforcement learning-based multi-agent system for network traffic signal control. IET Intelligent Transport Systems, 4(2):128–135, 2010.
Matignon et al. [2012] Laëtitia Matignon, Laurent Jeanpierre, and Abdel-Illah Mouaddib. Coordinated multi-robot exploration under communication constraints using decentralized markov decision processes. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 26, pages 2017–2023, 2012.
Zhang and Lesser [2011] Chongjie Zhang and Victor Lesser. Coordinated multi-agent reinforcement learning in networked distributed pomdps. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 25, pages 764–770, 2011.
Yarats et al. [2021] Denis Yarats, Amy Zhang, Ilya Kostrikov, Brandon Amos, Joelle Pineau, and Rob Fergus. Improving sample efficiency in model-free reinforcement learning from images. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 10674–10681, 2021.
Li et al. [2024] Dapeng Li, Na Lou, Bin Zhang, Zhiwei Xu, and Guoliang Fan. Adaptive parameter sharing for multi-agent reinforcement learning. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6035–6039, 2024. doi: 10.1109/ICASSP48485.2024.10447262.
Papoudakis et al. [2019] Georgios Papoudakis, Filippos Christianos, Arrasy Rahman, and Stefano V Albrecht. Dealing with non-stationarity in multi-agent deep reinforcement learning. arXiv preprint arXiv:1906.04737, 2019.
Milani et al. [2022] Stephanie Milani, Zhicheng Zhang, Nicholay Topin, Zheyuan Ryan Shi, Charles Kamhoua, Evangelos E Papalexakis, and Fei Fang. Maviper: Learning decision tree policies for interpretable multi-agent reinforcement learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 251–266. Springer, 2022.
Glanois et al. [2021] Claire Glanois, Paul Weng, Matthieu Zimmer, Dong Li, Tianpei Yang, Jianye Hao, and Wulong Liu. A survey on interpretable reinforcement learning. CoRR, abs/2112.13112, 2021. URL https://arxiv.org/abs/2112.13112.
Sunehag et al. [2017] Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinícius Flores Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z. Leibo, Karl Tuyls, and Thore Graepel. Value-decomposition networks for cooperative multi-agent learning. CoRR, abs/1706.05296, 2017.
Rashid et al. [2018] Tabish Rashid, Mikayel Samvelyan, Christian Schröder de Witt, Gregory Farquhar, Jakob N. Foerster, and Shimon Whiteson. QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning. In Jennifer G. Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 4292–4301. PMLR, 2018.
Foerster et al. [2018] Jakob N. Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. In Sheila A. McIlraith and Kilian Q. Weinberger, editors, Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 2974–2982. AAAI Press, 2018.
Yu et al. [2021] Chao Yu, Akash Velu, Eugene Vinitsky, Yu Wang, Alexandre M. Bayen, and Yi Wu. The surprising effectiveness of MAPPO in cooperative, multi-agent games. CoRR, abs/2103.01955, 2021.
Lowe et al. [2017] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 6379–6390, 2017.
Li et al. [2023a] Dapeng Li, Zhiwei Xu, Bin Zhang, and Guoliang Fan. Sea: A spatially explicit architecture for multi-agent reinforcement learning. In 2023 International Joint Conference on Neural Networks (IJCNN), pages 1–8, 2023a. doi: 10.1109/IJCNN54540.2023.10191819.
Miller et al. [2002] John H Miller, Carter T Butts, and David Rode. Communication and cooperation. Journal of Economic Behavior & Organization, 47(2):179–195, 2002.
Yang and Wang [2020] Yaodong Yang and Jun Wang. An overview of multi-agent reinforcement learning from game theoretical perspective. arXiv preprint arXiv:2011.00583, 2020.
Li et al. [2023b] Dapeng Li, Zhiwei Xu, Bin Zhang, and Guoliang Fan. From explicit communication to tacit cooperation: A novel paradigm for cooperative marl. arXiv preprint arXiv:2304.14656, 2023b.
Peng et al. [2017] Peng Peng, Quan Yuan, Ying Wen, Yaodong Yang, Zhenkun Tang, Haitao Long, and Jun Wang. Multiagent bidirectionally-coordinated nets for learning to play starcraft combat games. CoRR, abs/1703.10069, 2017. URL http://arxiv.org/abs/1703.10069.
Sukhbaatar et al. [2016] Sainbayar Sukhbaatar, Arthur Szlam, and Rob Fergus. Learning multiagent communication with backpropagation. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett, editors, Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 2244–2252, 2016. URL https://proceedings.neurips.cc/paper/2016/hash/55b1927fdafef39c48e5b73b5d61ea60-Abstract.html.
Tan et al. [2024] Weihao Tan, Wentao Zhang, Shanqi Liu, Longtao Zheng, Xinrun Wang, and Bo An. True knowledge comes from practice: Aligning large language models with embodied environments via reinforcement learning. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=hILVmJ4Uvu.
Carta et al. [2023] Thomas Carta, Clément Romac, Thomas Wolf, Sylvain Lamprier, Olivier Sigaud, and Pierre-Yves Oudeyer. Grounding large language models in interactive environments with online reinforcement learning. arXiv preprint arXiv:2302.02662, 2023.
Hu et al. [2021] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
Mandi et al. [2023] Zhao Mandi, Shreeya Jain, and Shuran Song. Roco: Dialectic multi-robot collaboration with large language models. arXiv preprint arXiv:2307.04738, 2023.
Jiang et al. [2023] Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Anima Anandkumar, Yuke Zhu, and Linxi Fan. Vima: General robot manipulation with multimodal prompts, 2023.
Liu et al. [2023] Bo Liu, Yuqian Jiang, Xiaohan Zhang, Qiang Liu, Shiqi Zhang, Joydeep Biswas, and Peter Stone. Llm+ p: Empowering large language models with optimal planning proficiency. arXiv preprint arXiv:2304.11477, 2023.
Ahn et al. [2022] Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Kuang-Huei Lee, Sergey Levine, Yao Lu, Linda Luu, Carolina Parada, Peter Pastor, Jornell Quiambao, Kanishka Rao, Jarek Rettinghouse, Diego Reyes, Pierre Sermanet, Nicolas Sievers, Clayton Tan, Alexander Toshev, Vincent Vanhoucke, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Mengyuan Yan, and Andy Zeng. Do as i can, not as i say: Grounding language in robotic affordances, 2022.
Zhang et al. [2023a] Ceyao Zhang, Kaijie Yang, Siyi Hu, Zihao Wang, Guanghe Li, Yihang Sun, Cheng Zhang, Zhaowei Zhang, Anji Liu, Song-Chun Zhu, et al. Proagent: Building proactive cooperative ai with large language models. arXiv preprint arXiv:2308.11339, 2023a.
Zhang et al. [2023b] Bin Zhang, Hangyu Mao, Jingqing Ruan, Ying Wen, Yang Li, Shao Zhang, Zhiwei Xu, Dapeng Li, Ziyue Li, Rui Zhao, et al. Controlling large language model-based agents for large-scale decision-making: An actor-critic approach. arXiv preprint arXiv:2311.13884, 2023b.
Wang et al. [2023] Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023.
Yao et al. [2022] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.
Dasgupta et al. [2023] Ishita Dasgupta, Christine Kaeser-Chen, Kenneth Marino, Arun Ahuja, Sheila Babayan, Felix Hill, and Rob Fergus. Collaborating with language models for embodied reasoning. arXiv preprint arXiv:2302.00763, 2023.
Ouyang et al. [2022] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
Mangrulkar et al. [2022] Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, and Benjamin Bossan. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022.
Ding et al. [2023] Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi-Min Chan, Weize Chen, et al. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nature Machine Intelligence, 5(3):220–235, 2023.
Yuan et al. [2022] Lei Yuan, Jianhao Wang, Fuxiang Zhang, Chenghe Wang, Zongzhang Zhang, Yang Yu, and Chongjie Zhang. Multi-agent incentive communication via decentralized teammate modeling. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022, pages 9466–9474. AAAI Press, 2022. URL https://ojs.aaai.org/index.php/AAAI/article/view/21179.
Slumbers et al. [2023] Oliver Slumbers, David Henry Mguni, Kun Shao, and Jun Wang. Leveraging large language models for optimised coordination in textual multi-agent reinforcement learning. 2023.
Patel and Pavlick [2021] Roma Patel and Ellie Pavlick. Mapping language models to grounded conceptual spaces. In International conference on learning representations, 2021.
De Witt et al. [2020] Christian Schroeder De Witt, Tarun Gupta, Denys Makoviichuk, Viktor Makoviychuk, Philip HS Torr, Mingfei Sun, and Shimon Whiteson. Is independent learning all you need in the starcraft multi-agent challenge? arXiv preprint arXiv:2011.09533, 2020.
Schulman et al. [2015] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.
OpenAI et al. [2024] OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, and et al. Jeff Belgum. Gpt-4 technical report, 2024.

Verco: Learning Coordinated Verbal Communication for Multi-agent Reinforcement LearningVerco：多智能体强化学习中的协调性语言交流学习