AgEnt AI:
Surveying the Horizons of Multimodal Interaction AgEnt 人工智能:
调查多模态交互的视野
Zane Durante ^(1†**){ }^{1 \dagger *}, Qiuyuan Huang ^(2‡**){ }^{2 \ddagger *}, Naoki Wake ^(2**){ }^{2 *}, 赞恩·杜兰特 ^(1†**){ }^{1 \dagger *} , 黄 ^(2‡**){ }^{2 \ddagger *} 秋元 , Naoki Wake ^(2**){ }^{2 *} ,Ran Gong ^(3†){ }^{3 \dagger}, Jae Sung Park ^(4†){ }^{4 \dagger}, Bidipta Sarkar ^(1†){ }^{1 \dagger}, Rohan Taori ^(1†){ }^{1 \dagger}, Yusuke Noda ^(5){ }^{5}, Demetri Terzopoulos ^(3){ }^{3}, Yejin Choi ^(4){ }^{4}, Katsushi Ikeuchi ^(2){ }^{2}, Hoi Vo ^(5){ }^{5}, Li Fei-Fei ^(1){ }^{1}, Jianfeng Gao ^(2){ }^{2} Ran Gong ^(3†){ }^{3 \dagger} , Jae Sung Park ^(4†){ }^{4 \dagger} , Bidipta Sarkar ^(1†){ }^{1 \dagger} , Rohan Taori ^(1†){ }^{1 \dagger} , Yusuke Noda ^(5){ }^{5} , Demetri Terzopoulos ^(3){ }^{3} , Yejin Choi ^(4){ }^{4} , Katsushi Ikeuchi ^(2){ }^{2} , Hoi Vo ^(5){ }^{5} , Li Fei-Fei ^(1){ }^{1} , Jianfeng Gao ^(2){ }^{2}^(1){ }^{1} Stanford University; ^(2){ }^{2} Microsoft Research, Redmond; ^(1){ }^{1} 斯坦福大学; ^(2){ }^{2} 雷德蒙德的 Microsoft Research;^(3){ }^{3} University of California, Los Angeles; ^(4){ }^{4} University of Washington; ^(5){ }^{5} Microsoft Gaming ^(3){ }^{3} 加州大学洛杉矶分校; ^(4){ }^{4} 华盛顿大学; ^(5){ }^{5} Microsoft 游戏
Figure 1: Overview of an Agent AI system that can perceive and act in different domains and applications. Agent AI is emerging as a promising avenue toward Artificial General Intelligence (AGI). Agent AI training has demonstrated the capacity for multi-modal understanding in the physical world. It provides a framework for reality-agnostic training by leveraging generative AI alongside multiple independent data sources. Large foundation models trained for agent and action-related tasks can be applied to physical and virtual worlds when trained on cross-reality data. We present the general overview of an Agent AI system that can perceive and act in many different domains and applications, possibly serving as a route towards AGI using an agent paradigm. 图 1:可以在不同领域和应用程序中感知和操作的 Agent AI 系统概述。Agent AI 正在成为实现通用人工智能 (AGI) 的一条有前途的途径。代理 AI 训练已经证明了在物理世界中进行多模态理解的能力。它通过利用生成式 AI 和多个独立数据源,为与现实无关的训练提供了一个框架。在跨现实数据上训练时,为代理和动作相关任务训练的大型基础模型可以应用于物理和虚拟世界。我们介绍了代理 AI 系统的一般概述,该系统可以在许多不同的领域和应用程序中感知和操作,可能作为使用代理范式实现 AGI 的路线。
Abstract 抽象
Multi-modal AI systems will likely become a ubiquitous presence in our everyday lives. A promising approach to making these systems more interactive is to embody them as agents within physical and virtual environments. At present, systems leverage existing foundation models as the basic building blocks for the creation of embodied agents. Embedding agents within such environments facilitates the ability of models to process and interpret visual and contextual data, which is critical for the creation of more sophisticated and context-aware AI systems. For example, a system that can perceive user actions, human behavior, environmental objects, audio expressions, and the collective sentiment of a scene can be used to inform and direct agent responses within the given environment. To accelerate research on agent-based multimodal intelligence, we define “Agent AI” as a class of interactive systems that can perceive visual stimuli, language inputs, and other environmentallygrounded data, and can produce meaningful embodied actions. In particular, we explore systems that aim to improve agents based on next-embodied action prediction by incorporating external knowledge, multi-sensory inputs, and human feedback. We argue that by developing agentic AI systems in grounded environments, one can also mitigate the hallucinations of large foundation models and their tendency to generate environmentally incorrect outputs. The emerging field of Agent AI subsumes the broader embodied and agentic aspects of multimodal interactions. Beyond agents acting and interacting in the physical world, we envision a future where people can easily create any virtual reality or simulated scene and interact with agents embodied within the virtual environment. 多模态 AI 系统可能会在我们的日常生活中无处不在。使这些系统更具交互性的一种有前途的方法是将它们作为物理和虚拟环境中的代理。目前,系统利用现有的基础模型作为创建具身代理的基本构建块。在此类环境中嵌入代理有助于模型处理和解释视觉和上下文数据的能力,这对于创建更复杂和上下文感知的 AI 系统至关重要。例如,可以感知用户操作、人类行为、环境对象、音频表达和场景集体情绪的系统可用于通知和指导给定环境中的代理响应。为了加速对基于智能体的多模态智能的研究,我们将“智能体 AI”定义为一类交互式系统,可以感知视觉刺激、语言输入和其他基于环境的数据,并可以产生有意义的具体动作。特别是,我们探索了旨在通过整合外部知识、多感官输入和人类反馈来改进基于下一个具体动作预测的代理的系统。我们认为,通过在接地环境中开发代理 AI 系统,还可以减轻大型基础模型的幻觉及其产生环境不正确输出的倾向。代理 AI 的新兴领域包含多模态交互的更广泛的具体和代理方面。除了代理在物理世界中行动和交互之外,我们设想的未来人们可以轻松创建任何虚拟现实或模拟场景,并与虚拟环境中体现的代理进行交互。
Contents 内容
1 Introduction … 5 1 引言 ...5
1.1 Motivation … 5 1.1 动机 ...5
1.2 Background … 5 1.2 背景 ...5
1.3 Overview … 6 1.3 概述 ...6
2 Agent AI Integration … 7 2 代理 AI 集成 ...7
2.1 Infinite AI agent … 7 2.1 无限 AI 代理 ...7
2.2 Agent AI with Large Foundation Models … 8 2.2 具有大型基础模型的代理 AI ...8
2.2.1 Hallucinations … 8 2.2.1 幻觉 ...8
2.2.2 Biases and Inclusivity … 9 2.2.2 偏见和包容性 ...9
2.2.3 Data Privacy and Usage … 10 2.2.3 数据隐私和使用10
2.2.4 Interpretability and Explainability … 11 2.2.4 可解释性和可解释性 ...11
2.2.5 Inference Augmentation … 12 2.2.5 推理增强 ...12
2.2.6 Regulation … 13 2.2.6 监管13
2.3 Agent AI for Emergent Abilities … 14 2.3 用于紧急能力的代理 AI ...14
3 Agent AI Paradigm … 15 3 代理 AI 范式 ...15
3.1 LLMs and VLMs … 15 3.1 LLMs 和 VLM ...15
3.2 Agent Transformer Definition … 15 3.2 代理变压器定义 ...15
3.3 Agent Transformer Creation … 16 3.3 代理变压器创建 ...16
4 Agent AI Learning … 17 4 代理 AI 学习 ...17
4.1 Strategy and Mechanism … 17 4.1 战略与机制17
4.1.1 Reinforcement Learning (RL) … 17 4.1.1 强化学习 (RL) ...17
4.1.2 Imitation Learning (IL) … 18 4.1.2 模仿学习 (IL) ...18
4.1.3 Traditional RGB … 18 4.1.3 传统 RGB ...18
4.1.4 In-context Learning … 18 4.1.4 情境学习 ...18
4.1.5 Optimization in the Agent System … 18 4.1.5 代理系统优化18
4.2 Agent Systems (zero-shot and few-shot level) … 19 4.2 代理系统(零镜头和少镜头级别)...19
4.2.1 Agent Modules … 19 4.2.1 代理模块 ...19
4.2.2 Agent Infrastructure … 19 4.2.2 代理基础结构 ...19
4.3 Agentic Foundation Models (pretraining and finetune level) … 19 4.3 代理基础模型(预训练和微调级别)...19
5 Agent AI Categorization … 20 5 代理 AI 分类 ...20
5.1 Generalist Agent Areas … 20 5.1 多面手代理区域 ...20
5.2 Embodied Agents … 20 5.2 具身代理人 ...20
5.2.1 Action Agents … 20 5.2.1 动作代理 ...20
5.2.2 Interactive Agents … 21 5.2.2 交互式代理 ...21
5.3 Simulation and Environments Agents … 21 5.3 模拟和环境代理 ...21
5.4 Generative Agents … 21 5.4 生成代理 ...21
5.4.1 AR/VR/mixed-reality Agents … 22 5.4.1 AR/VR/混合现实代理22
5.5 Knowledge and Logical Inference Agents … 22 5.5 知识和逻辑推理代理 ...22
5.5.1 Knowledge Agent … 23 5.5.1 知识代理 ...23
5.5.2 Logic Agents … 23 5.5.2 逻辑代理 ...23
5.5.3 Agents for Emotional Reasoning … 23 5.5.3 情感推理的代理 ...23
5.5.4 Neuro-Symbolic Agents … 24 5.5.4 神经符号代理 ...24
5.6 LLMs and VLMs Agent … 24 5.6 LLMs 和 VLM 代理...24
6 Agent AI Application Tasks … 24 6 个 Agent AI 应用程序任务...24
6.1 Agents for Gaming … 24 6.1 游戏代理 ...24
6.1.1 NPC Behavior … 24 6.1.1 NPC 行为 ...24
6.1.2 Human-NPC Interaction … 25 6.1.2 人与NPC的互动25
6.1.3 Agent-based Analysis of Gaming … 25 6.1.3 基于代理的游戏分析...25
6.1.4 Scene Synthesis for Gaming … 27 6.1.4 游戏场景合成27
6.1.5 Experiments and Results … 27 6.1.5 实验和结果 ...27
6.2 Robotics … 28 6.2 机器人技术28
6.2.1 LLM/VLM Agent for Robotics … 30 6.2.1 LLM/VLM 机器人代理 ...30
6.2.2 Experiments and Results. … 31 6.2.2 实验和结果。…31
6.3 Healthcare … 35 6.3 医疗保健35
6.3.1 Current Healthcare Capabilities … 36 6.3.1 当前的医疗保健能力......36
6.4 Multimodal Agents … 36 6.4 多模态代理 ...36
6.4.1 Image-Language Understanding and Generation … 36 6.4.1 图像语言的理解和生成36
6.4.2 Video and Language Understanding and Generation … 37 6.4.2 视频和语言的理解和生成37
6.4.3 Experiments and Results … 39 6.4.3 实验和结果 ...39
6.5 Video-language Experiments … 41 6.5 视频语言实验 ...41
6.6 Agent for NLP … 45 6.6 NLP 代理 ...45
6.6.1 LLM agent … 45 6.6.1 LLM 代理 ...45
6.6.2 General LLM agent … 45 6.6.2 常规 LLM 代理 ...45
6.6.3 Instruction-following LLM agents … 46 6.6.3 遵循指令 LLM agents ...46
6.6.4 Experiments and Results … 46 6.6.4 实验和结果 ...46
7 Agent AI Across Modalities, Domains, and Realities … 48 7 跨模式、领域和现实的代理 AI ...48
7.1 Agents for Cross-modal Understanding … 48 7.1 跨模态理解的代理 ...48
7.2 Agents for Cross-domain Understanding … 48 7.2 跨域理解代理48
7.3 Interactive agent for cross-modality and cross-reality … 49 7.3 跨模态和跨现实的交互式代理 ...49
7.4 Sim to Real Transfer … 49 7.4 Sim to Real Transfer ...49
8 Continuous and Self-improvement for Agent AI … 49 8 Agent AI 的持续和自我完善 ...49
8.1 Human-based Interaction Data … 49 8.1 基于人类的交互数据 ...49
8.2 Foundation Model Generated Data … 50 8.2 基础模型生成的数据 ...50
9 Agent Dataset and Leaderboard … 50 9 代理数据集和排行榜 ...50
9.1 “CuisineWorld” Dataset for Multi-agent Gaming … 50 9.1 用于多代理游戏的“CuisineWorld”数据集50
9.1.1 Benchmark … 51 9.1.1 基准测试 ...51
9.1.2 Task … 51 9.1.2 任务 ...51
9.1.3 Metrics and Judging … 51 9.1.3 度量和判断51
9.1.4 Evaluation … 51 9.1.4 评估 ...51
9.2 Audio-Video-Language Pre-training Dataset. … 51 9.2 音频-视频-语言预训练数据集。…51
10 Broader Impact Statement … 52 10 更广泛的影响声明 ...52
11 Ethical Considerations … 53 11 道德考虑 ...53
^(**){ }^{*} Equal Contribution. ^(‡){ }^{\ddagger} Project Lead. ^(†){ }^{\dagger} Work done while interning at Microsoft Research, Redmond. ^(**){ }^{*} 平等贡献。 ^(‡){ }^{\ddagger} 项目主管。 ^(†){ }^{\dagger} 在雷德蒙德的 Microsoft Research 实习期间完成的工作。