Mobile device operation tasks are increasingly becoming a popular multi-modal AI application scenario. Current Multi-modal Large Language Models (MLLMs), constrained by their training data, lack the capability to function effectively as operation assistants. Instead, MLLM-based agents, which enhance capabilities through tool invocation, are gradually being applied to this scenario. However, the two major navigation challenges in mobile device operation tasks - task progress navigation and focus content navigation - are difficult to effectively solve under the single-agent architecture of existing work. This is due to the overly long token sequences and the interleaved text-image data format, which limit performance. To address these navigation challenges effectively, we propose Mobile-Agent-v2, a multi-agent architecture for mobile device operation assistance. The architecture comprises three agents: planning agent, decision agent, and reflection agent. The planning agent condenses lengthy, interleaved image-text history operations and screens summaries into a pure-text task progress, which is then passed on to the decision agent. This reduction in context length makes it easier for decision agent to navigate the task progress. To retain focus content, we design a memory unit that updates with task progress by decision agent. Additionally, to correct erroneous operations, the reflection agent observes the outcomes of each operation and handles any mistake accordingly. Experimental results indicate that Mobile-Agent-v2 achieves over a 30%30 \% improvement in task completion compared to the single-agent architecture of Mobile-Agent. The code is open-sourced at https://github.com/X-PLUG/MobileAgent. 移动设备操作任务正日益成为流行的多模态人工智能应用场景。受限于训练数据,当前的跨模态大型语言模型(MLLMs)缺乏作为操作助手的有效功能。相反,通过工具调用来增强能力的基于 MLLM 的代理正逐渐应用于这一场景。然而,在现有工作的单代理架构下,移动设备操作任务中的两个主要导航挑战——任务进度导航和焦点内容导航——难以有效解决。这是由于过长的标记序列和交织的文本-图像数据格式限制了性能。为了有效解决这些导航挑战,我们提出了 Mobile-Agent-v2,这是一种用于移动设备操作辅助的多代理架构。该架构包括三个代理:规划代理、决策代理和反思代理。规划代理将冗长的、交织的图像-文本历史操作和屏幕摘要压缩成纯文本任务进度,然后传递给决策代理。 这一减少上下文长度使得决策代理更容易导航任务进度。为了保持焦点内容,我们设计了一个记忆单元,该单元通过决策代理的任务进度进行更新。此外,为了纠正错误操作,反射代理观察每个操作的结果,并相应地处理任何错误。实验结果表明,与 Mobile-Agent 的单代理架构相比,Mobile-Agent-v2 在任务完成度上实现了超过 30%30 \% 的改进。代码已在 https://github.com/X-PLUG/MobileAgent 上开源。
1 Introduction 1 引言
Multi-modal Large Language Models (MLLMs), represented by GPT-4v OpenAI (2023), have demonstrated outstanding capabilities in various domains Bai et al. (2023); Liu et al. (2023c b); Dai et al. (2023); Zhu et al. (2023); Chen et al. (2023); Ye et al. (2023ab b); Wang et al. (2023c); Hu et al. (2023, 2024); Zhang et al. (2024b). With the rapid development of agents based on Large Language Models (LLMs) Zhao et al. (2024); Liu et al. (2023f); Talebirad and Nadiri (2023); Zhang et al. (2023b); Wu et al. (2023); Shen et al. (2024); Li et al. (2023a), MLLM-based agents, which can overcome the limitations of MLLMs in specific application scenarios by various visual perception tools, have become a focal point of research attention Liu et al. (2023d). 多模态大型语言模型(MLLMs),以 GPT-4v OpenAI(2023)为代表,在各个领域展现出卓越的能力;白等(2023);刘等(2023c b);戴等(2023);朱等(2023);陈等(2023);叶等(2023ab b);王等(2023c);胡等(2023,2024);张等(2024b)。随着基于大型语言模型(LLMs)的代理的快速发展赵等(2024);刘等(2023f);塔勒比拉德和纳迪里(2023);张等(2023b);吴等(2023);沈等(2024);李等(2023a),基于 MLLM 的代理,通过各种视觉感知工具克服了 MLLM 在特定应用场景中的局限性,已成为研究关注的焦点刘等(2023d)。
Figure 1: Mobile device operation tasks require navigating focus content and task progress from history operation sequences, where the focus content comes from previous screens. As the number of operations increases, the length of the input sequences grows, making it extremely challenging for a single-agent architecture to manage these two types of navigation effectively. 图 1:移动设备操作任务需要从历史操作序列中导航到焦点内容和任务进度,其中焦点内容来自之前的屏幕。随着操作数量的增加,输入序列的长度也随之增长,这使得单一代理架构难以有效地管理这两种类型的导航。
Automated operations on mobile devices, as a practical multi-modal application scenario, are emerging as a major technological revolution in AI smartphone development Yao et al. (2022); Deng et al. (2023); Gur et al. (2024); Zheng et al. (2024); Zhang et al. (2023a); Wang et al. (2024); Chen and Li (2024a|b|c); Zhang et al. (2024a); Wen et al. (2023); Cheng et al. (2024). However, due to the limited screen recognition, operation, and location capabilities, existing MLLMs face challenges in this scenario. To address this, existing work leverages MLLM-based agent architecture to endow MLLMs with various capabilities for perceiving and operating mobile device UI. AppAgent Zhang et al. (2023a) tackles the limitation of MLLMs in localization by extracting clickable positions from device XML files. However, the reliance on UI files limits the applicability of this method to other platforms and devices. To eliminate the dependency on underlying UI files, Mobile-Agent Wang et al. (2024) proposes a solution for localization through visual perception tools. It perceives the screen through an MLLM and generates operations, locating their positions by visual perception tools. 移动设备上的自动化操作,作为一种实用的多模态应用场景,正成为人工智能智能手机开发领域的一次重大技术革命。姚等(2022 年);邓等(2023 年);古尔等(2024 年);郑等(2024 年);张等(2023a 年);王等(2024 年);陈和李(2024a|b|c);张等(2024a 年);文等(2023 年);程等(2024 年)。然而,由于屏幕识别、操作和定位能力的限制,现有的多语言语言模型(MLLMs)在此场景中面临挑战。为了解决这个问题,现有工作利用基于 MLLM 的代理架构,赋予 MLLMs 感知和操作移动设备 UI 的各种能力。AppAgent 张等(2023a)通过从设备 XML 文件中提取可点击位置来解决 MLLMs 在定位方面的限制。然而,依赖于 UI 文件限制了该方法在其他平台和设备上的适用性。为了消除对底层 UI 文件的依赖,Mobile-Agent 王等(2024 年)提出了一种通过视觉感知工具进行定位的解决方案。它通过 MLLM 感知屏幕,并通过视觉感知工具生成操作,定位其位置。
Mobile device operation tasks involve multi-step sequential processing. The operator needs to perform a series of continuous operations on the device starting from the initial screen until the instructions are fully executed. There are two main challenges in this process. First, to plan the operation intent, the operator needs to navigate the current task progress from the history operations. Second, some operations may require task-relevant information in the history screens, for example, writing sports news in Figure 1 requires using the match results queried earlier. We refer to this important information as the focus content. The focus content also needs to be navigated out from the history screens. However, as the task progresses, the lengthy history of interleaved image and text history operations and screens as input can significantly reduce the effectiveness of navigation in a single-agent architecture, as shown in Figure 1 . 移动设备操作任务涉及多步骤的顺序处理。操作员需要从初始屏幕开始,对设备执行一系列连续操作,直到指令完全执行。在这个过程中有两个主要挑战。首先,为了规划操作意图,操作员需要从历史操作中导航当前任务进度。其次,某些操作可能需要历史屏幕中的任务相关信息,例如,在图 1 中撰写体育新闻需要使用之前查询的比赛结果。我们将这些重要信息称为焦点内容。焦点内容也需要从历史屏幕中导航出来。然而,随着任务的进行,交织的图像和文本历史操作和屏幕作为输入可以显著降低单代理架构中导航的有效性,如图 1 所示。
In this paper, we propose Mobile-Agent-v2, a mobile device operation assistant with effective navigation via multi-agent collaboration. Mobile-Agent-v2 has three specialized agent roles: planning agent, decision agent, and reflection agent. The planning agent needs to generate a task progress based on the history operations. To save the focus content from the history screens, we design a memory unit to record task-related focus content. This unit will be observed by the decision agent when generating an operation, simultaneously checking if there is any focus content on the screen and updating it to the memory. Since the decision agent cannot observe the previous screen to reflect, we design the reflection agent to observe the changes in the screen before and after the decision agent’s operation and determine whether the operation meets the expectations. If it finds that the operation does not meet expectations, it will take appropriate measures to re-execute the operation. The entire 本文提出 Mobile-Agent-v2,一种通过多智能体协作实现有效导航的移动设备操作助手。Mobile-Agent-v2 具有三个专门的智能体角色:规划智能体、决策智能体和反思智能体。规划智能体需要根据历史操作生成任务进度。为了保存历史屏幕上的焦点内容,我们设计了一个记忆单元来记录与任务相关的焦点内容。该单元将在决策智能体生成操作时被观察,同时检查屏幕上是否存在焦点内容,并将其更新到记忆中。由于决策智能体无法观察前一个屏幕以进行反思,我们设计了反思智能体来观察决策智能体操作前后的屏幕变化,并确定操作是否符合预期。如果发现操作不符合预期,它将采取适当的措施重新执行操作。整个
process is illustrated in Figure 3. The three agent roles work respectively in the progress, decision, and reflection stages, collaborating to alleviate the difficulty of navigating. 流程如图 3 所示。三个代理角色分别在进度、决策和反思阶段分别工作,协作以减轻导航的难度。
Our summarized contributions are as follows: 我们的总结性贡献如下:
We propose a multi-agent architecture Mobile-Agent-v2 to alleviate various navigating difficulties inherent in the single-agent framework for mobile device operation tasks. We design a planning agent to generate task progress based on the history operations, ensuring effective operation generation by the decision agent. 我们提出了一种多智能体架构 Mobile-Agent-v2,以缓解移动设备操作任务中单智能体框架固有的各种导航困难。我们设计了一个规划智能体,根据历史操作生成任务进度,确保决策智能体有效生成操作。
To avoid the loss of focus content navigating and reflection capability, we design both a memory unit and a reflection agent. The memory unit is updated by the decision agent with focus content. The reflection agent assesses whether the decision agent’s operation meets expectations and generates appropriate remedial measures if expectations are not met. 为避免导航和反射能力的注意力丧失,我们设计了一个记忆单元和一个反射代理。记忆单元由具有焦点内容的决策代理更新。反射代理评估决策代理的操作是否符合预期,并在预期未满足时生成适当的补救措施。
We conducted dynamic evaluations of Mobile-Agent-v2 across various operating systems, language environments, and applications. Experimental results demonstrate that Mobile-Agent-v2 achieves significant performance improvements. Furthermore, we empirically validated that the performance of Mobile-Agent-v2 can be further enhanced by manual operation knowledge injection. 我们对 Mobile-Agent-v2 在各个操作系统、语言环境和应用中的动态评估进行了研究。实验结果表明,Mobile-Agent-v2 实现了显著的性能提升。此外,我们还通过实证验证了通过手动操作知识注入可以进一步提高 Mobile-Agent-v2 的性能。
2 Related Work 2 相关工作
2.1 Muiti-agent Application 2.1 多智能体应用
The powerful comprehension and reasoning capabilities of Large Language Models (LLMs) enable LLM-based agents to demonstrate the ability to independently execute tasks Brown et al. (2020); Achiam et al. (2023); Touvron et al. (2023a b); Bai et al. (2023). Inspired by human-team collaboration, the multi-agent framework has been proposed. Park et al. (2023) constructs Smallville consisting of 25 agents in a sandbox environment. Li et al. (2023b) proposes a role-playing-based multi-agent collaborative framework to enable two agents playing different roles to autonomously collaborate. Chen et al. (2024) innovatively propose an effective multi-agent framework for coordinating the collaboration of multiple expert agents. Hong et al. (2024) presents a groundbreaking meta-programming multi-agent collaboration framework. Wu et al. (2024) proposes a generic multi-agent framework that allows users to configure the number of agents, interaction modes, and toolsets. Chan et al. (2024); Subramaniam et al. (2024); Tao et al. (2024) investigate the implementation of a multi-agent debating framework, aiming to evaluate the quality of different texts or generated content. Abdelnabi et al. (2024); Xu et al. (2024); Mukobi et al. (2024) integrate multi-agent interaction with game theoretic strategies, aiming to enhance both the cooperative and decision abilities. 大型语言模型强大的理解和推理能力使基于LLM的智能体能够独立执行任务,如 Brown 等人(2020 年);Achiam 等人(2023 年);Touvron 等人(2023 年 a b);Bai 等人(2023 年)。受人类团队合作启发,提出了多智能体框架。Park 等人(2023 年)在一个沙盒环境中构建了由 25 个智能体组成的 Smallville。Li 等人(2023 年 b)提出了一种基于角色扮演的多智能体协作框架,以使扮演不同角色的两个智能体能够自主协作。Chen 等人(2024 年)创新性地提出了一种有效的多智能体框架,用于协调多个专家智能体的协作。Hong 等人(2024 年)提出了一种开创性的元编程多智能体协作框架。Wu 等人(2024 年)提出了一种通用的多智能体框架,允许用户配置智能体数量、交互模式和工具集。Chan 等人(2024 年);Subramaniam 等人(2024 年);Tao 等人(2024 年)研究了多智能体辩论框架的实施,旨在评估不同文本或生成内容的质量。Abdelnabi 等人 2024 年;徐等(2024 年);穆科比等(2024 年)将多智能体交互与博弈论策略相结合,旨在提高合作和决策能力。
Webpages, as a classic application scenario for UI agents, have attracted widespread attention to research on web agents. Yao et al. (2022) and Deng et al. (2023) aim to enhance the performance of agents on real-world webpage tasks by constructing high-quality website task datasets. Gur et al. (2024) utilizes pre-trained LLMs and self-experience learning to automate task processing on real-world websites. Zheng et al. (2024) leverages GPT-4V for visual understanding and webpage manipulation. Simultaneously, research on LLM-based UI agents for mobile platforms has also drawn significant attention. Wen et al. (2023) converts Graphical User Interface (GUI) information into HTML representations and then leverages LLM in conjunction with application-specific domain knowledge. Yan et al. (2023) proposes a multi-modal intelligent mobile agent based on GPT-4V, exploring the direct utilization of GPT-4V to perceive screen screenshots with annotations. Unlike the former approach that operates on screens with digital labels, Zhang et al. (2023a) combines the application’s XML files for localization operations, mimicking human spatial autonomy in operating mobile applications. Wang et al. (2024) eliminates the dependency on the application’s XML files and leverages visual module tools for localization operations. Additionally, Hong et al. (2023) designed a GUI agent based on pre-trained vision-language models. Chen and Li (2024a|b|c) propose small-scale client-side models for deployment on actual devices. Zhang et al. (2024a) proposed a UI multi-agent framework tailored for the Windows operating system. Despite the significant performance improvements achieved by multi-agent architectures in many tasks, currently, there is no work that employs multi-agent architectures in mobile device operation tasks. To address the 网页作为 UI 代理的经典应用场景,引起了人们对网页代理研究的高度关注。姚等(2022 年)和邓等(2023 年)旨在通过构建高质量的网站任务数据集来提高代理在现实网页任务中的性能。古尔等(2024 年)利用预训练的LLMs和自经验学习来自动化现实网站上的任务处理。郑等(2024 年)利用 GPT-4V 进行视觉理解和网页操作。同时,针对移动平台上的LLM-基于 UI 代理的研究也引起了广泛关注。文等(2023 年)将图形用户界面(GUI)信息转换为 HTML 表示,然后结合应用特定领域知识利用LLM。颜等(2023 年)提出了一种基于 GPT-4V 的多模态智能移动代理,探索直接利用 GPT-4V 感知带有注释的屏幕截图。与在带有数字标签的屏幕上操作的前一种方法不同,张等(2023a)结合应用程序的 XML 文件进行本地化操作,模拟人类在操作移动应用程序中的空间自主性。王等 (2024) 消除了对应用程序 XML 文件的依赖,并利用视觉模块工具进行本地化操作。此外,Hong 等人(2023)设计了一个基于预训练视觉-语言模型的 GUI 代理。Chen 和 Li(2024a|b|c)提出了适用于实际设备部署的小规模客户端模型。Zhang 等人(2024a)提出了一种针对 Windows 操作系统的 UI 多代理框架。尽管多代理架构在许多任务中实现了显著的性能提升,但目前还没有工作在移动设备操作任务中采用多代理架构。解决
*Work done during internship at Alibaba Group. 实习期间在阿里巴巴集团完成的工作。 ^(†){ }^{\dagger} Corresponding author 对应作者