Introducing Apple’s On-Device and Server Foundation Models
At the 2024 Worldwide Developers Conference, we introduced Apple Intelligence, a personal intelligence system integrated deeply into iOS 18, iPadOS 18, and macOS Sequoia.
在 2024 年全球开发者大会上,我们推出了 Apple Intelligence,这是一个深度集成于 iOS 18、iPadOS 18 和 macOS Sequoia 的个人智能系统。
Apple Intelligence is comprised of multiple highly-capable generative models that are specialized for our users’ everyday tasks, and can adapt on the fly for their current activity. The foundation models built into Apple Intelligence have been fine-tuned for user experiences such as writing and refining text, prioritizing and summarizing notifications, creating playful images for conversations with family and friends, and taking in-app actions to simplify interactions across apps.
苹果智能由多个高度智能的生成模型组成,专为用户的日常任务量身定制,并能根据当前活动实时调整。内置于苹果智能的基础模型经过精细调校,以优化用户体验,如撰写和润色文本、优先处理和总结通知、为与家人朋友的对话创建趣味图像,以及执行应用内操作以简化跨应用的互动。
In the following overview, we will detail how two of these models — a ~3 billion parameter on-device language model, and a larger server-based language model available with Private Cloud Compute and running on Apple silicon servers — have been built and adapted to perform specialized tasks efficiently, accurately, and responsibly. These two foundation models are part of a larger family of generative models created by Apple to support users and developers; this includes a coding model to build intelligence into Xcode, as well as a diffusion model to help users express themselves visually, for example, in the Messages app. We look forward to sharing more information soon on this broader set of models.
在以下概述中,我们将详细介绍如何构建和调整其中两个模型——一个约 30 亿参数的设备端语言模型,以及一个可通过私有云计算访问并在苹果硅服务器上运行的大型服务器端语言模型——以高效、准确且负责任地执行特定任务。这两个基础模型是苹果创建的更大规模生成模型家族的一部分,旨在支持用户和开发者;这包括一个用于将智能融入 Xcode 的编码模型,以及一个帮助用户在视觉上表达自己的扩散模型,例如在“信息”应用中。我们期待很快分享更多关于这一系列模型的信息。
Update - July 29, 2024: The figures in this article have been updated to reflect the model versions and evaluations used in the technical report released today. For more detail, please see the paper: Apple Intelligence Foundation Language Models.
更新 - 2024 年 7 月 29 日:本文中的数据已更新,以反映今日发布的技术报告中使用的模型版本和评估结果。欲了解更多详情,请参阅论文:苹果智能基金会语言模型。
Our Focus on Responsible AI Development
我们致力于负责任的人工智能发展
Apple Intelligence is designed with our core values at every step and built on a foundation of groundbreaking privacy innovations.
苹果智能在每一步都融入了我们的核心价值观,并建立在开创性的隐私创新基础上。
Additionally, we have created a set of Responsible AI principles to guide how we develop AI tools, as well as the models that underpin them:
此外,我们制定了一套负责任的人工智能原则,以指导我们开发人工智能工具及其基础模型的方式:
- Empower users with intelligent tools: We identify areas where AI can be used responsibly to create tools for addressing specific user needs. We respect how our users choose to use these tools to accomplish their goals.
赋予用户智能工具:我们识别出可以负责任地使用人工智能来创建解决特定用户需求的工具的领域。我们尊重用户选择如何使用这些工具来实现其目标。 - Represent our users: We build deeply personal products with the goal of representing users around the globe authentically. We work continuously to avoid perpetuating stereotypes and systemic biases across our AI tools and models.
代表我们的用户:我们致力于打造深度个性化的产品,旨在真实地代表全球用户。我们持续努力,避免在我们的 AI 工具和模型中延续刻板印象和系统性偏见。 - Design with care: We take precautions at every stage of our process, including design, model training, feature development, and quality evaluation to identify how our AI tools may be misused or lead to potential harm. We will continuously and proactively improve our AI tools with the help of user feedback.
精心设计:我们在整个流程的每个阶段都采取预防措施,包括设计、模型训练、特征开发和质量评估,以识别我们的 AI 工具可能被滥用或导致潜在危害的情况。我们将借助用户反馈,持续主动地改进我们的 AI 工具。 - Protect privacy: We protect our users' privacy with powerful on-device processing and groundbreaking infrastructure like Private Cloud Compute. We do not use our users' private personal data or user interactions when training our foundation models.
保护隐私:我们通过强大的设备端处理和开创性的基础设施(如私有云计算)来保护用户隐私。在训练基础模型时,我们不会使用用户的私人个人数据或用户交互信息。
These principles are reflected throughout the architecture that enables Apple Intelligence, connects features and tools with specialized models, and scans inputs and outputs to provide each feature with the information needed to function responsibly.
这些原则贯穿于整个架构,该架构实现了苹果智能,将功能和工具与专用模型连接,并扫描输入和输出,为每个功能提供所需信息,以负责任地运行。
In the remainder of this overview, we provide details on decisions such as: how we develop models that are highly capable, fast, and power-efficient; how we approach training these models; how our adapters are fine-tuned for specific user needs; and how we evaluate model performance for both helpfulness and unintended harm.
在本概述的其余部分,我们将详细介绍以下决策:我们如何开发高度能力、快速且节能的模型;我们如何进行这些模型的训练;我们的适配器如何针对特定用户需求进行微调;以及我们如何评估模型在帮助性和避免意外伤害方面的性能。
Pre-Training 预训练
Our foundation models are trained on Apple's AXLearn framework, an open-source project we released in 2023. It builds on top of JAX and XLA, and allows us to train the models with high efficiency and scalability on various training hardware and cloud platforms, including TPUs and both cloud and on-premise GPUs. We used a combination of data parallelism, tensor parallelism, sequence parallelism, and Fully Sharded Data Parallel (FSDP) to scale training along multiple dimensions such as data, model, and sequence length.
我们的基础模型是在苹果的 AXLearn 框架上训练的,该框架是我们于 2023 年发布的开源项目。它基于 JAX 和 XLA 构建,使我们能够在多种训练硬件和云平台上高效且可扩展地训练模型,包括 TPUs 以及云端和本地的 GPU。我们采用了数据并行、张量并行、序列并行和全分片数据并行(FSDP)的组合,以在数据、模型和序列长度等多个维度上扩展训练。
We train our foundation models on licensed data, including data selected to enhance specific features, as well as publicly available data collected by our web-crawler, AppleBot. Web publishers have the option to opt out of the use of their web content for Apple Intelligence training with a data usage control.
我们使用经过许可的数据训练基础模型,包括为增强特定功能而选择的数据,以及由我们的网络爬虫 AppleBot 收集的公开可用数据。网络发布者可以选择通过数据使用控制来退出其网页内容用于 Apple Intelligence 训练。
We never use our users’ private personal data or user interactions when training our foundation models, and we apply filters to remove personally identifiable information like social security and credit card numbers that are publicly available on the Internet. We also filter profanity and other low-quality content to prevent its inclusion in the training corpus. In addition to filtering, we perform data extraction, deduplication, and the application of a model-based classifier to identify high quality documents.
我们绝不会在训练基础模型时使用用户的私人个人数据或用户互动信息,并且会应用过滤器来移除互联网上公开的个人身份信息,如社会保障号和信用卡号。我们还过滤脏话和其他低质量内容,以防止其被纳入训练语料库。除了过滤外,我们还进行数据提取、去重,并应用基于模型的分类器来识别高质量文档。
Post-Training
We find that data quality is essential to model success, so we utilize a hybrid data strategy in our training pipeline, incorporating both human-annotated and synthetic data, and conduct thorough data curation and filtering procedures. We have developed two novel algorithms in post-training: (1) a rejection sampling fine-tuning algorithm with teacher committee, and (2) a reinforcement learning from human feedback (RLHF) algorithm with mirror descent policy optimization and a leave-one-out advantage estimator. We find that these two algorithms lead to significant improvement in the model’s instruction-following quality.
Optimization
In addition to ensuring our generative models are highly capable, we have used a range of innovative techniques to optimize them on-device and on our private cloud for speed and efficiency. We have applied an extensive set of optimizations for both first token and extended token inference performance.
Both the on-device and server models use grouped-query-attention. We use shared input and output vocab embedding tables to reduce memory requirements and inference cost. These shared embedding tensors are mapped without duplications. The on-device model uses a vocab size of 49K, while the server model uses a vocab size of 100K, which includes additional language and technical tokens.
For on-device inference, we use low-bit palletization, a critical optimization technique that achieves the necessary memory, power, and performance requirements. To maintain model quality, we developed a new framework using LoRA adapters that incorporates a mixed 2-bit and 4-bit configuration strategy — averaging 3.7 bits-per-weight — to achieve the same accuracy as the uncompressed models. More aggressively, the model can be compressed to 3.5 bits-per-weight without significant quality loss.
Additionally, we use an interactive model latency and power analysis tool, Talaria, to better guide the bit rate selection for each operation. We also utilize activation quantization and embedding quantization, and have developed an approach to enable efficient Key-Value (KV) cache update on our neural engines.
With this set of optimizations, on iPhone 15 Pro we are able to reach time-to-first-token latency of about 0.6 millisecond per prompt token, and a generation rate of 30 tokens per second. Notably, this performance is attained before employing token speculation techniques, from which we see further enhancement on the token generation rate.
Model Adaptation
Our foundation models are fine-tuned for users’ everyday activities, and can dynamically specialize themselves on-the-fly for the task at hand. We utilize adapters, small neural network modules that can be plugged into various layers of the pre-trained model, to fine-tune our models for specific tasks. For our models we adapt the attention matrices, the attention projection matrix, and the fully connected layers in the point-wise feedforward networks for a suitable set of the decoding layers of the transformer architecture.
By fine-tuning only the adapter layers, the original parameters of the base pre-trained model remain unchanged, preserving the general knowledge of the model while tailoring the adapter layers to support specific tasks.
We represent the values of the adapter parameters using 16 bits, and for the ~3 billion parameter on-device model, the parameters for a rank 16 adapter typically require 10s of megabytes. The adapter models can be dynamically loaded, temporarily cached in memory, and swapped — giving our foundation model the ability to specialize itself on the fly for the task at hand while efficiently managing memory and guaranteeing the operating system's responsiveness.
To facilitate the training of the adapters, we created an efficient infrastructure that allows us to rapidly retrain, test, and deploy adapters when either the base model or the training data gets updated. The adapter parameters are initialized using the accuracy-recovery adapter introduced in the Optimization section.
Performance and Evaluation
Our focus is on delivering generative models that can enable users to communicate, work, express themselves, and get things done across their Apple products. When benchmarking our models, we focus on human evaluation as we find that these results are highly correlated to user experience in our products. We conducted performance evaluations on both feature-specific adapters and the foundation models.
To illustrate our approach, we look at how we evaluated our adapter for summarization. As product requirements for summaries of emails, messages, and notifications differ in subtle but important ways, we fine-tune accuracy-recovery low-rank (LoRA) adapters on top of the palletized model to meet these specific requirements. Our training data is based on synthetic summaries generated from bigger server models, filtered by a rejection sampling strategy that keeps only the high quality summaries.
To evaluate the product-specific summarization, we use a set of 750 responses carefully sampled for each use case. These evaluation datasets emphasize a diverse set of inputs that our product features are likely to face in production, and include a stratified mixture of single and stacked documents of varying content types and lengths. As product features, it was important to evaluate performance against datasets that are representative of real use cases. We find that overall, our models with adapters generate better summaries than a comparable model.
As part of responsible development, we identified and evaluated specific risks inherent to summarization. For example, summaries occasionally remove important nuance or other details in ways that are undesirable. However, we found that the summarization adapter did not amplify sensitive content in over 99% of targeted adversarial examples. We continue to adversarially probe to identify unknown harms and expand our evaluations to help guide further improvements.
In addition to evaluating feature specific performance powered by foundation models and adapters, we evaluate both the on-device and server-based models’ general capabilities. We utilize a comprehensive evaluation set of real-world prompts to test the general model capabilities. These prompts are diverse across different difficulty levels and cover major categories such as brainstorming, classification, closed question answering, coding, extraction, mathematical reasoning, open question answering, rewriting, safety, summarization, and writing.
We compare our models with both open-source models (Phi-3, Gemma, Mistral, DBRX, Llama) and commercial models of comparable size (GPT-3.5, GPT-4)1. We find that our models are preferred by human graders over most comparable competitor models. On this benchmark, our on-device model, with ~3B parameters, outperforms larger models including Phi-3-mini, Mistral-7B, Gemma-7B, and Llama-3-8B. Our server model compares favorably to DBRX-Instruct, Mixtral-8x22B, GPT-3.5, and Llama-3-70B while being highly efficient.
We use a set of diverse adversarial prompts to test the model performance on harmful content, sensitive topics, and factuality. We measure the violation rates of each model as evaluated by human graders on this evaluation set, with a lower number being desirable. Both the on-device and server models are robust when faced with adversarial prompts, achieving violation rates lower than open-source and commercial models.
Our models are preferred by human graders as safe and helpful over competitor models for these prompts. However, considering the broad capabilities of large language models, we understand the limitation of our safety benchmark. We are actively conducting both manual and automatic red-teaming with internal and external teams to continue evaluating our models' safety.
To further evaluate our models, we use the Instruction-Following Eval (IFEval) benchmark to compare their instruction-following capabilities with models of comparable size. The results suggest that both our on-device and server model follow detailed instructions better than the open-source and commercial models of comparable size.
We evaluate our models’ writing ability on our internal summarization and composition benchmarks, consisting of a variety of writing instructions. These results do not refer to our feature-specific adapter for summarization (seen in Figure 3), nor do we have an adapter focused on composition.
Conclusion
The Apple foundation models and adapters introduced at WWDC24 underlie Apple Intelligence, the new personal intelligence system that is integrated deeply into iPhone, iPad, and Mac, and enables powerful capabilities across language, images, actions, and personal context. Our models have been created with the purpose of helping users do everyday activities across their Apple products, and developed responsibly at every stage and guided by Apple’s core values. We look forward to sharing more information soon on our broader family of generative models, including language, diffusion, and coding models.
Footnotes
[1] We compared against the following model versions: gpt-3.5-turbo-0125, gpt-4-0125-preview, Phi-3-mini-4k-instruct, Mistral-7B-Instruct-v0.2, Mixtral-8x22B-Instruct-v0.1, Gemma-1.1-2B, Gemma-1.1-7B, Llama-3-8B-Instruct, and Llama-3-70B-Instruct. The open-source and Apple models are evaluated in bfloat16 precision.
Related readings and updates.
Apple Intelligence Foundation Language Models
Apple Natural Language Understanding Workshop 2023
Earlier this year, Apple hosted the Natural Language Understanding workshop. This two-day hybrid event brought together Apple and members of the academic research community for talks and discussions on the state of the art in natural language understanding.
In this post, we share highlights from workshop discussions and recordings of select workshop talks.