这是用户在 2024-5-22 17:44 为 https://app.immersivetranslate.com/pdf-pro/c0d91070-5997-49af-b650-9cdd14869d26 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
2024_05_22_751d9dac637e014f337fg

Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey
大型模型的参数高效微调:全面调查

Zeyu Han , Chao Gao , Jinyang Liu , Jeff (Jun) Zhang , and Sai Qian Zhang
韩泽宇 ,高超 ,刘金阳 ,张杰(俊) ,张赛倩 }}
Northeastern University University of California, Riverside Arizona State University
东北大学 加利福尼亚大学河滨分校 亚利桑那州立大学
New York University
纽约大学
{han.zeyu,liu.jinyan}@northeastern.edu, cgao037@ucr.edu, jeffzhang@asu.edu, sai.zhang@nyu.edu

Abstract 摘要

Large models represent a groundbreaking advancement in multiple application fields, enabling remarkable achievements across various tasks. However, their unprecedented scale comes with significant computational costs. These models, often consisting of billions of parameters, require vast amounts of computational resources for execution. Especially, the expansive scale and computational demands pose considerable challenges when customizing them for particular downstream tasks, particularly over the hardware platforms constrained by computational capabilities.
大型模型代表了多个应用领域的突破性进展,使得在各种任务中取得了显著的成就。然而,它们前所未有的规模带来了巨大的计算成本。这些模型通常由数十亿个参数组成,执行时需要大量的计算资源。特别是,庞大的规模和计算需求在为特定下游任务定制时带来了相当大的挑战,尤其是在受计算能力限制的硬件平台上。

Parameter Efficient Fine-Tuning (PEFT) provides a practical solution by efficiently adapt the large models over the various downstream tasks. In particular, PEFT refers to the process of adjusting the parameters of a pre-trained large models to adapt it to a specific task or domain while minimizing the number of additional parameters introduced or computational resources required. This approach is particularly important when dealing with large-scale language models with high parameter counts, as fine-tuning these models from scratch can be computationally expensive and resource-intensive, posing considerable challenges in the supporting system platform design.
参数高效微调(PEFT)通过有效地调整大型模型以适应各种下游任务,提供了一个实用的解决方案。具体而言,PEFT 是指调整预训练大型模型的参数,使其适应特定任务或领域,同时最大限度地减少引入的额外参数数量或所需的计算资源。当处理具有高参数数量的大规模语言模型时,这种方法尤为重要,因为从头开始微调这些模型可能会耗费大量计算资源,对支持系统平台设计构成相当大的挑战。

In this survey, we present comprehensive studies of various PEFT algorithms, examining their performance and computational overhead. Moreover, we provide an overview of applications developed using different PEFT algorithms and discuss common techniques employed to mitigate computation costs for PEFT. In addition to the algorithmic perspective, we overview various real-world system designs to investigate the implementation costs associated with different PEFT algorithms. This survey serves as an indispensable resource for researchers aiming to understand both the PEFT algorithm and its system implementation, offering detailed insights into recent advancements and practical applications.
在这项调查中,我们对各种 PEFT 算法进行了全面研究,检查它们的性能和计算开销。此外,我们概述了使用不同 PEFT 算法开发的应用程序,并讨论了用于减少 PEFT 计算成本的常见技术。除了算法的角度,我们还概述了各种真实世界系统设计,以调查与不同 PEFT 算法相关的实施成本。这项调查为那些希望了解 PEFT 算法及其系统实施的研究人员提供了必不可少的资源,提供了对最新进展和实际应用的详细见解。

Index Terms-Large Language Model, Parameter-Efficient Fine-tuning, Computer System, Distributed System.
索引词-大型语言模型,参数高效微调,计算机系统,分布式系统。

I. INTRODUCTION 我。介绍

Large Models (LMs) have recently captured considerable public interest. Their ability to understand context and nuances enables them to proficiently handle diverse tasks across multiple domains, including natural language processing (NLP), computer vision (CV), etc. In the field of NLP, Large Language Models (LLMs) have achieved significant advancements across various tasks including text generation [1], [2], translation [3], [4], personalized chat-bots [5], [6], [7], and summarization [8], demonstrating remarkable proficiency.
大型模型(LMs)最近引起了相当大的公众关注。它们理解上下文和细微差别的能力使它们能够熟练处理跨多个领域的各种任务,包括自然语言处理(NLP)、计算机视觉(CV)等。在自然语言处理领域,大型语言模型(LLMs)在各种任务中取得了显著进展,包括文本生成[1],[2],翻译[3],[4],个性化聊天机器人[5],[6],[7]和摘要[8],展示了卓越的熟练度。
Earlier studies [1] has suggested that LLMs exhibit high levels of generalization, enabling them to apply their acquired knowledge to new tasks not included in their original training. This capability is commonly known as zero-shot learning. Nevertheless, fine-tuning remains essential to further enhance LLMs for optimal performance on new user datasets and tasks.
较早的研究[1]表明LLMs表现出高水平的泛化能力,使它们能够将所获知识应用于原始训练中未包含的新任务。这种能力通常被称为零样本学习。然而,微调仍然是必不可少的,以进一步增强LLMs在新用户数据集和任务上的最佳性能。
Due to its scale, a widely adopted strategy for fine-tuning LLMs involves adjusting a limited number of LLM parameters while keeping the remainder unchanged. This technique, termed Parameter-Efficient-Fine-Tuning (PEFT), involves selectively adjusting a small proportion of their parameters, while keeping the rest unaltered. Furthermore, the application of PEFT extends beyond the realm of NLP and quickly attracts interest in the CV community for handling fine-tuning vision models with large parameters, such as Vision Transformers (ViT) and diffusion models, as well as disciplinary models such as vision-language models.
由于其规模,广泛采用的微调LLMs的策略涉及调整有限数量的LLM参数,同时保持其余参数不变。这种技术被称为参数高效微调(PEFT),涉及有选择地调整一小部分参数,同时保持其余参数不变。此外,PEFT 的应用不仅限于 NLP 领域,还迅速引起了 CV 社区的兴趣,用于处理具有大参数的微调视觉模型,如 Vision Transformers(ViT)和扩散模型,以及视觉-语言模型等学科模型。
In this survey, we systematically review and categorize recent advancements in PEFT algorithms as well as the system implementation costs associated with various PEFT algorithms across diverse scenarios. Figure 1 presents the overview content for this survey. In section , we present some fundamental concepts for LLM and PEFT, including computational flow for LLM, basic knowledge of PEFT, and commonly used datasets and tasks.
在这项调查中,我们系统地审查和分类了最近在 PEFT 算法方面的进展,以及不同场景中与各种 PEFT 算法相关的系统实施成本。图 1 展示了本调查的概述内容。在第 节中,我们介绍了一些关于LLM和 PEFT 的基本概念,包括LLM的计算流程、PEFT 的基本知识以及常用的数据集和任务。
We categorized all types of PEFT algorithms in Section III according to their computational flow. In Section III-A, we introduce additive algorithms that either introduce additional weight parameters or modify activations. For algorithms that exclusively require fine-tuning with existing parameters, they fall under the category of selective approaches, and their introduction can be found in Section III-B, In Section III-C, we explore reparameterized PEFT, which constructs a (lowdimensional) reparameterization of original model parameters for training while transforms the weights back to maintain the inference speed. Additionally, there exist algorithms that combine the above techniques, and we have classified these as hybrid approaches, elaborating on them in Section III-D We also investigate strategies for further reducing the computational complexity of different PEFT algorithms, including KV-cache management, pruning, quantization, and memory optimization, in Section IV
我们根据它们的计算流程在第三部分对所有类型的 PEFT 算法进行了分类。在第 III-A 节中,我们介绍了引入额外权重参数或修改激活的加法算法。对于仅需要与现有参数进行微调的算法,它们属于选择性方法的范畴,其介绍可以在第 III-B 节中找到。在第 III-C 节中,我们探讨了重新参数化的 PEFT,它为训练构建了原始模型参数的(低维)重新参数化,同时将权重转换回以保持推理速度。此外,还存在结合上述技术的算法,我们已将其分类为混合方法,并在第 III-D 节中对其进行了详细阐述。我们还研究了进一步减少不同 PEFT 算法的计算复杂性的策略,包括 KV 缓存管理、修剪、量化和内存优化,在第 IV 节中。
In Section , we expand the scope of this survey beyond the computational perspective to involve various potential application scenarios. We explore innovations that applying PEFT techniques to different model architecture, including LLMs
在第 节中,我们将这项调查的范围扩大到涉及各种潜在应用场景,从而超越计算视角。我们探索将 PEFT 技术应用于不同模型架构的创新,包括LLMs。
Fig. 1: A content overview covered in the survey.
图 1:调查涵盖的内容概述。
(Section V-A), Vision Transformer (Section V-B), VisionLanguage alignment models (Section V-C), and Diffusion models (Section V-D), for varied downstream tasks, underscoring PEFT's versatility and applicability in a range of scenarios.
(第五部分-A),视觉 Transformer(第五部分-B),视觉语言对齐模型(第五部分-C)和扩散模型(第五部分-D),用于各种下游任务,强调 PEFT 在各种场景中的多功能性和适用性。
In Section VI, we explores the system design challenge for PEFT methods. The discussion includes three advanced system solutions for practical PEFT deployment: distributed tuning (Section VI-B), PEFT query serving (Section VI-C), and concurrent PEFT tuning (Section VI-D).
在第六部分,我们探讨了 PEFT 方法的系统设计挑战。讨论包括三种实际 PEFT 部署的高级系统解决方案:分布式调整(第 VI-B 节),PEFT 查询服务(第 VI-C 节)和并发 PEFT 调整(第 VI-D 节)。
In the last Section VII we summarize our survey and propose several potential future directions from both algorithm and system perspectives, hoping to provide valuable insights for further research and development in the field.
在最后的第七部分,我们总结了我们的调查,并从算法和系统的角度提出了几个潜在的未来方向,希望为该领域的进一步研究和发展提供有价值的见解。

II. BACKGROUND 背景

In this section, we first discussed the computation flow of LLM, including its fundamental components, computational complexity, and the flow of computations it involves as a case study. We then provide a brief overview of different PEFT algorithms in section II-B
在本节中,我们首先讨论了LLM的计算流程,包括其基本组成部分、计算复杂性以及涉及的计算流程作为案例研究。然后在第 II-B 节中提供了对不同 PEFT 算法的简要概述。

A. Computation flow for LLaMA
LLaMA 的计算流程

In order to gain a deeper understanding of LLM and other Transformer-based models, we employ LLaMA-7B, a cuttingedge open-source LLM model, to scrutinize the architecture of LLM as well as Transformer. As shown in Figure 2 (a), LLaMA consists of three major components: an embedding block, a stack of decoder blocks and a head block which consists of linear and softmax layer. The embedding layer's primary role is to transform unstructured textual information, into chunks of discrete numerical vectors (tokens) to facilitate subsequent processing. The embedded tokens are then delivered to the decoder layers for further processing. Each LLaMA decoder is composed of two fundamental components: Multihead Self-Attention (MSA) and Feedforward Network (FFN). In the MSA module, each of the tokens will be clustered by an attention map obtained by a dot production between two linear mappings of the input tokens. Then the grouped tokens will be further processed by a Feedforward Neural network.
为了更深入地了解LLM和其他基于 Transformer 的模型,我们采用 LLaMA-7B,这是一种尖端的开源LLM模型,来审查LLM和 Transformer 的架构。如图 2(a)所示,LLaMA 由三个主要组件组成:嵌入块、一堆解码器块和一个由线性和 softmax 层组成的头块。嵌入层的主要作用是将非结构化的文本信息转换为离散的数字向量块(标记),以便后续处理。嵌入的标记然后传递给解码器层进行进一步处理。每个 LLaMA 解码器由两个基本组件组成:多头自注意力(MSA)和前馈网络(FFN)。在 MSA 模块中,每个标记将通过输入标记的两个线性映射之间的点积获得的注意力图进行聚类。然后,分组的标记将通过前馈神经网络进一步处理。

Additionally, Root Mean Square Layer Normalization (RMSNorm) [9] is adopted in LLaMA as a replacement for Layer Normalization to ensure efficient training.
此外,均方根层归一化(RMSNorm)[9]在 LLaMA 中被采用作为层归一化的替代,以确保高效训练。
LLM distinguishes itself from other deep neural network (DNN) models such as convolutional neural networks (CNN) in two significant ways. Firstly, LLM exhibits an inherent autoregressive nature, necessitating multiple iterations to complete the generation task. Moreover, LLM incorporates an attention mechanism, a component with computational complexity that scales quadratically with the length of the inputs. On the other hand, the inherent computation characteristic of LLM lies in the attention blocks inside each decoder layer. Figure 2 (c) depicts the high-level overview of the computation flow in the attention block.
LLM以两种显著方式区别于其他深度神经网络(DNN)模型,如卷积神经网络(CNN)。首先,LLM表现出固有的自回归特性,需要多次迭代才能完成生成任务。此外,LLM融入了注意力机制,这是一个具有计算复杂度的组件,随着输入长度的增加呈二次方增长。另一方面,LLM的固有计算特性在每个解码器层内的注意力块中。图 2(c)展示了注意力块中计算流程的高级概述。
During the inference process, each decoder takes a 4dimensional tensor as the input tokens. The input tokens are first multiplied with three weight matrices , and , producing the output referred to as query and value . Given the MSA module's inability to recognize positional data and the inherent autoregressive nature of LLMs, the query and key will undergo a process using Rotary Positional Embedding [10] (RoPE, denoted as . Subsequently, the key and value will be combined with prior tokens.
在推理过程中,每个解码器将一个 4 维张量 作为输入标记。首先,输入标记将与三个权重矩阵 相乘,产生所谓的查询 和数值 的输出。鉴于 MSA 模块无法识别位置数据以及LLMs固有的自回归特性,查询和键将经过使用旋转位置嵌入[10](RoPE,表示为 的过程。 随后,键和数值将与先前的标记组合在一起。
After the positional embedding, the intermediate activation will then undergo a series of multiplication, softmax, and residual addition to generate MSA output as described in Eq9 To be noted here, in the equation refers to the number of feature dimensions in the multi-head attention mechanism.
在位置嵌入之后,中间激活将经历一系列乘法、softmax 和残差相加,以生成 Eq9 中描述的 MSA 输出。需要注意的是,在方程中 指的是多头注意力机制中的特征维度数量。
The SA output will then be forwarded to the FFN blocks for further processing. The FFN block will have another three
SA 输出将被转发到 FFN 块进行进一步处理。FFN 块将有另外三
Fig. 2: (a) LLaMA architecture. (b) LLaMA auto-regressive pattern. (c) Three common PEFT operations. All the learnable components are highlighted in red, while the frozen components are highlighted in grey. LoRA is applied on all the Query, Key, and Value blocks. The adapter targets the FFN module. Soft-Prompt focused on tuning the input activation of each decoder. We only show one decoder for illustration simplicity.
图 2:(a) LLaMA 架构。(b) LLaMA 自回归模式。(c) 三种常见的 PEFT 操作。所有可学习组件都用红色突出显示,而冻结组件则用灰色突出显示。LoRA 应用于所有查询、键和值块。适配器针对 FFN 模块。Soft-Prompt 专注于调整每个解码器的输入激活。为了简化说明,我们只展示一个解码器。
matrices , and and the computation can be illustrated by:
矩阵 ,计算过程如下:
where denotes the input of the FFN layer, and SiLU is the nonlinear function used in LLaMA. In the original Transformer, the FFN block can be demonstrated by:
其中 表示 FFN 层的输入,SiLU 是 LLaMA 中使用的非线性函数。在原始 Transformer 中,FFN 块可以通过以下方式展示:
The output of the last decoder layer will be sent to a linear layer, which then generates a probability distribution spanning the complete vocabulary to predict the next token in the sequence. The produced token will then be concatenated with the previous tokens and used as the input for the next round of processing. This generating process repeats in an auto-regressive manner until a full sequence of tokens, referred to as a completion, is produced (Figure 2 (b)). For training, the computation flow is similar to that for inference, except that the generated sentences are directly compared to the ground truth output and generate the training loss. Gradients will then be computed across the LLM weights to minimize this training loss.
最后一个解码器层的输出将被发送到一个线性层,该线性层然后生成一个横跨完整词汇表的概率分布,以预测序列中的下一个标记。生成的标记然后将与先前的标记连接起来,并用作下一轮处理的输入。这个生成过程以自回归的方式重复,直到生成一个完整的标记序列,称为完成(图 2(b))。对于训练,计算流程与推理过程类似,不同之处在于生成的句子直接与地面真实输出进行比较并生成训练损失。然后将计算梯度跨越LLM个权重以最小化这个训练损失。
To analyze the computation cost and memory overhead in LLM, we also set a series of parameters used in later section III Table I shows the parameter size and computation dimension in the LLaMA-7B model as a starting example.
为了分析LLM中的计算成本和内存开销,我们还设置了一系列在后文第 III 节中使用的参数。表 I 显示了 LLaMA-7B 模型中的参数大小和计算维度,作为一个起始示例。
LLM models generate tokens (words) one for each round, depicted in Fig 2, based on the previous prompt (input) and previously generated sequence. This process will be repeated until the model outputs hits and termination token. To accelerate the inference process in LLM models, people take the strategy of storing the previous Keys and Values in KeyValue cache (KV-cache), so they don't need to recalculate them for each new token. Mathematically, we can represent the total decoders' kv-cache memory cost in equation 6. In the equation, 1 and are the context length and batch size and refers to the number of layers. The is the head dimension and is the number of heads.
LLM模型生成令牌(单词),每轮一个,如图 2 所示,基于先前的提示(输入)和先前生成的序列。这个过程将重复进行,直到模型输出命中和终止令牌。为了加速LLM模型中的推理过程,人们采取了将先前的键和值存储在 KeyValue 缓存(KV 缓存)中的策略,这样他们就不需要为每个新令牌重新计算它们。在数学上,我们可以用方程 6 表示总解码器的 kv-cache 内存成本。在方程中,1 和 分别是上下文长度和批处理大小, 表示层数。 是头维度, 是头的数量。

B. Overview on Parameter Efficient Fine Tuning
参数高效微调概述

Fine-tuning remains essential to enhance LLM performance on unseen user datasets and tasks. With the size of the model growing (e.g. 1.5B in GPT-2 to 175B in GPT-3), standard full fine-tuning paradigm requires thousands of GPU work in parallel, which is highly inefficient and unsustainable. A type of algorithm has been raised namely Parameter-efficient fine-tuning (PEFT) which aims to tune minimal parameters to achieve better performance over full tuning on downstream tasks.
微调仍然是提高LLM在未见用户数据集和任务上性能的关键。随着模型规模的增长(例如,GPT-2 中的 15 亿到 GPT-3 中的 175 亿),标准的全微调范式需要成千上万个 GPU 并行工作,这是非常低效和不可持续的。一种算法被提出,即参数高效微调(PEFT),旨在调整最小参数以在下游任务上实现比全面调整更好的性能。
In parallel developments, large-scale pre-trained models in vision and multimodal domains have also demonstrated their effective representational learning capabilities, enabling adaptation from large datasets to smaller ones or across various data modalities through fine-tuning. Consequently, this capability has made PEFT increasingly attractive to the wider research community.
在平行发展中,视觉和多模态领域的大规模预训练模型也展示了它们有效的表征学习能力,使得能够通过微调从大型数据集适应到较小的数据集或跨越各种数据模态。因此,这种能力使得 PEFT 对更广泛的研究社区越来越具吸引力。
We categorized the PEFT algorithms into additive, selective, reparameterized, and hybrid fine-tuning based on their operations. As Figure 3 depicts, three major additive finetuning algorithms are normally used: (1) Adapter; (2) Soft Prompt; (3) Others. They differ from each other in terms of the different additional tunable modules or parameters. Selective fine-tuning, on the other hand, doesn't require any additional parameters, it selects a small subset of parameters from the backbone model and only makes them tunable while keeping the majority of parameters untouched during finetuning on downstream tasks. We categorized selective finetuning based on the grouping of chosen parameters: (1) Unstructural Masking; (2) Structural Masking. Reparametrization
我们根据它们的操作将 PEFT 算法分类为加性、选择性、重新参数化和混合微调。如图 3 所示,通常使用三种主要的加性微调算法:(1)适配器;(2)软提示;(3)其他。它们在不同的可调模块或参数方面有所不同。另一方面,选择性微调不需要任何额外参数,它从骨干模型中选择一小部分参数,并只使它们可调,同时在下游任务的微调过程中保持大多数参数不变。我们根据所选参数的分组将选择性微调分类为:(1)非结构化掩蔽;(2)结构化掩蔽。重新参数化
Operation Weights Symbol 重量符号 Weights Dimension 重量尺寸 Input Tensor Dimension 输入张量维度 Complexity
Eq. 1 等式 1
Eq. 2 , 等式 2, - -
Eq. 3 等式 3
Eq. 4 等式 4 OR
TABLE I: Configuration parameters and computation operation for LLaMA-7B architecture
表 I:LLaMA-7B 架构的配置参数和计算操作
represents transforming model parameters between two equivalent forms. Specifically, reparametrized fine-tuning introduces additional low-rank trainable parameters during training, which are then integrated with the original model for inference. This approach is categorized into two main strategies: (1) Lowrank Decomposition, and (2) LoRA Derivatives. Hybrid finetuning explores the design spaces of different PEFT methods and combines their advantages.
代表将模型参数在两种等效形式之间进行转换。具体而言,重新参数化微调在训练过程中引入了额外的低秩可训练参数,然后将其与原始模型集成以进行推断。这种方法分为两种主要策略:(1)低秩分解,和(2)LoRA 导数。混合微调探索了不同 PEFT 方法的设计空间,并结合它们的优势。

C. Downstream Tasks for LLM Evaluation
LLM评估的下游任务

Two types of tasks have been widely used for LLM evaluation, the first type is the General Language Understanding Evaluation (GLUE) [11] benchmark, which integrates nine sentence or sentence-pair language understanding tasks (CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, and WNLI), chosen for their diversity in dataset sizes, text genres, and difficulty levels, and is based on established existing datasets. It also includes a diagnostic dataset specifically designed to evaluate and analyze model performance across various linguistic phenomena inherent in natural language. Additionally, it features a public leaderboard to track performance on the benchmark and a dashboard to visualize model performance on the diagnostic set.
两种类型的任务已广泛用于LLM评估,第一种类型是通用语言理解评估(GLUE)[11]基准测试,它整合了九个句子或句对语言理解任务(CoLA,SST-2,MRPC,STS-B,QQP,MNLI,QNLI,RTE 和 WNLI),这些任务因其数据集大小、文本流派和难度级别的多样性而被选择,并且基于已建立的现有数据集。它还包括一个专门设计的诊断数据集,用于评估和分析模型在自然语言中固有的各种语言现象上的表现。此外,它还设有一个公开排行榜,用于跟踪基准测试的表现,并有一个仪表板来可视化模型在诊断集上的表现。
The other type of dataset that has been used in recent LLM papers is common sense reasoning which integrated into our study caters to a variety of research facets: (1) OpenBookQA [12] is curated to foster research in advanced question-answering, delving into a profound understanding of both the subject matter and the language in which it is articulated. (2) PIQA [13] primarily emphasizes everyday scenarios, demonstrating a predilection for unconventional solutions. (3) Social IQA [14] emerges as a novel questionanswering benchmark tailored for gauging social commonsense intelligence. (4) HellaSwag [15] serves as a dataset, the essence of which is to ascertain the capability of machines in aptly concluding sentences. (5) BoolQ [16] is a dataset dedicated to question-answering, particularly for binary responses (yes/no queries). (6) WinoGrande [17] is introduced as a fresh compilation, encompassing a substantial 44,000 problems. (7) -easy [18] presents itself as a novel dataset constituting genuine grade-school level multiple-choice science questions, designed to invigorate research in intricate question-answering. (8) ARC-challenges [18], distinctively, encompasses solely those questions that were inaccurately addressed by both a retrieval-based algorithm and a word co-occurrence algorithm.
最近LLM篇论文中使用的另一种数据集是融入到我们研究中的常识推理,适用于各种研究方面:(1)OpenBookQA [12] 旨在促进高级问答研究,深入理解主题内容及表达语言。 (2)PIQA [13] 主要强调日常场景,展示对非传统解决方案的偏好。 (3)Social IQA [14] 是一种新颖的问题回答基准,旨在评估社会常识智能。 (4)HellaSwag [15] 是一个数据集,其本质在于确定机器在恰当地得出句子结论方面的能力。 (5)BoolQ [16] 是一个专门用于问答的数据集,特别适用于二元响应(是/否查询)。 (6)WinoGrande [17] 被介绍为一个新的编译,包含大量的 44,000 个问题。 (7) -easy [18] 作为一个新的数据集,包含真实的小学水平的多项选择科学问题,旨在激发复杂问答研究。 (8)ARC 挑战[18],独特地,仅包括那些由检索算法和词共现算法都错误地处理的问题。
Image recognition is the primary benchmark and application for vision models, exemplified by benchmarks such as fine- grained visual categorization (FGVC) and visual task adaptation benchmark (VTAB). Beyond image classification, video action recognition is another key application area, involving datasets like Kinetics-400 [19], SSv2 [20], and HMDB51 [21]. Additionally, PEFT has been utilized for dense prediction tasks, using datasets like MSCOCO [22], ADE20K [23], and PASCAL VOC [24].
图像识别是视觉模型的主要基准和应用,以细粒度视觉分类(FGVC)和视觉任务适应基准(VTAB)等基准为例。除了图像分类之外,视频动作识别是另一个关键应用领域,涉及 Kinetics-400 [19]、SSv2 [20]和 HMDB51 [21]等数据集。此外,PEFT 已被用于密集预测任务,使用 MSCOCO [22]、ADE20K [23]和 PASCAL VOC [24]等数据集。

III. PEFT TAXONOMY III. PEFT 分类

The PEFT strategies can be broadly classified into four categories: additive PEFT (Section III-A), which modifies the model architecture by injecting new trainable modules or parameters; selective PEFT (Section III-B), which makes a subset of parameters trainable during fine-tuning; reparameterized PEFT (Section III-C), which constructs a (lowdimensional) reparameterization of the original model parameters for training, then equivalently transforms it back for inference; and hybrid PEFT (Section III-D), which combines advantages from different PEFT methods to build a unified PEFT model. A overview of different types of PEFT algorithms is depicted in Figure 4.
PEFT 策略可以广泛分为四类:加性 PEFT(第 III-A 节),通过注入新的可训练模块或参数来修改模型架构;选择性 PEFT(第 III-B 节),在微调期间使一部分参数可训练;重新参数化 PEFT(第 III-C 节),为训练构建原始模型参数的(低维)重新参数化,然后等效地将其转换回推断;混合 PEFT(第 III-D 节),结合不同 PEFT 方法的优势构建统一的 PEFT 模型。图 4 展示了不同类型的 PEFT 算法概述。

A. Additive PEFT 添加剂 PEFT

Standard full fine-tuning entails substantial computational expenses and also could potentially harm the model's generalization ability. To mitigate this problem, a widely employed approach is to maintain the pre-trained backbone unchanged and introduce only a minimal number of trainable parameters that are strategically positioned within the model architecture. While fine-tuning for a specific downstream task, only the weights of these additional modules or parameters are updated, which results in a substantial reduction in storage, memory, and computational resource requirements. Due to their characteristic of adding parameters, these techniques can be termed as Additive Tuning, as shown in Figure 4 (a). Next, we discuss several popular Additive PEFT algorithms.
标准的全面微调需要大量的计算开销,还可能损害模型的泛化能力。为了缓解这个问题,一个广泛采用的方法是保持预训练的主干网络不变,并仅引入一小部分可训练参数,这些参数被战略性地放置在模型架构中。在为特定的下游任务微调时,只更新这些额外模块或参数的权重,这导致存储、内存和计算资源需求大幅减少。由于增加参数的特性,这些技术可以称为增量调整,如图 4(a)所示。接下来,我们将讨论几种流行的增量 PEFT 算法。
  1. Adapters: Adapter approaches involve the insertion of small adapter layers within Transformer blocks. Typically, an adapter layer consists of a down-projection matrix , followed by a non-linear activation function , and an up-projection matrix