这是用户在 2024-5-22 17:46 为 https://app.immersivetranslate.com/pdf-pro/c0d91070-5997-49af-b650-9cdd14869d26 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
2024_05_22_751d9dac637e014f337fg

Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey
大型模型的参数高效微调:全面调查

Zeyu Han , Chao Gao , Jinyang Liu , Jeff (Jun) Zhang , and Sai Qian Zhang
Zeyu Han , Chao Gao , Jinyang Liu , Jeff (Jun) Zhang , and Sai Qian Zhang
Northeastern University University of California, Riverside Arizona State University
东北大学 加州大学河滨分校 亚利桑那州立大学
New York University
纽约大学
{han.zeyu,liu.jinyan}@northeastern.edu, cgao037@ucr.edu, jeffzhang@asu.edu, sai.zhang@nyu.edu

Abstract 摘要

Large models represent a groundbreaking advancement in multiple application fields, enabling remarkable achievements across various tasks. However, their unprecedented scale comes with significant computational costs. These models, often consisting of billions of parameters, require vast amounts of computational resources for execution. Especially, the expansive scale and computational demands pose considerable challenges when customizing them for particular downstream tasks, particularly over the hardware platforms constrained by computational capabilities.
大型模型在多个应用领域都取得了突破性进展,在各种任务中都取得了显著成就。然而,其前所未有的规模也带来了巨大的计算成本。这些模型通常由数十亿个参数组成,需要大量计算资源才能执行。特别是在为特定下游任务定制模型时,庞大的规模和计算需求带来了相当大的挑战,尤其是在受计算能力限制的硬件平台上。

Parameter Efficient Fine-Tuning (PEFT) provides a practical solution by efficiently adapt the large models over the various downstream tasks. In particular, PEFT refers to the process of adjusting the parameters of a pre-trained large models to adapt it to a specific task or domain while minimizing the number of additional parameters introduced or computational resources required. This approach is particularly important when dealing with large-scale language models with high parameter counts, as fine-tuning these models from scratch can be computationally expensive and resource-intensive, posing considerable challenges in the supporting system platform design.
参数高效微调(PEFT)通过在各种下游任务中有效地调整大型模型,提供了一种实用的解决方案。具体来说,PEFT 是指调整预先训练好的大型模型的参数,使其适应特定任务或领域,同时尽量减少引入的额外参数数量或所需计算资源的过程。在处理参数数量较多的大型语言模型时,这种方法尤为重要,因为从头开始微调这些模型可能会耗费大量计算资源,给支持系统平台的设计带来相当大的挑战。

In this survey, we present comprehensive studies of various PEFT algorithms, examining their performance and computational overhead. Moreover, we provide an overview of applications developed using different PEFT algorithms and discuss common techniques employed to mitigate computation costs for PEFT. In addition to the algorithmic perspective, we overview various real-world system designs to investigate the implementation costs associated with different PEFT algorithms. This survey serves as an indispensable resource for researchers aiming to understand both the PEFT algorithm and its system implementation, offering detailed insights into recent advancements and practical applications.
在这份调查报告中,我们对各种 PEFT 算法进行了全面研究,考察了它们的性能和计算开销。此外,我们还概述了使用不同 PEFT 算法开发的应用,并讨论了为降低 PEFT 计算成本而采用的常见技术。除了算法角度,我们还概述了各种实际系统设计,以研究与不同 PEFT 算法相关的实施成本。对于希望了解 PEFT 算法及其系统实现的研究人员来说,本调查报告是不可或缺的资源,它提供了对最新进展和实际应用的详细见解。

Index Terms-Large Language Model, Parameter-Efficient Fine-tuning, Computer System, Distributed System.
索引词条-大型语言模型、参数高效微调、计算机系统、分布式系统。

I. INTRODUCTION I.引言

Large Models (LMs) have recently captured considerable public interest. Their ability to understand context and nuances enables them to proficiently handle diverse tasks across multiple domains, including natural language processing (NLP), computer vision (CV), etc. In the field of NLP, Large Language Models (LLMs) have achieved significant advancements across various tasks including text generation [1], [2], translation [3], [4], personalized chat-bots [5], [6], [7], and summarization [8], demonstrating remarkable proficiency.
大型模型(LM)最近引起了公众的极大兴趣。它们理解上下文和细微差别的能力使其能够熟练处理多个领域的各种任务,包括自然语言处理(NLP)、计算机视觉(CV)等。在 NLP 领域,大型语言模型 (LLMs) 在文本生成[1]、[2]、翻译[3]、[4]、个性化聊天机器人[5]、[6]、[7]和摘要[8]等各种任务中都取得了重大进展,显示出非凡的能力。
Earlier studies [1] has suggested that LLMs exhibit high levels of generalization, enabling them to apply their acquired knowledge to new tasks not included in their original training. This capability is commonly known as zero-shot learning. Nevertheless, fine-tuning remains essential to further enhance LLMs for optimal performance on new user datasets and tasks.
早先的研究[1]表明,LLMs ,他们表现出很高的概括能力,能够将获得的知识应用于原来训练中没有包括的新任务。这种能力通常被称为 "零点学习"。不过,要进一步提高LLMs 在新的用户数据集和任务中的最佳性能,微调仍然是必不可少的。
Due to its scale, a widely adopted strategy for fine-tuning LLMs involves adjusting a limited number of LLM parameters while keeping the remainder unchanged. This technique, termed Parameter-Efficient-Fine-Tuning (PEFT), involves selectively adjusting a small proportion of their parameters, while keeping the rest unaltered. Furthermore, the application of PEFT extends beyond the realm of NLP and quickly attracts interest in the CV community for handling fine-tuning vision models with large parameters, such as Vision Transformers (ViT) and diffusion models, as well as disciplinary models such as vision-language models.
由于其规模庞大,一种被广泛采用的微调LLMs 的策略是调整数量有限的LLM 参数,其余参数保持不变。这种技术被称为 "参数-系数-微调"(Parameter-Efficient-Fine-Tuning,PEFT),即有选择地调整一小部分参数,其余参数保持不变。此外,PEFT 的应用范围还超出了 NLP 领域,并迅速引起了 CV 界对处理微调大参数视觉模型(如视觉变换器(ViT)和扩散模型)以及学科模型(如视觉语言模型)的兴趣。
In this survey, we systematically review and categorize recent advancements in PEFT algorithms as well as the system implementation costs associated with various PEFT algorithms across diverse scenarios. Figure 1 presents the overview content for this survey. In section , we present some fundamental concepts for LLM and PEFT, including computational flow for LLM, basic knowledge of PEFT, and commonly used datasets and tasks.
在本调查中,我们系统地回顾了 PEFT 算法的最新进展,并对其进行了分类,同时还介绍了不同场景下与各种 PEFT 算法相关的系统实施成本。图 1 展示了本调查的概览内容。在 部分,我们介绍了LLM 和 PEFT 的一些基本概念,包括LLM 的计算流程、PEFT 的基本知识以及常用数据集和任务。
We categorized all types of PEFT algorithms in Section III according to their computational flow. In Section III-A, we introduce additive algorithms that either introduce additional weight parameters or modify activations. For algorithms that exclusively require fine-tuning with existing parameters, they fall under the category of selective approaches, and their introduction can be found in Section III-B, In Section III-C, we explore reparameterized PEFT, which constructs a (lowdimensional) reparameterization of original model parameters for training while transforms the weights back to maintain the inference speed. Additionally, there exist algorithms that combine the above techniques, and we have classified these as hybrid approaches, elaborating on them in Section III-D We also investigate strategies for further reducing the computational complexity of different PEFT algorithms, including KV-cache management, pruning, quantization, and memory optimization, in Section IV
我们在第三节中根据计算流程对所有类型的 PEFT 算法进行了分类。在第 III-A 节中,我们介绍了引入额外权重参数或修改激活度的加法算法。在第 III-C 节中,我们探讨了重参数化 PEFT 算法,该算法对原始模型参数进行(低维)重参数化以进行训练,同时将权重转换回来以保持推理速度。此外,还有将上述技术结合起来的算法,我们将其归类为混合方法,并在第 III-D 节中详细阐述。
In Section , we expand the scope of this survey beyond the computational perspective to involve various potential application scenarios. We explore innovations that applying PEFT techniques to different model architecture, including LLMs
在第 节,我们将调查范围从计算角度扩展到各种潜在应用场景。我们探讨了将 PEFT 技术应用于不同模型架构的创新,包括LLMs
Fig. 1: A content overview covered in the survey.
图 1:调查涉及的内容概览。
(Section V-A), Vision Transformer (Section V-B), VisionLanguage alignment models (Section V-C), and Diffusion models (Section V-D), for varied downstream tasks, underscoring PEFT's versatility and applicability in a range of scenarios.
(第 V-A 节)、Vision Transformer(第 V-B 节)、VisionLanguage 对齐模型(第 V-C 节)和 Diffusion 模型(第 V-D 节),以完成不同的下游任务,凸显了 PEFT 在各种情况下的多功能性和适用性。
In Section VI, we explores the system design challenge for PEFT methods. The discussion includes three advanced system solutions for practical PEFT deployment: distributed tuning (Section VI-B), PEFT query serving (Section VI-C), and concurrent PEFT tuning (Section VI-D).
在第六部分,我们探讨了 PEFT 方法的系统设计挑战。讨论包括实用 PEFT 部署的三种先进系统解决方案:分布式调整(VI-B 节)、PEFT 查询服务(VI-C 节)和并发 PEFT 调整(VI-D 节)。
In the last Section VII we summarize our survey and propose several potential future directions from both algorithm and system perspectives, hoping to provide valuable insights for further research and development in the field.
在最后的第七部分,我们总结了我们的调查,并从算法和系统的角度提出了几个潜在的未来发展方向,希望能为这个领域的进一步研究和发展提供有价值的见解。

II. BACKGROUND II.背景情况

In this section, we first discussed the computation flow of LLM, including its fundamental components, computational complexity, and the flow of computations it involves as a case study. We then provide a brief overview of different PEFT algorithms in section II-B
在本节中,我们首先讨论了LLM 的计算流程,包括其基本组成部分、计算复杂度以及作为案例研究的计算流程。然后,我们在第 II-B 节中简要介绍了不同的 PEFT 算法。

A. Computation flow for LLaMA
A.LLaMA 的计算流程

In order to gain a deeper understanding of LLM and other Transformer-based models, we employ LLaMA-7B, a cuttingedge open-source LLM model, to scrutinize the architecture of LLM as well as Transformer. As shown in Figure 2 (a), LLaMA consists of three major components: an embedding block, a stack of decoder blocks and a head block which consists of linear and softmax layer. The embedding layer's primary role is to transform unstructured textual information, into chunks of discrete numerical vectors (tokens) to facilitate subsequent processing. The embedded tokens are then delivered to the decoder layers for further processing. Each LLaMA decoder is composed of two fundamental components: Multihead Self-Attention (MSA) and Feedforward Network (FFN). In the MSA module, each of the tokens will be clustered by an attention map obtained by a dot production between two linear mappings of the input tokens. Then the grouped tokens will be further processed by a Feedforward Neural network.
为了更深入地了解LLM 和其他基于 Transformer 的模型,我们采用了最先进的开源LLM 模型 LLaMA-7B 来仔细研究LLM 和 Transformer 的架构。如图 2 (a)所示,LLaMA 由三个主要部分组成:嵌入块、解码器块堆栈以及由线性层和软最大层组成的头部块。嵌入层的主要作用是将非结构化文本信息转化为离散数字向量(标记)块,以便于后续处理。然后,嵌入的标记被传送到解码器层进行进一步处理。每个 LLaMA 解码器都由两个基本组件组成:多头自注意(MSA)和前馈网络(FFN)。在 MSA 模块中,每个词组都将根据输入词组的两个线性映射之间的点生成所获得的注意力图进行分组。然后,分组后的词组将由前馈神经网络进一步处理。

Additionally, Root Mean Square Layer Normalization (RMSNorm) [9] is adopted in LLaMA as a replacement for Layer Normalization to ensure efficient training.
此外,LLaMA 采用了均方根层归一化(RMSNorm)[9] 来替代层归一化,以确保高效训练。
LLM distinguishes itself from other deep neural network (DNN) models such as convolutional neural networks (CNN) in two significant ways. Firstly, LLM exhibits an inherent autoregressive nature, necessitating multiple iterations to complete the generation task. Moreover, LLM incorporates an attention mechanism, a component with computational complexity that scales quadratically with the length of the inputs. On the other hand, the inherent computation characteristic of LLM lies in the attention blocks inside each decoder layer. Figure 2 (c) depicts the high-level overview of the computation flow in the attention block.
LLM 与卷积神经网络(CNN)等其他深度神经网络(DNN)模型有两个显著的不同之处。首先, 表现出固有的自回归特性,需要多次迭代才能完成生成任务。此外, 还包含注意力机制,该机制的计算复杂度与输入长度成二次方关系。另一方面, 固有的计算特性在于每个解码器层内的注意力块。图 2(c)描述了注意力区块计算流程的高层概览。LLM LLM LLM
During the inference process, each decoder takes a 4dimensional tensor as the input tokens. The input tokens are first multiplied with three weight matrices , and , producing the output referred to as query and value . Given the MSA module's inability to recognize positional data and the inherent autoregressive nature of LLMs, the query and key will undergo a process using Rotary Positional Embedding [10] (RoPE, denoted as . Subsequently, the key and value will be combined with prior tokens.
在推理过程中,每个解码器将 4 维张量 作为输入标记。输入标记首先与三个权重矩阵 相乘,产生的输出称为查询 和值 。鉴于 MSA 模块无法识别位置数据以及LLMs 固有的自回归性质,查询和密钥将使用旋转位置嵌入法 [10](RoPE,表示为 )进行处理。 随后,密钥和值将与之前的标记相结合。
After the positional embedding, the intermediate activation will then undergo a series of multiplication, softmax, and residual addition to generate MSA output as described in Eq9 To be noted here, in the equation refers to the number of feature dimensions in the multi-head attention mechanism.
如公式 9 所述,在位置嵌入之后,中间激活将经过一系列乘法、软最大值和残差加法来生成 MSA 输出。需要注意的是,公式中的 指的是多头注意力机制中的特征维数。
The SA output will then be forwarded to the FFN blocks for further processing. The FFN block will have another three
然后,SA 输出将被转发到 FFN 模块进行进一步处理。FFN 块将有另外三个
Fig. 2: (a) LLaMA architecture. (b) LLaMA auto-regressive pattern. (c) Three common PEFT operations. All the learnable components are highlighted in red, while the frozen components are highlighted in grey. LoRA is applied on all the Query, Key, and Value blocks. The adapter targets the FFN module. Soft-Prompt focused on tuning the input activation of each decoder. We only show one decoder for illustration simplicity.
图 2: (a) LLaMA 架构。(b) LLaMA 自动回归模式。(c) 三种常见的 PEFT 操作。所有可学习的组件以红色标出,冻结的组件以灰色标出。LoRA 应用于所有查询、键和值块。适配器以 FFN 模块为目标。Soft-Prompt 专注于调整每个解码器的输入激活。为便于说明,我们只展示了一个解码器。
matrices , and and the computation can be illustrated by:
矩阵 ,以及 ,计算过程可通过以下方式说明:
where denotes the input of the FFN layer, and SiLU is the nonlinear function used in LLaMA. In the original Transformer, the FFN block can be demonstrated by:
其中 表示 FFN 层的输入,SiLU 是 LLaMA 中使用的非线性函数。在最初的变换器中,FFN 模块可以通过以下方式进行演示:
The output of the last decoder layer will be sent to a linear layer, which then generates a probability distribution spanning the complete vocabulary to predict the next token in the sequence. The produced token will then be concatenated with the previous tokens and used as the input for the next round of processing. This generating process repeats in an auto-regressive manner until a full sequence of tokens, referred to as a completion, is produced (Figure 2 (b)). For training, the computation flow is similar to that for inference, except that the generated sentences are directly compared to the ground truth output and generate the training loss. Gradients will then be computed across the LLM weights to minimize this training loss.
最后一个解码层的输出将被发送到线性层,然后线性层生成一个跨越完整词汇的概率分布,以预测序列中的下一个标记。然后,生成的标记将与之前的标记连接起来,作为下一轮处理的输入。这一生成过程以自动递归的方式重复进行,直到生成一个完整的标记序列(称为 "完成")(图 2 (b))。对于训练,计算流程与推理类似,只是生成的句子会直接与地面实况输出进行比较,并产生训练损失。然后将计算LLM 权重的梯度,以最小化训练损失。
To analyze the computation cost and memory overhead in LLM, we also set a series of parameters used in later section III Table I shows the parameter size and computation dimension in the LLaMA-7B model as a starting example.
为了分析LLM 中的计算成本和内存开销,我们还设置了后面第三节中使用的一系列参数 表 I 显示了 LLaMA-7B 模型中的参数大小和计算维度,并以此为起始例子。
LLM models generate tokens (words) one for each round, depicted in Fig 2, based on the previous prompt (input) and previously generated sequence. This process will be repeated until the model outputs hits and termination token. To accelerate the inference process in LLM models, people take the strategy of storing the previous Keys and Values in KeyValue cache (KV-cache), so they don't need to recalculate them for each new token. Mathematically, we can represent the total decoders' kv-cache memory cost in equation 6. In the equation, 1 and are the context length and batch size and refers to the number of layers. The is the head dimension and is the number of heads.
LLM 模型根据前一个提示(输入)和之前生成的序列,每轮生成一个标记(词),如图 2 所示。这个过程会一直重复,直到模型输出命中和终止标记。为了加快 模型的推理过程,人们采取的策略是将之前的键值和值存储在键值缓存(KV-cache)中,这样就不需要为每个新标记重新计算键值和值了。在数学上,我们可以用公式 6 来表示解码器的 KV 缓存内存总成本。式中,1 和 是上下文长度和批量大小, 指层数。 是磁头维度, 是磁头个数。LLM

B. Overview on Parameter Efficient Fine Tuning
B.参数高效微调概述

Fine-tuning remains essential to enhance LLM performance on unseen user datasets and tasks. With the size of the model growing (e.g. 1.5B in GPT-2 to 175B in GPT-3), standard full fine-tuning paradigm requires thousands of GPU work in parallel, which is highly inefficient and unsustainable. A type of algorithm has been raised namely Parameter-efficient fine-tuning (PEFT) which aims to tune minimal parameters to achieve better performance over full tuning on downstream tasks.
微调对于提高LLM 在未知用户数据集和任务上的性能仍然至关重要。随着模型规模的不断扩大(例如,GPT-2 中的 1.5B 到 GPT-3 中的 175B),标准的完全微调范式需要数千个 GPU 并行工作,效率极低,难以为继。有人提出了一种算法,即参数高效微调(PEFT),其目的是调整最小参数,以获得比对下游任务进行全面调整更好的性能。
In parallel developments, large-scale pre-trained models in vision and multimodal domains have also demonstrated their effective representational learning capabilities, enabling adaptation from large datasets to smaller ones or across various data modalities through fine-tuning. Consequently, this capability has made PEFT increasingly attractive to the wider research community.
与此同时,视觉和多模态领域的大规模预训练模型也展示了其有效的表征学习能力,能够通过微调从大型数据集适应小型数据集或跨各种数据模态。因此,这种能力使 PEFT 对更广泛的研究界越来越有吸引力。
We categorized the PEFT algorithms into additive, selective, reparameterized, and hybrid fine-tuning based on their operations. As Figure 3 depicts, three major additive finetuning algorithms are normally used: (1) Adapter; (2) Soft Prompt; (3) Others. They differ from each other in terms of the different additional tunable modules or parameters. Selective fine-tuning, on the other hand, doesn't require any additional parameters, it selects a small subset of parameters from the backbone model and only makes them tunable while keeping the majority of parameters untouched during finetuning on downstream tasks. We categorized selective finetuning based on the grouping of chosen parameters: (1) Unstructural Masking; (2) Structural Masking. Reparametrization
我们根据 PEFT 算法的操作,将其分为加法微调、选择性微调、重参数微调和混合微调。如图 3 所示,通常使用三种主要的加法微调算法:(1) 适配器;(2) 软提示;(3) 其他。它们之间的区别在于可调整的附加模块或参数不同。另一方面,选择性微调不需要任何附加参数,它从骨干模型中选择一小部分参数,只对其进行微调,而在下游任务的微调过程中保持大部分参数不变。我们根据所选参数的分组对选择性微调进行了分类:(1)非结构性屏蔽;(2)结构性屏蔽。重新参数化
Operation Weights Symbol 重量符号 Weights Dimension 重量 尺寸 Input Tensor Dimension 输入张量维度 Complexity
Eq. 1 公式 1
Eq. 2 , 公式 2 、 - -
Eq. 3 等式 3
Eq. 4 等式 4 OR
TABLE I: Configuration parameters and computation operation for LLaMA-7B architecture
表 I:LLaMA-7B 架构的配置参数和计算操作
represents transforming model parameters between two equivalent forms. Specifically, reparametrized fine-tuning introduces additional low-rank trainable parameters during training, which are then integrated with the original model for inference. This approach is categorized into two main strategies: (1) Lowrank Decomposition, and (2) LoRA Derivatives. Hybrid finetuning explores the design spaces of different PEFT methods and combines their advantages.
表示在两种等效形式之间转换模型参数。具体来说,重参数化微调在训练过程中引入额外的低秩可训练参数,然后与原始模型整合进行推理。这种方法分为两种主要策略:(1) 低阶分解和 (2) LoRA 衍生。混合微调探索了不同 PEFT 方法的设计空间,并结合了它们的优势。

C. Downstream Tasks for LLM Evaluation
C.LLM 评估的下游任务

Two types of tasks have been widely used for LLM evaluation, the first type is the General Language Understanding Evaluation (GLUE) [11] benchmark, which integrates nine sentence or sentence-pair language understanding tasks (CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, and WNLI), chosen for their diversity in dataset sizes, text genres, and difficulty levels, and is based on established existing datasets. It also includes a diagnostic dataset specifically designed to evaluate and analyze model performance across various linguistic phenomena inherent in natural language. Additionally, it features a public leaderboard to track performance on the benchmark and a dashboard to visualize model performance on the diagnostic set.
有两类任务被广泛用于LLM 评估,第一类是通用语言理解评估(GLUE)[11] 基准,它整合了九个句子或句子对语言理解任务(CoLA、SST-2、MRPC、STS-B、QQP、MNLI、QNLI、RTE 和 WNLI),这些任务因其数据集大小、文本流派和难度级别的多样性而被选中,并基于现有的数据集。它还包括一个诊断数据集,专门用于评估和分析模型在自然语言固有的各种语言现象中的表现。此外,它还提供了一个公共排行榜来跟踪基准性能,以及一个仪表盘来直观显示模型在诊断集上的性能。
The other type of dataset that has been used in recent LLM papers is common sense reasoning which integrated into our study caters to a variety of research facets: (1) OpenBookQA [12] is curated to foster research in advanced question-answering, delving into a profound understanding of both the subject matter and the language in which it is articulated. (2) PIQA [13] primarily emphasizes everyday scenarios, demonstrating a predilection for unconventional solutions. (3) Social IQA [14] emerges as a novel questionanswering benchmark tailored for gauging social commonsense intelligence. (4) HellaSwag [15] serves as a dataset, the essence of which is to ascertain the capability of machines in aptly concluding sentences. (5) BoolQ [16] is a dataset dedicated to question-answering, particularly for binary responses (yes/no queries). (6) WinoGrande [17] is introduced as a fresh compilation, encompassing a substantial 44,000 problems. (7) -easy [18] presents itself as a novel dataset constituting genuine grade-school level multiple-choice science questions, designed to invigorate research in intricate question-answering. (8) ARC-challenges [18], distinctively, encompasses solely those questions that were inaccurately addressed by both a retrieval-based algorithm and a word co-occurrence algorithm.
LLM 最近的论文中使用的另一种数据集是常识推理,它与我们的研究相结合,迎合了各种研究方面的需要:(1) OpenBookQA [12] 的目的是促进高级问题解答的研究,深入理解主题和表述主题的语言。(2) PIQA[13]主要强调日常情景,倾向于非常规的解决方案。(3) Social IQA[14]是为衡量社会常识智能而量身定制的一种新颖的答题基准。(4) HellaSwag[15]是一个数据集,其本质是确定机器在恰当地总结句子方面的能力。(5) BoolQ[16]是一个专门用于回答问题的数据集,尤其是二元回答(是/否查询)。(6) WinoGrande [17]是一个全新的汇编,包含大量 44,000 个问题。(7) -easy [18]是一个新颖的数据集,由真正的小学水平多选科学问题组成,旨在促进复杂问题解答的研究。(8) ARC-challenges [18]与众不同,它只包含那些用基于检索的算法和词语共现算法都无法准确解答的问题。
Image recognition is the primary benchmark and application for vision models, exemplified by benchmarks such as fine- grained visual categorization (FGVC) and visual task adaptation benchmark (VTAB). Beyond image classification, video action recognition is another key application area, involving datasets like Kinetics-400 [19], SSv2 [20], and HMDB51 [21]. Additionally, PEFT has been utilized for dense prediction tasks, using datasets like MSCOCO [22], ADE20K [23], and PASCAL VOC [24].
图像识别是视觉模型的主要基准和应用,例如细粒度视觉分类(FGVC)和视觉任务适应基准(VTAB)。除了图像分类,视频动作识别是另一个关键应用领域,涉及的数据集有 Kinetics-400 [19]、SSv2 [20] 和 HMDB51 [21]。此外,PEFT 还被用于密集预测任务,使用的数据集包括 MSCOCO [22]、ADE20K [23] 和 PASCAL VOC [24]。

III. PEFT TAXONOMY III.蕨类植物分类法

The PEFT strategies can be broadly classified into four categories: additive PEFT (Section III-A), which modifies the model architecture by injecting new trainable modules or parameters; selective PEFT (Section III-B), which makes a subset of parameters trainable during fine-tuning; reparameterized PEFT (Section III-C), which constructs a (lowdimensional) reparameterization of the original model parameters for training, then equivalently transforms it back for inference; and hybrid PEFT (Section III-D), which combines advantages from different PEFT methods to build a unified PEFT model. A overview of different types of PEFT algorithms is depicted in Figure 4.
PEFT 策略可大致分为四类:添加式 PEFT(III-A 节),通过注入新的可训练模块或参数来修改模型结构;选择式 PEFT(III-B 节),在微调过程中使参数子集成为可训练参数;重参数化 PEFT(III-C 节),对原始模型参数进行(低维)重参数化以进行训练,然后等价转换回来进行推理;混合式 PEFT(III-D 节),结合不同 PEFT 方法的优点以建立统一的 PEFT 模型。不同类型的 PEFT 算法概览见图 4。

A. Additive PEFT A.添加型 PEFT

Standard full fine-tuning entails substantial computational expenses and also could potentially harm the model's generalization ability. To mitigate this problem, a widely employed approach is to maintain the pre-trained backbone unchanged and introduce only a minimal number of trainable parameters that are strategically positioned within the model architecture. While fine-tuning for a specific downstream task, only the weights of these additional modules or parameters are updated, which results in a substantial reduction in storage, memory, and computational resource requirements. Due to their characteristic of adding parameters, these techniques can be termed as Additive Tuning, as shown in Figure 4 (a). Next, we discuss several popular Additive PEFT algorithms.
标准的全面微调需要大量的计算费用,还可能损害模型的泛化能力。为了缓解这一问题,一种被广泛采用的方法是保持预先训练的骨干模块不变,只引入极少量的可训练参数,这些参数在模型架构中处于战略位置。在针对特定下游任务进行微调时,只更新这些附加模块或参数的权重,从而大幅减少对存储、内存和计算资源的需求。如图 4 (a)所示,由于这些技术具有增加参数的特点,因此可将其称为 "加法调整"。接下来,我们将讨论几种流行的加法 PEFT 算法。
  1. Adapters: Adapter approaches involve the insertion of small adapter layers within Transformer blocks. Typically, an adapter layer consists of a down-projection matrix , followed by a non-linear activation function , and an up-projection matrix