这是用户在 2024-5-22 17:46 为 https://app.immersivetranslate.com/pdf-pro/c0d91070-5997-49af-b650-9cdd14869d26 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
2024_05_22_751d9dac637e014f337fg

Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey
大型模型的参数高效微调:全面调查

Zeyu Han , Chao Gao , Jinyang Liu , Jeff (Jun) Zhang , and Sai Qian Zhang
Zeyu Han , Chao Gao , Jinyang Liu , Jeff (Jun) Zhang , and Sai Qian Zhang
Northeastern University University of California, Riverside Arizona State University
东北大学 加州大学河滨分校 亚利桑那州立大学
New York University
纽约大学
{han.zeyu,liu.jinyan}@northeastern.edu, cgao037@ucr.edu, jeffzhang@asu.edu, sai.zhang@nyu.edu

Abstract 摘要

Large models represent a groundbreaking advancement in multiple application fields, enabling remarkable achievements across various tasks. However, their unprecedented scale comes with significant computational costs. These models, often consisting of billions of parameters, require vast amounts of computational resources for execution. Especially, the expansive scale and computational demands pose considerable challenges when customizing them for particular downstream tasks, particularly over the hardware platforms constrained by computational capabilities.
大型模型在多个应用领域都取得了突破性进展,在各种任务中都取得了显著成就。然而,其前所未有的规模也带来了巨大的计算成本。这些模型通常由数十亿个参数组成,需要大量计算资源才能执行。特别是在为特定下游任务定制模型时,庞大的规模和计算需求带来了相当大的挑战,尤其是在受计算能力限制的硬件平台上。

Parameter Efficient Fine-Tuning (PEFT) provides a practical solution by efficiently adapt the large models over the various downstream tasks. In particular, PEFT refers to the process of adjusting the parameters of a pre-trained large models to adapt it to a specific task or domain while minimizing the number of additional parameters introduced or computational resources required. This approach is particularly important when dealing with large-scale language models with high parameter counts, as fine-tuning these models from scratch can be computationally expensive and resource-intensive, posing considerable challenges in the supporting system platform design.
参数高效微调(PEFT)通过在各种下游任务中有效地调整大型模型,提供了一种实用的解决方案。具体来说,PEFT 是指调整预先训练好的大型模型的参数,使其适应特定任务或领域,同时尽量减少引入的额外参数数量或所需计算资源的过程。在处理参数数量较多的大型语言模型时,这种方法尤为重要,因为从头开始微调这些模型可能会耗费大量计算资源,给支持系统平台的设计带来相当大的挑战。

In this survey, we present comprehensive studies of various PEFT algorithms, examining their performance and computational overhead. Moreover, we provide an overview of applications developed using different PEFT algorithms and discuss common techniques employed to mitigate computation costs for PEFT. In addition to the algorithmic perspective, we overview various real-world system designs to investigate the implementation costs associated with different PEFT algorithms. This survey serves as an indispensable resource for researchers aiming to understand both the PEFT algorithm and its system implementation, offering detailed insights into recent advancements and practical applications.
在这份调查报告中,我们对各种 PEFT 算法进行了全面研究,考察了它们的性能和计算开销。此外,我们还概述了使用不同 PEFT 算法开发的应用,并讨论了为降低 PEFT 计算成本而采用的常见技术。除了算法角度,我们还概述了各种实际系统设计,以研究与不同 PEFT 算法相关的实施成本。对于希望了解 PEFT 算法及其系统实现的研究人员来说,本调查报告是不可或缺的资源,它提供了对最新进展和实际应用的详细见解。

Index Terms-Large Language Model, Parameter-Efficient Fine-tuning, Computer System, Distributed System.
索引词条-大型语言模型、参数高效微调、计算机系统、分布式系统。

I. INTRODUCTION I.引言

Large Models (LMs) have recently captured considerable public interest. Their ability to understand context and nuances enables them to proficiently handle diverse tasks across multiple domains, including natural language processing (NLP), computer vision (CV), etc. In the field of NLP, Large Language Models (LLMs) have achieved significant advancements across various tasks including text generation [1], [2], translation [3], [4], personalized chat-bots [5], [6], [7], and summarization [8], demonstrating remarkable proficiency.
大型模型(LM)最近引起了公众的极大兴趣。它们理解上下文和细微差别的能力使其能够熟练处理多个领域的各种任务,包括自然语言处理(NLP)、计算机视觉(CV)等。在 NLP 领域,大型语言模型 (LLMs) 在文本生成[1]、[2]、翻译[3]、[4]、个性化聊天机器人[5]、[6]、[7]和摘要[8]等各种任务中都取得了重大进展,显示出非凡的能力。
Earlier studies [1] has suggested that LLMs exhibit high levels of generalization, enabling them to apply their acquired knowledge to new tasks not included in their original training. This capability is commonly known as zero-shot learning. Nevertheless, fine-tuning remains essential to further enhance LLMs for optimal performance on new user datasets and tasks.
早先的研究[1]表明,LLMs ,他们表现出很高的概括能力,能够将获得的知识应用于原来训练中没有包括的新任务。这种能力通常被称为 "零点学习"。不过,要进一步提高LLMs 在新的用户数据集和任务中的最佳性能,微调仍然是必不可少的。
Due to its scale, a widely adopted strategy for fine-tuning LLMs involves adjusting a limited number of LLM parameters while keeping the remainder unchanged. This technique, termed Parameter-Efficient-Fine-Tuning (PEFT), involves selectively adjusting a small proportion of their parameters, while keeping the rest unaltered. Furthermore, the application of PEFT extends beyond the realm of NLP and quickly attracts interest in the CV community for handling fine-tuning vision models with large parameters, such as Vision Transformers (ViT) and diffusion models, as well as disciplinary models such as vision-language models.
由于其规模庞大,一种被广泛采用的微调LLMs 的策略是调整数量有限的LLM 参数,其余参数保持不变。这种技术被称为 "参数-系数-微调"(Parameter-Efficient-Fine-Tuning,PEFT),即有选择地调整一小部分参数,其余参数保持不变。此外,PEFT 的应用范围还超出了 NLP 领域,并迅速引起了 CV 界对处理微调大参数视觉模型(如视觉变换器(ViT)和扩散模型)以及学科模型(如视觉语言模型)的兴趣。
In this survey, we systematically review and categorize recent advancements in PEFT algorithms as well as the system implementation costs associated with various PEFT algorithms across diverse scenarios. Figure 1 presents the overview content for this survey. In section , we present some fundamental concepts for LLM and PEFT, including computational flow for LLM, basic knowledge of PEFT, and commonly used datasets and tasks.
在本调查中,我们系统地回顾了 PEFT 算法的最新进展,并对其进行了分类,同时还介绍了不同场景下与各种 PEFT 算法相关的系统实施成本。图 1 展示了本调查的概览内容。在 部分,我们介绍了LLM 和 PEFT 的一些基本概念,包括LLM 的计算流程、PEFT 的基本知识以及常用数据集和任务。
We categorized all types of PEFT algorithms in Section III according to their computational flow. In Section III-A, we introduce additive algorithms that either introduce additional weight parameters or modify activations. For algorithms that exclusively require fine-tuning with existing parameters, they fall under the category of selective approaches, and their introduction can be found in Section III-B, In Section III-C, we explore reparameterized PEFT, which constructs a (lowdimensional) reparameterization of original model parameters for training while transforms the weights back to maintain the inference speed. Additionally, there exist algorithms that combine the above techniques, and we have classified these as hybrid approaches, elaborating on them in Section III-D We also investigate strategies for further reducing the computational complexity of different PEFT algorithms, including KV-cache management, pruning, quantization, and memory optimization, in Section IV
我们在第三节中根据计算流程对所有类型的 PEFT 算法进行了分类。在第 III-A 节中,我们介绍了引入额外权重参数或修改激活度的加法算法。在第 III-C 节中,我们探讨了重参数化 PEFT 算法,该算法对原始模型参数进行(低维)重参数化以进行训练,同时将权重转换回来以保持推理速度。此外,还有将上述技术结合起来的算法,我们将其归类为混合方法,并在第 III-D 节中详细阐述。
In Section , we expand the scope of this survey beyond the computational perspective to involve various potential application scenarios. We explore innovations that applying PEFT techniques to different model architecture, including LLMs
在第 节,我们将调查范围从计算角度扩展到各种潜在应用场景。我们探讨了将 PEFT 技术应用于不同模型架构的创新,包括LLMs
Fig. 1: A content overview covered in the survey.
图 1:调查涉及的内容概览。
(Section V-A), Vision Transformer (Section V-B), VisionLanguage alignment models (Section V-C), and Diffusion models (Section V-D), for varied downstream tasks, underscoring PEFT's versatility and applicability in a range of scenarios.
(第 V-A 节)、Vision Transformer(第 V-B 节)、VisionLanguage 对齐模型(第 V-C 节)和 Diffusion 模型(第 V-D 节),以完成不同的下游任务,凸显了 PEFT 在各种情况下的多功能性和适用性。
In Section VI, we explores the system design challenge for PEFT methods. The discussion includes three advanced system solutions for practical PEFT deployment: distributed tuning (Section VI-B), PEFT query serving (Section VI-C), and concurrent PEFT tuning (Section VI-D).
在第六部分,我们探讨了 PEFT 方法的系统设计挑战。讨论包括实用 PEFT 部署的三种先进系统解决方案:分布式调整(VI-B 节)、PEFT 查询服务(VI-C 节)和并发 PEFT 调整(VI-D 节)。
In the last Section VII we summarize our survey and propose several potential future directions from both algorithm and system perspectives, hoping to provide valuable insights for further research and development in the field.
在最后的第七部分,我们总结了我们的调查,并从算法和系统的角度提出了几个潜在的未来发展方向,希望能为这个领域的进一步研究和发展提供有价值的见解。

II. BACKGROUND II.背景情况

In this section, we first discussed the computation flow of LLM, including its fundamental components, computational complexity, and the flow of computations it involves as a case study. We then provide a brief overview of different PEFT algorithms in section II-B
在本节中,我们首先讨论了LLM 的计算流程,包括其基本组成部分、计算复杂度以及作为案例研究的计算流程。然后,我们在第 II-B 节中简要介绍了不同的 PEFT 算法。

A. Computation flow for LLaMA
A.LLaMA 的计算流程

In order to gain a deeper understanding of LLM and other Transformer-based models, we employ LLaMA-7B, a cuttingedge open-source LLM model, to scrutinize the architecture of LLM as well as Transformer. As shown in Figure 2 (a), LLaMA consists of three major components: an embedding block, a stack of decoder blocks and a head block which consists of linear and softmax layer. The embedding layer's primary role is to transform unstructured textual information, into chunks of discrete numerical vectors (tokens) to facilitate subsequent processing. The embedded tokens are then delivered to the decoder layers for further processing. Each LLaMA decoder is composed of two fundamental components: Multihead Self-Attention (MSA) and Feedforward Network (FFN). In the MSA module, each of the tokens will be clustered by an attention map obtained by a dot production between two linear mappings of the input tokens. Then the grouped tokens will be further processed by a Feedforward Neural network.
为了更深入地了解LLM 和其他基于 Transformer 的模型,我们采用了最先进的开源LLM 模型 LLaMA-7B 来仔细研究LLM 和 Transformer 的架构。如图 2 (a)所示,LLaMA 由三个主要部分组成:嵌入块、解码器块堆栈以及由线性层和软最大层组成的头部块。嵌入层的主要作用是将非结构化文本信息转化为离散数字向量(标记)块,以便于后续处理。然后,嵌入的标记被传送到解码器层进行进一步处理。每个 LLaMA 解码器都由两个基本组件组成:多头自注意(MSA)和前馈网络(FFN)。在 MSA 模块中,每个词组都将根据输入词组的两个线性映射之间的点生成所获得的注意力图进行分组。然后,分组后的词组将由前馈神经网络进一步处理。

Additionally, Root Mean Square Layer Normalization (RMSNorm) [9] is adopted in LLaMA as a replacement for Layer Normalization to ensure efficient training.
此外,LLaMA 采用了均方根层归一化(RMSNorm)[9] 来替代层归一化,以确保高效训练。
LLM distinguishes itself from other deep neural network (DNN) models such as convolutional neural networks (CNN) in two significant ways. Firstly, LLM exhibits an inherent autoregressive nature, necessitating multiple iterations to complete the generation task. Moreover, LLM incorporates an attention mechanism, a component with computational complexity that scales quadratically with the length of the inputs. On the other hand, the inherent computation characteristic of LLM lies in the attention blocks inside each decoder layer. Figure 2 (c) depicts the high-level overview of the computation flow in the attention block.
LLM 与卷积神经网络(CNN)等其他深度神经网络(DNN)模型有两个显著的不同之处。首先, 表现出固有的自回归特性,需要多次迭代才能完成生成任务。此外, 还包含注意力机制,该机制的计算复杂度与输入长度成二次方关系。另一方面, 固有的计算特性在于每个解码器层内的注意力块。图 2(c)描述了注意力区块计算流程的高层概览。LLM LLM LLM
During the inference process, each decoder takes a 4dimensional tensor as the input tokens. The input tokens are first multiplied with three weight matrices , and , producing the output referred to as query and value . Given the MSA module's inability to recognize positional data and the inherent autoregressive nature of LLMs, the query and key will undergo a process using Rotary Positional Embedding [10] (RoPE, denoted as . Subsequently, the key and value will be combined with prior tokens.
在推理过程中,每个解码器将 4 维张量 作为输入标记。输入标记首先与三个权重矩阵 相乘,产生的输出称为查询 和值 。鉴于 MSA 模块无法识别位置数据以及LLMs 固有的自回归性质,查询和密钥将使用旋转位置嵌入法 [10](RoPE,表示为 )进行处理。 随后,密钥和值将与之前的标记相结合。
After the positional embedding, the intermediate activation will then undergo a series of multiplication, softmax, and residual addition to generate MSA output as described in Eq9 To be noted here, in the equation refers to the number of feature dimensions in the multi-head attention mechanism.
如公式 9 所述,在位置嵌入之后,中间激活将经过一系列乘法、软最大值和残差加法来生成 MSA 输出。需要注意的是,公式中的 指的是多头注意力机制中的特征维数。
The SA output will then be forwarded to the FFN blocks for further processing. The FFN block will have another three
然后,SA 输出将被转发到 FFN 模块进行进一步处理。FFN 块将有另外三个
Fig. 2: (a) LLaMA architecture. (b) LLaMA auto-regressive pattern. (c) Three common PEFT operations. All the learnable components are highlighted in red, while the frozen components are highlighted in grey. LoRA is applied on all the Query, Key, and Value blocks. The adapter targets the FFN module. Soft-Prompt focused on tuning the input activation of each decoder. We only show one decoder for illustration simplicity.
图 2: (a) LLaMA 架构。(b) LLaMA 自动回归模式。(c) 三种常见的 PEFT 操作。所有可学习的组件以红色标出,冻结的组件以灰色标出。LoRA 应用于所有查询、键和值块。适配器以 FFN 模块为目标。Soft-Prompt 专注于调整每个解码器的输入激活。为便于说明,我们只展示了一个解码器。
matrices , and and the computation can be illustrated by:
矩阵 ,以及 ,计算过程可通过以下方式说明:
where denotes the input of the FFN layer, and SiLU is the nonlinear function used in LLaMA. In the original Transformer, the FFN block can be demonstrated by:
其中 表示 FFN 层的输入,SiLU 是 LLaMA 中使用的非线性函数。在最初的变换器中,FFN 模块可以通过以下方式进行演示:
The output of the last decoder layer will be sent to a linear layer, which then generates a probability distribution spanning the complete vocabulary to predict the next token in the sequence. The produced token will then be concatenated with the previous tokens and used as the input for the next round of processing. This generating process repeats in an auto-regressive manner until a full sequence of tokens, referred to as a completion, is produced (Figure 2 (b)). For training, the computation flow is similar to that for inference, except that the generated sentences are directly compared to the ground truth output and generate the training loss. Gradients will then be computed across the LLM weights to minimize this training loss.
最后一个解码层的输出将被发送到线性层,然后线性层生成一个跨越完整词汇的概率分布,以预测序列中的下一个标记。然后,生成的标记将与之前的标记连接起来,作为下一轮处理的输入。这一生成过程以自动递归的方式重复进行,直到生成一个完整的标记序列(称为 "完成")(图 2 (b))。对于训练,计算流程与推理类似,只是生成的句子会直接与地面实况输出进行比较,并产生训练损失。然后将计算LLM 权重的梯度,以最小化训练损失。
To analyze the computation cost and memory overhead in LLM, we also set a series of parameters used in later section III Table I shows the parameter size and computation dimension in the LLaMA-7B model as a starting example.
为了分析LLM 中的计算成本和内存开销,我们还设置了后面第三节中使用的一系列参数 表 I 显示了 LLaMA-7B 模型中的参数大小和计算维度,并以此为起始例子。
LLM models generate tokens (words) one for each round, depicted in Fig 2, based on the previous prompt (input) and previously generated sequence. This process will be repeated until the model outputs hits and termination token. To accelerate the inference process in LLM models, people take the strategy of storing the previous Keys and Values in KeyValue cache (KV-cache), so they don't need to recalculate them for each new token. Mathematically, we can represent the total decoders' kv-cache memory cost in equation 6. In the equation, 1 and are the context length and batch size and refers to the number of layers. The is the head dimension and is the number of heads.
LLM 模型根据前一个提示(输入)和之前生成的序列,每轮生成一个标记(词),如图 2 所示。这个过程会一直重复,直到模型输出命中和终止标记。为了加快 模型的推理过程,人们采取的策略是将之前的键值和值存储在键值缓存(KV-cache)中,这样就不需要为每个新标记重新计算键值和值了。在数学上,我们可以用公式 6 来表示解码器的 KV 缓存内存总成本。式中,1 和 是上下文长度和批量大小, 指层数。 是磁头维度, 是磁头个数。LLM

B. Overview on Parameter Efficient Fine Tuning
B.参数高效微调概述

Fine-tuning remains essential to enhance LLM performance on unseen user datasets and tasks. With the size of the model growing (e.g. 1.5B in GPT-2 to 175B in GPT-3), standard full fine-tuning paradigm requires thousands of GPU work in parallel, which is highly inefficient and unsustainable. A type of algorithm has been raised namely Parameter-efficient fine-tuning (PEFT) which aims to tune minimal parameters to achieve better performance over full tuning on downstream tasks.
微调对于提高LLM 在未知用户数据集和任务上的性能仍然至关重要。随着模型规模的不断扩大(例如,GPT-2 中的 1.5B 到 GPT-3 中的 175B),标准的完全微调范式需要数千个 GPU 并行工作,效率极低,难以为继。有人提出了一种算法,即参数高效微调(PEFT),其目的是调整最小参数,以获得比对下游任务进行全面调整更好的性能。
In parallel developments, large-scale pre-trained models in vision and multimodal domains have also demonstrated their effective representational learning capabilities, enabling adaptation from large datasets to smaller ones or across various data modalities through fine-tuning. Consequently, this capability has made PEFT increasingly attractive to the wider research community.
与此同时,视觉和多模态领域的大规模预训练模型也展示了其有效的表征学习能力,能够通过微调从大型数据集适应小型数据集或跨各种数据模态。因此,这种能力使 PEFT 对更广泛的研究界越来越有吸引力。
We categorized the PEFT algorithms into additive, selective, reparameterized, and hybrid fine-tuning based on their operations. As Figure 3 depicts, three major additive finetuning algorithms are normally used: (1) Adapter; (2) Soft Prompt; (3) Others. They differ from each other in terms of the different additional tunable modules or parameters. Selective fine-tuning, on the other hand, doesn't require any additional parameters, it selects a small subset of parameters from the backbone model and only makes them tunable while keeping the majority of parameters untouched during finetuning on downstream tasks. We categorized selective finetuning based on the grouping of chosen parameters: (1) Unstructural Masking; (2) Structural Masking. Reparametrization
我们根据 PEFT 算法的操作,将其分为加法微调、选择性微调、重参数微调和混合微调。如图 3 所示,通常使用三种主要的加法微调算法:(1) 适配器;(2) 软提示;(3) 其他。它们之间的区别在于可调整的附加模块或参数不同。另一方面,选择性微调不需要任何附加参数,它从骨干模型中选择一小部分参数,只对其进行微调,而在下游任务的微调过程中保持大部分参数不变。我们根据所选参数的分组对选择性微调进行了分类:(1)非结构性屏蔽;(2)结构性屏蔽。重新参数化
Operation Weights Symbol 重量符号 Weights Dimension 重量 尺寸 Input Tensor Dimension 输入张量维度 Complexity
Eq. 1 公式 1
Eq. 2 , 公式 2 、 - -
Eq. 3 等式 3
Eq. 4 等式 4 OR
TABLE I: Configuration parameters and computation operation for LLaMA-7B architecture
表 I:LLaMA-7B 架构的配置参数和计算操作
represents transforming model parameters between two equivalent forms. Specifically, reparametrized fine-tuning introduces additional low-rank trainable parameters during training, which are then integrated with the original model for inference. This approach is categorized into two main strategies: (1) Lowrank Decomposition, and (2) LoRA Derivatives. Hybrid finetuning explores the design spaces of different PEFT methods and combines their advantages.
表示在两种等效形式之间转换模型参数。具体来说,重参数化微调在训练过程中引入额外的低秩可训练参数,然后与原始模型整合进行推理。这种方法分为两种主要策略:(1) 低阶分解和 (2) LoRA 衍生。混合微调探索了不同 PEFT 方法的设计空间,并结合了它们的优势。

C. Downstream Tasks for LLM Evaluation
C.LLM 评估的下游任务

Two types of tasks have been widely used for LLM evaluation, the first type is the General Language Understanding Evaluation (GLUE) [11] benchmark, which integrates nine sentence or sentence-pair language understanding tasks (CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, and WNLI), chosen for their diversity in dataset sizes, text genres, and difficulty levels, and is based on established existing datasets. It also includes a diagnostic dataset specifically designed to evaluate and analyze model performance across various linguistic phenomena inherent in natural language. Additionally, it features a public leaderboard to track performance on the benchmark and a dashboard to visualize model performance on the diagnostic set.
有两类任务被广泛用于LLM 评估,第一类是通用语言理解评估(GLUE)[11] 基准,它整合了九个句子或句子对语言理解任务(CoLA、SST-2、MRPC、STS-B、QQP、MNLI、QNLI、RTE 和 WNLI),这些任务因其数据集大小、文本流派和难度级别的多样性而被选中,并基于现有的数据集。它还包括一个诊断数据集,专门用于评估和分析模型在自然语言固有的各种语言现象中的表现。此外,它还提供了一个公共排行榜来跟踪基准性能,以及一个仪表盘来直观显示模型在诊断集上的性能。
The other type of dataset that has been used in recent LLM papers is common sense reasoning which integrated into our study caters to a variety of research facets: (1) OpenBookQA [12] is curated to foster research in advanced question-answering, delving into a profound understanding of both the subject matter and the language in which it is articulated. (2) PIQA [13] primarily emphasizes everyday scenarios, demonstrating a predilection for unconventional solutions. (3) Social IQA [14] emerges as a novel questionanswering benchmark tailored for gauging social commonsense intelligence. (4) HellaSwag [15] serves as a dataset, the essence of which is to ascertain the capability of machines in aptly concluding sentences. (5) BoolQ [16] is a dataset dedicated to question-answering, particularly for binary responses (yes/no queries). (6) WinoGrande [17] is introduced as a fresh compilation, encompassing a substantial 44,000 problems. (7) -easy [18] presents itself as a novel dataset constituting genuine grade-school level multiple-choice science questions, designed to invigorate research in intricate question-answering. (8) ARC-challenges [18], distinctively, encompasses solely those questions that were inaccurately addressed by both a retrieval-based algorithm and a word co-occurrence algorithm.
LLM 最近的论文中使用的另一种数据集是常识推理,它与我们的研究相结合,迎合了各种研究方面的需要:(1) OpenBookQA [12] 的目的是促进高级问题解答的研究,深入理解主题和表述主题的语言。(2) PIQA[13]主要强调日常情景,倾向于非常规的解决方案。(3) Social IQA[14]是为衡量社会常识智能而量身定制的一种新颖的答题基准。(4) HellaSwag[15]是一个数据集,其本质是确定机器在恰当地总结句子方面的能力。(5) BoolQ[16]是一个专门用于回答问题的数据集,尤其是二元回答(是/否查询)。(6) WinoGrande [17]是一个全新的汇编,包含大量 44,000 个问题。(7) -easy [18]是一个新颖的数据集,由真正的小学水平多选科学问题组成,旨在促进复杂问题解答的研究。(8) ARC-challenges [18]与众不同,它只包含那些用基于检索的算法和词语共现算法都无法准确解答的问题。
Image recognition is the primary benchmark and application for vision models, exemplified by benchmarks such as fine- grained visual categorization (FGVC) and visual task adaptation benchmark (VTAB). Beyond image classification, video action recognition is another key application area, involving datasets like Kinetics-400 [19], SSv2 [20], and HMDB51 [21]. Additionally, PEFT has been utilized for dense prediction tasks, using datasets like MSCOCO [22], ADE20K [23], and PASCAL VOC [24].
图像识别是视觉模型的主要基准和应用,例如细粒度视觉分类(FGVC)和视觉任务适应基准(VTAB)。除了图像分类,视频动作识别是另一个关键应用领域,涉及的数据集有 Kinetics-400 [19]、SSv2 [20] 和 HMDB51 [21]。此外,PEFT 还被用于密集预测任务,使用的数据集包括 MSCOCO [22]、ADE20K [23] 和 PASCAL VOC [24]。

III. PEFT TAXONOMY III.蕨类植物分类法

The PEFT strategies can be broadly classified into four categories: additive PEFT (Section III-A), which modifies the model architecture by injecting new trainable modules or parameters; selective PEFT (Section III-B), which makes a subset of parameters trainable during fine-tuning; reparameterized PEFT (Section III-C), which constructs a (lowdimensional) reparameterization of the original model parameters for training, then equivalently transforms it back for inference; and hybrid PEFT (Section III-D), which combines advantages from different PEFT methods to build a unified PEFT model. A overview of different types of PEFT algorithms is depicted in Figure 4.
PEFT 策略可大致分为四类:添加式 PEFT(III-A 节),通过注入新的可训练模块或参数来修改模型结构;选择式 PEFT(III-B 节),在微调过程中使参数子集成为可训练参数;重参数化 PEFT(III-C 节),对原始模型参数进行(低维)重参数化以进行训练,然后等价转换回来进行推理;混合式 PEFT(III-D 节),结合不同 PEFT 方法的优点以建立统一的 PEFT 模型。不同类型的 PEFT 算法概览见图 4。

A. Additive PEFT A.添加型 PEFT

Standard full fine-tuning entails substantial computational expenses and also could potentially harm the model's generalization ability. To mitigate this problem, a widely employed approach is to maintain the pre-trained backbone unchanged and introduce only a minimal number of trainable parameters that are strategically positioned within the model architecture. While fine-tuning for a specific downstream task, only the weights of these additional modules or parameters are updated, which results in a substantial reduction in storage, memory, and computational resource requirements. Due to their characteristic of adding parameters, these techniques can be termed as Additive Tuning, as shown in Figure 4 (a). Next, we discuss several popular Additive PEFT algorithms.
标准的全面微调需要大量的计算费用,还可能损害模型的泛化能力。为了缓解这一问题,一种被广泛采用的方法是保持预先训练的骨干模块不变,只引入极少量的可训练参数,这些参数在模型架构中处于战略位置。在针对特定下游任务进行微调时,只更新这些附加模块或参数的权重,从而大幅减少对存储、内存和计算资源的需求。如图 4 (a)所示,由于这些技术具有增加参数的特点,因此可将其称为 "加法调整"。接下来,我们将讨论几种流行的加法 PEFT 算法。
  1. Adapters: Adapter approaches involve the insertion of small adapter layers within Transformer blocks. Typically, an adapter layer consists of a down-projection matrix , followed by a non-linear activation function , and an up-projection matrix . In this context, represents the dimension of the hidden layer, and serves as the bottleneck dimension, which is a hyperparameter used in configuring the adapters. Denote as the input to the adapter, the computation within the adapter module (with residual) can be summarized as follows:
    适配器适配器方法是在变换器模块中插入小型适配器层。通常情况下,适配器层包括一个向下投影矩阵 ,然后是一个非线性激活函数 和一个向上投影矩阵 。在这种情况下, 代表隐藏层的维度,而 作为瓶颈维度,是用于配置适配器的超参数。将 作为适配器的输入,适配器模块内的计算(含残差)可总结如下:
Fig. 3: Taxonomy of Parameter-Efficient Fine-Tuning Methods for Large Models.
图 3:大型模型的参数效率微调方法分类学。
(a) Additive PEFT (a) 加法 PEFT
(c) Reparameterization PEFT
(c) 重参数化 PEFT
Fig. 4: Different types of PEFT algorithms.
图 4:不同类型的 PEFT 算法。
The concept of adapters in the field of NLP was initially introduced by Serial Adapter [25] as shown in Figure 5 (a). In their approach, each Transformer block is enhanced by adding two adapter modules, with one positioned after the self-attention layer and the other after the FFN layer, respectively. Subsequent research has aimed to address the additional computational cost associated with adapter layers. A modified framework AdapterFusion [29] was proposed, where adapter layers are inserted only after the 'Add & Norm' step following the FFN layer to enhance the computational efficiency. The adapters mentioned above follow a sequential design, placing adapter layers as bottlenecks within the Transformer blocks. This approach may potentially reduce the model's parallelism and require a trade-off between inference efficiency and accuracy. In contrast, [26] introduced a parallel adapter (PA) approach as depicted in Figure 5 (b), which reorganizes the traditionally sequential adapter layers into a parallel side-network that runs alongside each Transformer sublayer. Similarly, CIAT [27], CoDA [28] and KronA [72] also adopts a parallel adapter design. Except for the parallel design, CoDA employs a sparse activation mechanism to improve the inference efficiency as shown in Figure 5 (c).
如图 5(a)所示,NLP 领域的适配器概念最初由 Serial Adapter [25] 提出。在他们的方法中,通过添加两个适配器模块来增强每个 Transformer 模块,其中一个模块位于自我注意层之后,另一个模块位于 FFN 层之后。随后的研究旨在解决与适配器层相关的额外计算成本问题。有人提出了一个改进的框架 AdapterFusion [29],即在 FFN 层之后的 "添加与规范 "步骤之后才插入适配器层,以提高计算效率。上述适配器采用顺序设计,将适配器层作为瓶颈置于变换器块中。这种方法可能会降低模型的并行性,需要在推理效率和准确性之间做出权衡。与此相反,[26] 引入了并行适配器 (PA) 方法,如图 5 (b) 所示,该方法将传统的顺序适配器层重组为并行侧网络,与每个变换器子层并行运行。同样,CIAT [27]、CoDA [28] 和 KronA [72] 也采用了并行适配器设计。除了并行设计,CoDA 还采用了稀疏激活机制来提高推理效率,如图 5(c)所示。

Specifically, CoDA use a soft top- selection process that identifies important tokens in each layer, which will be processed by both the frozen pre-trained Transformer layer and the adapter branch to maintain model accuracy. In contrast, those unimportant tokens are only processed by the adapter branch while skip the heavy pre-trained layer, therefore optimizing for inference efficiency without compromising overall performance.
具体来说,CoDA 使用软顶部 选择过程,在每一层中识别出 重要的标记,这些标记将由冻结的预训练转换器层和适配器分支处理,以保持模型的准确性。相反,那些不重要的词组只由适配器分支处理,而跳过重度预训练层,从而在不影响整体性能的情况下优化推理效率。
To enhance the performance and generalization of adapters, various studies have implemented multi-task learning strategies, such as AdapterFusion [29], AdaMix [30], PHA [31], AdapterSoup [32], MerA [33], and Hyperformer [34]. AdapterFusion keeps all pre-trained adapters in the model and employs a fusion module to merge the multi-task informations. Unlike AdapterFusion, MerA merges pretrained adapters into a single one through optimal transport based on weights and activations. This approach avoids introducing any additional trainable parameters, thereby enhancing computational efficiency. Hyperformer stores the multi-task information in a shared hypernetwork, which generates task and layer-specific adapter parameters conditioned on task and layer id embeddings. Given a new task, only an additional task embedding needs to be learned, therefore reducing the number of trained parameters.
为了提高适配器的性能和通用性,许多研究都采用了多任务学习策略,如 AdapterFusion [29]、AdaMix [30]、PHA [31]、AdapterSoup [32]、MerA [33] 和 Hyperformer [34]。AdapterFusion 在模型中保留了所有预先训练好的适配器,并采用融合模块来合并多任务信息。与 AdapterFusion 不同的是,MerA 通过基于权重和激活度的优化传输,将预先训练好的适配器合并成一个。这种方法避免了引入任何额外的可训练参数,从而提高了计算效率。Hyperformer 将多任务信息存储在共享超网络中,并根据任务和层 ID 嵌入生成特定于任务和层的适配器参数。对于新任务,只需学习额外的任务嵌入,从而减少了训练参数的数量。
  1. Soft Prompt: Alternatively, prompt tuning presents an additional approach for refining the model to achieve improved performance through fine-tuning. Instead of optimizing discrete token representations through in-context learning, there is a prevailing belief that the continuous embedding space of soft prompts inherently contains more information [96]. Drawing inspiration from this concept, researchers directly append adjustable vectors, referred to as soft prompts, to the start of the input sequence. This can be represented as follows:
    软提示:另外,提示调整也是一种改进模型的方法,通过微调来提高性能。人们普遍认为,软提示的连续嵌入空间本质上包含更多信息,而不是通过上下文学习来优化离散标记表征[96]。研究人员从这一概念中汲取灵感,直接将可调整向量(称为软提示)附加到输入序列的起始位置。具体表现如下
(a) Serial Adapter (a) 串行适配器
(b) Parallel Adapter (b) 并行适配器
(d) Adapter Layer (d) 适配器层
Fig. 5: Illustration of three representative adapter-based fine-tuning algorithms. Blue represents frozen, while yellow represents trainable.
图 5:三种具有代表性的基于适配器的微调算法示意图。蓝色代表冻结,黄色代表可训练。
where is the sequence of input tokens for layer , including soft prompt tokens followed by the original input tokens is the number of soft prompt tokens, and is the number of original input tokens.
其中 层的输入令牌序列,包括软提示令牌 和原始输入令牌 是软提示令牌的数量, 是原始输入令牌的数量。
Prefix-tuning [35] introduces learnable vectors that are prepended to keys and values across all Transformer layers. To ensure stability during the optimization process, Prefixtuning adopts a reparameterization strategy, which utilizes an MLP layer to generate these prefix vectors rather than optimizing them directly. After fine-tuning, only the prefix vectors are saved for inference. This technique is adapted and improved in several studies [36], [37], [38]. For instance, ptuning v2 [37] removes reparameterization and expands its usage to broader model scales and NLP tasks. APT (Adaptive Prefix Tuning) [38] enhances Prefix-tuning by introducing adaptive gate mechanism to control the prefix importance in each layer. Concurrent work p-tuning [39] and prompttuning [40] apply learnable vectors only at the initial word embedding layer rather than all layers to enhance training and inference efficiency. It's important to highlight that prompttuning demonstrates its effectiveness primarily in the context of large models, specifically those with over 11 billion parameters [40]. Complementing this, Xprompt [41] eliminates the negative prompt tokens through a hierarchical structured pruning, which closes the performance gap at smaller model scales. The work by [97] provides some theoretical analysis towards prompt tuning, demonstrating its universality and limitations in limited-depth Transformers. IDPG (InstanceDependent Prompt Generation) [42] improves prompt tuning by generating prompts based on each input sentence with a lightweight prompt generator. In a related approach, LPT (Late Prompt Tuning) [43] also leverages a prompt generator to obtain instance-aware prompt. Unlike previous work, LPT adds these prompts only after an intermediate layer, rather than at the initial or all layers. This strategic placement eliminates the gradient calculation below the intermediate layer, thereby significantly accelerating the training speed. Simultaneously, LPT can improve the overall performance due to the shorter backpropagation path preserves more task-related information. Inspired by LPT, SPT (Selective Prompt Tuning) [44] delves deeper into the importance of prompt inserting strategies. It introduces a learnable probabilistic gate in each layer to determine whether to use the prompt propagated from the previous layer or inject a newly generated prompt. APrompt [45] employs another prompt inserting strategy. In addition to input prompts inserted at the beginning of the input sequence for each Transformer layer, APrompt also prepends additional learnable prompts to the respective query, key, and value matrices in the self-attention blocks to learn new attention patterns. Besides, APrompt incorporates the learning of a taskspecific head.
前缀调整[35]引入了可学习的向量,这些向量在所有变换器层的键 和值 前缀。为了确保优化过程中的稳定性,前缀调整采用了重新参数化策略,利用 MLP 层生成这些前缀向量,而不是直接对其进行优化。微调后,仅保存前缀向量用于推理。一些研究对这一技术进行了调整和改进 [36]、[37]、[38]。例如,ptuning v2 [37] 取消了重新参数化,并将其应用扩展到更广泛的模型规模和 NLP 任务。APT(自适应前缀调整)[38] 通过引入自适应门机制来控制每一层的前缀重要性,从而增强了前缀调整功能。同时进行的 p-tuning [39] 和 prompttuning [40] 只在初始词嵌入层而不是所有层应用可学习向量,以提高训练和推理效率。需要强调的是,prompttuning 主要是在大型模型,特别是参数超过 110 亿的模型中显示出其有效性[40]。作为补充,Xprompt [41] 通过分层结构化剪枝消除了负面提示标记,从而缩小了较小模型规模下的性能差距。文献[97]对提示调整进行了一些理论分析,证明了其在有限深度变换器中的普遍性和局限性。IDPG(InstanceDependent Prompt Generation)[42] 通过使用轻量级提示生成器根据每个输入句子生成提示来改进提示调整。在一种相关的方法中,LPT(晚期提示调整)[43] 也利用提示生成器来获得实例感知提示。与之前的研究不同,LPT 仅在中间层后添加这些提示,而不是在初始层或所有层。这种策略性的安排省去了中间层以下的梯度计算,从而大大加快了训练速度。同时,由于反向传播路径更短,保留了更多与任务相关的信息,LPT 可以提高整体性能。受 LPT 的启发,SPT(选择性提示调整)[44] 深入探讨了提示插入策略的重要性。它在每一层都引入了一个可学习的概率门,以决定是使用上一层传播的提示,还是注入一个新生成的提示。APrompt [45] 采用了另一种提示插入策略。除了在每个转换器层的输入序列开头插入输入提示外,APrompt 还会在自我注意区块中的相应查询、键和值矩阵中预置额外的可学习提示,以学习新的注意模式。此外,APrompt 还加入了任务特定头的学习。
The concept of soft prompts has been employed for various downstream tasks [98], [99], although their training can be prone to instability and slow convergence. To address this, SPoT [46] uses a source prompt learned from one or multiple tasks to initialize prompts for new tasks. Similarly, transfer of soft prompts from one task to initialize another is proposed in TPT (transferable prompt tuning) [47], which demonstrates that a better prompt initialization results in a large training convergence speedup. InfoPrompt [48] develops two mutual information based loss functions, i.e., head loss and representation loss, to find better prompt initialization and learn sufficient task-relevant information, thereby also expediting convergence. PTP [49] delves into the root causes of training instability. It identifies the steep nature of the loss landscape in conventional prompt tuning, where minor variations in input data can lead to significant loss fluctuations. To mitigate this, PTP introduces perturbation-based regularizers to smooth the loss landscape and consequently stabilize the training process. DePT [52] decomposes the soft prompt into a shorter soft prompt with a pair of low-rank matrices, which are optimised with two distinct learning rates. This strategy not only improves the performance but also enhances training and inference efficiency. SMoP (Sparse Mixture-of-Prompts) [51] reduce the training and inference cost by utilizing short soft prompts. During training, multiple short soft prompts are trained, each tailored to specific subsets of the dataset. During inference, SMoP integrates a gating mechanism that routes each input instance to an appropriate short prompt. This technique not only increases efficiency in both training and inference stages but also retains performance comparable to those achieved with longer soft prompts. To further cut down the number of soft prompt parameters, IPT (Intrinsic Prompt Tuning) [50] identifies an intrinsic task subspace by training an auto-encoder on multiple tasks. Tuning on new tasks then requires adjusting only a few parameters within this subspace, significantly reducing the number of training parameters.
软提示的概念已被用于各种下游任务 [98]、[99],但其训练可能容易出现不稳定和收敛缓慢的问题。为了解决这个问题,SPoT [46] 使用从一个或多个任务中学到的源提示来初始化新任务的提示。同样,TPT(可转移提示调整)[47] 中也提出了从一个任务中转移软提示来初始化另一个任务的方法,该方法证明了更好的提示初始化会带来很大的训练收敛速度提升。InfoPrompt [48] 开发了两个基于互信息的损失函数,即头部损失和表征损失,以找到更好的提示初始化并学习足够的任务相关信息,从而也加快了收敛速度。PTP [49] 深入研究了训练不稳定的根本原因。它发现了传统即时调整中损失景观的陡峭性质,在这种情况下,输入数据的微小变化就会导致显著的损失波动。为了缓解这一问题,PTP 引入了基于扰动的正则器来平滑损失景观,从而稳定训练过程。DePT [52] 将软提示分解为带有一对低秩矩阵的更短软提示,并以两种不同的学习率对其进行优化。这种策略不仅能提高性能,还能提高训练和推理效率。SMoP(Sparse Mixture-of-Prompts)[51] 利用短软提示降低了训练和推理成本。在训练过程中,会对多个短软提示进行训练,每个提示都是针对数据集的特定子集量身定制的。在推理过程中,SMoP 集成了一种门控机制,可将每个输入实例路由到适当的短提示。这种技术不仅提高了训练和推理阶段的效率,还能保持与使用较长软提示符时相当的性能。为了进一步减少软提示参数的数量,IPT(内在提示调整)[50] 通过在多个任务上训练自动编码器来识别内在任务子空间。在新任务上进行调整时,只需在该子空间内调整几个参数,从而大大减少了训练参数的数量。
  1. Other Additive Methods: Apart from the methods mentioned above, there appears other approaches that strategically incorporate additional parameters during the fine-tuning process. For example, [53] introduces three learnable
    其他附加方法:除上述方法外,还有一些方法在微调过程中战略性地加入了附加参数。例如, [53] 引入了三种可学习的
(a)  (a)
(b) SSF
Fig. 6: Illustration of (IA) and SSF. Blue represents frozen, while yellow represents trainable.
图 6:(IA) 和 SSF 的图示。蓝色代表冻结,黄色代表可训练。
rescaling vectors: , and , to rescale the key, value, and FFN activations, respectively, as depicted in Figure 6 (a). The operations within the self attention block can be described as follows:
重定向向量: ,分别对键、值和 FFN 激活进行重新缩放,如图 6 (a) 所示。自我关注区块内的操作可描述如下:
In FFN, the rescaling can be denoted as:
在 FFN 中,重定标可表示为
where is Hadamard product. Furthermore, the scale vectors and can be seamlessly integrated into the weight matrices of and . This integration effectively eliminates the extra computational costs during inference. A similar technique SSF [55] also performs linear transformation to the model activations, as illustrated in Figure 6 (b). Specifically, after each operation (i.e., MSA, FFN and layer normalization) in the pre-trained model, a SSF-ADA layer is injected, which performs scaling and shifting to the features generated from the operation. During fine-tuning, only those SSF-ADA layers can be updated, while during inference, similar to (IA) , these SSF-ADA layers can be merged into model weights, so no additional inference overhead would be incurred. IPA (InferenceTime Policy Adapters) [56] offers a novel approach to align LLMs, such as GPT-4, with user-specific requirements without modifying the base model's parameters. This is particularly significant when dealing with models whose parameters are extremely large and often not directly accessible. IPA achieves this by combining (through multiplication and normalization) the output distribution of a base LLM (base policy) with that of a smaller-sized model (adapter policy) during the decoding phase. During training, the policy adapter's parameters are fine-tuned using reinforcement learning, while the base policy's parameters remain fixed. During inference, IPA decodes with the combined distribution of the base model and the trained policy adapter, tailoring it to fulfill specific user-defined criteria.
其中 是哈达玛乘积。此外,尺度向量 可以无缝集成到 的权重矩阵中。这种整合有效地消除了推理过程中的额外计算成本。类似的技术 SSF [55] 也对模型激活进行线性变换,如图 6 (b) 所示。具体来说,在预训练模型的每个操作(即 MSA、FFN 和层归一化)之后,都会注入一个 SSF-ADA 层,该层会对操作生成的特征进行缩放和移位。在微调过程中,只能更新这些 SSF-ADA 层,而在推理过程中,类似于(IA) ,这些 SSF-ADA 层可以合并到模型权重中,因此不会产生额外的推理开销。IPA(推理时间策略适配器)[56] 提供了一种新颖的方法,在不修改基础模型参数的情况下,使LLMs (如 GPT-4)与用户的特定要求保持一致。这在处理参数极其庞大且通常无法直接访问的模型时尤为重要。IPA 通过在解码阶段将基础LLM (基础策略)的输出分布与较小模型(适配器策略)的输出分布相结合(通过乘法和归一化)来实现这一点。在训练过程中,策略适配器的参数通过强化学习进行微调,而基础策略的参数则保持不变。在推理过程中,IPA 利用基础模型和训练有素的策略适配器的组合分布进行解码,使其符合用户定义的特定标准。

B. Selective PEFT B.选择性 PEFT

Rather than additive PEFT, which increases the model complexity by adding more parameters, selective PEFT finetunes a subset of the existing parameters to enhance model performance over downstream tasks, as depicted in Figure 4 (b).
与通过增加参数来提高模型复杂度的加法 PEFT 相比,选择性 PEFT 是对现有参数的一个子集进行微调,以提高模型在下游任务中的性能,如图 4 (b) 所示。
Fig. 7: Illustration of how two parameter masking methods. Blue represents frozen, while yellow represents trainable.
图 7:两种参数屏蔽方法的示意图。蓝色代表冻结,黄色代表可训练。
Specifically, given a model with parameters where each denotes an individual model parameter and represents the total count of these parameters, the process of selective PEFT is represented by applying a binary mask to these parameters. Each in is either 0 or 1 , indicating whether the corresponding parameter is selected (1) or not selected (0) for fine-tuning. The updated parameter set after fine-tuning is given by:
具体来说,给定一个带有参数 的模型,其中每个 表示一个单独的模型参数, 表示这些参数的总数,选择性 PEFT 的过程是通过对这些参数应用二进制掩码 来表示的。 中的每个 都是 0 或 1,表示微调时选择(1)或不选择(0)相应的参数 。微调后的更新参数集 由以下公式给出:
where represents the learning rate, and is the gradient of the loss function with respect to the parameter . In this formulation, only the parameters that are selected (i.e., are updated during backpropagation.
其中 代表学习率, 是损失函数相对于参数 的梯度。在这一公式中,反向传播过程中只更新选定的参数(即 )。
Diff pruning [57] is a representative work that applies a learnable binary mask to the model weights during finetuning. To achieve parameter efficiency, the mask is regularized by a differentiable approximation of the -norm penalty. PaFi [59] simply select model parameters with the smallest absolute magnitude as trainable. FishMask [60] determines parameter importance using the approximate Fisher information. It then selects the top parameters based on this information to form the mask M. Similarly, Fish-Dip [61] also uses Fisher information to calculate , but the mask will be re-calculated dynamically in each train period. LT-SFT [62] introduces another technique to determines parameter importance inspired by the Lottery Ticket Hypothesis [100], [101], where the the subset of parameters that change the most during an initial fine-tuning stage is selected to form the mask . SAM [63] proposes a second-order approximation method, which approximates the original problem with an analytically solvable optimization function, to help decide the parameter mask. Child-tuning [64] proposes two approaches to select a child network during each training iteration, where only the parameters within this child network can be updated.
Diff pruning [57] 是一项具有代表性的工作,它在微调过程中对模型权重应用了可学习的二进制掩码。为实现参数效率,掩码通过 -norm 惩罚的可微分近似值正则化。PaFi [59] 简单地选择绝对值最小的模型参数作为可训练参数。FishMask [60] 利用近似费雪信息确定参数的重要性。 同样,Fish-Dip [61] 也使用 Fisher 信息来计算 ,但掩码将在每个训练期动态地重新计算。LT-SFT [62] 受彩票假设 [100], [101] 的启发,引入了另一种确定参数重要性的技术,即选择在初始微调阶段变化最大的参数子集来形成掩码 。SAM [63] 提出了一种二阶近似法,用一个可分析求解的优化函数来近似原始问题,以帮助决定参数掩码。子调整法 [64] 提出了两种方法,在每次训练迭代中选择一个子网络,只有该子网络中的参数可以更新。
However, above unstructured parameter masking results in an uneven distribution of non-zero masks and diminished hardware efficiency when implementing PEFT. As shown in Figure 7, the structured mask organize parameter masking in regular patterns, unlike unstructured ones that apply it randomly, thus can enhances computational and hardware efficiency during training. Therefore, various structured selective PEFT techniques have undergone extensive investigation. Diff pruning proposes a structured pruning strategy by partitioning the weight parameters into local groups and strategically eliminating them together. Similarly, FAR [65]
然而,上述非结构化参数掩码会导致非零掩码分布不均,降低实现 PEFT 时的硬件效率。如图 7 所示,结构化掩码以规则模式组织参数掩码,不像非结构化掩码那样随机应用,因此可以提高训练过程中的计算和硬件效率。因此,各种结构化选择性 PEFT 技术得到了广泛的研究。Diff pruning 提出了一种结构化剪枝策略,它将权重参数划分为局部组,并有策略地将它们一起消除。同样,FAR [65]
(a) LoRA (a) 土地註冊處
(b) DyLoRA
(c) DoRA (c) 土改局
Fig. 8: Illustration of three representative reparameterized PEFT algorithms. Blue represents frozen, while yellow represents trainable.
图 8:三种具有代表性的重新参数化 PEFT 算法示意图。蓝色代表冻结,黄色代表可训练。
fine-tunes BERT models by grouping weights of the FFN in Transformer blocks into nodes, then rank and select the learner nodes using norm. To further reduce the memory access frequency, they also reconfigure the FFN by grouping the learner nodes together. Bitfit [66] is proposed to only finetunes the bias parameters of each DNN layer, and achieves competitive results for small models. However, this method fails to handle large models. The work by [58] applies NAS to Bitfit, where S-BitFit keeps the structural nature in Bitfit that restrict NAS algorithm must choose whether or not for each bias module. Similar to Bitfit that fine-tunes a specific module in Transformer, Xattn Tuning [67] finetunes only the cross-attention layers. SPT (sensitivity-aware visual parameter-efficient fine-tuning) [68] first identifies the sensitive parameters measured by the loss reduction when being tuned. This sensitivity is calculated using a first-order Taylor expansion, derived from a single forward and backward pass before fine-tuning in oneshot. Next, SPT finds the weight matrices whose number of sensitive parameters exceeds a predefined threshold, and then applies a selected PEFT technique (e.g., LoRA and Adapter) to these targeted weights to achieve structural tuning.
微调 BERT 模型的方法是将变换器块中的 FFN 权重分组为节点,然后使用 规范对学习节点进行排序和选择。为了进一步降低内存访问频率,他们还通过将学习节点分组来重新配置 FFN。Bitfit [66] 建议只对每个 DNN 层的偏置参数进行微调,并对小型模型取得了有竞争力的结果。但是,这种方法无法处理大型模型。文献[58]将 NAS 应用于 Bitfit,其中 S-BitFit 保留了 Bitfit 中的结构特性,即限制 NAS 算法必须为每个偏置模块选择是否 。与 Bitfit 微调 Transformer 中的特定模块类似,Xattn Tuning [67] 只微调交叉注意层。SPT(灵敏度感知可视参数高效微调)[68] 首先要确定微调时损耗降低所测量的敏感参数。这种灵敏度是通过一阶泰勒扩展计算得出的,而泰勒扩展是在单次微调之前通过一次前向和后向传递得出的。接下来,SPT 会找出敏感参数数量超过预定阈值的权重矩阵,然后对这些目标权重应用选定的 PEFT 技术(如 LoRA 和 Adapter),以实现结构调整。

C. Reparameterized PEFT C.重新参数化的 PEFT

Reparameterization stands for equivalently transforming a model's architecture from one to another via transforming its parameters. In the context of PEFT, this often means constructing a low-rank parameterization to achieve the goal of parameter efficiency during training. For inference, the model can be converted to its original weight parameterization, ensuring unchanged inference speed. This procedure is depicted in Figure 4 (c).
重新参数化(Reparameterization)是指通过转换参数,将模型的结构从一个模型转换为另一个模型。就 PEFT 而言,这通常意味着在训练过程中构建一个低秩参数化,以实现参数效率的目标。在推理时,可将模型转换为原始权重参数化,确保推理速度不变。这一过程如图 4 (c) 所示。
Earlier research studies [69] have shown that common pre-trained models exhibit an exceptionally low intrinsic dimensionality. In other words, it is possible to find a lowdimensional reparameterization that is effective for fine-tuning as the entire parameter space. Intrinsic SAID [69] is the pioneering work in investigating the intrinsic dimension feature during the fine-tuning of LLMs. However, the most widely recognized reparameterization technique is LoRA (Low Rank Adaptation) [70], [102], as shown in Figure 8] (a). For a given pre-trained weight matrix , LoRA introduces two trainable weight matrices, and where the rank , operating in parallel to . Let represent the input. Under normal conditions, the output through is . Instead, LoRA modifies this output by introducing an incremental update that encapsulates task-specific knowledge:
早期的研究[69]表明,普通的预训练模型表现出极低的内在维度。换句话说,我们有可能找到一种低维的重参数化方法,它能有效地对整个参数空间进行微调。本征 SAID [69] 是研究LLMs 微调过程中的本征维度特征的开创性工作。不过,最广泛认可的重参数化技术是 LoRA(低秩自适应)[70], [102],如图 8 所示。](a).对于给定的预训练权重矩阵 ,LoRA 引入了两个可训练的权重矩阵 ,其中 的秩与 平行运行。让 代表输入。在正常情况下,通过 的输出是 。相反,LoRA 通过引入增量更新 来修改输出,增量更新囊括了特定任务的知识:
where denotes a scaling factor. At the onset of training, is initialized using a random Gaussian distribution, while is initialized to zero, ensuring that initially holds a value of zero. LoRA is straightforward to implement and has been evaluated on models with up to 175 billion parameters. Fig 8 (c) used a single decoder as an example, the frozen and learnable components are highlighted in grey and red, respectively. Once fine-tuning is complete, LoRA's adaptive weights seamlessly integrate with the pre-trained backbone weights. This integration ensures that LoRA maintains the model's efficiency, adding no extra burden during inference.
其中 表示缩放因子。训练开始时, 使用随机高斯分布初始化,而 则初始化为零,确保 初始值为零。LoRA 的实现非常简单,已在多达 1 750 亿个参数的模型上进行了评估。图 8 (c) 以单个解码器为例,冻结部分和可学习部分分别以灰色和红色标出。微调完成后,LoRA 的自适应权重将与预训练的骨干权重无缝集成。这种整合可确保 LoRA 保持模型的效率,在推理过程中不会增加额外负担。
In LoRA training, selecting an appropriate rank has always been a challenging issue. To address this, DyLoRA [76], as depicted in Figure 8 (b), trains LoRA module on a range of ranks within a predefined training budget, rather than adhering to a single, fixed rank. Specifically, for a given rank range , DyLoRA dynamically chooses a rank at each iteration of the training process. Consequently, the matrices and are tailored for the selected rank , resulting in truncated versions and , and the subsequent forward and backward pass during this iteration will be restricted on and instead of and . With this dynamic and search-free approach, DyLoRA significantly reduces the training time required to find an optimal and fixed LoRA rank for specific tasks. AdaLoRA [77] reformulates the with a singular value decomposition (SVD), denoted as , where and are orthometric, is a diagonal matrix containing sigular values . All the three weight matrices are made learnable. During training, the singular values are pruned iteratively based on their importance scores, which are constructed from moving average of the magnitude of gradient-weight product.
在 LoRA 训练中,选择合适的等级一直是一个具有挑战性的问题。为了解决这个问题,如图 8 (b) 所示,DyLoRA [76]在预定义的训练预算范围内对一系列等级进行 LoRA 模块训练,而不是遵循单一、固定的等级。具体来说,对于给定的秩范围 ,DyLoRA 会在训练过程的每次迭代中动态选择一个秩 。因此,矩阵 将根据选定的等级 进行调整,从而产生截断版本的 ,在此迭代过程中的后续前向和后向传递将受限于 ,而不是 。通过这种动态的免搜索方法,DyLoRA 大大减少了为特定任务寻找最佳和固定 LoRA 秩所需的训练时间。AdaLoRA [77] 通过奇异值分解(SVD)重新计算 ,表示为 ,其中 为正交矩阵, 为包含奇异值的对角矩阵 。所有三个权重矩阵都是可学习的。在训练过程中,奇异值会根据其重要性得分进行迭代修剪,重要性得分由梯度-权重乘积大小的移动平均值计算得出。
To ensure the orthogonality between and , i.e., , an additional regularizer term is included in the loss:
为确保 (即 )之间的正交性,损耗中加入了额外的正则项:
This adaptive approach enables the model to dynamically adjust the rank within each LoRA module, effectively managing its parameter counts based on the significance of the weight matrices. However, according to SoRA [78], the importance scores used in AdaLoRA is heuristically constructed, which lacks rigorous theoretical motivation. Additionally, both moving average operation and calculation of Eq. 13 introduces extra computation cost during training. To address this, SoRA eliminates the orthogonality premise of and . Instead, a gating unit between and is directly applied and optimized:
这种自适应方法使模型能够动态调整每个 LoRA 模块内的等级,根据权重矩阵的重要性有效管理其参数计数。不过,根据 SoRA [78],AdaLoRA 中使用的重要性分数是启发式构建的,缺乏严格的理论依据。此外,移动平均操作和公式 13 的计算都会在训练过程中引入额外的计算成本。为了解决这个问题,SoRA 取消了 的正交性前提。取而代之的是直接应用和优化 之间的门控单元
where is Hadamard product. The gate is updated using a variation of proximal gradient iteration for loss [103], [104], which has a clear mathematical meaning and do not need the heuristic premise. After training, the zeroed-out gate units are pruned by removing the corresponding columns and rows in and .
其中 是哈达玛乘积。门 采用近似梯度迭代的一种变体来更新 loss [103], [104],这种方法有明确的数学含义,不需要启发式前提。训练结束后,通过删除 中的相应列和行,对归零的门单元进行剪枝。
Several subsequent studies have aimed to improve LoRA's performance in various aspects. For instance, LaplaceLoRA [81] notices that fine-tuned LLMs often exhibit overconfidence. To enhance the calibration of fine-tuned LLMs, Laplace-LoRA utilizes a Bayesian approach, specifically a post-hoc Laplace approximation [105], [106], to the posterior over the LoRA parameters. LoRA Dropout [82] introduces random noises to the learnable low-rank matrices and increases parameter sparsity to reduce the risk of overfitting. LoRA+ [84] proposes to set different learning rates for the LoRA matrices and , such that with fixed and tune .
随后的一些研究旨在提高 LoRA 的各方面性能。例如,LaplaceLoRA [81]注意到微调后的LLMs 经常表现出过度自信。为了加强微调后的LLMs 的校准,Laplace-LoRA 采用了贝叶斯方法,特别是对 LoRA 参数的后验进行事后拉普拉斯近似[105]、[106]。LoRA Dropout [82]在可学习的低阶矩阵中引入随机噪音,增加参数稀疏性,以降低过拟合风险。LoRA+ [84] 建议为 LoRA 矩阵 设置不同的学习率,如 固定,并调整
Thanks to the modular design of LoRA, many studies incorporate multiple LoRA modules in their frameworks to enhance performance. For example, LoRAHub aggregates various LoRA modules trained on different tasks. Given a handful of examples from a new task, LoRAHub can autonomously compose compatible LoRA modules without human intervention via a gradient-free method Shiwa [107]. MOELoRA employs a Mixture-of-Experts (MOE) approach to train LoRA in a multi-task setting, resulting in multiple expert LoRA modules. To retrieve parameters for certain task, MOELoRA utilizes a task-motivated gate function that assigns contribution weights to each expert based on the task ID, and the final parameters is calculated through a weighted sum of all experts.
得益于 LoRA 的模块化设计,许多研究将多个 LoRA 模块纳入其框架,以提高性能。例如,LoRAHub 聚合了针对不同任务训练的各种 LoRA 模块。给定新任务中的少量示例后,LoRAHub 可通过无梯度方法 Shiwa [107],在无需人工干预的情况下自主组成兼容的 LoRA 模块。MOELoRA 采用专家混合(MOE)方法在多任务环境中训练 LoRA,从而产生多个 LoRA 专家模块。为了检索特定任务的参数,MOELoRA 利用任务驱动门函数,根据任务 ID 为每位专家分配贡献权重,并通过所有专家的加权和计算最终参数。
In addition to LoRA, several other reparameterization techniques are emerging with significant potential. For instance, Compacter [71] introduces a light-weight adapter modules by parameterizing the and as , where , and denotes the Kronecker product. They further decrease the parameter count by designating as shared parameters and reparameterizing using the product of two low-rank matrices, effectively reducing the parameter complexity from to . Related studies, such as KronA [72] and KAdaptation [73], also employ the Kronecker product to reparameterize adapter weights, aiming to achieve parameter reduction. HiWi [59] proposes an adapter fine-tuning method that applies an adapter directly to pretrained parameters instead of hidden representations as:
除 LoRA 外,其他几种重新参数化技术也在不断涌现,并具有巨大潜力。例如,Compacter [71] 通过将 参数化为 ,其中 表示 Kronecker 乘积,引入了轻量级适配器模块。他们将 指定为共享参数,并使用两个低阶矩阵的乘积对 进行重新参数化,从而进一步减少了参数数量,有效地将参数复杂度从 降至 。相关研究,如 KronA [72] 和 KAdaptation [73],也采用克朗内克乘法对适配器权重进行重新参数化,以达到减少参数的目的。HiWi [59] 提出了一种适配器微调方法,它将适配器直接应用于预训练参数,而不是隐藏表示:
where denotes the weights or biases within the Transformer block's feed-forward layer. Notably, during inference, this method computes in advance, ensuring that the model's inference latency remains on par with that of traditional full fine-tuning. VeRA (Vector-based Random Matrix Adaptation) [74] employs a single pair of frozen low-rank matrices and that are shared across all layers, and adapts these matrices by learning small, trainable scaling vectors represented as and (formally denoted by diagonal matrices and . Specifically, the reparameterization is given by:
其中 表示变换器模块前馈层的权重或偏置。值得注意的是,在推理过程中,这种方法会提前计算 ,确保模型的推理延迟与传统的完全微调相同。VeRA(基于向量的随机矩阵适应)[74] 采用了一对在所有层中共享的冻结低秩矩阵 ,并通过学习以 表示的小型可训练缩放向量来适应这些矩阵(正式表示为对角矩阵 。具体来说,重新参数化的方法如下
where both and are initialized using a random Gaussian distribution. Similar to LoRA, the scaling vector is initialized to zeros to ensure that the weight matrix is unaffected during the first forward pass. This method significantly reduces the number of trainable parameters compared to LoRA yet maintains the same performance, enabling the fine-tuning of larger models on a single GPU. DoRA (WeightDecomposed Low-Rank Adaptation) [75] presents a novel approach as illustrated in Figure 8 (c) by decomposing model weights into magnitude and direction as follows:
其中 都使用随机高斯分布初始化。与 LoRA 类似,缩放向量 初始化为零,以确保权重矩阵在第一次前向传递时不受影响。与 LoRA 相比,这种方法大大减少了可训练参数的数量,但保持了相同的性能,从而可以在单个 GPU 上对更大的模型进行微调。DoRA(权重分解低级适应)[75] 提出了一种新颖的方法,如图 8 (c) 所示,它将模型权重 分解为大小和方向,具体如下:
where is the magnitude vector, is the directional matrix, with being the vector-wise norm of a matrix across each column. Subsequently, DoRA adopts a unique fine-tuning strategy for and . While both are tunable, only undergoes LoRA reparameterization, defined as:
其中, 是幅值向量, 是方向矩阵, 是矩阵每列的矢量-向规范。随后,DoRA 对 采用独特的微调策略。虽然两者都是可调的,但只有 会进行 LoRA 重参数化,其定义如下:
where is the incremental directional update learned by LoRA, and the underlined parameters denote the trainable parameters. Through this methodology, DoRA consistently outperforms LoRA across various tasks and models, demonstrating its superiority.
其中 是 LoRA 学习到的增量方向更新,下划线参数表示可训练参数。通过这种方法,DoRA 在各种任务和模型中的表现始终优于 LoRA,证明了其优越性。

D. Hybrid PEFT D.混合型 PEFT

The efficacy of various PEFT methods can significantly differ across different tasks. As a result, numerous studies aim to either combine the advantages of diverse PEFT approaches or seek to establish a unified perspective by analyzing the similarities among these methods. For instance, UniPELT[90] integrates LoRA, prefix-tuning, and adapters into each Transformer block. To control which PEFT submodules should be activated, they also introduce a gating mechanism. This
在不同的任务中,各种 PEFT 方法的功效会有很大不同。因此,许多研究旨在结合不同 PEFT 方法的优势,或通过分析这些方法之间的相似性来建立统一的观点。例如,UniPELT[90] 将 LoRA、前缀调谐和适配器集成到每个转换器块中。为了控制哪些 PEFT 子模块应被激活,他们还引入了门控机制。这

mechanism consists of three small FFNs that each produce a scalar value , which is then applied to the LoRA, prefix, and adapter matrices, respectively. Across various setups, UniPELT has consistently shown improvements in accuracy ranging from to . S4 [91] explores design spaces for several PEFT methods (i.e., Adapter (A), Prefix (P), BitFit (B), and LoRA (L)) to uncover underlying design patterns. After a series experiments, their findings include: (1) Applying the spindle grouping partitioning for Transformer layers, which results in four layer groups for . Layers in one group have similar behaviors together, which means should be apply similar PEFT strategies. (2) Allocating the number of trainable parameters to layers uniformly. (3) Tuning all the groups. (4) Assigning different PEFT strategies in different group. The resulting design space that has the best performance is:
该机制由三个小型 FFN 组成,每个 FFN 产生一个标量值 ,然后分别应用于 LoRA、前缀和适配器矩阵。在不同的设置下,UniPELT 的准确率一直在 之间。S4 [91] 探索了几种 PEFT 方法(即 Adapter (A)、Prefix (P)、BitFit (B) 和 LoRA (L))的设计空间,以发现潜在的设计模式。经过一系列实验,他们的发现包括(1) 对 Transformer 层应用主轴分组分区,结果产生了四个层组 。一个组中的层具有相似的行为,这意味着应采用相似的 PEFT 策略。(2) 将可训练参数的数量统一分配给各层。(3) 调整所有组。(4) 在不同组中分配不同的 PEFT 策略。最终得出的性能最佳的设计空间是
MAM Adapter[26] explores the intrinsic similarity between three additive PEFT methods: adapters, prefix-tuning, and LoRA, which leads to the development of three variants: Parallel Adapter, which places adapter layers alongside specific layers (SA or FFN) instead of after them; Multi-head Parallel Adapter, which divides the parallel adapter into multiple heads, each affecting the head attention output in SA; and Scaled Parallel Adapter, which adds a scaling term after the parallel adapter layer, similar to LoRA. Extensive experimentation revealed that the most effective configuration involves using prefix-tuning in the SA layer and the scaled parallel adapter in the FFN layer, which is called MAM Adapter. LLM-Adapters [94] builds an easy-to-use framework that incorporates various PEFT techniques into LLMs. Through comprehensive benchmarking across multiple datasets, the study reveals several key insights: (1) The most effective locations for series adapters, parallel adapters, and LoRA are after the MLP layers, alongside the MLP layers, and simultaneously following the Attention layers and MLP layers, respectively. (2) Smaller LLMs utilizing PEFT can achieve competitive or even superior results on certain tasks when compared to their larger counterparts. (3) With appropriate in-distribution fine-tuning data, smaller models are capable of surpassing larger models in task-specific performance.
MAM 适配器[26] 探索了三种相加 PEFT 方法之间的内在相似性:适配器、前缀调整和 LoRA,并由此开发出三种变体:并行适配器,将适配器层与特定层(SA 或 FFN)放在一起,而不是放在它们之后;多头并行适配器,将并行适配器分为多个头,每个头都会影响 SA 中的头注意输出;缩放并行适配器,在并行适配器层之后添加一个缩放项,与 LoRA 类似。大量实验表明,最有效的配置是在 SA 层使用前缀调整,在 FFN 层使用缩放并行适配器,即 MAM 适配器。LLM-适配器 [94] 构建了一个易于使用的框架,将各种 PEFT 技术整合到LLMs 中。通过对多个数据集进行全面的基准测试,该研究揭示了几个关键的见解:(1)串联适配器、并联适配器和 LoRA 的最有效位置分别在 MLP 层之后、MLP 层旁边,以及同时在 Attention 层和 MLP 层之后。(2) 利用 PEFT 的小型LLMs 在某些任务上可以取得比大型同类产品更有竞争力甚至更优越的结果。(3) 利用适当的分布内微调数据,较小的模型能够在特定任务的性能上超越较大的模型。
Several studies leverage neural architecture search (NAS) to find better PEFT combination approaches. For example, NOAH [92] discovers that different PEFT configurations are specifically tailored for different tasks. To address this issue, NOAH employs NAS to identify the most effective PEFT configurations for each dataset. Specifically, NOAH's searching space encompasses three PEFT methods: Adapter, LoRA, and Visual Prompt Tuning (VPT). It utilizes AutoFormer [108], a one-shot NAS algorithm, for the efficient discovery of optimal prompt modules. In a related vein, AUTOPEFT [93] first establishes a searching space that includes serial adapters, parallel adapters, and prefix tuning. After that, they proposes an effective NAS methods based on a high-dimensional multidimensional Bayesian optimisation [109]. Both NOAH and AUTOPEFT demonstrate the capability of NAS in enhancing PEFT configurations across a variety of tasks.
一些研究利用神经架构搜索(NAS)来寻找更好的 PEFT 组合方法。例如,NOAH [92] 发现,不同的 PEFT 配置专门针对不同的任务。为了解决这个问题,NOAH 利用 NAS 来确定每个数据集最有效的 PEFT 配置。具体来说,NOAH 的搜索空间包括三种 PEFT 方法:Adapter、LoRA 和 Visual Prompt Tuning (VPT)。它利用 AutoFormer [108](一种一次性 NAS 算法)高效地发现最佳提示模块。与此相关,AUTOPEFT [93] 首先建立了一个搜索空间,其中包括串行适配器、并行适配器和前缀调整。之后,他们提出了一种基于高维多维贝叶斯优化的有效 NAS 方法 [109]。NOAH 和 AUTOPEFT 都证明了 NAS 在各种任务中增强 PEFT 配置的能力。

IV. EfFICIENT PEFT DESIGN
IV.高效的 PEFT 设计

Processing latency and peak memory overhead are pivotal factors to consider from a computational standpoint. This section introduces a key characteristic in LLMs aimed at balancing between latency and memory usage (Section IV-A). Following this, we explore strategies for developing efficient PEFT methods to address computational challenges, including PEFT pruning (Section IV-B), PEFT quantization (Section IV-C), and memory-efficient PEFT techniques (Section IV-D), each designed to enhance model performance while minimizing resource consumption. It is noteworthy that quantization inherently addresses memory overhead concerns. However, given its distinct characteristics, we address these quantization methods separately rather than incorporating them under the memory-efficient PEFT section.
从计算角度来看,处理延迟和峰值内存开销是需要考虑的关键因素。本节将介绍LLMs 中旨在平衡延迟和内存使用的关键特性(第 IV-A 节)。随后,我们将探讨开发高效 PEFT 方法的策略,以应对计算挑战,包括 PEFT 剪枝(第 IV-B 节)、PEFT 量化(第 IV-C 节)和内存高效 PEFT 技术(第 IV-D 节),每种方法都旨在提高模型性能,同时最大限度地减少资源消耗。值得注意的是,量化从本质上解决了内存开销问题。不过,考虑到其独特性,我们将这些量化方法单独讨论,而不是将其纳入内存效率 PEFT 部分。

A. KV-cache Management for PEFT Efficiency
A.提高 PEFT 效率的 KV 缓存管理

The core of the LLMs model lies an autoregressive Transformer model, depicted in Figure 2 When we look at autoregression characteristic, it becomes a major challenge in designing an inference system, because every time a new token is generated, the entire LLM model has to transfer all the weights from different memories to the memory of the graphics processor, which is very unfriendly to single-user task scheduling or multi-user work-load balance. The challenging part of serving the auto-regressive paradigm is all previous sequences have to be cached and saved for the next proceeding iteration, the cached activation generated from the previous sequences is stored as the Key-Value Cache (KV-cache).
LLMs 模型的核心是一个自回归变换器模型,如图 2 所示。当我们看到自回归特性时,它成为设计推理系统的一大挑战,因为每生成一个新的标记,整个LLM 模型就必须将所有权重从不同的存储器转移到图形处理器的存储器中,这对单用户任务调度或多用户工作量平衡非常不友好。自动回归范式的挑战在于必须缓存和保存所有先前序列,以备下一次迭代之用,从先前序列生成的缓存激活被存储为键值缓存(KV-cache)。
The storage of KV-cache will cost both memory space and IO performance, yielding in workload memory-bounded and under-utilizing the computation power of the system. Previous works proposed a series of solutions like KV-cache control management [133] or KV-cache compression [134] to improve throughput or reduce latency. When designing PEFT methods, it is crucial to consider the characteristics of the KV-cache to complement its features. For instance, when applying soft prompts in the inference phase, efficiently leveraging the KVcache for these additional inputs can help accelerate response times by ensuring prompt-related data is readily accessible.
KV 缓存的存储会耗费内存空间和 IO 性能,导致工作负载受内存限制,系统计算能力利用不足。之前的研究提出了一系列解决方案,如 KV 缓存控制管理 [133] 或 KV 缓存压缩 [134],以提高吞吐量或减少延迟。在设计 PEFT 方法时,考虑 KV 缓存的特性以补充其功能至关重要。例如,在推理阶段应用软提示时,有效利用 KV 缓存来处理这些额外的输入,可确保提示相关的数据可随时访问,从而有助于加快响应时间。

B. Pruning Strategies for PEFT
B.PEFT 的剪枝策略

The inclusion of pruning can substantially enhance the efficiency of PEFT methods. In particular, AdapterDrop [110] explores the removal of adapters from lower transformer layers and multi-task adapters in AdapterFusion [29], which shows that the pruning can improve the training and inference efficiency with minimal decrease in performance. SparseAdapter [111] investigates different pruning methods and finds that high sparsity ratios can outperform standard adapters. Additionally, the Large-Sparse configuration, which increases the bottleneck dimension while maintaining a constant parameter budget (e.g., doubling dimensions with a sparsity), substantially enhances the model's capacity, resulting in improved performance. SPLoRA [112] adopts channel-based pruning to the LoRA weights and .
剪枝可以大大提高 PEFT 方法的效率。其中,AdapterDrop[110]探索了从低变压器层移除适配器的方法,以及 AdapterFusion[29]中的多任务适配器,结果表明剪枝可以提高训练和推理效率,而性能下降最小。SparseAdapter [111] 研究了不同的剪枝方法,发现高稀疏比的 可以优于标准适配器。此外,Large-Sparse 配置在保持参数预算不变的情况下增加了瓶颈维度(例如,在 稀疏度的情况下增加一倍维度),大大增强了模型的容量,从而提高了性能。SPLoRA [112] 对 LoRA 权重 采用了基于信道的剪枝。
BI-Adapter [115], PEQA [116], QLoRA [117], LoftQ [118], LQ-LoRA [119], QA-LoRA [120], INT2.1 [121], QDyLoRA [122], BitDelta [123]
BI-Adapter[115]、PEQA[116]、QLoRA[117]、LoftQ[118]、LQ-LoRA[119]、QA-LoRA[120]、INT2.1[121]、QDyLoRA[122]、BitDelta[123]
Fig. 9: Taxonomy of Efficient PEFT Design.
图 9:高效 PEFT 设计的分类标准。
This pruning affects not only the source weights , but also the LoRA parameters and . Similarly, LoRAPruning [113] adopts structured pruning not only to the pretrained model weights but also to the LoRA weights. In contrast to unstructured LoRA pruning methods, which primarily focus on sparsifying model weights while leaving LoRA weights dense, thus making weight merging challenging to achieve, LoRAPruning enables the weights to be merged easily. Additionally, this work also introduces a novel criterion that utilizes LoRA's gradients as an approximation of the gradients for the pre-trained weights, enabling the estimation of weight importance. ProPETL [114] constructs a single shared prototype (e.g., adapter, prefix, or LoRA) across layers and tasks. In addition, ProPETL learns binary masks to prune different sub-networks in different layers and tasks. As a result, the parameters can be reused in across layers and tasks, largely increasing the parameter efficiency.
这种剪枝不仅影响源权重 ,也影响 LoRA 参数 。同样,LoRAPruning [113] 不仅对预训练模型权重,而且对 LoRA 权重采用了结构化剪枝。非结构化 LoRA 修剪方法主要侧重于稀疏化模型权重,而让 LoRA 权重保持密集,从而使权重合并难以实现,相比之下,LoRAPruning 使权重很容易合并。此外,这项工作还引入了一种新标准,利用 LoRA 的梯度作为预训练权重梯度的近似值,从而能够估计权重的重要性。ProPETL [114] 构建了跨层和跨任务的单一共享原型(如适配器、前缀或 LoRA)。此外,ProPETL 还能学习二进制掩码,以剪除不同层和任务中的不同子网络。因此,参数可以跨层和跨任务重复使用,大大提高了参数效率。

C. Quantization Strategies for PEFT
C.PEFT 的量化策略

Quantization serves as another popular technique for improving computational efficiency and reduce memory usage. For example, by investigating the loss landscape of adapters, BI-Adapter [115] finds that adapters are resistant to noise in parameter space. Building on this insight, the authors introduce a clustering-based quantization approach. Remarkably, they demonstrate that a 1 -bit quantization of adapters not only minimizes storage requirements but also achieve superior performance among all precision settings. PEQA (Parameter-Efficient and Quantization-aware Adaptation) [116] uses a two-stage pipeline to achieve parameterefficient and quantization-aware fine-tuning. In the first stage, the pre-trained FFN weight matrix is quantized to , where represents per-channel scales and denotes the quantized weight. In the second stage, remains fixed, and fine-tuning is only conducted on . This approach not only ensures memory efficiency but also facilitates parameter efficiency. QLoRA [117] proposes several novel techniques, including a 4-bit NormalFloat, a Double Quantization, and a Paged Optimizers, to backpropagate a 4-bit quantized pretrained language model into LoRA. These techniques enable the fine-tuning for a 65B language model on a single 48GB GPU while maintaining similar performance to the full 16-bit fine-tuning. Similar to the original implementation [70], QLoRA attaches the fixed zero initialized LoRA weights to the quantized pre-trained model as the training start point. However, when applying the extreme low-bit (e.g., 2-bit) quantization, the huge quantization error can adversely impact the initialization of LoRA fine-tuning, i.e., quantization where , which will harm the fine-tuning performance as shown in [127]. To solve this, several quantization strategies are proposed to eliminate the quantization error. For example, LoftQ (LoRA-Fine-Tuningaware Quantization) [118] presents an innovative framework that provides a superior initialization point of quantized backbone weights and LoRA weights for subsequent LoRA finetuning. This approach addresses the discrepancies caused by quantization through the optimization of a Frobenius norm objective during network initialization, which takes both the LoRA weights and the quantized pre-trained backbone into consideration. LoftQ exhibits superior performance in 2-bit quantization over QLoRA, as well as greater generalization for downstream tasks. LQ-LoRA [119] uses an iterative algorithm inspired by robust principal components analysis [135], [136] which decomposes the weight such that to resolve the inaccuracy caused by the quantization error, where is the quantized component which remains fixed and is the trainable low-rank component. Moreover, this approach leverages integer linear programming to determine a mixed quantization strategy, enabling dynamic quantization configurations for each weight matrix while adhering to a predetermined total bit rate limit. QA-LoRA [120] address another limitation of QLoRA, which struggles to preserve its quantized property post fine-tuning. In QLoRA, the quantized pre-trained weight (NF4) have to be recovered to FP16 to match the LoRA weight precision (FP16) during weight merging. Instead, QA-LoRA uses INT4 quantization and introduces group-wise operators to enable quantization during inference stage, therefore improve the efficiency and accuracy compared with QLoRA. BitDelta [123] introduces a novel 1-bit posttraining quantization method that acts on the weight delta between a fine-tuned model and its underlying pre-trained model. Specifically, given the weight matrices and from the fine-tuned and base models respectively, the weight delta is binarized as . Here, , a high-precision scalar, is initialized based on the mean absolute delta value , with indicating the sign of . BitDelta further calibrates the scaling factors via distillation on a compact calibration dataset, while the binary matrices remain unchanged. This approach notably streamlines the deployment of multiple fine-tuned models on shared servers by utilizing a singular full-precision base model alongside efficiently batched 1 -bit deltas.
量化是提高计算效率和减少内存使用的另一种流行技术。例如,通过研究适配器的损失情况,BI-Adapter [115] 发现适配器对参数空间的噪声具有抵抗力。基于这一发现,作者提出了一种基于聚类的量化方法。值得注意的是,他们证明对适配器进行 1 位量化不仅能最大限度地减少存储需求,还能在所有精度设置中实现卓越性能。PEQA(参数效率和量化感知适配)[116] 采用两阶段流水线实现参数效率和量化感知微调。在第一阶段,预训练的 FFN 权重矩阵 被量化为 ,其中 表示每个通道的比例, 表示量化后的权重。在第二阶段, 保持不变,只对 进行微调。这种方法不仅能确保内存效率,还能提高参数效率。QLoRA [117] 提出了几种新技术,包括 4 位 NormalFloat、双量化和分页优化器,将 4 位量化预训练语言模型反向传播到 LoRA 中。这些技术可在单个 48GB GPU 上对 65B 语言模型进行微调,同时保持与完整 16 位微调类似的性能。与最初的实现[70]类似,QLoRA 将固定为零的初始化 LoRA 权重附加到量化的预训练模型上作为训练起点。然而,在应用极端低位(如 2 位)量化时,巨大的量化误差会对 LoRA 微调的初始化产生不利影响,即量化 其中 ,这将损害微调性能,如文献[127]所示。为了解决这个问题,人们提出了几种量化策略来消除量化误差。例如,LoftQ(LoRA-Fine-Tuningaware Quantization)[118] 提出了一种创新框架,为随后的 LoRA 微调提供了量化主干权重和 LoRA 权重的卓越初始化点。这种方法通过优化网络初始化过程中的 Frobenius 准则目标来解决量化造成的差异,该目标同时考虑了 LoRA 权重和量化的预训练骨干网。LoftQ 在 2 位量化方面的性能优于 QLoRA,而且对下游任务的通用性也更强。 LQ-LoRA [119] 采用了一种受稳健主成分分析[135]、[136]启发的迭代算法,该算法分解权重 ,使 ,以解决量化误差造成的不准确性,其中 是保持固定的量化成分, 是可训练的低秩成分。此外,这种方法利用整数线性规划来确定混合量化策略,使每个权重矩阵都能进行动态量化配置,同时遵守预定的总比特率限制。QA-LoRA [120] 解决了 QLoRA 的另一个局限性,即在微调后难以保持其量化属性。在 QLoRA 中,量化的预训练权重(NF4)必须恢复到 FP16,以便在权重合并时与 LoRA 权重精度(FP16)相匹配。相反,QA-LoRA 使用 INT4 量化,并引入了分组运算符,以便在推理阶段进行量化,因此与 QLoRA 相比,QA-LoRA 提高了效率和精度。BitDelta [123] 引入了一种新颖的 1 位训练后量化方法,该方法作用于微调模型与其底层预训练模型之间的权重三角洲。具体来说,给定分别来自微调模型和基础模型的权重矩阵 ,将权重三角 二值化为 是一个高精度标量,根据平均绝对三角洲值 进行初始化, 表示 的符号。BitDelta 在二进制矩阵保持不变的情况下,通过对紧凑校准数据集的蒸馏进一步校准缩放因子。这种方法通过利用奇异的全精度基础模型和高效分批的 1 位三角积分,显著简化了在共享服务器上部署多个微调模型的过程。

D. Memory-efficient PEFT Methods
D.内存效率 PEFT 方法

Fine-tuning the full LLMs necessitates substantial training memory owing to their considerable size. While most PEFT methods primarily target parameter efficiency, they still incur a significant memory overhead during training because gradient computation and backpropagation are still necessary for these methods. For example, prevalent PEFT techniques such as adapters and LoRA can only reduce memory usage to approximately compared to full model fine-tuning according to some literatures [125], [130]. From a computational perspective, memory efficiency also remains a critical factor that cannot be overlooked.
由于整个LLMs 的大小相当大,因此对其进行微调需要大量的训练内存。虽然大多数 PEFT 方法主要以参数效率为目标,但由于梯度计算和反向传播仍是这些方法所必需的,因此在训练过程中仍会产生大量内存开销。例如,根据一些文献[125]、[130],与完全模型微调相比,流行的 PEFT 技术(如适配器和 LoRA)只能将内存使用量减少到约 。从计算角度看,内存效率也是一个不容忽视的关键因素。
To improve memory efficiency, various techniques have been developed to minimize the need for caching gradients for the entire LLM during fine-tuning, thereby reducing memory usage. For example, both Side-Tuning [124] and LST (Ladder-Side Tuning) [125] introduces a learnable network branch parallel to the backbone model. By channeling the backpropagation exclusively through this parallel branch, it circumvents the need to store gradient information for the main model's weights, thus markedly reducing memory requirements during training. Similarly, Res-Tuning [126] disentangles the PEFT tuners (e.g., prompt tuning, adapter) from the backbone model. On top of the disentanglement, a memoryefficient fine-tuning framework named Res-Tuning-Bypass is proposed, which generates a bypass network in parallel with the backbone model by removing the data flow from the decoupled tuners to the backbone. This eliminates the requirement for gradient caching within the backbone model during backpropagation. MEFT [127] (memory-efficient fine-tuning) is an approach inspired by the reversible model [137]. During the training of a reversible model, intermediate activations are not required to be cached in the forward pass. During backpropagation, they can be recalculated from the final output. To save the memory during fine-tuning, MEFT investigates how to transform an LLM to its reversible counterparts without additional pre-training. A critical aspect of this transformation is the careful initialization of newly-introduced parameters in the pre-trained models. MEFT demonstrates the importance of the parameter initialization, and suggests that these parameters must be initialized in a manner that preserves the pre-trained model's starting point, ensuring that the fine-tuning of the modified model achieves performance on par with full finetuning methods. With this key consideration, MEFT introduces three distinct methods, each significantly curtailing the memory demands traditionally required for storing activations. LoRA-FA [128] addresses a limitation about memory overhead in LoRA fine-tuning. During training, LoRA modules still require high activation memory consumption. This is because, during backpropagation, large input activations must be stored during the forward pass to compute gradients. LoRA-FA resolves this issue by freezing both the pre-trained weights and the projection-down weights , and only updating the projection-up weights . Consequently, the input activation no longer needs to be stored, as the intermediate activation is adequate for gradient computation for . Given that , the memory requirement for activations in LoRA-
为了提高内存效率,人们开发了各种技术,以尽量减少在微调过程中对整个LLM 的梯度缓存需求,从而减少内存使用量。例如,Side-Tuning [124] 和 LST(Ladder-Side Tuning)[125] 都引入了一个与主干模型平行的可学习网络分支。通过将反向传播完全导入这个并行分支,就无需为主模型的权重存储梯度信息,从而显著降低了训练过程中的内存需求。类似地,Res-Tuning [126] 将 PEFT 调整器(如提示调整、适配器)与主干模型分离开来。在解耦的基础上,还提出了一种名为 Res-Tuning-Bypass 的内存高效微调框架,它通过消除从解耦调谐器到主干模型的数据流,与主干模型并行生成一个旁路网络。这就消除了反向传播过程中对骨干模型内梯度缓存的要求。MEFT[127](内存效率微调)是受可逆模型[137]启发的一种方法。在可逆模型的训练过程中,无需在前向传递中缓存中间激活。在反向传播过程中,它们可以根据最终输出重新计算。为了在微调过程中节省内存,MEFT 研究了如何将LLM 转换为可逆模型,而无需额外的预训练。这种转换的一个关键方面是在预训练模型中仔细初始化新引入的参数。MEFT 证明了参数初始化的重要性,并建议这些参数的初始化必须保留预训练模型的起点,以确保修改后模型的微调性能与完全微调方法相当。考虑到这一关键因素,MEFT 引入了三种不同的方法,每种方法都大大减少了传统上用于存储激活的内存需求。LoRA-FA [128] 解决了 LoRA 微调中内存开销的限制。在训练过程中,LoRA 模块仍然需要消耗大量激活内存。这是因为在反向传播过程中,为了计算梯度,必须在前向传递过程中存储大量的输入激活。LoRA-FA 通过冻结预训练权重 和投影向下权重 以及只更新投影向上权重 来解决这一问题。因此,不再需要存储输入激活 ,因为中间激活 足以用于梯度计算 。鉴于 ,LoRA- 中激活的内存需求为 1.5 亿美元。

FA can be significantly reduced.
FA 可以大大减少。
To further reduce memory usage during fine-tuning, some methods attempt to circumvent backpropagation within LLMs to address this issue. HyperTuning [129] employs a HyperModel to generate PEFT parameters using only fewshot examples. This approach demonstrates results comparable to those obtained through full model fine-tuning. PEFT Plug-in [130] first trains PEFT modules on small language models, which is more memory efficient compared to training on large ones. Subsequently, the research introduces a suite of techniques for seamlessly integrating these trained PEFT modules into LLMs during inference. This strategy effectively circumvents the necessity of gradient-based optimization directly on the larger models, resulting in substantial memory savings. However, it is important to note that both HyperModel and PEFT Plug-in still require additional model training, and this training cost cannot be entirely overlooked. MeZO [131] introduces a memoryefficient zeroth-order (ZO) optimizer for LLMs. Unlike conventional PEFT techniques, which rely on backpropagation to compute gradients for updating model parameters, MeZO finetunes LLMs through only forward passes. It accomplishes this by employing a ZO gradient estimator to calculate the gradient. Notably, MeZO implements an in-place solution for the classic ZO gradient estimator, effectively mitigating memory consumption during inference execution. This innovative approach allows for efficient fine-tuning of LLMs containing 30 billion parameters on a single GPU with of memory, all while maintaining performance that is comparable to fine-tuning using backpropagation. Furthermore, it can substantially decrease storage demands in comparison to the traditional PEFT methods such as LoRA and Adapter.
为了进一步减少微调过程中的内存占用,一些方法试图在LLMs 中规避反向传播来解决这个问题。HyperTuning [129]采用 HyperModel,仅使用少量实例生成 PEFT 参数。这种方法的结果与通过完全模型微调获得的结果不相上下。PEFT Plug-in [130] 首先在小型语言模型上训练 PEFT 模块,这比在大型语言模型上训练更节省内存。随后,该研究引入了一套技术,用于在推理过程中将这些训练有素的 PEFT 模块无缝集成到LLMs 中。这一策略有效地避免了直接在大型模型上进行基于梯度的优化,从而节省了大量内存。不过,值得注意的是,HyperModel 和 PEFT Plug-in 仍然需要额外的模型训练,这种训练成本不能完全忽视。MeZO [131] 为LLMs 引入了一种内存效率较高的零阶(ZO)优化器。传统的 PEFT 技术依靠反向传播计算梯度来更新模型参数,与之不同的是,MeZO 只通过前向传递对LLMs 进行微调。它通过使用 ZO 梯度估计器计算梯度来实现这一目的。值得注意的是,MeZO 为经典的 ZO 梯度估计器实施了就地解决方案,有效地减少了推理执行过程中的内存消耗。这种创新方法可在单个 GPU(内存容量为 )上对包含 300 亿个参数的LLMs 进行高效微调,同时保持与使用反向传播进行微调的性能相当。此外,与 LoRA 和 Adapter 等传统 PEFT 方法相比,它还能大幅降低存储需求。

V. PEFT FOR DNNS OF OTHER APPLICATIONS
V.为其他应用的 DNS 提供资金

In Section III, we outlined four categories of PEFT methods along with their improvements. Nonetheless, our discussion did not fully extend to the utilization or adaptation of PEFT techniques beyond traditional architectures (e.g., LLMs) or standard benchmarks (e.g., the GLUE dataset), where the majority of the discussed PEFT methods are applied. Therefore, in this section, we will highlight and discuss several most representative works that leverages PEFT strategies for various downstream tasks. We do not aim to cover all PEFT application scenarios in this section. Our objective is to showcase the significant influence of PEFT within various research domains, and demonstrate how to optimize and tailor general-purpose PEFT methods to achieve enhanced performance in specific models or tasks.
在第三节中,我们概述了四类 PEFT 方法及其改进。然而,我们的讨论并没有完全延伸到传统架构(如LLMs )或标准基准(如 GLUE 数据集)之外的 PEFT 技术的利用或调整,而大多数讨论过的 PEFT 方法都应用于传统架构或标准基准。因此,在本节中,我们将重点介绍并讨论利用 PEFT 策略完成各种下游任务的几项最具代表性的工作。我们并不打算在本节中涵盖所有 PEFT 应用场景。我们的目的是展示 PEFT 在各个研究领域的重要影响,并演示如何优化和定制通用 PEFT 方法,以提高特定模型或任务的性能。
Typically, fine-tuning happens when adapting a pre-trained backbone model to specialized downstream tasks. To this end, this section organizes the discussion around various model architectures, which include: LLM, Vision Transformer (ViT), Vision-Language Alignment Model (VLA), and Diffusion model. Within each architectural category, the discussion is further classify based on different downstream tasks.
通常情况下,微调发生在将预先训练好的骨干模型适应专门的下游任务时。为此,本节将围绕各种模型架构展开讨论,其中包括LLM视觉转换器(ViT)、视觉语言对齐模型(VLA)和扩散模型。在每个架构类别中,讨论将根据不同的下游任务进一步分类。

A. PEFT for LLMs - Beyond the Basics
A.LLMs 的 PEFT - 基础之外

Instead of common tasks in NLP such as NLU and NLG, PEFT techniques boast a wide array of applications across
与 NLU 和 NLG 等 NLP 中的常见任务不同,PEFT 技术在以下领域有着广泛的应用

diverse scenarios. PEFT has been successfully implemented in commonsense question answering [138], [139], multi-level implicit discourse relation recognition [140], out-of-distribution detection [141], privacy protection [142], [143], federated learning [144], and social biases mitigation [145]. In this section, we pay more focus on three representative downstream tasks: visual instruction following, continual learning, context window extension.
不同的应用场景。PEFT 已成功应用于常识性问题解答 [138]、[139]、多层次隐式话语关系识别 [140]、分布外检测 [141]、隐私保护 [142]、[143]、联合学习 [144] 和社会偏见缓解 [145]。在本节中,我们将重点讨论三个具有代表性的下游任务:视觉指令跟踪、持续学习和上下文窗口扩展。
  1. Visual Instruct Following: Several studies, including VL-BART [146], MiniGPT-4 [147], and LLaVA [148], have successfully extended the capabilities of LLMs, initially designed for pure text, to comprehend and generate responses to visual inputs. These enhanced models, namely visual instructfollowing LLMs, can process both images and text to produce textual responses, which can be benchmarked on tasks such as image captioning [149], [150], [151], [152] and visual question answering (VQA) [153], [154], [155]. However, these methods fine-tune the entire LLM to learn the visual representations, which can be inefficient in both time and memory. Therefore, its natural to apply PEFT techniques in the fine-tuning of visual instruct-following LLMs. An earlier work VL-Adapter [156] directly applies several PEFT methods (Adapter [25], Hyperformer [34] and Compacter [71]) on VLBART [146 then benchmarks them on several image-text and video-text tasks. Results show that vanilla adapters are the best among them, which can achieve performance on par with full fine-tuning. However, considering the functionality gap between the encoders and decoders in VL-BART, directly assign identical modular modifications will lead to suboptimal performance. Therefore, VL-PET [157] selectively integrates PEFT modules into different components of the encoder and decoder. They also introduces a granularity-controlled mechanism for finer-grained control.
    视觉指令跟踪:一些研究,包括 VL-BART [146]、MiniGPT-4 [147] 和 LLaVA [148],已经成功地扩展了最初为纯文本设计的LLMs 的功能,使其能够理解和生成对视觉输入的反应。这些增强型模型,即视觉指令跟踪LLMs ,可以同时处理图像和文本以生成文本回复,可以在图像字幕[149]、[150]、[151]、[152]和视觉问题解答(VQA)[153]、[154]、[155]等任务中进行基准测试。不过,这些方法都是通过微调整个LLM 来学习视觉表征,在时间和内存方面都可能效率低下。因此,将 PEFT 技术应用于视觉指令跟踪的微调是很自然的LLMs 。早先的一项研究 VL-Adapter [156] 将几种 PEFT 方法(Adapter [25]、Hyperformer [34] 和 Compacter [71])直接应用于 VLBART [146],然后在多个图像-文本和视频-文本任务中对它们进行了基准测试。结果表明,虚构适配器是其中最好的,其性能可与完全微调相媲美。然而,考虑到 VL-BART 中编码器和解码器之间的功能差距,直接分配相同的模块修改将导致性能不达标。因此,VL-PET [157] 选择性地将 PEFT 模块集成到编码器和解码器的不同组件中。他们还引入了粒度控制机制,以实现更精细的控制。
To adapt the recently prevalent LLaMA model, LLaMAAdapter [158] prepends a set of learnable prompts (similar to prefix tuning) to the input tokens in LLaMA's higher transformer layers. To avoid the unstable fine-tuning with large loss values at early training stages, instead of the randomly initialized weights of other PEFT methods, LLaMA-Adapter adopts a zero-initialized attention mechanism, which learns a zeroinitialized gating factor to adaptively control the contribution of adaptation prompts to the word tokens. This can maintain the fine-tuning starting point the same as the original model and progressively inject new knowledge into the model, where similar idea can be found in MEFT [127] and LoftQ [118] discussed earlier. To represent visual information, LLaMAAdapter extract multi-scale global image features using CLIP image encoder than projects them to linguistic embedding space. After that, the feature is element-wisely added onto the adaptation prompts at all inserted transformer layers. LLaMA-Adapter only introduces learnable parameters in LLaMA-7B, and costs less than one hour for fine-tuning on 8 A100 GPUs. A following work LLaMA-Adapter V2 [159] demonstrates that the simple multimodal fusion in LLaMAAdapter cannot generalize to more challenging open-ended multimodal reasoning tasks, where the visual cues tend to dominate the adaptation prompts than the language instruction data. To address this, LLaMA-Adapter V2 decouples the learn- ing of instruction-following ability (to generate long language responses) and vision-language alignment to avoid interference between visual and language fine-tuning. Specifically, LLaMA-Adapter V2 sets disjoint parameter groups which are respectively learned from image-text pairs and language instruction data. The visual adaptation prompts are inserted in the early stage of LLM, while the language adaptation prompts keeps at the higher transformer layers similar to LLaMAAdapter. Additionally, LLaMA-Adapter V2 introduces more learnable parameters and several expert systems (e.g., captioning, detection, and OCR) to enhance multimodal performance. LayerNorm Tuning [160] adjust only the weights of the LayerNorm within each attention block. This straightforward technique can achieve comparable or even better performance than the finetuning, while offer about more parameter efficiency than LoRA.
为了适应最近流行的 LLaMA 模型,LLaMAAdapter [158] 在 LLaMA 高级转换层的输入标记中预置了一组可学习的提示(类似于前缀调整)。为了避免在早期训练阶段出现较大损失值的不稳定微调,LLaMA-Adapter 没有采用其他 PEFT 方法的随机初始化权重,而是采用了零初始化注意机制,即学习一个零初始化门控因子,以适应性地控制适应提示对词块的贡献。这可以保持微调起点与原始模型相同,并逐步向模型注入新知识,前面讨论过的 MEFT [127] 和 LoftQ [118] 也有类似的想法。为了表示视觉信息,LLaMAAdapter 使用 CLIP 图像编码器提取多尺度全局图像特征,然后将其投射到语言嵌入空间。然后,在所有插入的转换器层中,将特征元素明智地添加到适配提示中。LLaMA-Adapter 仅在 LLaMA-7B 中引入了 可学习参数,在 8 个 A100 GPU 上进行微调的时间不到一小时。接下来的 LLaMA-Adapter V2 [159] 证明,LLaMAAdapter 中的简单多模态融合无法推广到更具挑战性的开放式多模态推理任务中,在这些任务中,视觉线索往往比语言指令数据更能主导适应性提示。为了解决这个问题,LLaMA-Adapter V2 将指令跟随能力(生成长语言反应)的学习与视觉语言对齐分离开来,以避免视觉和语言微调之间的干扰。具体来说,LLaMA-Adapter V2 设置了互不关联的参数组,分别从图像-文本对和语言指令数据中学习。视觉适配提示插入LLM 的早期阶段,而语言适配提示则保留在与 LLaMAAdapter 类似的较高转换器层。此外,LLaMA-Adapter V2 引入了更多可学习参数和多个专家系统(如字幕、检测和 OCR),以提高多模态性能。LayerNorm Tuning [160] 只调整每个注意力区块内 LayerNorm 的权重。这种直接的技术可以达到与微调相当甚至更好的性能,同时比 LoRA 多出约 的参数效率。
  1. Continual Learning (CL): CL aims to learn a sequence of new tasks over time within one single model, which has broad application in scenarios such as dialogue systems [161], information extraction systems [162], and question answering systems [163]. The main challenge in CL is catastrophic forgetting [164]. A popular practice, called architecture-based methods, tackles the CL by maintainging task-specific parameters in the model for each new task. Therefore, it's natural to leverage PEFT methods for CL tasks [165], [166], [167], [168]. For example, AdapterCL [165] parameterizes each new task using residual adapters. During testing, since the task-id is not provided, AdapterCL uses an entropy-based classifier to select which adapter to use for accomplishing specific task. CPT (Continual Prompt Tuning) [166] trains a soft prompt for each task. Instead of training soft prompts from scratch, CPT proposes a series techniques (continual prompt initialization, query fusion, memory replay, and a memory-guided technique) to achieve knowledge transfer from preceding and subsequent tasks. O-LoRA (orthogonal lowrank adaptation) [169] employs a strategy of learning distinct tasks within separate low-rank vector subspaces that are kept orthogonal to each other in order to minimize interference. This approace can effectively reducing catastrophic forgetting during the acquisition of new tasks.
    持续学习(CL):持续学习的目的是在一个单一的模型中,随着时间的推移学习一系列新任务,它在对话系统[161]、信息提取系统[162]和问题解答系统[163]等场景中有着广泛的应用。CL的主要挑战是灾难性遗忘[164]。一种流行的做法,即基于架构的方法,通过在模型中为每个新任务保留特定任务参数来解决遗忘问题。因此,利用 PEFT 方法来处理 CL 任务是很自然的 [165]、[166]、[167]、[168]。例如,AdapterCL [165] 使用残差适配器对每个新任务进行参数化。在测试过程中,由于没有提供任务 ID,AdapterCL 使用基于熵的分类器来选择使用哪个适配器来完成特定任务。CPT(持续提示调整)[166] 为每个任务训练软提示。CPT 不再从头开始训练软提示,而是提出了一系列技术(持续提示初始化、查询融合、记忆重放和记忆引导技术),以实现前后任务的知识转移。O-LoRA(正交低阶适应)[169] 采用的策略是在不同的低阶向量子空间内学习不同的任务,这些子空间相互保持正交,以尽量减少干扰。这种方法可以有效减少在学习新任务时的灾难性遗忘。
  2. Context Window Extension: LLMs are typically trained with a pre-defined context size. For example, LLaMA and LLaMA2 have pre-defined context sizes of 2048 and 4096 tokens, respectively. The positional encoding RoPE has weak extrapolation properties [170], which means the performance drops obviously given an input length exceeds the pre-defined context length. To solve this, a naive solution is to finetune a pre-trained LLM to longer context. However, this escalates computational costs quadratically with context size, straining memory and processing resources. To address this, LongLoRA [171] proposes to fine-tune a pre-trained LLM using LoRA to enlarge the context size. To reduce the perplexity gap between LoRA tuning and full fine-tuning, LongLoRA also opens embedding and normalization layers for training. In order to further improve training efficiency in long context scenario, LongLoRA further introduces a novel shifted sparse attention ( -Attn) as an efficient substitute for standard self-attention during training. A subsequent study
    上下文窗口扩展:LLMs 通常使用预定义的上下文大小进行训练。例如,LLaMA 和 LLaMA2 的预定义上下文大小分别为 2048 和 4096 个 token。位置编码 RoPE 具有弱外推性 [170],这意味着当输入长度超过预定义上下文长度时,其性能会明显下降。要解决这个问题,一个简单的办法是对预先训练好的LLM 进行微调,以适应更长的上下文。然而,这样做的计算成本会随着上下文长度的增加而呈四次方增长,从而对内存和处理资源造成压力。为了解决这个问题,LongLoRA [171] 建议使用 LoRA 对预先训练好的LLM 进行微调,以扩大上下文大小。为了缩小 LoRA 调整与完全微调之间的困惑度差距,LongLoRA 还开放了嵌入层和归一化层进行训练。为了进一步提高长上下文场景下的训练效率,LongLoRA 进一步引入了新的移位稀疏注意力( -Attn),作为训练过程中标准自我注意力的有效替代。后续研究
LongQLoRA [172] combines the advantages of LongLoRA with QLoRA and Position Interpolation [10] to save GPU memory. This work successfully extends context length of LLaMA2-13B from 4096 to 8192 on a single V100 with 32GB memory. LLoCO [173] introduces a pipeline that learns contexts offline through the combination of context compression and LoRA. The process begins by compressing documents into compact contexts, then fine-tuning LLM using LoRA on the compacted context to improve the LLM's ability to accurately extract and utilize information from these compressed representations. During model serving, a standard RAG retriever selects both the compressed document and the most relevant LoRA module, and apply them to the LLM for inference. This approach effectively extends the context window of a token LLaMA2-7B model to handle up to tokens.
LongQLoRA [172] 结合了 LongLoRA 与 QLoRA 和位置插值 [10] 的优点,以节省 GPU 内存。这项工作成功地将 LLaMA2-13B 的上下文长度从 4096 个扩展到 8192 个,使用的是具有 32GB 内存的单个 V100。LLoCO [173] 引入了一种通过结合上下文压缩和 LoRA 来离线学习上下文的管道。该过程首先将文档压缩成紧凑的上下文,然后使用 LoRA 对压缩后的上下文进行微调LLM ,以提高LLM 从这些压缩表示中准确提取和利用信息的能力。在模型服务过程中,标准的 RAG 检索器会选择压缩文档和最相关的 LoRA 模块,并将它们应用到LLM 进行推理。这种方法有效地扩展了 token LLaMA2-7B 模型的上下文窗口,可处理多达 token。
In addition to limited training-stage sequence length, realworld system memory constraints introduce another critical bottleneck to the context window. Specifically, the capacity of the KV-cache is curtailed by available system memory. For example, a 30B parameter LLM operating with an input length of 1024 and a batch size of 128 might necessitate up to for the KV-cache [174], thereby restricting the feasible size of the context window. In response to this, some strategies have resorted to quantizing the KV cache [134], [175], but quantization will certainly compromises performance. To effectively counteract this issue without significant loss, GEAR [176] presents a novel approach by employing a low-rank matrix to capture the majority of coherent bases of quantization error, complemented by a sparse matrix that addresses errors from outlier entries, thus efficiently minimizing approximation errors.
除了有限的训练阶段序列长度外,现实世界的系统内存限制还为上下文窗口引入了另一个关键瓶颈。具体来说,KV 缓存的容量受到可用系统内存的限制。例如,一个输入长度为 1024、批量大小为 128 的 30B 参数LLM ,可能需要高达 的 KV 缓存 [174],从而限制了上下文窗口的可行大小。为此,一些策略采用了量化 KV 缓存的方法 [134], [175],但量化肯定会影响性能。为了在不造成重大损失的情况下有效解决这一问题,GEAR [176] 提出了一种新颖的方法,即采用低秩矩阵来捕捉量化误差的大部分一致性基础,并辅以稀疏矩阵来解决离群项产生的误差,从而有效地将近似误差降到最低。

B. PEFT for ViTs
B.针对自愿信托基金的 PEFT

ViT [177] has emerged as a powerful backbone model in the recent computer vision community. In ViT model, images are treated as sequences of fixed-size patches analogous to how LLM uses discrete tokens. These patches undergo linear embedding and then receive positional encodings. Subsequently, they are processed through standard Transformer encoders. The training of ViT can be supervised [177], [178] or selfsupervised [179], [180], and ViT can achieve superior performance when training with more data and using larger model size [181]. However, such scaling up inevitably escalates training and storage costs. Therefore, similar to LLMs, PEFT widely implemented in various downstream tasks, such as dense prediction [182], continual learning [183], [184], deep metric learning [185]. Here, we focus on two typical tasks to showcase the involvement of PEFT: image classification and video recoginition.
ViT [177] 是近期计算机视觉领域出现的一个强大的骨干模型。在 ViT 模型中,图像被视为固定大小的斑块序列,类似于LLM 使用离散标记的方式。这些斑块经过线性嵌入,然后接收位置编码。随后,通过标准变换器编码器对其进行处理。ViT 的训练可以是监督式的 [177], [178] 或自我监督式的 [179], [180],当使用更多的数据和更大的模型规模进行训练时,ViT 可以实现更优越的性能 [181]。然而,这种扩展不可避免地会增加训练和存储成本。因此,与LLMs 类似,PEFT 广泛应用于各种下游任务,如密集预测 [182]、持续学习 [183]、[184]、深度度量学习 [185]。在此,我们将重点介绍两个典型任务,以展示 PEFT 的应用:图像分类和视频识别。
  1. Image Classification: Image classification on targeted visual datasets is a very common demand and has extensive applications, while pre-train then fine-tuning paradigm serves as a widespread strategy. A variety of methods leverage PEFT techniques to achieve efficient model tuning [186], [182], [187], [188]. For instance, AdaptFormer [187] inserts adapter modules in parallel to the FFN of the original ViT model for visual recognition tasks. VPT (Visual Prompt Tuning) [186] prepends a small amount of task-specific parameters into the input sequence of each Transformer layer. When applying ViT to downstream tasks, only these added parameters and the classification head are set to trainable. The work by [189] notices that compared with supervised ViT, VPT often underperforms with self-supervised ViT. Further analysis demonstrates that different pre-trained methods and downstream tasks have varying degrees of dependency on transformer blocks at different locations. To tackle this issue, the research introduces adaptable gates for ViT blocks. These gates dynamically modulate the contribution of prompt tokens to ViT blocks, allowing for a more targeted adaptation of the model to the task at hand.
    图像分类:在目标视觉数据集上进行图像分类是一种非常普遍的需求,并有着广泛的应用。多种方法利用 PEFT 技术实现了高效的模型调整 [186]、[182]、[187]、[188]。例如,AdaptFormer [187] 在原始 ViT 模型的 FFN 中并行插入适配器模块,用于视觉识别任务。VPT(视觉提示调整)[186] 在每个转换器层的输入序列中预置了少量特定任务参数。将 ViT 应用于下游任务时,只有这些添加的参数和分类头被设置为可训练。文献[189]指出,与监督式 ViT 相比,自监督式 ViT 的 VPT 通常表现不佳。进一步的分析表明,不同的预训练方法和下游任务对不同位置的变压器块具有不同程度的依赖性。为解决这一问题,研究为 ViT 块引入了可调整的门。这些门可动态调节提示标记对 ViT 块的贡献,从而使模型更有针对性地适应手头的任务。
  2. Video Recognition: Several works consider the more challenging adaptation problem that transfer ViT to downstream tasks that has a much larger domain gap. For example, ST-Adapter (Spatio-Temporal Adapter) [190] and AIM [191] both insert adapters layers into pre-trained ViT blocks. Their primary goal is to model spatial-temporal information, thereby enabling efficient adaptation of ViTs from image models to video tasks. Notably, both methodologies have exhibited performance that surpasses traditional full-model fine-tuning approaches.
    视频识别:有几项研究考虑了更具挑战性的适配问题,即将 ViT 移植到领域差距更大的下游任务。例如,ST-Adapter(时空适配器)[190] 和 AIM [191]都在预训练的 ViT 块中插入了适配器层。它们的主要目标是建立时空信息模型,从而实现 ViT 从图像模型到视频任务的高效适配。值得注意的是,这两种方法的性能都超过了传统的全模型微调方法。

C. PEFT for VLAs
C.用于 VLA 的 PEFT

Vision-Language alignment models (VLA), such as CLIP [192], ALIGN [193], DeCLIP [194], and FLAVA [195], are designed to learn a good image and text features which can be aligned within a unified representation space. Each VLA typically consists of separate image and text encoders that extract respective features. Contrastive learning is leveraged in these models to effective align the image and text features. Fine-tuning is leveraged to improve the performance of VLA in specific dataset or tasks, but fine-tuning the full model is computationally intensive. For instance, fine-tuning CLIP RN50x64 requires a batch size of 32,768 and 18 days of training on 592 V100 GPUs [192]. Moreover, full fine-tuning on smaller datasets often leads to catastrophic forgetting [164]. In response to these challenges, and drawing inspiration from the success of PEFT techniques in NLP, a range of PEFT strategies have been proposed and implemented in VLA models, such as semantic segmentation [196], [197], [198], point cloud understanding [199], [200], [201], [202], video understanding [203], [204], [205], visual reasoning [206], [207], temporal action detection [208], to name a few. This section will focus on one common task that uses VLAs: openvocabulary image classification.
视觉语言配准模型(VLA),如 CLIP [192]、ALIGN [193]、DeCLIP [194] 和 FLAVA [195],旨在学习良好的图像和文本特征,以便在统一的表示空间内进行配准。每个 VLA 通常由独立的图像和文本编码器组成,分别提取各自的特征。这些模型利用对比学习来有效对齐图像和文本特征。微调可用于提高 VLA 在特定数据集或任务中的性能,但对整个模型进行微调需要大量计算。例如,对 CLIP RN50x64 进行微调需要 32,768 个批次和 18 天的训练,使用 592 个 V100 GPU [192]。此外,在较小的数据集上进行全面微调往往会导致灾难性遗忘 [164]。为了应对这些挑战,并从 PEFT 技术在 NLP 中的成功应用中汲取灵感,人们提出了一系列 PEFT 策略,并在 VLA 模型中加以实施,例如语义分割 [196], [197], [198], 点云理解 [199], [200], [201], [202], 视频理解 [203], [204], [205], 视觉推理 [206], [207], 时间动作检测 [208] 等等。本节将重点讨论一项使用 VLA 的常见任务:开放词汇图像分类。
  1. Open-vocabulary Image Classification: In openvocabulary image classification, earlier works design class-specific prompts, e.g., a photo of a [CLASS], for each category, and ranks images based on their similarity to these textual descriptions. CoOp (Context Optimization) [209] replaces the handcrafted text prompt with learnable vectors, while keep the entire VLA fixes during training. CoCoOp (Conditional Context Optimization) [210] builds on this by tackling CoOp's limitations in generalizing to unseen classes. It introduces a lightweight neural network that generates an
    开放词汇图像分类:在开放词汇图像分类中,早期的工作为每个类别设计了特定类别的提示,例如一张[CLASS]的照片,并根据图像与这些文本描述的相似度对图像进行排名。CoOp(语境优化)[209] 用可学习向量取代手工制作的文本提示,同时在训练过程中保持整个 VLA 固定。CoCoOp(条件语境优化)[210] 在此基础上解决了 CoOp 在泛化到未见类别方面的局限性。它引入了一种轻量级神经网络,可生成一个

    input-specific context token, dynamically adapting the prompt based on each image, thereby enhancing generalizability, but at the cost of increased computational demands due to the instance-aware operation. ProGrad [211] addresses the over-fitting risk in in few-shot setting by regularizing the soft prompt updates whose gradient is aligned to the general knowledge only updates the prompt whose gradient is aligned (or non-conflicting) to the general knowledge offered by the original prompt. MaPLe [212] notes that existing methods learn prompts either in the language or in the vision branch of CLIP, which is not efficient to leverage the multimodal nature of VLAs. To address this, MaPLe proposes branch-aware hierarchical prompts that simultaneously adapt both language and vision branches, and achieves superior performance. TPT (test-time prompt tuning) [213] studies prompt tuning on the fly without additional training samples. Specifically, during inference, TPT first augments the input image into various views, which are then utilized to tune the learnable prompts. The primary training objective is to ensure the VLA can generate consistent responses when faced with these differing views. A following work DiffTPT [214] further enhances the data diversity of test samples through diffusion models.
    输入特定上下文标记,根据每幅图像动态调整提示,从而提高泛化能力,但代价是实例感知操作会增加计算需求。ProGrad [211] 通过对梯度与常识一致的软提示更新进行正则化,只更新梯度与原始提示提供的常识一致(或不冲突)的提示,从而解决了 在少镜头设置中的过拟合风险。MaPLe[212]指出,现有的方法要么在语言中学习提示,要么在CLIP的视觉分支中学习提示,这对于利用VLA的多模态特性并不有效。针对这一问题,MaPLe 提出了同时适应语言和视觉分支的分支感知分层提示,并取得了卓越的性能。TPT(测试时间提示调整)[213] 研究了无需额外训练样本的即时提示调整。具体来说,在推理过程中,TPT 首先将输入图像添加到各种视图中,然后利用这些视图来调整可学习的提示。主要的训练目标是确保 VLA 在面对这些不同视图时能做出一致的反应。后续的 DiffTPT [214] 工作通过扩散模型进一步增强了测试样本的数据多样性。
In another direction, several studies explores the usage of adapters in VLA. For example, CLIP-Adapter [215] integrates residual-style adapters after CLIP's text and visual encoders. Therefore, unlike CoOp and CoCoOp, CLIPAdapter avoids the gradients backpropagation through CLIP's encoders, leading to reduced computational requirements in terms of both training memory and time. Tip-Adapter [216] adopts the same design with CLIP-Adapter. Different from CLIP-Adapter, the weights of adapter is obtained in a trainingfree manner from a query-key cache model [217], [218] constructed from fewshot supervisions in a non-parametric manner. As a result, Tip-Adapter exhibits great efficiency compared to CLIP-Adapter's SGD training process.
在另一个方向上,一些研究探讨了在 VLA 中使用适配器的问题。例如,CLIP-Adapter [215] 在 CLIP 的文本和视觉编码器之后集成了残差式适配器。因此,与 CoOp 和 CoCoOp 不同,CLIP-Adapter 避免了通过 CLIP 编码器进行梯度反向传播,从而降低了训练内存和时间方面的计算要求。Tip-Adapter [216] 采用了与 CLIP-Adapter 相同的设计。与 CLIP-Adapter 不同的是,Tip-Adapter 的权重是以一种免训练的方式从查询密钥缓存模型 [217]、[218] 中获得的。因此,与 CLIP-Adapter 的 SGD 训练过程相比,Tip-Adapter 表现出更高的效率。

D. PEFT for Diffusion Models
D.扩散模型的 PEFT

Diffusion models [219], [220] are a class of generative models that learn to generate data by transforming random noise into a structured output by a progressive denoising process. During training, diffusion models learn to reverse the noise added to training data using a denoising network, while in inference, they start from noise, using denoising network to iteratively create data that mirrors the same distribution as the training examples. Diffusion models has various applications [221], [222], [223], [224], [225], while the most notable is stable diffusion [226], which bridges the gap between text and image with its robust capability to generate coherent and contextually relevant images directly from textual descriptions. Numerous studies leverage PEFT techniques to adapt a pre-trained diffusion model for downstream tasks, including accelerating sampling speed [227], [228], text-to-video adaptation [229], [230], text-to-3D adaptation [231], etc. This section mainly focus on two scenarios: integrating additional input modalities beyond mere text-based conditioning, and customizing content generation based on pre-trained diffusion model.
扩散模型[219]、[220]是一类生成模型,通过渐进式去噪过程将随机噪音转化为结构化输出,从而学会生成数据。在训练过程中,扩散模型利用去噪网络学习逆转添加到训练数据中的噪声,而在推理过程中,它们从噪声开始,利用去噪网络迭代创建与训练示例相同分布的数据。扩散模型有多种应用[221]、[222]、[223]、[224]、[225],其中最著名的是稳定扩散[226],它以其强大的功能直接从文本描述生成连贯且与上下文相关的图像,在文本和图像之间架起了一座桥梁。许多研究利用 PEFT 技术将预先训练好的扩散模型用于下游任务,包括加快采样速度[227]、[228]、文本到视频的适配[229]、[230]、文本到 3D 的适配[231]等。本节主要关注两种情况:除了基于文本的调节外,还集成了其他输入模式,以及基于预训练扩散模型的定制内容生成。
  1. Additional Input Control: To incorporate additional input modalities (e.g., layout, keypoints) while retaining the extensive knowledge in the pre-trained model, GLIGEN introduces a novel approach, which maintains the original model's weights intact and integrates new, trainable gated Transformer layers [232] that take in the new grounding input. The resulting model can not only accurately represent the grounding conditions but also produce high-quality images. Remarkably, the model can also generalize well to unseen objects during inference. ControlNet [233] fine-tunes a trainable copy of the encoding layers from Stable Diffusion while locks its pre-trained parameter weights. The fixed original model and the trainable copy are bridged through zero convolution layers. These layers, starting with zero-initialized weights, are designed to progressively adapt during training, ensuring that harmful noise does not affect the pre-trained features of Stable Diffusion at the beginning of training. This refined model is capable of conditioning on a variety of inputs such as Canny edges, Hough lines, user scribbles, human key points, segmentation maps, shape normals, depths, etc. Concept Sliders [234] introduces a plug-and-play LoRA adaptors to allow precise editing of concepts (e.g., age, smiling) within a diffusion model. T2I-Adapter [235] introduces a lightweight adapter model designed to align external control signals with the internal knowledge of text-to-image diffusion models. This adapter enables precise manipulation through structural control (e.g., sketch, depth map, semantic segmentation map, and keypose), color control (e.g., hue and color distribution), and integrating various controls by composing multiple adapters.
    附加输入控制:GLIGEN 引入了一种新颖的方法,即在保留预训练模型中大量知识的同时,纳入额外的输入模式(如布局、关键点),该方法保持了原始模型的权重不变,并集成了新的、可训练的门控变压器层 [232],以接收新的接地输入。由此产生的模型不仅能准确地表示接地条件,还能生成高质量的图像。值得注意的是,在推理过程中,该模型还能很好地泛化到未见过的物体上。ControlNet [233] 微调了稳定扩散编码层的可训练副本,同时锁定了其预先训练的参数权重。固定的原始模型和可训练副本通过零卷积层连接起来。这些层从零初始化权重开始,在训练过程中逐步适应,确保有害噪声不会影响稳定扩散模型在训练开始时的预训练特征。这种完善的模型能够对各种输入进行调节,例如 Canny 边缘、Hough 线、用户涂鸦、人体关键点、分割图、形状法线、深度等。概念滑块 [234] 引入了即插即用的 LoRA 适配器,允许在扩散模型中对概念(如年龄、微笑)进行精确编辑。T2I 适配器[235] 引入了一种轻量级适配器模型,旨在将外部控制信号与文本到图像扩散模型的内部知识相统一。该适配器可通过结构控制(如草图、深度图、语义分割图和keypose)、颜色控制(如色调和颜色分布)以及通过组成多个适配器来整合各种控制来实现精确操作。
  2. Customized Generation: The effectiveness of text-toimage diffusion models is limited by the user's ability to articulate the desired target through text descriptions. For instance, it is difficult to describe the precise features of an innovative toy car which is not encountered during large-scale model training. Consequently, the objective of customized generation is to enable the model to grasp new concepts from a minimal set of user-supplied images. Textual Inversion [236] addresses this by finding a new pseudo-word (similar to soft prompt discussed in Section III-A2 that represent new, specific concepts in the textual embedding space of pretrained text-to-image diffusion models. The pseudo-word is optimized via the original optimization goal in diffusion models given a small image set (typically 3-5 images) depicting the concept, and the pre-trained model is leaved untouched. During inference, can be treated like any other word and compose with other textual queries (e.g.," photo of on the beach"). Custom Diffusion [237] tackles a more challenging setting: compositional fine-tuning of multiple concepts. It finetunes only the mapping from text to latent features in attention layers, which yields superior performance in multiconcept learning scenarios. Additionally, during fine-tuning, Custom Diffusion prevents model forgetting by introducing a small set of real images with captions akin to the target, alongside employing augmentation for faster convergence and improved results. IP-Adapter [238] identifies limitations in current approaches (e.g., ControlNet and T2I-Adapter) which project condition signals into the cross-attention modules. When handling image conditions aiming at controlling con-
    定制生成:文本到图像扩散模型的有效性受限于用户通过文本描述表达所需目标的能力。例如,要精确描述一辆创新玩具车的特征是很困难的,而这在大规模模型训练中是不会遇到的。因此,定制生成的目标是使模型能够从用户提供的最小图像集合中掌握新概念。文本反转 [236] 通过寻找新的伪词 (类似于第 III-A2 节中讨论的软提示,代表预训练文本到图像扩散模型的文本嵌入空间中的新的特定概念)来解决这个问题。伪词 通过扩散模型中的原始优化目标进行优化,并给定一个描述该概念的小图像集(通常为 3-5 张图像),而预训练模型则保持不变。在推理过程中, 可以像其他词一样处理,并与其他文本查询(如 "海滩上 的照片")一起组成。Custom Diffusion [237] 解决了一个更具挑战性的问题:多个概念的组合微调。它只对注意力层中从文本到潜在特征的 映射进行微调,这在多概念学习场景中产生了卓越的性能。此外,在微调过程中,Custom Diffusion 还通过引入一小部分带有与目标相似标题的真实图像来防止模型遗忘,同时采用增强技术来加快收敛速度并改善结果。IP 适配器[238]指出了当前方法(如 ControlNet 和 T2I-Adapter)的局限性,这些方法将条件信号投射到交叉注意模块中。在处理图像条件时,目的是控制控制模块,而非控制模块。

    tent, these methods unable to generate images faithful to the prompted image. The issue stems from that merging image features and text features within cross-attention layers loses image-specific information, leading to only coarse-grained controllable generation such as image style rather than image content. To overcome this, IP-Adapter introduces a novel decoupled cross-attention mechanism to distinguish between text and image features. IP-Adapter adds an additional crossattention layer exclusively for image features in each crossattention layer, and only the parameters of the new crossattention layers are trained.
    但是,这些方法无法生成忠实于提示图像的图像。问题的根源在于,在交叉注意层中合并图像特征和文本特征会丢失图像的特定信息,导致只能生成粗粒度的可控图像,如图像风格而非图像内容。为了克服这一问题,IP-Adapter 引入了一种新颖的解耦交叉注意机制,以区分文本和图像特征。IP-Adapter 在每个交叉注意层中增加了一个专门用于图像特征的交叉注意层,并且只对新交叉注意层的参数进行训练。

VI. System Design Challenge for PEFT
VI.PEFT 的系统设计挑战

A. System design for PEFT
A.PEFT 的系统设计

In this section, we begin by providing a concise overview of cloud-based PEFT systems. Following this, we present the corresponding metrics employed for evaluating the system performance. Additionally, we present three prospective utilization scenarios to illustrate the challenges in system design.
在本节中,我们首先简要介绍了基于云的 PEFT 系统。随后,我们介绍了用于评估系统性能的相应指标。此外,我们还介绍了三种未来的使用场景,以说明系统设计所面临的挑战。
  1. Centralized PEFT Query Serving: Cloud providers have recently introduced a range of LLM services aimed at providing user applications through application programming interfaces (APIs) [239], [240]. These APIs facilitate the seamless integration of many ML functionalities into applications. After receiving one query for one specific downstream task through API, the cloud-based server processes the query with one featured LLM model. Under this scenario, the proposed cloud solution for handling multiple PEFT queries involves storing only a single copy of the LLM and multiple PEFT modules. This single copy maintains multiple branches of PEFT modules, each associated with different PEFT queries. The case study of a state-of-the-art system can be found in Section VI-C Figure 10 (b) illustrates the computation pattern for multi-query PEFT inference, wherein packed PEFT queries are scheduled and executed according to their deadlines and current system conditions.
    集中式 PEFT 查询服务:云提供商最近推出了一系列LLM 服务,旨在通过应用编程接口 (API) [239]、[240] 为用户应用程序提供服务。这些 API 有助于将许多 ML 功能无缝集成到应用程序中。通过 API 接收到针对特定下游任务的查询后,基于云的服务器会使用一个特色LLM 模型来处理该查询。在这种情况下,为处理多个 PEFT 查询而提出的云解决方案只需存储一份LLM 和多个 PEFT 模块。该单一副本维护多个 PEFT 模块分支,每个分支与不同的 PEFT 查询相关联。图 10 (b) 展示了多查询 PEFT 推理的计算模式,其中打包的 PEFT 查询根据其截止日期和当前系统条件进行调度和执行。
  2. Serving Metrics: To evaluate the system performance of centralized PEFT query serving, we propose a set of evaluation metrics.
    服务指标:为了评估集中式 PEFT 查询服务的系统性能,我们提出了一套评估指标。
  • System throughput: Considering PEFT queries as inter and intra tasks, we use tokens per second to measure the system throughput.
    系统吞吐量:将 PEFT 查询视为任务间和任务内的查询,我们使用每秒令牌数来衡量系统吞吐量。
  • Memory footprint: Run-time memory consumption during query serving, the memory utilization comes from both model parameters and -cache as mentioned in Section IV-A
    内存占用:查询服务过程中的运行时内存消耗,内存利用率来自模型参数和 缓存,如第 IV-A 节所述。
  • Accuracy performance: Real-world queries normally have different context lengths, and performance with variation length serves as a performance benchmark.
    准确性性能:现实世界的查询通常有不同的上下文长度,不同长度的性能可作为性能基准。
  • Quality of services: Queries are associated with latency requirements and deadline missing rates are considered as another benchmark.
    服务质量:查询与延迟要求相关,而截止日期错过率则被视为另一个基准。
  1. Distributed System for PEFT: Nevertheless, in the contemporary LLM model, personalized tasks are not fully supported with pre-trained models, consequently, extra fine-tuning is required to be executed with the methodologies mentioned in the previous sections. However, a big concern is raised when
    用于 PEFT 的分布式系统:然而,在当代的LLM 模型中,预训练模型并不能完全支持个性化任务,因此需要使用前几节中提到的方法进行额外的微调。然而,当
(a)
(b)
Fig. 10: (a) Distributed-based system computation pattern; (b) centralized PEFT Query inference
图 10:(a)基于分布式系统的计算模式;(b)集中式 PEFT 查询推理
we consider giving the datasets to cloud providers since these datasets are personalized.
我们考虑将数据集交给云提供商,因为这些数据集是个性化的。
For this concern, DLoRA [241] presents a distributed PEFT framework. During the PEFT process, the backbone LLM is executed in the cloud servers while the PEFT modules are trained entirely within the user devices. DLoRA scheme is depicted in Figure 10 (a).
为此,DLoRA [241] 提出了一种分布式 PEFT 框架。在 PEFT 过程中,骨干LLM 在云服务器中执行,而 PEFT 模块则完全在用户设备中训练。DLoRA 方案如图 10 (a) 所示。
  1. Distributed Metrics: To assess the efficacy of the proposed method, we establish a set of evaluative metrics. For this analysis, and without loss of generality, we adopt language models as the basis for our metric definitions.
    分布式指标:为了评估所建议方法的有效性,我们建立了一套评估指标。为了进行分析,在不失一般性的前提下,我们采用语言模型作为指标定义的基础。
  • Accuracy performance: Performance of the fine-tuned model over the downstream tasks.
    精度性能:微调模型在下游任务中的表现。
  • Compute cost: The compute cost during forward and backward propagation operations on edge devices.
    计算成本:在边缘设备上进行前向和后向传播操作时的计算成本。
  • Communication cost: Refers to the volume of data involved during the transfer of intermediate data between the edge device and the cloud.
    通信成本:指在边缘设备和云之间传输中间数据时涉及的数据量。
  1. Multi-PEFT Training: Different from multiple-PEFT serving, tuning with multiple customized PEFTs always involves different backbone LLMs. When contemplating LLM usage across various downstream tasks, pre-trained models typically exhibit subpar performance. A prevalent approach to adapt LLM to diverse tasks involves crafting fine-tuned PEFTs. However, simultaneously tuning multiple PEFTs can pose considerable challenges. Challenges like how to manage memory gradient and model weights storage, and how to design an efficient kernel for batching PEFT training remain unsolved. PEFTs will be categorized based on their PEFT algorithms and backbone LLM models. The design challenge involves how to consolidate multiple PEFTs with the same LLM backbone and multiple different LLM backbones simultaneously.
    多重 PEFT 培训:与多 PEFT 服务不同,使用多个定制 PEFT 进行调整总是涉及不同的骨干LLMs 。当考虑在各种下游任务中使用LLM 时,预训练模型通常会表现出不理想的性能。为使LLM 适应各种任务,一种普遍的方法是制作微调 PEFT。然而,同时调整多个 PEFT 可能会带来相当大的挑战。如何管理内存梯度和模型权重存储,以及如何设计用于批量 PEFT 训练的高效内核等难题仍未解决。PEFT 将根据其 PEFT 算法和骨干LLM 模型进行分类。设计挑战涉及如何同时合并具有相同LLM 主干网和多个不同LLM 主干网的多个 PEFT。

B. Case study: Offsite-Tuning
B.案例研究:场外调整

We already know that fine-tuning LLM for downstream tasks is challenging for two reasons: dual privacy concerns between cloud server and data owner, and issues with computational resources and efficiency. Firstly, the privacy of both parties is at risk: the weights of large models are often proprietary and not made public. Sharing data with model owners for fine-tuning can lead to data privacy concerns while providing model weights to data proprietors could compromise the ownership of proprietary models. Secondly, even if downstream users have access to pre-trained weights, the stringent hardware requirements make transfer learning impractical for most end users.
我们已经知道,针对下游任务微调LLM 具有挑战性,原因有二:云服务器和数据所有者之间的双重隐私问题,以及计算资源和效率问题。首先,双方的隐私都面临风险:大型模型的权重通常是专有的,不会公开。与模型所有者共享数据进行微调可能会导致数据隐私问题,而向数据所有者提供模型权重可能会损害专有模型的所有权。其次,即使下游用户可以获得预训练的权重,严格的硬件要求也会使迁移学习对大多数终端用户来说不切实际。
To resolve these two issues, Offsite-Tuning [242] proposes a privacy-preserving and efficient transfer learning framework that enables foundational models to adapt to downstream tasks without the need to access the complete model weights. The key insight of Offsite-Tuning is the cloud provider sends an adapter and an emulator to the data proprietor. Then, with the assistance of the emulator, the data proprietor fine-tunes the adapter. The fine-tuned adapter is then sent back to the cloud side, which integrates it into the complete model, creating a fine-tuned foundational model for downstream users.
为了解决这两个问题,Offsite-Tuning[242]提出了一种保护隐私的高效迁移学习框架,使基础模型能够适应下游任务,而无需访问完整的模型权重。场外调优的关键在于云提供商向数据所有者发送适配器和仿真器。然后,在仿真器的协助下,数据所有者对适配器进行微调。经过微调的适配器随后被发送回云端,云端将其集成到完整的模型中,为下游用户创建一个经过微调的基础模型。
Offsite-Tuning safeguards the privacy of data proprietors since they do not need to share their training data directly. It also protects the foundational model owners, as the complete model weights are not shared, and the emulator provided is lossy, with significantly degraded performance. Compared to existing fine-tuning methods that require access to the full model weights, Offsite-Tuning is more resource-efficient because it allows for fine-tuning through a compressed emulator without needing the complete model.
场外调优可以保护数据所有者的隐私,因为他们不需要直接共享训练数据。它还能保护基础模型所有者,因为完整的模型权重不会共享,而且提供的仿真器是有损的,性能会显著下降。与需要访问完整模型权重的现有微调方法相比,场外微调方法更节省资源,因为它允许通过压缩仿真器进行微调,而不需要完整模型。

C. Case Study: PetS
C.案例研究:PetS

The PEFT algorithm is notable for its ability to distinguish between modifiable and immutable weights within a model. This characteristic inspires developers to amalgamate diverse LLMs with distinct PEFT techniques into collective units. PetS, as introduced in [243], advocates for a comprehensive approach to managing multiple PEFT tasks by suggesting a unified serving framework. The framework's core advancement lies in the translation of varying PEFT tasks into integrated computation kernels to enhance efficiency. Moreover, PetS pioneers an orchestrated batching approach and a scheduling methodology, aiming to augment system throughput and leverage task parallelism respectively.
PEFT 算法的显著特点是能够区分模型中的可修改权重和不可修改权重。这一特点促使开发人员将不同的LLMs 与不同的 PEFT 技术合并为集体单元。PetS 在 [243] 中提出了一个统一的服务框架,主张采用综合方法管理多个 PEFT 任务。该框架的核心进步在于将不同的 PEFT 任务转化为集成计算内核,以提高效率。此外,PetS 还开创了一种协调批处理方法和一种调度方法,分别旨在提高系统吞吐量和利用任务并行性。
As depicted in Figure 11, the PetS framework begins with users registering PEFT tasks through a standardized Application Programming Interface (API). Upon registration, developers are expected to provide the Pre-Trained Model Tag (e.g., LLaMA), PEFT parameters in a compressed format, and the specific PEFT algorithms (e.g., LoRA, Adapter, Bitfit, etc.). These tasks are then endowed with unique identifiers, and the inference engine takes charge of query processing. PetS bifurcates the primary computational workload (e.g., linear layer computations) into three distinct computational operations: (1) Dense Matrix-Vector Multiplication (MVM) leveraging universally accessible, pre-trained weights. (2) Bias vector addition (Vadd), using either common or task-exclusive biases. (3) A combination of Sparse/dense MVM operations employing task-specific PET parameters. A unified pre-trained weight matrix is employed across PetS, facilitating the batching of initial operations, . However, subsequent task-specific computations involving PET parameters, despite being relatively minimal in complexity, are processed individually.
如图 11 所示,PetS 框架首先由用户通过标准化应用编程接口(API)注册 PEFT 任务。注册时,开发人员需要提供预训练模型标签(如 LLaMA)、压缩格式的 PEFT 参数以及特定的 PEFT 算法(如 LoRA、Adapter、Bitfit 等)。然后,这些任务被赋予唯一标识符,推理引擎负责查询处理。PetS 将主要计算工作量(如线性层计算)分成三个不同的计算操作:(1)利用普遍可及的预训练权重进行密集矩阵-向量乘法(MVM)。(2)偏置向量加法(Vadd),使用通用偏置或任务专属偏置。(3) 结合使用特定任务 PET 参数的稀疏/密集 MVM 运算。PetS 采用统一的预训练权重矩阵 ,便于批处理初始操作 。不过,涉及 PET 参数的后续特定任务计算尽管复杂程度相对较低,但仍要单独处理。
Considering the Adapter and Bitfit tasks as an illustration, both aim at the MLP component of LLMs. The Adapter task integrates additional weight segments, whereas Bitfit adjusts bias elements. The Adapter operation is modeled as , where represents the input for the Adapter task, and are the original and adapter-specific PEFT weights respectively, and is the initial bias. The Bitfit operation, on the other hand, is defined as , with symbolizing the Bitfitadjustable bias. These operations are further synthesized as , delineating that the part is amenable to batching through MVM, while the segment pertains to the Vadd operation.
以 Adapter 和 Bitfit 任务为例,这两个任务的目标都是LLMs 的 MLP 组件。Adapter 任务整合了额外的权重段,而 Bitfit 则调整了偏置元素。适配器操作建模为 ,其中 代表适配器任务的输入, 分别是原始和适配器特定的 PEFT 权重, 是初始偏置。Bitfit 操作则定义为 ,其中 表示可调整的 Bitfit 偏置。这些操作进一步合成为 ,其中 部分可通过 MVM 进行批处理,而 部分则与 Vadd 操作有关。
For tasks like Diff-Pruning [III-B is a little bit different than Bitfit and Adapter. For Diff-Pruning, the computation concerning the shared weight and 'difference' are conducted separately. Then the results are added up, namely
对于 Diff-Pruning 这样的任务,[III-B 与 Bitfit 和 Adapter 稍有不同。对于 Diff-Pruning,共享权重和 "差值 "的计算是分开进行的。然后将结果相加,即
, here, the denotes the backbone model weights while denotes the pruned weights which can be represented as Sparse MVM.
这里, 表示骨干模型权重,而 表示剪枝后的权重,可以表示为稀疏 MVM。
The other challenge PetS proposed is how to schedule different PEFT requests to achieve high performance. PetS scheduler achieves high parallelism through a two-level scheduling policy: Coordinated Batching (CB) and Macro-batch Streaming (MS) as Figure 12 depicts. Through , the input queries will first be clustered based on their input length and then grouped based on their shared operator. This is to make sure the same sequence length of queries will be executed without wasting padding. MS strategy will take the grouped queries after coordinated batching and the theoretical latency for different operators as well as the system modeling parameters to generate the best execution order.
PetS 提出的另一个挑战是如何调度不同的 PEFT 请求以实现高性能。PetS 调度器通过两级调度策略实现高并行性:协调批处理(CB)和宏批处理流(MS),如图 12 所示。通过 ,输入查询将首先根据其输入长度进行聚类,然后根据其共享运算符进行分组。这样做是为了确保执行相同序列长度的查询,而不浪费填充。MS 策略将利用协调批处理后的分组查询、不同操作符的理论延迟以及系统建模参数来生成最佳执行顺序。

D. Parallel PEFT Training Frameworks
D.平行 PEFT 培训框架

a) Design Challenges: Unlike the PetS system, which aims to accommodate flexible multi-PEFT algorithms, SLoRA [244] and Punica [245] focus solely on facilitating multiple-LoRA blocks for various tasks. Designing multiple PEFT training systems presents key challenges in two main aspects:
a) 设计挑战:PetS 系统旨在适应灵活的多重 PEFT 算法,而 SLoRA [244] 和 Punica [245] 与之不同,它们只专注于促进各种任务的多重 LoRA 模块。设计多重 PEFT 训练系统主要面临两个方面的挑战:
  • Efficient concurrent execution of multiple PEFT models with the same LLM backbone.
    使用同一LLM 主干网高效并发执行多个 PEFT 模型。
  • Designing an efficient system for multi-tenant serving with different LLM backbones.
    为使用不同LLM 主干网的多租户服务设计高效系统。
b) Efficient kernel design: Punica addresses the first challenge by using existing matrix multiplication for the backbone computation and introducing a new CUDA kernel, Segmented Gather Matrix-Vector Multiplication (SGMV), for adding the PEFT add-ons to the backbone computation in a batched manner. This kernel parallelizes the feature-weight multiplication for different requests in the batch and groups requests corresponding to the same PEFT model to increase operational intensity and use GPU Tensor Cores for acceleration.
b) 高效内核设计:Punica 利用现有的矩阵乘法进行主干计算,并引入新的 CUDA 内核--分段收集矩阵-矢量乘法(SGMV),以分批方式将 PEFT 附加组件添加到主干计算中,从而解决了第一个难题。该内核对批次中不同请求的特征量乘法进行并行处理,并将对应于相同 PEFT 模型的请求分组,以提高运行强度并使用 GPU 张量核心进行加速。
The second challenge is beyond the computational cost, designing an efficient system architecture that can effectively serve multi-tenant PEFT model workloads on the smallest set of GPUs possible while occupying the least amount of GPU resources is another significant challenge. Punica addresses
第二个挑战是计算成本之外的另一个重大挑战,即设计一个高效的系统架构,在占用最少 GPU 资源的同时,在尽可能少的 GPU 上为多租户 PEFT 模型工作负载提供有效服务。Punica 解决了以下问题
Fig. 11: PetS system overview: (1) Tasks register; (2) Task manager (3) Task schedule; (4) Task serving. (Image is taken from PetS [243])
图 11:PetS 系统概述:(1) 任务注册;(2) 任务管理器 (3) 任务计划;(4) 任务服务。(图片摘自 PetS [243])
this by scheduling user requests to active GPUs that already serve or train PEFT models, thereby improving GPU utilization. For older requests, Punica periodically migrates them to consolidate workloads, thus freeing up GPU resources for new requests.
为此,Punica 将用户请求调度到已经为 PEFT 模型提供服务或训练的活动 GPU 上,从而提高 GPU 的利用率。对于较早的请求,Punica 会定期迁移这些请求以整合工作负载,从而为新请求释放 GPU 资源。
c) Multi-Tenant PEFT design: Designing an efficient system for the multi-tenant PEFT model serving in the Punica framework focuses on addressing several key challenges to maximize hardware utilization and minimize resource consumption. The system aims to consolidate multi-tenant LoRA serving workloads onto the smallest set of GPUs possible. This consolidation is achieved through strategic scheduling of user requests to active GPUs that are already serving or training LoRA models, thereby improving GPU utilization. For older requests, Punica periodically migrates them to consolidate workloads further, thus freeing up GPU resources for new requests. It incorporates on-demand loading of LoRA model weights, which introduces only millisecond-level latency. This feature provides Punica with the flexibility to dynamically consolidate user requests to a small set of GPUs, without being constrained by the specific LoRA models already running on those GPUs. Besides that, Punica identifies that the decode stage is a predominant factor in the cost of model serving, Punica's design primarily focuses on optimizing decode stage performance. Other aspects of model serving leverage straightforward techniques, such as on-demand loading of LoRA model weights, to efficiently manage resource utilization.
c) 多租户 PEFT 设计:为在 Punica 框架中提供服务的多租户 PEFT 模型设计一个高效系统的重点是解决几个关键挑战,以最大限度地提高硬件利用率和减少资源消耗。该系统旨在将多租户 LoRA 服务工作负载整合到尽可能少的 GPU 上。这种整合是通过将用户请求战略性地调度到已在服务或训练 LoRA 模型的活动 GPU 上实现的,从而提高了 GPU 的利用率。对于较早的请求,Punica 会定期迁移这些请求,以进一步整合工作负载,从而为新请求释放 GPU 资源。它结合了按需加载 LoRA 模型权重的功能,这只会带来毫秒级的延迟。这一功能使 Punica 能够灵活地将用户请求动态整合到一小部分 GPU 上,而不受已经在这些 GPU 上运行的特定 LoRA 模型的限制。除此之外,Punica 发现解码阶段是影响模型服务成本的主要因素,因此 Punica 的设计主要侧重于优化解码阶段的性能。模型服务的其他方面则利用直接的技术,如按需加载 LoRA 模型权重,来有效管理资源利用率。

VII. Conclusion and Future Directions
VII.结论和未来方向

In the current era dominated by large models and large datasets, PEFT stands out as a highly attractive method for efficiently adapting models to downstream tasks. This technique gains its appeal by addressing the significant challenges posed by traditional full-model fine-tuning, which often places substantial computational and data demands. This survey offers a comprehensive examination of the most recent advancements in PEFT, including algorithmic design, computational efficiency, application scenarios, and system implementation for PEFT. It offers a comprehensive taxonomy and explanation that serves as an excellent guidance and knowledge base, which enables readers of various levels and disciplines to swiftly grasp the core concepts of PEFT.
在当前以大型模型和大型数据集为主导的时代,PEFT 是一种极具吸引力的方法,能有效地调整模型以适应下游任务。传统的全模型微调往往需要大量的计算和数据需求,而 PEFT 技术可以解决这一难题,因此极具吸引力。本研究全面考察了 PEFT 的最新进展,包括 PEFT 的算法设计、计算效率、应用场景和系统实现。它提供了全面的分类和解释,是一个很好的指导和知识库,能让不同层次和学科的读者迅速掌握 PEFT 的核心概念。
Fig. 12: Coordinated Batching (CB) Strategy
图 12:协调分批(CB)策略
For further research on PEFT, we propose a series of possible directions from both algorithm and system perspectives, hoping to inspire more researchers to engage in further studies in these areas.
对于 PEFT 的进一步研究,我们从算法和系统两个角度提出了一系列可能的方向,希望能激励更多的研究人员在这些领域开展进一步的研究。

A. Simplify hyperparameter tuning
A.简化超参数调整

The effectiveness of PEFT is often sensitive to its hyperparameters, such as the bottleneck dimension of the adapter, the rank of LoRA, and the arrangement of various additive PEFT layers. Manually tuning these hyperparameters will cost lots of efforts. Therefore, future efforts could focus on developing methods that are less dependent on manual tuning of these parameters, or automatically find the optimal configuration settings. Several studies [76], [77], [78], [91], [92], [93] have started to address this issue, but there's a need for more simple and efficient solutions optimizing these hyperparameters.
PEFT 的有效性通常对其超参数非常敏感,例如适配器的瓶颈维度、LoRA 的秩以及各种添加 PEFT 层的排列。手动调整这些超参数将耗费大量精力。因此,未来的工作重点可以放在开发较少依赖人工调整这些参数的方法,或者自动找到最佳配置设置。一些研究[76]、[77]、[78]、[91]、[92]、[93]已经开始着手解决这个问题,但还需要更简单有效的优化这些超参数的解决方案。

B. Establish a unified benchmark
B.建立统一基准

Despite the existence of libraries like HuggingFace's PEFT [246] and AdapterHub [247], a comprehensive benchmark for PEFT is still lacking. This gap hinders the ability to fairly compare the performance and efficiency of different PEFT approaches. A well-accepted, up-to-date benchmark akin to MMDetection [248] for object detection would enable researchers to validate their methods against a standard set of tasks and metrics, fostering innovation and collaboration within the community.
尽管存在 HuggingFace's PEFT [246] 和 AdapterHub [247] 等库,但 PEFT 仍然缺乏全面的基准。这一空白阻碍了对不同 PEFT 方法的性能和效率进行公平比较的能力。一个类似于用于物体检测的 MMDetection [248] 的公认的最新基准将使研究人员能够根据一组标准任务和指标来验证他们的方法,从而促进社区内的创新与合作。

C. Enhance training efficiency
C.提高培训效率

The presumed parameter efficiency of PEFT does not always consistent with computational and memory savings during training. Given that trainable parameters are intertwined within the pre-trained model's architecture, computing and storing activations and gradients for the full model often become necessary during fine-tuning. This oversight calls for a rethinking of what constitutes efficiency. As outlined in Section IV potential solutions lie in the integration of model compression techniques such as pruning and quantization, alongside innovations specifically designed to optimize memory during PEFT tuning [249]. Further research into enhancing
PEFT 假定的参数效率并不总是与训练过程中计算和内存的节省相一致。由于可训练参数与预训练模型的结构相互交织,因此在微调过程中往往需要计算和存储完整模型的激活和梯度。这种疏忽要求我们重新思考什么是效率。正如第四部分所述,潜在的解决方案在于整合模型压缩技术,如剪枝和量化,以及专门设计用于在 PEFT 调整过程中优化内存的创新技术[249]。进一步研究如何提高

the computational efficiency of PEFT methodologies is imperative.
当务之急是提高 PEFT 方法的计算效率。

D. Explore scaling laws
D.探索缩放规律

The design and effectiveness of PEFT methods originally developed for smaller Transformer models do not necessarily scale with larger models. As the size of foundation models increases, identifying and adapting PEFT strategies that remain effective is crucial. This investigation will aid in customizing PEFT methodologies to suit the evolving landscape of large model architectures.
最初为较小变压器模型开发的 PEFT 方法的设计和有效性并不一定适用于较大的模型。随着基础模型规模的扩大,确定和调整仍然有效的 PEFT 策略至关重要。这项研究将有助于定制 PEFT 方法,以适应大型模型结构的不断发展。

E. Serve more models and tasks
E.为更多的模型和任务提供服务

The rise of large foundation models across various domains presents new opportunities for PEFT. Designing PEFT methods tailored to the unique characteristics of models, such as Sora [250], Mamba [251], and LVM [252], can unlock new application scenarios and opportunities.
各领域大型基础模型的兴起为 PEFT 带来了新机遇。针对模型的独特特征设计 PEFT 方法,如 Sora [250]、Mamba [251] 和 LVM [252],可以开启新的应用场景和机遇。

F. Enhancing data privacy
F.加强数据隐私

Trusting centralized systems to serve or fine-tune personalized PEFT modules is yet another issue for system developers. Multiple types of inversion attacks [253], [254] have been proposed to reconstruct user's data by hijacking the intermediate results. One perspective of future trust-worthy LLM system design involves developing an encryption protocol for both personal data and intermediate training and inference results.
对于系统开发人员来说,信任集中式系统来提供或微调个性化 PEFT 模块是另一个问题。有人提出了多种类型的反转攻击 [253]、[254],通过劫持中间结果来重建用户数据。未来值得信赖的LLM 系统设计的一个视角是为个人数据以及中间训练和推理结果制定加密协议。

G. PEFT with model compression
G.带有模型压缩功能的 PEFT

Model compression is one of the most effective ways to make LLM executable on resource-limited devices. Yet, the impact of model compression techniques on the performance of PEFT algorithms running on hardware remains another systemic challenge. Common compression techniques such as quantization and pruning necessitate dedicated hardware platforms to expedite the process, and building such hardware platform for compressed models is yet another direction for future research.
模型压缩是使LLM 可在资源有限的设备上执行的最有效方法之一。然而,模型压缩技术对在硬件上运行的 PEFT 算法性能的影响仍然是另一个系统性挑战。量化和剪枝等常用压缩技术需要专用的硬件平台来加速处理过程,而为压缩模型构建这样的硬件平台是未来研究的另一个方向。

REFERENCES 参考文献

[1] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., "Language models are few-shot learners," Advances in neural information processing systems, vol. 33, pp. 1877-1901, 2020.
[1] T. Brown、B. Mann、N. Ryder、M. Subbiah、J. D. Kaplan、P. Dhariwal、A. Neelakantan、P. Shyam、G. Sastry、A. Askell 等:《语言模型是少数学习者》,《神经信息处理系统进展》,第 33 卷,第 1877-1901 页,2020 年。
[2] Y. Zhuang, Y. Yu, K. Wang, H. Sun, and C. Zhang, "Toolqa: A dataset for question answering with external tools," arXiv preprint arXiv:2306.13304, 2023.
[2] Y. Zhuang、Y. Yu、K. Wang、H. Sun 和 C. Zhang,"Toolqa:A dataset for question answering with external tools," arXiv preprint arXiv:2306.13304, 2023.
[3] W. Zhu, H. Liu, Q. Dong, J. Xu, L. Kong, J. Chen, L. Li, and S. Huang, "Multilingual machine translation with large language models: Empirical results and analysis," arXiv preprint arXiv:2304.04675, 2023.
[3] W. Zhu, H. Liu, Q. Dong, J. Xu, L. Kong, J. Chen, L. Li, and S. Huang, "Multilingual machine translation with large language models:Empirical results and analysis," arXiv preprint arXiv:2304.04675, 2023.
[4] M. U. Hadi, R. Qureshi, A. Shah, M. Irfan, A. Zafar, M. Shaikh, N. Akhtar, J. Wu, and S. Mirjalili, "A survey on large language models: Applications, challenges, limitations, and practical usage," TechRxiv, 2023.
[4] M. U. Hadi、R. Qureshi、A. Shah、M. Irfan、A. Zafar、M. Shaikh、N. Akhtar、J. Wu 和 S. Mirjalili,"大型语言模型调查:应用、挑战、限制和实际使用",TechRxiv,2023。
[5] B. Xu, X. Liu, H. Shen, Z. Han, Y. Li, M. Yue, Z. Peng, Y. Liu, Z. Yao, and D. Xu, "Gentopia: A collaborative platform for tool-augmented arXiv preprint arXiv:2308.04030, 2023.
[5] B. Xu、X. Liu、H. Shen、Z. Han、Y. Li、M. Yue、Z. Peng、Y. Liu、Z. Yao 和 D. Xu,"Gentopia:A collaborative platform for tool-augmented arXiv preprint arXiv:2308.04030, 2023.
[6] G. Li, H. A. A. K. Hammoud, H. Itani, D. Khizbullin, and B. Ghanem, "Camel: Communicative agents for "mind" exploration of large language model society," in Thirty-seventh Conference on Neural Information Processing Systems, 2023.
[6] G. Li、H. A. A. K. Hammoud、H. Itani、D. Khizbullin 和 B. Ghanem,"Camel:大型语言模型社会的 "心智 "探索交流代理》,第三十七届神经信息处理系统大会,2023 年。

[7] Q. Wu, G. Bansal, J. Zhang, Y. Wu, S. Zhang, E. Zhu, B. Li, L. Jiang, X. Zhang, and C. Wang, "Autogen: Enabling next-gen llm applications via multi-agent conversation framework," arXiv preprint arXiv:2308.08155, 2023.
[7] Q. Wu、G. Bansal、J. Zhang、Y. Wu、S. Zhang、E. Zhu、B. Li、L. Jiang、X. Zhang 和 C. Wang,"Autogen:Autogen: Enabling next-genllm applications via multi-agent conversation framework," arXiv preprint arXiv:2308.08155, 2023.
[8] H. Zhang, X. Liu, and J. Zhang, "Summit: Iterative text summarization via chatgpt," arXiv preprint arXiv:2305.14835, 2023.
[8] H. Zhang、X. Liu 和 J. Zhang,"Summit:Iterative text summarization via chatgpt," arXiv preprint arXiv:2305.14835, 2023.
[9] B. Zhang and R. Sennrich, "Root mean square layer normalization," Advances in Neural Information Processing Systems, vol. 32, 2019.
[10] J. Su, Y. Lu, S. Pan, A. Murtadha, B. Wen, and Y. Liu, "Roformer: Enhanced transformer with rotary position embedding," arXiv preprint arXiv:2104.09864, 2021.
[10] J. Su、Y. Lu、S. Pan、A. Murtadha、B. Wen 和 Y. Liu,"Roformer:带旋转位置嵌入的增强型变压器",arXiv 预印本 arXiv:2104.09864, 2021。
[11] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, "Glue: A multi-task benchmark and analysis platform for natural language understanding," arXiv preprint arXiv:1804.07461, 2018.
[11] A. Wang、A. Singh、J. Michael、F. Hill、O. Levy 和 S. R. Bowman,"Glue:A multi-task benchmark and analysis platform for natural language understanding," arXiv preprint arXiv:1804.07461, 2018.
[12] T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal, "Can a suit of armor conduct electricity? a new dataset for open book question answering," in EMNLP, 2018.
[12] T. Mihaylov、P. Clark、T. Khot 和 A. Sabharwal,"一套盔甲能导电吗?",2018 年 EMNLP 会议。
[13] Y. Bisk, R. Zellers, R. L. Bras, J. Gao, and Y. Choi, "Piqa: Reasoning about physical commonsense in natural language," in Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020.
[13] Y. Bisk、R. Zellers、R. L. Bras、J. Gao 和 Y. Choi,"Piqa:Piqa: Reasoning about physical commonsense in natural language," in Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020.
[14] M. Sap, H. Rashkin, D. Chen, R. LeBras, and Y. Choi, "Socialiqa: Commonsense reasoning about social interactions," arXiv preprint arXiv:1904.09728, 2019.
[14] M. Sap、H. Rashkin、D. Chen、R. LeBras 和 Y. Choi,"Socialiqa:关于社会互动的常识推理》,arXiv 预印本 arXiv:1904.09728, 2019。
[15] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi, "Hellaswag: Can a machine really finish your sentence?" in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.
[15] R. Zellers、A. Holtzman、Y. Bisk、A. Farhadi 和 Y. Choi,"Hellaswag:机器真的能完成你的句子吗?"《2019 年第 57 届计算语言学协会年会论文集》。
[16] C. e. a. Clark, "Boolq: Exploring the surprising difficulty of natural yes/no questions," in NAACL, 2019.
[17] K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi, "Winogrande: An adversarial winograd schema challenge at scale," Communications of the ACM, vol. 64, no. 9, pp. 99-106, 2021.
[17] K. Sakaguchi、R. L. Bras、C. Bhagavatula 和 Y. Choi,"Winogrande:ACM 通信》,第 64 卷,第 9 期,第 99-106 页,2021 年。
[18] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord, "Think you have solved question answering? try arc, the ai2 reasoning challenge," arXiv:1803.05457v1, 2018.
[19] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev et al., "The kinetics human action video dataset," arXiv preprint arXiv:1705.06950, 2017.
[19] W. Kay、J. Carreira、K. Simonyan、B. Zhang、C. Hillier、S. Vijayanarasimhan、F. Viola、T. Green、T. Back、P. Natsev 等:《动力学人类动作视频数据集》,arXiv 预印本 arXiv:1705.06950, 2017。
[20] R. Goyal, S. Ebrahimi Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag et al., "The" something something" video database for learning and evaluating visual common sense," in Proceedings of the IEEE international conference on computer vision, 2017, pp. 5842-5850.
[20] R. Goyal、S. Ebrahimi Kahou、V. Michalski、J. Materzynska、S. Westphal、H. Kim、V. Haenel、I. Fruend、P. Yianilos、M. Mueller-Freitag 等:《用于学习和评估视觉常识的 "某某东西 "视频数据库》,《电气和电子工程师学会计算机视觉国际会议论文集》,2017 年,第 5842-5850 页。
[21] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, "Hmdb: a large video database for human motion recognition," in 2011 International conference on computer vision. IEEE, 2011, pp. 2556-2563.
[21] H. Kuehne、H. Jhuang、E. Garrote、T. Poggio 和 T. Serre,"Hmdb:用于人体动作识别的大型视频数据库",2011 年计算机视觉国际会议。IEEE, 2011, pp.
[22] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, "Microsoft coco: Common objects in context," in Computer Vision-ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 2014, pp. 740-755.
[22] T. -Y.Lin、M. Maire、S. Belongie、J. Hays、P. Perona、D. Ramanan、P. Dollár 和 C. L. Zitnick,"Microsoft coco:语境中的常见对象",计算机视觉-ECCV 2014:13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13.Springer, 2014, pp.
[23] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, "Scene parsing through ade20k dataset," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 633641.
[23] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, "Scene parsing through ade20k dataset," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp.
[24] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, "The pascal visual object classes (voc) challenge," International journal of computer vision, vol. 88, pp. 303-338, 2010.
[24] M. Everingham、L. Van Gool、C. K. Williams、J. Winn 和 A. Zisserman,"帕斯卡尔视觉对象类别(voc)挑战",《国际计算机视觉杂志》,第 88 卷,第 303-338 页,2010 年。
[25] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, "Parameter-efficient transfer learning for nlp," in International Conference on Machine Learning. PMLR, 2019, pp. 2790-2799.
[25] N. Houlsby、A. Giurgiu、S. Jastrzebski、B. Morrone、Q. De Laroussilhe、A. Gesmundo、M. Attariyan 和 S. Gelly,"Parameter-efficient transfer learning for nlp",国际机器学习会议。PMLR, 2019, pp.
[26] J. He, C. Zhou, X. Ma, T. Berg-Kirkpatrick, and G. Neubig, "Towards a unified view of parameter-efficient transfer learning," arXiv preprint arXiv:2110.04366, 2021.
[27] Y. Zhu, J. Feng, C. Zhao, M. Wang, and L. Li, "Counterinterference adapter for multilingual machine translation," arXiv preprint arXiv:2104.08154, 2021.
[28] T. Lei, J. Bai, S. Brahma, J. Ainslie, K. Lee, Y. Zhou, N. Du, V. Y. Zhao, Y. Wu, B. Li et al., "Conditional adapters: Parameter-efficient transfer learning with fast inference," arXiv preprint arXiv:2304.04947, 2023.
[28] T. Lei、J. Bai、S. Brahma、J. Ainslie、K. Lee、Y. Zhou、N. Du、V. Y. Zhao、Y. Wu、B. Li 等人,"Conditional adapters:参数高效转移学习与快速推理》,arXiv preprint arXiv:2304.04947, 2023.
[29] J. Pfeiffer, A. Kamath, A. Rücklé, K. Cho, and I. Gurevych, "Adapterfusion: Non-destructive task composition for transfer learning," arXiv preprint arXiv:2005.00247, 2020.
[29] J. Pfeiffer、A. Kamath、A. Rücklé、K. Cho 和 I. Gurevych,"Adapterfusion:用于迁移学习的非破坏性任务组合",arXiv 预印本 arXiv:2005.00247, 2020。
[30] Y. Wang, S. Mukherjee, X. Liu, J. Gao, A. H. Awadallah, and J. Gao, "Adamix: Mixture-of-adapter for parameter-efficient tuning of large language models," arXiv preprint arXiv:2205.12410, vol. 1, no. 2, p. 4, 2022.
[30] Y. Wang, S. Mukherjee, X. Liu, J. Gao, A. H. Awadallah, and J. Gao, "Adamix:Mixture-of-adapter for parameter-efficient tuning of large language models," arXiv preprint arXiv:2205.12410,vol. 1,no. 2,p. 4,2022。
[31] H. Zhao, J. Fu, and Z. He, "Prototype-based hyperadapter for sampleefficient multi-task tuning," arXiv preprint arXiv:2310.11670, 2023.
[32] A. Chronopoulou, M. E. Peters, A. Fraser, and J. Dodge, "Adaptersoup: Weight averaging to improve generalization of pretrained language models," arXiv preprint arXiv:2302.07027, 2023.
[32] A. Chronopoulou、M. E. Peters、A. Fraser 和 J. Dodge,"Adaptersoup:Weight averaging to improve generalization of pretrained language models," arXiv preprint arXiv:2302.07027, 2023.
[33] S. He, R.-Z. Fan, L. Ding, L. Shen, T. Zhou, and D. Tao, "Mera: Merging pretrained adapters for few-shot learning," arXiv preprint arXiv:2308.15982, 2023.
[33] S. He, R.-Z.Fan、L. Ding、L. Shen、T. Zhou 和 D. Tao,"Mera:Merging pretrained adapters for few-shot learning," arXiv preprint arXiv:2308.15982, 2023.
[34] R. K. Mahabadi, S. Ruder, M. Dehghani, and J. Henderson, "Parameterefficient multi-task fine-tuning for transformers via shared hypernetworks," arXiv preprint arXiv:2106.04489, 2021.
[34] R. K. Mahabadi、S. Ruder、M. Dehghani 和 J. Henderson,《通过共享超网络对变压器进行参数有效的多任务微调》,arXiv 预印本 arXiv:2106.04489,2021 年。
[35] X. L. Li and P. Liang, "Prefix-tuning: Optimizing continuous prompts for generation," arXiv preprint arXiv:2101.00190, 2021.
[35] X. L. Li and P. Liang, "Prefix-tuning:Optimizing continuous prompts for generation," arXiv preprint arXiv:2101.00190, 2021.
[36] J. Li, W. Aitken, R. Bhambhoria, and X. Zhu, "Prefix propagation: Parameter-efficient tuning for long sequences," arXiv preprint arXiv:2305.12086, 2023.
[36] J. Li、W. Aitken、R. Bhambhoria 和 X. Zhu,"前缀传播:Parameter-efficient tuning for long sequences," arXiv preprint arXiv:2305.12086, 2023.
[37] X. Liu, K. Ji, Y. Fu, W. L. Tam, Z. Du, Z. Yang, and J. Tang, "P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks," arXiv preprint arXiv:2110.07602, 2021.
[37] X. Liu, K. Ji, Y. Fu, W. L. Tam, Z. Du, Z. Yang, and J. Tang, "P-tuning v2:P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks," arXiv preprint arXiv:2110.07602, 2021.
[38] Z.-R. Zhang, C. Tan, H. Xu, C. Wang, J. Huang, and S. Huang, "Towards adaptive prefix tuning for parameter-efficient language model fine-tuning," arXiv preprint arXiv:2305.15212, 2023.
[38] Z. -R.Zhang, C. Tan, H. Xu, C. Wang, J. Huang, and S. Huang, "Towards adaptive prefix tuning for parameter-efficient language model fine-tuning," arXiv preprint arXiv:2305.15212, 2023.
[39] X. Liu, Y. Zheng, Z. Du, M. Ding, Y. Qian, Z. Yang, and J. Tang, "Gpt understands, too," arXiv preprint arXiv:2103.10385, 2021.
[40] B. Lester, R. Al-Rfou, and N. Constant, "The power of scale for parameter-efficient prompt tuning," arXiv preprint arXiv:2104.08691, 2021.
[41] F. Ma, C. Zhang, L. Ren, J. Wang, Q. Wang, W. Wu, X. Quan, and D. Song, "Xprompt: Exploring the extreme of prompt tuning," arXiv preprint arXiv:2210.04457, 2022.
[41] F. Ma、C. Zhang、L. Ren、J. Wang、Q. Wang、W. Wu、X. Quan 和 D. Song,"Xprompt:Xprompt: Exploring the extreme of prompt tuning," arXiv preprint arXiv:2210.04457, 2022.
[42] Z. Wu, S. Wang, J. Gu, R. Hou, Y. Dong, V. Vydiswaran, and H. Ma, "Idpg: An instance-dependent prompt generation method," arXiv preprint arXiv:2204.04497, 2022.
[43] X. Liu, T. Sun, X. Huang, and X. Qiu, "Late prompt tuning: A late prompt could be better than many prompts," arXiv preprint arXiv:2210.11292, 2022.
[43] X. Liu, T. Sun, X. Huang, and X.Qiu, "Late prompt tuning:晚期提示可能比多次提示更好",arXiv 预印本 arXiv:2210.11292, 2022。
[44] W. Zhu and M. Tan, "Spt: Learning to selectively insert prompts for better prompt tuning," in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 11862 11878.
[44] W. Zhu 和 M. Tan,"Spt:Learning to selectively insert prompts for better prompt tuning," in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp.
[45] Q. Wang, Y. Mao, J. Wang, H. Yu, S. Nie, S. Wang, F. Feng, L. Huang, X. Quan, Z. Xu et al., "Aprompt: Attention prompt tuning for efficient adaptation of pre-trained language models," in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 9147-9160.
[45] Q. Wang、Y. Mao、J. Wang、H. Yu、S. Nie、S. Wang、F. Feng、L. Huang、X. Quan、Z. Xu 等,"Aprompt:用于高效适应预训练语言模型的注意力提示调整",《2023 年自然语言处理实证方法大会论文集》,2023 年,第 9147-9160 页。
[46] T. Vu, B. Lester, N. Constant, R. A1-Rfou, and D. Cer, "Spot: Better frozen model adaptation through soft prompt transfer," arXiv preprint arXiv:2110.07904, 2021.
[46] T. Vu、B. Lester、N. Constant、R. A1-Rfou 和 D. Cer,"Spot:Better frozen model adaptation through soft prompt transfer," arXiv preprint arXiv:2110.07904, 2021.
[47] Y. Su, X. Wang, Y. Qin, C.-M. Chan, Y. Lin, H. Wang, K. Wen, Z. Liu, P. Li, J. Li et al., "On transferability of prompt tuning for natural language processing," arXiv preprint arXiv:2111.06719, 2021.
[47] Y. Su, X. Wang, Y. Qin, C.-M. Chan, Y. Lin, H. Wang, K. Wen, Z. Liu, P. Li, J. Li et al.Chan、Y. Lin、H. Wang、K. Wen、Z. Liu、P. Li、J. Li 等:"On transferability of prompt tuning for natural language processing," arXiv preprint arXiv:2111.06719, 2021.
[48] J. Wu, T. Yu, R. Wang, Z. Song, R. Zhang, H. Zhao, C. Lu, S. Li, and R. Henao, "Infoprompt: Information-theoretic soft prompt tuning for natural language understanding," arXiv preprint arXiv:2306.04933, 2023.
[48] J. Wu、T. Yu、R. Wang、Z. Song、R. Zhang、H. Zhao、C. Lu、S. Li 和 R. Henao,"Infoprompt:用于自然语言理解的信息论软提示调整",arXiv preprint arXiv:2306.04933, 2023.
[49] L. Chen, H. Huang, and M. Cheng, "Ptp: Boosting stability and performance of prompt tuning with perturbation-based regularizer," arXiv preprint arXiv:2305.02423, 2023.
[50] Y. Qin, X. Wang, Y. Su, Y. Lin, N. Ding, J. Yi, W. Chen, Z. Liu, J. Li, L. Hou et al., "Exploring universal intrinsic task subspace via prompt tuning," arXiv preprint arXiv:2110.07867, 2021.
[50] Y.Qin、X. Wang、Y. Su、Y. Lin、N. Ding、J. Yi、W. Chen、Z. Liu、J. Li、L. Hou 等:"Exploring universal intrinsic task subspace via prompt tuning," arXiv preprint arXiv:2110.07867, 2021.
[51] J.-Y. Choi, J. Kim, J.-H. Park, W.-L. Mok, and S. Lee, "Smop: Towards efficient and effective prompt tuning with sparse mixture-of-prompts," in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 14306-14316.
[51] J.-Y. Choi, J. Kim, J.-H.Choi、J. Kim、J.-H.Park, W.-L. Mok, and S. Lee, "Smop:Towards efficient and effective prompt tuning with sparse mixture-of-prompts," in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp.
[52] Z. Shi and A. Lipani, "Dept: Decomposed prompt tuning for parameterefficient fine-tuning," arXiv preprint arXiv:2309.05173, 2023.
[52] Z. Shi 和 A. Lipani,"Dept:Decomposed prompt tuning for parameterefficient fine-tuning," arXiv preprint arXiv:2309.05173, 2023.
[53] H. Liu, D. Tam, M. Muqeeth, J. Mohta, T. Huang, M. Bansal, and C. A. Raffel, "Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning," Advances in Neural Information Processing Systems, vol. 35, pp. 1950-1965, 2022.
[53] H. Liu, D. Tam, M. Muqeeth, J. Mohta, T. Huang, M. Bansal, and C. A. Raffel, "Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning," Advances in Neural Information Processing Systems, vol. 35, pp.

[54] T. Zadouri, A. Üstün, A. Ahmadian, B. Ermiş, A. Locatelli, and S. Hooker, "Pushing mixture of experts to the limit: Extremely parameter efficient moe for instruction tuning," arXiv preprint arXiv:2309.05444, 2023.
[54] T. Zadouri、A. Üstün、A. Ahmadian、B. Ermiş、A. Locatelli 和 S. Hooker,《将专家混合物推向极限:用于指令调整的参数效率极高的 moe》,arXiv 预印本 arXiv:2309.05444, 2023。
[55] D. Lian, D. Zhou, J. Feng, and X. Wang, "Scaling & shifting your features: A new baseline for efficient model tuning," Advances in Neural Information Processing Systems, vol. 35, pp. 109-123, 2022.
[55] D. Lian, D. Zhou, J. Feng, and X. Wang, "Scaling & shifting your features:A new baseline for efficient model tuning," Advances in Neural Information Processing Systems, vol. 35, pp.
[56] X. Lu, F. Brahman, P. West, J. Jang, K. Chandu, A. Ravichander, L. Qin, P. Ammanabrolu, L. Jiang, S. Ramnath et al., "Inference-time policy adapters (ipa): Tailoring extreme-scale without fine-tuning," arXiv preprint arXiv:2305.15065, 2023.
[56] X. Lu、F. Brahman、P. West、J. Jang、K. Chandu、A. Ravichander、L. Qin、P. Ammanabrolu、L. Jiang、S. Ramnath 等人,"Inference-time policy adapters (ipa):Tailoring extreme-scale without fine-tuning," arXiv preprint arXiv:2305.15065, 2023.
[57] D. Guo, A. M. Rush, and Y. Kim, "Parameter-efficient transfer learning with diff pruning," arXiv preprint arXiv:2012.07463, 2020.
[58] N. Lawton, A. Kumar, G. Thattai, A. Galstyan, and G. V. Steeg, "Neural architecture search for parameter-efficient fine-tuning of large pre-trained language models," arXiv preprint arXiv:2305.16597, 2023.
[58] N. Lawton、A. Kumar、G. Thattai、A. Galstyan 和 G. V. Steeg,"用于对大型预训练语言模型进行参数高效微调的神经架构搜索",arXiv 预印本 arXiv:2305.16597, 2023。
[59] B. Liao, Y. Meng, and C. Monz, "Parameter-efficient fine-tuning without introducing new latency," arXiv preprint arXiv:2305.16742, 2023.
[60] Y.-L. Sung, V. Nair, and C. A. Raffel, "Training neural networks with fixed sparse masks," Advances in Neural Information Processing Systems, vol. 34, pp. 24 193-24 205, 2021.
[60] Y.-L. Sung、V. Nair 和 C. A. Raffel,《使用固定稀疏掩码训练神经网络》,《神经信息处理系统进展》,第 34 卷,第 24 193-24 205 页,2021 年。
[61] S. S. S. Das, R. H. Zhang, P. Shi, W. Yin, and R. Zhang, "Unified low-resource sequence labeling by sample-aware dynamic sparse finetuning," arXiv preprint arXiv:2311.03748, 2023.
[61] S. S. S. Das、R. H. Zhang、P. Shi、W. Yin 和 R. Zhang,《通过样本感知动态稀疏微调的统一低资源序列标注》,arXiv 预印本 arXiv:2311.03748, 2023。
[62] A. Ansell, E. M. Ponti, A. Korhonen, and I. Vulić, "Composable sparse fine-tuning for cross-lingual transfer," arXiv preprint arXiv:2110.07560, 2021
[62] A. Ansell, E. M. Ponti, A. Korhonen, and I. Vulić, "Composable sparse fine-tuning for crosslingual transfer," arXiv preprint arXiv:2110.07560, 2021
[63] Z. Fu, H. Yang, A. M.-C. So, W. Lam, L. Bing, and N. Collier, "On the effectiveness of parameter-efficient fine-tuning," in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 11, 2023, pp. .
[63] Z. Fu、H. Yang、A. M. -C.So, W. Lam, L. Bing, and N. Collier, "On the effectiveness of parameter-efficient fine-tuning," in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 11, 2023, pp. .
[64] R. Xu, F. Luo, Z. Zhang, C. Tan, B. Chang, S. Huang, and F. Huang, "Raise a child in large language model: Towards effective and generalizable fine-tuning," arXiv preprint arXiv:2109.05687, 2021.
[64] R. Xu、F. Luo、Z. Zhang、C. Tan、B. Chang、S. Huang 和 F. Huang,"Raise a child in large language model:Towards effective and generalizable fine-tuning," arXiv preprint arXiv:2109.05687, 2021.
[65] D. Vucetic, M. Tayaranian, M. Ziaeefard, J. J. Clark, B. H. Meyer, and W. J. Gross, "Efficient fine-tuning of bert models on the edge," in 2022 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 2022, pp. 1838-1842.
[65] D. Vucetic, M. Tayaranian, M. Ziaeefard, J. J. Clark, B. H. Meyer, and W. J. Gross, "Efficient fine-tuning of bert models on the edge," in 2022 IEEE International Symposium on Circuits and Systems (ISCAS).IEEE, 2022, pp.
[66] E. B. Zaken, S. Ravfogel, and Y. Goldberg, "Bitfit: Simple parameterefficient fine-tuning for transformer-based masked language-models," arXiv preprint arXiv:2106.10199, 2021.
[66] E. B. Zaken、S. Ravfogel 和 Y. Goldberg,"Bitfit:基于变换器的掩码语言模型的简单参数微调",arXiv 预印本 arXiv:2106.10199, 2021。
[67] M. Gheini, X. Ren, and J. May, "Cross-attention is all you need: Adapting pretrained transformers for machine translation," arXiv preprint arXiv:2104.08771, 2021.
[67] M. Gheini, X. Ren, and J. May, "Cross-attention is all you need:Adapting pretrained transformers for machine translation," arXiv preprint arXiv:2104.08771, 2021.
[68] H. He, J. Cai, J. Zhang, D. Tao, and B. Zhuang, "Sensitivity-aware visual parameter-efficient fine-tuning," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 11825 11835.
[68] H. He, J. Cai, J. Zhang, D. Tao, and B. Zhuang, "Sensitivity-aware visual parameter-efficient fine-tuning," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp.
[69] A. Aghajanyan, L. Zettlemoyer, and S. Gupta, "Intrinsic dimensionality explains the effectiveness of language model fine-tuning," arXiv preprint arXiv:2012.13255, 2020.
[70] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, "Lora: Low-rank adaptation of large language models," arXiv preprint arXiv:2106.09685, 2021
[70] E. J. Hu、Y. Shen、P. Wallis、Z. Allen-Zhu、Y. Li、S. Wang、L. Wang 和 W. Chen,"Lora:Lora: Low-rank adaptation of large language models," arXiv preprint arXiv:2106.09685, 2021
[71] R. Karimi Mahabadi, J. Henderson, and S. Ruder, "Compacter: Efficient low-rank hypercomplex adapter layers," Advances in Neural Information Processing Systems, vol. 34, pp. 1022-1035, 2021.
[71] R. Karimi Mahabadi、J. Henderson 和 S. Ruder,"Compacter:Efficient low-rank hypercomplex adapter layers," Advances in Neural Information Processing Systems, vol. 34, pp.
[72] A. Edalati, M. Tahaei, I. Kobyzev, V. P. Nia, J. J. Clark, and M. Rezagholizadeh, "Krona: Parameter efficient tuning with kronecker adapter," arXiv preprint arXiv:2212.10650, 2022.
[72] A. Edalati、M. Tahaei、I. Kobyzev、V. P. Nia、J. J. Clark 和 M. Rezagholizadeh,"Krona:使用克朗克尔适配器进行参数高效调整》,arXiv 预印本 arXiv:2212.10650, 2022。
[73] X. He, C. Li, P. Zhang, J. Yang, and X. E. Wang, "Parameter-efficient model adaptation for vision transformers," in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 1, 2023, pp. 817-825.
[73] X. He, C. Li, P. Zhang, J. Yang, and X. E. Wang, "Parameter-efficient model adaptation for vision transformers," in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 1, 2023, pp.
[74] D. J. Kopiczko, T. Blankevoort, and Y. M. Asano, "Vera: Vector-based random matrix adaptation," arXiv preprint arXiv:2310.11454, 2023.
[74] D. J. Kopiczko、T. Blankevoort 和 Y. M. Asano,"Vera:基于向量的随机矩阵适应," arXiv 预印本 arXiv:2310.11454, 2023.
[75] S.-Y. Liu, C.-Y. Wang, H. Yin, P. Molchanov, Y.-C. F. Wang, K.T. Cheng, and M.-H. Chen, "Dora: Weight-decomposed low-rank adaptation," arXiv preprint arXiv:2402.09353, 2024.
[75] S.-Y. Liu, C.-Y.Liu, C.-Y. Wang, H. Yin, P. Molchanov, Y.-C.Wang, H. Yin, P. Molchanov, Y.-C. F. Wang, K.T. Cheng, and M.-H.F.Wang、K.T. Cheng 和 M.-H. Chen,"Dora:Dora.Chen, "Dora:Weight-decomposed low-rank adaptation," arXiv preprint arXiv:2402.09353, 2024.
[76] M. Valipour, M. Rezagholizadeh, I. Kobyzev, and A. Ghodsi, "Dylora: Parameter efficient tuning of pre-trained models using dynamic searchfree low-rank adaptation," arXiv preprint arXiv:2210.07558, 2022.
[76] M. Valipour, M. Rezagholizadeh, I. Kobyzev, and A. Ghodsi, "Dylora:Parameter efficient tuning of pre-trained models using dynamic searchfree low-rank adaptation," arXiv preprint arXiv:2210.07558, 2022.
[77] Q. Zhang, M. Chen, A. Bukharin, P. He, Y. Cheng, W. Chen, and T. Zhao, "Adaptive budget allocation for parameter-efficient finetuning," arXiv preprint arXiv:2303.10512, 2023.
[78] N. Ding, X. Lv, Q. Wang, Y. Chen, B. Zhou, Z. Liu, and M. Sun, "Sparse low-rank adaptation of pre-trained language models," arXiv preprint arXiv:2311.11696, 2023.
[79] S. Haobo, H. Zhao, S. Majumder, and T. Lin, "Increasing model capacity for free: A simple strategy for parameter efficient fine-tuning," in The Twelfth International Conference on Learning Representations, 2023.
[79] S. Haobo、H. Zhao、S. Majumder 和 T. Lin,"免费增加模型容量:参数高效微调的简单策略",《第十二届学习表征国际会议》,2023 年。
[80] R. Zhang, R. Qiang, S. A. Somayajula, and P. Xie, "Autolora: Automatically tuning matrix ranks in low-rank adaptation based on meta learning," arXiv preprint arXiv:2403.09113, 2024.
[80] R. Zhang、R. Qiang、S. A. Somayajula 和 P. Xie,"Autolora:基于元学习在低阶适应中自动调整矩阵等级》,arXiv 预印本 arXiv:2403.09113, 2024。
[81] A. X. Yang, M. Robeyns, X. Wang, and L. Aitchison, "Bayesian low-rank adaptation for large language models," arXiv preprint
[81] A. X. Yang, M. Robeyns, X. Wang, and L. Aitchison, "Bayesian lowrank adaptation for large language models," arXiv preprint
[82] Y. Lin, X. Ma, X. Chu, Y. Jin, Z. Yang, Y. Wang, and H. Mei, "Lora dropout as a sparsity regularizer for overfitting control," arXiv preprint arXiv:2404.09610, 2024
[83] X. Meng, D. Dai, W. Luo, Z. Yang, S. Wu, X. Wang, P. Wang, Q. Dong, L. Chen, and Z. Sui, "Periodiclora: Breaking the low-rank bottleneck in lora optimization," arXiv preprint arXiv:2402.16141, 2024.
[83] X. Meng、D. Dai、W. Luo、Z. Yang、S. Wu、X. Wang、P. Wang、Q. Dong、L. Chen 和 Z. Sui,"Periodiclora:打破罗拉优化中的低阶瓶颈",arXiv 预印本 arXiv:2402.16141, 2024.
[84] S. Hayou, N. Ghosh, and B. Yu, "Lora+: Efficient low rank adaptation of large models," arXiv preprint arXiv:2402.12354, 2024.
[84] S. Hayou, N. Ghosh, and B. Yu, "Lora+:Efficient low rank adaptation of large models," arXiv preprint arXiv:2402.12354, 2024.
[85] C. Huang, Q. Liu, B. Y. Lin, T. Pang, C. Du, and M. Lin, "Lorahub: Efficient cross-task generalization via dynamic lora composition," arXiv preprint arXiv:2307.13269, 2023.
[85] C. Huang、Q. Liu、B. Y. Lin、T. Pang、C. Du 和 M. Lin,"Lorahub:Efficient cross-task generalization via dynamic lora composition," arXiv preprint arXiv:2307.13269, 2023.
[86] Q. Liu, X. Wu, X. Zhao, Y. Zhu, D. Xu, F. Tian, and Y. Zheng, "Moelora: An moe-based parameter efficient fine-tuning method for multi-task medical applications," arXiv preprint arXiv:2310.18339, 2023.
[86] Q. Liu、X. Wu、X. Zhao、Y. Zhu、D. Xu、F. Tian 和 Y. Zheng,"Moelora:基于 moe 的多任务医疗应用参数高效微调方法》,arXiv 预印本 arXiv:2310.18339, 2023。
[87] W. Feng, C. Hao, Y. Zhang, Y. Han, and H. Wang, "Mixture-of-loras: An efficient multitask tuning for large language models," arXiv preprint arXiv:2403.03432, 2024
[87] W. Feng, C. Hao, Y. Zhang, Y. Han, and H. Wang, "Mixture-of-loras:大型语言模型的高效多任务调整",arXiv 预印本 arXiv:2403.03432, 2024
[88] X. Wu, S. Huang, and F. Wei, "Mixture of lora experts," arXiv preprint arXiv:2404.13628, 2024
[89] D. Li, Y. Ma, N. Wang, Z. Cheng, L. Duan, J. Zuo, C. Yang, and M. Tang, "Mixlora: Enhancing large language models fine-tuning with lora based mixture of experts," arXiv preprint arXiv:2404.15159, 2024.
[89] D. Li、Y. Ma、N. Wang、Z. Cheng、L. Duan、J. Zuo、C. Yang 和 M. Tang,"Mixlora:用基于 lora 的专家混合物增强大型语言模型微调》,arXiv 预印本 arXiv:2404.15159, 2024。
[90] Y. Mao, L. Mathias, R. Hou, A. Almahairi, H. Ma, J. Han, W.-t. Yih, and M. Khabsa, "Unipelt: A unified framework for parameter-efficient language model tuning," arXiv preprint arXiv:2110.07577, 2021.
[90] Y. Mao, L. Mathias, R. Hou, A. Almahairi, H. Ma, J. Han, W.-t.Yih, and M. Khabsa, "Unipelt:A unified framework for parameter-efficient language model tuning," arXiv preprint arXiv:2110.07577, 2021.
[91] J. Chen, A. Zhang, X. Shi, M. Li, A. Smola, and D. Yang, "Parameterefficient fine-tuning design spaces," arXiv preprint arXiv:2301.01821, 2023.
[92] Y. Zhang, K. Zhou, and Z. Liu, "Neural prompt search," 2022.
[93] H. Zhou, X. Wan, I. Vulić, and A. Korhonen, "Autopeft: Automatic configuration search for parameter-efficient fine-tuning," arXiv preprint arXiv:2301.12132, 2023.
[93] H. Zhou、X. Wan、I. Vulić 和 A. Korhonen,"Autopeft:参数高效微调的自动配置搜索",arXiv 预印本 arXiv:2301.12132, 2023.
[94] Z. Hu, Y. Lan, L. Wang, W. Xu, E.-P. Lim, R. K.-W. Lee, L. Bing, and S. Poria, "Llm-adapters: An adapter family for parameter-efficient finetuning of large language models," arXiv preprint arXiv:2304.01933, 2023.
[94] Z. Hu, Y. Lan, L. Wang, W. Xu, E.-P. Lim, R. K.-W.Lim, R. K.-W.Lee, L. Bing, and S. Poria, "Llm-adapters:An adapter family for parameter-efficient finetuning of large language models," arXiv preprint arXiv:2304.01933, 2023.
[95] S. Hu, Z. Zhang, N. Ding, Y. Wang, Y. Wang, Z. Liu, and M. Sun, "Sparse structure search for parameter-efficient tuning," arXiv preprint arXiv:2206.07382, 2022
[96] A. Petrov, P. H. Torr, and A. Bibi, "When do prompting and prefixtuning work? a theory of capabilities and limitations," arXiv preprint arXiv:2310.19698, 2023.
[97] Y. Wang, J. Chauhan, W. Wang, and C.-J. Hsieh, "Universality and limitations of prompt tuning," arXiv preprint arXiv:2305.18787, 2023.
[98] Y. Choi and J.-H. Lee, "Codeprompt: Task-agnostic prefix tuning for program and language generation," in Findings of the Association for Computational Linguistics: ACL 2023, 2023, pp. 5282-5297.
[98] Y. Choi 和 J. -H.Lee, "Codeprompt:用于程序和语言生成的任务区分前缀调整",《计算语言学协会论文集》,ACL 2023,2023 年,第 5282-5297 页:ACL 2023, 2023, pp.
[99] H. Wu and X. Shi, "Adversarial soft prompt tuning for cross-domain sentiment analysis," in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 2438-2447.
[99] H. Wu and X. Shi, "Adversarial soft prompt tuning for cross-domain sentiment analysis," in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp.
[100] J. Frankle and M. Carbin, "The lottery ticket hypothesis: Finding sparse, trainable neural networks," arXiv preprint arXiv:1803.03635, 2018.
[100] J. Frankle 和 M. Carbin,"彩票假设:Finding sparse, trainable neural networks," arXiv preprint arXiv:1803.03635, 2018.
[101] E. Malach, G. Yehudai, S. Shalev-Schwartz, and O. Shamir, "Proving the lottery ticket hypothesis: Pruning is all you need," in International Conference on Machine Learning. PMLR, 2020, pp. 6682-6691
[101] E. Malach、G. Yehudai、S. Shalev-Schwartz 和 O. Shamir,"证明彩票假设:Pruning is all you need," in International Conference on Machine Learning.PMLR, 2020, pp.
[102] V. Fomenko, H. Yu, J. Lee, S. Hsieh, and W. Chen, "A note on lora," arXiv preprint arXiv:2404.05086, 2024.
[102] V. Fomenko、H. Yu、J. Lee、S. Hsieh 和 W. Chen,"A note on lora",arXiv preprint arXiv:2404.05086, 2024。
[103] A. Beck and M. Teboulle, "A fast iterative shrinkage-thresholding algorithm for linear inverse problems," SIAM journal on imaging sciences, vol. 2, no. 1, pp. 183-202, 2009.
[103] A. Beck and M. Teboulle, "A fast iterative shrinkage-thresholding algorithm for linear inverse problems," SIAM journal on imaging sciences, vol. 2, no. 1, pp.

[104] A. Chambolle, R. A. De Vore, N.-Y. Lee, and B. J. Lucier, "Nonlinear wavelet image processing: variational problems, compression, and noise removal through wavelet shrinkage," IEEE Transactions on image processing, vol. 7, no. 3, pp. 319-335, 1998
[104] A. Chambolle、R. A. De Vore、N. -Y.Lee, and B. J. Lucier, "Nonlinear wavelet image processing: variational problems, compression, and noise removal through wavelet shrinkage," IEEE Transactions on image processing, vol. 7, no.3, pp.
[105] D. J. MacKay, "A practical bayesian framework for backpropagation networks," Neural computation, vol. 4, no. 3, pp. 448-472, 1992.
[105] D. J. MacKay, "A practical bayesian framework for backpropagation networks," Neural computation, vol. 4, no.3, pp.
[106] J. Antorán, D. Janz, J. U. Allingham, E. Daxberger, R. R. Barbano, E. Nalisnick, and J. M. Hernández-Lobato, "Adapting the linearised laplace model evidence for modern deep learning," in International Conference on Machine Learning. PMLR, 2022, pp. 796-821.
[106] J. Antorán, D. Janz, J. U. Allingham, E. Daxberger, R. R. Barbano, E. Nalisnick, and J. M. Hernández-Lobato, "Adapting the linearised laplace model evidence for modern deep learning," in International Conference on Machine Learning.PMLR,2022 年,第 796-821 页。
[107] J. Liu, A. Moreau, M. Preuss, J. Rapin, B. Roziere, F. Teytaud, and O. Teytaud, "Versatile black-box optimization," in Proceedings of the 2020 Genetic and Evolutionary Computation Conference, 2020, pp. .
[107] J. Liu, A. Moreau, M. Preuss, J. Rapin, B. Roziere, F. Teytaud, and O. Teytaud, "Versatile black-box optimization," in Proceedings of the 2020 Genetic and Evolutionary Computation Conference, 2020, pp. .
[108] M. Chen, H. Peng, J. Fu, and H. Ling, "Autoformer: Searching transformers for visual recognition," in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 12270-12280.
[108] M. Chen、H. Peng、J. Fu 和 H. Ling,"Autoformer:搜索视觉识别变换器",《IEEE/CVF 计算机视觉国际会议论文集》,2021 年,第 12270-12280 页。
[109] P. I. Frazier, "A tutorial on bayesian optimization," arXiv preprint arXiv:1807.02811, 2018.
[109] P. I. Frazier,"贝叶斯优化教程",arXiv preprint arXiv:1807.02811,2018.
[110] A. Rücklé, G. Geigle, M. Glockner, T. Beck, J. Pfeiffer, N. Reimers, and I. Gurevych, "Adapterdrop: On the efficiency of adapters in transformers," arXiv preprint arXiv:2010.11918, 2020.
[110] A. Rücklé、G. Geigle、M. Glockner、T. Beck、J. Pfeiffer、N. Reimers 和 I. Gurevych,"Adapterdrop:On the efficiency of adapters in transformers," arXiv preprint arXiv:2010.11918, 2020.
[111] S. He, L. Ding, D. Dong, J. Zhang, and D. Tao, "SparseAdapter: An easy approach for improving the parameter-efficiency of adapters," in Findings of the Association for Computational Linguistics: EMNLP 2022. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, Dec. 2022, pp. 2184-2190. [Online]. Available: https://aclanthology.org/2022.findings-emnlp. 160
[111] S. He、L. Ding、D. Dong、J. Zhang 和 D. Tao,"SparseAdapter:提高适配器参数效率的简便方法",计算语言学协会论文集:EMNLP 2022。阿拉伯联合酋长国阿布扎比:阿拉伯联合酋长国阿布扎比:计算语言学协会,2022 年 12 月,第 2184-2190 页。[Online].Available: https://aclanthology.org/2022.findings-emnlp.160
[112] L. Hedegaard, A. Alok, J. Jose, and A. Iosifidis, "Structured pruning adapters," arXiv preprint arXiv:2211.10155, 2022.
[113] M. Zhang, C. Shen, Z. Yang, L. Ou, X. Yu, B. Zhuang et al., "Pruning meets low-rank parameter-efficient fine-tuning," arXiv preprint arXiv:2305.18403, 2023.
[113] M. Zhang、C. Shen、Z. Yang、L. Ou、X. Yu、B. Zhuang 等,"Pruning meets lowrank parameter-efficient fine-tuning,"arXiv preprint arXiv:2305.18403,2023.
[114] G. Zeng, P. Zhang, and W. Lu, "One network, many masks: Towards more parameter-efficient transfer learning," arXiv preprint arXiv:2305.17682, 2023.
[114] G. Zeng, P. Zhang, and W. Lu, "One network, many masks:Towards more parameter-efficient transfer learning," arXiv preprint arXiv:2305.17682, 2023.
[115] S. Jie, H. Wang, and Z.-H. Deng, "Revisiting the parameter efficiency of adapters from the perspective of precision redundancy," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. .
[115] S. Jie、H. Wang 和 Z.-H.Deng, "Revisiting the parameter efficiency of adapters from the perspective of precision redundancy," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. .
[116] J. Kim, J. H. Lee, S. Kim, J. Park, K. M. Yoo, S. J. Kwon, and D. Lee, "Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization," arXiv preprint arXiv:2305.14152, 2023.
[116] J. Kim、J. H. Lee、S. Kim、J. Park、K. M. Yoo、S. J. Kwon 和 D. Lee,"通过亚 4 位整数量化实现压缩大型语言模型的内存效率微调",arXiv 预印本 arXiv:2305.14152,2023。
[117] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, "Qlora: Efficient finetuning of quantized 1lms," arXiv preprint arXiv:2305.14314, 2023 .
[117] T. Dettmers、A. Pagnoni、A. Holtzman 和 L. Zettlemoyer, "Qlora:Efficient finetuning of quantized 1lms," arXiv preprint arXiv:2305.14314, 2023 .
[118] Y. Li, Y. Yu, C. Liang, P. He, N. Karampatziakis, W. Chen, and T. Zhao, "Loftq: Lora-fine-tuning-aware quantization for large language models," arXiv preprint arXiv:2310.08659, 2023.
[119] H. Guo, P. Greengard, E. P. Xing, and Y. Kim, "Lq-lora: Low-rank plus quantized matrix decomposition for efficient language model finetuning," arXiv preprint arXiv:2311.12023, 2023.
[119] H. Guo、P. Greengard、E. P. Xing 和 Y. Kim,"Lq-lora:Low-rank plus quantized matrix decomposition for efficient language model finetuning," arXiv preprint arXiv:2311.12023, 2023.
[120] Y. Xu, L. Xie, X. Gu, X. Chen, H. Chang, H. Zhang, Z. Chen, X. Zhang, and Q. Tian, "Qa-lora: Quantization-aware low-rank adaptation of large language models," arXiv preprint arXiv:2309.14717, 2023.
[120] Y. Xu、L. Xie、X. Gu、X. Chen、H. Chang、H. Zhang、Z. Chen、X. Zhang 和 Q. Tian,"Qa-lora:Quantization-aware lowrank adaptation of large language models," arXiv preprint arXiv:2309.14717, 2023.
[121] Y. Chai, J. Gkountouras, G. G. Ko, D. Brooks, and G.-Y. Wei, "Int2. 1: Towards fine-tunable quantized large language models with error correction through low-rank adaptation," arXiv preprint arXiv:2306.08162, 2023
[121] Y. Chai、J. Gkountouras、G. G. Ko、D. Brooks 和 G.-Y. Wei,"Int2.Wei,"Int2.1: Towards fine-tunable quantized large language models with error correction through lowrank adaptation," arXiv preprint arXiv:2306.08162, 2023
[122] H. Rajabzadeh, M. Valipour, T. Zhu, M. Tahaei, H. J. Kwon, A. Ghodsi, B. Chen, and M. Rezagholizadeh, "Qdylora: Quantized dynamic lowrank adaptation for efficient large language model tuning," arXiv preprint arXiv:2402.10462, 2024.
[122] H. Rajabzadeh、M. Valipour、T. Zhu、M. Tahaei、H. J. Kwon、A. Ghodsi、B. Chen 和 M. Rezagholizadeh,"Qdylora:Quantized dynamic lowrank adaptation for efficient large language model tuning," arXiv preprint arXiv:2402.10462, 2024.
[123] J. Liu, G. Xiao, K. Li, J. D. Lee, S. Han, T. Dao, and T. Cai, "Bitdelta: Your fine-tune may only be worth one bit," arXiv preprint arXiv:2402.10193, 2024.
[124] J. O. Zhang, A. Sax, A. Zamir, L. Guibas, and J. Malik, "Sidetuning: a baseline for network adaptation via additive side networks," in Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part III 16. Springer, 2020, pp.
[124] J. O. Zhang, A. Sax, A. Zamir, L. Guibas, and J. Malik, "Sidetuning: a baseline for network adaptation via additive side networks," in Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part III 16.Springer, 2020, pp.
[125] Y.-L. Sung, J. Cho, and M. Bansal, "Lst: Ladder side-tuning for parameter and memory efficient transfer learning," Advances in Neural Information Processing Systems, vol. 35, pp. 12991-13005, 2022.
[125] Y.-L. Sung, J. Cho, and M. Bansal, "Lst:Ladder side-tuning for parameter and memory efficient transfer learning," Advances in Neural Information Processing Systems, vol. 35, pp.
[126] Z. Jiang, C. Mao, Z. Huang, A. Ma, Y. Lv, Y. Shen, D. Zhao, and J. Zhou, "Res-tuning: A flexible and efficient tuning paradigm via
[126] Z. Jiang、C. Mao、Z. Huang、A. Ma、Y. Lv、Y. Shen、D. Zhao 和 J. Zhou,"Res-tuning:灵活高效的调谐范式

unbinding tuner from backbone," arXiv preprint arXiv:2310.19859, 2023.
从骨干网解绑调谐器》,arXiv 预印本 arXiv:2310.19859, 2023。
[127] B. Liao, S. Tan, and C. Monz, "Make your pre-trained model reversible: From parameter to memory efficient fine-tuning," arXiv preprint arXiv:2306.00477, 2023.
[127] B. Liao, S. Tan, and C. Monz, "Make your pre-trained model reversible:从参数到内存高效微调》,arXiv preprint arXiv:2306.00477, 2023.
[128] L. Zhang, L. Zhang, S. Shi, X. Chu, and B. Li, "Lora-fa: Memoryefficient low-rank adaptation for large language models fine-tuning," arXiv preprint arXiv:2308.03303, 2023.
[128] L. Zhang、L. Zhang、S. Shi、X. Chu 和 B. Li,"Lora-fa:Memoryefficient low-rank adaptation for large language models fine-tuning," arXiv preprint arXiv:2308.03303, 2023.
[129] J. Phang, Y. Mao, P. He, and W. Chen, "Hypertuning: Toward adapting large language models without back-propagation," in International Conference on Machine Learning. PMLR, 2023, pp. 27 854-27 875.
[129] J. Phang、Y. Mao、P. He 和 W. Chen,"Hypertuning:Toward adapting large language models without back-propagation," in International Conference on Machine Learning.PMLR, 2023, pp.
[130] F. Jin, J. Zhang, and C. Zong, "Parameter-efficient tuning for large language model without calculating its gradients," in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 321-330.
[130] F. Jin, J. Zhang, and C. Zong, "Parameter-efficient tuning for large language model without calculating its gradients," in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp.
[131] S. Malladi, T. Gao, E. Nichani, A. Damian, J. D. Lee, D. Chen, and S. Arora, "Fine-tuning language models with just forward passes," arXiv preprint arXiv:2305.17333, 2023.
[132] J. Zhao, Z. Zhang, B. Chen, Z. Wang, A. Anandkumar, and Y. Tian, "Galore: Memory-efficient training by gradient low-rank projection," arXiv preprint arXiv:2403.03507, 2024.
[132] J. Zhao、Z. Zhang、B. Chen、Z. Wang、A. Anandkumar 和 Y. Tian,"Galore:通过梯度低秩投影实现记忆高效的 训练",arXiv 预印本 arXiv:2403.03507, 2024。
[133] W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, "Efficient memory management for large language model serving with pagedattention," in Proceedings of the 29th Symposium on Operating Systems Principles, 2023, pp. 611-626
[133] W. Kwon、Z. Li、S. Zhuang、Y. Sheng、L. Zheng、C. H. Yu、J. Gonzalez、H. Zhang 和 I. Stoica,"使用 pagedattention 为大型语言模型服务的高效内存管理",第 29 届操作系统原理研讨会论文集,2023 年,第 611-626 页。
[134] Y. Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin, B. Chen, P. Liang, C. Ré, I. Stoica, and C. Zhang, "Flexgen: High-throughput generative inference of large language models with a single gpu," in International Conference on Machine Learning. PMLR, 2023, pp. 31 094-31 116.
[134] Y. Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin, B. Chen, P. Liang, C. Ré, I. Stoica, and C. Zhang, "Flexgen: High-throughput generative inference of large language models with a single gpu," in International Conference on Machine Learning.PMLR, 2023, pp.
[135] T. Zhou and D. Tao, "Godec: Randomized low-rank & sparse matrix decomposition in noisy case," in Proceedings of the 28th International Conference on Machine Learning, ICML 2011, 2011
[135] T. Zhou 和 D. Tao,"Godec:噪声情况下的随机低阶稀疏矩阵分解",第 28 届国际机器学习大会论文集,ICML 2011,2011 年
[136] J. Wright, A. Ganesh, S. Rao, Y. Peng, and Y. Ma, "Robust principal component analysis: Exact recovery of corrupted low-rank matrices via convex optimization," Advances in neural information processing systems, vol. 22, 2009.
[136] J. Wright, A. Ganesh, S. Rao, Y. Peng, and Y. Ma, "Robust principal component analysis:Robust principal component analysis: Exact recovery of corrupted low-rank matrices via convex optimization," Advances in neural information processing systems, vol. 22, 2009.
[137] A. N. Gomez, M. Ren, R. Urtasun, and R. B. Grosse, "The reversible residual network: Backpropagation without storing activations," vances in neural information processing systems, vol. 30, 2017.
[137] A. N. Gomez、M. Ren、R. Urtasun 和 R. B. Grosse,"可逆残差网络:Backpropagation without storing activations," vances in neural information processing systems, vol. 30, 2017.
[138] Y. Huang, Y. Li, Y. Xu, L. Zhang, R. Gan, J. Zhang, and L. Wang, "Mvp-tuning: Multi-view knowledge retrieval with prompt tuning for commonsense reasoning," in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 13417-13 432
[138] Y. Huang、Y. Li、Y. Xu、L. Zhang、R. Gan、J. Zhang 和 L. Wang,"Mvp-tuning:多视图知识检索与常识推理的及时调整",《第 61 届计算语言学协会年会论文集》(第 1 卷:长篇论文),2023 年,第 13417-13 页 432
[139] Z. Zhao, L. Hu, H. Zhao, Y. Shao, and Y. Wang, "Knowledgeable parameter efficient tuning network for commonsense question answering," in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 90519063
[139] Z. Zhao, L. Hu, H. Zhao, Y. Shao, and Y. Wang, "Knowledgeable parameter efficient tuning network for commonsense question answering," in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp.
[140] H. Zhao, R. He, M. Xiao, and J. Xu, "Infusing hierarchical guidance into prompt tuning: A parameter-efficient framework for multi-level implicit discourse relation recognition," in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 6477-6492.
[140] H. Zhao, R. He, M. Xiao, and J. Xu, "Infusing hierarchical guidance into prompt tuning:多层次隐式话语关系识别的参数高效框架》,《计算语言学协会第 61 届年会(第 1 卷:长篇论文)论文集》,2023 年,第 6477-6492 页。
[141] Y. Ouyang, Y. Cao, Y. Gao, Z. Wu, J. Zhang, and X. Dai, "On prefixtuning for lightweight out-of-distribution detection," in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 1533-1545.
[141] Y. Ouyang, Y. Cao, Y. Gao, Z. Wu, J. Zhang, and X. Dai, "On prefixtuning for lightweight out-of-distribution detection," in Proceedings of 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp.
[142] M. S. Ozdayi, C. Peris, J. Fitzgerald, C. Dupuy, J. Majmudar, H. Khan, R. Parikh, and R. Gupta, "Controlling the extraction of memorized data from large language models via prompt-tuning," arXiv preprint arXiv:2305.11759, 2023
[142] M. S. Ozdayi、C. Peris、J. Fitzgerald、C. Dupuy、J. Majmudar、H. Khan、R. Parikh 和 R. Gupta,"通过提示调整控制从大型语言模型中提取记忆数据",arXiv 预印本 arXiv:2305.11759, 2023
[143] G. Xiao, J. Lin, and S. Han, "Offsite-tuning: Transfer learning without full model," arXiv preprint arXiv:2302.04870, 2023.
[143] G. Xiao, J. Lin, and S. Han, "Offsite-tuning:Transfer learning without full model," arXiv preprint arXiv:2302.04870, 2023.
[144] T. Che, J. Liu, Y. Zhou, J. Ren, J. Zhou, V. S. Sheng, H. Dai, and D. Dou, "Federated learning of large language models with parameterefficient prompt tuning and adaptive optimization," arXiv preprint arXiv:2310.15080, 2023
[145] Y. Li, M. Du, X. Wang, and Y. Wang, "Prompt tuning pushes farther, contrastive learning pulls closer: A two-stage approach to mitigate social biases," arXiv preprint arXiv:2307.01595, 2023.
[145] Y. Li, M. Du, X. Wang, and Y. Wang, "Prompt tuning pushed farther, contrastive learning pulls closer:A two-stage approach to mitigate social biases," arXiv preprint arXiv:2307.01595, 2023.
[146] J. Cho, J. Lei, H. Tan, and M. Bansal, "Unifying vision-and-language tasks via text generation," in International Conference on Machine Learning. PMLR, 2021, pp. 1931-1942.
[146] J. Cho、J. Lei、H. Tan 和 M. Bansal,"通过文本生成统一视觉和语言任务",国际机器学习会议。PMLR,2021 年,第 1931-1942 页。
[147] D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, "Minigpt-4: Enhancing vision-language understanding with advanced large language models," arXiv preprint arXiv:2304.10592, 2023.
[148] H. Liu, C. Li, Q. Wu, and Y. J. Lee, "Visual instruction tuning," arXiv preprint arXiv:2304.08485, 2023.
[149] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, "Selfcritical sequence training for image captioning," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. .
[149] S. J. Rennie、E. Marcheret、Y. Mroueh、J. Ross 和 V. Goel,"用于图像标题的自关键序列训练",《电气和电子工程师学会计算机视觉和模式识别会议论文集》,2017 年,第 页。
[150] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, "Image captioning with semantic attention," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4651-4659.
[150] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, "Image captioning with semantic attention," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp.
[151] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, "Show and tell: Lessons learned from the 2015 mscoco image captioning challenge," IEEE transactions on pattern analysis and machine intelligence, vol. 39 , no. 4 , pp. 652-663, 2016.
[151] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, "Show and tell:从 2015 年 mscoco 图像标题挑战赛中学到的经验," IEEE transactions on pattern analysis and machine intelligence, vol. 39 , no.4 , pp.
[152] M. Z. Hossain, F. Sohel, M. F. Shiratuddin, and H. Laga, "A comprehensive survey of deep learning for image captioning," ACM Computing Surveys (CsUR), vol. 51, no. 6, pp. 1-36, 2019.
[152] M. Z. Hossain、F. Sohel、M. F. Shiratuddin 和 H. Laga,《图像字幕深度学习综合调查》,《ACM Computing Surveys(CsUR)》,第 51 卷,第 6 期,第 1-36 页,2019 年。
[153] P. Wang, Q. Wu, C. Shen, A. Dick, and A. Van Den Hengel, "Fvqa: Fact-based visual question answering," IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 10, pp. 2413-2427, 2017.
[153] P. Wang、Q. Wu、C. Shen、A. Dick 和 A. Van Den Hengel,"Fvqa:基于事实的可视化问题解答》,《IEEE 关于模式分析和机器智能的论文集》,第 40 卷,第 10 期,第 2413-2427 页,2017 年。
[154] Q. Wu, D. Teney, P. Wang, C. Shen, A. Dick, and A. Van Den Hengel, "Visual question answering: A survey of methods and datasets," Computer Vision and Image Understanding, vol. 163, pp. 21-40, 2017.
[154] Q. Wu、D. Teney、P. Wang、C. Shen、A. Dick 和 A. Van Den Hengel,"视觉问题解答:方法与数据集调查》,《计算机视觉与图像理解》,第 163 卷,第 21-40 页,2017 年。
[155] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, "Vqa: Visual question answering," in Proceedings of the IEEE international conference on computer vision, 2015, pp. 24252433.
[155] S. Antol、A. Agrawal、J. Lu、M. Mitchell、D. Batra、C. L. Zitnick 和 D. Parikh,"Vqa:Visual question answering," in Proceedings of the IEEE international conference on computer vision, 2015, pp.
[156] Y.-L. Sung, J. Cho, and M. Bansal, "V1-adapter: Parameter-efficient transfer learning for vision-and-language tasks," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5227-5237.
[156] Y.-L. Sung、J. Cho 和 M. Bansal,"V1-adapter:Parameter-efficient transfer learning for vision-and-language tasks," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp.
[157] Z.-Y. Hu, Y. Li, M. R. Lyu, and L. Wang, "Vl-pet: Vision-and-language parameter-efficient tuning via granularity control," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3010-3020.
[157] Z.-Y.Hu, Y. Li, M. R. Lyu, and L. Wang, "Vl-pet:Vision-and-language parameter-efficient tuning via granularity control," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp.
[158] R. Zhang, J. Han, A. Zhou, X. Hu, S. Yan, P. Lu, H. Li, P. Gao, and Y. Qiao, "Llama-adapter: Efficient fine-tuning of language models with zero-init attention," arXiv preprint arXiv:2303.16199, 2023.
[158] R. Zhang、J. Han、A. Zhou、X. Hu、S. Yan、P. Lu、H. Li、P. Gao 和 Y. Qiao:"Llama-adapter:Llama-adapter:Llama-adapter"。Qiao, "Llama-adapter:Efficient fine-tuning of language models with zero-init attention," arXiv preprint arXiv:2303.16199, 2023.
[159] P. Gao, J. Han, R. Zhang, Z. Lin, S. Geng, A. Zhou, W. Zhang, P. Lu, C. He, X. Yue et al., "Llama-adapter v2: Parameter-efficient visual instruction model," arXiv preprint arXiv:2304.15010, 2023.
[159] P. Gao、J. Han、R. Zhang、Z. Lin、S. Geng、A. Zhou、W. Zhang、P. Lu、C. He、X. Yue 等人,"Llama-adapter v2:Parameter-efficient visual instruction model," arXiv preprint arXiv:2304.15010, 2023.
[160] B. Zhao, H. Tu, C. Wei, J. Mei, and C. Xie, "Tuning layernorm in attention: Towards efficient multi-modal llm finetuning," arXiv preprint arXiv:2312.11420, 2023
[160] B. Zhao, H. Tu, C. Wei, J. Mei, and C. Xie, "Tuning layernorm in attention:Towards efficient multi-modalllm finetuning," arXiv preprint arXiv:2312.11420, 2023
[161] S. Lee, "Toward continual learning for conversational agents," arXiv preprint arXiv:1712.09943, 2017
[161] S. Lee,"Toward continual learning for conversational agents",arXiv preprint arXiv:1712.09943,2017
[162] C.-H. Chang, M. Kayed, M. R. Girgis, and K. F. Shaalan, "A survey of web information extraction systems," IEEE transactions on knowledge and data engineering, vol. 18, no. 10, pp. 1411-1428, 2006.
[162] C.-H. Chang, M. Kayed, M. R. Girgis, and K. F. Shaalan, "A survey of web information extraction systems," IEEE transactions on knowledge and data engineering, vol. 18, pp.Chang, M. Kayed, M. R. Girgis, and K. F. Shaalan, "A survey of web information extraction systems," IEEE transactions on knowledge and data engineering, vol. 18, no. 10, pp.
[163] W. Yang, Y. Xie, A. Lin, X. Li, L. Tan, K. Xiong, M. Li, and J. Lin, "End-to-end open-domain question answering with bertserini," arXiv preprint arXiv:1902.01718, 2019
[164] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska et al., "Overcoming catastrophic forgetting in neural networks," Proceedings of the national academy of sciences, vol. 114, no. 13, pp. .
[164] J. Kirkpatrick、R. Pascanu、N. Rabinowitz、J. Veness、G. Desjardins、A. A. Rusu、K. Milan、J. Quan、T. Ramalho、A. Grabska-Barwinska 等人,"克服神经网络中的灾难性遗忘",《美国国家科学院院刊》,第 114 卷,第 13 期,第 页。
[165] A. Madotto, Z. Lin, Z. Zhou, S. Moon, P. Crook, B. Liu, Z. Yu, E. Cho, and Z. Wang, "Continual learning in task-oriented dialogue systems," arXiv preprint arXiv:2012.15504, 2020
[166] Q. Zhu, B. Li, F. Mi, X. Zhu, and M. Huang, "Continual prompt tuning for dialog state tracking," arXiv preprint arXiv:2203.06654, 2022.
[167] Y. Dai, H. Lang, Y. Zheng, F. Huang, L. Si, and Y. Li, "Lifelong learning for question answering with hierarchical prompts," arXiv preprint arXiv:2208.14602, 2022.
[168] Z. Liang, F. Wei, Y. Jie, Y. Qian, Z. Hao, and B. Han, "Prompts can play lottery tickets well: Achieving lifelong information extraction via lottery prompt tuning," in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 277-292
[168] Z. Liang, F. Wei, Y. Jie, Y. Qian, Z. Hao, and B. Han, "Prompts can play lottery tickets well:通过彩票提示调整实现终身信息提取》,《第 61 届计算语言学协会年会论文集(第 1 卷:长篇论文)》,2023 年,第 277-292 页
[169] X. Wang, T. Chen, Q. Ge, H. Xia, R. Bao, R. Zheng, Q. Zhang, T. Gui, and X. Huang, "Orthogonal subspace learning for language model continual learning," arXiv preprint arXiv:2310.14152, 2023.
[170] S. Chen, S. Wong, L. Chen, and Y. Tian, "Extending context window of large language models via positional interpolation," arXiv preprint arXiv:2306.15595, 2023
[171] Y. Chen, S. Qian, H. Tang, X. Lai, Z. Liu, S. Han, and J. Jia, "Longlora: Efficient fine-tuning of long-context large language models," arXiv preprint arXiv:2309.12307, 2023.
[171] Y. Chen、S. Qian、H. Tang、X. Lai、Z. Liu、S. Han 和 J. Jia,"Longlora:Efficient fine-tuning of longcontext large language models," arXiv preprint arXiv:2309.12307, 2023.
[172] J. Yang, "Longqlora: Efficient and effective method to extend context length of large language models," arXiv preprint arXiv:2311.04879, 2023.
[172] J. Yang,"Longqlora:Efficient and effective method to extend context length of large language models," arXiv preprint arXiv:2311.04879, 2023.
[173] S. Tan, X. Li, S. Patil, Z. Wu, T. Zhang, K. Keutzer, J. E. Gonzalez, and R. A. Popa, "Lloco: Learning long contexts offline," arXiv preprint arXiv:2404.07979, 2024
[173] S. Tan、X. Li、S. Patil、Z. Wu、T. Zhang、K. Keutzer、J. E. Gonzalez 和 R. A. Popa,"Lloco:离线学习长上下文,"arXiv 预印本 arXiv:2404.07979, 2024
[174] Z. Zhang, Y. Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, Z. Song, Y. Tian, C. Ré, C. Barrett et al., "H2o: Heavy-hitter oracle for efficient generative inference of large language models," Advances in Neural Information Processing Systems, vol. 36, 2024.
[174] Z. Zhang、Y. Sheng、T. Zhou、T. Chen、L. Zheng、R. Cai、Z. Song、Y. Tian、C. Ré、C. Barrett 等,"H2o:用于大型语言模型高效生成推理的重型甲骨文》,《神经信息处理系统进展》,第 36 卷,2024 年。
[175] T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer, "Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale," Advances in Neural Information Processing Systems, vol. 35, pp. 30318-30 332, 2022.
[175] T. Dettmers、M. Lewis、Y. Belkada 和 L. Zettlemoyer,"Gpt3. int8 ():8-bit matrix multiplication for transformers at scale," Advances in Neural Information Processing Systems, vol. 35, pp.
[176] H. Kang, Q. Zhang, S. Kundu, G. Jeong, Z. Liu, T. Krishna, and T. Zhao, "Gear: An efficient kv cache compression recipefor nearlossless generative inference of 1lm," arXiv preprint arXiv:2403.05527, 2024.
[176] H. Kang、Q. Zhang、S. Kundu、G. Jeong、Z. Liu、T. Krishna 和 T. Zhao,"Gear:1lm 的近无损生成推理的高效 KV 缓存压缩接收器",arXiv 预印本 arXiv:2403.05527, 2024.
[177] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., "An image is worth words: Transformers for image recognition at scale. arxiv 2020," arXiv preprint arXiv:2010.11929, 2010.
[177] A. Dosovitskiy、L. Beyer、A. Kolesnikov、D. Weissenborn、X. Zhai、T. Unterthiner、M. Dehghani、M. Minderer、G. Heigold、S. Gelly 等人,"一幅图像胜过 个单词:ArXiv 2020," arXiv preprint arXiv:2010.11929, 2010.
[178] A. Steiner, A. Kolesnikov, X. Zhai, R. Wightman, J. Uszkoreit, and L. Beyer, "How to train your vit? data, augmentation, and regularization in vision transformers," arXiv preprint arXiv:2106.10270, 2021.
[179] X. Chen, S. Xie, and K. He, "An empirical study of training selfsupervised vision transformers," in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 9640-9649.
[179] X. Chen, S. Xie, and K. He, "An empirical study of training selfsupervised vision transformers," in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp.
[180] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, "Masked autoencoders are scalable vision learners," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. .
[180] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, "Masked autoencoders are scalable vision learners," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. .
[181] M. Dehghani, J. Djolonga, B. Mustafa, P. Padlewski, J. Heek, J. Gilmer, A. P. Steiner, M. Caron, R. Geirhos, I. Alabdulmohsin et al., "Scaling vision transformers to 22 billion parameters," in International Conference on Machine Learning. PMLR, 2023, pp. 7480-7512.
[181] M. Dehghani、J. Djolonga、B. Mustafa、P. Padlewski、J. Heek、J. Gilmer、A. P. Steiner、M. Caron、R. Geirhos、I. Alabdulmohsin 等人,"将视觉转换器扩展至 220 亿个参数",国际机器学习会议。PMLR,2023 年,第 7480-7512 页。
[182] Z. Chen, Y. Duan, W. Wang, J. He, T. Lu, J. Dai, and Y. Qiao, "Vision transformer adapter for dense predictions," arXiv preprint arXiv:2205.08534, 2022
[182] Z. Chen, Y. Duan, W. Wang, J. He, T. Lu, J. Dai, and Y. Qiao, "Vision transformer adapter for dense predictions," arXiv preprint arXiv:2205.08534, 2022.Qiao, "Vision transformer adapter for dense predictions," arXiv preprint arXiv:2205.08534, 2022
[183] Z. Wang, Z. Zhang, C.-Y. Lee, H. Zhang, R. Sun, X. Ren, G. Su, V. Perot, J. Dy, and T. Pfister, "Learning to prompt for continual learning," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 139-149.
[183] Z. Wang、Z. Zhang、C.Lee, H. Zhang, R. Sun, X. Ren, G. Su, V. Perot, J. Dy, and T. Pfister, "Learning to prompt for continual learning," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp.
[184] Q. Gao, C. Zhao, Y. Sun, T. Xi, G. Zhang, B. Ghanem, and J. Zhang, "A unified continual learning framework with general parameter-efficient tuning," arXiv preprint arXiv:2303.10070, 2023.
[185] L. Ren, C. Chen, L. Wang, and K. Hua, "Learning semantic proxies from visual prompts for parameter-efficient fine-tuning in deep metric learning," arXiv preprint arXiv:2402.02340, 2024.
[186] M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan, and S.-N. Lim, "Visual prompt tuning," in European Conference on Computer Vision. Springer, 2022, pp. 709-727.
[186] M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan, and S.-N.Chen、C. Cardie、S. Belongie、B. Hariharan 和 S.-N. Lim,"视觉提示调谐",欧洲计算机视觉会议。Lim,"视觉提示调整",欧洲计算机视觉会议。Springer, 2022, pp.
[187] S. Chen, C. Ge, Z. Tong, J. Wang, Y. Song, J. Wang, and P. Luo, "Adaptformer: Adapting vision transformers for scalable visual recognition," Advances in Neural Information Processing Systems, vol. 35, pp. .
[187] S. Chen、C. Ge、Z. Tong、J. Wang、Y. Song、J. Wang 和 P. Luo,"Adaptformer:Adapting vision transformers for scalable visual recognition," Advances in Neural Information Processing Systems, vol. 35, pp. .
[188] S. Jie and Z.-H. Deng, "Convolutional bypasses are better vision transformer adapters," arXiv preprint arXiv:2207.07039, 2022.
[188] S. Jie 和 Z. -H.Deng, "Convolutional bypasses are better vision transformer adapters," arXiv preprint arXiv:2207.07039, 2022.
[189] S. Yoo, E. Kim, D. Jung, J. Lee, and S. Yoon, "Improving visual prompt tuning for self-supervised vision transformers," arXiv preprint arXiv:2306.05067, 2023.
[189] S. Yoo、E. Kim、D. Jung、J. Lee 和 S. Yoon,《改进自监督视觉转换器的视觉提示调整》,arXiv 预印本 arXiv:2306.05067, 2023。
[190] J. Pan, Z. Lin, X. Zhu, J. Shao, and H. Li, "St-adapter: Parameterefficient image-to-video transfer learning," Advances in Neural Information Processing Systems, vol. 35, pp. 26 462-26 477, 2022.
[190] J. Pan, Z. Lin, X. Zhu, J. Shao, and H. Li, "St-adapter:Parameterefficient image-to-video transfer learning," Advances in Neural Information Processing Systems, vol. 35, pp.
[191] T. Yang, Y. Zhu, Y. Xie, A. Zhang, C. Chen, and M. Li, "Aim: Adapting image models for efficient video action recognition," arXiv preprint arXiv:2302.03024, 2023
[192] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., "Learning transferable visual models from natural language supervision," in International conference on machine learning. PMLR, 2021, pp. 8748-8763.
[192] A. Radford、J. W. Kim、C. Hallacy、A. Ramesh、G. Goh、S. Agarwal、G. Sastry、A. Askell、P. Mishkin、J. Clark 等人,"从自然语言监督中学习可转移的视觉模型",机器学习国际会议。PMLR, 2021, pp.
[193] C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, "Scaling up visual and vision-language representation learning with noisy text supervision," in International conference on machine learning. PMLR, 2021, pp. 4904-4916.
[193] C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H.Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, "Scaling up visual and vision-language representation learning with noisy.Duerig, "Scaling up visual and vision-language representation learning with noisy text supervision," in International conference on machine learning.PMLR,2021 年,第 4904-4916 页。
[194] Y. Li, F. Liang, L. Zhao, Y. Cui, W. Ouyang, J. Shao, F. Yu, and J. Yan, "Supervision exists everywhere: A data efficient contrastive languageimage pre-training paradigm," arXiv preprint arXiv:2110.05208, 2021.
[194] Y. Li, F. Liang, L. Zhao, Y. Cui, W. Ouyang, J. Shao, F. Yu, and J. Yan, "Supervision exists everywhere:A data efficient contrastive languageimage pre-training paradigm," arXiv preprint arXiv:2110.05208, 2021.
[195] A. Singh, R. Hu, V. Goswami, G. Couairon, W. Galuba, M. Rohrbach, and D. Kiela, "Flava: A foundational language and vision alignment model," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. .
[195] A. Singh、R. Hu、V. Goswami、G. Couairon、W. Galuba、M. Rohrbach 和 D. Kiela,"Flava:A foundational language and vision alignment model," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. .
[196] M. Xu, Z. Zhang, F. Wei, H. Hu, and X. Bai, "Side adapter network for open-vocabulary semantic segmentation," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2945-2954.
[196] M. Xu、Z. Zhang、F. Wei、H. Hu 和 X. Bai,"用于开放词汇语义分割的侧适配器网络",《IEEE/CVF 计算机视觉与模式识别会议论文集》,2023 年,第 2945-2954 页。
[197] Q. Yu, J. He, X. Deng, X. Shen, and L.-C. Chen, "Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip," arXiv preprint arXiv:2308.02487, 2023.
[197] Q. Yu、J. He、X. Deng、X. Shen 和 L.-C. Chen,"卷积会死得很难看。Chen, "Convolutions die hard:Open-vocabulary segmentation with single frozen convolutional clip," arXiv preprint arXiv:2308.02487, 2023.
[198] Z. Xu, Z. Chen, Y. Zhang, Y. Song, X. Wan, and G. Li, "Bridging vision and language encoders: Parameter-efficient tuning for referring image segmentation," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 17 503-17 512.
[198] Z. Xu, Z. Chen, Y. Zhang, Y. Song, X. Wan, and G. Li, "Bridging vision and language encoders:参考图像分割的参数高效调整》,《IEEE/CVF 计算机视觉国际会议论文集》,2023 年,第 17 503-17 512 页。
[199] R. Zhang, Z. Guo, W. Zhang, K. Li, X. Miao, B. Cui, Y. Qiao, P. Gao, and H. Li, "Pointclip: Point cloud understanding by clip," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 8552-8562.
[199] R. Zhang、Z. Guo、W. Zhang、K. Li、X. Miao、B. Cui、Y. Qiao、P. Gao 和 H. Li,"Pointclip:通过剪辑理解点云",《IEEE/CVF 计算机视觉与模式识别会议论文集》,2022 年,第 8552-8562 页。
[200] X. Zhu, R. Zhang, B. He, Z. Guo, Z. Zeng, Z. Qin, S. Zhang, and P. Gao, "Pointclip v2: Prompting clip and gpt for powerful 3d open-world learning," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2639-2650.
[200] X. Zhu, R. Zhang, B. He, Z. Guo, Z. Zeng, Z. Qin, S. Zhang, and P. Gao, "Pointclip v2:Prompting clip and gpt for powerful 3d open-world learning," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp.
[201] Z. Wang, X. Yu, Y. Rao, J. Zhou, and J. Lu, "P2p: Tuning pre-trained image models for point cloud analysis with point-to-pixel prompting," Advances in neural information processing systems, vol. 35, pp. .
[201] Z. Wang、X. Yu、Y. Rao、J. Zhou 和 J. Lu,"P2p:利用点到像素提示为点云分析调整预训练图像模型",《神经信息处理系统进展》,第 35 卷,第 页。
[202] T. Huang, B. Dong, Y. Yang, X. Huang, R. W. Lau, W. Ouyang, and W. Zuo, "Clip2point: Transfer clip to point cloud classification with image-depth pre-training," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 157-22 167.
[202] T. Huang、B. Dong、Y. Yang、X. Huang、R. W. Lau、W. Ouyang 和 W. Zuo,"Clip2point:带图像深度预训练的剪辑到点云分类",《IEEE/CVF 计算机视觉国际会议论文集》,2023 年,第 22 157-22 167 页。
[203] C. Ju, T. Han, K. Zheng, Y. Zhang, and W. Xie, "Prompting visuallanguage models for efficient video understanding," in European Conference on Computer Vision. Springer, 2022, pp. 105-124.
[203] C. Ju, T. Han, K. Zheng, Y. Zhang, and W. Xie, "Prompting visuallanguage models for efficient video understanding," in European Conference on Computer Vision.Springer, 2022, pp.
[204] B. Ni, H. Peng, M. Chen, S. Zhang, G. Meng, J. Fu, S. Xiang, and H. Ling, "Expanding language-image pretrained models for general video recognition," in European Conference on Computer Vision. Springer, 2022, pp. 1-18.
[204] B. Ni, H. Peng, M. Chen, S. Zhang, G. Meng, J. Fu, S. Xiang, and H. Ling, "Expanding language-image pretrained models for general video recognition," in European Conference on Computer Vision. Springer, 2022, pp.Springer, 2022, pp.
[205] Z. Lin, S. Geng, R. Zhang, P. Gao, G. de Melo, X. Wang, J. Dai, Y. Qiao, and H. Li, "Frozen clip models are efficient video learners," in European Conference on Computer Vision. Springer, 2022, pp. 388-404.
[205] Z. Lin, S. Geng, R. Zhang, P. Gao, G. de Melo, X. Wang, J. Dai, Y. Qiao, and H. Li, "Frozen clip models are efficient video learners," in European Conference on Computer Vision. Springer, pp.Springer, 2022, pp.
[206] Z. Han, F. Zhu, Q. Lao, and H. Jiang, "Zero-shot referring expression comprehension via structural similarity between images and captions," arXiv preprint arXiv:2311.17048, 2023.
[207] S. Doveh, A. Arbelle, S. Harary, E. Schwartz, R. Herzig, R. Giryes, R. Feris, R. Panda, S. Ullman, and L. Karlinsky, "Teaching structured vision & language concepts to vision & language models," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2657-2668.
[207] S. Doveh、A. Arbelle、S. Harary、E. Schwartz、R. Herzig、R. Giryes、R. Feris、R. Panda、S. Ullman 和 L. Karlinsky,《向视觉与语言模型教授结构化视觉与语言概念》,《IEEE/CVF 计算机视觉与模式识别会议论文集》,2023 年,第 2657-2668 页。
[208] S. Nag, X. Zhu, Y.-Z. Song, and T. Xiang, "Zero-shot temporal action detection via vision-language prompting," in European Conference on Computer Vision. Springer, 2022, pp. 681-697.
[208] S. Nag、X. Zhu、Y. -Z.Song, and T. Xiang, "Zero-shot temporal action detection via vision-language prompting," in European Conference on Computer Vision.Springer, 2022, pp.
[209] K. Zhou, J. Yang, C. C. Loy, and Z. Liu, "Learning to prompt for vision-language models," International Journal of Computer Vision, vol. 130, no. 9, pp. 2337-2348, 2022.
[209] K. Zhou, J. Yang, C. C. Loy, and Z. Liu, "Learning to prompt for vision-language models," International Journal of Computer Vision, vol. 130, no. 9, pp.
[210] - , "Conditional prompt learning for vision-language models," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16816-16825.
[210] - ,"视觉语言模型的条件提示学习",《电气和电子工程师协会/计算机视觉与模式识别大会论文集》,2022 年,第 16816-16825 页。
[211] B. Zhu, Y. Niu, Y. Han, Y. Wu, and H. Zhang, "Prompt-aligned gradient for prompt tuning," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 15659-15 669.
[211] B. Zhu、Y. Niu、Y. Han、Y. Wu 和 H. Zhang,"Prompt-aligned gradient for prompt tuning",《IEEE/CVF 计算机视觉国际会议论文集》,2023 年,第 15659-15 669 页。
[212] M. U. Khattak, H. Rasheed, M. Maaz, S. Khan, and F. S. Khan, "Maple: Multi-modal prompt learning," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19122
[212] M. U. Khattak, H. Rasheed, M. Maaz, S. Khan, and F. S. Khan, "Maple: Multi-modal prompt learning," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19122
[213] M. Shu, W. Nie, D.-A. Huang, Z. Yu, T. Goldstein, A. Anandkumar, and C. Xiao, "Test-time prompt tuning for zero-shot generalization in vision-language models," Advances in Neural Information Processing Systems, vol. 35, pp. 14274-14289, 2022.
[213] M. Shu、W. Nie、D. -A.Huang, Z. Yu, T. Goldstein, A. Anandkumar, and C. Xiao, "Test-time prompt tuning for zero-shot generalization in vision-language models," Advances in Neural Information Processing Systems, vol. 35, pp.
[214] C.-M. Feng, K. Yu, Y. Liu, S. Khan, and W. Zuo, "Diverse data augmentation with diffusions for effective test-time prompt tuning," in
[214] C.-M.Feng, K. Yu, Y. Liu, S. Khan, and W. Zuo, "Diverse data augmentation with diffusions for effective test-time prompt tuning," in the World Health and Technology.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2704-2714.
IEEE/CVF 计算机视觉国际会议论文集》,2023 年,第 2704-2714 页。
[215] P. Gao, S. Geng, R. Zhang, T. Ma, R. Fang, Y. Zhang, H. Li, and Y. Qiao, "Clip-adapter: Better vision-language models with feature adapters," International Journal of Computer Vision, pp. 1-15, 2023.
[215] P. Gao、S. Geng、R. Zhang、T. Ma、R. Fang、Y. Zhang、H. Li 和 Y. Qiao,"Clip-adapter:Clip-adapter"。Clip-adapter:更好的视觉语言模型与特征适配器",《国际计算机视觉杂志》,第 1-15 页,2023 年。
[216] R. Zhang, R. Fang, W. Zhang, P. Gao, K. Li, J. Dai, Y. Qiao, and H. Li, "Tip-adapter: Training-free clip-adapter for better visionlanguage modeling," arXiv preprint arXiv:2111.03930, 2021.
[216] R. Zhang、R. Fang、W. Zhang、P. Gao、K. Li、J. Dai、Y. Qiao 和 H. Li,"Tip-adapter:Tip-adapter: Training-free clip-adapter for better visionlanguage modeling," arXiv preprint arXiv:2111.03930, 2021.
[217] E. Orhan, "A simple cache model for image recognition," Advances in Neural Information Processing Systems, vol. 31, 2018.
[217] E. Orhan:《图像识别的简单缓存模型》,《神经信息处理系统进展》,第31卷,2018年。
[218] E. Grave, M. M. Cisse, and A. Joulin, "Unbounded cache model for online language modeling with open vocabulary," Advances in neural information processing systems, vol. 30, 2017.
[218]E. Grave、M. M. Cisse 和 A. Joulin:《开放词汇在线语言建模的无界缓存模型》,《神经信息处理系统进展》,第 30 卷,2017 年。
[219] J. Ho, A. Jain, and P. Abbeel, "Denoising diffusion probabilistic models," Advances in neural information processing systems, vol. 33, pp. 6840-6851, 2020.
[219] J. Ho, A. Jain, and P. Abbeel, "Denoising diffusion probabilistic models," Advances in neural information processing systems, vol. 33, pp.
[220] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, "Deep unsupervised learning using nonequilibrium thermodynamics," in International conference on machine learning. PMLR, 2015, pp. .
[220] J. Sohl-Dickstein、E. Weiss、N. Maheswaranathan 和 S. Ganguli,"使用非平衡热力学的深度无监督学习",机器学习国际会议。PMLR, 2015, pp. .
[221] Z. Han, Y. Wang, L. Zhou, P. Wang, B. Yan, J. Zhou, Y. Wang, and D. Shen, "Contrastive diffusion model with auxiliary guidance for coarse-to-fine pet reconstruction," in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2023, pp. 239-249.
[221] Z. Han, Y. Wang, L. Zhou, P. Wang, B. Yan, J. Zhou, Y. Wang, and D. Shen, "Contrastive diffusion model with auxiliary guidance for coarse-to-fine pet reconstruction," in International Conference on Medical Image Computing and Computer-Assisted Intervention.Springer, 2023, pp.
[222] L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao, W. Zhang, B. Cui, and M.-H. Yang, "Diffusion models: A comprehensive survey of methods and applications," ACM Computing Surveys, vol. 56, no. 4, pp. .
[222] L. Yang、Z. Zhang、Y. Song、S. Hong、R. Xu、Y. Zhao、W. Zhang、B. Cui 和 M.-H. Yang,"Diffusion models.Yang, "Diffusion models:ACM Computing Surveys, vol. 56, no. 4, pp.
[223] F.-A. Croitoru, V. Hondru, R. T. Ionescu, and M. Shah, "Diffusion models in vision: A survey," IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
[223] F.-A.Croitoru, V. Hondru, R. T. Ionescu, and M. Shah, "Diffusion models in vision:A survey," IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
[224] P. Dhariwal and A. Nichol, "Diffusion models beat gans on image synthesis," Advances in neural information processing systems, vol. 34, pp. 8780-8794, 2021.
[224] P. Dhariwal 和 A. Nichol,《图像合成上的扩散模型击败甘斯》,《神经信息处理系统进展》,第 34 卷,第 8780-8794 页,2021 年。
[225] N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman, "Dreambooth: Fine tuning text-to-image diffusion models for subjectdriven generation," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22 500-22 510.
[225] N. Ruiz、Y. Li、V. Jampani、Y. Pritch、M. Rubinstein 和 K. Aberman,"Dreambooth:微调用于主题驱动生成的文本到图像扩散模型",《IEEE/CVF 计算机视觉与模式识别会议论文集》,2023 年,第 22 500-22 510 页。
[226] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, "High-resolution image synthesis with latent diffusion models," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. .
[226] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, "High-resolution image synthesis with latent diffusion models," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. .
[227] S. Luo, Y. Tan, S. Patil, D. Gu, P. von Platen, A. Passos, L. Huang, J. Li, and H. Zhao, "Lcm-lora: A universal stable-diffusion acceleration module," arXiv preprint arXiv:2311.05556, 2023.
[227] S. Luo、Y. Tan、S. Patil、D. Gu、P. von Platen、A. Passos、L. Huang、J. Li 和 H. Zhao,"Lcm-lora:A universal stable-diffusion acceleration module," arXiv preprint arXiv:2311.05556, 2023.
[228] W. Chai, D. Zheng, J. Cao, Z. Chen, C. Wang, and C. Ma, "Speedupnet: A plug-and-play hyper-network for accelerating text-to-image diffusion models," arXiv preprint arXiv:2312.08887, 2023.
[228] W. Chai、D. Zheng、J. Cao、Z. Chen、C. Wang 和 C. Ma,"Speedupnet:加速文本到图像扩散模型的即插即用超网络",arXiv 预印本 arXiv:2312.08887, 2023。
[229] J. Z. Wu, Y. Ge, X. Wang, S. W. Lei, Y. Gu, Y. Shi, W. Hsu, Y. Shan, X. Qie, and M. Z. Shou, "Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. .
[229] J. Z. Wu、Y. Ge、X. Wang、S. W. Lei、Y. Gu、Y. Shi、W. Hsu、Y. Shan、X. Qie 和 M. Z. Shou,"Tune-a-video:用于文本到视频生成的图像扩散模型的一次性调整",《IEEE/CVF 计算机视觉国际会议论文集》,2023 年,第 页。
[230] Z. Xing, Q. Dai, H. Hu, Z. Wu, and Y.-G. Jiang, "Simda: Simple diffusion adapter for efficient video generation," arXiv preprint arXiv:2308.09710, 2023.
[230] Z. Xing、Q. Dai、H. Hu、Z. Wu 和 Y.-G. Jiang,"Simda:Simda.Jiang, "Simda:Simda: Simple diffusion adapter for efficient video generation," arXiv preprint arXiv:2308.09710, 2023.
[231] B. Zeng, S. Li, Y. Feng, H. Li, S. Gao, J. Liu, H. Li, X. Tang, J. Liu, and B. Zhang, "Ipdreamer: Appearance-controllable 3d object generation with image prompts," arXiv preprint arXiv:2310.05375, 2023.
[231] B. Zeng、S. Li、Y. Feng、H. Li、S. Gao、J. Liu、H. Li、X. Tang、J. Liu 和 B. Zhang,"Ipdreamer:用图像提示生成外观可控的 3D 物体",arXiv 预印本 arXiv:2310.05375, 2023。
[232] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., "Flamingo: a visual language model for few-shot learning," Advances in Neural Information Processing Systems, vol. 35, pp. 23716-23736, 2022.
[232] J.-B.Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., "Flamingo: a visual language model for few-shot learning," Advances in Neural Information Processing Systems, vol. 35, pp.
[233] L. Zhang, A. Rao, and M. Agrawala, "Adding conditional control to text-to-image diffusion models," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3836-3847.
[233] L. Zhang, A. Rao, and M. Agrawala, "Adding conditional control to text-to-image diffusion models," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp.
[234] R. Gandikota, J. Materzynska, T. Zhou, A. Torralba, and D. Bau, "Concept sliders: Lora adaptors for precise control in diffusion models," arXiv preprint arXiv:2311.12092, 2023.
[234] R. Gandikota, J. Materzynska, T. Zhou, A. Torralba, and D. Bau, "Concept sliders:Lora adaptors for precise control in diffusion models," arXiv preprint arXiv:2311.12092, 2023.
[235] C. Mou, X. Wang, L. Xie, Y. Wu, J. Zhang, Z. Qi, Y. Shan, and X. Qie, "T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models," arXiv preprint arXiv:2302.08453, 2023.
[235] C. Mou、X. Wang、L. Xie、Y. Wu、J. Zhang、Z. Qi、Y. Shan 和 X. Qie,"T2i-adapter:T2i-adapter"。Qie,"T2i-adapter:学习适配器为文本到图像扩散模型挖掘更多可控能力》,arXiv 预印本 arXiv:2302.08453, 2023.
[236] R. Gal, Y. Alaluf, Y. Atzmon, O. Patashnik, A. H. Bermano, G. Chechik, and D. Cohen-Or, "An image is worth one word: Personal- izing text-to-image generation using textual inversion," arXiv preprint arXiv:2208.01618, 2022.
[236] R. Gal、Y. Alaluf、Y. Atzmon、O. Patashnik、A. H. Bermano、G. Chechik 和 D. Cohen-Or,《一图胜一词:利用文本反转实现文本到图像的个性化生成》,arXiv 预印本 arXiv:2208.01618, 2022。
[237] N. Kumari, B. Zhang, R. Zhang, E. Shechtman, and J.-Y. Zhu, "Multiconcept customization of text-to-image diffusion," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1931-1941.
[237] N. Kumari, B. Zhang, R. Zhang, E. Shechtman, and J.-Y. Zhu, "Multiconcept customization of text-to-image diffusion," in Proceedings of IEEE/CVF Conference Computer Vision and Pattern Recognition, 2023.Zhu, "Multiconcept customization of text-to-image diffusion," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp.
[238] H. Ye, J. Zhang, S. Liu, X. Han, and W. Yang, "Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models," arXiv preprint arXiv:2308.06721, 2023
[238] H. Ye, J. Zhang, S. Liu, X. Han, and W. Yang, "Ip-adapter:文本到图像扩散模型的文本兼容图像提示适配器," arXiv 预印本 arXiv:2308.06721, 2023
[239] OpenAI, "Gpt-4," in https://openai.com/gpt-4, 2023.
[239] OpenAI,"Gpt-4",载于 https://openai.com/gpt-4,2023。
[240] G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth et al., "Gemini: a family of highly capable multimodal models," arXiv preprint arXiv:2312.11805, 2023.
[240] G. Team、R. Anil、S. Borgeaud、Y. Wu、J.-B.Alayrac、J. Yu、R. Soricut、J. Schalkwyk、A. M. Dai、A. Hauth 等,《双子座:高能力多模态模型系列》,arXiv 预印本 arXiv:2312.11805, 2023。
[241] C. Gao and S. Q. Zhang, "Dlora: Distributed parameter-efficient fine-tuning solution for large language model," arXiv preprint arXiv:2404.05182, 2024
[241] C. Gao 和 S. Q. Zhang,"Dlora:大型语言模型的分布式参数高效微调解决方案",arXiv 预印本 arXiv:2404.05182, 2024
[242] G. Xiao, J. Lin, and S. Han, "Offsite-tuning: Transfer learning without full model," arXiv preprint arXiv:2302.04870, 2023.
[242] G. Xiao, J. Lin, and S. Han, "Offsite-tuning:Transfer learning without full model," arXiv preprint arXiv:2302.04870, 2023.
[243] Z. Zhou, X. Wei, J. Zhang, and G. Sun, " PetS : A unified framework for {Parameter-Efficient } transformers serving," in 2022 USENIX Annual Technical Conference (USENIX ATC 22), 2022, pp. 489-504.
[243] Z. Zhou, X. Wei, J. Zhang, and G. Sun, " PetS : A unified framework for {Parameter-Efficient } transformers serving," in 2022 USENIX Annual Technical Conference (USENIX ATC 22), 2022, pp.
[244] Y. Sheng, S. Cao, D. Li, C. Hooper, N. Lee, S. Yang, C. Chou, B. Zhu, L. Zheng, K. Keutzer et al., "S-lora: Serving thousands of concurrent lora adapters," arXiv preprint arXiv:2311.03285, 2023.
[244] Y. Sheng、S. Cao、D. Li、C. Hooper、N. Lee、S. Yang、C. Chou、B. Zhu、L. Zheng、K. Keutzer 等人,"S-lora:Serving thousands of concurrent lora adapters," arXiv preprint arXiv:2311.03285, 2023.
[245] L. Chen, Z. Ye, Y. Wu, D. Zhuo, L. Ceze, and A. Krishnamurthy, "Punica: Multi-tenant lora serving," arXiv preprint arXiv:2310.18547, 2023.
[245] L. Chen、Z. Ye、Y. Wu、D. Zhuo、L. Ceze 和 A. Krishnamurthy,"Punica:Punica: Multi-tenant lora serving," arXiv preprint arXiv:2310.18547, 2023.
[246] S. Mangrulkar, S. Gugger, L. Debut, Y. Belkada, S. Paul, and B. Bossan, "Peft: State-of-the-art parameter-efficient fine-tuning methods," https://github.com/huggingface/peft 2022.
[246] S. Mangrulkar、S. Gugger、L. Debut、Y. Belkada、S. Paul 和 B. Bossan,"Peft:最先进的参数高效微调方法",https://github.com/huggingface/peft 2022。
[247] C. Poth, H. Sterz, I. Paul, S. Purkayastha, L. Engländer, T. Imhof, I. Vulić, S. Ruder, I. Gurevych, and J. Pfeiffer, "Adapters: A unified library for parameter-efficient and modular transfer learning," 2023.
[247] C. Poth、H. Sterz、I. Paul、S. Purkayastha、L. Engländer、T. Imhof、I. Vulić、S. Ruder、I. Gurevych 和 J. Pfeiffer,"Adapters:A unified library for parameter-efficient and modular transfer learning," 2023.
[248] K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li, X. Lu, R. Zhu, Y. Wu, J. Dai, J. Wang, J. Shi, W. Ouyang, C. C. Loy, and D. Lin, "MMDetection: Open mmlab detection toolbox and benchmark," arXiv preprint arXiv:1906.07155, 2019.
[248] K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li, X. Lu, R. Zhu, Y. Wu, J. Dai, J. Wang, J. Shi, W. Ouyang, C. C. LoyCheng, Q. Zhao, B. Li, X. Lu, R. Zhu, Y. Wu, J. Dai, J. Wang, J. Shi, W. Ouyang, C. C. Loy, and D. Lin, "MMDetection:Open mmlab detection toolbox and benchmark," arXiv preprint arXiv:1906.07155, 2019.
[249] S. Q. Zhang, T. Tambe, N. Cuevas, G.-Y. Wei, and D. Brooks, "Camel: Co-designing ai models and embedded drams for efficient on-device learning," arXiv preprint arXiv:2305.03148, 2023.
[249] S. Q. Zhang、T. Tambe、N. Cuevas、G. -Y.Wei, and D. Brooks, "Camel:Co-designing ai models and embedded drams for efficient on-device learning," arXiv preprint arXiv:2305.03148, 2023.
[250] T. Brooks, B. Peebles, C. Holmes, W. DePue, Y. Guo, L. Jing, D. Schnurr, J. Taylor, T. Luhman, E. Luhman, C. Ng, R. Wang, and A. Ramesh, "Video generation models as world simulators," 2024. [Online]. Available: https: //openai.com/research/video-generation-models-as-world-simulators
[250] T. Brooks、B. Peebles、C. Holmes、W. DePue、Y. Guo、L. Jing、D. Schnurr、J. Taylor、T. Luhman、E. Luhman、C. Ng、R. Wang 和 A. Ramesh,"作为世界模拟器的视频生成模型",2024。[Online].Available: https: //openai.com/research/video-generation-models-as-world-simulators
[251] A. Gu and T. Dao, "Mamba: Linear-time sequence modeling with selective state spaces," arXiv preprint arXiv:2312.00752, 2023.
[251] A. Gu and T. Dao, "Mamba:线性时序建模与选择性状态空间》,arXiv 预印本 arXiv:2312.00752, 2023.
[252] Y. Bai, X. Geng, K. Mangalam, A. Bar, A. Yuille, T. Darrell, J. Malik, and A. A. Efros, "Sequential modeling enables scalable learning for large vision models," arXiv preprint arXiv:2312.00785, 2023.
[253] A. Dosovitskiy and T. Brox, "Inverting visual representations with convolutional networks," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4829-4837.
[253] A. Dosovitskiy 和 T. Brox,"用卷积网络反转视觉表征",《2016 年 IEEE 计算机视觉与模式识别会议论文集》,第 4829-4837 页。
[254] Z. He, T. Zhang, and R. B. Lee, "Model inversion attacks against collaborative inference," in Proceedings of the 35th Annual Computer Security Applications Conference, 2019, pp. 148-162.
[254] Z. He, T. Zhang, and R. B. Lee, "Model inversion attacks against collaborative inference," in Proceedings of the 35th Annual Computer Security Applications Conference, 2019, pp.
summarize 综述