2024_05_22_751d9dac637e014f337fg

Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey
大型模型的参数高效微调：全面调查

Zeyu Han , Chao Gao , Jinyang Liu , Jeff (Jun) Zhang , and Sai Qian Zhang
韩泽宇，高超，刘金阳，张杰（俊），张赛倩 }} Northeastern University University of California, Riverside Arizona State University
东北大学加利福尼亚大学河滨分校亚利桑那州立大学 New York University
纽约大学{han.zeyu,liu.jinyan}@northeastern.edu, cgao037@ucr.edu, jeffzhang@asu.edu, sai.zhang@nyu.edu

Abstract 摘要

Large models represent a groundbreaking advancement in multiple application fields, enabling remarkable achievements across various tasks. However, their unprecedented scale comes with significant computational costs. These models, often consisting of billions of parameters, require vast amounts of computational resources for execution. Especially, the expansive scale and computational demands pose considerable challenges when customizing them for particular downstream tasks, particularly over the hardware platforms constrained by computational capabilities.
大型模型代表了多个应用领域的突破性进展，使得在各种任务中取得了显著的成就。然而，它们前所未有的规模带来了巨大的计算成本。这些模型通常由数十亿个参数组成，执行时需要大量的计算资源。特别是，庞大的规模和计算需求在为特定下游任务定制时带来了相当大的挑战，尤其是在受计算能力限制的硬件平台上。

Parameter Efficient Fine-Tuning (PEFT) provides a practical solution by efficiently adapt the large models over the various downstream tasks. In particular, PEFT refers to the process of adjusting the parameters of a pre-trained large models to adapt it to a specific task or domain while minimizing the number of additional parameters introduced or computational resources required. This approach is particularly important when dealing with large-scale language models with high parameter counts, as fine-tuning these models from scratch can be computationally expensive and resource-intensive, posing considerable challenges in the supporting system platform design.
参数高效微调（PEFT）通过有效地调整大型模型以适应各种下游任务，提供了一个实用的解决方案。具体而言，PEFT 是指调整预训练大型模型的参数，使其适应特定任务或领域，同时最大限度地减少引入的额外参数数量或所需的计算资源。当处理具有高参数数量的大规模语言模型时，这种方法尤为重要，因为从头开始微调这些模型可能会耗费大量计算资源，对支持系统平台设计构成相当大的挑战。

In this survey, we present comprehensive studies of various PEFT algorithms, examining their performance and computational overhead. Moreover, we provide an overview of applications developed using different PEFT algorithms and discuss common techniques employed to mitigate computation costs for PEFT. In addition to the algorithmic perspective, we overview various real-world system designs to investigate the implementation costs associated with different PEFT algorithms. This survey serves as an indispensable resource for researchers aiming to understand both the PEFT algorithm and its system implementation, offering detailed insights into recent advancements and practical applications.
在这项调查中，我们对各种 PEFT 算法进行了全面研究，检查它们的性能和计算开销。此外，我们概述了使用不同 PEFT 算法开发的应用程序，并讨论了用于减少 PEFT 计算成本的常见技术。除了算法的角度，我们还概述了各种真实世界系统设计，以调查与不同 PEFT 算法相关的实施成本。这项调查为那些希望了解 PEFT 算法及其系统实施的研究人员提供了必不可少的资源，提供了对最新进展和实际应用的详细见解。

Index Terms-Large Language Model, Parameter-Efficient Fine-tuning, Computer System, Distributed System.
索引词-大型语言模型，参数高效微调，计算机系统，分布式系统。

I. INTRODUCTION 我。介绍

Large Models (LMs) have recently captured considerable public interest. Their ability to understand context and nuances enables them to proficiently handle diverse tasks across multiple domains, including natural language processing (NLP), computer vision (CV), etc. In the field of NLP, Large Language Models (LLMs) have achieved significant advancements across various tasks including text generation [1], [2], translation [3], [4], personalized chat-bots [5], [6], [7], and summarization [8], demonstrating remarkable proficiency.
大型模型（LMs）最近引起了相当大的公众关注。它们理解上下文和细微差别的能力使它们能够熟练处理跨多个领域的各种任务，包括自然语言处理（NLP）、计算机视觉（CV）等。在自然语言处理领域，大型语言模型（LLMs）在各种任务中取得了显著进展，包括文本生成[1]，[2]，翻译[3]，[4]，个性化聊天机器人[5]，[6]，[7]和摘要[8]，展示了卓越的熟练度。

Earlier studies [1] has suggested that LLMs exhibit high levels of generalization, enabling them to apply their acquired knowledge to new tasks not included in their original training. This capability is commonly known as zero-shot learning. Nevertheless, fine-tuning remains essential to further enhance LLMs for optimal performance on new user datasets and tasks.
较早的研究[1]表明LLMs表现出高水平的泛化能力，使它们能够将所获知识应用于原始训练中未包含的新任务。这种能力通常被称为零样本学习。然而，微调仍然是必不可少的，以进一步增强LLMs在新用户数据集和任务上的最佳性能。

Due to its scale, a widely adopted strategy for fine-tuning LLMs involves adjusting a limited number of LLM parameters while keeping the remainder unchanged. This technique, termed Parameter-Efficient-Fine-Tuning (PEFT), involves selectively adjusting a small proportion of their parameters, while keeping the rest unaltered. Furthermore, the application of PEFT extends beyond the realm of NLP and quickly attracts interest in the CV community for handling fine-tuning vision models with large parameters, such as Vision Transformers (ViT) and diffusion models, as well as disciplinary models such as vision-language models.
由于其规模，广泛采用的微调LLMs的策略涉及调整有限数量的LLM参数，同时保持其余参数不变。这种技术被称为参数高效微调（PEFT），涉及有选择地调整一小部分参数，同时保持其余参数不变。此外，PEFT 的应用不仅限于 NLP 领域，还迅速引起了 CV 社区的兴趣，用于处理具有大参数的微调视觉模型，如 Vision Transformers（ViT）和扩散模型，以及视觉-语言模型等学科模型。

In this survey, we systematically review and categorize recent advancements in PEFT algorithms as well as the system implementation costs associated with various PEFT algorithms across diverse scenarios. Figure 1 presents the overview content for this survey. In section

, we present some fundamental concepts for LLM and PEFT, including computational flow for LLM, basic knowledge of PEFT, and commonly used datasets and tasks.
在这项调查中，我们系统地审查和分类了最近在 PEFT 算法方面的进展，以及不同场景中与各种 PEFT 算法相关的系统实施成本。图 1 展示了本调查的概述内容。在第

节中，我们介绍了一些关于LLM和 PEFT 的基本概念，包括LLM的计算流程、PEFT 的基本知识以及常用的数据集和任务。

We categorized all types of PEFT algorithms in Section III according to their computational flow. In Section III-A, we introduce additive algorithms that either introduce additional weight parameters or modify activations. For algorithms that exclusively require fine-tuning with existing parameters, they fall under the category of selective approaches, and their introduction can be found in Section III-B, In Section III-C, we explore reparameterized PEFT, which constructs a (lowdimensional) reparameterization of original model parameters for training while transforms the weights back to maintain the inference speed. Additionally, there exist algorithms that combine the above techniques, and we have classified these as hybrid approaches, elaborating on them in Section III-D We also investigate strategies for further reducing the computational complexity of different PEFT algorithms, including KV-cache management, pruning, quantization, and memory optimization, in Section IV
我们根据它们的计算流程在第三部分对所有类型的 PEFT 算法进行了分类。在第 III-A 节中，我们介绍了引入额外权重参数或修改激活的加法算法。对于仅需要与现有参数进行微调的算法，它们属于选择性方法的范畴，其介绍可以在第 III-B 节中找到。在第 III-C 节中，我们探讨了重新参数化的 PEFT，它为训练构建了原始模型参数的（低维）重新参数化，同时将权重转换回以保持推理速度。此外，还存在结合上述技术的算法，我们已将其分类为混合方法，并在第 III-D 节中对其进行了详细阐述。我们还研究了进一步减少不同 PEFT 算法的计算复杂性的策略，包括 KV 缓存管理、修剪、量化和内存优化，在第 IV 节中。

In Section

, we expand the scope of this survey beyond the computational perspective to involve various potential application scenarios. We explore innovations that applying PEFT techniques to different model architecture, including LLMs
在第

节中，我们将这项调查的范围扩大到涉及各种潜在应用场景，从而超越计算视角。我们探索将 PEFT 技术应用于不同模型架构的创新，包括LLMs。

Fig. 1: A content overview covered in the survey.
图 1：调查涵盖的内容概述。

(Section V-A), Vision Transformer (Section V-B), VisionLanguage alignment models (Section V-C), and Diffusion models (Section V-D), for varied downstream tasks, underscoring PEFT's versatility and applicability in a range of scenarios.
（第五部分-A），视觉 Transformer（第五部分-B），视觉语言对齐模型（第五部分-C）和扩散模型（第五部分-D），用于各种下游任务，强调 PEFT 在各种场景中的多功能性和适用性。

In Section VI, we explores the system design challenge for PEFT methods. The discussion includes three advanced system solutions for practical PEFT deployment: distributed tuning (Section VI-B), PEFT query serving (Section VI-C), and concurrent PEFT tuning (Section VI-D).
在第六部分，我们探讨了 PEFT 方法的系统设计挑战。讨论包括三种实际 PEFT 部署的高级系统解决方案：分布式调整（第 VI-B 节），PEFT 查询服务（第 VI-C 节）和并发 PEFT 调整（第 VI-D 节）。

In the last Section VII we summarize our survey and propose several potential future directions from both algorithm and system perspectives, hoping to provide valuable insights for further research and development in the field.
在最后的第七部分，我们总结了我们的调查，并从算法和系统的角度提出了几个潜在的未来方向，希望为该领域的进一步研究和发展提供有价值的见解。

II. BACKGROUND 背景

In this section, we first discussed the computation flow of LLM, including its fundamental components, computational complexity, and the flow of computations it involves as a case study. We then provide a brief overview of different PEFT algorithms in section II-B
在本节中，我们首先讨论了LLM的计算流程，包括其基本组成部分、计算复杂性以及涉及的计算流程作为案例研究。然后在第 II-B 节中提供了对不同 PEFT 算法的简要概述。

A. Computation flow for LLaMA
LLaMA 的计算流程

In order to gain a deeper understanding of LLM and other Transformer-based models, we employ LLaMA-7B, a cuttingedge open-source LLM model, to scrutinize the architecture of LLM as well as Transformer. As shown in Figure 2 (a), LLaMA consists of three major components: an embedding block, a stack of decoder blocks and a head block which consists of linear and softmax layer. The embedding layer's primary role is to transform unstructured textual information, into chunks of discrete numerical vectors (tokens) to facilitate subsequent processing. The embedded tokens are then delivered to the decoder layers for further processing. Each LLaMA decoder is composed of two fundamental components: Multihead Self-Attention (MSA) and Feedforward Network (FFN). In the MSA module, each of the tokens will be clustered by an attention map obtained by a dot production between two linear mappings of the input tokens. Then the grouped tokens will be further processed by a Feedforward Neural network.
为了更深入地了解LLM和其他基于 Transformer 的模型，我们采用 LLaMA-7B，这是一种尖端的开源LLM模型，来审查LLM和 Transformer 的架构。如图 2（a）所示，LLaMA 由三个主要组件组成：嵌入块、一堆解码器块和一个由线性和 softmax 层组成的头块。嵌入层的主要作用是将非结构化的文本信息转换为离散的数字向量块（标记），以便后续处理。嵌入的标记然后传递给解码器层进行进一步处理。每个 LLaMA 解码器由两个基本组件组成：多头自注意力（MSA）和前馈网络（FFN）。在 MSA 模块中，每个标记将通过输入标记的两个线性映射之间的点积获得的注意力图进行聚类。然后，分组的标记将通过前馈神经网络进一步处理。
Additionally, Root Mean Square Layer Normalization (RMSNorm) [9] is adopted in LLaMA as a replacement for Layer Normalization to ensure efficient training.
此外，均方根层归一化（RMSNorm）[9]在 LLaMA 中被采用作为层归一化的替代，以确保高效训练。

LLM distinguishes itself from other deep neural network (DNN) models such as convolutional neural networks (CNN) in two significant ways. Firstly, LLM exhibits an inherent autoregressive nature, necessitating multiple iterations to complete the generation task. Moreover, LLM incorporates an attention mechanism, a component with computational complexity that scales quadratically with the length of the inputs. On the other hand, the inherent computation characteristic of LLM lies in the attention blocks inside each decoder layer. Figure 2 (c) depicts the high-level overview of the computation flow in the attention block.
LLM以两种显著方式区别于其他深度神经网络（DNN）模型，如卷积神经网络（CNN）。首先，LLM表现出固有的自回归特性，需要多次迭代才能完成生成任务。此外，LLM融入了注意力机制，这是一个具有计算复杂度的组件，随着输入长度的增加呈二次方增长。另一方面，LLM的固有计算特性在每个解码器层内的注意力块中。图 2（c）展示了注意力块中计算流程的高级概述。

During the inference process, each decoder takes a 4dimensional tensor

as the input tokens. The input tokens are first multiplied with three weight matrices

, and

, producing the output referred to as query

and value

. Given the MSA module's inability to recognize positional data and the inherent autoregressive nature of LLMs, the query and key will undergo a process using Rotary Positional Embedding [10] (RoPE, denoted as

Subsequently, the key and value will be combined with prior tokens.
在推理过程中，每个解码器将一个 4 维张量

作为输入标记。首先，输入标记将与三个权重矩阵

和

相乘，产生所谓的查询

和数值

的输出。鉴于 MSA 模块无法识别位置数据以及LLMs固有的自回归特性，查询和键将经过使用旋转位置嵌入[10]（RoPE，表示为

的过程。

随后，键和数值将与先前的标记组合在一起。

After the positional embedding, the intermediate activation will then undergo a series of multiplication, softmax, and residual addition to generate MSA output as described in Eq9 To be noted here,

in the equation refers to the number of feature dimensions in the multi-head attention mechanism.
在位置嵌入之后，中间激活将经历一系列乘法、softmax 和残差相加，以生成 Eq9 中描述的 MSA 输出。需要注意的是，在方程中

指的是多头注意力机制中的特征维度数量。

The SA output will then be forwarded to the FFN blocks for further processing. The FFN block will have another three
SA 输出将被转发到 FFN 块进行进一步处理。FFN 块将有另外三

Fig. 2: (a) LLaMA architecture. (b) LLaMA auto-regressive pattern. (c) Three common PEFT operations. All the learnable components are highlighted in red, while the frozen components are highlighted in grey. LoRA is applied on all the Query, Key, and Value blocks. The adapter targets the FFN module. Soft-Prompt focused on tuning the input activation of each decoder. We only show one decoder for illustration simplicity.
图 2：(a) LLaMA 架构。(b) LLaMA 自回归模式。(c) 三种常见的 PEFT 操作。所有可学习组件都用红色突出显示，而冻结组件则用灰色突出显示。LoRA 应用于所有查询、键和值块。适配器针对 FFN 模块。Soft-Prompt 专注于调整每个解码器的输入激活。为了简化说明，我们只展示一个解码器。

matrices

, and

and the computation can be illustrated by:
矩阵

和

，计算过程如下：

where

denotes the input of the FFN layer, and SiLU is the nonlinear function used in LLaMA. In the original Transformer, the FFN block can be demonstrated by:
其中

表示 FFN 层的输入，SiLU 是 LLaMA 中使用的非线性函数。在原始 Transformer 中，FFN 块可以通过以下方式展示：

The output of the last decoder layer will be sent to a linear layer, which then generates a probability distribution spanning the complete vocabulary to predict the next token in the sequence. The produced token will then be concatenated with the previous tokens and used as the input for the next round of processing. This generating process repeats in an auto-regressive manner until a full sequence of tokens, referred to as a completion, is produced (Figure 2 (b)). For training, the computation flow is similar to that for inference, except that the generated sentences are directly compared to the ground truth output and generate the training loss. Gradients will then be computed across the LLM weights to minimize this training loss.
最后一个解码器层的输出将被发送到一个线性层，该线性层然后生成一个横跨完整词汇表的概率分布，以预测序列中的下一个标记。生成的标记然后将与先前的标记连接起来，并用作下一轮处理的输入。这个生成过程以自回归的方式重复，直到生成一个完整的标记序列，称为完成（图 2（b））。对于训练，计算流程与推理过程类似，不同之处在于生成的句子直接与地面真实输出进行比较并生成训练损失。然后将计算梯度跨越LLM个权重以最小化这个训练损失。

To analyze the computation cost and memory overhead in LLM, we also set a series of parameters used in later section III Table I shows the parameter size and computation dimension in the LLaMA-7B model as a starting example.
为了分析LLM中的计算成本和内存开销，我们还设置了一系列在后文第 III 节中使用的参数。表 I 显示了 LLaMA-7B 模型中的参数大小和计算维度，作为一个起始示例。

LLM models generate tokens (words) one for each round, depicted in Fig 2, based on the previous prompt (input) and previously generated sequence. This process will be repeated until the model outputs hits and termination token. To accelerate the inference process in LLM models, people take the strategy of storing the previous Keys and Values in KeyValue cache (KV-cache), so they don't need to recalculate them for each new token. Mathematically, we can represent the total decoders' kv-cache memory cost in equation 6. In the equation, 1 and

are the context length and batch size and

refers to the number of layers. The

is the head dimension and

is the number of heads.
LLM模型生成令牌（单词），每轮一个，如图 2 所示，基于先前的提示（输入）和先前生成的序列。这个过程将重复进行，直到模型输出命中和终止令牌。为了加速LLM模型中的推理过程，人们采取了将先前的键和值存储在 KeyValue 缓存（KV 缓存）中的策略，这样他们就不需要为每个新令牌重新计算它们。在数学上，我们可以用方程 6 表示总解码器的 kv-cache 内存成本。在方程中，1 和

分别是上下文长度和批处理大小，

表示层数。

是头维度，

是头的数量。

B. Overview on Parameter Efficient Fine Tuning
参数高效微调概述

Fine-tuning remains essential to enhance LLM performance on unseen user datasets and tasks. With the size of the model growing (e.g. 1.5B in GPT-2 to 175B in GPT-3), standard full fine-tuning paradigm requires thousands of GPU work in parallel, which is highly inefficient and unsustainable. A type of algorithm has been raised namely Parameter-efficient fine-tuning (PEFT) which aims to tune minimal parameters to achieve better performance over full tuning on downstream tasks.
微调仍然是提高LLM在未见用户数据集和任务上性能的关键。随着模型规模的增长（例如，GPT-2 中的 15 亿到 GPT-3 中的 175 亿），标准的全微调范式需要成千上万个 GPU 并行工作，这是非常低效和不可持续的。一种算法被提出，即参数高效微调（PEFT），旨在调整最小参数以在下游任务上实现比全面调整更好的性能。

In parallel developments, large-scale pre-trained models in vision and multimodal domains have also demonstrated their effective representational learning capabilities, enabling adaptation from large datasets to smaller ones or across various data modalities through fine-tuning. Consequently, this capability has made PEFT increasingly attractive to the wider research community.
在平行发展中，视觉和多模态领域的大规模预训练模型也展示了它们有效的表征学习能力，使得能够通过微调从大型数据集适应到较小的数据集或跨越各种数据模态。因此，这种能力使得 PEFT 对更广泛的研究社区越来越具吸引力。

We categorized the PEFT algorithms into additive, selective, reparameterized, and hybrid fine-tuning based on their operations. As Figure 3 depicts, three major additive finetuning algorithms are normally used: (1) Adapter; (2) Soft Prompt; (3) Others. They differ from each other in terms of the different additional tunable modules or parameters. Selective fine-tuning, on the other hand, doesn't require any additional parameters, it selects a small subset of parameters from the backbone model and only makes them tunable while keeping the majority of parameters untouched during finetuning on downstream tasks. We categorized selective finetuning based on the grouping of chosen parameters: (1) Unstructural Masking; (2) Structural Masking. Reparametrization
我们根据它们的操作将 PEFT 算法分类为加性、选择性、重新参数化和混合微调。如图 3 所示，通常使用三种主要的加性微调算法：（1）适配器；（2）软提示；（3）其他。它们在不同的可调模块或参数方面有所不同。另一方面，选择性微调不需要任何额外参数，它从骨干模型中选择一小部分参数，并只使它们可调，同时在下游任务的微调过程中保持大多数参数不变。我们根据所选参数的分组将选择性微调分类为：（1）非结构化掩蔽；（2）结构化掩蔽。重新参数化

Operation	Weights Symbol 重量符号	Weights Dimension 重量尺寸	Input Tensor Dimension 输入张量维度	Complexity
Eq. 1 等式 1
Eq. 2 , 等式 2,	-	-
Eq. 3 等式 3
Eq. 4 等式 4			OR

TABLE I: Configuration parameters and computation operation for LLaMA-7B architecture
表 I：LLaMA-7B 架构的配置参数和计算操作

represents transforming model parameters between two equivalent forms. Specifically, reparametrized fine-tuning introduces additional low-rank trainable parameters during training, which are then integrated with the original model for inference. This approach is categorized into two main strategies: (1) Lowrank Decomposition, and (2) LoRA Derivatives. Hybrid finetuning explores the design spaces of different PEFT methods and combines their advantages.
代表将模型参数在两种等效形式之间进行转换。具体而言，重新参数化微调在训练过程中引入了额外的低秩可训练参数，然后将其与原始模型集成以进行推断。这种方法分为两种主要策略：（1）低秩分解，和（2）LoRA 导数。混合微调探索了不同 PEFT 方法的设计空间，并结合它们的优势。

C. Downstream Tasks for LLM Evaluation
LLM评估的下游任务

Two types of tasks have been widely used for LLM evaluation, the first type is the General Language Understanding Evaluation (GLUE) [11] benchmark, which integrates nine sentence or sentence-pair language understanding tasks (CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, and WNLI), chosen for their diversity in dataset sizes, text genres, and difficulty levels, and is based on established existing datasets. It also includes a diagnostic dataset specifically designed to evaluate and analyze model performance across various linguistic phenomena inherent in natural language. Additionally, it features a public leaderboard to track performance on the benchmark and a dashboard to visualize model performance on the diagnostic set.
两种类型的任务已广泛用于LLM评估，第一种类型是通用语言理解评估（GLUE）[11]基准测试，它整合了九个句子或句对语言理解任务（CoLA，SST-2，MRPC，STS-B，QQP，MNLI，QNLI，RTE 和 WNLI），这些任务因其数据集大小、文本流派和难度级别的多样性而被选择，并且基于已建立的现有数据集。它还包括一个专门设计的诊断数据集，用于评估和分析模型在自然语言中固有的各种语言现象上的表现。此外，它还设有一个公开排行榜，用于跟踪基准测试的表现，并有一个仪表板来可视化模型在诊断集上的表现。

The other type of dataset that has been used in recent LLM papers is common sense reasoning which integrated into our study caters to a variety of research facets: (1) OpenBookQA [12] is curated to foster research in advanced question-answering, delving into a profound understanding of both the subject matter and the language in which it is articulated. (2) PIQA [13] primarily emphasizes everyday scenarios, demonstrating a predilection for unconventional solutions. (3) Social IQA [14] emerges as a novel questionanswering benchmark tailored for gauging social commonsense intelligence. (4) HellaSwag [15] serves as a dataset, the essence of which is to ascertain the capability of machines in aptly concluding sentences. (5) BoolQ [16] is a dataset dedicated to question-answering, particularly for binary responses (yes/no queries). (6) WinoGrande [17] is introduced as a fresh compilation, encompassing a substantial 44,000 problems. (7)

-easy [18] presents itself as a novel dataset constituting genuine grade-school level multiple-choice science questions, designed to invigorate research in intricate question-answering. (8) ARC-challenges [18], distinctively, encompasses solely those questions that were inaccurately addressed by both a retrieval-based algorithm and a word co-occurrence algorithm.
最近LLM篇论文中使用的另一种数据集是融入到我们研究中的常识推理，适用于各种研究方面：（1）OpenBookQA [12] 旨在促进高级问答研究，深入理解主题内容及表达语言。（2）PIQA [13] 主要强调日常场景，展示对非传统解决方案的偏好。（3）Social IQA [14] 是一种新颖的问题回答基准，旨在评估社会常识智能。（4）HellaSwag [15] 是一个数据集，其本质在于确定机器在恰当地得出句子结论方面的能力。（5）BoolQ [16] 是一个专门用于问答的数据集，特别适用于二元响应（是/否查询）。（6）WinoGrande [17] 被介绍为一个新的编译，包含大量的 44,000 个问题。（7）

-easy [18] 作为一个新的数据集，包含真实的小学水平的多项选择科学问题，旨在激发复杂问答研究。（8）ARC 挑战[18]，独特地，仅包括那些由检索算法和词共现算法都错误地处理的问题。

Image recognition is the primary benchmark and application for vision models, exemplified by benchmarks such as fine- grained visual categorization (FGVC) and visual task adaptation benchmark (VTAB). Beyond image classification, video action recognition is another key application area, involving datasets like Kinetics-400 [19], SSv2 [20], and HMDB51 [21]. Additionally, PEFT has been utilized for dense prediction tasks, using datasets like MSCOCO [22], ADE20K [23], and PASCAL VOC [24].
图像识别是视觉模型的主要基准和应用，以细粒度视觉分类（FGVC）和视觉任务适应基准（VTAB）等基准为例。除了图像分类之外，视频动作识别是另一个关键应用领域，涉及 Kinetics-400 [19]、SSv2 [20]和 HMDB51 [21]等数据集。此外，PEFT 已被用于密集预测任务，使用 MSCOCO [22]、ADE20K [23]和 PASCAL VOC [24]等数据集。

III. PEFT TAXONOMY III. PEFT 分类

The PEFT strategies can be broadly classified into four categories: additive PEFT (Section III-A), which modifies the model architecture by injecting new trainable modules or parameters; selective PEFT (Section III-B), which makes a subset of parameters trainable during fine-tuning; reparameterized PEFT (Section III-C), which constructs a (lowdimensional) reparameterization of the original model parameters for training, then equivalently transforms it back for inference; and hybrid PEFT (Section III-D), which combines advantages from different PEFT methods to build a unified PEFT model. A overview of different types of PEFT algorithms is depicted in Figure 4.
PEFT 策略可以广泛分为四类：加性 PEFT（第 III-A 节），通过注入新的可训练模块或参数来修改模型架构；选择性 PEFT（第 III-B 节），在微调期间使一部分参数可训练；重新参数化 PEFT（第 III-C 节），为训练构建原始模型参数的（低维）重新参数化，然后等效地将其转换回推断；混合 PEFT（第 III-D 节），结合不同 PEFT 方法的优势构建统一的 PEFT 模型。图 4 展示了不同类型的 PEFT 算法概述。

A. Additive PEFT 添加剂 PEFT

Standard full fine-tuning entails substantial computational expenses and also could potentially harm the model's generalization ability. To mitigate this problem, a widely employed approach is to maintain the pre-trained backbone unchanged and introduce only a minimal number of trainable parameters that are strategically positioned within the model architecture. While fine-tuning for a specific downstream task, only the weights of these additional modules or parameters are updated, which results in a substantial reduction in storage, memory, and computational resource requirements. Due to their characteristic of adding parameters, these techniques can be termed as Additive Tuning, as shown in Figure 4 (a). Next, we discuss several popular Additive PEFT algorithms.
标准的全面微调需要大量的计算开销，还可能损害模型的泛化能力。为了缓解这个问题，一个广泛采用的方法是保持预训练的主干网络不变，并仅引入一小部分可训练参数，这些参数被战略性地放置在模型架构中。在为特定的下游任务微调时，只更新这些额外模块或参数的权重，这导致存储、内存和计算资源需求大幅减少。由于增加参数的特性，这些技术可以称为增量调整，如图 4（a）所示。接下来，我们将讨论几种流行的增量 PEFT 算法。

Adapters: Adapter approaches involve the insertion of small adapter layers within Transformer blocks. Typically, an adapter layer consists of a down-projection matrix , followed by a non-linear activation function , and an up-projection matrix . In this context, represents the dimension of the hidden layer, and serves as the bottleneck dimension, which is a hyperparameter used in configuring the adapters. Denote as the input to the adapter, the computation within the adapter module (with residual) can be summarized as follows:
适配器：适配器方法涉及在 Transformer 块内插入小的适配器层。通常，适配器层由一个下投影矩阵，后跟一个非线性激活函数，和一个上投影矩阵组成。在这个上下文中，代表隐藏层的维度，作为瓶颈维度，这是用于配置适配器的超参数。将表示为适配器的输入，带有残差的适配器模块内的计算可以总结如下：

Fig. 3: Taxonomy of Parameter-Efficient Fine-Tuning Methods for Large Models.
图 3：大型模型参数高效微调方法的分类。

(a) Additive PEFT 添加剂 PEFT

Fig. 4: Different types of PEFT algorithms.
图 4：不同类型的 PEFT 算法。

The concept of adapters in the field of NLP was initially introduced by Serial Adapter [25] as shown in Figure 5 (a). In their approach, each Transformer block is enhanced by adding two adapter modules, with one positioned after the self-attention layer and the other after the FFN layer, respectively. Subsequent research has aimed to address the additional computational cost associated with adapter layers. A modified framework AdapterFusion [29] was proposed, where adapter layers are inserted only after the 'Add & Norm' step following the FFN layer to enhance the computational efficiency. The adapters mentioned above follow a sequential design, placing adapter layers as bottlenecks within the Transformer blocks. This approach may potentially reduce the model's parallelism and require a trade-off between inference efficiency and accuracy. In contrast, [26] introduced a parallel adapter (PA) approach as depicted in Figure 5 (b), which reorganizes the traditionally sequential adapter layers into a parallel side-network that runs alongside each Transformer sublayer. Similarly, CIAT [27], CoDA [28] and KronA [72] also adopts a parallel adapter design. Except for the parallel design, CoDA employs a sparse activation mechanism to improve the inference efficiency as shown in Figure 5 (c).
在自然语言处理领域，适配器的概念最初是由 Serial Adapter [25]引入的，如图 5(a)所示。在他们的方法中，每个 Transformer 块通过添加两个适配器模块进行增强，一个位于自注意力层之后，另一个位于 FFN 层之后。随后的研究旨在解决适配器层带来的额外计算成本。提出了一种修改后的框架 AdapterFusion [29]，其中适配器层仅在 FFN 层后的“加和归一化”步骤之后插入，以增强计算效率。上述适配器遵循顺序设计，将适配器层放置在 Transformer 块内部作为瓶颈。这种方法可能会降低模型的并行性，并需要在推理效率和准确性之间进行权衡。相比之下，[26]引入了一种并行适配器（PA）方法，如图 5(b)所示，将传统的顺序适配器层重新组织为一个并行的侧网络，与每个 Transformer 子层并行运行。类似地，CIAT [27]、CoDA [28]和 KronA [72]也采用了并行适配器设计。除了并行设计外，CoDA 还采用了稀疏激活机制来提高推理效率，如图 5（c）所示。
Specifically, CoDA use a soft top-

selection process that identifies

important tokens in each layer, which will be processed by both the frozen pre-trained Transformer layer and the adapter branch to maintain model accuracy. In contrast, those unimportant tokens are only processed by the adapter branch while skip the heavy pre-trained layer, therefore optimizing for inference efficiency without compromising overall performance.
具体来说，CoDA 使用软顶

选择过程，识别每一层中的

个重要标记，这些标记将由冻结的预训练 Transformer 层和适配器分支处理，以保持模型准确性。相比之下，那些不重要的标记只由适配器分支处理，跳过繁重的预训练层，因此在不影响整体性能的情况下优化推理效率。

To enhance the performance and generalization of adapters, various studies have implemented multi-task learning strategies, such as AdapterFusion [29], AdaMix [30], PHA [31], AdapterSoup [32], MerA [33], and Hyperformer [34]. AdapterFusion keeps all pre-trained adapters in the model and employs a fusion module to merge the multi-task informations. Unlike AdapterFusion, MerA merges pretrained adapters into a single one through optimal transport based on weights and activations. This approach avoids introducing any additional trainable parameters, thereby enhancing computational efficiency. Hyperformer stores the multi-task information in a shared hypernetwork, which generates task and layer-specific adapter parameters conditioned on task and layer id embeddings. Given a new task, only an additional task embedding needs to be learned, therefore reducing the number of trained parameters.
为了提高适配器的性能和泛化能力，各种研究已经实施了多任务学习策略，例如 AdapterFusion [29]，AdaMix [30]，PHA [31]，AdapterSoup [32]，MerA [33]和 Hyperformer [34]。AdapterFusion 保留模型中的所有预训练适配器，并采用融合模块来合并多任务信息。与 AdapterFusion 不同，MerA 通过基于权重和激活的最优传输将预训练适配器合并为一个单一适配器。这种方法避免引入任何额外的可训练参数，从而提高了计算效率。Hyperformer 将多任务信息存储在共享的超网络中，该超网络生成基于任务和层 ID 嵌入的任务和层特定适配器参数。对于新任务，只需要学习一个额外的任务嵌入，从而减少了训练参数的数量。

Soft Prompt: Alternatively, prompt tuning presents an additional approach for refining the model to achieve improved performance through fine-tuning. Instead of optimizing discrete token representations through in-context learning, there is a prevailing belief that the continuous embedding space of soft prompts inherently contains more information [96]. Drawing inspiration from this concept, researchers directly append adjustable vectors, referred to as soft prompts, to the start of the input sequence. This can be represented as follows:
软提示：另外，提示调整提供了另一种通过微调来改进模型以实现更好性能的方法。与通过上下文学习优化离散标记表示不同，有一种普遍的观念认为软提示的连续嵌入空间本质上包含更多信息[96]。受到这一概念的启发，研究人员直接在输入序列的开头附加可调整向量，称为软提示。这可以表示为：

(a) Serial Adapter 串行适配器

(b) Parallel Adapter 并行适配器

(d) Adapter Layer 适配器层

Fig. 5: Illustration of three representative adapter-based fine-tuning algorithms. Blue represents frozen, while yellow represents trainable.
图 5：三种代表性的基于适配器的微调算法示例。蓝色代表冻结，黄色代表可训练。

where

is the sequence of input tokens for layer

, including soft prompt tokens

followed by the original input tokens

is the number of soft prompt tokens, and

is the number of original input tokens.
其中

是第

层的输入令牌序列，包括软提示令牌

和原始输入令牌

，

是软提示令牌的数量，

是原始输入令牌的数量。

Prefix-tuning [35] introduces learnable vectors that are prepended to keys

and values

across all Transformer layers. To ensure stability during the optimization process, Prefixtuning adopts a reparameterization strategy, which utilizes an MLP layer to generate these prefix vectors rather than optimizing them directly. After fine-tuning, only the prefix vectors are saved for inference. This technique is adapted and improved in several studies [36], [37], [38]. For instance, ptuning v2 [37] removes reparameterization and expands its usage to broader model scales and NLP tasks. APT (Adaptive Prefix Tuning) [38] enhances Prefix-tuning by introducing adaptive gate mechanism to control the prefix importance in each layer. Concurrent work p-tuning [39] and prompttuning [40] apply learnable vectors only at the initial word embedding layer rather than all layers to enhance training and inference efficiency. It's important to highlight that prompttuning demonstrates its effectiveness primarily in the context of large models, specifically those with over 11 billion parameters [40]. Complementing this, Xprompt [41] eliminates the negative prompt tokens through a hierarchical structured pruning, which closes the performance gap at smaller model scales. The work by [97] provides some theoretical analysis towards prompt tuning, demonstrating its universality and limitations in limited-depth Transformers. IDPG (InstanceDependent Prompt Generation) [42] improves prompt tuning by generating prompts based on each input sentence with a lightweight prompt generator. In a related approach, LPT (Late Prompt Tuning) [43] also leverages a prompt generator to obtain instance-aware prompt. Unlike previous work, LPT adds these prompts only after an intermediate layer, rather than at the initial or all layers. This strategic placement eliminates the gradient calculation below the intermediate layer, thereby significantly accelerating the training speed. Simultaneously, LPT can improve the overall performance due to the shorter backpropagation path preserves more task-related information. Inspired by LPT, SPT (Selective Prompt Tuning) [44] delves deeper into the importance of prompt inserting strategies. It introduces a learnable probabilistic gate in each layer to determine whether to use the prompt propagated from the previous layer or inject a newly generated prompt. APrompt [45] employs another prompt inserting strategy. In addition to input prompts inserted at the beginning of the input sequence for each Transformer layer, APrompt also prepends additional learnable prompts to the respective query, key, and value matrices in the self-attention blocks to learn new attention patterns. Besides, APrompt incorporates the learning of a taskspecific head.
Prefix-tuning [35] 引入了可学习的向量，这些向量被添加到所有 Transformer 层的键

和值

之前。为了确保在优化过程中的稳定性，Prefixtuning 采用了重新参数化策略，该策略利用 MLP 层生成这些前缀向量，而不是直接优化它们。微调后，只保存前缀向量以进行推断。这种技术在几项研究中得到了改进和应用[36]，[37]，[38]。例如，ptuning v2 [37] 去除了重新参数化，并将其用法扩展到更广泛的模型规模和 NLP 任务。APT（自适应前缀调整）[38] 通过引入自适应门机制来增强 Prefix-tuning，以控制每层中前缀的重要性。同时进行的工作 p-tuning [39] 和 prompttuning [40] 仅在初始词嵌入层应用可学习向量，而不是所有层，以增强训练和推断效率。值得强调的是，prompttuning 主要在大型模型的背景下展示了其有效性，特别是那些具有超过 110 亿参数的模型[40]。与此相辅相成的是，Xprompt [41] 通过分层结构修剪消除了负面提示标记，从而在较小的模型规模下缩小了性能差距。[97] 的工作对提示调整进行了一些理论分析，展示了其在有限深度的 Transformer 中的普适性和局限性。IDPG（InstanceDependent Prompt Generation）[42] 通过基于每个输入句子生成提示的轻量级提示生成器改进了提示调整。在相关方法中，LPT（Late Prompt Tuning）[43] 也利用提示生成器获得实例感知提示。与以往的工作不同，LPT 仅在中间层之后添加这些提示，而不是在初始或所有层。这种策略性的放置消除了中间层以下的梯度计算，从而显著加快了训练速度。同时，由于更短的反向传播路径保留了更多与任务相关的信息，LPT 可以提高整体性能。受到 LPT 的启发，SPT（Selective Prompt Tuning）[44] 深入探讨了提示插入策略的重要性。它在每一层引入了一个可学习的概率门，用于确定是使用从前一层传播的提示还是注入新生成的提示。APrompt [45]采用另一种提示插入策略。除了在每个 Transformer 层的输入序列开头插入输入提示外，APrompt 还在自注意力块中的相应查询、键和值矩阵前置额外的可学习提示，以学习新的注意力模式。此外，APrompt 还结合了任务特定头的学习。

The concept of soft prompts has been employed for various downstream tasks [98], [99], although their training can be prone to instability and slow convergence. To address this, SPoT [46] uses a source prompt learned from one or multiple tasks to initialize prompts for new tasks. Similarly, transfer of soft prompts from one task to initialize another is proposed in TPT (transferable prompt tuning) [47], which demonstrates that a better prompt initialization results in a large training convergence speedup. InfoPrompt [48] develops two mutual information based loss functions, i.e., head loss and representation loss, to find better prompt initialization and learn sufficient task-relevant information, thereby also expediting convergence. PTP [49] delves into the root causes of training instability. It identifies the steep nature of the loss landscape in conventional prompt tuning, where minor variations in input data can lead to significant loss fluctuations. To mitigate this, PTP introduces perturbation-based regularizers to smooth the loss landscape and consequently stabilize the training process. DePT [52] decomposes the soft prompt into a shorter soft prompt with a pair of low-rank matrices, which are optimised with two distinct learning rates. This strategy not only improves the performance but also enhances training and inference efficiency. SMoP (Sparse Mixture-of-Prompts) [51] reduce the training and inference cost by utilizing short soft prompts. During training, multiple short soft prompts are trained, each tailored to specific subsets of the dataset. During inference, SMoP integrates a gating mechanism that routes each input instance to an appropriate short prompt. This technique not only increases efficiency in both training and inference stages but also retains performance comparable to those achieved with longer soft prompts. To further cut down the number of soft prompt parameters, IPT (Intrinsic Prompt Tuning) [50] identifies an intrinsic task subspace by training an auto-encoder on multiple tasks. Tuning on new tasks then requires adjusting only a few parameters within this subspace, significantly reducing the number of training parameters.
软提示的概念已经被用于各种下游任务[98]，[99]，尽管它们的训练可能容易不稳定且收敛缓慢。为了解决这个问题，SPoT [46] 使用从一个或多个任务中学习的源提示来初始化新任务的提示。类似地，TPT（可转移提示调整）[47] 提出了从一个任务转移软提示以初始化另一个任务的方法，表明更好的提示初始化会导致训练收敛速度加快。InfoPrompt [48] 开发了两个基于互信息的损失函数，即头损失和表示损失，以找到更好的提示初始化并学习足够的任务相关信息，从而加快收敛速度。PTP [49] 深入研究了训练不稳定的根本原因。它确定了传统提示调整中损失景观的陡峭特性，其中输入数据的微小变化可能导致显著的损失波动。为了缓解这一问题，PTP 引入基于扰动的正则化器来平滑损失景观，从而稳定训练过程。 DePT [52] 将软提示分解为一对低秩矩阵的较短软提示，这些矩阵使用两个不同的学习率进行优化。这种策略不仅提高了性能，还增强了训练和推断效率。SMoP（Sparse Mixture-of-Prompts）[51] 通过利用短软提示来减少训练和推断成本。在训练过程中，会训练多个短软提示，每个提示都针对数据集的特定子集。在推断过程中，SMoP 集成了一个门控机制，将每个输入实例路由到适当的短提示。这种技术不仅提高了训练和推断阶段的效率，而且保持了与较长软提示实现的性能相媲美。为了进一步减少软提示参数的数量，IPT（Intrinsic Prompt Tuning）[50] 通过在多个任务上训练自动编码器来识别内在任务子空间。然后，对新任务进行调整只需要调整这个子空间内的少量参数，从而显著减少训练参数的数量。

Other Additive Methods: Apart from the methods mentioned above, there appears other approaches that strategically incorporate additional parameters during the fine-tuning process. For example, [53] introduces three learnable
其他添加方法：除了上述提到的方法之外，在微调过程中还出现了其他方法，这些方法在战略上整合了额外的参数。例如， [53] 引入了三个可学习

(a)

（a）

(b) SSF
Fig. 6: Illustration of (IA)

and SSF. Blue represents frozen, while yellow represents trainable.
图 6：(IA)

和 SSF 的示意图。蓝色代表冻结，黄色代表可训练。

rescaling vectors:

, and

, to rescale the key, value, and FFN activations, respectively, as depicted in Figure 6 (a). The operations within the self attention block can be described as follows:
重新调整向量：

，和

，以重新调整键、值和 FFN 激活，如图 6（a）所示。自注意力块内的操作可以描述如下：

In FFN, the rescaling can be denoted as:
在 FFN 中，重新缩放可以表示为：

where

is Hadamard product. Furthermore, the scale vectors

and

can be seamlessly integrated into the weight matrices of

and

. This integration effectively eliminates the extra computational costs during inference. A similar technique SSF [55] also performs linear transformation to the model activations, as illustrated in Figure 6 (b). Specifically, after each operation (i.e., MSA, FFN and layer normalization) in the pre-trained model, a SSF-ADA layer is injected, which performs scaling and shifting to the features generated from the operation. During fine-tuning, only those SSF-ADA layers can be updated, while during inference, similar to (IA)

, these SSF-ADA layers can be merged into model weights, so no additional inference overhead would be incurred. IPA (InferenceTime Policy Adapters) [56] offers a novel approach to align LLMs, such as GPT-4, with user-specific requirements without modifying the base model's parameters. This is particularly significant when dealing with models whose parameters are extremely large and often not directly accessible. IPA achieves this by combining (through multiplication and normalization) the output distribution of a base LLM (base policy) with that of a smaller-sized model (adapter policy) during the decoding phase. During training, the policy adapter's parameters are fine-tuned using reinforcement learning, while the base policy's parameters remain fixed. During inference, IPA decodes with the combined distribution of the base model and the trained policy adapter, tailoring it to fulfill specific user-defined criteria.
其中

是 Hadamard 乘积。此外，比例向量

和

可以无缝集成到

和

的权重矩阵中。这种集成有效地消除了推断过程中的额外计算成本。类似的技术 SSF [55]也对模型激活进行线性变换，如图 6（b）所示。具体来说，在预训练模型的每个操作（即 MSA、FFN 和层归一化）之后，注入了一个 SSF-ADA 层，对从操作生成的特征进行缩放和移位。在微调过程中，只有那些 SSF-ADA 层可以更新，而在推断过程中，类似于（IA）

，这些 SSF-ADA 层可以合并到模型权重中，因此不会产生额外的推断开销。IPA（推断时间策略适配器）[56]提供了一种新颖的方法来使LLMs（如 GPT-4）与用户特定要求对齐，而无需修改基础模型的参数。当处理参数极大且通常无法直接访问的模型时，这一点尤为重要。 IPA 通过在解码阶段将基础LLM（基础策略）的输出分布与较小尺寸模型（适配器策略）的输出分布进行组合（通过乘法和归一化）来实现这一点。在训练期间，通过强化学习微调策略适配器的参数，而基础策略的参数保持不变。在推断期间，IPA 使用基础模型和经过训练的策略适配器的组合分布进行解码，以满足特定用户定义的标准。

B. Selective PEFT 选择性 PEFT

Rather than additive PEFT, which increases the model complexity by adding more parameters, selective PEFT finetunes a subset of the existing parameters to enhance model performance over downstream tasks, as depicted in Figure 4 (b).
与增加参数的 PEFT 不同，选择性 PEFT 通过微调现有参数的子集来提高模型在下游任务中的性能，如图 4（b）所示。

Fig. 7: Illustration of how two parameter masking methods. Blue represents frozen, while yellow represents trainable.
图 7：两种参数掩码方法的示例。蓝色代表冻结，黄色代表可训练。

Specifically, given a model with parameters

where each

denotes an individual model parameter and

represents the total count of these parameters, the process of selective PEFT is represented by applying a binary mask

to these parameters. Each

is either 0 or 1 , indicating whether the corresponding parameter

is selected (1) or not selected (0) for fine-tuning. The updated parameter set

after fine-tuning is given by:
具体来说，给定具有参数

的模型，其中每个

表示一个单独的模型参数，

表示这些参数的总数，选择性 PEFT 的过程通过将二进制掩码

应用于这些参数来表示。

中的每个

都是 0 或 1，表示相应参数

是否被选择（1）或未被选择（0）进行微调。微调后的更新参数集

如下：

where

represents the learning rate, and

is the gradient of the loss function with respect to the parameter

. In this formulation, only the parameters that are selected (i.e.,

are updated during backpropagation.
其中

代表学习率，

代表损失函数相对于参数

的梯度。在这个公式中，只有被选中的参数（即

）在反向传播过程中被更新。

Diff pruning [57] is a representative work that applies a learnable binary mask to the model weights during finetuning. To achieve parameter efficiency, the mask is regularized by a differentiable approximation of the

-norm penalty. PaFi [59] simply select model parameters with the smallest absolute magnitude as trainable. FishMask [60] determines parameter importance using the approximate Fisher information. It then selects the top

parameters based on this information to form the mask M. Similarly, Fish-Dip [61] also uses Fisher information to calculate

, but the mask will be re-calculated dynamically in each train period. LT-SFT [62] introduces another technique to determines parameter importance inspired by the Lottery Ticket Hypothesis [100], [101], where the the subset of parameters that change the most during an initial fine-tuning stage is selected to form the mask

. SAM [63] proposes a second-order approximation method, which approximates the original problem with an analytically solvable optimization function, to help decide the parameter mask. Child-tuning [64] proposes two approaches to select a child network during each training iteration, where only the parameters within this child network can be updated.
差分修剪[57]是一项代表性工作，它在微调过程中将可学习的二进制掩码应用于模型权重。为了实现参数效率，该掩码通过对

-范数惩罚的可微近似进行正则化。PaFi [59]简单地选择具有最小绝对值的模型参数作为可训练参数。FishMask [60]使用近似的 Fisher 信息确定参数重要性。然后根据这些信息选择前

个参数来形成掩码 M。类似地，Fish-Dip [61]也使用 Fisher 信息来计算

，但掩码将在每个训练周期动态重新计算。LT-SFT [62]引入了另一种技术，该技术受到“彩票票据假设”[100]，[101]的启发，该假设选择在初始微调阶段中变化最大的参数子集以形成掩码

。SAM [63]提出了一种二阶近似方法，该方法用可解析的优化函数近似原始问题，以帮助决定参数掩码。 Child-tuning [64]在每次训练迭代中提出了两种选择子网络的方法，只有该子网络内的参数可以被更新。

However, above unstructured parameter masking results in an uneven distribution of non-zero masks and diminished hardware efficiency when implementing PEFT. As shown in Figure 7, the structured mask organize parameter masking in regular patterns, unlike unstructured ones that apply it randomly, thus can enhances computational and hardware efficiency during training. Therefore, various structured selective PEFT techniques have undergone extensive investigation. Diff pruning proposes a structured pruning strategy by partitioning the weight parameters into local groups and strategically eliminating them together. Similarly, FAR [65]
然而，上述的非结构化参数掩码会导致非零掩码的分布不均匀，并在实施 PEFT 时降低硬件效率。如图 7 所示，结构化掩码以规则模式组织参数掩码，不像非结构化掩码那样随机应用，因此可以增强训练过程中的计算和硬件效率。因此，各种结构化选择性 PEFT 技术已经进行了广泛的研究。差异剪枝提出了一种结构化剪枝策略，通过将权重参数分成局部组并有策略地一起消除它们。类似地，FAR【65】

(a) LoRA

(b) DyLoRA DyLoRA

Fig. 8: Illustration of three representative reparameterized PEFT algorithms. Blue represents frozen, while yellow represents trainable.
图 8：三种代表性的重新参数化 PEFT 算法示例。蓝色代表冻结，黄色代表可训练。

fine-tunes BERT models by grouping weights of the FFN in Transformer blocks into nodes, then rank and select the learner nodes using

norm. To further reduce the memory access frequency, they also reconfigure the FFN by grouping the learner nodes together. Bitfit [66] is proposed to only finetunes the bias parameters of each DNN layer, and achieves competitive results for small models. However, this method fails to handle large models. The work by [58] applies NAS to Bitfit, where S-BitFit keeps the structural nature in Bitfit that restrict NAS algorithm must choose whether

or not for each bias module. Similar to Bitfit that fine-tunes a specific module in Transformer, Xattn Tuning [67] finetunes only the cross-attention layers. SPT (sensitivity-aware visual parameter-efficient fine-tuning) [68] first identifies the sensitive parameters measured by the loss reduction when being tuned. This sensitivity is calculated using a first-order Taylor expansion, derived from a single forward and backward pass before fine-tuning in oneshot. Next, SPT finds the weight matrices whose number of sensitive parameters exceeds a predefined threshold, and then applies a selected PEFT technique (e.g., LoRA and Adapter) to these targeted weights to achieve structural tuning.
通过将 Transformer 块中 FFN 的权重分组到节点中，然后使用

范数对学习节点进行排名和选择，来微调 BERT 模型。为了进一步减少内存访问频率，他们还通过将学习节点分组来重新配置 FFN。Bitfit [66] 提出仅微调每个 DNN 层的偏置参数，并且对于小模型取得了竞争性结果。然而，这种方法无法处理大模型。[58]的工作将 NAS 应用于 Bitfit，其中 S-BitFit 保留了 Bitfit 中的结构性质，限制 NAS 算法必须为每个偏置模块选择是否

。与 Bitfit 类似，Xattn Tuning [67] 仅微调交叉注意力层。SPT（敏感性感知视觉参数高效微调）[68]首先确定通过调整时损失减少来测量的敏感参数。这种敏感性是使用一阶泰勒展开计算的，在微调之前通过单次前向和后向传递导出。接下来，SPT 找到敏感参数数量超过预定义阈值的权重矩阵，然后对这些目标权重应用选定的 PEFT 技术（例如，LoRA 和 Adapter）以实现结构调整。

C. Reparameterized PEFT C. 重新参数化的 PEFT

Reparameterization stands for equivalently transforming a model's architecture from one to another via transforming its parameters. In the context of PEFT, this often means constructing a low-rank parameterization to achieve the goal of parameter efficiency during training. For inference, the model can be converted to its original weight parameterization, ensuring unchanged inference speed. This procedure is depicted in Figure 4 (c).
重新参数化是指通过转换模型的参数，等效地将模型的架构从一个转换为另一个。在 PEFT 的背景下，这通常意味着构建一个低秩参数化，以在训练过程中实现参数效率的目标。对于推断，模型可以转换为其原始权重参数化，确保推断速度不变。该过程如图 4（c）所示。

Earlier research studies [69] have shown that common pre-trained models exhibit an exceptionally low intrinsic dimensionality. In other words, it is possible to find a lowdimensional reparameterization that is effective for fine-tuning as the entire parameter space. Intrinsic SAID [69] is the pioneering work in investigating the intrinsic dimension feature during the fine-tuning of LLMs. However, the most widely recognized reparameterization technique is LoRA (Low Rank Adaptation) [70], [102], as shown in Figure 8] (a). For a given pre-trained weight matrix

, LoRA introduces two trainable weight matrices,

and

where the rank

, operating in parallel to

. Let

represent the input. Under normal conditions, the output through

. Instead, LoRA modifies this output by introducing an incremental update

that encapsulates task-specific knowledge:
较早的研究[69]表明，常见的预训练模型表现出异常低的内在维度。换句话说，可以找到一个低维重参数化，对整个参数空间进行微调是有效的。内在 SAID[69]是在LLMs微调过程中研究内在维度特征的开创性工作。然而，最广泛认可的重参数化技术是 LoRA（低秩适应）[70]，[102]，如图 8]（a）所示。对于给定的预训练权重矩阵

，LoRA 引入两个可训练的权重矩阵

和

，其中秩

，并行操作

。让

代表输入。在正常情况下，通过

的输出是

。相反，LoRA 通过引入一个包含特定任务知识的增量更新

来修改这个输出。

where

denotes a scaling factor. At the onset of training,

is initialized using a random Gaussian distribution, while

is initialized to zero, ensuring that

initially holds a value of zero. LoRA is straightforward to implement and has been evaluated on models with up to 175 billion parameters. Fig 8 (c) used a single decoder as an example, the frozen and learnable components are highlighted in grey and red, respectively. Once fine-tuning is complete, LoRA's adaptive weights seamlessly integrate with the pre-trained backbone weights. This integration ensures that LoRA maintains the model's efficiency, adding no extra burden during inference.
其中

表示缩放因子。在训练开始时，

使用随机高斯分布进行初始化，而

初始化为零，确保

最初保持为零。LoRA 易于实现，并已在具有高达 1750 亿参数的模型上进行了评估。图 8（c）以单个解码器为例，冻结和可学习组件分别用灰色和红色突出显示。一旦微调完成，LoRA 的自适应权重将与预训练的骨干权重无缝集成。这种集成确保 LoRA 保持模型的效率，在推理过程中不增加额外负担。

In LoRA training, selecting an appropriate rank has always been a challenging issue. To address this, DyLoRA [76], as depicted in Figure 8 (b), trains LoRA module on a range of ranks within a predefined training budget, rather than adhering to a single, fixed rank. Specifically, for a given rank range

, DyLoRA dynamically chooses a rank

at each iteration of the training process. Consequently, the matrices

and

are tailored for the selected rank

, resulting in truncated versions

and

, and the subsequent forward and backward pass during this iteration will be restricted on

and

instead of

and

. With this dynamic and search-free approach, DyLoRA significantly reduces the training time required to find an optimal and fixed LoRA rank for specific tasks. AdaLoRA [77] reformulates the

with a singular value decomposition (SVD), denoted as

, where

and

are orthometric,

is a diagonal matrix containing sigular values

. All the three weight matrices are made learnable. During training, the singular values are pruned iteratively based on their importance scores, which are constructed from moving average of the magnitude of gradient-weight product.
在 LoRA 训练中，选择适当的秩一直是一个具有挑战性的问题。为了解决这个问题，DyLoRA [76]，如图 8（b）所示，对 LoRA 模块进行训练，使用预定义的训练预算范围内的一系列秩，而不是坚持使用单一的固定秩。具体而言，对于给定的秩范围

，DyLoRA 在训练过程的每次迭代中动态选择一个秩

。因此，矩阵

和

被调整为选定的秩

，导致截断版本

和

，在此迭代期间的后续前向和后向传递将受到限制

和

，而不是

和

。通过这种动态和无搜索的方法，DyLoRA 显著减少了寻找特定任务的最佳和固定 LoRA 秩所需的训练时间。AdaLoRA [77]重新构造了

，使用奇异值分解（SVD）表示为

，其中

和

是正交的，

是包含奇异值

的对角矩阵。所有三个权重矩阵都是可学习的。在训练过程中，奇异值根据其重要性分数进行迭代修剪，这些分数是根据梯度权重乘积的幅度的移动平均值构建的。

To ensure the orthogonality between

and

, i.e.,

, an additional regularizer term is included in the loss:
为了确保

和

之间的正交性，即

，损失函数中包含了一个额外的正则化项：

This adaptive approach enables the model to dynamically adjust the rank within each LoRA module, effectively managing its parameter counts based on the significance of the weight matrices. However, according to SoRA [78], the importance scores used in AdaLoRA is heuristically constructed, which lacks rigorous theoretical motivation. Additionally, both moving average operation and calculation of Eq. 13 introduces extra computation cost during training. To address this, SoRA eliminates the orthogonality premise of

and

. Instead, a gating unit

between

and

is directly applied and optimized:
这种自适应方法使模型能够动态调整每个 LoRA 模块内的排名，根据权重矩阵的重要性有效地管理其参数计数。然而，根据 SoRA [78]，AdaLoRA 中使用的重要性分数是经验性构建的，缺乏严格的理论动机。此外，在训练过程中，移动平均操作和计算 Eq. 13 都会引入额外的计算成本。为了解决这个问题，SoRA 消除了

和

之间的正交前提。相反，直接应用和优化了

和

之间的一个门控单元

：

where

is Hadamard product. The gate

is updated using a variation of proximal gradient iteration for

loss [103], [104], which has a clear mathematical meaning and do not need the heuristic premise. After training, the zeroed-out gate units are pruned by removing the corresponding columns and rows in

and

.
其中

是 Hadamard 乘积。门

使用一种变体的近端梯度迭代来更新

损失[103]，[104]，具有明确的数学意义，不需要启发式前提。训练后，通过在

和

中移除相应的列和行来修剪归零门单元。

Several subsequent studies have aimed to improve LoRA's performance in various aspects. For instance, LaplaceLoRA [81] notices that fine-tuned LLMs often exhibit overconfidence. To enhance the calibration of fine-tuned LLMs, Laplace-LoRA utilizes a Bayesian approach, specifically a post-hoc Laplace approximation [105], [106], to the posterior over the LoRA parameters. LoRA Dropout [82] introduces random noises to the learnable low-rank matrices and increases parameter sparsity to reduce the risk of overfitting. LoRA+ [84] proposes to set different learning rates for the LoRA matrices

and

, such that

with

fixed and tune

.
几项后续研究旨在改善 LoRA 在各个方面的性能。例如，LaplaceLoRA [81] 发现，经过微调的LLMs通常表现出过度自信。为了增强经过微调的LLMs的校准性，Laplace-LoRA 采用了一种贝叶斯方法，具体来说是一种事后 Laplace 近似[105]，[106]，用于对 LoRA 参数的后验进行建模。LoRA Dropout [82] 引入随机噪声到可学习的低秩矩阵中，并增加参数稀疏性以减少过拟合的风险。LoRA+ [84] 建议为 LoRA 矩阵

和

设置不同的学习率，使得

固定而

。

Thanks to the modular design of LoRA, many studies incorporate multiple LoRA modules in their frameworks to enhance performance. For example, LoRAHub aggregates various LoRA modules trained on different tasks. Given a handful of examples from a new task, LoRAHub can autonomously compose compatible LoRA modules without human intervention via a gradient-free method Shiwa [107]. MOELoRA employs a Mixture-of-Experts (MOE) approach to train LoRA in a multi-task setting, resulting in multiple expert LoRA modules. To retrieve parameters for certain task, MOELoRA utilizes a task-motivated gate function that assigns contribution weights to each expert based on the task ID, and the final parameters is calculated through a weighted sum of all experts.
由于 LoRA 的模块化设计，许多研究在其框架中结合多个 LoRA 模块以增强性能。例如，LoRAHub 聚合了在不同任务上训练的各种 LoRA 模块。给定一个新任务的一些示例，LoRAHub 可以通过无梯度方法 Shiwa[107]自动组合兼容的 LoRA 模块，无需人工干预。MOELoRA 采用专家混合（MOE）方法在多任务设置中训练 LoRA，从而产生多个专家 LoRA 模块。为了检索特定任务的参数，MOELoRA 利用一个任务驱动的门函数，根据任务 ID 为每个专家分配贡献权重，最终参数通过所有专家的加权和计算得出。

In addition to LoRA, several other reparameterization techniques are emerging with significant potential. For instance, Compacter [71] introduces a light-weight adapter modules by parameterizing the

and

, where

, and

denotes the Kronecker product. They further decrease the parameter count by designating

as shared parameters and reparameterizing

using the product of two low-rank matrices, effectively reducing the parameter complexity from

. Related studies, such as KronA [72] and KAdaptation [73], also employ the Kronecker product to reparameterize adapter weights, aiming to achieve parameter reduction. HiWi [59] proposes an adapter fine-tuning method that applies an adapter directly to pretrained parameters instead of hidden representations as:
除了 LoRA 之外，还出现了几种具有重要潜力的重新参数化技术。例如，Compacter [71] 引入了一个轻量级的适配器模块，通过将

和

参数化为

，其中

，

表示 Kronecker 积。他们进一步通过将

指定为共享参数并重新参数化

为两个低秩矩阵的乘积，有效地将参数复杂度从

降低到

。相关研究，如 KronA [72] 和 KAdaptation [73]，也利用 Kronecker 积重新参数化适配器权重，旨在实现参数减少。HiWi [59] 提出了一种适配器微调方法，将适配器直接应用于预训练参数而不是隐藏表示，如下所示：

where

denotes the weights or biases within the Transformer block's feed-forward layer. Notably, during inference, this method computes

in advance, ensuring that the model's inference latency remains on par with that of traditional full fine-tuning. VeRA (Vector-based Random Matrix Adaptation) [74] employs a single pair of frozen low-rank matrices

and

that are shared across all layers, and adapts these matrices by learning small, trainable scaling vectors represented as

and

(formally denoted by diagonal matrices

and

. Specifically, the reparameterization is given by:
其中

表示 Transformer 块的前馈层内的权重或偏置。值得注意的是，在推断期间，该方法提前计算

，确保模型的推断延迟与传统的完全微调相当。VeRA（基于向量的随机矩阵适应）[74]采用一对冻结的低秩矩阵

和

，这些矩阵在所有层之间共享，并通过学习表示为

和

的小型可训练缩放向量来调整这些矩阵（正式表示为对角矩阵

和

）。具体而言，重新参数化如下：

where both

and

are initialized using a random Gaussian distribution. Similar to LoRA, the scaling vector

is initialized to zeros to ensure that the weight matrix is unaffected during the first forward pass. This method significantly reduces the number of trainable parameters compared to LoRA yet maintains the same performance, enabling the fine-tuning of larger models on a single GPU. DoRA (WeightDecomposed Low-Rank Adaptation) [75] presents a novel approach as illustrated in Figure 8 (c) by decomposing model weights

into magnitude and direction as follows:
其中

和

都是使用随机高斯分布进行初始化的。与 LoRA 类似，缩放向量

被初始化为零，以确保在第一次前向传递期间权重矩阵不受影响。这种方法与 LoRA 相比显著减少了可训练参数的数量，但保持了相同的性能，使得可以在单个 GPU 上对更大的模型进行微调。DoRA（WeightDecomposed Low-Rank Adaptation）[75]提出了一种新颖的方法，如图 8（c）所示，通过将模型权重

分解为大小和方向。

where

is the magnitude vector,

is the directional matrix, with

being the vector-wise norm of a matrix across each column. Subsequently, DoRA adopts a unique fine-tuning strategy for

and

. While both are tunable, only

undergoes LoRA reparameterization, defined as:
其中

是幅度向量，

是方向矩阵，

是矩阵在每列上的向量范数。随后，DoRA 采用了一种独特的

和

的微调策略。虽然两者都是可调的，但只有

经历 LoRA 重新参数化，定义为：

where

is the incremental directional update learned by LoRA, and the underlined parameters denote the trainable parameters. Through this methodology, DoRA consistently outperforms LoRA across various tasks and models, demonstrating its superiority.
其中

是 LoRA 学习到的增量方向更新，下划线参数表示可训练参数。通过这种方法，DoRA 在各种任务和模型中始终优于 LoRA，展示了其优越性。

D. Hybrid PEFT 混合 PEFT

The efficacy of various PEFT methods can significantly differ across different tasks. As a result, numerous studies aim to either combine the advantages of diverse PEFT approaches or seek to establish a unified perspective by analyzing the similarities among these methods. For instance, UniPELT[90] integrates LoRA, prefix-tuning, and adapters into each Transformer block. To control which PEFT submodules should be activated, they also introduce a gating mechanism. This
各种 PEFT 方法的功效在不同任务中可能会有显著差异。因此，许多研究旨在要么结合不同 PEFT 方法的优势，要么通过分析这些方法之间的相似之处来建立统一的视角。例如，UniPELT[90]将 LoRA、前缀调整和适配器集成到每个 Transformer 块中。为了控制哪些 PEFT 子模块应该被激活，他们还引入了一个门控机制。
mechanism consists of three small FFNs that each produce a scalar value

, which is then applied to the LoRA, prefix, and adapter matrices, respectively. Across various setups, UniPELT has consistently shown improvements in accuracy ranging from

. S4 [91] explores design spaces for several PEFT methods (i.e., Adapter (A), Prefix (P), BitFit (B), and LoRA (L)) to uncover underlying design patterns. After a series experiments, their findings include: (1) Applying the spindle grouping partitioning for Transformer layers, which results in four layer groups

for

. Layers in one group have similar behaviors together, which means should be apply similar PEFT strategies. (2) Allocating the number of trainable parameters to layers uniformly. (3) Tuning all the groups. (4) Assigning different PEFT strategies in different group. The resulting design space that has the best performance is:
机制由三个小的 FFN 组成，每个都产生一个标量值

，然后分别应用于 LoRA、前缀和适配器矩阵。在各种设置中，UniPELT 一直显示出准确性的改进，范围从

到

。S4 [91] 探索了几种 PEFT 方法（即适配器（A）、前缀（P）、BitFit（B）和 LoRA（L））的设计空间，以揭示潜在的设计模式。经过一系列实验，他们的发现包括：（1）为 Transformer 层应用主轴分组分区，结果是四个层组

为

。一个组中的层具有相似的行为，这意味着应该应用相似的 PEFT 策略。（2）将可训练参数的数量均匀分配给各层。（3）调整所有组。（4）在不同组中分配不同的 PEFT 策略。具有最佳性能的结果设计空间是：

MAM Adapter[26] explores the intrinsic similarity between three additive PEFT methods: adapters, prefix-tuning, and LoRA, which leads to the development of three variants: Parallel Adapter, which places adapter layers alongside specific layers (SA or FFN) instead of after them; Multi-head Parallel Adapter, which divides the parallel adapter into multiple heads, each affecting the head attention output in SA; and Scaled Parallel Adapter, which adds a scaling term after the parallel adapter layer, similar to LoRA. Extensive experimentation revealed that the most effective configuration involves using prefix-tuning in the SA layer and the scaled parallel adapter in the FFN layer, which is called MAM Adapter. LLM-Adapters [94] builds an easy-to-use framework that incorporates various PEFT techniques into LLMs. Through comprehensive benchmarking across multiple datasets, the study reveals several key insights: (1) The most effective locations for series adapters, parallel adapters, and LoRA are after the MLP layers, alongside the MLP layers, and simultaneously following the Attention layers and MLP layers, respectively. (2) Smaller LLMs utilizing PEFT can achieve competitive or even superior results on certain tasks when compared to their larger counterparts. (3) With appropriate in-distribution fine-tuning data, smaller models are capable of surpassing larger models in task-specific performance.
MAM 适配器[26] 探索了三种附加 PEFT 方法之间的内在相似性：适配器、前缀调整和 LoRA，这导致了三种变体的发展：并行适配器，将适配器层放置在特定层（SA 或 FFN）旁边，而不是在它们之后；多头并行适配器，将并行适配器分成多个头部，每个头部影响 SA 中的头部注意力输出；以及缩放并行适配器，在并行适配器层之后添加一个缩放项，类似于 LoRA。广泛的实验揭示了最有效的配置涉及在 SA 层中使用前缀调整和在 FFN 层中使用缩放并行适配器，这被称为 MAM 适配器。LLM-适配器[94] 构建了一个易于使用的框架，将各种 PEFT 技术整合到 LLMs 中。通过在多个数据集上进行全面基准测试，研究揭示了几个关键见解：（1）系列适配器、并行适配器和 LoRA 的最有效位置分别是在 MLP 层之后、与 MLP 层并行以及同时跟随注意力层和 MLP 层之后。 (2) 利用 PEFT 的较小模型在某些任务上可以取得竞争力甚至优于较大模型的结果。(3) 通过适当的分布内微调数据，较小模型能够在特定任务的性能上超越较大模型。

Several studies leverage neural architecture search (NAS) to find better PEFT combination approaches. For example, NOAH [92] discovers that different PEFT configurations are specifically tailored for different tasks. To address this issue, NOAH employs NAS to identify the most effective PEFT configurations for each dataset. Specifically, NOAH's searching space encompasses three PEFT methods: Adapter, LoRA, and Visual Prompt Tuning (VPT). It utilizes AutoFormer [108], a one-shot NAS algorithm, for the efficient discovery of optimal prompt modules. In a related vein, AUTOPEFT [93] first establishes a searching space that includes serial adapters, parallel adapters, and prefix tuning. After that, they proposes an effective NAS methods based on a high-dimensional multidimensional Bayesian optimisation [109]. Both NOAH and AUTOPEFT demonstrate the capability of NAS in enhancing PEFT configurations across a variety of tasks.
几项研究利用神经架构搜索（NAS）来寻找更好的 PEFT 组合方法。例如，NOAH [92] 发现不同的 PEFT 配置专门针对不同的任务定制。为了解决这个问题，NOAH 利用 NAS 来识别每个数据集中最有效的 PEFT 配置。具体来说，NOAH 的搜索空间包括三种 PEFT 方法：Adapter、LoRA 和 Visual Prompt Tuning（VPT）。它利用 AutoFormer [108]，一种一次性 NAS 算法，高效地发现最佳提示模块。在相关领域，AUTOPEFT [93] 首先建立了一个搜索空间，其中包括串行适配器、并行适配器和前缀调整。之后，他们提出了一种基于高维多维贝叶斯优化的有效 NAS 方法 [109]。NOAH 和 AUTOPEFT 都展示了 NAS 在增强各种任务中的 PEFT 配置方面的能力。

IV. EfFICIENT PEFT DESIGN
高效的 PEFT 设计

Processing latency and peak memory overhead are pivotal factors to consider from a computational standpoint. This section introduces a key characteristic in LLMs aimed at balancing between latency and memory usage (Section IV-A). Following this, we explore strategies for developing efficient PEFT methods to address computational challenges, including PEFT pruning (Section IV-B), PEFT quantization (Section IV-C), and memory-efficient PEFT techniques (Section IV-D), each designed to enhance model performance while minimizing resource consumption. It is noteworthy that quantization inherently addresses memory overhead concerns. However, given its distinct characteristics, we address these quantization methods separately rather than incorporating them under the memory-efficient PEFT section.
处理延迟和峰值内存开销是从计算角度考虑的关键因素。本节介绍了LLMs中旨在平衡延迟和内存使用之间的关键特征（第 IV-A 节）。在此之后，我们探讨了开发高效 PEFT 方法以解决计算挑战的策略，包括 PEFT 修剪（第 IV-B 节）、PEFT 量化（第 IV-C 节）和内存高效 PEFT 技术（第 IV-D 节），每种方法都旨在提高模型性能同时最大限度地减少资源消耗。值得注意的是，量化本质上解决了内存开销问题。然而，鉴于其独特特性，我们将这些量化方法单独处理，而不是将它们纳入内存高效 PEFT 部分。

A. KV-cache Management for PEFT Efficiency
A. 为 PEFT 效率管理 KV 缓存

The core of the LLMs model lies an autoregressive Transformer model, depicted in Figure 2 When we look at autoregression characteristic, it becomes a major challenge in designing an inference system, because every time a new token is generated, the entire LLM model has to transfer all the weights from different memories to the memory of the graphics processor, which is very unfriendly to single-user task scheduling or multi-user work-load balance. The challenging part of serving the auto-regressive paradigm is all previous sequences have to be cached and saved for the next proceeding iteration, the cached activation generated from the previous sequences is stored as the Key-Value Cache (KV-cache).
LLMs模型的核心是一个自回归 Transformer 模型，如图 2 所示。当我们看到自回归特性时，它成为设计推理系统的一个主要挑战，因为每次生成一个新的标记时，整个LLM模型都必须将所有权重从不同的内存传输到图形处理器的内存，这对于单用户任务调度或多用户工作负载平衡非常不友好。为提供自回归范式的挑战在于必须缓存和保存所有先前的序列以供下一个迭代使用，从先前序列生成的缓存激活被存储为键值缓存（KV-cache）。

The storage of KV-cache will cost both memory space and IO performance, yielding in workload memory-bounded and under-utilizing the computation power of the system. Previous works proposed a series of solutions like KV-cache control management [133] or KV-cache compression [134] to improve throughput or reduce latency. When designing PEFT methods, it is crucial to consider the characteristics of the KV-cache to complement its features. For instance, when applying soft prompts in the inference phase, efficiently leveraging the KVcache for these additional inputs can help accelerate response times by ensuring prompt-related data is readily accessible.
KV-cache 的存储将消耗内存空间和 IO 性能，导致工作负载受内存限制且系统的计算能力被低效利用。先前的研究提出了一系列解决方案，如 KV-cache 控制管理[133]或 KV-cache 压缩[134]，以提高吞吐量或减少延迟。在设计 PEFT 方法时，关键是考虑 KV-cache 的特性以补充其功能。例如，在推理阶段应用软提示时，有效利用 KV-cache 来处理这些额外输入可以通过确保相关数据可立即访问来加快响应时间。

B. Pruning Strategies for PEFT
PEFT 的修剪策略

The inclusion of pruning can substantially enhance the efficiency of PEFT methods. In particular, AdapterDrop [110] explores the removal of adapters from lower transformer layers and multi-task adapters in AdapterFusion [29], which shows that the pruning can improve the training and inference efficiency with minimal decrease in performance. SparseAdapter [111] investigates different pruning methods and finds that high sparsity ratios

can outperform standard adapters. Additionally, the Large-Sparse configuration, which increases the bottleneck dimension while maintaining a constant parameter budget (e.g., doubling dimensions with a

sparsity), substantially enhances the model's capacity, resulting in improved performance. SPLoRA [112] adopts channel-based pruning to the LoRA weights

and

.
修剪的加入可以显著提高 PEFT 方法的效率。具体而言，AdapterDrop [110] 探索了从较低的变压器层和 AdapterFusion [29] 中的多任务适配器中移除适配器，表明修剪可以在最小性能降低的情况下提高训练和推理效率。SparseAdapter [111] 研究了不同的修剪方法，并发现高稀疏度比率可以胜过标准适配器。此外，增加瓶颈维度同时保持恒定参数预算（例如，通过 1 稀疏度加倍维度）的 Large-Sparse 配置显著增强了模型的容量，从而提高了性能。SPLoRA [112] 采用基于通道的修剪来修剪 LoRA 权重。

BI-Adapter [115], PEQA [116], QLoRA [117], LoftQ [118], LQ-LoRA [119], QA-LoRA [120], INT2.1 [121], QDyLoRA [122], BitDelta [123]

Fig. 9: Taxonomy of Efficient PEFT Design.
图 9：高效 PEFT 设计分类。

This pruning affects not only the source weights

, but also the LoRA parameters

and

. Similarly, LoRAPruning [113] adopts structured pruning not only to the pretrained model weights but also to the LoRA weights. In contrast to unstructured LoRA pruning methods, which primarily focus on sparsifying model weights while leaving LoRA weights dense, thus making weight merging challenging to achieve, LoRAPruning enables the weights to be merged easily. Additionally, this work also introduces a novel criterion that utilizes LoRA's gradients as an approximation of the gradients for the pre-trained weights, enabling the estimation of weight importance. ProPETL [114] constructs a single shared prototype (e.g., adapter, prefix, or LoRA) across layers and tasks. In addition, ProPETL learns binary masks to prune different sub-networks in different layers and tasks. As a result, the parameters can be reused in across layers and tasks, largely increasing the parameter efficiency.
这种修剪不仅影响源权重

，还影响 LoRA 参数

和

。类似地，LoRAPruning [113]采用结构化修剪，不仅适用于预训练模型的权重，还适用于 LoRA 的权重。与非结构化的 LoRA 修剪方法相比，后者主要专注于稀疏化模型权重，同时保持 LoRA 权重密集，从而使权重合并难以实现，LoRAPruning 使权重可以轻松合并。此外，该工作还引入了一种新颖的标准，利用 LoRA 的梯度作为预训练权重梯度的近似，从而实现权重重要性的估计。ProPETL [114]在层和任务之间构建一个共享的原型（例如，适配器、前缀或 LoRA）。此外，ProPETL 学习二进制掩码，以修剪不同层和任务中的不同子网络。因此，参数可以在层和任务之间重复使用，大大提高了参数效率。

C. Quantization Strategies for PEFT
PEFT 的量化策略

Quantization serves as another popular technique for improving computational efficiency and reduce memory usage. For example, by investigating the loss landscape of adapters, BI-Adapter [115] finds that adapters are resistant to noise in parameter space. Building on this insight, the authors introduce a clustering-based quantization approach. Remarkably, they demonstrate that a 1 -bit quantization of adapters not only minimizes storage requirements but also achieve superior performance among all precision settings. PEQA (Parameter-Efficient and Quantization-aware Adaptation) [116] uses a two-stage pipeline to achieve parameterefficient and quantization-aware fine-tuning. In the first stage, the pre-trained FFN weight matrix

is quantized to

, where

represents per-channel scales and

denotes the quantized weight. In the second stage,

remains fixed, and fine-tuning is only conducted on

. This approach not only ensures memory efficiency but also facilitates parameter efficiency. QLoRA [117] proposes several novel techniques, including a 4-bit NormalFloat, a Double Quantization, and a Paged Optimizers, to backpropagate a 4-bit quantized pretrained language model into LoRA. These techniques enable the fine-tuning for a 65B language model on a single 48GB GPU while maintaining similar performance to the full 16-bit fine-tuning. Similar to the original implementation [70], QLoRA attaches the fixed zero initialized LoRA weights to the quantized pre-trained model as the training start point. However, when applying the extreme low-bit (e.g., 2-bit) quantization, the huge quantization error can adversely impact the initialization of LoRA fine-tuning, i.e., quantization

where

, which will harm the fine-tuning performance as shown in [127]. To solve this, several quantization strategies are proposed to eliminate the quantization error. For example, LoftQ (LoRA-Fine-Tuningaware Quantization) [118] presents an innovative framework that provides a superior initialization point of quantized backbone weights and LoRA weights for subsequent LoRA finetuning. This approach addresses the discrepancies caused by quantization through the optimization of a Frobenius norm objective during network initialization, which takes both the LoRA weights and the quantized pre-trained backbone into consideration. LoftQ exhibits superior performance in 2-bit quantization over QLoRA, as well as greater generalization for downstream tasks. LQ-LoRA [119] uses an iterative algorithm inspired by robust principal components analysis [135], [136] which decomposes the weight

such that

to resolve the inaccuracy caused by the quantization error, where

is the quantized component which remains fixed and

is the trainable low-rank component. Moreover, this approach leverages integer linear programming to determine a mixed quantization strategy, enabling dynamic quantization configurations for each weight matrix while adhering to a predetermined total bit rate limit. QA-LoRA [120] address another limitation of QLoRA, which struggles to preserve its quantized property post fine-tuning. In QLoRA, the quantized pre-trained weight (NF4) have to be recovered to FP16 to match the LoRA weight precision (FP16) during weight merging. Instead, QA-LoRA uses INT4 quantization and introduces group-wise operators to enable quantization during inference stage, therefore improve the efficiency and accuracy compared with QLoRA. BitDelta [123] introduces a novel 1-bit posttraining quantization method that acts on the weight delta between a fine-tuned model and its underlying pre-trained model. Specifically, given the weight matrices

and

from the fine-tuned and base models respectively, the weight delta

is binarized as

. Here,

, a high-precision scalar, is initialized based on the mean absolute delta value

, with

indicating the sign of

. BitDelta further calibrates the scaling factors via distillation on a compact calibration dataset, while the binary matrices remain unchanged. This approach notably streamlines the deployment of multiple fine-tuned models on shared servers by utilizing a singular full-precision base model alongside efficiently batched 1 -bit deltas.
量化是另一种提高计算效率和减少内存使用的流行技术。例如，通过研究适配器的损失景观，BI-Adapter [115] 发现适配器对参数空间中的噪声具有抗性。基于这一洞察力，作者们引入了一种基于聚类的量化方法。值得注意的是，他们证明了适配器的 1 位量化不仅最小化了存储需求，而且在所有精度设置中实现了卓越的性能。PEQA（参数高效和量化感知调整）[116] 使用两阶段流水线实现参数高效和量化感知微调。在第一阶段，预训练的 FFN 权重矩阵

被量化为

，其中

表示每通道比例，

表示量化权重。在第二阶段，

保持不变，微调仅在

上进行。这种方法不仅确保了内存效率，还促进了参数效率。 QLoRA [117]提出了几种新颖的技术，包括 4 位 NormalFloat、双量化和分页优化器，以将 4 位量化的预训练语言模型反向传播到 LoRA。这些技术使得在单个 48GB GPU 上对 65B 语言模型进行微调成为可能，同时保持与完整 16 位微调相似的性能。与原始实现[70]类似，QLoRA 将固定的零初始化 LoRA 权重附加到量化的预训练模型上作为训练起点。然而，当应用极低位（例如 2 位）量化时，巨大的量化误差可能会对 LoRA 微调的初始化产生不利影响，即量化

，这将损害微调性能，如[127]所示。为了解决这个问题，提出了几种量化策略来消除量化误差。例如，LoftQ（LoRA-Fine-Tuningaware 量化）[118]提出了一个创新框架，为后续 LoRA 微调提供了量化骨干权重和 LoRA 权重的优越初始化点。这种方法通过在网络初始化期间优化 Frobenius 范数目标来解决由量化引起的差异，考虑了 LoRA 权重和量化的预训练骨干。LoftQ 在 2 位量化方面表现出比 QLoRA 更优越的性能，同时对下游任务具有更好的泛化能力。LQ-LoRA [119] 使用受鲁棒主成分分析 [135]，[136] 启发的迭代算法，将权重分解为

以解决量化误差引起的不准确性，其中

是保持不变的量化分量，

是可训练的低秩分量。此外，这种方法利用整数线性规划确定混合量化策略，为每个权重矩阵提供动态量化配置，同时遵守预定的总比特率限制。QA-LoRA [120] 解决了 QLoRA 的另一个限制，即在微调后难以保持其量化属性。在 QLoRA 中，量化的预训练权重（NF4）必须恢复为 FP16，以匹配权重合并期间的 LoRA 权重精度（FP16）。相反，QA-LoRA 使用 INT4 量化并引入分组操作符，以在推断阶段实现量化，从而提高效率和准确性，与 QLoRA 相比。BitDelta [123]引入了一种新颖的 1 位后训练量化方法，作用于经过微调的模型与其基础预训练模型之间的权重增量。具体来说，给定来自经过微调和基础模型的权重矩阵

和

，权重增量

被二元化为

。在这里，

，一个高精度标量，基于平均绝对增量值

初始化，

表示

的符号。BitDelta 通过蒸馏在紧凑的校准数据集上进一步校准缩放因子，而二进制矩阵保持不变。这种方法通过利用单个全精度基础模型以及高效批处理的 1 位增量，显着简化了在共享服务器上部署多个经过微调的模型。

D. Memory-efficient PEFT Methods
D. 内存高效的 PEFT 方法

Fine-tuning the full LLMs necessitates substantial training memory owing to their considerable size. While most PEFT methods primarily target parameter efficiency, they still incur a significant memory overhead during training because gradient computation and backpropagation are still necessary for these methods. For example, prevalent PEFT techniques such as adapters and LoRA can only reduce memory usage to approximately

compared to full model fine-tuning according to some literatures [125], [130]. From a computational perspective, memory efficiency also remains a critical factor that cannot be overlooked.

To improve memory efficiency, various techniques have been developed to minimize the need for caching gradients for the entire LLM during fine-tuning, thereby reducing memory usage. For example, both Side-Tuning [124] and LST (Ladder-Side Tuning) [125] introduces a learnable network branch parallel to the backbone model. By channeling the backpropagation exclusively through this parallel branch, it circumvents the need to store gradient information for the main model's weights, thus markedly reducing memory requirements during training. Similarly, Res-Tuning [126] disentangles the PEFT tuners (e.g., prompt tuning, adapter) from the backbone model. On top of the disentanglement, a memoryefficient fine-tuning framework named Res-Tuning-Bypass is proposed, which generates a bypass network in parallel with the backbone model by removing the data flow from the decoupled tuners to the backbone. This eliminates the requirement for gradient caching within the backbone model during backpropagation. MEFT [127] (memory-efficient fine-tuning) is an approach inspired by the reversible model [137]. During the training of a reversible model, intermediate activations are not required to be cached in the forward pass. During backpropagation, they can be recalculated from the final output. To save the memory during fine-tuning, MEFT investigates how to transform an LLM to its reversible counterparts without additional pre-training. A critical aspect of this transformation is the careful initialization of newly-introduced parameters in the pre-trained models. MEFT demonstrates the importance of the parameter initialization, and suggests that these parameters must be initialized in a manner that preserves the pre-trained model's starting point, ensuring that the fine-tuning of the modified model achieves performance on par with full finetuning methods. With this key consideration, MEFT introduces three distinct methods, each significantly curtailing the memory demands traditionally required for storing activations. LoRA-FA [128] addresses a limitation about memory overhead in LoRA fine-tuning. During training, LoRA modules still require high activation memory consumption. This is because, during backpropagation, large input activations must be stored during the forward pass to compute gradients. LoRA-FA resolves this issue by freezing both the pre-trained weights

and the projection-down weights

, and only updating the projection-up weights

. Consequently, the input activation

no longer needs to be stored, as the intermediate activation

is adequate for gradient computation for

. Given that

, the memory requirement for activations in LoRA-
为了提高内存效率，已经开发了各种技术来最小化在微调期间对整个LLM进行梯度缓存的需求，从而减少内存使用量。例如，Side-Tuning [124] 和 LST（Ladder-Side Tuning）[125] 都引入了一个可学习的网络分支，与主干模型并行。通过仅通过这个并行分支传递反向传播，它规避了需要存储主模型权重的梯度信息，从而在训练期间显着减少了内存需求。类似地，Res-Tuning [126] 将 PEFT 调谐器（例如，提示调整器，适配器）与主干模型解耦。在解耦的基础上，提出了一种名为 Res-Tuning-Bypass 的内存高效微调框架，通过从解耦的调谐器到主干模型的数据流中移除数据流，生成一个与主干模型并行的旁路网络。这消除了在反向传播期间主干模型内部的梯度缓存要求。MEFT [127]（内存高效微调）是受可逆模型 [137] 启发的方法。在可逆模型的训练过程中，不需要在前向传播中缓存中间激活。在反向传播过程中，可以从最终输出中重新计算它们。为了在微调过程中节省内存，MEFT 研究了如何将一个LLM转换为可逆对应项，而无需额外的预训练。这种转换的一个关键方面是在预训练模型中谨慎初始化新引入的参数。MEFT 展示了参数初始化的重要性，并建议这些参数必须以一种方式初始化，以保留预训练模型的起始点，确保修改后模型的微调能够达到与完全微调方法相当的性能。在这个关键考虑下，MEFT 提出了三种不同的方法，每种方法都显著减少了传统上用于存储激活所需的内存需求。LoRA-FA [128] 解决了 LoRA 微调中关于内存开销的限制。在训练过程中，LoRA 模块仍然需要高激活内存消耗。这是因为在反向传播过程中，需要在前向传播期间存储大的输入激活以计算梯度。LoRA-FA 通过冻结预训练权重

和投影下权重

，仅更新投影上权重

来解决这个问题。因此，输入激活

不再需要被存储，因为中间激活

足以用于

的梯度计算。鉴于

，在 LoRA 中激活的内存需求-
FA can be significantly reduced.
FA 可以显著减少。

To further reduce memory usage during fine-tuning, some methods attempt to circumvent backpropagation within LLMs to address this issue. HyperTuning [129] employs a HyperModel to generate PEFT parameters using only fewshot examples. This approach demonstrates results comparable to those obtained through full model fine-tuning. PEFT Plug-in [130] first trains PEFT modules on small language models, which is more memory efficient compared to training on large ones. Subsequently, the research introduces a suite of techniques for seamlessly integrating these trained PEFT modules into LLMs during inference. This strategy effectively circumvents the necessity of gradient-based optimization directly on the larger models, resulting in substantial memory savings. However, it is important to note that both HyperModel and PEFT Plug-in still require additional model training, and this training cost cannot be entirely overlooked. MeZO [131] introduces a memoryefficient zeroth-order (ZO) optimizer for LLMs. Unlike conventional PEFT techniques, which rely on backpropagation to compute gradients for updating model parameters, MeZO finetunes LLMs through only forward passes. It accomplishes this by employing a ZO gradient estimator to calculate the gradient. Notably, MeZO implements an in-place solution for the classic ZO gradient estimator, effectively mitigating memory consumption during inference execution. This innovative approach allows for efficient fine-tuning of LLMs containing 30 billion parameters on a single GPU with

of memory, all while maintaining performance that is comparable to fine-tuning using backpropagation. Furthermore, it can substantially decrease storage demands in comparison to the traditional PEFT methods such as LoRA and Adapter.
为了进一步减少微调过程中的内存使用，一些方法尝试规避LLMs内的反向传播来解决这个问题。HyperTuning [129] 使用超模型生成 PEFT 参数，仅使用少量示例。这种方法展示了与完全模型微调获得的结果相媲美的效果。PEFT 插件 [130] 首先在小语言模型上训练 PEFT 模块，与在大模型上训练相比更节省内存。随后，研究引入了一系列技术，无缝地将这些训练好的 PEFT 模块集成到LLMs中进行推断。这种策略有效地规避了在更大模型上直接进行基于梯度的优化的必要性，从而实现了大量的内存节省。然而，值得注意的是，HyperModel 和 PEFT 插件仍然需要额外的模型训练，这种训练成本不能完全忽视。MeZO [131] 引入了一种内存高效的零阶（ZO）优化器用于LLMs。与传统的 PEFT 技术不同，后者依赖于反向传播来计算更新模型参数的梯度，MeZO 通过仅进行前向传递来微调LLMs。它通过使用 ZO 梯度估计器来计算梯度来实现这一点。值得注意的是，MeZO 实现了经典 ZO 梯度估计器的原地解决方案，有效地减少了推理执行过程中的内存消耗。这种创新方法允许在单个 GPU 上高效地微调包含 300 亿参数的LLMs，并且在保持与使用反向传播进行微调相当的性能的同时，不占用任何内存。此外，与传统的 PEFT 方法（如 LoRA 和 Adapter）相比，它可以大幅减少存储需求。

V. PEFT FOR DNNS OF OTHER APPLICATIONS
V. 用于其他应用程序的深度神经网络的 PEFT

In Section III, we outlined four categories of PEFT methods along with their improvements. Nonetheless, our discussion did not fully extend to the utilization or adaptation of PEFT techniques beyond traditional architectures (e.g., LLMs) or standard benchmarks (e.g., the GLUE dataset), where the majority of the discussed PEFT methods are applied. Therefore, in this section, we will highlight and discuss several most representative works that leverages PEFT strategies for various downstream tasks. We do not aim to cover all PEFT application scenarios in this section. Our objective is to showcase the significant influence of PEFT within various research domains, and demonstrate how to optimize and tailor general-purpose PEFT methods to achieve enhanced performance in specific models or tasks.
在第三部分中，我们概述了四类 PEFT 方法及其改进。然而，我们的讨论并没有完全延伸到 PEFT 技术在传统架构（例如LLMs）或标准基准（例如 GLUE 数据集）之外的利用或适应，大多数讨论的 PEFT 方法被应用的地方。因此，在本节中，我们将重点介绍和讨论几项利用 PEFT 策略进行各种下游任务的代表性作品。我们并不打算在本节中涵盖所有 PEFT 应用场景。我们的目标是展示 PEFT 在各个研究领域中的重要影响，并演示如何优化和定制通用 PEFT 方法，以实现特定模型或任务的增强性能。

Typically, fine-tuning happens when adapting a pre-trained backbone model to specialized downstream tasks. To this end, this section organizes the discussion around various model architectures, which include: LLM, Vision Transformer (ViT), Vision-Language Alignment Model (VLA), and Diffusion model. Within each architectural category, the discussion is further classify based on different downstream tasks.
通常，微调发生在将预训练的骨干模型适应专门的下游任务时。为此，本节围绕各种模型架构进行讨论，包括：LLM，Vision Transformer（ViT），Vision-Language Alignment Model（VLA）和 Diffusion 模型。在每个架构类别中，讨论进一步根据不同的下游任务进行分类。

A. PEFT for LLMs - Beyond the Basics
A. PEFT for LLMs - 超越基础

Instead of common tasks in NLP such as NLU and NLG, PEFT techniques boast a wide array of applications across
与 NLP 中常见的任务（如 NLU 和 NLG）不同，PEFT 技术在各个领域都有广泛的应用
diverse scenarios. PEFT has been successfully implemented in commonsense question answering [138], [139], multi-level implicit discourse relation recognition [140], out-of-distribution detection [141], privacy protection [142], [143], federated learning [144], and social biases mitigation [145]. In this section, we pay more focus on three representative downstream tasks: visual instruction following, continual learning, context window extension.
PEFT 已成功应用于常识问题回答[138]，多层次隐式话语关系识别[140]，分布外检测[141]，隐私保护[142]，联邦学习[144]和社会偏见缓解[145]等多种场景。在本节中，我们更加关注三个代表性的下游任务：视觉指令跟随，持续学习，上下文窗口扩展。

Visual Instruct Following: Several studies, including VL-BART [146], MiniGPT-4 [147], and LLaVA [148], have successfully extended the capabilities of LLMs, initially designed for pure text, to comprehend and generate responses to visual inputs. These enhanced models, namely visual instructfollowing LLMs, can process both images and text to produce textual responses, which can be benchmarked on tasks such as image captioning [149], [150], [151], [152] and visual question answering (VQA) [153], [154], [155]. However, these methods fine-tune the entire LLM to learn the visual representations, which can be inefficient in both time and memory. Therefore, its natural to apply PEFT techniques in the fine-tuning of visual instruct-following LLMs. An earlier work VL-Adapter [156] directly applies several PEFT methods (Adapter [25], Hyperformer [34] and Compacter [71]) on VLBART [146 then benchmarks them on several image-text and video-text tasks. Results show that vanilla adapters are the best among them, which can achieve performance on par with full fine-tuning. However, considering the functionality gap between the encoders and decoders in VL-BART, directly assign identical modular modifications will lead to suboptimal performance. Therefore, VL-PET [157] selectively integrates PEFT modules into different components of the encoder and decoder. They also introduces a granularity-controlled mechanism for finer-grained control.
视觉指导以下：包括 VL-BART [146]，MiniGPT-4 [147]和 LLaVA [148]在内的几项研究已成功地扩展了最初设计用于纯文本的LLMs的功能，以理解和生成对视觉输入的响应。这些增强型模型，即视觉指导以下LLMs，可以处理图像和文本以生成文本响应，可以在诸如图像字幕[149]，[150]，[151]，[152]和视觉问答（VQA）[153]，[154]，[155]等任务上进行基准测试。然而，这些方法微调整个LLM以学习视觉表示，这在时间和内存方面可能效率低下。因此，在视觉指导以下LLMs的微调中自然而然地应用 PEFT 技术。早期的工作 VL-Adapter [156]直接在 VLBART [146]上应用了几种 PEFT 方法（Adapter [25]，Hyperformer [34]和 Compacter [71]），然后在几个图像文本和视频文本任务上进行基准测试。结果显示，普通适配器在其中表现最佳，可以达到与完全微调相当的性能。然而，考虑到 VL-BART 中编码器和解码器之间的功能差距，直接分配相同的模块修改会导致性能不佳。因此，VL-PET [157] 选择性地将 PEFT 模块集成到编码器和解码器的不同组件中。他们还引入了一个粒度受控机制，用于更精细地控制。

To adapt the recently prevalent LLaMA model, LLaMAAdapter [158] prepends a set of learnable prompts (similar to prefix tuning) to the input tokens in LLaMA's higher transformer layers. To avoid the unstable fine-tuning with large loss values at early training stages, instead of the randomly initialized weights of other PEFT methods, LLaMA-Adapter adopts a zero-initialized attention mechanism, which learns a zeroinitialized gating factor to adaptively control the contribution of adaptation prompts to the word tokens. This can maintain the fine-tuning starting point the same as the original model and progressively inject new knowledge into the model, where similar idea can be found in MEFT [127] and LoftQ [118] discussed earlier. To represent visual information, LLaMAAdapter extract multi-scale global image features using CLIP image encoder than projects them to linguistic embedding space. After that, the feature is element-wisely added onto the adaptation prompts at all inserted transformer layers. LLaMA-Adapter only introduces

learnable parameters in LLaMA-7B, and costs less than one hour for fine-tuning on 8 A100 GPUs. A following work LLaMA-Adapter V2 [159] demonstrates that the simple multimodal fusion in LLaMAAdapter cannot generalize to more challenging open-ended multimodal reasoning tasks, where the visual cues tend to dominate the adaptation prompts than the language instruction data. To address this, LLaMA-Adapter V2 decouples the learn- ing of instruction-following ability (to generate long language responses) and vision-language alignment to avoid interference between visual and language fine-tuning. Specifically, LLaMA-Adapter V2 sets disjoint parameter groups which are respectively learned from image-text pairs and language instruction data. The visual adaptation prompts are inserted in the early stage of LLM, while the language adaptation prompts keeps at the higher transformer layers similar to LLaMAAdapter. Additionally, LLaMA-Adapter V2 introduces more learnable parameters and several expert systems (e.g., captioning, detection, and OCR) to enhance multimodal performance. LayerNorm Tuning [160] adjust only the weights of the LayerNorm within each attention block. This straightforward technique can achieve comparable or even better performance than the finetuning, while offer about

more parameter efficiency than LoRA.
为了适应最近流行的 LLaMA 模型，LLaMAAdapter [158] 在 LLaMA 的更高 transformer 层中的输入标记之前添加了一组可学习的提示（类似于前缀调整）。为了避免在早期训练阶段出现大损失值的不稳定微调，LLaMA-Adapter 采用了一个零初始化的注意机制，而不是其他 PEFT 方法的随机初始化权重，该机制学习了一个零初始化的门控因子，以自适应地控制适应提示对单词标记的贡献。这可以保持微调的起点与原始模型相同，并逐渐向模型注入新知识，这个类似的想法可以在之前讨论的 MEFT [127] 和 LoftQ [118] 中找到。为了表示视觉信息，LLaMAAdapter 使用 CLIP 图像编码器提取多尺度全局图像特征，然后将这些特征投影到语言嵌入空间。之后，该特征被逐元素地添加到所有插入的 transformer 层中的适应提示上。LLaMA-Adapter 仅在 LLaMA-7B 中引入

可学习参数，并在 8 个 A100 GPU 上进行微调不到一小时的时间。以下工作 LLaMA-Adapter V2 [159] 表明，LLaMAAdapter 中的简单多模态融合无法推广到更具挑战性的开放式多模态推理任务，其中视觉线索往往比语言指令数据更占主导地位。为了解决这个问题，LLaMA-Adapter V2 将指令遵循能力（生成长语言响应）和视觉-语言对齐的学习分开，以避免视觉和语言微调之间的干扰。具体来说，LLaMA-Adapter V2 设置了不同的参数组，分别从图像文本对和语言指令数据中学习。视觉适应提示插入在LLM的早期阶段，而语言适应提示保持在更高的变压器层，类似于 LLaMAAdapter。此外，LLaMA-Adapter V2 引入了更多可学习的参数和几个专家系统（例如字幕、检测和 OCR）来增强多模态性能。LayerNorm 调整 [160] 仅调整每个注意力块内的 LayerNorm 权重。这种直接的技术可以实现与微调相当甚至更好的性能，同时比 LoRA 提供约

个更高的参数效率。

Continual Learning (CL): CL aims to learn a sequence of new tasks over time within one single model, which has broad application in scenarios such as dialogue systems [161], information extraction systems [162], and question answering systems [163]. The main challenge in CL is catastrophic forgetting [164]. A popular practice, called architecture-based methods, tackles the CL by maintainging task-specific parameters in the model for each new task. Therefore, it's natural to leverage PEFT methods for CL tasks [165], [166], [167], [168]. For example, AdapterCL [165] parameterizes each new task using residual adapters. During testing, since the task-id is not provided, AdapterCL uses an entropy-based classifier to select which adapter to use for accomplishing specific task. CPT (Continual Prompt Tuning) [166] trains a soft prompt for each task. Instead of training soft prompts from scratch, CPT proposes a series techniques (continual prompt initialization, query fusion, memory replay, and a memory-guided technique) to achieve knowledge transfer from preceding and subsequent tasks. O-LoRA (orthogonal lowrank adaptation) [169] employs a strategy of learning distinct tasks within separate low-rank vector subspaces that are kept orthogonal to each other in order to minimize interference. This approace can effectively reducing catastrophic forgetting during the acquisition of new tasks.
持续学习（CL）：CL 旨在通过一个单一模型随时间学习一系列新任务，这在对话系统[161]、信息提取系统[162]和问答系统[163]等场景中具有广泛应用。CL 面临的主要挑战是灾难性遗忘[164]。一种流行的做法，称为基于架构的方法，通过在模型中为每个新任务维护特定于任务的参数来解决 CL。因此，自然而然地可以利用 PEFT 方法来处理 CL 任务[165]，[166]，[167]，[168]。例如，AdapterCL [165]使用残余适配器对每个新任务进行参数化。在测试期间，由于未提供任务 ID，AdapterCL 使用基于熵的分类器来选择用于完成特定任务的适配器。CPT（持续提示调整）[166]为每个任务训练一个软提示。CPT 提出了一系列技术（持续提示初始化、查询融合、记忆重放和记忆引导技术）来实现从先前和后续任务中进行知识转移，而不是从头开始训练软提示。 O-LoRA（正交低秩适应）[169]采用了一种学习不同任务的策略，这些任务位于保持彼此正交的低秩向量子空间中，以减少干扰。这种方法可以有效地减少在学习新任务时发生的灾难性遗忘。
Context Window Extension: LLMs are typically trained with a pre-defined context size. For example, LLaMA and LLaMA2 have pre-defined context sizes of 2048 and 4096 tokens, respectively. The positional encoding RoPE has weak extrapolation properties [170], which means the performance drops obviously given an input length exceeds the pre-defined context length. To solve this, a naive solution is to finetune a pre-trained LLM to longer context. However, this escalates computational costs quadratically with context size, straining memory and processing resources. To address this, LongLoRA [171] proposes to fine-tune a pre-trained LLM using LoRA to enlarge the context size. To reduce the perplexity gap between LoRA tuning and full fine-tuning, LongLoRA also opens embedding and normalization layers for training. In order to further improve training efficiency in long context scenario, LongLoRA further introduces a novel shifted sparse attention ( -Attn) as an efficient substitute for standard self-attention during training. A subsequent study
上下文窗口扩展: LLMs 通常是使用预定义的上下文大小进行训练。例如，LLaMA 和 LLaMA2 的预定义上下文大小分别为 2048 和 4096 个标记。位置编码 RoPE 具有弱外推特性[170]，这意味着在输入长度超过预定义上下文长度时，性能明显下降。为了解决这个问题，一个简单的解决方案是对预训练的 LLM 进行微调以适应更长的上下文。然而，随着上下文大小的增加，这将使计算成本呈二次方增长，给内存和处理资源带来压力。为了解决这个问题，LongLoRA [171] 提出了使用 LoRA 对预训练的 LLM 进行微调以扩大上下文大小。为了减少 LoRA 调整和完全微调之间的困惑差距，LongLoRA 还为训练打开了嵌入和归一化层。为了进一步提高长上下文场景中的训练效率，LongLoRA 进一步引入了一种新颖的移位稀疏注意力（ -Attn）作为训练过程中标准自注意力的高效替代品。随后的研究

LongQLoRA [172] combines the advantages of LongLoRA with QLoRA and Position Interpolation [10] to save GPU memory. This work successfully extends context length of LLaMA2-13B from 4096 to 8192 on a single V100 with 32GB memory. LLoCO [173] introduces a pipeline that learns contexts offline through the combination of context compression and LoRA. The process begins by compressing documents into compact contexts, then fine-tuning LLM using LoRA on the compacted context to improve the LLM's ability to accurately extract and utilize information from these compressed representations. During model serving, a standard RAG retriever selects both the compressed document and the most relevant LoRA module, and apply them to the LLM for inference. This approach effectively extends the context window of a

token LLaMA2-7B model to handle up to

tokens.
LongQLoRA [172] 将 LongLoRA 与 QLoRA 和 Position Interpolation [10] 的优势相结合，以节省 GPU 内存。这项工作成功地将 LLaMA2-13B 的上下文长度从 4096 扩展到了单个 V100（32GB 内存）上的 8192。LLoCO [173] 引入了一个通过上下文压缩和 LoRA 的组合离线学习上下文的流水线。该过程从将文档压缩为紧凑上下文开始，然后在紧缩上下文上使用 LoRA 进行微调，以提高模型准确提取和利用这些压缩表示的能力。在模型服务期间，标准的 RAG 检索器选择压缩文档和最相关的 LoRA 模块，并将它们应用于推理的 LLM。这种方法有效地将

令牌 LLaMA2-7B 模型的上下文窗口扩展到处理多达

个令牌。

In addition to limited training-stage sequence length, realworld system memory constraints introduce another critical bottleneck to the context window. Specifically, the capacity of the KV-cache is curtailed by available system memory. For example, a 30B parameter LLM operating with an input length of 1024 and a batch size of 128 might necessitate up to

for the KV-cache [174], thereby restricting the feasible size of the context window. In response to this, some strategies have resorted to quantizing the KV cache [134], [175], but quantization will certainly compromises performance. To effectively counteract this issue without significant loss, GEAR [176] presents a novel approach by employing a low-rank matrix to capture the majority of coherent bases of quantization error, complemented by a sparse matrix that addresses errors from outlier entries, thus efficiently minimizing approximation errors.
除了有限的训练阶段序列长度外，现实世界系统内存限制引入了上下文窗口的另一个关键瓶颈。具体来说，可用系统内存限制了 KV 缓存的容量。例如，一个具有 1024 个输入长度和 128 个批量大小的 30B 参数LLM可能需要高达

的 KV 缓存[174]，从而限制了上下文窗口的可行大小。为了应对这一问题，一些策略已经采用了对 KV 缓存进行量化[134]，[175]，但量化肯定会影响性能。为了有效地解决这个问题而又不会有显著损失，GEAR [176]提出了一种新颖的方法，即利用低秩矩阵来捕获量化误差的大部分一致基础，辅以一个稀疏矩阵来处理异常条目的错误，从而有效地最小化近似误差。

B. PEFT for ViTs
B. PEFT 用于 ViTs

ViT [177] has emerged as a powerful backbone model in the recent computer vision community. In ViT model, images are treated as sequences of fixed-size patches analogous to how LLM uses discrete tokens. These patches undergo linear embedding and then receive positional encodings. Subsequently, they are processed through standard Transformer encoders. The training of ViT can be supervised [177], [178] or selfsupervised [179], [180], and ViT can achieve superior performance when training with more data and using larger model size [181]. However, such scaling up inevitably escalates training and storage costs. Therefore, similar to LLMs, PEFT widely implemented in various downstream tasks, such as dense prediction [182], continual learning [183], [184], deep metric learning [185]. Here, we focus on two typical tasks to showcase the involvement of PEFT: image classification and video recoginition.
ViT [177] 已成为最近计算机视觉社区中强大的骨干模型。在 ViT 模型中，图像被视为一系列固定大小的补丁，类似于 LLM 使用离散标记。这些补丁经过线性嵌入，然后接收位置编码。随后，它们通过标准 Transformer 编码器进行处理。ViT 的训练可以是有监督的 [177]，[178]，也可以是自监督的 [179]，[180]，在使用更多数据和更大模型尺寸进行训练时，ViT 可以实现更优越的性能 [181]。然而，这种扩展不可避免地会增加训练和存储成本。因此，类似于 LLMs，PEFT 在各种下游任务中得到了广泛实施，例如密集预测 [182]，持续学习 [183]，[184]，深度度量学习 [185]。在这里，我们专注于两个典型任务，以展示 PEFT 的参与：图像分类和视频识别。

Image Classification: Image classification on targeted visual datasets is a very common demand and has extensive applications, while pre-train then fine-tuning paradigm serves as a widespread strategy. A variety of methods leverage PEFT techniques to achieve efficient model tuning [186], [182], [187], [188]. For instance, AdaptFormer [187] inserts adapter modules in parallel to the FFN of the original ViT model for visual recognition tasks. VPT (Visual Prompt Tuning) [186] prepends a small amount of task-specific parameters into the input sequence of each Transformer layer. When applying ViT to downstream tasks, only these added parameters and the classification head are set to trainable. The work by [189] notices that compared with supervised ViT, VPT often underperforms with self-supervised ViT. Further analysis demonstrates that different pre-trained methods and downstream tasks have varying degrees of dependency on transformer blocks at different locations. To tackle this issue, the research introduces adaptable gates for ViT blocks. These gates dynamically modulate the contribution of prompt tokens to ViT blocks, allowing for a more targeted adaptation of the model to the task at hand.
图像分类：针对特定视觉数据集的图像分类是一种非常常见的需求，并具有广泛的应用，而预训练然后微调范式被视为一种普遍的策略。各种方法利用 PEFT 技术实现高效的模型调整[186]，[182]，[187]，[188]。例如，AdaptFormer [187] 在原始 ViT 模型的 FFN 旁插入适配器模块，用于视觉识别任务。VPT（Visual Prompt Tuning）[186] 在每个 Transformer 层的输入序列中添加少量任务特定参数。在将 ViT 应用于下游任务时，只有这些添加的参数和分类头被设置为可训练。[189]的研究发现，与监督 ViT 相比，VPT 通常在自监督 ViT 下表现不佳。进一步的分析表明，不同的预训练方法和下游任务对不同位置的 transformer 块有不同程度的依赖。为了解决这个问题，该研究引入了 ViT 块的可适应门。这些门动态调节即时令牌对 ViT 块的贡献，从而实现模型更有针对性地适应当前任务。
Video Recognition: Several works consider the more challenging adaptation problem that transfer ViT to downstream tasks that has a much larger domain gap. For example, ST-Adapter (Spatio-Temporal Adapter) [190] and AIM [191] both insert adapters layers into pre-trained ViT blocks. Their primary goal is to model spatial-temporal information, thereby enabling efficient adaptation of ViTs from image models to video tasks. Notably, both methodologies have exhibited performance that surpasses traditional full-model fine-tuning approaches.
视频识别：一些作品考虑了更具挑战性的适应问题，即将 ViT 转移到具有更大领域差距的下游任务。例如，ST-Adapter（时空适配器）[190] 和 AIM [191] 都将适配器层插入预训练的 ViT 块中。它们的主要目标是建模时空信息，从而实现从图像模型到视频任务的高效适应。值得注意的是，这两种方法都展示出了超越传统全模型微调方法的性能。

C. PEFT for VLAs
C. PEFT 用于 VLA

Vision-Language alignment models (VLA), such as CLIP [192], ALIGN [193], DeCLIP [194], and FLAVA [195], are designed to learn a good image and text features which can be aligned within a unified representation space. Each VLA typically consists of separate image and text encoders that extract respective features. Contrastive learning is leveraged in these models to effective align the image and text features. Fine-tuning is leveraged to improve the performance of VLA in specific dataset or tasks, but fine-tuning the full model is computationally intensive. For instance, fine-tuning CLIP RN50x64 requires a batch size of 32,768 and 18 days of training on 592 V100 GPUs [192]. Moreover, full fine-tuning on smaller datasets often leads to catastrophic forgetting [164]. In response to these challenges, and drawing inspiration from the success of PEFT techniques in NLP, a range of PEFT strategies have been proposed and implemented in VLA models, such as semantic segmentation [196], [197], [198], point cloud understanding [199], [200], [201], [202], video understanding [203], [204], [205], visual reasoning [206], [207], temporal action detection [208], to name a few. This section will focus on one common task that uses VLAs: openvocabulary image classification.
视觉-语言对齐模型（VLA），如 CLIP [192]，ALIGN [193]，DeCLIP [194]和 FLAVA [195]，旨在学习良好的图像和文本特征，这些特征可以在统一的表示空间内对齐。每个 VLA 通常包括单独的图像和文本编码器，用于提取各自的特征。这些模型利用对比学习来有效地对齐图像和文本特征。微调用于提高 VLA 在特定数据集或任务中的性能，但对整个模型进行微调计算密集。例如，微调 CLIP RN50x64 需要批量大小为 32,768，并在 592 个 V100 GPU 上进行 18 天的训练[192]。此外，在较小数据集上进行完全微调通常会导致灾难性遗忘[164]。针对这些挑战，并受到 NLP 中 PEFT 技术成功的启发，一系列 PEFT 策略已被提出并实施在 VLA 模型中，例如语义分割[196]，[197]，[198]，点云理解[199]，[200]，[201]，[202]，视频理解[203]，[204]，[205]，视觉推理[206]，[207]，时间动作检测[208]等。本节将重点讨论一个使用 VLA 的常见任务：开放词汇图像分类。

Open-vocabulary Image Classification: In openvocabulary image classification, earlier works design class-specific prompts, e.g., a photo of a [CLASS], for each category, and ranks images based on their similarity to these textual descriptions. CoOp (Context Optimization) [209] replaces the handcrafted text prompt with learnable vectors, while keep the entire VLA fixes during training. CoCoOp (Conditional Context Optimization) [210] builds on this by tackling CoOp's limitations in generalizing to unseen classes. It introduces a lightweight neural network that generates an
开放词汇图像分类：在开放词汇图像分类中，早期的作品设计了特定类别的提示，例如每个类别的一张照片，然后根据这些文本描述的相似性对图像进行排名。CoOp（上下文优化）[209]用可学习的向量替换手工制作的文本提示，同时在训练期间保持整个 VLA 固定。CoCoOp（条件上下文优化）[210]在此基础上解决了 CoOp 在泛化到未见类别方面的局限性。它引入了一个轻量级神经网络，用于生成
input-specific context token, dynamically adapting the prompt based on each image, thereby enhancing generalizability, but at the cost of increased computational demands due to the instance-aware operation. ProGrad [211] addresses the over-fitting risk in in few-shot setting by regularizing the soft prompt updates whose gradient is aligned to the general knowledge only updates the prompt whose gradient is aligned (or non-conflicting) to the general knowledge offered by the original prompt. MaPLe [212] notes that existing methods learn prompts either in the language or in the vision branch of CLIP, which is not efficient to leverage the multimodal nature of VLAs. To address this, MaPLe proposes branch-aware hierarchical prompts that simultaneously adapt both language and vision branches, and achieves superior performance. TPT (test-time prompt tuning) [213] studies prompt tuning on the fly without additional training samples. Specifically, during inference, TPT first augments the input image into various views, which are then utilized to tune the learnable prompts. The primary training objective is to ensure the VLA can generate consistent responses when faced with these differing views. A following work DiffTPT [214] further enhances the data diversity of test samples through diffusion models.
根据每个图像动态调整提示，从而增强泛化能力，但由于实例感知操作而增加计算需求的成本。ProGrad [211] 通过规范化软提示更新来解决少样本设置中的过拟合风险，其梯度与仅更新梯度与原始提示提供的通用知识一致（或不冲突）的提示对齐。MaPLe [212] 指出现有方法要么在 CLIP 的语言分支中学习提示，要么在视觉分支中学习提示，这不利于利用 VLAs 的多模态特性。为了解决这个问题，MaPLe 提出了分支感知的分层提示，同时调整语言和视觉分支，并取得了卓越的性能。TPT（测试时提示调整）[213] 研究了即时提示调整，无需额外的训练样本。具体来说，在推断过程中，TPT 首先将输入图像增强为各种视图，然后利用这些视图来调整可学习的提示。主要的培训目标是确保 VLA 在面对这些不同观点时能够产生一致的回应。后续工作 DiffTPT [214] 通过扩散模型进一步增强了测试样本的数据多样性。

In another direction, several studies explores the usage of adapters in VLA. For example, CLIP-Adapter [215] integrates residual-style adapters after CLIP's text and visual encoders. Therefore, unlike CoOp and CoCoOp, CLIPAdapter avoids the gradients backpropagation through CLIP's encoders, leading to reduced computational requirements in terms of both training memory and time. Tip-Adapter [216] adopts the same design with CLIP-Adapter. Different from CLIP-Adapter, the weights of adapter is obtained in a trainingfree manner from a query-key cache model [217], [218] constructed from fewshot supervisions in a non-parametric manner. As a result, Tip-Adapter exhibits great efficiency compared to CLIP-Adapter's SGD training process.
另一方面，有几项研究探讨了在 VLA 中使用适配器的情况。例如，CLIP-Adapter [215] 在 CLIP 的文本和视觉编码器之后集成了残差风格的适配器。因此，与 CoOp 和 CoCoOp 不同，CLIPAdapter 避免了通过 CLIP 的编码器反向传播的梯度，从而减少了在训练内存和时间方面的计算需求。Tip-Adapter [216] 采用了与 CLIP-Adapter 相同的设计。与 CLIP-Adapter 不同，适配器的权重是以无需训练的方式从一个查询-键缓存模型 [217]，[218] 中以非参数化方式构建的少样本监督中获得的。因此，与 CLIP-Adapter 的 SGD 训练过程相比，Tip-Adapter 表现出更高的效率。

D. PEFT for Diffusion Models
扩散模型的 D. PEFT

Diffusion models [219], [220] are a class of generative models that learn to generate data by transforming random noise into a structured output by a progressive denoising process. During training, diffusion models learn to reverse the noise added to training data using a denoising network, while in inference, they start from noise, using denoising network to iteratively create data that mirrors the same distribution as the training examples. Diffusion models has various applications [221], [222], [223], [224], [225], while the most notable is stable diffusion [226], which bridges the gap between text and image with its robust capability to generate coherent and contextually relevant images directly from textual descriptions. Numerous studies leverage PEFT techniques to adapt a pre-trained diffusion model for downstream tasks, including accelerating sampling speed [227], [228], text-to-video adaptation [229], [230], text-to-3D adaptation [231], etc. This section mainly focus on two scenarios: integrating additional input modalities beyond mere text-based conditioning, and customizing content generation based on pre-trained diffusion model.
扩散模型[219]，[220]是一类生成模型，通过渐进去噪过程，学习将随机噪声转化为结构化输出以生成数据。在训练过程中，扩散模型学习使用去噪网络逆转添加到训练数据中的噪声，而在推断中，它们从噪声开始，使用去噪网络迭代地创建与训练示例相同分布的数据。扩散模型具有各种应用[221]，[222]，[223]，[224]，[225]，其中最显著的是稳定扩散[226]，它通过其强大的能力直接从文本描述中生成连贯且具有上下文相关性的图像，弥合了文本和图像之间的差距。许多研究利用 PEFT 技术来调整预训练的扩散模型以用于下游任务，包括加速采样速度[227]，[228]，文本到视频的适应[229]，[230]，文本到 3D 的适应[231]等。本节主要关注两种情景：整合除了纯文本条件之外的额外输入模态，以及基于预训练的扩散模型定制内容生成。

Additional Input Control: To incorporate additional input modalities (e.g., layout, keypoints) while retaining the extensive knowledge in the pre-trained model, GLIGEN introduces a novel approach, which maintains the original model's weights intact and integrates new, trainable gated Transformer layers [232] that take in the new grounding input. The resulting model can not only accurately represent the grounding conditions but also produce high-quality images. Remarkably, the model can also generalize well to unseen objects during inference. ControlNet [233] fine-tunes a trainable copy of the encoding layers from Stable Diffusion while locks its pre-trained parameter weights. The fixed original model and the trainable copy are bridged through zero convolution layers. These layers, starting with zero-initialized weights, are designed to progressively adapt during training, ensuring that harmful noise does not affect the pre-trained features of Stable Diffusion at the beginning of training. This refined model is capable of conditioning on a variety of inputs such as Canny edges, Hough lines, user scribbles, human key points, segmentation maps, shape normals, depths, etc. Concept Sliders [234] introduces a plug-and-play LoRA adaptors to allow precise editing of concepts (e.g., age, smiling) within a diffusion model. T2I-Adapter [235] introduces a lightweight adapter model designed to align external control signals with the internal knowledge of text-to-image diffusion models. This adapter enables precise manipulation through structural control (e.g., sketch, depth map, semantic segmentation map, and keypose), color control (e.g., hue and color distribution), and integrating various controls by composing multiple adapters.
附加输入控制：为了在保留预训练模型中的广泛知识的同时，结合额外的输入模态（例如布局、关键点），GLIGEN 引入了一种新颖的方法，该方法保持原始模型的权重不变，并集成了新的可训练门控 Transformer 层[232]，这些层接收新的基础输入。由此产生的模型不仅可以准确表示基础条件，还可以生成高质量的图像。值得注意的是，在推断期间，该模型还能很好地泛化到未见过的对象。ControlNet [233] 对来自 Stable Diffusion 的编码层的可训练副本进行微调，同时锁定其预训练参数权重。固定的原始模型和可训练副本通过零卷积层相连。这些层从零初始化权重开始，旨在在训练过程中逐渐适应，确保有害噪声不会影响 Stable Diffusion 的预训练特征在训练开始时。这种精制模型能够根据各种输入进行条件化，例如 Canny 边缘、Hough 线、用户涂鸦、人体关键点、分割地图、形状法线、深度等。概念滑块[234]引入即插即用的 LoRA 适配器，允许在扩散模型内精确编辑概念（例如年龄、微笑）。T2I-Adapter [235]引入了一种轻量级适配器模型，旨在将外部控制信号与文本到图像扩散模型的内部知识对齐。该适配器通过结构控制（例如草图、深度图、语义分割图和关键姿势）、颜色控制（例如色调和颜色分布）以及通过组合多个适配器集成各种控制，实现精确操作。
Customized Generation: The effectiveness of text-toimage diffusion models is limited by the user's ability to articulate the desired target through text descriptions. For instance, it is difficult to describe the precise features of an innovative toy car which is not encountered during large-scale model training. Consequently, the objective of customized generation is to enable the model to grasp new concepts from a minimal set of user-supplied images. Textual Inversion [236] addresses this by finding a new pseudo-word (similar to soft prompt discussed in Section III-A2 that represent new, specific concepts in the textual embedding space of pretrained text-to-image diffusion models. The pseudo-word is optimized via the original optimization goal in diffusion models given a small image set (typically 3-5 images) depicting the concept, and the pre-trained model is leaved untouched. During inference, can be treated like any other word and compose with other textual queries (e.g.," photo of on the beach"). Custom Diffusion [237] tackles a more challenging setting: compositional fine-tuning of multiple concepts. It finetunes only the mapping from text to latent features in attention layers, which yields superior performance in multiconcept learning scenarios. Additionally, during fine-tuning, Custom Diffusion prevents model forgetting by introducing a small set of real images with captions akin to the target, alongside employing augmentation for faster convergence and improved results. IP-Adapter [238] identifies limitations in current approaches (e.g., ControlNet and T2I-Adapter) which project condition signals into the cross-attention modules. When handling image conditions aiming at controlling con-
定制生成：文本到图像扩散模型的有效性受到用户通过文本描述所需目标的能力的限制。例如，很难描述一个创新玩具车的精确特征，这在大规模模型训练中并未遇到。因此，定制生成的目标是使模型能够从用户提供的最少一组图像中掌握新概念。文本反转[236]通过在预训练的文本到图像扩散模型的文本嵌入空间中找到一个新的伪词（类似于第 III-A2 节中讨论的软提示），该伪词代表文本描述中的新、具体概念。给定描述概念的小图像集（通常为 3-5 张图像），伪词通过扩散模型的原始优化目标进行优化，而预训练模型则保持不变。在推断过程中，可以像其他单词一样处理，并与其他文本查询组合（例如，“海滩上的照片”）。自定义扩散[237]解决了一个更具挑战性的设置：多个概念的组合微调。它仅对注意力层中从文本到潜在特征的映射进行微调，在多概念学习场景中表现出卓越性能。此外，在微调过程中，自定义扩散通过引入一小组与目标类似的带标题的真实图像，同时采用增强技术以实现更快的收敛和改进的结果，防止模型遗忘。IP-Adapter [238]识别了当前方法（例如 ControlNet 和 T2I-Adapter）中将条件信号投影到交叉注意力模块中的局限性。在处理旨在控制 con- 的图像条件时
tent, these methods unable to generate images faithful to the prompted image. The issue stems from that merging image features and text features within cross-attention layers loses image-specific information, leading to only coarse-grained controllable generation such as image style rather than image content. To overcome this, IP-Adapter introduces a novel decoupled cross-attention mechanism to distinguish between text and image features. IP-Adapter adds an additional crossattention layer exclusively for image features in each crossattention layer, and only the parameters of the new crossattention layers are trained.
在这种情况下，这些方法无法生成与提示图像忠实的图像。问题在于，在跨注意力层中合并图像特征和文本特征会丢失图像特定信息，导致只能生成粗粒度可控的生成，如图像风格而不是图像内容。为了克服这一问题，IP-Adapter 引入了一种新颖的解耦交叉注意力机制，以区分文本和图像特征。IP-Adapter 在每个交叉注意力层中专门为图像特征添加了一个额外的交叉注意力层，只训练新交叉注意力层的参数。

VI. System Design Challenge for PEFT
VI. PEFT 系统设计挑战

A. System design for PEFT
PEFT 的系统设计

In this section, we begin by providing a concise overview of cloud-based PEFT systems. Following this, we present the corresponding metrics employed for evaluating the system performance. Additionally, we present three prospective utilization scenarios to illustrate the challenges in system design.
在本节中，我们首先提供了云端 PEFT 系统的简要概述。随后，我们介绍了用于评估系统性能的相应指标。此外，我们提出了三种潜在的利用场景，以阐明系统设计中的挑战。

Centralized PEFT Query Serving: Cloud providers have recently introduced a range of LLM services aimed at providing user applications through application programming interfaces (APIs) [239], [240]. These APIs facilitate the seamless integration of many ML functionalities into applications. After receiving one query for one specific downstream task through API, the cloud-based server processes the query with one featured LLM model. Under this scenario, the proposed cloud solution for handling multiple PEFT queries involves storing only a single copy of the LLM and multiple PEFT modules. This single copy maintains multiple branches of PEFT modules, each associated with different PEFT queries. The case study of a state-of-the-art system can be found in Section VI-C Figure 10 (b) illustrates the computation pattern for multi-query PEFT inference, wherein packed PEFT queries are scheduled and executed according to their deadlines and current system conditions.
集中式 PEFT 查询服务：云服务提供商最近推出了一系列旨在通过应用程序编程接口（API）[239]，[240]提供用户应用程序的服务。这些 API 促进了许多机器学习功能顺利集成到应用程序中。在通过 API 接收到一个特定下游任务的查询后，基于云的服务器使用一个特色模型处理查询。在这种情况下，处理多个 PEFT 查询的提议云解决方案涉及仅存储一个LLM和多个 PEFT 模块的单个副本。这个单个副本维护多个 PEFT 模块的分支，每个分支与不同的 PEFT 查询相关联。一个最先进系统的案例研究可以在第 VI-C 节中找到，图 10（b）说明了多查询 PEFT 推理的计算模式，其中打包的 PEFT 查询根据它们的截止日期和当前系统条件进行调度和执行。
Serving Metrics: To evaluate the system performance of centralized PEFT query serving, we propose a set of evaluation metrics.
服务指标：为了评估集中式 PEFT 查询服务的系统性能，我们提出了一组评估指标。

System throughput: Considering PEFT queries as inter and intra tasks, we use tokens per second to measure the system throughput.
系统吞吐量：将 PEFT 查询视为任务内和任务间，我们使用每秒标记数来衡量系统吞吐量。
Memory footprint: Run-time memory consumption during query serving, the memory utilization comes from both model parameters and -cache as mentioned in Section IV-A
内存占用：查询服务期间的运行时内存消耗，内存利用率来自模型参数和第 -cache，如第 IV-A 节所述
Accuracy performance: Real-world queries normally have different context lengths, and performance with variation length serves as a performance benchmark.
准确性表现：真实世界的查询通常具有不同的上下文长度，具有不同长度的性能作为性能基准。
Quality of services: Queries are associated with latency requirements and deadline missing rates are considered as another benchmark.
服务质量：查询与延迟要求相关，而未达到截止日期的比率被视为另一个基准。

Distributed System for PEFT: Nevertheless, in the contemporary LLM model, personalized tasks are not fully supported with pre-trained models, consequently, extra fine-tuning is required to be executed with the methodologies mentioned in the previous sections. However, a big concern is raised when
用于 PEFT 的分布式系统：然而，在当代LLM模型中，个性化任务并没有得到充分支持，因此，需要使用前面部分提到的方法进行额外的微调。然而，当

(a)

(b)
Fig. 10: (a) Distributed-based system computation pattern; (b) centralized PEFT Query inference
图 10：(a)基于分布式系统的计算模式；(b)集中式 PEFT 查询推理

we consider giving the datasets to cloud providers since these datasets are personalized.
我们考虑将数据集提供给云服务提供商，因为这些数据集是个性化的。

For this concern, DLoRA [241] presents a distributed PEFT framework. During the PEFT process, the backbone LLM is executed in the cloud servers while the PEFT modules are trained entirely within the user devices. DLoRA scheme is depicted in Figure 10 (a).
对于这个问题，DLoRA [241] 提出了一个分布式 PEFT 框架。在 PEFT 过程中，骨干LLM在云服务器中执行，而 PEFT 模块完全在用户设备中进行训练。DLoRA 方案如图 10(a)所示。

Distributed Metrics: To assess the efficacy of the proposed method, we establish a set of evaluative metrics. For this analysis, and without loss of generality, we adopt language models as the basis for our metric definitions.
分布式指标：为了评估所提出方法的有效性，我们建立了一组评估指标。在这个分析中，为了不失一般性，我们采用语言模型作为我们指标定义的基础。

Accuracy performance: Performance of the fine-tuned model over the downstream tasks.
准确性表现：微调模型在下游任务上的表现。
Compute cost: The compute cost during forward and backward propagation operations on edge devices.
计算成本：在边缘设备上进行前向和反向传播操作期间的计算成本。
Communication cost: Refers to the volume of data involved during the transfer of intermediate data between the edge device and the cloud.
通信成本：指的是边缘设备和云之间传输中间数据时涉及的数据量。

Multi-PEFT Training: Different from multiple-PEFT serving, tuning with multiple customized PEFTs always involves different backbone LLMs. When contemplating LLM usage across various downstream tasks, pre-trained models typically exhibit subpar performance. A prevalent approach to adapt LLM to diverse tasks involves crafting fine-tuned PEFTs. However, simultaneously tuning multiple PEFTs can pose considerable challenges. Challenges like how to manage memory gradient and model weights storage, and how to design an efficient kernel for batching PEFT training remain unsolved. PEFTs will be categorized based on their PEFT algorithms and backbone LLM models. The design challenge involves how to consolidate multiple PEFTs with the same LLM backbone and multiple different LLM backbones simultaneously.
多 PEFT 训练：与多 PEFT 服务不同，使用多个定制 PEFT 进行调整总是涉及不同的骨干LLMs。在考虑跨多个下游任务使用LLM时，预训练模型通常表现出次优性能。适应LLM到不同任务的一种普遍方法涉及制作微调的 PEFT。然而，同时调整多个 PEFT 可能会带来相当大的挑战。挑战包括如何管理内存梯度和模型权重存储，以及如何为批处理 PEFT 训练设计高效的内核仍未解决。PEFT 将根据其 PEFT 算法和骨干LLM模型进行分类。设计挑战涉及如何 consololidate 具有相同LLM骨干和多个不同LLM骨干的多个 PEFT。

B. Case study: Offsite-Tuning
案例研究：离线调整

We already know that fine-tuning LLM for downstream tasks is challenging for two reasons: dual privacy concerns between cloud server and data owner, and issues with computational resources and efficiency. Firstly, the privacy of both parties is at risk: the weights of large models are often proprietary and not made public. Sharing data with model owners for fine-tuning can lead to data privacy concerns while providing model weights to data proprietors could compromise the ownership of proprietary models. Secondly, even if downstream users have access to pre-trained weights, the stringent hardware requirements make transfer learning impractical for most end users.
我们已经知道，为下游任务微调LLM存在两个挑战：云服务器和数据所有者之间的双重隐私问题，以及计算资源和效率方面的问题。首先，双方的隐私都面临风险：大型模型的权重通常是专有的，不公开。与模型所有者共享数据进行微调可能会引发数据隐私问题，同时向数据所有者提供模型权重可能会损害专有模型的所有权。其次，即使下游用户可以访问预训练的权重，严格的硬件要求使得迁移学习对大多数终端用户来说并不实际。

To resolve these two issues, Offsite-Tuning [242] proposes a privacy-preserving and efficient transfer learning framework that enables foundational models to adapt to downstream tasks without the need to access the complete model weights. The key insight of Offsite-Tuning is the cloud provider sends an adapter and an emulator to the data proprietor. Then, with the assistance of the emulator, the data proprietor fine-tunes the adapter. The fine-tuned adapter is then sent back to the cloud side, which integrates it into the complete model, creating a fine-tuned foundational model for downstream users.
为了解决这两个问题，Offsite-Tuning [242] 提出了一种保护隐私且高效的迁移学习框架，使基础模型能够适应下游任务，而无需访问完整的模型权重。Offsite-Tuning 的关键见解是云提供商向数据所有者发送一个适配器和一个仿真器。然后，在仿真器的帮助下，数据所有者对适配器进行微调。微调后的适配器随后被发送回云端，将其整合到完整模型中，为下游用户创建一个微调的基础模型。

Offsite-Tuning safeguards the privacy of data proprietors since they do not need to share their training data directly. It also protects the foundational model owners, as the complete model weights are not shared, and the emulator provided is lossy, with significantly degraded performance. Compared to existing fine-tuning methods that require access to the full model weights, Offsite-Tuning is more resource-efficient because it allows for fine-tuning through a compressed emulator without needing the complete model.
离线调整保护数据所有者的隐私，因为他们不需要直接共享他们的训练数据。它还保护基础模型所有者，因为完整的模型权重不会被共享，提供的仿真器是有损的，性能明显下降。与现有需要访问完整模型权重的微调方法相比，离线调整更具资源效率，因为它允许通过压缩的仿真器进行微调，而无需完整模型。

C. Case Study: PetS
C. 案例研究：PetS

The PEFT algorithm is notable for its ability to distinguish between modifiable and immutable weights within a model. This characteristic inspires developers to amalgamate diverse LLMs with distinct PEFT techniques into collective units. PetS, as introduced in [243], advocates for a comprehensive approach to managing multiple PEFT tasks by suggesting a unified serving framework. The framework's core advancement lies in the translation of varying PEFT tasks into integrated computation kernels to enhance efficiency. Moreover, PetS pioneers an orchestrated batching approach and a scheduling methodology, aiming to augment system throughput and leverage task parallelism respectively.
PEFT 算法以其区分模型中可修改和不可修改权重的能力而著称。这一特征激发开发人员将不同的LLMs与不同的 PEFT 技术融合成集体单元。正如[243]中介绍的 PetS 所倡导的，通过提出统一的服务框架，支持综合管理多个 PEFT 任务的方法。该框架的核心进展在于将不同的 PEFT 任务转化为集成计算核心，以增强效率。此外，PetS 开创了一种协调的批处理方法和调度方法，旨在分别增加系统吞吐量和利用任务并行性。

As depicted in Figure 11, the PetS framework begins with users registering PEFT tasks through a standardized Application Programming Interface (API). Upon registration, developers are expected to provide the Pre-Trained Model Tag (e.g., LLaMA), PEFT parameters in a compressed format, and the specific PEFT algorithms (e.g., LoRA, Adapter, Bitfit, etc.). These tasks are then endowed with unique identifiers, and the inference engine takes charge of query processing. PetS bifurcates the primary computational workload (e.g., linear layer computations) into three distinct computational operations: (1) Dense Matrix-Vector Multiplication (MVM) leveraging universally accessible, pre-trained weights. (2) Bias vector addition (Vadd), using either common or task-exclusive biases. (3) A combination of Sparse/dense MVM operations employing task-specific PET parameters. A unified pre-trained weight matrix

is employed across PetS, facilitating the batching of initial operations,

. However, subsequent task-specific computations involving PET parameters, despite being relatively minimal in complexity, are processed individually.
如图 11 所示，PetS 框架始于用户通过标准化的应用程序编程接口（API）注册 PEFT 任务。注册后，开发人员需要提供预训练模型标签（例如 LLaMA）、压缩格式的 PEFT 参数以及特定的 PEFT 算法（例如 LoRA、Adapter、Bitfit 等）。这些任务随后被赋予唯一标识符，并推理引擎负责查询处理。PetS 将主要的计算工作负载（例如线性层计算）分为三个不同的计算操作：（1）利用通用可访问的预训练权重进行密集矩阵-向量乘法（MVM）。（2）使用常见或任务专用偏置的偏置向量添加（Vadd）。（3）结合使用特定任务的 PET 参数的稀疏/密集 MVM 操作。在 PetS 中采用统一的预训练权重矩阵

，便于批处理初始操作

。然而，尽管相对复杂度较低，涉及 PET 参数的后续任务特定计算是单独处理的。

Considering the Adapter and Bitfit tasks as an illustration, both aim at the MLP component of LLMs. The Adapter task integrates additional weight segments, whereas Bitfit adjusts bias elements. The Adapter operation is modeled as

, where

represents the input for the Adapter task,

and

are the original and adapter-specific PEFT weights respectively, and

is the initial bias. The Bitfit operation, on the other hand, is defined as

, with

symbolizing the Bitfitadjustable bias. These operations are further synthesized as

, delineating that the

part is amenable to batching through MVM, while the

segment pertains to the Vadd operation.
考虑适配器和 Bitfit 任务作为示例，两者都旨在LLMs的 MLP 组件。适配器任务集成了额外的权重段，而 Bitfit 调整了偏置元素。适配器操作被建模为

，其中

代表适配器任务的输入，

和

分别是原始和适配器特定的 PEFT 权重，

是初始偏置。另一方面，Bitfit 操作被定义为

，其中

象征着 Bitfit 可调偏置。这些操作进一步综合为

，说明

部分适合通过 MVM 进行批处理，而

部分涉及 Vadd 操作。

For tasks like Diff-Pruning [III-B is a little bit different than Bitfit and Adapter. For Diff-Pruning, the computation concerning the shared weight and 'difference' are conducted separately. Then the results are added up, namely
对于像 Diff-Pruning 这样的任务，III-B 与 Bitfit 和 Adapter 有一点不同。对于 Diff-Pruning，关于共享权重和“差异”的计算是分开进行的。然后将结果相加，即

, here, the

denotes the backbone model weights while

denotes the pruned weights which can be represented as Sparse MVM.
在这里，

表示骨干模型的权重，而

表示可以表示为稀疏 MVM 的修剪权重。

The other challenge PetS proposed is how to schedule different PEFT requests to achieve high performance. PetS scheduler achieves high parallelism through a two-level scheduling policy: Coordinated Batching (CB) and Macro-batch Streaming (MS) as Figure 12 depicts. Through

, the input queries will first be clustered based on their input length and then grouped based on their shared operator. This is to make sure the same sequence length of queries will be executed without wasting padding. MS strategy will take the grouped queries after coordinated batching and the theoretical latency for different operators as well as the system modeling parameters to generate the best execution order.
PetS 提出的另一个挑战是如何安排不同的 PEFT 请求以实现高性能。PetS 调度器通过两级调度策略实现高并行性：协调批处理（CB）和宏批处理流（MS），如图 12 所示。通过

，输入查询将首先根据其输入长度进行聚类，然后根据它们的共享运算符进行分组。这是为了确保相同序列长度的查询将被执行，而不会浪费填充。MS 策略将在协调批处理后获取分组查询，并考虑不同运算符的理论延迟以及系统建模参数，生成最佳执行顺序。

D. Parallel PEFT Training Frameworks
D. 并行 PEFT 训练框架

a) Design Challenges: Unlike the PetS system, which aims to accommodate flexible multi-PEFT algorithms, SLoRA [244] and Punica [245] focus solely on facilitating multiple-LoRA blocks for various tasks. Designing multiple PEFT training systems presents key challenges in two main aspects:
设计挑战：与旨在容纳灵活多 PEFT 算法的 PetS 系统不同，SLoRA [244]和 Punica [245]专注于为各种任务提供多个 LoRA 块的便利。设计多个 PEFT 训练系统在两个主要方面提出了关键挑战：

Efficient concurrent execution of multiple PEFT models with the same LLM backbone.
使用相同LLM骨干的多个 PEFT 模型的高效并发执行。
Designing an efficient system for multi-tenant serving with different LLM backbones.
为具有不同LLM骨干的多租户服务设计高效系统。

b) Efficient kernel design: Punica addresses the first challenge by using existing matrix multiplication for the backbone computation and introducing a new CUDA kernel, Segmented Gather Matrix-Vector Multiplication (SGMV), for adding the PEFT add-ons to the backbone computation in a batched manner. This kernel parallelizes the feature-weight multiplication for different requests in the batch and groups requests corresponding to the same PEFT model to increase operational intensity and use GPU Tensor Cores for acceleration.
b) 高效的内核设计：Punica 通过使用现有的矩阵乘法进行骨干计算，并引入一种新的 CUDA 内核，分段聚合矩阵-向量乘法（SGMV），以批量方式将 PEFT 附加组件添加到骨干计算中来解决第一个挑战。该内核并行化了批处理中不同请求的特征权重乘法，并将对应于相同 PEFT 模型的请求分组，以增加操作强度并利用 GPU 张量核心进行加速。

The second challenge is beyond the computational cost, designing an efficient system architecture that can effectively serve multi-tenant PEFT model workloads on the smallest set of GPUs possible while occupying the least amount of GPU resources is another significant challenge. Punica addresses
第二个挑战不仅仅是计算成本，设计一个高效的系统架构，能够有效地为多租户 PEFT 模型工作负载提供服务，同时占用尽可能少的 GPU 资源，是另一个重要挑战。Punica 解决。

Fig. 11: PetS system overview: (1) Tasks register; (2) Task manager (3) Task schedule; (4) Task serving. (Image is taken from PetS [243])
图 11：PetS 系统概述：（1）任务注册；（2）任务管理器；（3）任务调度；（4）任务服务。（图片来源于 PetS [243]）

this by scheduling user requests to active GPUs that already serve or train PEFT models, thereby improving GPU utilization. For older requests, Punica periodically migrates them to consolidate workloads, thus freeing up GPU resources for new requests.
通过将用户请求调度到已经为 PEFT 模型提供服务或训练的活跃 GPU，从而提高 GPU 利用率。对于较旧的请求，Punica 定期将它们迁移以整合工作负载，从而为新请求释放 GPU 资源。

c) Multi-Tenant PEFT design: Designing an efficient system for the multi-tenant PEFT model serving in the Punica framework focuses on addressing several key challenges to maximize hardware utilization and minimize resource consumption. The system aims to consolidate multi-tenant LoRA serving workloads onto the smallest set of GPUs possible. This consolidation is achieved through strategic scheduling of user requests to active GPUs that are already serving or training LoRA models, thereby improving GPU utilization. For older requests, Punica periodically migrates them to consolidate workloads further, thus freeing up GPU resources for new requests. It incorporates on-demand loading of LoRA model weights, which introduces only millisecond-level latency. This feature provides Punica with the flexibility to dynamically consolidate user requests to a small set of GPUs, without being constrained by the specific LoRA models already running on those GPUs. Besides that, Punica identifies that the decode stage is a predominant factor in the cost of model serving, Punica's design primarily focuses on optimizing decode stage performance. Other aspects of model serving leverage straightforward techniques, such as on-demand loading of LoRA model weights, to efficiently manage resource utilization.
c) 多租户 PEFT 设计：为 Punica 框架中提供服务的多租户 PEFT 模型设计一个高效系统，重点解决几个关键挑战，以最大化硬件利用率并最小化资源消耗。该系统旨在将多租户 LoRA 服务工作负载整合到可能的最小一组 GPU 上。通过对用户请求进行战略调度，将其分配给已经在提供或训练 LoRA 模型的活跃 GPU，从而提高 GPU 利用率来实现这种整合。对于较旧的请求，Punica 定期将它们迁移以进一步整合工作负载，从而为新请求释放 GPU 资源。它还包括 LoRA 模型权重的按需加载，仅引入毫秒级延迟。这一特性使 Punica 能够灵活地将用户请求动态整合到一小组 GPU 中，而不受这些 GPU 上已运行的特定 LoRA 模型的限制。此外，Punica 确定解码阶段是模型服务成本的主要因素，Punica 的设计主要侧重于优化解码阶段的性能。模型服务的其他方面利用直接的技术，例如按需加载 LoRA 模型权重，以有效地管理资源利用率。

VII. Conclusion and Future Directions
第七部分：结论和未来方向

In the current era dominated by large models and large datasets, PEFT stands out as a highly attractive method for efficiently adapting models to downstream tasks. This technique gains its appeal by addressing the significant challenges posed by traditional full-model fine-tuning, which often places substantial computational and data demands. This survey offers a comprehensive examination of the most recent advancements in PEFT, including algorithmic design, computational efficiency, application scenarios, and system implementation for PEFT. It offers a comprehensive taxonomy and explanation that serves as an excellent guidance and knowledge base, which enables readers of various levels and disciplines to swiftly grasp the core concepts of PEFT.
在当前由大型模型和大型数据集主导的时代，PEFT 作为一种高度吸引人的方法脱颖而出，可以有效地将模型调整到下游任务。这种技术之所以吸引人，是因为它解决了传统的全模型微调所带来的重大挑战，这往往需要大量的计算和数据需求。本调查全面审视了 PEFT 中最新进展，包括算法设计、计算效率、应用场景和 PEFT 的系统实施。它提供了全面的分类和解释，作为出色的指导和知识库，使各个层次和学科的读者能够迅速掌握 PEFT 的核心概念。

Fig. 12: Coordinated Batching (CB) Strategy
图 12：协调批处理（CB）策略

For further research on PEFT, we propose a series of possible directions from both algorithm and system perspectives, hoping to inspire more researchers to engage in further studies in these areas.
对于 PEFT 的进一步研究，我们从算法和系统两个角度提出了一系列可能的方向，希望能激发更多研究人员参与这些领域的进一步研究。

A. Simplify hyperparameter tuning
简化超参数调整

The effectiveness of PEFT is often sensitive to its hyperparameters, such as the bottleneck dimension of the adapter, the rank of LoRA, and the arrangement of various additive PEFT layers. Manually tuning these hyperparameters will cost lots of efforts. Therefore, future efforts could focus on developing methods that are less dependent on manual tuning of these parameters, or automatically find the optimal configuration settings. Several studies [76], [77], [78], [91], [92], [93] have started to address this issue, but there's a need for more simple and efficient solutions optimizing these hyperparameters.
PEFT 的有效性通常对其超参数非常敏感，例如适配器的瓶颈维度、LoRA 的秩以及各种附加 PEFT 层的排列。手动调整这些超参数将耗费大量精力。因此，未来的努力可以集中在开发那些不太依赖于这些参数手动调整的方法，或者自动找到最佳配置设置。一些研究[76]，[77]，[78]，[91]，[92]，[93]已经开始解决这个问题，但需要更简单和高效的解决方案来优化这些超参数。

B. Establish a unified benchmark
建立统一的基准

Despite the existence of libraries like HuggingFace's PEFT [246] and AdapterHub [247], a comprehensive benchmark for PEFT is still lacking. This gap hinders the ability to fairly compare the performance and efficiency of different PEFT approaches. A well-accepted, up-to-date benchmark akin to MMDetection [248] for object detection would enable researchers to validate their methods against a standard set of tasks and metrics, fostering innovation and collaboration within the community.
尽管存在像 HuggingFace 的 PEFT [246]和 AdapterHub [247]这样的库，但仍然缺乏一个全面的 PEFT 基准。这一差距阻碍了公平比较不同 PEFT 方法的性能和效率的能力。一个被广泛接受的、与目标检测的 MMDetection [248]类似的最新基准将使研究人员能够针对一套标准任务和指标验证他们的方法，促进社区内的创新和合作。

C. Enhance training efficiency
提高培训效率

The presumed parameter efficiency of PEFT does not always consistent with computational and memory savings during training. Given that trainable parameters are intertwined within the pre-trained model's architecture, computing and storing activations and gradients for the full model often become necessary during fine-tuning. This oversight calls for a rethinking of what constitutes efficiency. As outlined in Section IV potential solutions lie in the integration of model compression techniques such as pruning and quantization, alongside innovations specifically designed to optimize memory during PEFT tuning [249]. Further research into enhancing
PEFT 的假定参数效率并不总是与训练过程中的计算和内存节省一致。鉴于可训练参数与预训练模型的架构交织在一起，通常在微调过程中需要计算和存储完整模型的激活和梯度。这一疏忽需要重新思考何为效率。如第四节所述，潜在解决方案在于整合模型压缩技术，如修剪和量化，以及专门设计的创新，以在 PEFT 调整期间优化内存。进一步研究以增强。
the computational efficiency of PEFT methodologies is imperative.
PEFT 方法的计算效率至关重要。

D. Explore scaling laws
探索尺度定律

The design and effectiveness of PEFT methods originally developed for smaller Transformer models do not necessarily scale with larger models. As the size of foundation models increases, identifying and adapting PEFT strategies that remain effective is crucial. This investigation will aid in customizing PEFT methodologies to suit the evolving landscape of large model architectures.
PEFT 方法最初是为较小的 Transformer 模型开发的，其设计和有效性不一定随着模型变大而扩展。随着基础模型规模的增加，识别和调整仍然有效的 PEFT 策略至关重要。这项调查将有助于定制 PEFT 方法，以适应大型模型架构的不断发展。

E. Serve more models and tasks
提供更多的模型和任务

The rise of large foundation models across various domains presents new opportunities for PEFT. Designing PEFT methods tailored to the unique characteristics of models, such as Sora [250], Mamba [251], and LVM [252], can unlock new application scenarios and opportunities.
各个领域大型基础模型的崛起为 PEFT 带来了新的机遇。设计针对模型独特特征的 PEFT 方法，如 Sora [250]、Mamba [251]和 LVM [252]，可以开启新的应用场景和机会。

F. Enhancing data privacy
增强数据隐私

Trusting centralized systems to serve or fine-tune personalized PEFT modules is yet another issue for system developers. Multiple types of inversion attacks [253], [254] have been proposed to reconstruct user's data by hijacking the intermediate results. One perspective of future trust-worthy LLM system design involves developing an encryption protocol for both personal data and intermediate training and inference results.
信任集中式系统来服务或微调个性化 PEFT 模块对系统开发人员来说是另一个问题。已经提出了多种类型的反演攻击[253]，[254]，以通过劫持中间结果来重建用户数据。未来值得信赖的LLM系统设计的一个观点涉及开发一个加密协议，用于个人数据和中间训练和推理结果。

G. PEFT with model compression
G. 使用模型压缩的 PEFT

Model compression is one of the most effective ways to make LLM executable on resource-limited devices. Yet, the impact of model compression techniques on the performance of PEFT algorithms running on hardware remains another systemic challenge. Common compression techniques such as quantization and pruning necessitate dedicated hardware platforms to expedite the process, and building such hardware platform for compressed models is yet another direction for future research.
模型压缩是使LLM在资源有限的设备上可执行的最有效方法之一。然而，模型压缩技术对在硬件上运行的 PEFT 算法性能的影响仍然是另一个系统性挑战。常见的压缩技术，如量化和修剪，需要专用硬件平台来加快过程，并为压缩模型构建这样的硬件平台是未来研究的另一个方向。

REFERENCES 参考资料

[1] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., "Language models are few-shot learners," Advances in neural information processing systems, vol. 33, pp. 1877-1901, 2020.
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell 等人，“语言模型是少样本学习器”，《神经信息处理系统进展》，第 33 卷，第 1877-1901 页，2020 年。

[2] Y. Zhuang, Y. Yu, K. Wang, H. Sun, and C. Zhang, "Toolqa: A dataset for

question answering with external tools," arXiv preprint arXiv:2306.13304, 2023.
Y. Zhuang, Y. Yu, K. Wang, H. Sun, 和 C. Zhang, "Toolqa: 一个用于外部工具

问题回答的数据集," arXiv 预印本 arXiv:2306.13304, 2023.

[3] W. Zhu, H. Liu, Q. Dong, J. Xu, L. Kong, J. Chen, L. Li, and S. Huang, "Multilingual machine translation with large language models: Empirical results and analysis," arXiv preprint arXiv:2304.04675, 2023.
W. Zhu, H. Liu, Q. Dong, J. Xu, L. Kong, J. Chen, L. Li, 和 S. Huang, "使用大型语言模型进行多语言机器翻译：实证结果和分析," arXiv 预印本 arXiv:2304.04675, 2023.

[4] M. U. Hadi, R. Qureshi, A. Shah, M. Irfan, A. Zafar, M. Shaikh, N. Akhtar, J. Wu, and S. Mirjalili, "A survey on large language models: Applications, challenges, limitations, and practical usage," TechRxiv, 2023.
M. U. Hadi，R. Qureshi，A. Shah，M. Irfan，A. Zafar，M. Shaikh，N. Akhtar，J. Wu 和 S. Mirjalili，“大型语言模型调查：应用，挑战，限制和实际使用”，TechRxiv，2023。

[5] B. Xu, X. Liu, H. Shen, Z. Han, Y. Li, M. Yue, Z. Peng, Y. Liu, Z. Yao, and D. Xu, "Gentopia: A collaborative platform for tool-augmented

arXiv preprint arXiv:2308.04030, 2023.
B. Xu, X. Liu, H. Shen, Z. Han, Y. Li, M. Yue, Z. Peng, Y. Liu, Z. Yao, and D. Xu, "Gentopia: 一个用于工具增强的协作平台"，arXiv 预印本 arXiv:2308.04030, 2023.

[6] G. Li, H. A. A. K. Hammoud, H. Itani, D. Khizbullin, and B. Ghanem, "Camel: Communicative agents for "mind" exploration of large language model society," in Thirty-seventh Conference on Neural Information Processing Systems, 2023.
G. Li，H. A. A. K. Hammoud，H. Itani，D. Khizbullin 和 B. Ghanem，“骆驼：大型语言模型社会“心灵”探索的沟通代理”，发表于 2023 年第三十七届神经信息处理系统会议。
[7] Q. Wu, G. Bansal, J. Zhang, Y. Wu, S. Zhang, E. Zhu, B. Li, L. Jiang, X. Zhang, and C. Wang, "Autogen: Enabling next-gen llm applications via multi-agent conversation framework," arXiv preprint arXiv:2308.08155, 2023.
吴琪，班萨尔，张杰，吴阳，张帅，朱恩，李波，江丽，张晓，王超，“Autogen：通过多代理对话框架实现下一代 1001 应用”，arXiv 预印本 arXiv:2308.08155，2023 年。

[8] H. Zhang, X. Liu, and J. Zhang, "Summit: Iterative text summarization via chatgpt," arXiv preprint arXiv:2305.14835, 2023.
H. Zhang, X. Liu, 和 J. Zhang, "Summit: 通过 ChatGPT 迭代文本摘要," arXiv 预印本 arXiv:2305.14835, 2023.

[9] B. Zhang and R. Sennrich, "Root mean square layer normalization," Advances in Neural Information Processing Systems, vol. 32, 2019.
B. Zhang 和 R. Sennrich，“均方根层归一化”，神经信息处理系统进展，第 32 卷，2019 年。

[10] J. Su, Y. Lu, S. Pan, A. Murtadha, B. Wen, and Y. Liu, "Roformer: Enhanced transformer with rotary position embedding," arXiv preprint arXiv:2104.09864, 2021.
J. Su, Y. Lu, S. Pan, A. Murtadha, B. Wen, 和 Y. Liu, "Roformer: Enhanced transformer with rotary position embedding," arXiv 预印本 arXiv:2104.09864, 2021.

[11] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, "Glue: A multi-task benchmark and analysis platform for natural language understanding," arXiv preprint arXiv:1804.07461, 2018.
A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, 和 S. R. Bowman, "Glue: A multi-task benchmark and analysis platform for natural language understanding," arXiv preprint arXiv:1804.07461, 2018.

[12] T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal, "Can a suit of armor conduct electricity? a new dataset for open book question answering," in EMNLP, 2018.
T. Mihaylov, P. Clark, T. Khot, 和 A. Sabharwal, "一套盔甲能导电吗？一个用于开放式书籍问答的新数据集," 发表于 EMNLP, 2018.

[13] Y. Bisk, R. Zellers, R. L. Bras, J. Gao, and Y. Choi, "Piqa: Reasoning about physical commonsense in natural language," in Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020.
[13] Y. Bisk, R. Zellers, R. L. Bras, J. Gao, and Y. Choi, "Piqa: Reasoning about physical commonsense in natural language," in Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020. [13] Y. Bisk, R. Zellers, R. L. Bras, J. Gao, 和 Y. Choi, "Piqa: 推理自然语言中的物理常识," 在第三十四届 AAAI 人工智能大会上, 2020.

[14] M. Sap, H. Rashkin, D. Chen, R. LeBras, and Y. Choi, "Socialiqa: Commonsense reasoning about social interactions," arXiv preprint arXiv:1904.09728, 2019.
M. Sap、H. Rashkin、D. Chen、R. LeBras 和 Y. Choi，"Socialiqa: Commonsense reasoning about social interactions," arXiv 预印本 arXiv:1904.09728，2019。

[15] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi, "Hellaswag: Can a machine really finish your sentence?" in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.
R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi, "Hellaswag: Can a machine really finish your sentence?" in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.

[16] C. e. a. Clark, "Boolq: Exploring the surprising difficulty of natural yes/no questions," in NAACL, 2019.
C. e. a. Clark，“Boolq：探索自然的是/否问题的令人惊讶的困难”，发表于 NAACL，2019 年。

[17] K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi, "Winogrande: An adversarial winograd schema challenge at scale," Communications of the ACM, vol. 64, no. 9, pp. 99-106, 2021.
K. Sakaguchi, R. L. Bras, C. Bhagavatula, 和 Y. Choi, "Winogrande: An adversarial winograd schema challenge at scale," 《ACM 通讯》, vol. 64, no. 9, pp. 99-106, 2021.

[18] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord, "Think you have solved question answering? try arc, the ai2 reasoning challenge," arXiv:1803.05457v1, 2018.
P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord, "认为你已经解决了问题回答？尝试 ARC，AI2 推理挑战," arXiv:1803.05457v1, 2018.

[19] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev et al., "The kinetics human action video dataset," arXiv preprint arXiv:1705.06950, 2017.
W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev 等人，《The kinetics human action video dataset》，arXiv 预印本 arXiv:1705.06950，2017。

[20] R. Goyal, S. Ebrahimi Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag et al., "The" something something" video database for learning and evaluating visual common sense," in Proceedings of the IEEE international conference on computer vision, 2017, pp. 5842-5850.
R. Goyal, S. Ebrahimi Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag 等人，“用于学习和评估视觉常识的“某某某”视频数据库”，发表于 2017 年 IEEE 国际计算机视觉会议论文集，第 5842-5850 页。

[21] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, "Hmdb: a large video database for human motion recognition," in 2011 International conference on computer vision. IEEE, 2011, pp. 2556-2563.
H. Kuehne、H. Jhuang、E. Garrote、T. Poggio 和 T. Serre，“Hmdb：用于人类动作识别的大型视频数据库”，2011 年计算机视觉国际会议论文集，IEEE，2011 年，第 2556-2563 页。

[22] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, "Microsoft coco: Common objects in context," in Computer Vision-ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 2014, pp. 740-755.
[22] T.-Y. 林, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, 和 C. L. Zitnick, "Microsoft coco: Common objects in context," in Computer Vision-ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 2014, pp. 740-755.

[23] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, "Scene parsing through ade20k dataset," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 633641.
B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, 和 A. Torralba, "通过 ade20k 数据集进行场景解析," 在 2017 年 IEEE 计算机视觉和模式识别会议论文集中，第 633-641 页。

[24] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, "The pascal visual object classes (voc) challenge," International journal of computer vision, vol. 88, pp. 303-338, 2010.
M. Everingham, L. Van Gool, C. K. Williams, J. Winn, 和 A. Zisserman, "The pascal visual object classes (voc) challenge," 国际计算机视觉杂志, vol. 88, pp. 303-338, 2010.

[25] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, "Parameter-efficient transfer learning for nlp," in International Conference on Machine Learning. PMLR, 2019, pp. 2790-2799.
N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, 和 S. Gelly, "Parameter-efficient transfer learning for nlp," 在国际机器学习会议上。PMLR, 2019, 页 2790-2799。

[26] J. He, C. Zhou, X. Ma, T. Berg-Kirkpatrick, and G. Neubig, "Towards a unified view of parameter-efficient transfer learning," arXiv preprint arXiv:2110.04366, 2021.
J. He, C. Zhou, X. Ma, T. Berg-Kirkpatrick, 和 G. Neubig, "Towards a unified view of parameter-efficient transfer learning," arXiv 预印本 arXiv:2110.04366, 2021.

[27] Y. Zhu, J. Feng, C. Zhao, M. Wang, and L. Li, "Counterinterference adapter for multilingual machine translation," arXiv preprint arXiv:2104.08154, 2021.
Y. Zhu, J. Feng, C. Zhao, M. Wang, 和 L. Li, "用于多语言机器翻译的抗干扰适配器," arXiv 预印本 arXiv:2104.08154, 2021.

[28] T. Lei, J. Bai, S. Brahma, J. Ainslie, K. Lee, Y. Zhou, N. Du, V. Y. Zhao, Y. Wu, B. Li et al., "Conditional adapters: Parameter-efficient transfer learning with fast inference," arXiv preprint arXiv:2304.04947, 2023.
[28] T. 雷，J. 白，S. 布拉玛，J. 艾恩斯利，K. 李，Y. 周，N. 杜，V. Y. 赵，Y. 吴，B. 李等，“条件适配器：具有快速推理的参数高效迁移学习”，arXiv 预印本 arXiv:2304.04947，2023 年。

[29] J. Pfeiffer, A. Kamath, A. Rücklé, K. Cho, and I. Gurevych, "Adapterfusion: Non-destructive task composition for transfer learning," arXiv preprint arXiv:2005.00247, 2020.
J. Pfeiffer, A. Kamath, A. Rücklé, K. Cho 和 I. Gurevych, "Adapterfusion: Non-destructive task composition for transfer learning," arXiv 预印本 arXiv:2005.00247, 2020.

[30] Y. Wang, S. Mukherjee, X. Liu, J. Gao, A. H. Awadallah, and J. Gao, "Adamix: Mixture-of-adapter for parameter-efficient tuning of large language models," arXiv preprint arXiv:2205.12410, vol. 1, no. 2, p. 4, 2022.
Y. Wang, S. Mukherjee, X. Liu, J. Gao, A. H. Awadallah 和 J. Gao, "Adamix: Mixture-of-adapter for parameter-efficient tuning of large language models," arXiv 预印本 arXiv:2205.12410, vol. 1, no. 2, p. 4, 2022.

[31] H. Zhao, J. Fu, and Z. He, "Prototype-based hyperadapter for sampleefficient multi-task tuning," arXiv preprint arXiv:2310.11670, 2023.
H. Zhao, J. Fu, 和 Z. He, "基于原型的超适配器用于高效多任务调整," arXiv 预印本 arXiv:2310.11670, 2023.

[32] A. Chronopoulou, M. E. Peters, A. Fraser, and J. Dodge, "Adaptersoup: Weight averaging to improve generalization of pretrained language models," arXiv preprint arXiv:2302.07027, 2023.
[32] A. Chronopoulou, M. E. Peters, A. Fraser, and J. Dodge，“Adaptersoup: Weight averaging to improve generalization of pretrained language models”，arXiv 预印本 arXiv:2302.07027，2023。

[33] S. He, R.-Z. Fan, L. Ding, L. Shen, T. Zhou, and D. Tao, "Mera: Merging pretrained adapters for few-shot learning," arXiv preprint arXiv:2308.15982, 2023.
S. He, R.-Z. Fan, L. Ding, L. Shen, T. Zhou, 和 D. Tao, "Mera: Merging pretrained adapters for few-shot learning," arXiv 预印本 arXiv:2308.15982, 2023.

[34] R. K. Mahabadi, S. Ruder, M. Dehghani, and J. Henderson, "Parameterefficient multi-task fine-tuning for transformers via shared hypernetworks," arXiv preprint arXiv:2106.04489, 2021.
R. K. Mahabadi, S. Ruder, M. Dehghani 和 J. Henderson，"通过共享超网络进行参数高效的变压器多任务微调"，arXiv 预印本 arXiv:2106.04489，2021。

[35] X. L. Li and P. Liang, "Prefix-tuning: Optimizing continuous prompts for generation," arXiv preprint arXiv:2101.00190, 2021.
[35] X. L. Li 和 P. Liang, "前缀调整：优化生成的连续提示," arXiv 预印本 arXiv:2101.00190, 2021.

[36] J. Li, W. Aitken, R. Bhambhoria, and X. Zhu, "Prefix propagation: Parameter-efficient tuning for long sequences," arXiv preprint arXiv:2305.12086, 2023.
J. Li, W. Aitken, R. Bhambhoria 和 X. Zhu，“前缀传播：长序列的参数高效调整”，arXiv 预印本 arXiv:2305.12086，2023。

[37] X. Liu, K. Ji, Y. Fu, W. L. Tam, Z. Du, Z. Yang, and J. Tang, "P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks," arXiv preprint arXiv:2110.07602, 2021.
[37] 刘晓, 季凯, 傅宇, 谭伟立, 杜哲, 杨哲, 唐军, "P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks," arXiv preprint arXiv:2110.07602, 2021.

[38] Z.-R. Zhang, C. Tan, H. Xu, C. Wang, J. Huang, and S. Huang, "Towards adaptive prefix tuning for parameter-efficient language model fine-tuning," arXiv preprint arXiv:2305.15212, 2023.
Z.-R. 张，C. 谭，H. 徐，C. 王，J. 黄，和 S. 黄，“针对参数高效语言模型微调的自适应前缀调整”，arXiv 预印本 arXiv:2305.15212，2023 年。

[39] X. Liu, Y. Zheng, Z. Du, M. Ding, Y. Qian, Z. Yang, and J. Tang, "Gpt understands, too," arXiv preprint arXiv:2103.10385, 2021.
[39] 刘晓，郑阳，杜哲，丁明，钱燕，杨忠，唐军，"Gpt 也懂"，arXiv 预印本 arXiv:2103.10385，2021。

[40] B. Lester, R. Al-Rfou, and N. Constant, "The power of scale for parameter-efficient prompt tuning," arXiv preprint arXiv:2104.08691, 2021.
B. Lester, R. Al-Rfou 和 N. Constant，“参数高效提示调整的规模优势”，arXiv 预印本 arXiv:2104.08691，2021。

[41] F. Ma, C. Zhang, L. Ren, J. Wang, Q. Wang, W. Wu, X. Quan, and D. Song, "Xprompt: Exploring the extreme of prompt tuning," arXiv preprint arXiv:2210.04457, 2022.
F. Ma, C. Zhang, L. Ren, J. Wang, Q. Wang, W. Wu, X. Quan, 和 D. Song, "Xprompt: 探索提示调整的极端," arXiv 预印本 arXiv:2210.04457, 2022.

[42] Z. Wu, S. Wang, J. Gu, R. Hou, Y. Dong, V. Vydiswaran, and H. Ma, "Idpg: An instance-dependent prompt generation method," arXiv preprint arXiv:2204.04497, 2022.
Z. Wu, S. Wang, J. Gu, R. Hou, Y. Dong, V. Vydiswaran, 和 H. Ma, "Idpg: An instance-dependent prompt generation method," arXiv 预印本 arXiv:2204.04497, 2022.

[43] X. Liu, T. Sun, X. Huang, and X. Qiu, "Late prompt tuning: A late prompt could be better than many prompts," arXiv preprint arXiv:2210.11292, 2022.
[43] 刘晓，孙涛，黄晓，邱晓，“晚提示调整：晚提示可能比许多提示更好”，arXiv 预印本 arXiv:2210.11292，2022。

[44] W. Zhu and M. Tan, "Spt: Learning to selectively insert prompts for better prompt tuning," in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 11862 11878.
W. Zhu 和 M. Tan，“Spt: 学习选择性插入提示以获得更好的提示调整”，发表于 2023 年实证自然语言处理会议论文集，2023 年，第 11862-11878 页。

[45] Q. Wang, Y. Mao, J. Wang, H. Yu, S. Nie, S. Wang, F. Feng, L. Huang, X. Quan, Z. Xu et al., "Aprompt: Attention prompt tuning for efficient adaptation of pre-trained language models," in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 9147-9160.
[45] 王琪，毛阳，王军，于航，聂松，王帅，冯飞，黄磊，全晓，徐忠等，“Aprompt：注意力提示调整用于预训练语言模型的高效适应”，发表于 2023 年经验方法在自然语言处理会议论文集，2023 年，第 9147-9160 页。

[46] T. Vu, B. Lester, N. Constant, R. A1-Rfou, and D. Cer, "Spot: Better frozen model adaptation through soft prompt transfer," arXiv preprint arXiv:2110.07904, 2021.
T. Vu, B. Lester, N. Constant, R. A1-Rfou, 和 D. Cer, "Spot: Better frozen model adaptation through soft prompt transfer," arXiv 预印本 arXiv:2110.07904, 2021.

[47] Y. Su, X. Wang, Y. Qin, C.-M. Chan, Y. Lin, H. Wang, K. Wen, Z. Liu, P. Li, J. Li et al., "On transferability of prompt tuning for natural language processing," arXiv preprint arXiv:2111.06719, 2021.
Y. Su, X. Wang, Y. Qin, C.-M. Chan, Y. Lin, H. Wang, K. Wen, Z. Liu, P. Li, J. Li 等人，“关于提示调整在自然语言处理中的可转移性”，arXiv 预印本 arXiv:2111.06719，2021。

[48] J. Wu, T. Yu, R. Wang, Z. Song, R. Zhang, H. Zhao, C. Lu, S. Li, and R. Henao, "Infoprompt: Information-theoretic soft prompt tuning for natural language understanding," arXiv preprint arXiv:2306.04933, 2023.
J. Wu, T. Yu, R. Wang, Z. Song, R. Zhang, H. Zhao, C. Lu, S. Li, 和 R. Henao, "Infoprompt: 信息论软提示调整自然语言理解," arXiv 预印本 arXiv:2306.04933, 2023.

[49] L. Chen, H. Huang, and M. Cheng, "Ptp: Boosting stability and performance of prompt tuning with perturbation-based regularizer," arXiv preprint arXiv:2305.02423, 2023.
L. Chen, H. Huang, 和 M. Cheng, "Ptp: 通过基于扰动的正则化器提高提示调整的稳定性和性能," arXiv 预印本 arXiv:2305.02423, 2023.

[50] Y. Qin, X. Wang, Y. Su, Y. Lin, N. Ding, J. Yi, W. Chen, Z. Liu, J. Li, L. Hou et al., "Exploring universal intrinsic task subspace via prompt tuning," arXiv preprint arXiv:2110.07867, 2021.
Qin Y, Wang X, Su Y, Lin Y, Ding N, Yi J, Chen W, Liu Z, Li J, Hou L 等人，“通过提示调整探索通用内在任务子空间”，arXiv 预印本 arXiv:2110.07867，2021 年。

[51] J.-Y. Choi, J. Kim, J.-H. Park, W.-L. Mok, and S. Lee, "Smop: Towards efficient and effective prompt tuning with sparse mixture-of-prompts," in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 14306-14316.
J.-Y. Choi, J. Kim, J.-H. Park, W.-L. Mok, 和 S. Lee, "Smop: Towards efficient and effective prompt tuning with sparse mixture-of-prompts," 在《2023 年自然语言处理经验方法会议论文集》中发表，2023 年，第 14306-14316 页。

[52] Z. Shi and A. Lipani, "Dept: Decomposed prompt tuning for parameterefficient fine-tuning," arXiv preprint arXiv:2309.05173, 2023.
Z. Shi 和 A. Lipani，“Dept: Decomposed prompt tuning for parameter-efficient fine-tuning”，arXiv 预印本 arXiv:2309.05173，2023。

[53] H. Liu, D. Tam, M. Muqeeth, J. Mohta, T. Huang, M. Bansal, and C. A. Raffel, "Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning," Advances in Neural Information Processing Systems, vol. 35, pp. 1950-1965, 2022.
[53] H. 刘，D. 坦，M. Muqeeth，J. Mohta，T. 黄，M. Bansal 和 C. A. Raffel，“少样本参数高效微调比上下文学习更好更便宜，” 神经信息处理系统进展，第 35 卷，第 1950-1965 页，2022 年。
[54] T. Zadouri, A. Üstün, A. Ahmadian, B. Ermiş, A. Locatelli, and S. Hooker, "Pushing mixture of experts to the limit: Extremely parameter efficient moe for instruction tuning," arXiv preprint arXiv:2309.05444, 2023.
T. Zadouri，A. Üstün，A. Ahmadian，B. Ermiş，A. Locatelli 和 S. Hooker，“将专家混合推向极限：用于指令调整的极其参数高效的 moe”，arXiv 预印本 arXiv:2309.05444，2023。

[55] D. Lian, D. Zhou, J. Feng, and X. Wang, "Scaling & shifting your features: A new baseline for efficient model tuning," Advances in Neural Information Processing Systems, vol. 35, pp. 109-123, 2022.
D. Lian, D. Zhou, J. Feng, 和 X. Wang, "Scaling & shifting your features: A new baseline for efficient model tuning," Advances in Neural Information Processing Systems, vol. 35, pp. 109-123, 2022.

[56] X. Lu, F. Brahman, P. West, J. Jang, K. Chandu, A. Ravichander, L. Qin, P. Ammanabrolu, L. Jiang, S. Ramnath et al., "Inference-time policy adapters (ipa): Tailoring extreme-scale

without fine-tuning," arXiv preprint arXiv:2305.15065, 2023.
[56] X.卢，F.布拉曼，P.韦斯特，J.江，K.钱杜，A.拉维钱德，L.秦，P.阿曼纳布罗卢，L.江，S.拉姆纳特等人，“推断时间策略适配器（ipa）：定制极端规模

而无需微调”，arXiv 预印本 arXiv:2305.15065，2023 年。

[57] D. Guo, A. M. Rush, and Y. Kim, "Parameter-efficient transfer learning with diff pruning," arXiv preprint arXiv:2012.07463, 2020.
D. Guo, A. M. Rush, 和 Y. Kim, "Parameter-efficient transfer learning with diff pruning," arXiv 预印本 arXiv:2012.07463, 2020.

[58] N. Lawton, A. Kumar, G. Thattai, A. Galstyan, and G. V. Steeg, "Neural architecture search for parameter-efficient fine-tuning of large pre-trained language models," arXiv preprint arXiv:2305.16597, 2023.
N. Lawton, A. Kumar, G. Thattai, A. Galstyan, 和 G. V. Steeg, "神经架构搜索用于大型预训练语言模型的参数高效微调," arXiv 预印本 arXiv:2305.16597, 2023.

[59] B. Liao, Y. Meng, and C. Monz, "Parameter-efficient fine-tuning without introducing new latency," arXiv preprint arXiv:2305.16742, 2023.
[59] B. 廖，Y. 孟，和 C. 蒙兹，“无需引入新的延迟的参数高效微调”，arXiv 预印本 arXiv:2305.16742，2023。

[60] Y.-L. Sung, V. Nair, and C. A. Raffel, "Training neural networks with fixed sparse masks," Advances in Neural Information Processing Systems, vol. 34, pp. 24 193-24 205, 2021.
Y.-L. Sung, V. Nair 和 C. A. Raffel，“使用固定稀疏掩模训练神经网络”，《神经信息处理系统进展》，第 34 卷，第 24 193-24 205 页，2021 年。

[61] S. S. S. Das, R. H. Zhang, P. Shi, W. Yin, and R. Zhang, "Unified low-resource sequence labeling by sample-aware dynamic sparse finetuning," arXiv preprint arXiv:2311.03748, 2023.
S. S. S. 达斯，R. H. 张，P. 史，W. 尹和 R. 张，“通过样本感知动态稀疏微调实现统一的低资源序列标记”，arXiv 预印本 arXiv:2311.03748，2023。

[62] A. Ansell, E. M. Ponti, A. Korhonen, and I. Vulić, "Composable sparse fine-tuning for cross-lingual transfer," arXiv preprint arXiv:2110.07560, 2021
[62] A. Ansell, E. M. Ponti, A. Korhonen, and I. Vulić, "用于跨语言转移的可组合稀疏微调," arXiv 预印本 arXiv:2110.07560, 2021

[63] Z. Fu, H. Yang, A. M.-C. So, W. Lam, L. Bing, and N. Collier, "On the effectiveness of parameter-efficient fine-tuning," in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 11, 2023, pp.

.
[63] Z. Fu, H. Yang, A. M.-C. So, W. Lam, L. Bing, and N. Collier, "On the effectiveness of parameter-efficient fine-tuning," in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 11, 2023, pp.

. 【63】Z. Fu，H. Yang，A. M.-C. So，W. Lam，L. Bing 和 N. Collier，“关于参数高效微调的有效性”，载于第 37 卷第 11 期 2023 年 AAAI 人工智能会议论文集，第

页。

[64] R. Xu, F. Luo, Z. Zhang, C. Tan, B. Chang, S. Huang, and F. Huang, "Raise a child in large language model: Towards effective and generalizable fine-tuning," arXiv preprint arXiv:2109.05687, 2021.
R. Xu, F. Luo, Z. Zhang, C. Tan, B. Chang, S. Huang, 和 F. Huang, "在大型语言模型中培养一个孩子：朝着有效和可泛化的微调迈进," arXiv 预印本 arXiv:2109.05687, 2021.

[65] D. Vucetic, M. Tayaranian, M. Ziaeefard, J. J. Clark, B. H. Meyer, and W. J. Gross, "Efficient fine-tuning of bert models on the edge," in 2022 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 2022, pp. 1838-1842.
D. Vucetic, M. Tayaranian, M. Ziaeefard, J. J. Clark, B. H. Meyer, 和 W. J. Gross, "Efficient fine-tuning of bert models on the edge," 在 2022 年 IEEE 国际电路与系统研讨会 (ISCAS) 中。IEEE, 2022, pp. 1838-1842.

[66] E. B. Zaken, S. Ravfogel, and Y. Goldberg, "Bitfit: Simple parameterefficient fine-tuning for transformer-based masked language-models," arXiv preprint arXiv:2106.10199, 2021.
E. B. Zaken, S. Ravfogel 和 Y. Goldberg, "Bitfit: Simple parameterefficient fine-tuning for transformer-based masked language-models," arXiv 预印本 arXiv:2106.10199, 2021.

[67] M. Gheini, X. Ren, and J. May, "Cross-attention is all you need: Adapting pretrained transformers for machine translation," arXiv preprint arXiv:2104.08771, 2021.
M. Gheini, X. Ren 和 J. May，“交叉注意力就是你所需要的一切：为机器翻译调整预训练的 transformers”，arXiv 预印本 arXiv:2104.08771，2021。

[68] H. He, J. Cai, J. Zhang, D. Tao, and B. Zhuang, "Sensitivity-aware visual parameter-efficient fine-tuning," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 11825 11835.
H. He, J. Cai, J. Zhang, D. Tao, 和 B. Zhuang, "Sensitivity-aware visual parameter-efficient fine-tuning," 在 2023 年 IEEE/CVF 国际计算机视觉会议论文集中, pp. 11825-11835.

[69] A. Aghajanyan, L. Zettlemoyer, and S. Gupta, "Intrinsic dimensionality explains the effectiveness of language model fine-tuning," arXiv preprint arXiv:2012.13255, 2020.
A. Aghajanyan, L. Zettlemoyer, 和 S. Gupta, "内在维度解释语言模型微调的有效性," arXiv 预印本 arXiv:2012.13255, 2020.

[70] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, "Lora: Low-rank adaptation of large language models," arXiv preprint arXiv:2106.09685, 2021
[70] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, "Lora: Low-rank adaptation of large language models," arXiv preprint arXiv:2106.09685, 2021 [70] 胡恩杰，沈阳，沃利斯，艾伦-朱，李阳，王硕，王磊，陈伟，“Lora: 大型语言模型的低秩适应”，arXiv 预印本 arXiv:2106.09685，2021

[71] R. Karimi Mahabadi, J. Henderson, and S. Ruder, "Compacter: Efficient low-rank hypercomplex adapter layers," Advances in Neural Information Processing Systems, vol. 34, pp. 1022-1035, 2021.
[71] R. Karimi Mahabadi, J. Henderson, 和 S. Ruder, "Compacter: 高效低秩超复适配器层," Advances in Neural Information Processing Systems, vol. 34, pp. 1022-1035, 2021.

[72] A. Edalati, M. Tahaei, I. Kobyzev, V. P. Nia, J. J. Clark, and M. Rezagholizadeh, "Krona: Parameter efficient tuning with kronecker adapter," arXiv preprint arXiv:2212.10650, 2022.
[72] A. Edalati, M. Tahaei, I. Kobyzev, V. P. Nia, J. J. Clark, and M. Rezagholizadeh, "Krona: Parameter efficient tuning with kronecker adapter," arXiv preprint arXiv:2212.10650, 2022. [72] A. Edalati, M. Tahaei, I. Kobyzev, V. P. Nia, J. J. Clark, 和 M. Rezagholizadeh, "Krona: Parameter efficient tuning with kronecker adapter," arXiv 预印本 arXiv:2212.10650, 2022.

[73] X. He, C. Li, P. Zhang, J. Yang, and X. E. Wang, "Parameter-efficient model adaptation for vision transformers," in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 1, 2023, pp. 817-825.
[73] X. 何，C. 李，P. 张，J. 杨和 X. E. 王，“视觉变压器的参数高效模型适应”，在第 37 卷第 1 期 2023 年 AAAI 人工智能会议论文集中，第 817-825 页。

[74] D. J. Kopiczko, T. Blankevoort, and Y. M. Asano, "Vera: Vector-based random matrix adaptation," arXiv preprint arXiv:2310.11454, 2023.
D. J. Kopiczko、T. Blankevoort 和 Y. M. Asano，"Vera: 基于向量的随机矩阵适应"，arXiv 预印本 arXiv:2310.11454，2023。

[75] S.-Y. Liu, C.-Y. Wang, H. Yin, P. Molchanov, Y.-C. F. Wang, K.T. Cheng, and M.-H. Chen, "Dora: Weight-decomposed low-rank adaptation," arXiv preprint arXiv:2402.09353, 2024.
[75] 刘胜宇，王春阳，尹航，P.莫尔查诺夫，王宇春，郑克腾，陈明华，“Dora: 权重分解低秩适应”，arXiv 预印本 arXiv:2402.09353，2024 年。

[76] M. Valipour, M. Rezagholizadeh, I. Kobyzev, and A. Ghodsi, "Dylora: Parameter efficient tuning of pre-trained models using dynamic searchfree low-rank adaptation," arXiv preprint arXiv:2210.07558, 2022.
M. Valipour、M. Rezagholizadeh、I. Kobyzev 和 A. Ghodsi，"Dylora: Parameter efficient tuning of pre-trained models using dynamic searchfree low-rank adaptation," arXiv 预印本 arXiv:2210.07558，2022。

[77] Q. Zhang, M. Chen, A. Bukharin, P. He, Y. Cheng, W. Chen, and T. Zhao, "Adaptive budget allocation for parameter-efficient finetuning," arXiv preprint arXiv:2303.10512, 2023.
[77] Q. 张，M. 陈，A. 布哈林，P. 何，Y. 程，W. 陈和 T. 赵，“参数高效微调的自适应预算分配”，arXiv 预印本 arXiv:2303.10512，2023 年。

[78] N. Ding, X. Lv, Q. Wang, Y. Chen, B. Zhou, Z. Liu, and M. Sun, "Sparse low-rank adaptation of pre-trained language models," arXiv preprint arXiv:2311.11696, 2023.
[78] 丁宁，吕晓，王强，陈阳，周斌，刘忠，孙明，“预训练语言模型的稀疏低秩适应”，arXiv 预印本 arXiv:2311.11696，2023。

[79] S. Haobo, H. Zhao, S. Majumder, and T. Lin, "Increasing model capacity for free: A simple strategy for parameter efficient fine-tuning," in The Twelfth International Conference on Learning Representations, 2023.
[79] S. Haobo, H. Zhao, S. Majumder, and T. Lin，“免费增加模型容量：一种简单的参数高效微调策略”，发表于 2023 年第十二届国际学习表示会议。

[80] R. Zhang, R. Qiang, S. A. Somayajula, and P. Xie, "Autolora: Automatically tuning matrix ranks in low-rank adaptation based on meta learning," arXiv preprint arXiv:2403.09113, 2024.
R. Zhang, R. Qiang, S. A. Somayajula, 和 P. Xie, "Autolora: Automatically tuning matrix ranks in low-rank adaptation based on meta learning," arXiv 预印本 arXiv:2403.09113, 2024.

[81] A. X. Yang, M. Robeyns, X. Wang, and L. Aitchison, "Bayesian low-rank adaptation for large language models," arXiv preprint

[81] A. X.杨，M. Robeyns，X. 王和 L. Aitchison，“大型语言模型的贝叶斯低秩适应”，arXiv 预印本

[82] Y. Lin, X. Ma, X. Chu, Y. Jin, Z. Yang, Y. Wang, and H. Mei, "Lora dropout as a sparsity regularizer for overfitting control," arXiv preprint arXiv:2404.09610, 2024
Y.林，X.马，X.楚，Y.金，Z.杨，Y.王和 H.梅，“Lora 辍学作为过度拟合控制的稀疏正则化器”，arXiv 预印本 arXiv:2404.09610，2024

[83] X. Meng, D. Dai, W. Luo, Z. Yang, S. Wu, X. Wang, P. Wang, Q. Dong, L. Chen, and Z. Sui, "Periodiclora: Breaking the low-rank bottleneck in lora optimization," arXiv preprint arXiv:2402.16141, 2024.
[83] X. 孟，D. 戴，W. 罗，Z. 杨，S. 吴，X. 王，P. 王，Q. 董，L. 陈，和 Z. 隋，“Periodiclora: 打破 Lora 优化中的低秩瓶颈”，arXiv 预印本 arXiv:2402.16141，2024 年。

[84] S. Hayou, N. Ghosh, and B. Yu, "Lora+: Efficient low rank adaptation of large models," arXiv preprint arXiv:2402.12354, 2024.
S. Hayou, N. Ghosh, 和 B. Yu, "Lora+: 大型模型的高效低秩适应," arXiv 预印本 arXiv:2402.12354, 2024.

[85] C. Huang, Q. Liu, B. Y. Lin, T. Pang, C. Du, and M. Lin, "Lorahub: Efficient cross-task generalization via dynamic lora composition," arXiv preprint arXiv:2307.13269, 2023.
C.黄，Q.刘，B.Y.林，T.庞，C.杜和 M.林，“Lorahub：通过动态 lora 组合实现高效的跨任务泛化”，arXiv 预印本 arXiv:2307.13269，2023。

[86] Q. Liu, X. Wu, X. Zhao, Y. Zhu, D. Xu, F. Tian, and Y. Zheng, "Moelora: An moe-based parameter efficient fine-tuning method for multi-task medical applications," arXiv preprint arXiv:2310.18339, 2023.
[86] 刘琦，吴旭，赵欣，朱瑜，徐丹，田飞，郑阳，"Moelora: 一种基于 moe 的参数高效微调方法，用于多任务医疗应用," arXiv 预印本 arXiv:2310.18339, 2023.

[87] W. Feng, C. Hao, Y. Zhang, Y. Han, and H. Wang, "Mixture-of-loras: An efficient multitask tuning for large language models," arXiv preprint arXiv:2403.03432, 2024
[87] 冯威，郝超，张宇，韩宇，王红，"Mixture-of-loras: 大型语言模型的高效多任务调整," arXiv 预印本 arXiv:2403.03432, 2024

[88] X. Wu, S. Huang, and F. Wei, "Mixture of lora experts," arXiv preprint arXiv:2404.13628, 2024
[88] X. Wu, S. Huang, and F. Wei, "Lora 专家混合模型," arXiv 预印本 arXiv:2404.13628, 2024

[89] D. Li, Y. Ma, N. Wang, Z. Cheng, L. Duan, J. Zuo, C. Yang, and M. Tang, "Mixlora: Enhancing large language models fine-tuning with lora based mixture of experts," arXiv preprint arXiv:2404.15159, 2024.

[90] Y. Mao, L. Mathias, R. Hou, A. Almahairi, H. Ma, J. Han, W.-t. Yih, and M. Khabsa, "Unipelt: A unified framework for parameter-efficient language model tuning," arXiv preprint arXiv:2110.07577, 2021.
Y. Mao, L. Mathias, R. Hou, A. Almahairi, H. Ma, J. Han, W.-t. Yih, 和 M. Khabsa, "Unipelt: 一个用于参数高效语言模型调整的统一框架," arXiv 预印本 arXiv:2110.07577, 2021.

[91] J. Chen, A. Zhang, X. Shi, M. Li, A. Smola, and D. Yang, "Parameterefficient fine-tuning design spaces," arXiv preprint arXiv:2301.01821, 2023.
J. Chen, A. Zhang, X. Shi, M. Li, A. Smola, 和 D. Yang, "参数高效微调设计空间," arXiv 预印本 arXiv:2301.01821, 2023.

[92] Y. Zhang, K. Zhou, and Z. Liu, "Neural prompt search," 2022.
[92] 张宇，周凯，刘哲，“神经提示搜索”，2022。

[93] H. Zhou, X. Wan, I. Vulić, and A. Korhonen, "Autopeft: Automatic configuration search for parameter-efficient fine-tuning," arXiv preprint arXiv:2301.12132, 2023.
[93] H. Zhou, X. Wan, I. Vulić, and A. Korhonen，“Autopeft: Automatic configuration search for parameter-efficient fine-tuning”，arXiv 预印本 arXiv:2301.12132，2023。

[94] Z. Hu, Y. Lan, L. Wang, W. Xu, E.-P. Lim, R. K.-W. Lee, L. Bing, and S. Poria, "Llm-adapters: An adapter family for parameter-efficient finetuning of large language models," arXiv preprint arXiv:2304.01933, 2023.
Z. 胡，Y. 兰，L. 王，W. 徐，E.-P. 林，R. K.-W. 李，L. 冰和 S. 波里亚，"Llm-适配器：用于大型语言模型参数高效微调的适配器系列"，arXiv 预印本 arXiv:2304.01933，2023。

[95] S. Hu, Z. Zhang, N. Ding, Y. Wang, Y. Wang, Z. Liu, and M. Sun, "Sparse structure search for parameter-efficient tuning," arXiv preprint arXiv:2206.07382, 2022
[95] 胡胜利，张哲，丁宁，王宇，王阳，刘哲，孙明，"用于参数高效调整的稀疏结构搜索"，arXiv 预印本 arXiv:2206.07382，2022

[96] A. Petrov, P. H. Torr, and A. Bibi, "When do prompting and prefixtuning work? a theory of capabilities and limitations," arXiv preprint arXiv:2310.19698, 2023.
[96] A. Petrov, P. H. Torr, and A. Bibi，“提示和前缀调整何时有效？能力和局限性理论，”arXiv 预印本 arXiv:2310.19698，2023。

[97] Y. Wang, J. Chauhan, W. Wang, and C.-J. Hsieh, "Universality and limitations of prompt tuning," arXiv preprint arXiv:2305.18787, 2023.
王宇，乔翰，王伟和谢长青，“提示调整的普适性和局限性”，arXiv 预印本 arXiv:2305.18787，2023。

[98] Y. Choi and J.-H. Lee, "Codeprompt: Task-agnostic prefix tuning for program and language generation," in Findings of the Association for Computational Linguistics: ACL 2023, 2023, pp. 5282-5297.
Y. Choi 和 J.-H. Lee，“Codeprompt：用于程序和语言生成的任务不可知前缀调整”，发表于计算语言学协会发现：ACL 2023，2023 年，页码 5282-5297。

[99] H. Wu and X. Shi, "Adversarial soft prompt tuning for cross-domain sentiment analysis," in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 2438-2447.
吴和石，"对抗性软提示调整用于跨领域情感分析"，发表于第 60 届计算语言学年会论文集（第 1 卷：长文），2022 年，第 2438-2447 页。

[100] J. Frankle and M. Carbin, "The lottery ticket hypothesis: Finding sparse, trainable neural networks," arXiv preprint arXiv:1803.03635, 2018.
J. Frankle 和 M. Carbin，“彩票票假设：发现稀疏、可训练的神经网络”，arXiv 预印本 arXiv:1803.03635，2018。

[101] E. Malach, G. Yehudai, S. Shalev-Schwartz, and O. Shamir, "Proving the lottery ticket hypothesis: Pruning is all you need," in International Conference on Machine Learning. PMLR, 2020, pp. 6682-6691

[102] V. Fomenko, H. Yu, J. Lee, S. Hsieh, and W. Chen, "A note on lora," arXiv preprint arXiv:2404.05086, 2024.
[102] V. Fomenko, H. Yu, J. Lee, S. Hsieh, and W. Chen, "A note on lora," arXiv preprint arXiv:2404.05086, 2024. [102] V. Fomenko, H. Yu, J. Lee, S. Hsieh, 和 W. Chen, "关于 lora 的一点说明," arXiv 预印本 arXiv:2404.05086, 2024.

[103] A. Beck and M. Teboulle, "A fast iterative shrinkage-thresholding algorithm for linear inverse problems," SIAM journal on imaging sciences, vol. 2, no. 1, pp. 183-202, 2009.
A. Beck 和 M. Teboulle，“用于线性逆问题的快速迭代收缩阈值算法”，《SIAM 图像科学杂志》，第 2 卷，第 1 期，2009 年，183-202 页。
[104] A. Chambolle, R. A. De Vore, N.-Y. Lee, and B. J. Lucier, "Nonlinear wavelet image processing: variational problems, compression, and noise removal through wavelet shrinkage," IEEE Transactions on image processing, vol. 7, no. 3, pp. 319-335, 1998
A. Chambolle, R. A. De Vore, N.-Y. Lee, 和 B. J. Lucier, "非线性小波图像处理：通过小波收缩解决变分问题、压缩和去噪," IEEE 图像处理期刊, vol. 7, no. 3, pp. 319-335, 1998

[105] D. J. MacKay, "A practical bayesian framework for backpropagation networks," Neural computation, vol. 4, no. 3, pp. 448-472, 1992.
D. J. MacKay，“反向传播网络的实用贝叶斯框架”，《神经计算》，第 4 卷，第 3 期，1992 年，448-472 页。

[106] J. Antorán, D. Janz, J. U. Allingham, E. Daxberger, R. R. Barbano, E. Nalisnick, and J. M. Hernández-Lobato, "Adapting the linearised laplace model evidence for modern deep learning," in International Conference on Machine Learning. PMLR, 2022, pp. 796-821.
[106] J. Antorán, D. Janz, J. U. Allingham, E. Daxberger, R. R. Barbano, E. Nalisnick, 和 J. M. Hernández-Lobato, "调整线性化拉普拉斯模型证据以适应现代深度学习," 在机器学习国际会议上。PMLR, 2022, 页码 796-821.

[107] J. Liu, A. Moreau, M. Preuss, J. Rapin, B. Roziere, F. Teytaud, and O. Teytaud, "Versatile black-box optimization," in Proceedings of the 2020 Genetic and Evolutionary Computation Conference, 2020, pp.

.
[107] 刘杰，A. 莫罗，M. 普鲁斯，J. 拉平，B. 罗兹尔，F. 泰托，和 O. 泰托，“多功能黑盒优化”，在 2020 年遗传和进化计算大会论文集中，2020 年，第

页。

[108] M. Chen, H. Peng, J. Fu, and H. Ling, "Autoformer: Searching transformers for visual recognition," in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 12270-12280.
[108] M. Chen, H. Peng, J. Fu, and H. Ling, "Autoformer: Searching transformers for visual recognition," in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 12270-12280. [108] M. Chen, H. Peng, J. Fu, 和 H. Ling, "Autoformer: 为视觉识别搜索 transformers," 在 2021 年 IEEE/CVF 国际计算机视觉会议论文集中, pp. 12270-12280.

[109] P. I. Frazier, "A tutorial on bayesian optimization," arXiv preprint arXiv:1807.02811, 2018.
P. I. Frazier，“贝叶斯优化教程”，arXiv 预印本 arXiv:1807.02811，2018。

[110] A. Rücklé, G. Geigle, M. Glockner, T. Beck, J. Pfeiffer, N. Reimers, and I. Gurevych, "Adapterdrop: On the efficiency of adapters in transformers," arXiv preprint arXiv:2010.11918, 2020.
[110] A. Rücklé, G. Geigle, M. Glockner, T. Beck, J. Pfeiffer, N. Reimers, and I. Gurevych, "Adapterdrop: On the efficiency of adapters in transformers," arXiv preprint arXiv:2010.11918, 2020. [110] A. Rücklé, G. Geigle, M. Glockner, T. Beck, J. Pfeiffer, N. Reimers, 和 I. Gurevych, "Adapterdrop: On the efficiency of adapters in transformers," arXiv 预印本 arXiv:2010.11918, 2020.

[111] S. He, L. Ding, D. Dong, J. Zhang, and D. Tao, "SparseAdapter: An easy approach for improving the parameter-efficiency of adapters," in Findings of the Association for Computational Linguistics: EMNLP 2022. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, Dec. 2022, pp. 2184-2190. [Online]. Available: https://aclanthology.org/2022.findings-emnlp. 160
[111] S. He, L. Ding, D. Dong, J. Zhang, and D. Tao，“SparseAdapter: An easy approach for improving the parameter-efficiency of adapters,” in Findings of the Association for Computational Linguistics: EMNLP 2022. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, Dec. 2022, pp. 2184-2190. [Online]. Available: https://aclanthology.org/2022.findings-emnlp. 160

[112] L. Hedegaard, A. Alok, J. Jose, and A. Iosifidis, "Structured pruning adapters," arXiv preprint arXiv:2211.10155, 2022.
L. Hedegaard、A. Alok、J. Jose 和 A. Iosifidis，"Structured pruning adapters," arXiv 预印本 arXiv:2211.10155，2022。

[113] M. Zhang, C. Shen, Z. Yang, L. Ou, X. Yu, B. Zhuang et al., "Pruning meets low-rank parameter-efficient fine-tuning," arXiv preprint arXiv:2305.18403, 2023.
[113] 张明，沈超，杨忠，欧莉，于雪，庄斌等，"修剪遇见低秩参数高效微调"，arXiv 预印本 arXiv:2305.18403，2023。

[114] G. Zeng, P. Zhang, and W. Lu, "One network, many masks: Towards more parameter-efficient transfer learning," arXiv preprint arXiv:2305.17682, 2023.
G. 曾，P. 张和 W. 卢，“一个网络，多个面具：朝着更具参数效率的迁移学习”，arXiv 预印本 arXiv:2305.17682，2023。

[115] S. Jie, H. Wang, and Z.-H. Deng, "Revisiting the parameter efficiency of adapters from the perspective of precision redundancy," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp.

.
[115] S. Jie, H. Wang, 和 Z.-H. Deng, "从精度冗余的角度重新审视适配器的参数效率," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp.

[116] J. Kim, J. H. Lee, S. Kim, J. Park, K. M. Yoo, S. J. Kwon, and D. Lee, "Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization," arXiv preprint arXiv:2305.14152, 2023.
[116] J. Kim, J. H. Lee, S. Kim, J. Park, K. M. Yoo, S. J. Kwon, and D. Lee, "Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization," arXiv preprint arXiv:2305.14152, 2023. [116] J. Kim，J. H. Lee，S. Kim，J. Park，K. M. Yoo，S. J. Kwon 和 D. Lee，"通过次 4 位整数量化实现压缩大型语言模型的内存高效微调"，arXiv 预印本 arXiv:2305.14152，2023。

[117] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, "Qlora: Efficient finetuning of quantized 1lms," arXiv preprint arXiv:2305.14314, 2023 .
T. Dettmers, A. Pagnoni, A. Holtzman, 和 L. Zettlemoyer, "Qlora: Efficient finetuning of quantized 1lms," arXiv 预印本 arXiv:2305.14314, 2023.

[118] Y. Li, Y. Yu, C. Liang, P. He, N. Karampatziakis, W. Chen, and T. Zhao, "Loftq: Lora-fine-tuning-aware quantization for large language models," arXiv preprint arXiv:2310.08659, 2023.
Y. Li, Y. Yu, C. Liang, P. He, N. Karampatziakis, W. Chen, 和 T. Zhao, "Loftq: Lora-fine-tuning-aware quantization for large language models," arXiv 预印本 arXiv:2310.08659, 2023.

[119] H. Guo, P. Greengard, E. P. Xing, and Y. Kim, "Lq-lora: Low-rank plus quantized matrix decomposition for efficient language model finetuning," arXiv preprint arXiv:2311.12023, 2023.
H. Guo, P. Greengard, E. P. Xing, 和 Y. Kim, "Lq-lora: 低秩加量化矩阵分解用于高效语言模型微调," arXiv 预印本 arXiv:2311.12023, 2023.

[120] Y. Xu, L. Xie, X. Gu, X. Chen, H. Chang, H. Zhang, Z. Chen, X. Zhang, and Q. Tian, "Qa-lora: Quantization-aware low-rank adaptation of large language models," arXiv preprint arXiv:2309.14717, 2023.
Y. Xu, L. Xie, X. Gu, X. Chen, H. Chang, H. Zhang, Z. Chen, X. Zhang, and Q. Tian，“Qa-lora: Quantization-aware low-rank adaptation of large language models”，arXiv 预印本 arXiv:2309.14717，2023。

[121] Y. Chai, J. Gkountouras, G. G. Ko, D. Brooks, and G.-Y. Wei, "Int2. 1: Towards fine-tunable quantized large language models with error correction through low-rank adaptation," arXiv preprint arXiv:2306.08162, 2023
Y. Chai, J. Gkountouras, G. G. Ko, D. Brooks, and G.-Y. Wei，“Int2. 1：通过低秩适应实现可微调的量化大型语言模型及误差校正”，arXiv 预印本 arXiv:2306.08162，2023

[122] H. Rajabzadeh, M. Valipour, T. Zhu, M. Tahaei, H. J. Kwon, A. Ghodsi, B. Chen, and M. Rezagholizadeh, "Qdylora: Quantized dynamic lowrank adaptation for efficient large language model tuning," arXiv preprint arXiv:2402.10462, 2024.
H. Rajabzadeh、M. Valipour、T. Zhu、M. Tahaei、H. J. Kwon、A. Ghodsi、B. Chen 和 M. Rezagholizadeh，"Qdylora: Quantized dynamic lowrank adaptation for efficient large language model tuning," arXiv 预印本 arXiv:2402.10462，2024。

[123] J. Liu, G. Xiao, K. Li, J. D. Lee, S. Han, T. Dao, and T. Cai, "Bitdelta: Your fine-tune may only be worth one bit," arXiv preprint arXiv:2402.10193, 2024.
[123] 刘杰，肖戈，李凯，李杰迪，韩帅，刀天，蔡涛，“Bitdelta: 您的微调可能只值一个比特”，arXiv 预印本 arXiv:2402.10193，2024。

[124] J. O. Zhang, A. Sax, A. Zamir, L. Guibas, and J. Malik, "Sidetuning: a baseline for network adaptation via additive side networks," in Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part III 16. Springer, 2020, pp.

[124] J. O. Zhang, A. Sax, A. Zamir, L. Guibas, and J. Malik，“Sidetuning: a baseline for network adaptation via additive side networks,” in Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part III 16. Springer, 2020, pp.

[125] Y.-L. Sung, J. Cho, and M. Bansal, "Lst: Ladder side-tuning for parameter and memory efficient transfer learning," Advances in Neural Information Processing Systems, vol. 35, pp. 12991-13005, 2022.
Y.-L. Sung, J. Cho, 和 M. Bansal, "Lst: Ladder side-tuning for parameter and memory efficient transfer learning," Advances in Neural Information Processing Systems, vol. 35, pp. 12991-13005, 2022.

[126] Z. Jiang, C. Mao, Z. Huang, A. Ma, Y. Lv, Y. Shen, D. Zhao, and J. Zhou, "Res-tuning: A flexible and efficient tuning paradigm via
[126] Z. 江, C. 毛, Z. 黄, A. 马, Y. 吕, Y. 沈, D. 赵, 和 J. 周, "Res-tuning: 通过一种灵活高效的调整范式"
unbinding tuner from backbone," arXiv preprint arXiv:2310.19859, 2023.
从骨干解绑调谐器，arXiv 预印本 arXiv:2310.19859，2023 年。

[127] B. Liao, S. Tan, and C. Monz, "Make your pre-trained model reversible: From parameter to memory efficient fine-tuning," arXiv preprint arXiv:2306.00477, 2023.
[127] 廖 B.，谭 S.和蒙 C.，“使您的预训练模型可逆：从参数到内存高效微调”，arXiv 预印本 arXiv:2306.00477，2023。

[128] L. Zhang, L. Zhang, S. Shi, X. Chu, and B. Li, "Lora-fa: Memoryefficient low-rank adaptation for large language models fine-tuning," arXiv preprint arXiv:2308.03303, 2023.
[128] 张乐，张乐，史帅，褚雪，李斌，“Lora-fa：用于大型语言模型微调的内存高效低秩适应”，arXiv 预印本 arXiv:2308.03303，2023。

[129] J. Phang, Y. Mao, P. He, and W. Chen, "Hypertuning: Toward adapting large language models without back-propagation," in International Conference on Machine Learning. PMLR, 2023, pp. 27 854-27 875.
[129] J. Phang, Y. Mao, P. He, and W. Chen, "Hypertuning: Toward adapting large language models without back-propagation," in International Conference on Machine Learning. PMLR, 2023, pp. 27 854-27 875. 【129】J. Phang, Y. Mao, P. He, 和 W. Chen, "Hypertuning: Toward adapting large language models without back-propagation," 在国际机器学习会议上. PMLR, 2023, 页码 27 854-27 875.

[130] F. Jin, J. Zhang, and C. Zong, "Parameter-efficient tuning for large language model without calculating its gradients," in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 321-330.
F. Jin, J. Zhang, 和 C. Zong, "Parameter-efficient tuning for large language model without calculating its gradients," 在《2023 年自然语言处理经验方法会议论文集》中发表, 2023, pp. 321-330.

[131] S. Malladi, T. Gao, E. Nichani, A. Damian, J. D. Lee, D. Chen, and S. Arora, "Fine-tuning language models with just forward passes," arXiv preprint arXiv:2305.17333, 2023.
S. Malladi、T. Gao、E. Nichani、A. Damian、J. D. Lee、D. Chen 和 S. Arora，"Fine-tuning language models with just forward passes," arXiv 预印本 arXiv:2305.17333，2023。

[132] J. Zhao, Z. Zhang, B. Chen, Z. Wang, A. Anandkumar, and Y. Tian, "Galore: Memory-efficient

training by gradient low-rank projection," arXiv preprint arXiv:2403.03507, 2024.
[132] J. Zhao, Z. Zhang, B. Chen, Z. Wang, A. Anandkumar, and Y. Tian, "Galore: Memory-efficient

training by gradient low-rank projection," arXiv preprint arXiv:2403.03507, 2024. 【132】J. 赵，Z. 张，B. 陈，Z. 王，A. Anandkumar 和 Y. 田，“Galore：通过梯度低秩投影实现内存高效

训练”，arXiv 预印本 arXiv:2403.03507，2024 年。

[133] W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, "Efficient memory management for large language model serving with pagedattention," in Proceedings of the 29th Symposium on Operating Systems Principles, 2023, pp. 611-626
W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica，“基于分页注意力的大型语言模型服务的高效内存管理”，发表于 2023 年第 29 届操作系统原理研讨会论文集，页码 611-626。

[134] Y. Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin, B. Chen, P. Liang, C. Ré, I. Stoica, and C. Zhang, "Flexgen: High-throughput generative inference of large language models with a single gpu," in International Conference on Machine Learning. PMLR, 2023, pp. 31 094-31 116.
Y. Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin, B. Chen, P. Liang, C. Ré, I. Stoica, 和 C. Zhang, "Flexgen: 使用单个 GPU 高吞吐量生成大型语言模型的推理," 发表于国际机器学习会议。PMLR, 2023, pp. 31 094-31 116.

[135] T. Zhou and D. Tao, "Godec: Randomized low-rank & sparse matrix decomposition in noisy case," in Proceedings of the 28th International Conference on Machine Learning, ICML 2011, 2011
[135] T. Zhou and D. Tao，“Godec: 随机低秩和稀疏矩阵分解在嘈杂情况下”，发表于第 28 届国际机器学习会议 ICML 2011 论文集，2011。

[136] J. Wright, A. Ganesh, S. Rao, Y. Peng, and Y. Ma, "Robust principal component analysis: Exact recovery of corrupted low-rank matrices via convex optimization," Advances in neural information processing systems, vol. 22, 2009.
J. Wright, A. Ganesh, S. Rao, Y. Peng, and Y. Ma, "Robust principal component analysis: Exact recovery of corrupted low-rank matrices via convex optimization," Advances in neural information processing systems, vol. 22, 2009. 杰克·赖特，阿加尼什，S. 饶，Y. 彭和 Y. 马，“鲁棒主成分分析：通过凸优化精确恢复损坏的低秩矩阵”，神经信息处理系统进展，第 22 卷，2009 年。

[137] A. N. Gomez, M. Ren, R. Urtasun, and R. B. Grosse, "The reversible residual network: Backpropagation without storing activations,"

vances in neural information processing systems, vol. 30, 2017.
[137] A. N. Gomez, M. Ren, R. Urtasun, and R. B. Grosse, "The reversible residual network: Backpropagation without storing activations," 《神经信息处理系统进展》, 第 30 卷, 2017 年。

[138] Y. Huang, Y. Li, Y. Xu, L. Zhang, R. Gan, J. Zhang, and L. Wang, "Mvp-tuning: Multi-view knowledge retrieval with prompt tuning for commonsense reasoning," in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 13417-13 432
[138] Y.黄，Y.李，Y.徐，L.张，R.甘，J.张和 L.王，“Mvp-tuning: 多视图知识检索与常识推理的提示调整”，见于第 61 届计算语言学年会论文集（第 1 卷：长文），2023 年，页码 13417-13432。

[139] Z. Zhao, L. Hu, H. Zhao, Y. Shao, and Y. Wang, "Knowledgeable parameter efficient tuning network for commonsense question answering," in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 90519063
[139] Z. Zhao, L. Hu, H. Zhao, Y. Shao, and Y. Wang, "知识参数高效调整网络用于常识问题回答," 在第 61 届计算语言学年会论文集（第 1 卷：长篇论文）中，2023 年，第 905-9063 页。

[140] H. Zhao, R. He, M. Xiao, and J. Xu, "Infusing hierarchical guidance into prompt tuning: A parameter-efficient framework for multi-level implicit discourse relation recognition," in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 6477-6492.
赵华，何瑞，肖明，徐军，“将分层指导融入提示调整：一种用于多级隐式话语关系识别的参数高效框架”，发表于第 61 届计算语言学年会论文集（第 1 卷：长文），2023 年，第 6477-6492 页。

[141] Y. Ouyang, Y. Cao, Y. Gao, Z. Wu, J. Zhang, and X. Dai, "On prefixtuning for lightweight out-of-distribution detection," in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 1533-1545.
[141] 欧阳阳，曹阳，高阳，吴哲，张杰，戴欣，“关于轻量级超出分布检测的前缀调整”，发表于第 61 届计算语言学年会论文集（第 1 卷：长文），2023 年，第 1533-1545 页。

[142] M. S. Ozdayi, C. Peris, J. Fitzgerald, C. Dupuy, J. Majmudar, H. Khan, R. Parikh, and R. Gupta, "Controlling the extraction of memorized data from large language models via prompt-tuning," arXiv preprint arXiv:2305.11759, 2023
M. S. Ozdayi、C. Peris、J. Fitzgerald、C. Dupuy、J. Majmudar、H. Khan、R. Parikh 和 R. Gupta，"通过提示调整控制从大型语言模型中提取记忆数据"，arXiv 预印本 arXiv:2305.11759，2023.

[143] G. Xiao, J. Lin, and S. Han, "Offsite-tuning: Transfer learning without full model," arXiv preprint arXiv:2302.04870, 2023.
G. Xiao, J. Lin, 和 S. Han, "Offsite-tuning: Transfer learning without full model," arXiv 预印本 arXiv:2302.04870, 2023.

[144] T. Che, J. Liu, Y. Zhou, J. Ren, J. Zhou, V. S. Sheng, H. Dai, and D. Dou, "Federated learning of large language models with parameterefficient prompt tuning and adaptive optimization," arXiv preprint arXiv:2310.15080, 2023
T. Che, J. Liu, Y. Zhou, J. Ren, J. Zhou, V. S. Sheng, H. Dai, and D. Dou, "使用参数高效提示调整和自适应优化的大型语言模型联邦学习," arXiv 预印本 arXiv:2310.15080, 2023

[145] Y. Li, M. Du, X. Wang, and Y. Wang, "Prompt tuning pushes farther, contrastive learning pulls closer: A two-stage approach to mitigate social biases," arXiv preprint arXiv:2307.01595, 2023.
Y. Li, M. Du, X. Wang, 和 Y. Wang, "Prompt tuning pushes farther, contrastive learning pulls closer: A two-stage approach to mitigate social biases," arXiv 预印本 arXiv:2307.01595, 2023.

[146] J. Cho, J. Lei, H. Tan, and M. Bansal, "Unifying vision-and-language tasks via text generation," in International Conference on Machine Learning. PMLR, 2021, pp. 1931-1942.
J. Cho, J. Lei, H. Tan, 和 M. Bansal, "通过文本生成统一视觉与语言任务," 在国际机器学习会议上. PMLR, 2021, pp. 1931-1942.

[147] D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, "Minigpt-4: Enhancing vision-language understanding with advanced large language models," arXiv preprint arXiv:2304.10592, 2023.
D. Zhu, J. Chen, X. Shen, X. Li, 和 M. Elhoseiny, "Minigpt-4: 使用先进的大型语言模型增强视觉-语言理解," arXiv 预印本 arXiv:2304.10592, 2023.
[148] H. Liu, C. Li, Q. Wu, and Y. J. Lee, "Visual instruction tuning," arXiv preprint arXiv:2304.08485, 2023.
H. 刘，C. 李，Q. 吴和 Y. J. 李，“视觉指导调整”，arXiv 预印本 arXiv:2304.08485，2023。

[149] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, "Selfcritical sequence training for image captioning," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp.

.
[149] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel，“图像字幕的自临界序列训练”，发表于 2017 年 IEEE 计算机视觉与模式识别会议论文集，第

页。

[150] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, "Image captioning with semantic attention," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4651-4659.
[150] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, "带有语义注意力的图像字幕生成," 在 2016 年 IEEE 计算机视觉与模式识别会议论文集中, pp. 4651-4659.

[151] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, "Show and tell: Lessons learned from the 2015 mscoco image captioning challenge," IEEE transactions on pattern analysis and machine intelligence, vol. 39 , no. 4 , pp. 652-663, 2016.
O. Vinyals, A. Toshev, S. Bengio, 和 D. Erhan, "展示和讲述：从 2015 年 mscoco 图像字幕挑战中学到的教训," IEEE 模式分析和机器智能交易, vol. 39, no. 4, pp. 652-663, 2016.

[152] M. Z. Hossain, F. Sohel, M. F. Shiratuddin, and H. Laga, "A comprehensive survey of deep learning for image captioning," ACM Computing Surveys (CsUR), vol. 51, no. 6, pp. 1-36, 2019.
M. Z. Hossain, F. Sohel, M. F. Shiratuddin 和 H. Laga，"图像字幕的深度学习综述"，ACM 计算调查（CsUR），第 51 卷，第 6 期，2019 年，页码 1-36。

[153] P. Wang, Q. Wu, C. Shen, A. Dick, and A. Van Den Hengel, "Fvqa: Fact-based visual question answering," IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 10, pp. 2413-2427, 2017.
P. Wang, Q. Wu, C. Shen, A. Dick, 和 A. Van Den Hengel, "Fvqa: 基于事实的视觉问答," IEEE 模式分析与机器智能交易, 卷 40, 无. 10, 页 2413-2427, 2017.

[154] Q. Wu, D. Teney, P. Wang, C. Shen, A. Dick, and A. Van Den Hengel, "Visual question answering: A survey of methods and datasets," Computer Vision and Image Understanding, vol. 163, pp. 21-40, 2017.
[154] Q. Wu, D. Teney, P. Wang, C. Shen, A. Dick, and A. Van Den Hengel, "Visual question answering: A survey of methods and datasets," Computer Vision and Image Understanding, vol. 163, pp. 21-40, 2017. [154] Q. Wu, D. Teney, P. Wang, C. Shen, A. Dick, 和 A. Van Den Hengel, "视觉问答：方法和数据集综述," 计算机视觉与图像理解, 卷 163, 页 21-40, 2017.

[155] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, "Vqa: Visual question answering," in Proceedings of the IEEE international conference on computer vision, 2015, pp. 24252433.
S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, 和 D. Parikh, "Vqa: Visual question answering," 在 2015 年 IEEE 国际计算机视觉会议论文集中, pp. 2425-2433.

[156] Y.-L. Sung, J. Cho, and M. Bansal, "V1-adapter: Parameter-efficient transfer learning for vision-and-language tasks," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5227-5237.
Y.-L. Sung, J. Cho, 和 M. Bansal, "V1-adapter: Parameter-efficient transfer learning for vision-and-language tasks," 在 2022 年 IEEE/CVF 计算机视觉与模式识别会议论文集中, pp. 5227-5237.

[157] Z.-Y. Hu, Y. Li, M. R. Lyu, and L. Wang, "Vl-pet: Vision-and-language parameter-efficient tuning via granularity control," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3010-3020.
[157] 胡志勇，李阳，刘明瑞和王磊，“VL-PET：通过粒度控制进行视觉与语言参数高效调整”，2023 年 IEEE/CVF 国际计算机视觉会议论文集，第 3010-3020 页。

[158] R. Zhang, J. Han, A. Zhou, X. Hu, S. Yan, P. Lu, H. Li, P. Gao, and Y. Qiao, "Llama-adapter: Efficient fine-tuning of language models with zero-init attention," arXiv preprint arXiv:2303.16199, 2023.
R. Zhang, J. Han, A. Zhou, X. Hu, S. Yan, P. Lu, H. Li, P. Gao, and Y. Qiao，“Llama-adapter: Efficient fine-tuning of language models with zero-init attention”，arXiv 预印本 arXiv:2303.16199，2023。

[159] P. Gao, J. Han, R. Zhang, Z. Lin, S. Geng, A. Zhou, W. Zhang, P. Lu, C. He, X. Yue et al., "Llama-adapter v2: Parameter-efficient visual instruction model," arXiv preprint arXiv:2304.15010, 2023.
[159] 高 P.，韩 J.，张 R.，林 Z.，耿 S.，周 A.，张 W.，卢 P.，何 C.，岳 X.等人，"Llama-adapter v2: 参数高效视觉指导模型," arXiv 预印本 arXiv:2304.15010, 2023.

[160] B. Zhao, H. Tu, C. Wei, J. Mei, and C. Xie, "Tuning layernorm in attention: Towards efficient multi-modal llm finetuning," arXiv preprint arXiv:2312.11420, 2023

[161] S. Lee, "Toward continual learning for conversational agents," arXiv preprint arXiv:1712.09943, 2017
[161] S. Lee，“面向对话代理的持续学习”，arXiv 预印本 arXiv:1712.09943，2017

[162] C.-H. Chang, M. Kayed, M. R. Girgis, and K. F. Shaalan, "A survey of web information extraction systems," IEEE transactions on knowledge and data engineering, vol. 18, no. 10, pp. 1411-1428, 2006.
C.-H. Chang, M. Kayed, M. R. Girgis, 和 K. F. Shaalan, "网络信息提取系统调查," IEEE 知识与数据工程交易, vol. 18, no. 10, pp. 1411-1428, 2006.

[163] W. Yang, Y. Xie, A. Lin, X. Li, L. Tan, K. Xiong, M. Li, and J. Lin, "End-to-end open-domain question answering with bertserini," arXiv preprint arXiv:1902.01718, 2019
[163] W. Yang, Y. Xie, A. Lin, X. Li, L. Tan, K. Xiong, M. Li, and J. Lin, "End-to-end open-domain question answering with bertserini," arXiv preprint arXiv:1902.01718, 2019 [163] W. Yang, Y. Xie, A. Lin, X. Li, L. Tan, K. Xiong, M. Li, 和 J. Lin, "使用 bertserini 进行端到端开放领域问答," arXiv 预印本 arXiv:1902.01718, 2019

[164] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska et al., "Overcoming catastrophic forgetting in neural networks," Proceedings of the national academy of sciences, vol. 114, no. 13, pp.

.
[164] J.柯克帕特里克，R.帕斯卡努，N.拉宾维茨，J.维内斯，G.德贾尔丹，A. A.鲁苏，K.米兰，J.全，T.拉马尔霍，A.格拉布斯卡-巴尔温斯卡等人，“克服神经网络中的灾难性遗忘”，《国家科学院院刊》，第 114 卷，第 13 期，第

页。

[165] A. Madotto, Z. Lin, Z. Zhou, S. Moon, P. Crook, B. Liu, Z. Yu, E. Cho, and Z. Wang, "Continual learning in task-oriented dialogue systems," arXiv preprint arXiv:2012.15504, 2020

[166] Q. Zhu, B. Li, F. Mi, X. Zhu, and M. Huang, "Continual prompt tuning for dialog state tracking," arXiv preprint arXiv:2203.06654, 2022.
[166] Q. 朱，B. 李，F. 米，X. 朱和 M. 黄，“对话状态跟踪的持续提示调整”，arXiv 预印本 arXiv:2203.06654，2022。

[167] Y. Dai, H. Lang, Y. Zheng, F. Huang, L. Si, and Y. Li, "Lifelong learning for question answering with hierarchical prompts," arXiv preprint arXiv:2208.14602, 2022.
[167] Y. Dai, H. Lang, Y. Zheng, F. Huang, L. Si, and Y. Li, "Lifelong learning for question answering with hierarchical prompts," arXiv preprint arXiv:2208.14602, 2022. [167] 戴宇，郎辉，郑阳，黄飞，司亮，李勇，“具有分层提示的问答终身学习”，arXiv 预印本 arXiv:2208.14602，2022 年。

[168] Z. Liang, F. Wei, Y. Jie, Y. Qian, Z. Hao, and B. Han, "Prompts can play lottery tickets well: Achieving lifelong information extraction via lottery prompt tuning," in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 277-292
[168] 梁志，魏飞，杰宇，钱阳，郝忠，韩斌，“提示可以很好地玩抽奖票：通过抽奖提示调整实现终身信息提取”，见于第 61 届计算语言学年会论文集（第 1 卷：长文），2023 年，第 277-292 页。

[169] X. Wang, T. Chen, Q. Ge, H. Xia, R. Bao, R. Zheng, Q. Zhang, T. Gui, and X. Huang, "Orthogonal subspace learning for language model continual learning," arXiv preprint arXiv:2310.14152, 2023.
[169] 王翔，陈涛，葛强，夏辉，包瑞，郑瑞，张强，桂涛和黄晓，"用于语言模型持续学习的正交子空间学习"，arXiv 预印本 arXiv:2310.14152，2023。

[170] S. Chen, S. Wong, L. Chen, and Y. Tian, "Extending context window of large language models via positional interpolation," arXiv preprint arXiv:2306.15595, 2023
[170] S. Chen, S. Wong, L. Chen, and Y. Tian，“通过位置插值扩展大型语言模型的上下文窗口”，arXiv 预印本 arXiv:2306.15595，2023

[171] Y. Chen, S. Qian, H. Tang, X. Lai, Z. Liu, S. Han, and J. Jia, "Longlora: Efficient fine-tuning of long-context large language models," arXiv preprint arXiv:2309.12307, 2023.

[172] J. Yang, "Longqlora: Efficient and effective method to extend context length of large language models," arXiv preprint arXiv:2311.04879, 2023.
[172] J. Yang，“Longqlora：扩展大型语言模型上下文长度的高效有效方法”，arXiv 预印本 arXiv:2311.04879，2023。

[173] S. Tan, X. Li, S. Patil, Z. Wu, T. Zhang, K. Keutzer, J. E. Gonzalez, and R. A. Popa, "Lloco: Learning long contexts offline," arXiv preprint arXiv:2404.07979, 2024
S. Tan, X. Li, S. Patil, Z. Wu, T. Zhang, K. Keutzer, J. E. Gonzalez, 和 R. A. Popa, "Lloco: 离线学习长上下文," arXiv 预印本 arXiv:2404.07979, 2024

[174] Z. Zhang, Y. Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, Z. Song, Y. Tian, C. Ré, C. Barrett et al., "H2o: Heavy-hitter oracle for efficient generative inference of large language models," Advances in Neural Information Processing Systems, vol. 36, 2024.
[174] 张哲, 盛宇, 周涛, 陈涛, 郑磊, 蔡荣, 宋哲, 田宇, 雷春明, 巴雷特等, "H2o: 大型语言模型高效生成推理的重要信息预测器," 神经信息处理系统进展, 第 36 卷, 2024 年.

[175] T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer, "Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale," Advances in Neural Information Processing Systems, vol. 35, pp. 30318-30 332, 2022.
T. Dettmers, M. Lewis, Y. Belkada 和 L. Zettlemoyer，“Gpt3. int8 (): 用于规模变压器的 8 位矩阵乘法”，《神经信息处理系统的进展》，第 35 卷，30318-30332 页，2022 年。

[176] H. Kang, Q. Zhang, S. Kundu, G. Jeong, Z. Liu, T. Krishna, and T. Zhao, "Gear: An efficient kv cache compression recipefor nearlossless generative inference of 1lm," arXiv preprint arXiv:2403.05527, 2024.
H. Kang, Q. Zhang, S. Kundu, G. Jeong, Z. Liu, T. Krishna, and T. Zhao, "Gear: An efficient kv cache compression recipefor nearlossless generative inference of 1lm," arXiv preprint arXiv:2403.05527, 2024. 康，张，昆都，郑，刘，克里希纳和赵，“Gear：一种高效的 kv 缓存压缩配方，用于接近无损生成推理 1lm”，arXiv 预印本 arXiv:2403.05527，2024。

[177] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., "An image is worth

words: Transformers for image recognition at scale. arxiv 2020," arXiv preprint arXiv:2010.11929, 2010.
[177] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly 等人，“一张图片胜过

言：规模化图像识别的 Transformer。arxiv 2020”，arXiv 预印本 arXiv:2010.11929，2010。

[178] A. Steiner, A. Kolesnikov, X. Zhai, R. Wightman, J. Uszkoreit, and L. Beyer, "How to train your vit? data, augmentation, and regularization in vision transformers," arXiv preprint arXiv:2106.10270, 2021.
[178] A. Steiner, A. Kolesnikov, X. Zhai, R. Wightman, J. Uszkoreit, and L. Beyer, "如何训练您的视觉变压器？数据增强和正则化," arXiv 预印本 arXiv:2106.10270, 2021.

[179] X. Chen, S. Xie, and K. He, "An empirical study of training selfsupervised vision transformers," in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 9640-9649.
[179] X. Chen, S. Xie, 和 K. He, "一个关于自监督视觉 Transformer 训练的实证研究," 发表于 2021 年 IEEE/CVF 国际计算机视觉会议论文集, 页码 9640-9649.

[180] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, "Masked autoencoders are scalable vision learners," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp.

.
[180] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, "Masked autoencoders are scalable vision learners," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp.

. 【180】K. He, X. Chen, S. Xie, Y. Li, P. Dollár 和 R. Girshick, "Masked autoencoders are scalable vision learners," 发表于 2022 年 IEEE/CVF 计算机视觉与模式识别会议论文集, 第

页。

[181] M. Dehghani, J. Djolonga, B. Mustafa, P. Padlewski, J. Heek, J. Gilmer, A. P. Steiner, M. Caron, R. Geirhos, I. Alabdulmohsin et al., "Scaling vision transformers to 22 billion parameters," in International Conference on Machine Learning. PMLR, 2023, pp. 7480-7512.
[181] M. Dehghani, J. Djolonga, B. Mustafa, P. Padlewski, J. Heek, J. Gilmer, A. P. Steiner, M. Caron, R. Geirhos, I. Alabdulmohsin 等人，“将视觉 Transformer 扩展到 220 亿参数”，发表于机器学习国际会议。PMLR，2023 年，第 7480-7512 页。

[182] Z. Chen, Y. Duan, W. Wang, J. He, T. Lu, J. Dai, and Y. Qiao, "Vision transformer adapter for dense predictions," arXiv preprint arXiv:2205.08534, 2022

[183] Z. Wang, Z. Zhang, C.-Y. Lee, H. Zhang, R. Sun, X. Ren, G. Su, V. Perot, J. Dy, and T. Pfister, "Learning to prompt for continual learning," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 139-149.
[183] 王宗翔，张哲，李春阳，张航，孙睿，任旭，苏刚，佩罗特，戴杰，皮斯特，"学习提示持续学习"，于 2022 年 IEEE/CVF 计算机视觉与模式识别会议论文集，第 139-149 页。

[184] Q. Gao, C. Zhao, Y. Sun, T. Xi, G. Zhang, B. Ghanem, and J. Zhang, "A unified continual learning framework with general parameter-efficient tuning," arXiv preprint arXiv:2303.10070, 2023.
[184] 高琪，赵晨，孙阳，习涛，张刚，甘贝，张军，"具有通用参数高效调整的统一持续学习框架"，arXiv 预印本 arXiv:2303.10070，2023。

[185] L. Ren, C. Chen, L. Wang, and K. Hua, "Learning semantic proxies from visual prompts for parameter-efficient fine-tuning in deep metric learning," arXiv preprint arXiv:2402.02340, 2024.
L. Ren, C. Chen, L. Wang, 和 K. Hua, "从视觉提示中学习语义代理以实现深度度量学习中的参数高效微调," arXiv 预印本 arXiv:2402.02340, 2024.

[186] M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan, and S.-N. Lim, "Visual prompt tuning," in European Conference on Computer Vision. Springer, 2022, pp. 709-727.
[186] M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan, and S.-N. Lim，“视觉提示调整”，收录于欧洲计算机视觉大会。Springer，2022 年，页 709-727。

[187] S. Chen, C. Ge, Z. Tong, J. Wang, Y. Song, J. Wang, and P. Luo, "Adaptformer: Adapting vision transformers for scalable visual recognition," Advances in Neural Information Processing Systems, vol. 35, pp.

.
[187] S. Chen, C. Ge, Z. Tong, J. Wang, Y. Song, J. Wang, and P. Luo, "Adaptformer: Adapting vision transformers for scalable visual recognition," Advances in Neural Information Processing Systems, vol. 35, pp.

. 【187】S. Chen, C. Ge, Z. Tong, J. Wang, Y. Song, J. Wang, 和 P. Luo, "Adaptformer: 为可扩展视觉识别调整视觉变换器," 神经信息处理系统进展, 第 35 卷, 第

页。

[188] S. Jie and Z.-H. Deng, "Convolutional bypasses are better vision transformer adapters," arXiv preprint arXiv:2207.07039, 2022.
S. Jie 和 Z.-H. Deng，“卷积旁路更好的视觉变换器适配器”，arXiv 预印本 arXiv:2207.07039，2022。

[189] S. Yoo, E. Kim, D. Jung, J. Lee, and S. Yoon, "Improving visual prompt tuning for self-supervised vision transformers," arXiv preprint arXiv:2306.05067, 2023.
[189] S. Yoo, E. Kim, D. Jung, J. Lee, and S. Yoon, "Improving visual prompt tuning for self-supervised vision transformers," arXiv preprint arXiv:2306.05067, 2023. [189] S. Yoo, E. Kim, D. Jung, J. Lee, 和 S. Yoon, "Improving visual prompt tuning for self-supervised vision transformers," arXiv 预印本 arXiv:2306.05067, 2023.

[190] J. Pan, Z. Lin, X. Zhu, J. Shao, and H. Li, "St-adapter: Parameterefficient image-to-video transfer learning," Advances in Neural Information Processing Systems, vol. 35, pp. 26 462-26 477, 2022.
J. Pan, Z. Lin, X. Zhu, J. Shao, 和 H. Li, "St-adapter: Parameterefficient image-to-video transfer learning," Advances in Neural Information Processing Systems, vol. 35, pp. 26 462-26 477, 2022.

[191] T. Yang, Y. Zhu, Y. Xie, A. Zhang, C. Chen, and M. Li, "Aim: Adapting image models for efficient video action recognition," arXiv preprint arXiv:2302.03024, 2023
[191] T.杨，Y.朱，Y.谢，A.张，C.陈和 M.李，“目标：调整图像模型以实现高效视频动作识别”，arXiv 预印本 arXiv:2302.03024，2023

[192] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., "Learning transferable visual models from natural language supervision," in International conference on machine learning. PMLR, 2021, pp. 8748-8763.
[192] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark 等人，“从自然语言监督中学习可转移的视觉模型”，发表于机器学习国际会议。PMLR，2021 年，第 8748-8763 页。

[193] C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, "Scaling up visual and vision-language representation learning with noisy text supervision," in International conference on machine learning. PMLR, 2021, pp. 4904-4916.
C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, "通过嘈杂文本监督扩展视觉和视觉-语言表示学习规模," 在国际机器学习会议上. PMLR, 2021, pp. 4904-4916.

[194] Y. Li, F. Liang, L. Zhao, Y. Cui, W. Ouyang, J. Shao, F. Yu, and J. Yan, "Supervision exists everywhere: A data efficient contrastive languageimage pre-training paradigm," arXiv preprint arXiv:2110.05208, 2021.
Y. Li, F. Liang, L. Zhao, Y. Cui, W. Ouyang, J. Shao, F. Yu, and J. Yan, "监督存在于各处：一种数据高效的对比语言图像预训练范式," arXiv 预印本 arXiv:2110.05208, 2021.

[195] A. Singh, R. Hu, V. Goswami, G. Couairon, W. Galuba, M. Rohrbach, and D. Kiela, "Flava: A foundational language and vision alignment model," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp.

.
[195] A.辛格，R.胡，V.戈斯瓦米，G.库艾隆，W.加卢巴，M.罗尔巴赫和 D.基拉，“Flava: 一种基础语言和视觉对齐模型”，发表于 2022 年 IEEE/CVF 计算机视觉与模式识别会议论文集，第

页。

[196] M. Xu, Z. Zhang, F. Wei, H. Hu, and X. Bai, "Side adapter network for open-vocabulary semantic segmentation," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2945-2954.
[196] M. Xu, Z. Zhang, F. Wei, H. Hu, and X. Bai, "用于开放词汇语义分割的侧适配器网络," 发表于 2023 年 IEEE/CVF 计算机视觉与模式识别会议论文集, 页码 2945-2954.

[197] Q. Yu, J. He, X. Deng, X. Shen, and L.-C. Chen, "Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip," arXiv preprint arXiv:2308.02487, 2023.
[197] Q. 于，J. 何，X. 邓，X. 沈和 L.-C. 陈，“卷积难以消亡：使用单个冻结卷积剪辑进行开放词汇分割，”arXiv 预印本 arXiv:2308.02487，2023。

[198] Z. Xu, Z. Chen, Y. Zhang, Y. Song, X. Wan, and G. Li, "Bridging vision and language encoders: Parameter-efficient tuning for referring image segmentation," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 17 503-17 512.
[198] Z. Xu, Z. Chen, Y. Zhang, Y. Song, X. Wan, and G. Li, "Bridging vision and language encoders: Parameter-efficient tuning for referring image segmentation," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 17 503-17 512. [198] 许 Z，陈 Z，张 Y，宋 Y，万 X 和李 G，“连接视觉和语言编码器：用于指代图像分割的参数高效调整”，在 2023 年 IEEE/CVF 国际计算机视觉会议论文集中，第 17 503-17 512 页。

[199] R. Zhang, Z. Guo, W. Zhang, K. Li, X. Miao, B. Cui, Y. Qiao, P. Gao, and H. Li, "Pointclip: Point cloud understanding by clip," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 8552-8562.
R. Zhang, Z. Guo, W. Zhang, K. Li, X. Miao, B. Cui, Y. Qiao, P. Gao, and H. Li，“Pointclip: Point cloud understanding by clip,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 8552-8562.

[200] X. Zhu, R. Zhang, B. He, Z. Guo, Z. Zeng, Z. Qin, S. Zhang, and P. Gao, "Pointclip v2: Prompting clip and gpt for powerful 3d open-world learning," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2639-2650.
[200] X. 朱，R. 张，B. 何，Z. 郭，Z. 曾，Z. 秦，S. 张和 P. 高，“Pointclip v2：为强大的 3D 开放世界学习提示剪辑和 GPT，”在 2023 年 IEEE/CVF 国际计算机视觉会议论文集中，第 2639-2650 页。

[201] Z. Wang, X. Yu, Y. Rao, J. Zhou, and J. Lu, "P2p: Tuning pre-trained image models for point cloud analysis with point-to-pixel prompting," Advances in neural information processing systems, vol. 35, pp.

.
[201] Z. Wang, X. Yu, Y. Rao, J. Zhou, and J. Lu, "P2p: Tuning pre-trained image models for point cloud analysis with point-to-pixel prompting," Advances in neural information processing systems, vol. 35, pp.

. 【201】Z. Wang，X. Yu，Y. Rao，J. Zhou 和 J. Lu，“P2p：使用点到像素提示调整预训练图像模型以进行点云分析”，神经信息处理系统进展，第 35 卷，第

页。

[202] T. Huang, B. Dong, Y. Yang, X. Huang, R. W. Lau, W. Ouyang, and W. Zuo, "Clip2point: Transfer clip to point cloud classification with image-depth pre-training," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 157-22 167.
[202] T.黄，B.董，Y.杨，X.黄，R.W.刘，W.欧阳和 W.左，“Clip2point：使用图像深度预训练将剪辑转换为点云分类”，在 2023 年 IEEE/CVF 国际计算机视觉会议论文集中，第 22 页 157-22 167。

[203] C. Ju, T. Han, K. Zheng, Y. Zhang, and W. Xie, "Prompting visuallanguage models for efficient video understanding," in European Conference on Computer Vision. Springer, 2022, pp. 105-124.
C. Ju, T. Han, K. Zheng, Y. Zhang, 和 W. Xie, "Prompting visuallanguage models for efficient video understanding," 在欧洲计算机视觉会议上。Springer, 2022, 页 105-124.

[204] B. Ni, H. Peng, M. Chen, S. Zhang, G. Meng, J. Fu, S. Xiang, and H. Ling, "Expanding language-image pretrained models for general video recognition," in European Conference on Computer Vision. Springer, 2022, pp. 1-18.
B. Ni, H. Peng, M. Chen, S. Zhang, G. Meng, J. Fu, S. Xiang, and H. Ling，“扩展语言-图像预训练模型以进行通用视频识别”，收录于欧洲计算机视觉大会。Springer，2022 年，页码 1-18。

[205] Z. Lin, S. Geng, R. Zhang, P. Gao, G. de Melo, X. Wang, J. Dai, Y. Qiao, and H. Li, "Frozen clip models are efficient video learners," in European Conference on Computer Vision. Springer, 2022, pp. 388-404.
[205] Z. 林，S. 耿，R. 张，P. 高，G. 德梅洛，X. 王，J. 戴，Y. 乔，和 H. 李，“冻结剪辑模型是高效的视频学习者，”在欧洲计算机视觉会议上。Springer，2022 年，第 388-404 页。

[206] Z. Han, F. Zhu, Q. Lao, and H. Jiang, "Zero-shot referring expression comprehension via structural similarity between images and captions," arXiv preprint arXiv:2311.17048, 2023.
[206] Z. 韩，F. 朱，Q. 劳，和 H. 江，“通过图像和标题之间的结构相似性进行零样本指代表达理解”，arXiv 预印本 arXiv:2311.17048，2023。

[207] S. Doveh, A. Arbelle, S. Harary, E. Schwartz, R. Herzig, R. Giryes, R. Feris, R. Panda, S. Ullman, and L. Karlinsky, "Teaching structured vision & language concepts to vision & language models," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2657-2668.
S. Doveh, A. Arbelle, S. Harary, E. Schwartz, R. Herzig, R. Giryes, R. Feris, R. Panda, S. Ullman, and L. Karlinsky，“向视觉和语言模型教授结构化视觉和语言概念”，发表于 2023 年 IEEE/CVF 计算机视觉与模式识别会议论文集，页码 2657-2668。

[208] S. Nag, X. Zhu, Y.-Z. Song, and T. Xiang, "Zero-shot temporal action detection via vision-language prompting," in European Conference on Computer Vision. Springer, 2022, pp. 681-697.
[208] S. Nag, X. Zhu, Y.-Z. Song, and T. Xiang, "Zero-shot temporal action detection via vision-language prompting," in European Conference on Computer Vision. Springer, 2022, pp. 681-697. 【208】S. Nag, X. Zhu, Y.-Z. Song, 和 T. Xiang, "Zero-shot temporal action detection via vision-language prompting," 在欧洲计算机视觉会议上。Springer, 2022, 页 681-697。

[209] K. Zhou, J. Yang, C. C. Loy, and Z. Liu, "Learning to prompt for vision-language models," International Journal of Computer Vision, vol. 130, no. 9, pp. 2337-2348, 2022.
[209] K. Zhou, J. Yang, C. C. Loy, and Z. Liu, "Learning to prompt for vision-language models," International Journal of Computer Vision, vol. 130, no. 9, pp. 2337-2348, 2022. [209] K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “学习为视觉-语言模型提供提示”，《计算机视觉国际期刊》，第 130 卷，第 9 期，页码 2337-2348，2022 年。

[210] - , "Conditional prompt learning for vision-language models," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16816-16825.
[210] - ，“视觉语言模型的条件提示学习”，发表于 2022 年 IEEE/CVF 计算机视觉与模式识别会议论文集，第 16816-16825 页。

[211] B. Zhu, Y. Niu, Y. Han, Y. Wu, and H. Zhang, "Prompt-aligned gradient for prompt tuning," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 15659-15 669.
[211] B. 朱，Y. 牛，Y. 韩，Y. 吴和 H. 张，“用于提示调整的提示对齐梯度”，在 2023 年 IEEE/CVF 国际计算机视觉会议论文集中，第 15659-15669 页。

[212] M. U. Khattak, H. Rasheed, M. Maaz, S. Khan, and F. S. Khan, "Maple: Multi-modal prompt learning," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp.

19122

[213] M. Shu, W. Nie, D.-A. Huang, Z. Yu, T. Goldstein, A. Anandkumar, and C. Xiao, "Test-time prompt tuning for zero-shot generalization in vision-language models," Advances in Neural Information Processing Systems, vol. 35, pp. 14274-14289, 2022.
M. Shu, W. Nie, D.-A. Huang, Z. Yu, T. Goldstein, A. Anandkumar, 和 C. Xiao, "Test-time prompt tuning for zero-shot generalization in vision-language models," Advances in Neural Information Processing Systems, vol. 35, pp. 14274-14289, 2022.

[214] C.-M. Feng, K. Yu, Y. Liu, S. Khan, and W. Zuo, "Diverse data augmentation with diffusions for effective test-time prompt tuning," in
[214] 冯昌明，于凯，刘阳，汗萨，左伟，“扩散多样化数据增强以实现有效的测试时间提示调整”，收录于

Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2704-2714.
IEEE/CVF 国际计算机视觉会议论文集，2023 年，第 2704-2714 页。

[215] P. Gao, S. Geng, R. Zhang, T. Ma, R. Fang, Y. Zhang, H. Li, and Y. Qiao, "Clip-adapter: Better vision-language models with feature adapters," International Journal of Computer Vision, pp. 1-15, 2023.
[215] 高鹏，耿帅，张瑞，马涛，方瑞，张洋，李华，乔阳，“Clip-adapter: Better vision-language models with feature adapters”，《计算机视觉国际期刊》，页码 1-15，2023 年。

[216] R. Zhang, R. Fang, W. Zhang, P. Gao, K. Li, J. Dai, Y. Qiao, and H. Li, "Tip-adapter: Training-free clip-adapter for better visionlanguage modeling," arXiv preprint arXiv:2111.03930, 2021.
张瑞, 方睿, 张伟, 高鹏, 李凯, 戴军, 乔宇, 李航, "Tip-adapter: Training-free clip-adapter for better visionlanguage modeling," arXiv preprint arXiv:2111.03930, 2021.

[217] E. Orhan, "A simple cache model for image recognition," Advances in Neural Information Processing Systems, vol. 31, 2018.
[217] E. Orhan，“一种用于图像识别的简单缓存模型”，神经信息处理系统进展，第 31 卷，2018 年。

[218] E. Grave, M. M. Cisse, and A. Joulin, "Unbounded cache model for online language modeling with open vocabulary," Advances in neural information processing systems, vol. 30, 2017.
[218] E. Grave, M. M. Cisse, and A. Joulin, "Unbounded cache model for online language modeling with open vocabulary," Advances in neural information processing systems, vol. 30, 2017. [218] E. Grave, M. M. Cisse, 和 A. Joulin, "Unbounded cache model for online language modeling with open vocabulary," Advances in neural information processing systems, vol. 30, 2017.

[219] J. Ho, A. Jain, and P. Abbeel, "Denoising diffusion probabilistic models," Advances in neural information processing systems, vol. 33, pp. 6840-6851, 2020.
J. Ho, A. Jain, 和 P. Abbeel, "去噪扩散概率模型," 《神经信息处理系统进展》, vol. 33, pp. 6840-6851, 2020.

[220] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, "Deep unsupervised learning using nonequilibrium thermodynamics," in International conference on machine learning. PMLR, 2015, pp.

[221] Z. Han, Y. Wang, L. Zhou, P. Wang, B. Yan, J. Zhou, Y. Wang, and D. Shen, "Contrastive diffusion model with auxiliary guidance for coarse-to-fine pet reconstruction," in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2023, pp. 239-249.
[221] Z. 韩，Y. 王，L. 周，P. 王，B. 颜，J. 周，Y. 王，和 D. 沈，“具有辅助指导的对比扩散模型用于粗到细的宠物重建”，在医学图像计算与计算辅助干预国际会议上。Springer，2023 年，第 239-249 页。

[222] L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao, W. Zhang, B. Cui, and M.-H. Yang, "Diffusion models: A comprehensive survey of methods and applications," ACM Computing Surveys, vol. 56, no. 4, pp.

.
[222] L.杨，Z.张，Y.宋，S.洪，R.徐，Y.赵，W.张，B.崔和 M.-H.杨，“扩散模型：方法和应用的综合调查”，ACM 计算调查，第 56 卷，第 4 期，第

页。

[223] F.-A. Croitoru, V. Hondru, R. T. Ionescu, and M. Shah, "Diffusion models in vision: A survey," IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
F.-A. Croitoru、V. Hondru、R. T. Ionescu 和 M. Shah，“视觉中的扩散模型：一项调查”，《IEEE 模式分析与机器智能交易》，2023。

[224] P. Dhariwal and A. Nichol, "Diffusion models beat gans on image synthesis," Advances in neural information processing systems, vol. 34, pp. 8780-8794, 2021.
P. Dhariwal 和 A. Nichol，“扩散模型在图像合成方面击败了 GANs”，《神经信息处理系统进展》，第 34 卷，第 8780-8794 页，2021 年。

[225] N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman, "Dreambooth: Fine tuning text-to-image diffusion models for subjectdriven generation," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22 500-22 510.
[225] N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman, "Dreambooth: Fine tuning text-to-image diffusion models for subjectdriven generation," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22 500-22 510. [225] N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, 和 K. Aberman, "Dreambooth: Fine tuning text-to-image diffusion models for subjectdriven generation," 在 2023 年 IEEE/CVF 计算机视觉与模式识别会议论文集中, pp. 22 500-22 510.

[226] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, "High-resolution image synthesis with latent diffusion models," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp.

.
[226] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, "High-resolution image synthesis with latent diffusion models," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp.

. [226] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer,“基于潜在扩散模型的高分辨率图像合成”，发表于 2022 年 IEEE/CVF 计算机视觉与模式识别会议论文集，第

页。

[227] S. Luo, Y. Tan, S. Patil, D. Gu, P. von Platen, A. Passos, L. Huang, J. Li, and H. Zhao, "Lcm-lora: A universal stable-diffusion acceleration module," arXiv preprint arXiv:2311.05556, 2023.
[227] 罗斯，谭，帕蒂尔，顾，冯普拉滕，帕索斯，黄，李和赵，"Lcm-lora: 通用稳定扩散加速模块"，arXiv 预印本 arXiv:2311.05556，2023。

[228] W. Chai, D. Zheng, J. Cao, Z. Chen, C. Wang, and C. Ma, "Speedupnet: A plug-and-play hyper-network for accelerating text-to-image diffusion models," arXiv preprint arXiv:2312.08887, 2023.
W. Chai, D. Zheng, J. Cao, Z. Chen, C. Wang, 和 C. Ma, "Speedupnet: 用于加速文本到图像扩散模型的即插即用超网络," arXiv 预印本 arXiv:2312.08887, 2023.

[229] J. Z. Wu, Y. Ge, X. Wang, S. W. Lei, Y. Gu, Y. Shi, W. Hsu, Y. Shan, X. Qie, and M. Z. Shou, "Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp.

.
[229] J. Z. Wu, Y. Ge, X. Wang, S. W. Lei, Y. Gu, Y. Shi, W. Hsu, Y. Shan, X. Qie, and M. Z. Shou，“Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation”，收录于 2023 年 IEEE/CVF 国际计算机视觉会议论文集，第

页。

[230] Z. Xing, Q. Dai, H. Hu, Z. Wu, and Y.-G. Jiang, "Simda: Simple diffusion adapter for efficient video generation," arXiv preprint arXiv:2308.09710, 2023.
[230] Z. Xing, Q. Dai, H. Hu, Z. Wu, and Y.-G. Jiang, "Simda: Simple diffusion adapter for efficient video generation," arXiv preprint arXiv:2308.09710, 2023. [230] Z. 兴，Q. 戴，H. 胡，Z. 吴，和 Y.-G. 姜，“Simda：用于高效视频生成的简单扩散适配器”，arXiv 预印本 arXiv:2308.09710，2023。

[231] B. Zeng, S. Li, Y. Feng, H. Li, S. Gao, J. Liu, H. Li, X. Tang, J. Liu, and B. Zhang, "Ipdreamer: Appearance-controllable 3d object generation with image prompts," arXiv preprint arXiv:2310.05375, 2023.
[231] B. 曾，S. 李，Y. 冯，H. 李，S. 高，J. 刘，H. 李，X. 唐，J. 刘，和 B. 张，“Ipdreamer：具有图像提示的外观可控 3D 对象生成”，arXiv 预印本 arXiv:2310.05375，2023。

[232] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., "Flamingo: a visual language model for few-shot learning," Advances in Neural Information Processing Systems, vol. 35, pp. 23716-23736, 2022.
[232] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds 等人，“Flamingo: 一种用于少样本学习的视觉语言模型”，《神经信息处理系统进展》，第 35 卷，第 23716-23736 页，2022 年。

[233] L. Zhang, A. Rao, and M. Agrawala, "Adding conditional control to text-to-image diffusion models," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3836-3847.
张，A. 饶，和 M. Agrawala，"向文本到图像扩散模型添加条件控制"，在 2023 年 IEEE/CVF 国际计算机视觉会议论文集中，第 3836-3847 页。

[234] R. Gandikota, J. Materzynska, T. Zhou, A. Torralba, and D. Bau, "Concept sliders: Lora adaptors for precise control in diffusion models," arXiv preprint arXiv:2311.12092, 2023.
R. Gandikota、J. Materzynska、T. Zhou、A. Torralba 和 D. Bau，"概念滑块：扩散模型中的精确控制 Lora 适配器"，arXiv 预印本 arXiv:2311.12092，2023。

[235] C. Mou, X. Wang, L. Xie, Y. Wu, J. Zhang, Z. Qi, Y. Shan, and X. Qie, "T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models," arXiv preprint arXiv:2302.08453, 2023.
C. Mou, X. Wang, L. Xie, Y. Wu, J. Zhang, Z. Qi, Y. Shan, and X. Qie，“T2i-adapter: 学习适配器以挖掘文本到图像扩散模型更可控的能力”，arXiv 预印本 arXiv:2302.08453，2023。

[236] R. Gal, Y. Alaluf, Y. Atzmon, O. Patashnik, A. H. Bermano, G. Chechik, and D. Cohen-Or, "An image is worth one word: Personal- izing text-to-image generation using textual inversion," arXiv preprint arXiv:2208.01618, 2022.
R. Gal、Y. Alaluf、Y. Atzmon、O. Patashnik、A. H. Bermano、G. Chechik 和 D. Cohen-Or，"一幅图像胜过千言万语：使用文本反演个性化文本到图像生成"，arXiv 预印本 arXiv:2208.01618，2022。

[237] N. Kumari, B. Zhang, R. Zhang, E. Shechtman, and J.-Y. Zhu, "Multiconcept customization of text-to-image diffusion," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1931-1941.
[237] N. Kumari, B. Zhang, R. Zhang, E. Shechtman, and J.-Y. Zhu, "文本到图像扩散的多概念定制," in 2023 年 IEEE/CVF 计算机视觉与模式识别会议论文集, pp. 1931-1941.

[238] H. Ye, J. Zhang, S. Liu, X. Han, and W. Yang, "Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models," arXiv preprint arXiv:2308.06721, 2023
H. 叶，J. 张，S. 刘，X. 韩和 W. 杨，“Ip-adapter：用于文本到图像扩散模型的文本兼容图像提示适配器”，arXiv 预印本 arXiv:2308.06721，2023

[239] OpenAI, "Gpt-4," in https://openai.com/gpt-4, 2023.
[239] OpenAI，“Gpt-4”，于 https://openai.com/gpt-4，2023 年。

[240] G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth et al., "Gemini: a family of highly capable multimodal models," arXiv preprint arXiv:2312.11805, 2023.
[240] G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth 等人，“Gemini: 一系列高性能多模型”，arXiv 预印本 arXiv:2312.11805，2023。

[241] C. Gao and S. Q. Zhang, "Dlora: Distributed parameter-efficient fine-tuning solution for large language model," arXiv preprint arXiv:2404.05182, 2024
C.高和 S.Q.张，“Dlora：大型语言模型的分布式参数高效微调解决方案”，arXiv 预印本 arXiv:2404.05182，2024

[242] G. Xiao, J. Lin, and S. Han, "Offsite-tuning: Transfer learning without full model," arXiv preprint arXiv:2302.04870, 2023.
G. Xiao, J. Lin, 和 S. Han, "离线调整：无需完整模型的迁移学习," arXiv 预印本 arXiv:2302.04870, 2023.

[243] Z. Zhou, X. Wei, J. Zhang, and G. Sun, "

PetS

: A unified framework for {Parameter-Efficient } transformers serving," in 2022 USENIX Annual Technical Conference (USENIX ATC 22), 2022, pp. 489-504.
[243] 周 Z.，魏 X.，张 J.，孙 G.，“

PetS

：用于{参数高效}变压器服务的统一框架”，发表于 2022 年 USENIX 年度技术会议（USENIX ATC 22），2022 年，第 489-504 页。

[244] Y. Sheng, S. Cao, D. Li, C. Hooper, N. Lee, S. Yang, C. Chou, B. Zhu, L. Zheng, K. Keutzer et al., "S-lora: Serving thousands of concurrent lora adapters," arXiv preprint arXiv:2311.03285, 2023.
Y. Sheng, S. Cao, D. Li, C. Hooper, N. Lee, S. Yang, C. Chou, B. Zhu, L. Zheng, K. Keutzer 等人，“S-lora：为数千个并发 lora 适配器提供服务”，arXiv 预印本 arXiv:2311.03285，2023。

[245] L. Chen, Z. Ye, Y. Wu, D. Zhuo, L. Ceze, and A. Krishnamurthy, "Punica: Multi-tenant lora serving," arXiv preprint arXiv:2310.18547, 2023.
L. Chen, Z. Ye, Y. Wu, D. Zhuo, L. Ceze, 和 A. Krishnamurthy, "Punica: 多租户 lora 服务," arXiv 预印本 arXiv:2310.18547, 2023.

[246] S. Mangrulkar, S. Gugger, L. Debut, Y. Belkada, S. Paul, and B. Bossan, "Peft: State-of-the-art parameter-efficient fine-tuning methods," https://github.com/huggingface/peft 2022.
S. Mangrulkar, S. Gugger, L. Debut, Y. Belkada, S. Paul, 和 B. Bossan, "Peft: 最先进的参数高效微调方法," https://github.com/huggingface/peft 2022.

[247] C. Poth, H. Sterz, I. Paul, S. Purkayastha, L. Engländer, T. Imhof, I. Vulić, S. Ruder, I. Gurevych, and J. Pfeiffer, "Adapters: A unified library for parameter-efficient and modular transfer learning," 2023.
C. Poth, H. Sterz, I. Paul, S. Purkayastha, L. Engländer, T. Imhof, I. Vulić, S. Ruder, I. Gurevych, 和 J. Pfeiffer, "适配器：用于参数高效和模块化迁移学习的统一库," 2023.

[248] K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li, X. Lu, R. Zhu, Y. Wu, J. Dai, J. Wang, J. Shi, W. Ouyang, C. C. Loy, and D. Lin, "MMDetection: Open mmlab detection toolbox and benchmark," arXiv preprint arXiv:1906.07155, 2019.
[248] K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li, X. Lu, R. Zhu, Y. Wu, J. Dai, J. Wang, J. Shi, W. Ouyang, C. C. Loy, and D. Lin, "MMDetection: Open mmlab detection toolbox and benchmark," arXiv preprint arXiv:1906.07155, 2019. [248] K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li, X. Lu, R. Zhu, Y. Wu, J. Dai, J. Wang, J. Shi, W. Ouyang, C. C. Loy, and D. Lin, "MMDetection: Open mmlab detection toolbox and benchmark," arXiv preprint arXiv:1906.07155, 2019.

[249] S. Q. Zhang, T. Tambe, N. Cuevas, G.-Y. Wei, and D. Brooks, "Camel: Co-designing ai models and embedded drams for efficient on-device learning," arXiv preprint arXiv:2305.03148, 2023.
张绍强，谭泰，库埃瓦斯，魏光宇和布鲁克斯，"骆驼：为高效的设备端学习共同设计 ai 模型和嵌入式 DRAM"，arXiv 预印本 arXiv:2305.03148，2023。

[250] T. Brooks, B. Peebles, C. Holmes, W. DePue, Y. Guo, L. Jing, D. Schnurr, J. Taylor, T. Luhman, E. Luhman, C. Ng, R. Wang, and A. Ramesh, "Video generation models as world simulators," 2024. [Online]. Available: https: //openai.com/research/video-generation-models-as-world-simulators
T. Brooks, B. Peebles, C. Holmes, W. DePue, Y. Guo, L. Jing, D. Schnurr, J. Taylor, T. Luhman, E. Luhman, C. Ng, R. Wang, and A. Ramesh，“视频生成模型作为世界模拟器”，2024 年。[在线]。可访问：https://openai.com/research/video-generation-models-as-world-simulators

[251] A. Gu and T. Dao, "Mamba: Linear-time sequence modeling with selective state spaces," arXiv preprint arXiv:2312.00752, 2023.
[251] A. Gu 和 T. Dao, "Mamba: 具有选择性状态空间的线性时间序列建模," arXiv 预印本 arXiv:2312.00752, 2023.

[252] Y. Bai, X. Geng, K. Mangalam, A. Bar, A. Yuille, T. Darrell, J. Malik, and A. A. Efros, "Sequential modeling enables scalable learning for large vision models," arXiv preprint arXiv:2312.00785, 2023.
Y. Bai, X. Geng, K. Mangalam, A. Bar, A. Yuille, T. Darrell, J. Malik, and A. A. Efros，"序列建模实现大视觉模型的可扩展学习"，arXiv 预印本 arXiv:2312.00785，2023。

[253] A. Dosovitskiy and T. Brox, "Inverting visual representations with convolutional networks," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4829-4837.
Dosovitskiy 和 T. Brox，“使用卷积网络反转视觉表示”，发表于 2016 年 IEEE 计算机视觉和模式识别会议论文集，第 4829-4837 页。

[254] Z. He, T. Zhang, and R. B. Lee, "Model inversion attacks against collaborative inference," in Proceedings of the 35th Annual Computer Security Applications Conference, 2019, pp. 148-162.
[254] Z. He, T. Zhang, and R. B. Lee, "Model inversion attacks against collaborative inference," in Proceedings of the 35th Annual Computer Security Applications Conference, 2019, pp. 148-162. [254] Z. He, T. Zhang, 和 R. B. Lee, "Model inversion attacks against collaborative inference," 在第 35 届年度计算机安全应用会议论文集中, 2019, 页码 148-162.

Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey 大型模型的参数高效微调：全面调查

Abstract 摘要

I. INTRODUCTION 我。介绍

II. BACKGROUND 背景

A. Computation flow for LLaMALLaMA 的计算流程

B. Overview on Parameter Efficient Fine Tuning参数高效微调概述

C. Downstream Tasks for LLM EvaluationLLM评估的下游任务

III. PEFT TAXONOMY III. PEFT 分类

A. Additive PEFT 添加剂 PEFT

B. Selective PEFT 选择性 PEFT

C. Reparameterized PEFT C. 重新参数化的 PEFT

D. Hybrid PEFT 混合 PEFT

IV. EfFICIENT PEFT DESIGN高效的 PEFT 设计

A. KV-cache Management for PEFT EfficiencyA. 为 PEFT 效率管理 KV 缓存

B. Pruning Strategies for PEFTPEFT 的修剪策略

C. Quantization Strategies for PEFTPEFT 的量化策略

D. Memory-efficient PEFT MethodsD. 内存高效的 PEFT 方法

V. PEFT FOR DNNS OF OTHER APPLICATIONSV. 用于其他应用程序的深度神经网络的 PEFT

A. PEFT for LLMs - Beyond the BasicsA. PEFT for LLMs - 超越基础

B. PEFT for ViTsB. PEFT 用于 ViTs

C. PEFT for VLAsC. PEFT 用于 VLA

D. PEFT for Diffusion Models扩散模型的 D. PEFT

VI. System Design Challenge for PEFTVI. PEFT 系统设计挑战

A. System design for PEFTPEFT 的系统设计

B. Case study: Offsite-Tuning案例研究：离线调整

C. Case Study: PetSC. 案例研究：PetS

D. Parallel PEFT Training FrameworksD. 并行 PEFT 训练框架

VII. Conclusion and Future Directions第七部分：结论和未来方向

A. Simplify hyperparameter tuning简化超参数调整

B. Establish a unified benchmark建立统一的基准

C. Enhance training efficiency提高培训效率

D. Explore scaling laws探索尺度定律

E. Serve more models and tasks提供更多的模型和任务

F. Enhancing data privacy增强数据隐私

G. PEFT with model compressionG. 使用模型压缩的 PEFT

REFERENCES 参考资料

Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey
大型模型的参数高效微调：全面调查

A. Computation flow for LLaMA
LLaMA 的计算流程

B. Overview on Parameter Efficient Fine Tuning
参数高效微调概述

C. Downstream Tasks for LLM Evaluation
LLM评估的下游任务

IV. EfFICIENT PEFT DESIGN
高效的 PEFT 设计

A. KV-cache Management for PEFT Efficiency
A. 为 PEFT 效率管理 KV 缓存

B. Pruning Strategies for PEFT
PEFT 的修剪策略

C. Quantization Strategies for PEFT
PEFT 的量化策略

D. Memory-efficient PEFT Methods
D. 内存高效的 PEFT 方法

V. PEFT FOR DNNS OF OTHER APPLICATIONS
V. 用于其他应用程序的深度神经网络的 PEFT

A. PEFT for LLMs - Beyond the Basics
A. PEFT for LLMs - 超越基础

B. PEFT for ViTs
B. PEFT 用于 ViTs

C. PEFT for VLAs
C. PEFT 用于 VLA

D. PEFT for Diffusion Models
扩散模型的 D. PEFT

VI. System Design Challenge for PEFT
VI. PEFT 系统设计挑战

A. System design for PEFT
PEFT 的系统设计

B. Case study: Offsite-Tuning
案例研究：离线调整

C. Case Study: PetS
C. 案例研究：PetS

D. Parallel PEFT Training Frameworks
D. 并行 PEFT 训练框架

VII. Conclusion and Future Directions
第七部分：结论和未来方向

A. Simplify hyperparameter tuning
简化超参数调整

B. Establish a unified benchmark
建立统一的基准

C. Enhance training efficiency
提高培训效率

D. Explore scaling laws
探索尺度定律

E. Serve more models and tasks
提供更多的模型和任务

F. Enhancing data privacy
增强数据隐私

G. PEFT with model compression
G. 使用模型压缩的 PEFT