INTRODUCTION 引言
Prompt tuning(PT)[1] is a parameter-efficient method[2] to adapt large pre-trained models to various downstream tasks. PT prepends a sequence of soft prompt vectors to the input and optimizes the prompt with the pre-trained model parameters frozen. Although PT achieves very effective performance, it is sensitive to the initialization of the prompt[3]. To address this issue, Soft Prompt Transfer(SPT)[4] is proposed to enhance the performance and stability of prompt tuning (PT) by transferring prompts.
Prompt tuning(PT) [1] 是一種節省參數的方法, [2] ,以適應各種下游任務的大型預訓模型。PT 在輸入中預先加入一連串軟提示向量,並在凍結預訓練模型參數的情況下優化提示。雖然 PT 能達到非常有效的效能,但它對提示的初始化很敏感 [3] 。為了解決這個問題,我們提出了軟提示轉移(Soft Prompt Transfer,SPT) [4] ,藉由轉移提示來提升提示調整(PT)的效能與穩定性。
Recent work[3] on SPT discovered that initializing with well-trained soft prompts can enhance PT performance. SPoT[4] learned a prompt on one or more source tasks and used it to initialize the prompt for a target task. ATTEMPT[5] trained a soft prompt on each source task and transferred them to target tasks for further learning. MPT[6] introduced prompt distillation to learn a transferred prompt from all source tasks. Most methods in SPT learned only a single and task-specific prompt for each source prompt. However, adequately trained prompts may not be applicable to all data instances of the given task[7]. In other words, the well-trained task-specific prompt on the source task may not be a suitable prompt for the target task instances. We build on the idea of the SPT approach and explore learning multiple prompts for each source prompt.
最近有關 SPT 的工作 [3] 發現,使用訓練有素的軟提示進行初始化可以提升 PT 的效能。SPoT [4] 在一個或多個來源任務上學習一個提示,然後用它來初始化目標任務的提示。ATTEMPT [5] 在每個來源任務上訓練一個軟提示,並將它們轉移到目標任務上進一步學習。MPT [6] 引入了提示提煉,從所有來源任務學習轉移的提示。SPT 中的大多數方法僅針對每個來源提示學習單一且特定於任務的提示。然而,經過充分訓練的提示未必適用於給定任務的所有資料實例 [7] 。換句話說,在來源任務上訓練有素的特定任務提示可能不適合目標任務實例的提示。我們以 SPT 方法的想法為基礎,探索為每個來源提示學習多個提示。
Some work[8][9] found multiple prompt ensembling can leverage the complementary advantages of different prompts and stabilize performance on downstream tasks. Typically, multiple soft prompts are obtained by different initializations or different random seeds. However, these methods require multiple independent training runs to obtain multiple prompts[10], which are relatively inefficient.
有些工作 [8] [9] 發現多重提示組合可以利用不同提示的互補優勢,並穩定下游任務的效能。通常,多重軟提示是透過不同的初始化或不同的隨機種子獲得。然而,這些方法需要多次獨立的訓練運行,才能獲得多重提示 [10] ,效率相對較低。
To address these limitations, we introduce Snapshot Prompt Ensemble(SPE) method for parameter-efficient soft prompt transfer. Snapshot Ensemble[11] takes snapshots during training to ensemble multiple neural networks at little additional cost. SPE is the first to apply the idea of Snapshot Ensemble to multi-prompt ensemble. SPE extracts multiple soft prompts from each source task via taking snapshots at different training phases of prompt tuning. SPE then adaptively ensembles the multiple soft prompts and obtains a fused and instance-dependent prompt for the target task by a cross-task attention module. SPE can obtain multiple prompts without additional training and provide a more suitable starting point for target prompt training. Extensive experiments on a wide range of NLU tasks with T5[12] demonstrate that SPE outperforms state-of-the-art methods, despite only tuning 0.4% parameters compared to full fine-tuning.
為了解決這些限制,我們引進了快照提示集合(Snapshot Prompt Ensemble, SPE) 方法來進行參數有效的軟提示轉換。Snapshot Ensemble [11] 在訓練期間採取快照,以很少的額外成本來集合多個神經網路。SPE 是第一個將 Snapshot Ensemble 的概念應用於多重提示集合的方法。SPE 透過在提示調整的不同訓練階段進行快照,從每個來源任務中擷取多個軟提示。然後 SPE 自適應地組合多個軟提示,並透過跨任務注意模組為目標任務取得融合且依據實例的提示。SPE 不需要額外的訓練就能獲得多重提示,並為目標提示訓練提供更適當的起點。使用 T5 [12] 在多種 NLU 任務上進行的廣泛實驗證明,相較於完全微調,SPE 只需調整 0.4% 的參數,其表現仍優於最先進的方法。
METHOD 方法
Snapshot Prompt Ensemble method transfers multiple soft prompts from source tasks to the target task. Figure 1 presents an overview of SPE. It consists of two stages: source prompt training and target prompt training. In the first stage, SPE takes snapshots during the source prompt tuning and obtains multiple snapshot prompts on each source task. In the second stage, SPE adaptively ensembles the multiple soft prompts and obtains a fused and instance-dependent prompt for the target task by a cross-task attention module.
Snapshot Prompt Ensemble 方法可將來源任務中的多個軟提示轉移到目標任務中。 Figure 1 介紹了 SPE 的概況。它包含兩個階段:源提示訓練和目標提示訓練。在第一階段,SPE 會在來源提示調整期間進行快照,並在每個來源任務上取得多個快照提示。在第二階段,SPE 自適應地組合多個軟提示,並透過跨任務注意模組針對目標任務取得融合且依據實例的提示。
2.1. Source prompt training
2.1.來源提示訓練
SPE trains multiple source prompts for each source task(S = {S1, S2, …, SN}) individually through prompt tuning.
SPE 透過提示調整,針對每個來源任務(S={S1,S2, ...,SN}) 個別訓練多個來源提示。
Given a task T with training data
給定一個具有訓練資料
Multiple Snapshot Prompts Extraction
提取多個快照提示
Vanilla prompt tuning can only obtain a single and task-specific prompt for a source task. It may not be suitable for the target task. Meanwhile, typical practice to obtain multiple soft prompts is not computationally efficient.
Vanilla 提示調整只能針對來源任務取得單一且特定於任務的提示。它可能不適合目標任務。同時,獲得多個軟提示的典型做法在計算上並不高效。
Therefore, this paper applies the idea of snapshot ensemble to prompt tuning. Snapshot Ensemble aims to ensemble multiple neural networks without incurring any additional training costs. Similarly, SPE attempts to obtain multiple prompts at no additional costs. Specifically, the prompt starts to train from Epoch 0 after initialization. Prompt vectors are updated through continuous forward and backward propagation. When prompts are trained to different degrees, SPE takes snapshots to extract intermediate prompts Pi,j(i ∈ (1, N), j ∈ (1, K)) and subsequently preserves them in the source prompt album. Pi,j refers to the prompt for the i-th source task and the j-th snapshot during its training phase. By training on N distinct source tasks, N × K source prompts can be generated. It is worth noting that the total training cost of obtaining all the snapshot prompts of a task is equivalent to the cost of vanilla prompt tuning to learn a prompt. In other words, SPE can efficiently obtain multiple prompts on each source task by once training.
因此,本文將快照集合的想法應用在提示調整上。快照集合的目的是在不產生任何額外訓練成本的情況下集合多個神經網路。同樣地,SPE 嘗試在不增加成本的情況下獲得多個提示。具體來說,提示在初始化後從 Epoch 0 開始訓練。提示向量透過持續的前向與後向傳播進行更新。當提示詞訓練到不同程度時,SPE 會採取快照來擷取中間提示詞Pi,j(i∈(1, N),j∈(1, K)),並隨後將其保留在來源提示詞專輯中。Pi,j是指第 i 個來源任務和第 j 個快照在訓練階段的提示。透過N個不同來源任務的訓練,可以產生N×K 個來源提示。值得注意的是,取得一個任務的所有快照提示的總訓練成本,等於學習一個提示的 vanilla 提示調整成本。換句話說,SPE 可以透過一次訓練,在每個來源任務上有效率地取得多個提示。
2.2. Target prompt training
2.2.目標提示訓練
After acquiring an album of multiple source snapshot prompts, our method aims to apply a cross-task attention module to adaptively ensemble the multiple soft prompts and obtain a fused and instance-dependent prompt for the target task.
在取得多個來源快照提示的相簿之後,我們的方法旨在應用跨任務注意模組來自適應性地組合多個軟提示,並針對目標任務取得融合且依據實例的提示。
Adaptive Snapshot Prompts Ensembling
自適應快照提示組合
SPE uses a cross-task attention module to adaptively generate weights to ensemble the source prompts based on their competence on the target instance. Because of the different length between input instance X and source prompt, the module first performs max-pool over the token embedding sequence X = [x1, x2, …, xn] ∈ ℝn×d and prompt embedding sequence Pi,j = [p1, p2, …, pn] ∈ ℝl×d, to obtain
SPE 使用跨任務注意模組來自適應性地產生權重,以根據來源提示在目標實例上的能力來集合來源提示。由於輸入實例X與來源提示之間的長度不同,該模組首先對代號嵌入序列X=[x1,x2,...,xn] ∈ℝn×d和提示嵌入序列Pi,j=[p1, p2,...,pn] ∈ℝl×d 執行最大池,得到
The obtained prompt is a fused prompt that incorporates information across different tasks and different training phases. Meanwhile, Pinstance is instance-dependent. Compared to other SPT methods[4], the cross-task attention module can effectively learn more suitable prompt from source prompts for target data instance.
所得到的提示是融合了不同任務和不同訓練階段資訊的提示。同時,Pinstance取決於實例。相較於其他 SPT 方法 [4] ,跨任務注意模組能有效地從來源提示中學習到更適合目標資料實例的提示。
Considering that instance-dependent prompts may ignore the importance of task-specific information[14], SPE initializes a task-wise prompt Ptask for the target task, which is trained and shared by all the instances of the target task. The final target prompt is the combination of instance-wise prompt and task-wise prompt as follows:
考慮到依據實例的提示可能會忽略特定任務資訊的重要性 [14] ,SPE 會初始化目標任務的任務性提示Ptask,此Ptask經過訓練,並由目標任務的所有實例共享。最終的目標提示是實例提示和任務提示的組合,如下所示:
Finally, we concentrate the target prompt to the input and train the model by maximizing the maximizing the likelihood:
最後,我們將目標提示集中到輸入,並透過最大化最大似然率來訓練模型:
During target prompt training, only prompt Ptask and the cross-task attention module G are updated via Ptarget, while source prompts and the pre-trained LM θ are frozen.
在目標提示訓練期間,只有提示Ptask 和跨任務注意模組G會透過Ptarget 更新,而來源提示和預先訓練的 LMθ則會凍結。
2.3. Parameter Efficiency of SPE
2.3.SPE 的參數效率
Since extracting multiple source prompts is done in one training, the trainable parameters in this process are only N source prompt (i.e., N × l × d). There are no extra parameters in multiple prompt extraction. SPE also trains a cross-task attention module which includes two projection layers, one layer norms, and a task prompt, which requires 2×d×r+2×d+l×d parameters. In total, for SPE, the number of trainable parameters is (1+ N)× l × d +2 × d × r(~0.7M in our experiment), which is less than 0.4% of full fine-tuning a T5-base model.
由於抽取多個來源提示是在一次訓練中完成,因此在這個過程中可訓練的參數只有N個來源提示 (即N×l×d)。在多重提示擷取中沒有額外的參數。SPE 還會訓練一個跨任務注意模組,這個模組包括兩個投影層、一個層規範和一個任務提示,需要2×d×r+2×d+l×d的參數。總的來說,對 SPE 而言,可訓練的參數為 (1+N)×l×d+2 ×d×r(在我們的實驗中約為 0.7M),少於 T5 基礎模型完全微調的 0.4%。
EXPERIMENTS 實驗
3.1. Experiments settings
3.1.實驗設定
Following the same setting of ATTEMPT[5], we use 6 high-resource datasets (MNLI, QNLI, QQP, SST-2, SQuAD, and ReCoRD) as source tasks. We use 8 tasks from GLUE[15] as target tasks with pre-trained T5-base model. Besides full fine-tuning, we compare our model with parameter-efficient methods, such as Adapter[16], Bitfit[17], PT[1], and soft prompt transfer methods, like SPoT[4] and ATTEMPT[5].
依照 ATTEMPT [5] 的相同設定,我們使用 6 個高資源資料集(MNLI、QNLI、QQP、SST-2、SQuAD 和 ReCoRD)作為源任務。我們使用 GLUE [15] 中的 8 個任務作為目標任務,並使用預先訓練的 T5 基礎模型。除了完全微調之外,我們還將我們的模型與參數效 率高的方法進行比較,例如 Adapter [16] , Bitfit [17] , PT [1] , 以及軟提示轉換方法,例如 SPoT [4] 和 ATTEMPT [5] 。
We report Pearson Correlation for STS-B, and accuracy for the other tasks as metrics and use a default setting training for a batch size of 32, a learning rate of 3e-4. We train our model on a single NVIDIA A100 with 80G of memory. We use 100 prompt vectors as prompt length for all benchmarks. During source prompt tuning, SPE takes snapshots when training to 60%, 80%, and 100% respectively.
我們報告 STS-B 的 Pearson Correlation 以及其他任務的準確度作為指標,並使用預設設定訓練,批次大小為 32,學習率為 3e-4。我們在具有 80G 記憶體的單一 NVIDIA A100 上訓練模型。我們使用 100 個提示向量作為所有基準的提示長度。在來源提示調整期間,SPE 分別在訓練到 60%、80% 和 100% 時進行快照。
3.2. Main Results 3.2.主要結果
Table 1 shows the performance of our method and other baselines on the GLUE benchmark. SPE outperforms baselines and achieves competitive results on several NLU tasks.
Table 1 顯示我們的方法和其他基線在 GLUE 基準上的表現。SPE 的表現優於基準,並在多項 NLU 任務上取得具有競爭力的結果。
Specifically, SPE achieves comparable performance with full parameter fine-tuning despite only tuning 0.4% of its parameters. Compared to vanilla’s prompt tuning, the average scores of our method are substantially improved from 72.2% to 86.0%, an improvement of 13.8%, highlighting the benefits of transferring knowledge from multiple source tasks. Compared with PET methods, the performance is also improved to varying degrees. Compared with the soft prompt transfer method SPoT and ATTEMPT, SPE improves by 3.7% and 2.6%, respectively. We posit that during source prompt iterative training, the prompts obtain task-specific knowledge gradually. SPE takes snapshots at the different training phases. Thus the snapshot prompts are equipped with different richness of task-specific knowledge. A probable reason for the good performance of SPE is that SPE can ensemble prompts with different levels of task-specific knowledge from different tasks.
具體來說,SPE 儘管只調整了 0.4% 的參數,卻達到了完全參數微調的可比性能。與 vanilla 的提示調整相比,我們方法的平均分數大幅提升,從 72.2% 提升至 86.0%,提升幅度達 13.8%,突顯出從多種來源任務轉移知識的優點。與 PET 方法相比,性能也有不同程度的提升。與軟提示轉移方法 SPoT 和 ATTEMPT 相比,SPE 分別提高了 3.7% 和 2.6%。我們認為,在來源提示反覆訓練的過程中,提示會逐漸獲得特定任務的知識。SPE 會在不同的訓練階段進行快照。因此,快照提示擁有不同豐富度的特定任務知識。SPE 擁有優異表現的一個可能原因是,SPE 能夠從不同的任務中集合具有不同程度特定任務知識的提示。
3.3. Few-shot Adaptation 3.3.少量調整
Following prior work[5], we conduct few-shot adaptation experiments on BoolQ[18], CB[19], and SciTail[20] tasks to prove the generalization ability. Table 2 shows that our method outperforms other methods. The result demonstrates that the proposed SPE shows good generalizability as it performs well in both full-dataset and few-shot settings on different tasks.
根據先前的工作 [5] ,我們在 BoolQ [18] 、CB [19] 和 SciTail [20] 任務上進行了少量適應實驗,以證明泛化能力。 Table 2 顯示我們的方法優於其他方法。結果顯示,建議的 SPE 在不同的任務上,無論是全資料集或少點設定,都有良好的表現,顯示出良好的泛化能力。
3.4. Ablation Studies 3.4.消融研究
We conduct experiments to explore the effectiveness of each component of SPE. The results on STS-B and RTE are shown in Figure 2. no snapshot means that obtain a single prompt on each task. no attention means that ensemble source prompts with an average weight. no task prompt means that do not add the task prompt to combine with the instance prompt.
我們進行實驗來探索 SPE 各個元件的有效性。在 STS-B 和 RTE 上的結果如 Figure 2 所示。無快照表示在每個任務上取得單一提示。無注意力表示以平均權重集合來源提示。無任務提示表示不加入任務提示與實例提示結合。
It can be found that the complete SPE method achieves the best performance. When any part of the model is removed, the performance of the model decreases, indicating that each part is essential to our method. From the results, we derive the following insights. The cross-attention module enables SPE to ensemble the source prompts effectively, and the task-wise prompt provides information at different levels of granularity for the target task. These two components ensure the effectiveness of SPE. Moreover, snapshots at source prompt tuning allow SPE to obtain more prompts with diverse and rich knowledge, which further enhances the performance and stability of SPE.
可以發現完整的 SPE 方法達到了最好的性能。當移除模型的任何部分時,模型的效能都會降低,這說明每個部分對我們的方法都是不可或缺的。從結果中,我們得到以下啟示。交叉注意模組能讓 SPE 有效地集合來源提示,而任務性提示則為目標任務提供不同粒度層級的資訊。這兩個元件可確保 SPE 的有效性。此外,來源提示調整時的快照可讓 SPE 獲得更多具有多元豐富知識的提示,進一步提升 SPE 的效能與穩定性。
3.5. Analysis on Universality
3.5.普遍性分析
We extend our method to Chinese NLU tasks and different backbone LM sizes. We choose AFQMC, IFLYTEK, OCNLI, and Tnews from CLUE benchmark[21] as source tasks, using Randeng-T5-77M-MultiTask-Chinese, Randeng-T5-784M-MultiTask-Chinese[22] as pre-trained model parameters.
我們將此方法擴展至中文 NLU 任務和不同的主幹 LM 大小。我們選擇 CLUE 基準 [21] 中的 AFQMC、IFLYTEK、OCNLI 和 Tnews 作為源任務,使用 Randeng-T5-77M-MultiTask-Chinese 和 Randeng-T5-784M-MultiTask-Chinese [22] 作為預訓模型參數。
Table 3 summarizes the performance of baselines and our methods with different LM sizes on four CLUE tasks. Our method largely benefits from backbone LM size increase, outperforming most methods on larger backbone model. We hypothesize that Chinese semantic patterns are more complicated and SPE needs a larger number of parameters to capture enough semantic information. SPE can achieve the best performance on most tasks, which demonstrates the effectiveness and universality of our method on English and Chinese NLU tasks. A possible reason is that our method is language-independent.
Table 3 總結了基線和我們的方法在四個 CLUE 任務上不同 LM 大小的表現。我們的方法主要受益於主幹 LM 大小的增加,在較大的主幹模型上表現優於大多數方法。我們假設中文語意模式較為複雜,SPE 需要較多參數才能捕捉足夠的語意資訊。SPE 在大部分任務上都能達到最佳效能,這證明我們的方法在英文和中文 NLU 任務上的有效性和普遍性。可能的原因是我們的方法與語言無關。
CONCLUSION 結論
We present Snapshot Prompt Ensemble method for parameter-efficient soft prompt transfer. SPE gains multiple prompts on each source task by taking snapshots at different training phases of source prompt tuning, and then adaptively ensembles the multiple soft prompts and obtains a fused and instance-dependent prompt for the target task by a cross-task attention module. Through extensive experiments, we demonstrate the effectiveness and efficiency of SPE. In future work, we will work on eliminating potential negative transfer.
我們提出 Snapshot Prompt Ensemble 方法,用於參數效率高的軟提示轉換。SPE 透過在來源提示調整的不同訓練階段進行快照,獲得每個來源任務的多個提示,然後自適性地組合多個軟提示,並透過跨任務注意模組獲得目標任務的融合且依據實例的提示。透過大量的實驗,我們證明了 SPE 的有效性與效率。在未來的工作中,我們將致力於消除潛在的負向轉移。