Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models

1. Introduction 1. 介绍

4. Vision-Language Adaptation
4. 视觉-语言适应

5. Vision-Audio Adaptation
5. 视听适应

6. Ablation Studies 6. 消融研究

7. Discussion and Limitations
7. 讨论和限制

References 参考资料

Appendix 附录

8. Experimental Details 8. 实验细节

9. Additional Results 9. 其他结果

1. Introduction 1. 介绍

2. Related Works 2. 相关作品

3. Cross-Modal Adaptation3. 跨模态适应

4. Vision-Language Adaptation4. 视觉-语言适应

5. Vision-Audio Adaptation5. 视听适应

6. Ablation Studies 6. 消融研究

7. Discussion and Limitations7. 讨论和限制

References 参考资料

Appendix 附录

8. Experimental Details 8. 实验细节

9. Additional Results 9. 其他结果

3. Cross-Modal Adaptation
3. 跨模态适应

4. Vision-Language Adaptation
4. 视觉-语言适应

5. Vision-Audio Adaptation
5. 视听适应

7. Discussion and Limitations
7. 讨论和限制

Abstract 摘要

Abstract 摘要

Abstract 摘要

Abstract 摘要

2024_07_08_16c1ee3d51773a3dd709g

Zhiqiu Lin* Samuel Yu* Zhiyi Kuang Deepak Pathak Deva RamananCarnegie Mellon University{zhiqiul,samuelyu, zkuang, dpathak, deva}@cs.cmu.edu

The ability to quickly learn a new task with minimal instruction - known as few-shot learning - is a central aspect of intelligent agents.
快速学习新任务的能力，只需最少的指导 - 被称为少样本学习 - 是智能代理的核心方面。
Classical few-shot benchmarks make use of few-shot samples from a single modality, but such samples may not be sufficient to characterize an entire concept class. In contrast, humans use cross-modal information to learn new concepts efficiently.
经典的少样本基准测试利用来自单一模态的少样本，但这些样本可能不足以描述整个概念类别。相比之下，人类利用跨模态信息高效学习新概念。
In this work, we demonstrate that one can indeed build a better visual dog classifier by reading about dogs and listening to them bark.
在这项工作中，我们证明通过阅读关于狗的资料并倾听它们的吠声，确实可以构建一个更好的视觉狗分类器。
To do so, we exploit the fact that recent multimodal foundation models such as CLIP are inherently cross-modal, mapping different modalities to the same representation space.
为此，我们利用最近的多模态基础模型（如 CLIP）本质上是跨模态的事实，将不同的模态映射到相同的表示空间。
Specifically, we propose a simple cross-modal adaptation approach that learns from few-shot examples spanning different modalities.
具体来说，我们提出了一种简单的跨模态适应方法，该方法从跨不同模态的少量示例中学习。
By repurposing class names as additional one-shot training samples, we achieve SOTA results with an embarrassingly simple linear classifier for vision-language adaptation.
通过将类名重新用作额外的一次性训练样本，我们使用一个令人尴尬简单的线性分类器实现了最先进的视觉语言适应结果。
Furthermore, we show that our approach can benefit existing methods such as prefix tuning, adapters, and classifier ensembling.
此外，我们展示了我们的方法可以使现有的方法受益，如前缀调整、适配器和分类器集成。
Finally, to explore other modalities beyond vision and language, we construct the first (to our knowledge) audiovisual few-shot benchmark and use crossmodal training to improve the performance of both image and audio classification. Project site at link.
最后，为了探索视觉和语言之外的其他模态，我们构建了第一个（据我们所知）音频视觉少样本基准，并使用跨模态训练来提高图像和音频分类的性能。项目网站链接。

Learning with minimal instruction is a hallmark of human intelligence [86,91,98], and is often studied under the guise of few-shot learning.
学习最少指导是人类智能的标志[86,91,98]，通常在少样本学习的伪装下进行研究。
In the context of few-shot visual classification [18,20,29, 46,79, 82], a classifier is first pretrained on a set of base classes to learn a good feature representation and then adapted or finetuned on a small amount of novel class data.
在少样本视觉分类的背景下[18,20,29,46,79,82]，分类器首先在一组基础类别上进行预训练，以学习良好的特征表示，然后在少量新类别数据上进行调整或微调。
However, such few-shot setups often face an inherent ambiguity - if the training image contains a golden retriever wearing a hat, how does the learner know if
然而，这种少样本设置经常面临固有的模糊性 - 如果训练图像中包含一只戴帽子的金毛寻回犬，学习者如何知道

Figure 1. Human perception is internally cross-modal. When we perceive from one modality (such as vision), the same neurons will be triggered in our cerebral cortex as if we are perceiving the object from other modalities (such as language and audio)

. This phenomenon grants us a strong ability to learn from a few examples with cross-modal information [52,67].
图 1. 人类感知是内部跨模态的。当我们从一个模态（比如视觉）感知时，与我们从其他模态（比如语言和音频）感知对象时相同的神经元将在我们的大脑皮层中被触发。这种现象赋予了我们通过跨模态信息从少数示例中学习的强大能力。
In this work, we propose to leverage cross-modality to adapt multimodal models (such as CLIP [81] and AudioCLIP [27]), that encode different modalities to the same representation space.
在这项工作中，我们建议利用跨模态来调整多模态模型（如 CLIP [81]和 AudioCLIP [27]），将不同的模态编码到相同的表示空间中。

the task is to find dogs, golden retrievers, or even hats? On the other hand, humans have little trouble understanding and even generalizing from as few as one example. How so?
任务是找到狗，金毛寻回犬，甚至帽子？另一方面，人类很少有困难，甚至可以从一个例子中理解和概括。为什么？

We argue that humans make use of multimodal signals and representations (Figure 1) when learning concepts For example, verbal language has been shown to help toddlers better recognize visual objects given just a few examples

. Indeed, there exists ample evidence from neuroscience suggesting that cognitive representations are inherently multimodal.
我们认为人类在学习概念时利用多模信号和表示（图 1）。例如，已经证明语言有助于幼儿仅凭几个例子更好地识别视觉对象

。事实上，神经科学中存在大量证据表明认知表示本质上是多模的。
For instance, visual images of a person evoke the same neurons as the textual strings of the person's name [80] and even audio clips of that person talking [70].
例如，一个人的视觉形象会激发与该人姓名的文本字符串[80]以及该人说话的音频剪辑[70]相同的神经元。
Even for infants as young as 1-5 months old, there is a strong correspondence between auditory-visual [52] as well as visual-tactile signals [67].
即使是 1-5 个月大的婴儿，听觉-视觉[52]以及视触信号[67]之间存在着很强的对应关系。
Such cross-modal or inter-modal representations are fundamental to the human perceptual-cognitive system, allowing us to understand new concepts even with few examples [24]
这种跨模态或者互模态的表征对于人类感知认知系统至关重要，使我们能够即使只有少量示例也能理解新概念。【24】

Cross-modal adaptation (our approach). In this paper, we demonstrate that cross-modal understanding of different modalities (such as image-text or image-audio) can improve the performance of individual modalities.
跨模态适应（我们的方法）。在本文中，我们展示了不同模态（如图像-文本或图像-音频）的跨模态理解可以提高各个模态的性能。
That is, reading about dogs and listening to them bark can help build a better visual classifier for them! To do so, we present a remarkably simple strategy for cross-modal few-shot adaptation: we treat examples from different modalities as additional few-shot examples.
也就是说，阅读关于狗的资料并听它们叫可以帮助建立一个更好的视觉分类器！为此，我们提出了一个非常简单的跨模态少样本适应策略：我们将来自不同模态的示例视为额外的少样本示例。
For example, given the "1-shot" task of learning a dog classifier, we treat both the textual dog label and the single visual image as training examples for learning a (visual) dog classifier.
例如，对于学习狗分类器的“1-shot”任务，我们将文本狗标签和单个视觉图像都视为学习（视觉）狗分类器的训练示例。
Learning is straightforward when using frozen textual and visual encoders, such as CLIP [81], that map different modalities to the same representational space. In essence, we have converted the "nshot" problem to a "

-shot" problem (Figure 2)! We demonstrate that this basic strategy produces SOTA results across the board with a simple linear classifier, and can be applied to existing finetuning methods

or additional modalities (e.g. audio).
使用冻结的文本和视觉编码器（例如 CLIP [81]）进行学习变得简单，这些编码器将不同的模态映射到相同的表示空间。本质上，我们已将“nshot”问题转换为“

-shot”问题（图 2）！我们证明，这种基本策略通过简单的线性分类器在各方面产生了 SOTA 结果，并可应用于现有的微调方法

或其他模态（例如音频）。

Why does it work? 为什么它有效？
From one perspective, it may not be surprising that cross-modal adaptation improves accuracy, since it takes advantage of additional training examples that are "hidden" in the problem definition, e.g. a label name [104] or an annotation policy [68] for each class.
从一个角度来看，跨模态适应提高准确性并不奇怪，因为它利用了在问题定义中“隐藏”的额外训练示例，例如每个类别的标签名称[104]或注释策略[68]。
However, our experiments demonstrate that multimodal cues are often complementary since they capture different aspects of the underlying concept; a dog label paired with a single visual example is often more performant than two images!
然而，我们的实验表明，多模态线索通常是互补的，因为它们捕捉了潜在概念的不同方面；与单个视觉示例配对的狗标签通常比两个图像更有效！
For example, Figure 3 demonstrates a oneshot example where the target concept is ambiguous, but becomes clear once we add information from other modalities like language and sound.
例如，图 3 展示了一个单拍示例，其中目标概念是模糊的，但一旦我们添加来自其他模态的信息，如语言和声音，它就变得清晰起来。

Multimodal adaptation (prior art). In contrast to our cross-modal approach, most prior works simply follow the popular practice of finetuning uni-modal foundation models, such as large vision

or language models [8, 17, 62]. For example, CoOp [113] and other prompting methods

finetune CLIP via prefix tuning to replace hand-engineered prompts such as "a photo of a {cls}" with learned word tokens.
多模态适应（现有技术）。与我们的跨模态方法相比，大多数先前的工作简单地遵循微调单模基础模型的流行做法，例如大型视觉

或语言模型[8, 17, 62]。例如，CoOp [113] 和其他提示方法

通过前缀微调来微调 CLIP，以替换手工设计的提示，如“一张{cls}的照片”，用学习到的单词标记。
Similarly, inspired by parameter-efficient tuning of language models [39], adapter-based methods [21,111] finetune CLIP by inserting lightweight multi-layer-perceptrons (MLPs).
类似地，受到语言模型参数高效调整的启发[39]，适配器方法[21,111]通过插入轻量级多层感知器（MLP）微调 CLIP。
However, we aim to study the fundamental question of how to finetune multi-modal (as opposed to uni-modal) models.
然而，我们的目标是研究如何微调多模态（而不是单模态）模型的基本问题。
A crucial difference between prior art and ours is the use of textual information, as all existing methods [41,100,111,113] repurpose additional text features as classifier weights instead of training samples.
先前艺术和我们之间的一个关键区别是文本信息的使用，所有现有方法[41,100,111,113]重新利用额外的文本特征作为分类器权重，而不是训练样本。
We demonstrate in this paper that crossmodal adaptation is not only more performant but can also benefit prior uni-modal approaches.
我们在本文中展示，跨模态适应不仅性能更好，而且还可以使先前的单模态方法受益。

Problem setup. We begin by replicating the existing
问题设置。我们从复制现有的开始

Figure 2. Adding additional modalities helps few-shot learning. Adding textual labels to a 2-shot cat-vs-dog classification task leads to better test performance (by turning the problem into a 3-shot cross-modal task!).
图 2. 添加额外的模态有助于少样本学习。在 2-shot 猫狗分类任务中添加文本标签会提高测试性能（将问题转变为 3-shot 跨模态任务！）。
We visualize cross-modal CLIP [21] features (projection to 2D with principal component analysis) and the resulting classifier learned from them, and observe a large shift in the decision boundary. See Figure 5 for more examples.
我们可视化跨模态 CLIP [21]特征（通过主成分分析投影到 2D）以及从中学习到的分类器，并观察到决策边界的巨大变化。更多示例请参见图 5。

evaluation protocol of other works [81,111,113] on fewshot adaptation of vision-language models, and report performance on 11 diverse downstream datasets.
对视觉-语言模型的 fewshot 适应性进行其他作品[81,111,113]的评估协议，并在 11 个不同的下游数据集上报告性能。
We produce state-of-the-art accuracy with an embarrassingly simple linear classifier that has access to additional "hidden" training examples in the form of textual labels, resulting in a system that is far more lightweight than prior art.
我们使用一种令人尴尬地简单的线性分类器来实现最先进的准确性，该分类器可以访问额外的“隐藏”训练示例，这些示例以文本标签的形式存在，从而产生了一个比以往更轻量级的系统。
Interestingly, we show that existing approaches [100,111,113], despite already repurposing text features as classifier weights, can still benefit from cross-modal learning.
有趣的是，我们发现现有方法[100,111,113]，尽管已经将文本特征重新用作分类器权重，但仍然可以从跨模态学习中受益。
Finally, we extend our work to the audio domain by taking advantage of AudioCLIP [27] that maps audio to the same frozen CLIP representation space.
最后，我们通过利用 AudioCLIP [27]将我们的工作扩展到音频领域，将音频映射到相同的冻结 CLIP 表示空间。
We construct the first (to our knowledge) cross-modal few-shot learning benchmark with audio by intersecting ImageNet [15] and the ESC-50 audio classification dataset [77].
我们构建了第一个（据我们所知）跨模态少样本学习基准，通过将 ImageNet [15] 和 ESC-50 音频分类数据集 [77] 相交。
We show that cross-modal audiovisual learning helps for both downstream image and audio classification; in summary, one can train better dog image classifiers by listening to them bark!
我们展示跨模态音频视觉学习有助于下游图像和音频分类；总之，通过倾听它们的吠声，可以训练出更好的狗图像分类器！

Webly-supervised pre-training. Learning foundation models [5] from large-scale web data is becoming a predominant paradigm in AI.
Webly-supervised pre-training. 从大规模网络数据中学习基础模型[5]正成为人工智能中的主导范式。
In NLP, models such as BERT [17] and GPT-3 [8] are pre-trained on a massive web text corpus with language-modeling objectives and can be transferred to a wide range of downstream tasks, even without explicit supervised finetuning [61, 94]. Selfsupervision

is also a trending topic in the vision community, and recent methods [26,31] demonstrate even stronger visual representations than fully-supervised pre-trained ones such as on ImageNet [15].
在 NLP 中，诸如 BERT [17]和 GPT-3 [8]等模型是在大规模网络文本语料库上进行预训练的，具有语言建模目标，并且可以转移到各种下游任务，甚至无需明确的监督微调 [61, 94]。自我监督也是视觉社区中的一个热门话题，最近的方法 [26,31]展示了比完全监督预训练模型（如在 ImageNet [15]上）更强大的视觉表示。

Multimodal foundation models. Recently, foundation models have shifted towards a multimodal supervi-
多模基础模型。最近，基础模型已经转向多模式监督。

Figure 3. Cross-modality reduces the ambiguity of few-shot learning. Classic (uni-modal) few-shot learning is often underspecified.
图 3. 跨模态降低了少样本学习的歧义。经典（单模态）少样本学习通常是不充分的。
Even for binary classification, when given only a single image per class (left), it is unclear whether the target class is the animal, the hat, or the background scene. Adding an extra modality, such as text or audio, helps clarify the problem setup (right).
即使是二元分类，当每个类别只提供一张图像时（左），目标类别是动物、帽子还是背景场景都不清楚。添加额外的模态，比如文本或音频，有助于澄清问题的设置（右）。
Notably, language usually comes "for free" in classification datasets in the form of a textual label per class.
通常，在分类数据集中，语言通常以每个类别的文本标签的形式“免费”提供。

sion paradigm. For visual representation learning, early works transform web image captions into structured outputs for supervised learning, such as multi-label targets [47] or visual n-grams [56].
对于视觉表示学习，早期的作品将网络图像标题转换为结构化输出，用于监督学习，例如多标签目标[47]或视觉 n-gram[56]。
More recently, CLIP [81] and ALIGN [43] propose a simple contrastive-based approach to embed images and captions into the same representation space, and demonstrate impressive "zero-shot" performance on downstream tasks.
最近，CLIP [81] 和 ALIGN [43] 提出了一种简单的对比方法，将图像和标题嵌入到相同的表示空间中，并在下游任务中展示了令人印象深刻的“零样本”性能。
Follow-up works enhance multimodal pre-training by incorporating generative-based objectives [2, 57, 106], consistency regularization [60, 69], stronger visual priors [107], phrase-grounding tasks [58, 109], and audiovisual information through videos [27].
后续作品通过整合基于生成的目标[2, 57, 106]、一致性正则化[60, 69]、更强的视觉先验[107]、短语定位任务[58, 109]以及通过视频传递的视听信息[27]来增强多模态预训练。
In this work, we focus on adapting CLIP [81] and AudioCLIP [27] for few-shot classification because contrastivebased multimodal models are stronger classifiers [2]. Adopting other multimodal models

or adapting to tasks other than classification

can be interesting future directions.
在这项工作中，我们专注于将 CLIP [81]和 AudioCLIP [27]调整为少样本分类，因为基于对比的多模态模型是更强大的分类器[2]。采用其他多模态模型

或者适应于除分类之外的任务

可能是有趣的未来方向。

Adaptation of foundation models. As multimodal pretrained models have excelled at classic vision tasks [81, 109], there has been surging interest in developing more efficient adaptation methods.
基础模型的适应。由于多模态预训练模型在经典视觉任务上表现出色，因此人们对开发更高效的适应方法产生了浓厚兴趣。
However, we observe that most of the trending techniques are built upon successful recipes crafted for uni-modal foundation models. For example, CLIP [81] adopts linear probing [12,31,32,109] and full-finetuning [25, 31, 48, 99, 101, 109] when transferring to downstream tasks.
然而，我们观察到大多数流行的技术都是基于为单模基础模型设计的成功配方构建的。例如，当转移到下游任务时，CLIP [81] 采用线性探测 [12,31,32,109] 和全微调 [25, 31, 48, 99, 101, 109]。
Prompt adaptation of CLIP [63, 81,

is motivated by the success of prefix-tuning for language models [16,22,30,45,61,78, 84, 85, 89]. Similarly, CLIP-Adapter [21] and Tip-Adapter [111] are inspired by parameter-efficient finetuning methods

that optimize lightweight MLPs while freezing the encoder. Yet, all aforementioned methods including WiSE-FT [100] use
CLIP 的快速适应[63, 81,

受到前缀调整语言模型成功的启发[16,22,30,45,61,78, 84, 85, 89]。同样，CLIP-Adapter [21]和 Tip-Adapter [111]受到参数高效微调方法的启发

，这些方法优化轻量级 MLP，同时冻结编码器。然而，包括 WiSE-FT [100]在内的所有上述方法都使用

Figure 4. Uni-modal (left) vs. cross-modal adaptation (right). Prior work [21,100,111,113] performs uni-modal adaptation by calculating the loss over a single modality.
图 4. 单模态（左）与跨模态适应（右）。之前的工作[21,100,111,113]通过在单一模态上计算损失来执行单模态适应。
Cross-modal adaptation makes use of additional training samples from other modalities, exploiting pre-trained encoders that map different modalities to the same representation space.
跨模态适应利用来自其他模态的额外训练样本，利用预训练的编码器将不同模态映射到相同的表示空间。
We show that cross-modal learning can also improve prior art and even extends to audio modalities with AudioCLIP [27].
我们展示跨模态学习也可以改进先前的技术，甚至扩展到音频模态与 AudioCLIP [27]。

the other modality, e.g. textual labels, as classifier weights and still calculate a uni-modal softmax loss on the few-shot images. We instead show that incorporating other modalities as training samples is far more effective.
另一种模态，例如文本标签，作为分类器权重，仍然在少样本图像上计算单模 softmax 损失。相反，我们表明将其他模态纳入训练样本要更有效。

Few-shot classification. Prior successful few-shot learning methods leverage meta learning [20, 82], metric learning [4, 91, 95], transfer learning [29, 79], and transductive learning [18,46].
少样本分类。先前成功的少样本学习方法利用元学习[20, 82]，度量学习[4, 91, 95]，迁移学习[29, 79]和传导学习[18,46]。
These classic algorithms usually assume a large meta-training set for pre-training the network, and then evaluate on multiple episodes of few-shot train (support) and test (query) sets.
这些经典算法通常假定有一个大的元训练集来预训练网络，然后在多个 few-shot 训练（支持）和测试（查询）集上进行评估。
In this work, we instead follow the new evaluation protocol implemented by recent works on few-shot adaptation with CLIP [81,111,113]: (1) the meta-training phase is replaced with pre-trained CLIP models, and (2) the test sets are the official test splits of each dataset (thus not few-shot).
在这项工作中，我们改为遇到了最近在使用 CLIP 进行少样本适应的作品中实施的新评估协议：(1) 元训练阶段被预训练的 CLIP 模型取代，(2) 测试集是每个数据集的官方测试拆分（因此不是少样本）。
Notably, none of the prior works

we compare to in this paper perform optimization with test set samples, and we follow this practice to ensure a fair comparison. We leave semisupervised [97] or transductive finetuning [18, 40] techniques as future work.
值得注意的是，在本文中我们比较的先前作品

中，没有一个是使用测试集样本进行优化的，我们遵循这一做法以确保公平比较。我们将半监督[97]或传导微调[18, 40]技术作为未来的工作。

Cross-modal machine learning. Inspired by crossmodal human cognition [9, 49, 70], cross-modal learning

is a subfield of multimodal machine learning

that aims to use data from additional modalities to improve a uni-modal task. Cross-modal learning does not require instance-wise alignment; for example, existing algorithms [68,104] can benefit from class-level descriptions as opposed to image-level captions.
跨模态机器学习。受跨模态人类认知的启发[9, 49, 70]，跨模态学习是多模态机器学习的一个子领域，旨在利用来自额外模态的数据来改进单模态任务。跨模态学习不需要实例级对齐；例如，现有算法[68,104]可以从类级描述中受益，而不是图像级标题。
In this work, we propose a lightweight cross-modal learning method by treating data from other modalities as additional training samples. Furthermore, we encourage future works to embrace cross-modal few-shot learning as opposed to the underspecified uni-modal setup (Figure 3).
在这项工作中，我们提出了一种轻量级的跨模态学习方法，将来自其他模态的数据视为额外的训练样本。此外，我们鼓励未来的工作采用跨模态少样本学习，而不是不明确的单模态设置（图 3）。

In this section, we mathematically formalize our approach to cross-modal few-shot learning.
在这一部分中，我们对跨模态少样本学习的方法进行了数学形式化。

Uni-modal learning. We begin by reviewing standard uni-modal few-shot classification, which learns a classifier from a small dataset of

pairs and pre-trained feature encoder

:
单模态学习。我们首先回顾标准的单模态少样本分类，该分类器从一小组

对和预训练的特征编码器

中学习：

where

is typically the softmax loss
其中

通常是 softmax 损失

Our notation separates the feature extractor

from the final class weights

, since the former is typically pre-trained on a massive source dataset and the latter is trained on the few-shot target dataset. However, sometimes the representation

can also be finetuned on the few-shot dataset (as we explore in our experiments). Importantly, both the class weights and feature extractor must live in the same

dimensional space in order to compute their inner product:
我们的表示法将特征提取器

与最终的类别权重

分开，因为前者通常在大规模源数据集上进行预训练，而后者在少样本目标数据集上进行训练。然而，有时表示

也可以在少样本数据集上进行微调（正如我们在实验中探索的那样）。重要的是，为了计算它们的内积，类别权重和特征提取器必须存在于相同的

维空间中：

Though we focus on classification, class models could be learned via other losses (such as centroid prototypes [91]).
尽管我们专注于分类，但类模型也可以通过其他损失（例如质心原型[91]）来学习。

Cross-modal learning. Our extension to multiple modalities is staightforward; we assume each training example is accompanied by a discrete label

denoting its modality:
跨模态学习。我们对多模态的扩展很简单；我们假设每个训练示例都附带一个离散标签

，表示其模态：

For example, one may define the set of modalities to be

visual, language

or {visual, audio

(Figure 4). We can then define an associated loss:
例如，可以将模态集定义为

视觉，语言

或{视觉，音频

（图 4）。然后我们可以定义一个相关的损失：

where we crucially assume access to modality-specific feature encoders

for

. While the individual datapoints

may come from different modalities with different dimensions, our formulation requires that the encoders map all modalities to the same fixed-dimensional space.
我们在关键地假设可以访问特定模态特征编码器

以

。虽然个别数据点

可能来自不同维度的不同模态，但我们的公式要求编码器将所有模态映射到相同的固定维度空间。

Note that this requirement is satisfied by many multimodal foundation models such as CLIP [81] and ALIGN [43] since they map different modalities into the same

-dimensional embedding. We provide training pseudocode for visionlanguage adaptation (section 3) in algorithm 1 for clarity.
请注意，许多多模态基础模型（如 CLIP [81]和 ALIGN [43]）都满足这一要求，因为它们将不同的模态映射到相同的

维嵌入中。为了清晰起见，我们在算法 1 中提供了视觉语言适应的训练伪代码（第 3 节）。

Inference: The learned classifier can produce a label prediction for a test example

from any modality

:
推断：学习的分类器可以为任何模态

的测试示例

生成标签预测：

This means we can use the same classifier to classify different test modalities (e.g. images and audio clips). In this paper, we mainly evaluate on a single modality (like images) to emphasize that multimodality helps unimodality.
这意味着我们可以使用相同的分类器来对不同的测试模态（例如图像和音频片段）进行分类。在本文中，我们主要评估单一模态（如图像），以强调多模态有助于单模态。

Cross-modal ensembles. We now show that crossmodal learning produces classifiers that are ensembles of modality-specific classifiers, exposing a connection to related approaches for ensembling (such as WiSE-FT [100]).
跨模态集成。我们现在展示跨模态学习产生的分类器是模态特定分类器的集成，揭示了与集成相关方法的连接（例如 WiSE-FT [100]）。
We begin by appealing to the well-known Representer Theorem [87], which shows that optimally-trained classifiers can be represented as linear combinations of their training samples. In the case of a cross-modal linear probe, weights for class

must be a weighted combination of all

training features, across all modalities:
我们首先引用著名的代表定理[87]，该定理表明经过最佳训练的分类器可以表示为其训练样本的线性组合。在跨模态线性探测器的情况下，类

的权重必须是所有

训练特征的加权组合，跨越所有模态：

Linear classification via cross-modal adaptation solves for all weights

jointly, so as to minimize the empirical risk (or training loss). In contrast, prior art optimizes for imagespecific

's independently of the text-specific

's, linearly combining them with a single global

(as in WiSEFT [100]) or via text-based classifier initialization [21,111]. Our analysis suggests that the joint optimization enabled by cross-modal learning may help other adaptation methods, as our experiments do in fact show.
通过跨模态适应的线性分类解决所有权重

联合，以最小化经验风险（或训练损失）。相比之下，先前的技术优化图像特定

独立于文本特定

，线性地将它们与单个全局

组合（如 WiSEFT [100]中）或通过基于文本的分类器初始化[21,111]。我们的分析表明，通过跨模态学习实现的联合优化可能有助于其他适应方法，正如我们的实验证明的那样。

Extensions. 扩展。
Although we focus on uni-modal inference tasks (e.g. image classification), the above formulation allows the learned classifier to be applied to multimodal test sets, such as classifying videos by training on image and audio, and then ensembling predictions across the two modalities with Equation 7.
虽然我们专注于单模态推理任务（例如图像分类），但上述公式允许学习的分类器应用于多模态测试集，例如通过在图像和音频上训练来对视频进行分类，然后使用方程式 7 跨两种模态进行预测整合。
Or, one can extend image classification by providing additional data such as captions and/or attributes. We leave these scenarios as future work. Finally, just as one can optimize uni-modal losses (1) by finetuning the encoder

, one can similarly finetune modality-specific encoders

in the cross-modal setting (5). We explore this finetuning method in the next section.
或者，可以通过提供额外的数据，如标题和/或属性，来扩展图像分类。我们将这些场景留作未来的工作。最后，就像可以通过微调编码器

来优化单模态损失(1)一样，也可以在跨模态设置中通过类似的方式微调特定于模态的编码器

。我们将在下一节中探讨这种微调方法。

We now explore our cross-modal formulation for a particular multimodal setting. Many prior works

,
我们现在探讨我们的跨模态公式在特定的多模态设置中。许多先前的作品

，

113] explore the intersection of vision and language, and thus that is our initial focus. Interestingly, the influential "zero-shot" and "few-shot" evaluation protocols introduced by prior work

can be mapped to our crossmodal setting, with one crucial difference; the textual label of each class can be treated as an explicit training sample

. From this perspective, "zero-shot" learning may be more naturally thought of as one-shot cross-modal learning that learns a few-shot model on text and then infers with it on images.
[113] 探索视觉和语言的交集，因此这是我们的初始关注点。有趣的是，之前工作引入的具有影响力的“零样本”和“少样本”评估协议可以映射到我们的跨模态设置中，但有一个关键的区别；每个类别的文本标签可以被视为一个显式的训练样本。从这个角度来看，“零样本”学习可能更自然地被认为是一次性的跨模态学习，在文本上学习少样本模型，然后在图像上进行推断。

Few-shot evaluation protocol. To ensure a fair comparison, we strictly follow the protocol of CoOp [113] by reporting test performance on 11 public image datasets (Table 5), with ResNet50 [33] as the image encoder backbone.
少样本评估协议。为了确保公平比较，我们严格遵循 CoOp [113]的协议，通过在 11 个公共图像数据集（表 5）上报告测试性能，使用 ResNet50 [33]作为图像编码器骨干。
For maximal reproducibility, we use CoOp's dataset splits [113] and the three-fold few-shot train sets sampled with the same random seeds. We adopt the given test split of each dataset as the test set. Some prior works

apparently use the large-scale test set to tune hyperparameters for few-shot learning; we instead exercise due diligence by tuning hyperparameters (such as the learning rate, weight decay, and early stopping) on the given few-shot validation set with

examples, where

is the number of training shots. We include PyTorch-style pseudocode (algorithm 1) and hyperparameter details (section 8).
为了最大限度地提高可重现性，我们使用 CoOp 的数据集拆分[113]和使用相同随机种子采样的三折少样本训练集。我们采用每个数据集的给定测试拆分作为测试集。一些先前的工作显然使用大规模测试集来调整少样本学习的超参数；相反，我们通过在给定的少样本验证集上调整超参数（如学习率、权重衰减和提前停止）来行使应有的谨慎，该验证集包含

个示例，其中

是训练样本数。我们包括 PyTorch 风格的伪代码（算法 1）和超参数详细信息（第 8 节）。

Cross-modal adaptation outperforms SOTA.
跨模态适应优于 SOTA。
Table 1 shows the effectiveness of our proposal: we surpass all prior art with an embarrassingly simple linear classifier that requires significantly less training time than other carefullycrafted algorithms. In addition, partial finetuning of the last attentional pooling layer from

sets the new SOTA. To ensure a fair comparison, we augment the class names into sentences using hand-engineered templates selected by Tip-Adapter [111] (Table 5) and follow their practice to initialize the linear layer with text features.
表 1 显示了我们提案的有效性：我们通过一个令人尴尬的简单线性分类器超越了所有先前的技术，该分类器需要的训练时间明显少于其他精心设计的算法。此外，对最后一个注意力汇聚层进行部分微调，从

设置了新的 SOTA。为了确保公平比较，我们使用 Tip-Adapter [111]（表 5）选择的手工模板将类名增加到句子中，并遵循他们的做法，使用文本特征初始化线性层。
Furthermore, we perform minimal image augmentation with a center crop plus a flipped view instead of random crops as in prior art

. As such, we can pre-extract features before training the classifier, leading to significantly less training time as shown in Table 8. We also show that our method can benefit from both image and text augmentation in Table 6.
此外，我们执行最小图像增强，使用中心裁剪加上翻转视图，而不是先前艺术中的随机裁剪。因此，我们可以在训练分类器之前预提取特征，从而大大减少训练时间，如表 8 所示。我们还表明我们的方法可以从表 6 中的图像和文本增强中受益。
In the appendix, we provide more ablations on classifier initialization (Table 12), partial finetuning (Table 13), and ViT-based backbone (Table 14). Per-dataset results are also in appendix Table 10.
在附录中，我们提供了更多关于分类器初始化（表 12）、部分微调（表 13）和基于 ViT 的骨干网络（表 14）的消融实验。每个数据集的结果也在附录表 10 中。

Why does cross-modal learning help? As stated earlier, one reason that cross-modal learning helps is that it turns the original

-shot problem to an

-shot one. However, Table 1 shows that 1 -shot cross-modal linear probing outperforms the 2-shot results of most prior methods. This suggests that training samples from other modalities tend to contain complementary cues [68, 100, 104].
为什么跨模态学习有帮助？正如前面所述，跨模态学习有帮助的一个原因是它将原始的

-shot 问题转变为一个

-shot 问题。然而，表 1 显示，1-shot 跨模态线性探测优于大多数先前方法的 2-shot 结果。这表明来自其他模态的训练样本往往包含互补的线索[68, 100, 104]。
One can loosely observe this in Figure 2 and Figure 5,
可以在图 2 和图 5 中粗略观察到这一点

Algorithm 1: An example of PyTorch-style pseudocode for cross-modal (vision-language) adaptation. Notably, the image and text samples do not need to be paired and one may sample different numbers of them per batch.
算法 1：PyTorch 风格伪代码示例，用于跨模态（视觉-语言）适应。值得注意的是，图像和文本样本不需要配对，每批次可以采样不同数量的它们。
For simplicity, we omit linear classifier initialization and early stopping with validation performance. One can also disable the corresponding grad field of the encoders for partial finetuning, or pre-extract intermediate features to speed up training.
为简单起见，我们省略了线性分类器的初始化和通过验证性能进行早停止。也可以禁用编码器的相应梯度字段进行部分微调，或者预先提取中间特征以加快训练速度。

# w: linear layer initialized with text features
    # T: temperature scaling (default is 100)
    for - in iteration:
        # Randomly sample images and texts
        im, im_labels = image_loader.next()
        tx, tx_labels = text_loader.next()
        # Extract image and text features
        im_f = image_encoder(im)
        tx_f = text_encoder(tx)
        # Put in same batch then L2 normalize
        features = cat((im_f, tx_f), dim=0)
        features = normalize(features, dim=1)
        labels = cat((im_labels, tx_labels), dim=0)
        # Compute softmax (cross entropy) loss
        logits = w(features)
        loss = cross_entropy_loss(logits / T, labels)
        loss.backward()
        # Update linear layer
        update(w.params)
        # [optional] Update (partial or full) encoders
        update(image_encoder.params)
        update (text_encoder.params)

Figure 5. Additional PCA projection plots for random pairs of classes in ImageNet [15]. Adding one-shot text as training samples can oftentimes aggressively shift the decision boundary.
图 5. ImageNet 中随机类别对的额外 PCA 投影图[15]。添加一次性文本作为训练样本往往会大幅移动决策边界。

whereby visual and text examples lie in slightly different parts of the embedding space (indicating the potential to aggressively shape the final decision boundary). In fact, WiSE-FT [100] is inspired by similar reasons to ensemble
其中，视觉和文本示例位于嵌入空间的略有不同部分（表明激进地塑造最终决策边界的潜力）。事实上，WiSE-FT [100] 受到集成类似原因的启发。

Method	Number of shots 射击次数					Train speed 列车速度
Method	1	2	4	8	16	Train speed 列车速度
Zero-Shot CLIP (58.8) 零射击 CLIP（58.8）	-	-	-	-	-	-
Linear Probing 线性探测	36.7	47.6	57.2	65.0	71.1
WiSE-FT [100]	59.1	61.8	65.3	68.4	71.6
CoOp [113] 合作社 [113]	59.6	62.3	66.8	69.9	73.4
ProGrad [114]	62.6	64.9	68.5	71.4	74.0
Tip-Adapter [111] 提示适配器 [111]	64.5	66.7	69.7	72.5	75.8
Tip-Adapter [11] 提示适配器 [11]	63.3	65.9	69.0	72.2	75.1
Cross-Modal Linear Probing 跨模态线性探测	64.1	67.0	70.3	73.0	76.0
Cross-Modal Partial Finetuning 跨模态部分微调

Table 1. Comparison to SOTA using the CoOp [113] protocol, which reports top-1 accuracy across 11 test sets in Table 5. We include per-dataset results and standard deviation in section 9 .
表 1。使用 CoOp [113]协议与 SOTA 进行比较，该协议在表 5 中报告了 11 个测试集的 top-1 准确性。我们在第 9 节中包括每个数据集的结果和标准偏差。
For a fair comparison, we reuse the same few-shot visual samples and hand-engineered text prompts used by Tip-Adapter [111].
为了公平比较，我们重复使用了 Tip-Adapter [111]使用的相同的少样本视觉样本和手工设计的文本提示。
The original Tip-Adapter searches over hyperparameters (e.g. early stopping) on the large-scale test set, which may not be realistic for few-shot scenarios. Instead, we rerun their codebase and earlystop on a few-shot validation set (as we do), denoted by

. We reproduce WiSE-FT in our codebase since the original work does not provide few-shot results. In summary, by incorporating oneshot text samples into our training set, a simple cross-modal linear probe already outperforms all prior methods across all shots.
原始的 Tip-Adapter 在大规模测试集上搜索超参数（例如提前停止），这可能对少样本情况不现实。相反，我们重新运行他们的代码库，并在少样本验证集上进行提前停止（正如我们所做的），表示为

。我们在我们的代码库中复现了 WiSE-FT，因为原始工作没有提供少样本结果。总之，通过将一次性文本样本纳入我们的训练集，一个简单的跨模态线性探针已经在所有镜头上表现优于所有先前的方法。
Additionally, partial finetuning further improves performance, especially for 8 and 16 shots. Finally, our methods are faster to train than prior work, sometimes significantly (full report in Table 8).
此外，部分微调进一步提高了性能，特别是对于 8 和 16 张图片。最后，我们的方法训练速度比以前的工作更快，有时显著（完整报告见表 8）。

Method	Number of shots 射击次数
Method	1	2	4	8	16
Linear Probing 线性探测	36.7	47.6	57.2	65.0	71.1
Cross-Modal Linear Probing 跨模态线性探测	64.1	67.0	70.3	73.0	76.0
	27.4	19.4	13.1	8.0	4.9
WiSE-FT [100]	59.1	61.8	65.3	68.4	71.6
Cross-Modal WiSE-FT 跨模态 WiSE-FT	63.8	66.4	69.0	71.7	74.1
	4.7	4.6	3.7	3.3	2.5
CoOp [113] 合作社 [113]	59.6	62.3	66.8	69.9	73.4
Cross-Modal Prompting 跨模态提示	62.0	64.9	68.6	71.4	74.0
	2.4	2.6	1.8	1.5	0.6
Tip-Adapter 提示适配器	63.3	65.9	69.0	72.2	75.1
Cross-Modal Adapter 跨模态适配器	64.4	67.6	70.8	73.4	75.9
	1.1	1.7	1.8	1.2	0.8

Table 2. Cross-modal adaptation improves existing methods. We follow the same protocol as Table 1, reporting the delta accuracy between uni-modal and cross-modal variants of various stateof-the-art methods.
表 2. 跨模态适应改进了现有方法。我们遵循与表 1 相同的协议，报告各种最先进方法的单模态和跨模态变体之间的准确性增量。
The consistent boost suggests that cross-modal training is orthogonal to techniques for uni-modal adaptation, such as prompting [113], adapter [39], and robust finetuning [100].
一致的提升表明，跨模态训练与单模态适应技术（如提示[113]、适配器[39]和鲁棒微调[100]）是正交的。

the uni-modal visual classifier with a "zero-shot" (one-shottext) classifier (in the linear probing case).
具有“零样本”（一次性文本）分类器的单模视觉分类器（在线性探测案例中）。
However, Equation 8 shows that cross-modal adaptation can also be seen as jointly learning an ensemble, while WiSE-FT [100] learns the visual classifier independently of the text classifier.
然而，方程式 8 表明，跨模态适应也可以被视为联合学习一个集成，而 WiSE-FT [100]独立于文本分类器学习视觉分类器。
This suggests that other adaptation methods may benefit from cross-modal learning, as we show next.
这表明其他适应方法可能会受益于跨模态学习，接下来我们将展示。
Cross-modal adaptation helps prior art (Table 2). This includes prompting (CoOp [113]), adapters (TipAdapter [111]), and robust-finetuning (WiSE-FT [100]). We see a large improvement in the low-data regime (1 and 2 shots).
跨模态适应有助于先前的技术（表 2）。这包括提示（CoOp [113]）、适配器（TipAdapter [111]）和鲁棒微调（WiSE-FT [100]）。我们在低数据范围（1 和 2 次射击）中看到了很大的改进。
Notably, we do not need to tune any methods, and simply reuse the reported hyperparameters. For prompting, we follow CoOp [113] to optimize 16 continuous tokens with the same training setting.
值得注意的是，我们不需要调整任何方法，只需重复使用报告的超参数。对于提示，我们遵循 CoOp [113] 以相同的训练设置优化 16 个连续令牌。
For the Adapter model, we follow the same 2-layer MLP architecture of CLIP-Adapter [21] with the given residual ratio of 0.2 ; we outperform Tip-Adapter without relying on their training-free initialization of MLP.
对于适配器模型，我们遵循 CLIP-Adapter [21] 的相同 2 层 MLP 架构，给定残差比率为 0.2；我们在不依赖于 MLP 的无训练初始化的情况下胜过 Tip-Adapter。
For WiSE-FT, we adopt the given ratio (0.5) to post-hoc ensemble the learned and the zero-shot classifiers.
对于 WiSE-FT，我们采用给定的比例（0.5）来事后集成已学习和零样本分类器。
Overall, our experiments suggest that cross-modal adaptation is consistently effective, and should likely be a baseline moving forward given its easeof-implementation (algorithm 1).
总的来说，我们的实验表明，跨模态适应是一直有效的，鉴于其易于实施（算法 1），很可能会成为未来的基准。
For example, instead of separately benchmarking on "zero-shot" (one-shot-text) and few-shot-vision, a cross-modal linear prob would suffice to evaluate representations of a multimodal model.
例如，不必分别在“零射”（一次性文本）和少射视觉上进行基准测试，跨模态线性概率就足以评估多模态模型的表示。

We now explore cross-modal adaption for other modalities such as audio. We pose the following question: can one learn a better dog visual classifier by listening to a dog barking?
我们现在探索跨模态适应其他模态，比如音频。我们提出以下问题：通过听狗叫声，能否学习到更好的狗视觉分类器？
To examine this question, we curate the first audiovisual benchmark that supports few-shot classification of both image and audio.
为了研究这个问题，我们策划了第一个支持图像和音频的少样本分类的视听基准。

Our ImageNet-ESC benchmark.

We construct our audiovisual benchmark by intersecting two of the most popular image and audio datasets: ImageNet [15] with 1000 types of objects and ESC-50 [77] with 50 types of environmental sounds (including animal, nature, human activity, domestic, and urban noises).
我们的 ImageNet-ESC 基准测试。

我们通过交叉两个最受欢迎的图像和音频数据集来构建我们的视听基准测试：ImageNet [15]，其中包含 1000 种对象，以及 ESC-50 [77]，其中包含 50 种环境声音（包括动物、自然、人类活动、家庭和城市噪音）。
We use the class names of the two datasets for class matching. For each class in ESC-50, we check whether there is a corresponding ImageNet class that may produce this type of sound. In this process, we observe that the audio-to-object matching can sometimes be one-to-many.
我们使用两个数据集的类名进行类匹配。对于 ESC-50 中的每个类，我们检查是否有一个对应的 ImageNet 类可能产生这种类型的声音。在这个过程中，我们观察到音频到对象的匹配有时可能是一对多的。
For example, the clock-alarm class in ESC-50 can be mapped to either digital clock or analog clock in ImageNet; the dog (barking) class in ESC-50 can be matched to any of the 120 dog species.
例如，ESC-50 中的时钟闹钟类可以映射到 ImageNet 中的数字时钟或模拟时钟；ESC-50 中的狗（叫声）类可以匹配到 120 种狗品种中的任何一种。
In such scenarios, we randomly match the classes, e.g. clock alarm to digital clock and dog to otterhound. Also, we find that some audio classes loosely match with some visual objects, such as drinking-sipping to water bottle and pouring-water to water jug.
在这种情况下，我们会随机匹配类别，例如将闹钟匹配到数字时钟，狗匹配到水獭犬。此外，我们发现一些音频类别与一些视觉对象松散匹配，比如将喝水匹配到水瓶，倒水匹配到水壶。
As such, we create two versions of the dataset: (1) ImageNet-ESC-27, which represents the maximal intersection consisting of all loose matches, and (2) ImageNet-ESC-19, a subset of the for- mer version consisting of more accurate matches.
因此，我们创建了数据集的两个版本：（1）ImageNet-ESC-27，代表了所有宽松匹配的最大交集，以及（2）ImageNet-ESC-19，是前一个版本的子集，包含更准确的匹配。
The final matches are shown in appendix Table 9.
最终比赛显示在附录表 9 中。

Few-shot evaluation protocol. We use five-fold fewshot splits sampled from ImageNet, with each split divided into half for training and validation. Test performance is recorded on the official ImageNet validation set of the corresponding classes.
少样本评估协议。我们使用从 ImageNet 采样的五折少样本拆分，每个拆分分为一半用于训练和验证。测试性能记录在相应类别的官方 ImageNet 验证集上。
We adopt the predefined five folds of ESC-50, where each fold contains 8 samples per class. We construct 5 splits from ESC-50 by selecting one fold for training and validation, and record test performance on the other 4 folds.
我们采用 ESC-50 的预定义五折交叉验证，每个折包含每类 8 个样本。我们通过从 ESC-50 中选择一个折用于训练和验证来构建 5 个拆分，并记录在其他 4 个折上的测试性能。
We report averaged performance over 25 runs (since we have 5 random splits for each modality). To keep consistent with our vision-language experiments, we adopt a uni-modal validation and test set and leave cross-modal testing for future work.
我们报告了 25 次运行的平均性能（因为我们为每种模态性分裂了 5 次）。为了与我们的视觉语言实验保持一致，我们采用了单模态验证和测试集，并将跨模态测试留给未来的工作。

Audio encoding. We use AudioCLIP [27] with an ESResNeXT backbone [28] as the audio encoder

. Because AudioCLIP is trained on a large-scale video dataset (AudioSet [23]) while freezing the pre-trained CLIP text and image encoder, it produces audio embeddings in the same representation space.
音频编码。我们使用具有 ESResNeXT 骨干的 AudioCLIP [27]作为音频编码器

。由于 AudioCLIP 是在大规模视频数据集（AudioSet [23]）上训练的，同时冻结预训练的 CLIP 文本和图像编码器，因此它在相同的表示空间中生成音频嵌入。
While AudioCLIP is pretrained on a sizable amount of data, we note that it does not come close to matching the scale of CLIP pretraining [27, 81].
虽然 AudioCLIP 是在大量数据上预训练的，但我们注意到它与 CLIP 预训练的规模相去甚远。
Thus, it does not perform favorably compared to the SOTA for downstream "zero-shot" audio (i.e. one-shot text) classification tasks [27]. However, scaling up audio pretraining is orthogonal to our investigation.
因此，与下游“零-shot”音频（即一次性文本）分类任务相比，它的表现并不理想[27]。然而，音频预训练的扩展与我们的研究无关。

Audio improves image classification. Table 3 shows that adding a random one-shot-audio improves upon naive image-only linear probing, especially in an extremely lowshot setting.
音频改善图像分类。表 3 显示，添加一个随机的一次性音频可以改善天真的仅图像线性探测，特别是在极低的拍摄设置中。
This reaffirms Figure 3's hypothesis that crossmodality can reduce the ambiguity of the uni-modal fewshot setup; in other words, one can learn a better image classifier by listening to object sounds.
这再次证实了图 3 的假设，即跨模态可以减少单模态少样本设置的模糊性；换句话说，通过听物体声音可以学习到更好的图像分类器。
One exception is the 4 shot performance on ImageNet-ESC-27, where adding audio does not help. We posit that (1) loosely-matched classes can result in noisier training data, and (2) the audio representations are not as robust due to smaller-scale pretraining.
在 ImageNet-ESC-27 上的 4 次表现是一个例外，添加音频并没有帮助。我们认为(1) 松散匹配的类别可能导致训练数据更加嘈杂，以及(2) 由于规模较小的预训练，音频表示不够稳健。
This suggests that cross-modal adaptation is less effective when representations are not aligned well or insufficiently trained. Nevertheless, under most scenarios, cross-modal adaptation helps.
这表明当表示没有很好对齐或训练不足时，跨模态适应性效果较差。然而，在大多数情况下，跨模态适应有所帮助。
Table 15 shows that adding the language modality (i.e. label names) can significantly boost the performance, which is expected because our benchmark is curated with textual information.
表 15 显示，添加语言模态（即标签名称）可以显著提升性能，这是预期的，因为我们的基准是使用文本信息策划的。
For all experiments, we follow an identical procedure to vision-language experiments in section 3 and provide details in appendix section 8.
对于所有实验，我们遵循与第 3 节中的视觉语言实验相同的程序，并在附录第 8 节中提供详细信息。

Vision improves audio classification. We additionally evaluate the reverse task - whether adding a random oneshot image sample for downstream audio classification can improve upon audio-only training. Table 4 shows the results, where we see the same favorable trend.
视觉改善音频分类。我们另外评估了反向任务 - 是否为下游音频分类添加一个随机的一次性图像样本可以改善仅音频训练。表 4 显示了结果，我们看到了相同的有利趋势。
This success concludes that our approachis modality-agnostic.
这一成功表明我们的方法是与模态无关的。

Dataset	Method	Image Classification 图像分类
Dataset	Method	1-shot	2-shot 2 镜头	4-shot 4 次射击
ImageNet-ESC-19	Image-Only Linear 仅图像线性	68.0	75.7	83.1
ImageNet-ESC-19	Image-Audio Linear 图像音频线性
ImageNet-ESC-27	Image-Only Linear 仅图像线性	60.1	71.8
ImageNet-ESC-27	Image-Audio Linear 图像音频线性			78.9

Table 3. Image classification results on ImageNet-ESC benchmark. Adding one audio shot can improve image classification under most few-shot scenarios, even when the audio and vision modalities are only loosely aligned.
表 3. ImageNet-ESC 基准测试上的图像分类结果。在大多数少样本情况下，添加一个音频样本可以改善图像分类，即使音频和视觉模态只是松散对齐。

Dataset	Method	Audio Classification 音频分类
Dataset	Method	1-shot	2-shot 2 镜头	4-shot 4 次射击
ImageNet-ESC-19	Audio-Only Linear 仅音频线性	31.2	41.1	48.5
ImageNet-ESC-19	Audio-Image Linear 音频图像线性
ImageNet-ESC-27	Audio-Only Linear 仅音频线性	28.2	39.0	47.1
ImageNet-ESC-27	Audio-Image Linear 音频图像线性

Table 4. Audio classification results on ImageNet-ESC benchmark. Similar to Table 3, adding one image shot improves fewshot audio classification.
表 4. ImageNet-ESC 基准上的音频分类结果。与表 3 类似，添加一个图像截图可以改善少样本音频分类。

Dataset

Classes 课程

Train 火车

Val

Test 测试

Hand-crafted Prompt [111]
手工制作的提示 [111]

Caltech101 [19] 加州理工学院 101 [19]

100

4,128

1,649

2,465

a photo of a

.
一张

的照片。

OxfordPets [75] 牛津宠物 [75]

2,944

736

3,669

a photo of a {cls}, a type of pet.
一张{cls}的照片，一种宠物类型。

StanfordCars [50] 斯坦福汽车 [50]

196

6,509

1,635

8,041

a photo of a

.
一张

的照片。

Flowers 102 [71] 花卉 102 [71]

102

4,093

1,633

2,463

a photo of a {cls}, a type of flower.
一张{cls}的照片，一种花的类型。

Food101 [6] 食品 101 [6]

101

50,500

20,200

30,300

a photo of

, a type of food.

的照片，一种食物。

FGVCAircraft [66] FGVC 飞机[66]

100

3,334

3,333

a photo of a {cls}, a type of aircraft.
一张{cls}飞机的照片。

SUN397 [103]

397

15,880

3,970

19,850

a photo of a

.
一张

的照片。

DTD [14]

2,820

1,128

1,692

{cls} texture. {cls} 纹理。

EuroSAT [35]

13,500

5,400

8,100

a centered satellite photo of

的中心卫星照片。

UCF101 [93]

101

7,639

1,898

3,783

a photo of a person doing

.
一个人在做

的照片。

ImageNet [15]

1000

N/A

50,000

这是一张{cls}的照片。一张糟糕的

照片。一个折纸

。一张大

的照片。一个视频游戏中的

。

的艺术。一个小

的照片。

itap of a {cls}.

a bad photo of the

a origami

a photo of the large

in a video game.

art of the

a photo of the small

Table 5. Detailed statistics of the

datasets. We adopt the hand-engineered templates selected by Tip-Adapter [111] unless otherwise stated. Note that this set of templates is identical to the ones selected by CLIP [81] and CoOp [113], except for ImageNet.
表 5.

数据集的详细统计数据。除非另有说明，我们采用 Tip-Adapter [111] 选择的手工模板。请注意，这组模板与 CLIP [81] 和 CoOp [113] 选择的模板相同，除了 ImageNet。

We present a few selected ablation studies in this section. For comprehensive results, please refer to section 9 .
我们在本节中展示了一些精选的消融研究。有关全面结果，请参阅第 9 节。

Data augmentation of text samples. Like most prior works [81, 113], we also find that data augmentation can improve downstream performance during vision-language adaptation (cf. Table 1).
文本样本的数据增强。与大多数先前的工作[81, 113]一样，我们也发现数据增强可以提高视觉语言适应过程中的下游性能（参见表 1）。
Notably, since the class names are included as training samples, one can explore augmentation techniques for text (just as random cropping for images).
值得注意的是，由于类名包含在训练样本中，因此可以探索文本的增强技术（就像对图像进行随机裁剪一样）。
Besides the fixed template a photo of a {cls} and hand-crafted templates (Table 5), we also try a template mining strategy that does not rely on the selected datasetspecific templates.
除了固定模板和手工制作的模板（表 5）之外，我们还尝试了一种不依赖于选定数据集特定模板的模板挖掘策略。
To automatically mine for the templates, we search among a pool of 180 templates for 21 templates with the best zero-shot performance on the few-shot vali-
为了自动挖掘模板，我们在 180 个模板池中搜索 21 个在少样本验证中表现最佳的模板

Finetuning 微调	ImageAugment 图像增强	TextAugment 文本增强	Number of shots 射击次数
Finetuning 微调	ImageAugment 图像增强	TextAugment 文本增强	1	2	4	8	16
Linear 线性	CenterCrop 中心裁剪	Classname 类名	61.8	65.3	69.0	72.0
		a photo of a 一张的照片	63.2	66.2	69.7	72.5	75.3
		Template Mining 模板挖掘	63.5	67.2	70.3	73.1	75.7
		Hand Engineered [111] 手工工程化 [111]	63.7	66.7	70.3	72.9	75.5
	+Flipped View +翻转视图	Hand Engineered [111] 手工工程化 [111]	64.1	67.0	70.3	73.0	76.0
Partial 部分	CenterCrop 中心裁剪	Classname 类名	62.5	65.7	69.3	72.9	76.2
		a photo of a 一张的照片	63.8	66.8	69.8	73.4	76.7
		Template Mining 模板挖掘	64.3	67.1	70.3	73.5	76.5
		Hand Engineered [111] 手工工程化 [111]	64.6	67.2	70.2	73.7	76.9
	+Flipped View +翻转视图	Hand Engineered [111] 手工工程化 [111]	64.7	67.7	70.6		77.2

Table 6. Augmentation for cross-modal adaptation. We evaluate the impact of selected augmentation techniques following the same CoOp protocol as in Table 1.
表 6. 跨模态适应的增强。我们评估了选择的增强技术对应用了与表 1 中相同的 CoOp 协议的影响。

dation set of each dataset. We discuss how we collect the 180 templates in appendix section 8. For image augmentation, we perform standard flipping and random cropping.
每个数据集的数据集。我们讨论了如何在附录第 8 节中收集 180 个模板。对于图像增强，我们执行标准翻转和随机裁剪。
We show a subset of results in Table 6, and find that all text augmentation techniques provide a sizable boost in performance. We also report comprehensive ablations in appendix Table 11 and compare it to the SOTA prompting method ProDA [63].
我们在表 6 中展示了部分结果，并发现所有文本增强技术都能显著提升性能。我们还在附录表 11 中报告了全面的消融实验，并将其与 SOTA 提示方法 ProDA [63]进行了比较。
The salient conclusions include (1) the performance gain from image augmentation is saturated after more than two views, and (2) template mining can be as competitive as a large number of 36 carefully-tuned prompts. In fact, prompting

can be viewed as another text augmentation technique under cross-modal adaptation, and we leave this exploration to future work.
显著的结论包括（1）图像增强的性能增益在超过两个视图后饱和，并且（2）模板挖掘可以与 36 个精心调整的提示一样具有竞争力。实际上，提示

可以被视为跨模态适应下的另一种文本增强技术，我们将这一探索留给未来的工作。

Test-time distribution shifts. We examine how robust our approach is against test-time distribution shifts in Table 7.
测试时间分布变化。我们在表 7 中检查我们的方法对测试时间分布变化的稳健性。
Specifically, we follow the CoOp [113] protocol to report the test performance of a classifier trained on the source dataset (16-shot ImageNet) to 4 distribution-shifted target test sets, including ImageNet-V2 [83], ImageNetSketch [96], ImageNet-A [37], and ImageNet-R [36].
具体来说，我们遵循 CoOp [113]协议，报告在源数据集（16-shot ImageNet）上训练的分类器的测试性能，针对 4 个分布偏移的目标测试集，包括 ImageNet-V2 [83]、ImageNetSketch [96]、ImageNet-A [37]和 ImageNet-R [36]。
As shown in Table 7, cross-modal adaptation can significantly boost the robustness of image-only linear probing and is competitive against baselines designed to address robustness such as CoCoOp [112] and WiSE-FT [100].
如表 7 所示，跨模态适应可以显著提升仅图像线性探测的鲁棒性，并且与旨在解决鲁棒性的基线方法（如 CoCoOp [112]和 WiSE-FT [100]）具有竞争力。
CrossModal adaptation also improves upon WiSE-FT [100] and sets the new SOTA. We can conclude that language modality plays an important role in robustness, similar to how humans rely on textual cues for recognition [37].
跨模态适应还改进了 WiSE-FT [100]，并设定了新的 SOTA。我们可以得出结论，语言模态在稳健性方面起着重要作用，类似于人类依赖文本线索进行识别 [37]。

Efficiency. As shown in Table 8, our approaches are much more lightweight because we do not rely on deep finetuning

or heavy image augmentations. This allows us to speed up training by pre-extracting features, resulting in rather fast training speeds.
效率。如表 8 所示，我们的方法要轻得多，因为我们不依赖于深度微调

或大量图像增强。这使我们能够通过预提取特征来加快训练速度，从而实现相当快速的训练速度。

We show that cross-modal training is a lightweight and effective approach for adapting pre-trained multimodal models for downstream uni-modal tasks. One reason for
我们展示跨模态训练是一种轻量且有效的方法，用于调整预训练的多模态模型以适应下游单模态任务。一个原因是

Method

Target 目标

-Sketch -素描

ResNet50

Zero-Shot CLIP 零射击 CLIP

58.2

51.3

33.3

21.7

56.0

Linear Probing 线性探测

55.9

46.0

19.1

12.7

63.0

55.1

32.7

22.1

55.0

CoOp

合作

63.3

55.4

56.6

WiSE-FT

62.9

54.2

33.3

20.3

57.4

Cross-Modal WiSE-FT

跨模态 WiSE-FT

65.2

56.6

35.6

Cross-Modal Linear Probing
跨模态线性探测

64.5

55.3

33.1

20.0

ViT-B/16

Zero-Shot CLIP 零射击 CLIP

66.7

60.8

46.2

47.8

74.0

Linear Probing 线性探测

65.9

56.3

34.8

58.4

71.9

64.2

46.7

48.4

74.3

CoOp

合作

71.7

64.6

47.9

49.9

75.1

CoCoOp

71.0

64.1

48.8

76.2

WiSE-FT

73.0

49.1

49.8

77.6

Cross-Modal WiSE-FT

跨模态 WiSE-FT

72.9

49.2

77.8

-1

Cross-Modal Linear Probing
跨模态线性探测

73.2

64.8

47.9

48.3

76.4

Table 7. Robustness under test-time distribution shifts. We follow CoOp [113]'s protocol for evaluating the test-time performance on variants of ImageNet. We report results with two image encoders (ResNet50 and ViT-B/16), and mark the best and second best results.
表 7. 测试时间分布变化下的鲁棒性。我们遵循 CoOp [113]的协议，评估 ImageNet 变体的测试时间性能。我们报告了两种图像编码器（ResNet50 和 ViT-B/16）的结果，并标记了最佳和次佳结果。
Salient conclusions: (a) Cross-modal linear probing is much more robust than its uni-modal counterpart while being competitive to previous SOTA methods such as WiseFT and CoOp, and (b) it can be further augmented with post-hoc modification through WiseFT to achieve new the SOTA.
显著结论：(a) 跨模态线性探测比其单模态对应物更加稳健，同时与先前的 SOTA 方法（如 WiseFT 和 CoOp）竞争力强，(b) 可以通过 WiseFT 进一步增强后处理修改，以实现新的 SOTA。

Method	Iteration 迭代	Time 时间	Accuracy 准确性	Gain 获得
Zero-shot CLIP [81] 零射击 CLIP [81]	0	0	60.33	0
Image-Only Linear 仅图像线性			56.44	-3.89
CoOp [113] 合作社 [113]			62.95	+2.62
ProGrad [113]			63.45	+3.12
Tip-Adapter [111] 提示适配器 [111]			65.18	+5.18
Cross-Modal Linear			64.51
Cross-Modal Partial 跨模态部分

Table 8. Efficiency and accuracy for different methods on ImageNet-16-shot. All experiments are tested with batch size 32 on a single NVIDIA GeForce RTX 3090 GPU. Our approaches take less time and achieve SOTA performance.
表 8. 在 ImageNet-16-shot 上，不同方法的效率和准确性。所有实验都在单个 NVIDIA GeForce RTX 3090 GPU 上以批量大小 32 进行测试。我们的方法花费更少的时间并实现了 SOTA 性能。

its effectiveness is that it naturally addresses the underspecification of few-shot learning. In the context of visionlanguage adaptation, one can achieve SOTA results by using existing text labels as free training samples.
它的有效性在于自然地解决了少样本学习的不充分规范。在视觉语言适应的背景下，可以通过使用现有文本标签作为免费训练样本来实现 SOTA 结果。
In the context of vision-audio adapation, one can learn better visual object classifiers by listening to object sounds (and better audio classifiers by looking at objects!).
在视听适应的背景下，人们可以通过倾听物体的声音来更好地学习视觉对象分类器（通过观察物体来更好地学习音频分类器！）。
One attractive aspect of cross-modal learning is that the learned models naturally apply to multimodal test data, such as the classification of videos that contain both visual and audio signals.
跨模态学习的一个吸引人之处在于学习的模型自然地适用于多模态测试数据，比如包含视觉和音频信号的视频分类。
However, cross-modal learning is less effective when model representations are not well-aligned or insufficiently trained. Nevertheless, due to its simplicity and effectiveness, we hope cross-modal learning becomes a tool for future research on multi-modal adaptation.
然而，当模型表示不够对齐或训练不足时，跨模态学习效果较差。尽管如此，由于其简单性和有效性，我们希望跨模态学习成为未来多模态适应研究的工具。

[1] Mohamed Afham, Salman Khan, Muhammad Haris Khan, Muzammal Naseer, and Fahad Shahbaz Khan. Rich semantics improve few-shot learning. arXiv preprint arXiv:2104.12709, 2021. 3

[2] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198, 2022. 3
[2] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds 等人。Flamingo: 一种用于少样本学习的视觉语言模型。arXiv 预印本 arXiv:2204.14198，2022 年。

[3] Humam Alwassel, Dhruv Mahajan, Bruno Korbar, Lorenzo Torresani, Bernard Ghanem, and Du Tran. Self-supervised learning by cross-modal audio-video clustering. Advances in Neural Information Processing Systems, 33:9758-9770, 2020. 3
Humam Alwassel，Dhruv Mahajan，Bruno Korbar，Lorenzo Torresani，Bernard Ghanem 和 Du Tran。跨模态音频视频聚类的自监督学习。神经信息处理系统的进展，33：9758-9770，2020. 3

[4] Peyman Bateni, Raghav Goyal, Vaden Masrani, Frank Wood, and Leonid Sigal. Improved few-shot visual classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14493

[4] Peyman Bateni, Raghav Goyal, Vaden Masrani, Frank Wood, and Leonid Sigal. 改进的少样本视觉分类。在 IEEE/CVF 计算机视觉与模式识别会议论文集中，第 14493 页。

[5] Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021. 2

[6] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101-mining discriminative components with random forests. In European conference on computer vision, pages 446-461. Springer, 2014. 7, 15, 21
[6] Lukas Bossard, Matthieu Guillaumin, 和 Luc Van Gool. Food-101-mining discriminative components with random forests. 在 European conference on computer vision, 页码 446-461. Springer, 2014. 7, 15, 21

[7] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101 - mining discriminative components with random forests. In European Conference on Computer Vision, 2014. 17

[8] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei.
Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877-1901. Curran Associates, Inc., 2020. 2
语言模型是少样本学习器。在 H. Larochelle，M. Ranzato，R. Hadsell，M.F. Balcan 和 H. Lin 编辑的《神经信息处理系统进展》中，第 33 卷，第 1877-1901 页。Curran Associates，Inc.，2020。2

[9] Gemma Calvert, Edward Bullmore, M.J. Brammer, Ruth Campbell, Steven Williams, Philip Mcguire, Peter Woodruff, S.D. Iversen, and Anthony David. Activation of auditory cortex during silent lipreading. science, 276(5312), 593-596. Science (New York, N.Y.), 276:593

[9] Gemma Calvert, Edward Bullmore, M.J. Brammer, Ruth Campbell, Steven Williams, Philip Mcguire, Peter Woodruff, S.D. Iversen, and Anthony David. 在无声读唇期间听觉皮层的激活。科学，276(5312)，593-596。科学（纽约，纽约），276:593

[10] Cătălina Cangea, Petar Veličković, and Pietro Lio. Xflow: Cross-modal deep neural networks for audiovisual classification. IEEE Transactions on Neural Networks and Learning Systems, 31(9):3711-3720, 2019. 3
[10] Cătălina Cangea，Petar Veličković和 Pietro Lio。Xflow：用于音频视觉分类的跨模态深度神经网络。IEEE 神经网络与学习系统交易，31(9)：3711-3720，2019. 3
[11] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9650-9660, 2021. 2
[11] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 自监督视觉变换器中的新兴属性。在 2021 年 IEEE/CVF 国际计算机视觉会议论文集中，第 9650-9660 页。

[12] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597-1607. PMLR, 2020. 2, 3
[12] 陈婷，西蒙·科恩布利斯，穆罕默德·诺鲁兹和杰弗里·辛顿。视觉表示对比学习的简单框架。在 2020 年机器学习国际会议上，第 1597-1607 页。PMLR。

[13] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, , and A. Vedaldi. Describing textures in the wild. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014. 17
[13] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, , and A. Vedaldi. 在野外描述纹理。在 2014 年 IEEE 计算机视觉和模式识别大会（CVPR）论文集中。17

[14] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3606-3613, 2014. 7, 15, 21
[14] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. 在野外描述纹理。在 2014 年 IEEE 计算机视觉和模式识别会议论文集中，第 3606-3613 页。7, 15, 21

[15] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248-255. Ieee, 2009. 2, 5,

[15] 邓佳，董伟，Richard Socher，李立嘉，李凯，和李飞飞。Imagenet：一个大规模的分层图像数据库。在 2009 年 IEEE 计算机视觉和模式识别会议上，第 248-255 页。Ieee，2009 年。2, 5,

[16] Mingkai Deng, Jianyu Wang, Cheng-Ping Hsieh, Yihan Wang, Han Guo, Tianmin Shu, Meng Song, Eric P Xing, and Zhiting Hu. Rlprompt: Optimizing discrete text prompts with reinforcement learning. arXiv preprint arXiv:2205.12548, 2022. 3

[17] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018. 2
Jacob Devlin、Ming-Wei Chang、Kenton Lee 和 Kristina Toutanova。Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 预印本 arXiv:1810.04805，2018. 2

[18] Guneet S Dhillon, Pratik Chaudhari, Avinash Ravichandran, and Stefano Soatto. A baseline for few-shot image classification. arXiv preprint arXiv:1909.02729, 2019. 1, 3
[18] Guneet S Dhillon, Pratik Chaudhari, Avinash Ravichandran, and Stefano Soatto. 少样本图像分类的基线。arXiv 预印本 arXiv:1909.02729，2019. 1, 3

[19] Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, pages 178-178. IEEE, 2004. 7
[19] 李飞飞，罗布·弗格斯和皮埃特罗·佩罗纳。从少量训练示例中学习生成视觉模型：在 101 个对象类别上测试的增量贝叶斯方法。在 2004 年计算机视觉和模式识别研讨会上，第 178-178 页。IEEE，2004 年。7

[20] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Modelagnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pages 1126-1135. PMLR, 2017. 1, 3
切尔西·芬（Chelsea Finn）、皮特·阿贝尔（Pieter Abbeel）和谢尔盖·莱文（Sergey Levine）。模型无关元学习用于深度网络的快速适应。在机器学习国际会议上，第 1126-1135 页。PMLR，2017 年。1, 3

[21] Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544, 2021. 2, 3, 4 , 6,14

[22] Tianyu Gao, Adam Fisch, and Danqi Chen. Making pretrained language models better few-shot learners. arXiv preprint arXiv:2012.15723, 2020. 3
[22] 高天宇，亚当·菲什和陈丹琦。使预训练语言模型更好的少样本学习者。arXiv 预印本 arXiv:2012.15723，2020 年。3

[23] Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events.
[23] Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter。音频集：音频事件的本体和人工标记数据集。
In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 776-780, 2017. 7
2017 年 IEEE 国际声学、语音和信号处理会议(ICASSP)第 776-780 页，2017 年 7 月

[24] Eleanor J Gibson. Principles of perceptual learning and development. 1969. 1
[24] 埃莉诺·吉布森。感知学习和发展原理。1969 年。1

[25] Rohit Girdhar and Deva Ramanan. Attentional pooling for action recognition. Advances in neural information processing systems, 30, 2017. 3
[25] Rohit Girdhar 和 Deva Ramanan。用于动作识别的注意力池化。神经信息处理系统的进展，30，2017. 3

[26] Priya Goyal, Mathilde Caron, Benjamin Lefaudeux, Min Xu, Pengchao Wang, Vivek Pai, Mannat Singh, Vitaliy Liptchinsky, Ishan Misra, Armand Joulin, et al. Selfsupervised pretraining of visual features in the wild. arXiv preprint arXiv:2103.01988, 2021. 2
[26] Priya Goyal, Mathilde Caron, Benjamin Lefaudeux, Min Xu, Pengchao Wang, Vivek Pai, Mannat Singh, Vitaliy Liptchinsky, Ishan Misra, Armand Joulin, 等。野外视觉特征的自监督预训练。 arXiv 预印本 arXiv:2103.01988，2021. 2

[27] Andrey Guzhov, Federico Raue, Jörn Hees, and Andreas Dengel. Audioclip: Extending clip to image, text and audio, 2021. 1, 2, 3, 7
[27] 安德烈·古日科夫，费德里科·劳埃，约恩·赫斯和安德烈亚斯·登格尔。Audioclip：将剪辑扩展到图像、文本和音频，2021 年。1、2、3、7

[28] Andrey Guzhov, Federico Raue, Jörn Hees, and Andreas Dengel. Esresne(x)t-fbsp: Learning robust time-frequency transformation of audio, 2021. 7
[28] 安德烈·古日科夫，费德里科·劳埃，约恩·赫斯和安德烈亚斯·登格尔。Esresne(x)t-fbsp：学习音频的稳健时频转换，2021 年。7

[29] Bharath Hariharan and Ross Girshick. Low-shot visual recognition by shrinking and hallucinating features. In Proceedings of the IEEE international conference on computer vision, pages 3018-3027, 2017. 1, 3
[29] Bharath Hariharan 和 Ross Girshick。通过缩小和虚构特征进行低样本视觉识别。在 2017 年 IEEE 国际计算机视觉会议论文集中，第 3018-3027 页。 1, 3

[30] Adi Haviv, Jonathan Berant, and Amir Globerson. Bertese: Learning to speak to bert. arXiv preprint arXiv:2103.05327, 2021. 3

[31] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000-16009, 2022. 2, 3
[31] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, 和 Ross Girshick. 掩码自动编码器是可扩展的视觉学习者。在 2022 年 IEEE/CVF 计算机视觉与模式识别会议论文集中，第 16000-16009 页。2, 3

[32] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729-9738, 2020. 2, 3
[32] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 动量对比用于无监督视觉表示学习。在 2020 年 IEEE/CVF 计算机视觉和模式识别会议论文集中，第 9729-9738 页。2, 3

[33] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770-778, 2016. 5
Kaiming He, Xiangyu Zhang, Shaoqing Ren, 和 Jian Sun. 深度残差学习用于图像识别. 在 2016 年 IEEE 计算机视觉和模式识别会议论文集中，第 770-778 页。

[34] Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification, 2017. 17
[34] Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: 一种新颖的数据集和深度学习基准，用于土地利用和土地覆盖分类，2017 年。17

[35] Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217-2226, 2019. 7
Patrick Helber，Benjamin Bischke，Andreas Dengel 和 Damian Borth。Eurosat：用于土地利用和土地覆盖分类的新数据集和深度学习基准。IEEE 应用地球观测和遥感领域选择性主题杂志，12(7)：2217-2226，2019. 7

[36] Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and Justin Gilmer. The many faces of robustness: A critical analysis of out-of-distribution generalization.
[36] 丹·亨德里克斯（Dan Hendrycks）、史蒂文·巴萨特（Steven Basart）、诺曼·穆（Norman Mu）、索拉夫·卡达瓦斯（Saurav Kadavath）、弗兰克·王（Frank Wang）、埃文·多伦多（Evan Dorundo）、拉胡尔·德赛（Rahul Desai）、泰勒·朱（Tyler Zhu）、萨米亚克·帕拉朱利（Samyak Parajuli）、迈克·郭（Mike Guo）、唐·宋（Dawn Song）、雅各布·斯坦哈特（Jacob Steinhardt）和贾斯汀·吉尔默（Justin Gilmer）。鲁棒性的多面性：对超出分布泛化的关键分析。
ICCV, 2021. 8 ICCV，2021 年。8

[37] Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15262-15271, 2021. 8
[37] 丹·亨德里克斯（Dan Hendrycks）、凯文·赵（Kevin Zhao）、史蒂文·巴萨特（Steven Basart）、雅各布·斯坦哈特（Jacob Steinhardt）和黎明·宋（Dawn Song）。自然对抗样本。在 2021 年 IEEE/CVF 计算机视觉与模式识别会议论文集中，第 15262-15271 页。8
[38] Danfeng Hong, Naoto Yokoya, Gui-Song Xia, Jocelyn Chanussot, and Xiao Xiang Zhu. X-modalnet: A semisupervised deep cross-modal network for classification of remote sensing data. ISPRS Journal of Photogrammetry and Remote Sensing, 167:12-23, 2020. 3
[38] 洪丹峰，横谷直人，夏贵松，乔斯林·尚诺索特和朱晓翔。X-modalnet：一种用于遥感数据分类的半监督深度跨模态网络。ISPRS 摄影测量与遥感杂志，167：12-23，2020. 3

[39] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790-2799. PMLR, 2019.
Neil Houlsby、Andrei Giurgiu、Stanislaw Jastrzebski、Bruna Morrone、Quentin De Laroussilhe、Andrea Gesmundo、Mona Attariyan 和 Sylvain Gelly。参数高效的自然语言处理迁移学习。在机器学习国际会议上，第 2790-2799 页。PMLR，2019 年。
2, 3, 6

[40] Tony Huang, Jack Chu, and Fangyun Wei. Unsupervised prompt learning for vision-language models. arXiv preprint arXiv:2204.03649, 2022. 3
Tony Huang、Jack Chu 和 Fangyun Wei。视觉语言模型的无监督提示学习。arXiv 预印本 arXiv:2204.03649，2022 年 3 月

[41] Gabriel Ilharco, Mitchell Wortsman, Samir Yitzhak Gadre Shuran Song, Hannaneh Hajishirzi, Simon Kornblith, Ali Farhadi, and Ludwig Schmidt. Patching openvocabulary models by interpolating weights. arXiv preprint arXiv:2208.05592, 2022. 2
[41] Gabriel Ilharco, Mitchell Wortsman, Samir Yitzhak Gadre Shuran Song, Hannaneh Hajishirzi, Simon Kornblith, Ali Farhadi, and Ludwig Schmidt. 通过插值权重修补开放词汇模型。arXiv 预印本 arXiv:2208.05592，2022 年。2

[42] Ray Jackendoff. On beyond zebra: The relation of linguistic and visual information. Cognition, 26(2):89-114, 1987. 1
[42] Ray Jackendoff. 超越斑马：语言和视觉信息的关系。认知，26(2)：89-114，1987 年。

[43] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision.
[43] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 通过嘈杂文本监督扩展视觉和视觉语言表示学习。
In International Conference on Machine Learning, pages 49044916. PMLR, 2021. 3,4
在机器学习国际会议上，第 49044916 页。PMLR，2021 年。3,4

[44] Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. arXiv preprint arXiv:2203.12119, 2022. 3
[44] 贾梦琳，唐路明，陈博纯，克莱尔·卡迪，谢尔日·贝隆吉，巴拉特·哈里哈兰和林世南。视觉提示调整。arXiv 预印本 arXiv:2203.12119，2022 年 3 月

[45] Zhengbao Jiang, Frank F Xu, Jun Araki, and Graham Neubig. How can we know what language models know? Transactions of the Association for Computational Linguistics, 8:423-438, 2020. 3, 22
郑宝江，Frank F Xu，Jun Araki 和 Graham Neubig。我们如何知道语言模型知道什么？计算语言学协会交易，8：423-438，2020 年。3，22

[46] Thorsten Joachims et al. Transductive inference for text classification using support vector machines. In Icml, volume 99, pages 200-209, 1999. 1, 3
[46] Thorsten Joachims 等人。使用支持向量机进行文本分类的传导推理。在 Icml 中，卷 99，页 200-209，1999 年。1, 3

[47] Armand Joulin, Laurens van der Maaten, Allan Jabri, and Nicolas Vasilache. Learning visual features from large weakly supervised data. In European Conference on Computer Vision, pages 67-84. Springer, 2016. 3
[47] Armand Joulin, Laurens van der Maaten, Allan Jabri 和 Nicolas Vasilache。从大规模弱监督数据中学习视觉特征。在欧洲计算机视觉大会上，第 67-84 页。Springer，2016. 3

[48] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka GrabskaBarwinska, et al. Overcoming catastrophic forgetting in neural networks.
Proceedings of the national academy of sciences, 114(13):3521-3526, 2017. 3
国家科学院院刊，114（13）：3521-3526，2017 年。

[49] Stephen M. Kosslyn, Giorgio Ganis, and William L. Thompson. 3Multimodal images in the brain. In The neurophysiological foundations of mental and motor imagery. Oxford University Press, 01 2010. 3
[49] 斯蒂芬·M·科斯林（Stephen M. Kosslyn）、乔治奥·加尼斯（Giorgio Ganis）和威廉·L·汤普森（William L. Thompson）。大脑中的多模态图像。在《心理和运动想象的神经生理基础》一书中。牛津大学出版社，2010 年 01 月。

[50] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pages 554-561, 2013. 7
Jonathan Krause、Michael Stark、Jia Deng 和 Li Fei-Fei。用于细粒度分类的 3D 对象表示。在 2013 年 IEEE 国际计算机视觉研讨会论文集中，第 554-561 页。7

[51] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei 3d object representations for fine-grained categorization
Jonathan Krause，Michael Stark，Jia Deng 和 Li Fei-Fei 3D 物体表示用于细粒度分类

In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia, 2013. 17
在 2013 年澳大利亚悉尼举办的第四届国际 IEEE 三维表示和识别研讨会（3dRR-13）中。17

[52] Patricia K Kuhl and Andrew N Meltzoff. The intermodal representation of speech in infants. Infant behavior and development, 7(3):361-381, 1984. 1
[52] Patricia K Kuhl 和 Andrew N Meltzoff。婴儿中的言语跨模态表示。婴儿行为和发展，7(3)：361-381，1984 年。

[53] Ananya Kumar, Aditi Raghunathan, Robbie Jones, Tengyu Ma, and Percy Liang. Fine-tuning can distort pretrained features and underperform out-of-distribution. arXiv preprint arXiv:2202.10054, 2022. 19
[53] Ananya Kumar, Aditi Raghunathan, Robbie Jones, Tengyu Ma, and Percy Liang. 微调可能会扭曲预训练特征并在分布之外表现不佳。arXiv 预印本 arXiv:2202.10054，2022 年。19

[54] Jet-Tsyn Lee, Danushka Bollegala, and Shan Luo. "touching to see" and "seeing to feel": Robotic cross-modal sensory data generation for visual-tactile perception. In 2019 International Conference on Robotics and Automation (ICRA), pages 4276-4282. IEEE, 2019. 3
Jet-Tsyn Lee、Danushka Bollegala 和 Shan Luo。"触摸以见"和"见以感受"：用于视触知觉的机器人跨模态感官数据生成。在 2019 年国际机器人与自动化大会（ICRA）上，第 4276-4282 页。IEEE，2019. 3

[55] Li, Andreeto, Ranzato, and Perona. Caltech 101, Apr 2022. 17
[55] 李，安德里托，兰扎托和佩罗纳。 Caltech 101，2022 年 4 月。 17

[56] Ang Li, Allan Jabri, Armand Joulin, and Laurens Van Der Maaten. Learning visual n-grams from web data. In Proceedings of the IEEE International Conference on Computer Vision, pages 4183-4192, 2017. 3
Ang Li, Allan Jabri, Armand Joulin 和 Laurens Van Der Maaten。从网络数据中学习视觉 n-gram。在 2017 年 IEEE 国际计算机视觉会议论文集中，第 4183-4192 页。

[57] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. arXiv preprint arXiv:2201.12086, 2022. 3
[57] 李俊楠，李东旭，熊才明和 Steven Hoi。Blip：引导语言-图像预训练以实现统一的视觉-语言理解和生成。arXiv 预印本 arXiv:2201.12086，2022 年。3

[58] Liunian Harold Li*, Pengchuan Zhang*, Haotian Zhang*, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao. Grounded language-image pre-training. In CVPR, 2022. 3

[59] Wei Li, Can Gao, Guocheng Niu, Xinyan Xiao, Hao Liu, Jiachen Liu, Hua Wu, and Haifeng Wang. Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning. arXiv preprint arXiv:2012.15409, 2020. 3

[61] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586, 2021. 2, 3, 8
刘鹏飞，袁伟哲，傅金兰，姜正宝，林浩明和格雷厄姆·纽比格。预训练，提示和预测：自然语言处理中提示方法的系统调查。arXiv 预印本 arXiv:2107.13586，2021 年。2，3，8

[62] Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. Gpt understands, too. arXiv:2103.10385, 2021. 2

[63] Yuning Lu, Jianzhuang Liu, Yonggang Zhang, Yajing Liu, and Xinmei Tian. Prompt distribution learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5206-5215, 2022. 2, 3, 5, 8, 15, 21
[63] 陆宇宁，刘建庄，张永刚，刘雅静和田新梅。提示分布学习。在 2022 年 IEEE/CVF 计算机视觉与模式识别会议论文集中，第 5206-5215 页。2, 3, 5, 8, 15, 21

[64] Shan Luo, Wenzhen Yuan, Edward Adelson, Anthony G Cohn, and Raul Fuentes. Vitac: Feature sharing between vision and tactile sensing for cloth texture recognition. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 2722-2727. IEEE, 2018. 3
[64] Shan Luo, Wenzhen Yuan, Edward Adelson, Anthony G Cohn, and Raul Fuentes. Vitac: 视觉和触觉传感之间的特征共享用于布料纹理识别。在 2018 年 IEEE 国际机器人与自动化大会(ICRA)上，页码 2722-2727。IEEE，2018. 3
[65] S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and A. Vedaldi. Fine-grained visual classification of aircraft. Technical report, 2013. 17

[66] Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013 7

[67] Andrew N Meltzoff and Richard W Borton. Intermodal matching by human neonates. Nature, 282(5737):403-404, 1979. 1
[67] 安德鲁·N·梅尔佐夫和理查德·W·博顿。人类新生儿的跨模态匹配。自然, 282(5737):403-404, 1979. 1

[68] Jesse Mu, Percy Liang, and Noah Goodman. Shaping visual representations with language for few-shot classification. arXiv preprint arXiv:1911.02683, 2019. 2, 3, 4, 5
[68] Jesse Mu, Percy Liang, and Noah Goodman. 用语言塑造视觉表示以进行少样本分类。arXiv 预印本 arXiv:1911.02683，2019 年。2, 3, 4, 5

[69] Norman Mu, Alexander Kirillov, David Wagner, and Saining Xie. Slip: Self-supervision meets language-image pretraining. In European Conference on Computer Vision, pages 529-544. Springer, 2022. 3
Norman Mu，Alexander Kirillov，David Wagner 和 Saining Xie。滑动：自我监督遇见语言图像预训练。在欧洲计算机视觉大会上，页码 529-544。Springer，2022。

[70] Bence Nanay. Multimodal mental imagery. Cortex, 105:125-136, 2018. 1,3
[70] Bence Nanay. 多模式心像. 皮层, 105:125-136, 2018. 1,3

[71] Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pages 722-729. IEEE, 2008. 7
[71] Maria-Elena Nilsback 和 Andrew Zisserman。在 2008 年第六届印度计算机视觉、图形和图像处理会议上，对大量类别的花卉进行自动分类。IEEE，2008 年。7

[72] Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In Indian Conference on Computer Vision, Graphics and Image Processing, Dec 2008. 17
[72] Maria-Elena Nilsback 和 Andrew Zisserman。在 2008 年 12 月举办的印度计算机视觉、图形和图像处理会议上，对大量类别的花卉进行自动分类。17

[73] Frederik Pahde, Main Nabi, Tassila Klein, and Patrick Jahnichen. Discriminative hallucination for multi-modal fewshot learning. In 2018 25th IEEE International Conference on Image Processing (ICIP), pages 156-160. IEEE, 2018 3
[73] Frederik Pahde, Main Nabi, Tassila Klein, and Patrick Jahnichen. 多模式少样本学习的判别性幻觉。在 2018 年第 25 届 IEEE 国际图像处理大会（ICIP）上，第 156-160 页。IEEE，2018 3

[74] Frederik Pahde, Mihai Puscas, Tassilo Klein, and Moin Nabi. Multimodal prototypical networks for few-shot learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2644-2653, 2021. 3
[74] Frederik Pahde, Mihai Puscas, Tassilo Klein, and Moin Nabi. 多模式原型网络用于少样本学习。在 IEEE/CVF 冬季计算机视觉应用会议论文集中，页码 2644-2653，2021 年。

[75] Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, pages 3498-3505. IEEE, 2012. 7
[75] Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. 猫和狗。在 2012 年 IEEE 计算机视觉和模式识别会议上，3498-3505 页。IEEE，2012. 7

[76] Omkar M. Parkhi, Andrea Vedaldi, Andrew Zisserman, and C. V. Jawahar. Cats and dogs. In IEEE Conference on Computer Vision and Pattern Recognition, 2012. 17
Omkar M. Parkhi、Andrea Vedaldi、Andrew Zisserman 和 C. V. Jawahar。猫和狗。在 2012 年 IEEE 计算机视觉和模式识别大会上。17

[77] Karol J Piczak. Esc: Dataset for environmental sound classification. In Proceedings of the 23rd ACM international conference on Multimedia, pages 1015-1018, 2015. 2, 6, 14
[77] Karol J Piczak. Esc: 用于环境声音分类的数据集。在第 23 届 ACM 国际多媒体会议论文集中，第 1015-1018 页，2015 年。2, 6, 14

[78] Archiki Prasad, Peter Hase, Xiang Zhou, and Mohit Bansal. Grips: Gradient-free, edit-based instruction search for prompting large language models. arXiv preprint arXiv:2203.07281, 2022. 3
Archiki Prasad，Peter Hase，Xiang Zhou 和 Mohit Bansal。Grips：Gradient-free，edit-based instruction search for prompting large language models。arXiv 预印本 arXiv:2203.07281，2022 年 3 月

[79] Hang Qi, Matthew Brown, and David G Lowe. Low-shot learning with imprinted weights. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5822-5830, 2018. 1, 3
[79] Hang Qi，Matthew Brown 和 David G Lowe。使用印记权重进行低样本学习。在 2018 年 IEEE 计算机视觉和模式识别会议论文集中，第 5822-5830 页。1, 3

[80] R Quian Quiroga, Leila Reddy, Gabriel Kreiman, Christof Koch, and Itzhak Fried. Invariant visual representation by single neurons in the human brain. Nature, 435(7045):1102-1107, 2005. 1
R Quian Quiroga, Leila Reddy, Gabriel Kreiman, Christof Koch, and Itzhak Fried. 人脑中单个神经元的不变视觉表示。自然，435(7045)：1102-1107，2005. 1

[81] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML. PMLR, 2021. 1, 2, 3, 4, 5, 7, 8, 14
[81] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, 等人。从自然语言监督中学习可转移的视觉模型。在 ICML。PMLR，2021 年。1, 2, 3, 4, 5, 7, 8, 14

[82] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. 2016. 1, 3
[82] Sachin Ravi 和 Hugo Larochelle。优化作为少样本学习的模型。2016 年。1, 3

[83] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? In International Conference on Machine Learning, pages 5389-5400. PMLR, 2019. 8
[83] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. 图像分类器是否能泛化到 ImageNet？在机器学习国际会议上，第 5389-5400 页。PMLR，2019. 8

[84] Timo Schick and Hinrich Schütze. Exploiting cloze questions for few-shot text classification and natural language inference. Computing Research Repository, arXiv:2001.07676, 2020. 3
[84] Timo Schick 和 Hinrich Schütze。利用填空问题进行少样本文本分类和自然语言推理。计算研究存储库，arXiv:2001.07676，2020 年。3

[85] Timo Schick and Hinrich Schütze. It's not just size that matters: Small language models are also few-shot learners. Computing Research Repository, arXiv:2009.07118, 2020. 3
[85] Timo Schick 和 Hinrich Schütze。重要的不仅仅是大小：小语言模型也是少样本学习者。计算研究存储库，arXiv:2009.07118，2020. 3

[86] Lauren A Schmidt. Meaning and compositionality as statistical induction of categories and constraints. PhD thesis, Massachusetts Institute of Technology, 2009. 1
[86] Lauren A Schmidt. 意义和组合性作为类别和约束的统计归纳。博士论文，麻省理工学院，2009 年。 1

[87] Bernhard Schölkopf, Ralf Herbrich, and Alex J Smola. A generalized representer theorem. In International conference on computational learning theory, pages 416-426. Springer, 2001. 4
[87] Bernhard Schölkopf、Ralf Herbrich 和 Alex J Smola。广义表现定理。在计算学习理论国际会议上，第 416-426 页。Springer，2001。4

[88] Eli Schwartz, Leonid Karlinsky, Rogerio Feris, Raja Giryes, and Alex Bronstein. Baby steps towards few-shot learning with multiple semantics. Pattern Recognition Letters, 160:142-147, 2022. 3
Eli Schwartz, Leonid Karlinsky, Rogerio Feris, Raja Giryes 和 Alex Bronstein。Baby steps towards few-shot learning with multiple semantics。Pattern Recognition Letters，160：142-147，2022. 3

[89] Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Eric Wallace, and Sameer Singh. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980, 2020. 3
Taylor Shin、Yasaman Razeghi、Robert L Logan IV、Eric Wallace 和 Sameer Singh。Autoprompt：使用自动生成的提示从语言模型中引出知识。arXiv 预印本 arXiv:2010.15980，2020. 3

[90] Linda Smith and Michael Gasser. The development of embodied cognition: Six lessons from babies. Artificial life, 11(1-2):13-29, 2005. 1
Linda Smith 和 Michael Gasser。具身认知的发展：从婴儿中学到的六堂课。人工生命，11(1-2)：13-29，2005。

[91] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. Advances in neural information processing systems, 30, 2017. 1, 3, 4
Jake Snell、Kevin Swersky 和 Richard Zemel。少样本学习的原型网络。神经信息处理系统的进展，2017 年。1, 3, 4

[92] Haoyu Song, Li Dong, Wei-Nan Zhang, Ting Liu, and Furu Wei. Clip models are few-shot learners: Empirical studies on vqa and visual entailment. arXiv preprint arXiv:2203.07190, 2022. 3
[92] 宋浩宇，董力，张伟楠，刘婷和魏富儒。剪辑模型是少样本学习者：关于 vqa 和视觉蕴涵的实证研究。arXiv 预印本 arXiv:2203.07190，2022 年 3 月

[93] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012. 7,17
Khurram Soomro、Amir Roshan Zamir 和 Mubarak Shah。UCF101：来自野外视频的 101 个人类动作类别数据集。arXiv 预印本 arXiv:1212.0402，2012 年 7 月 17 日

[94] Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, SM Eslami, Oriol Vinyals, and Felix Hill. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, 34:200-212, 2021. 2
Maria Tsimpoukelli，Jacob L Menick，Serkan Cabi，SM Eslami，Oriol Vinyals 和 Felix Hill。冻结语言模型的多模态少样本学习。神经信息处理系统的进展，34:200-212，2021. 2
[95] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, koray kavukcuoglu, and Daan Wierstra. Matching networks for one shot learning. In D. Lee, M. Sugiyama, U. Luxburg, I Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29.
[95] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, koray kavukcuoglu, 和 Daan Wierstra。匹配网络用于一次性学习。在 D. Lee, M. Sugiyama, U. Luxburg, I Guyon, 和 R. Garnett, 编辑，神经信息处理系统的进展，第 29 卷。
Curran Associates, Inc., 2016. 3
Curran Associates, Inc.，2016 年。3

[96] Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global representations by penalizing local predictive power. In Advances in Neural Information Processing Systems, pages 10506-10518, 2019. 8
Haohan Wang，Songwei Ge，Zachary Lipton 和 Eric P Xing。通过惩罚局部预测能力来学习稳健的全局表示。在 2019 年神经信息处理系统进展中，第 10506-10518 页。

[97] Xudong Wang, Zhirong Wu, Long Lian, and Stella X Yu. Debiased learning from naturally imbalanced pseudolabels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14647

[97] 王旭东，吴志荣，连龙和斯特拉 X Yu。从自然不平衡的伪标签中进行无偏学习。在 IEEE/CVF 计算机视觉和模式识别会议论文集中，第 14647 页。

[98] Yu-Xiong Wang, Ross Girshick, Martial Hebert, and Bharath Hariharan. Low-shot learning from imaginary data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7278-7286, 2018. 1
[98] 王宇雄，罗斯·吉尔希克，马歇尔·赫伯特和巴拉特·哈里哈兰。从虚拟数据中学习低样本学习。在 2018 年 IEEE 计算机视觉和模式识别会议论文集中，第 7278-7286 页。

[99] Yu-Xiong Wang, Deva Ramanan, and Martial Hebert. Growing a brain: Fine-tuning by increasing model capacity. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2471-2480, 2017. 3
王宇雄，Deva Ramanan 和 Martial Hebert。通过增加模型容量进行微调：培育大脑。在 2017 年 IEEE 计算机视觉和模式识别会议论文集中，第 2471-2480 页。

[100] Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gontijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, et al.
Robust fine-tuning of zero-shot models In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7959-7971, 2022. 2,

零样本模型的强化微调在 2022 年 IEEE/CVF 计算机视觉与模式识别会议论文集中，第 7959-7971 页。2，

[101] Wenhao Wu, Zhun Sun, and Wanli Ouyang. Transferring textual knowledge for visual recognition. arXiv preprint arXiv:2207.01297, 2022. 3
[101] 吴文浩，孙准，欧阳万里。将文本知识转移到视觉识别中。arXiv 预印本 arXiv:2207.01297，2022 年。3

[102] Yongqin Xian, Christoph H Lampert, Bernt Schiele, and Zeynep Akata. Zero-shot learning-a comprehensive evaluation of the good, the bad and the ugly. IEEE transactions on pattern analysis and machine intelligence, 41(9):22512265, 2018. 5
[102] Yongqin Xian, Christoph H Lampert, Bernt Schiele, and Zeynep Akata. 零样本学习-对好坏丑的全面评估。IEEE 模式分析与机器智能交易，41(9)：2251-2265，2018. 5

[103] Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition, pages 3485-3492. IEEE, 2010. 7, 17
[103] 肖建雄，詹姆斯·海斯，克里斯塔·埃辛格，奥德·奥利瓦和安东尼奥·托拉尔巴。太阳数据库：从修道院到动物园的大规模场景识别。在 2010 年 IEEE 计算机学会计算机视觉和模式识别会议上，页码 3485-3492。IEEE，2010 年。7，17

[104] Chen Xing, Negar Rostamzadeh, Boris Oreshkin, and Pedro O O Pinheiro. Adaptive cross-modal few-shot learning. Advances in Neural Information Processing Systems, 32, 2019. 2, 3, 4, 5
陈星，Negar Rostamzadeh，Boris Oreshkin 和 Pedro O O Pinheiro。自适应跨模态少样本学习。神经信息处理系统的进展，2019 年。2, 3, 4, 5

[105] Yinghui Xing, Qirui Wu, De Cheng, Shizhou Zhang, Guoqiang Liang, and Yanning Zhang. Class-aware visual prompt tuning for vision-language pre-trained model. arXiv preprint arXiv:2208.08340, 2022. 3

[106] Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022. 3

[107] Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer

In this section, we go through the hyperparameter details for all the experiments for reproducibility.
在本节中，我们将详细介绍所有实验的超参数细节，以确保可重现性。

Basic settings: We follow the original CLIP [81] to L2-normalize the features after the encoder before sending them into the linear layer. We also use the L2-normalized text features to initialize the final linear layer weight following WiSE-FT [100].
基本设置：我们遵循原始的 CLIP [81]，在编码器之后将特征进行 L2 归一化，然后将其发送到线性层。我们还使用 L2 归一化的文本特征来初始化最终的线性层权重，遵循 WiSE-FT [100]。
For all cross-modal adaptation experiments, half of the batch is image samples and the other half is text samples.
对于所有的跨模态适应实验，一半的批次是图像样本，另一半是文本样本。
For all experiments, we use AdamW optimizer following WiSE-FT [100] and tune the hyperparameters including initial learning rate, weight decay, and batch size on the few-shot validation set.
对于所有实验，我们使用 AdamW 优化器，遵循 WiSE-FT [100]，并在少样本验证集上调整超参数，包括初始学习率、权重衰减和批量大小。
We perform a learning rate warmup with 50 iterations, during which the learning rate goes up linearly from 0.00001 to the initial value. We then perform a cosine annealing learning rate scheduling over the course of 12800 iterations.
我们进行了 50 次迭代的学习率预热，在此期间，学习率从 0.00001 线性增加到初始值。然后我们在 12800 次迭代过程中执行余弦退火学习率调度。
We do early stopping based on the few-shot validation set performance evaluated every 100 iterations.
我们根据每 100 次迭代评估的少样本验证集性能进行提前停止。
Furthermore, because the logit scale (inverse of softmax temperature) is a learnable weight clipped at 100 during CLIP-pretraining [81], we reuse the given logit scale of 100 for all experiments except for partial finetuning, where we find lowering it to 50 can improve validation performance.
此外，由于在 CLIP 预训练期间，逻辑刻度（softmax 温度的倒数）是一个可学习的权重，被剪切到 100 [81]，我们在所有实验中重复使用给定的逻辑刻度 100，除了部分微调，我们发现将其降低到 50 可以提高验证性能。
Future work may choose to set the logit scale as a learnable parameter instead.
未来的工作可以选择将逻辑刻度设置为可学习参数。

We now report the range of hyperparameter search for each method. Note that the search range is kept the same for all 11 target datasets.
我们现在报告每种方法的超参数搜索范围。请注意，对于所有 11 个目标数据集，搜索范围保持不变。

Linear Probing: For all linear probing experiments, we perform a grid search of learning rate in

, weight decay in

, and batch size in

.
线性探测：对于所有线性探测实验，我们在

中执行学习率的网格搜索，在

中执行权重衰减的网格搜索，并在

中执行批量大小的网格搜索。

WiSE-FT: To compare with linear probing, we adopt the same procedure above to train the linear classifier and then perform post-hoc ensembling with the text-based classifier with a fixed ratio of 0.5 .
WiSE-FT：与线性探测相比，我们采用上述相同的过程来训练线性分类器，然后与基于文本的分类器进行后续集成，固定比例为 0.5。

Partial Finetuning: For all partial finetuning experiments, we perform a grid search of learning rate in

, weight decay in [0.0,0.001, 0.00001], and batch size is set to 8. CLIP [81] adopts a modified version of ResNet-50 image encoder, in which the final average pooling layer is replaced by an attentional pooling layer.
部分微调：对于所有部分微调实验，我们在

中执行学习率的网格搜索，权重衰减在[0.0,0.001,0.00001]，批量大小设置为 8。 CLIP [81]采用修改后的 ResNet-50 图像编码器版本，其中最终的平均池化层被替换为一个注意力池化层。
We thus choose this layer as the finetuning target for all ResNet-50 experiments. For ViTB/16 encoder, we simply finetune the last transformer layer. In the next section, we also show that finetuning the text encoder is not as effective.
因此，我们选择这一层作为所有 ResNet-50 实验的微调目标。对于 ViTB/16 编码器，我们只需微调最后一个变换器层。在下一节中，我们还展示了微调文本编码器并不那么有效。

Cross-modal Prompting: We follow the same setup and hyperparameters used in CoOp [113]. We use the ResNet50 backbone with 16 learnable tokens, and append the class
跨模态提示：我们遵循 CoOp [113]中使用的相同设置和超参数。我们使用具有 16 个可学习令牌的 ResNet50 骨干，并附加类。

Included Dataset 包括的数据集

ESC-50 [77] Class ESC-50 [77] 类

ImageNet [15] Class ImageNet [15] 类

ImageNet-ESC-19

公鸡母鸡啁啾的鸟青蛙狗猫昆虫蟋蟀猪羊飞机火车电锯键盘打字闹钟鼠标点击吸尘器时钟滴答洗衣机

rooster

hen

chirping-birds

frog

dog

cat

insects

crickets

pig

sheep

airplane

train

chainsaw

keyboard-typing

clock-alarm

mouse-click

vacuum-cleaner

clock-tick

washing-machine

公鸡母鸡山雀树蛙猎水獭埃及猫苍蝇蟋蟀猪大角羊客机高速列车电锯电脑键盘数字时钟电脑鼠标吸尘器墙钟洗衣机

rooster

hen

chickadee

tree frog

otterhound

egyptian cat

fly

cricket

pig

big-horn sheep

airliner

high-speed train

chainsaw

computer keyboaro

digital clock

computer mouse

vacuum cleaner

wall clock

washing machine

ImageNet-ESC-27

开罐教堂钟声燃烧的火厕所冲水水滴饮水啜饮倒水海浪

can-opening

church-bells

crackling-fire

toilet-flush

water-drops

drinking-sipping

pouring-water

sea-waves

can opener

church bells

fire screen

toilet seat

sink

water bottle

water jug

sandbar

Table 9. ImageNet-ESC dataset class matchings.

name to the end of the tokens. Following CoOp, we use SGD with a learning rate of 0.002 , decayed using the cosine annealing rule. We train for 200 epochs for 8 and 16 shots, 100 epochs for 2 and 4 shots, and 50 epochs for 1 shot (except ImageNet which is fixed at 50 epochs).
将名称添加到令牌的末尾。在 CoOp 之后，我们使用学习率为 0.002 的 SGD，使用余弦退火规则进行衰减。我们对 8 和 16 个样本进行 200 个时期的训练，对 2 和 4 个样本进行 100 个时期的训练，对 1 个样本进行 50 个时期的训练（除了 ImageNet，它固定在 50 个时期）。
The learning rate for the first epoch is fixed at 0.00001 . We also use the same random resized crop transformations as CoOp.
第一个时代的学习率固定为 0.00001。我们还使用与 CoOp 相同的随机调整大小的裁剪转换。

Cross-modal Adapter: We follow the same 2-layer MLPs architecture in CLIP-Adapter [21] with a residual ratio of 0.2 . Specifically, the first linear layer downsizes the input feature to

of the original dimension and the second linear layer transforms it back to the original dimension. Each linear layer is followed by a ReLU function. Finally, the transformed features are multiplied by 0.2 and added with

the original feature. We use a single adapter for both image and text features. We perform a grid search of learning rate in

, weight decay in

, and batch size is set to 8 . We do not adopt the cache-modal and training-free initialization proposed in the follow-up Tip-Adapter [111] method.
跨模态适配器：我们遵循与 CLIP-Adapter [21]相同的 2 层 MLP 架构，残差比为 0.2。具体来说，第一个线性层将输入特征缩小到原始维度的

，第二个线性层将其转换回原始维度。每个线性层后面跟着一个 ReLU 函数。最后，转换后的特征乘以 0.2 并与

原始特征相加。我们为图像和文本特征使用单个适配器。我们对学习率进行网格搜索

，对权重衰减进行

，批量大小设置为 8。我们不采用后续 Tip-Adapter [111]方法中提出的缓存模态和无需训练的初始化。
Also, we notice that Tip-Adapter uses test set to perform early stopping; we however strictly follow the CoOp protocol to use the few-shot validation set for all hyperparameter searching.
此外，我们注意到 Tip-Adapter 使用测试集来执行提前停止；然而，我们严格遵循 CoOp 协议，使用少样本验证集进行所有超参数搜索。

ImageNet-ESC Experiments: For all linear probing experiments on ImageNet-ESC, we perform a grid search of learning rate in

, weight decay in

, and batch size is 8 .
ImageNet-ESC 实验：对于 ImageNet-ESC 上的所有线性探测实验，我们在

中执行学习率的网格搜索，

中执行权重衰减的网格搜索，批量大小为 8。

In this section, we present all the results with standard deviation over multiple runs. Here is an overview (please refer to table captions for more discussion):
在本节中，我们展示了所有结果，并附有多次运行的标准偏差。以下是一个概述（请参考表格标题进行更多讨论）：

Per-dataset results for all methods: We show Figure 6 and Table 10.
所有方法的每个数据集结果：我们展示图 6 和表 10。
In particular, we note that cross-modal adaptation consistently outperforms prior methods across a wide variety of visual recognition datasets, further strengthening our claim that our approach should be the de-facto adaptation method for finetuning multimodal models.
特别是，我们注意到跨模态适应始终在各种视觉识别数据集上表现优异，进一步加强了我们的观点，即我们的方法应该成为微调多模型的事实适应方法。
Ablation for augmentation techniques: In Table 11, we show the performance of all combinations of image and text augmentation techniques. Importantly, simple text augmentation strategies work very well for visual recognition.
消融增强技术：在表 11 中，我们展示了所有图像和文本增强技术组合的性能。重要的是，简单的文本增强策略对视觉识别非常有效。
Ablation for classifier initialization: In Table 12, our experiments suggest that (a) text-based initialization is beneficial for both linear and partial finetuning, and (b) cross-modal adaptation can improve the performance regardless of the initialization.
分类器初始化的消融：在表 12 中，我们的实验表明(a)基于文本的初始化对线性和部分微调都有益处，(b)跨模态适应可以改善性能，无论初始化如何。
Ablation for partial finetuning: In Table 13, we confirm that partial finetuning of the image encoder is more effective than finetuning the text encoder.
部分微调的消融：在表 13 中，我们确认图像编码器的部分微调比微调文本编码器更有效。
Complete results for all reported methods: In Table 14 , we show the standard deviation for all methods reported in the main paper and appendix, including ViT-based encoder results.
所有报告方法的完整结果：在表 14 中，我们展示了主要论文和附录中报告的所有方法的标准偏差，包括基于 ViT 的编码器结果。
Complete results on ImageNet-ESC benchmark: We show the complete results on ImageNet-ESC-19 and ImageNet-ESC-27 for both image-classification in Table 15 and audio-classification in Table 16.
在 ImageNet-ESC 基准测试上的完整结果：我们在表 15 中展示了 ImageNet-ESC-19 和 ImageNet-ESC-27 的完整结果，分别用于图像分类和音频分类。
We additionally include the results of the text-based classifier and cross-modal linear probing with all three modalities (including text) for reference.
我们还包括了基于文本的分类器和跨模态线性探测器的结果，使用所有三种模态（包括文本）作为参考。
Including the text modality seems to be the most performant, which is expected since the benchmark is curated based on textual information, i.e. matching label names.
包括文本模态似乎是最有效的，这是预期的，因为基准是基于文本信息策划的，即匹配标签名称。
We also note that just adding text modality is better than including all three modalities; we believe this issue can be alleviated with better alignment between the image and audio representations, e.g. scaling the pre-training data for AudioCLIP.
我们还注意到，仅添加文本模态比包含所有三种模态更好；我们相信这个问题可以通过更好地对齐图像和音频表示来缓解，例如为 AudioCLIP 的预训练数据进行缩放。
Furthermore, the standard deviations of the experiments are higher than those of the vision-language adaptation experiments because the randomly sampled one-shot sample can make a huge difference in the performance.
此外，实验的标准偏差高于视觉语言适应实验的标准偏差，因为随机抽样的一次性样本可能会对性能产生巨大影响。
However, cross-modal adaptation is more performant not by chance - in more than of the experiments, adding the one-shotaudio or one-shot-image to the same set of samples can outperform uni-modal linear probing.
然而，跨模态适应并非偶然更具性能-在超过的实验中，将一次性音频或一次性图像添加到相同的样本集中可以胜过单模态线性探测。
Comparison to ProDA [63]: In Table 17, we compare to ProDA, another promising SOTA method that does automatic prompt ensembling with 36 learned templates.
与 ProDA [63]的比较：在表 17 中，我们与 ProDA 进行比较，ProDA 是另一种有前途的 SOTA 方法，它使用 36 个学习模板进行自动提示集成。
We are told by the authors that they do not follow the dataset split given by CoOp [113], and use the official test split of each dataset whenever possible or sample their own test split from the train set.
我们被作者告知，他们不遵循 CoOp [113]提供的数据集拆分，并在可能的情况下使用每个数据集的官方测试拆分，或者从训练集中抽样自己的测试拆分。
Therefore, we cannot directly compare to their performance since CoOp [113] use their own test split for most datasets and ProDA does not release the code yet. In particular, official test sets exist for two of the target datasets (Food101 [6] and DTD [14]).
因此，我们无法直接比较它们的性能，因为 CoOp [113] 对大多数数据集使用自己的测试集，而 ProDA 尚未发布代码。特别是，两个目标数据集（Food101 [6] 和 DTD [14]）存在官方测试集。
We therefore switch to the official test split for these two datasets and use the CoOp's split for the rest of the 9 datasets in Table 17 as our best attempt to compare to ProDA [63]. Note that ProDa also does not report the use of a fewshot validation set.
因此，我们将这两个数据集切换到官方测试集，并使用表 17 中其余 9 个数据集的 CoOp 拆分作为我们与 ProDA [63]进行比较的最佳尝试。请注意，ProDa 也没有报告使用 fewshot 验证集。
In conclusion, our approach is still more performant than theirs under most scenarios with significantly fewer training resources.
总的来说，在大多数情况下，我们的方法仍然比他们的表现更好，而且需要的培训资源明显更少。
templates used for mining: In Table 18, we show the pool of templates we use when mining based on few-shot validation set performance.
用于挖掘的模板：在表 18 中，我们展示了我们在挖掘时使用的模板池，这是基于少样本验证集性能的。

Figure 6. Comparison of few-shot learning results across 11 datasets. We show our main methods (cross-modal linear probing and partial finetuning) and compare them with prior works.
图 6. 跨 11 个数据集比较少样本学习结果。我们展示了我们的主要方法（跨模态线性探测和部分微调），并将它们与先前的工作进行比较。
We note that the Tip-Adapter [111] numbers shown are our own re-run of the method, where we replace their early-stopping on the test set with early stopping on the few-shot validation set for a fair comparison.
我们注意到，所示的 Tip-Adapter [111]编号是我们重新运行该方法的结果，我们将其在测试集上的提前停止替换为在少样本验证集上的提前停止，以进行公平比较。
As seen in the plots, cross-modal partial finetuning consistently outperforms prior works across the datasets, and cross-modal linear probing is also generally more performant.
如图所示，跨模态部分微调始终在数据集上表现优于先前的工作，而跨模态线性探测也通常更具性能。

Method	Shots 镜头	Dataset
Method	Shots 镜头	Caltech [55] 加州理工学院 [55]	ImageNet [15]	DTD [13]	EuroSAT [34]	Aircraft [65] 飞机 [65]	Food [7] 食物 [7]	Flowers [72] 花朵【72】	Pets [76] 宠物 [76]	Cars [51] 汽车 [51]	SUN397 [103]	UCF101 [93]	Average 平均
Zero-Shot CLIP 零射击 CLIP	0	86.29	58.18	42.32	37.56	17.28	77.31	66.14	85.77	55.61	58.52	61.46	58.77
CoOp 合作	1	87.53	57.15	44.39	50.63	9.64	74.32	68.12	85.89	55.59	60.29	61.92	59.77
	2	87.93	57.81	45.15	61.50	18.68	72.49	77.51	82.64	58.28	59.48	64.09	62.32
	4	89.55	59.99	53.49	70.18	21.87	73.33	86.20	86.70	62.62	63.47	67.03	66.77
	8	90.21	61.56	59.97	76.73	26.13	71.82	91.18	85.32	68.43	65.52	71.94	69.89
	16	91.83	62.95	63.58	83.53	31.26	74.67	94.51	87.01	73.36	69.26	75.71	73.42
Tip-Adapter 提示适配器	1
	2
	4
	8
	16
ProGrad	1
	2
	4
	8
	16
Wise-FT 智慧-FT	1
	2
	4
	8
	16
Cross-Modal Linear Probe	1
	2
	4
	8
	16
Cross-Modal Wise-FT 跨模态智慧-FT	1
	2
	4
	8
	16
Cross-Modal Adapter 跨模态适配器	1
	2
	4
	8
	16
Cross-Modal Partial Finetuning 跨模态部分微调	1
	2
	4
	8
	16

Table 10. Per-dataset results on the ResNet-50 backbone. We also include results from prior works for easier comparison. The zeroshot CLIP numbers differ from those reported in the original CLIP paper because we use one single prompt per dataset.
表 10. 在 ResNet-50 骨干网络上的每个数据集结果。我们还包括以前的工作结果，以便更容易进行比较。零样本 CLIP 数字与原始 CLIP 论文中报告的数字不同，因为我们每个数据集使用一个单一提示。
We bold the best result for each shot and each dataset, and underline the second best result. We see that cross-modal adaptation methods consistently produce the best performance across almost all dataset.
我们为每个镜头和每个数据集加粗显示最佳结果，并下划线显示第二最佳结果。我们看到跨模态适应方法在几乎所有数据集上始终产生最佳性能。
The Tip-Adapter results are reproduced using only the few-shot validation set for hyperparameter searching and early stopping.
Tip-Adapter 的结果是仅使用少量验证集进行超参数搜索和提前停止而再现的。

Finetuning 微调

ImageAug 图像增强

TextAug 文本增强

Number of shots 射击次数

Linear 线性

CenterCrop (1 view) 中心裁剪（1 视图）

N/A (Uni-Modal Adaptation)
单模适应

+ Flip (2 views)
+ 翻转（2 个视图）

+RandomCrop (2 views) +随机裁剪（2 个视图）

71.94

+RandomCrop (10 views) +随机裁剪（10 个视图）

CenterCrop (1 view) 中心裁剪（1 视图）

类名照片

的一个。手工模板挖掘（21 次查看

Class name

a photo of a

Hand Engineered

Template Mining ( 21 views

+ Flip (2 views)
+ 翻转（2 个视图）

类名照片

手工模板挖掘（21 次浏览

Class name

a photo of a

Hand Engineered

Template Mining ( 21 views

75.97

75.70

+RandomCrop (2 views) +随机裁剪（2 个视图）

类名照片

的一个。手工模板挖掘（21 次查看

Class name

a photo of a

Hand Engineered

Template Mining ( 21 views

+ RandomCrop (10 views)
+ 随机裁剪（10 个视图）

类名照片

手工模板挖掘

视图

Class name

a photo of a

Hand Engineered

Template Mining

views

70.19

Partial 部分

CenterCrop (1 view) 中心裁剪（1 视图）

N/A (Uni-Modal Adaptation)
单模适应

+ Flip (2 views)
+ 翻转（2 个视图）

31.68

72.19

+RandomCrop (2 views) +随机裁剪（2 个视图）

+RandomCrop (10 views) +随机裁剪（10 个视图）

CenterCrop (1 view) 中心裁剪（1 视图）

类名照片

手工模板挖掘（21 次浏览）

Class name

a photo of a

Hand Engineered

Template Mining (21 views)

76.86

+ Flip (2 views)
+ 翻转（2 个视图）

类名照片

的。手工模板挖掘（21 次浏览）

Class name

a photo of a

Hand Engineered

Template Mining (21 views)

77.22

+RandomCrop (2 views) +随机裁剪（2 个视图）

类名照片

手工模板挖掘（21 次浏览）

Class name

a photo of a

Hand Engineered

Template Mining (21 views)

73.96

77.19

+ RandomCrop (10 views)
+ 随机裁剪（10 个视图）

类名照片

手工模板挖掘（21 次浏览）

Class name

a photo of a

Hand Engineered

Template Mining (21 views)

77.27

Table 11. Ablation for augmentation under vision-language adaptation. Salient conclusions: (1) Uni-modal adaptation is much worse than cross-modal adaptation even when doing aggressive image augmentation to increase the number of views, e.g. 10 random crops.
表 11. 视觉语言适应下的增强消融。显著结论：(1) 单模态适应远不如跨模态适应，即使进行激进的图像增强以增加视图数量，例如 10 个随机裁剪。
(2) Doing both image augmentation and text augmentation can improve the results, but text augmentation has a more profound impact whereas image augmentation saturates with a few views. (3) Simple template mining can be as competitive as manually selected templates (cf.
(2) 同时进行图像增强和文本增强可以改善结果，但文本增强具有更深远的影响，而图像增强则会随着少数视图而饱和。(3) 简单的模板挖掘可以与手动选择的模板一样具有竞争力。
Table 18). Overall, we hope this preliminary investigation can encourage future work to explore more text augmentation strategies.
表 18）。总的来说，我们希望这项初步调查能够鼓励未来的工作探索更多的文本增强策略。

Method	Initialization 初始化	Number of shots 射击次数
Method	Initialization 初始化	1	2	4	8	16
Linear Probing 线性探测	Random 随机
Linear Probing 线性探测	Text 文本
Cross-Modal Linear Probing 跨模态线性探测	Random 随机
Cross-Modal Linear Probing 跨模态线性探测	Text 文本
Partial Finetuning 部分微调	Random 随机
Partial Finetuning 部分微调	Text 文本
Cross-Modal Partial Finetuning 跨模态部分微调	Random 随机
Cross-Modal Partial Finetuning 跨模态部分微调	Text 文本

Table 12. Ablation results for text-based vs random initialization for linear classifier weight. We perform diligent analysis to confirm that initializing the linear classifier weights with text features is beneficial for the final performance.
表 12. 线性分类器权重的基于文本与随机初始化的消融结果。我们进行了认真的分析，以确认使用文本特征初始化线性分类器权重对最终性能有益。
Still, cross-modal adaptation uniformly boosts the performance no matter the method or initialization.
跨模态适应仍然可以提高性能，无论是哪种方法或初始化。
The text-based initialization is also more important for partial-finetuning than for linear probing, confirming the hypothesis [53] that a randomly initialized classifier will distort pre-trained features.
文本初始化对于部分微调比线性探测更为重要，证实了假设[53]，即随机初始化的分类器会扭曲预训练特征。
Experiments in this table use center crop as image augmentation and Tip-Adapter's template as text augmentation for simplicity.
在这个表中的实验使用中心裁剪作为图像增强，使用 Tip-Adapter 的模板作为文本增强，以简化操作。

Table 13. Ablation results for partial-finetuning. Partial finetuning of the last layer of image encoder is much more effective than finetuning the last layer of text encoder, suggesting that one may simply freeze the text encoder for few-shot vision-language adaptation.
表 13. 部分微调的消融结果。对图像编码器的最后一层进行部分微调比微调文本编码器的最后一层更有效，这表明可以简单地冻结文本编码器进行少样本视觉语言适应。
Experiments in this table use center crop as image augmentation and Tip-Adapter's template as text augmentation for simplicity.
在这个表中的实验使用中心裁剪作为图像增强，使用 Tip-Adapter 的模板作为文本增强，以简化操作。

Backbone	Method	Number of shots 射击次数
Backbone	Method	1	2	4	8	16
ResNet50	WiSE-FT
	Cross-Modal WiSE-FT 跨模态 WiSE-FT
	Cross-Modal Prompting 跨模态提示
	Cross-Modal Adapter 跨模态适配器
	Linear Probing 线性探测		48.85
	Cross-Modal Linear Probing 跨模态线性探测
	Partial Finetuning 部分微调

ViT-B/16
		71.19



	Cross-Modal Partial Finetuning 跨模态部分微调

Table 14. Complete results for all methods reported. Experiments in this table use center crop as image augmentation and Tip-Adapter's template as text augmentation. Furthermore, we include ViT-B/16 results for completeness.
表 14. 所有报告方法的完整结果。本表中的实验使用中心裁剪作为图像增强和 Tip-Adapter 的模板作为文本增强。此外，为了完整性，我们还包括了 ViT-B/16 的结果。

Dataset	Number of Image Shots 图像拍摄数量
Dataset			1	2	4
ImageNet-ESC-19	Image-Only Linear Probing 仅图像线性探测	-
	Image-Audio Linear Probing 图像音频线性探测	-
	Image-Audio-Text Linear Probing 图像-音频-文本线性探测	-
	Audio-initialized Classifier 音频初始化的分类器
	Text-initialized Classifier 文本初始化分类器		-	-	-
	Image-Only Linear Probing 仅图像线性探测	-
	Image-Audio Linear Probing 图像音频线性探测	-
	Image-Text Linear Probing 图像文本线性探测	-
	Image-Audio-Text Linear Probing 图像-音频-文本线性探测	-
	Audio-initialized Classifier 音频初始化的分类器		-	-	-
Text-initialized Classifier 文本初始化分类器		-	-	-

Table 15. ImageNet-ESC image-classification results.
表 15. ImageNet-ESC 图像分类结果。

Dataset	Number of Audio Shots 音频片段数量
Dataset			1	2	4
ImageNet-ESC-19	Audio-Only Linear Probing 仅音频线性探测	-
	Audio-Image Linear Probing 音频图像线性探测	-
	Audio-Image-Text Linear Probing 音频-图像-文本线性探测	-
	Image-initialized Classifier 图像初始化分类器
	Text-initialized Classifier 文本初始化分类器		-	-	-
	Audio-Only Linear Probing 仅音频线性探测	-
	Audio-Image Linear Probing 音频图像线性探测	-
	Audio-Text Linear Probing 音频文本线性探测	-
	Audio-Image-Text Linear Probing 音频-图像-文本线性探测	-
	Image-initialized Classifier 图像初始化分类器		-	-	-
Text-initialized Classifier 文本初始化分类器		-	-	-

Table 16. ImageNet-ESC audio-classification results.
表 16. ImageNet-ESC 音频分类结果。

Table 18. Templates used during template mining. Most of the templates we use come from the original CoOp codebase [113]. In addition, we add 31 random templates by paraphrasing [45] the standard template a photo of a

. We encourage future work to try out more sophisticated techniques to generate templates, e.g. through automated prompting [113] or with the help of language models

.
表 18. 模板挖掘过程中使用的模板。我们使用的大多数模板来自原始的 CoOp 代码库[113]。此外，我们通过改写[45]标准模板中的一个照片

，添加了 31 个随机模板。我们鼓励未来的工作尝试更复杂的技术来生成模板，例如通过自动提示[113]或借助语言模型

。

*Equal contribution 平等贡献
Download instructions can be found in our codebase.
下载说明可以在我们的代码库中找到。

Image Encoder 图像编码器

Text Encoder 文本编码器

Number of shots 射击次数

Text Encoder 文本编码器

Frozen 冰雪奇缘

Frozen 冰雪奇缘

Finetune Attention Pooling Layer
微调注意力池化层

Frozen 冰雪奇缘

Frozen 冰雪奇缘

Finetune Last Transformer Layer
微调最后一个 Transformer 层

Finetune Attention Pooling Layer
微调注意力池化层

Finetune Last Transformer Layer
微调最后一个 Transformer 层