这是用户在 2024-7-8 13:24 为 https://app.immersivetranslate.com/pdf-pro/946acb6e-3111-423c-a86f-91a555422601 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
2024_07_08_16c1ee3d51773a3dd709g

Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models

Zhiqiu Lin* Samuel Yu* Zhiyi Kuang Deepak Pathak Deva RamananCarnegie Mellon University{zhiqiul,samuelyu, zkuang, dpathak, deva}@cs.cmu.edu

Abstract 摘要

The ability to quickly learn a new task with minimal instruction - known as few-shot learning - is a central aspect of intelligent agents.
快速学习新任务的能力,只需最少的指导 - 被称为少样本学习 - 是智能代理的核心方面。

Classical few-shot benchmarks make use of few-shot samples from a single modality, but such samples may not be sufficient to characterize an entire concept class. In contrast, humans use cross-modal information to learn new concepts efficiently.
经典的少样本基准测试利用来自单一模态的少样本,但这些样本可能不足以描述整个概念类别。相比之下,人类利用跨模态信息高效学习新概念。

In this work, we demonstrate that one can indeed build a better visual dog classifier by reading about dogs and listening to them bark.
在这项工作中,我们证明通过阅读关于狗的资料并倾听它们的吠声,确实可以构建一个更好的视觉狗分类器。

To do so, we exploit the fact that recent multimodal foundation models such as CLIP are inherently cross-modal, mapping different modalities to the same representation space.
为此,我们利用最近的多模态基础模型(如 CLIP)本质上是跨模态的事实,将不同的模态映射到相同的表示空间。

Specifically, we propose a simple cross-modal adaptation approach that learns from few-shot examples spanning different modalities.
具体来说,我们提出了一种简单的跨模态适应方法,该方法从跨不同模态的少量示例中学习。

By repurposing class names as additional one-shot training samples, we achieve SOTA results with an embarrassingly simple linear classifier for vision-language adaptation.
通过将类名重新用作额外的一次性训练样本,我们使用一个令人尴尬简单的线性分类器实现了最先进的视觉语言适应结果。

Furthermore, we show that our approach can benefit existing methods such as prefix tuning, adapters, and classifier ensembling.
此外,我们展示了我们的方法可以使现有的方法受益,如前缀调整、适配器和分类器集成。

Finally, to explore other modalities beyond vision and language, we construct the first (to our knowledge) audiovisual few-shot benchmark and use crossmodal training to improve the performance of both image and audio classification. Project site at link.
最后,为了探索视觉和语言之外的其他模态,我们构建了第一个(据我们所知)音频视觉少样本基准,并使用跨模态训练来提高图像和音频分类的性能。项目网站链接。

1. Introduction 1. 介绍

Learning with minimal instruction is a hallmark of human intelligence [86,91,98], and is often studied under the guise of few-shot learning.
学习最少指导是人类智能的标志[86,91,98],通常在少样本学习的伪装下进行研究。

In the context of few-shot visual classification [18,20,29, 46,79, 82], a classifier is first pretrained on a set of base classes to learn a good feature representation and then adapted or finetuned on a small amount of novel class data.
在少样本视觉分类的背景下[18,20,29,46,79,82],分类器首先在一组基础类别上进行预训练,以学习良好的特征表示,然后在少量新类别数据上进行调整或微调。

However, such few-shot setups often face an inherent ambiguity - if the training image contains a golden retriever wearing a hat, how does the learner know if
然而,这种少样本设置经常面临固有的模糊性 - 如果训练图像中包含一只戴帽子的金毛寻回犬,学习者如何知道
Figure 1. Human perception is internally cross-modal. When we perceive from one modality (such as vision), the same neurons will be triggered in our cerebral cortex as if we are perceiving the object from other modalities (such as language and audio) . This phenomenon grants us a strong ability to learn from a few examples with cross-modal information [52,67].
图 1. 人类感知是内部跨模态的。当我们从一个模态(比如视觉)感知时,与我们从其他模态(比如语言和音频)感知对象时相同的神经元将在我们的大脑皮层中被触发。这种现象赋予了我们通过跨模态信息从少数示例中学习的强大能力。

In this work, we propose to leverage cross-modality to adapt multimodal models (such as CLIP [81] and AudioCLIP [27]), that encode different modalities to the same representation space.
在这项工作中,我们建议利用跨模态来调整多模态模型(如 CLIP [81]和 AudioCLIP [27]),将不同的模态编码到相同的表示空间中。
the task is to find dogs, golden retrievers, or even hats? On the other hand, humans have little trouble understanding and even generalizing from as few as one example. How so?
任务是找到狗,金毛寻回犬,甚至帽子?另一方面,人类很少有困难,甚至可以从一个例子中理解和概括。为什么?
We argue that humans make use of multimodal signals and representations (Figure 1) when learning concepts For example, verbal language has been shown to help toddlers better recognize visual objects given just a few examples . Indeed, there exists ample evidence from neuroscience suggesting that cognitive representations are inherently multimodal.
我们认为人类在学习概念时利用多模信号和表示(图 1)。例如,已经证明语言有助于幼儿仅凭几个例子更好地识别视觉对象 。事实上,神经科学中存在大量证据表明认知表示本质上是多模的。

For instance, visual images of a person evoke the same neurons as the textual strings of the person's name [80] and even audio clips of that person talking [70].
例如,一个人的视觉形象会激发与该人姓名的文本字符串[80]以及该人说话的音频剪辑[70]相同的神经元。

Even for infants as young as 1-5 months old, there is a strong correspondence between auditory-visual [52] as well as visual-tactile signals [67].
即使是 1-5 个月大的婴儿,听觉-视觉[52]以及视触信号[67]之间存在着很强的对应关系。

Such cross-modal or inter-modal representations are fundamental to the human perceptual-cognitive system, allowing us to understand new concepts even with few examples [24]
这种跨模态或者互模态的表征对于人类感知认知系统至关重要,使我们能够即使只有少量示例也能理解新概念。【24】
Cross-modal adaptation (our approach). In this paper, we demonstrate that cross-modal understanding of different modalities (such as image-text or image-audio) can improve the performance of individual modalities.
跨模态适应(我们的方法)。在本文中,我们展示了不同模态(如图像-文本或图像-音频)的跨模态理解可以提高各个模态的性能。

That is, reading about dogs and listening to them bark can help build a better visual classifier for them! To do so, we present a remarkably simple strategy for cross-modal few-shot adaptation: we treat examples from different modalities as additional few-shot examples.
也就是说,阅读关于狗的资料并听它们叫可以帮助建立一个更好的视觉分类器!为此,我们提出了一个非常简单的跨模态少样本适应策略:我们将来自不同模态的示例视为额外的少样本示例。

For example, given the "1-shot" task of learning a dog classifier, we treat both the textual dog label and the single visual image as training examples for learning a (visual) dog classifier.
例如,对于学习狗分类器的“1-shot”任务,我们将文本狗标签和单个视觉图像都视为学习(视觉)狗分类器的训练示例。

Learning is straightforward when using frozen textual and visual encoders, such as CLIP [81], that map different modalities to the same representational space. In essence, we have converted the "nshot" problem to a " -shot" problem (Figure 2)! We demonstrate that this basic strategy produces SOTA results across the board with a simple linear classifier, and can be applied to existing finetuning methods or additional modalities (e.g. audio).
使用冻结的文本和视觉编码器(例如 CLIP [81])进行学习变得简单,这些编码器将不同的模态映射到相同的表示空间。 本质上,我们已将“nshot”问题转换为“ -shot”问题(图 2)! 我们证明,这种基本策略通过简单的线性分类器在各方面产生了 SOTA 结果,并可应用于现有的微调方法 或其他模态(例如音频)。
Why does it work? 为什么它有效?
From one perspective, it may not be surprising that cross-modal adaptation improves accuracy, since it takes advantage of additional training examples that are "hidden" in the problem definition, e.g. a label name [104] or an annotation policy [68] for each class.
从一个角度来看,跨模态适应提高准确性并不奇怪,因为它利用了在问题定义中“隐藏”的额外训练示例,例如每个类别的标签名称[104]或注释策略[68]。

However, our experiments demonstrate that multimodal cues are often complementary since they capture different aspects of the underlying concept; a dog label paired with a single visual example is often more performant than two images!
然而,我们的实验表明,多模态线索通常是互补的,因为它们捕捉了潜在概念的不同方面;与单个视觉示例配对的狗标签通常比两个图像更有效!

For example, Figure 3 demonstrates a oneshot example where the target concept is ambiguous, but becomes clear once we add information from other modalities like language and sound.
例如,图 3 展示了一个单拍示例,其中目标概念是模糊的,但一旦我们添加来自其他模态的信息,如语言和声音,它就变得清晰起来。
Multimodal adaptation (prior art). In contrast to our cross-modal approach, most prior works simply follow the popular practice of finetuning uni-modal foundation models, such as large vision or language models [8, 17, 62]. For example, CoOp [113] and other prompting methods finetune CLIP via prefix tuning to replace hand-engineered prompts such as "a photo of a {cls}" with learned word tokens.
多模态适应(现有技术)。与我们的跨模态方法相比,大多数先前的工作简单地遵循微调单模基础模型的流行做法,例如大型视觉 或语言模型[8, 17, 62]。例如,CoOp [113] 和其他提示方法 通过前缀微调来微调 CLIP,以替换手工设计的提示,如“一张{cls}的照片”,用学习到的单词标记。

Similarly, inspired by parameter-efficient tuning of language models [39], adapter-based methods [21,111] finetune CLIP by inserting lightweight multi-layer-perceptrons (MLPs).
类似地,受到语言模型参数高效调整的启发[39],适配器方法[21,111]通过插入轻量级多层感知器(MLP)微调 CLIP。

However, we aim to study the fundamental question of how to finetune multi-modal (as opposed to uni-modal) models.
然而,我们的目标是研究如何微调多模态(而不是单模态)模型的基本问题。

A crucial difference between prior art and ours is the use of textual information, as all existing methods [41,100,111,113] repurpose additional text features as classifier weights instead of training samples.
先前艺术和我们之间的一个关键区别是文本信息的使用,所有现有方法[41,100,111,113]重新利用额外的文本特征作为分类器权重,而不是训练样本。

We demonstrate in this paper that crossmodal adaptation is not only more performant but can also benefit prior uni-modal approaches.
我们在本文中展示,跨模态适应不仅性能更好,而且还可以使先前的单模态方法受益。
Problem setup. We begin by replicating the existing
问题设置。我们从复制现有的开始
Figure 2. Adding additional modalities helps few-shot learning. Adding textual labels to a 2-shot cat-vs-dog classification task leads to better test performance (by turning the problem into a 3-shot cross-modal task!).
图 2. 添加额外的模态有助于少样本学习。在 2-shot 猫狗分类任务中添加文本标签会提高测试性能(将问题转变为 3-shot 跨模态任务!)。

We visualize cross-modal CLIP [21] features (projection to 2D with principal component analysis) and the resulting classifier learned from them, and observe a large shift in the decision boundary. See Figure 5 for more examples.
我们可视化跨模态 CLIP [21]特征(通过主成分分析投影到 2D)以及从中学习到的分类器,并观察到决策边界的巨大变化。更多示例请参见图 5。
evaluation protocol of other works [81,111,113] on fewshot adaptation of vision-language models, and report performance on 11 diverse downstream datasets.
对视觉-语言模型的 fewshot 适应性进行其他作品[81,111,113]的评估协议,并在 11 个不同的下游数据集上报告性能。

We produce state-of-the-art accuracy with an embarrassingly simple linear classifier that has access to additional "hidden" training examples in the form of textual labels, resulting in a system that is far more lightweight than prior art.
我们使用一种令人尴尬地简单的线性分类器来实现最先进的准确性,该分类器可以访问额外的“隐藏”训练示例,这些示例以文本标签的形式存在,从而产生了一个比以往更轻量级的系统。

Interestingly, we show that existing approaches [100,111,113], despite already repurposing text features as classifier weights, can still benefit from cross-modal learning.
有趣的是,我们发现现有方法[100,111,113],尽管已经将文本特征重新用作分类器权重,但仍然可以从跨模态学习中受益。

Finally, we extend our work to the audio domain by taking advantage of AudioCLIP [27] that maps audio to the same frozen CLIP representation space.
最后,我们通过利用 AudioCLIP [27]将我们的工作扩展到音频领域,将音频映射到相同的冻结 CLIP 表示空间。

We construct the first (to our knowledge) cross-modal few-shot learning benchmark with audio by intersecting ImageNet [15] and the ESC-50 audio classification dataset [77].
我们构建了第一个(据我们所知)跨模态少样本学习基准,通过将 ImageNet [15] 和 ESC-50 音频分类数据集 [77] 相交。

We show that cross-modal audiovisual learning helps for both downstream image and audio classification; in summary, one can train better dog image classifiers by listening to them bark!
我们展示跨模态音频视觉学习有助于下游图像和音频分类;总之,通过倾听它们的吠声,可以训练出更好的狗图像分类器!
Webly-supervised pre-training. Learning foundation models [5] from large-scale web data is becoming a predominant paradigm in AI.
Webly-supervised pre-training. 从大规模网络数据中学习基础模型[5]正成为人工智能中的主导范式。

In NLP, models such as BERT [17] and GPT-3 [8] are pre-trained on a massive web text corpus with language-modeling objectives and can be transferred to a wide range of downstream tasks, even without explicit supervised finetuning [61, 94]. Selfsupervision is also a trending topic in the vision community, and recent methods [26,31] demonstrate even stronger visual representations than fully-supervised pre-trained ones such as on ImageNet [15].
在 NLP 中,诸如 BERT [17]和 GPT-3 [8]等模型是在大规模网络文本语料库上进行预训练的,具有语言建模目标,并且可以转移到各种下游任务,甚至无需明确的监督微调 [61, 94]。自我监督也是视觉社区中的一个热门话题,最近的方法 [26,31]展示了比完全监督预训练模型(如在 ImageNet [15]上)更强大的视觉表示。
Multimodal foundation models. Recently, foundation models have shifted towards a multimodal supervi-
多模基础模型。最近,基础模型已经转向多模式监督。

Figure 3. Cross-modality reduces the ambiguity of few-shot learning. Classic (uni-modal) few-shot learning is often underspecified.
图 3. 跨模态降低了少样本学习的歧义。经典(单模态)少样本学习通常是不充分的。

Even for binary classification, when given only a single image per class (left), it is unclear whether the target class is the animal, the hat, or the background scene. Adding an extra modality, such as text or audio, helps clarify the problem setup (right).
即使是二元分类,当每个类别只提供一张图像时(左),目标类别是动物、帽子还是背景场景都不清楚。添加额外的模态,比如文本或音频,有助于澄清问题的设置(右)。

Notably, language usually comes "for free" in classification datasets in the form of a textual label per class.
通常,在分类数据集中,语言通常以每个类别的文本标签的形式“免费”提供。
sion paradigm. For visual representation learning, early works transform web image captions into structured outputs for supervised learning, such as multi-label targets [47] or visual n-grams [56].
对于视觉表示学习,早期的作品将网络图像标题转换为结构化输出,用于监督学习,例如多标签目标[47]或视觉 n-gram[56]。

More recently, CLIP [81] and ALIGN [43] propose a simple contrastive-based approach to embed images and captions into the same representation space, and demonstrate impressive "zero-shot" performance on downstream tasks.
最近,CLIP [81] 和 ALIGN [43] 提出了一种简单的对比方法,将图像和标题嵌入到相同的表示空间中,并在下游任务中展示了令人印象深刻的“零样本”性能。

Follow-up works enhance multimodal pre-training by incorporating generative-based objectives [2, 57, 106], consistency regularization [60, 69], stronger visual priors [107], phrase-grounding tasks [58, 109], and audiovisual information through videos [27].
后续作品通过整合基于生成的目标[2, 57, 106]、一致性正则化[60, 69]、更强的视觉先验[107]、短语定位任务[58, 109]以及通过视频传递的视听信息[27]来增强多模态预训练。

In this work, we focus on adapting CLIP [81] and AudioCLIP [27] for few-shot classification because contrastivebased multimodal models are stronger classifiers [2]. Adopting other multimodal models or adapting to tasks other than classification can be interesting future directions.
在这项工作中,我们专注于将 CLIP [81]和 AudioCLIP [27]调整为少样本分类,因为基于对比的多模态模型是更强大的分类器[2]。采用其他多模态模型 或者适应于除分类之外的任务 可能是有趣的未来方向。
Adaptation of foundation models. As multimodal pretrained models have excelled at classic vision tasks [81, 109], there has been surging interest in developing more efficient adaptation methods.
基础模型的适应。由于多模态预训练模型在经典视觉任务上表现出色,因此人们对开发更高效的适应方法产生了浓厚兴趣。

However, we observe that most of the trending techniques are built upon successful recipes crafted for uni-modal foundation models. For example, CLIP [81] adopts linear probing [12,31,32,109] and full-finetuning [25, 31, 48, 99, 101, 109] when transferring to downstream tasks.
然而,我们观察到大多数流行的技术都是基于为单模基础模型设计的成功配方构建的。例如,当转移到下游任务时,CLIP [81] 采用线性探测 [12,31,32,109] 和全微调 [25, 31, 48, 99, 101, 109]。

Prompt adaptation of CLIP [63, 81, is motivated by the success of prefix-tuning for language models [16,22,30,45,61,78, 84, 85, 89]. Similarly, CLIP-Adapter [21] and Tip-Adapter [111] are inspired by parameter-efficient finetuning methods that optimize lightweight MLPs while freezing the encoder. Yet, all aforementioned methods including WiSE-FT [100] use
CLIP 的快速适应[63, 81, 受到前缀调整语言模型成功的启发[16,22,30,45,61,78, 84, 85, 89]。同样,CLIP-Adapter [21]和 Tip-Adapter [111]受到参数高效微调方法的启发 ,这些方法优化轻量级 MLP,同时冻结编码器。然而,包括 WiSE-FT [100]在内的所有上述方法都使用
Figure 4. Uni-modal (left) vs. cross-modal adaptation (right). Prior work [21,100,111,113] performs uni-modal adaptation by calculating the loss over a single modality.
图 4. 单模态(左)与跨模态适应(右)。之前的工作[21,100,111,113]通过在单一模态上计算损失来执行单模态适应。

Cross-modal adaptation makes use of additional training samples from other modalities, exploiting pre-trained encoders that map different modalities to the same representation space.
跨模态适应利用来自其他模态的额外训练样本,利用预训练的编码器将不同模态映射到相同的表示空间。

We show that cross-modal learning can also improve prior art and even extends to audio modalities with AudioCLIP [27].
我们展示跨模态学习也可以改进先前的技术,甚至扩展到音频模态与 AudioCLIP [27]。
the other modality, e.g. textual labels, as classifier weights and still calculate a uni-modal softmax loss on the few-shot images. We instead show that incorporating other modalities as training samples is far more effective.
另一种模态,例如文本标签,作为分类器权重,仍然在少样本图像上计算单模 softmax 损失。相反,我们表明将其他模态纳入训练样本要更有效。
Few-shot classification. Prior successful few-shot learning methods leverage meta learning [20, 82], metric learning [4, 91, 95], transfer learning [29, 79], and transductive learning [18,46].
少样本分类。先前成功的少样本学习方法利用元学习[20, 82],度量学习[4, 91, 95],迁移学习[29, 79]和传导学习[18,46]。

These classic algorithms usually assume a large meta-training set for pre-training the network, and then evaluate on multiple episodes of few-shot train (support) and test (query) sets.
这些经典算法通常假定有一个大的元训练集来预训练网络,然后在多个 few-shot 训练(支持)和测试(查询)集上进行评估。

In this work, we instead follow the new evaluation protocol implemented by recent works on few-shot adaptation with CLIP [81,111,113]: (1) the meta-training phase is replaced with pre-trained CLIP models, and (2) the test sets are the official test splits of each dataset (thus not few-shot).
在这项工作中,我们改为遇到了最近在使用 CLIP 进行少样本适应的作品中实施的新评估协议:(1) 元训练阶段被预训练的 CLIP 模型取代,(2) 测试集是每个数据集的官方测试拆分(因此不是少样本)。

Notably, none of the prior works we compare to in this paper perform optimization with test set samples, and we follow this practice to ensure a fair comparison. We leave semisupervised [97] or transductive finetuning [18, 40] techniques as future work.
值得注意的是,在本文中我们比较的先前作品 中,没有一个是使用测试集样本进行优化的,我们遵循这一做法以确保公平比较。我们将半监督[97]或传导微调[18, 40]技术作为未来的工作。
Cross-modal machine learning. Inspired by crossmodal human cognition [9, 49, 70], cross-modal learning is a subfield of multimodal machine learning that aims to use data from additional modalities to improve a uni-modal task. Cross-modal learning does not require instance-wise alignment; for example, existing algorithms [68,104] can benefit from class-level descriptions as opposed to image-level captions.
跨模态机器学习。受跨模态人类认知的启发[9, 49, 70],跨模态学习是多模态机器学习的一个子领域,旨在利用来自额外模态的数据来改进单模态任务。跨模态学习不需要实例级对齐;例如,现有算法[68,104]可以从类级描述中受益,而不是图像级标题。

In this work, we propose a lightweight cross-modal learning method by treating data from other modalities as additional training samples. Furthermore, we encourage future works to embrace cross-modal few-shot learning as opposed to the underspecified uni-modal setup (Figure 3).
在这项工作中,我们提出了一种轻量级的跨模态学习方法,将来自其他模态的数据视为额外的训练样本。此外,我们鼓励未来的工作采用跨模态少样本学习,而不是不明确的单模态设置(图 3)。

3. Cross-Modal Adaptation
3. 跨模态适应

In this section, we mathematically formalize our approach to cross-modal few-shot learning.
在这一部分中,我们对跨模态少样本学习的方法进行了数学形式化。
Uni-modal learning. We begin by reviewing standard uni-modal few-shot classification, which learns a classifier from a small dataset of pairs and pre-trained feature encoder :
单模态学习。我们首先回顾标准的单模态少样本分类,该分类器从一小组 对和预训练的特征编码器 中学习:
where is typically the softmax loss
其中 通常是 softmax 损失
Our notation separates the feature extractor from the final class weights , since the former is typically pre-trained on a massive source dataset and the latter is trained on the few-shot target dataset. However, sometimes the representation can also be finetuned on the few-shot dataset (as we explore in our experiments). Importantly, both the class weights and feature extractor must live in the same dimensional space in order to compute their inner product:
我们的表示法将特征提取器 与最终的类别权重 分开,因为前者通常在大规模源数据集上进行预训练,而后者在少样本目标数据集上进行训练。然而,有时表示 也可以在少样本数据集上进行微调(正如我们在实验中探索的那样)。重要的是,为了计算它们的内积,类别权重和特征提取器必须存在于相同的 维空间中:
Though we focus on classification, class models could be learned via other losses (such as centroid prototypes [91]).
尽管我们专注于分类,但类模型也可以通过其他损失(例如质心原型[91])来学习。
Cross-modal learning. Our extension to multiple modalities is staightforward; we assume each training example is accompanied by a discrete label denoting its modality:
跨模态学习。我们对多模态的扩展很简单;我们假设每个训练示例都附带一个离散标签 ,表示其模态:
For example, one may define the set of modalities to be visual, language or {visual, audio (Figure 4). We can then define an associated loss:
例如,可以将模态集定义为 视觉,语言 或{视觉,音频 (图 4)。然后我们可以定义一个相关的损失:
where we crucially assume access to modality-specific feature encoders for . While the individual datapoints may come from different modalities with different dimensions, our formulation requires that the encoders map all modalities to the same fixed-dimensional space.
我们在关键地假设可以访问特定模态特征编码器 。虽然个别数据点 可能来自不同维度的不同模态,但我们的公式要求编码器将所有模态映射到相同的固定维度空间。
Note that this requirement is satisfied by many multimodal foundation models such as CLIP [81] and ALIGN [43] since they map different modalities into the same -dimensional embedding. We provide training pseudocode for visionlanguage adaptation (section 3) in algorithm 1 for clarity.
请注意,许多多模态基础模型(如 CLIP [81]和 ALIGN [43])都满足这一要求,因为它们将不同的模态映射到相同的 维嵌入中。为了清晰起见,我们在算法 1 中提供了视觉语言适应的训练伪代码(第 3 节)。
Inference: The learned classifier can produce a label prediction for a test example from any modality :
推断:学习的分类器可以为任何模态 的测试示例 生成标签预测:
This means we can use the same classifier to classify different test modalities (e.g. images and audio clips). In this paper, we mainly evaluate on a single modality (like images) to emphasize that multimodality helps unimodality.
这意味着我们可以使用相同的分类器来对不同的测试模态(例如图像和音频片段)进行分类。在本文中,我们主要评估单一模态(如图像),以强调多模态有助于单模态。
Cross-modal ensembles. We now show that crossmodal learning produces classifiers that are ensembles of modality-specific classifiers, exposing a connection to related approaches for ensembling (such as WiSE-FT [100]).
跨模态集成。我们现在展示跨模态学习产生的分类器是模态特定分类器的集成,揭示了与集成相关方法的连接(例如 WiSE-FT [100])。

We begin by appealing to the well-known Representer Theorem [87], which shows that optimally-trained classifiers can be represented as linear combinations of their training samples. In the case of a cross-modal linear probe, weights for class must be a weighted combination of all training features, across all modalities:
我们首先引用著名的代表定理[87],该定理表明经过最佳训练的分类器可以表示为其训练样本的线性组合。在跨模态线性探测器的情况下,类 的权重必须是所有 训练特征的加权组合,跨越所有模态:
Linear classification via cross-modal adaptation solves for all weights jointly, so as to minimize the empirical risk (or training loss). In contrast, prior art optimizes for imagespecific 's independently of the text-specific 's, linearly combining them with a single global (as in WiSEFT [100]) or via text-based classifier initialization [21,111]. Our analysis suggests that the joint optimization enabled by cross-modal learning may help other adaptation methods, as our experiments do in fact show.
通过跨模态适应的线性分类解决所有权重 联合,以最小化经验风险(或训练损失)。相比之下,先前的技术优化图像特定 独立于文本特定 ,线性地将它们与单个全局 组合(如 WiSEFT [100]中)或通过基于文本的分类器初始化[21,111]。我们的分析表明,通过跨模态学习实现的联合优化可能有助于其他适应方法,正如我们的实验证明的那样。
Extensions. 扩展。
Although we focus on uni-modal inference tasks (e.g. image classification), the above formulation allows the learned classifier to be applied to multimodal test sets, such as classifying videos by training on image and audio, and then ensembling predictions across the two modalities with Equation 7.
虽然我们专注于单模态推理任务(例如图像分类),但上述公式允许学习的分类器应用于多模态测试集,例如通过在图像和音频上训练来对视频进行分类,然后使用方程式 7 跨两种模态进行预测整合。

Or, one can extend image classification by providing additional data such as captions and/or attributes. We leave these scenarios as future work. Finally, just as one can optimize uni-modal losses (1) by finetuning the encoder , one can similarly finetune modality-specific encoders in the cross-modal setting (5). We explore this finetuning method in the next section.
或者,可以通过提供额外的数据,如标题和/或属性,来扩展图像分类。我们将这些场景留作未来的工作。最后,就像可以通过微调编码器 来优化单模态损失(1)一样,也可以在跨模态设置中通过类似的方式微调特定于模态的编码器 。我们将在下一节中探讨这种微调方法。

4. Vision-Language Adaptation
4. 视觉-语言适应

We now explore our cross-modal formulation for a particular multimodal setting. Many prior works ,
我们现在探讨我们的跨模态公式在特定的多模态设置中。许多先前的作品
113] explore the intersection of vision and language, and thus that is our initial focus. Interestingly, the influential "zero-shot" and "few-shot" evaluation protocols introduced by prior work can be mapped to our crossmodal setting, with one crucial difference; the textual label of each class can be treated as an explicit training sample . From this perspective, "zero-shot" learning may be more naturally thought of as one-shot cross-modal learning that learns a few-shot model on text and then infers with it on images.
[113] 探索视觉和语言的交集,因此这是我们的初始关注点。有趣的是,之前工作引入的具有影响力的“零样本”和“少样本”评估协议可以映射到我们的跨模态设置中,但有一个关键的区别;每个类别的文本标签可以被视为一个显式的训练样本。从这个角度来看,“零样本”学习可能更自然地被认为是一次性的跨模态学习,在文本上学习少样本模型,然后在图像上进行推断。
Few-shot evaluation protocol. To ensure a fair comparison, we strictly follow the protocol of CoOp [113] by reporting test performance on 11 public image datasets (Table 5), with ResNet50 [33] as the image encoder backbone.
少样本评估协议。为了确保公平比较,我们严格遵循 CoOp [113]的协议,通过在 11 个公共图像数据集(表 5)上报告测试性能,使用 ResNet50 [33]作为图像编码器骨干。

For maximal reproducibility, we use CoOp's dataset splits [113] and the three-fold few-shot train sets sampled with the same random seeds. We adopt the given test split of each dataset as the test set. Some prior works apparently use the large-scale test set to tune hyperparameters for few-shot learning; we instead exercise due diligence by tuning hyperparameters (such as the learning rate, weight decay, and early stopping) on the given few-shot validation set with examples, where is the number of training shots. We include PyTorch-style pseudocode (algorithm 1) and hyperparameter details (section 8).
为了最大限度地提高可重现性,我们使用 CoOp 的数据集拆分[113]和使用相同随机种子采样的三折少样本训练集。我们采用每个数据集的给定测试拆分作为测试集。一些先前的工作显然使用大规模测试集来调整少样本学习的超参数;相反,我们通过在给定的少样本验证集上调整超参数(如学习率、权重衰减和提前停止)来行使应有的谨慎,该验证集包含 个示例,其中 是训练样本数。我们包括 PyTorch 风格的伪代码(算法 1)和超参数详细信息(第 8 节)。
Cross-modal adaptation outperforms SOTA.
跨模态适应优于 SOTA。

Table 1 shows the effectiveness of our proposal: we surpass all prior art with an embarrassingly simple linear classifier that requires significantly less training time than other carefullycrafted algorithms. In addition, partial finetuning of the last attentional pooling layer from sets the new SOTA. To ensure a fair comparison, we augment the class names into sentences using hand-engineered templates selected by Tip-Adapter [111] (Table 5) and follow their practice to initialize the linear layer with text features.
表 1 显示了我们提案的有效性:我们通过一个令人尴尬的简单线性分类器超越了所有先前的技术,该分类器需要的训练时间明显少于其他精心设计的算法。此外,对最后一个注意力汇聚层进行部分微调,从 设置了新的 SOTA。为了确保公平比较,我们使用 Tip-Adapter [111](表 5)选择的手工模板将类名增加到句子中,并遵循他们的做法,使用文本特征初始化线性层。

Furthermore, we perform minimal image augmentation with a center crop plus a flipped view instead of random crops as in prior art . As such, we can pre-extract features before training the classifier, leading to significantly less training time as shown in Table 8. We also show that our method can benefit from both image and text augmentation in Table 6.
此外,我们执行最小图像增强,使用中心裁剪加上翻转视图,而不是先前艺术中的随机裁剪。因此,我们可以在训练分类器之前预提取特征,从而大大减少训练时间,如表 8 所示。我们还表明我们的方法可以从表 6 中的图像和文本增强中受益。

In the appendix, we provide more ablations on classifier initialization (Table 12), partial finetuning (Table 13), and ViT-based backbone (Table 14). Per-dataset results are also in appendix Table 10.
在附录中,我们提供了更多关于分类器初始化(表 12)、部分微调(表 13)和基于 ViT 的骨干网络(表 14)的消融实验。每个数据集的结果也在附录表 10 中。
Why does cross-modal learning help? As stated earlier, one reason that cross-modal learning helps is that it turns the original -shot problem to an -shot one. However, Table 1 shows that 1 -shot cross-modal linear probing outperforms the 2-shot results of most prior methods. This suggests that training samples from other modalities tend to contain complementary cues [68, 100, 104].
为什么跨模态学习有帮助?正如前面所述,跨模态学习有帮助的一个原因是它将原始的 -shot 问题转变为一个 -shot 问题。然而,表 1 显示,1-shot 跨模态线性探测优于大多数先前方法的 2-shot 结果。这表明来自其他模态的训练样本往往包含互补的线索[68, 100, 104]。

One can loosely observe this in Figure 2 and Figure 5,
可以在图 2 和图 5 中粗略观察到这一点

Abstract 摘要

Algorithm 1: An example of PyTorch-style pseudocode for cross-modal (vision-language) adaptation. Notably, the image and text samples do not need to be paired and one may sample different numbers of them per batch.
算法 1:PyTorch 风格伪代码示例,用于跨模态(视觉-语言)适应。值得注意的是,图像和文本样本不需要配对,每批次可以采样不同数量的它们。

For simplicity, we omit linear classifier initialization and early stopping with validation performance. One can also disable the corresponding grad field of the encoders for partial finetuning, or pre-extract intermediate features to speed up training.
为简单起见,我们省略了线性分类器的初始化和通过验证性能进行早停止。 也可以禁用编码器的相应梯度字段进行部分微调,或者预先提取中间特征以加快训练速度。

# w: linear layer initialized with text features
    # T: temperature scaling (default is 100)
    for - in iteration:
        # Randomly sample images and texts
        im, im_labels = image_loader.next()
        tx, tx_labels = text_loader.next()
        # Extract image and text features
        im_f = image_encoder(im)
        tx_f = text_encoder(tx)
        # Put in same batch then L2 normalize
        features = cat((im_f, tx_f), dim=0)
        features = normalize(features, dim=1)
        labels = cat((im_labels, tx_labels), dim=0)
        # Compute softmax (cross entropy) loss
        logits = w(features)
        loss = cross_entropy_loss(logits / T, labels)
        loss.backward()
        # Update linear layer
        update(w.params)
        # [optional] Update (partial or full) encoders
        update(image_encoder.params)
        update (text_encoder.params)
Figure 5. Additional PCA projection plots for random pairs of classes in ImageNet [15]. Adding one-shot text as training samples can oftentimes aggressively shift the decision boundary.
图 5. ImageNet 中随机类别对的额外 PCA 投影图[15]。 添加一次性文本作为训练样本往往会大幅移动决策边界。
whereby visual and text examples lie in slightly different parts of the embedding space (indicating the potential to aggressively shape the final decision boundary). In fact, WiSE-FT [100] is inspired by similar reasons to ensemble
其中,视觉和文本示例位于嵌入空间的略有不同部分(表明激进地塑造最终决策边界的潜力)。事实上,WiSE-FT [100] 受到集成类似原因的启发。
Method Number of shots 射击次数 Train speed 列车速度
1 2 4 8 16
Zero-Shot CLIP (58.8) 零射击 CLIP(58.8) - - - - - -
Linear Probing 线性探测 36.7 47.6 57.2 65.0 71.1
WiSE-FT [100] 59.1 61.8 65.3 68.4 71.6
CoOp [113] 合作社 [113] 59.6 62.3 66.8 69.9 73.4
ProGrad [114] 62.6 64.9 68.5 71.4 74.0
Tip-Adapter [111] 提示适配器 [111] 64.5 66.7 69.7 72.5 75.8
Tip-Adapter [11] 提示适配器 [11] 63.3 65.9 69.0 72.2 75.1
Cross-Modal Linear Probing
跨模态线性探测
64.1 67.0 70.3 73.0 76.0
Cross-Modal Partial Finetuning
跨模态部分微调
Table 1. Comparison to SOTA using the CoOp [113] protocol, which reports top-1 accuracy across 11 test sets in Table 5. We include per-dataset results and standard deviation in section 9 .
表 1。使用 CoOp [113]协议与 SOTA 进行比较,该协议在表 5 中报告了 11 个测试集的 top-1 准确性。我们在第 9 节中包括每个数据集的结果和标准偏差。

For a fair comparison, we reuse the same few-shot visual samples and hand-engineered text prompts used by Tip-Adapter [111].
为了公平比较,我们重复使用了 Tip-Adapter [111]使用的相同的少样本视觉样本和手工设计的文本提示。

The original Tip-Adapter searches over hyperparameters (e.g. early stopping) on the large-scale test set, which may not be realistic for few-shot scenarios. Instead, we rerun their codebase and earlystop on a few-shot validation set (as we do), denoted by . We reproduce WiSE-FT in our codebase since the original work does not provide few-shot results. In summary, by incorporating oneshot text samples into our training set, a simple cross-modal linear probe already outperforms all prior methods across all shots.
原始的 Tip-Adapter 在大规模测试集上搜索超参数(例如提前停止),这可能对少样本情况不现实。相反,我们重新运行他们的代码库,并在少样本验证集上进行提前停止(正如我们所做的),表示为 。我们在我们的代码库中复现了 WiSE-FT,因为原始工作没有提供少样本结果。总之,通过将一次性文本样本纳入我们的训练集,一个简单的跨模态线性探针已经在所有镜头上表现优于所有先前的方法。

Additionally, partial finetuning further improves performance, especially for 8 and 16 shots. Finally, our methods are faster to train than prior work, sometimes significantly (full report in Table 8).
此外,部分微调进一步提高了性能,特别是对于 8 和 16 张图片。最后,我们的方法训练速度比以前的工作更快,有时显著(完整报告见表 8)。
Method Number of shots 射击次数
1 2 4 8 16
Linear Probing 线性探测 36.7 47.6 57.2 65.0 71.1
Cross-Modal Linear Probing
跨模态线性探测
64.1 67.0 70.3 73.0 76.0
27.4 19.4 13.1 8.0 4.9
WiSE-FT [100] 59.1 61.8 65.3 68.4 71.6
Cross-Modal WiSE-FT 跨模态 WiSE-FT 63.8 66.4 69.0 71.7 74.1
4.7 4.6 3.7 3.3 2.5
CoOp [113] 合作社 [113] 59.6 62.3 66.8 69.9 73.4
Cross-Modal Prompting 跨模态提示 62.0 64.9 68.6 71.4 74.0
2.4 2.6 1.8 1.5 0.6
Tip-Adapter  提示适配器 63.3 65.9 69.0 72.2 75.1
Cross-Modal Adapter 跨模态适配器 64.4 67.6 70.8 73.4 75.9
1.1 1.7 1.8 1.2 0.8
Table 2. Cross-modal adaptation improves existing methods. We follow the same protocol as Table 1, reporting the delta accuracy between uni-modal and cross-modal variants of various stateof-the-art methods.
表 2. 跨模态适应改进了现有方法。我们遵循与表 1 相同的协议,报告各种最先进方法的单模态和跨模态变体之间的准确性增量。

The consistent boost suggests that cross-modal training is orthogonal to techniques for uni-modal adaptation, such as prompting [113], adapter [39], and robust finetuning [100].
一致的提升表明,跨模态训练与单模态适应技术(如提示[113]、适配器[39]和鲁棒微调[100])是正交的。
the uni-modal visual classifier with a "zero-shot" (one-shottext) classifier (in the linear probing case).
具有“零样本”(一次性文本)分类器的单模视觉分类器(在线性探测案例中)。

However, Equation 8 shows that cross-modal adaptation can also be seen as jointly learning an ensemble, while WiSE-FT [100] learns the visual classifier independently of the text classifier.
然而,方程式 8 表明,跨模态适应也可以被视为联合学习一个集成,而 WiSE-FT [100]独立于文本分类器学习视觉分类器。

This suggests that other adaptation methods may benefit from cross-modal learning, as we show next.
这表明其他适应方法可能会受益于跨模态学习,接下来我们将展示。

Cross-modal adaptation helps prior art (Table 2). This includes prompting (CoOp [113]), adapters (TipAdapter [111]), and robust-finetuning (WiSE-FT [100]). We see a large improvement in the low-data regime (1 and 2 shots).
跨模态适应有助于先前的技术(表 2)。这包括提示(CoOp [113])、适配器(TipAdapter [111])和鲁棒微调(WiSE-FT [100])。我们在低数据范围(1 和 2 次射击)中看到了很大的改进。

Notably, we do not need to tune any methods, and simply reuse the reported hyperparameters. For prompting, we follow CoOp [113] to optimize 16 continuous tokens with the same training setting.
值得注意的是,我们不需要调整任何方法,只需重复使用报告的超参数。 对于提示,我们遵循 CoOp [113] 以相同的训练设置优化 16 个连续令牌。

For the Adapter model, we follow the same 2-layer MLP architecture of CLIP-Adapter [21] with the given residual ratio of 0.2 ; we outperform Tip-Adapter without relying on their training-free initialization of MLP.
对于适配器模型,我们遵循 CLIP-Adapter [21] 的相同 2 层 MLP 架构,给定残差比率为 0.2;我们在不依赖于 MLP 的无训练初始化的情况下胜过 Tip-Adapter。

For WiSE-FT, we adopt the given ratio (0.5) to post-hoc ensemble the learned and the zero-shot classifiers.
对于 WiSE-FT,我们采用给定的比例(0.5)来事后集成已学习和零样本分类器。

Overall, our experiments suggest that cross-modal adaptation is consistently effective, and should likely be a baseline moving forward given its easeof-implementation (algorithm 1).
总的来说,我们的实验表明,跨模态适应是一直有效的,鉴于其易于实施(算法 1),很可能会成为未来的基准。

For example, instead of separately benchmarking on "zero-shot" (one-shot-text) and few-shot-vision, a cross-modal linear prob would suffice to evaluate representations of a multimodal model.
例如,不必分别在“零射”(一次性文本)和少射视觉上进行基准测试,跨模态线性概率就足以评估多模态模型的表示。

5. Vision-Audio Adaptation
5. 视听适应

We now explore cross-modal adaption for other modalities such as audio. We pose the following question: can one learn a better dog visual classifier by listening to a dog barking?
我们现在探索跨模态适应其他模态,比如音频。我们提出以下问题:通过听狗叫声,能否学习到更好的狗视觉分类器?

To examine this question, we curate the first audiovisual benchmark that supports few-shot classification of both image and audio.
为了研究这个问题,我们策划了第一个支持图像和音频的少样本分类的视听基准。
Our ImageNet-ESC benchmark. We construct our audiovisual benchmark by intersecting two of the most popular image and audio datasets: ImageNet [15] with 1000 types of objects and ESC-50 [77] with 50 types of environmental sounds (including animal, nature, human activity, domestic, and urban noises).
我们的 ImageNet-ESC 基准测试。 我们通过交叉两个最受欢迎的图像和音频数据集来构建我们的视听基准测试:ImageNet [15],其中包含 1000 种对象,以及 ESC-50 [77],其中包含 50 种环境声音(包括动物、自然、人类活动、家庭和城市噪音)。

We use the class names of the two datasets for class matching. For each class in ESC-50, we check whether there is a corresponding ImageNet class that may produce this type of sound. In this process, we observe that the audio-to-object matching can sometimes be one-to-many.
我们使用两个数据集的类名进行类匹配。对于 ESC-50 中的每个类,我们检查是否有一个对应的 ImageNet 类可能产生这种类型的声音。在这个过程中,我们观察到音频到对象的匹配有时可能是一对多的。

For example, the clock-alarm class in ESC-50 can be mapped to either digital clock or analog clock in ImageNet; the dog (barking) class in ESC-50 can be matched to any of the 120 dog species.
例如,ESC-50 中的时钟闹钟类可以映射到 ImageNet 中的数字时钟或模拟时钟;ESC-50 中的狗(叫声)类可以匹配到 120 种狗品种中的任何一种。

In such scenarios, we randomly match the classes, e.g. clock alarm to digital clock and dog to otterhound. Also, we find that some audio classes loosely match with some visual objects, such as drinking-sipping to water bottle and pouring-water to water jug.
在这种情况下,我们会随机匹配类别,例如将闹钟匹配到数字时钟,狗匹配到水獭犬。此外,我们发现一些音频类别与一些视觉对象松散匹配,比如将喝水匹配到水瓶,倒水匹配到水壶。

As such, we create two versions of the dataset: (1) ImageNet-ESC-27, which represents the maximal intersection consisting of all loose matches, and (2) ImageNet-ESC-19, a subset of the for- mer version consisting of more accurate matches.
因此,我们创建了数据集的两个版本:(1)ImageNet-ESC-27,代表了所有宽松匹配的最大交集,以及(2)ImageNet-ESC-19,是前一个版本的子集,包含更准确的匹配。

The final matches are shown in appendix Table 9.
最终比赛显示在附录表 9 中。
Few-shot evaluation protocol. We use five-fold fewshot splits sampled from ImageNet, with each split divided into half for training and validation. Test performance is recorded on the official ImageNet validation set of the corresponding classes.
少样本评估协议。我们使用从 ImageNet 采样的五折少样本拆分,每个拆分分为一半用于训练和验证。测试性能记录在相应类别的官方 ImageNet 验证集上。

We adopt the predefined five folds of ESC-50, where each fold contains 8 samples per class. We construct 5 splits from ESC-50 by selecting one fold for training and validation, and record test performance on the other 4 folds.
我们采用 ESC-50 的预定义五折交叉验证,每个折包含每类 8 个样本。我们通过从 ESC-50 中选择一个折用于训练和验证来构建 5 个拆分,并记录在其他 4 个折上的测试性能。

We report averaged performance over 25 runs (since we have 5 random splits for each modality). To keep consistent with our vision-language experiments, we adopt a uni-modal validation and test set and leave cross-modal testing for future work.
我们报告了 25 次运行的平均性能(因为我们为每种模态性分裂了 5 次)。为了与我们的视觉语言实验保持一致,我们采用了单模态验证和测试集,并将跨模态测试留给未来的工作。
Audio encoding. We use AudioCLIP [27] with an ESResNeXT backbone [28] as the audio encoder . Because AudioCLIP is trained on a large-scale video dataset (AudioSet [23]) while freezing the pre-trained CLIP text and image encoder, it produces audio embeddings in the same representation space.
音频编码。我们使用具有 ESResNeXT 骨干的 AudioCLIP [27]作为音频编码器 。由于 AudioCLIP 是在大规模视频数据集(AudioSet [23])上训练的,同时冻结预训练的 CLIP 文本和图像编码器,因此它在相同的表示空间中生成音频嵌入。

While AudioCLIP is pretrained on a sizable amount of data, we note that it does not come close to matching the scale of CLIP pretraining [27, 81].
虽然 AudioCLIP 是在大量数据上预训练的,但我们注意到它与 CLIP 预训练的规模相去甚远。

Thus, it does not perform favorably compared to the SOTA for downstream "zero-shot" audio (i.e. one-shot text) classification tasks [27]. However, scaling up audio pretraining is orthogonal to our investigation.
因此,与下游“零-shot”音频(即一次性文本)分类任务相比,它的表现并不理想[27]。然而,音频预训练的扩展与我们的研究无关。
Audio improves image classification. Table 3 shows that adding a random one-shot-audio improves upon naive image-only linear probing, especially in an extremely lowshot setting.
音频改善图像分类。表 3 显示,添加一个随机的一次性音频可以改善天真的仅图像线性探测,特别是在极低的拍摄设置中。

This reaffirms Figure 3's hypothesis that crossmodality can reduce the ambiguity of the uni-modal fewshot setup; in other words, one can learn a better image classifier by listening to object sounds.
这再次证实了图 3 的假设,即跨模态可以减少单模态少样本设置的模糊性;换句话说,通过听物体声音可以学习到更好的图像分类器。

One exception is the 4 shot performance on ImageNet-ESC-27, where adding audio does not help. We posit that (1) loosely-matched classes can result in noisier training data, and (2) the audio representations are not as robust due to smaller-scale pretraining.
在 ImageNet-ESC-27 上的 4 次表现是一个例外,添加音频并没有帮助。我们认为(1) 松散匹配的类别可能导致训练数据更加嘈杂,以及(2) 由于规模较小的预训练,音频表示不够稳健。

This suggests that cross-modal adaptation is less effective when representations are not aligned well or insufficiently trained. Nevertheless, under most scenarios, cross-modal adaptation helps.
这表明当表示没有很好对齐或训练不足时,跨模态适应性效果较差。然而,在大多数情况下,跨模态适应有所帮助。

Table 15 shows that adding the language modality (i.e. label names) can significantly boost the performance, which is expected because our benchmark is curated with textual information.
表 15 显示,添加语言模态(即标签名称)可以显著提升性能,这是预期的,因为我们的基准是使用文本信息策划的。

For all experiments, we follow an identical procedure to vision-language experiments in section 3 and provide details in appendix section 8.
对于所有实验,我们遵循与第 3 节中的视觉语言实验相同的程序,并在附录第 8 节中提供详细信息。
Vision improves audio classification. We additionally evaluate the reverse task - whether adding a random oneshot image sample for downstream audio classification can improve upon audio-only training. Table 4 shows the results, where we see the same favorable trend.
视觉改善音频分类。我们另外评估了反向任务 - 是否为下游音频分类添加一个随机的一次性图像样本可以改善仅音频训练。表 4 显示了结果,我们看到了相同的有利趋势。

This success concludes that our approachis modality-agnostic.
这一成功表明我们的方法是与模态无关的。
Dataset Method Image Classification 图像分类
1-shot 2-shot 2 镜头 4-shot 4 次射击
ImageNet-ESC-19 Image-Only Linear 仅图像线性 68.0 75.7 83.1
Image-Audio Linear 图像音频线性
ImageNet-ESC-27 Image-Only Linear 仅图像线性 60.1 71.8
Image-Audio Linear 图像音频线性 78.9
Table 3. Image classification results on ImageNet-ESC benchmark. Adding one audio shot can improve image classification under most few-shot scenarios, even when the audio and vision modalities are only loosely aligned.
表 3. ImageNet-ESC 基准测试上的图像分类结果。在大多数少样本情况下,添加一个音频样本可以改善图像分类,即使音频和视觉模态只是松散对齐。
Dataset Method Audio Classification 音频分类
1-shot 2-shot 2 镜头 4-shot 4 次射击
ImageNet-ESC-19 Audio-Only Linear 仅音频线性 31.2 41.1 48.5
Audio-Image Linear 音频图像线性
ImageNet-ESC-27 Audio-Only Linear 仅音频线性 28.2 39.0 47.1
Audio-Image Linear 音频图像线性
Table 4. Audio classification results on ImageNet-ESC benchmark. Similar to Table 3, adding one image shot improves fewshot audio classification.
表 4. ImageNet-ESC 基准上的音频分类结果。与表 3 类似,添加一个图像截图可以改善少样本音频分类。
Dataset Classes 课程 Train 火车 Val Test 测试 Hand-crafted Prompt [111]
手工制作的提示 [111]
Caltech101 [19] 加州理工学院 101 [19] 100 4,128 1,649 2,465 a photo of a .
一张 的照片。
OxfordPets [75] 牛津宠物 [75] 37 2,944 736 3,669 a photo of a {cls}, a type of pet.
一张{cls}的照片,一种宠物类型。
StanfordCars [50] 斯坦福汽车 [50] 196 6,509 1,635 8,041 a photo of a .
一张 的照片。
Flowers 102 [71] 花卉 102 [71] 102 4,093 1,633 2,463 a photo of a {cls}, a type of flower.
一张{cls}的照片,一种花的类型。
Food101 [6] 食品 101 [6] 101 50,500 20,200 30,300 a photo of , a type of food.
的照片,一种食物。
FGVCAircraft [66] FGVC 飞机[66] 100 3,334 3,333 3,333 a photo of a {cls}, a type of aircraft.
一张{cls}飞机的照片。
SUN397 [103] 397 15,880 3,970 19,850 a photo of a .
一张 的照片。
DTD [14] 47 2,820 1,128 1,692 {cls} texture. {cls} 纹理。
EuroSAT [35] 10 13,500 5,400 8,100 a centered satellite photo of .
的中心卫星照片。
UCF101 [93] 101 7,639 1,898 3,783 a photo of a person doing .
一个人在做 的照片。
ImageNet [15] 1000 N/A 50,000

这是一张{cls}的照片。一张糟糕的 照片。一个折纸 。一张大 的照片。一个视频游戏中的 的艺术。一个小 的照片。
itap of a {cls}.
a bad photo of the .
a origami .
a photo of the large .
a in a video game.
art of the .
a photo of the small .
Table 5. Detailed statistics of the datasets. We adopt the hand-engineered templates selected by Tip-Adapter [111] unless otherwise stated. Note that this set of templates is identical to the ones selected by CLIP [81] and CoOp [113], except for ImageNet.
表 5. 数据集的详细统计数据。除非另有说明,我们采用 Tip-Adapter [111] 选择的手工模板。请注意,这组模板与 CLIP [81] 和 CoOp [113] 选择的模板相同,除了 ImageNet。

6. Ablation Studies 6. 消融研究

We present a few selected ablation studies in this section. For comprehensive results, please refer to section 9 .
我们在本节中展示了一些精选的消融研究。有关全面结果,请参阅第 9 节。
Data augmentation of text samples. Like most prior works [81, 113], we also find that data augmentation can improve downstream performance during vision-language adaptation (cf. Table 1).
文本样本的数据增强。与大多数先前的工作[81, 113]一样,我们也发现数据增强可以提高视觉语言适应过程中的下游性能(参见表 1)。

Notably, since the class names are included as training samples, one can explore augmentation techniques for text (just as random cropping for images).
值得注意的是,由于类名包含在训练样本中,因此可以探索文本的增强技术(就像对图像进行随机裁剪一样)。

Besides the fixed template a photo of a {cls} and hand-crafted templates (Table 5), we also try a template mining strategy that does not rely on the selected datasetspecific templates.
除了固定模板和手工制作的模板(表 5)之外,我们还尝试了一种不依赖于选定数据集特定模板的模板挖掘策略。

To automatically mine for the templates, we search among a pool of 180 templates for 21 templates with the best zero-shot performance on the few-shot vali-
为了自动挖掘模板,我们在 180 个模板池中搜索 21 个在少样本验证中表现最佳的模板
Finetuning 微调 ImageAugment 图像增强 TextAugment 文本增强 Number of shots 射击次数
1 2 4 8 16
Linear 线性 CenterCrop 中心裁剪 Classname 类名 61.8 65.3 69.0 72.0
a photo of a
一张 的照片
63.2 66.2 69.7 72.5 75.3
Template Mining 模板挖掘 63.5 67.2 70.3 73.1 75.7
Hand Engineered [111] 手工工程化 [111] 63.7 66.7 70.3 72.9 75.5
+Flipped View +翻转视图 Hand Engineered [111] 手工工程化 [111] 64.1 67.0 70.3 73.0 76.0
Partial 部分 CenterCrop 中心裁剪 Classname 类名 62.5 65.7 69.3 72.9 76.2
a photo of a
一张 的照片
63.8 66.8 69.8 73.4 76.7
Template Mining 模板挖掘 64.3 67.1 70.3 73.5 76.5
Hand Engineered [111] 手工工程化 [111] 64.6 67.2 70.2 73.7 76.9
+Flipped View +翻转视图 Hand Engineered [111] 手工工程化 [111] 64.7 67.7 70.6 77.2
Table 6. Augmentation for cross-modal adaptation. We evaluate the impact of selected augmentation techniques following the same CoOp protocol as in Table 1.
表 6. 跨模态适应的增强。我们评估了选择的增强技术对应用了与表 1 中相同的 CoOp 协议的影响。
dation set of each dataset. We discuss how we collect the 180 templates in appendix section 8. For image augmentation, we perform standard flipping and random cropping.
每个数据集的数据集。我们讨论了如何在附录第 8 节中收集 180 个模板。对于图像增强,我们执行标准翻转和随机裁剪。

We show a subset of results in Table 6, and find that all text augmentation techniques provide a sizable boost in performance. We also report comprehensive ablations in appendix Table 11 and compare it to the SOTA prompting method ProDA [63].
我们在表 6 中展示了部分结果,并发现所有文本增强技术都能显著提升性能。我们还在附录表 11 中报告了全面的消融实验,并将其与 SOTA 提示方法 ProDA [63]进行了比较。

The salient conclusions include (1) the performance gain from image augmentation is saturated after more than two views, and (2) template mining can be as competitive as a large number of 36 carefully-tuned prompts. In fact, prompting can be viewed as another text augmentation technique under cross-modal adaptation, and we leave this exploration to future work.
显著的结论包括(1)图像增强的性能增益在超过两个视图后饱和,并且(2)模板挖掘可以与 36 个精心调整的提示一样具有竞争力。实际上,提示 可以被视为跨模态适应下的另一种文本增强技术,我们将这一探索留给未来的工作。
Test-time distribution shifts. We examine how robust our approach is against test-time distribution shifts in Table 7.
测试时间分布变化。我们在表 7 中检查我们的方法对测试时间分布变化的稳健性。

Specifically, we follow the CoOp [113] protocol to report the test performance of a classifier trained on the source dataset (16-shot ImageNet) to 4 distribution-shifted target test sets, including ImageNet-V2 [83], ImageNetSketch [96], ImageNet-A [37], and ImageNet-R [36].
具体来说,我们遵循 CoOp [113]协议,报告在源数据集(16-shot ImageNet)上训练的分类器的测试性能,针对 4 个分布偏移的目标测试集,包括 ImageNet-V2 [83]、ImageNetSketch [96]、ImageNet-A [37]和 ImageNet-R [36]。

As shown in Table 7, cross-modal adaptation can significantly boost the robustness of image-only linear probing and is competitive against baselines designed to address robustness such as CoCoOp [112] and WiSE-FT [100].
如表 7 所示,跨模态适应可以显著提升仅图像线性探测的鲁棒性,并且与旨在解决鲁棒性的基线方法(如 CoCoOp [112]和 WiSE-FT [100])具有竞争力。

CrossModal adaptation also improves upon WiSE-FT [100] and sets the new SOTA. We can conclude that language modality plays an important role in robustness, similar to how humans rely on textual cues for recognition [37].
跨模态适应还改进了 WiSE-FT [100],并设定了新的 SOTA。我们可以得出结论,语言模态在稳健性方面起着重要作用,类似于人类依赖文本线索进行识别 [37]。
Efficiency. As shown in Table 8, our approaches are much more lightweight because we do not rely on deep finetuning or heavy image augmentations. This allows us to speed up training by pre-extracting features, resulting in rather fast training speeds.
效率。如表 8 所示,我们的方法要轻得多,因为我们不依赖于深度微调 或大量图像增强。这使我们能够通过预提取特征来加快训练速度,从而实现相当快速的训练速度。

7. Discussion and Limitations
7. 讨论和限制

We show that cross-modal training is a lightweight and effective approach for adapting pre-trained multimodal models for downstream uni-modal tasks. One reason for
我们展示跨模态训练是一种轻量且有效的方法,用于调整预训练的多模态模型以适应下游单模态任务。一个原因是
Method Target 目标
-Sketch -素描
ResNet50
Zero-Shot CLIP 零射击 CLIP 58.2 51.3 33.3 21.7 56.0
Linear Probing 线性探测 55.9 46.0 19.1 12.7
63.0 55.1 32.7 22.1 55.0
CoOp  合作 63.3 55.4 56.6
WiSE-FT 62.9 54.2 33.3 20.3 57.4
Cross-Modal WiSE-FT
跨模态 WiSE-FT
65.2 56.6 35.6
Cross-Modal Linear Probing
跨模态线性探测
64.5 55.3 33.1 20.0
ViT-B/16
Zero-Shot CLIP 零射击 CLIP 66.7 60.8 46.2 47.8 74.0
Linear Probing 线性探测 65.9 56.3 34.8 58.4
71.9 64.2 46.7 48.4 74.3
CoOp  合作 71.7 64.6 47.9 49.9 75.1
CoCoOp 71.0 64.1 48.8 76.2
WiSE-FT 73.0 49.1 49.8 77.6
Cross-Modal WiSE-FT
跨模态 WiSE-FT
72.9
49.2
77.8 -1
Cross-Modal Linear Probing
跨模态线性探测
73.2 64.8 47.9 48.3 76.4
Table 7. Robustness under test-time distribution shifts. We follow CoOp [113]'s protocol for evaluating the test-time performance on variants of ImageNet. We report results with two image encoders (ResNet50 and ViT-B/16), and mark the best and second best results.
表 7. 测试时间分布变化下的鲁棒性。我们遵循 CoOp [113]的协议,评估 ImageNet 变体的测试时间性能。我们报告了两种图像编码器(ResNet50 和 ViT-B/16)的结果,并标记了最佳和次佳结果。

Salient conclusions: (a) Cross-modal linear probing is much more robust than its uni-modal counterpart while being competitive to previous SOTA methods such as WiseFT and CoOp, and (b) it can be further augmented with post-hoc modification through WiseFT to achieve new the SOTA.
显著结论:(a) 跨模态线性探测比其单模态对应物更加稳健,同时与先前的 SOTA 方法(如 WiseFT 和 CoOp)竞争力强,(b) 可以通过 WiseFT 进一步增强后处理修改,以实现新的 SOTA。
Method Iteration 迭代 Time 时间 Accuracy 准确性 Gain 获得
Zero-shot CLIP [81] 零射击 CLIP [81] 0 0 60.33 0
Image-Only Linear 仅图像线性 56.44 -3.89
CoOp [113] 合作社 [113] 62.95 +2.62
ProGrad [113] 63.45 +3.12
Tip-Adapter [111] 提示适配器 [111]