这是用户在 2024-7-21 13:28 为 https://ar5iv.labs.arxiv.org/html/2012.12556?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

A Survey on Visual Transformer
视觉Transformer综述

Kai Han, Yunhe Wang, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao, Chunjing Xu, Yixing Xu, Zhaohui Yang, Yiman Zhang, and Dacheng Tao
韩凯、王云鹤、陈汉庭、陈兴浩、郭建元、刘振华、唐夜辉、肖安、徐春静、徐艺兴、杨朝晖、张一曼、陶大成
Kai Han, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao, Chunjing Xu, Yixing Xu, Zhaohui Yang, Yiman Zhang, and Yunhe Wang are with Huawei Noah’s Ark Lab.
韩凯、陈汉庭、陈星浩、郭建元、刘振华、唐业辉、肖安、徐春静、徐艺兴、杨朝晖、张一曼和王云鹤在华为诺亚方舟实验室工作。

E-mail: {kai.han, yunhe.wang}@huawei.com. Hanting Chen, Zhenhua Liu, Yehui Tang, and Zhaohui Yang are also with School of EECS, Peking University.
电子邮件:{kai. han,yunhe.wang}@huawei.com。陈汉庭、刘振华、唐叶辉和杨朝晖也在北大EECS学院工作。

Dacheng Tao is with the School of Computer Science, in the Faculty of Engineering, at The University of Sydney, 6 Cleveland St, Darlington, NSW 2008, Australia. E-mail: dacheng.tao@sydney.edu.au. Corresponding to Yunhe Wang and Dacheng Tao.
陶达成在悉尼大学工程学院计算机科学学院工作,地址:澳大利亚新南威尔士州达林顿克利夫兰街6号,2008年。电子邮件:dacheng.tao@sydney.edu.au。对应王云和陶达成。

All authors are listed in alphabetical order of last name (except the primary and corresponding authors).
所有作者均按姓氏字母顺序排列(主要作者和通讯作者除外)。
Abstract 抽象

Transformer, first applied to the field of natural language processing, is a type of deep neural network mainly based on the self-attention mechanism. Thanks to its strong representation capabilities, researchers are looking at ways to apply transformer to computer vision tasks.
Transformer首先应用于自然语言处理领域,是一种主要基于自我注意力机制的深度神经网络。由于其强大的表示能力,研究人员正在寻找将变压器应用于计算机视觉任务的方法。

In a variety of visual benchmarks, transformer-based models perform similar to or better than other types of networks such as convolutional and recurrent neural networks.
在各种视觉基准测试中,基于转换器的模型的性能类似于或优于其他类型的网络,如卷积和递归神经网络。

Given its high performance and less need for vision-specific inductive bias, transformer is receiving more and more attention from the computer vision community.
由于变压器具有高性能和对视觉特定电感偏置的需求较少,因此越来越受到计算机视觉界的关注。

In this paper, we review these vision transformer models by categorizing them in different tasks and analyzing their advantages and disadvantages. The main categories we explore include the backbone network, high/mid-level vision, low-level vision, and video processing.
在本文中,我们回顾了这些视觉转换器模型,将它们分类在不同的任务中,并分析了它们的优缺点。我们探索的主要类别包括骨干网络、高/中级视觉、低级视觉和视频处理。

We also include efficient transformer methods for pushing transformer into real device-based applications. Furthermore, we also take a brief look at the self-attention mechanism in computer vision, as it is the base component in transformer.
我们还包括将变压器推向基于设备的实际应用的有效变压器方法。此外,我们还简要介绍了计算机视觉中的自我注意力机制,因为它是变压器的基本组件。

Toward the end of this paper, we discuss the challenges and provide several further research directions for vision transformers.
在本文的最后,我们讨论了视觉变压器面临的挑战并提供了几个进一步的研究方向。

Index Terms:
Transformer, Self-attention, Computer Vision, High-level vision, Low-level vision, Video.
索引术语:Transformer,自注意力机制,计算机视觉,高级视觉,低级视觉,视频。

1 Introduction 1介绍

Deep neural networks (DNNs) have become the fundamental infrastructure in today’s artificial intelligence (AI) systems. Different types of tasks have typically involved different types of networks.
深度神经网络(DNN)已成为当今人工智能(AI)系统的基本基础设施。不同类型的任务通常涉及不同类型的网络。

For example, multi-layer perceptron (MLP) or the fully connected (FC) network is the classical type of neural network, which is composed of multiple linear layers and nonlinear activations stacked together [1, 2]. Convolutional neural networks (CNNs) introduce convolutional layers and pooling layers for processing shift-invariant data such as images [3, 4]. And recurrent neural networks (RNNs) utilize recurrent cells to process sequential data or time series data [5, 6]. Transformer is a new type of neural network. It mainly utilizes the self-attention mechanism [7, 8] to extract intrinsic features [9] and shows great potential for extensive use in AI applications.
例如,多层感知器(MLP)或全连接(FC)网络是神经网络的经典类型,它由多个线性层和非线性激活堆叠在一起组成[1,2]。卷积神经网络(CNN)引入了卷积层和池化层,用于处理图像等移位不变数据[3,4]。而递神经归网络(RNNs)利用循环单元来处理顺序数据或时间序列数据[5,6]。Transformer是一种新型的神经网络。它主要利用自我注意力机制[7,8]来提取内在特征[9],并显示出在AI应用中广泛使用的巨大潜力。

Refer to caption
Figure 1: Key milestones in the development of transformer. The vision transformer models are marked in red.
图1:变压器开发的关键里程碑。视觉变压器型号用红色标记。

Transformer was first applied to natural language processing (NLP) tasks where it achieved significant improvements [9, 10, 11]. For example, Vaswani et al. [9] first proposed transformer based on attention mechanism for machine translation and English constituency parsing tasks. Devlin et al. [10] introduced a new language representation model called BERT (short for Bidirectional Encoder Representations from Transformers), which pre-trains a transformer on unlabeled text taking into account the context of each word as it is bidirectional.
Transformer首先被应用于自然语言处理(自然语言处理)任务,在那里它取得了显著的改进[9,10,11]。例如,Vaswani等人[9]首先提出了基于注意力机制的机器翻译和英语选区解析任务的转换器。Devlin等人[10]引入了一种名为BERT(来自变形金刚的双向编码器表示的缩写)的新语言表示模型,该模型在未标记文本上预训练变形金刚,同时考虑到每个单词的上下文,因为它是双向的。

When BERT was published, it obtained state-of-the-art performance on 11 NLP tasks. Brown et al. [11] pre-trained a massive transformer-based model called GPT-3 (short for Generative Pre-trained Transformer 3) on 45 TB of compressed plaintext data using 175 billion parameters.
BERT发布时,在11个自然语言处理任务上获得了最先进的性能Brown et al.[11]使用1750亿参数在45 TB的压缩明文数据上预训练了一个名为GPT-3(Generative预训练Transformer3的缩写)的基于变压器的大规模模型。

It achieved strong performance on different types of downstream natural language tasks without requiring any fine-tuning. These transformer-based models, with their strong representation capacity, have achieved significant breakthroughs in NLP.
在不需要任何微调的情况下,它在不同类型的下游自然语言任务上实现了强大的性能。这些基于转换器的模型凭借其强大的表示能力,在自然语言处理方面取得了重大突破。

Inspired by the major success of transformer architectures in the field of NLP, researchers have recently applied transformer to computer vision (CV) tasks. In vision applications, CNNs are considered the fundamental component [12, 13], but nowadays transformer is showing it is a potential alternative to CNN. Chen et al. [14] trained a sequence transformer to auto-regressively predict pixels, achieving results comparable to CNNs on image classification tasks. Another vision transformer model is ViT, which applies a pure transformer directly to sequences of image patches to classify the full image.
受变压器架构在自然语言处理领域取得重大成功的启发,研究人员最近将变压器应用于计算机视觉(CV)任务。在视觉应用中,CNN被认为是基本组件[12,13],但如今变压器显示它是卷积神经网络的潜在替代品。Chen等人[14]训练了一个序列变压器来自回归预测像素,在图像分类任务上取得了与CNN相当的结果。另一个视觉变压器模型是ViT,它将纯变压器直接应用于图像补丁序列,以对完整图像进行分类。

Recently proposed by Dosovitskiy et al. [15], it has achieved state-of-the-art performance on multiple image recognition benchmarks. In addition to image classification, transformer has been utilized to address a variety of other vision problems, including object detection [16, 17], semantic segmentation [18], image processing [19], and video understanding [20]. Thanks to its exceptional performance, more and more researchers are proposing transformer-based models for improving a wide range of visual tasks.
最近由Dosovitskiy等人[15]提出,它在多个图像识别基准上实现了最先进的性能。除了图像分类,变压器还被用于解决各种其他视觉问题,包括目标检测[16,17]、语义分割[18]、图像处理[19]和视频理解[20]。由于其卓越的性能,越来越多的研究人员提出了基于变压器的模型来改进广泛的视觉任务。

Due to the rapid increase in the number of transformer-based vision models, keeping pace with the rate of new progress is becoming increasingly difficult. As such, a survey of the existing works is urgent and would be beneficial for the community.
由于基于变压器的视觉模型数量迅速增加,跟上新进展的速度变得越来越困难。因此,对现有工程进行调查是当务之急,对社区有益。

In this paper, we focus on providing a comprehensive overview of the recent advances in vision transformers and discuss the potential directions for further improvement.
在本文中,我们重点提供了视觉变压器的最新进展的全面概述,并讨论了进一步改进的潜在方向。

To facilitate future research on different topics, we categorize the transformer models by their application scenarios, as listed in Table I. The main categories include backbone network, high/mid-level vision, low-level vision, and video processing. High-level vision deals with the interpretation and use of what is seen in the image [21], whereas mid-level vision deals with how this information is organized into what we experience as objects and surfaces [22]. Given the gap between high- and mid-level vision is becoming more obscure in DNN-based vision systems [23, 24], we treat them as a single category here. A few examples of transformer models that address these high/mid-level vision tasks include DETR [16], deformable DETR [17] for object detection, and Max-DeepLab [25] for segmentation. Low-level image processing mainly deals with extracting descriptions from images (such descriptions are usually represented as images themselves) [26]. Typical applications of low-level image processing include super-resolution, image denoising, and style transfer. At present, only a few works [19, 27] in low-level vision use transformers, creating the need for further investigation. Another category is video processing, which is an important part in both computer vision and image-based tasks.
为了方便未来对不同主题的研究,我们按应用场景对变压器模型进行分类,如表I所列。主要类别包括骨干网络、高/中级视觉、低级视觉和视频处理。高级视觉处理对图像中所见内容的解释和使用[21],而中级视觉处理如何将这些信息组织成我们作为对象和表面所体验的内容[22]。鉴于在基于DNN的视觉系统[23,24]中,高级和中级视觉之间的差距变得越来越模糊,我们在这里将它们视为单一类别。解决这些高/中级视觉任务的变压器模型的几个示例包括DETR[16]、用于目标检测的可变形DETR[17]和用于分割的Max-DeepLab[25]。低级图像处理主要处理从图像中提取描述(这种描述通常表示为图像本身)[26]。低级图像处理的典型应用包括超分辨率、图像去噪和样式迁移。目前,低级视觉中只有少数作品[19,27]使用变压器,创造了进一步研究的需要。另一类是视频处理,它是计算机视觉和基于图像的任务中的重要组成部分。

Due to the sequential property of video, transformer is inherently well suited for use on video tasks [20, 28], in which it is beginning to perform on par with conventional CNNs and RNNs. Here, we survey the works associated with transformer-based visual models in order to track the progress in this field. Figure 1 shows the development timeline of vision transformer — undoubtedly, there will be many more milestones in the future.
由于视频的顺序特性,变压器本质上非常适合用于视频任务[20,28],在这些任务中,它开始与传统的CNN和RNN相提并论。在这里,我们调查了与基于变压器的视觉模型相关的工作,以跟踪该领域的进展。图1显示了视觉变压器的开发时间表——毫无疑问,未来还会有更多的里程碑。

The rest of the paper is organized as follows. Section 2 discusses the formulation of the standard transformer and the self-attention mechanism.
论文的其余部分组织如下。第二节讨论了标准变压器的制定和自我注意力机制。

Section 4 is the main part of the paper, in which we summarize the vision transformer models on backbone, high/mid-level vision, low-level vision, and video tasks. We also briefly describe efficient transformer methods, as they are closely related to our main topic.
第四部分是本文的主要部分,我们总结了骨干视觉、高/中级视觉、低级视觉和视频任务的视觉转换器模型。我们还简要描述了高效的转换器方法,因为它们与我们的主题密切相关。

In the final section, we give our conclusion and discuss several research directions and challenges. Due to the page limit, we describe the methods of transformer in NLP in the supplemental material, as the research experience may be beneficial for vision tasks.
在最后一节中,我们给出了我们的结论,并讨论了几个研究方向和挑战。由于页面限制,我们在补充材料中描述了自然语言处理中的变压器方法,因为研究经验可能对视觉任务有益。

In the supplemental material, we also review the self-attention mechanism for CV as the supplementary of vision transformer models.
在补充材料中,我们还回顾了CV的自我注意力机制,作为视觉变压器模型的补充。

In this survey, we mainly include the representative works (early, pioneering, novel, or inspiring works) since there are many preprinted works on arXiv and we cannot include them all in limited pages.
在本次调查中,我们主要包括代表性作品(早期、开创性、新颖或鼓舞人心的作品),因为arxiv上有许多预印本作品,我们无法将它们全部包含在有限的页面中。

TABLE I: Representative works of vision transformers.
表一:视觉变形金刚代表作品。
  Category  类别 Sub-category 子类别 Method 方法 Highlights 亮点 Publication 出版物
Backbone 骨干 Supervised pretraining 监督预训练 ViT [15] ViT[15] Image patches, standard transformer
图像补丁,标准变压器
ICLR 2021
TNT [29] TNT[29] Transformer in transformer, local attention
变压器Transformer,本地注意
NeurIPS 2021
Swin [30] 斯温[30] Shifted window, window-based self-attention
移位窗口,基于窗口的自我关注
ICCV 2021
Self-supervised pretraining
自我监督预训练
iGPT [14] iGPT[14] Pixel prediction self-supervised learning, GPT model
像素预测自监督学习, GPT模型
ICML 2020
MoCo v3 [31] MoCo v3[31] Contrastive self-supervised learning, ViT
对比自监督学习
ICCV 2021
MAE  [32] MAE[32] Masked image modeling, ViT
蒙版图像建模,ViT
CVPR 2022
High/Mid-level 高/中级 vision 视觉 Object detection 目标检测 DETR [16] DETR[16] Set-based prediction, bipartite matching, transformer
基于集合的预测,二部匹配,变压器
ECCV 2020
Deformable DETR [17] 可变形DETR[17] DETR, deformable attention module
DETR,可变形注意力模块
ICLR 2021
UP-DETR [33] UP-DETR[33] Unsupervised pre-training, random query patch detection
无监督预训练,随机查询补丁检测
CVPR 2021
Segmentation 分割 Max-DeepLab [25] Max-DeepLab[25] PQ-style bipartite matching, dual-path transformer
PQ型二部匹配,双路变压器
CVPR 2021
VisTR [34] VisTR[34] Instance sequence matching and segmentation
示例序列匹配和分割
CVPR 2021
SETR [18] SETR[18] Sequence-to-sequence prediction, standard transformer
Sequence-to-sequence预测,标准变压器
CVPR 2021
Pose Estimation 姿态估计 Hand-Transformer [35] Transformer[35] Non-autoregressive transformer, 3D point set
非自回归变压器,3D点集
ECCV 2020
HOT-Net [36] 热网[36] Structured-reference extractor
Structured-reference提取器
MM 2020
METRO [37] 地铁[37] Progressive dimensionality reduction
渐进降维
CVPR 2021
Low-level 低级 vision 视觉 Image generation 图像生成 Image Transformer [27] 图像Transformer[27] Pixel generation using transformer
使用变压器生成像素
ICML 2018
Taming transformer [38] 驯服变压器[38] VQ-GAN, auto-regressive transformer
GAN自回归变压器
CVPR 2021
TransGAN [39] 跨甘[39] GAN using pure transformer architecture
GAN采用纯变压器架构
NeurIPS 2021
Image enhancement 图像增强 IPT [19] IPT[19] Multi-task, ImageNet pre-training, transformer model
多任务、ImageNet预训练、变压器模型
CVPR 2021
TTSR [40] TTSR[40] Texture transformer, RefSR
纹理变压器,RefSR
CVPR 2020
Video 视频 processing 加工 Video inpainting 视频修复 STTN [28] STTN[28] Spatial-temporal adversarial loss
时空对抗损失
ECCV 2020
Video captioning 视频字幕 Masked Transformer [20] 蒙面Transformer[20] Masking network, event proposal
掩蔽网络、活动提案
CVPR 2018
Multimodality 多模态 Classification 分类 CLIP [41] 剪辑[41] NLP supervision for images, zero-shot transfer
图像的自然语言处理监督,零镜头传输
arXiv 2021
Image generation 图像生成 DALL-E [42] DALL-E[42] Zero-shot text-to image generation
零镜头文本到图像生成
ICML 2021
Cogview [43] Cogview[43] VQ-VAE, Chinese input VQ-VAE,中文输入 NeurIPS 2021
Multi-task 多任务 GPT-4 [44] GPT-4[44] Large Multi-modal model for NLP & CV tasks
用于自然语言处理和CV任务的大型多模态模型
arXiv 2023
Efficient 高效 transformer 变压器 Decomposition 分解 ASH [45] 灰烬[45] Number of heads, importance estimation
人头数,重要性估计
NeurIPS 2019
Distillation 蒸馏 TinyBert [46] TinyBert[46] Various losses for different modules
不同模块的各种损耗
EMNLP Findings 2020 2020年EMNLP调查结果
Quantization 量化 FullyQT [47] 完全QT[47] Fully quantized transformer
全量子化变压器
EMNLP Findings 2020 2020年EMNLP调查结果
Architecture design 架构设计 ConvBert [48] ConvBert[48] Local dependence, dynamic convolution
局部依赖,动态卷积
NeurIPS 2020

2 Formulation of Transformer
2Transformer的制定

Transformer [9] was first used in the field of natural language processing (NLP) on machine translation tasks. As shown in Figure 2, it consists of an encoder and a decoder with several transformer blocks of the same architecture. The encoder generates encodings of inputs, while the decoder takes all the encodings and using their incorporated contextual information to generate the output sequence.
Transformer[9]首先被用于机器翻译任务上的自然语言处理(自然语言处理)领域,如图2所示,它由一个编码器和一个具有几个相同架构的转换器块的解码器组成,编码器生成输入的编码,而解码器获取所有的编码并使用它们合并的上下文信息来生成输出序列。

Each transformer block is composed of a multi-head attention layer, a feed-forward neural network, shortcut connection and layer normalization. In the following, we describe each component of the transformer in detail.
每个变压器块由多头关注层、前馈神经网络、快捷连接和层归一化组成。下面,我们详细描述变压器的每个组件。

Refer to caption
Figure 2: Structure of the original transformer (image from [9]).
图2:原始变压器的结构(图片来自[9])。

2.1 Self-Attention 2.1自注意力

In the self-attention layer, the input vector is first transformed into three different vectors: the query vector 𝐪𝐪\mathbf{q}, the key vector 𝐤𝐤\mathbf{k} and the value vector 𝐯𝐯\mathbf{v} with dimension dq=dk=dv=dmodel=512subscript𝑑𝑞subscript𝑑𝑘subscript𝑑𝑣subscript𝑑𝑚𝑜𝑑𝑒𝑙512d_{q}=d_{k}=d_{v}=d_{model}=512. Vectors derived from different inputs are then packed together into three different matrices, namely, 𝐐𝐐\mathbf{Q}, 𝐊𝐊\mathbf{K} and 𝐕𝐕\mathbf{V}. Subsequently, the attention function between different input vectors is calculated as follows (and shown in Figure 3 left):
在自注意力层中,首先将输入向量转换为三个不同的向量:查询向量 𝐪𝐪\mathbf{q} 、键向量 𝐤𝐤\mathbf{k} 和维数 dq=dk=dv=dmodel=512subscript𝑑𝑞subscript𝑑𝑘subscript𝑑𝑣subscript𝑑𝑚𝑜𝑑𝑒𝑙512d_{q}=d_{k}=d_{v}=d_{model}=512 的值向量 𝐯𝐯\mathbf{v} 。然后将来自不同输入的向量一起打包成三个不同的矩阵,即 𝐐𝐐\mathbf{Q}𝐊𝐊\mathbf{K}𝐕𝐕\mathbf{V} 。随后,不同输入向量之间的注意力函数计算如下(如左图3所示):

  • Step 1: Compute scores between different input vectors with 𝐒=𝐐𝐊𝐒𝐐superscript𝐊top\mathbf{S}=\mathbf{Q}\cdot\mathbf{K}^{\top};


    •第1步:使用 𝐒=𝐐𝐊𝐒𝐐superscript𝐊top\mathbf{S}=\mathbf{Q}\cdot\mathbf{K}^{\top} 计算不同输入向量之间的分数;
  • Step 2: Normalize the scores for the stability of gradient with 𝐒n=𝐒/dksubscript𝐒𝑛𝐒subscript𝑑𝑘\mathbf{S}_{n}=\mathbf{S}/{\sqrt{d_{k}}};


    第2步:使用 𝐒n=𝐒/dksubscript𝐒𝑛𝐒subscript𝑑𝑘\mathbf{S}_{n}=\mathbf{S}/{\sqrt{d_{k}}} 标准化梯度稳定性的分数;
  • Step 3: Translate the scores into probabilities with softmax function 𝐏=softmax(𝐒n)𝐏softmaxsubscript𝐒𝑛\mathbf{P}=\mathrm{softmax}(\mathbf{S}_{n});


    第3步:使用softmax函数 𝐏=softmax(𝐒n)𝐏softmaxsubscript𝐒𝑛\mathbf{P}=\mathrm{softmax}(\mathbf{S}_{n}) 将分数转换为概率;
  • Step 4: Obtain the weighted value matrix with 𝐙=𝐕𝐏𝐙𝐕𝐏\mathbf{Z}=\mathbf{V}\cdot\mathbf{P}.


    •第4步:用 𝐙=𝐕𝐏𝐙𝐕𝐏\mathbf{Z}=\mathbf{V}\cdot\mathbf{P} 获取加权值矩阵。

The process can be unified into a single function:
该过程可以统一为一个功能:

Attention(𝐐,𝐊,𝐕)=softmax(𝐐𝐊dk)𝐕.Attention𝐐𝐊𝐕softmax𝐐superscript𝐊topsubscript𝑑𝑘𝐕\mathrm{Attention}(\mathbf{Q},\mathbf{K},\mathbf{V})=\mathrm{softmax}(\frac{\mathbf{Q}\cdot\mathbf{K}^{\top}}{\sqrt{d_{k}}})\cdot\mathbf{V}. (1)

The logic behind Eq. 1 is simple. Step 1 computes scores between each pair of different vectors, and these scores determine the degree of attention that we give other words when encoding the word at the current position.
公式1背后的逻辑很简单,步骤1计算每对不同向量之间的分数,这些分数决定了我们在当前位置编码单词时给予其他单词的关注程度。

Step 2 normalizes the scores to enhance gradient stability for improved training, and step 3 translates the scores into probabilities. Finally, each value vector is multiplied by the sum of the probabilities.
步骤2将分数归一化以增强梯度稳定性以改进训练,步骤3将分数转换为概率。最后,每个值向量乘以概率之和。

Vectors with larger probabilities receive additional focus from the following layers.
具有较大概率的向量从以下层获得额外的焦点。

The encoder-decoder attention layer in the decoder module is similar to the self-attention layer in the encoder module with the following exceptions: The key matrix K𝐾K and value matrix V𝑉V are derived from the encoder module, and the query matrix Q𝑄Q is derived from the previous layer.
解码器模块中的编码器-解码器关注层类似于编码器模块中的自关注层,但有以下例外:键矩阵 K𝐾K 和值矩阵 V𝑉V 源自编码器模块,查询矩阵 Q𝑄Q 源自前一层。

Note that the preceding process is invariant to the position of each word, meaning that the self-attention layer lacks the ability to capture the positional information of words in a sentence.
请注意,前面的过程对每个单词的位置是不变的,这意味着自我注意层缺乏捕捉单词在句子中的位置信息的能力。

However, the sequential nature of sentences in a language requires us to incorporate the positional information within our encoding. To address this issue and allow the final input vector of the word to be obtained, a positional encoding with dimension dmodelsubscript𝑑𝑚𝑜𝑑𝑒𝑙d_{model} is added to the original input embedding. Specifically, the position is encoded with the following equations:
然而,语言中句子的顺序性质要求我们将位置信息合并到我们的编码中。为了解决这个问题并允许获得单词的最终输入向量,在原始输入嵌入中添加了维度为 dmodelsubscript𝑑𝑚𝑜𝑑𝑒𝑙d_{model} 的位置编码。具体来说,位置用以下方程编码:

PE(pos,2i)=sin(pos100002idmodel);𝑃𝐸𝑝𝑜𝑠2𝑖𝑠𝑖𝑛𝑝𝑜𝑠superscript100002𝑖subscript𝑑𝑚𝑜𝑑𝑒𝑙\displaystyle PE(pos,2i)=sin(\frac{pos}{10000^{\frac{2i}{d_{model}}}}); (2)
PE(pos,2i+1)=cos(pos100002idmodel),𝑃𝐸𝑝𝑜𝑠2𝑖1𝑐𝑜𝑠𝑝𝑜𝑠superscript100002𝑖subscript𝑑𝑚𝑜𝑑𝑒𝑙\displaystyle PE(pos,2i+1)=cos(\frac{pos}{10000^{\frac{2i}{d_{model}}}}), (3)

in which pos𝑝𝑜𝑠pos denotes the position of the word in a sentence, and i𝑖i represents the current dimension of the positional encoding. In this way, each element of the positional encoding corresponds to a sinusoid, and it allows the transformer model to learn to attend by relative positions and extrapolate to longer sequence lengths during inference.
其中 pos𝑝𝑜𝑠pos 表示单词在句子中的位置, i𝑖i 表示位置编码的当前维度。这样,位置编码的每个元素对应一个正弦曲线,它允许变压器模型在推理过程中学习通过相对位置来参与并外推到更长的序列长度。

In apart from the fixed positional encoding in the vanilla transformer, learned positional encoding [49] and relative positional encoding [50] are also utilized in various models [10, 15].
除了普通变压器中的固定位置编码之外,学习位置编码[49]和相对位置编码[50]也用于各种模型[10、15]。

Refer to caption
Figure 3: (Left) Self-attention process. (Right) Multi-head attention. The image is from [9].
图3:(左)自注意力机制过程。(右)多头注意力。图像来自[9]。

Multi-Head Attention. Multi-head attention is a mechanism that can be used to boost the performance of the vanilla self-attention layer. Note that for a given reference word, we often want to focus on several other words when going through the sentence.
多头注意力。多头注意力是一种机制,可用于提升香草自我注意力层的性能。请注意,对于给定的参考词,我们在浏览句子时通常希望关注其他几个词。

A single-head self-attention layer limits our ability to focus on one or more specific positions without influencing the attention on other equally important positions at the same time. This is achieved by giving attention layers different representation subspace.
单头自我注意力层限制了我们在不同时影响对其他同等重要位置的注意力的情况下专注于一个或多个特定位置的能力,这是通过赋予注意力层不同的表示子空间来实现的。

Specifically, different query, key and value matrices are used for different heads, and these matrices can project the input vectors into different representation subspace after training due to random initialization.
具体来说,不同的头部使用不同的查询、键和值矩阵,这些矩阵由于随机初始化,经过训练后可以将输入向量投射到不同的表示子空间中。

To elaborate on this in greater detail, given an input vector and the number of heads hh, the input vector is first transformed into three different groups of vectors: the query group, the key group and the value group. In each group, there are hh vectors with dimension dq=dk=dv=dmodel/h=64subscript𝑑superscript𝑞subscript𝑑superscript𝑘subscript𝑑superscript𝑣subscript𝑑𝑚𝑜𝑑𝑒𝑙64d_{q^{\prime}}=d_{k^{\prime}}=d_{v^{\prime}}=d_{model}/h=64. The vectors derived from different inputs are then packed together into three different groups of matrices: {𝐐i}i=1hsuperscriptsubscriptsubscript𝐐𝑖𝑖1\{\mathbf{Q}_{i}\}_{i=1}^{h}, {𝐊i}i=1hsuperscriptsubscriptsubscript𝐊𝑖𝑖1\{\mathbf{K}_{i}\}_{i=1}^{h} and {𝐕i}i=1hsuperscriptsubscriptsubscript𝐕𝑖𝑖1\{\mathbf{V}_{i}\}_{i=1}^{h}. The multi-head attention process is shown as follows:
为了更详细地说明这一点,给定一个输入向量和头部 hh 的数量,首先将输入向量转换为三组不同的向量:查询组、键组和值组。在每组中,都有维度 dq=dk=dv=dmodel/h=64subscript𝑑superscript𝑞subscript𝑑superscript𝑘subscript𝑑superscript𝑣subscript𝑑𝑚𝑜𝑑𝑒𝑙64d_{q^{\prime}}=d_{k^{\prime}}=d_{v^{\prime}}=d_{model}/h=64hh{𝐊i}i=1hsuperscriptsubscriptsubscript𝐊𝑖𝑖1\{\mathbf{K}_{i}\}_{i=1}^{h}{𝐕i}i=1hsuperscriptsubscriptsubscript𝐕𝑖𝑖1\{\mathbf{V}_{i}\}_{i=1}^{h} 。多头注意过程如下所示:

MultiHead(𝐐,𝐊,𝐕)MultiHeadsuperscript𝐐superscript𝐊superscript𝐕\displaystyle\mathrm{MultiHead}(\mathbf{Q}^{\prime},\mathbf{K}^{\prime},\mathbf{V}^{\prime}) =Concat(head1,,headh)𝐖o,absentConcatsubscripthead1subscriptheadsuperscript𝐖𝑜\displaystyle=\mathrm{Concat}(\mathrm{head}_{1},\cdots,\mathrm{head}_{h})\mathbf{W}^{o},
where headiwhere subscripthead𝑖\displaystyle\text{where }\mathrm{head}_{i} =Attention(𝐐i,𝐊i,𝐕i).absentAttentionsubscript𝐐𝑖subscript𝐊𝑖subscript𝐕𝑖\displaystyle=\mathrm{Attention}(\mathbf{Q}_{i},\mathbf{K}_{i},\mathbf{V}_{i}). (4)

Here, 𝐐superscript𝐐\mathbf{Q}^{\prime} (and similarly 𝐊superscript𝐊\mathbf{K}^{\prime} and 𝐕superscript𝐕\mathbf{V}^{\prime}) is the concatenation of {𝐐i}i=1hsuperscriptsubscriptsubscript𝐐𝑖𝑖1\{\mathbf{Q}_{i}\}_{i=1}^{h}, and 𝐖odmodel×dmodelsuperscript𝐖𝑜superscriptsubscript𝑑𝑚𝑜𝑑𝑒𝑙subscript𝑑𝑚𝑜𝑑𝑒𝑙\mathbf{W}^{o}\in\mathbb{R}^{d_{model}\times d_{model}} is the projection weight.
这里, 𝐐superscript𝐐\mathbf{Q}^{\prime} (类似地 𝐊superscript𝐊\mathbf{K}^{\prime}𝐕superscript𝐕\mathbf{V}^{\prime} )是 {𝐐i}i=1hsuperscriptsubscriptsubscript𝐐𝑖𝑖1\{\mathbf{Q}_{i}\}_{i=1}^{h} 的级联, 𝐖odmodel×dmodelsuperscript𝐖𝑜superscriptsubscript𝑑𝑚𝑜𝑑𝑒𝑙subscript𝑑𝑚𝑜𝑑𝑒𝑙\mathbf{W}^{o}\in\mathbb{R}^{d_{model}\times d_{model}} 是投影权重。

2.2 Other Key Concepts in Transformer
2.2Transformer的其他关键概念

Feed-Forward Network. A feed-forward network (FFN) is applied after the self-attention layers in each encoder and decoder. It consists of two linear transformation layers and a nonlinear activation function within them, and can be denoted as the following function:
前向网络。在每个编码器和解码器的自注意层之后应用前馈网络(FFN)。它由两个线性变换层和其中的非线性激活函数组成,可以表示为以下函数:

FFN(𝐗)=𝐖2σ(𝐖1𝐗),FFN𝐗subscript𝐖2𝜎subscript𝐖1𝐗\mathrm{FFN}(\mathbf{X})=\mathbf{W}_{2}\sigma(\mathbf{W}_{1}\mathbf{X}), (5)

where 𝐖1subscript𝐖1\mathbf{W}_{1} and 𝐖2subscript𝐖2\mathbf{W}_{2} are the two parameter matrices of the two linear transformation layers, and σ𝜎\sigma represents the nonlinear activation function, such as GELU [51]. The dimensionality of the hidden layer is dh=2048subscript𝑑2048d_{h}=2048.
其中 𝐖1subscript𝐖1\mathbf{W}_{1}𝐖2subscript𝐖2\mathbf{W}_{2} 是两个线性变换层的两个参数矩阵, σ𝜎\sigma 表示非线性激活函数,如GELU[51]。隐藏层的维数为 dh=2048subscript𝑑2048d_{h}=2048

Residual Connection in the Encoder and Decoder. As shown in Figure 2, a residual connection is added to each sub-layer in the encoder and decoder. This strengthens the flow of information in order to achieve higher performance. A layer-normalization [52] is followed after the residual connection. The output of these operations can be described as:
编码器和解码器中的残差连接如图2所示,编码器和解码器中的每个子层都添加了残差连接。这加强了信息流,以便实现更高的性能。残差连接之后接着是层归一化[52]。这些操作的输出可以描述为:

LayerNorm(𝐗+Attention(𝐗)).LayerNorm𝐗Attention𝐗\mathrm{LayerNorm}(\mathbf{X}+\mathrm{Attention}(\mathbf{X})). (6)

Here, 𝐗𝐗\mathbf{X} is used as the input of self-attention layer, and the query, key and value matrices 𝐐,𝐊𝐐𝐊\mathbf{Q},\mathbf{K} and 𝐕𝐕\mathbf{V} are all derived from the same input matrix 𝐗𝐗\mathbf{X}. A variant pre-layer normalization (Pre-LN) is also widely-used [53, 54, 15]. Pre-LN inserts the layer normalization inside the residual connection and before multi-head attention or FFN. For the normalization layer, there are several alternatives such as batch normalization [55]. Batch normalization usually perform worse when applied on transformer as the feature values change acutely [56]. Some other normalization algorithms [57, 56, 58] have been proposed to improve training of transformer.
这里, 𝐗𝐗\mathbf{X} 作为自关注层的输入,查询、键和值矩阵 𝐐,𝐊𝐐𝐊\mathbf{Q},\mathbf{K}𝐕𝐕\mathbf{V} 都来源于同一个输入矩阵 𝐗𝐗\mathbf{X} 。变体预层归一化(Pre-LN)也被广泛使用[53,54,15]。Pre-LN在残差连接内部和多头关注或FFN之前插入层归一化。对于归一化层,有几种替代方案,例如批量归一化[55]。批量归一化在应用于变压器时通常表现更差,因为特征值发生剧烈变化[56]。已经提出了一些其他归一化算法[57,56,58]来改进变压器的训练。

Final Layer in the Decoder. The final layer in the decoder is used to turn the stack of vectors back into a word. This is achieved by a linear layer followed by a softmax layer. The linear layer projects the vector into a logits vector with dwordsubscript𝑑𝑤𝑜𝑟𝑑d_{word} dimensions, in which dwordsubscript𝑑𝑤𝑜𝑟𝑑d_{word} is the number of words in the vocabulary. The softmax layer is then used to transform the logits vector into probabilities.
解码器中的最终层。解码器中的最后一层用于将向量堆栈转回单词。这是通过一个线性层和一个softmax层来实现的。线性层将向量投射到具有 dwordsubscript𝑑𝑤𝑜𝑟𝑑d_{word} 维度的logits向量中,其中 dwordsubscript𝑑𝑤𝑜𝑟𝑑d_{word} 是词汇表中的单词数。然后使用softmax层将logits向量转换为概率。

When used for CV tasks, most transformers adopt the original transformer’s encoder module. Such transformers can be treated as a new type of feature extractor.
当用于CV任务时,大多数变压器采用原始变压器的编码器模块。这种变压器可以被视为一种新型的特征提取器。

Compared with CNNs which focus only on local characteristics, transformer can capture long-distance characteristics, meaning that it can easily derive global information.
与只关注局部特征的CNN相比,变压器可以捕捉远距离特征,这意味着它可以很容易地导出全局信息。

And in contrast to RNNs, whose hidden state must be computed sequentially, transformer is more efficient because the output of the self-attention layer and the fully connected layers can be computed in parallel and easily accelerated.
与必须按顺序计算隐藏状态的RNN相比,变压器效率更高,因为自关注层和全连接层的输出可以并行计算,并且很容易加速。

From this, we can conclude that further study into using transformer in computer vision as well as NLP would yield beneficial results.
由此,我们可以得出结论,进一步研究在计算机视觉和自然语言处理中使用转换器将产生有益的结果。

3 Vision Transformer 3视觉Transformer

In this section, we review the applications of transformer-based models in computer vision, including image classification, high/mid-level vision, low-level vision and video processing.
在本节中,我们回顾了基于变压器的模型在计算机视觉中的应用,包括图像分类、高/中级视觉、低级视觉和视频处理。

We also briefly summarize the applications of the self-attention mechanism and model compression methods for efficient transformer.
我们还简要总结了自注意力机制和模型压缩方法在高效变压器中的应用。

Refer to caption
Figure 4: A taxonomy of backbone using convolution and attention.
图4:使用卷积和注意力的骨干分类。

3.1 Backbone for Representation Learning
3.1表示学习的主干

Inspired by the success that transformer has achieved in the field of NLP, some researchers have explored whether similar models can learn useful representations for images.
受转换器在自然语言处理领域取得的成功的启发,一些研究人员探索了类似的模型是否可以学习对图像有用的表示。

Given that images involve more dimensions, noise and redundant modality compared to text, they are believed to be more difficult for generative modeling.
与文本相比,图像涉及更多的维度、噪声和冗余模态,因此它们被认为更难进行生成建模。

Other than CNNs, the transformer can be used as backbone networks for image classification. Wu et al. [59] adopted ResNet as a convenient baseline and used vision transformers to replace the last stage of convolutions. Specifically, they apply convolutional layers to extract low-level features that are then fed into the vision transformer. For the vision transformer, they use a tokenizer to group pixels into a small number of visual tokens, each representing a semantic concept in the image. These visual tokens are used directly for image classification, with the transformers being used to model the relationships between tokens. As shown in Figure 4, the works can be divided into purely using transformer for vision and combining CNN and transformer. We summarize the results of these models in Table II and Figure 6 to demonstrate the development of the backbones. In addition to supervised learning, self-supervised learning is also explored in vision transformer.
除了CNN之外,变压器还可以作为图像分类的骨干网络。Wu等人[59]采用ResNet作为方便的基线,并使用视觉变压器来代替卷积的最后一阶段。具体来说,他们应用卷积层来提取低级特征,然后将这些特征馈送到视觉变压器中。对于视觉变压器,他们使用标记器将像素分组为少量的视觉标记,每个标记代表图像中的一个语义概念。这些视觉标记直接用于图像分类,变压器用于对标记之间的关系进行建模。如图4所示,作品可以分为纯粹使用变压器进行视觉和结合卷积神经网络和变压器。我们在表II和图6中总结了这些模型的结果,以演示骨干的发展。除了监督学习,视觉变压器中还探索了自监督学习。

3.1.1 Pure Transformer 3.1.1纯Transformer

ViT. Vision Transformer (ViT) [15] is a pure transformer directly applies to the sequences of image patches for image classification task. It follows transformer’s original design as much as possible. Figure 5 shows the framework of ViT.
ViT. VisionTransformer(ViT)[15]是一个纯转换器,直接应用于图像块序列进行图像分类任务。它尽可能地遵循变压器的原始设计。图5显示了ViT的框架。

To handle 2D images, the image Xh×w×c𝑋superscript𝑤𝑐X\in\mathbb{R}^{h\times w\times c} is reshaped into a sequence of flattened 2D patches Xpn×(p2c)subscript𝑋𝑝superscript𝑛superscript𝑝2𝑐X_{p}\in\mathbb{R}^{n\times(p^{2}\cdot c)} such that c𝑐c is the number of channels. (h,w)𝑤(h,w) is the resolution of the original image, while (p,p)𝑝𝑝(p,p) is the resolution of each image patch. The effective sequence length for the transformer is therefore n=hw/p2𝑛𝑤superscript𝑝2n=hw/p^{2}. Because the transformer uses constant widths in all of its layers, a trainable linear projection maps each vectorized path to the model dimension d𝑑d, the output of which is referred to as patch embeddings.
为了处理2D图像,图像 Xh×w×c𝑋superscript𝑤𝑐X\in\mathbb{R}^{h\times w\times c} 被重塑为一系列扁平的2D块 Xpn×(p2c)subscript𝑋𝑝superscript𝑛superscript𝑝2𝑐X_{p}\in\mathbb{R}^{n\times(p^{2}\cdot c)} ,使得 c𝑐c 是通道数。 (h,w)𝑤(h,w) 是原始图像的分辨率,而 (p,p)𝑝𝑝(p,p) 是每个图像块的分辨率。因此,转换器的有效序列长度是 n=hw/p2𝑛𝑤superscript𝑝2n=hw/p^{2} 。因为转换器在其所有层中使用恒定的宽度,可训练的线性投影将每个矢量化路径映射到模型维度 d𝑑d ,其输出称为补丁嵌入。

Similar to BERT’s [class]delimited-[]𝑐𝑙𝑎𝑠𝑠[class] token, a learnable embedding is applied to the sequence of embedding patches. The state of this embedding serves as the image representation. During both pre-training and fine-tuning stage, the classification heads are attached to the same size.
与BERT的 [class]delimited-[]𝑐𝑙𝑎𝑠𝑠[class] 标记类似,可学习嵌入应用于嵌入补丁序列。这种嵌入的状态作为图像表示。在预训练和微调阶段,分类头都连接到相同的大小。

In addition, 1D position embeddings are added to the patch embeddings in order to retain positional information. It is worth noting that ViT utilizes only the standard transformer’s encoder (except for the place for the layer normalization), whose output precedes an MLP head.
此外,为了保留位置信息,将一维位置嵌入添加到补丁嵌入中。值得注意的是,ViT仅使用标准变压器的编码器(层归一化的位置除外),其输出先于MLP头。

In most cases, ViT is pre-trained on large datasets, and then fine-tuned for downstream tasks with smaller data.
在大多数情况下,ViT在大型数据集上进行预训练,然后针对具有较小数据的下游任务进行微调。

ViT yields modest results when trained on mid-sized datasets such as ImageNet, achieving accuracies of a few percentage points below ResNets of comparable size.
当在ImageNet等中型数据集上训练时,ViT会产生适度的结果,其准确度比同等大小的ResNet低几个百分点。

Because transformers lack some inductive biases inherent to CNNs–such as translation equivariance and locality–they do not generalize well when trained on insufficient amounts of data.
因为转换器缺乏CNN固有的一些归纳偏差——例如平移等方差和局部性——当在数据量不足的情况下训练时,它们不能很好地泛化。

However, the authors found that training the models on large datasets (14 million to 300 million images) surpassed inductive bias. When pre-trained at sufficient scale, transformers achieve excellent results on tasks with fewer datapoints.
然而,作者发现在大型数据集(1400万3亿图像)上训练模型超越了归纳偏差。当在足够规模上预训练时,变压器在数据点较少的任务上取得了出色的结果。

For example, when pre-trained on the JFT-300M dataset, ViT approached or even exceeded state of the art performance on multiple image recognition benchmarks. Specifically, it reached an accuracy of 88.36% on ImageNet, and 77.16% on the VTAB suite of 19 tasks.
例如,当在JFT-300M数据集上进行预训练时,ViT在多个图像识别基准上接近甚至超过了最先进的性能。具体来说,它在ImageNet上达到了88.36%的准确率,在19个任务的VTAB套件上达到了77.16%。

Touvron et al. [60] proposed a competitive convolution-free transformer, called Data-efficient image transformer (DeiT), by training on only the ImageNet database. DeiT-B, the reference vision transformer, has the same architecture as ViT-B and employs 86 million parameters.
Touvron等人[60]通过仅在ImageNet数据库上进行训练,提出了一种竞争性的无卷积转换器,称为数据高效图像转换器(DeiT)。

With a strong data augmentation, DeiT-B achieves top-1 accuracy of 83.1% (single-crop evaluation) on ImageNet with no external data. In addition, the authors observe that using a CNN teacher gives better performance than using a transformer.
通过强大的数据增强,DeiT-B在没有外部数据的情况下在ImageNet上实现了83.1%(单作物评估)的顶级1准确率。此外,作者观察到使用卷积神经网络教师比使用变压器提供更好的性能。

Specifically, DeiT-B can achieve top-1 accuracy 84.40% with the help of a token-based distillation.
具体来说,DeiT-B可以在基于令牌的蒸馏的帮助下达到顶级精度84.40%。

Refer to caption
Figure 5: The framework of ViT (image from [15]).
图5:ViT的框架(图片来自[15])。

Variants of ViT. Following the paradigm of ViT, a series of variants of ViT have been proposed to improve the performance on vision tasks. The main approaches include enhancing locality, self-attention improvement and architecture design.
ViT的变体。遵循ViT的范式,提出了一系列ViT变体来提高视觉任务的性能。主要方法包括增强局部性、自我注意力改进和架构设计。

The original vision transformer is good at capturing long-range dependencies between patches, but disregard the local feature extraction as the 2D patch is projected to a vector with simple linear layer.
原始视觉转换器擅长捕获补丁之间的远程依赖关系,但忽略了局部特征提取,因为2D补丁被投影到具有简单线性层的向量上。

Recently, the researchers begin to pay attention to improve the modeling capacity for local information [29, 61, 62]. TNT [29] further divides the patch into a number of sub-patches and introduces a novel transformer-in-transformer architecture which utilizes an inner transformer block to model the relationship between sub-patches and an outer transformer block for patch-level information exchange.
最近,研究人员开始关注提高局部信息的建模能力[29,61,62]。TNT[29]进一步将补丁划分为多个子补丁,并引入了一种新颖的transformer-in-transformer架构,该架构利用内部变压器块对子补丁和外部变压器块之间的关系进行建模,以进行补丁级信息交换。

Twins [63] and CAT [64] alternately perform local and global attention layer-by-layer. Swin Transformers [61, 65] performs local attention within a window and introduces a shifted window partitioning approach for cross-window connections. Shuffle Transformer [66, 67] further utilizes the spatial shuffle operation instead of shifted window partitioning to allow cross-window connections. RegionViT [62] generates regional tokens and local tokens from an image, and local tokens receive global information via attention with regional tokens. In addition to the local attention, some other works propose to boost local information through local feature aggregation, e.g., T2T [68]. These works demonstrate the benefit of the local information exchange and global information exchange in vision transformer.
Twins[63]和CAT[64]交替地逐层执行局部和全局注意力。Swin Transtors[61,65]在窗口内执行局部注意力,并为跨窗口连接引入了移位窗口分区方法。ShuffleTransformer[66,67]进一步利用空间shuffle操作而不是移位窗口分区来允许跨窗口连接。RegionViT[62]从图像中生成区域标记和本地标记,本地标记通过区域标记的注意力接收全局信息。除了局部注意力之外,其他一些工作提出通过局部特征聚合来提升局部信息,例如,T2T[68]。这些工作展示了视觉转换器中局部信息交换和全局信息交换的好处。

As a key component of transformer, self-attention layer provides the ability for global interaction between image patches. Improving the calculation of self-attention layer has attracted many researchers. DeepViT [69] proposes to establish cross-head communication to re-generate the attention maps to increase the diversity at different layers. KVT [70] introduces the k𝑘k-NN attention to utilize locality of images patches and ignore noisy tokens by only computing attentions with top-k𝑘k similar tokens. Refiner [71] explores attention expansion in higher-dimension space and applied convolution to augment local patterns of the attention maps. XCiT [72] performs self-attention calculation across feature channels rather than tokens, which allows efficient processing of high-resolution images. The computation complexity and attention precision of the self-attention mechanism are two key-points for future optimization.
自注意力层作为变压器的关键组成部分,提供了图像块之间全局交互的能力。改进自注意力层的计算吸引了许多研究人员。DeepViT[69]提出建立跨头通信来重新生成注意力图,以增加不同层的多样性。KVT[70]引入了 k𝑘k -NN注意力,以利用图像块的局部性,并通过仅计算具有top- k𝑘k 相似标记的注意力来忽略嘈杂的标记。Refiner[71]探索了高维空间中的注意力扩展,并应用卷积来增强注意力图的局部模式。XCiT[72]跨特征通道而不是标记执行自注意力计算,这允许高效处理高分辨率图像。自注意力机制的计算复杂性和注意力精度是未来优化的两个关键点。

The network architecture is an important factor as demonstrated in the field of CNNs. The original architecture of ViT is a simple stack of the same-shape transformer block. New architecture design for vision transformer has been an interesting topic.
网络架构是CNN领域的一个重要因素。ViT的原始架构是一个简单的同形状变压器块堆叠。视觉变压器的新架构设计一直是一个有趣的话题。

The pyramid-like architecture is utilized by many vision transformer models [73, 61, 74, 75, 76, 77] including PVT [73], HVT [78], Swin Transformer [61] and PiT [79]. There are also other types of architectures, such as two-stream architecture [80] and U-net architecture [81, 30]. Neural architecture search (NAS) has also been investigated to search for better transformer architectures, e.g., Scaling-ViT [82], ViTAS [83], AutoFormer [84] and GLiT [85]. Currently, both network design and NAS for vision transformer mainly draw on the experience of CNN. In the future, we expect the specific and novel architectures appear in the filed of vision transformer.
金字塔状架构被许多视觉变压器模型[73,61,74,75,76,77]使用,包括PVT[73]、HVT[78]、SwinTransformer[61]和PiT[79]。还有其他类型的架构,如双流架构[80]和U-net架构[81,30]。神经架构搜索(NAS)也被研究以搜索更好的变压器架构,例如缩放-ViT[82]、ViTAS[83]、AutoForm[84]和GLiT[85]。目前,视觉变压器的网络设计和NAS都主要借鉴卷积神经网络的经验。未来,我们期待具体新颖的架构出现在视觉变压器领域。

In addition to the aforementioned approaches, there are some other directions to further improve vision transformer, e.g., positional encoding [86, 87], normalization strategy [88], shortcut connection [89] and removing attention [90, 91, 92, 93].
除了上述方法之外,还有一些其他方向可以进一步改进视觉转换器,例如,位置编码[86,87]、归一化策略[88]、快捷连接[89]和消除注意力[90,91,92,93]。

TABLE II: ImageNet result comparison of representative CNN and vision transformer models. Pure transformer means only using a few convolutions in the stem stage. CNN + Transformer means using convolutions in the intermediate layers. Following [60, 61], the throughput is measured on NVIDIA V100 GPU and Pytorch, with 224×\times224 input size.
表二:代表性卷积神经网络和视觉变压器模型的ImageNet结果比较。纯变压器意味着在茎阶段仅使用少数卷积。卷积神经网络+Transformer意味着在中间层使用卷积。继[60,61]之后,在NVIDIA V100 GPU和Pytorch上测量吞吐量,输入大小为224 ×\times 224。
  Model Params 参数 FLOPs 每秒浮点运算次数 Throughput 吞吐量 Top-1 前1名
(M) (m) (B) (b) (image/s) (图片/秒) (%)
CNN
ResNet-50 [12, 68] ResNet-50[12,68] 25.6 4.1 1226 79.1
ResNet-101 [12, 68] ResNet-101[12,68] 44.7 7.9 753 79.9
ResNet-152 [12, 68] ResNet-152[12,68] 60.2 11.5 526 80.8
EfficientNet-B0 [94] 效率网络-B0[94] 5.3 0.39 2694 77.1
EfficientNet-B1 [94] 效率网络-B1[94] 7.8 0.70 1662 79.1
EfficientNet-B2 [94] 效率网络-B2[94] 9.2 1.0 1255 80.1
EfficientNet-B3 [94] 效率网络-B3[94] 12 1.8 732 81.6
EfficientNet-B4 [94] 效率网络-B4[94] 19 4.2 349 82.9
Pure Transformer 纯Transformer
DeiT-Ti [15, 60] DeiT-Ti[15,60] 5 1.3 2536 72.2
DeiT-S [15, 60] DeiT-S[15,60] 22 4.6 940 79.8
DeiT-B [15, 60] DeiT-B[15,60] 86 17.6 292 81.8
T2T-ViT-14 [68] T2T-ViT-14[68] 21.5 5.2 764 81.5
T2T-ViT-19 [68] T2T-ViT-19[68] 39.2 8.9 464 81.9
T2T-ViT-24 [68] T2T-ViT-24[68] 64.1 14.1 312 82.3
PVT-Small [73] PVT-小[73] 24.5 3.8 820 79.8
PVT-Medium [73] PVT-介质[73] 44.2 6.7 526 81.2
PVT-Large [73] PVT-大[73] 61.4 9.8 367 81.7
TNT-S [29] TNT-S[29] 23.8 5.2 428 81.5
TNT-B [29] TNT-B[29] 65.6 14.1 246 82.9
CPVT-S [86] CPVT-S[86] 23 4.6 930 80.5
CPVT-B [86] CPVT-B[86] 88 17.6 285 82.3
Swin-T [61] Swin-T[61] 29 4.5 755 81.3
Swin-S [61] Swin-S[61] 50 8.7 437 83.0
Swin-B [61] Swin-B[61] 88 15.4 278 83.3
CNN + Transformer 卷积神经网络+Transformer
Twins-SVT-S [63] 双胞胎-SVT-S[63] 24 2.9 1059 81.7
Twins-SVT-B [63] 双胞胎-SVT-B[63] 56 8.6 469 83.2
Twins-SVT-L [63] 双胞胎-SVT-L[63] 99.2 15.1 288 83.7
Shuffle-T [66] Shuffle-T[66] 29 4.6 791 82.5
Shuffle-S [66] Shuffle-S[66] 50 8.9 450 83.5
Shuffle-B [66] 洗牌-B[66] 88 15.6 279 84.0
CMT-S [95] CMT-S[95] 25.1 4.0 563 83.5
CMT-B [95] CMT-B[95] 45.7 9.3 285 84.5
VOLO-D1 [96] VOLO-D1[96] 27 6.8 481 84.2
VOLO-D2 [96] VOLO-D2[96] 59 14.1 244 85.2
VOLO-D3 [96] VOLO-D3[96] 86 20.6 168 85.4
VOLO-D4 [96] VOLO-D4[96] 193 43.8 100 85.7
VOLO-D5 [96] VOLO-D5[96] 296 69.0 64 86.1
 
11footnotetext:

3.1.2 Transformer with Convolution
3.1.2卷积Transformer

Although vision transformers have been successfully applied to various visual tasks due to their ability to capture long-range dependencies within the input, there are still gaps in performance between transformers and existing CNNs.
尽管由于视觉转换器能够捕获输入中的远程依赖关系,视觉转换器已成功应用于各种视觉任务,但它与现有CNN之间的性能仍然存在差距。

One main reason can be the lack of ability to extract local information.
一个主要原因可能是缺乏提取本地信息的能力。

Except the above mentioned variants of ViT that enhance the locality, combining the transformer with convolution can be a more straightforward way to introduce the locality into the conventional transformer.
除了上面提到的增强局部性的ViT变体之外,将变压器与卷积相结合可以是将局部性引入传统变压器的更直接的方法。

There are plenty of works trying to augment a conventional transformer block or self-attention layer with convolution. For example, CPVT [86] proposed a conditional positional encoding (CPE) scheme, which is conditioned on the local neighborhood of input tokens and adaptable to arbitrary input sizes, to leverage convolutions for fine-level feature encoding. CvT [97], CeiT [98], LocalViT [99] and CMT [95] analyzed the potential drawbacks when directly borrowing Transformer architectures from NLP and combined the convolutions with transformers together.
有很多作品试图用卷积来增强传统的变压器块或自注意力层。例如,CPVT[86]提出了一种条件位置编码(CPE)方案,该方案以输入令牌的局部邻域为条件,适用于任意输入大小,以利用卷积进行精细特征编码。CvT[97]、CeiT[98]、LocalViT[99]和CMT[95]分析了直接从自然语言处理中借用Transformer架构并将卷积与变压器结合在一起时的潜在缺点。

Specifically, the feed-forward network (FFN) in each transformer block is combined with a convolutional layer that promotes the correlation among neighboring tokens. LeViT [100] revisited principles from extensive literature on CNNs and applied them to transformers, proposing a hybrid neural network for fast inference image classification. BoTNet [101] replaced the spatial convolutions with global self-attention in the final three bottleneck blocks of a ResNet, and improved upon the baselines significantly on both instance segmentation and object detection tasks with minimal overhead in latency.
具体来说,每个变压器块中的前馈网络(FFN)与促进相邻标记之间相关性的卷积层相结合。LeViT[100]重新审视了大量CNN文献中的原理,并将其应用于变压器,提出了一种用于快速推理图像分类的混合神经网络。BoTNet[101]在ResNet的最后三个瓶颈块中用全局自关注取代了空间卷积,并在实例分割和目标检测任务的基线上以最小的延迟开销显着改进。

Besides, some researchers have demonstrated that transformer based models can be more difficult to enjoy a favorable ability of fitting data [15, 102, 103], in other words, they are sensitive to the choice of optimizer, hyper-parameter, and the schedule of training. Visformer [102] revealed the gap between transformers and CNNs with two different training settings. The first one is the standard setting for CNNs, i.e., the training schedule is shorter and the data augmentation only contains random cropping and horizental flipping. The other one is the training setting used in [60], i.e., the training schedule is longer and the data augmentation is stronger. [103] changed the early visual processing of ViT by replacing its embedding stem with a standard convolutional stem, and found that this change allows ViT to converge faster and enables the use of either AdamW or SGD without a significant drop in accuracy.
此外,一些研究人员已经证明,基于变压器的模型可能更难享受良好的拟合数据能力[15,102,103],换句话说,它们对优化器的选择、超参数和训练时间表很敏感。Visger[102]揭示了变压器和CNN之间的差距,两种不同的训练设置。第一个是CNN的标准设置,即训练时间表更短,数据增强只包含随机裁剪和水平翻转。[103]通过用标准卷积词干替换其嵌入词干,改变了ViT的早期视觉处理,并发现这种变化允许ViT更快地收敛,并且能够使用AdamW或SGD而不会显着降低准确性。

In addition to these two works, [100, 95] also choose to add convolutional stem on the top of the transformer.
除了这两个作品,[100,95]还选择在变压器顶部添加卷积词干。

Refer to caption Refer to caption
(a) Acc v.s. FLOPs. (a)acc vs. s.每秒浮点运算次数。 (b) Acc v.s. throughput. (b)Acc vs. s.吞吐量。
Figure 6: FLOPs and throughput comparison of representative CNN and vision transformer models.
图6:代表性卷积神经网络和视觉变压器模型的每秒浮点运算次数和吞吐量比较。

3.1.3 Self-supervised Representation Learning
3.1.3自监督表示学习

Generative Based Approach. Generative pre-training methods for images have existed for a long time [104, 105, 106, 107]. Chen et al. [14] re-examined this class of methods and combined it with self-supervised methods. After that, several works [108, 109] were proposed to extend generative based self-supervised learning for vision transformer.
基于生成的方法,图像的生成预训练方法已经存在了很长时间[104,105,106,107],Chen等人[14]重新审视了这一类方法,并将其与自监督方法相结合,之后提出了若干工作[108,109],将基于生成的自监督学习扩展到视觉转换器。

We briefly introduce iGPT [14] to demonstrate its mechanism. This approach consists of a pre-training stage followed by a fine-tuning stage. During the pre-training stage, auto-regressive and BERT objectives are explored.
我们简要介绍iGPT[14]来演示其机制。这种方法由预训练阶段和微调阶段组成。在预训练阶段,探索自回归和BERT目标。

To implement pixel prediction, a sequence transformer architecture is adopted instead of language tokens (as used in NLP). Pre-training can be thought of as a favorable initialization or regularizer when used in combination with early stopping.
为了实现像素预测,采用序列转换器架构代替语言标记(如用于自然语言处理)。当与提前终止结合使用时,预训练可以被认为是一个有利的初始化或正则化器。

During the fine-tuning stage, they add a small classification head to the model. This helps optimize a classification objective and adapts all weights.
在微调阶段,他们向模型添加了一个小分类头。这有助于优化分类目标并调整所有权重。

The image pixels are transformed into a sequential data by k𝑘k-means clustering. Given an unlabeled dataset X𝑋{X} consisting of high dimensional data 𝐱=(x1,,xn)𝐱subscript𝑥1subscript𝑥𝑛\mathbf{x}=(x_{1},\cdots,x_{n}), they train the model by minimizing the negative log-likelihood of the data:
图像像素通过 k𝑘k -means聚类转换为顺序数据。给定一个由高维数据 𝐱=(x1,,xn)𝐱subscript𝑥1subscript𝑥𝑛\mathbf{x}=(x_{1},\cdots,x_{n}) 组成的未标记数据集 X𝑋{X} ,它们通过最小化数据的负对数似然来训练模型:

LAR=𝔼𝐱X[logp(𝐱)],subscript𝐿𝐴𝑅similar-to𝐱𝑋𝔼delimited-[]𝑝𝐱L_{AR}=\underset{\mathbf{x}\sim X}{\mathbb{E}}[-\log p(\mathbf{x})], (7)

where p(𝐱)𝑝𝐱p(\mathbf{x}) is the probability density of the data of images, which can be modeled as:
其中 p(𝐱)𝑝𝐱p(\mathbf{x}) 是图像数据的概率密度,可以建模为:

p(𝐱)=i=1np(xπi|xπ1,,xπi1,θ).𝑝𝐱superscriptsubscriptproduct𝑖1𝑛𝑝conditionalsubscript𝑥subscript𝜋𝑖subscript𝑥subscript𝜋1subscript𝑥subscript𝜋𝑖1𝜃p(\mathbf{x})=\prod_{i=1}^{n}p(x_{\pi_{i}}|x_{\pi_{1}},\cdots,x_{\pi_{i-1}},\theta). (8)

Here, the identity permutation πi=isubscript𝜋𝑖𝑖\pi_{i}=i is adopted for 1in1𝑖𝑛1\leqslant i\leqslant n, which is also known as raster order. Chen et al. also considered the BERT objective, which samples a sub-sequence M[1,n]𝑀1𝑛M\subset[1,n] such that each index i𝑖i independently has probability 0.15 of appearing in M𝑀M. M𝑀M is called the BERT mask, and the model is trained by minimizing the negative log-likelihood of the “masked” elements xMsubscript𝑥𝑀x_{M} conditioned on the “unmasked” ones x[1,n]\Msubscript𝑥\1𝑛𝑀x_{[1,n]\backslash M}:
这里, 1in1𝑖𝑛1\leqslant i\leqslant n 采用同一性排列 πi=isubscript𝜋𝑖𝑖\pi_{i}=i ,也称为光栅顺序。Chen等人还考虑了BERT目标,它对子序列 M[1,n]𝑀1𝑛M\subset[1,n] 进行采样,使得每个索引 i𝑖i 独立出现在 M𝑀M 中的概率为0.15。 M𝑀M 称为BERT掩码,模型通过最小化“掩码”元素 xMsubscript𝑥𝑀x_{M} 的负对数似然来训练,条件是“未掩码”元素 x[1,n]\Msubscript𝑥\1𝑛𝑀x_{[1,n]\backslash M}

LBERT=𝔼𝐱X𝔼𝑀iM[logp(xi|x[1,n]\M)].subscript𝐿𝐵𝐸𝑅𝑇similar-to𝐱𝑋𝔼𝑀𝔼subscript𝑖𝑀delimited-[]𝑝conditionalsubscript𝑥𝑖subscript𝑥\1𝑛𝑀L_{BERT}=\underset{\mathbf{x}\sim X}{\mathbb{E}}\underset{M}{\mathbb{E}}\sum_{i\in M}[-\log p(x_{i}|x_{[1,n]\backslash M})]. (9)

During the pre-training stage, they pick either LARsubscript𝐿𝐴𝑅L_{AR} or LBERTsubscript𝐿𝐵𝐸𝑅𝑇L_{BERT} and minimize the loss over the pre-training dataset.
在预训练阶段,他们选择 LARsubscript𝐿𝐴𝑅L_{AR}LBERTsubscript𝐿𝐵𝐸𝑅𝑇L_{BERT} 并最小化预训练集的损失。

GPT-2 [110] formulation of the transformer decoder block is used. To ensure proper conditioning when training the AR objective, Chen et al. apply the standard upper triangular mask to the n×n𝑛𝑛n\times n matrix of attention logits. No attention logit masking is required when the BERT objective is used: Chen et al. zero out the positions after the content embeddings are applied to the input sequence. Following the final transformer layer, they apply a layer norm and learn a projection from the output to logits parameterizing the conditional distributions at each sequence element.
使用了转换器解码器块的GPT-2[110]公式。为了确保在训练AR目标时进行适当的调节,Chen等人。将标准的上三角形掩码应用于注意力日志的 n×n𝑛𝑛n\times n 矩阵。使用BERT目标时不需要注意力日志掩码:Chen等人。将内容嵌入应用于输入序列后的位置归零。在最终的转换器层之后,他们应用层范数并从输出中学习投影,以对每个序列元素的条件分布进行参数化logits。

When training BERT, they simply ignore the logits at unmasked positions.
当训练BERT时,他们只是忽略未屏蔽位置的logits。

During the fine-tuning stage, they average pool the output of the final layer normalization layer across the sequence dimension to extract a d𝑑d-dimensional vector of features per example. They learn a projection from the pooled feature to class logits and use this projection to minimize a cross entropy loss.
在微调阶段,他们平均池跨序列维度的最终层归一化层的输出,以提取每个示例的 d𝑑d 维特征向量。他们学习从池化特征到类logits的投影,并使用该投影最小化交叉熵损失。

Practical applications offer empirical evidence that the joint objective of cross entropy loss and pretraining loss (LARsubscript𝐿𝐴𝑅L_{AR} or LBERTsubscript𝐿𝐵𝐸𝑅𝑇L_{BERT}) works even better. After iGPT, masked image modeling is proposed such as MAE [32] and SimMIM [111] which achieves competitive performance on downstream tasks.
实际应用提供了经验证据,表明交叉熵损失和预训练损失( LARsubscript𝐿𝐴𝑅L_{AR}LBERTsubscript𝐿𝐵𝐸𝑅𝑇L_{BERT} )的联合目标效果更好。在iGPT之后,提出了掩蔽图像建模,如MAE[32]和SimMIM[111],这在下游任务上实现了竞争性能。

iGPT and ViT are two pioneering works to apply transformer for visual tasks.
iGPT和ViT是将变压器应用于视觉任务的两个开创性工作。

The difference of iGPT and ViT-like models mainly lies on 3 aspects: 1) The input of iGPT is a sequence of color palettes by clustering pixels, while ViT uniformly divided the image into a number of local patches; 2) The architecture of iGPT is an encoder-decoder framework, while ViT only has transformer encoder; 3) iGPT utilizes auto-regressive self-supervised loss for training, while ViT is trained by supervised image classification task.
iGPT和类ViT模型的区别主要在于3个方面: 1)iGPT的输入是通过聚类像素的调色板序列,而ViT将图像均匀地划分为多个局部斑块;2)iGPT的架构是一个编码器-解码器框架,而ViT只有变压器编码器;3)iGPT利用自回归自监督损失进行训练,而ViT通过监督图像分类任务进行训练。

Contrastive Learning Based Approach. Currently, contrastive learning is the most popular manner of self-supervised learning for computer vision. Contrastive learning has been applied on vision transformer for unsupervised pretraining [31, 112, 113].
基于对比学习的方法。目前,对比学习是计算机视觉自监督学习最流行的方式。对比学习已应用于无监督预训练的视觉变压器[31,112,113]。

Chen et al. [31] investigate the effects of several fundamental components for training self-supervised ViT. The authors observe that instability is a major issue that degrades accuracy, and these results are indeed partial failure and they can be improved when training is made more stable.
Chen等人。[31]研究了训练自监督ViT的几个基本组成部分的影响。作者观察到,不稳定性是降低准确性的一个主要问题,这些结果确实是部分失败的,当训练变得更加稳定时,它们可以得到改善。

They introduce a “MoCo v3” framework, which is an incremental improvement of MoCo [114]. Specifically, the authors take two crops for each image under random data augmentation. They are encodes by two encoders, fqsubscript𝑓𝑞f_{q} and fksubscript𝑓𝑘f_{k}, with output vectors 𝐪𝐪\mathbf{q} and 𝐤𝐤\mathbf{k}. Intuitively, 𝐪𝐪\mathbf{q} behaves like a “query” and the goal of learning is to retrieve the corresponding “key”. This is formulated as minimizing a contrastive loss function, which can be written as:
他们引入了一个“MoCo v3”框架,这是MoCo[114]的增量改进。具体来说,作者在随机数据增强下为每个图像取两个作物。它们由两个编码器编码, fqsubscript𝑓𝑞f_{q}fksubscript𝑓𝑘f_{k} ,输出向量为 𝐪𝐪\mathbf{q}𝐤𝐤\mathbf{k} 。直观地说, 𝐪𝐪\mathbf{q} 表现得像一个“查询”,学习的目标是检索相应的“键”。这被表述为最小化一个对比损失函数,可以写成:

q=logexp(𝐪𝐤+/τ)exp(𝐪𝐤+/τ)+𝐤exp(𝐪𝐤/τ).subscript𝑞log𝐪superscript𝐤𝜏𝐪superscript𝐤𝜏subscriptsuperscript𝐤𝐪superscript𝐤𝜏\mathcal{L}_{q}=-\text{log}\frac{\exp(\mathbf{q}\cdot\mathbf{k}^{+}/\tau)}{\exp(\mathbf{q}\cdot\mathbf{k}^{+}/\tau)+\sum_{\mathbf{k}^{-}}\exp(\mathbf{q}\cdot\mathbf{k}^{-}/\tau)}. (10)

Here 𝐤+superscript𝐤\mathbf{k}^{+} is fksubscript𝑓𝑘f_{k}’s output on the same image as 𝐪𝐪\mathbf{q}, known as 𝐪𝐪\mathbf{q}’s positive sample. The set 𝐤superscript𝐤\mathbf{k}^{-} consists of fksubscript𝑓𝑘f_{k}’s outputs from other images, known as 𝐪𝐪\mathbf{q}’s negative samples. τ𝜏\tau is a temperature hyper-parameter for l2subscript𝑙2l_{2}-normalized 𝐪𝐪\mathbf{q}, 𝐤𝐤\mathbf{k}. MoCo v3 uses the keys that naturally co-exist in the same batch and abandon the memory queue, which they find has diminishing gain if the batch is sufficiently large (e.g., 4096). With this simplification, the contrastive loss can be implemented in a simple way. The encoder fqsubscript𝑓𝑞f_{q} consists of a backbone (e.g., ViT), a projection head and an extra prediction head; while the encoder fksubscript𝑓𝑘f_{k} has the backbone and projection head, but not the prediction head. fksubscript𝑓𝑘f_{k} is updated by the moving-average of fqsubscript𝑓𝑞f_{q}, excluding the prediction head.
这里 𝐤+superscript𝐤\mathbf{k}^{+}fksubscript𝑓𝑘f_{k} 在与 𝐪𝐪\mathbf{q} 相同的图像上的输出,称为 𝐪𝐪\mathbf{q} 的正样本。集合 𝐤superscript𝐤\mathbf{k}^{-}fksubscript𝑓𝑘f_{k} 在其他图像上的输出组成,称为 𝐪𝐪\mathbf{q} 的负样本。 τ𝜏\taul2subscript𝑙2l_{2} -归一化 𝐪𝐪\mathbf{q}𝐤𝐤\mathbf{k} 的温度超参数。MoCo v3使用在同一批次中自然共存的键,并放弃内存队列,如果批次足够大(例如4096),他们发现该队列的增益会减少。通过这种简化,可以以简单的方式实现对比损失。编码器 fqsubscript𝑓𝑞f_{q} 由骨干(例如ViT)、投影头和额外的预测头组成;而编码器 fksubscript𝑓𝑘f_{k} 具有骨干和投影头,但没有预测头。

MoCo v3 shows that the instability is a major issue of training the self-supervised ViT, thus they describe a simple trick that can improve the stability in various cases of the experiments. They observe that it is not necessary to train the patch projection layer.
MoCo v3表明不稳定性是训练自监督ViT的一个主要问题,因此他们描述了一个简单的技巧,可以在实验的各种情况下提高稳定性。他们观察到没有必要训练补丁投影层。

For the standard ViT patch size, the patch projection matrix is complete or over-complete. And in this case, random projection should be sufficient to preserve the information of the original patches. However, the trick alleviates the issue, but does not solve it.
对于标准的ViT贴片尺寸,贴片投影矩阵是完全或过完全的。在这种情况下,随机投影应该足以保留原始贴片的信息。然而,这个技巧缓解了这个问题,但并没有解决它。

The model can still be unstable if the learning rate is too big and the first layer is unlikely the essential reason for the instability.
如果学习速率太大,并且第一层不太可能是不稳定的本质原因,则模型仍然可能不稳定。

3.1.4 Discussions 3.1.4讨论

All of the components of vision transformer including multi-head self-attention, multi-layer perceptron, shortcut connection, layer normalization, positional encoding and network topology, play key roles in visual recognition.
视觉转换器的各个组成部分,包括多头自注意、多层感知器、快捷连接、层归一化、位置编码和网络拓扑,在视觉识别中起着关键作用。

As stated above, a number of works have been proposed to improve the effectiveness and efficiency of vision transformer. From the results in Figure 6, we can see that combining CNN and transformer achieve the better performance, indicating their complementation to each other through local connection and global connection. Further investigation on backbone networks can lead to the improvement for the whole vision community.
如上所述,已经提出了一些工作来提高视觉转换器的有效性和效率。从图6中的结果可以看出,卷积神经网络和变压器相结合获得了更好的性能,表明它们通过局部连接和全局连接相互补充。对骨干网络的进一步研究可以导致整个视觉社区的改进。

As for the self-supervised representation learning for vision transformer, we still need to make effort to pursue the success of large-scale pretraining in the filed of NLP.
对于视觉转换器的自监督表示学习,在自然语言处理领域,我们还需要努力追求大规模预训练的成功。

3.2 High/Mid-level Vision
3.2高/中级愿景

Recently there has been growing interest in using transformer for high/mid-level computer vision tasks, such as object detection [16, 17, 115, 116, 117], lane detection [118], segmentation [34, 25, 18] and pose estimation [35, 36, 37, 119]. We review these methods in this section.
最近越来越多的人对使用变压器进行高/中级计算机视觉任务感兴趣,例如目标检测[16,17,115,116,117]、车道检测[118]、分割[34,25,18]和姿态估计[35,36,37,119]。我们在本节中回顾这些方法。

3.2.1 Generic Object Detection
3.2.1目标检测

Traditional object detectors are mainly built upon CNNs, but transformer-based object detection has gained significant interest recently due to its advantageous capability.
传统的目标检测主