A Study on Self-Supervised Pretraining for Vision Problems in Gastrointestinal Endoscopy
关于自监督预训练在胃肠内窥镜视觉问题中的研究

EDWARD SANDERSON $^{(D)}$ AND BOGDAN J. MATUSZEWSKI $^{(D)}$ ， (Member, IEEE)Computer Vision and Machine Learning (CVML) Group, University of Central Lancashire, PR1 2HE Preston, U.K.
计算机视觉与机器学习（CVML）组，中央兰开夏大学，PR1 2HE 普雷斯顿，英国。Corresponding author: Edward Sanderson (esanderson4@uclan.ac.uk)
通讯作者：爱德华·桑德森 (esanderson4@uclan.ac.uk)

This work was supported by the Science and Technology Facilities Council [grant number ST/S005404/1].
本工作得到了科学与技术设施委员会的支持 [拨款编号 ST/S005404/1]。

Abstract 摘要

Solutions to vision tasks in gastrointestinal endoscopy (GIE) conventionally use image encoders pretrained in a supervised manner with ImageNet-1k as backbones. However, the use of modern self-supervised pretraining algorithms and a recent dataset of 100k unlabelled GIE images (Hyperkvasirunlabelled) may allow for improvements. In this work, we study the fine-tuned performance of models with ResNet50 and ViT-B backbones pretrained in self-supervised and supervised manners with ImageNet-1k and Hyperkvasir-unlabelled (self-supervised only) in a range of GIE vision tasks. In addition to identifying the most suitable pretraining pipeline and backbone architecture for each task, out of those considered, our results suggest three general principles. Firstly, that self-supervised pretraining generally produces more suitable backbones for GIE vision tasks than supervised pretraining. Secondly, that self-supervised pretraining with ImageNet- 1 k is typically more suitable than pretraining with Hyperkvasir-unlabelled, with the notable exception of monocular depth estimation in colonoscopy. Thirdly, that ViT-Bs are more suitable in polyp segmentation and monocular depth estimation in colonoscopy, ResNet50s are more suitable in polyp detection, and both architectures perform similarly in anatomical landmark recognition and pathological finding characterisation. We hope this work draws attention to the complexity of pretraining for GIE vision tasks, informs this development of more suitable approaches than the convention, and inspires further research on this topic to help advance this development. Code available: github.com/ESandML/SSL4GIE.
在胃肠内镜（GIE）视觉任务的解决方案中，传统上使用在 ImageNet-1k 上以监督方式预训练的图像编码器作为骨干网络。然而，使用现代自监督预训练算法和最近的 10 万张未标记 GIE 图像数据集（Hyperkvasirunlabelled）可能会带来改进。在这项工作中，我们研究了使用 ResNet50 和 ViT-B 骨干网络在自监督和监督方式下以 ImageNet-1k 和 Hyperkvasir-unlabelled（仅自监督）预训练的模型在一系列 GIE 视觉任务中的微调性能。除了确定每个任务最合适的预训练流程和骨干架构外，我们的结果还建议了三个一般原则。首先，自监督预训练通常比监督预训练产生更适合 GIE 视觉任务的骨干网络。其次，使用 ImageNet-1k 进行自监督预训练通常比使用 Hyperkvasir-unlabelled 进行预训练更合适，唯一的例外是结肠镜检查中的单目深度估计。第三，ViT-B 在结肠镜检查中的息肉分割和单目深度估计方面更为适用，而 ResNet50 在息肉检测方面更为适用，两种架构在解剖标志识别和病理发现特征化方面表现相似。我们希望这项工作引起人们对 GIE 视觉任务预训练复杂性的关注，告知这一领域开发比传统方法更合适的方法，并激励对该主题的进一步研究，以帮助推动这一发展。代码可用：github.com/ESandML/SSL4GIE。

INDEX TERMS Gastrointestinal endoscopy, computer vision, self-supervised pretraining, anatomical landmark recognition, pathological finding characterisation, polyp detection, polyp segmentation, monocular depth estimation.
索引词胃肠内镜检查，计算机视觉，自监督预训练，解剖标志识别，病理发现特征化，息肉检测，息肉分割，单目深度估计。

I. INTRODUCTION I. 引言

Gastrointestinal endoscopy (GIE) is a procedure for screening and treating various digestive disorders that involves the insertion of a thin, flexible tube with a camera and light at the end, known as an endoscope, into either the mouth (gastroscopy) or anus (colonoscopy or sigmoidoscopy) of the patient. The endoscope is then traversed through the gastrointestinal tract as it transmits images of the inner lining to a monitor, where the endoscopist can inspect them for abnormalities and perform any necessary interventions. However, this poses several challenges for the endoscopist, such as the high volume and complexity of visual information, the
胃肠内镜检查（GIE）是一种用于筛查和治疗各种消化系统疾病的程序，涉及将一根细长、柔韧的管子（末端带有摄像头和灯光，称为内镜）插入患者的口腔（胃镜检查）或肛门（结肠镜检查或乙状结肠镜检查）。然后，内镜通过胃肠道传输内壁的图像到监视器上，内镜医生可以检查这些图像以发现异常并进行必要的干预。然而，这对内镜医生提出了几个挑战，例如视觉信息的高容量和复杂性，

variability and subtlety of the lesions, and the need for realtime decision making [1].
病变的变异性和细微差别，以及实时决策的需要[1]。

To help overcome these challenges, computer vision has been identified as offering a promising set of tools for assisting endoscopists with various aspects of data analysis. Such aspects may be framed as traditional computer vision tasks such as image classification, object detection, semantic segmentation, and monocular depth estimation, among others, where the current state-of-the-art solutions for these tasks use deep learning models trained on large amounts of data.
为了帮助克服这些挑战，计算机视觉被认为提供了一套有前景的工具，以协助内窥镜医生进行各种数据分析方面的工作。这些方面可以被框定为传统的计算机视觉任务，例如图像分类、物体检测、语义分割和单目深度估计等，其中这些任务的当前最先进解决方案使用在大量数据上训练的深度学习模型。

While large datasets suitable for training models to perform image classification with everyday images exist; most notably
虽然存在适合用于训练模型以执行日常图像分类的大型数据集；最显著的是
the publicly available ImageNet-1k [2], but also the privately held JFT-300M [3], [4] and JFT-3B [5]; the datasets available for other computer vision tasks and distributions of images, particularly GIE images [6], are notably smaller. It has become clear that the amount of data a model is trained on has a strong influence on its performance [7], and efforts have therefore been taken to identify ways in which the largest available datasets can be leveraged in the training of models for tasks which these large datasets do not include suitable annotations for, and which may involve images of a dissimilar distribution. A now well-established approach [8] is to train (pretrain) an image classifier from random initialisation with the ImageNet-1k dataset ( 1.2 M everday images), remove the classification layer and add any decoder components required for the intended (downstream) task to the then pretrained image encoder, and train (fine-tune) the resulting model with a dataset which does include suitable annotations for the downstream task. Encoders used in this manner are often referred to as backbones.
公开可用的 ImageNet-1k [2]，以及私有的 JFT-300M [3]，[4]和 JFT-3B [5]；用于其他计算机视觉任务和图像分布的数据集，特别是 GIE 图像 [6]，明显较小。已经明确，模型训练的数据量对其性能有很大影响 [7]，因此已经采取措施确定如何利用可用的最大数据集来训练这些大型数据集不包括适当注释的任务模型，并且可能涉及不同分布的图像。现在已建立的一个方法 [8] 是从随机初始化开始，使用 ImageNet-1k 数据集（120 万张日常图像）训练（预训练）图像分类器，去除分类层，并为预期的（下游）任务添加所需的解码器组件，然后用包含适当注释的下游任务数据集训练（微调）得到的模型。以这种方式使用的编码器通常被称为骨干网络。

The approach of pretraining backbones on image classification with ImageNet-1k may however be limiting for two main reasons. Firstly, the model will learn to make high-level abstractions during pretraining, and since this pretraining is task-specific, these abstractions may not generalise well and may need to be unlearned during fine-tuning. For example, the ground truth class of many images in ImageNet-1k refers to objects in the foreground and training a model to classify images on this basis may lead to the model learning to pay less attention to the background, which could contain information that is useful for the downstream task. Secondly, image classification datasets require annotations which can be expensive to produce, limiting the degree to which we can leverage more data in pretraining. This is particularly true of GIE images [6], which are especially expensive to annotate, and the use of which in pretraining may be beneficial when the downstream task involves such images.
然而，在 ImageNet-1k 上对图像分类进行预训练的骨干网络方法可能会受到两个主要原因的限制。首先，模型将在预训练期间学习进行高层次的抽象，由于这种预训练是特定于任务的，这些抽象可能无法很好地泛化，并且在微调期间可能需要被遗忘。例如，ImageNet-1k 中许多图像的真实类别指的是前景中的物体，而基于此训练模型进行图像分类可能导致模型对背景的关注减少，而背景中可能包含对下游任务有用的信息。其次，图像分类数据集需要注释，而制作注释可能成本高昂，这限制了我们在预训练中利用更多数据的程度。这对于 GIE 图像[6]尤其如此，因为这些图像的注释成本特别高，而在预训练中使用这些图像可能在下游任务涉及此类图像时是有益的。

With the aim of addressing these limitations, a significant amount of research into self-supervised pretraining has been undertaken in recent years, leading to a range of popular algorithms [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19]. Self-supervised pretraining algorithms set taskagnostic objectives that require models to predict targets extracted from the input data, which can allow for the learning of generalisable high-level feature recognition. Additionally, since this paradigm of learning does not require annotations, it provides the potential for leveraging a much larger amount of data and/or data of a more similar distribution to that involved in the downstream task.
为了应对这些局限性，近年来进行了大量关于自监督预训练的研究，导致了一系列流行的算法。自监督预训练算法设定了与任务无关的目标，要求模型预测从输入数据中提取的目标，这可以实现可泛化的高级特征识别。此外，由于这种学习范式不需要注释，它提供了利用更大量数据和/或与下游任务相关的数据分布更相似的潜力。

A significant amount of research into self-supervised pretraining with everyday images [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], as well as several modalities of medical images [20], [21], [22], [23], [24], [25], [26], [27], has now been undertaken. However, it is still the convention in GIE to employ backbones that have been pretrained in a supervised manner with ImageNet-1k. A set of 99,417 unlabelled GIE images (Hyperkvasir-unlabelled)
已经进行了大量关于使用日常图像进行自监督预训练的研究[9]，[10]，[11]，[12]，[13]，[14]，[15]，[16]，[17]，[18]，[19]，以及几种医学图像的模态[20]，[21]，[22]，[23]，[24]，[25]，[26]，[27]。然而，在 GIE 中，仍然习惯使用在 ImageNet-1k 上以监督方式预训练的骨干网络。一组 99,417 张未标记的 GIE 图像（Hyperkvasir-unlabelled）
was however included in the recently released Hyperkvasir dataset [28] which, while much smaller than ImageNet-1k, is significantly larger than other datasets of GIE images. This data should allow for the self-supervised pretraining of GIE-specific backbones, which may be better suited to some tasks in GIE than the described convention. Additionally, self-supervised pretraining with datasets of everyday images, e.g. ImageNet-1k, may also provide opportunities for improvements.
然而，最近发布的 Hyperkvasir 数据集[28]中包含了这一内容，尽管它比 ImageNet-1k 小得多，但仍然比其他 GIE 图像数据集大得多。这些数据应该可以用于 GIE 特定骨干网的自监督预训练，这可能比所描述的常规方法更适合某些 GIE 任务。此外，使用日常图像数据集（例如 ImageNet-1k）进行自监督预训练也可能提供改进的机会。

B. CONTRIBUTIONS B. 贡献

This paper presents a study on pretraining encoders for use as backbones in solutions to vision tasks in GIE. We consider twelve encoders, each of a ResNet50 [29] or ViT-B [30] architecture and pretrained with one of six pipelines, including two self-supervised pretraining algorithms per architecture, each used separately with both ImageNet-1k and Hyperkvasir-unlabelled, as well as baselines of supervised pretraining with ImageNet-1k and random initialisation (not pretrained). We use state-of-the-art methods for adapting and fine-tuning each encoder for a range of vision tasks in GIE, namely: anatomical landmark recognition, pathological finding characterisation, polyp detection, polyp segmentation, and monocular depth estimation in colonoscopy; and we compare the resulting models on the basis of their fine-tuned performance using well-established metrics. The overall workflow of our experimentation is illustrated in Fig. 1.
本文介绍了一项关于预训练编码器的研究，旨在作为解决 GIE 中视觉任务的骨干。我们考虑了十二个编码器，每个编码器采用 ResNet50 [29]或 ViT-B [30]架构，并使用六个管道中的一个进行预训练，包括每种架构的两个自监督预训练算法，分别与 ImageNet-1k 和 Hyperkvasir-未标记数据集单独使用，以及使用 ImageNet-1k 的监督预训练和随机初始化（未预训练）的基线。我们使用最先进的方法来适应和微调每个编码器，以应对 GIE 中的一系列视觉任务，即：解剖标志识别、病理发现特征化、息肉检测、息肉分割和结肠镜下单目深度估计；并根据它们的微调性能使用公认的指标比较结果模型。我们实验的整体工作流程如图 1 所示。

In addition to identifying which architecture and pretraining pipeline (algorithm and data) is most suitable for each task, our results suggest that self-supervised pretraining with ImageNet-1k consistently allows for better performance than supervised pretraining with ImageNet-1k, across all considered tasks and architectures. We also demonstrate that selfsupervised pretraining with ImageNet-1k is typically more suitable than self-supervised pretraining with Hyperkvasirunlabelled, with the notable exception of monocular depth estimation in colonoscopy where the similarity of the pretraining data to the downstream data appears to be more critical than the amount of pretraining data. Additionally, we find that ViT-B backbones are typically more suitable for polyp segmentation and monocular depth estimation in colonoscopy, that ResNet50 backbones are more suitable for polyp detection, and that both architectures perform similarly in anatomical landmark recognition and pathological finding characterisation.
除了确定哪种架构和预训练管道（算法和数据）最适合每个任务外，我们的结果表明，自监督预训练使用 ImageNet-1k 在所有考虑的任务和架构中，始终比使用 ImageNet-1k 的监督预训练表现更好。我们还证明，自监督预训练使用 ImageNet-1k 通常比使用 Hyperkvasir 无标签数据的自监督预训练更合适，唯一的例外是结肠镜检查中的单目深度估计，在这种情况下，预训练数据与下游数据的相似性似乎比预训练数据的数量更为关键。此外，我们发现 ViT-B 骨干网通常更适合结肠镜检查中的息肉分割和单目深度估计，ResNet50 骨干网更适合息肉检测，而这两种架构在解剖标志识别和病理发现特征化方面表现相似。

While a number of studies have experimented with selfsupervised pretraining for certain GIE vision tasks before [31], [32], [33], [34], [35], only two [34], [35] have compared self-supervised pretraining against the convention of supervised pretraining with ImageNet-1k. Additionally, in their experiments with GIE vision tasks, these works either compared self-supervised pretraining against supervised pretraining of a different architecture with the same dataset, or the same architecture with a different dataset. Our work is therefore the first to compare self-supervised pretraining
虽然之前有一些研究对某些 GIE 视觉任务进行了自监督预训练的实验[31]，[32]，[33]，[34]，[35]，但只有两项[34]，[35]将自监督预训练与使用 ImageNet-1k 的监督预训练进行了比较。此外，在他们对 GIE 视觉任务的实验中，这些工作要么将自监督预训练与使用相同数据集的不同架构的监督预训练进行了比较，要么将相同架构与不同数据集进行了比较。因此，我们的工作是首次比较自监督预训练。

FIGURE 1. The overall workflow of our experimentation.
图 1. 我们实验的整体工作流程。
against supervised pretraining for the same encoder architecture and pretraining data, in terms of fine-tuned performance on GIE vision tasks. Additionally, we consider a much wider scope of self-supervised pretraining algorithms and GIE vision tasks than these previous works, each of which focuses on a single task, and are the first that we know of to experiment with self-supervised pretraining for polyp detection and monocular depth estimation in colonoscopy. Beyond the value of these results in isolation, this wide scope allows us to expose the general principles revealed by our analysis.
针对相同编码器架构和预训练数据的监督预训练，在 GIE 视觉任务的微调性能方面。此外，我们考虑的自监督预训练算法和 GIE 视觉任务的范围远比这些以单一任务为重点的先前工作要广泛，我们是首个我们所知的在结肠镜检查中实验自监督预训练用于息肉检测和单目深度估计的研究。除了这些结果本身的价值，这种广泛的范围使我们能够揭示我们分析所揭示的一般原则。

II. INVESTIGATED SELF-SUPERVISED PRETRAINING ALGORITHMS
II. 研究的自监督预训练算法

Self-supervised algorithms for pretraining image encoders for use as backbones can be grouped into four families [36]:
自监督算法用于预训练图像编码器作为骨干可以分为四个类别[36]：

Deep metric learning (DML)-based self-supervised pretraining algorithms train an encoder to describe semantically similar images with quantifiably similar representations, and semantically dissimilar images with quantifiably dissimilar representations. This is typically achieved by creating positive pairs, which are distorted variants of the same image, and negative pairs, which are distorted variants of different images, and training the encoder with a contrastive loss that is minimised through a reduction in the distance or angle between the representations of positive pairs, and an increase in the distance or angle between the representations of negative pairs.
基于深度度量学习（DML）的自监督预训练算法训练一个编码器，以量化相似的表示描述语义上相似的图像，并以量化不相似的表示描述语义上不相似的图像。这通常通过创建正样本对（同一图像的扭曲变体）和负样本对（不同图像的扭曲变体）来实现，并通过对比损失训练编码器，该损失通过减少正样本对表示之间的距离或角度来最小化，并通过增加负样本对表示之间的距离或角度来实现。
Self-distillation-based self-supervised pretraining algorithms train an encoder to describe a variant of an image with a representation that allows for a representation of a different variant of the image, produced by another encoder, to be predicted. As a means of avoiding collapse, which occurs when both encoders learn to output the same representation for all images, the second encoder is typically an exponential moving average of
自蒸馏基础的自监督预训练算法训练一个编码器，以描述图像的一个变体，其表示允许预测由另一个编码器生成的图像的不同变体的表示。为了避免崩溃，即当两个编码器学习为所有图像输出相同表示时，第二个编码器通常是一个指数移动平均。
the encoder being optimised, though collapse can be avoided through a Siamese network with a stop-gradient on one branch [19].
正在优化的编码器，尽管可以通过在一个分支上使用停止梯度的孪生网络来避免崩溃 [19]。
Canonical correlation analysis (CCA)-based selfsupervised pretraining algorithms train an encoder to describe an image in such a way that each feature of its representation is informative of a distinct attribute of the image. This is typically achieved with a loss function that encourages the encoder to maintain a certain amount of variance for each feature in the representation, while establishing uncorrelatedness between features.
基于典型相关分析（CCA）的自监督预训练算法训练一个编码器，以一种方式描述图像，使得其表示的每个特征都能提供图像的一个独特属性的信息。这通常通过一个损失函数来实现，该函数鼓励编码器为表示中的每个特征保持一定量的方差，同时在特征之间建立无关性。
Masked image modelling (MIM)-based self-supervised pretraining algorithms aim to reproduce the success of masked language modelling (MLM) pretraining algorithms, first introduced for pretraining the transformerbased text encoder BERT [37], in the domain of vision. MIM algorithms are therefore typically used with ViT architectures, which are also inspired by BERT, where the image is split into patches that are treated as a sequence of visual tokens akin to the sequence of word tokens used to represent input text for BERT. In both MLM and MIM, input tokens are randomly masked, and a model is trained to reconstruct these tokens based on the information contained in the remaining tokens.
基于掩码图像建模（MIM）的自监督预训练算法旨在复制掩码语言建模（MLM）预训练算法的成功，后者最初是为预训练基于变换器的文本编码器 BERT [37]而引入的，应用于视觉领域。因此，MIM 算法通常与 ViT 架构一起使用，这些架构也受到 BERT 的启发，其中图像被分割成补丁，这些补丁被视为类似于用于表示 BERT 输入文本的单词令牌序列的视觉令牌序列。在 MLM 和 MIM 中，输入令牌被随机掩盖，模型被训练以根据剩余令牌中包含的信息重建这些令牌。

The rest of this section presents the selection of algorithms considered in our experimentation, which we ensured spanned these four families of self-supervised algorithms. We illustrate and provide a definition of the key details of each algorithm, where we use

f_{θ}

to denote the image encoder being optimised for use as a backbone, and explain how we obtained and used encoders pretrained using each algorithm with either ImageNet-1k or Hyperkvasir-unlabelled. Note that any training performed as part of this work was done on an ASUS ESC8000-G4 GPU server with

6 \times

NVIDIA RTX A6000 48GB GPUs. Due to the number of GPUs and amount of memory, the batch sizes used are in multiples of 6 and, for pretraining, are the maximum that we could allow for.
本节的其余部分介绍了我们实验中考虑的算法选择，我们确保涵盖了这四类自监督算法。我们说明并提供每个算法的关键细节的定义，其中我们使用

f_{θ}

表示被优化用于作为主干的图像编码器，并解释我们如何获得和使用使用每种算法预训练的编码器，这些编码器使用的是 ImageNet-1k 或 Hyperkvasir-unlabelled。请注意，作为本工作的部分进行的任何训练都是在配备

6 \times

NVIDIA RTX A6000 48GB GPU 的 ASUS ESC8000-G4 GPU 服务器上进行的。由于 GPU 的数量和内存的大小，使用的批量大小是 6 的倍数，并且在预训练时是我们能够允许的最大值。

FIGURE 2. Visualisation of the MoCo v3 algorithm. Shown for a per-GPU batch size of 2, and 3 GPUs. We use

g_{θ}

to denote the projector,

h_{θ}

to denote the predictor,

ϕ

to denote the momentum parameters that are computed with an exponential moving average (denoted ema) of the online parameters

θ

, and

s g

is a stop-gradient.
图 2. MoCo v3 算法的可视化。显示每个 GPU 的批量大小为 2，使用 3 个 GPU。我们用

g_{θ}

表示投影器，用

h_{θ}

表示预测器，用

ϕ

表示通过在线参数

θ

的指数移动平均（表示为 ema）计算的动量参数，

s g

是停止梯度。

A. MoCo v3

MoCo v3 [14], illustrated in Fig. 2, is the latest iteration of the momentum contrast ( MoCo ) algorithm, which started as an example of DML. While the distinguishing feature of all iterations of MoCo is the momentum encoder and projector

g_{ϕ} \circ f_{ϕ}

, which is used to compute a representation for one image variant in each pair, rather than using the online encoder and projector

g_{θ} \circ f_{θ}

to compute both representations as is more conventional in DML, e.g. SimCLR [9], MoCo v3 incorporates a prediction head

h_{θ}

. The resulting algorithm can be framed as either a DML algorithm that incorporates the principle of self-distillation, or a self-distillation algorithm which uses a contrastive loss. As such, we consider MoCo v3 as a representative of both the DML and self-distillation families.
MoCo v3 [14]，如图 2 所示，是动量对比（MoCo）算法的最新迭代，最初作为深度度量学习（DML）的一个例子。所有 MoCo 迭代的显著特征是动量编码器和投影器

g_{ϕ} \circ f_{ϕ}

，它用于计算每对中一个图像变体的表示，而不是像 DML 中更常见的那样使用在线编码器和投影器

g_{θ} \circ f_{θ}

来计算两个表示，例如 SimCLR [9]，MoCo v3 则结合了一个预测头

h_{θ}

。因此，得到的算法可以被视为一个结合自蒸馏原理的 DML 算法，或者是一个使用对比损失的自蒸馏算法。因此，我们将 MoCo v3 视为 DML 和自蒸馏两者的代表。

We define a batch of positive pairs of image variants on a single GPU as

{(x_{i, 1}, x_{i, 2})}_{i = 1}^{N_{b}}

. We then define the representations used by MoCo v3 as:
我们将单个 GPU 上的一批正图像变体对定义为

{(x_{i, 1}, x_{i, 2})}_{i = 1}^{N_{b}}

。然后，我们将 MoCo v3 使用的表示定义为：

\begin{aligned} q_{i, j} = h_{θ} (g_{θ} (f_{θ} (x_{i, j}))), i = 1, \dots, N_{b} and j = 1, 2 \\ k_{i, j} = g_{ϕ} (f_{ϕ} (x_{i, j})), i = 1, \dots, N_{G b} and j = 1, 2 \end{aligned}

where

N_{G b} = N_{G} N_{b}

, where

N_{G}

is the number of GPUs, and the representations

{k_{i, j}}_{i = N_{b} + 1, j = 1}^{N_{G b}, 2}

are gathered from the other GPUs (see Fig. 2), where they are computed in the same manner as

{k_{i, j}}_{i = 1, j = 1}^{N_{b}, 2}

on different image variants, i.e.

{(x_{i, 1}, x_{i, 2})}_{i = N_{b} + 1}^{N_{G b}}

. The loss function used by MoCo v3 for a batch on a single GPU can then be defined:
在这里

N_{G b} = N_{G} N_{b}

，其中

N_{G}

是 GPU 的数量，表示

{k_{i, j}}_{i = N_{b} + 1, j = 1}^{N_{G b}, 2}

是从其他 GPU 收集的（见图 2），它们以与

{k_{i, j}}_{i = 1, j = 1}^{N_{b}, 2}

相同的方式在不同的图像变体上计算，即

{(x_{i, 1}, x_{i, 2})}_{i = N_{b} + 1}^{N_{G b}}

。MoCo v3 在单个 GPU 上对一个批次使用的损失函数可以定义为：

L_{M C 3} ({q_{i, 1}}_{i = 1}^{N_{b}}, {k_{i, 1}}_{i = 1}^{N_{G b}}, {q_{i, 2}}_{i = 1}^{N_{b}}, {k_{i, 2}}_{i = 1}^{N_{G b}})

= \frac{2 τ}{N_{b}} \sum_{i = 1}^{N_{b}} [L_{I N C E} (q_{i, 1}, {k_{j, 2}}_{j = 1}^{N_{G b}}) + L_{I N C E} (q_{i, 2}, {k_{j, 1}}_{j = 1}^{N_{G b}})]

where

τ

is the temperature parameter, a constant positive scalar, and

L_{INCE}

is the InfoNCE loss [38], which is defined:
其中

τ

是温度参数，一个常数正标量，而

L_{INCE}

是 InfoNCE 损失 [38]，其定义为：

L_{I N C E} (q_{i}, {k_{j}}_{j = 1}^{N}) = - \log (\frac{e^{CoSim (q_{i}, k_{i}) / τ}}{\sum_{j = 1}^{N} e^{CoSim (q_{i}, k_{j}) / τ}})

where CoSim is the cosine similarity:
where CoSim 是余弦相似度：

CoSim (a, b) = \frac{a^{⊤} b}{‖ a ‖ ‖ b ‖}

Note that Fig. 2 can be seen as illustrating
请注意，图 2 可以看作是说明

\frac{2 τ}{N_{b}} \sum_{i = 1}^{N_{b}} L_{I N C E} (q_{i, 1}, {k_{j, 2}}_{j = 1}^{N_{G} N_{b}})

whereas

L_{M C 3}

makes this symmetrical.
而

L_{M C 3}

使其对称。
The algorithm has been designed to work effectively for optimising both ResNet and ViT architectures of

f_{θ}

. For ViT architectures, the patch embedding layer is frozen as a random linear projection for stability reasons, and the class token [cls] is taken as the output of

f_{θ}

. The projectors

g_{θ}

and

g_{ϕ}

, and the predictor

h_{θ}

, are defined as multilayer perceptrons (MLPs) composed of fully connected layers, batch normalisation and ReLU activations.
该算法旨在有效优化

f_{θ}

的 ResNet 和 ViT 架构。对于 ViT 架构，补丁嵌入层被固定为随机线性投影以确保稳定性，类标记[cls]被视为

f_{θ}

的输出。投影器

g_{θ}

和

g_{ϕ}

，以及预测器

h_{θ}

被定义为由全连接层、批量归一化和 ReLU 激活组成的多层感知器（MLP）。

We consider the use of MoCo v3 for pretraining both ResNet50 and ViT-B architectures, and use the torchvision implementation of ResNet50 and the ViT-B implementation from the official MoCo v3 codebase. For the encoders pretrained using MoCo v3 with ImageNet-1k, we use the weights provided by the authors. We then used the implementation of MoCo v3 in the official codebase to pretrain encoders with Hyperkvasir-unlabelled, modifying the code only for loading Hyperkvasir-unlabelled and to change the batch size from 4096 to 1536/768 (ResNet50/ ViT-B). When fine-tuning the ViT-B models, we unfreeze the patch embedding layer.
我们考虑使用 MoCo v3 对 ResNet50 和 ViT-B 架构进行预训练，并使用 torchvision 实现的 ResNet50 和官方 MoCo v3 代码库中的 ViT-B 实现。对于使用 MoCo v3 和 ImageNet-1k 预训练的编码器，我们使用作者提供的权重。然后，我们使用官方代码库中的 MoCo v3 实现对编码器进行 Hyperkvasir-unlabelled 的预训练，仅修改代码以加载 Hyperkvasir-unlabelled，并将批量大小从 4096 更改为 1536/768（ResNet50/ViT-B）。在微调 ViT-B 模型时，我们解冻了补丁嵌入层。

B. BARLOW TWINS

Barlow Twins [13], illustrated in Fig. 3, is an example of a CCA algorithm. Barlow Twins trains a model to maintain a certain amount of variance for each feature and to establish uncorrelatedness between features with a loss function that encourages an identity empirical crosscorrelation matrix between representations of two distorted variants of the same image. Other examples of CCA differ mainly in the loss function. For example, the loss function
巴洛双胞胎 [13]，如图 3 所示，是一种 CCA 算法的例子。巴洛双胞胎训练一个模型，以保持每个特征的某种方差，并通过一个损失函数在同一图像的两个失真变体的表示之间建立无相关性，该损失函数鼓励身份经验交叉相关矩阵。其他 CCA 的例子主要在于损失函数。例如，损失函数

FIGURE 3. Visualisation of the Barlow Twins algorithm. Shown for a per-GPU batch size of 2, and representations of dimensionality 4. We use

g_{θ}

to denote the projector.
图 3. Barlow Twins 算法的可视化。显示每个 GPU 的批量大小为 2，维度表示为 4。我们使用

g_{θ}

表示投影器。
used by VicReg [16] encourages the variance of features to be maintained, and uncorrelatedness between features to be established, on representations of individual variants of an image directly, as well as minimising the Euclidean distance between representations of two variants of the same image.
VicReg [16] 使用的鼓励特征方差保持，并在图像的单个变体表示上建立特征之间的无关性，同时最小化同一图像两个变体表示之间的欧几里得距离。

We define a batch of positive pairs of image variants on a single GPU as

{(x_{i, 1}, x_{i, 2})}_{i = 1}^{N_{b}}

. We then define the representations used by Barlow Twins as:
我们将单个 GPU 上的一批正图像变体对定义为

{(x_{i, 1}, x_{i, 2})}_{i = 1}^{N_{b}}

。然后，我们将 Barlow Twins 使用的表示定义为：

z_{i, j} = g_{θ} (f_{θ} (x_{i, j})), i = 1, \dots, N_{b} and j = 1, 2

which may also be written as

{(z_{i, j, k})}_{k = 1}^{d} = z_{i, j}

. These representations are normalised to give:
可以写作

{(z_{i, j, k})}_{k = 1}^{d} = z_{i, j}

。这些表示法被标准化为：

{\hat{z}}_{i, j, k} = \frac{z_{i, j, k} - \frac{1}{N_{b}} \sum_{m = 1}^{N_{b}} z_{m, j, k}}{\sqrt{\frac{1}{N_{b}} \sum_{n = 1}^{N_{b}} {(z_{n, j, k} - \frac{1}{N_{b}} \sum_{m = 1}^{N_{b}} z_{m, j, k})}^{2}}}

The elements of the empirical cross correlation matrix

{(c_{k, l})}_{k = 1, l = 1}^{d, d}

can then be defined:
经验交叉相关矩阵

{(c_{k, l})}_{k = 1, l = 1}^{d, d}

的元素可以定义为：

c_{k, l} = \frac{1}{N_{b}} \sum_{i = 1}^{N_{b}} {\hat{z}}_{i, 1, k} {\hat{z}}_{i, 2, l}

which is averaged across GPUs, the result of which we denote

{({\bar{c}}_{k, l})}_{k = 1, l = 1}^{d, d}

. Finally, the Barlow Twins loss can be defined:
在各个 GPU 上取平均，我们将其结果表示为

{({\bar{c}}_{k, l})}_{k = 1, l = 1}^{d, d}

。最后，Barlow Twins 损失可以定义为：

L_{B T} ({({\bar{c}}_{k, l})}_{k = 1, l = 1}^{d, d}) = \sum_{k = 1}^{d} {(1 - {\bar{c}}_{k, k})}^{2} + λ \sum_{k = 1}^{d} \sum_{l = 1}^{d} 1_{[k \neq l]} {\bar{c}}_{k, l}^{2}

where

λ

is a constant positive scalar and

1

is an indicator function.
其中

λ

是一个常数正标量，

1

是一个指示函数。

The algorithm has been designed to work effectively for ResNet architectures of

f_{θ}

. The projector

g_{θ}

is defined as an MLP composed of fully connected layers, batch normalisation and ReLU activations.
该算法已被设计为有效地适用于

f_{θ}

的 ResNet 架构。投影器

g_{θ}

被定义为由全连接层、批量归一化和 ReLU 激活组成的 MLP。

We consider the use of Barlow Twins for pretraining ResNet50 architectures, for which we use the torchvision implementation. For the ResNet50 pretrained using Barlow Twins with ImageNet-1k, we use
我们考虑使用 Barlow Twins 对 ResNet50 架构进行预训练，我们使用 torchvision 实现。对于使用 Barlow Twins 和 ImageNet-1k 预训练的 ResNet50，我们使用
the weights provided by the authors. We then used the implementation of Barlow Twins in the official codebase to pretrain a ResNet50 with Hyperkvasir-unlabelled, modifying the code only for loading Hyperkvasir-unlabelled and to change the batch size from 2048 to 1536.
作者提供的权重。然后，我们使用官方代码库中 Barlow Twins 的实现，使用 Hyperkvasir-unlabelled 对 ResNet50 进行预训练，仅修改代码以加载 Hyperkvasir-unlabelled 并将批量大小从 2048 更改为 1536。

C. MAE

Masked autoencoders (MAE) [12], illustrated in Fig. 4, is a particularly popular example of the MIM family. It differs from other popular MIM algorithms on two main fronts. First, examples such as BEiT [15], PeCo [39], and SimMIM [40] use an arbitrary token in place of the masked tokens in the input to

f_{θ}

, where MAE simply omits them. Notably, this is only possible with ViTs due to the use of position embeddings that inform a model of the specific patch of an image that a token corresponds to explicitly. For reconstruction, this does however require the insertion of an arbitrary token at each position of a masked token in the output of

f_{θ}

, and the processing of the resulting sequence of tokens by a decoder

g_{θ}

, which has a smaller ViT architecture. Secondly, BEiT and PeCo use the discrete variational autoencoder introduced as a component of DALL-E [41] to quantise all possible image patches into a finite set of visual tokens akin to a vocabulary of words, rather than directly using the patches as visual tokens. This allows the reconstruction to be framed as classifying which token in this finite set the masked token should be, closely following BERT. MAE however takes a more conventional approach to image reconstruction and frames it as a regression problem.
掩码自编码器（MAE）[12]，如图 4 所示，是 MIM 家族中特别流行的一个例子。它在两个主要方面与其他流行的 MIM 算法不同。首先，像 BEiT [15]、PeCo [39]和 SimMIM [40]这样的例子在输入中用任意标记替代掩码标记，而 MAE 则简单地省略它们。值得注意的是，这仅在 ViTs 中可行，因为使用了位置嵌入，明确告知模型标记对应于图像的特定补丁。对于重建，这确实需要在输出的每个掩码标记位置插入一个任意标记，并通过具有较小 ViT 架构的解码器处理生成的标记序列。其次，BEiT 和 PeCo 使用作为 DALL-E [41]组件引入的离散变分自编码器，将所有可能的图像补丁量化为有限的视觉标记集，类似于单词的词汇，而不是直接使用补丁作为视觉标记。这使得重建可以被框定为分类掩码标记应该是这个有限集合中的哪个标记，紧密跟随 BERT。然而，MAE 采取了一种更传统的图像重建方法，并将其框定为回归问题。

As is typical for a ViT, an image is first divided into a sequence of flattened non-overlapping patches that are projected by a patch embedding layer and translated by a position embedding to produce the sequence of visual tokens

{(x_{i})}_{i = 1}^{N_{p}}

that are to be concatenated with the [cls] token and fed into the first block. Before concatenating with the [cls] token however, MAE generates a set of uniformly distributed random values

{α_{i} \sim U (0, 1)}_{i = 1}^{N_{p}}

and computes the permutation

σ

which sorts the set into reverse order, i.e.

α_{σ (i)} \geq α_{σ (i + 1)}

for

i = 1, \dots, N_{p} - 1

. For a proportion of masking

γ \in [0, 1]

, selected to ensure that
对于 ViT 来说，图像首先被分割成一系列扁平化的非重叠补丁，这些补丁通过补丁嵌入层进行投影，并通过位置嵌入进行转换，以生成要与[cls]标记连接并输入到第一个块的视觉标记序列

{(x_{i})}_{i = 1}^{N_{p}}

。然而，在与[cls]标记连接之前，MAE 生成一组均匀分布的随机值

{α_{i} \sim U (0, 1)}_{i = 1}^{N_{p}}

，并计算排列

σ

，将该集合按逆序排序，即

α_{σ (i)} \geq α_{σ (i + 1)}

对于

i = 1, \dots, N_{p} - 1

。对于掩蔽比例

γ \in [0, 1]

，选择以确保

FIGURE 4. Visualisation of the MAE algorithm. Shown for a ViT encoder that treats an image as a

4 \times 4

grid of patch tokens, with

75 %

masking.
图 4. MAE 算法的可视化。展示了一个将图像视为

4 \times 4

个补丁令牌网格的 ViT 编码器，具有

75 %

掩蔽。

γ N_{p} - ⌊ γ N_{p} ⌋ = 0

, the sequence passed forward is then

{({\tilde{x}}_{i})}_{i = 1}^{(1 - γ) N_{p}} = {(x_{σ (i)})}_{i = 1}^{(1 - γ) N_{p}}

. In contrast to MIM algorithms that replace rather than omit the masked tokens from the input to

f_{θ}

, it is important in MAE that the same number of tokens in each input are masked, i.e.

γ N_{p}

is constant, to allow for batching. If the sequence of visual tokens, i.e. omitting the [cls] token, in the output of

f_{θ}

is denoted

{({\tilde{z}}_{i})}_{i = 1}^{(1 - γ) N_{p}}

, we then create the sequence

{(z_{i})}_{i = 1}^{N_{p}}

, where:

γ N_{p} - ⌊ γ N_{p} ⌋ = 0

，传递的序列是

{({\tilde{x}}_{i})}_{i = 1}^{(1 - γ) N_{p}} = {(x_{σ (i)})}_{i = 1}^{(1 - γ) N_{p}}

。与替换而不是省略输入中被掩蔽的标记的 MIM 算法相比，在 MAE 中，每个输入中掩蔽的标记数量是相同的，即

γ N_{p}

是常量，以便进行批处理。如果输出的视觉标记序列，即省略[cls]标记，在

f_{θ}

的输出中表示为

{({\tilde{z}}_{i})}_{i = 1}^{(1 - γ) N_{p}}

，那么我们就创建序列

{(z_{i})}_{i = 1}^{N_{p}}

，其中：

z_{i} = {\begin{cases} {\tilde{z}}_{σ^{- 1} (i)} & if 1 \leq σ^{- 1} (i) \leq (1 - γ) N_{p} \\ m & if (1 - γ) N_{p} + 1 \leq σ^{- 1} (i) \leq N_{p} \end{cases}

where

m

is a learnt arbitrary token. The tokens in

{(z_{i})}_{i = 1}^{N_{p}}

are then translated by another position embedding and fed through the decoder blocks with the [cls] token. The output of the decoder blocks is then fed through a prediction head and the [cls] token is removed, leaving the sequence of reconstructed flattened patches for the entire image

{({\hat{y}}_{i})}_{i = 1}^{N_{p}}

. Denoting the sequence of ground truth flattened patches

{(y_{i})}_{i = 1}^{N_{p}}

, in which the features have been zero-centred and scaled to unit variance for each patch independently, the loss function is defined:
其中

m

是一个学习到的任意标记。

{(z_{i})}_{i = 1}^{N_{p}}

中的标记通过另一个位置嵌入进行翻译，并与 [cls] 标记一起输入解码器块。解码器块的输出随后通过预测头，并去掉 [cls] 标记，留下整个图像

{({\hat{y}}_{i})}_{i = 1}^{N_{p}}

的重建扁平化补丁序列。表示真实扁平化补丁的序列

{(y_{i})}_{i = 1}^{N_{p}}

，其中特征已被零中心化并独立缩放到单位方差，损失函数定义为：

\begin{aligned} L_{M A E} ({({\hat{y}}_{i})}_{i = 1}^{N_{p}}) \\ = \frac{1}{γ N_{p} d_{p}} \sum_{i = 1}^{N_{p}} 1_{[(1 - γ) N_{p} + 1 \leq σ^{- 1} (i) \leq N_{p}]} {‖ {\hat{y}}_{i} - y_{i} ‖}^{2} \end{aligned}

where

d_{p}

is the dimensionality of a patch. The loss is then averaged over all images in the batch on a single GPU, and the update to the model is averaged over GPUs, as is typical in distributed supervised learning.
其中

d_{p}

是一个补丁的维度。然后，损失在单个 GPU 上对批次中的所有图像进行平均，模型的更新在 GPU 之间进行平均，这在分布式监督学习中是典型的。

As mentioned, MAE has been designed for pretraining ViT architectures specifically. A notable distinction between the use of ViT in MAE and in MoCo v3 is that the loss is computed on the processed visual tokens in MAE, whereas it is computed on the processed [cls] token in MoCo v3.
如前所述，MAE 是专门为预训练 ViT 架构而设计的。MAE 中 ViT 的使用与 MoCo v3 之间的一个显著区别在于，MAE 中的损失是基于处理过的视觉标记计算的，而在 MoCo v3 中则是基于处理过的 [cls] 标记计算的。

We consider the use of MAE for pretraining ViT-B architectures, for which we use the implementation from the official MAE codebase. For the ViT-B pretrained using MAE with ImageNet-1k, we use the weights provided by the authors. We then used the implementation of MAE in the official codebase to pretrain a ViT-B with Hyperkvasirunlabelled, modifying the code only for loading Hyperkvasirunlabelled and to change the batch size from 4096 to 768.
我们考虑使用 MAE 对 ViT-B 架构进行预训练，使用官方 MAE 代码库中的实现。对于使用 MAE 和 ImageNet-1k 预训练的 ViT-B，我们使用作者提供的权重。然后，我们使用官方代码库中的 MAE 实现对 ViT-B 进行预训练，使用 Hyperkvasirunlabelled，仅修改代码以加载 Hyperkvasirunlabelled 并将批量大小从 4096 更改为 768。

III. BASELINES III. 基线

For each of the considered encoder architectures, ResNet50 and ViT-B, we consider two baselines to compare the discussed self-supervised pretraining pipelines against. Most importantly, we consider supervised pretraining with ImageNet-1k, representing the conventional approach for pretraining image encoders for use as backbones in solutions to GIE vision tasks. We then consider no pretraining, i.e. finetuning from random initialisation. We use the torchvision implementation and weights for ResNet50, and the

t i m m

implementation and weights for ViT-B.
对于所考虑的每种编码器架构，ResNet50 和 ViT-B，我们考虑两个基线，以比较讨论的自监督预训练管道。最重要的是，我们考虑使用 ImageNet-1k 的监督预训练，代表了用于作为 GIE 视觉任务解决方案的骨干网络的图像编码器预训练的传统方法。然后我们考虑不进行预训练，即从随机初始化进行微调。我们使用 torchvision 的 ResNet50 实现和权重，以及 ViT-B 的

t i m m

实现和权重。

We note that we do not directly compare against the state-of-the-art methods for each task. While our primary aim is to study the relative effectiveness of different pretraining pipelines, which such comparisons would not be suitable for due to the need for consistency in all other details, we believe that this would still be informative. However, we cannot compare against previously reported results due to the lack
我们注意到，我们并没有直接与每个任务的最先进方法进行比较。虽然我们的主要目的是研究不同预训练管道的相对有效性，但由于在所有其他细节上需要一致性，这样的比较并不合适，我们相信这仍然是有信息量的。然而，由于缺乏，我们无法与之前报告的结果进行比较。
of standardisation in the benchmarks, with different works using different splits and different evaluation methodologies, and re-implementing these methods to allow for a direct comparison would be too time-consuming. To the best of our knowledge, the state-of-the-art for each task uses either a convolutional neural network or some derivative of ViT that has been pretrained in a supervised manner with ImageNet-1k as a backbone, and as such we consider models with a ResNet50 or ViT-B backbone that has been pretrained in a supervised manner with ImageNet-1k as representative of the state-of-the-art.
在基准测试中的标准化，不同的工作使用不同的划分和不同的评估方法，重新实现这些方法以便进行直接比较将会耗费太多时间。据我们所知，每个任务的最先进技术要么使用卷积神经网络，要么使用某种经过监督方式在 ImageNet-1k 上预训练的 ViT 衍生模型，因此我们认为以 ResNet50 或 ViT-B 为骨干网，并在 ImageNet-1k 上经过监督方式预训练的模型代表了最先进技术。

IV. IMAGE CLASSIFICATION
IV. 图像分类

Image classification is the problem of determining which, out of a predefined set of classes, a given image should be assigned to. In the context of GIE, the predefined set of classes may cover, for example, possible anatomical landmarks, pathological findings, or categories of polyps. In this section, we detail and present our evaluation of the fine-tuned performance of backbones in two of these image classification tasks, namely anatomical landmark recognition and pathological finding characterisation.
图像分类是确定给定图像应分配到预定义类别集中的哪个类别的问题。在 GIE 的背景下，预定义的类别集可能涵盖例如可能的解剖标志、病理发现或息肉类别。在本节中，我们详细介绍并展示了在这两个图像分类任务中的骨干网络的微调性能评估，即解剖标志识别和病理发现特征化。

A. DATA A. 数据

The data used in our image classification experiments is taken from the Hyperkvasir-labelled dataset [28], which does not share any instances with Hyperkvasir-unlabelled. We specifically used the anatomical landmarks and pathological findings subsets, which we treated the classification of as two separate problems. For each subset, we combined the data for the upper and lower gastrointestinal tract, and applied a random

80 % / 10 % / 10 %

training/validation/test split, where the validation data is used to determine whether to save the weights after each epoch of training on the training data, and the test data is reserved for evaluating the model after finetuning. The number of instances of each class, in total and in each split, are given in Table 1.
我们在图像分类实验中使用的数据来自 Hyperkvasir 标记的数据集[28]，该数据集与 Hyperkvasir 未标记的数据集没有共享任何实例。我们特别使用了解剖标志和病理发现子集，将其视为两个独立的问题进行分类。对于每个子集，我们结合了上消化道和下消化道的数据，并应用了随机

80 % / 10 % / 10 %

训练/验证/测试划分，其中验证数据用于确定在每个训练周期后是否保存权重，测试数据则保留用于在微调后评估模型。每个类别的实例数量，包括总数和每个划分中的数量，见表 1。

B. DECODERS B. 解码器

In image classification, it is typical to simply add a linear classifier to the final representation computed by an encoder to allow for prediction. Following convention, we implement this as a fully connected layer that maps the final representation to a vector of logits, one for each possible class, which is softmax normalised prior to computation of the loss. For the ViT-B models, we use the output [cls] token as the final representation.
在图像分类中，通常只需将线性分类器添加到编码器计算的最终表示中，以便进行预测。按照惯例，我们将其实现为一个全连接层，将最终表示映射到一个对每个可能类别的 logits 向量，在计算损失之前进行 softmax 归一化。对于 ViT-B 模型，我们使用输出的[cls]标记作为最终表示。

C. FINE-TUNING PROCEDURE
C. 微调程序

We separately train each model to perform both anatomical landmark recognition and pathological finding characterisation through the same procedure. We use the common fine-tuning procedure hyperparameters given in Table 3 and pre-process the training images using the pipeline detailed in Table 2. The loss is then computed using a cross entropy loss function which, due to the significant class imbalance in the
我们分别训练每个模型，通过相同的程序执行解剖标志识别和病理发现特征化。我们使用表 3 中给出的常见微调程序超参数，并使用表 2 中详细说明的流程对训练图像进行预处理。然后，使用交叉熵损失函数计算损失，由于类别不平衡显著，
data, is weighted with a value of

N_{D} / N_{i} N_{c}

for the

i^{th}

class, where

N_{D}

is the total number of images in the dataset,

N_{i}

is the number of images in a particular class, and

N_{c}

is the number of classes. Note that these numbers are for the entire dataset, rather than the training set. This weighting ensures that the total sum of weights across all instances is

N_{D}

, for consistency with unweighted cross entropy. We use the macro F 1 -score

(mF 1)^{1}

as the validation metric:
数据的权重为

N_{D} / N_{i} N_{c}

，适用于

i^{th}

类，其中

N_{D}

是数据集中图像的总数，

N_{i}

是特定类中的图像数量，

N_{c}

是类的数量。请注意，这些数字是针对整个数据集，而不是训练集。此加权确保所有实例的权重总和为

N_{D}

，以与未加权的交叉熵保持一致。我们使用宏 F1 分数

(mF 1)^{1}

作为验证指标：

mF 1 = \frac{1}{N_{c}} \sum_{i = 1}^{N_{c}} \frac{2 {TP}_{i} + ϵ}{2 {TP}_{i} + {FP}_{i} + {FN}_{i} + ϵ}

where

{TP}_{i}

is the number of true positives for the

i^{th}

class,

{FP}_{i}

is the number of false positives,

{FN}_{i}

is the number of false negatives, and

ϵ = 1 e - 8

. The transformations applied to the validation images include the same resizing and normalisation applied to the training images. Finally, the model is trained on this basis for 50 epochs, with the parameters saved after each epoch that leads to an improvement in mF 1 on the validation set, with any batch normalisation synchronised across GPUs.
其中

{TP}_{i}

是

i^{th}

类的真实正例数量，

{FP}_{i}

是假正例的数量，

{FN}_{i}

是假负例的数量，以及

ϵ = 1 e - 8

。对验证图像应用的变换包括对训练图像应用的相同的调整大小和归一化。最后，模型在此基础上训练 50 个周期，在每个导致验证集 mF 1 改进的周期后保存参数，并在 GPU 之间同步任何批量归一化。

D. EVALUATION D. 评估

We evaluate the resulting image classification models using the corresponding test data, which is pre-processed in the same manner as the validation data, with four metrics, namely mF 1 (as defined in (13)), mPrecision, mRecall, and Accuracy:
我们使用相应的测试数据评估生成的图像分类模型，该测试数据以与验证数据相同的方式进行预处理，使用四个指标，即 mF 1（如（13）中定义），mPrecision，mRecall 和准确率：

\begin{aligned} mPrecision & = \frac{1}{N_{c}} \sum_{i = 1}^{N_{c}} \frac{{TP}_{i} + ϵ}{{TP}_{i} + {FP}_{i} + ϵ} \\ mRecall & = \frac{1}{N_{c}} \sum_{i = 1}^{N_{c}} \frac{{TP}_{i} + ϵ}{{TP}_{i} + {FN}_{i} + ϵ} \\ Accuracy & = \frac{\sum_{i = 1}^{N_{c}} {TP}_{i}}{N_{D}} \end{aligned}

where

ϵ = 1 e - 8

. 在哪里

ϵ = 1 e - 8

。
For all metrics, a higher value indicates better performance. The results for anatomical landmark recognition are presented in Table 4 and the results for pathological finding characterisation are presented in Table 5.
对于所有指标，较高的值表示更好的性能。解剖标志识别的结果在表 4 中呈现，病理发现特征化的结果在表 5 中呈现。

V. OBJECT DETECTION V. 目标检测

Object detection is the problem of recognising and locating any objects of interest in an image. In the context of GIE, the objects of interest may be polyps, tools, artefacts, or disease. In this section, we detail and present our evaluation of the fine-tuned performance of backbones in polyp detection specifically.
目标检测是识别和定位图像中任何感兴趣物体的问题。在 GIE 的背景下，感兴趣的物体可能是息肉、工具、文物或疾病。在本节中，我们详细介绍并展示我们对息肉检测中特别调整的骨干网络性能的评估。

A. DATA A. 数据

The data used in our object detection experiments is taken from the Kvasir-SEG dataset [45], which does not share
我们物体检测实验中使用的数据来自 Kvasir-SEG 数据集[45]，该数据集不共享

TABLE 1. Number of instances of each class, in total and in each split.
表 1. 每个类别的实例总数及每个拆分中的实例数。

Classification Task 分类任务	Tract 道	Class 班级	Total 总计	Train. 火车。	Val. 值。	Test. 测试。
Anatomical landmark recognition 解剖标志识别	Lower 下	Cecum 盲肠	1009	794	109	106
		Ileum 回肠	9	6	2	1
		Retroflex-rectum 反卷直肠	391	308	32	51
	Upper 上	Pylorus 幽门	999	816	87	96
		Retroflex-stomach 卷舌胃	764	610	82	72
		Z-line Z 线	932	750	98	84
Pathological finding characterisation 病理发现特征化	Lower 下	Hemorrhoids 痔疮	6	3	3	0
		Polyps 息肉	1028	825	100	103
		Ulcerative colitis grade $0 - 1$ 溃疡性结肠炎等级 $0 - 1$	35	28	4	3
		Ulcerative colitis grade 1 溃疡性结肠炎 1 级	201	160	17	24
		Ulcerative colitis grade 1-2 溃疡性结肠炎 1-2 级	11	7	2	2
		Ulcerative colitis grade 2 溃疡性结肠炎二级	443	347	53	43
		Ulcerative colitis grade 2-3 溃疡性结肠炎 2-3 级	28	23	2	3
		Ulcerative colitis grade 3 溃疡性结肠炎三级	133	109	13	11
	Upper 上	Barrett's short-segment 巴雷特短段	53	44	5	4
		Barrett's 巴雷特氏病	41	32	4	5
		Esophagitis A 食管炎 A	403	333	33	37
		Esophagitis B-D 食管炎 B-D	260	203	28	29

TABLE 2. Pre-processing of training images, which is performed online during training. 1) pads to max(

h, w) \times

max

(h, w)

for original height

h

and width

w

. 2) resizes to

224 \times 224

using bicubic interpolation with anti-aliasing [43]. 3) applies colour jitter with brightness factor sampled uniformly from [0.6, 0.4 ], contrast factor sampled uniformly from [0.5, 1.5], saturation factor sampled uniformly from [0.75, 1.25], and hue factor sampled uniformly from [0.99, 1.01].4) applies Gaussian blur with a

25 \times 25

kernel with a standard deviation sampled uniformly from [0.001, 2]. 5) applies a rotation of

90

with a probability of 0.5 . 6) applies a horizontal flip with a probability of 0.5 . 7) applies a vertical flip with a probability of 0.5 . 8) applies a rotation of an angle sampled uniformly from [

180^{\circ}, 180^{\circ}]

. 9) applies an affine transform with, horizontal translation sampled uniformly from [-28, 28], vertical translation sampled uniformly from [-28, 28], scaling with factor sampled uniformly from [0.5, 1.5], and shearing of an angle sampled uniformly from

[- {22.5}^{\circ}, 22.5^{\circ}]

10) applies normalisation using the ImageNet-1

k

pixel mean and standard deviation, for consistency with all pretraining pipelines.
表 2. 训练图像的预处理，在训练期间在线执行。1) 填充到 max(

h, w) \times

max

(h, w)

原始高度

h

和宽度

w

。2) 使用双三次插值和抗锯齿将大小调整为

224 \times 224

[43]。3) 应用颜色抖动，亮度因子从 [0.6, 0.4 ] 中均匀采样，对比度因子从 [0.5, 1.5] 中均匀采样，饱和度因子从 [0.75, 1.25] 中均匀采样，色调因子从 [0.99, 1.01] 中均匀采样。4) 应用标准差从 [0.001, 2] 中均匀采样的

25 \times 25

核的高斯模糊。5) 以 0.5 的概率应用

90

的旋转。6) 以 0.5 的概率应用水平翻转。7) 以 0.5 的概率应用垂直翻转。8) 应用从 [

180^{\circ}, 180^{\circ}]

中均匀采样的角度旋转。9) 应用仿射变换，水平平移从 [-28, 28] 中均匀采样，垂直平移从 [-28, 28] 中均匀采样，缩放因子从 [0.5, 1.5], 并且从

[- {22.5}^{\circ}, 22.5^{\circ}]

10)均匀采样的角度进行剪切，使用 ImageNet-1

k

像素均值和标准差进行归一化，以与所有预训练管道保持一致。

Operation 操作	Image classification 图像分类	Object detection 物体检测	Semantic segmentation 语义分割	Monocular depth estimation 单目深度估计
1) Pad to square 1) 填充为正方形	$X$	$X$	$X$	$✓$
2) Resize 2) 调整大小	$✓$	$X$	$✓$	$✓$
3) Colour jitter 3) 颜色抖动	$✓$	$✓$	$✓$	$✓$
4) Gaussian blur 高斯模糊	$✓$	$✓$	$✓$	$X$
5) Discrete rotation 5) 离散旋转	$X$	$✓$	$X$	$✓$
6) Horizontal flip 6) 水平翻转	$✓$	$✓$	$✓$	$✓$
7) Vertical flip 7) 垂直翻转	$✓$	$✓$	$✓$	$✓$
8) Continuous rotation 8) 连续旋转	$✓$	$X$	$✓$	$X$
9) Affine transform 9) 仿射变换	$X$	$X$	$✓$	$X$
10) Normalisation 10) 标准化	$✓$	$✓$	$✓$	$✓$

TABLE 3. Common fine-tuning procedure hyperparameters.
表 3. 常见的微调过程超参数。

Hyperparameter 超参数

Value 价值

Batch size 批量大小

Optimiser 优化器

AdamW [44]

Initial learning 初始学习

1 e - 4

rate 费率

当验证性能不再提高时减半

Halve when validation

performance does not

Learning rate 学习率

schedule 日程表

在 10 个周期内改进，直到达到 1e-6

improve over 10 epochs,

until reaching 1e-6

any instances with Hyperkvasir-unlabelled. The dataset includes 1000 GIE images, each of which shows at least one polyp and is paired with both a set of bounding boxes, specifying the location and the horizontal and vertical dimensions of any polyps in the image, and a binary segmentation map indicating which pixels correspond to a polyp and which don’t. While the segmentation maps were used in our semantic segmentation experiments, here we use the sets of bounding boxes. We applied a random

80 % / 10 % / 10 %

任何与 Hyperkvasir-unlabelled 相关的实例。该数据集包括 1000 张 GIE 图像，每张图像至少显示一个息肉，并配有一组边界框，指定图像中任何息肉的位置以及水平和垂直尺寸，以及一个二进制分割图，指示哪些像素对应于息肉，哪些不对应。虽然分割图在我们的语义分割实验中被使用，但在这里我们使用边界框集。我们应用了一个随机

80 % / 10 % / 10 %

B. DECODERS B. 解码器

For our object detection experiments, we implemented the listed backbones within a Faster R-CNN pipeline [46] with feature pyramid network (FPN) [47], which we used the torchvision implementation of. We used the existing implementation of the pipeline with a ResNet50 backbone, specifying that all layers of the backbone should be trainable. For the ViT-B models, based on previous analyses of using ViT backbones in object detection [48], [49], we first modified the encoders to efficiently process larger image sizes

^{2}

by bilinearly interpolating the position embeddings
在我们的目标检测实验中，我们在 Faster R-CNN 管道[46]中实现了列出的主干网络，并使用了特征金字塔网络（FPN）[47]，我们使用了 torchvision 的实现。我们使用了带有 ResNet50 主干的现有管道实现，指定主干的所有层都应可训练。对于 ViT-B 模型，基于之前对在目标检测中使用 ViT 主干的分析[48]，[49]，我们首先修改了编码器，以通过双线性插值位置嵌入来有效处理更大的图像尺寸

^{2}

。

TABLE 4. Performance in anatomical landmark recognition. The best results for each architecture are highlighted as bold, and the best results overall are underlined.
表 4. 解剖标志识别的性能。每种架构的最佳结果以粗体突出显示，整体最佳结果则用下划线标出。

Backbone arch. 主干拱。	Pretraining data 预训练数据	Pretraining algo. 预训练算法。	mF1	mPrecision	mRecall	Accuracy 准确性
ResNet50	Hyperkvasir-unlabel	MoCo v3	0.823	0.989	0.823	0.988
	Hyperkvasir-unlabel.	Barlow Twins 巴洛兄弟	0.824	0.989	0.826	0.990
	ImageNet-1k	MoCo v3	0.828	0.993	0.829	0.993
		Barlow Twins 巴洛兄弟	0.826	0.991	0.828	0.990
		Supervised 监督的	0.826	0.827	0.825	0.988
	None 无	None 无	0.793	0.957	0.795	0.956
ViT-B	Hyperkvasir-unlabel.	MoCo v3	0.818	0.983	0.819	0.983
	Hyperkvasir-unlabel.	MAE	0.823	0.990	0.823	0.988
	ImageNet-1k	MoCo v3	0.828	0.993	0.829	0.993
		MAE	0.823	0.989	0.823	0.988
		Supervised 监督的	0.815	0.979	0.817	0.980
	None 无	None 无	0.782	0.955	0.778	0.944

TABLE 5. Performance in pathological finding characterisation. The best results for each architecture are highlighted as bold, and the best results overall are underlined.
表 5. 病理发现特征化的表现。每种架构的最佳结果以粗体突出显示，整体最佳结果则用下划线标出。

Backbone arch. 主干拱。	Pretraining data 预训练数据	Pretraining algo. 预训练算法。	mF1	mPrecision	mRecall	Accuracy 准确性
ResNet50	Hyperkvasir-unlabel	MoCo v3	0.542	0.531	0.495	0.746
	Hyperkvasir-unlabel.	Barlow Twins 巴洛兄弟	0.613	0.527	0.561	0.765
	ImageNet-1k	MoCo v3	0.595	0.515	0.516	0.758
		Barlow Twins 巴洛兄弟	0.628	0.682	0.545	0.765
		Supervised 监督的	0.587	0.645	0.492	0.777
	None 无	None 无	0.491	0.584	0.437	0.621
ViT-B	Hyperkvasir-unlabel.	MoCo v3	0.589	0.690	0.526	0.777
	Hyperkvasir-unlabel.	MAE	0.576	0.563	0.526	0.723
	ImageNet-1k	MoCo v3	0.596	0.608	0.527	0.731
		MAE	0.652	0.723	0.596	0.780
		Supervised 监督的	0.596	0.605	0.523	0.758
	None 无	None 无	0.434	0.679	0.366	0.621

and using non-overlapping window self-attention in all but the 3rd, 6th, 9th, and 12th blocks. Window attention, also known as restricted attention [50], independently applies attention to subsets of the sequence of visual tokens, where each subset corresponds to the tokens in a square window of the equivalent feature map, with no overlapping windows. We used 256 tokens in each subset, corresponding to a

16 \times 16

window of a feature map. We then modified the Faster R-CNN pipeline to use the resulting encoders as backbones with a ViTDet FPN [49].
并在除第 3、第 6、第 9 和第 12 个块之外的所有块中使用非重叠窗口自注意力。窗口注意力，也称为限制注意力[50]，独立地对视觉标记序列的子集应用注意力，其中每个子集对应于等效特征图的一个正方形窗口中的标记，且没有重叠窗口。我们在每个子集中使用了 256 个标记，对应于特征图的

16 \times 16

窗口。然后，我们修改了 Faster R-CNN 管道，以使用生成的编码器作为具有 ViTDet FPN[49]的主干。

C. FINE-TUNING PROCEDURE
C. 微调程序

In the fine-tuning of both model architectures, we use the common fine-tuning procedure hyperparameters given in Table 3 and pre-process the training images using the pipeline detailed in Table 2. For the ResNet50 models, using the default pre-processing pipeline for the Faster R-CNN implementation, the images in a batch are then each resized with bilinear interpolation to a scale of

min (800 / min (h, w), 1333 / max (h, w))

of the original size

h \times

w

, and then padded to

H \times W

, where

H

is the maximum
在两个模型架构的微调中，我们使用表 3 中给出的常见微调过程超参数，并使用表 2 中详细描述的流程对训练图像进行预处理。对于 ResNet50 模型，使用 Faster R-CNN 实现的默认预处理流程，批次中的每张图像都通过双线性插值调整为原始大小

h \times

w

的

min (800 / min (h, w), 1333 / max (h, w))

比例，然后填充到

H \times W

，其中

H

是最大值。
height of the resized images and

W

is the maximum width of the resized images across the batch. For the ViT-B models, inspired by a previous analysis [48], the images are padded to

1024 \times 1024

- since several images in the dataset have a height or width larger than 1024, these images are downsampled to half the resolution using bicubic interpolation with anti-aliasing before padding. Transformations are also applied to the bounding boxes in accordance with any spatial transformations applied to the image. The usual multitask loss function for the Faster R-CNN pipeline is used to compute the loss, and we use AP@[.5:.95] as the validation metric for predicted bounding boxes that have a confidence score

\geq 0.05

:
调整后图像的高度为

W

，而批次中调整后图像的最大宽度为

1024 \times 1024

。对于 ViT-B 模型，受之前分析的启发[48]，图像被填充到

1024 \times 1024

- 由于数据集中有几张图像的高度或宽度大于 1024，这些图像在填充之前使用带抗锯齿的双三次插值下采样到一半的分辨率。变换也会根据对图像应用的任何空间变换应用于边界框。Faster R-CNN 管道的常用多任务损失函数用于计算损失，我们使用 AP@[.5:.95]作为预测边界框的验证指标，置信度得分为

\geq 0.05

：

AP @ [.5 : .95] = \frac{1}{10} \sum_{t \in T} AP @ t

where

T = {0.5, 0.55, \dots, 0.95}

is the set of intersection over union (IoU) thresholds and AP@t is the average precision at the

t^{t h}

IoU threshold. We compute AP@

t

by first ranking all predicted bounding boxes with respect to the confidence score, from high to low. We then step through
其中

T = {0.5, 0.55, \dots, 0.95}

是交并比 (IoU) 阈值的集合，AP@t 是在

t^{t h}

IoU 阈值下的平均精度。我们通过首先根据置信度分数从高到低对所有预测边界框进行排序来计算 AP@

t

。然后我们逐步进行
the predicted bounding boxes in rank order and assign the prediction to the true positives if it has an IoU with a target bounding box for the same image that is greater than the IoU threshold, and otherwise assign it to the false positives. At each rank, we then compute the precision and recall using the cumulative number of true positives and false positives and the total number of false negatives. We then determine a strictly monotonically increasing sequence of recall values

{(r_{i})}_{i = 1}^{N_{r}}

, with

r_{1} = 0, r_{N_{r}} = 1

, and

{(r_{i})}_{i = 2}^{N_{r} - 1}

being the recall values (excluding 0 and 1 ) for ranks where false positives and resulting drops in the precision occur, and AP@

t

is then:
预测的边界框按排名顺序排列，如果与同一图像的目标边界框的 IoU 大于 IoU 阈值，则将预测分配给真正例，否则分配给假正例。在每个排名中，我们使用真正例和假正例的累积数量以及假负例的总数来计算精度和召回率。然后，我们确定一个严格单调递增的召回值序列

{(r_{i})}_{i = 1}^{N_{r}}

，其中

r_{1} = 0, r_{N_{r}} = 1

和

{(r_{i})}_{i = 2}^{N_{r} - 1}

是发生假正例和导致精度下降的排名的召回值（不包括 0 和 1），然后 AP@

t

为：

AP @ t = \sum_{i = 2}^{N_{r}} (r_{i} - r_{i - 1}) p (r_{i})

where

p (r_{i})

is the maximum precision value out of those which correspond to

r_{i}

, for

i = 2, \dots, N_{r}

. The transformations applied to the validation images include the same resizing and/or padding and normalisation applied to the training images, and a batch size of 1 is used to ensure the evaluation of a ResNet50 model on a particular instance is not influenced by other images in a batch (through the padding to

H \times W)

. Finally, the model is trained on this basis for 200 epochs, with the parameters saved after each epoch that leads to an improvement in AP@ [.5:.95] on the validation set, with any batch normalisation synchronised across GPUs.
其中

p (r_{i})

是与

r_{i}

对应的最大精度值，适用于

i = 2, \dots, N_{r}

。对验证图像应用的变换包括与训练图像相同的调整大小和/或填充以及归一化，并使用批量大小为 1，以确保对特定实例的 ResNet50 模型评估不受批次中其他图像的影响（通过填充到

H \times W)

）。最后，模型在此基础上训练 200 个周期，在每个导致验证集上 AP@ [.5:.95] 改进的周期后保存参数，并在 GPU 之间同步任何批量归一化。

D. EVALUATION D. 评估

We evaluate the resulting object detection models using the test data, which is pre-processed in the same manner as the validation data, with AP@ [.5:.95] (AP for conciseness), AP@.

5 ({AP}_{50})

, and

AP @ .75 ({AP}_{75})

computed for predicted bounded boxes with a confidence score

\geq 0.05

. For all metrics, a higher value indicates better performance. The results are presented in Table 6, and some examples for predicted bounding boxes with a confidence score

\geq 0.5

are shown in Fig. 5.
我们使用测试数据评估生成的目标检测模型，该测试数据的预处理方式与验证数据相同，使用 AP@ [.5:.95]（为简洁起见，简称 AP）、AP@.

5 ({AP}_{50})

和

AP @ .75 ({AP}_{75})

计算预测的边界框，置信度分数为

\geq 0.05

。对于所有指标，值越高表示性能越好。结果在表 6 中呈现，图 5 中展示了一些置信度分数为

\geq 0.5

的预测边界框示例。

VI. SEMANTIC SEGMENTATION
VI. 语义分割

Semantic segmentation is the problem of determining which, out of a predefined set of classes, each pixel in an image should be assigned to. In the context of GIE, the predefined set of classes will typically include a background class that accounts for anything that is not of interest, as well as any classes that are of interest, for example, polyps, tools, artefacts, or disease. In this section, we detail and present our evaluation of the fine-tuned performance of backbones in polyp segmentation specifically, which is notably a binary segmentation problem.
语义分割是确定图像中每个像素应分配到预定义类别集合中的哪个类别的问题。在 GIE 的背景下，预定义的类别集合通常包括一个背景类别，该类别包含任何不感兴趣的内容，以及任何感兴趣的类别，例如息肉、工具、伪影或疾病。在本节中，我们详细介绍并展示我们对息肉分割中骨干网络微调性能的评估，特别是这显然是一个二元分割问题。

A. DATA A. 数据

We used two datasets in our semantic segmentation experiments, namely Kvasir-SEG [45] and CVC-ClinicDB [51]. Kvasir-SEG has already been discussed in the context of our object detection experiments, and we use the
我们在语义分割实验中使用了两个数据集，即 Kvasir-SEG [45] 和 CVC-ClinicDB [51]。Kvasir-SEG 已经在我们的目标检测实验中讨论过，我们使用了
same training/validation/test split here. CVC-ClinicDB includes 612 GIE images, each of which shows at least one polyp and is paired with a binary segmentation map indicating which pixels correspond to a polyp and which don’t. We applied a random

80 % / 10 % / 10 %

training/validation/test split, where the validation data is used to determine whether to save the weights after each epoch of training on the training data, and the test data is reserved for evaluating the model after fine-tuning.
相同的训练/验证/测试划分。CVC-ClinicDB 包含 612 张 GIE 图像，每张图像至少显示一个息肉，并配有一个二进制分割图，指示哪些像素对应于息肉，哪些不对应。我们应用了一个随机

80 % / 10 % / 10 %

的训练/验证/测试划分，其中验证数据用于确定在每个训练周期后是否保存权重，测试数据则保留用于在微调后评估模型。

B. DECODERS B. 解码器

For our semantic segmentation experiments, we used the listed ResNet50 backbones with a DeepLabV3+ [52] decoder, using the segmentation-models-pytorch implementation. We then used the ViT-B backbones with the segmentation variant of the dense prediction transformer (DPT) [53] decoder, using the implementation provided in the official codebase.
对于我们的语义分割实验，我们使用了列出的 ResNet50 主干和 DeepLabV3+ [52] 解码器，使用 segmentation-models-pytorch 实现。然后，我们使用了 ViT-B 主干和密集预测变换器 (DPT) [53] 解码器的分割变体，使用官方代码库中提供的实现。

C. FINE-TUNING PROCEDURE
C. 微调程序

We separately train each model to perform polyp segmentation with each dataset through the same procedure. We use the common fine-tuning procedure hyperparameters given in Table 3 and pre-process the training images using the pipeline detailed in Table 2. Transformations are also applied to the segmentation maps in accordance with any spatial transformations applied to the image. The loss is then computed using the Dice loss function [54], and we use mDice as the validation metric:
我们分别训练每个模型，通过相同的程序使用每个数据集进行息肉分割。我们使用表 3 中给出的常见微调程序超参数，并使用表 2 中详细说明的流程对训练图像进行预处理。还会根据对图像应用的任何空间变换，对分割图进行变换。然后使用 Dice 损失函数[54]计算损失，并使用 mDice 作为验证指标：

mDice = \frac{1}{N_{e}} \sum_{i = 1}^{N_{e}} \frac{2 {TP}_{i} + ϵ}{2 {TP}_{i} + {FP}_{i} + {FN}_{i} + ϵ}

where

N_{e}

is the number of instances in the validation/test set,

{TP}_{i}

is the number of true positives for the

i^{th}

image,

{FP}_{i}

is the number of false positives,

{FN}_{i}

is the number of false negatives, and

ϵ = 1 e - 8

. The transformations applied to the validation images include the same resizing and normalisation applied to the training images, with the validation maps also resized to

224 \times 224

. Finally, the model is trained on this basis for 200 epochs, with the parameters saved after each epoch that leads to an improvement in mDice on the validation set, with any batch normalisation synchronised across GPUs.
其中

N_{e}

是验证/测试集中的实例数量，

{TP}_{i}

是

i^{th}

图像的真正例数量，

{FP}_{i}

是假阳性数量，

{FN}_{i}

是假阴性数量，以及

ϵ = 1 e - 8

。对验证图像应用的变换包括与训练图像相同的调整大小和归一化，验证图也调整为

224 \times 224

。最后，模型在此基础上训练 200 个周期，在每个导致验证集 mDice 改进的周期后保存参数，并在 GPU 之间同步任何批量归一化。

D. EVALUATION D. 评估

We evaluate the resulting semantic segmentation models using the corresponding test data, where the images are pre-processed in the same manner as the validation images, but the segmentation maps are left at their original size. The predictions are therefore resized to this original size using bilinear interpolation prior to binarisation. We then use four metrics, namely the mDice (as defined in (19)), mIoU,
我们使用相应的测试数据评估生成的语义分割模型，其中图像的预处理方式与验证图像相同，但分割图的大小保持原样。因此，预测结果在二值化之前使用双线性插值调整为原始大小。然后我们使用四个指标，即 mDice（如（19）中定义），mIoU，

TABLE 6. Performance in polyp detection. The best results for each architecture are highlighted as bold, and the best results overall are underlined.
表 6. 恶性肿瘤检测的表现。每种架构的最佳结果以粗体突出显示，整体最佳结果则用下划线标出。

Backbone arch. 主干拱。	Pretraining data 预训练数据	Pretraining algo. 预训练算法。	AP	${AP}_{50}$	${AP}_{75}$
ResNet50		MoCo v3	0.604	0.895	0.702
	Hyperkvasir-unlabel.	Barlow Twins 巴洛兄弟	0.647	0.895	0.717
	ImageNet-1k	MoCo v3	0.640	0.905	0.731
		Barlow Twins 巴洛兄弟	0.653	0.906	$0.772$
		Supervised 监督的	0.633	0.889	0.716
	None 无	None 无	0.492	0.795	0.614
ViT-B	Hyperkvasir-unlabel.	MoCo v3	0.549	0.867	0.578
	Hyperkvasir-unlabel.	MAE	0.563	0.839	0.666
	ImageNet-1k	MoCo v3	0.572	0.873	0.652
		MAE	0.643	0.921	0.748
		Supervised 监督的	0.577	0.832	0.685
	None 无	None 无	0.281	0.609	0.197

FIGURE 5. Targets (yellow bounding boxes) and predictions (green bounding boxes) for two randomly selected instances of the Kvasir-SEG test set. For conciseness, we denote ResNet50s with RN, ViT-Bs with VT, Hyperkvasir-unlabelled with HK, ImageNet-1k with IN, MoCo v3 with MC, Barlow Twins with BT, MAE with MA, supervised pretraining with SL, and no pretraining with NA-NA.
图 5. Kvasir-SEG 测试集两个随机选择实例的目标（黄色边框）和预测（绿色边框）。为简洁起见，我们用 RN 表示 ResNet50s，用 VT 表示 ViT-Bs，用 HK 表示 Hyperkvasir-unlabelled，用 IN 表示 ImageNet-1k，用 MC 表示 MoCo v3，用 BT 表示 Barlow Twins，用 MA 表示 MA，用 SL 表示有监督预训练，用 NA-NA 表示无预训练。
mPrecision, and mRecall:

\begin{aligned} mIoU & = \frac{1}{N_{e}} \sum_{i = 1}^{N_{e}} \frac{{TP}_{i} + ϵ}{{TP}_{i} + {FP}_{i} + {FN}_{i} + ϵ} \\ mPrecision & = \frac{1}{N_{e}} \sum_{i = 1}^{N_{e}} \frac{{TP}_{i} + ϵ}{{TP}_{i} + {FP}_{i} + ϵ} \\ mRecall & = \frac{1}{N_{e}} \sum_{i = 1}^{N_{e}} \frac{{TP}_{i} + ϵ}{{TP}_{i} + {FN}_{i} + ϵ} \end{aligned}

For all metrics, a higher value indicates better performance. The results for Kvasir-SEG are presented in Table 7 and the results for CVC-ClinicDB are presented in Table 8. Examples for Kvasir-SEG are shown in Fig. 6.
对于所有指标，较高的值表示更好的性能。Kvasir-SEG 的结果在表 7 中呈现，CVC-ClinicDB 的结果在表 8 中呈现。Kvasir-SEG 的示例如图 6 所示。

VII. MONOCULAR DEPTH ESTIMATION
VII. 单目深度估计

Monocular depth estimation is the problem of predicting the length of the ray of light, that a particular pixel in an image corresponds to, between the camera and the object that the ray of light has come from, for every pixel in the image. Since the
单目深度估计是一个问题，旨在预测图像中每个像素对应的光线长度，即从相机到光线来源物体之间的距离。由于

TABLE 7. Performance in polyp segmentation with Kvasir-SEG. The best results for each architecture are highlighted as bold, and the best results overall are underlined.
表 7. 使用 Kvasir-SEG 进行息肉分割的性能。每种架构的最佳结果以粗体突出显示，整体最佳结果则用下划线标出。

Backbone arch. 主干拱。	Pretraining data 预训练数据	Pretraining algo. 预训练算法。	mDice	mIoU	mPrecision	mRecall
ResNet50		MoCo v3	0.841	0.753	0.857	0.878
	Hyperkvasir-unlabel.	Barlow Twins 巴洛兄弟	0.852	0.772	0.859	0.900
	ImageNet-1k	MoCo v3	0.883	0.812	0.866	0.936
		Barlow Twins 巴洛兄弟	0.873	0.795	0.879	0.899
		Supervised 监督的	0.871	0.800	0.882	0.893
	None 无	None 无	0.632	0.506	0.639	0.780
ViT-B		MoCo v3	0.861	0.788	0.867	0.898
	Hyperkvasir-unlabel.	MAE	0.885	0.816	0.899	0.906
	ImageNet-1k	MoCo v3	0.889	0.824	0.900	0.907
		MAE	$\underset{―}{0.896}$	0.834	$0.921$	0.902
		Supervised 监督的	0.871	0.795	0.894	0.883
	None 无	None 无	0.755	0.650	0.785	0.815

TABLE 8. Performance in polyp segmentation with CVC-ClinicDB. The best results for each architecture are highlighted as bold, and the best results overall are underlined.
表 8. 在 CVC-ClinicDB 中进行息肉分割的性能。每种架构的最佳结果以粗体突出显示，整体最佳结果则用下划线标出。

Backbone arch. 主干拱。	Pretraining data 预训练数据	Pretraining algo. 预训练算法。	mDice	mIoU	mPrecision	mRecall
ResNet50	Hyperkvasir-unlabel	MoCo v3	0.909	0.839	0.906	0.921
	Hyperkvasir-unlabel.	Barlow Twins 巴洛兄弟	0.880	0.799	0.916	0.863
	ImageNet-1k	MoCo v3	0.920	0.856	0.938	0.909
		Barlow Twins 巴洛兄弟	0.901	0.826	0.893	0.920
		Supervised 监督的	0.879	0.805	0.933	0.861
	None 无	None 无	0.595	0.462	0.584	0.678
ViT-B		MoCo v3	0.909	0.848	0.916	0.926
	Hyperkvasir-unlabel.	MAE	0.901	0.838	0.920	0.903
	ImageNet-1k	MoCo v3	0.911	0.848	0.940	0.901
		MAE	0.927	0.867	0.926	$\underset{―}{0.933}$
		Supervised 监督的	0.907	0.841	0.912	0.910
	None 无	None 无	0.813	0.708	0.839	0.826

absolute scale of the scene can only be determined from the parallax observed with a second view, the problem is however inherently ill-posed and only relative scale can be determined. In this section, we detail and present our evaluation of the fine-tuned performance of backbones in monocular depth estimation in colonoscopy.
场景的绝对尺度只能通过第二视图观察到的视差来确定，然而这个问题本质上是病态的，只能确定相对尺度。在本节中，我们详细介绍并展示了在结肠镜检查中单目深度估计中骨干网络的微调性能评估。

A. DATA A. 数据

The data used for our depth estimation experiments is taken from the C3VD dataset [55], the only dataset that we know of which includes images captured with a clinical GIE camera (colonoscope, specifically) with paired ground truth depth maps. The dataset was collected by recording segments (sigmoid, descending, transcending, ascending, and cecum) of a high-fidelity 3D silicone phantom colon model with varying textures, emulating different patient-specific tissue features and vasculature patterns at varying optical depths, and varying illumination modes with a clinical colonoscope. Views of an equivalent 3D virtual colon model were then registered with key frames of the resulting videos, allowing
我们深度估计实验中使用的数据来自 C3VD 数据集[55]，这是我们所知的唯一一个包含使用临床 GIE 相机（具体来说是结肠镜）捕获的图像及配对的真实深度图的数据集。该数据集是通过记录高保真度的 3D 硅胶幻影结肠模型的不同段（乙状结肠、降结肠、横结肠、升结肠和盲肠）收集的，模拟不同患者特定的组织特征和血管模式，具有不同的光学深度和不同的照明模式，使用临床结肠镜。然后，将等效的 3D 虚拟结肠模型的视图与生成视频的关键帧进行配准，从而允许
for the rendering of a ground truth depth map for each frame, as well as a surface normal, optical flow, and occlusion map. Each video is also paired with ground truth camera pose, surface model, and coverage map. 22 videos were recorded, with variation in the segment, camera pose, textures, and illumination, amounting to 10015 frames in total. We selected 18 videos ( 8610 frames) for training, 2 videos ( 977 frames) for validation, and 2 videos ( 528 frames) for testing, where the validation and test sets each include one randomly sampled video of the cecum and one randomly sampled video of the transcending segment, since the majority of videos were of one of these segments ( 8 of cecum and 9 of transcending).
为了渲染每帧的真实深度图，以及表面法线、光流和遮挡图。每个视频还配有真实的相机姿态、表面模型和覆盖图。共录制了 22 个视频，具有段落、相机姿态、纹理和照明的变化，总计 10015 帧。我们选择了 18 个视频（8610 帧）用于训练，2 个视频（977 帧）用于验证，2 个视频（528 帧）用于测试，其中验证集和测试集各包括一个随机抽样的盲肠视频和一个随机抽样的横结肠段视频，因为大多数视频都是这两个段落中的一个（8 个盲肠和 9 个横结肠）。

B. DECODERS B. 解码器

For our monocular depth estimation experiments, we used the listed ViT-B backbones with the depth estimation variant of the dense prediction transformer (DPT) [53] decoder, using the implementation provided in the official codebase. Since there is no clear precedent for a decoder architecture
对于我们的单目深度估计实验，我们使用了列出的 ViT-B 主干与密集预测变换器 (DPT) [53] 解码器的深度估计变体，使用官方代码库中提供的实现。由于没有明确的解码器架构先例

FIGURE 6. Targets and predictions for two randomly selected instances of the Kvasir-SEG test set. For conciseness, we denote ResNet50s with RN, ViT-Bs with VT, Hyperkvasir-unlabelled with HK, ImageNet-1k with IN, MoCo v3 with MC, Barlow Twins with BT, MAE with MA, supervised pretraining with SL, and no pretraining with NA-NA.
图 6. Kvasir-SEG 测试集两个随机选择实例的目标和预测。为简洁起见，我们将 ResNet50s 表示为 RN，将 ViT-Bs 表示为 VT，将 Hyperkvasir-unlabelled 表示为 HK，将 ImageNet-1k 表示为 IN，将 MoCo v3 表示为 MC，将 Barlow Twins 表示为 BT，将 MAE 表示为 MA，将监督预训练表示为 SL，将无预训练表示为 NA-NA。

TABLE 9. Performance in monocular depth estimation in colonoscopy. The best results for each architecture are highlighted as bold, and the best results overall are underlined.
表 9. 在结肠镜检查中的单目深度估计性能。每种架构的最佳结果以粗体突出显示，整体最佳结果以下划线标出。

Backbone arch. 主干拱。	Pretraining data 预训练数据	Pretraining algo. 预训练算法。	mRMSE (cm)	mMRAE	mMAE (cm)
ResNet50	Hyperkvasir-unlabel	MoCo v3	0.177	0.0345	0.131
	Hyperkvasir-unlabel.	Barlow Twins 巴洛兄弟	0.178	0.0362	0.134
	ImageNet-1k	MoCo v3	0.207	0.0428	0.159
		Barlow Twins 巴洛兄弟	0.203	0.0428	0.156
		Supervised 监督的	0.208	0.0452	0.159
	None 无	None 无	0.241	0.0515	0.188
ViT-B	Hyperkvasir-unlabel.	MoCo v3	0.151	0.0259	0.104
	Hyperkvasir-unlabel.	MAE	$\underset{―}{0.143}$	$0.0246$	$\underset{―}{0.098}$
	ImageNet-1k	MoCo v3	0.166	0.0274	0.116
		MAE	0.160	0.0298	0.115
		Supervised 监督的	0.168	0.0322	0.121
	None 无	None 无	0.226	0.0436	0.167

for ResNet50-based depth estimation,

^{3}

we designed our own. This decoder, designed to mirror the architecture of ResNet50, has three fusion levels. The first starts with the final feature maps output by a ResNet50 and halves the number of channels with a

1 \times 1

convolutional layer followed by batch normalisation, before upsampling the resulting feature maps to twice the resolution with bilinear interpolation and concatenating it with the feature maps output by the previous level of the ResNet50. The concatenated features are then processed by three blocks that have the same design as the blocks used in each level of ResNet50. The second and third
对于基于 ResNet50 的深度估计，我们设计了自己的解码器。该解码器旨在镜像 ResNet50 的架构，具有三个融合层。第一个层从 ResNet50 输出的最终特征图开始，通过一个卷积层将通道数减半，并进行批量归一化，然后使用双线性插值将结果特征图上采样到两倍分辨率，并与 ResNet50 前一层输出的特征图进行连接。连接的特征随后由三个与 ResNet50 每一层使用的块相同设计的块进行处理。第二和第三

levels of the decoder follow the same logic as the first, except that they start with the output of the previous level of the decoder. A prediction head, which has the same design as the prediction head used in the depth estimation variant of the DPT decoder, is then used to predict a depth map from the output of the third level.
解码器的级别遵循与第一个相同的逻辑，只是它们以解码器前一层的输出开始。然后使用一个预测头，该预测头的设计与 DPT 解码器深度估计变体中使用的预测头相同，从第三层的输出中预测深度图。

C. FINE-TUNING PROCEDURE
C. 微调程序

In the fine-tuning of both model architectures, we use the common fine-tuning procedure hyperparameters given in Table 3 and pre-process the training images using the pipeline detailed in Table 2. Transformations are also applied to the depth maps in accordance with any spatial transformations applied to the image, with absolute depth values scaled to
在两个模型架构的微调中，我们使用表 3 中给出的常见微调过程超参数，并使用表 2 中详细说明的流程对训练图像进行预处理。深度图也根据对图像应用的任何空间变换进行变换，绝对深度值被缩放到

FIGURE 7. Targets and post-processed predictions for a randomly selected instance from each of the test videos for C3VD. For conciseness, we denote ResNet50s with RN, ViT-Bs with VT, Hyperkvasir-unlabelled with HK, ImageNet-1k with IN, MoCo v3 with MC, Barlow Twins with BT, MAE with MA, supervised pretraining with SL, and no pretraining with NA-NA.
图 7. C3VD 测试视频中随机选择的每个实例的目标和后处理预测。为简洁起见，我们将 ResNet50s 表示为 RN，将 ViT-Bs 表示为 VT，将 Hyperkvasir-unlabelled 表示为 HK，将 ImageNet-1k 表示为 IN，将 MoCo v3 表示为 MC，将 Barlow Twins 表示为 BT，将 MAE 表示为 MA，将监督预训练表示为 SL，将无预训练表示为 NA-NA。

FIGURE 8. Error maps for the post-processed predictions shown in Fig. 7, illustrating the absolute error with a larger value represented by a darker shade. For conciseness, we denote ResNet50s with RN, ViT-Bs with VT, Hyperkvasir-unlabelled with HK, ImageNet-1k with IN, MoCo v3 with MC, Barlow Twins with BT, MAE with MA, supervised pretraining with SL, and no pretraining with NA-NA.
图 8. 图 7 中显示的后处理预测的误差图，说明绝对误差，较大的值用较深的阴影表示。为简洁起见，我们将 ResNet50s 表示为 RN，将 ViT-Bs 表示为 VT，将 Hyperkvasir-unlabelled 表示为 HK，将 ImageNet-1k 表示为 IN，将 MoCo v3 表示为 MC，将 Barlow Twins 表示为 BT，将 MAE 表示为 MA，将监督预训练表示为 SL，将无预训练表示为 NA-NA。

[0, 1]

. The loss is then computed using the scale- and shiftinvariant (SSI) mean squared error (MSE) [58] with a multiscale shift-invariant gradient matching term [59], which is computed only on the pixels that are covered by the lens (corners are not covered - see examples in Fig. 7), and we use the mSSI-MSE for pixels covered by the lens as the validation metric:

[0, 1]

. 然后使用尺度和位移不变（SSI）均方误差（MSE）[58] 计算损失，并使用多尺度位移不变梯度匹配项 [59]，该项仅在镜头覆盖的像素上计算（角落未被覆盖 - 参见图 7 中的示例），我们将镜头覆盖的像素的 mSSI-MSE 作为验证指标：

mSSI - MSE = \frac{1}{N_{e} N_{v}} \sum_{i = 1}^{N_{e}} \sum_{j = 1}^{N_{v}} {(s_{i} {\hat{y}}_{i, j} + t_{i} - y_{i, j})}^{2}

where

N_{v}

is the number of pixels covered by the lens in an image,

{\hat{y}}_{i, j}

is the output value for the

j^{th}

pixel covered
其中

N_{v}

是图像中镜头覆盖的像素数量，

{\hat{y}}_{i, j}

是覆盖的

j^{th}

像素的输出值
by the lens in the

i^{t h}

image,

y_{i, j}

is the corresponding target value, and

s_{i}

and

t_{i}

are the scale and shift computed using the closed form solution to the standard least squares problem:
通过

i^{t h}

图像中的镜头，

y_{i, j}

是相应的目标值，

s_{i}

和

t_{i}

是使用标准最小二乘问题的闭式解计算的缩放和偏移：

h_{i}^{*} = \arg min_{h_{i}} \sum_{j = 1}^{N_{v}} {({\hat{y}}_{i, j}^{⊤} h_{i} - y_{i, j})}^{2}

where

h_{i} = {(s_{i}, t_{i})}^{⊤}

and

{\hat{y}}_{i, j} = {({\hat{y}}_{i, j}, 1)}^{⊤}

. The transformations applied to the validation images include the same padding, resizing, and normalisation applied to the training images, with the validation maps also padded and resized to
其中

h_{i} = {(s_{i}, t_{i})}^{⊤}

和

{\hat{y}}_{i, j} = {({\hat{y}}_{i, j}, 1)}^{⊤}

。对验证图像应用的变换包括与训练图像相同的填充、调整大小和归一化，验证图也进行了填充和调整大小。

224 \times 224

and depth values scaled to [0, 1]. Finally, the model is trained on this basis for 50 epochs, with the parameters saved after each epoch that leads to an improvement in SSI MSE on the validation set, with any batch normalisation synchronised across GPUs.

224 \times 224

和深度值缩放到 [0, 1]。最后，模型在此基础上训练 50 个周期，在每个导致验证集 SSI MSE 改进的周期后保存参数，并在 GPU 之间同步任何批量归一化。

D. EVALUATION D. 评估

We evaluate the resulting monocular depth estimation models using the test data, where the images are pre-processed in the same manner as the validation images. We load two target depth maps for each image, one which is pre-processed in the same manner as the validation depth maps, for computing the scale and shift for pixels covered by the lens, and one left at the original size and scale (

[0 cm, 10 cm]

), for computing the performance. We compute and apply the scale and shift for the prediction, then resize the result to

max (h, w) \times max (h, w)

, where

h

and

w

are the height and width of the original image, crop to

h \times w

to remove values for padded pixels, clip values to

[0, 1]

, set any values for pixels not covered by the lens to 0 , and scale the resulting values to

[0 cm, 10 cm]

. We then use the four metrics used the SimCol3D challenge [60], namely the arithmetic mean across the test set of: the root MSE (mRMSE), the median relative absolute error (mMRAE), and the mean absolute error (mMAE), which are only applied to pixels covered by the lens:
我们使用测试数据评估生成的单目深度估计模型，其中图像以与验证图像相同的方式进行预处理。我们为每个图像加载两个目标深度图，一个是以与验证深度图相同的方式预处理的，用于计算镜头覆盖像素的缩放和偏移，另一个保持原始大小和比例（

[0 cm, 10 cm]

），用于计算性能。我们计算并应用预测的缩放和偏移，然后将结果调整为

max (h, w) \times max (h, w)

，其中

h

和

w

是原始图像的高度和宽度，裁剪到

h \times w

以去除填充像素的值，将值裁剪到

[0, 1]

，将未被镜头覆盖的像素的值设置为 0，并将结果值缩放到

[0 cm, 10 cm]

。然后，我们使用 SimCol3D 挑战中使用的四个指标[60]，即测试集的算术平均值：根均方误差（mRMSE）、中位数相对绝对误差（mMRAE）和均方绝对误差（mMAE），这些指标仅适用于镜头覆盖的像素：

\begin{aligned} mRMSE & = \frac{1}{N_{e}} \sum_{i = 1}^{N_{e}} \sqrt{\frac{1}{N_{V}} \sum_{j = 1}^{N_{V}} {({\hat{y}}_{i, j} - y_{i, j})}^{2}} \\ mMRAE & = \frac{1}{N_{e}} \sum_{i = 1}^{N_{e}} {median}_{j = 1, \dots, N_{V}} (| \frac{{\hat{y}}_{i, j} - y_{i, j}}{y_{i, j}} |) \\ mMAE & = \frac{1}{N_{e} N_{V}} \sum_{i = 1}^{N_{e}} \sum_{j = 1}^{N_{V}} | {\hat{y}}_{i, j} - y_{i, j} | \end{aligned}

where

N_{V}

is the number of pixels covered by the lens in an image at its original size,

{\hat{y}}_{i, j}

is the value in the post-processed prediction for the

j^{t h}

pixel covered by the lens in the

i^{t h}

image at its original size, and

y_{i, j}

is the corresponding target value. For all metrics, a lower value indicates better performance. The results are presented in Table 9, and some examples are shown in Fig. 7 with corresponding error maps shown in Fig. 8 to help visualise the differences.
其中

N_{V}

是图像在其原始大小下镜头覆盖的像素数量，

{\hat{y}}_{i, j}

是后处理预测中

j^{t h}

像素在

i^{t h}

图像原始大小下的值，

y_{i, j}

是相应的目标值。对于所有指标，较低的值表示更好的性能。结果在表 9 中呈现，图 7 中显示了一些示例，图 8 中显示了相应的误差图，以帮助可视化差异。

VIII. ANALYSIS VIII. 分析

The results presented in the previous sections primarily provide an indication of the ranking of the pretraining pipelines for each considered GIE vision task. Notably, there is some variation in this ranking, as illustrated in Fig. 9, however the ViT-B encoder pretrained with MAE and ImageNet-1k most consistently allows for either the best, or highly competitive, downstream performance. Beyond this identification, however, these results provide evidence for more general principles regarding the pretraining of encoders
前面部分呈现的结果主要提供了每个考虑的 GIE 视觉任务的预训练管道排名的指示。值得注意的是，这个排名存在一些变化，如图 9 所示，然而，使用 MAE 和 ImageNet-1k 预训练的 ViT-B 编码器在下游性能方面最 consistently 提供了最佳或高度竞争的表现。除此之外，这些结果还为编码器预训练的更一般原则提供了证据。

FIGURE 9. Ranking of the performance of each model on each task, as measured by mF1 (anatomical landmark recognition and pathological finding characterisation), AP (polyp detection), mDice (polyp segmentation), and mRMSE (monocular depth estimation in colonoscopy), where a better rank is represented by a greater distance from the centre. For conciseness, we denote ResNet50s with RN, ViT-Bs with VT, Hyperkvasir-unlabelled with HK, ImageNet-1k with IN, MoCo v3 with MC, Barlow Twins with BT, MAE with MA, supervised pretraining with SL, and no pretraining with NA-NA. Additionally, we refer to anatomical landmark recognition as anat, pathological finding characterisation as path, polyp detection as det, polyp segmentation with Kvasir-SEG as segk, polyp segmentation with CVC-ClinicDB as segc, and monocular depth estimation in colonoscopy as dep.
图 9. 每个模型在每个任务上的性能排名，按 mF1（解剖标志识别和病理发现特征化）、AP（息肉检测）、mDice（息肉分割）和 mRMSE（结肠镜单目深度估计）测量，其中更好的排名由离中心的距离更大表示。为简洁起见，我们用 RN 表示 ResNet50s，用 VT 表示 ViT-Bs，用 HK 表示 Hyperkvasir-unlabelled，用 IN 表示 ImageNet-1k，用 MC 表示 MoCo v3，用 BT 表示 Barlow Twins，用 MA 表示 MA，用 SL 表示有监督预训练，用 NA-NA 表示无预训练。此外，我们将解剖标志识别称为 anat，将病理发现特征化称为 path，将息肉检测称为 det，将使用 Kvasir-SEG 的息肉分割称为 segk，将使用 CVC-ClinicDB 的息肉分割称为 segc，将结肠镜单目深度估计称为 dep。
for use as backbones in solutions to GIE vision tasks, which we reveal through an analysis presented in this section.
用于作为解决 GIE 视觉任务的骨干，我们通过本节中呈现的分析揭示了这一点。

First, we demonstrate that self-supervised pretraining is generally more suitable than supervised pretraining. To assess this, we evaluate the relative improvement of each model that uses a backbone pretrained in a self-supervised manner with ImageNet-1k vs. the equivalent model (same architecture and task) that uses a backbone pretrained in a supervised manner with ImageNet-1k. To compute the relative improvement, we consider the primary metric for each task as mF 1 (image classification), AP (object detection), mDice (semantic segmentation), and mRMSE (depth estimation), as defined in the discussion of each task. Then, for all but mRMSE, we take the absolute difference between the result and a perfect score of 1 , in order to convert each score (higher is better) to a measure of error (lower is better). We do not do this for mRMSE since it is already a measure of error. We then compute the relative improvement using:
首先，我们证明自监督预训练通常比监督预训练更合适。为了评估这一点，我们比较使用在自监督方式下以 ImageNet-1k 预训练的主干网络的每个模型的相对改进与使用在监督方式下以 ImageNet-1k 预训练的等效模型（相同架构和任务）。为了计算相对改进，我们将每个任务的主要指标视为 mF1（图像分类）、AP（目标检测）、mDice（语义分割）和 mRMSE（深度估计），如每个任务的讨论中所定义。然后，对于除 mRMSE 之外的所有指标，我们取结果与完美分数 1 之间的绝对差值，以将每个分数（越高越好）转换为误差度量（越低越好）。我们不对 mRMSE 进行此操作，因为它已经是一个误差度量。然后，我们使用以下公式计算相对改进：

% {Improvement}_{S L \to S S L} = 100 \frac{δ_{S L} - δ_{S S L}}{δ_{S L}}

where

δ_{S S L}

is the error for a model with a backbone pretrained in a self-supervised manner and

δ_{S L}

is the error for an equivalent model (same architecture, pretraining data, and task) with a backbone pretrained in a supervised manner. Note that this analysis omits any results for pretraining with Hyperkvasirunlabelled or no pretraining. We visualise the results of this analysis in Fig. 10, where it can be seen that self-supervised pretraining overwhelmingly provides improvements over
其中

δ_{S S L}

是使用自监督方式预训练的模型的误差，而

δ_{S L}

是使用监督方式预训练的等效模型（相同架构、预训练数据和任务）的误差。请注意，此分析省略了使用 Hyperkvasirunlabelled 进行预训练或没有预训练的任何结果。我们在图 10 中可视化了此分析的结果，可以看出自监督预训练显著提供了改进。

FIGURE 10. Improvement of self-supervised pretraining vs. supervised pretraining for same architecture and pretraining data (ImageNet-1 k). For conciseness, we denote ResNet50s with RN, ViT-Bs with VT, MoCo v3 with MC, Barlow Twins with BT, and MAE with MA. Additionally, we refer to anatomical landmark recognition as anat, pathological finding characterisation as path, polyp detection as det, polyp segmentation with Kvasir-SEG as segk, polyp segmentation with CVC-ClinicDB as segc, and monocular depth estimation in colonoscopy as dep.
图 10. 自监督预训练与监督预训练在相同架构和预训练数据（ImageNet-1 k）上的改进。为简洁起见，我们将 ResNet50s 表示为 RN，将 ViT-Bs 表示为 VT，将 MoCo v3 表示为 MC，将 Barlow Twins 表示为 BT，将 MAE 表示为 MA。此外，我们将解剖标志识别称为 anat，将病理发现特征化称为 path，将息肉检测称为 det，将使用 Kvasir-SEG 进行的息肉分割称为 segk，将使用 CVC-ClinicDB 进行的息肉分割称为 segc，将结肠镜检查中的单目深度估计称为 dep。

FIGURE 11. Improvement of pretraining with Hyperkvasir-unlabelled vs. pretraining with ImageNet-1k for same architecture and self-supervised pretraining algorithm. For conciseness, we denote ResNet50s with RN, ViT-Bs with VT, MoCo v3 with MC, Barlow Twins with BT, and MAE with MA. Additionally, we refer to anatomical landmark recognition as anat, pathological finding characterisation as path, polyp detection as det, polyp segmentation with Kvasir-SEG as segk, polyp segmentation with CVC-ClinicDB as segc, and monocular depth estimation in colonoscopy as dep.
图 11. 使用 Hyperkvasir 未标记数据进行预训练与使用 ImageNet-1k 进行相同架构和自监督预训练算法的预训练改进。为简洁起见，我们将 ResNet50s 表示为 RN，将 ViT-Bs 表示为 VT，将 MoCo v3 表示为 MC，将 Barlow Twins 表示为 BT，将 MAE 表示为 MA。此外，我们将解剖标志识别称为 anat，将病理发现特征化称为 path，将息肉检测称为 det，将使用 Kvasir-SEG 进行的息肉分割称为 segk，将使用 CVC-ClinicDB 进行的息肉分割称为 segc，将结肠镜中的单目深度估计称为 dep。
supervised pretraining in our experiments, with only a single marginal exception to this observed across all selfsupervised pretraining algorithms, architectures, and tasks. We can therefore confidently conclude that self-supervised pretraining with ImageNet-1k generally provides better backbones than supervised pretraining with ImageNet-1k. Since supervised pretraining with ImageNet-1k is still the conventional pretraining pipeline for backbones used in solutions to vision tasks in GIE, including the state-of-theart, this is a crucial finding.
在我们的实验中，监督预训练只有一个边际例外，这一现象在所有自监督预训练算法、架构和任务中都得到了观察。因此，我们可以自信地得出结论，自监督预训练使用 ImageNet-1k 通常提供比使用 ImageNet-1k 的监督预训练更好的骨干网络。由于使用 ImageNet-1k 的监督预训练仍然是 GIE 中用于视觉任务解决方案的骨干网络的常规预训练流程，包括最先进的技术，这一发现至关重要。

We also demonstrate that self-supervised pretraining with ImageNet-1k is generally more suitable than self-supervised pretraining with Hyperkvasir-unlabelled in the considered downstream tasks, with the notable exception of monocular depth estimation in colonoscopy. To assess this, we use the same measures of error used in the previous analysis and evaluate the relative improvement from pretraining with
我们还证明，在考虑的下游任务中，自监督预训练使用 ImageNet-1k 通常比使用 Hyperkvasir-unlabelled 更合适，唯一的例外是结肠镜检查中的单眼深度估计。为了评估这一点，我们使用与之前分析中相同的误差度量，并评估预训练带来的相对改进。

Hyperkvasir-unlabelled

v s

. ImageNet-1k using:
Hyperkvasir-unlabelled

v s

. ImageNet-1k 使用：

{\%Improvement}_{I N \to H K} = 100 \frac{δ_{I N} - δ_{H K}}{δ_{I N}}

where

δ_{H K}

is the error for a model with a backbone pretrained with Hyperkvasir-unlabelled and

δ_{I N}

is the error for an equivalent model (same architecture, pretraining algorithm, and task) with a backbone pretrained with ImageNet-1k.
其中

δ_{H K}

是使用 Hyperkvasir-unlabelled 预训练的主干模型的误差，而

δ_{I N}

是使用 ImageNet-1k 预训练的等效模型（相同架构、预训练算法和任务）的误差。

Note that this analysis omits any results for supervised pretraining or no pretraining. We visualise the results of this analysis in Fig. 11, where it can be seen that self-supervised pretraining with ImageNet-1k generally provides better performance than self-supervised pretraining with Hyperkvasir-unlabelled, with exceptions including the anatomical landmark recognition models with MAE pretrained backbones, as well as all monocular depth estimation models. While the result for the anatomical landmark
请注意，此分析省略了任何关于监督预训练或无预训练的结果。我们在图 11 中可视化了此分析的结果，可以看出，使用 ImageNet-1k 进行自监督预训练通常比使用 Hyperkvasir-unlabelled 进行自监督预训练提供更好的性能，例外情况包括使用 MAE 预训练骨干网的解剖标志识别模型，以及所有单目深度估计模型。虽然解剖标志的结果

FIGURE 12. Improvement of ViT-B over ResNet50 for same pretraining pipeline (data and algorithm). For conciseness, we denote Hyperkvasir-unlabelled with HK, ImageNet-1k with IN, MoCo v3 with MC, supervised pretraining with SL, and no pretraining with NA-NA. Additionally, we refer to anatomical landmark recognition as anat, pathological finding characterisation as path, polyp detection as det, polyp segmentation with Kvasir-SEG as segk, polyp segmentation with CVC-ClinicDB as segc, and monocular depth estimation in colonoscopy as dep.
图 12. ViT-B 在相同预训练管道（数据和算法）下相对于 ResNet50 的改进。为简洁起见，我们将 Hyperkvasir-unlabelled 表示为 HK，将 ImageNet-1k 表示为 IN，将 MoCo v3 表示为 MC，将监督预训练表示为 SL，将无预训练表示为 NA-NA。此外，我们将解剖标志识别称为 anat，将病理发现特征化称为 path，将息肉检测称为 det，将使用 Kvasir-SEG 的息肉分割称为 segk，将使用 CVC-ClinicDB 的息肉分割称为 segc，将结肠镜下的单目深度估计称为 dep。

FIGURE 13. Distribution of Dice score (higher is better) across the test set for each Kvasir-SEG polyp segmentation model, visualised as box and violin plots. For conciseness, we denote ResNet50s with RN, ViT-Bs with VI, Hyperkvasir-unlabelled with HK, ImageNet-1k with IN, MoCo v3 with MC, Barlow Twins with BT, MAE with MA, supervised pretraining with SL, and no pretraining with NA-NA. For clarity, the violin plots for ResNet50 models are coloured red and the violin plots for ViT-B models are coloured blue.
图 13. Kvasir-SEG 息肉分割模型在测试集上的 Dice 分数分布（分数越高越好），以箱形图和小提琴图的形式可视化。为简洁起见，我们用 RN 表示 ResNet50，用 VI 表示 ViT-B，用 HK 表示 Hyperkvasir-unlabelled，用 IN 表示 ImageNet-1k，用 MC 表示 MoCo v3，用 BT 表示 Barlow Twins，用 MA 表示 MA，用 SL 表示有监督预训练，用 NA-NA 表示无预训练。为清晰起见，ResNet50 模型的小提琴图为红色，ViT-B 模型的小提琴图为蓝色。
recognition models with MAE pretrained backbones shows only a marginal improvement for pretraining with Hyperkvasir-unlabelled vs. ImageNet-1k, the results for the depth estimation models are more significant. This implies that the similarity of the pretraining data to the data used in the depth estimation experiments is much more critical than the amount of pretraining data, in comparison to other tasks. While this finding is significant for the development of solutions to vision tasks in GIE, it may have broader implications and further work may find this to be true for monocular depth estimation in general.
使用 MAE 预训练骨干的识别模型在 Hyperkvasir 无标签与 ImageNet-1k 的预训练中仅显示出边际改善，而深度估计模型的结果则更为显著。这意味着，与其他任务相比，预训练数据与深度估计实验中使用的数据的相似性比预训练数据的数量更为关键。虽然这一发现对 GIE 视觉任务解决方案的发展具有重要意义，但它可能具有更广泛的影响，进一步的研究可能会发现这一点对于单目深度估计也是成立的。

Finally, we demonstrate that models with a ViT-B backbone are generally better than models with a ResNet50 backbone in polyp segmentation and monocular depth estimation in colonoscopy, generally worse in polyp detection, and generally similar in image classification. To assess this,
最后，我们证明了具有 ViT-B 骨干网的模型在结肠镜检查中的息肉分割和单目深度估计方面通常优于具有 ResNet50 骨干网的模型，在息肉检测方面通常较差，在图像分类方面通常相似。为了评估这一点，
we use the same measures of error used in the previous analyses and evaluate the relative improvement from using a ViT-B vs. a ResNet50 using:
我们使用与之前分析中相同的误差度量，并评估使用 ViT-B 与 ResNet50 的相对改进：

{\%Improvement}_{R N \to V T} = 100 \frac{δ_{R N} - δ_{V T}}{δ_{R N}}

where

δ_{V T}

is the error for a model with a ViT-B backbone and

δ_{R N}

is the error for an equivalent model (same pretraining pipeline and task) with a ResNet50 backbone.
其中

δ_{V T}

是具有 ViT-B 主干的模型的误差，而

δ_{R N}

是具有 ResNet50 主干的等效模型（相同的预训练管道和任务）的误差。

Note that this analysis omits any results for pretraining with Barlow Twins or MAE. We visualise the results of this analysis in Fig. 12, where it can be seen that the ResNet50 and ViT-B models perform similarly in anatomical landmark recognition and pathological finding characterisation, that the ResNet50 models perform better than the ViT-B models perform in polyp detection, and that the ViT-B models
请注意，此分析省略了使用 Barlow Twins 或 MAE 进行预训练的任何结果。我们在图 12 中可视化了此分析的结果，可以看出 ResNet50 和 ViT-B 模型在解剖标志识别和病理发现特征化方面表现相似，ResNet50 模型在息肉检测方面的表现优于 ViT-B 模型。

FIGURE 14. Distribution of Dice score (higher is better) across the test set for each CVC-ClinicDB polyp segmentation model, visualised as box and violin plots. For conciseness, we denote ResNet50s with RN, ViT-Bs with VT, Hyperkvasir-unlabelled with HK, ImageNet-1k with IN, MoCo v3 with MC, Barlow Twins with BT, MAE with MA, supervised pretraining with SL, and no pretraining with NA-NA. For clarity, the violin plots for ResNet50 models are coloured red and the violin plots for ViT-B models are coloured blue.
图 14. 各 CVC-ClinicDB 息肉分割模型在测试集上的 Dice 得分分布（得分越高越好），以箱形图和小提琴图的形式可视化。为简洁起见，我们将 ResNet50s 表示为 RN，将 ViT-Bs 表示为 VT，将 Hyperkvasir-unlabelled 表示为 HK，将 ImageNet-1k 表示为 IN，将 MoCo v3 表示为 MC，将 Barlow Twins 表示为 BT，将 MAE 表示为 MA，将监督预训练表示为 SL，将无预训练表示为 NA-NA。为清晰起见，ResNet50 模型的小提琴图为红色，ViT-B 模型的小提琴图为蓝色。

FIGURE 15. Distribution of RMSE (lower is better) across the test set for each C3VD monocular depth estimation model, visualised as box and violin plots. For conciseness, we denote ResNet50s with RN, ViT-Bs with VI, Hyperkvasir-unlabelled with HK, ImageNet-1k with IN, MoCo v3 with MC, Barlow Twins with BT, MAE with MA, supervised pretraining with SL, and no pretraining with NA-NA. For clarity, the violin plots for ResNet50 models are coloured red and the violin plots for ViT-B models are coloured blue.
图 15. 各 C3VD 单目深度估计模型在测试集上的 RMSE 分布（越低越好），以箱形图和小提琴图的形式可视化。为简洁起见，我们用 RN 表示 ResNet50，用 VI 表示 ViT-B，用 HK 表示 Hyperkvasir-未标记，用 IN 表示 ImageNet-1k，用 MC 表示 MoCo v3，用 BT 表示 Barlow Twins，用 MA 表示 MAE，用 SL 表示有监督预训练，用 NA-NA 表示无预训练。为清晰起见，ResNet50 模型的小提琴图为红色，ViT-B 模型的小提琴图为蓝色。
generally perform better in the dense prediction tasks of polyp segmentation and monocular depth estimation colonoscopy. We further demonstrate the advantage of the ViT-B models over the ResNet50 models in dense prediction by visualising the distribution of performance across the Kvasir-SEG, CVCClinicDB, and C3VD test sets in Fig. 13, Fig. 14, and Fig. 15, respectively. Such visualisations are only suitable for these experiments since the metrics measure the performance on each instance prior to averaging, which is not the case for our image classification or object detection experiments. While we observe that ResNet50 models are typically better on polyp detection, we note that the polyp detection model with an MAE pretrained backbone with ImageNet-1k performs better than all but two models with ResNet50 backbones with respect to AP , and performs best with respect to

{AP}_{50}

, further
通常在息肉分割和单目深度估计的密集预测任务中表现更好。我们进一步通过可视化在 Kvasir-SEG、CVCClinicDB 和 C3VD 测试集上的性能分布，展示了 ViT-B 模型相对于 ResNet50 模型在密集预测中的优势，如图 13、图 14 和图 15 所示。这种可视化仅适用于这些实验，因为这些指标在平均之前测量每个实例的性能，而我们的图像分类或目标检测实验并非如此。虽然我们观察到 ResNet50 模型在息肉检测上通常表现更好，但我们注意到，具有 ImageNet-1k 预训练主干的 MAE 息肉检测模型在 AP 方面的表现优于所有但两个 ResNet50 主干的模型，并且在

{AP}_{50}

方面表现最佳，进一步
emphasising the particular robustness of this pretraining pipeline. There is still much to understand about the relative strengths and weaknesses of these architectures, particularly in the context of domains where the availability of data is much lower than that of everyday images, such as GIE. However, these results provide useful insights into which architecture may be better suited to each considered task.
强调了这个预训练管道的特别稳健性。关于这些架构的相对优缺点仍有很多需要理解的地方，特别是在数据可用性远低于日常图像的领域，如 GIE。然而，这些结果为每个考虑的任务提供了有用的见解，帮助判断哪种架构可能更适合。

One final note we make is that, as expected, pretraining with any of the considered pipelines consistently leads to better fine-tuned performance than training on the downstream task from random initialisation.
我们最后要提到的是，正如预期的那样，使用任何考虑的管道进行预训练始终会比从随机初始化开始在下游任务上训练获得更好的微调性能。

IX. CONCLUSION 九. 结论

In this work, we studied the pretraining of image encoders for use as backbones in solutions to vision tasks in GIE,
在这项工作中，我们研究了图像编码器的预训练，以作为解决 GIE 视觉任务的骨干
considering variation in encoder architecture, pretraining pipeline (data and algorithm), and downstream task. This was motivated by recent opportunities to improve on the convention of supervised pretraining backbones on image classification with ImageNet-1k, namely modern self-supervised pretraining algorithms and Hyperkvasirunlabelled - a relatively large dataset of unlabelled GIE images. We primarily identified the best pretraining pipeline and architecture, out of those considered, for each considered task by adapting the encoders to the tasks with state-of-theart decoders, fine-tuning the resulting models on datasets that include suitable annotations for the tasks, and evaluating the performance on test sets with well-established metrics. Overall, we found that a ViT-B backbone pretrained using the MAE algorithm and ImageNet-1k was most robust. Additionally, our findings suggest three general principles regarding the pretraining of encoders for use as backbones in solutions to vision tasks in GIE, which we revealed through an analysis of the downstream performance. These include:
考虑到编码器架构、预训练流程（数据和算法）以及下游任务的变化。这是受到最近机会的激励，以改善在图像分类中使用 ImageNet-1k 的监督预训练骨干网的常规，即现代自监督预训练算法和 Hyperkvasirunlabelled——一个相对较大的未标记 GIE 图像数据集。我们主要通过将编码器适应于任务，使用最先进的解码器，微调在包含适当注释的数据集上的结果模型，并使用公认的指标评估测试集的性能，从考虑的选项中识别出每个任务的最佳预训练流程和架构。总体而言，我们发现使用 MAE 算法和 ImageNet-1k 预训练的 ViT-B 骨干网是最稳健的。此外，我们的研究结果表明了关于编码器预训练作为 GIE 视觉任务解决方案骨干网使用的三个一般原则，这些原则是通过对下游性能的分析揭示的。这些包括：

Self-supervised pretraining generally produces more suitable backbones than supervised pretraining. This result is significant as it is still the convention to use backbones that have been pretrained on ImageNet-1k in a supervised manner - this implies that the current state-of-the-art could be improved upon through selfsupervised pretraining. Additionally, this result contrasts with the results observed for tasks involving everyday images, where supervised pretraining typically leads to better performance.
自监督预训练通常产生比监督预训练更合适的骨干网络。这个结果是显著的，因为目前仍然习惯于使用在 ImageNet-1k 上以监督方式预训练的骨干网络——这意味着当前的最先进技术可以通过自监督预训练得到改进。此外，这一结果与涉及日常图像的任务观察到的结果形成对比，在这些任务中，监督预训练通常会导致更好的性能。
Self-supervised pretraining with ImageNet-1k generally produces more suitable backbones than selfsupervised pretraining with Hyperkvasir-unlabelled, with the notable exception of monocular depth estimation in colonoscopy where the similarity of the pretraining data to the downstream data appears to be more critical than the amount of pretraining data. While this is a useful insight for the development of monocular depth estimation models for GIE, this finding may also be true for monocular depth estimation solutions in other domains.
自监督预训练使用 ImageNet-1k 通常比使用 Hyperkvasir-unlabelled 产生更合适的骨干网络，唯一的例外是结肠镜检查中的单目深度估计，在这种情况下，预训练数据与下游数据的相似性似乎比预训练数据的数量更为关键。虽然这是对 GIE 的单目深度估计模型开发的有用见解，但这一发现也可能适用于其他领域的单目深度估计解决方案。
That ResNet50 backbones are generally better for polyp detection, whereas ViT-B backbones are generally better for polyp segmentation and monocular depth estimation in colonoscopy, and both architectures perform similarly in anatomical landmark recognition and pathological finding characterisation.
ResNet50 主干通常更适合息肉检测，而 ViT-B 主干通常更适合息肉分割和结肠镜检查中的单目深度估计，两种架构在解剖标志识别和病理发现特征化方面表现相似。

We hope that this paper encourages further work on the topic of pretraining image encoders for use as backbones in solutions to vision tasks in GIE. Firstly, the scope of this work could be extended to more tasks and datasets, as well as decoder architectures and fine-tuning procedures. For example, we considered the Faster R-CNN object detection pipeline, which is a 2-stage detector, and it is worth investigating whether our findings are also true for 1-stage detectors. Additionally, we considered supervised fine-tuning
我们希望这篇论文能鼓励在 GIE 视觉任务解决方案中使用预训练图像编码器作为骨干的主题进一步研究。首先，这项工作的范围可以扩展到更多的任务和数据集，以及解码器架构和微调程序。例如，我们考虑了 Faster R-CNN 目标检测管道，它是一个两阶段检测器，值得调查我们的发现是否也适用于一阶段检测器。此外，我们考虑了监督微调。
for monocular depth estimation in colonoscopy, while selfsupervised fine-tuning for monocular depth estimation is also a promising research avenue and may benefit from an investigation into pretraining. Also, the impact of existing pretraining pipelines on the hybrid architectures that combine both convolutional and transformer components and that have found success in polyp segmentation can be investigated. We believe that such research should lay the groundwork for the development of backbones that are better suited to tasks in GIE, which should allow for significant advancement in the state-of-the-art. Beyond extending the scope of this study and the further investigation of existing pretraining algorithms, we suggest that future work also studies the development of pretraining algorithms specifically for this domain, as well as for other encoder architectures.
在结肠镜检查中进行单目深度估计，而自监督微调单目深度估计也是一个有前景的研究方向，并可能受益于对预训练的研究。此外，可以研究现有预训练管道对结合卷积和变换器组件的混合架构的影响，这些架构在息肉分割中取得了成功。我们相信，这样的研究应该为开发更适合 GIE 任务的骨干网络奠定基础，这将允许在最先进技术上实现显著进展。除了扩展本研究的范围和进一步研究现有的预训练算法外，我们建议未来的工作还应研究专门针对该领域的预训练算法的开发，以及其他编码器架构的预训练算法。

ACKNOWLEDGMENT 确认

Data Access Statement: this publication is supported by multiple datasets which are openly available as cited in the 'References’section of this paper.
数据访问声明：本出版物得到了多个数据集的支持，这些数据集在本文的“参考文献”部分中公开可用。

REFERENCES 参考文献

[1] F. Renna, M. Martins, A. Neto, A. Cunha, D. Libónio, M. Dinis-Ribeiro, and M. Coimbra, “Artificial intelligence for upper gastrointestinal endoscopy: A roadmap from technology development to clinical practice,” Diagnostics, vol. 12, no. 5, p. 1278, May 2022.
[1] F. Renna, M. Martins, A. Neto, A. Cunha, D. Libónio, M. Dinis-Ribeiro, 和 M. Coimbra, “用于上消化道内窥镜检查的人工智能：从技术开发到临床实践的路线图,” Diagnostics, vol. 12, no. 5, p. 1278, 2022 年 5 月。
[2] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2009, pp. 248-255.
[2] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, 和 L. Fei-Fei, “ImageNet: 一个大规模分层图像数据库,” 在 IEEE 计算机视觉与模式识别会议论文集中, 2009 年 6 月, 第 248-255 页.
[3] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” 2015, arXiv:1503.02531.
[3] G. Hinton, O. Vinyals, 和 J. Dean, “提炼神经网络中的知识,” 2015, arXiv:1503.02531.
[4] F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 1800-1807.
[4] F. Chollet, “Xception: 深度学习与深度可分离卷积,” 在 IEEE 计算机视觉与模式识别会议 (CVPR) 论文集中, 2017 年 7 月, 第 1800-1807 页。
[5] X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer, “Scaling vision transformers,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 1204-1213.
[6] S. Zhu, J. Gao, L. Liu, M. Yin, J. Lin, C. Xu, C. Xu, and J. Zhu, “Public imaging datasets of gastrointestinal endoscopy for artificial intelligence: A review,” J. Digit. Imag., vol. 36, no. 6, pp. 2578-2601, Dec. 2023.
[6] S. Zhu, J. Gao, L. Liu, M. Yin, J. Lin, C. Xu, C. Xu, 和 J. Zhu, “用于人工智能的胃肠内窥镜公共影像数据集：综述,” J. Digit. Imag., vol. 36, no. 6, pp. 2578-2601, 2023 年 12 月。
[7] C. Sun, A. Shrivastava, S. Singh, and A. Gupta, “Revisiting unreasonable effectiveness of data in deep learning era,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 843-852.
[7] C. Sun, A. Shrivastava, S. Singh, 和 A. Gupta, “重新审视深度学习时代数据的不合理有效性,” 在 IEEE 国际计算机视觉会议 (ICCV) 论文集中, 2017 年 10 月, 第 843-852 页。
[8] M. Huh, P. Agrawal, and A. A. Efros, “What makes ImageNet good for transfer learning?” 2016, arXiv:1608.08614.
[8] M. Huh, P. Agrawal, 和 A. A. Efros, “是什么让 ImageNet 适合迁移学习？” 2016, arXiv:1608.08614.
[9] T. Chen, S. Kornblith, M. Norouzi, and G. E. Hinton, “A simple framework for contrastive learning of visual representations,” in Proc. 37th Int. Conf. Mach. Learn., vol. 119, 2020, pp. 1597-1607.
[9] T. Chen, S. Kornblith, M. Norouzi, 和 G. E. Hinton, “一种简单的视觉表征对比学习框架,” 在第 37 届国际机器学习会议论文集中, 第 119 卷, 2020 年, 第 1597-1607 页。
[10] M. Caron, H. Touvron, I. Misra, H. Jegou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 9630-9640.
[10] M. Caron, H. Touvron, I. Misra, H. Jegou, J. Mairal, P. Bojanowski, 和 A. Joulin, “自监督视觉变换器中的新兴特性,” 在 IEEE/CVF 国际计算机视觉会议（ICCV）论文集中, 2021 年 10 月, 第 9630-9640 页.
[11] M. Oquab et al., “DINOv2: Learning robust visual features without supervision,” 2023, arXiv:2304.07193.
[12] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 15979-15988.
[12] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, 和 R. Girshick, “掩码自编码器是可扩展的视觉学习者,” 在 IEEE/CVF 计算机视觉与模式识别会议 (CVPR) 论文集中, 2022 年 6 月, 第 15979-15988 页.
[13] J. Zbontar, L. Jing, I. Misra, Y. LeCun, and S. Deny, “Barlow twins: Self-supervised learning via redundancy reduction,” in Proc. Int. Conf. Mach. Learn. (ICML), vol. 139, 2021, pp. 12310-12320.
[13] J. Zbontar, L. Jing, I. Misra, Y. LeCun, 和 S. Deny, “Barlow twins: Self-supervised learning via redundancy reduction,” 在国际机器学习会议（ICML）论文集中，卷 139，2021 年，页 12310-12320。
[14] X. Chen, S. Xie, and K. He, “An empirical study of training self-supervised vision transformers,” 2021, arXiv:2104.02057.
[14] X. Chen, S. Xie, and K. He, “自监督视觉变换器的实证研究,” 2021, arXiv:2104.02057.
[15] H. Bao, L. Dong, S. Piao, and F. Wei, “BEiT: BERT pre-training of image transformers,” 2021, arXiv:2106.08254.
[15] H. Bao, L. Dong, S. Piao, and F. Wei, “BEiT: BERT 预训练图像变换器,” 2021, arXiv:2106.08254.
[16] A. Bardes, J. Ponce, and Y. LeCun, “VICReg: Variance-invariancecovariance regularization for self-supervised learning,” 2021, arXiv:2105.04906.
[16] A. Bardes, J. Ponce, and Y. LeCun, “VICReg: 方差-不变性协方差正则化用于自监督学习,” 2021, arXiv:2105.04906.

The associate editor coordinating the review of this manuscript and approving it for publication was Binit Lukose $^{(D)}$ .
负责协调该手稿审查并批准其出版的副编辑是 Binit Lukose $^{(D)}$ 。
$^{1}$ Two different formulations of mF 1 can be found in the literature - (13) is the more robust [42] arithmetic mean of individual F1 scores.
文献中可以找到两种不同的 mF 1 公式 - (13)是个体 F1 分数的更稳健的[42]算术平均值。
$^{2}$ All considered pretraining was done with images resized to $224 \times 224$ , whereas object detection typically involves larger images, e.g. $1024 \times 1024$
$^{2}$ 所有考虑的预训练都是在将图像调整为 $224 \times 224$ 的情况下进行的，而目标检测通常涉及更大的图像，例如 $1024 \times 1024$
$^{3}$ Popular dense prediction architectures that adopt certain details of ResNets in their design and which may be suitable for depth estimation, such as ResUNet [56] or ResUNet++ [57], do not actually use a ResNet encoder.
$^{3}$ 采用 ResNets 设计某些细节的流行密集预测架构，可能适合深度估计，例如 ResUNet [56]或 ResUNet++ [57]，实际上并不使用 ResNet 编码器。

A Study on Self-Supervised Pretraining for Vision Problems in Gastrointestinal Endoscopy
关于自监督预训练在胃肠内窥镜视觉问题中的研究

Abstract 摘要

I. INTRODUCTION I. 引言

B. CONTRIBUTIONS B. 贡献

II. INVESTIGATED SELF-SUPERVISED PRETRAINING ALGORITHMS
II. 研究的自监督预训练算法

A. MoCo v3

B. BARLOW TWINS

C. MAE

III. BASELINES III. 基线

IV. IMAGE CLASSIFICATION
IV. 图像分类

A. DATA A. 数据

B. DECODERS B. 解码器

C. FINE-TUNING PROCEDURE
C. 微调程序

D. EVALUATION D. 评估

V. OBJECT DETECTION V. 目标检测

A. DATA A. 数据

B. DECODERS B. 解码器

C. FINE-TUNING PROCEDURE
C. 微调程序

D. EVALUATION D. 评估

VI. SEMANTIC SEGMENTATION
VI. 语义分割

A. DATA A. 数据

B. DECODERS B. 解码器

C. FINE-TUNING PROCEDURE
C. 微调程序

D. EVALUATION D. 评估

VII. MONOCULAR DEPTH ESTIMATION
VII. 单目深度估计

A. DATA A. 数据

B. DECODERS B. 解码器

C. FINE-TUNING PROCEDURE
C. 微调程序

D. EVALUATION D. 评估

VIII. ANALYSIS VIII. 分析

IX. CONCLUSION 九. 结论

ACKNOWLEDGMENT 确认

REFERENCES 参考文献

生词

已掌握

例句库

稍后读
(开发中)

单词学习
(开发中)

A Study on Self-Supervised Pretraining for Vision Problems in Gastrointestinal Endoscopy 关于自监督预训练在胃肠内窥镜视觉问题中的研究

Abstract 摘要

I. INTRODUCTION I. 引言

A. RELATED WORK A. 相关工作

B. CONTRIBUTIONS B. 贡献

II. INVESTIGATED SELF-SUPERVISED PRETRAINING ALGORITHMSII. 研究的自监督预训练算法

A. MoCo v3

B. BARLOW TWINS

C. MAE

III. BASELINES III. 基线

IV. IMAGE CLASSIFICATIONIV. 图像分类

A. DATA A. 数据

B. DECODERS B. 解码器

C. FINE-TUNING PROCEDUREC. 微调程序

D. EVALUATION D. 评估

V. OBJECT DETECTION V. 目标检测

A. DATA A. 数据

B. DECODERS B. 解码器

C. FINE-TUNING PROCEDUREC. 微调程序

D. EVALUATION D. 评估

VI. SEMANTIC SEGMENTATIONVI. 语义分割

A. DATA A. 数据

B. DECODERS B. 解码器

C. FINE-TUNING PROCEDUREC. 微调程序

D. EVALUATION D. 评估

VII. MONOCULAR DEPTH ESTIMATIONVII. 单目深度估计

A. DATA A. 数据

B. DECODERS B. 解码器

C. FINE-TUNING PROCEDUREC. 微调程序

D. EVALUATION D. 评估

VIII. ANALYSIS VIII. 分析

IX. CONCLUSION 九. 结论

ACKNOWLEDGMENT 确认

REFERENCES 参考文献

A Study on Self-Supervised Pretraining for Vision Problems in Gastrointestinal Endoscopy
关于自监督预训练在胃肠内窥镜视觉问题中的研究

II. INVESTIGATED SELF-SUPERVISED PRETRAINING ALGORITHMS
II. 研究的自监督预训练算法

IV. IMAGE CLASSIFICATION
IV. 图像分类

C. FINE-TUNING PROCEDURE
C. 微调程序

C. FINE-TUNING PROCEDURE
C. 微调程序

VI. SEMANTIC SEGMENTATION
VI. 语义分割

C. FINE-TUNING PROCEDURE
C. 微调程序

VII. MONOCULAR DEPTH ESTIMATION
VII. 单目深度估计

C. FINE-TUNING PROCEDURE
C. 微调程序