Test Time Adaptation for Blind Image Quality Assessment
测试时自适应用于无参考图像质量评估
Abstract 摘要
While the design of blind image quality assessment (IQA) algorithms has improved significantly, the distribution shift between the training and testing scenarios often leads to a poor performance of these methods at inference time. This motivates the study of test time adaptation (TTA) techniques to improve their performance at inference time. Existing auxiliary tasks and loss functions used for TTA may not be relevant for quality-aware adaptation of the pre-trained model. In this work, we introduce two novel quality-relevant auxiliary tasks at the batch and sample levels to enable TTA for blind IQA. In particular, we introduce a group contrastive loss at the batch level and a relative rank loss at the sample level to make the model quality aware and adapt to the target data. Our experiments reveal that even using a small batch of images from the test distribution helps achieve significant improvement in performance by updating the batch normalization statistics of the source model.
虽然无参考图像质量评估(IQA)算法的设计已经显著改进,但训练和测试场景之间的分布变化通常会导致这些方法在推理时性能较差。这促使研究测试时间适配(TTA)技术以提高其推理时的性能。现有的辅助任务和损失函数用于 TTA 可能与预训练模型的质量感知适配无关。在这项工作中,我们在批次和样本级别引入了两种新颖的与质量相关的辅助任务,以实现无参考 IQA 的 TTA。特别是,我们在批次级别引入了组对比损失,在样本级别引入了相对等级损失,使模型能够感知质量并适应目标数据。我们的实验表明,即使使用测试分布中的一小批图像,通过更新源模型的批量归一化统计数据,也能显著提高性能。
1 Introduction 1 引言
The problem of image quality assessment (IQA) is extremely important in diverse image capture, processing, and sharing applications. However, a reference image is often not available for quality assessment. No reference (NR) or blind IQA primarily deals with the question of predicting image quality without using a reference image. Such NR IQA algorithms are often designed using machine learning approaches. More recently, deep learning based approaches have been extremely successful in achieving impressive performance. However, IQA applications are quite diverse and deal with several different distortions and distributional shifts. IQA models often have poor generalization ability and find it difficult to perform well under such shifts.
图像质量评估(IQA)问题在不同的图像捕获、处理和共享应用中非常重要。然而,通常没有参考图像可用于质量评估。无参考(NR)或盲目图像质量评估主要处理的问题是无需使用参考图像就能预测图像质量。这类无参考图像质量评估算法通常采用机器学习方法进行设计。最近,基于深度学习的方法在实现卓越的性能方面取得了极大的成功。然而,图像质量评估应用非常多样化,涉及多个不同的失真和分布性转变。图像质量评估模型通常具有较差的泛化能力,难以在这些转变下表现良好。
Test time adaptation (TTA) has emerged as an important approach to address distributional shifts at test time [10]. It has been shown that by modifying a few global parameters of the model using a suitable loss that does not require the ground truth, one can significantly improve the performance of the model on the test data. Further, source-free adaptation, where the source data on which the original model was trained is not available while updating the model, is a realistic setting. While such approaches have been studied extensively in image classification literature [35, 37], there is hardly any literature on TTA for IQA.
测试时适应(TTA)已成为一种重要的方法,用于解决测试时的分布变化问题[10]。研究表明,通过使用不需要真实值的适当损失修改模型的几个全局参数,可以显著提高模型在测试数据上的性能。此外,源无关适应,即在更新模型时无法获得原始模型训练的源数据,是一种现实设定。尽管在图像分类文献中对此类方法进行了广泛研究[35, 37],但几乎没有关于图像质量评估(IQA)的 TTA 文献。
There are multiple challenges in designing TTA for IQA. Typical losses used for TTA, such as entropy minimization [37], are not applicable for IQA. For example, IQA is often studied in the regression context. This makes it difficult to extend models based on class confidences [28] or class prototypes [12] for classification to IQA. Also, the relevance of other self-supervised tasks such as rotation prediction [18], context prediction [5], colorization [19], noise prediction [1], feature clustering [3] for adapting IQA models is not clear. While contrastive learning has also been employed for TTA [24], such a framework is not explicitly based on contrasting image quality, and its relevance is also not clear.
在设计用于图像质量评估(IQA)的测试时间自适应(TTA)时面临多个挑战。用于 TTA 的典型损失,例如熵最小化[37],并不适用于 IQA。例如,IQA 通常是在回归环境中进行研究的。这使得基于类别置信度[28]或类别原型[12]的模型从分类扩展到 IQA 变得困难。此外,其他自监督任务如旋转预测[18]、上下文预测[5]、着色[19]、噪声预测[1]、特征聚类[3]用于调整 IQA 模型的相关性尚不明确。尽管对比学习也已应用于 TTA[24],这样的框架并不是明确基于对比图像质量,其相关性也不清楚。
Our main contribution is in the design of auxiliary tasks to enable TTA for IQA. We start with a source model trained on a large IQA dataset and fine-tune the model on individual batches of test samples. The first task we introduce for adaptation is based on contrasting groups of low and high quality images in a batch. Thus, we exploit the initial knowledge of the source model and try to adapt it by enforcing quality relationships among the batch of samples. Such a group contrastive (GC) learning approach fits naturally to our setting to account for any errors on individual samples that the source model may be prone to.
我们的主要贡献在于设计辅助任务以实现 IQ 的测试时间适应。我们从在大规模 IQ 数据集上训练的源模型开始,并对测试样本的单个批次进行微调。我们为适应性介绍的第一个任务是基于对比批次中低质量和高质量图像的组。因此,我们利用源模型的初始知识,并尝试通过强制批次样本之间的质量关系来进行适应。这种组对比(GC)学习方法自然适合我们的设置,以考虑源模型可能容易出现的单个样本错误。
In contrast to the GC learning that depends on the batch, our second auxiliary task is an image specific task based on distorted augmentations of different types. Here, our goal is to enable the model to rank the image quality of further distorted versions of each test sample. We explore the role of different distortion types to leverage the maximum benefit of this task. While GC learning is more effective when samples in a batch are diverse in quality, the rank order based learning is more effective when the quality of the images is not extremely poor. Thus, a combination of the tasks helps overcome the shortcoming of both tasks and leads to an overall superior performance.
与依赖于批次的 GC 学习相反,我们的第二个辅助任务是基于不同类型的失真增强的特定图像任务。这里,我们的目标是使模型能够对每个测试样本的进一步失真版本进行图像质量排序。我们探讨不同失真类型的作用,以最大程度地利用该任务的优势。虽然 GC 学习在批次中的样本质量多样时更有效,但基于排名顺序的学习在图像质量不极差时更有效。因此,任务的结合有助于克服两者的缺点,带来整体更优的表现。
We study the TTA problem under different settings of source and target datasets for multiple state of the art IQA models. Our results show significant improvements of the source model and the importance of TTA for IQA. We summarize our main contributions as follows:
我们研究了在不同的源数据集和目标数据集设置下,多个先进的 IQA 模型中的 TTA 问题。我们的结果显示了源模型的显著提升以及 TTA 对于 IQA 的重要性。我们总结了我们的主要贡献如下:
-
•
We propose source-free test time adaptation techniques in the context of blind image quality assessment to mitigate distribution shifts between train and test data.
我们提出了无源测试时适应技术,以盲图像质量评估为背景,以缓解训练数据和测试数据之间的分布差异。 -
•
We formulate two novel quality-aware self-supervised auxiliary tasks to adapt to the test data distribution. While group contrastive learning helps capture quality discriminative information among several images within a batch, rank ordering helps maintain the quality order between two different distorted versions of the same image.
我们设计了两个新的质量感知自监督辅助任务,以适应测试数据分布。群组对比学习有助于在批处理中的多张图像之间捕获质量区分信息,而排序则有助于保持同一图像的两个不同失真版本之间的质量顺序。 -
•
We show that our TTA method can significantly improve the performance of four different quality-aware source models, each on four different test IQA databases.
我们展示了我们的 TTA 方法可以显著提升四种不同质量感知源模型在四个不同测试 IQA 数据库上的性能。

图 1:测试时间适应的一般架构的框图。在测试时间训练中,通过优化等级和 GC 损失的组合来调整特征提取器的归一化层。在推理阶段,我们使用更新后的特征提取器和预训练的质量回归器来预测测试图像的质量分数。
2 Related Work 2 相关工作
2.1 Test Time Adaptation 2.1 测试时间适应
One of the first pieces of work on TTA [35] introduces a joint training framework using a loss for the main task and a self-supervised auxiliary task loss. The choice of a relevant auxiliary task is a challenging part of the design. Sun et al. [35] use rotation prediction as a pre-text task for image classification applications. Researchers also explore simpler tasks that are highly correlated with the main task, such as feature alignment with the batch and a simple contrastive loss framework [24] for adaptation.
However, these methods require the source data to train the source model all over again to enable TTA.
关于 TTA 的首批工作之一[35]介绍了一个联合训练框架,使用主任务的损失和自监督辅助任务损失。选择相关的辅助任务是设计中的一个挑战部分。Sun 等人[35]在图像分类应用中使用旋转预测作为前文本任务。研究人员还探索了与主任务高度相关的更简单任务,例如与批次的特征对齐和一个简单的对比损失框架[24]用于适应。然而,这些方法需要源数据重新训练源模型,才能实现 TTA。
TENT [37] studies source-free TTA, where entropy minimization-based adaptation even outperforms some methods that use the source data. Moreover, only the batch normalization parameters of the model are adapted in TENT.
There are several works on batch norm statistics adaptation [29, 32] to improve the robustness of the model at test time.
SHOT [20] presents a clustering-based pseudo-labeling
method to align features from the target domain to the source domain using an information
maximization loss.
While a plethora of methods exists as above, the auxiliary tasks in these methods are not relevant for IQA.
TENT [37] 研究了无源 TTA,其中基于熵最小化的适应性甚至优于一些使用源数据的方法。此外,在 TENT 中,只有模型的批归一化参数进行了适应。有几项关于批归一化统计量适应的工作 [29, 32] 以提高模型在测试时的鲁棒性。SHOT [20] 提出了一种基于聚类的伪标记方法,通过信息最大化损失来对齐目标域与源域的特征。虽然存在许多如上所述的方法,但这些方法中的辅助任务与 IQA 不相关。
2.2 No-Reference Image Quality Assessment
2.2 无参考图像质量评估
Most classical NR IQA algorithms are mainly based on natural scene statistics (NSS) [26, 42, 23]. In [27], the naturalness of the distorted image in the wavelet domain is modeled based on NSS. Saad et al. [31] also design the NSS
model in the discrete cosine transform (DCT) domain. CORNIA [39], and HOSA [38] are among the earliest codebook learning based methods to predict quality. With the emergence of deep learning and the availability of large subject-rated IQA databases, various general-purpose NR IQA methods have been designed based on convolutional neural network (CNN) architectures [40, 22, 2, 25]. Some of these methods require end-to-end training [2, 25] of deep neural networks (DNN), while others [43, 34, 9, 15] are based on updating pre-trained models with some modifications.
大多数经典的无参考图像质量评估(NR IQA)算法主要基于自然场景统计(NSS)[26, 42, 23]。在[27]中,基于 NSS 在小波域中对失真图像的自然性进行了建模。Saad 等人[31]也在离散余弦变换(DCT)域设计了 NSS 模型。CORNIA [39]和 HOSA [38]是基于词典学习的最早的方法之一,用于预测质量。随着深度学习的兴起以及大规模主观评价 IQA 数据库的出现,已经设计了多种基于卷积神经网络(CNN)架构的一般用途的 NR IQA 方法[40, 22, 2, 25]。其中一些方法需要对深度神经网络(DNN)进行端到端训练[2, 25],而其他方法[43, 34, 9, 15]则基于对预训练模型进行一些修改的更新。
More recently, Zhang et al. [43] propose a method to train bilinear DNNs to simultaneously model both authentic and synthetic distortions. Su et al. [34] design a self-adaptive hyper network to provide weights for the quality prediction module parameters and handle various types of distortions and content in the images. Several methods [9, 15] explore transformer-based architectures along with a CNN to capture dependencies between local and global features. MetaIQA [44] proposes meta-learning on synthetic distortions by using a shared quality prior knowledge model to adapt to any kind of distortion.
最近,Zhang 等人[43]提出了一种训练双线性 DNN 的方法,以同时模拟真实和合成失真。Su 等人[34]设计了一种自适应超网络,为质量预测模块参数提供权重,并处理图像中的各种失真和内容。几种方法[9, 15]探讨了基于 Transformer 的架构与 CNN 结合,以捕捉局部和全局特征之间的依赖关系。MetaIQA[44]通过使用共享质量先验知识模型进行元学习,在合成失真方面提出了一种可以适应任何失真的方法。
All the existing methods assume that the train and test data come from the same distribution. If there is a distribution shift across different databases, we need adaptation of the pre-trained model to learn about the target distributional information.
所有现有的方法都默认训练数据和测试数据来自同一分布。如果在不同数据库之间存在分布偏差,我们需要对预训练模型进行适应,以了解目标分布的信息。
3 Methodology
We propose a novel self-supervised Test Time Adaptation technique for Image Quality Assessment (TTA-IQA) to adapt pre-trained quality models and mitigate distribution shifts between the source and target data. We consider source-free TTA, where we only have access to the pre-trained quality-aware models and no access to the source training data.
When a batch of test data arrives, we adapt the model using the batch without knowledge of corresponding ground truth .
我们提出了一种新颖的自监督测试时间适应技术用于图像质量评估(TTA-IQA),旨在适应预训练的质量模型并缓解源数据与目标数据之间的分布变化。我们考虑无源测试时间适应,其中我们只能访问预训练的质量感知模型,而无法访问源训练数据。当一批测试数据 到达时,我们在不知道对应的真实数据 的情况下使用该批数据来调整模型。
3.1 Approach 3.1 方法
Let the model trained on the source data be where corresponds to the parameters of the network, and and correspond to the parameters of the feature extractor and regression layers. Thus, for an input image , we denote . Since our goal is to learn the distribution shift between train and test data, we only update the parameters of the feature extractor, , to align the features between the train and test distributions in a lower dimensional space.
让在源数据上训练的模型为 ,其中 对应网络的参数, 和 对应特征提取器和回归层的参数。因此,对于输入图像 ,我们表示为 。由于我们的目标是学习训练数据和测试数据之间的分布偏移,我们只更新特征提取器的参数 ,以在低维空间中对齐训练和测试分布之间的特征。
The key challenge in TTA is the choice of a self-supervised auxiliary task that is highly correlated with the main task of IQA. However, relying too much on the auxiliary task can affect the performance of the main task. To prevent loss of learnt information induced by the auxiliary task [4], we first project the features to a lower dimensional space using a non-linear projection head , parameterized by . The auxiliary task drives the adaptation of the feature extractor through the projection head. Thus our model now has three parts parametrized by to resemble a Y-shape as shown in Figure 1.
TTA 的关键挑战在于选择一个与 IQA 主要任务高度相关的自监督辅助任务。然而,过度依赖辅助任务可能会影响主要任务的表现。为了防止因辅助任务而导致的学习信息丢失[4],我们首先使用一个非线性投射头 来将特征投射到低维空间,该投射头由 参数化。辅助任务通过投射头驱动特征提取器的适应。因此,我们的模型现在有三部分通过 参数化,类似于如图 1 所示的 Y 形。
When a batch of test instances arrives, we extract features from the feature extractor, followed by the projection head, and update the set of parameters ( by optimizing a self-supervised objective function . Updating all the model parameters of the feature extractor can cause the model to diverge too much from training, and the performance can drop drastically. Inspired by prior work on test time entropy minimization [37] and improving robustness for test data [32], we only update the linear and lower-dimensional feature modulation parameters. In a neural network, normalization layers satisfy these properties. So we only adapt the batch normalization layers by updating the affine parameters to mitigate distributional shifts.
Thus we update these parameters by optimizing the auxiliary task loss given by
当一批测试实例 到达时,我们从特征提取器中提取特征,然后通过投影头,并通过优化一个自监督目标函数 来更新参数集 。更新特征提取器 的所有模型参数可能会导致模型与训练偏离过多,表现急剧下降。受到测试时间熵最小化[37]和提高测试数据鲁棒性[32]方面的早期工作的启发,我们仅更新线性和低维特征调制参数。在神经网络中,归一化层满足这些特性。因此我们仅通过更新仿射参数来调整批量归一化层,以减轻分布变化。因此,我们通过优化给出的辅助任务损失更新这些参数。
(1) |
After optimizing the above loss function,
we use the updated feature extractor parameters and pre-trained quality regressor parameters to predict the quality for a batch of test data.
Since the distribution across batches can vary significantly, we ignore the earlier updated model and start the TTA based on source model weights for new test instants. Thus the updated target model always only depends on the incoming test data.
在优化上述损失函数后,我们使用更新的特征提取器参数 和预训练的质量回归器参数 来预测一批测试数据的质量。由于各批次之间的分布可能有显著差异,我们忽略之前更新的模型,并基于源模型权重开始新的测试时刻的 TTA。因此,更新后的目标模型始终仅依赖于输入的测试数据。

图 2:用于测试时间自适应的组对比损失的概述框架。对于一批图像,根据源模型给出的伪标签形成两个不同的组。组对比损失试图最小化属于同一组的图像提取特征之间的距离,同时最大化来自不同组的特征之间的距离。
3.2 Self-Supervised Auxiliary Tasks
3.2 自监督辅助任务
Our goal is to carefully choose self-supervised auxiliary tasks that capture quality-aware distributional information to adapt the feature encoder. We formulate two novel and complementary self-supervised learning techniques which help to learn the distribution shift between train and test data. These self-supervised objective functions are - 1) Group Contrastive Loss and 2) Rank Loss. While the GC loss works well when there is a reasonable separation of quality among the samples in a batch, the rank loss works better even when the quality of the batch samples is similar. On the other hand, the rank loss is meaningful only when the quality of the input image is not extremely low, while the GC loss is independent of the quality of a given image. Thus, a combination of the two losses renders our TTA extremely effective across various scenarios.
我们的目标是精心选择自监督辅助任务,以捕捉质量感知的分布信息来适应特征编码器。我们提出了两种新颖且互补的自监督学习技术,帮助学习训练和测试数据之间的分布偏差。这些自监督目标函数是:1)组对比损失和 2)排序损失。虽然当批次样本中质量有合理分离时,组对比损失效果较好,但即使批次样本质量相似,排序损失也能表现得更好。另一方面,排序损失只有在输入图像质量不是极低时才有意义,而组对比损失则与给定图像的质量无关。因此,两种损失的结合使我们的 TTA 在各种情况下都极其有效。
3.2.1 Group Contrastive Loss
3.2.1 组对比损失
While contrastive learning has been used for TTA of deep image classification, its direct application does not appear relevant to the task of IQA. Thus, we introduce group contrastive (GC) learning as an auxiliary task for TTA of IQA models.
In particular, we make two groups of images from a single batch of images based on the pseudo-labels given by the pre-trained source model. We sort the images in ascending order as ,,…, based on the pseudo-labels, where , , corresponds to the lowest quality image in the batch. We then segregate images with high quality scores (say, top fraction of the data in a batch) and include them in a group of higher-quality images. Similarly, we separate out lower quality images (say, the lowest fraction of the images in a batch) together to form another group. Here we assume that and are integers for simplicity; else, they can be rounded off to the nearest integers.
虽然对比学习已用于深度图像分类的 TTA,但其直接应用似乎与 IQA 任务无关。因此,我们引入组对比(GC)学习作为 IQA 模型 TTA 的辅助任务。具体而言,我们根据预训练源模型给出的伪标签,从一个批次的 张图像中分成两组。我们根据伪标签按升序排列图像为 、 、…、 ,其中 、 对应于批次中质量最低的图像。然后,我们将具有高质量分数的图像(例如,批次中数据的前 部分)隔离出来并包括在高质量图像组中。同样,我们将较低质量的图像(例如,批次中图像的最低 部分)分离出来以形成另一个组。在这里,我们假设 和 是整数,为简单起见,否则它们可以四舍五入为最接近的整数。
The premise behind our loss for GC learning is that images from the same quality group should give similar feature representations in a lower dimensional space while features of images from different groups are separated out. Thus image pairs from the same group act as positive pairs, and image pairs from different groups act as negative pairs. By separating out these two groups, the model adapts itself by better separating the intermediate quality samples.
我们用于 GC 学习的损失前提是,来自同一质量组的图像在低维空间中应给出相似的特征表示,而不同组的图像特征则应被分离。因此,同一组的图像对作为正对,而不同组的图像对作为负对。通过分离这两个组,模型通过更好地分离中等质量样本来进行自我适应。
Let a positive pair and come from the same group, i.e., either both or . We use a modified NT-Xent contrastive loss [4] as our objective function. For a pair of images coming from the lower quality group where and , the GC loss is defined as
设一个正样本对 和 来自同一组,即要么都是 ,要么都是 。我们使用修改后的 NT-Xent 对比损失 [4] 作为目标函数。对于来自较低质量组的一对图像,即 和 ,GC 损失定义为
(2) |
where represents the feature at the output of the projection head for sample , refers to the cosine similarity between two features. and is the feature extracted from an image from the higher quality group with . Also, represents temperature scaling parameter. While we define the loss above when , we can define a similar loss when . For all pairs of images within the same group, we obtain the GC loss and add them together to obtain the final loss, as
其中 表示样本 在投影头输出处的特征, 指的是两个特征之间的余弦相似度, 是从较高质量组中提取的图像特征,配有 。此外, 代表温度缩放参数。虽然我们在 时定义了上述损失,我们也可以在 时定义类似的损失。对于同一组内所有图像对,我们获得 GC 损失并将它们相加以获得最终损失, 为
(3) |
A block diagram of the GC loss is shown in Figure 2.
GC 损失的框图如图 2 所示。

图 3:用于测试时间调整的排名损失框架。我们将测试图像进行两个不同程度的降级。然后我们通过共享特征提取器和投影头将这三张图像投射到特征空间中。计算从每个失真图像和原始测试图像提取的特征之间的距离,用于比较它们的排名顺序。
3.2.2 Rank Loss 3.2.2 排序损失
Given a sample of images from test data, the rank loss helps adapt the features by capturing quality-aware distributional information at the sample level. We introduce a ranking objective to learn the quality orders between two distorted versions of the test images that are quality discriminable.
We distort each test image , , in the minibatch of size with two different degrees of degradation of a given distortion type. Let and denote the higher and lower distorted image, respectively. The degradation types include synthetic distortions such as blur, compression, and noise. We choose degradation levels randomly from two sets of parameters sufficiently farther apart.
Ideally, the distance between the features extracted from the test image and the highly distorted image should be greater than that between the features extracted from the test image and the lower distorted image. Our rank loss tries to capture this order and adapt the model to the test data.
给定测试数据中的一个图像样本,排序损失通过在样本级别捕捉质量感知的分布信息来帮助调整特征。我们引入一个排序目标,学习在测试图像的两个失真版本之间能够区分质量的顺序。我们将小批量中每个测试图像 、 以两种不同程度的给定失真类型进行降解。设 和 分别表示更高失真和更低失真的图像。降解类型包括合成失真,如模糊、压缩和噪声。我们从相距足够远的两个参数集随机选择降解水平。理想情况下,从测试图像提取的特征与高度失真图像之间的距离应大于从测试图像提取的特征与低度失真图像之间的距离。我们的排序损失试图捕捉这一顺序并调整模型以适应测试数据。
We project the triplet of images in the feature space through the feature extractor and the projection head. We measure the distance between the distortion-augmented image and the original test data using the Euclidean distance in the projected feature space.
Mathematically, let be the feature extracted for test sample . Similarly we define and .
Let denote the distance between and . Similarly, denotes the distance between and . The target model is now fine-tuned to obtain the correct ranking between the distances by achieving shown in Figure 3.
我们通过特征提取器和投影头将图像三重对 投射到特征空间中。我们使用投射特征空间中的欧几里得距离来衡量失真增强图像与原始测试数据之间的距离。在数学上,令 为测试样本 提取的特征。同样,我们定义 和 。令 表示 和 之间的距离。同样, 表示 和 之间的距离。目标模型现在经过微调以实现 所示的正确排名距离。图 3 中展示了这一过程。
The probability of achieving this order is estimated by passing the difference in distances through a sigmoid function as
实现此顺序的概率通过将距离差传递给一个 S 型函数来估计,形式为
(4) |
We use binary cross-entropy loss between true label and the predicted probability to obtain the rank loss as
我们使用真实标签 和预测概率 之间的二元交叉熵损失来获得排名损失为
(5) |
For a batch of size , the overall rank loss is given as
对于大小为 的批次,总体排名损失表示为
(6) |
One of the challenges in the rank loss is selecting the distortion type for every image. The original test image may already be distorted, which poses difficulties in deciding the distortion type that can help TTA.
For example, if the test image is extremely blurred and we blur it again with two different levels, both images may be visually indistinguishable, thereby limiting the extent to which adaptation can help. We exploit the source model’s knowledge to overcome this limitation in choosing the distortion type.
We predict the quality scores of the pair for each distortion type using the source model. We hypothesize that the pair with the maximal difference in predicted quality is sufficiently different in visual quality. Thus choosing such a pair in Equation (5) can help adapt the model.
排名损失中的一个挑战是为每个图像选择失真类型。原始测试图像可能已经失真,这在决定哪种失真类型可以帮助 TTA 时构成困难。例如,如果测试图像非常模糊,并且我们再次用两个不同的级别进行模糊处理,可能两个图像在视觉上是无法区分的,从而限制了适应能帮助的程度。我们利用源模型的知识来克服选择失真类型的这一限制。我们使用源模型预测每种失真类型的配对 的质量得分。我们假设在预测质量上差异最大的配对在视觉质量上是充分不同的。因此,在方程(5)中选择这样的配对可以帮助模型调整。
Our overall self-supervised TTA objective function is a combination of both the rank and GC loss given as
我们的整体自监督 TTA 目标函数是等级损失和 GC 损失的组合,表示为
(7) |
where is a hyper-parameter used to combine the losses.
其中 是用于结合损失的超参数。
Backbone 基础网络 | Database 数据库 | KonIQ-10k | PIPAL | CID2013 | LIVE-IQA | ||||
---|---|---|---|---|---|---|---|---|---|
Method 方法 | SROCC | PLCC | SROCC | PLCC | SROCC | PLCC | SROCC | PLCC | |
TReS | Baseline 基线 | 0.6520 | 0.6955 | 0.3845 | 0.4078 | 0.5272 | 0.6463 | 0.5435 | 0.4450 |
Rotation 旋转 | 0.6506 | 0.6805 | 0.4061 | 0.4114 | 0.5706 | 0.6651 | 0.5866 | 0.5311 | |
TTA-IQA | 0.6578 | 0.7074 | 0.4278 | 0.4204 | 0.6032 | 0.6710 | 0.6722 | 0.5963 | |
MUSIQ | Baseline 基线 | 0.6304 | 0.6802 | 0.3190 | 0.3414 | 0.5173 | 0.6032 | 0.2596 | 0.3351 |
Rotation 旋转 | 0.6577 | 0.7154 | 0.3665 | 0.3693 | 0.5487 | 0.6164 | 0.3512 | 0.3976 | |
TTA-IQA | 0.6693 | 0.7230 | 0.3743 | 0.3731 | 0.5499 | 0.6220 | 0.3649 | 0.4031 | |
HyperIQA | Baseline 基准 | 0.5861 | 0.6313 | 0.3037 | 0.3304 | 0.4895 | 0.6123 | 0.5143 | 0.4377 |
Rotation 旋转 | 0.6033 | 0.6536 | 0.3268 | 0.3482 | 0.4902 | 0.6150 | 0.6268 | 0.5469 | |
TTA-IQA | 0.5960 | 0.6495 | 0.3653 | 0.3767 | 0.5039 | 0.5988 | 0.6218 | 0.5438 | |
MetaIQA | Baseline 基线 | 0.5162 | 0.4460 | 0.3287 | 0.2955 | 0.7213 | 0.6817 | 0.7323 | 0.6732 |
Rotation 旋转 | 0.5823 | 0.5311 | 0.3353 | 0.3042 | 0.7177 | 0.6745 | 0.7271 | 0.6868 | |
TTA-IQA | 0.5838 | 0.5428 | 0.4073 | 0.3510 | 0.7809 | 0.7399 | 0.7999 | 0.7726 |
表 1:比较 TTA-IQA 与流行的 NR IQA 方法以及一个流行的辅助任务——旋转预测在真实和合成失真数据集上的表现。加粗的条目表示每个单独的质量感知模型在相应数据集上的最佳性能。
4 Experiments
4.1 Quality Models, Datasets and Metrics
质量模型、数据集和指标
We evaluate TTA on four popular IQA databases using four different quality-aware models. In particular, we consider state-of-the-art deep IQA models such as TReS [9], MUSIQ [15] , HyperIQA [34], and MetaIQA [44]. These models contain a ResNet backbone with batch normalization layers, which we model as the feature extractor and only update its batch normalization parameters. We model the rest of the network as the quality regressor part, . TReS and MUSIQ use transformers as a part of their architecture, which we include as a part of the quality regressor in all our main experiments. We also explore the adaptation of transformers by including them as part of the feature extractor and updating their layer normalization statistics in the supplementary material.
我们在四个流行的 IQA 数据库上使用四种不同的质量感知模型评估 TTA。特别是,我们考虑了最先进的深度 IQA 模型,例如 TReS [9]、MUSIQ [15]、HyperIQA [34]和 MetaIQA [44]。这些模型包含一个带有批归一化层的 ResNet 骨干,我们将其建模为特征提取器 ,并且仅更新其批归一化参数。我们将网络的其余部分建模为质量回归器部分 。TReS 和 MUSIQ 在其架构中使用了 Transformer,我们在所有主要实验中将其作为质量回归器的一部分。我们还在补充材料中探索了通过将 Transformer 作为特征提取器的一部分并更新其层归一化统计来适应 Transformer。
We project the features extracted from the last layer of ResNet through a 256-dimensional fully connected (FC) layer corresponding to the self-supervised projection head . These lower dimensional features are used to adapt the model at test time.
Three IQA models, TReS, MUSIQ, and HyperIQA are trained on the LIVEFB database [41] containing 39,810 images. MetaIQA is trained on two synthetically distorted databases, TID2013 [30] and KADID-10k [21].
我们通过一个 256 维的全连接(FC)层将从 ResNet 最后一层提取的特征投影到对应的自监督投影头 。这些低维特征用于在测试时调整模型。三个 IQA 模型,TReS、MUSIQ 和 HyperIQA 在 LIVEFB 数据库[41]上训练,该数据库包含 39,810 张图像。MetaIQA 则在两个合成的失真数据库 TID2013[30]和 KADID-10k[21]上进行训练。
We choose challenging databases to evaluate the generalization capability of our TTA-IQA method. The test datasets are described as follows:
我们选择了具有挑战性的数据集来评估我们的 TTA-IQA 方法的泛化能力。测试数据集描述如下:
KonIQ-10k [11] is a popular in-the-wild authentically distorted database consisting of 10,073 quality-scored images.
KonIQ-10k [11] 是一个流行的真实环境中自然失真的数据库,包含 10,073 张质量评分的图像。
PIPAL [13] is a large-scale IQA database for evaluating perceptual image restoration. It contains 29k distorted images, including 19
different GAN-based algorithms.
PIPAL [13] 是一个评估感知图像修复的大规模 IQA 数据库。它包含 29,000 张失真图像,包括 19 种不同的基于 GAN 的算法。
CID2013 [36] consists of 480 images in six image sets captured by 79 imaging devices.
CID2013 [36] 包含六组图像,共 480 张图像,由 79 台成像设备拍摄。
LIVE-IQA [33] consists of 779 synthetically distorted images from 29 reference images.
LIVE-IQA [33] 包含来自 29 张参考图像的 779 张合成失真图像。
We evaluate our results using Spearman’s rank-order correlation coefficient (SROCC) and Pearson’s linear correlation coefficient (PLCC).
我们使用斯皮尔曼等级相关系数(SROCC)和皮尔逊线性相关系数(PLCC)来评估我们的结果。
4.2 Implementation Details
4.2 实施细节
We implement our setup in PyTorch and conduct all the experiments with an 16 GB NVIDIA RTX A4000 GPU. During TTA, we randomly select a patch of size from the input image and apply quality preserving augmentations such as horizontal flip and vertical flip before passing it through the network for TTA. We use the ADAM [16] optimizer with a learning rate of 0.001 and set the number of iteration as 3. All the experiments were run on five different seeds using a batch size of 8, and the final results were obtained by averaging.
我们在 PyTorch 中实现了我们的设置,并使用一台 16GB NVIDIA RTX A4000 GPU 进行所有实验。在 TTA 期间,我们从输入图像中随机选择一个大小为 的补丁,并在通过网络进行 TTA 之前应用质量保持增强,比如水平翻转和垂直翻转。我们使用学习率为 0.001 的 ADAM [16]优化器,并将迭代次数设定为 3。所有实验都在五个不同的种子上运行,使用的批量大小为 8,最终结果通过平均获得。
With regard to the self-supervised auxiliary tasks, we use a combination of rank loss and GC loss as our objective function using given in Equation (7). We choose to determine the groups. We calculate the GC loss using . For the rank loss, the test image is synthetically blurred using a Gaussian blur filter of size with two sets of standard deviations , for the Gaussian blur kernel. We keep for highly distorted images and for less blurred images. For compression distortions, we specify the quality factor in for lower compression and for higher compression rates. Similarly, we add zero mean Gaussian white noise to the test image with variance in for higher distortion and in for lower distortion.
关于自监督辅助任务,我们使用排名损失和 GC 损失的组合作为我们的目标函数,使用等式(7)中给出的 。我们选择 来确定分组。我们使用 计算 GC 损失。对于排名损失,测试图像通过大小为 的高斯模糊滤波器进行合成模糊,使用两组标准差 ,作为高斯模糊内核。我们保持 用于高度失真图像, 用于较少模糊的图像。对于压缩失真,我们在 中指定较低压缩的质量因数,以及在 中指定较高压缩率。同样地,我们对测试图像添加零均值的高斯白噪声,其中 表示较高失真, 表示较低失真。
Backbone 基础网络 | Database 数据库 | KonIQ-10k | PIPAL | CID2013 | LIVE-IQA | |||||
---|---|---|---|---|---|---|---|---|---|---|
Rank 排名 | GC | SRCC | PLCC | SRCC | PLCC | SRCC | PLCC | SRCC | PLCC | |
TReS | 0.6562 | 0.6989 | 0.4171 | 0.4204 | 0.6016 | 0.6736 | 0.6705 | 0.5908 | ||
0.6516 | 0.6946 | 0.4666 | 0.5183 | 0.5366 | 0.6493 | 0.7160 | 0.6290 | |||
0.6578 | 0.7074 | 0.4278 | 0.4204 | 0.6032 | 0.6710 | 0.6722 | 0.5963 | |||
MUSIQ | 0.6549 | 0.7149 | 0.3768 | 0.3718 | 0.5216 | 0.6034 | 0.3634 | 0.4024 | ||
0.6611 | 0.7176 | 0.3585 | 0.3642 | 0.5446 | 0.6159 | 0.3299 | 0.3914 | |||
0.6693 | 0.7230 | 0.3743 | 0.3731 | 0.5499 | 0.6220 | 0.3649 | 0.4031 | |||
HyperIQA | 0.5928 | 0.6455 | 0.3616 | 0.3732 | 0.5039 | 0.5991 | 0.5505 | 0.5050 | ||
0.6094 | 0.6567 | 0.3333 | 0.3552 | 0.5120 | 0.6255 | 0.6331 | 0.5592 | |||
0.5960 | 0.6495 | 0.3653 | 0.3767 | 0.5039 | 0.5988 | 0.6218 | 0.5438 | |||
MetaIQA | 0.5580 | 0.5118 | 0.3992 | 0.3437 | 0.7579 | 0.7282 | 0.7894 | 0.7534 | ||
0.5414 | 0.4636 | 0.3710 | 0.3287 | 0.7861 | 0.7067 | 0.7566 | 0.6927 | |||
0.5838 | 0.5428 | 0.4073 | 0.3510 | 0.7809 | 0.7399 | 0.7999 | 0.7726 |
表 2:使用流行的质量感知源模型对真实和合成失真数据集进行消融研究的结果。加粗条目表示在所有三个设置中性能最佳。
4.3 Performance Evaluation
4.3 性能评估
Table 1 shows the performance of our method on all four test datasets using all four quality-aware models. In addition to the source model, we also compare with rotation prediction [35] as the auxiliary task during TTA. A comparison with such a task helps understand the role of quality-aware losses for the TTA of IQA models.
表 1 显示了我们的方法在所有四个测试数据集上,使用所有四个质量感知模型的性能。除了源模型之外,我们还将旋转预测[35]作为辅助任务进行比较。与这样的任务进行比较有助于理解质量感知损失在 IQATTA 中的作用。
We observe that TTA-IQA using the combination of rank and GC loss outperforms the source models in all the cases.
Note that the PIPAL dataset has a huge distribution shift from the authentically distorted LIVEFB dataset on which three of the models were trained. While most source models perform very poorly on PIPAL, we achieve 10%-20% improvement over the source models.
On the KonIQ-10k dataset, TTA-IQA gives around 1%-10% improvement over the source model. As both KonIQ-10k and LIVEFB are authentically distorted datasets, the distribution shift is reasonably small, leading to smaller improvements using adaptation.
CID2013 is also an authentically distorted dataset and gives similar improvements of around 2%-13% over the source models.
Our experiments on the LIVE-IQA database provide a significantly greater improvement of 10%-33% owing to the shift from authentic to synthetic distortions for three models.
我们观察到,结合排名和 GC 损失的 TTA-IQA 在所有情况下都优于源模型。需要注意的是,PIPAL 数据集与三种模型训练的真实失真 LIVEFB 数据集之间存在巨大分布差异。尽管大多数源模型在 PIPAL 上表现很差,我们相比源模型取得了 10%-20%的提升。在 KonIQ-10k 数据集上,TTA-IQA 相较于源模型提高了约 1%-10%。由于 KonIQ-10k 和 LIVEFB 都是真实失真数据集,分布差异相对较小,因此通过适应性得到的提升也较小。CID2013 也是一个真实失真数据集,相比源模型获得了约 2%-13%的相似提升。我们在 LIVE-IQA 数据库上的实验显示,由于三个模型从真实失真转向合成失真,改进显著提高了 10%-33%。



图 4:来自 LIVE-IQA 数据集的相似质量图像质量预测散点图,展示了源模型和仅使用排名损失或 GC 损失的 TTA。我们在每种情况下用椭圆标记了两组具有相似人类意见评分(DMOS)的图像。
We also observe from Table 1 that our approach outperforms the rotation prediction task in most cases. Other TTA ideas, such as TENT [37] and training using masked autoencoders [7] are not applicable for the IQA task. In particular, the notion of entropy in regression tasks is not clear. On the other hand, masked reconstructions tend to lead to quality degradation. Thus, we do not compare with them.
我们还从表 1 中观察到,我们的方法在大多数情况下优于旋转预测任务。其他测试时间适应(TTA)方法,如 TENT [37]和使用掩码自动编码器训练[7]不适用于图像质量评估(IQA)任务。特别是在回归任务中熵的概念不明确。另一方面,掩码重建往往会导致质量下降。因此,我们没有与它们进行比较。
4.4 Ablation Study & Other Experiments
Need for Rank Loss and GC Loss for TTA.
We perform an ablation study with respect to the two losses in Table 2. We see from the results that the rank loss and the GC loss individually always improve on the source models. Further, the combination of the rank and GC loss provides an even better performance over the individual losses in most evaluation scenarios. Even in scenarios, where the combination loss achieves the second-best performance, its performance is very close to the best. For the rest of the experiments in this section, we present results on all four datasets using the TReS method alone.
需要排名损失和 GC 损失用于 TTA。我们进行了一项关于表 2 中两个损失的消融研究。我们从结果中看到,排名损失和 GC 损失在独立应用时总能改善源模型的表现。此外,排名损失和 GC 损失的结合在大多数评估场景中表现出更好的性能。即便在组合损失表现为第二好的场景中,其表现也接近最佳。在本节余下的实验中,我们仅使用 TReS 方法展示所有四个数据集的结果。
We now discuss scenarios where the GC loss is particularly more useful than the rank loss. If the test image is extremely distorted, none of the three kinds of distortion types can create much difference in the perceptual quality of the two degraded versions. To understand this further, we consider highly distorted images (corresponding to a differential mean opinion score (DMOS) greater than 70) of different distortion types in the LIVE-IQA dataset. We apply TTA only for these images using different losses. The SROCC performance for these images in Figure 5 reveal that the rank loss is not very effective. However, the GC loss works well in such scenarios.
我们现在讨论 GC 损失比排名损失更有用的情况。如果测试图像极度失真,任何三种失真类型都无法在两个降级版本的感知质量中产生太大差异。为了进一步理解这一点,我们考虑 LIVE-IQA 数据集中不同失真类型的高度失真图像(对应于差异平均意见得分 (DMOS) 大于 70)。我们仅对这些图像应用 TTA 并使用不同的损失。图 5 中这些图像的 SROCC 表现表明排名损失效果不佳。然而,GC 损失在这样的情况下表现良好。
Conversely, we also discuss scenarios where the rank loss is more useful than the GC loss. To illustrate this idea, we select multiple images having similar quality with DMOS in from the LIVE-IQA dataset. We adapt the source model for this set of images using both the GC and rank losses. From Figure 4, we observe that the rank loss leads to much better adaptation in terms of the resulting model correlating with the ground truth scores. Intuitively, when the input images have a similar quality, the pseudo-labels given by the source model may be noisy, leading to an inaccurate grouping of the images into the lower and higher quality groups. This can adversely affect the GC loss. So the rank loss is more effective than the GC loss for test batches with similar quality images.
相反,我们也讨论了在某些情况下,秩损失比 GC 损失更有用。为了说明这个观点,我们从 LIVE-IQA 数据集中选择了多个质量相似且 DMOS 为 的图像。我们使用 GC 损失和秩损失对这组图像进行源模型的适应。从图 4 中可以观察到,秩损失在模型与真实评分结果的相关性方面表现出更好的适应性。直观上,当输入图像具有相似质量时,由源模型给出的伪标签可能会很噪声,导致将图像不准确地分组到低质量和高质量组中。这可能会对 GC 损失产生不利影响。因此,对于具有类似质量图像的测试批次,秩损失比 GC 损失更为有效。

图 5:针对 LIVE-IQA 数据集中不同失真类型图像的 TTA-IQA,分别与各损失相关联
Effect of Selecting Multiple Distortion Types in the Rank Loss.
We experiment with the impact of different kinds of distortion types while incorporating the rank loss. In particular, we consider three scenarios: one where only one of the distortion types is used in the rank loss, another where all the three types are used to obtain three loss terms which are summed together to update the model, and a third, where only one of the distortion types is chosen based on the discussion in Section 3.2.2. We observe from Table 3 that choosing the distortion type with the maximal difference in quality gives the best performance for the rank loss. This experiment validates our hypothesis that only a rank loss that can discriminate between the degraded image pairs is useful for TTA.
选择多种失真类型对排序损失的影响。我们通过结合排序损失,实验了不同类型失真对其影响。特别是,我们考虑三个场景:一个场景是仅使用一种失真类型进行排序损失,另一个场景是使用三种失真类型来获得三个损失项并相加以更新模型,第三个场景则根据 3.2.2 节中的讨论选择一种失真类型。从表 3 可以观察到,选择质量差异最大的失真类型能为排序损失带来最佳性能。该实验验证了我们的假设,即只有能区分降质图像对的排序损失才对 TTA 有用。
Database 数据库 | Blur 模糊 | Comp 竞争 | Noise 噪声 | All 全部 | Best 最佳 |
---|---|---|---|---|---|
KonIQ-10k | 0.656 | 0.656 | 0.667 | 0.615 | 0.656 |
PIPAL | 0.384 | 0.415 | 0.389 | 0.415 | 0.417 |
CID2013 | 0.527 | 0.609 | 0.606 | 0.560 | 0.609 |
LIVE-IQA | 0.578 | 0.605 | 0.618 | 0.634 | 0.671 |
表 3:选择多种畸变类型对排名损失的影响
Selection of Group Size by Varying . We explore different values of for constructing the GC loss. For a batch size of 8, possible choices of the group size are corresponding to respectively. From Table 4, we observe that the performances are roughly similar for different values of with a slightly superior performance for . We note that as increases, the groups are larger and probably less discriminative in terms of quality, leading to a slightly poorer performance.
通过改变 选择组大小。我们探索不同的 值来构建 GC 损失。对于批量大小为 8 的情况,组大小的可能选择对应于分别为 的 。从表 4 可以看出,对于不同的 值,性能大致相似,但 略占优势。我们注意到随着 增加,组变得更大,在质量方面可能不那么具有区分性,导致略差的性能。
Database 数据库 | =0.25 | =0.375 | =0.5 |
---|---|---|---|
KonIQ-10k | 0.6516 | 0.6497 | 0.6467 |
PIPAL | 0.4666 | 0.4602 | 0.4616 |
CID2013 | 0.5366 | 0.5486 | 0.5536 |
LIVE-IQA | 0.7160 | 0.7171 | 0.6937 |
表 4:使用 进行 GC 损失时不同组大小的影响
Selection of Number of Groups for GC Loss. In our formulation, we define the GC loss only between two contrastive groups. In principle, we can extend our idea of GC loss to multiple groups. In particular, we can cluster the images in a batch into multiple groups and apply the GC loss between every pair of groups and sum the loss terms. Thus, images coming from the same group act as positive pairs, and images from two different groups are considered as negative pairs. From Table 5, we observe that the number of groups does not impact the performance much.
选择 GC 损失的组数。在我们的公式中,我们仅在两个对比组之间定义 GC 损失。原则上,我们可以将 GC 损失的概念扩展到多个组。具体来说,我们可以将一个批次中的图像聚类成多个组,并在每对组之间应用 GC 损失并求和损失项。因此,来自同一组的图像充当正对图像,而来自两个不同组的图像则视为负对图像。根据表 5,我们观察到组数 对性能影响不大。
Database 数据库 | |||
---|---|---|---|
KonIQ-10k | 0.6516 | 0.6524 | 0.6509 |
PIPAL | 0.4666 | 0.4668 | 0.4663 |
CID2013 | 0.5366 | 0.5327 | 0.5388 |
LIVE-IQA | 0.7160 | 0.7011 | 0.6883 |
表 5:GC 损失中变化的组数 的影响
Choice of Number of Iterations for Learning Auxiliary Task. We also examine the effect of the number of iterations of parameter updates during TTA. Figure 6 shows that the performance on CID2013 dataset is fairly robust until 4 iterations. Beyond that, the model overfits the auxiliary task and leads to poor performance at test time.
学习辅助任务的迭代次数选择。我们还检查了 TTA 期间参数更新迭代次数的影响。图 6 显示,CID2013 数据集上的性能在进行 4 次迭代之前都相当稳健。超过这个次数后,模型则会对辅助任务产生过拟合,从而导致测试时性能较差。

图 6:增加迭代次数的效果
5 Conclusion 5 结论
Test time adaptation has become very popular due to its simplicity and the lack of need for end-to-end training. Our work is perhaps one of the first attempts to design the method of TTA in the context of blind IQA. While most IQA methods focus on making the models robust enough to perform well for cross-database experiments, our TTA-IQA method can outperform existing state-of-the-art methods because of the adaptation at test time. We formulate novel self-supervised auxiliary tasks using the rank and group contrastive losses, which can learn quality-aware information from the test data. While primarily explored TTA for IQA, it would be interesting to understand the role of TTA for video quality assessment as well.
测试时适应由于其简单性和不需要端到端训练而变得非常流行。我们的工作可能是首次尝试在盲目图像质量评估(IQA)背景下设计测试时适应方法。虽然大多数 IQA 方法专注于使模型具有足够的鲁棒性以在跨数据库实验中表现良好,但我们的 TTA-IQA 方法由于测试时的适应性,可以超越现有的最新方法。我们用排名和组对比损失制定了新颖的自监督辅助任务,这可以从测试数据中学习质量感知信息。虽然主要探索了用于 IQA 的 TTA,了解 TTA 在视频质量评估中的作用也很有趣。
Acknowledgement: This work was supported in part by a grant from the Department of Science and Technology, Government of India, under grant CRG/2020/003516.
致谢:本研究部分得到了印度政府科学技术部的资助,资助号为 CRG/2020/003516。
Appendix 附录
Appendix A Experiments with Transformer Parameters Adaptation
附录 ATransformer 参数适应实验
We evaluate the performance of TTA-IQA by updating transformer parameters for better adaptation. For TReS [9] and MUSIQ [15], we incorporate the transformer as a part of feature extractor. Thus, only the last fully connected (FC) layer works as the quality regressor. Current literature [17] shows that layer normalization (LN) parameters of transformers are a good choice for test time adaptation. In the case of vision transformer, the CLS token is also used for adaptation. Table 6 shows the performance on all four datasets by optimizing various parameters using the combination of rank and group contrastive loss. We observe that adaptation of transformer parameters alone gives a performance equivalent to the adaptation of the batch normalization (BN) parameters of convolutional neural network (CNN) backbone.
Thus, it is possible to update models that only use a transformer and achieve significant gains using TTA.
我们通过更新变压器参数来评估 TTA-IQA 的性能,以实现更好的适应性。对于 TReS[9]和 MUSIQ[15],我们将变压器作为特征提取器的一部分。因此,只有最后一个全连接(FC)层作为质量回归器。目前的文献[17]表明,变压器的层归一化(LN)参数是测试时适应的良好选择。在视觉变压器的情况下,CLS 令牌也用于适应。表 6 展示了通过使用排行和组对比损失组合优化各种参数在所有四个数据集上的性能。我们观察到,仅变压器参数的适应性能可与卷积神经网络(CNN)主干的批归一化(BN)参数的适应性能相当。因此,可以更新仅使用变压器的模型,并通过 TTA 实现显著的收益。
Backbone 基础网络 | Database 数据库 | KONIQ | PIPAL | CID2013 | LIVE-IQA | ||||
---|---|---|---|---|---|---|---|---|---|
Method 方法 | SRCC | PLCC | SRCC | PLCC | SRCC | PLCC | SRCC | PLCC | |
TReS | Baseline 基线 | 0.6520 | 0.6955 | 0.3845 | 0.4078 | 0.5272 | 0.6463 | 0.5435 | 0.4450 |
BN Only 仅限 BN | 0.6731 | 0.7151 | 0.4392 | 0.2710 | 0.6173 | 0.6800 | 0.6707 | 0.6006 | |
LN Only 仅 LN | 0.6694 | 0.7176 | 0.4128 | 0.2690 | 0.6193 | 0.6850 | 0.5998 | 0.5458 | |
BN+LN | 0.6621 | 0.7059 | 0.4417 | 0.3484 | 0.6123 | 0.6758 | 0.6723 | 0.5948 | |
MUSIQ | Baseline 基线 | 0.6304 | 0.6802 | 0.3190 | 0.3414 | 0.5173 | 0.6032 | 0.2596 | 0.3351 |
BN Only 仅 BN | 0.6588 | 0.7174 | 0.3772 | 0.3744 | 0.5275 | 0.6126 | 0.3350 | 0.3954 | |
LN Only 仅限 LN | 0.6582 | 0.7155 | 0.3757 | 0.3762 | 0.5403 | 0.6109 | 0.3661 | 0.4044 | |
CLS Only 仅限 CLS | 0.6618 | 0.7207 | 0.3776 | 0.3739 | 0.5491 | 0.6212 | 0.3511 | 0.3993 | |
BN+LN | 0.6552 | 0.7145 | 0.3736 | 0.3732 | 0.5329 | 0.6095 | 0.3767 | 0.4028 | |
BN+LN+CLS | 0.6598 | 0.7172 | 0.3751 | 0.3764 | 0.5253 | 0.6048 | 0.3569 | 0.3999 |
表 6:使用流行的基于 Transformer 的无参考图像质量评估方法在真实和合成失真数据集上的 TTA-IQA 比较。
Train on 训练在 | LIVEFB | LIVE-IQA | ||||||
---|---|---|---|---|---|---|---|---|
Test on 测试时间 | PIPAL | KonIQ-10k | SPAQ | LIVEC | PIPAL | CID2013 | KonIQ-10k | LIVEC |
Baseline 基线 | 0.385 | 0.652 | 0.707 | 0.726 | 0.402 | 0.519 | 0.521 | 0.563 |
TTA-IQA | 0.428 | 0.658 | 0.755 | 0.728 | 0.449 | 0.523 | 0.522 | 0.565 |
表 7:使用 LIVE-IQA 数据库训练的 TReS 骨干网络对 TTA-IQA 进行 SRCC 性能评估
TRES MUSIQ HYPER-IQA META-IQA Baseline 基线 0.535 0.404 0.496 0.591 Rotation 旋转 0.529 0.425 0.479 0.540 TTA-IQA 0.586 0.450 0.493 0.608
表 8:DSLR 数据库上 TTA-IQA 的 SRCC 性能分析。
Appendix B Visualizing Images that Justify Need for Both Rank and GC Loss
附录 B 可视化图像证明需要排名损失和 GC 损失的必要性
In Section 4.4, we justify the need for both the rank and GC loss for effective TTA. Here we give a few visual examples of images corresponding to that analysis. In Figure 7, we observe that the images have very poor quality. Hence, distorting these images further creates distorted versions that have perceptually indistinguishable quality ratings. On the other hand, Figure 8 shows similar quality images. Here, as the images have almost similar visual quality, it is difficult to form two different quality groups based on pseudo-labels given by the source model.
在第 4.4 节中,我们论证了排名和 GC 损失对于有效 TTA 的必要性。这里我们给出几个对应该分析的图像视觉示例。在图 7 中,我们观察到这些图像的质量非常差。因此,进一步扭曲这些图像会创造出在感知上不可区分质量评分的扭曲版本。另一方面,图 8 显示了质量相似的图像。这里,由于图像几乎有相似的视觉质量,基于源模型给出的伪标签很难形成两个不同的质量组。








图 7:高度失真图像的示例,其中 GC 损失比排序损失更有效








图 8:在相似质量的图像示例中,排名损失比 GC 损失更有效
Appendix C Performance of TTA-IQA on Other Databases
附录 C TTA-IQA 在其他数据库上的表现
C.1 Performance evaluation with synthetic database as source database
C.1 使用合成数据库作为源数据库的性能评估
In the main paper, we reported performances where the source model is trained on camera captured LIVEFB [41] database and tested on various authentic and synthetic databases. In Table 7, we provide more such evaluations with respect to different intra and inter domain comparisons. In particular, we present results when TReS [9] is trained on LIVE FB and evaluated on more intra domain datasets such as SPAQ [6] and LIVEC [8]. We also present results when TReS is trained on a synthetic dataset such as LIVE-IQA [33] and tested on authentic as well as other datasets containing restored images. We observe that TTA-IQA gives a reasonable performance gain over the baseline even when there is domain shift between source (synthetic) data and target (authentic) data.
在正文中,我们报告了源模型在摄像机拍摄的 LIVEFB [41]数据库上训练并在各种真实和合成数据库上测试的性能。在表 7 中,我们提供了关于不同域内和跨域比较的更多评估。特别是,我们展示了 TReS [9]在 LIVE FB 上训练并在更多域内数据集(如 SPAQ [6]和 LIVEC [8])上评估时的结果。我们还展示了 TReS 在合成数据集(如 LIVE-IQA [33])上训练并在真实以及其他包含恢复图像的数据集上测试时的结果。我们观察到,即使源数据(合成)和目标数据(真实)之间存在域转移,TTA-IQA 仍然较基线有合理的性能提升。
C.2 Performance on Low-Light Restorted Database
C.2 在低光恢复数据库上的表现
To understand the impact of larger domain shifts, we also evaluate on a new database DSLR [14], where images captured in low light are restored via various image restoration algorithms. Since novel distortions are generated while restorting such low-light images, we evaluate the performance of TTA-IQA with source database as LIVEFB and target datasbase as DSLR database. We see that TTA-IQA helps improve the performance of most of the methods.
为了了解较大领域变化的影响,我们还在一个新的数据库 DSLR [14]上进行评估,其中低光环境下拍摄的图像通过各种图像恢复算法进行修复。由于在恢复这些低光图像时会产生新的失真,我们评估 TTA-IQA 在源数据库为 LIVEFB,目标数据库为 DSLR 数据库的表现。我们发现,TTA-IQA 有助于提高大多数方法的性能。
References
- [1] Piotr Bojanowski and Armand Joulin. Unsupervised learning by predicting noise. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 517–526. PMLR, 06–11 Aug 2017.
- [2] Sebastian Bosse, Dominique Maniry, Klaus-Robert Müller, Thomas Wiegand, and Wojciech Samek. Deep neural networks for no-reference and full-reference image quality assessment. IEEE Transactions on Image Processing, 27(1):206–219, 2018.
- [3] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss, editors, Computer Vision – ECCV 2018, pages 139–156, Cham, 2018. Springer International Publishing.
- [4] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org, 2020.
- [5] Carl Doersch, Abhinav Gupta, and Alexei A. Efros. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015.
- [6] Yuming Fang, Hanwei Zhu, Yan Zeng, Kede Ma, and Zhou Wang. Perceptual quality assessment of smartphone photography. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3674–3683, 2020.
- [7] Yossi Gandelsman, Yu Sun, Xinlei Chen, and Alexei A. Efros. Test-time training with masked autoencoders. arXiv preprint arXiv:2209.07522, 2022.
- [8] Deepti Ghadiyaram and Alan C. Bovik. Massive online crowdsourced study of subjective and objective picture quality. IEEE Transactions on Image Processing, 25(1):372–387, 2016.
- [9] S Alireza Golestaneh, Saba Dadsetan, and Kris M Kitani. No-reference image quality assessment via transformers, relative ranking, and self-consistency. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3209–3218, 2022.
- [10] Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. Proceedings of the International Conference on Learning Representations, 2019.
- [11] Vlad Hosu, Hanhe Lin, Tamas Sziranyi, and Dietmar Saupe. Koniq-10k: An ecologically valid database for deep learning of blind image quality assessment. Trans. Img. Proc., 29:4041–4056, jan 2020.
- [12] Yusuke Iwasawa and Yutaka Matsuo. Test-time classifier adjustment module for model-agnostic domain generalization. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 2427–2440. Curran Associates, Inc., 2021.
- [13] Gu Jinjin, Cai Haoming, Chen Haoyu, Ye Xiaoxing, Jimmy S. Ren, and Dong Chao. Pipal: A large-scale image quality assessment dataset for perceptual image restoration. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision – ECCV 2020, pages 633–651, Cham, 2020. Springer International Publishing.
- [14] Vignesh Kannan, Sameer Malik, Nithin C. Babu, and Rajiv Soundararajan. Quality assessment of low-light restored images: A subjective study and an unsupervised model. IEEE Access, 11:68216–68230, 2023.
- [15] Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 5148–5157, October 2021.
- [16] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- [17] Takeshi Kojima, Yutaka Matsuo, and Yusuke Iwasawa. Robustifying vision transformer without retraining from scratch by test-time class-conditional feature alignment. In Lud De Raedt, editor, Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pages 1009–1016. International Joint Conferences on Artificial Intelligence Organization, 7 2022. Main Track.
- [18] Nikos Komodakis and Spyros Gidaris. Unsupervised representation learning by predicting image rotations. In International Conference on Learning Representations (ICLR), Vancouver, Canada, Apr. 2018.
- [19] Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Colorization as a proxy task for visual understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
- [20] Jian Liang, Dapeng Hu, and Jiashi Feng. Do we really need to access the source data? Source hypothesis transfer for unsupervised domain adaptation. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 6028–6039. PMLR, 13–18 Jul 2020.
- [21] Hanhe Lin, Vlad Hosu, and Dietmar Saupe. Kadid-10k: A large-scale artificially distorted iqa database. In 2019 Eleventh International Conference on Quality of Multimedia Experience (QoMEX), pages 1–3, 2019.
- [22] Xialei Liu, Joost van de Weijer, and Andrew D. Bagdanov. Rankiqa: Learning from rankings for no-reference image quality assessment. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017.
- [23] Yutao Liu, Ke Gu, Yongbing Zhang, Xiu Li, Guangtao Zhai, Debin Zhao, and Wen Gao. Unsupervised blind image quality evaluation via statistical measurements of structure, naturalness, and perception. IEEE Transactions on Circuits and Systems for Video Technology, 30(4):929–943, 2020.
- [24] Yuejiang Liu, Parth Kothari, Bastien van Delft, Baptiste Bellot-Gurlet, Taylor Mordan, and Alexandre Alahi. Ttt++: When does self-supervised test-time training fail or thrive? In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 21808–21820. Curran Associates, Inc., 2021.
- [25] Kede Ma, Wentao Liu, Kai Zhang, Zhengfang Duanmu, Zhou Wang, and Wangmeng Zuo. End-to-end blind image quality assessment using deep neural networks. IEEE Transactions on Image Processing, 27(3):1202–1213, 2018.
- [26] Anish Mittal, Anush Krishna Moorthy, and Alan Conrad Bovik. No-reference image quality assessment in the spatial domain. IEEE Transactions on Image Processing, 21(12):4695–4708, 2012.
- [27] Anush Krishna Moorthy and Alan Conrad Bovik. Blind image quality assessment: From natural scene statistics to perceptual quality. IEEE Transactions on Image Processing, 20(12):3350–3364, 2011.
- [28] Chaithanya Kumar Mummadi, Robin Hutmacher, Kilian Rambach, Evgeny Levinkov, Thomas Brox, and Jan Hendrik Metzen. Test-time adaptation to distribution shift by confidence maximization and input transformation, 2021.
- [29] Zachary Nado, Shreyas Padhy, D Sculley, Alexander D’Amour, Balaji Lakshminarayanan, and Jasper Snoek. Evaluating prediction-time batch normalization for robustness under covariate shift. arXiv preprint arXiv:2006.10963, 2020.
- [30] Nikolay Ponomarenko, Lina Jin, Oleg Ieremeiev, Vladimir Lukin, Karen Egiazarian, Jaakko Astola, Benoit Vozel, Kacem Chehdi, Marco Carli, Federica Battisti, and C.-C. Jay Kuo. Image database tid2013: Peculiarities, results and perspectives. Signal Processing: Image Communication, 30:57–77, 2015.
- [31] Michele A. Saad, Alan C. Bovik, and Christophe Charrier. Blind image quality assessment: A natural scene statistics approach in the dct domain. IEEE Transactions on Image Processing, 21(8):3339–3352, 2012.
- [32] Steffen Schneider, Evgenia Rusak, Luisa Eck, Oliver Bringmann, Wieland Brendel, and Matthias Bethge. Improving robustness against common corruptions by covariate shift adaptation. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 11539–11551. Curran Associates, Inc., 2020.
- [33] H.R. Sheikh, M.F. Sabir, and A.C. Bovik. A statistical evaluation of recent full reference image quality assessment algorithms. IEEE Transactions on Image Processing, 15(11):3440–3451, 2006.
- [34] Shaolin Su, Qingsen Yan, Yu Zhu, Cheng Zhang, Xin Ge, Jinqiu Sun, and Yanning Zhang. Blindly assess image quality in the wild guided by a self-adaptive hyper network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
- [35] Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 9229–9248. PMLR, 13–18 Jul 2020.
- [36] Toni Virtanen, Mikko Nuutinen, Mikko Vaahteranoksa, Pirkko Oittinen, and Jukka Häkkinen. Cid2013: A database for evaluating no-reference image quality assessment algorithms. IEEE Transactions on Image Processing, 24(1):390–402, 2015.
- [37] Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. In International Conference on Learning Representations, 2021.
- [38] Jingtao Xu, Peng Ye, Qiaohong Li, Haiqing Du, Yong Liu, and David Doermann. Blind image quality assessment based on high order statistics aggregation. IEEE Transactions on Image Processing, 25(9):4444–4457, 2016.
- [39] Peng Ye, Jayant Kumar, Le Kang, and David Doermann. Unsupervised feature learning framework for no-reference image quality assessment. In 2012 IEEE conference on computer vision and pattern recognition, pages 1098–1105. IEEE, 2012.
- [40] Peng Ye, Jayant Kumar, Le Kang, and David Doermann. Unsupervised feature learning framework for no-reference image quality assessment. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 1098–1105, 2012.
- [41] Zhenqiang Ying, Haoran Niu, Praful Gupta, Dhruv Mahajan, Deepti Ghadiyaram, and Alan Bovik. From patches to pictures (paq-2-piq): Mapping the perceptual space of picture quality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
- [42] Guangtao Zhai, Xiongkuo Min, and Ning Liu. Free-energy principle inspired visual quality assessment: An overview. Digital Signal Processing, 91:11–20, 2019.
- [43] Weixia Zhang, Kede Ma, Jia Yan, Dexiang Deng, and Zhou Wang. Blind image quality assessment using a deep bilinear convolutional neural network. IEEE Transactions on Circuits and Systems for Video Technology, 30(1):36–47, 2020.
- [44] Hancheng Zhu, Leida Li, Jinjian Wu, Weisheng Dong, and Guangming Shi. Metaiqa: Deep meta-learning for no-reference image quality assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.