这是用户在 2025-7-6 14:50 为 https://ar5iv.labs.arxiv.org/html/2307.14735?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Test Time Adaptation for Blind Image Quality Assessment
测试时自适应用于无参考图像质量评估

Subhadeep Roy  Shankhanil Mitra  Soma Biswas  Rajiv Soundararajan
Indian Institute of Science, Bengaluru, India
subhadeeproy2000@gmail.com, {shankhanilm, somabiswas, rajivs}@iisc.ac.in
印度班加罗尔科学研究所 subhadeeproy2000@gmail.com, {shankhanilm, somabiswas, rajivs}@iisc.ac.in
Abstract  摘要

While the design of blind image quality assessment (IQA) algorithms has improved significantly, the distribution shift between the training and testing scenarios often leads to a poor performance of these methods at inference time. This motivates the study of test time adaptation (TTA) techniques to improve their performance at inference time. Existing auxiliary tasks and loss functions used for TTA may not be relevant for quality-aware adaptation of the pre-trained model. In this work, we introduce two novel quality-relevant auxiliary tasks at the batch and sample levels to enable TTA for blind IQA. In particular, we introduce a group contrastive loss at the batch level and a relative rank loss at the sample level to make the model quality aware and adapt to the target data. Our experiments reveal that even using a small batch of images from the test distribution helps achieve significant improvement in performance by updating the batch normalization statistics of the source model.
虽然无参考图像质量评估(IQA)算法的设计已经显著改进,但训练和测试场景之间的分布变化通常会导致这些方法在推理时性能较差。这促使研究测试时间适配(TTA)技术以提高其推理时的性能。现有的辅助任务和损失函数用于 TTA 可能与预训练模型的质量感知适配无关。在这项工作中,我们在批次和样本级别引入了两种新颖的与质量相关的辅助任务,以实现无参考 IQA 的 TTA。特别是,我们在批次级别引入了组对比损失,在样本级别引入了相对等级损失,使模型能够感知质量并适应目标数据。我们的实验表明,即使使用测试分布中的一小批图像,通过更新源模型的批量归一化统计数据,也能显著提高性能。

$\ast$$\ast$footnotetext: Authors contributed equally to this work$\S$$\S$footnotetext: https://github.com/subhadeeproy2000/TTA-IQA

1 Introduction  1 引言

The problem of image quality assessment (IQA) is extremely important in diverse image capture, processing, and sharing applications. However, a reference image is often not available for quality assessment. No reference (NR) or blind IQA primarily deals with the question of predicting image quality without using a reference image. Such NR IQA algorithms are often designed using machine learning approaches. More recently, deep learning based approaches have been extremely successful in achieving impressive performance. However, IQA applications are quite diverse and deal with several different distortions and distributional shifts. IQA models often have poor generalization ability and find it difficult to perform well under such shifts.
图像质量评估(IQA)问题在不同的图像捕获、处理和共享应用中非常重要。然而,通常没有参考图像可用于质量评估。无参考(NR)或盲目图像质量评估主要处理的问题是无需使用参考图像就能预测图像质量。这类无参考图像质量评估算法通常采用机器学习方法进行设计。最近,基于深度学习的方法在实现卓越的性能方面取得了极大的成功。然而,图像质量评估应用非常多样化,涉及多个不同的失真和分布性转变。图像质量评估模型通常具有较差的泛化能力,难以在这些转变下表现良好。

Test time adaptation (TTA) has emerged as an important approach to address distributional shifts at test time [10]. It has been shown that by modifying a few global parameters of the model using a suitable loss that does not require the ground truth, one can significantly improve the performance of the model on the test data. Further, source-free adaptation, where the source data on which the original model was trained is not available while updating the model, is a realistic setting. While such approaches have been studied extensively in image classification literature [35, 37], there is hardly any literature on TTA for IQA.
测试时适应(TTA)已成为一种重要的方法,用于解决测试时的分布变化问题[10]。研究表明,通过使用不需要真实值的适当损失修改模型的几个全局参数,可以显著提高模型在测试数据上的性能。此外,源无关适应,即在更新模型时无法获得原始模型训练的源数据,是一种现实设定。尽管在图像分类文献中对此类方法进行了广泛研究[35, 37],但几乎没有关于图像质量评估(IQA)的 TTA 文献。

There are multiple challenges in designing TTA for IQA. Typical losses used for TTA, such as entropy minimization [37], are not applicable for IQA. For example, IQA is often studied in the regression context. This makes it difficult to extend models based on class confidences [28] or class prototypes [12] for classification to IQA. Also, the relevance of other self-supervised tasks such as rotation prediction [18], context prediction [5], colorization [19], noise prediction [1], feature clustering [3] for adapting IQA models is not clear. While contrastive learning has also been employed for TTA [24], such a framework is not explicitly based on contrasting image quality, and its relevance is also not clear.
在设计用于图像质量评估(IQA)的测试时间自适应(TTA)时面临多个挑战。用于 TTA 的典型损失,例如熵最小化[37],并不适用于 IQA。例如,IQA 通常是在回归环境中进行研究的。这使得基于类别置信度[28]或类别原型[12]的模型从分类扩展到 IQA 变得困难。此外,其他自监督任务如旋转预测[18]、上下文预测[5]、着色[19]、噪声预测[1]、特征聚类[3]用于调整 IQA 模型的相关性尚不明确。尽管对比学习也已应用于 TTA[24],这样的框架并不是明确基于对比图像质量,其相关性也不清楚。

Our main contribution is in the design of auxiliary tasks to enable TTA for IQA. We start with a source model trained on a large IQA dataset and fine-tune the model on individual batches of test samples. The first task we introduce for adaptation is based on contrasting groups of low and high quality images in a batch. Thus, we exploit the initial knowledge of the source model and try to adapt it by enforcing quality relationships among the batch of samples. Such a group contrastive (GC) learning approach fits naturally to our setting to account for any errors on individual samples that the source model may be prone to.
我们的主要贡献在于设计辅助任务以实现 IQ 的测试时间适应。我们从在大规模 IQ 数据集上训练的源模型开始,并对测试样本的单个批次进行微调。我们为适应性介绍的第一个任务是基于对比批次中低质量和高质量图像的组。因此,我们利用源模型的初始知识,并尝试通过强制批次样本之间的质量关系来进行适应。这种组对比(GC)学习方法自然适合我们的设置,以考虑源模型可能容易出现的单个样本错误。

In contrast to the GC learning that depends on the batch, our second auxiliary task is an image specific task based on distorted augmentations of different types. Here, our goal is to enable the model to rank the image quality of further distorted versions of each test sample. We explore the role of different distortion types to leverage the maximum benefit of this task. While GC learning is more effective when samples in a batch are diverse in quality, the rank order based learning is more effective when the quality of the images is not extremely poor. Thus, a combination of the tasks helps overcome the shortcoming of both tasks and leads to an overall superior performance.
与依赖于批次的 GC 学习相反,我们的第二个辅助任务是基于不同类型的失真增强的特定图像任务。这里,我们的目标是使模型能够对每个测试样本的进一步失真版本进行图像质量排序。我们探讨不同失真类型的作用,以最大程度地利用该任务的优势。虽然 GC 学习在批次中的样本质量多样时更有效,但基于排名顺序的学习在图像质量不极差时更有效。因此,任务的结合有助于克服两者的缺点,带来整体更优的表现。

We study the TTA problem under different settings of source and target datasets for multiple state of the art IQA models. Our results show significant improvements of the source model and the importance of TTA for IQA. We summarize our main contributions as follows:
我们研究了在不同的源数据集和目标数据集设置下,多个先进的 IQA 模型中的 TTA 问题。我们的结果显示了源模型的显著提升以及 TTA 对于 IQA 的重要性。我们总结了我们的主要贡献如下:

  • We propose source-free test time adaptation techniques in the context of blind image quality assessment to mitigate distribution shifts between train and test data.
    我们提出了无源测试时适应技术,以盲图像质量评估为背景,以缓解训练数据和测试数据之间的分布差异。

  • We formulate two novel quality-aware self-supervised auxiliary tasks to adapt to the test data distribution. While group contrastive learning helps capture quality discriminative information among several images within a batch, rank ordering helps maintain the quality order between two different distorted versions of the same image.
    我们设计了两个新的质量感知自监督辅助任务,以适应测试数据分布。群组对比学习有助于在批处理中的多张图像之间捕获质量区分信息,而排序则有助于保持同一图像的两个不同失真版本之间的质量顺序。

  • We show that our TTA method can significantly improve the performance of four different quality-aware source models, each on four different test IQA databases.
    我们展示了我们的 TTA 方法可以显著提升四种不同质量感知源模型在四个不同测试 IQA 数据库上的性能。

Refer to caption
Figure 1: Block diagram of a general architecture for test time adaptation. At test time training, the normalization layers of the feature extractor are adapted by optimizing the combination of rank and GC loss. At inference time, we predict the quality scores of test images using the updated feature extractor and pre-trained quality regressor.
图 1:测试时间适应的一般架构的框图。在测试时间训练中,通过优化等级和 GC 损失的组合来调整特征提取器的归一化层。在推理阶段,我们使用更新后的特征提取器和预训练的质量回归器来预测测试图像的质量分数。

2 Related Work

2.1 Test Time Adaptation

One of the first pieces of work on TTA [35] introduces a joint training framework using a loss for the main task and a self-supervised auxiliary task loss. The choice of a relevant auxiliary task is a challenging part of the design. Sun et al. [35] use rotation prediction as a pre-text task for image classification applications. Researchers also explore simpler tasks that are highly correlated with the main task, such as feature alignment with the batch and a simple contrastive loss framework [24] for adaptation. However, these methods require the source data to train the source model all over again to enable TTA.

TENT [37] studies source-free TTA, where entropy minimization-based adaptation even outperforms some methods that use the source data. Moreover, only the batch normalization parameters of the model are adapted in TENT. There are several works on batch norm statistics adaptation [29, 32] to improve the robustness of the model at test time. SHOT [20] presents a clustering-based pseudo-labeling method to align features from the target domain to the source domain using an information maximization loss. While a plethora of methods exists as above, the auxiliary tasks in these methods are not relevant for IQA.

2.2 No-Reference Image Quality Assessment

Most classical NR IQA algorithms are mainly based on natural scene statistics (NSS) [26, 42, 23]. In [27], the naturalness of the distorted image in the wavelet domain is modeled based on NSS. Saad et al. [31] also design the NSS model in the discrete cosine transform (DCT) domain. CORNIA [39], and HOSA [38] are among the earliest codebook learning based methods to predict quality. With the emergence of deep learning and the availability of large subject-rated IQA databases, various general-purpose NR IQA methods have been designed based on convolutional neural network (CNN) architectures [40, 22, 2, 25]. Some of these methods require end-to-end training [2, 25] of deep neural networks (DNN), while others [43, 34, 9, 15] are based on updating pre-trained models with some modifications.

More recently, Zhang et al. [43] propose a method to train bilinear DNNs to simultaneously model both authentic and synthetic distortions. Su et al. [34] design a self-adaptive hyper network to provide weights for the quality prediction module parameters and handle various types of distortions and content in the images. Several methods [9, 15] explore transformer-based architectures along with a CNN to capture dependencies between local and global features. MetaIQA [44] proposes meta-learning on synthetic distortions by using a shared quality prior knowledge model to adapt to any kind of distortion.

All the existing methods assume that the train and test data come from the same distribution. If there is a distribution shift across different databases, we need adaptation of the pre-trained model to learn about the target distributional information.

3 Methodology

We propose a novel self-supervised Test Time Adaptation technique for Image Quality Assessment (TTA-IQA) to adapt pre-trained quality models and mitigate distribution shifts between the source and target data. We consider source-free TTA, where we only have access to the pre-trained quality-aware models and no access to the source training data. When a batch of test data D={xj}j=1nD=\left\{x_{j}\right\}_{j=1}^{n} arrives, we adapt the model using the batch without knowledge of corresponding ground truth {yj}j=1n\left\{y_{j}\right\}_{j=1}^{n}.

3.1 Approach

Let the model trained on the source data be fθf_{\theta} where θ=(θe,θc)\theta=(\theta_{e},\theta_{c}) corresponds to the parameters of the network, and θe\theta_{e} and θc\theta_{c} correspond to the parameters of the feature extractor and regression layers. Thus, for an input image xx, we denote fθ(x)=fθc(fθe(x))f_{\theta}(x)=f_{\theta_{c}}(f_{\theta_{e}}(x)). Since our goal is to learn the distribution shift between train and test data, we only update the parameters of the feature extractor, θe\theta_{e}, to align the features between the train and test distributions in a lower dimensional space.

The key challenge in TTA is the choice of a self-supervised auxiliary task that is highly correlated with the main task of IQA. However, relying too much on the auxiliary task can affect the performance of the main task. To prevent loss of learnt information induced by the auxiliary task [4], we first project the features to a lower dimensional space using a non-linear projection head fθsf_{\theta_{s}}, parameterized by θs\theta_{s}. The auxiliary task drives the adaptation of the feature extractor through the projection head. Thus our model now has three parts parametrized by (θe,θs,θc)(\theta_{e},\theta_{s},\theta_{c}) to resemble a Y-shape as shown in Figure 1.

When a batch of test instances DD arrives, we extract features from the feature extractor, followed by the projection head, and update the set of parameters (θe,θs)\theta_{e},\theta_{s}) by optimizing a self-supervised objective function s(D)\mathcal{L}_{s}(D). Updating all the model parameters of the feature extractor θe\theta_{e} can cause the model to diverge too much from training, and the performance can drop drastically. Inspired by prior work on test time entropy minimization [37] and improving robustness for test data [32], we only update the linear and lower-dimensional feature modulation parameters. In a neural network, normalization layers satisfy these properties. So we only adapt the batch normalization layers by updating the affine parameters to mitigate distributional shifts. Thus we update these parameters by optimizing the auxiliary task loss given by

θe,θs=argmin(θe,θs)s(D)\theta_{e}^{*},\theta_{s}^{*}=\underset{{({\theta}_{e},\theta_{s})}}{\operatorname{argmin}}\mathcal{L}_{s}(D) (1)

After optimizing the above loss function, we use the updated feature extractor parameters θe\theta_{e}^{*} and pre-trained quality regressor parameters θc\theta_{c} to predict the quality for a batch of test data. Since the distribution across batches can vary significantly, we ignore the earlier updated model and start the TTA based on source model weights for new test instants. Thus the updated target model always only depends on the incoming test data.

Refer to caption
Figure 2: The overview framework of group contrastive loss for test time adaptation. For a batch of images, two different groups are formed based on pseudo-labels given by the source model. The group contrastive loss tries to minimize the distance between features extracted from images belonging to the same group while maximizing the distance between features from different groups.

3.2 Self-Supervised Auxiliary Tasks

Our goal is to carefully choose self-supervised auxiliary tasks that capture quality-aware distributional information to adapt the feature encoder. We formulate two novel and complementary self-supervised learning techniques which help to learn the distribution shift between train and test data. These self-supervised objective functions are - 1) Group Contrastive Loss and 2) Rank Loss. While the GC loss works well when there is a reasonable separation of quality among the samples in a batch, the rank loss works better even when the quality of the batch samples is similar. On the other hand, the rank loss is meaningful only when the quality of the input image is not extremely low, while the GC loss is independent of the quality of a given image. Thus, a combination of the two losses renders our TTA extremely effective across various scenarios.

3.2.1 Group Contrastive Loss

While contrastive learning has been used for TTA of deep image classification, its direct application does not appear relevant to the task of IQA. Thus, we introduce group contrastive (GC) learning as an auxiliary task for TTA of IQA models. In particular, we make two groups of images from a single batch of NN images based on the pseudo-labels given by the pre-trained source model. We sort the images in ascending order as x(1)x_{(1)},x(2)x_{(2)},…,x(N)x_{(N)} based on the pseudo-labels, where x(i)x_{(i)}, i=1,2,,Ni=1,2,\ldots,N, corresponds to the ithi^{th} lowest quality image in the batch. We then segregate images with high quality scores (say, top pp fraction of the data in a batch) and include them in a group of higher-quality images. Similarly, we separate out lower quality images (say, the lowest pp fraction of the images in a batch) together to form another group. Here we assume that pNpN and (1p)N(1-p)N are integers for simplicity; else, they can be rounded off to the nearest integers.

The premise behind our loss for GC learning is that images from the same quality group should give similar feature representations in a lower dimensional space while features of images from different groups are separated out. Thus image pairs from the same group act as positive pairs, and image pairs from different groups act as negative pairs. By separating out these two groups, the model adapts itself by better separating the intermediate quality samples.

Let a positive pair x(i)x_{(i)} and x(j)x_{(j)} come from the same group, i.e., either both i,jpNi,j\leq pN or i,j>(1p)Ni,j>(1-p)N. We use a modified NT-Xent contrastive loss [4] as our objective function. For a pair of images coming from the lower quality group where i,jpNi,j\leq pN and iji\neq j, the GC loss is defined as

i,jgc=logexp(sim(𝒛(i),𝒛(j))/τ)k>(1p)NNexp(sim(𝒛(i),𝒛(k))/τ),\mathcal{L}_{i,j}^{gc}=-\log\frac{\exp\left(\operatorname{sim}\left(\boldsymbol{z}_{(i)},\boldsymbol{z}_{(j)}\right)/\tau\right)}{\sum_{k>(1-p)N}^{N}\exp\left(\operatorname{sim}\left(\boldsymbol{z}_{(i)},\boldsymbol{z}_{(k)}\right)/\tau\right)}, (2)

where 𝐳(i)=fθs(fθe(x(i)))\mathbf{z}_{(i)}=f_{\theta_{s}}(f_{\theta_{e}}(x_{(i)})) represents the feature at the output of the projection head for sample x(i)x_{(i)}, sim\operatorname{sim} refers to the cosine similarity between two features. and 𝐳(k)\mathbf{z}_{(k)} is the feature extracted from an image from the higher quality group with k>(1p)Nk>(1-p)N. Also, τ\tau represents temperature scaling parameter. While we define the loss above when i,jpNi,j\leq pN, we can define a similar loss when i,j>(1p)Ni,j>(1-p)N. For all pairs of images within the same group, we obtain the GC loss and add them together to obtain the final loss, gc\mathcal{L}^{gc} as

gc=i=1pNj=1jipNi,jgc+i=(1p)N+1Nj=(1p)N+1jiNi,jgc.\mathcal{L}^{gc}=\sum_{i=1}^{pN}\sum_{\begin{subarray}{c}j=1\\ j\neq i\end{subarray}}^{pN}\mathcal{L}_{i,j}^{gc}+\sum_{i=(1-p)N+1}^{N}\sum_{\begin{subarray}{c}j=(1-p)N+1\\ j\neq i\end{subarray}}^{N}\mathcal{L}_{i,j}^{gc}. (3)

A block diagram of the GC loss is shown in Figure 2.

Refer to caption
Figure 3: The rank loss framework for test time adaptation. We distort the test image with two different degrees of degradation. We then project the triplet of images in the features space through the shared feature extractor and projection head. Distances calculated between the features extracted from each distorted image and the original test image are used to compare their rank order.

3.2.2 Rank Loss

Given a sample of images from test data, the rank loss helps adapt the features by capturing quality-aware distributional information at the sample level. We introduce a ranking objective to learn the quality orders between two distorted versions of the test images that are quality discriminable. We distort each test image xix_{i}, i=1,2,,Ni=1,2,\dots,N, in the minibatch of size NN with two different degrees of degradation of a given distortion type. Let xihighx_{i}^{high} and xilowx_{i}^{low} denote the higher and lower distorted image, respectively. The degradation types include synthetic distortions such as blur, compression, and noise. We choose degradation levels randomly from two sets of parameters sufficiently farther apart. Ideally, the distance between the features extracted from the test image and the highly distorted image should be greater than that between the features extracted from the test image and the lower distorted image. Our rank loss tries to capture this order and adapt the model to the test data.

We project the triplet of images (xi,xihigh,xilow)(x_{i},x_{i}^{high},x_{i}^{low}) in the feature space through the feature extractor and the projection head. We measure the distance between the distortion-augmented image and the original test data using the Euclidean distance in the projected feature space. Mathematically, let 𝐳i=fθs(fθe(xi))\mathbf{z}_{i}=f_{\theta_{s}}(f_{\theta_{e}}(x_{i})) be the feature extracted for test sample xix_{i}. Similarly we define 𝐳ihigh\mathbf{z}_{i}^{high} and 𝐳ilow\mathbf{z}_{i}^{low}. Let dihighd_{i}^{high} denote the distance between 𝐳i\mathbf{z}_{i} and 𝐳ihigh\mathbf{z}_{i}^{high}. Similarly, dilowd_{i}^{low} denotes the distance between 𝐳i\mathbf{z}_{i} and 𝐳ilow\mathbf{z}_{i}^{low}. The target model is now fine-tuned to obtain the correct ranking between the distances by achieving dihighdilowd_{i}^{high}\geq d_{i}^{low} shown in Figure 3.

The probability of achieving this order is estimated by passing the difference in distances through a sigmoid function as

Pi=Pr(dihighdilow)=exp(dihighdilow)1+exp(dihighdilow).P_{i}=\operatorname{Pr}\left({d}_{i}^{high}\geq{d}_{i}^{low}\right)=\frac{\exp\left(d_{i}^{high}-d_{i}^{low}\right)}{1+\exp\left(d_{i}^{high}-d_{i}^{low}\right)}. (4)

We use binary cross-entropy loss between true label Pi¯=1\bar{P_{i}}=1 and the predicted probability PiP_{i} to obtain the rank loss as

ir(Pi¯,Pi)=Pi¯logPi(1Pi¯)log(1Pi).\mathcal{L}_{i}^{r}\left(\bar{P_{i}},P_{i}\right)=-\bar{P_{i}}\log P_{i}-\left(1-\bar{P_{i}}\right)\log\left(1-P_{i}\right). (5)

For a batch of size NN, the overall rank loss is given as

r=i=1Nir.\mathcal{L}^{r}=\sum_{i=1}^{N}\mathcal{L}_{i}^{r}. (6)

One of the challenges in the rank loss is selecting the distortion type for every image. The original test image may already be distorted, which poses difficulties in deciding the distortion type that can help TTA. For example, if the test image is extremely blurred and we blur it again with two different levels, both images may be visually indistinguishable, thereby limiting the extent to which adaptation can help. We exploit the source model’s knowledge to overcome this limitation in choosing the distortion type. We predict the quality scores of the pair (xihigh,xilow)(x_{i}^{high},x_{i}^{low}) for each distortion type using the source model. We hypothesize that the pair with the maximal difference in predicted quality is sufficiently different in visual quality. Thus choosing such a pair in Equation (5) can help adapt the model.

Our overall self-supervised TTA objective function is a combination of both the rank and GC loss given as

s=gc+λr,\mathcal{L}_{s}=\mathcal{L}^{gc}+\lambda\mathcal{L}^{r}, (7)

where λ\lambda is a hyper-parameter used to combine the losses.

Backbone Database KonIQ-10k PIPAL CID2013 LIVE-IQA
Method SROCC PLCC SROCC PLCC SROCC PLCC SROCC PLCC
TReS Baseline 0.6520 0.6955 0.3845 0.4078 0.5272 0.6463 0.5435 0.4450
Rotation 0.6506 0.6805 0.4061 0.4114 0.5706 0.6651 0.5866 0.5311
TTA-IQA 0.6578 0.7074 0.4278 0.4204 0.6032 0.6710 0.6722 0.5963
MUSIQ Baseline 0.6304 0.6802 0.3190 0.3414 0.5173 0.6032 0.2596 0.3351
Rotation 0.6577 0.7154 0.3665 0.3693 0.5487 0.6164 0.3512 0.3976
TTA-IQA 0.6693 0.7230 0.3743 0.3731 0.5499 0.6220 0.3649 0.4031
HyperIQA Baseline 0.5861 0.6313 0.3037 0.3304 0.4895 0.6123 0.5143 0.4377
Rotation 0.6033 0.6536 0.3268 0.3482 0.4902 0.6150 0.6268 0.5469
TTA-IQA 0.5960 0.6495 0.3653 0.3767 0.5039 0.5988 0.6218 0.5438
MetaIQA Baseline 0.5162 0.4460 0.3287 0.2955 0.7213 0.6817 0.7323 0.6732
Rotation 0.5823 0.5311 0.3353 0.3042 0.7177 0.6745 0.7271 0.6868
TTA-IQA 0.5838 0.5428 0.4073 0.3510 0.7809 0.7399 0.7999 0.7726
Table 1: Comparison of TTA-IQA with popular NR IQA methods and one popular auxiliary task - rotation prediction on authentically and synthetically distorted datasets. Bold entries imply the best performance for every individual quality-aware model on respective datasets.

4 Experiments

4.1 Quality Models, Datasets and Metrics

We evaluate TTA on four popular IQA databases using four different quality-aware models. In particular, we consider state-of-the-art deep IQA models such as TReS [9], MUSIQ [15] , HyperIQA [34], and MetaIQA [44]. These models contain a ResNet backbone with batch normalization layers, which we model as the feature extractor fθef_{\theta_{e}} and only update its batch normalization parameters. We model the rest of the network as the quality regressor part, fθcf_{\theta_{c}}. TReS and MUSIQ use transformers as a part of their architecture, which we include as a part of the quality regressor in all our main experiments. We also explore the adaptation of transformers by including them as part of the feature extractor and updating their layer normalization statistics in the supplementary material.

We project the features extracted from the last layer of ResNet through a 256-dimensional fully connected (FC) layer corresponding to the self-supervised projection head fθsf_{\theta_{s}}. These lower dimensional features are used to adapt the model at test time. Three IQA models, TReS, MUSIQ, and HyperIQA are trained on the LIVEFB database [41] containing 39,810 images. MetaIQA is trained on two synthetically distorted databases, TID2013 [30] and KADID-10k [21].

We choose challenging databases to evaluate the generalization capability of our TTA-IQA method. The test datasets are described as follows:

KonIQ-10k [11] is a popular in-the-wild authentically distorted database consisting of 10,073 quality-scored images.

PIPAL [13] is a large-scale IQA database for evaluating perceptual image restoration. It contains 29k distorted images, including 19 different GAN-based algorithms.

CID2013 [36] consists of 480 images in six image sets captured by 79 imaging devices.

LIVE-IQA [33] consists of 779 synthetically distorted images from 29 reference images.

We evaluate our results using Spearman’s rank-order correlation coefficient (SROCC) and Pearson’s linear correlation coefficient (PLCC).

4.2 Implementation Details

We implement our setup in PyTorch and conduct all the experiments with an 16 GB NVIDIA RTX A4000 GPU. During TTA, we randomly select a patch of size 224×224224\times 224 from the input image and apply quality preserving augmentations such as horizontal flip and vertical flip before passing it through the network for TTA. We use the ADAM [16] optimizer with a learning rate of 0.001 and set the number of iteration as 3. All the experiments were run on five different seeds using a batch size of 8, and the final results were obtained by averaging.

With regard to the self-supervised auxiliary tasks, we use a combination of rank loss and GC loss as our objective function using λ=1\lambda=1 given in Equation (7). We choose p=0.25p=0.25 to determine the groups. We calculate the GC loss using τ=1\tau=1. For the rank loss, the test image is synthetically blurred using a Gaussian blur filter of size 5×55\times 5 with two sets of standard deviations σ\sigma, for the Gaussian blur kernel. We keep σ[40,80]\sigma\in[40,80] for highly distorted images and σ[1,20]\sigma\in[1,20] for less blurred images. For compression distortions, we specify the quality factor in [80,95][80,95] for lower compression and [30,60][30,60] for higher compression rates. Similarly, we add zero mean Gaussian white noise to the test image with variance in [0.05,0.1][0.05,0.1] for higher distortion and in [0.005,0.01][0.005,0.01] for lower distortion.

Backbone Database KonIQ-10k PIPAL CID2013 LIVE-IQA
Rank GC SRCC PLCC SRCC PLCC SRCC PLCC SRCC PLCC
TReS \checkmark ×\times 0.6562 0.6989 0.4171 0.4204 0.6016 0.6736 0.6705 0.5908
×\times \checkmark 0.6516 0.6946 0.4666 0.5183 0.5366 0.6493 0.7160 0.6290
\checkmark \checkmark 0.6578 0.7074 0.4278 0.4204 0.6032 0.6710 0.6722 0.5963
MUSIQ \checkmark ×\times 0.6549 0.7149 0.3768 0.3718 0.5216 0.6034 0.3634 0.4024
×\times \checkmark 0.6611 0.7176 0.3585 0.3642 0.5446 0.6159 0.3299 0.3914
\checkmark \checkmark 0.6693 0.7230 0.3743 0.3731 0.5499 0.6220 0.3649 0.4031
HyperIQA \checkmark ×\times 0.5928 0.6455 0.3616 0.3732 0.5039 0.5991 0.5505 0.5050
×\times \checkmark 0.6094 0.6567 0.3333 0.3552 0.5120 0.6255 0.6331 0.5592
\checkmark \checkmark 0.5960 0.6495 0.3653 0.3767 0.5039 0.5988 0.6218 0.5438
MetaIQA \checkmark ×\times 0.5580 0.5118 0.3992 0.3437 0.7579 0.7282 0.7894 0.7534
×\times \checkmark 0.5414 0.4636 0.3710 0.3287 0.7861 0.7067 0.7566 0.6927
\checkmark \checkmark 0.5838 0.5428 0.4073 0.3510 0.7809 0.7399 0.7999 0.7726
Table 2: Ablation study results on authentically and synthetically distorted datasets using popular quality aware source models. Bold entries imply the best performance among all three settings.

4.3 Performance Evaluation

Table 1 shows the performance of our method on all four test datasets using all four quality-aware models. In addition to the source model, we also compare with rotation prediction [35] as the auxiliary task during TTA. A comparison with such a task helps understand the role of quality-aware losses for the TTA of IQA models.

We observe that TTA-IQA using the combination of rank and GC loss outperforms the source models in all the cases. Note that the PIPAL dataset has a huge distribution shift from the authentically distorted LIVEFB dataset on which three of the models were trained. While most source models perform very poorly on PIPAL, we achieve 10%-20% improvement over the source models. On the KonIQ-10k dataset, TTA-IQA gives around 1%-10% improvement over the source model. As both KonIQ-10k and LIVEFB are authentically distorted datasets, the distribution shift is reasonably small, leading to smaller improvements using adaptation. CID2013 is also an authentically distorted dataset and gives similar improvements of around 2%-13% over the source models. Our experiments on the LIVE-IQA database provide a significantly greater improvement of 10%-33% owing to the shift from authentic to synthetic distortions for three models.

Refer to caption
Refer to caption
Refer to caption
Figure 4: Scatter plot for predicting quality of similar quality images from LIVE-IQA dataset for the source model and TTA using only rank loss or GC loss. We mark two sets of images having similar human opinion score (DMOS) by ellipses in each case.

We also observe from Table 1 that our approach outperforms the rotation prediction task in most cases. Other TTA ideas, such as TENT [37] and training using masked autoencoders [7] are not applicable for the IQA task. In particular, the notion of entropy in regression tasks is not clear. On the other hand, masked reconstructions tend to lead to quality degradation. Thus, we do not compare with them.

4.4 Ablation Study & Other Experiments

Need for Rank Loss and GC Loss for TTA. We perform an ablation study with respect to the two losses in Table 2. We see from the results that the rank loss and the GC loss individually always improve on the source models. Further, the combination of the rank and GC loss provides an even better performance over the individual losses in most evaluation scenarios. Even in scenarios, where the combination loss achieves the second-best performance, its performance is very close to the best. For the rest of the experiments in this section, we present results on all four datasets using the TReS method alone.

We now discuss scenarios where the GC loss is particularly more useful than the rank loss. If the test image is extremely distorted, none of the three kinds of distortion types can create much difference in the perceptual quality of the two degraded versions. To understand this further, we consider highly distorted images (corresponding to a differential mean opinion score (DMOS) greater than 70) of different distortion types in the LIVE-IQA dataset. We apply TTA only for these images using different losses. The SROCC performance for these images in Figure 5 reveal that the rank loss is not very effective. However, the GC loss works well in such scenarios.

Conversely, we also discuss scenarios where the rank loss is more useful than the GC loss. To illustrate this idea, we select multiple images having similar quality with DMOS in [28,44][28,44] from the LIVE-IQA dataset. We adapt the source model for this set of images using both the GC and rank losses. From Figure 4, we observe that the rank loss leads to much better adaptation in terms of the resulting model correlating with the ground truth scores. Intuitively, when the input images have a similar quality, the pseudo-labels given by the source model may be noisy, leading to an inaccurate grouping of the images into the lower and higher quality groups. This can adversely affect the GC loss. So the rank loss is more effective than the GC loss for test batches with similar quality images.

Refer to caption
Figure 5: TTA-IQA on different distortion-specific images from LIVE-IQA dataset with respect to individual losses

Effect of Selecting Multiple Distortion Types in the Rank Loss. We experiment with the impact of different kinds of distortion types while incorporating the rank loss. In particular, we consider three scenarios: one where only one of the distortion types is used in the rank loss, another where all the three types are used to obtain three loss terms which are summed together to update the model, and a third, where only one of the distortion types is chosen based on the discussion in Section 3.2.2. We observe from Table 3 that choosing the distortion type with the maximal difference in quality gives the best performance for the rank loss. This experiment validates our hypothesis that only a rank loss that can discriminate between the degraded image pairs is useful for TTA.

Database Blur Comp Noise All Best
KonIQ-10k 0.656 0.656 0.667 0.615 0.656
PIPAL 0.384 0.415 0.389 0.415 0.417
CID2013 0.527 0.609 0.606 0.560 0.609
LIVE-IQA 0.578 0.605 0.618 0.634 0.671
Table 3: Impact of selecting multiple distortion types on rank loss

Selection of Group Size by Varying pp. We explore different values of pp for constructing the GC loss. For a batch size of 8, possible choices of the group size are 2,3,42,3,4 corresponding to p=0.25,0.375,0.5p=0.25,0.375,0.5 respectively. From Table 4, we observe that the performances are roughly similar for different values of pp with a slightly superior performance for p=0.25p=0.25. We note that as pp increases, the groups are larger and probably less discriminative in terms of quality, leading to a slightly poorer performance.

Database pp=0.25 pp=0.375 pp=0.5
KonIQ-10k 0.6516 0.6497 0.6467
PIPAL 0.4666 0.4602 0.4616
CID2013 0.5366 0.5486 0.5536
LIVE-IQA 0.7160 0.7171 0.6937
Table 4: Impact of varying group size using pp for GC loss

Selection of Number of Groups for GC Loss. In our formulation, we define the GC loss only between two contrastive groups. In principle, we can extend our idea of GC loss to multiple groups. In particular, we can cluster the images in a batch into multiple groups and apply the GC loss between every pair of groups and sum the loss terms. Thus, images coming from the same group act as positive pairs, and images from two different groups are considered as negative pairs. From Table 5, we observe that the number of groups GG does not impact the performance much.

Database G=2G=2 G=3G=3 G=4G=4
KonIQ-10k 0.6516 0.6524 0.6509
PIPAL 0.4666 0.4668 0.4663
CID2013 0.5366 0.5327 0.5388
LIVE-IQA 0.7160 0.7011 0.6883
Table 5: Impact of varying number of groups GG for GC loss

Choice of Number of Iterations for Learning Auxiliary Task. We also examine the effect of the number of iterations of parameter updates during TTA. Figure 6 shows that the performance on CID2013 dataset is fairly robust until 4 iterations. Beyond that, the model overfits the auxiliary task and leads to poor performance at test time.

Refer to caption
Figure 6: Effect of increasing number of iterations

5 Conclusion

Test time adaptation has become very popular due to its simplicity and the lack of need for end-to-end training. Our work is perhaps one of the first attempts to design the method of TTA in the context of blind IQA. While most IQA methods focus on making the models robust enough to perform well for cross-database experiments, our TTA-IQA method can outperform existing state-of-the-art methods because of the adaptation at test time. We formulate novel self-supervised auxiliary tasks using the rank and group contrastive losses, which can learn quality-aware information from the test data. While primarily explored TTA for IQA, it would be interesting to understand the role of TTA for video quality assessment as well.

Acknowledgement: This work was supported in part by a grant from the Department of Science and Technology, Government of India, under grant CRG/2020/003516.

Appendix

Appendix A Experiments with Transformer Parameters Adaptation

We evaluate the performance of TTA-IQA by updating transformer parameters for better adaptation. For TReS [9] and MUSIQ [15], we incorporate the transformer as a part of feature extractor. Thus, only the last fully connected (FC) layer works as the quality regressor. Current literature [17] shows that layer normalization (LN) parameters of transformers are a good choice for test time adaptation. In the case of vision transformer, the CLS token is also used for adaptation. Table 6 shows the performance on all four datasets by optimizing various parameters using the combination of rank and group contrastive loss. We observe that adaptation of transformer parameters alone gives a performance equivalent to the adaptation of the batch normalization (BN) parameters of convolutional neural network (CNN) backbone. Thus, it is possible to update models that only use a transformer and achieve significant gains using TTA.

Backbone Database KONIQ PIPAL CID2013 LIVE-IQA
Method SRCC PLCC SRCC PLCC SRCC PLCC SRCC PLCC
TReS Baseline 0.6520 0.6955 0.3845 0.4078 0.5272 0.6463 0.5435 0.4450
BN Only 0.6731 0.7151 0.4392 0.2710 0.6173 0.6800 0.6707 0.6006
LN Only 0.6694 0.7176 0.4128 0.2690 0.6193 0.6850 0.5998 0.5458
BN+LN 0.6621 0.7059 0.4417 0.3484 0.6123 0.6758 0.6723 0.5948
MUSIQ Baseline 0.6304 0.6802 0.3190 0.3414 0.5173 0.6032 0.2596 0.3351
BN Only 0.6588 0.7174 0.3772 0.3744 0.5275 0.6126 0.3350 0.3954
LN Only 0.6582 0.7155 0.3757 0.3762 0.5403 0.6109 0.3661 0.4044
CLS Only 0.6618 0.7207 0.3776 0.3739 0.5491 0.6212 0.3511 0.3993
BN+LN 0.6552 0.7145 0.3736 0.3732 0.5329 0.6095 0.3767 0.4028
BN+LN+CLS 0.6598 0.7172 0.3751 0.3764 0.5253 0.6048 0.3569 0.3999
Table 6: Comparison of TTA-IQA using popular transformer based NR IQA methods on authentically and synthetically distorted datasets.
Train on LIVEFB LIVE-IQA
Test on PIPAL KonIQ-10k SPAQ LIVEC PIPAL CID2013 KonIQ-10k LIVEC
Baseline 0.385 0.652 0.707 0.726 0.402 0.519 0.521 0.563
TTA-IQA 0.428 0.658 0.755 0.728 0.449 0.523 0.522 0.565
Table 7: SRCC performance evaluation of TTA-IQA with TReS backbone trained on LIVE-IQA database

TRES MUSIQ HYPER-IQA META-IQA Baseline 0.535 0.404 0.496 0.591 Rotation 0.529 0.425 0.479 0.540 TTA-IQA 0.586 0.450 0.493 0.608

Table 8: SRCC performance analysis of TTA-IQA on DSLR database.

Appendix B Visualizing Images that Justify Need for Both Rank and GC Loss

In Section 4.4, we justify the need for both the rank and GC loss for effective TTA. Here we give a few visual examples of images corresponding to that analysis. In Figure 7, we observe that the images have very poor quality. Hence, distorting these images further creates distorted versions that have perceptually indistinguishable quality ratings. On the other hand, Figure 8 shows similar quality images. Here, as the images have almost similar visual quality, it is difficult to form two different quality groups based on pseudo-labels given by the source model.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 7: Examples of highly distorted images in which GC loss is more effective than rank loss
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 8: Examples of similar quality images in which rank loss is more effective than GC loss

Appendix C Performance of TTA-IQA on Other Databases

C.1 Performance evaluation with synthetic database as source database

In the main paper, we reported performances where the source model is trained on camera captured LIVEFB [41] database and tested on various authentic and synthetic databases. In Table 7, we provide more such evaluations with respect to different intra and inter domain comparisons. In particular, we present results when TReS [9] is trained on LIVE FB and evaluated on more intra domain datasets such as SPAQ [6] and LIVEC [8]. We also present results when TReS is trained on a synthetic dataset such as LIVE-IQA [33] and tested on authentic as well as other datasets containing restored images. We observe that TTA-IQA gives a reasonable performance gain over the baseline even when there is domain shift between source (synthetic) data and target (authentic) data.

C.2 Performance on Low-Light Restorted Database

To understand the impact of larger domain shifts, we also evaluate on a new database DSLR [14], where images captured in low light are restored via various image restoration algorithms. Since novel distortions are generated while restorting such low-light images, we evaluate the performance of TTA-IQA with source database as LIVEFB and target datasbase as DSLR database. We see that TTA-IQA helps improve the performance of most of the methods.

References

  • [1] Piotr Bojanowski and Armand Joulin. Unsupervised learning by predicting noise. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 517–526. PMLR, 06–11 Aug 2017.
  • [2] Sebastian Bosse, Dominique Maniry, Klaus-Robert Müller, Thomas Wiegand, and Wojciech Samek. Deep neural networks for no-reference and full-reference image quality assessment. IEEE Transactions on Image Processing, 27(1):206–219, 2018.
  • [3] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. In Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss, editors, Computer Vision – ECCV 2018, pages 139–156, Cham, 2018. Springer International Publishing.
  • [4] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org, 2020.
  • [5] Carl Doersch, Abhinav Gupta, and Alexei A. Efros. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015.
  • [6] Yuming Fang, Hanwei Zhu, Yan Zeng, Kede Ma, and Zhou Wang. Perceptual quality assessment of smartphone photography. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3674–3683, 2020.
  • [7] Yossi Gandelsman, Yu Sun, Xinlei Chen, and Alexei A. Efros. Test-time training with masked autoencoders. arXiv preprint arXiv:2209.07522, 2022.
  • [8] Deepti Ghadiyaram and Alan C. Bovik. Massive online crowdsourced study of subjective and objective picture quality. IEEE Transactions on Image Processing, 25(1):372–387, 2016.
  • [9] S Alireza Golestaneh, Saba Dadsetan, and Kris M Kitani. No-reference image quality assessment via transformers, relative ranking, and self-consistency. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3209–3218, 2022.
  • [10] Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. Proceedings of the International Conference on Learning Representations, 2019.
  • [11] Vlad Hosu, Hanhe Lin, Tamas Sziranyi, and Dietmar Saupe. Koniq-10k: An ecologically valid database for deep learning of blind image quality assessment. Trans. Img. Proc., 29:4041–4056, jan 2020.
  • [12] Yusuke Iwasawa and Yutaka Matsuo. Test-time classifier adjustment module for model-agnostic domain generalization. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 2427–2440. Curran Associates, Inc., 2021.
  • [13] Gu Jinjin, Cai Haoming, Chen Haoyu, Ye Xiaoxing, Jimmy S. Ren, and Dong Chao. Pipal: A large-scale image quality assessment dataset for perceptual image restoration. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision – ECCV 2020, pages 633–651, Cham, 2020. Springer International Publishing.
  • [14] Vignesh Kannan, Sameer Malik, Nithin C. Babu, and Rajiv Soundararajan. Quality assessment of low-light restored images: A subjective study and an unsupervised model. IEEE Access, 11:68216–68230, 2023.
  • [15] Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 5148–5157, October 2021.
  • [16] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [17] Takeshi Kojima, Yutaka Matsuo, and Yusuke Iwasawa. Robustifying vision transformer without retraining from scratch by test-time class-conditional feature alignment. In Lud De Raedt, editor, Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pages 1009–1016. International Joint Conferences on Artificial Intelligence Organization, 7 2022. Main Track.
  • [18] Nikos Komodakis and Spyros Gidaris. Unsupervised representation learning by predicting image rotations. In International Conference on Learning Representations (ICLR), Vancouver, Canada, Apr. 2018.
  • [19] Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Colorization as a proxy task for visual understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [20] Jian Liang, Dapeng Hu, and Jiashi Feng. Do we really need to access the source data? Source hypothesis transfer for unsupervised domain adaptation. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 6028–6039. PMLR, 13–18 Jul 2020.
  • [21] Hanhe Lin, Vlad Hosu, and Dietmar Saupe. Kadid-10k: A large-scale artificially distorted iqa database. In 2019 Eleventh International Conference on Quality of Multimedia Experience (QoMEX), pages 1–3, 2019.
  • [22] Xialei Liu, Joost van de Weijer, and Andrew D. Bagdanov. Rankiqa: Learning from rankings for no-reference image quality assessment. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [23] Yutao Liu, Ke Gu, Yongbing Zhang, Xiu Li, Guangtao Zhai, Debin Zhao, and Wen Gao. Unsupervised blind image quality evaluation via statistical measurements of structure, naturalness, and perception. IEEE Transactions on Circuits and Systems for Video Technology, 30(4):929–943, 2020.
  • [24] Yuejiang Liu, Parth Kothari, Bastien van Delft, Baptiste Bellot-Gurlet, Taylor Mordan, and Alexandre Alahi. Ttt++: When does self-supervised test-time training fail or thrive? In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 21808–21820. Curran Associates, Inc., 2021.
  • [25] Kede Ma, Wentao Liu, Kai Zhang, Zhengfang Duanmu, Zhou Wang, and Wangmeng Zuo. End-to-end blind image quality assessment using deep neural networks. IEEE Transactions on Image Processing, 27(3):1202–1213, 2018.
  • [26] Anish Mittal, Anush Krishna Moorthy, and Alan Conrad Bovik. No-reference image quality assessment in the spatial domain. IEEE Transactions on Image Processing, 21(12):4695–4708, 2012.
  • [27] Anush Krishna Moorthy and Alan Conrad Bovik. Blind image quality assessment: From natural scene statistics to perceptual quality. IEEE Transactions on Image Processing, 20(12):3350–3364, 2011.
  • [28] Chaithanya Kumar Mummadi, Robin Hutmacher, Kilian Rambach, Evgeny Levinkov, Thomas Brox, and Jan Hendrik Metzen. Test-time adaptation to distribution shift by confidence maximization and input transformation, 2021.
  • [29] Zachary Nado, Shreyas Padhy, D Sculley, Alexander D’Amour, Balaji Lakshminarayanan, and Jasper Snoek. Evaluating prediction-time batch normalization for robustness under covariate shift. arXiv preprint arXiv:2006.10963, 2020.
  • [30] Nikolay Ponomarenko, Lina Jin, Oleg Ieremeiev, Vladimir Lukin, Karen Egiazarian, Jaakko Astola, Benoit Vozel, Kacem Chehdi, Marco Carli, Federica Battisti, and C.-C. Jay Kuo. Image database tid2013: Peculiarities, results and perspectives. Signal Processing: Image Communication, 30:57–77, 2015.
  • [31] Michele A. Saad, Alan C. Bovik, and Christophe Charrier. Blind image quality assessment: A natural scene statistics approach in the dct domain. IEEE Transactions on Image Processing, 21(8):3339–3352, 2012.
  • [32] Steffen Schneider, Evgenia Rusak, Luisa Eck, Oliver Bringmann, Wieland Brendel, and Matthias Bethge. Improving robustness against common corruptions by covariate shift adaptation. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 11539–11551. Curran Associates, Inc., 2020.
  • [33] H.R. Sheikh, M.F. Sabir, and A.C. Bovik. A statistical evaluation of recent full reference image quality assessment algorithms. IEEE Transactions on Image Processing, 15(11):3440–3451, 2006.
  • [34] Shaolin Su, Qingsen Yan, Yu Zhu, Cheng Zhang, Xin Ge, Jinqiu Sun, and Yanning Zhang. Blindly assess image quality in the wild guided by a self-adaptive hyper network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  • [35] Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 9229–9248. PMLR, 13–18 Jul 2020.
  • [36] Toni Virtanen, Mikko Nuutinen, Mikko Vaahteranoksa, Pirkko Oittinen, and Jukka Häkkinen. Cid2013: A database for evaluating no-reference image quality assessment algorithms. IEEE Transactions on Image Processing, 24(1):390–402, 2015.
  • [37] Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. In International Conference on Learning Representations, 2021.
  • [38] Jingtao Xu, Peng Ye, Qiaohong Li, Haiqing Du, Yong Liu, and David Doermann. Blind image quality assessment based on high order statistics aggregation. IEEE Transactions on Image Processing, 25(9):4444–4457, 2016.
  • [39] Peng Ye, Jayant Kumar, Le Kang, and David Doermann. Unsupervised feature learning framework for no-reference image quality assessment. In 2012 IEEE conference on computer vision and pattern recognition, pages 1098–1105. IEEE, 2012.
  • [40] Peng Ye, Jayant Kumar, Le Kang, and David Doermann. Unsupervised feature learning framework for no-reference image quality assessment. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 1098–1105, 2012.
  • [41] Zhenqiang Ying, Haoran Niu, Praful Gupta, Dhruv Mahajan, Deepti Ghadiyaram, and Alan Bovik. From patches to pictures (paq-2-piq): Mapping the perceptual space of picture quality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  • [42] Guangtao Zhai, Xiongkuo Min, and Ning Liu. Free-energy principle inspired visual quality assessment: An overview. Digital Signal Processing, 91:11–20, 2019.
  • [43] Weixia Zhang, Kede Ma, Jia Yan, Dexiang Deng, and Zhou Wang. Blind image quality assessment using a deep bilinear convolutional neural network. IEEE Transactions on Circuits and Systems for Video Technology, 30(1):36–47, 2020.
  • [44] Hancheng Zhu, Leida Li, Jinjian Wu, Weisheng Dong, and Guangming Shi. Metaiqa: Deep meta-learning for no-reference image quality assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.