Introduction 引言
AIGenerated Content (AIGC) refers to any form of content, such as text, images, audio, or video, that is created with the help of artificial intelligence technology. With the flourishing development of deep learning, the efficiency of AIGC generation has increased, and AIGC Images (AGIs) are becoming more prevalent in areas such as culture, entertainment, education, social media, etc. Unlike natural scene images (NSIs) that are captured from the natural scenes, AGIs are directly generated from AI models as shown in Fig 1. Namely, diffusion models [1] and generative adversarial networks [2] are capable of generating a great number of images according to our needs. However, due to the hardware limitations and technical proficiency, the quality of AGIs is inconsistent and various, which often requires refinement and filtering before exhibition and being put into practical use. Thus, objective quality models for AGIs are urgently needed.
AIGenerated Content (AIGC) 指的是任何形式的内容,例如文本、图像、音频或视频,这些内容是借助人工智能技术创建的。随着深度学习的蓬勃发展,AIGC 生成的效率不断提高,AIGC 图像 (AGIs) 在文化、娱乐、教育、社交媒体等领域变得越来越普遍。与从自然场景捕捉的自然场景图像 (NSIs) 不同,AGIs 是直接从 AI 模型生成的,如图 1 所示。即,扩散模型 [1] 和生成对抗网络 [2] 能够根据我们的需求生成大量图像。然而,由于硬件限制和技术水平,AGIs 的质量不一致且多样,通常需要在展示和实际应用之前进行精细化和过滤。因此,迫切需要针对 AGIs 的客观质量模型。
During the last decade, large amounts of effort have been put into constructing image quality assessment (IQA) databases and proposing IQA methods for common image contents, such as NSIs [3], JPEG2000-compressed [4], cartoon [5], computer-generated [6], contrast-changed [7], high dynamic range (HDR) [8], in-the-wild [9], screen content [10], and omnidirectional [11] images. All these kinds of images can share some common technical quality assessment dimensions such as illumination, blur, contrast, texture, etc. AGIs obtain some unique quality characteristics and viewers tend to evaluate the quality of AGIs from some different aspects although AGIs are generated under restrictions to be similar to the training images such as NSIs. We summarize some major quality assessment aspects for AGIs here: a) Technical issues, which refer to the common distortions that affect the visibility of the image content; b) AI artifacts, which indicates the confusing and unexpected components appeared in the images; c) Unnaturalness, which stands for the unnaturalness that goes against common sense and the discomfort during the viewing experience; d) Discrepancy, which denotes the mismatch extent between the AGIs and our expectation; e) Aesthetics, which refers to the overall visual appeal and beauty of the images.
在过去十年中,已经投入大量精力构建图像质量评估(IQA)数据库,并提出针对常见图像内容的 IQA 方法,如自然场景图像(NSIs)[3]、JPEG2000 压缩图像[4]、卡通图像[5]、计算机生成图像[6]、对比度变化图像[7]、高动态范围(HDR)图像[8]、野外图像[9]、屏幕内容图像[10]和全向图像[11]。所有这些类型的图像可以共享一些共同的技术质量评估维度,如照明、模糊、对比度、纹理等。尽管 AGIs 是在与训练图像(如 NSIs)相似的限制下生成的,但 AGIs 获得了一些独特的质量特征,观众倾向于从不同的方面评估 AGIs 的质量。 我们在这里总结了一些 AGI 的主要质量评估方面:a) 技术问题,指影响图像内容可见性的常见失真;b) AI 伪影,指图像中出现的令人困惑和意外的成分;c) 不自然性,代表与常识相悖的非自然性以及观看体验中的不适感;d) 差异,表示 AGI 与我们期望之间的不匹配程度;e) 美学,指图像的整体视觉吸引力和美感。
However, there has been no scientific research specifically targeted on the perceptual quality of AGIs currently. Therefore, in this paper, we embark on a certain exploration to address the challenge of evaluating the quality of AIGC by carrying out a first-of-a-kind perceptual quality assessment database for AGIs, named AGIQA-1K. Specifically, we employ two latent text-to-image diffusion models [1] stable-inpainting-v 1
and stable-diffusion-v2 as the AGI models. Then we choose several most popular text keywords from the Internet for AGI generation and a total of 1,080 AGIs are obtained. Afterward, we carry out a subjective experiment in a well-controlled laboratory environment, where the subjects are asked to perceptually evaluate the quality of AGIs following the major quality aspects discussed above. Finally, a benchmark experiment is conducted to evaluate the performance of current IQA models and in-depth discussions are given as well. Our contributions are proposed as follows:
然而,目前尚无针对 AGI 感知质量的科学研究。因此,在本文中,我们开展了一项探索,以应对评估 AIGC 质量的挑战,建立了一个首创的 AGI 感知质量评估数据库,命名为 AGIQA-1K。具体而言,我们采用了两个潜在的文本到图像扩散模型[1] stable-inpainting-v1 和 stable-diffusion-v2 作为 AGI 模型。然后,我们从互联网选择了几个最受欢迎的文本关键词进行 AGI 生成,共获得 1,080 个 AGI。随后,我们在一个控制良好的实验室环境中进行了一项主观实验,受试者被要求根据上述主要质量方面对 AGI 的质量进行感知评估。最后,进行了一项基准实验,以评估当前 IQA 模型的性能,并进行了深入讨论。我们的贡献如下:
We propose a thorough quality assessment guideline for AGIs, the major evaluation aspects include technical issues, AI artifacts, unnaturalness, discrepancy, and aesthetics.
我们提出了一套全面的 AGI 质量评估指南,主要评估方面包括技术问题、人工智能产物、不自然性、差异性和美学。We are the first to carry out a perceptual AGI quality assessment database (AGIQA-1K), which provides 1,080 AGIs along with quality labels.
我们是首个开展感知 AGI 质量评估数据库(AGIQA-1K)的团队,该数据库提供了 1,080 个 AGI 及其质量标签。A benchmark experiment is conducted to evaluate the performance of current IQA models.
进行基准实验以评估当前图像质量评估模型的性能。
Database Construction 数据库构建
A. AGIs Collection A. AGIs 收集
Considering the success of stable diffusion models, we select two text-to-image diffusion models stable-inpainting-v1 and stable-diffusion-v2 (sub-models derived from [1]) as the AGI models. To ensure content diversity and catch up with the popular trends, we use the hot keywords from the PNGIMG website 1 for AGIs generation, and the employed keywords are exhibited in Table I, which contains the main objects, the second objects, places, and styles. Some sample AGIs are further illustrated in Fig. 2.
考虑到稳定扩散模型的成功,我们选择了两个文本到图像的扩散模型 stable-inpainting-v1 和 stable-diffusion-v2(源自 [1] 的子模型)作为 AGI 模型。为了确保内容的多样性并跟上流行趋势,我们使用了来自 PNGIMG 网站的热门关键词进行 AGI 生成,所使用的关键词展示在表 I 中,包含主要对象、次要对象、地点和风格。一些示例 AGI 进一步在图 2 中说明。
In order to evaluate the statistical discrepancy between NSIs and AGIs, we present the distributions of five quality-related attributes for comparison. The NSIs are sourced from the in-the-wild KonIQ-10k IQA database [12], while the AGIs are collected through the proposed AGIQA-1K database. The quality-related attributes under consideration are light, contrast, colorfulness, blur, and spatial information (SI). Detailed descriptions of these attributes can be found in [13]. As shown in Fig. 3, the quality-related attribute distributions of NSIs and AGIs are quite similar and tend to be Gaussian-like. Specifically, AGIs are relatively blurrier and contain more spatial information than NSIs.
为了评估国家标准图像(NSIs)与人工生成图像(AGIs)之间的统计差异,我们展示了五个与质量相关的属性分布以供比较。NSIs 来源于野外的 KonIQ-10k IQA 数据库 [12],而 AGIs 则通过提议的 AGIQA-1K 数据库收集。考虑的质量相关属性包括光照、对比度、色彩丰富度、模糊度和空间信息(SI)。这些属性的详细描述可以在 [13] 中找到。如图 3 所示,NSIs 和 AGIs 的质量相关属性分布非常相似,并且趋向于高斯分布。具体而言,AGIs 相对更模糊,并且包含比 NSIs 更多的空间信息。
B. Subjective Experiment B. 主观实验
To evaluate the quality of AGIs, a subjective experiment is conducted following the guidelines of ITU-R BT.500-13 [14]. The subjects are asked to rate the overall quality levels of exhibited AGIs from the technical issues, AI artifacts, unnaturalness, discrepancy, and aesthetic aspects. Some typical distortion examples are shown in Fig. 4. The AGIs are presented in random order on an iMac monitor with a resolution of up to
为了评估 AGI 的质量,按照 ITU-R BT.500-13 [14]的指导原则进行了一项主观实验。受试者被要求从技术问题、人工智能伪影、不自然性、差异性和美学方面对展示的 AGI 的整体质量水平进行评分。一些典型的失真示例如图 4 所示。AGI 以随机顺序在分辨率高达
To limit the experiment time for each session to less than half an hour, the experiment is split into 5 sessions, each of which includes the subjective quality evaluation for about 200 AGIs. This results in more than
为了将每个实验会话的时间限制在半小时以内,实验分为 5 个会话,每个会话包括对约 200 个 AGI 的主观质量评估。这导致产生超过
C. Subjective Data Analysis
C. 主观数据分析
After the subjective experiment, all quality ratings from the subjects are collected. The raw rating judged by the
在主观实验后,收集所有受试者的质量评分。第
其中
其中
Experiment 实验
A. Benchmark Models A. 基准模型
Due to the absence of pristine reference images in the proposed AGIQA-1k database, only no-reference (NR) IQA models are selected for comparison. The selected models can be classified into three groups:
由于在提议的 AGIQA-1k 数据库中缺乏原始参考图像,因此仅选择无参考(NR)图像质量评估模型进行比较。所选模型可以分为三组:
Handcrafted-based models: This group includes BMPRI [17], CEIQ [18], DSIQA [19], NIQE [20], and SISBLIM [21]. These models extract handcrafted features based on prior knowledge about image quality.
基于手工特征的模型:该组包括 BMPRI [17]、CEIQ [18]、DSIQA [19]、NIQE [20]和 SISBLIM [21]。这些模型基于对图像质量的先验知识提取手工特征。Handcrafted &SVR-based models: This group includes friquee [22], GMLF [23], HIGRADE [24], NFERM [25], and NFSDM [26]. These models combine handcrafted features from a Support Vector Regression (SVR) to represent perceptual quality.
手工制作的基于 SVR 的模型:该组包括 friquee [22]、GMLF [23]、HIGRADE [24]、NFERM [25]和 NFSDM [26]。这些模型结合了来自支持向量回归(SVR)的手工特征,以表示感知质量。Deep learning-based models: This group includes ResNet50 [27], StairIQA [28], and MGQA [29]. These models characterize quality-aware information by training deep neural networks from labeled data.
基于深度学习的模型:该组包括 ResNet50 [27]、StairIQA [28]和 MGQA [29]。这些模型通过从标记数据中训练深度神经网络来表征质量感知信息。
Notably, the models mentioned above have exhibited strong performance in previous IQA tasks for natural scenes.
值得注意的是,上述模型在之前的自然场景图像质量评估任务中表现出色。
B. Evaluation Criteria B. 评估标准
In this study, three primary metrics are utilized to evaluate the consistency between the predicted scores and Mean Opinion Scores (MOSs): Spearman Rank Correlation Coefficient (SRoCC), Pearson Linear Correlation Coefficient (PLCC), Kendall's Rank Correlation Coefficient (KRoCC). The SRoCC metric measures the similarity between two sets of rankings, while the PLCC metric computes the linear correlation between two groups of rankings. The KRoCC metric, on the other hand, estimates the ordinal relationship between two measured quantities.
在本研究中,使用了三个主要指标来评估预测分数与平均意见分数(MOSs)之间的一致性:斯皮尔曼等级相关系数(SRoCC)、皮尔逊线性相关系数(PLCC)和肯德尔等级相关系数(KRoCC)。SRoCC 指标测量两组排名之间的相似性,而 PLCC 指标计算两组排名之间的线性相关性。另一方面,KRoCC 指标则估计两个测量量之间的序关系。
To map the predicted scores to MOSs, a five-parameter logistic function is applied, which is a standard practice suggested in [30]:
为了将预测分数映射到 MOS,应用了五参数逻辑函数,这是[30]中建议的标准做法:
其中
C. Experimental Setup C. 实验设置
All the benchmark models in III-A are validated on the proposed AGIQA-1k database. The database is split randomly in an 80/20 ratio for training/testing while ensuring the image with the same object label falls into the same set. The partitioning and evaluation process is repeated several times for a fair comparison while considering the computational complexity, and the average result is reported as the final performance. For &SVR-based models, the repeating time is 1,000, implemented by LIBSVM [31] with radial basis function (RBF) kernel. For deep learning-based models, the repeating time is 10, using ResNet50 [27] as the network backbone. The Adam optimizer [32] (with an initial learning rate of 0.00001 and batch size 40) is used for 100-epochs training on an NVIDIA GTX 4090Ti GPU.
所有在 III-A 中的基准模型都在提议的 AGIQA-1k 数据库上进行了验证。该数据库以 80/20 的比例随机分割为训练/测试集,同时确保具有相同对象标签的图像落入同一组。分区和评估过程重复多次以进行公平比较,同时考虑计算复杂性,最终结果以平均值报告。对于基于&SVR 的模型,重复次数为 1,000 次,使用 LIBSVM [31]实现,采用径向基函数(RBF)核。对于基于深度学习的模型,重复次数为 10 次,使用 ResNet50 [27]作为网络主干。Adam 优化器[32](初始学习率为 0.00001,批量大小为 40)用于在 NVIDIA GTX 4090Ti GPU 上进行 100 个周期的训练。
D. Performance Discussion
D. 性能讨论
The performance results on the proposed AGIQA-1K database and corresponding two different generative model subsets are exhibited in Table II, from which we can make several conclusions. 1) The handcrafted-based methods achieve poor performance on the whole database and two subsets, which indicates the extracted handcrafted features are not effective for modeling the quality representation of AGIs. This is because most employed handcrafted features of these methods are based on the prior knowledge learned from NSIs, which apparently do not hold for the AGIs. 2) The deep learning-based methods achieve relatively more competitive performance results on the whole database and two subsets. However, they are still far away from satisfactory. 3) Nearly all the IQA models achieve the best performance on the whole database and undergo significant performance drops on the stable-diffusion-v2 subsets. We attempt to give the reasons for such a phenomenon. More keywords are utilized for the stable-diffusion-v2 model, therefore making the AGIs generated by a such model more diverse and complicated. This makes it more challenging for the IQA models to extract quality-aware features from AGIs, which inevitably leads to performance drops.
在所提出的 AGIQA-1K 数据库及其对应的两个不同生成模型子集上的性能结果如表 II 所示,从中我们可以得出几个结论。1)基于手工特征的方法在整个数据库和两个子集上的表现较差,这表明提取的手工特征对于建模 AGI 的质量表示并不有效。这是因为这些方法所采用的大多数手工特征是基于从 NSI 中学习的先验知识,而这些知识显然不适用于 AGI。2)基于深度学习的方法在整个数据库和两个子集上取得了相对更具竞争力的性能结果。然而,它们仍然远未令人满意。3)几乎所有的 IQA 模型在整个数据库上表现最佳,但在 stable-diffusion-v2 子集上性能显著下降。我们尝试给出这种现象的原因。stable-diffusion-v2 模型使用了更多的关键词,因此使得由该模型生成的 AGI 更加多样化和复杂。 这使得 IQA 模型从 AGI 中提取质量感知特征变得更加困难,这不可避免地导致性能下降。
We further validate the performance of the IQA models on the AGIQA-1K database with the anime and realistic styles. The experimental results are listed in Table III. It seems that the IQA models gain similar performance across different styles, which suggests that the styles have a limited impact on the performance of current IQA models.
我们进一步验证了 IQA 模型在 AGIQA-1K 数据库上对动漫和现实风格的表现。实验结果列在表 III 中。似乎 IQA 模型在不同风格下的表现相似,这表明风格对当前 IQA 模型的性能影响有限。
Conclusion 结论
AIGC has become increasingly popular as deep learning techniques keep improving. However, due to hardware constraints and technical limitations, the quality of AGIs can vary, necessitating refinement and filtering prior to practical usage. Therefore, there is a critical need for developing objective models to assess the quality of AGIs. In this paper, we first discuss significant evaluation aspects, such as technical issues, AI artifacts, unnaturalness, discrepancy, and aesthetics for AGI quality assessment. Then, we carry out the first perceptual AGI quality assessment database, AGIQA-1K, containing 1,080 AGIs generated from diffusion models. A well-organized subjective experiment is conducted to collect quality labels for the AGIs. Subsequently, a benchmark experiment is carried out to evaluate the performance of current IQA models. The experimental results reveal that the current IQA models are not well qualified to deal with AGIQA task and there is still a long way to go.
AIGC 随着深度学习技术的不断进步而变得越来越受欢迎。然而,由于硬件限制和技术局限性,AGI 的质量可能会有所不同,因此在实际使用之前需要进行精炼和过滤。因此,迫切需要开发客观模型来评估 AGI 的质量。本文首先讨论了 AGI 质量评估的几个重要评估方面,如技术问题、人工智能伪影、不自然性、差异性和美学。然后,我们开展了第一个感知 AGI 质量评估数据库 AGIQA-1K,包含 1,080 个由扩散模型生成的 AGI。我们进行了一个组织良好的主观实验,以收集 AGI 的质量标签。随后,进行了基准实验,以评估当前 IQA 模型的性能。实验结果表明,当前的 IQA 模型并不适合处理 AGIQA 任务,仍然任重道远。