这是用户在 2024-8-20 18:00 为 http://ieeexplore-ieee-org-s.webvpn.zju.edu.cn:8001/abstract/document/10222021 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
AIGC 图像的感知质量评估探索 | IEEE 会议出版物 | IEEE Xplore --- A Perceptual Quality Assessment Exploration for AIGC Images | IEEE Conference Publication | IEEE Xplore

A Perceptual Quality Assessment Exploration for AIGC Images
对 AIGC 图像的感知质量评估探索

Publisher: IEEE 出版商:IEEE
Zicheng Zhang; Chunyi Li; Wei Sun; Xiaohong Liu; Xiongkuo Min; Guangtao Zhai
张自成;李春怡;孙伟;刘小红;闵雄阔;翟光涛

Abstract:

AI Generated Content (AIGC) has gained widespread attention with the increasing efficiency of deep learning in content creation. AIGC, created with the assistance of arti...View more

Abstract: 摘要:

AI Generated Content (AIGC) has gained widespread attention with the increasing efficiency of deep learning in content creation. AIGC, created with the assistance of artificial intelligence technology, includes various forms of content, among which the AI-generated images (AGIs) have brought significant impact to society and have been applied to various fields such as entertainment, education, social media, etc. However, due to hardware limitations and technical proficiency, the quality of AIGC images (AGIs) varies, necessitating refinement and filtering before practical use. Consequently, there is an urgent need for developing objective models to assess the quality of AGIs. Unfortunately, no research has been carried out to investigate the perceptual quality assessment for AGIs specifically. Therefore, in this paper, we first discuss the major evaluation aspects such as technical issues, AI artifacts, unnaturalness, discrepancy, and aesthetics for AGI quality assessment. Then we present the first perceptual AGI quality assessment database, AGIQA-1K, which consists of 1,080 AGIs generated from diffusion models. A well-organized subjective experiment is followed to collect the quality labels of the AGIs. Finally, we conduct a benchmark experiment to evaluate the performance of current image quality assessment (IQA) models. The database is released on https://github.com/lcysyzxdxc/AGIQA-1k-Database.
人工智能生成内容(AIGC)随着深度学习在内容创作中的效率提升而受到广泛关注。AIGC 是在人工智能技术的辅助下创建的,包含多种形式的内容,其中人工智能生成的图像(AGI)对社会产生了重大影响,并已应用于娱乐、教育、社交媒体等多个领域。然而,由于硬件限制和技术水平的差异,AIGC 图像(AGI)的质量参差不齐,亟需在实际使用前进行精细化和过滤。因此,迫切需要开发客观模型来评估 AGI 的质量。不幸的是,目前尚未进行专门针对 AGI 的感知质量评估研究。因此,在本文中,我们首先讨论 AGI 质量评估的主要评估方面,如技术问题、人工智能伪影、不自然性、差异性和美学。然后,我们呈现第一个感知 AGI 质量评估数据库 AGIQA-1K,该数据库由 1,080 个基于扩散模型生成的 AGI 组成。 一个组织良好的主观实验被用来收集 AGI 的质量标签。最后,我们进行基准实验以评估当前图像质量评估(IQA)模型的性能。数据库已发布在 https://github.com/lcysyzxdxc/AGIQA-1k-Database。
Published in: 2023 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)
发表于:2023 年 IEEE 国际多媒体与博览会研讨会(ICMEW)
Date of Conference: 10-14 July 2023
会议日期:2023 年 7 月 10 日至 14 日
Date Added to IEEE Xplore: 28 August 2023
添加到 IEEE Xplore 的日期:2023 年 8 月 28 日
ISBN Information: ISBN 信息:
Publisher: IEEE 出版商:IEEE
Conference Location: Brisbane, Australia
会议地点:澳大利亚布里斯班

SECTION I. 第一部分。

Introduction 引言

AIGenerated Content (AIGC) refers to any form of content, such as text, images, audio, or video, that is created with the help of artificial intelligence technology. With the flourishing development of deep learning, the efficiency of AIGC generation has increased, and AIGC Images (AGIs) are becoming more prevalent in areas such as culture, entertainment, education, social media, etc. Unlike natural scene images (NSIs) that are captured from the natural scenes, AGIs are directly generated from AI models as shown in Fig 1. Namely, diffusion models [1] and generative adversarial networks [2] are capable of generating a great number of images according to our needs. However, due to the hardware limitations and technical proficiency, the quality of AGIs is inconsistent and various, which often requires refinement and filtering before exhibition and being put into practical use. Thus, objective quality models for AGIs are urgently needed.
AIGenerated Content (AIGC) 指的是任何形式的内容,例如文本、图像、音频或视频,这些内容是借助人工智能技术创建的。随着深度学习的蓬勃发展,AIGC 生成的效率不断提高,AIGC 图像 (AGIs) 在文化、娱乐、教育、社交媒体等领域变得越来越普遍。与从自然场景捕捉的自然场景图像 (NSIs) 不同,AGIs 是直接从 AI 模型生成的,如图 1 所示。即,扩散模型 [1] 和生成对抗网络 [2] 能够根据我们的需求生成大量图像。然而,由于硬件限制和技术水平,AGIs 的质量不一致且多样,通常需要在展示和实际应用之前进行精细化和过滤。因此,迫切需要针对 AGIs 的客观质量模型。

Fig. 1. - Illustration of the generation process of AGIs and NSIs, where NSIs are captured from the natural scenes and AGIs are directly generated from AI models.
Fig. 1.  图 1。

Illustration of the generation process of AGIs and NSIs, where NSIs are captured from the natural scenes and AGIs are directly generated from AI models.
AGI 和 NSI 的生成过程示意图,其中 NSI 是从自然场景中捕获的,AGI 则是直接从 AI 模型生成的。

During the last decade, large amounts of effort have been put into constructing image quality assessment (IQA) databases and proposing IQA methods for common image contents, such as NSIs [3], JPEG2000-compressed [4], cartoon [5], computer-generated [6], contrast-changed [7], high dynamic range (HDR) [8], in-the-wild [9], screen content [10], and omnidirectional [11] images. All these kinds of images can share some common technical quality assessment dimensions such as illumination, blur, contrast, texture, etc. AGIs obtain some unique quality characteristics and viewers tend to evaluate the quality of AGIs from some different aspects although AGIs are generated under restrictions to be similar to the training images such as NSIs. We summarize some major quality assessment aspects for AGIs here: a) Technical issues, which refer to the common distortions that affect the visibility of the image content; b) AI artifacts, which indicates the confusing and unexpected components appeared in the images; c) Unnaturalness, which stands for the unnaturalness that goes against common sense and the discomfort during the viewing experience; d) Discrepancy, which denotes the mismatch extent between the AGIs and our expectation; e) Aesthetics, which refers to the overall visual appeal and beauty of the images.
在过去十年中,已经投入大量精力构建图像质量评估(IQA)数据库,并提出针对常见图像内容的 IQA 方法,如自然场景图像(NSIs)[3]、JPEG2000 压缩图像[4]、卡通图像[5]、计算机生成图像[6]、对比度变化图像[7]、高动态范围(HDR)图像[8]、野外图像[9]、屏幕内容图像[10]和全向图像[11]。所有这些类型的图像可以共享一些共同的技术质量评估维度,如照明、模糊、对比度、纹理等。尽管 AGIs 是在与训练图像(如 NSIs)相似的限制下生成的,但 AGIs 获得了一些独特的质量特征,观众倾向于从不同的方面评估 AGIs 的质量。 我们在这里总结了一些 AGI 的主要质量评估方面:a) 技术问题,指影响图像内容可见性的常见失真;b) AI 伪影,指图像中出现的令人困惑和意外的成分;c) 不自然性,代表与常识相悖的非自然性以及观看体验中的不适感;d) 差异,表示 AGI 与我们期望之间的不匹配程度;e) 美学,指图像的整体视觉吸引力和美感。

Fig. 2. - Sample images from the AGIQA-1k database, where the first to sixth rows show AGIs with (bird, cat, batman, kid, man, woman) as the main objects respectively.
Fig. 2.  图 2。

Sample images from the AGIQA-1k database, where the first to sixth rows show AGIs with (bird, cat, batman, kid, man, woman) as the main objects respectively.
来自 AGIQA-1k 数据库的样本图像,其中第一到第六行分别展示了以(鸟、猫、蝙蝠侠、孩子、男人、女人)为主要对象的 AGI。

However, there has been no scientific research specifically targeted on the perceptual quality of AGIs currently. Therefore, in this paper, we embark on a certain exploration to address the challenge of evaluating the quality of AIGC by carrying out a first-of-a-kind perceptual quality assessment database for AGIs, named AGIQA-1K. Specifically, we employ two latent text-to-image diffusion models [1] stable-inpainting-v 1 and stable-diffusion-v2 as the AGI models. Then we choose several most popular text keywords from the Internet for AGI generation and a total of 1,080 AGIs are obtained. Afterward, we carry out a subjective experiment in a well-controlled laboratory environment, where the subjects are asked to perceptually evaluate the quality of AGIs following the major quality aspects discussed above. Finally, a benchmark experiment is conducted to evaluate the performance of current IQA models and in-depth discussions are given as well. Our contributions are proposed as follows:
然而,目前尚无针对 AGI 感知质量的科学研究。因此,在本文中,我们开展了一项探索,以应对评估 AIGC 质量的挑战,建立了一个首创的 AGI 感知质量评估数据库,命名为 AGIQA-1K。具体而言,我们采用了两个潜在的文本到图像扩散模型[1] stable-inpainting-v1 和 stable-diffusion-v2 作为 AGI 模型。然后,我们从互联网选择了几个最受欢迎的文本关键词进行 AGI 生成,共获得 1,080 个 AGI。随后,我们在一个控制良好的实验室环境中进行了一项主观实验,受试者被要求根据上述主要质量方面对 AGI 的质量进行感知评估。最后,进行了一项基准实验,以评估当前 IQA 模型的性能,并进行了深入讨论。我们的贡献如下:

  • We propose a thorough quality assessment guideline for AGIs, the major evaluation aspects include technical issues, AI artifacts, unnaturalness, discrepancy, and aesthetics.
    我们提出了一套全面的 AGI 质量评估指南,主要评估方面包括技术问题、人工智能产物、不自然性、差异性和美学。

  • We are the first to carry out a perceptual AGI quality assessment database (AGIQA-1K), which provides 1,080 AGIs along with quality labels.
    我们是首个开展感知 AGI 质量评估数据库(AGIQA-1K)的团队,该数据库提供了 1,080 个 AGI 及其质量标签。

  • A benchmark experiment is conducted to evaluate the performance of current IQA models.
    进行基准实验以评估当前图像质量评估模型的性能。

Fig. 3. - The normalized probability distributions of the quality-related attributes for NSIs and AGIs. The distributions are obtained from 10,073 NSIs in the KonIQ-10k IQA database [12] and 1,080 AGIs in the proposed AGIQA-1k database respectively. The ‘color’ indicates the colorfulness of the images and the ‘SI’ (spatial information) stands for the content diversity of the images.
Fig. 3.  图 3。

The normalized probability distributions of the quality-related attributes for NSIs and AGIs. The distributions are obtained from 10,073 NSIs in the KonIQ-10k IQA database [12] and 1,080 AGIs in the proposed AGIQA-1k database respectively. The ‘color’ indicates the colorfulness of the images and the ‘SI’ (spatial information) stands for the content diversity of the images.
NSI 和 AGI 的质量相关属性的标准化概率分布。这些分布分别来自 KonIQ-10k IQA 数据库中的 10,073 个 NSI 和提议的 AGIQA-1k 数据库中的 1,080 个 AGI。“颜色”表示图像的色彩丰富度,而“SI”(空间信息)代表图像的内容多样性。

Table I Illustration of text keywords for generating the AGIs. All the keywords mentioned in the table are used for the stable-diffusion-v2 while the keywords marked with * are excluded for the stable-inpainting-v1.
表 I 生成 AGI 的文本关键词示例。表中提到的所有关键词均用于 stable-diffusion-v2,而标有 * 的关键词则不适用于 stable-inpainting-v1。
Table I- Illustration of text keywords for generating the AGIs. All the keywords mentioned in the table are used for the stable-diffusion-v2 while the keywords marked with * are excluded for the stable-inpainting-v1.

SECTION II. 第二节。

Database Construction 数据库构建

A. AGIs Collection A. AGIs 收集

Considering the success of stable diffusion models, we select two text-to-image diffusion models stable-inpainting-v1 and stable-diffusion-v2 (sub-models derived from [1]) as the AGI models. To ensure content diversity and catch up with the popular trends, we use the hot keywords from the PNGIMG website 1 for AGIs generation, and the employed keywords are exhibited in Table I, which contains the main objects, the second objects, places, and styles. Some sample AGIs are further illustrated in Fig. 2.

In order to evaluate the statistical discrepancy between NSIs and AGIs, we present the distributions of five quality-related attributes for comparison. The NSIs are sourced from the in-the-wild KonIQ-10k IQA database [12], while the AGIs are collected through the proposed AGIQA-1K database. The quality-related attributes under consideration are light, contrast, colorfulness, blur, and spatial information (SI). Detailed descriptions of these attributes can be found in [13]. As shown in Fig. 3, the quality-related attribute distributions of NSIs and AGIs are quite similar and tend to be Gaussian-like. Specifically, AGIs are relatively blurrier and contain more spatial information than NSIs.

Fig. 4. - Exhibition for some common AGI distortions, where the generation keywords are marked in the top right. The content of (a) is poorly visible due to the blur and inexplicit texture. Unexpected artifacts are introduced to the bottom left of (b). (c) Contains unnatural content such as the hands of the woman do not cope with common sense. (d) Is too simple and does not fit the text keyword of “wearing coat”.
Fig. 4.  图 4。

Exhibition for some common AGI distortions, where the generation keywords are marked in the top right. The content of (a) is poorly visible due to the blur and inexplicit texture. Unexpected artifacts are introduced to the bottom left of (b). (c) Contains unnatural content such as the hands of the woman do not cope with common sense. (d) Is too simple and does not fit the text keyword of “wearing coat”.
展示了一些常见的 AGI 失真,其中生成关键词标记在右上角。由于模糊和不明确的纹理,(a) 的内容可见性较差。意外的伪影出现在(b) 的左下角。(c) 包含不自然的内容,例如女性的手与常识不符。(d) 过于简单,不符合“穿外套”的文本关键词。

B. Subjective Experiment B. 主观实验

To evaluate the quality of AGIs, a subjective experiment is conducted following the guidelines of ITU-R BT.500-13 [14]. The subjects are asked to rate the overall quality levels of exhibited AGIs from the technical issues, AI artifacts, unnaturalness, discrepancy, and aesthetic aspects. Some typical distortion examples are shown in Fig. 4. The AGIs are presented in random order on an iMac monitor with a resolution of up to 4096×2304, using an interface designed with Python Tkinter, as shown in Fig. 5. The interface allows viewers to browse the previous and next AGIs and rate them using a quality scale that ranges from 0 to 5, with a minimum interval of 0.1. A total of 22 graduate students (10 males and 12 females) participate in the experiment, and they are seated at a distance of around 1.5 times the screen height (45cm) in a laboratory with normal indoor lighting.
为了评估 AGI 的质量,按照 ITU-R BT.500-13 [14]的指导原则进行了一项主观实验。受试者被要求从技术问题、人工智能伪影、不自然性、差异性和美学方面对展示的 AGI 的整体质量水平进行评分。一些典型的失真示例如图 4 所示。AGI 以随机顺序在分辨率高达 4096×2304 的 iMac 显示器上展示,使用 Python Tkinter 设计的界面,如图 5 所示。该界面允许观众浏览前一个和下一个 AGI,并使用从 0 到 5 的质量评分标准进行评分,最小间隔为 0.1。共有 22 名研究生(10 名男性和 12 名女性)参与了实验,他们坐在距离屏幕高度约 1.5 倍(45 厘米)的地方,在一个正常室内照明的实验室中。

To limit the experiment time for each session to less than half an hour, the experiment is split into 5 sessions, each of which includes the subjective quality evaluation for about 200 AGIs. This results in more than 22×1,080=23,760 quality ratings.
为了将每个实验会话的时间限制在半小时以内,实验分为 5 个会话,每个会话包括对约 200 个 AGI 的主观质量评估。这导致产生超过 22×1,080=23,760 个质量评分。

Fig. 5. - An example of the quality assessment interface, where the AGI and corresponding keywords are shown at the same time. The subject can then evaluate the quality of AGIs and record the quality scores with the scroll bar on the right.
Fig. 5.  图 5。

An example of the quality assessment interface, where the AGI and corresponding keywords are shown at the same time. The subject can then evaluate the quality of AGIs and record the quality scores with the scroll bar on the right.
质量评估界面的示例,其中 AGI 和相应的关键词同时显示。受试者可以评估 AGI 的质量,并通过右侧的滚动条记录质量分数。

C. Subjective Data Analysis
C. 主观数据分析

After the subjective experiment, all quality ratings from the subjects are collected. The raw rating judged by the i-th subject on the j-th image is denoted by rij Z-scores are obtained from the raw ratings using the following formula:
在主观实验后,收集所有受试者的质量评分。第 i 位受试者对第 j 幅图像的原始评分用 rij 表示。Z 分数是通过以下公式从原始评分中获得的:

zij=rijμiσi,(1)
View SourceRight-click on figure for MathML and additional features. 查看源代码 Right-click on figure for MathML and additional features. where μi=1NiNij=1rij, σi=1Ni1Nij=1(rijμi)2, and Ni is the number of images judged by subject i. Next, ratings from unreliable subjects are removed using the subject rejection procedure recommended by ITU-R BT.500-13 [14]. The mean opinion score (MOS) of image j is computed by averaging the rescaled z-scores:
其中 μi=1NiNij=1rij, σi=1Ni1Nij=1(rijμi)2 ,而 Ni 是被试 i 评判的图像数量。接下来,使用 ITU-R BT.500-13 [14] 推荐的被试拒绝程序去除不可靠被试的评分。图像 j 的平均意见分数 (MOS) 通过对重新缩放的 z 分数进行平均计算。
MOSj=1Mi=1Mzij,(2)
View SourceRight-click on figure for MathML and additional features.
where MOSj indicates the MOS for the j-th AGI, M is the number of valid subjects, and zij are the rescaled z-scores. The corresponding MOS distribution in Fig. 6 is consistent with previous works [15] [16] about subjective diversity.
其中 MOSj 表示第 j 个 AGI 的 MOS, M 是有效受试者的数量, zij 是重新缩放的 z 分数。图 6 中相应的 MOS 分布与之前的研究 [15] [16] 关于主观多样性的一致。

SECTION III.

Experiment

A. Benchmark Models

Due to the absence of pristine reference images in the proposed AGIQA-1k database, only no-reference (NR) IQA models are selected for comparison. The selected models can be classified into three groups:
由于在提议的 AGIQA-1k 数据库中缺乏原始参考图像,因此仅选择无参考(NR)图像质量评估模型进行比较。所选模型可以分为三组:

  • Handcrafted-based models: This group includes BMPRI [17], CEIQ [18], DSIQA [19], NIQE [20], and SISBLIM [21]. These models extract handcrafted features based on prior knowledge about image quality.
    基于手工特征的模型:该组包括 BMPRI [17]、CEIQ [18]、DSIQA [19]、NIQE [20]和 SISBLIM [21]。这些模型基于对图像质量的先验知识提取手工特征。

  • Handcrafted &SVR-based models: This group includes friquee [22], GMLF [23], HIGRADE [24], NFERM [25], and NFSDM [26]. These models combine handcrafted features from a Support Vector Regression (SVR) to represent perceptual quality.
    手工制作的基于 SVR 的模型:该组包括 friquee [22]、GMLF [23]、HIGRADE [24]、NFERM [25]和 NFSDM [26]。这些模型结合了来自支持向量回归(SVR)的手工特征,以表示感知质量。

  • Deep learning-based models: This group includes ResNet50 [27], StairIQA [28], and MGQA [29]. These models characterize quality-aware information by training deep neural networks from labeled data.
    基于深度学习的模型:该组包括 ResNet50 [27]、StairIQA [28]和 MGQA [29]。这些模型通过从标记数据中训练深度神经网络来表征质量感知信息。

Fig. 6. - Illustration of the MOS probability distribution.
Fig. 6.  图 6。

Illustration of the MOS probability distribution.
MOS 概率分布的示意图。

Notably, the models mentioned above have exhibited strong performance in previous IQA tasks for natural scenes.
值得注意的是,上述模型在之前的自然场景图像质量评估任务中表现出色。

B. Evaluation Criteria B. 评估标准

In this study, three primary metrics are utilized to evaluate the consistency between the predicted scores and Mean Opinion Scores (MOSs): Spearman Rank Correlation Coefficient (SRoCC), Pearson Linear Correlation Coefficient (PLCC), Kendall's Rank Correlation Coefficient (KRoCC). The SRoCC metric measures the similarity between two sets of rankings, while the PLCC metric computes the linear correlation between two groups of rankings. The KRoCC metric, on the other hand, estimates the ordinal relationship between two measured quantities.
在本研究中,使用了三个主要指标来评估预测分数与平均意见分数(MOSs)之间的一致性:斯皮尔曼等级相关系数(SRoCC)、皮尔逊线性相关系数(PLCC)和肯德尔等级相关系数(KRoCC)。SRoCC 指标测量两组排名之间的相似性,而 PLCC 指标计算两组排名之间的线性相关性。另一方面,KRoCC 指标则估计两个测量量之间的序关系。

To map the predicted scores to MOSs, a five-parameter logistic function is applied, which is a standard practice suggested in [30]:
为了将预测分数映射到 MOS,应用了五参数逻辑函数,这是[30]中建议的标准做法:

X^=α1(0.511+eα2(Xα3))+α4X+α5,(3)
View SourceRight-click on figure for MathML and additional features. 查看源代码 Right-click on figure for MathML and additional features. where {αi | i=1,2,,5} represent the parameters to be fitted, y and y^ stand for predicted and mapped scores, respectively.
其中 {αi | i=1,2,,5} 代表待拟合的参数, yy^ 分别表示预测得分和映射得分。

C. Experimental Setup C. 实验设置

All the benchmark models in III-A are validated on the proposed AGIQA-1k database. The database is split randomly in an 80/20 ratio for training/testing while ensuring the image with the same object label falls into the same set. The partitioning and evaluation process is repeated several times for a fair comparison while considering the computational complexity, and the average result is reported as the final performance. For &SVR-based models, the repeating time is 1,000, implemented by LIBSVM [31] with radial basis function (RBF) kernel. For deep learning-based models, the repeating time is 10, using ResNet50 [27] as the network backbone. The Adam optimizer [32] (with an initial learning rate of 0.00001 and batch size 40) is used for 100-epochs training on an NVIDIA GTX 4090Ti GPU.
所有在 III-A 中的基准模型都在提议的 AGIQA-1k 数据库上进行了验证。该数据库以 80/20 的比例随机分割为训练/测试集,同时确保具有相同对象标签的图像落入同一组。分区和评估过程重复多次以进行公平比较,同时考虑计算复杂性,最终结果以平均值报告。对于基于&SVR 的模型,重复次数为 1,000 次,使用 LIBSVM [31]实现,采用径向基函数(RBF)核。对于基于深度学习的模型,重复次数为 10 次,使用 ResNet50 [27]作为网络主干。Adam 优化器[32](初始学习率为 0.00001,批量大小为 40)用于在 NVIDIA GTX 4090Ti GPU 上进行 100 个周期的训练。

Table II Performance results on the AGIQA-1k database and two different generative model subsets. The best performance results are marked in RED and the second performance results are marked in BLUE.
表 II 在 AGIQA-1k 数据库和两个不同生成模型子集上的性能结果。最佳性能结果用红色标记,第二性能结果用蓝色标记。
Table II- Performance results on the AGIQA-1k database and two different generative model subsets. The best performance results are marked in RED and the second performance results are marked in BLUE.

D. Performance Discussion
D. 性能讨论

The performance results on the proposed AGIQA-1K database and corresponding two different generative model subsets are exhibited in Table II, from which we can make several conclusions. 1) The handcrafted-based methods achieve poor performance on the whole database and two subsets, which indicates the extracted handcrafted features are not effective for modeling the quality representation of AGIs. This is because most employed handcrafted features of these methods are based on the prior knowledge learned from NSIs, which apparently do not hold for the AGIs. 2) The deep learning-based methods achieve relatively more competitive performance results on the whole database and two subsets. However, they are still far away from satisfactory. 3) Nearly all the IQA models achieve the best performance on the whole database and undergo significant performance drops on the stable-diffusion-v2 subsets. We attempt to give the reasons for such a phenomenon. More keywords are utilized for the stable-diffusion-v2 model, therefore making the AGIs generated by a such model more diverse and complicated. This makes it more challenging for the IQA models to extract quality-aware features from AGIs, which inevitably leads to performance drops.
在所提出的 AGIQA-1K 数据库及其对应的两个不同生成模型子集上的性能结果如表 II 所示,从中我们可以得出几个结论。1)基于手工特征的方法在整个数据库和两个子集上的表现较差,这表明提取的手工特征对于建模 AGI 的质量表示并不有效。这是因为这些方法所采用的大多数手工特征是基于从 NSI 中学习的先验知识,而这些知识显然不适用于 AGI。2)基于深度学习的方法在整个数据库和两个子集上取得了相对更具竞争力的性能结果。然而,它们仍然远未令人满意。3)几乎所有的 IQA 模型在整个数据库上表现最佳,但在 stable-diffusion-v2 子集上性能显著下降。我们尝试给出这种现象的原因。stable-diffusion-v2 模型使用了更多的关键词,因此使得由该模型生成的 AGI 更加多样化和复杂。 这使得 IQA 模型从 AGI 中提取质量感知特征变得更加困难,这不可避免地导致性能下降。

We further validate the performance of the IQA models on the AGIQA-1K database with the anime and realistic styles. The experimental results are listed in Table III. It seems that the IQA models gain similar performance across different styles, which suggests that the styles have a limited impact on the performance of current IQA models.
我们进一步验证了 IQA 模型在 AGIQA-1K 数据库上对动漫和现实风格的表现。实验结果列在表 III 中。似乎 IQA 模型在不同风格下的表现相似,这表明风格对当前 IQA 模型的性能影响有限。

Table III Performance results on the AGIQA-1k database with different styles. The best performance results are marked in red and the second performance results are marked in blue.
表 III 在 AGIQA-1k 数据库上不同风格的性能结果。最佳性能结果用红色标记,第二性能结果用蓝色标记。
Table III- Performance results on the AGIQA-1k database with different styles. The best performance results are marked in red and the second performance results are marked in blue.

SECTION IV. 第四节。

Conclusion 结论

AIGC has become increasingly popular as deep learning techniques keep improving. However, due to hardware constraints and technical limitations, the quality of AGIs can vary, necessitating refinement and filtering prior to practical usage. Therefore, there is a critical need for developing objective models to assess the quality of AGIs. In this paper, we first discuss significant evaluation aspects, such as technical issues, AI artifacts, unnaturalness, discrepancy, and aesthetics for AGI quality assessment. Then, we carry out the first perceptual AGI quality assessment database, AGIQA-1K, containing 1,080 AGIs generated from diffusion models. A well-organized subjective experiment is conducted to collect quality labels for the AGIs. Subsequently, a benchmark experiment is carried out to evaluate the performance of current IQA models. The experimental results reveal that the current IQA models are not well qualified to deal with AGIQA task and there is still a long way to go.
AIGC 随着深度学习技术的不断进步而变得越来越受欢迎。然而,由于硬件限制和技术局限性,AGI 的质量可能会有所不同,因此在实际使用之前需要进行精炼和过滤。因此,迫切需要开发客观模型来评估 AGI 的质量。本文首先讨论了 AGI 质量评估的几个重要评估方面,如技术问题、人工智能伪影、不自然性、差异性和美学。然后,我们开展了第一个感知 AGI 质量评估数据库 AGIQA-1K,包含 1,080 个由扩散模型生成的 AGI。我们进行了一个组织良好的主观实验,以收集 AGI 的质量标签。随后,进行了基准实验,以评估当前 IQA 模型的性能。实验结果表明,当前的 IQA 模型并不适合处理 AGIQA 任务,仍然任重道远。

References

1.
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser and Björn Ommer, "High-resolution image synthesis with latent diffusion models", IEEE/CVF CVPR, 2022.
2.
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, et al., "Generative adversarial networks", Communications of the ACM, 2020.
3.
Zhenyu Peng, Qiuping Jiang, Feng Shao, Wei Gao and Weisi Lin, "Lggd+: Image retargeting quality assessment by measuring local and global geometric distortions", IEEE TCSVT, 2022.
4.
Luhong Liang, Shiqi Wang, Jianhua Chen, Siwei Ma, Debin Zhao and Wen Gao, "No-reference perceptual image quality metric using gradient profiles for jpeg2000", Signal Processing: Image Commun, 2010.
5.
Chunyi Li, Zicheng Zhang, Wei Sun, Xiongkuo Min and Guangtao Zhai, "A full-reference quality assessment metric for cartoon images", IEEE MMSP, 2022.
6.
Zicheng Zhang, Wei Sun, Xiongkuo Min, Tao Wang, Wei Lu and Guangtao Zhai, "Distinguishing computer-generated images from photographic images: a texture-aware deep learning-based method", IEEE VCIP, 2022.
7.
Shiqi Wang, Kede Ma, Hojatollah Yeganeh, Zhou Wang and Weisi Lin, "A patch-structure representation method for quality assessment of contrast changed images", IEEE SPL, 2015.
8.
Ke Gu, Shiqi Wang, Guangtao Zhai, Siwei Ma, Xiaokang Yang, Weisi Lin, et al., "Blind quality assessment of tone-mapped images via analysis of information naturalness and structure", IEEE TMM, 2016.
9.
Zhenqiang Ying, Haoran Niu, Praful Gupta, Dhruv Mahajan, Deepti Ghadiyaram and Alan Bovik, "From patches to pictures (paq-2-piq): Mapping the perceptual space of picture quality", IEEE/CVF CVPR, 2020.
10.
Shiqi Wang, Ke Gu, Xinfeng Zhang, Weisi Lin, Siwei Ma and Wen Gao, "Reduced-reference quality assessment of screen content images", IEEE TCSVT, 2018.
11.
Mai Xu, Chen Li, Shanyi Zhang and Patrick Le Callet, "State-of-the-art in 360° video/image processing: Perception assessment and compression", IEEE JSTSP, 2020.
12.
V. Hosu, H. Lin, T. Sziranyi and D. Saupe, "Koniq-10k: An ecologically valid database for deep learning of blind image quality assessment", IEEE TIP, 2020.
13.
Vlad Hosu, Franz Hahn, Mohsen Jenadeleh, Hanhe Lin, Hui Men, Tamás Szirányi, et al., "The konstanz natural video database (konvid-1k)", IEEE QoMEX, 2017.
14.
"Methodology for the subjective assessment of the quality of television pictures", ITU-R Recommendation BT. 500-11, 2002.
15.
Yixuan Gao, Xiongkuo Min, Yucheng Zhu, Jing Li, Xiao-Ping Zhang and Guangtao Zhai, "Image quality assessment: From mean opinion score to opinion score distribution", ACM MM, 2022.
16.
Hossein Talebi and Peyman Milanfar, "Nima: Neural image assessment", IEEE TIP, 2018.
17.
Xiongkuo Min, Guangtao Zhai, Ke Gu, Yutao Liu and Xiaokang Yang, "Blind image quality estimation via distortion aggravation", IEEE TBC, 2018.
18.
Jia Yan, Jie Li and Xin Fu, "No-reference quality assessment of contrast-distorted images using contrast enhancement", arXiv preprint, 2019.
19.
Niranjan D Narvekar and Lina J Karam, "A no-reference perceptual image sharpness metric based on a cumulative probability of blur detection", QoMEX, 2009.
20.
Anish Mittal, Rajiv Soundararajan and Alan C Bovik, "Making a “completely blind” image quality analyzer", IEEE SPL, 2012.
21.
Ke Gu, Guangtao Zhai, Xiaokang Yang and Wenjun Zhang, "Hybrid no-reference quality metric for singly and multiply distorted images", IEEE TBC, 2014.
22.
Deepti Ghadiyaram and Alan C Bovik, "Perceptual quality prediction on authentically distorted images using a bag of features approach", Journal of Vision, 2017.
23.
Wufeng Xue, Xuanqin Mou, Lei Zhang, Alan C Bovik and Xiangchu Feng, "Blind image quality assessment using joint statistics of gradient magnitude and laplacian features", IEEE TIP, 2014.
24.
D Kundu, D Ghadiyaram, AC Bovik and BL Evans, "Large-scale crowdsourced study for high dynamic range images", IEEE TIP, 2017.
25.
Ke Gu, Guangtao Zhai, Xiaokang Yang and Wenjun Zhang, "Using free energy principle for blind image quality assessment", IEEE TMM, 2014.
26.
Ke Gu, Guangtao Zhai, Xiaokang Yang, Wenjun Zhang and Longfei Liang, "No-reference image quality assessment metric by combining free energy theory and structural degradation model", IEEE ICME, 2013.
27.
Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun, "Deep residual learning for image recognition", IEEE/CVF CVPR, 2016.
28.
Wei Sun, Huiyu Duan, Xiongkuo Min, Li Chen and Guangtao Zhai, "Blind quality assessment for in-the-wild images via hierarchical feature fusion strategy", IEEE BMSB, 2022.
29.
Tao Wang, Wei Sun, Xiongkuo Min, Wei Lu, Zicheng Zhang and Guangtao Zhai, "A multi-dimensional aesthetic quality assessment model for mobile game images", IEEE VCIP, 2021.
30.
Hamid R Sheikh, Muhammad F Sabir and Alan C Bovik, "A statistical evaluation of recent full reference image quality assessment algorithms", IEEE TIP, 2006.