这是用户在 2024-6-27 14:05 为 https://ar5iv.labs.arxiv.org/html/2112.10752?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

High-Resolution Image Synthesis with Latent Diffusion Models
利用潜在扩散模型合成高分辨率图像

Robin Rombach1   Andreas Blattmann1   Dominik Lorenz1   Patrick Esser[Uncaptioned image]   Björn Ommer1
1Ludwig Maximilian University of Munich & IWR, Heidelberg University, Germany   [Uncaptioned image]Runway ML
1 德国慕尼黑路德维希-马克西米利安大学和海德堡大学 IWR [Uncaptioned image] Runway ML

https://github.com/CompVis/latent-diffusion
The first two authors contributed equally to this work.
前两位作者对本研究的贡献相同。
Abstract 摘要

By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations. To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner. Our latent diffusion models (LDMs) achieve new state-of-the-art scores for image inpainting and class-conditional image synthesis and highly competitive performance on various tasks, including text-to-image synthesis, unconditional image generation and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs.
通过将图像形成过程分解为去噪自编码器的连续应用,扩散模型(DM)在图像数据及其他方面取得了最先进的合成结果。此外,它们的表述方式允许一种指导机制来控制图像生成过程,而无需重新训练。然而,由于这些模型通常直接在像素空间中运行,对功能强大的 DM 进行优化通常需要耗费数百个 GPU 日,而且由于顺序评估,推理成本高昂。为了在有限的计算资源上进行 DM 训练,同时保持其质量和灵活性,我们在强大的预训练自动编码器的潜空间中应用了 DM。与之前的工作不同,在这种表示法上训练扩散模型,可以首次在降低复杂性和保留细节之间达到近乎最佳的平衡点,从而大大提高视觉保真度。通过在模型架构中引入交叉注意层,我们将扩散模型转化为强大而灵活的生成器,可用于文本或边界框等一般条件输入,并以卷积方式实现高分辨率合成。与基于像素的扩散模型相比,我们的潜扩散模型(LDM)大大降低了计算要求,在图像绘制和类条件图像合成方面取得了新的一流成绩,并在文本到图像合成、无条件图像生成和超分辨率等各种任务中取得了极具竞争力的性能。

1 Introduction 1引言

Input ours (f=4𝑓4f=4) PSNR: 27.427.427.4 R-FID: 0.580.580.58 DALL-E (f=8𝑓8f=8) PSNR: 22.822.822.8 R-FID: 32.0132.0132.01 VQGAN (f=16𝑓16f=16) PSNR: 19.919.919.9 R-FID: 4.984.984.98 Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption

Figure 1: Boosting the upper bound on achievable quality with less agressive downsampling. Since diffusion models offer excellent inductive biases for spatial data, we do not need the heavy spatial downsampling of related generative models in latent space, but can still greatly reduce the dimensionality of the data via suitable autoencoding models, see Sec. 3. Images are from the DIV2K [1] validation set, evaluated at 5122superscript5122512^{2} px. We denote the spatial downsampling factor by f𝑓f. Reconstruction FIDs [29] and PSNR are calculated on ImageNet-val. [12]; see also Tab. 8.
图 1:以较低的下采样率提升可实现质量的上限。由于扩散模型为空间数据提供了极好的归纳偏差,因此我们不需要对潜在空间中的相关生成模型进行严重的空间下采样,但仍然可以通过合适的自动编码模型大大降低数据的维度(见第 3 章)。图片来自 DIV2K [ 1] 验证集,在 5122superscript5122512^{2} px 下进行评估。我们用 f𝑓f 表示空间下采样因子。重建 FID [ 29] 和 PSNR 是在 ImageNet-val.[12];另见表 8。8.

Image synthesis is one of the computer vision fields with the most spectacular recent development, but also among those with the greatest computational demands. Especially high-resolution synthesis of complex, natural scenes is presently dominated by scaling up likelihood-based models, potentially containing billions of parameters in autoregressive (AR) transformers [66, 67]. In contrast, the promising results of GANs [27, 3, 40] have been revealed to be mostly confined to data with comparably limited variability as their adversarial learning procedure does not easily scale to modeling complex, multi-modal distributions. Recently, diffusion models [82], which are built from a hierarchy of denoising autoencoders, have shown to achieve impressive results in image synthesis [30, 85] and beyond [45, 7, 48, 57], and define the state-of-the-art in class-conditional image synthesis [15, 31] and super-resolution [72]. Moreover, even unconditional DMs can readily be applied to tasks such as inpainting and colorization [85] or stroke-based synthesis [53], in contrast to other types of generative models [46, 69, 19]. Being likelihood-based models, they do not exhibit mode-collapse and training instabilities as GANs and, by heavily exploiting parameter sharing, they can model highly complex distributions of natural images without involving billions of parameters as in AR models [67].
图像合成是计算机视觉领域近来发展最迅猛的领域之一,同时也是计算要求最高的领域之一。尤其是复杂自然场景的高分辨率合成,目前主要是通过扩大基于似然法的模型来实现,这些模型可能包含自回归(AR)变换器中的数十亿个参数[66, 67]。与此相反,GANs [ 27, 3, 40] 所取得的令人鼓舞的成果大多局限于变异性相对有限的数据,因为其对抗学习程序不容易扩展到复杂的多模态分布建模。最近,扩散模型[82]从去噪自动编码器的层次中建立起来,在图像合成[30, 85]和其他方面[45, 7, 48, 57]取得了令人印象深刻的成果,并在类条件图像合成[15, 31]和超分辨率[72]方面达到了最先进的水平。此外,与其他类型的生成模型[46, 69, 19]相比,即使是无条件的 DM 也可以很容易地应用于诸如内画和着色[85]或基于笔画的合成[53]等任务。作为基于似然的模型,它们不会像 GANs 那样表现出模式崩溃和训练不稳定性,而且通过大量利用参数共享,它们可以对自然图像的高度复杂分布进行建模,而无需像 AR 模型那样涉及数十亿个参数[67]。

Democratizing High-Resolution Image Synthesis
高分辨率图像合成的民主化

DMs belong to the class of likelihood-based models, whose mode-covering behavior makes them prone to spend excessive amounts of capacity (and thus compute resources) on modeling imperceptible details of the data [16, 73]. Although the reweighted variational objective [30] aims to address this by undersampling the initial denoising steps, DMs are still computationally demanding, since training and evaluating such a model requires repeated function evaluations (and gradient computations) in the high-dimensional space of RGB images. As an example, training the most powerful DMs often takes hundreds of GPU days (e.g. 150 - 1000 V100 days in [15]) and repeated evaluations on a noisy version of the input space render also inference expensive, so that producing 50k samples takes approximately 5 days [15] on a single A100 GPU. This has two consequences for the research community and users in general: Firstly, training such a model requires massive computational resources only available to a small fraction of the field, and leaves a huge carbon footprint [65, 86]. Secondly, evaluating an already trained model is also expensive in time and memory, since the same model architecture must run sequentially for a large number of steps (e.g. 25 - 1000 steps in [15]).
DM 属于基于似然法的模型,其模式覆盖行为使其容易将过多的容量(以及计算资源)用于模拟数据中难以察觉的细节[16, 73]。虽然重新加权的变分目标[30]旨在通过对初始去噪步骤进行低采样来解决这一问题,但 DMs 的计算要求仍然很高,因为训练和评估这样的模型需要在 RGB 图像的高维空间中反复进行函数评估(和梯度计算)。举例来说,训练最强大的 DM 通常需要数百个 GPU 日(例如[15]中的 150 - 1000 V100 日),而在输入空间的高噪声版本上重复评估推理也非常昂贵,因此在单个 A100 GPU 上生成 50k 个样本大约需要 5 天[15]。这给研究界和广大用户带来了两个后果:首先,训练这样一个模型需要大量的计算资源,而这些资源只有该领域的一小部分才能使用,并且会留下巨大的碳足迹[ 65, 86]。其次,评估已训练好的模型在时间和内存上也很昂贵,因为同一模型架构必须连续运行大量步骤(例如[15]中的 25 - 1000 步骤)。

To increase the accessibility of this powerful model class and at the same time reduce its significant resource consumption, a method is needed that reduces the computational complexity for both training and sampling. Reducing the computational demands of DMs without impairing their performance is, therefore, key to enhance their accessibility.
为了提高这类强大模型的可用性,同时减少其对资源的大量消耗,需要一种方法来降低训练和采样的计算复杂度。因此,在不影响 DM 性能的前提下降低其计算需求,是提高其可用性的关键。

Departure to Latent Space
从潜在空间出发

Our approach starts with the analysis of already trained diffusion models in pixel space: Fig. 2 shows the rate-distortion trade-off of a trained model. As with any likelihood-based model, learning can be roughly divided into two stages: First is a perceptual compression stage which removes high-frequency details but still learns little semantic variation. In the second stage, the actual generative model learns the semantic and conceptual composition of the data (semantic compression). We thus aim to first find a perceptually equivalent, but computationally more suitable space, in which we will train diffusion models for high-resolution image synthesis.
我们的方法从分析像素空间中已经训练好的扩散模型开始:图 2 显示了经过训练的模型的速率-失真权衡。与任何基于似然法的模型一样,学习可大致分为两个阶段:首先是感知压缩阶段,该阶段会去除高频细节,但仍能学习到少量语义变化。在第二阶段,实际生成模型学习数据的语义和概念构成(语义压缩)。因此,我们的目标是首先找到一个在感知上等效、但在计算上更合适的空间,在这个空间中训练用于高分辨率图像合成的扩散模型。

Following common practice [96, 67, 23, 11, 66], we separate training into two distinct phases: First, we train an autoencoder which provides a lower-dimensional (and thereby efficient) representational space which is perceptually equivalent to the data space. Importantly, and in contrast to previous work [23, 66], we do not need to rely on excessive spatial compression, as we train DMs in the learned latent space, which exhibits better scaling properties with respect to the spatial dimensionality. The reduced complexity also provides efficient image generation from the latent space with a single network pass. We dub the resulting model class Latent Diffusion Models (LDMs).
按照通常的做法[96, 67, 23, 11, 66],我们将训练分为两个不同的阶段:首先,我们训练一个自动编码器,它提供了一个与数据空间感知等效的低维(从而高效)表征空间。重要的是,与之前的工作[23, 66]不同的是,我们不需要依赖过度的空间压缩,因为我们在学习到的潜空间中训练 DM,而潜空间在空间维度上表现出更好的缩放特性。复杂性的降低也使得我们只需通过一次网络就能从潜空间生成高效的图像。我们将由此产生的模型类别命名为潜在扩散模型(LDMs)。

A notable advantage of this approach is that we need to train the universal autoencoding stage only once and can therefore reuse it for multiple DM trainings or to explore possibly completely different tasks [81]. This enables efficient exploration of a large number of diffusion models for various image-to-image and text-to-image tasks. For the latter, we design an architecture that connects transformers to the DM’s UNet backbone [71] and enables arbitrary types of token-based conditioning mechanisms, see Sec. 3.3.
这种方法的一个显著优势是,我们只需要对通用自动编码阶段进行一次训练,因此可以在多个 DM 训练中重复使用,或用于探索可能完全不同的任务[81]。这样,我们就能针对各种图像到图像和文本到图像任务,高效地探索大量的扩散模型。对于后者,我们设计了一种架构,将转换器连接到 DM 的 UNet 主干网[71],并启用任意类型的基于标记的调节机制(见第 3.3 节)。

Refer to caption
Figure 2: Illustrating perceptual and semantic compression: Most bits of a digital image correspond to imperceptible details. While DMs allow to suppress this semantically meaningless information by minimizing the responsible loss term, gradients (during training) and the neural network backbone (training and inference) still need to be evaluated on all pixels, leading to superfluous computations and unnecessarily expensive optimization and inference.
图 2:说明感知压缩和语义压缩:数字图像的大部分比特对应于难以察觉的细节。虽然 DM 可以通过最小化责任损失项来抑制这些语义上无意义的信息,但梯度(训练期间)和神经网络骨干(训练和推理)仍需要对所有像素进行评估,从而导致多余的计算以及不必要的昂贵优化和推理。

We propose latent diffusion models (LDMs) as an effective generative model and a separate mild compression stage that only eliminates imperceptible details. Data and images from [30].
我们建议将潜在扩散模型(LDM)作为一种有效的生成模型,并采用单独的温和压缩阶段,只消除不易察觉的细节。数据和图像来自 [ 30].

In sum, our work makes the following contributions:
总之,我们的工作做出了以下贡献:

(i) In contrast to purely transformer-based approaches [23, 66], our method scales more graceful to higher dimensional data and can thus (a) work on a compression level which provides more faithful and detailed reconstructions than previous work (see Fig. 1) and (b) can be efficiently applied to high-resolution synthesis of megapixel images.
(i) 与纯粹基于变换器的方法[23, 66]相比,我们的方法在处理高维数据时更加优雅,因此可以(a)在压缩级别上工作,提供比以前的工作(见图 1)更忠实、更详细的重构;(b)可以有效地应用于百万像素图像的高分辨率合成。

(ii) We achieve competitive performance on multiple tasks (unconditional image synthesis, inpainting, stochastic super-resolution) and datasets while significantly lowering computational costs. Compared to pixel-based diffusion approaches, we also significantly decrease inference costs.
(ii) 我们在多个任务(无条件图像合成、内绘制、随机超分辨率)和数据集上实现了具有竞争力的性能,同时显著降低了计算成本。与基于像素的扩散方法相比,我们还大大降低了推理成本。

(iii) We show that, in contrast to previous work [93] which learns both an encoder/decoder architecture and a score-based prior simultaneously, our approach does not require a delicate weighting of reconstruction and generative abilities. This ensures extremely faithful reconstructions and requires very little regularization of the latent space.
(iii) 我们证明,与之前同时学习编码器/解码器结构和基于分数的先验的工作[93]相比,我们的方法不需要对重建和生成能力进行微妙的加权。这确保了极其忠实的重构,并且只需要对潜在空间进行极少的正则化处理。

(iv) We find that for densely conditioned tasks such as super-resolution, inpainting and semantic synthesis, our model can be applied in a convolutional fashion and render large, consistent images of 10242similar-toabsentsuperscript10242\sim 1024^{2} px.
(iv) 我们发现,对于超分辨率、内画法和语义合成等条件密集的任务,我们的模型可以以卷积的方式应用,并呈现 10242similar-toabsentsuperscript10242\sim 1024^{2} px 的大型一致图像。

(v) Moreover, we design a general-purpose conditioning mechanism based on cross-attention, enabling multi-modal training. We use it to train class-conditional, text-to-image and layout-to-image models.
(v) 此外,我们还设计了一种基于交叉注意的通用调节机制,实现了多模式训练。我们用它来训练类条件、文本到图像和布局到图像模型。

(vi) Finally, we release pretrained latent diffusion and autoencoding models at https://github.com/CompVis/latent-diffusion which might be reusable for a various tasks besides training of DMs [81].
(vi) 最后,我们在 https://github.com/CompVis/latent-diffusion 上发布了经过预训练的潜在扩散和自动编码模型,这些模型除了可以训练 DM 之外,还可以用于各种任务[81]。

2 Related Work 2 相关工作

Generative Models for Image Synthesis The high dimensional nature of images presents distinct challenges to generative modeling. Generative Adversarial Networks (GAN) [27] allow for efficient sampling of high resolution images with good perceptual quality [3, 42], but are difficult to optimize [54, 2, 28] and struggle to capture the full data distribution [55]. In contrast, likelihood-based methods emphasize good density estimation which renders optimization more well-behaved. Variational autoencoders (VAE) [46] and flow-based models [18, 19] enable efficient synthesis of high resolution images [9, 92, 44], but sample quality is not on par with GANs. While autoregressive models (ARM) [95, 94, 6, 10] achieve strong performance in density estimation, computationally demanding architectures [97] and a sequential sampling process limit them to low resolution images. Because pixel based representations of images contain barely perceptible, high-frequency details [16, 73], maximum-likelihood training spends a disproportionate amount of capacity on modeling them, resulting in long training times. To scale to higher resolutions, several two-stage approaches [101, 67, 23, 103] use ARMs to model a compressed latent image space instead of raw pixels.
用于图像合成的生成模型 图像的高维特性给生成模型带来了独特的挑战。生成对抗网络(GAN)[27] 允许对具有良好感知质量的高分辨率图像进行高效采样[3, 42],但难以优化[54, 2, 28],并且难以捕捉完整的数据分布[55]。与此相反,基于似然法的方法强调良好的密度估计,从而使优化更加顺畅。变异自动编码器(VAE)[46] 和基于流的模型[18, 19]能够高效合成高分辨率图像[9, 92, 44],但样本质量无法与 GAN 相提并论。虽然自回归模型(ARM)[95, 94, 6, 10]在密度估计中表现出色,但计算要求高的架构[97]和顺序采样过程限制了它们在低分辨率图像中的应用。由于基于像素的图像表征包含几乎无法感知的高频细节[16, 73],最大似然训练将不成比例的容量花费在这些细节的建模上,导致训练时间过长。为了扩展到更高分辨率,有几种两阶段方法[ 101, 67, 23, 103] 使用 ARM 对压缩的潜在图像空间建模,而不是原始像素。

Recently, Diffusion Probabilistic Models (DM) [82], have achieved state-of-the-art results in density estimation [45] as well as in sample quality [15]. The generative power of these models stems from a natural fit to the inductive biases of image-like data when their underlying neural backbone is implemented as a UNet [71, 30, 85, 15]. The best synthesis quality is usually achieved when a reweighted objective [30] is used for training. In this case, the DM corresponds to a lossy compressor and allow to trade image quality for compression capabilities. Evaluating and optimizing these models in pixel space, however, has the downside of low inference speed and very high training costs. While the former can be partially adressed by advanced sampling strategies [84, 75, 47] and hierarchical approaches [31, 93], training on high-resolution image data always requires to calculate expensive gradients. We adress both drawbacks with our proposed LDMs, which work on a compressed latent space of lower dimensionality. This renders training computationally cheaper and speeds up inference with almost no reduction in synthesis quality (see Fig. 1).
最近,扩散概率模型(DM)[82]在密度估计[45]和样本质量[15]方面取得了最先进的成果。这些模型的生成能力源于其底层神经骨架作为 UNet 实现时,与图像类数据的归纳偏差的自然匹配[71, 30, 85, 15]。通常,在使用重权目标[30]进行训练时,可以获得最佳的合成质量。在这种情况下,DM 相当于有损压缩器,可以用图像质量换取压缩能力。然而,在像素空间中评估和优化这些模型的缺点是推理速度低、训练成本高。虽然先进的采样策略[84, 75, 47]和分层方法[31, 93]可以部分解决前者的问题,但在高分辨率图像数据上进行训练总是需要计算昂贵的梯度。我们提出的 LDM 解决了这两个缺点,它在低维度的压缩潜空间上工作。这使得训练的计算成本更低,并在几乎不降低合成质量的情况下加快了推理速度(见图 1)。

Two-Stage Image Synthesis To mitigate the shortcomings of individual generative approaches, a lot of research [11, 70, 23, 103, 101, 67] has gone into combining the strengths of different methods into more efficient and performant models via a two stage approach. VQ-VAEs [101, 67] use autoregressive models to learn an expressive prior over a discretized latent space. [66] extend this approach to text-to-image generation by learning a joint distributation over discretized image and text representations. More generally, [70] uses conditionally invertible networks to provide a generic transfer between latent spaces of diverse domains. Different from VQ-VAEs, VQGANs [23, 103] employ a first stage with an adversarial and perceptual objective to scale autoregressive transformers to larger images. However, the high compression rates required for feasible ARM training, which introduces billions of trainable parameters [66, 23], limit the overall performance of such approaches and less compression comes at the price of high computational cost [66, 23]. Our work prevents such trade-offs, as our proposed LDMs scale more gently to higher dimensional latent spaces due to their convolutional backbone. Thus, we are free to choose the level of compression which optimally mediates between learning a powerful first stage, without leaving too much perceptual compression up to the generative diffusion model while guaranteeing high-fidelity reconstructions (see Fig. 1).
两阶段图像合成 为了减轻单个生成方法的缺点,大量研究[11, 70, 23, 103, 101, 67]通过两阶段方法将不同方法的优势结合到更高效、性能更好的模型中。VQ-VAEs[101,67] 使用自回归模型来学习离散潜空间的表达式先验。[66] 通过学习离散图像和文本表示的联合分布,将这种方法扩展到文本到图像的生成。更广泛地说,[ 70] 使用条件可逆网络在不同领域的潜空间之间提供通用转移。与 VQ-VAE 不同,VQGANs [ 23, 103] 采用了第一阶段,以对抗性和感知为目标,将自回归变换器扩展到更大的图像。然而,可行的 ARM 训练需要较高的压缩率,这就引入了数十亿个可训练参数[66, 23],从而限制了此类方法的整体性能,而且较低的压缩率是以较高的计算成本为代价的[66, 23]。我们的工作避免了这种权衡,因为我们提出的 LDM 因其卷积骨干而更容易扩展到更高维度的潜空间。因此,我们可以在保证高保真重构的同时,自由选择压缩水平,在学习功能强大的第一阶段之间进行最佳调和,而不会将过多的感知压缩留给生成扩散模型(见图 1)。

While approaches to jointly [93] or separately [80] learn an encoding/decoding model together with a score-based prior exist, the former still require a difficult weighting between reconstruction and generative capabilities [11] and are outperformed by our approach (Sec. 4), and the latter focus on highly structured images such as human faces.
虽然存在联合[93]或单独[80]学习编码/解码模型和基于分数的先验模型的方法,但前者仍然需要在重建和生成能力之间进行艰难的权衡[11],其性能优于我们的方法(第 4 章),而后者主要针对人脸等高结构图像。

3 Method 3 方法

To lower the computational demands of training diffusion models towards high-resolution image synthesis, we observe that although diffusion models allow to ignore perceptually irrelevant details by undersampling the corresponding loss terms [30], they still require costly function evaluations in pixel space, which causes huge demands in computation time and energy resources.
为了降低高分辨率图像合成对扩散模型训练的计算要求,我们注意到,虽然扩散模型可以通过对相应损失项的低采样来忽略感知上不相关的细节[30],但它们仍然需要在像素空间进行代价高昂的函数评估,这对计算时间和能源资源造成了巨大的需求。

We propose to circumvent this drawback by introducing an explicit separation of the compressive from the generative learning phase (see Fig. 2). To achieve this, we utilize an autoencoding model which learns a space that is perceptually equivalent to the image space, but offers significantly reduced computational complexity.
我们建议将压缩学习阶段与生成学习阶段明确分开,以规避这一缺点(见图 2)。为此,我们采用了一种自动编码模型,它所学习的空间在感知上等同于图像空间,但计算复杂度却大大降低。

Such an approach offers several advantages: (i) By leaving the high-dimensional image space, we obtain DMs which are computationally much more efficient because sampling is performed on a low-dimensional space. (ii) We exploit the inductive bias of DMs inherited from their UNet architecture [71], which makes them particularly effective for data with spatial structure and therefore alleviates the need for aggressive, quality-reducing compression levels as required by previous approaches [23, 66]. (iii) Finally, we obtain general-purpose compression models whose latent space can be used to train multiple generative models and which can also be utilized for other downstream applications such as single-image CLIP-guided synthesis [25].
这种方法有以下几个优点:(i) 离开高维图像空间,我们就能获得计算效率高得多的 DM,因为采样是在低维空间上进行的。(ii) 我们利用了 DM 从其 UNet 架构中继承下来的归纳偏差[71],这使它们对具有空间结构的数据特别有效,从而减轻了以往方法[23, 66]所要求的积极的、降低质量的压缩水平。(iii) 最后,我们获得了通用压缩模型,其潜在空间可用于训练多个生成模型,也可用于其他下游应用,如单图像 CLIP 引导合成[25]。

3.1 Perceptual Image Compression
3.1 感知图像压缩

Our perceptual compression model is based on previous work [23] and consists of an autoencoder trained by combination of a perceptual loss [106] and a patch-based [33] adversarial objective [20, 23, 103]. This ensures that the reconstructions are confined to the image manifold by enforcing local realism and avoids bluriness introduced by relying solely on pixel-space losses such as L2subscript𝐿2L_{2} or L1subscript𝐿1L_{1} objectives.
我们的感知压缩模型基于之前的工作[23],由感知损失[106]和基于补丁[33]的对抗目标[20, 23, 103]组合训练的自动编码器组成。这确保了重构仅限于图像流形,强化了局部真实感,并避免了单纯依赖像素空间损失(如 L2subscript𝐿2L_{2}L1subscript𝐿1L_{1} 目标)所带来的模糊性。

More precisely, given an image xH×W×3𝑥superscript𝐻𝑊3x\in\mathbb{R}^{H\times W\times 3} in RGB space, the encoder \mathcal{E} encodes x𝑥x into a latent representation z=(x)𝑧𝑥z=\mathcal{E}(x), and the decoder 𝒟𝒟\mathcal{D} reconstructs the image from the latent, giving x~=𝒟(z)=𝒟((x))~𝑥𝒟𝑧𝒟𝑥\tilde{x}=\mathcal{D}(z)=\mathcal{D}(\mathcal{E}(x)), where zh×w×c𝑧superscript𝑤𝑐z\in\mathbb{R}^{h\times w\times c}. Importantly, the encoder downsamples the image by a factor f=H/h=W/w𝑓𝐻𝑊𝑤f=H/h=W/w, and we investigate different downsampling factors f=2m𝑓superscript2𝑚f=2^{m}, with m𝑚m\in\mathbb{N}.
更确切地说,给定 RGB 空间中的图像 xH×W×3𝑥superscript𝐻𝑊3x\in\mathbb{R}^{H\times W\times 3} ,编码器 \mathcal{E}x𝑥x 编码成潜在表示 z=(x)𝑧𝑥z=\mathcal{E}(x) ,解码器 𝒟𝒟\mathcal{D} 从潜在表示重建图像,得到 x~=𝒟(z)=𝒟((x))~𝑥𝒟𝑧𝒟𝑥\tilde{x}=\mathcal{D}(z)=\mathcal{D}(\mathcal{E}(x)) ,其中 zh×w×c𝑧superscript𝑤𝑐z\in\mathbb{R}^{h\times w\times c} 。 重要的是,编码器对图像进行下采样,下采样系数为 f=H/h=W/w𝑓𝐻𝑊𝑤f=H/h=W/w ,我们研究了不同的下采样系数 f=2m𝑓superscript2𝑚f=2^{m} ,其中 m𝑚m\in\mathbb{N}

In order to avoid arbitrarily high-variance latent spaces, we experiment with two different kinds of regularizations. The first variant, KL-reg., imposes a slight KL-penalty towards a standard normal on the learned latent, similar to a VAE [46, 69], whereas VQ-reg. uses a vector quantization layer [96] within the decoder. This model can be interpreted as a VQGAN [23] but with the quantization layer absorbed by the decoder. Because our subsequent DM is designed to work with the two-dimensional structure of our learned latent space z=(x)𝑧𝑥z=\mathcal{E}(x), we can use relatively mild compression rates and achieve very good reconstructions. This is in contrast to previous works [23, 66], which relied on an arbitrary 1D ordering of the learned space z𝑧z to model its distribution autoregressively and thereby ignored much of the inherent structure of z𝑧z. Hence, our compression model preserves details of x𝑥x better (see Tab. 8). The full objective and training details can be found in the supplement.
为了避免任意的高方差潜空间,我们尝试了两种不同的正则化方法。第一种变体,KL-reg.,对学习到的潜空间施加了轻微的标准正态 KL 权限,类似于 VAE[46,69],而 VQ-reg.这个模型可以解释为 VQGAN [ 23] ,但量化层被解码器吸收了。由于我们后续的 DM 是针对所学潜空间 z=(x)𝑧𝑥z=\mathcal{E}(x) 的二维结构而设计的,因此我们可以使用相对温和的压缩率,并实现非常好的重构。这与之前的工作[23, 66]形成了鲜明对比,之前的工作依赖于对所学空间 z𝑧z 的任意一维排序来对其分布进行自回归建模,从而忽略了 z𝑧z 的许多固有结构。 因此,我们的压缩模型能更好地保留 x𝑥x 的细节(见表 8)。完整的目标和训练细节见补充资料。

3.2 Latent Diffusion Models
3.2 潜在扩散模型

Diffusion Models [82] are probabilistic models designed to learn a data distribution p(x)𝑝𝑥p(x) by gradually denoising a normally distributed variable, which corresponds to learning the reverse process of a fixed Markov Chain of length T𝑇T. For image synthesis, the most successful models [30, 15, 72] rely on a reweighted variant of the variational lower bound on p(x)𝑝𝑥p(x), which mirrors denoising score-matching [85]. These models can be interpreted as an equally weighted sequence of denoising autoencoders ϵθ(xt,t);t=1Tsubscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡𝑡1𝑇\epsilon_{\theta}(x_{t},t);\,t=1\dots T, which are trained to predict a denoised variant of their input xtsubscript𝑥𝑡x_{t}, where xtsubscript𝑥𝑡x_{t} is a noisy version of the input x𝑥x. The corresponding objective can be simplified to (Sec. B)
扩散模型 [ 82] 是一种概率模型,旨在通过对正态分布变量的逐步去噪来学习数据分布 p(x)𝑝𝑥p(x) ,这相当于学习长度为 T𝑇T 的固定马尔可夫链的反向过程。对于图像合成,最成功的模型[30, 15, 72]依赖于 p(x)𝑝𝑥p(x) 变分下限的重权变体,它反映了去噪分数匹配[85]。这些模型可以解释为去噪自编码器 ϵθ(xt,t);t=1Tsubscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡𝑡1𝑇\epsilon_{\theta}(x_{t},t);\,t=1\dots T 的等权序列,这些自编码器被训练成预测其输入 xtsubscript𝑥𝑡x_{t} 的去噪变体,其中 xtsubscript𝑥𝑡x_{t} 是输入 x𝑥x 的噪声版本。相应的目标可简化为(B 节)

LDM=𝔼x,ϵ𝒩(0,1),t[ϵϵθ(xt,t)22],subscript𝐿𝐷𝑀subscript𝔼formulae-sequencesimilar-to𝑥italic-ϵ𝒩01𝑡delimited-[]superscriptsubscriptnormitalic-ϵsubscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡22L_{DM}=\mathbb{E}_{x,\epsilon\sim\mathcal{N}(0,1),t}\Big{[}\|\epsilon-\epsilon_{\theta}(x_{t},t)\|_{2}^{2}\Big{]}\,, (1)

with t𝑡t uniformly sampled from {1,,T}1𝑇\{1,\dots,T\}.
其中 t𝑡t{1,,T}1𝑇\{1,\dots,T\} 中均匀采样。

Generative Modeling of Latent Representations With our trained perceptual compression models consisting of \mathcal{E} and 𝒟𝒟\mathcal{D}, we now have access to an efficient, low-dimensional latent space in which high-frequency, imperceptible details are abstracted away. Compared to the high-dimensional pixel space, this space is more suitable for likelihood-based generative models, as they can now (i) focus on the important, semantic bits of the data and (ii) train in a lower dimensional, computationally much more efficient space.
潜在表征的生成模型 有了由 \mathcal{E}𝒟𝒟\mathcal{D} 组成的训练有素的感知压缩模型,我们现在可以访问一个高效的低维潜在空间,在这个空间中,高频率、不易察觉的细节被抽象掉了。与高维像素空间相比,这个空间更适合基于似然法的生成模型,因为它们现在可以:(i) 专注于数据中重要的语义位;(ii) 在一个低维、计算效率更高的空间中进行训练。

Unlike previous work that relied on autoregressive, attention-based transformer models in a highly compressed, discrete latent space [66, 23, 103], we can take advantage of image-specific inductive biases that our model offers. This includes the ability to build the underlying UNet primarily from 2D convolutional layers, and further focusing the objective on the perceptually most relevant bits using the reweighted bound, which now reads
与之前在高度压缩的离散潜空间中依赖自回归、基于注意力的转换器模型的工作不同[66, 23, 103],我们可以利用我们的模型所提供的图像特定归纳偏差。这包括主要从二维卷积层构建底层 UNet 的能力,以及利用重新加权约束进一步将目标集中在感知上最相关的比特上的能力。

Refer to caption
Figure 3: We condition LDMs either via concatenation or by a more general cross-attention mechanism. See Sec. 3.3
图 3:我们通过连接或更通用的交叉注意机制来调节 LDM。见第 3.3 节
LLDM:=𝔼(x),ϵ𝒩(0,1),t[ϵϵθ(zt,t)22].assignsubscript𝐿𝐿𝐷𝑀subscript𝔼formulae-sequencesimilar-to𝑥italic-ϵ𝒩01𝑡delimited-[]superscriptsubscriptnormitalic-ϵsubscriptitalic-ϵ𝜃subscript𝑧𝑡𝑡22L_{LDM}:=\mathbb{E}_{\mathcal{E}(x),\epsilon\sim\mathcal{N}(0,1),t}\Big{[}\|\epsilon-\epsilon_{\theta}(z_{t},t)\|_{2}^{2}\Big{]}\,. (2)

The neural backbone ϵθ(,t)subscriptitalic-ϵ𝜃𝑡\epsilon_{\theta}(\circ,t) of our model is realized as a time-conditional UNet [71]. Since the forward process is fixed, ztsubscript𝑧𝑡z_{t} can be efficiently obtained from \mathcal{E} during training, and samples from p(zp(z) can be decoded to image space with a single pass through 𝒟𝒟\mathcal{D}.
我们模型的神经骨干 ϵθ(,t)subscriptitalic-ϵ𝜃𝑡\epsilon_{\theta}(\circ,t) 是以时间条件 UNet 的形式实现的[71]。由于前向过程是固定的,因此 ztsubscript𝑧𝑡z_{t} 可以在训练过程中高效地从 \mathcal{E} 中获得,而 p(zp(z 中的样本只需通过 𝒟𝒟\mathcal{D} 就能解码到图像空间。

CelebAHQ FFHQ LSUN-Churches LSUN-教堂 LSUN-Beds LSUN-床 ImageNet 图像网络
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 4: Samples from LDMs trained on CelebAHQ [39], FFHQ [41], LSUN-Churches [102], LSUN-Bedrooms [102] and class-conditional ImageNet [12], each with a resolution of 256×256256256256\times 256. Best viewed when zoomed in. For more samples cf. the supplement.
图 4:在 CelebAHQ [ 39]、FFHQ [ 41]、LSUN-Churches [ 102]、LSUN-Bedrooms [ 102] 和类别条件 ImageNet [ 12] 上训练的 LDM 的样本,每个样本的分辨率为 256×256256256256\times 256 。放大后观看效果最佳。更多样本请参阅附录。

3.3 Conditioning Mechanisms
3.3 条件机制

Similar to other types of generative models [56, 83], diffusion models are in principle capable of modeling conditional distributions of the form p(z|y)𝑝conditional𝑧𝑦p(z|y). This can be implemented with a conditional denoising autoencoder ϵθ(zt,t,y)subscriptitalic-ϵ𝜃subscript𝑧𝑡𝑡𝑦\epsilon_{\theta}(z_{t},t,y) and paves the way to controlling the synthesis process through inputs y𝑦y such as text [68], semantic maps  [61, 33] or other image-to-image translation tasks [34].
与其他类型的生成模型[ 56, 83]类似,扩散模型原则上也能够对 p(z|y)𝑝conditional𝑧𝑦p(z|y) 形式的条件分布进行建模。这可以通过条件去噪自动编码器 ϵθ(zt,t,y)subscriptitalic-ϵ𝜃subscript𝑧𝑡𝑡𝑦\epsilon_{\theta}(z_{t},t,y) 来实现,并为通过文本[68]、语义图[61, 33]或其他图像到图像的翻译任务[34]等输入 y𝑦y 来控制合成过程铺平了道路。

In the context of image synthesis, however, combining the generative power of DMs with other types of conditionings beyond class-labels [15] or blurred variants of the input image [72] is so far an under-explored area of research.
然而,在图像合成的背景下,除了类标签[ 15] 或输入图像的模糊变体[ 72] 之外,将 DM 的生成能力与其他类型的条件相结合,迄今为止还是一个尚未充分开发的研究领域。

We turn DMs into more flexible conditional image generators by augmenting their underlying UNet backbone with the cross-attention mechanism [97], which is effective for learning attention-based models of various input modalities [36, 35]. To pre-process y𝑦y from various modalities (such as language prompts) we introduce a domain specific encoder τθsubscript𝜏𝜃\tau_{\theta} that projects y𝑦y to an intermediate representation τθ(y)M×dτsubscript𝜏𝜃𝑦superscript𝑀subscript𝑑𝜏\tau_{\theta}(y)\in\mathbb{R}^{M\times d_{\tau}}, which is then mapped to the intermediate layers of the UNet via a cross-attention layer implementing Attention(Q,K,V)=softmax(QKTd)VAttention𝑄𝐾𝑉softmax𝑄superscript𝐾𝑇𝑑𝑉\text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^{T}}{\sqrt{d}}\right)\cdot V, with
我们通过交叉注意力机制[97]来增强其底层 UNet 骨干,从而将 DMs 转化为更灵活的条件图像生成器,该机制对于学习各种输入模态的基于注意力的模型非常有效[36, 35]。为了预处理来自各种模态(如语言提示)的 y𝑦y ,我们引入了一个特定领域编码器 τθsubscript𝜏𝜃\tau_{\theta} ,它将 y𝑦y 投射到中间表示 τθ(y)M×dτsubscript𝜏𝜃𝑦superscript𝑀subscript𝑑𝜏\tau_{\theta}(y)\in\mathbb{R}^{M\times d_{\tau}} ,然后通过实现 Attention(Q,K,V)=softmax(QKTd)VAttention𝑄𝐾𝑉softmax𝑄superscript𝐾𝑇𝑑𝑉\text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^{T}}{\sqrt{d}}\right)\cdot V 的交叉注意层将 τθ(y)M×dτsubscript𝜏𝜃𝑦superscript𝑀subscript𝑑𝜏\tau_{\theta}(y)\in\mathbb{R}^{M\times d_{\tau}} 映射到 UNet 的中间层。

Q=WQ(i)φi(zt),K=WK(i)τθ(y),V=WV(i)τθ(y).formulae-sequence𝑄subscriptsuperscript𝑊𝑖𝑄subscript𝜑𝑖subscript𝑧𝑡formulae-sequence𝐾subscriptsuperscript𝑊𝑖𝐾subscript𝜏𝜃𝑦𝑉subscriptsuperscript𝑊𝑖𝑉subscript𝜏𝜃𝑦Q=W^{(i)}_{Q}\cdot\varphi_{i}(z_{t}),\;K=W^{(i)}_{K}\cdot\tau_{\theta}(y),\;V=W^{(i)}_{V}\cdot\tau_{\theta}(y).

Here, φi(zt)N×dϵisubscript𝜑𝑖subscript𝑧𝑡superscript𝑁subscriptsuperscript𝑑𝑖italic-ϵ\varphi_{i}(z_{t})\in\mathbb{R}^{N\times d^{i}_{\epsilon}} denotes a (flattened) intermediate representation of the UNet implementing ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta} and WV(i)d×dϵisubscriptsuperscript𝑊𝑖𝑉superscript𝑑subscriptsuperscript𝑑𝑖italic-ϵW^{(i)}_{V}\in\mathbb{R}^{d\times d^{i}_{\epsilon}}, WQ(i)d×dτsubscriptsuperscript𝑊𝑖𝑄superscript𝑑subscript𝑑𝜏W^{(i)}_{Q}\in\mathbb{R}^{d\times d_{\tau}} & WK(i)d×dτsubscriptsuperscript𝑊𝑖𝐾superscript𝑑subscript𝑑𝜏W^{(i)}_{K}\in\mathbb{R}^{d\times d_{\tau}} are learnable projection matrices [97, 36]. See Fig. 3 for a visual depiction.
这里, φi(zt)N×dϵisubscript𝜑𝑖subscript𝑧𝑡superscript𝑁subscriptsuperscript𝑑𝑖italic-ϵ\varphi_{i}(z_{t})\in\mathbb{R}^{N\times d^{i}_{\epsilon}} 表示实现 ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta} 的 UNet 的(扁平化)中间表示,而 WV(i)d×dϵisubscriptsuperscript𝑊𝑖𝑉superscript𝑑subscriptsuperscript𝑑𝑖italic-ϵW^{(i)}_{V}\in\mathbb{R}^{d\times d^{i}_{\epsilon}}WQ(i)d×dτsubscriptsuperscript𝑊𝑖𝑄superscript𝑑subscript𝑑𝜏W^{(i)}_{Q}\in\mathbb{R}^{d\times d_{\tau}}WK(i)d×dτsubscriptsuperscript𝑊𝑖𝐾superscript𝑑subscript𝑑𝜏W^{(i)}_{K}\in\mathbb{R}^{d\times d_{\tau}} 是可学习的投影矩阵 [ 97, 36]。可视化描述见图 3。

Based on image-conditioning pairs, we then learn the conditional LDM via
根据图像条件对,我们可以通过以下方法学习条件 LDM

LLDM:=𝔼(x),y,ϵ𝒩(0,1),t[ϵϵθ(zt,t,τθ(y))22],assignsubscript𝐿𝐿𝐷𝑀subscript𝔼formulae-sequencesimilar-to𝑥𝑦italic-ϵ𝒩01𝑡delimited-[]superscriptsubscriptnormitalic-ϵsubscriptitalic-ϵ𝜃subscript𝑧𝑡𝑡subscript𝜏𝜃𝑦22L_{LDM}:=\mathbb{E}_{\mathcal{E}(x),y,\epsilon\sim\mathcal{N}(0,1),t}\Big{[}\|\epsilon-\epsilon_{\theta}(z_{t},t,\tau_{\theta}(y))\|_{2}^{2}\Big{]}\,, (3)

where both τθsubscript𝜏𝜃\tau_{\theta} and ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta} are jointly optimized via Eq. 3. This conditioning mechanism is flexible as τθsubscript𝜏𝜃\tau_{\theta} can be parameterized with domain-specific experts, e.g. (unmasked) transformers [97] when y𝑦y are text prompts (see Sec. 4.3.1)
其中 τθsubscript𝜏𝜃\tau_{\theta}ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta} 是通过公式 3 共同优化的。这种调节机制非常灵活,因为 τθsubscript𝜏𝜃\tau_{\theta} 可以用特定领域的专家进行参数化,例如,当 y𝑦y 是文本提示时,可以用(未屏蔽的)变换器 [ 97] 进行参数化(见第 4.3.1 节)。

4 Experiments 4实验

Text-to-Image Synthesis on LAION. 1.45B Model.
LAION 上的文本到图像合成。1.45B 模型
’A street sign that reads
写有
“Latent Diffusion” ’ "潜在扩散" '
’A zombie in the 僵尸在 style of Picasso’ 毕加索风格 ’An image of an animal
动物形象
half mouse half octopus’ 一半是老鼠一半是章鱼
’An illustration of a slightly
一幅略带
conscious neural network’
有意识的神经网络
’A painting of a 一幅 squirrel eating a burger’
松鼠吃汉堡
’A watercolor painting of a
一幅水彩画
chair that looks like an octopus’
椅子的水彩画
’A shirt with the inscription:
一件印有"...... "字样的衬衫
“I love generative models!” ’
"我爱生成模型!"'
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 5: Samples for user-defined text prompts from our model for text-to-image synthesis, LDM-8 (KL), which was trained on the LAION [78] database. Samples generated with 200 DDIM steps and η=1.0𝜂1.0\eta=1.0. We use unconditional guidance [32] with s=10.0𝑠10.0s=10.0.
图 5:我们的文本到图像合成模型 LDM-8 (KL)的用户自定义文本提示样本,该模型是在 LAION [ 78] 数据库上训练的。样本是在 200 DDIM 步和 η=1.0𝜂1.0\eta=1.0 条件下生成的。我们使用 s=10.0𝑠10.0s=10.0 的无条件引导 [ 32] 。
Refer to caption
Refer to caption
Figure 6: Analyzing the training of class-conditional LDMs with different downsampling factors f𝑓f over 2M train steps on the ImageNet dataset. Pixel-based LDM-1 requires substantially larger train times compared to models with larger downsampling factors (LDM-{{\{4-16}}\}). Too much perceptual compression as in LDM-32 limits the overall sample quality. All models are trained on a single NVIDIA A100 with the same computational budget. Results obtained with 100 DDIM steps [84] and κ=0𝜅0\kappa=0.
图 6:分析在 ImageNet 数据集上使用不同的下采样因子 f𝑓f 在 2M 训练步骤中训练类条件 LDM 的情况。与降采样因子较大的模型(LDM- {{\{ 4-16 }}\} )相比,基于像素的 LDM-1 需要更多的训练时间。过多的感知压缩(如 LDM-32 中的压缩)会限制整体样本质量。所有模型都是在计算预算相同的 NVIDIA A100 上进行训练的。使用 100 DDIM 步 [ 84] 和 κ=0𝜅0\kappa=0 得到的结果。
Refer to caption
Refer to caption
Figure 7: Comparing LDMs with varying compression on the CelebA-HQ (left) and ImageNet (right) datasets. Different markers indicate {10,20,50,100,200}102050100200\{10,20,50,100,200\} sampling steps using DDIM, from right to left along each line. The dashed line shows the FID scores for 200 steps, indicating the strong performance of LDM-{{\{4-8}}\}. FID scores assessed on 5000 samples. All models were trained for 500k (CelebA) / 2M (ImageNet) steps on an A100.
图 7:在 CelebA-HQ 数据集(左)和 ImageNet 数据集(右)上比较不同压缩率的 LDM。沿每条直线从右到左,不同的标记表示使用 DDIM 的 {10,20,50,100,200}102050100200\{10,20,50,100,200\} 个采样步骤。虚线表示 200 步的 FID 分数,表明 LDM- {{\{ 4-8 }}\} 的强大性能。在 5000 个样本上评估的 FID 分数。所有模型均在 A100 上进行了 500k (CelebA) / 2M (ImageNet) 步的训练。

LDMs provide means to flexible and computationally tractable diffusion based image synthesis of various image modalities, which we empirically show in the following. Firstly, however, we analyze the gains of our models compared to pixel-based diffusion models in both training and inference. Interestingly, we find that LDMs trained in VQ-regularized latent spaces sometimes achieve better sample quality, even though the reconstruction capabilities of VQ-regularized first stage models slightly fall behind those of their continuous counterparts, cf. Tab. 8. A visual comparison between the effects of first stage regularization schemes on LDM training and their generalization abilities to resolutions >2562absentsuperscript2562>256^{2} can be found in Appendix D.1. In E.2 we list details on architecture, implementation, training and evaluation for all results presented in this section.
LDM 为基于扩散的各种图像模态合成提供了灵活且可计算的方法,我们将在下文中进行实证展示。首先,我们分析了与基于像素的扩散模型相比,我们的模型在训练和推理方面的优势。有趣的是,我们发现在 VQ 规则化潜空间中训练的 LDM 有时能获得更好的样本质量,尽管 VQ 规则化第一阶段模型的重建能力略逊于连续模型(参见表 8)。8.第一阶段正则化方案对 LDM 训练的影响及其对分辨率 >2562absentsuperscript2562>256^{2} 的泛化能力的可视化比较见附录 D.1。在 E.2 中,我们列出了本节介绍的所有结果的架构、实现、训练和评估细节。

4.1 On Perceptual Compression Tradeoffs
4.1 关于感知压缩的权衡

This section analyzes the behavior of our LDMs with different downsampling factors f{1,2,4,8,16,32}𝑓12481632f\in\{1,2,4,8,16,32\} (abbreviated as LDM-f𝑓f, where LDM-1 corresponds to pixel-based DMs). To obtain a comparable test-field, we fix the computational resources to a single NVIDIA A100 for all experiments in this section and train all models for the same number of steps and with the same number of parameters.
本节分析了不同下采样系数 f{1,2,4,8,16,32}𝑓12481632f\in\{1,2,4,8,16,32\} (缩写为 LDM- f𝑓f ,其中 LDM-1 对应于基于像素的 DM)的 LDM 行为。为了获得具有可比性的测试场,我们在本节的所有实验中将计算资源固定为一台英伟达 A100,并以相同的步骤数和参数数训练所有模型。

Tab. 8 shows hyperparameters and reconstruction performance of the first stage models used for the LDMs compared in this section. Fig. 6 shows sample quality as a function of training progress for 2M steps of class-conditional models on the ImageNet [12] dataset. We see that, i) small downsampling factors for LDM-{{\{1,2}}\} result in slow training progress, whereas ii) overly large values of f𝑓f cause stagnating fidelity after comparably few training steps. Revisiting the analysis above (Fig. 1 and 2) we attribute this to i) leaving most of perceptual compression to the diffusion model and ii) too strong first stage compression resulting in information loss and thus limiting the achievable quality. LDM-{{\{4-16}}\} strike a good balance between efficiency and perceptually faithful results, which manifests in a significant FID [29] gap of 38 between pixel-based diffusion (LDM-1) and LDM-8 after 2M training steps.
表 8表 8 显示了本节比较的 LDM 第一阶段模型的超参数和重建性能。图 6 显示了 ImageNet [ 12] 数据集上 200 万步类条件模型的样本质量与训练进度的函数关系。我们可以看到,i)LDM- {{\{ 1,2 }}\} 的下采样因子过小会导致训练进度缓慢,而 ii) f𝑓f 的值过大则会在训练步骤相当少之后导致保真度停滞不前。重新回顾上面的分析(图 1 和图 2),我们认为这是由于 i) 将大部分感知压缩留给了扩散模型;ii) 第一阶段压缩太强导致信息丢失,从而限制了可达到的质量。LDM- {{\{ 4-16 }}\} 在效率和感知忠实结果之间取得了很好的平衡,这表现在基于像素的扩散模型(LDM-1)和 LDM-8 在 200 万步训练后的 FID [ 29] 差距为 38。

In Fig. 7, we compare models trained on CelebA-HQ [39] and ImageNet in terms sampling speed for different numbers of denoising steps with the DDIM sampler [84] and plot it against FID-scores [29]. LDM-{{\{4-8}}\} outperform models with unsuitable ratios of perceptual and conceptual compression. Especially compared to pixel-based LDM-1, they achieve much lower FID scores while simultaneously significantly increasing sample throughput. Complex datasets such as ImageNet require reduced compression rates to avoid reducing quality. In summary, LDM-4 and -8 offer the best conditions for achieving high-quality synthesis results.
在图 7 中,我们比较了在 CelebA-HQ [ 39] 和 ImageNet 上训练的模型在使用 DDIM 采样器 [ 84] 进行不同数量的去噪步骤时的采样速度,并将其与 FID 分数 [ 29] 进行对比。LDM- {{\{ 4-8 }}\} 优于感知和概念压缩比率不合适的模型。特别是与基于像素的 LDM-1 相比,它们的 FID 分数更低,同时样本吞吐量也显著提高。ImageNet 等复杂数据集需要降低压缩率,以避免降低质量。总之,LDM-4 和 -8 为获得高质量的合成结果提供了最佳条件。

CelebA-HQ 256×256256256256\times 256 FFHQ 256×256256256256\times 256 Method 方法 FID \downarrow Prec. \uparrow 精确。 \uparrow Recall \uparrow 召回率 \uparrow Method 方法 FID \downarrow Prec. \uparrow 精确。 \uparrow Recall \uparrow 召回率 \uparrow DC-VAE [63] DC-VAE [ 63] 15.8 - - ImageBART [21] 图像ART [ 21] 9.57 - - VQGAN+T. [23] (k=400) VQGAN+T.[ 23] (k=400) 10.2 - - U-Net GAN (+aug) [77] 10.9 (7.6) - - PGGAN [39] PGGAN [ 39] 8.0 - - UDM [43] UDM [ 43] 5.54 - - LSGM [93] LSGM [ 93] 7.22 - - StyleGAN [41] StyleGAN [ 41] 4.16 0.71 0.46 UDM [43] UDM [ 43] 7.16 - - ProjectedGAN[76] 投影GAN[ 76] 3.08 0.65 0.46 LDM-4 (ours, 500-s)
LDM-4 (我们的,500 秒 )
5.11 0.72 0.49 LDM-4 (ours, 200-s) LDM-4 (我们的,200 秒) 4.98 0.73 0.50

LSUN-Churches 256×256256256256\times 256 LSUN-Bedrooms 256×256256256256\times 256 LSUN-卧室 256×256256256256\times 256 Method 方法 FID \downarrow Prec. \uparrow 精确。 \uparrow Recall \uparrow 召回率 \uparrow Method 方法 FID \downarrow Prec. \uparrow 精确。 \uparrow Recall \uparrow 召回率 \uparrow DDPM [30] DDPM [ 30] 7.89 - - ImageBART [21] 图像ART [ 21] 5.51 - - ImageBART[21] ImageBART[ 21] 7.32 - - DDPM [30] DDPM [ 30] 4.9 - - PGGAN [39] PGGAN [ 39] 6.42 - - UDM [43] UDM [ 43] 4.57 - - StyleGAN[41] StyleGAN[ 41] 4.21 - - StyleGAN[41] StyleGAN[ 41] 2.35 0.59 0.48 StyleGAN2[42] StyleGAN2[ 42] 3.86 - - ADM [15] ADM [ 15] 1.90 0.66 0.51 ProjectedGAN[76] 投影GAN[ 76] 1.59 0.61 0.44 ProjectedGAN[76] 投影GAN[ 76] 1.52 0.61 0.34 LDM-8 (ours, 200-s) LDM-8 (我们的, 200-s) 4.02 0.64 0.52 LDM-4 (ours, 200-s) LDM-4 (我们的,200 秒) 2.95 0.66 0.48

Table 1: Evaluation metrics for unconditional image synthesis. CelebA-HQ results reproduced from [63, 100, 43], FFHQ from [42, 43]. : N𝑁N-s refers to N𝑁N sampling steps with the DDIM [84] sampler. : trained in KL-regularized latent space. Additional results can be found in the supplementary.
表 1:无条件图像合成的评估指标。CelebA-HQ 结果转载自 [ 63, 100, 43],FFHQ 结果转载自 [ 42, 43]。 : N𝑁N -s 指使用 DDIM [ 84] 采样器的 N𝑁N 采样步骤。 : 在 KL-regularized latent space 中训练。更多结果见补充资料。

Text-Conditional Image Synthesis
文本条件图像合成
Method 方法 FID \downarrow IS\uparrow Nparamssubscript𝑁paramsN_{\text{params}} CogView [17] 27.10 18.20 4B self-ranking, rejection rate 0.017
自我排名,拒绝率 0.017
LAFITE [109] lafite [ 109] 26.94 26.02 75M GLIDE [59] 滑行 [ 59] 12.24 - 6B 277 DDIM steps, c.f.g. [32] s=3𝑠3s=3
277 DDIM 步骤,c.f.g. [ 32] s=3𝑠3s=3
Make-A-Scene [26] 制作场景 [ 26] 11.84 - 4B c.f.g for AR models [98] s=5𝑠5s=5
AR 模型的 c.f.g [ 98] s=5𝑠5s=5
LDM-KL-8 23.31 20.03±0.33plus-or-minus0.33\pm\text{0.33} 1.45B 250 DDIM steps 250 级 DDIM LDM-KL-8-G 12.63 30.29±0.42plus-or-minus0.42\pm\text{0.42} 1.45B 250 DDIM steps, c.f.g. [32] s=1.5𝑠1.5s=1.5
250 DDIM 步,c.f.g. [ 32] s=1.5𝑠1.5s=1.5

Table 2: Evaluation of text-conditional image synthesis on the 256×256256256256\times 256-sized MS-COCO [51] dataset: with 250 DDIM [84] steps our model is on par with the most recent diffusion [59] and autoregressive [26] methods despite using significantly less parameters. /:Numbers from [109][26]
表 2:在 256×256256256256\times 256 大小的 MS-COCO [ 51] 数据集上对文本条件图像合成的评估:使用 250 DDIM [ 84] 步,我们的模型与最新的扩散 [ 59] 和自回归 [ 26] 方法相当,尽管使用的参数少得多。 / :来自 [ 109]/ [ 26] 的数字

4.2 Image Generation with Latent Diffusion
4.2 利用潜在扩散生成图像

We train unconditional models of 2562superscript2562256^{2} images on CelebA-HQ [39], FFHQ [41], LSUN-Churches and -Bedrooms [102] and evaluate the i) sample quality and ii) their coverage of the data manifold using ii) FID [29] and ii) Precision-and-Recall [50]. Tab. 1 summarizes our results. On CelebA-HQ, we report a new state-of-the-art FID of 5.115.115.11, outperforming previous likelihood-based models as well as GANs. We also outperform LSGM [93] where a latent diffusion model is trained jointly together with the first stage. In contrast, we train diffusion models in a fixed space and avoid the difficulty of weighing reconstruction quality against learning the prior over the latent space, see Fig. 1-2.
我们在 CelebA-HQ [ 39]、FFHQ [ 41]、LSUN-Churches and -Bedrooms [ 102]上训练 2562superscript2562256^{2} 图像的无条件模型,并使用 ii) FID [ 29] 和 ii) Precision-and-Recall [ 50] 评估 i) 样本质量和 ii) 它们对数据流形的覆盖率。表 1 总结了我们的结果。表 1 总结了我们的结果。在 CelebA-HQ 上,我们报告的最新 FID 为 5.115.115.11 ,优于之前的基于似然法的模型和 GAN。我们的表现也优于 LSGM [ 93],在 LSGM 中,潜在扩散模型与第一阶段模型一起训练。相比之下,我们在一个固定的空间中训练扩散模型,避免了权衡重建质量与学习潜空间先验的困难,见图 1-2。

We outperform prior diffusion based approaches on all but the LSUN-Bedrooms dataset, where our score is close to ADM [15], despite utilizing half its parameters and requiring 4-times less train resources (see Appendix E.3.5). Moreover, LDMs consistently improve upon GAN-based methods in Precision and Recall, thus confirming the advantages of their mode-covering likelihood-based training objective over adversarial approaches. In Fig. 4 we also show qualitative results on each dataset.
在 LSUN-Bedrooms 数据集之外的所有数据集上,我们的表现都优于先前的基于扩散的方法,在该数据集上,我们的得分接近 ADM [ 15],尽管我们只使用了其一半的参数,所需的训练资源也少了 4 倍(见附录 E.3.5)。此外,LDM 在精确度和召回率方面始终优于基于 GAN 的方法,从而证实了其基于模式覆盖似然的训练目标相对于对抗方法的优势。图 4 还显示了每个数据集的定性结果。

4.3 Conditional Latent Diffusion
4.3 条件潜在扩散

Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 8: Layout-to-image synthesis with an LDM on COCO [4], see Sec. 4.3.1. Quantitative evaluation in the supplement D.3.
图 8:在 COCO [ 4] 上使用 LDM 进行布局到图像的合成,见第 4.3.1 节。定量评估见补充资料 D.3。

4.3.1 Transformer Encoders for LDMs
4.3.1 用于 LDM 的变压器编码器

By introducing cross-attention based conditioning into LDMs we open them up for various conditioning modalities previously unexplored for diffusion models. For text-to-image image modeling, we train a 1.45B parameter KL-regularized LDM conditioned on language prompts on LAION-400M [78]. We employ the BERT-tokenizer [14] and implement τθsubscript𝜏𝜃\tau_{\theta} as a transformer [97] to infer a latent code which is mapped into the UNet via (multi-head) cross-attention (Sec. 3.3). This combination of domain specific experts for learning a language representation and visual synthesis results in a powerful model, which generalizes well to complex, user-defined text prompts, cf. Fig. 8 and 5. For quantitative analysis, we follow prior work and evaluate text-to-image generation on the MS-COCO [51] validation set, where our model improves upon powerful AR [66, 17] and GAN-based [109] methods, cf. Tab. 2. We note that applying classifier-free diffusion guidance [32] greatly boosts sample quality, such that the guided LDM-KL-8-G is on par with the recent state-of-the-art AR [26] and diffusion models [59] for text-to-image synthesis, while substantially reducing parameter count. To further analyze the flexibility of the cross-attention based conditioning mechanism we also train models to synthesize images based on semantic layouts on OpenImages  [49], and finetune on COCO [4], see Fig. 8. See Sec. D.3 for the quantitative evaluation and implementation details.
通过在 LDM 中引入基于交叉注意力的调理,我们可以将 LDM 应用于以前未曾探索过的扩散模型的各种调理模式。为了进行文本到图像的图像建模,我们在 LAION-400M 上训练了一个以语言提示为条件的 1.45B 参数 KL 规则化 LDM [ 78]。我们使用 BERT-tokenizer [ 14] 并将 τθsubscript𝜏𝜃\tau_{\theta} 作为转换器 [ 97] 来推断潜在代码,该代码通过(多头)交叉关注映射到 UNet 中(第 3.3 节)。这种将特定领域专家学习语言表征和视觉合成相结合的方法产生了一个强大的模型,可以很好地推广到复杂的、用户定义的文本提示中,参见图 8 和图 5。在定量分析方面,我们沿用了之前的工作,在 MS-COCO [ 51] 验证集上评估文本到图像的生成,我们的模型改进了强大的 AR [ 66, 17] 和基于 GAN [ 109] 的方法,参见表 2。2.我们注意到,应用无分类器扩散引导[32]大大提高了样本质量,因此引导的 LDM-KL-8-G 与最近用于文本到图像合成的最先进的 AR 模型[26]和扩散模型[59]不相上下,同时大大减少了参数数量。为了进一步分析基于交叉注意的调节机制的灵活性,我们还在 OpenImages [ 49] 上训练基于语义布局的图像合成模型,并在 COCO [ 4] 上进行微调,见图 8。定量评估和实施细节请参见 D.3 节。

Lastly, following prior work [15, 3, 23, 21], we evaluate our best-performing class-conditional ImageNet models with f{4,8}𝑓48f\in\{4,8\} from Sec. 4.1 in Tab. 3, Fig. 4 and Sec. D.4. Here we outperform the state of the art diffusion model ADM [15] while significantly reducing computational requirements and parameter count, cf. Tab 18.
最后,按照先前的工作[ 15, 3, 23, 21],我们在表 3、图 4 和 D.4 中用第 4.1 节中的 f{4,8}𝑓48f\in\{4,8\} 评估了表现最佳的类条件 ImageNet 模型。在此,我们的表现优于最先进的扩散模型 ADM [ 15] ,同时显著降低了计算要求和参数数量,参见表 18。

Method 方法 FID\downarrow IS\uparrow Precision\uparrow 精确度 \uparrow Recall\uparrow 召回率 \uparrow Nparamssubscript𝑁paramsN_{\text{params}} BigGan-deep [3] BigGan-deep [ 3] 6.95 203.6±2.6plus-or-minus2.6\pm\text{2.6} 0.87 0.28 340M - ADM [15] ADM [ 15] 10.94 100.98 0.69 0.63 554M 250 DDIM steps 250 级 DDIM ADM-G [15] ADM-G [ 15] 4.59 186.7 0.82 0.52 608M 250 DDIM steps 250 级 DDIM LDM-4 (ours) LDM-4 (我们的) 10.56 103.49±1.24plus-or-minus1.24\pm\text{1.24} 0.71 0.62 400M 250 DDIM steps 250 级 DDIM LDM-4-G (ours) LDM-4-G (我们的) 3.60 247.67±5.59plus-or-minus5.59\pm\text{5.59} 0.87 0.48 400M 250 steps, c.f.g [32], s=1.5𝑠1.5s=1.5
250 步,c.f.g [ 32], s=1.5𝑠1.5s=1.5

Table 3: Comparison of a class-conditional ImageNet LDM with recent state-of-the-art methods for class-conditional image generation on ImageNet [12]. A more detailed comparison with additional baselines can be found in D.4, Tab. 10 and F. c.f.g. denotes classifier-free guidance with a scale s𝑠s as proposed in  [32].
表 3:基于类别条件的 ImageNet LDM 与最近在 ImageNet 上基于类别条件生成图像的最先进方法的比较[12]。与其他基线的更详细比较见 D.4、表 10 和 F。c.f.g.表示无分类器引导,比例为 s𝑠s ,如 [ 32] 中提出的那样。

4.3.2 Convolutional Sampling Beyond 2562superscript2562256^{2}
4.3.2 超越 2562superscript2562256^{2} 的卷积采样

By concatenating spatially aligned conditioning information to the input of ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}, LDMs can serve as efficient general-purpose image-to-image translation models. We use this to train models for semantic synthesis, super-resolution (Sec. 4.4) and inpainting (Sec. 4.5). For semantic synthesis, we use images of landscapes paired with semantic maps [61, 23] and concatenate downsampled versions of the semantic maps with the latent image representation of a f=4𝑓4f=4 model (VQ-reg., see Tab. 8). We train on an input resolution of 2562superscript2562256^{2} (crops from 3842superscript3842384^{2}) but find that our model generalizes to larger resolutions and can generate images up to the megapixel regime when evaluated in a convolutional manner (see Fig. 9). We exploit this behavior to also apply the super-resolution models in Sec. 4.4 and the inpainting models in Sec. 4.5 to generate large images between 5122superscript5122512^{2} and 10242superscript102421024^{2}. For this application, the signal-to-noise ratio (induced by the scale of the latent space) significantly affects the results. In Sec. D.1 we illustrate this when learning an LDM on (i) the latent space as provided by a f=4𝑓4f=4 model (KL-reg., see Tab. 8), and (ii) a rescaled version, scaled by the component-wise standard deviation.
通过将空间对齐的调节信息连接到 ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta} 的输入中,LDM 可以作为高效的通用图像到图像转换模型。我们利用它来训练用于语义合成、超分辨率(第 4.4 节)和内绘制(第 4.5 节)的模型。在语义合成方面,我们使用与语义图配对的景观图像[61, 23],并将语义图的低采样版本与 f=4𝑓4f=4 模型(VQ-reg.)我们在 2562superscript2562256^{2} 的输入分辨率(从 3842superscript3842384^{2} 开始裁剪)上进行训练,但发现我们的模型可以泛化到更大的分辨率,并且在以卷积方式进行评估时,可以生成高达百万像素的图像(见图 9)。我们利用这一特性,还应用了第 4.4 节中的超分辨率模型和第 4.5 节中的内绘模型,生成了 5122superscript5122512^{2}10242superscript102421024^{2} 之间的大图像。在这种应用中,信噪比(由潜空间尺度引起)对结果有很大影响。在 D.1 节中,我们将说明在以下两种情况下学习 LDM 的效果:(i) 由 f=4𝑓4f=4 模型提供的潜空间(KL-reg.,见表 8);(ii) 由分量标准偏差缩放的重标版本。

The latter, in combination with classifier-free guidance [32], also enables the direct synthesis of >2562absentsuperscript2562>256^{2} images for the text-conditional LDM-KL-8-G as in Fig. 13.
后者与无分类器引导[32]相结合,还能为文本条件 LDM-KL-8-G 直接合成 >2562absentsuperscript2562>256^{2} 图像,如图 13 所示。

Refer to caption
Figure 9: A LDM trained on 2562superscript2562256^{2} resolution can generalize to larger resolution (here: 512×10245121024512\times 1024) for spatially conditioned tasks such as semantic synthesis of landscape images. See Sec. 4.3.2.
图 9:在 2562superscript2562256^{2} 分辨率下训练的 LDM 可以推广到更高分辨率(此处为 512×10245121024512\times 1024 )的空间条件任务,如景观图像的语义合成。见第 4.3.2 节。

4.4 Super-Resolution with Latent Diffusion
4.4 利用潜像扩散进行超分辨率处理

LDMs can be efficiently trained for super-resolution by diretly conditioning on low-resolution images via concatenation (cf. Sec. 3.3). In a first experiment, we follow SR3 [72] and fix the image degradation to a bicubic interpolation with 4×4\times-downsampling and train on ImageNet following SR3’s data processing pipeline. We use the f=4𝑓4f=4 autoencoding model pretrained on OpenImages (VQ-reg., cf. Tab. 8) and concatenate the low-resolution conditioning y𝑦y and the inputs to the UNet, i.e. τθsubscript𝜏𝜃\tau_{\theta} is the identity. Our qualitative and quantitative results (see Fig. 10 and Tab. 5) show competitive performance and LDM-SR outperforms SR3 in FID while SR3 has a better IS. A simple image regression model achieves the highest PSNR and SSIM scores; however these metrics do not align well with human perception [106] and favor blurriness over imperfectly aligned high frequency details [72]. Further, we conduct a user study comparing the pixel-baseline with LDM-SR. We follow SR3 [72] where human subjects were shown a low-res image in between two high-res images and asked for preference. The results in Tab. 4 affirm the good performance of LDM-SR. PSNR and SSIM can be pushed by using a post-hoc guiding mechanism [15] and we implement this image-based guider via a perceptual loss, see Sec. D.6.
通过串联对低分辨率图像进行直接条件化(参见第 3.3 节),可以有效地训练 LDM 以实现超分辨率。在第一个实验中,我们按照 SR3 [ 72] 的方法,将图像降解固定为 4×4\times 降采样的双三次插值,并按照 SR3 的数据处理管道在 ImageNet 上进行训练。我们使用在 OpenImages(VQ-reg.,参见表 8)上预训练的 f=4𝑓4f=4 自动编码模型,并将低分辨率调节 y𝑦y 和输入连接到 UNet,即 τθsubscript𝜏𝜃\tau_{\theta} 是标识。我们的定性和定量结果(见图 10 和表 5)显示了具有竞争力的性能,LDM-SR 在 FID 方面优于 SR3,而 SR3 具有更好的 IS。简单的图像回归模型可以获得最高的 PSNR 和 SSIM 分数;然而,这些指标与人类的感知并不一致[106],它们更倾向于模糊而不是不完全对齐的高频细节[72]。此外,我们还进行了一项用户研究,将像素基准与 LDM-SR 进行了比较。我们按照 SR3 [ 72] 的方法,在两幅高分辨率图像之间显示一幅低分辨率图像,并询问受试者的偏好。表 4 中的结果证实了 LDM-SR 的良好性能。表 4 中的结果证实了 LDM-SR 的良好性能。PSNR 和 SSIM 可以通过使用事后引导机制来提高[ 15],我们通过感知损失来实现这种基于图像的引导机制,请参见 D.6 节。

bicubic 双三次方 LDM-SR SR3
Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption
Figure 10: ImageNet 64\rightarrow256 super-resolution on ImageNet-Val. LDM-SR has advantages at rendering realistic textures but SR3 can synthesize more coherent fine structures. See appendix for additional samples and cropouts. SR3 results from [72].
图 10:ImageNet 64 \rightarrow 256 在 ImageNet-Val 上的超分辨率。LDM-SR 在渲染逼真纹理方面具有优势,但 SR3 可以合成更加连贯的精细结构。其他样本和裁剪见附录。SR3 结果来自 [ 72]。

SR on ImageNet SR 在 ImageNet 上的应用 Inpainting on Places 地点涂色 User Study 用户研究 Pixel-DM (f1𝑓1f1) LDM-4 LAMA [88] LAMA [ 88] LDM-4 Task 1: Preference vs GT \uparrow
任务 1:偏好与 GT \uparrow
16.0% 30.4% 13.6% 21.0%
Task 2: Preference Score \uparrow
任务 2:偏好得分 \uparrow
29.4% 70.6% 31.9% 68.1%

Table 4: Task 1: Subjects were shown ground truth and generated image and asked for preference. Task 2: Subjects had to decide between two generated images. More details in E.3.6
表 4:任务 1:向受试者展示地面实况和生成的图像,并询问其偏好。任务 2:受试者必须在两幅生成的图像中做出选择。更多详情见 E.3.6

Since the bicubic degradation process does not generalize well to images which do not follow this pre-processing, we also train a generic model, LDM-BSR, by using more diverse degradation. The results are shown in Sec. D.6.1.
由于双三次降解过程不能很好地适用于没有经过这种预处理的图像,因此我们还使用了更多样化的降解方法来训练一个通用模型 LDM-BSR。结果见第 D.6.1 节。

Method 方法 FID \downarrow IS \uparrow PSNR \uparrow SSIM \uparrow Nparamssubscript𝑁paramsN_{\text{params}} [sampless]()[\frac{\text{samples}}{s}](^{*}) Image Regression [72] 图像回归 [ 72] 15.2 121.1 27.9 0.801 625M N/A 不适用 SR3 [72] SR3 [ 72] 5.2 180.1 26.4 0.762 625M N/A 不适用 LDM-4 (ours, 100 steps) LDM-4(我们的,100 级) 2.8/4.8 166.3 24.4±plus-or-minus\pm3.8 0.69±plus-or-minus\pm0.14 169M 4.62 emphLDM-4 (ours, big, 100 steps)
emphLDM-4(我们的,大,100 级)
2.4/4.3 174.9 24.7±plus-or-minus\pm4.1 0.71±plus-or-minus\pm0.15 552M 4.5
LDM-4 (ours, 50 steps, guiding)
LDM-4(我们的,50 级,引导式)
4.4/6.4 153.7 25.8±plus-or-minus\pm3.7 0.74±plus-or-minus\pm0.12 184M 0.38

Table 5: ×4absent4\times 4 upscaling results on ImageNet-Val. (2562superscript2562256^{2}); : FID features computed on validation split, : FID features computed on train split; : Assessed on a NVIDIA A100
表 5:( 2562superscript2562256^{2} ); : 在验证分割上计算的 FID 特征, : 在训练分割上计算的 FID 特征; : 在 NVIDIA A100 上评估。

4.5 Inpainting with Latent Diffusion
4.5 利用潜在扩散进行绘制

Inpainting is the task of filling masked regions of an image with new content either because parts of the image are are corrupted or to replace existing but undesired content within the image. We evaluate how our general approach for conditional image generation compares to more specialized, state-of-the-art approaches for this task. Our evaluation follows the protocol of LaMa[88], a recent inpainting model that introduces a specialized architecture relying on Fast Fourier Convolutions[8]. The exact training & evaluation protocol on Places[108] is described in Sec. E.2.2.
涂抹是指在图像的遮蔽区域填充新的内容,这可能是因为图像的某些部分已损坏,也可能是为了替换图像中已有但不想要的内容。我们将评估我们的条件图像生成通用方法与更专业、更先进的方法相比,在这项任务中的效果如何。我们的评估遵循 LaMa[ 88] 的协议,LaMa 是一种最新的内绘模型,它引入了一种依赖于快速傅立叶卷积[ 8] 的专门架构。关于 Places[ 108] 的确切训练和评估协议将在 E.2.2 节中介绍。

We first analyze the effect of different design choices for the first stage.
我们首先分析第一阶段不同设计方案的效果。

train throughput 训练吞吐量 sampling throughput 采样吞吐量 train+val FID@2k Model (reg.-type) 模型(注册类型) samples/sec. 采样/秒 @256 @512 hours/epoch 小时/纪元 epoch 6 第 6 个纪元 LDM-1 (no first stage) LDM-1(无第一级) 0.11 0.26 0.07 20.66 24.74 LDM-4 (KL, w/ attn) LDM-4(KL,带附件) 0.32 0.97 0.34 7.66 15.21 LDM-4 (VQ, w/ attn) LDM-4(VQ,带附加装置) 0.33 0.97 0.34 7.04 14.99 LDM-4 (VQ, w/o attn) LDM-4(VQ,无注释) 0.35 0.99 0.36 6.66 15.95

Table 6: Assessing inpainting efficiency. : Deviations from Fig. 7 due to varying GPU settings/batch sizes cf. the supplement.
表 6:内绘效率评估。0#:由于 GPU 设置/批量大小的不同,与图 7 存在偏差,参见附录。
input 输入 result 结果
Refer to caption Refer to caption
Refer to caption Refer to caption
Refer to caption Refer to caption
Figure 11: Qualitative results on object removal with our big, w/ ft inpainting model. For more results, see Fig. 22.
图 11:使用我们的大尺度内绘模型去除物体的定性结果。更多结果请参见图 22。

In particular, we compare the inpainting efficiency of LDM-1 (i.e. a pixel-based conditional DM) with LDM-4, for both KL and VQ regularizations, as well as VQ-LDM-4 without any attention in the first stage (see Tab. 8), where the latter reduces GPU memory for decoding at high resolutions. For comparability, we fix the number of parameters for all models. Tab. 6 reports the training and sampling throughput at resolution 2562superscript2562256^{2} and 5122superscript5122512^{2}, the total training time in hours per epoch and the FID score on the validation split after six epochs. Overall, we observe a speed-up of at least 2.7×2.7\times between pixel- and latent-based diffusion models while improving FID scores by a factor of at least 1.6×1.6\times.
特别是,我们比较了 LDM-1(即基于像素的条件 DM)和 LDM-4(KL 和 VQ 正则化)的内绘效率,以及在第一阶段没有任何关注的 VQ-LDM-4(见表 8),后者在高分辨率解码时减少了 GPU 内存。为了便于比较,我们固定了所有模型的参数数。表 66 报告了在分辨率为 2562superscript2562256^{2}5122superscript5122512^{2} 时的训练和采样吞吐量、以小时为单位的总训练时间(每个历时)以及六个历时后验证分割的 FID 分数。总的来说,我们发现基于像素的扩散模型和基于潜像的扩散模型的速度至少提高了 2.7×2.7\times ,而 FID 分数至少提高了 1.6×1.6\times

The comparison with other inpainting approaches in Tab. 7 shows that our model with attention improves the overall image quality as measured by FID over that of [88]. LPIPS between the unmasked images and our samples is slightly higher than that of [88]. We attribute this to [88] only producing a single result which tends to recover more of an average image compared to the diverse results produced by our LDM cf. Fig. 21. Additionally in a user study (Tab. 4) human subjects favor our results over those of [88].
表 7 中与其他内绘方法的比较显示,我们的注意力模型提高了 FID 分数的 @3#。表 7 显示,与 [ 88] 相比,我们的注意力模型提高了以 FID 衡量的整体图像质量。未屏蔽图像与我们的样本之间的 LPIPS 略高于 [ 88]。我们认为这是由于[88] 只产生了单一结果,与我们的 LDM 产生的多样化结果相比,它更倾向于恢复平均图像,参见图 21。此外,在一项用户研究中(表 4),与 [ 88] 的结果相比,人类受试者更喜欢我们的结果。

Based on these initial results, we also trained a larger diffusion model (big in Tab. 7) in the latent space of the VQ-regularized first stage without attention. Following [15], the UNet of this diffusion model uses attention layers on three levels of its feature hierarchy, the BigGAN [3] residual block for up- and downsampling and has 387M parameters instead of 215M. After training, we noticed a discrepancy in the quality of samples produced at resolutions 2562superscript2562256^{2} and 5122superscript5122512^{2}, which we hypothesize to be caused by the additional attention modules. However, fine-tuning the model for half an epoch at resolution 5122superscript5122512^{2} allows the model to adjust to the new feature statistics and sets a new state of the art FID on image inpainting (big, w/o attn, w/ ft in Tab. 7, Fig. 11.).
基于这些初步结果,我们还在无关注 VQ 规则化第一阶段的潜空间中训练了一个更大的扩散模型(表 7 中的大图)。按照[15]的方法,该扩散模型的 UNet 在其特征层次结构的三个层级上使用了注意力层,使用 BigGAN[ 3] 剩余块进行上采样和下采样,并拥有 387M 而不是 215M 的参数。训练结束后,我们注意到在分辨率为 2562superscript2562256^{2}5122superscript5122512^{2} 时产生的样本质量存在差异,我们推测这是额外的注意力模块造成的。然而,在分辨率为 5122superscript5122512^{2} 时对模型进行了半个历时的微调,使模型能够适应新的特征统计数据,并在图像绘制方面建立了新的 FID 技术水平(表 7、图 11 中的 big、w/o attn、w/ft)。

40-50% masked 40-50% 遮挡 All samples 所有样本 Method 方法 FID \downarrow LPIPS \downarrow FID \downarrow LPIPS \downarrow LDM-4 (ours, big, w/ ft)
LDM-4(我们的,大的,带英尺)
9.39 0.246±plus-or-minus\pm 0.042 1.50 0.137±plus-or-minus\pm 0.080
LDM-4 (ours, big, w/o ft)
LDM-4(我们的,大号,不带英尺)
12.89 0.257±plus-or-minus\pm 0.047 2.40 0.142±plus-or-minus\pm 0.085
LDM-4 (ours, w/ attn) LDM-4(我们的,大号,无脚踏板) 11.87 0.257±plus-or-minus\pm 0.042 2.15 0.144±plus-or-minus\pm 0.084 LDM-4 (ours, w/o attn) LDM-4(我们的,不带耳机) 12.60 0.259±plus-or-minus\pm 0.041 2.37 0.145±plus-or-minus\pm 0.084 LaMa[88] 12.31 0.243±plus-or-minus\pm 0.038 2.23 0.134±plus-or-minus\pm 0.080 LaMa[88] LaMa[ 88] 12.0 0.24±plus-or-minus\pm 0.000 2.21 0.14±plus-or-minus\pm 0.000 CoModGAN[107] 10.4 0.26±plus-or-minus\pm 0.000 1.82 0.15±plus-or-minus\pm 0.000 RegionWise[52] RegionWise[ 52] 21.3 0.27±plus-or-minus\pm 0.000 4.75 0.15±plus-or-minus\pm 0.000 DeepFill v2[104] DeepFill v2[ 104] 22.1 0.28±plus-or-minus\pm 0.000 5.20 0.16±plus-or-minus\pm 0.000 EdgeConnect[58] 边缘连接[ 58] 30.5 0.28±plus-or-minus\pm 0.000 8.37 0.16±plus-or-minus\pm 0.000

Table 7: Comparison of inpainting performance on 30k crops of size 512×512512512512\times 512 from test images of Places[108]. The column 40-50% reports metrics computed over hard examples where 40-50% of the image region have to be inpainted. recomputed on our test set, since the original test set used in [88] was not available.
表 7:Places[ 108] 测试图像中 30k 个大小为 512×512512512512\times 512 的裁剪的内绘性能比较。40-50% 一栏报告的是在需要对 40-50% 的图像区域进行内绘的困难实例中计算得出的指标。由于无法获得[ 88] 中使用的原始测试集,因此在我们的测试集上重新计算了

5 Limitations & Societal Impact
5 局限性和社会影响

Limitations 局限性

While LDMs significantly reduce computational requirements compared to pixel-based approaches, their sequential sampling process is still slower than that of GANs. Moreover, the use of LDMs can be questionable when high precision is required: although the loss of image quality is very small in our f=4𝑓4f=4 autoencoding models (see Fig. 1), their reconstruction capability can become a bottleneck for tasks that require fine-grained accuracy in pixel space. We assume that our superresolution models (Sec. 4.4) are already somewhat limited in this respect.
与基于像素的方法相比,LDM 虽然大大降低了计算要求,但其顺序采样过程仍比 GAN 慢。此外,当需要高精度时,使用 LDM 可能会有问题:虽然在我们的 f=4𝑓4f=4 自动编码模型中,图像质量的损失非常小(见图 1),但对于需要像素空间细粒度精度的任务来说,它们的重建能力可能会成为瓶颈。我们假设我们的超分辨率模型(第 4.4 节)在这方面已经受到了一定的限制。

Societal Impact 社会影响

Generative models for media like imagery are a double-edged sword: On the one hand, they enable various creative applications, and in particular approaches like ours that reduce the cost of training and inference have the potential to facilitate access to this technology and democratize its exploration. On the other hand, it also means that it becomes easier to create and disseminate manipulated data or spread misinformation and spam. In particular, the deliberate manipulation of images (“deep fakes”) is a common problem in this context, and women in particular are disproportionately affected by it [13, 24].
图像等媒体的生成模型是一把双刃剑:一方面,它可以促进各种创造性应用,尤其是像我们这样降低训练和推理成本的方法,有可能促进对这一技术的利用,并使其探索民主化。另一方面,这也意味着创建和传播篡改数据或传播错误信息和垃圾邮件变得更加容易。特别是,故意篡改图像("深度伪造")是这方面的一个常见问题,尤其是妇女受到的影响尤为严重[13, 24]。

Generative models can also reveal their training data [5, 90], which is of great concern when the data contain sensitive or personal information and were collected without explicit consent. However, the extent to which this also applies to DMs of images is not yet fully understood.
生成模型也会泄露其训练数据[5, 90],当数据包含敏感或个人信息且未经明确同意而收集时,这一点就非常令人担忧。然而,这种情况在多大程度上也适用于图像的 DMs 还不完全清楚。

Finally, deep learning modules tend to reproduce or exacerbate biases that are already present in the data [91, 38, 22]. While diffusion models achieve better coverage of the data distribution than e.g. GAN-based approaches, the extent to which our two-stage approach that combines adversarial training and a likelihood-based objective misrepresents the data remains an important research question.
最后,深度学习模块往往会重现或加剧数据中已经存在的偏差[91, 38, 22]。虽然扩散模型比基于 GAN 的方法能更好地覆盖数据分布,但我们结合对抗训练和基于似然目标的两阶段方法在多大程度上误导了数据,仍然是一个重要的研究问题。

For a more general, detailed discussion of the ethical considerations of deep generative models, see e.g. [13].
关于深度生成模型的伦理考虑的更广泛、更详细的讨论,请参见[ 13] 等。

6 Conclusion 6 结束语

We have presented latent diffusion models, a simple and efficient way to significantly improve both the training and sampling efficiency of denoising diffusion models without degrading their quality. Based on this and our cross-attention conditioning mechanism, our experiments could demonstrate favorable results compared to state-of-the-art methods across a wide range of conditional image synthesis tasks without task-specific architectures. This work has been supported by the German Federal Ministry for Economic Affairs and Energy within the project ’KI-Absicherung - Safe AI for automated driving’ and by the German Research Foundation (DFG) project 421703927.
我们提出了潜在扩散模型,这是一种简单有效的方法,可以显著提高去噪扩散模型的训练和采样效率,而不会降低其质量。在此基础上,再加上我们的交叉注意力调节机制,我们的实验可以在没有特定任务架构的情况下,在广泛的条件图像合成任务中与最先进的方法相比取得良好的效果。

References 参考资料

  • [1] Eirikur Agustsson and Radu Timofte. NTIRE 2017 challenge on single image super-resolution: Dataset and study. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2017, Honolulu, HI, USA, July 21-26, 2017, pages 1122–1131. IEEE Computer Society, 2017.
    Eirikur Agustsson 和 Radu Timofte.NTIRE 2017 单图像超分辨率挑战赛:数据集与研究。In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2017, Honolulu, HI, USA, July 21-26, 2017, pages 1122-1131.IEEE 计算机协会,2017 年。
  • [2] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan, 2017.
    Martin Arjovsky、Soumith Chintala 和 Léon Bottou.Wasserstein gan, 2017.
  • [3] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis. In Int. Conf. Learn. Represent., 2019.
    Andrew Brock、Jeff Donahue 和 Karen Simonyan。用于高保真自然图像合成的大规模 GAN 训练。In Int.Conf.Learn.Represent.
  • [4] Holger Caesar, Jasper R. R. Uijlings, and Vittorio Ferrari. Coco-stuff: Thing and stuff classes in context. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 1209–1218. Computer Vision Foundation / IEEE Computer Society, 2018.
    Holger Caesar、Jasper R. R. Uijlings 和 Vittorio Ferrari。Coco-stuff:上下文中的事物和物品类。In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 1209-1218.计算机视觉基金会/IEEE计算机学会,2018。
  • [5] Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650, 2021.
    Nicholas Carlini、Florian Tramer、Eric Wallace、Matthew Jagielski、Ariel Herbert-Voss、Katherine Lee、Adam Roberts、Tom Brown、Dawn Song、Ulfar Erlingsson 等:从大型语言模型中提取训练数据。第 30 届 USENIX 安全研讨会(USENIX Security 21),第 2633-2650 页,2021 年。
  • [6] Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In ICML, volume 119 of Proceedings of Machine Learning Research, pages 1691–1703. PMLR, 2020.
    Mark Chen、Alec Radford、Rewon Child、Jeffrey Wu、Heewoo Jun、David Luan 和 Ilya Sutskever。从像素生成预训练。ICML,《机器学习研究论文集》第 119 卷,第 1691-1703 页。PMLR, 2020.
  • [7] Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, and William Chan. Wavegrad: Estimating gradients for waveform generation. In ICLR. OpenReview.net, 2021.
    Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, and William Chan.波形梯度:波形生成的梯度估计。In ICLR.OpenReview.net, 2021.
  • [8] Lu Chi, Borui Jiang, and Yadong Mu. Fast fourier convolution. In NeurIPS, 2020.
    Lu Chi、Borui Jiang 和 Yadong Mu。快速傅立叶卷积。在 NeurIPS,2020 年。
  • [9] Rewon Child. Very deep vaes generalize autoregressive models and can outperform them on images. CoRR, abs/2011.10650, 2020.
    Rewon Child.非常深度的Vaes泛化自回归模型,并能在图像上超越它们。CoRR,abs/2011.10650,2020。
  • [10] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. CoRR, abs/1904.10509, 2019.
    Rewon Child、Scott Gray、Alec Radford 和 Ilya Sutskever。用稀疏变换器生成长序列。CoRR,abs/1904.10509,2019。
  • [11] Bin Dai and David P. Wipf. Diagnosing and enhancing VAE models. In ICLR (Poster). OpenReview.net, 2019.
    Bin Dai 和 David P. Wipf.诊断和增强 VAE 模型。In ICLR (Poster).OpenReview.net, 2019.
  • [12] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255. IEEE Computer Society, 2009.
    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li.Imagenet:大规模分层图像数据库。In CVPR, pages 248-255.IEEE 计算机协会,2009 年。
  • [13] Emily Denton. Ethical considerations of generative ai. AI for Content Creation Workshop, CVPR, 2021.
    Emily Denton.生成式人工智能的伦理考量。AI for Content Creation Workshop, CVPR, 2021.
  • [14] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018.
    Jacob Devlin、Ming-Wei Chang、Kenton Lee 和 Kristina Toutanova。BERT:用于语言理解的深度双向变换器预训练。CoRR,ABS/1810.04805,2018。
  • [15] Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis. CoRR, abs/2105.05233, 2021.
    Prafulla Dhariwal 和 Alex Nichol.扩散模型在图像合成上击败甘斯。CoRR,abs/2105.05233,2021。
  • [16] Sander Dieleman. Musings on typicality, 2020.
    桑德-迪埃勒曼关于典型性的思考,2020 年。
  • [17] Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, and Jie Tang. Cogview: Mastering text-to-image generation via transformers. CoRR, abs/2105.13290, 2021.
    丁明、杨卓一、洪文义、郑文迪、周畅、尹达、林俊扬、邹旭、邵周、杨红霞和唐杰。Cogview:通过转换器掌握文本到图像的生成。CoRR,abs/2105.13290,2021。
  • [18] Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation, 2015.
    Laurent Dinh、David Krueger 和 Yoshua Bengio。尼斯:非线性独立分量估计》,2015 年。
  • [19] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real NVP. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017.
    Laurent Dinh、Jascha Sohl-Dickstein 和 Samy Bengio。使用真实 NVP 的密度估计。In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings.OpenReview.net, 2017.
  • [20] Alexey Dosovitskiy and Thomas Brox. Generating images with perceptual similarity metrics based on deep networks. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett, editors, Adv. Neural Inform. Process. Syst., pages 658–666, 2016.
    Alexey Dosovitskiy and Thomas Brox. 基于深度网络的感知相似度指标生成图像。In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett, editors, Adv. Neural Inform.Process.Syst.》,第 658-666 页,2016 年。
  • [21] Patrick Esser, Robin Rombach, Andreas Blattmann, and Björn Ommer. Imagebart: Bidirectional context with multinomial diffusion for autoregressive image synthesis. CoRR, abs/2108.08827, 2021.
    Patrick Esser、Robin Rombach、Andreas Blattmann 和 Björn Ommer。Imagebart:用于自回归图像合成的多项式扩散双向上下文。CoRR,abs/2108.08827,2021。
  • [22] Patrick Esser, Robin Rombach, and Björn Ommer. A note on data biases in generative models. arXiv preprint arXiv:2012.02516, 2020.
    Patrick Esser, Robin Rombach, and Björn Ommer.生成模型中的数据偏差说明。arXiv 预印本 arXiv:2012.02516, 2020.
  • [23] Patrick Esser, Robin Rombach, and Björn Ommer. Taming transformers for high-resolution image synthesis. CoRR, abs/2012.09841, 2020.
    Patrick Esser, Robin Rombach, and Björn Ommer.用于高分辨率图像合成的驯服变换器。CoRR,abs/2012.09841,2020。
  • [24] Mary Anne Franks and Ari Ezra Waldman. Sex, lies, and videotape: Deep fakes and free speech delusions. Md. L. Rev., 78:892, 2018.
    玛丽-安妮-弗兰克斯和阿里-埃兹拉-瓦尔德曼。性、谎言和录像带:深度伪造与自由言论妄想。Md.L. Rev., 78:892, 2018.
  • [25] Kevin Frans, Lisa B. Soros, and Olaf Witkowski. Clipdraw: Exploring text-to-drawing synthesis through language-image encoders. ArXiv, abs/2106.14843, 2021.
    Kevin Frans, Lisa B. Soros, and Olaf Witkowski.Clipdraw:通过语言图像编码器探索文本到绘图的合成。ArXiv, abs/2106.14843, 2021.
  • [26] Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene-based text-to-image generation with human priors. CoRR, abs/2203.13131, 2022.
    Oran Gafni、Adam Polyak、Oron Ashual、Shelly Sheynin、Devi Parikh 和 Yaniv Taigman。制作场景:基于场景和人类先验的文本到图像生成。CoRR,abs/2203.13131,2022。
  • [27] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial networks. CoRR, 2014.
    Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio.生成对抗网络。CoRR,2014。
  • [28] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Improved training of wasserstein gans, 2017.
    Ishaan Gulrajani、Faruk Ahmed、Martin Arjovsky、Vincent Dumoulin 和 Aaron Courville。改进的瓦瑟斯坦甘斯训练》,2017 年。
  • [29] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Adv. Neural Inform. Process. Syst., pages 6626–6637, 2017.
    马丁-豪塞尔、休伯特-拉姆绍尔、托马斯-昂特希纳、伯恩哈德-奈斯勒和塞普-霍赫赖特。通过双时间尺度更新规则训练的甘斯收敛到局部纳什均衡。In Adv.Process.Syst., pages 6626-6637, 2017.
  • [30] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, 2020.
    Jonathan Ho, Ajay Jain, and Pieter Abbeel.去噪扩散概率模型。In NeurIPS, 2020.
  • [31] Jonathan Ho, Chitwan Saharia, William Chan, David J. Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation. CoRR, abs/2106.15282, 2021.
    Jonathan Ho, Chitwan Saharia, William Chan, David J. Fleet, Mohammad Norouzi, and Tim Salimans.高保真图像生成的级联扩散模型。CoRR,abs/2106.15282,2021。
  • [32] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
    乔纳森-何(Jonathan Ho)和蒂姆-萨利曼斯(Tim Salimans)。无分类器扩散引导。在 NeurIPS 2021 深度生成模型和下游应用研讨会上,2021 年。
  • [33] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adversarial networks. In CVPR, pages 5967–5976. IEEE Computer Society, 2017.
    菲利普-伊索拉、朱俊彦、周廷辉和阿列克谢-A-埃弗罗斯。利用条件对抗网络实现图像到图像的翻译。In CVPR, pages 5967-5976.IEEE 计算机协会,2017 年。
  • [34] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adversarial networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5967–5976, 2017.
    Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros.利用条件对抗网络实现图像到图像的翻译。2017 IEEE 计算机视觉与模式识别大会(CVPR),第 5967-5976 页,2017 年。
  • [35] Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier J. Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, and João Carreira. Perceiver IO: A general architecture for structured inputs &outputs. CoRR, abs/2107.14795, 2021.
    Andrew Jaegle、Sebastian Borgeaud、Jean-Baptiste Alayrac、Carl Doersch、Catalin Ionescu、David Ding、Skanda Koppula、Daniel Zoran、Andrew Brock、Evan Shelhamer、Olivier J. Hénaff、Matthew M. Botvinick、Andrew Zisserman、Oriol Vinyals 和 João Carreira。感知器 IO:结构化输入输出的通用架构。CoRR,abs/2107.14795,2021。
  • [36] Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and João Carreira. Perceiver: General perception with iterative attention. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 4651–4664. PMLR, 2021.
    安德鲁-耶格尔、费利克斯-吉梅诺、安迪-布洛克、奥里奥尔-维尼亚尔斯、安德鲁-齐瑟曼和若昂-卡雷拉。感知器:具有迭代注意力的一般感知。见 Marina Meila 和 Tong Zhang 编辑的《第 38 届国际机器学习大会论文集》(ICML 2021,2021 年 7 月 18-24 日,虚拟活动),《机器学习研究论文集》第 139 卷,第 4651-4664 页。PMLR, 2021.
  • [37] Manuel Jahn, Robin Rombach, and Björn Ommer. High-resolution complex scene synthesis with transformers. CoRR, abs/2105.06458, 2021.
    Manuel Jahn, Robin Rombach, and Björn Ommer.使用变换器的高分辨率复杂场景合成。CoRR,abs/2105.06458,2021。
  • [38] Niharika Jain, Alberto Olmo, Sailik Sengupta, Lydia Manikonda, and Subbarao Kambhampati. Imperfect imaganation: Implications of gans exacerbating biases on facial data augmentation and snapchat selfie lenses. arXiv preprint arXiv:2001.09528, 2020.
    Niharika Jain、Alberto Olmo、Sailik Sengupta、Lydia Manikonda 和 Subbarao Kambhampati。不完美的想象:面部数据增强和 snapchat 自拍镜头的 gans 加剧偏差的影响》。arXiv 预印本 arXiv:2001.09528, 2020 年。
  • [39] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. CoRR, abs/1710.10196, 2017.
    Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen.为提高质量、稳定性和变异而进行的甘斯渐进生长。CoRR,abs/1710.10196,2017。
  • [40] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In IEEE Conf. Comput. Vis. Pattern Recog., pages 4401–4410, 2019.
    Tero Karras、Samuli Laine 和 Timo Aila。基于风格的生成式对抗网络生成器架构。In IEEE Conf.Comput.Pattern Recog.Pattern Recog., pages 4401-4410, 2019.
  • [41] T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial networks. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
    T.Karras, S. Laine, and T. Aila.基于风格的生成式对抗网络生成器架构。In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • [42] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. CoRR, abs/1912.04958, 2019.
    Tero Karras、Samuli Laine、Miika Aittala、Janne Hellsten、Jaakko Lehtinen 和 Timo Aila。分析和改进样式表的图像质量。CoRR,abs/1912.04958,2019。
  • [43] Dongjun Kim, Seungjae Shin, Kyungwoo Song, Wanmo Kang, and Il-Chul Moon. Score matching model for unbounded data score. CoRR, abs/2106.05527, 2021.
    Dongjun Kim、Seungjae Shin、Kyungwoo Song、Wanmo Kang 和 Il-Chul Moon。无约束数据分数的分数匹配模型。CoRR,abs/2106.05527,2021。
  • [44] Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, 2018.
    Durk P Kingma 和 Prafulla Dhariwal.Glow:具有可逆 1x1 卷积的生成流。见 S. Bengio、H. Wallach、H. Larochelle、K. Grauman、N. Cesa-Bianchi 和 R. Garnett 编辑,《神经信息处理系统进展》,2018 年。
  • [45] Diederik P. Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. CoRR, abs/2107.00630, 2021.
    Diederik P. Kingma、Tim Salimans、Ben Poole 和 Jonathan Ho。变异扩散模型。CoRR,abs/2107.00630,2021。
  • [46] Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR, 2014.
    Diederik P. Kingma 和 Max Welling.自动编码变异贝叶斯。第二届学习表征国际会议,ICLR,2014。
  • [47] Zhifeng Kong and Wei Ping. On fast sampling of diffusion probabilistic models. CoRR, abs/2106.00132, 2021.
    孔志峰、魏平.论扩散概率模型的快速采样.CoRR,abs/2106.00132,2021.
  • [48] Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile diffusion model for audio synthesis. In ICLR. OpenReview.net, 2021.
    孔志峰、平伟、黄佳吉、赵可欣和布莱恩-卡坦扎罗。Diffwave:用于音频合成的多功能扩散模型。In ICLR.OpenReview.net, 2021.
  • [49] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper R. R. Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Tom Duerig, and Vittorio Ferrari. The open images dataset V4: unified image classification, object detection, and visual relationship detection at scale. CoRR, abs/1811.00982, 2018.
    Alina Kuznetsova、Hassan Rom、Neil Alldrin、Jasper R. R. Uijlings、Ivan Krasin、Jordi Pont-Tuset、Shahab Kamali、Stefan Popov、Matteo Malloci、Tom Duerig 和 Vittorio Ferrari。开放图像数据集 V4:大规模统一图像分类、对象检测和视觉关系检测。CoRR,abs/1811.00982,2018。
  • [50] Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models. CoRR, abs/1904.06991, 2019.
    Tuomas Kynkäänniemi、Tero Karras、Samuli Laine、Jaakko Lehtinen 和 Timo Aila。用于评估生成模型的改进精度和召回率度量。CoRR,abs/1904.06991,2019。
  • [51] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: common objects in context. CoRR, abs/1405.0312, 2014.
    Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick.微软 COCO:上下文中的通用对象。CoRR,abs/1405.0312,2014。
  • [52] Yuqing Ma, Xianglong Liu, Shihao Bai, Le-Yi Wang, Aishan Liu, Dacheng Tao, and Edwin Hancock. Region-wise generative adversarial imageinpainting for large missing areas. ArXiv, abs/1909.12507, 2019.
    马玉清、刘祥龙、白世豪、王乐毅、刘爱山、陶大成和埃德温-汉考克。针对大面积缺失的区域生成式对抗图像绘制ArXiv, abs/1909.12507, 2019.
  • [53] Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Image synthesis and editing with stochastic differential equations. CoRR, abs/2108.01073, 2021.
    孟晨林、宋洋、宋佳明、吴佳俊、朱俊彦和斯特凡诺-埃尔蒙。Sdedit:用随机微分方程合成和编辑图像。CoRR,abs/2108.01073,2021。
  • [54] Lars M. Mescheder. On the convergence properties of GAN training. CoRR, abs/1801.04406, 2018.
    Lars M. Mescheder.论 GAN 训练的收敛特性。CoRR,abs/1801.04406,2018.
  • [55] Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein. Unrolled generative adversarial networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017.
    Luke Metz、Ben Poole、David Pfau 和 Jascha Sohl-Dickstein。未卷积生成对抗网络。In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings.OpenReview.net, 2017.
  • [56] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. CoRR, abs/1411.1784, 2014.
    Mehdi Mirza 和 Simon Osindero.条件生成对抗网。CoRR,abs/1411.1784,2014。
  • [57] Gautam Mittal, Jesse H. Engel, Curtis Hawthorne, and Ian Simon. Symbolic music generation with diffusion models. CoRR, abs/2103.16091, 2021.
    高塔姆-米塔尔、杰西-H-恩格尔、柯蒂斯-霍桑和伊恩-西蒙。用扩散模型生成符号音乐。CoRR,abs/2103.16091,2021。
  • [58] Kamyar Nazeri, Eric Ng, Tony Joseph, Faisal Z. Qureshi, and Mehran Ebrahimi. Edgeconnect: Generative image inpainting with adversarial edge learning. ArXiv, abs/1901.00212, 2019.
    Kamyar Nazeri、Eric Ng、Tony Joseph、Faisal Z. Qureshi 和 Mehran Ebrahimi。Qureshi 和 Mehran Ebrahimi。边缘连接:使用对抗边缘学习的生成式图像内绘。ArXiv,abs/1901.00212,2019。
  • [59] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. CoRR, abs/2112.10741, 2021.
    Alex Nichol、Prafulla Dhariwal、Aditya Ramesh、Pranav Shyam、Pamela Mishkin、Bob McGrew、Ilya Sutskever 和 Mark Chen。GLIDE:利用文本引导的扩散模型实现逼真图像的生成和编辑。CoRR,abs/2112.10741,2021。
  • [60] Anton Obukhov, Maximilian Seitzer, Po-Wei Wu, Semen Zhydenko, Jonathan Kyl, and Elvis Yu-Jing Lin. High-fidelity performance metrics for generative models in pytorch, 2020. Version: 0.3.0, DOI: 10.5281/zenodo.4957738.
    Anton Obukhov、Maximilian Seitzer、Po-Wei Wu、Semen Zhydenko、Jonathan Kyl 和 Elvis Yu-Jing Lin。pytorch 中生成模型的高保真性能指标,2020.版本:0.3.0,DOI:10.5281/zenodo.4957738。
  • [61] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
    Taesung Park、Ming-Yu Liu、Ting-Chun Wang 和 Jun-Yan Zhu.空间自适应归一化的语义图像合成。IEEE 计算机视觉与模式识别大会论文集,2019 年。
  • [62] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
    Taesung Park、Ming-Yu Liu、Ting-Chun Wang 和 Jun-Yan Zhu。空间自适应归一化的语义图像合成。IEEE/CVF计算机视觉与模式识别会议(CVPR)论文集,2019年6月。
  • [63] Gaurav Parmar, Dacheng Li, Kwonjoon Lee, and Zhuowen Tu. Dual contradistinctive generative autoencoder. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 823–832. Computer Vision Foundation / IEEE, 2021.
    Gaurav Parmar、Dacheng Li、Kwonjoon Lee 和 Zhuowen Tu。双矛盾生成自动编码器。In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 823-832.计算机视觉基金会/IEEE,2021。
  • [64] Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:2104.11222, 2021.
    Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu.关于漏洞百出的大小调整库和 Fid 计算中令人惊奇的微妙之处。arXiv 预印本 arXiv:2104.11222, 2021.
  • [65] David A. Patterson, Joseph Gonzalez, Quoc V. Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David R. So, Maud Texier, and Jeff Dean. Carbon emissions and large neural network training. CoRR, abs/2104.10350, 2021.
    David A. Patterson、Joseph Gonzalez、Quoc V. Le、Chen Liang、Lluis-Miquel Munguia、Daniel Rothchild、David R. So、Maud Texier 和 Jeff Dean。碳排放与大型神经网络训练。CoRR,abs/2104.10350,2021。
  • [66] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. CoRR, abs/2102.12092, 2021.
    Aditya Ramesh、Mikhail Pavlov、Gabriel Goh、Scott Gray、Chelsea Voss、Alec Radford、Mark Chen 和 Ilya Sutskever。零镜头文本到图像生成。CoRR,abs/2102.12092,2021。
  • [67] Ali Razavi, Aäron van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with VQ-VAE-2. In NeurIPS, pages 14837–14847, 2019.
    Ali Razavi、Aäron van den Oord 和 Oriol Vinyals。用 VQ-VAE-2 生成多样化的高保真图像。In NeurIPS, pages 14837-14847, 2019.
  • [68] Scott E. Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. In ICML, 2016.
    Scott E. Reed、Zeynep Akata、Xinchen Yan、Lajanugen Logeswaran、Bernt Schiele 和 Honglak Lee。生成式对抗文本图像合成。In ICML, 2016.
  • [69] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the 31st International Conference on International Conference on Machine Learning, ICML, 2014.
    Danilo Jimenez Rezende、Shakir Mohamed 和 Daan Wierstra。深度生成模型中的随机反向传播和近似推理。第 31 届国际机器学习大会论文集》,ICML,2014 年。
  • [70] Robin Rombach, Patrick Esser, and Björn Ommer. Network-to-network translation with conditional invertible neural networks. In NeurIPS, 2020.
    Robin Rombach、Patrick Esser 和 Björn Ommer.用条件可逆神经网络实现网络到网络的转换。NeurIPS, 2020.
  • [71] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI (3), volume 9351 of Lecture Notes in Computer Science, pages 234–241. Springer, 2015.
    Olaf Ronneberger、Philipp Fischer 和 Thomas Brox.U-net:用于生物医学图像分割的卷积网络。In MICCAI (3), volume 9351 of Lecture Notes in Computer Science, pages 234-241.Springer, 2015.
  • [72] Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J. Fleet, and Mohammad Norouzi. Image super-resolution via iterative refinement. CoRR, abs/2104.07636, 2021.
    Chitwan Saharia、Jonathan Ho、William Chan、Tim Salimans、David J. Fleet 和 Mohammad Norouzi。通过迭代细化实现图像超分辨率。CoRR,abs/2104.07636,2021。
  • [73] Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P. Kingma. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. CoRR, abs/1701.05517, 2017.
    Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P. Kingma.Pixelcnn++:用离散化逻辑混合似然和其他修改改进 pixelcnn。CoRR,abs/1701.05517,2017.
  • [74] Dave Salvator. NVIDIA Developer Blog. https://developer.nvidia.com/blog/getting-immediate-speedups-with-a100-tf32, 2020.
    Dave Salvator。英伟达开发者博客。https://developer.nvidia.com/blog/getting-immediate-speedups-with-a100-tf32,2020 年。
  • [75] Robin San-Roman, Eliya Nachmani, and Lior Wolf. Noise estimation for generative diffusion models. CoRR, abs/2104.02600, 2021.
    Robin San-Roman、Eliya Nachmani 和 Lior Wolf。生成式扩散模型的噪声估计.CoRR,abs/2104.02600,2021。
  • [76] Axel Sauer, Kashyap Chitta, Jens Müller, and Andreas Geiger. Projected gans converge faster. CoRR, abs/2111.01007, 2021.
    阿克塞尔-绍尔、卡什亚普-奇塔、延斯-穆勒和安德烈亚斯-盖格。投影甘斯收敛更快CoRR,abs/2111.01007,2021。
  • [77] Edgar Schönfeld, Bernt Schiele, and Anna Khoreva. A u-net based discriminator for generative adversarial networks. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 8204–8213. Computer Vision Foundation / IEEE, 2020.
    Edgar Schönfeld、Bernt Schiele 和 Anna Khoreva。基于 U-net 的生成式对抗网络判别器。In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 8204-8213.计算机视觉基金会/IEEE,2020。
  • [78] Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs, 2021.
    克里斯托夫-舒曼(Christoph Schuhmann)、理查德-文库(Richard Vencu)、罗曼-博蒙特(Romain Beaumont)、罗伯特-卡茨马尔奇克(Robert Kaczmarczyk)、克莱顿-穆利斯(Clayton Mullis)、阿鲁什-卡塔(Aarush Katta)、西奥-库姆斯(Theo Coombes)、杰尼亚-吉特塞夫(Jenia Jitsev)和阿兰-小松崎(Aran Komatsuzaki)。莱昂-400 米:经剪辑过滤的 4 亿图像-文本对开放数据集,2021 年。
  • [79] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In Yoshua Bengio and Yann LeCun, editors, Int. Conf. Learn. Represent., 2015.
    Karen Simonyan 和 Andrew Zisserman.用于大规模图像识别的深度卷积网络。In Yoshua Bengio and Yann LeCun, editors, Int.Conf.Learn.Represent.
  • [80] Abhishek Sinha, Jiaming Song, Chenlin Meng, and Stefano Ermon. D2C: diffusion-denoising models for few-shot conditional generation. CoRR, abs/2106.06819, 2021.
    Abhishek Sinha、Jiaming Song、Chenlin Meng 和 Stefano Ermon。D2C:用于少量条件生成的扩散-去噪模型。CoRR,abs/2106.06819,2021。
  • [81] Charlie Snell. Alien Dreams: An Emerging Art Scene. https://ml.berkeley.edu/blog/posts/clip-art/, 2021. [Online; accessed November-2021].
    查理-斯内尔外星人之梦》:https://ml.berkeley.edu/blog/posts/clip-art/, 2021。[Online; accessed November-2021].
  • [82] Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. CoRR, abs/1503.03585, 2015.
    Jascha Sohl-Dickstein、Eric A. Weiss、Niru Maheswaranathan 和 Surya Ganguli。使用非平衡热力学的深度无监督学习。CoRR,abs/1503.03585,2015.
  • [83] Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep conditional generative models. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.
    Kihyuk Sohn, Honglak Lee, and Xinchen Yan.使用深度条件生成模型学习结构化输出表示。见 C. Cortes、N. Lawrence、D. Lee、M. Sugiyama 和 R. Garnett 编辑的《神经信息处理系统进展》第 28 卷。库兰联合公司,2015 年。
  • [84] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In ICLR. OpenReview.net, 2021.
    宋家明、孟晨霖和斯特凡诺-埃尔蒙。去噪扩散隐含模型In ICLR.OpenReview.net, 2021.
  • [85] Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. CoRR, abs/2011.13456, 2020.
    杨松、Jascha Sohl-Dickstein、Diederik P. Kingma、Abhishek Kumar、Stefano Ermon 和 Ben Poole。通过随机微分方程进行基于分数的生成建模。CoRR,abs/2011.13456,2020。
  • [86] Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for modern deep learning research. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 13693–13696. AAAI Press, 2020.
    Emma Strubell、Ananya Ganesh 和 Andrew McCallum.现代深度学习研究的能源和政策考虑。In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 13693-13696.AAAI Press, 2020.
  • [87] Wei Sun and Tianfu Wu. Learning layout and style reconfigurable gans for controllable image synthesis. CoRR, abs/2003.11571, 2020.
    Wei Sun 和 Tianfu Wu.用于可控图像合成的学习布局和样式可重构甘斯。CoRR,abs/2003.11571,2020.
  • [88] Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor S. Lempitsky. Resolution-robust large mask inpainting with fourier convolutions. ArXiv, abs/2109.07161, 2021.
    Roman Suvorov、Elizaveta Logacheva、Anton Mashikhin、Anastasia Remizova、Arsenii Ashukha、Aleksei Silvestrov、Naejin Kong、Harshith Goka、Kiwoong Park 和 Victor S. Lempitsky。利用傅立叶卷积进行分辨率稳健的大掩模涂色。ArXiv,abs/2109.07161,2021。
  • [89] Tristan Sylvain, Pengchuan Zhang, Yoshua Bengio, R. Devon Hjelm, and Shikhar Sharma. Object-centric image generation from layouts. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pages 2647–2655. AAAI Press, 2021.
    Tristan Sylvain、Pengchuan Zhang、Yoshua Bengio、R. Devon Hjelm 和 Shikhar Sharma。从布局生成以对象为中心的图像。In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pages 2647-2655.AAAI Press, 2021.
  • [90] Patrick Tinsley, Adam Czajka, and Patrick Flynn. This face does not exist… but it might be yours! identity leakage in generative models. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1320–1328, 2021.
    Patrick Tinsley, Adam Czajka, and Patrick Flynn.这张脸并不存在......但它可能是你的!生成模型中的身份泄露。IEEE/CVF 计算机视觉应用冬季会议论文集》,第 1320-1328 页,2021 年。
  • [91] Antonio Torralba and Alexei A Efros. Unbiased look at dataset bias. In CVPR 2011, pages 1521–1528. IEEE, 2011.
    Antonio Torralba 和 Alexei A Efros.无偏见地看待数据集偏差。In CVPR 2011, pages 1521-1528.IEEE, 2011.
  • [92] Arash Vahdat and Jan Kautz. NVAE: A deep hierarchical variational autoencoder. In NeurIPS, 2020.
    Arash Vahdat 和 Jan Kautz.NVAE:深度分层变异自动编码器。在 NeurIPS,2020 年。
  • [93] Arash Vahdat, Karsten Kreis, and Jan Kautz. Score-based generative modeling in latent space. CoRR, abs/2106.05931, 2021.
    Arash Vahdat、Karsten Kreis 和 Jan Kautz。潜在空间中基于分数的生成建模。CoRR,abs/2106.05931,2021。
  • [94] Aaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, koray kavukcuoglu, Oriol Vinyals, and Alex Graves. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, 2016.
    Aaron van den Oord、Nal Kalchbrenner、Lasse Espeholt、Koray Kavukcuoglu、Oriol Vinyals 和 Alex Graves。用像素神经解码器生成条件图像神经信息处理系统进展》,2016 年。
  • [95] Aäron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. CoRR, abs/1601.06759, 2016.
    Aäron van den Oord、Nal Kalchbrenner 和 Koray Kavukcuoglu。像素递归神经网络。CoRR,abs/1601.06759,2016。
  • [96] Aäron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. In NIPS, pages 6306–6315, 2017.
    Aäron van den Oord、Oriol Vinyals 和 Koray Kavukcuoglu。神经离散表征学习。In NIPS, pages 6306-6315, 2017.
  • [97] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, pages 5998–6008, 2017.
    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin.注意力就是你所需要的一切。In NIPS, pages 5998-6008, 2017.
  • [98] Rivers Have Wings. Tweet on Classifier-free guidance for autoregressive models. https://twitter.com/RiversHaveWings/status/1478093658716966912, 2022.
    河流有翅膀。Tweet on Classifier-free guidance for autoregressive models. https://twitter.com/RiversHaveWings/status/1478093658716966912, 2022.
  • [99] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie Brew. Huggingface’s transformers: State-of-the-art natural language processing. CoRR, abs/1910.03771, 2019.
    Thomas Wolf、Lysandre Debut、Victor Sanh、Julien Chaumond、Clement Delangue、Anthony Moi、Pierric Cistac、Tim Rault、Rémi Louf、Morgan Funtowicz 和 Jamie Brew。Huggingface's transformers:最先进的自然语言处理技术。CoRR,abs/1910.03771,2019。
  • [100] Zhisheng Xiao, Karsten Kreis, Jan Kautz, and Arash Vahdat. VAEBM: A symbiosis between variational autoencoders and energy-based models. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
    Zhisheng Xiao, Karsten Kreis, Jan Kautz, and Arash Vahdat.VAEBM:变异自动编码器与基于能量的模型之间的共生。第九届学习表征国际会议,ICLR 2021,奥地利虚拟活动,2021年5月3-7日。OpenReview.net, 2021.
  • [101] Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using VQ-VAE and transformers. CoRR, abs/2104.10157, 2021.
    Wilson Yan、Yunzhi Zhang、Pieter Abbeel 和 Aravind Srinivas。Videogpt:使用 VQ-VAE 和变压器生成视频。CoRR,abs/2104.10157,2021。
  • [102] Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong Xiao. LSUN: construction of a large-scale image dataset using deep learning with humans in the loop. CoRR, abs/1506.03365, 2015.
    Fisher Yu、Yinda Zhang、Shuran Song、Ari Seff 和 Jianxiong Xiao.LSUN:利用深度学习与人类共同构建大规模图像数据集。CoRR,abs/1506.03365,2015。
  • [103] Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan, 2021.
    Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu.用改进的 vqgan 进行矢量量化图像建模》,2021 年。
  • [104] Jiahui Yu, Zhe L. Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S. Huang. Free-form image inpainting with gated convolution. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 4470–4479, 2019.
    余佳慧、林哲、杨继美、沈晓辉、卢昕和黄轶翔。使用门控卷积的自由形式图像绘制(Free-form image inpainting with gated convolution)。2019 IEEE/CVF 计算机视觉国际会议(ICCV),第 4470-4479 页,2019 年。
  • [105] K. Zhang, Jingyun Liang, Luc Van Gool, and Radu Timofte. Designing a practical degradation model for deep blind image super-resolution. ArXiv, abs/2103.14006, 2021.
    K.Zhang, Jingyun Liang, Luc Van Gool, and Radu Timofte.为深度盲图像超分辨率设计实用退化模型。ArXiv,abs/2103.14006,2021。
  • [106] Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
    Richard Zhang、Phillip Isola、Alexei A. Efros、Eli Shechtman 和 Oliver Wang。深度特征作为感知度量的不合理有效性。电气和电子工程师协会计算机视觉与模式识别大会论文集》,2018 年 6 月。
  • [107] Shengyu Zhao, Jianwei Cui, Yilun Sheng, Yue Dong, Xiao Liang, Eric I-Chao Chang, and Yan Xu. Large scale image completion via co-modulated generative adversarial networks. ArXiv, abs/2103.10428, 2021.
    赵胜宇、崔建伟、盛一伦、董玥、梁晓、张一超和徐艳。通过共调制生成式对抗网络完成大规模图像处理。ArXiv,abs/2103.10428,2021。
  • [108] Bolei Zhou, Àgata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40:1452–1464, 2018.
    Bolei Zhou, Àgata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba.场所:用于场景识别的千万级图像数据库。IEEE Transactions on Pattern Analysis and Machine Intelligence》,40:1452-1464,2018.
  • [109] Yufan Zhou, Ruiyi Zhang, Changyou Chen, Chunyuan Li, Chris Tensmeyer, Tong Yu, Jiuxiang Gu, Jinhui Xu, and Tong Sun. LAFITE: towards language-free training for text-to-image generation. CoRR, abs/2111.13792, 2021.
    Yufan Zhou, Ruiyi Zhang, Changyou Chen, Chunyuan Li, Chris Tensmeyer, Tong Yu, Jiuxiang Gu, Jinhui Xu, and Tong Sun.LAFITE:实现文本到图像生成的无语言训练。CoRR,abs/2111.13792,2021。

Appendix 附录

Refer to caption
Refer to caption
Refer to caption
Figure 12: Convolutional samples from the semantic landscapes model as in Sec. 4.3.2, finetuned on 5122superscript5122512^{2} images.
图 12:来自第 4.3.2 节中语义景观模型的卷积样本,在 5122superscript5122512^{2} 图像上进行了微调。
’A painting of the last supper by Picasso.’
毕加索的《最后的晚餐》油画。
Refer to caption
’An oil painting of a latent space.’
一幅潜藏空间的油画。
’An epic painting of Gandalf the Black
黑袍甘道夫的史诗画作
summoning thunder and lightning in the mountains.’
在山中召唤雷电的史诗画作。
Refer to caption Refer to caption
’A sunset over a mountain range, vector image.’
山脉上的日落,矢量图像。
Refer to caption
Figure 13: Combining classifier free diffusion guidance with the convolutional sampling strategy from Sec. 4.3.2, our 1.45B parameter text-to-image model can be used for rendering images larger than the native 2562superscript2562256^{2} resolution the model was trained on.
图 13:结合分类器自由扩散引导和第 4.3.2 节中的卷积采样策略,我们的 1.45B 参数文本到图像模型可用于渲染大于模型所训练的原始 2562superscript2562256^{2} 分辨率的图像。

Appendix A Changelog 附录 AC 更新日志

Here we list changes between this version (https://arxiv.org/abs/2112.10752v2) of the paper and the previous version, i.e. https://arxiv.org/abs/2112.10752v1.
这里我们列出了本文本版本 ( https://arxiv.org/abs/2112.10752v2) 与上一版本(即 https://arxiv.org/abs/2112.10752v1)之间的变化。

  • We updated the results on text-to-image synthesis in Sec. 4.3 which were obtained by training a new, larger model (1.45B parameters). This also includes a new comparison to very recent competing methods on this task that were published on arXiv at the same time as ([59, 109]) or after ([26]) the publication of our work.


    - 我们更新了第 4.3 节中文本到图像合成的结果,这些结果是通过训练一个新的、更大的模型(1.45B 个参数)获得的。这也包括了与最近在 arXiv 上发表的关于这项任务的竞争方法的新比较,这些方法与我们的工作同时([ 59, 109])或之后([ 26])发表。
  • We updated results on class-conditional synthesis on ImageNet in Sec. 4.1, Tab. 3 (see also Sec. D.4) obtained by retraining the model with a larger batch size. The corresponding qualitative results in Fig. 26 and Fig. 27 were also updated. Both the updated text-to-image and the class-conditional model now use classifier-free guidance [32] as a measure to increase visual fidelity.


    - 我们在第 4.1 节的表 3 中更新了在 ImageNet 上进行类条件合成的结果(另见第 D.3 节)。3 中关于 ImageNet 的类条件合成结果(另见第 D.4 节)。图 26 和图 27 中相应的定性结果也进行了更新。更新后的 "文本到图像 "模型和 "类别条件 "模型现在都使用无分类器引导[32]作为提高视觉逼真度的措施。
  • We conducted a user study (following the scheme suggested by Saharia et al [72]) which provides additional evaluation for our inpainting (Sec. 4.5) and superresolution models (Sec. 4.4).


    - 我们进行了一项用户研究(按照 Saharia 等人[72]建议的方案),为我们的内绘制(第 4.5 节)和超分辨率模型(第 4.4 节)提供了额外的评估。
  • Added Fig. 5 to the main paper, moved Fig. 18 to the appendix, added Fig. 13 to the appendix.


    - 在正文中添加了图 5,将图 18 移到了附录中,在附录中添加了图 13。

Appendix B Detailed Information on Denoising Diffusion Models
附录 B 关于去噪扩散模型的详细信息

Diffusion models can be specified in terms of a signal-to-noise ratio SNR(t)=αt2σt2SNR𝑡superscriptsubscript𝛼𝑡2superscriptsubscript𝜎𝑡2\text{SNR}(t)=\frac{\alpha_{t}^{2}}{\sigma_{t}^{2}} consisting of sequences (αt)t=1Tsuperscriptsubscriptsubscript𝛼𝑡𝑡1𝑇(\alpha_{t})_{t=1}^{T} and (σt)t=1Tsuperscriptsubscriptsubscript𝜎𝑡𝑡1𝑇(\sigma_{t})_{t=1}^{T} which, starting from a data sample x0subscript𝑥0x_{0}, define a forward diffusion process q𝑞q as
扩散模型可以用信噪比 SNR(t)=αt2σt2SNR𝑡superscriptsubscript𝛼𝑡2superscriptsubscript𝜎𝑡2\text{SNR}(t)=\frac{\alpha_{t}^{2}}{\sigma_{t}^{2}} 来指定,它由序列 (αt)t=1Tsuperscriptsubscriptsubscript𝛼𝑡𝑡1𝑇(\alpha_{t})_{t=1}^{T}(σt)t=1Tsuperscriptsubscriptsubscript𝜎𝑡𝑡1𝑇(\sigma_{t})_{t=1}^{T} 组成,从数据样本 x0subscript𝑥0x_{0} 开始,定义一个前向扩散过程 q𝑞q

q(xt|x0)=𝒩(xt|αtx0,σt2𝕀)𝑞conditionalsubscript𝑥𝑡subscript𝑥0𝒩conditionalsubscript𝑥𝑡subscript𝛼𝑡subscript𝑥0superscriptsubscript𝜎𝑡2𝕀q(x_{t}|x_{0})=\mathcal{N}(x_{t}|\alpha_{t}x_{0},\sigma_{t}^{2}\mathbb{I}) (4)

with the Markov structure for s<t𝑠𝑡s<t:
的马尔科夫结构:

q(xt|xs)𝑞conditionalsubscript𝑥𝑡subscript𝑥𝑠\displaystyle q(x_{t}|x_{s}) =𝒩(xt|αt|sxs,σt|s2𝕀)absent𝒩conditionalsubscript𝑥𝑡subscript𝛼conditional𝑡𝑠subscript𝑥𝑠superscriptsubscript𝜎conditional𝑡𝑠2𝕀\displaystyle=\mathcal{N}(x_{t}|\alpha_{t|s}x_{s},\sigma_{t|s}^{2}\mathbb{I}) (5)
αt|ssubscript𝛼conditional𝑡𝑠\displaystyle\alpha_{t|s} =αtαsabsentsubscript𝛼𝑡subscript𝛼𝑠\displaystyle=\frac{\alpha_{t}}{\alpha_{s}} (6)
σt|s2superscriptsubscript𝜎conditional𝑡𝑠2\displaystyle\sigma_{t|s}^{2} =σt2αt|s2σs2absentsuperscriptsubscript𝜎𝑡2superscriptsubscript𝛼conditional𝑡𝑠2superscriptsubscript𝜎𝑠2\displaystyle=\sigma_{t}^{2}-\alpha_{t|s}^{2}\sigma_{s}^{2} (7)

Denoising diffusion models are generative models p(x0)𝑝subscript𝑥0p(x_{0}) which revert this process with a similar Markov structure running backward in time, i.e. they are specified as
去噪扩散模型是一种生成模型 p(x0)𝑝subscript𝑥0p(x_{0}) ,它以类似的马尔可夫结构反演这一过程,即在时间上向后运行,具体为

p(x0)=zp(xT)t=1Tp(xt1|xt)𝑝subscript𝑥0subscript𝑧𝑝subscript𝑥𝑇superscriptsubscriptproduct𝑡1𝑇𝑝conditionalsubscript𝑥𝑡1subscript𝑥𝑡p(x_{0})=\int_{z}p(x_{T})\prod_{t=1}^{T}p(x_{t-1}|x_{t}) (8)

The evidence lower bound (ELBO) associated with this model then decomposes over the discrete time steps as
与该模型相关的证据下限(ELBO)在离散时间步长上的分解为

logp(x0)𝕂𝕃(q(xT|x0)|p(xT))+t=1T𝔼q(xt|x0)𝕂𝕃(q(xt1|xt,x0)|p(xt1|xt))-\log p(x_{0})\leq\mathbb{KL}(q(x_{T}|x_{0})|p(x_{T}))+\sum_{t=1}^{T}\mathbb{E}_{q(x_{t}|x_{0})}\mathbb{KL}(q(x_{t-1}|x_{t},x_{0})|p(x_{t-1}|x_{t})) (9)

The prior p(xT)𝑝subscript𝑥𝑇p(x_{T}) is typically choosen as a standard normal distribution and the first term of the ELBO then depends only on the final signal-to-noise ratio SNR(T)SNR𝑇\text{SNR}(T). To minimize the remaining terms, a common choice to parameterize p(xt1|xt)𝑝conditionalsubscript𝑥𝑡1subscript𝑥𝑡p(x_{t-1}|x_{t}) is to specify it in terms of the true posterior q(xt1|xt,x0)𝑞conditionalsubscript𝑥𝑡1subscript𝑥𝑡subscript𝑥0q(x_{t-1}|x_{t},x_{0}) but with the unknown x0subscript𝑥0x_{0} replaced by an estimate xθ(xt,t)subscript𝑥𝜃subscript𝑥𝑡𝑡x_{\theta}(x_{t},t) based on the current step xtsubscript𝑥𝑡x_{t}. This gives [45]
先验值 p(xT)𝑝subscript𝑥𝑇p(x_{T}) 通常选择标准正态分布,ELBO 的第一项只取决于最终信噪比 SNR(T)SNR𝑇\text{SNR}(T) 。为了最小化其余项,对 p(xt1|xt)𝑝conditionalsubscript𝑥𝑡1subscript𝑥𝑡p(x_{t-1}|x_{t}) 进行参数化的一个常见选择是用真实后验 q(xt1|xt,x0)𝑞conditionalsubscript𝑥𝑡1subscript𝑥𝑡subscript𝑥0q(x_{t-1}|x_{t},x_{0}) 来指定它,但用基于当前步长 xtsubscript𝑥𝑡x_{t} 的估计值 xθ