High-Resolution Image Synthesis with Latent Diffusion Models
利用潜在扩散模型合成高分辨率图像

Robin Rombach¹ Andreas Blattmann¹ ^∗ Dominik Lorenz¹ Patrick Esser Björn Ommer¹
¹Ludwig Maximilian University of Munich & IWR, Heidelberg University, Germany Runway ML
¹ 德国慕尼黑路德维希-马克西米利安大学和海德堡大学 IWR Runway ML
https://github.com/CompVis/latent-diffusion The first two authors contributed equally to this work.
前两位作者对本研究的贡献相同。

Abstract 摘要

By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations. To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner. Our latent diffusion models (LDMs) achieve new state-of-the-art scores for image inpainting and class-conditional image synthesis and highly competitive performance on various tasks, including text-to-image synthesis, unconditional image generation and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs.
通过将图像形成过程分解为去噪自编码器的连续应用，扩散模型（DM）在图像数据及其他方面取得了最先进的合成结果。此外，它们的表述方式允许一种指导机制来控制图像生成过程，而无需重新训练。然而，由于这些模型通常直接在像素空间中运行，对功能强大的 DM 进行优化通常需要耗费数百个 GPU 日，而且由于顺序评估，推理成本高昂。为了在有限的计算资源上进行 DM 训练，同时保持其质量和灵活性，我们在强大的预训练自动编码器的潜空间中应用了 DM。与之前的工作不同，在这种表示法上训练扩散模型，可以首次在降低复杂性和保留细节之间达到近乎最佳的平衡点，从而大大提高视觉保真度。通过在模型架构中引入交叉注意层，我们将扩散模型转化为强大而灵活的生成器，可用于文本或边界框等一般条件输入，并以卷积方式实现高分辨率合成。与基于像素的扩散模型相比，我们的潜扩散模型（LDM）大大降低了计算要求，在图像绘制和类条件图像合成方面取得了新的一流成绩，并在文本到图像合成、无条件图像生成和超分辨率等各种任务中取得了极具竞争力的性能。

1 Introduction 1引言

Refer to caption — Figure 1: Boosting the upper bound on achievable quality with less agressive downsampling. Since diffusion models offer excellent inductive biases for spatial data, we do not need the heavy spatial downsampling of related generative models in latent space, but can still greatly reduce the dimensionality of the data via suitable autoencoding models, see Sec. 3. Images are from the DIV2K [1] validation set, evaluated at $512^{2}$ px. We denote the spatial downsampling factor by $f$ . Reconstruction FIDs [29] and PSNR are calculated on ImageNet-val. [12]; see also Tab. 8.
图 1：以较低的下采样率提升可实现质量的上限。由于扩散模型为空间数据提供了极好的归纳偏差，因此我们不需要对潜在空间中的相关生成模型进行严重的空间下采样，但仍然可以通过合适的自动编码模型大大降低数据的维度（见第 3 章）。图片来自 DIV2K [ 1] 验证集，在 $512^{2}$ px 下进行评估。我们用 $f$ 表示空间下采样因子。重建 FID [ 29] 和 PSNR 是在 ImageNet-val.[12]；另见表 8。8.

Image synthesis is one of the computer vision fields with the most spectacular recent development, but also among those with the greatest computational demands. Especially high-resolution synthesis of complex, natural scenes is presently dominated by scaling up likelihood-based models, potentially containing billions of parameters in autoregressive (AR) transformers [66, 67]. In contrast, the promising results of GANs [27, 3, 40] have been revealed to be mostly confined to data with comparably limited variability as their adversarial learning procedure does not easily scale to modeling complex, multi-modal distributions. Recently, diffusion models [82], which are built from a hierarchy of denoising autoencoders, have shown to achieve impressive results in image synthesis [30, 85] and beyond [45, 7, 48, 57], and define the state-of-the-art in class-conditional image synthesis [15, 31] and super-resolution [72]. Moreover, even unconditional DMs can readily be applied to tasks such as inpainting and colorization [85] or stroke-based synthesis [53], in contrast to other types of generative models [46, 69, 19]. Being likelihood-based models, they do not exhibit mode-collapse and training instabilities as GANs and, by heavily exploiting parameter sharing, they can model highly complex distributions of natural images without involving billions of parameters as in AR models [67].
图像合成是计算机视觉领域近来发展最迅猛的领域之一，同时也是计算要求最高的领域之一。尤其是复杂自然场景的高分辨率合成，目前主要是通过扩大基于似然法的模型来实现，这些模型可能包含自回归（AR）变换器中的数十亿个参数[66, 67]。与此相反，GANs [ 27, 3, 40] 所取得的令人鼓舞的成果大多局限于变异性相对有限的数据，因为其对抗学习程序不容易扩展到复杂的多模态分布建模。最近，扩散模型[82]从去噪自动编码器的层次中建立起来，在图像合成[30, 85]和其他方面[45, 7, 48, 57]取得了令人印象深刻的成果，并在类条件图像合成[15, 31]和超分辨率[72]方面达到了最先进的水平。此外，与其他类型的生成模型[46, 69, 19]相比，即使是无条件的 DM 也可以很容易地应用于诸如内画和着色[85]或基于笔画的合成[53]等任务。作为基于似然的模型，它们不会像 GANs 那样表现出模式崩溃和训练不稳定性，而且通过大量利用参数共享，它们可以对自然图像的高度复杂分布进行建模，而无需像 AR 模型那样涉及数十亿个参数[67]。

Democratizing High-Resolution Image Synthesis
高分辨率图像合成的民主化

DMs belong to the class of likelihood-based models, whose mode-covering behavior makes them prone to spend excessive amounts of capacity (and thus compute resources) on modeling imperceptible details of the data [16, 73]. Although the reweighted variational objective [30] aims to address this by undersampling the initial denoising steps, DMs are still computationally demanding, since training and evaluating such a model requires repeated function evaluations (and gradient computations) in the high-dimensional space of RGB images. As an example, training the most powerful DMs often takes hundreds of GPU days (e.g. 150 - 1000 V100 days in [15]) and repeated evaluations on a noisy version of the input space render also inference expensive, so that producing 50k samples takes approximately 5 days [15] on a single A100 GPU. This has two consequences for the research community and users in general: Firstly, training such a model requires massive computational resources only available to a small fraction of the field, and leaves a huge carbon footprint [65, 86]. Secondly, evaluating an already trained model is also expensive in time and memory, since the same model architecture must run sequentially for a large number of steps (e.g. 25 - 1000 steps in [15]).
DM 属于基于似然法的模型，其模式覆盖行为使其容易将过多的容量（以及计算资源）用于模拟数据中难以察觉的细节[16, 73]。虽然重新加权的变分目标[30]旨在通过对初始去噪步骤进行低采样来解决这一问题，但 DMs 的计算要求仍然很高，因为训练和评估这样的模型需要在 RGB 图像的高维空间中反复进行函数评估（和梯度计算）。举例来说，训练最强大的 DM 通常需要数百个 GPU 日（例如[15]中的 150 - 1000 V100 日），而在输入空间的高噪声版本上重复评估推理也非常昂贵，因此在单个 A100 GPU 上生成 50k 个样本大约需要 5 天[15]。这给研究界和广大用户带来了两个后果：首先，训练这样一个模型需要大量的计算资源，而这些资源只有该领域的一小部分才能使用，并且会留下巨大的碳足迹[ 65, 86]。其次，评估已训练好的模型在时间和内存上也很昂贵，因为同一模型架构必须连续运行大量步骤（例如[15]中的 25 - 1000 步骤）。

To increase the accessibility of this powerful model class and at the same time reduce its significant resource consumption, a method is needed that reduces the computational complexity for both training and sampling. Reducing the computational demands of DMs without impairing their performance is, therefore, key to enhance their accessibility.
为了提高这类强大模型的可用性，同时减少其对资源的大量消耗，需要一种方法来降低训练和采样的计算复杂度。因此，在不影响 DM 性能的前提下降低其计算需求，是提高其可用性的关键。

Departure to Latent Space
从潜在空间出发

Our approach starts with the analysis of already trained diffusion models in pixel space: Fig. 2 shows the rate-distortion trade-off of a trained model. As with any likelihood-based model, learning can be roughly divided into two stages: First is a perceptual compression stage which removes high-frequency details but still learns little semantic variation. In the second stage, the actual generative model learns the semantic and conceptual composition of the data (semantic compression). We thus aim to first find a perceptually equivalent, but computationally more suitable space, in which we will train diffusion models for high-resolution image synthesis.
我们的方法从分析像素空间中已经训练好的扩散模型开始：图 2 显示了经过训练的模型的速率-失真权衡。与任何基于似然法的模型一样，学习可大致分为两个阶段：首先是感知压缩阶段，该阶段会去除高频细节，但仍能学习到少量语义变化。在第二阶段，实际生成模型学习数据的语义和概念构成（语义压缩）。因此，我们的目标是首先找到一个在感知上等效、但在计算上更合适的空间，在这个空间中训练用于高分辨率图像合成的扩散模型。

Following common practice [96, 67, 23, 11, 66], we separate training into two distinct phases: First, we train an autoencoder which provides a lower-dimensional (and thereby efficient) representational space which is perceptually equivalent to the data space. Importantly, and in contrast to previous work [23, 66], we do not need to rely on excessive spatial compression, as we train DMs in the learned latent space, which exhibits better scaling properties with respect to the spatial dimensionality. The reduced complexity also provides efficient image generation from the latent space with a single network pass. We dub the resulting model class Latent Diffusion Models (LDMs).
按照通常的做法[96, 67, 23, 11, 66]，我们将训练分为两个不同的阶段：首先，我们训练一个自动编码器，它提供了一个与数据空间感知等效的低维（从而高效）表征空间。重要的是，与之前的工作[23, 66]不同的是，我们不需要依赖过度的空间压缩，因为我们在学习到的潜空间中训练 DM，而潜空间在空间维度上表现出更好的缩放特性。复杂性的降低也使得我们只需通过一次网络就能从潜空间生成高效的图像。我们将由此产生的模型类别命名为潜在扩散模型（LDMs）。

A notable advantage of this approach is that we need to train the universal autoencoding stage only once and can therefore reuse it for multiple DM trainings or to explore possibly completely different tasks [81]. This enables efficient exploration of a large number of diffusion models for various image-to-image and text-to-image tasks. For the latter, we design an architecture that connects transformers to the DM’s UNet backbone [71] and enables arbitrary types of token-based conditioning mechanisms, see Sec. 3.3.
这种方法的一个显著优势是，我们只需要对通用自动编码阶段进行一次训练，因此可以在多个 DM 训练中重复使用，或用于探索可能完全不同的任务[81]。这样，我们就能针对各种图像到图像和文本到图像任务，高效地探索大量的扩散模型。对于后者，我们设计了一种架构，将转换器连接到 DM 的 UNet 主干网[71]，并启用任意类型的基于标记的调节机制（见第 3.3 节）。

In sum, our work makes the following contributions:
总之，我们的工作做出了以下贡献：

(i) In contrast to purely transformer-based approaches [23, 66], our method scales more graceful to higher dimensional data and can thus (a) work on a compression level which provides more faithful and detailed reconstructions than previous work (see Fig. 1) and (b) can be efficiently applied to high-resolution synthesis of megapixel images.
(i) 与纯粹基于变换器的方法[23, 66]相比，我们的方法在处理高维数据时更加优雅，因此可以（a）在压缩级别上工作，提供比以前的工作（见图 1）更忠实、更详细的重构；（b）可以有效地应用于百万像素图像的高分辨率合成。

(ii) We achieve competitive performance on multiple tasks (unconditional image synthesis, inpainting, stochastic super-resolution) and datasets while significantly lowering computational costs. Compared to pixel-based diffusion approaches, we also significantly decrease inference costs.
(ii) 我们在多个任务（无条件图像合成、内绘制、随机超分辨率）和数据集上实现了具有竞争力的性能，同时显著降低了计算成本。与基于像素的扩散方法相比，我们还大大降低了推理成本。

(iii) We show that, in contrast to previous work [93] which learns both an encoder/decoder architecture and a score-based prior simultaneously, our approach does not require a delicate weighting of reconstruction and generative abilities. This ensures extremely faithful reconstructions and requires very little regularization of the latent space.
(iii) 我们证明，与之前同时学习编码器/解码器结构和基于分数的先验的工作[93]相比，我们的方法不需要对重建和生成能力进行微妙的加权。这确保了极其忠实的重构，并且只需要对潜在空间进行极少的正则化处理。

(iv) We find that for densely conditioned tasks such as super-resolution, inpainting and semantic synthesis, our model can be applied in a convolutional fashion and render large, consistent images of $\sim 1024^{2}$ px.
(iv) 我们发现，对于超分辨率、内画法和语义合成等条件密集的任务，我们的模型可以以卷积的方式应用，并呈现 $\sim 1024^{2}$ px 的大型一致图像。

(v) Moreover, we design a general-purpose conditioning mechanism based on cross-attention, enabling multi-modal training. We use it to train class-conditional, text-to-image and layout-to-image models.
(v) 此外，我们还设计了一种基于交叉注意的通用调节机制，实现了多模式训练。我们用它来训练类条件、文本到图像和布局到图像模型。

(vi) Finally, we release pretrained latent diffusion and autoencoding models at https://github.com/CompVis/latent-diffusion which might be reusable for a various tasks besides training of DMs [81].
(vi) 最后，我们在 https://github.com/CompVis/latent-diffusion 上发布了经过预训练的潜在扩散和自动编码模型，这些模型除了可以训练 DM 之外，还可以用于各种任务[81]。

2 Related Work 2 相关工作

Generative Models for Image Synthesis The high dimensional nature of images presents distinct challenges to generative modeling. Generative Adversarial Networks (GAN) [27] allow for efficient sampling of high resolution images with good perceptual quality [3, 42], but are difficult to optimize [54, 2, 28] and struggle to capture the full data distribution [55]. In contrast, likelihood-based methods emphasize good density estimation which renders optimization more well-behaved. Variational autoencoders (VAE) [46] and flow-based models [18, 19] enable efficient synthesis of high resolution images [9, 92, 44], but sample quality is not on par with GANs. While autoregressive models (ARM) [95, 94, 6, 10] achieve strong performance in density estimation, computationally demanding architectures [97] and a sequential sampling process limit them to low resolution images. Because pixel based representations of images contain barely perceptible, high-frequency details [16, 73], maximum-likelihood training spends a disproportionate amount of capacity on modeling them, resulting in long training times. To scale to higher resolutions, several two-stage approaches [101, 67, 23, 103] use ARMs to model a compressed latent image space instead of raw pixels.
用于图像合成的生成模型图像的高维特性给生成模型带来了独特的挑战。生成对抗网络（GAN）[27] 允许对具有良好感知质量的高分辨率图像进行高效采样[3, 42]，但难以优化[54, 2, 28]，并且难以捕捉完整的数据分布[55]。与此相反，基于似然法的方法强调良好的密度估计，从而使优化更加顺畅。变异自动编码器（VAE）[46] 和基于流的模型[18, 19]能够高效合成高分辨率图像[9, 92, 44]，但样本质量无法与 GAN 相提并论。虽然自回归模型（ARM）[95, 94, 6, 10]在密度估计中表现出色，但计算要求高的架构[97]和顺序采样过程限制了它们在低分辨率图像中的应用。由于基于像素的图像表征包含几乎无法感知的高频细节[16, 73]，最大似然训练将不成比例的容量花费在这些细节的建模上，导致训练时间过长。为了扩展到更高分辨率，有几种两阶段方法[ 101, 67, 23, 103] 使用 ARM 对压缩的潜在图像空间建模，而不是原始像素。

Recently, Diffusion Probabilistic Models (DM) [82], have achieved state-of-the-art results in density estimation [45] as well as in sample quality [15]. The generative power of these models stems from a natural fit to the inductive biases of image-like data when their underlying neural backbone is implemented as a UNet [71, 30, 85, 15]. The best synthesis quality is usually achieved when a reweighted objective [30] is used for training. In this case, the DM corresponds to a lossy compressor and allow to trade image quality for compression capabilities. Evaluating and optimizing these models in pixel space, however, has the downside of low inference speed and very high training costs. While the former can be partially adressed by advanced sampling strategies [84, 75, 47] and hierarchical approaches [31, 93], training on high-resolution image data always requires to calculate expensive gradients. We adress both drawbacks with our proposed LDMs, which work on a compressed latent space of lower dimensionality. This renders training computationally cheaper and speeds up inference with almost no reduction in synthesis quality (see Fig. 1).
最近，扩散概率模型（DM）[82]在密度估计[45]和样本质量[15]方面取得了最先进的成果。这些模型的生成能力源于其底层神经骨架作为 UNet 实现时，与图像类数据的归纳偏差的自然匹配[71, 30, 85, 15]。通常，在使用重权目标[30]进行训练时，可以获得最佳的合成质量。在这种情况下，DM 相当于有损压缩器，可以用图像质量换取压缩能力。然而，在像素空间中评估和优化这些模型的缺点是推理速度低、训练成本高。虽然先进的采样策略[84, 75, 47]和分层方法[31, 93]可以部分解决前者的问题，但在高分辨率图像数据上进行训练总是需要计算昂贵的梯度。我们提出的 LDM 解决了这两个缺点，它在低维度的压缩潜空间上工作。这使得训练的计算成本更低，并在几乎不降低合成质量的情况下加快了推理速度（见图 1）。

Two-Stage Image Synthesis To mitigate the shortcomings of individual generative approaches, a lot of research [11, 70, 23, 103, 101, 67] has gone into combining the strengths of different methods into more efficient and performant models via a two stage approach. VQ-VAEs [101, 67] use autoregressive models to learn an expressive prior over a discretized latent space. [66] extend this approach to text-to-image generation by learning a joint distributation over discretized image and text representations. More generally, [70] uses conditionally invertible networks to provide a generic transfer between latent spaces of diverse domains. Different from VQ-VAEs, VQGANs [23, 103] employ a first stage with an adversarial and perceptual objective to scale autoregressive transformers to larger images. However, the high compression rates required for feasible ARM training, which introduces billions of trainable parameters [66, 23], limit the overall performance of such approaches and less compression comes at the price of high computational cost [66, 23]. Our work prevents such trade-offs, as our proposed LDMs scale more gently to higher dimensional latent spaces due to their convolutional backbone. Thus, we are free to choose the level of compression which optimally mediates between learning a powerful first stage, without leaving too much perceptual compression up to the generative diffusion model while guaranteeing high-fidelity reconstructions (see Fig. 1).
两阶段图像合成为了减轻单个生成方法的缺点，大量研究[11, 70, 23, 103, 101, 67]通过两阶段方法将不同方法的优势结合到更高效、性能更好的模型中。VQ-VAEs[101，67] 使用自回归模型来学习离散潜空间的表达式先验。[66] 通过学习离散图像和文本表示的联合分布，将这种方法扩展到文本到图像的生成。更广泛地说，[ 70] 使用条件可逆网络在不同领域的潜空间之间提供通用转移。与 VQ-VAE 不同，VQGANs [ 23, 103] 采用了第一阶段，以对抗性和感知为目标，将自回归变换器扩展到更大的图像。然而，可行的 ARM 训练需要较高的压缩率，这就引入了数十亿个可训练参数[66, 23]，从而限制了此类方法的整体性能，而且较低的压缩率是以较高的计算成本为代价的[66, 23]。我们的工作避免了这种权衡，因为我们提出的 LDM 因其卷积骨干而更容易扩展到更高维度的潜空间。因此，我们可以在保证高保真重构的同时，自由选择压缩水平，在学习功能强大的第一阶段之间进行最佳调和，而不会将过多的感知压缩留给生成扩散模型（见图 1）。

While approaches to jointly [93] or separately [80] learn an encoding/decoding model together with a score-based prior exist, the former still require a difficult weighting between reconstruction and generative capabilities [11] and are outperformed by our approach (Sec. 4), and the latter focus on highly structured images such as human faces.
虽然存在联合[93]或单独[80]学习编码/解码模型和基于分数的先验模型的方法，但前者仍然需要在重建和生成能力之间进行艰难的权衡[11]，其性能优于我们的方法（第 4 章），而后者主要针对人脸等高结构图像。

3 Method 3 方法

To lower the computational demands of training diffusion models towards high-resolution image synthesis, we observe that although diffusion models allow to ignore perceptually irrelevant details by undersampling the corresponding loss terms [30], they still require costly function evaluations in pixel space, which causes huge demands in computation time and energy resources.
为了降低高分辨率图像合成对扩散模型训练的计算要求，我们注意到，虽然扩散模型可以通过对相应损失项的低采样来忽略感知上不相关的细节[30]，但它们仍然需要在像素空间进行代价高昂的函数评估，这对计算时间和能源资源造成了巨大的需求。

We propose to circumvent this drawback by introducing an explicit separation of the compressive from the generative learning phase (see Fig. 2). To achieve this, we utilize an autoencoding model which learns a space that is perceptually equivalent to the image space, but offers significantly reduced computational complexity.
我们建议将压缩学习阶段与生成学习阶段明确分开，以规避这一缺点（见图 2）。为此，我们采用了一种自动编码模型，它所学习的空间在感知上等同于图像空间，但计算复杂度却大大降低。

Such an approach offers several advantages: (i) By leaving the high-dimensional image space, we obtain DMs which are computationally much more efficient because sampling is performed on a low-dimensional space. (ii) We exploit the inductive bias of DMs inherited from their UNet architecture [71], which makes them particularly effective for data with spatial structure and therefore alleviates the need for aggressive, quality-reducing compression levels as required by previous approaches [23, 66]. (iii) Finally, we obtain general-purpose compression models whose latent space can be used to train multiple generative models and which can also be utilized for other downstream applications such as single-image CLIP-guided synthesis [25].
这种方法有以下几个优点：(i) 离开高维图像空间，我们就能获得计算效率高得多的 DM，因为采样是在低维空间上进行的。(ii) 我们利用了 DM 从其 UNet 架构中继承下来的归纳偏差[71]，这使它们对具有空间结构的数据特别有效，从而减轻了以往方法[23, 66]所要求的积极的、降低质量的压缩水平。(iii) 最后，我们获得了通用压缩模型，其潜在空间可用于训练多个生成模型，也可用于其他下游应用，如单图像 CLIP 引导合成[25]。

3.1 Perceptual Image Compression
3.1 感知图像压缩

Our perceptual compression model is based on previous work [23] and consists of an autoencoder trained by combination of a perceptual loss [106] and a patch-based [33] adversarial objective [20, 23, 103]. This ensures that the reconstructions are confined to the image manifold by enforcing local realism and avoids bluriness introduced by relying solely on pixel-space losses such as $L_{2}$ or $L_{1}$ objectives.
我们的感知压缩模型基于之前的工作[23]，由感知损失[106]和基于补丁[33]的对抗目标[20, 23, 103]组合训练的自动编码器组成。这确保了重构仅限于图像流形，强化了局部真实感，并避免了单纯依赖像素空间损失（如 $L_{2}$ 或 $L_{1}$ 目标）所带来的模糊性。

More precisely, given an image $x\in\mathbb{R}^{H\times W\times 3}$ in RGB space, the encoder $\mathcal{E}$ encodes $x$ into a latent representation $z=\mathcal{E}(x)$ , and the decoder $\mathcal{D}$ reconstructs the image from the latent, giving $\tilde{x}=\mathcal{D}(z)=\mathcal{D}(\mathcal{E}(x))$ , where $z\in\mathbb{R}^{h\times w\times c}$ . Importantly, the encoder downsamples the image by a factor $f=H/h=W/w$ , and we investigate different downsampling factors $f=2^{m}$ , with $m\in\mathbb{N}$ .
更确切地说，给定 RGB 空间中的图像 $x\in\mathbb{R}^{H\times W\times 3}$ ，编码器 $\mathcal{E}$ 将 $x$ 编码成潜在表示 $z=\mathcal{E}(x)$ ，解码器 $\mathcal{D}$ 从潜在表示重建图像，得到 $\tilde{x}=\mathcal{D}(z)=\mathcal{D}(\mathcal{E}(x))$ ，其中 $z\in\mathbb{R}^{h\times w\times c}$ 。重要的是，编码器对图像进行下采样，下采样系数为 $f=H/h=W/w$ ，我们研究了不同的下采样系数 $f=2^{m}$ ，其中 $m\in\mathbb{N}$ 。

In order to avoid arbitrarily high-variance latent spaces, we experiment with two different kinds of regularizations. The first variant, KL-reg., imposes a slight KL-penalty towards a standard normal on the learned latent, similar to a VAE [46, 69], whereas VQ-reg. uses a vector quantization layer [96] within the decoder. This model can be interpreted as a VQGAN [23] but with the quantization layer absorbed by the decoder. Because our subsequent DM is designed to work with the two-dimensional structure of our learned latent space $z=\mathcal{E}(x)$ , we can use relatively mild compression rates and achieve very good reconstructions. This is in contrast to previous works [23, 66], which relied on an arbitrary 1D ordering of the learned space $z$ to model its distribution autoregressively and thereby ignored much of the inherent structure of $z$ . Hence, our compression model preserves details of $x$ better (see Tab. 8). The full objective and training details can be found in the supplement.
为了避免任意的高方差潜空间，我们尝试了两种不同的正则化方法。第一种变体，KL-reg.，对学习到的潜空间施加了轻微的标准正态 KL 权限，类似于 VAE[46，69]，而 VQ-reg.这个模型可以解释为 VQGAN [ 23] ，但量化层被解码器吸收了。由于我们后续的 DM 是针对所学潜空间 $z=\mathcal{E}(x)$ 的二维结构而设计的，因此我们可以使用相对温和的压缩率，并实现非常好的重构。这与之前的工作[23, 66]形成了鲜明对比，之前的工作依赖于对所学空间 $z$ 的任意一维排序来对其分布进行自回归建模，从而忽略了 $z$ 的许多固有结构。因此，我们的压缩模型能更好地保留 $x$ 的细节（见表 8）。完整的目标和训练细节见补充资料。

3.2 Latent Diffusion Models
3.2 潜在扩散模型

Diffusion Models [82] are probabilistic models designed to learn a data distribution $p(x)$ by gradually denoising a normally distributed variable, which corresponds to learning the reverse process of a fixed Markov Chain of length $T$ . For image synthesis, the most successful models [30, 15, 72] rely on a reweighted variant of the variational lower bound on $p(x)$ , which mirrors denoising score-matching [85]. These models can be interpreted as an equally weighted sequence of denoising autoencoders $\epsilon_{\theta}(x_{t},t);\,t=1\dots T$ , which are trained to predict a denoised variant of their input $x_{t}$ , where $x_{t}$ is a noisy version of the input $x$ . The corresponding objective can be simplified to (Sec. B)
扩散模型 [ 82] 是一种概率模型，旨在通过对正态分布变量的逐步去噪来学习数据分布 $p(x)$ ，这相当于学习长度为 $T$ 的固定马尔可夫链的反向过程。对于图像合成，最成功的模型[30, 15, 72]依赖于 $p(x)$ 变分下限的重权变体，它反映了去噪分数匹配[85]。这些模型可以解释为去噪自编码器 $\epsilon_{\theta}(x_{t},t);\,t=1\dots T$ 的等权序列，这些自编码器被训练成预测其输入 $x_{t}$ 的去噪变体，其中 $x_{t}$ 是输入 $x$ 的噪声版本。相应的目标可简化为（B 节）

L_{DM}=\mathbb{E}_{x,\epsilon\sim\mathcal{N}(0,1),t}\Big{[}\|\epsilon-\epsilon_{\theta}(x_{t},t)\|_{2}^{2}\Big{]}\,,

(1)

with $t$ uniformly sampled from $\{1,\dots,T\}$ .
其中 $t$ 从 $\{1,\dots,T\}$ 中均匀采样。

Generative Modeling of Latent Representations With our trained perceptual compression models consisting of $\mathcal{E}$ and $\mathcal{D}$ , we now have access to an efficient, low-dimensional latent space in which high-frequency, imperceptible details are abstracted away. Compared to the high-dimensional pixel space, this space is more suitable for likelihood-based generative models, as they can now (i) focus on the important, semantic bits of the data and (ii) train in a lower dimensional, computationally much more efficient space.
潜在表征的生成模型有了由 $\mathcal{E}$ 和 $\mathcal{D}$ 组成的训练有素的感知压缩模型，我们现在可以访问一个高效的低维潜在空间，在这个空间中，高频率、不易察觉的细节被抽象掉了。与高维像素空间相比，这个空间更适合基于似然法的生成模型，因为它们现在可以：(i) 专注于数据中重要的语义位；(ii) 在一个低维、计算效率更高的空间中进行训练。

Unlike previous work that relied on autoregressive, attention-based transformer models in a highly compressed, discrete latent space [66, 23, 103], we can take advantage of image-specific inductive biases that our model offers. This includes the ability to build the underlying UNet primarily from 2D convolutional layers, and further focusing the objective on the perceptually most relevant bits using the reweighted bound, which now reads
与之前在高度压缩的离散潜空间中依赖自回归、基于注意力的转换器模型的工作不同[66, 23, 103]，我们可以利用我们的模型所提供的图像特定归纳偏差。这包括主要从二维卷积层构建底层 UNet 的能力，以及利用重新加权约束进一步将目标集中在感知上最相关的比特上的能力。

L_{LDM}:=\mathbb{E}_{\mathcal{E}(x),\epsilon\sim\mathcal{N}(0,1),t}\Big{[}\|\epsilon-\epsilon_{\theta}(z_{t},t)\|_{2}^{2}\Big{]}\,.

(2)

The neural backbone $\epsilon_{\theta}(\circ,t)$ of our model is realized as a time-conditional UNet [71]. Since the forward process is fixed, $z_{t}$ can be efficiently obtained from $\mathcal{E}$ during training, and samples from $p(z$ ) can be decoded to image space with a single pass through $\mathcal{D}$ .
我们模型的神经骨干 $\epsilon_{\theta}(\circ,t)$ 是以时间条件 UNet 的形式实现的[71]。由于前向过程是固定的，因此 $z_{t}$ 可以在训练过程中高效地从 $\mathcal{E}$ 中获得，而 $p(z$ 中的样本只需通过 $\mathcal{D}$ 就能解码到图像空间。

CelebAHQ			FFHQ			LSUN-Churches LSUN-教堂			LSUN-Beds LSUN-床			ImageNet 图像网络

Figure 4: Samples from LDMs trained on CelebAHQ [39], FFHQ [41], LSUN-Churches [102], LSUN-Bedrooms [102] and class-conditional ImageNet [12], each with a resolution of

256\times 256

. Best viewed when zoomed in. For more samples cf. the supplement.
图 4：在 CelebAHQ [ 39]、FFHQ [ 41]、LSUN-Churches [ 102]、LSUN-Bedrooms [ 102] 和类别条件 ImageNet [ 12] 上训练的 LDM 的样本，每个样本的分辨率为

256\times 256

。放大后观看效果最佳。更多样本请参阅附录。

3.3 Conditioning Mechanisms
3.3 条件机制

Similar to other types of generative models [56, 83], diffusion models are in principle capable of modeling conditional distributions of the form $p(z|y)$ . This can be implemented with a conditional denoising autoencoder $\epsilon_{\theta}(z_{t},t,y)$ and paves the way to controlling the synthesis process through inputs $y$ such as text [68], semantic maps [61, 33] or other image-to-image translation tasks [34].
与其他类型的生成模型[ 56, 83]类似，扩散模型原则上也能够对 $p(z|y)$ 形式的条件分布进行建模。这可以通过条件去噪自动编码器 $\epsilon_{\theta}(z_{t},t,y)$ 来实现，并为通过文本[68]、语义图[61, 33]或其他图像到图像的翻译任务[34]等输入 $y$ 来控制合成过程铺平了道路。

In the context of image synthesis, however, combining the generative power of DMs with other types of conditionings beyond class-labels [15] or blurred variants of the input image [72] is so far an under-explored area of research.
然而，在图像合成的背景下，除了类标签[ 15] 或输入图像的模糊变体[ 72] 之外，将 DM 的生成能力与其他类型的条件相结合，迄今为止还是一个尚未充分开发的研究领域。

We turn DMs into more flexible conditional image generators by augmenting their underlying UNet backbone with the cross-attention mechanism [97], which is effective for learning attention-based models of various input modalities [36, 35]. To pre-process $y$ from various modalities (such as language prompts) we introduce a domain specific encoder $\tau_{\theta}$ that projects $y$ to an intermediate representation $\tau_{\theta}(y)\in\mathbb{R}^{M\times d_{\tau}}$ , which is then mapped to the intermediate layers of the UNet via a cross-attention layer implementing $\text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^{T}}{\sqrt{d}}\right)\cdot V$ , with
我们通过交叉注意力机制[97]来增强其底层 UNet 骨干，从而将 DMs 转化为更灵活的条件图像生成器，该机制对于学习各种输入模态的基于注意力的模型非常有效[36, 35]。为了预处理来自各种模态（如语言提示）的 $y$ ，我们引入了一个特定领域编码器 $\tau_{\theta}$ ，它将 $y$ 投射到中间表示 $\tau_{\theta}(y)\in\mathbb{R}^{M\times d_{\tau}}$ ，然后通过实现 $\text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^{T}}{\sqrt{d}}\right)\cdot V$ 的交叉注意层将 $\tau_{\theta}(y)\in\mathbb{R}^{M\times d_{\tau}}$ 映射到 UNet 的中间层。

Q=W^{(i)}_{Q}\cdot\varphi_{i}(z_{t}),\;K=W^{(i)}_{K}\cdot\tau_{\theta}(y),\;V=W^{(i)}_{V}\cdot\tau_{\theta}(y).

Here, $\varphi_{i}(z_{t})\in\mathbb{R}^{N\times d^{i}_{\epsilon}}$ denotes a (flattened) intermediate representation of the UNet implementing $\epsilon_{\theta}$ and $W^{(i)}_{V}\in\mathbb{R}^{d\times d^{i}_{\epsilon}}$ , $W^{(i)}_{Q}\in\mathbb{R}^{d\times d_{\tau}}$ & $W^{(i)}_{K}\in\mathbb{R}^{d\times d_{\tau}}$ are learnable projection matrices [97, 36]. See Fig. 3 for a visual depiction.
这里， $\varphi_{i}(z_{t})\in\mathbb{R}^{N\times d^{i}_{\epsilon}}$ 表示实现 $\epsilon_{\theta}$ 的 UNet 的（扁平化）中间表示，而 $W^{(i)}_{V}\in\mathbb{R}^{d\times d^{i}_{\epsilon}}$ 、 $W^{(i)}_{Q}\in\mathbb{R}^{d\times d_{\tau}}$ 和 $W^{(i)}_{K}\in\mathbb{R}^{d\times d_{\tau}}$ 是可学习的投影矩阵 [ 97, 36]。可视化描述见图 3。

Based on image-conditioning pairs, we then learn the conditional LDM via
根据图像条件对，我们可以通过以下方法学习条件 LDM

L_{LDM}:=\mathbb{E}_{\mathcal{E}(x),y,\epsilon\sim\mathcal{N}(0,1),t}\Big{[}\|\epsilon-\epsilon_{\theta}(z_{t},t,\tau_{\theta}(y))\|_{2}^{2}\Big{]}\,,

(3)

where both $\tau_{\theta}$ and $\epsilon_{\theta}$ are jointly optimized via Eq. 3. This conditioning mechanism is flexible as $\tau_{\theta}$ can be parameterized with domain-specific experts, e.g. (unmasked) transformers [97] when $y$ are text prompts (see Sec. 4.3.1)
其中 $\tau_{\theta}$ 和 $\epsilon_{\theta}$ 是通过公式 3 共同优化的。这种调节机制非常灵活，因为 $\tau_{\theta}$ 可以用特定领域的专家进行参数化，例如，当 $y$ 是文本提示时，可以用（未屏蔽的）变换器 [ 97] 进行参数化（见第 4.3.1 节）。

4 Experiments 4实验

LDMs provide means to flexible and computationally tractable diffusion based image synthesis of various image modalities, which we empirically show in the following. Firstly, however, we analyze the gains of our models compared to pixel-based diffusion models in both training and inference. Interestingly, we find that LDMs trained in VQ-regularized latent spaces sometimes achieve better sample quality, even though the reconstruction capabilities of VQ-regularized first stage models slightly fall behind those of their continuous counterparts, cf. Tab. 8. A visual comparison between the effects of first stage regularization schemes on LDM training and their generalization abilities to resolutions $>256^{2}$ can be found in Appendix D.1. In E.2 we list details on architecture, implementation, training and evaluation for all results presented in this section.
LDM 为基于扩散的各种图像模态合成提供了灵活且可计算的方法，我们将在下文中进行实证展示。首先，我们分析了与基于像素的扩散模型相比，我们的模型在训练和推理方面的优势。有趣的是，我们发现在 VQ 规则化潜空间中训练的 LDM 有时能获得更好的样本质量，尽管 VQ 规则化第一阶段模型的重建能力略逊于连续模型（参见表 8）。8.第一阶段正则化方案对 LDM 训练的影响及其对分辨率 $>256^{2}$ 的泛化能力的可视化比较见附录 D.1。在 E.2 中，我们列出了本节介绍的所有结果的架构、实现、训练和评估细节。

4.1 On Perceptual Compression Tradeoffs
4.1 关于感知压缩的权衡

This section analyzes the behavior of our LDMs with different downsampling factors $f\in\{1,2,4,8,16,32\}$ (abbreviated as LDM- $f$ , where LDM-1 corresponds to pixel-based DMs). To obtain a comparable test-field, we fix the computational resources to a single NVIDIA A100 for all experiments in this section and train all models for the same number of steps and with the same number of parameters.
本节分析了不同下采样系数 $f\in\{1,2,4,8,16,32\}$ （缩写为 LDM- $f$ ，其中 LDM-1 对应于基于像素的 DM）的 LDM 行为。为了获得具有可比性的测试场，我们在本节的所有实验中将计算资源固定为一台英伟达 A100，并以相同的步骤数和参数数训练所有模型。

Tab. 8 shows hyperparameters and reconstruction performance of the first stage models used for the LDMs compared in this section. Fig. 6 shows sample quality as a function of training progress for 2M steps of class-conditional models on the ImageNet [12] dataset. We see that, i) small downsampling factors for LDM- $\{$ 1,2 $\}$ result in slow training progress, whereas ii) overly large values of $f$ cause stagnating fidelity after comparably few training steps. Revisiting the analysis above (Fig. 1 and 2) we attribute this to i) leaving most of perceptual compression to the diffusion model and ii) too strong first stage compression resulting in information loss and thus limiting the achievable quality. LDM- $\{$ 4-16 $\}$ strike a good balance between efficiency and perceptually faithful results, which manifests in a significant FID [29] gap of 38 between pixel-based diffusion (LDM-1) and LDM-8 after 2M training steps.
表 8表 8 显示了本节比较的 LDM 第一阶段模型的超参数和重建性能。图 6 显示了 ImageNet [ 12] 数据集上 200 万步类条件模型的样本质量与训练进度的函数关系。我们可以看到，i）LDM- $\{$ 1,2 $\}$ 的下采样因子过小会导致训练进度缓慢，而 ii） $f$ 的值过大则会在训练步骤相当少之后导致保真度停滞不前。重新回顾上面的分析（图 1 和图 2），我们认为这是由于 i) 将大部分感知压缩留给了扩散模型；ii) 第一阶段压缩太强导致信息丢失，从而限制了可达到的质量。LDM- $\{$ 4-16 $\}$ 在效率和感知忠实结果之间取得了很好的平衡，这表现在基于像素的扩散模型（LDM-1）和 LDM-8 在 200 万步训练后的 FID [ 29] 差距为 38。

In Fig. 7, we compare models trained on CelebA-HQ [39] and ImageNet in terms sampling speed for different numbers of denoising steps with the DDIM sampler [84] and plot it against FID-scores [29]. LDM- $\{$ 4-8 $\}$ outperform models with unsuitable ratios of perceptual and conceptual compression. Especially compared to pixel-based LDM-1, they achieve much lower FID scores while simultaneously significantly increasing sample throughput. Complex datasets such as ImageNet require reduced compression rates to avoid reducing quality. In summary, LDM-4 and -8 offer the best conditions for achieving high-quality synthesis results.
在图 7 中，我们比较了在 CelebA-HQ [ 39] 和 ImageNet 上训练的模型在使用 DDIM 采样器 [ 84] 进行不同数量的去噪步骤时的采样速度，并将其与 FID 分数 [ 29] 进行对比。LDM- $\{$ 4-8 $\}$ 优于感知和概念压缩比率不合适的模型。特别是与基于像素的 LDM-1 相比，它们的 FID 分数更低，同时样本吞吐量也显著提高。ImageNet 等复杂数据集需要降低压缩率，以避免降低质量。总之，LDM-4 和 -8 为获得高质量的合成结果提供了最佳条件。

CelebA-HQ $256\times 256$ FFHQ $256\times 256$ Method 方法 FID $\downarrow$ Prec. $\uparrow$ 精确。 $\uparrow$ Recall $\uparrow$ 召回率 $\uparrow$ Method 方法 FID $\downarrow$ Prec. $\uparrow$ 精确。 $\uparrow$ Recall $\uparrow$ 召回率 $\uparrow$ DC-VAE [63] DC-VAE [ 63］ 15.8 - - ImageBART [21] 图像ART [ 21］ 9.57 - - VQGAN+T. [23] (k=400) VQGAN+T.[ 23] (k=400) 10.2 - - U-Net GAN (+aug) [77] 10.9 (7.6) - - PGGAN [39] PGGAN [ 39］ 8.0 - - UDM [43] UDM [ 43］ 5.54 - - LSGM [93] LSGM [ 93］ 7.22 - - StyleGAN [41] StyleGAN [ 41］ 4.16 0.71 0.46 UDM [43] UDM [ 43］ 7.16 - - ProjectedGAN[76] 投影GAN[ 76］ 3.08 0.65 0.46 LDM-4 (ours, 500-s^†)
LDM-4 （我们的，500 秒 ^† ) 5.11 0.72 0.49 LDM-4 (ours, 200-s) LDM-4 （我们的，200 秒） 4.98 0.73 0.50

LSUN-Churches $256\times 256$ LSUN-Bedrooms $256\times 256$ LSUN-卧室 $256\times 256$ Method 方法 FID $\downarrow$ Prec. $\uparrow$ 精确。 $\uparrow$ Recall $\uparrow$ 召回率 $\uparrow$ Method 方法 FID $\downarrow$ Prec. $\uparrow$ 精确。 $\uparrow$ Recall $\uparrow$ 召回率 $\uparrow$ DDPM [30] DDPM [ 30］ 7.89 - - ImageBART [21] 图像ART [ 21］ 5.51 - - ImageBART[21] ImageBART[ 21］ 7.32 - - DDPM [30] DDPM [ 30］ 4.9 - - PGGAN [39] PGGAN [ 39］ 6.42 - - UDM [43] UDM [ 43］ 4.57 - - StyleGAN[41] StyleGAN[ 41］ 4.21 - - StyleGAN[41] StyleGAN[ 41］ 2.35 0.59 0.48 StyleGAN2[42] StyleGAN2[ 42］ 3.86 - - ADM [15] ADM [ 15］ 1.90 0.66 0.51 ProjectedGAN[76] 投影GAN[ 76］ 1.59 0.61 0.44 ProjectedGAN[76] 投影GAN[ 76］ 1.52 0.61 0.34 LDM-8^∗ (ours, 200-s) LDM-8 ^∗ (我们的, 200-s) 4.02 0.64 0.52 LDM-4 (ours, 200-s) LDM-4 （我们的，200 秒） 2.95 0.66 0.48

Table 1: Evaluation metrics for unconditional image synthesis. CelebA-HQ results reproduced from [63, 100, 43], FFHQ from [42, 43]. ^†:

N

-s refers to

N

sampling steps with the DDIM [84] sampler. ^∗: trained in KL-regularized latent space. Additional results can be found in the supplementary.
表 1：无条件图像合成的评估指标。CelebA-HQ 结果转载自 [ 63, 100, 43]，FFHQ 结果转载自 [ 42, 43]。 ^† :

N

-s 指使用 DDIM [ 84] 采样器的

N

采样步骤。 ^∗ : 在 KL-regularized latent space 中训练。更多结果见补充资料。

Text-Conditional Image Synthesis
文本条件图像合成 Method 方法 FID $\downarrow$ IS $\uparrow$ $N_{\text{params}}$ CogView^† [17] 27.10 18.20 4B self-ranking, rejection rate 0.017
自我排名，拒绝率 0.017 LAFITE^† [109] lafite ^† [ 109］ 26.94 26.02 75M GLIDE^∗ [59] 滑行 ^∗ [ 59］ 12.24 - 6B 277 DDIM steps, c.f.g. [32] $s=3$
277 DDIM 步骤，c.f.g. [ 32] $s=3$ Make-A-Scene^∗ [26] 制作场景 ^∗ [ 26] 11.84 - 4B c.f.g for AR models [98] $s=5$
AR 模型的 c.f.g [ 98] $s=5$ LDM-KL-8 23.31 20.03 $\pm\text{0.33}$ 1.45B 250 DDIM steps 250 级 DDIM LDM-KL-8-G^∗ 12.63 30.29 $\pm\text{0.42}$ 1.45B 250 DDIM steps, c.f.g. [32] $s=1.5$
250 DDIM 步，c.f.g. [ 32] $s=1.5$

Table 2: Evaluation of text-conditional image synthesis on the

256\times 256

-sized MS-COCO [51] dataset: with 250 DDIM [84] steps our model is on par with the most recent diffusion [59] and autoregressive [26] methods despite using significantly less parameters. ^†/^∗:Numbers from [109]/ [26]
表 2：在

256\times 256

大小的 MS-COCO [ 51] 数据集上对文本条件图像合成的评估：使用 250 DDIM [ 84] 步，我们的模型与最新的扩散 [ 59] 和自回归 [ 26] 方法相当，尽管使用的参数少得多。 ^† / ^∗ :来自 [ 109]/ [ 26] 的数字

4.2 Image Generation with Latent Diffusion
4.2 利用潜在扩散生成图像

We train unconditional models of $256^{2}$ images on CelebA-HQ [39], FFHQ [41], LSUN-Churches and -Bedrooms [102] and evaluate the i) sample quality and ii) their coverage of the data manifold using ii) FID [29] and ii) Precision-and-Recall [50]. Tab. 1 summarizes our results. On CelebA-HQ, we report a new state-of-the-art FID of $5.11$ , outperforming previous likelihood-based models as well as GANs. We also outperform LSGM [93] where a latent diffusion model is trained jointly together with the first stage. In contrast, we train diffusion models in a fixed space and avoid the difficulty of weighing reconstruction quality against learning the prior over the latent space, see Fig. 1-2.
我们在 CelebA-HQ [ 39]、FFHQ [ 41]、LSUN-Churches and -Bedrooms [ 102]上训练 $256^{2}$ 图像的无条件模型，并使用 ii) FID [ 29] 和 ii) Precision-and-Recall [ 50] 评估 i) 样本质量和 ii) 它们对数据流形的覆盖率。表 1 总结了我们的结果。表 1 总结了我们的结果。在 CelebA-HQ 上，我们报告的最新 FID 为 $5.11$ ，优于之前的基于似然法的模型和 GAN。我们的表现也优于 LSGM [ 93]，在 LSGM 中，潜在扩散模型与第一阶段模型一起训练。相比之下，我们在一个固定的空间中训练扩散模型，避免了权衡重建质量与学习潜空间先验的困难，见图 1-2。

We outperform prior diffusion based approaches on all but the LSUN-Bedrooms dataset, where our score is close to ADM [15], despite utilizing half its parameters and requiring 4-times less train resources (see Appendix E.3.5). Moreover, LDMs consistently improve upon GAN-based methods in Precision and Recall, thus confirming the advantages of their mode-covering likelihood-based training objective over adversarial approaches. In Fig. 4 we also show qualitative results on each dataset.
在 LSUN-Bedrooms 数据集之外的所有数据集上，我们的表现都优于先前的基于扩散的方法，在该数据集上，我们的得分接近 ADM [ 15]，尽管我们只使用了其一半的参数，所需的训练资源也少了 4 倍（见附录 E.3.5）。此外，LDM 在精确度和召回率方面始终优于基于 GAN 的方法，从而证实了其基于模式覆盖似然的训练目标相对于对抗方法的优势。图 4 还显示了每个数据集的定性结果。

4.3 Conditional Latent Diffusion
4.3 条件潜在扩散

Figure 8: Layout-to-image synthesis with an LDM on COCO [4], see Sec. 4.3.1. Quantitative evaluation in the supplement D.3.
图 8：在 COCO [ 4] 上使用 LDM 进行布局到图像的合成，见第 4.3.1 节。定量评估见补充资料 D.3。

4.3.1 Transformer Encoders for LDMs
4.3.1 用于 LDM 的变压器编码器

By introducing cross-attention based conditioning into LDMs we open them up for various conditioning modalities previously unexplored for diffusion models. For text-to-image image modeling, we train a 1.45B parameter KL-regularized LDM conditioned on language prompts on LAION-400M [78]. We employ the BERT-tokenizer [14] and implement $\tau_{\theta}$ as a transformer [97] to infer a latent code which is mapped into the UNet via (multi-head) cross-attention (Sec. 3.3). This combination of domain specific experts for learning a language representation and visual synthesis results in a powerful model, which generalizes well to complex, user-defined text prompts, cf. Fig. 8 and 5. For quantitative analysis, we follow prior work and evaluate text-to-image generation on the MS-COCO [51] validation set, where our model improves upon powerful AR [66, 17] and GAN-based [109] methods, cf. Tab. 2. We note that applying classifier-free diffusion guidance [32] greatly boosts sample quality, such that the guided LDM-KL-8-G is on par with the recent state-of-the-art AR [26] and diffusion models [59] for text-to-image synthesis, while substantially reducing parameter count. To further analyze the flexibility of the cross-attention based conditioning mechanism we also train models to synthesize images based on semantic layouts on OpenImages [49], and finetune on COCO [4], see Fig. 8. See Sec. D.3 for the quantitative evaluation and implementation details.
通过在 LDM 中引入基于交叉注意力的调理，我们可以将 LDM 应用于以前未曾探索过的扩散模型的各种调理模式。为了进行文本到图像的图像建模，我们在 LAION-400M 上训练了一个以语言提示为条件的 1.45B 参数 KL 规则化 LDM [ 78]。我们使用 BERT-tokenizer [ 14] 并将 $\tau_{\theta}$ 作为转换器 [ 97] 来推断潜在代码，该代码通过（多头）交叉关注映射到 UNet 中（第 3.3 节）。这种将特定领域专家学习语言表征和视觉合成相结合的方法产生了一个强大的模型，可以很好地推广到复杂的、用户定义的文本提示中，参见图 8 和图 5。在定量分析方面，我们沿用了之前的工作，在 MS-COCO [ 51] 验证集上评估文本到图像的生成，我们的模型改进了强大的 AR [ 66, 17] 和基于 GAN [ 109] 的方法，参见表 2。2.我们注意到，应用无分类器扩散引导[32]大大提高了样本质量，因此引导的 LDM-KL-8-G 与最近用于文本到图像合成的最先进的 AR 模型[26]和扩散模型[59]不相上下，同时大大减少了参数数量。为了进一步分析基于交叉注意的调节机制的灵活性，我们还在 OpenImages [ 49] 上训练基于语义布局的图像合成模型，并在 COCO [ 4] 上进行微调，见图 8。定量评估和实施细节请参见 D.3 节。

Lastly, following prior work [15, 3, 23, 21], we evaluate our best-performing class-conditional ImageNet models with $f\in\{4,8\}$ from Sec. 4.1 in Tab. 3, Fig. 4 and Sec. D.4. Here we outperform the state of the art diffusion model ADM [15] while significantly reducing computational requirements and parameter count, cf. Tab 18.
最后，按照先前的工作[ 15, 3, 23, 21]，我们在表 3、图 4 和 D.4 中用第 4.1 节中的 $f\in\{4,8\}$ 评估了表现最佳的类条件 ImageNet 模型。在此，我们的表现优于最先进的扩散模型 ADM [ 15] ，同时显著降低了计算要求和参数数量，参见表 18。

Method 方法 FID $\downarrow$ IS $\uparrow$ Precision $\uparrow$ 精确度 $\uparrow$ Recall $\uparrow$ 召回率 $\uparrow$ $N_{\text{params}}$ BigGan-deep [3] BigGan-deep [ 3］ 6.95 203.6 $\pm\text{2.6}$ 0.87 0.28 340M - ADM [15] ADM [ 15］ 10.94 100.98 0.69 0.63 554M 250 DDIM steps 250 级 DDIM ADM-G [15] ADM-G [ 15］ 4.59 186.7 0.82 0.52 608M 250 DDIM steps 250 级 DDIM LDM-4 (ours) LDM-4 （我们的） 10.56 103.49 $\pm\text{1.24}$ 0.71 0.62 400M 250 DDIM steps 250 级 DDIM LDM-4-G (ours) LDM-4-G （我们的） 3.60 247.67 $\pm\text{5.59}$ 0.87 0.48 400M 250 steps, c.f.g [32], $s=1.5$
250 步，c.f.g [ 32]， $s=1.5$

Table 3: Comparison of a class-conditional ImageNet LDM with recent state-of-the-art methods for class-conditional image generation on ImageNet [12]. A more detailed comparison with additional baselines can be found in D.4, Tab. 10 and F. c.f.g. denotes classifier-free guidance with a scale

s

as proposed in [32].
表 3：基于类别条件的 ImageNet LDM 与最近在 ImageNet 上基于类别条件生成图像的最先进方法的比较[12]。与其他基线的更详细比较见 D.4、表 10 和 F。c.f.g.表示无分类器引导，比例为

s

，如 [ 32] 中提出的那样。

4.3.2 Convolutional Sampling Beyond $256^{2}$
4.3.2 超越 $256^{2}$ 的卷积采样

By concatenating spatially aligned conditioning information to the input of $\epsilon_{\theta}$ , LDMs can serve as efficient general-purpose image-to-image translation models. We use this to train models for semantic synthesis, super-resolution (Sec. 4.4) and inpainting (Sec. 4.5). For semantic synthesis, we use images of landscapes paired with semantic maps [61, 23] and concatenate downsampled versions of the semantic maps with the latent image representation of a $f=4$ model (VQ-reg., see Tab. 8). We train on an input resolution of $256^{2}$ (crops from $384^{2}$ ) but find that our model generalizes to larger resolutions and can generate images up to the megapixel regime when evaluated in a convolutional manner (see Fig. 9). We exploit this behavior to also apply the super-resolution models in Sec. 4.4 and the inpainting models in Sec. 4.5 to generate large images between $512^{2}$ and $1024^{2}$ . For this application, the signal-to-noise ratio (induced by the scale of the latent space) significantly affects the results. In Sec. D.1 we illustrate this when learning an LDM on (i) the latent space as provided by a $f=4$ model (KL-reg., see Tab. 8), and (ii) a rescaled version, scaled by the component-wise standard deviation.
通过将空间对齐的调节信息连接到 $\epsilon_{\theta}$ 的输入中，LDM 可以作为高效的通用图像到图像转换模型。我们利用它来训练用于语义合成、超分辨率（第 4.4 节）和内绘制（第 4.5 节）的模型。在语义合成方面，我们使用与语义图配对的景观图像[61, 23]，并将语义图的低采样版本与 $f=4$ 模型（VQ-reg.）我们在 $256^{2}$ 的输入分辨率（从 $384^{2}$ 开始裁剪）上进行训练，但发现我们的模型可以泛化到更大的分辨率，并且在以卷积方式进行评估时，可以生成高达百万像素的图像（见图 9）。我们利用这一特性，还应用了第 4.4 节中的超分辨率模型和第 4.5 节中的内绘模型，生成了 $512^{2}$ 和 $1024^{2}$ 之间的大图像。在这种应用中，信噪比（由潜空间尺度引起）对结果有很大影响。在 D.1 节中，我们将说明在以下两种情况下学习 LDM 的效果：(i) 由 $f=4$ 模型提供的潜空间（KL-reg.，见表 8）；(ii) 由分量标准偏差缩放的重标版本。

The latter, in combination with classifier-free guidance [32], also enables the direct synthesis of $>256^{2}$ images for the text-conditional LDM-KL-8-G as in Fig. 13.
后者与无分类器引导[32]相结合，还能为文本条件 LDM-KL-8-G 直接合成 $>256^{2}$ 图像，如图 13 所示。

4.4 Super-Resolution with Latent Diffusion
4.4 利用潜像扩散进行超分辨率处理

LDMs can be efficiently trained for super-resolution by diretly conditioning on low-resolution images via concatenation (cf. Sec. 3.3). In a first experiment, we follow SR3 [72] and fix the image degradation to a bicubic interpolation with $4\times$ -downsampling and train on ImageNet following SR3’s data processing pipeline. We use the $f=4$ autoencoding model pretrained on OpenImages (VQ-reg., cf. Tab. 8) and concatenate the low-resolution conditioning $y$ and the inputs to the UNet, i.e. $\tau_{\theta}$ is the identity. Our qualitative and quantitative results (see Fig. 10 and Tab. 5) show competitive performance and LDM-SR outperforms SR3 in FID while SR3 has a better IS. A simple image regression model achieves the highest PSNR and SSIM scores; however these metrics do not align well with human perception [106] and favor blurriness over imperfectly aligned high frequency details [72]. Further, we conduct a user study comparing the pixel-baseline with LDM-SR. We follow SR3 [72] where human subjects were shown a low-res image in between two high-res images and asked for preference. The results in Tab. 4 affirm the good performance of LDM-SR. PSNR and SSIM can be pushed by using a post-hoc guiding mechanism [15] and we implement this image-based guider via a perceptual loss, see Sec. D.6.
通过串联对低分辨率图像进行直接条件化（参见第 3.3 节），可以有效地训练 LDM 以实现超分辨率。在第一个实验中，我们按照 SR3 [ 72] 的方法，将图像降解固定为 $4\times$ 降采样的双三次插值，并按照 SR3 的数据处理管道在 ImageNet 上进行训练。我们使用在 OpenImages（VQ-reg.，参见表 8）上预训练的 $f=4$ 自动编码模型，并将低分辨率调节 $y$ 和输入连接到 UNet，即 $\tau_{\theta}$ 是标识。我们的定性和定量结果（见图 10 和表 5）显示了具有竞争力的性能，LDM-SR 在 FID 方面优于 SR3，而 SR3 具有更好的 IS。简单的图像回归模型可以获得最高的 PSNR 和 SSIM 分数；然而，这些指标与人类的感知并不一致[106]，它们更倾向于模糊而不是不完全对齐的高频细节[72]。此外，我们还进行了一项用户研究，将像素基准与 LDM-SR 进行了比较。我们按照 SR3 [ 72] 的方法，在两幅高分辨率图像之间显示一幅低分辨率图像，并询问受试者的偏好。表 4 中的结果证实了 LDM-SR 的良好性能。表 4 中的结果证实了 LDM-SR 的良好性能。PSNR 和 SSIM 可以通过使用事后引导机制来提高[ 15]，我们通过感知损失来实现这种基于图像的引导机制，请参见 D.6 节。

SR on ImageNet SR 在 ImageNet 上的应用 Inpainting on Places 地点涂色 User Study 用户研究 Pixel-DM ( $f1$ ) LDM-4 LAMA [88] LAMA [ 88］ LDM-4 Task 1: Preference vs GT $\uparrow$
任务 1：偏好与 GT $\uparrow$ 16.0% 30.4% 13.6% 21.0% Task 2: Preference Score $\uparrow$
任务 2：偏好得分 $\uparrow$ 29.4% 70.6% 31.9% 68.1%

Table 4: Task 1: Subjects were shown ground truth and generated image and asked for preference. Task 2: Subjects had to decide between two generated images. More details in E.3.6
表 4：任务 1：向受试者展示地面实况和生成的图像，并询问其偏好。任务 2：受试者必须在两幅生成的图像中做出选择。更多详情见 E.3.6

Since the bicubic degradation process does not generalize well to images which do not follow this pre-processing, we also train a generic model, LDM-BSR, by using more diverse degradation. The results are shown in Sec. D.6.1.
由于双三次降解过程不能很好地适用于没有经过这种预处理的图像，因此我们还使用了更多样化的降解方法来训练一个通用模型 LDM-BSR。结果见第 D.6.1 节。

Method 方法 FID $\downarrow$ IS $\uparrow$ PSNR $\uparrow$ SSIM $\uparrow$ $N_{\text{params}}$ $[\frac{\text{samples}}{s}](^{*})$ Image Regression [72] 图像回归 [ 72］ 15.2 121.1 27.9 0.801 625M N/A 不适用 SR3 [72] SR3 [ 72］ 5.2 180.1 26.4 0.762 625M N/A 不适用 LDM-4 (ours, 100 steps) LDM-4（我们的，100 级） 2.8^†/4.8^‡ 166.3 24.4 $\pm$ 3.8 0.69 $\pm$ 0.14 169M 4.62 emphLDM-4 (ours, big, 100 steps)
emphLDM-4（我们的，大，100 级） 2.4^†/4.3^‡ 174.9 24.7 $\pm$ 4.1 0.71 $\pm$ 0.15 552M 4.5 LDM-4 (ours, 50 steps, guiding)
LDM-4（我们的，50 级，引导式） 4.4^†/6.4^‡ 153.7 25.8 $\pm$ 3.7 0.74 $\pm$ 0.12 184M 0.38

Table 5:

\times 4

upscaling results on ImageNet-Val. (

256^{2}

); ^†: FID features computed on validation split, ^‡: FID features computed on train split; ^∗: Assessed on a NVIDIA A100
表 5：(

256^{2}

); ^† : 在验证分割上计算的 FID 特征, ^‡ : 在训练分割上计算的 FID 特征; ^∗ : 在 NVIDIA A100 上评估。

4.5 Inpainting with Latent Diffusion
4.5 利用潜在扩散进行绘制

Inpainting is the task of filling masked regions of an image with new content either because parts of the image are are corrupted or to replace existing but undesired content within the image. We evaluate how our general approach for conditional image generation compares to more specialized, state-of-the-art approaches for this task. Our evaluation follows the protocol of LaMa[88], a recent inpainting model that introduces a specialized architecture relying on Fast Fourier Convolutions[8]. The exact training & evaluation protocol on Places[108] is described in Sec. E.2.2.
涂抹是指在图像的遮蔽区域填充新的内容，这可能是因为图像的某些部分已损坏，也可能是为了替换图像中已有但不想要的内容。我们将评估我们的条件图像生成通用方法与更专业、更先进的方法相比，在这项任务中的效果如何。我们的评估遵循 LaMa[ 88] 的协议，LaMa 是一种最新的内绘模型，它引入了一种依赖于快速傅立叶卷积[ 8] 的专门架构。关于 Places[ 108] 的确切训练和评估协议将在 E.2.2 节中介绍。

We first analyze the effect of different design choices for the first stage.
我们首先分析第一阶段不同设计方案的效果。

train throughput 训练吞吐量 sampling throughput^† 采样吞吐量 ^† train+val FID@2k Model (reg.-type) 模型（注册类型） samples/sec. 采样/秒 @256 @512 hours/epoch 小时/纪元 epoch 6 第 6 个纪元 LDM-1 (no first stage) LDM-1（无第一级） 0.11 0.26 0.07 20.66 24.74 LDM-4 (KL, w/ attn) LDM-4（KL，带附件） 0.32 0.97 0.34 7.66 15.21 LDM-4 (VQ, w/ attn) LDM-4（VQ，带附加装置） 0.33 0.97 0.34 7.04 14.99 LDM-4 (VQ, w/o attn) LDM-4（VQ，无注释） 0.35 0.99 0.36 6.66 15.95

Table 6: Assessing inpainting efficiency. ^†: Deviations from Fig. 7 due to varying GPU settings/batch sizes cf. the supplement.
表 6：内绘效率评估。0#：由于 GPU 设置/批量大小的不同，与图 7 存在偏差，参见附录。

In particular, we compare the inpainting efficiency of LDM-1 (i.e. a pixel-based conditional DM) with LDM-4, for both KL and VQ regularizations, as well as VQ-LDM-4 without any attention in the first stage (see Tab. 8), where the latter reduces GPU memory for decoding at high resolutions. For comparability, we fix the number of parameters for all models. Tab. 6 reports the training and sampling throughput at resolution $256^{2}$ and $512^{2}$ , the total training time in hours per epoch and the FID score on the validation split after six epochs. Overall, we observe a speed-up of at least $2.7\times$ between pixel- and latent-based diffusion models while improving FID scores by a factor of at least $1.6\times$ .
特别是，我们比较了 LDM-1（即基于像素的条件 DM）和 LDM-4（KL 和 VQ 正则化）的内绘效率，以及在第一阶段没有任何关注的 VQ-LDM-4（见表 8），后者在高分辨率解码时减少了 GPU 内存。为了便于比较，我们固定了所有模型的参数数。表 66 报告了在分辨率为 $256^{2}$ 和 $512^{2}$ 时的训练和采样吞吐量、以小时为单位的总训练时间（每个历时）以及六个历时后验证分割的 FID 分数。总的来说，我们发现基于像素的扩散模型和基于潜像的扩散模型的速度至少提高了 $2.7\times$ ，而 FID 分数至少提高了 $1.6\times$ 。

The comparison with other inpainting approaches in Tab. 7 shows that our model with attention improves the overall image quality as measured by FID over that of [88]. LPIPS between the unmasked images and our samples is slightly higher than that of [88]. We attribute this to [88] only producing a single result which tends to recover more of an average image compared to the diverse results produced by our LDM cf. Fig. 21. Additionally in a user study (Tab. 4) human subjects favor our results over those of [88].
表 7 中与其他内绘方法的比较显示，我们的注意力模型提高了 FID 分数的 @3#。表 7 显示，与 [ 88] 相比，我们的注意力模型提高了以 FID 衡量的整体图像质量。未屏蔽图像与我们的样本之间的 LPIPS 略高于 [ 88]。我们认为这是由于[88] 只产生了单一结果，与我们的 LDM 产生的多样化结果相比，它更倾向于恢复平均图像，参见图 21。此外，在一项用户研究中（表 4），与 [ 88] 的结果相比，人类受试者更喜欢我们的结果。

Based on these initial results, we also trained a larger diffusion model (big in Tab. 7) in the latent space of the VQ-regularized first stage without attention. Following [15], the UNet of this diffusion model uses attention layers on three levels of its feature hierarchy, the BigGAN [3] residual block for up- and downsampling and has 387M parameters instead of 215M. After training, we noticed a discrepancy in the quality of samples produced at resolutions $256^{2}$ and $512^{2}$ , which we hypothesize to be caused by the additional attention modules. However, fine-tuning the model for half an epoch at resolution $512^{2}$ allows the model to adjust to the new feature statistics and sets a new state of the art FID on image inpainting (big, w/o attn, w/ ft in Tab. 7, Fig. 11.).
基于这些初步结果，我们还在无关注 VQ 规则化第一阶段的潜空间中训练了一个更大的扩散模型（表 7 中的大图）。按照[15]的方法，该扩散模型的 UNet 在其特征层次结构的三个层级上使用了注意力层，使用 BigGAN[ 3] 剩余块进行上采样和下采样，并拥有 387M 而不是 215M 的参数。训练结束后，我们注意到在分辨率为 $256^{2}$ 和 $512^{2}$ 时产生的样本质量存在差异，我们推测这是额外的注意力模块造成的。然而，在分辨率为 $512^{2}$ 时对模型进行了半个历时的微调，使模型能够适应新的特征统计数据，并在图像绘制方面建立了新的 FID 技术水平（表 7、图 11 中的 big、w/o attn、w/ft）。

40-50% masked 40-50% 遮挡 All samples 所有样本 Method 方法 FID $\downarrow$ LPIPS $\downarrow$ FID $\downarrow$ LPIPS $\downarrow$ LDM-4 (ours, big, w/ ft)
LDM-4（我们的，大的，带英尺） 9.39 0.246 $\pm$ 0.042 1.50 0.137 $\pm$ 0.080 LDM-4 (ours, big, w/o ft)
LDM-4（我们的，大号，不带英尺） 12.89 0.257 $\pm$ 0.047 2.40 0.142 $\pm$ 0.085 LDM-4 (ours, w/ attn) LDM-4（我们的，大号，无脚踏板） 11.87 0.257 $\pm$ 0.042 2.15 0.144 $\pm$ 0.084 LDM-4 (ours, w/o attn) LDM-4（我们的，不带耳机） 12.60 0.259 $\pm$ 0.041 2.37 0.145 $\pm$ 0.084 LaMa[88]^† 12.31 0.243 $\pm$ 0.038 2.23 0.134 $\pm$ 0.080 LaMa[88] LaMa[ 88］ 12.0 0.24 $\pm$ 0.000 2.21 0.14 $\pm$ 0.000 CoModGAN[107] 10.4 0.26 $\pm$ 0.000 1.82 0.15 $\pm$ 0.000 RegionWise[52] RegionWise[ 52］ 21.3 0.27 $\pm$ 0.000 4.75 0.15 $\pm$ 0.000 DeepFill v2[104] DeepFill v2[ 104］ 22.1 0.28 $\pm$ 0.000 5.20 0.16 $\pm$ 0.000 EdgeConnect[58] 边缘连接[ 58］ 30.5 0.28 $\pm$ 0.000 8.37 0.16 $\pm$ 0.000

Table 7: Comparison of inpainting performance on 30k crops of size

512\times 512

from test images of Places[108]. The column 40-50% reports metrics computed over hard examples where 40-50% of the image region have to be inpainted. ^†recomputed on our test set, since the original test set used in [88] was not available.
表 7：Places[ 108] 测试图像中 30k 个大小为

512\times 512

的裁剪的内绘性能比较。40-50% 一栏报告的是在需要对 40-50% 的图像区域进行内绘的困难实例中计算得出的指标。由于无法获得[ 88] 中使用的原始测试集，因此在我们的测试集上重新计算了 ^† 。

5 Limitations & Societal Impact
5 局限性和社会影响

Limitations 局限性

While LDMs significantly reduce computational requirements compared to pixel-based approaches, their sequential sampling process is still slower than that of GANs. Moreover, the use of LDMs can be questionable when high precision is required: although the loss of image quality is very small in our $f=4$ autoencoding models (see Fig. 1), their reconstruction capability can become a bottleneck for tasks that require fine-grained accuracy in pixel space. We assume that our superresolution models (Sec. 4.4) are already somewhat limited in this respect.
与基于像素的方法相比，LDM 虽然大大降低了计算要求，但其顺序采样过程仍比 GAN 慢。此外，当需要高精度时，使用 LDM 可能会有问题：虽然在我们的 $f=4$ 自动编码模型中，图像质量的损失非常小（见图 1），但对于需要像素空间细粒度精度的任务来说，它们的重建能力可能会成为瓶颈。我们假设我们的超分辨率模型（第 4.4 节）在这方面已经受到了一定的限制。

Societal Impact 社会影响

Generative models for media like imagery are a double-edged sword: On the one hand, they enable various creative applications, and in particular approaches like ours that reduce the cost of training and inference have the potential to facilitate access to this technology and democratize its exploration. On the other hand, it also means that it becomes easier to create and disseminate manipulated data or spread misinformation and spam. In particular, the deliberate manipulation of images (“deep fakes”) is a common problem in this context, and women in particular are disproportionately affected by it [13, 24].
图像等媒体的生成模型是一把双刃剑：一方面，它可以促进各种创造性应用，尤其是像我们这样降低训练和推理成本的方法，有可能促进对这一技术的利用，并使其探索民主化。另一方面，这也意味着创建和传播篡改数据或传播错误信息和垃圾邮件变得更加容易。特别是，故意篡改图像（"深度伪造"）是这方面的一个常见问题，尤其是妇女受到的影响尤为严重[13, 24]。

Generative models can also reveal their training data [5, 90], which is of great concern when the data contain sensitive or personal information and were collected without explicit consent. However, the extent to which this also applies to DMs of images is not yet fully understood.
生成模型也会泄露其训练数据[5, 90]，当数据包含敏感或个人信息且未经明确同意而收集时，这一点就非常令人担忧。然而，这种情况在多大程度上也适用于图像的 DMs 还不完全清楚。

Finally, deep learning modules tend to reproduce or exacerbate biases that are already present in the data [91, 38, 22]. While diffusion models achieve better coverage of the data distribution than e.g. GAN-based approaches, the extent to which our two-stage approach that combines adversarial training and a likelihood-based objective misrepresents the data remains an important research question.
最后，深度学习模块往往会重现或加剧数据中已经存在的偏差[91, 38, 22]。虽然扩散模型比基于 GAN 的方法能更好地覆盖数据分布，但我们结合对抗训练和基于似然目标的两阶段方法在多大程度上误导了数据，仍然是一个重要的研究问题。

For a more general, detailed discussion of the ethical considerations of deep generative models, see e.g. [13].
关于深度生成模型的伦理考虑的更广泛、更详细的讨论，请参见[ 13] 等。

6 Conclusion 6 结束语

We have presented latent diffusion models, a simple and efficient way to significantly improve both the training and sampling efficiency of denoising diffusion models without degrading their quality. Based on this and our cross-attention conditioning mechanism, our experiments could demonstrate favorable results compared to state-of-the-art methods across a wide range of conditional image synthesis tasks without task-specific architectures. ^†^†This work has been supported by the German Federal Ministry for Economic Affairs and Energy within the project ’KI-Absicherung - Safe AI for automated driving’ and by the German Research Foundation (DFG) project 421703927.
我们提出了潜在扩散模型，这是一种简单有效的方法，可以显著提高去噪扩散模型的训练和采样效率，而不会降低其质量。在此基础上，再加上我们的交叉注意力调节机制，我们的实验可以在没有特定任务架构的情况下，在广泛的条件图像合成任务中与最先进的方法相比取得良好的效果。 ^†

References 参考资料

[1] Eirikur Agustsson and Radu Timofte. NTIRE 2017 challenge on single image super-resolution: Dataset and study. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2017, Honolulu, HI, USA, July 21-26, 2017, pages 1122–1131. IEEE Computer Society, 2017.
Eirikur Agustsson 和 Radu Timofte.NTIRE 2017 单图像超分辨率挑战赛：数据集与研究。In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2017, Honolulu, HI, USA, July 21-26, 2017, pages 1122-1131.IEEE 计算机协会，2017 年。
[2] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan, 2017.
Martin Arjovsky、Soumith Chintala 和 Léon Bottou.Wasserstein gan, 2017.
[3] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis. In Int. Conf. Learn. Represent., 2019.
Andrew Brock、Jeff Donahue 和 Karen Simonyan。用于高保真自然图像合成的大规模 GAN 训练。In Int.Conf.Learn.Represent.
[4] Holger Caesar, Jasper R. R. Uijlings, and Vittorio Ferrari. Coco-stuff: Thing and stuff classes in context. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 1209–1218. Computer Vision Foundation / IEEE Computer Society, 2018.
Holger Caesar、Jasper R. R. Uijlings 和 Vittorio Ferrari。Coco-stuff：上下文中的事物和物品类。In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 1209-1218.计算机视觉基金会/IEEE计算机学会，2018。
[5] Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650, 2021.
Nicholas Carlini、Florian Tramer、Eric Wallace、Matthew Jagielski、Ariel Herbert-Voss、Katherine Lee、Adam Roberts、Tom Brown、Dawn Song、Ulfar Erlingsson 等：从大型语言模型中提取训练数据。第 30 届 USENIX 安全研讨会（USENIX Security 21），第 2633-2650 页，2021 年。
[6] Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In ICML, volume 119 of Proceedings of Machine Learning Research, pages 1691–1703. PMLR, 2020.
Mark Chen、Alec Radford、Rewon Child、Jeffrey Wu、Heewoo Jun、David Luan 和 Ilya Sutskever。从像素生成预训练。ICML，《机器学习研究论文集》第 119 卷，第 1691-1703 页。PMLR, 2020.
[7] Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, and William Chan. Wavegrad: Estimating gradients for waveform generation. In ICLR. OpenReview.net, 2021.
Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, and William Chan.波形梯度：波形生成的梯度估计。In ICLR.OpenReview.net, 2021.
[8] Lu Chi, Borui Jiang, and Yadong Mu. Fast fourier convolution. In NeurIPS, 2020.
Lu Chi、Borui Jiang 和 Yadong Mu。快速傅立叶卷积。在 NeurIPS，2020 年。
[9] Rewon Child. Very deep vaes generalize autoregressive models and can outperform them on images. CoRR, abs/2011.10650, 2020.
Rewon Child.非常深度的Vaes泛化自回归模型，并能在图像上超越它们。CoRR，abs/2011.10650，2020。
[10] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. CoRR, abs/1904.10509, 2019.
Rewon Child、Scott Gray、Alec Radford 和 Ilya Sutskever。用稀疏变换器生成长序列。CoRR，abs/1904.10509，2019。
[11] Bin Dai and David P. Wipf. Diagnosing and enhancing VAE models. In ICLR (Poster). OpenReview.net, 2019.
Bin Dai 和 David P. Wipf.诊断和增强 VAE 模型。In ICLR (Poster).OpenReview.net, 2019.
[12] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255. IEEE Computer Society, 2009.
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li.Imagenet：大规模分层图像数据库。In CVPR, pages 248-255.IEEE 计算机协会，2009 年。
[13] Emily Denton. Ethical considerations of generative ai. AI for Content Creation Workshop, CVPR, 2021.
Emily Denton.生成式人工智能的伦理考量。AI for Content Creation Workshop, CVPR, 2021.
[14] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018.
Jacob Devlin、Ming-Wei Chang、Kenton Lee 和 Kristina Toutanova。BERT：用于语言理解的深度双向变换器预训练。CoRR，ABS/1810.04805，2018。
[15] Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis. CoRR, abs/2105.05233, 2021.
Prafulla Dhariwal 和 Alex Nichol.扩散模型在图像合成上击败甘斯。CoRR，abs/2105.05233，2021。
[16] Sander Dieleman. Musings on typicality, 2020.
桑德-迪埃勒曼关于典型性的思考，2020 年。
[17] Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, and Jie Tang. Cogview: Mastering text-to-image generation via transformers. CoRR, abs/2105.13290, 2021.
丁明、杨卓一、洪文义、郑文迪、周畅、尹达、林俊扬、邹旭、邵周、杨红霞和唐杰。Cogview：通过转换器掌握文本到图像的生成。CoRR，abs/2105.13290，2021。
[18] Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation, 2015.
Laurent Dinh、David Krueger 和 Yoshua Bengio。尼斯：非线性独立分量估计》，2015 年。
[19] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real NVP. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017.
Laurent Dinh、Jascha Sohl-Dickstein 和 Samy Bengio。使用真实 NVP 的密度估计。In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings.OpenReview.net, 2017.
[20] Alexey Dosovitskiy and Thomas Brox. Generating images with perceptual similarity metrics based on deep networks. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett, editors, Adv. Neural Inform. Process. Syst., pages 658–666, 2016.
Alexey Dosovitskiy and Thomas Brox. 基于深度网络的感知相似度指标生成图像。In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett, editors, Adv. Neural Inform.Process.Syst.》，第 658-666 页，2016 年。
[21] Patrick Esser, Robin Rombach, Andreas Blattmann, and Björn Ommer. Imagebart: Bidirectional context with multinomial diffusion for autoregressive image synthesis. CoRR, abs/2108.08827, 2021.
Patrick Esser、Robin Rombach、Andreas Blattmann 和 Björn Ommer。Imagebart：用于自回归图像合成的多项式扩散双向上下文。CoRR，abs/2108.08827，2021。
[22] Patrick Esser, Robin Rombach, and Björn Ommer. A note on data biases in generative models. arXiv preprint arXiv:2012.02516, 2020.
Patrick Esser, Robin Rombach, and Björn Ommer.生成模型中的数据偏差说明。arXiv 预印本 arXiv:2012.02516, 2020.
[23] Patrick Esser, Robin Rombach, and Björn Ommer. Taming transformers for high-resolution image synthesis. CoRR, abs/2012.09841, 2020.
Patrick Esser, Robin Rombach, and Björn Ommer.用于高分辨率图像合成的驯服变换器。CoRR，abs/2012.09841，2020。
[24] Mary Anne Franks and Ari Ezra Waldman. Sex, lies, and videotape: Deep fakes and free speech delusions. Md. L. Rev., 78:892, 2018.
玛丽-安妮-弗兰克斯和阿里-埃兹拉-瓦尔德曼。性、谎言和录像带：深度伪造与自由言论妄想。Md.L. Rev., 78:892, 2018.
[25] Kevin Frans, Lisa B. Soros, and Olaf Witkowski. Clipdraw: Exploring text-to-drawing synthesis through language-image encoders. ArXiv, abs/2106.14843, 2021.
Kevin Frans, Lisa B. Soros, and Olaf Witkowski.Clipdraw：通过语言图像编码器探索文本到绘图的合成。ArXiv, abs/2106.14843, 2021.
[26] Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene-based text-to-image generation with human priors. CoRR, abs/2203.13131, 2022.
Oran Gafni、Adam Polyak、Oron Ashual、Shelly Sheynin、Devi Parikh 和 Yaniv Taigman。制作场景：基于场景和人类先验的文本到图像生成。CoRR，abs/2203.13131，2022。
[27] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial networks. CoRR, 2014.
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio.生成对抗网络。CoRR，2014。
[28] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Improved training of wasserstein gans, 2017.
Ishaan Gulrajani、Faruk Ahmed、Martin Arjovsky、Vincent Dumoulin 和 Aaron Courville。改进的瓦瑟斯坦甘斯训练》，2017 年。
[29] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Adv. Neural Inform. Process. Syst., pages 6626–6637, 2017.
马丁-豪塞尔、休伯特-拉姆绍尔、托马斯-昂特希纳、伯恩哈德-奈斯勒和塞普-霍赫赖特。通过双时间尺度更新规则训练的甘斯收敛到局部纳什均衡。In Adv.Process.Syst., pages 6626-6637, 2017.
[30] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, 2020.
Jonathan Ho, Ajay Jain, and Pieter Abbeel.去噪扩散概率模型。In NeurIPS, 2020.
[31] Jonathan Ho, Chitwan Saharia, William Chan, David J. Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation. CoRR, abs/2106.15282, 2021.
Jonathan Ho, Chitwan Saharia, William Chan, David J. Fleet, Mohammad Norouzi, and Tim Salimans.高保真图像生成的级联扩散模型。CoRR，abs/2106.15282，2021。
[32] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
乔纳森-何（Jonathan Ho）和蒂姆-萨利曼斯（Tim Salimans）。无分类器扩散引导。在 NeurIPS 2021 深度生成模型和下游应用研讨会上，2021 年。
[33] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adversarial networks. In CVPR, pages 5967–5976. IEEE Computer Society, 2017.
菲利普-伊索拉、朱俊彦、周廷辉和阿列克谢-A-埃弗罗斯。利用条件对抗网络实现图像到图像的翻译。In CVPR, pages 5967-5976.IEEE 计算机协会，2017 年。
[34] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adversarial networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5967–5976, 2017.
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros.利用条件对抗网络实现图像到图像的翻译。2017 IEEE 计算机视觉与模式识别大会（CVPR），第 5967-5976 页，2017 年。
[35] Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier J. Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, and João Carreira. Perceiver IO: A general architecture for structured inputs &outputs. CoRR, abs/2107.14795, 2021.
Andrew Jaegle、Sebastian Borgeaud、Jean-Baptiste Alayrac、Carl Doersch、Catalin Ionescu、David Ding、Skanda Koppula、Daniel Zoran、Andrew Brock、Evan Shelhamer、Olivier J. Hénaff、Matthew M. Botvinick、Andrew Zisserman、Oriol Vinyals 和 João Carreira。感知器 IO：结构化输入输出的通用架构。CoRR，abs/2107.14795，2021。
[36] Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and João Carreira. Perceiver: General perception with iterative attention. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 4651–4664. PMLR, 2021.
安德鲁-耶格尔、费利克斯-吉梅诺、安迪-布洛克、奥里奥尔-维尼亚尔斯、安德鲁-齐瑟曼和若昂-卡雷拉。感知器：具有迭代注意力的一般感知。见 Marina Meila 和 Tong Zhang 编辑的《第 38 届国际机器学习大会论文集》（ICML 2021，2021 年 7 月 18-24 日，虚拟活动），《机器学习研究论文集》第 139 卷，第 4651-4664 页。PMLR, 2021.
[37] Manuel Jahn, Robin Rombach, and Björn Ommer. High-resolution complex scene synthesis with transformers. CoRR, abs/2105.06458, 2021.
Manuel Jahn, Robin Rombach, and Björn Ommer.使用变换器的高分辨率复杂场景合成。CoRR，abs/2105.06458，2021。
[38] Niharika Jain, Alberto Olmo, Sailik Sengupta, Lydia Manikonda, and Subbarao Kambhampati. Imperfect imaganation: Implications of gans exacerbating biases on facial data augmentation and snapchat selfie lenses. arXiv preprint arXiv:2001.09528, 2020.
Niharika Jain、Alberto Olmo、Sailik Sengupta、Lydia Manikonda 和 Subbarao Kambhampati。不完美的想象：面部数据增强和 snapchat 自拍镜头的 gans 加剧偏差的影响》。arXiv 预印本 arXiv:2001.09528, 2020 年。
[39] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. CoRR, abs/1710.10196, 2017.
Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen.为提高质量、稳定性和变异而进行的甘斯渐进生长。CoRR，abs/1710.10196，2017。
[40] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In IEEE Conf. Comput. Vis. Pattern Recog., pages 4401–4410, 2019.
Tero Karras、Samuli Laine 和 Timo Aila。基于风格的生成式对抗网络生成器架构。In IEEE Conf.Comput.Pattern Recog.Pattern Recog., pages 4401-4410, 2019.
[41] T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial networks. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
T.Karras, S. Laine, and T. Aila.基于风格的生成式对抗网络生成器架构。In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
[42] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. CoRR, abs/1912.04958, 2019.
Tero Karras、Samuli Laine、Miika Aittala、Janne Hellsten、Jaakko Lehtinen 和 Timo Aila。分析和改进样式表的图像质量。CoRR，abs/1912.04958，2019。
[43] Dongjun Kim, Seungjae Shin, Kyungwoo Song, Wanmo Kang, and Il-Chul Moon. Score matching model for unbounded data score. CoRR, abs/2106.05527, 2021.
Dongjun Kim、Seungjae Shin、Kyungwoo Song、Wanmo Kang 和 Il-Chul Moon。无约束数据分数的分数匹配模型。CoRR，abs/2106.05527，2021。
[44] Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, 2018.
Durk P Kingma 和 Prafulla Dhariwal.Glow：具有可逆 1x1 卷积的生成流。见 S. Bengio、H. Wallach、H. Larochelle、K. Grauman、N. Cesa-Bianchi 和 R. Garnett 编辑，《神经信息处理系统进展》，2018 年。
[45] Diederik P. Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. CoRR, abs/2107.00630, 2021.
Diederik P. Kingma、Tim Salimans、Ben Poole 和 Jonathan Ho。变异扩散模型。CoRR，abs/2107.00630，2021。
[46] Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR, 2014.
Diederik P. Kingma 和 Max Welling.自动编码变异贝叶斯。第二届学习表征国际会议，ICLR，2014。
[47] Zhifeng Kong and Wei Ping. On fast sampling of diffusion probabilistic models. CoRR, abs/2106.00132, 2021.
孔志峰、魏平.论扩散概率模型的快速采样.CoRR，abs/2106.00132，2021.
[48] Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile diffusion model for audio synthesis. In ICLR. OpenReview.net, 2021.
孔志峰、平伟、黄佳吉、赵可欣和布莱恩-卡坦扎罗。Diffwave：用于音频合成的多功能扩散模型。In ICLR.OpenReview.net, 2021.
[49] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper R. R. Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Tom Duerig, and Vittorio Ferrari. The open images dataset V4: unified image classification, object detection, and visual relationship detection at scale. CoRR, abs/1811.00982, 2018.
Alina Kuznetsova、Hassan Rom、Neil Alldrin、Jasper R. R. Uijlings、Ivan Krasin、Jordi Pont-Tuset、Shahab Kamali、Stefan Popov、Matteo Malloci、Tom Duerig 和 Vittorio Ferrari。开放图像数据集 V4：大规模统一图像分类、对象检测和视觉关系检测。CoRR，abs/1811.00982，2018。
[50] Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models. CoRR, abs/1904.06991, 2019.
Tuomas Kynkäänniemi、Tero Karras、Samuli Laine、Jaakko Lehtinen 和 Timo Aila。用于评估生成模型的改进精度和召回率度量。CoRR，abs/1904.06991，2019。
[51] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: common objects in context. CoRR, abs/1405.0312, 2014.
Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick.微软 COCO：上下文中的通用对象。CoRR，abs/1405.0312，2014。
[52] Yuqing Ma, Xianglong Liu, Shihao Bai, Le-Yi Wang, Aishan Liu, Dacheng Tao, and Edwin Hancock. Region-wise generative adversarial imageinpainting for large missing areas. ArXiv, abs/1909.12507, 2019.
马玉清、刘祥龙、白世豪、王乐毅、刘爱山、陶大成和埃德温-汉考克。针对大面积缺失的区域生成式对抗图像绘制ArXiv, abs/1909.12507, 2019.
[53] Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Image synthesis and editing with stochastic differential equations. CoRR, abs/2108.01073, 2021.
孟晨林、宋洋、宋佳明、吴佳俊、朱俊彦和斯特凡诺-埃尔蒙。Sdedit：用随机微分方程合成和编辑图像。CoRR，abs/2108.01073，2021。
[54] Lars M. Mescheder. On the convergence properties of GAN training. CoRR, abs/1801.04406, 2018.
Lars M. Mescheder.论 GAN 训练的收敛特性。CoRR，abs/1801.04406，2018.
[55] Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein. Unrolled generative adversarial networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017.
Luke Metz、Ben Poole、David Pfau 和 Jascha Sohl-Dickstein。未卷积生成对抗网络。In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings.OpenReview.net, 2017.
[56] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. CoRR, abs/1411.1784, 2014.
Mehdi Mirza 和 Simon Osindero.条件生成对抗网。CoRR，abs/1411.1784，2014。
[57] Gautam Mittal, Jesse H. Engel, Curtis Hawthorne, and Ian Simon. Symbolic music generation with diffusion models. CoRR, abs/2103.16091, 2021.
高塔姆-米塔尔、杰西-H-恩格尔、柯蒂斯-霍桑和伊恩-西蒙。用扩散模型生成符号音乐。CoRR，abs/2103.16091，2021。
[58] Kamyar Nazeri, Eric Ng, Tony Joseph, Faisal Z. Qureshi, and Mehran Ebrahimi. Edgeconnect: Generative image inpainting with adversarial edge learning. ArXiv, abs/1901.00212, 2019.
Kamyar Nazeri、Eric Ng、Tony Joseph、Faisal Z. Qureshi 和 Mehran Ebrahimi。Qureshi 和 Mehran Ebrahimi。边缘连接：使用对抗边缘学习的生成式图像内绘。ArXiv，abs/1901.00212，2019。
[59] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: towards photorealistic image generation and editing with text-guided diffusion models. CoRR, abs/2112.10741, 2021.
Alex Nichol、Prafulla Dhariwal、Aditya Ramesh、Pranav Shyam、Pamela Mishkin、Bob McGrew、Ilya Sutskever 和 Mark Chen。GLIDE：利用文本引导的扩散模型实现逼真图像的生成和编辑。CoRR，abs/2112.10741，2021。
[60] Anton Obukhov, Maximilian Seitzer, Po-Wei Wu, Semen Zhydenko, Jonathan Kyl, and Elvis Yu-Jing Lin. High-fidelity performance metrics for generative models in pytorch, 2020. Version: 0.3.0, DOI: 10.5281/zenodo.4957738.
Anton Obukhov、Maximilian Seitzer、Po-Wei Wu、Semen Zhydenko、Jonathan Kyl 和 Elvis Yu-Jing Lin。pytorch 中生成模型的高保真性能指标，2020.版本：0.3.0，DOI：10.5281/zenodo.4957738。
[61] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
Taesung Park、Ming-Yu Liu、Ting-Chun Wang 和 Jun-Yan Zhu.空间自适应归一化的语义图像合成。IEEE 计算机视觉与模式识别大会论文集，2019 年。
[62] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
Taesung Park、Ming-Yu Liu、Ting-Chun Wang 和 Jun-Yan Zhu。空间自适应归一化的语义图像合成。IEEE/CVF计算机视觉与模式识别会议（CVPR）论文集，2019年6月。
[63] Gaurav Parmar, Dacheng Li, Kwonjoon Lee, and Zhuowen Tu. Dual contradistinctive generative autoencoder. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 823–832. Computer Vision Foundation / IEEE, 2021.
Gaurav Parmar、Dacheng Li、Kwonjoon Lee 和 Zhuowen Tu。双矛盾生成自动编码器。In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 823-832.计算机视觉基金会/IEEE，2021。
[64] Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On buggy resizing libraries and surprising subtleties in fid calculation. arXiv preprint arXiv:2104.11222, 2021.
Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu.关于漏洞百出的大小调整库和 Fid 计算中令人惊奇的微妙之处。arXiv 预印本 arXiv:2104.11222, 2021.
[65] David A. Patterson, Joseph Gonzalez, Quoc V. Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David R. So, Maud Texier, and Jeff Dean. Carbon emissions and large neural network training. CoRR, abs/2104.10350, 2021.
David A. Patterson、Joseph Gonzalez、Quoc V. Le、Chen Liang、Lluis-Miquel Munguia、Daniel Rothchild、David R. So、Maud Texier 和 Jeff Dean。碳排放与大型神经网络训练。CoRR，abs/2104.10350，2021。
[66] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. CoRR, abs/2102.12092, 2021.
Aditya Ramesh、Mikhail Pavlov、Gabriel Goh、Scott Gray、Chelsea Voss、Alec Radford、Mark Chen 和 Ilya Sutskever。零镜头文本到图像生成。CoRR，abs/2102.12092，2021。
[67] Ali Razavi, Aäron van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with VQ-VAE-2. In NeurIPS, pages 14837–14847, 2019.
Ali Razavi、Aäron van den Oord 和 Oriol Vinyals。用 VQ-VAE-2 生成多样化的高保真图像。In NeurIPS, pages 14837-14847, 2019.
[68] Scott E. Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. In ICML, 2016.
Scott E. Reed、Zeynep Akata、Xinchen Yan、Lajanugen Logeswaran、Bernt Schiele 和 Honglak Lee。生成式对抗文本图像合成。In ICML, 2016.
[69] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the 31st International Conference on International Conference on Machine Learning, ICML, 2014.
Danilo Jimenez Rezende、Shakir Mohamed 和 Daan Wierstra。深度生成模型中的随机反向传播和近似推理。第 31 届国际机器学习大会论文集》，ICML，2014 年。
[70] Robin Rombach, Patrick Esser, and Björn Ommer. Network-to-network translation with conditional invertible neural networks. In NeurIPS, 2020.
Robin Rombach、Patrick Esser 和 Björn Ommer.用条件可逆神经网络实现网络到网络的转换。NeurIPS, 2020.
[71] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI (3), volume 9351 of Lecture Notes in Computer Science, pages 234–241. Springer, 2015.
Olaf Ronneberger、Philipp Fischer 和 Thomas Brox.U-net：用于生物医学图像分割的卷积网络。In MICCAI (3), volume 9351 of Lecture Notes in Computer Science, pages 234-241.Springer, 2015.
[72] Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J. Fleet, and Mohammad Norouzi. Image super-resolution via iterative refinement. CoRR, abs/2104.07636, 2021.
Chitwan Saharia、Jonathan Ho、William Chan、Tim Salimans、David J. Fleet 和 Mohammad Norouzi。通过迭代细化实现图像超分辨率。CoRR，abs/2104.07636，2021。
[73] Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P. Kingma. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. CoRR, abs/1701.05517, 2017.
Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P. Kingma.Pixelcnn++：用离散化逻辑混合似然和其他修改改进 pixelcnn。CoRR，abs/1701.05517，2017.
[74] Dave Salvator. NVIDIA Developer Blog. https://developer.nvidia.com/blog/getting-immediate-speedups-with-a100-tf32, 2020.
Dave Salvator。英伟达开发者博客。https://developer.nvidia.com/blog/getting-immediate-speedups-with-a100-tf32，2020 年。
[75] Robin San-Roman, Eliya Nachmani, and Lior Wolf. Noise estimation for generative diffusion models. CoRR, abs/2104.02600, 2021.
Robin San-Roman、Eliya Nachmani 和 Lior Wolf。生成式扩散模型的噪声估计.CoRR，abs/2104.02600，2021。
[76] Axel Sauer, Kashyap Chitta, Jens Müller, and Andreas Geiger. Projected gans converge faster. CoRR, abs/2111.01007, 2021.
阿克塞尔-绍尔、卡什亚普-奇塔、延斯-穆勒和安德烈亚斯-盖格。投影甘斯收敛更快CoRR，abs/2111.01007，2021。
[77] Edgar Schönfeld, Bernt Schiele, and Anna Khoreva. A u-net based discriminator for generative adversarial networks. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 8204–8213. Computer Vision Foundation / IEEE, 2020.
Edgar Schönfeld、Bernt Schiele 和 Anna Khoreva。基于 U-net 的生成式对抗网络判别器。In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 8204-8213.计算机视觉基金会/IEEE，2020。
[78] Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs, 2021.
克里斯托夫-舒曼（Christoph Schuhmann）、理查德-文库（Richard Vencu）、罗曼-博蒙特（Romain Beaumont）、罗伯特-卡茨马尔奇克（Robert Kaczmarczyk）、克莱顿-穆利斯（Clayton Mullis）、阿鲁什-卡塔（Aarush Katta）、西奥-库姆斯（Theo Coombes）、杰尼亚-吉特塞夫（Jenia Jitsev）和阿兰-小松崎（Aran Komatsuzaki）。莱昂-400 米：经剪辑过滤的 4 亿图像-文本对开放数据集，2021 年。
[79] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In Yoshua Bengio and Yann LeCun, editors, Int. Conf. Learn. Represent., 2015.
Karen Simonyan 和 Andrew Zisserman.用于大规模图像识别的深度卷积网络。In Yoshua Bengio and Yann LeCun, editors, Int.Conf.Learn.Represent.
[80] Abhishek Sinha, Jiaming Song, Chenlin Meng, and Stefano Ermon. D2C: diffusion-denoising models for few-shot conditional generation. CoRR, abs/2106.06819, 2021.
Abhishek Sinha、Jiaming Song、Chenlin Meng 和 Stefano Ermon。D2C：用于少量条件生成的扩散-去噪模型。CoRR，abs/2106.06819，2021。
[81] Charlie Snell. Alien Dreams: An Emerging Art Scene. https://ml.berkeley.edu/blog/posts/clip-art/, 2021. [Online; accessed November-2021].
查理-斯内尔外星人之梦》：https://ml.berkeley.edu/blog/posts/clip-art/, 2021。[Online; accessed November-2021].
[82] Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. CoRR, abs/1503.03585, 2015.
Jascha Sohl-Dickstein、Eric A. Weiss、Niru Maheswaranathan 和 Surya Ganguli。使用非平衡热力学的深度无监督学习。CoRR，abs/1503.03585，2015.
[83] Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep conditional generative models. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.
Kihyuk Sohn, Honglak Lee, and Xinchen Yan.使用深度条件生成模型学习结构化输出表示。见 C. Cortes、N. Lawrence、D. Lee、M. Sugiyama 和 R. Garnett 编辑的《神经信息处理系统进展》第 28 卷。库兰联合公司，2015 年。
[84] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In ICLR. OpenReview.net, 2021.
宋家明、孟晨霖和斯特凡诺-埃尔蒙。去噪扩散隐含模型In ICLR.OpenReview.net, 2021.
[85] Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. CoRR, abs/2011.13456, 2020.
杨松、Jascha Sohl-Dickstein、Diederik P. Kingma、Abhishek Kumar、Stefano Ermon 和 Ben Poole。通过随机微分方程进行基于分数的生成建模。CoRR，abs/2011.13456，2020。
[86] Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for modern deep learning research. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 13693–13696. AAAI Press, 2020.
Emma Strubell、Ananya Ganesh 和 Andrew McCallum.现代深度学习研究的能源和政策考虑。In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 13693-13696.AAAI Press, 2020.
[87] Wei Sun and Tianfu Wu. Learning layout and style reconfigurable gans for controllable image synthesis. CoRR, abs/2003.11571, 2020.
Wei Sun 和 Tianfu Wu.用于可控图像合成的学习布局和样式可重构甘斯。CoRR，abs/2003.11571，2020.
[88] Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor S. Lempitsky. Resolution-robust large mask inpainting with fourier convolutions. ArXiv, abs/2109.07161, 2021.
Roman Suvorov、Elizaveta Logacheva、Anton Mashikhin、Anastasia Remizova、Arsenii Ashukha、Aleksei Silvestrov、Naejin Kong、Harshith Goka、Kiwoong Park 和 Victor S. Lempitsky。利用傅立叶卷积进行分辨率稳健的大掩模涂色。ArXiv，abs/2109.07161，2021。
[89] Tristan Sylvain, Pengchuan Zhang, Yoshua Bengio, R. Devon Hjelm, and Shikhar Sharma. Object-centric image generation from layouts. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pages 2647–2655. AAAI Press, 2021.
Tristan Sylvain、Pengchuan Zhang、Yoshua Bengio、R. Devon Hjelm 和 Shikhar Sharma。从布局生成以对象为中心的图像。In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pages 2647-2655.AAAI Press, 2021.
[90] Patrick Tinsley, Adam Czajka, and Patrick Flynn. This face does not exist… but it might be yours! identity leakage in generative models. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1320–1328, 2021.
Patrick Tinsley, Adam Czajka, and Patrick Flynn.这张脸并不存在......但它可能是你的！生成模型中的身份泄露。IEEE/CVF 计算机视觉应用冬季会议论文集》，第 1320-1328 页，2021 年。
[91] Antonio Torralba and Alexei A Efros. Unbiased look at dataset bias. In CVPR 2011, pages 1521–1528. IEEE, 2011.
Antonio Torralba 和 Alexei A Efros.无偏见地看待数据集偏差。In CVPR 2011, pages 1521-1528.IEEE, 2011.
[92] Arash Vahdat and Jan Kautz. NVAE: A deep hierarchical variational autoencoder. In NeurIPS, 2020.
Arash Vahdat 和 Jan Kautz.NVAE：深度分层变异自动编码器。在 NeurIPS，2020 年。
[93] Arash Vahdat, Karsten Kreis, and Jan Kautz. Score-based generative modeling in latent space. CoRR, abs/2106.05931, 2021.
Arash Vahdat、Karsten Kreis 和 Jan Kautz。潜在空间中基于分数的生成建模。CoRR，abs/2106.05931，2021。
[94] Aaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, koray kavukcuoglu, Oriol Vinyals, and Alex Graves. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, 2016.
Aaron van den Oord、Nal Kalchbrenner、Lasse Espeholt、Koray Kavukcuoglu、Oriol Vinyals 和 Alex Graves。用像素神经解码器生成条件图像神经信息处理系统进展》，2016 年。
[95] Aäron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. CoRR, abs/1601.06759, 2016.
Aäron van den Oord、Nal Kalchbrenner 和 Koray Kavukcuoglu。像素递归神经网络。CoRR，abs/1601.06759，2016。
[96] Aäron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. In NIPS, pages 6306–6315, 2017.
Aäron van den Oord、Oriol Vinyals 和 Koray Kavukcuoglu。神经离散表征学习。In NIPS, pages 6306-6315, 2017.
[97] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, pages 5998–6008, 2017.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin.注意力就是你所需要的一切。In NIPS, pages 5998-6008, 2017.
[98] Rivers Have Wings. Tweet on Classifier-free guidance for autoregressive models. https://twitter.com/RiversHaveWings/status/1478093658716966912, 2022.
河流有翅膀。Tweet on Classifier-free guidance for autoregressive models. https://twitter.com/RiversHaveWings/status/1478093658716966912, 2022.
[99] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie Brew. Huggingface’s transformers: State-of-the-art natural language processing. CoRR, abs/1910.03771, 2019.
Thomas Wolf、Lysandre Debut、Victor Sanh、Julien Chaumond、Clement Delangue、Anthony Moi、Pierric Cistac、Tim Rault、Rémi Louf、Morgan Funtowicz 和 Jamie Brew。Huggingface's transformers：最先进的自然语言处理技术。CoRR，abs/1910.03771，2019。
[100] Zhisheng Xiao, Karsten Kreis, Jan Kautz, and Arash Vahdat. VAEBM: A symbiosis between variational autoencoders and energy-based models. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
Zhisheng Xiao, Karsten Kreis, Jan Kautz, and Arash Vahdat.VAEBM：变异自动编码器与基于能量的模型之间的共生。第九届学习表征国际会议，ICLR 2021，奥地利虚拟活动，2021年5月3-7日。OpenReview.net, 2021.
[101] Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using VQ-VAE and transformers. CoRR, abs/2104.10157, 2021.
Wilson Yan、Yunzhi Zhang、Pieter Abbeel 和 Aravind Srinivas。Videogpt：使用 VQ-VAE 和变压器生成视频。CoRR，abs/2104.10157，2021。
[102] Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong Xiao. LSUN: construction of a large-scale image dataset using deep learning with humans in the loop. CoRR, abs/1506.03365, 2015.
Fisher Yu、Yinda Zhang、Shuran Song、Ari Seff 和 Jianxiong Xiao.LSUN：利用深度学习与人类共同构建大规模图像数据集。CoRR，abs/1506.03365，2015。
[103] Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan, 2021.
Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu.用改进的 vqgan 进行矢量量化图像建模》，2021 年。
[104] Jiahui Yu, Zhe L. Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S. Huang. Free-form image inpainting with gated convolution. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 4470–4479, 2019.
余佳慧、林哲、杨继美、沈晓辉、卢昕和黄轶翔。使用门控卷积的自由形式图像绘制（Free-form image inpainting with gated convolution）。2019 IEEE/CVF 计算机视觉国际会议（ICCV），第 4470-4479 页，2019 年。
[105] K. Zhang, Jingyun Liang, Luc Van Gool, and Radu Timofte. Designing a practical degradation model for deep blind image super-resolution. ArXiv, abs/2103.14006, 2021.
K.Zhang, Jingyun Liang, Luc Van Gool, and Radu Timofte.为深度盲图像超分辨率设计实用退化模型。ArXiv，abs/2103.14006，2021。
[106] Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
Richard Zhang、Phillip Isola、Alexei A. Efros、Eli Shechtman 和 Oliver Wang。深度特征作为感知度量的不合理有效性。电气和电子工程师协会计算机视觉与模式识别大会论文集》，2018 年 6 月。
[107] Shengyu Zhao, Jianwei Cui, Yilun Sheng, Yue Dong, Xiao Liang, Eric I-Chao Chang, and Yan Xu. Large scale image completion via co-modulated generative adversarial networks. ArXiv, abs/2103.10428, 2021.
赵胜宇、崔建伟、盛一伦、董玥、梁晓、张一超和徐艳。通过共调制生成式对抗网络完成大规模图像处理。ArXiv，abs/2103.10428，2021。
[108] Bolei Zhou, Àgata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40:1452–1464, 2018.
Bolei Zhou, Àgata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba.场所：用于场景识别的千万级图像数据库。IEEE Transactions on Pattern Analysis and Machine Intelligence》，40:1452-1464，2018.
[109] Yufan Zhou, Ruiyi Zhang, Changyou Chen, Chunyuan Li, Chris Tensmeyer, Tong Yu, Jiuxiang Gu, Jinhui Xu, and Tong Sun. LAFITE: towards language-free training for text-to-image generation. CoRR, abs/2111.13792, 2021.
Yufan Zhou, Ruiyi Zhang, Changyou Chen, Chunyuan Li, Chris Tensmeyer, Tong Yu, Jiuxiang Gu, Jinhui Xu, and Tong Sun.LAFITE：实现文本到图像生成的无语言训练。CoRR，abs/2111.13792，2021。

Appendix 附录

Appendix A Changelog 附录 AC 更新日志

Here we list changes between this version (https://arxiv.org/abs/2112.10752v2) of the paper and the previous version, i.e. https://arxiv.org/abs/2112.10752v1.
这里我们列出了本文本版本 ( https://arxiv.org/abs/2112.10752v2) 与上一版本（即 https://arxiv.org/abs/2112.10752v1）之间的变化。

•

We updated the results on text-to-image synthesis in Sec. 4.3 which were obtained by training a new, larger model (1.45B parameters). This also includes a new comparison to very recent competing methods on this task that were published on arXiv at the same time as ([59, 109]) or after ([26]) the publication of our work.

- 我们更新了第 4.3 节中文本到图像合成的结果，这些结果是通过训练一个新的、更大的模型（1.45B 个参数）获得的。这也包括了与最近在 arXiv 上发表的关于这项任务的竞争方法的新比较，这些方法与我们的工作同时（[ 59, 109]）或之后（[ 26]）发表。
•

We updated results on class-conditional synthesis on ImageNet in Sec. 4.1, Tab. 3 (see also Sec. D.4) obtained by retraining the model with a larger batch size. The corresponding qualitative results in Fig. 26 and Fig. 27 were also updated. Both the updated text-to-image and the class-conditional model now use classifier-free guidance [32] as a measure to increase visual fidelity.

- 我们在第 4.1 节的表 3 中更新了在 ImageNet 上进行类条件合成的结果（另见第 D.3 节）。3 中关于 ImageNet 的类条件合成结果（另见第 D.4 节）。图 26 和图 27 中相应的定性结果也进行了更新。更新后的 "文本到图像 "模型和 "类别条件 "模型现在都使用无分类器引导[32]作为提高视觉逼真度的措施。
•

We conducted a user study (following the scheme suggested by Saharia et al [72]) which provides additional evaluation for our inpainting (Sec. 4.5) and superresolution models (Sec. 4.4).

- 我们进行了一项用户研究（按照 Saharia 等人[72]建议的方案），为我们的内绘制（第 4.5 节）和超分辨率模型（第 4.4 节）提供了额外的评估。
•

Added Fig. 5 to the main paper, moved Fig. 18 to the appendix, added Fig. 13 to the appendix.

- 在正文中添加了图 5，将图 18 移到了附录中，在附录中添加了图 13。

Appendix B Detailed Information on Denoising Diffusion Models
附录 B 关于去噪扩散模型的详细信息

Diffusion models can be specified in terms of a signal-to-noise ratio $\text{SNR}(t)=\frac{\alpha_{t}^{2}}{\sigma_{t}^{2}}$ consisting of sequences $(\alpha_{t})_{t=1}^{T}$ and $(\sigma_{t})_{t=1}^{T}$ which, starting from a data sample $x_{0}$ , define a forward diffusion process $q$ as
扩散模型可以用信噪比 $\text{SNR}(t)=\frac{\alpha_{t}^{2}}{\sigma_{t}^{2}}$ 来指定，它由序列 $(\alpha_{t})_{t=1}^{T}$ 和 $(\sigma_{t})_{t=1}^{T}$ 组成，从数据样本 $x_{0}$ 开始，定义一个前向扩散过程 $q$ 为

q(x_{t}|x_{0})=\mathcal{N}(x_{t}|\alpha_{t}x_{0},\sigma_{t}^{2}\mathbb{I})

(4)

with the Markov structure for $s<t$ :
的马尔科夫结构：

$\displaystyle q(x_{t}\|x_{s})$	$\displaystyle=\mathcal{N}(x_{t}\|\alpha_{t\|s}x_{s},\sigma_{t\|s}^{2}\mathbb{I})$	(5)
$\displaystyle\alpha_{t\|s}$	$\displaystyle=\frac{\alpha_{t}}{\alpha_{s}}$	(6)
$\displaystyle\sigma_{t\|s}^{2}$	$\displaystyle=\sigma_{t}^{2}-\alpha_{t\|s}^{2}\sigma_{s}^{2}$	(7)

Denoising diffusion models are generative models $p(x_{0})$ which revert this process with a similar Markov structure running backward in time, i.e. they are specified as
去噪扩散模型是一种生成模型 $p(x_{0})$ ，它以类似的马尔可夫结构反演这一过程，即在时间上向后运行，具体为

p(x_{0})=\int_{z}p(x_{T})\prod_{t=1}^{T}p(x_{t-1}|x_{t})

(8)

The evidence lower bound (ELBO) associated with this model then decomposes over the discrete time steps as
与该模型相关的证据下限（ELBO）在离散时间步长上的分解为

-\log p(x_{0})\leq\mathbb{KL}(q(x_{T}|x_{0})|p(x_{T}))+\sum_{t=1}^{T}\mathbb{E}_{q(x_{t}|x_{0})}\mathbb{KL}(q(x_{t-1}|x_{t},x_{0})|p(x_{t-1}|x_{t}))

(9)

The prior $p(x_{T})$ is typically choosen as a standard normal distribution and the first term of the ELBO then depends only on the final signal-to-noise ratio $\text{SNR}(T)$ . To minimize the remaining terms, a common choice to parameterize $p(x_{t-1}|x_{t})$ is to specify it in terms of the true posterior $q(x_{t-1}|x_{t},x_{0})$ but with the unknown $x_{0}$ replaced by an estimate $x_{\theta}(x_{t},t)$ based on the current step $x_{t}$ . This gives [45]
先验值 $p(x_{T})$ 通常选择标准正态分布，ELBO 的第一项只取决于最终信噪比 $\text{SNR}(T)$ 。为了最小化其余项，对 $p(x_{t-1}|x_{t})$ 进行参数化的一个常见选择是用真实后验 $q(x_{t-1}|x_{t},x_{0})$ 来指定它，但用基于当前步长 $x_{t}$ 的估计值 $x_{\theta}(x_{t},t)$ 代替未知值 $x_{0}$ 。这样就得到了[ 45] 。

	$\displaystyle p(x_{t-1}\|x_{t})$	$\displaystyle\coloneqq q(x_{t-1}\|x_{t},x_{\theta}(x_{t},t))$		(10)
		$\displaystyle=\mathcal{N}(x_{t-1}\|\mu_{\theta}(x_{t},t),\sigma_{t\|t-1}^{2}\frac{\sigma_{t-1}^{2}}{\sigma_{t}^{2}}\mathbb{I}),$		(11)

where the mean can be expressed as
其中均值可以表示为

\mu_{\theta}(x_{t},t)=\frac{\alpha_{t|t-1}\sigma_{t-1}^{2}}{\sigma_{t}^{2}}x_{t}+\frac{\alpha_{t-1}\sigma_{t|t-1}^{2}}{\sigma_{t}^{2}}x_{\theta}(x_{t},t).

(12)

In this case, the sum of the ELBO simplify to
在这种情况下，ELBO 的总和简化为

\sum_{t=1}^{T}\mathbb{E}_{q(x_{t}|x_{0})}\mathbb{KL}(q(x_{t-1}|x_{t},x_{0})|p(x_{t-1})=\sum_{t=1}^{T}\mathbb{E}_{\mathcal{N}(\epsilon|0,\mathbb{I})}\frac{1}{2}(\text{SNR}(t-1)-\text{SNR}(t))\|x_{0}-x_{\theta}(\alpha_{t}x_{0}+\sigma_{t}\epsilon,t)\|^{2}

(13)

Following [30], we use the reparameterization
按照 [ 30] 的方法，我们使用重参数化

\epsilon_{\theta}(x_{t},t)=(x_{t}-\alpha_{t}x_{\theta}(x_{t},t))/\sigma_{t}

(14)

to express the reconstruction term as a denoising objective,
来表达去噪目标的重建项、

\|x_{0}-x_{\theta}(\alpha_{t}x_{0}+\sigma_{t}\epsilon,t)\|^{2}=\frac{\sigma_{t}^{2}}{\alpha_{t}^{2}}\|\epsilon-\epsilon_{\theta}(\alpha_{t}x_{0}+\sigma_{t}\epsilon,t)\|^{2}

(15)

and the reweighting, which assigns each of the terms the same weight and results in Eq. (1).
以及重新加权，赋予每个项相同的权重，从而得出公式 (1)。

Appendix C Image Guiding Mechanisms
附录 C 图像引导机制

An intriguing feature of diffusion models is that unconditional models can be conditioned at test-time [85, 82, 15]. In particular, [15] presented an algorithm to guide both unconditional and conditional models trained on the ImageNet dataset with a classifier $\log p_{\Phi}(y|x_{t})$ , trained on each $x_{t}$ of the diffusion process. We directly build on this formulation and introduce post-hoc image-guiding:
扩散模型的一个令人感兴趣的特点是，非条件模型可以在测试时进行调节[85, 82, 15]。特别是，[15] 提出了一种算法，用于指导在 ImageNet 数据集上训练的无条件模型和有条件模型，其分类器为 $\log p_{\Phi}(y|x_{t})$ ，在扩散过程的每个 $x_{t}$ 上进行训练。我们在此基础上直接引入了事后图像引导：

For an epsilon-parameterized model with fixed variance, the guiding algorithm as introduced in [15] reads:
对于具有固定方差的ε参数化模型，[15] 中介绍的引导算法如下：

\hat{\epsilon}\leftarrow\epsilon_{\theta}(z_{t},t)+\sqrt{1-\alpha_{t}^{2}}\;\nabla_{z_{t}}\log p_{\Phi}(y|z_{t})\;.

(16)

This can be interpreted as an update correcting the “score” $\epsilon_{\theta}$ with a conditional distribution $\log p_{\Phi}(y|z_{t})$ .
这可以解释为用条件分布 $\log p_{\Phi}(y|z_{t})$ 修正 "分数" $\epsilon_{\theta}$ 的更新。

So far, this scenario has only been applied to single-class classification models. We re-interpret the guiding distribution $p_{\Phi}(y|T(\mathcal{D}(z_{0}(z_{t}))))$ as a general purpose image-to-image translation task given a target image $y$ , where $T$ can be any differentiable transformation adopted to the image-to-image translation task at hand, such as the identity, a downsampling operation or similar.
到目前为止，这种情况只应用于单类分类模型。我们将指导分布 $p_{\Phi}(y|T(\mathcal{D}(z_{0}(z_{t}))))$ 重新解释为给定目标图像 $y$ 的通用图像到图像转换任务，其中 $T$ 可以是手头图像到图像转换任务所采用的任何可变变换，如身份、降采样操作或类似操作。
As an example, we can assume a Gaussian guider with fixed variance $\sigma^{2}=1$ , such that
举例来说，我们可以假设一个高斯导向器具有固定方差 $\sigma^{2}=1$ ，这样

\log p_{\Phi}(y|z_{t})=-\frac{1}{2}\|y-T(\mathcal{D}(z_{0}(z_{t})))\|^{2}_{2}

(17)

becomes a $L_{2}$ regression objective.
成为 $L_{2}$ 回归目标。

Fig. 14 demonstrates how this formulation can serve as an upsampling mechanism of an unconditional model trained on $256^{2}$ images, where unconditional samples of size $256^{2}$ guide the convolutional synthesis of $512^{2}$ images and $T$ is a $2\times$ bicubic downsampling. Following this motivation, we also experiment with a perceptual similarity guiding and replace the $L_{2}$ objective with the LPIPS [106] metric, see Sec. 4.4.
图 14 展示了这种表述如何作为在 $256^{2}$ 图像上训练的无条件模型的上采样机制，其中大小为 $256^{2}$ 的无条件样本引导 $512^{2}$ 图像的卷积合成，而 $T$ 是 $2\times$ 的双三次方下采样。根据这一动机，我们还尝试了感知相似性指导，并用 LPIPS [ 106] 指标取代了 $L_{2}$ 目标，见第 4.4 节。

Appendix D Additional Results
附录 D 附加结果

D.1 Choosing the Signal-to-Noise Ratio for High-Resolution Synthesis
D.1 为高分辨率合成选择信噪比

As discussed in Sec. 4.3.2, the signal-to-noise ratio induced by the variance of the latent space (i.e. $\text{Var(z)}/\sigma^{2}_{t}$ ) significantly affects the results for convolutional sampling. For example, when training a LDM directly in the latent space of a KL-regularized model (see Tab. 8), this ratio is very high, such that the model allocates a lot of semantic detail early on in the reverse denoising process. In contrast, when rescaling the latent space by the component-wise standard deviation of the latents as described in Sec. G, the SNR is descreased. We illustrate the effect on convolutional sampling for semantic image synthesis in Fig. 15. Note that the VQ-regularized space has a variance close to $1$ , such that it does not have to be rescaled.
如第 4.3.2 节所述，潜空间方差（即 $\text{Var(z)}/\sigma^{2}_{t}$ ）引起的信噪比会显著影响卷积采样的结果。例如，当直接在 KL 规则化模型的潜空间中训练 LDM 时（见表 8），这一比率非常高，因此模型在反向去噪过程的早期会分配大量语义细节。与此相反，当按照 G 章所述的潜点分量标准偏差重新缩放潜点空间时，信噪比就会降低。我们在图 15 中说明了卷积采样对语义图像合成的影响。请注意，VQ 规则化空间的方差接近 $1$ ，因此无需重新缩放。

D.2 Full List of all First Stage Models
D.2 所有第一阶段模型的完整列表

We provide a complete list of various autoenconding models trained on the OpenImages dataset in Tab. 8.
我们在表 8 中提供了在 OpenImages 数据集上训练的各种自动对应模型的完整列表。8.

$f$ $|\mathcal{Z}|$ $c$ R-FID $\downarrow$ R-IS $\uparrow$ PSNR $\uparrow$ PSIM $\downarrow$ SSIM $\uparrow$ 16 VQGAN [23] 16 VQGAN [ 23］ 16384 256 4.98 – - 19.9 $\pm 3.4$ 1.83 $\pm 0.42$ 0.51 $\pm 0.18$ 16 VQGAN [23] 16 VQGAN [ 23］ 1024 256 7.94 – - 19.4 $\pm 3.3$ 1.98 $\pm 0.43$ 0.50 $\pm 0.18$ 8 DALL-E [66] 8 DALL-E [ 66］ 8192 - 32.01 – - 22.8 $\pm 2.1$ 1.95 $\pm 0.51$ 0.73 $\pm 0.13$ 32 16384 16 31.83 40.40 $\pm 1.07$ 17.45 $\pm 2.90$ 2.58 $\pm 0.48$ 0.41 $\pm 0.18$ 16 16384 8 5.15 144.55 $\pm 3.74$ 20.83 $\pm 3.61$ 1.73 $\pm 0.43$ 0.54 $\pm 0.18$ 8 16384 4 1.14 201.92 $\pm 3.97$ 23.07 $\pm 3.99$ 1.17 $\pm 0.36$ 0.65 $\pm 0.16$ 8 256 4 1.49 194.20 $\pm 3.87$ 22.35 $\pm 3.81$ 1.26 $\pm 0.37$ 0.62 $\pm 0.16$ 4 8192 3 0.58 224.78 $\pm 5.35$ 27.43 $\pm 4.26$ 0.53 $\pm 0.21$ 0.82 $\pm 0.10$ 4^† 8192 3 1.06 221.94 $\pm 4.58$ 25.21 $\pm 4.17$ 0.72 $\pm 0.26$ 0.76 $\pm 0.12$ 4 256 3 0.47 223.81 $\pm 4.58$ 26.43 $\pm 4.22$ 0.62 $\pm 0.24$ 0.80 $\pm 0.11$ 2 2048 2 0.16 232.75 $\pm 5.09$ 30.85 $\pm 4.12$ 0.27 $\pm 0.12$ 0.91 $\pm 0.05$ 2 64 2 0.40 226.62 $\pm 4.83$ 29.13 $\pm 3.46$ 0.38 $\pm 0.13$ 0.90 $\pm 0.05$ 32 KL 64 2.04 189.53 $\pm 3.68$ 22.27 $\pm 3.93$ 1.41 $\pm 0.40$ 0.61 $\pm 0.17$ 32 KL 16 7.3 132.75 $\pm 2.71$ 20.38 $\pm 3.56$ 1.88 $\pm 0.45$ 0.53 $\pm 0.18$ 16 KL 16 0.87 210.31 $\pm 3.97$ 24.08 $\pm 4.22$ 1.07 $\pm 0.36$ 0.68 $\pm 0.15$ 16 KL 8 2.63 178.68 $\pm 4.08$ 21.94 $\pm 3.92$ 1.49 $\pm 0.42$ 0.59 $\pm 0.17$ 8 KL 4 0.90 209.90 $\pm 4.92$ 24.19 $\pm 4.19$ 1.02 $\pm 0.35$ 0.69 $\pm 0.15$ 4 KL 3 0.27 227.57 $\pm 4.89$ 27.53 $\pm 4.54$ 0.55 $\pm 0.24$ 0.82 $\pm 0.11$ 2 KL 2 0.086 232.66 $\pm 5.16$ 32.47 $\pm 4.19$ 0.20 $\pm 0.09$ 0.93 $\pm 0.04$

Table 8: Complete autoencoder zoo trained on OpenImages, evaluated on ImageNet-Val.

\dagger

denotes an attention-free autoencoder.
表 8：在 OpenImages 上训练的完整自动编码器动物园，在 ImageNet-Val 上进行评估。0# 表示无注意力自动编码器。

D.3 Layout-to-Image Synthesis
D.3 布局到图像的合成

COCO $256\times 256$ OpenImages $256\times 256$ OpenImages $512\times 512$ Method 方法 FID $\downarrow$ FID $\downarrow$ FID $\downarrow$ LostGAN-V2 [87] LostGAN-V2 [ 87］ 42.55 - - OC-GAN [89] OC-GAN [ 89］ 41.65 - - SPADE [62] SPADE [ 62］ 41.11 - - VQGAN+T [37] VQGAN+T [ 37］ 56.58 45.33 48.11 LDM-8 (100 steps, ours) LDM-8 (100 步，我们的) 42.06^† - - LDM-4 (200 steps, ours) LDM-4（200 步，我们的模型） 40.91^∗ 32.02 35.80

Table 9: Quantitative comparison of our layout-to-image models on the COCO [4] and OpenImages [49] datasets. ^†: Training from scratch on COCO; ^∗: Finetuning from OpenImages.
表 9：在 COCO [ 4] 和 OpenImages [ 49] 数据集上对我们的布局-图像模型进行的定量比较。0#：在 COCO 上从头开始训练； ^∗ ：根据 OpenImages 进行微调。

Here we provide the quantitative evaluation and additional samples for our layout-to-image models from Sec. 4.3.1. We train a model on the COCO [4] and one on the OpenImages [49] dataset, which we subsequently additionally finetune on COCO. Tab 9 shows the result. Our COCO model reaches the performance of recent state-of-the art models in layout-to-image synthesis, when following their training and evaluation protocol [89]. When finetuning from the OpenImages model, we surpass these works. Our OpenImages model surpasses the results of Jahn et al [37] by a margin of nearly 11 in terms of FID. In Fig. 16 we show additional samples of the model finetuned on COCO.
在此，我们将为第 4.3.1 节中的布局-图像模型提供定量评估和额外样本。我们在 COCO [ 4] 和 OpenImages [ 49] 数据集上分别训练了一个模型，随后又在 COCO 上对其进行了微调。表 9 显示了结果。我们的 COCO 模型在布局到图像的合成方面达到了近期最先进模型的性能，这与它们的训练和评估协议是一致的[89]。在 OpenImages 模型的基础上进行微调后，我们超越了这些作品。在 FID 方面，我们的 OpenImages 模型比 Jahn 等人[37] 的结果高出近 11 倍。在图 16 中，我们展示了在 COCO 上对模型进行微调的其他样本。

D.4 Class-Conditional Image Synthesis on ImageNet
D.4 ImageNet 上的类条件图像合成

Tab. 10 contains the results for our class-conditional LDM measured in FID and Inception score (IS). LDM-8 requires significantly fewer parameters and compute requirements (see Tab. 18) to achieve very competitive performance. Similar to previous work, we can further boost the performance by training a classifier on each noise scale and guiding with it, see Sec. C. Unlike the pixel-based methods, this classifier is trained very cheaply in latent space. For additional qualitative results, see Fig. 26 and Fig. 27.
表 10表 10 包含以 FID 和初始分数 (IS) 衡量的类别条件 LDM 结果。LDM-8 所需的参数和计算要求大大减少（见表 18），从而实现了极具竞争力的性能。与之前的工作类似，我们可以通过在每个噪声尺度上训练一个分类器并以此为指导来进一步提高性能，见 C 章。其他定性结果见图 26 和图 27。

Method 方法 FID $\downarrow$ IS $\uparrow$ Precision $\uparrow$ 精确度 $\uparrow$ Recall $\uparrow$ 召回率 $\uparrow$ $N_{\text{params}}$ SR3 [72] SR3 [ 72］ 11.30 - - - 625M - ImageBART [21] 图像ART [ 21］ 21.19 - - - 3.5B - ImageBART [21] 图像ART [ 21］ 7.44 - - - 3.5B 0.05 acc. rate^∗ 0.05 acc. VQGAN+T [23] VQGAN+T [ 23］ 17.04 70.6 $\pm\text{1.8}$ - - 1.3B - VQGAN+T [23] VQGAN+T [ 23］ 5.88 304.8 $\pm\text{3.6}$ - - 1.3B 0.05 acc. rate^∗ 0.05 acc. BigGan-deep [3] BigGan-deep [ 3］ 6.95 203.6 $\pm\text{2.6}$ 0.87 0.28 340M - ADM [15] ADM [ 15］ 10.94 100.98 0.69 0.63 554M 250 DDIM steps 250 级 DDIM ADM-G [15] ADM-G [ 15］ 4.59 186.7 0.82 0.52 608M 250 DDIM steps 250 级 DDIM ADM-G,ADM-U [15] ADM-G,ADM-U [ 15］ 3.85 221.72 0.84 0.53 n/a 不适用 2 $\times$ 250 DDIM steps
2 $\times$ 250 DDIM 步 CDM [31] CDM [ 31］ 4.88 158.71 $\pm\text{2.26}$ - - n/a 不适用 2 $\times$ 100 DDIM steps
2 $\times$ 100 DDIM 级 LDM-8 (ours) LDM-8（我们的） 17.41 72.92 $\pm\text{2.6}$ 0.65 0.62 395M 200 DDIM steps, 2.9M train steps, batch size 64
200 DDIM 步，2.9M 列车步，批量大小 64 LDM-8-G (ours) LDM-8-G（我们的） 8.11 190.43 $\pm\text{2.60}$ 0.83 0.36 506M 200 DDIM steps, classifier scale 10, 2.9M train steps, batch size 64
200 个 DDIM 步骤，分类器规模 10，290 万个训练步骤，批量大小 64 LDM-8 (ours) LDM-8（我们的） 15.51 79.03 $\pm\text{1.03}$ 0.65 0.63 395M 200 DDIM steps, 4.8M train steps, batch size 64
200 个 DDIM 步骤，480 万个列车步骤，批量大小 64 LDM-8-G (ours) LDM-8-G（我们的） 7.76 209.52 $\pm\text{4.24}$ 0.84 0.35 506M 200 DDIM steps, classifier scale 10, 4.8M train steps, batch size 64
200 DDIM 步，分类器规模 10，480 万训练步数，批量大小 64 LDM-4 (ours) LDM-4 （我们的） 10.56 103.49 $\pm\text{1.24}$ 0.71 0.62 400M 250 DDIM steps, 178K train steps, batch size 1200
250 DDIM 步骤，178K 训练步骤，批量大小 1200 LDM-4-G (ours) LDM-4-G （我们的） 3.95 178.22 $\pm\text{2.43}$ 0.81 0.55 400M 250 DDIM steps, unconditional guidance [32] scale 1.25, 178K train steps, batch size 1200
250 DDIM 步，无条件指导 [ 32] 比例 1.25，178K 火车步，批量大小 1200 LDM-4-G (ours) LDM-4-G （我们的） 3.60 247.67 $\pm\text{5.59}$ 0.87 0.48 400M 250 DDIM steps, unconditional guidance [32] scale 1.5, 178K train steps, batch size 1200
250 DDIM 步，无条件引导 [ 32] 比例 1.5，178K 训练步数，批量大小 1200

Table 10: Comparison of a class-conditional ImageNet LDM with recent state-of-the-art methods for class-conditional image generation on the ImageNet [12] dataset.^∗: Classifier rejection sampling with the given rejection rate as proposed in [67].
表 10：在 ImageNet [ 12] 数据集上，有类别条件的 ImageNet LDM 与最近最先进的有类别条件图像生成方法的比较。0#：[ 67] 中提出的具有给定拒绝率的分类器拒绝采样。

D.5 Sample Quality vs. V100 Days (Continued from Sec. 4.1)
D.5 取样质量与 V100 天数（续 4.1 节）

For the assessment of sample quality over the training progress in Sec. 4.1, we reported FID and IS scores as a function of train steps. Another possibility is to report these metrics over the used resources in V100 days. Such an analysis is additionally provided in Fig. 17, showing qualitatively similar results.
为了评估第 4.1 节中训练过程中的样本质量，我们将 FID 和 IS 分数作为训练步骤的函数进行报告。另一种方法是报告 V100 天内所用资源的这些指标。图 17 也提供了这样的分析，并显示了类似的定性结果。

D.6 Super-Resolution D.6超分辨率

Method 方法 FID $\downarrow$ IS $\uparrow$ PSNR $\uparrow$ SSIM $\uparrow$ Image Regression [72] 图像回归 [ 72］ 15.2 121.1 27.9 0.801 SR3 [72] SR3 [ 72］ 5.2 180.1 26.4 0.762 LDM-4 (ours, 100 steps) LDM-4（我们的，100 级） 2.8^†/4.8^‡ 166.3 24.4 $\pm$ 3.8 0.69 $\pm$ 0.14 LDM-4 (ours, 50 steps, guiding)
LDM-4（我们的，50 级，引导式） 4.4^†/6.4^‡ 153.7 25.8 $\pm$ 3.7 0.74 $\pm$ 0.12 LDM-4 (ours, 100 steps, guiding)
LDM-4 (我们的，100 步，引导) 4.4^†/6.4^‡ 154.1 25.7 $\pm$ 3.7 0.73 $\pm$ 0.12 LDM-4 (ours, 100 steps, +15 ep.)
LDM-4 （我们的，100 步，+15 ep.） 2.6^† / 4.6^‡ 169.76 $\pm$ 5.03 24.4 $\pm$ 3.8 0.69 $\pm$ 0.14 Pixel-DM (100 steps, +15 ep.)
像素-DM（100 级，+15 ep.） 5.1^† / 7.1^‡ 163.06 $\pm$ 4.67 24.1 $\pm$ 3.3 0.59 $\pm$ 0.12

Table 11:

\times 4

upscaling results on ImageNet-Val. (

256^{2}

); ^†: FID features computed on validation split, ^‡: FID features computed on train split. We also include a pixel-space baseline that receives the same amount of compute as LDM-4. The last two rows received 15 epochs of additional training compared to the former results.
表 11：在 ImageNet-Val. (

256^{2}

) 上的

\times 4

放大结果； ^† ：在验证分割上计算的 FID 特征， ^‡ ：在训练分割上计算的 FID 特征。我们还包括一个像素空间基线，其计算量与 LDM-4 相同。与之前的结果相比，最后两行接受了 15 个历元的额外训练。

For better comparability between LDMs and diffusion models in pixel space, we extend our analysis from Tab. 5 by comparing a diffusion model trained for the same number of steps and with a comparable number ¹¹1It is not possible to exactly match both architectures since the diffusion model operates in the pixel space of parameters to our LDM. The results of this comparison are shown in the last two rows of Tab. 11 and demonstrate that LDM achieves better performance while allowing for significantly faster sampling. A qualitative comparison is given in Fig. 20 which shows random samples from both LDM and the diffusion model in pixel space.
为了更好地比较像素空间中的 LDM 和扩散模型，我们扩展了表 5 中的分析。为了更好地比较 LDM 和扩散模型在像素空间中的可比性，我们将表 5 中的分析进行了扩展，比较了与我们的 LDM 采用相同步数和相似参数数 ¹ 训练的扩散模型。比较结果见表 11 的最后两行。表 11 最后两行显示了比较结果，表明 LDM 性能更好，同时采样速度更快。图 20 显示了 LDM 和扩散模型在像素空间中的随机样本。

D.6.1 LDM-BSR: General Purpose SR Model via Diverse Image Degradation
D.6.1LDM-BSR：通过多样化图像退化的通用 SR 模型

To evaluate generalization of our LDM-SR, we apply it both on synthetic LDM samples from a class-conditional ImageNet model (Sec. 4.1) and images crawled from the internet. Interestingly, we observe that LDM-SR, trained only with a bicubicly downsampled conditioning as in [72], does not generalize well to images which do not follow this pre-processing. Hence, to obtain a superresolution model for a wide range of real world images, which can contain complex superpositions of camera noise, compression artifacts, blurr and interpolations, we replace the bicubic downsampling operation in LDM-SR with the degration pipeline from [105]. The BSR-degradation process is a degradation pipline which applies JPEG compressions noise, camera sensor noise, different image interpolations for downsampling, Gaussian blur kernels and Gaussian noise in a random order to an image. We found that using the bsr-degredation process with the original parameters as in [105] leads to a very strong degradation process. Since a more moderate degradation process seemed apppropiate for our application, we adapted the parameters of the bsr-degradation (our adapted degradation process can be found in our code base at https://github.com/CompVis/latent-diffusion). Fig. 18 illustrates the effectiveness of this approach by directly comparing LDM-SR with LDM-BSR. The latter produces images much sharper than the models confined to a fixed pre-processing, making it suitable for real-world applications. Further results of LDM-BSR are shown on LSUN-cows in Fig. 19.
为了评估我们的 LDM-SR 的泛化能力，我们将其应用于来自分类条件 ImageNet 模型（第 4.1 节）的合成 LDM 样本和从互联网抓取的图像。有趣的是，我们观察到，LDM-SR 仅使用 [ 72] 中的双三次方降采样条件进行训练，并不能很好地泛化到未经过这种预处理的图像上。因此，为了获得适用于各种真实图像的超分辨率模型（这些图像可能包含相机噪声、压缩伪影、模糊和插值的复杂叠加），我们用[105]中的降级管道取代了 LDM-SR 中的双三次降采样操作。BSR 降级过程是一种降级流水线，它以随机顺序对图像应用 JPEG 压缩噪声、相机传感器噪声、不同的图像插值降采样、高斯模糊核和高斯噪声。我们发现，按照[105]中的原始参数使用bsr降级过程会导致非常强的降级过程。由于更温和的降级过程似乎更适合我们的应用，因此我们调整了 bsr-degradation 的参数（我们调整后的降级过程可在我们的代码库中找到，网址为 https://github.com/CompVis/latent-diffusion）。图 18 通过直接比较 LDM-SR 和 LDM-BSR，说明了这种方法的有效性。后者生成的图像比局限于固定预处理的模型要清晰得多，因此适合实际应用。图 19 显示了 LDM-BSR 在 LSUN 奶牛上的进一步结果。

Appendix E Implementation Details and Hyperparameters
附录 E 实施细节和超参数

E.1 Hyperparameters E.1超参数

We provide an overview of the hyperparameters of all trained LDM models in Tab. 12, Tab. 13, Tab. 14 and Tab. 15.
我们在表 12、表 13 和表 14 中概述了所有训练有素的 LDM 模型的超参数。12, Tab.13, Tab.表 14 和表 15。15.

CelebA-HQ $256\times 256$ FFHQ $256\times 256$ LSUN-Churches $256\times 256$ LSUN-Bedrooms $256\times 256$ LSUN-卧室 $256\times 256$ $f$ 4 4 8 4 $z$ -shape $z$ 形状 $64\times 64\times 3$ $64\times 64\times 3$ - $64\times 64\times 3$ $|\mathcal{Z}|$ 8192 8192 - 8192 Diffusion steps 扩散步骤 1000 1000 1000 1000 Noise Schedule 噪声时间表 linear 线性 linear 线性 linear 线性 linear 线性 $N_{\text{params}}$ 274M 274M 294M 274M Channels 渠道 224 224 192 224 Depth 深度 2 2 2 2 Channel Multiplier 通道倍增器 1,2,3,4 1,2,3,4 1,2,2,4,4 1,2,3,4 Attention resolutions 注意决议 32, 16, 8 32, 16, 8 32, 16, 8, 4 32, 16, 8 Head Channels 头部通道 32 32 24 32 Batch Size 批次大小 48 42 96 48 Iterations^∗ 迭代 ^∗ 410k 635k 500k 1.9M Learning Rate 学习率 9.6e-5 8.4e-5 5.e-5 9.6e-5

Table 12: Hyperparameters for the unconditional LDMs producing the numbers shown in Tab. 1. All models trained on a single NVIDIA A100.
表 12：产生表 1 所示数字的无条件 LDM 的超参数。1.所有模型均在一台 NVIDIA A100 上进行训练。

LDM-1 LDM-2 LDM-4 LDM-8 LDM-16 LDM-32 $z$ -shape $z$ 形状 $256\times 256\times 3$ $128\times 128\times 2$ $64\times 64\times 3$ $32\times 32\times 4$ $16\times 16\times 8$ $88\times 8\times 32$ $|\mathcal{Z}|$ - 2048 8192 16384 16384 16384 Diffusion steps 扩散步骤 1000 1000 1000 1000 1000 1000 Noise Schedule 噪声时间表 linear 线性 linear 线性 linear 线性 linear 线性 linear 线性 linear 线性 Model Size 模型尺寸 396M 391M 391M 395M 395M 395M Channels 渠道 192 192 192 256 256 256 Depth 深度 2 2 2 2 2 2 Channel Multiplier 通道倍增器 1,1,2,2,4,4 1,2,2,4,4 1,2,3,5 1,2,4 1,2,4 1,2,4 Number of Heads 头数 1 1 1 1 1 1 Batch Size 批次大小 7 9 40 64 112 112 Iterations 迭代次数 2M 2M 2M 2M 2M 2M Learning Rate 学习率 4.9e-5 6.3e-5 8e-5 6.4e-5 4.5e-5 4.5e-5 Conditioning 调节 CA CA CA CA CA CA CA-resolutions CA 分辨率 32, 16, 8 32, 16, 8 32, 16, 8 32, 16, 8 16, 8, 4 8, 4, 2 Embedding Dimension 嵌入尺寸 512 512 512 512 512 512 Transformers Depth 变形深度 1 1 1 1 1 1

Table 13: Hyperparameters for the conditional LDMs trained on the ImageNet dataset for the analysis in Sec. 4.1. All models trained on a single NVIDIA A100.

LDM-1 LDM-2 LDM-4 LDM-8 LDM-16 LDM-32 $z$ -shape $z$ 形状 $256\times 256\times 3$ $128\times 128\times 2$ $64\times 64\times 3$ $32\times 32\times 4$ $16\times 16\times 8$ $88\times 8\times 32$ $|\mathcal{Z}|$ - 2048 8192 16384 16384 16384 Diffusion steps 扩散步骤 1000 1000 1000 1000 1000 1000 Noise Schedule 噪声时间表 linear 线性 linear 线性 linear 线性 linear 线性 linear 线性 linear 线性 Model Size 模型尺寸 270M 265M 274M 258M 260M 258M Channels 渠道 192 192 224 256 256 256 Depth 深度 2 2 2 2 2 2 Channel Multiplier 通道倍增器 1,1,2,2,4,4 1,2,2,4,4 1,2,3,4 1,2,4 1,2,4 1,2,4 Attention resolutions 32, 16, 8 32, 16, 8 32, 16, 8 32, 16, 8 16, 8, 4 8, 4, 2 Head Channels 32 32 32 32 32 32 Batch Size 9 11 48 96 128 128 Iterations^∗ 500k 500k 500k 500k 500k 500k Learning Rate 9e-5 1.1e-4 9.6e-5 9.6e-5 1.3e-4 1.3e-4

Table 14: Hyperparameters for the unconditional LDMs trained on the CelebA dataset for the analysis in Fig. 7. All models trained on a single NVIDIA A100. ^∗: All models are trained for 500k iterations. If converging earlier, we used the best checkpoint for assessing the provided FID scores.

Task Text-to-Image Layout-to-Image Class-Label-to-Image Super Resolution Inpainting Semantic-Map-to-Image Dataset LAION OpenImages COCO ImageNet ImageNet Places Landscapes $f$ 8 4 8 4 4 4 8 $z$ -shape $32\times 32\times 4$ $64\times 64\times 3$ $32\times 32\times 4$ $64\times 64\times 3$ $64\times 64\times 3$ $64\times 64\times 3$ $32\times 32\times 4$ $|\mathcal{Z}|$ - 8192 16384 8192 8192 8192 16384 Diffusion steps 1000 1000 1000 1000 1000 1000 1000 Noise Schedule linear linear linear linear linear linear linear Model Size 1.45B 306M 345M 395M 169M 215M 215M Channels 320 128 192 192 160 128 128 Depth 2 2 2 2 2 2 2 Channel Multiplier 1,2,4,4 1,2,3,4 1,2,4 1,2,3,5 1,2,2,4 1,4,8 1,4,8 Number of Heads 8 1 1 1 1 1 1 Dropout - - 0.1 - - - - Batch Size 批次大小 680 24 48 1200 64 128 48 Iterations 390K 4.4M 170K 178K 860K 360K 360K Learning Rate 1.0e-4 4.8e-5 4.8e-5 1.0e-4 6.4e-5 1.0e-6 4.8e-5 Conditioning CA CA CA CA concat concat concat (C)A-resolutions 32, 16, 8 32, 16, 8 32, 16, 8 32, 16, 8 - - - Embedding Dimension 1280 512 512 512 - - - Transformer Depth 1 3 2 1 - - -

Table 15: Hyperparameters for the conditional LDMs from Sec. 4. All models trained on a single NVIDIA A100 except for the inpainting model which was trained on eight V100.

E.2 Implementation Details

E.2.1 Implementations of $\tau_{\theta}$ for conditional LDMs

For the experiments on text-to-image and layout-to-image (Sec. 4.3.1) synthesis, we implement the conditioner $\tau_{\theta}$ as an unmasked transformer which processes a tokenized version of the input $y$ and produces an output $\zeta:=\tau_{\theta}(y)$ , where $\zeta\in\mathbb{R}^{M\times d_{\tau}}$ . More specifically, the transformer is implemented from $N$ transformer blocks consisting of global self-attention layers, layer-normalization and position-wise MLPs as follows²²2adapted from https://github.com/lucidrains/x-transformers:

	$\displaystyle\zeta\leftarrow\text{TokEmb}(y)+\text{PosEmb(y)}$		(18)
	$\displaystyle\text{for }i=1,\dots,N:$
	$\displaystyle\quad\zeta_{1}\leftarrow\text{LayerNorm}(\zeta)$		(19)
	$\displaystyle\quad\zeta_{2}\leftarrow\text{MultiHeadSelfAttention}(\zeta_{1})+\zeta$		(20)
	$\displaystyle\quad\zeta_{3}\leftarrow\text{LayerNorm}(\zeta_{2})$		(21)
	$\displaystyle\quad\zeta\leftarrow\text{MLP}(\zeta_{3})+\zeta_{2}$		(22)
	$\displaystyle\zeta\leftarrow\text{LayerNorm}(\zeta)$		(23)

With $\zeta$ available, the conditioning is mapped into the UNet via the cross-attention mechanism as depicted in Fig. 3. We modify the “ablated UNet” [15] architecture and replace the self-attention layer with a shallow (unmasked) transformer consisting of $T$ blocks with alternating layers of (i) self-attention, (ii) a position-wise MLP and (iii) a cross-attention layer; see Tab. 16. Note that without (ii) and (iii), this architecture is equivalent to the “ablated UNet”.
如图 3 所示，在 $\zeta$ 可用的情况下，通过交叉注意机制将调节映射到 UNet 中。我们修改了 "消融 UNet"[15]架构，并用一个浅层（无掩蔽）变压器取代了自注意层，该变压器由 $T$ 块组成，交替使用(i) 自注意层、(ii) 定位 MLP 层和(iii) 交叉注意层；见表 16。16.请注意，如果没有(ii)和(iii)，这种结构就等同于 "消融的 UNet"。

While it would be possible to increase the representational power of $\tau_{\theta}$ by additionally conditioning on the time step $t$ , we do not pursue this choice as it reduces the speed of inference. We leave a more detailed analysis of this modification to future work.
虽然可以通过对时间步长 $t$ 的附加条件来提高 $\tau_{\theta}$ 的表征能力，但我们没有采用这种方法，因为这会降低推理速度。我们将在今后的工作中对这一修改进行更详细的分析。

For the text-to-image model, we rely on a publicly available³³3https://huggingface.co/transformers/model_doc/bert.html#berttokenizerfast tokenizer [99]. The layout-to-image model discretizes the spatial locations of the bounding boxes and encodes each box as a $(l,b,c)$ -tuple, where $l$ denotes the (discrete) top-left and $b$ the bottom-right position. Class information is contained in $c$ .
对于文本到图像模型，我们依赖于一个公开可用的 ³ 标记化器 [ 99]。布局到图像模型将边界框的空间位置离散化，并将每个边界框编码为一个 $(l,b,c)$ 元组，其中 $l$ 表示（离散的）左上角位置， $b$ 表示右下角位置。类信息包含在 $c$ 中。
See Tab. 17 for the hyperparameters of $\tau_{\theta}$ and Tab. 13 for those of the UNet for both of the above tasks.
类别信息包含在 @4# 中。关于 $\tau_{\theta}$ 的超参数，请参见表 17；关于 UNet 的超参数，请参见表 13。上述两项任务的 UNet 超参数见表 13。

Note that the class-conditional model as described in Sec. 4.1 is also implemented via cross-attention, where $\tau_{\theta}$ is a single learnable embedding layer with a dimensionality of 512, mapping classes $y$ to $\zeta\in\mathbb{R}^{1\times 512}$ .
请注意，第 4.1 节中描述的类别条件模型也是通过交叉注意实现的，其中 $\tau_{\theta}$ 是一个单一的可学习嵌入层，维数为 512，映射类别 $y$ 到 $\zeta\in\mathbb{R}^{1\times 512}$ 。

input 输入 $\mathbb{R}^{h\times w\times c}$ LayerNorm 层规范 $\mathbb{R}^{h\times w\times c}$ Conv1x1 $\mathbb{R}^{h\times w\times d\cdot n_{h}}$ Reshape 重塑 $\mathbb{R}^{h\cdot w\times d\cdot n_{h}}$ $\times T\begin{cases*}\text{SelfAttention}\\ \text{MLP}\\ \text{CrossAttention}\end{cases*}$ $\mathbb{R}^{h\cdot w\times d\cdot n_{h}}$ $\mathbb{R}^{h\cdot w\times d\cdot n_{h}}$ $\mathbb{R}^{h\cdot w\times d\cdot n_{h}}$ Reshape 重塑 $\mathbb{R}^{h\times w\times d\cdot n_{h}}$ Conv1x1 $\mathbb{R}^{h\times w\times c}$

Table 16: Architecture of a transformer block as described in Sec. E.2.1, replacing the self-attention layer of the standard “ablated UNet” architecture [15]. Here,

n_{h}

denotes the number of attention heads and

d

the dimensionality per head.
表 16：E.2.1 节中描述的变压器块结构，取代标准 "消融 UNet "结构中的自注意层[15]。这里，

n_{h}

表示注意头的数量，

d

表示每个注意头的维度。

Text-to-Image 文本到图片 Layout-to-Image 布局到图片 seq-length 序列长度 77 92 depth $N$ 深度 $N$ 32 16 dim 昏暗 1280 512

Table 17: Hyperparameters for the experiments with transformer encoders in Sec. 4.3.
表 17：第 4.3 节中变压器编码器实验的超参数。

E.2.2 Inpainting E.2.2油漆

For our experiments on image-inpainting in Sec. 4.5, we used the code of [88] to generate synthetic masks. We use a fixed set of 2k validation and 30k testing samples from Places[108]. During training, we use random crops of size $256\times 256$ and evaluate on crops of size $512\times 512$ . This follows the training and testing protocol in [88] and reproduces their reported metrics (see ^† in Tab. 7). We include additional qualitative results of LDM-4, w/ attn in Fig. 21 and of LDM-4, w/o attn, big, w/ ft in Fig. 22.
在第 4.5 节的图像绘制实验中，我们使用了[ 88] 的代码来生成合成遮罩。我们使用来自 Places[ 108] 的 2k 验证样本和 30k 测试样本的固定集合。在训练过程中，我们使用大小为 $256\times 256$ 的随机作物，并在大小为 $512\times 512$ 的作物上进行评估。这遵循了[ 88] 中的训练和测试协议，并重现了他们报告的指标（见表 7 中的 ^† ）。我们在图 21 中列出了 LDM-4 的其他定性结果（带注意力），在图 22 中列出了 LDM-4 的其他定性结果（不带注意力、大、带英尺）。

E.3 Evaluation Details E.3 评估细节

This section provides additional details on evaluation for the experiments shown in Sec. 4.
本节提供第 4 节所示实验的更多评估细节。

E.3.1 Quantitative Results in Unconditional and Class-Conditional Image Synthesis
E.3.1 无条件和类条件图像合成的定量结果

We follow common practice and estimate the statistics for calculating the FID-, Precision- and Recall-scores [29, 50] shown in Tab. 1 and 10 based on 50k samples from our models and the entire training set of each of the shown datasets. For calculating FID scores we use the torch-fidelity package [60]. However, since different data processing pipelines might lead to different results [64], we also evaluate our models with the script provided by Dhariwal and Nichol [15]. We find that results mainly coincide, except for the ImageNet and LSUN-Bedrooms datasets, where we notice slightly varying scores of 7.76 (torch-fidelity) vs. 7.77 (Nichol and Dhariwal) and 2.95 vs 3.0. For the future we emphasize the importance of a unified procedure for sample quality assessment. Precision and Recall are also computed by using the script provided by Nichol and Dhariwal.
我们按照惯例估算了表 1 和表 10 所示的 FID、精确度和召回分数[29, 50]。表 1 和表 10 所示的 FID-、精确度-和召回率[ 29, 50]的计算统计基于我们模型的 50k 个样本和每个数据集的整个训练集。为了计算 FID 分数，我们使用了 torch-fidelity 软件包[60]。不过，由于不同的数据处理管道可能会导致不同的结果[64]，我们也用 Dhariwal 和 Nichol 提供的脚本[15]对我们的模型进行了评估。我们发现，除了 ImageNet 和 LSUN-Bedrooms 数据集的结果略有不同外，其他数据集的结果基本一致，分别为 7.76（torch-fidelity）vs 7.77（Nichol 和 Dhariwal）和 2.95 vs 3.0。对于未来，我们强调统一样本质量评估程序的重要性。精确度和召回率也是通过使用 Nichol 和 Dhariwal 提供的脚本计算得出的。

E.3.2 Text-to-Image Synthesis
E.3.2 文本到图像的合成

Following the evaluation protocol of [66] we compute FID and Inception Score for the Text-to-Image models from Tab. 2 by comparing generated samples with 30000 samples from the validation set of the MS-COCO dataset [51]. FID and Inception Scores are computed with torch-fidelity.
根据 [ 66] 的评估协议，我们通过将生成的样本与数百万个样本进行比较，计算出表 2 中文本到图像模型的 FID 和初始得分。我们将生成的样本与 MS-COCO 数据集[ 51] 验证集中的 30000 个样本进行比较，计算出表 2 中文本到图像模型的 FID 和 Inception Score。FID 和 Inception 分数是以火炬保真度计算的。

E.3.3 Layout-to-Image Synthesis
E.3.3 布局到图像合成

For assessing the sample quality of our Layout-to-Image models from Tab. 9 on the COCO dataset, we follow common practice [89, 37, 87] and compute FID scores the 2048 unaugmented examples of the COCO Segmentation Challenge split. To obtain better comparability, we use the exact same samples as in [37]. For the OpenImages dataset we similarly follow their protocol and use 2048 center-cropped test images from the validation set.
为了评估表 9 中 "从布局到图像 "模型在 COCO 数据集上的样本质量，我们使用了 "从布局到图像 "模型。为了在 COCO 数据集上评估表 9 中的 "从布局到图像 "模型的样本质量，我们按照通常的做法[89, 37, 87]，计算 COCO 分段挑战赛分割的 2048 个未分割样本的 FID 分数。为了获得更好的可比性，我们使用了与 [ 37] 完全相同的样本。对于 OpenImages 数据集，我们同样遵循他们的协议，使用验证集中的 2048 张中心裁剪测试图像。

E.3.4 Super Resolution E.3.4 超分辨率

We evaluate the super-resolution models on ImageNet following the pipeline suggested in [72], i.e. images with a shorter size less than $256$ px are removed (both for training and evaluation). On ImageNet, the low-resolution images are produced using bicubic interpolation with anti-aliasing. FIDs are evaluated using torch-fidelity [60], and we produce samples on the validation split. For FID scores, we additionally compare to reference features computed on the train split, see Tab. 5 and Tab. 11.
我们按照[72]中建议的方法在 ImageNet 上评估超分辨率模型，即删除尺寸小于 $256$ px 的图像（包括训练和评估）。在 ImageNet 上，低分辨率图像是通过双三次插值和抗锯齿处理生成的。使用火炬保真度[60]对 FID 进行评估，并在验证分割上生成样本。对于 FID 分数，我们还将其与在训练分割上计算的参考特征进行比较，见表 5 和表 6。5 和表 11。11.

E.3.5 Efficiency Analysis E.3.5 效率分析

For efficiency reasons we compute the sample quality metrics plotted in Fig. 6, 17 and 7 based on 5k samples. Therefore, the results might vary from those shown in Tab. 1 and 10. All models have a comparable number of parameters as provided in Tab. 13 and 14. We maximize the learning rates of the individual models such that they still train stably. Therefore, the learning rates slightly vary between different runs cf. Tab. 13 and 14.
出于效率考虑，我们基于 5k 个样本计算图 6、图 17 和图 7 中的样本质量指标。因此，结果可能与表 1 和表 10 中的结果有所不同。1 和 10 中显示的结果有所不同。如表 13 和 14 所示，所有模型的参数数量相当。13 和 14 中提供的参数数量相当。我们最大限度地提高了各个模型的学习率，使它们仍能稳定地进行训练。因此，不同运行的学习率略有不同，参见表 13 和 14。13 和 14。

E.3.6 User Study E.3.6 用户研究

For the results of the user study presented in Tab. 4 we followed the protocoll of [72] and and use the 2-alternative force-choice paradigm to assess human preference scores for two distinct tasks. In Task-1 subjects were shown a low resolution/masked image between the corresponding ground truth high resolution/unmasked version and a synthesized image, which was generated by using the middle image as conditioning. For SuperResolution subjects were asked: ’Which of the two images is a better high quality version of the low resolution image in the middle?’. For Inpainting we asked ’Which of the two images contains more realistic inpainted regions of the image in the middle?’. In Task-2, humans were similarly shown the low-res/masked version and asked for preference between two corresponding images generated by the two competing methods. As in [72] humans viewed the images for 3 seconds before responding.
表 4 所示为用户研究结果。在表 4 中的用户研究结果中，我们沿用了[ 72]的原型，并使用二选一的力选择范式来评估人类在两个不同任务中的偏好分数。在任务-1 中，受试者在相应的地面真实高分辨率/无遮罩版本和合成图像之间看到一幅低分辨率/遮罩图像，合成图像是以中间图像为条件生成的。对于超分辨率，我们会问受试者："两幅图像中哪一幅是中间低分辨率图像的高质量版本？对于 "涂色"，我们的问题是 "两幅图像中，哪幅图像中间的涂色区域更逼真？在 "任务-2 "中，我们同样向人类展示了低分辨率/蒙版图像，并要求人类在两种竞争方法生成的两幅相应图像中做出选择。与[72]中一样，人类在做出反应之前会观看图像 3 秒钟。

Appendix F Computational Requirements
附录 FC 计算要求

Method 方法 Generator 生成器 Classifier 分类器 Overall 总体 Inference 推理 $N_{\text{params}}$ FID $\downarrow$ IS $\uparrow$ Precision $\uparrow$ 精确度 $\uparrow$ Recall $\uparrow$ 召回率 $\uparrow$ Compute 计算 Compute 计算 Compute 计算 Throughput^∗ 吞吐量 ^∗ LSUN Churches $256^{2}$ LSUN 教堂 $256^{2}$ StyleGAN2 [42]^† 64 - 64 - 59M 3.86 - - - LDM-8 (ours, 100 steps, 410K)
LDM-8（我们的，100 步，410K） 18 - 18 6.80 256M 4.02 - 0.64 0.52 LSUN Bedrooms $256^{2}$ LSUN 卧室 $256^{2}$ ADM [15]^† (1000 steps)
ADM [ 15] ^† （1000 步） 232 - 232 0.03 552M 1.9 - 0.66 0.51 LDM-4 (ours, 200 steps, 1.9M)
LDM-4 （我们的，200 步，1.9M） 60 - 55 1.07 274M 2.95 - 0.66 0.48 CelebA-HQ $256^{2}$ LDM-4 (ours, 500 steps, 410K)
LDM-4（我们的，500 步，41 万） 14.4 - 14.4 0.43 274M 5.11 - 0.72 0.49 FFHQ $256^{2}$ StyleGAN2 [42] StyleGAN2 [ 42］ 32.13^‡ - 32.13^† - 59M 3.8 - - - LDM-4 (ours, 200 steps, 635K)
LDM-4（我们的，200 步，635K） 26 - 26 1.07 274M 4.98 - 0.73 0.50 ImageNet $256^{2}$ VQGAN-f-4 (ours, first stage)
VQGAN-f-4（我们的，第一阶段） 29 - 29 - 55M 0.58^†† - - - VQGAN-f-8 (ours, first stage)
VQGAN-f-8（我们的第一阶段） 66 - 66 - 68M 1.14^†† - - - BigGAN-deep [3]^† 128-256 128-256 - 340M 6.95 203.6 $\pm\text{2.6}$ 0.87 0.28 ADM [15] (250 steps) ^†
ADM [ 15]（250 级） ^† 916 - 916 0.12 554M 10.94 100.98 0.69 0.63 ADM-G [15] (25 steps) ^†
ADM-G [ 15]（25 级） ^† 916 46 962 0.7 608M 5.58 - 0.81 0.49 ADM-G [15] (250 steps)^†
ADM-G [ 15]（250步） ^† 916 46 962 0.07 608M 4.59 186.7 0.82 0.52 ADM-G,ADM-U [15] (250 steps)^†
ADM-G,ADM-U [ 15]（250 级） ^† 329 30 349 n/a 不适用 n/a 不适用 3.85 221.72 0.84 0.53 LDM-8-G (ours, 100, 2.9M)
LDM-8-G（我们的，100，2.9M） 79 12 91 1.93 506M 8.11 190.4 $\pm\text{2.6}$ 0.83 0.36 LDM-8 (ours, 200 ddim steps 2.9M, batch size 64)
LDM-8（我们的，200 ddim 步骤，2.9M，批量大小 64） 79 - 79 1.9 395M 17.41 72.92 0.65 0.62 LDM-4 (ours, 250 ddim steps 178K, batch size 1200)
LDM-4 （我们的，250 ddim 步长 178K，批量大小 1200） 271 - 271 0.7 400M 10.56 103.49 $\pm\text{1.24}$ 0.71 0.62 LDM-4-G (ours, 250 ddim steps 178K, batch size 1200, classifier-free guidance [32] scale 1.25)
LDM-4-G（我们的，250 ddim 步 178K，批量大小 1200，无分类器引导 [ 32] 比例 1.25） 271 - 271 0.4 400M 3.95 178.22 $\pm\text{2.43}$ 0.81 0.55 LDM-4-G (ours, 250 ddim steps 178K, batch size 1200, classifier-free guidance [32] scale 1.5)
LDM-4-G（我们的，250 ddim 步骤 178K，批量大小 1200，无分类器引导 [ 32] 比例 1.5） 271 - 271 0.4 400M 3.60 247.67 $\pm\text{5.59}$ 0.87 0.48

Table 18: Comparing compute requirements during training and inference throughput with state-of-the-art generative models. Compute during training in V100-days, numbers of competing methods taken from [15] unless stated differently;^∗: Throughput measured in samples/sec on a single NVIDIA A100;^†: Numbers taken from [15] ;^‡: Assumed to be trained on 25M train examples; ^††: R-FID vs. ImageNet validation set
表 18：与最先进的生成模型比较训练期间的计算要求和推理吞吐量。训练期间的计算量以 V100 天为单位，除非另有说明，否则竞争方法的数量取自 [ 15]； ^∗ ：在单台英伟达 A100 上以样本/秒为单位测量的吞吐量； ^† ：取自 [ 15]； ^‡ ：假定在 2500 万个训练实例上进行训练； ^†† ：R-FID 与 ImageNet 验证集对比

In Tab 18 we provide a more detailed analysis on our used compute ressources and compare our best performing models on the CelebA-HQ, FFHQ, LSUN and ImageNet datasets with the recent state of the art models by using their provided numbers, cf. [15]. As they report their used compute in V100 days and we train all our models on a single NVIDIA A100 GPU, we convert the A100 days to V100 days by assuming a $\times 2.2$ speedup of A100 vs V100 [74]⁴⁴4This factor corresponds to the speedup of the A100 over the V100 for a U-Net, as defined in Fig. 1 in [74]. To assess sample quality, we additionally report FID scores on the reported datasets. We closely reach the performance of state of the art methods as StyleGAN2 [42] and ADM [15] while significantly reducing the required compute resources.
在表 18 中，我们对使用的计算资源进行了更详细的分析，并将我们在 CelebA-HQ、FFHQ、LSUN 和 ImageNet 数据集上表现最好的模型与最近最先进的模型进行了比较。由于他们以 V100 天为单位报告所用计算时间，而我们的所有模型都是在单个英伟达 A100 GPU 上训练的，因此我们假定 A100 与 V100 相比速度提升了 $\times 2.2$ [ 74] ⁴ ，从而将 A100 天转换为 V100 天。为了评估样本质量，我们还报告了报告数据集的 FID 分数。我们与 StyleGAN2 [ 42] 和 ADM [ 15] 等最先进方法的性能接近，同时大大减少了所需的计算资源。

Appendix G Details on Autoencoder Models
附录 G 自编码器模型详情

We train all our autoencoder models in an adversarial manner following [23], such that a patch-based discriminator $D_{\psi}$ is optimized to differentiate original images from reconstructions $\mathcal{D}(\mathcal{E}(x))$ . To avoid arbitrarily scaled latent spaces, we regularize the latent $z$ to be zero centered and obtain small variance by introducing an regularizing loss term $L_{reg}$ .
我们按照 [ 23] 的方法，以对抗的方式训练所有的自动编码器模型，例如优化基于斑块的判别器 $D_{\psi}$ 以区分原始图像和重建图像 $\mathcal{D}(\mathcal{E}(x))$ 。为了避免任意缩放的潜空间，我们将潜 $z$ 正则化为零中心，并通过引入正则化损失项 $L_{reg}$ 来获得较小的方差。
We investigate two different regularization methods: (i) a low-weighted Kullback-Leibler-term between $q_{\mathcal{E}}(z|x)=\mathcal{N}(z;\mathcal{E}_{\mu},\mathcal{E}_{\sigma^{2}})$ and a standard normal distribution $\mathcal{N}(z;0,1)$ as in a standard variational autoencoder [46, 69], and, (ii) regularizing the latent space with a vector quantization layer by learning a codebook of $|\mathcal{Z}|$ different exemplars [96].
我们研究了两种不同的正则化方法：(i) $q_{\mathcal{E}}(z|x)=\mathcal{N}(z;\mathcal{E}_{\mu},\mathcal{E}_{\sigma^{2}})$ 与标准正态分布 $\mathcal{N}(z;0,1)$ 之间的低加权 Kullback-Leibler 项，就像标准变异自动编码器一样[46, 69]；(ii) 通过学习由 $|\mathcal{Z}|$ 个不同示例组成的编码本，用向量量化层对潜空间进行正则化[96]。
To obtain high-fidelity reconstructions we only use a very small regularization for both scenarios, i.e. we either weight the $\mathbb{KL}$ term by a factor $\sim 10^{-6}$ or choose a high codebook dimensionality $|\mathcal{Z}|$ .
为了获得高保真度的重构，我们在两种情况下都只使用了很小的正则化，即对 $\mathbb{KL}$ 项加权 $\sim 10^{-6}$ 或选择较高的编码本维度 $|\mathcal{Z}|$ 。

The full objective to train the autoencoding model $(\mathcal{E},\mathcal{D})$ reads:
训练自动编码模型 $(\mathcal{E},\mathcal{D})$ 的完整目标如下：

L_{\text{Autoencoder}}=\min_{\mathcal{E},\mathcal{D}}\max_{\psi}\Big{(}L_{rec}(x,\mathcal{D}(\mathcal{E}(x)))-L_{adv}(\mathcal{D}(\mathcal{E}(x)))+\log D_{\psi}(x)+L_{reg}(x;\mathcal{E},\mathcal{D})\Big{)}

(25)

DM Training in Latent Space
潜在空间中的 DM 训练

Note that for training diffusion models on the learned latent space, we again distinguish two cases when learning $p(z)$ or $p(z|y)$ (Sec. 4.3): (i) For a KL-regularized latent space, we sample $z=\mathcal{E}_{\mu}(x)+\mathcal{E}_{\sigma}(x)\cdot\varepsilon=:\mathcal{E}(x)$ , where $\varepsilon\sim\mathcal{N}(0,1)$ . When rescaling the latent, we estimate the component-wise variance
请注意，对于在所学潜空间上训练扩散模型，我们在学习 $p(z)$ 或 $p(z|y)$ 时再次区分两种情况（第 4.3 节）：（i）对于 KL 规则化的潜空间，我们采样 $z=\mathcal{E}_{\mu}(x)+\mathcal{E}_{\sigma}(x)\cdot\varepsilon=:\mathcal{E}(x)$ ，其中 $\varepsilon\sim\mathcal{N}(0,1)$ 。在重新调整潜变量时，我们从数据中的第一批样本中估计出分量方差

\hat{\sigma}^{2}=\frac{1}{bchw}\sum_{b,c,h,w}(z^{b,c,h,w}-\hat{\mu})^{2}

from the first batch in the data, where $\hat{\mu}=\frac{1}{bchw}\sum_{b,c,h,w}z^{b,c,h,w}$ . The output of $\mathcal{E}$ is scaled such that the rescaled latent has unit standard deviation, i.e. $z\leftarrow\frac{z}{\hat{\sigma}}=\frac{\mathcal{E}(x)}{\hat{\sigma}}$ . (ii) For a VQ-regularized latent space, we extract $z$ before the quantization layer and absorb the quantization operation into the decoder, i.e. it can be interpreted as the first layer of $\mathcal{D}$ .
其中 $\hat{\mu}=\frac{1}{bchw}\sum_{b,c,h,w}z^{b,c,h,w}$ 。(ii) 对于 VQ 规则化潜空间，我们在量化层之前提取 $z$ ，并将量化操作吸收到解码器中，即可以将其理解为 $\mathcal{D}$ 的第一层。

Appendix H Additional Qualitative Results
附录 HA 其他定性结果

Finally, we provide additional qualitative results for our landscapes model (Fig. 12, 23, 24 and 25), our class-conditional ImageNet model (Fig. 26 - 27) and our unconditional models for the CelebA-HQ, FFHQ and LSUN datasets (Fig. 28 - 31). Similar as for the inpainting model in Sec. 4.5 we also fine-tuned the semantic landscapes model from Sec. 4.3.2 directly on $512^{2}$ images and depict qualitative results in Fig. 12 and Fig. 23. For our those models trained on comparably small datasets, we additionally show nearest neighbors in VGG [79] feature space for samples from our models in Fig. 32 - 34.
最后，我们提供了景观模型（图 12、23、24 和 25）、ImageNet 类别条件模型（图 26 - 27）以及 CelebA-HQ、FFHQ 和 LSUN 数据集无条件模型（图 28 - 31）的其他定性结果。与第 4.5 节中的内绘模型类似，我们也直接在 $512^{2}$ 图像上微调了第 4.3.2 节中的语义景观模型，并在图 12 和图 23 中描述了定性结果。对于我们在相当小的数据集上训练的模型，我们还在图 32 - 34 中显示了模型样本在 VGG [ 79] 特征空间中的近邻。

Text-to-Image Synthesis on LAION. 1.45B Model. LAION 上的文本到图像合成。1.45B 模型
’A street sign that reads 写有 “Latent Diffusion” ’ "潜在扩散" '	’A zombie in the 僵尸在 style of Picasso’ 毕加索风格	’An image of an animal 动物形象 half mouse half octopus’ 一半是老鼠一半是章鱼	’An illustration of a slightly 一幅略带 conscious neural network’ 有意识的神经网络	’A painting of a 一幅 squirrel eating a burger’ 松鼠吃汉堡	’A watercolor painting of a 一幅水彩画 chair that looks like an octopus’ 椅子的水彩画	’A shirt with the inscription: 一件印有"...... "字样的衬衫 “I love generative models!” ’ "我爱生成模型！"'

’A painting of the last supper by Picasso.’ 毕加索的《最后的晚餐》油画。

’An oil painting of a latent space.’ 一幅潜藏空间的油画。	’An epic painting of Gandalf the Black 黑袍甘道夫的史诗画作 summoning thunder and lightning in the mountains.’ 在山中召唤雷电的史诗画作。

’A sunset over a mountain range, vector image.’ 山脉上的日落，矢量图像。

$\displaystyle q(x_{t}\|x_{s})$	$\displaystyle=\mathcal{N}(x_{t}\|\alpha_{t\|s}x_{s},\sigma_{t\|s}^{2}\mathbb{I})$	(5)
$\displaystyle\alpha_{t\|s}$	$\displaystyle=\frac{\alpha_{t}}{\alpha_{s}}$	(6)
$\displaystyle\sigma_{t\|s}^{2}$	$\displaystyle=\sigma_{t}^{2}-\alpha_{t\|s}^{2}\sigma_{s}^{2}$	(7)

	$\displaystyle p(x_{t-1}\|x_{t})$	$\displaystyle\coloneqq q(x_{t-1}\|x_{t},x_{\theta}(x_{t},t))$		(10)
		$\displaystyle=\mathcal{N}(x_{t-1}\|\mu_{\theta}(x_{t},t),\sigma_{t\|t-1}^{2}\frac{\sigma_{t-1}^{2}}{\sigma_{t}^{2}}\mathbb{I}),$		(11)

input 输入	GT	Pixel Baseline #1 像素基线 #1	Pixel Baseline #2 像素基线 #2	LDM #1	LDM #2

bicubic 双三次方	LDM-SR	SR3

KL-reg, w/o rescaling KL-reg, 无重整	KL-reg, w/ rescaling KL-Reg，带重定向	VQ-reg, w/o rescaling VQ-reg, 不带重定向

Semantic Synthesis on Flickr-Landscapes [23] ( $512^{2}$ finetuning) Flickr-Landscapes 上的语义合成[ 23 ] ( $512^{2}$ 微调)

Semantic Synthesis on Flickr-Landscapes [23] Flickr-Landscapes 上的语义合成 [ 23］

High-Resolution Image Synthesis with Latent Diffusion Models利用潜在扩散模型合成高分辨率图像

Abstract 摘要

1 Introduction 1引言

Democratizing High-Resolution Image Synthesis高分辨率图像合成的民主化

Departure to Latent Space从潜在空间出发

2 Related Work 2 相关工作

3 Method 3 方法

3.1 Perceptual Image Compression3.1 感知图像压缩

3.2 Latent Diffusion Models3.2 潜在扩散模型

3.3 Conditioning Mechanisms3.3 条件机制

4 Experiments 4实验

4.1 On Perceptual Compression Tradeoffs4.1 关于感知压缩的权衡

4.2 Image Generation with Latent Diffusion4.2 利用潜在扩散生成图像

4.3 Conditional Latent Diffusion4.3 条件潜在扩散

4.3.1 Transformer Encoders for LDMs4.3.1 用于 LDM 的变压器编码器

4.3.2 Convolutional Sampling Beyond 2562superscript2562256^{2}4.3.2 超越 2562superscript2562256^{2} 的卷积采样

4.4 Super-Resolution with Latent Diffusion4.4 利用潜像扩散进行超分辨率处理

4.5 Inpainting with Latent Diffusion4.5 利用潜在扩散进行绘制

5 Limitations & Societal Impact5 局限性和社会影响