High-Resolution Image Synthesis with Latent Diffusion Models
利用潜在扩散模型合成高分辨率图像
Abstract 摘要
By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond.
Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining.
However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to
sequential evaluations.
To enable DM training on limited computational resources while retaining their quality and flexibility,
we apply them in the latent space of powerful pretrained autoencoders.
In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point
between complexity reduction and detail preservation,
greatly boosting visual fidelity.
By introducing cross-attention layers into the model architecture,
we turn diffusion models into powerful and flexible
generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner.
Our latent diffusion models (LDMs) achieve new state-of-the-art scores for image
inpainting and class-conditional image synthesis and highly competitive performance on various tasks, including
text-to-image synthesis, unconditional image generation and super-resolution, while
significantly reducing computational requirements compared to pixel-based DMs.
通过将图像形成过程分解为去噪自编码器的连续应用,扩散模型(DM)在图像数据及其他方面取得了最先进的合成结果。此外,它们的表述方式允许一种指导机制来控制图像生成过程,而无需重新训练。然而,由于这些模型通常直接在像素空间中运行,对功能强大的 DM 进行优化通常需要耗费数百个 GPU 日,而且由于顺序评估,推理成本高昂。为了在有限的计算资源上进行 DM 训练,同时保持其质量和灵活性,我们在强大的预训练自动编码器的潜空间中应用了 DM。与之前的工作不同,在这种表示法上训练扩散模型,可以首次在降低复杂性和保留细节之间达到近乎最佳的平衡点,从而大大提高视觉保真度。通过在模型架构中引入交叉注意层,我们将扩散模型转化为强大而灵活的生成器,可用于文本或边界框等一般条件输入,并以卷积方式实现高分辨率合成。与基于像素的扩散模型相比,我们的潜扩散模型(LDM)大大降低了计算要求,在图像绘制和类条件图像合成方面取得了新的一流成绩,并在文本到图像合成、无条件图像生成和超分辨率等各种任务中取得了极具竞争力的性能。
1 Introduction 1引言
Input
ours ()
PSNR: R-FID:
DALL-E ()
PSNR: R-FID:
VQGAN
()
PSNR: R-FID:
图 1:以较低的下采样率提升可实现质量的上限。由于扩散模型为空间数据提供了极好的归纳偏差,因此我们不需要对潜在空间中的相关生成模型进行严重的空间下采样,但仍然可以通过合适的自动编码模型大大降低数据的维度(见第 3 章)。图片来自 DIV2K [ 1] 验证集,在 px 下进行评估。我们用 表示空间下采样因子。重建 FID [ 29] 和 PSNR 是在 ImageNet-val.[12];另见表 8。8.
Image synthesis is one of the computer vision fields with the most spectacular recent development, but also among those with the greatest computational demands. Especially high-resolution synthesis of complex, natural scenes is presently dominated by scaling up likelihood-based models, potentially containing billions of parameters in autoregressive (AR) transformers [66, 67].
In contrast, the promising results of GANs [27, 3, 40] have been revealed to be mostly confined to data with comparably limited variability as their adversarial learning procedure does not easily scale to modeling complex, multi-modal distributions. Recently, diffusion models [82], which are built from a hierarchy of denoising autoencoders, have shown to achieve impressive results in image synthesis [30, 85] and beyond [45, 7, 48, 57], and define the state-of-the-art in class-conditional image synthesis [15, 31] and super-resolution [72]. Moreover, even unconditional DMs can readily be applied to tasks such as inpainting and colorization [85] or stroke-based synthesis [53], in contrast to other types of generative models [46, 69, 19].
Being likelihood-based models, they do not exhibit mode-collapse and training
instabilities as GANs and, by heavily exploiting parameter sharing,
they can model highly complex distributions of natural images without involving
billions of parameters as in AR
models [67].
图像合成是计算机视觉领域近来发展最迅猛的领域之一,同时也是计算要求最高的领域之一。尤其是复杂自然场景的高分辨率合成,目前主要是通过扩大基于似然法的模型来实现,这些模型可能包含自回归(AR)变换器中的数十亿个参数[66, 67]。与此相反,GANs [ 27, 3, 40] 所取得的令人鼓舞的成果大多局限于变异性相对有限的数据,因为其对抗学习程序不容易扩展到复杂的多模态分布建模。最近,扩散模型[82]从去噪自动编码器的层次中建立起来,在图像合成[30, 85]和其他方面[45, 7, 48, 57]取得了令人印象深刻的成果,并在类条件图像合成[15, 31]和超分辨率[72]方面达到了最先进的水平。此外,与其他类型的生成模型[46, 69, 19]相比,即使是无条件的 DM 也可以很容易地应用于诸如内画和着色[85]或基于笔画的合成[53]等任务。作为基于似然的模型,它们不会像 GANs 那样表现出模式崩溃和训练不稳定性,而且通过大量利用参数共享,它们可以对自然图像的高度复杂分布进行建模,而无需像 AR 模型那样涉及数十亿个参数[67]。
Democratizing High-Resolution Image Synthesis
高分辨率图像合成的民主化
DMs belong to the class of likelihood-based models, whose mode-covering behavior makes them prone to spend excessive amounts of capacity (and thus compute resources) on modeling imperceptible details of the data [16, 73].
Although the reweighted variational objective [30]
aims to address this
by undersampling the initial denoising steps,
DMs are still computationally demanding, since training and evaluating such a model requires repeated function evaluations (and gradient computations) in the high-dimensional space of RGB images.
As an example, training the most powerful DMs often takes hundreds of GPU days (e.g. 150 - 1000 V100 days in [15])
and repeated evaluations on a noisy version of the input space render also inference expensive, so that producing 50k samples takes approximately 5 days [15] on a single A100 GPU.
This has two consequences for the research community and users in general:
Firstly, training such a model
requires massive computational resources only available to a small fraction of the field,
and leaves a huge carbon footprint [65, 86].
Secondly, evaluating an already trained model is also expensive in time and memory, since the same model architecture must
run sequentially for a large number of steps (e.g. 25 - 1000 steps in [15]).
DM 属于基于似然法的模型,其模式覆盖行为使其容易将过多的容量(以及计算资源)用于模拟数据中难以察觉的细节[16, 73]。虽然重新加权的变分目标[30]旨在通过对初始去噪步骤进行低采样来解决这一问题,但 DMs 的计算要求仍然很高,因为训练和评估这样的模型需要在 RGB 图像的高维空间中反复进行函数评估(和梯度计算)。举例来说,训练最强大的 DM 通常需要数百个 GPU 日(例如[15]中的 150 - 1000 V100 日),而在输入空间的高噪声版本上重复评估推理也非常昂贵,因此在单个 A100 GPU 上生成 50k 个样本大约需要 5 天[15]。这给研究界和广大用户带来了两个后果:首先,训练这样一个模型需要大量的计算资源,而这些资源只有该领域的一小部分才能使用,并且会留下巨大的碳足迹[ 65, 86]。其次,评估已训练好的模型在时间和内存上也很昂贵,因为同一模型架构必须连续运行大量步骤(例如[15]中的 25 - 1000 步骤)。
To increase the accessibility of this powerful model class and at the same time
reduce its significant resource consumption, a method is needed that
reduces the computational complexity for both training and sampling.
Reducing the computational demands of DMs without impairing their performance is, therefore, key to enhance their accessibility.
为了提高这类强大模型的可用性,同时减少其对资源的大量消耗,需要一种方法来降低训练和采样的计算复杂度。因此,在不影响 DM 性能的前提下降低其计算需求,是提高其可用性的关键。
Departure to Latent Space
从潜在空间出发
Our approach starts with the analysis of already trained diffusion models in pixel space:
Fig. 2 shows the rate-distortion trade-off of a trained model. As with any likelihood-based model, learning can be roughly divided into two stages: First is a perceptual compression stage which removes high-frequency details
but still learns little semantic variation.
In the second stage, the actual generative model learns the semantic and conceptual composition of the data (semantic compression).
We thus aim to first find a perceptually equivalent, but computationally more suitable space, in which we will train diffusion models for high-resolution image synthesis.
我们的方法从分析像素空间中已经训练好的扩散模型开始:图 2 显示了经过训练的模型的速率-失真权衡。与任何基于似然法的模型一样,学习可大致分为两个阶段:首先是感知压缩阶段,该阶段会去除高频细节,但仍能学习到少量语义变化。在第二阶段,实际生成模型学习数据的语义和概念构成(语义压缩)。因此,我们的目标是首先找到一个在感知上等效、但在计算上更合适的空间,在这个空间中训练用于高分辨率图像合成的扩散模型。
Following common practice [96, 67, 23, 11, 66],
we separate training into two distinct phases: First, we train an autoencoder which provides a lower-dimensional (and thereby efficient) representational space which is perceptually equivalent to the data space.
Importantly, and in contrast to previous work [23, 66],
we do not need to
rely on excessive spatial compression, as we train DMs in the learned latent space, which
exhibits better scaling properties with respect to the spatial dimensionality.
The reduced complexity also provides efficient image generation from the latent space with a single network pass.
We dub the resulting model class Latent Diffusion Models (LDMs).
按照通常的做法[96, 67, 23, 11, 66],我们将训练分为两个不同的阶段:首先,我们训练一个自动编码器,它提供了一个与数据空间感知等效的低维(从而高效)表征空间。重要的是,与之前的工作[23, 66]不同的是,我们不需要依赖过度的空间压缩,因为我们在学习到的潜空间中训练 DM,而潜空间在空间维度上表现出更好的缩放特性。复杂性的降低也使得我们只需通过一次网络就能从潜空间生成高效的图像。我们将由此产生的模型类别命名为潜在扩散模型(LDMs)。
A notable
advantage of this approach is that we need to train the universal
autoencoding stage only once and can therefore reuse it for multiple DM
trainings or to explore possibly completely different tasks
[81].
This enables efficient exploration of a large number of diffusion models for various image-to-image and text-to-image tasks.
For the latter, we design an architecture that connects transformers to the DM’s UNet backbone [71]
and enables arbitrary types of token-based conditioning mechanisms, see Sec. 3.3.
这种方法的一个显著优势是,我们只需要对通用自动编码阶段进行一次训练,因此可以在多个 DM 训练中重复使用,或用于探索可能完全不同的任务[81]。这样,我们就能针对各种图像到图像和文本到图像任务,高效地探索大量的扩散模型。对于后者,我们设计了一种架构,将转换器连接到 DM 的 UNet 主干网[71],并启用任意类型的基于标记的调节机制(见第 3.3 节)。

图 2:说明感知压缩和语义压缩:数字图像的大部分比特对应于难以察觉的细节。虽然 DM 可以通过最小化责任损失项来抑制这些语义上无意义的信息,但梯度(训练期间)和神经网络骨干(训练和推理)仍需要对所有像素进行评估,从而导致多余的计算以及不必要的昂贵优化和推理。
We propose latent diffusion models (LDMs) as an effective generative model and a separate mild compression stage that only eliminates imperceptible details. Data and images from [30].
我们建议将潜在扩散模型(LDM)作为一种有效的生成模型,并采用单独的温和压缩阶段,只消除不易察觉的细节。数据和图像来自 [ 30].
In sum, our work makes the following contributions:
总之,我们的工作做出了以下贡献:
(i) In contrast to purely transformer-based approaches [23, 66], our
method scales more graceful to higher dimensional data and can thus (a) work on a compression level which provides more faithful and detailed reconstructions than previous work (see Fig. 1) and (b) can be efficiently applied to high-resolution synthesis of megapixel images.
(i) 与纯粹基于变换器的方法[23, 66]相比,我们的方法在处理高维数据时更加优雅,因此可以(a)在压缩级别上工作,提供比以前的工作(见图 1)更忠实、更详细的重构;(b)可以有效地应用于百万像素图像的高分辨率合成。
(ii) We achieve competitive performance on multiple tasks (unconditional image synthesis, inpainting, stochastic super-resolution)
and datasets while significantly lowering computational costs.
Compared to pixel-based diffusion approaches, we also significantly decrease inference costs.
(ii) 我们在多个任务(无条件图像合成、内绘制、随机超分辨率)和数据集上实现了具有竞争力的性能,同时显著降低了计算成本。与基于像素的扩散方法相比,我们还大大降低了推理成本。
(iii) We show that, in contrast to previous work [93] which learns both an encoder/decoder architecture and a score-based prior simultaneously, our
approach does not require a delicate weighting of reconstruction and generative abilities.
This ensures extremely faithful reconstructions and requires very little regularization of the latent space.
(iii) 我们证明,与之前同时学习编码器/解码器结构和基于分数的先验的工作[93]相比,我们的方法不需要对重建和生成能力进行微妙的加权。这确保了极其忠实的重构,并且只需要对潜在空间进行极少的正则化处理。
(iv) We find that for densely conditioned tasks such as super-resolution, inpainting and semantic synthesis, our model can be applied in a
convolutional fashion and render large, consistent images of px.
(iv) 我们发现,对于超分辨率、内画法和语义合成等条件密集的任务,我们的模型可以以卷积的方式应用,并呈现 px 的大型一致图像。
(v) Moreover, we design a general-purpose conditioning mechanism based on cross-attention, enabling multi-modal training.
We use it to train class-conditional, text-to-image and layout-to-image models.
(v) 此外,我们还设计了一种基于交叉注意的通用调节机制,实现了多模式训练。我们用它来训练类条件、文本到图像和布局到图像模型。
(vi) Finally, we release pretrained latent diffusion and autoencoding models at https://github.com/CompVis/latent-diffusion
which might be reusable for a various tasks besides
training of DMs [81].
(vi) 最后,我们在 https://github.com/CompVis/latent-diffusion 上发布了经过预训练的潜在扩散和自动编码模型,这些模型除了可以训练 DM 之外,还可以用于各种任务[81]。
2 Related Work 2 相关工作
Generative Models for Image Synthesis
The high dimensional nature of images presents distinct challenges to generative modeling.
Generative Adversarial Networks (GAN) [27]
allow for efficient sampling of high resolution images with good perceptual quality [3, 42], but are difficult to optimize [54, 2, 28] and struggle to capture the full data distribution
[55].
In contrast, likelihood-based methods emphasize good density estimation which renders optimization more well-behaved.
Variational autoencoders (VAE) [46]
and flow-based models [18, 19] enable efficient synthesis of high resolution images [9, 92, 44], but sample quality is not on par with GANs.
While autoregressive models (ARM) [95, 94, 6, 10] achieve strong performance in density
estimation, computationally demanding architectures
[97] and a sequential sampling process limit them to low resolution images.
Because pixel based representations of images contain barely
perceptible, high-frequency details [16, 73], maximum-likelihood training spends a
disproportionate amount of capacity on modeling them, resulting in
long training times.
To scale to higher resolutions,
several two-stage approaches [101, 67, 23, 103]
use ARMs to model a compressed latent image space instead of raw pixels.
用于图像合成的生成模型 图像的高维特性给生成模型带来了独特的挑战。生成对抗网络(GAN)[27] 允许对具有良好感知质量的高分辨率图像进行高效采样[3, 42],但难以优化[54, 2, 28],并且难以捕捉完整的数据分布[55]。与此相反,基于似然法的方法强调良好的密度估计,从而使优化更加顺畅。变异自动编码器(VAE)[46] 和基于流的模型[18, 19]能够高效合成高分辨率图像[9, 92, 44],但样本质量无法与 GAN 相提并论。虽然自回归模型(ARM)[95, 94, 6, 10]在密度估计中表现出色,但计算要求高的架构[97]和顺序采样过程限制了它们在低分辨率图像中的应用。由于基于像素的图像表征包含几乎无法感知的高频细节[16, 73],最大似然训练将不成比例的容量花费在这些细节的建模上,导致训练时间过长。为了扩展到更高分辨率,有几种两阶段方法[ 101, 67, 23, 103] 使用 ARM 对压缩的潜在图像空间建模,而不是原始像素。
Recently, Diffusion Probabilistic Models (DM) [82], have achieved state-of-the-art results in density estimation [45] as well as in sample quality [15]. The generative power of these models stems from a natural fit to the inductive biases of image-like data when their underlying neural backbone is implemented as a UNet [71, 30, 85, 15].
The best synthesis quality is usually achieved when a reweighted objective [30]
is used for training. In this case, the DM corresponds to a lossy compressor and allow to trade image quality for compression capabilities.
Evaluating and optimizing these models in pixel space, however, has the downside of low inference speed and very high training costs.
While the former can be partially adressed by advanced sampling strategies [84, 75, 47] and hierarchical approaches [31, 93], training on high-resolution image data always requires to calculate expensive gradients.
We adress both drawbacks with our proposed LDMs, which work on a compressed latent space of lower dimensionality.
This renders training computationally cheaper and speeds up inference with
almost no reduction in synthesis quality (see
Fig. 1).
最近,扩散概率模型(DM)[82]在密度估计[45]和样本质量[15]方面取得了最先进的成果。这些模型的生成能力源于其底层神经骨架作为 UNet 实现时,与图像类数据的归纳偏差的自然匹配[71, 30, 85, 15]。通常,在使用重权目标[30]进行训练时,可以获得最佳的合成质量。在这种情况下,DM 相当于有损压缩器,可以用图像质量换取压缩能力。然而,在像素空间中评估和优化这些模型的缺点是推理速度低、训练成本高。虽然先进的采样策略[84, 75, 47]和分层方法[31, 93]可以部分解决前者的问题,但在高分辨率图像数据上进行训练总是需要计算昂贵的梯度。我们提出的 LDM 解决了这两个缺点,它在低维度的压缩潜空间上工作。这使得训练的计算成本更低,并在几乎不降低合成质量的情况下加快了推理速度(见图 1)。
Two-Stage Image Synthesis
To mitigate the shortcomings of individual generative approaches, a lot of research [11, 70, 23, 103, 101, 67] has gone into combining the strengths of different methods into more efficient and performant models via a two stage approach. VQ-VAEs [101, 67] use autoregressive models to learn an expressive prior over a discretized latent space.
[66] extend this approach to text-to-image generation by learning a joint distributation over discretized image and text representations.
More generally, [70] uses conditionally invertible networks to provide a generic transfer between latent spaces of diverse domains.
Different from VQ-VAEs, VQGANs [23, 103] employ a first stage with an adversarial and perceptual objective to scale autoregressive transformers to larger images.
However, the high compression rates required for feasible ARM training, which introduces billions of trainable parameters [66, 23], limit the overall performance of such approaches and less compression comes at the price of high computational cost [66, 23].
Our work prevents such trade-offs, as our proposed LDMs scale more gently to higher dimensional latent spaces due to their convolutional backbone. Thus, we are free to choose the level of compression which optimally mediates between learning a powerful first stage, without leaving too much perceptual compression up to the generative diffusion model while guaranteeing high-fidelity reconstructions (see Fig. 1).
两阶段图像合成 为了减轻单个生成方法的缺点,大量研究[11, 70, 23, 103, 101, 67]通过两阶段方法将不同方法的优势结合到更高效、性能更好的模型中。VQ-VAEs[101,67] 使用自回归模型来学习离散潜空间的表达式先验。[66] 通过学习离散图像和文本表示的联合分布,将这种方法扩展到文本到图像的生成。更广泛地说,[ 70] 使用条件可逆网络在不同领域的潜空间之间提供通用转移。与 VQ-VAE 不同,VQGANs [ 23, 103] 采用了第一阶段,以对抗性和感知为目标,将自回归变换器扩展到更大的图像。然而,可行的 ARM 训练需要较高的压缩率,这就引入了数十亿个可训练参数[66, 23],从而限制了此类方法的整体性能,而且较低的压缩率是以较高的计算成本为代价的[66, 23]。我们的工作避免了这种权衡,因为我们提出的 LDM 因其卷积骨干而更容易扩展到更高维度的潜空间。因此,我们可以在保证高保真重构的同时,自由选择压缩水平,在学习功能强大的第一阶段之间进行最佳调和,而不会将过多的感知压缩留给生成扩散模型(见图 1)。
While approaches to jointly [93] or separately [80]
learn an encoding/decoding model together with a score-based prior exist,
the former still require a difficult weighting between reconstruction and generative capabilities [11] and are outperformed by our approach (Sec. 4), and the latter focus on highly structured images such as human faces.
虽然存在联合[93]或单独[80]学习编码/解码模型和基于分数的先验模型的方法,但前者仍然需要在重建和生成能力之间进行艰难的权衡[11],其性能优于我们的方法(第 4 章),而后者主要针对人脸等高结构图像。
3 Method 3 方法
To lower the computational demands of training diffusion models towards high-resolution image synthesis,
we observe
that although diffusion models
allow
to ignore perceptually irrelevant details by undersampling the corresponding loss terms [30],
they still require costly function evaluations in pixel space,
which causes
huge demands in computation time and energy resources.
为了降低高分辨率图像合成对扩散模型训练的计算要求,我们注意到,虽然扩散模型可以通过对相应损失项的低采样来忽略感知上不相关的细节[30],但它们仍然需要在像素空间进行代价高昂的函数评估,这对计算时间和能源资源造成了巨大的需求。
We propose to circumvent this drawback by introducing an explicit separation
of the compressive from the generative learning phase (see
Fig. 2).
To achieve this, we utilize an autoencoding model which learns a space that is perceptually equivalent to the image space,
but offers significantly reduced computational complexity.
我们建议将压缩学习阶段与生成学习阶段明确分开,以规避这一缺点(见图 2)。为此,我们采用了一种自动编码模型,它所学习的空间在感知上等同于图像空间,但计算复杂度却大大降低。
Such an approach offers several advantages: (i)
By leaving the high-dimensional image space, we
obtain DMs which are computationally much more efficient
because sampling is performed on a low-dimensional space.
(ii)
We exploit the inductive bias of DMs inherited from their UNet architecture
[71], which makes them particularly effective for data with spatial
structure and therefore
alleviates the need for aggressive, quality-reducing compression levels as required by previous
approaches [23, 66].
(iii)
Finally, we obtain general-purpose compression models whose latent space
can be used to train multiple generative models
and which can also be utilized
for other downstream applications such as single-image CLIP-guided synthesis [25].
这种方法有以下几个优点:(i) 离开高维图像空间,我们就能获得计算效率高得多的 DM,因为采样是在低维空间上进行的。(ii) 我们利用了 DM 从其 UNet 架构中继承下来的归纳偏差[71],这使它们对具有空间结构的数据特别有效,从而减轻了以往方法[23, 66]所要求的积极的、降低质量的压缩水平。(iii) 最后,我们获得了通用压缩模型,其潜在空间可用于训练多个生成模型,也可用于其他下游应用,如单图像 CLIP 引导合成[25]。
3.1 Perceptual Image Compression
3.1 感知图像压缩
Our perceptual compression model
is based on previous work [23] and
consists of an autoencoder trained by combination of a perceptual
loss [106] and a patch-based [33] adversarial objective [20, 23, 103].
This ensures that the reconstructions are
confined to the image manifold by enforcing local realism and avoids bluriness introduced by relying solely on pixel-space losses
such as or objectives.
我们的感知压缩模型基于之前的工作[23],由感知损失[106]和基于补丁[33]的对抗目标[20, 23, 103]组合训练的自动编码器组成。这确保了重构仅限于图像流形,强化了局部真实感,并避免了单纯依赖像素空间损失(如 或 目标)所带来的模糊性。
More precisely, given an image in RGB space,
the encoder encodes into a latent representation , and the decoder
reconstructs the image from the latent, giving , where .
Importantly, the encoder downsamples the image by a factor , and we investigate different downsampling
factors , with .
更确切地说,给定 RGB 空间中的图像 ,编码器 将 编码成潜在表示 ,解码器 从潜在表示重建图像,得到 ,其中 。 重要的是,编码器对图像进行下采样,下采样系数为 ,我们研究了不同的下采样系数 ,其中 。
In order to avoid arbitrarily
high-variance
latent spaces, we experiment with two different kinds of regularizations. The
first variant, KL-reg., imposes a slight KL-penalty towards a standard
normal on the learned latent, similar to
a VAE [46, 69], whereas VQ-reg.
uses a vector quantization layer [96] within
the decoder. This model can be interpreted as a VQGAN [23] but with the quantization layer
absorbed by the decoder. Because our subsequent DM
is designed to work with the two-dimensional structure of
our learned latent space , we can use relatively mild
compression rates and achieve very good reconstructions. This is in contrast to
previous works [23, 66],
which relied on an arbitrary 1D ordering of the learned space to model its
distribution autoregressively and thereby ignored much of the
inherent structure of .
Hence, our compression model preserves details of better (see Tab. 8).
The full objective and training details can be found in the supplement.
为了避免任意的高方差潜空间,我们尝试了两种不同的正则化方法。第一种变体,KL-reg.,对学习到的潜空间施加了轻微的标准正态 KL 权限,类似于 VAE[46,69],而 VQ-reg.这个模型可以解释为 VQGAN [ 23] ,但量化层被解码器吸收了。由于我们后续的 DM 是针对所学潜空间 的二维结构而设计的,因此我们可以使用相对温和的压缩率,并实现非常好的重构。这与之前的工作[23, 66]形成了鲜明对比,之前的工作依赖于对所学空间 的任意一维排序来对其分布进行自回归建模,从而忽略了 的许多固有结构。 因此,我们的压缩模型能更好地保留 的细节(见表 8)。完整的目标和训练细节见补充资料。
3.2 Latent Diffusion Models
3.2 潜在扩散模型
Diffusion Models [82] are probabilistic models designed to learn a data distribution by
gradually denoising a normally distributed variable, which corresponds to learning
the reverse process of a fixed Markov Chain of length .
For image synthesis, the most successful models [30, 15, 72]
rely on a reweighted variant of the variational lower bound on , which mirrors denoising score-matching [85].
These models can be interpreted as an equally weighted sequence of denoising autoencoders ,
which are trained to predict a denoised variant of their input , where is a noisy version of the input .
The corresponding objective can be simplified to (Sec. B)
扩散模型 [ 82] 是一种概率模型,旨在通过对正态分布变量的逐步去噪来学习数据分布 ,这相当于学习长度为 的固定马尔可夫链的反向过程。对于图像合成,最成功的模型[30, 15, 72]依赖于 变分下限的重权变体,它反映了去噪分数匹配[85]。这些模型可以解释为去噪自编码器 的等权序列,这些自编码器被训练成预测其输入 的去噪变体,其中 是输入 的噪声版本。相应的目标可简化为(B 节)
(1) |
with uniformly sampled from .
其中 从 中均匀采样。
Generative Modeling of Latent Representations
With our trained perceptual compression models consisting of and ,
we now have access to an efficient, low-dimensional latent space in which high-frequency, imperceptible details are
abstracted away.
Compared to the high-dimensional pixel space, this space is more suitable
for likelihood-based generative models, as they can now (i) focus on the important, semantic bits of the data
and (ii) train in a lower dimensional, computationally much more efficient space.
潜在表征的生成模型 有了由 和 组成的训练有素的感知压缩模型,我们现在可以访问一个高效的低维潜在空间,在这个空间中,高频率、不易察觉的细节被抽象掉了。与高维像素空间相比,这个空间更适合基于似然法的生成模型,因为它们现在可以:(i) 专注于数据中重要的语义位;(ii) 在一个低维、计算效率更高的空间中进行训练。
Unlike previous work that relied on autoregressive, attention-based
transformer models in a highly compressed, discrete latent space [66, 23, 103],
we can take advantage of image-specific inductive biases that our model offers.
This includes the ability to build the underlying UNet primarily from 2D convolutional layers,
and further focusing the objective on the perceptually most relevant bits using
the reweighted bound, which now reads
与之前在高度压缩的离散潜空间中依赖自回归、基于注意力的转换器模型的工作不同[66, 23, 103],我们可以利用我们的模型所提供的图像特定归纳偏差。这包括主要从二维卷积层构建底层 UNet 的能力,以及利用重新加权约束进一步将目标集中在感知上最相关的比特上的能力。

图 3:我们通过连接或更通用的交叉注意机制来调节 LDM。见第 3.3 节
(2) |
The neural backbone of our model
is realized as a time-conditional UNet [71].
Since the forward process is fixed, can be efficiently obtained from
during training,
and samples from ) can be decoded to image space with a single pass through .
我们模型的神经骨干 是以时间条件 UNet 的形式实现的[71]。由于前向过程是固定的,因此 可以在训练过程中高效地从 中获得,而 中的样本只需通过 就能解码到图像空间。
CelebAHQ | FFHQ | LSUN-Churches LSUN-教堂 | LSUN-Beds LSUN-床 | ImageNet 图像网络 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
图 4:在 CelebAHQ [ 39]、FFHQ [ 41]、LSUN-Churches [ 102]、LSUN-Bedrooms [ 102] 和类别条件 ImageNet [ 12] 上训练的 LDM 的样本,每个样本的分辨率为 。放大后观看效果最佳。更多样本请参阅附录。
3.3 Conditioning Mechanisms
3.3 条件机制
Similar to other types of generative
models [56, 83], diffusion models
are in principle capable of modeling conditional distributions of the form
.
This can be implemented with
a conditional denoising autoencoder and
paves the way to controlling the synthesis process through
inputs such as text [68], semantic maps [61, 33] or other image-to-image translation tasks [34].
与其他类型的生成模型[ 56, 83]类似,扩散模型原则上也能够对 形式的条件分布进行建模。这可以通过条件去噪自动编码器 来实现,并为通过文本[68]、语义图[61, 33]或其他图像到图像的翻译任务[34]等输入 来控制合成过程铺平了道路。
In the context of image synthesis, however, combining the
generative power of DMs with other types of conditionings beyond class-labels [15]
or blurred variants
of the input image [72]
is so far an under-explored area of research.
然而,在图像合成的背景下,除了类标签[ 15] 或输入图像的模糊变体[ 72] 之外,将 DM 的生成能力与其他类型的条件相结合,迄今为止还是一个尚未充分开发的研究领域。
We turn DMs into more flexible conditional image generators by
augmenting their underlying UNet backbone with the cross-attention mechanism [97],
which
is
effective for learning attention-based models of various input modalities [36, 35].
To pre-process from various modalities (such as language prompts) we introduce a domain specific
encoder that projects to an intermediate representation , which
is then mapped to the intermediate layers of the UNet via a cross-attention
layer implementing , with
我们通过交叉注意力机制[97]来增强其底层 UNet 骨干,从而将 DMs 转化为更灵活的条件图像生成器,该机制对于学习各种输入模态的基于注意力的模型非常有效[36, 35]。为了预处理来自各种模态(如语言提示)的 ,我们引入了一个特定领域编码器 ,它将 投射到中间表示 ,然后通过实现 的交叉注意层将 映射到 UNet 的中间层。
Here, denotes a (flattened) intermediate representation of the UNet implementing
and , & are learnable projection matrices [97, 36]. See Fig. 3 for a visual depiction.
这里, 表示实现 的 UNet 的(扁平化)中间表示,而 、 和 是可学习的投影矩阵 [ 97, 36]。可视化描述见图 3。
Based on image-conditioning pairs, we then learn the conditional LDM via
根据图像条件对,我们可以通过以下方法学习条件 LDM
(3) |
where
both and are jointly optimized via Eq. 3.
This conditioning mechanism is flexible as can be parameterized with domain-specific experts,
e.g. (unmasked) transformers [97] when are
text prompts (see Sec. 4.3.1)
其中 和 是通过公式 3 共同优化的。这种调节机制非常灵活,因为 可以用特定领域的专家进行参数化,例如,当 是文本提示时,可以用(未屏蔽的)变换器 [ 97] 进行参数化(见第 4.3.1 节)。
4 Experiments 4实验
Text-to-Image Synthesis on LAION. 1.45B Model. LAION 上的文本到图像合成。1.45B 模型 |
||||||
---|---|---|---|---|---|---|
’A street sign that reads 写有 “Latent Diffusion” ’ "潜在扩散" ' |
’A zombie in the 僵尸在 style of Picasso’ 毕加索风格 |
’An image of an animal 动物形象 half mouse half octopus’ 一半是老鼠一半是章鱼 |
’An illustration of a slightly 一幅略带 conscious neural network’ 有意识的神经网络 |
’A painting of a 一幅
squirrel eating a burger’ 松鼠吃汉堡 |
’A watercolor painting of a 一幅水彩画 chair that looks like an octopus’ 椅子的水彩画 |
’A shirt with the inscription: 一件印有"...... "字样的衬衫 “I love generative models!” ’ "我爱生成模型!"' |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
图 5:我们的文本到图像合成模型 LDM-8 (KL)的用户自定义文本提示样本,该模型是在 LAION [ 78] 数据库上训练的。样本是在 200 DDIM 步和 条件下生成的。我们使用 的无条件引导 [ 32] 。


图 6:分析在 ImageNet 数据集上使用不同的下采样因子 在 2M 训练步骤中训练类条件 LDM 的情况。与降采样因子较大的模型(LDM- 4-16 )相比,基于像素的 LDM-1 需要更多的训练时间。过多的感知压缩(如 LDM-32 中的压缩)会限制整体样本质量。所有模型都是在计算预算相同的 NVIDIA A100 上进行训练的。使用 100 DDIM 步 [ 84] 和 得到的结果。


图 7:在 CelebA-HQ 数据集(左)和 ImageNet 数据集(右)上比较不同压缩率的 LDM。沿每条直线从右到左,不同的标记表示使用 DDIM 的 个采样步骤。虚线表示 200 步的 FID 分数,表明 LDM- 4-8 的强大性能。在 5000 个样本上评估的 FID 分数。所有模型均在 A100 上进行了 500k (CelebA) / 2M (ImageNet) 步的训练。
LDMs provide means to flexible and computationally tractable diffusion based image synthesis
of various image modalities, which we empirically show in the following.
Firstly, however, we analyze the gains of our models compared to pixel-based diffusion models in both training and inference.
Interestingly, we find that LDMs trained in VQ-regularized latent spaces sometimes achieve better sample quality, even though the reconstruction capabilities of VQ-regularized first stage models slightly fall behind those of their continuous counterparts, cf. Tab. 8.
A visual comparison between the effects of first stage regularization schemes on LDM training and their generalization abilities to resolutions can be found in Appendix D.1. In E.2 we list details on architecture, implementation, training and evaluation for all results presented in this section.
LDM 为基于扩散的各种图像模态合成提供了灵活且可计算的方法,我们将在下文中进行实证展示。首先,我们分析了与基于像素的扩散模型相比,我们的模型在训练和推理方面的优势。有趣的是,我们发现在 VQ 规则化潜空间中训练的 LDM 有时能获得更好的样本质量,尽管 VQ 规则化第一阶段模型的重建能力略逊于连续模型(参见表 8)。8.第一阶段正则化方案对 LDM 训练的影响及其对分辨率 的泛化能力的可视化比较见附录 D.1。在 E.2 中,我们列出了本节介绍的所有结果的架构、实现、训练和评估细节。
4.1 On Perceptual Compression Tradeoffs
4.1 关于感知压缩的权衡
This section analyzes the behavior of our LDMs with different downsampling
factors (abbreviated as LDM-, where
LDM-1 corresponds to pixel-based DMs).
To obtain a
comparable test-field, we fix the computational resources
to a
single NVIDIA A100 for all experiments in this section and train all models for
the same number of steps and with the same number of parameters.
本节分析了不同下采样系数 (缩写为 LDM- ,其中 LDM-1 对应于基于像素的 DM)的 LDM 行为。为了获得具有可比性的测试场,我们在本节的所有实验中将计算资源固定为一台英伟达 A100,并以相同的步骤数和参数数训练所有模型。
Tab. 8 shows hyperparameters and reconstruction performance of the first stage models used for the LDMs compared in this section.
Fig. 6 shows sample quality as a function of training
progress for 2M steps of class-conditional models on the
ImageNet [12] dataset. We see that, i) small
downsampling factors for LDM-1,2 result in slow training
progress, whereas ii) overly large values of cause stagnating fidelity
after comparably few training steps. Revisiting the analysis above
(Fig. 1 and 2) we
attribute this to i) leaving most of perceptual compression to the diffusion
model and ii) too strong first stage compression resulting in information loss
and thus limiting the achievable quality. LDM-4-16
strike a good balance between efficiency and perceptually faithful results,
which manifests in a significant FID [29] gap of 38 between pixel-based diffusion (LDM-1) and LDM-8 after 2M training steps.
表 8表 8 显示了本节比较的 LDM 第一阶段模型的超参数和重建性能。图 6 显示了 ImageNet [ 12] 数据集上 200 万步类条件模型的样本质量与训练进度的函数关系。我们可以看到,i)LDM- 1,2 的下采样因子过小会导致训练进度缓慢,而 ii) 的值过大则会在训练步骤相当少之后导致保真度停滞不前。重新回顾上面的分析(图 1 和图 2),我们认为这是由于 i) 将大部分感知压缩留给了扩散模型;ii) 第一阶段压缩太强导致信息丢失,从而限制了可达到的质量。LDM- 4-16 在效率和感知忠实结果之间取得了很好的平衡,这表现在基于像素的扩散模型(LDM-1)和 LDM-8 在 200 万步训练后的 FID [ 29] 差距为 38。
In Fig. 7, we compare models trained on
CelebA-HQ [39] and ImageNet in terms
sampling speed for different numbers of denoising steps with the DDIM
sampler [84] and plot it against FID-scores [29].
LDM-4-8 outperform models
with unsuitable ratios of perceptual and conceptual compression.
Especially compared to
pixel-based LDM-1, they achieve much lower FID scores while
simultaneously significantly increasing sample throughput.
Complex datasets such as ImageNet require reduced compression rates to avoid
reducing quality.
In summary, LDM-4 and -8 offer the best conditions for achieving high-quality synthesis results.
在图 7 中,我们比较了在 CelebA-HQ [ 39] 和 ImageNet 上训练的模型在使用 DDIM 采样器 [ 84] 进行不同数量的去噪步骤时的采样速度,并将其与 FID 分数 [ 29] 进行对比。LDM- 4-8 优于感知和概念压缩比率不合适的模型。特别是与基于像素的 LDM-1 相比,它们的 FID 分数更低,同时样本吞吐量也显著提高。ImageNet 等复杂数据集需要降低压缩率,以避免降低质量。总之,LDM-4 和 -8 为获得高质量的合成结果提供了最佳条件。
CelebA-HQ
FFHQ
Method 方法
FID
Prec. 精确。
Recall 召回率
Method 方法
FID
Prec. 精确。
Recall 召回率
DC-VAE [63] DC-VAE [ 63]
15.8
-
-
ImageBART [21] 图像ART [ 21]
9.57
-
-
VQGAN+T. [23] (k=400) VQGAN+T.[ 23] (k=400)
10.2
-
-
U-Net GAN (+aug) [77]
10.9 (7.6)
-
-
PGGAN [39] PGGAN [ 39]
8.0
-
-
UDM [43] UDM [ 43]
5.54
-
-
LSGM [93] LSGM [ 93]
7.22
-
-
StyleGAN [41] StyleGAN [ 41]
4.16
0.71
0.46
UDM [43] UDM [ 43]
7.16
-
-
ProjectedGAN[76] 投影GAN[ 76]
3.08
0.65
0.46
LDM-4 (ours, 500-s†)
LDM-4 (我们的,500 秒 † )
5.11
0.72
0.49
LDM-4 (ours, 200-s) LDM-4 (我们的,200 秒)
4.98
0.73
0.50
LSUN-Churches LSUN-Bedrooms LSUN-卧室 Method 方法 FID Prec. 精确。 Recall 召回率 Method 方法 FID Prec. 精确。 Recall 召回率 DDPM [30] DDPM [ 30] 7.89 - - ImageBART [21] 图像ART [ 21] 5.51 - - ImageBART[21] ImageBART[ 21] 7.32 - - DDPM [30] DDPM [ 30] 4.9 - - PGGAN [39] PGGAN [ 39] 6.42 - - UDM [43] UDM [ 43] 4.57 - - StyleGAN[41] StyleGAN[ 41] 4.21 - - StyleGAN[41] StyleGAN[ 41] 2.35 0.59 0.48 StyleGAN2[42] StyleGAN2[ 42] 3.86 - - ADM [15] ADM [ 15] 1.90 0.66 0.51 ProjectedGAN[76] 投影GAN[ 76] 1.59 0.61 0.44 ProjectedGAN[76] 投影GAN[ 76] 1.52 0.61 0.34 LDM-8∗ (ours, 200-s) LDM-8 ∗ (我们的, 200-s) 4.02 0.64 0.52 LDM-4 (ours, 200-s) LDM-4 (我们的,200 秒) 2.95 0.66 0.48
表 1:无条件图像合成的评估指标。CelebA-HQ 结果转载自 [ 63, 100, 43],FFHQ 结果转载自 [ 42, 43]。 † : -s 指使用 DDIM [ 84] 采样器的 采样步骤。 ∗ : 在 KL-regularized latent space 中训练。更多结果见补充资料。
Text-Conditional Image Synthesis
文本条件图像合成
Method 方法
FID
IS
CogView† [17]
27.10
18.20
4B
self-ranking, rejection rate 0.017
自我排名,拒绝率 0.017
LAFITE† [109] lafite † [ 109]
26.94
26.02
75M
GLIDE∗ [59] 滑行 ∗ [ 59]
12.24
-
6B
277 DDIM steps, c.f.g. [32]
277 DDIM 步骤,c.f.g. [ 32]
Make-A-Scene∗ [26] 制作场景 ∗ [ 26]
11.84
-
4B
c.f.g for AR models [98]
AR 模型的 c.f.g [ 98]
LDM-KL-8
23.31
20.03
1.45B
250 DDIM steps 250 级 DDIM
LDM-KL-8-G∗
12.63
30.29
1.45B
250 DDIM steps, c.f.g. [32]
250 DDIM 步,c.f.g. [ 32]
表 2:在 大小的 MS-COCO [ 51] 数据集上对文本条件图像合成的评估:使用 250 DDIM [ 84] 步,我们的模型与最新的扩散 [ 59] 和自回归 [ 26] 方法相当,尽管使用的参数少得多。 † / ∗ :来自 [ 109]/ [ 26] 的数字
4.2 Image Generation with Latent Diffusion
4.2 利用潜在扩散生成图像
We train unconditional models of images on
CelebA-HQ [39], FFHQ [41],
LSUN-Churches and -Bedrooms [102]
and evaluate the i) sample quality and ii) their coverage of the data manifold using ii) FID [29] and ii) Precision-and-Recall [50]. Tab. 1 summarizes our results. On CelebA-HQ, we report a new state-of-the-art FID of , outperforming
previous likelihood-based models as well as GANs. We also outperform
LSGM [93] where a latent diffusion model is
trained jointly together with the first stage.
In contrast, we train diffusion models in a fixed space and avoid the
difficulty of weighing reconstruction quality against learning the prior over
the latent space, see Fig. 1-2.
我们在 CelebA-HQ [ 39]、FFHQ [ 41]、LSUN-Churches and -Bedrooms [ 102]上训练 图像的无条件模型,并使用 ii) FID [ 29] 和 ii) Precision-and-Recall [ 50] 评估 i) 样本质量和 ii) 它们对数据流形的覆盖率。表 1 总结了我们的结果。表 1 总结了我们的结果。在 CelebA-HQ 上,我们报告的最新 FID 为 ,优于之前的基于似然法的模型和 GAN。我们的表现也优于 LSGM [ 93],在 LSGM 中,潜在扩散模型与第一阶段模型一起训练。相比之下,我们在一个固定的空间中训练扩散模型,避免了权衡重建质量与学习潜空间先验的困难,见图 1-2。
We outperform prior diffusion based approaches on all but the LSUN-Bedrooms
dataset, where our score is close to
ADM [15], despite utilizing half its
parameters and requiring 4-times less train resources (see Appendix E.3.5).
Moreover, LDMs consistently improve upon GAN-based methods in Precision and Recall, thus confirming the advantages of their mode-covering likelihood-based training objective over adversarial approaches.
In Fig. 4 we also show qualitative results on each dataset.
在 LSUN-Bedrooms 数据集之外的所有数据集上,我们的表现都优于先前的基于扩散的方法,在该数据集上,我们的得分接近 ADM [ 15],尽管我们只使用了其一半的参数,所需的训练资源也少了 4 倍(见附录 E.3.5)。此外,LDM 在精确度和召回率方面始终优于基于 GAN 的方法,从而证实了其基于模式覆盖似然的训练目标相对于对抗方法的优势。图 4 还显示了每个数据集的定性结果。
4.3 Conditional Latent Diffusion
4.3 条件潜在扩散
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
图 8:在 COCO [ 4] 上使用 LDM 进行布局到图像的合成,见第 4.3.1 节。定量评估见补充资料 D.3。
4.3.1 Transformer Encoders for LDMs
4.3.1 用于 LDM 的变压器编码器
By introducing cross-attention based conditioning into LDMs
we open them up for various
conditioning modalities previously unexplored for diffusion models.
For text-to-image image modeling, we train a 1.45B parameter KL-regularized LDM conditioned on language prompts on LAION-400M [78].
We employ the BERT-tokenizer [14] and implement as a transformer [97] to infer a latent code which
is mapped into the UNet via (multi-head) cross-attention (Sec. 3.3).
This combination of domain specific experts for learning a language representation and visual synthesis results in a powerful model, which generalizes well to complex, user-defined text prompts, cf. Fig. 8 and 5. For quantitative analysis, we follow prior work and evaluate text-to-image generation on the MS-COCO [51] validation set, where our model improves upon powerful AR [66, 17] and GAN-based [109] methods, cf. Tab. 2. We note that applying classifier-free diffusion guidance [32] greatly boosts sample quality, such that the guided LDM-KL-8-G is on par with the recent state-of-the-art AR [26] and diffusion models [59] for text-to-image synthesis, while substantially reducing parameter count.
To further analyze the flexibility of the cross-attention based conditioning mechanism we also train models to synthesize images based on semantic layouts on OpenImages [49],
and finetune on COCO [4], see Fig. 8.
See Sec. D.3 for the quantitative evaluation and implementation details.
通过在 LDM 中引入基于交叉注意力的调理,我们可以将 LDM 应用于以前未曾探索过的扩散模型的各种调理模式。为了进行文本到图像的图像建模,我们在 LAION-400M 上训练了一个以语言提示为条件的 1.45B 参数 KL 规则化 LDM [ 78]。我们使用 BERT-tokenizer [ 14] 并将 作为转换器 [ 97] 来推断潜在代码,该代码通过(多头)交叉关注映射到 UNet 中(第 3.3 节)。这种将特定领域专家学习语言表征和视觉合成相结合的方法产生了一个强大的模型,可以很好地推广到复杂的、用户定义的文本提示中,参见图 8 和图 5。在定量分析方面,我们沿用了之前的工作,在 MS-COCO [ 51] 验证集上评估文本到图像的生成,我们的模型改进了强大的 AR [ 66, 17] 和基于 GAN [ 109] 的方法,参见表 2。2.我们注意到,应用无分类器扩散引导[32]大大提高了样本质量,因此引导的 LDM-KL-8-G 与最近用于文本到图像合成的最先进的 AR 模型[26]和扩散模型[59]不相上下,同时大大减少了参数数量。为了进一步分析基于交叉注意的调节机制的灵活性,我们还在 OpenImages [ 49] 上训练基于语义布局的图像合成模型,并在 COCO [ 4] 上进行微调,见图 8。定量评估和实施细节请参见 D.3 节。
Lastly, following prior work [15, 3, 23, 21], we evaluate our best-performing class-conditional ImageNet models with from Sec. 4.1 in Tab. 3, Fig. 4 and
Sec. D.4. Here we outperform the state of the art diffusion model ADM [15] while significantly reducing computational requirements and parameter count, cf. Tab 18.
最后,按照先前的工作[ 15, 3, 23, 21],我们在表 3、图 4 和 D.4 中用第 4.1 节中的 评估了表现最佳的类条件 ImageNet 模型。在此,我们的表现优于最先进的扩散模型 ADM [ 15] ,同时显著降低了计算要求和参数数量,参见表 18。
Method 方法
FID
IS
Precision 精确度
Recall 召回率
BigGan-deep [3] BigGan-deep [ 3]
6.95
203.6
0.87
0.28
340M
-
ADM [15] ADM [ 15]
10.94
100.98
0.69
0.63
554M
250 DDIM steps 250 级 DDIM
ADM-G [15] ADM-G [ 15]
4.59
186.7
0.82
0.52
608M
250 DDIM steps 250 级 DDIM
LDM-4 (ours) LDM-4 (我们的)
10.56
103.49
0.71
0.62
400M
250 DDIM steps 250 级 DDIM
LDM-4-G (ours) LDM-4-G (我们的)
3.60
247.67
0.87
0.48
400M
250 steps, c.f.g [32],
250 步,c.f.g [ 32],
表 3:基于类别条件的 ImageNet LDM 与最近在 ImageNet 上基于类别条件生成图像的最先进方法的比较[12]。与其他基线的更详细比较见 D.4、表 10 和 F。c.f.g.表示无分类器引导,比例为 ,如 [ 32] 中提出的那样。
4.3.2 Convolutional Sampling Beyond
4.3.2 超越 的卷积采样
By concatenating spatially aligned conditioning information to the input of
, LDMs can serve
as efficient general-purpose image-to-image translation models.
We use this to train models for semantic synthesis, super-resolution (Sec. 4.4) and inpainting (Sec. 4.5).
For semantic synthesis, we use images of landscapes paired with semantic maps [61, 23]
and concatenate downsampled versions of the semantic maps with the latent image
representation of a model (VQ-reg., see Tab. 8).
We train on an input resolution of (crops from ) but find that our model generalizes to larger resolutions
and can
generate images up to the megapixel regime when evaluated in a convolutional
manner (see Fig. 9).
We exploit this behavior to also apply the super-resolution models in
Sec. 4.4 and the inpainting models in
Sec. 4.5 to generate large images between and .
For this application, the signal-to-noise ratio (induced by the scale of the latent space)
significantly affects the results.
In Sec. D.1 we illustrate this
when learning an LDM on (i) the latent space as provided by a model (KL-reg., see Tab. 8),
and (ii) a rescaled version, scaled by the component-wise standard deviation.
通过将空间对齐的调节信息连接到 的输入中,LDM 可以作为高效的通用图像到图像转换模型。我们利用它来训练用于语义合成、超分辨率(第 4.4 节)和内绘制(第 4.5 节)的模型。在语义合成方面,我们使用与语义图配对的景观图像[61, 23],并将语义图的低采样版本与 模型(VQ-reg.)我们在 的输入分辨率(从 开始裁剪)上进行训练,但发现我们的模型可以泛化到更大的分辨率,并且在以卷积方式进行评估时,可以生成高达百万像素的图像(见图 9)。我们利用这一特性,还应用了第 4.4 节中的超分辨率模型和第 4.5 节中的内绘模型,生成了 和 之间的大图像。在这种应用中,信噪比(由潜空间尺度引起)对结果有很大影响。在 D.1 节中,我们将说明在以下两种情况下学习 LDM 的效果:(i) 由 模型提供的潜空间(KL-reg.,见表 8);(ii) 由分量标准偏差缩放的重标版本。
The latter, in combination with classifier-free guidance [32], also enables the direct synthesis of
images for the text-conditional LDM-KL-8-G as in Fig. 13.
后者与无分类器引导[32]相结合,还能为文本条件 LDM-KL-8-G 直接合成 图像,如图 13 所示。

图 9:在 分辨率下训练的 LDM 可以推广到更高分辨率(此处为 )的空间条件任务,如景观图像的语义合成。见第 4.3.2 节。
4.4 Super-Resolution with Latent Diffusion
4.4 利用潜像扩散进行超分辨率处理
LDMs can be efficiently trained for super-resolution by diretly conditioning on low-resolution images via concatenation (cf. Sec. 3.3).
In a first experiment, we follow SR3 [72] and fix the image degradation to a bicubic interpolation with -downsampling and train on ImageNet following SR3’s data processing pipeline. We use the autoencoding model pretrained on OpenImages (VQ-reg., cf. Tab. 8) and concatenate the low-resolution conditioning and the inputs to the UNet, i.e. is the identity.
Our qualitative and quantitative results (see Fig. 10 and Tab. 5) show competitive performance and LDM-SR outperforms SR3 in FID while SR3 has a better IS.
A simple image regression model achieves the highest PSNR and SSIM scores;
however these metrics do not align well with human perception [106] and favor blurriness over imperfectly aligned high frequency details [72].
Further, we conduct a user study comparing the pixel-baseline with LDM-SR. We follow SR3 [72] where human subjects were shown a low-res image in between two high-res images and asked for preference. The results in Tab. 4 affirm the good performance of LDM-SR.
PSNR and SSIM can be pushed by using a post-hoc guiding mechanism [15] and we implement this image-based guider via a perceptual loss, see Sec. D.6.
通过串联对低分辨率图像进行直接条件化(参见第 3.3 节),可以有效地训练 LDM 以实现超分辨率。在第一个实验中,我们按照 SR3 [ 72] 的方法,将图像降解固定为 降采样的双三次插值,并按照 SR3 的数据处理管道在 ImageNet 上进行训练。我们使用在 OpenImages(VQ-reg.,参见表 8)上预训练的 自动编码模型,并将低分辨率调节 和输入连接到 UNet,即 是标识。我们的定性和定量结果(见图 10 和表 5)显示了具有竞争力的性能,LDM-SR 在 FID 方面优于 SR3,而 SR3 具有更好的 IS。简单的图像回归模型可以获得最高的 PSNR 和 SSIM 分数;然而,这些指标与人类的感知并不一致[106],它们更倾向于模糊而不是不完全对齐的高频细节[72]。此外,我们还进行了一项用户研究,将像素基准与 LDM-SR 进行了比较。我们按照 SR3 [ 72] 的方法,在两幅高分辨率图像之间显示一幅低分辨率图像,并询问受试者的偏好。表 4 中的结果证实了 LDM-SR 的良好性能。表 4 中的结果证实了 LDM-SR 的良好性能。PSNR 和 SSIM 可以通过使用事后引导机制来提高[ 15],我们通过感知损失来实现这种基于图像的引导机制,请参见 D.6 节。
bicubic 双三次方 | LDM-SR | SR3 |
---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
图 10:ImageNet 64 256 在 ImageNet-Val 上的超分辨率。LDM-SR 在渲染逼真纹理方面具有优势,但 SR3 可以合成更加连贯的精细结构。其他样本和裁剪见附录。SR3 结果来自 [ 72]。
SR on ImageNet SR 在 ImageNet 上的应用
Inpainting on Places 地点涂色
User Study 用户研究
Pixel-DM ()
LDM-4
LAMA [88] LAMA [ 88]
LDM-4
Task 1: Preference vs GT
任务 1:偏好与 GT
16.0%
30.4%
13.6%
21.0%
Task 2: Preference Score
任务 2:偏好得分
29.4%
70.6%
31.9%
68.1%
表 4:任务 1:向受试者展示地面实况和生成的图像,并询问其偏好。任务 2:受试者必须在两幅生成的图像中做出选择。更多详情见 E.3.6
Since the bicubic degradation process does not generalize well to images which do not follow this pre-processing, we also train a generic model, LDM-BSR, by using more diverse degradation. The results are shown in Sec. D.6.1.
由于双三次降解过程不能很好地适用于没有经过这种预处理的图像,因此我们还使用了更多样化的降解方法来训练一个通用模型 LDM-BSR。结果见第 D.6.1 节。
Method 方法
FID
IS
PSNR
SSIM
Image Regression [72] 图像回归 [ 72]
15.2
121.1
27.9
0.801
625M
N/A 不适用
SR3 [72] SR3 [ 72]
5.2
180.1
26.4
0.762
625M
N/A 不适用
LDM-4 (ours, 100 steps) LDM-4(我们的,100 级)
2.8†/4.8‡
166.3
24.43.8
0.690.14
169M
4.62
emphLDM-4 (ours, big, 100 steps)
emphLDM-4(我们的,大,100 级)
2.4†/4.3‡
174.9
24.74.1
0.710.15
552M
4.5
LDM-4 (ours, 50 steps, guiding)
LDM-4(我们的,50 级,引导式)
4.4†/6.4‡
153.7
25.83.7
0.740.12
184M
0.38
表 5:( ); † : 在验证分割上计算的 FID 特征, ‡ : 在训练分割上计算的 FID 特征; ∗ : 在 NVIDIA A100 上评估。
4.5 Inpainting with Latent Diffusion
4.5 利用潜在扩散进行绘制
Inpainting is the task of filling masked regions of an image with new content either
because parts of the image are are corrupted or to
replace existing but undesired content within the image. We evaluate how our
general approach for conditional image generation compares to more specialized,
state-of-the-art approaches for this task. Our evaluation follows the protocol
of LaMa[88], a recent inpainting model that introduces a specialized
architecture relying on Fast Fourier Convolutions[8].
The exact training & evaluation protocol on Places[108] is described in Sec. E.2.2.
涂抹是指在图像的遮蔽区域填充新的内容,这可能是因为图像的某些部分已损坏,也可能是为了替换图像中已有但不想要的内容。我们将评估我们的条件图像生成通用方法与更专业、更先进的方法相比,在这项任务中的效果如何。我们的评估遵循 LaMa[ 88] 的协议,LaMa 是一种最新的内绘模型,它引入了一种依赖于快速傅立叶卷积[ 8] 的专门架构。关于 Places[ 108] 的确切训练和评估协议将在 E.2.2 节中介绍。
We first analyze the effect of different design choices for
the first stage.
我们首先分析第一阶段不同设计方案的效果。
train throughput 训练吞吐量 sampling throughput† 采样吞吐量 † train+val FID@2k Model (reg.-type) 模型(注册类型) samples/sec. 采样/秒 @256 @512 hours/epoch 小时/纪元 epoch 6 第 6 个纪元 LDM-1 (no first stage) LDM-1(无第一级) 0.11 0.26 0.07 20.66 24.74 LDM-4 (KL, w/ attn) LDM-4(KL,带附件) 0.32 0.97 0.34 7.66 15.21 LDM-4 (VQ, w/ attn) LDM-4(VQ,带附加装置) 0.33 0.97 0.34 7.04 14.99 LDM-4 (VQ, w/o attn) LDM-4(VQ,无注释) 0.35 0.99 0.36 6.66 15.95
表 6:内绘效率评估。0#:由于 GPU 设置/批量大小的不同,与图 7 存在偏差,参见附录。
input 输入 | result 结果 |
---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
图 11:使用我们的大尺度内绘模型去除物体的定性结果。更多结果请参见图 22。
In particular, we compare the inpainting efficiency of LDM-1
(i.e. a pixel-based conditional DM) with
LDM-4, for both KL and VQ regularizations,
as well as VQ-LDM-4 without any attention in the first stage
(see Tab. 8), where the latter reduces GPU memory for decoding at high resolutions. For comparability, we fix the number of parameters for all models.
Tab. 6 reports the training and sampling throughput
at resolution and , the total training time in hours
per epoch and the FID score on the validation split after six epochs. Overall,
we observe a speed-up of at least between pixel- and latent-based
diffusion models while improving FID scores by a factor of at least .
特别是,我们比较了 LDM-1(即基于像素的条件 DM)和 LDM-4(KL 和 VQ 正则化)的内绘效率,以及在第一阶段没有任何关注的 VQ-LDM-4(见表 8),后者在高分辨率解码时减少了 GPU 内存。为了便于比较,我们固定了所有模型的参数数。表 66 报告了在分辨率为 和 时的训练和采样吞吐量、以小时为单位的总训练时间(每个历时)以及六个历时后验证分割的 FID 分数。总的来说,我们发现基于像素的扩散模型和基于潜像的扩散模型的速度至少提高了 ,而 FID 分数至少提高了 。
The comparison with other inpainting approaches in Tab. 7
shows that our model with attention improves the overall image quality as
measured by FID over that of [88]. LPIPS between the unmasked
images and our samples is slightly higher than that of [88].
We attribute this to [88] only producing a
single result which tends to recover more of an average image
compared to the diverse results produced by our LDM
cf. Fig. 21.
Additionally in a user study (Tab. 4) human subjects favor our results over those of [88].
表 7 中与其他内绘方法的比较显示,我们的注意力模型提高了 FID 分数的 @3#。表 7 显示,与 [ 88] 相比,我们的注意力模型提高了以 FID 衡量的整体图像质量。未屏蔽图像与我们的样本之间的 LPIPS 略高于 [ 88]。我们认为这是由于[88] 只产生了单一结果,与我们的 LDM 产生的多样化结果相比,它更倾向于恢复平均图像,参见图 21。此外,在一项用户研究中(表 4),与 [ 88] 的结果相比,人类受试者更喜欢我们的结果。
Based on these initial results, we also trained a larger diffusion model
(big in Tab. 7) in the
latent space of the VQ-regularized first stage without attention.
Following [15], the UNet of this diffusion model uses attention layers on
three levels of its feature hierarchy, the BigGAN [3] residual block
for up- and downsampling and has 387M parameters instead of 215M.
After training, we noticed a discrepancy in the quality of samples produced at
resolutions and , which we hypothesize to be caused by the
additional attention modules. However, fine-tuning the model for half an epoch
at resolution allows the model to adjust to the new feature statistics
and sets a new state of the art FID on image inpainting (big, w/o attn, w/
ft in Tab. 7, Fig. 11.).
基于这些初步结果,我们还在无关注 VQ 规则化第一阶段的潜空间中训练了一个更大的扩散模型(表 7 中的大图)。按照[15]的方法,该扩散模型的 UNet 在其特征层次结构的三个层级上使用了注意力层,使用 BigGAN[ 3] 剩余块进行上采样和下采样,并拥有 387M 而不是 215M 的参数。训练结束后,我们注意到在分辨率为 和 时产生的样本质量存在差异,我们推测这是额外的注意力模块造成的。然而,在分辨率为 时对模型进行了半个历时的微调,使模型能够适应新的特征统计数据,并在图像绘制方面建立了新的 FID 技术水平(表 7、图 11 中的 big、w/o attn、w/ft)。
40-50% masked 40-50% 遮挡
All samples 所有样本
Method 方法
FID
LPIPS
FID
LPIPS
LDM-4 (ours, big, w/ ft)
LDM-4(我们的,大的,带英尺)
9.39
0.246 0.042
1.50
0.137 0.080
LDM-4 (ours, big, w/o ft)
LDM-4(我们的,大号,不带英尺)
12.89
0.257 0.047
2.40
0.142 0.085
LDM-4 (ours, w/ attn) LDM-4(我们的,大号,无脚踏板)
11.87
0.257 0.042
2.15
0.144 0.084
LDM-4 (ours, w/o attn) LDM-4(我们的,不带耳机)
12.60
0.259 0.041
2.37
0.145 0.084
LaMa[88]†
12.31
0.243 0.038
2.23
0.134 0.080
LaMa[88] LaMa[ 88]
12.0
0.24
2.21
0.14
CoModGAN[107]
10.4
0.26
1.82
0.15
RegionWise[52] RegionWise[ 52]
21.3
0.27
4.75
0.15
DeepFill v2[104] DeepFill v2[ 104]
22.1
0.28
5.20
0.16
EdgeConnect[58] 边缘连接[ 58]
30.5
0.28
8.37
0.16
表 7:Places[ 108] 测试图像中 30k 个大小为 的裁剪的内绘性能比较。40-50% 一栏报告的是在需要对 40-50% 的图像区域进行内绘的困难实例中计算得出的指标。由于无法获得[ 88] 中使用的原始测试集,因此在我们的测试集上重新计算了 † 。
5 Limitations & Societal Impact
5 局限性和社会影响
Limitations 局限性
While LDMs significantly reduce computational requirements compared to pixel-based approaches,
their sequential sampling process is still slower than that of GANs.
Moreover, the use of LDMs can be questionable when high precision is required:
although the loss of image quality is very small in our autoencoding models (see Fig. 1),
their reconstruction capability can become a bottleneck for tasks that require fine-grained accuracy in pixel space.
We assume that our superresolution models (Sec. 4.4) are already somewhat limited in this respect.
与基于像素的方法相比,LDM 虽然大大降低了计算要求,但其顺序采样过程仍比 GAN 慢。此外,当需要高精度时,使用 LDM 可能会有问题:虽然在我们的 自动编码模型中,图像质量的损失非常小(见图 1),但对于需要像素空间细粒度精度的任务来说,它们的重建能力可能会成为瓶颈。我们假设我们的超分辨率模型(第 4.4 节)在这方面已经受到了一定的限制。
Societal Impact 社会影响
Generative models for media like imagery are a double-edged sword: On the one hand, they enable various creative applications,
and in particular approaches like ours that reduce the cost of training and inference have the potential
to facilitate access to this technology and democratize its exploration.
On the other hand, it also means that it becomes easier to create and disseminate manipulated data or spread misinformation and spam.
In particular, the deliberate manipulation of images (“deep fakes”) is a common problem in this context,
and women in particular are disproportionately affected by it [13, 24].
图像等媒体的生成模型是一把双刃剑:一方面,它可以促进各种创造性应用,尤其是像我们这样降低训练和推理成本的方法,有可能促进对这一技术的利用,并使其探索民主化。另一方面,这也意味着创建和传播篡改数据或传播错误信息和垃圾邮件变得更加容易。特别是,故意篡改图像("深度伪造")是这方面的一个常见问题,尤其是妇女受到的影响尤为严重[13, 24]。
Generative models can also reveal their training data [5, 90],
which is of great concern when the data contain sensitive or personal information
and were collected without explicit consent.
However, the extent to which this also applies to DMs of images is not yet fully understood.
生成模型也会泄露其训练数据[5, 90],当数据包含敏感或个人信息且未经明确同意而收集时,这一点就非常令人担忧。然而,这种情况在多大程度上也适用于图像的 DMs 还不完全清楚。
Finally, deep learning modules tend to reproduce or exacerbate biases that are already present in the data [91, 38, 22].
While diffusion models achieve better coverage of the data distribution than e.g. GAN-based approaches,
the extent to which our two-stage approach that combines adversarial training and a likelihood-based objective
misrepresents the data remains an important research question.
最后,深度学习模块往往会重现或加剧数据中已经存在的偏差[91, 38, 22]。虽然扩散模型比基于 GAN 的方法能更好地覆盖数据分布,但我们结合对抗训练和基于似然目标的两阶段方法在多大程度上误导了数据,仍然是一个重要的研究问题。
For a more general, detailed discussion of the ethical considerations of deep generative models, see e.g. [13].
关于深度生成模型的伦理考虑的更广泛、更详细的讨论,请参见[ 13] 等。
6 Conclusion 6 结束语
We have presented latent diffusion models, a simple and efficient way to significantly improve both the
training and sampling efficiency of denoising diffusion models without
degrading their quality. Based on this and our cross-attention
conditioning mechanism, our experiments could demonstrate favorable results
compared to state-of-the-art methods across a wide range of conditional image
synthesis tasks without task-specific architectures.
††This work has been supported by the German Federal Ministry for Economic Affairs and Energy within the project ’KI-Absicherung - Safe AI for automated driving’ and by the German Research Foundation (DFG) project 421703927.
我们提出了潜在扩散模型,这是一种简单有效的方法,可以显著提高去噪扩散模型的训练和采样效率,而不会降低其质量。在此基础上,再加上我们的交叉注意力调节机制,我们的实验可以在没有特定任务架构的情况下,在广泛的条件图像合成任务中与最先进的方法相比取得良好的效果。 †
References 参考资料
-
[1]
Eirikur Agustsson and Radu Timofte.
NTIRE 2017 challenge on single image super-resolution: Dataset and
study.
In 2017 IEEE Conference on Computer Vision and Pattern
Recognition Workshops, CVPR Workshops 2017, Honolulu, HI, USA, July 21-26,
2017, pages 1122–1131. IEEE Computer Society, 2017.
Eirikur Agustsson 和 Radu Timofte.NTIRE 2017 单图像超分辨率挑战赛:数据集与研究。In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2017, Honolulu, HI, USA, July 21-26, 2017, pages 1122-1131.IEEE 计算机协会,2017 年。 -
[2]
Martin Arjovsky, Soumith Chintala, and Léon Bottou.
Wasserstein gan, 2017.
Martin Arjovsky、Soumith Chintala 和 Léon Bottou.Wasserstein gan, 2017. -
[3]
Andrew Brock, Jeff Donahue, and Karen Simonyan.
Large scale GAN training for high fidelity natural image synthesis.
In Int. Conf. Learn. Represent., 2019.
Andrew Brock、Jeff Donahue 和 Karen Simonyan。用于高保真自然图像合成的大规模 GAN 训练。In Int.Conf.Learn.Represent. -
[4]
Holger Caesar, Jasper R. R. Uijlings, and Vittorio Ferrari.
Coco-stuff: Thing and stuff classes in context.
In 2018 IEEE Conference on Computer Vision and Pattern
Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages
1209–1218. Computer Vision Foundation / IEEE Computer Society, 2018.
Holger Caesar、Jasper R. R. Uijlings 和 Vittorio Ferrari。Coco-stuff:上下文中的事物和物品类。In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 1209-1218.计算机视觉基金会/IEEE计算机学会,2018。 -
[5]
Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel
Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar
Erlingsson, et al.
Extracting training data from large language models.
In 30th USENIX Security Symposium (USENIX Security 21), pages
2633–2650, 2021.
Nicholas Carlini、Florian Tramer、Eric Wallace、Matthew Jagielski、Ariel Herbert-Voss、Katherine Lee、Adam Roberts、Tom Brown、Dawn Song、Ulfar Erlingsson 等:从大型语言模型中提取训练数据。第 30 届 USENIX 安全研讨会(USENIX Security 21),第 2633-2650 页,2021 年。 -
[6]
Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and
Ilya Sutskever.
Generative pretraining from pixels.
In ICML, volume 119 of Proceedings of Machine Learning
Research, pages 1691–1703. PMLR, 2020.
Mark Chen、Alec Radford、Rewon Child、Jeffrey Wu、Heewoo Jun、David Luan 和 Ilya Sutskever。从像素生成预训练。ICML,《机器学习研究论文集》第 119 卷,第 1691-1703 页。PMLR, 2020. -
[7]
Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, and William
Chan.
Wavegrad: Estimating gradients for waveform generation.
In ICLR. OpenReview.net, 2021.
Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, and William Chan.波形梯度:波形生成的梯度估计。In ICLR.OpenReview.net, 2021. -
[8]
Lu Chi, Borui Jiang, and Yadong Mu.
Fast fourier convolution.
In NeurIPS, 2020.
Lu Chi、Borui Jiang 和 Yadong Mu。快速傅立叶卷积。在 NeurIPS,2020 年。 -
[9]
Rewon Child.
Very deep vaes generalize autoregressive models and can outperform
them on images.
CoRR, abs/2011.10650, 2020.
Rewon Child.非常深度的Vaes泛化自回归模型,并能在图像上超越它们。CoRR,abs/2011.10650,2020。 -
[10]
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever.
Generating long sequences with sparse transformers.
CoRR, abs/1904.10509, 2019.
Rewon Child、Scott Gray、Alec Radford 和 Ilya Sutskever。用稀疏变换器生成长序列。CoRR,abs/1904.10509,2019。 -
[11]
Bin Dai and David P. Wipf.
Diagnosing and enhancing VAE models.
In ICLR (Poster). OpenReview.net, 2019.
Bin Dai 和 David P. Wipf.诊断和增强 VAE 模型。In ICLR (Poster).OpenReview.net, 2019. -
[12]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li.
Imagenet: A large-scale hierarchical image database.
In CVPR, pages 248–255. IEEE Computer Society, 2009.
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li.Imagenet:大规模分层图像数据库。In CVPR, pages 248-255.IEEE 计算机协会,2009 年。 -
[13]
Emily Denton.
Ethical considerations of generative ai.
AI for Content Creation Workshop, CVPR, 2021.
Emily Denton.生成式人工智能的伦理考量。AI for Content Creation Workshop, CVPR, 2021. -
[14]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
BERT: pre-training of deep bidirectional transformers for language
understanding.
CoRR, abs/1810.04805, 2018.
Jacob Devlin、Ming-Wei Chang、Kenton Lee 和 Kristina Toutanova。BERT:用于语言理解的深度双向变换器预训练。CoRR,ABS/1810.04805,2018。 -
[15]
Prafulla Dhariwal and Alex Nichol.
Diffusion models beat gans on image synthesis.
CoRR, abs/2105.05233, 2021.
Prafulla Dhariwal 和 Alex Nichol.扩散模型在图像合成上击败甘斯。CoRR,abs/2105.05233,2021。 -
[16]
Sander Dieleman.
Musings on typicality, 2020.
桑德-迪埃勒曼关于典型性的思考,2020 年。 -
[17]
Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang
Lin, Xu Zou, Zhou Shao, Hongxia Yang, and Jie Tang.
Cogview: Mastering text-to-image generation via transformers.
CoRR, abs/2105.13290, 2021.
丁明、杨卓一、洪文义、郑文迪、周畅、尹达、林俊扬、邹旭、邵周、杨红霞和唐杰。Cogview:通过转换器掌握文本到图像的生成。CoRR,abs/2105.13290,2021。 -
[18]
Laurent Dinh, David Krueger, and Yoshua Bengio.
Nice: Non-linear independent components estimation, 2015.
Laurent Dinh、David Krueger 和 Yoshua Bengio。尼斯:非线性独立分量估计》,2015 年。 -
[19]
Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio.
Density estimation using real NVP.
In 5th International Conference on Learning Representations,
ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track
Proceedings. OpenReview.net, 2017.
Laurent Dinh、Jascha Sohl-Dickstein 和 Samy Bengio。使用真实 NVP 的密度估计。In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings.OpenReview.net, 2017. -
[20]
Alexey Dosovitskiy and Thomas Brox.
Generating images with perceptual similarity metrics based on deep
networks.
In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle
Guyon, and Roman Garnett, editors, Adv. Neural Inform. Process. Syst.,
pages 658–666, 2016.
Alexey Dosovitskiy and Thomas Brox. 基于深度网络的感知相似度指标生成图像。In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett, editors, Adv. Neural Inform.Process.Syst.》,第 658-666 页,2016 年。 -
[21]
Patrick Esser, Robin Rombach, Andreas Blattmann, and Björn Ommer.
Imagebart: Bidirectional context with multinomial diffusion for
autoregressive image synthesis.
CoRR, abs/2108.08827, 2021.
Patrick Esser、Robin Rombach、Andreas Blattmann 和 Björn Ommer。Imagebart:用于自回归图像合成的多项式扩散双向上下文。CoRR,abs/2108.08827,2021。 -
[22]
Patrick Esser, Robin Rombach, and Björn Ommer.
A note on data biases in generative models.
arXiv preprint arXiv:2012.02516, 2020.
Patrick Esser, Robin Rombach, and Björn Ommer.生成模型中的数据偏差说明。arXiv 预印本 arXiv:2012.02516, 2020. -
[23]
Patrick Esser, Robin Rombach, and Björn Ommer.
Taming transformers for high-resolution image synthesis.
CoRR, abs/2012.09841, 2020.
Patrick Esser, Robin Rombach, and Björn Ommer.用于高分辨率图像合成的驯服变换器。CoRR,abs/2012.09841,2020。 -
[24]
Mary Anne Franks and Ari Ezra Waldman.
Sex, lies, and videotape: Deep fakes and free speech delusions.
Md. L. Rev., 78:892, 2018.
玛丽-安妮-弗兰克斯和阿里-埃兹拉-瓦尔德曼。性、谎言和录像带:深度伪造与自由言论妄想。Md.L. Rev., 78:892, 2018. -
[25]
Kevin Frans, Lisa B. Soros, and Olaf Witkowski.
Clipdraw: Exploring text-to-drawing synthesis through language-image
encoders.
ArXiv, abs/2106.14843, 2021.
Kevin Frans, Lisa B. Soros, and Olaf Witkowski.Clipdraw:通过语言图像编码器探索文本到绘图的合成。ArXiv, abs/2106.14843, 2021. -
[26]
Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv
Taigman.
Make-a-scene: Scene-based text-to-image generation with human priors.
CoRR, abs/2203.13131, 2022.
Oran Gafni、Adam Polyak、Oron Ashual、Shelly Sheynin、Devi Parikh 和 Yaniv Taigman。制作场景:基于场景和人类先验的文本到图像生成。CoRR,abs/2203.13131,2022。 -
[27]
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David
Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio.
Generative adversarial networks.
CoRR, 2014.
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio.生成对抗网络。CoRR,2014。 -
[28]
Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron
Courville.
Improved training of wasserstein gans, 2017.
Ishaan Gulrajani、Faruk Ahmed、Martin Arjovsky、Vincent Dumoulin 和 Aaron Courville。改进的瓦瑟斯坦甘斯训练》,2017 年。 -
[29]
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp
Hochreiter.
Gans trained by a two time-scale update rule converge to a local nash
equilibrium.
In Adv. Neural Inform. Process. Syst., pages 6626–6637, 2017.
马丁-豪塞尔、休伯特-拉姆绍尔、托马斯-昂特希纳、伯恩哈德-奈斯勒和塞普-霍赫赖特。通过双时间尺度更新规则训练的甘斯收敛到局部纳什均衡。In Adv.Process.Syst., pages 6626-6637, 2017. -
[30]
Jonathan Ho, Ajay Jain, and Pieter Abbeel.
Denoising diffusion probabilistic models.
In NeurIPS, 2020.
Jonathan Ho, Ajay Jain, and Pieter Abbeel.去噪扩散概率模型。In NeurIPS, 2020. -
[31]
Jonathan Ho, Chitwan Saharia, William Chan, David J. Fleet, Mohammad Norouzi,
and Tim Salimans.
Cascaded diffusion models for high fidelity image generation.
CoRR, abs/2106.15282, 2021.
Jonathan Ho, Chitwan Saharia, William Chan, David J. Fleet, Mohammad Norouzi, and Tim Salimans.高保真图像生成的级联扩散模型。CoRR,abs/2106.15282,2021。 -
[32]
Jonathan Ho and Tim Salimans.
Classifier-free diffusion guidance.
In NeurIPS 2021 Workshop on Deep Generative Models and
Downstream Applications, 2021.
乔纳森-何(Jonathan Ho)和蒂姆-萨利曼斯(Tim Salimans)。无分类器扩散引导。在 NeurIPS 2021 深度生成模型和下游应用研讨会上,2021 年。 -
[33]
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros.
Image-to-image translation with conditional adversarial networks.
In CVPR, pages 5967–5976. IEEE Computer Society, 2017.
菲利普-伊索拉、朱俊彦、周廷辉和阿列克谢-A-埃弗罗斯。利用条件对抗网络实现图像到图像的翻译。In CVPR, pages 5967-5976.IEEE 计算机协会,2017 年。 -
[34]
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros.
Image-to-image translation with conditional adversarial networks.
2017 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pages 5967–5976, 2017.
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros.利用条件对抗网络实现图像到图像的翻译。2017 IEEE 计算机视觉与模式识别大会(CVPR),第 5967-5976 页,2017 年。 -
[35]
Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch,
Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan
Shelhamer, Olivier J. Hénaff, Matthew M. Botvinick, Andrew Zisserman,
Oriol Vinyals, and João Carreira.
Perceiver IO: A general architecture for structured inputs
&outputs.
CoRR, abs/2107.14795, 2021.
Andrew Jaegle、Sebastian Borgeaud、Jean-Baptiste Alayrac、Carl Doersch、Catalin Ionescu、David Ding、Skanda Koppula、Daniel Zoran、Andrew Brock、Evan Shelhamer、Olivier J. Hénaff、Matthew M. Botvinick、Andrew Zisserman、Oriol Vinyals 和 João Carreira。感知器 IO:结构化输入输出的通用架构。CoRR,abs/2107.14795,2021。 -
[36]
Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and
João Carreira.
Perceiver: General perception with iterative attention.
In Marina Meila and Tong Zhang, editors, Proceedings of the 38th
International Conference on Machine Learning, ICML 2021, 18-24 July 2021,
Virtual Event, volume 139 of Proceedings of Machine Learning Research,
pages 4651–4664. PMLR, 2021.
安德鲁-耶格尔、费利克斯-吉梅诺、安迪-布洛克、奥里奥尔-维尼亚尔斯、安德鲁-齐瑟曼和若昂-卡雷拉。感知器:具有迭代注意力的一般感知。见 Marina Meila 和 Tong Zhang 编辑的《第 38 届国际机器学习大会论文集》(ICML 2021,2021 年 7 月 18-24 日,虚拟活动),《机器学习研究论文集》第 139 卷,第 4651-4664 页。PMLR, 2021. -
[37]
Manuel Jahn, Robin Rombach, and Björn Ommer.
High-resolution complex scene synthesis with transformers.
CoRR, abs/2105.06458, 2021.
Manuel Jahn, Robin Rombach, and Björn Ommer.使用变换器的高分辨率复杂场景合成。CoRR,abs/2105.06458,2021。 -
[38]
Niharika Jain, Alberto Olmo, Sailik Sengupta, Lydia Manikonda, and Subbarao
Kambhampati.
Imperfect imaganation: Implications of gans exacerbating biases on
facial data augmentation and snapchat selfie lenses.
arXiv preprint arXiv:2001.09528, 2020.
Niharika Jain、Alberto Olmo、Sailik Sengupta、Lydia Manikonda 和 Subbarao Kambhampati。不完美的想象:面部数据增强和 snapchat 自拍镜头的 gans 加剧偏差的影响》。arXiv 预印本 arXiv:2001.09528, 2020 年。 -
[39]
Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen.
Progressive growing of gans for improved quality, stability, and
variation.
CoRR, abs/1710.10196, 2017.
Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen.为提高质量、稳定性和变异而进行的甘斯渐进生长。CoRR,abs/1710.10196,2017。 -
[40]
Tero Karras, Samuli Laine, and Timo Aila.
A style-based generator architecture for generative adversarial
networks.
In IEEE Conf. Comput. Vis. Pattern Recog., pages 4401–4410,
2019.
Tero Karras、Samuli Laine 和 Timo Aila。基于风格的生成式对抗网络生成器架构。In IEEE Conf.Comput.Pattern Recog.Pattern Recog., pages 4401-4410, 2019. -
[41]
T. Karras, S. Laine, and T. Aila.
A style-based generator architecture for generative adversarial
networks.
In 2019 IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 2019.
T.Karras, S. Laine, and T. Aila.基于风格的生成式对抗网络生成器架构。In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. -
[42]
Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and
Timo Aila.
Analyzing and improving the image quality of stylegan.
CoRR, abs/1912.04958, 2019.
Tero Karras、Samuli Laine、Miika Aittala、Janne Hellsten、Jaakko Lehtinen 和 Timo Aila。分析和改进样式表的图像质量。CoRR,abs/1912.04958,2019。 -
[43]
Dongjun Kim, Seungjae Shin, Kyungwoo Song, Wanmo Kang, and Il-Chul Moon.
Score matching model for unbounded data score.
CoRR, abs/2106.05527, 2021.
Dongjun Kim、Seungjae Shin、Kyungwoo Song、Wanmo Kang 和 Il-Chul Moon。无约束数据分数的分数匹配模型。CoRR,abs/2106.05527,2021。 -
[44]
Durk P Kingma and Prafulla Dhariwal.
Glow: Generative flow with invertible 1x1 convolutions.
In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi,
and R. Garnett, editors, Advances in Neural Information Processing
Systems, 2018.
Durk P Kingma 和 Prafulla Dhariwal.Glow:具有可逆 1x1 卷积的生成流。见 S. Bengio、H. Wallach、H. Larochelle、K. Grauman、N. Cesa-Bianchi 和 R. Garnett 编辑,《神经信息处理系统进展》,2018 年。 -
[45]
Diederik P. Kingma, Tim Salimans, Ben Poole, and Jonathan Ho.
Variational diffusion models.
CoRR, abs/2107.00630, 2021.
Diederik P. Kingma、Tim Salimans、Ben Poole 和 Jonathan Ho。变异扩散模型。CoRR,abs/2107.00630,2021。 -
[46]
Diederik P. Kingma and Max Welling.
Auto-Encoding Variational Bayes.
In 2nd International Conference on Learning Representations,
ICLR, 2014.
Diederik P. Kingma 和 Max Welling.自动编码变异贝叶斯。第二届学习表征国际会议,ICLR,2014。 -
[47]
Zhifeng Kong and Wei Ping.
On fast sampling of diffusion probabilistic models.
CoRR, abs/2106.00132, 2021.
孔志峰、魏平.论扩散概率模型的快速采样.CoRR,abs/2106.00132,2021. -
[48]
Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro.
Diffwave: A versatile diffusion model for audio synthesis.
In ICLR. OpenReview.net, 2021.
孔志峰、平伟、黄佳吉、赵可欣和布莱恩-卡坦扎罗。Diffwave:用于音频合成的多功能扩散模型。In ICLR.OpenReview.net, 2021. -
[49]
Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper R. R. Uijlings, Ivan Krasin,
Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Tom Duerig,
and Vittorio Ferrari.
The open images dataset V4: unified image classification, object
detection, and visual relationship detection at scale.
CoRR, abs/1811.00982, 2018.
Alina Kuznetsova、Hassan Rom、Neil Alldrin、Jasper R. R. Uijlings、Ivan Krasin、Jordi Pont-Tuset、Shahab Kamali、Stefan Popov、Matteo Malloci、Tom Duerig 和 Vittorio Ferrari。开放图像数据集 V4:大规模统一图像分类、对象检测和视觉关系检测。CoRR,abs/1811.00982,2018。 -
[50]
Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen,
and Timo Aila.
Improved precision and recall metric for assessing generative models.
CoRR, abs/1904.06991, 2019.
Tuomas Kynkäänniemi、Tero Karras、Samuli Laine、Jaakko Lehtinen 和 Timo Aila。用于评估生成模型的改进精度和召回率度量。CoRR,abs/1904.06991,2019。 -
[51]
Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B.
Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and
C. Lawrence Zitnick.
Microsoft COCO: common objects in context.
CoRR, abs/1405.0312, 2014.
Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick.微软 COCO:上下文中的通用对象。CoRR,abs/1405.0312,2014。 -
[52]
Yuqing Ma, Xianglong Liu, Shihao Bai, Le-Yi Wang, Aishan Liu, Dacheng Tao, and
Edwin Hancock.
Region-wise generative adversarial imageinpainting for large missing
areas.
ArXiv, abs/1909.12507, 2019.
马玉清、刘祥龙、白世豪、王乐毅、刘爱山、陶大成和埃德温-汉考克。针对大面积缺失的区域生成式对抗图像绘制ArXiv, abs/1909.12507, 2019. -
[53]
Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano
Ermon.
Sdedit: Image synthesis and editing with stochastic differential
equations.
CoRR, abs/2108.01073, 2021.
孟晨林、宋洋、宋佳明、吴佳俊、朱俊彦和斯特凡诺-埃尔蒙。Sdedit:用随机微分方程合成和编辑图像。CoRR,abs/2108.01073,2021。 -
[54]
Lars M. Mescheder.
On the convergence properties of GAN training.
CoRR, abs/1801.04406, 2018.
Lars M. Mescheder.论 GAN 训练的收敛特性。CoRR,abs/1801.04406,2018. -
[55]
Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein.
Unrolled generative adversarial networks.
In 5th International Conference on Learning Representations,
ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track
Proceedings. OpenReview.net, 2017.
Luke Metz、Ben Poole、David Pfau 和 Jascha Sohl-Dickstein。未卷积生成对抗网络。In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings.OpenReview.net, 2017. -
[56]
Mehdi Mirza and Simon Osindero.
Conditional generative adversarial nets.
CoRR, abs/1411.1784, 2014.
Mehdi Mirza 和 Simon Osindero.条件生成对抗网。CoRR,abs/1411.1784,2014。 -
[57]
Gautam Mittal, Jesse H. Engel, Curtis Hawthorne, and Ian Simon.
Symbolic music generation with diffusion models.
CoRR, abs/2103.16091, 2021.
高塔姆-米塔尔、杰西-H-恩格尔、柯蒂斯-霍桑和伊恩-西蒙。用扩散模型生成符号音乐。CoRR,abs/2103.16091,2021。 -
[58]
Kamyar Nazeri, Eric Ng, Tony Joseph, Faisal Z. Qureshi, and Mehran Ebrahimi.
Edgeconnect: Generative image inpainting with adversarial edge
learning.
ArXiv, abs/1901.00212, 2019.
Kamyar Nazeri、Eric Ng、Tony Joseph、Faisal Z. Qureshi 和 Mehran Ebrahimi。Qureshi 和 Mehran Ebrahimi。边缘连接:使用对抗边缘学习的生成式图像内绘。ArXiv,abs/1901.00212,2019。 -
[59]
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin,
Bob McGrew, Ilya Sutskever, and Mark Chen.
GLIDE: towards photorealistic image generation and editing with
text-guided diffusion models.
CoRR, abs/2112.10741, 2021.
Alex Nichol、Prafulla Dhariwal、Aditya Ramesh、Pranav Shyam、Pamela Mishkin、Bob McGrew、Ilya Sutskever 和 Mark Chen。GLIDE:利用文本引导的扩散模型实现逼真图像的生成和编辑。CoRR,abs/2112.10741,2021。 -
[60]
Anton Obukhov, Maximilian Seitzer, Po-Wei Wu, Semen Zhydenko, Jonathan Kyl, and
Elvis Yu-Jing Lin.
High-fidelity performance metrics for generative models in pytorch,
2020.
Version: 0.3.0, DOI: 10.5281/zenodo.4957738.
Anton Obukhov、Maximilian Seitzer、Po-Wei Wu、Semen Zhydenko、Jonathan Kyl 和 Elvis Yu-Jing Lin。pytorch 中生成模型的高保真性能指标,2020.版本:0.3.0,DOI:10.5281/zenodo.4957738。 -
[61]
Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu.
Semantic image synthesis with spatially-adaptive normalization.
In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2019.
Taesung Park、Ming-Yu Liu、Ting-Chun Wang 和 Jun-Yan Zhu.空间自适应归一化的语义图像合成。IEEE 计算机视觉与模式识别大会论文集,2019 年。 -
[62]
Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu.
Semantic image synthesis with spatially-adaptive normalization.
In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), June 2019.
Taesung Park、Ming-Yu Liu、Ting-Chun Wang 和 Jun-Yan Zhu。空间自适应归一化的语义图像合成。IEEE/CVF计算机视觉与模式识别会议(CVPR)论文集,2019年6月。 -
[63]
Gaurav Parmar, Dacheng Li, Kwonjoon Lee, and Zhuowen Tu.
Dual contradistinctive generative autoencoder.
In IEEE Conference on Computer Vision and Pattern Recognition,
CVPR 2021, virtual, June 19-25, 2021, pages 823–832. Computer Vision
Foundation / IEEE, 2021.
Gaurav Parmar、Dacheng Li、Kwonjoon Lee 和 Zhuowen Tu。双矛盾生成自动编码器。In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 823-832.计算机视觉基金会/IEEE,2021。 -
[64]
Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu.
On buggy resizing libraries and surprising subtleties in fid
calculation.
arXiv preprint arXiv:2104.11222, 2021.
Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu.关于漏洞百出的大小调整库和 Fid 计算中令人惊奇的微妙之处。arXiv 预印本 arXiv:2104.11222, 2021. -
[65]
David A. Patterson, Joseph Gonzalez, Quoc V. Le, Chen Liang, Lluis-Miquel
Munguia, Daniel Rothchild, David R. So, Maud Texier, and Jeff Dean.
Carbon emissions and large neural network training.
CoRR, abs/2104.10350, 2021.
David A. Patterson、Joseph Gonzalez、Quoc V. Le、Chen Liang、Lluis-Miquel Munguia、Daniel Rothchild、David R. So、Maud Texier 和 Jeff Dean。碳排放与大型神经网络训练。CoRR,abs/2104.10350,2021。 -
[66]
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec
Radford, Mark Chen, and Ilya Sutskever.
Zero-shot text-to-image generation.
CoRR, abs/2102.12092, 2021.
Aditya Ramesh、Mikhail Pavlov、Gabriel Goh、Scott Gray、Chelsea Voss、Alec Radford、Mark Chen 和 Ilya Sutskever。零镜头文本到图像生成。CoRR,abs/2102.12092,2021。 -
[67]
Ali Razavi, Aäron van den Oord, and Oriol Vinyals.
Generating diverse high-fidelity images with VQ-VAE-2.
In NeurIPS, pages 14837–14847, 2019.
Ali Razavi、Aäron van den Oord 和 Oriol Vinyals。用 VQ-VAE-2 生成多样化的高保真图像。In NeurIPS, pages 14837-14847, 2019. -
[68]
Scott E. Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele,
and Honglak Lee.
Generative adversarial text to image synthesis.
In ICML, 2016.
Scott E. Reed、Zeynep Akata、Xinchen Yan、Lajanugen Logeswaran、Bernt Schiele 和 Honglak Lee。生成式对抗文本图像合成。In ICML, 2016. -
[69]
Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra.
Stochastic backpropagation and approximate inference in deep
generative models.
In Proceedings of the 31st International Conference on
International Conference on Machine Learning, ICML, 2014.
Danilo Jimenez Rezende、Shakir Mohamed 和 Daan Wierstra。深度生成模型中的随机反向传播和近似推理。第 31 届国际机器学习大会论文集》,ICML,2014 年。 -
[70]
Robin Rombach, Patrick Esser, and Björn Ommer.
Network-to-network translation with conditional invertible neural
networks.
In NeurIPS, 2020.
Robin Rombach、Patrick Esser 和 Björn Ommer.用条件可逆神经网络实现网络到网络的转换。NeurIPS, 2020. -
[71]
Olaf Ronneberger, Philipp Fischer, and Thomas Brox.
U-net: Convolutional networks for biomedical image segmentation.
In MICCAI (3), volume 9351 of Lecture Notes in
Computer Science, pages 234–241. Springer, 2015.
Olaf Ronneberger、Philipp Fischer 和 Thomas Brox.U-net:用于生物医学图像分割的卷积网络。In MICCAI (3), volume 9351 of Lecture Notes in Computer Science, pages 234-241.Springer, 2015. -
[72]
Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J. Fleet, and
Mohammad Norouzi.
Image super-resolution via iterative refinement.
CoRR, abs/2104.07636, 2021.
Chitwan Saharia、Jonathan Ho、William Chan、Tim Salimans、David J. Fleet 和 Mohammad Norouzi。通过迭代细化实现图像超分辨率。CoRR,abs/2104.07636,2021。 -
[73]
Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P. Kingma.
Pixelcnn++: Improving the pixelcnn with discretized logistic mixture
likelihood and other modifications.
CoRR, abs/1701.05517, 2017.
Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P. Kingma.Pixelcnn++:用离散化逻辑混合似然和其他修改改进 pixelcnn。CoRR,abs/1701.05517,2017. -
[74]
Dave Salvator.
NVIDIA Developer Blog.
https://developer.nvidia.com/blog/getting-immediate-speedups-with-a100-tf32,
2020.
Dave Salvator。英伟达开发者博客。https://developer.nvidia.com/blog/getting-immediate-speedups-with-a100-tf32,2020 年。 -
[75]
Robin San-Roman, Eliya Nachmani, and Lior Wolf.
Noise estimation for generative diffusion models.
CoRR, abs/2104.02600, 2021.
Robin San-Roman、Eliya Nachmani 和 Lior Wolf。生成式扩散模型的噪声估计.CoRR,abs/2104.02600,2021。 -
[76]
Axel Sauer, Kashyap Chitta, Jens Müller, and Andreas Geiger.
Projected gans converge faster.
CoRR, abs/2111.01007, 2021.
阿克塞尔-绍尔、卡什亚普-奇塔、延斯-穆勒和安德烈亚斯-盖格。投影甘斯收敛更快CoRR,abs/2111.01007,2021。 -
[77]
Edgar Schönfeld, Bernt Schiele, and Anna Khoreva.
A u-net based discriminator for generative adversarial networks.
In 2020 IEEE/CVF Conference on Computer Vision and Pattern
Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages
8204–8213. Computer Vision Foundation / IEEE, 2020.
Edgar Schönfeld、Bernt Schiele 和 Anna Khoreva。基于 U-net 的生成式对抗网络判别器。In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 8204-8213.计算机视觉基金会/IEEE,2020。 -
[78]
Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk,
Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran
Komatsuzaki.
Laion-400m: Open dataset of clip-filtered 400 million image-text
pairs, 2021.
克里斯托夫-舒曼(Christoph Schuhmann)、理查德-文库(Richard Vencu)、罗曼-博蒙特(Romain Beaumont)、罗伯特-卡茨马尔奇克(Robert Kaczmarczyk)、克莱顿-穆利斯(Clayton Mullis)、阿鲁什-卡塔(Aarush Katta)、西奥-库姆斯(Theo Coombes)、杰尼亚-吉特塞夫(Jenia Jitsev)和阿兰-小松崎(Aran Komatsuzaki)。莱昂-400 米:经剪辑过滤的 4 亿图像-文本对开放数据集,2021 年。 -
[79]
Karen Simonyan and Andrew Zisserman.
Very deep convolutional networks for large-scale image recognition.
In Yoshua Bengio and Yann LeCun, editors, Int. Conf. Learn.
Represent., 2015.
Karen Simonyan 和 Andrew Zisserman.用于大规模图像识别的深度卷积网络。In Yoshua Bengio and Yann LeCun, editors, Int.Conf.Learn.Represent. -
[80]
Abhishek Sinha, Jiaming Song, Chenlin Meng, and Stefano Ermon.
D2C: diffusion-denoising models for few-shot conditional
generation.
CoRR, abs/2106.06819, 2021.
Abhishek Sinha、Jiaming Song、Chenlin Meng 和 Stefano Ermon。D2C:用于少量条件生成的扩散-去噪模型。CoRR,abs/2106.06819,2021。 -
[81]
Charlie Snell.
Alien Dreams: An Emerging Art Scene.
https://ml.berkeley.edu/blog/posts/clip-art/, 2021.
[Online; accessed November-2021].
查理-斯内尔外星人之梦》:https://ml.berkeley.edu/blog/posts/clip-art/, 2021。[Online; accessed November-2021]. -
[82]
Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya
Ganguli.
Deep unsupervised learning using nonequilibrium thermodynamics.
CoRR, abs/1503.03585, 2015.
Jascha Sohl-Dickstein、Eric A. Weiss、Niru Maheswaranathan 和 Surya Ganguli。使用非平衡热力学的深度无监督学习。CoRR,abs/1503.03585,2015. -
[83]
Kihyuk Sohn, Honglak Lee, and Xinchen Yan.
Learning structured output representation using deep conditional
generative models.
In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett,
editors, Advances in Neural Information Processing Systems, volume 28.
Curran Associates, Inc., 2015.
Kihyuk Sohn, Honglak Lee, and Xinchen Yan.使用深度条件生成模型学习结构化输出表示。见 C. Cortes、N. Lawrence、D. Lee、M. Sugiyama 和 R. Garnett 编辑的《神经信息处理系统进展》第 28 卷。库兰联合公司,2015 年。 -
[84]
Jiaming Song, Chenlin Meng, and Stefano Ermon.
Denoising diffusion implicit models.
In ICLR. OpenReview.net, 2021.
宋家明、孟晨霖和斯特凡诺-埃尔蒙。去噪扩散隐含模型In ICLR.OpenReview.net, 2021. -
[85]
Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano
Ermon, and Ben Poole.
Score-based generative modeling through stochastic differential
equations.
CoRR, abs/2011.13456, 2020.
杨松、Jascha Sohl-Dickstein、Diederik P. Kingma、Abhishek Kumar、Stefano Ermon 和 Ben Poole。通过随机微分方程进行基于分数的生成建模。CoRR,abs/2011.13456,2020。 -
[86]
Emma Strubell, Ananya Ganesh, and Andrew McCallum.
Energy and policy considerations for modern deep learning research.
In The Thirty-Fourth AAAI Conference on Artificial
Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of
Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium
on Educational Advances in Artificial Intelligence, EAAI 2020, New York,
NY, USA, February 7-12, 2020, pages 13693–13696. AAAI Press, 2020.
Emma Strubell、Ananya Ganesh 和 Andrew McCallum.现代深度学习研究的能源和政策考虑。In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 13693-13696.AAAI Press, 2020. -
[87]
Wei Sun and Tianfu Wu.
Learning layout and style reconfigurable gans for controllable image
synthesis.
CoRR, abs/2003.11571, 2020.
Wei Sun 和 Tianfu Wu.用于可控图像合成的学习布局和样式可重构甘斯。CoRR,abs/2003.11571,2020. -
[88]
Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova,
Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong
Park, and Victor S. Lempitsky.
Resolution-robust large mask inpainting with fourier convolutions.
ArXiv, abs/2109.07161, 2021.
Roman Suvorov、Elizaveta Logacheva、Anton Mashikhin、Anastasia Remizova、Arsenii Ashukha、Aleksei Silvestrov、Naejin Kong、Harshith Goka、Kiwoong Park 和 Victor S. Lempitsky。利用傅立叶卷积进行分辨率稳健的大掩模涂色。ArXiv,abs/2109.07161,2021。 -
[89]
Tristan Sylvain, Pengchuan Zhang, Yoshua Bengio, R. Devon Hjelm, and Shikhar
Sharma.
Object-centric image generation from layouts.
In Thirty-Fifth AAAI Conference on Artificial Intelligence,
AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial
Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in
Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021,
pages 2647–2655. AAAI Press, 2021.
Tristan Sylvain、Pengchuan Zhang、Yoshua Bengio、R. Devon Hjelm 和 Shikhar Sharma。从布局生成以对象为中心的图像。In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pages 2647-2655.AAAI Press, 2021. -
[90]
Patrick Tinsley, Adam Czajka, and Patrick Flynn.
This face does not exist… but it might be yours! identity leakage
in generative models.
In Proceedings of the IEEE/CVF Winter Conference on Applications
of Computer Vision, pages 1320–1328, 2021.
Patrick Tinsley, Adam Czajka, and Patrick Flynn.这张脸并不存在......但它可能是你的!生成模型中的身份泄露。IEEE/CVF 计算机视觉应用冬季会议论文集》,第 1320-1328 页,2021 年。 -
[91]
Antonio Torralba and Alexei A Efros.
Unbiased look at dataset bias.
In CVPR 2011, pages 1521–1528. IEEE, 2011.
Antonio Torralba 和 Alexei A Efros.无偏见地看待数据集偏差。In CVPR 2011, pages 1521-1528.IEEE, 2011. -
[92]
Arash Vahdat and Jan Kautz.
NVAE: A deep hierarchical variational autoencoder.
In NeurIPS, 2020.
Arash Vahdat 和 Jan Kautz.NVAE:深度分层变异自动编码器。在 NeurIPS,2020 年。 -
[93]
Arash Vahdat, Karsten Kreis, and Jan Kautz.
Score-based generative modeling in latent space.
CoRR, abs/2106.05931, 2021.
Arash Vahdat、Karsten Kreis 和 Jan Kautz。潜在空间中基于分数的生成建模。CoRR,abs/2106.05931,2021。 -
[94]
Aaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, koray kavukcuoglu, Oriol
Vinyals, and Alex Graves.
Conditional image generation with pixelcnn decoders.
In Advances in Neural Information Processing Systems, 2016.
Aaron van den Oord、Nal Kalchbrenner、Lasse Espeholt、Koray Kavukcuoglu、Oriol Vinyals 和 Alex Graves。用像素神经解码器生成条件图像神经信息处理系统进展》,2016 年。 -
[95]
Aäron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu.
Pixel recurrent neural networks.
CoRR, abs/1601.06759, 2016.
Aäron van den Oord、Nal Kalchbrenner 和 Koray Kavukcuoglu。像素递归神经网络。CoRR,abs/1601.06759,2016。 -
[96]
Aäron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu.
Neural discrete representation learning.
In NIPS, pages 6306–6315, 2017.
Aäron van den Oord、Oriol Vinyals 和 Koray Kavukcuoglu。神经离散表征学习。In NIPS, pages 6306-6315, 2017. -
[97]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin.
Attention is all you need.
In NIPS, pages 5998–6008, 2017.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin.注意力就是你所需要的一切。In NIPS, pages 5998-6008, 2017. -
[98]
Rivers Have Wings.
Tweet on Classifier-free guidance for autoregressive models.
https://twitter.com/RiversHaveWings/status/1478093658716966912,
2022.
河流有翅膀。Tweet on Classifier-free guidance for autoregressive models. https://twitter.com/RiversHaveWings/status/1478093658716966912, 2022. -
[99]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue,
Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz,
and Jamie Brew.
Huggingface’s transformers: State-of-the-art natural language
processing.
CoRR, abs/1910.03771, 2019.
Thomas Wolf、Lysandre Debut、Victor Sanh、Julien Chaumond、Clement Delangue、Anthony Moi、Pierric Cistac、Tim Rault、Rémi Louf、Morgan Funtowicz 和 Jamie Brew。Huggingface's transformers:最先进的自然语言处理技术。CoRR,abs/1910.03771,2019。 -
[100]
Zhisheng Xiao, Karsten Kreis, Jan Kautz, and Arash Vahdat.
VAEBM: A symbiosis between variational autoencoders and
energy-based models.
In 9th International Conference on Learning Representations,
ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
Zhisheng Xiao, Karsten Kreis, Jan Kautz, and Arash Vahdat.VAEBM:变异自动编码器与基于能量的模型之间的共生。第九届学习表征国际会议,ICLR 2021,奥地利虚拟活动,2021年5月3-7日。OpenReview.net, 2021. -
[101]
Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas.
Videogpt: Video generation using VQ-VAE and transformers.
CoRR, abs/2104.10157, 2021.
Wilson Yan、Yunzhi Zhang、Pieter Abbeel 和 Aravind Srinivas。Videogpt:使用 VQ-VAE 和变压器生成视频。CoRR,abs/2104.10157,2021。 -
[102]
Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong Xiao.
LSUN: construction of a large-scale image dataset using deep
learning with humans in the loop.
CoRR, abs/1506.03365, 2015.
Fisher Yu、Yinda Zhang、Shuran Song、Ari Seff 和 Jianxiong Xiao.LSUN:利用深度学习与人类共同构建大规模图像数据集。CoRR,abs/1506.03365,2015。 -
[103]
Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander
Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu.
Vector-quantized image modeling with improved vqgan, 2021.
Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu.用改进的 vqgan 进行矢量量化图像建模》,2021 年。 -
[104]
Jiahui Yu, Zhe L. Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S. Huang.
Free-form image inpainting with gated convolution.
2019 IEEE/CVF International Conference on Computer Vision
(ICCV), pages 4470–4479, 2019.
余佳慧、林哲、杨继美、沈晓辉、卢昕和黄轶翔。使用门控卷积的自由形式图像绘制(Free-form image inpainting with gated convolution)。2019 IEEE/CVF 计算机视觉国际会议(ICCV),第 4470-4479 页,2019 年。 -
[105]
K. Zhang, Jingyun Liang, Luc Van Gool, and Radu Timofte.
Designing a practical degradation model for deep blind image
super-resolution.
ArXiv, abs/2103.14006, 2021.
K.Zhang, Jingyun Liang, Luc Van Gool, and Radu Timofte.为深度盲图像超分辨率设计实用退化模型。ArXiv,abs/2103.14006,2021。 -
[106]
Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang.
The unreasonable effectiveness of deep features as a perceptual
metric.
In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), June 2018.
Richard Zhang、Phillip Isola、Alexei A. Efros、Eli Shechtman 和 Oliver Wang。深度特征作为感知度量的不合理有效性。电气和电子工程师协会计算机视觉与模式识别大会论文集》,2018 年 6 月。 -
[107]
Shengyu Zhao, Jianwei Cui, Yilun Sheng, Yue Dong, Xiao Liang, Eric I-Chao
Chang, and Yan Xu.
Large scale image completion via co-modulated generative adversarial
networks.
ArXiv, abs/2103.10428, 2021.
赵胜宇、崔建伟、盛一伦、董玥、梁晓、张一超和徐艳。通过共调制生成式对抗网络完成大规模图像处理。ArXiv,abs/2103.10428,2021。 -
[108]
Bolei Zhou, Àgata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio
Torralba.
Places: A 10 million image database for scene recognition.
IEEE Transactions on Pattern Analysis and Machine Intelligence,
40:1452–1464, 2018.
Bolei Zhou, Àgata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba.场所:用于场景识别的千万级图像数据库。IEEE Transactions on Pattern Analysis and Machine Intelligence》,40:1452-1464,2018. -
[109]
Yufan Zhou, Ruiyi Zhang, Changyou Chen, Chunyuan Li, Chris Tensmeyer, Tong Yu,
Jiuxiang Gu, Jinhui Xu, and Tong Sun.
LAFITE: towards language-free training for text-to-image
generation.
CoRR, abs/2111.13792, 2021.
Yufan Zhou, Ruiyi Zhang, Changyou Chen, Chunyuan Li, Chris Tensmeyer, Tong Yu, Jiuxiang Gu, Jinhui Xu, and Tong Sun.LAFITE:实现文本到图像生成的无语言训练。CoRR,abs/2111.13792,2021。
Appendix 附录



图 12:来自第 4.3.2 节中语义景观模型的卷积样本,在 图像上进行了微调。
’A painting of the last supper by Picasso.’ 毕加索的《最后的晚餐》油画。 |
|
---|---|
![]() |
|
’An oil painting of a latent space.’ 一幅潜藏空间的油画。 |
’An epic painting of Gandalf the Black 黑袍甘道夫的史诗画作 summoning thunder and lightning in the mountains.’ 在山中召唤雷电的史诗画作。 |
![]() |
![]() |
’A sunset over a mountain range, vector image.’ 山脉上的日落,矢量图像。 |
|
![]() |
图 13:结合分类器自由扩散引导和第 4.3.2 节中的卷积采样策略,我们的 1.45B 参数文本到图像模型可用于渲染大于模型所训练的原始 分辨率的图像。
Appendix A Changelog 附录 AC 更新日志
Here we list changes between this version (https://arxiv.org/abs/2112.10752v2) of the paper and the previous version, i.e. https://arxiv.org/abs/2112.10752v1.
这里我们列出了本文本版本 ( https://arxiv.org/abs/2112.10752v2) 与上一版本(即 https://arxiv.org/abs/2112.10752v1)之间的变化。
-
•
We updated the results on text-to-image synthesis in Sec. 4.3 which were obtained by training a new, larger model (1.45B parameters). This also includes a new comparison to very recent competing methods on this task that were published on arXiv at the same time as ([59, 109]) or after ([26]) the publication of our work.
- 我们更新了第 4.3 节中文本到图像合成的结果,这些结果是通过训练一个新的、更大的模型(1.45B 个参数)获得的。这也包括了与最近在 arXiv 上发表的关于这项任务的竞争方法的新比较,这些方法与我们的工作同时([ 59, 109])或之后([ 26])发表。 -
•
We updated results on class-conditional synthesis on ImageNet in Sec. 4.1, Tab. 3 (see also Sec. D.4) obtained by retraining the model with a larger batch size. The corresponding qualitative results in Fig. 26 and Fig. 27 were also updated. Both the updated text-to-image and the class-conditional model now use classifier-free guidance [32] as a measure to increase visual fidelity.
- 我们在第 4.1 节的表 3 中更新了在 ImageNet 上进行类条件合成的结果(另见第 D.3 节)。3 中关于 ImageNet 的类条件合成结果(另见第 D.4 节)。图 26 和图 27 中相应的定性结果也进行了更新。更新后的 "文本到图像 "模型和 "类别条件 "模型现在都使用无分类器引导[32]作为提高视觉逼真度的措施。 -
•
We conducted a user study (following the scheme suggested by Saharia et al [72]) which provides additional evaluation for our inpainting (Sec. 4.5) and superresolution models (Sec. 4.4).
- 我们进行了一项用户研究(按照 Saharia 等人[72]建议的方案),为我们的内绘制(第 4.5 节)和超分辨率模型(第 4.4 节)提供了额外的评估。 -
•
- 在正文中添加了图 5,将图 18 移到了附录中,在附录中添加了图 13。
Appendix B Detailed Information on Denoising Diffusion Models
附录 B 关于去噪扩散模型的详细信息
Diffusion models can be specified in terms of a signal-to-noise ratio
consisting
of sequences and which, starting from
a data sample , define a forward diffusion process as
扩散模型可以用信噪比 来指定,它由序列 和 组成,从数据样本 开始,定义一个前向扩散过程 为
(4) |
with the Markov structure for :
的马尔科夫结构:
(5) | ||||
(6) | ||||
(7) |
Denoising diffusion models are generative models which revert this
process with a similar Markov structure running backward in time, i.e. they are
specified as
去噪扩散模型是一种生成模型 ,它以类似的马尔可夫结构反演这一过程,即在时间上向后运行,具体为
(8) |
The evidence lower bound (ELBO) associated with this model then decomposes over
the discrete time steps as
与该模型相关的证据下限(ELBO)在离散时间步长上的分解为
(9) |
The prior is typically choosen as a standard normal distribution and
the first term of the ELBO then depends only on the final signal-to-noise ratio
. To minimize the remaining terms, a common choice to parameterize
is to specify it in terms of the true posterior
but with the unknown replaced by an estimate
based on the current step . This gives [45]
先验值 通常选择标准正态分布,ELBO 的第一项只取决于最终信噪比 。为了最小化其余项,对 进行参数化的一个常见选择是用真实后验 来指定它,但用基于当前步长 的估计值