这是用户在 2024-10-18 16:14 为 https://ar5iv.labs.arxiv.org/html/2010.02502?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Denoising Diffusion Implicit Models
去噪扩散隐式模型

Jiaming Song, Chenlin Meng & Stefano Ermon
宋佳明,孟晨琳,斯特凡诺·埃尔蒙

Stanford University
{tsong,chenlin,ermon}@cs.stanford.edu
斯坦福大学 {tsong,chenlin,ermon}@cs.stanford.edu
Abstract 摘要

Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps in order to produce a sample. To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models with the same training procedure as DDPMs. In DDPMs, the generative process is defined as the reverse of a particular Markovian diffusion process. We generalize DDPMs via a class of non-Markovian diffusion processes that lead to the same training objective. These non-Markovian processes can correspond to generative processes that are deterministic, giving rise to implicit models that produce high quality samples much faster. We empirically demonstrate that DDIMs can produce high quality samples 10×10\times to 50×50\times faster in terms of wall-clock time compared to DDPMs, allow us to trade off computation for sample quality, perform semantically meaningful image interpolation directly in the latent space, and reconstruct observations with very low error.
去噪扩散概率模型(DDPMs)在无需对抗训练的情况下实现了高质量的图像生成,但它们需要模拟一个马尔可夫链多个步骤才能生成一个样本。为了加速采样,我们提出了去噪扩散隐式模型(DDIMs),这是一类更高效的迭代隐式概率模型,其训练过程与 DDPMs 相同。在 DDPMs 中,生成过程被定义为特定马尔可夫扩散过程的逆过程。我们通过一类非马尔可夫扩散过程对 DDPMs 进行了推广,这些过程导致了相同的训练目标。这些非马尔可夫过程可以对应于确定性的生成过程,从而产生能够更快生成高质量样本的隐式模型。我们通过实验证明,就挂钟时间而言,DDIMs 比 DDPMs 能够更快地生成高质量样本 10×10\times50×50\times ,使我们能够在计算量和样本质量之间进行权衡,直接在潜在空间中进行具有语义意义的图像插值,并以极低的误差重建观测结果。

1 Introduction 1 引言

Deep generative models have demonstrated the ability to produce high quality samples in many domains (Karras et al., 2020; van den Oord et al., 2016a). In terms of image generation, generative adversarial networks (GANs, Goodfellow et al. (2014)) currently exhibits higher sample quality than likelihood-based methods such as variational autoencoders (Kingma & Welling, 2013), autoregressive models (van den Oord et al., 2016b) and normalizing flows (Rezende & Mohamed, 2015; Dinh et al., 2016). However, GANs require very specific choices in optimization and architectures in order to stabilize training (Arjovsky et al., 2017; Gulrajani et al., 2017; Karras et al., 2018; Brock et al., 2018), and could fail to cover modes of the data distribution (Zhao et al., 2018).
深度生成模型已在许多领域展示出能够生成高质量样本的能力(Karras 等人,2020;van den Oord 等人,2016a)。在图像生成方面,生成对抗网络(GANs,Goodfellow 等人(2014))目前表现出比基于似然的方法(如变分自编码器(Kingma&Welling,2013)、自回归模型(van den Oord 等人,2016b)和归一化流(Rezende&Mohamed,2015;Dinh 等人,2016))更高的样本质量。然而,GANs 为了稳定训练,在优化和架构方面需要非常特定的选择(Arjovsky 等人,2017;Gulrajani 等人,2017;Karras 等人,2018;Brock 等人,2018),并且可能无法覆盖数据分布的模式(Zhao 等人,2018)。

Recent works on iterative generative models (Bengio et al., 2014), such as denoising diffusion probabilistic models (DDPM, Ho et al. (2020)) and noise conditional score networks (NCSN, Song & Ermon (2019)) have demonstrated the ability to produce samples comparable to that of GANs, without having to perform adversarial training. To achieve this, many denoising autoencoding models are trained to denoise samples corrupted by various levels of Gaussian noise. Samples are then produced by a Markov chain which, starting from white noise, progressively denoises it into an image. This generative Markov Chain process is either based on Langevin dynamics (Song & Ermon, 2019) or obtained by reversing a forward diffusion process that progressively turns an image into noise (Sohl-Dickstein et al., 2015).
关于迭代生成模型的近期研究工作(Bengio 等人,2014),例如去噪扩散概率模型(DDPM,Ho 等人(2020))和噪声条件得分网络(NCSN,Song 和 Ermon(2019)),已经证明了它们能够生成与 GAN 相当的样本,而无需进行对抗性训练。为实现这一目标,许多去噪自编码模型被训练用于对受到不同程度高斯噪声破坏的样本进行去噪。然后,通过一个马尔可夫链生成样本,该链从白噪声开始,逐步将其去噪为图像。这种生成马尔可夫链过程要么基于朗之万动力学(Song 和 Ermon,2019),要么通过逆转将图像逐渐变为噪声的正向扩散过程来获得(Sohl-Dickstein 等人,2015)。

A critical drawback of these models is that they require many iterations to produce a high quality sample. For DDPMs, this is because that the generative process (from noise to data) approximates the reverse of the forward diffusion process (from data to noise), which could have thousands of steps; iterating over all the steps is required to produce a single sample, which is much slower compared to GANs, which only needs one pass through a network. For example, it takes around 20 hours to sample 50k images of size 32×32323232\times 32 from a DDPM, but less than a minute to do so from a GAN on a Nvidia 2080 Ti GPU. This becomes more problematic for larger images as sampling 50k images of size 256×256256256256\times 256 could take nearly 100010001000 hours on the same GPU.
这些模型的一个关键缺陷是,它们需要多次迭代才能生成高质量的样本。对于去噪扩散概率模型(DDPMs),这是因为生成过程(从噪声到数据)近似于正向扩散过程(从数据到噪声)的逆过程,而正向扩散过程可能有数千个步骤;要生成一个样本,需要对所有步骤进行迭代,这与生成对抗网络(GANs)相比要慢得多,GANs 只需要通过网络一次。例如,在 Nvidia 2080 Ti GPU 上,从 DDPM 中采样 50k 个大小为 32×32323232\times 32 的图像大约需要 20 小时,而从 GAN 中采样则不到一分钟。对于较大的图像,这个问题更加严重,因为在同一 GPU 上采样 50k 个大小为 256×256256256256\times 256 的图像可能需要近 100010001000 小时。

To close this efficiency gap between DDPMs and GANs, we present denoising diffusion implicit models (DDIMs). DDIMs are implicit probabilistic models (Mohamed & Lakshminarayanan, 2016) and are closely related to DDPMs, in the sense that they are trained with the same objective function. In Section 3, we generalize the forward diffusion process used by DDPMs, which is Markovian, to non-Markovian ones, for which we are still able to design suitable reverse generative Markov chains. We show that the resulting variational training objectives have a shared surrogate objective, which is exactly the objective used to train DDPM. Therefore, we can freely choose from a large family of generative models using the same neural network simply by choosing a different, non-Markovian diffusion process (Section 4.1) and the corresponding reverse generative Markov Chain. In particular, we are able to use non-Markovian diffusion processes which lead to "short" generative Markov chains (Section 4.2) that can be simulated in a small number of steps. This can massively increase sample efficiency only at a minor cost in sample quality.
为了缩小去噪扩散概率模型(DDPMs)和生成对抗网络(GANs)之间的效率差距,我们提出了去噪扩散隐式模型(DDIMs)。DDIMs 是隐式概率模型(Mohamed & Lakshminarayanan,2016),并且与 DDPMs 密切相关,因为它们使用相同的目标函数进行训练。在第 3 节中,我们将 DDPMs 所使用的马尔可夫式前向扩散过程推广到非马尔可夫式过程,并且我们仍然能够设计合适的反向生成马尔可夫链。我们表明,所得的变分训练目标具有一个共享的替代目标,这正是用于训练 DDPM 的目标。因此,我们可以通过选择不同的非马尔可夫扩散过程(第 4.1 节)和相应的反向生成马尔可夫链,仅使用相同的神经网络就可以从大量的生成模型中自由选择。特别是,我们能够使用非马尔可夫扩散过程,从而导致“短”生成马尔可夫链(第 4.2 节),这些链可以在少量步骤中进行模拟。这可以在样本质量仅有轻微损失的情况下,极大地提高样本效率。

In Section 5, we demonstrate several empirical benefits of DDIMs over DDPMs. First, DDIMs have superior sample generation quality compared to DDPMs, when we accelerate sampling by 10×10\times to 100×100\times using our proposed method. Second, DDIM samples have the following ``consistency'' property, which does not hold for DDPMs: if we start with the same initial latent variable and generate several samples with Markov chains of various lengths, these samples would have similar high-level features. Third, because of ``consistency'' in DDIMs, we can perform semantically meaningful image interpolation by manipulating the initial latent variable in DDIMs, unlike DDPMs which interpolates near the image space due to the stochastic generative process.
在第 5 节中,我们展示了 DDIM 相对于 DDPM 的几个经验优势。首先,当我们使用我们提出的方法将采样加速从 10×10\times100×100\times 时,DDIM 具有比 DDPM 更优越的样本生成质量。其次,DDIM 样本具有以下“一致性”属性,而 DDPM 不具有:如果我们从相同的初始潜在变量开始,并使用各种长度的马尔可夫链生成多个样本,这些样本将具有相似的高级特征。第三,由于 DDIM 中的“一致性”,我们可以通过操纵 DDIM 中的初始潜在变量来执行具有语义意义的图像插值,这与 DDPM 不同,DDPM 由于随机生成过程而在图像空间附近进行插值。

2 Background 2 背景

Refer to caption
Refer to caption
Figure 1: Graphical models for diffusion (left) and non-Markovian (right) inference models.
图 1:扩散(左)和非马尔可夫(右)推断模型的图形模型。

Given samples from a data distribution q(𝒙0)𝑞subscript𝒙0q({\bm{x}}_{0}), we are interested in learning a model distribution pθ(𝒙0)subscript𝑝𝜃subscript𝒙0p_{\theta}({\bm{x}}_{0}) that approximates q(𝒙0)𝑞subscript𝒙0q({\bm{x}}_{0}) and is easy to sample from. Denoising diffusion probabilistic models (DDPMs, Sohl-Dickstein et al. (2015); Ho et al. (2020)) are latent variable models of the form
给定来自数据分布 q(𝒙0)subscript0q({\bm{x}}_{0}) 的样本,我们有兴趣学习一个模型分布 pθ(𝒙0)subscriptsubscript0p_{\theta}({\bm{x}}_{0}) ,该分布近似于 q(𝒙0)subscript0q({\bm{x}}_{0}) 且易于采样。去噪扩散概率模型(DDPMs,Sohl-Dickstein 等人(2015 年);Ho 等人(2020 年))是如下形式的潜在变量模型

pθ(𝒙0)=pθ(𝒙0:T)d𝒙1:T,wherepθ(𝒙0:T):=pθ(𝒙T)t=1Tpθ(t)(𝒙t1|𝒙t)formulae-sequencesubscript𝑝𝜃subscript𝒙0subscript𝑝𝜃subscript𝒙:0𝑇differential-dsubscript𝒙:1𝑇whereassignsubscript𝑝𝜃subscript𝒙:0𝑇subscript𝑝𝜃subscript𝒙𝑇superscriptsubscriptproduct𝑡1𝑇subscriptsuperscript𝑝𝑡𝜃conditionalsubscript𝒙𝑡1subscript𝒙𝑡\displaystyle p_{\theta}({\bm{x}}_{0})=\int p_{\theta}({\bm{x}}_{0:T})\mathrm{d}{\bm{x}}_{1:T},\quad\text{where}\quad p_{\theta}({\bm{x}}_{0:T}):=p_{\theta}({\bm{x}}_{T})\prod_{t=1}^{T}p^{(t)}_{\theta}({\bm{x}}_{t-1}|{\bm{x}}_{t}) (1)

where 𝒙1,,𝒙Tsubscript𝒙1subscript𝒙𝑇{\bm{x}}_{1},\ldots,{\bm{x}}_{T} are latent variables in the same sample space as 𝒙0subscript𝒙0{\bm{x}}_{0} (denoted as 𝒳𝒳{\mathcal{X}}). The parameters θ𝜃\theta are learned to fit the data distribution q(𝒙0)𝑞subscript𝒙0q({\bm{x}}_{0}) by maximizing a variational lower bound:
其中 𝒙1,,𝒙Tsubscript1subscript{\bm{x}}_{1},\ldots,{\bm{x}}_{T} 是与 𝒙0subscript0{\bm{x}}_{0} 在相同样本空间中的潜在变量(表示为 𝒳{\mathcal{X}} )。通过最大化变分下界来学习参数 θ\theta 以拟合数据分布 q(𝒙0)subscript0q({\bm{x}}_{0})

maxθ𝔼q(𝒙0)[logpθ(𝒙0)]maxθ𝔼q(𝒙0,𝒙1,,𝒙T)[logpθ(𝒙0:T)logq(𝒙1:T|𝒙0)]subscript𝜃subscript𝔼𝑞subscript𝒙0delimited-[]subscript𝑝𝜃subscript𝒙0subscript𝜃subscript𝔼𝑞subscript𝒙0subscript𝒙1subscript𝒙𝑇delimited-[]subscript𝑝𝜃subscript𝒙:0𝑇𝑞conditionalsubscript𝒙:1𝑇subscript𝒙0\displaystyle\max_{\theta}{\mathbb{E}}_{q({\bm{x}}_{0})}[\log p_{\theta}({\bm{x}}_{0})]\leq\max_{\theta}{\mathbb{E}}_{q({\bm{x}}_{0},{\bm{x}}_{1},\ldots,{\bm{x}}_{T})}\left[\log p_{\theta}({\bm{x}}_{0:T})-\log q({\bm{x}}_{1:T}|{\bm{x}}_{0})\right] (2)

where q(𝒙1:T|𝒙0)𝑞conditionalsubscript𝒙:1𝑇subscript𝒙0q({\bm{x}}_{1:T}|{\bm{x}}_{0}) is some inference distribution over the latent variables. Unlike typical latent variable models (such as the variational autoencoder (Rezende et al., 2014)), DDPMs are learned with a fixed (rather than trainable) inference procedure q(𝒙1:T|𝒙0)𝑞conditionalsubscript𝒙:1𝑇subscript𝒙0q({\bm{x}}_{1:T}|{\bm{x}}_{0}), and latent variables are relatively high dimensional. For example, Ho et al. (2020) considered the following Markov chain with Gaussian transitions parameterized by a decreasing sequence α1:T(0,1]Tsubscript𝛼:1𝑇superscript01𝑇\alpha_{1:T}\in(0,1]^{T}:
其中 q(𝒙1:T|𝒙0)conditionalsubscript1subscript0q({\bm{x}}_{1:T}|{\bm{x}}_{0}) 是关于潜在变量的某种推理分布。与典型的潜在变量模型(如变分自编码器(Rezende 等人,2014))不同,DDPMs 是通过固定的(而非可训练的)推理过程 q(𝒙1:T|𝒙0)conditionalsubscript1subscript0q({\bm{x}}_{1:T}|{\bm{x}}_{0}) 进行学习的,并且潜在变量的维度相对较高。例如,Ho 等人(2020)考虑了以下具有由递减序列 α1:T(0,1]Tsubscript1superscript01\alpha_{1:T}\in(0,1]^{T} 参数化的高斯转移的马尔可夫链:

q(𝒙1:T|𝒙0):=t=1Tq(𝒙t|𝒙t1),whereq(𝒙t|𝒙t1):=𝒩(αtαt1𝒙t1,(1αtαt1)𝑰)formulae-sequenceassign𝑞conditionalsubscript𝒙:1𝑇subscript𝒙0superscriptsubscriptproduct𝑡1𝑇𝑞conditionalsubscript𝒙𝑡subscript𝒙𝑡1assignwhere𝑞conditionalsubscript𝒙𝑡subscript𝒙𝑡1𝒩subscript𝛼𝑡subscript𝛼𝑡1subscript𝒙𝑡11subscript𝛼𝑡subscript𝛼𝑡1𝑰\displaystyle q({\bm{x}}_{1:T}|{\bm{x}}_{0}):=\prod_{t=1}^{T}q({\bm{x}}_{t}|{\bm{x}}_{t-1}),\text{where}\ q({\bm{x}}_{t}|{\bm{x}}_{t-1}):={\mathcal{N}}\left(\sqrt{\frac{\alpha_{t}}{\alpha_{t-1}}}{\bm{x}}_{t-1},\left(1-\frac{\alpha_{t}}{\alpha_{t-1}}\right){\bm{I}}\right) (3)

where the covariance matrix is ensured to have positive terms on its diagonal. This is called the forward process due to the autoregressive nature of the sampling procedure (from 𝒙0subscript𝒙0{\bm{x}}_{0} to 𝒙Tsubscript𝒙𝑇{\bm{x}}_{T}). We call the latent variable model pθ(𝒙0:T)subscript𝑝𝜃subscript𝒙:0𝑇p_{\theta}({\bm{x}}_{0:T}), which is a Markov chain that samples from 𝒙Tsubscript𝒙𝑇{\bm{x}}_{T} to 𝒙0subscript𝒙0{\bm{x}}_{0}, the generative process, since it approximates the intractable reverse process q(𝒙t1|𝒙t)𝑞conditionalsubscript𝒙𝑡1subscript𝒙𝑡q({\bm{x}}_{t-1}|{\bm{x}}_{t}). Intuitively, the forward process progressively adds noise to the observation 𝒙0subscript𝒙0{\bm{x}}_{0}, whereas the generative process progressively denoises a noisy observation (Figure 1, left).
其中协方差矩阵确保其对角线上具有正值。由于抽样过程的自回归性质(从 𝒙0subscript0{\bm{x}}_{0}𝒙Tsubscript{\bm{x}}_{T} ),这被称为正向过程。我们将从 𝒙Tsubscript{\bm{x}}_{T}𝒙0subscript0{\bm{x}}_{0} 进行抽样的马尔可夫链潜在变量模型 pθ(𝒙0:T)subscriptsubscript0p_{\theta}({\bm{x}}_{0:T}) 称为生成过程,因为它近似于难以处理的反向过程 q(𝒙t1|𝒙t)conditionalsubscript1subscriptq({\bm{x}}_{t-1}|{\bm{x}}_{t}) 。直观地说,正向过程逐渐向观测值 𝒙0subscript0{\bm{x}}_{0} 添加噪声,而生成过程则逐渐对有噪声的观测值进行去噪(图 1,左)。

A special property of the forward process is that
正向过程的一个特殊性质是

q(𝒙t|𝒙0):=q(𝒙1:t|𝒙0)d𝒙1:(t1)=𝒩(𝒙t;αt𝒙0,(1αt)𝑰);assign𝑞conditionalsubscript𝒙𝑡subscript𝒙0𝑞conditionalsubscript𝒙:1𝑡subscript𝒙0differential-dsubscript𝒙:1𝑡1𝒩subscript𝒙𝑡subscript𝛼𝑡subscript𝒙01subscript𝛼𝑡𝑰q({\bm{x}}_{t}|{\bm{x}}_{0}):=\int q({\bm{x}}_{1:t}|{\bm{x}}_{0})\mathrm{d}{\bm{x}}_{1:(t-1)}={\mathcal{N}}({\bm{x}}_{t};\sqrt{\alpha_{t}}{\bm{x}}_{0},(1-\alpha_{t}){\bm{I}});

so we can express 𝒙tsubscript𝒙𝑡{\bm{x}}_{t} as a linear combination of 𝒙0subscript𝒙0{\bm{x}}_{0} and a noise variable ϵitalic-ϵ\epsilon:
因此,我们可以将 𝒙tsubscript{\bm{x}}_{t} 表示为 𝒙0subscript0{\bm{x}}_{0} 和一个噪声变量 ϵ\epsilon 的线性组合:

𝒙t=αt𝒙0+1αtϵ,whereϵ𝒩(𝟎,𝑰).formulae-sequencesubscript𝒙𝑡subscript𝛼𝑡subscript𝒙01subscript𝛼𝑡italic-ϵwheresimilar-toitalic-ϵ𝒩0𝑰\displaystyle{\bm{x}}_{t}=\sqrt{\alpha_{t}}{\bm{x}}_{0}+\sqrt{1-\alpha_{t}}\epsilon,\quad\text{where}\quad\epsilon\sim{\mathcal{N}}({\bm{0}},{\bm{I}}). (4)

When we set αTsubscript𝛼𝑇\alpha_{T} sufficiently close to 00, q(𝒙T|𝒙0)𝑞conditionalsubscript𝒙𝑇subscript𝒙0q({\bm{x}}_{T}|{\bm{x}}_{0}) converges to a standard Gaussian for all 𝒙0subscript𝒙0{\bm{x}}_{0}, so it is natural to set pθ(𝒙T):=𝒩(𝟎,𝑰)assignsubscript𝑝𝜃subscript𝒙𝑇𝒩0𝑰p_{\theta}({\bm{x}}_{T}):={\mathcal{N}}({\bm{0}},{\bm{I}}). If all the conditionals are modeled as Gaussians with trainable mean functions and fixed variances, the objective in Eq. (2) can be simplified to111Please refer to Appendix C.2 for details.
请参考附录 C.2 以获取详情。

当我们将 αTsubscript\alpha_{T} 设置得足够接近 00 时,对于所有 𝒙0subscript0{\bm{x}}_{0}q(𝒙T|𝒙0)conditionalsubscriptsubscript0q({\bm{x}}_{T}|{\bm{x}}_{0}) 会收敛到标准高斯分布,因此很自然地设置 pθ(𝒙T):=𝒩(𝟎,𝑰)assignsubscriptsubscript0p_{\theta}({\bm{x}}_{T}):={\mathcal{N}}({\bm{0}},{\bm{I}}) 。如果将所有条件都建模为具有可训练均值函数和固定方差的高斯分布,则式(2)中的目标可以简化为 1
:

Lγ(ϵθ)subscript𝐿𝛾subscriptitalic-ϵ𝜃\displaystyle L_{\gamma}(\epsilon_{\theta}) :=t=1Tγt𝔼𝒙0q(𝒙0),ϵt𝒩(𝟎,𝑰)[ϵθ(t)(αt𝒙0+1αtϵt)ϵt22]assignabsentsuperscriptsubscript𝑡1𝑇subscript𝛾𝑡subscript𝔼formulae-sequencesimilar-tosubscript𝒙0𝑞subscript𝒙0similar-tosubscriptitalic-ϵ𝑡𝒩0𝑰delimited-[]superscriptsubscriptdelimited-∥∥superscriptsubscriptitalic-ϵ𝜃𝑡subscript𝛼𝑡subscript𝒙01subscript𝛼𝑡subscriptitalic-ϵ𝑡subscriptitalic-ϵ𝑡22\displaystyle:=\sum_{t=1}^{T}\gamma_{t}{\mathbb{E}}_{{\bm{x}}_{0}\sim q({\bm{x}}_{0}),\epsilon_{t}\sim{\mathcal{N}}({\bm{0}},{\bm{I}})}\left[{\lVert{\epsilon_{\theta}^{(t)}(\sqrt{\alpha_{t}}{\bm{x}}_{0}+\sqrt{1-\alpha_{t}}\epsilon_{t})-\epsilon_{t}}\rVert}_{2}^{2}\right] (5)

where ϵθ:={ϵθ(t)}t=1Tassignsubscriptitalic-ϵ𝜃superscriptsubscriptsuperscriptsubscriptitalic-ϵ𝜃𝑡𝑡1𝑇\epsilon_{\theta}:=\{\epsilon_{\theta}^{(t)}\}_{t=1}^{T} is a set of T𝑇T functions, each ϵθ(t):𝒳𝒳:superscriptsubscriptitalic-ϵ𝜃𝑡𝒳𝒳\epsilon_{\theta}^{(t)}:{\mathcal{X}}\to{\mathcal{X}} (indexed by t𝑡t) is a function with trainable parameters θ(t)superscript𝜃𝑡\theta^{(t)}, and γ:=[γ1,,γT]assign𝛾subscript𝛾1subscript𝛾𝑇\gamma:=[\gamma_{1},\ldots,\gamma_{T}] is a vector of positive coefficients in the objective that depends on α1:Tsubscript𝛼:1𝑇\alpha_{1:T}. In Ho et al. (2020), the objective with γ=𝟏𝛾1\gamma={\bm{1}} is optimized instead to maximize generation performance of the trained model; this is also the same objective used in noise conditional score networks (Song & Ermon, 2019) based on score matching (Hyvärinen, 2005; Vincent, 2011). From a trained model, 𝒙0subscript𝒙0{\bm{x}}_{0} is sampled by first sampling 𝒙Tsubscript𝒙𝑇{\bm{x}}_{T} from the prior pθ(𝒙T)subscript𝑝𝜃subscript𝒙𝑇p_{\theta}({\bm{x}}_{T}), and then sampling 𝒙t1subscript𝒙𝑡1{\bm{x}}_{t-1} from the generative processes iteratively.
其中 ϵθ:={ϵθ(t)}t=1Tassignsubscriptsuperscriptsubscriptsuperscriptsubscript1\epsilon_{\theta}:=\{\epsilon_{\theta}^{(t)}\}_{t=1}^{T} 是一组 TT 函数,每个 ϵθ(t):𝒳𝒳superscriptsubscript\epsilon_{\theta}^{(t)}:{\mathcal{X}}\to{\mathcal{X}} (由 tt 索引)是具有可训练参数 θ(t)superscript\theta^{(t)} 的函数,而 γ:=[γ1,,γT]assignsubscript1subscript\gamma:=[\gamma_{1},\ldots,\gamma_{T}] 是目标中依赖于 α1:Tsubscript1\alpha_{1:T} 的正系数向量。在 Ho 等人(2020)中,取而代之的是优化具有 γ=𝟏1\gamma={\bm{1}} 的目标,以最大化训练模型的生成性能;这也是基于得分匹配(Hyvärinen,2005;Vincent,2011)的噪声条件得分网络(Song&Ermon,2019)中使用的相同目标。从训练好的模型中,通过首先从先验 pθ(𝒙T)subscriptsubscriptp_{\theta}({\bm{x}}_{T}) 中采样 𝒙Tsubscript{\bm{x}}_{T} ,然后从生成过程中迭代采样 𝒙t1subscript1{\bm{x}}_{t-1} 来采样 𝒙0subscript0{\bm{x}}_{0}

The length T𝑇T of the forward process is an important hyperparameter in DDPMs. From a variational perspective, a large T𝑇T allows the reverse process to be close to a Gaussian (Sohl-Dickstein et al., 2015), so that the generative process modeled with Gaussian conditional distributions becomes a good approximation; this motivates the choice of large T𝑇T values, such as T=1000𝑇1000T=1000 in Ho et al. (2020). However, as all T𝑇T iterations have to be performed sequentially, instead of in parallel, to obtain a sample 𝒙0subscript𝒙0{\bm{x}}_{0}, sampling from DDPMs is much slower than sampling from other deep generative models, which makes them impractical for tasks where compute is limited and latency is critical.
在 DDPMs 中,前向过程的长度 TT 是一个重要的超参数。从变分的角度来看,较大的 TT 可使反向过程接近高斯分布(Sohl-Dickstein 等人,2015),从而使得用高斯条件分布建模的生成过程成为一个良好的近似;这促使选择较大的 TT 值,例如 Ho 等人(2020)中的 T=10001000T=1000 。然而,由于为了获得样本 𝒙0subscript0{\bm{x}}_{0} ,所有的 TT 迭代都必须按顺序而不是并行地执行,从 DDPMs 中采样比从其他深度生成模型中采样要慢得多,这使得它们在计算受限且延迟至关重要的任务中不切实际。

3 Variational Inference for non-Markovian Forward Processes
非马尔可夫正向过程的变分推断

Because the generative model approximates the reverse of the inference process, we need to rethink the inference process in order to reduce the number of iterations required by the generative model. Our key observation is that the DDPM objective in the form of Lγsubscript𝐿𝛾L_{\gamma} only depends on the marginals222We slightly abuse this term (as well as joints) when only conditioned on 𝒙0subscript𝒙0{\bm{x}}_{0}.
当仅以 𝒙0subscript0{\bm{x}}_{0} 为条件时,我们稍微滥用了这个术语(以及关节)。

由于生成模型近似于推理过程的逆过程,我们需要重新思考推理过程,以减少生成模型所需的迭代次数。我们的关键观察结果是,以 LγsubscriptL_{\gamma} 形式的 DDPM 目标仅取决于边际 2
q(𝒙t|𝒙0)𝑞conditionalsubscript𝒙𝑡subscript𝒙0q({\bm{x}}_{t}|{\bm{x}}_{0}), but not directly on the joint q(𝒙1:T|𝒙0)𝑞conditionalsubscript𝒙:1𝑇subscript𝒙0q({\bm{x}}_{1:T}|{\bm{x}}_{0}). Since there are many inference distributions (joints) with the same marginals, we explore alternative inference processes that are non-Markovian, which leads to new generative processes (Figure 1, right). These non-Markovian inference process lead to the same surrogate objective function as DDPM, as we will show below. In Appendix A, we show that the non-Markovian perspective also applies beyond the Gaussian case.
q(𝒙t|𝒙0)conditionalsubscriptsubscript0q({\bm{x}}_{t}|{\bm{x}}_{0}) ,但并非直接在关节 q(𝒙1:T|𝒙0)conditionalsubscript1subscript0q({\bm{x}}_{1:T}|{\bm{x}}_{0}) 上。由于存在许多具有相同边际的推理分布(关节),我们探索非马尔可夫的替代推理过程,这会导致新的生成过程(图 1,右)。如下所示,这些非马尔可夫推理过程会导致与 DDPM 相同的替代目标函数。在附录 A 中,我们表明非马尔可夫观点也适用于超出高斯情况的情况。

3.1 Non-Markovian forward processes
3.1 非马尔可夫正向过程

Let us consider a family 𝒬𝒬{\mathcal{Q}} of inference distributions, indexed by a real vector σ0T𝜎superscriptsubscriptabsent0𝑇\sigma\in\mathbb{R}_{\geq 0}^{T}:
让我们考虑一个由实向量 σ0Tsuperscriptsubscriptabsent0\sigma\in\mathbb{R}_{\geq 0}^{T} 索引的推理分布族 𝒬{\mathcal{Q}}

qσ(𝒙1:T|𝒙0):=qσ(𝒙T|𝒙0)t=2Tqσ(𝒙t1|𝒙t,𝒙0)assignsubscript𝑞𝜎conditionalsubscript𝒙:1𝑇subscript𝒙0subscript𝑞𝜎conditionalsubscript𝒙𝑇subscript𝒙0superscriptsubscriptproduct𝑡2𝑇subscript𝑞𝜎conditionalsubscript𝒙𝑡1subscript𝒙𝑡subscript𝒙0\displaystyle q_{\sigma}({\bm{x}}_{1:T}|{\bm{x}}_{0}):=q_{\sigma}({\bm{x}}_{T}|{\bm{x}}_{0})\prod_{t=2}^{T}q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0}) (6)

where qσ(𝒙T|𝒙0)=𝒩(αT𝒙0,(1αT)𝑰)subscript𝑞𝜎conditionalsubscript𝒙𝑇subscript𝒙0𝒩subscript𝛼𝑇subscript𝒙01subscript𝛼𝑇𝑰q_{\sigma}({\bm{x}}_{T}|{\bm{x}}_{0})={\mathcal{N}}(\sqrt{\alpha_{T}}{\bm{x}}_{0},(1-\alpha_{T}){\bm{I}}) and for all t>1𝑡1t>1,
其中 qσ(𝒙T|𝒙0)=𝒩(αT𝒙0,(1αT)𝑰)subscriptconditionalsubscriptsubscript0subscriptsubscript01subscriptq_{\sigma}({\bm{x}}_{T}|{\bm{x}}_{0})={\mathcal{N}}(\sqrt{\alpha_{T}}{\bm{x}}_{0},(1-\alpha_{T}){\bm{I}}) 且对于所有 t>11t>1 , (请注意, qσ(𝒙T|𝒙0)=𝒩(αT𝒙0,(1αT)𝑰)subscriptconditionalsubscriptsubscript0subscriptsubscript01subscriptq_{\sigma}({\bm{x}}_{T}|{\bm{x}}_{0})={\mathcal{N}}(\sqrt{\alpha_{T}}{\bm{x}}_{0},(1-\alpha_{T}){\bm{I}})t>11t>1可能是原文中的特定符号或术语,在没有更多上下文的情况下,只能按原样保留)

qσ(𝒙t1|𝒙t,𝒙0)=𝒩(αt1𝒙0+1αt1σt2𝒙tαt𝒙01αt,σt2𝑰).subscript𝑞𝜎conditionalsubscript𝒙𝑡1subscript𝒙𝑡subscript𝒙0𝒩subscript𝛼𝑡1subscript𝒙01subscript𝛼𝑡1subscriptsuperscript𝜎2𝑡subscript𝒙𝑡subscript𝛼𝑡subscript𝒙01subscript𝛼𝑡superscriptsubscript𝜎𝑡2𝑰\displaystyle q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0})={\mathcal{N}}\left(\sqrt{\alpha_{t-1}}{\bm{x}}_{0}+\sqrt{1-\alpha_{t-1}-\sigma^{2}_{t}}\cdot{\frac{{\bm{x}}_{t}-\sqrt{\alpha_{t}}{\bm{x}}_{0}}{\sqrt{1-\alpha_{t}}}},\sigma_{t}^{2}{\bm{I}}\right). (7)

The mean function is chosen to order to ensure that qσ(𝒙t|𝒙0)=𝒩(αt𝒙0,(1αt)𝑰)subscript𝑞𝜎conditionalsubscript𝒙𝑡subscript𝒙0𝒩subscript𝛼𝑡subscript𝒙01subscript𝛼𝑡𝑰q_{\sigma}({\bm{x}}_{t}|{\bm{x}}_{0})={\mathcal{N}}(\sqrt{\alpha_{t}}{\bm{x}}_{0},(1-\alpha_{t}){\bm{I}}) for all t𝑡t (see Lemma 1 of Appendix B), so that it defines a joint inference distribution that matches the ``marginals'' as desired. The forward process333We overload the term “forward process” for cases where the inference model is not a diffusion.
在推断模型不是扩散的情况下,我们对“正向过程”一词进行了过度使用。

选择均值函数是为了确保对于所有 ttqσ(𝒙t|𝒙0)=𝒩(αt𝒙0,(1αt)𝑰)subscriptconditionalsubscriptsubscript0subscriptsubscript01subscriptq_{\sigma}({\bm{x}}_{t}|{\bm{x}}_{0})={\mathcal{N}}(\sqrt{\alpha_{t}}{\bm{x}}_{0},(1-\alpha_{t}){\bm{I}}) (见附录 B 的引理 1),以便它定义一个联合推断分布,如期望的那样与“边缘分布”相匹配。正向过程 3
can be derived from Bayes' rule:
可从贝叶斯法则推导得出:

qσ(𝒙t|𝒙t1,𝒙0)=qσ(𝒙t1|𝒙t,𝒙0)qσ(𝒙t|𝒙0)qσ(𝒙t1|𝒙0),subscript𝑞𝜎conditionalsubscript𝒙𝑡subscript𝒙𝑡1subscript𝒙0subscript𝑞𝜎conditionalsubscript𝒙𝑡1subscript𝒙𝑡subscript𝒙0subscript𝑞𝜎conditionalsubscript𝒙𝑡subscript𝒙0subscript𝑞𝜎conditionalsubscript𝒙𝑡1subscript𝒙0\displaystyle q_{\sigma}({\bm{x}}_{t}|{\bm{x}}_{t-1},{\bm{x}}_{0})=\frac{q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0})q_{\sigma}({\bm{x}}_{t}|{\bm{x}}_{0})}{q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{0})}, (8)

which is also Gaussian (although we do not use this fact for the remainder of this paper). Unlike the diffusion process in Eq. (3), the forward process here is no longer Markovian, since each 𝒙tsubscript𝒙𝑡{\bm{x}}_{t} could depend on both 𝒙t1subscript𝒙𝑡1{\bm{x}}_{t-1} and 𝒙0subscript𝒙0{\bm{x}}_{0}. The magnitude of σ𝜎\sigma controls the how stochastic the forward process is; when σ𝟎𝜎0\sigma\to{\bm{0}}, we reach an extreme case where as long as we observe 𝒙0subscript𝒙0{\bm{x}}_{0} and 𝒙tsubscript𝒙𝑡{\bm{x}}_{t} for some t𝑡t, then 𝒙t1subscript𝒙𝑡1{\bm{x}}_{t-1} become known and fixed.
这也是高斯分布的(尽管在本文的其余部分我们不会使用这个事实)。与式(3)中的扩散过程不同,这里的正向过程不再是马尔可夫的,因为每个 𝒙tsubscript{\bm{x}}_{t} 可能同时依赖于 𝒙t1subscript1{\bm{x}}_{t-1}𝒙0subscript0{\bm{x}}_{0}σ\sigma 的大小控制着正向过程的随机性程度;当 σ𝟎0\sigma\to{\bm{0}} 时,我们会达到一个极端情况,只要我们对某些 tt 观察到 𝒙0subscript0{\bm{x}}_{0}𝒙tsubscript{\bm{x}}_{t} ,那么 𝒙t1subscript1{\bm{x}}_{t-1} 就会变得已知且固定。

3.2 Generative process and unified variational inference objective
3.2 生成过程与统一变分推断目标

Next, we define a trainable generative process pθ(𝒙0:T)subscript𝑝𝜃subscript𝒙:0𝑇p_{\theta}({\bm{x}}_{0:T}) where each pθ(t)(𝒙t1|𝒙t)superscriptsubscript𝑝𝜃𝑡conditionalsubscript𝒙𝑡1subscript𝒙𝑡p_{\theta}^{(t)}({\bm{x}}_{t-1}|{\bm{x}}_{t}) leverages knowledge of qσ(𝒙t1|𝒙t,𝒙0)subscript𝑞𝜎conditionalsubscript𝒙𝑡1subscript𝒙𝑡subscript𝒙0q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0}). Intuitively, given a noisy observation 𝒙tsubscript𝒙𝑡{\bm{x}}_{t}, we first make a prediction444Learning a distribution over the predictions is also possible, but empirically we found little benefits of it.
对预测结果的分布进行学习也是可能的,但从经验上看,我们发现这样做几乎没有什么益处。

接下来,我们定义一个可训练的生成过程 pθ(𝒙0:T)subscriptsubscript0p_{\theta}({\bm{x}}_{0:T}) ,其中每个 pθ(t)(𝒙t1|𝒙t)superscriptsubscriptconditionalsubscript1subscriptp_{\theta}^{(t)}({\bm{x}}_{t-1}|{\bm{x}}_{t}) 利用 qσ(𝒙t1|𝒙t,𝒙0)subscriptconditionalsubscript1subscriptsubscript0q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0}) 的知识。直观地说,给定一个有噪声的观测值 𝒙tsubscript{\bm{x}}_{t} ,我们首先进行预测 4
of the corresponding 𝒙0subscript𝒙0{\bm{x}}_{0}, and then use it to obtain a sample 𝒙t1subscript𝒙𝑡1{\bm{x}}_{t-1} through the reverse conditional distribution qσ(𝒙t1|𝒙t,𝒙0)subscript𝑞𝜎conditionalsubscript𝒙𝑡1subscript𝒙𝑡subscript𝒙0q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0}), which we have defined.
对应的 𝒙0subscript0{\bm{x}}_{0} 的,然后使用它通过我们所定义的逆条件分布 qσ(𝒙t1|𝒙t,𝒙0)subscriptconditionalsubscript1subscriptsubscript0q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0}) 来获得一个样本 𝒙t1subscript1{\bm{x}}_{t-1}

For some 𝒙0q(𝒙0)similar-tosubscript𝒙0𝑞subscript𝒙0{\bm{x}}_{0}\sim q({\bm{x}}_{0}) and ϵt𝒩(𝟎,𝑰)similar-tosubscriptitalic-ϵ𝑡𝒩0𝑰\epsilon_{t}\sim{\mathcal{N}}({\bm{0}},{\bm{I}}), 𝒙tsubscript𝒙𝑡{\bm{x}}_{t} can be obtained using Eq. (4). The model ϵθ(t)(𝒙t)superscriptsubscriptitalic-ϵ𝜃𝑡subscript𝒙𝑡\epsilon_{\theta}^{(t)}({\bm{x}}_{t}) then attempts to predict ϵtsubscriptitalic-ϵ𝑡\epsilon_{t} from 𝒙tsubscript𝒙𝑡{\bm{x}}_{t}, without knowledge of 𝒙0subscript𝒙0{\bm{x}}_{0}. By rewriting Eq. (4), one can then predict the denoised observation, which is a prediction of 𝒙0subscript𝒙0{\bm{x}}_{0} given 𝒙tsubscript𝒙𝑡{\bm{x}}_{t}:
对于某些 𝒙0q(𝒙0)similar-tosubscript0subscript0{\bm{x}}_{0}\sim q({\bm{x}}_{0})ϵt𝒩(𝟎,𝑰)similar-tosubscript0\epsilon_{t}\sim{\mathcal{N}}({\bm{0}},{\bm{I}}) ,可以使用式(4)获得 𝒙tsubscript{\bm{x}}_{t} 。然后,模型 ϵθ(t)(𝒙t)superscriptsubscriptsubscript\epsilon_{\theta}^{(t)}({\bm{x}}_{t}) 尝试在不知道 𝒙0subscript0{\bm{x}}_{0} 的情况下从 𝒙tsubscript{\bm{x}}_{t} 预测 ϵtsubscript\epsilon_{t} 。通过重写式(4),可以预测去噪观测值,这是在给定 𝒙tsubscript{\bm{x}}_{t} 的情况下对 𝒙0subscript0{\bm{x}}_{0} 的预测:

fθ(t)(𝒙t):=(𝒙t1αtϵθ(t)(𝒙t))/αt.assignsuperscriptsubscript𝑓𝜃𝑡subscript𝒙𝑡subscript𝒙𝑡1subscript𝛼𝑡superscriptsubscriptitalic-ϵ𝜃𝑡subscript𝒙𝑡subscript𝛼𝑡\displaystyle f_{\theta}^{(t)}({\bm{x}}_{t}):=({\bm{x}}_{t}-\sqrt{1-\alpha_{t}}\cdot\epsilon_{\theta}^{(t)}({\bm{x}}_{t}))/\sqrt{\alpha_{t}}. (9)

We can then define the generative process with a fixed prior pθ(𝒙T)=𝒩(𝟎,𝑰)subscript𝑝𝜃subscript𝒙𝑇𝒩0𝑰p_{\theta}({\bm{x}}_{T})={\mathcal{N}}({\bm{0}},{\bm{I}}) and
然后,我们可以用固定的先验 pθ(𝒙T)=𝒩(𝟎,𝑰)subscriptsubscript0p_{\theta}({\bm{x}}_{T})={\mathcal{N}}({\bm{0}},{\bm{I}}) 来定义生成过程

pθ(t)(𝒙t1|𝒙t)={𝒩(fθ(1)(𝒙1),σ12𝑰)ift=1qσ(𝒙t1|𝒙t,fθ(t)(𝒙t))otherwise,superscriptsubscript𝑝𝜃𝑡conditionalsubscript𝒙𝑡1subscript𝒙𝑡cases𝒩superscriptsubscript𝑓𝜃1subscript𝒙1superscriptsubscript𝜎12𝑰if𝑡1subscript𝑞𝜎conditionalsubscript𝒙𝑡1subscript𝒙𝑡superscriptsubscript𝑓𝜃𝑡subscript𝒙𝑡otherwise,\displaystyle p_{\theta}^{(t)}({\bm{x}}_{t-1}|{\bm{x}}_{t})=\begin{cases}{\mathcal{N}}(f_{\theta}^{(1)}({\bm{x}}_{1}),\sigma_{1}^{2}{\bm{I}})&\text{if}\ t=1\\ q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},f_{\theta}^{(t)}({\bm{x}}_{t}))&\text{otherwise,}\end{cases} (10)

where qσ(𝒙t1|𝒙t,fθ(t)(𝒙t))subscript𝑞𝜎conditionalsubscript𝒙𝑡1subscript𝒙𝑡superscriptsubscript𝑓𝜃𝑡subscript𝒙𝑡q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},f_{\theta}^{(t)}({\bm{x}}_{t})) is defined as in Eq. (7) with 𝒙0subscript𝒙0{\bm{x}}_{0} replaced by fθ(t)(𝒙t)superscriptsubscript𝑓𝜃𝑡subscript𝒙𝑡f_{\theta}^{(t)}({\bm{x}}_{t}). We add some Gaussian noise (with covariance σ12𝑰superscriptsubscript𝜎12𝑰\sigma_{1}^{2}{\bm{I}}) for the case of t=1𝑡1t=1 to ensure that the generative process is supported everywhere.
其中 qσ(𝒙t1|𝒙t,fθ(t)(𝒙t))subscriptconditionalsubscript1subscriptsuperscriptsubscriptsubscriptq_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},f_{\theta}^{(t)}({\bm{x}}_{t})) 如式(7)中所定义,其中 𝒙0subscript0{\bm{x}}_{0}fθ(t)(𝒙t)superscriptsubscriptsubscriptf_{\theta}^{(t)}({\bm{x}}_{t}) 替代。对于 t=11t=1 的情况,我们添加一些高斯噪声(具有协方差 σ12𝑰superscriptsubscript12\sigma_{1}^{2}{\bm{I}} ),以确保生成过程在各处都得到支持。

We optimize θ𝜃\theta via the following variational inference objective (which is a functional over ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}):
我们通过以下变分推断目标(这是关于 ϵθsubscript\epsilon_{\theta} 的泛函)来优化 θ\theta

Jσ(ϵθ):=𝔼𝒙0:Tqσ(𝒙0:T)[logqσ(𝒙1:T|𝒙0)logpθ(𝒙0:T)]assignsubscript𝐽𝜎subscriptitalic-ϵ𝜃subscript𝔼similar-tosubscript𝒙:0𝑇subscript𝑞𝜎subscript𝒙:0𝑇delimited-[]subscript𝑞𝜎conditionalsubscript𝒙:1𝑇subscript𝒙0subscript𝑝𝜃subscript𝒙:0𝑇\displaystyle J_{\sigma}(\epsilon_{\theta}):={\mathbb{E}}_{{\bm{x}}_{0:T}\sim q_{\sigma}({\bm{x}}_{0:T})}[\log q_{\sigma}({\bm{x}}_{1:T}|{\bm{x}}_{0})-\log p_{\theta}({\bm{x}}_{0:T})] (11)
=\displaystyle= 𝔼𝒙0:Tqσ(𝒙0:T)[logqσ(𝒙T|𝒙0)+t=2Tlogqσ(𝒙t1|𝒙t,𝒙0)t=1Tlogpθ(t)(𝒙t1|𝒙t)logpθ(𝒙T)]subscript𝔼similar-tosubscript𝒙:0𝑇subscript𝑞𝜎subscript𝒙:0𝑇delimited-[]subscript𝑞𝜎conditionalsubscript𝒙𝑇subscript𝒙0superscriptsubscript𝑡2𝑇subscript𝑞𝜎conditionalsubscript𝒙𝑡1subscript𝒙𝑡subscript𝒙0superscriptsubscript𝑡1𝑇superscriptsubscript𝑝𝜃𝑡conditionalsubscript𝒙𝑡1subscript𝒙𝑡subscript𝑝𝜃subscript𝒙𝑇\displaystyle\ {\mathbb{E}}_{{\bm{x}}_{0:T}\sim q_{\sigma}({\bm{x}}_{0:T})}\left[\log q_{\sigma}({\bm{x}}_{T}|{\bm{x}}_{0})+\sum_{t=2}^{T}\log q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0})-\sum_{t=1}^{T}\log p_{\theta}^{(t)}({\bm{x}}_{t-1}|{\bm{x}}_{t})-\log p_{\theta}({\bm{x}}_{T})\right]

where we factorize qσ(𝒙1:T|𝒙0)subscript𝑞𝜎conditionalsubscript𝒙:1𝑇subscript𝒙0q_{\sigma}({\bm{x}}_{1:T}|{\bm{x}}_{0}) according to Eq. (6) and pθ(𝒙0:T)subscript𝑝𝜃subscript𝒙:0𝑇p_{\theta}({\bm{x}}_{0:T}) according to Eq. (1).
在这里,我们根据式(6)对 qσ(𝒙1:T|𝒙0)subscriptconditionalsubscript1subscript0q_{\sigma}({\bm{x}}_{1:T}|{\bm{x}}_{0}) 进行因式分解,根据式(1)对 pθ(𝒙0:T)subscriptsubscript0p_{\theta}({\bm{x}}_{0:T}) 进行因式分解。

From the definition of Jσsubscript𝐽𝜎J_{\sigma}, it would appear that a different model has to be trained for every choice of σ𝜎\sigma, since it corresponds to a different variational objective (and a different generative process). However, Jσsubscript𝐽𝜎J_{\sigma} is equivalent to Lγsubscript𝐿𝛾L_{\gamma} for certain weights γ𝛾\gamma, as we show below.
JσsubscriptJ_{\sigma} 的定义来看,似乎对于 σ\sigma 的每一种选择都必须训练一个不同的模型,因为它对应于一个不同的变分目标(以及一个不同的生成过程)。然而,正如我们下面所示,对于某些权重 γ\gammaJσsubscriptJ_{\sigma} 等同于 LγsubscriptL_{\gamma}

Theorem 1.

For all σ>𝟎𝜎0\sigma>{\bm{0}}, there exists γ>0T𝛾superscriptsubscriptabsent0𝑇\gamma\in\mathbb{R}_{>0}^{T} and C𝐶C\in\mathbb{R}, such that Jσ=Lγ+Csubscript𝐽𝜎subscript𝐿𝛾𝐶J_{\sigma}=L_{\gamma}+C.


定理 1. 对于所有 σ>𝟎0\sigma>{\bm{0}} ,存在 γ>0Tsuperscriptsubscriptabsent0\gamma\in\mathbb{R}_{>0}^{T}CC\in\mathbb{R} ,使得 Jσ=Lγ+CsubscriptsubscriptJ_{\sigma}=L_{\gamma}+C

The variational objective Lγsubscript𝐿𝛾L_{\gamma} is special in the sense that if parameters θ𝜃\theta of the models ϵθ(t)superscriptsubscriptitalic-ϵ𝜃𝑡\epsilon_{\theta}^{(t)} are not shared across different t𝑡t, then the optimal solution for ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta} will not depend on the weights γ𝛾\gamma (as global optimum is achieved by separately maximizing each term in the sum). This property of Lγsubscript𝐿𝛾L_{\gamma} has two implications. On the one hand, this justified the use of L𝟏subscript𝐿1L_{\bm{1}} as a surrogate objective function for the variational lower bound in DDPMs; on the other hand, since Jσsubscript𝐽𝜎J_{\sigma} is equivalent to some Lγsubscript𝐿𝛾L_{\gamma} from Theorem 1, the optimal solution of Jσsubscript𝐽𝜎J_{\sigma} is also the same as that of L𝟏subscript𝐿1L_{\bm{1}}. Therefore, if parameters are not shared across t𝑡t in the model ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}, then the L𝟏subscript𝐿1L_{\bm{1}} objective used by Ho et al. (2020) can be used as a surrogate objective for the variational objective Jσsubscript𝐽𝜎J_{\sigma} as well.
变分目标 LγsubscriptL_{\gamma} 具有特殊性,即在模型 ϵθ(t)superscriptsubscript\epsilon_{\theta}^{(t)} 的参数 θ\theta 在不同的 tt 之间不共享的情况下, ϵθsubscript\epsilon_{\theta} 的最优解将不依赖于权重 γ\gamma (因为通过分别最大化总和中的每一项可实现全局最优)。 LγsubscriptL_{\gamma} 的这一特性有两个含义。一方面,这证明了在 DDPM 中使用 L𝟏subscript1L_{\bm{1}} 作为变分下界的替代目标函数是合理的;另一方面,由于 JσsubscriptJ_{\sigma} 等同于定理 1 中的某些 LγsubscriptL_{\gamma}JσsubscriptJ_{\sigma} 的最优解也与 L𝟏subscript1L_{\bm{1}} 的相同。因此,如果在模型 ϵθsubscript\epsilon_{\theta} 中的 tt 之间参数不共享,那么 Ho 等人(2020)使用的 L𝟏subscript1L_{\bm{1}} 目标也可以作为变分目标 JσsubscriptJ_{\sigma} 的替代目标。

4 Sampling from Generalized Generative Processes
4 从广义生成过程中抽样

With L𝟏subscript𝐿1L_{\bm{1}} as the objective, we are not only learning a generative process for the Markovian inference process considered in Sohl-Dickstein et al. (2015) and Ho et al. (2020), but also generative processes for many non-Markovian forward processes parametrized by σ𝜎\sigma that we have described. Therefore, we can essentially use pretrained DDPM models as the solutions to the new objectives, and focus on finding a generative process that is better at producing samples subject to our needs by changing σ𝜎\sigma.
L𝟏subscript1L_{\bm{1}} 为目标,我们不仅在学习 Sohl-Dickstein 等人(2015 年)和 Ho 等人(2020 年)所考虑的马尔可夫推断过程的生成过程,而且还在学习我们所描述的由 σ\sigma 参数化的许多非马尔可夫正向过程的生成过程。因此,我们基本上可以将预训练的 DDPM 模型用作新目标的解决方案,并通过改变 σ\sigma 专注于找到一种更能根据我们的需求生成样本的生成过程。

4.1 Denoising Diffusion Implicit Models
4.1 去噪扩散隐式模型

From pθ(𝒙1:T)subscript𝑝𝜃subscript𝒙:1𝑇p_{\theta}({\bm{x}}_{1:T}) in Eq. (10), one can generate a sample 𝒙t1subscript𝒙𝑡1{\bm{x}}_{t-1} from a sample 𝒙tsubscript𝒙𝑡{\bm{x}}_{t} via:
从式(10)中的 pθ(𝒙1:T)subscriptsubscript1p_{\theta}({\bm{x}}_{1:T}) ,人们可以通过以下方式从样本 𝒙tsubscript{\bm{x}}_{t} 生成样本 𝒙t1subscript1{\bm{x}}_{t-1}

𝒙t1subscript𝒙𝑡1\displaystyle{\bm{x}}_{t-1} =αt1(𝒙t1αtϵθ(t)(𝒙t)αt)`` predicted 𝒙0''+1αt1σt2ϵθ(t)(𝒙t)``direction pointing to 𝒙t''+σtϵtrandom noiseabsentsubscript𝛼𝑡1subscriptsubscript𝒙𝑡1subscript𝛼𝑡superscriptsubscriptitalic-ϵ𝜃𝑡subscript𝒙𝑡subscript𝛼𝑡`` predicted subscript𝒙0''subscript1subscript𝛼𝑡1superscriptsubscript𝜎𝑡2superscriptsubscriptitalic-ϵ𝜃𝑡subscript𝒙𝑡``direction pointing to subscript𝒙𝑡''subscriptsubscript𝜎𝑡subscriptitalic-ϵ𝑡random noise\displaystyle=\sqrt{\alpha_{t-1}}\underbrace{\left(\frac{{\bm{x}}_{t}-\sqrt{1-\alpha_{t}}\epsilon_{\theta}^{(t)}({\bm{x}}_{t})}{\sqrt{\alpha_{t}}}\right)}_{\text{`` predicted }{\bm{x}}_{0}\text{''}}+\underbrace{\sqrt{1-\alpha_{t-1}-\sigma_{t}^{2}}\cdot\epsilon_{\theta}^{(t)}({\bm{x}}_{t})}_{\text{``direction pointing to }{\bm{x}}_{t}\text{''}}+\underbrace{\sigma_{t}\epsilon_{t}}_{\text{random noise}} (12)

where ϵt𝒩(𝟎,𝑰)similar-tosubscriptitalic-ϵ𝑡𝒩0𝑰\epsilon_{t}\sim{\mathcal{N}}({\bm{0}},{\bm{I}}) is standard Gaussian noise independent of 𝒙tsubscript𝒙𝑡{\bm{x}}_{t}, and we define α0:=1assignsubscript𝛼01\alpha_{0}:=1. Different choices of σ𝜎\sigma values results in different generative processes, all while using the same model ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}, so re-training the model is unnecessary. When σt=(1αt1)/(1αt)1αt/αt1subscript𝜎𝑡1subscript𝛼𝑡11subscript𝛼𝑡1subscript𝛼𝑡subscript𝛼𝑡1\sigma_{t}=\sqrt{(1-\alpha_{t-1})/(1-\alpha_{t})}\sqrt{1-\alpha_{t}/\alpha_{t-1}} for all t𝑡t, the forward process becomes Markovian, and the generative process becomes a DDPM.
其中 ϵt𝒩(𝟎,𝑰)similar-tosubscript0\epsilon_{t}\sim{\mathcal{N}}({\bm{0}},{\bm{I}}) 是与 𝒙tsubscript{\bm{x}}_{t} 无关的标准高斯噪声,我们定义 α0:=1assignsubscript01\alpha_{0}:=1σ\sigma 值的不同选择会导致不同的生成过程,而都使用相同的模型 ϵθsubscript\epsilon_{\theta} ,因此无需重新训练模型。当对于所有 ttσt=(1αt1)/(1αt)1αt/αt1subscript1subscript11subscript1subscriptsubscript1\sigma_{t}=\sqrt{(1-\alpha_{t-1})/(1-\alpha_{t})}\sqrt{1-\alpha_{t}/\alpha_{t-1}} 时,前向过程变为马尔可夫过程,生成过程变为 DDPM。

We note another special case when σt=0subscript𝜎𝑡0\sigma_{t}=0 for all t𝑡t555Although this case is not covered in Theorem 1, we can always approximate it by making σtsubscript𝜎𝑡\sigma_{t} very small.
尽管此案例未在定理 1 中涵盖,但我们总可以通过使 σtsubscript\sigma_{t} 非常小来对其进行近似。

当对于所有的 tt 5σt=0subscript0\sigma_{t}=0 时,我们注意到另一个特殊情况
; the forward process becomes deterministic given 𝒙t1subscript𝒙𝑡1{\bm{x}}_{t-1} and 𝒙0subscript𝒙0{\bm{x}}_{0}, except for t=1𝑡1t=1; in the generative process, the coefficient before the random noise ϵtsubscriptitalic-ϵ𝑡\epsilon_{t} becomes zero. The resulting model becomes an implicit probabilistic model (Mohamed & Lakshminarayanan, 2016), where samples are generated from latent variables with a fixed procedure (from 𝒙Tsubscript𝒙𝑇{\bm{x}}_{T} to 𝒙0subscript𝒙0{\bm{x}}_{0}). We name this the denoising diffusion implicit model (DDIM, pronounced /d:Im/), because it is an implicit probabilistic model trained with the DDPM objective (despite the forward process no longer being a diffusion).
在给定 𝒙t1subscript1{\bm{x}}_{t-1}𝒙0subscript0{\bm{x}}_{0} 的情况下,前向过程除了 t=11t=1 外变得具有确定性;在生成过程中,随机噪声 ϵtsubscript\epsilon_{t} 前的系数变为零。所得模型成为一个隐式概率模型(Mohamed & Lakshminarayanan,2016),其中样本是通过固定程序(从 𝒙Tsubscript{\bm{x}}_{T}𝒙0subscript0{\bm{x}}_{0} )从潜在变量中生成的。我们将其命名为去噪扩散隐式模型(DDIM,发音为 /d:Im/),因为它是使用 DDPM 目标进行训练的隐式概率模型(尽管前向过程不再是扩散)。

4.2 Accelerated generation processes
4.2 加速生成过程

In the previous sections, the generative process is considered as the approximation to the reverse process; since of the forward process has T𝑇T steps, the generative process is also forced to sample T𝑇T steps. However, as the denoising objective L𝟏subscript𝐿1L_{\bm{1}} does not depend on the specific forward procedure as long as qσ(𝒙t|𝒙0)subscript𝑞𝜎conditionalsubscript𝒙𝑡subscript𝒙0q_{\sigma}({\bm{x}}_{t}|{\bm{x}}_{0}) is fixed, we may also consider forward processes with lengths smaller than T𝑇T, which accelerates the corresponding generative processes without having to train a different model.
在前面的部分中,生成过程被视为对逆向过程的近似;由于正向过程有 TT 个步骤,生成过程也被迫采样 TT 个步骤。然而,由于去噪目标 L𝟏subscript1L_{\bm{1}} 只要 qσ(𝒙t|𝒙0)subscriptconditionalsubscriptsubscript0q_{\sigma}({\bm{x}}_{t}|{\bm{x}}_{0}) 固定就不依赖于特定的正向过程,我们也可以考虑长度小于 TT 的正向过程,这会加速相应的生成过程,而无需训练不同的模型。

Refer to caption
Figure 2: Graphical model for accelerated generation, where τ=[1,3]𝜏13\tau=[1,3].
图 2:加速生成的图形模型,其中 τ=[1,3]13\tau=[1,3] 。 (注:原文中“ τ=[1,3]13\tau=[1,3]”的含义不太明确,所以在翻译中保留了原样)

Let us consider the forward process as defined not on all the latent variables 𝒙1:Tsubscript𝒙:1𝑇{\bm{x}}_{1:T}, but on a subset {𝒙τ1,,𝒙τS}subscript𝒙subscript𝜏1subscript𝒙subscript𝜏𝑆\{{\bm{x}}_{\tau_{1}},\ldots,{\bm{x}}_{\tau_{S}}\}, where τ𝜏\tau is an increasing sub-sequence of [1,,T]1𝑇[1,\ldots,T] of length S𝑆S. In particular, we define the sequential forward process over 𝒙τ1,,𝒙τSsubscript𝒙subscript𝜏1subscript𝒙subscript𝜏𝑆{\bm{x}}_{\tau_{1}},\ldots,{\bm{x}}_{\tau_{S}} such that q(𝒙τi|𝒙0)=𝒩(ατi𝒙0,(1ατi)𝑰)𝑞conditionalsubscript𝒙subscript𝜏𝑖subscript𝒙0𝒩subscript𝛼subscript𝜏𝑖subscript𝒙01subscript𝛼subscript𝜏𝑖𝑰q({\bm{x}}_{\tau_{i}}|{\bm{x}}_{0})={\mathcal{N}}(\sqrt{\alpha_{\tau_{i}}}{\bm{x}}_{0},(1-\alpha_{\tau_{i}}){\bm{I}}) matches the ``marginals'' (see Figure 2 for an illustration). The generative process now samples latent variables according to reversed(τ)reversed𝜏\text{reversed}(\tau), which we term (sampling) trajectory. When the length of the sampling trajectory is much smaller than T𝑇T, we may achieve significant increases in computational efficiency due to the iterative nature of the sampling process.
让我们考虑这样一个前向过程,它不是定义在所有潜在变量 𝒙1:Tsubscript1{\bm{x}}_{1:T} 上,而是定义在一个子集 {𝒙τ1,,𝒙τS}subscriptsubscript1subscriptsubscript\{{\bm{x}}_{\tau_{1}},\ldots,{\bm{x}}_{\tau_{S}}\} 上,其中 τ\tau[1,,T]1[1,\ldots,T] 的一个长度为 SS 的递增子序列。特别地,我们在 𝒙τ1,,𝒙τSsubscriptsubscript1subscriptsubscript{\bm{x}}_{\tau_{1}},\ldots,{\bm{x}}_{\tau_{S}} 上定义顺序前向过程,使得 q(𝒙τi|𝒙0)=𝒩(ατi𝒙0,(1ατi)𝑰)conditionalsubscriptsubscriptsubscript0subscriptsubscriptsubscript01subscriptsubscriptq({\bm{x}}_{\tau_{i}}|{\bm{x}}_{0})={\mathcal{N}}(\sqrt{\alpha_{\tau_{i}}}{\bm{x}}_{0},(1-\alpha_{\tau_{i}}){\bm{I}}) 与“边缘分布”相匹配(见图 2 进行说明)。现在,生成过程根据 reversed(τ)\text{reversed}(\tau) 对潜在变量进行采样,我们将其称为(采样)轨迹。当采样轨迹的长度远小于 TT 时,由于采样过程的迭代性质,我们可能会在计算效率上实现显著提高。

Using a similar argument as in Section 3, we can justify using the model trained with the L𝟏subscript𝐿1L_{\bm{1}} objective, so no changes are needed in training. We show that only slight changes to the updates in Eq. (12) are needed to obtain the new, faster generative processes, which applies to DDPM, DDIM, as well as all generative processes considered in Eq. (10). We include these details in Appendix C.1.
使用与第 3 节中类似的论点,我们可以证明使用用 L𝟏subscript1L_{\bm{1}} 目标训练的模型是合理的,因此在训练中无需进行更改。我们表明,为获得新的、更快的生成过程,只需对式(12)中的更新进行轻微更改,这适用于 DDPM、DDIM 以及式(10)中考虑的所有生成过程。我们将这些细节包含在附录 C.1 中。

In principle, this means that we can train a model with an arbitrary number of forward steps but only sample from some of them in the generative process. Therefore, the trained model could consider many more steps than what is considered in (Ho et al., 2020) or even a continuous time variable t𝑡t (Chen et al., 2020). We leave empirical investigations of this aspect as future work.
原则上,这意味着我们可以用任意数量的前向步骤训练模型,但在生成过程中仅从其中一些进行采样。因此,训练后的模型可以考虑比(Ho 等人,2020)中所考虑的更多的步骤,甚至是一个连续时间变量 tt (Chen 等人,2020)。我们将此方面的实证研究留作未来的工作。

4.3 Relevance to Neural ODEs
4.3 与神经常微分方程的相关性

Moreover, we can rewrite the DDIM iterate according to Eq. (12), and its similarity to Euler integration for solving ordinary differential equations (ODEs) becomes more apparent:
此外,我们可以根据式(12)重写 DDIM 迭代式,并且它与用于求解常微分方程(ODE)的欧拉积分的相似性变得更加明显:

𝒙tΔtαtΔt=𝒙tαt+(1αtΔtαtΔt1αtαt)ϵθ(t)(𝒙t)subscript𝒙𝑡Δ𝑡subscript𝛼𝑡Δ𝑡subscript𝒙𝑡subscript𝛼𝑡1subscript𝛼𝑡Δ𝑡subscript𝛼𝑡Δ𝑡1subscript𝛼𝑡subscript𝛼𝑡superscriptsubscriptitalic-ϵ𝜃𝑡subscript𝒙𝑡\displaystyle\frac{{\bm{x}}_{t-\Delta t}}{\sqrt{\alpha_{t-\Delta t}}}=\frac{{\bm{x}}_{t}}{\sqrt{\alpha_{t}}}+\left(\sqrt{\frac{1-\alpha_{t-\Delta t}}{\alpha_{t-\Delta t}}}-\sqrt{\frac{1-\alpha_{t}}{\alpha_{t}}}\right)\epsilon_{\theta}^{(t)}({\bm{x}}_{t}) (13)

To derive the corresponding ODE, we can reparameterize (1α/α)1𝛼𝛼(\sqrt{1-\alpha}/\sqrt{\alpha}) with σ𝜎\sigma and (𝒙/α)𝒙𝛼({\bm{x}}/\sqrt{\alpha}) with 𝒙¯¯𝒙\bar{{\bm{x}}}. In the continuous case, σ𝜎\sigma and 𝒙𝒙{\bm{x}} are functions of t𝑡t, where σ:00:𝜎subscriptabsent0subscriptabsent0\sigma:{\mathbb{R}}_{\geq 0}\to{\mathbb{R}}_{\geq 0} is continous, increasing with σ(0)=0𝜎00\sigma(0)=0. Equation (13) with can be treated as a Euler method over the following ODE:
为了推导出相应的常微分方程,我们可以用 σ\sigma 重新参数化 (1α/α)1(\sqrt{1-\alpha}/\sqrt{\alpha}) ,用 𝒙¯\bar{{\bm{x}}} 重新参数化 (𝒙/α)({\bm{x}}/\sqrt{\alpha}) 。在连续的情况下, σ\sigma𝒙{\bm{x}}tt 的函数,其中 σ:00subscriptabsent0subscriptabsent0\sigma:{\mathbb{R}}_{\geq 0}\to{\mathbb{R}}_{\geq 0} 是连续的,随 σ(0)=000\sigma(0)=0 增加。带有(13)式的方程可以视为以下常微分方程的欧拉方法:

d𝒙¯(t)=ϵθ(t)(𝒙¯(t)σ2+1)dσ(t),d¯𝒙𝑡superscriptsubscriptitalic-ϵ𝜃𝑡¯𝒙𝑡superscript𝜎21d𝜎𝑡\displaystyle\mathrm{d}\bar{{\bm{x}}}(t)=\epsilon_{\theta}^{(t)}\left(\frac{\bar{{\bm{x}}}(t)}{\sqrt{\sigma^{2}+1}}\right)\mathrm{d}\sigma(t), (14)

where the initial conditions is 𝒙(T)𝒩(0,σ(T))similar-to𝒙𝑇𝒩0𝜎𝑇{\bm{x}}(T)\sim{\mathcal{N}}(0,\sigma(T)) for a very large σ(T)𝜎𝑇\sigma(T) (which corresponds to the case of α0𝛼0\alpha\approx 0). This suggests that with enough discretization steps, the we can also reverse the generation process (going from t=0𝑡0t=0 to T𝑇T), which encodes 𝒙0subscript𝒙0{\bm{x}}_{0} to 𝒙Tsubscript𝒙𝑇{\bm{x}}_{T} and simulates the reverse of the ODE in Eq. (14). This suggests that unlike DDPM, we can use DDIM to obtain encodings of the observations (as the form of 𝒙Tsubscript𝒙𝑇{\bm{x}}_{T}), which might be useful for other downstream applications that requires latent representations of a model.
其中,对于非常大的 σ(T)\sigma(T) ,初始条件为 𝒙(T)𝒩(0,σ(T))similar-to0{\bm{x}}(T)\sim{\mathcal{N}}(0,\sigma(T)) (这对应于 α00\alpha\approx 0 的情况)。这表明,通过足够的离散化步骤,我们还可以逆转生成过程(从 t=00t=0TT ),该过程将 𝒙0subscript0{\bm{x}}_{0} 编码为 𝒙Tsubscript{\bm{x}}_{T} ,并模拟式(14)中常微分方程的逆过程。这表明与 DDPM 不同,我们可以使用 DDIM 来获得观测值的编码(以 𝒙Tsubscript{\bm{x}}_{T} 的形式),这对于需要模型潜在表示的其他下游应用可能是有用的。

In a concurrent work, (Song et al., 2020) proposed a ``probability flow ODE'' that aims to recover the marginal densities of a stochastic differential equation (SDE) based on scores, from which a similar sampling schedule can be obtained. Here, we state that the our ODE is equivalent to a special case of theirs (which corresponds to a continuous-time analog of DDPM).
在一项并行工作中,(Song 等人,2020)提出了一个“概率流常微分方程”,旨在基于分数恢复随机微分方程(SDE)的边际密度,从中可以获得类似的采样时间表。在此,我们指出,我们的常微分方程等同于他们的一个特殊情况(这对应于 DDPM 的连续时间类似物)。

Proposition 1.

The ODE in Eq. (14) with the optimal model ϵθ(t)superscriptsubscriptitalic-ϵ𝜃𝑡\epsilon_{\theta}^{(t)} has an equivalent probability flow ODE corresponding to the ``Variance-Exploding'' SDE in Song et al. (2020).


命题 1. 具有最优模型 ϵθ(t)superscriptsubscript\epsilon_{\theta}^{(t)} 的式(14)中的常微分方程(ODE)具有一个等效的概率流常微分方程,该方程对应于 Song 等人(2020)中的“方差爆炸”随机微分方程(SDE)。

We include the proof in Appendix B. While the ODEs are equivalent, the sampling procedures are not, since the Euler method for the probability flow ODE will make the following update:
我们将证明包含在附录 B 中。虽然常微分方程是等价的,但采样过程并非如此,因为概率流常微分方程的欧拉方法将进行如下更新:

𝒙tΔtαtΔt=𝒙tαt+12(1αtΔtαtΔt1αtαt)αt1αtϵθ(t)(𝒙t)subscript𝒙𝑡Δ𝑡subscript𝛼𝑡Δ𝑡subscript𝒙𝑡subscript𝛼𝑡121subscript𝛼𝑡Δ𝑡subscript𝛼𝑡Δ𝑡1subscript𝛼𝑡subscript𝛼𝑡subscript𝛼𝑡1subscript𝛼𝑡superscriptsubscriptitalic-ϵ𝜃𝑡subscript𝒙𝑡\displaystyle\frac{{\bm{x}}_{t-\Delta t}}{\sqrt{\alpha_{t-\Delta t}}}=\frac{{\bm{x}}_{t}}{\sqrt{\alpha_{t}}}+\frac{1}{2}\left(\frac{1-\alpha_{t-\Delta t}}{\alpha_{t-\Delta t}}-\frac{1-\alpha_{t}}{\alpha_{t}}\right)\cdot\sqrt{\frac{\alpha_{t}}{1-\alpha_{t}}}\cdot\epsilon_{\theta}^{(t)}({\bm{x}}_{t}) (15)

which is equivalent to ours if αtsubscript𝛼𝑡\alpha_{t} and αtΔtsubscript𝛼𝑡Δ𝑡\alpha_{t-\Delta t} are close enough. In fewer sampling steps, however, these choices will make a difference; we take Euler steps with respect to dσ(t)d𝜎𝑡\mathrm{d}\sigma(t) (which depends less directly on the scaling of ``time'' t𝑡t) whereas Song et al. (2020) take Euler steps with respect to dtd𝑡\mathrm{d}t.
如果 αtsubscript\alpha_{t}αtΔtsubscript\alpha_{t-\Delta t} 足够接近,这与我们的结果是等效的。然而,在较少的采样步骤中,这些选择会产生差异;我们对 dσ(t)\mathrm{d}\sigma(t) 采取欧拉步(其对“时间” tt 的缩放依赖程度较低),而 Song 等人(2020)对 dt\mathrm{d}t 采取欧拉步。

5 Experiments 5 实验

In this section, we show that DDIMs outperform DDPMs in terms of image generation when fewer iterations are considered, giving speed ups of 10×10\times to 100×100\times over the original DDPM generation process. Moreover, unlike DDPMs, once the initial latent variables 𝒙Tsubscript𝒙𝑇{\bm{x}}_{T} are fixed, DDIMs retain high-level image features regardless of the generation trajectory, so they are able to perform interpolation directly from the latent space. DDIMs can also be used to encode samples that reconstruct them from the latent code, which DDPMs cannot do due to the stochastic sampling process.
在本节中,我们表明,当考虑较少的迭代次数时,DDIMs 在图像生成方面的表现优于 DDPMs,与原始 DDPM 生成过程相比,速度提升了 10×10\times100×100\times 。此外,与 DDPMs 不同,一旦初始潜在变量 𝒙Tsubscript{\bm{x}}_{T} 被固定,DDIMs 无论生成轨迹如何都能保留高级图像特征,因此它们能够直接从潜在空间进行插值。DDIMs 还可用于对样本进行编码,从潜在代码中重建它们,而由于随机采样过程,DDPMs 无法做到这一点。

For each dataset, we use the same trained model with T=1000𝑇1000T=1000 and the objective being Lγsubscript𝐿𝛾L_{\gamma} from Eq. (5) with γ=𝟏𝛾1\gamma={\bm{1}}; as we argued in Section 3, no changes are needed with regards to the training procedure. The only changes that we make is how we produce samples from the model; we achieve this by controlling τ𝜏\tau (which controls how fast the samples are obtained) and σ𝜎\sigma (which interpolates between the deterministic DDIM and the stochastic DDPM).
对于每个数据集,我们使用相同的经过训练的模型,其中 T=10001000T=1000 ,目标为来自式(5)的 LγsubscriptL_{\gamma} ,并带有 γ=𝟏1\gamma={\bm{1}} ;正如我们在第 3 节中所论述的,在训练过程方面无需进行任何更改。我们所做的唯一更改是如何从模型中生成样本;我们通过控制 τ\tau (它控制样本获取的速度)和 σ\sigma (它在确定性 DDIM 和随机 DDPM 之间进行插值)来实现这一点。

We consider different sub-sequences τ𝜏\tau of [1,,T]1𝑇[1,\ldots,T] and different variance hyperparameters σ𝜎\sigma indexed by elements of τ𝜏\tau. To simplify comparisons, we consider σ𝜎\sigma with the form: 重试    错误原因

στi(η)=η(1ατi1)/(1ατi)1ατi/ατi1,subscript𝜎subscript𝜏𝑖𝜂𝜂1subscript𝛼subscript𝜏𝑖11subscript𝛼subscript𝜏𝑖1subscript𝛼subscript𝜏𝑖subscript𝛼subscript𝜏𝑖1\displaystyle\sigma_{\tau_{i}}(\eta)=\eta\sqrt{(1-\alpha_{\tau_{i-1}})/(1-\alpha_{\tau_{i}})}\sqrt{1-{\alpha_{\tau_{i}}}/{\alpha_{\tau_{i-1}}}}, (16)

where η0𝜂subscriptabsent0\eta\in\mathbb{R}_{\geq 0} is a hyperparameter that we can directly control. This includes an original DDPM generative process when η=1𝜂1\eta=1 and DDIM when η=0𝜂0\eta=0. We also consider DDPM where the random noise has a larger standard deviation than σ(1)𝜎1\sigma(1), which we denote as σ^^𝜎\hat{\sigma}: σ^τi=1ατi/ατi1subscript^𝜎subscript𝜏𝑖1subscript𝛼subscript𝜏𝑖subscript𝛼subscript𝜏𝑖1\hat{\sigma}_{\tau_{i}}=\sqrt{1-{\alpha_{\tau_{i}}}/{\alpha_{\tau_{i-1}}}} . This is used by the implementation in Ho et al. (2020) only to obtain the CIFAR10 samples, but not samples of the other datasets. We include more details in Appendix D.
其中 η0subscriptabsent0\eta\in\mathbb{R}_{\geq 0} 是我们可以直接控制的超参数。这包括当 η=11\eta=1 时的原始 DDPM 生成过程和当 η=00\eta=0 时的 DDIM。我们还考虑了随机噪声的标准偏差大于 σ(1)1\sigma(1) 的 DDPM,我们将其表示为 σ^\hat{\sigma}σ^τi=1ατi/ατi1subscriptsubscript1subscriptsubscriptsubscriptsubscript1\hat{\sigma}_{\tau_{i}}=\sqrt{1-{\alpha_{\tau_{i}}}/{\alpha_{\tau_{i-1}}}} 。Ho 等人(2020)的实现仅使用此方法来获得 CIFAR10 样本,而不是其他数据集的样本。我们在附录 D 中包含了更多细节。

5.1 Sample quality and efficiency
5.1 样本质量与效率

In Table 1, we report the quality of the generated samples with models trained on CIFAR10 and CelebA, as measured by Frechet Inception Distance (FID (Heusel et al., 2017)), where we vary the number of timesteps used to generate a sample (dim(τ)dimension𝜏\dim(\tau)) and the stochasticity of the process (η𝜂\eta). As expected, the sample quality becomes higher as we increase dim(τ)dimension𝜏\dim(\tau), presenting a trade-off between sample quality and computational costs. We observe that DDIM (η=0𝜂0\eta=0) achieves the best sample quality when dim(τ)dimension𝜏\dim(\tau) is small, and DDPM (η=1𝜂1\eta=1 and σ^^𝜎\hat{\sigma}) typically has worse sample quality compared to its less stochastic counterparts with the same dim(τ)dimension𝜏\dim(\tau), except for the case for dim(τ)=1000dimension𝜏1000\dim(\tau)=1000 and σ^^𝜎\hat{\sigma} reported by Ho et al. (2020) where DDIM is marginally worse. However, the sample quality of σ^^𝜎\hat{\sigma} becomes much worse for smaller dim(τ)dimension𝜏\dim(\tau), which suggests that it is ill-suited for shorter trajectories. DDIM, on the other hand, achieves high sample quality much more consistently.
在表 1 中,我们报告了在 CIFAR10 和 CelebA 上训练的模型所生成样本的质量,通过弗雷歇初始距离(FID(Heusel 等人,2017))来衡量,其中我们改变了用于生成样本的时间步数( dim(τ)dimension\dim(\tau) )和过程的随机性( η\eta )。正如预期的那样,随着我们增加 dim(τ)dimension\dim(\tau) ,样本质量会变得更高,这表明在样本质量和计算成本之间存在权衡。我们观察到,当 dim(τ)dimension\dim(\tau) 较小时,DDIM( η=00\eta=0 )实现了最佳的样本质量,而与具有相同 dim(τ)dimension\dim(\tau) 的随机性较小的对应模型相比,DDPM( η=11\eta=1σ^\hat{\sigma} )通常具有较差的样本质量,除了 Ho 等人(2020)报告的 dim(τ)=1000dimension1000\dim(\tau)=1000σ^\hat{\sigma} 的情况,其中 DDIM 略差。然而,对于较小的 dim(τ)dimension\dim(\tau)σ^\hat{\sigma} 的样本质量会变得更差,这表明它不适合较短的轨迹。另一方面,DDIM 更一致地实现了高样本质量。

In Figure 3, we show CIFAR10 and CelebA samples with the same number of sampling steps and varying σ𝜎\sigma. For the DDPM, the sample quality deteriorates rapidly when the sampling trajectory has 10 steps. For the case of σ^^𝜎\hat{\sigma}, the generated images seem to have more noisy perturbations under short trajectories; this explains why the FID scores are much worse than other methods, as FID is very sensitive to such perturbations (as discussed in Jolicoeur-Martineau et al. (2020)).
在图 3 中,我们展示了具有相同采样步数和不同 σ\sigma 的 CIFAR10 和 CelebA 样本。对于 DDPM,当采样轨迹有 10 步时,样本质量迅速恶化。对于 σ^\hat{\sigma} 的情况,在短轨迹下生成的图像似乎具有更多的噪声扰动;这解释了为什么 FID 分数比其他方法差得多,因为 FID 对这种扰动非常敏感(如 Jolicoeur-Martineau 等人(2020)所讨论的)。

In Figure 4, we show that the amount of time needed to produce a sample scales linearly with the length of the sample trajectory. This suggests that DDIM is useful for producing samples more efficiently, as samples can be generated in much fewer steps. Notably, DDIM is able to produce samples with quality comparable to 1000 step models within 202020 to 100100100 steps, which is a 10×10\times to 50×50\times speed up compared to the original DDPM. Even though DDPM could also achieve reasonable sample quality with 100×100\times steps, DDIM requires much fewer steps to achieve this; on CelebA, the FID score of the 100 step DDPM is similar to that of the 20 step DDIM.
在图 4 中,我们表明生产一个样本所需的时间量与样本轨迹的长度呈线性比例。这表明 DDIM 对于更高效地生产样本是有用的,因为可以在少得多的步骤中生成样本。值得注意的是,DDIM 能够在 202020100100100 步内生成质量与 1000 步模型相当的样本,与原始的 DDPM 相比,速度提高了 10×10\times50×50\times 。尽管 DDPM 也可以在 100×100\times 步时达到合理的样本质量,但 DDIM 实现这一目标所需的步骤要少得多;在 CelebA 上,100 步 DDPM 的 FID 分数与 20 步 DDIM 的相似。

Table 1: CIFAR10 and CelebA image generation measured in FID. η=1.0𝜂1.0\eta=1.0 and σ^^𝜎\hat{\sigma} are cases of DDPM (although Ho et al. (2020) only considered T=1000𝑇1000T=1000 steps, and S<T𝑆𝑇S<T can be seen as simulating DDPMs trained with S𝑆S steps), and η=0.0𝜂0.0\eta=0.0 indicates DDIM.
表 1:以 FID 衡量的 CIFAR10 和 CelebA 图像生成。 η=1.01.0\eta=1.0σ^\hat{\sigma} 是 DDPM 的情况(尽管 Ho 等人(2020)只考虑了 T=10001000T=1000 步,而 S<TS<T 可视为模拟用 SS 步训练的 DDPM), η=0.00.0\eta=0.0 表示 DDIM。
CIFAR10 (32×32323232\times 32) CIFAR10 ( 32×32323232\times 32 ) Note: It's not clear what " 32×32323232\times 32 " means in this context. The translation provided retains the original characters as they are not clear in meaning CelebA (64×64646464\times 64) CelebA( 64×64646464\times 64 ) (注:这个原文似乎不太明确或存在一些特殊符号,翻译可能不太准确。如果您有更多上下文或信息,以便我能更准确地进行翻译。)
S𝑆S 10 20 50 100 1000 10 20 50 100 1000
η𝜂\eta 0.00.00.0 13.36 6.84 4.67 4.16 4.04 17.33 13.73 9.17 6.53 3.51
0.2 14.04 7.11 4.77 4.25 4.09 17.66 14.11 9.51 6.79 3.64
0.5 16.66 8.35 5.25 4.46 4.29 19.86 16.06 11.01 8.09 4.28
1.0 41.07 18.36 8.01 5.78 4.73 33.12 26.03 18.48 13.93 5.98
σ^^𝜎\hat{\sigma} 367.43 133.37 32.72 9.99 3.17 299.71 183.83 71.71 45.20 3.26
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: CIFAR10 and CelebA samples with dim(τ)=10dimension𝜏10\dim(\tau)=10 and dim(τ)=100dimension𝜏100\dim(\tau)=100.
图 3:带有 dim(τ)=10dimension10\dim(\tau)=10dim(τ)=100dimension100\dim(\tau)=100 的 CIFAR10 和 CelebA 样本。
Refer to caption
Refer to caption
Figure 4: Hours to sample 50k images with one Nvidia 2080 Ti GPU and samples at different steps.
图 4:使用一块 Nvidia 2080 Ti GPU 对 50k 张图像进行采样所需的小时数以及在不同步骤的样本。

5.2 Sample consistency in DDIMs
5.2 DDIMs 中的样本一致性

For DDIM, the generative process is deterministic, and 𝒙0subscript𝒙0{\bm{x}}_{0} would depend only on the initial state 𝒙Tsubscript𝒙𝑇{\bm{x}}_{T}. In Figure 5, we observe the generated images under different generative trajectories (i.e. different τ𝜏\tau) while starting with the same initial 𝒙Tsubscript𝒙𝑇{\bm{x}}_{T}. Interestingly, for the generated images with the same initial 𝒙Tsubscript𝒙𝑇{\bm{x}}_{T}, most high-level features are similar, regardless of the generative trajectory. In many cases, samples generated with only 20 steps are already very similar to ones generated with 1000 steps in terms of high-level features, with only minor differences in details. Therefore, it would appear that 𝒙Tsubscript𝒙𝑇{\bm{x}}_{T} alone would be an informative latent encoding of the image; and minor details that affects sample quality are encoded in the parameters, as longer sample trajectories gives better quality samples but do not significantly affect the high-level features. We show more samples in Appendix D.4.
对于 DDIM,生成过程是确定性的, 𝒙0subscript0{\bm{x}}_{0} 仅取决于初始状态 𝒙Tsubscript{\bm{x}}_{T} 。在图 5 中,我们观察到在相同的初始 𝒙Tsubscript{\bm{x}}_{T} 下,不同生成轨迹(即不同的 τ\tau )所生成的图像。有趣的是,对于具有相同初始 𝒙Tsubscript{\bm{x}}_{T} 的生成图像,无论生成轨迹如何,大多数高级特征都是相似的。在许多情况下,仅用 20 步生成的样本在高级特征方面已经与用 1000 步生成的样本非常相似,只是在细节上有微小差异。因此,似乎 𝒙Tsubscript{\bm{x}}_{T} 本身就是图像的一种信息丰富的潜在编码;而影响样本质量的细微细节则编码在参数中,因为更长的样本轨迹会产生质量更好的样本,但不会显著影响高级特征。我们在附录 D.4 中展示了更多样本。

Refer to caption
Refer to caption
Refer to caption
Figure 5: Samples from DDIM with the same random 𝒙Tsubscript𝒙𝑇{\bm{x}}_{T} and different number of steps.
图 5:具有相同随机 𝒙Tsubscript{\bm{x}}_{T} 且不同步数的 DDIM 样本。

5.3 Interpolation in deterministic generative processes
5.3 确定性生成过程中的插值

Refer to caption
Refer to caption
Refer to caption
Figure 6: Interpolation of samples from DDIM with dim(τ)=50dimension𝜏50\dim(\tau)=50.
图 6:使用 dim(τ)=50dimension50\dim(\tau)=50 对来自 DDIM 的样本进行插值。

Since the high level features of the DDIM sample is encoded by 𝒙Tsubscript𝒙𝑇{\bm{x}}_{T}, we are interested to see whether it would exhibit the semantic interpolation effect similar to that observed in other implicit probabilistic models, such as GANs (Goodfellow et al., 2014). This is different from the interpolation procedure in Ho et al. (2020), since in DDPM the same 𝒙Tsubscript𝒙𝑇{\bm{x}}_{T} would lead to highly diverse 𝒙0subscript𝒙0{\bm{x}}_{0} due to the stochastic generative process666Although it might be possible if one interpolates all T𝑇T noises, like what is done in Song & Ermon (2020).
尽管如果像 Song & Ermon(2020)中所做的那样对所有 TT 噪声进行内插,这也许是可能的。

由于 DDIM 样本的高级特征由 𝒙Tsubscript{\bm{x}}_{T} 编码,我们有兴趣了解它是否会表现出与其他隐式概率模型(如 GANs(Goodfellow 等人,2014))中观察到的类似的语义插值效应。这与 Ho 等人(2020)中的插值过程不同,因为在 DDPM 中,由于随机生成过程 6 ,相同的 𝒙Tsubscript{\bm{x}}_{T} 会导致高度多样化的 𝒙0subscript0{\bm{x}}_{0}
. In Figure 6, we show that simple interpolations in 𝒙Tsubscript𝒙𝑇{\bm{x}}_{T} can lead to semantically meaningful interpolations between two samples. We include more details and samples in Appendix D.5. This allows DDIM to control the generated images on a high level directly through the latent variables, which DDPMs cannot.
在图 6 中,我们表明在 𝒙Tsubscript{\bm{x}}_{T} 中的简单插值可导致两个样本之间具有语义意义的插值。我们在附录 D.5 中包含了更多细节和样本。这使得 DDIM 能够直接通过潜在变量在高层次上控制生成的图像,而 DDPMs 则不能。

5.4 Reconstruction from Latent Space
5.4 从潜在空间进行重建

As DDIM is the Euler integration for a particular ODE, it would be interesting to see whether it can encode from 𝒙0subscript𝒙0{\bm{x}}_{0} to 𝒙Tsubscript𝒙𝑇{\bm{x}}_{T} (reverse of Eq. (14)) and reconstruct 𝒙0subscript𝒙0{\bm{x}}_{0} from the resulting 𝒙Tsubscript𝒙𝑇{\bm{x}}_{T} (forward of Eq. (14))777Since 𝒙Tsubscript𝒙𝑇{\bm{x}}_{T} and 𝒙0subscript𝒙0{\bm{x}}_{0} have the same dimensions, their compression qualities are not our immediate concern.
由于 𝒙Tsubscript{\bm{x}}_{T}𝒙0subscript0{\bm{x}}_{0} 具有相同的维度,它们的压缩质量并非我们当下所关注的。

由于 DDIM 是针对特定常微分方程的欧拉积分,研究它是否能够从 𝒙0subscript0{\bm{x}}_{0} 编码到 𝒙Tsubscript{\bm{x}}_{T} (式(14)的逆过程),以及从所得的 𝒙Tsubscript{\bm{x}}_{T} 重建 𝒙0subscript0{\bm{x}}_{0} (式(14)的正过程) 7 ,这将是很有趣的。
. We consider encoding and decoding on the CIFAR-10 test set with the CIFAR-10 model with S𝑆S steps for both encoding and decoding; we report the per-dimension mean squared error (scaled to [0,1]01[0,1]) in Table 2. Our results show that DDIMs have lower reconstruction error for larger S𝑆S values and have properties similar to Neural ODEs and normalizing flows. The same cannot be said for DDPMs due to their stochastic nature.
我们在 CIFAR - 10 测试集上使用 CIFAR - 10 模型进行编码和解码,编码和解码均采用 SS 步;我们在表 2 中报告了每维均方误差(缩放到 [0,1]01[0,1] )。我们的结果表明,对于较大的 SS 值,DDIMs 的重建误差较低,并且具有与神经 ODE 和归一化流类似的特性。由于其随机性,对于 DDPMs 则不能这样说。

Table 2: Reconstruction error with DDIM on CIFAR-10 test set, rounded to 104superscript10410^{-4}.
表 2:在 CIFAR - 10 测试集上使用 DDIM 的重建误差,四舍五入到 104superscript10410^{-4} 。 (注:原文中“rounded to 104superscript10410^{-4}”的含义不太明确,可能存在错误或特殊的上下文信息缺失。)
S𝑆S 10 20 50 100 200 500 1000
Error 错误 0.014 0.0065 0.0023 0.0009 0.0004 0.0001 0.00010.00010.0001

6 Related Work 6 相关工作

Our work is based on a large family of existing methods on learning generative models as transition operators of Markov chains (Sohl-Dickstein et al., 2015; Bengio et al., 2014; Salimans et al., 2014; Song et al., 2017; Goyal et al., 2017; Levy et al., 2017). Among them, denoising diffusion probabilistic models (DDPMs, Ho et al. (2020)) and noise conditional score networks (NCSN, Song & Ermon (2019; 2020)) have recently achieved high sample quality comparable to GANs (Brock et al., 2018; Karras et al., 2018). DDPMs optimize a variational lower bound to the log-likelihood, whereas NCSNs optimize the score matching objective (Hyvärinen, 2005) over a nonparametric Parzen density estimator of the data (Vincent, 2011; Raphan & Simoncelli, 2011).
我们的工作基于一大类现有的将生成模型作为马尔可夫链的转移算子来学习的方法(Sohl-Dickstein 等人,2015;Bengio 等人,2014;Salimans 等人,2014;Song 等人,2017;Goyal 等人,2017;Levy 等人,2017)。其中,去噪扩散概率模型(DDPMs,Ho 等人(2020))和噪声条件得分网络(NCSN,Song 和 Ermon(2019;2020))最近取得了与生成对抗网络(GANs)(Brock 等人,2018;Karras 等人,2018)相当的高样本质量。DDPMs 优化对数似然的变分下界,而 NCSNs 在数据的非参数 Parzen 密度估计器(Vincent,2011;Raphan 和 Simoncelli,2011)上优化得分匹配目标(Hyvärinen,2005)。

Despite their different motivations, DDPMs and NCSNs are closely related. Both use a denoising autoencoder objective for many noise levels, and both use a procedure similar to Langevin dynamics to produce samples (Neal et al., 2011). Since Langevin dynamics is a discretization of a gradient flow (Jordan et al., 1998), both DDPM and NCSN require many steps to achieve good sample quality. This aligns with the observation that DDPM and existing NCSN methods have trouble generating high-quality samples in a few iterations.
尽管它们的动机不同,但 DDPMs 和 NCSNs 密切相关。两者都针对多个噪声水平使用去噪自编码器目标,并且都使用类似于朗之万动力学的过程来生成样本(Neal 等人,2011)。由于朗之万动力学是梯度流的离散化(Jordan 等人,1998),DDPM 和 NCSN 都需要许多步骤才能获得良好的样本质量。这与观察到的 DDPM 和现有的 NCSN 方法在几次迭代中难以生成高质量样本的情况相符。

DDIM, on the other hand, is an implicit generative model (Mohamed & Lakshminarayanan, 2016) where samples are uniquely determined from the latent variables. Hence, DDIM has certain properties that resemble GANs (Goodfellow et al., 2014) and invertible flows (Dinh et al., 2016), such as the ability to produce semantically meaningful interpolations. We derive DDIM from a purely variational perspective, where the restrictions of Langevin dynamics are not relevant; this could partially explain why we are able to observe superior sample quality compared to DDPM under fewer iterations. The sampling procedure of DDIM is also reminiscent of neural networks with continuous depth (Chen et al., 2018; Grathwohl et al., 2018), since the samples it produces from the same latent variable have similar high-level visual features, regardless of the specific sample trajectory.
另一方面,DDIM 是一种隐式生成模型(Mohamed 和 Lakshminarayanan,2016),其中样本由潜在变量唯一确定。因此,DDIM 具有某些类似于 GANs(Goodfellow 等人,2014)和可逆流(Dinh 等人,2016)的特性,例如能够产生具有语义意义的插值。我们从纯粹的变分角度推导 DDIM,其中朗之万动力学的限制并不相关;这可以部分解释为什么在较少的迭代次数下,我们能够观察到与 DDPM 相比更优的样本质量。DDIM 的采样过程也让人联想到具有连续深度的神经网络(Chen 等人,2018;Grathwohl 等人,2018),因为它从相同的潜在变量产生的样本具有相似的高级视觉特征,而与特定的样本轨迹无关。

7 Discussion 7 讨论

We have presented DDIMs – an implicit generative model trained with denoising auto-encoding / score matching objectives – from a purely variational perspective. DDIM is able to generate high-quality samples much more efficiently than existing DDPMs and NCSNs, with the ability to perform meaningful interpolations from the latent space. The non-Markovian forward process presented here seems to suggest continuous forward processes other than Gaussian (which cannot be done in the original diffusion framework, since Gaussian is the only stable distribution with finite variance). We also demonstrated a discrete case with a multinomial forward process in Appendix A, and it would be interesting to investigate similar alternatives for other combinatorial structures.
我们从纯粹的变分角度提出了 DDIMs——一种使用去噪自编码/得分匹配目标训练的隐式生成模型。DDIM 能够比现有的 DDPMs 和 NCSNs 更高效地生成高质量样本,并且能够从潜在空间进行有意义的插值。这里提出的非马尔可夫前向过程似乎表明了除高斯以外的连续前向过程(这在原始扩散框架中无法实现,因为高斯是具有有限方差的唯一稳定分布)。我们还在附录 A 中展示了具有多项分布前向过程的离散情况,研究其他组合结构的类似替代方案将是有趣的。

Moreover, since the sampling procedure of DDIMs is similar to that of an neural ODE, it would be interesting to see if methods that decrease the discretization error in ODEs, including multi-step methods such as Adams-Bashforth (Butcher & Goodwin, 2008), could be helpful for further improving sample quality in fewer steps (Queiruga et al., 2020). It is also relevant to investigate whether DDIMs exhibit other properties of existing implicit models (Bau et al., 2019).
此外,由于 DDIMs 的采样过程与神经 ODE 的采样过程相似,研究包括 Adams-Bashforth(Butcher & Goodwin,2008)等多步方法在内的降低 ODE 离散化误差的方法是否有助于在更少的步骤中进一步提高样本质量(Queiruga 等人,2020)将是很有趣的。研究 DDIMs 是否表现出现有隐式模型的其他特性(Bau 等人,2019)也具有相关性。

Acknowledgements 致谢

The authors would like to thank Yang Song and Shengjia Zhao for helpful discussions over the ideas, Kuno Kim for reviewing an earlier draft of the paper, and Sharvil Nanavati and Sophie Liu for identifying typos. This research was supported by NSF (#1651565, #1522054, #1733686), ONR (N00014-19-1-2145), AFOSR (FA9550-19-1-0024), and Amazon AWS.
作者们想要感谢杨松和赵胜家就这些想法进行的有益讨论,感谢 Kuno Kim 对论文早期草稿的审阅,以及感谢 Sharvil Nanavati 和 Sophie Liu 指出的排版错误。本研究得到了美国国家科学基金会(#1651565,#1522054,#1733686)、美国海军研究办公室(N00014 - 19 - 1 - 2145)、美国空军科学研究办公室(FA9550 - 19 - 1 - 0024)和亚马逊网络服务的支持。

References

  • Arjovsky et al. (2017) Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein GAN. arXiv preprint arXiv:1701.07875, January 2017.
  • Bau et al. (2019) David Bau, Jun-Yan Zhu, Jonas Wulff, William Peebles, Hendrik Strobelt, Bolei Zhou, and Antonio Torralba. Seeing what a gan cannot generate. In Proceedings of the IEEE International Conference on Computer Vision, pp.  4502–4511, 2019.
  • Bengio et al. (2014) Yoshua Bengio, Eric Laufer, Guillaume Alain, and Jason Yosinski. Deep generative stochastic networks trainable by backprop. In International Conference on Machine Learning, pp. 226–234, January 2014.
  • Bishop (2006) Christopher M Bishop. Pattern recognition and machine learning. springer, 2006.
  • Brock et al. (2018) Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, September 2018.
  • Butcher & Goodwin (2008) John Charles Butcher and Nicolette Goodwin. Numerical methods for ordinary differential equations, volume 2. Wiley Online Library, 2008.
  • Chen et al. (2020) Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, and William Chan. WaveGrad: Estimating gradients for waveform generation. arXiv preprint arXiv:2009.00713, September 2020.
  • Chen et al. (2018) Ricky T Q Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural ordinary differential equations. arXiv preprint arXiv:1806.07366, June 2018.
  • Dinh et al. (2016) Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real NVP. arXiv preprint arXiv:1605.08803, May 2016.
  • Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
  • Goyal et al. (2017) Anirudh Goyal, Nan Rosemary Ke, Surya Ganguli, and Yoshua Bengio. Variational walkback: Learning a transition operator as a stochastic recurrent net. In Advances in Neural Information Processing Systems, pp. 4392–4402, 2017.
  • Grathwohl et al. (2018) Will Grathwohl, Ricky T Q Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. FFJORD: Free-form continuous dynamics for scalable reversible generative models. arXiv preprint arXiv:1810.01367, October 2018.
  • Gulrajani et al. (2017) Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pp. 5769–5779, 2017.
  • Heusel et al. (2017) Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two Time-Scale update rule converge to a local nash equilibrium. arXiv preprint arXiv:1706.08500, June 2017.
  • Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. arXiv preprint arXiv:2006.11239, June 2020.
  • Hyvärinen (2005) Aapo Hyvärinen. Estimation of Non-Normalized statistical models by score matching. Journal of Machine Learning Researc h, 6:695–709, 2005.
  • Jolicoeur-Martineau et al. (2020) Alexia Jolicoeur-Martineau, Rémi Piché-Taillefer, Rémi Tachet des Combes, and Ioannis Mitliagkas. Adversarial score matching and improved sampling for image generation. September 2020.
  • Jordan et al. (1998) Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the fokker–planck equation. SIAM journal on mathematical analysis, 29(1):1–17, 1998.
  • Karras et al. (2018) Tero Karras, Samuli Laine, and Timo Aila. A Style-Based generator architecture for generative adversarial networks. arXiv preprint arXiv:1812.04948, December 2018.
  • Karras et al. (2020) Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  8110–8119, 2020.
  • Kingma & Welling (2013) Diederik P Kingma and Max Welling. Auto-Encoding variational bayes. arXiv preprint arXiv:1312.6114v10, December 2013.
  • Levy et al. (2017) Daniel Levy, Matthew D Hoffman, and Jascha Sohl-Dickstein. Generalizing hamiltonian monte carlo with neural networks. arXiv preprint arXiv:1711.09268, 2017.
  • Mohamed & Lakshminarayanan (2016) Shakir Mohamed and Balaji Lakshminarayanan. Learning in implicit generative models. arXiv preprint arXiv:1610.03483, October 2016.
  • Neal et al. (2011) Radford M Neal et al. Mcmc using hamiltonian dynamics. Handbook of markov chain monte carlo, 2(11):2, 2011.
  • Queiruga et al. (2020) Alejandro F Queiruga, N Benjamin Erichson, Dane Taylor, and Michael W Mahoney. Continuous-in-depth neural networks. arXiv preprint arXiv:2008.02389, 2020.
  • Raphan & Simoncelli (2011) Martin Raphan and Eero P Simoncelli. Least squares estimation without priors or supervision. Neural computation, 23(2):374–420, February 2011. ISSN 0899-7667, 1530-888X.
  • Rezende & Mohamed (2015) Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770, May 2015.
  • Rezende et al. (2014) Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
  • Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp.  234–241. Springer, 2015.
  • Salimans et al. (2014) Tim Salimans, Diederik P Kingma, and Max Welling. Markov chain monte carlo and variational inference: Bridging the gap. arXiv preprint arXiv:1410.6460, October 2014.
  • Shoemake (1985) Ken Shoemake. Animating rotation with quaternion curves. In Proceedings of the 12th annual conference on Computer graphics and interactive techniques, pp.  245–254, 1985.
  • Sohl-Dickstein et al. (2015) Jascha Sohl-Dickstein, Eric A Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. arXiv preprint arXiv:1503.03585, March 2015.
  • Song et al. (2017) Jiaming Song, Shengjia Zhao, and Stefano Ermon. A-nice-mc: Adversarial training for mcmc. arXiv preprint arXiv:1706.07561, June 2017.
  • Song & Ermon (2019) Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. arXiv preprint arXiv:1907.05600, July 2019.
  • Song & Ermon (2020) Yang Song and Stefano Ermon. Improved techniques for training Score-Based generative models. arXiv preprint arXiv:2006.09011, June 2020.
  • Song et al. (2020) Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
  • van den Oord et al. (2016a) Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, September 2016a.
  • van den Oord et al. (2016b) Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, January 2016b.
  • Vincent (2011) Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661–1674, 2011.
  • Zagoruyko & Komodakis (2016) Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, May 2016.
  • Zhao et al. (2018) Shengjia Zhao, Hongyu Ren, Arianna Yuan, Jiaming Song, Noah Goodman, and Stefano Ermon. Bias and generalization in deep generative models: An empirical study. In Advances in Neural Information Processing Systems, pp. 10792–10801, 2018.

Appendix A Non-Markovian Forward Processes for a Discrete Case
附录 A 离散情形下的非马尔可夫正向过程

In this section, we describe a non-Markovian forward processes for discrete data and corresponding variational objectives. Since the focus of this paper is to accelerate reverse models corresponding to the Gaussian diffusion, we leave empirical evaluations as future work.
在本节中,我们描述了一种用于离散数据的非马尔可夫正向过程以及相应的变分目标。由于本文的重点是加速与高斯扩散相对应的反向模型,我们将实证评估留作未来的工作。

For a categorical observation 𝒙0subscript𝒙0{\bm{x}}_{0} that is a one-hot vector with K𝐾K possible values, we define the forward process as follows. First, we have q(𝒙t|𝒙0)𝑞conditionalsubscript𝒙𝑡subscript𝒙0q({\bm{x}}_{t}|{\bm{x}}_{0}) as the following categorical distribution:
对于一个类别观测值 𝒙0subscript0{\bm{x}}_{0} (它是一个具有 KK 种可能值的独热向量),我们将前向过程定义如下。首先,我们有 q(𝒙t|𝒙0)conditionalsubscriptsubscript0q({\bm{x}}_{t}|{\bm{x}}_{0}) 作为以下类别分布:

q(𝒙t|𝒙0)=Cat(αt𝒙0+(1αt)𝟏K)𝑞conditionalsubscript𝒙𝑡subscript𝒙0Catsubscript𝛼𝑡subscript𝒙01subscript𝛼𝑡subscript1𝐾\displaystyle q({\bm{x}}_{t}|{\bm{x}}_{0})=\mathrm{Cat}(\alpha_{t}{\bm{x}}_{0}+(1-\alpha_{t}){\bm{1}}_{K}) (17)

where 𝟏KKsubscript1𝐾superscript𝐾{\bm{1}}_{K}\in\mathbb{R}^{K} is a vector with all entries being 1/K1𝐾1/K, and αtsubscript𝛼𝑡\alpha_{t} decreasing from α0=1subscript𝛼01\alpha_{0}=1 for t=0𝑡0t=0 to αT=0subscript𝛼𝑇0\alpha_{T}=0 for t=T𝑡𝑇t=T. Then we define q(𝒙t1|𝒙t,𝒙0)𝑞conditionalsubscript𝒙𝑡1subscript𝒙𝑡subscript𝒙0q({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0}) as the following mixture distribution:
其中 𝟏KKsubscript1superscript{\bm{1}}_{K}\in\mathbb{R}^{K} 是一个所有元素均为 1/K11/K 的向量, αtsubscript\alpha_{t}α0=1subscript01\alpha_{0}=1 针对 t=00t=0αT=0subscript0\alpha_{T}=0 针对 t=Tt=T 逐渐减小。然后,我们将 q(𝒙t1|𝒙t,𝒙0)conditionalsubscript1subscriptsubscript0q({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0}) 定义为如下的混合分布:

q(𝒙t1|𝒙t,𝒙0)={Cat(𝒙t)with probability σtCat(𝒙0)with probability (αt1σtαt)Cat(𝟏K)with probability (1αt1)(1αt)σt,𝑞conditionalsubscript𝒙𝑡1subscript𝒙𝑡subscript𝒙0casesCatsubscript𝒙𝑡with probability subscript𝜎𝑡Catsubscript𝒙0with probability subscript𝛼𝑡1subscript𝜎𝑡subscript𝛼𝑡Catsubscript1𝐾with probability 1subscript𝛼𝑡11subscript𝛼𝑡subscript𝜎𝑡\displaystyle q({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0})=\begin{cases}\mathrm{Cat}({\bm{x}}_{t})&\text{with probability }\sigma_{t}\\ \mathrm{Cat}({\bm{x}}_{0})&\text{with probability }(\alpha_{t-1}-\sigma_{t}\alpha_{t})\\ \mathrm{Cat}({\bm{1}}_{K})&\text{with probability }(1-\alpha_{t-1})-(1-\alpha_{t})\sigma_{t}\end{cases}, (18)

or equivalently: 或等价地:

q(𝒙t1|𝒙t,𝒙0)=Cat(σt𝒙t+(αt1σtαt)𝒙0+((1αt1)(1αt)σt)𝟏K),𝑞conditionalsubscript𝒙𝑡1subscript𝒙𝑡subscript𝒙0Catsubscript𝜎𝑡subscript𝒙𝑡subscript𝛼𝑡1subscript𝜎𝑡subscript𝛼𝑡subscript𝒙01subscript𝛼𝑡11subscript𝛼𝑡subscript𝜎𝑡subscript1𝐾\displaystyle q({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0})=\mathrm{Cat}\left(\sigma_{t}{\bm{x}}_{t}+(\alpha_{t-1}-\sigma_{t}\alpha_{t}){\bm{x}}_{0}+((1-\alpha_{t-1})-(1-\alpha_{t})\sigma_{t}){\bm{1}}_{K}\right), (19)

which is consistent with how we have defined q(𝒙t|𝒙0)𝑞conditionalsubscript𝒙𝑡subscript𝒙0q({\bm{x}}_{t}|{\bm{x}}_{0}).
这与我们对 q(𝒙t|𝒙0)conditionalsubscriptsubscript0q({\bm{x}}_{t}|{\bm{x}}_{0}) 的定义一致。

Similarly, we can define our reverse process pθ(𝒙t1|𝒙t)subscript𝑝𝜃conditionalsubscript𝒙𝑡1subscript𝒙𝑡p_{\theta}({\bm{x}}_{t-1}|{\bm{x}}_{t}) as:
同样地,我们可以将我们的逆过程 pθ(𝒙t1|𝒙t)subscriptconditionalsubscript1subscriptp_{\theta}({\bm{x}}_{t-1}|{\bm{x}}_{t}) 定义为:

pθ(𝒙t1|𝒙t)=Cat(σt𝒙t+(αt1σtαt)fθ(t)(𝒙t)+((1αt1)(1αt)σt)𝟏K),subscript𝑝𝜃conditionalsubscript𝒙𝑡1subscript𝒙𝑡Catsubscript𝜎𝑡subscript𝒙𝑡subscript𝛼𝑡1subscript𝜎𝑡subscript𝛼𝑡superscriptsubscript𝑓𝜃𝑡subscript𝒙𝑡1subscript𝛼𝑡11subscript𝛼𝑡subscript𝜎𝑡subscript1𝐾\displaystyle p_{\theta}({\bm{x}}_{t-1}|{\bm{x}}_{t})=\mathrm{Cat}\left(\sigma_{t}{\bm{x}}_{t}+(\alpha_{t-1}-\sigma_{t}\alpha_{t})f_{\theta}^{(t)}({\bm{x}}_{t})+((1-\alpha_{t-1})-(1-\alpha_{t})\sigma_{t}){\bm{1}}_{K}\right), (20)

where fθ(t)(𝒙t)superscriptsubscript𝑓𝜃𝑡subscript𝒙𝑡f_{\theta}^{(t)}({\bm{x}}_{t}) maps 𝒙tsubscript𝒙𝑡{\bm{x}}_{t} to a K𝐾K-dimensional vector. As (1αt1)(1αt)σt01subscript𝛼𝑡11subscript𝛼𝑡subscript𝜎𝑡0(1-\alpha_{t-1})-(1-\alpha_{t})\sigma_{t}\to 0, the sampling process will become less stochastic, in the sense that it will either choose 𝒙tsubscript𝒙𝑡{\bm{x}}_{t} or the predicted 𝒙0subscript𝒙0{\bm{x}}_{0} with high probability. The KL divergence
其中 fθ(t)(𝒙t)superscriptsubscriptsubscriptf_{\theta}^{(t)}({\bm{x}}_{t})𝒙tsubscript{\bm{x}}_{t} 映射到一个 KK 维向量。随着 (1αt1)(1αt)σt01subscript11subscriptsubscript0(1-\alpha_{t-1})-(1-\alpha_{t})\sigma_{t}\to 0 ,采样过程的随机性会降低,从某种意义上说,它会以高概率选择 𝒙tsubscript{\bm{x}}_{t} 或预测的 𝒙0subscript0{\bm{x}}_{0} 。KL 散度

DKL(q(𝒙t1|𝒙t,𝒙0)pθ(𝒙t1|𝒙t))\displaystyle D_{\mathrm{KL}}(q({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0})\|p_{\theta}({\bm{x}}_{t-1}|{\bm{x}}_{t})) (21)

is well-defined, and is simply the KL divergence between two categoricals. Therefore, the resulting variational objective function should be easy to optimize as well. Moreover, as KL divergence is convex, we have this upper bound (which is tight when the right hand side goes to zero):
定义明确,它仅仅是两个分类之间的 KL 散度。因此,所得的变分目标函数也应该易于优化。此外,由于 KL 散度是凸的,我们有这个上界(当右侧趋于零时是紧的):

DKL(q(𝒙t1|𝒙t,𝒙0)pθ(𝒙t1|𝒙t))(αt1σtαt)DKL(Cat(𝒙0)Cat(fθ(t)(𝒙t))).\displaystyle D_{\mathrm{KL}}(q({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0})\|p_{\theta}({\bm{x}}_{t-1}|{\bm{x}}_{t}))\leq(\alpha_{t-1}-\sigma_{t}\alpha_{t})D_{\mathrm{KL}}(\mathrm{Cat}({\bm{x}}_{0})\|\mathrm{Cat}(f_{\theta}^{(t)}({\bm{x}}_{t}))).

The right hand side is simply a multi-class classification loss (up to constants), so we can arrive at similar arguments regarding how changes in σtsubscript𝜎𝑡\sigma_{t} do not affect the objective (up to re-weighting).
右侧仅仅是一个多类分类损失(至多为常数),因此我们可以得出类似的论点,即 σtsubscript\sigma_{t} 的变化如何不会影响目标(至多为重加权)。

Appendix B Proofs 附录 B 证明

Lemma 1.

For qσ(𝐱1:T|𝐱0)subscript𝑞𝜎conditionalsubscript𝐱:1𝑇subscript𝐱0q_{\sigma}({\bm{x}}_{1:T}|{\bm{x}}_{0}) defined in Eq. (6) and qσ(𝐱t1|𝐱t,𝐱0)subscript𝑞𝜎conditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝐱0q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0}) defined in Eq. (7), we have:

qσ(𝒙t|𝒙0)=𝒩(αt𝒙0,(1αt)𝑰)subscript𝑞𝜎conditionalsubscript𝒙𝑡subscript𝒙0𝒩subscript𝛼𝑡subscript𝒙01subscript𝛼𝑡𝑰\displaystyle q_{\sigma}({\bm{x}}_{t}|{\bm{x}}_{0})={\mathcal{N}}(\sqrt{\alpha_{t}}{\bm{x}}_{0},(1-\alpha_{t}){\bm{I}}) (22)

引理 1. 对于在式(6)中定义的 qσ(𝐱1:T|𝐱0)subscriptconditionalsubscript1subscript0q_{\sigma}({\bm{x}}_{1:T}|{\bm{x}}_{0}) 和在式(7)中定义的 qσ(𝐱t1|𝐱t,𝐱0)subscriptconditionalsubscript1subscriptsubscript0q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0}) ,我们有:
Proof.

Assume for any tT𝑡𝑇t\leq T, qσ(𝒙t|𝒙0)=𝒩(αt𝒙0,(1αt)𝑰)subscript𝑞𝜎conditionalsubscript𝒙𝑡subscript𝒙0𝒩subscript𝛼𝑡subscript𝒙01subscript𝛼𝑡𝑰q_{\sigma}({\bm{x}}_{t}|{\bm{x}}_{0})={\mathcal{N}}(\sqrt{\alpha_{t}}{\bm{x}}_{0},(1-\alpha_{t}){\bm{I}}) holds, if:

qσ(𝒙t1|𝒙0)=𝒩(αt1𝒙0,(1αt1)𝑰)subscript𝑞𝜎conditionalsubscript𝒙𝑡1subscript𝒙0𝒩subscript𝛼𝑡1subscript𝒙01subscript𝛼𝑡1𝑰\displaystyle q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{0})={\mathcal{N}}(\sqrt{\alpha_{t-1}}{\bm{x}}_{0},(1-\alpha_{t-1}){\bm{I}}) (23)

then we can prove the statement with an induction argument for t𝑡t from T𝑇T to 111, since the base case (t=T𝑡𝑇t=T) already holds.
然后我们可以用从 TT111tt 的归纳论证来证明该陈述,因为基础情况( t=Tt=T )已经成立。


证明。假设对于任何 tTt\leq Tqσ(𝒙t|𝒙0)=𝒩(αt𝒙0,(1αt)𝑰)subscriptconditionalsubscriptsubscript0subscriptsubscript01subscriptq_{\sigma}({\bm{x}}_{t}|{\bm{x}}_{0})={\mathcal{N}}(\sqrt{\alpha_{t}}{\bm{x}}_{0},(1-\alpha_{t}){\bm{I}}) 成立,如果:

First, we have that
首先,我们有

qσ(𝒙t1|𝒙0):=𝒙tqσ(𝒙t|𝒙0)qσ(𝒙t1|𝒙t,𝒙0)d𝒙tassignsubscript𝑞𝜎conditionalsubscript𝒙𝑡1subscript𝒙0subscriptsubscript𝒙𝑡subscript𝑞𝜎conditionalsubscript𝒙𝑡subscript𝒙0subscript𝑞𝜎conditionalsubscript𝒙𝑡1subscript𝒙𝑡subscript𝒙0differential-dsubscript𝒙𝑡q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{0}):=\int_{{\bm{x}}_{t}}q_{\sigma}({\bm{x}}_{t}|{\bm{x}}_{0})q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0})\mathrm{d}{\bm{x}}_{t}

and 并且

qσ(𝒙t|𝒙0)=𝒩(αt𝒙0,(1αt)𝑰)subscript𝑞𝜎conditionalsubscript𝒙𝑡subscript𝒙0𝒩subscript𝛼𝑡subscript𝒙01subscript𝛼𝑡𝑰\displaystyle q_{\sigma}({\bm{x}}_{t}|{\bm{x}}_{0})={\mathcal{N}}(\sqrt{\alpha_{t}}{\bm{x}}_{0},(1-\alpha_{t}){\bm{I}}) (24)
qσ(𝒙t1|𝒙t,𝒙0)=𝒩(αt1𝒙0+1αt1σt2𝒙tαt𝒙01αt,σt2𝑰).subscript𝑞𝜎conditionalsubscript𝒙𝑡1subscript𝒙𝑡subscript𝒙0𝒩subscript𝛼𝑡1subscript𝒙01subscript𝛼𝑡1subscriptsuperscript𝜎2𝑡subscript𝒙𝑡subscript𝛼𝑡subscript𝒙01subscript𝛼𝑡superscriptsubscript𝜎𝑡2𝑰\displaystyle q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0})={\mathcal{N}}\left(\sqrt{\alpha_{t-1}}{\bm{x}}_{0}+\sqrt{1-\alpha_{t-1}-\sigma^{2}_{t}}\cdot{\frac{{\bm{x}}_{t}-\sqrt{\alpha_{t}}{\bm{x}}_{0}}{\sqrt{1-\alpha_{t}}}},\sigma_{t}^{2}{\bm{I}}\right). (25)

From Bishop (2006) (2.115), we have that qσ(𝒙t1|𝒙0)subscript𝑞𝜎conditionalsubscript𝒙𝑡1subscript𝒙0q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{0}) is Gaussian, denoted as 𝒩(μt1,Σt1)𝒩subscript𝜇𝑡1subscriptΣ𝑡1{\mathcal{N}}(\mu_{t-1},\Sigma_{t-1}) where
从 Bishop(2006)(2.115)中,我们得知 qσ(𝒙t1|𝒙0)subscriptconditionalsubscript1subscript0q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{0}) 是高斯分布,记为 𝒩(μt1,Σt1)subscript1subscript1{\mathcal{N}}(\mu_{t-1},\Sigma_{t-1}) ,其中

μt1subscript𝜇𝑡1\displaystyle\mu_{t-1} =αt1𝒙0+1αt1σt2αt𝒙0αt𝒙01αtabsentsubscript𝛼𝑡1subscript𝒙01subscript𝛼𝑡1subscriptsuperscript𝜎2𝑡subscript𝛼𝑡subscript𝒙0subscript𝛼𝑡subscript𝒙01subscript𝛼𝑡\displaystyle=\sqrt{\alpha_{t-1}}{\bm{x}}_{0}+\sqrt{1-\alpha_{t-1}-\sigma^{2}_{t}}\cdot{\frac{\sqrt{\alpha_{t}}{\bm{x}}_{0}-\sqrt{\alpha_{t}}{\bm{x}}_{0}}{\sqrt{1-\alpha_{t}}}} (26)
=αt1𝒙0absentsubscript𝛼𝑡1subscript𝒙0\displaystyle=\sqrt{\alpha_{t-1}}{\bm{x}}_{0} (27)

and 并且

Σt1subscriptΣ𝑡1\displaystyle\Sigma_{t-1} =σt2𝑰+1αt1σt21αt(1αt)𝑰=(1αt1)𝑰absentsuperscriptsubscript𝜎𝑡2𝑰1subscript𝛼𝑡1subscriptsuperscript𝜎2𝑡1subscript𝛼𝑡1subscript𝛼𝑡𝑰1subscript𝛼𝑡1𝑰\displaystyle=\sigma_{t}^{2}{\bm{I}}+\frac{1-\alpha_{t-1}-\sigma^{2}_{t}}{1-\alpha_{t}}(1-\alpha_{t}){\bm{I}}=(1-\alpha_{t-1}){\bm{I}} (28)

Therefore, qσ(𝒙t1|𝒙0)=𝒩(αt1𝒙0,(1αt1)𝑰)subscript𝑞𝜎conditionalsubscript𝒙𝑡1subscript𝒙0𝒩subscript𝛼𝑡1subscript𝒙01subscript𝛼𝑡1𝑰q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{0})={\mathcal{N}}(\sqrt{\alpha_{t-1}}{\bm{x}}_{0},(1-\alpha_{t-1}){\bm{I}}), which allows us to apply the induction argument. ∎
因此, qσ(𝒙t1|𝒙0)=𝒩(αt1𝒙0,(1αt1)𝑰)subscriptconditionalsubscript1subscript0subscript1subscript01subscript1q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{0})={\mathcal{N}}(\sqrt{\alpha_{t-1}}{\bm{x}}_{0},(1-\alpha_{t-1}){\bm{I}}) ,这使我们能够应用归纳论证。∎

See 1 见 1

Proof.

From the definition of Jσsubscript𝐽𝜎J_{\sigma}:

Jσ(ϵθ)subscript𝐽𝜎subscriptitalic-ϵ𝜃\displaystyle J_{\sigma}(\epsilon_{\theta}) :=𝔼𝒙0:Tq(𝒙0:T)[logqσ(𝒙T|𝒙0)+t=2Tlogqσ(𝒙t1|𝒙t,𝒙0)t=1Tlogpθ(t)(𝒙t1|𝒙t)]assignabsentsubscript𝔼similar-tosubscript𝒙:0𝑇𝑞subscript𝒙:0𝑇delimited-[]subscript𝑞𝜎conditionalsubscript𝒙𝑇subscript𝒙0superscriptsubscript𝑡2𝑇subscript𝑞𝜎conditionalsubscript𝒙𝑡1subscript𝒙𝑡subscript𝒙0superscriptsubscript𝑡1𝑇superscriptsubscript𝑝𝜃𝑡conditionalsubscript𝒙𝑡1subscript𝒙𝑡\displaystyle:={\mathbb{E}}_{{\bm{x}}_{0:T}\sim q({\bm{x}}_{0:T})}\left[\log q_{\sigma}({\bm{x}}_{T}|{\bm{x}}_{0})+\sum_{t=2}^{T}\log q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0})-\sum_{t=1}^{T}\log p_{\theta}^{(t)}({\bm{x}}_{t-1}|{\bm{x}}_{t})\right] (29)
𝔼𝒙0:Tq(𝒙0:T)[t=2TDKL(qσ(𝒙t1|𝒙t,𝒙0))pθ(t)(𝒙t1|𝒙t))logpθ(1)(𝒙0|𝒙1)]\displaystyle\equiv{\mathbb{E}}_{{\bm{x}}_{0:T}\sim q({\bm{x}}_{0:T})}\left[\sum_{t=2}^{T}D_{\mathrm{KL}}(q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0}))\|p_{\theta}^{(t)}({\bm{x}}_{t-1}|{\bm{x}}_{t}))-\log p_{\theta}^{(1)}({\bm{x}}_{0}|{\bm{x}}_{1})\right]

where we use \equiv to denote ``equal up to a value that does not depend on ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta} (but may depend on qσsubscript𝑞𝜎q_{\sigma})''. For t>1𝑡1t>1:
我们使用 \equiv 表示“在不依赖于 ϵθsubscript\epsilon_{\theta} (但可能依赖于 qσsubscriptq_{\sigma} )的值上相等”。对于 t>11t>1

𝔼𝒙0,𝒙tq(𝒙0,𝒙t)[DKL(qσ(𝒙t1|𝒙t,𝒙0))pθ(t)(𝒙t1|𝒙t))]\displaystyle{\mathbb{E}}_{{\bm{x}}_{0},{\bm{x}}_{t}\sim q({\bm{x}}_{0},{\bm{x}}_{t})}[D_{\mathrm{KL}}(q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0}))\|p_{\theta}^{(t)}({\bm{x}}_{t-1}|{\bm{x}}_{t}))]
=\displaystyle= 𝔼𝒙0,𝒙tq(𝒙0,𝒙t)[DKL(qσ(𝒙t1|𝒙t,𝒙0))qσ(𝒙t1|𝒙t,fθ(t)(𝒙t)))]\displaystyle\ {\mathbb{E}}_{{\bm{x}}_{0},{\bm{x}}_{t}\sim q({\bm{x}}_{0},{\bm{x}}_{t})}[D_{\mathrm{KL}}(q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0}))\|q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},f_{\theta}^{(t)}({\bm{x}}_{t})))]
\displaystyle\equiv 𝔼𝒙0,𝒙tq(𝒙0,𝒙t)[𝒙0fθ(t)(𝒙t)222σt2]subscript𝔼similar-tosubscript𝒙0subscript𝒙𝑡𝑞subscript𝒙0subscript𝒙𝑡delimited-[]superscriptsubscriptdelimited-∥∥subscript𝒙0superscriptsubscript𝑓𝜃𝑡subscript𝒙𝑡222superscriptsubscript𝜎𝑡2\displaystyle\ {\mathbb{E}}_{{\bm{x}}_{0},{\bm{x}}_{t}\sim q({\bm{x}}_{0},{\bm{x}}_{t})}\left[\frac{{\lVert{{\bm{x}}_{0}-f_{\theta}^{(t)}({\bm{x}}_{t})}\rVert}_{2}^{2}}{2\sigma_{t}^{2}}\right] (30)
=\displaystyle= 𝔼𝒙0q(𝒙0),ϵ𝒩(𝟎,𝑰),𝒙t=αt𝒙0+1αtϵ[(𝒙t1αtϵ)αt(𝒙t1αtϵθ(t)(𝒙t))αt222σt2]subscript𝔼formulae-sequencesimilar-tosubscript𝒙0𝑞subscript𝒙0formulae-sequencesimilar-toitalic-ϵ𝒩0𝑰subscript𝒙𝑡subscript𝛼𝑡subscript𝒙01subscript𝛼𝑡italic-ϵdelimited-[]superscriptsubscriptdelimited-∥∥subscript𝒙𝑡1subscript𝛼𝑡italic-ϵsubscript𝛼𝑡subscript𝒙𝑡1subscript𝛼𝑡superscriptsubscriptitalic-ϵ𝜃𝑡subscript𝒙𝑡subscript𝛼𝑡222superscriptsubscript𝜎𝑡2\displaystyle\ {\mathbb{E}}_{{\bm{x}}_{0}\sim q({\bm{x}}_{0}),\epsilon\sim{\mathcal{N}}({\bm{0}},{\bm{I}}),{\bm{x}}_{t}=\sqrt{\alpha_{t}}{\bm{x}}_{0}+\sqrt{1-\alpha_{t}}\epsilon}\left[\frac{{\lVert{\frac{{({\bm{x}}_{t}-\sqrt{1-\alpha_{t}}\epsilon)}}{\sqrt{\alpha_{t}}}-\frac{({\bm{x}}_{t}-\sqrt{1-\alpha_{t}}\epsilon_{\theta}^{(t)}({\bm{x}}_{t}))}{\sqrt{\alpha_{t}}}}\rVert}_{2}^{2}}{2\sigma_{t}^{2}}\right] (31)
=\displaystyle= 𝔼𝒙0q(𝒙0),ϵ𝒩(𝟎,𝑰),𝒙t=αt𝒙0+1αtϵ[ϵϵθ(t)(𝒙t)222dσt2αt]subscript𝔼formulae-sequencesimilar-tosubscript𝒙0𝑞subscript𝒙0formulae-sequencesimilar-toitalic-ϵ𝒩0𝑰subscript𝒙𝑡subscript𝛼𝑡subscript𝒙01subscript𝛼𝑡italic-ϵdelimited-[]superscriptsubscriptdelimited-∥∥italic-ϵsuperscriptsubscriptitalic-ϵ𝜃𝑡subscript𝒙𝑡222𝑑superscriptsubscript𝜎𝑡2subscript𝛼𝑡\displaystyle\ {\mathbb{E}}_{{\bm{x}}_{0}\sim q({\bm{x}}_{0}),\epsilon\sim{\mathcal{N}}({\bm{0}},{\bm{I}}),{\bm{x}}_{t}=\sqrt{\alpha_{t}}{\bm{x}}_{0}+\sqrt{1-\alpha_{t}}\epsilon}\left[\frac{{\lVert{\epsilon-\epsilon_{\theta}^{(t)}({\bm{x}}_{t})}\rVert}_{2}^{2}}{2d\sigma_{t}^{2}\alpha_{t}}\right] (32)

where d𝑑d is the dimension of 𝒙0subscript𝒙0{\bm{x}}_{0}. For t=1𝑡1t=1:
其中 dd𝒙0subscript0{\bm{x}}_{0} 的维度。对于 t=11t=1

𝔼𝒙0,𝒙1q(𝒙0,𝒙1)[logpθ(1)(𝒙0|𝒙1)]𝔼𝒙0,𝒙1q(𝒙0,𝒙1)[𝒙0fθ(t)(𝒙1)222σ12]subscript𝔼similar-tosubscript𝒙0subscript𝒙1𝑞subscript𝒙0subscript𝒙1delimited-[]superscriptsubscript𝑝𝜃1conditionalsubscript𝒙0subscript𝒙1subscript𝔼similar-tosubscript𝒙0subscript𝒙1𝑞subscript𝒙0subscript𝒙1delimited-[]superscriptsubscriptdelimited-∥∥subscript𝒙0superscriptsubscript𝑓𝜃𝑡subscript𝒙1222superscriptsubscript𝜎12\displaystyle\ {\mathbb{E}}_{{\bm{x}}_{0},{\bm{x}}_{1}\sim q({\bm{x}}_{0},{\bm{x}}_{1})}\left[-\log p_{\theta}^{(1)}({\bm{x}}_{0}|{\bm{x}}_{1})\right]\equiv{\mathbb{E}}_{{\bm{x}}_{0},{\bm{x}}_{1}\sim q({\bm{x}}_{0},{\bm{x}}_{1})}\left[\frac{{\lVert{{\bm{x}}_{0}-f_{\theta}^{(t)}({\bm{x}}_{1})}\rVert}_{2}^{2}}{2\sigma_{1}^{2}}\right] (33)
=\displaystyle= 𝔼𝒙0q(𝒙0),ϵ𝒩(𝟎,𝑰),𝒙1=α1𝒙0+1αtϵ[ϵϵθ(1)(𝒙1)222dσ12α1]subscript𝔼formulae-sequencesimilar-tosubscript𝒙0𝑞subscript𝒙0formulae-sequencesimilar-toitalic-ϵ𝒩0𝑰subscript𝒙1subscript𝛼1subscript𝒙01subscript𝛼𝑡italic-ϵdelimited-[]superscriptsubscriptdelimited-∥∥italic-ϵsuperscriptsubscriptitalic-ϵ𝜃1subscript𝒙1222𝑑superscriptsubscript𝜎12subscript𝛼1\displaystyle\ {\mathbb{E}}_{{\bm{x}}_{0}\sim q({\bm{x}}_{0}),\epsilon\sim{\mathcal{N}}({\bm{0}},{\bm{I}}),{\bm{x}}_{1}=\sqrt{\alpha_{1}}{\bm{x}}_{0}+\sqrt{1-\alpha_{t}}\epsilon}\left[\frac{{\lVert{\epsilon-\epsilon_{\theta}^{(1)}({\bm{x}}_{1})}\rVert}_{2}^{2}}{2d\sigma_{1}^{2}\alpha_{1}}\right] (34)

Therefore, when γt=1/(2dσt2αt)subscript𝛾𝑡12𝑑superscriptsubscript𝜎𝑡2subscript𝛼𝑡\gamma_{t}=1/(2d\sigma_{t}^{2}\alpha_{t}) for all t{1,,T}𝑡1𝑇t\in\{1,\ldots,T\}, we have
因此,当对于所有 t{1,,T}1t\in\{1,\ldots,T\}γt=1/(2dσt2αt)subscript12superscriptsubscript2subscript\gamma_{t}=1/(2d\sigma_{t}^{2}\alpha_{t}) 时,我们有

Jσ(ϵθ)t=1T12dσt2αt𝔼[ϵθ(t)(𝒙t)ϵt22]=Lγ(ϵθ)subscript𝐽𝜎subscriptitalic-ϵ𝜃superscriptsubscript𝑡1𝑇12𝑑superscriptsubscript𝜎𝑡2subscript𝛼𝑡𝔼delimited-[]superscriptsubscriptdelimited-∥∥superscriptsubscriptitalic-ϵ𝜃𝑡subscript𝒙𝑡subscriptitalic-ϵ𝑡22subscript𝐿𝛾subscriptitalic-ϵ𝜃\displaystyle J_{\sigma}(\epsilon_{\theta})\equiv\sum_{t=1}^{T}\frac{1}{2d\sigma_{t}^{2}\alpha_{t}}{\mathbb{E}}\left[{\lVert{\epsilon_{\theta}^{(t)}({\bm{x}}_{t})-\epsilon_{t}}\rVert}_{2}^{2}\right]=L_{\gamma}(\epsilon_{\theta}) (35)

for all ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}. From the definition of ``\equiv'', we have that Jσ=Lγ+Csubscript𝐽𝜎subscript𝐿𝛾𝐶J_{\sigma}=L_{\gamma}+C. ∎
对于所有 ϵθsubscript\epsilon_{\theta} 。根据“ \equiv ”的定义,我们有 Jσ=Lγ+CsubscriptsubscriptJ_{\sigma}=L_{\gamma}+C 。∎


证明。从 JσsubscriptJ_{\sigma} 的定义:

See 1 见 1

Proof.

In the context of the proof, we consider t𝑡t as a continous, independent ``time'' variable and 𝒙𝒙{\bm{x}} and α𝛼\alpha as functions of t𝑡t. First, let us consider a reparametrization between DDIM and the VE-SDE888Refer to (Song et al., 2020) for more details of VE-SDE.
有关 VE-SDE 的更多详细信息,请参考(Song 等人,2020)。
by introducing the variables 𝒙¯¯𝒙\bar{{\bm{x}}} and σ𝜎\sigma:
通过引入变量 𝒙¯\bar{{\bm{x}}}σ\sigma

𝒙¯(t)=𝒙¯(0)+σ(t)ϵ,ϵ𝒩(0,𝑰),formulae-sequence¯𝒙𝑡¯𝒙0𝜎𝑡italic-ϵsimilar-toitalic-ϵ𝒩0𝑰\displaystyle\bar{{\bm{x}}}(t)=\bar{{\bm{x}}}(0)+\sigma(t)\epsilon,\quad\epsilon\sim{\mathcal{N}}(0,{\bm{I}}), (36)

for t[0,)𝑡0t\in[0,\infty) and an increasing continuous function σ:00:𝜎subscriptabsent0subscriptabsent0\sigma:{\mathbb{R}}_{\geq 0}\to{\mathbb{R}}_{\geq 0} where σ(0)=0𝜎00\sigma(0)=0.
对于 t[0,)0t\in[0,\infty) 以及一个递增的连续函数 σ:00subscriptabsent0subscriptabsent0\sigma:{\mathbb{R}}_{\geq 0}\to{\mathbb{R}}_{\geq 0} ,其中 σ(0)=000\sigma(0)=0


证明。在证明的背景下,我们将 tt 视为连续、独立的“时间”变量,将 𝒙{\bm{x}}α\alpha 视为 tt 的函数。首先,让我们考虑 DDIM 和 VE-SDE 8 之间的重新参数化。

We can then define α(t)𝛼𝑡\alpha(t) and 𝒙(t)𝒙𝑡{\bm{x}}(t) corresponding to DDIM case as:
然后,我们可以将对应于 DDIM 情况的 α(t)\alpha(t)𝒙(t){\bm{x}}(t) 定义为:

𝒙¯(t)=𝒙(t)α(t)¯𝒙𝑡𝒙𝑡𝛼𝑡\displaystyle\bar{{\bm{x}}}(t)=\frac{{\bm{x}}(t)}{\sqrt{\alpha(t)}} (37)
σ(t)=1α(t)α(t).𝜎𝑡1𝛼𝑡𝛼𝑡\displaystyle\sigma(t)=\sqrt{\frac{1-\alpha(t)}{\alpha(t)}}. (38)

This also means that:
这也意味着:

𝒙(t)=𝒙¯(t)σ2(t)+1𝒙𝑡¯𝒙𝑡superscript𝜎2𝑡1\displaystyle{\bm{x}}(t)=\frac{\bar{{\bm{x}}}(t)}{\sqrt{\sigma^{2}(t)+1}} (39)
α(t)=11+σ2(t),𝛼𝑡11superscript𝜎2𝑡\displaystyle\alpha(t)=\frac{1}{1+\sigma^{2}(t)}, (40)

which establishes an bijection between (𝒙,α)𝒙𝛼({\bm{x}},\alpha) and (𝒙¯,σ)¯𝒙𝜎(\bar{{\bm{x}}},\sigma). From Equation (4) we have (note that α(0)=1𝛼01\alpha(0)=1): 重试    错误原因

𝒙(t)α(t)=𝒙(0)α(0)+1α(t)α(t)ϵ,ϵ𝒩(0,𝑰)formulae-sequence𝒙𝑡𝛼𝑡𝒙0𝛼01𝛼𝑡𝛼𝑡italic-ϵsimilar-toitalic-ϵ𝒩0𝑰\displaystyle\frac{{\bm{x}}(t)}{\sqrt{\alpha(t)}}=\frac{{\bm{x}}(0)}{\sqrt{\alpha(0)}}+\sqrt{\frac{1-\alpha(t)}{\alpha(t)}}\epsilon,\quad\epsilon\sim{\mathcal{N}}(0,{\bm{I}}) (41)

which can be reparametrized into a form that is consistent with VE-SDE:
其可重新参数化为与 VE-SDE 一致的形式:

𝒙¯(t)=𝒙¯(0)+σ(t)ϵ.¯𝒙𝑡¯𝒙0𝜎𝑡italic-ϵ\displaystyle\bar{{\bm{x}}}(t)=\bar{{\bm{x}}}(0)+\sigma(t)\epsilon. (42)

Now, we derive the ODE forms for both DDIM and VE-SDE and show that they are equivalent.
现在,我们推导出 DDIM 和 VE - SDE 的常微分方程形式,并表明它们是等价的。

ODE form for DDIM
用于 DDIM 的常微分方程形式

We repeat Equation (13) here:
我们在此重复方程(13):

𝒙tΔtαtΔt=𝒙tαt+(1αtΔtαtΔt1αtαt)ϵθ(t)(𝒙t),subscript𝒙𝑡Δ𝑡subscript𝛼𝑡Δ𝑡subscript𝒙𝑡subscript𝛼𝑡1subscript𝛼𝑡Δ𝑡subscript𝛼𝑡Δ𝑡1subscript𝛼𝑡subscript𝛼𝑡superscriptsubscriptitalic-ϵ𝜃𝑡subscript𝒙𝑡\displaystyle\frac{{\bm{x}}_{t-\Delta t}}{\sqrt{\alpha_{t-\Delta t}}}=\frac{{\bm{x}}_{t}}{\sqrt{\alpha_{t}}}+\left(\sqrt{\frac{1-\alpha_{t-\Delta t}}{\alpha_{t-\Delta t}}}-\sqrt{\frac{1-\alpha_{t}}{\alpha_{t}}}\right)\epsilon_{\theta}^{(t)}({\bm{x}}_{t}), (43)

which is equivalent to:
这相当于:

𝒙¯(tΔt)=𝒙¯(t)+(σ(tΔt)σ(t))ϵθ(t)(𝒙(t))¯𝒙𝑡Δ𝑡¯𝒙𝑡𝜎𝑡Δ𝑡𝜎𝑡superscriptsubscriptitalic-ϵ𝜃𝑡𝒙𝑡\displaystyle\bar{{\bm{x}}}(t-\Delta t)=\bar{{\bm{x}}}(t)+(\sigma(t-\Delta t)-\sigma(t))\cdot\epsilon_{\theta}^{(t)}({\bm{x}}(t)) (44)

Divide both sides by (Δt)Δ𝑡(-\Delta t) and as Δt0Δ𝑡0\Delta t\to 0, we have:
将两边除以 (Δt)(-\Delta t) ,并且由于 Δt00\Delta t\to 0 ,我们有:

d𝒙¯(t)dt=dσ(t)dtϵθ(t)(𝒙¯(t)σ2(t)+1),d¯𝒙𝑡d𝑡d𝜎𝑡d𝑡superscriptsubscriptitalic-ϵ𝜃𝑡¯𝒙𝑡superscript𝜎2𝑡1\displaystyle\frac{\mathrm{d}\bar{{\bm{x}}}(t)}{\mathrm{d}t}=\frac{\mathrm{d}\sigma(t)}{\mathrm{d}t}\epsilon_{\theta}^{(t)}\left(\frac{\bar{{\bm{x}}}(t)}{\sqrt{\sigma^{2}(t)+1}}\right), (45)

which is exactly what we have in Equation (14).
这正是我们在方程(14)中所拥有的。

We note that for the optimal model, ϵθ(t)superscriptsubscriptitalic-ϵ𝜃𝑡\epsilon_{\theta}^{(t)} is a minimizer:
我们注意到,对于最优模型, ϵθ(t)superscriptsubscript\epsilon_{\theta}^{(t)} 是一个极小值点:

ϵθ(t)=argminft𝔼𝒙(0)q(𝒙),ϵ𝒩(0,𝑰)[ft(𝒙(t))ϵ22]superscriptsubscriptitalic-ϵ𝜃𝑡subscriptargminsubscript𝑓𝑡subscript𝔼formulae-sequencesimilar-to𝒙0𝑞𝒙similar-toitalic-ϵ𝒩0𝑰delimited-[]superscriptsubscriptdelimited-∥∥subscript𝑓𝑡𝒙𝑡italic-ϵ22\displaystyle\epsilon_{\theta}^{(t)}=\operatorname*{arg\,min}_{f_{t}}{\mathbb{E}}_{{\bm{x}}(0)\sim q({\bm{x}}),\epsilon\sim{\mathcal{N}}(0,{\bm{I}})}[{\lVert{f_{t}({\bm{x}}(t))-\epsilon}\rVert}_{2}^{2}] (46)

where 𝒙(t)=α(t)𝒙(t)+1α(t)ϵ𝒙𝑡𝛼𝑡𝒙𝑡1𝛼𝑡italic-ϵ{\bm{x}}(t)=\sqrt{\alpha(t)}{\bm{x}}(t)+\sqrt{1-\alpha(t)}\epsilon. where 𝒙(t)=α(t)𝒙(t)+1α(t)ϵ1{\bm{x}}(t)=\sqrt{\alpha(t)}{\bm{x}}(t)+\sqrt{1-\alpha(t)}\epsilon . (由于原文中的“𝒙(t)=α(t)𝒙(t)+1α(t)ϵ1{\bm{x}}(t)=\sqrt{\alpha(t)}{\bm{x}}(t)+\sqrt{1-\alpha(t)}\epsilon ”可能是特定的符号或代码,在没有更多上下文的情况下,难以进行准确的语义翻译,所以此处只能按照原文形式呈现。)

ODE form for VE-SDE
用于 VE-SDE 的常微分方程形式

Define pt(𝒙¯)subscript𝑝𝑡¯𝒙p_{t}(\bar{{\bm{x}}}) as the data distribution perturbed with σ2(t)superscript𝜎2𝑡\sigma^{2}(t) variance Gaussian noise. The probability flow for VE-SDE is defined as Song et al. (2020):
pt(𝒙¯)subscriptp_{t}(\bar{{\bm{x}}}) 定义为受到 σ2(t)superscript2\sigma^{2}(t) 方差高斯噪声干扰的数据分布。VE-SDE 的概率流定义为 Song 等人(2020):

d𝒙¯=12g(t)2𝒙¯logpt(𝒙¯)dtd¯𝒙12𝑔superscript𝑡2subscript¯𝒙subscript𝑝𝑡¯𝒙d𝑡\displaystyle\mathrm{d}\bar{{\bm{x}}}=-\frac{1}{2}g(t)^{2}\nabla_{\bar{{\bm{x}}}}\log p_{t}(\bar{{\bm{x}}})\mathrm{d}t (47)

where g(t)=dσ2(t)dt𝑔𝑡dsuperscript𝜎2𝑡d𝑡g(t)=\sqrt{\frac{\mathrm{d}\sigma^{2}(t)}{\mathrm{d}t}} is the diffusion coefficient, and 𝒙¯logpt(𝒙¯)subscript¯𝒙subscript𝑝𝑡¯𝒙\nabla_{\bar{{\bm{x}}}}\log p_{t}(\bar{{\bm{x}}}) is the score of ptsubscript𝑝𝑡p_{t}.
其中 g(t)=dσ2(t)dtsuperscript2g(t)=\sqrt{\frac{\mathrm{d}\sigma^{2}(t)}{\mathrm{d}t}} 是扩散系数, 𝒙¯logpt(𝒙¯)subscriptsubscript\nabla_{\bar{{\bm{x}}}}\log p_{t}(\bar{{\bm{x}}})ptsubscriptp_{t} 的得分。

The σ(t)𝜎𝑡\sigma(t)-perturbed score function 𝒙¯logpt(𝒙¯)subscript¯𝒙subscript𝑝𝑡¯𝒙\nabla_{\bar{{\bm{x}}}}\log p_{t}(\bar{{\bm{x}}}) is also a minimizer (from denoising score matching (Vincent, 2011)):
σ(t)\sigma(t) -扰动得分函数 𝒙¯logpt(𝒙¯)subscriptsubscript\nabla_{\bar{{\bm{x}}}}\log p_{t}(\bar{{\bm{x}}}) 也是一个最小化器(来自去噪得分匹配(Vincent,2011)):

𝒙¯logpt=argmingt𝔼𝒙(0)q(𝒙),ϵ𝒩(0,𝑰)[gt(𝒙¯)+ϵ/σ(t)22]subscript¯𝒙subscript𝑝𝑡subscriptargminsubscript𝑔𝑡subscript𝔼formulae-sequencesimilar-to𝒙0𝑞𝒙similar-toitalic-ϵ𝒩0𝑰delimited-[]superscriptsubscriptdelimited-∥∥subscript𝑔𝑡¯𝒙italic-ϵ𝜎𝑡22\displaystyle\nabla_{\bar{{\bm{x}}}}\log p_{t}=\operatorname*{arg\,min}_{g_{t}}{\mathbb{E}}_{{\bm{x}}(0)\sim q({\bm{x}}),\epsilon\sim{\mathcal{N}}(0,{\bm{I}})}[{\lVert{g_{t}(\bar{{\bm{x}}})+\epsilon/\sigma(t)}\rVert}_{2}^{2}] (48)

where 𝒙¯(t)=𝒙¯(t)+σ(t)ϵ¯𝒙𝑡¯𝒙𝑡𝜎𝑡italic-ϵ\bar{{\bm{x}}}(t)=\bar{{\bm{x}}}(t)+\sigma(t)\epsilon. where 𝒙¯(t)=𝒙¯(t)+σ(t)ϵ\bar{{\bm{x}}}(t)=\bar{{\bm{x}}}(t)+\sigma(t)\epsilon . (由于原文中的“𝒙¯(t)=𝒙¯(t)+σ(t)ϵ\bar{{\bm{x}}}(t)=\bar{{\bm{x}}}(t)+\sigma(t)\epsilon ”可能是特定的符号或代码,在没有更多上下文的情况下,难以进行准确的语义翻译,所以此处只能按照原文形式呈现。)

Since there is an equivalence between 𝒙(t)𝒙𝑡{\bm{x}}(t) and 𝒙¯(t)¯𝒙𝑡\bar{{\bm{x}}}(t), we have the following relationship:
由于 𝒙(t){\bm{x}}(t)𝒙¯(t)\bar{{\bm{x}}}(t) 之间存在等价关系,我们有以下关系:

𝒙¯logpt(𝒙¯)=ϵθ(t)(𝒙¯(t)σ2(t)+1)σ(t)subscript¯𝒙subscript𝑝𝑡¯𝒙superscriptsubscriptitalic-ϵ𝜃𝑡¯𝒙𝑡superscript𝜎2𝑡1𝜎𝑡\displaystyle\nabla_{\bar{{\bm{x}}}}\log p_{t}(\bar{{\bm{x}}})=-\frac{\epsilon_{\theta}^{(t)}\left(\frac{\bar{{\bm{x}}}(t)}{\sqrt{\sigma^{2}(t)+1}}\right)}{\sigma(t)} (49)

from Equation (46) and Equation (48). Plug Equation (49) and definition of g(t)𝑔𝑡g(t) in Equation (47), we have:
从方程(46)和方程(48)。将方程(49)和 g(t)g(t) 的定义代入方程(47),我们有:

d𝒙¯(t)=12dσ2(t)dtϵθ(t)(𝒙¯(t)σ2(t)+1)σ(t)dt,d¯𝒙𝑡12dsuperscript𝜎2𝑡d𝑡superscriptsubscriptitalic-ϵ𝜃𝑡¯𝒙𝑡superscript𝜎2𝑡1𝜎𝑡d𝑡\displaystyle\mathrm{d}\bar{{\bm{x}}}(t)=\frac{1}{2}\frac{\mathrm{d}\sigma^{2}(t)}{\mathrm{d}t}\frac{\epsilon_{\theta}^{(t)}\left(\frac{\bar{{\bm{x}}}(t)}{\sqrt{\sigma^{2}(t)+1}}\right)}{\sigma(t)}\mathrm{d}t, (50)

and we have the following by rearranging terms:
通过重新排列项,我们有以下内容:

d𝒙¯(t)dt=dσ(t)dtϵθ(t)(𝒙¯(t)σ2(t)+1)d¯𝒙𝑡d𝑡d𝜎𝑡d𝑡superscriptsubscriptitalic-ϵ𝜃𝑡¯𝒙𝑡superscript𝜎2𝑡1\displaystyle\frac{\mathrm{d}\bar{{\bm{x}}}(t)}{\mathrm{d}t}=\frac{\mathrm{d}\sigma(t)}{\mathrm{d}t}\epsilon_{\theta}^{(t)}\left(\frac{\bar{{\bm{x}}}(t)}{\sqrt{\sigma^{2}(t)+1}}\right) (51)

which is equivalent to Equation (45). In both cases the initial conditions are 𝒙¯(T)𝒩(𝟎,σ2(T)𝑰)similar-to¯𝒙𝑇𝒩0superscript𝜎2𝑇𝑰\bar{{\bm{x}}}(T)\sim{\mathcal{N}}({\bm{0}},\sigma^{2}(T){\bm{I}}), so the resulting ODEs are identical. ∎
这等同于方程(45)。在这两种情况下,初始条件均为 𝒙¯(T)𝒩(𝟎,σ2(T)𝑰)similar-to0superscript2\bar{{\bm{x}}}(T)\sim{\mathcal{N}}({\bm{0}},\sigma^{2}(T){\bm{I}}) ,因此所得的常微分方程是相同的。∎

Appendix C Additional Derivations
附录 C 额外的推导

C.1 Accelerated sampling processes
C.1 加速抽样过程

In the accelerated case, we can consider the inference process to be factored as:
在加速的情况下,我们可以认为推理过程可分解为:

qσ,τ(𝒙1:T|𝒙0)=qσ,τ(𝒙τS|𝒙0)i=1Sqσ,τ(𝒙τi1|𝒙τi,𝒙0)tτ¯qσ,τ(𝒙t|𝒙0)subscript𝑞𝜎𝜏conditionalsubscript𝒙:1𝑇subscript𝒙0subscript𝑞𝜎𝜏conditionalsubscript𝒙subscript𝜏𝑆subscript𝒙0superscriptsubscriptproduct𝑖1𝑆subscript𝑞𝜎𝜏conditionalsubscript𝒙subscript𝜏𝑖1subscript𝒙subscript𝜏𝑖subscript𝒙0subscriptproduct𝑡¯𝜏subscript𝑞𝜎𝜏conditionalsubscript𝒙𝑡subscript𝒙0\displaystyle q_{\sigma,\tau}({\bm{x}}_{1:T}|{\bm{x}}_{0})=q_{\sigma,\tau}({\bm{x}}_{\tau_{S}}|{\bm{x}}_{0})\prod_{i=1}^{S}q_{\sigma,\tau}({\bm{x}}_{\tau_{i-1}}|{\bm{x}}_{\tau_{i}},{\bm{x}}_{0})\prod_{t\in\bar{\tau}}q_{\sigma,\tau}({\bm{x}}_{t}|{\bm{x}}_{0}) (52)

where τ𝜏\tau is a sub-sequence of [1,,T]1𝑇[1,\ldots,T] of length S𝑆S with τS=Tsubscript𝜏𝑆𝑇\tau_{S}=T, and let τ¯:={1,,T}τassign¯𝜏1𝑇𝜏\bar{\tau}:=\{1,\ldots,T\}\setminus\tau be its complement. Intuitively, the graphical model of {𝒙τi}i=1Ssuperscriptsubscriptsubscript𝒙subscript𝜏𝑖𝑖1𝑆\{{\bm{x}}_{\tau_{i}}\}_{i=1}^{S} and 𝒙0subscript𝒙0{\bm{x}}_{0} form a chain, whereas the graphical model of {𝒙t}tτ¯subscriptsubscript𝒙𝑡𝑡¯𝜏\{{\bm{x}}_{t}\}_{t\in\bar{\tau}} and 𝒙0subscript𝒙0{\bm{x}}_{0} forms a star graph. We define:
其中 τ\tau[1,,T]1[1,\ldots,T] 的长度为 SS 且具有 τS=Tsubscript\tau_{S}=T 的子序列,令 τ¯:={1,,T}τassign1\bar{\tau}:=\{1,\ldots,T\}\setminus\tau 为其补集。直观地说, {𝒙τi}i=1Ssuperscriptsubscriptsubscriptsubscript1\{{\bm{x}}_{\tau_{i}}\}_{i=1}^{S}𝒙0subscript0{\bm{x}}_{0} 的图形模型形成一条链,而 {𝒙t}tτ¯subscriptsubscript\{{\bm{x}}_{t}\}_{t\in\bar{\tau}}𝒙0subscript0{\bm{x}}_{0} 的图形模型形成一个星图。我们定义:

qσ,τ(𝒙t|𝒙0)=𝒩(αt𝒙0,(1αt)𝑰)tτ¯{T}formulae-sequencesubscript𝑞𝜎𝜏conditionalsubscript𝒙𝑡subscript𝒙0𝒩subscript𝛼𝑡subscript𝒙01subscript𝛼𝑡𝑰for-all𝑡¯𝜏𝑇\displaystyle q_{\sigma,\tau}({\bm{x}}_{t}|{\bm{x}}_{0})={\mathcal{N}}(\sqrt{\alpha_{t}}{\bm{x}}_{0},(1-\alpha_{t}){\bm{I}})\quad\forall t\in\bar{\tau}\cup\{T\} (53)
qσ,τ(𝒙τi1|𝒙τi,𝒙0)=𝒩(ατi1𝒙0+1ατi1στi2𝒙τiατi𝒙01ατi,στi2𝑰)i[S]subscript𝑞𝜎𝜏conditionalsubscript𝒙subscript𝜏𝑖1subscript𝒙subscript𝜏𝑖subscript𝒙0𝒩subscript𝛼subscript𝜏𝑖1subscript𝒙01subscript𝛼subscript𝜏𝑖1subscriptsuperscript𝜎2subscript𝜏𝑖subscript𝒙subscript𝜏𝑖subscript𝛼subscript𝜏𝑖subscript𝒙01subscript𝛼subscript𝜏𝑖superscriptsubscript𝜎subscript𝜏𝑖2𝑰for-all𝑖delimited-[]𝑆\displaystyle q_{\sigma,\tau}({\bm{x}}_{\tau_{i-1}}|{\bm{x}}_{\tau_{i}},{\bm{x}}_{0})={\mathcal{N}}\left(\sqrt{\alpha_{\tau_{i-1}}}{\bm{x}}_{0}+\sqrt{1-\alpha_{\tau_{i-1}}-\sigma^{2}_{\tau_{i}}}\cdot{\frac{{\bm{x}}_{\tau_{i}}-\sqrt{\alpha_{\tau_{i}}}{\bm{x}}_{0}}{\sqrt{1-\alpha_{\tau_{i}}}}},\sigma_{{\tau_{i}}}^{2}{\bm{I}}\right)\ \forall i\in[S]

where the coefficients are chosen such that:
其中系数的选取应满足:

qσ,τ(𝒙τi|𝒙0)=𝒩(ατi𝒙0,(1ατi)𝑰)i[S]formulae-sequencesubscript𝑞𝜎𝜏conditionalsubscript𝒙subscript𝜏𝑖subscript𝒙0𝒩subscript𝛼subscript𝜏𝑖subscript𝒙01subscript𝛼subscript𝜏𝑖𝑰for-all𝑖delimited-[]𝑆\displaystyle q_{\sigma,\tau}({\bm{x}}_{\tau_{i}}|{\bm{x}}_{0})={\mathcal{N}}(\sqrt{\alpha_{\tau_{i}}}{\bm{x}}_{0},(1-\alpha_{\tau_{i}}){\bm{I}})\quad\forall i\in[S] (54)

i.e., the ``marginals'' match.
即,“边际”匹配。

The corresponding ``generative process'' is defined as:
相应的“生成过程”定义为:

pθ(𝒙0:T):=pθ(𝒙T)i=1Spθ(τi)(𝒙τi1|𝒙τi)use to produce samples×tτ¯pθ(t)(𝒙0|𝒙t)in variational objectiveassignsubscript𝑝𝜃subscript𝒙:0𝑇subscriptsubscript𝑝𝜃subscript𝒙𝑇superscriptsubscriptproduct𝑖1𝑆subscriptsuperscript𝑝subscript𝜏𝑖𝜃conditionalsubscript𝒙subscript𝜏𝑖1subscript𝒙subscript𝜏𝑖use to produce samplessubscriptsubscriptproduct𝑡¯𝜏superscriptsubscript𝑝𝜃𝑡conditionalsubscript𝒙0subscript𝒙𝑡in variational objective\displaystyle p_{\theta}({\bm{x}}_{0:T}):=\underbrace{p_{\theta}({\bm{x}}_{T})\prod_{i=1}^{S}p^{(\tau_{i})}_{\theta}({\bm{x}}_{\tau_{i-1}}|{\bm{x}}_{\tau_{i}})}_{\text{use to produce samples}}\times\underbrace{\prod_{t\in\bar{\tau}}p_{\theta}^{(t)}({\bm{x}}_{0}|{\bm{x}}_{t})}_{\text{in variational objective}} (55)

where only part of the models are actually being used to produce samples. The conditionals are:
在其中只有部分模型实际上被用于生成样本。条件是:

pθ(τi)(𝒙τi1|𝒙τi)=qσ,τ(𝒙τi1|𝒙τi,fθ(τi)(𝒙τi1))ifi[S],i>1formulae-sequencesuperscriptsubscript𝑝𝜃subscript𝜏𝑖conditionalsubscript𝒙subscript𝜏𝑖1subscript𝒙subscript𝜏𝑖subscript𝑞𝜎𝜏conditionalsubscript𝒙subscript𝜏𝑖1subscript𝒙subscript𝜏𝑖superscriptsubscript𝑓𝜃subscript𝜏𝑖subscript𝒙subscript𝜏𝑖1formulae-sequenceif𝑖delimited-[]𝑆𝑖1\displaystyle p_{\theta}^{(\tau_{i})}({\bm{x}}_{\tau_{i-1}}|{\bm{x}}_{\tau_{i}})=q_{\sigma,\tau}({\bm{x}}_{\tau_{i-1}}|{\bm{x}}_{\tau_{i}},f_{\theta}^{(\tau_{i})}({\bm{x}}_{\tau_{i-1}}))\quad\text{if}\ i\in[S],i>1 (56)
pθ(t)(𝒙0|𝒙t)=𝒩(fθ(t)(𝒙t),σt2𝑰)otherwise,superscriptsubscript𝑝𝜃𝑡conditionalsubscript𝒙0subscript𝒙𝑡𝒩superscriptsubscript𝑓𝜃𝑡subscript𝒙𝑡superscriptsubscript𝜎𝑡2𝑰otherwise,\displaystyle p_{\theta}^{(t)}({\bm{x}}_{0}|{\bm{x}}_{t})={\mathcal{N}}(f_{\theta}^{(t)}({\bm{x}}_{t}),\sigma_{t}^{2}{\bm{I}})\quad\text{otherwise,} (57)

where we leverage qσ,τ(𝒙τi1|𝒙τi,𝒙0)subscript𝑞𝜎𝜏conditionalsubscript𝒙subscript𝜏𝑖1subscript𝒙subscript𝜏𝑖subscript𝒙0q_{\sigma,\tau}({\bm{x}}_{\tau_{i-1}}|{\bm{x}}_{\tau_{i}},{\bm{x}}_{0}) as part of the inference process (similar to what we have done in Section 3). The resulting variational objective becomes (define 𝒙τL+1=subscript𝒙subscript𝜏𝐿1{\bm{x}}_{\tau_{L+1}}=\varnothing for conciseness):
在我们将 qσ,τ(𝒙τi1|𝒙τi,𝒙0)subscriptconditionalsubscriptsubscript1subscriptsubscriptsubscript0q_{\sigma,\tau}({\bm{x}}_{\tau_{i-1}}|{\bm{x}}_{\tau_{i}},{\bm{x}}_{0}) 用作推理过程的一部分的地方(类似于我们在第 3 节中所做的)。所得的变分目标变为(为简洁起见定义 𝒙τL+1=subscriptsubscript1{\bm{x}}_{\tau_{L+1}}=\varnothing ):

J(ϵθ)𝐽subscriptitalic-ϵ𝜃\displaystyle J(\epsilon_{\theta}) =𝔼𝒙0:Tqσ,τ(𝒙0:T)[logqσ,τ(𝒙1:T|𝒙0)logpθ(𝒙0:T)]absentsubscript𝔼similar-tosubscript𝒙:0𝑇subscript𝑞𝜎𝜏subscript𝒙:0𝑇delimited-[]subscript𝑞𝜎𝜏conditionalsubscript𝒙:1𝑇subscript𝒙0subscript𝑝𝜃subscript𝒙:0𝑇\displaystyle={\mathbb{E}}_{{\bm{x}}_{0:T}\sim q_{\sigma,\tau}({\bm{x}}_{0:T})}[\log q_{\sigma,\tau}({\bm{x}}_{1:T}|{\bm{x}}_{0})-\log p_{\theta}({\bm{x}}_{0:T})] (58)
=𝔼𝒙0:Tqσ,τ(𝒙0:T)[tτ¯DKL(qσ,τ(𝒙t|𝒙0)pθ(t)(𝒙0|𝒙t)\displaystyle={\mathbb{E}}_{{\bm{x}}_{0:T}\sim q_{\sigma,\tau}({\bm{x}}_{0:T})}\Bigg{[}\sum_{t\in\bar{\tau}}D_{\mathrm{KL}}(q_{\sigma,\tau}({\bm{x}}_{t}|{\bm{x}}_{0})\|p_{\theta}^{(t)}({\bm{x}}_{0}|{\bm{x}}_{t}) (59)
+i=1LDKL(qσ,τ(𝒙τi1|𝒙τi,𝒙0)pθ(τi)(𝒙τi1|𝒙τi)))]\displaystyle\qquad\qquad\qquad\qquad+\sum_{i=1}^{L}D_{\mathrm{KL}}(q_{\sigma,\tau}({\bm{x}}_{\tau_{i-1}}|{\bm{x}}_{\tau_{i}},{\bm{x}}_{0})\|p_{\theta}^{(\tau_{i})}({\bm{x}}_{\tau_{i-1}}|{\bm{x}}_{\tau_{i}})))\Bigg{]}

where each KL divergence is between two Gaussians with variance independent of θ𝜃\theta. A similar argument to the proof used in Theorem 1 can show that the variational objective J𝐽J can also be converted to an objective of the form Lγsubscript𝐿𝛾L_{\gamma}.
其中每个 KL 散度是在两个方差与 θ\theta 无关的高斯分布之间。与定理 1 中使用的证明类似的论证可以表明,变分目标 JJ 也可以转换为形式为 LγsubscriptL_{\gamma} 的目标。

C.2 Derivation of denoising objectives for DDPMs
C.2 针对去噪扩散概率模型的去噪目标的推导

We note that in Ho et al. (2020), a diffusion hyperparameter βtsubscript𝛽𝑡\beta_{t}999In this section we use teal to color notations used in Ho et al. (2020)
在本节中,我们使用青绿色为 Ho 等人(2020 年)中使用的符号着色。
.

我们注意到在 Ho 等人(2020 年)中,一个扩散超参数 βtsubscript\beta_{t} 9
is first introduced, and then relevant variables αt:=1βtassignsubscript𝛼𝑡1subscript𝛽𝑡\alpha_{t}:=1-\beta_{t} and α¯t=t=1Tαtsubscript¯𝛼𝑡superscriptsubscriptproduct𝑡1𝑇subscript𝛼𝑡\bar{\alpha}_{t}=\prod_{t=1}^{T}\alpha_{t} are defined. In this paper, we have used the notation αtsubscript𝛼𝑡\alpha_{t} to represent the variable α¯tsubscript¯𝛼𝑡\bar{\alpha}_{t} in Ho et al. (2020) for three reasons. First, it makes it more clear that we only need to choose one set of hyperparameters, reducing possible cross-references of the derived variables. Second, it allows us to introduce the generalization as well as the acceleration case easier, because the inference process is no longer motivated by a diffusion. Third, there exists an isomorphism between α1:Tsubscript𝛼:1𝑇\alpha_{1:T} and 1,,T1𝑇1,\ldots,T, which is not the case for βtsubscript𝛽𝑡\beta_{t}.
首先进行介绍,然后定义相关变量 αt:=1βtassignsubscript1subscript\alpha_{t}:=1-\beta_{t}α¯t=t=1Tαtsubscriptsuperscriptsubscriptproduct1subscript\bar{\alpha}_{t}=\prod_{t=1}^{T}\alpha_{t} 。在本文中,我们使用符号 αtsubscript\alpha_{t} 来表示 Ho 等人(2020)中的变量 α¯tsubscript\bar{\alpha}_{t} ,原因有三。首先,它更清楚地表明我们只需要选择一组超参数,减少了派生变量可能的交叉引用。其次,它使我们能够更轻松地引入推广以及加速情况,因为推理过程不再受扩散的驱动。第三, α1:Tsubscript1\alpha_{1:T}1,,T11,\ldots,T 之间存在同构,而 βtsubscript\beta_{t} 并非如此。

In this section, we use βtsubscript𝛽𝑡\beta_{t} and αtsubscript𝛼𝑡\alpha_{t} to be more consistent with the derivation in Ho et al. (2020), where
在本节中,我们使用 βtsubscript\beta_{t}αtsubscript\alpha_{t} 以与 Ho 等人(2020 年)的推导更加一致,其中

αt=αtαt1subscript𝛼𝑡subscript𝛼𝑡subscript𝛼𝑡1\displaystyle{\color[rgb]{0,.5,.5}\alpha_{t}}=\frac{\alpha_{t}}{\alpha_{t-1}} (60)
βt=1αtαt1subscript𝛽𝑡1subscript𝛼𝑡subscript𝛼𝑡1\displaystyle{\color[rgb]{0,.5,.5}\beta_{t}}=1-\frac{\alpha_{t}}{\alpha_{t-1}} (61)

can be uniquely determined from αtsubscript𝛼𝑡\alpha_{t} (i.e. α¯tsubscript¯𝛼𝑡\bar{\alpha}_{t}).
可从 αtsubscript\alpha_{t} (即 α¯tsubscript\bar{\alpha}_{t} )唯一确定。

First, from the diffusion forward process:
首先,从扩散正向过程:

q(𝒙t1|𝒙t,𝒙0)=𝒩(αt1βt1αt𝒙0+αt(1αt1)1αt𝒙tμ~(𝒙t,𝒙0),1αt11αtβt𝑰)𝑞conditionalsubscript𝒙𝑡1subscript𝒙𝑡subscript𝒙0𝒩subscriptsubscript𝛼𝑡1subscript𝛽𝑡1subscript𝛼𝑡subscript𝒙0subscript𝛼𝑡1subscript𝛼𝑡11subscript𝛼𝑡subscript𝒙𝑡~𝜇subscript𝒙𝑡subscript𝒙01subscript𝛼𝑡11subscript𝛼𝑡subscript𝛽𝑡𝑰\displaystyle q({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0})={\mathcal{N}}\Bigg{(}\underbrace{\frac{\sqrt{\alpha_{t-1}}{\color[rgb]{0,.5,.5}\beta_{t}}}{1-\alpha_{t}}{\bm{x}}_{0}+\frac{\sqrt{{\color[rgb]{0,.5,.5}\alpha_{t}}}(1-\alpha_{t-1})}{1-\alpha_{t}}{\bm{x}}_{t}}_{\color[rgb]{0,.5,.5}\tilde{\mu}({\bm{x}}_{t},{\bm{x}}_{0})},\frac{1-\alpha_{t-1}}{1-\alpha_{t}}{\color[rgb]{0,.5,.5}\beta_{t}}{\bm{I}}\Bigg{)}

Ho et al. (2020) considered a specific type of pθ(t)(𝒙t1|𝒙t)superscriptsubscript𝑝𝜃𝑡conditionalsubscript𝒙𝑡1subscript𝒙𝑡p_{\theta}^{(t)}({\bm{x}}_{t-1}|{\bm{x}}_{t}):
Ho 等人(2020 年)考虑了一种特定类型的 pθ(t)(𝒙t1|𝒙t)superscriptsubscriptconditionalsubscript1subscriptp_{\theta}^{(t)}({\bm{x}}_{t-1}|{\bm{x}}_{t}) : (注:原文中的“ pθ(t)(𝒙t1|𝒙t)superscriptsubscriptconditionalsubscript1subscriptp_{\theta}^{(t)}({\bm{x}}_{t-1}|{\bm{x}}_{t}) ”可能是特定领域的术语或存在错误,在没有更多上下文的情况下,只能按原文进行翻译。)

pθ(t)(𝒙t1|𝒙t)=𝒩(μθ(𝒙t,t),σt𝑰)superscriptsubscript𝑝𝜃𝑡conditionalsubscript𝒙𝑡1subscript𝒙𝑡𝒩subscript𝜇𝜃subscript𝒙𝑡𝑡subscript𝜎𝑡𝑰\displaystyle p_{\theta}^{(t)}({\bm{x}}_{t-1}|{\bm{x}}_{t})={\mathcal{N}}\left({\color[rgb]{0,.5,.5}\mu_{\theta}({\bm{x}}_{t},t)},\sigma_{t}{\bm{I}}\right) (62)

which leads to the following variational objective:
这导致了以下变分目标:

L𝐿\displaystyle{\color[rgb]{0,.5,.5}L} :=𝔼𝒙0:Tq(𝒙0:T)[q(𝒙T|𝒙0)+t=2Tlogq(𝒙t1|𝒙t,𝒙0)t=1Tlogpθ(t)(𝒙t1|𝒙t)]assignabsentsubscript𝔼similar-tosubscript𝒙:0𝑇𝑞subscript𝒙:0𝑇delimited-[]𝑞conditionalsubscript𝒙𝑇subscript𝒙0superscriptsubscript𝑡2𝑇𝑞conditionalsubscript𝒙𝑡1subscript𝒙𝑡subscript𝒙0superscriptsubscript𝑡1𝑇superscriptsubscript𝑝𝜃𝑡conditionalsubscript𝒙𝑡1subscript𝒙𝑡\displaystyle:={\mathbb{E}}_{{\bm{x}}_{0:T}\sim q({\bm{x}}_{0:T})}\left[q({\bm{x}}_{T}|{\bm{x}}_{0})+\sum_{t=2}^{T}\log q({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0})-\sum_{t=1}^{T}\log p_{\theta}^{(t)}({\bm{x}}_{t-1}|{\bm{x}}_{t})\right] (63)
𝔼𝒙0:Tq(𝒙0:T)[t=2TDKL(q(𝒙t1|𝒙t,𝒙0))pθ(t)(𝒙t1|𝒙t))Lt1logpθ(1)(𝒙0|𝒙1)]\displaystyle\equiv{\mathbb{E}}_{{\bm{x}}_{0:T}\sim q({\bm{x}}_{0:T})}\left[\sum_{t=2}^{T}\underbrace{D_{\mathrm{KL}}(q({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0}))\|p_{\theta}^{(t)}({\bm{x}}_{t-1}|{\bm{x}}_{t}))}_{\color[rgb]{0,.5,.5}L_{t-1}}-\log p_{\theta}^{(1)}({\bm{x}}_{0}|{\bm{x}}_{1})\right]

One can write: 一个人可以写道:

Lt1=𝔼q[12σt2μθ(𝒙t,t)μ~(𝒙t,𝒙0)22]subscript𝐿𝑡1subscript𝔼𝑞delimited-[]12superscriptsubscript𝜎𝑡2superscriptsubscriptdelimited-∥∥subscript𝜇𝜃subscript𝒙𝑡𝑡~𝜇subscript𝒙𝑡subscript𝒙022\displaystyle{\color[rgb]{0,.5,.5}L_{t-1}}={\mathbb{E}}_{q}\left[\frac{1}{2\sigma_{t}^{2}}{\lVert{{\color[rgb]{0,.5,.5}\mu_{\theta}({\bm{x}}_{t},t)}-{\color[rgb]{0,.5,.5}\tilde{\mu}({\bm{x}}_{t},{\bm{x}}_{0})}}\rVert}_{2}^{2}\right] (64)

Ho et al. (2020) chose the parametrization
何等人(2020 年)选择了参数化

μθ(𝒙t,t)=1αt(𝒙tβt1αtϵθ(𝒙t,t))subscript𝜇𝜃subscript𝒙𝑡𝑡1subscript𝛼𝑡subscript𝒙𝑡subscript𝛽𝑡1subscript𝛼𝑡subscriptitalic-ϵ𝜃subscript𝒙𝑡𝑡\displaystyle{\color[rgb]{0,.5,.5}\mu_{\theta}({\bm{x}}_{t},t)}=\frac{1}{\sqrt{{\color[rgb]{0,.5,.5}\alpha_{t}}}}\left({\bm{x}}_{t}-\frac{{\color[rgb]{0,.5,.5}\beta_{t}}}{\sqrt{1-\alpha_{t}}}{\color[rgb]{0,.5,.5}\epsilon_{\theta}({\bm{x}}_{t},t)}\right) (65)

which can be simplified to:
其可简化为:

Lt1=𝔼𝒙0,ϵ[βt22σt2(1αt)αtϵϵθ(αt𝒙0+1αtϵ,t)22]subscript𝐿𝑡1subscript𝔼subscript𝒙0italic-ϵdelimited-[]superscriptsubscript𝛽𝑡22superscriptsubscript𝜎𝑡21subscript𝛼𝑡subscript𝛼𝑡superscriptsubscriptdelimited-∥∥italic-ϵsubscriptitalic-ϵ𝜃subscript𝛼𝑡subscript𝒙01subscript𝛼𝑡italic-ϵ𝑡22\displaystyle{\color[rgb]{0,.5,.5}L_{t-1}}={\mathbb{E}}_{{\bm{x}}_{0},\epsilon}\left[\frac{{\color[rgb]{0,.5,.5}\beta_{t}}^{2}}{2\sigma_{t}^{2}(1-\alpha_{t}){\color[rgb]{0,.5,.5}\alpha_{t}}}{\lVert{\epsilon-{\color[rgb]{0,.5,.5}\epsilon_{\theta}(\sqrt{\alpha_{t}}{\bm{x}}_{0}+\sqrt{1-\alpha_{t}}\epsilon,t)}}\rVert}_{2}^{2}\right] (66)

Appendix D Experimental Details
附录 D 实验细节

D.1 Datasets and architectures
D.1 数据集与架构

We consider 4 image datasets with various resolutions: CIFAR10 (32×32323232\times 32, unconditional), CelebA (64×64646464\times 64), LSUN Bedroom (256×256256256256\times 256) and LSUN Church (256×256256256256\times 256). For all datasets, we set the hyperparameters α𝛼\alpha according to the heuristic in (Ho et al., 2020) to make the results directly comparable. We use the same model for each dataset, and only compare the performance of different generative processes. For CIFAR10, Bedroom and Church, we obtain the pretrained checkpoints from the original DDPM implementation; for CelebA, we trained our own model using the denoising objective L𝟏subscript𝐿1L_{\bm{1}}.
我们考虑了具有各种分辨率的 4 个图像数据集:CIFAR10( 32×32323232\times 32 ,无条件)、CelebA( 64×64646464\times 64 )、LSUN 卧室( 256×256256256256\times 256 )和 LSUN 教堂( 256×256256256256\times 256 )。对于所有数据集,我们根据(Ho 等人,2020)中的启发式方法设置超参数 α\alpha ,以使结果直接具有可比性。我们对每个数据集使用相同的模型,仅比较不同生成过程的性能。对于 CIFAR10、卧室和教堂,我们从原始的 DDPM 实现中获得预训练的检查点;对于 CelebA,我们使用去噪目标 L𝟏subscript1L_{\bm{1}} 训练了自己的模型。

Our architecture for ϵθ(t)(𝒙t)superscriptsubscriptitalic-ϵ𝜃𝑡subscript𝒙𝑡\epsilon_{\theta}^{(t)}({\bm{x}}_{t}) follows that in Ho et al. (2020), which is a U-Net (Ronneberger et al., 2015) based on a Wide ResNet (Zagoruyko & Komodakis, 2016). We use the pretrained models from Ho et al. (2020) for CIFAR10, Bedroom and Church, and train our own model for the CelebA 64×64646464\times 64 model (since a pretrained model is not provided). Our CelebA model has five feature map resolutions from 64×64646464\times 64 to 4×4444\times 4, and we use the original CelebA dataset (not CelebA-HQ) using the pre-processing technique from the StyleGAN (Karras et al., 2018) repository.
我们的 ϵθ(t)(𝒙t)superscriptsubscriptsubscript\epsilon_{\theta}^{(t)}({\bm{x}}_{t}) 架构遵循 Ho 等人(2020)的架构,这是一个基于宽残差网络(Zagoruyko 和 Komodakis,2016)的 U-Net(Ronneberger 等人,2015)。我们使用 Ho 等人(2020)针对 CIFAR10、Bedroom 和 Church 的预训练模型,并为 CelebA 64×64646464\times 64 模型训练我们自己的模型(因为未提供预训练模型)。我们的 CelebA 模型具有从 64×64646464\times 644×4444\times 4 的五个特征图分辨率,并且我们使用原始的 CelebA 数据集(不是 CelebA-HQ),采用来自 StyleGAN(Karras 等人,2018)存储库的预处理技术。

Table 3: LSUN Bedroom and Church image generation results, measured in FID. For 1000 steps DDPM, the FIDs are 6.36 for Bedroom and 7.89 for Church.
表 3:以 FID 衡量的 LSUN 卧室和教堂图像生成结果。对于 1000 步的 DDPM,卧室的 FID 为 6.36,教堂的 FID 为 7.89。
Bedroom (256×256256256256\times 256) 卧室( 256×256256256256\times 256 ) (It's not clear what " 256×256256256256\times 256 " means in this context. Without further context or clarification, it's difficult to provide a more accurate translation for this part.) Church (256×256256256256\times 256) 该英文文本“Church ( 256×256256256256\times 256 )”似乎不太符合常规的语义表达,可能存在错误或特殊的上下文信息缺失。按照字面直接翻译为:“教堂(256×256256256256\times 256)”,但这个翻译可能不太符合实际的学术文本翻译要求,因为这个源文本的内容不太明确。如果您有更多的上下文信息或想要修正这个源文本,欢迎提供,以便我能给出更准确和有用的翻译
dim(τ)dimension𝜏\dim(\tau) 10 20 50 100 10 20 50 100
DDIM (η=0.0𝜂0.0\eta=0.0) 抱歉,“DDIM ( η=0.00.0\eta=0.0 )”这个内容似乎不太明确或缺乏上下文,难以进行准确的学术性翻译。如果您能提供更多的信息或上下文,我将尽力为您提供更准确的翻译 16.95 8.89 6.75 6.62 19.45 12.47 10.84 10.58
DDPM (η=1.0𝜂1.0\eta=1.0) 抱歉,“DDPM ( η=1.01.0\eta=1.0 )”这个内容似乎不太明确或可能存在错误,难以进行准确的学术性翻译。如果您有更多的上下文或相关信息,我将尽力提供更准确的翻译 42.78 22.77 10.81 6.81 51.56 23.37 11.16 8.27

D.2 Reverse process sub-sequence selection
D.2 逆过程子序列选择

We consider two types of selection procedure for τ𝜏\tau given the desired dim(τ)<Tdimension𝜏𝑇\dim(\tau)<T:
对于给定的期望 dim(τ)<Tdimension\dim(\tau)<T ,我们考虑针对 τ\tau 的两种选择程序:

  • Linear: we select the timesteps such that τi=cisubscript𝜏𝑖𝑐𝑖\tau_{i}=\lfloor ci\rfloor for some c𝑐c;
    线性:我们选择时间步长,使得对于某些 ccτi=cisubscript\tau_{i}=\lfloor ci\rfloor ; (注:原文中的 τi=cisubscript\tau_{i}=\lfloor ci\rfloorcc 可能是特定的符号或变量,在没有更多上下文的情况下,只能按原文形式保留)

  • Quadratic: we select the timesteps such that τi=ci2subscript𝜏𝑖𝑐superscript𝑖2\tau_{i}=\lfloor ci^{2}\rfloor for some c𝑐c.
    二次型:我们选择时间步长,使得对于某些 ccτi=ci2subscriptsuperscript2\tau_{i}=\lfloor ci^{2}\rfloor 。 (注:原文中的 τi=ci2subscriptsuperscript2\tau_{i}=\lfloor ci^{2}\rfloorcc 可能是特定的符号或变量,在没有更多上下文信息的情况下,只能按照原文形式进行翻译。)

The constant value c𝑐c is selected such that τ1subscript𝜏1\tau_{-1} is close to T𝑇T. We used quadratic for CIFAR10 and linear for the remaining datasets. These choices achieve slightly better FID than their alternatives in the respective datasets.
常数值 cc 的选择使得 τ1subscript1\tau_{-1} 接近 TT 。对于 CIFAR10 我们使用二次函数,对于其余数据集我们使用线性函数。在各自的数据集中,这些选择实现的 FID 比其替代方案略好。

D.3 Closed form equations for each sampling step
D.3 每个采样步骤的闭式方程

From the general sampling equation in Eq. (12), we have the following update equation:
从式(12)中的一般抽样方程,我们得到以下更新方程:

𝒙τi1(η)=ατi1(𝒙τi1ατiϵθ(τi)(𝒙τi)ατi)+1ατi1στi(η)2ϵθ(τi)(𝒙τi)+στi(η)ϵsubscript𝒙subscript𝜏𝑖1𝜂subscript𝛼subscript𝜏𝑖1subscript𝒙subscript𝜏𝑖1subscript𝛼subscript𝜏𝑖superscriptsubscriptitalic-ϵ𝜃subscript𝜏𝑖subscript𝒙subscript𝜏𝑖subscript𝛼subscript𝜏𝑖1subscript𝛼subscript𝜏𝑖1subscript𝜎subscript𝜏𝑖superscript𝜂2superscriptsubscriptitalic-ϵ𝜃subscript𝜏𝑖subscript𝒙subscript𝜏𝑖subscript𝜎subscript𝜏𝑖𝜂italic-ϵ\displaystyle{\bm{x}}_{\tau_{i-1}}(\eta)=\sqrt{\alpha_{\tau_{i-1}}}\left(\frac{{\bm{x}}_{\tau_{i}}-\sqrt{1-\alpha_{\tau_{i}}}\epsilon_{\theta}^{(\tau_{i})}({\bm{x}}_{\tau_{i}})}{\sqrt{\alpha_{\tau_{i}}}}\right)+\sqrt{1-\alpha_{\tau_{i-1}}-\sigma_{\tau_{i}}(\eta)^{2}}\cdot\epsilon_{\theta}^{(\tau_{i})}({\bm{x}}_{\tau_{i}})+\sigma_{\tau_{i}}(\eta)\epsilon

where 哪里;在……的地方 (注:“where”根据不同语境有不同的含义,这里给出了两种常见的翻译)

στi(η)=η1ατi11ατi1ατiατi1subscript𝜎subscript𝜏𝑖𝜂𝜂1subscript𝛼subscript𝜏𝑖11subscript𝛼subscript𝜏𝑖1subscript𝛼subscript𝜏𝑖subscript𝛼subscript𝜏𝑖1\sigma_{\tau_{i}}(\eta)=\eta\sqrt{\frac{1-\alpha_{\tau_{i-1}}}{1-\alpha_{\tau_{i}}}}\sqrt{1-\frac{\alpha_{\tau_{i}}}{\alpha_{\tau_{i-1}}}}

For the case of σ^^𝜎\hat{\sigma} (DDPM with a larger variance), the update equation becomes:
对于 σ^\hat{\sigma} (具有较大方差的 DDPM)的情况,更新方程变为:

𝒙τi1=ατi1(𝒙τi1ατiϵθ(τi)(𝒙τi)ατi)+1ατi1στi(1)2ϵθ(τi)(𝒙τi)+σ^τiϵsubscript𝒙subscript𝜏𝑖1subscript𝛼subscript𝜏𝑖1subscript𝒙subscript𝜏𝑖1subscript𝛼subscript𝜏𝑖superscriptsubscriptitalic-ϵ𝜃subscript𝜏𝑖subscript𝒙subscript𝜏𝑖subscript𝛼subscript𝜏𝑖1subscript𝛼subscript𝜏𝑖1subscript𝜎subscript𝜏𝑖superscript12superscriptsubscriptitalic-ϵ𝜃subscript𝜏𝑖subscript𝒙subscript𝜏𝑖subscript^𝜎subscript𝜏𝑖italic-ϵ\displaystyle{\bm{x}}_{\tau_{i-1}}=\sqrt{\alpha_{\tau_{i-1}}}\left(\frac{{\bm{x}}_{\tau_{i}}-\sqrt{1-\alpha_{\tau_{i}}}\epsilon_{\theta}^{(\tau_{i})}({\bm{x}}_{\tau_{i}})}{\sqrt{\alpha_{\tau_{i}}}}\right)+\sqrt{1-\alpha_{\tau_{i-1}}-\sigma_{\tau_{i}}(1)^{2}}\cdot\epsilon_{\theta}^{(\tau_{i})}({\bm{x}}_{\tau_{i}})+\hat{\sigma}_{\tau_{i}}\epsilon

which uses a different coefficient for ϵitalic-ϵ\epsilon compared with the update for η=1𝜂1\eta=1, but uses the same coefficient for the non-stochastic parts. This update is more stochastic than the update for η=1𝜂1\eta=1, which explains why it achieves worse performance when dim(τ)dimension𝜏\dim(\tau) is small.
其对 ϵ\epsilon 使用的系数与对 η=11\eta=1 的更新所用系数不同,但对非随机部分使用相同的系数。此更新比针对 η=11\eta=1 的更新更具随机性,这解释了为何当 dim(τ)dimension\dim(\tau) 较小时,它的性能更差。

D.4 Samples and Consistency
D.4 样本与一致性

We show more samples in Figure 7 (CIFAR10), Figure 8 (CelebA), Figure 10 (Church) and consistency results of DDIM in Figure 9 (CelebA).
我们在图 7(CIFAR10)、图 8(CelebA)、图 10(Church)中展示了更多样本,并在图 9(CelebA)中展示了 DDIM 的一致性结果。

Refer to caption
Refer to caption
Refer to caption
Figure 7: CIFAR10 samples from 1000 step DDPM, 1000 step DDIM and 100 step DDIM.
图 7:来自 1000 步 DDPM、1000 步 DDIM 和 100 步 DDIM 的 CIFAR10 样本。
Refer to caption
Refer to caption
Refer to caption
Figure 8: CelebA samples from 1000 step DDPM, 1000 step DDIM and 100 step DDIM.
图 8:来自 1000 步 DDPM、1000 步 DDIM 和 100 步 DDIM 的 CelebA 样本。
Refer to caption
Figure 9: CelebA samples from DDIM with the same random 𝒙Tsubscript𝒙𝑇{\bm{x}}_{T} and different number of steps.
图 9:来自 DDIM 且具有相同随机 𝒙Tsubscript{\bm{x}}_{T} 以及不同步数的 CelebA 样本。
Refer to caption
Refer to caption
Figure 10: Church samples from 100 step DDPM and 100 step DDIM.
图 10:来自 100 步 DDPM 和 100 步 DDIM 的教堂样本。

D.5 Interpolation D.5 插值法

To generate interpolations on a line, we randomly sample two initial 𝒙Tsubscript𝒙𝑇{\bm{x}}_{T} values from the standard Gaussian, interpolate them with spherical linear interpolation (Shoemake, 1985), and then use the DDIM to obtain 𝒙0subscript𝒙0{\bm{x}}_{0} samples.
为在一条线上生成插值,我们从标准高斯分布中随机采样两个初始 𝒙Tsubscript{\bm{x}}_{T} 值,使用球面线性插值(Shoemake,1985)对它们进行插值,然后使用 DDIM 获得 𝒙0subscript0{\bm{x}}_{0} 样本。

𝒙T(α)=sin((1α)θ)sin(θ)𝒙T(0)+sin(αθ)sin(θ)𝒙T(1)superscriptsubscript𝒙𝑇𝛼1𝛼𝜃𝜃superscriptsubscript𝒙𝑇0𝛼𝜃𝜃superscriptsubscript𝒙𝑇1\displaystyle{\bm{x}}_{T}^{(\alpha)}=\frac{\sin((1-\alpha)\theta)}{\sin(\theta)}{\bm{x}}_{T}^{(0)}+\frac{\sin(\alpha\theta)}{\sin(\theta)}{\bm{x}}_{T}^{(1)} (67)

where θ=arccos((𝒙T(0))𝒙T(1)𝒙T(0)𝒙T(1))𝜃superscriptsuperscriptsubscript𝒙𝑇0topsuperscriptsubscript𝒙𝑇1delimited-∥∥superscriptsubscript𝒙𝑇0delimited-∥∥superscriptsubscript𝒙𝑇1\theta=\arccos\left(\frac{({\bm{x}}_{T}^{(0)})^{\top}{\bm{x}}_{T}^{(1)}}{{\lVert{{\bm{x}}_{T}^{(0)}}\rVert}{\lVert{{\bm{x}}_{T}^{(1)}}\rVert}}\right). These values are used to produce DDIM samples.
其中 θ=arccos((𝒙T(0))𝒙T(1)𝒙T(0)𝒙T(1))superscriptsuperscriptsubscript0topsuperscriptsubscript1delimited-∥∥superscriptsubscript0delimited-∥∥superscriptsubscript1\theta=\arccos\left(\frac{({\bm{x}}_{T}^{(0)})^{\top}{\bm{x}}_{T}^{(1)}}{{\lVert{{\bm{x}}_{T}^{(0)}}\rVert}{\lVert{{\bm{x}}_{T}^{(1)}}\rVert}}\right) 。这些值用于生成 DDIM 样本。

To generate interpolations on a grid, we sample four latent variables and separate them in to two pairs; then we use slerp with the pairs under the same α𝛼\alpha, and use slerp over the interpolated samples across the pairs (under an independently chosen interpolation coefficient). We show more grid interpolation results in Figure 11 (CelebA), Figure 12 (Bedroom), and Figure 13 (Church).
为在网格上生成插值,我们对四个潜在变量进行采样,并将它们分为两对;然后,我们在相同的 α\alpha 下对这两对使用球面线性插值(slerp),并在跨对的插值样本上使用球面线性插值(使用独立选择的插值系数)。我们在图 11(CelebA)、图 12(卧室)和图 13(教堂)中展示了更多的网格插值结果。

Refer to caption
Figure 11: More interpolations from the CelebA DDIM with dim(τ)=50dimension𝜏50\dim(\tau)=50. 
Refer to caption
Figure 12: More interpolations from the Bedroom DDIM with dim(τ)=50dimension𝜏50\dim(\tau)=50.
图 12:来自卧室 DDIM 且带有 dim(τ)=50dimension50\dim(\tau)=50 的更多插值。
Refer to caption
Figure 13: More interpolations from the Church DDIM with dim(τ)=50dimension𝜏50\dim(\tau)=50.
图 13:来自具有 dim(τ)=50dimension50\dim(\tau)=50 的 Church DDIM 的更多插值。