这是用户在 2024-12-11 18:03 为 https://ar5iv.labs.arxiv.org/html/2010.02502?_immersive_translate_auto_translate=1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Denoising Diffusion Implicit Models
去噪扩散隐式模型

Jiaming Song, Chenlin Meng & Stefano Ermon
宋家铭,孟晨林 & 斯特凡诺·埃尔蒙

Stanford University
{tsong,chenlin,ermon}@cs.stanford.edu
斯坦福大学 {tsong,chenlin,ermon}@cs.stanford.edu
Abstract 摘要

Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps in order to produce a sample. To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models with the same training procedure as DDPMs. In DDPMs, the generative process is defined as the reverse of a particular Markovian diffusion process. We generalize DDPMs via a class of non-Markovian diffusion processes that lead to the same training objective. These non-Markovian processes can correspond to generative processes that are deterministic, giving rise to implicit models that produce high quality samples much faster. We empirically demonstrate that DDIMs can produce high quality samples 10×10\times to 50×50\times faster in terms of wall-clock time compared to DDPMs, allow us to trade off computation for sample quality, perform semantically meaningful image interpolation directly in the latent space, and reconstruct observations with very low error.
去噪扩散概率模型(DDPMs)在没有对抗训练的情况下实现了高质量的图像生成,但它们需要模拟马尔可夫链进行多步操作以生成样本。为了加速采样,我们提出了去噪扩散隐式模型(DDIMs),这是一类更高效的迭代隐式概率模型,其训练过程与 DDPMs 相同。在 DDPMs 中,生成过程被定义为特定马尔可夫扩散过程的逆过程。我们通过一类非马尔可夫扩散过程对 DDPMs 进行了推广,这些过程导致相同的训练目标。这些非马尔可夫过程可以对应于确定性的生成过程,从而产生隐式模型,能够更快地产生高质量样本。我们通过实验证明,DDIMs 在墙钟时间上比 DDPMs 快 10×10\times50×50\times ,允许我们在样本质量与计算之间进行权衡,直接在潜在空间中执行语义上有意义的图像插值,并以非常低的误差重建观察结果。

1 Introduction 1 引言

Deep generative models have demonstrated the ability to produce high quality samples in many domains (Karras et al., 2020; van den Oord et al., 2016a). In terms of image generation, generative adversarial networks (GANs, Goodfellow et al. (2014)) currently exhibits higher sample quality than likelihood-based methods such as variational autoencoders (Kingma & Welling, 2013), autoregressive models (van den Oord et al., 2016b) and normalizing flows (Rezende & Mohamed, 2015; Dinh et al., 2016). However, GANs require very specific choices in optimization and architectures in order to stabilize training (Arjovsky et al., 2017; Gulrajani et al., 2017; Karras et al., 2018; Brock et al., 2018), and could fail to cover modes of the data distribution (Zhao et al., 2018).
深度生成模型在许多领域展示了生成高质量样本的能力(Karras et al., 2020; van den Oord et al., 2016a)。在图像生成方面,生成对抗网络(GANs,Goodfellow et al. (2014))目前展现出比基于似然的方法(如变分自编码器(Kingma & Welling, 2013)、自回归模型(van den Oord et al., 2016b)和归一化流(Rezende & Mohamed, 2015; Dinh et al., 2016))更高的样本质量。然而,GANs 在优化和架构上需要非常特定的选择以稳定训练(Arjovsky et al., 2017; Gulrajani et al., 2017; Karras et al., 2018; Brock et al., 2018),并可能无法覆盖数据分布的模式(Zhao et al., 2018)。

Recent works on iterative generative models (Bengio et al., 2014), such as denoising diffusion probabilistic models (DDPM, Ho et al. (2020)) and noise conditional score networks (NCSN, Song & Ermon (2019)) have demonstrated the ability to produce samples comparable to that of GANs, without having to perform adversarial training. To achieve this, many denoising autoencoding models are trained to denoise samples corrupted by various levels of Gaussian noise. Samples are then produced by a Markov chain which, starting from white noise, progressively denoises it into an image. This generative Markov Chain process is either based on Langevin dynamics (Song & Ermon, 2019) or obtained by reversing a forward diffusion process that progressively turns an image into noise (Sohl-Dickstein et al., 2015).
最近关于迭代生成模型的研究(Bengio 等,2014),例如去噪扩散概率模型(DDPM,Ho 等(2020))和噪声条件评分网络(NCSN,Song & Ermon(2019))已展示出生成与 GANs 相当的样本的能力,而无需进行对抗训练。为了实现这一点,许多去噪自编码模型被训练以去噪受到不同程度高斯噪声污染的样本。然后,通过一个马尔可夫链生成样本,该链从白噪声开始,逐步将其去噪成图像。这个生成的马尔可夫链过程要么基于朗之万动力学(Song & Ermon,2019),要么通过逆转一个逐步将图像转变为噪声的前向扩散过程获得(Sohl-Dickstein 等,2015)。

A critical drawback of these models is that they require many iterations to produce a high quality sample. For DDPMs, this is because that the generative process (from noise to data) approximates the reverse of the forward diffusion process (from data to noise), which could have thousands of steps; iterating over all the steps is required to produce a single sample, which is much slower compared to GANs, which only needs one pass through a network. For example, it takes around 20 hours to sample 50k images of size 32×3232\times 32 from a DDPM, but less than a minute to do so from a GAN on a Nvidia 2080 Ti GPU. This becomes more problematic for larger images as sampling 50k images of size 256×256256\times 256 could take nearly 10001000 hours on the same GPU.
这些模型的一个关键缺陷是它们需要多次迭代才能生成高质量的样本。对于 DDPM 而言,这是因为生成过程(从噪声到数据)近似于正向扩散过程(从数据到噪声)的反向过程,这可能需要数千个步骤;为了生成一个样本,必须遍历所有步骤,这与只需通过网络一次的 GAN 相比,速度要慢得多。例如,从 DDPM 中采样 50,000 张大小为 32×32323232\times 32 的图像大约需要 20 小时,而在 Nvidia 2080 Ti GPU 上,从 GAN 中采样则不到一分钟。对于更大的图像,这个问题变得更加严重,因为在同一 GPU 上,采样 50,000 张大小为 256×256256256256\times 256 的图像可能需要近 100010001000 小时。

To close this efficiency gap between DDPMs and GANs, we present denoising diffusion implicit models (DDIMs). DDIMs are implicit probabilistic models (Mohamed & Lakshminarayanan, 2016) and are closely related to DDPMs, in the sense that they are trained with the same objective function. In Section 3, we generalize the forward diffusion process used by DDPMs, which is Markovian, to non-Markovian ones, for which we are still able to design suitable reverse generative Markov chains. We show that the resulting variational training objectives have a shared surrogate objective, which is exactly the objective used to train DDPM. Therefore, we can freely choose from a large family of generative models using the same neural network simply by choosing a different, non-Markovian diffusion process (Section 4.1) and the corresponding reverse generative Markov Chain. In particular, we are able to use non-Markovian diffusion processes which lead to "short" generative Markov chains (Section 4.2) that can be simulated in a small number of steps. This can massively increase sample efficiency only at a minor cost in sample quality.
为了缩小 DDPM 和 GAN 之间的效率差距,我们提出了去噪扩散隐式模型(DDIM)。DDIM 是隐式概率模型(Mohamed & Lakshminarayanan, 2016),与 DDPM 密切相关,因为它们使用相同的目标函数进行训练。在第 3 节中,我们将 DDPM 使用的前向扩散过程(这是马尔可夫过程)推广到非马尔可夫过程,对于这些过程,我们仍然能够设计合适的反向生成马尔可夫链。我们展示了所得到的变分训练目标具有共享的替代目标,这正是用于训练 DDPM 的目标。因此,我们可以通过选择不同的非马尔可夫扩散过程(第 4.1 节)和相应的反向生成马尔可夫链,随意选择来自大量生成模型的选项,使用相同的神经网络。特别是,我们能够使用导致“短”生成马尔可夫链的非马尔可夫扩散过程(第 4.2 节),这些链可以在少量步骤中进行模拟。这可以在样本质量仅略有损失的情况下大幅提高样本效率。

In Section 5, we demonstrate several empirical benefits of DDIMs over DDPMs. First, DDIMs have superior sample generation quality compared to DDPMs, when we accelerate sampling by 10×10\times to 100×100\times using our proposed method. Second, DDIM samples have the following ``consistency'' property, which does not hold for DDPMs: if we start with the same initial latent variable and generate several samples with Markov chains of various lengths, these samples would have similar high-level features. Third, because of ``consistency'' in DDIMs, we can perform semantically meaningful image interpolation by manipulating the initial latent variable in DDIMs, unlike DDPMs which interpolates near the image space due to the stochastic generative process.
在第 5 节中,我们展示了 DDIM 相对于 DDPM 的几个实证优势。首先,当我们使用所提出的方法将采样加速从 10×10\times100×100\times 时,DDIM 的样本生成质量优于 DDPM。其次,DDIM 样本具有以下“一致性”特性,而 DDPM 则不具备:如果我们从相同的初始潜变量开始,并使用不同长度的马尔可夫链生成多个样本,这些样本将具有相似的高层特征。第三,由于 DDIM 中的“一致性”,我们可以通过操控 DDIM 中的初始潜变量进行语义上有意义的图像插值,而 DDPM 由于随机生成过程则在图像空间附近进行插值。

2 Background

Refer to caption
Refer to caption
Figure 1: Graphical models for diffusion (left) and non-Markovian (right) inference models.

Given samples from a data distribution q(𝒙0)q({\bm{x}}_{0}), we are interested in learning a model distribution pθ(𝒙0)p_{\theta}({\bm{x}}_{0}) that approximates q(𝒙0)q({\bm{x}}_{0}) and is easy to sample from. Denoising diffusion probabilistic models (DDPMs, Sohl-Dickstein et al. (2015); Ho et al. (2020)) are latent variable models of the form

pθ(𝒙0)=pθ(𝒙0:T)d𝒙1:T,wherepθ(𝒙0:T):=pθ(𝒙T)t=1Tpθ(t)(𝒙t1|𝒙t)\displaystyle p_{\theta}({\bm{x}}_{0})=\int p_{\theta}({\bm{x}}_{0:T})\mathrm{d}{\bm{x}}_{1:T},\quad\text{where}\quad p_{\theta}({\bm{x}}_{0:T}):=p_{\theta}({\bm{x}}_{T})\prod_{t=1}^{T}p^{(t)}_{\theta}({\bm{x}}_{t-1}|{\bm{x}}_{t}) (1)

where 𝒙1,,𝒙T{\bm{x}}_{1},\ldots,{\bm{x}}_{T} are latent variables in the same sample space as 𝒙0{\bm{x}}_{0} (denoted as 𝒳{\mathcal{X}}). The parameters θ\theta are learned to fit the data distribution q(𝒙0)q({\bm{x}}_{0}) by maximizing a variational lower bound:

maxθ𝔼q(𝒙0)[logpθ(𝒙0)]maxθ𝔼q(𝒙0,𝒙1,,𝒙T)[logpθ(𝒙0:T)logq(𝒙1:T|𝒙0)]\displaystyle\max_{\theta}{\mathbb{E}}_{q({\bm{x}}_{0})}[\log p_{\theta}({\bm{x}}_{0})]\leq\max_{\theta}{\mathbb{E}}_{q({\bm{x}}_{0},{\bm{x}}_{1},\ldots,{\bm{x}}_{T})}\left[\log p_{\theta}({\bm{x}}_{0:T})-\log q({\bm{x}}_{1:T}|{\bm{x}}_{0})\right] (2)

where q(𝒙1:T|𝒙0)q({\bm{x}}_{1:T}|{\bm{x}}_{0}) is some inference distribution over the latent variables. Unlike typical latent variable models (such as the variational autoencoder (Rezende et al., 2014)), DDPMs are learned with a fixed (rather than trainable) inference procedure q(𝒙1:T|𝒙0)q({\bm{x}}_{1:T}|{\bm{x}}_{0}), and latent variables are relatively high dimensional. For example, Ho et al. (2020) considered the following Markov chain with Gaussian transitions parameterized by a decreasing sequence α1:T(0,1]T\alpha_{1:T}\in(0,1]^{T}:

q(𝒙1:T|𝒙0):=t=1Tq(𝒙t|𝒙t1),whereq(𝒙t|𝒙t1):=𝒩(αtαt1𝒙t1,(1αtαt1)𝑰)\displaystyle q({\bm{x}}_{1:T}|{\bm{x}}_{0}):=\prod_{t=1}^{T}q({\bm{x}}_{t}|{\bm{x}}_{t-1}),\text{where}\ q({\bm{x}}_{t}|{\bm{x}}_{t-1}):={\mathcal{N}}\left(\sqrt{\frac{\alpha_{t}}{\alpha_{t-1}}}{\bm{x}}_{t-1},\left(1-\frac{\alpha_{t}}{\alpha_{t-1}}\right){\bm{I}}\right) (3)

where the covariance matrix is ensured to have positive terms on its diagonal. This is called the forward process due to the autoregressive nature of the sampling procedure (from 𝒙0{\bm{x}}_{0} to 𝒙T{\bm{x}}_{T}). We call the latent variable model pθ(𝒙0:T)p_{\theta}({\bm{x}}_{0:T}), which is a Markov chain that samples from 𝒙T{\bm{x}}_{T} to 𝒙0{\bm{x}}_{0}, the generative process, since it approximates the intractable reverse process q(𝒙t1|𝒙t)q({\bm{x}}_{t-1}|{\bm{x}}_{t}). Intuitively, the forward process progressively adds noise to the observation 𝒙0{\bm{x}}_{0}, whereas the generative process progressively denoises a noisy observation (Figure 1, left).

A special property of the forward process is that

q(𝒙t|𝒙0):=q(𝒙1:t|𝒙0)d𝒙1:(t1)=𝒩(𝒙t;αt𝒙0,(1αt)𝑰);q({\bm{x}}_{t}|{\bm{x}}_{0}):=\int q({\bm{x}}_{1:t}|{\bm{x}}_{0})\mathrm{d}{\bm{x}}_{1:(t-1)}={\mathcal{N}}({\bm{x}}_{t};\sqrt{\alpha_{t}}{\bm{x}}_{0},(1-\alpha_{t}){\bm{I}});

so we can express 𝒙t{\bm{x}}_{t} as a linear combination of 𝒙0{\bm{x}}_{0} and a noise variable ϵ\epsilon:

𝒙t=αt𝒙0+1αtϵ,whereϵ𝒩(𝟎,𝑰).\displaystyle{\bm{x}}_{t}=\sqrt{\alpha_{t}}{\bm{x}}_{0}+\sqrt{1-\alpha_{t}}\epsilon,\quad\text{where}\quad\epsilon\sim{\mathcal{N}}({\bm{0}},{\bm{I}}). (4)

When we set αT\alpha_{T} sufficiently close to 0, q(𝒙T|𝒙0)q({\bm{x}}_{T}|{\bm{x}}_{0}) converges to a standard Gaussian for all 𝒙0{\bm{x}}_{0}, so it is natural to set pθ(𝒙T):=𝒩(𝟎,𝑰)p_{\theta}({\bm{x}}_{T}):={\mathcal{N}}({\bm{0}},{\bm{I}}). If all the conditionals are modeled as Gaussians with trainable mean functions and fixed variances, the objective in Eq. (2) can be simplified to111Please refer to Appendix C.2 for details.:

Lγ(ϵθ)\displaystyle L_{\gamma}(\epsilon_{\theta}) :=t=1Tγt𝔼𝒙0q(𝒙0),ϵt𝒩(𝟎,𝑰)[ϵθ(t)(αt𝒙0+1αtϵt)ϵt22]\displaystyle:=\sum_{t=1}^{T}\gamma_{t}{\mathbb{E}}_{{\bm{x}}_{0}\sim q({\bm{x}}_{0}),\epsilon_{t}\sim{\mathcal{N}}({\bm{0}},{\bm{I}})}\left[{\lVert{\epsilon_{\theta}^{(t)}(\sqrt{\alpha_{t}}{\bm{x}}_{0}+\sqrt{1-\alpha_{t}}\epsilon_{t})-\epsilon_{t}}\rVert}_{2}^{2}\right] (5)

where ϵθ:={ϵθ(t)}t=1T\epsilon_{\theta}:=\{\epsilon_{\theta}^{(t)}\}_{t=1}^{T} is a set of TT functions, each ϵθ(t):𝒳𝒳\epsilon_{\theta}^{(t)}:{\mathcal{X}}\to{\mathcal{X}} (indexed by tt) is a function with trainable parameters θ(t)\theta^{(t)}, and γ:=[γ1,,γT]\gamma:=[\gamma_{1},\ldots,\gamma_{T}] is a vector of positive coefficients in the objective that depends on α1:T\alpha_{1:T}. In Ho et al. (2020), the objective with γ=𝟏\gamma={\bm{1}} is optimized instead to maximize generation performance of the trained model; this is also the same objective used in noise conditional score networks (Song & Ermon, 2019) based on score matching (Hyvärinen, 2005; Vincent, 2011). From a trained model, 𝒙0{\bm{x}}_{0} is sampled by first sampling 𝒙T{\bm{x}}_{T} from the prior pθ(𝒙T)p_{\theta}({\bm{x}}_{T}), and then sampling 𝒙t1{\bm{x}}_{t-1} from the generative processes iteratively.

The length TT of the forward process is an important hyperparameter in DDPMs. From a variational perspective, a large TT allows the reverse process to be close to a Gaussian (Sohl-Dickstein et al., 2015), so that the generative process modeled with Gaussian conditional distributions becomes a good approximation; this motivates the choice of large TT values, such as T=1000T=1000 in Ho et al. (2020). However, as all TT iterations have to be performed sequentially, instead of in parallel, to obtain a sample 𝒙0{\bm{x}}_{0}, sampling from DDPMs is much slower than sampling from other deep generative models, which makes them impractical for tasks where compute is limited and latency is critical.

3 Variational Inference for non-Markovian Forward Processes

Because the generative model approximates the reverse of the inference process, we need to rethink the inference process in order to reduce the number of iterations required by the generative model. Our key observation is that the DDPM objective in the form of LγL_{\gamma} only depends on the marginals222We slightly abuse this term (as well as joints) when only conditioned on 𝒙0{\bm{x}}_{0}. q(𝒙t|𝒙0)q({\bm{x}}_{t}|{\bm{x}}_{0}), but not directly on the joint q(𝒙1:T|𝒙0)q({\bm{x}}_{1:T}|{\bm{x}}_{0}). Since there are many inference distributions (joints) with the same marginals, we explore alternative inference processes that are non-Markovian, which leads to new generative processes (Figure 1, right). These non-Markovian inference process lead to the same surrogate objective function as DDPM, as we will show below. In Appendix A, we show that the non-Markovian perspective also applies beyond the Gaussian case.

3.1 Non-Markovian forward processes

Let us consider a family 𝒬{\mathcal{Q}} of inference distributions, indexed by a real vector σ0T\sigma\in\mathbb{R}_{\geq 0}^{T}:

qσ(𝒙1:T|𝒙0):=qσ(𝒙T|𝒙0)t=2Tqσ(𝒙t1|𝒙t,𝒙0)\displaystyle q_{\sigma}({\bm{x}}_{1:T}|{\bm{x}}_{0}):=q_{\sigma}({\bm{x}}_{T}|{\bm{x}}_{0})\prod_{t=2}^{T}q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0}) (6)

where qσ(𝒙T|𝒙0)=𝒩(αT𝒙0,(1αT)𝑰)q_{\sigma}({\bm{x}}_{T}|{\bm{x}}_{0})={\mathcal{N}}(\sqrt{\alpha_{T}}{\bm{x}}_{0},(1-\alpha_{T}){\bm{I}}) and for all t>1t>1,

qσ(𝒙t1|𝒙t,𝒙0)=𝒩(αt1𝒙0+1αt1σt2𝒙tαt𝒙01αt,σt2𝑰).\displaystyle q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0})={\mathcal{N}}\left(\sqrt{\alpha_{t-1}}{\bm{x}}_{0}+\sqrt{1-\alpha_{t-1}-\sigma^{2}_{t}}\cdot{\frac{{\bm{x}}_{t}-\sqrt{\alpha_{t}}{\bm{x}}_{0}}{\sqrt{1-\alpha_{t}}}},\sigma_{t}^{2}{\bm{I}}\right). (7)

The mean function is chosen to order to ensure that qσ(𝒙t|𝒙0)=𝒩(αt𝒙0,(1αt)𝑰)q_{\sigma}({\bm{x}}_{t}|{\bm{x}}_{0})={\mathcal{N}}(\sqrt{\alpha_{t}}{\bm{x}}_{0},(1-\alpha_{t}){\bm{I}}) for all tt (see Lemma 1 of Appendix B), so that it defines a joint inference distribution that matches the ``marginals'' as desired. The forward process333We overload the term “forward process” for cases where the inference model is not a diffusion. can be derived from Bayes' rule:

qσ(𝒙t|𝒙t1,𝒙0)=qσ(𝒙t1|𝒙t,𝒙0)qσ(𝒙t|𝒙0)qσ(𝒙t1|𝒙0),\displaystyle q_{\sigma}({\bm{x}}_{t}|{\bm{x}}_{t-1},{\bm{x}}_{0})=\frac{q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0})q_{\sigma}({\bm{x}}_{t}|{\bm{x}}_{0})}{q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{0})}, (8)

which is also Gaussian (although we do not use this fact for the remainder of this paper). Unlike the diffusion process in Eq. (3), the forward process here is no longer Markovian, since each 𝒙t{\bm{x}}_{t} could depend on both 𝒙t1{\bm{x}}_{t-1} and 𝒙0{\bm{x}}_{0}. The magnitude of σ\sigma controls the how stochastic the forward process is; when σ𝟎\sigma\to{\bm{0}}, we reach an extreme case where as long as we observe 𝒙0{\bm{x}}_{0} and 𝒙t{\bm{x}}_{t} for some tt, then 𝒙t1{\bm{x}}_{t-1} become known and fixed.

3.2 Generative process and unified variational inference objective

Next, we define a trainable generative process pθ(𝒙0:T)p_{\theta}({\bm{x}}_{0:T}) where each pθ(t)(𝒙t1|𝒙t)p_{\theta}^{(t)}({\bm{x}}_{t-1}|{\bm{x}}_{t}) leverages knowledge of qσ(𝒙t1|𝒙t,𝒙0)q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0}). Intuitively, given a noisy observation 𝒙t{\bm{x}}_{t}, we first make a prediction444Learning a distribution over the predictions is also possible, but empirically we found little benefits of it. of the corresponding 𝒙0{\bm{x}}_{0}, and then use it to obtain a sample 𝒙t1{\bm{x}}_{t-1} through the reverse conditional distribution qσ(𝒙t1|𝒙t,𝒙0)q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0}), which we have defined.

For some 𝒙0q(𝒙0){\bm{x}}_{0}\sim q({\bm{x}}_{0}) and ϵt𝒩(𝟎,𝑰)\epsilon_{t}\sim{\mathcal{N}}({\bm{0}},{\bm{I}}), 𝒙t{\bm{x}}_{t} can be obtained using Eq. (4). The model ϵθ(t)(𝒙t)\epsilon_{\theta}^{(t)}({\bm{x}}_{t}) then attempts to predict ϵt\epsilon_{t} from 𝒙t{\bm{x}}_{t}, without knowledge of 𝒙0{\bm{x}}_{0}. By rewriting Eq. (4), one can then predict the denoised observation, which is a prediction of 𝒙0{\bm{x}}_{0} given 𝒙t{\bm{x}}_{t}:

fθ(t)(𝒙t):=(𝒙t1αtϵθ(t)(𝒙t))/αt.\displaystyle f_{\theta}^{(t)}({\bm{x}}_{t}):=({\bm{x}}_{t}-\sqrt{1-\alpha_{t}}\cdot\epsilon_{\theta}^{(t)}({\bm{x}}_{t}))/\sqrt{\alpha_{t}}. (9)

We can then define the generative process with a fixed prior pθ(𝒙T)=𝒩(𝟎,𝑰)p_{\theta}({\bm{x}}_{T})={\mathcal{N}}({\bm{0}},{\bm{I}}) and

pθ(t)(𝒙t1|𝒙t)={𝒩(fθ(1)(𝒙1),σ12𝑰)ift=1qσ(𝒙t1|𝒙t,fθ(t)(𝒙t))otherwise,\displaystyle p_{\theta}^{(t)}({\bm{x}}_{t-1}|{\bm{x}}_{t})=\begin{cases}{\mathcal{N}}(f_{\theta}^{(1)}({\bm{x}}_{1}),\sigma_{1}^{2}{\bm{I}})&\text{if}\ t=1\\ q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},f_{\theta}^{(t)}({\bm{x}}_{t}))&\text{otherwise,}\end{cases} (10)

where qσ(𝒙t1|𝒙t,fθ(t)(𝒙t))q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},f_{\theta}^{(t)}({\bm{x}}_{t})) is defined as in Eq. (7) with 𝒙0{\bm{x}}_{0} replaced by fθ(t)(𝒙t)f_{\theta}^{(t)}({\bm{x}}_{t}). We add some Gaussian noise (with covariance σ12𝑰\sigma_{1}^{2}{\bm{I}}) for the case of t=1t=1 to ensure that the generative process is supported everywhere.

We optimize θ\theta via the following variational inference objective (which is a functional over ϵθ\epsilon_{\theta}):

Jσ(ϵθ):=𝔼𝒙0:Tqσ(𝒙0:T)[logqσ(𝒙1:T|𝒙0)logpθ(𝒙0:T)]\displaystyle J_{\sigma}(\epsilon_{\theta}):={\mathbb{E}}_{{\bm{x}}_{0:T}\sim q_{\sigma}({\bm{x}}_{0:T})}[\log q_{\sigma}({\bm{x}}_{1:T}|{\bm{x}}_{0})-\log p_{\theta}({\bm{x}}_{0:T})] (11)
=\displaystyle= 𝔼𝒙0:Tqσ(𝒙0:T)[logqσ(𝒙T|𝒙0)+t=2Tlogqσ(𝒙t1|𝒙t,𝒙0)t=1Tlogpθ(t)(𝒙t1|𝒙t)logpθ(𝒙T)]\displaystyle\ {\mathbb{E}}_{{\bm{x}}_{0:T}\sim q_{\sigma}({\bm{x}}_{0:T})}\left[\log q_{\sigma}({\bm{x}}_{T}|{\bm{x}}_{0})+\sum_{t=2}^{T}\log q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0})-\sum_{t=1}^{T}\log p_{\theta}^{(t)}({\bm{x}}_{t-1}|{\bm{x}}_{t})-\log p_{\theta}({\bm{x}}_{T})\right]

where we factorize qσ(𝒙1:T|𝒙0)q_{\sigma}({\bm{x}}_{1:T}|{\bm{x}}_{0}) according to Eq. (6) and pθ(𝒙0:T)p_{\theta}({\bm{x}}_{0:T}) according to Eq. (1).

From the definition of JσJ_{\sigma}, it would appear that a different model has to be trained for every choice of σ\sigma, since it corresponds to a different variational objective (and a different generative process). However, JσJ_{\sigma} is equivalent to LγL_{\gamma} for certain weights γ\gamma, as we show below.

Theorem 1.

For all σ>𝟎\sigma>{\bm{0}}, there exists γ>0T\gamma\in\mathbb{R}_{>0}^{T} and CC\in\mathbb{R}, such that Jσ=Lγ+CJ_{\sigma}=L_{\gamma}+C.

The variational objective LγL_{\gamma} is special in the sense that if parameters θ\theta of the models ϵθ(t)\epsilon_{\theta}^{(t)} are not shared across different tt, then the optimal solution for ϵθ\epsilon_{\theta} will not depend on the weights γ\gamma (as global optimum is achieved by separately maximizing each term in the sum). This property of LγL_{\gamma} has two implications. On the one hand, this justified the use of L𝟏L_{\bm{1}} as a surrogate objective function for the variational lower bound in DDPMs; on the other hand, since JσJ_{\sigma} is equivalent to some LγL_{\gamma} from Theorem 1, the optimal solution of JσJ_{\sigma} is also the same as that of L𝟏L_{\bm{1}}. Therefore, if parameters are not shared across tt in the model ϵθ\epsilon_{\theta}, then the L𝟏L_{\bm{1}} objective used by Ho et al. (2020) can be used as a surrogate objective for the variational objective JσJ_{\sigma} as well.

4 Sampling from Generalized Generative Processes

With L𝟏L_{\bm{1}} as the objective, we are not only learning a generative process for the Markovian inference process considered in Sohl-Dickstein et al. (2015) and Ho et al. (2020), but also generative processes for many non-Markovian forward processes parametrized by σ\sigma that we have described. Therefore, we can essentially use pretrained DDPM models as the solutions to the new objectives, and focus on finding a generative process that is better at producing samples subject to our needs by changing σ\sigma.

4.1 Denoising Diffusion Implicit Models

From pθ(𝒙1:T)p_{\theta}({\bm{x}}_{1:T}) in Eq. (10), one can generate a sample 𝒙t1{\bm{x}}_{t-1} from a sample 𝒙t{\bm{x}}_{t} via:

𝒙t1\displaystyle{\bm{x}}_{t-1} =αt1(𝒙t1αtϵθ(t)(𝒙t)αt)`` predicted 𝒙0''+1αt1σt2ϵθ(t)(𝒙t)``direction pointing to 𝒙t''+σtϵtrandom noise\displaystyle=\sqrt{\alpha_{t-1}}\underbrace{\left(\frac{{\bm{x}}_{t}-\sqrt{1-\alpha_{t}}\epsilon_{\theta}^{(t)}({\bm{x}}_{t})}{\sqrt{\alpha_{t}}}\right)}_{\text{`` predicted }{\bm{x}}_{0}\text{''}}+\underbrace{\sqrt{1-\alpha_{t-1}-\sigma_{t}^{2}}\cdot\epsilon_{\theta}^{(t)}({\bm{x}}_{t})}_{\text{``direction pointing to }{\bm{x}}_{t}\text{''}}+\underbrace{\sigma_{t}\epsilon_{t}}_{\text{random noise}} (12)

where ϵt𝒩(𝟎,𝑰)\epsilon_{t}\sim{\mathcal{N}}({\bm{0}},{\bm{I}}) is standard Gaussian noise independent of 𝒙t{\bm{x}}_{t}, and we define α0:=1\alpha_{0}:=1. Different choices of σ\sigma values results in different generative processes, all while using the same model ϵθ\epsilon_{\theta}, so re-training the model is unnecessary. When σt=(1αt1)/(1αt)1αt/αt1\sigma_{t}=\sqrt{(1-\alpha_{t-1})/(1-\alpha_{t})}\sqrt{1-\alpha_{t}/\alpha_{t-1}} for all tt, the forward process becomes Markovian, and the generative process becomes a DDPM.

We note another special case when σt=0\sigma_{t}=0 for all tt555Although this case is not covered in Theorem 1, we can always approximate it by making σt\sigma_{t} very small.; the forward process becomes deterministic given 𝒙t1{\bm{x}}_{t-1} and 𝒙0{\bm{x}}_{0}, except for t=1t=1; in the generative process, the coefficient before the random noise ϵt\epsilon_{t} becomes zero. The resulting model becomes an implicit probabilistic model (Mohamed & Lakshminarayanan, 2016), where samples are generated from latent variables with a fixed procedure (from 𝒙T{\bm{x}}_{T} to 𝒙0{\bm{x}}_{0}). We name this the denoising diffusion implicit model (DDIM, pronounced /d:Im/), because it is an implicit probabilistic model trained with the DDPM objective (despite the forward process no longer being a diffusion).

4.2 Accelerated generation processes

In the previous sections, the generative process is considered as the approximation to the reverse process; since of the forward process has TT steps, the generative process is also forced to sample TT steps. However, as the denoising objective L𝟏L_{\bm{1}} does not depend on the specific forward procedure as long as qσ(𝒙t|𝒙0)q_{\sigma}({\bm{x}}_{t}|{\bm{x}}_{0}) is fixed, we may also consider forward processes with lengths smaller than TT, which accelerates the corresponding generative processes without having to train a different model.

Refer to caption
Figure 2: Graphical model for accelerated generation, where τ=[1,3]\tau=[1,3].

Let us consider the forward process as defined not on all the latent variables 𝒙1:T{\bm{x}}_{1:T}, but on a subset {𝒙τ1,,𝒙τS}\{{\bm{x}}_{\tau_{1}},\ldots,{\bm{x}}_{\tau_{S}}\}, where τ\tau is an increasing sub-sequence of [1,,T][1,\ldots,T] of length SS. In particular, we define the sequential forward process over 𝒙τ1,,𝒙τS{\bm{x}}_{\tau_{1}},\ldots,{\bm{x}}_{\tau_{S}} such that q(𝒙τi|𝒙0)=𝒩(ατi𝒙0,(1ατi)𝑰)q({\bm{x}}_{\tau_{i}}|{\bm{x}}_{0})={\mathcal{N}}(\sqrt{\alpha_{\tau_{i}}}{\bm{x}}_{0},(1-\alpha_{\tau_{i}}){\bm{I}}) matches the ``marginals'' (see Figure 2 for an illustration). The generative process now samples latent variables according to reversed(τ)\text{reversed}(\tau), which we term (sampling) trajectory. When the length of the sampling trajectory is much smaller than TT, we may achieve significant increases in computational efficiency due to the iterative nature of the sampling process.

Using a similar argument as in Section 3, we can justify using the model trained with the L𝟏L_{\bm{1}} objective, so no changes are needed in training. We show that only slight changes to the updates in Eq. (12) are needed to obtain the new, faster generative processes, which applies to DDPM, DDIM, as well as all generative processes considered in Eq. (10). We include these details in Appendix C.1.

In principle, this means that we can train a model with an arbitrary number of forward steps but only sample from some of them in the generative process. Therefore, the trained model could consider many more steps than what is considered in (Ho et al., 2020) or even a continuous time variable tt (Chen et al., 2020). We leave empirical investigations of this aspect as future work.

4.3 Relevance to Neural ODEs

Moreover, we can rewrite the DDIM iterate according to Eq. (12), and its similarity to Euler integration for solving ordinary differential equations (ODEs) becomes more apparent:

𝒙tΔtαtΔt=𝒙tαt+(1αtΔtαtΔt1αtαt)ϵθ(t)(𝒙t)\displaystyle\frac{{\bm{x}}_{t-\Delta t}}{\sqrt{\alpha_{t-\Delta t}}}=\frac{{\bm{x}}_{t}}{\sqrt{\alpha_{t}}}+\left(\sqrt{\frac{1-\alpha_{t-\Delta t}}{\alpha_{t-\Delta t}}}-\sqrt{\frac{1-\alpha_{t}}{\alpha_{t}}}\right)\epsilon_{\theta}^{(t)}({\bm{x}}_{t}) (13)

To derive the corresponding ODE, we can reparameterize (1α/α)(\sqrt{1-\alpha}/\sqrt{\alpha}) with σ\sigma and (𝒙/α)({\bm{x}}/\sqrt{\alpha}) with 𝒙¯\bar{{\bm{x}}}. In the continuous case, σ\sigma and 𝒙{\bm{x}} are functions of tt, where σ:00\sigma:{\mathbb{R}}_{\geq 0}\to{\mathbb{R}}_{\geq 0} is continous, increasing with σ(0)=0\sigma(0)=0. Equation (13) with can be treated as a Euler method over the following ODE:

d𝒙¯(t)=ϵθ(t)(𝒙¯(t)σ2+1)dσ(t),\displaystyle\mathrm{d}\bar{{\bm{x}}}(t)=\epsilon_{\theta}^{(t)}\left(\frac{\bar{{\bm{x}}}(t)}{\sqrt{\sigma^{2}+1}}\right)\mathrm{d}\sigma(t), (14)

where the initial conditions is 𝒙(T)𝒩(0,σ(T)){\bm{x}}(T)\sim{\mathcal{N}}(0,\sigma(T)) for a very large σ(T)\sigma(T) (which corresponds to the case of α0\alpha\approx 0). This suggests that with enough discretization steps, the we can also reverse the generation process (going from t=0t=0 to TT), which encodes 𝒙0{\bm{x}}_{0} to 𝒙T{\bm{x}}_{T} and simulates the reverse of the ODE in Eq. (14). This suggests that unlike DDPM, we can use DDIM to obtain encodings of the observations (as the form of 𝒙T{\bm{x}}_{T}), which might be useful for other downstream applications that requires latent representations of a model.

In a concurrent work, (Song et al., 2020) proposed a ``probability flow ODE'' that aims to recover the marginal densities of a stochastic differential equation (SDE) based on scores, from which a similar sampling schedule can be obtained. Here, we state that the our ODE is equivalent to a special case of theirs (which corresponds to a continuous-time analog of DDPM).

Proposition 1.

The ODE in Eq. (14) with the optimal model ϵθ(t)\epsilon_{\theta}^{(t)} has an equivalent probability flow ODE corresponding to the ``Variance-Exploding'' SDE in Song et al. (2020).

We include the proof in Appendix B. While the ODEs are equivalent, the sampling procedures are not, since the Euler method for the probability flow ODE will make the following update:

𝒙tΔtαtΔt=𝒙tαt+12(1αtΔtαtΔt1αtαt)αt1αtϵθ(t)(𝒙t)\displaystyle\frac{{\bm{x}}_{t-\Delta t}}{\sqrt{\alpha_{t-\Delta t}}}=\frac{{\bm{x}}_{t}}{\sqrt{\alpha_{t}}}+\frac{1}{2}\left(\frac{1-\alpha_{t-\Delta t}}{\alpha_{t-\Delta t}}-\frac{1-\alpha_{t}}{\alpha_{t}}\right)\cdot\sqrt{\frac{\alpha_{t}}{1-\alpha_{t}}}\cdot\epsilon_{\theta}^{(t)}({\bm{x}}_{t}) (15)

which is equivalent to ours if αt\alpha_{t} and αtΔt\alpha_{t-\Delta t} are close enough. In fewer sampling steps, however, these choices will make a difference; we take Euler steps with respect to dσ(t)\mathrm{d}\sigma(t) (which depends less directly on the scaling of ``time'' tt) whereas Song et al. (2020) take Euler steps with respect to dt\mathrm{d}t.

5 Experiments

In this section, we show that DDIMs outperform DDPMs in terms of image generation when fewer iterations are considered, giving speed ups of 10×10\times to 100×100\times over the original DDPM generation process. Moreover, unlike DDPMs, once the initial latent variables 𝒙T{\bm{x}}_{T} are fixed, DDIMs retain high-level image features regardless of the generation trajectory, so they are able to perform interpolation directly from the latent space. DDIMs can also be used to encode samples that reconstruct them from the latent code, which DDPMs cannot do due to the stochastic sampling process.

For each dataset, we use the same trained model with T=1000T=1000 and the objective being LγL_{\gamma} from Eq. (5) with γ=𝟏\gamma={\bm{1}}; as we argued in Section 3, no changes are needed with regards to the training procedure. The only changes that we make is how we produce samples from the model; we achieve this by controlling τ\tau (which controls how fast the samples are obtained) and σ\sigma (which interpolates between the deterministic DDIM and the stochastic DDPM).

We consider different sub-sequences τ\tau of [1,,T][1,\ldots,T] and different variance hyperparameters σ\sigma indexed by elements of τ\tau. To simplify comparisons, we consider σ\sigma with the form:

στi(η)=η(1ατi1)/(1ατi)1ατi/ατi1,\displaystyle\sigma_{\tau_{i}}(\eta)=\eta\sqrt{(1-\alpha_{\tau_{i-1}})/(1-\alpha_{\tau_{i}})}\sqrt{1-{\alpha_{\tau_{i}}}/{\alpha_{\tau_{i-1}}}}, (16)

where η0\eta\in\mathbb{R}_{\geq 0} is a hyperparameter that we can directly control. This includes an original DDPM generative process when η=1\eta=1 and DDIM when η=0\eta=0. We also consider DDPM where the random noise has a larger standard deviation than σ(1)\sigma(1), which we denote as σ^\hat{\sigma}: σ^τi=1ατi/ατi1\hat{\sigma}_{\tau_{i}}=\sqrt{1-{\alpha_{\tau_{i}}}/{\alpha_{\tau_{i-1}}}} . This is used by the implementation in Ho et al. (2020) only to obtain the CIFAR10 samples, but not samples of the other datasets. We include more details in Appendix D.

5.1 Sample quality and efficiency

In Table 1, we report the quality of the generated samples with models trained on CIFAR10 and CelebA, as measured by Frechet Inception Distance (FID (Heusel et al., 2017)), where we vary the number of timesteps used to generate a sample (dim(τ)\dim(\tau)) and the stochasticity of the process (η\eta). As expected, the sample quality becomes higher as we increase dim(τ)\dim(\tau), presenting a trade-off between sample quality and computational costs. We observe that DDIM (η=0\eta=0) achieves the best sample quality when dim(τ)\dim(\tau) is small, and DDPM (η=1\eta=1 and σ^\hat{\sigma}) typically has worse sample quality compared to its less stochastic counterparts with the same dim(τ)\dim(\tau), except for the case for dim(τ)=1000\dim(\tau)=1000 and σ^\hat{\sigma} reported by Ho et al. (2020) where DDIM is marginally worse. However, the sample quality of σ^\hat{\sigma} becomes much worse for smaller dim(τ)\dim(\tau), which suggests that it is ill-suited for shorter trajectories. DDIM, on the other hand, achieves high sample quality much more consistently.

In Figure 3, we show CIFAR10 and CelebA samples with the same number of sampling steps and varying σ\sigma. For the DDPM, the sample quality deteriorates rapidly when the sampling trajectory has 10 steps. For the case of σ^\hat{\sigma}, the generated images seem to have more noisy perturbations under short trajectories; this explains why the FID scores are much worse than other methods, as FID is very sensitive to such perturbations (as discussed in Jolicoeur-Martineau et al. (2020)).

In Figure 4, we show that the amount of time needed to produce a sample scales linearly with the length of the sample trajectory. This suggests that DDIM is useful for producing samples more efficiently, as samples can be generated in much fewer steps. Notably, DDIM is able to produce samples with quality comparable to 1000 step models within 2020 to 100100 steps, which is a 10×10\times to 50×50\times speed up compared to the original DDPM. Even though DDPM could also achieve reasonable sample quality with 100×100\times steps, DDIM requires much fewer steps to achieve this; on CelebA, the FID score of the 100 step DDPM is similar to that of the 20 step DDIM.

Table 1: CIFAR10 and CelebA image generation measured in FID. η=1.0\eta=1.0 and σ^\hat{\sigma} are cases of DDPM (although Ho et al. (2020) only considered T=1000T=1000 steps, and S<TS<T can be seen as simulating DDPMs trained with SS steps), and η=0.0\eta=0.0 indicates DDIM.
CIFAR10 (32×3232\times 32) CelebA (64×6464\times 64)
SS 10 20 50 100 1000 10 20 50 100 1000
η\eta 0.00.0 13.36 6.84 4.67 4.16 4.04 17.33 13.73 9.17 6.53 3.51
0.2 14.04 7.11 4.77 4.25 4.09 17.66 14.11 9.51 6.79 3.64
0.5 16.66 8.35 5.25 4.46 4.29 19.86 16.06 11.01 8.09 4.28
1.0 41.07 18.36 8.01 5.78 4.73 33.12 26.03 18.48 13.93 5.98
σ^\hat{\sigma} 367.43 133.37 32.72 9.99 3.17 299.71 183.83 71.71 45.20 3.26
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: CIFAR10 and CelebA samples with dim(τ)=10\dim(\tau)=10 and dim(τ)=100\dim(\tau)=100.
Refer to caption
Refer to caption
Figure 4: Hours to sample 50k images with one Nvidia 2080 Ti GPU and samples at different steps.

5.2 Sample consistency in DDIMs

For DDIM, the generative process is deterministic, and 𝒙0{\bm{x}}_{0} would depend only on the initial state 𝒙T{\bm{x}}_{T}. In Figure 5, we observe the generated images under different generative trajectories (i.e. different τ\tau) while starting with the same initial 𝒙T{\bm{x}}_{T}. Interestingly, for the generated images with the same initial 𝒙T{\bm{x}}_{T}, most high-level features are similar, regardless of the generative trajectory. In many cases, samples generated with only 20 steps are already very similar to ones generated with 1000 steps in terms of high-level features, with only minor differences in details. Therefore, it would appear that 𝒙T{\bm{x}}_{T} alone would be an informative latent encoding of the image; and minor details that affects sample quality are encoded in the parameters, as longer sample trajectories gives better quality samples but do not significantly affect the high-level features. We show more samples in Appendix D.4.

Refer to caption
Refer to caption
Refer to caption
Figure 5: Samples from DDIM with the same random 𝒙T{\bm{x}}_{T} and different number of steps.

5.3 Interpolation in deterministic generative processes

Refer to caption
Refer to caption
Refer to caption
Figure 6: Interpolation of samples from DDIM with dim(τ)=50\dim(\tau)=50.

Since the high level features of the DDIM sample is encoded by 𝒙T{\bm{x}}_{T}, we are interested to see whether it would exhibit the semantic interpolation effect similar to that observed in other implicit probabilistic models, such as GANs (Goodfellow et al., 2014). This is different from the interpolation procedure in Ho et al. (2020), since in DDPM the same 𝒙T{\bm{x}}_{T} would lead to highly diverse 𝒙0{\bm{x}}_{0} due to the stochastic generative process666Although it might be possible if one interpolates all TT noises, like what is done in Song & Ermon (2020).. In Figure 6, we show that simple interpolations in 𝒙T{\bm{x}}_{T} can lead to semantically meaningful interpolations between two samples. We include more details and samples in Appendix D.5. This allows DDIM to control the generated images on a high level directly through the latent variables, which DDPMs cannot.

5.4 Reconstruction from Latent Space

As DDIM is the Euler integration for a particular ODE, it would be interesting to see whether it can encode from 𝒙0{\bm{x}}_{0} to 𝒙T{\bm{x}}_{T} (reverse of Eq. (14)) and reconstruct 𝒙0{\bm{x}}_{0} from the resulting 𝒙T{\bm{x}}_{T} (forward of Eq. (14))777Since 𝒙T{\bm{x}}_{T} and 𝒙0{\bm{x}}_{0} have the same dimensions, their compression qualities are not our immediate concern.. We consider encoding and decoding on the CIFAR-10 test set with the CIFAR-10 model with SS steps for both encoding and decoding; we report the per-dimension mean squared error (scaled to [0,1][0,1]) in Table 2. Our results show that DDIMs have lower reconstruction error for larger SS values and have properties similar to Neural ODEs and normalizing flows. The same cannot be said for DDPMs due to their stochastic nature.

Table 2: Reconstruction error with DDIM on CIFAR-10 test set, rounded to 10410^{-4}.
SS 10 20 50 100 200 500 1000
Error 0.014 0.0065 0.0023 0.0009 0.0004 0.0001 0.00010.0001

6 Related Work

Our work is based on a large family of existing methods on learning generative models as transition operators of Markov chains (Sohl-Dickstein et al., 2015; Bengio et al., 2014; Salimans et al., 2014; Song et al., 2017; Goyal et al., 2017; Levy et al., 2017). Among them, denoising diffusion probabilistic models (DDPMs, Ho et al. (2020)) and noise conditional score networks (NCSN, Song & Ermon (2019; 2020)) have recently achieved high sample quality comparable to GANs (Brock et al., 2018; Karras et al., 2018). DDPMs optimize a variational lower bound to the log-likelihood, whereas NCSNs optimize the score matching objective (Hyvärinen, 2005) over a nonparametric Parzen density estimator of the data (Vincent, 2011; Raphan & Simoncelli, 2011).

Despite their different motivations, DDPMs and NCSNs are closely related. Both use a denoising autoencoder objective for many noise levels, and both use a procedure similar to Langevin dynamics to produce samples (Neal et al., 2011). Since Langevin dynamics is a discretization of a gradient flow (Jordan et al., 1998), both DDPM and NCSN require many steps to achieve good sample quality. This aligns with the observation that DDPM and existing NCSN methods have trouble generating high-quality samples in a few iterations.

DDIM, on the other hand, is an implicit generative model (Mohamed & Lakshminarayanan, 2016) where samples are uniquely determined from the latent variables. Hence, DDIM has certain properties that resemble GANs (Goodfellow et al., 2014) and invertible flows (Dinh et al., 2016), such as the ability to produce semantically meaningful interpolations. We derive DDIM from a purely variational perspective, where the restrictions of Langevin dynamics are not relevant; this could partially explain why we are able to observe superior sample quality compared to DDPM under fewer iterations. The sampling procedure of DDIM is also reminiscent of neural networks with continuous depth (Chen et al., 2018; Grathwohl et al., 2018), since the samples it produces from the same latent variable have similar high-level visual features, regardless of the specific sample trajectory.

7 Discussion

We have presented DDIMs – an implicit generative model trained with denoising auto-encoding / score matching objectives – from a purely variational perspective. DDIM is able to generate high-quality samples much more efficiently than existing DDPMs and NCSNs, with the ability to perform meaningful interpolations from the latent space. The non-Markovian forward process presented here seems to suggest continuous forward processes other than Gaussian (which cannot be done in the original diffusion framework, since Gaussian is the only stable distribution with finite variance). We also demonstrated a discrete case with a multinomial forward process in Appendix A, and it would be interesting to investigate similar alternatives for other combinatorial structures.

Moreover, since the sampling procedure of DDIMs is similar to that of an neural ODE, it would be interesting to see if methods that decrease the discretization error in ODEs, including multi-step methods such as Adams-Bashforth (Butcher & Goodwin, 2008), could be helpful for further improving sample quality in fewer steps (Queiruga et al., 2020). It is also relevant to investigate whether DDIMs exhibit other properties of existing implicit models (Bau et al., 2019).

Acknowledgements

The authors would like to thank Yang Song and Shengjia Zhao for helpful discussions over the ideas, Kuno Kim for reviewing an earlier draft of the paper, and Sharvil Nanavati and Sophie Liu for identifying typos. This research was supported by NSF (#1651565, #1522054, #1733686), ONR (N00014-19-1-2145), AFOSR (FA9550-19-1-0024), and Amazon AWS.

References

  • Arjovsky et al. (2017) Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein GAN. arXiv preprint arXiv:1701.07875, January 2017.
  • Bau et al. (2019) David Bau, Jun-Yan Zhu, Jonas Wulff, William Peebles, Hendrik Strobelt, Bolei Zhou, and Antonio Torralba. Seeing what a gan cannot generate. In Proceedings of the IEEE International Conference on Computer Vision, pp.  4502–4511, 2019.
  • Bengio et al. (2014) Yoshua Bengio, Eric Laufer, Guillaume Alain, and Jason Yosinski. Deep generative stochastic networks trainable by backprop. In International Conference on Machine Learning, pp. 226–234, January 2014.
  • Bishop (2006) Christopher M Bishop. Pattern recognition and machine learning. springer, 2006.
  • Brock et al. (2018) Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, September 2018.
  • Butcher & Goodwin (2008) John Charles Butcher and Nicolette Goodwin. Numerical methods for ordinary differential equations, volume 2. Wiley Online Library, 2008.
  • Chen et al. (2020) Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, and William Chan. WaveGrad: Estimating gradients for waveform generation. arXiv preprint arXiv:2009.00713, September 2020.
  • Chen et al. (2018) Ricky T Q Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural ordinary differential equations. arXiv preprint arXiv:1806.07366, June 2018.
  • Dinh et al. (2016) Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real NVP. arXiv preprint arXiv:1605.08803, May 2016.
  • Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
  • Goyal et al. (2017) Anirudh Goyal, Nan Rosemary Ke, Surya Ganguli, and Yoshua Bengio. Variational walkback: Learning a transition operator as a stochastic recurrent net. In Advances in Neural Information Processing Systems, pp. 4392–4402, 2017.
  • Grathwohl et al. (2018) Will Grathwohl, Ricky T Q Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. FFJORD: Free-form continuous dynamics for scalable reversible generative models. arXiv preprint arXiv:1810.01367, October 2018.
  • Gulrajani et al. (2017) Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pp. 5769–5779, 2017.
  • Heusel et al. (2017) Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two Time-Scale update rule converge to a local nash equilibrium. arXiv preprint arXiv:1706.08500, June 2017.
  • Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. arXiv preprint arXiv:2006.11239, June 2020.
  • Hyvärinen (2005) Aapo Hyvärinen. Estimation of Non-Normalized statistical models by score matching. Journal of Machine Learning Researc h, 6:695–709, 2005.
  • Jolicoeur-Martineau et al. (2020) Alexia Jolicoeur-Martineau, Rémi Piché-Taillefer, Rémi Tachet des Combes, and Ioannis Mitliagkas. Adversarial score matching and improved sampling for image generation. September 2020.
  • Jordan et al. (1998) Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the fokker–planck equation. SIAM journal on mathematical analysis, 29(1):1–17, 1998.
  • Karras et al. (2018) Tero Karras, Samuli Laine, and Timo Aila. A Style-Based generator architecture for generative adversarial networks. arXiv preprint arXiv:1812.04948, December 2018.
  • Karras et al. (2020) Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  8110–8119, 2020.
  • Kingma & Welling (2013) Diederik P Kingma and Max Welling. Auto-Encoding variational bayes. arXiv preprint arXiv:1312.6114v10, December 2013.
  • Levy et al. (2017) Daniel Levy, Matthew D Hoffman, and Jascha Sohl-Dickstein. Generalizing hamiltonian monte carlo with neural networks. arXiv preprint arXiv:1711.09268, 2017.
  • Mohamed & Lakshminarayanan (2016) Shakir Mohamed and Balaji Lakshminarayanan. Learning in implicit generative models. arXiv preprint arXiv:1610.03483, October 2016.
  • Neal et al. (2011) Radford M Neal et al. Mcmc using hamiltonian dynamics. Handbook of markov chain monte carlo, 2(11):2, 2011.
  • Queiruga et al. (2020) Alejandro F Queiruga, N Benjamin Erichson, Dane Taylor, and Michael W Mahoney. Continuous-in-depth neural networks. arXiv preprint arXiv:2008.02389, 2020.
  • Raphan & Simoncelli (2011) Martin Raphan and Eero P Simoncelli. Least squares estimation without priors or supervision. Neural computation, 23(2):374–420, February 2011. ISSN 0899-7667, 1530-888X.
  • Rezende & Mohamed (2015) Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770, May 2015.
  • Rezende et al. (2014) Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
  • Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp.  234–241. Springer, 2015.
  • Salimans et al. (2014) Tim Salimans, Diederik P Kingma, and Max Welling. Markov chain monte carlo and variational inference: Bridging the gap. arXiv preprint arXiv:1410.6460, October 2014.
  • Shoemake (1985) Ken Shoemake. Animating rotation with quaternion curves. In Proceedings of the 12th annual conference on Computer graphics and interactive techniques, pp.  245–254, 1985.
  • Sohl-Dickstein et al. (2015) Jascha Sohl-Dickstein, Eric A Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. arXiv preprint arXiv:1503.03585, March 2015.
  • Song et al. (2017) Jiaming Song, Shengjia Zhao, and Stefano Ermon. A-nice-mc: Adversarial training for mcmc. arXiv preprint arXiv:1706.07561, June 2017.
  • Song & Ermon (2019) Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. arXiv preprint arXiv:1907.05600, July 2019.
  • Song & Ermon (2020) Yang Song and Stefano Ermon. Improved techniques for training Score-Based generative models. arXiv preprint arXiv:2006.09011, June 2020.
  • Song et al. (2020) Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
  • van den Oord et al. (2016a) Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, September 2016a.
  • van den Oord et al. (2016b) Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, January 2016b.
  • Vincent (2011) Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661–1674, 2011.
  • Zagoruyko & Komodakis (2016) Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, May 2016.
  • Zhao et al. (2018) Shengjia Zhao, Hongyu Ren, Arianna Yuan, Jiaming Song, Noah Goodman, and Stefano Ermon. Bias and generalization in deep generative models: An empirical study. In Advances in Neural Information Processing Systems, pp. 10792–10801, 2018.

Appendix A Non-Markovian Forward Processes for a Discrete Case

In this section, we describe a non-Markovian forward processes for discrete data and corresponding variational objectives. Since the focus of this paper is to accelerate reverse models corresponding to the Gaussian diffusion, we leave empirical evaluations as future work.

For a categorical observation 𝒙0{\bm{x}}_{0} that is a one-hot vector with KK possible values, we define the forward process as follows. First, we have q(𝒙t|𝒙0)q({\bm{x}}_{t}|{\bm{x}}_{0}) as the following categorical distribution:

q(𝒙t|𝒙0)=Cat(αt𝒙0+(1αt)𝟏K)\displaystyle q({\bm{x}}_{t}|{\bm{x}}_{0})=\mathrm{Cat}(\alpha_{t}{\bm{x}}_{0}+(1-\alpha_{t}){\bm{1}}_{K}) (17)

where 𝟏KK{\bm{1}}_{K}\in\mathbb{R}^{K} is a vector with all entries being 1/K1/K, and αt\alpha_{t} decreasing from α0=1\alpha_{0}=1 for t=0t=0 to αT=0\alpha_{T}=0 for t=Tt=T. Then we define q(𝒙t1|𝒙t,𝒙0)q({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0}) as the following mixture distribution:

q(𝒙t1|𝒙t,𝒙0)={Cat(𝒙t)with probability σtCat(𝒙0)with probability (αt1σtαt)Cat(𝟏K)with probability (1αt1)(1αt)σt,\displaystyle q({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0})=\begin{cases}\mathrm{Cat}({\bm{x}}_{t})&\text{with probability }\sigma_{t}\\ \mathrm{Cat}({\bm{x}}_{0})&\text{with probability }(\alpha_{t-1}-\sigma_{t}\alpha_{t})\\ \mathrm{Cat}({\bm{1}}_{K})&\text{with probability }(1-\alpha_{t-1})-(1-\alpha_{t})\sigma_{t}\end{cases}, (18)

or equivalently:

q(𝒙t1|𝒙t,𝒙0)=Cat(σt𝒙t+(αt1σtαt)𝒙0+((1αt1)(1αt)σt)𝟏K),\displaystyle q({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0})=\mathrm{Cat}\left(\sigma_{t}{\bm{x}}_{t}+(\alpha_{t-1}-\sigma_{t}\alpha_{t}){\bm{x}}_{0}+((1-\alpha_{t-1})-(1-\alpha_{t})\sigma_{t}){\bm{1}}_{K}\right), (19)

which is consistent with how we have defined q(𝒙t|𝒙0)q({\bm{x}}_{t}|{\bm{x}}_{0}).

Similarly, we can define our reverse process pθ(𝒙t1|𝒙t)p_{\theta}({\bm{x}}_{t-1}|{\bm{x}}_{t}) as:

pθ(𝒙t1|𝒙t)=Cat(σt𝒙t+(αt1σtαt)fθ(t)(𝒙t)+((1αt1)(1αt)σt)𝟏K),\displaystyle p_{\theta}({\bm{x}}_{t-1}|{\bm{x}}_{t})=\mathrm{Cat}\left(\sigma_{t}{\bm{x}}_{t}+(\alpha_{t-1}-\sigma_{t}\alpha_{t})f_{\theta}^{(t)}({\bm{x}}_{t})+((1-\alpha_{t-1})-(1-\alpha_{t})\sigma_{t}){\bm{1}}_{K}\right), (20)

where fθ(t)(𝒙t)f_{\theta}^{(t)}({\bm{x}}_{t}) maps 𝒙t{\bm{x}}_{t} to a KK-dimensional vector. As (1αt1)(1αt)σt0(1-\alpha_{t-1})-(1-\alpha_{t})\sigma_{t}\to 0, the sampling process will become less stochastic, in the sense that it will either choose 𝒙t{\bm{x}}_{t} or the predicted 𝒙0{\bm{x}}_{0} with high probability. The KL divergence

DKL(q(𝒙t1|𝒙t,𝒙0)pθ(𝒙t1|𝒙t))\displaystyle D_{\mathrm{KL}}(q({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0})\|p_{\theta}({\bm{x}}_{t-1}|{\bm{x}}_{t})) (21)

is well-defined, and is simply the KL divergence between two categoricals. Therefore, the resulting variational objective function should be easy to optimize as well. Moreover, as KL divergence is convex, we have this upper bound (which is tight when the right hand side goes to zero):

DKL(q(𝒙t1|𝒙t,𝒙0)pθ(𝒙t1|𝒙t))(αt1σtαt)DKL(Cat(𝒙0)Cat(fθ(t)(𝒙t))).\displaystyle D_{\mathrm{KL}}(q({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0})\|p_{\theta}({\bm{x}}_{t-1}|{\bm{x}}_{t}))\leq(\alpha_{t-1}-\sigma_{t}\alpha_{t})D_{\mathrm{KL}}(\mathrm{Cat}({\bm{x}}_{0})\|\mathrm{Cat}(f_{\theta}^{(t)}({\bm{x}}_{t}))).

The right hand side is simply a multi-class classification loss (up to constants), so we can arrive at similar arguments regarding how changes in σt\sigma_{t} do not affect the objective (up to re-weighting).

Appendix B Proofs

Lemma 1.

For qσ(𝐱1:T|𝐱0)q_{\sigma}({\bm{x}}_{1:T}|{\bm{x}}_{0}) defined in Eq. (6) and qσ(𝐱t1|𝐱t,𝐱0)q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0}) defined in Eq. (7), we have:

qσ(𝒙t|𝒙0)=𝒩(αt𝒙0,(1αt)𝑰)\displaystyle q_{\sigma}({\bm{x}}_{t}|{\bm{x}}_{0})={\mathcal{N}}(\sqrt{\alpha_{t}}{\bm{x}}_{0},(1-\alpha_{t}){\bm{I}}) (22)
Proof.

Assume for any tTt\leq T, qσ(𝒙t|𝒙0)=𝒩(αt𝒙0,(1αt)𝑰)q_{\sigma}({\bm{x}}_{t}|{\bm{x}}_{0})={\mathcal{N}}(\sqrt{\alpha_{t}}{\bm{x}}_{0},(1-\alpha_{t}){\bm{I}}) holds, if:

qσ(𝒙t1|𝒙0)=𝒩(αt1𝒙0,(1αt1)𝑰)\displaystyle q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{0})={\mathcal{N}}(\sqrt{\alpha_{t-1}}{\bm{x}}_{0},(1-\alpha_{t-1}){\bm{I}}) (23)

then we can prove the statement with an induction argument for tt from TT to 11, since the base case (t=Tt=T) already holds.

First, we have that

qσ(𝒙t1|𝒙0):=𝒙tqσ(𝒙t|𝒙0)qσ(𝒙t1|𝒙t,𝒙0)d𝒙tq_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{0}):=\int_{{\bm{x}}_{t}}q_{\sigma}({\bm{x}}_{t}|{\bm{x}}_{0})q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0})\mathrm{d}{\bm{x}}_{t}

and

qσ(𝒙t|𝒙0)=𝒩(αt𝒙0,(1αt)𝑰)\displaystyle q_{\sigma}({\bm{x}}_{t}|{\bm{x}}_{0})={\mathcal{N}}(\sqrt{\alpha_{t}}{\bm{x}}_{0},(1-\alpha_{t}){\bm{I}}) (24)
qσ(𝒙t1|𝒙t,𝒙0)=𝒩(αt1𝒙0+1αt1σt2𝒙tαt𝒙01αt,σt2𝑰).\displaystyle q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0})={\mathcal{N}}\left(\sqrt{\alpha_{t-1}}{\bm{x}}_{0}+\sqrt{1-\alpha_{t-1}-\sigma^{2}_{t}}\cdot{\frac{{\bm{x}}_{t}-\sqrt{\alpha_{t}}{\bm{x}}_{0}}{\sqrt{1-\alpha_{t}}}},\sigma_{t}^{2}{\bm{I}}\right). (25)

From Bishop (2006) (2.115), we have that qσ(𝒙t1|𝒙0)q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{0}) is Gaussian, denoted as 𝒩(μt1,Σt1){\mathcal{N}}(\mu_{t-1},\Sigma_{t-1}) where

μt1\displaystyle\mu_{t-1} =αt1𝒙0+1αt1σt2αt𝒙0αt𝒙01αt\displaystyle=\sqrt{\alpha_{t-1}}{\bm{x}}_{0}+\sqrt{1-\alpha_{t-1}-\sigma^{2}_{t}}\cdot{\frac{\sqrt{\alpha_{t}}{\bm{x}}_{0}-\sqrt{\alpha_{t}}{\bm{x}}_{0}}{\sqrt{1-\alpha_{t}}}} (26)
=αt1𝒙0\displaystyle=\sqrt{\alpha_{t-1}}{\bm{x}}_{0} (27)

and

Σt1\displaystyle\Sigma_{t-1} =σt2𝑰+1αt1σt21αt(1αt)𝑰=(1αt1)𝑰\displaystyle=\sigma_{t}^{2}{\bm{I}}+\frac{1-\alpha_{t-1}-\sigma^{2}_{t}}{1-\alpha_{t}}(1-\alpha_{t}){\bm{I}}=(1-\alpha_{t-1}){\bm{I}} (28)

Therefore, qσ(𝒙t1|𝒙0)=𝒩(αt1𝒙0,(1αt1)𝑰)q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{0})={\mathcal{N}}(\sqrt{\alpha_{t-1}}{\bm{x}}_{0},(1-\alpha_{t-1}){\bm{I}}), which allows us to apply the induction argument. ∎

See 1

Proof.

From the definition of JσJ_{\sigma}:

Jσ(ϵθ)\displaystyle J_{\sigma}(\epsilon_{\theta}) :=𝔼𝒙0:Tq(𝒙0:T)[logqσ(𝒙T|𝒙0)+t=2Tlogqσ(𝒙t1|𝒙t,𝒙0)t=1Tlogpθ(t)(𝒙t1|𝒙t)]\displaystyle:={\mathbb{E}}_{{\bm{x}}_{0:T}\sim q({\bm{x}}_{0:T})}\left[\log q_{\sigma}({\bm{x}}_{T}|{\bm{x}}_{0})+\sum_{t=2}^{T}\log q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0})-\sum_{t=1}^{T}\log p_{\theta}^{(t)}({\bm{x}}_{t-1}|{\bm{x}}_{t})\right] (29)
𝔼𝒙0:Tq(𝒙0:T)[t=2TDKL(qσ(𝒙t1|𝒙t,𝒙0))pθ(t)(𝒙t1|𝒙t))logpθ(1)(𝒙0|𝒙1)]\displaystyle\equiv{\mathbb{E}}_{{\bm{x}}_{0:T}\sim q({\bm{x}}_{0:T})}\left[\sum_{t=2}^{T}D_{\mathrm{KL}}(q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0}))\|p_{\theta}^{(t)}({\bm{x}}_{t-1}|{\bm{x}}_{t}))-\log p_{\theta}^{(1)}({\bm{x}}_{0}|{\bm{x}}_{1})\right]

where we use \equiv to denote ``equal up to a value that does not depend on ϵθ\epsilon_{\theta} (but may depend on qσq_{\sigma})''. For t>1t>1:

𝔼𝒙0,𝒙tq(𝒙0,𝒙t)[DKL(qσ(𝒙t1|𝒙t,𝒙0))pθ(t)(𝒙t1|𝒙t))]\displaystyle{\mathbb{E}}_{{\bm{x}}_{0},{\bm{x}}_{t}\sim q({\bm{x}}_{0},{\bm{x}}_{t})}[D_{\mathrm{KL}}(q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0}))\|p_{\theta}^{(t)}({\bm{x}}_{t-1}|{\bm{x}}_{t}))]
=\displaystyle= 𝔼𝒙0,𝒙tq(𝒙0,𝒙t)[DKL(qσ(𝒙t1|𝒙t,𝒙0))qσ(𝒙t1|𝒙t,fθ(t)(𝒙t)))]\displaystyle\ {\mathbb{E}}_{{\bm{x}}_{0},{\bm{x}}_{t}\sim q({\bm{x}}_{0},{\bm{x}}_{t})}[D_{\mathrm{KL}}(q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0}))\|q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},f_{\theta}^{(t)}({\bm{x}}_{t})))]
\displaystyle\equiv 𝔼𝒙0,𝒙tq(𝒙0,𝒙t)[𝒙0fθ(t)(𝒙t)222σt2]\displaystyle\ {\mathbb{E}}_{{\bm{x}}_{0},{\bm{x}}_{t}\sim q({\bm{x}}_{0},{\bm{x}}_{t})}\left[\frac{{\lVert{{\bm{x}}_{0}-f_{\theta}^{(t)}({\bm{x}}_{t})}\rVert}_{2}^{2}}{2\sigma_{t}^{2}}\right] (30)
=\displaystyle= 𝔼𝒙0q(𝒙0),ϵ𝒩(𝟎,𝑰),𝒙t=αt𝒙0+1αtϵ[(𝒙t1αtϵ)αt(𝒙t1αtϵθ(t)(𝒙t))αt222σt2]\displaystyle\ {\mathbb{E}}_{{\bm{x}}_{0}\sim q({\bm{x}}_{0}),\epsilon\sim{\mathcal{N}}({\bm{0}},{\bm{I}}),{\bm{x}}_{t}=\sqrt{\alpha_{t}}{\bm{x}}_{0}+\sqrt{1-\alpha_{t}}\epsilon}\left[\frac{{\lVert{\frac{{({\bm{x}}_{t}-\sqrt{1-\alpha_{t}}\epsilon)}}{\sqrt{\alpha_{t}}}-\frac{({\bm{x}}_{t}-\sqrt{1-\alpha_{t}}\epsilon_{\theta}^{(t)}({\bm{x}}_{t}))}{\sqrt{\alpha_{t}}}}\rVert}_{2}^{2}}{2\sigma_{t}^{2}}\right] (31)
=\displaystyle= 𝔼𝒙0q(𝒙0),ϵ𝒩(𝟎,𝑰),𝒙t=αt𝒙0+1αtϵ[ϵϵθ(t)(𝒙t)222dσt2αt]\displaystyle\ {\mathbb{E}}_{{\bm{x}}_{0}\sim q({\bm{x}}_{0}),\epsilon\sim{\mathcal{N}}({\bm{0}},{\bm{I}}),{\bm{x}}_{t}=\sqrt{\alpha_{t}}{\bm{x}}_{0}+\sqrt{1-\alpha_{t}}\epsilon}\left[\frac{{\lVert{\epsilon-\epsilon_{\theta}^{(t)}({\bm{x}}_{t})}\rVert}_{2}^{2}}{2d\sigma_{t}^{2}\alpha_{t}}\right] (32)

where dd is the dimension of 𝒙0{\bm{x}}_{0}. For t=1t=1:

𝔼𝒙0,𝒙1q(𝒙0,𝒙1)[logpθ(1)(𝒙0|𝒙1)]𝔼𝒙0,𝒙1q(𝒙0,𝒙1)[𝒙0fθ(t)(𝒙1)222σ12]\displaystyle\ {\mathbb{E}}_{{\bm{x}}_{0},{\bm{x}}_{1}\sim q({\bm{x}}_{0},{\bm{x}}_{1})}\left[-\log p_{\theta}^{(1)}({\bm{x}}_{0}|{\bm{x}}_{1})\right]\equiv{\mathbb{E}}_{{\bm{x}}_{0},{\bm{x}}_{1}\sim q({\bm{x}}_{0},{\bm{x}}_{1})}\left[\frac{{\lVert{{\bm{x}}_{0}-f_{\theta}^{(t)}({\bm{x}}_{1})}\rVert}_{2}^{2}}{2\sigma_{1}^{2}}\right] (33)
=\displaystyle= 𝔼𝒙0q(𝒙0),ϵ𝒩(𝟎,𝑰),𝒙1=α1𝒙0+1αtϵ[ϵϵθ(1)(𝒙1)222dσ12α1]\displaystyle\ {\mathbb{E}}_{{\bm{x}}_{0}\sim q({\bm{x}}_{0}),\epsilon\sim{\mathcal{N}}({\bm{0}},{\bm{I}}),{\bm{x}}_{1}=\sqrt{\alpha_{1}}{\bm{x}}_{0}+\sqrt{1-\alpha_{t}}\epsilon}\left[\frac{{\lVert{\epsilon-\epsilon_{\theta}^{(1)}({\bm{x}}_{1})}\rVert}_{2}^{2}}{2d\sigma_{1}^{2}\alpha_{1}}\right] (34)

Therefore, when γt=1/(2dσt2αt)\gamma_{t}=1/(2d\sigma_{t}^{2}\alpha_{t}) for all t{1,,T}t\in\{1,\ldots,T\}, we have

Jσ(ϵθ)t=1T12dσt2αt𝔼[ϵθ(t)(𝒙t)ϵt22]=Lγ(ϵθ)\displaystyle J_{\sigma}(\epsilon_{\theta})\equiv\sum_{t=1}^{T}\frac{1}{2d\sigma_{t}^{2}\alpha_{t}}{\mathbb{E}}\left[{\lVert{\epsilon_{\theta}^{(t)}({\bm{x}}_{t})-\epsilon_{t}}\rVert}_{2}^{2}\right]=L_{\gamma}(\epsilon_{\theta}) (35)

for all ϵθ\epsilon_{\theta}. From the definition of ``\equiv'', we have that Jσ=Lγ+CJ_{\sigma}=L_{\gamma}+C. ∎

See 1

Proof.

In the context of the proof, we consider tt as a continous, independent ``time'' variable and 𝒙{\bm{x}} and α\alpha as functions of tt. First, let us consider a reparametrization between DDIM and the VE-SDE888Refer to (Song et al., 2020) for more details of VE-SDE. by introducing the variables 𝒙¯\bar{{\bm{x}}} and σ\sigma:

𝒙¯(t)=𝒙¯(0)+σ(t)ϵ,ϵ𝒩(0,𝑰),\displaystyle\bar{{\bm{x}}}(t)=\bar{{\bm{x}}}(0)+\sigma(t)\epsilon,\quad\epsilon\sim{\mathcal{N}}(0,{\bm{I}}), (36)

for t[0,)t\in[0,\infty) and an increasing continuous function σ:00\sigma:{\mathbb{R}}_{\geq 0}\to{\mathbb{R}}_{\geq 0} where σ(0)=0\sigma(0)=0.

We can then define α(t)\alpha(t) and 𝒙(t){\bm{x}}(t) corresponding to DDIM case as:

𝒙¯(t)=𝒙(t)α(t)\displaystyle\bar{{\bm{x}}}(t)=\frac{{\bm{x}}(t)}{\sqrt{\alpha(t)}} (37)
σ(t)=1α(t)α(t).\displaystyle\sigma(t)=\sqrt{\frac{1-\alpha(t)}{\alpha(t)}}. (38)

This also means that:

𝒙(t)=𝒙¯(t)σ2(t)+1\displaystyle{\bm{x}}(t)=\frac{\bar{{\bm{x}}}(t)}{\sqrt{\sigma^{2}(t)+1}} (39)
α(t)=11+σ2(t),\displaystyle\alpha(t)=\frac{1}{1+\sigma^{2}(t)}, (40)

which establishes an bijection between (𝒙,α)({\bm{x}},\alpha) and (𝒙¯,σ)(\bar{{\bm{x}}},\sigma). From Equation (4) we have (note that α(0)=1\alpha(0)=1):

𝒙(t)α(t)=𝒙(0)α(0)+1α(t)α(t)ϵ,ϵ𝒩(0,𝑰)\displaystyle\frac{{\bm{x}}(t)}{\sqrt{\alpha(t)}}=\frac{{\bm{x}}(0)}{\sqrt{\alpha(0)}}+\sqrt{\frac{1-\alpha(t)}{\alpha(t)}}\epsilon,\quad\epsilon\sim{\mathcal{N}}(0,{\bm{I}}) (41)

which can be reparametrized into a form that is consistent with VE-SDE:

𝒙¯(t)=𝒙¯(0)+σ(t)ϵ.\displaystyle\bar{{\bm{x}}}(t)=\bar{{\bm{x}}}(0)+\sigma(t)\epsilon. (42)

Now, we derive the ODE forms for both DDIM and VE-SDE and show that they are equivalent.

ODE form for DDIM

We repeat Equation (13) here:

𝒙tΔtαtΔt=𝒙tαt+(1αtΔtαtΔt1αtαt)ϵθ(t)(𝒙t),\displaystyle\frac{{\bm{x}}_{t-\Delta t}}{\sqrt{\alpha_{t-\Delta t}}}=\frac{{\bm{x}}_{t}}{\sqrt{\alpha_{t}}}+\left(\sqrt{\frac{1-\alpha_{t-\Delta t}}{\alpha_{t-\Delta t}}}-\sqrt{\frac{1-\alpha_{t}}{\alpha_{t}}}\right)\epsilon_{\theta}^{(t)}({\bm{x}}_{t}), (43)

which is equivalent to:

𝒙¯(tΔt)=𝒙¯(t)+(σ(tΔt)σ(t))ϵθ(t)(𝒙(t))\displaystyle\bar{{\bm{x}}}(t-\Delta t)=\bar{{\bm{x}}}(t)+(\sigma(t-\Delta t)-\sigma(t))\cdot\epsilon_{\theta}^{(t)}({\bm{x}}(t)) (44)

Divide both sides by (Δt)(-\Delta t) and as Δt0\Delta t\to 0, we have:

d𝒙¯(t)dt=dσ(t)dtϵθ(t)(𝒙¯(t)σ2(t)+1),\displaystyle\frac{\mathrm{d}\bar{{\bm{x}}}(t)}{\mathrm{d}t}=\frac{\mathrm{d}\sigma(t)}{\mathrm{d}t}\epsilon_{\theta}^{(t)}\left(\frac{\bar{{\bm{x}}}(t)}{\sqrt{\sigma^{2}(t)+1}}\right), (45)

which is exactly what we have in Equation (14).

We note that for the optimal model, ϵθ(t)\epsilon_{\theta}^{(t)} is a minimizer:

ϵθ(t)=argminft𝔼𝒙(0)q(𝒙),ϵ𝒩(0,𝑰)[ft(𝒙(t))ϵ22]\displaystyle\epsilon_{\theta}^{(t)}=\operatorname*{arg\,min}_{f_{t}}{\mathbb{E}}_{{\bm{x}}(0)\sim q({\bm{x}}),\epsilon\sim{\mathcal{N}}(0,{\bm{I}})}[{\lVert{f_{t}({\bm{x}}(t))-\epsilon}\rVert}_{2}^{2}] (46)

where 𝒙(t)=α(t)𝒙(t)+1α(t)ϵ{\bm{x}}(t)=\sqrt{\alpha(t)}{\bm{x}}(t)+\sqrt{1-\alpha(t)}\epsilon.

ODE form for VE-SDE

Define pt(𝒙¯)p_{t}(\bar{{\bm{x}}}) as the data distribution perturbed with σ2(t)\sigma^{2}(t) variance Gaussian noise. The probability flow for VE-SDE is defined as Song et al. (2020):

d𝒙¯=12g(t)2𝒙¯logpt(𝒙¯)dt\displaystyle\mathrm{d}\bar{{\bm{x}}}=-\frac{1}{2}g(t)^{2}\nabla_{\bar{{\bm{x}}}}\log p_{t}(\bar{{\bm{x}}})\mathrm{d}t (47)

where g(t)=dσ2(t)dtg(t)=\sqrt{\frac{\mathrm{d}\sigma^{2}(t)}{\mathrm{d}t}} is the diffusion coefficient, and 𝒙¯logpt(𝒙¯)\nabla_{\bar{{\bm{x}}}}\log p_{t}(\bar{{\bm{x}}}) is the score of ptp_{t}.

The σ(t)\sigma(t)-perturbed score function 𝒙¯logpt(𝒙¯)\nabla_{\bar{{\bm{x}}}}\log p_{t}(\bar{{\bm{x}}}) is also a minimizer (from denoising score matching (Vincent, 2011)):

𝒙¯logpt=argmingt𝔼𝒙(0)q(𝒙),ϵ𝒩(0,𝑰)[gt(𝒙¯)+ϵ/σ(t)22]\displaystyle\nabla_{\bar{{\bm{x}}}}\log p_{t}=\operatorname*{arg\,min}_{g_{t}}{\mathbb{E}}_{{\bm{x}}(0)\sim q({\bm{x}}),\epsilon\sim{\mathcal{N}}(0,{\bm{I}})}[{\lVert{g_{t}(\bar{{\bm{x}}})+\epsilon/\sigma(t)}\rVert}_{2}^{2}] (48)

where 𝒙¯(t)=𝒙¯(t)+σ(t)ϵ\bar{{\bm{x}}}(t)=\bar{{\bm{x}}}(t)+\sigma(t)\epsilon.

Since there is an equivalence between 𝒙(t){\bm{x}}(t) and 𝒙¯(t)\bar{{\bm{x}}}(t), we have the following relationship:

𝒙¯logpt(𝒙¯)=ϵθ(t)(𝒙¯(t)σ2(t)+1)σ(t)\displaystyle\nabla_{\bar{{\bm{x}}}}\log p_{t}(\bar{{\bm{x}}})=-\frac{\epsilon_{\theta}^{(t)}\left(\frac{\bar{{\bm{x}}}(t)}{\sqrt{\sigma^{2}(t)+1}}\right)}{\sigma(t)} (49)

from Equation (46) and Equation (48). Plug Equation (49) and definition of g(t)g(t) in Equation (47), we have:

d𝒙¯(t)=12dσ2(t)dtϵθ(t)(𝒙¯(t)σ2(t)+1)σ(t)dt,\displaystyle\mathrm{d}\bar{{\bm{x}}}(t)=\frac{1}{2}\frac{\mathrm{d}\sigma^{2}(t)}{\mathrm{d}t}\frac{\epsilon_{\theta}^{(t)}\left(\frac{\bar{{\bm{x}}}(t)}{\sqrt{\sigma^{2}(t)+1}}\right)}{\sigma(t)}\mathrm{d}t, (50)

and we have the following by rearranging terms:

d𝒙¯(t)dt=dσ(t)dtϵθ(t)(𝒙¯(t)σ2(t)+1)\displaystyle\frac{\mathrm{d}\bar{{\bm{x}}}(t)}{\mathrm{d}t}=\frac{\mathrm{d}\sigma(t)}{\mathrm{d}t}\epsilon_{\theta}^{(t)}\left(\frac{\bar{{\bm{x}}}(t)}{\sqrt{\sigma^{2}(t)+1}}\right) (51)

which is equivalent to Equation (45). In both cases the initial conditions are 𝒙¯(T)𝒩(𝟎,σ2(T)𝑰)\bar{{\bm{x}}}(T)\sim{\mathcal{N}}({\bm{0}},\sigma^{2}(T){\bm{I}}), so the resulting ODEs are identical. ∎

Appendix C Additional Derivations

C.1 Accelerated sampling processes

In the accelerated case, we can consider the inference process to be factored as:

qσ,τ(𝒙1:T|𝒙0)=qσ,τ(𝒙τS|𝒙0)i=1Sqσ,τ(𝒙τi1|𝒙τi,𝒙0)tτ¯qσ,τ(𝒙t|𝒙0)\displaystyle q_{\sigma,\tau}({\bm{x}}_{1:T}|{\bm{x}}_{0})=q_{\sigma,\tau}({\bm{x}}_{\tau_{S}}|{\bm{x}}_{0})\prod_{i=1}^{S}q_{\sigma,\tau}({\bm{x}}_{\tau_{i-1}}|{\bm{x}}_{\tau_{i}},{\bm{x}}_{0})\prod_{t\in\bar{\tau}}q_{\sigma,\tau}({\bm{x}}_{t}|{\bm{x}}_{0}) (52)

where τ\tau is a sub-sequence of [1,,T][1,\ldots,T] of length SS with τS=T\tau_{S}=T, and let τ¯:={1,,T}τ\bar{\tau}:=\{1,\ldots,T\}\setminus\tau be its complement. Intuitively, the graphical model of {𝒙τi}i=1S\{{\bm{x}}_{\tau_{i}}\}_{i=1}^{S} and 𝒙0{\bm{x}}_{0} form a chain, whereas the graphical model of {𝒙t}tτ¯\{{\bm{x}}_{t}\}_{t\in\bar{\tau}} and 𝒙0{\bm{x}}_{0} forms a star graph. We define:

qσ,τ(𝒙t|𝒙0)=𝒩(αt𝒙0,(1αt)𝑰)tτ¯{T}\displaystyle q_{\sigma,\tau}({\bm{x}}_{t}|{\bm{x}}_{0})={\mathcal{N}}(\sqrt{\alpha_{t}}{\bm{x}}_{0},(1-\alpha_{t}){\bm{I}})\quad\forall t\in\bar{\tau}\cup\{T\} (53)
qσ,τ(𝒙τi1|𝒙τi,𝒙0)=𝒩(ατi1𝒙0+1ατi1στi2𝒙τiατi𝒙01ατi,στi2𝑰)i[S]\displaystyle q_{\sigma,\tau}({\bm{x}}_{\tau_{i-1}}|{\bm{x}}_{\tau_{i}},{\bm{x}}_{0})={\mathcal{N}}\left(\sqrt{\alpha_{\tau_{i-1}}}{\bm{x}}_{0}+\sqrt{1-\alpha_{\tau_{i-1}}-\sigma^{2}_{\tau_{i}}}\cdot{\frac{{\bm{x}}_{\tau_{i}}-\sqrt{\alpha_{\tau_{i}}}{\bm{x}}_{0}}{\sqrt{1-\alpha_{\tau_{i}}}}},\sigma_{{\tau_{i}}}^{2}{\bm{I}}\right)\ \forall i\in[S]

where the coefficients are chosen such that:

qσ,τ(𝒙τi|𝒙0)=𝒩(ατi𝒙0,(1ατi)𝑰)i[S]\displaystyle q_{\sigma,\tau}({\bm{x}}_{\tau_{i}}|{\bm{x}}_{0})={\mathcal{N}}(\sqrt{\alpha_{\tau_{i}}}{\bm{x}}_{0},(1-\alpha_{\tau_{i}}){\bm{I}})\quad\forall i\in[S] (54)

i.e., the ``marginals'' match.

The corresponding ``generative process'' is defined as:

pθ(𝒙0:T):=pθ(𝒙T)i=1Spθ(τi)(𝒙τi1|𝒙τi)use to produce samples×tτ¯pθ(t)(𝒙0|𝒙t)in variational objective\displaystyle p_{\theta}({\bm{x}}_{0:T}):=\underbrace{p_{\theta}({\bm{x}}_{T})\prod_{i=1}^{S}p^{(\tau_{i})}_{\theta}({\bm{x}}_{\tau_{i-1}}|{\bm{x}}_{\tau_{i}})}_{\text{use to produce samples}}\times\underbrace{\prod_{t\in\bar{\tau}}p_{\theta}^{(t)}({\bm{x}}_{0}|{\bm{x}}_{t})}_{\text{in variational objective}} (55)

where only part of the models are actually being used to produce samples. The conditionals are:

pθ(τi)(𝒙τi1|𝒙τi)=qσ,τ(𝒙τi1|𝒙τi,fθ(τi)(𝒙τi1))ifi[S],i>1\displaystyle p_{\theta}^{(\tau_{i})}({\bm{x}}_{\tau_{i-1}}|{\bm{x}}_{\tau_{i}})=q_{\sigma,\tau}({\bm{x}}_{\tau_{i-1}}|{\bm{x}}_{\tau_{i}},f_{\theta}^{(\tau_{i})}({\bm{x}}_{\tau_{i-1}}))\quad\text{if}\ i\in[S],i>1 (56)
pθ(t)(𝒙0|𝒙t)=𝒩(fθ(t)(𝒙t),σt2𝑰)otherwise,\displaystyle p_{\theta}^{(t)}({\bm{x}}_{0}|{\bm{x}}_{t})={\mathcal{N}}(f_{\theta}^{(t)}({\bm{x}}_{t}),\sigma_{t}^{2}{\bm{I}})\quad\text{otherwise,} (57)

where we leverage qσ,τ(𝒙τi1|𝒙τi,𝒙0)q_{\sigma,\tau}({\bm{x}}_{\tau_{i-1}}|{\bm{x}}_{\tau_{i}},{\bm{x}}_{0}) as part of the inference process (similar to what we have done in Section 3). The resulting variational objective becomes (define 𝒙τL+1={\bm{x}}_{\tau_{L+1}}=\varnothing for conciseness):

J(ϵθ)\displaystyle J(\epsilon_{\theta}) =𝔼𝒙0:Tqσ,τ(𝒙0:T)[logqσ,τ(𝒙1:T|𝒙0)logpθ(𝒙0:T)]\displaystyle={\mathbb{E}}_{{\bm{x}}_{0:T}\sim q_{\sigma,\tau}({\bm{x}}_{0:T})}[\log q_{\sigma,\tau}({\bm{x}}_{1:T}|{\bm{x}}_{0})-\log p_{\theta}({\bm{x}}_{0:T})] (58)
=𝔼𝒙0:Tqσ,τ(𝒙0:T)[tτ¯DKL(qσ,τ(𝒙t|𝒙0)pθ(t)(𝒙0|𝒙t)\displaystyle={\mathbb{E}}_{{\bm{x}}_{0:T}\sim q_{\sigma,\tau}({\bm{x}}_{0:T})}\Bigg{[}\sum_{t\in\bar{\tau}}D_{\mathrm{KL}}(q_{\sigma,\tau}({\bm{x}}_{t}|{\bm{x}}_{0})\|p_{\theta}^{(t)}({\bm{x}}_{0}|{\bm{x}}_{t}) (59)
+i=1LDKL(qσ,τ(𝒙τi1|𝒙τi,𝒙0)pθ(τi)(𝒙τi1|𝒙τi)))]\displaystyle\qquad\qquad\qquad\qquad+\sum_{i=1}^{L}D_{\mathrm{KL}}(q_{\sigma,\tau}({\bm{x}}_{\tau_{i-1}}|{\bm{x}}_{\tau_{i}},{\bm{x}}_{0})\|p_{\theta}^{(\tau_{i})}({\bm{x}}_{\tau_{i-1}}|{\bm{x}}_{\tau_{i}})))\Bigg{]}

where each KL divergence is between two Gaussians with variance independent of θ\theta. A similar argument to the proof used in Theorem 1 can show that the variational objective JJ can also be converted to an objective of the form LγL_{\gamma}.

C.2 Derivation of denoising objectives for DDPMs

We note that in Ho et al. (2020), a diffusion hyperparameter βt\beta_{t}999In this section we use teal to color notations used in Ho et al. (2020). is first introduced, and then relevant variables αt:=1βt\alpha_{t}:=1-\beta_{t} and α¯t=t=1Tαt\bar{\alpha}_{t}=\prod_{t=1}^{T}\alpha_{t} are defined. In this paper, we have used the notation αt\alpha_{t} to represent the variable α¯t\bar{\alpha}_{t} in Ho et al. (2020) for three reasons. First, it makes it more clear that we only need to choose one set of hyperparameters, reducing possible cross-references of the derived variables. Second, it allows us to introduce the generalization as well as the acceleration case easier, because the inference process is no longer motivated by a diffusion. Third, there exists an isomorphism between α1:T\alpha_{1:T} and 1,,T1,\ldots,T, which is not the case for βt\beta_{t}.

In this section, we use βt\beta_{t} and αt\alpha_{t} to be more consistent with the derivation in Ho et al. (2020), where

αt=αtαt1\displaystyle{\color[rgb]{0,.5,.5}\alpha_{t}}=\frac{\alpha_{t}}{\alpha_{t-1}} (60)
βt=1αtαt1\displaystyle{\color[rgb]{0,.5,.5}\beta_{t}}=1-\frac{\alpha_{t}}{\alpha_{t-1}} (61)

can be uniquely determined from αt\alpha_{t} (i.e. α¯t\bar{\alpha}_{t}).

First, from the diffusion forward process:

q(𝒙t1|𝒙t,𝒙0)=𝒩(αt1βt1αt𝒙0+αt(1αt1)1αt𝒙tμ~(𝒙t,𝒙0),1αt11αtβt𝑰)\displaystyle q({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0})={\mathcal{N}}\Bigg{(}\underbrace{\frac{\sqrt{\alpha_{t-1}}{\color[rgb]{0,.5,.5}\beta_{t}}}{1-\alpha_{t}}{\bm{x}}_{0}+\frac{\sqrt{{\color[rgb]{0,.5,.5}\alpha_{t}}}(1-\alpha_{t-1})}{1-\alpha_{t}}{\bm{x}}_{t}}_{\color[rgb]{0,.5,.5}\tilde{\mu}({\bm{x}}_{t},{\bm{x}}_{0})},\frac{1-\alpha_{t-1}}{1-\alpha_{t}}{\color[rgb]{0,.5,.5}\beta_{t}}{\bm{I}}\Bigg{)}

Ho et al. (2020) considered a specific type of pθ(t)(𝒙t1|𝒙t)p_{\theta}^{(t)}({\bm{x}}_{t-1}|{\bm{x}}_{t}):

pθ(t)(𝒙t1|𝒙t)=𝒩(μθ(𝒙t,t),σt𝑰)\displaystyle p_{\theta}^{(t)}({\bm{x}}_{t-1}|{\bm{x}}_{t})={\mathcal{N}}\left({\color[rgb]{0,.5,.5}\mu_{\theta}({\bm{x}}_{t},t)},\sigma_{t}{\bm{I}}\right) (62)

which leads to the following variational objective:

L\displaystyle{\color[rgb]{0,.5,.5}L} :=𝔼𝒙0:Tq(𝒙0:T)[q(𝒙T|𝒙0)+t=2Tlogq(𝒙t1|𝒙t,𝒙0)t=1Tlogpθ(t)(𝒙t1|𝒙t)]\displaystyle:={\mathbb{E}}_{{\bm{x}}_{0:T}\sim q({\bm{x}}_{0:T})}\left[q({\bm{x}}_{T}|{\bm{x}}_{0})+\sum_{t=2}^{T}\log q({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0})-\sum_{t=1}^{T}\log p_{\theta}^{(t)}({\bm{x}}_{t-1}|{\bm{x}}_{t})\right] (63)
𝔼𝒙0:Tq(𝒙0:T)[t=2TDKL(q(𝒙t1|𝒙t,𝒙0))pθ(t)(𝒙t1|𝒙t))Lt1logpθ(1)(𝒙0|𝒙1)]\displaystyle\equiv{\mathbb{E}}_{{\bm{x}}_{0:T}\sim q({\bm{x}}_{0:T})}\left[\sum_{t=2}^{T}\underbrace{D_{\mathrm{KL}}(q({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0}))\|p_{\theta}^{(t)}({\bm{x}}_{t-1}|{\bm{x}}_{t}))}_{\color[rgb]{0,.5,.5}L_{t-1}}-\log p_{\theta}^{(1)}({\bm{x}}_{0}|{\bm{x}}_{1})\right]

One can write:

Lt1=𝔼q[12σt2μθ(𝒙t,t)μ~(𝒙t,𝒙0)22]\displaystyle{\color[rgb]{0,.5,.5}L_{t-1}}={\mathbb{E}}_{q}\left[\frac{1}{2\sigma_{t}^{2}}{\lVert{{\color[rgb]{0,.5,.5}\mu_{\theta}({\bm{x}}_{t},t)}-{\color[rgb]{0,.5,.5}\tilde{\mu}({\bm{x}}_{t},{\bm{x}}_{0})}}\rVert}_{2}^{2}\right] (64)

Ho et al. (2020) chose the parametrization

μθ(𝒙t,t)=1αt(𝒙tβt1αtϵθ(𝒙t,t))\displaystyle{\color[rgb]{0,.5,.5}\mu_{\theta}({\bm{x}}_{t},t)}=\frac{1}{\sqrt{{\color[rgb]{0,.5,.5}\alpha_{t}}}}\left({\bm{x}}_{t}-\frac{{\color[rgb]{0,.5,.5}\beta_{t}}}{\sqrt{1-\alpha_{t}}}{\color[rgb]{0,.5,.5}\epsilon_{\theta}({\bm{x}}_{t},t)}\right) (65)

which can be simplified to:

Lt1=𝔼𝒙0,ϵ[βt22σt2(1αt)αtϵϵθ(αt𝒙0+1αtϵ,t)22]\displaystyle{\color[rgb]{0,.5,.5}L_{t-1}}={\mathbb{E}}_{{\bm{x}}_{0},\epsilon}\left[\frac{{\color[rgb]{0,.5,.5}\beta_{t}}^{2}}{2\sigma_{t}^{2}(1-\alpha_{t}){\color[rgb]{0,.5,.5}\alpha_{t}}}{\lVert{\epsilon-{\color[rgb]{0,.5,.5}\epsilon_{\theta}(\sqrt{\alpha_{t}}{\bm{x}}_{0}+\sqrt{1-\alpha_{t}}\epsilon,t)}}\rVert}_{2}^{2}\right] (66)

Appendix D Experimental Details

D.1 Datasets and architectures

We consider 4 image datasets with various resolutions: CIFAR10 (32×3232\times 32, unconditional), CelebA (64×6464\times 64), LSUN Bedroom (256×256256\times 256) and LSUN Church (256×256256\times 256). For all datasets, we set the hyperparameters α\alpha according to the heuristic in (Ho et al., 2020) to make the results directly comparable. We use the same model for each dataset, and only compare the performance of different generative processes. For CIFAR10, Bedroom and Church, we obtain the pretrained checkpoints from the original DDPM implementation; for CelebA, we trained our own model using the denoising objective L𝟏L_{\bm{1}}.

Our architecture for ϵθ(t)(𝒙t)\epsilon_{\theta}^{(t)}({\bm{x}}_{t}) follows that in Ho et al. (2020), which is a U-Net (Ronneberger et al., 2015) based on a Wide ResNet (Zagoruyko & Komodakis, 2016). We use the pretrained models from Ho et al. (2020) for CIFAR10, Bedroom and Church, and train our own model for the CelebA 64×6464\times 64 model (since a pretrained model is not provided). Our CelebA model has five feature map resolutions from 64×6464\times 64 to 4×44\times 4, and we use the original CelebA dataset (not CelebA-HQ) using the pre-processing technique from the StyleGAN (Karras et al., 2018) repository.

Table 3: LSUN Bedroom and Church image generation results, measured in FID. For 1000 steps DDPM, the FIDs are 6.36 for Bedroom and 7.89 for Church.
Bedroom (256×256256\times 256) Church (256×256256\times 256)
dim(τ)\dim(\tau) 10 20 50 100 10 20 50 100
DDIM (η=0.0\eta=0.0) 16.95 8.89 6.75 6.62 19.45 12.47 10.84 10.58
DDPM (η=1.0\eta=1.0) 42.78 22.77 10.81 6.81 51.56 23.37 11.16 8.27

D.2 Reverse process sub-sequence selection

We consider two types of selection procedure for τ\tau given the desired dim(τ)<T\dim(\tau)<T:

  • Linear: we select the timesteps such that τi=ci\tau_{i}=\lfloor ci\rfloor for some cc;

  • Quadratic: we select the timesteps such that τi=ci2\tau_{i}=\lfloor ci^{2}\rfloor for some cc.

The constant value cc is selected such that τ1\tau_{-1} is close to TT. We used quadratic for CIFAR10 and linear for the remaining datasets. These choices achieve slightly better FID than their alternatives in the respective datasets.

D.3 Closed form equations for each sampling step

From the general sampling equation in Eq. (12), we have the following update equation:

𝒙τi1(η)=ατi1(𝒙τi1ατiϵθ(τi)(𝒙τi)ατi)+1ατi1στi(η)2ϵθ(τi)(𝒙τi)+στi(η)ϵ\displaystyle{\bm{x}}_{\tau_{i-1}}(\eta)=\sqrt{\alpha_{\tau_{i-1}}}\left(\frac{{\bm{x}}_{\tau_{i}}-\sqrt{1-\alpha_{\tau_{i}}}\epsilon_{\theta}^{(\tau_{i})}({\bm{x}}_{\tau_{i}})}{\sqrt{\alpha_{\tau_{i}}}}\right)+\sqrt{1-\alpha_{\tau_{i-1}}-\sigma_{\tau_{i}}(\eta)^{2}}\cdot\epsilon_{\theta}^{(\tau_{i})}({\bm{x}}_{\tau_{i}})+\sigma_{\tau_{i}}(\eta)\epsilon

where

στi(η)=η1ατi11ατi1ατiατi1\sigma_{\tau_{i}}(\eta)=\eta\sqrt{\frac{1-\alpha_{\tau_{i-1}}}{1-\alpha_{\tau_{i}}}}\sqrt{1-\frac{\alpha_{\tau_{i}}}{\alpha_{\tau_{i-1}}}}

For the case of σ^\hat{\sigma} (DDPM with a larger variance), the update equation becomes:

𝒙τi1=ατi1(𝒙τi1ατiϵθ(τi)(𝒙τi)ατi)+1ατi1στi(1)2ϵθ(τi)(𝒙τi)+σ^τiϵ\displaystyle{\bm{x}}_{\tau_{i-1}}=\sqrt{\alpha_{\tau_{i-1}}}\left(\frac{{\bm{x}}_{\tau_{i}}-\sqrt{1-\alpha_{\tau_{i}}}\epsilon_{\theta}^{(\tau_{i})}({\bm{x}}_{\tau_{i}})}{\sqrt{\alpha_{\tau_{i}}}}\right)+\sqrt{1-\alpha_{\tau_{i-1}}-\sigma_{\tau_{i}}(1)^{2}}\cdot\epsilon_{\theta}^{(\tau_{i})}({\bm{x}}_{\tau_{i}})+\hat{\sigma}_{\tau_{i}}\epsilon

which uses a different coefficient for ϵ\epsilon compared with the update for η=1\eta=1, but uses the same coefficient for the non-stochastic parts. This update is more stochastic than the update for η=1\eta=1, which explains why it achieves worse performance when dim(τ)\dim(\tau) is small.

D.4 Samples and Consistency

We show more samples in Figure 7 (CIFAR10), Figure 8 (CelebA), Figure 10 (Church) and consistency results of DDIM in Figure 9 (CelebA).

Refer to caption
Refer to caption
Refer to caption
Figure 7: CIFAR10 samples from 1000 step DDPM, 1000 step DDIM and 100 step DDIM.
Refer to caption
Refer to caption
Refer to caption
Figure 8: CelebA samples from 1000 step DDPM, 1000 step DDIM and 100 step DDIM.
Refer to caption
Figure 9: CelebA samples from DDIM with the same random 𝒙T{\bm{x}}_{T} and different number of steps.
Refer to caption
Refer to caption
Figure 10: Church samples from 100 step DDPM and 100 step DDIM.

D.5 Interpolation

To generate interpolations on a line, we randomly sample two initial 𝒙T{\bm{x}}_{T} values from the standard Gaussian, interpolate them with spherical linear interpolation (Shoemake, 1985), and then use the DDIM to obtain 𝒙0{\bm{x}}_{0} samples.

𝒙T(α)=sin((1α)θ)sin(θ)𝒙T(0)+sin(αθ)sin(θ)𝒙T(1)\displaystyle{\bm{x}}_{T}^{(\alpha)}=\frac{\sin((1-\alpha)\theta)}{\sin(\theta)}{\bm{x}}_{T}^{(0)}+\frac{\sin(\alpha\theta)}{\sin(\theta)}{\bm{x}}_{T}^{(1)} (67)

where θ=arccos((𝒙T(0))𝒙T(1)𝒙T(0)𝒙T(1))\theta=\arccos\left(\frac{({\bm{x}}_{T}^{(0)})^{\top}{\bm{x}}_{T}^{(1)}}{{\lVert{{\bm{x}}_{T}^{(0)}}\rVert}{\lVert{{\bm{x}}_{T}^{(1)}}\rVert}}\right). These values are used to produce DDIM samples.

To generate interpolations on a grid, we sample four latent variables and separate them in to two pairs; then we use slerp with the pairs under the same α\alpha, and use slerp over the interpolated samples across the pairs (under an independently chosen interpolation coefficient). We show more grid interpolation results in Figure 11 (CelebA), Figure 12 (Bedroom), and Figure 13 (Church).

Refer to caption
Figure 11: More interpolations from the CelebA DDIM with dim(τ)=50\dim(\tau)=50.
Refer to caption
Figure 12: More interpolations from the Bedroom DDIM with dim(τ)=50\dim(\tau)=50.
Refer to caption
Figure 13: More interpolations from the Church DDIM with dim(τ)=50\dim(\tau)=50.