Denoising Diffusion Implicit Models
去噪扩散隐式模型

Jiaming Song, Chenlin Meng & Stefano Ermon
宋佳明，孟晨琳，斯特凡诺·埃尔蒙
Stanford University
{tsong,chenlin,ermon}@cs.stanford.edu
斯坦福大学 {tsong,chenlin,ermon}@cs.stanford.edu

Abstract 摘要

Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps in order to produce a sample. To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efficient class of iterative implicit probabilistic models with the same training procedure as DDPMs. In DDPMs, the generative process is defined as the reverse of a particular Markovian diffusion process. We generalize DDPMs via a class of non-Markovian diffusion processes that lead to the same training objective. These non-Markovian processes can correspond to generative processes that are deterministic, giving rise to implicit models that produce high quality samples much faster. We empirically demonstrate that DDIMs can produce high quality samples $10\times$ to $50\times$ faster in terms of wall-clock time compared to DDPMs, allow us to trade off computation for sample quality, perform semantically meaningful image interpolation directly in the latent space, and reconstruct observations with very low error.
去噪扩散概率模型（DDPMs）在无需对抗训练的情况下实现了高质量的图像生成，但它们需要模拟一个马尔可夫链多个步骤才能生成一个样本。为了加速采样，我们提出了去噪扩散隐式模型（DDIMs），这是一类更高效的迭代隐式概率模型，其训练过程与 DDPMs 相同。在 DDPMs 中，生成过程被定义为特定马尔可夫扩散过程的逆过程。我们通过一类非马尔可夫扩散过程对 DDPMs 进行了推广，这些过程导致了相同的训练目标。这些非马尔可夫过程可以对应于确定性的生成过程，从而产生能够更快生成高质量样本的隐式模型。我们通过实验证明，就挂钟时间而言，DDIMs 比 DDPMs 能够更快地生成高质量样本 $10\times$ 到 $50\times$ ，使我们能够在计算量和样本质量之间进行权衡，直接在潜在空间中进行具有语义意义的图像插值，并以极低的误差重建观测结果。

1 Introduction 1 引言

Deep generative models have demonstrated the ability to produce high quality samples in many domains (Karras et al., 2020; van den Oord et al., 2016a). In terms of image generation, generative adversarial networks (GANs, Goodfellow et al. (2014)) currently exhibits higher sample quality than likelihood-based methods such as variational autoencoders (Kingma & Welling, 2013), autoregressive models (van den Oord et al., 2016b) and normalizing flows (Rezende & Mohamed, 2015; Dinh et al., 2016). However, GANs require very specific choices in optimization and architectures in order to stabilize training (Arjovsky et al., 2017; Gulrajani et al., 2017; Karras et al., 2018; Brock et al., 2018), and could fail to cover modes of the data distribution (Zhao et al., 2018).
深度生成模型已在许多领域展示出能够生成高质量样本的能力（Karras 等人，2020；van den Oord 等人，2016a）。在图像生成方面，生成对抗网络（GANs，Goodfellow 等人（2014））目前表现出比基于似然的方法（如变分自编码器（Kingma＆Welling，2013）、自回归模型（van den Oord 等人，2016b）和归一化流（Rezende＆Mohamed，2015；Dinh 等人，2016））更高的样本质量。然而，GANs 为了稳定训练，在优化和架构方面需要非常特定的选择（Arjovsky 等人，2017；Gulrajani 等人，2017；Karras 等人，2018；Brock 等人，2018），并且可能无法覆盖数据分布的模式（Zhao 等人，2018）。

Recent works on iterative generative models (Bengio et al., 2014), such as denoising diffusion probabilistic models (DDPM, Ho et al. (2020)) and noise conditional score networks (NCSN, Song & Ermon (2019)) have demonstrated the ability to produce samples comparable to that of GANs, without having to perform adversarial training. To achieve this, many denoising autoencoding models are trained to denoise samples corrupted by various levels of Gaussian noise. Samples are then produced by a Markov chain which, starting from white noise, progressively denoises it into an image. This generative Markov Chain process is either based on Langevin dynamics (Song & Ermon, 2019) or obtained by reversing a forward diffusion process that progressively turns an image into noise (Sohl-Dickstein et al., 2015).
关于迭代生成模型的近期研究工作（Bengio 等人，2014），例如去噪扩散概率模型（DDPM，Ho 等人（2020））和噪声条件得分网络（NCSN，Song 和 Ermon（2019）），已经证明了它们能够生成与 GAN 相当的样本，而无需进行对抗性训练。为实现这一目标，许多去噪自编码模型被训练用于对受到不同程度高斯噪声破坏的样本进行去噪。然后，通过一个马尔可夫链生成样本，该链从白噪声开始，逐步将其去噪为图像。这种生成马尔可夫链过程要么基于朗之万动力学（Song 和 Ermon，2019），要么通过逆转将图像逐渐变为噪声的正向扩散过程来获得（Sohl-Dickstein 等人，2015）。

A critical drawback of these models is that they require many iterations to produce a high quality sample. For DDPMs, this is because that the generative process (from noise to data) approximates the reverse of the forward diffusion process (from data to noise), which could have thousands of steps; iterating over all the steps is required to produce a single sample, which is much slower compared to GANs, which only needs one pass through a network. For example, it takes around 20 hours to sample 50k images of size $32\times 32$ from a DDPM, but less than a minute to do so from a GAN on a Nvidia 2080 Ti GPU. This becomes more problematic for larger images as sampling 50k images of size $256\times 256$ could take nearly $1000$ hours on the same GPU.
这些模型的一个关键缺陷是，它们需要多次迭代才能生成高质量的样本。对于去噪扩散概率模型（DDPMs），这是因为生成过程（从噪声到数据）近似于正向扩散过程（从数据到噪声）的逆过程，而正向扩散过程可能有数千个步骤；要生成一个样本，需要对所有步骤进行迭代，这与生成对抗网络（GANs）相比要慢得多，GANs 只需要通过网络一次。例如，在 Nvidia 2080 Ti GPU 上，从 DDPM 中采样 50k 个大小为 $32\times 32$ 的图像大约需要 20 小时，而从 GAN 中采样则不到一分钟。对于较大的图像，这个问题更加严重，因为在同一 GPU 上采样 50k 个大小为 $256\times 256$ 的图像可能需要近 $1000$ 小时。

To close this efficiency gap between DDPMs and GANs, we present denoising diffusion implicit models (DDIMs). DDIMs are implicit probabilistic models (Mohamed & Lakshminarayanan, 2016) and are closely related to DDPMs, in the sense that they are trained with the same objective function. In Section 3, we generalize the forward diffusion process used by DDPMs, which is Markovian, to non-Markovian ones, for which we are still able to design suitable reverse generative Markov chains. We show that the resulting variational training objectives have a shared surrogate objective, which is exactly the objective used to train DDPM. Therefore, we can freely choose from a large family of generative models using the same neural network simply by choosing a different, non-Markovian diffusion process (Section 4.1) and the corresponding reverse generative Markov Chain. In particular, we are able to use non-Markovian diffusion processes which lead to "short" generative Markov chains (Section 4.2) that can be simulated in a small number of steps. This can massively increase sample efficiency only at a minor cost in sample quality.
为了缩小去噪扩散概率模型（DDPMs）和生成对抗网络（GANs）之间的效率差距，我们提出了去噪扩散隐式模型（DDIMs）。DDIMs 是隐式概率模型（Mohamed & Lakshminarayanan，2016），并且与 DDPMs 密切相关，因为它们使用相同的目标函数进行训练。在第 3 节中，我们将 DDPMs 所使用的马尔可夫式前向扩散过程推广到非马尔可夫式过程，并且我们仍然能够设计合适的反向生成马尔可夫链。我们表明，所得的变分训练目标具有一个共享的替代目标，这正是用于训练 DDPM 的目标。因此，我们可以通过选择不同的非马尔可夫扩散过程（第 4.1 节）和相应的反向生成马尔可夫链，仅使用相同的神经网络就可以从大量的生成模型中自由选择。特别是，我们能够使用非马尔可夫扩散过程，从而导致“短”生成马尔可夫链（第 4.2 节），这些链可以在少量步骤中进行模拟。这可以在样本质量仅有轻微损失的情况下，极大地提高样本效率。

In Section 5, we demonstrate several empirical benefits of DDIMs over DDPMs. First, DDIMs have superior sample generation quality compared to DDPMs, when we accelerate sampling by $10\times$ to $100\times$ using our proposed method. Second, DDIM samples have the following ``consistency'' property, which does not hold for DDPMs: if we start with the same initial latent variable and generate several samples with Markov chains of various lengths, these samples would have similar high-level features. Third, because of ``consistency'' in DDIMs, we can perform semantically meaningful image interpolation by manipulating the initial latent variable in DDIMs, unlike DDPMs which interpolates near the image space due to the stochastic generative process.
在第 5 节中，我们展示了 DDIM 相对于 DDPM 的几个经验优势。首先，当我们使用我们提出的方法将采样加速从 $10\times$ 到 $100\times$ 时，DDIM 具有比 DDPM 更优越的样本生成质量。其次，DDIM 样本具有以下“一致性”属性，而 DDPM 不具有：如果我们从相同的初始潜在变量开始，并使用各种长度的马尔可夫链生成多个样本，这些样本将具有相似的高级特征。第三，由于 DDIM 中的“一致性”，我们可以通过操纵 DDIM 中的初始潜在变量来执行具有语义意义的图像插值，这与 DDPM 不同，DDPM 由于随机生成过程而在图像空间附近进行插值。

2 Background 2 背景

Refer to caption — Figure 1: Graphical models for diffusion (left) and non-Markovian (right) inference models.
图 1：扩散（左）和非马尔可夫（右）推断模型的图形模型。

Given samples from a data distribution $q({\bm{x}}_{0})$ , we are interested in learning a model distribution $p_{\theta}({\bm{x}}_{0})$ that approximates $q({\bm{x}}_{0})$ and is easy to sample from. Denoising diffusion probabilistic models (DDPMs, Sohl-Dickstein et al. (2015); Ho et al. (2020)) are latent variable models of the form
给定来自数据分布 $q({\bm{x}}_{0})$ 的样本，我们有兴趣学习一个模型分布 $p_{\theta}({\bm{x}}_{0})$ ，该分布近似于 $q({\bm{x}}_{0})$ 且易于采样。去噪扩散概率模型（DDPMs，Sohl-Dickstein 等人（2015 年）；Ho 等人（2020 年））是如下形式的潜在变量模型

\displaystyle p_{\theta}({\bm{x}}_{0})=\int p_{\theta}({\bm{x}}_{0:T})\mathrm{d}{\bm{x}}_{1:T},\quad\text{where}\quad p_{\theta}({\bm{x}}_{0:T}):=p_{\theta}({\bm{x}}_{T})\prod_{t=1}^{T}p^{(t)}_{\theta}({\bm{x}}_{t-1}|{\bm{x}}_{t})

(1)

where ${\bm{x}}_{1},\ldots,{\bm{x}}_{T}$ are latent variables in the same sample space as ${\bm{x}}_{0}$ (denoted as ${\mathcal{X}}$ ). The parameters $\theta$ are learned to fit the data distribution $q({\bm{x}}_{0})$ by maximizing a variational lower bound:
其中 ${\bm{x}}_{1},\ldots,{\bm{x}}_{T}$ 是与 ${\bm{x}}_{0}$ 在相同样本空间中的潜在变量（表示为 ${\mathcal{X}}$ ）。通过最大化变分下界来学习参数 $\theta$ 以拟合数据分布 $q({\bm{x}}_{0})$ ：

\displaystyle\max_{\theta}{\mathbb{E}}_{q({\bm{x}}_{0})}[\log p_{\theta}({\bm{x}}_{0})]\leq\max_{\theta}{\mathbb{E}}_{q({\bm{x}}_{0},{\bm{x}}_{1},\ldots,{\bm{x}}_{T})}\left[\log p_{\theta}({\bm{x}}_{0:T})-\log q({\bm{x}}_{1:T}|{\bm{x}}_{0})\right]

(2)

where $q({\bm{x}}_{1:T}|{\bm{x}}_{0})$ is some inference distribution over the latent variables. Unlike typical latent variable models (such as the variational autoencoder (Rezende et al., 2014)), DDPMs are learned with a fixed (rather than trainable) inference procedure $q({\bm{x}}_{1:T}|{\bm{x}}_{0})$ , and latent variables are relatively high dimensional. For example, Ho et al. (2020) considered the following Markov chain with Gaussian transitions parameterized by a decreasing sequence $\alpha_{1:T}\in(0,1]^{T}$ :
其中 $q({\bm{x}}_{1:T}|{\bm{x}}_{0})$ 是关于潜在变量的某种推理分布。与典型的潜在变量模型（如变分自编码器（Rezende 等人，2014））不同，DDPMs 是通过固定的（而非可训练的）推理过程 $q({\bm{x}}_{1:T}|{\bm{x}}_{0})$ 进行学习的，并且潜在变量的维度相对较高。例如，Ho 等人（2020）考虑了以下具有由递减序列 $\alpha_{1:T}\in(0,1]^{T}$ 参数化的高斯转移的马尔可夫链：

\displaystyle q({\bm{x}}_{1:T}|{\bm{x}}_{0}):=\prod_{t=1}^{T}q({\bm{x}}_{t}|{\bm{x}}_{t-1}),\text{where}\ q({\bm{x}}_{t}|{\bm{x}}_{t-1}):={\mathcal{N}}\left(\sqrt{\frac{\alpha_{t}}{\alpha_{t-1}}}{\bm{x}}_{t-1},\left(1-\frac{\alpha_{t}}{\alpha_{t-1}}\right){\bm{I}}\right)

(3)

where the covariance matrix is ensured to have positive terms on its diagonal. This is called the forward process due to the autoregressive nature of the sampling procedure (from ${\bm{x}}_{0}$ to ${\bm{x}}_{T}$ ). We call the latent variable model $p_{\theta}({\bm{x}}_{0:T})$ , which is a Markov chain that samples from ${\bm{x}}_{T}$ to ${\bm{x}}_{0}$ , the generative process, since it approximates the intractable reverse process $q({\bm{x}}_{t-1}|{\bm{x}}_{t})$ . Intuitively, the forward process progressively adds noise to the observation ${\bm{x}}_{0}$ , whereas the generative process progressively denoises a noisy observation (Figure 1, left).
其中协方差矩阵确保其对角线上具有正值。由于抽样过程的自回归性质（从 ${\bm{x}}_{0}$ 到 ${\bm{x}}_{T}$ ），这被称为正向过程。我们将从 ${\bm{x}}_{T}$ 到 ${\bm{x}}_{0}$ 进行抽样的马尔可夫链潜在变量模型 $p_{\theta}({\bm{x}}_{0:T})$ 称为生成过程，因为它近似于难以处理的反向过程 $q({\bm{x}}_{t-1}|{\bm{x}}_{t})$ 。直观地说，正向过程逐渐向观测值 ${\bm{x}}_{0}$ 添加噪声，而生成过程则逐渐对有噪声的观测值进行去噪（图 1，左）。

A special property of the forward process is that
正向过程的一个特殊性质是

q({\bm{x}}_{t}|{\bm{x}}_{0}):=\int q({\bm{x}}_{1:t}|{\bm{x}}_{0})\mathrm{d}{\bm{x}}_{1:(t-1)}={\mathcal{N}}({\bm{x}}_{t};\sqrt{\alpha_{t}}{\bm{x}}_{0},(1-\alpha_{t}){\bm{I}});

so we can express ${\bm{x}}_{t}$ as a linear combination of ${\bm{x}}_{0}$ and a noise variable $\epsilon$ :
因此，我们可以将 ${\bm{x}}_{t}$ 表示为 ${\bm{x}}_{0}$ 和一个噪声变量 $\epsilon$ 的线性组合：

\displaystyle{\bm{x}}_{t}=\sqrt{\alpha_{t}}{\bm{x}}_{0}+\sqrt{1-\alpha_{t}}\epsilon,\quad\text{where}\quad\epsilon\sim{\mathcal{N}}({\bm{0}},{\bm{I}}).

(4)

When we set $\alpha_{T}$ sufficiently close to $0$ , $q({\bm{x}}_{T}|{\bm{x}}_{0})$ converges to a standard Gaussian for all ${\bm{x}}_{0}$ , so it is natural to set $p_{\theta}({\bm{x}}_{T}):={\mathcal{N}}({\bm{0}},{\bm{I}})$ . If all the conditionals are modeled as Gaussians with trainable mean functions and fixed variances, the objective in Eq. (2) can be simplified to¹¹1Please refer to Appendix C.2 for details.
请参考附录 C.2 以获取详情。
当我们将 $\alpha_{T}$ 设置得足够接近 $0$ 时，对于所有 ${\bm{x}}_{0}$ ， $q({\bm{x}}_{T}|{\bm{x}}_{0})$ 会收敛到标准高斯分布，因此很自然地设置 $p_{\theta}({\bm{x}}_{T}):={\mathcal{N}}({\bm{0}},{\bm{I}})$ 。如果将所有条件都建模为具有可训练均值函数和固定方差的高斯分布，则式（2）中的目标可以简化为 ¹ :

\displaystyle L_{\gamma}(\epsilon_{\theta})

\displaystyle:=\sum_{t=1}^{T}\gamma_{t}{\mathbb{E}}_{{\bm{x}}_{0}\sim q({\bm{x}}_{0}),\epsilon_{t}\sim{\mathcal{N}}({\bm{0}},{\bm{I}})}\left[{\lVert{\epsilon_{\theta}^{(t)}(\sqrt{\alpha_{t}}{\bm{x}}_{0}+\sqrt{1-\alpha_{t}}\epsilon_{t})-\epsilon_{t}}\rVert}_{2}^{2}\right]

(5)

where $\epsilon_{\theta}:=\{\epsilon_{\theta}^{(t)}\}_{t=1}^{T}$ is a set of $T$ functions, each $\epsilon_{\theta}^{(t)}:{\mathcal{X}}\to{\mathcal{X}}$ (indexed by $t$ ) is a function with trainable parameters $\theta^{(t)}$ , and $\gamma:=[\gamma_{1},\ldots,\gamma_{T}]$ is a vector of positive coefficients in the objective that depends on $\alpha_{1:T}$ . In Ho et al. (2020), the objective with $\gamma={\bm{1}}$ is optimized instead to maximize generation performance of the trained model; this is also the same objective used in noise conditional score networks (Song & Ermon, 2019) based on score matching (Hyvärinen, 2005; Vincent, 2011). From a trained model, ${\bm{x}}_{0}$ is sampled by first sampling ${\bm{x}}_{T}$ from the prior $p_{\theta}({\bm{x}}_{T})$ , and then sampling ${\bm{x}}_{t-1}$ from the generative processes iteratively.
其中 $\epsilon_{\theta}:=\{\epsilon_{\theta}^{(t)}\}_{t=1}^{T}$ 是一组 $T$ 函数，每个 $\epsilon_{\theta}^{(t)}:{\mathcal{X}}\to{\mathcal{X}}$ （由 $t$ 索引）是具有可训练参数 $\theta^{(t)}$ 的函数，而 $\gamma:=[\gamma_{1},\ldots,\gamma_{T}]$ 是目标中依赖于 $\alpha_{1:T}$ 的正系数向量。在 Ho 等人（2020）中，取而代之的是优化具有 $\gamma={\bm{1}}$ 的目标，以最大化训练模型的生成性能；这也是基于得分匹配（Hyvärinen，2005；Vincent，2011）的噪声条件得分网络（Song＆Ermon，2019）中使用的相同目标。从训练好的模型中，通过首先从先验 $p_{\theta}({\bm{x}}_{T})$ 中采样 ${\bm{x}}_{T}$ ，然后从生成过程中迭代采样 ${\bm{x}}_{t-1}$ 来采样 ${\bm{x}}_{0}$ 。

The length $T$ of the forward process is an important hyperparameter in DDPMs. From a variational perspective, a large $T$ allows the reverse process to be close to a Gaussian (Sohl-Dickstein et al., 2015), so that the generative process modeled with Gaussian conditional distributions becomes a good approximation; this motivates the choice of large $T$ values, such as $T=1000$ in Ho et al. (2020). However, as all $T$ iterations have to be performed sequentially, instead of in parallel, to obtain a sample ${\bm{x}}_{0}$ , sampling from DDPMs is much slower than sampling from other deep generative models, which makes them impractical for tasks where compute is limited and latency is critical.
在 DDPMs 中，前向过程的长度 $T$ 是一个重要的超参数。从变分的角度来看，较大的 $T$ 可使反向过程接近高斯分布（Sohl-Dickstein 等人，2015），从而使得用高斯条件分布建模的生成过程成为一个良好的近似；这促使选择较大的 $T$ 值，例如 Ho 等人（2020）中的 $T=1000$ 。然而，由于为了获得样本 ${\bm{x}}_{0}$ ，所有的 $T$ 迭代都必须按顺序而不是并行地执行，从 DDPMs 中采样比从其他深度生成模型中采样要慢得多，这使得它们在计算受限且延迟至关重要的任务中不切实际。

3 Variational Inference for non-Markovian Forward Processes
非马尔可夫正向过程的变分推断

Because the generative model approximates the reverse of the inference process, we need to rethink the inference process in order to reduce the number of iterations required by the generative model. Our key observation is that the DDPM objective in the form of $L_{\gamma}$ only depends on the marginals²²2We slightly abuse this term (as well as joints) when only conditioned on ${\bm{x}}_{0}$ .
当仅以 ${\bm{x}}_{0}$ 为条件时，我们稍微滥用了这个术语（以及关节）。
由于生成模型近似于推理过程的逆过程，我们需要重新思考推理过程，以减少生成模型所需的迭代次数。我们的关键观察结果是，以 $L_{\gamma}$ 形式的 DDPM 目标仅取决于边际 ² $q({\bm{x}}_{t}|{\bm{x}}_{0})$ , but not directly on the joint $q({\bm{x}}_{1:T}|{\bm{x}}_{0})$ . Since there are many inference distributions (joints) with the same marginals, we explore alternative inference processes that are non-Markovian, which leads to new generative processes (Figure 1, right). These non-Markovian inference process lead to the same surrogate objective function as DDPM, as we will show below. In Appendix A, we show that the non-Markovian perspective also applies beyond the Gaussian case.
$q({\bm{x}}_{t}|{\bm{x}}_{0})$ ，但并非直接在关节 $q({\bm{x}}_{1:T}|{\bm{x}}_{0})$ 上。由于存在许多具有相同边际的推理分布（关节），我们探索非马尔可夫的替代推理过程，这会导致新的生成过程（图 1，右）。如下所示，这些非马尔可夫推理过程会导致与 DDPM 相同的替代目标函数。在附录 A 中，我们表明非马尔可夫观点也适用于超出高斯情况的情况。

3.1 Non-Markovian forward processes
3.1 非马尔可夫正向过程

Let us consider a family ${\mathcal{Q}}$ of inference distributions, indexed by a real vector $\sigma\in\mathbb{R}_{\geq 0}^{T}$ :
让我们考虑一个由实向量 $\sigma\in\mathbb{R}_{\geq 0}^{T}$ 索引的推理分布族 ${\mathcal{Q}}$ ：

\displaystyle q_{\sigma}({\bm{x}}_{1:T}|{\bm{x}}_{0}):=q_{\sigma}({\bm{x}}_{T}|{\bm{x}}_{0})\prod_{t=2}^{T}q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0})

(6)

where $q_{\sigma}({\bm{x}}_{T}|{\bm{x}}_{0})={\mathcal{N}}(\sqrt{\alpha_{T}}{\bm{x}}_{0},(1-\alpha_{T}){\bm{I}})$ and for all $t>1$ ,
其中 $q_{\sigma}({\bm{x}}_{T}|{\bm{x}}_{0})={\mathcal{N}}(\sqrt{\alpha_{T}}{\bm{x}}_{0},(1-\alpha_{T}){\bm{I}})$ 且对于所有 $t>1$ ，（请注意， $q_{\sigma}({\bm{x}}_{T}|{\bm{x}}_{0})={\mathcal{N}}(\sqrt{\alpha_{T}}{\bm{x}}_{0},(1-\alpha_{T}){\bm{I}})$ 和 $t>1$ 可能是原文中的特定符号或术语，在没有更多上下文的情况下，只能按原样保留）

\displaystyle q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0})={\mathcal{N}}\left(\sqrt{\alpha_{t-1}}{\bm{x}}_{0}+\sqrt{1-\alpha_{t-1}-\sigma^{2}_{t}}\cdot{\frac{{\bm{x}}_{t}-\sqrt{\alpha_{t}}{\bm{x}}_{0}}{\sqrt{1-\alpha_{t}}}},\sigma_{t}^{2}{\bm{I}}\right).

(7)

The mean function is chosen to order to ensure that $q_{\sigma}({\bm{x}}_{t}|{\bm{x}}_{0})={\mathcal{N}}(\sqrt{\alpha_{t}}{\bm{x}}_{0},(1-\alpha_{t}){\bm{I}})$ for all $t$ (see Lemma 1 of Appendix B), so that it defines a joint inference distribution that matches the ``marginals'' as desired. The forward process³³3We overload the term “forward process” for cases where the inference model is not a diffusion.
在推断模型不是扩散的情况下，我们对“正向过程”一词进行了过度使用。
选择均值函数是为了确保对于所有 $t$ ， $q_{\sigma}({\bm{x}}_{t}|{\bm{x}}_{0})={\mathcal{N}}(\sqrt{\alpha_{t}}{\bm{x}}_{0},(1-\alpha_{t}){\bm{I}})$ （见附录 B 的引理 1），以便它定义一个联合推断分布，如期望的那样与“边缘分布”相匹配。正向过程 ³ can be derived from Bayes' rule:
可从贝叶斯法则推导得出：

\displaystyle q_{\sigma}({\bm{x}}_{t}|{\bm{x}}_{t-1},{\bm{x}}_{0})=\frac{q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0})q_{\sigma}({\bm{x}}_{t}|{\bm{x}}_{0})}{q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{0})},

(8)

which is also Gaussian (although we do not use this fact for the remainder of this paper). Unlike the diffusion process in Eq. (3), the forward process here is no longer Markovian, since each ${\bm{x}}_{t}$ could depend on both ${\bm{x}}_{t-1}$ and ${\bm{x}}_{0}$ . The magnitude of $\sigma$ controls the how stochastic the forward process is; when $\sigma\to{\bm{0}}$ , we reach an extreme case where as long as we observe ${\bm{x}}_{0}$ and ${\bm{x}}_{t}$ for some $t$ , then ${\bm{x}}_{t-1}$ become known and fixed.
这也是高斯分布的（尽管在本文的其余部分我们不会使用这个事实）。与式（3）中的扩散过程不同，这里的正向过程不再是马尔可夫的，因为每个 ${\bm{x}}_{t}$ 可能同时依赖于 ${\bm{x}}_{t-1}$ 和 ${\bm{x}}_{0}$ 。 $\sigma$ 的大小控制着正向过程的随机性程度；当 $\sigma\to{\bm{0}}$ 时，我们会达到一个极端情况，只要我们对某些 $t$ 观察到 ${\bm{x}}_{0}$ 和 ${\bm{x}}_{t}$ ，那么 ${\bm{x}}_{t-1}$ 就会变得已知且固定。

3.2 Generative process and unified variational inference objective
3.2 生成过程与统一变分推断目标

Next, we define a trainable generative process $p_{\theta}({\bm{x}}_{0:T})$ where each $p_{\theta}^{(t)}({\bm{x}}_{t-1}|{\bm{x}}_{t})$ leverages knowledge of $q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0})$ . Intuitively, given a noisy observation ${\bm{x}}_{t}$ , we first make a prediction⁴⁴4Learning a distribution over the predictions is also possible, but empirically we found little benefits of it.
对预测结果的分布进行学习也是可能的，但从经验上看，我们发现这样做几乎没有什么益处。
接下来，我们定义一个可训练的生成过程 $p_{\theta}({\bm{x}}_{0:T})$ ，其中每个 $p_{\theta}^{(t)}({\bm{x}}_{t-1}|{\bm{x}}_{t})$ 利用 $q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0})$ 的知识。直观地说，给定一个有噪声的观测值 ${\bm{x}}_{t}$ ，我们首先进行预测 ⁴ of the corresponding ${\bm{x}}_{0}$ , and then use it to obtain a sample ${\bm{x}}_{t-1}$ through the reverse conditional distribution $q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0})$ , which we have defined.
对应的 ${\bm{x}}_{0}$ 的，然后使用它通过我们所定义的逆条件分布 $q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0})$ 来获得一个样本 ${\bm{x}}_{t-1}$ 。

For some ${\bm{x}}_{0}\sim q({\bm{x}}_{0})$ and $\epsilon_{t}\sim{\mathcal{N}}({\bm{0}},{\bm{I}})$ , ${\bm{x}}_{t}$ can be obtained using Eq. (4). The model $\epsilon_{\theta}^{(t)}({\bm{x}}_{t})$ then attempts to predict $\epsilon_{t}$ from ${\bm{x}}_{t}$ , without knowledge of ${\bm{x}}_{0}$ . By rewriting Eq. (4), one can then predict the denoised observation, which is a prediction of ${\bm{x}}_{0}$ given ${\bm{x}}_{t}$ :
对于某些 ${\bm{x}}_{0}\sim q({\bm{x}}_{0})$ 和 $\epsilon_{t}\sim{\mathcal{N}}({\bm{0}},{\bm{I}})$ ，可以使用式（4）获得 ${\bm{x}}_{t}$ 。然后，模型 $\epsilon_{\theta}^{(t)}({\bm{x}}_{t})$ 尝试在不知道 ${\bm{x}}_{0}$ 的情况下从 ${\bm{x}}_{t}$ 预测 $\epsilon_{t}$ 。通过重写式（4），可以预测去噪观测值，这是在给定 ${\bm{x}}_{t}$ 的情况下对 ${\bm{x}}_{0}$ 的预测：

\displaystyle f_{\theta}^{(t)}({\bm{x}}_{t}):=({\bm{x}}_{t}-\sqrt{1-\alpha_{t}}\cdot\epsilon_{\theta}^{(t)}({\bm{x}}_{t}))/\sqrt{\alpha_{t}}.

(9)

We can then define the generative process with a fixed prior $p_{\theta}({\bm{x}}_{T})={\mathcal{N}}({\bm{0}},{\bm{I}})$ and
然后，我们可以用固定的先验 $p_{\theta}({\bm{x}}_{T})={\mathcal{N}}({\bm{0}},{\bm{I}})$ 来定义生成过程

\displaystyle p_{\theta}^{(t)}({\bm{x}}_{t-1}|{\bm{x}}_{t})=\begin{cases}{\mathcal{N}}(f_{\theta}^{(1)}({\bm{x}}_{1}),\sigma_{1}^{2}{\bm{I}})&\text{if}\ t=1\\ q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},f_{\theta}^{(t)}({\bm{x}}_{t}))&\text{otherwise,}\end{cases}

(10)

where $q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},f_{\theta}^{(t)}({\bm{x}}_{t}))$ is defined as in Eq. (7) with ${\bm{x}}_{0}$ replaced by $f_{\theta}^{(t)}({\bm{x}}_{t})$ . We add some Gaussian noise (with covariance $\sigma_{1}^{2}{\bm{I}}$ ) for the case of $t=1$ to ensure that the generative process is supported everywhere.
其中 $q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},f_{\theta}^{(t)}({\bm{x}}_{t}))$ 如式（7）中所定义，其中 ${\bm{x}}_{0}$ 被 $f_{\theta}^{(t)}({\bm{x}}_{t})$ 替代。对于 $t=1$ 的情况，我们添加一些高斯噪声（具有协方差 $\sigma_{1}^{2}{\bm{I}}$ ），以确保生成过程在各处都得到支持。

We optimize $\theta$ via the following variational inference objective (which is a functional over $\epsilon_{\theta}$ ):
我们通过以下变分推断目标（这是关于 $\epsilon_{\theta}$ 的泛函）来优化 $\theta$

		$\displaystyle J_{\sigma}(\epsilon_{\theta}):={\mathbb{E}}_{{\bm{x}}_{0:T}\sim q_{\sigma}({\bm{x}}_{0:T})}[\log q_{\sigma}({\bm{x}}_{1:T}\|{\bm{x}}_{0})-\log p_{\theta}({\bm{x}}_{0:T})]$		(11)
	$\displaystyle=$	$\displaystyle\ {\mathbb{E}}_{{\bm{x}}_{0:T}\sim q_{\sigma}({\bm{x}}_{0:T})}\left[\log q_{\sigma}({\bm{x}}_{T}\|{\bm{x}}_{0})+\sum_{t=2}^{T}\log q_{\sigma}({\bm{x}}_{t-1}\|{\bm{x}}_{t},{\bm{x}}_{0})-\sum_{t=1}^{T}\log p_{\theta}^{(t)}({\bm{x}}_{t-1}\|{\bm{x}}_{t})-\log p_{\theta}({\bm{x}}_{T})\right]$

where we factorize $q_{\sigma}({\bm{x}}_{1:T}|{\bm{x}}_{0})$ according to Eq. (6) and $p_{\theta}({\bm{x}}_{0:T})$ according to Eq. (1).
在这里，我们根据式（6）对 $q_{\sigma}({\bm{x}}_{1:T}|{\bm{x}}_{0})$ 进行因式分解，根据式（1）对 $p_{\theta}({\bm{x}}_{0:T})$ 进行因式分解。

From the definition of $J_{\sigma}$ , it would appear that a different model has to be trained for every choice of $\sigma$ , since it corresponds to a different variational objective (and a different generative process). However, $J_{\sigma}$ is equivalent to $L_{\gamma}$ for certain weights $\gamma$ , as we show below.
从 $J_{\sigma}$ 的定义来看，似乎对于 $\sigma$ 的每一种选择都必须训练一个不同的模型，因为它对应于一个不同的变分目标（以及一个不同的生成过程）。然而，正如我们下面所示，对于某些权重 $\gamma$ ， $J_{\sigma}$ 等同于 $L_{\gamma}$ 。

Theorem 1.

For all $\sigma>{\bm{0}}$ , there exists $\gamma\in\mathbb{R}_{>0}^{T}$ and $C\in\mathbb{R}$ , such that $J_{\sigma}=L_{\gamma}+C$ .

定理 1. 对于所有

\sigma>{\bm{0}}

，存在

\gamma\in\mathbb{R}_{>0}^{T}

和

C\in\mathbb{R}

，使得

J_{\sigma}=L_{\gamma}+C

。

The variational objective $L_{\gamma}$ is special in the sense that if parameters $\theta$ of the models $\epsilon_{\theta}^{(t)}$ are not shared across different $t$ , then the optimal solution for $\epsilon_{\theta}$ will not depend on the weights $\gamma$ (as global optimum is achieved by separately maximizing each term in the sum). This property of $L_{\gamma}$ has two implications. On the one hand, this justified the use of $L_{\bm{1}}$ as a surrogate objective function for the variational lower bound in DDPMs; on the other hand, since $J_{\sigma}$ is equivalent to some $L_{\gamma}$ from Theorem 1, the optimal solution of $J_{\sigma}$ is also the same as that of $L_{\bm{1}}$ . Therefore, if parameters are not shared across $t$ in the model $\epsilon_{\theta}$ , then the $L_{\bm{1}}$ objective used by Ho et al. (2020) can be used as a surrogate objective for the variational objective $J_{\sigma}$ as well.
变分目标 $L_{\gamma}$ 具有特殊性，即在模型 $\epsilon_{\theta}^{(t)}$ 的参数 $\theta$ 在不同的 $t$ 之间不共享的情况下， $\epsilon_{\theta}$ 的最优解将不依赖于权重 $\gamma$ （因为通过分别最大化总和中的每一项可实现全局最优）。 $L_{\gamma}$ 的这一特性有两个含义。一方面，这证明了在 DDPM 中使用 $L_{\bm{1}}$ 作为变分下界的替代目标函数是合理的；另一方面，由于 $J_{\sigma}$ 等同于定理 1 中的某些 $L_{\gamma}$ ， $J_{\sigma}$ 的最优解也与 $L_{\bm{1}}$ 的相同。因此，如果在模型 $\epsilon_{\theta}$ 中的 $t$ 之间参数不共享，那么 Ho 等人（2020）使用的 $L_{\bm{1}}$ 目标也可以作为变分目标 $J_{\sigma}$ 的替代目标。

4 Sampling from Generalized Generative Processes
4 从广义生成过程中抽样

With $L_{\bm{1}}$ as the objective, we are not only learning a generative process for the Markovian inference process considered in Sohl-Dickstein et al. (2015) and Ho et al. (2020), but also generative processes for many non-Markovian forward processes parametrized by $\sigma$ that we have described. Therefore, we can essentially use pretrained DDPM models as the solutions to the new objectives, and focus on finding a generative process that is better at producing samples subject to our needs by changing $\sigma$ .
以 $L_{\bm{1}}$ 为目标，我们不仅在学习 Sohl-Dickstein 等人（2015 年）和 Ho 等人（2020 年）所考虑的马尔可夫推断过程的生成过程，而且还在学习我们所描述的由 $\sigma$ 参数化的许多非马尔可夫正向过程的生成过程。因此，我们基本上可以将预训练的 DDPM 模型用作新目标的解决方案，并通过改变 $\sigma$ 专注于找到一种更能根据我们的需求生成样本的生成过程。

4.1 Denoising Diffusion Implicit Models
4.1 去噪扩散隐式模型

From $p_{\theta}({\bm{x}}_{1:T})$ in Eq. (10), one can generate a sample ${\bm{x}}_{t-1}$ from a sample ${\bm{x}}_{t}$ via:
从式（10）中的 $p_{\theta}({\bm{x}}_{1:T})$ ，人们可以通过以下方式从样本 ${\bm{x}}_{t}$ 生成样本 ${\bm{x}}_{t-1}$ ：

\displaystyle{\bm{x}}_{t-1}

\displaystyle=\sqrt{\alpha_{t-1}}\underbrace{\left(\frac{{\bm{x}}_{t}-\sqrt{1-\alpha_{t}}\epsilon_{\theta}^{(t)}({\bm{x}}_{t})}{\sqrt{\alpha_{t}}}\right)}_{\text{`` predicted }{\bm{x}}_{0}\text{''}}+\underbrace{\sqrt{1-\alpha_{t-1}-\sigma_{t}^{2}}\cdot\epsilon_{\theta}^{(t)}({\bm{x}}_{t})}_{\text{``direction pointing to }{\bm{x}}_{t}\text{''}}+\underbrace{\sigma_{t}\epsilon_{t}}_{\text{random noise}}

(12)

where $\epsilon_{t}\sim{\mathcal{N}}({\bm{0}},{\bm{I}})$ is standard Gaussian noise independent of ${\bm{x}}_{t}$ , and we define $\alpha_{0}:=1$ . Different choices of $\sigma$ values results in different generative processes, all while using the same model $\epsilon_{\theta}$ , so re-training the model is unnecessary. When $\sigma_{t}=\sqrt{(1-\alpha_{t-1})/(1-\alpha_{t})}\sqrt{1-\alpha_{t}/\alpha_{t-1}}$ for all $t$ , the forward process becomes Markovian, and the generative process becomes a DDPM.
其中 $\epsilon_{t}\sim{\mathcal{N}}({\bm{0}},{\bm{I}})$ 是与 ${\bm{x}}_{t}$ 无关的标准高斯噪声，我们定义 $\alpha_{0}:=1$ 。 $\sigma$ 值的不同选择会导致不同的生成过程，而都使用相同的模型 $\epsilon_{\theta}$ ，因此无需重新训练模型。当对于所有 $t$ 的 $\sigma_{t}=\sqrt{(1-\alpha_{t-1})/(1-\alpha_{t})}\sqrt{1-\alpha_{t}/\alpha_{t-1}}$ 时，前向过程变为马尔可夫过程，生成过程变为 DDPM。

We note another special case when $\sigma_{t}=0$ for all $t$ ⁵⁵5Although this case is not covered in Theorem 1, we can always approximate it by making $\sigma_{t}$ very small.
尽管此案例未在定理 1 中涵盖，但我们总可以通过使 $\sigma_{t}$ 非常小来对其进行近似。
当对于所有的 $t$ ⁵ ， $\sigma_{t}=0$ 时，我们注意到另一个特殊情况; the forward process becomes deterministic given ${\bm{x}}_{t-1}$ and ${\bm{x}}_{0}$ , except for $t=1$ ; in the generative process, the coefficient before the random noise $\epsilon_{t}$ becomes zero. The resulting model becomes an implicit probabilistic model (Mohamed & Lakshminarayanan, 2016), where samples are generated from latent variables with a fixed procedure (from ${\bm{x}}_{T}$ to ${\bm{x}}_{0}$ ). We name this the denoising diffusion implicit model (DDIM, pronounced /d:Im/), because it is an implicit probabilistic model trained with the DDPM objective (despite the forward process no longer being a diffusion).
在给定 ${\bm{x}}_{t-1}$ 和 ${\bm{x}}_{0}$ 的情况下，前向过程除了 $t=1$ 外变得具有确定性；在生成过程中，随机噪声 $\epsilon_{t}$ 前的系数变为零。所得模型成为一个隐式概率模型（Mohamed & Lakshminarayanan，2016），其中样本是通过固定程序（从 ${\bm{x}}_{T}$ 到 ${\bm{x}}_{0}$ ）从潜在变量中生成的。我们将其命名为去噪扩散隐式模型（DDIM，发音为 /d:Im/），因为它是使用 DDPM 目标进行训练的隐式概率模型（尽管前向过程不再是扩散）。

4.2 Accelerated generation processes
4.2 加速生成过程

In the previous sections, the generative process is considered as the approximation to the reverse process; since of the forward process has $T$ steps, the generative process is also forced to sample $T$ steps. However, as the denoising objective $L_{\bm{1}}$ does not depend on the specific forward procedure as long as $q_{\sigma}({\bm{x}}_{t}|{\bm{x}}_{0})$ is fixed, we may also consider forward processes with lengths smaller than $T$ , which accelerates the corresponding generative processes without having to train a different model.
在前面的部分中，生成过程被视为对逆向过程的近似；由于正向过程有 $T$ 个步骤，生成过程也被迫采样 $T$ 个步骤。然而，由于去噪目标 $L_{\bm{1}}$ 只要 $q_{\sigma}({\bm{x}}_{t}|{\bm{x}}_{0})$ 固定就不依赖于特定的正向过程，我们也可以考虑长度小于 $T$ 的正向过程，这会加速相应的生成过程，而无需训练不同的模型。

Let us consider the forward process as defined not on all the latent variables ${\bm{x}}_{1:T}$ , but on a subset $\{{\bm{x}}_{\tau_{1}},\ldots,{\bm{x}}_{\tau_{S}}\}$ , where $\tau$ is an increasing sub-sequence of $[1,\ldots,T]$ of length $S$ . In particular, we define the sequential forward process over ${\bm{x}}_{\tau_{1}},\ldots,{\bm{x}}_{\tau_{S}}$ such that $q({\bm{x}}_{\tau_{i}}|{\bm{x}}_{0})={\mathcal{N}}(\sqrt{\alpha_{\tau_{i}}}{\bm{x}}_{0},(1-\alpha_{\tau_{i}}){\bm{I}})$ matches the ``marginals'' (see Figure 2 for an illustration). The generative process now samples latent variables according to $\text{reversed}(\tau)$ , which we term (sampling) trajectory. When the length of the sampling trajectory is much smaller than $T$ , we may achieve significant increases in computational efficiency due to the iterative nature of the sampling process.
让我们考虑这样一个前向过程，它不是定义在所有潜在变量 ${\bm{x}}_{1:T}$ 上，而是定义在一个子集 $\{{\bm{x}}_{\tau_{1}},\ldots,{\bm{x}}_{\tau_{S}}\}$ 上，其中 $\tau$ 是 $[1,\ldots,T]$ 的一个长度为 $S$ 的递增子序列。特别地，我们在 ${\bm{x}}_{\tau_{1}},\ldots,{\bm{x}}_{\tau_{S}}$ 上定义顺序前向过程，使得 $q({\bm{x}}_{\tau_{i}}|{\bm{x}}_{0})={\mathcal{N}}(\sqrt{\alpha_{\tau_{i}}}{\bm{x}}_{0},(1-\alpha_{\tau_{i}}){\bm{I}})$ 与“边缘分布”相匹配（见图 2 进行说明）。现在，生成过程根据 $\text{reversed}(\tau)$ 对潜在变量进行采样，我们将其称为（采样）轨迹。当采样轨迹的长度远小于 $T$ 时，由于采样过程的迭代性质，我们可能会在计算效率上实现显著提高。

Using a similar argument as in Section 3, we can justify using the model trained with the $L_{\bm{1}}$ objective, so no changes are needed in training. We show that only slight changes to the updates in Eq. (12) are needed to obtain the new, faster generative processes, which applies to DDPM, DDIM, as well as all generative processes considered in Eq. (10). We include these details in Appendix C.1.
使用与第 3 节中类似的论点，我们可以证明使用用 $L_{\bm{1}}$ 目标训练的模型是合理的，因此在训练中无需进行更改。我们表明，为获得新的、更快的生成过程，只需对式（12）中的更新进行轻微更改，这适用于 DDPM、DDIM 以及式（10）中考虑的所有生成过程。我们将这些细节包含在附录 C.1 中。

In principle, this means that we can train a model with an arbitrary number of forward steps but only sample from some of them in the generative process. Therefore, the trained model could consider many more steps than what is considered in (Ho et al., 2020) or even a continuous time variable $t$ (Chen et al., 2020). We leave empirical investigations of this aspect as future work.
原则上，这意味着我们可以用任意数量的前向步骤训练模型，但在生成过程中仅从其中一些进行采样。因此，训练后的模型可以考虑比（Ho 等人，2020）中所考虑的更多的步骤，甚至是一个连续时间变量 $t$ （Chen 等人，2020）。我们将此方面的实证研究留作未来的工作。

4.3 Relevance to Neural ODEs
4.3 与神经常微分方程的相关性

Moreover, we can rewrite the DDIM iterate according to Eq. (12), and its similarity to Euler integration for solving ordinary differential equations (ODEs) becomes more apparent:
此外，我们可以根据式（12）重写 DDIM 迭代式，并且它与用于求解常微分方程（ODE）的欧拉积分的相似性变得更加明显：

\displaystyle\frac{{\bm{x}}_{t-\Delta t}}{\sqrt{\alpha_{t-\Delta t}}}=\frac{{\bm{x}}_{t}}{\sqrt{\alpha_{t}}}+\left(\sqrt{\frac{1-\alpha_{t-\Delta t}}{\alpha_{t-\Delta t}}}-\sqrt{\frac{1-\alpha_{t}}{\alpha_{t}}}\right)\epsilon_{\theta}^{(t)}({\bm{x}}_{t})

(13)

To derive the corresponding ODE, we can reparameterize $(\sqrt{1-\alpha}/\sqrt{\alpha})$ with $\sigma$ and $({\bm{x}}/\sqrt{\alpha})$ with $\bar{{\bm{x}}}$ . In the continuous case, $\sigma$ and ${\bm{x}}$ are functions of $t$ , where $\sigma:{\mathbb{R}}_{\geq 0}\to{\mathbb{R}}_{\geq 0}$ is continous, increasing with $\sigma(0)=0$ . Equation (13) with can be treated as a Euler method over the following ODE:
为了推导出相应的常微分方程，我们可以用 $\sigma$ 重新参数化 $(\sqrt{1-\alpha}/\sqrt{\alpha})$ ，用 $\bar{{\bm{x}}}$ 重新参数化 $({\bm{x}}/\sqrt{\alpha})$ 。在连续的情况下， $\sigma$ 和 ${\bm{x}}$ 是 $t$ 的函数，其中 $\sigma:{\mathbb{R}}_{\geq 0}\to{\mathbb{R}}_{\geq 0}$ 是连续的，随 $\sigma(0)=0$ 增加。带有（13）式的方程可以视为以下常微分方程的欧拉方法：

\displaystyle\mathrm{d}\bar{{\bm{x}}}(t)=\epsilon_{\theta}^{(t)}\left(\frac{\bar{{\bm{x}}}(t)}{\sqrt{\sigma^{2}+1}}\right)\mathrm{d}\sigma(t),

(14)

where the initial conditions is ${\bm{x}}(T)\sim{\mathcal{N}}(0,\sigma(T))$ for a very large $\sigma(T)$ (which corresponds to the case of $\alpha\approx 0$ ). This suggests that with enough discretization steps, the we can also reverse the generation process (going from $t=0$ to $T$ ), which encodes ${\bm{x}}_{0}$ to ${\bm{x}}_{T}$ and simulates the reverse of the ODE in Eq. (14). This suggests that unlike DDPM, we can use DDIM to obtain encodings of the observations (as the form of ${\bm{x}}_{T}$ ), which might be useful for other downstream applications that requires latent representations of a model.
其中，对于非常大的 $\sigma(T)$ ，初始条件为 ${\bm{x}}(T)\sim{\mathcal{N}}(0,\sigma(T))$ （这对应于 $\alpha\approx 0$ 的情况）。这表明，通过足够的离散化步骤，我们还可以逆转生成过程（从 $t=0$ 到 $T$ ），该过程将 ${\bm{x}}_{0}$ 编码为 ${\bm{x}}_{T}$ ，并模拟式（14）中常微分方程的逆过程。这表明与 DDPM 不同，我们可以使用 DDIM 来获得观测值的编码（以 ${\bm{x}}_{T}$ 的形式），这对于需要模型潜在表示的其他下游应用可能是有用的。

In a concurrent work, (Song et al., 2020) proposed a ``probability flow ODE'' that aims to recover the marginal densities of a stochastic differential equation (SDE) based on scores, from which a similar sampling schedule can be obtained. Here, we state that the our ODE is equivalent to a special case of theirs (which corresponds to a continuous-time analog of DDPM).
在一项并行工作中，（Song 等人，2020）提出了一个“概率流常微分方程”，旨在基于分数恢复随机微分方程（SDE）的边际密度，从中可以获得类似的采样时间表。在此，我们指出，我们的常微分方程等同于他们的一个特殊情况（这对应于 DDPM 的连续时间类似物）。

Proposition 1.

The ODE in Eq. (14) with the optimal model $\epsilon_{\theta}^{(t)}$ has an equivalent probability flow ODE corresponding to the ``Variance-Exploding'' SDE in Song et al. (2020).

命题 1. 具有最优模型

\epsilon_{\theta}^{(t)}

的式（14）中的常微分方程（ODE）具有一个等效的概率流常微分方程，该方程对应于 Song 等人（2020）中的“方差爆炸”随机微分方程（SDE）。

We include the proof in Appendix B. While the ODEs are equivalent, the sampling procedures are not, since the Euler method for the probability flow ODE will make the following update:
我们将证明包含在附录 B 中。虽然常微分方程是等价的，但采样过程并非如此，因为概率流常微分方程的欧拉方法将进行如下更新：

\displaystyle\frac{{\bm{x}}_{t-\Delta t}}{\sqrt{\alpha_{t-\Delta t}}}=\frac{{\bm{x}}_{t}}{\sqrt{\alpha_{t}}}+\frac{1}{2}\left(\frac{1-\alpha_{t-\Delta t}}{\alpha_{t-\Delta t}}-\frac{1-\alpha_{t}}{\alpha_{t}}\right)\cdot\sqrt{\frac{\alpha_{t}}{1-\alpha_{t}}}\cdot\epsilon_{\theta}^{(t)}({\bm{x}}_{t})

(15)

which is equivalent to ours if $\alpha_{t}$ and $\alpha_{t-\Delta t}$ are close enough. In fewer sampling steps, however, these choices will make a difference; we take Euler steps with respect to $\mathrm{d}\sigma(t)$ (which depends less directly on the scaling of ``time'' $t$ ) whereas Song et al. (2020) take Euler steps with respect to $\mathrm{d}t$ .
如果 $\alpha_{t}$ 和 $\alpha_{t-\Delta t}$ 足够接近，这与我们的结果是等效的。然而，在较少的采样步骤中，这些选择会产生差异；我们对 $\mathrm{d}\sigma(t)$ 采取欧拉步（其对“时间” $t$ 的缩放依赖程度较低），而 Song 等人（2020）对 $\mathrm{d}t$ 采取欧拉步。

5 Experiments 5 实验

In this section, we show that DDIMs outperform DDPMs in terms of image generation when fewer iterations are considered, giving speed ups of $10\times$ to $100\times$ over the original DDPM generation process. Moreover, unlike DDPMs, once the initial latent variables ${\bm{x}}_{T}$ are fixed, DDIMs retain high-level image features regardless of the generation trajectory, so they are able to perform interpolation directly from the latent space. DDIMs can also be used to encode samples that reconstruct them from the latent code, which DDPMs cannot do due to the stochastic sampling process.
在本节中，我们表明，当考虑较少的迭代次数时，DDIMs 在图像生成方面的表现优于 DDPMs，与原始 DDPM 生成过程相比，速度提升了 $10\times$ 到 $100\times$ 。此外，与 DDPMs 不同，一旦初始潜在变量 ${\bm{x}}_{T}$ 被固定，DDIMs 无论生成轨迹如何都能保留高级图像特征，因此它们能够直接从潜在空间进行插值。DDIMs 还可用于对样本进行编码，从潜在代码中重建它们，而由于随机采样过程，DDPMs 无法做到这一点。

For each dataset, we use the same trained model with $T=1000$ and the objective being $L_{\gamma}$ from Eq. (5) with $\gamma={\bm{1}}$ ; as we argued in Section 3, no changes are needed with regards to the training procedure. The only changes that we make is how we produce samples from the model; we achieve this by controlling $\tau$ (which controls how fast the samples are obtained) and $\sigma$ (which interpolates between the deterministic DDIM and the stochastic DDPM).
对于每个数据集，我们使用相同的经过训练的模型，其中 $T=1000$ ，目标为来自式（5）的 $L_{\gamma}$ ，并带有 $\gamma={\bm{1}}$ ；正如我们在第 3 节中所论述的，在训练过程方面无需进行任何更改。我们所做的唯一更改是如何从模型中生成样本；我们通过控制 $\tau$ （它控制样本获取的速度）和 $\sigma$ （它在确定性 DDIM 和随机 DDPM 之间进行插值）来实现这一点。

We consider different sub-sequences $\tau$ of $[1,\ldots,T]$ and different variance hyperparameters $\sigma$ indexed by elements of $\tau$ . To simplify comparisons, we consider $\sigma$ with the form: 重试错误原因

\displaystyle\sigma_{\tau_{i}}(\eta)=\eta\sqrt{(1-\alpha_{\tau_{i-1}})/(1-\alpha_{\tau_{i}})}\sqrt{1-{\alpha_{\tau_{i}}}/{\alpha_{\tau_{i-1}}}},

(16)

where $\eta\in\mathbb{R}_{\geq 0}$ is a hyperparameter that we can directly control. This includes an original DDPM generative process when $\eta=1$ and DDIM when $\eta=0$ . We also consider DDPM where the random noise has a larger standard deviation than $\sigma(1)$ , which we denote as $\hat{\sigma}$ : $\hat{\sigma}_{\tau_{i}}=\sqrt{1-{\alpha_{\tau_{i}}}/{\alpha_{\tau_{i-1}}}}$ . This is used by the implementation in Ho et al. (2020) only to obtain the CIFAR10 samples, but not samples of the other datasets. We include more details in Appendix D.
其中 $\eta\in\mathbb{R}_{\geq 0}$ 是我们可以直接控制的超参数。这包括当 $\eta=1$ 时的原始 DDPM 生成过程和当 $\eta=0$ 时的 DDIM。我们还考虑了随机噪声的标准偏差大于 $\sigma(1)$ 的 DDPM，我们将其表示为 $\hat{\sigma}$ ： $\hat{\sigma}_{\tau_{i}}=\sqrt{1-{\alpha_{\tau_{i}}}/{\alpha_{\tau_{i-1}}}}$ 。Ho 等人（2020）的实现仅使用此方法来获得 CIFAR10 样本，而不是其他数据集的样本。我们在附录 D 中包含了更多细节。

5.1 Sample quality and efficiency
5.1 样本质量与效率

In Table 1, we report the quality of the generated samples with models trained on CIFAR10 and CelebA, as measured by Frechet Inception Distance (FID (Heusel et al., 2017)), where we vary the number of timesteps used to generate a sample ( $\dim(\tau)$ ) and the stochasticity of the process ( $\eta$ ). As expected, the sample quality becomes higher as we increase $\dim(\tau)$ , presenting a trade-off between sample quality and computational costs. We observe that DDIM ( $\eta=0$ ) achieves the best sample quality when $\dim(\tau)$ is small, and DDPM ( $\eta=1$ and $\hat{\sigma}$ ) typically has worse sample quality compared to its less stochastic counterparts with the same $\dim(\tau)$ , except for the case for $\dim(\tau)=1000$ and $\hat{\sigma}$ reported by Ho et al. (2020) where DDIM is marginally worse. However, the sample quality of $\hat{\sigma}$ becomes much worse for smaller $\dim(\tau)$ , which suggests that it is ill-suited for shorter trajectories. DDIM, on the other hand, achieves high sample quality much more consistently.
在表 1 中，我们报告了在 CIFAR10 和 CelebA 上训练的模型所生成样本的质量，通过弗雷歇初始距离（FID（Heusel 等人，2017））来衡量，其中我们改变了用于生成样本的时间步数（ $\dim(\tau)$ ）和过程的随机性（ $\eta$ ）。正如预期的那样，随着我们增加 $\dim(\tau)$ ，样本质量会变得更高，这表明在样本质量和计算成本之间存在权衡。我们观察到，当 $\dim(\tau)$ 较小时，DDIM（ $\eta=0$ ）实现了最佳的样本质量，而与具有相同 $\dim(\tau)$ 的随机性较小的对应模型相比，DDPM（ $\eta=1$ 和 $\hat{\sigma}$ ）通常具有较差的样本质量，除了 Ho 等人（2020）报告的 $\dim(\tau)=1000$ 和 $\hat{\sigma}$ 的情况，其中 DDIM 略差。然而，对于较小的 $\dim(\tau)$ ， $\hat{\sigma}$ 的样本质量会变得更差，这表明它不适合较短的轨迹。另一方面，DDIM 更一致地实现了高样本质量。

In Figure 3, we show CIFAR10 and CelebA samples with the same number of sampling steps and varying $\sigma$ . For the DDPM, the sample quality deteriorates rapidly when the sampling trajectory has 10 steps. For the case of $\hat{\sigma}$ , the generated images seem to have more noisy perturbations under short trajectories; this explains why the FID scores are much worse than other methods, as FID is very sensitive to such perturbations (as discussed in Jolicoeur-Martineau et al. (2020)).
在图 3 中，我们展示了具有相同采样步数和不同 $\sigma$ 的 CIFAR10 和 CelebA 样本。对于 DDPM，当采样轨迹有 10 步时，样本质量迅速恶化。对于 $\hat{\sigma}$ 的情况，在短轨迹下生成的图像似乎具有更多的噪声扰动；这解释了为什么 FID 分数比其他方法差得多，因为 FID 对这种扰动非常敏感（如 Jolicoeur-Martineau 等人（2020）所讨论的）。

In Figure 4, we show that the amount of time needed to produce a sample scales linearly with the length of the sample trajectory. This suggests that DDIM is useful for producing samples more efficiently, as samples can be generated in much fewer steps. Notably, DDIM is able to produce samples with quality comparable to 1000 step models within $20$ to $100$ steps, which is a $10\times$ to $50\times$ speed up compared to the original DDPM. Even though DDPM could also achieve reasonable sample quality with $100\times$ steps, DDIM requires much fewer steps to achieve this; on CelebA, the FID score of the 100 step DDPM is similar to that of the 20 step DDIM.
在图 4 中，我们表明生产一个样本所需的时间量与样本轨迹的长度呈线性比例。这表明 DDIM 对于更高效地生产样本是有用的，因为可以在少得多的步骤中生成样本。值得注意的是，DDIM 能够在 $20$ 到 $100$ 步内生成质量与 1000 步模型相当的样本，与原始的 DDPM 相比，速度提高了 $10\times$ 到 $50\times$ 。尽管 DDPM 也可以在 $100\times$ 步时达到合理的样本质量，但 DDIM 实现这一目标所需的步骤要少得多；在 CelebA 上，100 步 DDPM 的 FID 分数与 20 步 DDIM 的相似。

Table 1: CIFAR10 and CelebA image generation measured in FID.

\eta=1.0

and

\hat{\sigma}

are cases of DDPM (although Ho et al. (2020) only considered

T=1000

steps, and

S<T

can be seen as simulating DDPMs trained with

S

steps), and

\eta=0.0

indicates DDIM.
表 1：以 FID 衡量的 CIFAR10 和 CelebA 图像生成。

\eta=1.0

和

\hat{\sigma}

是 DDPM 的情况（尽管 Ho 等人（2020）只考虑了

T=1000

步，而

S<T

可视为模拟用

S

步训练的 DDPM），

\eta=0.0

表示 DDIM。

		CIFAR10 ( $32\times 32$ ) CIFAR10 （ $32\times 32$ ） Note: It's not clear what " $32\times 32$ " means in this context. The translation provided retains the original characters as they are not clear in meaning					CelebA ( $64\times 64$ ) CelebA（ $64\times 64$ ）（注：这个原文似乎不太明确或存在一些特殊符号，翻译可能不太准确。如果您有更多上下文或信息，以便我能更准确地进行翻译。）
$S$		10	20	50	100	1000	10	20	50	100	1000
$\eta$	$0.0$	13.36	6.84	4.67	4.16	4.04	17.33	13.73	9.17	6.53	3.51
	0.2	14.04	7.11	4.77	4.25	4.09	17.66	14.11	9.51	6.79	3.64
	0.5	16.66	8.35	5.25	4.46	4.29	19.86	16.06	11.01	8.09	4.28
	1.0	41.07	18.36	8.01	5.78	4.73	33.12	26.03	18.48	13.93	5.98
$\hat{\sigma}$		367.43	133.37	32.72	9.99	3.17	299.71	183.83	71.71	45.20	3.26

5.2 Sample consistency in DDIMs
5.2 DDIMs 中的样本一致性

For DDIM, the generative process is deterministic, and ${\bm{x}}_{0}$ would depend only on the initial state ${\bm{x}}_{T}$ . In Figure 5, we observe the generated images under different generative trajectories (i.e. different $\tau$ ) while starting with the same initial ${\bm{x}}_{T}$ . Interestingly, for the generated images with the same initial ${\bm{x}}_{T}$ , most high-level features are similar, regardless of the generative trajectory. In many cases, samples generated with only 20 steps are already very similar to ones generated with 1000 steps in terms of high-level features, with only minor differences in details. Therefore, it would appear that ${\bm{x}}_{T}$ alone would be an informative latent encoding of the image; and minor details that affects sample quality are encoded in the parameters, as longer sample trajectories gives better quality samples but do not significantly affect the high-level features. We show more samples in Appendix D.4.
对于 DDIM，生成过程是确定性的， ${\bm{x}}_{0}$ 仅取决于初始状态 ${\bm{x}}_{T}$ 。在图 5 中，我们观察到在相同的初始 ${\bm{x}}_{T}$ 下，不同生成轨迹（即不同的 $\tau$ ）所生成的图像。有趣的是，对于具有相同初始 ${\bm{x}}_{T}$ 的生成图像，无论生成轨迹如何，大多数高级特征都是相似的。在许多情况下，仅用 20 步生成的样本在高级特征方面已经与用 1000 步生成的样本非常相似，只是在细节上有微小差异。因此，似乎 ${\bm{x}}_{T}$ 本身就是图像的一种信息丰富的潜在编码；而影响样本质量的细微细节则编码在参数中，因为更长的样本轨迹会产生质量更好的样本，但不会显著影响高级特征。我们在附录 D.4 中展示了更多样本。

5.3 Interpolation in deterministic generative processes
5.3 确定性生成过程中的插值

Since the high level features of the DDIM sample is encoded by ${\bm{x}}_{T}$ , we are interested to see whether it would exhibit the semantic interpolation effect similar to that observed in other implicit probabilistic models, such as GANs (Goodfellow et al., 2014). This is different from the interpolation procedure in Ho et al. (2020), since in DDPM the same ${\bm{x}}_{T}$ would lead to highly diverse ${\bm{x}}_{0}$ due to the stochastic generative process⁶⁶6Although it might be possible if one interpolates all $T$ noises, like what is done in Song & Ermon (2020).
尽管如果像 Song & Ermon（2020）中所做的那样对所有 $T$ 噪声进行内插，这也许是可能的。
由于 DDIM 样本的高级特征由 ${\bm{x}}_{T}$ 编码，我们有兴趣了解它是否会表现出与其他隐式概率模型（如 GANs（Goodfellow 等人，2014））中观察到的类似的语义插值效应。这与 Ho 等人（2020）中的插值过程不同，因为在 DDPM 中，由于随机生成过程 ⁶ ，相同的 ${\bm{x}}_{T}$ 会导致高度多样化的 ${\bm{x}}_{0}$ 。. In Figure 6, we show that simple interpolations in ${\bm{x}}_{T}$ can lead to semantically meaningful interpolations between two samples. We include more details and samples in Appendix D.5. This allows DDIM to control the generated images on a high level directly through the latent variables, which DDPMs cannot.
在图 6 中，我们表明在 ${\bm{x}}_{T}$ 中的简单插值可导致两个样本之间具有语义意义的插值。我们在附录 D.5 中包含了更多细节和样本。这使得 DDIM 能够直接通过潜在变量在高层次上控制生成的图像，而 DDPMs 则不能。

5.4 Reconstruction from Latent Space
5.4 从潜在空间进行重建

As DDIM is the Euler integration for a particular ODE, it would be interesting to see whether it can encode from ${\bm{x}}_{0}$ to ${\bm{x}}_{T}$ (reverse of Eq. (14)) and reconstruct ${\bm{x}}_{0}$ from the resulting ${\bm{x}}_{T}$ (forward of Eq. (14))⁷⁷7Since ${\bm{x}}_{T}$ and ${\bm{x}}_{0}$ have the same dimensions, their compression qualities are not our immediate concern.
由于 ${\bm{x}}_{T}$ 和 ${\bm{x}}_{0}$ 具有相同的维度，它们的压缩质量并非我们当下所关注的。
由于 DDIM 是针对特定常微分方程的欧拉积分，研究它是否能够从 ${\bm{x}}_{0}$ 编码到 ${\bm{x}}_{T}$ （式（14）的逆过程），以及从所得的 ${\bm{x}}_{T}$ 重建 ${\bm{x}}_{0}$ （式（14）的正过程） ⁷ ，这将是很有趣的。. We consider encoding and decoding on the CIFAR-10 test set with the CIFAR-10 model with $S$ steps for both encoding and decoding; we report the per-dimension mean squared error (scaled to $[0,1]$ ) in Table 2. Our results show that DDIMs have lower reconstruction error for larger $S$ values and have properties similar to Neural ODEs and normalizing flows. The same cannot be said for DDPMs due to their stochastic nature.
我们在 CIFAR - 10 测试集上使用 CIFAR - 10 模型进行编码和解码，编码和解码均采用 $S$ 步；我们在表 2 中报告了每维均方误差（缩放到 $[0,1]$ ）。我们的结果表明，对于较大的 $S$ 值，DDIMs 的重建误差较低，并且具有与神经 ODE 和归一化流类似的特性。由于其随机性，对于 DDPMs 则不能这样说。

Table 2: Reconstruction error with DDIM on CIFAR-10 test set, rounded to

10^{-4}

.
表 2：在 CIFAR - 10 测试集上使用 DDIM 的重建误差，四舍五入到

10^{-4}

。（注：原文中“rounded to

10^{-4}

”的含义不太明确，可能存在错误或特殊的上下文信息缺失。）

$S$	10	20	50	100	200	500	1000
Error 错误	0.014	0.0065	0.0023	0.0009	0.0004	0.0001	$0.0001$

6 Related Work 6 相关工作

Our work is based on a large family of existing methods on learning generative models as transition operators of Markov chains (Sohl-Dickstein et al., 2015; Bengio et al., 2014; Salimans et al., 2014; Song et al., 2017; Goyal et al., 2017; Levy et al., 2017). Among them, denoising diffusion probabilistic models (DDPMs, Ho et al. (2020)) and noise conditional score networks (NCSN, Song & Ermon (2019; 2020)) have recently achieved high sample quality comparable to GANs (Brock et al., 2018; Karras et al., 2018). DDPMs optimize a variational lower bound to the log-likelihood, whereas NCSNs optimize the score matching objective (Hyvärinen, 2005) over a nonparametric Parzen density estimator of the data (Vincent, 2011; Raphan & Simoncelli, 2011).
我们的工作基于一大类现有的将生成模型作为马尔可夫链的转移算子来学习的方法（Sohl-Dickstein 等人，2015；Bengio 等人，2014；Salimans 等人，2014；Song 等人，2017；Goyal 等人，2017；Levy 等人，2017）。其中，去噪扩散概率模型（DDPMs，Ho 等人（2020））和噪声条件得分网络（NCSN，Song 和 Ermon（2019；2020））最近取得了与生成对抗网络（GANs）（Brock 等人，2018；Karras 等人，2018）相当的高样本质量。DDPMs 优化对数似然的变分下界，而 NCSNs 在数据的非参数 Parzen 密度估计器（Vincent，2011；Raphan 和 Simoncelli，2011）上优化得分匹配目标（Hyvärinen，2005）。

Despite their different motivations, DDPMs and NCSNs are closely related. Both use a denoising autoencoder objective for many noise levels, and both use a procedure similar to Langevin dynamics to produce samples (Neal et al., 2011). Since Langevin dynamics is a discretization of a gradient flow (Jordan et al., 1998), both DDPM and NCSN require many steps to achieve good sample quality. This aligns with the observation that DDPM and existing NCSN methods have trouble generating high-quality samples in a few iterations.
尽管它们的动机不同，但 DDPMs 和 NCSNs 密切相关。两者都针对多个噪声水平使用去噪自编码器目标，并且都使用类似于朗之万动力学的过程来生成样本（Neal 等人，2011）。由于朗之万动力学是梯度流的离散化（Jordan 等人，1998），DDPM 和 NCSN 都需要许多步骤才能获得良好的样本质量。这与观察到的 DDPM 和现有的 NCSN 方法在几次迭代中难以生成高质量样本的情况相符。

DDIM, on the other hand, is an implicit generative model (Mohamed & Lakshminarayanan, 2016) where samples are uniquely determined from the latent variables. Hence, DDIM has certain properties that resemble GANs (Goodfellow et al., 2014) and invertible flows (Dinh et al., 2016), such as the ability to produce semantically meaningful interpolations. We derive DDIM from a purely variational perspective, where the restrictions of Langevin dynamics are not relevant; this could partially explain why we are able to observe superior sample quality compared to DDPM under fewer iterations. The sampling procedure of DDIM is also reminiscent of neural networks with continuous depth (Chen et al., 2018; Grathwohl et al., 2018), since the samples it produces from the same latent variable have similar high-level visual features, regardless of the specific sample trajectory.
另一方面，DDIM 是一种隐式生成模型（Mohamed 和 Lakshminarayanan，2016），其中样本由潜在变量唯一确定。因此，DDIM 具有某些类似于 GANs（Goodfellow 等人，2014）和可逆流（Dinh 等人，2016）的特性，例如能够产生具有语义意义的插值。我们从纯粹的变分角度推导 DDIM，其中朗之万动力学的限制并不相关；这可以部分解释为什么在较少的迭代次数下，我们能够观察到与 DDPM 相比更优的样本质量。DDIM 的采样过程也让人联想到具有连续深度的神经网络（Chen 等人，2018；Grathwohl 等人，2018），因为它从相同的潜在变量产生的样本具有相似的高级视觉特征，而与特定的样本轨迹无关。

7 Discussion 7 讨论

We have presented DDIMs – an implicit generative model trained with denoising auto-encoding / score matching objectives – from a purely variational perspective. DDIM is able to generate high-quality samples much more efficiently than existing DDPMs and NCSNs, with the ability to perform meaningful interpolations from the latent space. The non-Markovian forward process presented here seems to suggest continuous forward processes other than Gaussian (which cannot be done in the original diffusion framework, since Gaussian is the only stable distribution with finite variance). We also demonstrated a discrete case with a multinomial forward process in Appendix A, and it would be interesting to investigate similar alternatives for other combinatorial structures.
我们从纯粹的变分角度提出了 DDIMs——一种使用去噪自编码/得分匹配目标训练的隐式生成模型。DDIM 能够比现有的 DDPMs 和 NCSNs 更高效地生成高质量样本，并且能够从潜在空间进行有意义的插值。这里提出的非马尔可夫前向过程似乎表明了除高斯以外的连续前向过程（这在原始扩散框架中无法实现，因为高斯是具有有限方差的唯一稳定分布）。我们还在附录 A 中展示了具有多项分布前向过程的离散情况，研究其他组合结构的类似替代方案将是有趣的。

Moreover, since the sampling procedure of DDIMs is similar to that of an neural ODE, it would be interesting to see if methods that decrease the discretization error in ODEs, including multi-step methods such as Adams-Bashforth (Butcher & Goodwin, 2008), could be helpful for further improving sample quality in fewer steps (Queiruga et al., 2020). It is also relevant to investigate whether DDIMs exhibit other properties of existing implicit models (Bau et al., 2019).
此外，由于 DDIMs 的采样过程与神经 ODE 的采样过程相似，研究包括 Adams-Bashforth（Butcher & Goodwin，2008）等多步方法在内的降低 ODE 离散化误差的方法是否有助于在更少的步骤中进一步提高样本质量（Queiruga 等人，2020）将是很有趣的。研究 DDIMs 是否表现出现有隐式模型的其他特性（Bau 等人，2019）也具有相关性。

Acknowledgements 致谢

The authors would like to thank Yang Song and Shengjia Zhao for helpful discussions over the ideas, Kuno Kim for reviewing an earlier draft of the paper, and Sharvil Nanavati and Sophie Liu for identifying typos. This research was supported by NSF (#1651565, #1522054, #1733686), ONR (N00014-19-1-2145), AFOSR (FA9550-19-1-0024), and Amazon AWS.
作者们想要感谢杨松和赵胜家就这些想法进行的有益讨论，感谢 Kuno Kim 对论文早期草稿的审阅，以及感谢 Sharvil Nanavati 和 Sophie Liu 指出的排版错误。本研究得到了美国国家科学基金会（#1651565，#1522054，#1733686）、美国海军研究办公室（N00014 - 19 - 1 - 2145）、美国空军科学研究办公室（FA9550 - 19 - 1 - 0024）和亚马逊网络服务的支持。

References

Arjovsky et al. (2017) Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein GAN. arXiv preprint arXiv:1701.07875, January 2017.
Bau et al. (2019) David Bau, Jun-Yan Zhu, Jonas Wulff, William Peebles, Hendrik Strobelt, Bolei Zhou, and Antonio Torralba. Seeing what a gan cannot generate. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4502–4511, 2019.
Bengio et al. (2014) Yoshua Bengio, Eric Laufer, Guillaume Alain, and Jason Yosinski. Deep generative stochastic networks trainable by backprop. In International Conference on Machine Learning, pp. 226–234, January 2014.
Bishop (2006) Christopher M Bishop. Pattern recognition and machine learning. springer, 2006.
Brock et al. (2018) Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, September 2018.
Butcher & Goodwin (2008) John Charles Butcher and Nicolette Goodwin. Numerical methods for ordinary differential equations, volume 2. Wiley Online Library, 2008.
Chen et al. (2020) Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, and William Chan. WaveGrad: Estimating gradients for waveform generation. arXiv preprint arXiv:2009.00713, September 2020.
Chen et al. (2018) Ricky T Q Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural ordinary differential equations. arXiv preprint arXiv:1806.07366, June 2018.
Dinh et al. (2016) Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real NVP. arXiv preprint arXiv:1605.08803, May 2016.
Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
Goyal et al. (2017) Anirudh Goyal, Nan Rosemary Ke, Surya Ganguli, and Yoshua Bengio. Variational walkback: Learning a transition operator as a stochastic recurrent net. In Advances in Neural Information Processing Systems, pp. 4392–4402, 2017.
Grathwohl et al. (2018) Will Grathwohl, Ricky T Q Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. FFJORD: Free-form continuous dynamics for scalable reversible generative models. arXiv preprint arXiv:1810.01367, October 2018.
Gulrajani et al. (2017) Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pp. 5769–5779, 2017.
Heusel et al. (2017) Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two Time-Scale update rule converge to a local nash equilibrium. arXiv preprint arXiv:1706.08500, June 2017.
Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. arXiv preprint arXiv:2006.11239, June 2020.
Hyvärinen (2005) Aapo Hyvärinen. Estimation of Non-Normalized statistical models by score matching. Journal of Machine Learning Researc h, 6:695–709, 2005.
Jolicoeur-Martineau et al. (2020) Alexia Jolicoeur-Martineau, Rémi Piché-Taillefer, Rémi Tachet des Combes, and Ioannis Mitliagkas. Adversarial score matching and improved sampling for image generation. September 2020.
Jordan et al. (1998) Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the fokker–planck equation. SIAM journal on mathematical analysis, 29(1):1–17, 1998.
Karras et al. (2018) Tero Karras, Samuli Laine, and Timo Aila. A Style-Based generator architecture for generative adversarial networks. arXiv preprint arXiv:1812.04948, December 2018.
Karras et al. (2020) Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8110–8119, 2020.
Kingma & Welling (2013) Diederik P Kingma and Max Welling. Auto-Encoding variational bayes. arXiv preprint arXiv:1312.6114v10, December 2013.
Levy et al. (2017) Daniel Levy, Matthew D Hoffman, and Jascha Sohl-Dickstein. Generalizing hamiltonian monte carlo with neural networks. arXiv preprint arXiv:1711.09268, 2017.
Mohamed & Lakshminarayanan (2016) Shakir Mohamed and Balaji Lakshminarayanan. Learning in implicit generative models. arXiv preprint arXiv:1610.03483, October 2016.
Neal et al. (2011) Radford M Neal et al. Mcmc using hamiltonian dynamics. Handbook of markov chain monte carlo, 2(11):2, 2011.
Queiruga et al. (2020) Alejandro F Queiruga, N Benjamin Erichson, Dane Taylor, and Michael W Mahoney. Continuous-in-depth neural networks. arXiv preprint arXiv:2008.02389, 2020.
Raphan & Simoncelli (2011) Martin Raphan and Eero P Simoncelli. Least squares estimation without priors or supervision. Neural computation, 23(2):374–420, February 2011. ISSN 0899-7667, 1530-888X.
Rezende & Mohamed (2015) Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770, May 2015.
Rezende et al. (2014) Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Springer, 2015.
Salimans et al. (2014) Tim Salimans, Diederik P Kingma, and Max Welling. Markov chain monte carlo and variational inference: Bridging the gap. arXiv preprint arXiv:1410.6460, October 2014.
Shoemake (1985) Ken Shoemake. Animating rotation with quaternion curves. In Proceedings of the 12th annual conference on Computer graphics and interactive techniques, pp. 245–254, 1985.
Sohl-Dickstein et al. (2015) Jascha Sohl-Dickstein, Eric A Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. arXiv preprint arXiv:1503.03585, March 2015.
Song et al. (2017) Jiaming Song, Shengjia Zhao, and Stefano Ermon. A-nice-mc: Adversarial training for mcmc. arXiv preprint arXiv:1706.07561, June 2017.
Song & Ermon (2019) Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. arXiv preprint arXiv:1907.05600, July 2019.
Song & Ermon (2020) Yang Song and Stefano Ermon. Improved techniques for training Score-Based generative models. arXiv preprint arXiv:2006.09011, June 2020.
Song et al. (2020) Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
van den Oord et al. (2016a) Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, September 2016a.
van den Oord et al. (2016b) Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, January 2016b.
Vincent (2011) Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661–1674, 2011.
Zagoruyko & Komodakis (2016) Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, May 2016.
Zhao et al. (2018) Shengjia Zhao, Hongyu Ren, Arianna Yuan, Jiaming Song, Noah Goodman, and Stefano Ermon. Bias and generalization in deep generative models: An empirical study. In Advances in Neural Information Processing Systems, pp. 10792–10801, 2018.

Appendix A Non-Markovian Forward Processes for a Discrete Case
附录 A 离散情形下的非马尔可夫正向过程

In this section, we describe a non-Markovian forward processes for discrete data and corresponding variational objectives. Since the focus of this paper is to accelerate reverse models corresponding to the Gaussian diffusion, we leave empirical evaluations as future work.
在本节中，我们描述了一种用于离散数据的非马尔可夫正向过程以及相应的变分目标。由于本文的重点是加速与高斯扩散相对应的反向模型，我们将实证评估留作未来的工作。

For a categorical observation ${\bm{x}}_{0}$ that is a one-hot vector with $K$ possible values, we define the forward process as follows. First, we have $q({\bm{x}}_{t}|{\bm{x}}_{0})$ as the following categorical distribution:
对于一个类别观测值 ${\bm{x}}_{0}$ （它是一个具有 $K$ 种可能值的独热向量），我们将前向过程定义如下。首先，我们有 $q({\bm{x}}_{t}|{\bm{x}}_{0})$ 作为以下类别分布：

\displaystyle q({\bm{x}}_{t}|{\bm{x}}_{0})=\mathrm{Cat}(\alpha_{t}{\bm{x}}_{0}+(1-\alpha_{t}){\bm{1}}_{K})

(17)

where ${\bm{1}}_{K}\in\mathbb{R}^{K}$ is a vector with all entries being $1/K$ , and $\alpha_{t}$ decreasing from $\alpha_{0}=1$ for $t=0$ to $\alpha_{T}=0$ for $t=T$ . Then we define $q({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0})$ as the following mixture distribution:
其中 ${\bm{1}}_{K}\in\mathbb{R}^{K}$ 是一个所有元素均为 $1/K$ 的向量， $\alpha_{t}$ 从 $\alpha_{0}=1$ 针对 $t=0$ 到 $\alpha_{T}=0$ 针对 $t=T$ 逐渐减小。然后，我们将 $q({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0})$ 定义为如下的混合分布：

\displaystyle q({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0})=\begin{cases}\mathrm{Cat}({\bm{x}}_{t})&\text{with probability }\sigma_{t}\\ \mathrm{Cat}({\bm{x}}_{0})&\text{with probability }(\alpha_{t-1}-\sigma_{t}\alpha_{t})\\ \mathrm{Cat}({\bm{1}}_{K})&\text{with probability }(1-\alpha_{t-1})-(1-\alpha_{t})\sigma_{t}\end{cases},

(18)

or equivalently: 或等价地：

\displaystyle q({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0})=\mathrm{Cat}\left(\sigma_{t}{\bm{x}}_{t}+(\alpha_{t-1}-\sigma_{t}\alpha_{t}){\bm{x}}_{0}+((1-\alpha_{t-1})-(1-\alpha_{t})\sigma_{t}){\bm{1}}_{K}\right),

(19)

which is consistent with how we have defined $q({\bm{x}}_{t}|{\bm{x}}_{0})$ .
这与我们对 $q({\bm{x}}_{t}|{\bm{x}}_{0})$ 的定义一致。

Similarly, we can define our reverse process $p_{\theta}({\bm{x}}_{t-1}|{\bm{x}}_{t})$ as:
同样地，我们可以将我们的逆过程 $p_{\theta}({\bm{x}}_{t-1}|{\bm{x}}_{t})$ 定义为：

\displaystyle p_{\theta}({\bm{x}}_{t-1}|{\bm{x}}_{t})=\mathrm{Cat}\left(\sigma_{t}{\bm{x}}_{t}+(\alpha_{t-1}-\sigma_{t}\alpha_{t})f_{\theta}^{(t)}({\bm{x}}_{t})+((1-\alpha_{t-1})-(1-\alpha_{t})\sigma_{t}){\bm{1}}_{K}\right),

(20)

where $f_{\theta}^{(t)}({\bm{x}}_{t})$ maps ${\bm{x}}_{t}$ to a $K$ -dimensional vector. As $(1-\alpha_{t-1})-(1-\alpha_{t})\sigma_{t}\to 0$ , the sampling process will become less stochastic, in the sense that it will either choose ${\bm{x}}_{t}$ or the predicted ${\bm{x}}_{0}$ with high probability. The KL divergence
其中 $f_{\theta}^{(t)}({\bm{x}}_{t})$ 将 ${\bm{x}}_{t}$ 映射到一个 $K$ 维向量。随着 $(1-\alpha_{t-1})-(1-\alpha_{t})\sigma_{t}\to 0$ ，采样过程的随机性会降低，从某种意义上说，它会以高概率选择 ${\bm{x}}_{t}$ 或预测的 ${\bm{x}}_{0}$ 。KL 散度

\displaystyle D_{\mathrm{KL}}(q({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0})\|p_{\theta}({\bm{x}}_{t-1}|{\bm{x}}_{t}))

(21)

is well-defined, and is simply the KL divergence between two categoricals. Therefore, the resulting variational objective function should be easy to optimize as well. Moreover, as KL divergence is convex, we have this upper bound (which is tight when the right hand side goes to zero):
定义明确，它仅仅是两个分类之间的 KL 散度。因此，所得的变分目标函数也应该易于优化。此外，由于 KL 散度是凸的，我们有这个上界（当右侧趋于零时是紧的）：

\displaystyle D_{\mathrm{KL}}(q({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0})\|p_{\theta}({\bm{x}}_{t-1}|{\bm{x}}_{t}))\leq(\alpha_{t-1}-\sigma_{t}\alpha_{t})D_{\mathrm{KL}}(\mathrm{Cat}({\bm{x}}_{0})\|\mathrm{Cat}(f_{\theta}^{(t)}({\bm{x}}_{t}))).

The right hand side is simply a multi-class classification loss (up to constants), so we can arrive at similar arguments regarding how changes in $\sigma_{t}$ do not affect the objective (up to re-weighting).
右侧仅仅是一个多类分类损失（至多为常数），因此我们可以得出类似的论点，即 $\sigma_{t}$ 的变化如何不会影响目标（至多为重加权）。

Appendix B Proofs 附录 B 证明

Lemma 1.

For $q_{\sigma}({\bm{x}}_{1:T}|{\bm{x}}_{0})$ defined in Eq. (6) and $q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0})$ defined in Eq. (7), we have:

\displaystyle q_{\sigma}({\bm{x}}_{t}|{\bm{x}}_{0})={\mathcal{N}}(\sqrt{\alpha_{t}}{\bm{x}}_{0},(1-\alpha_{t}){\bm{I}})

(22)

引理 1. 对于在式（6）中定义的

q_{\sigma}({\bm{x}}_{1:T}|{\bm{x}}_{0})

和在式（7）中定义的

q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0})

，我们有：

Proof.

Assume for any $t\leq T$ , $q_{\sigma}({\bm{x}}_{t}|{\bm{x}}_{0})={\mathcal{N}}(\sqrt{\alpha_{t}}{\bm{x}}_{0},(1-\alpha_{t}){\bm{I}})$ holds, if:

\displaystyle q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{0})={\mathcal{N}}(\sqrt{\alpha_{t-1}}{\bm{x}}_{0},(1-\alpha_{t-1}){\bm{I}})

(23)

then we can prove the statement with an induction argument for $t$ from $T$ to $1$ , since the base case ( $t=T$ ) already holds.
然后我们可以用从 $T$ 到 $1$ 对 $t$ 的归纳论证来证明该陈述，因为基础情况（ $t=T$ ）已经成立。

证明。假设对于任何

t\leq T

，

q_{\sigma}({\bm{x}}_{t}|{\bm{x}}_{0})={\mathcal{N}}(\sqrt{\alpha_{t}}{\bm{x}}_{0},(1-\alpha_{t}){\bm{I}})

成立，如果：

First, we have that
首先，我们有

q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{0}):=\int_{{\bm{x}}_{t}}q_{\sigma}({\bm{x}}_{t}|{\bm{x}}_{0})q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0})\mathrm{d}{\bm{x}}_{t}

and 并且

	$\displaystyle q_{\sigma}({\bm{x}}_{t}\|{\bm{x}}_{0})={\mathcal{N}}(\sqrt{\alpha_{t}}{\bm{x}}_{0},(1-\alpha_{t}){\bm{I}})$		(24)
	$\displaystyle q_{\sigma}({\bm{x}}_{t-1}\|{\bm{x}}_{t},{\bm{x}}_{0})={\mathcal{N}}\left(\sqrt{\alpha_{t-1}}{\bm{x}}_{0}+\sqrt{1-\alpha_{t-1}-\sigma^{2}_{t}}\cdot{\frac{{\bm{x}}_{t}-\sqrt{\alpha_{t}}{\bm{x}}_{0}}{\sqrt{1-\alpha_{t}}}},\sigma_{t}^{2}{\bm{I}}\right).$		(25)

From Bishop (2006) (2.115), we have that $q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{0})$ is Gaussian, denoted as ${\mathcal{N}}(\mu_{t-1},\Sigma_{t-1})$ where
从 Bishop（2006）（2.115）中，我们得知 $q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{0})$ 是高斯分布，记为 ${\mathcal{N}}(\mu_{t-1},\Sigma_{t-1})$ ，其中

	$\displaystyle\mu_{t-1}$	$\displaystyle=\sqrt{\alpha_{t-1}}{\bm{x}}_{0}+\sqrt{1-\alpha_{t-1}-\sigma^{2}_{t}}\cdot{\frac{\sqrt{\alpha_{t}}{\bm{x}}_{0}-\sqrt{\alpha_{t}}{\bm{x}}_{0}}{\sqrt{1-\alpha_{t}}}}$		(26)
		$\displaystyle=\sqrt{\alpha_{t-1}}{\bm{x}}_{0}$		(27)

and 并且

\displaystyle\Sigma_{t-1}

\displaystyle=\sigma_{t}^{2}{\bm{I}}+\frac{1-\alpha_{t-1}-\sigma^{2}_{t}}{1-\alpha_{t}}(1-\alpha_{t}){\bm{I}}=(1-\alpha_{t-1}){\bm{I}}

(28)

Therefore, $q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{0})={\mathcal{N}}(\sqrt{\alpha_{t-1}}{\bm{x}}_{0},(1-\alpha_{t-1}){\bm{I}})$ , which allows us to apply the induction argument. ∎
因此， $q_{\sigma}({\bm{x}}_{t-1}|{\bm{x}}_{0})={\mathcal{N}}(\sqrt{\alpha_{t-1}}{\bm{x}}_{0},(1-\alpha_{t-1}){\bm{I}})$ ，这使我们能够应用归纳论证。∎

See 1 见 1

Proof.

From the definition of $J_{\sigma}$ :

	$\displaystyle J_{\sigma}(\epsilon_{\theta})$	$\displaystyle:={\mathbb{E}}_{{\bm{x}}_{0:T}\sim q({\bm{x}}_{0:T})}\left[\log q_{\sigma}({\bm{x}}_{T}\|{\bm{x}}_{0})+\sum_{t=2}^{T}\log q_{\sigma}({\bm{x}}_{t-1}\|{\bm{x}}_{t},{\bm{x}}_{0})-\sum_{t=1}^{T}\log p_{\theta}^{(t)}({\bm{x}}_{t-1}\|{\bm{x}}_{t})\right]$		(29)
		$\displaystyle\equiv{\mathbb{E}}_{{\bm{x}}_{0:T}\sim q({\bm{x}}_{0:T})}\left[\sum_{t=2}^{T}D_{\mathrm{KL}}(q_{\sigma}({\bm{x}}_{t-1}\|{\bm{x}}_{t},{\bm{x}}_{0}))\\|p_{\theta}^{(t)}({\bm{x}}_{t-1}\|{\bm{x}}_{t}))-\log p_{\theta}^{(1)}({\bm{x}}_{0}\|{\bm{x}}_{1})\right]$

where we use $\equiv$ to denote ``equal up to a value that does not depend on $\epsilon_{\theta}$ (but may depend on $q_{\sigma}$ )''. For $t>1$ :
我们使用 $\equiv$ 表示“在不依赖于 $\epsilon_{\theta}$ （但可能依赖于 $q_{\sigma}$ ）的值上相等”。对于 $t>1$ ：

	$\displaystyle{\mathbb{E}}_{{\bm{x}}_{0},{\bm{x}}_{t}\sim q({\bm{x}}_{0},{\bm{x}}_{t})}[D_{\mathrm{KL}}(q_{\sigma}({\bm{x}}_{t-1}\|{\bm{x}}_{t},{\bm{x}}_{0}))\\|p_{\theta}^{(t)}({\bm{x}}_{t-1}\|{\bm{x}}_{t}))]$
$\displaystyle=$	$\displaystyle\ {\mathbb{E}}_{{\bm{x}}_{0},{\bm{x}}_{t}\sim q({\bm{x}}_{0},{\bm{x}}_{t})}[D_{\mathrm{KL}}(q_{\sigma}({\bm{x}}_{t-1}\|{\bm{x}}_{t},{\bm{x}}_{0}))\\|q_{\sigma}({\bm{x}}_{t-1}\|{\bm{x}}_{t},f_{\theta}^{(t)}({\bm{x}}_{t})))]$
$\displaystyle\equiv$	$\displaystyle\ {\mathbb{E}}_{{\bm{x}}_{0},{\bm{x}}_{t}\sim q({\bm{x}}_{0},{\bm{x}}_{t})}\left[\frac{{\lVert{{\bm{x}}_{0}-f_{\theta}^{(t)}({\bm{x}}_{t})}\rVert}_{2}^{2}}{2\sigma_{t}^{2}}\right]$	(30)
$\displaystyle=$	$\displaystyle\ {\mathbb{E}}_{{\bm{x}}_{0}\sim q({\bm{x}}_{0}),\epsilon\sim{\mathcal{N}}({\bm{0}},{\bm{I}}),{\bm{x}}_{t}=\sqrt{\alpha_{t}}{\bm{x}}_{0}+\sqrt{1-\alpha_{t}}\epsilon}\left[\frac{{\lVert{\frac{{({\bm{x}}_{t}-\sqrt{1-\alpha_{t}}\epsilon)}}{\sqrt{\alpha_{t}}}-\frac{({\bm{x}}_{t}-\sqrt{1-\alpha_{t}}\epsilon_{\theta}^{(t)}({\bm{x}}_{t}))}{\sqrt{\alpha_{t}}}}\rVert}_{2}^{2}}{2\sigma_{t}^{2}}\right]$	(31)
$\displaystyle=$	$\displaystyle\ {\mathbb{E}}_{{\bm{x}}_{0}\sim q({\bm{x}}_{0}),\epsilon\sim{\mathcal{N}}({\bm{0}},{\bm{I}}),{\bm{x}}_{t}=\sqrt{\alpha_{t}}{\bm{x}}_{0}+\sqrt{1-\alpha_{t}}\epsilon}\left[\frac{{\lVert{\epsilon-\epsilon_{\theta}^{(t)}({\bm{x}}_{t})}\rVert}_{2}^{2}}{2d\sigma_{t}^{2}\alpha_{t}}\right]$	(32)

where $d$ is the dimension of ${\bm{x}}_{0}$ . For $t=1$ :
其中 $d$ 是 ${\bm{x}}_{0}$ 的维度。对于 $t=1$ ：

		$\displaystyle\ {\mathbb{E}}_{{\bm{x}}_{0},{\bm{x}}_{1}\sim q({\bm{x}}_{0},{\bm{x}}_{1})}\left[-\log p_{\theta}^{(1)}({\bm{x}}_{0}\|{\bm{x}}_{1})\right]\equiv{\mathbb{E}}_{{\bm{x}}_{0},{\bm{x}}_{1}\sim q({\bm{x}}_{0},{\bm{x}}_{1})}\left[\frac{{\lVert{{\bm{x}}_{0}-f_{\theta}^{(t)}({\bm{x}}_{1})}\rVert}_{2}^{2}}{2\sigma_{1}^{2}}\right]$		(33)
	$\displaystyle=$	$\displaystyle\ {\mathbb{E}}_{{\bm{x}}_{0}\sim q({\bm{x}}_{0}),\epsilon\sim{\mathcal{N}}({\bm{0}},{\bm{I}}),{\bm{x}}_{1}=\sqrt{\alpha_{1}}{\bm{x}}_{0}+\sqrt{1-\alpha_{t}}\epsilon}\left[\frac{{\lVert{\epsilon-\epsilon_{\theta}^{(1)}({\bm{x}}_{1})}\rVert}_{2}^{2}}{2d\sigma_{1}^{2}\alpha_{1}}\right]$		(34)

Therefore, when $\gamma_{t}=1/(2d\sigma_{t}^{2}\alpha_{t})$ for all $t\in\{1,\ldots,T\}$ , we have
因此，当对于所有 $t\in\{1,\ldots,T\}$ ， $\gamma_{t}=1/(2d\sigma_{t}^{2}\alpha_{t})$ 时，我们有

\displaystyle J_{\sigma}(\epsilon_{\theta})\equiv\sum_{t=1}^{T}\frac{1}{2d\sigma_{t}^{2}\alpha_{t}}{\mathbb{E}}\left[{\lVert{\epsilon_{\theta}^{(t)}({\bm{x}}_{t})-\epsilon_{t}}\rVert}_{2}^{2}\right]=L_{\gamma}(\epsilon_{\theta})

(35)

for all $\epsilon_{\theta}$ . From the definition of `` $\equiv$ '', we have that $J_{\sigma}=L_{\gamma}+C$ . ∎
对于所有 $\epsilon_{\theta}$ 。根据“ $\equiv$ ”的定义，我们有 $J_{\sigma}=L_{\gamma}+C$ 。∎

证明。从

J_{\sigma}

的定义：

See 1 见 1

Proof.

In the context of the proof, we consider $t$ as a continous, independent ``time'' variable and ${\bm{x}}$ and $\alpha$ as functions of $t$ . First, let us consider a reparametrization between DDIM and the VE-SDE⁸⁸8Refer to (Song et al., 2020) for more details of VE-SDE.
有关 VE-SDE 的更多详细信息，请参考（Song 等人，2020）。 by introducing the variables $\bar{{\bm{x}}}$ and $\sigma$ :
通过引入变量 $\bar{{\bm{x}}}$ 和 $\sigma$ ：

\displaystyle\bar{{\bm{x}}}(t)=\bar{{\bm{x}}}(0)+\sigma(t)\epsilon,\quad\epsilon\sim{\mathcal{N}}(0,{\bm{I}}),

(36)

for $t\in[0,\infty)$ and an increasing continuous function $\sigma:{\mathbb{R}}_{\geq 0}\to{\mathbb{R}}_{\geq 0}$ where $\sigma(0)=0$ .
对于 $t\in[0,\infty)$ 以及一个递增的连续函数 $\sigma:{\mathbb{R}}_{\geq 0}\to{\mathbb{R}}_{\geq 0}$ ，其中 $\sigma(0)=0$ 。

证明。在证明的背景下，我们将

t

视为连续、独立的“时间”变量，将

{\bm{x}}

和

\alpha

视为

t

的函数。首先，让我们考虑 DDIM 和 VE-SDE ⁸ 之间的重新参数化。

We can then define $\alpha(t)$ and ${\bm{x}}(t)$ corresponding to DDIM case as:
然后，我们可以将对应于 DDIM 情况的 $\alpha(t)$ 和 ${\bm{x}}(t)$ 定义为：

	$\displaystyle\bar{{\bm{x}}}(t)=\frac{{\bm{x}}(t)}{\sqrt{\alpha(t)}}$		(37)
	$\displaystyle\sigma(t)=\sqrt{\frac{1-\alpha(t)}{\alpha(t)}}.$		(38)

This also means that:
这也意味着：

	$\displaystyle{\bm{x}}(t)=\frac{\bar{{\bm{x}}}(t)}{\sqrt{\sigma^{2}(t)+1}}$		(39)
	$\displaystyle\alpha(t)=\frac{1}{1+\sigma^{2}(t)},$		(40)

which establishes an bijection between $({\bm{x}},\alpha)$ and $(\bar{{\bm{x}}},\sigma)$ . From Equation (4) we have (note that $\alpha(0)=1$ ): 重试错误原因

\displaystyle\frac{{\bm{x}}(t)}{\sqrt{\alpha(t)}}=\frac{{\bm{x}}(0)}{\sqrt{\alpha(0)}}+\sqrt{\frac{1-\alpha(t)}{\alpha(t)}}\epsilon,\quad\epsilon\sim{\mathcal{N}}(0,{\bm{I}})

(41)

which can be reparametrized into a form that is consistent with VE-SDE:
其可重新参数化为与 VE-SDE 一致的形式：

\displaystyle\bar{{\bm{x}}}(t)=\bar{{\bm{x}}}(0)+\sigma(t)\epsilon.

(42)

Now, we derive the ODE forms for both DDIM and VE-SDE and show that they are equivalent.
现在，我们推导出 DDIM 和 VE - SDE 的常微分方程形式，并表明它们是等价的。

ODE form for DDIM
用于 DDIM 的常微分方程形式

We repeat Equation (13) here:
我们在此重复方程（13）：

\displaystyle\frac{{\bm{x}}_{t-\Delta t}}{\sqrt{\alpha_{t-\Delta t}}}=\frac{{\bm{x}}_{t}}{\sqrt{\alpha_{t}}}+\left(\sqrt{\frac{1-\alpha_{t-\Delta t}}{\alpha_{t-\Delta t}}}-\sqrt{\frac{1-\alpha_{t}}{\alpha_{t}}}\right)\epsilon_{\theta}^{(t)}({\bm{x}}_{t}),

(43)

which is equivalent to:
这相当于：

\displaystyle\bar{{\bm{x}}}(t-\Delta t)=\bar{{\bm{x}}}(t)+(\sigma(t-\Delta t)-\sigma(t))\cdot\epsilon_{\theta}^{(t)}({\bm{x}}(t))

(44)

Divide both sides by $(-\Delta t)$ and as $\Delta t\to 0$ , we have:
将两边除以 $(-\Delta t)$ ，并且由于 $\Delta t\to 0$ ，我们有：

\displaystyle\frac{\mathrm{d}\bar{{\bm{x}}}(t)}{\mathrm{d}t}=\frac{\mathrm{d}\sigma(t)}{\mathrm{d}t}\epsilon_{\theta}^{(t)}\left(\frac{\bar{{\bm{x}}}(t)}{\sqrt{\sigma^{2}(t)+1}}\right),

(45)

which is exactly what we have in Equation (14).
这正是我们在方程（14）中所拥有的。

We note that for the optimal model, $\epsilon_{\theta}^{(t)}$ is a minimizer:
我们注意到，对于最优模型， $\epsilon_{\theta}^{(t)}$ 是一个极小值点：

\displaystyle\epsilon_{\theta}^{(t)}=\operatorname*{arg\,min}_{f_{t}}{\mathbb{E}}_{{\bm{x}}(0)\sim q({\bm{x}}),\epsilon\sim{\mathcal{N}}(0,{\bm{I}})}[{\lVert{f_{t}({\bm{x}}(t))-\epsilon}\rVert}_{2}^{2}]

(46)

where ${\bm{x}}(t)=\sqrt{\alpha(t)}{\bm{x}}(t)+\sqrt{1-\alpha(t)}\epsilon$ . where ${\bm{x}}(t)=\sqrt{\alpha(t)}{\bm{x}}(t)+\sqrt{1-\alpha(t)}\epsilon$ . （由于原文中的“ ${\bm{x}}(t)=\sqrt{\alpha(t)}{\bm{x}}(t)+\sqrt{1-\alpha(t)}\epsilon$ ”可能是特定的符号或代码，在没有更多上下文的情况下，难以进行准确的语义翻译，所以此处只能按照原文形式呈现。）

ODE form for VE-SDE
用于 VE-SDE 的常微分方程形式

Define $p_{t}(\bar{{\bm{x}}})$ as the data distribution perturbed with $\sigma^{2}(t)$ variance Gaussian noise. The probability flow for VE-SDE is defined as Song et al. (2020):
将 $p_{t}(\bar{{\bm{x}}})$ 定义为受到 $\sigma^{2}(t)$ 方差高斯噪声干扰的数据分布。VE-SDE 的概率流定义为 Song 等人（2020）：

\displaystyle\mathrm{d}\bar{{\bm{x}}}=-\frac{1}{2}g(t)^{2}\nabla_{\bar{{\bm{x}}}}\log p_{t}(\bar{{\bm{x}}})\mathrm{d}t

(47)

where $g(t)=\sqrt{\frac{\mathrm{d}\sigma^{2}(t)}{\mathrm{d}t}}$ is the diffusion coefficient, and $\nabla_{\bar{{\bm{x}}}}\log p_{t}(\bar{{\bm{x}}})$ is the score of $p_{t}$ .
其中 $g(t)=\sqrt{\frac{\mathrm{d}\sigma^{2}(t)}{\mathrm{d}t}}$ 是扩散系数， $\nabla_{\bar{{\bm{x}}}}\log p_{t}(\bar{{\bm{x}}})$ 是 $p_{t}$ 的得分。

The $\sigma(t)$ -perturbed score function $\nabla_{\bar{{\bm{x}}}}\log p_{t}(\bar{{\bm{x}}})$ is also a minimizer (from denoising score matching (Vincent, 2011)):
$\sigma(t)$ -扰动得分函数 $\nabla_{\bar{{\bm{x}}}}\log p_{t}(\bar{{\bm{x}}})$ 也是一个最小化器（来自去噪得分匹配（Vincent，2011））：

\displaystyle\nabla_{\bar{{\bm{x}}}}\log p_{t}=\operatorname*{arg\,min}_{g_{t}}{\mathbb{E}}_{{\bm{x}}(0)\sim q({\bm{x}}),\epsilon\sim{\mathcal{N}}(0,{\bm{I}})}[{\lVert{g_{t}(\bar{{\bm{x}}})+\epsilon/\sigma(t)}\rVert}_{2}^{2}]

(48)

where $\bar{{\bm{x}}}(t)=\bar{{\bm{x}}}(t)+\sigma(t)\epsilon$ . where $\bar{{\bm{x}}}(t)=\bar{{\bm{x}}}(t)+\sigma(t)\epsilon$ . （由于原文中的“ $\bar{{\bm{x}}}(t)=\bar{{\bm{x}}}(t)+\sigma(t)\epsilon$ ”可能是特定的符号或代码，在没有更多上下文的情况下，难以进行准确的语义翻译，所以此处只能按照原文形式呈现。）

Since there is an equivalence between ${\bm{x}}(t)$ and $\bar{{\bm{x}}}(t)$ , we have the following relationship:
由于 ${\bm{x}}(t)$ 和 $\bar{{\bm{x}}}(t)$ 之间存在等价关系，我们有以下关系：

\displaystyle\nabla_{\bar{{\bm{x}}}}\log p_{t}(\bar{{\bm{x}}})=-\frac{\epsilon_{\theta}^{(t)}\left(\frac{\bar{{\bm{x}}}(t)}{\sqrt{\sigma^{2}(t)+1}}\right)}{\sigma(t)}

(49)

from Equation (46) and Equation (48). Plug Equation (49) and definition of $g(t)$ in Equation (47), we have:
从方程（46）和方程（48）。将方程（49）和 $g(t)$ 的定义代入方程（47），我们有：

\displaystyle\mathrm{d}\bar{{\bm{x}}}(t)=\frac{1}{2}\frac{\mathrm{d}\sigma^{2}(t)}{\mathrm{d}t}\frac{\epsilon_{\theta}^{(t)}\left(\frac{\bar{{\bm{x}}}(t)}{\sqrt{\sigma^{2}(t)+1}}\right)}{\sigma(t)}\mathrm{d}t,

(50)

and we have the following by rearranging terms:
通过重新排列项，我们有以下内容：

\displaystyle\frac{\mathrm{d}\bar{{\bm{x}}}(t)}{\mathrm{d}t}=\frac{\mathrm{d}\sigma(t)}{\mathrm{d}t}\epsilon_{\theta}^{(t)}\left(\frac{\bar{{\bm{x}}}(t)}{\sqrt{\sigma^{2}(t)+1}}\right)

(51)

which is equivalent to Equation (45). In both cases the initial conditions are $\bar{{\bm{x}}}(T)\sim{\mathcal{N}}({\bm{0}},\sigma^{2}(T){\bm{I}})$ , so the resulting ODEs are identical. ∎
这等同于方程（45）。在这两种情况下，初始条件均为 $\bar{{\bm{x}}}(T)\sim{\mathcal{N}}({\bm{0}},\sigma^{2}(T){\bm{I}})$ ，因此所得的常微分方程是相同的。∎

Appendix C Additional Derivations
附录 C 额外的推导

C.1 Accelerated sampling processes
C.1 加速抽样过程

In the accelerated case, we can consider the inference process to be factored as:
在加速的情况下，我们可以认为推理过程可分解为：

\displaystyle q_{\sigma,\tau}({\bm{x}}_{1:T}|{\bm{x}}_{0})=q_{\sigma,\tau}({\bm{x}}_{\tau_{S}}|{\bm{x}}_{0})\prod_{i=1}^{S}q_{\sigma,\tau}({\bm{x}}_{\tau_{i-1}}|{\bm{x}}_{\tau_{i}},{\bm{x}}_{0})\prod_{t\in\bar{\tau}}q_{\sigma,\tau}({\bm{x}}_{t}|{\bm{x}}_{0})

(52)

where $\tau$ is a sub-sequence of $[1,\ldots,T]$ of length $S$ with $\tau_{S}=T$ , and let $\bar{\tau}:=\{1,\ldots,T\}\setminus\tau$ be its complement. Intuitively, the graphical model of $\{{\bm{x}}_{\tau_{i}}\}_{i=1}^{S}$ and ${\bm{x}}_{0}$ form a chain, whereas the graphical model of $\{{\bm{x}}_{t}\}_{t\in\bar{\tau}}$ and ${\bm{x}}_{0}$ forms a star graph. We define:
其中 $\tau$ 是 $[1,\ldots,T]$ 的长度为 $S$ 且具有 $\tau_{S}=T$ 的子序列，令 $\bar{\tau}:=\{1,\ldots,T\}\setminus\tau$ 为其补集。直观地说， $\{{\bm{x}}_{\tau_{i}}\}_{i=1}^{S}$ 和 ${\bm{x}}_{0}$ 的图形模型形成一条链，而 $\{{\bm{x}}_{t}\}_{t\in\bar{\tau}}$ 和 ${\bm{x}}_{0}$ 的图形模型形成一个星图。我们定义：

	$\displaystyle q_{\sigma,\tau}({\bm{x}}_{t}\|{\bm{x}}_{0})={\mathcal{N}}(\sqrt{\alpha_{t}}{\bm{x}}_{0},(1-\alpha_{t}){\bm{I}})\quad\forall t\in\bar{\tau}\cup\{T\}$		(53)
	$\displaystyle q_{\sigma,\tau}({\bm{x}}_{\tau_{i-1}}\|{\bm{x}}_{\tau_{i}},{\bm{x}}_{0})={\mathcal{N}}\left(\sqrt{\alpha_{\tau_{i-1}}}{\bm{x}}_{0}+\sqrt{1-\alpha_{\tau_{i-1}}-\sigma^{2}_{\tau_{i}}}\cdot{\frac{{\bm{x}}_{\tau_{i}}-\sqrt{\alpha_{\tau_{i}}}{\bm{x}}_{0}}{\sqrt{1-\alpha_{\tau_{i}}}}},\sigma_{{\tau_{i}}}^{2}{\bm{I}}\right)\ \forall i\in[S]$

where the coefficients are chosen such that:
其中系数的选取应满足：

\displaystyle q_{\sigma,\tau}({\bm{x}}_{\tau_{i}}|{\bm{x}}_{0})={\mathcal{N}}(\sqrt{\alpha_{\tau_{i}}}{\bm{x}}_{0},(1-\alpha_{\tau_{i}}){\bm{I}})\quad\forall i\in[S]

(54)

i.e., the ``marginals'' match.
即，“边际”匹配。

The corresponding ``generative process'' is defined as:
相应的“生成过程”定义为：

\displaystyle p_{\theta}({\bm{x}}_{0:T}):=\underbrace{p_{\theta}({\bm{x}}_{T})\prod_{i=1}^{S}p^{(\tau_{i})}_{\theta}({\bm{x}}_{\tau_{i-1}}|{\bm{x}}_{\tau_{i}})}_{\text{use to produce samples}}\times\underbrace{\prod_{t\in\bar{\tau}}p_{\theta}^{(t)}({\bm{x}}_{0}|{\bm{x}}_{t})}_{\text{in variational objective}}

(55)

where only part of the models are actually being used to produce samples. The conditionals are:
在其中只有部分模型实际上被用于生成样本。条件是：

	$\displaystyle p_{\theta}^{(\tau_{i})}({\bm{x}}_{\tau_{i-1}}\|{\bm{x}}_{\tau_{i}})=q_{\sigma,\tau}({\bm{x}}_{\tau_{i-1}}\|{\bm{x}}_{\tau_{i}},f_{\theta}^{(\tau_{i})}({\bm{x}}_{\tau_{i-1}}))\quad\text{if}\ i\in[S],i>1$		(56)
	$\displaystyle p_{\theta}^{(t)}({\bm{x}}_{0}\|{\bm{x}}_{t})={\mathcal{N}}(f_{\theta}^{(t)}({\bm{x}}_{t}),\sigma_{t}^{2}{\bm{I}})\quad\text{otherwise,}$		(57)

where we leverage $q_{\sigma,\tau}({\bm{x}}_{\tau_{i-1}}|{\bm{x}}_{\tau_{i}},{\bm{x}}_{0})$ as part of the inference process (similar to what we have done in Section 3). The resulting variational objective becomes (define ${\bm{x}}_{\tau_{L+1}}=\varnothing$ for conciseness):
在我们将 $q_{\sigma,\tau}({\bm{x}}_{\tau_{i-1}}|{\bm{x}}_{\tau_{i}},{\bm{x}}_{0})$ 用作推理过程的一部分的地方（类似于我们在第 3 节中所做的）。所得的变分目标变为（为简洁起见定义 ${\bm{x}}_{\tau_{L+1}}=\varnothing$ ）：

$\displaystyle J(\epsilon_{\theta})$	$\displaystyle={\mathbb{E}}_{{\bm{x}}_{0:T}\sim q_{\sigma,\tau}({\bm{x}}_{0:T})}[\log q_{\sigma,\tau}({\bm{x}}_{1:T}\|{\bm{x}}_{0})-\log p_{\theta}({\bm{x}}_{0:T})]$	(58)
	$\displaystyle={\mathbb{E}}_{{\bm{x}}_{0:T}\sim q_{\sigma,\tau}({\bm{x}}_{0:T})}\Bigg{[}\sum_{t\in\bar{\tau}}D_{\mathrm{KL}}(q_{\sigma,\tau}({\bm{x}}_{t}\|{\bm{x}}_{0})\\|p_{\theta}^{(t)}({\bm{x}}_{0}\|{\bm{x}}_{t})$	(59)
	$\displaystyle\qquad\qquad\qquad\qquad+\sum_{i=1}^{L}D_{\mathrm{KL}}(q_{\sigma,\tau}({\bm{x}}_{\tau_{i-1}}\|{\bm{x}}_{\tau_{i}},{\bm{x}}_{0})\\|p_{\theta}^{(\tau_{i})}({\bm{x}}_{\tau_{i-1}}\|{\bm{x}}_{\tau_{i}})))\Bigg{]}$

where each KL divergence is between two Gaussians with variance independent of $\theta$ . A similar argument to the proof used in Theorem 1 can show that the variational objective $J$ can also be converted to an objective of the form $L_{\gamma}$ .
其中每个 KL 散度是在两个方差与 $\theta$ 无关的高斯分布之间。与定理 1 中使用的证明类似的论证可以表明，变分目标 $J$ 也可以转换为形式为 $L_{\gamma}$ 的目标。

C.2 Derivation of denoising objectives for DDPMs
C.2 针对去噪扩散概率模型的去噪目标的推导

We note that in Ho et al. (2020), a diffusion hyperparameter $\beta_{t}$ ⁹⁹9In this section we use teal to color notations used in Ho et al. (2020)
在本节中，我们使用青绿色为 Ho 等人（2020 年）中使用的符号着色。.
我们注意到在 Ho 等人（2020 年）中，一个扩散超参数 $\beta_{t}$ ⁹ is first introduced, and then relevant variables $\alpha_{t}:=1-\beta_{t}$ and $\bar{\alpha}_{t}=\prod_{t=1}^{T}\alpha_{t}$ are defined. In this paper, we have used the notation $\alpha_{t}$ to represent the variable $\bar{\alpha}_{t}$ in Ho et al. (2020) for three reasons. First, it makes it more clear that we only need to choose one set of hyperparameters, reducing possible cross-references of the derived variables. Second, it allows us to introduce the generalization as well as the acceleration case easier, because the inference process is no longer motivated by a diffusion. Third, there exists an isomorphism between $\alpha_{1:T}$ and $1,\ldots,T$ , which is not the case for $\beta_{t}$ .
首先进行介绍，然后定义相关变量 $\alpha_{t}:=1-\beta_{t}$ 和 $\bar{\alpha}_{t}=\prod_{t=1}^{T}\alpha_{t}$ 。在本文中，我们使用符号 $\alpha_{t}$ 来表示 Ho 等人（2020）中的变量 $\bar{\alpha}_{t}$ ，原因有三。首先，它更清楚地表明我们只需要选择一组超参数，减少了派生变量可能的交叉引用。其次，它使我们能够更轻松地引入推广以及加速情况，因为推理过程不再受扩散的驱动。第三， $\alpha_{1:T}$ 和 $1,\ldots,T$ 之间存在同构，而 $\beta_{t}$ 并非如此。

In this section, we use $\beta_{t}$ and $\alpha_{t}$ to be more consistent with the derivation in Ho et al. (2020), where
在本节中，我们使用 $\beta_{t}$ 和 $\alpha_{t}$ 以与 Ho 等人（2020 年）的推导更加一致，其中

	$\displaystyle{\color[rgb]{0,.5,.5}\alpha_{t}}=\frac{\alpha_{t}}{\alpha_{t-1}}$		(60)
	$\displaystyle{\color[rgb]{0,.5,.5}\beta_{t}}=1-\frac{\alpha_{t}}{\alpha_{t-1}}$		(61)

can be uniquely determined from $\alpha_{t}$ (i.e. $\bar{\alpha}_{t}$ ).
可从 $\alpha_{t}$ （即 $\bar{\alpha}_{t}$ ）唯一确定。

First, from the diffusion forward process:
首先，从扩散正向过程：

\displaystyle q({\bm{x}}_{t-1}|{\bm{x}}_{t},{\bm{x}}_{0})={\mathcal{N}}\Bigg{(}\underbrace{\frac{\sqrt{\alpha_{t-1}}{\color[rgb]{0,.5,.5}\beta_{t}}}{1-\alpha_{t}}{\bm{x}}_{0}+\frac{\sqrt{{\color[rgb]{0,.5,.5}\alpha_{t}}}(1-\alpha_{t-1})}{1-\alpha_{t}}{\bm{x}}_{t}}_{\color[rgb]{0,.5,.5}\tilde{\mu}({\bm{x}}_{t},{\bm{x}}_{0})},\frac{1-\alpha_{t-1}}{1-\alpha_{t}}{\color[rgb]{0,.5,.5}\beta_{t}}{\bm{I}}\Bigg{)}

Ho et al. (2020) considered a specific type of $p_{\theta}^{(t)}({\bm{x}}_{t-1}|{\bm{x}}_{t})$ :
Ho 等人（2020 年）考虑了一种特定类型的 $p_{\theta}^{(t)}({\bm{x}}_{t-1}|{\bm{x}}_{t})$ ：（注：原文中的“ $p_{\theta}^{(t)}({\bm{x}}_{t-1}|{\bm{x}}_{t})$ ”可能是特定领域的术语或存在错误，在没有更多上下文的情况下，只能按原文进行翻译。）

\displaystyle p_{\theta}^{(t)}({\bm{x}}_{t-1}|{\bm{x}}_{t})={\mathcal{N}}\left({\color[rgb]{0,.5,.5}\mu_{\theta}({\bm{x}}_{t},t)},\sigma_{t}{\bm{I}}\right)

(62)

which leads to the following variational objective:
这导致了以下变分目标：

	$\displaystyle{\color[rgb]{0,.5,.5}L}$	$\displaystyle:={\mathbb{E}}_{{\bm{x}}_{0:T}\sim q({\bm{x}}_{0:T})}\left[q({\bm{x}}_{T}\|{\bm{x}}_{0})+\sum_{t=2}^{T}\log q({\bm{x}}_{t-1}\|{\bm{x}}_{t},{\bm{x}}_{0})-\sum_{t=1}^{T}\log p_{\theta}^{(t)}({\bm{x}}_{t-1}\|{\bm{x}}_{t})\right]$		(63)
		$\displaystyle\equiv{\mathbb{E}}_{{\bm{x}}_{0:T}\sim q({\bm{x}}_{0:T})}\left[\sum_{t=2}^{T}\underbrace{D_{\mathrm{KL}}(q({\bm{x}}_{t-1}\|{\bm{x}}_{t},{\bm{x}}_{0}))\\|p_{\theta}^{(t)}({\bm{x}}_{t-1}\|{\bm{x}}_{t}))}_{\color[rgb]{0,.5,.5}L_{t-1}}-\log p_{\theta}^{(1)}({\bm{x}}_{0}\|{\bm{x}}_{1})\right]$

One can write: 一个人可以写道：

\displaystyle{\color[rgb]{0,.5,.5}L_{t-1}}={\mathbb{E}}_{q}\left[\frac{1}{2\sigma_{t}^{2}}{\lVert{{\color[rgb]{0,.5,.5}\mu_{\theta}({\bm{x}}_{t},t)}-{\color[rgb]{0,.5,.5}\tilde{\mu}({\bm{x}}_{t},{\bm{x}}_{0})}}\rVert}_{2}^{2}\right]

(64)

Ho et al. (2020) chose the parametrization
何等人（2020 年）选择了参数化

\displaystyle{\color[rgb]{0,.5,.5}\mu_{\theta}({\bm{x}}_{t},t)}=\frac{1}{\sqrt{{\color[rgb]{0,.5,.5}\alpha_{t}}}}\left({\bm{x}}_{t}-\frac{{\color[rgb]{0,.5,.5}\beta_{t}}}{\sqrt{1-\alpha_{t}}}{\color[rgb]{0,.5,.5}\epsilon_{\theta}({\bm{x}}_{t},t)}\right)

(65)

which can be simplified to:
其可简化为：

\displaystyle{\color[rgb]{0,.5,.5}L_{t-1}}={\mathbb{E}}_{{\bm{x}}_{0},\epsilon}\left[\frac{{\color[rgb]{0,.5,.5}\beta_{t}}^{2}}{2\sigma_{t}^{2}(1-\alpha_{t}){\color[rgb]{0,.5,.5}\alpha_{t}}}{\lVert{\epsilon-{\color[rgb]{0,.5,.5}\epsilon_{\theta}(\sqrt{\alpha_{t}}{\bm{x}}_{0}+\sqrt{1-\alpha_{t}}\epsilon,t)}}\rVert}_{2}^{2}\right]

(66)

Appendix D Experimental Details
附录 D 实验细节

D.1 Datasets and architectures
D.1 数据集与架构

We consider 4 image datasets with various resolutions: CIFAR10 ( $32\times 32$ , unconditional), CelebA ( $64\times 64$ ), LSUN Bedroom ( $256\times 256$ ) and LSUN Church ( $256\times 256$ ). For all datasets, we set the hyperparameters $\alpha$ according to the heuristic in (Ho et al., 2020) to make the results directly comparable. We use the same model for each dataset, and only compare the performance of different generative processes. For CIFAR10, Bedroom and Church, we obtain the pretrained checkpoints from the original DDPM implementation; for CelebA, we trained our own model using the denoising objective $L_{\bm{1}}$ .
我们考虑了具有各种分辨率的 4 个图像数据集：CIFAR10（ $32\times 32$ ，无条件）、CelebA（ $64\times 64$ ）、LSUN 卧室（ $256\times 256$ ）和 LSUN 教堂（ $256\times 256$ ）。对于所有数据集，我们根据（Ho 等人，2020）中的启发式方法设置超参数 $\alpha$ ，以使结果直接具有可比性。我们对每个数据集使用相同的模型，仅比较不同生成过程的性能。对于 CIFAR10、卧室和教堂，我们从原始的 DDPM 实现中获得预训练的检查点；对于 CelebA，我们使用去噪目标 $L_{\bm{1}}$ 训练了自己的模型。

Our architecture for $\epsilon_{\theta}^{(t)}({\bm{x}}_{t})$ follows that in Ho et al. (2020), which is a U-Net (Ronneberger et al., 2015) based on a Wide ResNet (Zagoruyko & Komodakis, 2016). We use the pretrained models from Ho et al. (2020) for CIFAR10, Bedroom and Church, and train our own model for the CelebA $64\times 64$ model (since a pretrained model is not provided). Our CelebA model has five feature map resolutions from $64\times 64$ to $4\times 4$ , and we use the original CelebA dataset (not CelebA-HQ) using the pre-processing technique from the StyleGAN (Karras et al., 2018) repository.
我们的 $\epsilon_{\theta}^{(t)}({\bm{x}}_{t})$ 架构遵循 Ho 等人（2020）的架构，这是一个基于宽残差网络（Zagoruyko 和 Komodakis，2016）的 U-Net（Ronneberger 等人，2015）。我们使用 Ho 等人（2020）针对 CIFAR10、Bedroom 和 Church 的预训练模型，并为 CelebA $64\times 64$ 模型训练我们自己的模型（因为未提供预训练模型）。我们的 CelebA 模型具有从 $64\times 64$ 到 $4\times 4$ 的五个特征图分辨率，并且我们使用原始的 CelebA 数据集（不是 CelebA-HQ），采用来自 StyleGAN（Karras 等人，2018）存储库的预处理技术。

Table 3: LSUN Bedroom and Church image generation results, measured in FID. For 1000 steps DDPM, the FIDs are 6.36 for Bedroom and 7.89 for Church.
表 3：以 FID 衡量的 LSUN 卧室和教堂图像生成结果。对于 1000 步的 DDPM，卧室的 FID 为 6.36，教堂的 FID 为 7.89。

	Bedroom ( $256\times 256$ ) 卧室（ $256\times 256$ ） (It's not clear what " $256\times 256$ " means in this context. Without further context or clarification, it's difficult to provide a more accurate translation for this part.)				Church ( $256\times 256$ ) 该英文文本“Church ( $256\times 256$ )”似乎不太符合常规的语义表达，可能存在错误或特殊的上下文信息缺失。按照字面直接翻译为：“教堂（ $256\times 256$ ）”，但这个翻译可能不太符合实际的学术文本翻译要求，因为这个源文本的内容不太明确。如果您有更多的上下文信息或想要修正这个源文本，欢迎提供，以便我能给出更准确和有用的翻译
$\dim(\tau)$	10	20	50	100	10	20	50	100
DDIM ( $\eta=0.0$ ) 抱歉，“DDIM ( $\eta=0.0$ )”这个内容似乎不太明确或缺乏上下文，难以进行准确的学术性翻译。如果您能提供更多的信息或上下文，我将尽力为您提供更准确的翻译	16.95	8.89	6.75	6.62	19.45	12.47	10.84	10.58
DDPM ( $\eta=1.0$ ) 抱歉，“DDPM ( $\eta=1.0$ )”这个内容似乎不太明确或可能存在错误，难以进行准确的学术性翻译。如果您有更多的上下文或相关信息，我将尽力提供更准确的翻译	42.78	22.77	10.81	6.81	51.56	23.37	11.16	8.27

D.2 Reverse process sub-sequence selection
D.2 逆过程子序列选择

We consider two types of selection procedure for $\tau$ given the desired $\dim(\tau)<T$ :
对于给定的期望 $\dim(\tau)<T$ ，我们考虑针对 $\tau$ 的两种选择程序：

•

Linear: we select the timesteps such that $\tau_{i}=\lfloor ci\rfloor$ for some $c$ ;
线性：我们选择时间步长，使得对于某些 $c$ ， $\tau_{i}=\lfloor ci\rfloor$ ；（注：原文中的 $\tau_{i}=\lfloor ci\rfloor$ 和 $c$ 可能是特定的符号或变量，在没有更多上下文的情况下，只能按原文形式保留）
•

Quadratic: we select the timesteps such that $\tau_{i}=\lfloor ci^{2}\rfloor$ for some $c$ .
二次型：我们选择时间步长，使得对于某些 $c$ ， $\tau_{i}=\lfloor ci^{2}\rfloor$ 。（注：原文中的 $\tau_{i}=\lfloor ci^{2}\rfloor$ 和 $c$ 可能是特定的符号或变量，在没有更多上下文信息的情况下，只能按照原文形式进行翻译。）

The constant value $c$ is selected such that $\tau_{-1}$ is close to $T$ . We used quadratic for CIFAR10 and linear for the remaining datasets. These choices achieve slightly better FID than their alternatives in the respective datasets.
常数值 $c$ 的选择使得 $\tau_{-1}$ 接近 $T$ 。对于 CIFAR10 我们使用二次函数，对于其余数据集我们使用线性函数。在各自的数据集中，这些选择实现的 FID 比其替代方案略好。

D.3 Closed form equations for each sampling step
D.3 每个采样步骤的闭式方程

From the general sampling equation in Eq. (12), we have the following update equation:
从式（12）中的一般抽样方程，我们得到以下更新方程：

\displaystyle{\bm{x}}_{\tau_{i-1}}(\eta)=\sqrt{\alpha_{\tau_{i-1}}}\left(\frac{{\bm{x}}_{\tau_{i}}-\sqrt{1-\alpha_{\tau_{i}}}\epsilon_{\theta}^{(\tau_{i})}({\bm{x}}_{\tau_{i}})}{\sqrt{\alpha_{\tau_{i}}}}\right)+\sqrt{1-\alpha_{\tau_{i-1}}-\sigma_{\tau_{i}}(\eta)^{2}}\cdot\epsilon_{\theta}^{(\tau_{i})}({\bm{x}}_{\tau_{i}})+\sigma_{\tau_{i}}(\eta)\epsilon

where 哪里；在……的地方（注：“where”根据不同语境有不同的含义，这里给出了两种常见的翻译）

\sigma_{\tau_{i}}(\eta)=\eta\sqrt{\frac{1-\alpha_{\tau_{i-1}}}{1-\alpha_{\tau_{i}}}}\sqrt{1-\frac{\alpha_{\tau_{i}}}{\alpha_{\tau_{i-1}}}}

For the case of $\hat{\sigma}$ (DDPM with a larger variance), the update equation becomes:
对于 $\hat{\sigma}$ （具有较大方差的 DDPM）的情况，更新方程变为：

\displaystyle{\bm{x}}_{\tau_{i-1}}=\sqrt{\alpha_{\tau_{i-1}}}\left(\frac{{\bm{x}}_{\tau_{i}}-\sqrt{1-\alpha_{\tau_{i}}}\epsilon_{\theta}^{(\tau_{i})}({\bm{x}}_{\tau_{i}})}{\sqrt{\alpha_{\tau_{i}}}}\right)+\sqrt{1-\alpha_{\tau_{i-1}}-\sigma_{\tau_{i}}(1)^{2}}\cdot\epsilon_{\theta}^{(\tau_{i})}({\bm{x}}_{\tau_{i}})+\hat{\sigma}_{\tau_{i}}\epsilon

which uses a different coefficient for $\epsilon$ compared with the update for $\eta=1$ , but uses the same coefficient for the non-stochastic parts. This update is more stochastic than the update for $\eta=1$ , which explains why it achieves worse performance when $\dim(\tau)$ is small.
其对 $\epsilon$ 使用的系数与对 $\eta=1$ 的更新所用系数不同，但对非随机部分使用相同的系数。此更新比针对 $\eta=1$ 的更新更具随机性，这解释了为何当 $\dim(\tau)$ 较小时，它的性能更差。

D.4 Samples and Consistency
D.4 样本与一致性

We show more samples in Figure 7 (CIFAR10), Figure 8 (CelebA), Figure 10 (Church) and consistency results of DDIM in Figure 9 (CelebA).
我们在图 7（CIFAR10）、图 8（CelebA）、图 10（Church）中展示了更多样本，并在图 9（CelebA）中展示了 DDIM 的一致性结果。

D.5 Interpolation D.5 插值法

To generate interpolations on a line, we randomly sample two initial ${\bm{x}}_{T}$ values from the standard Gaussian, interpolate them with spherical linear interpolation (Shoemake, 1985), and then use the DDIM to obtain ${\bm{x}}_{0}$ samples.
为在一条线上生成插值，我们从标准高斯分布中随机采样两个初始 ${\bm{x}}_{T}$ 值，使用球面线性插值（Shoemake，1985）对它们进行插值，然后使用 DDIM 获得 ${\bm{x}}_{0}$ 样本。

\displaystyle{\bm{x}}_{T}^{(\alpha)}=\frac{\sin((1-\alpha)\theta)}{\sin(\theta)}{\bm{x}}_{T}^{(0)}+\frac{\sin(\alpha\theta)}{\sin(\theta)}{\bm{x}}_{T}^{(1)}

(67)

where $\theta=\arccos\left(\frac{({\bm{x}}_{T}^{(0)})^{\top}{\bm{x}}_{T}^{(1)}}{{\lVert{{\bm{x}}_{T}^{(0)}}\rVert}{\lVert{{\bm{x}}_{T}^{(1)}}\rVert}}\right)$ . These values are used to produce DDIM samples.
其中 $\theta=\arccos\left(\frac{({\bm{x}}_{T}^{(0)})^{\top}{\bm{x}}_{T}^{(1)}}{{\lVert{{\bm{x}}_{T}^{(0)}}\rVert}{\lVert{{\bm{x}}_{T}^{(1)}}\rVert}}\right)$ 。这些值用于生成 DDIM 样本。

To generate interpolations on a grid, we sample four latent variables and separate them in to two pairs; then we use slerp with the pairs under the same $\alpha$ , and use slerp over the interpolated samples across the pairs (under an independently chosen interpolation coefficient). We show more grid interpolation results in Figure 11 (CelebA), Figure 12 (Bedroom), and Figure 13 (Church).
为在网格上生成插值，我们对四个潜在变量进行采样，并将它们分为两对；然后，我们在相同的 $\alpha$ 下对这两对使用球面线性插值（slerp），并在跨对的插值样本上使用球面线性插值（使用独立选择的插值系数）。我们在图 11（CelebA）、图 12（卧室）和图 13（教堂）中展示了更多的网格插值结果。

Denoising Diffusion Implicit Models去噪扩散隐式模型

Abstract 摘要

1 Introduction 1 引言

2 Background 2 背景

3 Variational Inference for non-Markovian Forward Processes非马尔可夫正向过程的变分推断

3.1 Non-Markovian forward processes3.1 非马尔可夫正向过程

3.2 Generative process and unified variational inference objective3.2 生成过程与统一变分推断目标

Theorem 1.

4 Sampling from Generalized Generative Processes4 从广义生成过程中抽样

4.1 Denoising Diffusion Implicit Models4.1 去噪扩散隐式模型

4.2 Accelerated generation processes4.2 加速生成过程

4.3 Relevance to Neural ODEs4.3 与神经常微分方程的相关性

Proposition 1.

5 Experiments 5 实验

5.1 Sample quality and efficiency5.1 样本质量与效率

5.2 Sample consistency in DDIMs5.2 DDIMs 中的样本一致性

5.3 Interpolation in deterministic generative processes5.3 确定性生成过程中的插值

5.4 Reconstruction from Latent Space5.4 从潜在空间进行重建

6 Related Work 6 相关工作

7 Discussion 7 讨论

Acknowledgements 致谢

References

Appendix A Non-Markovian Forward Processes for a Discrete Case附录 A 离散情形下的非马尔可夫正向过程

Appendix B Proofs 附录 B 证明

Lemma 1.

Proof.

Proof.

Proof.

ODE form for DDIM用于 DDIM 的常微分方程形式

ODE form for VE-SDE用于 VE-SDE 的常微分方程形式

Appendix C Additional Derivations附录 C 额外的推导

C.1 Accelerated sampling processesC.1 加速抽样过程

C.2 Derivation of denoising objectives for DDPMsC.2 针对去噪扩散概率模型的去噪目标的推导

Appendix D Experimental Details附录 D 实验细节

D.1 Datasets and architecturesD.1 数据集与架构

D.2 Reverse process sub-sequence selectionD.2 逆过程子序列选择

D.3 Closed form equations for each sampling stepD.3 每个采样步骤的闭式方程

D.4 Samples and ConsistencyD.4 样本与一致性

D.5 Interpolation D.5 插值法

Denoising Diffusion Implicit Models
去噪扩散隐式模型

3 Variational Inference for non-Markovian Forward Processes
非马尔可夫正向过程的变分推断

3.1 Non-Markovian forward processes
3.1 非马尔可夫正向过程

3.2 Generative process and unified variational inference objective
3.2 生成过程与统一变分推断目标

4 Sampling from Generalized Generative Processes
4 从广义生成过程中抽样

4.1 Denoising Diffusion Implicit Models
4.1 去噪扩散隐式模型

4.2 Accelerated generation processes
4.2 加速生成过程

4.3 Relevance to Neural ODEs
4.3 与神经常微分方程的相关性

5.1 Sample quality and efficiency
5.1 样本质量与效率

5.2 Sample consistency in DDIMs
5.2 DDIMs 中的样本一致性

5.3 Interpolation in deterministic generative processes
5.3 确定性生成过程中的插值

5.4 Reconstruction from Latent Space
5.4 从潜在空间进行重建

Appendix A Non-Markovian Forward Processes for a Discrete Case
附录 A 离散情形下的非马尔可夫正向过程

ODE form for DDIM
用于 DDIM 的常微分方程形式

ODE form for VE-SDE
用于 VE-SDE 的常微分方程形式

Appendix C Additional Derivations
附录 C 额外的推导

C.1 Accelerated sampling processes
C.1 加速抽样过程

C.2 Derivation of denoising objectives for DDPMs
C.2 针对去噪扩散概率模型的去噪目标的推导

Appendix D Experimental Details
附录 D 实验细节

D.1 Datasets and architectures
D.1 数据集与架构

D.2 Reverse process sub-sequence selection
D.2 逆过程子序列选择

D.3 Closed form equations for each sampling step
D.3 每个采样步骤的闭式方程

D.4 Samples and Consistency
D.4 样本与一致性