Perspectives on diffusion

Reading time ~35 minutes

Diffusion models appear to come in many shapes and forms. If you pick two random research papers about diffusion and look at how they describe the model class in their respective introductions, chances are they will go about it in very different ways. This can be both frustrating and enlightening: frustrating, because it makes it harder to spot relationships and equivalences across papers and implementations – but also enlightening, because these various perspectives each reveal new connections and are a breeding ground for new ideas. This blog post is an overview of the perspectives on diffusion I’ve found useful.

Last year, I wrote a blog post titled “diffusion models are autoencoders”. The title was tongue-in-cheek, but it highlighted a close connection between diffusion models and autoencoders, which I felt had been underappreciated up until then. Since so many more ML practitioners were familiar with autoencoders than with diffusion models, at the time, it seemed like a good idea to try and change that.

Since then, I’ve realised that I could probably write a whole series of blog posts, each highlighting a different perspective or equivalence. Unfortunately I only seem to be able to produce one or two blog posts a year, despite efforts to increase the frequency. So instead, this post will cover all of them at once in considerably less detail – but hopefully enough to pique your curiosity, or to make you see diffusion models in a new light.

This post will probably be most useful to those who already have at least a basic understanding of diffusion models. If you don’t count yourself among this group, or you’d like a refresher, check out my earlier blog posts on the topic:

Before we start, a disclaimer: some of these connections are deliberately quite handwavy. They are intended to build intuition and understanding, and are not supposed to be taken literally, for the most part – this is a blog post, not a peer-reviewed research paper.

That said, I welcome any corrections and thoughts about the ways in which these equivalences don’t quite hold, or could even be misleading. Feel free to leave a comment, or reach out to me on Twitter (@sedielem) or Threads (@sanderdieleman). If you have a different perspective that I haven’t covered here, please share it as well.
话虽如此,欢迎任何关于这些等价性不够准确或甚至可能具有误导性的更正和想法。请随时留下评论,或通过 Twitter(@sedielem)或 Threads(@sanderdieleman)与我联系。如果你有我在此未涵盖的不同观点,请也分享出来。

Alright, here goes (click to scroll to each section):

  1. Diffusion models are autoencoders
  2. Diffusion models are deep latent variable models
  3. Diffusion models predict the score function
  4. Diffusion models solve reverse SDEs
  5. Diffusion models are flow-based models
  6. Diffusion models are recurrent neural networks
  7. Diffusion models are autoregressive models
  8. Diffusion models estimate expectations
  9. Discrete and continuous diffusion models
  10. Alternative formulations
  11. Consistency 一致性
  12. Defying conventions 打破传统
  13. Closing thoughts 总结思考
  14. Acknowledgements 致谢
  15. References 参考文献

Diffusion models are autoencoders

Denoising autoencoders are neural networks whose input is corrupted by noise, and they are tasked to predict the clean input, i.e. to remove the corruption. Doing well at this task requires learning about the distribution of the clean data. They have been very popular for representation learning, and in the early days of deep learning, they were also used for layer-wise pre-training of deep neural networks1.
去噪自编码器是一种神经网络,其输入受到噪声干扰,它们的任务是预测干净的输入,即去除干扰。在这项任务上表现良好需要学习干净数据的分布。它们在表示学习方面非常受欢迎,在深度学习的早期阶段,它们还被用于深度神经网络的逐层预训练 1

It turns out that the neural network used in a diffusion model usually solves a very similar problem: given an input example corrupted by noise, it predicts some quantity associated with the data distribution. This can be the corresponding clean input (as in denoising autoencoders), the noise that was added, or something in between (more on that later). All of these are equivalent in some sense when the corruption process is linear, i.e., the noise is additive: we can turn a model that predicts the noise into a model that predicts the clean input, simply by subtracting its prediction from the noisy input. In neural network parlance, we would be adding a residual connection from the input to the output.

Schematic diagram of a denoising autoencoder (left) and a diffusion model (right).
Schematic diagram of a denoising autoencoder (left) and a diffusion model (right).

There are a few key differences:

  • Denoising autoencoders often have some sort of information bottleneck somewhere in the middle, to learn a useful representation of the input whose capacity is constrained in some way. The denoising task itself is merely a means to an end, and not what we actually want to use the models for once we’ve trained them. The neural networks used for diffusion models don’t typically have such a bottleneck, as we are more interested in their predictions, rather than the internal representations they construct along the way to be able to make those predictions.

  • Denoising autoencoders can be trained with a variety of types of noise. For example, parts of the input could be masked out (masking noise), or we could add noise drawn from some arbitrary distribution (often Gaussian). For diffusion models, we usually stick with additive Gaussian noise because of its helpful mathematical properties, which simplify a lot of operations.

  • Another important difference is that denoising autoencoders are usually trained to deal only with noise of a particular strength. In a diffusion model, we have to be able to make predictions for inputs with a lot of noise, or with very little noise. The noise level is provided to the neural network as an extra input.

As mentioned, I’ve already discussed this relationship in detail in a previous blog post, so check that out if you are keen to explore this connection more thoroughly.

Diffusion models are deep latent variable models

Sohl-Dickstein et al. first suggested using a diffusion process to gradually destroy structure in data, and then constructing a generative model by learning to reverse this process in a 2015 ICML paper2. Five years later, Ho et al. built on this to develop Denoising Diffusion Probabilistic Models or DDPMs3, which formed the blueprint of modern diffusion models along with score-based models (see below).
Sohl-Dickstein 等人在 2015 年 ICML 论文中首次建议使用扩散过程逐渐破坏数据结构,然后通过学习逆转该过程来构建生成模型 2 。五年后,Ho 等人在此基础上发展出去噪扩散概率模型(Denoising Diffusion Probabilistic Models,DDPMs) 3 ,这与基于得分的模型一同构成了现代扩散模型的蓝图(见下文)。

DDPM graphical model.
DDPM graphical model. DDPM 图形模型。

In this formulation, represented by the graphical model above, xT (latent) represents Gaussian noise and x0 (observed) represents the data distribution. These random variables are bridged by a finite number of intermediate latent variables xt (typically T=1000), which form a Markov chain, i.e. xt1 only depends on xt, and not directly on any preceding random variables in the chain.
在上面的图形模型所代表的构造中, xT (潜在)代表高斯噪声, x0 (观测)代表数据分布。这些随机变量由有限数量的中间潜在变量 xt (通常为 T=1000 )连接,形成马尔科夫链,即 xt1 仅依赖于 xt ,而不直接依赖链中之前的任何随机变量。

The parameters of the Markov chain are fit using variational inference to reverse a diffusion process, which is itself a Markov chain (in the other direction, represented by q(xtxt1) in the diagram) that gradually adds Gaussian noise to the data. Concretely, as in Variational Autoencoders (VAEs)45, we can write down an Evidence Lower Bound (ELBO), a bound on the log likelihood, which we can maximise tractably. In fact, this section could just as well have been titled “diffusion models are deep VAEs”, but I’ve already used “diffusion models are autoencoders” for a different perspective, so I figured this might have been a bit confusing.
马尔可夫链的参数是使用变分推断来拟合的,以逆向扩散过程,这本身就是一个马尔可夫链(在图中用 q(xtxt1) 表示的另一个方向),逐渐向数据中添加高斯噪声。具体地,就像变分自动编码器(VAEs)中一样,我们可以列出证据下界(ELBO),这是对对数似然的一个界限,我们可以通过最大化来实现。事实上,这一节也可以被命名为“扩散模型是深度 VAEs”,但我已经用了“扩散模型是自动编码器”来描述不同的视角,所以我觉得这可能有点混淆。

We know q(xtxt1) is Gaussian by construction, but p(xt1xt), which we are trying to fit with our model, need not be! However, as long as each individual step is small enough (i.e. T is large enough), it turns out that we can parameterise p(xt1xt) as if it were Gaussian, and the approximation error will be small enough for this model to still produce good samples. This is kind of surprising when you think about it, as during sampling, any errors may accumulate over T steps.
我们知道 q(xtxt1) 是高斯的,但 p(xt1xt) ,也就是我们试图用我们的模型拟合的部分,未必是!然而,只要每个单独的步骤足够小(即 T 足够大),我们可以将 p(xt1xt) 参数化为高斯的,而且这种逼近误差足够小,使得这个模型仍然可以产生良好的样本。当您考虑这一点时,这有点令人惊讶,因为在采样过程中,任何错误可能在 T 步中累积。

Full disclosure: out of all the different perspectives on diffusion in this blog post, this is probably the one I understand least well. Sort of ironic, given how popular it is, but variational inference has always been a little bit mysterious to me. I will stop here, and mostly defer to a few others who have described this perspective in detail (apart from the original DDPM paper, of course):
全面披露:在本博文中关于扩散的各种不同观点中,这可能是我理解最不透彻的一个。有点讽刺的是,鉴于其受欢迎程度,但变分推断对我来说一直有些神秘。我会在这里停下,主要倾向于向一些其他人转述了这个观点的详细信息(当然除了原始的 DDPM 论文之外)。

Diffusion models predict the score function

Most likelihood-based generative models parameterise the log-likelihood of an input x, logp(xθ), and then fit the model parameters θ to maximise it, either approximately (as in VAEs) or exactly (as in flow-based models or autoregressive models). Because log-likelihoods represent probability distributions, and probability distributions have to be normalised, this usually requires some constraints to ensure all possible values for the parameters θ yield valid distributions. For example, autoregressive models have causal masking to ensure this, and most flow-based models require invertible neural network architectures.
大多数基于似然的生成模型会对输入的对数似然进行参数化 x , logp(xθ) ,然后拟合模型参数 θ 以最大化它,或者是近似地(如在 VAEs 中),或者是精确地(如在基于流或自回归模型中)。因为对数似然代表概率分布,而概率分布必须被规范化,这通常需要一些约束来确保所有可能的参数值 θ 产生有效的分布。例如,自回归模型具有因果屏蔽以确保这一点,而大多数基于流的模型需要可逆的神经网络架构。

It turns out there is another way to fit distributions that neatly sidesteps this normalisation requirement, called score matching6. It’s based on the observation that the so-called score function, sθ(x):=xlogp(xθ), is invariant to the scaling of p(xθ). This is easy to see:
发现有另一种适应分布的方法,可以巧妙地规避这种归一化要求,称为得分匹配 6 。 它基于这样的观察:所谓的得分函数 sθ(x):=xlogp(xθ) p(xθ) 的缩放是不变的。这很容易理解:

xlog(αp(xθ))=x(logα+logp(xθ)) =xlogα+xlogp(xθ)=0+xlogp(xθ).

Any arbitrary scale factor applied to the probability density simply disappears. Therefore, if we have a model that parameterises a score estimate s^θ(x) directly, we can fit the distribution by minimising the score matching loss (instead of maximising the likelihood directly):
应用到概率密度的任意任意比例因子会简单地消失。因此,如果我们有一个模型,它直接参数化得分估计 s^θ(x) ,我们可以通过最小化得分匹配损失(而不是直接最大化似然)来适应分布:


In this form however, this loss function is not practical, because we do not have a good way to compute ground truth scores xlogp(x) for any data point x. There are a few tricks that can be applied to sidestep this requirement, and transform this into a loss function that’s easy to compute, including implicit score matching (ISM)6, sliced score matching (SSM)7 and denoising score matching (DSM)8. We’ll take a closer look at this last one:
在这种形式下,这个损失函数并不实用,因为我们无法有效地计算任何数据点的地面真实分数 xlogp(x) x 。有一些技巧可以应用以避开这一要求,并将其转换为易于计算的损失函数,包括隐式分数匹配(ISM) 6 、分片分数匹配(SSM) 7 和去噪分数匹配(DSM) 8 。我们将更仔细地研究这最后一种情况:


Here, x~ is obtained by adding Gaussian noise to x. This means p(x~x) is distributed according to a Gaussian distribution N(x,σ2) and the ground truth conditional score function can be calculated in closed form:
这里,通过向 x 添加高斯噪声获得 x~ 。这意味着 p(x~x) 分布符合高斯分布 N(x,σ2) ,并且地面实况条件得分函数可以通过封闭形式计算:

x~logp(x~x)=x~log(1σ2πe12(x~xσ)2) =x~log1σ2πx~(12(x~xσ)2)=0122(x~xσ)1σ=xx~σ2.

This form has a very intuitive interpretation: it is a scaled version of the Gaussian noise added to x to obtain x~. Therefore, making x~ more likely by following the score (= gradient ascent on the log-likelihood) directly corresponds to removing (some of) the noise:
该形式有一个非常直观的解释:它是高斯噪声的缩放版本,添加到 x 获取 x~ 。因此,通过直接遵循得分(=对数似然上的梯度上升)使 x~ 更有可能,直接对应于去除(部分)噪声。


If we choose the step size η=σ2, we recover the clean data x in a single step.
如果我们选择步长 η=σ2 ,我们可以在单步中恢复干净的数据 x

LSM and LDSM are different loss functions, but the neat thing is that they have the same minimum in expectation: Ex[LSM]=Ex,x~[LDSM]+C, where C is some constant. Pascal Vincent derived this equivalence back in 2010 (before score matching was cool!) and I strongly recommend reading his tech report about it8 if you want to deepen your understanding.
LSM LDSM 是不同的损失函数,但有趣的是它们在期望中具有相同的最小值: Ex[LSM]=Ex,x~[LDSM]+C ,其中 C 是某个常数。Pascal Vincent 在 2010 年推导了这种等价性(在分数匹配变得流行之前!),我强烈推荐阅读他的技术报告 8 ,如果您想加深对此的理解。

One important question this approach raises is: how much noise should we add, i.e. what should σ be? Picking a particular fixed value for this hyperparameter doesn’t actually work very well in practice. At low noise levels, it is very difficult to estimate the score accurately in low-density regions. At high noise levels, this is less of a problem, because the added noise spreads out the density in all directions – but then the distribution that we’re modelling is significantly distorted by the noise. What works well is to model the density at many different noise levels. Once we have such a model, we can anneal σ during sampling, starting with lots of noise and gradually dialing it down. Song & Ermon describe these issues and their elegant solution in detail in their 2019 paper9.
这种方法引发的一个重要问题是:我们应该添加多少噪音,即 应该是多少?在实践中,为这个超参数选择一个特定的固定值实际上效果并不好。在低噪音水平下,很难准确估算低密度区域的分数。在高噪音水平下,这就不是问题,因为添加的噪音会使密度朝各个方向扩散——但我们建模的分布会受到噪音的显著扭曲。有效的方法是在许多不同的噪音水平上建模密度。一旦有了这样的模型,我们就 can 在采样过程中逐渐降低它。Song & Ermon 在他们的 2019 年论文中详细描述了这些问题及其简洁的解决方案。

This combination of denoising score matching at many different noise levels with gradual annealing of the noise during sampling yields a model that’s essentially equivalent to a DDPM, but the derivation is completely different – no ELBOs in sight! To learn more about this perspective, check out Yang Song’s excellent blog post on the topic.
通过在许多不同噪声水平上进行降噪评分匹配,以及在抽样过程中逐渐退火噪声的组合产生了一个模型,其本质上等同于 DDPM,但是推导过程完全不同 – 没有 ELBO!要了解更多相关视角,请查看杨松(Yang Song)关于这个主题的优秀博客文章。

Diffusion models solve reverse SDEs
扩散模型解决逆向 SDEs

In both of the previous perspectives (deep latent variable models and score matching), we consider a discete and finite set of steps. These steps correspond to different levels of Gaussian noise, and we can write down a monotonic mapping σ(t) which maps the step index t to the standard deviation of the noise at that step.
在之前的两种视角(深层潜变量模型和分数匹配)中,我们考虑了一组离散有限的步骤。这些步骤对应不同级别的高斯噪声,我们可以编写一个单调映射 σ(t) ,将步骤索引 t 映射到该步骤的噪声标准差。

If we let the number of steps go to infinity, it makes sense to replace the discrete index variable with a continuous value t on an interval [0,T], which can be interpreted as a time variable, i.e. σ(t) now describes the evolution of the standard deviation of the noise over time. In continuous time, we can describe the diffusion process which gradually adds noise to data points x with a stochastic differential equation (SDE):
如果我们让步骤的数量趋于无穷大,用一个连续值 t 替代离散索引变量是合理的,这个值在一个区间 [0,T] 上,可以被解释为一个时间变量,即 σ(t) 现在描述了随时间逐渐添加噪声标准差的演变过程。在连续时间内,我们可以用一个随机微分方程(SDE)描述逐渐向数据点 x 添加噪声的扩散过程:


This equation relates an infinitesimal change in x with an infintesimal change in t, and dw represents infinitesimal Gaussian noise, also known as the Wiener process. f and g are called the drift and diffusion coefficients respectively. Particular choices for f and g yield time-continuous versions of the Markov chains used to formulate DDPMs.
这个方程将 x 的微小变化与 t 的微小变化联系起来, dw 表示微小的高斯噪声,也称为维纳过程。 f g 分别称为漂移和扩散系数。特定选择 f g 会产生用于制定 DDPM 的时间连续版本的马尔可夫链。

SDEs combine differential equations with stochastic random variables, which can seem a bit daunting at first. Luckily we don’t need too much of the advanced SDE machinery that exists to understand how this perspective can be useful for diffusion models. However, there is one very important result that we can make use of. Given an SDE that describes a diffusion process like the one above, we can write down another SDE that describes the process in the other direction, i.e. reverses time10:
随机微分方程(SDEs)将微分方程与随机随机变量结合起来,一开始可能有点令人畏惧。幸运的是,我们不需要太多现有的高级 SDE 机制来理解这个观点如何对扩散模型有用。然而,有一个非常重要的结果我们可以利用。给定描述扩散过程的 SDE,我们可以写出另一个描述过程的 SDE,即时间的倒转 10


This equation also describes a diffusion process. dw¯ is the reversed Wiener process, and xlogpt(x) is the time-dependent score function. The time dependence comes from the fact that the noise level changes over time.
这个方程也描述了一个扩散过程。 dw¯ 是倒转的维纳过程, xlogpt(x) 是时间相关的得分函数。时间相关性来自噪声水平随时间变化。

Explaining why this is the case is beyond the scope of this blog post, but the original paper by Yang Song and colleagues that introduced the SDE-based formalism for diffusion models11 is well worth a read.
解释为什么如此超出了这篇博文的范围,但杨松及其同事们提出的介绍了扩散模型基于 SDE 的形式主义的原始论文是值得一读的。

Concretely, if we have a way to estimate the time-dependent score function, we can simulate the reverse diffusion process, and therefore draw samples from the data distribution starting from noise. So we can once again train a neural network to predict this quantity, and plug it into the reverse SDE to obtain a continuous-time diffusion model.
具体而言,如果我们有一种估计时间相关得分函数的方法,我们就可以模拟逆扩散过程,因此从噪声开始从数据分布中抽取样本。因此,我们可以再次训练神经网络来预测这个量,并将其嵌入逆 SDE 中,从而获得连续时间扩散模型。

In practice, simulating this SDE requires discretising the time variable t again, so you might wonder what the point of all this is. What’s neat is that this discretisation is now something we can decide at sampling-time, and it does not have to be fixed before we train our score prediction model. In other words, we can trade off sample quality for computational cost in a very natural way without changing the model, by choosing the number of sampling steps.
在实践中,模拟这个 SDE 需要再次离散化时间变量 t ,因此您可能会想知道这一切的意义在哪里。有趣的是,这种离散化现在是我们可以在采样时决定的事情,并且在训练得分预测模型之前不必固定。换句话说,我们可以通过选择采样步骤的数量,以非常自然的方式在样本质量和计算成本之间进行权衡,而无需更改模型。

Diffusion models are flow-based models

Remember flow-based models12 13? They aren’t very popular for generative modelling these days, which I think is mainly because they tend to require more parameters than other types of models to achieve the same level of performance. This is due to their limited expressivity: neural networks used in flow-based models are required to be invertible, and the log-determinant of the Jacobian must be easy to compute, which imposes significant constraints on the kinds of computations that are possible.

At least, this is the case for discrete normalising flows. Continuous normalising flows (CNFs)14 15 also exist, and usually take the form of an ordinary differential equation (ODE) parameterised by a neural network, which describes a deterministic path between samples from the data distribution and corresponding samples from a simple base distribution (e.g. standard Gaussian). CNFs are not affected by the aforementioned neural network architecture constraints, but in their original form, they require backpropagation through an ODE solver to train. Although some tricks exist to do this more efficiently, this probably also presents a barrier to widespread adoption.
至少对于离散归一化流而言情况是这样的。连续归一化流也存在,通常采用由神经网络参数化的普通微分方程(ODE)的形式,该方程描述了从数据分布中的样本到简单基础分布(例如标准高斯分布)样本之间的确定性路径。CNFs 不受上述神经网络架构约束的影响,但在其原始形式中,它们需要通过 ODE 求解器进行反向传播才能训练。尽管存在一些技巧可以更高效地实现这一点,但这可能也是普遍采用的障碍。

Let’s revisit the SDE formulation of diffusion models, which describes a stochastic process mapping samples from a simple base distribution to samples from the data distribution. An interesting question to ask is: what does the distribution of the intermediate samples pt(x) look like, and how does it evolve over time? This is governed by the so-called Fokker-Planck equation. If you want to see what this looks like in practice, check out appendix D.1 of Song et al. (2021)11.
让我们重新审视扩散模型的 SDE 表达式,该表达式描述了一个将简单基础分布样本映射到数据分布样本的随机过程。一个值得探讨的问题是:中间样本的分布是什么样的,它们随时间如何演化?这由所谓的福克-普朗克方程控制。如果您想实际看看这是什么样子,请查看宋等人(2021 年)的附录 D.1。

Here’s where it gets wild: there exists an ODE that describes a deterministic process whose time-dependent distributions are exactly the same as those of the stochastic process described by the SDE. This is called the probability flow ODE. What’s more, it has a simple closed form:
这里是一个疯狂的地方:存在一个 ODE,描述了一个确定性过程,其时间相关分布与 SDE 描述的随机过程完全相同。这被称为概率流 ODE。更重要的是,它有一个简单的闭式形式:


This equation describes both the forward and backward process (just flip the sign to go in the other direction), and note that the time-dependent score function xlogpt(x) once again features. To prove this, you can write down the Fokker-Planck equations for both the SDE and the probability flow ODE, and do some algebra to show that they are the same, and hence must have the same solution pt(x).
这个方程同时描述了正向和反向过程(只需改变符号以沿着另一个方向前进),请注意,时间相关的得分函数 xlogpt(x) 再次出现。为了证明这一点,您可以分别写出 SDE 和概率流 ODE 的福克-普朗克方程,并进行一些代数运算以表明它们是相同的,因此必然具有相同的解 pt(x)

Note that this ODE does not describe the same process as the SDE: that would be impossible, because a deterministic differential equation cannot describe a stochastic process. Instead, it describes a different process with the unique property that the distributions pt(x) are the same for both processes. Check out the probability flow ODE section in Yang Song’s blog post for a great diagram comparing both processes.
请注意,这个 ODE 并不描述与 SDE 相同的过程:这是不可能的,因为确定性微分方程无法描述随机过程。相反,它描述了一种具有独特性质的不同过程,即分布 pt(x) 对于两个过程都是相同的。在杨松的博客文章中查看概率流 ODE 部分,其中有一个很好的图示比较了这两个过程。

The implications of this are profound: there is now a bijective mapping between particular samples from the simple base distribution, and samples from the data distribution. We have a sampling process where all the randomness is contained in the initial base distribution sample – once that’s been sampled, going from there to a data sample is completely deterministic. It also means that we can map data points to their corresponding latent representations by simulating the ODE forward, manipulating them, and then mapping them back to the data space by simulating the ODE backward.
这一点的含义深远:现在在简单基础分布中的特定样本与数据分布中的样本之间存在一种双射映射。我们有一个采样过程,其中所有的随机性都包含在初始基础分布样本中 - 一旦它被采样,从那里到数据样本的过程是完全确定性的。这也意味着我们可以通过向前模拟 ODE、操纵它们,然后通过向后模拟 ODE 将它们映射回数据空间,将数据点映射到它们对应的潜在表示。

The model described by the probability flow ODE is a continuous normalising flow, but it’s one that we managed to train without having to backpropagate through an ODE, rendering the approach much more scalable.
概率流 ODE 描述的模型是一种连续的归一化流,但这是我们成功训练的一种,而无需通过 ODE 反向传播,使得该方法更具可扩展性。

The fact that all this is possible, without even changing anything about how the model is trained, still feels like magic to me. We can plug our score predictor into the reverse SDE from the previous section, or the ODE from this one, and get out two different generative models that model the same distribution in different ways. How cool is that?
所有这一切都是可能的,甚至在不改变模型训练方式的情况下,这仍然让我感到像在魔术一样。我们可以将我们的得分预测器插入上一节的反向 SDE,或者本节的 ODE 中,并获得两种以不同方式模拟相同分布的生成模型。这是多么酷呢?

As a bonus, the probability flow ODE also enables likelihood computation for diffusion models (see appendix D.2 of Song et al. (2021)11). This also requires solving the ODE, so it’s roughly as expensive as sampling.
作为一个奖励,概率流 ODE 还可以为扩散模型计算可能性(详见 Song 等人(2021)的附录 D.2)。这也需要解决 ODE,因此大致与抽样一样昂贵。

For all of the reasons above, the probability flow ODE paradigm has proven quite popular recently. Among other examples, it is used by Karras et al.16 as a basis for their work investigating various diffusion modelling design choices, and my colleagues and I recently used it for our work on diffusion language models17. It has also been generalised and extended beyond diffusion processes, to enable learning a mapping between any pair of distributions, e.g. in the form of Flow Matching18, Rectified Flows19 and Stochastic Interpolants20.
鉴于上述所有原因,概率流 ODE 范式最近变得相当受欢迎。例如,Karras 等人将其用作基础,用于研究各种扩散建模设计选择,并且我和我的同事最近在扩散语言模型的工作中使用了它。它还被泛化和扩展到超出扩散过程,以实现学习任何一对分布之间的映射,例如,以 Flow Matching、Rectified Flows 和 Stochastic Interpolants 等形式。

Side note: another way to obtain a deterministic sampling process for diffusion models is given by DDIM21, which is based on the deep latent variable model perspective.
顺便提一下:另一种获得扩散模型确定性采样过程的方法是由基于深度潜变量模型观点的 DDIM 提供的。

Diffusion models are recurrent neural networks (RNNs)

Sampling from a diffusion model involves making repeated predictions with a neural network and using those predictions to update a canvas, which starts out filled with random noise. If we consider the full computational graph of this process, it starts to look a lot like a recurrent neural network (RNN). In RNNs, there is a hidden state which repeatedly gets updated by passing it through a recurrent cell, which consists of one or more nonlinear parameterised operations (e.g. the gating mechanisms of LSTMs22). Here, the hidden state is the canvas, so it lives in the input space, and the cell is formed by the denoiser neural network that we’ve trained for our diffusion model.
从扩散模型中采样涉及使用神经网络进行重复预测,并使用这些预测更新画布,其最初充满随机噪声。如果我们考虑这个过程的完整计算图,它开始看起来很像一个递归神经网络(RNN)。在 RNN 中,有一个隐藏状态,通过将其传递给一个递归单元来重复更新,该单元包含一个或多个非线性参数化操作(例如,LSTM 的门控机制)。在这里,隐藏状态是画布,因此它存在于输入空间中,并且单元由我们为扩散模型训练的去噪神经网络形成。

Schematic diagram of the unrolled diffusion sampling loop.
Schematic diagram of the unrolled diffusion sampling loop.

RNNs are usually trained with backpropagation through time (BPTT), with gradients propagated through the recurrence. The number of recurrent steps to backpropagate through is often limited to some maximum number to reduce the computational cost, which is referred to as truncated BPTT. Diffusion models are also trained by backpropagation, but only through one step at a time. In some sense, diffusion models present a way to train deep recurrent neural networks without backpropagating through the recurrence at all, yielding a much more scalable training procedure.
循环神经网络 (RNNs) 通常通过时间反向传播 (BPTT) 进行训练,梯度是通过循环向后传播的。通常限制要向后传播的循环步数,以降低计算成本,这被称为截断 BPTT。扩散模型也是通过反向传播训练的,但每次只通过一步。在某种意义上,扩散模型提出了一种方法,可以在完全不通过循环反向传播的情况下训练深度循环神经网络,从而产生更具伸缩性的训练过程。

RNNs are usually deterministic, so this analogy makes the most sense for the deterministic process based on the probability flow ODE described in the previous section – though injecting noise into the hidden state of RNNs as a means of regularisation is not unheard of, so I think the analogy also works for the stochastic process.
RNN 通常是确定性的,因此这种类比在概率流 ODE 描述的确定性过程上最有意义 - 尽管向 RNN 的隐藏状态注入噪音作为正规化的手段并不罕见,因此我认为这种类比也适用于随机过程。

The total depth of this computation graph in terms of the number of nonlinear layers is given by the number of layers in our neural network, multiplied by the number of sampling steps. We can look at the unrolled recurrence as a very deep neural network in its own right, with potentially thousands of layers. This is a lot of depth, but it stands to reason that a challenging task like generative modelling of real-world data requires such deep computation graphs.

We can also consider what happens if we do not use the same neural network at each diffusion sampling step, but potentially different ones for different ranges of noise levels. These networks can be trained separately and independently, and can even have different architectures. This means we are effectively “untying the weights” in our very deep network, turning it from an RNN into a plain old deep neural network, but we are still able to avoid having to backpropagate through all of it in one go. Stable Diffusion XL23 uses this approach to great effect for its “Refiner” model, so I think it might start to catch on.
我们还可以考虑如果我们在每个扩散采样步骤中没有使用相同的神经网络,而是针对不同噪音水平范围使用潜在的不同神经网络。这些网络可以分别和独立地训练,甚至可以具有不同的架构。这意味着我们有效地“解开”了我们非常深的网络中的权重,将其从循环神经网络变成了一个普通的深度神经网络,但我们仍然能够避免一次性通过所有内容进行反向传播。《Stable Diffusion XL》使用这种方法来极大地提升其“Refiner”模型的效果,所以我认为它可能开始流行起来。

When I started my PhD in 2010, training neural networks with more than two hidden layers was a chore: backprop didn’t work well out of the box, so we used unsupervised layer-wise pre-training1 24 to find a good initialisation which would make backpropagation possible. Nowadays, even hundreds of nonlinear layers do not form an obstacle anymore. Therefore it’s not inconceivable that several years from now, training networks with tens of thousands of layers by backprop will be within reach. At that point, the “divide and conquer” approach that diffusion models offer might lose its luster, and perhaps we’ll all go back to training deep variational autoencoders! (Note that the same “divide and conquer” perspective equally applies to autoregressive models, so they would become obsolete as well, in that case.)
当我在 2010 年开始我的博士学习时,使用超过两个隐藏层的神经网络是一项苦差事:反向传播不能很好地自动运行,因此我们使用了无监督的逐层预训练来寻找一个良好的初始化方式,使得反向传播成为可能。如今,甚至成百上千个非线性层也不再构成障碍。因此,几年后,通过反向传播训练拥有数万层的网络并非难事。到那时,“分而治之”的扩散模型所提供的方法可能失去吸引力,也许我们会重返训练深度变分自动编码器的路径!(请注意,相同的“分而治之”观点同样适用于自回归模型,所以在那种情况下,它们也将变得过时。)

One question this perspective raises is whether diffusion models might actually work better if we backpropagated through the sampling procedure for two or more steps. This approach isn’t popular, which probably indicates that it isn’t cost-effective in practice. There is one important exception (sort of): models which use self-conditioning25, such as Recurrent Interface Networks (RINs)26, pass some form of state between the diffusion sampling steps, in addition to the updated canvas. To enable the model to learn to make use of this state, an approximation of it is made available during training by running an additional forward pass. There is no additional backward pass though, so this doesn’t really count as two steps of BPTT – more like 1.5 steps.
这种观点引发的一个问题是,如果我们通过两个或更多步骤反向传播采样过程,扩散模型是否实际上效果更好。这种方法并不流行,这可能表明在实践中它并不具有成本效益。有一个重要的例外(有点),即使用自我调节的模型,例如递归接口网络(RINs),在扩散采样步骤之间传递一些形式的状态,除了更新的画布。为了使模型学会利用这种状态,通过运行额外的前向传递,在训练期间提供了它的一个近似值。但是在这里并没有额外的反向传递,所以这实际上并不算作双向传输 BPPT 的两个步骤,更像是 1.5 个步骤。

Diffusion models are autoregressive models

For diffusion models of natural images, the sampling process tends to produce large-scale structure first, and then iteratively adds more and more fine-grained details. Indeed, there seems to be almost a direct correspondence between noise levels and feature scales, which I discussed in more detail in Section 5 of a previous blog post.
对于自然图像的扩散模型,抽样过程倾向于首先产生大尺度结构,然后逐渐添加越来越多的细节。实际上,噪声水平与特征尺度之间似乎几乎存在直接的对应关系,我在以前的一篇博客文章的第 5 部分中更详细地讨论了这一点。

But why is this the case? To understand this, it helps to think in terms of spatial frequencies. Large-scale features in images correspond to low spatial frequencies, whereas fine-grained details correspond to high frequencies. We can decompose images into their spatial frequency components using the 2D Fourier transform (or some variant of it). This is often the first step in image compression algorithms, because the human visual system is known to be much less sensitive to high frequencies, and this can be exploited by compressing them more aggressively than low frequencies.
但这为什么会发生呢?为了理解这一点,有助于从空间频率的角度思考。图像中的大尺度特征对应于低空间频率,而细微的细节对应于高频率。我们可以使用 2D 傅里叶变换(或其某些变体)将图像分解为其空间频率分量。这通常是图像压缩算法中的第一步,因为已知人类视觉系统对高频率要不敏感得多,这可以通过对高频率进行比低频率更积极的压缩来利用。

Visualisation of the spatial frequency components of the 8x8 discrete cosine transform, used in e.g. JPEG.
Visualisation of the spatial frequency components of the 8x8 discrete cosine transform, used in e.g. JPEG.
显示用于例如 JPEG 中的 8x8 离散余弦变换的空间频率分量的可视化。

Natural images, along with many other natural signals, exhibit an interesting phenomenon in the frequency domain: the magnitude of different frequency components tends to drop off proportionally to the inverse of the frequency27: S(f)1/f (or the inverse of the square of the frequency, if you’re looking at power spectra instead of magnitude spectra).
自然图像与许多其他自然信号在频域展示出一种有趣的现象:不同频率分量的幅度倾向于按频率的倒数 27 (或者如果您正在查看功率谱而不是幅度谱,则按频率的平方的倒数)成比例地下降。

Gaussian noise, on the other hand, has a flat spectrum: in expectation, all frequencies have the same magnitude. Since the Fourier transform is a linear operation, adding Gaussian noise to a natural image yields a new image whose spectrum is the sum of the spectrum of the original image, and the flat spectrum of the noise. In the log-domain, this superposition of the two spectra looks like a hinge, which shows how the addition of noise obscures any structure present in higher spatial frequencies (see figure below). The larger the standard deviation of this noise, the more spatial frequencies will be affected.

Magnitude spectra of natural images, Gaussian noise, and noisy images.
Magnitude spectra of natural images, Gaussian noise, and noisy images.

Since diffusion models are constructed by progressively adding more noise to input examples, we can say that this process increasingly drowns out lower and lower frequency content, until all structure is erased (for natural images, at least). When sampling from the model, we go in the opposite direction and effectively add structure at higher and higher spatial frequencies. This basically looks like autoregression, but in frequency space! Rissanen et al. (2023) discuss this observation in Section 2.2 of their paper28 on generative modelling with inverse heat dissipation (as an alternative to Gaussian diffusion), though they do not make the connection to autoregressive models. I added that bit, so this section could have a provocative title.
由于扩散模型是通过逐渐向输入示例添加更多噪声来构建的,我们可以说这个过程越来越淹没了越来越低频内容,直到所有结构被抹去(至少对自然图像而言)。在从模型进行采样时,我们朝着相反的方向走,并在更高的空间频率上有效地添加结构。这基本上看起来像自回归,但在频率空间中!Rissanen 等人(2023)在他们关于利用逆热耗散进行生成建模(作为高斯扩散的替代方案)的论文 28 的第 2.2 节中讨论了这一观察,尽管他们没有将其与自回归模型联系起来。我加入这一点,以便本节可以具有引人注目的标题。

An important caveat is that this interpretation relies on the frequency characteristics of natural signals, so for applications of diffusion models in other domains (e.g. language modelling, see Section 2 of my blog post on diffusion language models), the analogy may not make sense.
一个重要的警告是,这种解释依赖于自然信号的频率特征,因此对于在其他领域(例如语言建模,在我关于扩散语言模型的博客文章的第 2 节中)中应用扩散模型的情况,这种类比可能没有意义。

Diffusion models estimate expectations

Consider the transition density p(xtx0), which describes the distribution of the noisy data example xt at time t, conditioned on the original clean input x0 it was derived from (by adding noise). Based on samples from this distribution, the neural network used in a diffusion model is tasked to predict the expectation E[x0xt] (or some linear time-dependent function of it). This may seem a tad obvious, but I wanted to highlight some of the implications.
考虑过渡密度 p(xtx0) ,描述了在时间 t 上,基于原始清洁输入 x0 (通过添加噪声)推导的嘈杂数据示例 xt 的分布特征。根据该分布的样本,在扩散模型中使用的神经网络负责预测期望值 E[x0xt] (或其某种线性时间相关函数)。这可能显得有些显而易见,但我想要强调其中一些含义。

First, it provides another motivation for why the mean squared error (MSE) is the right loss function to use for training diffusion models. During training, the expectation E[x0xt] is not known, so instead we supervise the model using x0 itself. Because the minimiser of the MSE loss is precisely the expectation, we end up recovering (an approximation of) E[x0xt], even though we don’t know this quantity a priori. This is a bit different from typical supervised learning problems, where the ideal outcome would be for the model to predict exactly the targets used to supervise it (barring any label errors). Here, we purposely do not want that. More generally, the notion of being able to estimate conditional expectations, even though we only provide supervision through samples, is very powerful.
首先,这为为何均方误差(MSE)是训练扩散模型时使用的正确损失函数提供了另一个动机。在训练期间,期望值 E[x0xt] 是未知的,因此我们使用 x0 本身对模型进行监督。由于均方误差损失的最小化器正是期望值,我们最终会恢复(其近似值) E[x0xt] ,尽管我们事先不知道这个量。这与典型的监督学习问题有些不同,典型情况下模型应该精确预测用于监督其的目标(除了任何标签错误)。在这里,我们故意不希望如此。更通常地,即使我们只通过样本提供监督,能够估计条件期望的概念也是非常强大的。

Second, it explains why distillation29 of diffusion models30 31 32 is such a compelling proposition: in this setting, we are able to supervise a diffusion model directly with an approximation of the target expectation E[x0xt] that we want it to predict, because that is what the teacher model already provides. As a result, the variance of the training loss will be much lower than if we had trained the model from scratch, and convergence will be much faster. Of course, this is only useful if you already have a trained model on hand to use as a teacher.
其次,这解释了为何扩散模型的蒸馏 29 是一个极具吸引力的主张:在这种设置中,我们能够直接用我们希望模型预测的目标期望的近似值 E[x0xt] 对扩散模型进行直接监督,因为师生模型已经提供了这种监督。结果,训练损失的方差将远低于我们从头开始训练时的情况,收敛速度也会更快。当然,这仅在您已经有一个训练好的模型可供用作师生模型时才有用。

Discrete and continuous diffusion models

So far, we have covered several perspectives that consider a finite set of discrete noise levels, and several perspectives that use a notion of continuous time, combined with a mapping function σ(t) to map time steps to the corresponding standard deviation of the noise. These are typically referred to as discrete-time and continuous-time respectively. One thing that’s quite neat is that this is mostly a matter of interpretation: models trained within a discrete-time perspective can usually be repurposed quite easily to work in the continuous-time setting16, and vice versa.
到目前为止,我们已经涵盖了几个观点,考虑了有限的离散噪声水平,并且使用了一种连续时间观念,结合一个映射函数 σ(t) 将时间步映射到相应的噪声标准差。这通常分别称为离散时间和连续时间。有趣的是,这大部分是一种解释问题:在离散时间观念下训练的模型通常可以很容易地重新用于连续时间设置 16 ,反之亦然。

Another way in which diffusion models can be discrete or continuous, is with respect to the input space. In the literature, I’ve found that it is sometimes unclear whether “continuous” or “discrete” are meant to be with respect to time, or with respect to the input. This is especially important because some perspectives only really make sense for continuous input, as they rely on gradients with respect to the input (i.e. all perspectives based on the score function).

All four combinations of discreteness/continuity exist:

  • discrete time, continuous input: the original deep latent variable model perspective (DDPMs), as well as the score-based perspective;
  • continuous time, continuous input: SDE- and ODE-based perspectives;
    连续时间、连续输入:基于 SDE 和 ODE 的视角;
  • discrete time, discrete input: D3PM33, MaskGIT34, Mask-predict35, ARDM36, Multinomial diffusion37 and SUNDAE38 are all methods that use iterative refinement on discrete inputs – whether all of these should be considered diffusion models isn’t entirely clear (it depends on who you ask);
    离散时间、离散输入:D3PM,MaskGIT,Mask-predict,ARDM,Multinomial diffusion 和 SUNDAE 都是在离散输入上进行迭代细化的方法——是否应该将所有这些视为扩散模型并不完全清楚(这取决于您问的是谁);
  • continuous time, discrete input: Continuous Time Markov Chains (CTMCs)39, Score-based Continuous-time Discrete Diffusion Models40 and Blackout Diffusion41 all pair discrete input with continuous time – this setting is also often handled by embedding discrete data in Euclidean space, and then performing input-continuous diffusion in that space, as in e.g. Analog Bits25, Self-conditioned Embedding Diffusion42 and CDCD17.
    连续时间、离散输入:连续时间马尔可夫链(CTMCs),基于得分的连续时间离散扩散模型以及 Blackout Diffusion 都将离散输入与连续时间配对——这种设置通常也通过将离散数据嵌入欧几里得空间,然后在该空间中执行输入连续扩散来处理,例如 Analog Bits,Self-conditioned Embedding Diffusion 和 CDCD。

Alternative formulations

Recently, a few papers have proposed new derivations of this class of models from first principles with the benefit of hindsight, avoiding concepts such as differential equations, ELBOs or score matching altogether. These works provide yet another perspective on diffusion models, which may be more accessible because it requires less background knowledge.
最近,一些论文提出了基于事后认识的一类模型的新推导,避免完全使用微分方程、ELBO 或得分匹配等概念。这些工作提供了对扩散模型的另一种视角,这可能更容易理解,因为它需要更少的背景知识。

Inversion by Direct Iteration (InDI)43 is a formulation rooted in image restoration, intended to harness iterative refinement to improve perceptual quality. No assumptions are made about the nature of the image degradations, and models are trained on paired low-quality and high-quality examples. Iterative α-(de)blending44 uses linear interpolation between samples from two different distributions as a starting point to obtain a deterministic mapping between the distributions. Both of these methods are also closely related to Flow Matching18, Rectified Flow19 and Stochastic Interpolants20 discussed earlier.

Consistency 一致性

A few different notions of “consistency” in diffusion models have arisen in literature recently:

  • Consistency models (CM)45 are trained to map points on any trajectory of the probability flow ODE to the trajectory’s origin (i.e. the clean data point), enabling sampling in a single step. This is done indirectly by taking pairs of points on a particular trajectory and ensuring that the model output is the same for both (hence “consistency”). There is a distillation variant which starts from an existing diffusion model, but it is also possible to train a consistency model from scratch.
    一致性模型(CM)被训练为将概率流 ODE 任何轨迹上的点映射到轨迹的起点(即干净数据点),从而实现单步采样。通过间接地取特定轨迹上的点对,并确保模型对两者都输出相同时来实现这一点(因此“一致性”)。有一种蒸馏变体是从现有扩散模型开始的,但也可以从头开始训练一致性模型。

  • Consistent diffusion models (CDM)46 are trained using a regularisation term that explicitly encourages consistency, which they define to mean that the prediction of the denoiser should correspond to the conditional expectation E[x0xt] (see earlier).
    一致扩散模型(CDM)是使用正则化项训练的,该项明确鼓励一致性,他们定义一致性为去噪器的预测应该对应于条件期望(参见前文) E[x0xt]

  • FP-Diffusion47 takes the Fokker-Planck equation describing the evolution across time of pt(x), and introduces an explicit regularisation term to ensure that it holds.
    FP-扩散 47 获取描述随着时间演变的 Fokker-Planck 方程 pt(x) ,并引入明确的正则化项以确保其成立。

Each of these properties would trivially hold for an ideal diffusion model (i.e. fully converged, in the limit of infinite capacity). However, real diffusion models are approximate, and so they tend not to hold in practice, which is why it makes sense to add mechanisms to explicitly enforce them.

The main reason for including this section here is that I wanted to highlight a recent paper by Lai et al. (2023)48 that shows that these three different notions of consistency are essentially different perspectives on the same thing. I thought this was a very elegant result, and it definitely suits the theme of this blog post!
我在这里包含这一部分的主要原因是我想强调 Lai 等人(2023 年)的一篇最近的论文 48 ,该论文表明这三种不同的一致性概念实质上是同一个事物的不同视角。我认为这是一个非常优雅的结果,并且它绝对符合这篇博客文章的主题!

Defying conventions 打破传统

Apart from all these different perspectives on a conceptual level, the diffusion literature is also particularly fraught in terms of reinventing notation and defying conventions, in my experience. Sometimes, even two different descriptions of the same conceptual perspective look nothing alike. This doesn’t help accessibility and increases the barrier to entry. (I’m not blaming anyone for this, to be clear – in fact, I suspect I might be contributing to the problem with this blog post. Sorry about that.)

There are also a few other seemingly innocuous details and parameterisation choices that can have profound implications. Here are three things to watch out for:

  • By and large, people use variance-preserving (VP) diffusion processes, where in addition to adding noise at each step, the current canvas is rescaled to preserve the overall variance. However, the variance-exploding (VE) formulation, where no rescaling happens and the variance of the added noise increases towards infinity, has also gained some followers. Most notably it is used by Karras et al. (2022)16. Some results that hold for VP diffusion might not hold for VE diffusion or vice versa (without making the requisite changes), and this might not be mentioned explicitly. If you’re reading a diffusion paper, make sure you are aware of which formulation is used, and whether any assumptions are being made about it.
    总的来说,人们通常使用保持方差的(VP)扩散过程,在每一步添加噪声的同时,当前画布被重新调整以保持总体方差。然而,也有一些支持者使用方差爆炸(VE)的公式,在这种情况下不进行重新调整,并且添加的噪声方差朝着无穷大增加。最明显的例子是 Karras 等人(2022 年)。对于 VP 扩散成果适用的一些结果可能不适用于 VE 扩散或反之亦然(未做必要更改),并且可能没有明确提及。如果你在阅读扩散论文,请确保你知道使用了哪种公式,并且是否对其进行了任何假设。

  • Sometimes, the neural network used in a diffusion model is parameterised to predict the (standardised) noise added to the input, or the score function; sometimes it predicts the clean input instead, or even a time-dependent combination of the two (as in e.g. v-prediction30). All of these targets are equivalent in the sense that they are time-dependent linear functions of each other and the noisy input xt. But it is important to understand how this interacts with the relative weighting of loss contributions for different time steps during training, which can significantly affect model performance. Out of the box, predicting the standardised noise seems to be a great choice for image data. When modelling certain other quantities (e.g. latents in latent diffusion), people have found predicting the clean input to work better. This is primarily because it implies a different weighting of noise levels, and hence feature scales.
    有时,在扩散模型中使用的神经网络被参数化为预测添加到输入的(标准化的)噪声,或者得分函数;有时它预测干净的输入,或者甚至是两者的时间相关组合(例如 v -预测 30 )。在这个意义上,所有这些目标都是等效的,因为它们是彼此和嘈杂输入之间的时间相关线性函数 xt 。但重要的是理解这如何与不同时间步长的损失贡献的相对加权相互作用,在训练过程中,这可以显著影响模型性能。从一开始,预测标准化噪声似乎是图像数据的一个很好的选择。在建模某些其他量(例如潜在扩散中的潜在因子)时,人们发现预测干净的输入效果更好。这主要是因为它意味着不同的噪声级别加权,从而影响特征尺度。

  • It is generally understood that the standard deviation of the noise added by the corruption process increases with time, i.e. entropy increases over time, as it tends to do in our universe. Therefore, x0 corresponds to clean data, and xT (for some large enough T) corresponds to pure noise. Some works (e.g. Flow Matching18) invert this convention, which can be very confusing if you don’t notice it straight away.
    通常认为,由损坏过程引入的噪声的标准偏差随时间增加,即熵随时间增加,因为它倾向于在我们的宇宙中这样做。 因此, x0 对应于干净数据,而 xT (对于某个足够大的 T )对应于纯噪声。 一些作品(例如 Flow Matching 18 )颠倒了这种约定,如果你不立即注意到这一点,可能会非常令人困惑。

Finally, it’s worth noting that the definition of “diffusion” in the context of generative modelling has grown to be quite broad, and is now almost equivalent to “iterative refinement”. A lot of “diffusion models” for discrete input are not actually based on diffusion processes, but they are of course closely related, so the scope of this label has gradually been extended to include them. It’s not clear where to draw the line: if any model which implements iterative refinement through inversion of a gradual corruption process is a diffusion model, then all autoregressive models are also diffusion models. To me, that seems confusing enough so as to render the term useless.

Closing thoughts 总结思考

Learning about diffusion models right now must be a pretty confusing experience, but the exploration of all these different perspectives has resulted in a diverse toolbox of methods which can all be combined together, because ultimately, the underlying model is always the same. I’ve also found that learning about how the different perspectives relate to each other has considerably deepened my understanding. Some things that are a mystery from one perspective are clear as day in another.

If you are just getting started with diffusion, hopefully this post will help guide you towards the right things to learn next. If you are a seasoned diffuser, I hope I’ve broadened your perspectives and I hope you’ve learnt something new nevertheless. Thanks for reading!

What's your favourite perspective on diffusion? Are there any useful perspectives that I've missed? Please share your thoughts in the comments below, or reach out on Twitter (@sedielem) or Threads (@sanderdieleman) if you prefer. Email is okay too.
你对扩散有什么喜欢的观点吗?我有没有漏掉的有用观点?请在下面的评论中分享您的想法,或者如果您喜欢,可以在 Twitter(@sedielem)或 Threads(@sanderdieleman)上联系我。电子邮件也行。

I will also be at ICML 2023 in Honolulu and would be happy to chat in person!
我还将参加 2023 年在檀香山举行的 ICML 会议,很高兴能与您当面交流!

If you would like to cite this post in an academic context, you can use this BibTeX snippet:
如果您想在学术领域引用这篇文章,您可以使用以下 BibTeX 代码片段:

  author = {Dieleman, Sander},
  title = {Perspectives on diffusion},
  url = {},
  year = {2023}

Acknowledgements 致谢

Thanks to my colleagues at Google DeepMind for various discussions, which continue to shape my thoughts on this topic! Thanks to Ayan Das, Ira Korshunova, Peyman Milanfar, and Çağlar Ünlü for suggestions and corrections. 

References 参考文献

  1. Bengio, Lamblin, Popovici, Larochelle, “Greedy Layer-Wise Training of Deep Networks”, Neural Information Processing Systems, 2006.