Generative Modeling by Estimating Gradients of the Data Distribution
通过估计数据分布的梯度进行生成建模

This blog post focuses on a promising new direction for generative modeling. We can learn score functions (gradients of log probability density functions) on a large number of noise-perturbed data distributions, then generate samples with Langevin-type sampling. The resulting generative models, often called score-based generative models, has several important advantages over existing model families: GAN-level sample quality without adversarial training, flexible model architectures, exact log-likelihood computation, and inverse problem solving without re-training models. In this blog post, we will show you in more detail the intuition, basic concepts, and potential applications of score-based generative models.
本博文重点介绍了生成建模的一个有前途的新方向。我们可以在大量噪声扰动的数据分布上学习得分函数（对数概率密度函数的梯度），然后使用 Langevin 类型的采样生成样本。由此产生的生成模型，通常称为基于得分的生成模型，相较于现有的模型家族有几个重要的优势：无需对抗训练即可达到 GAN 级别的样本质量、灵活的模型架构、精确的对数似然计算以及无需重新训练模型即可解决逆问题。在这篇博文中，我们将更详细地向您展示基于得分的生成模型的直觉、基本概念和潜在应用。

Authors 作者

Affiliations 附属机构

Yang Song 宋杨

Stanford University 斯坦福大学

Published 已发布

May 5, 2021 2021 年 5 月 5 日

Introduction 介绍

Existing generative modeling techniques can largely be grouped into two categories based on how they represent probability distributions.
现有的生成建模技术大致可以根据它们如何表示概率分布分为两类。

likelihood-based models, which directly learn the distribution’s probability density (or mass) function via (approximate) maximum likelihood. Typical likelihood-based models include autoregressive models
基于似然的模型，直接通过（近似）最大似然学习分布的概率密度（或质量）函数。典型的基于似然的模型包括自回归模型, normalizing flow models
, 正规化流模型, energy-based models (EBMs)
, 基于能量的模型 (EBMs), and variational auto-encoders (VAEs)
，以及变分自编码器（VAEs）.
implicit generative models
隐式生成模型 , where the probability distribution is implicitly represented by a model of its sampling process. The most prominent example is generative adversarial networks (GANs)
，其中概率分布由其采样过程的模型隐式表示。最突出的例子是生成对抗网络（GANs）, where new samples from the data distribution are synthesized by transforming a random Gaussian vector with a neural network.
，其中通过使用神经网络变换随机高斯向量来合成来自数据分布的新样本。

Bayesian networks, Markov random fields (MRF), autoregressive models, and normalizing flow models are all examples of likelihood-based models. All these models represent the probability density or mass function of a distribution.
贝叶斯网络、马尔可夫随机场（MRF）、自回归模型和归一化流模型都是基于似然的模型的例子。所有这些模型都表示分布的概率密度或质量函数。

GAN is an example of implicit models. It implicitly represents a distribution over all objects that can be produced by the generator network.
GAN 是隐式模型的一个例子。它隐式地表示生成器网络可以生成的所有对象的分布。

Likelihood-based models and implicit generative models, however, both have significant limitations. Likelihood-based models either require strong restrictions on the model architecture to ensure a tractable normalizing constant for likelihood computation, or must rely on surrogate objectives to approximate maximum likelihood training. Implicit generative models, on the other hand, often require adversarial training, which is notoriously unstable
然而，基于似然的模型和隐式生成模型都有显著的局限性。基于似然的模型要么需要对模型架构进行严格限制以确保似然计算的可处理归一化常数，要么必须依赖替代目标来近似最大似然训练。另一方面，隐式生成模型通常需要对抗训练，而对抗训练是出了名的不稳定。 and can lead to mode collapse
并可能导致模式崩溃.

In this blog post, I will introduce another way to represent probability distributions that may circumvent several of these limitations. The key idea is to model the gradient of the log probability density function, a quantity often known as the (Stein) score function
在这篇博客文章中，我将介绍另一种表示概率分布的方法，这种方法可能会规避其中的一些限制。关键思想是对数概率密度函数的梯度建模，这个量通常被称为（Stein）得分函数 . Such score-based models are not required to have a tractable normalizing constant, and can be directly learned by score matching
这种基于得分的模型不需要具有可解的归一化常数，并且可以通过得分匹配直接学习 .

Score function (the vector field) and density function (contours) of a mixture of two Gaussians.
两个高斯混合的得分函数（向量场）和密度函数（等高线）。

Score-based models have achieved state-of-the-art performance on many downstream tasks and applications. These tasks include, among others, image generation
基于得分的模型在许多下游任务和应用中取得了最先进的性能。这些任务包括图像生成等。 (Yes, better than GANs!), audio synthesis
（是的，比 GANs 更好！），音频合成, shape generation ，形状生成, and music generation ，以及音乐生成. Moreover, score-based models have connections to normalizing flow models, therefore allowing exact likelihood computation and representation learning. Additionally, modeling and estimating scores facilitates inverse problem solving, with applications such as image inpainting
此外，基于分数的模型与归一化流模型有联系，因此允许精确的似然计算和表示学习。此外，建模和估计分数有助于解决逆问题，应用于图像修复, image colorization 图像上色, compressive sensing, and medical image reconstruction (e.g., CT, MRI)
压缩感知和医学图像重建（例如，CT，MRI）.

1024 x 1024 samples generated from score-based models
从基于评分的模型生成的 1024 x 1024 样本

This post aims to show you the motivation and intuition of score-based generative modeling, as well as its basic concepts, properties and applications.
本文旨在向您展示基于分数的生成建模的动机和直觉，以及其基本概念、属性和应用。

The score function, score-based models, and score matching
分数函数、基于分数的模型和分数匹配

Suppose we are given a dataset ${x_{1}, x_{2}, \dots, x_{N}}$ , where each point is drawn independently from an underlying data distribution $p (x)$ . Given this dataset, the goal of generative modeling is to fit a model to the data distribution such that we can synthesize new data points at will by sampling from the distribution.
假设我们有一个数据集 ${x_{1}, x_{2}, \dots, x_{N}}$ ，其中每个点都是从一个潜在的数据分布 $p (x)$ 中独立抽取的。给定这个数据集，生成建模的目标是拟合一个模型到数据分布，以便我们可以通过从分布中采样来随意合成新的数据点。

In order to build such a generative model, we first need a way to represent a probability distribution. One such way, as in likelihood-based models, is to directly model the probability density function (p.d.f.) or probability mass function (p.m.f.). Let $f_{θ} (x) \in R$ be a real-valued function parameterized by a learnable parameter $θ$ . We can define a p.d.f. Hereafter we only consider probability density functions. Probability mass functions are similar. via $\begin{array}{r} (1) & p_{θ} (x) = \frac{e^{- f_{θ} (x)}}{Z_{θ}}, \end{array}$ where $Z_{θ} > 0$ is a normalizing constant dependent on $θ$ , such that $\int p_{θ} (x) d x = 1$ . Here the function $f_{θ} (x)$ is often called an unnormalized probabilistic model, or energy-based model
为了构建这样的生成模型，我们首先需要一种表示概率分布的方法。一个方法是直接建模概率密度函数（p.d.f.）或概率质量函数（p.m.f.），如在基于似然的模型中。设 $f_{θ} (x) \in R$ 是一个由可学习参数 $θ$ 参数化的实值函数。我们可以定义一个 p.d.f. ¹ 在此之后我们只考虑概率密度函数。概率质量函数类似。通过 $\begin{array}{r} (1) & p_{θ} (x) = \frac{e^{- f_{θ} (x)}}{Z_{θ}}, \end{array}$ ，其中 $Z_{θ} > 0$ 是依赖于 $θ$ 的归一化常数，使得 $\int p_{θ} (x) d x = 1$ 。这里函数 $f_{θ} (x)$ 通常被称为非归一化概率模型，或基于能量的模型。.

We can train $p_{θ} (x)$ by maximizing the log-likelihood of the data $\begin{array}{r} (2) & max_{θ} \sum_{i = 1}^{N} \log p_{θ} (x_{i}) . \end{array}$ However, equation $(2)$ requires $p_{θ} (x)$ to be a normalized probability density function. This is undesirable because in order to compute $p_{θ} (x)$ , we must evaluate the normalizing constant $Z_{θ}$ —a typically intractable quantity for any general $f_{θ} (x)$ . Thus to make maximum likelihood training feasible, likelihood-based models must either restrict their model architectures (e.g., causal convolutions in autoregressive models, invertible networks in normalizing flow models) to make $Z_{θ}$ tractable, or approximate the normalizing constant (e.g., variational inference in VAEs, or MCMC sampling used in contrastive divergence
我们可以通过最大化数据的对数似然来训练 $p_{θ} (x)$ 。然而，方程 $(2)$ 要求 $p_{θ} (x)$ 是一个归一化的概率密度函数。这是不理想的，因为为了计算 $p_{θ} (x)$ ，我们必须评估归一化常数 $Z_{θ}$ ——对于任何一般的 $f_{θ} (x)$ 来说，这通常是一个难以处理的量。因此，为了使最大似然训练可行，基于似然的模型必须要么限制其模型架构（例如，自回归模型中的因果卷积，归一化流模型中的可逆网络）以使 $Z_{θ}$ 可处理，要么近似归一化常数（例如，VAEs 中的变分推断，或对比散度中使用的 MCMC 采样）。) which may be computationally expensive.
) 这可能会计算量大。

By modeling the score function instead of the density function, we can sidestep the difficulty of intractable normalizing constants. The score function of a distribution $p (x)$ is defined as $\nabla_{x} \log p (x),$ and a model for the score function is called a score-based model
通过对评分函数建模而不是密度函数，我们可以避开难以处理的归一化常数的难题。分布的评分函数 $p (x)$ 定义为 $\nabla_{x} \log p (x),$ ，评分函数的模型称为基于评分的模型 , which we denote as $s_{θ} (x)$ . The score-based model is learned such that $s_{θ} (x) \approx \nabla_{x} \log p (x)$ , and can be parameterized without worrying about the normalizing constant. For example, we can easily parameterize a score-based model with the energy-based model defined in equation $(1)$ , via
，我们将其表示为 $s_{θ} (x)$ 。基于评分的模型被学习为 $s_{θ} (x) \approx \nabla_{x} \log p (x)$ ，并且可以在不考虑归一化常数的情况下进行参数化。例如，我们可以通过方程 $(1)$ 中定义的基于能量的模型轻松地参数化基于评分的模型，

\begin{matrix} (3) & s_{θ} (x) = \nabla_{x} \log p_{θ} (x) = - \nabla_{x} f_{θ} (x) - \underset{= 0}{\underset{⏟}{\nabla_{x} \log Z_{θ}}} = - \nabla_{x} f_{θ} (x) . \end{matrix}

Note that the score-based model $s_{θ} (x)$ is independent of the normalizing constant $Z_{θ}$ ! This significantly expands the family of models that we can tractably use, since we don’t need any special architectures to make the normalizing constant tractable.
请注意，基于评分的模型 $s_{θ} (x)$ 与归一化常数 $Z_{θ}$ 无关！这大大扩展了我们可以实际使用的模型家族，因为我们不需要任何特殊的架构来使归一化常数变得可处理。

Parameterizing probability density functions. No matter how you change the model family and parameters, it has to be normalized (area under the curve must integrate to one).
参数化概率密度函数。无论如何更改模型家族和参数，它都必须归一化（曲线下面积必须积分为一）。

Parameterizing score functions. No need to worry about normalization.
参数化评分函数。无需担心归一化。

Similar to likelihood-based models, we can train score-based models by minimizing the Fisher divergence Fisher divergence is typically between two distributions p and q, defined as $\begin{matrix} (4) & E_{p (x)} [‖ \nabla_{x} \log p (x) - \nabla_{x} \log q (x) ‖_{2}^{2}] . \end{matrix}$ Here we slightly abuse the term as the name of a closely related expression for score-based models. between the model and the data distributions, defined as
类似于基于似然的模型，我们可以通过最小化 Fisher 散度 ² 来训练基于评分的模型。Fisher 散度通常在两个分布 p 和 q 之间定义为 $\begin{matrix} (4) & E_{p (x)} [‖ \nabla_{x} \log p (x) - \nabla_{x} \log q (x) ‖_{2}^{2}] . \end{matrix}$ 。在这里，我们稍微滥用这个术语，作为基于评分模型的一个密切相关表达的名称。模型和数据分布之间定义为 $\begin{matrix} (5) & E_{p (x)} [‖ \nabla_{x} \log p (x) - s_{θ} (x) ‖_{2}^{2}] \end{matrix}$ 。 $\begin{matrix} (5) & E_{p (x)} [‖ \nabla_{x} \log p (x) - s_{θ} (x) ‖_{2}^{2}] \end{matrix}$

Intuitively, the Fisher divergence compares the squared $ℓ_{2}$ distance between the ground-truth data score and the score-based model. Directly computing this divergence, however, is infeasible because it requires access to the unknown data score $\nabla_{x} \log p (x)$ . Fortunately, there exists a family of methods called score matching Commonly used score matching methods include denoising score matching and sliced score matching
以及切片评分匹配. Here is an introduction to score matching and sliced score matching.
这里是关于分数匹配和切片分数匹配的介绍。
直观地说，Fisher 散度比较了真实数据分数和基于分数的模型之间的平方 $ℓ_{2}$ 距离。然而，直接计算这种散度是不可行的，因为它需要访问未知的数据分数 $\nabla_{x} \log p (x)$ 。幸运的是，存在一类称为分数匹配的方法 ³ 常用的分数匹配方法包括去噪分数匹配 that minimize the Fisher divergence without knowledge of the ground-truth data score. Score matching objectives can directly be estimated on a dataset and optimized with stochastic gradient descent, analogous to the log-likelihood objective for training likelihood-based models (with known normalizing constants). We can train the score-based model by minimizing a score matching objective, without requiring adversarial optimization.
这些方法在不知道真实数据分数的情况下最小化 Fisher 散度。分数匹配目标可以直接在数据集上估计，并通过随机梯度下降进行优化，类似于训练基于似然模型的对数似然目标（具有已知的归一化常数）。我们可以通过最小化分数匹配目标来训练基于分数的模型，而不需要对抗优化。

Additionally, using the score matching objective gives us a considerable amount of modeling flexibility. The Fisher divergence itself does not require $s_{θ} (x)$ to be an actual score function of any normalized distribution—it simply compares the $ℓ_{2}$ distance between the ground-truth data score and the score-based model, with no additional assumptions on the form of $s_{θ} (x)$ . In fact, the only requirement on the score-based model is that it should be a vector-valued function with the same input and output dimensionality, which is easy to satisfy in practice.
此外，使用分数匹配目标提供了us相当大的建模灵活性。Fisher 散度本身不要求 $s_{θ} (x)$ 是任何归一化分布的实际分数函数——它只是比较真实数据分数和基于分数的模型之间的 $ℓ_{2}$ 距离，而对 $s_{θ} (x)$ 的形式没有额外的假设。事实上，对基于分数的模型的唯一要求是它应该是一个具有相同输入和输出维度的向量值函数，这在实践中很容易满足。

As a brief summary, we can represent a distribution by modeling its score function, which can be estimated by training a score-based model of free-form architectures with score matching.
简要总结一下，我们可以通过建模其评分函数来表示分布，这可以通过训练自由形式架构的基于评分的模型并进行评分匹配来估计。

Langevin dynamics 朗之万动力学

Once we have trained a score-based model $s_{θ} (x) \approx \nabla_{x} \log p (x)$ , we can use an iterative procedure called Langevin dynamics
一旦我们训练了一个基于评分的模型 $s_{θ} (x) \approx \nabla_{x} \log p (x)$ ，我们可以使用一种称为朗之万动力学的迭代程序 to draw samples from it.
从中抽取样本。

Langevin dynamics provides an MCMC procedure to sample from a distribution $p (x)$ using only its score function $\nabla_{x} \log p (x)$ . Specifically, it initializes the chain from an arbitrary prior distribution $x_{0} \sim π (x)$ , and then iterates the following
朗之万动力学提供了一种 MCMC 程序，只使用其得分函数 $\nabla_{x} \log p (x)$ 从分布 $p (x)$ 中采样。具体来说，它从任意先验分布 $x_{0} \sim π (x)$ 初始化链，然后迭代以下步骤

\begin{array}{r} (6) & x_{i + 1} \leftarrow x_{i} + ϵ \nabla_{x} \log p (x) + \sqrt{2 ϵ} z_{i}, i = 0, 1, \dots, K, \end{array}

where $z_{i} \sim N (0, I)$ . When $ϵ \to 0$ and $K \to \infty$ , $x_{K}$ obtained from the procedure in $(6)$ converges to a sample from $p (x)$ under some regularity conditions. In practice, the error is negligible when $ϵ$ is sufficiently small and $K$ is sufficiently large.
当 $ϵ \to 0$ 和 $K \to \infty$ 时， $z_{i} \sim N (0, I)$ 。在某些正则条件下， $(6)$ 中的程序获得的 $x_{K}$ 收敛到 $p (x)$ 的样本。在实践中，当 $ϵ$ 足够小且 $K$ 足够大时，误差可以忽略不计。

Using Langevin dynamics to sample from a mixture of two Gaussians.
使用朗之万动力学从两个高斯分布的混合中采样。

Note that Langevin dynamics accesses $p (x)$ only through $\nabla_{x} \log p (x)$ . Since $s_{θ} (x) \approx \nabla_{x} \log p (x)$ , we can produce samples from our score-based model $s_{θ} (x)$ by plugging it into equation $(6)$ .
请注意，朗之万动力学仅通过 $\nabla_{x} \log p (x)$ 访问 $p (x)$ 。由于 $s_{θ} (x) \approx \nabla_{x} \log p (x)$ ，我们可以通过将其插入方程 $(6)$ 来从我们的基于得分的模型 $s_{θ} (x)$ 中生成样本。

Naive score-based generative modeling and its pitfalls
天真的基于得分的生成建模及其陷阱

So far, we’ve discussed how to train a score-based model with score matching, and then produce samples via Langevin dynamics. However, this naive approach has had limited success in practice—we’ll talk about some pitfalls of score matching that received little attention in prior works
到目前为止，我们已经讨论了如何通过得分匹配训练基于得分的模型，然后通过 Langevin 动力学生成样本。然而，这种天真的方法在实践中取得的成功有限——我们将讨论一些在先前工作中很少受到关注的得分匹配的陷阱.

Score-based generative modeling with score matching + Langevin dynamics.
基于分数的生成建模与分数匹配 + 朗之万动力学。

The key challenge is the fact that the estimated score functions are inaccurate in low density regions, where few data points are available for computing the score matching objective. This is expected as score matching minimizes the Fisher divergence
关键挑战在于估计的分数函数在低密度区域不准确，因为在这些区域中可用于计算分数匹配目标的数据点很少。这是预期的，因为分数匹配最小化了费舍尔散度。

E_{p (x)} [‖ \nabla_{x} \log p (x) - s_{θ} (x) ‖_{2}^{2}] = \int p (x) ‖ \nabla_{x} \log p (x) - s_{θ} (x) ‖_{2}^{2} d x .

Since the $ℓ_{2}$ differences between the true data score function and score-based model are weighted by $p (x)$ , they are largely ignored in low density regions where $p (x)$ is small. This behavior can lead to subpar results, as illustrated by the figure below:
由于真实数据分数函数和基于分数的模型之间的差异被 $p (x)$ 加权，因此在 $p (x)$ 较小的低密度区域中，它们在很大程度上被忽略。这种行为可能会导致不理想的结果，如下图所示：

Estimated scores are only accurate in high density regions.
估计的分数仅在高密度区域准确。

When sampling with Langevin dynamics, our initial sample is highly likely in low density regions when data reside in a high dimensional space. Therefore, having an inaccurate score-based model will derail Langevin dynamics from the very beginning of the procedure, preventing it from generating high quality samples that are representative of the data.
在使用朗之万动力学采样时，当数据位于高维空间中时，我们的初始样本很可能位于低密度区域。因此，具有不准确的基于分数的模型将从程序一开始就使朗之万动力学偏离轨道，阻止其生成代表数据的高质量样本。

Score-based generative modeling with multiple noise perturbations
基于评分的生成建模与多重噪声扰动

How can we bypass the difficulty of accurate score estimation in regions of low data density? Our solution is to perturb data points with noise and train score-based models on the noisy data points instead. When the noise magnitude is sufficiently large, it can populate low data density regions to improve the accuracy of estimated scores. For example, here is what happens when we perturb a mixture of two Gaussians perturbed by additional Gaussian noise.
我们如何绕过在低数据密度区域准确评分估计的难题？我们的解决方案是用噪声扰动数据点，并在噪声数据点上训练基于评分的模型。当噪声幅度足够大时，它可以填充低数据密度区域以提高评分估计的准确性。例如，当我们用额外的高斯噪声扰动两个高斯混合时，会发生什么。

Estimated scores are accurate everywhere for the noise-perturbed data distribution due to reduced low data density regions.
由于低数据密度区域减少，噪声扰动数据分布的估计评分在各处都是准确的。

Yet another question remains: how do we choose an appropriate noise scale for the perturbation process? Larger noise can obviously cover more low density regions for better score estimation, but it over-corrupts the data and alters it significantly from the original distribution. Smaller noise, on the other hand, causes less corruption of the original data distribution, but does not cover the low density regions as well as we would like.
另一个问题是：我们如何选择适当的噪声尺度来进行扰动过程？较大的噪声显然可以覆盖更多的低密度区域以获得更好的评分估计，但它会过度破坏数据并显著改变原始分布。另一方面，较小的噪声会对原始数据分布造成较少的破坏，但不能像我们希望的那样覆盖低密度区域。

To achieve the best of both worlds, we use multiple scales of noise perturbations simultaneously
为了实现两全其美，我们同时使用多种尺度的噪声扰动. Suppose we always perturb the data with isotropic Gaussian noise, and let there be a total of $L$ increasing standard deviations $σ_{1} < σ_{2} < \dots < σ_{L}$ . We first perturb the data distribution $p (x)$ with each of the Gaussian noise $N (0, σ_{i}^{2} I), i = 1, 2, \dots, L$ to obtain a noise-perturbed distribution
假设我们总是用各向同性高斯噪声扰动数据，并且有总共 $L$ 个递增的标准差 $σ_{1} < σ_{2} < \dots < σ_{L}$ 。我们首先用每个高斯噪声 $N (0, σ_{i}^{2} I), i = 1, 2, \dots, L$ 扰动数据分布 $p (x)$ ，以获得噪声扰动分布

p_{σ_{i}} (x) = \int p (y) N (x; y, σ_{i}^{2} I) d y .

Note that we can easily draw samples from $p_{σ_{i}} (x)$ by sampling $x \sim p (x)$ and computing $x + σ_{i} z$ , with $z \sim N (0, I)$ .
请注意，我们可以通过采样 $x \sim p (x)$ 并计算 $x + σ_{i} z$ ，使用 $z \sim N (0, I)$ ，轻松从 $p_{σ_{i}} (x)$ 中抽取样本。

Next, we estimate the score function of each noise-perturbed distribution, $\nabla_{x} \log p_{σ_{i}} (x)$ , by training a Noise Conditional Score-Based Model $s_{θ} (x, i)$ (also called a Noise Conditional Score Network, or NCSN
接下来，我们通过训练噪声条件得分模型 $s_{θ} (x, i)$ （也称为噪声条件得分网络，或 NCSN），估计每个噪声扰动分布的得分函数 $\nabla_{x} \log p_{σ_{i}} (x)$ , when parameterized with a neural network) with score matching, such that $s_{θ} (x, i) \approx \nabla_{x} \log p_{σ_{i}} (x)$ for all $i = 1, 2, \dots, L$ .
，当用神经网络参数化时）通过得分匹配，使得 $s_{θ} (x, i) \approx \nabla_{x} \log p_{σ_{i}} (x)$ 对所有 $i = 1, 2, \dots, L$ 成立。

We apply multiple scales of Gaussian noise to perturb the data distribution (first row), and jointly estimate the score functions for all of them (second row).
我们应用多种尺度的高斯噪声来扰动数据分布（第一行），并联合估计所有这些噪声的得分函数（第二行）。

Perturbing an image with multiple scales of Gaussian noise.
用多种尺度的高斯噪声扰动图像。

The training objective for $s_{θ} (x, i)$ is a weighted sum of Fisher divergences for all noise scales. In particular, we use the objective below:
$s_{θ} (x, i)$ 的训练目标是所有噪声尺度的费舍尔散度的加权和。特别是，我们使用以下目标：

\begin{matrix} (7) & \sum_{i = 1}^{L} λ (i) E_{p_{σ_{i}} (x)} [‖ \nabla_{x} \log p_{σ_{i}} (x) - s_{θ} (x, i) ‖_{2}^{2}], \end{matrix}

where $λ (i) \in R_{> 0}$ is a positive weighting function, often chosen to be $λ (i) = σ_{i}^{2}$ . The objective $(7)$ can be optimized with score matching, exactly as in optimizing the naive (unconditional) score-based model $s_{θ} (x)$ .
其中 $λ (i) \in R_{> 0}$ 是一个正的加权函数，通常选择为 $λ (i) = σ_{i}^{2}$ 。目标 $(7)$ 可以通过得分匹配来优化，正如优化简单（无条件）基于得分的模型 $s_{θ} (x)$ 一样。

After training our noise-conditional score-based model $s_{θ} (x, i)$ , we can produce samples from it by running Langevin dynamics for $i = L, L - 1, \dots, 1$ in sequence. This method is called annealed Langevin dynamics (defined by Algorithm 1 in
在训练我们的噪声条件分数模型 $s_{θ} (x, i)$ 后，我们可以通过依次运行 Langevin 动力学来从中生成样本。这种方法称为退火 Langevin 动力学（由算法 1 定义, and improved by
，并由), since the noise scale $σ_{i}$ decreases (anneals) gradually over time.
改进），因为噪声尺度 $σ_{i}$ 随着时间的推移逐渐减小（退火）。

Annealed Langevin dynamics combine a sequence of Langevin chains with gradually decreasing noise scales.
退火 Langevin 动力学结合了一系列噪声尺度逐渐减小的 Langevin 链。

Annealed Langevin dynamics for the Noise Conditional Score Network (NCSN) model (from ref.
退火朗之万动力学用于噪声条件分数网络（NCSN）模型（来自参考文献。) trained on CelebA (left) and CIFAR-10 (right). We can start from unstructured noise, modify images according to the scores, and generate nice samples. The method achieved state-of-the-art Inception score on CIFAR-10 at its time.
）在 CelebA（左）和 CIFAR-10（右）上训练。我们可以从无结构噪声开始，根据分数修改图像，并生成不错的样本。该方法在当时在 CIFAR-10 上达到了最先进的 Inception 分数。

Here are some practical recommendations for tuning score-based generative models with multiple noise scales:
这里有一些关于调整具有多个噪声尺度的基于分数的生成模型的实用建议：

Choose $σ_{1} < σ_{2} < \dots < σ_{L}$ as a geometric progression, with $σ_{1}$ being sufficiently small and $σ_{L}$ comparable to the maximum pairwise distance between all training data points
选择 $σ_{1} < σ_{2} < \dots < σ_{L}$ 作为几何级数，其中 $σ_{1}$ 足够小， $σ_{L}$ 与所有训练数据点之间的最大成对距离相当. $L$ is typically on the order of hundreds or thousands.
. $L$ 通常在数百或数千的数量级。
Parameterize the score-based model $s_{θ} (x, i)$ with U-Net skip connections
使用 U-Net 跳跃连接参数化基于分数的模型 $s_{θ} (x, i)$ .
Apply exponential moving average on the weights of the score-based model when used at test time
在测试时对基于分数的模型的权重应用指数移动平均.

With such best practices, we are able to generate high quality image samples with comparable quality to GANs on various datasets, such as below:
通过这些最佳实践，我们能够在各种数据集上生成与 GAN 质量相当的高质量图像样本，如下所示：

Samples from the NCSNv2
NCSNv2 的样本 model. From left to right: FFHQ 256x256, LSUN bedroom 128x128, LSUN tower 128x128, LSUN church_outdoor 96x96, and CelebA 64x64.
模型。从左到右：FFHQ 256x256，LSUN 卧室 128x128，LSUN 塔 128x128，LSUN 教堂户外 96x96，和 CelebA 64x64。

Score-based generative modeling with stochastic differential equations (SDEs)
基于分数的生成建模与随机微分方程（SDEs）

As we already discussed, adding multiple noise scales is critical to the success of score-based generative models. By generalizing the number of noise scales to infinity
正如我们已经讨论过的，添加多个噪声尺度对于基于分数的生成模型的成功至关重要。通过将噪声尺度的数量推广到无限, we obtain not only higher quality samples, but also, among others, exact log-likelihood computation, and controllable generation for inverse problem solving.
，我们不仅获得了更高质量的样本，还获得了精确的对数似然计算和可控的逆问题生成。

In addition to this introduction, we have tutorials written in Google Colab to provide a step-by-step guide for training a toy model on MNIST. We also have more advanced code repositories that provide full-fledged implementations for large scale applications.
除了这个介绍，我们还在 Google Colab 上编写了教程，提供了在 MNIST 上训练玩具模型的逐步指南。我们还有更高级的代码库，提供了大规模应用的完整实现。

Link	Description
	Tutorial of score-based generative modeling with SDEs in JAX + FLAX 基于分数的生成建模教程，使用 JAX + FLAX 中的 SDE
	Load our pretrained checkpoints and play with sampling, likelihood computation, and controllable synthesis (JAX + FLAX) 加载我们预训练的检查点，并使用采样、似然计算和可控合成（JAX + FLAX）进行操作
	Tutorial of score-based generative modeling with SDEs in PyTorch 基于分数的生成建模教程，使用 PyTorch 中的 SDE
	Load our pretrained checkpoints and play with sampling, likelihood computation, and controllable synthesis (PyTorch) 加载我们预训练的检查点，并使用采样、似然计算和可控合成（PyTorch）进行操作
Code in JAX 使用 JAX 编写代码	Score SDE codebase in JAX + FLAX 在 JAX + FLAX 中评分 SDE 代码库
Code in PyTorch 用 PyTorch 编写代码	Score SDE codebase in PyTorch 在 PyTorch 中评分 SDE 代码库

Perturbing data with an SDE
用 SDE 扰动数据

When the number of noise scales approaches infinity, we essentially perturb the data distribution with continuously growing levels of noise. In this case, the noise perturbation procedure is a continuous-time stochastic process, as demonstrated below
当噪声尺度的数量趋于无穷大时，我们实际上是用不断增加的噪声水平扰动数据分布。在这种情况下，噪声扰动过程是一个连续时间的随机过程，如下所示

Perturbing data to noise with a continuous-time stochastic process.
用连续时间随机过程将数据扰动为噪声。

How can we represent a stochastic process in a concise way? Many stochastic processes (diffusion processes in particular) are solutions of stochastic differential equations (SDEs). In general, an SDE possesses the following form:
我们如何以简洁的方式表示随机过程？许多随机过程（特别是扩散过程）是随机微分方程（SDE）的解。一般来说，SDE 具有以下形式：

\begin{array}{r} (8) & d x = f (x, t) d t + g (t) d w, \end{array}

where $f (\cdot, t) : R^{d} \to R^{d}$ is a vector-valued function called the drift coefficient, $g (t) \in R$ is a real-valued function called the diffusion coefficient, $w$ denotes a standard Brownian motion, and $d w$ can be viewed as infinitesimal white noise. The solution of a stochastic differential equation is a continuous collection of random variables ${x (t)}_{t \in [0, T]}$ . These random variables trace stochastic trajectories as the time index $t$ grows from the start time $0$ to the end time $T$ . Let $p_{t} (x)$ denote the (marginal) probability density function of $x (t)$ . Here $t \in [0, T]$ is analogous to $i = 1, 2, \dots, L$ when we had a finite number of noise scales, and $p_{t} (x)$ is analogous to $p_{σ_{i}} (x)$ . Clearly, $p_{0} (x) = p (x)$ is the data distribution since no perturbation is applied to data at $t = 0$ . After perturbing $p (x)$ with the stochastic process for a sufficiently long time $T$ , $p_{T} (x)$ becomes close to a tractable noise distribution $π (x)$ , called a prior distribution. We note that $p_{T} (x)$ is analogous to $p_{σ_{L}} (x)$ in the case of finite noise scales, which corresponds to applying the largest noise perturbation $σ_{L}$ to the data.
其中 $f (\cdot, t) : R^{d} \to R^{d}$ 是一个称为漂移系数的向量值函数， $g (t) \in R$ 是一个称为扩散系数的实值函数， $w$ 表示标准布朗运动， $d w$ 可以看作是无穷小的白噪声。随机微分方程的解是随机变量的连续集合 ${x (t)}_{t \in [0, T]}$ 。这些随机变量在时间索引 $t$ 从开始时间 $0$ 到结束时间 $T$ 的过程中描绘出随机轨迹。令 $p_{t} (x)$ 表示 $x (t)$ 的（边际）概率密度函数。这里 $t \in [0, T]$ 类似于我们在有限噪声尺度时的 $i = 1, 2, \dots, L$ ，而 $p_{t} (x)$ 类似于 $p_{σ_{i}} (x)$ 。显然， $p_{0} (x) = p (x)$ 是数据分布，因为在 $t = 0$ 时没有对数据进行扰动。在用随机过程对 $p (x)$ 进行足够长时间的扰动后， $p_{T} (x)$ 变得接近于一个可处理的噪声分布 $π (x)$ ，称为先验分布。我们注意到， $p_{T} (x)$ 类似于在有限噪声尺度情况下的 $p_{σ_{L}} (x)$ ，这对应于对数据施加最大的噪声扰动 $σ_{L}$ 。

The SDE in $(8)$ is hand designed, similarly to how we hand-designed $σ_{1} < σ_{2} < \dots < σ_{L}$ in the case of finite noise scales. There are numerous ways to add noise perturbations, and the choice of SDEs is not unique. For example, the following SDE
$(8)$ 中的 SDE 是手工设计的，类似于我们在有限噪声尺度的情况下手工设计 $σ_{1} < σ_{2} < \dots < σ_{L}$ 。有许多方法可以添加噪声扰动，SDE 的选择并不是唯一的。例如，以下 SDE

\begin{array}{r} (9) & d x = e^{t} d w \end{array}

perturbs data with a Gaussian noise of mean zero and exponentially growing variance, which is analogous to perturbing data with $N (0, σ_{1}^{2} I), N (0, σ_{2}^{2} I), \dots, N (0, σ_{L}^{2} I)$ when $σ_{1} < σ_{2} < \dots < σ_{L}$ is a geometric progression. Therefore, the SDE should be viewed as part of the model, much like ${σ_{1}, σ_{2}, \dots, σ_{L}}$ . In
用均值为零且方差指数增长的高斯噪声扰动数据，这类似于当 $σ_{1} < σ_{2} < \dots < σ_{L}$ 是几何级数时用 $N (0, σ_{1}^{2} I), N (0, σ_{2}^{2} I), \dots, N (0, σ_{L}^{2} I)$ 扰动数据。因此，SDE 应被视为模型的一部分，就像 ${σ_{1}, σ_{2}, \dots, σ_{L}}$ 一样。在, we provide three SDEs that generally work well for images: the Variance Exploding SDE (VE SDE), the Variance Preserving SDE (VP SDE), and the sub-VP SDE.
中，我们提供了三种通常对图像效果良好的 SDE：方差爆炸 SDE（VE SDE）、方差保持 SDE（VP SDE）和子 VP SDE。

Reversing the SDE for sample generation
反转 SDE 以生成样本

Recall that with a finite number of noise scales, we can generate samples by reversing the perturbation process with annealed Langevin dynamics, i.e., sequentially sampling from each noise-perturbed distribution using Langevin dynamics. For infinite noise scales, we can analogously reverse the perturbation process for sample generation by using the reverse SDE.
回想一下，在有限数量的噪声尺度下，我们可以通过使用退火朗之万动力学逆转扰动过程来生成样本，即通过朗之万动力学依次从每个噪声扰动分布中采样。对于无限噪声尺度，我们可以类似地通过使用反向 SDE 来逆转扰动过程以生成样本。

Generate data from noise by reversing the perturbation procedure.
通过逆转扰动过程从噪声中生成数据。

Importantly, any SDE has a corresponding reverse SDE
重要的是，任何 SDE 都有一个对应的逆向 SDE, whose closed form is given by
，其闭合形式为

\begin{matrix} (10) & d x = [f (x, t) - g^{2} (t) \nabla_{x} \log p_{t} (x)] d t + g (t) d w . \end{matrix}

Here $d t$ represents a negative infinitesimal time step, since the SDE $(10)$ needs to be solved backwards in time (from $t = T$ to $t = 0$ ). In order to compute the reverse SDE, we need to estimate $\nabla_{x} \log p_{t} (x)$ , which is exactly the score function of $p_{t} (x)$ .
这里 $d t$ 表示一个负无穷小时间步长，因为 SDE $(10)$ 需要在时间上向后求解（从 $t = T$ 到 $t = 0$ ）。为了计算逆向 SDE，我们需要估计 $\nabla_{x} \log p_{t} (x)$ ，这正是 $p_{t} (x)$ 的得分函数。

Solving a reverse SDE yields a score-based generative model. Transforming data to a simple noise distribution can be accomplished with an SDE. It can be reversed to generate samples from noise if we know the score of the distribution at each intermediate time step.
解决反向 SDE 会产生基于评分的生成模型。通过 SDE 可以将数据转换为简单的噪声分布。如果我们知道每个中间时间步的分布评分，它可以反向生成噪声样本。

Estimating the reverse SDE with score-based models and score matching
使用基于评分的模型和评分匹配估计反向 SDE

Solving the reverse SDE requires us to know the terminal distribution $p_{T} (x)$ , and the score function $\nabla_{x} \log p_{t} (x)$ . By design, the former is close to the prior distribution $π (x)$ which is fully tractable. In order to estimate $\nabla_{x} \log p_{t} (x)$ , we train a Time-Dependent Score-Based Model $s_{θ} (x, t)$ , such that $s_{θ} (x, t) \approx \nabla_{x} \log p_{t} (x)$ . This is analogous to the noise-conditional score-based model $s_{θ} (x, i)$ used for finite noise scales, trained such that $s_{θ} (x, i) \approx \nabla_{x} \log p_{σ_{i}} (x)$ .
解决反向 SDE 需要us知道终端分布 $p_{T} (x)$ ，以及评分函数 $\nabla_{x} \log p_{t} (x)$ 。设计上，前者接近完全可处理的先验分布 $π (x)$ 。为了估计 $\nabla_{x} \log p_{t} (x)$ ，我们训练一个时间依赖的基于评分的模型 $s_{θ} (x, t)$ ，使得 $s_{θ} (x, t) \approx \nabla_{x} \log p_{t} (x)$ 。这类似于用于有限噪声尺度的噪声条件基于评分的模型 $s_{θ} (x, i)$ ，训练使得 $s_{θ} (x, i) \approx \nabla_{x} \log p_{σ_{i}} (x)$ 。

Our training objective for $s_{θ} (x, t)$ is a continuous weighted combination of Fisher divergences, given by
我们的训练目标 $s_{θ} (x, t)$ 是费舍尔散度的连续加权组合，表示为

\begin{matrix} (11) & E_{t \in U (0, T)} E_{p_{t} (x)} [λ (t) ‖ \nabla_{x} \log p_{t} (x) - s_{θ} (x, t) ‖_{2}^{2}], \end{matrix}

where $U (0, T)$ denotes a uniform distribution over the time interval $[0, T]$ , and $λ : R \to R_{> 0}$ is a positive weighting function. Typically we use $λ (t) \propto 1 / E [‖ \nabla_{x (t)} \log p (x (t) ∣ x (0)) ‖_{2}^{2}]$ to balance the magnitude of different score matching losses across time.
其中 $U (0, T)$ 表示时间间隔 $[0, T]$ 上的均匀分布， $λ : R \to R_{> 0}$ 是一个正的加权函数。通常我们使用 $λ (t) \propto 1 / E [‖ \nabla_{x (t)} \log p (x (t) ∣ x (0)) ‖_{2}^{2}]$ 来平衡不同时间的评分匹配损失的幅度。

As before, our weighted combination of Fisher divergences can be efficiently optimized with score matching methods, such as denoising score matching
与之前一样，我们的费舍尔散度加权组合可以通过得分匹配方法（如去噪得分匹配）高效优化 and sliced score matching
和切片得分匹配 . Once our score-based model $s_{θ} (x, t)$ is trained to optimality, we can plug it into the expression of the reverse SDE in $(10)$ to obtain an estimated reverse SDE.
。一旦我们的基于得分的模型 $s_{θ} (x, t)$ 训练到最优，我们可以将其代入 $(10)$ 中的反向 SDE 表达式，以获得估计的反向 SDE。

\begin{matrix} (12) & d x = [f (x, t) - g^{2} (t) s_{θ} (x, t)] d t + g (t) d w . \end{matrix}

We can start with $x (T) \sim π$ , and solve the above reverse SDE to obtain a sample $x (0)$ . Let us denote the distribution of $x (0)$ obtained in such way as $p_{θ}$ . When the score-based model $s_{θ} (x, t)$ is well-trained, we have $p_{θ} \approx p_{0}$ , in which case $x (0)$ is an approximate sample from the data distribution $p_{0}$ .
我们可以从 $x (T) \sim π$ 开始，并求解上述反向 SDE 以获得样本 $x (0)$ 。令us表示以这种方式获得的 $x (0)$ 的分布为 $p_{θ}$ 。当基于评分的模型 $s_{θ} (x, t)$ 训练良好时，我们有 $p_{θ} \approx p_{0}$ ，在这种情况下， $x (0)$ 是数据分布 $p_{0}$ 的近似样本。

When $λ (t) = g^{2} (t)$ , we have an important connection between our weighted combination of Fisher divergences and the KL divergence from $p_{0}$ to $p_{θ}$ under some regularity conditions
当 $λ (t) = g^{2} (t)$ 时，我们在某些正则条件下，Fisher 散度的加权组合与从 $p_{0}$ 到 $p_{θ}$ 的 KL 散度之间有一个重要的联系。:

\begin{matrix} KL (p_{0} (x) ‖ p_{θ} (x)) \leq \frac{T}{2} E_{t \in U (0, T)} E_{p_{t} (x)} [λ (t) ‖ \nabla_{x} \log p_{t} (x) - s_{θ} (x, t) ‖_{2}^{2}] \\ (13) & + KL (p_{T} ‖ π) . \end{matrix}

Due to this special connection to the KL divergence and the equivalence between minimizing KL divergences and maximizing likelihood for model training, we call $λ (t) = g (t)^{2}$ the likelihood weighting function. Using this likelihood weighting function, we can train score-based generative models to achieve very high likelihoods, comparable or even superior to state-of-the-art autoregressive models
由于这种与 KL 散度的特殊联系以及在模型训练中最小化 KL 散度和最大化似然之间的等价性，我们称 $λ (t) = g (t)^{2}$ 为似然加权函数。使用这种似然加权函数，我们可以训练基于评分的生成模型以实现非常高的似然性，甚至可以与最先进的自回归模型相媲美或优于它们。.

How to solve the reverse SDE
如何解决反向 SDE

By solving the estimated reverse SDE with numerical SDE solvers, we can simulate the reverse stochastic process for sample generation. Perhaps the simplest numerical SDE solver is the Euler-Maruyama method. When applied to our estimated reverse SDE, it discretizes the SDE using finite time steps and small Gaussian noise. Specifically, it chooses a small negative time step $Δ t \approx 0$ , initializes $t \leftarrow T$ , and iterates the following procedure until $t \approx 0$ :
通过使用数值 SDE 求解器解决估计的反向 SDE，我们可以模拟用于样本生成的反向随机过程。也许最简单的数值 SDE 求解器是 Euler-Maruyama 方法。当应用于我们的估计反向 SDE 时，它使用有限的时间步长和小的高斯噪声离散化 SDE。具体来说，它选择一个小的负时间步长 $Δ t \approx 0$ ，初始化 $t \leftarrow T$ ，并迭代以下过程直到 $t \approx 0$ ：

\begin{aligned} Δ x & \leftarrow [f (x, t) - g^{2} (t) s_{θ} (x, t)] Δ t + g (t) \sqrt{| Δ t |} z_{t} \\ x & \leftarrow x + Δ x \\ t & \leftarrow t + Δ t, \end{aligned}

Here $z_{t} \sim N (0, I)$ . The Euler-Maruyama method is qualitatively similar to Langevin dynamics—both update $x$ by following score functions perturbed with Gaussian noise.
这里 $z_{t} \sim N (0, I)$ 。Euler-Maruyama 方法在质量上类似于 Langevin 动力学——两者都通过跟随被高斯噪声扰动的得分函数来更新 $x$ 。

Aside from the Euler-Maruyama method, other numerical SDE solvers can be directly employed to solve the reverse SDE for sample generation, including, for example, Milstein method, and stochastic Runge-Kutta methods. In
除了 Euler-Maruyama 方法，其他数值 SDE 求解器也可以直接用于解决样本生成的反向 SDE，例如 Milstein 方法和随机 Runge-Kutta 方法。在, we provided a reverse diffusion solver similar to Euler-Maruyama, but more tailored for solving reverse-time SDEs. More recently, authors in
我们提供了一种类似于 Euler-Maruyama 的反向扩散求解器，但更适合于解决反向时间 SDE。最近，作者在 introduced adaptive step-size SDE solvers that can generate samples faster with better quality.
引入了自适应步长 SDE 求解器，可以更快地生成质量更好的样本。

In addition, there are two special properties of our reverse SDE that allow for even more flexible sampling methods:
此外，我们的反向 SDE 有两个特殊属性，可以实现更灵活的采样方法：

We have an estimate of $\nabla_{x} \log p_{t} (x)$ via our time-dependent score-based model $s_{θ} (x, t)$ .
我们通过时间依赖的基于分数的模型 $s_{θ} (x, t)$ 估计了 $\nabla_{x} \log p_{t} (x)$ 。
We only care about sampling from each marginal distribution $p_{t} (x)$ . Samples obtained at different time steps can have arbitrary correlations and do not have to form a particular trajectory sampled from the reverse SDE.
我们只关心从每个边际分布中采样 $p_{t} (x)$ 。在不同时间步骤获得的样本可以有任意的相关性，并且不必形成从反向 SDE 采样的特定轨迹。

As a consequence of these two properties, we can apply MCMC approaches to fine-tune the trajectories obtained from numerical SDE solvers. Specifically, we propose Predictor-Corrector samplers. The predictor can be any numerical SDE solver that predicts $x (t + Δ t) \sim p_{t + Δ t} (x)$ from an existing sample $x (t) \sim p_{t} (x)$ . The corrector can be any MCMC procedure that solely relies on the score function, such as Langevin dynamics and Hamiltonian Monte Carlo.
由于这两个特性的结果，我们可以应用 MCMC 方法来微调从数值 SDE 求解器获得的轨迹。具体来说，我们提出了预测-校正采样器。预测器可以是任何预测 $x (t + Δ t) \sim p_{t + Δ t} (x)$ 的数值 SDE 求解器，基于现有样本 $x (t) \sim p_{t} (x)$ 。校正器可以是任何仅依赖于得分函数的 MCMC 过程，例如 Langevin 动力学和 Hamiltonian Monte Carlo。

At each step of the Predictor-Corrector sampler, we first use the predictor to choose a proper step size $Δ t < 0$ , and then predict $x (t + Δ t)$ based on the current sample $x (t)$ . Next, we run several corrector steps to improve the sample $x (t + Δ t)$ according to our score-based model $s_{θ} (x, t + Δ t)$ , so that $x (t + Δ t)$ becomes a higher-quality sample from $p_{t + Δ t} (x)$ .
在每一步预测-校正采样器中，我们首先使用预测器选择一个合适的步长 $Δ t < 0$ ，然后基于当前样本 $x (t)$ 预测 $x (t + Δ t)$ 。接下来，我们运行几个校正步骤，根据我们的基于得分的模型 $s_{θ} (x, t + Δ t)$ 改进样本 $x (t + Δ t)$ ，以便 $x (t + Δ t)$ 成为 $p_{t + Δ t} (x)$ 的高质量样本。

With Predictor-Corrector methods and better architectures of score-based models, we can achieve state-of-the-art sample quality on CIFAR-10 (measured in FID
通过预测-校正方法和更好的基于得分模型的架构，我们可以在 CIFAR-10 上实现最先进的样本质量（以 FID 衡量）。 and Inception scores
和 Inception 分数), outperforming the best GAN model to date (StyleGAN2 + ADA
), 超过了迄今为止最好的 GAN 模型（StyleGAN2 + ADA).

Method	FID $↓$	Inception score $↑$ Inception 分数 $↑$
StyleGAN2 + ADA	2.92	9.83
Ours	2.20	9.89

The sampling methods are also scalable for extremely high dimensional data. For example, it can successfully generate high fidelity images of resolution $1024 \times 1024$ .
采样方法也可以扩展到极高维数据。例如，它可以成功生成分辨率为 $1024 \times 1024$ 的高保真图像。

1024 x 1024 samples from a score-based model trained on the FFHQ dataset.
从基于分数的模型中训练的 FFHQ 数据集的 1024 x 1024 样本。

Some additional (uncurated) samples for other datasets (taken from this GitHub repo):
其他数据集的一些额外（未整理）样本（取自这个 GitHub 仓库）：

256 x 256 samples on LSUN bedroom.
LSUN 卧室的 256 x 256 样本。

256 x 256 samples on CelebA-HQ.
CelebA-HQ 的 256 x 256 样本。

Probability flow ODE 概率流 ODE

Despite capable of generating high-quality samples, samplers based on Langevin MCMC and SDE solvers do not provide a way to compute the exact log-likelihood of score-based generative models. Below, we introduce a sampler based on ordinary differential equations (ODEs) that allow for exact likelihood computation.
尽管基于 Langevin MCMC 和 SDE 求解器的采样器能够生成高质量的样本，但它们无法计算基于分数的生成模型的精确对数似然。下面，我们介绍一种基于常微分方程（ODE）的采样器，可以进行精确的似然计算。

In , we show t is possible to convert any SDE into an ordinary differential equation (ODE) without changing its marginal distributions ${p_{t} (x)}_{t \in [0, T]}$ . Thus by solving this ODE, we can sample from the same distributions as the reverse SDE. The corresponding ODE of an SDE is named probability flow ODE
我们展示了可以将任何 SDE 转换为常微分方程（ODE）而不改变其边际分布。因此，通过求解这个 ODE，我们可以从与反向 SDE 相同的分布中采样。SDE 的对应 ODE 被称为概率流 ODE。 , given by 由

\begin{matrix} (14) & d x = [f (x, t) - \frac{1}{2} g^{2} (t) \nabla_{x} \log p_{t} (x)] d t . \end{matrix}

The following figure depicts trajectories of both SDEs and probability flow ODEs. Although ODE trajectories are noticeably smoother than SDE trajectories, they convert the same data distribution to the same prior distribution and vice versa, sharing the same set of marginal distributions ${p_{t} (x)}_{t \in [0, T]}$ . In other words, trajectories obtained by solving the probability flow ODE have the same marginal distributions as the SDE trajectories.
下图描绘了 SDE 和概率流 ODE 的轨迹。尽管 ODE 轨迹明显比 SDE 轨迹更平滑，但它们将相同的数据分布转换为相同的先验分布，反之亦然，具有相同的边际分布。换句话说，通过求解概率流 ODE 获得的轨迹与 SDE 轨迹具有相同的边际分布。

We can map data to a noise distribution (the prior) with an SDE, and reverse this SDE for generative modeling. We can also reverse the associated probability flow ODE, which yields a deterministic process that samples from the same distribution as the SDE. Both the reverse-time SDE and probability flow ODE can be obtained by estimating score functions.
我们可以用 SDE 将数据映射到噪声分布（先验），并反转这个 SDE 进行生成建模。我们也可以反转相关的概率流 ODE，这会产生一个从与 SDE 相同分布中采样的确定性过程。通过估计得分函数，可以获得逆时间 SDE 和概率流 ODE。

This probability flow ODE formulation has several unique advantages.
这种概率流 ODE 公式有几个独特的优势。

When $\nabla_{x} \log p_{t} (x)$ is replaced by its approximation $s_{θ} (x, t)$ , the probability flow ODE becomes a special case of a neural ODE
当 $\nabla_{x} \log p_{t} (x)$ 被其近似 $s_{θ} (x, t)$ 替换时，概率流 ODE 成为神经 ODE 的一个特例。. In particular, it is an example of continuous normalizing flows
特别是，它是连续归一化流的一个例子，, since the probability flow ODE converts a data distribution $p_{0} (x)$ to a prior noise distribution $p_{T} (x)$ (since it shares the same marginal distributions as the SDE) and is fully invertible.
因为概率流 ODE 将数据分布 $p_{0} (x)$ 转换为先验噪声分布 $p_{T} (x)$ （因为它与 SDE 共享相同的边际分布）并且是完全可逆的。

As such, the probability flow ODE inherits all properties of neural ODEs or continuous normalizing flows, including exact log-likelihood computation. Specifically, we can leverage the instantaneous change-of-variable formula (Theorem 1 in
因此，概率流 ODE 继承了神经 ODE 或连续归一化流的所有特性，包括精确的对数似然计算。具体来说，我们可以利用瞬时变量变化公式（定理 1）, Equation (4) in
方程 (4) 中) to compute the unknown data density $p_{0}$ from the known prior density $p_{T}$ with numerical ODE solvers.
) 用数值 ODE 求解器从已知的先验密度 $p_{T}$ 计算未知的数据密度 $p_{0}$ 。

In fact, our model achieves the state-of-the-art log-likelihoods on uniformly dequantized It is typical for normalizing flow models to convert discrete images to continuous ones by adding small uniform noise to them. CIFAR-10 images
事实上，我们的模型在均匀去量化的 ⁴ 上实现了最先进的对数似然。归一化流模型通常通过向离散图像添加小的均匀噪声将其转换为连续图像。CIFAR-10 图像, even without maximum likelihood training.
，即使没有最大似然训练。

Method	Negative log-likelihood (bits/dim) $↓$ 负对数似然（比特/维度） $↓$
RealNVP	3.49
iResNet	3.45
Glow	3.35
FFJORD	3.40
Flow++	3.29
Ours	2.99

When training score-based models with the likelihood weighting we discussed before, and using variational dequantization to obtain likelihoods on discrete images, we can achieve comparable or even superior likelihood to the state-of-the-art autoregressive models (all without any data augmentation)
当使用我们之前讨论的似然加权训练基于分数的模型，并使用变分去量化来获得离散图像的似然时，我们可以实现与最先进的自回归模型相当甚至更优的似然（所有这些都无需任何数据增强）.

Method	Negative log-likelihood (bits/dim) $↓$ on CIFAR-10 CIFAR-10 上的负对数似然（比特/维度） $↓$	Negative log-likelihood (bits/dim) $↓$ on ImageNet 32x32 ImageNet 32x32 上的负对数似然（比特/维度） $↓$
Sparse Transformer 稀疏变换器	2.80	-
Image Transformer 图像变换器	2.90	3.77
Ours	2.83	3.76

Controllable generation for inverse problem solving
可控生成用于逆问题求解

Score-based generative models are particularly suitable for solving inverse problems. At its core, inverse problems are same as Bayesian inference problems. Let $x$ and $y$ be two random variables, and suppose we know the forward process of generating $y$ from $x$ , represented by the transition probability distribution $p (y ∣ x)$ . The inverse problem is to compute $p (x ∣ y)$ . From Bayes’ rule, we have $p (x ∣ y) = p (x) p (y ∣ x) / \int p (x) p (y ∣ x) d x$ . This expression can be greatly simplified by taking gradients with respect to $x$ on both sides, leading to the following Bayes’ rule for score functions:
基于分数的生成模型特别适合解决逆问题。逆问题的核心与贝叶斯推理问题相同。设 $x$ 和 $y$ 为两个随机变量，假设我们知道从 $x$ 生成 $y$ 的前向过程，由转移概率分布 $p (y ∣ x)$ 表示。逆问题是计算 $p (x ∣ y)$ 。根据贝叶斯定理，我们有 $p (x ∣ y) = p (x) p (y ∣ x) / \int p (x) p (y ∣ x) d x$ 。通过对两边关于 $x$ 求导，这个表达式可以大大简化，得到以下分数函数的贝叶斯定理：

\begin{matrix} (15) & \nabla_{x} \log p (x ∣ y) = \nabla_{x} \log p (x) + \nabla_{x} \log p (y ∣ x) . \end{matrix}

Through score matching, we can train a model to estimate the score function of the unconditional data distribution, i.e., $s_{θ} (x) \approx \nabla_{x} \log p (x)$ . This will allow us to easily compute the posterior score function $\nabla_{x} \log p (x ∣ y)$ from the known forward process $p (y ∣ x)$ via equation $(15)$ , and sample from it with Langevin-type sampling
通过分数匹配，我们可以训练一个模型来估计无条件数据分布的分数函数，即 $s_{θ} (x) \approx \nabla_{x} \log p (x)$ 。这将允许us通过方程 $(15)$ 从已知的前向过程 $p (y ∣ x)$ 中轻松计算后验分数函数 $\nabla_{x} \log p (x ∣ y)$ ，并通过朗之万类型采样从中采样.

A recent work from UT Austin
来自 UT Austin 的最新研究 has demonstrated that score-based generative models can be applied to solving inverse problems in medical imaging, such as accelerating magnetic resonance imaging (MRI). Concurrently in
已经证明，基于评分的生成模型可以应用于解决医学成像中的逆问题，例如加速磁共振成像（MRI）。同时在, we demonstrated superior performance of score-based generative models not only on accelerated MRI, but also sparse-view computed tomography (CT). We were able to achieve comparable or even better performance than supervised or unrolled deep learning approaches, while being more robust to different measurement processes at test time.
，我们证明了基于评分的生成模型不仅在加速 MRI 上表现优越，而且在稀疏视图计算机断层扫描（CT）上也表现出色。我们能够实现与监督或展开的深度学习方法相当甚至更好的性能，同时在测试时对不同的测量过程更具鲁棒性。

Below we show some examples on solving inverse problems for computer vision.
下面我们展示了一些解决计算机视觉逆问题的例子。

Class-conditional generation with an unconditional time-dependent score-based model, and a pre-trained noise-conditional image classifier on CIFAR-10.
在 CIFAR-10 上使用无条件时间依赖的基于评分的模型和预训练的噪声条件图像分类器进行类别条件生成。

Image inpainting with a time-dependent score-based model trained on LSUN bedroom. The leftmost column is ground-truth. The second column shows masked images (y in our framework). The rest columns show different inpainted images, generated by solving the conditional reverse-time SDE.
使用在 LSUN 卧室数据集上训练的时间依赖评分模型进行图像修复。最左列为真实图像。第二列显示遮罩图像（在我们的框架中为 y）。其余列显示通过求解条件反向时间随机微分方程生成的不同修复图像。

Image colorization with a time-dependent score-based model trained on LSUN church_outdoor and bedroom. The leftmost column is ground-truth. The second column shows gray-scale images (y in our framework). The rest columns show different colorizedimages, generated by solving the conditional reverse-time SDE.
使用在 LSUN church_outdoor 和 bedroom 上训练的时间依赖评分模型进行图像上色。最左边的列是地面实况。第二列显示灰度图像（在我们的框架中为 y）。其余列显示不同的上色图像，通过求解条件反向时间 SDE 生成。

We can even colorize gray-scale portrays of famous people in history (Abraham Lincoln) with a time-dependent score-based model trained on FFHQ. The image resolution is 1024 x 1024.
我们甚至可以用在 FFHQ 上训练的时间依赖评分模型为历史上著名人物（亚伯拉罕·林肯）的灰度肖像上色。图像分辨率为 1024 x 1024。

Connection to diffusion models and others
与扩散模型和其他模型的联系

I started working on score-based generative modeling since 2019, when I was trying hard to make score matching scalable for training deep energy-based models on high-dimensional datasets. My first attempt at this led to the method sliced score matching
我从 2019 年开始研究基于评分的生成建模，当时我正在努力使评分匹配可扩展，以便在高维数据集上训练深度能量模型。我的第一次尝试导致了切片评分匹配方法。. Despite the scalability of sliced score matching for training energy-based models, I found to my surprise that Langevin sampling from those models fails to produce reasonable samples even on the MNIST dataset. I started investigating this issue and discovered three crucial improvements that can lead to extremely good samples: (1) perturbing data with multiple scales of noise, and training score-based models for each noise scale; (2) using a U-Net architecture (we used RefineNet since it is a modern version of U-Nets) for the score-based model; (3) applying Langevin MCMC to each noise scale and chaining them together. With those methods, I was able to obtain the state-of-the-art Inception Score on CIFAR-10 in
尽管切片得分匹配在训练基于能量的模型方面具有可扩展性，但令我惊讶的是，从这些模型中进行朗之万采样在即使是 MNIST 数据集上也未能产生合理的样本。我开始调查这个问题，并发现了三个关键改进，可以产生极好的样本：(1) 用多种噪声尺度扰动数据，并为每个噪声尺度训练基于得分的模型；(2) 使用 U-Net 架构（我们使用 RefineNet，因为它是 U-Net 的现代版本）作为基于得分的模型；(3) 对每个噪声尺度应用朗之万 MCMC 并将它们串联在一起。通过这些方法，我在 CIFAR-10 上获得了最先进的 Inception Score (even better than the best GANs!), and generate high-fidelity image samples of resolution up to 256\times 256 in
（甚至比最好的 GANs 还要好！），并生成了分辨率高达 256\times 256 的高保真图像样本.

The idea of perturbing data with multiple scales of noise is by no means unique to score-based generative models though. It has been previously used in, for example, simulated annealing, annealed importance sampling
尽管扰动数据的想法并不是得分生成模型所独有的，但它已经在例如模拟退火、退火重要性采样中被使用过。, diffusion probabilistic models
,扩散概率模型, infusion training , 输液训练, and variational walkback
, 和变分回溯 for generative stochastic networks
用于生成随机网络. Out of all these works, diffusion probabilistic modeling is perhaps the closest to score-based generative modeling. Diffusion probabilistic models are hierachical latent variable models first proposed by Jascha and his colleagues
. 在所有这些工作中，扩散概率建模可能是最接近基于评分的生成建模。扩散概率模型是分层潜变量模型，最早由 Jascha 及其同事在 2015 年提出，通过学习一个变分解码器来逆转扰动数据到噪声的离散扩散过程来生成样本。在不了解这项工作的情况下，基于评分的生成建模是从一个非常不同的角度独立提出和激励的。尽管两者都通过多尺度噪声扰动数据，但当时基于评分的生成建模和扩散概率建模之间的联系似乎是表面的，因为前者通过评分匹配进行训练，并通过 Langevin 动力学进行采样，而后者通过证据下界（ELBO）进行训练，并通过学习的解码器进行采样。 in 2015, which generate samples by learning a variational decoder to reverse a discrete diffusion process that perturbs data to noise. Without awareness of this work, score-based generative modeling was proposed and motivated independently from a very different perspective. Despite both perturbing data with multiple scales of noise, the connection between score-based generative modeling and diffusion probabilistic modeling seemed superficial at that time, since the former is trained by score matching and sampled by Langevin dynamics, while the latter is trained by the evidence lower bound (ELBO) and sampled with a learned decoder.
在 2015 年，通过学习一个变分解码器来逆转扰动数据到噪声的离散扩散过程来生成样本。在不了解这项工作的情况下，基于评分的生成建模是从一个非常不同的角度独立提出和激励的。尽管两者都通过多尺度噪声扰动数据，但当时基于评分的生成建模和扩散概率建模之间的联系似乎是表面的，因为前者通过评分匹配进行训练，并通过 Langevin 动力学进行采样，而后者通过证据下界（ELBO）进行训练，并通过学习的解码器进行采样。

In 2020, Jonathan Ho and colleagues
2020 年，Jonathan Ho 和同事们 significantly improved the empirical performance of diffusion probabilistic models and first unveiled a deeper connection to score-based generative modeling. They showed that the ELBO used for training diffusion probabilistic models is essentially equivalent to the weighted combination of score matching objectives used in score-based generative modeling. Moreover, by parameterizing the decoder as a sequence of score-based models with a U-Net architecture, they demonstrated for the first time that diffusion probabilistic models can also generate high quality image samples comparable or superior to GANs.
显著提高了扩散概率模型的经验性能，并首次揭示了与基于分数的生成建模的更深层次联系。他们表明，用于训练扩散概率模型的 ELBO 本质上等同于基于分数的生成建模中使用的加权组合分数匹配目标。此外，通过将解码器参数化为具有 U-Net 架构的基于分数的模型序列，他们首次证明扩散概率模型也可以生成与 GANs 相媲美或更高质量的图像样本。

Inspired by their work, we further investigated the relationship between diffusion models and score-based generative models in an ICLR 2021 paper
受他们工作的启发，我们在 ICLR 2021 论文中进一步研究了扩散模型与基于分数的生成模型之间的关系. We found that the sampling method of diffusion probabilistic models can be integrated with annealed Langevin dynamics of score-based models to create a unified and more powerful sampler (the Predictor-Corrector sampler). By generalizing the number of noise scales to infinity, we further proved that score-based generative models and diffusion probabilistic models can both be viewed as discretizations to stochastic differential equations determined by score functions. This work bridges both score-based generative modeling and diffusion probabilistic modeling into a unified framework.
我们发现扩散概率模型的采样方法可以与基于分数模型的退火朗之万动力学相结合，创建一个统一且更强大的采样器（预测-校正采样器）。通过将噪声尺度的数量推广到无限，我们进一步证明了基于分数的生成模型和扩散概率模型都可以看作是由分数函数决定的随机微分方程的离散化。此项工作将基于分数的生成建模和扩散概率建模桥接到一个统一的框架中。

Collectively, these latest developments seem to indicate that both score-based generative modeling with multiple noise perturbations and diffusion probabilistic models are different perspectives of the same model family, much like how wave mechanics and matrix mechanics are equivalent formulations of quantum mechanics in the history of physicsGoes without saying that the significance of score-based generative models/diffusion probabilistic models is in no way comparable to quantum mechanics.. The perspective of score matching and score-based models allows one to calculate log-likelihoods exactly, solve inverse problems naturally, and is directly connected to energy-based models, Schrödinger bridges and optimal transport
总的来说，这些最新的发展似乎表明，多重噪声扰动的基于分数的生成建模和扩散概率模型是同一模型家族的不同视角，就像波动力学和矩阵力学在物理学史上是量子力学的等效形式一样。无需多言，基于分数的生成模型/扩散概率模型的重要性远不能与量子力学相提并论。基于分数匹配和分数模型的视角使人们能够精确计算对数似然，自然地解决逆问题，并且直接与基于能量的模型、薛定谔桥和最优传输相联系。. The perspective of diffusion models is naturally connected to VAEs, lossy compression, and can be directly incorporated with variational probabilistic inference. This blog post focuses on the first perspective, but I highly recommend interested readers to learn about the alternative perspective of diffusion models as well (see a great blog by Lilian Weng).
扩散模型的视角自然与变分自编码器（VAEs）、有损压缩相关，并且可以直接与变分概率推理结合。本文重点介绍了第一种视角，但我强烈建议感兴趣的读者也了解扩散模型的另一种视角（参见 Lilian Weng 的精彩博客）。

Many recent works on score-based generative models or diffusion probabilistic models have been deeply influenced by knowledge from both sides of research (see a website curated by researchers at the University of Oxford). Despite this deep connection between score-based generative models and diffusion models, it is hard to come up with an umbrella term for the model family that they both belong to. Some colleagues in DeepMind propose to call them “Generative Diffusion Processes”. It remains to be seen if this will be adopted by the community in the future.
许多关于基于分数的生成模型或扩散概率模型的最新研究工作都深受两方面研究知识的影响（参见牛津大学研究人员策划的网站）。尽管基于分数的生成模型和扩散模型之间有着深厚的联系，但很难为它们所属的模型家族想出一个总称。DeepMind 的一些同事建议称它们为“生成扩散过程”。未来是否会被社区采纳还有待观察。

Concluding remarks 结论性意见

This blog post gives a detailed introduction to score-based generative models. We demonstrate that this new paradigm of generative modeling is able to produce high quality samples, compute exact log-likelihoods, and perform controllable generation for inverse problem solving. It is a compilation of several papers we published in the past few years. Please visit them if you are interested in more details:
这篇博客文章详细介绍了基于评分的生成模型。我们展示了这种新的生成建模范式能够生成高质量的样本，计算精确的对数似然，并进行可控生成以解决逆问题。这是我们过去几年发表的几篇论文的汇编。如果您对更多细节感兴趣，请访问它们：

For a list of works that have been influenced by score-based generative modeling, researchers at the University of Oxford have built a very useful (but necessarily incomplete) website: https://scorebasedgenerativemodeling.github.io/.
对于受基于分数的生成建模影响的作品列表，牛津大学的研究人员建立了一个非常有用（但必然不完整）的网站：https://scorebasedgenerativemodeling.github.io/。

There are two major challenges of score-based generative models. First, the sampling speed is slow since it involves a large number of Langevin-type iterations. Second, it is inconvenient to work with discrete data distributions since scores are only defined on continuous distributions.
基于分数的生成模型有两个主要挑战。首先，采样速度慢，因为它涉及大量的朗之万类型迭代。其次，处理离散数据分布不方便，因为分数仅在连续分布上定义。

The first challenge can be partially solved by using numerical ODE solvers for the probability flow ODE with lower precision (a similar method, denoising diffusion implicit modeling, has been proposed in
第一个挑战可以通过使用数值 ODE 求解器以较低精度求解概率流 ODE 来部分解决（类似的方法，去噪扩散隐式建模，已经被提出). It is also possible to learn a direct mapping from the latent space of probability flow ODEs to the image space, as shown in
也可以从概率流 ODE 的潜在空间直接映射到图像空间，如图所示. However, all such methods to date result in worse sample quality.
然而，迄今为止，所有这些方法的样本质量都较差。

The second challenge can be addressed by learning an autoencoder on discrete data and performing score-based generative modeling on its continuous latent space
第二个挑战可以通过在离散数据上学习自编码器并在其连续潜在空间中执行基于评分的生成建模来解决. Jascha’s original work on diffusion models
贾斯查关于扩散模型的原创工作 also provides a discrete diffusion process for discrete data distributions, but its potential for large scale applications remains yet to be proven.
还提供了离散数据分布的离散扩散过程，但其在大规模应用中的潜力尚待证明。

It is my conviction that these challenges will soon be solved with the joint efforts of the research community, and score-based generative models/ diffusion-based models will become one of the most useful tools for data generation, density estimation, inverse problem solving, and many other downstream tasks in machine learning.
我相信这些挑战将在研究界的共同努力下很快得到解决，基于评分的生成模型/基于扩散的模型将成为数据生成、密度估计、逆问题解决以及机器学习中许多其他下游任务的最有用工具之一。

Footnotes 脚注

Hereafter we only consider probability density functions. Probability mass functions are similar. [↩]
此后我们只考虑概率密度函数。概率质量函数类似。 [↩]
Fisher divergence is typically between two distributions p and q, defined as $\begin{matrix} (4) & E_{p (x)} [‖ \nabla_{x} \log p (x) - \nabla_{x} \log q (x) ‖_{2}^{2}] . \end{matrix}$ Here we slightly abuse the term as the name of a closely related expression for score-based models. [↩]
Fisher 散度通常在两个分布 p 和 q 之间定义，这里我们稍微滥用了这个术语，作为基于评分模型的一个密切相关表达的名称。 [↩]
Commonly used score matching methods include denoising score matching
常用的分数匹配方法包括去噪分数匹配 and sliced score matching
以及切片评分匹配. Here is an introduction to score matching and sliced score matching.
这里是关于分数匹配和切片分数匹配的介绍。[↩]
。这里是一个介绍。 [↩]
It is typical for normalizing flow models to convert discrete images to continuous ones by adding small uniform noise to them. [↩]
归一化流模型通常通过向离散图像添加小的均匀噪声将其转换为连续图像。 [↩]
Goes without saying that the significance of score-based generative models/diffusion probabilistic models is in no way comparable to quantum mechanics.[↩]
不言而喻，基于分数的生成模型/扩散概率模型的重要性与量子力学完全不可同日而语。 [↩]

References 参考文献

The neural autoregressive distribution estimator
神经自回归分布估计器
Larochelle, H. and Murray, I., 2011. International Conference on Artificial Intelligence and Statistics, pp. 29--37.
Larochelle, H. 和 Murray, I., 2011. 人工智能与统计国际会议, 第 29-37 页。
Made: Masked autoencoder for distribution estimation
Made: 用于分布估计的掩码自动编码器
Germain, M., Gregor, K., Murray, I. and Larochelle, H., 2015. International Conference on Machine Learning, pp. 881--889.
Germain, M., Gregor, K., Murray, I. 和 Larochelle, H., 2015. 国际机器学习会议, 第 881-889 页.
Pixel recurrent neural networks
像素递归神经网络
Van Oord, A., Kalchbrenner, N. and Kavukcuoglu, K., 2016. International Conference on Machine Learning, pp. 1747--1756.
Van Oord, A., Kalchbrenner, N. 和 Kavukcuoglu, K., 2016. 国际机器学习会议, 第 1747-1756 页.
NICE: Non-linear independent components estimation
NICE: 非线性独立成分估计
Dinh, L., Krueger, D. and Bengio, Y., 2014. arXiv preprint arXiv:1410.8516.
Dinh, L., Krueger, D. 和 Bengio, Y., 2014. arXiv 预印本 arXiv:1410.8516.
Density estimation using Real NVP
使用 Real NVP 进行密度估计
Dinh, L., Sohl-Dickstein, J. and Bengio, S., 2017. International Conference on Learning Representations.
Dinh, L., Sohl-Dickstein, J. 和 Bengio, S., 2017 年。国际学习表征会议。
A tutorial on energy-based learning
能量基学习教程
LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M. and Huang, F., 2006. Predicting structured data, Vol 1(0).
LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M. 和 Huang, F., 2006 年。预测结构化数据，卷 1(0)。
How to Train Your Energy-Based Models
如何训练你的能量基模型
Song, Y. and Kingma, D.P., 2021. arXiv preprint arXiv:2101.03288.
宋，Y. 和 Kingma，D.P.，2021。arXiv 预印本 arXiv:2101.03288。
Auto-encoding variational bayes
自动编码变分贝叶斯
Kingma, D.P. and Welling, M., 2014. International Conference on Learning Representations.
Kingma，D.P. 和 Welling，M.，2014。国际学习表征会议。
Stochastic backpropagation and approximate inference in deep generative models
深度生成模型中的随机反向传播和近似推理
Rezende, D.J., Mohamed, S. and Wierstra, D., 2014. International conference on machine learning, pp. 1278--1286.
Rezende，D.J.，Mohamed，S. 和 Wierstra，D.，2014。国际机器学习会议，第 1278-1286 页。
Learning in implicit generative models
在隐式生成模型中学习
Mohamed, S. and Lakshminarayanan, B., 2016. arXiv preprint arXiv:1610.03483.
Mohamed, S. 和 Lakshminarayanan, B., 2016. arXiv 预印本 arXiv:1610.03483.
Generative adversarial nets
生成对抗网络
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. and Bengio, Y., 2014. Advances in neural information processing systems, pp. 2672--2680.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. 和 Bengio, Y., 2014. 神经信息处理系统进展, 第 2672-2680 页.
Improved techniques for training gans
改进的 GAN 训练技术
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A. and Chen, X., 2016. Advances in Neural Information Processing Systems, pp. 2226--2234.
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A. 和 Chen, X., 2016. 神经信息处理系统进展, 第 2226-2234 页.
Unrolled Generative Adversarial Networks [link]
展开生成对抗网络 [链接]
Metz, L., Poole, B., Pfau, D. and Sohl-Dickstein, J., 2017. 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net.
Metz, L., Poole, B., Pfau, D. 和 Sohl-Dickstein, J., 2017. 第五届国际学习表征会议, ICLR 2017, 法国土伦, 2017 年 4 月 24-26 日, 会议论文集. OpenReview.net.
A Kernel Test of Goodness of Fit [HTML]
拟合优度的核检验 [HTML]
Chwialkowski, K., Strathmann, H. and Gretton, A., 2016. Proceedings of The 33rd International Conference on Machine Learning, Vol 48, pp. 2606--2615. PMLR.
Chwialkowski, K., Strathmann, H. 和 Gretton, A., 2016. 第 33 届国际机器学习会议论文集, 第 48 卷, 第 2606-2615 页. PMLR.
A kernelized Stein discrepancy for goodness-of-fit tests
用于拟合优度检验的核化 Stein 差异
Liu, Q., Lee, J. and Jordan, M., 2016. International conference on machine learning, pp. 276--284.
刘强，李俊和乔丹，M.，2016 年。国际机器学习会议，页 276-284。
Estimation of non-normalized statistical models by score matching
通过得分匹配估计非标准化统计模型
Hyvarinen, A., 2005. Journal of Machine Learning Research, Vol 6(Apr), pp. 695--709.
Hyvarinen, A., 2005 年。机器学习研究杂志，卷 6（4 月），页 695-709。
A connection between score matching and denoising autoencoders
得分匹配与去噪自动编码器之间的联系
Vincent, P., 2011. Neural computation, Vol 23(7), pp. 1661--1674. MIT Press.
Vincent, P., 2011. 神经计算, 第 23 卷(第 7 期), 第 1661-1674 页. 麻省理工学院出版社.
Generative Modeling by Estimating Gradients of the Data Distribution [PDF]
通过估计数据分布的梯度进行生成建模 [PDF]
Song, Y. and Ermon, S., 2019. Advances in Neural Information Processing Systems, pp. 11895--11907.
Song, Y. 和 Ermon, S., 2019. 神经信息处理系统进展, 第 11895-11907 页.
Improved Techniques for Training Score-Based Generative Models [PDF]
改进的基于评分的生成模型训练技术 [PDF]
Song, Y. and Ermon, S., 2020. Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
Song, Y. 和 Ermon, S., 2020. 神经信息处理系统进展 33: 2020 年神经信息处理系统年会, NeurIPS 2020, 2020 年 12 月 6-12 日, 虚拟会议.
Denoising diffusion probabilistic models
去噪扩散概率模型
Ho, J., Jain, A. and Abbeel, P., 2020. arXiv preprint arXiv:2006.11239.
Ho, J., Jain, A. 和 Abbeel, P., 2020. arXiv 预印本 arXiv:2006.11239.
Score-Based Generative Modeling through Stochastic Differential Equations [link]
通过随机微分方程的基于分数的生成建模 [链接]
Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S. and Poole, B., 2021. International Conference on Learning Representations.
宋洋, Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S. 和 Poole, B., 2021. 国际学习表征会议.
Diffusion models beat gans on image synthesis
扩散模型在图像合成上击败了 GANs
Dhariwal, P. and Nichol, A., 2021. arXiv preprint arXiv:2105.05233.
Dhariwal, P. 和 Nichol, A., 2021. arXiv 预印本 arXiv:2105.05233.
Cascaded Diffusion Models for High Fidelity Image Generation
高保真图像生成的级联扩散模型
Ho, J., Saharia, C., Chan, W., Fleet, D.J., Norouzi, M. and Salimans, T., 2021.
Ho, J., Saharia, C., Chan, W., Fleet, D.J., Norouzi, M. 和 Salimans, T., 2021.
WaveGrad: Estimating Gradients for Waveform Generation [link]
WaveGrad：波形生成的梯度估计 [链接]
Chen, N., Zhang, Y., Zen, H., Weiss, R.J., Norouzi, M. and Chan, W., 2021. International Conference on Learning Representations.
陈宁，张宇，禅，魏斯，R.J.，诺鲁齐，M. 和陈，W.，2021 年。国际学习表征会议。
DiffWave: A Versatile Diffusion Model for Audio Synthesis [link]
DiffWave：一种多功能的音频合成扩散模型 [链接]
Kong, Z., Ping, W., Huang, J., Zhao, K. and Catanzaro, B., 2021. International Conference on Learning Representations.
孔志，平，W.，黄，J.，赵，K. 和卡坦扎罗，B.，2021 年。国际学习表征会议。
Grad-tts: A diffusion probabilistic model for text-to-speech
Grad-tts：一种用于文本到语音的扩散概率模型
Popov, V., Vovk, I., Gogoryan, V., Sadekova, T. and Kudinov, M., 2021. arXiv preprint arXiv:2105.06337.
波波夫, V., 沃夫克, I., 戈戈里安, V., 萨德科娃, T. 和库迪诺夫, M., 2021. arXiv 预印本 arXiv:2105.06337.
Learning Gradient Fields for Shape Generation
学习形状生成的梯度场
Cai, R., Yang, G., Averbuch-Elor, H., Hao, Z., Belongie, S., Snavely, N. and Hariharan, B., 2020. Proceedings of the European Conference on Computer Vision (ECCV).
蔡, R., 杨, G., 阿维布赫-埃洛尔, H., 郝, Z., 贝隆吉, S., 斯纳夫利, N. 和哈里哈兰, B., 2020. 欧洲计算机视觉会议 (ECCV) 论文集.
Symbolic Music Generation with Diffusion Models
使用扩散模型的符号音乐生成
Mittal, G., Engel, J., Hawthorne, C. and Simon, I., 2021. arXiv preprint arXiv:2103.16091.
米塔尔, G., 恩格尔, J., 霍索恩, C. 和西蒙, I., 2021. arXiv 预印本 arXiv:2103.16091.
Robust Compressed Sensing MRI with Deep Generative Priors
具有深度生成先验的鲁棒压缩感知 MRI
Jalal, A., Arvinte, M., Daras, G., Price, E., Dimakis, A.G. and Tamir, J.I., 2021. Advances in neural information processing systems.
Jalal, A., Arvinte, M., Daras, G., Price, E., Dimakis, A.G. 和 Tamir, J.I., 2021. 神经信息处理系统的进展。
Training products of experts by minimizing contrastive divergence
通过最小化对比散度训练专家产品
Hinton, G.E., 2002. Neural computation, Vol 14(8), pp. 1771--1800. MIT Press.
Hinton, G.E., 2002. 神经计算, 第 14 卷(第 8 期), 第 1771-1800 页. MIT 出版社.
Sliced score matching: A scalable approach to density and score estimation [PDF]
切片得分匹配：一种可扩展的密度和得分估计方法 [PDF]
Song, Y., Garg, S., Shi, J. and Ermon, S., 2020. Uncertainty in Artificial Intelligence, pp. 574--584.
宋, Y., Garg, S., Shi, J. 和 Ermon, S., 2020. 人工智能中的不确定性, 第 574--584 页.
Correlation functions and computer simulations
相关函数和计算机模拟
Parisi, G., 1981. Nuclear Physics B, Vol 180(3), pp. 378--384. Elsevier.
帕里西, G., 1981. 核物理 B, 第 180(3) 卷, 第 378--384 页. 爱思唯尔.
Representations of knowledge in complex systems
复杂系统中的知识表示
Grenander, U. and Miller, M.I., 1994. Journal of the Royal Statistical Society: Series B (Methodological), Vol 56(4), pp. 549--581. Wiley Online Library.
格雷纳德, U. 和米勒, M.I., 1994. 皇家统计学会杂志: B 系列 (方法论), 第 56(4) 卷, 第 549--581 页. Wiley 在线图书馆.
Adversarial score matching and improved sampling for image generation [link]
对抗得分匹配和改进的图像生成采样 [链接]
Jolicoeur-Martineau, A., Piche-Taillefer, R., Mitliagkas, I. and Combes, R.T.d., 2021. International Conference on Learning Representations.
Jolicoeur-Martineau, A., Piche-Taillefer, R., Mitliagkas, I. 和 Combes, R.T.d., 2021. 国际学习表征会议.
Reverse-time diffusion equation models
反向时间扩散方程模型
Anderson, B.D., 1982. Stochastic Processes and their Applications, Vol 12(3), pp. 313--326. Elsevier.
Anderson, B.D., 1982. 随机过程及其应用, 第 12 卷(3), 第 313-326 页. 爱思唯尔.
Maximum Likelihood Training of Score-Based Diffusion Models
基于得分的扩散模型的最大似然训练
Song, Y., Durkan, C., Murray, I. and Ermon, S., 2021. Advances in Neural Information Processing Systems (NeurIPS).
宋洋, 杜尔坎, 穆雷, 埃尔蒙, 2021. 神经信息处理系统进展 (NeurIPS).
Gotta Go Fast When Generating Data with Score-Based Models
使用基于评分的模型生成数据时必须快速
Jolicoeur-Martineau, A., Li, K., Piche-Taillefer, R., Kachman, T. and Mitliagkas, I., 2021. arXiv preprint arXiv:2105.14080.
乔利科尔-马蒂诺, 李凯, 皮切-泰勒费尔, 卡赫曼, 米特利亚卡斯, 2021. arXiv 预印本 arXiv:2105.14080.
GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium
通过双时间尺度更新规则训练的 GAN 收敛到局部纳什均衡
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B. and Hochreiter, S., 2017. Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, {USA}, pp. 6626--6637.
赫塞尔, 拉姆绍尔, 安特希纳, 内斯勒, 霍赫赖特, 2017. 神经信息处理系统进展 30: 2017 年神经信息处理系统年会, 2017 年 12 月 4-9 日, 加利福尼亚州长滩, {美国}, 第 6626-6637 页.
Training Generative Adversarial Networks with Limited Data
用有限数据训练生成对抗网络
Karras, T., Aittala, M., Hellsten, J., Laine, S., Lehtinen, J. and Aila, T., 2020. Proc. NeurIPS.
Karras, T., Aittala, M., Hellsten, J., Laine, S., Lehtinen, J. 和 Aila, T., 2020. Proc. NeurIPS.
Neural Ordinary Differential Equations
神经常微分方程
Chen, T.Q., Rubanova, Y., Bettencourt, J. and Duvenaud, D., 2018. Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montr{\'{e}}al, Canada, pp. 6572--6583.
Chen, T.Q., Rubanova, Y., Bettencourt, J. 和 Duvenaud, D., 2018. 神经信息处理系统进展 31：2018 年神经信息处理系统年度会议，NeurIPS 2018，2018 年 12 月 3-8 日，加拿大蒙特利尔，第 6572-6583 页。
Scalable Reversible Generative Models with Free-form Continuous Dynamics [link]
具有自由形式连续动态的可扩展可逆生成模型 [链接]
Grathwohl, W., Chen, R.T.Q., Bettencourt, J. and Duvenaud, D., 2019. International Conference on Learning Representations.
Grathwohl, W., Chen, R.T.Q., Bettencourt, J. 和 Duvenaud, D., 2019. 国际学习表征会议.
Solving Inverse Problems in Medical Imaging with Score-Based Generative Models [PDF]
使用基于评分的生成模型解决医学成像中的逆问题 [PDF]
Song, Y., Shen, L., Xing, L. and Ermon, S., 2022. International Conference on Learning Representations.
Song, Y., Shen, L., Xing, L. 和 Ermon, S., 2022. 国际学习表征会议.
Annealed importance sampling
退火重要性采样
Neal, R.M., 2001. Statistics and computing, Vol 11(2), pp. 125--139. Springer.
Neal, R.M., 2001. 统计与计算, 第 11 卷(第 2 期), 第 125-139 页. Springer.
Deep unsupervised learning using nonequilibrium thermodynamics
使用非平衡热力学的深度无监督学习
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N. and Ganguli, S., 2015. International Conference on Machine Learning, pp. 2256--2265.
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N. 和 Ganguli, S., 2015 年。国际机器学习会议，页码 2256-2265。
Learning to generate samples from noise through infusion training
通过注入训练学习从噪声中生成样本
Bordes, F., Honari, S. and Vincent, P., 2017. arXiv preprint arXiv:1703.06975.
Bordes, F., Honari, S. 和 Vincent, P., 2017 年。arXiv 预印本 arXiv:1703.06975。
Variational walkback: Learning a transition operator as a stochastic recurrent net
变分回溯：学习作为随机递归网络的转移算子
Goyal, A., Ke, N.R., Ganguli, S. and Bengio, Y., 2017. arXiv preprint arXiv:1711.02282.
Goyal, A., Ke, N.R., Ganguli, S. 和 Bengio, Y., 2017. arXiv 预印本 arXiv:1711.02282.
GSNs: generative stochastic networks
GSNs: 生成随机网络
Alain, G., Bengio, Y., Yao, L., Yosinski, J., Thibodeau-Laufer, E., Zhang, S. and Vincent, P., 2016. Information and Inference: A Journal of the IMA, Vol 5(2), pp. 210--249. Oxford University Press.
阿兰, G., 本吉奥, Y., 姚, L., 约辛斯基, J., 提博多-劳弗, E., 张, S. 和文森特, P., 2016. 信息与推理: IMA 期刊, 第 5 卷第 2 期, 第 210-249 页. 牛津大学出版社.
Diffusion Schrödinger Bridge with Applications to Score-Based Generative Modeling
扩散薛定谔桥及其在基于分数的生成建模中的应用
De Bortoli, V., Thornton, J., Heng, J. and Doucet, A., 2021. Advances in Neural Information Processing Systems (NeurIPS).
德博托利, V., 桑顿, J., 亨, J. 和杜塞特, A., 2021. 神经信息处理系统进展 (NeurIPS).
Denoising Diffusion Implicit Models [link]
去噪扩散隐式模型 [链接]
Song, J., Meng, C. and Ermon, S., 2021. International Conference on Learning Representations.
宋, J., 孟, C. 和 Ermon, S., 2021. 国际学习表征会议.
Knowledge Distillation in Iterative Generative Models for Improved Sampling Speed
迭代生成模型中的知识蒸馏以提高采样速度
Luhman, E. and Luhman, T., 2021. arXiv e-prints, pp. arXiv--2101.
Luhman, E. 和 Luhman, T., 2021. arXiv 电子印刷, pp. arXiv--2101.
Score-based Generative Modeling in Latent Space
基于评分的潜在空间生成建模
Vahdat, A., Kreis, K. and Kautz, J., 2021. Advances in Neural Information Processing Systems (NeurIPS).
Vahdat, A., Kreis, K. 和 Kautz, J., 2021. 神经信息处理系统进展 (NeurIPS).

Generative Modeling by Estimating Gradients of the Data Distribution通过估计数据分布的梯度进行生成建模

Authors 作者

Affiliations 附属机构

Published 已发布

Introduction 介绍

The score function, score-based models, and score matching分数函数、基于分数的模型和分数匹配

Langevin dynamics 朗之万动力学

Naive score-based generative modeling and its pitfalls天真的基于得分的生成建模及其陷阱

Score-based generative modeling with multiple noise perturbations基于评分的生成建模与多重噪声扰动

Score-based generative modeling with stochastic differential equations (SDEs)基于分数的生成建模与随机微分方程（SDEs）

Perturbing data with an SDE用 SDE 扰动数据

Reversing the SDE for sample generation反转 SDE 以生成样本

Estimating the reverse SDE with score-based models and score matching使用基于评分的模型和评分匹配估计反向 SDE

How to solve the reverse SDE如何解决反向 SDE