这是用户在 2024-5-18 14:54 为 https://yang-song.net/blog/2021/score/ 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Generative Modeling by Estimating Gradients of the Data Distribution
通过估计数据分布的梯度进行生成建模

This blog post focuses on a promising new direction for generative modeling. We can learn score functions (gradients of log probability density functions) on a large number of noise-perturbed data distributions, then generate samples with Langevin-type sampling. The resulting generative models, often called score-based generative models, has several important advantages over existing model families: GAN-level sample quality without adversarial training, flexible model architectures, exact log-likelihood computation, and inverse problem solving without re-training models. In this blog post, we will show you in more detail the intuition, basic concepts, and potential applications of score-based generative models.
本博文重点介绍了生成建模的一个有前途的新方向。我们可以在大量噪声扰动的数据分布上学习得分函数(对数概率密度函数的梯度),然后使用 Langevin 类型的采样生成样本。由此产生的生成模型,通常称为基于得分的生成模型,相较于现有的模型家族有几个重要的优势:无需对抗训练即可达到 GAN 级别的样本质量、灵活的模型架构、精确的对数似然计算以及无需重新训练模型即可解决逆问题。在这篇博文中,我们将更详细地向您展示基于得分的生成模型的直觉、基本概念和潜在应用。

Introduction 介绍

Existing generative modeling techniques can largely be grouped into two categories based on how they represent probability distributions.
现有的生成建模技术大致可以根据它们如何表示概率分布分为两类。

  1. likelihood-based models, which directly learn the distribution’s probability density (or mass) function via (approximate) maximum likelihood. Typical likelihood-based models include autoregressive models
    基于似然的模型,直接通过(近似)最大似然学习分布的概率密度(或质量)函数。典型的基于似然的模型包括自回归模型
    , normalizing flow models
    , 正规化流模型
    , energy-based models (EBMs)
    , 基于能量的模型 (EBMs)
    , and variational auto-encoders (VAEs)
    ,以及变分自编码器(VAEs)
    .
  2. implicit generative models
    隐式生成模型
    , where the probability distribution is implicitly represented by a model of its sampling process. The most prominent example is generative adversarial networks (GANs)
    ,其中概率分布由其采样过程的模型隐式表示。最突出的例子是生成对抗网络(GANs)
    , where new samples from the data distribution are synthesized by transforming a random Gaussian vector with a neural network.
    ,其中通过使用神经网络变换随机高斯向量来合成来自数据分布的新样本。
Bayesian networks, Markov random fields (MRF), autoregressive models, and normalizing flow models are all examples of likelihood-based models. All these models represent the probability density or mass function of a distribution.
贝叶斯网络、马尔可夫随机场(MRF)、自回归模型和归一化流模型都是基于似然的模型的例子。所有这些模型都表示分布的概率密度或质量函数。
GAN is an example of implicit models. It implicitly represents a distribution over all objects that can be produced by the generator network.
GAN 是隐式模型的一个例子。它隐式地表示生成器网络可以生成的所有对象的分布。

Likelihood-based models and implicit generative models, however, both have significant limitations. Likelihood-based models either require strong restrictions on the model architecture to ensure a tractable normalizing constant for likelihood computation, or must rely on surrogate objectives to approximate maximum likelihood training. Implicit generative models, on the other hand, often require adversarial training, which is notoriously unstable
然而,基于似然的模型和隐式生成模型都有显著的局限性。基于似然的模型要么需要对模型架构进行严格限制以确保似然计算的可处理归一化常数,要么必须依赖替代目标来近似最大似然训练。另一方面,隐式生成模型通常需要对抗训练,而对抗训练是出了名的不稳定。
and can lead to mode collapse
并可能导致模式崩溃
.

In this blog post, I will introduce another way to represent probability distributions that may circumvent several of these limitations. The key idea is to model the gradient of the log probability density function, a quantity often known as the (Stein) score function
在这篇博客文章中,我将介绍另一种表示概率分布的方法,这种方法可能会规避其中的一些限制。关键思想是对数概率密度函数的梯度建模,这个量通常被称为(Stein)得分函数
. Such score-based models are not required to have a tractable normalizing constant, and can be directly learned by score matching
这种基于得分的模型不需要具有可解的归一化常数,并且可以通过得分匹配直接学习
.

Score function (the vector field) and density function (contours) of a mixture of two Gaussians.
两个高斯混合的得分函数(向量场)和密度函数(等高线)。

Score-based models have achieved state-of-the-art performance on many downstream tasks and applications. These tasks include, among others, image generation
基于得分的模型在许多下游任务和应用中取得了最先进的性能。这些任务包括图像生成等。
(Yes, better than GANs!), audio synthesis
(是的,比 GANs 更好!),音频合成
, shape generation ,形状生成, and music generation ,以及音乐生成. Moreover, score-based models have connections to normalizing flow models, therefore allowing exact likelihood computation and representation learning. Additionally, modeling and estimating scores facilitates inverse problem solving, with applications such as image inpainting
此外,基于分数的模型与归一化流模型有联系,因此允许精确的似然计算和表示学习。此外,建模和估计分数有助于解决逆问题,应用于图像修复
, image colorization  图像上色, compressive sensing, and medical image reconstruction (e.g., CT, MRI)
压缩感知和医学图像重建(例如,CT,MRI)
.

1024 x 1024 samples generated from score-based models
从基于评分的模型生成的 1024 x 1024 样本

This post aims to show you the motivation and intuition of score-based generative modeling, as well as its basic concepts, properties and applications.
本文旨在向您展示基于分数的生成建模的动机和直觉,以及其基本概念、属性和应用。

The score function, score-based models, and score matching
分数函数、基于分数的模型和分数匹配

Suppose we are given a dataset {x1,x2,,xN}, where each point is drawn independently from an underlying data distribution p(x). Given this dataset, the goal of generative modeling is to fit a model to the data distribution such that we can synthesize new data points at will by sampling from the distribution.
假设我们有一个数据集 {x1,x2,,xN} ,其中每个点都是从一个潜在的数据分布 p(x) 中独立抽取的。给定这个数据集,生成建模的目标是拟合一个模型到数据分布,以便我们可以通过从分布中采样来随意合成新的数据点。

In order to build such a generative model, we first need a way to represent a probability distribution. One such way, as in likelihood-based models, is to directly model the probability density function (p.d.f.) or probability mass function (p.m.f.). Let fθ(x)R be a real-valued function parameterized by a learnable parameter θ. We can define a p.d.f. Hereafter we only consider probability density functions. Probability mass functions are similar. via (1)pθ(x)=efθ(x)Zθ, where Zθ>0 is a normalizing constant dependent on θ, such that pθ(x)dx=1. Here the function fθ(x) is often called an unnormalized probabilistic model, or energy-based model
为了构建这样的生成模型,我们首先需要一种表示概率分布的方法。一个方法是直接建模概率密度函数(p.d.f.)或概率质量函数(p.m.f.),如在基于似然的模型中。设 fθ(x)R 是一个由可学习参数 θ 参数化的实值函数。我们可以定义一个 p.d.f. 1 在此之后我们只考虑概率密度函数。概率质量函数类似。通过 (1)pθ(x)=efθ(x)Zθ, ,其中 Zθ>0 是依赖于 θ 的归一化常数,使得 pθ(x)dx=1 。这里函数 fθ(x) 通常被称为非归一化概率模型,或基于能量的模型。
.

We can train pθ(x) by maximizing the log-likelihood of the data (2)maxθi=1Nlogpθ(xi). However, equation (2) requires pθ(x) to be a normalized probability density function. This is undesirable because in order to compute pθ(x), we must evaluate the normalizing constant Zθ—a typically intractable quantity for any general fθ(x). Thus to make maximum likelihood training feasible, likelihood-based models must either restrict their model architectures (e.g., causal convolutions in autoregressive models, invertible networks in normalizing flow models) to make Zθ tractable, or approximate the normalizing constant (e.g., variational inference in VAEs, or MCMC sampling used in contrastive divergence
我们可以通过最大化数据的对数似然来训练 pθ(x) 。然而,方程 (2) 要求 pθ(x) 是一个归一化的概率密度函数。这是不理想的,因为为了计算 pθ(x) ,我们必须评估归一化常数 Zθ ——对于任何一般的 fθ(x) 来说,这通常是一个难以处理的量。因此,为了使最大似然训练可行,基于似然的模型必须要么限制其模型架构(例如,自回归模型中的因果卷积,归一化流模型中的可逆网络)以使 Zθ 可处理,要么近似归一化常数(例如,VAEs 中的变分推断,或对比散度中使用的 MCMC 采样)。
) which may be computationally expensive.
) 这可能会计算量大。

By modeling the score function instead of the density function, we can sidestep the difficulty of intractable normalizing constants. The score function of a distribution p(x) is defined as xlogp(x), and a model for the score function is called a score-based model
通过对评分函数建模而不是密度函数,我们可以避开难以处理的归一化常数的难题。分布的评分函数 p(x) 定义为 xlogp(x), ,评分函数的模型称为基于评分的模型
, which we denote as sθ(x). The score-based model is learned such that sθ(x)xlogp(x), and can be parameterized without worrying about the normalizing constant. For example, we can easily parameterize a score-based model with the energy-based model defined in equation (1) , via
,我们将其表示为 sθ(x) 。基于评分的模型被学习为 sθ(x)xlogp(x) ,并且可以在不考虑归一化常数的情况下进行参数化。例如,我们可以通过方程 (1) 中定义的基于能量的模型轻松地参数化基于评分的模型,

(3)sθ(x)=xlogpθ(x)=xfθ(x)xlogZθ=0=xfθ(x).

Note that the score-based model sθ(x) is independent of the normalizing constant Zθ ! This significantly expands the family of models that we can tractably use, since we don’t need any special architectures to make the normalizing constant tractable.
请注意,基于评分的模型 sθ(x) 与归一化常数 Zθ 无关!这大大扩展了我们可以实际使用的模型家族,因为我们不需要任何特殊的架构来使归一化常数变得可处理。

Parameterizing probability density functions. No matter how you change the model family and parameters, it has to be normalized (area under the curve must integrate to one).
参数化概率密度函数。无论如何更改模型家族和参数,它都必须归一化(曲线下面积必须积分为一)。
Parameterizing score functions. No need to worry about normalization.
参数化评分函数。无需担心归一化。

Similar to likelihood-based models, we can train score-based models by minimizing the Fisher divergence Fisher divergence is typically between two distributions p and q, defined as (4)Ep(x)[xlogp(x)xlogq(x)22]. Here we slightly abuse the term as the name of a closely related expression for score-based models. between the model and the data distributions, defined as
类似于基于似然的模型,我们可以通过最小化 Fisher 散度 2 来训练基于评分的模型。Fisher 散度通常在两个分布 p 和 q 之间定义为 (4)Ep(x)[xlogp(x)xlogq(x)22]. 。在这里,我们稍微滥用这个术语,作为基于评分模型的一个密切相关表达的名称。模型和数据分布之间定义为 (5)Ep(x)[xlogp(x)sθ(x)22]
(5)Ep(x)[xlogp(x)sθ(x)22]

Intuitively, the Fisher divergence compares the squared 2 distance between the ground-truth data score and the score-based model. Directly computing this divergence, however, is infeasible because it requires access to the unknown data score xlogp(x). Fortunately, there exists a family of methods called score matching Commonly used score matching methods include denoising score matching and sliced score matching
以及切片评分匹配
. Here is an introduction to score matching and sliced score matching.
这里是关于分数匹配和切片分数匹配的介绍。

直观地说,Fisher 散度比较了真实数据分数和基于分数的模型之间的平方 2 距离。然而,直接计算这种散度是不可行的,因为它需要访问未知的数据分数 xlogp(x) 。幸运的是,存在一类称为分数匹配的方法 3 常用的分数匹配方法包括去噪分数匹配
that minimize the Fisher divergence without knowledge of the ground-truth data score. Score matching objectives can directly be estimated on a dataset and optimized with stochastic gradient descent, analogous to the log-likelihood objective for training likelihood-based models (with known normalizing constants). We can train the score-based model by minimizing a score matching objective, without requiring adversarial optimization.
这些方法在不知道真实数据分数的情况下最小化 Fisher 散度。分数匹配目标可以直接在数据集上估计,并通过随机梯度下降进行优化,类似于训练基于似然模型的对数似然目标(具有已知的归一化常数)。我们可以通过最小化分数匹配目标来训练基于分数的模型,而不需要对抗优化。

Additionally, using the score matching objective gives us a considerable amount of modeling flexibility. The Fisher divergence itself does not require sθ(x) to be an actual score function of any normalized distribution—it simply compares the 2 distance between the ground-truth data score and the score-based model, with no additional assumptions on the form of sθ(x). In fact, the only requirement on the score-based model is that it should be a vector-valued function with the same input and output dimensionality, which is easy to satisfy in practice.
此外,使用分数匹配目标提供了us相当大的建模灵活性。Fisher 散度本身不要求 sθ(x) 是任何归一化分布的实际分数函数——它只是比较真实数据分数和基于分数的模型之间的 2 距离,而对 sθ(x) 的形式没有额外的假设。事实上,对基于分数的模型的唯一要求是它应该是一个具有相同输入和输出维度的向量值函数,这在实践中很容易满足。

As a brief summary, we can represent a distribution by modeling its score function, which can be estimated by training a score-based model of free-form architectures with score matching.
简要总结一下,我们可以通过建模其评分函数来表示分布,这可以通过训练自由形式架构的基于评分的模型并进行评分匹配来估计。

Langevin dynamics 朗之万动力学

Once we have trained a score-based model sθ(x)xlogp(x), we can use an iterative procedure called Langevin dynamics
一旦我们训练了一个基于评分的模型 sθ(x)xlogp(x) ,我们可以使用一种称为朗之万动力学的迭代程序
to draw samples from it.
从中抽取样本。

Langevin dynamics provides an MCMC procedure to sample from a distribution p(x) using only its score function xlogp(x). Specifically, it initializes the chain from an arbitrary prior distribution x0π(x), and then iterates the following
朗之万动力学提供了一种 MCMC 程序,只使用其得分函数 xlogp(x) 从分布 p(x) 中采样。具体来说,它从任意先验分布 x0π(x) 初始化链,然后迭代以下步骤

(6)xi+1xi+ϵxlogp(x)+2ϵ zi,i=0,1,,K,

where ziN(0,I). When ϵ0 and K, xK obtained from the procedure in (6) converges to a sample from p(x) under some regularity conditions. In practice, the error is negligible when ϵ is sufficiently small and K is sufficiently large.
ϵ0K 时, ziN(0,I) 。在某些正则条件下, (6) 中的程序获得的 xK 收敛到 p(x) 的样本。在实践中,当 ϵ 足够小且 K 足够大时,误差可以忽略不计。

Using Langevin dynamics to sample from a mixture of two Gaussians.
使用朗之万动力学从两个高斯分布的混合中采样。

Note that Langevin dynamics accesses p(x) only through xlogp(x). Since sθ(x)xlogp(x), we can produce samples from our score-based model sθ(x) by plugging it into equation (6).
请注意,朗之万动力学仅通过 xlogp(x) 访问 p(x) 。由于 sθ(x)xlogp(x) ,我们可以通过将其插入方程 (6) 来从我们的基于得分的模型 sθ(x) 中生成样本。

Naive score-based generative modeling and its pitfalls
天真的基于得分的生成建模及其陷阱

So far, we’ve discussed how to train a score-based model with score matching, and then produce samples via Langevin dynamics. However, this naive approach has had limited success in practice—we’ll talk about some pitfalls of score matching that received little attention in prior works
到目前为止,我们已经讨论了如何通过得分匹配训练基于得分的模型,然后通过 Langevin 动力学生成样本。然而,这种天真的方法在实践中取得的成功有限——我们将讨论一些在先前工作中很少受到关注的得分匹配的陷阱
.

Score-based generative modeling with score matching + Langevin dynamics.
基于分数的生成建模与分数匹配 + 朗之万动力学。

The key challenge is the fact that the estimated score functions are inaccurate in low density regions, where few data points are available for computing the score matching objective. This is expected as score matching minimizes the Fisher divergence
关键挑战在于估计的分数函数在低密度区域不准确,因为在这些区域中可用于计算分数匹配目标的数据点很少。这是预期的,因为分数匹配最小化了费舍尔散度。

Ep(x)[xlogp(x)sθ(x)22]=p(x)xlogp(x)sθ(x)22dx.

Since the 2 differences between the true data score function and score-based model are weighted by p(x), they are largely ignored in low density regions where p(x) is small. This behavior can lead to subpar results, as illustrated by the figure below:
由于真实数据分数函数和基于分数的模型之间的差异被 p(x) 加权,因此在 p(x) 较小的低密度区域中,它们在很大程度上被忽略。这种行为可能会导致不理想的结果,如下图所示:

Estimated scores are only accurate in high density regions.
估计的分数仅在高密度区域准确。

When sampling with Langevin dynamics, our initial sample is highly likely in low density regions when data reside in a high dimensional space. Therefore, having an inaccurate score-based model will derail Langevin dynamics from the very beginning of the procedure, preventing it from generating high quality samples that are representative of the data.
在使用朗之万动力学采样时,当数据位于高维空间中时,我们的初始样本很可能位于低密度区域。因此,具有不准确的基于分数的模型将从程序一开始就使朗之万动力学偏离轨道,阻止其生成代表数据的高质量样本。

Score-based generative modeling with multiple noise perturbations
基于评分的生成建模与多重噪声扰动

How can we bypass the difficulty of accurate score estimation in regions of low data density? Our solution is to perturb data points with noise and train score-based models on the noisy data points instead. When the noise magnitude is sufficiently large, it can populate low data density regions to improve the accuracy of estimated scores. For example, here is what happens when we perturb a mixture of two Gaussians perturbed by additional Gaussian noise.
我们如何绕过在低数据密度区域准确评分估计的难题?我们的解决方案是用噪声扰动数据点,并在噪声数据点上训练基于评分的模型。当噪声幅度足够大时,它可以填充低数据密度区域以提高评分估计的准确性。例如,当我们用额外的高斯噪声扰动两个高斯混合时,会发生什么。

Estimated scores are accurate everywhere for the noise-perturbed data distribution due to reduced low data density regions.
由于低数据密度区域减少,噪声扰动数据分布的估计评分在各处都是准确的。

Yet another question remains: how do we choose an appropriate noise scale for the perturbation process? Larger noise can obviously cover more low density regions for better score estimation, but it over-corrupts the data and alters it significantly from the original distribution. Smaller noise, on the other hand, causes less corruption of the original data distribution, but does not cover the low density regions as well as we would like.
另一个问题是:我们如何选择适当的噪声尺度来进行扰动过程?较大的噪声显然可以覆盖更多的低密度区域以获得更好的评分估计,但它会过度破坏数据并显著改变原始分布。另一方面,较小的噪声会对原始数据分布造成较少的破坏,但不能像我们希望的那样覆盖低密度区域。

To achieve the best of both worlds, we use multiple scales of noise perturbations simultaneously
为了实现两全其美,我们同时使用多种尺度的噪声扰动
. Suppose we always perturb the data with isotropic Gaussian noise, and let there be a total of L increasing standard deviations σ1<σ2<<σL. We first perturb the data distribution p(x) with each of the Gaussian noise N(0,σi2I),i=1,2,,L to obtain a noise-perturbed distribution
假设我们总是用各向同性高斯噪声扰动数据,并且有总共 L 个递增的标准差 σ1<σ2<<σL 。我们首先用每个高斯噪声 N(0,σi2I),i=1,2,,L 扰动数据分布 p(x) ,以获得噪声扰动分布

pσi(x)=p(y)N(x;y,σi2I)dy.

Note that we can easily draw samples from pσi(x) by sampling xp(x) and computing x+σiz, with zN(0,I).
请注意,我们可以通过采样 xp(x) 并计算 x+σiz ,使用 zN(0,I) ,轻松从 pσi(x) 中抽取样本。

Next, we estimate the score function of each noise-perturbed distribution, xlogpσi(x), by training a Noise Conditional Score-Based Model sθ(x,i) (also called a Noise Conditional Score Network, or NCSN
接下来,我们通过训练噪声条件得分模型 sθ(x,i) (也称为噪声条件得分网络,或 NCSN),估计每个噪声扰动分布的得分函数 xlogpσi(x)
, when parameterized with a neural network) with score matching, such that sθ(x,i)xlogpσi(x) for all i=1,2,,L.
,当用神经网络参数化时)通过得分匹配,使得 sθ(x,i)xlogpσi(x) 对所有 i=1,2,,L 成立。

We apply multiple scales of Gaussian noise to perturb the data distribution (first row), and jointly estimate the score functions for all of them (second row).
我们应用多种尺度的高斯噪声来扰动数据分布(第一行),并联合估计所有这些噪声的得分函数(第二行)。
Perturbing an image with multiple scales of Gaussian noise.
用多种尺度的高斯噪声扰动图像。

The training objective for sθ(x,i) is a weighted sum of Fisher divergences for all noise scales. In particular, we use the objective below:
sθ(x,i) 的训练目标是所有噪声尺度的费舍尔散度的加权和。特别是,我们使用以下目标:

(7)i=1Lλ(i)Epσi(x)[xlogpσi(x)sθ(x,i)22],

where λ(i)R>0 is a positive weighting function, often chosen to be λ(i)=σi2. The objective (7) can be optimized with score matching, exactly as in optimizing the naive (unconditional) score-based model sθ(x).
其中 λ(i)R>0 是一个正的加权函数,通常选择为 λ(i)=σi2 。目标 (7) 可以通过得分匹配来优化,正如优化简单(无条件)基于得分的模型 sθ(x) 一样。

After training our noise-conditional score-based model sθ(x,i), we can produce samples from it by running Langevin dynamics for i=L,L1,,1 in sequence. This method is called annealed Langevin dynamics (defined by Algorithm 1 in
在训练我们的噪声条件分数模型 sθ(x,i) 后,我们可以通过依次运行 Langevin 动力学来从中生成样本。这种方法称为退火 Langevin 动力学(由算法 1 定义
, and improved by
,并由
), since the noise scale σi decreases (anneals) gradually over time.
改进),因为噪声尺度 σi 随着时间的推移逐渐减小(退火)。

Annealed Langevin dynamics combine a sequence of Langevin chains with gradually decreasing noise scales.
退火 Langevin 动力学结合了一系列噪声尺度逐渐减小的 Langevin 链。
Annealed Langevin dynamics for the Noise Conditional Score Network (NCSN) model (from ref.
退火朗之万动力学用于噪声条件分数网络(NCSN)模型(来自参考文献。
) trained on CelebA (left) and CIFAR-10 (right). We can start from unstructured noise, modify images according to the scores, and generate nice samples. The method achieved state-of-the-art Inception score on CIFAR-10 at its time.
)在 CelebA(左)和 CIFAR-10(右)上训练。我们可以从无结构噪声开始,根据分数修改图像,并生成不错的样本。该方法在当时在 CIFAR-10 上达到了最先进的 Inception 分数。

Here are some practical recommendations for tuning score-based generative models with multiple noise scales:
这里有一些关于调整具有多个噪声尺度的基于分数的生成模型的实用建议:

With such best practices, we are able to generate high quality image samples with comparable quality to GANs on various datasets, such as below:
通过这些最佳实践,我们能够在各种数据集上生成与 GAN 质量相当的高质量图像样本,如下所示:

Samples from the NCSNv2
NCSNv2 的样本
model. From left to right: FFHQ 256x256, LSUN bedroom 128x128, LSUN tower 128x128, LSUN church_outdoor 96x96, and CelebA 64x64.
模型。从左到右:FFHQ 256x256,LSUN 卧室 128x128,LSUN 塔 128x128,LSUN 教堂户外 96x96,和 CelebA 64x64。

Score-based generative modeling with stochastic differential equations (SDEs)
基于分数的生成建模与随机微分方程(SDEs)

As we already discussed, adding multiple noise scales is critical to the success of score-based generative models. By generalizing the number of noise scales to infinity
正如我们已经讨论过的,添加多个噪声尺度对于基于分数的生成模型的成功至关重要。通过将噪声尺度的数量推广到无限
, we obtain not only higher quality samples, but also, among others, exact log-likelihood computation, and controllable generation for inverse problem solving.
,我们不仅获得了更高质量的样本,还获得了精确的对数似然计算和可控的逆问题生成。

In addition to this introduction, we have tutorials written in Google Colab to provide a step-by-step guide for training a toy model on MNIST. We also have more advanced code repositories that provide full-fledged implementations for large scale applications.
除了这个介绍,我们还在 Google Colab 上编写了教程,提供了在 MNIST 上训练玩具模型的逐步指南。我们还有更高级的代码库,提供了大规模应用的完整实现。

Link Description
Open In Colab Tutorial of score-based generative modeling with SDEs in JAX + FLAX
基于分数的生成建模教程,使用 JAX + FLAX 中的 SDE
Open In Colab Load our pretrained checkpoints and play with sampling, likelihood computation, and controllable synthesis (JAX + FLAX)
加载我们预训练的检查点,并使用采样、似然计算和可控合成(JAX + FLAX)进行操作
Open In Colab Tutorial of score-based generative modeling with SDEs in PyTorch
基于分数的生成建模教程,使用 PyTorch 中的 SDE
Open In Colab Load our pretrained checkpoints and play with sampling, likelihood computation, and controllable synthesis (PyTorch)
加载我们预训练的检查点,并使用采样、似然计算和可控合成(PyTorch)进行操作
Code in JAX 使用 JAX 编写代码 Score SDE codebase in JAX + FLAX
在 JAX + FLAX 中评分 SDE 代码库
Code in PyTorch 用 PyTorch 编写代码 Score SDE codebase in PyTorch
在 PyTorch 中评分 SDE 代码库

Perturbing data with an SDE
用 SDE 扰动数据

When the number of noise scales approaches infinity, we essentially perturb the data distribution with continuously growing levels of noise. In this case, the noise perturbation procedure is a continuous-time stochastic process, as demonstrated below
当噪声尺度的数量趋于无穷大时,我们实际上是用不断增加的噪声水平扰动数据分布。在这种情况下,噪声扰动过程是一个连续时间的随机过程,如下所示

Perturbing data to noise with a continuous-time stochastic process.
用连续时间随机过程将数据扰动为噪声。

How can we represent a stochastic process in a concise way? Many stochastic processes (diffusion processes in particular) are solutions of stochastic differential equations (SDEs). In general, an SDE possesses the following form:
我们如何以简洁的方式表示随机过程?许多随机过程(特别是扩散过程)是随机微分方程(SDE)的解。一般来说,SDE 具有以下形式:

(8)dx=f(x,t)dt+g(t)dw,

where f(,t):RdRd is a vector-valued function called the drift coefficient, g(t)R is a real-valued function called the diffusion coefficient, w denotes a standard Brownian motion, and dw can be viewed as infinitesimal white noise. The solution of a stochastic differential equation is a continuous collection of random variables {x(t)}t[0,T]. These random variables trace stochastic trajectories as the time index t grows from the start time 0 to the end time T. Let pt(x) denote the (marginal) probability density function of x(t). Here t[0,T] is analogous to i=1,2,,L when we had a finite number of noise scales, and pt(x) is analogous to pσi(x). Clearly, p0(x)=p(x) is the data distribution since no perturbation is applied to data at t=0. After perturbing p(x) with the stochastic process for a sufficiently long time T, pT(x) becomes close to a tractable noise distribution π(x), called a prior distribution. We note that pT(x) is analogous to pσL(x) in the case of finite noise scales, which corresponds to applying the largest noise perturbation σL to the data.
其中 f(,t):RdRd 是一个称为漂移系数的向量值函数, g(t)R 是一个称为扩散系数的实值函数, w 表示标准布朗运动, dw 可以看作是无穷小的白噪声。随机微分方程的解是随机变量的连续集合 {x(t)}t[0,T] 。这些随机变量在时间索引 t 从开始时间 0 到结束时间 T 的过程中描绘出随机轨迹。令 pt(x) 表示 x(t) 的(边际)概率密度函数。这里 t[0,T] 类似于我们在有限噪声尺度时的 i=1,2,,L ,而 pt(x) 类似于 pσi(x) 。显然, p0(x)=p(x) 是数据分布,因为在 t=0 时没有对数据进行扰动。在用随机过程对 p(x) 进行足够长时间的扰动后, pT(x) 变得接近于一个可处理的噪声分布 π(x) ,称为先验分布。我们注意到, pT(x) 类似于在有限噪声尺度情况下的 pσL(x) ,这对应于对数据施加最大的噪声扰动 σL

The SDE in (8) is hand designed, similarly to how we hand-designed σ1<σ2<<σL in the case of finite noise scales. There are numerous ways to add noise perturbations, and the choice of SDEs is not unique. For example, the following SDE
(8) 中的 SDE 是手工设计的,类似于我们在有限噪声尺度的情况下手工设计 σ1<σ2<<σL 。有许多方法可以添加噪声扰动,SDE 的选择并不是唯一的。例如,以下 SDE

(9)dx=etdw

perturbs data with a Gaussian noise of mean zero and exponentially growing variance, which is analogous to perturbing data with N(0,σ12I),N(0,σ22I),,N(0,σL2I) when σ1<σ2<<σL is a geometric progression. Therefore, the SDE should be viewed as part of the model, much like {σ1,σ2,,σL}. In
用均值为零且方差指数增长的高斯噪声扰动数据,这类似于当 σ1<σ2<<σL 是几何级数时用 N(0,σ12I),N(0,σ22I),,N(0,σL2I) 扰动数据。因此,SDE 应被视为模型的一部分,就像 {σ1,σ2,,σL} 一样。在
, we provide three SDEs that generally work well for images: the Variance Exploding SDE (VE SDE), the Variance Preserving SDE (VP SDE), and the sub-VP SDE.
中,我们提供了三种通常对图像效果良好的 SDE:方差爆炸 SDE(VE SDE)、方差保持 SDE(VP SDE)和子 VP SDE。

Reversing the SDE for sample generation
反转 SDE 以生成样本

Recall that with a finite number of noise scales, we can generate samples by reversing the perturbation process with annealed Langevin dynamics, i.e., sequentially sampling from each noise-perturbed distribution using Langevin dynamics. For infinite noise scales, we can analogously reverse the perturbation process for sample generation by using the reverse SDE.
回想一下,在有限数量的噪声尺度下,我们可以通过使用退火朗之万动力学逆转扰动过程来生成样本,即通过朗之万动力学依次从每个噪声扰动分布中采样。对于无限噪声尺度,我们可以类似地通过使用反向 SDE 来逆转扰动过程以生成样本。

Generate data from noise by reversing the perturbation procedure.
通过逆转扰动过程从噪声中生成数据。

Importantly, any SDE has a corresponding reverse SDE
重要的是,任何 SDE 都有一个对应的逆向 SDE
, whose closed form is given by
,其闭合形式为