2024_07_23_65588abc753a56851495g

On Variational Bounds of Mutual Information
论互信息的变异边界

Ben Poole Sherjil Ozair Aäron van den Oord Alexander A. Alemi George Tucker
Ben Poole Sherjil Ozair Aaron van den Oord Alexander A. Alemi George Tucker

Abstract 摘要

Estimating and optimizing Mutual Information (MI) is core to many problems in machine learning; however, bounding MI in high dimensions is challenging. To establish tractable and scalable objectives, recent work has turned to variational bounds parameterized by neural networks, but the relationships and tradeoffs between these bounds remains unclear. In this work, we unify these recent developments in a single framework. We find that the existing variational lower bounds degrade when the MI is large, exhibiting either high bias or high variance. To address this problem, we introduce a continuum of lower bounds that encompasses previous bounds and flexibly trades off bias and variance. On high-dimensional, controlled problems, we empirically characterize the bias and variance of the bounds and their gradients and demonstrate the effectiveness of our new bounds for estimation and representation learning.
估算和优化互信息（MI）是机器学习中许多问题的核心；然而，在高维度上对互信息进行约束却极具挑战性。为了建立可牵引和可扩展的目标，最近的工作转向了以神经网络为参数的变分边界，但这些边界之间的关系和权衡仍不清楚。在这项工作中，我们将这些最新进展统一到一个框架中。我们发现，当 MI 较大时，现有的变分下界会退化，表现出高偏差或高方差。为了解决这个问题，我们引入了一个连续的下界，它包含了以前的下界，并灵活地权衡了偏差和方差。在高维受控问题上，我们根据经验描述了下界及其梯度的偏差和方差，并证明了我们的新下界在估计和表征学习中的有效性。

1. Introduction 1.导言

Estimating the relationship between pairs of variables is a fundamental problem in science and engineering. Quantifying the degree of the relationship requires a metric that captures a notion of dependency. Here, we focus on mutual information (MI), denoted

, which is a reparameterization-invariant measure of dependency:
估算成对变量之间的关系是科学和工程领域的一个基本问题。量化这种关系的程度需要一种能捕捉依赖性概念的度量。在这里，我们重点讨论互信息（MI），用

表示，它是依赖关系的重参数化不变度量：

Mutual information estimators are used in computational neuroscience (Palmer et al., 2015), Bayesian optimal experimental design (Ryan et al., 2016; Foster et al., 2018), understanding neural networks (Tishby et al., 2000; Tishby & Zaslavsky, 2015; Gabrié et al., 2018), and more. In practice, estimating MI is challenging as we typically have access to
互信息估计器被用于计算神经科学（Palmer 等人，2015 年）、贝叶斯最优实验设计（Ryan 等人，2016 年；Foster 等人，2018 年）、理解神经网络（Tishby 等人，2000 年；Tishby & Zaslavsky，2015 年；Gabrié 等人，2018 年）等领域。在实践中，估计 MI 具有挑战性，因为我们通常可以获得

Figure 1. Schematic of variational bounds of mutual information presented in this paper. Nodes are colored based on their tractability for estimation and optimization: green bounds can be used for both, yellow for optimization but not estimation, and red for neither. Children are derived from their parents by introducing new approximations or assumptions.
图 1.本文提出的互信息变分边界示意图。节点的颜色基于它们在估算和优化方面的可操作性：绿色边界可同时用于估算和优化，黄色边界可用于优化但不可用于估算，红色边界不可同时用于估算和优化。子节点通过引入新的近似或假设从父节点衍生而来。

samples but not the underlying distributions (Paninski, 2003; McAllester & Stratos, 2018). Existing sample-based estimators are brittle, with the hyperparameters of the estimator impacting the scientific conclusions (Saxe et al., 2018).
样本而非基本分布（Paninski，2003；McAllester & Stratos，2018）。现有的基于样本的估计器很脆弱，估计器的超参数会影响科学结论（Saxe 等人，2018 年）。

Beyond estimation, many methods use upper bounds on MI to limit the capacity or contents of representations. For example in the information bottleneck method (Tishby et al., 2000; Alemi et al., 2016), the representation is optimized to solve a downstream task while being constrained to contain as little information as possible about the input. These techniques have proven useful in a variety of domains, from restricting the capacity of discriminators in GANs (Peng et al., 2018) to preventing representations from containing information about protected attributes (Moyer et al., 2018).
除了估算，许多方法还使用 MI 上限来限制表征的容量或内容。例如，在信息瓶颈法（Tishby 等人，2000 年；Alemi 等人，2016 年）中，表征被优化以解决下游任务，同时受限于包含尽可能少的输入信息。事实证明，这些技术在各种领域都很有用，从限制 GAN 中判别器的容量（Peng 等人，2018 年）到防止表征包含受保护属性的信息（Moyer 等人，2018 年）。

Lastly, there are a growing set of methods in representation learning that maximize the mutual information between a learned representation and an aspect of the data. Specifically, given samples from a data distribution,

, the goal is to learn a stochastic representation of the data

that has maximal MI with

subject to constraints on the mapping (e.g. Bell & Sejnowski, 1995; Krause et al., 2010; Hu et al., 2017; van den Oord et al., 2018; Hjelm et al., 2018; Alemi et al., 2017). To maximize MI, we can compute gradients of a lower bound on MI with respect to the parameters

of the stochastic encoder

, which may not require directly estimating MI.
最后，有越来越多的表征学习方法可以最大化所学表征与数据某一方面之间的互信息。具体来说，给定数据分布样本

，目标是学习数据的随机表征

，该表征与

之间的互信息最大，但要受到映射约束（如 Bell & Sejnowski，1995；Krause 等人，2010；Hu 等人，2017；van den Oord 等人，2018；Hjelm 等人，2018；Alemi 等人，2017）。为了使 MI 最大化，我们可以计算 MI 相对于随机编码器参数

的梯度

，这可能不需要直接估计 MI。

While many parametric and non-parametric (Nemenman et al., 2004; Kraskov et al., 2004; Reshef et al., 2011; Gao et al., 2015) techniques have been proposed to address MI estimation and optimization problems, few of them scale up to the dataset size and dimensionality encountered in modern machine learning problems.
虽然有很多参数和非参数（Nemenman 等人，2004 年；Kraskov 等人，2004 年；Reshef 等人，2011 年；Gao 等人，2015 年）技术被提出来解决 MI 估算和优化问题，但其中很少有技术能扩展到现代机器学习问题中遇到的数据集规模和维度。

To overcome these scaling difficulties, recent work combines variational bounds (Blei et al., 2017; Donsker & Varadhan, 1983; Barber & Agakov, 2003; Nguyen et al., 2010; Foster et al., 2018) with deep learning (Alemi et al., 2016; 2017; van den Oord et al., 2018; Hjelm et al., 2018; Belghazi et al., 2018) to enable differentiable and tractable estimation of mutual information. These papers introduce flexible parametric distributions or critics parameterized by neural networks that are used to approximate unkown densities

or density ratios

.
为了克服这些缩放困难，最近的工作将变分约束（Blei 等人，2017 年；Donsker & Varadhan，1983 年；Barber & Agakov，2003 年；Nguyen 等人，2010 年；Foster 等人，2018 年）与深度学习（Alemi 等人，2016 年；2017 年；van den Oord 等人，2018 年；Hjelm 等人，2018 年；Belghazi 等人，2018 年）相结合，实现了互信息的可微分和可操作性估计。这些论文引入了灵活的参数分布或由神经网络参数化的临界值，用于近似未知密度

或密度比

。

In spite of their effectiveness, the properties of existing variational estimators of MI are not well understood. In this paper, we introduce several results that begin to demystify these approaches and present novel bounds with improved properties (see Fig. 1 for a schematic):
尽管这些方法非常有效，但人们对现有 MI 变分估计器的特性还不甚了解。在本文中，我们介绍了几项成果，开始揭开这些方法的神秘面纱，并提出了具有改进特性的新边界（示意图见图 1）：

We provide a review of existing estimators, discussing their relationships and tradeoffs, including the first proof that the noise contrastive loss in van den Oord et al. (2018) is a lower bound on MI, and that the heuristic "bias corrected gradients" in Belghazi et al. (2018) can be justified as unbiased estimates of the gradients of a different lower bound on MI.
我们对现有的估计器进行了回顾，讨论了它们之间的关系和权衡，包括首次证明 van den Oord 等人（2018 年）的噪声对比损失是 MI 的下限，以及 Belghazi 等人（2018 年）的启发式 "偏差校正梯度 "可以作为 MI 的不同下限梯度的无偏估计。
We derive a new continuum of multi-sample lower bounds that can flexibly trade off bias and variance, generalizing the bounds of (Nguyen et al., 2010; van den Oord et al., 2018).
我们推导出一种新的连续多样本下界，可以灵活地权衡偏差和方差，概括了（Nguyen 等人，2010 年；van den Oord 等人，2018 年）的下界。
We show how to leverage known conditional structure yielding simple lower and upper bounds that sandwich MI in the representation learning context when is tractable.
我们展示了如何利用已知条件结构产生简单的下限和上限，在可控的情况下，将 MI 夹在表征学习中。
We systematically evaluate the bias and variance of MI estimators and their gradients on controlled highdimensional problems.
我们在受控高维问题上系统地评估了 MI 估计器及其梯度的偏差和方差。
We demonstrate the utility of our variational upper and lower bounds in the context of decoder-free disentangled representation learning on dSprites (Matthey et al., 2017).
我们在 dSprites 上的无解码器分离表征学习中展示了我们的变分上界和下界的实用性（Matthey 等人，2017 年）。

2. Variational bounds of MI
2.MI 的变量边界

Here, we review existing variational bounds on MI in a unified framework, and present several new bounds that trade off bias and variance and naturally leverage known conditional densities when they are available. A schematic of the bounds we consider is presented in Fig. 1. We begin by reviewing the classic upper and lower bounds of Barber & Agakov (2003) and then show how to derive the lower bounds of Donsker & Varadhan (1983); Nguyen et al. (2010); Belghazi et al. (2018) from an unnormalized variational distribution. Generalizing the unnormalized bounds to the multi-sample setting yields the bound proposed in van den Oord et al. (2018), and provides the basis for our interpolated bound.
在此，我们在一个统一的框架内回顾了现有的 MI 变分边界，并提出了几个新的边界，这些边界在偏差和方差之间进行了权衡，并自然地利用了已知的条件密度。图 1 展示了我们考虑的边界示意图。我们首先回顾了 Barber & Agakov（2003）的经典上界和下界，然后展示了如何从非归一化变分分布推导出 Donsker & Varadhan（1983）；Nguyen 等人（2010）；Belghazi 等人（2018）的下界。将非正态化边界推广到多样本环境，可以得到 van den Oord 等人（2018）提出的边界，并为我们的内插边界提供了基础。

2.1. Normalized upper and lower bounds
2.1.归一化上限和下限

Upper bounding MI is challenging, but is possible when the conditional distribution

is known (e.g. in deep representation learning where

is the stochastic representation). We can build a tractable variational upper bound by introducing a variational approximation

to the intractable marginal

. By multiplying and dividing the integrand in MI by

and dropping a negative KL term, we get a tractable variational upper bound (Barber & Agakov, 2003):
对 MI 进行上界计算具有挑战性，但在条件分布

已知的情况下（例如，在深度表征学习中，

是随机表征），上界计算是可能的。我们可以通过对难以处理的边际

引入变分近似值

来建立一个可处理的变分上界。通过将 MI 中的积分乘以

并除以

，再去掉一个负 KL 项，我们就能得到一个可行的变分上界（Barber & Agakov，2003 年）：

which is often referred to as the rate in generative models (Alemi et al., 2017). This bound is tight when

, and requires that computing

is tractable. This variational upper bound is often used as a regularizer to limit the capacity of a stochastic representation (e.g. Rezende et al., 2014; Kingma & Welling, 2013; Burgess et al., 2018). In Alemi et al. (2016), this upper bound is used to prevent the representation from carrying information about the input that is irrelevant for the downstream classification task.
通常被称为生成模型中的速率（Alemi 等人，2017）。当

时，这个约束很紧，并且要求

的计算是可行的。这个变分上界经常被用作限制随机表示容量的正则器（如 Rezende 等人，2014；Kingma & Welling，2013；Burgess 等人，2018）。在 Alemi 等人（2016 年）的研究中，这个上限被用来防止表征携带与下游分类任务无关的输入信息。

Unlike the upper bound, most variational lower bounds on mutual information do not require direct knowledge of any conditional densities. To establish an initial lower bound on mutual information, we factor MI the opposite direction as the upper bound, and replace the intractable conditional distribution

with a tractable optimization problem over a variational distribution

. As shown in Barber & Agakov (2003), this yields a lower bound on MI due to the non-negativity of the KL divergence:
与上界不同，大多数互信息的变分下界不需要直接知道任何条件密度。为了建立互信息的初始下界，我们将 MI 因子的方向设定为与上界相反的方向，并将难以处理的条件分布

替换为对变分分布

的可处理优化问题。正如 Barber & Agakov (2003) 所示，由于 KL 发散的非负性，这就产生了 MI 的下限：

where

is the differential entropy of

. The bound is tight when

, in which case the first term equals the conditional entropy

.
其中

是

的差分熵。当

时，边界是紧密的，在这种情况下，第一项等于条件熵

。

Unfortunately, evaluating this objective is generally intractable as the differential entropy of

is often unknown. If

is known, this provides a tractable estimate of a lower bound on MI. Otherwise, one can still compare the amount of information different variables (e.g.,

and

) carry about

.
遗憾的是，由于

的差分熵通常是未知的，因此对这一目标的评估通常是难以实现的。如果

已知，这就为 MI 的下限提供了一个可行的估计值。否则，我们仍然可以比较不同变量（如

和

）携带的

信息量。

In the representation learning context where

is data and

is a learned stochastic representation, the first term of

can be thought of as negative reconstruction error or distortion, and the gradient of

with respect to the "encoder"

and variational "decoder"

is tractable. Thus we can use this objective to learn an encoder

that maximizes

as in Alemi et al. (2017). However, this approach to representation learning requires building a tractable decoder

, which is challenging when

is high-dimensional and

is large, for example in video representation learning (van den Oord et al., 2016).
在表征学习中，

是数据，

是学习到的随机表征，

的第一项可视为负重构误差或失真，而

相对于 "编码器"

和变分 "解码器 "的梯度为

和变分 "解码器"

的梯度是可控的。因此，我们可以利用这一目标来学习编码器

，使

最大化，如 Alemi 等人（2017）所做的那样。然而，这种表征学习方法需要建立一个可控的解码器

，当

是高维且

较大时，这种方法就具有挑战性，例如在视频表征学习中（van den Oord 等人，2016 年）。

2.2. Unnormalized lower bounds
2.2.非规范化下限

To derive tractable lower bounds that do not require a tractable decoder, we turn to unnormalized distributions for the variational family of

, and show how this recovers the estimators of Donsker & Varadhan (1983); Nguyen et al. (2010).
为了推导出不需要可控解码器的可控下限，我们转向

变分系列的非规范化分布，并展示了如何恢复 Donsker & Varadhan (1983); Nguyen 等人 (2010) 的估计值。

We choose an energy-based variational family that uses a critic

and is scaled by the data density

:
我们选择一个基于能量的变分系，它使用批判者

，并按数据密度

缩放：

Substituting this distribution into

(Eq. 2) gives a lower bound on MI which we refer to as

for the Unnormalized version of the Barber and Agakov bound:
将此分布代入

（公式 2），可得到 MI 的下限，我们将其称为

，即巴伯和阿加科夫约束的非规范化版本：

This bound is tight when

, where

is solely a function of

(and not

). Note that by scaling

, the intractable differential entropy term in

cancels, but we are still left with an intractable

partition function,

, that prevents evaluation or gradient computation. If we apply Jensen's inequality to

, we can lower bound Eq. 4 to recover the bound of Donsker & Varadhan (1983):
当

，其中

仅仅是

（而不是

）的函数时，这个约束是紧密的。需要注意的是，通过

对

进行缩放，

中难以处理的微分熵项就会抵消，但我们仍然会剩下一个难以处理的

分区函数

，它阻碍了求值或梯度计算。如果我们对

应用詹森不等式，就可以降低公式 4 的边界，从而恢复 Donsker & Varadhan (1983) 的边界：

However, this objective is still intractable. Applying Jensen's the other direction by replacing

with

results in a tractable objective, but produces an upper bound on Eq. 4 (which is itself a lower bound on mutual information). Thus evaluating

using a Monte-Carlo approximation of the expectations as in MINE (Belghazi et al., 2018) produces estimates that are neither an upper or lower bound on MI. Recent work has studied the convergence and asymptotic consistency of such nested Monte-Carlo estimators, but does not address the problem of building bounds that hold with finite samples (Rainforth et al., 2018; Mathieu et al., 2018).
然而，这个目标仍然难以实现。将

替换为

，从另一个方向应用 Jensen's 方法，会得到一个可行的目标，但会产生公式 4 的上界（这本身就是互信息的下界）。因此，像 MINE（Belghazi 等人，2018 年）那样使用蒙特卡洛期望近似法评估

，得出的估计值既不是互信息的上界，也不是互信息的下界。最近的研究对这种嵌套蒙特卡洛估计器的收敛性和渐进一致性进行了研究，但并没有解决建立有限样本下成立的边界问题（Rainforth 等人，2018 年；Mathieu 等人，2018 年）。

To form a tractable bound, we can upper bound the log partition function using the inequality:

for all

. Applying this inequality to the second term of Eq. 4 gives:

, which is tight when

. This results in a Tractable Unnormalized version of the Barber and Agakov (TUBA) lower bound on MI that admits unbiased estimates and gradients:
为了得出一个简单明了的边界，我们可以利用不等式对对数分割函数进行上限计算：

适用于所有

。将此不等式应用于式 4 的第二项，可得到：

：

，当

时，这个值很紧。这样就得到了巴伯和阿加科夫（TUBA）的可实现非标准化 MI 下界，它允许无偏估计和梯度：

To tighten this lower bound, we maximize with respect to the variational parameters

and

. In the InfoMax setting, we can maximize the bound with respect to the stochastic encoder

to increase

. Unlike the min-max objective of GANs, all parameters are optimized towards the same objective.
为了收紧这个下限，我们可以针对变异参数

和

进行最大化。在 InfoMax 设置中，我们可以针对随机编码器

进行最大化以增加

。与 GAN 的最小-最大目标不同，所有参数的优化都是为了同一个目标。

This bound holds for any choice of

, with simplifications recovering existing bounds. Letting

be the constant

recovers the bound of Nguyen, Wainwright, and Jordan (Nguyen et al., 2010) also known as

-GAN KL (Nowozin et al., 2016) and MINE-f (Belghazi et al., 2018)

:
对于任何

的选择，这一界限都是成立的，并通过简化恢复了现有的界限。让

成为常数

可以恢复 Nguyen、Wainwright 和 Jordan（Nguyen 等人，2010 年）的边界，也称为

-GAN KL（Nowozin 等人，2016 年）和 MINE-f（Belghazi 等人，2018 年）

：

This tractable bound no longer requires learning

, but now

must learn to self-normalize, yielding a unique optimal critic

. This requirement of self-normalization is a common choice when learning log-linear models and empirically has been shown not to negatively impact performance (Mnih & Teh, 2012).
这个可行的约束不再需要学习

，但现在

必须学习自归一化，从而产生一个唯一的最优批评者

。自归一化要求是学习对数线性模型时的常见选择，经验表明不会对性能产生负面影响（Mnih & Teh，2012）。

Finally, we can set

to be the scalar exponential moving average (EMA) of

across minibatches. This pushes the normalization constant to be independent of

, but it no longer has to exactly self-normalize. With this choice of

, the gradients of

exactly yield the "improved MINE gradient estimator" from (Belghazi et al., 2018). This provides sound justification for the heuristic optimization
最后，我们可以将

设置为

跨小批的标量指数移动平均值（EMA）。这就使归一化常数与

无关，但它不再需要完全自归一化。有了这个

的选择，

的梯度正好产生了（Belghazi 等人，2018 年）中的 "改进 MINE 梯度估计器"。这为启发式优化提供了充分的理由

procedure proposed by Belghazi et al. (2018). However, instead of using the critic in the

bound to get an estimate that is not a bound on MI as in Belghazi et al. (2018), one can compute an estimate with

which results in a valid lower bound.
Belghazi等人（2018）提出的程序。然而，与其像 Belghazi 等人（2018）那样利用

界值中的批评者得到一个不是 MI 界值的估计值，不如用

计算一个估计值，从而得到一个有效的下界。

To summarize, these unnormalized bounds are attractive because they provide tractable estimators which become tight with the optimal critic. However, in practice they exhibit high variance due to their reliance on high variance upper bounds on the log partition function.
总之，这些非规范化边界很有吸引力，因为它们提供了易于理解的估计值，并与最优批判者变得紧密。然而，在实际应用中，由于依赖于对数分割函数的高方差上限，它们会表现出很高的方差。

2.3. Multi-sample unnormalized lower bounds
2.3.多样本非规范化下界

To reduce variance, we extend the unnormalized bounds to depend on multiple samples, and show how to recover the low-variance but high-bias MI estimator proposed by van den Oord et al. (2018).
为了减少方差，我们扩展了非规范化边界，使其取决于多个样本，并展示了如何恢复 van den Oord 等人（2018 年）提出的低方差但高偏差的 MI 估计器。

Our goal is to estimate

given samples from

and access to

additional samples

(potentially from a different distribution than

). For any random variable

independent from

and

, therefore:
我们的目标是在

样本和

额外样本

（可能来自与

不同的分布）的条件下估计

。因此，对于任何独立于

和

的随机变量

来说：

This multi-sample mutual information can be estimated using any of the previous bounds, and has the same optimal critic as for

. For

, we have that the optimal critic is

. However, the critic can now also depend on the additional samples

. In particular, setting the critic to

and

becomes:
这种多样本互信息可以用前面的任何一个边界来估计，其最优批判者与

相同。对于

，我们可以得出最优批判者为

。然而，批判者现在也可以取决于额外的样本

。特别是，将批判者设置为

和

时，情况就变成了这样：

where we have written the critic using parameters

to highlight the close connection to the variational parameters in

. One way to leverage these additional samples from

is to build a Monte-Carlo estimate of the partition function

:
其中，我们使用参数

来书写批评者，以突出与

中的变分参数的密切联系。利用这些来自

的额外样本的一种方法是建立分区函数

的蒙特卡洛估计：

Intriguingly, with this choice, the high-variance term in

that estimates an upper bound on

is now upper bounded by

appears in the numerator and also in the denominator (scaled by

). If we average the bound over

replicates, reindexing

for each term, then the last term in Eq. 8 becomes the constant 1:
耐人寻味的是，在这种选择下，

中估计

上限的高方差项现在的上限是

，因为

出现在分子中，也出现在分母中（按

缩放）。如果我们对

个重复进行平均，将每个项的

重新索引为

，那么公式 8 中的最后一个项就变成了常数 1：

and we exactly recover the lower bound on MI proposed by van den Oord et al. (2018):
我们正好恢复了 van den Oord 等人（2018）提出的 MI 下限：

where the expectation is over

independent samples from the joint distribution:

. This provides a proof

that

is a lower bound on MI. Unlike

where the optimal critic depends on both the conditional and marginal densities, the optimal critic for

where

is any function that depends on

but not

(Ma & Collins, 2018). Thus the critic only has to learn the conditional density and not the marginal density

.
其中，期望值是联合分布中

个独立样本的期望值：

.这就证明了

是 MI 的下限。与最佳批判者同时取决于条件密度和边际密度的

不同，

的最佳批判者是

，其中

是取决于

但不取决于

的任何函数（Ma & Collins，2018 年）。因此，批判者只需学习条件密度，而无需学习边际密度

。

As pointed out in van den Oord et al. (2018),

is upper bounded by

, meaning that this bound will be loose when

. Although the optimal critic does not depend on the batch size and can be fit with a smaller mini-batches, accurately estimating mutual information still needs a large batch size at test time if the mutual information is high.
正如 van den Oord 等人（2018）所指出的，

的上界为

，这意味着当

时，这个上界将变得松散。虽然最优批评者并不依赖于批量大小，而且可以用较小的迷你批量来拟合，但如果互信息较高，在测试时准确估计互信息仍然需要较大的批量。

2.4. Nonlinearly interpolated lower bounds
2.4.非线性内插下限

The multi-sample perspective on

allows us to make other choices for the functional form of the critic. Here we propose one simple form for a critic that allows us to nonlinearly interpolate between

and

, effectively bridging the gap between the low-bias, high-variance

estimator and the high-bias, low-variance

estimator. Similarly to Eq. 8, we set the critic to

with

to get a continuum of lower bounds:
关于

的多样本视角允许我们对批判者的函数形式做出其他选择。在此，我们提出一种简单的批判形式，它允许我们在

和

之间进行非线性插值，有效地弥补了低偏差、高方差

估计数与高偏差、低方差

估计数之间的差距。与公式 8 类似，我们将批判者设置为

与

，以获得连续的下限：

By interpolating between

and

, we can recover

. Unlike

which is
通过

和

之间的插值，我们可以恢复

或

。与

不同的是

upper bounded by

, the interpolated bound is upper bounded by

, allowing us to use

to tune the tradeoff between bias and variance. We can maximize this lower bound in terms of

and

. Note that unlike

, for

the last term does not vanish and we must sample

independently from

to form a Monte Carlo approximation for that term. In practice we use a leave-oneout estimate, holding out an element from the minibatch for the independent

in the second term. We conjecture that the optimal critic for the interpolated bound is achieved when

and

and use this choice when evaluating the accuracy of the estimates and gradients of

with optimal critics.
上界为

，内插边界的上界为

，这样我们就可以使用

来调整偏差和方差之间的权衡。我们可以用

和

来最大化这个下限。需要注意的是，与

不同，对于

来说，最后一项并不消失，我们必须对

和

进行独立采样，以形成该项的蒙特卡罗近似值。在实践中，我们使用 "留一弃一 "的估计方法，从迷你批中留出一个元素作为第二项中独立的

。我们推测，当

和

时，插值约束的最佳批判者就会出现，在用最佳批判者评估

的估计值和梯度的准确性时，我们就会使用这一选择。

2.5. Structured bounds with tractable encoders
2.5.结构化边界与可编程编码器

In the previous sections we presented one variational upper bound and several variational lower bounds. While these bounds are flexible and can make use of any architecture or parameterization for the variational families, we can additionally take into account known problem structure. Here we present several special cases of the previous bounds that can be leveraged when the conditional distribution

is known. This case is common in representation learning where

is data and

is a learned stochastic representation.
在前面的章节中，我们介绍了一个变分上界和几个变分下界。虽然这些边界是灵活的，可以利用变分族的任何结构或参数化，但我们还可以考虑已知的问题结构。在此，我们将介绍前述边界的几种特例，当条件分布

已知时，可以利用这些特例。这种情况在表示学习中很常见，其中

是数据，

是学习到的随机表示。

InfoNCE with a tractable conditional.
信息网络教育，条件简单明了。

An optimal critic for

is given by

, so we can simply use the

when it is known. This gives us a lower bound on MI without additional variational parameters:

的最佳批判者由

给出，因此当

已知时，我们只需使用

即可。这样，我们就得到了 MI 的下限，而无需额外的变分参数：

where the expectation is over

.
其中，期望值为

。

Leave one out upper bound.
留出一个上限。

Recall that the variational upper bound (Eq. 1) is minimized when our variational

matches the true marginal distribution

. Given a minibatch of

pairs, we can approximate

(Chen et al., 2018). For each example

in the minibatch, we can approximate

with the mixture over all other elements:

. With this choice of variational distribution, the variational upper bound is:
回想一下，当我们的变分

与真实边际分布

匹配时，变分上界（公式 1）就会最小化。给定一个由

对组成的小批量，我们可以近似得到

（陈等人，2018 年）。对于小批量中的每个

例子，我们可以用所有其他元素的混合物来近似

：

.选择这种变分分布后，变分上界为

where the expectation is over

. Combining Eq. 12 and Eq. 13, we can sandwich MI without introducing learned variational distributions. Note that the only difference between these bounds is whether

is included in the denominator. Similar mixture distributions have been used in prior work but they require additional parameters (Tomczak & Welling, 2018; Kolchinsky et al., 2017).
其中的期望值为

。结合公式 12 和公式 13，我们就可以在不引入学变分布的情况下夹住 MI。请注意，这些界限之间的唯一区别在于分母中是否包含

。之前的工作中也使用过类似的混合分布，但它们需要额外的参数（Tomczak & Welling，2018；Kolchinsky 等人，2017）。

Reparameterizing critics.
重新参数化的批评家

For

, the optimal critic is given by

, so it is possible to use a critic

and optimize only over

when

is known. The resulting bound resembles the variational upper bound (Eq. 1) with a correction term to make it a lower bound:
对于

，最优批判者由

给出，因此可以使用批判者

，在已知

的情况下，只对

进行优化。由此得到的边界类似于变分上界（式 1），并有一个修正项使其成为下界：

This bound is valid for any choice of

, including unnormalized

.
这个界限对任何

的选择都有效，包括非规范化的

。

Similarly, for the interpolated bounds we can use

and only optimize over the

in the denominator. In practice, we find reparameterizing the critic to be beneficial as the critic no longer needs to learn the mapping between

and

, and instead only has to learn an approximate marginal

in the typically lower-dimensional representation space.
同样，对于内插边界，我们可以使用

，只对分母中的

进行优化。在实践中，我们发现对批判者重新参数化是有益的，因为批判者不再需要学习

和

之间的映射，而只需要在典型的低维表示空间中学习近似边际

。

Upper bounding total correlation.
上界总相关性。

Minimizing statistical dependency in representations is a common goal in disentangled representation learning. Prior work has focused on two approaches that both minimize lower bounds: (1) using adversarial learning (Kim & Mnih, 2018; Hjelm et al., 2018), or (2) using minibatch approximations where again a lower bound is minimized (Chen et al., 2018). To measure and minimize statistical dependency, we would like an upper bound, not a lower bound. In the case of a mean field encoder

, we can factor the total correlation into two information terms, and form a tractable upper bound. First, we can write the total correlation as:

. We can then use either the standard (Eq. 1) or the leave one out upper bound (Eq. 13) for each term in the summation, and any of the lower bounds for

. If

is small, we can use the leave one out upper bound (Eq. 13) and

(Eq. 12) for the lower bound and get a tractable upper bound on total correlation without any variational distributions or critics. Broadly, we can convert lower bounds on mutual information into upper bounds on KL divergences when the conditional distribution is tractable.
最小化表征中的统计依赖性是分离表征学习的共同目标。之前的工作主要集中在两种都能最小化下限的方法上：(1) 使用对抗学习（Kim & Mnih，2018；Hjelm 等人，2018），或 (2) 使用迷你批近似，其中同样是最小化下限（Chen 等人，2018）。要测量并最小化统计依赖性，我们需要的是上限，而不是下限。在均值场编码器

的情况下，我们可以将总相关性分解为两个信息项，并形成一个可行的上界。首先，我们可以将总相关性写成

.然后，我们可以对求和中的每个项使用标准上界（式 1）或舍去一值上界（式 13），并对

使用任意一个下界。如果

较小，我们可以使用舍去一值上界（式 13）和

（式 12）作为下界，这样就可以得到总相关性的可行上界，而无需任何变分或批判。广义地说，当条件分布可控时，我们可以将互信息的下界转换为 KL 发散的上界。

2.6. From density ratio estimators to bounds
2.6.从密度比估计值到边界

Note that the optimal critic for both

and

are functions of the

density ratio

. So, given a log density ratio estimator, we can estimate the optimal critic and form a lower bound on MI. In practice, we find that
请注意，

和

的最优批判者都是

密度比

的函数。因此，给定一个对数密度比估计器，我们就能估计出最优批判者，并形成 MI 的下限。在实践中，我们发现

Figure 2. Performance of bounds at estimating mutual information. Top: The dataset

is a correlated Gaussian with the correlation

stepping over time. Bottom: the dataset is created by drawing

and then transforming

to get

where

and the cubing is elementwise. Critics are trained to maximize each lower bound on MI, and the objective (light) and smoothed objective (dark) are plotted for each technique and critic type. The single-sample bounds (

and

) have higher variance than

and

, but achieve competitive estimates on both datasets. While

is a poor estimator of MI with the small training batch size of 64 , the interpolated bounds are able to provide less biased estimates than

with less variance than

. For the more challenging nonlinear relationship in the bottom set of panels, the best estimates of MI are with

. Using a joint critic (orange) outperforms a separable critic (blue) for

and

, while the multi-sample bounds are more robust to the choice of critic architecture.
图 2.估计互信息的界限性能。上图：数据集

是一个相关高斯，其相关性

随时间变化。下图：数据集是通过绘制

，然后变换

得到

，其中

和立方是按元素进行的。对批判者进行训练，以最大化 MI 的每个下限，并绘制出每种技术和批判者类型的目标（浅色）和平滑目标（深色）。单样本界限（

和

）的方差高于

和

，但在两个数据集上都能获得有竞争力的估计值。在训练批量为 64 个的情况下，虽然

对 MI 的估计效果不佳，但插值边界能够提供比

更少偏差的估计值，方差也比

小。对于底部面板中更具挑战性的非线性关系，MI 的最佳估计值是

。对于

和

，使用联合批判器（橙色）优于可分离批判器（蓝色），而多样本边界对批判器结构的选择更稳健。

training a critic using the Jensen-Shannon divergence (as in Nowozin et al. (2016); Hjelm et al. (2018)), yields an estimate of the log density ratio that is lower variance and as accurate as training with

. Empirically we find that training the critic using gradients of

can be unstable due to the exp from the upper bound on the log partition function in the

objective. Instead, one can train a log density ratio estimator to maximize a lower bound on the Jensen-Shannon (JS) divergence, and use the density ratio estimate in

(see Appendix D for details). We call this approach

as we update the critic using the JS as in (Hjelm et al., 2018), but still compute a MI lower bound with

. This approach is similar to (Poole et al., 2016; Mescheder et al., 2017) but results in a bound instead of an unbounded estimate based on a Monte-Carlo approximation of the

-divergence.
使用詹森-香农发散训练批判者（如 Nowozin 等人（2016）；Hjelm 等人（2018）），得到的对数密度比估计值方差较小，与使用

训练的精度相当。我们在经验中发现，由于

目标中对数分割函数上界的 exp，使用

梯度训练批评者可能不稳定。相反，我们可以训练一个对数密度比估计器，最大化詹森-香农（JS）发散的下限，并在

中使用密度比估计（详见附录 D）。我们称这种方法为

，因为我们使用（Hjelm 等，2018 年）中的 JS 更新批判者，但仍使用

计算 MI 下限。这种方法类似于（Poole 等，2016；Mescheder 等，2017），但结果是基于

-发散的蒙特卡洛近似值的约束而非无约束估计。

3. Experiments 3.实验

First, we evaluate the performance of MI bounds on two simple tractable toy problems. Then, we conduct a more thorough analysis of the bias/variance tradeoffs in MI estimates and gradient estimates given the optimal critic. Our goal in these experiments was to verify the theoretical results in Section 2, and show that the interpolated bounds can achieve better estimates of MI when the relationship between the variables is nonlinear. Finally, we highlight the utility of these bounds for disentangled representation learning on the dSprites datasets.
首先，我们在两个简单易行的玩具问题上评估了 MI 边界的性能。然后，我们对最优批判者给出的 MI 估计和梯度估计的偏差/方差权衡进行了更深入的分析。我们在这些实验中的目标是验证第 2 节中的理论结果，并证明当变量之间的关系是非线性的时候，插值边界可以获得更好的 MI 估计值。最后，我们强调了这些界值在 dSprites 数据集上的分离表征学习中的实用性。

Comparing estimates across different lower bounds.
比较不同下限的估计值。

We applied our estimators to two different toy problems, (1) a correlated Gaussian problem taken from Belghazi et al. (2018) where

are drawn from a 20 -d Gaussian distribution with correlation

(see Appendix B for details), and we vary

over time, and (2) the same as in (1) but we apply a random linear transformation followed by a cubic nonlinearity to

to get samples

. As long as the linear transformation is full rank,

. We find that the single-sample unnormalized critic estimates of MI exhibit high variance, and are challenging to tune for even these problems. In congtrast, the multi-sample estimates of

are low variance, but have estimates that saturate at

(batch size). The interpolated bounds trade off bias for variance, and achieve the best estimates of MI for the second problem. None of the estimators exhibit low variance and good estimates of MI at high rates, supporting the theoretical findings of McAllester & Stratos (2018).
我们将估计器应用于两个不同的玩具问题：（1）一个相关高斯问题，取自 Belghazi 等人（2018 年），其中

取自具有相关性

的 20 -d 高斯分布（详见附录 B），我们随时间改变

；（2）与（1）中的问题相同，但我们对

应用随机线性变换，然后再应用立方非线性，以得到样本

。只要线性变换是满级的，

。我们发现，MI 的单样本非规范化批判估计值表现出很高的方差，即使在这些问题上进行调整也具有挑战性。与此形成鲜明对比的是，多样本估计值

的方差较低，但估计值在

（批量大小）时达到饱和。内插边界以偏差换方差，在第二个问题上实现了对 MI 的最佳估计。没有一个估计器在高比率时表现出低方差和良好的 MI 估计值，这支持了 McAllester & Stratos（2018）的理论发现。

Efficiency-accuracy tradeoffs for critic architectures. One major difference between the critic architectures used in (van den Oord et al., 2018) and (Belghazi et al., 2018) is the structure of the critic architecture. van den Oord et al. (2018) uses a separable critic

which requires only

forward passes through a neural network for a batch size of

. However, Belghazi et al. (2018) use a joint critic, where

are concatenated and fed as input to one network, thus requiring

forward passes. For both toy problems, we found that separable critics (orange)
批判者架构的效率-精度权衡。van den Oord 等人（2018 年）和 Belghazi 等人（2018 年）使用的批判者架构之间的一个主要区别是批判者架构的结构。 Van den Oord 等人（2018 年）使用的是可分离批判者

，在批量大小为

的情况下，只需要

次前向传递通过神经网络。然而，Belghazi 等人（2018 年）使用的是联合批判者，其中

次被串联起来并作为一个网络的输入，因此需要

次前向传递。对于这两个玩具问题，我们发现可分离批判者（橙色）

Figure 3. Bias and variance of MI estimates with the optimal critic. While

is unbiased when given the optimal critic,

can exhibit large bias that grows linearly with MI. The

bounds trade off bias and variance to recover more accurate bounds in terms of MSE in certain regimes.
图 3.使用最优批判者时 MI 估计值的偏差和方差。虽然

在给定最优批判者时是无偏的，但

会表现出很大的偏差，且偏差与 MI 成线性关系。

边界在偏差和方差之间进行了权衡，从而在某些情况下恢复了更精确的 MSE 边界。

increased the variance of the estimator and generally performed worse than joint critics (blue) when using

(Fig. 2). However, joint critics scale poorly with batch size, and it is possible that separable critics require larger neural networks to get similar performance.
在使用

或

时，可分离批判增加了估计器的方差，其性能通常比联合批判（蓝色）差（图 2）。然而，联合批判随批次规模的扩大而缩小，可分离批判可能需要更大的神经网络才能获得类似的性能。

Bias-variance tradeoff for optimal critics.
最佳批评家的偏差-方差权衡。

To better understand the behavior of different estimators, we analyzed the bias and variance of each estimator as a function of batch size given the optimal critic (Fig. 3). We again evaluated the estimators on the 20 -d correlated Gaussian distribution and varied

to achieve different values of MI. While

is an unbiased estimator of MI, it exhibits high variance when the MI is large and the batch size is small. As noted in van den Oord et al. (2018), the

estimate is upper bounded by

(batch size). This results in high bias but low variance when the batch size is small and the MI is large. In this regime, the absolute value of the bias grows linearly with MI because the objective saturates to a constant while the MI continues to grow linearly. In contrast, the

bounds are less biased than

and lower variance than

, resulting in a mean squared error (MSE) that can be smaller than either

. We can also see that the leave one out upper bound (Eq. 13) has large bias and variance when the batch size is too small.
为了更好地理解不同估计器的行为，我们分析了每个估计器的偏差和方差作为批量大小与最佳批判者的函数（图 3）。我们再次评估了 20d 相关高斯分布上的估计器，并改变

以获得不同的 MI 值。虽然

是 MI 的无偏估计值，但当 MI 较大、批量较小时，它就会表现出较高的方差。正如 van den Oord 等人（2018 年）所指出的，

估计值的上限为

（批量大小）。当批次规模较小而 MI 较大时，这会导致偏差大而方差小。在这种情况下，偏差的绝对值随 MI 线性增长，因为目标饱和为一个常数，而 MI 继续线性增长。相比之下，

边界的偏差比

小，方差比

小，因此平均平方误差 (MSE) 可能比

或

都小。我们还可以看到，当批次规模太小时，"漏掉一个 "上限（公式 13）的偏差和方差都很大。

Bias-variance tradeoffs for representation learning.
表征学习的偏差-方差权衡

To better understand whether the bias and variance of the estimated MI impact representation learning, we looked at the accuracy of the gradients of the estimates with respect to a stochastic encoder

versus the true gradient of MI with respect to the encoder. In order to have access to ground truth gradients, we restrict our model to

where we have a separate
为了更好地了解估计 MI 的偏差和方差是否会影响表征学习，我们研究了相对于随机编码器

的估计梯度的准确性与相对于编码器的 MI 真实梯度。为了获得真实梯度，我们将模型限制为

，在此模型中，我们有一个单独的

Figure 4. Gradient accuracy of MI estimators. Left: MSE between the true encoder gradients and approximate gradients as a function of mutual information and batch size (colors the same as in Fig. 3 ). Right: For each mutual information and batch size, we evaluated the

bound with different

s and found the

that had the smallest gradient MSE. For small MI and small size,

-like objectives are preferred, while for large MI and large batch size,

-like objectives are preferred.
图 4.MI 估计器的梯度精度。左图：真实编码器梯度与近似梯度之间的 MSE 与互信息和批量大小的函数关系（颜色与图 3 相同）。右图对于每种互信息和批量大小，我们用不同的

s 评估了

约束，并找到了梯度 MSE 最小的

。对于较小的 MI 和较小的规模，类似

的目标更受青睐，而对于较大的 MI 和较大的批量规模，类似

的目标更受青睐。

correlation parameter for each dimension

, and look at the gradient of MI with respect to the vector of parameters

. We evaluate the accuracy of the gradients by computing the MSE between the true and approximate gradients. For different settings of the parameters

, we identify which

performs best as a function of batch size and mutual information. In Fig. 4, we show that the optimal

for the interpolated bounds depends strongly on batch size and the true mutual information. For smaller batch sizes and MIs,

close to 1 (

is preferred, while for larger batch sizes and MIs,

closer to 0 (

) is preferred. The reduced gradient MSE of the

bounds points to their utility as an objective for training encoders in the InfoMax setting.
每个维度的相关参数

，并查看 MI 相对于参数向量

的梯度。我们通过计算真实梯度和近似梯度之间的 MSE 来评估梯度的准确性。对于参数

的不同设置，我们确定了作为批量大小和互信息函数的

性能最佳。图 4 显示，插值边界的最佳

与批量大小和真实互信息密切相关。对于较小的批次规模和 MIs，

接近 1 (

是首选，而对于较大的批次规模和 MIs，

接近 0 (

) 是首选。

边界的梯度 MSE 值的降低表明，在 InfoMax 设置中，它们可以作为训练编码器的目标。

3.1. Decoder-free representation learning on dSprites
3.1.dSprites 上的无解码器表征学习

Many recent papers in representation learning have focused on learning latent representations in a generative model that correspond to human-interpretable or "disentangled" concepts (Higgins et al., 2016; Burgess et al., 2018; Chen et al., 2018; Kumar et al., 2017). While the exact definition of disentangling remains elusive (Locatello et al., 2018; Higgins et al., 2018; Mathieu et al., 2018), many papers have focused on reducing statistical dependency between latent variables as a proxy (Kim & Mnih, 2018; Chen et al., 2018; Kumar et al., 2017). Here we show how a decoder-free information maximization approach subject to smoothness and independence constraints can retain much of the representation learning capabilities of latent-variable generative models on the dSprites dataset (a 2d dataset of white shapes on a black background with varying shape, rotation, scale, and position from Matthey et al. (2017)).
最近许多关于表征学习的论文都关注在生成模型中学习潜在表征，这些表征对应于人类可解释或 "分解 "的概念（Higgins 等人，2016；Burgess 等人，2018；Chen 等人，2018；Kumar 等人，2017）。虽然解缠的确切定义仍然难以确定（Locatello 等人，2018 年；Higgins 等人，2018 年；Mathieu 等人，2018 年），但许多论文都将重点放在减少潜变量之间的统计依赖性作为替代（Kim & Mnih，2018 年；Chen 等人，2018 年；Kumar 等人，2017 年）。在这里，我们展示了一种受平滑性和独立性约束的无解码器信息最大化方法如何在 dSprites 数据集（来自 Matthey 等人（2017）的黑背景上白色形状的二维数据集，具有不同的形状、旋转、比例和位置）上保留潜变量生成模型的大部分表征学习能力。

To estimate and maximize the information contained in the representation

about the input

, we use the

lower bound, with a structured critic that leverages the known
为了估算并最大化表征

中包含的关于输入

的信息，我们使用了

下限，并利用已知的结构化批判器，对

进行了结构化批判。
stochastic encoder

but learns an unnormalized variational approximation

to the prior. To encourage independence, we form an upper bound on the total correlation of the representation,

, by leveraging our novel variational bounds. In particular, we reuse the

lower bound of

, and use the leave one out upper bounds (Eq. 13) for each

. Unlike prior work in this area with VAEs, (Kim & Mnih, 2018; Chen et al., 2018; Hjelm et al., 2018; Kumar et al., 2017), this approach tractably estimates and removes statistical dependency in the representation without resorting to adversarial techniques, moment matching, or minibatch lower bounds in the wrong direction.
随机编码器

，但学习的是先验的非规范化变分近似

。为了鼓励独立性，我们通过利用新颖的变分边界，对表示的总相关性形成了一个上界

。特别是，我们重新使用了

的

下限，并对每个

使用了 "留一 "上限（公式 13）。与该领域之前的 VAE 工作（Kim & Mnih，2018；Chen 等人，2018；Hjelm 等人，2018；Kumar 等人，2017）不同的是，这种方法无需诉诸对抗技术、矩匹配或错误方向的迷你批量下界，就能有效地估计和消除表示中的统计依赖性。

As demonstrated in Krause et al. (2010), information maximization alone is ineffective at learning useful representations from finite data. Furthermore, minimizing statistical dependency is also insufficient, as we can always find an invertible function that maintains the same amount of information and correlation structure, but scrambles the representation (Locatello et al., 2018). We can avoid these issues by introducing additional inductive biases into the representation learning problem. In particular, here we add a simple smoothness regularizer that forces nearby points in

space to be mapped to similar regions in

space:

where

.
正如 Krause 等人（2010 年）所证明的，仅靠信息最大化无法从有限数据中学习到有用的表征。此外，统计依赖性最小化也是不够的，因为我们总能找到一个可逆函数，它能保持相同的信息量和相关结构，但会扰乱表征（Locatello 等人，2018 年）。我们可以通过在表征学习问题中引入额外的归纳偏差来避免这些问题。具体来说，我们在这里添加了一个简单的平滑正则器，迫使

空间中的邻近点映射到

空间中的相似区域：

其中

The resulting regularized InfoMax objective we optimize is:
由此，我们优化的正则化 InfoMax 目标是

We use the convolutional encoder architecture from Burgess et al. (2018); Locatello et al. (2018) for

, and a two hidden layer fully-connected neural network to parameterize the unnormalized variational marginal

used by

.
我们使用 Burgess 等人（2018 年）的卷积编码器架构；Locatello 等人（2018 年）的

，以及一个双隐层全连接神经网络，对

使用的非规范化变分边际

进行参数化。

Empirically, we find that this variational regularized infomax objective is able to learn x and y position, and scale, but not rotation (Fig. 5, see Chen et al. (2018) for more details on the visualization). To the best of our knowledge, the only other decoder-free representation learning result on dSprites is Pfau & Burgess (2018), which recovers shape and rotation but not scale on a simplified version of the dSprites dataset with one shape.
根据经验，我们发现这种变分正则化的 infomax 目标能够学习 x 和 y 位置以及比例，但不能学习旋转（图 5，更多可视化细节请参见 Chen 等人（2018））。据我们所知，在 dSprites 上唯一的其他无解码器表征学习结果是 Pfau & Burgess（2018），它在一个简化版的 dSprites 数据集上恢复了形状和旋转，但没有恢复比例。

4. Discussion 4.讨论

In this work, we reviewed and presented several new bounds on mutual information. We showed that our new interpolated bounds are able to trade off bias for variance to yield better estimates of MI. However, none of the approaches we considered here are capable of providing low-variance,
在这项工作中，我们回顾并介绍了几种新的互信息界限。我们的研究表明，我们的新插值界限能够在偏差与方差之间进行权衡，从而获得更好的互信息估计值。然而，我们在此考虑的方法都无法提供低方差、

Figure 5. Feature selectivity on dSprites. The representation learned with our regularized InfoMax objective exhibits disentangled features for position and scale, but not rotation. Each row corresponds to a different active latent dimension. The first column depicts the position tuning of the latent variable, where the x and y axis correspond to

position, and the color corresponds to the average activation of the latent variable in response to an input at that position (red is high, blue is low). The scale and rotation columns show the average value of the latent on the

axis, and the value of the ground truth factor (scale or rotation) on the x axis.
图 5.dSprites 的特征选择性。使用我们的正则化 InfoMax 目标学习到的表征显示出位置和比例的分离特征，但不包括旋转特征。每一行对应一个不同的活动潜在维度。第一列描述了潜变量的位置调整，其中 x 轴和 y 轴对应

位置，颜色对应潜变量对该位置输入的平均激活（红色为高，蓝色为低）。缩放和旋转列显示

轴上的潜变量平均值，以及 x 轴上的基本真实因子（缩放或旋转）值。

low-bias estimates when the MI is large and the batch size is small. Future work should identify whether such estimators are impossible (McAllester & Stratos, 2018), or whether certain distributional assumptions or neural network inductive biases can be leveraged to build tractable estimators. Alternatively, it may be easier to estimate gradients of MI than estimating MI. For example, maximizing

is feasible even though we do not have access to the constant data entropy. There may be better approaches in this setting when we do not care about MI estimation and only care about computing gradients of MI for minimization or maximization
当 MI 较大、批量较小时，低偏差估计。未来的工作应该确定这样的估计是否不可能（McAllester & Stratos，2018），或者是否可以利用某些分布假设或神经网络归纳偏差来建立可行的估计。另外，估计 MI 梯度可能比估计 MI 更容易。例如，即使我们无法获得恒定的数据熵，最大化

也是可行的。在这种情况下，当我们不关心 MI 估计，只关心计算 MI 梯度以实现最小化或最大化时，可能会有更好的方法。

A limitation of our analysis and experiments is that they focus on the regime where the dataset is infinite and there is no overfitting. In this setting, we do not have to worry about differences in MI on training vs. heldout data, nor do we have to tackle biases of finite samples. Addressing and understanding this regime is an important area for future work.
我们的分析和实验有一个局限性，那就是它们只关注数据集为无限且不存在过度拟合的情况。在这种情况下，我们不必担心训练数据与保留数据的 MI 差异，也不必解决有限样本的偏差问题。解决和理解这种情况是未来工作的一个重要领域。

Another open question is whether mutual information maximization is a more useful objective for representation learning than other unsupervised or self-supervised approaches (Noroozi & Favaro, 2016; Doersch et al., 2015; Dosovitskiy et al., 2014). While deviating from mutual information maximization loses a number of connections to information theory, it may provide other mechanisms for learning features that are useful for downstream tasks. In future work, we hope to evaluate these estimators on larger-scale representation learning tasks to address these questions.
另一个悬而未决的问题是，与其他无监督或自监督方法相比，互信息最大化是否是一个更有用的表征学习目标（Noroozi & Favaro, 2016; Doersch et al, 2015; Dosovitskiy et al, 2014）。虽然偏离互信息最大化会失去一些与信息论的联系，但它可以为学习对下游任务有用的特征提供其他机制。在未来的工作中，我们希望在更大规模的表征学习任务中对这些估计器进行评估，以解决这些问题。

References 参考资料

Alemi, A. A., Fischer, I., Dillon, J. V., and Murphy, K. Deep variational information bottleneck. arXiv preprint arXiv:1612.00410, 2016.
Alemi，A. A.、Fischer，I.、Dillon，J. V.和 Murphy，K. Deep variational information bottleneck.

Alemi, A. A., Poole, B., Fischer, I., Dillon, J. V., Saurous, R. A., and Murphy, K. Fixing a broken elbo, 2017.

Barber, D. and Agakov, F. The im algorithm: A variational approach to information maximization. In NIPS, pp. 201208. MIT Press, 2003.
Barber, D. and Agakov, F. The im algorithm：信息最大化的变分法。In NIPS, pp.麻省理工学院出版社，2003 年。

Belghazi, M. I., Baratin, A., Rajeshwar, S., Ozair, S., Bengio, Y., Hjelm, D., and Courville, A. Mutual information neural estimation. In International Conference on Machine Learning, pp. 530-539, 2018.
Belghazi, M. I., Baratin, A., Rajeshwar, S., Ozair, S., Bengio, Y., Hjelm, D., and Courville, A. Mutual information neural estimation.国际机器学习大会》，第 530-539 页，2018 年。

Bell, A. J. and Sejnowski, T. J. An informationmaximization approach to blind separation and blind deconvolution. Neural computation, 7(6):1129-1159, 1995 .
Bell, A. J. and Sejnowski, T. J. An informationmaximization approach to blind separation and blind deconvolution.神经计算，7（6）：1129-1159，1995 .

Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859-877, 2017.
Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. Variational inference：A review for statisticians.美国统计协会期刊》，112（518）：859-877，2017.

Burgess, C. P., Higgins, I., Pal, A., Matthey, L., Watters, N., Desjardins, G., and Lerchner, A. Understanding disentangling in

-vae. arXiv preprint arXiv:1804.03599, 2018.

Chen, T. Q., Li, X., Grosse, R., and Duvenaud, D. Isolating sources of disentanglement in variational autoencoders. arXiv preprint arXiv:1802.04942, 2018.

Doersch, C., Gupta, A., and Efros, A. A. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1422-1430, 2015.
Doersch, C., Gupta, A., and Efros, A. A. Unsupervised visual representation learning by context prediction.In Proceedings of the IEEE International Conference on Computer Vision, pp.

Donsker, M. D. and Varadhan, S. S. Asymptotic evaluation of certain markov process expectations for large time. iv. Communications on Pure and Applied Mathematics, 36 (2):183-212, 1983.
Donsker, M. D. and Varadhan, S. S. Asymptotic evaluation of certain markov process expectations for large time.Communications on Pure and Applied Mathematics, 36 (2):183-212, 1983.

Dosovitskiy, A., Springenberg, J. T., Riedmiller, M., and Brox, T. Discriminative unsupervised feature learning with convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 766-774, 2014.
Dosovitskiy, A., Springenberg, J. T., Riedmiller, M., and Brox, T. Discriminative unsupervised feature learning with convolutional neural networks.神经信息处理系统进展》，第 766-774 页，2014 年。

Foster, A., Jankowiak, M., Bingham, E., Teh, Y. W., Rainforth, T., and Goodman, N. Variational optimal experiment design: Efficient automation of adaptive experiments. In NeurIPS Bayesian Deep Learning Workshop, 2018 .
Foster, A., Jankowiak, M., Bingham, E., Teh, Y. W., Rainforth, T., and Goodman, N. Variational optimal experiment design：自适应实验的高效自动化。In NeurIPS Bayesian Deep Learning Workshop, 2018 .
Gabrié, M., Manoel, A., Luneau, C., Barbier, J., Macris, N., Krzakala, F., and Zdeborová, L. Entropy and mutual information in models of deep neural networks. arXiv preprint arXiv:1805.09785, 2018.

Gao, S., Ver Steeg, G., and Galstyan, A. Efficient estimation of mutual information for strongly dependent variables. In Artificial Intelligence and Statistics, pp. 277-286, 2015.
Gao, S., Ver Steeg, G., and Galstyan, A. Efficient estimation of mutual information for strongly dependent variables.人工智能与统计学》，第 277-286 页，2015 年。

Higgins, I., Matthey, L., Glorot, X., Pal, A., Uria, B., Blundell, C., Mohamed, S., and Lerchner, A. Early visual concept learning with unsupervised deep learning. arXiv preprint arXiv:1606.05579, 2016.

Higgins, I., Amos, D., Pfau, D., Racanière, S., Matthey, L., Rezende, D. J., and Lerchner, A. Towards a definition of disentangled representations. CoRR, abs/1812.02230, 2018 .
Higgins, I., Amos, D., Pfau, D., Racanière, S., Matthey, L., Rezende, D. J., and Lerchner, A. Towards a definition of disentangled representations.CoRR, abs/1812.02230, 2018 .

Hjelm, R. D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Trischler, A., and Bengio, Y. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670, 2018.

Hu, W., Miyato, T., Tokui, S., Matsumoto, E., and Sugiyama, M. Learning discrete representations via information maximizing self-augmented training. arXiv preprint arXiv:1702.08720, 2017.

Kim, H. and Mnih, A. Disentangling by factorising. arXiv preprint arXiv:1802.05983, 2018.

Kingma, D. P. and Welling, M. Auto-encoding variational bayes. nternational Conference on Learning Representations, 2013.
Kingma, D. P. and Welling, M. Auto-encoding variational bayes.

Kolchinsky, A., Tracey, B. D., and Wolpert, D. H. Nonlinear information bottleneck. arXiv preprint arXiv:1705.02436, 2017.

Kraskov, A., Stögbauer, H., and Grassberger, P. Estimating mutual information. Physical review E, 69(6):066138, 2004.
Kraskov, A., Stögbauer, H., and Grassberger, P. Estimating mutual information.Physical Review E，69（6）：066138，2004.

Krause, A., Perona, P., and Gomes, R. G. Discriminative clustering by regularized information maximization. In Advances in neural information processing systems, pp. 775-783, 2010.
Krause, A., Perona, P., and Gomes, R. G. Discriminative clustering by regularized information maximization.神经信息处理系统进展》，第 775-783 页，2010 年。

Kumar, A., Sattigeri, P., and Balakrishnan, A. Variational inference of disentangled latent concepts from unlabeled observations. arXiv preprint arXiv:1711.00848, 2017.

Locatello, F., Bauer, S., Lucic, M., Gelly, S., Schölkopf, B., and Bachem, O. Challenging common assumptions in the unsupervised learning of disentangled representations. arXiv preprint arXiv:1811.12359, 2018.

Ma, Z. and Collins, M. Noise contrastive estimation and negative sampling for conditional models: Consistency and statistical efficiency. arXiv preprint arXiv:1809.01812, 2018 .
Ma, Z. and Collins, M. Noise contrastive estimation and negative sampling for conditional models：一致性与统计效率. arXiv preprint arXiv:1809.01812, 2018 .

Mathieu, E., Rainforth, T., Siddharth, N., and Teh, Y. W. Disentangling disentanglement in variational auto-encoders, 2018 .

Matthey, L., Higgins, I., Hassabis, D., and Lerchner, A. dsprites: Disentanglement testing sprites dataset. https://github.com/deepmind/dsprites-dataset/, 2017.
Matthey, L., Higgins, I., Hassabis, D., and Lerchner, A. dsprites：https://github.com/deepmind/dsprites-dataset/, 2017.

McAllester, D. and Stratos, K. Formal limitations on the measurement of mutual information, 2018.

Mescheder, L., Nowozin, S., and Geiger, A. Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks. arXiv preprint arXiv:1701.04722, 2017.

Mnih, A. and Teh, Y. W. A fast and simple algorithm for training neural probabilistic language models. arXiv preprint arXiv:1206.6426, 2012.

Moyer, D., Gao, S., Brekelmans, R., Galstyan, A., and Ver Steeg, G. Invariant representations without adversarial training. In Advances in Neural Information Processing Systems, pp. 9102-9111, 2018.
Moyer, D., Gao, S., Brekelmans, R., Galstyan, A., and Ver Steeg, G. Invariant Representations without adversarial training.神经信息处理系统进展》，第 9102-9111 页，2018 年。

Nemenman, I., Bialek, W., and van Steveninck, R. d. R. Entropy and information in neural spike trains: Progress on the sampling problem. Physical Review E, 69(5):

.
Nemenman, I., Bialek, W., and van Steveninck, R. d. R. Entropy and information in neural spike trains：采样问题的进展。Physical Review E, 69(5)：

Nguyen, X., Wainwright, M. J., and Jordan, M. I. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847-5861, 2010.
Nguyen, X., Wainwright, M. J., and Jordan, M. I. Estimating divergence functionals and the likelihood ratio by convex risk minimization.IEEE Transactions on Information Theory, 56(11):5847-5861, 2010.

Noroozi, M. and Favaro, P. Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pp. 69-84. Springer, 2016.
Noroozi, M. and Favaro, P. Unsupervised learning of visual representation by solving jigsaw puzzle.欧洲计算机视觉会议，第 69-84 页。Springer, 2016.

Nowozin, S., Cseke, B., and Tomioka, R. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, pp. 271-279, 2016.
Nowozin, S., Cseke, B., and Tomioka, R. f-gan：使用变分最小化训练生成神经采样器。神经信息处理系统进展》，第 271-279 页，2016 年。

Palmer, S. E., Marre, O., Berry, M. J., and Bialek, W. Predictive information in a sensory population. Proceedings of the National Academy of Sciences, 112(22):6908-6913, 2015 .
Palmer, S. E., Marre, O., Berry, M. J., and Bialek, W. Predictive information in a sensory population.美国国家科学院院刊》，112（22）：6908-6913，2015 .

Paninski, L. Estimation of entropy and mutual information. Neural computation, 15(6):1191-1253, 2003.
Paninski, L. Estimation of entropy and mutual information.神经计算，15（6）：1191-1253，2003.
Peng, X. B., Kanazawa, A., Toyer, S., Abbeel, P., and Levine, S. Variational discriminator bottleneck: Improving imitation learning, inverse rl, and gans by constraining information flow. arXiv preprint arXiv:1810.00821, 2018.
Peng, X. B., Kanazawa, A., Toyer, S., Abbeel, P., and Levine, S. Variational discriminator bottleneck：通过限制信息流改进模仿学习、反向 RL 和 Gans。ArXiv 预印本 arXiv:1810.00821, 2018.

Pfau, D. and Burgess, C. P. Minimally redundant laplacian eigenmaps. 2018.
Pfau, D. and Burgess, C. P. Minimally redundant laplacian eigenmaps.2018.

Poole, B., Alemi, A. A., Sohl-Dickstein, J., and Angelova, A. Improved generator objectives for gans. arXiv preprint arXiv:1612.02780, 2016.

Rainforth, T., Cornish, R., Yang, H., and Warrington, A. On nesting monte carlo estimators. In International Conference on Machine Learning, pp. 4264-4273, 2018.
Rainforth, T., Cornish, R., Yang, H., and Warrington, A. On nesting monte carlo estimators.国际机器学习大会》，第 4264-4273 页，2018 年。

Reshef, D. N., Reshef, Y. A., Finucane, H. K., Grossman, S. R., McVean, G., Turnbaugh, P. J., Lander, E. S., Mitzenmacher, M., and Sabeti, P. C. Detecting novel associations in large data sets. science, 334(6062):1518-1524, 2011.

Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, pp. 1278-1286, 2014.
Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models.国际机器学习大会》，第 1278-1286 页，2014 年。

Ryan, E. G., Drovandi, C. C., McGree, J. M., and Pettitt, A. N. A review of modern computational algorithms for bayesian optimal design. International Statistical Review, 84(1):128-154, 2016.
Ryan, E. G., Drovandi, C. C., McGree, J. M., and Pettitt, A. N. A review of modern computational algorithms for bayesian optimal design.International Statistical Review, 84(1):128-154, 2016.

Saxe, A. M., Bansal, Y., Dapello, J., Advani, M., Kolchinsky, A., Tracey, B. D., and Cox, D. D. On the information bottleneck theory of deep learning. In International Conference on Learning Representations, 2018.
Saxe, A. M., Bansal, Y., Dapello, J., Advani, M., Kolchinsky, A., Tracey, B. D., and Cox, D. D. On the information bottleneck theory of deep learning.国际学习表征大会，2018。

Tishby, N. and Zaslavsky, N. Deep learning and the information bottleneck principle. In Information Theory Workshop (ITW), 2015 IEEE, pp. 1-5. IEEE, 2015.
Tishby, N. and Zaslavsky, N. Deep learning and the information bottleneck principle.In Information Theory Workshop (ITW), 2015 IEEE, pp.IEEE, 2015.

Tishby, N., Pereira, F. C., and Bialek, W. The information bottleneck method. arXiv preprint physics/0004057, 2000 .

Tomczak, J. and Welling, M. Vae with a vampprior. In Storkey, A. and Perez-Cruz, F. (eds.), Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pp. 1214-1223, Playa Blanca, Lanzarote, Canary Islands, 09-11 Apr 2018. PMLR.
Tomczak, J. and Welling, M. Vae with a vampprior.In Storkey, A. and Perez-Cruz, F. (eds.), Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pp.PMLR.

van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., Graves, A., et al. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, pp. 4790-4798, 2016.
van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., Graves, A., et al. Conditional image generation with pixelcnn decoders.神经信息处理系统进展》，第 4790-4798 页，2016 年。

van den Oord, A., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.

Google Brain MILA DeepMind. Correspondence to: Ben Poole pooleb @ google.com .
Google Brain MILA DeepMind.通信地址：：Ben Poole pooleb @ google.com 。

Proceedings of the International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).
国际机器学习会议（）论文集，加利福尼亚州长滩，PMLR 97，2019。作者 2019 年版权所有。
can also be derived the opposite direction by plugging the critic into .
也可以通过将批判者插入来反向推导。
The derivation by van den Oord et al. (2018) relied on an approximation, which we show is unnecessary.
van den Oord 等人（2018）的推导依赖于一种近似值，而我们的研究表明这种近似值是不必要的。

On Variational Bounds of Mutual Information 论互信息的变异边界

Abstract 摘要

1. Introduction 1.导言

2. Variational bounds of MI2.MI 的变量边界

2.1. Normalized upper and lower bounds2.1.归一化上限和下限

2.2. Unnormalized lower bounds2.2.非规范化下限

2.3. Multi-sample unnormalized lower bounds2.3.多样本非规范化下界

2.4. Nonlinearly interpolated lower bounds2.4.非线性内插下限

2.5. Structured bounds with tractable encoders2.5.结构化边界与可编程编码器

InfoNCE with a tractable conditional.信息网络教育，条件简单明了。

Leave one out upper bound.留出一个上限。

Reparameterizing critics.重新参数化的批评家

Upper bounding total correlation.上界总相关性。

2.6. From density ratio estimators to bounds2.6.从密度比估计值到边界

3. Experiments 3.实验

Comparing estimates across different lower bounds.比较不同下限的估计值。

Bias-variance tradeoff for optimal critics.最佳批评家的偏差-方差权衡。

Bias-variance tradeoffs for representation learning.表征学习的偏差-方差权衡

3.1. Decoder-free representation learning on dSprites3.1.dSprites 上的无解码器表征学习

4. Discussion 4.讨论

References 参考资料

On Variational Bounds of Mutual Information
论互信息的变异边界

2. Variational bounds of MI
2.MI 的变量边界

2.1. Normalized upper and lower bounds
2.1.归一化上限和下限

2.2. Unnormalized lower bounds
2.2.非规范化下限

2.3. Multi-sample unnormalized lower bounds
2.3.多样本非规范化下界

2.4. Nonlinearly interpolated lower bounds
2.4.非线性内插下限

2.5. Structured bounds with tractable encoders
2.5.结构化边界与可编程编码器

InfoNCE with a tractable conditional.
信息网络教育，条件简单明了。

Leave one out upper bound.
留出一个上限。

Reparameterizing critics.
重新参数化的批评家

Upper bounding total correlation.
上界总相关性。

2.6. From density ratio estimators to bounds
2.6.从密度比估计值到边界

Comparing estimates across different lower bounds.
比较不同下限的估计值。

Bias-variance tradeoff for optimal critics.
最佳批评家的偏差-方差权衡。

Bias-variance tradeoffs for representation learning.
表征学习的偏差-方差权衡

3.1. Decoder-free representation learning on dSprites
3.1.dSprites 上的无解码器表征学习