这是用户在 2024-7-23 20:41 为 https://app.immersivetranslate.com/pdf-pro/a70ec523-9bbd-4310-b016-dc041f76709c 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
2024_07_23_65588abc753a56851495g

On Variational Bounds of Mutual Information
论互信息的变异边界

Ben Poole Sherjil Ozair Aäron van den Oord Alexander A. Alemi George Tucker
Ben Poole Sherjil Ozair Aaron van den Oord Alexander A. Alemi George Tucker

Abstract 摘要

Estimating and optimizing Mutual Information (MI) is core to many problems in machine learning; however, bounding MI in high dimensions is challenging. To establish tractable and scalable objectives, recent work has turned to variational bounds parameterized by neural networks, but the relationships and tradeoffs between these bounds remains unclear. In this work, we unify these recent developments in a single framework. We find that the existing variational lower bounds degrade when the MI is large, exhibiting either high bias or high variance. To address this problem, we introduce a continuum of lower bounds that encompasses previous bounds and flexibly trades off bias and variance. On high-dimensional, controlled problems, we empirically characterize the bias and variance of the bounds and their gradients and demonstrate the effectiveness of our new bounds for estimation and representation learning.
估算和优化互信息(MI)是机器学习中许多问题的核心;然而,在高维度上对互信息进行约束却极具挑战性。为了建立可牵引和可扩展的目标,最近的工作转向了以神经网络为参数的变分边界,但这些边界之间的关系和权衡仍不清楚。在这项工作中,我们将这些最新进展统一到一个框架中。我们发现,当 MI 较大时,现有的变分下界会退化,表现出高偏差或高方差。为了解决这个问题,我们引入了一个连续的下界,它包含了以前的下界,并灵活地权衡了偏差和方差。在高维受控问题上,我们根据经验描述了下界及其梯度的偏差和方差,并证明了我们的新下界在估计和表征学习中的有效性。

1. Introduction 1.导言

Estimating the relationship between pairs of variables is a fundamental problem in science and engineering. Quantifying the degree of the relationship requires a metric that captures a notion of dependency. Here, we focus on mutual information (MI), denoted , which is a reparameterization-invariant measure of dependency:
估算成对变量之间的关系是科学和工程领域的一个基本问题。量化这种关系的程度需要一种能捕捉依赖性概念的度量。在这里,我们重点讨论互信息(MI),用 表示,它是依赖关系的重参数化不变度量:
Mutual information estimators are used in computational neuroscience (Palmer et al., 2015), Bayesian optimal experimental design (Ryan et al., 2016; Foster et al., 2018), understanding neural networks (Tishby et al., 2000; Tishby & Zaslavsky, 2015; Gabrié et al., 2018), and more. In practice, estimating MI is challenging as we typically have access to
互信息估计器被用于计算神经科学(Palmer 等人,2015 年)、贝叶斯最优实验设计(Ryan 等人,2016 年;Foster 等人,2018 年)、理解神经网络(Tishby 等人,2000 年;Tishby & Zaslavsky,2015 年;Gabrié 等人,2018 年)等领域。在实践中,估计 MI 具有挑战性,因为我们通常可以获得
Figure 1. Schematic of variational bounds of mutual information presented in this paper. Nodes are colored based on their tractability for estimation and optimization: green bounds can be used for both, yellow for optimization but not estimation, and red for neither. Children are derived from their parents by introducing new approximations or assumptions.
图 1.本文提出的互信息变分边界示意图。节点的颜色基于它们在估算和优化方面的可操作性:绿色边界可同时用于估算和优化,黄色边界可用于优化但不可用于估算,红色边界不可同时用于估算和优化。子节点通过引入新的近似或假设从父节点衍生而来。
samples but not the underlying distributions (Paninski, 2003; McAllester & Stratos, 2018). Existing sample-based estimators are brittle, with the hyperparameters of the estimator impacting the scientific conclusions (Saxe et al., 2018).
样本而非基本分布(Paninski,2003;McAllester & Stratos,2018)。现有的基于样本的估计器很脆弱,估计器的超参数会影响科学结论(Saxe 等人,2018 年)。
Beyond estimation, many methods use upper bounds on MI to limit the capacity or contents of representations. For example in the information bottleneck method (Tishby et al., 2000; Alemi et al., 2016), the representation is optimized to solve a downstream task while being constrained to contain as little information as possible about the input. These techniques have proven useful in a variety of domains, from restricting the capacity of discriminators in GANs (Peng et al., 2018) to preventing representations from containing information about protected attributes (Moyer et al., 2018).
除了估算,许多方法还使用 MI 上限来限制表征的容量或内容。例如,在信息瓶颈法(Tishby 等人,2000 年;Alemi 等人,2016 年)中,表征被优化以解决下游任务,同时受限于包含尽可能少的输入信息。事实证明,这些技术在各种领域都很有用,从限制 GAN 中判别器的容量(Peng 等人,2018 年)到防止表征包含受保护属性的信息(Moyer 等人,2018 年)。
Lastly, there are a growing set of methods in representation learning that maximize the mutual information between a learned representation and an aspect of the data. Specifically, given samples from a data distribution, , the goal is to learn a stochastic representation of the data that has maximal MI with subject to constraints on the mapping (e.g. Bell & Sejnowski, 1995; Krause et al., 2010; Hu et al., 2017; van den Oord et al., 2018; Hjelm et al., 2018; Alemi et al., 2017). To maximize MI, we can compute gradients of a lower bound on MI with respect to the parameters of the stochastic encoder , which may not require directly estimating MI.
最后,有越来越多的表征学习方法可以最大化所学表征与数据某一方面之间的互信息。具体来说,给定数据分布样本 ,目标是学习数据的随机表征 ,该表征与 之间的互信息最大,但要受到映射约束(如 Bell & Sejnowski,1995;Krause 等人,2010;Hu 等人,2017;van den Oord 等人,2018;Hjelm 等人,2018;Alemi 等人,2017)。为了使 MI 最大化,我们可以计算 MI 相对于随机编码器参数 的梯度 ,这可能不需要直接估计 MI。
While many parametric and non-parametric (Nemenman et al., 2004; Kraskov et al., 2004; Reshef et al., 2011; Gao et al., 2015) techniques have been proposed to address MI estimation and optimization problems, few of them scale up to the dataset size and dimensionality encountered in modern machine learning problems.
虽然有很多参数和非参数(Nemenman 等人,2004 年;Kraskov 等人,2004 年;Reshef 等人,2011 年;Gao 等人,2015 年)技术被提出来解决 MI 估算和优化问题,但其中很少有技术能扩展到现代机器学习问题中遇到的数据集规模和维度。
To overcome these scaling difficulties, recent work combines variational bounds (Blei et al., 2017; Donsker & Varadhan, 1983; Barber & Agakov, 2003; Nguyen et al., 2010; Foster et al., 2018) with deep learning (Alemi et al., 2016; 2017; van den Oord et al., 2018; Hjelm et al., 2018; Belghazi et al., 2018) to enable differentiable and tractable estimation of mutual information. These papers introduce flexible parametric distributions or critics parameterized by neural networks that are used to approximate unkown densities or density ratios .
为了克服这些缩放困难,最近的工作将变分约束(Blei 等人,2017 年;Donsker & Varadhan,1983 年;Barber & Agakov,2003 年;Nguyen 等人,2010 年;Foster 等人,2018 年)与深度学习(Alemi 等人,2016 年;2017 年;van den Oord 等人,2018 年;Hjelm 等人,2018 年;Belghazi 等人,2018 年)相结合,实现了互信息的可微分和可操作性估计。这些论文引入了灵活的参数分布或由神经网络参数化的临界值,用于近似未知密度 或密度比
In spite of their effectiveness, the properties of existing variational estimators of MI are not well understood. In this paper, we introduce several results that begin to demystify these approaches and present novel bounds with improved properties (see Fig. 1 for a schematic):
尽管这些方法非常有效,但人们对现有 MI 变分估计器的特性还不甚了解。在本文中,我们介绍了几项成果,开始揭开这些方法的神秘面纱,并提出了具有改进特性的新边界(示意图见图 1):
  • We provide a review of existing estimators, discussing their relationships and tradeoffs, including the first proof that the noise contrastive loss in van den Oord et al. (2018) is a lower bound on MI, and that the heuristic "bias corrected gradients" in Belghazi et al. (2018) can be justified as unbiased estimates of the gradients of a different lower bound on MI.
    我们对现有的估计器进行了回顾,讨论了它们之间的关系和权衡,包括首次证明 van den Oord 等人(2018 年)的噪声对比损失是 MI 的下限,以及 Belghazi 等人(2018 年)的启发式 "偏差校正梯度 "可以作为 MI 的不同下限梯度的无偏估计。
  • We derive a new continuum of multi-sample lower bounds that can flexibly trade off bias and variance, generalizing the bounds of (Nguyen et al., 2010; van den Oord et al., 2018).
    我们推导出一种新的连续多样本下界,可以灵活地权衡偏差和方差,概括了(Nguyen 等人,2010 年;van den Oord 等人,2018 年)的下界。
  • We show how to leverage known conditional structure yielding simple lower and upper bounds that sandwich MI in the representation learning context when is tractable.
    我们展示了如何利用已知条件结构产生简单的下限和上限,在 可控的情况下,将 MI 夹在表征学习中。
  • We systematically evaluate the bias and variance of MI estimators and their gradients on controlled highdimensional problems.
    我们在受控高维问题上系统地评估了 MI 估计器及其梯度的偏差和方差。
  • We demonstrate the utility of our variational upper and lower bounds in the context of decoder-free disentangled representation learning on dSprites (Matthey et al., 2017).
    我们在 dSprites 上的无解码器分离表征学习中展示了我们的变分上界和下界的实用性(Matthey 等人,2017 年)。

2. Variational bounds of MI
2.MI 的变量边界

Here, we review existing variational bounds on MI in a unified framework, and present several new bounds that trade off bias and variance and naturally leverage known conditional densities when they are available. A schematic of the bounds we consider is presented in Fig. 1. We begin by reviewing the classic upper and lower bounds of Barber & Agakov (2003) and then show how to derive the lower bounds of Donsker & Varadhan (1983); Nguyen et al. (2010); Belghazi et al. (2018) from an unnormalized variational distribution. Generalizing the unnormalized bounds to the multi-sample setting yields the bound proposed in van den Oord et al. (2018), and provides the basis for our interpolated bound.
在此,我们在一个统一的框架内回顾了现有的 MI 变分边界,并提出了几个新的边界,这些边界在偏差和方差之间进行了权衡,并自然地利用了已知的条件密度。图 1 展示了我们考虑的边界示意图。我们首先回顾了 Barber & Agakov(2003)的经典上界和下界,然后展示了如何从非归一化变分分布推导出 Donsker & Varadhan(1983);Nguyen 等人(2010);Belghazi 等人(2018)的下界。将非正态化边界推广到多样本环境,可以得到 van den Oord 等人(2018)提出的边界,并为我们的内插边界提供了基础。

2.1. Normalized upper and lower bounds
2.1.归一化上限和下限

Upper bounding MI is challenging, but is possible when the conditional distribution is known (e.g. in deep representation learning where is the stochastic representation). We can build a tractable variational upper bound by introducing a variational approximation to the intractable marginal . By multiplying and dividing the integrand in MI by and dropping a negative KL term, we get a tractable variational upper bound (Barber & Agakov, 2003):
对 MI 进行上界计算具有挑战性,但在条件分布 已知的情况下(例如,在深度表征学习中, 是随机表征),上界计算是可能的。我们可以通过对难以处理的边际 引入变分近似值 来建立一个可处理的变分上界。通过将 MI 中的积分乘以 并除以 ,再去掉一个负 KL 项,我们就能得到一个可行的变分上界(Barber & Agakov,2003 年):
which is often referred to as the rate in generative models (Alemi et al., 2017). This bound is tight when , and requires that computing is tractable. This variational upper bound is often used as a regularizer to limit the capacity of a stochastic representation (e.g. Rezende et al., 2014; Kingma & Welling, 2013; Burgess et al., 2018). In Alemi et al. (2016), this upper bound is used to prevent the representation from carrying information about the input that is irrelevant for the downstream classification task.
通常被称为生成模型中的速率(Alemi 等人,2017)。当 时,这个约束很紧,并且要求 的计算是可行的。这个变分上界经常被用作限制随机表示容量的正则器(如 Rezende 等人,2014;Kingma & Welling,2013;Burgess 等人,2018)。在 Alemi 等人(2016 年)的研究中,这个上限被用来防止表征携带与下游分类任务无关的输入信息。
Unlike the upper bound, most variational lower bounds on mutual information do not require direct knowledge of any conditional densities. To establish an initial lower bound on mutual information, we factor MI the opposite direction as the upper bound, and replace the intractable conditional distribution with a tractable optimization problem over a variational distribution . As shown in Barber & Agakov (2003), this yields a lower bound on MI due to the non-negativity of the KL divergence:
与上界不同,大多数互信息的变分下界不需要直接知道任何条件密度。为了建立互信息的初始下界,我们将 MI 因子的方向设定为与上界相反的方向,并将难以处理的条件分布 替换为对变分分布 的可处理优化问题。正如 Barber & Agakov (2003) 所示,由于 KL 发散的非负性,这就产生了 MI 的下限:
where is the differential entropy of . The bound is tight when , in which case the first term equals the conditional entropy .
其中 的差分熵。当 时,边界是紧密的,在这种情况下,第一项等于条件熵
Unfortunately, evaluating this objective is generally intractable as the differential entropy of is often unknown. If is known, this provides a tractable estimate of a lower bound on MI. Otherwise, one can still compare the amount of information different variables (e.g., and ) carry about .
遗憾的是,由于 的差分熵通常是未知的,因此对这一目标的评估通常是难以实现的。如果 已知,这就为 MI 的下限提供了一个可行的估计值。否则,我们仍然可以比较不同变量(如 )携带的 信息量。
In the representation learning context where is data and is a learned stochastic representation, the first term of can be thought of as negative reconstruction error or distortion, and the gradient of with respect to the "encoder" and variational "decoder" is tractable. Thus we can use this objective to learn an encoder that maximizes as in Alemi et al. (2017). However, this approach to representation learning requires building a tractable decoder , which is challenging when is high-dimensional and is large, for example in video representation learning (van den Oord et al., 2016).
在表征学习中, 是数据, 是学习到的随机表征, 的第一项可视为负重构误差或失真,而 相对于 "编码器" 和变分 "解码器 "的梯度为 和变分 "解码器" 的梯度是可控的。因此,我们可以利用这一目标来学习编码器 ,使 最大化,如 Alemi 等人(2017)所做的那样。然而,这种表征学习方法需要建立一个可控的解码器 ,当 是高维且 较大时,这种方法就具有挑战性,例如在视频表征学习中(van den Oord 等人,2016 年)。

2.2. Unnormalized lower bounds
2.2.非规范化下限

To derive tractable lower bounds that do not require a tractable decoder, we turn to unnormalized distributions for the variational family of , and show how this recovers the estimators of Donsker & Varadhan (1983); Nguyen et al. (2010).
为了推导出不需要可控解码器的可控下限,我们转向 变分系列的非规范化分布,并展示了如何恢复 Donsker & Varadhan (1983); Nguyen 等人 (2010) 的估计值。
We choose an energy-based variational family that uses a critic and is scaled by the data density :
我们选择一个基于能量的变分系,它使用批判者 ,并按数据密度 缩放:
Substituting this distribution into (Eq. 2) gives a lower bound on MI which we refer to as for the Unnormalized version of the Barber and Agakov bound:
将此分布代入 (公式 2),可得到 MI 的下限,我们将其称为 ,即巴伯和阿加科夫约束的非规范化版本:
This bound is tight when , where is solely a function of (and not ). Note that by scaling by , the intractable differential entropy term in cancels, but we are still left with an intractable partition function, , that prevents evaluation or gradient computation. If we apply Jensen's inequality to , we can lower bound Eq. 4 to recover the bound of Donsker & Varadhan (1983):
,其中 仅仅是 (而不是 )的函数时,这个约束是紧密的。需要注意的是,通过 进行缩放, 中难以处理的微分熵项就会抵消,但我们仍然会剩下一个难以处理的 分区函数 ,它阻碍了求值或梯度计算。如果我们对 应用詹森不等式,就可以降低公式 4 的边界,从而恢复 Donsker & Varadhan (1983) 的边界:
However, this objective is still intractable. Applying Jensen's the other direction by replacing with results in a tractable objective, but produces an upper bound on Eq. 4 (which is itself a lower bound on mutual information). Thus evaluating using a Monte-Carlo approximation of the expectations as in MINE (Belghazi et al., 2018) produces estimates that are neither an upper or lower bound on MI. Recent work has studied the convergence and asymptotic consistency of such nested Monte-Carlo estimators, but does not address the problem of building bounds that hold with finite samples (Rainforth et al., 2018; Mathieu et al., 2018).
然而,这个目标仍然难以实现。将 替换为 ,从另一个方向应用 Jensen's 方法,会得到一个可行的目标,但会产生公式 4 的上界(这本身就是互信息的下界)。因此,像 MINE(Belghazi 等人,2018 年)那样使用蒙特卡洛期望近似法评估 ,得出的估计值既不是互信息的上界,也不是互信息的下界。最近的研究对这种嵌套蒙特卡洛估计器的收敛性和渐进一致性进行了研究,但并没有解决建立有限样本下成立的边界问题(Rainforth 等人,2018 年;Mathieu 等人,2018 年)。
To form a tractable bound, we can upper bound the log partition function using the inequality: for all . Applying this inequality to the second term of Eq. 4 gives: , which is tight when . This results in a Tractable Unnormalized version of the Barber and Agakov (TUBA) lower bound on MI that admits unbiased estimates and gradients:
为了得出一个简单明了的边界,我们可以利用不等式对对数分割函数进行上限计算: 适用于所有 。将此不等式应用于式 4 的第二项,可得到: ,当 时,这个值很紧。这样就得到了巴伯和阿加科夫(TUBA)的可实现非标准化 MI 下界,它允许无偏估计和梯度: