Representation Learning: A Review and New Perspectives
表示学习：综述与新的视角

Yoshua Bengio^†, Aaron Courville, and Pascal Vincent^†
杨立昆 ^† ，艾伦·库维尔，以及帕斯卡·文森特 ^†
Department of computer science and operations research, U. Montreal

\dagger

also, Canadian Institute for Advanced Research (CIFAR)
计算机科学与运筹学部，蒙特利尔大学

\dagger

此外，加拿大高级研究院（CIFAR）

Abstract 摘要

The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the quest for AI is motivating the design of more powerful representation-learning algorithms implementing such priors. This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, auto-encoders, manifold learning, and deep networks. This motivates longer-term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation and manifold learning.
机器学习算法的成功通常取决于数据表示，我们假设这是因为不同的表示可以不同程度地纠缠和隐藏数据背后的不同解释因素。尽管可以使用特定领域的知识来帮助设计表示，但使用通用先验进行学习也可以，而人工智能的追求正在激励设计实现此类先验的更强大的表示学习算法。本文回顾了无监督特征学习和深度学习领域最近的研究工作，涵盖了概率模型、自编码器、流形学习和深度网络方面的进展。这引发了关于学习良好表示的适当目标、计算表示（即推理）以及表示学习、密度估计和流形学习之间的几何联系等长期未解问题。

Index Terms:

Deep learning, representation learning, feature learning, unsupervised learning, Boltzmann Machine, autoencoder, neural nets
索引词：深度学习，表示学习，特征学习，无监督学习，玻尔兹曼机，自编码器，神经网络

1 Introduction 1 引言

The performance of machine learning methods is heavily dependent on the choice of data representation (or features) on which they are applied. For that reason, much of the actual effort in deploying machine learning algorithms goes into the design of preprocessing pipelines and data transformations that result in a representation of the data that can support effective machine learning. Such feature engineering is important but labor-intensive and highlights the weakness of current learning algorithms: their inability to extract and organize the discriminative information from the data. Feature engineering is a way to take advantage of human ingenuity and prior knowledge to compensate for that weakness. In order to expand the scope and ease of applicability of machine learning, it would be highly desirable to make learning algorithms less dependent on feature engineering, so that novel applications could be constructed faster, and more importantly, to make progress towards Artificial Intelligence (AI). An AI must fundamentally understand the world around us, and we argue that this can only be achieved if it can learn to identify and disentangle the underlying explanatory factors hidden in the observed milieu of low-level sensory data.
机器学习方法的性能高度依赖于它们所应用的数据表示（或特征）的选择。因此，在部署机器学习算法的实际工作中，大部分努力都投入到预处理管道和数据转换的设计中，以实现支持有效机器学习的数据表示。这种特征工程虽然重要但劳动密集，突显了当前学习算法的弱点：无法从数据中提取和组织区分信息。特征工程是一种利用人类创造力和先验知识来弥补这种弱点的途径。为了扩大机器学习的应用范围和便捷性，使学习算法减少对特征工程的依赖性是非常理想的，这样就可以更快地构建新的应用，更重要的是，朝着人工智能（AI）的方向迈进。人工智能必须从根本上理解我们周围的世界，我们认为只有当它能够学会识别和理清隐藏在观察到的低级感官数据环境中的潜在解释因素时，才能实现这一点。

This paper is about representation learning, i.e., learning representations of the data that make it easier to extract useful information when building classifiers or other predictors. In the case of probabilistic models, a good representation is often one that captures the posterior distribution of the underlying explanatory factors for the observed input. A good representation is also one that is useful as input to a supervised predictor. Among the various ways of learning representations, this paper focuses on deep learning methods: those that are formed by the composition of multiple non-linear transformations, with the goal of yielding more abstract – and ultimately more useful – representations. Here we survey this rapidly developing area with special emphasis on recent progress. We consider some of the fundamental questions that have been driving research in this area. Specifically, what makes one representation better than another? Given an example, how should we compute its representation, i.e. perform feature extraction? Also, what are appropriate objectives for learning good representations?
这篇论文关于表示学习，即学习数据的表示，使得在构建分类器或其他预测器时更容易提取有用信息。在概率模型的情况下，一个好的表示通常是指能够捕捉到观察到的输入的潜在解释因素的后验分布。一个好的表示也是指作为监督预测器的输入是有用的。在众多学习表示的方法中，本文重点关注深度学习方法：由多个非线性变换的组合形成，旨在产生更抽象——最终更有用——的表示。在这里，我们特别强调最近的研究进展，对这一快速发展的领域进行了综述。我们考虑了推动这一领域研究的一些基本问题。具体来说，什么使得一个表示比另一个更好？给定一个例子，我们应该如何计算其表示，即执行特征提取？此外，学习良好表示的适当目标是什么？

2 Why should we care about learning representations?
为什么我们应该关注学习表示？

Representation learning has become a field in itself in the machine learning community, with regular workshops at the leading conferences such as NIPS and ICML, and a new conference dedicated to it, ICLR¹¹1International Conference on Learning Representations, sometimes under the header of Deep Learning or Feature Learning. Although depth is an important part of the story, many other priors are interesting and can be conveniently captured when the problem is cast as one of learning a representation, as discussed in the next section. The rapid increase in scientific activity on representation learning has been accompanied and nourished by a remarkable string of empirical successes both in academia and in industry. Below, we briefly highlight some of these high points.
表征学习已成为机器学习社区的一个独立领域，在 NIPS 和 ICML 等顶级会议上有定期的研讨会，还专门为此设立了新的会议，即 ICLR ¹ ，有时在深度学习或特征学习的大标题下。尽管深度是故事的重要组成部分，但许多其他先验条件也很有趣，当问题被表述为学习表征时，这些先验条件可以方便地捕捉到，如下一节所述。表征学习在科学活动中的快速增长伴随着学术界和工业界的显著实证成功。以下，我们简要概述了一些这些亮点。

Speech Recognition and Signal Processing
语音识别与信号处理

Speech was one of the early applications of neural networks, in particular convolutional (or time-delay) neural networks ²²2See Bengio (1993) for a review of early work in this area.. The recent revival of interest in neural networks, deep learning, and representation learning has had a strong impact in the area of speech recognition, with breakthrough results (Dahl et al., 2010; Deng et al., 2010; Seide et al., 2011a; Mohamed et al., 2012; Dahl et al., 2012; Hinton et al., 2012) obtained by several academics as well as researchers at industrial labs bringing these algorithms to a larger scale and into products. For example, Microsoft has released in 2012 a new version of their MAVIS (Microsoft Audio Video Indexing Service) speech system based on deep learning (Seide et al., 2011a). These authors managed to reduce the word error rate on four major benchmarks by about 30% (e.g. from 27.4% to 18.5% on RT03S) compared to state-of-the-art models based on Gaussian mixtures for the acoustic modeling and trained on the same amount of data (309 hours of speech). The relative improvement in error rate obtained by Dahl et al. (2012) on a smaller large-vocabulary speech recognition benchmark (Bing mobile business search dataset, with 40 hours of speech) is between 16% and 23%.
语音是神经网络早期应用之一，特别是卷积（或时间延迟）神经网络。近年来，对神经网络、深度学习和表示学习的兴趣复苏，对语音识别领域产生了重大影响，多位学者以及工业实验室的研究人员通过将这些算法扩展到更大规模并应用于产品，取得了突破性成果（Dahl 等人，2010；Deng 等人，2010；Seide 等人，2011a；Mohamed 等人，2012；Dahl 等人，2012；Hinton 等人，2012）。例如，微软在 2012 年发布了基于深度学习的新版 MAVIS（微软音频视频索引服务）语音系统（Seide 等人，2011a）。这些作者设法将四个主要基准测试中的词错误率降低了约 30%（例如，在 RT03S 上从 27.4%降至 18.5%），与基于高斯混合声学建模并在相同数据量（309 小时语音）上训练的最先进模型相比。Dahl 等人获得的错误率相对改进。 (2012) 在一个较小的大型词汇语音识别基准（Bing 移动商业搜索数据集，包含 40 小时语音）上的结果在 16%到 23%之间。

Representation-learning algorithms have also been applied to music, substantially beating the state-of-the-art in polyphonic transcription (Boulanger-Lewandowski et al., 2012), with relative error improvement between 5% and 30% on a standard benchmark of 4 datasets. Deep learning also helped to win MIREX (Music Information Retrieval) competitions, e.g. in 2011 on audio tagging (Hamel et al., 2011).
表示学习算法也被应用于音乐领域，在多声部转录方面显著超越了现有技术（Boulanger-Lewandowski 等，2012），在 4 个数据集的标准基准测试中，相对误差改进了 5%至 30%。深度学习还帮助赢得了 MIREX（音乐信息检索）比赛，例如 2011 年的音频标签比赛（Hamel 等，2011）。

Object Recognition 目标识别

The beginnings of deep learning in 2006 have focused on the MNIST digit image classification problem (Hinton et al., 2006; Bengio et al., 2007), breaking the supremacy of SVMs (1.4% error) on this dataset³³3for the knowledge-free version of the task, where no image-specific prior is used, such as image deformations or convolutions. The latest records are still held by deep networks: Ciresan et al. (2012) currently claims the title of state-of-the-art for the unconstrained version of the task (e.g., using a convolutional architecture), with 0.27% error, and Rifai et al. (2011c) is state-of-the-art for the knowledge-free version of MNIST, with 0.81% error.
2006 年深度学习的开端聚焦于 MNIST 数字图像分类问题（Hinton 等人，2006；Bengio 等人，2007），打破了在此数据集上 SVMs（1.4%错误率）的统治地位。最新的记录仍然由深度网络保持：Ciresan 等人（2012）目前声称在任务的未约束版本（例如，使用卷积架构）中达到最先进水平，错误率为 0.27%，而 Rifai 等人（2011c）在 MNIST 的无知识版本中达到最先进水平，错误率为 0.81%。

In the last few years, deep learning has moved from digits to object recognition in natural images, and the latest breakthrough has been achieved on the ImageNet dataset⁴⁴4The 1000-class ImageNet benchmark, whose results are detailed here:
http://www.image-net.org/challenges/LSVRC/2012/results.html bringing down the state-of-the-art error rate from 26.1% to 15.3% (Krizhevsky et al., 2012).
在过去的几年里，深度学习从数字识别发展到自然图像中的物体识别，最新的突破是在 ImageNet 数据集上实现的，将最先进的错误率从 26.1%降低到 15.3%（Krizhevsky 等人，2012 年）。

Natural Language Processing
自然语言处理

Besides speech recognition, there are many other Natural Language Processing (NLP) applications of representation learning. Distributed representations for symbolic data were introduced by Hinton (1986), and first developed in the context of statistical language modeling by Bengio et al. (2003) in so-called neural net language models (Bengio, 2008). They are all based on learning a distributed representation for each word, called a word embedding. Adding a convolutional architecture, Collobert et al. (2011) developed the SENNA system⁵⁵5downloadable from http://ml.nec-labs.com/senna/ that shares representations across the tasks of language modeling, part-of-speech tagging, chunking, named entity recognition, semantic role labeling and syntactic parsing. SENNA approaches or surpasses the state-of-the-art on these tasks but is simpler and much faster than traditional predictors. Learning word embeddings can be combined with learning image representations in a way that allow to associate text and images. This approach has been used successfully to build Google’s image search, exploiting huge quantities of data to map images and queries in the same space (Weston et al., 2010) and it has recently been extended to deeper multi-modal representations (Srivastava and Salakhutdinov, 2012).
除了语音识别，还有许多其他自然语言处理（NLP）应用中的表示学习。Hinton（1986）引入了符号数据的分布式表示，Bengio 等人（2003）在所谓的神经网络语言模型（Bengio，2008）的背景下首先将其发展起来。它们都是基于学习每个单词的分布式表示，称为词嵌入。Collobert 等人（2011）添加了卷积架构，开发了 SENNA 系统 ⁵ ，该系统能够在语言模型、词性标注、分块、命名实体识别、语义角色标注和句法分析等任务之间共享表示。SENNA 在这些任务上的表现或超过了最先进水平，但比传统预测器更简单、更快。学习词嵌入可以与学习图像表示相结合，从而允许将文本和图像关联起来。这种方法已被成功用于构建谷歌的图像搜索，利用大量数据在同一空间中映射图像和查询（Weston 等人）。，2010) 最近已扩展到更深的多模态表示（Srivastava 和 Salakhutdinov，2012）。

The neural net language model was also improved by adding recurrence to the hidden layers (Mikolov et al., 2011), allowing it to beat the state-of-the-art (smoothed n-gram models) not only in terms of perplexity (exponential of the average negative log-likelihood of predicting the right next word, going down from 140 to 102) but also in terms of word error rate in speech recognition (since the language model is an important component of a speech recognition system), decreasing it from 17.2% (KN5 baseline) or 16.9% (discriminative language model) to 14.4% on the Wall Street Journal benchmark task. Similar models have been applied in statistical machine translation (Schwenk et al., 2012; Le et al., 2013), improving perplexity and BLEU scores. Recursive auto-encoders (which generalize recurrent networks) have also been used to beat the state-of-the-art in full sentence paraphrase detection (Socher et al., 2011a) almost doubling the F1 score for paraphrase detection. Representation learning can also be used to perform word sense disambiguation (Bordes et al., 2012), bringing up the accuracy from 67.8% to 70.2% on the subset of Senseval-3 where the system could be applied (with subject-verb-object sentences). Finally, it has also been successfully used to surpass the state-of-the-art in sentiment analysis (Glorot et al., 2011b; Socher et al., 2011b).
神经网络语言模型通过在隐藏层添加递归（Mikolov 等人，2011 年）也得到了改进，使其不仅在困惑度（预测下一个正确单词的平均负对数似然率的指数，从 140 降至 102）方面超越了最先进的技术（平滑 n-gram 模型），而且在语音识别中的单词错误率（因为语言模型是语音识别系统的一个重要组成部分）方面也有所提高，将其从 17.2%（KN5 基线）或 16.9%（判别性语言模型）降低到华尔街日报基准任务中的 14.4%。类似的模型已在统计机器翻译（Schwenk 等人，2012 年；Le 等人，2013 年）中得到应用，提高了困惑度和 BLEU 分数。递归自动编码器（它泛化了循环网络）也被用来在全文释义检测中超越最先进的技术（Socher 等人，2011a），将释义检测的 F1 分数几乎翻倍。表示学习还可以用于执行词义消歧（Bordes 等人，2012 年），将准确率从 67.8%提高到 70%。2% 在 Senseval-3 的子集上（系统可应用的部分，即主谓宾句子）。最后，它还成功用于超越情感分析领域的最新水平（Glorot 等，2011b；Socher 等，2011b）。

Multi-Task and Transfer Learning, Domain Adaptation
多任务学习和迁移学习，领域自适应

Transfer learning is the ability of a learning algorithm to exploit commonalities between different learning tasks in order to share statistical strength, and transfer knowledge across tasks. As discussed below, we hypothesize that representation learning algorithms have an advantage for such tasks because they learn representations that capture underlying factors, a subset of which may be relevant for each particular task, as illustrated in Figure 1. This hypothesis seems confirmed by a number of empirical results showing the strengths of representation learning algorithms in transfer learning scenarios.
迁移学习是学习算法利用不同学习任务之间的共性以共享统计强度和跨任务传递知识的能力。如下文所述，我们假设表示学习算法在此类任务中具有优势，因为它们学习到的表示能够捕捉到潜在因素，其中一部分可能与每个特定任务相关，如图 1 所示。这一假设似乎得到了许多实证结果的证实，这些结果展示了表示学习算法在迁移学习场景中的优势。

Refer to caption — Figure 1: Illustration of representation-learning discovering explanatory factors (middle hidden layer, in red), some explaining the input (semi-supervised setting), and some explaining target for each task. Because these subsets overlap, sharing of statistical strength helps generalization.
图 1：表示学习发现解释因素的示意图（中间隐藏层，红色），其中一些解释输入（半监督设置），一些解释每个任务的输出。因为这些子集有重叠，共享统计强度有助于泛化。

Most impressive are the two transfer learning challenges held in 2011 and won by representation learning algorithms. First, the Transfer Learning Challenge, presented at an ICML 2011 workshop of the same name, was won using unsupervised layer-wise pre-training (Bengio, 2011; Mesnil et al., 2011). A second Transfer Learning Challenge was held the same year and won by Goodfellow et al. (2011). Results were presented at NIPS 2011’s Challenges in Learning Hierarchical Models Workshop. In the related domain adaptation setup, the target remains the same but the input distribution changes (Glorot et al., 2011b; Chen et al., 2012). In the multi-task learning setup, representation learning has also been found advantageous Krizhevsky et al. (2012); Collobert et al. (2011), because of shared factors across tasks.
最令人印象深刻的是 2011 年举办的两次迁移学习挑战赛，均由表示学习算法获胜。首先，在同年同名的 ICML 2011 研讨会上举办的迁移学习挑战赛，是通过无监督层叠预训练（Bengio，2011；Mesnil 等，2011）获胜的。同年还举办了一次第二次迁移学习挑战赛，由 Goodfellow 等人（2011）获胜。结果在 NIPS 2011 的层次模型学习挑战研讨会中展出。在相关的领域自适应设置中，目标保持不变，但输入分布发生变化（Glorot 等，2011b；Chen 等，2012）。在多任务学习设置中，表示学习也被发现具有优势，因为任务之间存在共享因素（Krizhevsky 等，2012；Collobert 等，2011）。

3 What makes a representation good?
3 什么是好的表示？

3.1 Priors for Representation Learning in AI
3.1 人工智能中的表示学习先验

In Bengio and LeCun (2007), one of us introduced the notion of AI-tasks, which are challenging for current machine learning algorithms, and involve complex but highly structured dependencies. One reason why explicitly dealing with representations is interesting is because they can be convenient to express many general priors about the world around us, i.e., priors that are not task-specific but would be likely to be useful for a learning machine to solve AI-tasks. Examples of such general-purpose priors are the following:
在 Bengio 和 LeCun（2007）的研究中，我们其中一人引入了 AI 任务的观念，这些任务对当前机器学习算法来说具有挑战性，并涉及复杂但高度结构化的依赖关系。处理表示形式的一个有趣之处在于，它们可以方便地表达关于我们周围世界的许多一般先验，即不是特定于任务的先验，但对于学习机器解决 AI 任务可能是有用的。这类通用先验的例子包括以下内容：
$\bullet$ Smoothness: assumes the function to be learned $f$ is s.t. $x\approx y$ generally implies $f(x)\approx f(y)$ . This most basic prior is present in most machine learning, but is insufficient to get around the curse of dimensionality, see Section 3.2.
平滑性：假设要学习函数 $f$ 满足 $x\approx y$ ，这通常意味着 $f(x)\approx f(y)$ 。这种最基本的先验在大多数机器学习中都存在，但不足以克服维度灾难，参见第 3.2 节。
$\bullet$ Multiple explanatory factors: the data generating distribution is generated by different underlying factors, and for the most part what one learns about one factor generalizes in many configurations of the other factors. The objective to recover or at least disentangle these underlying factors of variation is discussed in Section 3.5. This assumption is behind the idea of distributed representations, discussed in Section 3.3 below.
多个解释因素：数据生成分布由不同的潜在因素产生，并且对于大多数情况，关于一个因素的学习可以推广到许多其他因素的配置中。在 3.5 节中讨论了恢复或至少解耦这些潜在变异因素的目标。这一假设是下面 3.3 节中讨论的分布式表示理念背后的。
$\bullet$ A hierarchical organization of explanatory factors: the concepts that are useful for describing the world around us can be defined in terms of other concepts, in a hierarchy, with more abstract concepts higher in the hierarchy, defined in terms of less abstract ones. This assumption is exploited with deep representations, elaborated in Section 3.4 below.
$\bullet$ 解释因素的层次结构：描述我们周围世界的有用概念可以用其他概念来定义，形成一个层次结构，其中更抽象的概念位于层次结构的高层，用较不抽象的概念来定义。这一假设在下面的第 3.4 节中通过深度表示得到利用。
$\bullet$ Semi-supervised learning: with inputs $X$ and target $Y$ to predict, a subset of the factors explaining $X$ ’s distribution explain much of $Y$ , given $X$ . Hence representations that are useful for $P(X)$ tend to be useful when learning $P(Y|X)$ , allowing sharing of statistical strength between the unsupervised and supervised learning tasks, see Section 4.
半监督学习：给定输入 $X$ 和目标 $Y$ 进行预测，解释 $X$ 分布的因子子集在很大程度上解释了 $Y$ ，在 $X$ 的条件下。因此，对 $P(X)$ 有用的表示在学习 $P(Y|X)$ 时也往往有用，允许无监督学习和监督学习任务之间共享统计强度，参见第 4 节。
$\bullet$ Shared factors across tasks: with many $Y$ ’s of interest or many learning tasks in general, tasks (e.g., the corresponding $P(Y|X,{\rm task})$ ) are explained by factors that are shared with other tasks, allowing sharing of statistical strengths across tasks, as discussed in the previous section (Multi-Task and Transfer Learning, Domain Adaptation).
任务间的共享因素：在许多感兴趣的任务或一般的学习任务中，任务（例如，相应的任务）由与其他任务共享的因素所解释，允许在任务间共享统计优势，如前节（多任务学习和迁移学习，领域自适应）所述。
$\bullet$ Manifolds: probability mass concentrates near regions that have a much smaller dimensionality than the original space where the data lives. This is explicitly exploited in some of the auto-encoder algorithms and other manifold-inspired algorithms described respectively in Sections 7.2 and 8.
流形：概率质量集中在比数据所在原始空间维度小得多的区域附近。这一点在 7.2 节和 8 节中分别描述的一些自动编码器算法和其他流形启发算法中被明确利用。
$\bullet$ Natural clustering: different values of categorical variables such as object classes are associated with separate manifolds. More precisely, the local variations on the manifold tend to preserve the value of a category, and a linear interpolation between examples of different classes in general involves going through a low density region, i.e., $P(X|Y=i)$ for different $i$ tend to be well separated and not overlap much. For example, this is exploited in the Manifold Tangent Classifier discussed in Section 8.3. This hypothesis is consistent with the idea that humans have named categories and classes because of such statistical structure (discovered by their brain and propagated by their culture), and machine learning tasks often involves predicting such categorical variables.
自然聚类：不同类别的分类变量，如对象类别，与不同的流形相关联。更准确地说，流形上的局部变化倾向于保持类别值，并且在不同类别示例之间进行线性插值通常涉及穿过低密度区域，即对于不同的类别，它们往往很好地分离且重叠不多。例如，这在第 8.3 节讨论的流形切线分类器中得到了利用。这个假设与人类因为这样的统计结构（由他们的大脑发现并由他们的文化传播）而对类别和进行命名和分类的想法一致，机器学习任务通常涉及预测这样的分类变量。
$\bullet$ Temporal and spatial coherence: consecutive (from a sequence) or spatially nearby observations tend to be associated with the same value of relevant categorical concepts, or result in a small move on the surface of the high-density manifold. More generally, different factors change at different temporal and spatial scales, and many categorical concepts of interest change slowly. When attempting to capture such categorical variables, this prior can be enforced by making the associated representations slowly changing, i.e., penalizing changes in values over time or space. This prior was introduced in Becker and Hinton (1992) and is discussed in Section 11.3.
时间与空间一致性：连续（来自序列）或空间邻近的观测值往往与相关分类概念的同值相关联，或者在高度密集流形表面产生小的移动。更普遍地，不同因素在不同时间和空间尺度上发生变化，许多感兴趣的分类概念变化缓慢。在尝试捕捉此类分类变量时，可以通过使相关表示缓慢变化来强制执行此先验，即惩罚时间和空间上值的改变。此先验在 Becker 和 Hinton（1992）中提出，并在第 11.3 节中讨论。
$\bullet$ Sparsity: for any given observation $x$ , only a small fraction of the possible factors are relevant. In terms of representation, this could be represented by features that are often zero (as initially proposed by Olshausen and Field (1996)), or by the fact that most of the extracted features are insensitive to small variations of $x$ . This can be achieved with certain forms of priors on latent variables (peaked at 0), or by using a non-linearity whose value is often flat at 0 (i.e., 0 and with a 0 derivative), or simply by penalizing the magnitude of the Jacobian matrix (of derivatives) of the function mapping input to representation. This is discussed in Sections 6.1.1 and 7.2.
稀疏性：对于任何给定的观察值 $x$ ，只有一小部分可能的因素是相关的。在表示方面，这可以通过通常为零的特征（如 Olshausen 和 Field（1996）最初提出的）来表示，或者通过大多数提取的特征对 $x$ 的小变化不敏感的事实来表示。这可以通过对潜在变量（峰值在 0 处）的某些形式先验，或者使用值通常在 0 处平坦的非线性（即 0 及其导数为 0），或者简单地通过惩罚将输入映射到表示的函数的雅可比矩阵（导数的矩阵）的幅度来实现。这将在第 6.1.1 节和第 7.2 节中讨论。
$\bullet$ Simplicity of Factor Dependencies: in good high-level representations, the factors are related to each other through simple, typically linear dependencies. This can be seen in many laws of physics, and is assumed when plugging a linear predictor on top of a learned representation.
因子依赖的简洁性：在良好的高级表示中，因子之间通过简单、通常是线性的依赖关系相互关联。这在许多物理定律中都可以看到，并且在将线性预测器叠加在学到的表示之上时被假设。

We can view many of the above priors as ways to help the learner discover and disentangle some of the underlying (and a priori unknown) factors of variation that the data may reveal. This idea is pursued further in Sections 3.5 and 11.4.
我们可以将上述许多先验视为帮助学习者发现和区分数据可能揭示的一些潜在（且先验未知）的变异因素的方法。这一想法在 3.5 节和 11.4 节中得到进一步探讨。

3.2 Smoothness and the Curse of Dimensionality
3.2 平滑性与维度诅咒

For AI-tasks, such as vision and NLP, it seems hopeless to rely only on simple parametric models (such as linear models) because they cannot capture enough of the complexity of interest unless provided with the appropriate feature space. Conversely, machine learning researchers have sought flexibility in local⁶⁶6local in the sense that the value of the learned function at $x$ depends mostly on training examples $x^{(t)}$ ’s close to $x$ non-parametric learners such as kernel machines with a fixed generic local-response kernel (such as the Gaussian kernel). Unfortunately, as argued at length by Bengio and Monperrus (2005); Bengio et al. (2006a); Bengio and LeCun (2007); Bengio (2009); Bengio et al. (2010), most of these algorithms only exploit the principle of local generalization, i.e., the assumption that the target function (to be learned) is smooth enough, so they rely on examples to explicitly map out the wrinkles of the target function. Generalization is mostly achieved by a form of local interpolation between neighboring training examples. Although smoothness can be a useful assumption, it is insufficient to deal with the curse of dimensionality, because the number of such wrinkles (ups and downs of the target function) may grow exponentially with the number of relevant interacting factors, when the data are represented in raw input space. We advocate learning algorithms that are flexible and non-parametric⁷⁷7We understand non-parametric as including all learning algorithms whose capacity can be increased appropriately as the amount of data and its complexity demands it, e.g. including mixture models and neural networks where the number of parameters is a data-selected hyper-parameter. but do not rely exclusively on the smoothness assumption. Instead, we propose to incorporate generic priors such as those enumerated above into representation-learning algorithms. Smoothness-based learners (such as kernel machines) and linear models can still be useful on top of such learned representations. In fact, the combination of learning a representation and kernel machine is equivalent to learning the kernel, i.e., the feature space. Kernel machines are useful, but they depend on a prior definition of a suitable similarity metric, or a feature space in which naive similarity metrics suffice. We would like to use the data, along with very generic priors, to discover those features, or equivalently, a similarity function.
对于 AI 任务，如视觉和 NLP，仅依赖简单的参数模型（如线性模型）似乎毫无希望，因为除非提供适当的特征空间，否则它们无法捕捉到足够的感兴趣复杂度。相反，机器学习研究人员寻求在具有固定通用局部响应核（如高斯核）的非参数学习器（如核机）中的灵活性。不幸的是，正如 Bengio 和 Monperrus（2005）；Bengio 等人（2006a）；Bengio 和 LeCun（2007）；Bengio（2009）；Bengio 等人（2010）详细论证的那样，这些算法中的大多数仅利用局部泛化原理，即目标函数（要学习的）足够平滑的假设，因此它们依赖于示例来明确映射出目标函数的皱纹。泛化主要通过相邻训练示例之间的局部插值形式来实现。尽管平滑性可以是一个有用的假设，但它不足以处理维度灾难，因为当数据在原始输入空间中表示时，这种皱纹（目标函数的起伏）的数量可能会随着相关相互作用因素数量的指数增长。我们提倡学习灵活且非参数化的算法，但不应仅依赖于平滑性假设。相反，我们建议将上述列举的通用先验纳入表示学习算法中。基于平滑性的学习器（如核机）和线性模型在上述学习表示之上仍然有用。事实上，学习表示和核机的组合等同于学习核，即特征空间。核机是有用的，但它们依赖于一个合适的相似性度量或特征空间的先验定义，或者在这个空间中，原始相似性度量就足够了。我们希望使用数据，以及非常通用的先验，来发现这些特征，或者说，一个相似性函数。

3.3 Distributed representations
3.3 分布式表示

Good representations are expressive, meaning that a reasonably-sized learned representation can capture a huge number of possible input configurations. A simple counting argument helps us to assess the expressiveness of a model producing a representation: how many parameters does it require compared to the number of input regions (or configurations) it can distinguish? Learners of one-hot representations, such as traditional clustering algorithms, Gaussian mixtures, nearest-neighbor algorithms, decision trees, or Gaussian SVMs all require $O(N)$ parameters (and/or $O(N)$ examples) to distinguish $O(N)$ input regions. One could naively believe that one cannot do better. However, RBMs, sparse coding, auto-encoders or multi-layer neural networks can all represent up to $O(2^{k})$ input regions using only $O(N)$ parameters (with $k$ the number of non-zero elements in a sparse representation, and $k=N$ in non-sparse RBMs and other dense representations). These are all distributed ⁸⁸8 Distributed representations: where $k$ out of $N$ representation elements or feature values can be independently varied, e.g., they are not mutually exclusive. Each concept is represented by having $k$ features being turned on or active, while each feature is involved in representing many concepts. or sparse⁹⁹9Sparse representations: distributed representations where only a few of the elements can be varied at a time, i.e., $k<N$ . representations. The generalization of clustering to distributed representations is multi-clustering, where either several clusterings take place in parallel or the same clustering is applied on different parts of the input, such as in the very popular hierarchical feature extraction for object recognition based on a histogram of cluster categories detected in different patches of an image (Lazebnik et al., 2006; Coates and Ng, 2011a). The exponential gain from distributed or sparse representations is discussed further in section 3.2 (and Figure 3.2) of Bengio (2009). It comes about because each parameter (e.g. the parameters of one of the units in a sparse code, or one of the units in a Restricted Boltzmann Machine) can be re-used in many examples that are not simply near neighbors of each other, whereas with local generalization, different regions in input space are basically associated with their own private set of parameters, e.g., as in decision trees, nearest-neighbors, Gaussian SVMs, etc. In a distributed representation, an exponentially large number of possible subsets of features or hidden units can be activated in response to a given input. In a single-layer model, each feature is typically associated with a preferred input direction, corresponding to a hyperplane in input space, and the code or representation associated with that input is precisely the pattern of activation (which features respond to the input, and how much). This is in contrast with a non-distributed representation such as the one learned by most clustering algorithms, e.g., k-means, in which the representation of a given input vector is a one-hot code identifying which one of a small number of cluster centroids best represents the input ¹⁰¹⁰10As discussed in (Bengio, 2009), things are only slightly better when allowing continuous-valued membership values, e.g., in ordinary mixture models (with separate parameters for each mixture component), but the difference in representational power is still exponential (Montufar and Morton, 2012). The situation may also seem better with a decision tree, where each given input is associated with a one-hot code over the tree leaves, which deterministically selects associated ancestors (the path from root to node). Unfortunately, the number of different regions represented (equal to the number of leaves of the tree) still only grows linearly with the number of parameters used to specify it (Bengio and Delalleau, 2011). .
优秀的表示方法是富有表现力的，这意味着一个合理大小的学习表示可以捕捉大量的可能输入配置。一个简单的计数论据帮助我们评估产生表示的模型的表现力：与它能区分的输入区域（或配置）数量相比，它需要多少参数？例如，学习独热表示的学习者，如传统的聚类算法、高斯混合模型、最近邻算法、决策树或高斯 SVM，都需要 $O(N)$ 个参数（和/或 $O(N)$ 个示例）来区分 $O(N)$ 个输入区域。人们可能会天真地认为无法做得更好。然而，RBM、稀疏编码、自编码器或多层神经网络都可以仅使用 $O(N)$ 个参数（其中 $k$ 是稀疏表示中非零元素的数量， $k=N$ 是非稀疏 RBM 和其他密集表示中的数量）来表示多达 $O(2^{k})$ 个输入区域。这些都是分布式 ⁸ 或稀疏 ⁹ 表示。聚类推广到分布式表示是多聚类，要么并行进行多个聚类，要么对输入的不同部分应用相同的聚类，例如在基于图像不同区域检测到的聚类类别直方图的非常流行的层次特征提取中进行对象识别（Lazebnik 等，2006；Coates 和 Ng，2011a）。关于分布式或稀疏表示的指数级增益在 Bengio（2009）的第 3.2 节（以及图 3.2）中进一步讨论。这是因为每个参数（例如稀疏码中一个单元的参数，或者受限玻尔兹曼机中的一个单元）可以在许多非简单邻近的例子中重复使用，而局部泛化中，输入空间的不同区域基本上与它们自己的私有参数集相关联，例如在决策树、最近邻、高斯 SVM 等中。在分布式表示中，针对给定的输入可以激活指数级大量的特征或隐藏单元的可能子集。在一个单层模型中，每个特征通常与一个首选输入方向相关联，对应于输入空间中的一个超平面，与该输入相关的代码或表示正是激活模式（哪些特征对输入有响应，以及响应程度）。这与大多数聚类算法（如 k-means）学习到的非分布式表示形成对比，其中给定输入向量的表示是一个一热代码，用于标识少量聚类中心中哪一个最能代表输入。

3.4 Depth and abstraction
3.4 深度与抽象

Depth is a key aspect to representation learning strategies we consider in this paper. As we will discuss, deep architectures are often challenging to train effectively and this has been the subject of much recent research and progress. However, despite these challenges, they carry two significant advantages that motivate our long-term interest in discovering successful training strategies for deep architectures. These advantages are: (1) deep architectures promote the re-use of features, and (2) deep architectures can potentially lead to progressively more abstract features at higher layers of representations (more removed from the data).
深度是我们在这篇论文中考虑的表示学习策略的关键方面。正如我们将讨论的，深度架构通常难以有效训练，这已成为近期研究和进展的主题。然而，尽管存在这些挑战，它们具有两个显著的优势，这促使我们长期致力于发现深度架构的成功训练策略。这些优势是：（1）深度架构促进了特征的复用；（2）深度架构有可能在表示的更高层（更远离数据）产生越来越抽象的特征。

Feature re-use. The notion of re-use, which explains the power of distributed representations, is also at the heart of the theoretical advantages behind deep learning, i.e., constructing multiple levels of representation or learning a hierarchy of features. The depth of a circuit is the length of the longest path from an input node of the circuit to an output node of the circuit. The crucial property of a deep circuit is that its number of paths, i.e., ways to re-use different parts, can grow exponentially with its depth. Formally, one can change the depth of a given circuit by changing the definition of what each node can compute, but only by a constant factor. The typical computations we allow in each node include: weighted sum, product, artificial neuron model (such as a monotone non-linearity on top of an affine transformation), computation of a kernel, or logic gates. Theoretical results clearly show families of functions where a deep representation can be exponentially more efficient than one that is insufficiently deep (Håstad, 1986; Håstad and Goldmann, 1991; Bengio et al., 2006a; Bengio and LeCun, 2007; Bengio and Delalleau, 2011). If the same family of functions can be represented with fewer parameters (or more precisely with a smaller VC-dimension), learning theory would suggest that it can be learned with fewer examples, yielding improvements in both computational efficiency (less nodes to visit) and statistical efficiency (less parameters to learn, and re-use of these parameters over many different kinds of inputs).
特征重用。重用概念解释了分布式表示的强大之处，也是深度学习背后理论优势的核心，即构建多级表示或学习特征层次。电路的深度是从电路输入节点到输出节点的最长路径长度。深度电路的关键特性是其路径数，即重用不同部分的方式，可以随着深度的增加而指数增长。形式上，可以通过改变每个节点可以计算的定义来改变给定电路的深度，但只能通过一个常数因子。我们允许在每个节点中进行的典型计算包括：加权求和、乘积、人工神经元模型（如叠加在仿射变换上的单调非线性），核计算或逻辑门。理论结果清楚地表明，在函数族中，深度表示可以比深度不足的表示指数级更有效（Håstad，1986；Håstad 和 Goldmann，1991；Bengio 等，2006a；Bengio 和 LeCun，2007；Bengio 和 Delalleau，2011）。如果同一族函数可以用更少的参数（或者更精确地说，更小的 VC 维数）来表示，学习理论将表明它可以用更少的示例来学习，从而在计算效率（需要访问的节点更少）和统计效率（需要学习的参数更少，以及这些参数在许多不同类型的输入中重复使用）方面带来改进。

Abstraction and invariance. Deep architectures can lead to abstract representations because more abstract concepts can often be constructed in terms of less abstract ones. In some cases, such as in the convolutional neural network (LeCun et al., 1998b), we build this abstraction in explicitly via a pooling mechanism (see section 11.2). More abstract concepts are generally invariant to most local changes of the input. That makes the representations that capture these concepts generally highly non-linear functions of the raw input. This is obviously true of categorical concepts, where more abstract representations detect categories that cover more varied phenomena (e.g. larger manifolds with more wrinkles) and thus they potentially have greater predictive power. Abstraction can also appear in high-level continuous-valued attributes that are only sensitive to some very specific types of changes in the input. Learning these sorts of invariant features has been a long-standing goal in pattern recognition.
抽象和不变性。深度架构可以导致抽象表示，因为更抽象的概念通常可以用较不抽象的概念来构建。在某些情况下，例如在卷积神经网络（LeCun 等，1998b）中，我们通过池化机制（见第 11.2 节）明确地构建这种抽象。更抽象的概念通常对输入的大部分局部变化是不变的。这使得捕捉这些概念的表现形式通常是原始输入的高度非线性函数。这在分类概念中显然是正确的，其中更抽象的表示检测到覆盖更多现象的类别（例如，具有更多褶皱的更大流形），因此它们可能具有更大的预测能力。抽象还可以出现在高级连续值属性中，这些属性只对输入中某些非常特定的变化敏感。在模式识别中学习这类不变特征一直是长期目标。

3.5 Disentangling Factors of Variation
3.5 解构变异因素

Beyond being distributed and invariant, we would like our representations to disentangle the factors of variation. Different explanatory factors of the data tend to change independently of each other in the input distribution, and only a few at a time tend to change when one considers a sequence of consecutive real-world inputs.
超越分布和不变性，我们希望我们的表示能够解耦变化的因素。数据的不同解释因素往往在输入分布中相互独立地变化，而当考虑一系列连续的现实世界输入时，只有少数因素会同时变化。

Complex data arise from the rich interaction of many sources. These factors interact in a complex web that can complicate AI-related tasks such as object classification. For example, an image is composed of the interaction between one or more light sources, the object shapes and the material properties of the various surfaces present in the image. Shadows from objects in the scene can fall on each other in complex patterns, creating the illusion of object boundaries where there are none and dramatically effect the perceived object shape. How can we cope with these complex interactions? How can we disentangle the objects and their shadows? Ultimately, we believe the approach we adopt for overcoming these challenges must leverage the data itself, using vast quantities of unlabeled examples, to learn representations that separate the various explanatory sources. Doing so should give rise to a representation significantly more robust to the complex and richly structured variations extant in natural data sources for AI-related tasks.
复杂数据源于众多来源的丰富交互。这些因素在一个复杂的网络中相互作用，可能会使人工智能相关的任务，如物体分类变得复杂。例如，一张图像由一个或多个光源、物体形状以及图像中各种表面的材料特性之间的交互构成。场景中物体的阴影可以以复杂的模式相互重叠，创造出实际上并不存在的物体边界，并极大地影响感知到的物体形状。我们如何应对这些复杂交互？如何解开物体及其阴影？最终，我们相信我们采用的克服这些挑战的方法必须利用数据本身，使用大量未标记的示例，来学习将各种解释来源分离的表示。这样做应该会产生一个对自然数据源中存在的复杂和丰富结构的变异具有显著鲁棒性的表示。

It is important to distinguish between the related but distinct goals of learning invariant features and learning to disentangle explanatory factors. The central difference is the preservation of information. Invariant features, by definition, have reduced sensitivity in the direction of invariance. This is the goal of building features that are insensitive to variation in the data that are uninformative to the task at hand. Unfortunately, it is often difficult to determine a priori which set of features and variations will ultimately be relevant to the task at hand. Further, as is often the case in the context of deep learning methods, the feature set being trained may be destined to be used in multiple tasks that may have distinct subsets of relevant features. Considerations such as these lead us to the conclusion that the most robust approach to feature learning is to disentangle as many factors as possible, discarding as little information about the data as is practical. If some form of dimensionality reduction is desirable, then we hypothesize that the local directions of variation least represented in the training data should be first to be pruned out (as in PCA, for example, which does it globally instead of around each example).
区分学习不变特征和学习解耦解释因素的相关但不同的目标非常重要。核心区别在于信息的保留。根据定义，不变特征在不变性的方向上具有降低的敏感性。这是构建对数据中与任务无关的变异不敏感的特征的目标。不幸的是，通常很难事先确定哪些特征集和变异最终与当前任务相关。此外，正如在深度学习方法背景下经常发生的那样，正在训练的特征集可能注定要用于多个可能具有不同相关特征子集的任务。考虑到这些因素，我们得出结论，特征学习的最稳健方法是尽可能解耦尽可能多的因素，同时尽可能少地丢弃关于数据的任何信息。如果希望进行某种形式的降维，那么我们假设在训练数据中代表性最少的局部变化方向应该首先被剪枝（例如，PCA 就是这样做的，它是全局而不是围绕每个示例进行）。

3.6 Good criteria for learning representations?
3.6 好的学习表示的准则是什么？

One of the challenges of representation learning that distinguishes it from other machine learning tasks such as classification is the difficulty in establishing a clear objective, or target for training. In the case of classification, the objective is (at least conceptually) obvious, we want to minimize the number of misclassifications on the training dataset. In the case of representation learning, our objective is far-removed from the ultimate objective, which is typically learning a classifier or some other predictor. Our problem is reminiscent of the credit assignment problem encountered in reinforcement learning. We have proposed that a good representation is one that disentangles the underlying factors of variation, but how do we translate that into appropriate training criteria? Is it even necessary to do anything but maximize likelihood under a good model or can we introduce priors such as those enumerated above (possibly data-dependent ones) that help the representation better do this disentangling? This question remains clearly open but is discussed in more detail in Sections 3.5 and 11.4.
代表学习的一个挑战，区别于其他机器学习任务如分类，在于难以确立一个明确的训练目标。在分类的情况下，目标（至少在概念上）是明显的，我们希望最小化训练数据集中的错误分类数量。在代表学习的情况下，我们的目标与最终目标相去甚远，最终目标通常是学习一个分类器或其他预测器。我们的问题与强化学习中遇到的信用分配问题相似。我们提出，一个好的表示能够解开潜在变化因素，但如何将其转化为适当的训练标准呢？是否只需在好的模型下最大化似然性就足够了，或者我们可以引入上述（可能依赖于数据的）先验，帮助表示更好地进行这种解开？这个问题显然仍然悬而未决，但在第 3.5 节和第 11.4 节中进行了更详细的讨论。

4 Building Deep Representations
4 构建深度表示

In 2006, a breakthrough in feature learning and deep learning was initiated by Geoff Hinton and quickly followed up in the same year (Hinton et al., 2006; Bengio et al., 2007; Ranzato et al., 2007), and soon after by Lee et al. (2008) and many more later. It has been extensively reviewed and discussed in Bengio (2009). A central idea, referred to as greedy layerwise unsupervised pre-training, was to learn a hierarchy of features one level at a time, using unsupervised feature learning to learn a new transformation at each level to be composed with the previously learned transformations; essentially, each iteration of unsupervised feature learning adds one layer of weights to a deep neural network. Finally, the set of layers could be combined to initialize a deep supervised predictor, such as a neural network classifier, or a deep generative model, such as a Deep Boltzmann Machine (Salakhutdinov and Hinton, 2009).
2006 年，杰弗里·辛顿（Geoff Hinton）在特征学习和深度学习领域取得了突破，同年（Hinton 等，2006；Bengio 等，2007；Ranzato 等，2007）迅速跟进，随后李等（Lee 等，2008）以及更多研究者也相继加入。这一成果在 Bengio（2009）的论文中得到了广泛的回顾和讨论。一个核心思想被称为贪婪层叠无监督预训练，即逐层学习特征层次，利用无监督特征学习在每个层次上学习新的变换，并将其与之前学习的变换组合；本质上，无监督特征学习的每一次迭代都会向深度神经网络添加一层权重。最后，这些层次可以组合起来初始化一个深度监督预测器，如神经网络分类器，或深度生成模型，如深度玻尔兹曼机（Salakhutdinov 和 Hinton，2009）。

This paper is mostly about feature learning algorithms that can be used to form deep architectures. In particular, it was empirically observed that layerwise stacking of feature extraction often yielded better representations, e.g., in terms of classification error (Larochelle et al., 2009; Erhan et al., 2010b), quality of the samples generated by a probabilistic model (Salakhutdinov and Hinton, 2009) or in terms of the invariance properties of the learned features (Goodfellow et al., 2009). Whereas this section focuses on the idea of stacking single-layer models, Section 10 follows up with a discussion on joint training of all the layers.
本文主要关于可用于构建深度架构的特征学习算法。特别是，通过实证观察发现，逐层堆叠特征提取通常会产生更好的表示，例如在分类误差（Larochelle 等，2009；Erhan 等，2010b）、由概率模型生成的样本质量（Salakhutdinov 和 Hinton，2009）或学习到的特征的不变性属性（Goodfellow 等，2009）方面。而本节侧重于堆叠单层模型的思想，第 10 节则继续讨论所有层的联合训练。

After greedy layerwise unsuperivsed pre-training, the resulting deep features can be used either as input to a standard supervised machine learning predictor (such as an SVM) or as initialization for a deep supervised neural network (e.g., by appending a logistic regression layer or purely supervised layers of a multi-layer neural network). The layerwise procedure can also be applied in a purely supervised setting, called the greedy layerwise supervised pre-training (Bengio et al., 2007). For example, after the first one-hidden-layer MLP is trained, its output layer is discarded and another one-hidden-layer MLP can be stacked on top of it, etc. Although results reported in Bengio et al. (2007) were not as good as for unsupervised pre-training, they were nonetheless better than without pre-training at all. Alternatively, the outputs of the previous layer can be fed as extra inputs for the next layer (in addition to the raw input), as successfully done in Yu et al. (2010). Another variant (Seide et al., 2011b) pre-trains in a supervised way all the previously added layers at each step of the iteration, and in their experiments this discriminant variant yielded better results than unsupervised pre-training.
在贪婪层叠无监督预训练之后，得到的深度特征既可以作为标准监督机器学习预测器（如 SVM）的输入，也可以作为深度监督神经网络的初始化（例如，通过添加逻辑回归层或多层神经网络的纯监督层）。层叠过程也可以应用于纯监督设置，称为贪婪层叠监督预训练（Bengio 等人，2007）。例如，在训练完第一个单隐藏层 MLP 之后，其输出层被丢弃，可以在其上方堆叠另一个单隐藏层 MLP 等。尽管 Bengio 等人（2007）报告的结果不如无监督预训练好，但它们仍然比完全没有预训练要好。或者，可以将前一层输出作为额外输入提供给下一层（除了原始输入之外），就像 Yu 等人（2010）成功做到的那样。另一种变体（Seide 等人。, 2011b) 在每次迭代的每一步以监督方式对所有先前添加的层进行预训练，在他们的实验中，这种判别性变体比无监督预训练产生了更好的结果。

Whereas combining single layers into a supervised model is straightforward, it is less clear how layers pre-trained by unsupervised learning should be combined to form a better unsupervised model. We cover here some of the approaches to do so, but no clear winner emerges and much work has to be done to validate existing proposals or improve them.
而将单层组合成监督模型是直接的，但如何将无监督学习预训练的层组合以形成更好的无监督模型则不太明确。我们在此介绍了一些实现这一目标的方法，但并未出现明显的优胜者，还需要大量工作来验证现有方案或改进它们。

The first proposal was to stack pre-trained RBMs into a Deep Belief Network (Hinton et al., 2006) or DBN, where the top layer is interpreted as an RBM and the lower layers as a directed sigmoid belief network. However, it is not clear how to approximate maximum likelihood training to further optimize this generative model. One option is the wake-sleep algorithm (Hinton et al., 2006) but more work should be done to assess the efficiency of this procedure in terms of improving the generative model.
第一个建议是将预训练的 RBMs 堆叠成一个深度信念网络（Hinton 等，2006）或 DBN，其中顶层被解释为 RBM，底层为有向 sigmoid 信念网络。然而，如何近似最大似然训练以进一步优化这个生成模型尚不明确。一个选项是唤醒-睡眠算法（Hinton 等，2006），但需要更多工作来评估该程序在提高生成模型效率方面的有效性。

The second approach that has been put forward is to combine the RBM parameters into a Deep Boltzmann Machine (DBM), by basically halving the RBM weights to obtain the DBM weights (Salakhutdinov and Hinton, 2009). The DBM can then be trained by approximate maximum likelihood as discussed in more details later (Section 10.2). This joint training has brought substantial improvements, both in terms of likelihood and in terms of classification performance of the resulting deep feature learner (Salakhutdinov and Hinton, 2009).
第二种提出的方法是将 RBM 参数组合成一个深度玻尔兹曼机（DBM），基本上将 RBM 权重减半以获得 DBM 权重（Salakhutdinov 和 Hinton，2009）。然后，可以按照后面更详细讨论的方法（第 10.2 节）通过近似最大似然来训练 DBM。这种联合训练在似然性和结果深度特征学习器的分类性能方面都带来了显著改进（Salakhutdinov 和 Hinton，2009）。

Another early approach was to stack RBMs or auto-encoders into a deep auto-encoder (Hinton and Salakhutdinov, 2006). If we have a series of encoder-decoder pairs $(f^{(i)}(\cdot),g^{(i)}(\cdot))$ , then the overall encoder is the composition of the encoders, $f^{(N)}(\ldots f^{(2)}(f^{(1)}(\cdot)))$ , and the overall decoder is its “transpose” (often with transposed weight matrices as well), $g^{(1)}(g^{(2)}(\ldots f^{(N)}(\cdot)))$ . The deep auto-encoder (or its regularized version, as discussed in Section 7.2) can then be jointly trained, with all the parameters optimized with respect to a global reconstruction error criterion. More work on this avenue clearly needs to be done, and it was probably avoided by fear of the challenges in training deep feedforward networks, discussed in the Section 10 along with very encouraging recent results.
另一种早期方法是将 RBMs 或自编码器堆叠成一个深度自编码器（Hinton 和 Salakhutdinov，2006）。如果我们有一系列编码器-解码器对 $(f^{(i)}(\cdot),g^{(i)}(\cdot))$ ，那么整体编码器是编码器的组合 $f^{(N)}(\ldots f^{(2)}(f^{(1)}(\cdot)))$ ，而整体解码器是其“转置”（通常还有转置权重矩阵）， $g^{(1)}(g^{(2)}(\ldots f^{(N)}(\cdot)))$ 。深度自编码器（或如第 7.2 节所述的其正则化版本）然后可以联合训练，所有参数都相对于全局重建误差标准进行优化。显然，在这个方向上还需要做更多的工作，这可能是由于害怕训练深度前馈网络的挑战而避免的，正如第 10 节所述，同时还有非常鼓舞人心的近期结果。

Yet another recently proposed approach to training deep architectures (Ngiam et al., 2011) is to consider the iterative construction of a free energy function (i.e., with no explicit latent variables, except possibly for a top-level layer of hidden units) for a deep architecture as the composition of transformations associated with lower layers, followed by top-level hidden units. The question is then how to train a model defined by an arbitrary parametrized (free) energy function. Ngiam et al. (2011) have used Hybrid Monte Carlo (Neal, 1993), but other options include contrastive divergence (Hinton, 1999; Hinton et al., 2006), score matching (Hyvärinen, 2005, 2008), denoising score matching (Kingma and LeCun, 2010; Vincent, 2011), ratio-matching (Hyvärinen, 2007) and noise-contrastive estimation (Gutmann and Hyvarinen, 2010).
另一种最近提出的训练深度架构的方法（Ngiam 等人，2011）是将深度架构中自由能函数的迭代构建（即，除了可能的最顶层隐藏单元外，没有显式的潜在变量）视为与较低层关联的变换的组合，然后是顶层隐藏单元。问题随之而来，即如何训练一个由任意参数化的（自由）能量函数定义的模型。Ngiam 等人（2011）使用了混合蒙特卡洛（Neal，1993），但其他选项还包括对比散度（Hinton，1999；Hinton 等人，2006）、评分匹配（Hyvärinen，2005，2008）、去噪评分匹配（Kingma 和 LeCun，2010；Vincent，2011）、比率匹配（Hyvärinen，2007）和噪声对比估计（Gutmann 和 Hyvarinen，2010）。

5 Single-layer learning modules
5 单层学习模块

Within the community of researchers interested in representation learning, there has developed two broad parallel lines of inquiry: one rooted in probabilistic graphical models and one rooted in neural networks. Fundamentally, the difference between these two paradigms is whether the layered architecture of a deep learning model is to be interpreted as describing a probabilistic graphical model or as describing a computation graph. In short, are hidden units considered latent random variables or as computational nodes?
在关注表示学习的学者群体中，已经形成了两条广泛的平行研究路线：一条植根于概率图模型，另一条植根于神经网络。从根本上讲，这两种范例之间的区别在于深度学习模型的层状架构是应该被解释为描述概率图模型，还是描述计算图。简而言之，隐藏单元被视为潜在随机变量，还是作为计算节点？

To date, the dichotomy between these two paradigms has remained in the background, perhaps because they appear to have more characteristics in common than separating them. We suggest that this is likely a function of the fact that much recent progress in both of these areas has focused on single-layer greedy learning modules and the similarities between the types of single-layer models that have been explored: mainly, the restricted Boltzmann machine (RBM) on the probabilistic side, and the auto-encoder variants on the neural network side. Indeed, as shown by one of us (Vincent, 2011) and others (Swersky et al., 2011), in the case of the restricted Boltzmann machine, training the model via an inductive principle known as score matching (Hyvärinen, 2005) (to be discussed in sec. 6.4.3) is essentially identical to applying a regularized reconstruction objective to an auto-encoder. Another strong link between pairs of models on both sides of this divide is when the computational graph for computing representation in the neural network model corresponds exactly to the computational graph that corresponds to inference in the probabilistic model, and this happens to also correspond to the structure of graphical model itself (e.g., as in the RBM).
截至目前，这两种范例之间的二分法一直处于幕后，可能是因为它们似乎有更多共同特征而不是区别。我们建议这可能是由于这两个领域最近在进展中都将重点放在单层贪婪学习模块上，以及所探索的单层模型类型之间的相似性：主要是概率方面的受限玻尔兹曼机（RBM），以及神经网络方面的自动编码器变体。确实，正如我们中的一员（Vincent，2011）和其他人（Swersky 等，2011）所展示的，在受限玻尔兹曼机的情况下，通过一种称为得分匹配的归纳原理（Hyvärinen，2005）（将在第 6.4.3 节中讨论）来训练模型，本质上等同于对自动编码器应用正则化重建目标。另一端模型之间的强联系出现在，当神经网络模型中计算表示的计算图与概率模型中推理的计算图完全对应时，这恰好也对应了图形模型本身的架构（例如，如 RBM 所示）。

The connection between these two paradigms becomes more tenuous when we consider deeper models where, in the case of a probabilistic model, exact inference typically becomes intractable. In the case of deep models, the computational graph diverges from the structure of the model. For example, in the case of a deep Boltzmann machine, unrolling variational (approximate) inference into a computational graph results in a recurrent graph structure. We have performed preliminary exploration (Savard, 2011) of deterministic variants of deep auto-encoders whose computational graph is similar to that of a deep Boltzmann machine (in fact very close to the mean-field variational approximations associated with the Boltzmann machine), and that is one interesting intermediate point to explore (between the deterministic approaches and the graphical model approaches).
这些两种范例之间的联系，当我们考虑更深层次的模型时，变得更加薄弱。在概率模型的情况下，精确推理通常变得难以处理。对于深度模型而言，计算图与模型结构相分离。例如，在深度玻尔兹曼机的情况下，将变分（近似）推理展开到计算图中，会导致循环图结构。我们已对深度自编码器的确定性变体进行了初步探索（Savard，2011），其计算图与深度玻尔兹曼机相似（实际上非常接近与玻尔兹曼机相关的平均场变分近似），这是探索的一个有趣中间点（在确定性方法和图模型方法之间）。

In the next few sections we will review the major developments in single-layer training modules used to support feature learning and particularly deep learning. We divide these sections between (Section 6) the probabilistic models, with inference and training schemes that directly parametrize the generative – or decoding – pathway and (Section 7) the typically neural network-based models that directly parametrize the encoding pathway. Interestingly, some models, like Predictive Sparse Decomposition (PSD) (Kavukcuoglu et al., 2008) inherit both properties, and will also be discussed (Section 7.2.4). We then present a different view of representation learning, based on the associated geometry and the manifold assumption, in Section 8.
在下几节中，我们将回顾用于支持特征学习和特别是深度学习的单层训练模块的主要发展。我们将这些章节分为（第 6 节）概率模型，其推理和训练方案直接参数化生成路径或解码路径，以及（第 7 节）通常基于神经网络的模型，这些模型直接参数化编码路径。有趣的是，一些模型，如预测稀疏分解（PSD）（Kavukcuoglu 等人，2008 年）继承了这两种属性，也将在第 7.2.4 节中进行讨论。然后，在第 8 节中，我们将从关联几何和流形假设的角度提出对表示学习的不同看法。

First, let us consider an unsupervised single-layer representation learning algorithm spaning all three views: probabilistic, auto-encoder, and manifold learning.
首先，让我们考虑一个涵盖所有三种视角（概率性、自动编码器和流形学习）的无监督单层表示学习算法。

Principal Components Analysis
主成分分析

We will use probably the oldest feature extraction algorithm, principal components analysis (PCA), to illustrate the probabilistic, auto-encoder and manifold views of representation-learning. PCA learns a linear transformation $h=f(x)=W^{T}x+b$ of input $x\in{\mathbb{R}}^{d_{x}}$ , where the columns of $d_{x}\times d_{h}$ matrix $W$ form an orthogonal basis for the $d_{h}$ orthogonal directions of greatest variance in the training data. The result is $d_{h}$ features (the components of representation $h$ ) that are decorrelated. The three interpretations of PCA are the following: a) it is related to probabilistic models (Section 6) such as probabilistic PCA, factor analysis and the traditional multivariate Gaussian distribution (the leading eigenvectors of the covariance matrix are the principal components); b) the representation it learns is essentially the same as that learned by a basic linear auto-encoder (Section 7.2); and c) it can be viewed as a simple linear form of linear manifold learning (Section 8), i.e., characterizing a lower-dimensional region in input space near which the data density is peaked. Thus, PCA may be in the back of the reader’s mind as a common thread relating these various viewpoints. Unfortunately the expressive power of linear features is very limited: they cannot be stacked to form deeper, more abstract representations since the composition of linear operations yields another linear operation. Here, we focus on recent algorithms that have been developed to extract non-linear features, which can be stacked in the construction of deep networks, although some authors simply insert a non-linearity between learned single-layer linear projections (Le et al., 2011c; Chen et al., 2012).
我们将使用可能最古老的特征提取算法，主成分分析（PCA），来阐述表示学习的概率、自编码器和流形视图。PCA 学习输入的线性变换 $h=f(x)=W^{T}x+b$ ，其中 $d_{x}\times d_{h}$ 矩阵 $W$ 的列构成了训练数据中最大方差 $d_{h}$ 正交方向的正交基。结果是 $d_{h}$ 特征（表示的分量 $h$ ），它们是去相关的。PCA 的三个解释如下：a）它与概率模型（第 6 节）相关，如概率 PCA、因子分析和传统的多元高斯分布（协方差矩阵的主导特征向量是主成分）；b）它学习的表示本质上与基本线性自编码器（第 7.2 节）学习的相同；c）它可以被视为线性流形学习（第 8 节）的简单线性形式，即描述输入空间中数据密度峰值附近的低维区域。因此，PCA 可能作为联系这些不同观点的线索，在读者的脑海中占据一席之地。很遗憾，线性特征的表辞性能非常有限：它们不能堆叠以形成更深、更抽象的表示，因为线性操作的组合产生另一个线性操作。在这里，我们关注最近开发出的提取非线性特征的方法，这些特征可以在构建深度网络时堆叠，尽管一些作者只是在学习到的单层线性投影之间简单地插入一个非线性函数（Le 等，2011c；Chen 等，2012）。

Another rich family of feature extraction techniques that this review does not cover in any detail due to space constraints is Independent Component Analysis or ICA (Jutten and Herault, 1991; Bell and Sejnowski, 1997). Instead, we refer the reader to Hyvärinen et al. (2001a, 2009). Note that, while in the simplest case (complete, noise-free) ICA yields linear features, in the more general case it can be equated with a linear generative model with non-Gaussian independent latent variables, similar to sparse coding (section 6.1.1), which result in non-linear features. Therefore, ICA and its variants like Independent and Topographic ICA (Hyvärinen et al., 2001b) can and have been used to build deep networks (Le et al., 2010, 2011c): see section 11.2. The notion of obtaining independent components also appears similar to our stated goal of disentangling underlying explanatory factors through deep networks. However, for complex real-world distributions, it is doubtful that the relationship between truly independent underlying factors and the observed high-dimensional data can be adequately characterized by a linear transformation.
另一类丰富的特征提取技术，由于篇幅限制，本文没有详细论述，即独立成分分析（ICA）或称 Jutten 和 Herault（1991 年）、Bell 和 Sejnowski（1997 年）提出的方法。相反，我们建议读者参考 Hyvärinen 等人（2001a，2009 年）。请注意，在最简单的情况下（完整、无噪声），ICA 产生线性特征，而在更一般的情况下，它可以等同于具有非高斯独立潜在变量的线性生成模型，类似于稀疏编码（第 6.1.1 节），从而产生非线性特征。因此，ICA 及其变体如独立 ICA 和拓扑 ICA（Hyvärinen 等人，2001b）可以，并且已经被用于构建深度网络（Le 等人，2010 年，2011c）：参见第 11.2 节。获得独立成分的概念也类似于我们通过深度网络分解潜在解释因素的目标。然而，对于复杂的现实世界分布，真正独立的潜在因素与观察到的多维数据之间的关系能否通过线性变换得到充分表征，这一点值得怀疑。

6 Probabilistic Models 6 概率模型

From the probabilistic modeling perspective, the question of feature learning can be interpreted as an attempt to recover a parsimonious set of latent random variables that describe a distribution over the observed data. We can express as $p(x,h)$ a probabilistic model over the joint space of the latent variables, $h$ , and observed data or visible variables $x$ . Feature values are conceived as the result of an inference process to determine the probability distribution of the latent variables given the data, i.e. $p(h\mid x)$ , often referred to as the posterior probability. Learning is conceived in term of estimating a set of model parameters that (locally) maximizes the regularized likelihood of the training data. The probabilistic graphical model formalism gives us two possible modeling paradigms in which we can consider the question of inferring latent variables, directed and undirected graphical models, which differ in their parametrization of the joint distribution $p(x,h)$ , yielding major impact on the nature and computational costs of both inference and learning.
从概率建模的角度来看，特征学习的问题可以解释为尝试恢复一组描述观察数据分布的简约潜在随机变量。我们可以将其表达为 $p(x,h)$ 潜在变量联合空间上的概率模型， $h$ ，以及观察数据或可见变量 $x$ 。特征值被视为在给定数据的情况下确定潜在变量概率分布的推理过程的结果，即 $p(h\mid x)$ ，通常被称为后验概率。学习被理解为估计一组模型参数，这些参数（局部地）最大化训练数据的正则化似然。概率图模型形式提供了两种可能的建模范例，在其中我们可以考虑推断潜在变量的问题，即有向和无向图模型，它们在联合分布的参数化 $p(x,h)$ 方面有所不同，从而对推理和学习性质及计算成本产生重大影响。

6.1 Directed Graphical Models
6.1 有向图模型

Directed latent factor models separately parametrize the conditional likelihood $p(x\mid h)$ and the prior $p(h)$ to construct the joint distribution, $p(x,h)=p(x\mid h)p(h)$ . Examples of this decomposition include: Principal Components Analysis (PCA) (Roweis, 1997; Tipping and Bishop, 1999), sparse coding (Olshausen and Field, 1996), sigmoid belief networks (Neal, 1992) and the newly introduced spike-and-slab sparse coding model (Goodfellow et al., 2011).
有向潜在因子模型分别对条件似然函数进行参数化 $p(x\mid h)$ 和对先验分布进行参数化 $p(h)$ ，以构建联合分布 $p(x,h)=p(x\mid h)p(h)$ 。这种分解的例子包括：主成分分析（PCA）（Roweis，1997；Tipping 和 Bishop，1999）、稀疏编码（Olshausen 和 Field，1996）、S 型信念网络（Neal，1992）以及新引入的尖峰和平板稀疏编码模型（Goodfellow 等，2011）。

6.1.1 Explaining Away 6.1.1 解释消除

Directed models often leads to one important property: explaining away, i.e., a priori independent causes of an event can become non-independent given the observation of the event. Latent factor models can generally be interpreted as latent cause models, where the $h$ activations cause the observed $x$ . This renders the a priori independent $h$ to be non-independent. As a consequence, recovering the posterior distribution of $h$ , $p(h\mid x)$ (which we use as a basis for feature representation), is often computationally challenging and can be entirely intractable, especially when $h$ is discrete.
定向模型通常导致一个重要属性：解释消除，即事件的前因可能因为事件的观察而变得不再独立。潜在因子模型通常可以解释为潜在原因模型，其中 $h$ 激活导致观察到的 $x$ 。这使得先验独立的 $h$ 变得不再独立。因此，恢复 $h$ ， $p(h\mid x)$ （我们将其用作特征表示的基础）的后验分布通常计算上具有挑战性，尤其是在 $h$ 是离散的情况下。

A classic example that illustrates the phenomenon is to imagine you are on vacation away from home and you receive a phone call from the security system company, telling you that the alarm has been activated. You begin worrying your home has been burglarized, but then you hear on the radio that a minor earthquake has been reported in the area of your home. If you happen to know from prior experience that earthquakes sometimes cause your home alarm system to activate, then suddenly you relax, confident that your home has very likely not been burglarized.
一个经典的例子可以说明这一现象：设想你正在远离家的度假中，突然接到安全系统公司的电话，告诉你警报已经启动。你开始担心家里可能被盗，但随后你在收音机上听到你家所在地区报告了轻微地震。如果你根据以往的经验知道地震有时会导致你家报警系统启动，那么你突然放松下来，自信地认为你家很可能没有被盗。

The example illustrates how the alarm activation rendered two otherwise entirely independent causes, burglarized and earthquake, to become dependent – in this case, the dependency is one of mutual exclusivity. Since both burglarized and earthquake are very rare events and both can cause alarm activation, the observation of one explains away the other. Despite the computational obstacles we face when attempting to recover the posterior over $h$ , explaining away promises to provide a parsimonious $p(h\mid x)$ , which can be an extremely useful characteristic of a feature encoding scheme. If one thinks of a representation as being composed of various feature detectors and estimated attributes of the observed input, it is useful to allow the different features to compete and collaborate with each other to explain the input. This is naturally achieved with directed graphical models, but can also be achieved with undirected models (see Section 6.2) such as Boltzmann machines if there are lateral connections between the corresponding units or corresponding interaction terms in the energy function that defines the probability model.
示例说明了警报激活如何使两个原本完全独立的因素——盗窃和地震，变得相互依赖——在这种情况下，这种依赖性是一对互斥关系。由于盗窃和地震都是非常罕见的事件，并且两者都可以触发警报，因此观察到其中一个就可以解释掉另一个。尽管我们在尝试恢复关于 $h$ 的后验概率时面临计算障碍，但解释掉（其他因素）承诺提供一种简明的 $p(h\mid x)$ ，这可以是一个特征编码方案极为有用的特性。如果将表示视为由各种特征检测器和观察输入的估计属性组成，那么允许不同的特征相互竞争和协作以解释输入是有用的。这可以通过有向图模型自然实现，但也可以通过无向模型（见第 6.2 节）如玻尔兹曼机实现，如果定义概率模型的能量函数中存在对应单元或对应交互项的横向连接。

Probabilistic Interpretation of PCA. PCA can be given a natural probabilistic interpretation (Roweis, 1997; Tipping and Bishop, 1999) as factor analysis:
概率主成分分析。主成分分析可以赋予一个自然的概率解释（Roweis，1997；Tipping 和 Bishop，1999）作为因子分析：

	$\displaystyle p(h)$	$\displaystyle=$	$\displaystyle\mathcal{N}(h;0,\sigma_{h}^{2}\mathbf{I})$
	$\displaystyle p(x\mid h)$	$\displaystyle=$	$\displaystyle\mathcal{N}(x;Wh+\mu_{x},\sigma_{x}^{2}\mathbf{I}),\vspace*{-3mm}$		(1)

where $x\in\mathbb{R}^{d_{x}}$ , $h\in\mathbb{R}^{d_{h}}$ , $\mathcal{N}(v;\mu,\Sigma)$ is the multivariate normal density of $v$ with mean $\mu$ and covariance $\Sigma$ , and columns of $W$ span the same space as leading $d_{h}$ principal components, but are not constrained to be orthonormal.
在 $x\in\mathbb{R}^{d_{x}}$ ， $h\in\mathbb{R}^{d_{h}}$ ， $\mathcal{N}(v;\mu,\Sigma)$ 为具有均值 $\mu$ 和协方差矩阵 $\Sigma$ 的 $v$ 的多元正态密度，且 $W$ 的列与主成分分析的前 $d_{h}$ 个主成分所占据的空间相同，但不受限于必须是正交归一。

Sparse Coding. Like PCA, sparse coding has both a probabilistic and non-probabilistic interpretation. Sparse coding also relates a latent representation $h$ (either a vector of random variables or a feature vector, depending on the interpretation) to the data $x$ through a linear mapping $W$ , which we refer to as the dictionary. The difference between sparse coding and PCA is that sparse coding includes a penalty to ensure a sparse activation of $h$ is used to encode each input $x$ . From a non-probabilistic perspective, sparse coding can be seen as recovering the code or feature vector associated with a new input $x$ via:
稀疏编码。与 PCA 类似，稀疏编码既有概率解释也有非概率解释。稀疏编码还将一个潜在表示 $h$ （根据解释，可以是随机变量向量或特征向量）通过线性映射 $W$ 与数据 $x$ 相关联，我们称之为字典。稀疏编码与 PCA 的区别在于，稀疏编码包含一个惩罚项，以确保使用稀疏激活 $h$ 来编码每个输入 $x$ 。从非概率的角度来看，稀疏编码可以看作是通过以下方式恢复与新的输入 $x$ 相关联的编码或特征向量：

h^{*}=f(x)=\operatornamewithlimits{argmin}_{h}\|x-Wh\|^{2}_{2}+\lambda\|h\|_{1},\vspace*{-2mm}

(2)

Learning the dictionary $W$ can be accomplished by optimizing the following training criterion with respect to $W$ :
通过优化以下训练标准以 $W$ 为依据，可以完成学习词典 $W$ ：

\mathcal{J}_{\mbox{\tiny SC}}=\sum_{t}\|x^{(t)}-Wh^{*(t)}\|^{2}_{2},\vspace*{-2mm}

(3)

where $x^{(t)}$ is the $t$ -th example and $h^{*(t)}$ is the corresponding sparse code determined by Eq. 2. $W$ is usually constrained to have unit-norm columns (because one can arbitrarily exchange scaling of column $i$ with scaling of $h^{(t)}_{i}$ , such a constraint is necessary for the L1 penalty to have any effect).
$x^{(t)}$ 是第 $t$ 个示例， $h^{*(t)}$ 是根据公式 2 确定的对应稀疏码。 $W$ 通常被限制为单位范数的列（因为可以任意交换列 $i$ 的缩放与列 $h^{(t)}_{i}$ 的缩放，这样的限制对于 L1 惩罚产生任何效果是必要的）。

The probabilistic interpretation of sparse coding differs from that of PCA, in that instead of a Gaussian prior on the latent random variable $h$ , we use a sparsity inducing Laplace prior (corresponding to an L1 penalty):
概率稀疏编码的解释与 PCA 不同，因为我们不是在潜在随机变量 $h$ 上使用高斯先验，而是使用诱导稀疏性的拉普拉斯先验（对应于 L1 惩罚）：

	$\displaystyle p(h)$	$\displaystyle=$	$\displaystyle\prod_{i}^{d_{h}}\frac{\lambda}{2}\exp(-\lambda\|h_{i}\|)$
	$\displaystyle p(x\mid h)$	$\displaystyle=$	$\displaystyle\mathcal{N}(x;Wh+\mu_{x},\sigma_{x}^{2}\mathbf{I}).\vspace*{-2mm}$		(4)

In the case of sparse coding, because we will seek a sparse representation (i.e., one with many features set to exactly zero), we will be interested in recovering the MAP (maximum a posteriori value of $h$ : i.e. $h^{*}=\operatornamewithlimits{argmax}_{h}p(h\mid x)$ rather than its expected value ${\mathbb{E}}\left[{h|x}\right]$ . Under this interpretation, dictionary learning proceeds as maximizing the likelihood of the data given these MAP values of $h^{*}$ : $\operatornamewithlimits{argmax}_{W}\prod_{t}p(x^{(t)}\mid h^{*(t)})$ subject to the norm constraint on $W$ . Note that this parameter learning scheme, subject to the MAP values of the latent $h$ , is not standard practice in the probabilistic graphical model literature. Typically the likelihood of the data $p(x)=\sum_{h}p(x\mid h)p(h)$ is maximized directly. In the presence of latent variables, expectation maximization is employed where the parameters are optimized with respect to the marginal likelihood, i.e., summing or integrating the joint log-likelihood over the all values of the latent variables under their posterior $P(h\mid x)$ , rather than considering only the single MAP value of $h$ . The theoretical properties of this form of parameter learning are not yet well understood but seem to work well in practice (e.g. k-Means vs Gaussian mixture models and Viterbi training for HMMs). Note also that the interpretation of sparse coding as a MAP estimation can be questioned (Gribonval, 2011), because even though the interpretation of the L1 penalty as a log-prior is a possible interpretation, there can be other Bayesian interpretations compatible with the training criterion.
在稀疏编码的情况下，因为我们寻求的是稀疏表示（即许多特征被精确设置为零），我们将对恢复 MAP（即 $h$ 的最大后验值）感兴趣：即 $h^{*}=\operatornamewithlimits{argmax}_{h}p(h\mid x)$ 而不是其期望值 ${\mathbb{E}}\left[{h|x}\right]$ 。在这种解释下，字典学习是通过最大化给定这些 MAP 值 $h^{*}$ ： $\operatornamewithlimits{argmax}_{W}\prod_{t}p(x^{(t)}\mid h^{*(t)})$ 的数据似然来进行的，同时受到 $W$ 的范数约束。请注意，这种参数学习方案，在潜在 $h$ 的 MAP 值下，在概率图模型文献中不是标准做法。通常直接最大化数据的似然 $p(x)=\sum_{h}p(x\mid h)p(h)$ 。在存在潜在变量的情况下，使用期望最大化，参数是相对于边缘似然进行优化的，即对所有潜在变量的后验值 $P(h\mid x)$ 下的联合对数似然求和或积分，而不是只考虑单个 MAP 值 $h$ 。这种参数学习形式的理论性质尚未得到充分理解，但在实践中似乎效果良好（例如 k-Means 与高斯混合模型以及 HMM 的 Viterbi 训练）。注意，将稀疏编码解释为 MAP 估计可能存在争议（Gribonval，2011），因为尽管将 L1 惩罚解释为对数先验是可能的解释，但可能存在其他与训练准则相容的贝叶斯解释。

Sparse coding is an excellent example of the power of explaining away. Even with a very overcomplete dictionary¹¹¹¹11Overcomplete: with more dimensions of $h$ than dimensions of $x$ ., the MAP inference process used in sparse coding to find $h^{*}$ can pick out the most appropriate bases and zero the others, despite them having a high degree of correlation with the input. This property arises naturally in directed graphical models such as sparse coding and is entirely owing to the explaining away effect. It is not seen in commonly used undirected probabilistic models such as the RBM, nor is it seen in parametric feature encoding methods such as auto-encoders. The trade-off is that, compared to methods such as RBMs and auto-encoders, inference in sparse coding involves an extra inner-loop of optimization to find $h^{*}$ with a corresponding increase in the computational cost of feature extraction. Compared to auto-encoders and RBMs, the code in sparse coding is a free variable for each example, and in that sense the implicit encoder is non-parametric.
稀疏编码是解释力强大的一个典型例子。即使是非常冗余的字典，稀疏编码中用于寻找 $h^{*}$ 的 MAP 推理过程也能挑选出最合适的基并使其他项为零，尽管它们与输入具有高度相关性。这种性质在稀疏编码等有向图模型中自然出现，完全是由于解释力效应。这在常用的无向概率模型（如 RBM）中看不到，在参数化特征编码方法（如自编码器）中也看不到。权衡在于，与 RBM 和自编码器等方法相比，稀疏编码中的推理涉及一个额外的优化内循环来寻找 $h^{*}$ ，这导致特征提取的计算成本增加。与自编码器和 RBM 相比，稀疏编码中的代码是每个示例的自由变量，从这个意义上说，隐式编码是非参数的。

One might expect that the parsimony of the sparse coding representation and its explaining away effect would be advantageous and indeed it seems to be the case. Coates and Ng (2011a) demonstrated on the CIFAR-10 object classification task (Krizhevsky and Hinton, 2009) with a patch-base feature extraction pipeline, that in the regime with few ( $<1000$ ) labeled training examples per class, the sparse coding representation significantly outperformed other highly competitive encoding schemes. Possibly because of these properties, and because of the very computationally efficient algorithms that have been proposed for it (in comparison with the general case of inference in the presence of explaining away), sparse coding enjoys considerable popularity as a feature learning and encoding paradigm. There are numerous examples of its successful application as a feature representation scheme, including natural image modeling (Raina et al., 2007; Kavukcuoglu et al., 2008; Coates and Ng, 2011a; Yu et al., 2011), audio classification (Grosse et al., 2007), NLP (Bagnell and Bradley, 2009), as well as being a very successful model of the early visual cortex (Olshausen and Field, 1996). Sparsity criteria can also be generalized successfully to yield groups of features that prefer to all be zero, but if one or a few of them are active then the penalty for activating others in the group is small. Different group sparsity patterns can incorporate different forms of prior knowledge (Kavukcuoglu et al., 2009; Jenatton et al., 2009; Bach et al., 2011; Gregor et al., 2011).
人们可能会预期稀疏编码表示的简约性和其解释消除效应是有益的，事实上似乎确实如此。Coates 和 Ng（2011a）在 CIFAR-10 物体分类任务（Krizhevsky 和 Hinton，2009）中，使用基于补丁的特征提取流程进行了演示，在每类仅有少量（ $<1000$ ）标记的训练示例的情况下，稀疏编码表示显著优于其他高度竞争的编码方案。可能正因为这些特性，以及针对它提出的非常计算高效的算法（与存在解释消除的推理的一般情况相比），稀疏编码在特征学习和编码范式方面享有相当大的知名度。它作为特征表示方案的成功应用实例众多，包括自然图像建模（Raina 等人，2007；Kavukcuoglu 等人，2008；Coates 和 Ng，2011a；Yu 等人，2011）、音频分类（Grosse 等人，2007）、NLP（Bagnell 和 Bradley，2009），以及作为早期视觉皮层的非常成功的模型（Olshausen 和 Field，1996）。稀疏性标准也可以成功推广，以产生一组倾向于全部为零的特征，但如果其中之一或几个被激活，则激活该组中其他特征的惩罚较小。不同的组稀疏模式可以结合不同的先验知识（Kavukcuoglu 等，2009；Jenatton 等，2009；Bach 等，2011；Gregor 等，2011）。

Spike-and-Slab Sparse Coding. Spike-and-slab sparse coding (S3C) is one example of a promising variation on sparse coding for feature learning (Goodfellow et al., 2012). The S3C model possesses a set of latent binary spike variables together with a a set of latent real-valued slab variables. The activation of the spike variables dictates the sparsity pattern. S3C has been applied to the CIFAR-10 and CIFAR-100 object classification tasks (Krizhevsky and Hinton, 2009), and shows the same pattern as sparse coding of superior performance in the regime of relatively few ( $<1000$ ) labeled examples per class (Goodfellow et al., 2012). In fact, in both the CIFAR-100 dataset (with 500 examples per class) and the CIFAR-10 dataset (when the number of examples is reduced to a similar range), the S3C representation actually outperforms sparse coding representations. This advantage was revealed clearly with S3C winning the NIPS’2011 Transfer Learning Challenge (Goodfellow et al., 2011).
刺棒-板稀疏编码。刺棒-板稀疏编码（S3C）是稀疏编码在特征学习方面的一个有希望的变体（Goodfellow 等，2012）。S3C 模型具有一组潜在的二元刺变量以及一组潜在的实值板变量。刺变量的激活决定了稀疏模式。S3C 已应用于 CIFAR-10 和 CIFAR-100 物体分类任务（Krizhevsky 和 Hinton，2009），并在每个类别的相对较少（ $<1000$ ）标记示例的机制中显示出与稀疏编码相同的高性能模式（Goodfellow 等，2012）。实际上，在 CIFAR-100 数据集（每个类别有 500 个示例）和 CIFAR-10 数据集（当示例数量减少到类似范围时），S3C 表示实际上优于稀疏编码表示。这种优势在 S3C 赢得 NIPS’2011 迁移学习挑战赛（Goodfellow 等，2011）时得到了明显体现。

6.2 Undirected Graphical Models
6.2 无向图模型

Undirected graphical models, also called Markov random fields (MRFs), parametrize the joint $p(x,h)$ through a product of unnormalized non-negative clique potentials:
无向图模型，也称为马尔可夫随机场（MRFs），通过非归一化非负团势能的乘积来参数化联合分布：

p(x,h)=\frac{1}{Z_{\theta}}\prod_{i}\psi_{i}(x)\prod_{j}\eta_{j}(h)\prod_{k}\nu_{k}(x,h)\vspace*{-2mm}

(5)

where $\psi_{i}(x)$ , $\eta_{j}(h)$ and $\nu_{k}(x,h)$ are the clique potentials describing the interactions between the visible elements, between the hidden variables, and those interaction between the visible and hidden variables respectively. The partition function $Z_{\theta}$ ensures that the distribution is normalized. Within the context of unsupervised feature learning, we generally see a particular form of Markov random field called a Boltzmann distribution with clique potentials constrained to be positive:
$\psi_{i}(x)$ 、 $\eta_{j}(h)$ 和 $\nu_{k}(x,h)$ 分别表示描述可见元素之间、隐藏变量之间以及可见变量与隐藏变量之间相互作用的团势能。配分函数 $Z_{\theta}$ 确保分布是归一化的。在无监督特征学习的背景下，我们通常看到一种特定的马尔可夫随机场形式，称为具有正约束的团势能的玻尔兹曼分布：

p(x,h)=\frac{1}{Z_{\theta}}\exp\left(-{\cal E}_{\theta}(x,h)\right),\vspace*{-2mm}

(6)

where ${\cal E}_{\theta}(x,h)$ is the energy function and contains the interactions described by the MRF clique potentials and $\theta$ are the model parameters that characterize these interactions.
${\cal E}_{\theta}(x,h)$ 是能量函数，包含由 MRF 团簇势描述的相互作用，而 $\theta$ 是表征这些相互作用的模型参数。

The Boltzmann machine was originally defined as a network of symmetrically-coupled binary random variables or units. These stochastic units can be divided into two groups: (1) the visible units $x\in\{0,1\}^{d_{x}}$ that represent the data, and (2) the hidden or latent units $h\in\{0,1\}^{d_{h}}$ that mediate dependencies between the visible units through their mutual interactions. The pattern of interaction is specified through the energy function:
玻尔兹曼机最初被定义为对称耦合的二进制随机变量或单元的网络。这些随机单元可以分为两组：（1）表示数据的可见单元 $x\in\{0,1\}^{d_{x}}$ ，以及（2）通过相互交互在可见单元之间传递依赖关系的隐藏或潜在单元 $h\in\{0,1\}^{d_{h}}$ 。交互模式通过能量函数指定：

{\cal E}^{\mathrm{BM}}_{\theta}(x,h)=-\frac{1}{2}x^{T}Ux-\frac{1}{2}h^{T}Vh-x^{T}Wh-b^{T}x-d^{T}h,\vspace*{-2mm}

(7)

where $\theta=\{U,V,W,b,d\}$ are the model parameters which respectively encode the visible-to-visible interactions, the hidden-to-hidden interactions, the visible-to-hidden interactions, the visible self-connections, and the hidden self-connections (called biases). To avoid over-parametrization, the diagonals of $U$ and $V$ are set to zero.
$\theta=\{U,V,W,b,d\}$ 代表模型参数，分别编码可见到可见的交互、隐藏到隐藏的交互、可见到隐藏的交互、可见的自连接和隐藏的自连接（称为偏置）。为了避免过参数化， $U$ 和 $V$ 的对角线被设置为 0。

The Boltzmann machine energy function specifies the probability distribution over $[x,h]$ , via the Boltzmann distribution, Eq. 6, with the partition function $Z_{\theta}$ given by:
Boltzmann 机能量函数通过 Boltzmann 分布（公式 6）指定了 $[x,h]$ 的概率分布，其中配分函数 $Z_{\theta}$ 由以下给出：

Z_{\theta}=\sum_{x_{1}=0}^{x_{1}=1}\cdots\sum_{x_{d_{x}}=0}^{x_{d_{x}}=1}\sum_{h_{1}=0}^{h_{1}=1}\cdots\sum_{h_{d_{h}}=0}^{h_{d_{h}}=1}\exp\left(-{\cal E}_{\theta}^{\mathrm{BM}}(x,h;\theta)\right).\vspace*{-2mm}

(8)

This joint probability distribution gives rise to the set of conditional distributions of the form:
这个联合概率分布产生了形式为的条件分布集合：

	$\displaystyle P(h_{i}\mid x,h_{\setminus i})$	$\displaystyle={\rm sigmoid}\left(\sum_{j}W_{ji}x_{j}+\sum_{i^{\prime}\neq i}V_{ii^{\prime}}h_{i^{\prime}}+d_{i}\right)$		(9)
	$\displaystyle P(x_{j}\mid h,x_{\setminus j})$	$\displaystyle={\rm sigmoid}\left(\sum_{i}W_{ji}x_{j}+\sum_{j^{\prime}\neq j}U_{jj^{\prime}}x_{j^{\prime}}+b_{j}\right).\vspace*{-2mm}$		(10)

In general, inference in the Boltzmann machine is intractable. For example, computing the conditional probability of $h_{i}$ given the visibles, $P(h_{i}\mid x)$ , requires marginalizing over the rest of the hiddens, which implies evaluating a sum with $2^{d_{h}-1}$ terms:
一般来说，玻尔兹曼机中的推理是不可行的。例如，给定可见变量 $h_{i}$ ，计算条件概率需要对其他隐藏变量进行边缘化，这意味着要评估一个包含 $2^{d_{h}-1}$ 项的和：

P(h_{i}\mid x)=\sum_{h_{1}=0}^{h_{1}=1}\cdots\sum_{h_{i-1}=0}^{h_{i-1}=1}\sum_{h_{i+1}=0}^{h_{i+1}=1}\cdots\sum_{h_{d_{h}}=0}^{h_{d_{h}}=1}P(h\mid x)\vspace*{-2mm}

(11)

However with some judicious choices in the pattern of interactions between the visible and hidden units, more tractable subsets of the model family are possible, as we discuss next.
然而，通过在可见单元和隐藏单元之间的交互模式中进行一些明智的选择，我们可能得到模型家族中更易于处理的子集，正如我们接下来要讨论的。

Restricted Boltzmann Machines (RBMs). The RBM is likely the most popular subclass of Boltzmann machine (Smolensky, 1986). It is defined by restricting the interactions in the Boltzmann energy function, in Eq. 7, to only those between $h$ and $x$ , i.e. ${\cal E}_{\theta}^{\mathrm{RBM}}$ is ${\cal E}_{\theta}^{\mathrm{BM}}$ with $U=\mathbf{0}$ and $V=\mathbf{0}$ . As such, the RBM can be said to form a bipartite graph with the visibles and the hiddens forming two layers of vertices in the graph (and no connection between units of the same layer). With this restriction, the RBM possesses the useful property that the conditional distribution over the hidden units factorizes given the visibles:
受限玻尔兹曼机（RBMs）。RBM 可能是玻尔兹曼机（Smolensky，1986）中最受欢迎的子类。它通过限制玻尔兹曼能量函数（方程 7）中的相互作用，仅限于 $h$ 和 $x$ 之间，即 ${\cal E}_{\theta}^{\mathrm{RBM}}$ 是 ${\cal E}_{\theta}^{\mathrm{BM}}$ 与 $U=\mathbf{0}$ 和 $V=\mathbf{0}$ 的函数。因此，RBM 可以被认为形成一个二分图，其中可见性和隐藏性形成图中的两个顶点层（并且同一层单元之间没有连接）。在这种限制下，RBM 具有有用的性质，即给定可见性，隐藏单元的条件分布可以分解：

	$\displaystyle P(h\mid x)=$	$\displaystyle\prod_{i}P(h_{i}\mid x)$
	$\displaystyle P(h_{i}=1\mid x)=$	$\displaystyle\;{\rm sigmoid}\left(\sum_{j}W_{ji}x_{j}+d_{i}\right).\vspace*{-3mm}$			(12)

Likewise, the conditional distribution over the visible units given the hiddens also factorizes:
同样，给定隐藏单元的可见单元的条件分布也进行分解：

	$\displaystyle P(x\mid h)=$	$\displaystyle\prod_{j}P(x_{j}\mid h)$
	$\displaystyle P(x_{j}=1\mid h)=$	$\displaystyle\;{\rm sigmoid}\left(\sum_{i}W_{ji}h_{i}+b_{j}\right).\vspace{-3mm}\vspace{-1.5mm}$		(13)

This makes inferences readily tractable in RBMs. For example, the RBM feature representation is taken to be the set of posterior marginals $P(h_{i}\mid x)$ , which, given the conditional independence described in Eq. 12, are immediately available. Note that this is in stark contrast to the situation with popular directed graphical models for unsupervised feature extraction, where computing the posterior probability is intractable.
这使得在 RBMs 中进行推断变得容易处理。例如，RBM 的特征表示被假定为后验边缘集 $P(h_{i}\mid x)$ ，根据式(12)中描述的条件独立性，这些边缘可以直接获得。请注意，这与用于无监督特征提取的流行有向图模型的情况形成鲜明对比，在这些模型中，计算后验概率是不可行的。

Importantly, the tractability of the RBM does not extend to its partition function, which still involves summing an exponential number of terms. It does imply however that we can limit the number of terms to $\min\{2^{d_{x}},2^{d_{h}}\}$ . Usually this is still an unmanageable number of terms and therefore we must resort to approximate methods to deal with its estimation.
重要的是，RBM 的可处理性并不扩展到其配分函数，该函数仍然涉及求和指数级数项。然而，这确实意味着我们可以将项数限制为 $\min\{2^{d_{x}},2^{d_{h}}\}$ 。通常，这仍然是一个难以管理的项数，因此我们必须求助于近似方法来处理其估计。

It is difficult to overstate the impact the RBM has had to the fields of unsupervised feature learning and deep learning. It has been used in a truly impressive variety of applications, including fMRI image classification (Schmah et al., 2009), motion and spatial transformations (Taylor and Hinton, 2009; Memisevic and Hinton, 2010), collaborative filtering (Salakhutdinov et al., 2007) and natural image modeling (Ranzato and Hinton, 2010; Courville et al., 2011b).
RBM 对无监督特征学习和深度学习领域的影响难以言过其实。它已被应用于真正令人印象深刻的多种应用，包括 fMRI 图像分类（Schmah 等人，2009 年）、运动和空间变换（Taylor 和 Hinton，2009 年；Memisevic 和 Hinton，2010 年）、协同过滤（Salakhutdinov 等人，2007 年）和自然图像建模（Ranzato 和 Hinton，2010 年；Courville 等人，2011b）。

6.3 Generalizations of the RBM to Real-valued data
6.3 RBM 对实值数据的推广

Important progress has been made in the last few years in defining generalizations of the RBM that better capture real-valued data, in particular real-valued image data, by better modeling the conditional covariance of the input pixels. The standard RBM, as discussed above, is defined with both binary visible variables $v\in\{0,1\}$ and binary latent variables $h\in\{0,1\}$ . The tractability of inference and learning in the RBM has inspired many authors to extend it, via modifications of its energy function, to model other kinds of data distributions. In particular, there has been multiple attempts to develop RBM-type models of real-valued data, where $x\in{\mathbb{R}}^{d_{x}}$ . The most straightforward approach to modeling real-valued observations within the RBM framework is the so-called Gaussian RBM (GRBM) where the only change in the RBM energy function is to the visible units biases, by adding a bias term that is quadratic in the visible units $x$ . While it probably remains the most popular way to model real-valued data within the RBM framework, Ranzato and Hinton (2010) suggest that the GRBM has proved to be a somewhat unsatisfactory model of natural images. The trained features typically do not represent sharp edges that occur at object boundaries and lead to latent representations that are not particularly useful features for classification tasks. Ranzato and Hinton (2010) argue that the failure of the GRBM to adequately capture the statistical structure of natural images stems from the exclusive use of the model capacity to capture the conditional mean at the expense of the conditional covariance. Natural images, they argue, are chiefly characterized by the covariance of the pixel values, not by their absolute values. This point is supported by the common use of preprocessing methods that standardize the global scaling of the pixel values across images in a dataset or across the pixel values within each image.
在过去的几年里，在定义 RBM 的推广方面取得了重要进展，这些推广能更好地捕捉实值数据，特别是实值图像数据，通过更好地建模输入像素的条件协方差。如上所述，标准 RBM 由二元可见变量 $v\in\{0,1\}$ 和二元潜在变量 $h\in\{0,1\}$ 定义。RBM 中推理和学习的可处理性激励了许多作者通过修改其能量函数来扩展它，以模拟其他类型的数据分布。特别是，已经尝试开发多种 RBM 型实值数据模型，其中 $x\in{\mathbb{R}}^{d_{x}}$ 。在 RBM 框架内建模实值观察的最直接方法被称为高斯 RBM（GRBM），其中 RBM 能量函数的唯一变化是对可见单元偏置，通过添加一个在可见单元中二次的偏置项 $x$ 。虽然这可能是 RBM 框架内建模实值数据最流行的方法，但 Ranzato 和 Hinton（2010）建议，GRBM 已被证明是自然图像的一个不太令人满意的模型。训练特征通常不表示出现在物体边界处的锐利边缘，导致潜在表示不是特别有用的分类任务特征。Ranzato 和 Hinton（2010）认为，GRBM 未能充分捕捉自然图像的统计结构，源于模型容量仅用于捕捉条件均值，而忽略了条件协方差。他们认为，自然图像主要特征是像素值的协方差，而不是它们的绝对值。这一点得到了预处理方法的普遍使用支持，这些方法在数据集图像的全局缩放或每个图像内部像素值之间标准化像素值。

These kinds of concerns about the ability of the GRBM to model natural image data has lead to the development of alternative RBM-based models that each attempt to take on this objective of better modeling non-diagonal conditional covariances. (Ranzato and Hinton, 2010) introduced the mean and covariance RBM (mcRBM). Like the GRBM, the mcRBM is a 2-layer Boltzmann machine that explicitly models the visible units as Gaussian distributed quantities. However unlike the GRBM, the mcRBM uses its hidden layer to independently parametrize both the mean and covariance of the data through two sets of hidden units. The mcRBM is a combination of the covariance RBM (cRBM) (Ranzato et al., 2010a), that models the conditional covariance, with the GRBM that captures the conditional mean. While the GRBM has shown considerable potential as the basis of a highly successful phoneme recognition system (Dahl et al., 2010), it seems that due to difficulties in training the mcRBM, the model has been largely superseded by the mPoT model. The mPoT model (mean-product of Student’s T-distributions model) (Ranzato et al., 2010b) is a combination of the GRBM and the product of Student’s T-distributions model (Welling et al., 2003). It is an energy-based model where the conditional distribution over the visible units conditioned on the hidden variables is a multivariate Gaussian (non-diagonal covariance) and the complementary conditional distribution over the hidden variables given the visibles are a set of independent Gamma distributions.
这些关于 GRBM 建模自然图像数据能力的担忧导致了基于 RBM 的替代模型的开发，每个模型都试图承担更好地建模非对角条件协方差的目标。（Ranzato 和 Hinton，2010）引入了均值和协方差 RBM（mcRBM）。与 GRBM 类似，mcRBM 是一个两层玻尔兹曼机，明确地将可见单元建模为高斯分布的量。然而，与 GRBM 不同，mcRBM 使用其隐藏层通过两组隐藏单元独立地参数化数据的均值和协方差。mcRBM 是条件协方差 RBM（cRBM）（Ranzato 等人，2010a）和捕捉条件均值的 GRBM 的结合。虽然 GRBM 已显示出作为高度成功的音素识别系统基础的巨大潜力（Dahl 等人，2010），但似乎由于训练 mcRBM 的困难，该模型已被 mPoT 模型所取代。mPoT 模型（学生 t 分布的均值乘积模型）（Ranzato 等人。distributions. 它是一种基于能量的模型，其中在隐藏变量条件下的可见单元的条件分布是多元高斯分布（非对角协方差），而给定可见变量的隐藏变量的互补条件分布是一组独立的伽马分布。

The PoT model has recently been generalized to the mPoT model (Ranzato et al., 2010b) to include nonzero Gaussian means by the addition of GRBM-like hidden units, similarly to how the mcRBM generalizes the cRBM. The mPoT model has been used to synthesize large-scale natural images (Ranzato et al., 2010b) that show large-scale features and shadowing structure. It has been used to model natural textures (Kivinen and Williams, 2012) in a tiled-convolution configuration (see section 11.2).
The PoT 模型最近被推广到 mPoT 模型（Ranzato 等，2010b），通过添加类似 GRBM 的隐藏单元来包含非零高斯均值，类似于 mcRBM 如何推广 cRBM。mPoT 模型已被用于合成大规模自然图像（Ranzato 等，2010b），这些图像显示出大规模特征和阴影结构。它已被用于在镶嵌卷积配置中模拟自然纹理（Kivinen 和 Williams，2012）（见第 11.2 节）。

Another recently introduced RBM-based model with the objective of having the hidden units encode both the mean and covariance information is the spike-and-slab Restricted Boltzmann Machine (ssRBM) (Courville et al., 2011a, b). The ssRBM is defined as having both a real-valued “slab” variable and a binary “spike” variable associated with each unit in the hidden layer. The ssRBM has been demonstrated as a feature learning and extraction scheme in the context of CIFAR-10 object classification (Krizhevsky and Hinton, 2009) from natural images and has performed well in the role (Courville et al., 2011a, b). When trained convolutionally (see Section 11.2) on full CIFAR-10 natural images, the model demonstrated the ability to generate natural image samples that seem to capture the broad statistical structure of natural images better than previous parametric generative models, as illustrated with the samples of Figure 2.
另一种近期提出的基于 RBM 的模型，其目标是让隐藏单元编码均值和协方差信息，是尖峰-板状受限玻尔兹曼机（ssRBM）（Courville 等，2011a，b）。ssRBM 被定义为每个隐藏层单元都关联一个实值“板状”变量和一个二进制“尖峰”变量。ssRBM 已被证明是 CIFAR-10 物体分类（Krizhevsky 和 Hinton，2009）的自然图像特征学习和提取方案，并在该角色中表现出色（Courville 等，2011a，b）。当在完整的 CIFAR-10 自然图像上以卷积方式训练（见第 11.2 节）时，该模型展示了生成自然图像样本的能力，这些样本似乎比之前的参数生成模型更好地捕捉了自然图像的广泛统计结构，如图 2 所示样本所示。

The mcRBM, mPoT and ssRBM each set out to model real-valued data such that the hidden units encode not only the conditional mean of the data but also its conditional covariance. Other than differences in the training schemes, the most significant difference between these models is how they encode their conditional covariance. While the mcRBM and the mPoT use the activation of the hidden units to enforce constraints on the covariance of $x$ , the ssRBM uses the hidden unit to pinch the precision matrix along the direction specified by the corresponding weight vector. These two ways of modeling conditional covariance diverge when the dimensionality of the hidden layer is significantly different from that of the input. In the overcomplete setting, sparse activation with the ssRBM parametrization permits variance only in the select directions of the sparsely activated hidden units. This is a property the ssRBM shares with sparse coding models (Olshausen and Field, 1996; Grosse et al., 2007). On the other hand, in the case of the mPoT or mcRBM, an overcomplete set of constraints on the covariance implies that capturing arbitrary covariance along a particular direction of the input requires decreasing potentially all constraints with positive projection in that direction. This perspective would suggest that the mPoT and mcRBM do not appear to be well suited to provide a sparse representation in the overcomplete setting.
mcRBM、mPoT 和 ssRBM 都旨在对实值数据进行建模，使得隐藏单元不仅编码数据的条件均值，还编码其条件协方差。除了训练方案的不同之外，这些模型之间最显著的区别在于它们如何编码条件协方差。虽然 mcRBM 和 mPoT 使用隐藏单元的激活来对 $x$ 的协方差施加约束，而 ssRBM 则使用隐藏单元在对应权重向量指定的方向上夹紧精度矩阵。这两种建模条件协方差的方法在隐藏层维度与输入维度显著不同时会产生分歧。在过完备设置中，使用 ssRBM 参数化的稀疏激活只允许稀疏激活的隐藏单元在选定的方向上有方差。这是 ssRBM 与稀疏编码模型（Olshausen 和 Field，1996；Grosse 等，2007）共有的性质。另一方面，在 mPoT 或 mcRBM 的情况下，对协方差施加的过完备约束集意味着要捕捉输入特定方向的任意协方差，需要降低在该方向上具有正投影的所有约束。这种观点表明，mPoT 和 mcRBM 似乎不适合在过完备设置中提供稀疏表示。

6.4 RBM parameter estimation
6.4 RBM 参数估计

Many of the RBM training methods we discuss here are applicable to more general undirected graphical models, but are particularly practical in the RBM setting. Freund and Haussler (1994) proposed a learning algorithm for harmoniums (RBMs) based on projection pursuit. Contrastive Divergence (Hinton, 1999; Hinton et al., 2006) has been used most often to train RBMs, and many recent papers use Stochastic Maximum Likelihood (Younes, 1999; Tieleman, 2008).
许多我们在这里讨论的 RBM 训练方法适用于更一般的无向图模型，但在 RBM 设置中尤其实用。Freund 和 Haussler（1994）提出了一种基于投影追踪的谐波（RBM）学习算法。对比散度（Hinton，1999；Hinton 等，2006）最常被用来训练 RBM，许多近期论文使用了随机最大似然（Younes，1999；Tieleman，2008）。

As discussed in Sec. 6.1, in training probabilistic models parameters are typically adapted in order to maximize the likelihood of the training data (or equivalently the log-likelihood, or its penalized version, which adds a regularization term). With $T$ training examples, the log likelihood is given by:
如第 6.1 节所述，在训练概率模型时，参数通常会被调整以最大化训练数据的似然（或等价地，对数似然，或其惩罚版本，该版本添加了一个正则化项）。对于 $T$ 个训练样本，对数似然由以下公式给出：

\sum_{t=1}^{T}\log P(x^{(t)};\theta)=\sum_{t=1}^{T}\log\sum_{h\in\{0,1\}^{d_{h}}}P(x^{(t)},h;\theta).\vspace*{-3mm}

(14)

Gradient-based optimization requires its gradient, which for Boltzmann machines, is given by:
梯度优化需要其梯度，对于玻尔兹曼机，其梯度由以下给出：

	$\displaystyle\frac{\partial}{\partial\theta_{i}}\sum_{t=1}^{T}\log p(x^{(t)})$	$\displaystyle=$	$\displaystyle-\sum_{t=1}^{T}{\mathbb{E}}_{p(h\mid x^{(t)})}\left[{\frac{\partial}{\partial\theta_{i}}{\cal E}_{\theta}^{\mathrm{BM}}(x^{(t)},h)}\right]$		(15)
			$\displaystyle+\sum_{t=1}^{T}{\mathbb{E}}_{p(x,h)}\left[{\frac{\partial}{\partial\theta_{i}}{\cal E}_{\theta}^{\mathrm{BM}}(x,h)}\right],\vspace*{-4mm}$		(15)

where we have the expectations with respect to $p(h^{(t)}\mid x^{(t)})$ in the “clamped” condition (also called the positive phase), and over the full joint $p(x,h)$ in the “unclamped” condition (also called the negative phase). Intuitively, the gradient acts to locally move the model distribution (the negative phase distribution) toward the data distribution (positive phase distribution), by pushing down the energy of $(h,x^{(t)})$ pairs (for $h\sim P(h|x^{(t)})$ ) while pushing up the energy of $(h,x)$ pairs (for $(h,x)\sim P(h,x)$ ) until the two forces are in equilibrium, at which point the sufficient statistics (gradient of the energy function) have equal expectations with $x$ sampled from the training distribution or with $x$ sampled from the model.
在“夹紧”条件（也称为正相）下对 $p(h^{(t)}\mid x^{(t)})$ 的期望，以及在“未夹紧”条件（也称为负相）下对整个关节 $p(x,h)$ 的期望。直观上，梯度通过降低 $(h,x^{(t)})$ 对（对于 $h\sim P(h|x^{(t)})$ ）的能量并提高 $(h,x)$ 对（对于 $(h,x)\sim P(h,x)$ ）的能量，使模型分布（负相分布）局部移动到数据分布（正相分布），直到两种力达到平衡，此时充分统计量（能量函数的梯度）与从训练分布中抽取的 $x$ 或从模型中抽取的 $x$ 具有相同的期望。

The RBM conditional independence properties imply that the expectation in the positive phase of Eq. 15 is tractable. The negative phase term – arising from the partition function’s contribution to the log-likelihood gradient – is more problematic because the computation of the expectation over the joint is not tractable. The various ways of dealing with the partition function’s contribution to the gradient have brought about a number of different training algorithms, many trying to approximate the log-likelihood gradient.
RBM 的条件独立性属性意味着方程 15 的正相位的期望是可处理的。负相项——源于配分函数对对数似然梯度的贡献——更为复杂，因为对联合分布的期望计算不可处理。处理配分函数对梯度贡献的各种方法导致了多种不同的训练算法，许多算法试图近似对数似然梯度。

To approximate the expectation of the joint distribution in the negative phase contribution to the gradient, it is natural to again consider exploiting the conditional independence of the RBM in order to specify a Monte Carlo approximation of the expectation over the joint:
为了近似负梯度贡献的联合分布的期望，自然再次考虑利用 RBM 的条件独立性来指定联合期望的蒙特卡洛近似：

{\mathbb{E}}_{p(x,h)}\left[{\frac{\partial}{\partial\theta_{i}}{\cal E}_{\theta}^{\mathrm{RBM}}(x,h)}\right]\approx\frac{1}{L}\sum_{l=1}^{L}\frac{\partial}{\partial\theta_{i}}{\cal E}_{\theta}^{\mathrm{RBM}}(\tilde{x}^{(l)},\tilde{h}^{(l)}),\vspace*{-2mm}

(16)

with the samples $(\tilde{x}^{(l)},\tilde{h}^{(l)})$ drawn by a block Gibbs MCMC (Markov chain Monte Carlo) sampling procedure:
通过块 Gibbs MCMC（马尔可夫链蒙特卡洛）抽样程序抽取的样本 $(\tilde{x}^{(l)},\tilde{h}^{(l)})$ ：

	$\displaystyle\tilde{x}^{(l)}$	$\displaystyle\sim$	$\displaystyle P(x\mid\tilde{h}^{(l-1)})$
	$\displaystyle\tilde{h}^{(l)}$	$\displaystyle\sim$	$\displaystyle P(h\mid\tilde{x}^{(l)}).\vspace*{-4mm}$

Naively, for each gradient update step, one would start a Gibbs sampling chain, wait until the chain converges to the equilibrium distribution and then draw a sufficient number of samples to approximate the expected gradient with respect to the model (joint) distribution in Eq. 16. Then restart the process for the next step of approximate gradient ascent on the log-likelihood. This procedure has the obvious flaw that waiting for the Gibbs chain to “burn-in” and reach equilibrium anew for each gradient update cannot form the basis of a practical training algorithm. Contrastive Divergence (Hinton, 1999; Hinton et al., 2006), Stochastic Maximum Likelihood (Younes, 1999; Tieleman, 2008) and fast-weights persistent contrastive divergence or FPCD (Tieleman and Hinton, 2009) are all ways to avoid or reduce the need for burn-in.
天真地，对于每个梯度更新步骤，人们会启动一个吉布斯采样链，等待链收敛到平衡分布，然后抽取足够多的样本来近似模型（联合）分布中关于式（16）的期望梯度。然后重新开始这个过程，用于对数似然的对数梯度上升的下一个步骤。这种方法的明显缺陷是，对于每次梯度更新，等待吉布斯链“热身”并重新达到平衡，不能作为实际训练算法的基础。对比散度（Hinton，1999；Hinton 等，2006）、随机最大似然（Younes，1999；Tieleman，2008）以及快速权重持续对比散度或 FPCD（Tieleman 和 Hinton，2009）都是避免或减少需要热身的方法。

6.4.1 Contrastive Divergence
6.4.1 对比散度

Contrastive divergence (CD) estimation (Hinton, 1999; Hinton et al., 2006) estimates the negative phase expectation (Eq. 15) with a very short Gibbs chain (often just one step) initialized at the training data used in the positive phase. This reduces the variance of the gradient estimator and still moves in a direction that pulls the negative chain samples towards the associated positive chain samples. Much has been written about the properties and alternative interpretations of CD and its similarity to auto-encoder training, e.g. Carreira-Perpiñan and Hinton (2005); Yuille (2005); Bengio and Delalleau (2009); Sutskever and Tieleman (2010).
对比发散（CD）估计（Hinton，1999；Hinton 等，2006）使用非常短的吉布斯链（通常仅一步）来估计负相期望（式 15），该链从正相中使用的训练数据初始化。这减少了梯度估计器的方差，同时仍然朝着将负链样本拉向相关正链样本的方向移动。关于 CD 的性质和替代解释以及它与自编码器训练的相似性已有大量文献讨论，例如 Carreira-Perpiñan 和 Hinton（2005）；Yuille（2005）；Bengio 和 Delalleau（2009）；Sutskever 和 Tieleman（2010）。

6.4.2 Stochastic Maximum Likelihood
6.4.2 随机最大似然

The Stochastic Maximum Likelihood (SML) algorithm (also known as persistent contrastive divergence or PCD) (Younes, 1999; Tieleman, 2008) is an alternative way to sidestep an extended burn-in of the negative phase Gibbs sampler. At each gradient update, rather than initializing the Gibbs chain at the positive phase sample as in CD, SML initializes the chain at the last state of the chain used for the previous update. In other words, SML uses a continually running Gibbs chain (or often a number of Gibbs chains run in parallel) from which samples are drawn to estimate the negative phase expectation. Despite the model parameters changing between updates, these changes should be small enough that only a few steps of Gibbs (in practice, often one step is used) are required to maintain samples from the equilibrium distribution of the Gibbs chain, i.e. the model distribution.
随机最大似然（SML）算法（也称为持久对比发散或 PCD）（Younes，1999；Tieleman，2008）是避开负相吉布斯采样长时间预烧的替代方法。在每次梯度更新时，SML 不是像 CD 那样在正相样本处初始化吉布斯链，而是初始化为上次更新使用的链的最后一个状态。换句话说，SML 使用一个持续运行的吉布斯链（或通常是一系列并行运行的吉布斯链）从中抽取样本以估计负相期望。尽管模型参数在更新之间会发生变化，但这些变化应该足够小，以至于只需要几个吉布斯步骤（在实践中通常使用一个步骤）来维持来自吉布斯链平衡分布的样本，即模型分布。

A troublesome aspect of SML is that it relies on the Gibbs chain to mix well (especially between modes) for learning to succeed. Typically, as learning progresses and the weights of the RBM grow, the ergodicity of the Gibbs sample begins to break down¹²¹²12When weights become large, the estimated distribution is more peaky, and the chain takes very long time to mix, to move from mode to mode, so that practically the gradient estimator can be very poor. This is a serious chicken-and-egg problem because if sampling is not effective, nor is the training procedure, which may seem to stall, and yields even larger weights.. If the learning rate $\epsilon$ associated with gradient ascent $\theta\leftarrow\theta+\epsilon\hat{g}$ (with $E[\hat{g}]\approx\frac{\partial\log p_{\theta}(x)}{\partial\theta}$ ) is not reduced to compensate, then the Gibbs sampler will diverge from the model distribution and learning will fail. Desjardins et al. (2010); Cho et al. (2010); Salakhutdinov (2010b, a) have all considered various forms of tempered transitions to address the failure of Gibbs chain mixing, and convincing solutions have not yet been clearly demonstrated. A recently introduced promising avenue relies on depth itself, showing that mixing between modes is much easier on deeper layers (Bengio et al., 2013) (Sec.9.4).
SML 的一个麻烦之处在于它依赖于吉布斯链进行良好的混合（尤其是在模式之间），以便学习成功。通常，随着学习的进展和 RBM 权重的增长，吉布斯样本的遍历性开始崩溃 ¹² 。如果与梯度上升 $\epsilon$ （带有 $E[\hat{g}]\approx\frac{\partial\log p_{\theta}(x)}{\partial\theta}$ ）相关的学习率 $\theta\leftarrow\theta+\epsilon\hat{g}$ 没有降低以补偿，那么吉布斯采样器将偏离模型分布，学习将失败。Desjardins 等人（2010 年）；Cho 等人（2010 年）；Salakhutdinov（2010b，a）都考虑了各种形式的调温转换来解决吉布斯链混合失败的问题，但尚未明确证明有令人信服的解决方案。最近提出的一个有希望的途径依赖于深度本身，表明在更深层的模式之间混合要容易得多（Bengio 等人，2013 年）（第 9.4 节）。

Tieleman and Hinton (2009) have proposed quite a different approach to addressing potential mixing problems of SML with their fast-weights persistent contrastive divergence (FPCD), and it has also been exploited to train Deep Boltzmann Machines (Salakhutdinov, 2010a) and construct a pure sampling algorithm for RBMs (Breuleux et al., 2011). FPCD builds on the surprising but robust tendency of Gibbs chains to mix better during SML learning than when the model parameters are fixed. The phenomenon is rooted in the form of the likelihood gradient itself (Eq. 15). The samples drawn from the SML Gibbs chain are used in the negative phase of the gradient, which implies that the learning update will slightly increase the energy (decrease the probability) of those samples, making the region in the neighborhood of those samples less likely to be resampled and therefore making it more likely that the samples will move somewhere else (typically going near another mode). Rather than drawing samples from the distribution of the current model (with parameters $\theta$ ), FPCD exaggerates this effect by drawing samples from a local perturbation of the model with parameters $\theta^{*}$ and an update
蒂尔曼和辛顿（2009）提出了一个相当不同的方法来解决 SML 中潜在混合问题，即他们的快速权重持续对比散度（FPCD），该方法也被用于训练深度玻尔兹曼机（Salakhutdinov，2010a）和构建 RBMs 的纯采样算法（Breuleux 等人，2011）。FPCD 建立在吉布斯链在 SML 学习期间比模型参数固定时混合得更好的惊人但稳健的趋势之上。这一现象源于似然梯度本身的形式（公式 15）。从 SML 吉布斯链中抽取的样本用于梯度的负相，这意味着学习更新将略微增加这些样本的能量（降低其概率），使得这些样本附近的区域不太可能被重新采样，因此使得样本移动到其他地方的可能性更大（通常是靠近另一个模式）。FPCD 不是从当前模型（参数为 $\theta$ ）的分布中抽取样本，而是通过从具有参数 $\theta^{*}$ 的模型局部扰动中抽取样本并进行更新来夸大这种效应。

\theta^{*}_{t+1}=(1-\eta)\theta_{t+1}+\eta\theta^{*}_{t}+\epsilon^{*}\frac{\partial}{\partial\theta_{i}}\left(\sum_{t=1}^{T}\log p(x^{(t)})\right),\vspace*{-2mm}

(17)

where $\epsilon^{*}$ is the relatively large fast-weight learning rate ( $\epsilon^{*}>\epsilon$ ) and $0<\eta<1$ (but near 1) is a forgetting factor that keeps the perturbed model close to the current model. Unlike tempering, FPCD does not converge to the model distribution as $\epsilon$ and $\epsilon^{*}$ go to 0, and further work is necessary to characterize the nature of its approximation to the model distribution. Nevertheless, FPCD is a popular and apparently effective means of drawing approximate samples from the model distribution that faithfully represent its diversity, at the price of sometimes generating spurious samples in between two modes (because the fast weights roughly correspond to a smoothed view of the current model’s energy function). It has been applied in a variety of applications (Tieleman and Hinton, 2009; Ranzato et al., 2011; Kivinen and Williams, 2012) and it has been transformed into a sampling algorithm (Breuleux et al., 2011) that also shares this fast mixing property with herding (Welling, 2009), for the same reason, i.e., introducing negative correlations between consecutive samples of the chain in order to promote faster mixing.
$\epsilon^{*}$ 是相对较大的快速权重学习率（ $\epsilon^{*}>\epsilon$ ），而 $0<\eta<1$ （但接近 1）是一个遗忘因子，它使扰动模型接近当前模型。与退火不同，当 $\epsilon$ 和 $\epsilon^{*}$ 趋近于 0 时，FPCD 不会收敛到模型分布，因此需要进一步研究其近似模型分布的性质。尽管如此，FPCD 是一种流行且显然有效的从模型分布中抽取近似样本的方法，这些样本忠实代表了其多样性，但有时会在两个模式之间生成虚假样本（因为快速权重大致对应于当前模型能量函数的平滑视图）。它已被应用于各种应用（Tieleman 和 Hinton，2009；Ranzato 等，2011；Kivinen 和 Williams，2012），并且已被转化为一种采样算法（Breuleux 等，2011），该算法也与放牧（Welling，2009）共享这种快速混合特性，原因相同，即引入连续样本之间的负相关性以促进更快混合。

6.4.3 Pseudolikelihood, Ratio-matching and More
6.4.3 伪似然、比率匹配及其他

While CD, SML and FPCD are by far the most popular methods for training RBMs and RBM-based models, all of these methods are perhaps most naturally described as offering different approximations to maximum likelihood training. There exist other inductive principles that are alternatives to maximum likelihood that can also be used to train RBMs. In particular, these include pseudo-likelihood (Besag, 1975) and ratio-matching (Hyvärinen, 2007). Both of these inductive principles attempt to avoid explicitly dealing with the partition function, and their asymptotic efficiency has been analyzed (Marlin and de Freitas, 2011). Pseudo-likelihood seeks to maximize the product of all one-dimensional conditional distributions of the form $P(x_{d}|x_{\setminus d})$ , while ratio-matching can be interpreted as an extension of score matching (Hyvärinen, 2005) to discrete data types. Both methods amount to weighted differences of the gradient of the RBM free energy¹³¹³13The free energy $\mathcal{F}(x;\theta)$ is the energy associated with the data marginal probability, $\mathcal{F}(x;\theta)=-\log P(x)-\log Z_{\theta}$ and is tractable for the RBM. evaluated at a data point and at neighboring points. One potential drawback of these methods is that depending on the parametrization of the energy function, their computational requirements may scale up to $O(n_{d})$ worse than CD, SML, FPCD, or denoising score matching (Kingma and LeCun, 2010; Vincent, 2011), discussed below. Marlin et al. (2010) empirically compared all of these methods (except denoising score matching) on a range of classification, reconstruction and density modeling tasks and found that, in general, SML provided the best combination of overall performance and computational tractability. However, in a later study, the same authors (Swersky et al., 2011) found denoising score matching to be a competitive inductive principle both in terms of classification performance (with respect to SML) and in terms of computational efficiency (with respect to analytically obtained score matching). Denoising score matching is a special case of the denoising auto-encoder training criterion (Section 7.2.2) when the reconstruction error residual equals a gradient, i.e., the score function associated with an energy function, as shown in (Vincent, 2011).
尽管 CD、SML 和 FPCD 是目前训练 RBMs 和基于 RBM 的模型最流行的三种方法，但所有这些方法都可以最自然地描述为提供对最大似然训练的不同近似。存在其他归纳原理，可以作为最大似然的对立面，也可以用于训练 RBMs。特别是，这些包括伪似然（Besag，1975）和比率匹配（Hyvärinen，2007）。这两种归纳原理都试图避免显式地处理配分函数，并且它们的渐近效率已经被分析过（Marlin 和 de Freitas，2011）。伪似然试图最大化所有一维条件分布的乘积，而比率匹配可以解释为得分匹配（Hyvärinen，2005）到离散数据类型的扩展。这两种方法都相当于在数据点和邻近点评估 RBM 自由能梯度 ¹³ 的加权差异。一种潜在缺点是，这些方法的计算需求可能因能量函数的参数化而增加，其计算复杂度可能比 CD、SML、FPCD 或去噪得分匹配（Kingma 和 LeCun，2010；Vincent，2011）更糟，下文将讨论这些方法。Marlin 等人（2010）在一系列分类、重建和密度建模任务上对这些方法（除去噪得分匹配外）进行了实证比较，并发现，总的来说，SML 在整体性能和计算可行性方面提供了最佳组合。然而，在后来的研究中，同一作者（Swersky 等人，2011）发现去噪得分匹配在分类性能（相对于 SML）和计算效率（相对于解析得分匹配）方面都具有竞争力。去噪得分匹配是去噪自动编码器训练准则（第 7.2.2 节）的一个特例，当重建误差残差等于梯度时，即与能量函数相关的得分函数，如 Vincent（2011）所示。

In the spirit of the Boltzmann machine gradient (Eq. 15) several approaches have been proposed to train energy-based models. One is noise-contrastive estimation (Gutmann and Hyvarinen, 2010), in which the training criterion is transformed into a probabilistic classification problem: distinguish between (positive) training examples and (negative) noise samples generated by a broad distribution (such as the Gaussian). Another family of approaches, more in the spirit of Contrastive Divergence, relies on distinguishing positive examples (of the training distribution) and negative examples obtained by perturbations of the positive examples (Collobert and Weston, 2008; Bordes et al., 2012; Weston et al., 2010).
1. 在玻尔兹曼机梯度（公式 15）的精神下，已经提出了几种训练基于能量的模型的方法。其中一种是噪声对比估计（Gutmann 和 Hyvarinen，2010），其中训练标准被转化为一个概率分类问题：区分（正）训练样本和由广泛分布（如高斯分布）生成的（负）噪声样本。另一类方法，更接近对比散度（Contrastive Divergence）的精神，依赖于区分训练分布的正例和通过扰动正例获得的负例（Collobert 和 Weston，2008；Bordes 等，2012；Weston 等，2010）。

7 Directly Learning A Parametric Map from Input to Representation
7 直接从输入到表示学习参数化映射

Within the framework of probabilistic models adopted in Section 6, the learned representation is always associated with latent variables, specifically with their posterior distribution given an observed input $x$ . Unfortunately, this posterior distribution tends to become very complicated and intractable if the model has more than a couple of interconnected layers, whether in the directed or undirected graphical model frameworks. It then becomes necessary to resort to sampling or approximate inference techniques, and to pay the associated computational and approximation error price. If the true posterior has a large number of modes that matter then current inference techniques may face an unsurmountable challenge or endure a potentially serious approximation. This is in addition to the difficulties raised by the intractable partition function in undirected graphical models. Moreover a posterior distribution over latent variables is not yet a simple usable feature vector that can for example be fed to a classifier. So actual feature values are typically derived from that distribution, taking the latent variable’s expectation (as is typically done with RBMs), their marginal probability, or finding their most likely value (as in sparse coding). If we are to extract stable deterministic numerical feature values in the end anyway, an alternative (apparently) non-probabilistic feature learning paradigm that focuses on carrying out this part of the computation, very efficiently, is that of auto-encoders and other directly parametrized feature or representation functions. The commonality between these methods is that they learn a direct encoding, i.e., a parametric map from inputs to their representation.
在第六节采用的概率模型框架内，学习到的表示始终与潜在变量相关联，具体来说是与给定观察输入 $x$ 的后验分布相关。不幸的是，如果模型包含超过几层相互连接的层，无论是定向还是非定向图模型框架，这种后验分布往往会变得非常复杂和难以处理。这时，就需要求助于采样或近似推理技术，并付出相关的计算和近似误差代价。如果真正的后验分布有大量重要的模式，那么当前的推理技术可能会面临无法克服的挑战或承受潜在的严重近似。这还加上非定向图模型中难以处理的配分函数带来的困难。此外，潜在变量的后验分布还不是一种简单的可用特征向量，例如可以被输入到分类器中。所以实际特征值通常是从该分布中导出的，包括潜在变量的期望（如 RBMs 通常所做的那样）、它们的边缘概率，或者找到它们最可能的价值（如在稀疏编码中）。如果我们最终无论如何都要提取稳定的确定性数值特征值，那么一个（显然）非概率的特征学习范式，专注于高效地执行这一部分的计算，就是自编码器和其他直接参数化的特征或表示函数。这些方法之间的共同之处在于它们学习一个直接编码，即从输入到其表示的参数映射。

Regularized auto-encoders, discussed next, also involve learning a decoding function that maps back from representation to input space. Sections 8.1 and 11.3 discuss direct encoding methods that do not require a decoder, such as semi-supervised embedding (Weston et al., 2008) and slow feature analysis (Wiskott and Sejnowski, 2002).
正则化自编码器，将在下文讨论，也涉及学习一个解码函数，该函数将表示映射回输入空间。第 8.1 节和第 11.3 节讨论了不需要解码器的直接编码方法，例如半监督嵌入（Weston 等人，2008 年）和慢特征分析（Wiskott 和 Sejnowski，2002 年）。

7.1 Auto-Encoders 7.1 自动编码器

In the auto-encoder framework (LeCun, 1987; Bourlard and Kamp, 1988; Hinton and Zemel, 1994), one starts by explicitly defining a feature-extracting function in a specific parametrized closed form. This function, that we will denote $f_{\theta}$ , is called the encoder and will allow the straightforward and efficient computation of a feature vector $h=f_{\theta}(x)$ from an input $x$ . For each example $x^{(t)}$ from a data set $\{x^{(1)},\ldots,x^{(T)}\}$ , we define
在自编码器框架（LeCun，1987；Bourlard 和 Kamp，1988；Hinton 和 Zemel，1994）中，首先通过显式定义一个特定参数化的封闭形式的特征提取函数。我们将此函数表示为 $f_{\theta}$ ，称为编码器，它将允许从输入 $x$ 直接高效地计算特征向量 $h=f_{\theta}(x)$ 。对于数据集 $\{x^{(1)},\ldots,x^{(T)}\}$ 中的每个示例 $x^{(t)}$ ，我们定义

h^{(t)}=f_{\theta}(x^{(t)})\vspace*{-3mm}

(18)

where $h^{(t)}$ is the feature-vector or representation or code computed from $x^{(t)}$ . Another closed form parametrized function $g_{\theta}$ , called the decoder, maps from feature space back into input space, producing a reconstruction $r=g_{\theta}(h)$ . Whereas probabilistic models are defined from an explicit probability function and are trained to maximize (often approximately) the data likelihood (or a proxy), auto-encoders are parametrized through their encoder and decoder and are trained using a different training principle. The set of parameters $\theta$ of the encoder and decoder are learned simultaneously on the task of reconstructing as well as possible the original input, i.e. attempting to incur the lowest possible reconstruction error $L(x,r)$ – a measure of the discrepancy between $x$ and its reconstruction $r$ – over training examples. Good generalization means low reconstruction error at test examples, while having high reconstruction error for most other $x$ configurations. To capture the structure of the data-generating distribution, it is therefore important that something in the training criterion or the parametrization prevents the auto-encoder from learning the identity function, which has zero reconstruction error everywhere. This is achieved through various means in the different forms of auto-encoders, as described below in more detail, and we call these regularized auto-encoders. A particular form of regularization consists in constraining the code to have a low dimension, and this is what the classical auto-encoder or PCA do.
$h^{(t)}$ 是从 $x^{(t)}$ 计算得到的特征向量或表示或代码。另一个闭式参数化函数 $g_{\theta}$ ，称为解码器，将特征空间映射回输入空间，产生重构 $r=g_{\theta}(h)$ 。而概率模型是由显式概率函数定义的，并且被训练以最大化（通常近似）数据似然度（或其代理），自编码器通过其编码器和解码器进行参数化，并使用不同的训练原则进行训练。编码器和解码器的参数集 $\theta$ 在重构原始输入的任务上是同时学习的，即尝试使重构误差 $L(x,r)$ 尽可能低——这是 $x$ 与其重构 $r$ 之间差异的度量——在训练样本上。好的泛化意味着在测试样本上具有低的重构误差，而对于大多数其他 $x$ 配置则具有高的重构误差。为了捕捉数据生成分布的结构，因此，在训练准则或参数化中防止自动编码器学习恒等函数（在所有地方都具有零重建误差）是很重要的。这通过不同形式的自动编码器的各种手段实现，如下文更详细地描述，我们称之为正则化自动编码器。一种特定的正则化形式在于约束代码具有低维数，这正是经典自动编码器或 PCA 所做的事情。

In summary, basic auto-encoder training consists in finding a value of parameter vector $\theta$ minimizing reconstruction error
总的来说，基本自编码器训练在于找到参数向量 $\theta$ 的最小化重建误差的值

\displaystyle\mathcal{J}_{\mbox{\tiny AE}}(\theta)

\displaystyle=

\displaystyle\sum_{t}L(x^{(t)},g_{\theta}(f_{\theta}(x^{(t)})))\vspace*{-3mm}

(19)

where $x^{(t)}$ is a training example. This minimization is usually carried out by stochastic gradient descent as in the training of Multi-Layer-Perceptrons (MLPs). Since auto-encoders were primarily developed as MLPs predicting their input, the most commonly used forms for the encoder and decoder are affine mappings, optionally followed by a non-linearity:
$x^{(t)}$ 是一个训练示例。这种最小化通常通过随机梯度下降法进行，如多层感知器（MLPs）的训练。由于自编码器最初是作为预测其输入的 MLP 开发的，因此编码器和解码器最常用的形式是仿射映射，可选地后跟非线性：

	$\displaystyle f_{\theta}(x)$	$\displaystyle=$	$\displaystyle s_{f}(b+Wx)$		(20)
	$\displaystyle g_{\theta}(h)$	$\displaystyle=$	$\displaystyle s_{g}(d+W^{\prime}h)\vspace*{-3mm}$		(21)

where $s_{f}$ and $s_{g}$ are the encoder and decoder activation functions (typically the element-wise sigmoid or hyperbolic tangent non-linearity, or the identity function if staying linear). The set of parameters of such a model is $\theta=\{W,b,W^{\prime},d\}$ where $b$ and $d$ are called encoder and decoder bias vectors, and $W$ and $W^{\prime}$ are the encoder and decoder weight matrices.
$s_{f}$ 和 $s_{g}$ 分别是编码器和解码器的激活函数（通常是逐元素 sigmoid 或双曲正切非线性，或者如果保持线性则使用恒等函数）。此类模型的参数集为 $\theta=\{W,b,W^{\prime},d\}$ ，其中 $b$ 和 $d$ 被称为编码器和解码器偏置向量，而 $W$ 和 $W^{\prime}$ 是编码器和解码器权重矩阵。

The choice of $s_{g}$ and $L$ depends largely on the input domain range and nature, and are usually chosen so that $L$ returns a negative log-likelihood for the observed value of $x$ . A natural choice for an unbounded domain is a linear decoder with a squared reconstruction error, i.e. $s_{g}(a)=a$ and $L(x,r)=\|x-r\|^{2}$ . If inputs are bounded between $0$ and $1$ however, ensuring a similarly-bounded reconstruction can be achieved by using $s_{g}={\rm sigmoid}$ . In addition if the inputs are of a binary nature, a binary cross-entropy loss¹⁴¹⁴14 $L(x,r)=-\sum_{i=1}^{d_{x}}x_{i}\log(r_{i})+(1-x_{i})\log(1-r_{i})$ is sometimes used.
选择 $s_{g}$ 和 $L$ 在很大程度上取决于输入域的范围和性质，通常选择使得 $L$ 对于观察到的 $x$ 值返回负对数似然。对于无界域的自然选择是具有平方重建误差的线性解码器，即 $s_{g}(a)=a$ 和 $L(x,r)=\|x-r\|^{2}$ 。然而，如果输入被限制在 $0$ 和 $1$ 之间，可以通过使用 $s_{g}={\rm sigmoid}$ 实现类似限制的重建。此外，如果输入是二进制性质，有时会使用二进制交叉熵损失 ¹⁴ 。

If both encoder and decoder use a sigmoid non-linearity, then $f_{\theta}(x)$ and $g_{\theta}(h)$ have the exact same form as the conditionals $P(h\mid v)$ and $P(v\mid h)$ of binary RBMs (see Section 6.2). This similarity motivated an initial study (Bengio et al., 2007) of the possibility of replacing RBMs with auto-encoders as the basic pre-training strategy for building deep networks, as well as the comparative analysis of auto-encoder reconstruction error gradient and contrastive divergence updates (Bengio and Delalleau, 2009).
如果编码器和解码器都使用 sigmoid 非线性，那么 $f_{\theta}(x)$ 和 $g_{\theta}(h)$ 与二元 RBM 的条件 $P(h\mid v)$ 和 $P(v\mid h)$ 具有完全相同的形式（参见第 6.2 节）。这种相似性促使了对将 RBM 替换为自动编码器作为构建深度网络的基本预训练策略的可能性进行初步研究（Bengio 等人，2007），以及自动编码器重建误差梯度与对比散度更新的比较分析（Bengio 和 Delalleau，2009）。

One notable difference in the parametrization is that RBMs use a single weight matrix, which follows naturally from their energy function, whereas the auto-encoder framework allows for a different matrix in the encoder and decoder. In practice however, weight-tying in which one defines $W^{\prime}=W^{T}$ may be (and is most often) used, rendering the parametrizations identical. The usual training procedures however differ greatly between the two approaches. A practical advantage of training auto-encoder variants is that they define a simple tractable optimization objective that can be used to monitor progress.
一个显著的参数化差异在于，RBMs 使用一个单一的权重矩阵，这自然来源于它们的能量函数，而自动编码器框架允许编码器和解码器使用不同的矩阵。然而，在实践中，可能会（并且通常）使用权重绑定，其中定义 $W^{\prime}=W^{T}$ ，这使得参数化相同。然而，两种方法的常规训练程序却大相径庭。训练自动编码器变体的一个实际优势是，它们定义了一个简单的可处理优化目标，可以用来监控进度。

In the case of a linear auto-encoder (linear encoder and decoder) with squared reconstruction error, minimizing Eq. 19 learns the same subspace¹⁵¹⁵15Contrary to traditional PCA loading factors, but similarly to the parameters learned by probabilistic PCA, the weight vectors learned by a linear auto-encoder are not constrained to form an orthonormal basis, nor to have a meaningful ordering. They will however span the same subspace. as PCA. This is also true when using a sigmoid nonlinearity in the encoder (Bourlard and Kamp, 1988), but not if the weights $W$ and $W^{\prime}$ are tied ( $W^{\prime}=W^{T}$ ), because $W$ cannot be forced into being small and $W^{\prime}$ large to achieve a linear encoder.
在具有平方重建误差的线性自编码器（线性编码器和解码器）的情况下，最小化公式 19 学习与 PCA 相同的子空间 ¹⁵ 。这也适用于编码器中使用 sigmoid 非线性函数（Bourlard 和 Kamp，1988），但如果权重 $W$ 和 $W^{\prime}$ 绑定（ $W^{\prime}=W^{T}$ ），则不成立，因为 $W$ 不能被强制变得很小而 $W^{\prime}$ 变得很大以实现线性编码器。

Similarly, Le et al. (2011b) recently showed that adding a regularization term of the form $\sum_{t}\sum_{j}s_{3}(W_{j}x^{(t)})$ to a linear auto-encoder with tied weights, where $s_{3}$ is a nonlinear convex function, yields an efficient algorithm for learning linear ICA.
同样，Le 等人（2011b）最近表明，将形式为 $\sum_{t}\sum_{j}s_{3}(W_{j}x^{(t)})$ 的正则化项添加到权重捆绑的线性自编码器中，其中 $s_{3}$ 是一个非线性凸函数，可以得到一个学习线性独立成分分析（ICA）的高效算法。

7.2 Regularized Auto-Encoders
7.2 正则化自编码器

Like PCA, auto-encoders were originally seen as a dimensionality reduction technique and thus used a bottleneck, i.e. $d_{h}<d_{x}$ . On the other hand, successful uses of sparse coding and RBM approaches tend to favour overcomplete representations, i.e. $d_{h}>d_{x}$ . This can allow the auto-encoder to simply duplicate the input in the features, with perfect reconstruction without having extracted more meaningful features. Recent research has demonstrated very successful alternative ways, called regulrized auto-encoders, to “constrain” the representation, even when it is overcomplete. The effect of a bottleneck or of this regularization is that the auto-encoder cannot reconstruct well everything, it is trained to reconstruct well the training examples and generalization means that reconstruction error is also small on test examples. An interesting justification (Ranzato et al., 2008) for the sparsity penalty (or any penalty that restricts in a soft way the volume of hidden configurations easily accessible by the learner) is that it acts in spirit like the partition function of RBMs, by making sure that only few input configurations can have a low reconstruction error.
与 PCA 类似，自动编码器最初被视为一种降维技术，因此使用了瓶颈，即 $d_{h}<d_{x}$ 。另一方面，稀疏编码和 RBM 方法的成功应用往往倾向于偏好过完备表示，即 $d_{h}>d_{x}$ 。这可以使自动编码器简单地复制输入特征，实现完美重建而无需提取更多有意义的特征。最近的研究已经证明了非常成功的替代方法，称为正则化自动编码器，以“约束”表示，即使在过完备的情况下。瓶颈或这种正则化的效果是，自动编码器不能很好地重建所有内容，它是训练来很好地重建训练示例的，而泛化意味着测试示例上的重建误差也较小。一个有趣的论证（Ranzato 等人，2008）是稀疏惩罚（或任何以软方式限制学习者容易访问的隐藏配置体积的惩罚）的作用，它类似于 RBM 的配分函数，通过确保只有少数输入配置可以具有很低的重建误差。

Alternatively, one can view the objective of the regularization applied to an auto-encoder as making the representation as “constant” (insensitive) as possible with respect to changes in input. This view immediately justifies two variants of regularized auto-encoders described below: contractive auto-encoders reduce the number of effective degrees of freedom of the representation (around each point) by making the encoder contractive, i.e., making the derivative of the encoder small (thus making the hidden units saturate), while the denoising auto-encoder makes the whole mapping “robust”, i.e., insensitive to small random perturbations, or contractive, making sure that the reconstruction cannot stay good when moving in most directions around a training example.
另一种观点是将应用于自编码器的正则化目标视为使表示尽可能“恒定”（不敏感）于输入的变化。这种观点立即为以下两种正则化自编码器变体提供了合理性：收缩自编码器通过使编码器收缩，即减小编码器的导数（从而使隐藏单元饱和），来减少表示的有效自由度（在每个点周围），而降噪自编码器使整个映射“鲁棒”，即对小的随机扰动不敏感，或者收缩，确保在大多数方向上移动时重建效果不会保持良好。

7.2.1 Sparse Auto-Encoders
7.2.1 稀疏自编码器

The earliest uses of single-layer auto-encoders for building deep architectures by stacking them (Bengio et al., 2007) considered the idea of tying the encoder weights and decoder weights to restrict capacity as well as the idea of introducing a form of sparsity regularization (Ranzato et al., 2007). Sparsity in the representation can be achieved by penalizing the hidden unit biases (making these additive offset parameters more negative) (Ranzato et al., 2007; Lee et al., 2008; Goodfellow et al., 2009; Larochelle and Bengio, 2008) or by directly penalizing the output of the hidden unit activations (making them closer to their saturating value at 0) (Ranzato et al., 2008; Le et al., 2011a; Zou et al., 2011). Penalizing the bias runs the danger that the weights could compensate for the bias, which could hurt numerical optimization. When directly penalizing the hidden unit outputs, several variants can be found in the literature, but a clear comparative analysis is still lacking. Although the L1 penalty (i.e., simply the sum of output elements $h_{j}$ in the case of sigmoid non-linearity) would seem the most natural (because of its use in sparse coding), it is used in few papers involving sparse auto-encoders. A close cousin of the L1 penalty is the Student-t penalty ( $\log(1+h_{j}^{2})$ ), originally proposed for sparse coding (Olshausen and Field, 1996). Several papers penalize the average output $\bar{h}_{j}$ (e.g. over a minibatch), and instead of pushing it to 0, encourage it to approach a fixed target, either through a mean-square error penalty, or maybe more sensibly (because $h_{j}$ behaves like a probability), a Kullback-Liebler divergence with respect to the binomial distribution with probability $\rho$ : $-\rho\log\bar{h}_{j}-(1-\rho)\log(1-\bar{h}_{j})+$ constant, e.g., with $\rho=0.05$ .
最早的单一层自动编码器用于构建深度架构的应用（Bengio 等人，2007）考虑了将编码器权重和解码器权重绑定以限制容量的想法，以及引入一种稀疏正则化的形式（Ranzato 等人，2007）。通过惩罚隐藏单元偏差（使这些加性偏移参数更负）或直接惩罚隐藏单元激活的输出（使它们接近 0 的饱和值）可以实现表示中的稀疏性（Ranzato 等人，2007；Lee 等人，2008；Goodfellow 等人，2009；Larochelle 和 Bengio，2008）。直接惩罚偏差存在风险，即权重可能会补偿偏差，这可能会损害数值优化。当直接惩罚隐藏单元输出时，文献中可以找到几种变体，但清晰的比较分析仍然缺乏。尽管 L1 惩罚（即……）, 简单来说，在 sigmoid 非线性情况下，输出元素之和 $h_{j}$ 似乎最为自然（因为其在稀疏编码中的应用），但它被用于涉及稀疏自编码器的论文中较少。L1 惩罚的近亲是 Student-t 惩罚（ $\log(1+h_{j}^{2})$ ），最初由 Olshausen 和 Field（1996）提出用于稀疏编码。几篇论文惩罚平均输出 $\bar{h}_{j}$ （例如，在一个小批量上），而不是将其推向 0，而是鼓励其接近一个固定目标，通过均方误差惩罚，或者可能更合理（因为 $h_{j}$ 的行为像概率），相对于概率为 $\rho$ 的二项分布的 Kullback-Liebler 散度： $-\rho\log\bar{h}_{j}-(1-\rho)\log(1-\bar{h}_{j})+$ 常数，例如， $\rho=0.05$ 。

7.2.2 Denoising Auto-Encoders
7.2.2 去噪自编码器

Vincent et al. (2008, 2010) proposed altering the training objective in Eq. 19 from mere reconstruction to that of denoising an artificially corrupted input, i.e. learning to reconstruct the clean input from a corrupted version. Learning the identity is no longer enough: the learner must capture the structure of the input distribution in order to optimally undo the effect of the corruption process, with the reconstruction essentially being a nearby but higher density point than the corrupted input. Figure 3 illustrates that the Denoising Auto-Encoder (DAE) is learning a reconstruction function that corresponds to a vector field pointing towards high-density regions (the manifold where examples concentrate).
文森特等人（2008，2010）提出了修改第 19 式中的训练目标，从单纯的重建变为对人工损坏输入的降噪，即学习从损坏版本重建干净输入。学习身份不再足够：学习者必须捕捉输入分布的结构，以便最优地消除损坏过程的影响，重建本质上是一个比损坏输入更接近但密度更高的点。图 3 说明了降噪自动编码器（DAE）正在学习一个重建函数，该函数对应于指向高密度区域（示例集中的流形）的矢量场。

Formally, the objective optimized by a DAE is:
正式地，由 DAE 优化的目标为：

\displaystyle\mathcal{J}_{\mbox{\tiny DAE}}

\displaystyle=

\displaystyle\sum_{t}{\mathbb{E}}_{q(\tilde{x}|x^{(t)})}\left[{L(x^{(t)},g_{\theta}(f_{\theta}(\tilde{x})))}\right]\vspace*{-6mm}

(22)

where ${\mathbb{E}}_{q(\tilde{x}|x^{(t)})}\left[{\cdot}\right]$ averages over corrupted examples $\tilde{x}$ drawn from corruption process $q(\tilde{x}|x^{(t)})$ . In practice this is optimized by stochastic gradient descent, where the stochastic gradient is estimated by drawing one or a few corrupted versions of $x^{(t)}$ each time $x^{(t)}$ is considered. Corruptions considered in Vincent et al. (2010) include additive isotropic Gaussian noise, salt and pepper noise for gray-scale images, and masking noise (salt or pepper only), e.g., setting some randomly chosen inputs to 0 (independently per example). Masking noise has been used in most of the simulations. Qualitatively better features are reported with denoising, resulting in improved classification, and DAE features performed similarly or better than RBM features. Chen et al. (2012) show that a simpler alternative with a closed form solution can be obtained when restricting to a linear auto-encoder and have successfully applied it to domain adaptation.
在从腐蚀过程 $q(\tilde{x}|x^{(t)})$ 中抽取的腐蚀示例 $\tilde{x}$ 上， ${\mathbb{E}}_{q(\tilde{x}|x^{(t)})}\left[{\cdot}\right]$ 平均值。在实践中，这通过随机梯度下降进行优化，其中随机梯度通过每次考虑 $x^{(t)}$ 时抽取一个或几个 $x^{(t)}$ 的腐蚀版本来估计。Vincent 等人（2010）考虑的腐蚀包括加性各向同性高斯噪声、灰度图像的盐和胡椒噪声以及掩码噪声（仅盐或胡椒），例如将一些随机选择的输入设置为 0（每个示例独立）。掩码噪声已在大多数模拟中使用。报告了去噪后的定性更好的特征，从而提高了分类，并且 DAE 特征的表现与 RBM 特征相似或更好。Chen 等人（2012）表明，当限制为线性自编码器时，可以获得一个具有封闭形式解的更简单的替代方案，并且他们已成功将其应用于领域自适应。

Vincent (2011) relates DAEs to energy-based probabilistic models: DAEs basically learn in $r(\tilde{x})-\tilde{x}$ a vector pointing in the direction of the estimated score $\frac{\partial\log p(\tilde{x})}{\partial\tilde{x}}$ (Figure 3). In the special case of linear reconstruction and squared error, Vincent (2011) shows that training an affine-sigmoid-affine DAE amounts to learning an energy-based model, whose energy function is very close to that of a GRBM. Training uses a regularized variant of the score matching parameter estimation technique (Hyvärinen, 2005, 2008; Kingma and LeCun, 2010) termed denoising score matching (Vincent, 2011). Swersky (2010) had shown that training GRBMs with score matching is equivalent to training a regular auto-encoder with an additional regularization term, while, following up on the theoretical results in Vincent (2011), Swersky et al. (2011) showed the practical advantage of denoising to implement score matching efficiently. Finally Alain and Bengio (2012) generalize Vincent (2011) and prove that DAEs of arbitrary parametrization with small Gaussian corruption noise are general estimators of the score.
文森特（2011）将 DAEs 与基于能量的概率模型联系起来：DAEs 基本上是在学习一个指向估计分数方向的向量 $r(\tilde{x})-\tilde{x}$ （图 3）。在线性重建和平方误差的特殊情况下，文森特（2011）表明训练一个仿射-sigmoid-仿射 DAE 相当于学习一个基于能量的模型，其能量函数与 GRBM 的非常接近。训练使用了一种正则化的分数匹配参数估计技术（Hyvärinen，2005，2008；Kingma 和 LeCun，2010），称为去噪分数匹配（文森特，2011）。斯威尔斯基（2010）曾表明，使用分数匹配训练 GRBMs 相当于训练一个带有额外正则化项的常规自动编码器，而斯威尔斯基等人（2011）在文森特（2011）的理论结果基础上，展示了去噪在有效实现分数匹配方面的实际优势。最后，阿兰和本吉奥（2012）推广了文森特（2011）的研究，并证明了具有小高斯噪声扰动的任意参数化 DAE 是分数的通用估计器。

7.2.3 Contractive Auto-Encoders
7.2.3 收敛自编码器

Contractive Auto-Encoders (CAE), proposed by Rifai et al. (2011a), follow up on Denoising Auto-Encoders (DAE) and share a similar motivation of learning robust representations. CAEs achieve this by adding an analytic contractive penalty to Eq. 19: the Frobenius norm of the encoder’s Jacobian, and results in penalizing the sensitivity of learned features to infinitesimal input variations. Let $J(x)=\frac{\partial f_{\theta}}{\partial x}(x)$ the Jacobian matrix of the encoder at $x$ . The CAE’s training objective is
合同自编码器（CAE），由 Rifai 等人（2011a）提出，继去噪自编码器（DAE）之后，具有学习鲁棒表示的相似动机。CAE 通过向方程 19 添加分析收缩惩罚来实现这一点：编码器雅可比矩阵的 Frobenius 范数，从而惩罚学习到的特征对无穷小输入变化的敏感性。令 $J(x)=\frac{\partial f_{\theta}}{\partial x}(x)$ 为 $x$ 处的编码器雅可比矩阵。CAE 的训练目标是

\displaystyle\mathcal{J}_{\mbox{\tiny CAE}}

\displaystyle=

\displaystyle\sum_{t}L(x^{(t)},g_{\theta}(f_{\theta}(x^{(t)})))+\lambda\left\|J(x^{(t)})\right\|^{2}_{F}\vspace*{-4mm}

(23)

where $\lambda$ is a hyper-parameter controlling the strength of the regularization. For an affine sigmoid encoder, the contractive penalty term is easy to compute:
$\lambda$ 是控制正则化强度的超参数。对于仿射 Sigmoid 编码器，收缩惩罚项易于计算：

	$\displaystyle J_{j}(x)$	$\displaystyle=$	$\displaystyle f_{\theta}(x)_{j}(1-f_{\theta}(x)_{j})W_{j}$
	$\displaystyle\left\\|J(x)\right\\|^{2}_{F}$	$\displaystyle=$	$\displaystyle\sum_{j}(f_{\theta}(x)_{j}(1-f_{\theta}(x)_{j}))^{2}\\|W_{j}\\|^{2}\vspace*{-3mm}$		(24)

There are at least three notable differences with DAEs, which may be partly responsible for the better performance that CAE features seem to empirically demonstrate: (a) the sensitivity of the features is penalized¹⁶¹⁶16i.e., the robustness of the representation is encouraged. rather than that of the reconstruction; (b) the penalty is analytic rather than stochastic: an efficiently computable expression replaces what might otherwise require $d_{x}$ corrupted samples to size up (i.e. the sensitivity in $d_{x}$ directions); (c) a hyper-parameter $\lambda$ allows a fine control of the trade-off between reconstruction and robustness (while the two are mingled in a DAE). Note however that there is a tight connection between the DAE and the CAE: as shown in (Alain and Bengio, 2012) a DAE with small corruption noise can be seen (through a Taylor expansion) as a type of contractive auto-encoder where the contractive penalty is on the whole reconstruction function rather than just on the encoder¹⁷¹⁷17but note that in the CAE, the decoder weights are tied to the encoder weights, to avoid degenerate solutions, and this should also make the decoder contractive..
至少有三个与 DAEs 显著不同的地方，这可能是 CAE 特征似乎在经验上表现出更好性能的部分原因：(a) 特征的敏感性受到惩罚，而不是重建；(b) 惩罚是解析的而不是随机的：一个高效可计算的表达式取代了可能需要 $d_{x}$ 损坏样本来评估（即 $d_{x}$ 方向上的敏感性）；(c) 一个超参数 $\lambda$ 允许对重建和鲁棒性之间的权衡进行精细控制（而在 DAE 中两者是混合在一起的）。然而，需要注意的是，DAE 和 CAE 之间存在紧密的联系：如(Alain 和 Bengio，2012)所示，具有小损坏噪声的 DAE 可以通过泰勒展开被视为一种收缩自编码器，其中收缩惩罚是针对整个重建函数，而不仅仅是编码器 ¹⁷ 。

A potential disadvantage of the CAE’s analytic penalty is that it amounts to only encouraging robustness to infinitesimal input variations. This is remedied in Rifai et al. (2011b) with the CAE+H, that penalizes all higher order derivatives, in an efficient stochastic manner, by adding a term that encourages $J(x)$ and $J(x+\epsilon)$ to be close:
CAE 的解析惩罚的潜在缺点是它仅仅鼓励对无穷小输入变化的鲁棒性。Rifai 等人（2011b）在 CAE+H 中对此进行了修正，该模型以高效随机的方式惩罚所有高阶导数，通过添加一个鼓励 $J(x)$ 和 $J(x+\epsilon)$ 接近的项：

	$\displaystyle\mathcal{J}_{\mbox{\tiny CAE+H}}$	$\displaystyle=$	$\displaystyle\sum_{t}L(x^{(t)},g_{\theta}(x^{(t)}))+\lambda\left\\|J(x^{(t)})\right\\|^{2}_{F}$		(25)
			$\displaystyle+\gamma{\mathbb{E}}_{\epsilon}\left[{\left\\|J(x)-J(x+\epsilon)\right\\|^{2}_{F}}\right]\vspace*{-3mm}$		(25)

where $\epsilon\sim\mathcal{N}(0,\sigma^{2}I)$ , and $\gamma$ is the associated regularization strength hyper-parameter.
在 $\epsilon\sim\mathcal{N}(0,\sigma^{2}I)$ ，且 $\gamma$ 是相关的正则化强度超参数。

The DAE and CAE have been successfully used to win the final phase of the Unsupervised and Transfer Learning Challenge (Mesnil et al., 2011). The representation learned by the CAE tends to be saturated rather than sparse, i.e., most hidden units are near the extremes of their range (e.g. 0 or 1), and their derivative $\frac{\partial h_{i}(x)}{\partial x}$ is near 0. The non-saturated units are few and sensitive to the inputs, with their associated filters (weight vectors) together forming a basis explaining the local changes around $x$ , as discussed in Section 8.2. Another way to get saturated (nearly binary) units is semantic hashing (Salakhutdinov and Hinton, 2007).
DAE 和 CAE 已被成功用于赢得无监督和迁移学习挑战赛（Mesnil 等人，2011 年）。CAE 学习到的表示往往趋于饱和而不是稀疏，即大多数隐藏单元接近其范围的极端（例如 0 或 1），其导数 $\frac{\partial h_{i}(x)}{\partial x}$ 接近 0。非饱和单元很少且对输入敏感，它们相关的滤波器（权重向量）共同构成一个基，解释了 $x$ 周围的局部变化，如第 8.2 节所述。另一种获得饱和（几乎二进制）单元的方法是语义哈希（Salakhutdinov 和 Hinton，2007 年）。

7.2.4 Predictive Sparse Decomposition
7.2.4 预测稀疏分解

Sparse coding (Olshausen and Field, 1996) may be viewed as a kind of auto-encoder that uses a linear decoder with a squared reconstruction error, but whose non-parametric encoder $f_{\theta}$ performs the comparatively non-trivial and relatively costly iterative minimization of Eq. 2. A practically successful variant of sparse coding and auto-encoders, named Predictive Sparse Decomposition or PSD (Kavukcuoglu et al., 2008) replaces that costly and highly non-linear encoding step by a fast non-iterative approximation during recognition (computing the learned features). PSD has been applied to object recognition in images and video (Kavukcuoglu et al., 2009, 2010; Jarrett et al., 2009), but also to audio (Henaff et al., 2011), mostly within the framework of multi-stage convolutional deep architectures (Section 11.2). The main idea can be summarized by the following equation for the training criterion, which is simultaneously optimized with respect to hidden codes (representation) $h^{(t)}$ and with respect to parameters $(W,\alpha)$ :
稀疏编码（Olshausen 和 Field，1996）可以看作是一种使用线性解码器并具有平方重建误差的自编码器，但其非参数编码器 $f_{\theta}$ 执行了相对复杂且相对昂贵的方程 2 的迭代最小化。稀疏编码和自编码器的一个实际成功的变体，称为预测稀疏分解或 PSD（Kavukcuoglu 等人，2008），在识别过程中（计算学习到的特征）通过快速非迭代近似替换了那个昂贵且高度非线性的编码步骤。PSD 已应用于图像和视频中的物体识别（Kavukcuoglu 等人，2009，2010；Jarrett 等人，2009），也应用于音频（Henaff 等人，2011），主要在多阶段卷积深度架构框架内（第 11.2 节）。训练准则的主要思想可以通过以下方程总结，该方程同时优化了隐藏代码（表示） $h^{(t)}$ 和参数 $(W,\alpha)$ ：

\mathcal{J}_{\mbox{\tiny PSD}}=\sum_{t}\lambda\|h^{(t)}\|_{1}+\|x^{(t)}-Wh^{(t)}\|^{2}_{2}+\|h^{(t)}-f_{\alpha}(x^{(t)})\|^{2}_{2}\vspace*{-3mm}

(26)

where $x^{(t)}$ is the input vector for example $t$ , $h^{(t)}$ is the optimized hidden code for that example, and $f_{\alpha}(\cdot)$ is the encoding function, the simplest variant being
$x^{(t)}$ 是输入向量，例如 $t$ ， $h^{(t)}$ 是该示例的优化隐藏码， $f_{\alpha}(\cdot)$ 是编码函数，最简单的变体是

f_{\alpha}(x^{(t)})=\tanh(b+W^{T}x^{(t)})\vspace*{-2mm}

(27)

where encoding weights are the transpose of decoding weights. Many variants have been proposed, including the use of a shrinkage operation instead of the hyperbolic tangent (Kavukcuoglu et al., 2010). Note how the L1 penalty on $h$ tends to make them sparse, and how this is the same criterion as sparse coding with dictionary learning (Eq. 3) except for the additional constraint that one should be able to approximate the sparse codes $h$ with a parametrized encoder $f_{\alpha}(x)$ . One can thus view PSD as an approximation to sparse coding, where we obtain a fast approximate encoder. Once PSD is trained, object representations $f_{\alpha}(x)$ are used to feed a classifier. They are computed quickly and can be further fine-tuned: the encoder can be viewed as one stage or one layer of a trainable multi-stage system such as a feedforward neural network.
编码权重是解码权重的转置。已经提出了许多变体，包括使用收缩操作代替双曲正切（Kavukcuoglu 等人，2010 年）。注意 L1 惩罚如何使它们变得稀疏，以及这与字典学习中的稀疏编码（式 3）的相同标准，除了额外的约束，即应该能够用参数化的编码器 $f_{\alpha}(x)$ 来近似稀疏编码 $h$ 。因此，可以将 PSD 视为稀疏编码的近似，其中我们获得一个快速近似的编码器。一旦 PSD 训练完成，对象表示 $f_{\alpha}(x)$ 就用于输入分类器。它们计算迅速，可以进一步微调：编码器可以被视为可训练的多阶段系统（如前馈神经网络）的一个阶段或一层。

PSD can also be seen as a kind of auto-encoder where the codes $h$ are given some freedom that can help to further improve reconstruction. One can also view the encoding penalty added on top of sparse coding as a kind of regularizer that forces the sparse codes to be nearly computable by a smooth and efficient encoder. This is in contrast with the codes obtained by complete optimization of the sparse coding criterion, which are highly non-smooth or even non-differentiable, a problem that motivated other approaches to smooth the inferred codes of sparse coding (Bagnell and Bradley, 2009), so a sparse coding stage could be jointly optimized along with following stages of a deep architecture.
PSD 也可以被视为一种自动编码器，其中 $h$ 代码被赋予一定的自由度，有助于进一步改善重建。人们还可以将添加在稀疏编码之上的编码惩罚视为一种正则化器，它迫使稀疏代码几乎可以通过一个平滑且高效的编码器来计算。这与通过完全优化稀疏编码准则获得的代码形成对比，这些代码高度非平滑甚至不可微，这是一个促使其他方法平滑稀疏编码推断代码的问题（Bagnell 和 Bradley，2009），因此稀疏编码阶段可以与深度架构的后续阶段联合优化。

8 Representation Learning as Manifold Learning
8 流形学习作为表示学习

Another important perspective on representation learning is based on the geometric notion of manifold. Its premise is the manifold hypothesis, according to which real-world data presented in high dimensional spaces are expected to concentrate in the vicinity of a manifold $\mathcal{M}$ of much lower dimensionality $d_{\mathcal{M}}$ , embedded in high dimensional input space ${\mathbb{R}}^{d_{x}}$ . This prior seems particularly well suited for AI tasks such as those involving images, sounds or text, for which most uniformly sampled input configurations are unlike natural stimuli. As soon as there is a notion of “representation” then one can think of a manifold by considering the variations in input space, which are captured by or reflected (by corresponding changes) in the learned representation. To first approximation, some directions are well preserved (the tangent directions of the manifold) while others aren’t (directions orthogonal to the manifolds). With this perspective, the primary unsupervised learning task is then seen as modeling the structure of the data-supporting manifold¹⁸¹⁸18Actually, data points need not strictly lie on the “manifold”, but the probability density is expected to fall off sharply as one moves away from it, and it may actually be constituted of several possibly disconnected manifolds with different intrinsic dimensionality.. The associated representation being learned can be associated with an intrinsic coordinate system on the embedded manifold. The archetypal manifold modeling algorithm is, not surprisingly, also the archetypal low dimensional representation learning algorithm: Principal Component Analysis, which models a linear manifold. It was initially devised with the objective of finding the closest linear manifold to a cloud of data points. The principal components, i.e. the representation $f_{\theta}(x)$ that PCA yields for an input point $x$ , uniquely locates its projection on that manifold: it corresponds to intrinsic coordinates on the manifold. Data manifold for complex real world domains are however expected to be strongly non-linear. Their modeling is sometimes approached as patchworks of locally linear tangent spaces (Vincent and Bengio, 2003; Brand, 2003). The large majority of algorithms built on this geometric perspective adopt a non-parametric approach, based on a training set nearest neighbor graph (Schölkopf et al., 1998; Roweis and Saul, 2000; Tenenbaum et al., 2000; Brand, 2003; Belkin and Niyogi, 2003; Donoho and Grimes, 2003; Weinberger and Saul, 2004; Hinton and Roweis, 2003; van der Maaten and Hinton, 2008). In these non-parametric approaches, each high-dimensional training point has its own set of free low-dimensional embedding coordinates, which are optimized so that certain properties of the neighborhood graph computed in original high dimensional input space are best preserved. These methods however do not directly learn a parametrized feature extraction function $f_{\theta}(x)$ applicable to new test points¹⁹¹⁹19For several of these techniques, representations for new points can be computed using the Nyström approximation as has been proposed as an extension in (Bengio et al., 2004), but this remains cumbersome and computationally expensive., which seriously limits their use as feature extractors, except in a transductive setting. Comparatively few non-linear manifold learning methods have been proposed, that learn a parametric map that can directly compute a representation for new points; we will focus on these.
另一种关于表示学习的重要视角基于流形几何概念。其前提是流形假设，根据该假设，在高维空间中呈现的现实世界数据预计会集中在嵌入在高维输入空间中的低维流形 $\mathcal{M}$ 附近 $d_{\mathcal{M}}$ 。这种先验似乎特别适合于涉及图像、声音或文本的 AI 任务，对于这些任务，大多数均匀采样的输入配置与自然刺激不同。一旦有“表示”的概念，就可以通过考虑输入空间中的变化来想到流形，这些变化被学习到的表示所捕捉或反映（通过相应的变化）。在第一次近似中，某些方向被很好地保留（流形的切线方向）而其他方向则不是（与流形正交的方向）。从这个视角来看，主要的无监督学习任务被视为建模支持流形 ¹⁸ 的结构。所学习到的相关表示可以与嵌入流形上的内在坐标系相关联。原型流形建模算法不出所料也是原型低维表示学习算法：主成分分析，它模型化了一个线性流形。它最初是为了找到与数据点云最近的线性流形而设计的。主成分，即 PCA 对输入点 $x$ 产生的表示 $f_{\theta}(x)$ ，唯一地定位了其在该流形上的投影：它对应于流形上的内在坐标。然而，对于复杂现实世界域的数据流形，预期将是强非线性。它们的建模有时被作为局部线性切空间的拼贴（Vincent 和 Bengio，2003；Brand，2003）。在基于这种几何视角构建的大多数算法中，采用基于训练集最近邻图的非参数方法（Schölkopf 等人，1998；Roweis 和 Saul，2000；Tennenbaum 等人，2000；Brand，2003；Belkin 和 Niyogi，2003；Donoho 和 Grimes，2003；Weinberger 和 Saul，2004；Hinton 和 Roweis，2003；van der Maaten 和 Hinton，2008）。在这些非参数方法中，每个高维训练点都有一组自由低维嵌入坐标，这些坐标被优化，以便在原始高维输入空间中计算出的邻域图的一定属性得到最佳保留。然而，这些方法并不直接学习一个适用于新测试点的参数化特征提取函数，这严重限制了它们作为特征提取器的使用，除了在归纳设置中。相对而言，提出的学习非线性流形的方法较少，这些方法可以学习一个参数化映射，可以直接计算新点的表示；我们将关注这些方法。

8.1 Learning a parametric mapping based on a neighborhood graph
8.1 基于邻域图学习参数映射

Some of the above non-parametric manifold learning algorithms can be modified to learn a parametric mapping $f_{\theta}$ , i.e., applicable to new points: instead of having free low-dimensional embedding coordinate “parameters” for each training point, these coordinates are obtained through an explicitly parametrized function, as with the parametric variant (van der Maaten, 2009) of t-SNE (van der Maaten and Hinton, 2008).
一些上述的非参数流形学习算法可以被修改为学习参数化映射 $f_{\theta}$ ，即适用于新点：不是为每个训练点拥有自由低维嵌入坐标“参数”，这些坐标通过一个显式参数化的函数获得，就像 t-SNE 的参数化变体（van der Maaten，2009）中那样（van der Maaten 和 Hinton，2008）。

Instead, Semi-Supervised Embedding (Weston et al., 2008) learns a direct encoding while taking into account the manifold hypothesis through a neighborhood graph. A parametrized neural network architecture simultaneously learns a manifold embedding and a classifier. The training criterion encourages training set neigbhors to have similar representations.
相反，半监督嵌入（Weston 等人，2008）在学习直接编码的同时，通过邻域图考虑流形假设。参数化的神经网络架构同时学习流形嵌入和分类器。训练准则鼓励训练集的邻居具有相似的表现。

The reduced and tightly controlled number of free parameters in such parametric methods, compared to their pure non-parametric counterparts, forces models to generalize the manifold shape non-locally (Bengio et al., 2006b), which can translate into better features and final performance (van der Maaten and Hinton, 2008). However, basing the modeling of manifolds on training set neighborhood relationships might be risky statistically in high dimensional spaces (sparsely populated due to the curse of dimensionality) as e.g. most Euclidean nearest neighbors risk having too little in common semantically. The nearest neighbor graph is simply not enough densely populated to map out satisfyingly the wrinkles of the target manifold (Bengio and Monperrus, 2005; Bengio et al., 2006b; Bengio and LeCun, 2007). It can also become problematic computationally to consider all pairs of data points²⁰²⁰20Even if pairs are picked stochastically, many must be considered before obtaining one that weighs significantly on the optimization objective., which scales quadratically with training set size.
参数方法中自由参数数量的减少和严格控制，与它们的纯非参数对应物相比，迫使模型在非局部范围内泛化流形形状（Bengio 等人，2006b），这可以转化为更好的特征和最终性能（van der Maaten 和 Hinton，2008）。然而，在高维空间（由于维度的诅咒而稀疏）中，基于训练集邻域关系来建模流形在统计上可能存在风险，例如，大多数欧几里得最近邻在语义上可能共同点太少。最近邻图简单地不足以满意地映射出目标流形的皱纹（Bengio 和 Monperrus，2005；Bengio 等人，2006b；Bengio 和 LeCun，2007）。考虑所有数据点对也可能在计算上变得有问题，这随着训练集大小的增加呈二次方增长。

8.2 Learning to represent non-linear manifolds
8.2 学习表示非线性流形

Can we learn a manifold without requiring nearest neighbor searches? Yes, for example, with regularized auto-encoders or PCA. In PCA, the sensitivity of the extracted components (the code) to input changes is the same regardless of position $x$ . The tangent space is the same everywhere along the linear manifold. By contrast, for a non-linear manifold, the tangent of the manifold changes as we move on the manifold, as illustrated in Figure 6. In non-linear representation-learning algorithms it is convenient to think about the local variations in the representation as the input $x$ is varied on the manifold, i.e., as we move among high-probability configurations. As we discuss below, the first derivative of the encoder therefore specifies the shape of the manifold (its tangent plane) around an example $x$ lying on it. If the density was really concentrated on the manifold, and the encoder had captured that, we would find the encoder derivatives to be non-zero only in the directions spanned by the tangent plane.
我们可以学习一个流形而不需要最近的邻域搜索吗？是的，例如使用正则化自编码器或 PCA。在 PCA 中，提取的成分（代码）对输入变化的敏感性在位置 $x$ 上是相同的。切空间在整个线性流形上都是相同的。相比之下，对于非线性流形，流形的切线随着我们在流形上的移动而变化，如图 6 所示。在非线性表示学习算法中，方便将表示的局部变化视为输入 $x$ 在流形上变化时的情况，即当我们移动到高概率配置之间。正如我们下面所讨论的，编码器的第一导数因此指定了流形（其切平面）上示例 $x$ 的形状。如果密度确实集中在流形上，并且编码器捕捉到了这一点，我们就会发现在切平面张成的方向上编码器导数不为零。

Let us consider sparse coding in this light: parameter matrix $W$ may be interpreted as a dictionary of input directions from which a different subset will be picked to model the local tangent space at an $x$ on the manifold. That subset corresponds to the active, i.e. non-zero, features for input $x$ . Non-zero component $h_{i}$ will be sensitive to small changes of the input in the direction of the associated weight vector $W_{:,i}$ , whereas inactive features are more likely to be stuck at 0 until a significant displacement has taken place in input space.
让我们从稀疏编码的角度来考虑：参数矩阵 $W$ 可以解释为从输入方向字典中选择不同子集以在流形上的 $x$ 处建模局部切空间的字典。该子集对应于激活的，即非零的，输入 $x$ 的特征。非零分量 $h_{i}$ 将敏感于输入在相关权重向量 $W_{:,i}$ 方向上的微小变化，而不活跃的特征更有可能在输入空间发生显著位移之前保持在 0。

The Local Coordinate Coding (LCC) algorithm (Yu et al., 2009) is very similar to sparse coding, but is explicitly derived from a manifold perspective. Using the same notation as that of sparse coding in Eq. 2, LCC replaces regularization term $\|h^{(t)}\|_{1}=\sum_{j}|h^{(t)}_{j}|$ yielding objective
本地坐标编码（LCC）算法（Yu 等，2009）与稀疏编码非常相似，但明确地从流形视角推导而来。使用与式（2）中稀疏编码相同的符号，LCC 用正则化项 $\|h^{(t)}\|_{1}=\sum_{j}|h^{(t)}_{j}|$ 替换，得到目标函数

\mathcal{J}_{\mbox{\tiny LCC}}=\sum_{t}\left(\|x^{(t)}-Wh^{(t)}\|^{2}_{2}+\lambda\sum_{j}|h^{(t)}_{j}|\|W_{:,j}-x^{(t)}\|^{1+p}\right)\vspace*{-3mm}

(28)

This is identical to sparse coding when $p=-1$ , but with larger $p$ it encourages the active anchor points for $x^{(t)}$ (i.e. the codebook vectors $W_{:,j}$ with non-negligible $|h^{(t)}_{j}|$ that are combined to reconstruct $x^{(t)}$ ) to be not too far from $x^{(t)}$ , hence the local aspect of the algorithm. An important theoretical contribution of Yu et al. (2009) is to show that that any Lipschitz-smooth function $\phi:\mathcal{M}\rightarrow{\mathbb{R}}$ defined on a smooth nonlinear manifold $\mathcal{M}$ embedded in ${\mathbb{R}}^{d_{x}}$ can be well approximated by a globally linear function with respect to the resulting coding scheme (i.e. linear in $h$ ), where the accuracy of the approximation and required number $d_{h}$ of anchor points depend on $d_{\mathcal{M}}$ rather than $d_{x}$ . This result has been further extended with the use of local tangent directions (Yu and Zhang, 2010), as well as to multiple layers (Lin et al., 2010).
这与稀疏编码相同，当 $p=-1$ 时，但较大的 $p$ 鼓励 $x^{(t)}$ （即具有非可忽略的 $|h^{(t)}_{j}|$ 的码本向量 $W_{:,j}$ ，它们组合起来重建 $x^{(t)}$ ）不要离 $x^{(t)}$ 太远，因此算法具有局部性。Yu 等人（2009）的一个重要理论贡献是表明，在嵌入 ${\mathbb{R}}^{d_{x}}$ 的平滑非线性流形 $\mathcal{M}$ 上定义的任何 Lipschitz 平滑函数 $\phi:\mathcal{M}\rightarrow{\mathbb{R}}$ 都可以通过相对于结果编码方案的全局线性函数（即在 $h$ 上线性）很好地近似，其中近似的准确性和所需锚点数 $d_{h}$ 取决于 $d_{\mathcal{M}}$ 而不是 $d_{x}$ 。这一结果已被进一步扩展，使用了局部切线方向（Yu 和 Zhang，2010），以及多层（Lin 等人，2010）。

Let us now consider the efficient non-iterative “feed-forward” encoders $f_{\theta}$ , used by PSD and the auto-encoders reviewed in Section 7.2, that are in the form of Eq. 20 or 27.The computed representation for $x$ will be only significantly sensitive to input space directions associated with non-saturated hidden units (see e.g. Eq. 24 for the Jacobian of a sigmoid layer). These directions to which the representation is significantly sensitive, like in the case of PCA or sparse coding, may be viewed as spanning the tangent space of the manifold at training point $x$ .
让我们现在考虑 PSD 和第 7.2 节中审查的自动编码器使用的有效非迭代“前馈”编码器 $f_{\theta}$ ，它们的形式为公式 20 或 27。对于 $x$ 的计算表示将仅对与未饱和隐藏单元相关的输入空间方向敏感（例如，参见公式 24 中 sigmoid 层的雅可比矩阵）。这些对表示显著敏感的方向，就像在 PCA 或稀疏编码的情况下，可以看作是在训练点 $x$ 处的流形切空间的张成。

Rifai et al. (2011a) empirically analyze in this light the singular value spectrum of the Jacobian (derivatives of representation vector with respect to input vector) of a trained CAE. Here the SVD provides an ordered orthonormal basis of most sensitive directions. The spectrum is sharply decreasing, indicating a relatively small number of significantly sensitive directions. This is taken as empirical evidence that the CAE indeed modeled the tangent space of a low-dimensional manifold. The leading singular vectors form a basis for the tangent plane of the estimated manifold, as illustrated in Figure 4. The CAE criterion is believed to achieve this thanks to its two opposing terms: the isotropic contractive penalty, that encourages the representation to be equally insensitive to changes in any input directions, and the reconstruction term, that pushes different training points (in particular neighbors) to have a different representation (so they may be reconstructed accurately), thus counteracting the isotropic contractive pressure only in directions tangent to the manifold.
Rifai 等人（2011a）从这一角度对训练好的 CAE 的雅可比矩阵（表示向量关于输入向量的导数）的奇异值谱进行了实证分析。在这里，奇异值分解提供了最敏感方向的有序正交基。谱急剧下降，表明存在相对较少的显著敏感方向。这被视为经验证据，表明 CAE 确实建模了低维流形的切空间。主奇异向量构成了估计流形切平面的基，如图 4 所示。人们认为 CAE 标准能够实现这一点，归功于其两个相反的项：各向同性收缩惩罚项，鼓励表示对任何输入方向的变化同等不敏感；以及重建项，推动不同的训练点（特别是邻居）具有不同的表示（以便可以准确重建），从而仅在流形的切向方向上抵消各向同性收缩压力。

Analyzing learned representations through the lens of the spectrum of the Jacobian and relating it to the notion of tangent space of a manifold is feasible, whenever the mapping is differentiable, and regardless of how it was learned, whether as direct encoding (as in auto-encoder variants), or derived from latent variable inference (as in sparse coding or RBMs). Exact low dimensional manifold models (like PCA) would yield non-zero singular values associated to directions along the manifold, and exact zeros for directions orthogonal to the manifold. But in smooth models like the CAE or the RBM we will instead have large versus relatively small singular values (as opposed to non-zero versus exactly zero).
分析通过雅可比矩阵谱的视角学习到的表示，并将其与流形切空间的概念联系起来是可行的，无论映射是否可微，以及它是如何学习的，无论是作为直接编码（如自编码器变体），还是从潜在变量推理中导出（如稀疏编码或 RBM）。精确的低维流形模型（如 PCA）将产生与流形方向相关的非零奇异值，而对于与流形正交的方向则产生精确的零值。但在 CAE 或 RBM 等光滑模型中，我们将有较大的与相对较小的奇异值（相对于非零与精确零）。

8.3 Leveraging the modeled tangent spaces
8.3 利用建模的切线空间

The local tangent space, at a point along the manifold, can be thought of capturing locally valid transformations that were prominent in the training data. For example Rifai et al. (2011c) examine the tangent directions extracted with an SVD of the Jacobian of CAEs trained on digits, images, or text-document data: they appear to correspond to small translations or rotations for images or digits, and to substitutions of words within a same theme for documents. Such very local transformations along a data manifold are not expected to change class identity. To build their Manifold Tangent Classifier (MTC), Rifai et al. (2011c) then apply techniques such as tangent distance (Simard et al., 1993) and tangent propagation (Simard et al., 1992), that were initially developed to build classifiers that are insensitive to input deformations provided as prior domain knowledge. Now these techniques are applied using the local leading tangent directions extracted by a CAE, i.e. not using any prior domain knowledge (except the broad prior about the existence of a manifold). This approach set a new record for MNIST digit classification among prior-knowledge free approaches²¹²¹21It yielded 0.81% error rate using the full MNIST training set, with no prior deformations, and no convolution..
局部切空间，在流形上的某一点，可以被认为是捕捉了在训练数据中突出的局部有效变换。例如，Rifai 等人（2011c）研究了使用 SVD 提取的 CAE 在数字、图像或文本文档数据上的雅可比切线方向：它们似乎对应于图像或数字的小平移或旋转，以及文档中同一主题内单词的替换。这种沿数据流形非常局部的变换预计不会改变类别身份。为了构建他们的流形切线分类器（MTC），Rifai 等人（2011c）随后应用了切线距离（Simard 等人，1993）和切线传播（Simard 等人，1992）等技术，这些技术最初是为了构建对输入变形不敏感的分类器而开发的，这些变形作为先验领域知识提供。现在，这些技术被应用于由 CAE 提取的局部主导切线方向，即不使用任何先验领域知识（除了关于流形存在的广泛先验之外）。这种方法在无先验知识的方法中为 MNIST 数字分类设定了新纪录。

9 Connections between Probabilistic and Direct Encoding models
9 概率编码与直接编码模型之间的 9 个联系

The standard likelihood framework for probabilistic models decomposes the training criterion for models with parameters $\theta$ in two parts: the log-likelihood $\log P(x|\theta)$ (or $\log P(x|h,\theta)$ with latent variables $h$ ), and the prior $\log P(\theta)$ (or $\log P(h|\theta)+\log P(\theta)$ with latent variables).
标准概率模型的似然框架将具有参数 $\theta$ 的模型的训练准则分解为两部分：对数似然 $\log P(x|\theta)$ （或带有潜在变量 $h$ 的 $\log P(x|h,\theta)$ ），以及先验 $\log P(\theta)$ （或带有潜在变量 $\log P(h|\theta)+\log P(\theta)$ ）。

9.1 PSD: a probabilistic interpretation
9.1 PSD：概率解释

In the case of the PSD algorithm, a connection can be made between the above standard probabilistic view and the direct encoding computation graph. The probabilistic model of PSD is the same directed generative model $P(x|h)$ of sparse coding (Section 6.1.1), which only accounts for the decoder. The encoder is viewed as an approximate inference mechanism to guess $P(h|x)$ and initialize a MAP iterative inference (where the sparse prior $P(h)$ is taken into account). However, in PSD, the encoder is trained jointly with the decoder, rather than simply taking the end result of iterative inference as a target to approximate. An interesting view²²²²22suggested by Ian Goodfellow, personal communication to reconcile these facts is that the encoder is a parametric approximation for the MAP solution of a variational lower bound on the joint log-likelihood. When MAP learning is viewed as a special case of variational learning (where the approximation of the joint log-likelihood is with a dirac distribution located at the MAP solution), the variational recipe tells us to simultaneously improve the likelihood (reduce reconstruction error) and improve the variational approximation (reduce the discrepancy between the encoder output and the latent variable value). Hence PSD sits at the intersection of probabilistic models (with latent variables) and direct encoding methods (which directly parametrize the mapping from input to representation). RBMs also sit at the intersection because their particular parametrization includes an explicit mapping from input to representation, thanks to the restricted connectivity between hidden units. However, this nice property does not extend to their natural deep generalizations, i.e., Deep Boltzmann Machines, discussed in Section 10.2.
在 PSD 算法的情况下，可以将上述标准概率视图与直接编码计算图联系起来。PSD 的概率模型与稀疏编码的相同的有向生成模型 $P(x|h)$ （第 6.1.1 节），它只考虑了解码器。编码器被视为一个近似推理机制来猜测 $P(h|x)$ 并初始化一个考虑稀疏先验 $P(h)$ 的 MAP 迭代推理。然而，在 PSD 中，编码器与解码器一起训练，而不是简单地将迭代推理的最终结果作为近似的目标。一个有趣的观点 ²² 来调和这些事实是，编码器是联合对数似然函数的变分下界 MAP 解的参数近似。当将 MAP 学习视为变分学习的一种特殊情况（其中联合对数似然估计使用位于 MAP 解处的狄拉克分布）时，变分方法告诉我们同时提高似然度（减少重建误差）和改进变分近似（减少编码器输出与潜在变量值之间的差异）。因此，PSD 位于具有潜在变量的概率模型和直接编码方法（直接参数化从输入到表示的映射）的交汇处。RBM 也位于交汇处，因为它们的特定参数化包括一个从输入到表示的显式映射，这得益于隐藏单元之间的限制性连接。然而，这种良好的属性并不扩展到它们的自然深度推广，即第 10.2 节中讨论的深度玻尔兹曼机。

9.2 Regularized Auto-Encoders Capture Local Structure of the Density
9.2 正则化自编码器捕捉密度局部结构

Can we also say something about the probabilistic interpretation of regularized auto-encoders? Their training criterion does not fit the standard likelihood framework because this would involve a data-dependent “prior”. An interesting hypothesis emerges to answer that question, out of recent theoretical results (Vincent, 2011; Alain and Bengio, 2012): the training criterion of regularized auto-encoders, instead of being a form of maximum likelihood, corresponds to a different inductive principle, such as score matching. The score matching connection is discussed in Section 7.2.2 and has been shown for a particular parametrization of DAE and equivalent Gaussian RBM (Vincent, 2011). The work in Alain and Bengio (2012) generalizes this idea to a broader class of parametrizations (arbitrary encoders and decoders), and shows that by regularizing the auto-encoder so that it be contractive, one obtains that the reconstruction function and its derivative estimate first and second derivatives of the underlying data-generative density. This view can be exploited to successfully sample from auto-encoders, as shown in Rifai et al. (2012); Bengio et al. (2012). The proposed sampling algorithms are MCMCs similar to Langevin MCMC, using not just the estimated first derivative of the density but also the estimated manifold tangents so as to stay close to manifolds of high density.
我们也可以谈谈正则化自编码器的概率解释吗？它们的训练准则不符合标准似然框架，因为这会涉及数据依赖的“先验”。从最近的理论结果（Vincent，2011；Alain 和 Bengio，2012）中，出现了一个有趣的假设来回答这个问题：正则化自编码器的训练准则，而不是最大似然的形式，对应于不同的归纳原理，如得分匹配。得分匹配的连接在第 7.2.2 节中进行了讨论，并且已经证明对于 DAE 和等效高斯 RBM 的特定参数化（Vincent，2011）。Alain 和 Bengio（2012）的工作将这个想法推广到更广泛的参数化类别（任意的编码器和解码器），并表明通过正则化自编码器使其具有收缩性，可以得到重建函数及其导数估计出底层数据生成密度的第一和第二阶导数。这种观点可以用来成功从自编码器中进行采样，如 Rifai 等人（2012）；Bengio 等人（2012）所示。提出的采样算法类似于 Langevin MCMC 的 MCMCs，不仅使用密度估计的一阶导数，还使用估计的流形切线，以便接近高密度流形。

This interpretation connects well with the geometric perspective introduced in Section 8. The regularization effects (e.g., due to a sparsity regularizer, a contractive regularizer, or the denoising criterion) asks the learned representation to be as insensitive as possible to the input, while minimizing reconstruction error on the training examples forces the representation to contain just enough information to distinguish them. The solution is that variations along the high-density manifolds are preserved while other variations are compressed: the reconstruction function should be as constant as possible while reproducing training examples, i.e., points near a training example should be mapped to that training example (Figure 5). The reconstruction function should map an input towards the nearest point manifold, i.e., the difference between reconstruction and input is a vector aligned with the estimated score (the derivative of the log-density with respect to the input). The score can be zero on the manifold (where reconstruction error is also zero), at local maxima of the log-density, but it can also be zero at local minima. It means that we cannot equate low reconstruction error with high estimated probability. The second derivatives of the log-density corresponds to the first derivatives of the reconstruction function, and on the manifold (where the first derivative is 0), they indicate the tangent directions of the manifold (where the first derivative remains near 0).
此解释与第 8 节中引入的几何视角相吻合。正则化效应（例如，由于稀疏正则化器、收缩正则化器或去噪标准）要求学习到的表示尽可能对输入不敏感，同时最小化训练样本上的重建误差迫使表示只包含足够的信息来区分它们。解决方案是保留高密度流形上的变化，同时压缩其他变化：重建函数应尽可能恒定，同时再现训练样本，即接近训练样本的点应映射到该训练样本（图 5）。重建函数应将输入映射到最近的点流形，即重建与输入之间的差异是一个与估计分数（对数密度关于输入的导数）对齐的向量。分数可以在流形上为零（此时重建误差也为零），在对数密度的局部极大值处，但也可以在局部极小值处为零。这意味着我们不能将低重建误差等同于高估计概率。对数密度的二阶导数对应于重建函数的一阶导数，在流形（一阶导数为 0 的地方）上，它们指示流形的切线方向（一阶导数保持接近 0）。

As illustrated in Figure 6, the basic idea of the auto-encoder sampling algorithms in Rifai et al. (2012); Bengio et al. (2012) is to make MCMC moves where one (a) moves toward the manifold by following the density gradient (i.e., applying a reconstruction) and (b) adds noise in the directions of the leading singular vectors of the reconstruction (or encoder) Jacobian, corresponding to those associated with smallest second derivative of the log-density.
如图 6 所示，Rifai 等人（2012 年）；Bengio 等人（2012 年）提出的自编码器采样算法的基本思想是在 MCMC 移动中，一个（a）通过跟随密度梯度（即应用重建）向流形移动，以及（b）在重建（或编码器）雅可比矩阵的主奇异向量方向添加噪声，这些方向对应于与对数密度的最小二阶导数相关联的方向。

9.3 Learning Approximate Inference
9.3 学习近似推理

Let us now consider from closer how a representation is computed in probabilistic models with latent variables, when iterative inference is required. There is a computation graph (possibly with random number generation in some of the nodes, in the case of MCMC) that maps inputs to representation, and in the case of deterministic inference (e.g., MAP inference or variational inference), that function could be optimized directly. This is a way to generalize PSD that has been explored in recent work on probabilistic models at the intersection of inference and learning (Bagnell and Bradley, 2009; Gregor and LeCun, 2010b; Grubb and Bagnell, 2010; Salakhutdinov and Larochelle, 2010; Stoyanov et al., 2011; Eisner, 2012), where a central idea is that instead of using a generic inference mechanism, one can use one that is learned and is more efficient, taking advantage of the specifics of the type of data on which it is applied.
让我们现在更深入地考虑在需要迭代推理的概率模型中，如何计算具有潜在变量的表示。存在一个计算图（在 MCMC 的情况下，某些节点可能包含随机数生成），它将输入映射到表示，在确定性推理（例如，MAP 推理或变分推理）的情况下，该函数可以直接优化。这是在推理和学习交叉领域的概率模型最近工作中探索的泛化 PSD 的方法（Bagnell 和 Bradley，2009；Gregor 和 LeCun，2010b；Grubb 和 Bagnell，2010；Salakhutdinov 和 Larochelle，2010；Stoyanov 等人，2011；Eisner，2012），其中核心思想是，而不是使用通用的推理机制，可以使用一个学习到的、更高效的机制，利用其应用的数据类型的特定性。

9.4 Sampling Challenges 9.4 样本挑战

A troubling challenge with many probabilistic models with latent variables like most Boltzmann machine variants is that good MCMC sampling is required as part of the learning procedure, but that sampling becomes extremely inefficient (or unreliable) as training progresses because the modes of the learned distribution become sharper, making mixing between modes very slow. Whereas initially during training a learner assigns mass almost uniformly, as training progresses, its entropy decreases, approaching the entropy of the target distribution as more examples and more computation are provided. According to our Manifold and Natural Clustering priors of Section 3.1, the target distribution has sharp modes (manifolds) separated by extremely low density areas. Mixing then becomes more difficult because MCMC methods, by their very nature, tend to make small steps to nearby high-probability configurations. This is illustrated in Figure 7.
一个令人烦恼的挑战是，对于像大多数玻尔兹曼机变体这样的具有潜在变量的概率模型，好的 MCMC 采样作为学习过程的一部分是必需的，但随着训练的进行，这种采样变得极其低效（或不稳定），因为学习到的分布的模式变得更加尖锐，使得模式之间的混合变得非常缓慢。而在训练初期，学习者几乎均匀地分配质量，但随着训练的进行，其熵度降低，随着更多示例和更多计算提供，接近目标分布的熵度。根据我们第 3.1 节中的流形和自然聚类先验，目标分布具有尖锐的模式（流形），由极低密度区域分隔。由于 MCMC 方法本质上倾向于对附近的较高概率配置进行小步骤调整，因此混合变得更加困难。这如图 7 所示。

Bengio et al. (2013) suggest that deep representations could help mixing between such well separated modes, based on both theoretical arguments and on empirical evidence. The idea is that if higher-level representations disentangle better the underlying abstract factors, then small steps in this abstract space (e.g., swapping from one category to another) can easily be done by MCMC. The high-level representations can then be mapped back to the input space in order to obtain input-level samples, as in the Deep Belief Networks (DBN) sampling algorithm (Hinton et al., 2006). This has been demonstrated both with DBNs and with the newly proposed algorithm for sampling from contracting and denoising auto-encoders (Rifai et al., 2012; Bengio et al., 2012). This observation alone does not suffice to solve the problem of training a DBN or a DBM, but it may provide a crucial ingredient, and it makes it possible to consider successfully sampling from deep models trained by procedures that do not require an MCMC, like the stacked regularized auto-encoders used in Rifai et al. (2012).
Bengio 等人（2013）提出，深度表示可以帮助混合这种分离良好的模式，这基于理论论证和实证证据。该观点是，如果高级表示更好地解耦了底层抽象因素，那么在这个抽象空间中的小步骤（例如，从一个类别交换到另一个类别）可以通过 MCMC 轻松完成。然后可以将高级表示映射回输入空间，以获得输入级别的样本，就像在深度信念网络（DBN）采样算法（Hinton 等人，2006）中那样。这一点已经在 DBN 和最新提出的从收缩和去噪自编码器中采样的算法（Rifai 等人，2012；Bengio 等人，2012）中得到证明。仅凭这一观察并不足以解决训练 DBN 或 DBM 的问题，但它可能是一个关键因素，并使得考虑成功从不需要 MCMC 的训练程序中采样的深度模型成为可能，例如 Rifai 等人（2012）中使用的堆叠正则化自编码器。

9.5 Evaluating and Monitoring Performance
9.5 评估与监控绩效

It is always possible to evaluate a feature learning algorithm in terms of its usefulness with respect to a particular task (e.g. object classification), with a predictor that is fed or initialized with the learned features. In practice, we do this by saving the features learned (e.g. at regular intervals during training, to perform early stopping) and training a cheap classifier on top (such as a linear classifier). However, training the final classifier can be a substantial computational overhead (e.g., supervised fine-tuning a deep neural network takes usually more training iterations than the feature learning itself), so we may want to avoid having to train a classifier for every training iteration of the unsupervised learner and every hyper-parameter setting. More importantly this may give an incomplete evaluation of the features (what would happen for other tasks?). All these issues motivate the use of methods to monitor and evaluate purely unsupervised performance. This is rather easy with all the auto-encoder variants (with some caution outlined below) and rather difficult with the undirected graphical models such as the RBM and Boltzmann machines.
始终可以就特定任务（例如，物体分类）的有用性来评估特征学习算法，使用输入或初始化为学习到的特征的预测器。在实践中，我们通过保存学习到的特征（例如，在训练期间定期保存，以执行早期停止）并在其上训练一个廉价的分类器（如线性分类器）来实现这一点。然而，训练最终的分类器可能是一个相当大的计算开销（例如，监督微调深度神经网络通常需要比特征学习更多的训练迭代），因此我们可能希望避免在无监督学习者的每个训练迭代和每个超参数设置中都需要训练分类器。更重要的是，这可能导致对特征的评估不完整（对于其他任务会发生什么？）。所有这些问题都促使我们使用方法来监控和评估纯粹的无监督性能。对于所有自动编码器变体来说，这相当容易（以下有一些注意事项），而对于如 RBM 和玻尔兹曼机这样的无向图模型来说则相当困难。

For auto-encoder and sparse coding variants, test set reconstruction error can readily be computed, but by itself may be misleading because larger capacity (e.g., more features, more training time) tends to systematically lead to lower reconstruction error, even on the test set. Hence it cannot be used reliably for selecting most hyper-parameters. On the other hand, denoising reconstruction error is clearly immune to this problem, so that solves the problem for DAEs. Based on the connection between DAEs and CAEs uncovered in Bengio et al. (2012); Alain and Bengio (2012), this immunity can be extended to DAEs, but not to the hyper-parameter controlling the amount of noise or of contraction.
对于自动编码器和稀疏编码变体，测试集重建误差可以很容易地计算，但仅凭这一点可能具有误导性，因为更大的容量（例如，更多特征、更多训练时间）往往系统地导致更低的重建误差，即使在测试集上也是如此。因此，不能可靠地用于选择大多数超参数。另一方面，去噪重建误差显然不受此问题的影响，因此解决了 DAEs 的问题。基于 Bengio 等人（2012 年）；Alain 和 Bengio（2012 年）发现的 DAEs 和 CAEs 之间的联系，这种免疫性可以扩展到 DAEs，但不能扩展到控制噪声量或收缩量的超参数。

For RBMs and some (not too deep) Boltzmann machines, one option is the use of Annealed Importance Sampling (Murray and Salakhutdinov, 2009) in order to estimate the partition function (and thus the test log-likelihood). Note that this estimator can have high variance and that it becomes less reliable (variance becomes too large) as the model becomes more interesting, with larger weights, more non-linearity, sharper modes and a sharper probability density function (see our previous discussion in Section 9.4). Another interesting and recently proposed option for RBMs is to track the partition function during training (Desjardins et al., 2011), which could be useful for early stopping and reducing the cost of ordinary AIS. For toy RBMs (e.g., 25 hidden units or less, or 25 inputs or less), the exact log-likelihood can also be computed analytically, and this can be a good way to debug and verify some properties of interest.
对于 RBMs 和一些（不太深）的玻尔兹曼机，一个选项是使用退火重要性采样（Murray 和 Salakhutdinov，2009）来估计配分函数（从而估计测试对数似然）。请注意，这个估计量可能具有高方差，并且随着模型变得更加有趣（权重更大、非线性更强、模式更尖锐、概率密度函数更尖锐），它变得不那么可靠（方差变得过大），如我们在第 9.4 节中的先前讨论。对于玩具 RBMs（例如，25 个或更少的隐藏单元，或 25 个或更少的输入），也可以通过解析方法计算精确的对数似然，这可以是一种调试和验证一些感兴趣特性的好方法。

10 Global Training of Deep Models
10 深度模型全球训练

One of the most interesting challenges raised by deep architectures is: how should we jointly train all the levels? In the previous section and in Section 4 we have only discussed how single-layer models could be combined to form a deep model. Here we consider joint training of all the levels and the difficulties that may arise.
深度架构提出的一个最有趣的挑战是：我们应该如何联合训练所有层级？在前一节和第 4 节中，我们只讨论了如何将单层模型组合成深度模型。在这里，我们考虑所有层级的联合训练以及可能出现的困难。

10.1 The Challenge of Training Deep Architectures
10.1 深度架构训练的挑战

Higher-level abstraction means more non-linearity. It means that two nearby input configurations may be interpreted very differently because a few surface details change the underlying semantics, whereas most other changes in the surface details would not change the underlying semantics. The representations associated with input manifolds may be complex because the mapping from input to representation may have to unfold and distort input manifolds that generally have complicated shapes into spaces where distributions are much simpler, where relations between factors are simpler, maybe even linear or involving many (conditional) independencies. Our expectation is that modeling the joint distribution between high-level abstractions and concepts should be much easier in the sense of requiring much less data to learn. The hard part is learning a good representation that does this unfolding and disentangling. This may be at the price of a more difficult training problem, possibly involving ill-conditioning and local minima.
高级抽象意味着更强的非线性。这意味着两个相邻的输入配置可能被解释得非常不同，因为一些表面细节的改变会改变底层语义，而大多数其他表面细节的改变则不会改变底层语义。与输入流形相关的表示可能很复杂，因为从输入到表示的映射可能需要展开和扭曲形状复杂的输入流形，将其映射到分布更简单、因素之间的关系更简单、甚至可能是线性的或涉及许多（条件）独立性的空间中。我们期望，在高级抽象和概念之间的联合分布建模应该更容易，从需要更少数据学习的角度来看。困难的部分是学习一个好的表示，使其能够进行这种展开和分解。这可能以更困难的训练问题为代价，可能涉及病态条件和局部最小值。

It is only since 2006 that researchers have seriously investigated ways to train deep architectures, to the exception of the convolutional networks (LeCun et al., 1998b). The first realization (Section 4) was that unsupervised or supervised layer-wise training was easier, and that this could be taken advantage of by stacking single-layer models into deeper ones.
自 2006 年以来，研究人员才开始认真研究训练深度架构的方法，除了卷积网络（LeCun 等，1998b）。第一个认识（第 4 节）是，无监督或监督的逐层训练更容易，并且可以通过将单层模型堆叠成更深的模型来利用这一点。

It is interesting to ask why does the layerwise unsupervised pre-training procedure sometimes help a supervised learner (Erhan et al., 2010b). There seems to be a more general principle at play ²³²³23First suggested to us by Leon Bottou of guiding the training of intermediate representations, which may be easier than trying to learn it all in one go. This is nicely related to the curriculum learning idea (Bengio et al., 2009), that it may be much easier to learn simpler concepts first and then build higher-level ones on top of simpler ones. This is also coherent with the success of several deep learning algorithms that provide some such guidance for intermediate representations, like Semi-Supervised Embedding (Weston et al., 2008).
有趣的是要问为什么层叠无监督预训练过程有时能帮助监督学习（Erhan 等人，2010b）。似乎存在一个更普遍的原则在起作用，即指导中间表示的训练，这可能比一次性学习它更容易。这与课程学习理念（Bengio 等人，2009）很好地相关联，即先学习简单的概念，然后在简单概念之上构建更高级的概念可能要容易得多。这也与几个提供此类中间表示指导的深度学习算法的成功相一致，如半监督嵌入（Weston 等人，2008）。

The question of why unsupervised pre-training could be helpful was extensively studied (Erhan et al., 2010b), trying to dissect the answer into a regularization effect and an optimization effect. The regularization effect is clear from the experiments where the stacked RBMs or denoising auto-encoders are used to initialize a supervised classification neural network (Erhan et al., 2010b). It may simply come from the use of unsupervised learning to bias the learning dynamics and initialize it in the basin of attraction of a “good” local minimum (of the training criterion), where “good” is in terms of generalization error. The underlying hypothesis exploited by this procedure is that some of the features or latent factors that are good at capturing the leading variations in the input distribution are also good at capturing the variations in the target output random variables of interest (e.g., classes). The optimization effect is more difficult to tease out because the top two layers of a deep neural net can just overfit the training set whether the lower layers compute useful features or not, but there are several indications that optimizing the lower levels with respect to a supervised training criterion can be challenging.
为什么无监督预训练可能有益的问题被广泛研究（Erhan 等人，2010b），试图将答案分解为正则化效应和优化效应。正则化效应从使用堆叠的 RBMs 或去噪自编码器初始化监督分类神经网络的实验中清晰可见（Erhan 等人，2010b）。这也许仅仅来自于使用无监督学习来偏置学习动态，并在“良好”局部最小值（训练准则的）吸引盆地里初始化，这里的“良好”是指泛化误差。此过程所利用的潜在假设是，一些擅长捕捉输入分布中主导变化的特征或潜在因素，也擅长捕捉感兴趣的输出随机变量（例如，类别）的变化。优化效果更难提取出来，因为深度神经网络的前两层即使底层计算了有用的特征，也可能过度拟合训练集，但有几个迹象表明，根据监督训练标准优化底层可能具有挑战性。

One such indication is that changing the numerical conditions of the optimization procedure can have a profound impact on the joint training of a deep architecture, for example by changing the initialization range and changing the type of non-linearity used (Glorot and Bengio, 2010), much more so than with shallow architectures. One hypothesis to explain some of the difficulty in the optimization of deep architectures is centered on the singular values of the Jacobian matrix associated with the transformation from the features at one level into the features at the next level (Glorot and Bengio, 2010). If these singular values are all small (less than 1), then the mapping is contractive in every direction and gradients would vanish when propagated backwards through many layers. This is a problem already discussed for recurrent neural networks (Bengio et al., 1994), which can be seen as very deep networks with shared parameters at each layer, when unfolded in time. This optimization difficulty has motivated the exploration of second-order methods for deep architectures and recurrent networks, in particular Hessian-free second-order methods (Martens, 2010; Martens and Sutskever, 2011). Unsupervised pre-training has also been proposed to help training recurrent networks and temporal RBMs (Sutskever et al., 2009), i.e., at each time step there is a local signal to guide the discovery of good features to capture in the state variables: model with the current state (as hidden units) the joint distribution of the previous state and the current input. Natural gradient (Amari, 1998) methods that can be applied to networks with millions of parameters (i.e. with good scaling properties) have also been proposed (Le Roux et al., 2008b; Pascanu and Bengio, 2013). Cho et al. (2011) proposes to use adaptive learning rates for RBM training, along with a novel and interesting idea for a gradient estimator that takes into account the invariance of the model to flipping hidden unit bits and inverting signs of corresponding weight vectors. At least one study indicates that the choice of initialization (to make the Jacobian of each layer closer to 1 across all its singular values) could substantially reduce the training difficulty of deep networks (Glorot and Bengio, 2010) and this is coherent with the success of the initialization procedure of Echo State Networks (Jaeger, 2007), as recently studied by Sutskever (2012). There are also several experimental results (Glorot and Bengio, 2010; Glorot et al., 2011a; Nair and Hinton, 2010) showing that the choice of hidden units non-linearity could influence both training and generalization performance, with particularly interesting results obtained with sparse rectifying units (Jarrett et al., 2009; Nair and Hinton, 2010; Glorot et al., 2011a; Krizhevsky et al., 2012). An old idea regarding the ill-conditioning issue with neural networks is that of symmetry breaking: part of the slowness of convergence may be due to many units moving together (like sheep) and all trying to reduce the output error for the same examples. By initializing with sparse weights (Martens, 2010) or by using often saturated non-linearities (such as rectifiers as max-pooling units), gradients only flow along a few paths, which may help hidden units to specialize more quickly. Another promising idea to improve the conditioning of neural network training is to nullify the average value and slope of each hidden unit output (Raiko et al., 2012), and possibly locally normalize magnitude as well (Jarrett et al., 2009). The debate still rages between using online methods such as stochastic gradient descent and using second-order methods on large minibatches (of several thousand examples) (Martens, 2010; Le et al., 2011a), with a variant of stochastic gradient descent recently winning an optimization challenge ²⁴²⁴24https://sites.google.com/site/nips2011workshop/optimization-challenges.
一种这样的迹象是，改变优化过程的数值条件可以对深度架构的联合训练产生深远的影响，例如通过改变初始化范围和改变使用的非线性类型（Glorot 和 Bengio，2010），比浅层架构更为明显。解释深度架构优化中一些困难的一个假设集中在与从某一层的特征到下一层特征的转换相关的雅可比矩阵的特征值上（Glorot 和 Bengio，2010）。如果这些特征值都较小（小于 1），那么映射在所有方向上都是收缩的，并且在反向传播通过多层时梯度会消失。这是已经讨论过的问题，对于循环神经网络（Bengio 等人，1994），当在时间上展开时，可以看作是非常深的网络，每一层都有共享参数。这种优化困难促使人们探索深度架构和循环网络的二阶方法，特别是 Hessian-free 二阶方法（Martens，2010；Martens 和 Sutskever，2011）。无监督预训练也被提出以帮助训练循环网络和时序 RBMs（Sutskever 等人，2009 年），即在每个时间步长都有一个局部信号来指导发现捕获状态变量的良好特征：具有当前状态（作为隐藏单元）的模型，以及先前状态和当前输入的联合分布。也提出了适用于具有数百万参数的网络（即具有良好的缩放特性）的自然梯度（Amari，1998 年）方法（Le Roux 等人，2008 年 b；Pascanu 和 Bengio，2013 年）。Cho 等人（2011 年）提出在 RBM 训练中使用自适应学习率，以及一个考虑模型对翻转隐藏单元位和反转相应权重向量符号不变性的新颖且有趣的想法的梯度估计器。至少一项研究表明，初始化的选择（使每层的雅可比矩阵的所有奇异值更接近 1）可以显著降低深度网络的训练难度（Glorot 和 Bengio，2010），这与 Echo State Networks 初始化过程的成功（Jaeger，2007）相一致，这一点最近由 Sutskever（2012）进行了研究。也有几个实验结果（Glorot 和 Bengio，2010；Glorot 等人，2011a；Nair 和 Hinton，2010）表明，隐藏单元的非线性选择可能影响训练和泛化性能，特别是稀疏整流单元（Jarrett 等人，2009；Nair 和 Hinton，2010；Glorot 等人，2011a；Krizhevsky 等人，2012）获得了特别有趣的结果。关于神经网络条件不良问题的一个旧观点是关于对称破缺：收敛速度慢的部分原因可能是许多单元一起移动（就像羊一样），并且都在尝试为相同的示例减少输出误差。通过初始化稀疏权重（Martens，2010）或使用常饱和的非线性函数（如作为最大池化单元的整流器），梯度只沿着少数路径流动，这可能有助于隐藏单元更快地专业化。另一种改善神经网络训练条件的有前景的想法是消除每个隐藏单元输出的平均值和斜率（Raiko 等人，2012），并且可能局部归一化幅度（Jarrett 等人，2009）。在在线方法（如随机梯度下降）和在大批量（数千个示例）上使用二阶方法之间（Martens，2010；Le 等人，2011a）的争论仍然激烈，最近一种随机梯度下降的变体赢得了优化挑战 ²⁴ 。

Finally, several recent results exploiting large quantities of labeled data suggest that with proper initialization and choice of non-linearity, very deep purely supervised networks can be trained successfully without any layerwise pre-training (Ciresan et al., 2010; Glorot et al., 2011a; Seide et al., 2011a; Krizhevsky et al., 2012). Researchers report than in such conditions, layerwise unsupervised pre-training brought little or no improvement over pure supervised learning from scratch when training for long enough. This reinforces the hypothesis that unsupervised pre-training acts as a prior, which may be less necessary when very large quantities of labeled data are available, but begs the question of why this had not been discovered earlier. The latest results reported in this respect (Krizhevsky et al., 2012) are particularly interesting because they allowed to drastically reduce the error rate of object recognition on a benchmark (the 1000-class ImageNet task) where many more traditional computer vision approaches had been evaluated (http://www.image-net.org/challenges/LSVRC/2012/results.html). The main techniques that allowed this success include the following: efficient GPU training allowing one to train longer (more than 100 million visits of examples), an aspect first reported by Lee et al. (2009a); Ciresan et al. (2010), large number of labeled examples, artificially transformed examples (see Section 11.1), a large number of tasks (1000 or 10000 classes for ImageNet), convolutional architecture with max-pooling (see section 11 for these latter two techniques), rectifying non-linearities (discussed above), careful initialization (discussed above), careful parameter update and adaptive learning rate heuristics, layerwise feature normalization (across features), and a new dropout trick based on injecting strong binary multiplicative noise on hidden units. This trick is similar to the binary noise injection used at each layer of a stack of denoising auto-encoders. Future work is hopefully going to help identify which of these elements matter most, how to generalize them across a large variety of tasks and architectures, and in particular contexts where most examples are unlabeled, i.e., including an unsupervised component in the training criterion.
最后，一些利用大量标记数据的近期成果表明，通过适当的初始化和非线性的选择，非常深的纯监督网络可以在没有任何层预训练的情况下成功训练（Ciresan 等人，2010；Glorot 等人，2011a；Seide 等人，2011a；Krizhevsky 等人，2012）。研究人员报告称，在这种条件下，当训练时间足够长时，层预训练对纯监督学习几乎没有或没有带来改进。这加强了无监督预训练作为先验假设的观点，当有大量标记数据时可能不那么必要，但同时也提出了为什么之前没有发现这一点的疑问。在这方面最新报道的结果（Krizhevsky 等人，2012）特别有趣，因为它们使得在基准测试（1000 类 ImageNet 任务）中对象识别的错误率大幅降低，而许多传统的计算机视觉方法已经在该基准测试中进行了评估（http://www.image-net.org/challenges/LSVRC/2012/results.html）。主要促成这一成功的技巧包括以下内容：高效的 GPU 训练，允许进行更长时间的训练（超过 1 亿个示例的访问），这是 Lee 等人（2009a）首次报道的；Ciresan 等人（2010）的研究，大量标记的示例，人工转换的示例（参见第 11.1 节），大量任务（ImageNet 中的 1000 或 10000 个类别），具有最大池化的卷积架构（参见第 11 节中这些后两种技术），非线性校正（如上所述），仔细的初始化（如上所述），仔细的参数更新和自适应学习率启发式方法，逐层特征归一化（跨特征），以及基于在隐藏单元上注入强二进制乘性噪声的新 dropout 技巧。这种技巧类似于在每个降噪自编码器堆栈的每一层使用的二进制噪声注入。未来的工作有望帮助确定这些元素中哪些最为重要，如何将它们推广到各种任务和架构中，特别是在大多数示例未标记的特定情况下，即在训练准则中包括无监督组件。

10.2 Joint Training of Deep Boltzmann Machines
10.2 深度玻尔兹曼机的联合训练

We now consider the problem of joint training of all layers of a specific unsupervised model, the Deep Boltzmann Machine (DBM). Whereas much progress (albeit with many unanswered questions) has been made on jointly training all the layers of deep architectures using back-propagated gradients (i.e., mostly in the supervised setting), much less work has been done on their purely unsupervised counterpart, e.g. with DBMs²⁵²⁵25Joint training of all the layers of a Deep Belief Net is much more challenging because of the much harder inference problem involved.. Note however that one could hope that the successful techniques described in the previous section could be applied to unsupervised learning algorithms.
我们现在考虑特定无监督模型——深度玻尔兹曼机（DBM）的所有层的联合训练问题。虽然使用反向传播梯度（即在监督设置中）联合训练深度架构的所有层已经取得了许多进展（尽管有许多未解决的问题），但在它们的纯粹无监督对应物方面，例如使用 DBMs，所做的工作要少得多。然而，需要注意的是，人们可以希望前一部分中描述的成功技术可以应用于无监督学习算法。

Like the RBM, the DBM is another particular subset of the Boltzmann machine family of models where the units are again arranged in layers. However unlike the RBM, the DBM possesses multiple layers of hidden units, with units in odd-numbered layers being conditionally independent given even-numbered layers, and vice-versa. With respect to the Boltzmann energy function of Eq. 7, the DBM corresponds to setting $U=0$ and a sparse connectivity structure in both $V$ and $W$ . We can make the structure of the DBM more explicit by specifying its energy function. For the model with two hidden layers it is given as:
与 RBM 类似，DBM 是玻尔兹曼机模型家族的另一个特定子集，其中单元再次按层排列。然而，与 RBM 不同，DBM 具有多个隐藏单元层，奇数层的单元在给定偶数层的情况下条件独立，反之亦然。关于第 7 个公式的玻尔兹曼能量函数，DBM 对应于设置 $U=0$ 以及在 $V$ 和 $W$ 中设置稀疏连接结构。我们可以通过指定其能量函数来使 DBM 的结构更加明确。对于具有两个隐藏层的模型，它表示为：

	$\displaystyle{\cal E}_{\theta}^{\mathrm{DBM}}(v,h^{(1)},h^{(2)};\theta)=$	$\displaystyle-v^{T}Wh^{(1)}-\left.h^{(1)}\right.^{T}Vh^{(2)}-$
		$\displaystyle\left.d^{(1)}\right.^{T}h^{(1)}-\left.d^{(2)}\right.^{T}h^{(2)}-b^{T}v,\vspace*{-3mm}$		(29)

with $\theta=\{W,V,d^{(1)},d^{(2)},b\}$ . The DBM can also be characterized as a bipartite graph between two sets of vertices, formed by odd and even-numbered layers (with $v:=h^{(0)}$ ).
具有 $\theta=\{W,V,d^{(1)},d^{(2)},b\}$ 。DBM 也可以被描述为两个顶点集之间的二分图，由奇数层和偶数层（具有 $v:=h^{(0)}$ ）组成。

10.2.1 Mean-field approximate inference
10.2.1 平均场近似推理

A key point of departure from the RBM is that the posterior distribution over the hidden units (given the visibles) is no longer tractable, due to the interactions between the hidden units. Salakhutdinov and Hinton (2009) resort to a mean-field approximation to the posterior. Specifically, in the case of a model with two hidden layers, we wish to approximate $P\left(h^{(1)},h^{(2)}\mid v\right)$ with the factored distribution $Q_{v}(h^{(1)},h^{(2)})=\prod_{j=1}^{N_{1}}Q_{v}\left(h^{(1)}_{j}\right)\ \prod_{i=1}^{N_{2}}Q_{v}\left(h^{(2)}_{i}\right)$ , such that the KL divergence $\mathrm{KL}\left(P\left(h^{(1)},h^{(2)}\mid v\right)\|Q_{v}(h^{1},h^{2})\right)$ is minimized or equivalently, that a lower bound to the log likelihood is maximized:
一个从 RBM（限制性玻尔兹曼机）出发的关键点是，由于隐藏单元之间的相互作用，给定可见单元的后验分布不再可计算。Salakhutdinov 和 Hinton（2009）求助于后验的均值场近似。具体来说，在具有两个隐藏层的模型中，我们希望用因式分布 $Q_{v}(h^{(1)},h^{(2)})=\prod_{j=1}^{N_{1}}Q_{v}\left(h^{(1)}_{j}\right)\ \prod_{i=1}^{N_{2}}Q_{v}\left(h^{(2)}_{i}\right)$ 来近似 $P\left(h^{(1)},h^{(2)}\mid v\right)$ ，以使 KL 散度 $\mathrm{KL}\left(P\left(h^{(1)},h^{(2)}\mid v\right)\|Q_{v}(h^{1},h^{2})\right)$ 最小化，或者等价地，最大化对数似然的下界：

\log P(v)>\mathcal{L}(Q_{v})\equiv\sum_{h^{(1)}}\sum_{h^{(2)}}Q_{v}(h^{(1)},h^{(2)})\log\left(\frac{P(v,h^{(1)},h^{(2)})}{Q_{v}(h^{(1)},h^{(2)})}\right)

(30)

Maximizing this lower-bound with respect to the mean-field distribution $Q_{v}(h^{1},h^{2})$ (by setting derivatives to zero) yields the following mean field update equations:
最大化关于平均场分布 $Q_{v}(h^{1},h^{2})$ 的下界（通过将导数设为零）得到以下平均场更新方程：

	$\displaystyle\hat{h}^{(1)}_{i}$	$\displaystyle\leftarrow{\rm sigmoid}\left(\sum_{j}W_{ji}v_{j}+\sum_{k}V_{ik}\hat{h}^{(2)}_{k}+d^{(1)}_{i}\right)$		(31)
	$\displaystyle\hat{h}^{(2)}_{k}$	$\displaystyle\leftarrow{\rm sigmoid}\left(\sum_{i}V_{ik}\hat{h}^{(1)}_{i}+d^{(2)}_{k}\right)\vspace*{-3mm}$		(32)

Note how the above equations ostensibly look like a fixed point recurrent neural network, i.e., with constant input. In the same way that an RBM can be associated with a simple auto-encoder, the above mean-field update equations for the DBM can be associated with a recurrent auto-encoder. In that case the training criterion involves the reconstruction error at the last or at consecutive time steps. This type of model has been explored by Savard (2011) and Seung (1998) and shown to do a better job at denoising than ordinary auto-encoders.
注意上述方程表面上看起来像是一个固定点循环神经网络，即具有恒定输入。就像 RBM 可以与简单的自编码器相关联一样，上述 DBM 的均值场更新方程可以与循环自编码器相关联。在这种情况下，训练准则涉及最后或连续时间步长的重建误差。这种类型的模型已被 Savard（2011）和 Seung（1998）探讨，并显示出在去噪方面比普通自编码器做得更好。

Iterating Eq. (31-32) until convergence yields the $Q$ parameters of the “variational positive phase” of Eq. 10.2.1:
迭代方程（31-32）直至收敛，得到方程 10.2.1 中“变分正相”的 $Q$ 参数。

$\displaystyle\mathcal{L}(Q_{v})=$	$\displaystyle{\mathbb{E}}_{Q_{v}}\left[{\log P(v,h^{(1)},h^{(2)})-\log Q_{v}(h^{(1)},h^{(2)})}\right]$
$\displaystyle=$	$\displaystyle{\mathbb{E}}_{Q_{v}}\left[{-{\cal E}_{\theta}^{\mathrm{DBM}}(v,h^{(1)},h^{(2)})-\log Q_{v}(h^{(1)},h^{(2)})}\right]$
	$\displaystyle-\log Z_{\theta}$
$\displaystyle\frac{\partial\mathcal{L}(Q_{v})}{\partial\theta}$	$\displaystyle=-{\mathbb{E}}_{Q_{v}}\left[{\frac{\partial{\cal E}_{\theta}^{\mathrm{DBM}}(v,h^{(1)},h^{(2)})}{\partial\theta}}\right]$
	$\displaystyle\hskip 11.38109pt+{\mathbb{E}}_{P}\left[{\frac{\partial{\cal E}_{\theta}^{\mathrm{DBM}}(v,h^{(1)},h^{(2)})}{\partial\theta}}\right]\vspace*{-3mm}$	(33)

This variational learning procedure leaves the “negative phase” untouched, which can thus be estimated through SML or Contrastive Divergence (Hinton, 2000) as in the RBM case.
此变分学习过程未对“负相”进行修改，因此可以通过 SML 或对比散度（Hinton，2000）来估计，如同在 RBM 案例中。

10.2.2 Training Deep Boltzmann Machines
10.2.2 训练深度玻尔兹曼机

The major difference between training a DBM and an RBM is that instead of maximizing the likelihood directly, we instead choose parameters to maximize the lower-bound on the likelihood given in Eq. 30. The SML-based algorithm for maximizing this lower-bound is as follows:
主要区别在于训练 DBM 和 RBM，我们不是直接最大化似然度，而是选择参数以最大化给定式(30)的似然度下界。基于 SML 的算法最大化此下界的步骤如下：

1.

Clamp the visible units to a training example.

1. 将可见单元固定到训练示例上。
2.

Iterate over Eq. (31-32) until convergence.

2. 对式（31-32）进行迭代，直至收敛。
3.

Generate negative phase samples $v^{-}$ , $h^{(1)-}$ and $h^{(2)-}$ through SML.

3. 通过 SML 生成负相样本 $v^{-}$ 、 $h^{(1)-}$ 和 $h^{(2)-}$ 。
4.

Compute $\partial\mathcal{L}(Q_{v})\left/\partial\theta\right.$ using the values obtained in steps 2-3.

4. 使用步骤 2-3 中获得的值计算 $\partial\mathcal{L}(Q_{v})\left/\partial\theta\right.$ 。
5.

Finally, update the model parameters with a step of approximate stochastic gradient ascent.

5. 最后，以近似随机梯度上升的步长更新模型参数。

While the above procedure appears to be a simple extension of the highly effective SML scheme for training RBMs, as we demonstrate in Desjardins et al. (2012), this procedure seems vulnerable to falling in poor local minima which leave many hidden units effectively dead (not significantly different from its random initialization with small norm).
尽管上述程序看起来是对训练 RBMs 的高效 SML 方案的简单扩展，正如我们在 Desjardins 等人（2012）中所示，此程序似乎容易陷入较差的局部最小值，导致许多隐藏单元实际上失效（与具有小范数的随机初始化没有显著差异）。

The failure of the SML joint training strategy was noted by Salakhutdinov and Hinton (2009). As an alternative, they proposed a greedy layer-wise training strategy. This procedure consists in pre-training the layers of the DBM, in much the same way as the Deep Belief Network: i.e. by stacking RBMs and training each layer to independently model the output of the previous layer. A final joint “fine-tuning” is done following the above SML-based procedure.
SML 聯合訓練策略的失敗被 Salakhutdinov 和 Hinton（2009）所指出。作為替代方案，他們提出了贪婪分层訓練策略。該過程包括預訓練 DBM 的層，與深度信念網絡的方式非常相似：即通過堆疊 RBM 並訓練每個層獨立建模前一層的輸出。在上述 SML 基礎程序之後進行最終的聯合“微調”。

11 Building-In Invariance
11 嵌入不变性

It is well understood that incorporating prior domain knowledge helps machine learning. Exploring good strategies for doing so is a very important research avenue. However, if we are to advance our understanding of core machine learning principles, it is important that we keep comparisons between predictors fair and maintain a clear awareness of the prior domain knowledge used by different learning algorithms, especially when comparing their performance on benchmark problems. We have so far only presented algorithms that exploited only generic inductive biases for high dimensional problems, thus making them potentially applicable to any high dimensional problem. The most prevalent approach to incorporating prior knowledge is to hand-design better features to feed a generic classifier, and has been used extensively in computer vision (e.g. (Lowe, 1999)). Here, we rather focus on how basic domain knowledge of the input, in particular its topological structure (e.g. bitmap images having a 2D structure), may be used to learn better features.
众所周知，将先验领域知识融入机器学习是有益的。探索实现这一目标的良好策略是一条非常重要的研究途径。然而，如果我们想要深化对核心机器学习原理的理解，保持预测器之间的比较公平，并清晰地了解不同学习算法所使用的先验领域知识，尤其是在比较它们在基准问题上的性能时，这一点尤为重要。迄今为止，我们只介绍了仅利用针对高维问题的一般归纳偏置的算法，这使得它们可能适用于任何高维问题。将先验知识融入的最常见方法是为通用分类器手工设计更好的特征，这在计算机视觉（例如（Lowe，1999））中已被广泛使用。在这里，我们更关注如何利用输入的基本领域知识，特别是其拓扑结构（例如，具有二维结构的位图图像），来学习更好的特征。

11.1 Generating transformed examples
11.1 生成转换后的示例

Generalization performance is usually improved by providing a larger quantity of representative data. This can be achieved by generating new examples by applying small random deformations to the original training examples, using deformations that are known not to change the target variables of interest, e.g., an object class is invariant to small transformations of images such as translations, rotations, scaling, or shearing. This old approach (Baird, 1990) has been recently applied with great success in the work of Ciresan et al. (2010) who used an efficient GPU implementation ( $40\times$ speedup) to train a standard but large deep multilayer Perceptron on deformed MNIST digits. Using both affine and elastic deformations (Simard et al., 2003), with plain old stochastic gradient descent, they reach a record 0.32% classification error rate.
泛化性能通常通过提供更大数量的代表性数据来提高。这可以通过对原始训练示例应用小的随机变形来生成新示例实现，使用已知不会改变目标变量的变形，例如，一个对象类别对图像的平移、旋转、缩放或剪切等小变形是不变的。这种旧方法（Baird，1990）最近在 Ciresan 等人（2010）的工作中取得了巨大成功，他们使用高效的 GPU 实现（ $40\times$ 加速）在变形的 MNIST 数字上训练了一个标准但大的多层感知器。使用仿射和弹性变形（Simard 等人，2003），以及简单的随机梯度下降，他们达到了创纪录的 0.32%分类错误率。

11.2 Convolution and pooling
11.2 卷积和池化

Another powerful approach is based on even more basic knowledge of merely the topological structure of the input dimensions. By this we mean e.g., the 2D layout of pixels in images or audio spectrograms, the 3D structure of videos, the 1D sequential structure of text or of temporal sequences in general. Based on such structure, one can define local receptive fields (Hubel and Wiesel, 1959), so that each low-level feature will be computed from only a subset of the input: a neighborhood in the topology (e.g. a sub-image at a given position). This topological locality constraint corresponds to a layer having a very sparse weight matrix with non-zeros only allowed for topologically local connections. Computing the associated matrix products can of course be made much more efficient than having to handle a dense matrix, in addition to the statistical gain from a much smaller number of free parameters. In domains with such topological structure, similar input patterns are likely to appear at different positions, and nearby values (e.g. consecutive frames or nearby pixels) are likely to have stronger dependencies that are also important to model the data. In fact these dependencies can be exploited to discover the topology (Le Roux et al., 2008a), i.e. recover a regular grid of pixels out of a set of vectors without any order information, e.g. after the elements have been arbitrarily shuffled in the same way for all examples. Thus a same local feature computation is likely to be relevant at all translated positions of the receptive field. Hence the idea of sweeping such a local feature extractor over the topology: this corresponds to a convolution, and transforms an input into a similarly shaped feature map. Equivalently to sweeping, this may be seen as static but differently positioned replicated feature extractors that all share the same parameters. This is at the heart of convolutional networks (LeCun et al., 1989, 1998b) which have been applied both to object recognition and to image segmentation (Turaga et al., 2010). Another hallmark of the convolutional architecture is that values computed by the same feature detector applied at several neighboring input locations are then summarized through a pooling operation, typically taking their max or their sum. This confers the resulting pooled feature layer some degree of invariance to input translations, and this style of architecture (alternating selective feature extraction and invariance-creating pooling) has been the basis of convolutional networks, the Neocognitron (Fukushima, 1980) and HMAX (Riesenhuber and Poggio, 1999) models, and argued to be the architecture used by mammalian brains for object recognition (Riesenhuber and Poggio, 1999; Serre et al., 2007; DiCarlo et al., 2012). The output of a pooling unit will be the same irrespective of where a specific feature is located inside its pooling region. Empirically the use of pooling seems to contribute significantly to improved classification accuracy in object classification tasks (LeCun et al., 1998b; Boureau et al., 2010, 2011). A successful variant of pooling connected to sparse coding is L2 pooling (Hyvärinen et al., 2009; Kavukcuoglu et al., 2009; Le et al., 2010), for which the pool output is the square root of the possibly weighted sum of squares of filter outputs. Ideally, we would like to generalize feature-pooling so as to learn what features should be pooled together, e.g. as successfully done in several papers (Hyvärinen and Hoyer, 2000; Kavukcuoglu et al., 2009; Le et al., 2010; Ranzato and Hinton, 2010; Courville et al., 2011b; Coates and Ng, 2011b; Gregor et al., 2011). In this way, the pool output learns to be invariant to the variations captured by the span of the features pooled.
另一种强大的方法基于对输入维度拓扑结构的更基本知识。我们这里指的是，例如，图像或音频频谱图中像素的二维布局，视频的三维结构，文本或一般时间序列的一维顺序结构。基于这种结构，可以定义局部感受野（Hubel 和 Wiesel，1959），这样每个低级特征将仅从输入的一个子集计算：拓扑中的邻域（例如，给定位置的子图像）。这种拓扑局部性约束对应于一个具有非常稀疏权重矩阵的层，其中非零值仅允许拓扑局部连接。当然，计算相关的矩阵乘法可以比处理密集矩阵更加高效，此外，由于自由参数数量大大减少，还可以获得统计上的收益。在具有此类拓扑结构的领域中，类似的输入模式可能会出现在不同的位置，并且附近的值（例如，连续帧或附近的像素）可能具有更强的依赖性，这些依赖性对于建模数据也很重要。实际上，这些依赖关系可以被利用来发现拓扑结构（Le Roux 等人，2008a），即从一组无序信息的向量中恢复出规则的像素网格，例如在所有示例中元素被任意方式随机打乱之后。因此，同一局部特征计算很可能在感受野的所有翻译位置上都是相关的。因此，将这样的局部特征提取器扫过拓扑结构：这相当于卷积，并将输入转换为形状相似的特征图。相当于扫过，这也可以被视为静态但位置不同的重复特征提取器，它们都共享相同的参数。这是卷积网络（LeCun 等人，1989，1998b）的核心，这些网络已被应用于物体识别和图像分割（Turaga 等人，2010）。卷积架构的另一个特点是，在几个相邻输入位置应用相同的特征检测器计算出的值，随后通过池化操作进行汇总，通常取其最大值或总和。这赋予结果池化特征层对输入翻译的一定程度的不变性，这种架构风格（交替选择特征提取和创建不变性的池化）是卷积网络、Neocognitron（Fukushima，1980）和 HMAX（Riesenhuber 和 Poggio，1999）模型的基础，并被认为是大脑用于物体识别的架构（Riesenhuber 和 Poggio，1999；Serre 等人，2007；DiCarlo 等人，2012）。池化单元的输出将不受特定特征在其池化区域内位置的影响。经验表明，池化在物体分类任务中似乎对提高分类精度有显著贡献（LeCun 等人，1998b；Boureau 等人，2010，2011）。与稀疏编码相关联的池化成功变体是 L2 池化（Hyvärinen 等人，2009；Kavukcuoglu 等人，2009；Le 等人，2010），其池化输出是滤波器输出平方和的可能加权总和的平方根。理想情况下，我们希望泛化特征池化，以便学习哪些特征应该一起池化，例如。如几篇论文中所成功实现的那样（Hyvärinen 和 Hoyer，2000；Kavukcuoglu 等，2009；Le 等，2010；Ranzato 和 Hinton，2010；Courville 等，2011b；Coates 和 Ng，2011b；Gregor 等，2011）。通过这种方式，池输出学习对由特征池的跨度捕获的变异保持不变性。

Patch-based training 基于补丁的训练

The simplest approach for learning a convolutional layer in an unsupervised fashion is patch-based training: simply feeding a generic unsupervised feature learning algorithm with local patches extracted at random positions of the inputs. The resulting feature extractor can then be swiped over the input to produce the convolutional feature maps. That map may be used as a new input for the next layer, and the operation repeated to thus learn and stack several layers. Such an approach was recently used with Independent Subspace Analysis (Le et al., 2011c) on 3D video blocks, reaching the state-of-the-art on Hollywood2, UCF, KTH and YouTube action recognition datasets. Similarly (Coates and Ng, 2011a) compared several feature learners with patch-based training and reached state-of-the-art results on several classification benchmarks. Interestingly, in this work performance was almost as good with very simple k-means clustering as with more sophisticated feature learners. We however conjecture that this is the case only because patches are rather low dimensional (compared to the dimension of a whole image). A large dataset might provide sufficient coverage of the space of e.g. edges prevalent in $6\times 6$ patches, so that a distributed representation is not absolutely necessary. Another plausible explanation for this success is that the clusters identified in each image patch are then pooled into a histogram of cluster counts associated with a larger sub-image. Whereas the output of a regular clustering is a one-hot non-distributed code, this histogram is itself a distributed representation, and the “soft” k-means (Coates and Ng, 2011a) representation allows not only the nearest filter but also its neighbors to be active.
最简单的无监督学习卷积层的方法是基于补丁的训练：只需将提取自输入随机位置的局部补丁输入通用的无监督特征学习算法。然后，得到的特征提取器可以滑动到输入上以产生卷积特征图。该图可以用作下一层的新输入，并重复操作以学习并堆叠多个层。这种方法最近在 3D 视频块上使用独立子空间分析（Le 等，2011c）时得到了应用，在 Hollywood2、UCF、KTH 和 YouTube 动作识别数据集上达到了最先进水平。类似地，（Coates 和 Ng，2011a）比较了几种基于补丁训练的特征学习器，并在多个分类基准测试中达到了最先进的结果。有趣的是，在这项工作中，性能几乎与更复杂的学习器一样好，使用了非常简单的 k-means 聚类。然而，我们推测这仅是因为补丁相对较低维（与整个图像的维度相比）。大型数据集可能足以覆盖例如空间。边缘在 $6\times 6$ 补丁中普遍存在，因此分布式表示并非绝对必要。这种成功的另一个可能的解释是，每个图像补丁中识别出的簇随后被汇总到一个与较大子图像相关联的簇计数直方图中。而常规聚类的输出是一个非分布的一热代码，而这个直方图本身就是一个分布式表示，而“软”k-means（Coates 和 Ng，2011a）表示不仅允许最近的滤波器激活，还允许其邻居激活。

Convolutional and tiled-convolutional training
卷积和瓦片卷积训练

It is possible to directly train large convolutional layers using an unsupervised criterion. An early approach (Jain and Seung, 2008) trained a standard but deep convolutional MLP on the task of denoising images, i.e. as a deep, convolutional, denoising auto-encoder. Convolutional versions of the RBM or its extensions have also been developed (Desjardins and Bengio, 2008; Lee et al., 2009a; Taylor et al., 2010) as well as a probabilistic max-pooling operation built into Convolutional Deep Networks (Lee et al., 2009a, b; Krizhevsky, 2010). Other unsupervised feature learning approaches that were adapted to the convolutional setting include PSD (Kavukcuoglu et al., 2009, 2010; Jarrett et al., 2009; Henaff et al., 2011), a convolutional version of sparse coding called deconvolutional networks (Zeiler et al., 2010), Topographic ICA (Le et al., 2010), and mPoT that Kivinen and Williams (2012) applied to modeling natural textures. Gregor and LeCun (2010a); Le et al. (2010) also demonstrated the technique of tiled-convolution, where parameters are shared only between feature extractors whose receptive fields are $k$ steps away (so the ones looking at immediate neighbor locations are not shared). This allows pooling units to be invariant to more than just translations, and is a hybrid between convolutional networks and earlier neural networks with local connections but no weight sharing (LeCun, 1986, 1989).
有直接使用无监督标准训练大型卷积层的方法。早期方法（Jain 和 Seung，2008）在去噪图像任务上训练了一个标准但深度卷积 MLP，即作为一个深度卷积去噪自动编码器。RBM 及其扩展的卷积版本也已被开发（Desjardins 和 Bengio，2008；Lee 等人，2009a；Taylor 等人，2010），以及嵌入到卷积深度网络中的概率最大池化操作（Lee 等人，2009a，b；Krizhevsky，2010）。其他适应卷积设置的未监督特征学习方法包括 PSD（Kavukcuoglu 等人，2009，2010；Jarrett 等人，2009；Henaff 等人，2011），称为去卷积网络的稀疏编码卷积版本（Zeiler 等人，2010），拓扑 ICA（Le 等人，2010），以及 Kivinen 和 Williams（2012）应用于模拟自然纹理的 mPoT。Gregor 和 LeCun（2010a）；Le 等人。 (2010) 还展示了分块卷积技术，其中参数仅在感受野相隔 $k$ 步的特征提取器之间共享（因此观察相邻位置的那些不共享）。这使池化单元对不仅仅是平移保持不变，是卷积网络和早期具有局部连接但没有权重共享的神经网络（LeCun，1986，1989）的混合体。

Alternatives to pooling 替代合并的方法

Alternatively, one can also use explicit knowledge of the expected invariants expressed mathematically to define transformations that are robust to a known family of input deformations, using so-called scattering operators (Mallat, 2012; Bruna and Mallat, 2011), which can be computed in a way interestingly analogous to deep convolutional networks and wavelets. Like convolutional networks, the scattering operators alternate two types of operations: convolution and pooling (as a norm). Unlike convolutional networks, the proposed approach keeps at each level all of the information about the input (in a way that can be inverted), and automatically yields a very sparse (but very high-dimensional) representation. Another difference is that the filters are not learned but instead set so as to guarantee that a priori specified invariances are robustly achieved. Just a few levels were sufficient to achieve impressive results on several benchmark datasets.
另一种方法是，也可以利用对预期不变量的显式数学知识来定义对已知输入变形族具有鲁棒性的变换，使用所谓的散射算子（Mallat，2012；Bruna 和 Mallat，2011），这些算子可以以一种有趣地类似于深度卷积网络和小波的方式计算。与卷积网络类似，散射算子交替两种类型的操作：卷积和池化（作为范数）。与卷积网络不同，所提出的方法在每个级别上保留所有关于输入的信息（以可逆的方式），并自动产生一个非常稀疏（但非常高维）的表示。另一个区别是，滤波器不是通过学习得到的，而是设置为确保预先指定的不变量能够稳健地实现。只需几个级别就足以在几个基准数据集上取得令人印象深刻的成果。

11.3 Temporal coherence and slow features
11.3 时间一致性及慢特征

The principle of identifying slowly moving/changing factors in temporal/spatial data has been investigated by many (Becker and Hinton, 1992; Wiskott and Sejnowski, 2002; Hurri and Hyvärinen, 2003; Körding et al., 2004; Cadieu and Olshausen, 2009) as a principle for finding useful representations. In particular this idea has been applied to image sequences and as an explanation for why V1 simple and complex cells behave the way they do. A good overview can be found in Hurri and Hyvärinen (2003); Berkes and Wiskott (2005).
时间/空间数据中识别缓慢移动/变化因素的原则已被许多人研究（Becker 和 Hinton，1992；Wiskott 和 Sejnowski，2002；Hurri 和 Hyvärinen，2003；Körding 等人，2004；Cadieu 和 Olshausen，2009），作为寻找有用表示的方法。特别是这一想法已被应用于图像序列，并作为解释 V1 简单和复杂细胞为何表现出这种行为的解释。Hurri 和 Hyvärinen（2003）；Berkes 和 Wiskott（2005）可以找到一篇很好的概述。

More recently, temporal coherence has been successfully exploited in deep architectures to model video (Mobahi et al., 2009). It was also found that temporal coherence discovered visual features similar to those obtained by ordinary unsupervised feature learning (Bergstra and Bengio, 2009), and a temporal coherence penalty has been combined with a training criterion for unsupervised feature learning (Zou et al., 2011), sparse auto-encoders with L1 regularization, in this case, yielding improved classification performance.
最近，时间一致性在深度架构中成功被用于视频建模（Mobahi 等人，2009 年）。还发现，时间一致性发现的可视特征与普通无监督特征学习获得的特征相似（Bergstra 和 Bengio，2009 年），并且时间一致性惩罚已被结合到无监督特征学习的训练标准中（Zou 等人，2011 年），在这种情况下，与 L1 正则化的稀疏自编码器相结合，从而提高了分类性能。

The temporal coherence prior can be expressed in several ways, the simplest being the squared difference between feature values at times $t$ and $t+1$ . Other plausible temporal coherence priors include the following. First, instead of penalizing the squared change, penalizing the absolute value (or a similar sparsity penalty) would state that most of the time the change should be exactly 0, which would intuitively make sense for the real-life factors that surround us. Second, one would expect that instead of just being slowly changing, different factors could be associated with their own different time scale. The specificity of their time scale could thus become a hint to disentangle explanatory factors. Third, one would expect that some factors should really be represented by a group of numbers (such as $x$ , $y$ , and $z$ position of some object in space and the pose parameters of Hinton et al. (2011)) rather than by a single scalar, and that these groups tend to move together. Structured sparsity penalties (Kavukcuoglu et al., 2009; Jenatton et al., 2009; Bach et al., 2011; Gregor et al., 2011) could be used for this purpose.
时间一致性先验可以以多种方式表达，最简单的是在时间 $t$ 和 $t+1$ 处的特征值之间的平方差。其他可能的时序一致性先验包括以下内容。首先，与其惩罚平方变化，不如惩罚绝对值（或类似的稀疏惩罚）将表明大多数时间变化应该是正好为 0，这对于我们周围的真实生活因素来说直观上是合理的。其次，人们会期望不同的因素可能与其各自不同的时间尺度相关联，它们的时序特异性因此可以成为解开解释因素的线索。第三，人们会期望某些因素实际上应该由一组数字（如 $x$ 、 $y$ 和 $z$ 位置的一些物体在空间中的位置以及 Hinton 等人（2011）的姿态参数）来表示，而不是单个标量，并且这些组倾向于一起移动。结构化稀疏惩罚（Kavukcuoglu 等人，2009；Jenatton 等人，2009；Bach 等人，2011；Gregor 等人，2011）可用于此目的。

11.4 Algorithms to Disentangle Factors of Variation
11.4 解耦变异因素的算法

The goal of building invariant features is to remove sensitivity of the representation to directions of variance in the data that are uninformative to the task at hand. However it is often the case that the goal of feature extraction is the disentangling or separation of many distinct but informative factors in the data, e.g., in a video of people: subject identity, action performed, subject pose relative to the camera, etc. In this situation, the methods of generating invariant features, such as feature-pooling, may be inadequate.
构建不变特征的目标是消除表示对数据中无信息变化方向的敏感性。然而，特征提取的目标通常是解耦或分离数据中的许多不同但具有信息量的因素，例如在人们的视频中：主题身份、执行的动作、主题相对于摄像机的姿态等。在这种情况下，生成不变特征的方法，如特征池化，可能是不够的。

The process of building invariant features can be seen as consisting of two steps. First, low-level features are recovered that account for the data. Second, subsets of these low level features are pooled together to form higher-level invariant features, exemplified by the pooling and subsampling layers of convolutional neural networks. The invariant representation formed by the pooling features offers an incomplete window on the data as the detailed representation of the lower-level features is abstracted away in the pooling procedure. While we would like higher-level features to be more abstract and exhibit greater invariance, we have little control over what information is lost through pooling. What we really would like is for a particular feature set to be invariant to the irrelevant features and disentangle the relevant features. Unfortunately, it is often difficult to determine a priori which set of features will ultimately be relevant to the task at hand.
构建不变特征的过程可以分为两个步骤。首先，恢复低级特征，这些特征解释了数据。其次，将这些低级特征的子集汇总在一起，形成更高层次的不变特征，例如卷积神经网络中的池化和子采样层。由池化特征形成的不变表示提供了对数据的部分视角，因为在池化过程中抽象掉了低级特征的详细表示。虽然我们希望高级特征更加抽象并表现出更大的不变性，但我们很难控制通过池化丢失的信息。我们真正希望的是，特定的特征集对无关特征不变，并解耦相关特征。不幸的是，通常很难事先确定哪些特征集最终与手头任务相关。

An interesting approach to taking advantage of some of the factors of variation known to exist in the data is the transforming auto-encoder (Hinton et al., 2011): instead of a scalar pattern detector (e.g,. corresponding to the probability of presence of a particular form in the input) one can think of the features as organized in groups that include both a pattern detector and pose parameters that specify attributes of the detected pattern. In (Hinton et al., 2011), what is assumed a priori is that pairs of examples (or consecutive ones) are observed with an associated value for the corresponding change in the pose parameters. For example, an animal that controls its eyes knows what changes to its ocular motor system were applied when going from one image on its retina to the next. In that work, it is also assumed that the pose changes are the same for all the pattern detectors, and this makes sense for global changes such as image translation and camera geometry changes. Instead, we would like to discover the pose parameters and attributes that should be associated with each feature detector, without having to specify ahead of time what they should be, force them to be the same for all features, and having to necessarily observe the changes in all of the pose parameters or attributes.
一种利用数据中已知存在的一些变异因素的有趣方法是变换自编码器（Hinton 等，2011）：不是使用标量模式检测器（例如，对应于输入中特定形式的概率），而是可以将特征视为组织成包含模式检测器和指定检测到的模式属性的姿态参数的组的集合。在（Hinton 等，2011）中，先验假设是观察到成对的示例（或连续的）与相应的姿态参数变化值相关联。例如，控制自己眼睛的动物知道当从视网膜上的一个图像切换到下一个图像时，对其眼动系统应用了哪些变化。在那项工作中，还假设所有模式检测器的姿态变化相同，这对于全局变化（如图像平移和相机几何变化）是有意义的。相反，我们希望发现应与每个特征检测器相关联的姿态参数和属性，而不必事先指定它们是什么，强迫它们对所有特征都相同，并必须必然观察所有姿态参数或属性的变化。

The approach taken recently in the Manifold Tangent Classifier, discussed in section 8.3, is interesting in this respect. Without any supervision or prior knowledge, it finds prominent local factors of variation (tangent vectors to the manifold, extracted from a CAE, interpreted as locally valid input ”deformations”). Higher-level features are subsequently encouraged to be invariant to these factors of variation, so that they must depend on other characteristics. In a sense this approach is disentangling valid local deformations along the data manifold from other, more drastic changes, associated to other factors of variation such as those that affect class identity.²⁶²⁶26The changes that affect class identity might, in input space, actually be of similar magnitude to local deformations, but not follow along the manifold, i.e. cross zones of low density.
最近在多流形切线分类器中采用的方法，在第 8.3 节中讨论，在这方面很有趣。无需任何监督或先验知识，它找到了显著的局部变化因素（从 CAE 中提取的流形切线向量，解释为局部有效的“变形”输入）。随后鼓励高级特征对这些变化因素保持不变，因此它们必须依赖于其他特征。从某种意义上说，这种方法是将数据流形上的有效局部变形与其他更剧烈的变化（与影响类别身份的其他变化因素相关）分离。 ²⁶

One solution to the problem of information loss that would fit within the feature-pooling paradigm, is to consider many overlapping pools of features based on the same low-level feature set. Such a structure would have the potential to learn a redundant set of invariant features that may not cause significant loss of information. However it is not obvious what learning principle could be applied that can ensure that the features are invariant while maintaining as much information as possible. While a Deep Belief Network or a Deep Boltzmann Machine (as discussed in sections 4 and 10.2 respectively) with two hidden layers would, in principle, be able to preserve information into the “pooling” second hidden layer, there is no guarantee that the second layer features are more invariant than the “low-level” first layer features. However, there is some empirical evidence that the second layer of the DBN tends to display more invariance than the first layer (Erhan et al., 2010a).
一种符合特征池化范式的解决信息丢失问题的方案，是考虑基于相同低级特征集的许多重叠特征池。这种结构有可能学习到一组冗余的不变特征，这可能导致信息损失不显著。然而，并不明显哪种学习原理可以应用，以确保特征不变的同时尽可能保留信息。虽然深度信念网络或深度玻尔兹曼机（分别在第 4 节和第 10.2 节中讨论）具有两层隐藏层，原则上能够将信息保留到“池化”的第二隐藏层，但无法保证第二层的特征比“低级”的第一层特征更不变。然而，有一些经验证据表明，DBN 的第二层往往比第一层表现出更多的不变性（Erhan 等人，2010a）。

A more principled approach, from the perspective of ensuring a more robust compact feature representation, can be conceived by reconsidering the disentangling of features through the lens of its generative equivalent – feature composition. Since many unsupervised learning algorithms have a generative interpretation (or a way to reconstruct inputs from their high-level representation), the generative perspective can provide insight into how to think about disentangling factors. The majority of the models currently used to construct invariant features have the interpretation that their low-level features linearly combine to construct the data.²⁷²⁷27As an aside, if we are given only the values of the higher-level pooling features, we cannot accurately recover the data because we do not know how to apportion credit for the pooling feature values to the lower-level features. This is simply the generative version of the consequences of the loss of information caused by pooling. This is a fairly rudimentary form of feature composition with significant limitations. For example, it is not possible to linearly combine a feature with a generic transformation (such as translation) to generate a transformed version of the feature. Nor can we even consider a generic color feature being linearly combined with a gray-scale stimulus pattern to generate a colored pattern. It would seem that if we are to take the notion of disentangling seriously we require a richer interaction of features than that offered by simple linear combinations.
一个更原则性的方法，从确保更稳健的紧凑特征表示的角度出发，可以通过重新考虑通过其生成等价物——特征组合——来解耦特征。由于许多无监督学习算法具有生成解释（或从其高级表示中重建输入的方法），生成视角可以提供关于如何思考解耦因素的见解。目前用于构建不变特征的多数模型具有这样的解释：它们的低级特征线性组合来构建数据。 ²⁷ 这是一种相当原始的特征组合形式，具有显著的局限性。例如，无法将特征与通用变换（如平移）线性组合以生成特征的变换版本。甚至不能考虑将通用颜色特征与灰度刺激模式线性组合以生成彩色图案。似乎如果我们认真对待解耦的概念，我们需要比简单线性组合提供的更丰富的特征交互。

12 Conclusion 12 结论

This review of representation learning and deep learning has covered three major and apparently disconnected approaches: the probabilistic models (both the directed kind such as sparse coding and the undirected kind such as Boltzmann machines), the reconstruction-based algorithms related to auto-encoders, and the geometrically motivated manifold-learning approaches. Drawing connections between these approaches is currently a very active area of research and is likely to continue to produce models and methods that take advantage of the relative strengths of each paradigm.
此篇关于表示学习和深度学习的综述涵盖了三种主要且表面上不相关的途径：概率模型（包括有向模型如稀疏编码和无向模型如玻尔兹曼机），与自编码器相关的基于重建的算法，以及受几何启发的人学习法。目前，在这些途径之间建立联系是一个非常活跃的研究领域，并且很可能会继续产生利用每种范例相对优势的模型和方法。

Practical Concerns and Guidelines. One of the criticisms addressed to artificial neural networks and deep learning algorithms is that they have many hyper-parameters and variants and that exploring their configurations and architectures is an art. This has motivated an earlier book on the “Tricks of the Trade” (Orr and Muller, 1998) of which LeCun et al. (1998a) is still relevant for training deep architectures, in particular what concerns initialization, ill-conditioning and stochastic gradient descent. A good and more modern compendium of good training practice, particularly adapted to training RBMs, is provided in Hinton (2010), while a similar guide oriented more towards deep neural networks can be found in Bengio (2013), both of which are part of a novel version of the above book. Recent work on automating hyper-parameter search (Bergstra and Bengio, 2012; Bergstra et al., 2011; Snoek et al., 2012) is also making it more convenient, efficient and reproducible.
实际关注与指南。针对人工神经网络和深度学习算法的批评之一是它们拥有众多超参数和变体，探索它们的配置和架构是一门艺术。这促使了早期关于“交易技巧”的书籍（Orr 和 Müller，1998），其中 LeCun 等人（1998a）的研究对于训练深度架构仍然适用，特别是关于初始化、病态性和随机梯度下降方面。Hinton（2010）提供了一本关于良好训练实践的优良且更现代的汇编，尤其适用于训练 RBMs，而 Bengio（2013）提供了一本类似指南，更侧重于深度神经网络，这两本书都是上述书籍的新版本的一部分。最近关于自动化超参数搜索的研究（Bergstra 和 Bengio，2012；Bergstra 等人，2011；Snoek 等人，2012）也使得这一过程更加方便、高效和可重复。

Incorporating Generic AI-level Priors. We have covered many high-level generic priors that we believe could bring machine learning closer to AI by improving representation learning. Many of these priors relate to the assumed existence of multiple underlying factors of variation, whose variations are in some sense orthogonal to each other. They are expected to be organized at multiple levels of abstraction, hence the need for deep architectures, which also have statistical advantages because they allow to re-use parameters in a combinatorially efficient way. Only a few of these factors would typically be relevant for any particular example, justifying sparsity of representation. These factors are expected to be related to simple (e.g., linear) dependencies, with subsets of these explaining different random variables of interest (inputs, tasks) and varying in structured ways in time and space (temporal and spatial coherence). We expect future successful applications of representation learning to refine and increase that list of priors, and to incorporate most of them instead of focusing on only one. Research in training criteria that better take these priors into account are likely to move us closer to the long-term objective of discovering learning algorithms that can disentangle the underlying explanatory factors.
将通用 AI 级先验纳入。我们已经涵盖了众多高级通用先验，我们认为这些先验可以使机器学习更接近人工智能，通过改进表示学习。其中许多先验与假设存在多个潜在变异因素相关，这些因素的变异在某种程度上是正交的。它们预计将在多个抽象层次上组织，因此需要深度架构，这也具有统计优势，因为它们允许以组合有效的方式重用参数。通常只有少数这些因素与任何特定示例相关，这证明了表示的稀疏性。这些因素预计与简单的（例如，线性）依赖相关，其中这些因素的子集解释了不同的随机变量（输入、任务），并在时间和空间中以结构化的方式变化（时间和空间一致性）。我们预计表示学习的未来成功应用将精炼并增加先验列表，并将大多数先验纳入其中，而不是只关注一个。研究在训练标准中更好地考虑这些先验知识的进展，可能有助于我们更接近发现能够解开潜在解释因素的机器学习算法的长期目标。

Inference. We anticipate that methods based on directly parametrizing a representation function will incorporate more and more of the iterative type of computation one finds in the inference procedures of probabilistic latent-variable models. There is already movement in the other direction, with probabilistic latent-variable models exploiting approximate inference mechanisms that are themselves learned (i.e., producing a parametric description of the representation function). A major appeal of probabilistic models is that the semantics of the latent variables are clear and this allows a clean separation of the problems of modeling (choose the energy function), inference (estimating $P(h|x)$ ), and learning (optimizing the parameters), using generic tools in each case. On the other hand, doing approximate inference and not taking that approximation into account explicitly in the approximate optimization for learning could have detrimental effects, hence the appeal of learning approximate inference. More fundamentally, there is the question of the multimodality of the posterior $P(h|x)$ . If there are exponentially many probable configurations of values of the factors $h_{i}$ that can explain $x$ , then we seem to be stuck with very poor inference, either focusing on a single mode (MAP inference), assuming some kind of strong factorization (as in variational inference) or using an MCMC that cannot visit enough modes of $P(h|x)$ . What we propose as food for thought is the idea of dropping the requirement of an explicit representation of the posterior and settle for an implicit representation that exploits potential structure in $P(h|x)$ in order to represent it compactly: even though $P(h|x)$ may have an exponential number of modes, it may be possible to represent it with a small set of numbers. For example, consider computing a deterministic feature representation $f(x)$ that implicitly captures the information about a highly multi-modal $P(h|x)$ , in the sense that all the questions (e.g. making some prediction about some target concept) that can be asked from $P(h|x)$ can also be answered from $f(x)$ .
推理。我们预计基于直接参数化表示函数的方法将越来越多地融入在概率潜在变量模型的推理过程中发现的迭代计算类型。已经存在向另一个方向发展的趋势，即概率潜在变量模型利用自身学习到的近似推理机制（即产生表示函数的参数描述）。概率模型的一个主要吸引力是潜在变量的语义清晰，这允许在建模（选择能量函数）、推理（估计 $P(h|x)$ ）和学习（优化参数）的问题上进行清晰的分离，每个案例都使用通用工具。另一方面，进行近似推理而不在近似优化学习过程中明确考虑这种近似可能会产生不利影响，因此学习近似推理具有吸引力。更根本的是，存在后验 $P(h|x)$ 的多模态性问题。如果存在指数级数量的可能配置，可以解释 $x$ 的因子 $h_{i}$ 的值，那么我们似乎陷入了非常差的推理，要么专注于单一模式（MAP 推理），假设某种强分解（如变分推理）或使用无法访问足够 $P(h|x)$ 模式的 MCMC。我们提出供思考的想法是放弃对后验的显式表示的要求，而满足于一种隐式表示，该表示利用 $P(h|x)$ 中的潜在结构来紧凑地表示它：即使 $P(h|x)$ 可能有指数数量的模式，也可能用一组小数字来表示它。例如，考虑计算一个确定性特征表示 $f(x)$ ，它隐式地捕捉了关于高度多模态 $P(h|x)$ 的信息，从 $P(h|x)$ 可以提出的所有问题（例如，对某些目标概念做出预测）也可以从 $f(x)$ 中得到回答。

Optimization. Much remains to be done to better understand the successes and failures of training deep architectures, both in the supervised case (with many recent successes) and the unsupervised case (where much more work needs to be done). Although regularization effects can be important on small datasets, the effects that persist on very large datasets suggest some optimization issues are involved. Are they more due to local minima (we now know there are huge numbers of them) and the dynamics of the training procedure? Or are they due mostly to ill-conditioning and may be handled by approximate second-order methods? These basic questions remain unanswered and deserve much more study.
优化。为了更好地理解训练深度架构的成功与失败，还有很多工作要做，无论是监督学习的情况（许多近期取得了成功）还是无监督学习的情况（需要做更多的工作）。尽管正则化效果在小数据集上可能很重要，但持续作用于大数据集上的效果表明，一些优化问题可能涉及其中。这些问题是否更多是由于局部最小值（我们现在知道有大量的局部最小值）和训练过程的动态性？还是主要由于条件不良，可能可以通过近似二阶方法来处理？这些基本问题尚未得到解答，值得进行更多研究。

Acknowledgments 致谢

The author would like to thank David Warde-Farley, Razvan Pascanu and Ian Goodfellow for useful feedback, as well as NSERC, CIFAR and the Canada Research Chairs for funding.
作者想感谢 David Warde-Farley、Razvan Pascanu 和 Ian Goodfellow 提供的宝贵反馈，以及 NSERC、CIFAR 和加拿大研究主席基金会的资助。

References 参考文献

Alain and Bengio (2012) 艾兰和本吉奥（2012） Alain, G. and Bengio, Y. (2012). What regularized auto-encoders learn from the data generating distribution. Technical Report Arxiv report 1211.4246, Université de Montréal.
艾兰，G. 和本吉奥，Y. (2012). 正则化自编码器从数据生成分布中学到了什么。技术报告 Arxiv 报告 1211.4246，蒙特利尔大学。
Amari (1998) 阿玛里（1998） Amari, S. (1998). Natural gradient works efficiently in learning. Neural Computation, 10(2), 251–276.
阿玛里，S. (1998). 自然梯度在学习中的高效性。神经计算，10(2)，251-276。
Bach et al. (2011) 巴赫等人（2011 年） Bach, F., Jenatton, R., Mairal, J., and Obozinski, G. (2011). Structured sparsity through convex optimization. CoRR, abs/1109.2397.
巴赫，F.，热纳顿，R.，马伊拉尔，J.，奥博齐斯金，G.（2011）。通过凸优化实现结构化稀疏性。CoRR，abs/1109.2397
Bagnell and Bradley (2009)
巴格内尔和布拉德利（2009） Bagnell, J. A. and Bradley, D. M. (2009). Differentiable sparse coding. In NIPS’2009, pages 113–120.
Bagnell, J. A. 和 Bradley, D. M. (2009). 可微稀疏编码。在 NIPS’2009，第 113-120 页。
Baird (1990) 贝尔德（1990） Baird, H. (1990). Document image defect models. In IAPR Workshop, Syntactic & Structural Patt. Rec., pages 38–46.
Baird, H. (1990). 文档图像缺陷模型。在 IAPR Workshop，句法和结构模式识别，第 38-46 页。
Becker and Hinton (1992) 贝克尔和辛顿（1992） Becker, S. and Hinton, G. (1992). A self-organizing neural network that discovers surfaces in random-dot stereograms. Nature, 355, 161–163.
贝克尔，S. 和霍顿，G. (1992). 一种在随机点立体图中发现表面的自组织神经网络。自然，355，161–163。
Belkin and Niyogi (2003) 贝林克和尼约吉（2003） Belkin, M. and Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6), 1373–1396.
贝林克，M. 和尼约吉，P. (2003). 拉普拉斯特征映射在降维和数据表示中的应用。神经计算，15(6)，1373–1396。
Bell and Sejnowski (1997)
贝尔和塞约夫斯基（1997） Bell, A. and Sejnowski, T. J. (1997). The independent components of natural scenes are edge filters. Vision Research, 37, 3327–3338.
贝尔，A. 和塞约夫斯基，T. J. (1997)。自然场景的独立成分是边缘滤波器。视觉研究，37，3327-3338。
Bengio (1993) Bengio, Y. (1993). A connectionist approach to speech recognition. International Journal on Pattern Recognition and Artificial Intelligence, 7(4), 647–668.
Bengio, Y. (1993). 语音识别的连接主义方法。国际模式识别与人工智能杂志，7(4)，647–668。
Bengio (2008) Bengio, Y. (2008). Neural net language models. Scholarpedia, 3(1).
Bengio, Y. (2008). 神经网络语言模型. Scholarpedia, 3(1).
Bengio (2009) Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1), 1–127. Also published as a book. Now Publishers, 2009.
9 Bengio, Y. (2009). 深度架构在人工智能中的应用。机器学习基础与趋势，2(1)，1-127。亦作为书籍出版。Now Publishers，2009。
Bengio (2011) Bengio, Y. (2011). Deep learning of representations for unsupervised and transfer learning. In JMLR W&CP: Proc. Unsupervised and Transfer Learning.
Bengio, Y. (2011). 无监督和迁移学习中的表示的深度学习。载于 JMLR W&CP：无监督和迁移学习会议论文集
Bengio (2013) Bengio, Y. (2013). Practical recommendations for gradient-based training of deep architectures. In K.-R. Müller, G. Montavon, and G. B. Orr, editors, Neural Networks: Tricks of the Trade. Springer.
Bengio, Y. (2013). 基于梯度的深度架构训练的实用建议。载于 K.-R. Müller, G. Montavon 和 G. B. Orr 编，《神经网络：技巧与策略》。Springer 出版社。
Bengio and Delalleau (2009)
Bengio 和 Delalleau（2009） Bengio, Y. and Delalleau, O. (2009). Justifying and generalizing contrastive divergence. Neural Computation, 21(6), 1601–1621.
Bengio, Y. 和 Delalleau, O. (2009). 论证和推广对比散度。神经网络计算，21(6)，1601–1621。
Bengio and Delalleau (2011)
Bengio 和 Delalleau（2011） Bengio, Y. and Delalleau, O. (2011). On the expressive power of deep architectures. In ALT’2011.
Bengio, Y. 和 Delalleau, O. (2011). 深度架构的表达能力。在 ALT’2011
Bengio and LeCun (2007) Bengio 和 LeCun（2007） Bengio, Y. and LeCun, Y. (2007). Scaling learning algorithms towards AI. In L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Machines. MIT Press.
Bengio, Y. 和 LeCun, Y. (2007). 向人工智能扩展学习算法。载于 L. Bottou, O. Chapelle, D. DeCoste 和 J. Weston 编著的《大规模核机》。麻省理工学院出版社。
Bengio and Monperrus (2005)
Bengio 和 Monperrus (2005) Bengio, Y. and Monperrus, M. (2005). Non-local manifold tangent learning. In NIPS’2004, pages 129–136. MIT Press.
Bengio, Y. 和 Monperrus, M. (2005). 非局部流形切线学习。在 NIPS’2004，第 129-136 页。麻省理工学院出版社。
Bengio et al. (1994) Bengio 等人（1994 年） Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2), 157–166.
Bengio, Y., Simard, P. 和 Frasconi, P. (1994). 使用梯度下降学习长期依赖性是困难的。IEEE 神经网络杂志，5(2)，157-166。
Bengio et al. (2003) Bengio 等人（2003） Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. (2003). A neural probabilistic language model. JMLR, 3, 1137–1155.
Bengio, Y., Ducharme, R., Vincent, P. 和 Jauvin, C. (2003). 神经概率语言模型。JMLR，3，1137-1155。
Bengio et al. (2004) Bengio 等人（2004） Bengio, Y., Paiement, J.-F., Vincent, P., Delalleau, O., Le Roux, N., and Ouimet, M. (2004). Out-of-sample extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering. In NIPS’2003.
Bengio, Y.，Paiement, J.-F.，Vincent, P.，Delalleau, O.，Le Roux, N.，和 Ouimet, M. (2004). LLE，Isomap，MDS，Eigenmaps 和 Spectral Clustering 的样本外扩展。在 NIPS’2003
Bengio et al. (2006a) Bengio 等（2006a） Bengio, Y., Delalleau, O., and Le Roux, N. (2006a). The curse of highly variable functions for local kernel machines. In NIPS’2005.
Bengio, Y.，Delalleau, O.，和 Le Roux, N. (2006a). 高变函数对局部核机的诅咒。在 NIPS’2005
Bengio et al. (2006b) Bengio 等（2006b） Bengio, Y., Larochelle, H., and Vincent, P. (2006b). Non-local manifold Parzen windows. In NIPS’2005. MIT Press.
Bengio, Y., Larochelle, H. 和 Vincent, P. (2006b). 非局部流形 Parzen 窗。在 NIPS’2005。MIT 出版社。
Bengio et al. (2007) Bengio 等人（2007 年） Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007). Greedy layer-wise training of deep networks. In NIPS’2006.
Bengio, Y.，Lamblin, P.，Popovici, D.，Larochelle, H. (2007). 深度网络的贪婪层叠训练。载于 NIPS’2006
Bengio et al. (2009) Bengio 等人（2009 年） Bengio, Y., Louradour, J., Collobert, R., and Weston, J. (2009). Curriculum learning. In ICML’09.
Bengio, Y., Louradour, J., Collobert, R. 和 Weston, J. (2009). 课程学习。在 ICML’09
Bengio et al. (2010) Bengio 等人（2010） Bengio, Y., Delalleau, O., and Simard, C. (2010). Decision trees do not generalize to new variations. Computational Intelligence, 26(4), 449–467.
Bengio, Y.，Delalleau, O.，和 Simard, C.（2010）。决策树不能泛化到新的变化。计算智能，26（4），449–467。
Bengio et al. (2012) Bengio 等人（2012） Bengio, Y., Alain, G., and Rifai, S. (2012). Implicit density estimation by local moment matching to sample from auto-encoders. Technical report, arXiv:1207.0057.
Bengio, Y.，Alain, G.，和 Rifai, S.（2012）。通过局部矩匹配从自编码器采样进行隐式密度估计。技术报告，arXiv:1207.0057。
Bengio et al. (2013) Bengio 等人（2013） Bengio, Y., Mesnil, G., Dauphin, Y., and Rifai, S. (2013). Better mixing via deep representations. In ICML’2013.
Bengio, Y., Mesnil, G., Dauphin, Y. 和 Rifai, S. (2013). 通过深度表示实现更好的混合。在 ICML’2013
Bergstra and Bengio (2009)
贝格斯特拉和本吉奥（2009） Bergstra, J. and Bengio, Y. (2009). Slow, decorrelated features for pretraining complex cell-like networks. In NIPS’2009.
Bergstra, J. 和 Bengio, Y. (2009). 预训练复杂细胞状网络的慢速、去相关特征。在 NIPS’2009
Bergstra and Bengio (2012)
贝格斯特拉和本吉奥（2012） Bergstra, J. and Bengio, Y. (2012). Random search for hyper-parameter optimization. J. Machine Learning Res., 13, 281–305.
Bergstra, J. 和 Bengio, Y. (2012). 随机搜索超参数优化。J. Machine Learning Res., 13, 281–305。
Bergstra et al. (2011) 贝格斯特拉等人（2011 年） Bergstra, J., Bardenet, R., Bengio, Y., and Kégl, B. (2011). Algorithms for hyper-parameter optimization. In NIPS’2011.
贝格斯特拉，J.，巴登内特，R.，本吉奥，Y.，和凯格尔，B. (2011). 超参数优化算法。载于 NIPS’2011
Berkes and Wiskott (2005)
贝克士和维斯科特（2005） Berkes, P. and Wiskott, L. (2005). Slow feature analysis yields a rich repertoire of complex cell properties. Journal of Vision, 5(6), 579–602.
贝克士，P. 和维斯科特，L. (2005). 慢特征分析产生丰富的复杂细胞特性。视觉杂志，5(6)，579–602。
Besag (1975) 贝萨格（1975） Besag, J. (1975). Statistical analysis of non-lattice data. The Statistician, 24(3), 179–195.
贝萨格，J. (1975). 非格网数据的统计分析. 统计学家，24(3)，179-195.
Bordes et al. (2012) 博德斯等（2012） Bordes, A., Glorot, X., Weston, J., and Bengio, Y. (2012). Joint learning of words and meaning representations for open-text semantic parsing. AISTATS’2012.
Bordes, A., Glorot, X., Weston, J. 和 Bengio, Y. (2012). 开放文本语义解析中词语和意义表示的联合学习。AISTATS’2012
Boulanger-Lewandowski et al. (2012)
布兰热-莱万多夫斯基等人（2012） Boulanger-Lewandowski, N., Bengio, Y., and Vincent, P. (2012). Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and transcription. In ICML’2012.
布兰杰-莱万多夫斯基，N.，本吉奥，Y.，和文森特，P. (2012). 高维序列中时间依赖性的建模：应用于多声部音乐生成和转录。在 ICML’2012
Boureau et al. (2010) 布劳尔等（2010） Boureau, Y., Ponce, J., and LeCun, Y. (2010). A theoretical analysis of feature pooling in vision algorithms. In ICML’10.
Boureau, Y., Ponce, J. 和 LeCun, Y. (2010). 视觉算法中特征池化的理论分析。在 ICML’10
Boureau et al. (2011) 布劳尔等（2011） Boureau, Y., Le Roux, N., Bach, F., Ponce, J., and LeCun, Y. (2011). Ask the locals: multi-way local pooling for image recognition. In ICCV’11.
Boureau, Y., Le Roux, N., Bach, F., Ponce, J. 和 LeCun, Y. (2011). 向当地人提问：用于图像识别的多向局部池化。在 ICCV’11
Bourlard and Kamp (1988) Bourlard 和 Kamp（1988） Bourlard, H. and Kamp, Y. (1988). Auto-association by multilayer perceptrons and singular value decomposition. Biological Cybernetics, 59, 291–294.
Bourlard, H. 和 Kamp, Y. (1988). 多层感知器和奇异值分解的自动关联。生物控制论，59，291–294。
Brand (2003) 布兰德（2003） Brand, M. (2003). Charting a manifold. In NIPS’2002, pages 961–968. MIT Press.
品牌，M.（2003）。绘制一个流形。在 NIPS’2002，第 961-968 页。麻省理工学院出版社。
Breuleux et al. (2011) 布鲁莱克斯等人（2011 年） Breuleux, O., Bengio, Y., and Vincent, P. (2011). Quickly generating representative samples from an RBM-derived process. Neural Computation, 23(8), 2053–2073.
布勒卢，O.，本吉奥，Y.，和文森特，P.（2011）。从 RBM 导出的过程中快速生成代表性样本。神经计算，23（8），2053-2073。
Bruna and Mallat (2011) Bruna 和 Mallat (2011) Bruna, J. and Mallat, S. (2011). Classification with scattering operators. In ICPR’2011.
Bruna, J. 和 Mallat, S. (2011). 基于散射算子的分类。在 ICPR’2011
Cadieu and Olshausen (2009)
蔡迪乌和奥尔沙乌森（2009） Cadieu, C. and Olshausen, B. (2009). Learning transformational invariants from natural movies. In NIPS’2009, pages 209–216. MIT Press.
蔡迪乌，C. 和奥尔沙文，B. (2009). 从自然电影中学习变换不变量。在 NIPS’2009，第 209-216 页。麻省理工学院出版社。
Carreira-Perpiñan and Hinton (2005)
卡瑞拉-佩皮尼亚和辛顿（2005） Carreira-Perpiñan, M. A. and Hinton, G. E. (2005). On contrastive divergence learning. In AISTATS’2005, pages 33–40.
Carreira-Perpiñan, M. A. 和 Hinton, G. E. (2005). 关于对比发散学习。在 AISTATS’2005，第 33-40 页。
Chen et al. (2012) 陈等（2012） Chen, M., Xu, Z., Winberger, K. Q., and Sha, F. (2012). Marginalized denoising autoencoders for domain adaptation. In ICML’2012.
陈，M.，徐，Z.，温贝格，K. Q.，沙，F. (2012). 基于边缘化的去噪自编码器在领域自适应中的应用。在 ICML’2012
Cho et al. (2010) 卓等（2010） Cho, K., Raiko, T., and Ilin, A. (2010). Parallel tempering is efficient for learning restricted Boltzmann machines. In IJCNN’2010.
Cho, K., Raiko, T. 和 Ilin, A. (2010). 并行退火在限制玻尔兹曼机学习中的效率。在 IJCNN’2010
Cho et al. (2011) 卓等（2011） Cho, K., Raiko, T., and Ilin, A. (2011). Enhanced gradient and adaptive learning rate for training restricted Boltzmann machines. In ICML’2011, pages 105–112.
卓，K.，ライコ，T.，及伊利林，A.（2011）。训练受限玻尔兹曼机的增强梯度与自适应学习率。在 ICML’2011，第 105-112 页。
Ciresan et al. (2012) Ciresan 等人（2012） Ciresan, D., Meier, U., and Schmidhuber, J. (2012). Multi-column deep neural networks for image classification. Technical report, arXiv:1202.2745.
Ciresan, D., Meier, U. 和 Schmidhuber, J. (2012). 多列深度神经网络在图像分类中的应用。技术报告，arXiv:1202.2745。
Ciresan et al. (2010) Ciresan 等人（2010） Ciresan, D. C., Meier, U., Gambardella, L. M., and Schmidhuber, J. (2010). Deep big simple neural nets for handwritten digit recognition. Neural Computation, 22, 1–14.
Ciresan, D. C., Meier, U., Gambardella, L. M. 和 Schmidhuber, J. (2010). 深度大简神经网络在手写数字识别中的应用。神经计算，22，1-14。
Coates and Ng (2011a) 科茨和吴（2011a） Coates, A. and Ng, A. Y. (2011a). The importance of encoding versus training with sparse coding and vector quantization. In ICML’2011.
Coates, A. 和 Ng, A. Y. (2011a). 稀疏编码与矢量量化中编码与训练的重要性。在 ICML’2011
Coates and Ng (2011b) 科茨和吴（2011b） Coates, A. and Ng, A. Y. (2011b). Selecting receptive fields in deep networks. In NIPS’2011.
Coates, A. 和 Ng, A. Y. (2011b). 深度网络中的感受野选择。在 NIPS’2011
Collobert and Weston (2008)
科洛贝和韦斯顿（2008） Collobert, R. and Weston, J. (2008). A unified architecture for natural language processing: Deep neural networks with multitask learning. In ICML’2008.
Collobert, R. 和 Weston, J. (2008). 自然语言处理统一架构：具有多任务学习的深度神经网络。在 ICML’2008
Collobert et al. (2011) 科洛贝等（2011） Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., and Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12, 2493–2537.
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K. 和 Kuksa, P. (2011). 从零开始的自然语言处理（几乎）。机器学习研究杂志，12，2493–2537。
Courville et al. (2011a) Courville 等（2011a） Courville, A., Bergstra, J., and Bengio, Y. (2011a). A spike and slab restricted Boltzmann machine. In AISTATS’2011.
Courville, A., Bergstra, J. 和 Bengio, Y. (2011a). 一种带刺和板限制的玻尔兹曼机。载于 AISTATS’2011
Courville et al. (2011b) Courville 等（2011b） Courville, A., Bergstra, J., and Bengio, Y. (2011b). Unsupervised models of images by spike-and-slab RBMs. In ICML’2011.
Courville, A., Bergstra, J. 和 Bengio, Y. (2011b). 通过 spike-and-slab RBM 的无监督图像模型。在 ICML’2011
Dahl et al. (2010) 达尔等（2010） Dahl, G. E., Ranzato, M., Mohamed, A., and Hinton, G. E. (2010). Phone recognition with the mean-covariance restricted Boltzmann machine. In NIPS’2010.
Dahl, G. E., Ranzato, M., Mohamed, A. 和 Hinton, G. E. (2010). 基于均值-协方差限制玻尔兹曼机的电话识别。在 NIPS’2010
Dahl et al. (2012) 达尔等（2012） Dahl, G. E., Yu, D., Deng, L., and Acero, A. (2012). Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 33–42.
Dahl, G. E., Yu, D., Deng, L.，Acero, A. (2012). 基于上下文的预训练深度神经网络在大词汇量语音识别中的应用。IEEE 信号处理、语音和语言处理杂志，20(1)，33-42。
Deng et al. (2010) 邓等（2010） Deng, L., Seltzer, M., Yu, D., Acero, A., Mohamed, A., and Hinton, G. (2010). Binary coding of speech spectrograms using a deep auto-encoder. In Interspeech 2010, Makuhari, Chiba, Japan.
邓，L.，塞尔策，M.，余，D.，阿塞罗，A.，穆罕默德，A.，和辛顿，G. (2010). 使用深度自编码器对语音频谱图进行二进制编码。在 2010 年国际语音会议，千叶县，日本，苅谷。
Desjardins and Bengio (2008)
杰哈丁和本吉奥（2008） Desjardins, G. and Bengio, Y. (2008). Empirical evaluation of convolutional RBMs for vision. Technical Report 1327, Dept. IRO, U. Montréal.
Desjardins, G. 和 Bengio, Y. (2008). 卷积 RBMs 在视觉领域的实证评估。技术报告 1327，蒙特利尔大学 IRO 系。
Desjardins et al. (2010) Desjardins 等（2010） Desjardins, G., Courville, A., Bengio, Y., Vincent, P., and Delalleau, O. (2010). Tempered Markov chain Monte Carlo for training of restricted Boltzmann machine. In AISTATS’2010, volume 9, pages 145–152.
Desjardins, G., Courville, A., Bengio, Y., Vincent, P. 和 Delalleau, O. (2010). 受温马尔可夫链蒙特卡洛在受限玻尔兹曼机训练中的应用。在 AISTATS’2010，第 9 卷，第 145-152 页。
Desjardins et al. (2011) Desjardins 等（2011） Desjardins, G., Courville, A., and Bengio, Y. (2011). On tracking the partition function. In NIPS’2011.
Desjardins, G., Courville, A. 和 Bengio, Y. (2011). 关于跟踪配分函数。在 NIPS’2011
Desjardins et al. (2012) Desjardins 等（2012） Desjardins, G., Courville, A., and Bengio, Y. (2012). On training deep Boltzmann machines. Technical Report arXiv:1203.4416v1, Université de Montréal.
Desjardins, G., Courville, A. 和 Bengio, Y. (2012). 深度玻尔兹曼机的训练。技术报告 arXiv:1203.4416v1，蒙特利尔大学。
DiCarlo et al. (2012) 迪卡尔罗等人（2012） DiCarlo, J., Zoccolan, D., and Rust, N. (2012). How does the brain solve visual object recognition? Neuron.
迪卡尔罗，J.，佐科兰，D.，和拉斯特，N. (2012)。大脑如何解决视觉物体识别问题？神经元
Donoho and Grimes (2003) Donoho 和 Grimes (2003) Donoho, D. L. and Grimes, C. (2003). Hessian eigenmaps: new locally linear embedding techniques for high-dimensional data. Technical Report 2003-08, Dept. Statistics, Stanford University.
Donoho, D. L. 和 Grimes, C. (2003). 海森矩阵特征映射：用于高维数据的新的局部线性嵌入技术。技术报告 2003-08，斯坦福大学统计学系。
Eisner (2012) 艾斯纳（2012） Eisner, J. (2012). Learning approximate inference policies for fast prediction. Keynote talk at ICML Workshop on Inferning: Interactions Between Search and Learning.
艾斯纳，J. (2012). 为快速预测学习近似推理策略。在 ICML 研讨会“推理：搜索与学习之间的交互”上的主题演讲。
Erhan et al. (2010a) 厄尔汉等人（2010a） Erhan, D., Courville, A., and Bengio, Y. (2010a). Understanding representations learned in deep architectures. Technical Report 1355, Université de Montréal/DIRO.
厄尔汉，D.，库尔维尔，A.，和本吉奥，Y. (2010a). 理解深度架构中学习到的表示。技术报告 1355，蒙特利尔大学/DIRO。
Erhan et al. (2010b) 厄尔汉等人（2010b） Erhan, D., Bengio, Y., Courville, A., Manzagol, P.-A., Vincent, P., and Bengio, S. (2010b). Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research, 11, 625–660.
厄尔汉，D.，本吉奥，Y.，库维尔，A.，曼扎戈尔，P.-A.，文森特，P.，本吉奥，S. (2010b). 为什么无监督预训练有助于深度学习？机器学习研究杂志，11，625–660。
Freund and Haussler (1994)
弗里德曼和豪斯勒（1994） Freund, Y. and Haussler, D. (1994). Unsupervised learning of distributions on binary vectors using two layer networks. Technical Report UCSC-CRL-94-25, University of California, Santa Cruz.
Freund, Y. 和 Haussler, D. (1994). 使用两层网络在二元向量上无监督学习分布。技术报告 UCSC-CRL-94-25，加州大学圣克鲁兹分校。
Fukushima (1980) 福島（1980） Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36, 193–202.
福島，K. (1980). 新认知机：一种不受位置移动影响的模式识别机制的自我组织神经网络模型. 生物控制论，36，193-202.
Glorot and Bengio (2010) 格罗特和本吉奥（2010） Glorot, X. and Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In AISTATS’2010.
格洛特，X. 和本吉奥，Y. (2010). 理解训练深度前馈神经网络的难度。载于 AISTATS’2010
Glorot et al. (2011a) 格洛特等人（2011a） Glorot, X., Bordes, A., and Bengio, Y. (2011a). Deep sparse rectifier neural networks. In AISTATS’2011.
Glorot, X., Bordes, A. 和 Bengio, Y. (2011a). 深度稀疏整流神经网络。载于 AISTATS’2011
Glorot et al. (2011b) 格洛特等人（2011b） Glorot, X., Bordes, A., and Bengio, Y. (2011b). Domain adaptation for large-scale sentiment classification: A deep learning approach. In ICML’2011.
Glorot, X., Bordes, A. 和 Bengio, Y. (2011b). 大规模情感分类的领域自适应：一种深度学习方法。在 ICML’2011
Goodfellow et al. (2009) 古德费洛等（2009） Goodfellow, I., Le, Q., Saxe, A., and Ng, A. (2009). Measuring invariances in deep networks. In NIPS’2009, pages 646–654.
古德费洛，I.，李，Q.，萨克斯，A.，及吴恩达，A.（2009）。在深度网络中测量不变性。在 NIPS’2009，第 646-654 页。
Goodfellow et al. (2011) 古德费洛等（2011） Goodfellow, I., Courville, A., and Bengio, Y. (2011). Spike-and-slab sparse coding for unsupervised feature discovery. In NIPS Workshop on Challenges in Learning Hierarchical Models.
Goodfellow, I., Courville, A., and Bengio, Y. (2011). 无监督特征发现中的脉冲和平板稀疏编码。在 NIPS 层次模型学习挑战研讨会论文集中
Goodfellow et al. (2012) 古德费洛等（2012） Goodfellow, I. J., Courville, A., and Bengio, Y. (2012). Spike-and-slab sparse coding for unsupervised feature discovery. arXiv:1201.3382.
Goodfellow, I. J., Courville, A., 和 Bengio, Y. (2012). 针刺-板稀疏编码用于无监督特征发现。arXiv:1201.3382。
Gregor and LeCun (2010a) 格雷戈尔和勒克文（2010a） Gregor, K. and LeCun, Y. (2010a). Emergence of complex-like cells in a temporal product network with local receptive fields. Technical report, arXiv:1006.0448.
Gregor, K. 和 LeCun, Y. (2010a). 时间乘积网络中具有局部感受野的复杂细胞的出现。技术报告，arXiv:1006.0448。
Gregor and LeCun (2010b) 格雷戈尔和勒克文（2010b） Gregor, K. and LeCun, Y. (2010b). Learning fast approximations of sparse coding. In ICML’2010.
Gregor, K. 和 LeCun, Y. (2010b). 学习稀疏编码的快速近似。在 ICML’2010
Gregor et al. (2011) 格雷戈尔等人（2011 年） Gregor, K., Szlam, A., and LeCun, Y. (2011). Structured sparse coding via lateral inhibition. In NIPS’2011.
Gregor, K., Szlam, A. 和 LeCun, Y. (2011). 通过侧抑制的分层稀疏编码。在 NIPS’2011
Gribonval (2011) 格里博诺夫（2011） Gribonval, R. (2011). Should penalized least squares regression be interpreted as Maximum A Posteriori estimation? IEEE Transactions on Signal Processing, 59(5), 2405–2410.
格里博诺瓦尔，R. (2011). 应将惩罚最小二乘回归解释为最大后验估计吗？IEEE 信号处理杂志，59(5)，2405–2410。
Grosse et al. (2007) 格罗斯等（2007） Grosse, R., Raina, R., Kwong, H., and Ng, A. Y. (2007). Shift-invariant sparse coding for audio classification. In UAI’2007.
Grosse, R., Raina, R., Kwong, H. 和 Ng, A. Y. (2007). 音频分类中的平移不变稀疏编码。在 UAI’2007
Grubb and Bagnell (2010) Grubb 和 Bagnell（2010） Grubb, A. and Bagnell, J. A. D. (2010). Boosted backpropagation learning for training deep modular networks. In ICML’2010.
Grubb, A. 和 Bagnell, J. A. D. (2010). 深度模块化网络的增强反向传播学习训练。在 ICML’2010
Gutmann and Hyvarinen (2010)
古特曼和哈维宁（2010） Gutmann, M. and Hyvarinen, A. (2010). Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In AISTATS’2010.
古特曼，M. 和赫瓦林，A. (2010). 噪声对比估计：非标准化统计模型的新估计原理。载于 AISTATS’2010
Hamel et al. (2011) 韩梅等（2011） Hamel, P., Lemieux, S., Bengio, Y., and Eck, D. (2011). Temporal pooling and multiscale learning for automatic annotation and ranking of music audio. In ISMIR.
韩梅尔，P.，莱米厄，S.，本吉奥，Y.，埃克，D.（2011）。时间池化和多尺度学习在音乐音频自动标注和排序中的应用。在 ISMIR
Håstad (1986) Håstad（1986） Håstad, J. (1986). Almost optimal lower bounds for small depth circuits. In STOC’86, pages 6–20.
Håstad, J. (1986). 小深度电路的几乎最优下界。在 STOC’86，第 6-20 页。
Håstad and Goldmann (1991)
Håstad 和 Goldmann (1991) Håstad, J. and Goldmann, M. (1991). On the power of small-depth threshold circuits. Computational Complexity, 1, 113–129.
Håstad, J. 和 Goldmann, M. (1991). 小深度阈值电路的威力。计算复杂性，1，113–129。
Henaff et al. (2011) Henaff 等（2011） Henaff, M., Jarrett, K., Kavukcuoglu, K., and LeCun, Y. (2011). Unsupervised learning of sparse features for scalable audio classification. In ISMIR’11.
Henaff, M., Jarrett, K., Kavukcuoglu, K. 和 LeCun, Y. (2011). 可扩展音频分类的无监督稀疏特征学习。在 ISMIR’11
Hinton et al. (2011) Hinton 等（2011） Hinton, G., Krizhevsky, A., and Wang, S. (2011). Transforming auto-encoders. In ICANN’2011.
Hinton, G., Krizhevsky, A., 和 Wang, S. (2011). 转换自编码器。在 ICANN’2011
Hinton et al. (2012) Hinton 等（2012） Hinton, G., Deng, L., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., and Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29(6), 82–97.
Hinton, G., Deng, L., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.，Kingsbury, B. (2012). 深度神经网络在语音识别中的声学建模。IEEE 信号处理杂志，29(6)，82-97。
Hinton (1986) Hinton, G. E. (1986). Learning distributed representations of concepts. In Proc. 8th Conf. Cog. Sc. Society, pages 1–12.
Hinton, G. E. (1986). 概念的分布式表示学习。在第八届认知科学学会会议论文集，第 1-12 页。
Hinton (1999) Hinton, G. E. (1999). Products of experts. In ICANN’1999.
Hinton, G. E. (1999). 专家的乘积. In ICANN’1999
Hinton (2000) Hinton, G. E. (2000). Training products of experts by minimizing contrastive divergence. Technical Report GCNU TR 2000-004, Gatsby Unit, University College London.
Hinton, G. E. (2000). 通过最小化对比散度训练专家乘积。技术报告 GCNU TR 2000-004，盖茨比单元，伦敦大学学院。
Hinton (2010) Hinton, G. E. (2010). A practical guide to training restricted Boltzmann machines. Technical Report UTML TR 2010-003, Department of Computer Science, University of Toronto.
Hinton, G. E. (2010). 限制玻尔兹曼机的实用指南。技术报告 UTML TR 2010-003，多伦多大学计算机科学系。
Hinton and Roweis (2003) Hinton 和 Roweis (2003) Hinton, G. E. and Roweis, S. (2003). Stochastic neighbor embedding. In NIPS’2002.
Hinton, G. E. 和 Roweis, S. (2003). 随机邻域嵌入。在 NIPS’2002
Hinton and Salakhutdinov (2006)
Hinton 和 Salakhutdinov（2006） Hinton, G. E. and Salakhutdinov, R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507.
Hinton, G. E. 和 Salakhutdinov, R. (2006). 使用神经网络降低数据维度。科学，313(5786)，504–507。
Hinton and Zemel (1994) Hinton 和 Zemel（1994） Hinton, G. E. and Zemel, R. S. (1994). Autoencoders, minimum description length, and helmholtz free energy. In NIPS’1993.
Hinton, G. E. 和 Zemel, R. S. (1994). 自编码器、最小描述长度和亥姆霍兹自由能。在 NIPS’1993
Hinton et al. (2006) Hinton 等（2006） Hinton, G. E., Osindero, S., and Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18, 1527–1554.
4 Hinton, G. E., Osindero, S., Teh, Y. (2006). 深度信念网的快速学习算法。神经计算，18，1527-1554。
Hubel and Wiesel (1959) 胡贝尔和威塞尔（1959 年） Hubel, D. H. and Wiesel, T. N. (1959). Receptive fields of single neurons in the cat’s striate cortex. Journal of Physiology, 148, 574–591.
Hubel, D. H. 和 Wiesel, T. N. (1959). 猫的纹状皮层中单个神经元的感受野。生理学杂志，148，574–591。
Hurri and Hyvärinen (2003)
胡里和哈维林（2003） Hurri, J. and Hyvärinen, A. (2003). Temporal coherence, natural image sequences, and the visual cortex. In NIPS’2002.
3 Hurri, J. 和 Hyvärinen, A. (2003). 时间一致性、自然图像序列和视觉皮层。在 NIPS’2002
Hyvärinen (2005) Hyvärinen, A. (2005). Estimation of non-normalized statistical models using score matching. J. Machine Learning Res., 6.
Hyvärinen, A. (2005). 使用得分匹配估计非归一化统计模型。J. Machine Learning Res., 6
Hyvärinen (2007) Hyvärinen, A. (2007). Some extensions of score matching. Computational Statistics and Data Analysis, 51, 2499–2512.
Hyvärinen, A. (2007). 一些得分匹配的扩展。计算统计学与数据分析，51，2499–2512。
Hyvärinen (2008) Hyvärinen, A. (2008). Optimal approximation of signal priors. Neural Computation, 20(12), 3087–3110.
Hyvärinen, A. (2008). 信号先验的最优逼近。神经计算，20(12)，3087–3110。
Hyvärinen and Hoyer (2000)
Hyvärinen 和 Hoyer (2000) Hyvärinen, A. and Hoyer, P. (2000). Emergence of phase and shift invariant features by decomposition of natural images into independent feature subspaces. Neural Computation, 12(7).
Hyvärinen, A. 和 Hoyer, P. (2000). 通过将自然图像分解为独立特征子空间，出现相位和位移不变特征。神经网络计算，12(7)。
Hyvärinen et al. (2001a) Hyvärinen 等（2001a） Hyvärinen, A., Karhunen, J., and Oja, E. (2001a). Independent Component Analysis. Wiley-Interscience.
Hyvärinen, A., Karhunen, J., 和 Oja, E. (2001a). 独立成分分析 Wiley-Interscience.
Hyvärinen et al. (2001b) Hyvärinen 等（2001b） Hyvärinen, A., Hoyer, P. O., and Inki, M. (2001b). Topographic independent component analysis. Neural Computation, 13(7), 1527–1558.
Hyvärinen, A., Hoyer, P.O., 和 Inki, M. (2001b). 地形独立成分分析。神经计算，13(7)，1527–1558。
Hyvärinen et al. (2009) Hyvärinen 等人（2009 年） Hyvärinen, A., Hurri, J., and Hoyer, P. O. (2009). Natural Image Statistics: A probabilistic approach to early computational vision. Springer-Verlag.
Hyvärinen, A., Hurri, J. 和 Hoyer, P. O. (2009). 自然图像统计：早期计算机视觉的概率方法 Springer-Verlag.
Jaeger (2007) 杰格尔（2007） Jaeger, H. (2007). Echo state network. Scholarpedia, 2(9), 2330.
杰格尔，H.（2007）。回声状态网络。Scholarpedia，2（9），2330。
Jain and Seung (2008) 简和申（2008） Jain, V. and Seung, S. H. (2008). Natural image denoising with convolutional networks. In NIPS’2008.
Jain, V. 和 Seung, S. H. (2008). 卷积网络在自然图像去噪中的应用。在 NIPS’2008
Jarrett et al. (2009) 贾雷特等人（2009 年） Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2009). What is the best multi-stage architecture for object recognition? In ICCV’09.
贾雷特，K.，卡夫库库奥卢，K.，兰扎托，M.，和莱克恩，Y. (2009). 对象识别的最佳多阶段架构是什么？在 ICCV’09
Jenatton et al. (2009) 詹纳顿等人（2009 年） Jenatton, R., Audibert, J.-Y., and Bach, F. (2009). Structured variable selection with sparsity-inducing norms. Technical report, arXiv:0904.3523.
Jenatton, R.，Audibert, J.-Y.，Bach, F. (2009). 基于稀疏诱导范数的结构化变量选择。技术报告，arXiv:0904.3523。
Jutten and Herault (1991)
朱滕和赫尔奥尔（1991 年） Jutten, C. and Herault, J. (1991). Blind separation of sources, part I: an adaptive algorithm based on neuromimetic architecture. Signal Processing, 24, 1–10.
Jutten, C. 和 Herault, J. (1991). 源盲分离，第一部分：基于神经形态架构的自适应算法。信号处理，24，1-10。
Kavukcuoglu et al. (2008)
卡夫库库奥卢等（2008） Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2008). Fast inference in sparse coding algorithms with applications to object recognition. CBLL-TR-2008-12-01, NYU.
卡夫库库奥卢，K.，兰扎托，M.，和莱克恩，Y. (2008). 基于稀疏编码算法的快速推理及其在物体识别中的应用。CBLL-TR-2008-12-01，纽约大学。
Kavukcuoglu et al. (2009)
卡夫库库奥卢等（2009） Kavukcuoglu, K., Ranzato, M.-A., Fergus, R., and LeCun, Y. (2009). Learning invariant features through topographic filter maps. In CVPR’2009.
卡夫库库奥卢，K.，兰扎托，M.-A.，费格斯，R.，和莱克恩，Y. (2009). 通过拓扑滤波图学习不变特征。在 CVPR’2009
Kavukcuoglu et al. (2010)
卡夫库库奥卢等（2010） Kavukcuoglu, K., Sermanet, P., Boureau, Y.-L., Gregor, K., Mathieu, M., and LeCun, Y. (2010). Learning convolutional feature hierarchies for visual recognition. In NIPS’2010.
Kavukcuoglu, K., Sermanet, P., Boureau, Y.-L., Gregor, K., Mathieu, M. 和 LeCun, Y. (2010). 学习用于视觉识别的卷积特征层次结构。在 NIPS’2010
Kingma and LeCun (2010) 金玛和勒克恩（2010） Kingma, D. and LeCun, Y. (2010). Regularized estimation of image statistics by score matching. In NIPS’2010.
金玛，D. 和李飞飞，Y. (2010). 通过得分匹配进行图像统计的规范化估计。在 NIPS’2010
Kivinen and Williams (2012)
基维宁和威廉姆斯（2012） Kivinen, J. J. and Williams, C. K. I. (2012). Multiple texture Boltzmann machines. In AISTATS’2012.
Kivinen, J. J. 和 Williams, C. K. I. (2012). 多纹理玻尔兹曼机。载于 AISTATS’2012
Körding et al. (2004) 柯丁等（2004） Körding, K. P., Kayser, C., Einhäuser, W., and König, P. (2004). How are complex cell properties adapted to the statistics of natural stimuli? J. Neurophysiology, 91.
柯丁，K. P.，凯泽，C.，艾因豪斯，W.，和国王，P. (2004). 复杂细胞特性是如何适应自然刺激的统计特性的？神经生理学杂志，91
Krizhevsky (2010) 克里泽夫斯基（2010） Krizhevsky, A. (2010). Convolutional deep belief networks on CIFAR-10. Technical report, U. Toronto.
克里泽夫斯基，A. (2010). 在 CIFAR-10 上的卷积深度信念网络。技术报告，多伦多大学。
Krizhevsky and Hinton (2009)
克里泽夫斯基和辛顿（2009） Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. Technical report, U. Toronto.
克里泽夫斯基，A. 和霍顿，G. (2009). 从微小图像中学习多层特征。技术报告，多伦多大学。
Krizhevsky et al. (2012) 克里泽夫斯基等人（2012） Krizhevsky, A., Sutskever, I., and Hinton, G. (2012). ImageNet classification with deep convolutional neural networks. In NIPS’2012.
克里泽夫斯基，A.，苏特斯克维尔，I.，和辛顿，G. (2012). 使用深度卷积神经网络进行 ImageNet 分类。在 NIPS’2012
Larochelle and Bengio (2008)
Larochelle 和 Bengio (2008) Larochelle, H. and Bengio, Y. (2008). Classification using discriminative restricted Boltzmann machines. In ICML’2008.
Larochelle, H. 和 Bengio, Y. (2008). 使用判别性受限玻尔兹曼机的分类。在 ICML’2008
Larochelle et al. (2009) 拉罗谢勒等（2009） Larochelle, H., Bengio, Y., Louradour, J., and Lamblin, P. (2009). Exploring strategies for training deep neural networks. Journal of Machine Learning Research, 10, 1–40.
Larochelle, H., Bengio, Y., Louradour, J. 和 Lamblin, P. (2009). 探索深度神经网络训练策略。机器学习研究杂志，10，1-40。
Lazebnik et al. (2006) 拉泽布尼克等（2006） Lazebnik, S., Schmid, C., and Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR’2006.
Lazebnik, S., Schmid, C. 和 Ponce, J. (2006). 超越特征包：用于识别自然场景类别的空间金字塔匹配。在 CVPR’2006
Le et al. (2013) 雷等（2013） Le, H.-S., Oparin, I., Allauzen, A., Gauvin, J.-L., and Yvon, F. (2013). Structured output layer neural network language models for speech recognition. IEEE Trans. Audio, Speech & Language Processing.
Le, H.-S., Oparin, I., Allauzen, A., Gauvin, J.-L. 和 Yvon, F. (2013). 语音识别的结构化输出层神经网络语言模型。IEEE 信号处理、语音与语言传输
Le et al. (2010) 雷等（2010） Le, Q., Ngiam, J., Chen, Z., hao Chia, D. J., Koh, P. W., and Ng, A. (2010). Tiled convolutional neural networks. In NIPS’2010.
Le, Q., Ngiam, J., Chen, Z., hao Chia, D. J., Koh, P. W., and Ng, A. (2010). Tiled convolutional neural networks. In NIPS’2010 Le, Q.，Ngiam, J.，Chen, Z.，hao Chia, D. J.，Koh, P. W.，和 Ng, A. (2010). 铺砖卷积神经网络。在 NIPS’2010
Le et al. (2011a) 雷等（2011a） Le, Q., Ngiam, J., Coates, A., Lahiri, A., Prochnow, B., and Ng, A. (2011a). On optimization methods for deep learning. In ICML’2011.
Le, Q., Ngiam, J., Coates, A., Lahiri, A., Prochnow, B. 和 Ng, A. (2011a). 深度学习优化方法。在 ICML’2011
Le et al. (2011b) 雷等（2011b） Le, Q. V., Karpenko, A., Ngiam, J., and Ng, A. Y. (2011b). ICA with reconstruction cost for efficient overcomplete feature learning. In NIPS’2011.
Le, Q. V., Karpenko, A., Ngiam, J. 和 Ng, A. Y. (2011b). 基于重建成本的独立成分分析以实现高效的过完备特征学习。在 NIPS’2011
Le et al. (2011c) 雷等（2011c） Le, Q. V., Zou, W. Y., Yeung, S. Y., and Ng, A. Y. (2011c). Learning hierarchical spatio-temporal features for action recognition with independent subspace analysis. In CVPR’2011.
Le, Q. V., Zou, W. Y., Yeung, S. Y., and Ng, A. Y. (2011c). 使用独立子空间分析学习动作识别的层次时空特征。在 CVPR’2011
Le Roux et al. (2008a) 勒鲁克斯等（2008a） Le Roux, N., Bengio, Y., Lamblin, P., Joliveau, M., and Kegl, B. (2008a). Learning the 2-D topology of images. In NIPS’07.
勒鲁克斯，N.，本吉奥，Y.，拉姆布兰，P.，约利维尔，M.，和凯格尔，B. (2008a). 学习图像的二维拓扑结构。在 NIPS’07
Le Roux et al. (2008b) 勒鲁克斯等（2008b） Le Roux, N., Manzagol, P.-A., and Bengio, Y. (2008b). Topmoumoute online natural gradient algorithm. In NIPS’07.
勒鲁克斯，N.，曼扎戈尔，P.-A.，和本吉奥，Y.（2008b）。Topmoumoute 在线自然梯度算法。在 NIPS’07
LeCun (1986) 列文（1986） LeCun, Y. (1986). Learning processes in an asymmetric threshold network. In Disordered Systems and Biological Organization, pages 233–240. Springer-Verlag.
LeCun, Y. (1986). 非对称阈值网络中的学习过程. 收录于《无序系统与生物组织》，第 233-240 页. Springer-Verlag.
LeCun (1987) 列文（1987） LeCun, Y. (1987). Modèles connexionistes de l’apprentissage. Ph.D. thesis, Université de Paris VI.
LeCun, Y. (1987). 连接主义学习模型博士论文，巴黎第六大学。
LeCun (1989) 列文（1989） LeCun, Y. (1989). Generalization and network design strategies. In Connectionism in Perspective. Elsevier Publishers.
LeCun, Y. (1989). 泛化与网络设计策略. 在《视角中的联接主义》. Elsevier 出版社.
LeCun et al. (1989) LeCun 等人 (1989) LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation.
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. (1989). 基于反向传播的手写邮编识别。神经网络计算
LeCun et al. (1998a) LeCun 等人 (1998a) LeCun, Y., Bottou, L., Orr, G. B., and Müller, K. (1998a). Efficient backprop. In Neural Networks, Tricks of the Trade.
LeCun, Y., Bottou, L., Orr, G. B., 和 Müller, K. (1998a). 高效反向传播。在《神经网络：技巧与贸易》
LeCun et al. (1998b) LeCun 等人 (1998b) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998b). Gradient based learning applied to document recognition. Proc. IEEE.
LeCun, Y., Bottou, L., Bengio, Y. 和 Haffner, P. (1998b). 基于梯度的学习在文档识别中的应用。IEEE 会议论文集
Lee et al. (2008) 李等（2008） Lee, H., Ekanadham, C., and Ng, A. (2008). Sparse deep belief net model for visual area V2. In NIPS’07.
李，H.，艾卡纳达姆，C.，和吴恩达，A.（2008）。视觉区域 V2 的稀疏深度信念网络模型。在 NIPS’07
Lee et al. (2009a) 李等（2009a） Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y. (2009a). Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In ICML’2009.
李，H.，格罗斯，R.，兰加纳特，R.，和吴，A. Y. (2009a). 可卷积深度信念网络用于可扩展的无监督学习层次表示。在 ICML’2009
Lee et al. (2009b) 李等（2009b） Lee, H., Pham, P., Largman, Y., and Ng, A. (2009b). Unsupervised feature learning for audio classification using convolutional deep belief networks. In NIPS’2009.
李，H.，潘，P.，拉格曼，Y.，和吴，A. (2009b). 使用卷积深度信念网络进行音频分类的无监督特征学习。在 NIPS’2009
Lin et al. (2010) 林等（2010） Lin, Y., Tong, Z., Zhu, S., and Yu, K. (2010). Deep coding network. In NIPS’2010.
林，Y.，童，Z.，朱，S.，余，K. (2010). 深度编码网络。在 NIPS’2010
Lowe (1999) Low（1999） Lowe, D. (1999). Object recognition from local scale invariant features. In ICCV’99.
Low, D. (1999). 从局部尺度不变特征进行物体识别。在 ICCV'99
Mallat (2012) 马拉特（2012） Mallat, S. (2012). Group invariant scattering. Communications on Pure and Applied Mathematics.
Mallat, S. (2012). 群不变散射. 纯与应用数学通信
Marlin and de Freitas (2011)
马尔林和德·弗雷塔斯（2011） Marlin, B. and de Freitas, N. (2011). Asymptotic efficiency of deterministic estimators for discrete energy-based models: Ratio matching and pseudolikelihood. In UAI’2011.
马尔林，B. 和德·弗雷塔斯，N. (2011)。基于离散能量模型的确定性估计量的渐近效率：比率匹配和似然函数。在 UAI’2011
Marlin et al. (2010) Marlin 等（2010） Marlin, B., Swersky, K., Chen, B., and de Freitas, N. (2010). Inductive principles for restricted Boltzmann machine learning. In AISTATS’2010, pages 509–516.
Marlin, B., Swersky, K., Chen, B. 和 de Freitas, N. (2010). 受限玻尔兹曼机学习的归纳原理。在 AISTATS’2010，第 509-516 页。
Martens (2010) 马滕斯（2010） Martens, J. (2010). Deep learning via Hessian-free optimization. In ICML’2010, pages 735–742.
马滕斯，J.（2010）。通过 Hessian-free 优化进行深度学习。在 ICML’2010，第 735-742 页。
Martens and Sutskever (2011)
马滕斯和苏茨克维弗（2011） Martens, J. and Sutskever, I. (2011). Learning recurrent neural networks with Hessian-free optimization. In ICML’2011.
马滕斯，J. 和苏茨克维尔，I. (2011). 使用 Hessian-free 优化学习循环神经网络。在 ICML’2011
Memisevic and Hinton (2010)
梅米塞维奇和辛顿（2010） Memisevic, R. and Hinton, G. E. (2010). Learning to represent spatial transformations with factored higher-order Boltzmann machines. Neural Comp., 22(6).
Memisevic, R. 和 Hinton, G. E. (2010). 使用因子高阶玻尔兹曼机学习表示空间变换。神经计算，22(6)。
Mesnil et al. (2011) 梅尼尔等（2011） Mesnil, G., Dauphin, Y., Glorot, X., Rifai, S., Bengio, Y., Goodfellow, I., Lavoie, E., Muller, X., Desjardins, G., Warde-Farley, D., Vincent, P., Courville, A., and Bergstra, J. (2011). Unsupervised and transfer learning challenge: a deep learning approach. In JMLR W&CP: Proc. Unsupervised and Transfer Learning, volume 7.
梅尼尔，G.，多芬，Y.，格洛特，X.，里法伊，S.，本吉奥，Y.，古德费洛，I.，拉沃伊，E.，穆勒，X.，德沙尔丁，G.，沃德-法雷尔，D.，文森特，P.，库尔维勒，A.，和贝格斯特拉，J.（2011）。无监督学习和迁移学习挑战：深度学习方法。在 JMLR W&CP：无监督学习和迁移学习会议论文集，第 7 卷。
Mikolov et al. (2011) 米科洛夫等人（2011 年） Mikolov, T., Deoras, A., Kombrink, S., Burget, L., and Cernocky, J. (2011). Empirical evaluation and combination of advanced language modeling techniques. In INTERSPEECH’2011.
Mikolov, T., Deoras, A., Kombrink, S., Burget, L. 和 Cernocky, J. (2011). 高级语言建模技术的实证评估与组合。在 INTERSPEECH’2011
Mobahi et al. (2009) 莫巴希等（2009） Mobahi, H., Collobert, R., and Weston, J. (2009). Deep learning from temporal coherence in video. In ICML’2009.
莫哈比，H.，科洛贝，R.，和韦斯顿，J. (2009). 从视频中的时间一致性进行深度学习。在 ICML’2009
Mohamed et al. (2012) 穆罕默德等人（2012） Mohamed, A., Dahl, G., and Hinton, G. (2012). Acoustic modeling using deep belief networks. IEEE Trans. on Audio, Speech and Language Processing, 20(1), 14–22.
穆罕默德，A.，达尔，G.，和辛顿，G. (2012). 基于深度信念网络的声学建模。IEEE 传输音视频语言处理，20(1)，14–22。
Montufar and Morton (2012)
蒙图法尔和莫顿（2012） Montufar, G. F. and Morton, J. (2012). When does a mixture of products contain a product of mixtures? Technical report, arXiv:1206.0387.
蒙图法，G. F. 和摩顿，J. (2012)。何时混合产品的混合体包含混合体的产品？技术报告，arXiv:1206.0387。
Murray and Salakhutdinov (2009)
穆雷和萨拉胡丁诺夫（2009） Murray, I. and Salakhutdinov, R. (2009). Evaluating probabilities under high-dimensional latent variable models. In NIPS’2008, pages 1137–1144.
Murray, I. 和 Salakhutdinov, R. (2009). 在高维潜在变量模型下评估概率。NIPS’2008，第 1137-1144 页。
Nair and Hinton (2010) 奈尔和辛顿（2010） Nair, V. and Hinton, G. E. (2010). Rectified linear units improve restricted Boltzmann machines. In ICML’10.
Nair, V. 和 Hinton, G. E. (2010). 矩形线性单元改进受限玻尔兹曼机。在 ICML’10
Neal (1992) 尼尔（1992） Neal, R. M. (1992). Connectionist learning of belief networks. Artificial Intelligence, 56, 71–113.
尼尔，R. M. (1992). 基于连接主义的信念网络学习。人工智能，56，71-113。
Neal (1993) 尼尔（1993） Neal, R. M. (1993). Probabilistic inference using Markov chain Monte-Carlo methods. Technical Report CRG-TR-93-1, Dept. of Computer Science, University of Toronto.
尼尔，R. M. (1993). 使用马尔可夫链蒙特卡洛方法的概率推理。技术报告 CRG-TR-93-1，多伦多大学计算机科学系。
Ngiam et al. (2011) 倪安等（2011） Ngiam, J., Chen, Z., Koh, P., and Ng, A. (2011). Learning deep energy models. In Proc. ICML’2011. ACM.
倪安，陈志坚，柯鹏，及吴恩达. (2011). 学习深度能量模型. 在 ICML’2011 会议论文集. ACM.
Olshausen and Field (1996)
奥尔斯豪森和菲尔德（1996） Olshausen, B. A. and Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607–609.
Olshausen, B. A. 和 Field, D. J. (1996). 通过学习自然图像的稀疏编码来出现简单细胞感受野特性。自然，381，607–609。
Orr and Muller (1998) 奥勒和穆勒（1998） Orr, G. and Muller, K.-R., editors (1998). Neural networks: tricks of the trade. Lect. Notes Comp. Sc. Springer-Verlag.
Orr, G. 和 Muller, K.-R. 编著 (1998). 神经网络：技巧与策略计算机科学讲义系列 Springer-Verlag.
Pascanu and Bengio (2013)
帕斯卡努和本吉奥（2013） Pascanu, R. and Bengio, Y. (2013). Natural gradient revisited. Technical report, arXiv:1301.3584.
帕斯卡努，R. 和本吉奥，Y. (2013). 自然梯度再探。技术报告，arXiv:1301.3584。
Raiko et al. (2012) ライコー他（2012 年） Raiko, T., Valpola, H., and LeCun, Y. (2012). Deep learning made easier by linear transformations in perceptrons. In AISTATS’2012.
ライコー，T.，バルポラ，H.，およびレクーン，Y.（2012）。感知器中的线性変換により深層学習が容易になる。AISTATS'2012
Raina et al. (2007) 雷纳等（2007） Raina, R., Battle, A., Lee, H., Packer, B., and Ng, A. Y. (2007). Self-taught learning: transfer learning from unlabeled data. In ICML’2007.
Raina, R.，Battle, A.，Lee, H.，Packer, B.，和 Ng, A. Y. (2007). 自我学习：从无标签数据中进行迁移学习。在 ICML’2007
Ranzato and Hinton (2010)
兰扎托和辛顿（2010） Ranzato, M. and Hinton, G. H. (2010). Modeling pixel means and covariances using factorized third-order Boltzmann machines. In CVPR’2010, pages 2551–2558.
Ranzato, M. 和 Hinton, G. H. (2010). 使用分解的三阶玻尔兹曼机建模像素均值和协方差。在 CVPR’2010，第 2551-2558 页。
Ranzato et al. (2007) ランザート他（2007 年） Ranzato, M., Poultney, C., Chopra, S., and LeCun, Y. (2007). Efficient learning of sparse representations with an energy-based model. In NIPS’2006.
Ranzato, M.，Poultney, C.，Chopra, S.，LeCun, Y. (2007). 基于能量模型的稀疏表示的高效学习。在 NIPS’2006
Ranzato et al. (2008) ランザート他（2008 年） Ranzato, M., Boureau, Y., and LeCun, Y. (2008). Sparse feature learning for deep belief networks. In NIPS’2007.
Ranzato, M.，Boureau, Y.，和 LeCun, Y. (2008). 深度信念网络中的稀疏特征学习。在 NIPS’2007
Ranzato et al. (2010a) 兰扎托等（2010a） Ranzato, M., Krizhevsky, A., and Hinton, G. (2010a). Factored 3-way restricted Boltzmann machines for modeling natural images. In AISTATS’2010, pages 621–628.
Ranzato, M.，Krizhevsky, A.，Hinton, G. (2010a). 基于自然图像建模的分解三向受限玻尔兹曼机。在 AISTATS’2010，第 621-628 页。
Ranzato et al. (2010b) 兰扎托等（2010b） Ranzato, M., Mnih, V., and Hinton, G. (2010b). Generating more realistic images using gated MRF’s. In NIPS’2010.
兰扎托，M.，米宁，V.，和辛顿，G.（2010b）。使用门控马尔可夫随机场生成更逼真的图像。在 NIPS’2010
Ranzato et al. (2011) 兰扎托等（2011） Ranzato, M., Susskind, J., Mnih, V., and Hinton, G. (2011). On deep generative models with applications to recognition. In CVPR’2011.
兰扎托，M.，苏斯金德，J.，米宁，V.，和辛顿，G. (2011). 关于深度生成模型及其在识别中的应用。在 CVPR’2011
Riesenhuber and Poggio (1999)
赖森胡贝尔和波吉奥（1999） Riesenhuber, M. and Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nature Neuroscience.
Riesenhuber, M. 和 Poggio, T. (1999). 大脑皮层中物体识别的层次模型。自然神经科学
Rifai et al. (2011a) Rifai 等人（2011a） Rifai, S., Vincent, P., Muller, X., Glorot, X., and Bengio, Y. (2011a). Contractive auto-encoders: Explicit invariance during feature extraction. In ICML’2011.
Rifai, S., Vincent, P., Muller, X., Glorot, X. 和 Bengio, Y. (2011a). 收敛自编码器：特征提取中的显式不变性。在 ICML’2011
Rifai et al. (2011b) Rifai 等人（2011b） Rifai, S., Mesnil, G., Vincent, P., Muller, X., Bengio, Y., Dauphin, Y., and Glorot, X. (2011b). Higher order contractive auto-encoder. In ECML PKDD.
Rifai, S., Mesnil, G., Vincent, P., Muller, X., Bengio, Y., Dauphin, Y. 和 Glorot, X. (2011b). 高阶收缩自编码器。在 ECML PKDD
Rifai et al. (2011c) Rifai 等人（2011c） Rifai, S., Dauphin, Y., Vincent, P., Bengio, Y., and Muller, X. (2011c). The manifold tangent classifier. In NIPS’2011.
Rifai, S., Dauphin, Y., Vincent, P., Bengio, Y. 和 Muller, X. (2011c). 流形切线分类器。在 NIPS’2011
Rifai et al. (2012) Rifai 等人（2012 年） Rifai, S., Bengio, Y., Dauphin, Y., and Vincent, P. (2012). A generative process for sampling contractive auto-encoders. In ICML’2012.
Rifai, S., Bengio, Y., Dauphin, Y. 和 Vincent, P. (2012). 一种用于采样收缩自编码器的生成过程。在 ICML’2012
Roweis (1997) Roweis, S. (1997). EM algorithms for PCA and sensible PCA. CNS Technical Report CNS-TR-97-02, Caltech.
Roweis, S. (1997). PCA 与合理 PCA 的 EM 算法。CNS 技术报告 CNS-TR-97-02，加州理工学院。
Roweis and Saul (2000) Roweis 和 Saul (2000) Roweis, S. and Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500).
Roweis, S. 和 Saul, L. K. (2000). 基于局部线性嵌入的非线性降维。Science，290(5500)。
Salakhutdinov (2010a) 萨拉胡丁诺夫（2010a） Salakhutdinov, R. (2010a). Learning deep Boltzmann machines using adaptive MCMC. In ICML’2010.
萨拉胡丁诺夫，R.（2010a）。使用自适应 MCMC 学习深度玻尔兹曼机。在 ICML’2010
Salakhutdinov (2010b) 萨拉胡丁诺夫（2010b） Salakhutdinov, R. (2010b). Learning in Markov random fields using tempered transitions. In NIPS’2010.
萨拉胡丁诺夫，R. (2010b). 使用调温转换在马尔可夫随机场中进行学习。在 NIPS’2010
Salakhutdinov and Hinton (2007)
萨拉胡丁诺夫和辛顿（2007） Salakhutdinov, R. and Hinton, G. E. (2007). Semantic hashing. In SIGIR’2007.
萨拉胡丁诺夫，R. 和霍 inton，G. E. (2007). 语义哈希。载于 SIGIR’2007
Salakhutdinov and Hinton (2009)
萨拉胡丁诺夫和辛顿（2009） Salakhutdinov, R. and Hinton, G. E. (2009). Deep Boltzmann machines. In AISTATS’2009, pages 448–455.
萨拉胡丁诺夫，R. 和辛顿，G. E. (2009). 深度玻尔兹曼机。载于 AISTATS’2009，第 448-455 页。
Salakhutdinov and Larochelle (2010)
萨拉胡丁诺夫和拉罗谢勒（2010） Salakhutdinov, R. and Larochelle, H. (2010). Efficient learning of deep Boltzmann machines. In AISTATS’2010.
萨拉胡丁诺夫，R. 和拉罗什勒，H. (2010). 深度玻尔兹曼机的有效学习。载于 AISTATS’2010
Salakhutdinov et al. (2007)
萨拉胡丁诺夫等人（2007 年） Salakhutdinov, R., Mnih, A., and Hinton, G. E. (2007). Restricted Boltzmann machines for collaborative filtering. In ICML 2007.
萨拉胡丁诺夫，R.，米宁，A.，和辛顿，G.E. (2007). 用于协同过滤的受限玻尔兹曼机。在 ICML 2007
Savard (2011) 萨瓦德（2011） Savard, F. (2011). Réseaux de neurones à relaxation entraînés par critère d’autoencodeur débruitant. Master’s thesis, U. Montréal.
Savard, F. (2011). 基于去噪自编码器准则训练的松弛神经网络硕士学位论文，蒙特利尔大学。
Schmah et al. (2009) 斯迈哈等人（2009 年） Schmah, T., Hinton, G. E., Zemel, R., Small, S. L., and Strother, S. (2009). Generative versus discriminative training of RBMs for classification of fMRI images. In NIPS’2008, pages 1409–1416.
施马赫，T.，辛顿，G. E.，泽梅尔，R.，斯莫尔，S. L.，及斯特罗瑟，S. (2009). RBM 的生成性训练与判别性训练在 fMRI 图像分类中的应用。在 NIPS’2008，第 1409-1416 页。
Schölkopf et al. (1998) 施瓦茨科普夫等人（1998） Schölkopf, B., Smola, A., and Müller, K.-R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10, 1299–1319.
施瓦茨科普夫，B.，斯莫拉，A.，和穆勒，K.-R. (1998). 作为核特征值问题的非线性成分分析。神经计算，10，1299–1319。
Schwenk et al. (2012) 施文克等人（2012） Schwenk, H., Rousseau, A., and Attik, M. (2012). Large, pruned or continuous space language models on a gpu for statistical machine translation. In Workshop on the future of language modeling for HLT.
施文克，H.，卢梭，A.，阿蒂克，M. (2012). 在 GPU 上用于统计机器翻译的大规模、剪枝或连续空间语言模型。在语言建模未来研讨会
Seide et al. (2011a) Seide 等（2011a） Seide, F., Li, G., and Yu, D. (2011a). Conversational speech transcription using context-dependent deep neural networks. In Interspeech 2011, pages 437–440.
Seide, F., Li, G. 和 Yu, D. (2011a). 基于上下文相关深度神经网络的会话语音转录。在 Interspeech 2011，第 437-440 页。
Seide et al. (2011b) Seide 等（2011b） Seide, F., Li, G., and Yu, D. (2011b). Feature engineering in context-dependent deep neural networks for conversational speech transcription. In ASRU’2011.
Seide, F., Li, G. 和 Yu, D. (2011b). 基于上下文的深度神经网络在对话语音转录中的特征工程。在 ASRU’2011
Serre et al. (2007) 塞雷等（2007） Serre, T., Wolf, L., Bileschi, S., and Riesenhuber, M. (2007). Robust object recognition with cortex-like mechanisms. IEEE Trans. Pattern Anal. Mach. Intell., 29(3), 411–426.
塞雷，T.，沃尔夫，L.，比莱希奇，S.，和赖森胡贝尔，M. (2007). 基于皮层机制的鲁棒物体识别。IEEE 信号处理与机器智能汇刊，29(3)，411–426。
Seung (1998) 申（1998） Seung, S. H. (1998). Learning continuous attractors in recurrent networks. In NIPS’1997.
Seung, S. H. (1998). 递归网络中的连续吸引子学习。在 NIPS’1997
Simard et al. (2003) Simard 等（2003） Simard, D., Steinkraus, P. Y., and Platt, J. C. (2003). Best practices for convolutional neural networks. In ICDAR’2003.
Simard, D., Steinkraus, P. Y., and Platt, J. C. (2003). 卷积神经网络的最佳实践。在 ICDAR’2003
Simard et al. (1992) Simard 等（1992） Simard, P., Victorri, B., LeCun, Y., and Denker, J. (1992). Tangent prop - A formalism for specifying selected invariances in an adaptive network. In NIPS’1991.
Simard, P., Victorri, B., LeCun, Y. 和 Denker, J. (1992). 切线传播 - 一种在自适应网络中指定选定不变性的形式化方法。载于 NIPS’1991
Simard et al. (1993) Simard 等（1993） Simard, P. Y., LeCun, Y., and Denker, J. (1993). Efficient pattern recognition using a new transformation distance. In NIPS’92.
Simard, P. Y., LeCun, Y., 和 Denker, J. (1993). 使用一种新的变换距离进行高效模式识别。在 NIPS’92
Smolensky (1986) 斯莫伦斯基（1986） Smolensky, P. (1986). Information processing in dynamical systems: Foundations of harmony theory. In D. E. Rumelhart and J. L. McClelland, editors, Parallel Distributed Processing, volume 1, chapter 6, pages 194–281. MIT Press, Cambridge.
斯莫伦斯基，P.（1986）。动态系统中的信息处理：和谐理论的基石。载于 D. E. 鲁梅哈特和 J. L. 麦克莱兰德编，《并行分布式处理》，第 1 卷，第 6 章，第 194-281 页。麻省理工学院出版社，剑桥。
Snoek et al. (2012) 斯诺克等（2012） Snoek, J., Larochelle, H., and Adams, R. P. (2012). Practical bayesian optimization of machine learning algorithms. In NIPS’2012.
Snoek, J., Larochelle, H. 和 Adams, R. P. (2012). 实用贝叶斯优化机器学习算法。在 NIPS’2012
Socher et al. (2011a) 索 cher 等人（2011a） Socher, R., Huang, E. H., Pennington, J., Ng, A. Y., and Manning, C. D. (2011a). Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In NIPS’2011.
索 cher, R., 黄，E. H.，Pennington, J.，Ng，A. Y.，和 Manning，C. D. (2011a). 动态池化和展开递归自编码器用于释义检测。在 NIPS’2011
Socher et al. (2011b) 索 cher 等人（2011b） Socher, R., Pennington, J., Huang, E. H., Ng, A. Y., and Manning, C. D. (2011b). Semi-supervised recursive autoencoders for predicting sentiment distributions. In EMNLP’2011.
索 cher, R., Pennington, J., 黄，E. H.，Ng，A. Y.，和 Manning，C. D. (2011b). 用于预测情感分布的半监督递归自编码器。在 EMNLP’2011
Srivastava and Salakhutdinov (2012)
斯里瓦斯塔瓦和萨拉胡丁诺夫（2012） Srivastava, N. and Salakhutdinov, R. (2012). Multimodal learning with deep boltzmann machines. In NIPS’2012.
斯里瓦斯塔瓦，N. 和萨拉胡丁诺夫，R. (2012)。深度玻尔兹曼机在多模态学习中的应用。在 NIPS’2012
Stoyanov et al. (2011) 斯多亚诺夫等人（2011 年） Stoyanov, V., Ropson, A., and Eisner, J. (2011). Empirical risk minimization of graphical model parameters given approximate inference, decoding, and model structure. In AISTATS’2011.
Stoyanov, V., Ropson, A., 和 Eisner, J. (2011). 在近似推理、解码和模型结构下，图模型参数的经验风险最小化。AISTATS’2011
Sutskever (2012) 苏茨克韦尔（2012） Sutskever, I. (2012). Training Recurrent Neural Networks. Ph.D. thesis, Departement of computer science, University of Toronto.
斯图茨克维尔，I.（2012）。训练循环神经网络。博士论文，多伦多大学计算机科学系。
Sutskever and Tieleman (2010)
苏茨克维尔和蒂勒曼（2010） Sutskever, I. and Tieleman, T. (2010). On the Convergence Properties of Contrastive Divergence. In AISTATS’2010.
Sutskever, I. 和 Tieleman, T. (2010). 关于对比散度的收敛性质。载于 AISTATS’2010
Sutskever et al. (2009) 苏茨克维尔等人（2009 年） Sutskever, I., Hinton, G., and Taylor, G. (2009). The recurrent temporal restricted Boltzmann machine. In NIPS’2008.
Sutskever, I., Hinton, G., 和 Taylor, G. (2009). 循环时间限制玻尔兹曼机。在 NIPS’2008
Swersky (2010) 斯威尔斯基（2010） Swersky, K. (2010). Inductive Principles for Learning Restricted Boltzmann Machines. Master’s thesis, University of British Columbia.
斯沃斯基，K.（2010）。学习受限玻尔兹曼机的归纳原理。硕士学位论文，不列颠哥伦比亚大学。
Swersky et al. (2011) 斯威尔斯基等人（2011 年） Swersky, K., Ranzato, M., Buchman, D., Marlin, B., and de Freitas, N. (2011). On score matching for energy based models: Generalizing autoencoders and simplifying deep learning. In Proc. ICML’2011. ACM.
斯沃斯基，K.，兰扎托，M.，布奇曼，D.，马尔林，B.，和德·弗雷塔斯，N.（2011）。关于基于能量的模型中的得分匹配：泛化自编码器并简化深度学习。在 ICML’2011 会议论文集。ACM。
Taylor and Hinton (2009) 泰勒和辛顿（2009） Taylor, G. and Hinton, G. (2009). Factored conditional restricted Boltzmann machines for modeling motion style. In ICML’2009.
泰勒，G. 和霍 inton，G. (2009). 用于建模运动风格的因子条件受限玻尔兹曼机。在 ICML’2009
Taylor et al. (2010) 泰勒等（2010） Taylor, G., Fergus, R., LeCun, Y., and Bregler, C. (2010). Convolutional learning of spatio-temporal features. In ECCV’10.
泰勒，G.，弗格森，R.，莱克恩，Y.，布雷格勒，C. (2010). 空间时间特征的卷积学习。在 ECCV'10
Tenenbaum et al. (2000) 滕纳鲍姆等（2000） Tenenbaum, J., de Silva, V., and Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500), 2319–2323.
Tenenbaum, J., de Silva, V. 和 Langford, J. C. (2000). 非线性降维的全局几何框架。Science，290（5500），2319-2323。
Tieleman (2008) 蒂埃勒曼（2008） Tieleman, T. (2008). Training restricted Boltzmann machines using approximations to the likelihood gradient. In ICML’2008, pages 1064–1071.
Tieleman, T. (2008). 使用似然梯度近似训练受限玻尔兹曼机。在 ICML’2008，第 1064-1071 页。
Tieleman and Hinton (2009)
蒂尔曼和辛顿（2009） Tieleman, T. and Hinton, G. (2009). Using fast weights to improve persistent contrastive divergence. In ICML’2009.
Tieleman, T. 和 Hinton, G. (2009). 使用快速权重来改进持久对比散度。在 ICML’2009
Tipping and Bishop (1999)
蒂平与主教（1999） Tipping, M. E. and Bishop, C. M. (1999). Probabilistic principal components analysis. J. Roy. Stat. Soc. B, (3).
Tipping, M. E. 和 Bishop, C. M. (1999). 概率主成分分析。J. Roy. Stat. Soc. B, (3)。
Turaga et al. (2010) 图拉加等人（2010 年） Turaga, S. C., Murray, J. F., Jain, V., Roth, F., Helmstaedter, M., Briggman, K., Denk, W., and Seung, H. S. (2010). Convolutional networks can learn to generate affinity graphs for image segmentation. Neural Computation, 22, 511–538.
图拉加，S.C.，穆雷，J.F.，贾因，V.，罗思，F.，赫尔姆施泰特，M.，布里格曼，K.，邓克，W.，和申，H.S. (2010). 卷积网络可以学习生成图像分割的亲和图。神经计算，22，511-538。
van der Maaten (2009) van der Maaten, L. (2009). Learning a parametric embedding by preserving local structure. In AISTATS’2009.
van der Maaten, L. (2009). 通过保留局部结构学习参数化嵌入。在 AISTATS’2009
van der Maaten and Hinton (2008)
van der Maaten 和 Hinton (2008) van der Maaten, L. and Hinton, G. E. (2008). Visualizing data using t-SNE. J. Machine Learning Res., 9.
van der Maaten, L. 和 Hinton, G. E. (2008). 使用 t-SNE 可视化数据。J. Machine Learning Res., 9
Vincent (2011) 文森特（2011 年） Vincent, P. (2011). A connection between score matching and denoising autoencoders. Neural Computation, 23(7).
文森特，P.（2011）。分数匹配与降噪自编码器之间的联系。神经计算，23（7）。
Vincent and Bengio (2003)
文森特和本吉奥（2003） Vincent, P. and Bengio, Y. (2003). Manifold Parzen windows. In NIPS’2002. MIT Press.
文森特，P. 和本吉奥，Y. (2003). 流形帕泽恩窗。载于 NIPS’2002。麻省理工学院出版社。
Vincent et al. (2008) 文森特等（2008） Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. (2008). Extracting and composing robust features with denoising autoencoders. In ICML 2008.
文森特，P.，拉罗什勒，H.，本吉奥，Y.，和曼扎戈，P.-A. (2008). 使用降噪自编码器提取和组合鲁棒特征。在 ICML 2008
Vincent et al. (2010) 文森特等（2010） Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Machine Learning Res., 11.
文森特，P.，拉罗什勒，H.，拉约，I.，本吉奥，Y.，和曼扎戈，P.-A. (2010). 堆叠降噪自编码器：在具有局部降噪准则的深度网络中学习有用的表示。J. Machine Learning Res.，11
Weinberger and Saul (2004)
Weinberger 和 Saul（2004） Weinberger, K. Q. and Saul, L. K. (2004). Unsupervised learning of image manifolds by semidefinite programming. In CVPR’2004, pages 988–995.
Weinberger, K. Q. 和 Saul, L. K. (2004). 通过半定规划无监督学习图像流形。在 CVPR’2004，第 988-995 页。
Welling (2009) 韦林（2009） Welling, M. (2009). Herding dynamic weights for partially observed random field models. In UAI’2009.
Welling, M. (2009). 部分观测随机场模型的动态权重放牧。在 UAI’2009
Welling et al. (2003) Welling 等（2003） Welling, M., Hinton, G. E., and Osindero, S. (2003). Learning sparse topographic representations with products of Student-t distributions. In NIPS’2002.
Welling, M., Hinton, G. E.，Osindero, S. (2003). 使用 Student-t 分布的乘积学习稀疏拓扑表示。在 NIPS’2002
Weston et al. (2008) 韦斯顿等人（2008 年） Weston, J., Ratle, F., and Collobert, R. (2008). Deep learning via semi-supervised embedding. In ICML 2008.
韦斯顿，J.，拉特，F.，和科洛贝，R.（2008）。通过半监督嵌入的深度学习。在 ICML 2008
Weston et al. (2010) 韦斯顿等人（2010） Weston, J., Bengio, S., and Usunier, N. (2010). Large scale image annotation: learning to rank with joint word-image embeddings. Machine Learning, 81(1), 21–35.
韦斯顿，J.，本吉奥，S.，和乌苏尼耶，N.（2010）。大规模图像标注：联合词-图像嵌入的排序学习。机器学习，81（1），21-35。
Wiskott and Sejnowski (2002)
威斯克特和塞约诺夫斯基（2002） Wiskott, L. and Sejnowski, T. (2002). Slow feature analysis: Unsupervised learning of invariances. Neural Computation, 14(4), 715–770.
威斯克特，L. 和塞约夫斯基，T. (2002). 慢特征分析：无监督学习不变性。神经计算，14(4)，715–770。
Younes (1999) 尤尼斯（1999） Younes, L. (1999). On the convergence of Markovian stochastic algorithms with rapidly decreasing ergodicity rates. Stochastics and Stochastic Reports, 65(3), 177–228.
尤尼斯，L. (1999). 具有快速下降遍历率的马尔可夫随机算法的收敛性。随机过程与随机报告，65(3)，177–228。
Yu et al. (2010) 余等人（2010） Yu, D., Wang, S., and Deng, L. (2010). Sequential labeling using deep-structured conditional random fields. IEEE Journal of Selected Topics in Signal Processing.
余，D.，王，S.，邓，L. (2010). 基于深度结构化条件随机场的序列标注。IEEE 信号处理选刊
Yu and Zhang (2010) 余和张（2010） Yu, K. and Zhang, T. (2010). Improved local coordinate coding using local tangents. In ICML’2010.
余，K. 和张，T. (2010). 基于局部切线的改进局部坐标编码。在 ICML’2010
Yu et al. (2009) 余等人（2009） Yu, K., Zhang, T., and Gong, Y. (2009). Nonlinear learning using local coordinate coding. In NIPS’2009.
余，K.，张，T.，及龚，Y.（2009）。基于局部坐标编码的非线性学习。在 NIPS’2009
Yu et al. (2011) 余等人（2011 年） Yu, K., Lin, Y., and Lafferty, J. (2011). Learning image representations from the pixel level via hierarchical sparse coding. In CVPR.
余，K.，林，Y.，拉弗蒂，J. (2011). 通过分层稀疏编码从像素级别学习图像表示。在 CVPR
Yuille (2005) 余伊尔（2005） Yuille, A. L. (2005). The convergence of contrastive divergences. In NIPS’2004, pages 1593–1600.
尤伊尔，A. L. (2005). 对比散度的收敛。在 NIPS’2004，第 1593-1600 页。
Zeiler et al. (2010) Zeiler et al. (2010) Zeiler, M., Krishnan, D., Taylor, G., and Fergus, R. (2010). Deconvolutional networks. In CVPR’2010.
Zeiler, M., Krishnan, D., Taylor, G. 和 Fergus, R. (2010). 反卷积网络。在 CVPR’2010
Zou et al. (2011) 邹等（2011） Zou, W. Y., Ng, A. Y., and Yu, K. (2011). Unsupervised learning of visual invariance with temporal coherence. In NIPS 2011 Workshop on Deep Learning and Unsupervised Feature Learning.
邹伟勇，吴安宇，余克（2011）。基于时间一致性的无监督视觉不变性学习。在 2011 年 NIPS 深度学习和无监督特征学习研讨会论文集中

Representation Learning: A Review and New Perspectives表示学习：综述与新的视角

Abstract 摘要

Index Terms:

1 Introduction 1 引言

2 Why should we care about learning representations?为什么我们应该关注学习表示？

3 What makes a representation good?3 什么是好的表示？

3.1 Priors for Representation Learning in AI3.1 人工智能中的表示学习先验

3.2 Smoothness and the Curse of Dimensionality3.2 平滑性与维度诅咒

3.3 Distributed representations3.3 分布式表示

3.4 Depth and abstraction3.4 深度与抽象

3.5 Disentangling Factors of Variation3.5 解构变异因素

3.6 Good criteria for learning representations?3.6 好的学习表示的准则是什么？

4 Building Deep Representations 4 构建深度表示

5 Single-layer learning modules5 单层学习模块

6 Probabilistic Models 6 概率模型

6.1 Directed Graphical Models6.1 有向图模型

6.1.1 Explaining Away 6.1.1 解释消除

6.2 Undirected Graphical Models6.2 无向图模型

6.3 Generalizations of the RBM to Real-valued data6.3 RBM 对实值数据的推广

6.4 RBM parameter estimation6.4 RBM 参数估计

6.4.1 Contrastive Divergence6.4.1 对比散度

6.4.2 Stochastic Maximum Likelihood6.4.2 随机最大似然

6.4.3 Pseudolikelihood, Ratio-matching and More6.4.3 伪似然、比率匹配及其他

7 Directly Learning A Parametric Map from Input to Representation7 直接从输入到表示学习参数化映射

7.1 Auto-Encoders 7.1 自动编码器

7.2 Regularized Auto-Encoders 7.2 正则化自编码器

7.2.1 Sparse Auto-Encoders7.2.1 稀疏自编码器

7.2.2 Denoising Auto-Encoders 7.2.2 去噪自编码器

7.2.3 Contractive Auto-Encoders 7.2.3 收敛自编码器

7.2.4 Predictive Sparse Decomposition 7.2.4 预测稀疏分解

8 Representation Learning as Manifold Learning 8 流形学习作为表示学习

8.1 Learning a parametric mapping based on a neighborhood graph8.1 基于邻域图学习参数映射

8.2 Learning to represent non-linear manifolds8.2 学习表示非线性流形

8.3 Leveraging the modeled tangent spaces8.3 利用建模的切线空间

9 Connections between Probabilistic and Direct Encoding models 9 概率编码与直接编码模型之间的 9 个联系

9.1 PSD: a probabilistic interpretation9.1 PSD：概率解释

9.2 Regularized Auto-Encoders Capture Local Structure of the Density9.2 正则化自编码器捕捉密度局部结构

9.3 Learning Approximate Inference9.3 学习近似推理