2024_05_10_6a5ea0bdcecdfd8da50eg

Chapter 5 第 5 章

Machine Learning Basics 机器学习基础

Deep learning is a specific kind of machine learning. To understand deep learning well, one must have a solid understanding of the basic principles of machine learning. This chapter provides a brief course in the most important general principles that are applied throughout the rest of the book. Novice readers or those who want a wider perspective are encouraged to consider machine learning textbooks with a more comprehensive coverage of the fundamentals, such as Murphy (2012) or Bishop (2006). If you are already familiar with machine learning basics, feel free to skip ahead to section 5.11. That section covers some perspectives on traditional machine learning techniques that have strongly influenced the development of deep learning algorithms.
深度学习是一种特殊的机器学习。要很好地理解深度学习，就必须扎实地理解机器学习的基本原理。本章简要介绍了贯穿本书其余部分的最重要的一般原理。新手读者或希望获得更广阔视野的读者，可以考虑阅读涵盖基础知识更全面的机器学习教科书，如 Murphy (2012) 或 Bishop (2006)。如果您已经熟悉机器学习基础知识，请随意跳到第 5.11 节。该部分涵盖了传统机器学习技术的一些观点，这些观点对深度学习算法的发展产生了重大影响。

We begin with a definition of what a learning algorithm is and present an example: the linear regression algorithm. We then proceed to describe how the challenge of fitting the training data differs from the challenge of finding patterns that generalize to new data. Most machine learning algorithms have settings called hyperparameters, which must be determined outside the learning algorithm itself; we discuss how to set these using additional data. Machine learning is essentially a form of applied statistics with increased emphasis on the use of computers to statistically estimate complicated functions and a decreased emphasis on proving confidence intervals around these functions; we therefore present the two central approaches to statistics: frequentist estimators and Bayesian inference. Most machine learning algorithms can be divided into the categories of supervised learning and unsupervised learning; we describe these categories and give some examples of simple learning algorithms from each category. Most deep learning algorithms are based on an optimization algorithm called stochastic gradient
我们首先定义了什么是学习算法，并举例说明：线性回归算法。然后，我们将介绍拟合训练数据的挑战与寻找可泛化到新数据的模式的挑战有何不同。大多数机器学习算法都有被称为超参数的设置，这些参数必须在学习算法本身之外确定；我们将讨论如何使用额外的数据来设置这些参数。机器学习本质上是一种应用统计学，更加强调使用计算机对复杂函数进行统计估计，而不再强调证明这些函数的置信区间；因此，我们介绍了统计学的两种核心方法：频数估计法和贝叶斯推断法。大多数机器学习算法可分为有监督学习和无监督学习两类；我们将介绍这两类算法，并举例说明每类算法中的一些简单学习算法。大多数深度学习算法都基于一种称为随机梯度的优化算法。
descent. We describe how to combine various algorithm components, such as an optimization algorithm, a cost function, a model, and a dataset, to build a machine learning algorithm. Finally, in section 5.11, we describe some of the factors that have limited the ability of traditional machine learning to generalize. These challenges have motivated the development of deep learning algorithms that overcome these obstacles.
cent.我们介绍了如何将优化算法、代价函数、模型和数据集等各种算法组件结合起来，构建机器学习算法。最后，在第 5.11 节中，我们介绍了限制传统机器学习泛化能力的一些因素。这些挑战推动了克服这些障碍的深度学习算法的发展。

5.1 Learning Algorithms 5.1 学习算法

A machine learning algorithm is an algorithm that is able to learn from data. But what do we mean by learning? Mitchell (1997) provides a succinct definition: "A computer program is said to learn from experience

with respect to some class of tasks

and performance measure

, if its performance at tasks in

, as measured by

, improves with experience

." One can imagine a wide variety of experiences

, tasks

, and performance measures

, and we do not attempt in this book to formally define what may be used for each of these entities. Instead, in the following sections, we provide intuitive descriptions and examples of the different kinds of tasks, performance measures, and experiences that can be used to construct machine learning algorithms.
机器学习算法是一种能够从数据中学习的算法。但我们所说的学习是什么意思呢？Mitchell (1997) 给出了一个简洁的定义："就某类任务

和性能指标

而言，如果一个计算机程序在

的任务中的性能（由

来衡量）随着经验的积累而有所提高

，那么这个计算机程序就可以说是从经验中学习的

"。我们可以想象出各种各样的经验

、任务

和性能指标

，在本书中我们并不试图正式定义这些实体中的每一个。相反，在以下章节中，我们将对可用于构建机器学习算法的各类任务、性能指标和经验进行直观描述和举例说明。

5.1.1 The Task,
5.1.1 任务、

Machine learning enables us to tackle tasks that are too difficult to solve with fixed programs written and designed by human beings. From a scientific and philosophical point of view, machine learning is interesting because developing our understanding of it entails developing our understanding of the principles that underlie intelligence.
机器学习使我们能够解决人类编写和设计的固定程序难以解决的任务。从科学和哲学的角度来看，机器学习之所以有趣，是因为我们要理解机器学习，就必须理解智能的基本原理。

In this relatively formal definition of the word "task," the process of learning itself is not the task. Learning is our means of attaining the ability to perform the task. For example, if we want a robot to be able to walk, then walking is the task. We could program the robot to learn to walk, or we could attempt to directly write a program that specifies how to walk manually.
在这个相对正式的 "任务 "定义中，学习过程本身并不是任务。学习是我们获得执行任务能力的手段。例如，如果我们希望机器人能够行走，那么行走就是任务。我们可以通过编程让机器人学会行走，也可以尝试直接编写一个程序，规定如何手动行走。

Machine learning tasks are usually described in terms of how the machine learning system should process an example. An example is a collection of features that have been quantitatively measured from some object or event that we want the machine learning system to process. We typically represent an example as a vector

where each entry

of the vector is another feature. For example, the features of an image are usually the values of the pixels in the image.
机器学习任务通常以机器学习系统应如何处理示例来描述。示例是我们希望机器学习系统处理的某个对象或事件的定量测量特征集合。我们通常将示例表示为一个向量

，其中向量的每个条目

是另一个特征。例如，图像的特征通常是图像中像素的值。

Many kinds of tasks can be solved with machine learning. Some of the most common machine learning tasks include the following:
机器学习可以解决多种任务。最常见的机器学习任务包括以下几种：

Classification: In this type of task, the computer program is asked to specify which of categories some input belongs to. To solve this task, the learning algorithm is usually asked to produce a function . When , the model assigns an input described by vector to a category identified by numeric code . There are other variants of the classification task, for example, where outputs a probability distribution over classes. An example of a classification task is object recognition, where the input is an image (usually described as a set of pixel brightness values), and the output is a numeric code identifying the object in the image. For example, the Willow Garage PR2 robot is able to act as a waiter that can recognize different kinds of drinks and deliver them to people on command (Goodfellow et al., 2010). Modern object recognition is best accomplished with deep learning (Krizhevsky et al., 2012; Ioffe and Szegedy, 2015). Object recognition is the same basic technology that enables computers to recognize faces (Taigman et al., 2014), which can be used to automatically tag people in photo collections and for computers to interact more naturally with their users.
分类：在这类任务中，计算机程序被要求指定某些输入属于中的哪一类。为完成这项任务，通常要求学习算法生成一个函数。当时，模型会将向量描述的输入分配到由数字代码标识的类别中。分类任务还有其他变体，例如，，输出类别的概率分布。物体识别就是分类任务的一个例子，输入是一幅图像（通常描述为一组像素亮度值），输出是识别图像中物体的数字代码。例如，Willow Garage PR2 机器人可以充当服务员，识别不同种类的饮料，并根据指令将饮料送到人们手中（Goodfellow 等人，2010 年）。现代物体识别的最佳方法是深度学习（Krizhevsky 等人，2012 年；Ioffe 和 Szegedy，2015 年）。物体识别与计算机识别人脸的基本技术相同（Taigman et al.
Classification with missing inputs: Classification becomes more challenging if the computer program is not guaranteed that every measurement in its input vector will always be provided. To solve the classification task, the learning algorithm only has to define a single function mapping from a vector input to a categorical output. When some of the inputs may be missing, rather than providing a single classification function, the learning algorithm must learn a set of functions. Each function corresponds to classifying with a different subset of its inputs missing. This kind of situation arises frequently in medical diagnosis, because many kinds of medical tests are expensive or invasive. One way to efficiently define such a large set of functions is to learn a probability distribution over all the relevant variables, then solve the classification task by marginalizing out the missing variables. With input variables, we can now obtain all different classification functions needed for each possible set of missing inputs, but the computer program needs to learn only a single function describing the joint probability distribution. See Goodfellow et al. (2013b) for an example of a deep probabilistic model applied to such a task in this way. Many of the other tasks described in this section can also be generalized to work with missing inputs; classification with missing inputs is just one example of what machine learning can do.
缺失输入的分类：如果计算机程序不能保证始终提供输入向量中的每个测量值，分类工作就会变得更具挑战性。为了解决分类任务，学习算法只需定义一个从向量输入到分类输出的单一函数映射。当某些输入可能缺失时，学习算法必须学习一组函数，而不是提供一个单一的分类函数。每个函数对应于对输入缺失不同子集的进行分类。这种情况在医疗诊断中经常出现，因为许多医疗测试都是昂贵的或侵入性的。要有效定义如此庞大的函数集，一种方法是学习所有相关变量的概率分布，然后通过边际化缺失变量来解决分类任务。有了输入变量，我们现在可以获得每组可能的缺失输入所需的所有不同的分类函数，但计算机程序只需学习一个描述联合概率分布的函数。关于深度概率模型以这种方式应用于此类任务的例子，请参见 Goodfellow 等人（2013b）。本节中介绍的许多其他任务也可以通用于缺失输入；缺失输入分类只是机器学习能做的事情中的一个例子。
Regression: In this type of task, the computer program is asked to predict a numerical value given some input. To solve this task, the learning algorithm is asked to output a function . This type of task is similar to classification, except that the format of output is different. An example of a regression task is the prediction of the expected claim amount that an insured person will make (used to set insurance premiums), or the prediction of future prices of securities. These kinds of predictions are also used for algorithmic trading.
回归：在这类任务中，计算机程序被要求预测某个输入的数值。为了解决这个任务，要求学习算法输出一个函数。这类任务与分类任务类似，只是输出格式不同。回归任务的一个例子是预测被保险人的预期索赔金额（用于设定保险费），或预测证券的未来价格。这类预测也用于算法交易。
Transcription: In this type of task, the machine learning system is asked to observe a relatively unstructured representation of some kind of data and transcribe the information into discrete textual form. For example, in optical character recognition, the computer program is shown a photograph containing an image of text and is asked to return this text in the form of a sequence of characters (e.g., in ASCII or Unicode format). Google Street View uses deep learning to process address numbers in this way (Goodfellow et al., 2014d). Another example is speech recognition, where the computer program is provided an audio waveform and emits a sequence of characters or word ID codes describing the words that were spoken in the audio recording. Deep learning is a crucial component of modern speech recognition systems used at major companies, including Microsoft, IBM and Google (Hinton et al., 2012b).
转录：在这类任务中，机器学习系统需要观察某种数据的相对非结构化表示，并将信息转录为离散的文本形式。例如，在光学字符识别中，计算机程序会看到一张包含文本图像的照片，并被要求以字符序列的形式（如 ASCII 或 Unicode 格式）返回文本。谷歌街景使用深度学习以这种方式处理地址编号（Goodfellow 等人，2014d）。另一个例子是语音识别，计算机程序获得音频波形后，会发出一串字符或单词 ID 码，描述录音中所说的单词。深度学习是微软、IBM 和谷歌等大公司使用的现代语音识别系统的重要组成部分（Hinton 等人，2012b）。
Machine translation: In a machine translation task, the input already consists of a sequence of symbols in some language, and the computer program must convert this into a sequence of symbols in another language. This is commonly applied to natural languages, such as translating from English to French. Deep learning has recently begun to have an important impact on this kind of task (Sutskever et al., 2014; Bahdanau et al., 2015).
机器翻译：在机器翻译任务中，输入内容已经由某种语言的符号序列组成，计算机程序必须将其转换成另一种语言的符号序列。这通常应用于自然语言，例如从英语翻译成法语。深度学习最近开始对这类任务产生重要影响（Sutskever 等人，2014 年；Bahdanau 等人，2015 年）。
Structured output: Structured output tasks involve any task where the output is a vector (or other data structure containing multiple values) with important relationships between the different elements. This is a broad category and subsumes the transcription and translation tasks described above, as well as many other tasks. One example is parsing - mapping a natural language sentence into a tree that describes its grammatical structure by tagging nodes of the trees as being verbs, nouns, adverbs, and so on. See Collobert (2011) for an example of deep learning applied to a parsing task. Another example is pixel-wise segmentation of images, where the computer program assigns every pixel in an image to a specific category.
结构化输出：结构化输出任务涉及任何输出为矢量（或其他包含多个值的数据结构）且不同元素之间存在重要关系的任务。这是一个广泛的类别，包含上述转录和翻译任务以及许多其他任务。其中一个例子是解析--将自然语言句子映射成一棵树，通过标记树中的动词、名词、副词等节点来描述其语法结构。有关将深度学习应用于解析任务的例子，请参见 Collobert (2011)。另一个例子是对图像进行像素分割，计算机程序将图像中的每个像素分配到特定类别。

For example, deep learning can be used to annotate the locations of roads in aerial photographs (Mnih and Hinton, 2010). The output form need not mirror the structure of the input as closely as in these annotation-style tasks. For example, in image captioning, the computer program observes an image and outputs a natural language sentence describing the image (Kiros et al., 2014a,b; Mao et al., 2015; Vinyals et al., 2015b; Donahue et al., 2014; Karpathy and Li, 2015; Fang et al., 2015; Xu et al., 2015). These tasks are called structured output tasks because the program must output several values that are all tightly interrelated. For example, the words produced by an image captioning program must form a valid sentence.
例如，深度学习可用于标注航拍照片中的道路位置（Mnih 和 Hinton，2010 年）。在这些注释式任务中，输出形式不必与输入结构如出一辙。例如，在图像标注中，计算机程序观察图像并输出描述图像的自然语言句子（Kiros 等人，2014a,b；Mao 等人，2015；Vinyals 等人，2015b；Donahue 等人，2014；Karpathy 和 Li，2015；Fang 等人，2015；Xu 等人，2015）。这些任务之所以被称为结构化输出任务，是因为程序必须输出多个相互紧密关联的值。例如，图像字幕程序生成的单词必须构成一个有效的句子。

Anomaly detection: In this type of task, the computer program sifts through a set of events or objects and flags some of them as being unusual or atypical. An example of an anomaly detection task is credit card fraud detection. By modeling your purchasing habits, a credit card company can detect misuse of your cards. If a thief steals your credit card or credit card information, the thief's purchases will often come from a different probability distribution over purchase types than your own. The credit card company can prevent fraud by placing a hold on an account as soon as that card has been used for an uncharacteristic purchase. See Chandola et al. (2009) for a survey of anomaly detection methods.
异常检测：在这类任务中，计算机程序会对一系列事件或对象进行筛选，并将其中一些事件或对象标记为异常或反常。异常检测任务的一个例子是信用卡欺诈检测。通过模拟您的购买习惯，信用卡公司可以检测到您的信用卡被滥用。如果盗贼窃取了您的信用卡或信用卡信息，那么盗贼的购买行为往往与您的购买行为的概率分布不同。一旦信用卡被用于非正常消费，信用卡公司可通过扣留账户来防止欺诈行为。有关异常检测方法的调查，请参见 Chandola 等人（2009 年）。
Synthesis and sampling: In this type of task, the machine learning algorithm is asked to generate new examples that are similar to those in the training data. Synthesis and sampling via machine learning can be useful for media applications when generating large volumes of content by hand would be expensive, boring, or require too much time. For example, video games can automatically generate textures for large objects or landscapes, rather than requiring an artist to manually label each pixel (Luo et al., 2013). In some cases, we want the sampling or synthesis procedure to generate a specific kind of output given the input. For example, in a speech synthesis task, we provide a written sentence and ask the program to emit an audio waveform containing a spoken version of that sentence. This is a kind of structured output task, but with the added qualification that there is no single correct output for each input, and we explicitly desire a large amount of variation in the output, in order for the output to seem more natural and realistic.
合成和采样：在这类任务中，机器学习算法需要生成与训练数据中的示例相似的新示例。如果手工生成大量内容既昂贵又枯燥，或者需要耗费大量时间，那么通过机器学习进行合成和采样对媒体应用非常有用。例如，视频游戏可以自动生成大型物体或景观的纹理，而不需要艺术家手动标注每个像素（Luo 等人，2013 年）。在某些情况下，我们希望采样或合成程序能根据输入生成特定类型的输出。例如，在语音合成任务中，我们提供一个书面句子，并要求程序发出包含该句子口语版本的音频波形。这是一种结构化输出任务，但附加的条件是，每个输入都没有一个正确的输出，而且我们明确希望输出有很大的变化，以使输出看起来更自然、更逼真。
Imputation of missing values: In this type of task, the machine learning algorithm is given a new example , but with some entries of
缺失值估算：在这类任务中，机器学习算法会得到一个新的示例，但其中某些条目为
missing. The algorithm must provide a prediction of the values of the missing entries.
缺失。算法必须对缺失条目的值进行预测。
Denoising: In this type of task, the machine learning algorithm is given as input a corrupted example obtained by an unknown corruption process from a clean example . The learner must predict the clean example from its corrupted version , or more generally predict the conditional probability distribution .
去噪：在这类任务中，机器学习算法的输入是一个损坏的示例，该示例是通过一个未知的损坏过程从一个干净的示例中得到的。学习者必须根据损坏版本预测干净示例，或者更笼统地说，预测条件概率分布。
Density estimation or probability mass function estimation: In the density estimation problem, the machine learning algorithm is asked to learn a function , where can be interpreted as a probability density function (if is continuous) or a probability mass function (if is discrete) on the space that the examples were drawn from. To do such a task well (we will specify exactly what that means when we discuss performance measures ), the algorithm needs to learn the structure of the data it has seen. It must know where examples cluster tightly and where they are unlikely to occur. Most of the tasks described above require the learning algorithm to at least implicitly capture the structure of the probability distribution. Density estimation enables us to explicitly capture that distribution. In principle, we can then perform computations on that distribution to solve the other tasks as well. For example, if we have performed density estimation to obtain a probability distribution , we can use that distribution to solve the missing value imputation task. If a value is missing, and all the other values, denoted , are given, then we know the distribution over it is given by . In practice, density estimation does not always enable us to solve all these related tasks, because in many cases the required operations on are computationally intractable.
密度估计或概率质量函数估计：在密度估计问题中，机器学习算法被要求学习一个函数，其中可以解释为示例空间上的概率密度函数（如果是连续的）或概率质量函数（如果是离散的）。要很好地完成这样的任务（我们将在讨论性能指标时明确说明这意味着什么），算法需要学习所见数据的结构。它必须知道哪些地方的例子聚集得很紧密，哪些地方的例子不太可能出现。上述大多数任务都要求学习算法至少隐含地捕捉到概率分布的结构。密度估计能让我们明确捕捉到这种分布。原则上，我们可以对该分布进行计算，从而解决其他任务。例如，如果我们通过密度估计得到了概率分布，我们就可以使用该分布来解决缺失值估算任务。如果缺少一个值，而其他所有值（表示为）都是给定的，那么我们就知道它的分布是由给定的。实际上，密度估计并不总能让我们解决所有这些相关任务，因为在很多情况下，对所需的操作在计算上是难以实现的。

Of course, many other tasks and types of tasks are possible. The types of tasks we list here are intended only to provide examples of what machine learning can do, not to define a rigid taxonomy of tasks.
当然，还有许多其他任务和任务类型。我们在此列出的任务类型只是为了举例说明机器学习的功能，而不是对任务进行严格的分类。

5.1.2 The Performance Measure,
5.1.2 绩效衡量、

To evaluate the abilities of a machine learning algorithm, we must design a quantitative measure of its performance. Usually this performance measure

is specific to the task

being carried out by the system.
为了评估机器学习算法的能力，我们必须设计一种量化的性能测量方法。通常，这种性能衡量

是针对系统正在执行的任务

。

For tasks such as classification, classification with missing inputs, and transcription, we often measure the accuracy of the model. Accuracy is just the
对于分类、输入缺失分类和转录等任务，我们通常会衡量模型的准确性。准确度只是
proportion of examples for which the model produces the correct output. We can also obtain equivalent information by measuring the error rate, the proportion of examples for which the model produces an incorrect output. We often refer to the error rate as the expected 0-1 loss. The 0-1 loss on a particular example is 0 if it is correctly classified and 1 if it is not. For tasks such as density estimation, it does not make sense to measure accuracy, error rate, or any other kind of 0-1 loss. Instead, we must use a different performance metric that gives the model a continuous-valued score for each example. The most common approach is to report the average log-probability the model assigns to some examples.
模型输出正确结果的例子比例。我们还可以通过测量错误率（即模型产生错误输出的例子比例）来获得等效信息。我们通常将错误率称为预期 0-1 损失。如果某个实例被正确分类，则 0-1 损失为 0；如果没有被正确分类，则 0-1 损失为 1。对于密度估计等任务，测量准确率、错误率或任何其他类型的 0-1 损失都是没有意义的。相反，我们必须使用不同的性能指标，为模型的每个示例提供一个连续值的分数。最常见的方法是报告模型赋予某些示例的平均对数概率。

Usually we are interested in how well the machine learning algorithm performs on data that it has not seen before, since this determines how well it will work when deployed in the real world. We therefore evaluate these performance measures using a test set of data that is separate from the data used for training the machine learning system.
通常，我们关注的是机器学习算法在未见过的数据上的表现，因为这决定了算法在实际应用中的效果。因此，我们使用一组测试数据来评估这些性能指标，这组数据与用于训练机器学习系统的数据是分开的。

The choice of performance measure may seem straightforward and objective, but it is often difficult to choose a performance measure that corresponds well to the desired behavior of the system.
性能指标的选择看似简单、客观，但要选择一个与系统预期行为相匹配的性能指标往往很困难。

In some cases, this is because it is difficult to decide what should be measured. For example, when performing a transcription task, should we measure the accuracy of the system at transcribing entire sequences, or should we use a more fine-grained performance measure that gives partial credit for getting some elements of the sequence correct? When performing a regression task, should we penalize the system more if it frequently makes medium-sized mistakes or if it rarely makes very large mistakes? These kinds of design choices depend on the application.
在某些情况下，这是因为很难决定应该测量什么。例如，在执行转录任务时，我们是应该衡量系统转录整个序列的准确性，还是应该使用更精细的性能衡量标准，对序列中某些元素的正确性给予部分奖励？在执行回归任务时，如果系统经常犯中等程度的错误，我们应该对其进行更多的惩罚；如果系统很少犯非常严重的错误，我们应该对其进行更多的惩罚？这类设计选择取决于应用。

In other cases, we know what quantity we would ideally like to measure, but measuring it is impractical. For example, this arises frequently in the context of density estimation. Many of the best probabilistic models represent probability distributions only implicitly. Computing the actual probability value assigned to a specific point in space in many such models is intractable. In these cases, one must design an alternative criterion that still corresponds to the design objectives, or design a good approximation to the desired criterion.
在其他情况下，我们知道我们最想测量的量是什么，但测量它是不切实际的。例如，这种情况经常出现在密度估计中。许多最好的概率模型只是隐含地表示概率分布。在许多此类模型中，计算空间中特定点的实际概率值是难以实现的。在这种情况下，我们必须设计一个仍然符合设计目标的替代标准，或者设计一个理想标准的近似值。

5.1.3 The Experience,
5.1.3 体验、

Machine learning algorithms can be broadly categorized as unsupervised or supervised by what kind of experience they are allowed to have during the learning process.
根据机器学习算法在学习过程中的经验类型，可将其大致分为无监督和有监督两种。

Most of the learning algorithms in this book can be understood as being allowed
本书中的大多数学习算法都可以理解为允许
to experience an entire dataset. A dataset is a collection of many examples, as defined in section 5.1.1. Sometimes we call examples data points.
来体验整个数据集。数据集是许多示例的集合，定义见第 5.1.1 节。有时我们称示例为数据点。

One of the oldest datasets studied by statisticians and machine learning researchers is the Iris dataset (Fisher, 1936). It is a collection of measurements of different parts of 150 iris plants. Each individual plant corresponds to one example. The features within each example are the measurements of each part of the plant: the sepal length, sepal width, petal length and petal width. The dataset also records which species each plant belonged to. Three different species are represented in the dataset.
统计学家和机器学习研究人员研究的最古老的数据集之一是鸢尾花数据集（Fisher，1936 年）。它收集了 150 株鸢尾花不同部位的测量数据。每株植物对应一个示例。每个示例的特征是植物各部分的测量值：萼片长度、萼片宽度、花瓣长度和花瓣宽度。数据集还记录了每种植物所属的物种。数据集中有三个不同的物种。

Unsupervised learning algorithms experience a dataset containing many features, then learn useful properties of the structure of this dataset. In the context of deep learning, we usually want to learn the entire probability distribution that generated a dataset, whether explicitly, as in density estimation, or implicitly, for tasks like synthesis or denoising. Some other unsupervised learning algorithms perform other roles, like clustering, which consists of dividing the dataset into clusters of similar examples.
无监督学习算法先体验包含许多特征的数据集，然后学习该数据集结构的有用属性。在深度学习中，我们通常希望学习生成数据集的整个概率分布，无论是显式的，如密度估计，还是隐式的，如合成或去噪任务。其他一些无监督学习算法还扮演着其他角色，比如聚类，它将数据集划分为由相似示例组成的簇。

Supervised learning algorithms experience a dataset containing features, but each example is also associated with a label or target. For example, the Iris dataset is annotated with the species of each iris plant. A supervised learning algorithm can study the Iris dataset and learn to classify iris plants into three different species based on their measurements.
监督学习算法体验的是包含特征的数据集，但每个示例也与标签或目标相关联。例如，鸢尾花数据集标注了每株鸢尾花的种类。监督学习算法可以研究鸢尾花数据集，并学会根据测量结果将鸢尾花植物分为三种不同的种类。

Roughly speaking, unsupervised learning involves observing several examples of a random vector

and attempting to implicitly or explicitly learn the probability distribution

, or some interesting properties of that distribution; while supervised learning involves observing several examples of a random vector

and an associated value or vector

, then learning to predict

from

, usually by estimating

. The term supervised learning originates from the view of the target

being provided by an instructor or teacher who shows the machine learning system what to do. In unsupervised learning, there is no instructor or teacher, and the algorithm must learn to make sense of the data without this guide.
粗略地说，无监督学习涉及观察随机向量

的若干实例，并尝试隐式或显式地学习概率分布

，或该分布的某些有趣属性；而有监督学习涉及观察随机向量

和相关值或向量

的若干实例，然后学习从

预测

，通常是通过估计

。监督学习一词源于这样一种观点，即目标

是由指导者或教师提供的，他向机器学习系统展示该做什么。在无监督学习中，没有指导者或教师，算法必须在没有指导者或教师的情况下学习如何理解数据。

Unsupervised learning and supervised learning are not formally defined terms. The lines between them are often blurred. Many machine learning technologies can be used to perform both tasks. For example, the chain rule of probability states that for a vector

, the joint distribution can be decomposed as
无监督学习和有监督学习并不是正式定义的术语。它们之间的界限往往很模糊。许多机器学习技术可用于执行这两种任务。例如，概率链规则指出，对于一个向量

，联合分布可以分解为

This decomposition means that we can solve the ostensibly unsupervised problem of
这种分解意味着我们可以解决表面上无监督的问题，即
modeling

by splitting it into

supervised learning problems. Alternatively, we can solve the supervised learning problem of learning

by using traditional unsupervised learning technologies to learn the joint distribution

, then inferring
建模

，将其拆分为

监督学习问题。另外，我们也可以通过使用传统的无监督学习技术来学习联合分布

，然后推断出

，从而解决学习的监督学习问题。

口

Though unsupervised learning and supervised learning are not completely formal or distinct concepts, they do help roughly categorize some of the things we do with machine learning algorithms. Traditionally, people refer to regression, classification and structured output problems as supervised learning. Density estimation in support of other tasks is usually considered unsupervised learning.
尽管无监督学习和有监督学习并不是完全正式或截然不同的概念，但它们确实有助于对我们使用机器学习算法所做的一些事情进行粗略分类。传统上，人们将回归、分类和结构化输出问题称为监督学习。支持其他任务的密度估计通常被视为无监督学习。

Other variants of the learning paradigm are possible. For example, in semisupervised learning, some examples include a supervision target but others do not. In multi-instance learning, an entire collection of examples is labeled as containing or not containing an example of a class, but the individual members of the collection are not labeled. For a recent example of multi-instance learning with deep models, see Kotzias et al. (2015).
学习范式还可能有其他变体。例如，在半监督学习中，一些示例包含监督目标，而另一些则不包含。在多实例学习中，整个实例集合被标记为包含或不包含某类实例，但集合中的单个成员不被标记。有关使用深度模型进行多实例学习的最新实例，请参见 Kotzias 等人（2015）。

Some machine learning algorithms do not just experience a fixed dataset. For example, reinforcement learning algorithms interact with an environment, so there is a feedback loop between the learning system and its experiences. Such algorithms are beyond the scope of this book. Please see Sutton and Barto (1998) or Bertsekas and Tsitsiklis (1996) for information about reinforcement learning, and Mnih et al. (2013) for the deep learning approach to reinforcement learning.
有些机器学习算法并不只是体验固定的数据集。例如，强化学习算法会与环境互动，因此学习系统与其经验之间存在反馈回路。此类算法超出了本书的讨论范围。有关强化学习的信息，请参阅 Sutton 和 Barto (1998) 或 Bertsekas 和 Tsitsiklis (1996)，有关强化学习的深度学习方法，请参阅 Mnih 等人 (2013)。

Most machine learning algorithms simply experience a dataset. A dataset can be described in many ways. In all cases, a dataset is a collection of examples, which are in turn collections of features.
大多数机器学习算法只需经历一个数据集。数据集有多种描述方式。在所有情况下，数据集都是示例的集合，而示例又是特征的集合。

One common way of describing a dataset is with a design matrix. A design matrix is a matrix containing a different example in each row. Each column of the matrix corresponds to a different feature. For instance, the Iris dataset contains 150 examples with four features for each example. This means we can represent the dataset with a design matrix

, where

is the sepal length of plant

is the sepal width of plant

, etc. We describe most of the learning algorithms in this book in terms of how they operate on design matrix datasets.
描述数据集的一种常见方法是设计矩阵。设计矩阵是一个每行包含一个不同示例的矩阵。矩阵的每一列对应一个不同的特征。例如，虹膜数据集包含 150 个示例，每个示例有四个特征。这意味着我们可以用设计矩阵

来表示数据集，其中

是植物的萼片长度

是植物的萼片宽度

等。我们将在本书中介绍大部分学习算法如何在设计矩阵数据集上运行。

Of course, to describe a dataset as a design matrix, it must be possible to describe each example as a vector, and each of these vectors must be the same size. This is not always possible. For example, if you have a collection of photographs with different widths and heights, then different photographs will contain different numbers of pixels, so not all the photographs may be described with the same
当然，要将数据集描述为设计矩阵，就必须能够将每个示例描述为一个向量，而且每个向量的大小必须相同。但这并不总是可能的。例如，如果您有一组具有不同宽度和高度的照片，那么不同的照片将包含不同数量的像素，因此并不是所有的照片都可以用相同的像素来描述。
length of vector. In Section 9.7 and chapter 10, we describe how to handle different types of such heterogeneous data. In cases like these, rather than describing the dataset as a matrix with

rows, we describe it as a set containing

elements:

. This notation does not imply that any two example vectors

and

have the same size.
向量的长度。在第 9.7 节和第 10 章中，我们将介绍如何处理不同类型的异构数据。在这种情况下，我们不会将数据集描述为一个行数为

的矩阵，而是将其描述为一个包含

元素的集合：

。这个符号并不意味着任何两个示例向量

和

的大小相同。

In the case of supervised learning, the example contains a label or target as well as a collection of features. For example, if we want to use a learning algorithm to perform object recognition from photographs, we need to specify which object appears in each of the photos. We might do this with a numeric code, with 0 signifying a person, 1 signifying a car, 2 signifying a cat, and so forth. Often when working with a dataset containing a design matrix of feature observations

, we also provide a vector of labels

, with

providing the label for example

.
在监督学习中，示例包含一个标签或目标以及一系列特征。例如，如果我们想使用一种学习算法来识别照片中的物体，就需要指定每张照片中出现的物体。我们可以用数字代码来实现这一点，0 表示人，1 表示车，2 表示猫，以此类推。通常情况下，在处理包含特征观测数据设计矩阵

的数据集时，我们也会提供一个标签向量

，其中

提供了

的标签。

Of course, sometimes the label may be more than just a single number. For example, if we want to train a speech recognition system to transcribe entire sentences, then the label for each example sentence is a sequence of words.
当然，有时标签可能不仅仅是一个数字。例如，如果我们想训练语音识别系统转录整个句子，那么每个例句的标签就是一串单词。

Just as there is no formal definition of supervised and unsupervised learning, there is no rigid taxonomy of datasets or experiences. The structures described here cover most cases, but it is always possible to design new ones for new applications.
正如监督学习和无监督学习没有正式的定义一样，数据集或经验也没有严格的分类标准。这里描述的结构涵盖了大多数情况，但也有可能为新的应用设计新的结构。

5.1.4 Example: Linear Regression
5.1.4 示例：线性回归

Our definition of a machine learning algorithm as an algorithm that is capable of improving a computer program's performance at some task via experience is somewhat abstract. To make this more concrete, we present an example of a simple machine learning algorithm: linear regression. We will return to this example repeatedly as we introduce more machine learning concepts that help to understand the algorithm's behavior.
我们将机器学习算法定义为一种能够通过经验提高计算机程序在某些任务中的性能的算法，这一定义有些抽象。为了使这一定义更加具体，我们将举例说明一种简单的机器学习算法：线性回归。在介绍更多有助于理解算法行为的机器学习概念时，我们将反复回到这个例子。

As the name implies, linear regression solves a regression problem. In other words, the goal is to build a system that can take a vector

as input and predict the value of a scalar

as its output. The output of linear regression is a linear function of the input. Let

be the value that our model predicts

should take on. We define the output to be
顾名思义，线性回归解决的是回归问题。换句话说，我们的目标是建立一个系统，将向量

作为输入，将标量

的值作为输出进行预测。线性回归的输出是输入的线性函数。让

成为我们的模型预测的

值。我们将输出定义为

where

is a vector of parameters.
其中

是一个参数向量。

Parameters are values that control the behavior of the system. In this case,

is the coefficient that we multiply by feature

before summing up the contributions from all the features. We can think of

as a set of weights that determine how
参数是控制系统行为的数值。在本例中，

是我们在将所有特征的贡献相加之前乘以特征

的系数。我们可以将

视为一组权重，它们决定了
each feature affects the prediction. If a feature

receives a positive weight

, then increasing the value of that feature increases the value of our prediction

. If a feature receives a negative weight, then increasing the value of that feature decreases the value of our prediction. If a feature's weight is large in magnitude, then it has a large effect on the prediction. If a feature's weight is zero, it has no effect on the prediction.
每个特征对预测的影响。如果某个特征

获得正权重

，那么增加该特征的值就会增加我们的预测值

。如果某个特征的权重为负，则增加该特征的值会降低我们的预测值。如果某个特征的权重很大，那么它对预测的影响就很大。如果某个特征的权重为零，则对预测没有影响。

We thus have a definition of our task

: to predict

from

by outputting

. Next we need a definition of our performance measure,

.
这样，我们就有了任务

的定义：通过输出

从

预测

。接下来，我们需要定义性能指标

。

Suppose that we have a design matrix of

example inputs that we will not use for training, only for evaluating how well the model performs. We also have a vector of regression targets providing the correct value of

for each of these examples. Because this dataset will only be used for evaluation, we call it the test set. We refer to the design matrix of inputs as

and the vector of regression targets as

.
假设我们有一个由

示例输入组成的设计矩阵，我们不会将其用于训练，只用于评估模型的性能如何。我们还有一个回归目标向量，为每个示例提供

的正确值。由于该数据集仅用于评估，因此我们称之为测试集。我们将输入设计矩阵称为

，将回归目标向量称为

。

One way of measuring the performance of the model is to compute the mean squared error of the model on the test set. If

gives the predictions of the model on the test set, then the mean squared error is given by
衡量模型性能的一种方法是计算模型在测试集上的均方误差。如果

给出了模型在测试集上的预测值，那么均方误差的计算公式为

Intuitively, one can see that this error measure decreases to 0 when

. We can also see that
直观上，我们可以看到，当

时，这个误差量会减小到 0。我们还可以看到

so the error increases whenever the Euclidean distance between the predictions and the targets increases.
因此，只要预测值与目标之间的欧氏距离增大，误差就会增大。

To make a machine learning algorithm, we need to design an algorithm that will improve the weights

in a way that reduces

when the algorithm is allowed to gain experience by observing a training set

. One intuitive way of doing this (which we justify later, in section 5.5.1) is just to minimize the mean squared error on the training set,

要制作一个机器学习算法，我们需要设计一种算法，当允许算法通过观察训练集获得经验时，该算法将以减少

的方式改进权重

。一种直观的方法（我们将在稍后的第 5.5.1 节中加以论证）就是尽量减小训练集上的均方误差，

。

To minimize

, we can simply solve for where its gradient is 0 :
要使

最小化，我们只需求解梯度为 0 的位置即可：

The system of equations whose solution is given by equation 5.12 is known as the normal equations. Evaluating equation 5.12 constitutes a simple learning algorithm. For an example of the linear regression learning algorithm in action, see figure 5.1.
方程 5.12 所给出的方程组称为正则方程。对方程 5.12 进行求值就是一种简单的学习算法。有关线性回归学习算法的示例，请参见图 5.1。

It is worth noting that the term linear regression is often used to refer to a slightly more sophisticated model with one additional parameter - an intercept term

. In this model
值得注意的是，线性回归这一术语通常指的是一种略为复杂的模型，它多了一个参数--截距项

。在该模型中

so the mapping from parameters to predictions is still a linear function but the mapping from features to predictions is now an affine function. This extension to
因此，从参数到预测的映射仍然是线性函数，但从特征到预测的映射现在是仿射函数。这种扩展

Figure 5.1: A linear regression problem, with a training set consisting of ten data points, each containing one feature. Because there is only one feature, the weight vector contains only a single parameter to learn,

. (Left)Observe that linear regression learns to set

such that the line

comes as close as possible to passing through all the training points. (Right)The plotted point indicates the value of

found by the normal equations, which we can see minimizes the mean squared error on the training set.
图 5.1：线性回归问题，训练集由十个数据点组成，每个数据点包含一个特征。因为只有一个特征，所以权重向量只包含一个需要学习的参数，即

。（左）观察线性回归学习设置

，使直线

尽可能接近通过所有训练点。(右图)图中的点表示通过正态方程找到的

值，我们可以看到该值使训练集上的均方误差最小。
affine functions means that the plot of the model's predictions still looks like a line, but it need not pass through the origin. Instead of adding the bias parameter

, one can continue to use the model with only weights but augment

with an extra entry that is always set to 1 . The weight corresponding to the extra 1 entry plays the role of the bias parameter. We frequently use the term "linear" when referring to affine functions throughout this book.
仿射函数意味着模型预测的曲线图看起来仍像一条直线，但不需要通过原点。与添加偏置参数

相反，我们可以继续使用只有权重的模型，但在

中增加一个始终设为 1 的额外条目。与额外的 1 项相对应的权重起偏置参数的作用。我们在本书中提到仿射函数时经常使用 "线性 "一词。

The intercept term

is often called the bias parameter of the affine transformation. This terminology derives from the point of view that the output of the transformation is biased toward being

in the absence of any input. This term is different from the idea of a statistical bias, in which a statistical estimation algorithm's expected estimate of a quantity is not equal to the true quantity.
截距项

通常被称为仿射变换的偏置参数。这一术语源于这样一种观点，即在没有任何输入的情况下，变换的输出偏向于

。这一术语与统计偏差的概念不同，统计偏差是指统计估计算法对某一数量的预期估计值不等于真实数量。

Linear regression is of course an extremely simple and limited learning algorithm, but it provides an example of how a learning algorithm can work. In subsequent sections we describe some of the basic principles underlying learning algorithm design and demonstrate how these principles can be used to build more complicated learning algorithms.
线性回归当然是一种极其简单和有限的学习算法，但它提供了一个学习算法如何工作的例子。在随后的章节中，我们将介绍学习算法设计的一些基本原则，并演示如何利用这些原则构建更复杂的学习算法。

5.2 Capacity, Overfitting and Underfitting
5.2 容量、过拟合和欠拟合

The central challenge in machine learning is that our algorithm must perform well on new, previously unseen inputs - not just those on which our model was trained. The ability to perform well on previously unobserved inputs is called generalization.
机器学习的核心挑战在于，我们的算法必须在新的、以前未见过的输入上表现出色，而不仅仅是那些我们的模型所训练过的输入。对以前未观察到的输入进行良好处理的能力称为泛化。

Typically, when training a machine learning model, we have access to a training set; we can compute some error measure on the training set, called the training error; and we reduce this training error. So far, what we have described is simply an optimization problem. What separates machine learning from optimization is that we want the generalization error, also called the test error, to be low as well. The generalization error is defined as the expected value of the error on a new input. Here the expectation is taken across different possible inputs, drawn from the distribution of inputs we expect the system to encounter in practice.
通常情况下，在训练机器学习模型时，我们可以获得一个训练集；我们可以计算训练集上的一些误差度量，称为训练误差；然后我们减小训练误差。到目前为止，我们所描述的只是一个优化问题。机器学习与优化的区别在于，我们希望泛化误差（也称为测试误差）也要低。泛化误差被定义为新输入时误差的期望值。这里的期望值取自不同的可能输入，这些输入来自于我们预期系统在实际应用中会遇到的输入分布。

We typically estimate the generalization error of a machine learning model by measuring its performance on a test set of examples that were collected separately from the training set.
我们通常通过测量机器学习模型在测试集上的表现来估算模型的泛化误差，测试集上的示例是与训练集分开收集的。

In our linear regression example, we trained the model by minimizing the training error,
在线性回归的例子中，我们通过最小化训练误差来训练模型、

but we actually care about the test error,

.
但我们实际上关心的是测试误差，

。

How can we affect performance on the test set when we can observe only the training set? The field of statistical learning theory provides some answers. If the training and the test set are collected arbitrarily, there is indeed little we can do. If we are allowed to make some assumptions about how the training and test set are collected, then we can make some progress.
当我们只能观察训练集时，如何影响测试集上的表现？统计学习理论领域提供了一些答案。如果训练集和测试集是任意收集的，那么我们确实无能为力。如果允许我们对训练集和测试集的收集方式做出一些假设，那么我们就能取得一些进展。

The training and test data are generated by a probability distribution over datasets called the data-generating process. We typically make a set of assumptions known collectively as the i.i.d. assumptions. These assumptions are that the examples in each dataset are independent from each other, and that the training set and test set are identically distributed, drawn from the same probability distribution as each other. This assumption enables us to describe the data-generating process with a probability distribution over a single example. The same distribution is then used to generate every train example and every test example. We call that shared underlying distribution the data-generating distribution, denoted

. This probabilistic framework and the i.i.d. assumptions enables us to mathematically study the relationship between training error and test error.
训练数据和测试数据由数据集上的概率分布生成，称为数据生成过程。我们通常会做出一系列假设，统称为 i.i.d. 假设。这些假设是：每个数据集中的示例是相互独立的，训练集和测试集的分布是相同的，都来自于相同的概率分布。通过这一假设，我们可以用单个示例的概率分布来描述数据生成过程。然后，相同的分布被用于生成每个训练示例和每个测试示例。我们称这种共享的基础分布为数据生成分布，记为

。这种概率框架和 i.i.d. 假设使我们能够从数学角度研究训练误差和测试误差之间的关系。

One immediate connection we can observe between training error and test error is that the expected training error of a randomly selected model is equal to the expected test error of that model. Suppose we have a probability distribution

and we sample from it repeatedly to generate the training set and the test set. For some fixed value

, the expected training set error is exactly the same as the expected test set error, because both expectations are formed using the same dataset sampling process. The only difference between the two conditions is the name we assign to the dataset we sample.
我们可以观察到训练误差和测试误差之间的一个直接联系，即随机选择的模型的预期训练误差等于该模型的预期测试误差。假设我们有一个概率分布

，并从中反复抽样生成训练集和测试集。对于某个固定值

，预期训练集误差与预期测试集误差完全相同，因为这两个预期都是通过相同的数据集采样过程形成的。两种情况的唯一区别在于我们为采样数据集指定的名称。

Of course, when we use a machine learning algorithm, we do not fix the parameters ahead of time, then sample both datasets. We sample the training set, then use it to choose the parameters to reduce training set error, then sample the test set. Under this process, the expected test error is greater than or equal to the expected value of training error. The factors determining how well a machine learning algorithm will perform are its ability to
当然，在使用机器学习算法时，我们不会提前固定参数，然后对两个数据集进行采样。我们先对训练集进行采样，然后利用它来选择参数以减少训练集误差，接着再对测试集进行采样。在此过程中，预期测试误差大于或等于训练误差的预期值。决定机器学习算法性能好坏的因素是其在以下方面的能力

Make the training error small.
减少训练误差。
Make the gap between training and test error small.
使训练误差和测试误差之间的差距很小。

These two factors correspond to the two central challenges in machine learning: underfitting and overfitting. Underfitting occurs when the model is not able to
这两个因素对应着机器学习中的两个核心挑战：欠拟合和过拟合。当模型不能
obtain a sufficiently low error value on the training set. Overfitting occurs when the gap between the training error and test error is too large.
在训练集上获得足够低的误差值。当训练误差与测试误差之间的差距过大时，就会出现过度拟合。

We can control whether a model is more likely to overfit or underfit by altering its capacity. Informally, a model's capacity is its ability to fit a wide variety of functions. Models with low capacity may struggle to fit the training set. Models with high capacity can overfit by memorizing properties of the training set that do not serve them well on the test set.
我们可以通过改变模型的能力来控制模型是更有可能拟合过度还是拟合不足。非正式地讲，模型的容量是指它拟合各种函数的能力。容量小的模型可能难以拟合训练集。容量大的模型可能会因为记忆了训练集的属性而过拟合，而这些属性在测试集上并不适用。

One way to control the capacity of a learning algorithm is by choosing its hypothesis space, the set of functions that the learning algorithm is allowed to select as being the solution. For example, the linear regression algorithm has the set of all linear functions of its input as its hypothesis space. We can generalize linear regression to include polynomials, rather than just linear functions, in its hypothesis space. Doing so increases the model's capacity.
控制学习算法能力的一种方法是选择其假设空间，即允许学习算法选择作为解的函数集。例如，线性回归算法将输入的所有线性函数集合作为其假设空间。我们可以对线性回归进行概括，将多项式而不仅仅是线性函数纳入其假设空间。这样做可以增加模型的容量。

A polynomial of degree 1 gives us the linear regression model with which we are already familiar, with the prediction
度数为 1 的多项式给出了我们已经熟悉的线性回归模型，其预测结果是

By introducing

as another feature provided to the linear regression model, we can learn a model that is quadratic as a function of

:
通过引入

作为线性回归模型的另一个特征，我们可以学习一个与

成二次函数的模型：

Though this model implements a quadratic function of its input, the output is still a linear function of the parameters, so we can still use the normal equations to train the model in closed form. We can continue to add more powers of

as additional features, for example, to obtain a polynomial of degree 9 :
虽然这个模型实现了输入的二次函数，但输出仍然是参数的线性函数，因此我们仍然可以使用正态方程以封闭形式训练模型。我们可以继续添加更多

的幂次作为附加特征，例如，得到一个度数为 9 的多项式：

Machine learning algorithms will generally perform best when their capacity is appropriate for the true complexity of the task they need to perform and the amount of training data they are provided with. Models with insufficient capacity are unable to solve complex tasks. Models with high capacity can solve complex tasks, but when their capacity is higher than needed to solve the present task, they may overfit.
一般来说，当机器学习算法的容量与其需要执行的任务的真正复杂性和所提供的训练数据量相适应时，其性能将达到最佳。容量不足的模型无法解决复杂任务。容量大的模型可以解决复杂的任务，但当其容量高于解决当前任务所需的容量时，它们可能会过度拟合。

Figure 5.2 shows this principle in action. We compare a linear, quadratic and degree-9 predictor attempting to fit a problem where the true underlying
图 5.2 展示了这一原则的实际应用。我们比较了线性预测器、二次预测器和九度预测器，试图拟合一个问题，在这个问题中，真实的潜在

Figure 5.2: We fit three models to this example training set. The training data was generated synthetically, by randomly sampling

values and choosing

deterministically by evaluating a quadratic function. (Left)A linear function fit to the data suffers from underfitting-it cannot capture the curvature that is present in the data. (Center)A quadratic function fit to the data generalizes well to unseen points. It does not suffer from a significant amount of overfitting or underfitting. (Right)A polynomial of degree 9 fit to the data suffers from overfitting. Here we used the Moore-Penrose pseudoinverse to solve the underdetermined normal equations. The solution passes through all the training points exactly, but we have not been lucky enough for it to extract the correct structure. It now has a deep valley between two training points that does not appear in the true underlying function. It also increases sharply on the left side of the data, while the true function decreases in this area.
图 5.2：我们对这个示例训练集拟合了三个模型。训练数据是通过随机抽样

值并通过评估二次函数确定性地选择

合成的。(左图）与数据拟合的线性函数存在拟合不足的问题--无法捕捉数据中存在的曲率。(中)二次函数拟合数据可以很好地泛化到未见的点。它没有明显的过拟合或欠拟合现象。(右图）与数据拟合的度数为 9 的多项式存在过拟合问题。在这里，我们使用 Moore-Penrose 伪逆变换来求解欠定正态方程。解法准确地通过了所有训练点，但我们还没有幸运到能提取出正确的结构。现在，它在两个训练点之间出现了一个深谷，而在真正的基础函数中却没有出现。它还在数据的左侧急剧增加，而真正的函数却在这一区域减小。

function is quadratic. The linear function is unable to capture the curvature in the true underlying problem, so it underfits. The degree- 9 predictor is capable of representing the correct function, but it is also capable of representing infinitely many other functions that pass exactly through the training points, because we have more parameters than training examples. We have little chance of choosing a solution that generalizes well when so many wildly different solutions exist. In this example, the quadratic model is perfectly matched to the true structure of the task, so it generalizes well to new data.
函数是二次函数。线性函数无法捕捉到真实问题中的曲率，因此它的拟合效果较差。度数为 9 的预测器能够代表正确的函数，但它也能代表无数个完全通过训练点的其他函数，因为我们的参数多于训练实例。当存在如此多差异巨大的解决方案时，我们几乎不可能选择一个通用性好的解决方案。在本例中，二次模型与任务的真实结构完全匹配，因此它能很好地泛化新数据。

So far we have described only one way of changing a model's capacity: by changing the number of input features it has, and simultaneously adding new parameters associated with those features. There are in fact many ways to change a model's capacity. Capacity is not determined only by the choice of model. The model specifies which family of functions the learning algorithm can choose from when varying the parameters in order to reduce a training objective. This is called the representational capacity of the model. In many cases, finding the best
到目前为止，我们只描述了改变模型容量的一种方法：改变输入特征的数量，同时增加与这些特征相关的新参数。事实上，改变模型容量的方法有很多。容量不仅取决于模型的选择。模型指定了学习算法在改变参数以降低训练目标时可以选择的函数系列。这就是模型的表征能力。在很多情况下，找到最佳的
function within this family is a difficult optimization problem. In practice, the learning algorithm does not actually find the best function, but merely one that significantly reduces the training error. These additional limitations, such as the imperfection of the optimization algorithm, mean that the learning algorithm's effective capacity may be less than the representational capacity of the model family.
在这个系列中寻找最佳函数是一个困难的优化问题。在实践中，学习算法实际上并没有找到最佳函数，而只是找到了一个能显著减少训练误差的函数。这些额外的限制，如优化算法的不完善，意味着学习算法的有效能力可能小于模型族的表征能力。

Our modern ideas about improving the generalization of machine learning models are refinements of thought dating back to philosophers at least as early as Ptolemy. Many early scholars invoke a principle of parsimony that is now most widely known as Occam's razor (c. 1287-1347). This principle states that among competing hypotheses that explain known observations equally well, we should choose the "simplest" one. This idea was formalized and made more precise in the twentieth century by the founders of statistical learning theory (Vapnik and Chervonenkis, 1971; Vapnik, 1982; Blumer et al., 1989; Vapnik, 1995).
我们关于提高机器学习模型泛化能力的现代想法，是对至少早在托勒密时期哲学家思想的完善。许多早期的学者都引用了一个现在最广为人知的 "奥卡姆剃刀"（Occam's razor，约 1287-1347 年）原则。该原则指出，在同样能够解释已知观察结果的相互竞争的假设中，我们应该选择 "最简单 "的假设。二十世纪，统计学习理论的创始人将这一思想正式化，并使其更加精确（Vapnik and Chervonenkis, 1971; Vapnik, 1982; Blumer et al.）

Statistical learning theory provides various means of quantifying model capacity. Among these, the most well known is the Vapnik-Chervonenkis dimension, or VC dimension. The VC dimension measures the capacity of a binary classifier. The VC dimension is defined as being the largest possible value of

for which there exists a training set of

different

points that the classifier can label arbitrarily.
统计学习理论提供了多种量化模型容量的方法。其中，最著名的是 Vapnik-Chervonenkis 维度或 VC 维度。VC 维度衡量二元分类器的容量。VC 维度的定义是：存在一个由

不同

点组成的训练集，分类器可以任意标注该训练集的

的最大可能值。

Quantifying the capacity of the model enables statistical learning theory to make quantitative predictions. The most important results in statistical learning theory show that the discrepancy between training error and generalization error is bounded from above by a quantity that grows as the model capacity grows but shrinks as the number of training examples increases (Vapnik and Chervonenkis, 1971; Vapnik, 1982; Blumer et al., 1989; Vapnik, 1995). These bounds provide intellectual justification that machine learning algorithms can work, but they are rarely used in practice when working with deep learning algorithms. This is in part because the bounds are often quite loose and in part because it can be quite difficult to determine the capacity of deep learning algorithms. The problem of determining the capacity of a deep learning model is especially difficult because the effective capacity is limited by the capabilities of the optimization algorithm, and we have little theoretical understanding of the general nonconvex optimization problems involved in deep learning.
量化模型的容量可以让统计学习理论做出定量预测。统计学习理论中最重要的结果表明，训练误差和泛化误差之间的差异有一个自上而下的界限，这个界限随着模型容量的增长而增长，但随着训练实例数量的增加而缩小（Vapnik 和 Chervonenkis，1971 年；Vapnik，1982 年；Blumer 等人，1989 年；Vapnik，1995 年）。这些界限为机器学习算法的运行提供了理论依据，但在深度学习算法的实际应用中却很少使用。部分原因是这些界限通常比较宽松，另一部分原因是确定深度学习算法的容量非常困难。确定深度学习模型容量的问题尤其困难，因为有效容量受限于优化算法的能力，而我们对深度学习所涉及的一般非凸优化问题几乎没有理论上的了解。

We must remember that while simpler functions are more likely to generalize (to have a small gap between training and test error), we must still choose a sufficiently complex hypothesis to achieve low training error. Typically, training error decreases until it asymptotes to the minimum possible error value as model capacity increases (assuming the error measure has a minimum value). Typically,
我们必须记住，虽然较简单的函数更有可能泛化（训练误差与测试误差之间的差距较小），但我们仍必须选择足够复杂的假设，以实现较低的训练误差。通常情况下，随着模型容量的增加，训练误差会逐渐减小，直至渐近于可能的最小误差值（假设误差测量值为最小值）。通常情况下

Figure 5.3: Typical relationship between capacity and error. Training and test error behave differently. At the left end of the graph, training error and generalization error are both high. This is the underfitting regime . As we increase capacity, training error decreases, but the gap between training and generalization error increases. Eventually, the size of this gap outweighs the decrease in training error, and we enter the overfitting regime, where capacity is too large, above the optimal capacity.
图 5.3：容量与误差之间的典型关系。训练误差和测试误差表现不同。在图的左端，训练误差和泛化误差都很大。这就是欠拟合状态。随着容量的增加，训练误差会减小，但训练误差和泛化误差之间的差距会增大。最终，这一差距的大小超过了训练误差的减少，我们就进入了过拟合状态，即容量过大，超过了最佳容量。

generalization error has a U-shaped curve as a function of model capacity. This is illustrated in figure 5.3.
泛化误差与模型容量的函数关系呈 U 型曲线。如图 5.3 所示。

To reach the most extreme case of arbitrarily high capacity, we introduce the concept of nonparametric models. So far, we have seen only parametric models, such as linear regression. Parametric models learn a function described by a parameter vector whose size is finite and fixed before any data is observed. Nonparametric models have no such limitation.
为了达到任意高容量的最极端情况，我们引入了非参数模型的概念。到目前为止，我们只见过参数模型，如线性回归。参数模型学习一个由参数向量描述的函数，而参数向量的大小是有限的，并且在观测到任何数据之前就已固定。非参数模型则没有这种限制。

Sometimes, nonparametric models are just theoretical abstractions (such as an algorithm that searches over all possible probability distributions) that cannot be implemented in practice. However, we can also design practical nonparametric models by making their complexity a function of the training set size. One example of such an algorithm is nearest neighbor regression. Unlike linear regression, which has a fixed-length vector of weights, the nearest neighbor regression model simply stores the

and

from the training set. When asked to classify a test point

, the model looks up the nearest entry in the training set and returns the associated regression target. In other words,

where

. The algorithm can also be generalized to distance metrics other than the

norm, such as learned distance metrics (Goldberger et al., 2005). If the algorithm is allowed to break ties by averaging the

values for all

that are tied for nearest, then this algorithm is able to achieve the minimum possible training error (which
有时，非参数模型只是理论上的抽象概念（如搜索所有可能概率分布的算法），无法在实践中实现。不过，我们也可以设计实用的非参数模型，使其复杂度成为训练集大小的函数。近邻回归就是这种算法的一个例子。与具有固定长度权重向量的线性回归不同，近邻回归模型只需存储来自训练集的

和

。当要求对测试点

进行分类时，模型会查找训练集中最近的条目，并返回相关的回归目标。换句话说，

，其中

。该算法还可以推广到

规范以外的距离度量，如学习距离度量（Goldberger 等人，2005 年）。如果允许该算法通过平均所有并列最近的

的

值来打破并列关系，那么该算法就能实现尽可能小的训练误差（即
might be greater than zero, if two identical inputs are associated with different outputs) on any regression dataset.
在任何回归数据集上，如果两个相同的输入与不同的输出相关联，则该值可能大于零。

Finally, we can also create a nonparametric learning algorithm by wrapping a parametric learning algorithm inside another algorithm that increases the number of parameters as needed. For example, we could imagine an outer loop of learning that changes the degree of the polynomial learned by linear regression on top of a polynomial expansion of the input.
最后，我们还可以通过将参数学习算法封装在另一种算法中来创建非参数学习算法，从而根据需要增加参数数量。例如，我们可以想象一个学习的外循环，在输入的多项式展开的基础上，改变线性回归学习到的多项式的度数。

The ideal model is an oracle that simply knows the true probability distribution that generates the data. Even such a model will still incur some error on many problems, because there may still be some noise in the distribution. In the case of supervised learning, the mapping from

may be inherently stochastic, or

may be a deterministic function that involves other variables besides those included in

. The error incurred by an oracle making predictions from the true distribution

is called the Bayes error.
理想的模型是一个只知道产生数据的真实概率分布的甲骨文。即使是这样的模型，在许多问题上仍会产生一些误差，因为分布中仍可能存在一些噪声。在监督学习中，从

到

的映射可能本身就是随机的，或者

可能是一个确定性函数，除了

中包含的变量外，还涉及其他变量。甲骨文从真实分布

进行预测所产生的误差称为贝叶斯误差。

Training and generalization error vary as the size of the training set varies. Expected generalization error can never increase as the number of training examples increases. For nonparametric models, more data yield better generalization until the best possible error is achieved. Any fixed parametric model with less than optimal capacity will asymptote to an error value that exceeds the Bayes error. See figure 5.4 for an illustration. Note that it is possible for the model to have optimal capacity and yet still have a large gap between training and generalization errors. In this situation, we may be able to reduce this gap by gathering more training examples.
训练误差和泛化误差会随着训练集大小的变化而变化。预期泛化误差永远不会随着训练实例数量的增加而增加。对于非参数模型来说，在达到最佳误差之前，更多的数据会产生更好的泛化效果。任何容量小于最佳值的固定参数模型，其误差值都会逐渐超过贝叶斯误差。请参见图 5.4。请注意，模型有可能具有最佳容量，但训练误差和泛化误差之间仍有很大差距。在这种情况下，我们可以通过收集更多的训练实例来缩小差距。

5.2.1 The No Free Lunch Theorem
5.2.1 天下没有免费的午餐定理

Learning theory claims that a machine learning algorithm can generalize well from a finite training set of examples. This seems to contradict some basic principles of logic. Inductive reasoning, or inferring general rules from a limited set of examples, is not logically valid. To logically infer a rule describing every member of a set, one must have information about every member of that set.
学习理论认为，机器学习算法可以从有限的示例训练集中很好地进行泛化。这似乎与逻辑学的一些基本原理相悖。归纳推理，即从有限的示例集中推断出一般规则，在逻辑上是不成立的。要从逻辑上推断出描述集合中每个成员的规则，必须掌握该集合中每个成员的信息。

In part, machine learning avoids this problem by offering only probabilistic rules, rather than the entirely certain rules used in purely logical reasoning. Machine learning promises to find rules that are probably correct about most members of the set they concern.
在某种程度上，机器学习只提供概率规则，而不是纯逻辑推理中使用的完全确定的规则，从而避免了这一问题。机器学习承诺找到的规则很可能对它们所涉及的集合中的大多数成员都是正确的。

Unfortunately, even this does not resolve the entire problem. The no free lunch theorem for machine learning (Wolpert, 1996) states that, averaged over all possible data-generating distributions, every classification algorithm has the
遗憾的是，即使这样也不能解决全部问题。机器学习 "没有免费午餐 "定理（Wolpert，1996 年）指出，在所有可能的数据生成分布中，每种分类算法都具有以下特点

Figure 5.4: The effect of the training dataset size on the train and test error, as well as on the optimal model capacity. We constructed a synthetic regression problem based on adding a moderate amount of noise to a degree-5 polynomial, generated a single test set, and then generated several different sizes of training set. For each size, we generated 40 different training sets in order to plot error bars showing 95 percent confidence intervals. (Top)The MSE on the training and test set for two different models: a quadratic model, and a model with degree chosen to minimize the test error. Both are fit in closed form. For the quadratic model, the training error increases as the size of the training set increases. This is because larger datasets are harder to fit. Simultaneously, the test error decreases, because fewer incorrect hypotheses are consistent with the training data. The quadratic model does not have enough capacity to solve the task, so its test error asymptotes to a high value. The test error at optimal capacity asymptotes to the Bayes error. The training error can fall below the Bayes error, due to the ability of the training algorithm to memorize specific instances of the training set. As the training size increases to infinity, the training error of any fixed-capacity model (here, the quadratic model) must rise to at least the Bayes error. (Bottom)As the training set size increases, the optimal capacity (shown here as the degree of the optimal polynomial regressor) increases. The optimal capacity plateaus after reaching sufficient complexity to solve the task.
图 5.4：训练集大小对训练和测试误差以及最佳模型容量的影响。我们构建了一个基于在 5 级多项式中添加适量噪声的合成回归问题，生成了一个测试集，然后生成了多个不同大小的训练集。对于每种大小，我们生成了 40 个不同的训练集，以便绘制显示 95% 置信区间的误差条。(上图）两个不同模型在训练集和测试集上的 MSE：一个二次模型和一个选择度数以最小化测试误差的模型。这两种模型都以封闭形式拟合。对于二次模型，训练误差随着训练集的增加而增大。这是因为数据集越大，拟合就越困难。与此同时，测试误差会减小，因为与训练数据一致的错误假设减少了。二次模型没有足够的容量来解决任务，因此它的测试误差会渐近到一个较高的值。最佳容量下的测试误差渐近于贝叶斯误差。由于训练算法能够记忆训练集的特定实例，因此训练误差可以低于贝叶斯误差。当训练规模增加到无穷大时，任何固定容量模型（这里指二次模型）的训练误差都必须至少上升到贝叶斯误差。(下图）随着训练集规模的增加，最佳容量（此处显示为最佳多项式回归器的度）也在增加。在达到足以解决任务的复杂度后，最佳容量会趋于稳定。
same error rate when classifying previously unobserved points. In other words, in some sense, no machine learning algorithm is universally any better than any other. The most sophisticated algorithm we can conceive of has the same average performance (over all possible tasks) as merely predicting that every point belongs to the same class.
同样的错误率。换句话说，从某种意义上讲，没有一种机器学习算法比其他算法更好。我们所能想到的最复杂的算法，其平均性能（在所有可能的任务中）与仅仅预测每个点属于同一类别的算法相同。

Fortunately, these results hold only when we average over all possible datagenerating distributions. If we make assumptions about the kinds of probability distributions we encounter in real-world applications, then we can design learning algorithms that perform well on these distributions.
幸运的是，只有当我们对所有可能的数据生成分布进行平均时，这些结果才会成立。如果我们对实际应用中遇到的概率分布进行假设，就能设计出在这些分布上表现良好的学习算法。

This means that the goal of machine learning research is not to seek a universal learning algorithm or the absolute best learning algorithm. Instead, our goal is to understand what kinds of distributions are relevant to the "real world" that an AI agent experiences, and what kinds of machine learning algorithms perform well on data drawn from the kinds of data-generating distributions we care about.
这意味着，机器学习研究的目标不是寻求一种通用的学习算法或绝对最好的学习算法。相反，我们的目标是了解哪些类型的分布与人工智能代理所经历的 "真实世界 "相关，以及哪些类型的机器学习算法在我们所关心的数据生成分布的数据上表现良好。

5.2.2 Regularization 5.2.2 正则化

The no free lunch theorem implies that we must design our machine learning algorithms to perform well on a specific task. We do so by building a set of preferences into the learning algorithm. When these preferences are aligned with the learning problems that we ask the algorithm to solve, it performs better.
没有免费的午餐定理意味着，我们必须设计机器学习算法，使其在特定任务中表现出色。为此，我们在学习算法中加入了一系列偏好。当这些偏好与我们要求算法解决的学习问题相一致时，算法就会表现得更好。

So far, the only method of modifying a learning algorithm that we have discussed concretely is to increase or decrease the model's representational capacity by adding or removing functions from the hypothesis space of solutions the learning algorithm is able to choose from. We gave the specific example of increasing or decreasing the degree of a polynomial for a regression problem. The view we have described so far is oversimplified.
到目前为止，我们具体讨论过的修改学习算法的唯一方法，就是通过从学习算法能够选择的解的假设空间中添加或删除函数，来增加或减少模型的表征能力。我们举了一个具体的例子，即在回归问题中增加或减少多项式的度数。我们迄今为止所描述的观点过于简单。

The behavior of our algorithm is strongly affected not just by how large we make the set of functions allowed in its hypothesis space, but by the specific identity of those functions. The learning algorithm we have studied so far, linear regression, has a hypothesis space consisting of the set of linear functions of its input. These linear functions can be useful for problems where the relationship between inputs and outputs truly is close to linear. They are less useful for problems that behave in a very nonlinear fashion. For example, linear regression would not perform well if we tried to use it to predict

from

. We can thus control the performance of our algorithms by choosing what kind of functions we allow them to draw solutions from, as well as by controlling the amount of these functions.
我们的算法的行为不仅受到我们使其假设空间所允许的函数集的大小的影响，而且还受到这些函数的具体特性的影响。我们迄今为止研究的学习算法--线性回归，其假设空间由输入的线性函数集组成。这些线性函数对于输入和输出之间的关系确实接近线性的问题非常有用。但对于那些非线性行为的问题，这些函数就不那么有用了。例如，如果我们试图用线性回归从

预测

，它的性能就不会很好。因此，我们可以通过选择允许算法从哪种函数中提取解以及控制这些函数的数量来控制算法的性能。

We can also give a learning algorithm a preference for one solution over another
我们还可以让学习算法优先选择一种方案，而不是另一种方案
in its hypothesis space. This means that both functions are eligible, but one is preferred. The unpreferred solution will be chosen only if it fits the training data significantly better than the preferred solution.
在其假设空间中。这意味着两个函数都符合条件，但其中一个是首选。只有当非首选方案比首选方案更适合训练数据时，它才会被选中。

For example, we can modify the training criterion for linear regression to include weight decay. To perform linear regression with weight decay, we minimize a sum

comprising both the mean squared error on the training and a criterion that expresses a preference for the weights to have smaller squared

norm. Specifically,
例如，我们可以修改线性回归的训练准则，使其包含权重衰减。要进行带权重衰减的线性回归，我们要最小化一个和

，这个和包括训练的均方误差和一个表示偏好权重具有较小平方

规范的准则。具体来说

where

is a value chosen ahead of time that controls the strength of our preference for smaller weights. When

, we impose no preference, and larger

forces the weights to become smaller. Minimizing

results in a choice of weights that make a tradeoff between fitting the training data and being small. This gives us solutions that have a smaller slope, or that put weight on fewer of the features. As an example of how we can control a model's tendency to overfit or underfit via weight decay, we can train a high-degree polynomial regression model with different values of

. See figure 5.5 for the results.
其中

是一个提前选择好的值，用于控制我们对较小权重的偏好程度。当

时，我们没有偏好，而

越大，权重就越小。最小化

的结果是，权重的选择要在拟合训练数据和较小权重之间做出权衡。这样，我们就能得到斜率较小的解决方案，或在较少特征上施加权重的解决方案。举例说明如何通过权重衰减来控制模型的过拟合或欠拟合倾向，我们可以用不同的

值来训练一个高次多项式回归模型。结果见图 5.5。

More generally, we can regularize a model that learns a function

by adding a penalty called a regularizer to the cost function. In the case of weight decay, the regularizer is

. In chapter 7 , we will see that many other regularizers are possible.
更一般地说，我们可以通过在代价函数中添加一种称为正则化器的惩罚，对学习函数

的模型进行正则化。在权重衰减的情况下，正则为

。在第 7 章中，我们将看到许多其他的正则化方法。

Expressing preferences for one function over another is a more general way of controlling a model's capacity than including or excluding members from the hypothesis space. We can think of excluding a function from a hypothesis space as expressing an infinitely strong preference against that function.
与从假设空间中包含或排除成员相比，表达对一种函数相对于另一种函数的偏好是控制模型容量的一种更普遍的方法。我们可以把从假设空间中排除一个函数看作是对该函数表达了无限强烈的偏好。

In our weight decay example, we expressed our preference for linear functions defined with smaller weights explicitly, via an extra term in the criterion we minimize. There are many other ways of expressing preferences for different solutions, both implicitly and explicitly. Together, these different approaches are known as regularization. Regularization is any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error. Regularization is one of the central concerns of the field of machine learning, rivaled in its importance only by optimization.
在权重衰减的例子中，我们通过最小化准则中的一个额外项，明确表达了对权重较小的线性函数的偏好。还有许多其他方法可以隐式或显式地表达对不同解决方案的偏好。这些不同的方法统称为正则化。正则化是我们对学习算法所做的任何修改，其目的是减少泛化误差，而不是训练误差。正则化是机器学习领域的核心问题之一，其重要性仅次于优化。

The no free lunch theorem has made it clear that there is no best machine learning algorithm, and, in particular, no best form of regularization. Instead we must choose a form of regularization that is well suited to the particular task we want to solve. The philosophy of deep learning in general and this book in particular is that a wide range of tasks (such as all the intellectual tasks that
没有免费的午餐定理清楚地表明，没有最好的机器学习算法，尤其是没有最好的正则化形式。相反，我们必须选择一种非常适合我们想要解决的特定任务的正则化形式。一般来说，深度学习的理念，尤其是本书的理念是，广泛的任务（如所有的智力任务，如
people can do) may all be solved effectively using very general-purpose forms of regularization.
人们可以做的事情）都可以通过非常通用的正则化形式得到有效解决。

5.3 Hyperparameters and Validation Sets
5.3 超参数和验证集

Most machine learning algorithms have hyperparameters, settings that we can use to control the algorithm's behavior. The values of hyperparameters are not adapted by the learning algorithm itself (though we can design a nested learning procedure in which one learning algorithm learns the best hyperparameters for another learning algorithm).
大多数机器学习算法都有超参数，我们可以用这些参数设置来控制算法的行为。学习算法本身不会调整超参数的值（不过我们可以设计一个嵌套学习程序，让一个学习算法为另一个学习算法学习最佳超参数）。

The polynomial regression example in figure 5.2 has a single hyperparameter: the degree of the polynomial, which acts as a capacity hyperparameter. The

value used to control the strength of weight decay is another example of a hyperparameter.
图 5.2 中的多项式回归示例只有一个超参数：多项式的阶数，它是一个容量超参数。用于控制权重衰减强度的

值是另一个超参数的例子。

Figure 5.5: We fit a high-degree polynomial regression model to our example training set from figure 5.2. The true function is quadratic, but here we use only models with degree 9 . We vary the amount of weight decay to prevent these high-degree models from overfitting. (Left)With very large

, we can force the model to learn a function with no slope at all. This underfits because it can only represent a constant function. (Center)With a medium value of

, the learning algorithm recovers a curve with the right general shape. Even though the model is capable of representing functions with much more complicated shapes, weight decay has encouraged it to use a simpler function described by smaller coefficients. (Right)With weight decay approaching zero (i.e., using the Moore-Penrose pseudoinverse to solve the underdetermined problem with minimal regularization), the degree-9 polynomial overfits significantly, as we saw in figure 5.2.
图 5.5：我们对图 5.2 中的示例训练集拟合了一个高阶多项式回归模型。真实的函数是二次函数，但这里我们只使用度数为 9 的模型。我们改变权重衰减量，以防止这些高阶模型过度拟合。(左图）当

非常大时，我们可以迫使模型学习一个完全没有斜率的函数。这将导致拟合不足，因为它只能表示一个常数函数。(中)当

的值处于中等水平时，学习算法会恢复出一条具有正确一般形状的曲线。尽管该模型能够表示形状复杂得多的函数，但权重衰减促使它使用了用较小系数描述的较简单函数。(右图）当权重衰减趋近于零时（即使用摩尔-彭罗斯伪逆变换求解欠定问题时，正则化程度最小），9 级多项式的拟合程度显著提高，如图 5.2 所示。

Sometimes a setting is chosen to be a hyperparameter that the learning algorithm does not learn because the setting is difficult to optimize. More frequently, the setting must be a hyperparameter because it is not appropriate to learn that hyperparameter on the training set. This applies to all hyperparameters that control model capacity. If learned on the training set, such hyperparameters would always choose the maximum possible model capacity, resulting in overfitting (refer to figure 5.3). For example, we can always fit the training set better with a higher-degree polynomial and a weight decay setting of

than we could with a lower-degree polynomial and a positive weight decay setting.
有时，由于设置难以优化，学习算法会选择不学习该设置的超参数。更常见的情况是，设置必须是一个超参数，因为在训练集上学习该超参数并不合适。这适用于所有控制模型容量的超参数。如果在训练集上学习，此类超参数将始终选择最大可能的模型容量，从而导致过度拟合（参见图 5.3）。例如，如果使用较高阶的多项式和

的权重衰减设置，我们总是能比使用较低阶的多项式和正权重衰减设置更好地拟合训练集。

To solve this problem, we need a validation set of examples that the training algorithm does not observe.
要解决这个问题，我们需要一个验证集，其中包含训练算法观察不到的示例。

Earlier we discussed how a held-out test set, composed of examples coming from the same distribution as the training set, can be used to estimate the generalization error of a learner, after the learning process has completed. It is important that the test examples are not used in any way to make choices about the model, including its hyperparameters. For this reason, no example from the test set can be used in the validation set. Therefore, we always construct the validation set from the training data. Specifically, we split the training data into two disjoint subsets. One of these subsets is used to learn the parameters. The other subset is our validation set, used to estimate the generalization error during or after training, allowing for the hyperparameters to be updated accordingly. The subset of data used to learn the parameters is still typically called the training set, even though this may be confused with the larger pool of data used for the entire training process. The subset of data used to guide the selection of hyperparameters is called the validation set. Typically, one uses about 80 percent of the training data for training and 20 percent for validation. Since the validation set is used to "train" the hyperparameters, the validation set error will underestimate the generalization error, though typically by a smaller amount than the training error does. After all hyperparameter optimization is complete, the generalization error may be estimated using the test set.
前面我们讨论了在学习过程结束后，如何使用由与训练集分布相同的示例组成的测试集来估计学习者的泛化误差。重要的是，测试示例不得以任何方式用于对模型（包括其超参数）做出选择。因此，验证集中不能使用测试集中的任何示例。因此，我们总是从训练数据中构建验证集。具体来说，我们将训练数据分成两个不相交的子集。其中一个子集用于学习参数。另一个子集是我们的验证集，用于在训练过程中或训练结束后估计泛化误差，以便对超参数进行相应更新。用于学习参数的数据子集通常仍被称为训练集，尽管这可能会与整个训练过程中使用的更大数据池混淆。用于指导超参数选择的数据子集称为验证集。通常情况下，80% 的训练数据用于训练，20% 用于验证。由于验证集是用来 "训练 "超参数的，因此验证集的误差会低估泛化误差，不过通常会比训练误差低一些。在完成所有超参数优化后，可以使用测试集估算泛化误差。

In practice, when the same test set has been used repeatedly to evaluate performance of different algorithms over many years, and especially if we consider all the attempts from the scientific community at beating the reported state-ofthe-art performance on that test set, we end up having optimistic evaluations with the test set as well. Benchmarks can thus become stale and then do not reflect the true field performance of a trained system. Thankfully, the community tends to move on to new (and usually more ambitious and larger) benchmark datasets.
在实践中，如果多年来一直重复使用同一个测试集来评估不同算法的性能，特别是如果我们考虑到科学界在该测试集上试图超越所报告的最先进性能的所有尝试，我们最终也会对测试集进行乐观的评估。因此，基准会变得陈旧，无法反映训练有素系统的真实现场表现。值得庆幸的是，社区往往会转向新的（通常更雄心勃勃、规模更大的）基准数据集。

5.3.1 Cross-Validation 5.3.1 交叉验证

Dividing the dataset into a fixed training set and a fixed test set can be problematic if it results in the test set being small. A small test set implies statistical uncertainty around the estimated average test error, making it difficult to claim that algorithm

works better than algorithm

on the given task.
将数据集分为固定的训练集和固定的测试集，如果导致测试集很小，就会产生问题。小测试集意味着估计的平均测试误差在统计上存在不确定性，因此很难说在给定任务上，算法

比算法

运行得更好。

When the dataset has hundreds of thousands of examples or more, this is not a serious issue. When the dataset is too small, alternative procedures enable one to use all the examples in the estimation of the mean test error, at the price of increased computational cost. These procedures are based on the idea of repeating the training and testing computation on different randomly chosen subsets or splits of the original dataset. The most common of these is the

-fold cross-validation procedure, shown in algorithm 5.1, in which a partition of the dataset is formed by splitting it into

nonoverlapping subsets. The test error may then be estimated by taking the average test error across

trials. On trial

, the

-th subset of the data is used as the test set, and the rest of the data is used as the training set. One problem is that no unbiased estimators of the variance of such average error estimators exist (Bengio and Grandvalet, 2004), but approximations are typically used.
当数据集有几十万个或更多示例时，这个问题并不严重。当数据集太小时，可采用其他程序来使用所有示例估计平均测试误差，但代价是增加计算成本。这些程序基于在原始数据集随机选择的不同子集或分集上重复训练和测试计算的想法。其中最常见的是

-fold 交叉验证程序，如算法 5.1 所示，该程序将数据集分割成

个不重叠的子集。然后，可以通过获取

试验的平均测试误差来估算测试误差。在

试验中，

-th 数据子集作为测试集，其余数据作为训练集。一个问题是，这种平均误差估计值的方差没有无偏估计值（Bengio 和 Grandvalet，2004 年），但通常会使用近似值。

5.4 Estimators, Bias and Variance
5.4 估计数、偏差和方差

The field of statistics gives us many tools to achieve the machine learning goal of solving a task not only on the training set but also to generalize. Foundational concepts such as parameter estimation, bias and variance are useful to formally characterize notions of generalization, underfitting and overfitting.
统计学领域为我们提供了许多工具，帮助我们实现机器学习的目标，即不仅能在训练集上解决任务，还能实现泛化。参数估计、偏差和方差等基本概念有助于正式描述泛化、欠拟合和过拟合等概念。

5.4.1 Point Estimation 5.4.1 点估算

Point estimation is the attempt to provide the single "best" prediction of some quantity of interest. In general the quantity of interest can be a single parameter or a vector of parameters in some parametric model, such as the weights in our linear regression example in section 5.1.4, but it can also be a whole function.
点估计是对某些相关量进行 "最佳 "预测的尝试。一般来说，相关量可以是某个参数模型中的一个参数或参数向量，例如第 5.1.4 节线性回归示例中的权重，但也可以是整个函数。

To distinguish estimates of parameters from their true value, our convention will be to denote a point estimate of a parameter

.
为了区分参数的估计值和真实值，我们习惯用

表示参数

的点估计值。

Let

be a set of

independent and identically distributed
假设

是一组独立且同分布的

generalization error of a learning algorithm

when the given dataset

is too small for a simple train/test or train/valid split to yield accurate estimation of generalization error, because the mean of a loss

on a small test set may have too high a variance. The dataset

contains as elements the abstract examples

(for the

-th example), which could stand for an (input,target) pair

in the case of supervised learning, or for just an input

in the case of unsupervised learning. The algorithm returns the vector of errors

for each example in

, whose mean is the estimated generalization error. The errors on individual examples can be used to compute a confidence interval around the mean (equation 5.47). Though these confidence intervals are not well justified after the use of cross-validation, it is still common practice to use them to declare that algorithm

is better than algorithm

only if the confidence interval of the error of algorithm

lies below and does not intersect the confidence interval of algorithm

当给定的数据集太小，无法通过简单的训练/测试或训练/有效拆分来准确估计泛化误差时，学习算法的泛化误差，因为在小测试集上损失的均值可能有太高的方差。数据集包含的元素是抽象示例（第个示例），在有监督学习的情况下，它可以代表一对（输入，目标），在无监督学习的情况下，它也可以只代表一个输入。算法会返回中每个示例的误差向量，其平均值就是估计的泛化误差。单个示例的误差可用于计算平均值周围的置信区间（公式 5.47）。尽管在使用交叉验证后，这些置信区间并不十分合理，但通常的做法仍然是使用这些置信区间来声明，只有当算法的误差置信区间低于且不与算法的置信区间相交时，算法才优于算法。

Define

: 定义

：

Require:

, the given dataset, with elements

要求

，给定数据集，元素为

Require:

, the learning algorithm, seen as a function that takes a dataset as input and outputs a learned function
要求：

将学习算法视为将数据集作为输入并输出所学函数的函数

Require:

, the loss function, seen as a function from a learned function

and an example

to a scalar

要求：

，损失函数被视为从学习函数

和示例

到标量的函数

Require:

, the number of folds
要求：

，折叠的次数

Split

into

mutually exclusive subsets

, whose union is

将

拆分为

互斥子集

，其联合为

for

from 1 to

do
for

from 1 to

for

do
for

end for 结束

Return

(i.i.d.) data points. A point estimator or statistic is any function of the data:
(i.i.d.) 数据点。点估计或统计量是数据的任何函数：

The definition does not require that

return a value that is close to the true

or even that the range of

be the same as the set of allowable values of

. This definition of a point estimator is very general and would enable the designer of an estimator great flexibility. While almost any function thus qualifies as an estimator,
该定义并不要求

返回的值接近真实的

，甚至不要求

的范围与

的允许值集相同。点估计器的这一定义非常宽泛，可以让估计器的设计者拥有极大的灵活性。因此，几乎任何函数都可以作为估计器、
a good estimator is a function whose output is close to the true underlying

that generated the training data.
一个好的估计器是一个其输出接近于产生训练数据的真实基础

的函数。

For now, we take the frequentist perspective on statistics. That is, we assume that the true parameter value

is fixed but unknown, while the point estimate

is a function of the data. Since the data is drawn from a random process, any function of the data is random. Therefore

is a random variable.
现在，我们从频繁主义的角度来看待统计学。也就是说，我们假设真正的参数值

是固定的，但却是未知的，而点估计值

是数据的函数。由于数据来自随机过程，因此数据的任何函数都是随机的。因此，

是一个随机变量。

Point estimation can also refer to the estimation of the relationship between input and target variables. We refer to these types of point estimates as function estimators.
点估计也可以指对输入变量和目标变量之间关系的估计。我们将这类点估计称为函数估计器。

Function Estimation Sometimes we are interested in performing function estimation (or function approximation). Here, we are trying to predict a variable

given an input vector

. We assume that there is a function

that describes the approximate relationship between

and

. For example, we may assume that

, where

stands for the part of

that is not predictable from

. In function estimation, we are interested in approximating

with a model or estimate

. Function estimation is really just the same as estimating a parameter

; the function estimator

is simply a point estimator in function space. The linear regression example (discussed in section 5.1.4) and the polynomial regression example (discussed in section 5.2) both illustrate scenarios that may be interpreted as either estimating a parameter

or estimating a function

mapping from

.
函数估计有时，我们对函数估计（或函数近似）很感兴趣。在这里，我们试图根据输入向量

预测变量

。我们假设有一个函数

可以描述

和

之间的近似关系。例如，我们可以假设

，其中

代表

中无法从

预测的部分。在函数估计中，我们感兴趣的是用一个模型或估计

来近似

。函数估计实际上与估计参数

相同；函数估计器

只是函数空间中的一个点估计器。线性回归示例（将在第 5.1.4 节中讨论）和多项式回归示例（将在第 5.2 节中讨论）都说明了可被解释为估计参数

或估计从

映射到

的函数

的情况。

We now review the most commonly studied properties of point estimators and discuss what they tell us about these estimators.
现在，我们回顾一下最常研究的点估计器属性，并讨论它们对这些估计器的启示。

5.4.2 Bias 5.4.2 偏差

The bias of an estimator is defined as
估计器的偏差定义为

where the expectation is over the data (seen as samples from a random variable) and

is the true underlying value of

used to define the data-generating distribution. An estimator

is said to be unbiased if

, which implies that

. An estimator

is said to be asymptotically unbiased if

, which implies that

.
其中，期望是对数据（视为随机变量的样本）的期望，而

是用于定义数据生成分布的

的真实基本值。如果

，意味着

，则估计器

可以说是无偏的。如果

，则估计器

称为渐近无偏，这意味着

。

Example: Bernoulli Distribution Consider a set of samples

that are independently and identically distributed according to a Bernoulli distri-
示例：伯努利分布考虑一组样本

，这些样本按照伯努利分布独立且相同地分布。
bution with mean

：

A common estimator for the

parameter of this distribution is the mean of the training samples:
该分布

参数的常用估计值是训练样本的平均值：

To determine whether this estimator is biased, we can substitute equation 5.22 into equation 5.20 :
为了确定这个估计值是否有偏差，我们可以将方程 5.22 代入方程 5.20：

Since

, we say that our estimator

is unbiased.
由于

，我们说我们的估计器

是无偏的。

Example: Gaussian Distribution Estimator of the Mean Now, consider a set of samples

that are independently and identically distributed according to a Gaussian distribution

, where

. Recall that the Gaussian probability density function is given by
示例：平均值的高斯分布估计器现在，考虑一组样本

，这些样本按照高斯分布

独立且同分布，其中

。回想一下，高斯概率密度函数为

A common estimator of the Gaussian mean parameter is known as the sample mean:
高斯均值参数的常用估计值称为样本均值：

To determine the bias of the sample mean, we are again interested in calculating its expectation:
为了确定样本平均数的偏差，我们还需要计算其期望值：

Thus we find that the sample mean is an unbiased estimator of Gaussian mean parameter.
因此，我们发现样本均值是高斯均值参数的无偏估计值。

Example: Estimators of the Variance of a Gaussian Distribution For this example, we compare two different estimators of the variance parameter

of a Gaussian distribution. We are interested in knowing if either estimator is biased.
示例：高斯分布方差的估计值本例中，我们将比较高斯分布方差参数

的两个不同估计值。我们想知道这两个估计值是否存在偏差。

The first estimator of

we consider is known as the sample variance

的第一个估计值称为样本方差

where

is the sample mean. More formally, we are interested in computing
其中

是样本平均数。更正式地说，我们有兴趣计算

We begin by evaluating the term

:
我们首先对

项进行评估：

Returning to equation 5.37, we conclude that the bias of

. Therefore, the sample variance is a biased estimator.
回到公式 5.37，我们得出

的偏差为

。因此，样本方差是一个有偏差的估计值。

The unbiased sample variance estimator
无偏样本方差估计器

provides an alternative approach. As the name suggests this estimator is unbiased. That is, we find that

:
提供了另一种方法。顾名思义，这种估计方法是无偏的。也就是说，我们发现

：

We have two estimators: one is biased, and the other is not. While unbiased estimators are clearly desirable, they are not always the "best" estimators. As we will see we often use biased estimators that possess other important properties.
我们有两个估计值：一个是有偏的，另一个是无偏的。虽然无偏估计器显然是可取的，但它们并不总是 "最佳 "估计器。正如我们将看到的，我们经常使用具有其他重要特性的有偏估计器。

5.4.3 Variance and Standard Error
5.4.3 方差和标准误差

Another property of the estimator that we might want to consider is how much we expect it to vary as a function of the data sample. Just as we computed the expectation of the estimator to determine its bias, we can compute its variance. The variance of an estimator is simply the variance
我们可能要考虑的估计因子的另一个特性是，我们期望它作为数据样本函数的变化程度。正如我们计算估计值的期望值来确定其偏差一样，我们也可以计算其方差。估计因子的方差就是

where the random variable is the training set. Alternately, the square root of the variance is called the standard error, denoted

.
其中随机变量为训练集。另外，方差的平方根称为标准误差，用

表示。

The variance, or the standard error, of an estimator provides a measure of how we would expect the estimate we compute from data to vary as we independently resample the dataset from the underlying data-generating process. Just as we might like an estimator to exhibit low bias, we would also like it to have relatively low variance.
一个估计值的方差或标准误差提供了一个度量，即当我们从基础数据生成过程中独立地对数据集进行重新采样时，我们期望从数据中计算出的估计值会如何变化。正如我们希望一个估计值表现出低偏差一样，我们也希望它具有相对较低的方差。

When we compute any statistic using a finite number of samples, our estimate of the true underlying parameter is uncertain, in the sense that we could have obtained other samples from the same distribution and their statistics would have
当我们使用有限数量的样本计算任何统计量时，我们对真实基本参数的估计是不确定的，因为我们可以从相同的分布中获得其他样本，而它们的统计量也会
been different. The expected degree of variation in any estimator is a source of error that we want to quantify.
不同。任何估计值的预期变化程度都是我们想要量化的误差来源。

The standard error of the mean is given by
平均值的标准误差为

where

is the true variance of the samples

. The standard error is often estimated by using an estimate of

. Unfortunately, neither the square root of the sample variance nor the square root of the unbiased estimator of the variance provide an unbiased estimate of the standard deviation. Both approaches tend to underestimate the true standard deviation but are still used in practice. The square root of the unbiased estimator of the variance is less of an underestimate. For large

, the approximation is quite reasonable.
其中

是样本

的真实方差。标准误差通常使用

的估计值来估算。遗憾的是，样本方差的平方根和方差无偏估计值的平方根都不能提供标准差的无偏估计值。这两种方法都倾向于低估真实的标准差，但在实践中仍在使用。方差无偏估计值的平方根的低估程度较低。对于较大的

，近似值相当合理。

The standard error of the mean is very useful in machine learning experiments. We often estimate the generalization error by computing the sample mean of the error on the test set. The number of examples in the test set determines the accuracy of this estimate. Taking advantage of the central limit theorem, which tells us that the mean will be approximately distributed with a normal distribution, we can use the standard error to compute the probability that the true expectation falls in any chosen interval. For example, the 95 percent confidence interval centered on the mean

is
均值标准误差在机器学习实验中非常有用。我们通常通过计算测试集上误差的样本平均值来估计泛化误差。测试集中的示例数量决定了这一估计的准确性。中心极限定理告诉我们，均值将近似于正态分布，利用这一定理，我们可以使用标准误差来计算真实期望值落入任意选定区间的概率。例如，以平均数

为中心的 95% 置信区间为

under the normal distribution with mean

and variance

. In machine learning experiments, it is common to say that algorithm

is better than algorithm

if the upper bound of the 95 percent confidence interval for the error of algorithm

is less than the lower bound of the 95 percent confidence interval for the error of algorithm

.
在均值为

、方差为

的正态分布下。在机器学习实验中，如果算法

误差的 95% 置信区间的上界小于算法

误差的 95% 置信区间的下界，通常可以说算法

优于算法

。

Example: Bernoulli Distribution We once again consider a set of samples

drawn independently and identically from a Bernoulli distribution (recall

. This time we are interested in computing the variance of the estimator

.
示例：伯努利分布我们再次考虑一组独立且相同地从伯努利分布中抽取的样本

（回顾

。这次我们想计算的是估计值的方差

。

The variance of the estimator decreases as a function of

, the number of examples in the dataset. This is a common property of popular estimators that we will return to when we discuss consistency (see section 5.4.5).
估计器的方差随

（数据集中的示例数）的增加而减小。这是常用估计器的一个共同特性，我们将在讨论一致性时再次提到（见第 5.4.5 节）。

5.4.4 Trading off Bias and Variance to Minimize Mean Squared Error
5.4.4 权衡偏差和方差，最小化均方误差

Bias and variance measure two different sources of error in an estimator. Bias measures the expected deviation from the true value of the function or parameter. Variance on the other hand, provides a measure of the deviation from the expected estimator value that any particular sampling of the data is likely to cause.
偏差和方差衡量估计器中两种不同的误差来源。偏差衡量的是函数或参数与真实值的预期偏差。而方差则是衡量任何特定的数据采样可能导致的估计值与预期估计值的偏差。

What happens when we are given a choice between two estimators, one with more bias and one with more variance? How do we choose between them? For example, imagine that we are interested in approximating the function shown in figure 5.2 and we are only offered the choice between a model with large bias and one that suffers from large variance. How do we choose between them?
当我们在两个估计器之间做出选择时，一个偏差更大，另一个方差更大，会发生什么情况？我们该如何取舍？例如，假设我们有兴趣近似图 5.2 所示的函数，而我们只能在偏差大的模型和方差大的模型之间做出选择。我们该如何选择？

The most common way to negotiate this trade-off is to use cross-validation. Empirically, cross-validation is highly successful on many real-world tasks. Alternatively, we can also compare the mean squared error (MSE) of the estimates:
最常见的权衡方法是使用交叉验证。根据经验，交叉验证在许多实际任务中都非常成功。另外，我们也可以比较估计值的均方误差（MSE）：

The MSE measures the overall expected deviation-in a squared error sensebetween the estimator and the true value of the parameter

. As is clear from equation 5.54, evaluating the MSE incorporates both the bias and the variance. Desirable estimators are those with small MSE and these are estimators that manage to keep both their bias and variance somewhat in check.
MSE 衡量估计值与参数真实值

之间的总体预期偏差（平方误差）。从公式 5.54 中可以清楚地看出，MSE 的评估包含偏差和方差。理想的估计值是那些 MSE 较小的估计值，这些估计值能够在一定程度上控制偏差和方差。

The relationship between bias and variance is tightly linked to the machine learning concepts of capacity, underfitting and overfitting. When generalization
偏差和方差之间的关系与机器学习的能力、欠拟合和过拟合概念密切相关。当泛化

Figure 5.6: As capacity increases (

-axis), bias (dotted) tends to decrease and variance (dashed) tends to increase, yielding another U-shaped curve for generalization error (bold curve). If we vary capacity along one axis, there is an optimal capacity, with underfitting when the capacity is below this optimum and overfitting when it is above. This relationship is similar to the relationship between capacity, underfitting, and overfitting, discussed in section 5.2 and figure 5.3 .
图 5.6：随着容量的增加（

- 轴），偏差（虚线）趋于减小，方差（虚线）趋于增大，从而产生另一条泛化误差 U 型曲线（粗体曲线）。如果我们沿一轴改变容量，则存在一个最佳容量，当容量低于此最佳容量时，拟合不足，而当容量高于此最佳容量时，拟合过度。这种关系类似于第 5.2 节和图 5.3 中讨论的容量、拟合不足和拟合过度之间的关系。

error is measured by the MSE (where bias and variance are meaningful components of generalization error), increasing capacity tends to increase variance and decrease bias. This is illustrated in figure 5.6, where we see again the U-shaped curve of generalization error as a function of capacity.
如果用 MSE 来衡量误差（偏差和方差是泛化误差的重要组成部分），则增加容量往往会增加方差，减少偏差。如图 5.6 所示，我们再次看到泛化误差随容量变化的 U 型曲线。

5.4.5 Consistency 5.4.5 一致性

So far we have discussed the properties of various estimators for a training set of fixed size. Usually, we are also concerned with the behavior of an estimator as the amount of training data grows. In particular, we usually wish that, as the number of data points

in our dataset increases, our point estimates converge to the true value of the corresponding parameters. More formally, we would like that
到目前为止，我们已经讨论了在训练集大小固定的情况下各种估计器的特性。通常情况下，我们还关注估计器在训练数据量增加时的表现。特别是，我们通常希望，随着数据集

中数据点数量的增加，我们的点估计也能收敛到相应参数的真实值。更正式地说，我们希望

The symbol plim indicates convergence in probability, meaning that for any

. The condition described by equation 5.55 is known as consistency. It is sometimes referred to as weak consistency, with strong consistency referring to the almost sure convergence of

. Almost
符号 plim 表示概率收敛，即对于任何

，

。方程 5.55 所描述的条件称为一致性。它有时被称为弱一致性，而强一致性指的是

几乎肯定收敛于

。几乎
sure convergence of a sequence of random variables

to a value

occurs when

.
当

时，随机变量序列

肯定收敛到一个值

。

Consistency ensures that the bias induced by the estimator diminishes as the number of data examples grows. However, the reverse is not true - asymptotic unbiasedness does not imply consistency. For example, consider estimating the mean parameter

of a normal distribution

, with a dataset consisting of

samples:

. We could use the first sample

of the dataset as an unbiased estimator:

. In that case,

, so the estimator is unbiased no matter how many data points are seen. This, of course, implies that the estimate is asymptotically unbiased. However, this is not a consistent estimator as it is not the case that

.
一致性确保了估计器引起的偏差随着数据示例数量的增加而减小。然而，反过来却并非如此--渐近无偏性并不意味着一致性。例如，考虑用

样本组成的数据集来估计正态分布

的平均参数

：

.我们可以使用数据集的第一个样本

作为无偏估计值：

。在这种情况下，

，因此无论看到多少个数据点，估计器都是无偏的。当然，这意味着估计值是渐近无偏的。然而，这并不是一个一致的估计器，因为

并不是

。

5.5 Maximum Likelihood Estimation
5.5 最大似然估计

We have seen some definitions of common estimators and analyzed their properties. But where did these estimators come from? Rather than guessing that some function might make a good estimator and then analyzing its bias and variance, we would like to have some principle from which we can derive specific functions that are good estimators for different models.
我们已经了解了一些常见估计器的定义，并分析了它们的特性。但是，这些估计器从何而来？与其猜测某个函数可能是一个好的估计器，然后分析它的偏差和方差，我们不如掌握一些原则，从中推导出针对不同模型的好估计器的特定函数。

The most common such principle is the maximum likelihood principle.
最常见的此类原则是最大似然原则。

Consider a set of

examples

drawn independently from the true but unknown data-generating distribution

.
考虑一组从真实但未知的数据生成分布中独立抽取的

示例

。

Let

be a parametric family of probability distributions over the same space indexed by

. In other words,

maps any configuration

to a real number estimating the true probability

.
让

成为同一空间上的概率分布参数族，以

为索引。换句话说，

将任何配置

映射为估计真实概率的实数

。

The maximum likelihood estimator for

is then defined as

的最大似然估计值定义为

This product over many probabilities can be inconvenient for various reasons. For example, it is prone to numerical underflow. To obtain a more convenient but equivalent optimization problem, we observe that taking the logarithm of the likelihood does not change its arg max but does conveniently transform a product
由于种种原因，这种在许多概率上的乘积可能会带来不便。例如，它容易出现数值下溢。为了得到一个更方便但等价的优化问题，我们可以观察到，取可能性的对数并不会改变它的最大值，但却能方便地转换乘积
into a sum: 变成一个总和：

Because the arg max does not change when we rescale the cost function, we can divide by

to obtain a version of the criterion that is expressed as an expectation with respect to the empirical distribution

defined by the training data:
由于 arg max 在我们调整成本函数时不会发生变化，因此我们可以将其除以

，得到一个标准版本，该版本表示为相对于由训练数据定义的经验分布

的期望值：

One way to interpret maximum likelihood estimation is to view it as minimizing the dissimilarity between the empirical distribution

, defined by the training set and the model distribution, with the degree of dissimilarity between the two measured by the KL divergence. The KL divergence is given by
解释最大似然估计的一种方法是将其视为最小化由训练集定义的经验分布

与模型分布之间的不相似性，两者之间的不相似程度用 KL 发散来衡量。KL 发散值为

The term on the left is a function only of the data-generating process, not the model. This means when we train the model to minimize the KL divergence, we need only minimize
左边的项只是数据生成过程的函数，而不是模型的函数。这意味着当我们训练模型以最小化 KL 发散时，我们只需要最小化

which is of course the same as the maximization in equation 5.59.
这当然与方程 5.59 中的最大化相同。

Minimizing this KL divergence corresponds exactly to minimizing the crossentropy between the distributions. Many authors use the term "cross-entropy" to identify specifically the negative log-likelihood of a Bernoulli or softmax distribution, but that is a misnomer. Any loss consisting of a negative log-likelihood is a crossentropy between the empirical distribution defined by the training set and the probability distribution defined by model. For example, mean squared error is the cross-entropy between the empirical distribution and a Gaussian model.
最小化 KL 发散正好对应于最小化分布间的交叉熵。许多作者使用 "交熵 "一词来专门识别伯努利分布或软最大分布的负对数似然，但这是一个用词不当的问题。任何由负对数似然构成的损失都是训练集定义的经验分布与模型定义的概率分布之间的交熵。例如，均方误差就是经验分布与高斯模型之间的交叉熵。

We can thus see maximum likelihood as an attempt to make the model distribution match the empirical distribution

. Ideally, we would like to match the true data-generating distribution

, but we have no direct access to this distribution.
因此，我们可以把最大似然法看作是使模型分布与经验分布相匹配的一种尝试

。理想情况下，我们希望与真正的数据生成分布相匹配

，但我们无法直接获得这一分布。

While the optimal

is the same regardless of whether we are maximizing the likelihood or minimizing the KL divergence, the values of the objective functions are different. In software, we often phrase both as minimizing a cost function. Maximum likelihood thus becomes minimization of the negative log-likelihood (NLL), or equivalently, minimization of the cross-entropy. The perspective of maximum likelihood as minimum KL divergence becomes helpful in this case because the KL divergence has a known minimum value of zero. The negative

-likelihood can actually become negative when

is real-valued.
虽然不管是最大化似然还是最小化 KL 发散，最佳

都是一样的，但目标函数的值却不同。在软件中，我们通常将两者表述为最小化成本函数。因此，最大似然变成了最小化负对数似然（NLL），或者等同于最小化交叉熵。在这种情况下，将最大似然视为最小 KL 分歧会有所帮助，因为已知 KL 分歧的最小值为零。当

为实值时，负

-likelihood 实际上会变成负值。

5.5.1 Conditional Log-Likelihood and Mean Squared Error
5.5.1 条件对数似然和均方误差

The maximum likelihood estimator can readily be generalized to estimate a conditional probability

in order to predict

given

. This is actually the most common situation because it forms the basis for most supervised learning. If

represents all our inputs and

all our observed targets, then the conditional maximum likelihood estimator is
最大似然估计法可以很容易地概括为估计条件概率

，以便在给定

的情况下预测

。这实际上是最常见的情况，因为它构成了大多数监督学习的基础。如果

代表我们所有的输入，而

代表我们所有的观测目标，那么条件最大似然估计器为

If the examples are assumed to be i.i.d., then this can be decomposed into
如果假定示例为 i.i.d.，则可分解为

Example: Linear Regression as Maximum Likelihood Linear regression, introduced in section 5.1.4, may be justified as a maximum likelihood procedure. Previously, we motivated linear regression as an algorithm that learns to take an input

and produce an output value

. The mapping from

is chosen to minimize mean squared error, a criterion that we introduced more or less arbitrarily. We now revisit linear regression from the point of view of maximum likelihood estimation. Instead of producing a single prediction

, we now think of the model as producing a conditional distribution

. We can imagine that with an infinitely large training set, we might see several training examples with the same input value

but different values of

. The goal of the learning algorithm is now to fit the distribution

to all those different

values that are all compatible with

. To derive the same linear regression algorithm we obtained before, we define

. The function

gives the prediction of the mean of the Gaussian. In this example, we assume that the variance is fixed to some constant

chosen by the user. We will see that this choice of the functional form of

causes the maximum likelihood estimation procedure to yield the same learning algorithm as we developed before. Since the examples are assumed to be i.i.d., the conditional log-likelihood (equation 5.63) is given by
示例：作为最大似然法的线性回归第 5.1.4 节中介绍的线性回归可作为最大似然法程序。在此之前，我们将线性回归作为一种算法，学习如何接受输入

并产生输出值

。从

到

的映射是为了最小化均方误差，我们或多或少随意引入了这一标准。现在，我们从最大似然估计的角度重新审视线性回归。我们现在认为模型产生了一个条件分布

，而不是产生一个单一的预测

。我们可以想象，在无限大的训练集中，我们可能会看到几个输入值相同

，但

值不同的训练示例。现在，学习算法的目标是将分布

与所有这些不同的

值拟合，这些值都与

兼容。为了得出与之前相同的线性回归算法，我们定义了

。函数

给出了高斯平均值的预测值。在本例中，我们假设方差固定为用户选择的某个常数

。我们将看到，选择

的函数形式会使最大似然估计程序产生与我们之前开发的相同的学习算法。由于假定示例为 i.i.d.，条件对数似然（等式 5.63）为

where

is the output of the linear regression on the

-th input

and

is the number of the training examples. Comparing the log-likelihood with the mean squared error,
其中，

是对

-th 输入

的线性回归输出，

是训练实例的数量。比较对数似然和均方误差、

we immediately see that maximizing the

-likelihood with respect to

yields the same estimate of the parameters

as does minimizing the mean squared error. The two criteria have different values but the same location of the optimum. This justifies the use of the MSE as a maximum likelihood estimation procedure. As we will see, the maximum likelihood estimator has several desirable properties.
我们马上就会发现，最大化

-likelihood 相对于

与最小化均方误差所得到的参数估计

是一样的。两个标准的值不同，但最佳位置相同。这就证明使用 MSE 作为最大似然估计程序是正确的。我们将看到，最大似然估计法具有几个理想的特性。

5.5.2 Properties of Maximum Likelihood
5.5.2 最大似然法的特性

The main appeal of the maximum likelihood estimator is that it can be shown to be the best estimator asymptotically, as the number of examples

, in terms of its rate of convergence as

increases.

最大似然估计法的主要优点是，随着

，实例数量的增加，其收敛速率也随之增加，因此可以证明最大似然估计法是渐近的最佳估计法。

Under appropriate conditions, the maximum likelihood estimator has the property of consistency (see section 5.4.5), meaning that as the number of training examples approaches infinity, the maximum likelihood estimate of a parameter converges to the true value of the parameter. These conditions are as follows:
在适当的条件下，最大似然估计法具有一致性（见第 5.4.5 节），这意味着当训练实例的数量接近无穷大时，参数的最大似然估计值会收敛到参数的真实值。这些条件如下

The true distribution must lie within the model family . Otherwise, no estimator can recover .
真实分布必须位于模型族内。否则，任何估计器都无法恢复。
The true distribution must correspond to exactly one value of . Otherwise, maximum likelihood can recover the correct but will not be able to determine which value of was used by the data-generating process.
真实分布必须与的一个值完全对应。否则，最大似然法可以恢复正确的，但无法确定数据生成过程使用了的哪个值。

There are other inductive principles besides the maximum likelihood estimator, many of which share the property of being consistent estimators. Consistent estimators can differ, however, in their statistical efficiency, meaning that one consistent estimator may obtain lower generalization error for a fixed number of samples

, or equivalently, may require fewer examples to obtain a fixed level of generalization error.
除了最大似然估计法，还有其他归纳法，其中许多都具有一致估计法的特性。不过，一致估计器在统计效率上可能有所不同，也就是说，对于固定数量的样本

，一种一致估计器可能会获得较低的概括误差，或者等同于，可能需要较少的例子来获得固定水平的概括误差。

Statistical efficiency is typically studied in the parametric case (as in linear regression), where our goal is to estimate the value of a parameter (assuming it is possible to identify the true parameter), not the value of a function. A way to measure how close we are to the true parameter is by the expected mean squared error, computing the squared difference between the estimated and true parameter
统计效率通常是在参数情况下研究的（如线性回归），在这种情况下，我们的目标是估计参数值（假设可以确定真实参数），而不是函数值。衡量我们与真实参数接近程度的方法是预期均方误差，计算估计参数与真实参数之间的平方差
values, where the expectation is over

training samples from the data-generating distribution. That parametric mean squared error decreases as

increases, and for

large, the Cramér-Rao lower bound (Rao, 1945; Cramér, 1946) shows that no consistent estimator has a lower MSE than the maximum likelihood estimator.
值，其中期望值是对来自数据生成分布的

训练样本的期望值。参数均方误差随着

的增加而减小，对于

较大的情况，克拉梅尔-拉奥下界（Rao, 1945; Cramér, 1946）表明，没有任何一致估计器的 MSE 低于最大似然估计器。

For these reasons (consistency and efficiency), maximum likelihood is often considered the preferred estimator to use for machine learning. When the number of examples is small enough to yield overfitting behavior, regularization strategies such as weight decay may be used to obtain a biased version of maximum likelihood that has less variance when training data is limited.
由于这些原因（一致性和效率），最大似然通常被认为是机器学习的首选估计器。当实例数量少到足以产生过拟合行为时，可以使用权重衰减等正则化策略来获得最大似然的偏置版本，从而在训练数据有限的情况下减少方差。

5.6 Bayesian Statistics 5.6 贝叶斯统计

So far we have discussed frequentist statistics and approaches based on estimating a single value of

, then making all predictions thereafter based on that one estimate. Another approach is to consider all possible values of

when making a prediction. The latter is the domain of Bayesian statistics.
到目前为止，我们已经讨论了基于估算

的单一值的频数统计和方法，然后根据这一个估算值进行此后的所有预测。另一种方法是在预测时考虑

的所有可能值。后者属于贝叶斯统计学的范畴。

As discussed in section 5.4.1, the frequentist perspective is that the true parameter value

is fixed but unknown, while the point estimate

is a random variable on account of it being a function of the dataset (which is seen as random).
如第 5.4.1 节所述，频繁主义观点认为真正的参数值

是固定的，但却是未知的，而点估计值

是一个随机变量，因为它是数据集（被视为随机的）的函数。

The Bayesian perspective on statistics is quite different. The Bayesian uses probability to reflect degrees of certainty in states of knowledge. The dataset is directly observed and so is not random. On the other hand, the true parameter

is unknown or uncertain and thus is represented as a random variable.
贝叶斯统计观则截然不同。贝叶斯用概率来反映知识状态的确定程度。数据集是直接观察到的，因此不是随机的。另一方面，真实参数

是未知或不确定的，因此被表示为随机变量。

Before observing the data, we represent our knowledge of

using the prior probability distribution,

(sometimes referred to as simply "the prior"). Generally, the machine learning practitioner selects a prior distribution that is quite broad (i.e., with high entropy) to reflect a high degree of uncertainty in the value of

before observing any data. For example, one might assume a priori that

lies in some finite range or volume, with a uniform distribution. Many priors instead reflect a preference for "simpler" solutions (such as smaller magnitude coefficients, or a function that is closer to being constant).
在观察数据之前，我们使用先验概率分布

（有时简称为 "先验"）来表示我们对

的了解。一般来说，机器学习从业者会选择一个相当宽泛（即具有高熵）的先验分布，以反映在观察任何数据之前，

值的高度不确定性。例如，我们可以先验地假设

位于某个有限的范围或体积内，分布均匀。相反，许多先验值反映了对 "更简单 "解决方案的偏好（如系数的大小较小，或函数更接近常数）。

Now consider that we have a set of data samples

. We can recover the effect of data on our belief about

by combining the data likelihood

with the prior via Bayes' rule:
现在假设我们有一组数据样本

。我们可以通过贝叶斯法则将数据似然

与先验值相结合，从而恢复数据对我们关于

的信念的影响：

In the scenarios where Bayesian estimation is typically used, the prior begins as a relatively uniform or Gaussian distribution with high entropy, and the observation of the data usually causes the posterior to lose entropy and concentrate around a few highly likely values of the parameters.
在通常使用贝叶斯估计法的情况下，先验值开始时是一个具有高熵的相对均匀或高斯分布，而对数据的观测通常会导致后验值失去熵，并集中在几个极有可能的参数值周围。

Relative to maximum likelihood estimation, Bayesian estimation offers two important differences. First, unlike the maximum likelihood approach that makes predictions using a point estimate of

, the Bayesian approach is to make predictions using a full distribution over

. For example, after observing

examples, the predicted distribution over the next data sample,

, is given by
相对于最大似然估计法，贝叶斯估计法有两个重要区别。首先，最大似然法使用

的点估计值进行预测，而贝叶斯法则不同，它使用

的完整分布进行预测。例如，在观察了

个例子后，对下一个数据样本

的预测分布为

Here each value of

with positive probability density contributes to the prediction of the next example, with the contribution weighted by the posterior density itself. After having observed

, if we are still quite uncertain about the value of

, then this uncertainty is incorporated directly into any predictions we might make.
在这里，

的每一个正概率密度值都会对下一个例子的预测做出贡献，而这种贡献是由后验密度本身加权的。在观察到

之后，如果我们对

的值仍然很不确定，那么这种不确定性将直接纳入我们可能做出的任何预测中。

In section 5.4, we discussed how the frequentist approach addresses the uncertainty in a given point estimate of

by evaluating its variance. The variance of the estimator is an assessment of how the estimate might change with alternative samplings of the observed data. The Bayesian answer to the question of how to deal with the uncertainty in the estimator is to simply integrate over it, which tends to protect well against overfitting. This integral is of course just an application of the laws of probability, making the Bayesian approach simple to justify, while the frequentist machinery for constructing an estimator is based on the rather ad hoc decision to summarize all knowledge contained in the dataset with a single point estimate.
在第 5.4 节中，我们讨论了频数法如何通过评估

的方差来解决给定点估计的不确定性问题。估计值的方差是对估计值在观察数据的其他抽样中可能发生的变化的评估。对于如何处理估计值的不确定性这一问题，贝叶斯方法的答案是简单地对其进行积分，这往往能很好地防止过度拟合。当然，这种积分只是概率法的一种应用，因此贝叶斯方法很容易证明其合理性，而频数主义构建估计器的机制则是基于一个相当特别的决定，即用一个单一的点估计来概括数据集中包含的所有知识。

The second important difference between the Bayesian approach to estimation and the maximum likelihood approach is due to the contribution of the Bayesian prior distribution. The prior has an influence by shifting probability mass density towards regions of the parameter space that are preferred a priori. In practice, the prior often expresses a preference for models that are simpler or more smooth. Critics of the Bayesian approach identify the prior as a source of subjective human judgment affecting the predictions.
贝叶斯估计法与最大似然法的第二个重要区别在于贝叶斯先验分布的作用。先验分布通过将概率质量密度转向参数空间中先验偏好的区域而产生影响。在实践中，先验往往表示对更简单或更平滑模型的偏好。贝叶斯方法的批评者认为先验是影响预测的主观人为判断的来源。

Bayesian methods typically generalize much better when limited training data is available but typically suffer from high computational cost when the number of training examples is large.
在训练数据有限的情况下，贝叶斯方法的泛化效果通常要好得多，但在训练示例数量较多时，计算成本通常会很高。

Example: Bayesian Linear Regression Here we consider the Bayesian estimation approach to learning the linear regression parameters. In linear regression, we learn a linear mapping from an input vector

to predict the value of a scalar

. The prediction is parametrized by the vector

:
示例：贝叶斯线性回归这里我们考虑用贝叶斯估计法来学习线性回归参数。在线性回归中，我们从输入向量

学习线性映射，以预测标量

的值。预测的参数是向量

：

Given a set of

training samples

, we can express the prediction of

over the entire training set as
给定一组

训练样本

，我们可以将

对整个训练集的预测表达为

Expressed as a Gaussian conditional distribution on

, we have
以

的高斯条件分布表示，我们得到

户

where we follow the standard MSE formulation in assuming that the Gaussian variance on

is one. In what follows, to reduce the notational burden, we refer to

as simply

.
其中，我们按照标准 MSE 公式，假设

上的高斯方差为一。在下文中，为减少符号负担，我们将

简单称为

。

To determine the posterior distribution over the model parameter vector

, we first need to specify a prior distribution. The prior should reflect our naive belief about the value of these parameters. While it is sometimes difficult or unnatural to express our prior beliefs in terms of the parameters of the model, in practice we typically assume a fairly broad distribution, expressing a high degree of uncertainty about

. For real-valued parameters it is common to use a Gaussian as a prior distribution,
要确定模型参数向量的后验分布

，我们首先需要指定一个先验分布。先验分布应反映我们对这些参数值的天真想法。虽然用模型参数来表达我们的先验信念有时很困难或不自然，但在实践中，我们通常会假定一个相当宽泛的分布，表达

的高度不确定性。对于实值参数，通常使用高斯分布作为先验分布、

where

and

are the prior distribution mean vector and covariance matrix respectively.

其中

和

分别为先验分布均值向量和协方差矩阵。

With the prior thus specified, we can now proceed in determining the posterior distribution over the model parameters:
有了这样的先验，我们现在就可以确定模型参数的后验分布了：

We now define

and

Using these new variables, we find that the posterior may be rewritten as a Gaussian distribution:
现在我们定义

和

利用这些新变量，我们发现后验分布可以改写为高斯分布：

All terms that do not include the parameter vector

have been omitted; they are implied by the fact that the distribution must be normalized to integrate to 1 . Equation 3.23 shows how to normalize a multivariate Gaussian distribution.
所有不包括参数向量

的项都已省略；它们是由分布必须归一化以积分为 1 这一事实所暗示的。公式 3.23 展示了如何对多元高斯分布进行归一化。

Examining this posterior distribution enables us to gain some intuition for the effect of Bayesian inference. In most situations, we set

to 0 . If we set

, then

gives the same estimate of

as does frequentist linear regression with a weight decay penalty of

. One difference is that the Bayesian estimate is undefined if

is set to zero-we are not allowed to begin the Bayesian learning process with an infinitely wide prior on

. The more important difference is that the Bayesian estimate provides a covariance matrix, showing how likely all the different values of

are, rather than providing only the estimate

.
通过研究这种后验分布，我们可以对贝叶斯推理的效果有一些直观的认识。在大多数情况下，我们将

设为 0。如果我们将

设为 0，那么

所给出的

估计值与带有权重衰减惩罚的频数线性回归所给出的

估计值相同。其中一个区别是，如果

设为 0，贝叶斯估计值是不确定的--我们不允许在

上以无限宽的先验值开始贝叶斯学习过程。更重要的区别在于，贝叶斯估计提供了一个协方差矩阵，显示

所有不同值的可能性，而不是只提供估计值

。

5.6.1 Maximum a Posteriori (MAP) Estimation
5.6.1 最大后验估计（MAP）

While the most principled approach is to make predictions using the full Bayesian posterior distribution over the parameter

, it is still often desirable to have a single point estimate. One common reason for desiring a point estimate is that most operations involving the Bayesian posterior for most interesting models are intractable, and a point estimate offers a tractable approximation. Rather than simply returning to the maximum likelihood estimate, we can still gain some of the benefit of the Bayesian approach by allowing the prior to influence the choice of the point estimate. One rational way to do this is to choose the maximum a posteriori (MAP) point estimate. The MAP estimate chooses the point of
虽然最原则的方法是使用参数的完整贝叶斯后验分布进行预测

，但通常还是希望有一个单点估计。需要点估计值的一个常见原因是，对于大多数有趣的模型，涉及贝叶斯后验分布的大多数运算都是难以处理的，而点估计值提供了一个可处理的近似值。与其简单地回到最大似然估计，我们还可以通过让先验影响点估计的选择来获得贝叶斯方法的一些好处。一种合理的方法是选择最大后验（MAP）估计点。MAP 估计选择的点是
maximal posterior probability (or maximal probability density in the more common case of continuous

):
最大后验概率（或最大概率密度，在连续

的情况下更为常见）：

We recognize, on the righthand side,

, that is, the standard

likelihood term, and

, corresponding to the prior distribution.
在右侧，我们可以看到

，即标准的

似然项，以及

，与先验分布相对应。

As an example, consider a linear regression model with a Gaussian prior on the weights

. If this prior is given by

, then the log-prior term in equation 5.79 is proportional to the familiar

weight decay penalty, plus a term that does not depend on

and does not affect the learning process. MAP Bayesian inference with a Gaussian prior on the weights thus corresponds to weight decay.
举例来说，考虑一个线性回归模型，其权重为高斯先验

。如果该先验由

给出，那么方程 5.79 中的对数先验项与我们熟悉的

权重衰减惩罚成正比，外加一个不依赖于

也不影响学习过程的项。因此，权重高斯先验的 MAP 贝叶斯推理与权重衰减相对应。

As with full Bayesian inference, MAP Bayesian inference has the advantage of leveraging information that is brought by the prior and cannot be found in the training data. This additional information helps to reduce the variance in the MAP point estimate (in comparison to the ML estimate). However, it does so at the price of increased bias.
与完全贝叶斯推断法一样，MAP 贝叶斯推断法的优势在于可以利用先验数据带来的、无法在训练数据中找到的信息。与 ML 估计相比，这些额外信息有助于减少 MAP 点估计的方差。然而，这是以增加偏差为代价的。

Many regularized estimation strategies, such as maximum likelihood learning regularized with weight decay, can be interpreted as making the MAP approximation to Bayesian inference. This view applies when the regularization consists of adding an extra term to the objective function that corresponds to

. Not all regularization penalties correspond to MAP Bayesian inference. For example, some regularizer terms may not be the logarithm of a probability distribution. Other regularization terms depend on the data, which of course a prior probability distribution is not allowed to do.
许多正则化估计策略，如用权重衰减进行正则化的最大似然学习，可以解释为对贝叶斯推理进行 MAP 近似。当正则化包括在目标函数中增加一个额外项，对应于

时，这种观点就适用。并非所有的正则化惩罚都对应于 MAP 贝叶斯推断。例如，有些正则项可能不是概率分布的对数。其他正则化项取决于数据，而先验概率分布当然不允许这样做。

MAP Bayesian inference provides a straightforward way to design complicated yet interpretable regularization terms. For example, a more complicated penalty term can be derived by using a mixture of Gaussians, rather than a single Gaussian distribution, as the prior (Nowlan and Hinton, 1992).
MAP 贝叶斯推理为设计复杂但可解释的正则化项提供了一种直接的方法。例如，使用高斯混合分布（而不是单一高斯分布）作为先验值，可以得到更复杂的惩罚项（Nowlan 和 Hinton，1992 年）。

5.7 Supervised Learning Algorithms
5.7 监督学习算法

Recall from section 5.1.3 that supervised learning algorithms are, roughly speaking, learning algorithms that learn to associate some input with some output, given a training set of examples of inputs

and outputs

. In many cases the outputs

may be difficult to collect automatically and must be provided by a human
回顾第 5.1.3 节，监督学习算法粗略地说就是在输入

和输出

的示例训练集中，学习将某些输入与某些输出联系起来的学习算法。在许多情况下，输出

可能难以自动收集，必须由人工提供。
"supervisor," but the term still applies even when the training set targets were collected automatically.
"监督员"，但即使在自动收集训练集目标的情况下，该术语仍然适用。

5.7.1 Probabilistic Supervised Learning
5.7.1 概率监督学习

Most supervised learning algorithms in this book are based on estimating a probability distribution

. We can do this simply by using maximum likelihood estimation to find the best parameter vector

for a parametric family of distributions

.
本书中的大多数监督学习算法都基于对概率分布的估计

。我们可以通过使用最大似然估计来找到参数族分布的最佳参数向量

。

We have already seen that linear regression corresponds to the family
我们已经看到，线性回归对应于

We can generalize linear regression to the classification scenario by defining a different family of probability distributions. If we have two classes, class 0 and class 1 , then we need only specify the probability of one of these classes. The probability of class 1 determines the probability of class 0 , because these two values must add up to 1 .
我们可以通过定义不同的概率分布族，将线性回归推广到分类情景中。如果我们有两个类别，类别 0 和类别 1，那么我们只需指定其中一个类别的概率。类别 1 的概率决定了类别 0 的概率，因为这两个值相加必须为 1。

The normal distribution over real-valued numbers that we used for linear regression is parametrized in terms of a mean. Any value we supply for this mean is valid. A distribution over a binary variable is slightly more complicated, because its mean must always be between 0 and 1 . One way to solve this problem is to use the logistic sigmoid function to squash the output of the linear function into the interval

and interpret that value as a probability:
我们用于线性回归的实数值正态分布是以均值为参数的。我们为这个均值提供的任何值都是有效的。二进制变量的分布则稍显复杂，因为其均值必须始终介于 0 和 1 之间。解决这个问题的一种方法是使用对数 sigmoid 函数将线性函数的输出压入区间

，并将该值解释为概率：

This approach is known as logistic regression (a somewhat strange name since we use the model for classification rather than regression).
这种方法被称为逻辑回归（这个名字有点奇怪，因为我们使用该模型进行分类而不是回归）。

In the case of linear regression, we were able to find the optimal weights by solving the normal equations. Logistic regression is somewhat more difficult. There is no closed-form solution for its optimal weights. Instead, we must search for them by maximizing the log-likelihood. We can do this by minimizing the negative log-likelihood using gradient descent.
对于线性回归，我们可以通过求解正态方程找到最佳权重。逻辑回归则要困难一些。它的最佳权重没有闭式解。相反，我们必须通过最大化对数似然来寻找最佳权重。我们可以使用梯度下降法最小化负对数似然来实现这一目标。

This same strategy can be applied to essentially any supervised learning problem, by writing down a parametric family of conditional probability distributions over the right kind of input and output variables.
通过对正确类型的输入和输出变量写下条件概率分布的参数族，同样的策略基本上可以应用于任何有监督的学习问题。

5.7.2 Support Vector Machines
5.7.2 支持向量机

One of the most influential approaches to supervised learning is the support vector machine (Boser et al., 1992; Cortes and Vapnik, 1995). This model is similar to logistic regression in that it is driven by a linear function

. Unlike logistic regression, the support vector machine does not provide probabilities, but only outputs a class identity. The SVM predicts that the positive class is present when

is positive. Likewise, it predicts that the negative class is present when

is negative.
最有影响力的监督学习方法之一是支持向量机（Boser 等人，1992 年；Cortes 和 Vapnik，1995 年）。该模型与逻辑回归类似，由线性函数

驱动。与逻辑回归不同的是，支持向量机不提供概率，只输出类别标识。当

为正值时，SVM 预测正向类存在。同样，当

为负值时，SVM 预测负值类存在。

One key innovation associated with support vector machines is the kernel trick. The kernel trick consists of observing that many machine learning algorithms can be written exclusively in terms of dot products between examples. For example, it can be shown that the linear function used by the support vector machine can be re-written as
与支持向量机相关的一项关键创新是内核技巧。内核技巧包括观察到许多机器学习算法可以完全用实例间的点积来书写。例如，可以证明支持向量机使用的线性函数可以改写为

where

is a training example, and

is a vector of coefficients. Rewriting the learning algorithm this way enables us to replace

with the output of a given feature function

and the dot product with a function

called a kernel. The - operator represents an inner product analogous to

. For some feature spaces, we may not use literally the vector inner product. In some infinite dimensional spaces, we need to use other kinds of inner products, for example, inner products based on integration rather than summation. A complete development of these kinds of inner products is beyond the scope of this book.
其中

是一个训练实例，

是一个系数向量。这样改写学习算法后，我们就可以用一个给定特征函数

的输出来代替

，并用一个称为核的函数

来代替点积。-算子代表一种内积，类似于

。对于某些特征空间，我们可以不使用字面意义上的向量内积。在某些无限维空间中，我们需要使用其他类型的内积，例如基于积分而非求和的内积。对这些类型的内积的完整阐释超出了本书的范围。

After replacing dot products with kernel evaluations, we can make predictions using the function
用内核求值代替点积后，我们可以使用函数

This function is nonlinear with respect to

, but the relationship between

and

is linear. Also, the relationship between

and

is linear. The kernel-based function is exactly equivalent to preprocessing the data by applying

to all inputs, then learning a linear model in the new transformed space.
该函数与

是非线性关系，但

与

是线性关系。此外，

和

之间也是线性关系。基于核的函数完全等同于通过对所有输入应用

对数据进行预处理，然后在新的转换空间中学习线性模型。

The kernel trick is powerful for two reasons. First, it enables us to learn models that are nonlinear as a function of

using convex optimization techniques that are guaranteed to converge efficiently. This is possible because we consider

fixed and optimize only

, that is, the optimization algorithm can view the decision function as being linear in a different space. Second, the kernel function

often
核技巧之所以强大，有两个原因。首先，它使我们能够利用凸优化技术学习作为

函数的非线性模型，并保证高效收敛。之所以能做到这一点，是因为我们认为

是固定的，只对

进行优化，也就是说，优化算法可以将决策函数视为不同空间中的线性函数。其次，核函数

常常是
admits an implementation that is significantly more computationally efficient than naively constructing two

vectors and explicitly taking their dot product.
它的计算效率大大高于天真地构建两个

向量并明确求出它们的点积。

In some cases,

can even be infinite dimensional, which would result in an infinite computational cost for the naive, explicit approach. In many cases,

) is a nonlinear, tractable function of

even when

is intractable. As an example of an infinite-dimensional feature space with a tractable kernel, we construct a feature mapping

over the nonnegative integers

. Suppose that this mapping returns a vector containing

ones followed by infinitely many zeros. We can write a kernel function

that is exactly equivalent to the corresponding infinite-dimensional dot product.
在某些情况下，

甚至可以是无限维的，这将导致天真的显式方法的计算成本无限大。在很多情况下，

) 是

的一个非线性可处理函数，即使

是难以处理的。以具有可处理内核的无限维特征空间为例，我们构建了一个非负整数

上的特征映射

。假设这个映射返回的向量包含

个 1，后面有无限多个 0。我们可以写出与相应的无限维点积完全等价的核函数

。

The most commonly used kernel is the Gaussian kernel,
最常用的核是高斯核、

where

is the standard normal density. This kernel is also known as the radial basis function (

) kernel, because its value decreases along lines in

space radiating outward from

. The Gaussian kernel corresponds to a dot product in an infinite-dimensional space, but the derivation of this space is less straightforward than in our example of the min kernel over the integers.
其中

是标准正态密度。这个核也被称为径向基函数（

）核，因为它的值会沿着

空间中从

向外辐射的线递减。高斯核对应于无穷维空间中的点积，但这个空间的推导没有我们的整数最小核的例子那么简单。

We can think of the Gaussian kernel as performing a kind of template matching. A training example

associated with training label

becomes a template for class

. When a test point

is near

according to Euclidean distance, the Gaussian kernel has a large response, indicating that

is very similar to the

template. The model then puts a large weight on the associated training label

. Overall, the prediction will combine many such training labels weighted by the similarity of the corresponding training examples.
我们可以把高斯核看作是一种模板匹配。与训练标签

5.7.3 Other Simple Supervised Learning Algorithms
5.7.3 其他简单的监督学习算法

We have already briefly encountered another nonprobabilistic supervised learning algorithm, nearest neighbor regression. More generally,

-nearest neighbors is a family of techniques that can be used for classification or regression. As a nonparametric learning algorithm,

-nearest neighbors is not restricted to a fixed number of parameters. We usually think of the

-nearest neighbors algorithm as not having any parameters but rather implementing a simple function of the training data. In fact, there is not even really a training stage or learning process. Instead, at test time, when we want to produce an output

for a new test input

, we find the

-nearest neighbors to

in the training data

. We then return the average of the corresponding

values in the training set. This works for essentially any kind of supervised learning where we can define an average over

values. In the case of classification, we can average over one-hot code vectors

with

and

for all other values of

. We can then interpret the average over these one-hot codes as giving a probability distribution over classes. As a nonparametric learning algorithm,

-nearest neighbor can achieve very high capacity. For example, suppose we have a multiclass classification task and measure performance with 0-1 loss. In this setting, 1-nearest neighbor converges to double the Bayes error as the number of training examples approaches infinity. The error in excess of the Bayes error results from choosing a single neighbor by breaking ties between equally distant neighbors randomly. When there is infinite training data, all test points

will have infinitely many training set neighbors at distance zero. If we allow the algorithm to use all these neighbors to vote, rather than randomly choosing one of them, the procedure converges to the Bayes error rate. The high capacity of

-nearest neighbors enables it to obtain high accuracy given a large training set. It does so at high computational cost, however, and it may generalize very badly given a small finite training set. One weakness of

-nearest neighbors is that it cannot learn that one feature is more discriminative than another. For example, imagine we have a regression task with

drawn from an isotropic Gaussian distribution, but only a single variable

is relevant to the output. Suppose further that this feature simply encodes the output directly, that

in all cases. Nearest neighbor regression will not be able to detect this simple pattern.
我们已经简单介绍过另一种非概率监督学习算法--近邻回归。更广义地说，

- 最近邻是一系列可用于分类或回归的技术。作为一种非参数学习算法，

- 最近邻并不局限于固定的参数数量。我们通常认为，

- 最近邻算法没有任何参数，而是实现了训练数据的一个简单函数。事实上，甚至不存在真正的训练阶段或学习过程。相反，在测试时，当我们要为新的测试输入

生成输出

时，我们会在训练数据

中找到与

5.8 Unsupervised Learning Algorithms
5.8 无监督学习算法

Recall from section 5.1.3 that unsupervised algorithms are those that experience only "features" but not a supervision signal. The distinction between supervised and unsupervised algorithms is not formally and rigidly defined because there is no objective test for distinguishing whether a value is a feature or a target provided by a supervisor. Informally, unsupervised learning refers to most attempts to extract information from a distribution that do not require human labor to annotate
回顾第 5.1.3 节，无监督算法是那些只体验 "特征 "而不体验监督信号的算法。有监督和无监督算法之间的区别并没有正式和严格的定义，因为没有客观的检验标准来区分一个值是特征还是由监督者提供的目标值。非正式地讲，无监督学习指的是从分布中提取信息的大多数尝试，这些尝试不需要人工注释

Figure 5.7: Diagrams describing how a decision tree works. (Top)Each node of the tree chooses to send the input example to the child node on the left (0) or to the child node on the right (1). Internal nodes are drawn as circles and leaf nodes as squares. Each node is displayed with a binary string identifier corresponding to its position in the tree, obtained by appending a bit to its parent identifier

choose left or top,

choose right or bottom). (Bottom)The tree divides space into regions. The 2-D plane shows how a decision tree might divide

. The nodes of the tree are plotted in this plane, with each internal node drawn along the dividing line it uses to categorize examples, and leaf nodes drawn in the center of the region of examples they receive. The result is a piecewise-constant function, with one piece per leaf. Each leaf requires at least one training example to define, so it is not possible for the decision tree to learn a function that has more local maxima than the number of training examples.
图 5.7：决策树工作原理图。(上图）决策树的每个节点都会选择将输入示例发送到左边的子节点（0）或右边的子节点（1）。内部节点画成圆形，叶子节点画成方形。每个节点都用二进制字符串标识符显示，该标识符与节点在树中的位置相对应，通过在其父节点标识符上附加一个比特来获得

选择左侧或顶部，

选择右侧或底部）。(下图）树将空间划分为多个区域。二维平面显示了决策树如何划分

。树的节点就绘制在这个平面上，每个内部节点都沿着它用来对示例进行分类的分界线绘制，而叶节点则绘制在它们所接收的示例区域的中心。结果是一个片状恒定函数，每片叶子有一个片状恒定函数。每个叶节点至少需要一个训练示例来定义，因此决策树不可能学习到局部最大值多于训练示例数的函数。
examples. The term is usually associated with density estimation, learning to draw samples from a distribution, learning to denoise data from some distribution, finding a manifold that the data lies near, or clustering the data into groups of related examples.
例子。该术语通常与密度估计、学习从分布中提取样本、学习从某种分布中去噪数据、寻找数据所在的流形或将数据聚类为相关示例组相关。

A classic unsupervised learning task is to find the "best" representation of the data. By "best" we can mean different things, but generally speaking we are looking for a representation that preserves as much information about

as possible while obeying some penalty or constraint aimed at keeping the representation simpler or more accessible than

itself.
一项经典的无监督学习任务是找到数据的 "最佳 "表示。所谓 "最佳"，我们可以有不同的含义，但一般来说，我们要寻找的是一种表征，它能尽可能多地保留

的信息，同时服从某种惩罚或约束，目的是保持表征比

本身更简单或更易于访问。

There are multiple ways of defining a simpler representation. Three of the most common include lower-dimensional representations, sparse representations, and independent representations. Low-dimensional representations attempt to compress as much information about

as possible in a smaller representation. Sparse representations (Barlow, 1989; Olshausen and Field, 1996; Hinton and Ghahramani, 1997) embed the dataset into a representation whose entries are mostly zeros for most inputs. The use of sparse representations typically requires increasing the dimensionality of the representation, so that the representation becoming mostly zeros does not discard too much information. This results in an overall structure of the representation that tends to distribute data along the axes of the representation space. Independent representations attempt to disentangle the sources of variation underlying the data distribution such that the dimensions of the representation are statistically independent.
有多种方法可以定义更简单的表征。最常见的三种方法包括低维表示法、稀疏表示法和独立表示法。低维表征试图在较小的表征中压缩尽可能多的

信息。稀疏表示法（Barlow，1989 年；Olshausen 和 Field，1996 年；Hinton 和 Ghahramani，1997 年）将数据集嵌入一个表示法中，对于大多数输入，该表示法的条目大多为零。使用稀疏表示法通常需要提高表示法的维度，这样大部分为零的表示法就不会丢弃过多的信息。这就导致表征的整体结构倾向于沿着表征空间的轴分布数据。独立表示法试图分解数据分布的变化来源，从而使表示的维数在统计上是独立的。

Of course these three criteria are certainly not mutually exclusive. Lowdimensional representations often yield elements that have fewer or weaker dependencies than the original high-dimensional data. This is because one way to reduce the size of a representation is to find and remove redundancies. Identifying and removing more redundancy enables the dimensionality reduction algorithm to achieve more compression while discarding less information.
当然，这三个标准肯定不是相互排斥的。与原始的高维数据相比，低维表示法产生的元素往往具有更少或更弱的依赖性。这是因为，缩小表征大小的方法之一就是找到并去除冗余。找出并去除更多冗余，就能让降维算法实现更多压缩，同时丢弃更少信息。

The notion of representation is one of the central themes of deep learning and therefore one of the central themes in this book. In this section, we develop some simple examples of representation learning algorithms. Together, these example algorithms show how to operationalize all three of the criteria above. Most of the remaining chapters introduce additional representation learning algorithms that develop these criteria in different ways or introduce other criteria.
表征概念是深度学习的核心主题之一，因此也是本书的核心主题之一。在本节中，我们将开发一些表征学习算法的简单示例。这些示例算法共同展示了如何实现上述三个标准。其余大部分章节将介绍其他表征学习算法，这些算法将以不同的方式发展这些标准或引入其他标准。

Figure 5.8: PCA learns a linear projection that aligns the direction of greatest variance with the axes of the new space. (Left)The original data consist of samples of

. In this space, the variance might occur along directions that are not axis aligned. (Right)The transformed data

now varies most along the axis

. The direction of second-most variance is now along

.
图 5.8：PCA 学习线性投影，将方差最大的方向与新空间的轴对齐。(左图）原始数据由

的样本组成。在此空间中，方差可能沿着与轴线不一致的方向出现。(右图)转换后的数据

现在沿

轴的方差最大。方差第二大的方向现在是沿

。

5.8.1 Principal Components Analysis
5.8.1 主成分分析

In section 2.12, we saw that the principal components analysis algorithm provides a means of compressing data. We can also view PCA as an unsupervised learning algorithm that learns a representation of data. This representation is based on two of the criteria for a simple representation described above. PCA learns a representation that has lower dimensionality than the original input. It also learns a representation whose elements have no linear correlation with each other. This is a first step toward the criterion of learning representations whose elements are statistically independent. To achieve full independence, a representation learning algorithm must also remove the nonlinear relationships between variables.
在第 2.12 节中，我们看到主成分分析算法提供了一种压缩数据的方法。我们也可以将 PCA 视为一种无监督学习算法，它可以学习数据的表示方法。这种表示法基于上述简单表示法的两个标准。PCA 所学习的表示比原始输入的维度更低。它还能学习一种元素之间没有线性相关的表示。这是向学习元素在统计上独立的表征这一标准迈出的第一步。要实现完全独立，表征学习算法还必须消除变量之间的非线性关系。

PCA learns an orthogonal, linear transformation of the data that projects an input

to a representation

as shown in figure 5.8. In section 2.12, we saw that we could learn a one-dimensional representation that best reconstructs the original data (in the sense of mean squared error) and that this representation actually corresponds to the first principal component of the data. Thus we can use PCA as a simple and effective dimensionality reduction method that preserves as much of the information in the data as possible (again, as measured by least-squares reconstruction error). In the following, we will study how the PCA representation decorrelates the original data representation

.
PCA 学习数据的正交线性变换，将输入

投射到一个表示

，如图 5.8 所示。在第 2.12 节中，我们看到我们可以学习一个最能重构原始数据的一维表示（从均方误差的角度看），而且这个表示实际上对应于数据的第一个主成分。因此，我们可以将 PCA 作为一种简单有效的降维方法，尽可能多地保留数据中的信息（同样以最小二乘重构误差来衡量）。下面，我们将研究 PCA 表示如何对原始数据表示

进行去相关化。

Let us consider the

design matrix

. We will assume that the data has
让我们考虑

设计矩阵

。我们假设数据有
a mean of zero,

. If this is not the case, the data can easily be centered by subtracting the mean from all examples in a preprocessing step.
平均值为零，

。如果情况并非如此，则可以通过在预处理步骤中减去所有示例的平均值，轻松实现数据居中。

The unbiased sample covariance matrix associated with

is given by
与

相关的无偏样本协方差矩阵由以下公式给出

PCA finds a representation (through linear transformation)

, where

is diagonal.
PCA 通过线性变换找到一个表示

，其中

是对角线。

In section 2.12, we saw that the principal components of a design matrix

are given by the eigenvectors of

. From this view,
在第 2.12 节中，我们看到设计矩阵

的主成分由

的特征向量给出。由此可见

In this section, we exploit an alternative derivation of the principal components. The principal components may also be obtained via singular value decomposition (SVD). Specifically, they are the right singular vectors of

. To see this, let

be the right singular vectors in the decomposition

. We then recover the original eigenvector equation with

as the eigenvector basis:
在本节中，我们将利用另一种方法推导主成分。主成分也可以通过奇异值分解（SVD）获得。具体来说，它们是

的右奇异向量。要了解这一点，让

成为分解

中的右奇异向量。然后，我们以

作为特征向量基础，恢复原始特征向量方程：

The SVD is helpful to show that PCA results in a diagonal

. Using the SVD of

, we can express the variance of

as:
SVD 有助于说明 PCA 的结果是对角线

。利用

的 SVD 值，我们可以将

的方差表示为

where we use the fact that

because the

matrix of the singular value decomposition is defined to be orthogonal. This shows that the covariance of

is diagonal as required:
其中，我们利用

这一事实，因为

矩阵的奇异值分解被定义为正交。这表明

的协方差按要求是对角的：

where this time we use the fact that

, again from the definition of the SVD.
其中，我们使用了

这一事实，这同样来自 SVD 的定义。

The above analysis shows that when we project the data

, via the linear transformation

, the resulting representation has a diagonal covariance matrix (as given by

), which immediately implies that the individual elements of

are mutually uncorrelated.
上述分析表明，当我们通过线性变换

将数据

投影到

时，得到的表示具有对角协方差矩阵（如

所示），这立即意味着

中的各个元素互不相关。

This ability of PCA to transform data into a representation where the elements are mutually uncorrelated is a very important property of PCA. It is a simple example of a representation that attempts to disentangle the unknown factors of variation underlying the data. In the case of

, this disentangling takes the form of finding a rotation of the input space (described by

) that aligns the principal axes of variance with the basis of the new representation space associated with

.
PCA 能够将数据转化为元素互不相关的表示形式，这是 PCA 的一个非常重要的特性。这是一个简单的示例，说明 PCA 试图将数据背后的未知变化因素分离开来。在

的情况下，这种分解的形式是找到输入空间（由

描述）的旋转，使方差主轴与与

相关联的新表示空间的基础保持一致。

While correlation is an important category of dependency between elements of the data, we are also interested in learning representations that disentangle more complicated forms of feature dependencies. For this, we will need more than what can be done with a simple linear transformation.
虽然相关性是数据元素间依赖关系的一个重要类别，但我们也对学习能区分更复杂的特征依赖形式的表示方法感兴趣。为此，我们需要的不仅仅是简单的线性变换。

5.8.2 -means Clustering
5.8.2 -均值聚类

Another example of a simple representation learning algorithm is

-means clustering. The

-means clustering algorithm divides the training set into

different clusters of examples that are near each other. We can thus think of the algorithm as providing a

-dimensional one-hot code vector

representing an input

. If

belongs to cluster

, then

, and all other entries of the representation

are zero.
简单表示学习算法的另一个例子是

-means聚类算法。

-means聚类算法将训练集划分为

个彼此接近的不同示例聚类。因此，我们可以认为该算法提供了一个

-dimensional one-hot code vector

，代表输入的

。如果

属于

聚类，那么

，而表示

的所有其他条目都为零。

The one-hot code provided by

-means clustering is an example of a sparse representation, because the majority of its entries are zero for every input. Later, we develop other algorithms that learn more flexible sparse representations, where more than one entry can be nonzero for each input

. One-hot codes are an extreme example of sparse representations that lose many of the benefits of a distributed representation. The one-hot code still confers some statistical advantages (it naturally conveys the idea that all examples in the same cluster are similar to each other), and it confers the computational advantage that the entire representation

-means聚类所提供的单击代码就是稀疏表示的一个例子，因为它的大部分条目对于每个输入都是零。稍后，我们将开发其他算法，学习更灵活的稀疏表示，在这些算法中，每个输入都可能有一个以上的条目为非零

。一热码是稀疏表示的一个极端例子，它失去了分布式表示的许多优点。一热代码仍然具有一些统计优势（它自然地传达了同一聚类中的所有示例都彼此相似的理念），而且它还具有计算优势，即整个表示的计算能力比分布式表示更强。
may be captured by a single integer.
可以用一个整数来表示。

The

-means algorithm works by initializing

different centroids

to different values, then alternating between two different steps until convergence. In one step, each training example is assigned to cluster

, where

is the index of the nearest centroid

. In the other step, each centroid

is updated to the mean of all training examples

assigned to cluster

-means算法的工作原理是将

不同的中心点

初始化为不同的值，然后交替使用两个不同的步骤直至收敛。在一个步骤中，每个训练实例被分配到

集群，其中

是最近的中心点

的索引。在另一个步骤中，每个中心点

会更新为分配到群组

的所有训练示例

的平均值。

One difficulty pertaining to clustering is that the clustering problem is inherently ill posed, in the sense that there is no single criterion that measures how well a clustering of the data corresponds to the real world. We can measure properties of the clustering, such as the average Euclidean distance from a cluster centroid to the members of the cluster. This enables us to tell how well we are able to reconstruct the training data from the cluster assignments. We do not know how well the cluster assignments correspond to properties of the real world. Moreover, there may be many different clusterings that all correspond well to some property of the real world. We may hope to find a clustering that relates to one feature but obtain a different, equally valid clustering that is not relevant to our task. For example, suppose that we run two clustering algorithms on a dataset consisting of images of red trucks, images of red cars, images of gray trucks, and images of gray cars. If we ask each clustering algorithm to find two clusters, one algorithm may find a cluster of cars and a cluster of trucks, while another may find a cluster of red vehicles and a cluster of gray vehicles. Suppose we also run a third clustering algorithm, which is allowed to determine the number of clusters. This may assign the examples to four clusters, red cars, red trucks, gray cars, and gray trucks. This new clustering now at least captures information about both attributes, but it has lost information about similarity. Red cars are in a different cluster from gray cars, just as they are in a different cluster from gray trucks. The output of the clustering algorithm does not tell us that red cars are more similar to gray cars than they are to gray trucks. They are different from both things, and that is all we know.
与聚类有关的一个难题是，聚类问题本身并不完美，因为没有一个单一的标准来衡量数据聚类与现实世界的对应程度。我们可以测量聚类的属性，例如从聚类中心点到聚类成员的平均欧氏距离。这样，我们就能知道根据聚类分配重建训练数据的能力有多强。我们不知道聚类分配与真实世界的属性对应程度如何。此外，可能有许多不同的聚类都能很好地对应真实世界的某些属性。我们可能希望找到一个与某一特征相关的聚类，但却得到一个与我们的任务无关的、同样有效的不同聚类。例如，假设我们在由红色卡车图像、红色轿车图像、灰色卡车图像和灰色轿车图像组成的数据集上运行两种聚类算法。如果我们要求每个聚类算法找到两个聚类，其中一个算法可能会找到一个汽车聚类和一个卡车聚类，而另一个算法可能会找到一个红色车辆聚类和一个灰色车辆聚类。假设我们还运行第三种聚类算法，允许它决定聚类的数量。这可能会将示例分配到四个聚类中，即红色汽车、红色卡车、灰色汽车和灰色卡车。这种新的聚类方法现在至少能捕捉到两种属性的信息，但却丢失了相似性信息。红色汽车与灰色汽车分属不同的聚类，就像它们与灰色卡车分属不同的聚类一样。聚类算法的输出结果并没有告诉我们，红色汽车与灰色汽车的相似度高于灰色卡车。它们与这两种东西都不同，这就是我们所知道的。

These issues illustrate some of the reasons that we may prefer a distributed representation to a one-hot representation. A distributed representation could have two attributes for each vehicle - one representing its color and one representing whether it is a car or a truck. It is still not entirely clear what the optimal distributed representation is (how can the learning algorithm know whether the two attributes we are interested in are color and car-versus-truck rather than manufacturer and age?), but having many attributes reduces the burden on the algorithm to guess which single attribute we care about, and gives us the ability to measure similarity between objects in a fine-grained way by comparing many
这些问题说明了我们更倾向于分布式表示法而非单点表示法的一些原因。分布式表示法可以为每辆车设置两个属性--一个代表它的颜色，一个代表它是轿车还是卡车。目前还不完全清楚最佳分布式表示法是什么（学习算法如何知道我们感兴趣的两个属性是颜色和汽车与卡车，而不是制造商和车龄？
attributes instead of just testing whether one attribute matches.
属性，而不是只测试一个属性是否匹配。

5.9 Stochastic Gradient Descent
5.9 随机梯度下降法

Nearly all of deep learning is powered by one very important algorithm: stochastic gradient descent (SGD). Stochastic gradient descent is an extension of the gradient descent algorithm introduced in section 4.3.
几乎所有的深度学习都由一种非常重要的算法驱动：随机梯度下降算法（SGD）。随机梯度下降算法是第 4.3 节介绍的梯度下降算法的扩展。

A recurring problem in machine learning is that large training sets are necessary for good generalization, but large training sets are also more computationally expensive.
机器学习中经常出现的一个问题是，要想获得良好的泛化效果，必须要有大量的训练集，但大量的训练集也会带来更高的计算成本。

The cost function used by a machine learning algorithm often decomposes as a sum over training examples of some per-example loss function. For example, the negative conditional log-likelihood of the training data can be written as
机器学习算法所使用的代价函数通常分解为某些按实例计算的损失函数在训练实例上的总和。例如，训练数据的负条件对数似然可以写成

where

is the per-example loss

.
其中

是每个样本的损失

。

For these additive cost functions, gradient descent requires computing
对于这些加法成本函数，梯度下降需要计算

The computational cost of this operation is

. As the training set size grows to billions of examples, the time to take a single gradient step becomes prohibitively long.
这一操作的计算成本为

。当训练集的规模增长到数十亿个示例时，采取一个梯度步骤所需的时间就会变得非常长，令人望而却步。

The insight of SGD is that the gradient is an expectation. The expectation may be approximately estimated using a small set of samples. Specifically, on each step of the algorithm, we can sample a minibatch of examples

drawn uniformly from the training set. The minibatch size

口is typically chosen to be a relatively small number of examples, ranging from one to a few hundred. Crucially,

Dis usually held fixed as the training set size

grows. We may fit a training set with billions of examples using updates computed on only a hundred examples.
SGD 的原理在于梯度是一种期望值。该期望值可通过少量样本进行近似估计。具体来说，在算法的每一步中，我们都可以从训练集中均匀抽取一小批样本

。小批量

口通常选择相对较少的例子，从一个到几百个不等。最重要的是，随着训练集

的增长，

通常保持不变。我们可以使用仅在一百个示例上计算的更新来适应一个拥有数十亿示例的训练集。

The estimate of the gradient is formed as
梯度估计值的计算公式为

using examples from the minibatch

. The stochastic gradient descent algorithm then follows the estimated gradient downhill:

中的实例。然后，随机梯度下降算法沿着估计的梯度下行：

where

is the learning rate.
其中

是学习率。

Gradient descent in general has often been regarded as slow or unreliable. In the past, the application of gradient descent to nonconvex optimization problems was regarded as foolhardy or unprincipled. Today, we know that the machine learning models described in part II work very well when trained with gradient descent. The optimization algorithm may not be guaranteed to arrive at even a local minimum in a reasonable amount of time, but it often finds a very low value of the cost function quickly enough to be useful.
梯度下降法通常被认为是缓慢或不可靠的。过去，将梯度下降法应用于非凸优化问题被认为是愚蠢或无原则的。如今，我们知道，第二部分所述的机器学习模型在使用梯度下降法训练时效果非常好。优化算法可能无法保证在合理的时间内达到局部最小值，但它往往能很快找到成本函数的一个非常低的值，足以派上用场。

Stochastic gradient descent has many important uses outside the context of deep learning. It is the main way to train large linear models on very large datasets. For a fixed model size, the cost per SGD update does not depend on the training set size

. In practice, we often use a larger model as the training set size increases, but we are not forced to do so. The number of updates required to reach convergence usually increases with training set size. However, as

approaches infinity, the model will eventually converge to its best possible test error before SGD has sampled every example in the training set. Increasing

further will not extend the amount of training time needed to reach the model's best possible test error. From this point of view, one can argue that the asymptotic cost of training a model with SGD is

as a function of

.
随机梯度下降技术在深度学习之外还有许多重要用途。它是在超大数据集上训练大型线性模型的主要方法。对于固定的模型大小，每次 SGD 更新的成本并不取决于训练集的大小

。在实践中，随着训练集规模的增加，我们通常会使用更大的模型，但我们并不是被迫这样做的。达到收敛所需的更新次数通常会随着训练集规模的增加而增加。不过，当

接近无穷大时，模型最终会在 SGD 对训练集中的每个示例进行采样之前收敛到最佳测试误差。进一步增加

不会延长达到模型最佳测试误差所需的训练时间。从这个角度看，我们可以认为用 SGD 训练模型的渐近成本是

与

的函数关系。

Prior to the advent of deep learning, the main way to learn nonlinear models was to use the kernel trick in combination with a linear model. Many kernel learning algorithms require constructing an

matrix

. Constructing this matrix has computational cost

, which is clearly undesirable for datasets with billions of examples. In academia, starting in 2006, deep learning was initially interesting because it was able to generalize to new examples better than competing algorithms when trained on medium-sized datasets with tens of thousands of examples. Soon after, deep learning garnered additional interest in industry because it provided a scalable way of training nonlinear models on large datasets.
在深度学习出现之前，学习非线性模型的主要方法是将核技巧与线性模型相结合。许多核学习算法需要构建一个

矩阵

。构建这个矩阵需要计算成本

，这对于拥有数十亿实例的数据集来说显然是不可取的。在学术界，从 2006 年开始，深度学习开始引起人们的关注，因为当它在包含数万个示例的中等规模数据集上进行训练时，能够比其他算法更好地泛化到新的示例。不久之后，深度学习又引起了工业界的兴趣，因为它提供了一种在大型数据集上训练非线性模型的可扩展方法。

Stochastic gradient descent and many enhancements to it are described further in chapter 8.
第 8 章将进一步介绍随机梯度下降及其许多改进。

5.10 Building a Machine Learning Algorithm
5.10 构建机器学习算法

Nearly all deep learning algorithms can be described as particular instances of a fairly simple recipe: combine a specification of a dataset, a cost function, an optimization procedure and a model.
几乎所有的深度学习算法都可以被描述为一个相当简单的配方的特定实例：将一个数据集规范、一个成本函数、一个优化程序和一个模型结合起来。

For example, the linear regression algorithm combines a dataset consisting of

and

, the cost function
例如，线性回归算法结合了由

和

组成的数据集，成本函数为

the model specification

, and, in most cases, the optimization algorithm defined by solving for where the gradient of the cost is zero using the normal equations.

，在大多数情况下，优化算法的定义是利用正态方程求解成本梯度为零的地方。

By realizing that we can replace any of these components mostly independently from the others, we can obtain a wide range of algorithms.
通过认识到我们可以独立地替换其中的任何部分，我们就可以获得多种算法。

The cost function typically includes at least one term that causes the learning process to perform statistical estimation. The most common cost function is the negative log-likelihood, so that minimizing the cost function causes maximum likelihood estimation.
成本函数通常包括至少一个项，使学习过程执行统计估计。最常见的成本函数是负对数似然，因此最小化成本函数会导致最大似然估计。

The cost function may also include additional terms, such as regularization terms. For example, we can add weight decay to the linear regression cost function to obtain
成本函数还可以包含附加项，例如正则化项。例如，我们可以在线性回归成本函数中加入权重衰减，得到

This still allows closed form optimization.
这样仍然可以进行封闭式优化。

If we change the model to be nonlinear, then most cost functions can no longer be optimized in closed form. This requires us to choose an iterative numerical optimization procedure, such as gradient descent.
如果我们将模型改为非线性模型，那么大多数成本函数就无法再以封闭形式进行优化。这就要求我们选择梯度下降等迭代数值优化程序。

The recipe for constructing a learning algorithm by combining models, costs, and optimization algorithms supports both supervised and unsupervised learning. The linear regression example shows how to support supervised learning. Unsupervised learning can be supported by defining a dataset that contains only

and providing an appropriate unsupervised cost and model. For example, we can obtain the first PCA vector by specifying that our loss function is
通过结合模型、成本和优化算法来构建学习算法的方法既支持监督学习，也支持非监督学习。线性回归示例展示了如何支持监督学习。通过定义一个只包含

的数据集，并提供适当的非监督代价和模型，可以支持非监督学习。例如，我们可以通过指定损失函数为

while our model is defined to have

with norm one and reconstruction function

.
而我们的模型定义为

，其规范为 1，重构函数为

。

In some cases, the cost function may be a function that we cannot actually evaluate, for computational reasons. In these cases, we can still approximately minimize it using iterative numerical optimization, as long as we have some way of approximating its gradients.
在某些情况下，由于计算原因，成本函数可能是一个我们无法实际评估的函数。在这种情况下，我们仍然可以使用迭代数值优化来近似最小化成本函数，只要我们有某种方法可以近似计算其梯度。

Most machine learning algorithms make use of this recipe, though it may not be immediately obvious. If a machine learning algorithm seems especially unique or hand designed, it can usually be understood as using a special-case optimizer. Some models, such as decision trees and

-means, require special-case optimizers because their cost functions have flat regions that make them inappropriate for minimization by gradient-based optimizers. Recognizing that most machine learning algorithms can be described using this recipe helps to see the different algorithms as part of a taxonomy of methods for doing related tasks that work for similar reasons, rather than as a long list of algorithms that each have separate justifications.
大多数机器学习算法都会使用这种方法，尽管可能并不明显。如果某个机器学习算法看起来特别独特，或者是手工设计的，通常可以理解为使用了特殊情况优化器。有些模型，如决策树和

-means，需要使用特殊情况优化器，因为它们的代价函数有平坦区域，不适合基于梯度的优化器最小化。认识到大多数机器学习算法都可以用这种方法来描述，有助于将不同的算法看作是出于相似原因完成相关任务的分类方法的一部分，而不是一长串各自有独立理由的算法。

5.11 Challenges Motivating Deep Learning
5.11 激发深度学习的挑战

The simple machine learning algorithms described in this chapter work well on a wide variety of important problems. They have not succeeded, however, in solving the central problems in AI, such as recognizing speech or recognizing objects.
本章介绍的简单机器学习算法可以很好地解决各种重要问题。不过，这些算法在解决人工智能的核心问题（如识别语音或识别物体）方面尚未取得成功。

The development of deep learning was motivated in part by the failure of traditional algorithms to generalize well on such AI tasks.
开发深度学习的部分原因是传统算法无法很好地泛化此类人工智能任务。

This section is about how the challenge of generalizing to new examples becomes exponentially more difficult when working with high-dimensional data, and how the mechanisms used to achieve generalization in traditional machine learning are insufficient to learn complicated functions in high-dimensional spaces. Such spaces also often impose high computational costs. Deep learning was designed to overcome these and other obstacles.
本节将介绍在处理高维数据时，泛化新示例的难度如何呈指数级增长，以及传统机器学习中用于实现泛化的机制如何不足以学习高维空间中的复杂函数。这些空间通常还会带来高昂的计算成本。深度学习就是为了克服这些和其他障碍而设计的。

5.11.1 The Curse of Dimensionality
5.11.1 维度的诅咒

Many machine learning problems become exceedingly difficult when the number of dimensions in the data is high. This phenomenon is known as the curse of dimensionality. Of particular concern is that the number of possible distinct configurations of a set of variables increases exponentially as the number of variables increases.
当数据的维数很高时，许多机器学习问题就会变得异常困难。这种现象被称为维度诅咒。尤其值得关注的是，随着变量数量的增加，一组变量可能存在的不同配置的数量也会呈指数级增长。

The curse of dimensionality arises in many places in computer science, especially in machine learning.
在计算机科学的许多领域，尤其是机器学习领域，都会出现维度诅咒。

Figure 5.9: As the number of relevant dimensions of the data increases (from left to right), the number of configurations of interest may grow exponentially. (Left)In this one-dimensional example, we have one variable for which we only care to distinguish 10 regions of interest. With enough examples falling within each of these regions (each region corresponds to a cell in the illustration), learning algorithms can easily generalize correctly. A straightforward way to generalize is to estimate the value of the target function within each region (and possibly interpolate between neighboring regions). (Center)With two dimensions, it is more difficult to distinguish 10 different values of each variable. We need to keep track of up to

regions, and we need at least that many examples to cover all those regions. (Right)With three dimensions, this grows to

regions and at least that many examples. For

dimensions and

values to be distinguished along each axis, we seem to need

regions and examples. This is an instance of the curse of dimensionality. Figure graciously provided by Nicolas Chapados.
图 5.9：随着数据相关维度数量的增加（从左到右），感兴趣的配置数量可能会呈指数增长。(左图）在这个一维示例中，我们有一个变量，只需区分 10 个感兴趣的区域。只要有足够多的示例属于这些区域中的每一个（每个区域对应插图中的一个单元格），学习算法就能很容易地正确泛化。归纳的一个直接方法就是估计每个区域内的目标函数值（也可能在相邻区域之间进行插值）。(中心)对于两个维度，要区分每个变量的 10 个不同值就更加困难了。我们需要跟踪多达

个区域，至少需要这么多示例才能覆盖所有这些区域。(右图)如果是三个维度，则需要记录

个区域和至少这么多的示例。要在每个轴上区分

维度和

值，我们似乎需要

区域和示例。这就是维度诅咒的一个实例。图由 Nicolas Chapados 慷慨提供。

One challenge posed by the curse of dimensionality is a statistical challenge. As illustrated in figure 5.9, a statistical challenge arises because the number of possible configurations of

is much larger than the number of training examples. To understand the issue, let us consider that the input space is organized into a grid, as in the figure. We can describe low-dimensional space with a small number of grid cells that are mostly occupied by the data. When generalizing to a new data point, we can usually tell what to do simply by inspecting the training examples that lie in the same cell as the new input. For example, if estimating the probability density at some point

, we can just return the number of training examples in the same unit volume cell as

, divided by the total number of training examples. If we wish to classify an example, we can return the most common class of training examples in the same cell. If we are doing regression, we can average the target values observed over the examples in that cell. But what about the cells for which we have seen no example? Because in high-dimensional spaces, the number of configurations is huge, much larger than our number of examples, a typical grid cell has no training example associated with it. How could we possibly say something meaningful about these new configurations? Many traditional machine learning
维度诅咒带来的挑战之一是统计挑战。如图 5.9 所示，统计挑战的出现是因为

可能的配置数量远远大于训练示例的数量。为了理解这个问题，我们不妨假设输入空间被组织成一个网格，如图所示。我们可以用少量的网格单元来描述低维空间，这些网格单元大部分被数据占据。在对新数据点进行泛化时，我们通常只需检查与新输入数据位于同一单元格中的训练示例，就能知道该怎么做。例如，如果要估计某个点的概率密度

，我们只需返回与

位于同一单位体积单元中的训练示例数除以训练示例总数即可。如果要对示例进行分类，我们可以返回同一单元格中最常见的训练示例类别。如果要进行回归，我们可以求出该单元中所有示例的目标值的平均值。但是，我们没有看到任何示例的单元格该怎么办呢？因为在高维空间中，配置的数量是巨大的，远远大于我们的示例数量，所以一个典型的网格单元格中没有与之相关的训练示例。我们怎么可能对这些新配置说出有意义的话呢？许多传统的机器学习
algorithms simply assume that the output at a new point should be approximately the same as the output at the nearest training point.
算法只是假设新点的输出应与最近训练点的输出大致相同。

5.11.2 Local Constancy and Smoothness Regularization
5.11.2 局部恒定与平滑正则化

To generalize well, machine learning algorithms need to be guided by prior beliefs about what kind of function they should learn. We have seen these priors incorporated as explicit beliefs in the form of probability distributions over parameters of the model. More informally, we may also discuss prior beliefs as directly influencing the function itself and influencing the parameters only indirectly, as a result of the relationship between the parameters and the function. Additionally, we informally discuss prior beliefs as being expressed implicitly by choosing algorithms that are biased toward choosing some class of functions over another, even though these biases may not be expressed (or even be possible to express) in terms of a probability distribution representing our degree of belief in various functions.
要实现良好的泛化，机器学习算法需要在先验信念的指导下，确定应该学习什么样的函数。我们已经看到，这些先验信念以模型参数概率分布的形式被明确纳入。更非正式地讲，我们也可以将先验信念视为直接影响函数本身，而只是间接影响参数，这是参数与函数之间关系的结果。此外，我们还可以非正式地讨论先验信念的隐含表达方式，即选择算法时偏向于选择某一类函数而不是另一类，尽管这些偏向可能无法用代表我们对各种函数的信念程度的概率分布来表达（甚至无法表达）。

Among the most widely used of these implicit "priors" is the smoothness prior, or local constancy prior. This prior states that the function we learn should not change very much within a small region.
在这些隐含 "先验 "中，使用最广泛的是平滑先验或局部恒定先验。这种先验认为，我们学习的函数在一个很小的区域内应该不会有太大的变化。

Many simpler algorithms rely exclusively on this prior to generalize well, and as a result, they fail to scale to the statistical challenges involved in solving AIlevel tasks. Throughout this book, we describe how deep learning introduces additional (explicit and implicit) priors in order to reduce the generalization error on sophisticated tasks. Here, we explain why the smoothness prior alone is insufficient for these tasks.
许多较简单的算法完全依赖这种先验来实现良好的泛化，结果无法应对解决人工智能级任务所涉及的统计挑战。在本书中，我们将介绍深度学习如何引入额外的（显式和隐式）先验，以减少复杂任务中的泛化误差。在此，我们将解释为什么光靠平滑度先验不足以完成这些任务。

There are many different ways to implicitly or explicitly express a prior belief that the learned function should be smooth or locally constant. All these different methods are designed to encourage the learning process to learn a function

that satisfies the condition
有许多不同的方法可以隐式或显式地表达一种先验信念，即学习到的函数应该是平滑的或局部恒定的。所有这些不同的方法都是为了鼓励学习过程中学习一个满足以下条件的函数

for most configurations

and small change

. In other words, if we know a good answer for an input

(for example, if

is a labeled training example), then that answer is probably good in the neighborhood of

. If we have several good answers in some neighborhood, we would combine them (by some form of averaging or interpolation) to produce an answer that agrees with as many of them as much as possible.
对于大多数配置

和小变化

。换句话说，如果我们知道输入

的一个好答案（例如，如果

是一个标注过的训练示例），那么该答案在

的邻域内很可能是好答案。如果我们在某个邻域有几个好答案，我们就会将它们组合起来（通过某种形式的平均或插值），生成一个尽可能与其中多个答案一致的答案。

An extreme example of the local constancy approach is the

-nearest neighbors family of learning algorithms. These predictors are literally constant over each region containing all the points

that have the same set of

nearest neighbors in
局部恒定方法的一个极端例子是

- 最近邻居学习算法系列。这些预测因子在每个区域内都是恒定的，这些区域包含

中具有相同

近邻的所有点。
the training set. For

, the number of distinguishable regions cannot be more than the number of training examples.
训练集。对于

，可区分区域的数量不能超过训练示例的数量。

While the

-nearest neighbors algorithm copies the output from nearby training examples, most kernel machines interpolate between training set outputs associated with nearby training examples. An important class of kernels is the family of local kernels, where

is large when

and decreases as

and

grow further apart from each other. A local kernel can be thought of as a similarity function that performs template matching, by measuring how closely a test example

resembles each training example

. Much of the modern motivation for deep learning is derived from studying the limitations of local template matching and how deep models are able to succeed in cases where local template matching fails (Bengio et al, 2006b).

- 最近邻算法复制附近训练实例的输出，而大多数内核机器则在与附近训练实例相关的训练集输出之间进行插值。内核的一个重要类别是局部内核系列，其中

在

时很大，随着

和

之间的距离越来越远，内核就会减小。局部核可以看作是一个相似度函数，它通过测量测试示例

与每个训练示例

的相似程度来进行模板匹配。深度学习的大部分现代动机来自于研究局部模板匹配的局限性，以及深度模型如何在局部模板匹配失败的情况下取得成功（Bengio et al, 2006b）。

Decision trees also suffer from the limitations of exclusively smoothness-based learning, because they break the input space into as many regions as there are leaves and use a separate parameter (or sometimes many parameters for extensions of decision trees) in each region. If the target function requires a tree with at least

leaves to be represented accurately, then at least

training examples are required to fit the tree. A multiple of

is needed to achieve some level of statistical confidence in the predicted output.
决策树也有完全基于平滑度学习的局限性，因为决策树会将输入空间分成与树叶一样多的区域，并在每个区域使用一个单独的参数（有时决策树的扩展会使用多个参数）。如果目标函数需要一棵至少有

个叶子的树来准确表示，那么至少需要

个训练示例来拟合树。要使预测输出达到一定的统计置信度，则需要

的倍数。

In general, to distinguish

regions in input space, all these methods require

examples. Typically there are

parameters, with

parameters associated with each of the

regions. The nearest neighbor scenario, in which each training example can be used to define at most one region, is illustrated in figure 5.10.
一般来说，要区分输入空间中的

区域，所有这些方法都需要

示例。通常有

参数，

参数与每个

区域相关联。图 5.10 展示了最近邻方案，其中每个训练示例最多可用于定义一个区域。

Is there a way to represent a complex function that has many more regions to be distinguished than the number of training examples? Clearly, assuming only smoothness of the underlying function will not allow a learner to do that. For example, imagine that the target function is a kind of checkerboard. A checkerboard contains many variations, but there is a simple structure to them. Imagine what happens when the number of training examples is substantially smaller than the number of black and white squares on the checkerboard. Based on only local generalization and the smoothness or local constancy prior, the learner would be guaranteed to correctly guess the color of a new point if it lay within the same checkerboard square as a training example. There is no guarantee, however, that the learner could correctly extend the checkerboard pattern to points lying in squares that do not contain training examples. With this prior alone, the only information that an example tells us is the color of its square, and the only way to get the colors of the entire checkerboard right is to cover each of its cells with at
有没有一种方法可以表示一个复杂函数，它需要区分的区域比训练示例的数量多得多？显然，仅假设底层函数的平滑性是无法让学习者做到这一点的。例如，假设目标函数是一种棋盘。棋盘上有很多变化，但都有一个简单的结构。想象一下，当训练示例的数量远远少于棋盘上黑白方格的数量时，会发生什么？仅根据局部泛化和平滑先验或局部恒定先验，如果一个新点与训练示例位于同一个棋盘方格内，学习者就能保证正确猜出该点的颜色。但是，学习者无法保证能正确地将棋盘图案扩展到不包含训练示例的方格中的点。仅凭这个先验，一个例子能告诉我们的唯一信息就是它所在方格的颜色，而要正确计算整个棋盘的颜色，唯一的方法就是在棋盘的每一个单元格上都覆盖上至少一种颜色。
least one example. 至少有一个例子。

The smoothness assumption and the associated nonparametric learning algorithms work extremely well as long as there are enough examples for the learning algorithm to observe high points on most peaks and low points on most valleys of the true underlying function to be learned. This is generally true when the function to be learned is smooth enough and varies in few enough dimensions. In high dimensions, even a very smooth function can change smoothly but in a different way along each dimension. If the function additionally behaves differently in various regions, it can become extremely complicated to describe with a set of training examples. If the function is complicated (we want to distinguish a huge number of regions compared to the number of examples), is there any hope to generalize well?
只要有足够多的例子，学习算法就能观察到要学习的真实基础函数的大部分峰值上的高点和大部分谷值上的低点，那么平滑性假设和相关的非参数学习算法就能非常有效地工作。一般来说，当要学习的函数足够平滑且变化维数足够少时，情况就会如此。在高维度下，即使是一个非常平滑的函数，也会在每个维度上以不同的方式发生平滑变化。如果函数在不同区域的表现也不同，那么用一组训练示例来描述它就会变得极其复杂。如果函数非常复杂（与示例数量相比，我们需要区分的区域数量非常多），那么我们还有希望很好地进行泛化吗？

The answer to both of these questions - whether it is possible to represent a complicated function efficiently, and whether it is possible for the estimated function to generalize well to new inputs - is yes. The key insight is that a very large number of regions, such as

, can be defined with

examples, so long as we
这两个问题的答案都是肯定的--是否有可能有效地表示一个复杂的函数，以及估计的函数是否有可能很好地泛化到新的输入。关键的启示在于，只要我们使用

示例，就可以定义大量的区域，如

。

Figure 5.10: Illustration of how the nearest neighbor algorithm breaks up the input space into regions. An example (represented here by a circle) within each region defines the region boundary (represented here by the lines). The

value associated with each example defines what the output should be for all points within the corresponding region. The regions defined by nearest neighbor matching form a geometric pattern called a Voronoi diagram. The number of these contiguous regions cannot grow faster than the number of training examples. While this figure illustrates the behavior of the nearest neighbor algorithm specifically, other machine learning algorithms that rely exclusively on the local smoothness prior for generalization exhibit similar behaviors: each training example only informs the learner about how to generalize in some neighborhood immediately surrounding that example.
图 5.10：近邻算法如何将输入空间分割成区域的示意图。每个区域内的示例（此处用圆圈表示）定义了区域边界（此处用线条表示）。与每个示例相关的

值定义了相应区域内所有点的输出。近邻匹配所定义的区域形成了一种称为沃罗诺依图的几何图形。这些连续区域数量的增长速度不能超过训练示例的数量。虽然这张图具体说明了近邻算法的行为，但其他完全依赖局部平滑先验进行泛化的机器学习算法也表现出类似的行为：每个训练示例都只会告知学习者如何在紧邻该示例的某个邻域内进行泛化。
introduce some dependencies between the regions through additional assumptions about the underlying data-generating distribution. In this way, we can actually generalize nonlocally (Bengio and Monperrus, 2005; Bengio et al., 2006c). Many different deep learning algorithms provide implicit or explicit assumptions that are reasonable for a broad range of AI tasks in order to capture these advantages.
通过对基础数据生成分布的额外假设，在区域之间引入一些依赖关系。这样，我们实际上就能实现非局部泛化（Bengio 和 Monperrus，2005 年；Bengio 等人，2006 年c）。许多不同的深度学习算法都提供了隐式或显式假设，这些假设对于广泛的人工智能任务来说都是合理的，目的就是为了抓住这些优势。

Other approaches to machine learning often make stronger, task-specific assumptions. For example, we could easily solve the checkerboard task by providing the assumption that the target function is periodic. Usually we do not include such strong, task-specific assumptions in neural networks so that they can generalize to a much wider variety of structures. AI tasks have structure that is much too complex to be limited to simple, manually specified properties such as periodicity, so we want learning algorithms that embody more general-purpose assumptions. The core idea in deep learning is that we assume that the data was generated by the composition of factors, or features, potentially at multiple levels in a hierarchy. Many other similarly generic assumptions can further improve deep learning algorithms. These apparently mild assumptions allow an exponential gain in the relationship between the number of examples and the number of regions that can be distinguished. We describe these exponential gains more precisely in sections 6.4.1, 15.4 and 15.5. The exponential advantages conferred by the use of deep distributed representations counter the exponential challenges posed by the curse of dimensionality.
其他机器学习方法通常会做出更强的特定任务假设。例如，只要假设目标函数是周期性的，我们就能轻松解决棋盘任务。通常，我们不会在神经网络中加入这种针对特定任务的强假设，这样神经网络就可以泛化到更广泛的结构中。人工智能任务的结构非常复杂，不能局限于简单的、人工指定的属性（如周期性），因此我们希望学习算法能体现更多通用假设。深度学习的核心理念是，我们假设数据是由各种因素或特征生成的，这些因素或特征可能处于层次结构的多个层次。许多其他类似的通用假设可以进一步改进深度学习算法。这些看似温和的假设可以使示例数量与可区分区域数量之间的关系呈指数级增长。我们将在第 6.4.1、15.4 和 15.5 节中更精确地描述这些指数收益。使用深度分布式表示法所带来的指数级优势，可以应对维度诅咒所带来的指数级挑战。

5.11.3 Manifold Learning
5.11.3 歧管学习

An important concept underlying many ideas in machine learning is that of a manifold.
流形是机器学习中的一个重要概念。

A manifold is a connected region. Mathematically, it is a set of points associated with a neighborhood around each point. From any given point, the manifold locally appears to be a Euclidean space. In everyday life, we experience the surface of the world as a 2-D plane, but it is in fact a spherical manifold in 3 -D space.
流形是一个连通的区域。在数学上，它是一组与每个点周围的邻域相关联的点。从任何给定的点看，流形的局部似乎是一个欧几里得空间。在日常生活中，我们把世界的表面看作一个二维平面，但实际上它是三维空间中的一个球形流形。

The concept of a neighborhood surrounding each point implies the existence of transformations that can be applied to move on the manifold from one position to a neighboring one. In the example of the world's surface as a manifold, one can walk north, south, east, or west.
每个点周围都有一个邻域的概念，这意味着流形上存在着可以从一个位置移动到邻近位置的变换。以作为流形的世界表面为例，人们可以向北、向南、向东或向西行走。

Although there is a formal mathematical meaning to the term "manifold," in machine learning it tends to be used more loosely to designate a connected set of points that can be approximated well by considering only a small number of
尽管 "流形 "一词有其正式的数学含义，但在机器学习中，它往往被更宽泛地用于指定一个连接的点集，只需考虑少数几个点就能很好地逼近该点集。

Figure 5.11: Data sampled from a distribution in a two-dimensional space that is actually concentrated near a one-dimensional manifold, like a twisted string. The solid line indicates the underlying manifold that the learner should infer.
图 5.11：从二维空间分布中采样的数据实际上集中在一维流形附近，就像一根扭曲的弦。实线表示学习者应该推断的底层流形。

degrees of freedom, or dimensions, embedded in a higher-dimensional space. Each dimension corresponds to a local direction of variation. See figure 5.11 for an example of training data lying near a one-dimensional manifold embedded in twodimensional space. In the context of machine learning, we allow the dimensionality of the manifold to vary from one point to another. This often happens when a manifold intersects itself. For example, a figure eight is a manifold that has a single dimension in most places but two dimensions at the intersection at the center.
自由度或维度，嵌入到一个更高的维度空间中。每个维度对应一个局部变化方向。见图 5.11，训练数据位于嵌入二维空间的一维流形附近。在机器学习中，我们允许流形的维度从一点到另一点发生变化。这通常发生在流形与自身相交的情况下。例如，"8 "字形流形在大多数地方只有一个维度，但在中心的交叉点却有两个维度。

Many machine learning problems seem hopeless if we expect the machine learning algorithm to learn functions with interesting variations across all of

. Manifold learning algorithms surmount this obstacle by assuming that most of

consists of invalid inputs, and that interesting inputs occur only along a collection of manifolds containing a small subset of points, with interesting variations in the output of the learned function occurring only along directions that lie on the manifold, or with interesting variations happening only when we move from one manifold to another. Manifold learning was introduced in the case of continuous-valued data and in the unsupervised learning setting, although this probability concentration idea can be generalized to both discrete data and the supervised learning setting: the key assumption remains that probability mass is highly concentrated.
如果我们期望机器学习算法能够在

上学习到具有有趣变化的函数，那么许多机器学习问题似乎就没有希望了。流形学习算法通过以下假设克服了这一障碍：

大部分由无效输入组成，有趣的输入仅沿着包含一小部分点的流形集合出现，所学函数输出的有趣变化仅沿着位于流形上的方向出现，或者有趣的变化仅在我们从一个流形移动到另一个流形时出现。流形学习是在连续值数据和无监督学习环境中引入的，不过这种概率集中的想法可以推广到离散数据和监督学习环境中：关键假设仍然是概率质量高度集中。

The assumption that the data lies along a low-dimensional manifold may not always be correct or useful. We argue that in the context of AI tasks, such as those that involve processing images, sounds, or text, the manifold assumption is
数据位于低维流形上的假设可能并不总是正确或有用的。我们认为，在人工智能任务中，例如涉及处理图像、声音或文本的任务，流形假设是

Figure 5.12: Sampling images uniformly at random (by randomly picking each pixel according to a uniform distribution) gives rise to noisy images. Although there is a nonzero probability of generating an image of a face or of any other object frequently encountered in AI applications, we never actually observe this happening in practice. This suggests that the images encountered in AI applications occupy a negligible proportion of the volume of image space.
图 5.12：对图像进行均匀随机抽样（根据均匀分布随机抽取每个像素）会产生噪声图像。尽管生成人脸图像或人工智能应用中经常遇到的任何其他物体图像的概率并非为零，但我们在实践中从未观察到这种情况。这表明，人工智能应用中遇到的图像在图像空间中所占的比例微乎其微。

at least approximately correct. The evidence in favor of this assumption consists of two categories of observations.
至少近似正确。支持这一假设的证据包括两类观察结果。

The first observation in favor of the manifold hypothesis is that the proba-
支持流形假说的第一个观点是，在 "流形 "和 "流形 "之间，可能存在一个 "流形"。
bility distribution over images, text strings, and sounds that occur in real life is highly concentrated. Uniform noise essentially never resembles structured inputs from these domains. Figure 5.12 shows how, instead, uniformly sampled points look like the patterns of static that appear on analog television sets when no signal is available. Similarly, if you generate a document by picking letters uniformly at random, what is the probability that you will get a meaningful English-language text? Almost zero, again, because most of the long sequences of letters do not correspond to a natural language sequence: the distribution of natural language sequences occupies a very little volume in the total space of sequences of letters.
现实生活中出现的图像、文本字符串和声音的噪声分布高度集中。均匀噪声基本上与这些领域的结构化输入不相似。图 5.12 显示了均匀采样点看起来就像模拟电视机在没有信号时出现的静电模式。同样，如果通过均匀随机抽取字母来生成文档，那么得到有意义的英文文本的概率有多大？几乎为零，这还是因为大多数长字母序列与自然语言序列并不对应：自然语言序列的分布在字母序列的总空间中所占的体积很小。

Of course, concentrated probability distributions are not sufficient to show that the data lies on a reasonably small number of manifolds. We must also establish that the examples we encounter are connected to each other by other examples, with each example surrounded by other highly similar examples that can be reached by applying transformations to traverse the manifold. The second argument in favor of the manifold hypothesis is that we can imagine such neighborhoods and transformations, at least informally. In the case of images, we can certainly think of many possible transformations that allow us to trace out a manifold in image space: we can gradually dim or brighten the lights, gradually move or rotate objects in the image, gradually alter the colors on the surfaces of objects, and so forth. Multiple manifolds are likely involved in most applications. For example, the manifold of human face images may not be connected to the manifold of cat face images.
当然，集中的概率分布并不足以说明数据位于合理数量的流形上。我们还必须确定，我们遇到的示例之间由其他示例相互连接，每个示例周围都有其他高度相似的示例，这些示例可以通过应用变换穿越流形而到达。支持流形假设的第二个论据是，我们可以想象出这样的邻域和变换，至少是非正式的。就图像而言，我们当然可以想到许多可能的变换，从而在图像空间中追踪出一个流形：我们可以逐渐调暗或调亮灯光，逐渐移动或旋转图像中的物体，逐渐改变物体表面的颜色，等等。大多数应用都可能涉及多个流形。例如，人脸图像流形可能与猫脸图像流形无关。

These thought experiments convey some intuitive reasons supporting the manifold hypothesis. More rigorous experiments (Cayton, 2005; Narayanan and Mitter, 2010; Schölkopf et al., 1998; Roweis and Saul, 2000; Tenenbaum et al., 2000; Brand, 2003; Belkin and Niyogi, 2003; Donoho and Grimes, 2003; Weinberger and Saul, 2004) clearly support the hypothesis for a large class of datasets of interest in AI.
这些思想实验传达了支持流形假说的一些直观理由。更严格的实验（Cayton，2005 年；Narayanan 和 Mitter，2010 年；Schölkopf 等人，1998 年；Roweis 和 Saul，2000 年；Tenenbaum 等人，2000 年；Brand，2003 年；Belkin 和 Niyogi，2003 年；Donoho 和 Grimes，2003 年；Weinberger 和 Saul，2004 年）明确支持人工智能领域大量数据集的流形假设。

When the data lies on a low-dimensional manifold, it can be most natural for machine learning algorithms to represent the data in terms of coordinates on the manifold, rather than in terms of coordinates in

. In everyday life, we can think of roads as 1-D manifolds embedded in 3-D space. We give directions to specific addresses in terms of address numbers along these 1-D roads, not in terms of coordinates in 3-D space. Extracting these manifold coordinates is challenging but holds the promise of improving many machine learning algorithms. This general principle is applied in many contexts. Figure 5.13 shows the manifold structure of a dataset consisting of faces. By the end of this book, we will have developed the methods necessary to learn such a manifold structure. In figure 20.6, we will see how a machine learning algorithm can successfully accomplish this goal.
当数据位于低维流形上时，机器学习算法最自然的做法是用流形上的坐标来表示数据，而不是用

中的坐标来表示数据。在日常生活中，我们可以将道路视为嵌入三维空间的一维流形。我们通过这些一维道路上的地址编号，而不是三维空间中的坐标来为特定地址指明方向。提取这些流形坐标具有挑战性，但有望改进许多机器学习算法。这一一般原则在很多情况下都适用。图 5.13 显示了由人脸组成的数据集的流形结构。在本书结束时，我们将开发出学习这种流形结构所需的方法。在图 20.6 中，我们将看到机器学习算法如何成功实现这一目标。

This concludes part I, which has provided the basic concepts in mathematics and machine learning that are employed throughout the remaining parts of the book. You are now prepared to embark on your study of deep learning.
第一部分提供了数学和机器学习的基本概念，这些概念将贯穿本书的其余部分。现在，你已经准备好开始深度学习的学习了。

Figure 5.13: Training examples from the QMUL Multiview Face Dataset (Gong et al., 2000), for which the subjects were asked to move in such a way as to cover the twodimensional manifold corresponding to two angles of rotation. We would like learning algorithms to be able to discover and disentangle such manifold coordinates. Figure 20.6 illustrates such a feat.
图 5.13：来自 QMUL 多视角人脸数据集（Gong 等人，2000 年）的训练示例，在这些示例中，受试者被要求移动以覆盖与两个旋转角度相对应的二维流形。我们希望学习算法能够发现并分解这种流形坐标。图 20.6 展示了这一壮举。

Unless there is a reason to use a particular covariance structure, we typically assume a diagonal covariance matrix .
除非有理由使用特定的协方差结构，否则我们通常假定协方差矩阵为对角矩阵。

Chapter 5 第 5 章

Machine Learning Basics 机器学习基础

5.1 Learning Algorithms 5.1 学习算法

5.1.1 The Task, 5.1.1 任务、

5.1.2 The Performance Measure, 5.1.2 绩效衡量、

5.1.3 The Experience, 5.1.3 体验、

5.1.4 Example: Linear Regression5.1.4 示例：线性回归

5.2 Capacity, Overfitting and Underfitting5.2 容量、过拟合和欠拟合

5.2.1 The No Free Lunch Theorem5.2.1 天下没有免费的午餐定理

5.2.2 Regularization 5.2.2 正则化

5.3 Hyperparameters and Validation Sets5.3 超参数和验证集

5.3.1 Cross-Validation 5.3.1 交叉验证

5.4 Estimators, Bias and Variance5.4 估计数、偏差和方差

5.4.1 Point Estimation 5.4.1 点估算

5.4.2 Bias 5.4.2 偏差

5.4.3 Variance and Standard Error5.4.3 方差和标准误差

5.4.4 Trading off Bias and Variance to Minimize Mean Squared Error5.4.4 权衡偏差和方差，最小化均方误差

5.4.5 Consistency 5.4.5 一致性

5.5 Maximum Likelihood Estimation5.5 最大似然估计

5.5.1 Conditional Log-Likelihood and Mean Squared Error5.5.1 条件对数似然和均方误差

5.5.2 Properties of Maximum Likelihood5.5.2 最大似然法的特性

5.6 Bayesian Statistics 5.6 贝叶斯统计

5.6.1 Maximum a Posteriori (MAP) Estimation5.6.1 最大后验估计（MAP）

5.7 Supervised Learning Algorithms5.7 监督学习算法

5.7.1 Probabilistic Supervised Learning5.7.1 概率监督学习

5.7.2 Support Vector Machines5.7.2 支持向量机

5.7.3 Other Simple Supervised Learning Algorithms5.7.3 其他简单的监督学习算法

5.8 Unsupervised Learning Algorithms5.8 无监督学习算法

5.8.1 Principal Components Analysis5.8.1 主成分分析

5.8.2 -means Clustering5.8.2 -均值聚类

5.9 Stochastic Gradient Descent5.9 随机梯度下降法

5.10 Building a Machine Learning Algorithm5.10 构建机器学习算法

5.11 Challenges Motivating Deep Learning5.11 激发深度学习的挑战

5.11.1 The Curse of Dimensionality5.11.1 维度的诅咒

5.11.2 Local Constancy and Smoothness Regularization5.11.2 局部恒定与平滑正则化

5.11.3 Manifold Learning5.11.3 歧管学习

5.1.1 The Task,
5.1.1 任务、

5.1.2 The Performance Measure,
5.1.2 绩效衡量、

5.1.3 The Experience,
5.1.3 体验、

5.1.4 Example: Linear Regression
5.1.4 示例：线性回归

5.2 Capacity, Overfitting and Underfitting
5.2 容量、过拟合和欠拟合

5.2.1 The No Free Lunch Theorem
5.2.1 天下没有免费的午餐定理

5.3 Hyperparameters and Validation Sets
5.3 超参数和验证集

5.4 Estimators, Bias and Variance
5.4 估计数、偏差和方差

5.4.3 Variance and Standard Error
5.4.3 方差和标准误差

5.4.4 Trading off Bias and Variance to Minimize Mean Squared Error
5.4.4 权衡偏差和方差，最小化均方误差

5.5 Maximum Likelihood Estimation
5.5 最大似然估计

5.5.1 Conditional Log-Likelihood and Mean Squared Error
5.5.1 条件对数似然和均方误差

5.5.2 Properties of Maximum Likelihood
5.5.2 最大似然法的特性

5.6.1 Maximum a Posteriori (MAP) Estimation
5.6.1 最大后验估计（MAP）

5.7 Supervised Learning Algorithms
5.7 监督学习算法

5.7.1 Probabilistic Supervised Learning
5.7.1 概率监督学习

5.7.2 Support Vector Machines
5.7.2 支持向量机

5.7.3 Other Simple Supervised Learning Algorithms
5.7.3 其他简单的监督学习算法

5.8 Unsupervised Learning Algorithms
5.8 无监督学习算法

5.8.1 Principal Components Analysis
5.8.1 主成分分析

5.8.2 -means Clustering
5.8.2 -均值聚类

5.9 Stochastic Gradient Descent
5.9 随机梯度下降法

5.10 Building a Machine Learning Algorithm
5.10 构建机器学习算法

5.11 Challenges Motivating Deep Learning
5.11 激发深度学习的挑战

5.11.1 The Curse of Dimensionality
5.11.1 维度的诅咒

5.11.2 Local Constancy and Smoothness Regularization
5.11.2 局部恒定与平滑正则化

5.11.3 Manifold Learning
5.11.3 歧管学习