这是用户在 2024-5-10 19:48 为 https://app.immersivetranslate.com/pdf-pro/3293a5c5-4635-426a-89a0-3c097b2065d1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Chapter 4  第 4 章

Numerical Computation 数值计算

Machine learning algorithms usually require a high amount of numerical computation. This typically refers to algorithms that solve mathematical problems by methods that update estimates of the solution via an iterative process, rather than analytically deriving a formula to provide a symbolic expression for the correct solution. Common operations include optimization (finding the value of an argument that minimizes or maximizes a function) and solving systems of linear equations. Even just evaluating a mathematical function on a digital computer can be difficult when the function involves real numbers, which cannot be represented precisely using a finite amount of memory.

4.1 Overflow and Underflow
4.1 溢出和下溢

The fundamental difficulty in performing continuous math on a digital computer is that we need to represent infinitely many real numbers with a finite number of bit patterns. This means that for almost all real numbers, we incur some approximation error when we represent the number in the computer. In many cases, this is just rounding error. Rounding error is problematic, especially when it compounds across many operations, and can cause algorithms that work in theory to fail in practice if they are not designed to minimize the accumulation of rounding error.
One form of rounding error that is particularly devastating is underflow. Underflow occurs when numbers near zero are rounded to zero. Many functions behave qualitatively differently when their argument is zero rather than a small positive number. For example, we usually want to avoid division by zero (some

software environments will raise exceptions when this occurs, others will return a result with a placeholder not-a-number value) or taking the logarithm of zero (this is usually treated as , which then becomes not-a-number if it is used for many further arithmetic operations).
如果出现这种情况,软件环境会引发异常,其他软件环境则会返回一个带有占位符的非数字值的结果),或者取零的对数(这通常被视为 ,如果用于更多的算术运算,它就会变成非数字)。
Another highly damaging form of numerical error is overflow. Overflow occurs when numbers with large magnitude are approximated as or . Further arithmetic will usually change these infinite values into not-a-number values.
另一种危害性极大的数值错误是溢出。当幅度较大的数字被近似为 时,就会发生溢出。进一步的运算通常会将这些无限值变成非数值。
One example of a function that must be stabilized against underflow and overflow is the softmax function. The softmax function is often used to predict the probabilities associated with a multinoulli distribution. The softmax function is defined to be
softmax 函数就是一个必须防止下溢和溢出的函数。softmax 函数通常用于预测与多概率分布相关的概率。软最大函数的定义是
Consider what happens when all the are equal to some constant . Analytically, we can see that all the outputs should be equal to . Numerically, this may not occur when has large magnitude. If is very negative, then will underflow. This means the denominator of the softmax will become 0 , so the final result is undefined. When is very large and positive, will overflow, again resulting in the expression as a whole being undefined. Both of these difficulties can be resolved by instead evaluating where . Simple algebra shows that the value of the softmax function is not changed analytically by adding or subtracting a scalar from the input vector. Subtracting results in the largest argument to exp being 0 , which rules out the possibility of overflow. Likewise, at least one term in the denominator has a value of 1 , which rules out the possibility of underflow in the denominator leading to a division by zero.
考虑当所有 都等于某个常数 时会发生什么。分析可知,所有输出都应等于 。从数值上看,当 幅值较大时,可能不会出现这种情况。如果 非常负,那么 将出现下溢。这意味着 softmax 的分母将变为 0 ,因此最终结果是未定义的。当 非常大且为正值时, 将溢出,再次导致表达式整体未定义。这两个问题都可以通过对 进行求值来解决,其中 。简单的代数表明,从输入向量中添加或减去一个标量并不会改变 softmax 函数的分析值。减去 后,exp 的最大参数为 0,这就排除了溢出的可能性。同样,分母中至少有一个项的值为 1,这就排除了分母下溢导致除以 0 的可能性。
There is still one small problem. Underflow in the numerator can still cause the expression as a whole to evaluate to zero. This means that if we implement by first running the softmax subroutine then passing the result to the log function, we could erroneously obtain . Instead, we must implement a separate function that calculates softmax in a numerically stable way. The log softmax function can be stabilized using the same trick as we used to stabilize the softmax function.
还有一个小问题。分子中的下溢仍会导致整个表达式的值为零。这意味着,如果我们先运行 softmax 子程序,然后将结果传递给 log 函数,从而实现 ,我们可能会错误地得到 。相反,我们必须执行一个单独的函数,以数值稳定的方式计算 softmax。对数 softmax 函数的稳定方法与 softmax 函数的稳定方法相同。
For the most part, we do not explicitly detail all the numerical considerations involved in implementing the various algorithms described in this book. Developers of low-level libraries should keep numerical issues in mind when implementing deep learning algorithms. Most readers of this book can simply rely on lowlevel libraries that provide stable implementations. In some cases, it is possible to implement a new algorithm and have the new implementation automatically

stabilized. Theano (Bergstra et al., 2010; Bastien et al., 2012) is an example of a software package that automatically detects and stabilizes many common numerically unstable expressions that arise in the context of deep learning.
稳定。Theano (Bergstra 等人,2010 年;Bastien 等人,2012 年)就是一个软件包的例子,它能自动检测并稳定深度学习中出现的许多常见数值不稳定表达式。

4.2 Poor Conditioning 4.2 条件差

Conditioning refers to how rapidly a function changes with respect to small changes in its inputs. Functions that change rapidly when their inputs are perturbed slightly can be problematic for scientific computation because rounding errors in the inputs can result in large changes in the output.
Consider the function . When has an eigenvalue decomposition, its condition number is
考虑函数 。当 具有特征值分解时,其条件数为
This is the ratio of the magnitude of the largest and smallest eigenvalue. When this number is large, matrix inversion is particularly sensitive to error in the input.
This sensitivity is an intrinsic property of the matrix itself, not the result of rounding error during matrix inversion. Poorly conditioned matrices amplify pre-existing errors when we multiply by the true matrix inverse. In practice, the error will be compounded further by numerical errors in the inversion process itself.

4.3 Gradient-Based Optimization
4.3 基于梯度的优化

Most deep learning algorithms involve optimization of some sort. Optimization refers to the task of either minimizing or maximizing some function by altering . We usually phrase most optimization problems in terms of minimizing . Maximization may be accomplished via a minimization algorithm by minimizing .
大多数深度学习算法都涉及某种优化。优化指的是通过改变 来最小化或最大化某个函数 的任务。我们通常用最小化 来表述大多数优化问题。最大化可以通过最小化算法实现,即最小化
The function we want to minimize or maximize is called the objective function, or criterion. When we are minimizing it, we may also call it the cost function, loss function, or error function. In this book, we use these terms interchangeably, though some machine learning publications assign special meaning to some of these terms.
We often denote the value that minimizes or maximizes a function with a superscript . For example, we might say .
我们通常用上标 来表示使函数最小化或最大化的值。例如,我们可以说
Figure 4.1: Gradient descent. An illustration of how the gradient descent algorithm uses the derivatives of a function to follow the function downhill to a minimum.
图 4.1:梯度下降梯度下降算法如何利用函数的导数,沿着函数下坡的方向找到最小值的示意图。
We assume the reader is already familiar with calculus but provide a brief review of how calculus concepts relate to optimization here.
Suppose we have a function , where both and are real numbers. The derivative of this function is denoted as or as . The derivative gives the slope of at the point . In other words, it specifies how to scale a small change in the input to obtain the corresponding change in the output: .
假设有一个函数 ,其中 都是实数。这个函数的导数用 表示。导数 给出了 在点 上的斜率。换句话说,它规定了如何通过调节输入的微小变化来获得输出的相应变化:
The derivative is therefore useful for minimizing a function because it tells us how to change in order to make a small improvement in . For example, we know that is less than for small enough . We can thus reduce by moving in small steps with the opposite sign of the derivative. This technique is called gradient descent (Cauchy, 1847). See figure 4.1 for an example of this technique.
因此,导数对最小化函数非常有用,因为它告诉我们如何改变 才能使 稍有改善。例如,我们知道,当 足够小时, 小于 。因此,我们可以通过以导数的相反符号小步移动 来减少 。这种方法称为梯度下降法(Cauchy,1847 年)。这种方法的示例见图 4.1。
When , the derivative provides no information about which direction to move. Points where are known as critical points, or stationary points. A local minimum is a point where is lower than at all neighboring points, so it is no longer possible to decrease by making infinitesimal steps. A local maximum is a point where is higher than at all neighboring points,
时,导数不提供移动方向的信息。 的点称为临界点或静止点。局部最小值是指 低于所有邻近点的值,因此无法通过无限小的步长来降低 。局部最大值是指 比所有相邻点都高的点、

Figure 4.2: Types of critical points. Examples of the three types of critical points in one dimension. A critical point is a point with zero slope. Such a point can either be a local minimum, which is lower than the neighboring points; a local maximum, which is higher than the neighboring points; or a saddle point, which has neighbors that are both higher and lower than the point itself.
图 4.2:临界点类型。一维中三种临界点的示例。临界点是一个斜率为零的点。临界点既可以是局部最小值,比邻近点低;也可以是局部最大值,比邻近点高;还可以是鞍点,邻近点既比临界点高,也比临界点低。
so it is not possible to increase by making infinitesimal steps. Some critical points are neither maxima nor minima. These are known as saddle points. See figure 4.2 for examples of each type of critical point.
因此不可能通过无限小的步长来增加 。有些临界点既不是最大值,也不是最小值。这些临界点被称为鞍点。各类临界点的示例见图 4.2。
A point that obtains the absolute lowest value of is a global minimum. There can be only one global minimum or multiple global minima of the function. It is also possible for there to be local minima that are not globally optimal. In the context of deep learning, we optimize functions that may have many local minima that are not optimal and many saddle points surrounded by very flat regions. All of this makes optimization difficult, especially when the input to the function is multidimensional. We therefore usually settle for finding a value of that is very low but not necessarily minimal in any formal sense. See figure 4.3 for an example.
获得 绝对最小值的点是全局最小值。函数可能只有一个全局最小值,也可能有多个全局最小值。也有可能存在非全局最优的局部最小值。在深度学习中,我们优化的函数可能有很多局部最小值不是最优的,也可能有很多鞍点被非常平坦的区域所包围。所有这些都给优化带来了困难,尤其是当函数的输入是多维的时候。因此,我们通常会在 上找到一个非常小的值,但从任何形式上来说都不一定是最小值。示例见图 4.3。
We often minimize functions that have multiple inputs: . For the concept of "minimization" to make sense, there must still be only one (scalar) output.
我们经常最小化有多个输入的函数: 。要使 "最小化 "的概念有意义,输出必须只有一个(标量)。
For functions with multiple inputs, we must make use of the concept of partial derivatives. The partial derivative measures how changes as only the variable increases at point . The gradient generalizes the notion of derivative to the case where the derivative is with respect to a vector: the gradient of is the vector containing all the partial derivatives, denoted . Element of the gradient is the partial derivative of with respect to . In multiple dimensions,
对于有多个输入的函数,我们必须使用偏导数的概念。偏导数 可以衡量当 变量增加时, 点的变化情况。梯度将导数的概念推广到导数相对于向量的情况: 的梯度是包含所有偏导数的向量,表示为 。梯度的元素 相对于 的偏导数。在多个维度中、
Figure 4.3: Approximate minimization. Optimization algorithms may fail to find a global minimum when there are multiple local minima or plateaus present. In the context of deep learning, we generally accept such solutions even though they are not truly minimal, so long as they correspond to significantly low values of the cost function.
图 4.3:近似最小化。当存在多个局部最小值或高原时,优化算法可能无法找到全局最小值。在深度学习中,我们通常会接受这样的解决方案,即使它们不是真正的最小值,只要它们对应的成本函数值明显较低即可。
critical points are points where every element of the gradient is equal to zero.
The directional derivative in direction (a unit vector) is the slope of the function in direction . In other words, the directional derivative is the derivative of the function with respect to , evaluated at . Using the chain rule, we can see that evaluates to when .
方向 (单位向量)上的方向导数是函数 在方向 上的斜率。换句话说,方向导数是函数 相对于 的导数,在 处求值。利用链式法则,我们可以看到,当 时, 的值为
To minimize , we would like to find the direction in which decreases the fastest. We can do this using the directional derivative:
为了使 最小化,我们希望找到 下降最快的方向。我们可以使用方向导数来实现这一目标:
where is the angle between and the gradient. Substituting in and ignoring factors that do not depend on , this . This is minimized when points in the opposite direction as the gradient. In other words, the gradient points directly uphill, and the negative gradient points directly downhill. We can decrease by moving in the direction of the negative gradient. This is known as the method of steepest descent, or gradient descent.
其中 与梯度之间的夹角。将 代入,并忽略与 无关的因素,即 。当 指向与坡度相反的方向时,该值最小。换句话说,梯度直接指向上坡,负梯度直接指向下坡。我们可以通过向负梯度方向移动来减少 。这就是所谓的最陡坡下降法,或梯度下降法。
Steepest descent proposes a new point
where is the learning rate, a positive scalar determining the size of the step. We can choose in several different ways. A popular approach is to set to a small constant. Sometimes, we can solve for the step size that makes the directional derivative vanish. Another approach is to evaluate for several values of and choose the one that results in the smallest objective function value. This last strategy is called a line search.
其中 是学习率,是一个决定步长的正标量。我们可以用几种不同的方法来选择 。一种常用的方法是将 设为一个小常数。有时,我们可以求解使方向导数消失的步长。另一种方法是针对 的多个值对 进行评估,然后选择目标函数值最小的那个值。最后一种策略称为直线搜索。
Steepest descent converges when every element of the gradient is zero (or, in practice, very close to zero). In some cases, we may be able to avoid running this iterative algorithm and just jump directly to the critical point by solving the equation for .
当梯度的每个元素都为零时(或实际上非常接近于零时),陡坡下降算法就会收敛。在某些情况下,我们也许可以避免运行这种迭代算法,而直接跳到临界点,方法是求解 的方程
Although gradient descent is limited to optimization in continuous spaces, the general concept of repeatedly making a small move (that is approximately the best small move) toward better configurations can be generalized to discrete spaces. Ascending an objective function of discrete parameters is called hill climbing (Russel and Norvig, 2003).
虽然梯度下降法仅限于在连续空间中进行优化,但向更好的配置反复进行小移动(近似于最佳小移动)的一般概念可以推广到离散空间。离散参数目标函数的上升称为爬坡(Russel 和 Norvig,2003 年)。

4.3.1 Beyond the Gradient: Jacobian and Hessian Matrices
4.3.1 梯度之外:雅各布矩阵和赫塞斯矩阵

Sometimes we need to find all the partial derivatives of a function whose input and output are both vectors. The matrix containing all such partial derivatives is known as a Jacobian matrix. Specifically, if we have a function , then the Jacobian matrix of is defined such that .
有时,我们需要找到输入和输出均为向量的函数的所有偏导数。包含所有此类偏导数的矩阵称为雅各布矩阵。具体来说,如果我们有一个函数 ,那么 的雅各布矩阵 的定义是:
We are also sometimes interested in a derivative of a derivative. This is known as a second derivative. For example, for a function , the derivative with respect to of the derivative of with respect to is denoted as . In a single dimension, we can denote by . The second derivative tells us how the first derivative will change as we vary the input. This is important because it tells us whether a gradient step will cause as much of an improvement as we would expect based on the gradient alone. We can think of the second derivative as measuring curvature. Suppose we have a quadratic function (many functions that arise in practice are not quadratic but can be approximated well as quadratic, at least locally). If such a function has a second derivative of zero, then there is no curvature. It is a perfectly flat line, and its value can be predicted using only the gradient. If the gradient is 1 , then we can make a step of size along the negative gradient, and the cost function will decrease by . If the second derivative is negative, the function curves downward, so the cost function will actually decrease by more than . Finally, if the second derivative is positive, the function curves upward, so the cost function can decrease by less than . See
我们有时也会对导数的导数感兴趣。这就是所谓的二阶导数。例如,对于函数 相对于 的导数对 的导数表示为 。在单维度中,我们可以用 表示 。二阶导数告诉我们,当我们改变输入时,一阶导数将如何变化。这一点非常重要,因为它可以告诉我们,梯度阶跃是否会像我们仅根据梯度所预期的那样带来改善。我们可以把二次导数看作是曲率的测量值。假设我们有一个二次函数(实际中出现的许多函数并不是二次函数,但至少在局部可以很好地近似为二次函数)。如果这样一个函数的二次导数为零,那么就不存在曲率。它是一条完全平坦的直线,只需利用梯度就能预测其值。如果梯度为 1,那么我们可以沿着负梯度跨一步,步长为 ,成本函数将减少 。如果二阶导数为负,则函数曲线向下,因此成本函数的实际下降幅度将超过 。最后,如果二次导数为正,则函数曲线向上,因此成本函数的减小幅度会小于 。参见
Figure 4.4: The second derivative determines the curvature of a function. Here we show quadratic functions with various curvature. The dashed line indicates the value of the cost function we would expect based on the gradient information alone as we make a gradient step downhill. With negative curvature, the cost function actually decreases faster than the gradient predicts. With no curvature, the gradient predicts the decrease correctly. With positive curvature, the function decreases more slowly than expected and eventually begins to increase, so steps that are too large can actually increase the function inadvertently.
图 4.4:二阶导数决定了函数的曲率。这里我们展示了具有不同曲率的二次函数。虚线表示我们在下坡过程中,仅根据梯度信息预计的代价函数值。在负曲率的情况下,成本函数的下降速度实际上比梯度预测的要快。在没有曲率的情况下,梯度预测的下降速度是正确的。在正曲率情况下,代价函数的下降速度比预期的要慢,并最终开始增加,因此过大的梯度实际上会在无意中增加代价函数。
figure 4.4 to see how different forms of curvature affect the relationship between the value of the cost function predicted by the gradient and the true value.
请参见图 4.4,了解不同形式的曲率如何影响梯度预测的成本函数值与真实值之间的关系。
When our function has multiple input dimensions, there are many second derivatives. These derivatives can be collected together into a matrix called the Hessian matrix. The Hessian matrix is defined such that
当我们的函数有多个输入维度时,就会有许多二次导数。这些导数可以汇集成一个矩阵,称为 Hessian 矩阵。黑森矩阵 的定义是
Equivalently, the Hessian is the Jacobian of the gradient.
等价地,Hessian 就是梯度的雅各布。
Anywhere that the second partial derivatives are continuous, the differential operators are commutative; that is, their order can be swapped:
This implies that , so the Hessian matrix is symmetric at such points. Most of the functions we encounter in the context of deep learning have a symmetric Hessian almost everywhere. Because the Hessian matrix is real and symmetric, we can decompose it into a set of real eigenvalues and an orthogonal basis of
这意味着 ,因此赫塞斯矩阵在这些点上是对称的。我们在深度学习中遇到的大多数函数几乎都具有对称的 Hessian。由于 Hessian 矩阵是实数且对称的,我们可以将其分解为一组实特征值和一个正交基础,即

eigenvectors. The second derivative in a specific direction represented by a unit vector is given by . When is an eigenvector of , the second derivative in that direction is given by the corresponding eigenvalue. For other directions of , the directional second derivative is a weighted average of all the eigenvalues, with weights between 0 and 1 , and eigenvectors that have a smaller angle with receiving more weight. The maximum eigenvalue determines the maximum second derivative, and the minimum eigenvalue determines the minimum second derivative.
特征向量。由单位向量 代表的特定方向上的二阶导数由 给出。当 的一个特征向量时,该方向上的二阶导数由相应的特征值给出。对于 的其他方向,方向二阶导数是所有特征值的加权平均值,权重介于 0 和 1 之间,与 夹角较小的特征向量权重较大。最大特征值决定最大二阶导数,最小特征值决定最小二阶导数。
The (directional) second derivative tells us how well we can expect a gradient descent step to perform. We can make a second-order Taylor series approximation to the function around the current point :
二阶导数(方向性)告诉我们梯度下降步骤的性能如何。我们可以围绕当前点 对函数 进行二阶泰勒级数近似:
where is the gradient and is the Hessian at . If we use a learning rate of , then the new point will be given by . Substituting this into our approximation, we obtain
其中, 是梯度, 处的 Hessian。如果我们使用 的学习率,那么新点 将由 给出。将其代入我们的近似值,可以得到
There are three terms here: the original value of the function, the expected improvement due to the slope of the function, and the correction we must apply to account for the curvature of the function. When this last term is too large, the gradient descent step can actually move uphill. When is zero or negative, the Taylor series approximation predicts that increasing forever will decrease forever. In practice, the Taylor series is unlikely to remain accurate for large , so one must resort to more heuristic choices of in this case. When is positive, solving for the optimal step size that decreases the Taylor series approximation of the function the most yields
这里有三个项:函数的原始值、函数斜率带来的预期改进,以及我们必须应用以考虑函数曲率的修正。当最后一项过大时,梯度下降步骤实际上是在走上坡路。当 为零或负数时,根据泰勒级数近似法的预测,永远增加 将永远减少 。在实践中,泰勒级数不太可能对较大的 保持精确,因此在这种情况下,我们必须对 进行更多的启发式选择。当 为正值时,求解使函数的泰勒级数近似值减小最多的最佳步长,结果是
In the worst case, when aligns with the eigenvector of corresponding to the maximal eigenvalue , then this optimal step size is given by . To the extent that the function we minimize can be approximated well by a quadratic function, the eigenvalues of the Hessian thus determine the scale of the learning rate.
在最坏的情况下,当 与最大特征值 对应的特征向量