89. Python 微积分基础#
89. Basics of Python Calculus

89.1. 介绍# 89.1. Introduction #

微积分是现代数学的基础，只要是机器学习算法必定需要进行公式推导，只要进行公式推导必定需要用到微积分。本实验首先从微积分的基础知识进行讲解，然后对梯度下降算法和最小二乘法的原理进行详细的阐述，最后利用这些算法完成了一个简单的最优化求解问题。本实验尝试以一种简单的方式来讲解微积分相关知识，对于每个知识点都附带了相关的例子，这些例子简单却能够很好的反应相关概念。因此，希望你能够静下心来仔细对其进行阅读和学习。
Calculus is the foundation of modern mathematics. Any machine learning algorithm inevitably requires formula derivation, and any formula derivation inevitably requires the use of calculus. This experiment first explains the basic knowledge of calculus, then provides a detailed explanation of the principles of the gradient descent algorithm and the least squares method, and finally uses these algorithms to solve a simple optimization problem. This experiment attempts to explain calculus-related knowledge in a simple way, with relevant examples attached to each knowledge point. These examples are simple yet effectively reflect the related concepts. Therefore, I hope you can calm down and carefully read and study it.

89.2. 知识点# 89.2. Knowledge Point #

线性函数与非线性函数 Linear functions and nonlinear functions
导数与偏导数 Derivatives and Partial Derivatives
链式法则 Chain rule
梯度下降算法 Gradient Descent Algorithm
局部最优和全局最优 Local optimum and global optimum
最小二乘法 Least squares method

函数是基础数学中最重要的理论之一。我们大概从很小的时候就开始学习函数以及它的基本性质了。因此，这里我们对那些了如指掌的性质不做解释，只阐述与机器学习相关的函数知识。
Functions are one of the most important theories in fundamental mathematics. We probably started learning about functions and their basic properties from a very young age. Therefore, we will not explain those well-known properties here, but only elaborate on the function knowledge related to machine learning.

89.3. 函数的线性和非线性# 89.3. Linear and Nonlinear Functions

函数表示的是变量到变量之间的映射关系。 A function represents the mapping relationship between variables.

如果这种关系是一次函数，我们就称之为线性关系（又叫线性函数）。更加通俗的讲，如果两个变量分别作为点的横坐标和纵坐标，其图像是平面上的一条直线，则说明这两个变量的关系为线性关系。相反，图像不是一条直线，那么这种关系就被称之为非线性关系，如指数函数、对数函数、幂函数等。
If this relationship is a linear function, we call it a linear relationship (also known as a linear function). More colloquially, if two variables are used as the horizontal and vertical coordinates of a point, and their graph is a straight line on a plane, it indicates that the relationship between these two variables is a linear relationship. Conversely, if the graph is not a straight line, then this relationship is called a nonlinear relationship, such as exponential functions, logarithmic functions, power functions, etc.

同理，在机器学习中，我们一般将能够使用线性函数解决的问题称之为线性问题。将必须使用非线性函数解决的问题称之为非线性问题。拿分类问题举例，如下图所示：
Similarly, in machine learning, we generally refer to problems that can be solved using linear functions as linear problems. Problems that must be solved using nonlinear functions are referred to as nonlinear problems. Taking classification problems as an example, as shown in the figure below:

https://cdn.aibydoing.com/hands-on-ai/images/8f3268d83aec2b30e9d8690a27125054-0.jpg

图一，我们可以使用一条直线将数据进行分类的问题就是线性问题，且我们把这条直线所在的函数称之为线性分类器。同理，类似于图二，需要使用曲线来完成分类的问题，我们称之为非线性问题，这条曲线也被叫做非线性分类器。
In Figure 1, the problem where we can use a straight line to classify the data is called a linear problem, and we refer to the function where this straight line is located as a linear classifier. Similarly, as shown in Figure 2, the problem that requires a curve to complete the classification is called a nonlinear problem, and this curve is also called a nonlinear classifier.

89.4. 导数# 89.4. Derivatives

函数值随着自变量的变化而变化，而这种变化的速率就是导数。函数 $f (x)$ 的导数可以记作 $f^{'} (x)$ ，又可以记作 $\frac{d f}{d x}$ 。导数衡量的，其实就是一个变量对函数值的影响能力。
The value of a function changes with the change of the independent variable, and the rate of this change is the derivative. The derivative of the function $f (x)$ can be denoted as $f^{'} (x)$ , and can also be denoted as $\frac{d f}{d x}$ . What the derivative measures is actually the ability of a variable to influence the function value.

导数值越大，则该变量每改变单位点对最终函数值的影响越大。且导数值为正时，代表自变量增大时函数值增大。反之，若导数值为负，则表示自变量增大时函数值减小。
The larger the derivative value, the greater the impact of each unit change in the variable on the final function value. When the derivative value is positive, it means that the function value increases as the independent variable increases. Conversely, if the derivative value is negative, it means that the function value decreases as the independent variable increases.

常见的函数与导数对照如下： Common functions and their derivatives are as follows:

原函数 $f$ Primitive function $f$	导函数 $f^{'}$ Derivative $f^{'}$
任何常数 Any constant	0
$x$	1
$e^{x}$	$e^{x}$
$x^{2}$	2 $x$
$\frac{1}{x}$	$\frac{- 1}{x^{2}}$
$\ln (x)$	$\frac{1}{x}$

89.5. 偏导数# 89.5. Partial Derivatives #

上面我们列举的就是只有一个自变量的函数，如果自变量有多个，如何求导数呢？
The above examples are functions with only one independent variable. If there are multiple independent variables, how do we find the derivative?

比如对于函数 $f = x + y$ ，怎样衡量 $x$ 和 $y$ 分别对函数值 $f$ 的影响呢？为了解决这个问题，数学上引入了偏导的概念。
For example, for the function $f = x + y$ , how can we measure the impact of $x$ and $y$ on the function value $f$ respectively? To solve this problem, the concept of partial derivatives is introduced in mathematics.

对一个多变量函数 $f$ ，求 $f$ 对其中一个自变量 $x$ 的偏导很简单，就是 将与 $x$ 无关的其他自变量视为常量 ，再进行单变量求导。得到的即为 $f$ 对 $x$ 的偏导。比如：
For a multivariable function $f$ , finding the partial derivative of $f$ with respect to one of the variables $x$ is very simple. It involves treating the other variables that are not related to $x$ as constants, and then performing single-variable differentiation. The result is the partial derivative of $f$ with respect to $x$ . For example:

若存在函数 $f = x + 2 y$ , 则 $f$ 对 $x$ 求偏导的结果为 $\frac{\partial f}{\partial x} = 1$ ，对 $y$ 求偏导的结果为 $\frac{\partial f}{\partial y} = 2$ 。
If there exists a function $f = x + 2 y$ , then the result of the partial derivative of $f$ with respect to $x$ is $\frac{\partial f}{\partial x} = 1$ , and the result of the partial derivative with respect to $y$ is $\frac{\partial f}{\partial y} = 2$ .

若存在函数 $f = x^{2} + 2 x y + y^{2}$ ，则 $f$ 对 $x$ 求偏导的结果为 $\frac{\partial f}{\partial x} = 2 x + 2 y + 0$ ，对 $y$ 求偏导的结果为 $\frac{\partial f}{\partial y} = 0 + 2 x + 2 y$ 。
If there exists a function $f = x^{2} + 2 x y + y^{2}$ , then the result of the partial derivative of $f$ with respect to $x$ is $\frac{\partial f}{\partial x} = 2 x + 2 y + 0$ , and the result of the partial derivative of $f$ with respect to $y$ is $\frac{\partial f}{\partial y} = 0 + 2 x + 2 y$ .

89.6. 梯度# 89.6. Gradient #

前面我们提到了，对于单变量函数来说，导数值的正负代表自变量对函数值影响的「方向」：变大或变小。
Previously, we mentioned that for a univariate function, the sign of the derivative represents the "direction" of the influence of the independent variable on the function value: increasing or decreasing.

那么对于多变量函数来说，如何表达函数值的变化方向呢？这就引入了梯度的概念。梯度所表示的方向其实就是函数值 $f$ 上升最快的方向。
So for multivariable functions, how do we express the direction of change in the function value? This introduces the concept of the gradient. The direction indicated by the gradient is actually the direction in which the function value increases the fastest.

梯度是一个向量，向量长度与自变量的个数相等。梯度是由每个自变量所对应的偏导值所组成的向量。如 $f (x, y, z)$ 的梯度向量就是 $(\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}, \frac{\partial f}{\partial z})$ 。
The gradient is a vector, and the length of the vector is equal to the number of variables. The gradient is a vector composed of the partial derivatives corresponding to each variable. For example, the gradient vector of $f (x, y, z)$ is $(\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}, \frac{\partial f}{\partial z})$ .

对于函数 $f = x^{2} + 2 x y + y^{2}$ , 其梯度向量为 $(2 x + 2 y, 2 x + 2 y)$ 。对于具体的自变量的值，比如 $x = 1, y = 1$ 的点，其梯度向量就为 $(4, 4)$ 。又比如 $x = 10, y = - 20$ 点，其梯度向量就为 $(- 20, - 20)$ 。
For the function $f = x^{2} + 2 x y + y^{2}$ , its gradient vector is $(2 x + 2 y, 2 x + 2 y)$ . For specific values of the independent variable, such as the point $x = 1, y = 1$ , its gradient vector is $(4, 4)$ . Another example is the point $x = 10, y = - 20$ , whose gradient vector is $(- 20, - 20)$ .

我们可以把多变量函数中的梯度(每个变量的偏导数的组合)理解为单变量函数中的导数。
We can understand the gradient in a multivariable function (the combination of partial derivatives of each variable) as the derivative in a single-variable function.

89.7. 链式法则# 89.7. Chain Rule

像上面这样的「简单函数」的求导或者求偏导是较为简单的。而对于「复合函数」的求导，就需要用到一些新的知识了。比如下面这样的函数：
The differentiation or partial differentiation of "simple functions" like the ones above is relatively straightforward. However, for the differentiation of "composite functions," some new knowledge is required. For example, functions like the one below:

f (x) = \frac{1}{x}

g (x) = e^{x}

y = f (g (x))

对于复合函数 $y$ 关于 $x$ 的求导，我们可以采用求导链式法则进行求导。
For the composite function $y$ with respect to $x$ , we can use the chain rule for differentiation.

链式法则的形式如下： The form of the chain rule is as follows:

\frac{d y}{d x} = \frac{d y}{d u} \cdot \frac{d u}{d x}

我们先把 $g (x)$ 看做整体，记作 $u$ 。因此式子可以化为 $y = f (u) = \frac{1}{u}$ ，进而得到：
We first consider $g (x)$ as a whole, denoted as $u$ . Therefore, the expression can be simplified to $y = f (u) = \frac{1}{u}$ , and further we get:

\frac{d y}{d u} = f^{'} (u) = - \frac{1}{u^{2}} = - \frac{1}{g (x)^{2}} = - \frac{1}{e^{x^{2}}} = - \frac{1}{e^{2 x}}

由于 $ $u = e^{x}$ $因此：$ $\frac{d u}{d x} = e^{x}$ $
Due to $ $u = e^{x}$ $因此：$ $\frac{d u}{d x} = e^{x}$ $

然后根据链式法则，我们可以得到： Then according to the chain rule, we can obtain:

\frac{d y}{d x} = \frac{d y}{d u} \cdot \frac{d u}{d x} = - \frac{1}{e^{2 x}} \cdot e^{x} = - \frac{1}{e^{x}}

这就是链式法则的全过程。在面对复合函数时，我们先把里面的函数看成一个整体，对外面的函数进行求导，然后再对里面的函数进行求导。这种像剥洋葱一样，一层一层的对当层变量函数进行求导，最后再把所有的导数相乘得到复函函数的导数的过程叫做链式法则。链式法则主要使用于神经网络的反向传播中。
This is the entire process of the chain rule. When dealing with composite functions, we first consider the inner function as a whole, differentiate the outer function, and then differentiate the inner function. This process, which is like peeling an onion layer by layer to differentiate the function of the current layer's variable, and finally multiplying all the derivatives to obtain the derivative of the composite function, is called the chain rule. The chain rule is mainly used in the backpropagation of neural networks.

89.8. 最优化理论# 89.8. Optimization Theory

89.9. 目标函数# 89.9. Objective Function #

简单的说，目标函数就是整个问题的目标。机器学习中我们常常把预测值和真实值之间的距离作为我们的目标函数。该函数越小，则证明我们的模型的预测准确率越高。
Simply put, the objective function is the goal of the entire problem. In machine learning, we often use the distance between the predicted value and the true value as our objective function. The smaller this function, the higher the accuracy of our model's predictions.

模型训练其实就是利用相关算法求取目标函数最小值的过程。
Model training is actually the process of using relevant algorithms to find the minimum value of the objective function.

为了方便讲解，我们可以引入一个简单的线性规划问题：一条长为 20 m 的铁丝被截成了两段，并将其分别弯成了两个正方形，要使两个正方形的面积和最小，两段铁丝的长度分别是多少？
To facilitate the explanation, we can introduce a simple linear programming problem: a 20-meter-long wire is cut into two segments and bent into two squares. To minimize the sum of the areas of the two squares, what should the lengths of the two segments be?

设一段长为 $x$ ，则另外一段长 $(20 - x)$ 。那么我们可以根据上面点的描述，定义该问题的目标函数如下：
Let one segment be of length $x$ , then the other segment is of length $(20 - x)$ . Therefore, based on the description of the points above, we can define the objective function of this problem as follows:

y = s_{1} + s_{2} = (\frac{x}{4})^{2} + (\frac{20 - x}{4})^{2}

上面的函数 $y$ 就是整个问题的目标函数。我们的目标就是求取 $y$ 的最小值和最小值对应的 $x$ 的值。
The function $y$ above is the objective function of the entire problem. Our goal is to find the minimum value of $y$ and the corresponding value of $x$ .

我们可以使用 Python 来定义一下这个函数 $y$ ：
We can use Python to define this function $y$ :

                          import math

def obj_fun(x):
    s1 = math.pow(x/4, 2)
    s2 = math.pow((20-x)/4, 2)
    return s1+s2

                          obj_fun(x=5)

                        

15.625

如上，我们定义了我们的目标函数 obj_fun,该函数可以根据上面公式计算任何 $x$ 所对应的 $y$ 值。得到目标函数之后，接下来我们就需要求取的该函数的最小值。那么如何求取该函数的最小值呢？这里我们使用梯度下降算法。
As above, we have defined our objective function obj_fun , which can calculate the corresponding $y$ value for any $x$ according to the above formula. After obtaining the objective function, we need to find the minimum value of this function. So how do we find the minimum value of this function? Here we use the gradient descent algorithm.

89.10. 梯度下降算法# 89.10. Gradient Descent Algorithm #

由于上面这个问题只是一个一元函数问题，我们可以简单的使用函数的求导法则，直接求出上面函数的最小值。而，使用梯度下降算法求解该问题未免有点「小题大做」。但是为了更好的入门该算法，我们就用这个比较简单的函数作为目标函数。
Since the above problem is just a single-variable function problem, we can simply use the differentiation rule of functions to directly find the minimum value of the above function. However, using the gradient descent algorithm to solve this problem might be a bit of an "overkill." But to better understand this algorithm, we will use this relatively simple function as the objective function.

我们可以把梯度下降算法类比为一个下山的过程。假设一个人被困在山上，需要快速从山上下来，走到山底。但是，由于山中浓雾很大，导致可视度很低。进而使我们无法确定下山的路径，只能看到周围的一些信息。也就是说我们需要走一步看一步再走一步再看一步。此时，我们就可以使用到梯度下降的算法，如下：
We can liken the gradient descent algorithm to a process of descending a mountain. Suppose a person is trapped on a mountain and needs to quickly get down to the bottom. However, due to the thick fog on the mountain, visibility is very low. This makes it impossible for us to determine the path down the mountain, and we can only see some information around us. In other words, we need to take one step, look around, then take another step, and look around again. At this point, we can use the gradient descent algorithm, as follows:

https://cdn.aibydoing.com/hands-on-ai/images/60b99fc78e60a08f4eb94f814ccd90d7-0.jpg

首先，以当前所在位置为基准，找到该位置周围最陡峭的方向，朝着该方向走一步。然后由以新的位置为基准，再次找最陡峭的方向，再朝着该方向走一步。如此循环，直到走到最低处为止。
First, based on the current location, find the steepest direction around that position and take a step in that direction. Then, using the new position as a reference, find the steepest direction again and take another step in that direction. Repeat this process until you reach the lowest point.

而我们需要找的目标函数也可以看做一座山，我们的目标就是找到这座山的最小值，即山底。而上文提到的最陡峭的方向，也就是目标函数值下降最快的方向，也就是梯度的相反方向（梯度方向为函数值上升最快的方向）。
The objective function we need to find can also be seen as a mountain, and our goal is to find the minimum value of this mountain, which is the bottom of the mountain. The steepest direction mentioned above, which is the direction in which the value of the objective function decreases the fastest, is the opposite direction of the gradient (the gradient direction is the direction in which the function value increases the fastest).

因此，为了寻找一个函数的最小值，我们需要利用当前位置计算梯度，然后根据梯度更新所在位置。然后再根据当前位置寻找梯度，再更新。如此循环，直到函数值最小。而根据梯度和当前位置更新下一次所在位置的数学表达式如下：
Therefore, in order to find the minimum value of a function, we need to use the current position to calculate the gradient, and then update the position based on the gradient. Then, find the gradient again based on the current position and update it again. This cycle continues until the function value is minimized. The mathematical expression for updating the next position based on the gradient and the current position is as follows:

θ^{1} = θ^{0} - α \cdot ▽ J (θ^{0})

上面式子展示了函数 $J (θ)$ 的最小值的求解过程。
The above formula shows the process of solving for the minimum value of the function $J (θ)$ .

其中 $θ^{0}$ 表示当前所在位置， $θ^{1}$ 表示下一次的位置， $α$ 表示步长（即一次更新的距离）, $- ▽ J (θ^{0})$ 表示 $θ^{0}$ 的梯度的相反方向。
Where $θ^{0}$ represents the current position, $θ^{1}$ represents the next position, $α$ represents the step size (i.e., the distance updated each time), and $- ▽ J (θ^{0})$ represents the opposite direction of the gradient of $θ^{0}$ .

为了方便计算，这里我们使用单变量函数进行举例。设函数 $J (θ) = θ^{2}$ 。由于单变量函数的梯度就是导数，因此 $▽ J (θ) = J^{'} (θ) = 2 θ$ 。这里设置步长为 0.4，梯度下降的起点为 $θ^{0} = 1$ 。接下来，我们利用梯度下降进行迭代：
To facilitate calculation, we use a univariate function as an example here. Let the function be $J (θ) = θ^{2}$ . Since the gradient of a univariate function is the derivative, $▽ J (θ) = J^{'} (θ) = 2 θ$ . Here, we set the step size to 0.4, and the starting point of the gradient descent is $θ^{0} = 1$ . Next, we use gradient descent for iteration:

θ^{0} = 1

θ^{1} = θ^{0} - α \cdot ▽ J (θ^{0}) = 1 - 0.4 \times 2 = 0.2

θ^{2} = θ^{1} - α \cdot ▽ J (θ^{1}) = 0.2 - 0.4 \times 0.4 = 0.04

θ^{3} = θ^{2} - α \cdot ▽ J (θ^{2}) = 0.04 - 0.4 \times 0.08 = 0.008

θ^{4} = θ^{3} - α \cdot ▽ J (θ^{3}) = 0.008 - 0.4 \times 0.016 = 0.0016

如上所示，经过四次计算，我们的 $J (θ)$ 的值越来越小。如果迭代上千次，那么 $J (θ)$ 的值会逐渐接近于最小值。当然，可能永远无法与最小值相等，但是可以无限接近该值。
As shown above, after four calculations, the value of our $J (θ)$ becomes smaller and smaller. If iterated thousands of times, the value of $J (θ)$ will gradually approach the minimum value. Of course, it may never be equal to the minimum value, but it can get infinitely close to it.

接下来，让我们回到计算铁丝所围成的两个正方形面积最小的求解上。函数还是如下：
Next, let's return to solving for the minimum area of the two squares enclosed by the wire. The function is still as follows:

y = s_{1} + s_{2} = (\frac{x}{4})^{2} + (\frac{20 - x}{4})^{2}

通过计算，可以得到该函数的梯度为 $y^{'} = \frac{x}{4} - \frac{5}{2}$ 。
Through calculation, it can be obtained that the gradient of this function is $y^{'} = \frac{x}{4} - \frac{5}{2}$ .

让我们编写一下该函数的梯度函数代码： Let's write the gradient function code for this function:

                          def grad_fun(x):
    # y' = x/4-5/2
    return x/4-5/2

                        

                          grad_fun(1)

                        

-2.25

接下来，让我们使用梯度下降算法，求解 y 的最小值：
Next, let's use the gradient descent algorithm to find the minimum value of y:

                          # 设置误差最小为 0.0001
# 即当函数下降距离已经小于 0.00000001 证明已经很接近最小值，无需继续迭代
e = 0.00000001
alpha = 1  # 设置步长
x = 1  # 设置初始位置
y0 = obj_fun(x)
ylists = [y0]
while(1):
    gx = grad_fun(x)
    # 梯度算法的核心公式
    x = x - alpha*gx
    y1 = obj_fun(x)
    ylists.append(y1)
    if(abs(y1-y0) < e):
        break
    y0 = y1
print("y 的最小值点", x)
print("y 的最小值", y1)

                        

y 的最小值点 9.99971394602207
y 的最小值 12.50000001022836

从上面的结果我们可以看出，当 $x \approx 10$ 时， $y$ 取得函数的最小值。
From the above results, we can see that when $x \approx 10$ , $y$ attains the minimum value of the function.

接下来，让我们来展示一下 $y$ 的迭代过程：
Next, let's demonstrate the iteration process of $y$ :

                          import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

plt.title("loss")
plt.plot(np.array(ylists))
plt.show()

                        

../_images/be379d86a7d9e13cf760b210df36afd6b1d80c71e91e74bf1b2ba5db00036292.png

如上图所示，我们通过梯度下降算法，求解了函数 y 的最小值。当然由于步长问题，可能我们找到的结果无法和正确答案完全一样（只能无限接近）。不过，通过这样的方法，我们能够很好的求得任何复杂函数的最小值点。
As shown in the above figure, we solved for the minimum value of the function y using the gradient descent algorithm. Of course, due to the step size issue, the result we find may not be exactly the same as the correct answer (it can only be infinitely close). However, through this method, we can effectively find the minimum point of any complex function.

接下啦，让我们来求取一个复杂函数 $f (x, y) = x^{2} + y^{2} - 6 x - 4 y + 29$ 的最小值。
Next, let's find the minimum value of a complex function $f (x, y) = x^{2} + y^{2} - 6 x - 4 y + 29$ .

首先，编写函数 $f (x, y)$ :
First, write the function $f (x, y)$ :

                          def func(x, y):
    return x**2+y**2-6*x-4*y+29

                          func(1, 1)

                        

接下来，编写梯度的计算函数。由于有两个未知量，因此需要返回两个变量的偏导：
Next, write the function to calculate the gradient. Since there are two unknowns, it is necessary to return the partial derivatives of both variables:

                          def grads(x, y):
    # 计算x的偏导
    dx = 2*x-6
    dy = 2*y-4
    return dx, dy

                        

                          grads(1, 1)

                        

(-4, -2)

接下来，让我们利用梯度下降算法计算该函数的最小值：
Next, let's use the gradient descent algorithm to calculate the minimum value of this function:

                          e = 0.00000001
alpha = 0.1
x = 1
y = 1
f0 = func(x, y)
flists = [f0]
while(1):
    dx, dy = grads(x, y)
    # 梯度算法的核心公式
    x = x - alpha*dx
    y = y - alpha*dy
    f1 = func(x, y)
    flists.append(f1)
    if(abs(f1-f0) < e):
        break
    f0 = f1
print(f"f 的最小值点:x={x},y={y}")
print(f"f 的最小值 {f1}")

                        

f 的最小值点:x=2.999891109642585,y=1.9999455548212925
f 的最小值 16.000000014821385

根据上面的结果，我们可以得到 $x \approx 3$ 和 $y \approx 2$ 时，原函数最小。
Based on the above results, we can conclude that the original function is minimized when $x \approx 3$ and $y \approx 2$ .

89.11. 局部最优和全局最优# 89.11. Local Optimum and Global Optimum

当然，梯度下降算法并不是每次都能求出一个函数的最小值的，如下所示：
Of course, the gradient descent algorithm does not always find the minimum value of a function, as shown below:

从上图可以看出，我们使用梯度下降找到了该函数的一个极小值点，但是并没有找到真正的最小值点。
From the above figure, it can be seen that we used gradient descent to find a local minimum of the function, but did not find the true minimum.

其实复杂函数中存在很多这样的极小值点，我们将这种在一定范围内的极小值点叫做局部最优点，并把这些极小值点中的最小值点称作全局最优点。
In fact, there are many such local minima in complex functions. We call these local minima within a certain range local optima, and the smallest of these minima is called the global optimum.

因此，为了能够保证我们找到的点尽可能是全局最优点，梯度下降的步长设置极为关键。我们可将步长设置的稍微大一点，这样,算法就能跨过极小值点两边的「峡谷」，进而找到真正的最小值点。当然，步长设置的过大，也可能让我们无法找到想要的答案。
Therefore, in order to ensure that the points we find are as close to the global optimum as possible, the step size in gradient descent is extremely critical. We can set the step size slightly larger, so that the algorithm can cross the "valleys" on both sides of the local minima and find the true minimum point. Of course, if the step size is set too large, it may also prevent us from finding the desired answer.

其实很多机器学习问题都是一个非常复杂的函数的优化问题。这些函数往往存在大量的极值点和最值点。有时，我们确实很难分清我们找到的点是否为全局最优。同样，我们也没有证据证明我们的点是最小值点（除非我们利用穷举的方法列出所有的点）。
In fact, many machine learning problems are optimization problems of very complex functions. These functions often have a large number of extrema and optimal points. Sometimes, it is indeed very difficult to distinguish whether the point we have found is globally optimal. Similarly, we also have no evidence to prove that our point is the minimum (unless we use exhaustive methods to list all the points).

实际上，在机器学习中，我们很多时候，只能尽力的调节参数，找到一个较好的极小值点当做最值点。换句话说，很多时候，我们训练好的模型的解，其实仅仅是一个局部最优解。
In fact, in machine learning, many times we can only try our best to adjust the parameters to find a relatively good local minimum point as the optimal point. In other words, many times the solution of our trained model is actually just a local optimal solution.

这就像找另一半一样，我们永远不可能保证我们现在的另一半是最适合自己的。我们只是在森林中找到了一个局部最优解，但确实这个解是很多解中的不错的一个，我们就应当好好珍惜 TA。
This is like finding a partner; we can never guarantee that our current partner is the most suitable for us. We have just found a local optimum in the forest, but indeed this solution is one of the better ones among many, and we should cherish them.

89.12. 最小二乘法# 89.12. Least Squares Method

最小二乘法也是线性规划中最常用的理论之一。该方法主要用于拟合数据点，简单的说就是找到一条合适数据线（函数式）来表示所有数据的分布。假设我们需要使用该方法来拟合一元函数问题，我们就可以将问题进行如下描述：
The least squares method is also one of the most commonly used theories in linear programming. This method is mainly used to fit data points, simply put, it is to find a suitable data line (function) to represent the distribution of all data. Suppose we need to use this method to fit a univariate function problem, we can describe the problem as follows:

假设存在 $n$ 个数据点，其坐标为 $(x_{1}, y_{1}), (x_{2}, y_{2}), (x_{3}, y_{3}) . . . (x_{n}, y_{n})$ ，我们需要使用函数 $y = k x + b$ 来描述这些数据点的 $y$ 坐标和 $x$ 坐标之间的函数关系。这里，我们就可以使用最小二乘法来求取 $k$ 和 $b$ 的具体值。
Assuming there are $n$ data points with coordinates $(x_{1}, y_{1}), (x_{2}, y_{2}), (x_{3}, y_{3}) . . . (x_{n}, y_{n})$ , we need to use the function $y = k x + b$ to describe the functional relationship between the $y$ coordinate and the $x$ coordinate of these data points. Here, we can use the least squares method to determine the specific values of $k$ and $b$ .

首先让我们先来模拟一下这些数据集合，假设这些数据集合为 $y = 2 x + 3 + (- 1, 1)$ 之间随机数。
First, let's simulate these data sets, assuming these data sets are random numbers between $y = 2 x + 3 + (- 1, 1)$ .

                          x = [0]
y = [3]
for i in range(60):
    xi = np.random.rand()*5
    yi = 2*xi+3 + np.random.rand()*(-1)
    x.append(xi)
    y.append(yi)
plt.plot(x, y, '.')

                        

[<matplotlib.lines.Line2D at 0x1178e7cd0>]

../_images/370976b58f9e8ae63d3775dfaae27ea96afe5d0347c21390e9d971e86abe10d7.png

如上图所示，我们需要找到一条直线 $y = k x + b$ 来衡量所有数据的分布。
As shown in the figure above, we need to find a straight line $y = k x + b$ to measure the distribution of all the data.

那么怎样的直线才是最能表现数据分布的呢？当然，如果所有的数据点都能落在直线上，那么这条直线肯定是最佳的直线。但是实际生活中，不会存在这么理想的数据，一般数据的分布都会像上图一样，有趋势，但与直线不重合。
What kind of straight line best represents the data distribution? Of course, if all the data points could fall on the straight line, then this line would definitely be the best one. However, in real life, such ideal data does not exist. Generally, the data distribution will be like the figure above, showing a trend but not coinciding with the straight line.

对于这种实际数据点，我们应该怎样找出一条最能反应分布的直线呢？最小二乘法给出了答案，该算法的核心思想就是「最小化误差平方和」。也就是说，如果我们找到的直线 $y = k x + b$ 与所有数据点的距离和最小，那么该直线就为最佳拟合直线。
For this kind of actual data points, how should we find a line that best reflects the distribution? The least squares method provides the answer. The core idea of this algorithm is to "minimize the sum of squared errors." In other words, if the line we find has the smallest sum of distances to all data points, then that line is the best fit line.

误差平方和的定义如下： The definition of the sum of squared errors is as follows:

f = \sum_{i = 1}^{n} (y_{i}^{t r u e} - y_{i}^{p r e})^{2} = \sum_{i = 1}^{n} (y_{i} - k x_{i} - b)^{2}

根据极值定理，当 $f$ 取最小值或极小值时， $f$ 所对应的点的斜率必定为 0，即未知数的导数为 0 。由于此时 $k$ 和 $b$ 是未知数，因此 $f$ 关于 $k$ 和 $b$ 的偏导应当为 0。如下：
According to the extreme value theorem, when $f$ takes the minimum or local minimum value, the slope of the point corresponding to $f$ must be 0, that is, the derivative of the unknown is 0. Since $k$ and $b$ are unknowns at this time, the partial derivatives of $f$ with respect to $k$ and $b$ should be 0. As follows:

\frac{\partial f}{\partial k} = \sum_{i = 1}^{n} [(y_{i} - k x_{i} - b) x_{i}] = 0

\frac{\partial f}{\partial b} = \sum_{i = 1}^{n} (y_{i} - k x_{i} - b) = 0

让我们将括号打开得到： Let's open the parentheses to get:

\frac{\partial f}{\partial k} = \sum_{i = 1}^{n} (x_{i} y_{i}) - k \sum_{i = 1}^{n} (x_{i}^{2}) - b \sum_{i = 1}^{n} (x_{i}) = 0

\frac{\partial f}{\partial b} = \sum_{i = 1}^{n} (y_{i}) - k \sum_{i = 1}^{n} (x_{i}) - n b = 0

假设 $A = \sum_{i = 1}^{n} (x_{i}^{2})$ ， $B = \sum_{i = 1}^{n} (x_{i})$ ， $C = \sum_{i = 1}^{n} (x_{i} y_{i})$ ， $D = \sum_{i = 1}^{n} (y_{i})$ ，则上面式子可以代换为：
Assuming $A = \sum_{i = 1}^{n} (x_{i}^{2})$ , $B = \sum_{i = 1}^{n} (x_{i})$ , $C = \sum_{i = 1}^{n} (x_{i} y_{i})$ , $D = \sum_{i = 1}^{n} (y_{i})$ , the above equation can be substituted as:

A k + b B = C

B K + n b = D

解上面的二元方程组可以得到： Solving the above system of linear equations, we get:

k = \frac{C n - B D}{A n - B B}

b = \frac{A D - C B}{A n - B B}

由于数据据点已知，那么我们就可以很轻松的根据上面公式求出 $k$ 和 $b$ 的具体值。
Since the data points are known, we can easily determine the specific values of $k$ and $b$ based on the above formula.

上面的整个过程就是最小二乘法的全部过程。首先我们需要确定描述数据集的函数的形式，然后根据函数和数据集合，得到他们的误差平方和公式，然后利用最小二乘法求取公式中每个参数的具体值。
The entire process above is the complete process of the least squares method. First, we need to determine the form of the function that describes the dataset. Then, based on the function and the dataset, we obtain their sum of squared errors formula. Finally, we use the least squares method to find the specific values of each parameter in the formula.

让我们利用最小二乘法对上面模拟的数据进行拟合，观察预测出来的 $k$ ， $b$ 的值与真实值之间的差距：
Let's use the least squares method to fit the simulated data above and observe the gap between the predicted values of $k$ and $b$ and the true values:

                          A = np.sum(np.multiply(x, x))
B = np.sum(x)
C = np.sum(np.multiply(x, y))
D = np.sum(y)
n = len(x)
k = (C*n-B*D)/(A*n-B*B)
b = (A*D-C*B)/(A*n-B*B)
print("实际的 k=2,b=3")
print("利用最小二乘法得到的 k = {},b={}".format(k, b))

                        

实际的 k=2,b=3
利用最小二乘法得到的 k = 1.9970925836606068,b=2.469418590058912

当然，上面只是利用最小二乘法拟合了较为简单的一元函数。最小二乘法在拟合复杂的多元函数时，也可以产生不错的效果。但是由于公式的推导较为复杂，这里就不演示了。如果有兴趣的话，你可以尝试一下，假设原函数的形式为 $y = k_{1} x^{2} + k_{2} x + b$ 时，利用最小二乘法和数据点求得每个系数的具体取值。
Of course, the above only uses the least squares method to fit a relatively simple univariate function. The least squares method can also produce good results when fitting complex multivariate functions. However, due to the complexity of the formula derivation, it will not be demonstrated here. If you are interested, you can try assuming the form of the original function is $y = k_{1} x^{2} + k_{2} x + b$ , and use the least squares method and data points to find the specific values of each coefficient.

89.13. 总结# 89.13. Summary #

梯度下降算法是机器学习中使用最多的优化求解方法之一。为了方便讲解，本实验引用的例子较为简单，如果你想将这种算法理解的更加透彻，你可以尝试学习人工神经网络的相关知识进而了解梯度下降算法的高阶应用。
The gradient descent algorithm is one of the most commonly used optimization methods in machine learning. For the sake of explanation, the examples cited in this experiment are relatively simple. If you want to understand this algorithm more thoroughly, you can try to learn about artificial neural networks to further understand the advanced applications of the gradient descent algorithm.

推荐安装 GetVM 浏览器扩展，使用其提供的 Jupyter Notebook 在线环境进行代码实践。
It is recommended to install the GetVM browser extension and use its provided Jupyter Notebook online environment for code practice.

如果你觉得这些内容对你有帮助，可以请我喝杯咖啡
If you find this content helpful, you can buy me a cup of coffee

88. Python 线性代数基础 88. Basics of Linear Algebra in Python

90. Python 概率论和统计学基础 90. Fundamentals of Probability Theory and Statistics in Python

89. Python 微积分基础#89. Basics of Python Calculus