Error bounds for approximations with deep ReLU networks
深度 ReLU 网络近似的误差边界
Keywords 关键词
深度 ReLU 网络近似复杂度
1. Introduction 1.导言
Recently, multiple successful applications of deep neural networks to pattern recognition problems LeCun et al. (2015), Schmidhuber (2015) have revived active interest in theoretical properties of such networks, in particular their expressive power. It has been argued that deep networks may be more expressive than shallow ones of comparable size (see, e.g., Bianchini and Scarselli (2014), Delalleau and Bengio (2011), Montufar et al. (2014), Raghu et al. (2016), Telgarsky (2015)). In contrast to a shallow network, a deep one can be viewed as a long sequence of non-commutative transformations, which is a natural setting for high expressiveness (cf. the well-known Solovay–Kitaev theorem on fast approximation of arbitrary quantum operations by sequences of non-commutative gates, see Dawson and Nielsen (2006), Kitaev et al. (2002)).
最近,深度神经网络在模式识别问题上的多次成功应用LeCun等人(2015)、Schmidhuber(2015)重新激发了人们对此类网络理论特性的兴趣,尤其是它们的表现力。有人认为,深度网络可能比同等规模的浅层网络更具表现力(参见 Bianchini 和 Scarselli (2014)、Delalleau 和 Bengio (2011)、Montufar 等人 (2014)、Raghu 等人 (2016)、Telgarsky (2015))。与浅层网络相比,深层网络可被视为非交换变换的长序列,这是高表达能力的天然环境(参见著名的索洛维-基塔耶夫定理:通过非交换门序列快速逼近任意量子运算,见 Dawson 和 Nielsen (2006)、Kitaev 等人 (2002))。
There are various ways to characterize expressive power of networks. Delalleau and Bengio (2011) consider sum–product networks and prove for certain classes of polynomials that they are much more easily represented by deep networks than by shallow networks. Montufar et al. (2014) estimate the number of linear regions in the network’s landscape. Bianchini and Scarselli (2014) give bounds for Betti numbers characterizing topological properties of functions represented by networks. Telgarsky (2015), Telgarsky (2016) provides specific examples of classification problems where deep networks are provably more efficient than shallow ones.
表征网络表达能力的方法多种多样。Delalleau 和 Bengio(2011 年)考虑了和积网络,并证明对于某些多项式类别,深度网络比浅层网络更容易表现它们。Montufar 等人(2014 年)估算了网络景观中线性区域的数量。Bianchini 和 Scarselli(2014)给出了贝蒂数的边界,表征了网络所代表函数的拓扑特性。Telgarsky (2015)、Telgarsky (2016) 提供了分类问题的具体例子,证明深度网络比浅层网络更有效。
In the context of classification problems, a general and standard approach to characterizing expressiveness is based on the notion of the Vapnik–Chervonenkis dimension (Vapnik & Chervonenkis, 2015). There exist several bounds for VC-dimension of deep networks with piece-wise polynomial activation functions that go back to geometric techniques of Goldberg and Jerrum (1995) and earlier results of Warren (1968); see Bartlett et al. (1998), Sakurai (1999) and the book (Anthony & Bartlett, 2009). There is a related concept, the fat-shattering dimension, for real-valued approximation problems Anthony and Bartlett (2009), Kearns and Schapire (1990).
在分类问题中,表征表现力的一般标准方法是基于 Vapnik-Chervonenkis 维度的概念(Vapnik & Chervonenkis, 2015)。对于具有片断多项式激活函数的深度网络的 VC 维度,存在几种边界,这些边界可追溯到 Goldberg 和 Jerrum(1995 年)的几何技术以及 Warren(1968 年)的早期结果;参见 Bartlett 等人(1998 年)、Sakurai(1999 年)和本书(Anthony & Bartlett, 2009 年)。对于实值近似问题,Anthony 和 Bartlett (2009)、Kearns 和 Schapire (1990)有一个相关的概念,即 "碎脂维度"。
A very general approach to expressiveness in the context of approximation is the method of nonlinear widths by DeVore, Howard, and Micchelli (1989) that concerns approximation of a family of functions under assumption of a continuous dependence of the model on the approximated function.
DeVore 、Howard 和 Micchelli(1989 年)提出的非线性宽度法是近似计算中一种非常通用的表达方法,它涉及在模型对近似函数具有连续依赖性的假设条件下对函数族的近似。
In this paper we examine the problem of shallow-vs-deep expressiveness from the perspective of approximation theory and general spaces of functions having derivatives up to certain order (Sobolev-type spaces). In this framework, the problem of expressiveness is very well studied in the case of shallow networks with a single hidden layer, where it is known, in particular, that to approximate a -function on a -dimensional set with infinitesimal error one needs a network of size about , assuming a smooth activation function (see, e.g., Mhaskar (1996), Pinkus (1999) for a number of related rigorous upper and lower bounds and further qualifications of this result). Much less seems to be known about deep networks in this setting, though Mhaskar, Liao, and Poggio (2016), Mhaskar and Poggio (2016) have recently introduced functional spaces constructed using deep dependency graphs and obtained expressiveness bounds for related deep networks.
在本文中,我们从近似理论和具有一定阶导数的函数的一般空间(Sobolev 型空间)的角度来研究浅层与深层的表现力问题。在这一框架下,人们对单隐层浅层网络的表现力问题研究得非常透彻,尤其是已知要在 维集合上以无穷小误差 近似一个 函数,需要一个大小约为 的网络,同时假设激活函数是平滑的(参见 Mhaskar (1996)、Pinkus (1999),以了解大量相关的严格上下限和对这一结果的进一步限定)。虽然 Mhaskar、Liao 和 Poggio(2016 年)、Mhaskar 和 Poggio(2016 年)最近引入了使用深度依赖图构建的函数空间,并获得了相关深度网络的表现力边界,但人们对这种情况下的深度网络的了解似乎要少得多。
We will focus our attention on networks with the ReLU activation function , which, despite its utter simplicity, seems to be the most popular choice in practical applications (LeCun et al., 2015). We will consider -error of approximation of functions belonging to the Sobolev spaces (without any assumptions of hierarchical structure). We will often consider families of approximations, as the approximated function runs over the unit ball in . In such cases we will distinguish scenarios of fixed and adaptive network architectures. Our goal is to obtain lower and upper bounds on the expressiveness of deep and shallow networks in different scenarios. We measure complexity of networks in a conventional way, by counting the number of their weights and computation units (cf. Anthony and Bartlett (2009)).
我们将重点关注具有 ReLU 激活函数 的网络,尽管它非常简单,但似乎是实际应用中最受欢迎的选择(LeCun 等人,2015 年)。我们将考虑 -属于 Sobolev 空间 的函数的近似误差(没有任何层次结构假设)。在这种情况下,我们将区分固定网络架构和自适应网络架构。我们的目标是获得深层和浅层网络在不同场景下的表现力下限和上限。我们用传统方法测量网络的复杂性,即计算其权重和计算单元的数量(参见 Anthony 和 Bartlett (2009))。
The main body of the paper consists of Sections 2 The ReLU network model, 3 Upper bounds, 4 Lower bounds.
本文主体包括第 2 节 ReLU 网络模型、第 3 节上界、第 4 节下界。
In Section 2 we describe our ReLU network model and show that the ReLU function is replaceable by any other continuous piece-wise linear activation function, up to constant factors in complexity asymptotics (Proposition 1).
在第 2 节中,我们描述了 ReLU 网络模型,并证明 ReLU 函数可由任何其他连续的片断线性激活函数替代,在复杂性渐近论中可达到常数因子(命题 1)。
In Section 3 we establish several upper bounds on the complexity of approximating by ReLU networks, in particular showing that deep networks are quite efficient for approximating smooth functions. Specifically:
在第 3 节中,我们确定了 ReLU 网络近似复杂度的几个上限,特别是表明深度网络在近似平滑函数时相当高效。具体来说
-
In Section 3.1 we show that the function can be -approximated by a network of depth and complexity (Proposition 2). This also leads to similar upper bounds on the depth and complexity that are sufficient to implement an approximate multiplication in a ReLU network (Proposition 3).
在第 3.1 节中,我们证明了函数 可以通过深度和复杂度均为 的网络进行 近似运算(命题 2)。这也引出了在 ReLU 网络中实现近似乘法所需的深度和复杂度的类似上限(命题 3)。 -
In Section 3.2 we describe a ReLU network architecture of depth and complexity that is capable of approximating with error any function from (Theorem 1).
在第 3.2 节中,我们描述了一个深度为 、复杂度为 的 ReLU 网络结构,它能够以误差 近似 中的任何函数(定理 1)。 -
In Section 3.3 we show that, even with fixed-depth networks, one can further decrease the approximation complexity if the network architecture is allowed to depend on the approximated function. Specifically, we prove that one can -approximate a given Lipschitz function on the segment by a depth-6 ReLU network with connections and activation units (Theorem 2). This upper bound is of interest since it lies below the lower bound provided by the method of nonlinear widths under assumption of continuous model selection (see Section 4.1).
在第 3.3 节中,我们将证明,即使是固定深度的网络,如果允许网络结构依赖于近似函数,也可以进一步降低近似复杂度。具体地说,我们证明了一个具有 连接和激活单元的深度为 6 的 ReLU 网络可以 近似 段上的给定 Lipschitz 函数(定理 2)。这个上界低于连续模型选择假设下的非线性宽度方法所提供的下界(见第 4.1 节),因此很有意义。
In Section 4 we obtain several lower bounds on the complexity of approximation by deep and shallow ReLU networks, using different approaches and assumptions.
在第 4 节中,我们利用不同的方法和假设,得出了深层和浅层 ReLU 网络近似复杂度的几个下限。
-
In Section 4.1 we recall the general lower bound provided by the method of continuous nonlinear widths. This method assumes that parameters of the approximation continuously depend on the approximated function, but does not assume anything about how the approximation depends on its parameters. In this setup, at least connections and weights are required for an -approximation on (Theorem 3). As already mentioned, for this lower bound is above the upper bound provided by Theorem 2.
在第 4.1 节中,我们回顾了连续非线性宽度法提供的一般下限。这种方法假定近似的参数连续地依赖于近似函数,但不假定近似如何依赖于其参数。在这种情况下,对 的 近似至少需要 的连接和权重(定理 3)。如前所述,对 而言,这一下限高于定理 2 所提供的上限。 -
In Section 4.2 we consider the setup where the same network architecture is used to approximate all functions in , but the weights are not assumed to continuously depend on the function. In this case, application of existing results on VC-dimension of deep piece-wise polynomial networks yields a lower bound in general and a lower bound if the network depth grows as (Theorem 4).
在第 4.2 节中,我们考虑了使用相同网络结构逼近 中所有函数,但权重不连续依赖于函数的情况。在这种情况下,应用现有的关于深度片断多项式网络 VC 维度的结果,一般会得到 的下限,如果网络深度以 的方式增长,则会得到 的下限(定理 4)。 -
In Section 4.3 we consider an individual approximation, without any assumptions regarding it as an element of a family as in Sections 4.1 and 4.2. We prove that for any there exists a function in such that its approximation complexity is not as (Theorem 5).
在第 4.3 节中,我们将考虑单个近似值,而不像第 4.1 节和第 4.2 节那样假设它是一个族的元素。我们将证明,对于任何 都存在一个 中的函数,使得它的近似复杂度不是 而是 (定理 5)。 -
In Section 4.4 we prove that -approximation of any nonlinear -function by a network of fixed depth requires at least computation units (Theorem 6). By comparison with Theorem 1, this shows that for sufficiently smooth functions approximation by fixed-depth ReLU networks is less efficient than by unbounded-depth networks.
在第 4.4 节中,我们证明用固定深度为 的网络对任何非线性 函数进行 近似至少需要 计算单元(定理 6)。与定理 1 相比,这表明对于足够光滑的函数,用固定深度 ReLU 网络逼近的效率比用无界深度网络逼近的效率要低。
In Section 5 we discuss the obtained bounds and summarize their implications, in particular comparing deep vs. shallow networks and fixed vs. adaptive architectures.
在第 5 节中,我们将讨论所获得的边界并总结其影响,特别是比较深层网络与浅层网络以及固定架构与自适应架构。
The arXiv preprint of the first version of the present work appeared almost simultaneously with the work of Liang and Srikant (2016) containing results partly overlapping with our results in Sections 3.1 Fast deep approximation of squaring and multiplication, 3.2 Fast deep approximation of general smooth functions, 4.4 Slow approximation of smooth functions by shallow networks. Liang and Srikant consider networks equipped with both ReLU and threshold activation functions. They prove a logarithmic upper bound for the complexity of approximating the function , which is analogous to our Proposition 2. Then, they extend this upper bound to polynomials and smooth functions. In contrast to our treatment of generic smooth functions based on standard Sobolev spaces, they impose more complex assumptions on the function (including, in particular, how many derivatives it has) that depend on the required approximation accuracy . As a consequence, they obtain strong complexity bounds rather different from our bound in Theorem 1 (in fact, our lower bound proved in Theorem 5 rules out, in general, such strong upper bounds for functions having only finitely many derivatives). Also, Liang and Srikant prove a lower bound for the complexity of approximating convex functions by shallow networks. Our version of this result, given in Section 4.4, is different in that we assume smoothness and nonlinearity instead of global convexity.
本工作第一版的 arXiv 预印本几乎与 Liang 和 Srikant(2016 年)的工作同时出现,其中的结果与我们在第 3.1 节 "平方和乘法的快速深度逼近"、第 3.2 节 "一般平滑函数的快速深度逼近"、第 4.4 节 "浅层网络对平滑函数的慢速逼近 "中的结果部分重叠。Liang 和 Srikant 考虑了同时配备 ReLU 和阈值激活函数的网络。他们证明了函数 近似复杂度的对数上限,这与我们的命题 2 类似。然后,他们将这一上限扩展到多项式和平滑函数。与我们基于标准索波列夫空间对一般平稳函数的处理不同,他们对函数(尤其包括函数有多少导数)施加了更复杂的假设,而这些假设取决于所需的逼近精度 。因此,他们得到的强 复杂性边界与我们在定理 1 中的边界截然不同(事实上,我们在定理 5 中证明的下界一般排除了对只有有限多个导数的函数的强上界)。此外,Liang 和 Srikant 还证明了用浅层网络逼近凸函数的复杂度下限。我们在第 4.4 节中给出的这一结果与之不同,我们假设的是平滑性和非线性,而不是全局凸性。

Fig. 1. A feedforward neural network having 3 input units (diamonds), 1 output unit (square), and 7 computation units with nonlinear activation (circles). The network has 4 layers and weights.
图 1.一个有 3 个输入单元(菱形)、1 个输出单元(正方形)和 7 个非线性激活计算单元(圆形)的前馈神经网络。该网络有 4 层,权重为 。
2. The ReLU network model
2.ReLU 网络模型
Throughout the paper, we consider feedforward neural networks with the ReLU (Rectified Linear Unit) activation function The network consists of several input units, one output unit, and a number of “hidden” computation units. Each hidden unit performs an operation of the form (1)with some weights (adjustable parameters) and depending on the unit. The output unit is also a computation unit, but without the nonlinearity, i.e., it computes . The units are grouped in layers, and the inputs of a computation unit in a certain layer are outputs of some units belonging to any of the preceding layers (see Fig. 1). Note that we allow connections between units in non-neighboring layers. Occasionally, when this cannot cause confusion, we may denote the network and the function it implements by the same symbol.
在本文中,我们考虑的是具有 ReLU(整流线性单元)激活函数 的前馈神经网络。该网络由多个输入单元、一个输出单元和若干 "隐藏 "计算单元组成。每个 "隐藏 "单元执行的运算形式为 (1) ,并根据单元的不同设置一些权重(可调参数) 和 。输出单元也是一个计算单元,但没有非线性,即计算 。单元按层分组,某层计算单元的输入 是属于前几层的某些单元的输出(见图 1)。请注意,我们允许非相邻层的单元之间存在连接。有时,在不会引起混淆的情况下,我们可以用相同的符号表示网络及其实现的功能。
The depth of the network, the number of units and the total number of weights are standard measures of network complexity (Anthony & Bartlett, 2009). We will use these measures throughout the paper. The number of weights is, clearly, the sum of the total number of connections and the number of computation units. We identify the depth with the number of layers (in particular, the most common type of neural networks – shallow networks having a single hidden layer – are depth-3 networks according to this convention).
网络深度、单元数和权重总数是衡量网络复杂性的标准指标(Anthony & Bartlett,2009 年)。本文将通篇使用这些指标。权重数显然是连接总数和计算单元数的总和。我们用层数来确定深度(尤其是最常见的神经网络类型--具有单个隐藏层的浅层网络--根据这一惯例属于深度-3 网络)。
We finish this subsection with a proposition showing that, given our complexity measures, using the ReLU activation function is not much different from using any other piece-wise linear activation function with finitely many breakpoints: one can replace one network by an equivalent one but having another activation function while only increasing the number of units and weights by constant factors. This justifies our restricted attention to the ReLU networks (which could otherwise have been perceived as an excessively particular example of networks).
在本小节的最后,我们将提出一个命题,说明根据我们的复杂度度量,使用 ReLU 激活函数与使用任何其他具有有限多个断点的片断线性激活函数并无太大区别:我们可以用一个等效的网络取代一个网络,但使用另一个激活函数,而只需增加单位数和权重的常数。这就证明了我们只关注 ReLU 网络是有道理的(否则它可能会被视为一个过于特殊的网络实例)。
Proposition 1 命题 1
Let
be any continuous piece-wise linear function with
breakpoints, where
让 是任意连续的片断线性函数,其断点为 ,其中 .
- (a)
Let be a network with the activation function , having depth , weights and computation units. Then there exists a ReLU network that has depth , not more than weights and not more than units, and that computes the same function as
假设 是一个具有激活函数 的网络,其深度为 权重和 计算单元。那么存在一个深度为 、权重不超过 、计算单元不超过 的 ReLU 网络 ,它计算的函数与 相同。. - (b)
Conversely, let be a ReLU network of depth with weights and computation units. Let be a bounded subset of , where is the input dimension of . Then there exists a network with the activation function that has depth , weights and units, and that computes the same function as on the set
反过来,假设 是一个深度为 的 ReLU 网络,权重为 ,计算单元为 。让 是 的有界子集,其中 是 的输入维度。那么就存在一个具有激活函数 的网络,它的深度为 权重和 单元,在集合 上计算与 相同的函数。.
Proof 证明
(a) Let be the breakpoints of , i.e., the points where its derivative is discontinuous: . We can then express via the ReLU function , as a linear combination with appropriately chosen coefficients and . It follows that computation performed by a single -unit, can be equivalently represented by a linear combination of a constant function and computations of
-units, (here is the index of a -unit). We can then replace one-by-one all the -units in the network by -units, without changing the output of the network. Obviously, these replacements do not change the network depth. Since each hidden unit gets replaced by new units, the number of units in the new network is not greater than times their number in the original network. Note also that the number of connections in the network is multiplied, at most, by . Indeed, each unit replacement entails replacing each of the incoming and outgoing connections of this unit by new connections, and each connection is replaced twice: as an incoming and as an outgoing one. These considerations imply the claimed complexity bounds for the resulting -network .
(a) 设 为 的断点,即其导数不连续的点: .然后,我们可以通过 ReLU 函数 将 表示为一个线性组合 ,并适当选择系数 和 。 由此可见,单个 单位 的计算可以等价地表示为一个常数函数和 单位 的计算的线性组合(这里 是 单位的索引)。然后,我们可以在不改变网络输出的情况下,用 单位逐一替换 网络中的所有 单位。显然,这些替换不会改变网络深度。由于每个隐藏单元都会被 个新单元替换,因此新网络中单元的数量不会超过原始网络中单元数量的 倍。还需要注意的是,网络中的连接数最多乘以 。 事实上,每个单元的替换都需要用 个新连接替换该单元的每个输入和输出连接,每个连接都会被替换两次:输入连接和输出连接。这些考虑意味着所得到的 - 网络 的复杂度边界。
(b) Let be any breakpoint of , so that . Let be the distance separating from the nearest other breakpoint, so that is linear on and on (if has only one node, any will do). Then, for any , we can express the ReLU function via in the -neighborhood of 0: It follows that a computation performed by a single -unit, can be equivalently represented by a linear combination of a constant function and two -units, provided the condition (2)holds. Since is a bounded set, we can choose at each unit of the initial network sufficiently large so as to satisfy condition (2) for all network inputs from . Then, like in (a), we replace each -unit with two -units, which produces the desired -network. □
(b) 让 成为 的任意一个断点,从而 。让 是 与最近的其他断点之间的距离,这样 在 和 上是线性的(如果 只有一个节点,那么任何 都可以)。那么,对于任意 ,我们可以通过 在 0 的 邻域表达 ReLU 函数 : 由此可见,只要条件 (2) 成立,由单个 单位 执行的计算可以等价地用一个常数函数和两个 单位 的线性组合来表示。由于 是一个有界集,我们可以在初始网络的每个单元 中选择足够大的 ,从而使来自 的所有网络输入都满足条件 (2)。然后,与(a)中一样,我们用两个 单位替换每个 单位,这样就得到了所需的 网络。□
3. Upper bounds 3.上限
Throughout the paper, we will be interested in approximating functions by ReLU networks. Given a function and its approximation , by the approximation error we will always mean the uniform maximum error
在本文中,我们将始终关注通过 ReLU 网络逼近函数 的问题。给定函数 及其近似值 ,近似误差总是指均匀最大误差
3.1. Fast deep approximation of squaring and multiplication
3.1.平方和乘法的快速深度逼近
Our first key result shows that ReLU networks with unconstrained depth can very efficiently approximate the function (more efficiently than any fixed-depth network, as we will see in Section 4.4). Our construction uses the “sawtooth” function that has previously appeared in the paper by Telgarsky (2015).
我们的第一个关键结果表明,无深度限制的 ReLU 网络可以非常高效地逼近函数 (比任何固定深度的网络都更高效,我们将在第 4.4 节中看到这一点)。我们的构造使用了之前出现在 Telgarsky(2015)论文中的 "锯齿 "函数。
Proposition 2 命题 2
The function
on the segment
can be approximated with any error
by a ReLU network having the depth and the number of weights and computation units
可以用一个深度、权重和计算单元数为 的 ReLU 网络来逼近线段 上的函数 ,误差为 。.
Proof 证明
Consider the “tooth” (or “mirror”) function
and the iterated functions Telgarsky has shown (see Lemma 2.4 in Telgarsky (2015)) that is a “sawtooth” function with uniformly distributed “teeth” (each application of doubles the number of teeth): (see Fig. 2, Fig. 2(a)). Our key observation now is that the function can be approximated by linear combinations of the functions . Namely, let be the piece-wise linear interpolation of with uniformly distributed breakpoints : (see Fig. 2(b)). The function approximates with the error . Now note that refining the interpolation from to amounts to adjusting it by a function proportional to a sawtooth function: Hence Since can be implemented by a finite ReLU network (as ) and since construction of only involves linear operations and compositions of , we can implement by a ReLU network having depth and the number of weights and computation units all being (see Fig. 2(c)). This implies the claim of the proposition. □
考虑 "齿"(或 "镜")函数 和迭代函数 Telgarsky 已经证明(见 Telgarsky (2015)中的 Lemma 2.4), 是一个具有 均匀分布的 "齿 "的 "锯齿 "函数(每次应用 都会使齿的数量加倍):6#(见图 2,图 2(a))。我们现在要观察的关键是,函数 可以用函数 的线性组合来近似。也就是说,让 成为 与 均匀分布的断点 : (见图 2(b))。函数 近似 ,误差为 。现在请注意,从 到 的细化插值相当于用一个与锯齿形函数成比例的函数进行调整: 因此 由于 可以用有限 ReLU 网络来实现(如 ),并且由于 的构造只涉及 的线性运算和 的合成,我们可以用一个深度、权重和计算单元数都为 的 ReLU 网络来实现 (见图 2(c))。这意味着命题的主张。□
Since (3)we can use Proposition 2 to efficiently implement accurate multiplication in a ReLU network. The implementation will depend on the required accuracy and the magnitude of the multiplied quantities.
由于 (3) ,我们可以利用命题 2 在 ReLU 网络中有效地实现精确乘法。实现方式取决于所需的精度和被乘量的大小。

Fig. 2(a). (a) 图 2(a).(a)

Fig. 2(b). (b) 图 2(b).(b)

Fig. 2(c). (c) 图 2(c).(c)
Fig. 2. Fast approximation of the function from Proposition 2 : (a) the “tooth” function and the iterated “sawtooth” functions ; (b) the approximating functions ; (c) the network architecture for .
图 2.命题 2 中函数 的快速近似:(a)"齿 "函数 和迭代 "锯齿 "函数 ;(b)近似函数 ;(c) 的网络结构。
Proposition 3 命题 3
Given
and
, there is a ReLU network
with two input units that implements a function
so that
给定 和 ,有一个带有两个输入单元的 ReLU 网络 ,它实现了一个函数 ,因此
- (a)
for any inputs , if and , then ;
对于任何输入 ,如果 和 ,那么 ; - (b)
if or , then ;
如果 或 ,则 ; - (c)
the depth and the number of weights and computation units in are not greater than with an absolute constant and a constant
0#中的深度、权重数和计算单元数不大于 ,绝对常数为 ,常数为 。.
Proof 证明
Let be the approximate squaring function from Proposition 2 such that and for . Assume without loss of generality that and set (4)where . Then property (b) is immediate and (a) follows easily using expansion (3). To conclude (c), observe that computation (4) consists of three instances of and finitely many linear and ReLU operations, so, using Proposition 2, we can implement by a ReLU network such that its depth and the number of computation units and weights are , i.e. are . □
让 是命题 2 中的近似平方函数,这样 和 对于 .在不失一般性的前提下,假设 并设 (4) 为 。那么性质 (b) 是直接的,而 (a) 很容易通过展开式 (3) 得出。为了得出(c),请观察计算(4)由 的三个实例以及有限多的线性和 ReLU 运算组成,因此,利用命题 2,我们可以用一个 ReLU 网络来实现 ,使得它的深度和计算单元与权重的数量都是 ,即都是 。□
3.2. Fast deep approximation of general smooth functions
3.2.一般光滑函数的快速深度逼近
In order to formulate our general result, Theorem 1, we consider the Sobolev spaces with . Recall that is defined as the space of functions on lying in along with their weak derivatives up to order . The norm in can be defined by where , , and is the respective weak derivative. Here and in the sequel we denote vectors by boldface characters. The space can be equivalently described as consisting of the functions from such that all their derivatives of order are Lipschitz continuous.
为了提出我们的一般结果,即定理 1,我们考虑了 与 的索波列夫空间。回想一下, 被定义为 上位于 的函数空间及其直到 阶的弱导数。 中的规范可以用 来定义,其中 , 和 是各自的弱导数。在这里和后续章节中,我们用黑体字表示向量。空间 可以等价地描述为由来自 的函数组成,这些函数的所有阶导数 都是 Lipschitz 连续的。
Throughout the paper, we denote by the unit ball in :
在本文中,我们用 表示 中的单位球 :
Also, it will now be convenient to make a distinction between networks and network architectures : we define the latter as the former with unspecified weights. We say that a network architecture is capable of expressing any function from
with error
meaning that this can be achieved by some weight assignment.
此外,我们现在可以方便地对网络和网络架构进行区分:我们将后者定义为前者,但不指定权重。我们说,网络结构能够表达来自 的任何函数,且误差为 ,这意味着可以通过某种权重分配来实现。
Theorem 1 定理 1
For any
and
, there is a ReLU network architecture that
对于任意的 和 ,都有一个 ReLU 网络结构,即
- 1.
is capable of expressing any function from with error ;
能够表达来自 的任何函数,且误差为 ; - 2.
has the depth at most and at most weights and computation units, with some constant
深度最多为 ,权重和计算单元最多为 ,常数为 .
Proof 证明
The proof will consist of two steps. We start with approximating by a sum–product combination of local Taylor polynomials and one-dimensional piecewise-linear functions. After that, we will use results of the previous section to approximate by a neural network.
证明分为两个步骤。首先,我们用局部泰勒多项式和一维片断线性函数的和积组合 近似 。之后,我们将利用上一节的结果,用神经网络来逼近 。
Let be a positive integer. Consider a partition of unity formed by a grid of functions on the domain : Here , and the function is defined as the product (5)where (see Fig. 3). Note that (6)and (7)
设 为正整数。考虑由域 上的 函数 的网格构成的 unity 的分割: 这里是 ,函数 定义为 (5) 其中 的乘积(见图 3)。注意 (6) 和 (7)
For any , consider the degree- Taylor polynomial for the function at : (8)with the usual conventions and . Now define an approximation to by (9)We bound the approximation error using the Taylor expansion of : Here in the second step we used the support property (7) and the bound (6), in the third the observation that any belongs to the support of at most functions , in the fourth a standard bound for the Taylor remainder, and in the fifth the property
对于任意 , 考虑函数 在 : (8) 与通常的约定 和 。现在用 (9) 来定义 的近似值,我们用 的泰勒展开来约束近似误差:在第二步中,我们使用了支持性质 (7) 和约束 (6);在第三步中,我们观察到任何 都属于最多 个函数 的支持;在第四步中,我们使用了泰勒余数的标准约束;在第五步中,我们使用了性质 。

Fig. 3. Functions forming a partition of unity for in the proof of Theorem 1.
图 3.在定理 1 的证明中,函数 构成了 的统一分区。
It follows that if we choose (10)(where is the ceiling function), then (11)Note that, by (8) the coefficients of the polynomials are uniformly bounded for all : (12)
由此可见,如果我们选择 (10) (其中 是上限函数),那么 (11) 注意,根据 (8),多项式 的系数对所有 : (12)
We have therefore reduced our task to the following: construct a network architecture capable of approximating with uniform error any function of the form (9), assuming that is given by (10) and the polynomials are of the form (12).
因此,我们将任务简化为:假设 由 (10) 给出,多项式 的形式为 (12),构建一个能够以均匀误差 近似 (9) 形式的任何函数的网络架构。
Expand as (13)The expansion is a linear combination of not more than terms . Each of these terms is a product of at most piece-wise linear univariate factors: functions (see (5)) and at most linear expressions . We can implement an approximation of this product by a neural network with the help of Proposition 3. Specifically, let be the approximate multiplication from Proposition 3 for and some accuracy to be chosen later, and consider the approximation of the product obtained by the chained application of : (14)
Using statement (c) of Proposition 3, we see that can be implemented by a ReLU network with the depth and the number of weights and computation units not larger than , for some constant .
将 展开为 (13) 展开是不超过 项 的线性组合。每个项都是最多 个片断线性单变量因子的乘积: 函数 (见 (5))和最多 线性表达式 的乘积。在命题 3 的帮助下,我们可以用神经网络实现对这个乘积的逼近。具体来说,假设 是命题 3 中 的近似乘法和稍后选择的某个精度 ,并考虑由 的链式应用得到的乘积 的近似值: (14) 利用命题 3 的陈述(c),我们可以看到 可以由一个 ReLU 网络来实现,其深度和权值以及计算单元的数量不大于 ,对于某个常数 。
Now we estimate the error of this approximation. Note that we have and for all and all . By statement (a) of Proposition 3, if and , then . Repeatedly applying this observation to all approximate multiplications in (14) while assuming , we see that the arguments of all these multiplications are bounded by our (equal to ) and the statement (a) of Proposition 3 holds for each of them. We then have (15)Moreover, by statement (b) of Proposition 3, (16)Now we define the full approximation by (17)We estimate the approximation error of : where in the first step we use expansion (13), in the second the identity (16), in the third the bound and the fact that for at most functions , and in the fourth the bound (15). It follows that if we choose (18)then and hence, by (11),
现在我们来估算这个近似值的误差。请注意,对于所有 和所有 ,我们都有 和 。根据命题 3 的陈述 (a),如果 和 ,那么 。在假设 的情况下,将这一观察结果重复应用于 (14) 中的所有近似乘法,我们会发现所有这些乘法的参数都以我们的 为界(等于 ),并且命题 3 的陈述 (a) 对它们中的每一个都成立。因此,我们有 (15) 而且,根据命题 3 的陈述 (b), (16) 现在我们用 (17) 来定义完全近似,我们估计 的近似误差 : 其中第一步我们使用扩展 (13),第二步使用特性 (16),第三步使用约束 和 对于最多 的函数 的事实,第四步使用约束 (15)。由此可见,如果我们选择 (18) ,那么 ,因此,根据 (11),
On the other hand, note that by (17), can be implemented by a network consisting of parallel subnetworks that compute each of ; the final output is obtained by weighting the outputs of the subnetworks with the weights . The architecture of the full network does not depend on ; only the weights do. As already shown, each of these subnetworks has not more than layers, weights and computation units, with some constant . There are not more than such subnetworks. Therefore, the full network for has not more than layers and weights and computation units. With given by (18) and given by (10), we obtain the claimed complexity bounds. □
另一方面,请注意,根据 (17), 可以通过一个由并行子网络组成的网络来实现,子网络计算 中的每一条;最终输出由子网络的输出与权重 加权得到。整个网络的结构与 无关,只与权重 有关。如前所述,每个子网络的层数、权重和计算单元都不超过 ,且有一定的常数 。这样的子网络不会超过 。因此, 的完整网络的层数不超过 ,权重和计算单元不超过 。根据 (18) 所给出的 和 (10) 所给出的 ,我们得到了所宣称的复杂度边界。□
We remark that an analog of Theorem 1 for so-called ’th order activation functions with can be found in Mhaskar (1993).
我们要指出的是,定理 1 与所谓的 'th 阶激活函数 的类似,可以在 Mhaskar (1993) 中找到。
3.3. Faster approximations using adaptive network architectures
3.3.利用自适应网络架构实现更快近似
Theorem 1 provides an upper bound for the approximation complexity in the case when the same network architecture is used to approximate all functions in . We can consider an alternative, “adaptive architecture” scenario where not only the weights, but also the architecture is adjusted to the approximated function. We expect, of course, that this would decrease the complexity of the resulting architectures, in general (at the price of needing to find the appropriate architecture). In this section we show that we can indeed obtain better upper bounds in this scenario.
定理 1 提供了使用相同网络架构逼近 中所有函数时的逼近复杂度上限。我们可以考虑另一种 "自适应架构 "方案,即根据近似函数调整权重和架构。当然,我们预计这将在总体上降低由此产生的架构的复杂性(代价是需要找到合适的架构)。在本节中,我们将展示在这种情况下我们确实可以获得更好的上限。
For simplicity, we will only consider the case . Then, is the space of Lipschitz functions on the segment . The set consists of functions having both and the Lipschitz constant bounded by 1. Theorem 1 provides an upper bound for the number of weights and computation units, but in this special case there is in fact a better bound obtained simply by piece-wise interpolation.
为简单起见,我们只考虑 的情况。然后, 是线段 上的 Lipschitz 函数空间。定理 1 为权重和计算单元的数量提供了一个上限 ,但在这种特殊情况下,实际上有一个更好的上限 ,只需通过分段插值即可获得。
Namely, given and , set and let be the piece-wise interpolation of with uniformly spaced breakpoints (i.e., ). The function is also Lipschitz with constant 1 and hence (since for any we can find such that and then ). At the same time, the function can be expressed in terms of the ReLU function by with some coefficients and . This expression can be viewed as a special case of the depth-3 ReLU network with weights and computation units.
即,给定 和 ,设 ,让 是 与 均匀间隔的断点 的片断插值(即 )。函数 也是常数为 1 的 Lipschitz 函数,因此 也是 Lipschitz 函数(因为对于任意 ,我们都可以找到 ,这样 和 都是 Lipschitz 函数)。同时,函数 可以用 ReLU 函数 表示,即 加上一些系数 和 。这个表达式可以看作是具有 权重和计算单元的深度-3 ReLU 网络的特例。
We show now how the bound can be improved by using adaptive architectures.
现在,我们将展示如何通过使用自适应架构来改进 约束。
Theorem 2 定理 2
For any
and
, there exists a depth- 6 ReLU network
(with architecture depending on
) that provides an
-approximation of
while having not more than
weights, connections and computation units. Here
is an absolute constant.
对于任意的 和 ,存在一个深度为 6 的 ReLU 网络 (其结构取决于 ),它提供了 - 的近似值,同时权重、连接和计算单元不超过 。这里的 是一个绝对常数。
Proof 证明
We first explain the idea of the proof. We start with interpolating by a piece-wise linear function, but not on the length scale —instead, we do it on a coarser length scale , with some . We then create a “cache” of auxiliary subnetworks that we use to fill in the details and go down to the scale , in each of the -subintervals. This allows us to reduce the amount of computations for small because the complexity of the cache only depends on . The assignment of cached subnetworks to the subintervals is encoded in the network architecture and depends on the function . We optimize by balancing the complexity of the cache with that of the initial coarse approximation. This leads to and hence to the reduction of the total complexity of the network by a factor compared to the simple piece-wise linear approximation on the scale . This construction is inspired by a similar argument used to prove the upper bound for the complexity of Boolean circuits implementing -ary functions (Shannon, 1949).
我们首先解释一下证明的思路。我们先用一个片断线性函数对 进行插值,但不是在长度尺度 上进行,而是在更粗的长度尺度 上进行,其中包含一些 。然后,我们会创建一个辅助子网的 "缓存",用来填补细节,并在每个 - 子区间内细化到 的尺度。这使我们能够减少小 的计算量,因为缓存的复杂度只取决于 。缓存子网络到子区间的分配在网络结构中编码,并取决于函数 。我们通过平衡缓存的复杂度和初始粗略近似的复杂度来优化 。这导致了 ,因此,与规模为 的简单片断线性近似相比,网络的总复杂度降低了 。这种构造的灵感来源于证明实现 -ary 函数的布尔电路复杂度上限 的类似论证(香农,1949 年)。
The proof becomes simpler if, in addition to the ReLU function , we are allowed to use the activation function (19)in our neural network. Since is discontinuous, we cannot just use Proposition 1 to replace -units by -units. We will first prove the analog of the claimed result for the model including -units, and then we will show how to construct a purely ReLU network.
如果除了 ReLU 函数 之外,我们还可以在神经网络中使用激活函数 (19) ,那么证明就变得更简单了。由于 是不连续的,我们不能直接用命题 1 将 单位替换为 单位。我们将首先证明包含 单位的模型的类似结果,然后再展示如何构建一个纯 ReLU 网络。
Lemma 1 例题 1
For any
and
, there exists a depth-5 network including
-units and
-units, that provides an
-approximation of
while having not more than
weights, where
is an absolute constant.
对于任意的 和 ,存在一个深度为 5 的网络,其中包括 - 单元和 - 单元,它提供了 - 的近似值,同时权重不超过 ,其中 是一个绝对常量。
Proof 证明
Given , we will construct an approximation to in the form Here, is the piece-wise linear interpolation of with the breakpoints , for some positive integer to be chosen later. Since is Lipschitz with constant 1, is also Lipschitz with constant 1. We will denote by the intervals between the breakpoints: We will now construct as an approximation to the difference (20)Note that vanishes at the endpoints of the intervals : (21)and is Lipschitz with constant 2: (22)since and are Lipschitz with constant 1.
给定 ,我们将以 的形式构建 与 的近似值 是 与断点 的片断线性插值,对于某个正整数 待定。由于 是常数为 1 的 Lipschitz,所以 也是常数为 1 的 Lipschitz。我们用 表示断点之间的间隔: 现在我们将构建 作为差值 (20) 的近似值,注意 在区间 : (21) 和 是常数为 2 的 Lipschitz: (22) 因为 和 是常数为 1 的 Lipschitz。
To define , we first construct a set of cached functions. Let be a positive integer to be chosen later. Let be the set of piecewise linear functions with the breakpoints and the properties and Note that the size of is not larger than .
要定义 ,我们首先要构建一个缓存函数集 。让 为稍后选择的正整数。让 成为具有断点 和属性 及 的片断线性函数集合 注意 的大小 不大于 。
If is any Lipschitz function with constant 2 and , then can be approximated by some with error not larger than : namely, take .
如果 是常数为 2 的任意 Lipschitz 函数,且 ,那么 可以用误差不大于 的某个 近似:即取 。
Moreover, if is defined by (20), then, using (21), (22), on each interval the function can be approximated with error not larger than by a properly rescaled function . Namely, for each we can define the function by . Then it is Lipschitz with constant 2 and , so we can find such that This can be equivalently written as Note that the obtained assignment is not injective, in general ( will be larger than ).
此外,如果 由 (20) 定义,那么,利用 (21)、(22),在每个区间 上,函数 可以用一个适当重标的函数 近似,误差不大于 。也就是说,对于每个 ,我们可以用 来定义函数 。那么它是常数为 2 的 Lipschitz 函数和 ,所以我们可以找到 使得 可以等价地写成 注意得到的赋值 一般来说不是注入式的( 将大于 )。
We can then define on the whole by (23)This approximates with error on : (24)and hence, by (20), for the full approximation we will also have (25)Note that the approximation has properties analogous to (21), (22) : (26)
(27)in particular, is continuous on .
我们可以通过 (23) 在 上定义 这个 以 的误差近似 在 : (24) 因此,根据 (20),对于完整的近似值 我们也将有 (25) 注意,近似值 具有类似于 (21)、(22) 的性质 : (26) (27) 特别是, 在 上是连续的。
We will now rewrite in a different form interpretable as a computation by a neural network. Specifically, using our additional activation function given by (19), we can express as (28)Indeed, given , observe that all the terms in the inner sum vanish except for the one corresponding to the determined by the condition . For this particular we have . Since , we conclude that (28) agrees with (23).
现在,我们将以另一种形式重写 ,将其解释为神经网络的计算。具体地说,利用(19)给出的附加激活函数 ,我们可以将 表达为 (28) 事实上,在给出 的情况下,除了与条件 确定的 相对应的项之外,观察到内和中的所有项都消失了。对于这个特殊的 ,我们有 。由于 ,我们得出结论 (28) 与 (23) 一致。
Let us also expand over the basis of shifted ReLU functions: Substituting this expansion in (28), we finally obtain (29)
我们还可以在移位 ReLU 函数的基础上展开 : 将这一扩展代入 (28),我们最终得到 (29)
Now consider the implementation of by a neural network. The term can clearly be implemented by a depth-3 ReLU network using connections and computation units. The term can be implemented by a depth-5 network with - and -units as follows (we denote a computation unit by with a superscript indexing the layer and a subscript indexing the unit within the layer).
现在考虑用神经网络来实现 。项 显然可以用深度为 3 的 ReLU 网络,使用 连接和计算单元来实现。术语 可以通过深度-5 网络使用 - 和 - 单元来实现(我们用 表示计算单元,上标表示层,下标表示层内单元)。
- 1.
The first layer contains the single input unit .
第一层包含单输入单元 。 - 2.
The second layer contains units computing
第二层包含 单元 计算 - 3.
The third layer contains units computing . This is equivalent to , because .
第三层包含 单元 计算 。这相当于 ,因为 . - 4.
The fourth layer contains units computing
第四层包含 单元 计算 - 5.
The final layer consists of a single output unit
最后一层由一个输出单元 组成
Examining this network, we see that the total number of connections and units in it is and hence is . This also holds for the full network implementing , since the term requires even fewer layers, connections and units. The output units of the subnetworks for and can be merged into the output unit for , so the depth of the full network is the maximum of the depths of the networks implementing and , i.e., is 5 (see Fig. 4).
观察这个网络,我们会发现其中连接和单元的总数是 ,因此是 。这同样适用于实现 的完整网络,因为 项所需的层数、连接数和单元数都更少。 和 子网络的输出单元可以合并到 的输出单元中,因此整个网络的深度是实现 和 的网络深度的最大值,即 5(见图 4)。
We show now how to modify the constructed network so as to remove -units. We only need to modify the part of the network. We will show that for any we can replace with a function (defined below) that
现在,我们将展示如何修改所构建的网络,以删除 单位。我们只需要修改网络的 部分。我们将证明,对于任何 ,我们都可以用函数 (定义见下文)来代替 ,即
- (a)
obeys the following analog of approximation bound (24) : (30)
遵从以下近似约束 (24) 的类似条件: (30) - (b)
and is implementable by a depth-6 ReLU network having complexity with an absolute constant independent of .
并可由深度为 6 的 ReLU 网络实现,其复杂度为 ,绝对常数 与 无关。
Since can be taken arbitrarily small, the theorem then follows by arguing as in Lemma 1, only with replaced by .
由于 可任意取小,因此,只需用 代替 即可得出定理,其论证过程与定理 1 相同。
As a first step, we approximate by a continuous piece-wise linear function , with a small : Let be defined as in (29), but with replaced by : Since is a continuous piece-wise linear function with three breakpoints, we can express it via the ReLU function, and hence implement by a purely ReLU network, as in Proposition 1, and the complexity of the implementation does not depend on . However, replacing with affects the function on the intervals , introducing there a large error (of magnitude ). But recall that both and vanish at the points , by (21), (26). We can then largely remove this newly introduced error by simply suppressing near the points .
第一步,我们用一个连续的片断线性函数 来近似 ,并加上一个小的 : 让 定义为 (29) 中的 ,但 由 代替: 由于 是一个连续的片断线性函数,有三个断点,我们可以通过 ReLU 函数来表达它,因此可以用纯 ReLU 网络来实现 ,如命题 1 所示,而且实现的复杂度与 无关。 不过,用 代替 会影响区间 上的函数 ,从而引入一个很大的误差(量级为 )。但是,根据 (21) 和 (26), 和 在点 上都消失了。因此,我们只需抑制 点附近的 ,就可以大致消除这个新引入的误差。
Precisely, consider the continuous piece-wise linear function and the full comb-like filtering function Note that is continuous piece-wise linear with breakpoints, and . We then define our final modification of as (31)
准确地说,考虑连续的片断线性函数 和全梳状滤波函数 注意, 是连续的片断线性函数,带有 断点和 。然后,我们将 的最终修改定义为 (31)
Lemma 2 推理 2
The function
obeys the bound (30)
函数 服从约束 (30).
Proof 证明
Given , let and be determined from the representation (i.e., is the relative position of in the respective interval ). Consider several possibilities for :
给定 ,让 和 根据表示 确定(即 是 在相应区间 中的相对位置)。考虑 的几种可能性:
- 1.
. In this case . Note that (32)because, by construction, , and by (26), (27). It follows that both terms in (31) vanish, i.e., . But, since is Lipschitz with constant 2 by (22) and , we have . This implies .
在这种情况下 .注意,根据构造, (32) 是因为 ,而 是因为 (26), (27)。由此可见,(31) 中的两项都消失了,即 .但是,由于由 (22) 和 可知 是常数为 2 的 Lipschitz 值,所以有 .这意味着 . - 2.
. In this case and . Using (32), we find that . It follows that .
在这种情况下, 和 .利用(32),我们发现 . 由此得出 . - 3.
. In this case . Since is Lipschitz with constant 1, . Both and are Lipschitz with constant 2 (by (22), (27)) and vanish at and (by (21), (26)). It follows that
在这种情况下, .由于 是常数为 1 的 Lipschitz,所以 . 和 都是常数为 2 的 Lipschitz(由 (22), (27)),并且在 和 处消失(由 (21), (26))。由此可见,
It remains to verify the complexity property (b) of the function . As already mentioned, can be implemented by a depth-5 purely ReLU network with not more than weights, connections and computation units, where is an absolute constant independent of . The function can be implemented by a shallow, depth-3 network with units and connection. Then, computation of can be implemented by a network including two subnetworks for computing and , and an additional layer containing two -units as written in (31). We thus obtain 6 layers in the resulting full network and, choosing and in the same way as in Lemma 1, obtain the bound for the number of its connections, weights, and computation units. □
我们还需要验证函数 的复杂性属性 (b)。如前所述, 可以由一个深度为 5 的纯 ReLU 网络实现,其权重、连接和计算单元不超过 ,其中 是一个与 无关的绝对常量。函数 可以通过一个深度为 3 的浅层网络来实现,该网络具有 个单元和连接。然后,对 的计算可以通过一个网络来实现,该网络包括两个用于计算 和 的子网络,以及一个包含两个 单元的附加层,如 (31) 所述。因此,我们可以在得到的完整网络中得到 6 层,并以与假设 1 相同的方式选择 和 ,从而得到其连接、权重和计算单元数量的约束 。□
4. Lower bounds 4.下限
4.1. Continuous nonlinear widths
4.1.连续非线性宽度
The method of continuous nonlinear widths (DeVore et al., 1989) is a very general approach to the analysis of parameterized nonlinear approximations, based on the assumption of continuous selection of their parameters. We are interested in the following lower bound for the complexity of approximations in
连续非线性宽度方法(DeVore 等人,1989 年)是分析参数化非线性近似的一种非常通用的方法,它基于参数连续选择的假设。我们对 中近似复杂度的下限感兴趣
Theorem 3 定理 3
DeVore et al., (1989), Theorem 4.2
DeVore 等人,(1989 年),定理 4.2
Fix
. Let
be a positive integer and
be any mapping between the space
and the space
. Suppose that there is a continuous map
such that
for all
. Then
, with some constant
depending only on
固定 。让 是一个正整数, 是空间 与空间 之间的任意映射。假设有一个连续映射 ,使得 对于所有 。那么 ,与某个常数 仅取决于 .
We apply this theorem by taking to be some ReLU network architecture, and the corresponding weight space. It follows that if a ReLU network architecture is capable of expressing any function from with error , then, under the hypothesis of continuous weight selection, the network must have at least weights. The number of connections is then lower bounded by (since the number of weights is not larger than the sum of the number of computation units and the number of connections, and there are at least as many connections as units).
我们应用该定理时,将 作为某个 ReLU 网络架构,而 则是相应的权重空间。由此可知,如果一个 ReLU 网络体系结构能够表达 的任何函数,且误差为 ,那么在连续权值选择的假设下,该网络必须至少有 个权值。连接数的下限为 (因为权重数不会大于计算单元数与连接数之和,而且连接数至少与计算单元数相同)。
The hypothesis of continuous weight selection is crucial in Theorem 3. By examining our proof of the counterpart upper bound in Theorem 1, the weights are selected there in a continuous manner, so this upper bound asymptotically lies above in agreement with Theorem 3. We remark, however, that the optimal choice of the network weights (minimizing the error) is known to be discontinuous in general, even for shallow networks (Kainen, Kůrková, & Vogt, 1999).
连续权重选择的假设在定理 3 中至关重要。根据我们对定理 1 中对应上限 的证明,权重的选择是连续的,因此该上限渐近地高于 ,与定理 3 一致。不过我们要指出的是,网络权重的最优选择(误差最小化)在一般情况下是不连续的,即使对于浅层网络也是如此(Kainen, Kůrková, & Vogt, 1999)。
We also compare the bounds of Theorem 2, Theorem 3. In the case , Theorem 3 provides a lower bound for the number of weights and connections. On the other hand, in the adaptive architecture scenario, Theorem 2 provides the upper bound for the number of weights, connections and computation units. The fact that this latter bound is asymptotically below the bound of Theorem 3 reflects the extra expressiveness associated with variable network architecture.
我们还可以比较定理 2 和定理 3 的边界。在 的情况下,定理 3 提供了权重和连接数的下限 。另一方面,在自适应架构情况下,定理 2 提供了权重、连接和计算单元数量的上限 。事实上,后一个界限逐渐低于定理 3 的界限,这反映了与可变网络架构相关的额外表现力。
4.2. Bounds based on VC-dimension
4.2.基于 VC 维度的界限
In this section we consider the setup where the same network architecture is used to approximate all functions , but the dependence of the weights on is not assumed to be necessarily continuous. In this setup, some lower bounds on the network complexity can be obtained as a consequence of existing upper bounds on VC-dimension of networks with piece-wise polynomial activation functions and Boolean outputs (Anthony & Bartlett, 2009). In the next theorem, part (a) is a more general but weaker bound, while part (b) is a stronger bound assuming a constrained growth of the network depth.
在本节中,我们将考虑这样一种情况:使用相同的网络架构来逼近所有函数 ,但权重对 的依赖性不一定是连续的。在这种情况下,网络复杂度的一些下限可以通过现有的具有片断多项式激活函数和布尔输出的网络的 VC 维度上限得到(Anthony & Bartlett,2009 年)。在下一个定理中,(a) 部分是一个更一般但较弱的约束,而 (b) 部分是一个假设网络深度增长受限的更强约束。
Theorem 4 定理 4
Fix 修复 .
- (a)
For any , a ReLU network architecture capable of approximating any function with error must have at least weights, with some constant
对于任意 ,能够以误差 近似任意函数 的 ReLU 网络结构必须至少有 个权值,其中包含某个常数 . - (b)
Let be some constants. For any , if a ReLU network architecture of depth is capable of approximating any function with error , then the network must have at least weights, with some constant . 1
假设 是一些常数。对于任意 ,如果深度为 的 ReLU 网络结构能够以误差 近似任意函数 ,那么该网络必须至少有 个权值,且有某个常数 1
Proof 证明
Recall that given a class of Boolean functions on , the VC-dimension of is defined as the size of the largest shattered subset , i.e. the largest subset on which can compute any dichotomy (see, e.g., (Anthony & Bartlett, 2009), Section 3.3). We are interested in the case when is the family of functions obtained by applying thresholds to a ReLU network with fixed architecture but variable weights. In this case Theorem 8.7 of Anthony and Bartlett (2009) implies that (33)and Theorem 8.8 implies that (34)where is the number of weights, is the network depth, and is an absolute constant.
回想一下,给定 上的一类布尔函数 , 的 VC 维度被定义为最大破碎子集 的大小,即 可以计算任何二分法的最大子集(参见,例如,(Anthony & Bartlett, 2009),第 3.3 节)。我们感兴趣的情况是,当 是将阈值 应用于具有固定架构但权重可变的 ReLU 网络时得到的函数族。在这种情况下,Anthony 和 Bartlett(2009 年)的定理 8.7 意味着 (33) ,定理 8.8 意味着 (34) ,其中 是权重数, 是网络深度, 是绝对常量。
Given a positive integer to be chosen later, choose as a set of points in the cube such that the distance between any two of them is not less than . Given any assignment of values , we can construct a smooth function satisfying for all by setting (35)with some function such that and if .
给定一个稍后选择的正整数 ,选择 作为立方体 中 点 的集合,使得任意两个点之间的距离不小于 。给定值 的任意赋值,我们可以通过设置 (35) 与某个 函数 使 和 满足 的所有 来构造一个平滑函数 。
Let us obtain a condition ensuring that such . For any multi-index , so if (36)with the constant , then .
让我们得到一个条件,确保这样的 .对于任意多指数 , 所以如果 (36) 加上常数 , 那么 .
Now set (37)Suppose that there is a ReLU network architecture that can approximate, by adjusting its weights, any with error less than . Denote by the output of the network for the input vector and the vector of weights .
现在设 (37) 假设有一个 ReLU 网络结构 ,它可以通过调整权重,以小于 的误差逼近任何 。用 表示输入向量 和权重向量 的网络输出。
Consider any assignment of Boolean values . Set and let be given by (35) (see Fig. 5); then (36) holds and hence .
考虑布尔值 的任意赋值 。设 并让 由 (35) 给出(见图 5);则 (36) 成立,因此 .
By assumption, there is then a vector of weights, , such that for all we have , and in particular so the thresholded network has outputs Since the Boolean values were arbitrary, we conclude that the subset is shattered and hence Expressing through with (37), we obtain (38)To establish part (a) of the Theorem, we apply bound (33) to the network : (39)where is the number of weights in , which is the same as in if we do not count the threshold parameter. Combining (38) with (39), we obtain the desired lower bound with .
根据假设,存在一个权重向量 ,这样对于所有 我们都有 ,尤其是 ,因此阈值网络 有输出 由于布尔值 是任意的,我们得出结论,子集 是破碎的,因此 用 (37) 表示 到 ,我们得到 (38) 为了建立定理 (a) 部分,我们对网络 应用约束 (33) : (39) 其中 是 中的权重数,如果不计算阈值参数,则与 中的权重数相同。结合 (38) 和 (39),我们可以得到所需的下界 和 。

Fig. 5. A function considered in the proof of Theorem 2 (for ).
图 5.定理 2 证明中考虑的函数 (对于 )。
To establish part (b) of the theorem,we use bound (34) and the hypothesis : (40)Combining (38) with (40), we obtain (41)Trying a of the form with a constant , we get Comparing this with (41), we see that if we choose , then for sufficiently small we have and hence , as claimed. We can ensure that for all by further decreasing . □
为了建立定理的(b)部分,我们使用约束(34)和假设 : (40) 将(38)与(40)结合,我们得到 (41) 尝试一个形式为 的 与一个常数 ,我们得到 将其与(41)比较,我们发现,如果我们选择 ,那么对于足够小的 ,我们有 ,因此有 ,正如我们所声称的。我们可以通过进一步减小 来确保 适用于所有 。□
We remark that the constrained depth hypothesis of part (b) is satisfied, with , by the architecture used for the upper bound in Theorem 1. The bound stated in part (b) of Theorem 4 matches the upper bound of Theorem 1 and the lower bound of Theorem 3 up to a power of .
我们注意到,根据定理 1 中上限所使用的结构,(b) 部分的受限深度假设在 的情况下是满足的。定理 4 (b) 部分所述的界限与定理 1 的上界和定理 3 的下界相匹配,最大为 的幂。
4.3. Adaptive network architectures
4.3.自适应网络架构
Our goal in this section is to obtain a lower bound for the approximation complexity in the scenario where the network architecture may depend on the approximated function. This lower bound is thus a counterpart to the upper bound of Section 3.3.
本节的目标是在网络架构可能取决于近似函数的情况下,获得近似复杂度的下限。因此,这个下限与第 3.3 节中的上限相对应。
To state this result we define the complexity of approximating the function with error as the minimal number of hidden computation units in a ReLU network that provides such an approximation.
为了说明这一结果,我们将误差为 的函数 的近似复杂度 定义为 ReLU 网络中提供这种近似的隐藏计算单元的最小数量。
Theorem 5 定理 5
For any
, there exists
such that
is not
as
对于任何 ,都存在 ,使得 不是 而是 .
The proof relies on the following lemma.
证明依赖于下面的 Lemma。
Lemma 3 推理 3
Fix
. For any sufficiently small
there exists
such that
, with some constant
固定 。对于任何足够小的 ,存在 ,使得 ,与某个常数 .
Proof 证明
Observe that all the networks with not more than hidden computation units can be embedded in the single “enveloping” network that has hidden layers, each consisting of units, and that includes all the connections between units not in the same layer (see Fig. 6, Fig. 6(a)). The number of weights in this enveloping network is . On the other hand, Theorem 4 (a) states that at least weights are needed for an architecture capable of -approximating any function in . It follows that there is a function that cannot be -approximated by networks with fewer than computation units. □
请注意,所有隐藏计算单元不超过 的网络都可以嵌入一个 "包络 "网络中,该网络有 个隐藏层,每个隐藏层由 个单元组成,并包括不在同一层的单元之间的所有连接(见图 6,图 6(a))。这个包络网络的权重数为 。另一方面,定理 4 (a) 指出,能够 近似 中任何函数的架构至少需要 个权重。 由此可见,有一个函数 是少于 个计算单元的网络无法 近似的。□
Before proceeding to the proof of Theorem 5, note that is a monotone decreasing function of with a few obvious properties: (42)(follows by multiplying the weights of the output unit of the approximating network by a constant); (43)(follows by approximating by an approximation of ); (44)(follows by combining approximating networks for and as in Fig. 6(b)).
在继续证明定理 5 之前,请注意 是 的单调递减函数,具有几个明显的性质:2#(将近似网络输出单元的权重乘以一个常数即可得出); (43) (用 的近似值近似 即可得出); (44) (如图 6(b)所示,将 和 的近似网络合并即可得出)。

Fig. 6(a). (a) 图 6(a).(a)

Fig. 6(b). (b) 图 6(b).(b)
Fig. 6. (a) Embedding a network with hidden units into an “enveloping” network (see Lemma 3). (b) Putting sub-networks in parallel to form an approximation for the sum or difference of two functions, see Eq. (44).
图 6. (a) 将具有 个隐藏单元的网络嵌入 "包络 "网络(见定理 3)。(b) 将子网络并联,形成两个函数的和或差的近似值,见公式 (44)。
Proof 证明
Proof of Theorem 5 定理 5 的证明
The claim of Theorem 5 is similar to the claim of Lemma 3, but is about a single function satisfying a slightly weaker complexity bound at multiple values of . We will assume that Theorem 5 is false, i.e., (45)for all , and we will reach contradiction by presenting violating this assumption. Specifically, we construct this as (46)with some , , and we will make sure that (47)for a sequence of .
定理 5 的主张与 Lemma 3 的主张类似,但涉及的是在多个 值下满足稍弱复杂度约束的单个函数 。我们将假设定理 5 是假的,即 (45) 对于所有 ,我们将通过提出违反这一假设的 来达到矛盾。具体地说,我们将 构造为 (46) 与一些 , ,我们将确保 (47) 为 的序列。
We determine sequentially. Suppose we have already found ; let us describe how we define .
我们依次确定 。假设我们已经找到了 ;让我们描述一下如何定义 。
First, we set (48)In particular, this ensures that so that the function defined by the series (46) will be in , because .
首先,我们设置 (48) 特别是,这样可以确保 使得由数列 (46) 定义的函数 在 中,因为 .
Next, using Lemma 3 and Eq. (42), observe that if is sufficiently small, then we can find such that (49)In addition, by assumption (45), if is small enough then (50)Let us choose and so that both (49), (50) hold. Obviously, we can also make sure that as .
接下来,利用 Lemma 3 和公式 (42),观察到如果 足够小,那么我们可以找到 使 (49) 成立。此外,根据假设 (45),如果 足够小,那么 (50) 我们选择 和 使 (49), (50) 都成立。显然,我们还可以确保 如 .
Let us check that the above choice of ensures that inequality (47) holds for all : Here in the first step we use inequality (43), in the second the monotonicity of , in the third the monotonicity of and the setting (48), in the fourth the inequality (44), and in the fifth the conditions (49), (50). □
让我们检验一下,上述 的选择是否能确保不等式(47)对所有 都成立:这里,第一步我们使用不等式 (43),第二步使用 的单调性,第三步使用 的单调性和设置 (48),第四步使用不等式 (44),第五步使用条件 (49), (50)。□
4.4. Slow approximation of smooth functions by shallow networks
4.4.浅层网络对平滑函数的缓慢逼近
In this section we show that, in contrast to deep ReLU networks, shallow ReLU networks relatively inefficiently approximate sufficiently smooth () nonlinear functions. We remark that Liang and Srikant (2016) prove a similar result assuming global convexity instead of smoothness and nonlinearity.
在本节中,我们将证明,与深度 ReLU 网络相比,浅层 ReLU 网络相对低效地逼近了足够平滑 ( ) 的非线性函数。我们注意到,Liang 和 Srikant(2016)在假设全局凸性而非平滑性和非线性的情况下证明了类似的结果。
Theorem 6 定理 6
Let
be a nonlinear function (i.e., not of the form
on the whole
). Then, for any fixed
, a depth-
ReLU network approximating
with error
must have at least
weights and computation units, with some constant
假设 是一个非线性函数(即在 整体上不是 的形式)。那么,对于任意固定的 ,以误差 近似 的深度 ReLU 网络必须至少有 个权值和计算单元,并有某个常数
Proof 证明
Since and is nonlinear, we can find and such that for all and the function is strictly convex or concave on . Suppose without loss of generality that it is strictly convex: (51)Suppose that is an -approximation of function , and let be implemented by a ReLU network of depth . Let . Then also approximates with error not larger than . Moreover, since is obtained from by a linear substitution , can be implemented by a ReLU network of the same depth and with the number of units and weights not larger than in (we can obtain from by replacing the input layer in with a single unit, accordingly modifying the input connections, and adjusting the weights associated with these connections). It is thus sufficient to establish the claimed bounds for .
由于 和 是非线性的,我们可以找到 和 ,使得 对于所有 和函数 在 上是严格凸的或凹的。在不失一般性的前提下,假设它是严格凸的: (51) 假设 是函数 的 近似值,并且让 由深度为 的 ReLU 网络 实现。让 。则 也能逼近 ,且误差不大于 。此外,由于 是通过线性替换 从 得到的,因此 可以通过深度 相同的 ReLU 网络 来实现,且单元数和权重不大于 (我们可以通过用一个单元替换 中的输入层,相应地修改输入连接,并调整与这些连接相关的权重,从 得到 )。因此,我们只需为 建立所声称的边界即可。
By construction, is a continuous piece-wise linear function of . Denote by the number of linear pieces in . We will use the following counting lemma.
根据构造, 是 的连续片断线性函数。用 表示 中的线性片段数。我们将使用下面的计数 Lemma。
Lemma 4 例题 4
, where
is the number of computation units in
,其中 是 中的计算单元数.
Proof 证明
This bound, up to minor details, is proved in Lemma 2.1 of Telgarsky (2015). Precisely, Telgarsky’s lemma states that if a network has a single input, connections only between neighboring layers, at most units in a layer, and a piece-wise linear activation function with pieces, then the number of linear pieces in the output of the network is not greater than . By examining the proof of the lemma we see that it will remain valid for networks with connections not necessarily between neighboring layers, if we replace by in the expression . Moreover, we can slightly strengthen the bound by noting that in the present paper the input and output units are counted as separate layers, only units of layers 3 to have multiple incoming connections, and the activation function is applied only in layers 2 to . By following Telgarsky’s arguments, this gives the slightly more accurate bound (i.e., with the power instead of ). It remains to note that the ReLU activation function corresponds to . □
Telgarsky (2015)的定理 2.1 证明了这一约束,但还有一些小细节。确切地说,特尔加尔斯基的定理指出,如果一个网络只有一个输入,只在相邻层之间有连接,一层中最多有 个单元,以及一个片断线性激活函数有 个片断,那么该网络输出中的线性片断数不会大于 。通过研究该定理的证明,我们可以发现,如果我们在表达式 中用 代替 ,那么该定理对于不一定在相邻层之间有连接的网络仍然有效。此外,我们还可以注意到,在本文中,输入和输出单元被视为独立的层,只有第 3 层到 层的单元有多个输入连接,并且激活函数只应用于第 2 层到 层,从而略微加强约束。根据特尔加尔斯基的论证,这给出了稍微更精确的边界 (即幂 ,而不是 )。需要注意的是,ReLU 激活函数对应于 。□
Lemma 4 implies that there is an interval of length not less than on which the function is linear. Let . Then, by the approximation accuracy assumption, , while by (51) and by the linearity of on , . It follows that and hence which implies the claimed bound . Since there are at least as many weights as computation units in a network, a similar bound holds for the number of weights. □
定理 4 意味着存在一个长度不小于 的区间 ,在这个区间上函数 是线性的。设 。那么,根据近似精确度假设, ,而根据 (51) 和 在 上的线性关系, 。 由此得出 ,进而得出 ,这意味着所声称的约束 。由于网络中的权重至少与计算单元一样多,因此权重数也有类似的约束。□
5. Discussion 5.讨论
We discuss some implications of the obtained bounds.
我们讨论了所获边界的一些含义。
Deep vs. shallow ReLU approximations of smooth functions
平滑函数的深层 ReLU 近似与浅层 ReLU 近似
Our results clearly show that deep ReLU networks more efficiently express smooth functions than shallow ReLU networks. By Theorem 1, functions from the Sobolev space can be -approximated by ReLU networks with depth and the number of computation units . In contrast, by Theorem 6, a nonlinear function from cannot be -approximated by a ReLU network of fixed depth with the number of units less than . In particular, it follows that in terms of the required number of computation units, unbounded-depth approximations of functions from are asymptotically strictly more efficient than approximations with a fixed depth at least when (assuming also , so that ). The efficiency of depth is even more pronounced for very smooth functions such as polynomials, which can be implemented by deep networks using only units (cf. Proposition 2, Proposition 3 and the proof of Theorem 1). Liang and Srikant describe in Liang and Srikant (2016) some conditions on the approximated function (resembling conditions of local analyticity) under which complexity of deep -approximation is with a constant power .
我们的结果清楚地表明,深度 ReLU 网络比浅层 ReLU 网络更有效地表达平滑函数。根据定理 1,来自 Sobolev 空间 的函数可以通过深度为 且计算单元数为 的 ReLU 网络进行 近似。 相反,根据定理 6,来自 的非线性函数无法通过固定深度为 且单元数小于 的 ReLU 网络进行 近似。特别是,从所需计算单元数来看,至少当 时,来自 的函数的无界深度近似在渐近严格意义上比具有固定深度 的近似更有效(假设还有 ,那么 )。对于多项式等非常平滑的函数,深度的效率更为显著,深度网络只需 个单元就能实现(参见命题 2、命题 3 和定理 1 的证明)。Liang 和 Srikant 在 Liang and Srikant (2016) 中描述了近似函数的一些条件(类似于局部解析性条件),在这些条件下,深度 近似的复杂度为 ,幂为 。
Continuous model selection vs. function-dependent network architectures
连续模型选择与依赖功能的网络架构对比
When approximating a function by a neural network, one can either view the network architecture as fixed and only tune the weights, or optimize the architecture as well. Moreover, when tuning the weights, one can either require them to continuously depend on the approximated function or not. We naturally expect that more freedom in the choice of the approximation should lead to higher expressiveness.
在用神经网络逼近函数时,我们可以将网络结构视为固定的,只调整权重,也可以同时优化结构。此外,在调整权重时,我们既可以要求权重持续依赖于近似函数,也可以不要求权重持续依赖于近似函数。我们自然希望,在选择近似值时有更大的自由度,从而提高表达能力。
Our bounds confirm this expectation to a certain extent. Specifically, the complexity of -approximation of functions from the unit ball in is lower bounded by in the scenario with a fixed architecture and continuously selected weights (see Theorem 3). On the other hand, we show in Theorem 2 that this complexity is upper bounded by if we are allowed to adjust the network architecture. This bound is achieved by finite-depth (depth-6) ReLU networks using the idea of reused subnetworks familiar from the theory of Boolean circuits (Shannon, 1949).
我们的边界在一定程度上证实了这一预期。具体来说,在架构固定、权重连续选择的情况下, 中单位球 函数的 近似复杂度下限为 (见定理 3)。另一方面,我们在定理 2 中证明,如果允许我们调整网络架构,则该复杂度的上限为 。利用布尔电路理论(香农,1949 年)中熟悉的重用子网络思想,有限深度(深度-6)ReLU 网络可以实现这一约束。
In the case of fixed architecture, we have not established any evidence of complexity improvement for unconstrained weight selection compared to continuous weight selection. We remark however that, already for approximations with depth-3 networks, the optimal weights are known to discontinuously depend, in general, on the approximated function (Kainen et al., 1999). On the other hand, part b) of Theorem 4 shows that if the network depth scales as , discontinuous weight selection cannot improve the continuous-case complexity more than by a factor being some power of .
在结构固定的情况下,与连续权重选择相比,我们还没有发现无约束权重选择能提高复杂性的证据。不过我们要指出的是,对于深度为 3 的网络近似,已知最优权重一般会不连续地依赖于近似函数(Kainen 等人,1999 年)。另一方面,定理 4 的 b) 部分表明,如果网络深度按 的比例缩放,非连续权重选择对连续情况下复杂度的改善不会超过 的某个幂次。
Upper vs. lower complexity bounds
复杂度上限与下限
We indicate the gaps between respective upper and lower bounds in the three scenarios mentioned above: fixed architectures with continuous selection of weights, fixed architectures with unconstrained selection of weights, or adaptive architectures.
我们指出了上述三种情况下各自上下限之间的差距:连续选择权重的固定架构、无约束选择权重的固定架构或自适应架构。
For fixed architectures with continuous selection the lower bound is provided by Theorem 3, and the upper bound by Theorem 1, so these bounds are tight up to a factor .
对于具有连续选择功能的固定架构,定理 3 提供了下界 ,定理 1 提供了上界 ,因此这些下界在系数 以下都很紧凑。
In the case of fixed architecture but unconstrained selection, part b) of Theorem 4 gives a lower bound under assumption that the depth is constrained by . This is only different by a factor of from the upper bound of Theorem 1. Without this depth constraint we only have the significantly weaker bound (part a) of Theorem 4).
在结构固定但选择不受约束的情况下,假设深度受 约束,定理 4 的 b) 部分给出了下界 。这与定理 1 的上限只相差 倍。如果没有深度限制,我们只能得到明显较弱的边界 (定理 4 的 a 部分)。
In the case of adaptive architectures, there is a big gap between our upper and lower bounds. The upper bound is given by Theorem 2 for . The lower bound, proved for general in Theorem 5, only states that there are for which the complexity is not .
对于自适应架构,我们的上界和下界之间存在很大差距。定理 2 给出了 的上界 。下界在定理 5 中针对一般的 得到证明,它只说明有 的复杂度不是 。
ReLU vs. smooth activation functions
ReLU 与平滑激活函数
A popular general-purpose method of approximation is shallow (depth-3) networks with smooth activation functions (e.g., logistic sigmoid). Upper and lower approximation complexity bounds for these networks Maiorov and Meir (2000), Mhaskar (1996) show that complexity scales as up to some factors. Comparing this with our bounds in Theorems 1, 2, 4, it appears that deep ReLU networks are roughly (up to factors) as expressive as shallow networks with smooth activation functions.
一种流行的通用近似方法是具有平滑激活函数(如 logistic sigmoid)的浅层(深度-3)网络。Maiorov 和 Meir(2000 年)、Mhaskar(1996 年)的研究表明,这些网络的上下限近似复杂度为 ,最高可达 。将其与定理 1、2、4 中的近似值进行比较,可以发现深度 ReLU 网络的表现力与具有平滑激活函数的浅层网络大致相当(最多为 个因子)。
6. Conclusion 6.结论
We have established several upper and lower bounds for the expressive power of deep ReLU networks in the context of approximation in Sobolev spaces. We should note, however, that this setting may not quite reflect typical real world applications, which usually possess symmetries and hierarchical and other structural properties substantially narrowing the actually interesting classes of approximated functions (LeCun et al., 2015). Some recent publications introduce and study expressive power of deep networks in frameworks bridging this gap, in particular, graph-based hierarchical approximations are studied in Mhaskar et al. (2016), Mhaskar and Poggio (2016) and convolutional arithmetic circuits in Cohen, Sharir, and Shashua (2015). Theoretical analysis of expressiveness of deep networks taking into account such properties of real data seems to be an important and promising direction of future research.
在索波列夫空间逼近的背景下,我们为深度 ReLU 网络的表现力确定了几个上下限。然而,我们应该注意到,这种设置可能并不完全反映典型的现实世界应用,这些应用通常具有对称性、层次性和其他结构特性,大大缩小了近似函数的实际有趣类别(LeCun 等人,2015 年)。最近的一些出版物介绍并研究了深度网络在弥合这一差距的框架中的表现力,特别是 Mhaskar 等人(2016)、Mhaskar 和 Poggio(2016)研究了基于图的分层逼近,Cohen、Sharir 和 Shashua(2015)研究了卷积算术电路。考虑到真实数据的这些特性,对深度网络的表现力进行理论分析似乎是未来研究的一个重要而有前景的方向。
Acknowledgments 致谢
The author thanks Matus Telgarsky and the anonymous referees for multiple helpful comments on the preliminary versions of the paper. The research was funded by Skolkovo Institute of Science and Technology
(project “Recognition of sparse geometric data”, 2016).
作者感谢马图斯-特尔加尔斯基(Matus Telgarsky)和匿名审稿人对论文初稿提出的多条有益意见。本研究得到了斯科尔科沃科学技术研究所的资助("稀疏几何数据识别 "项目,2016 年)。
References 参考资料
- Anthony and Bartlett (2009)
安东尼和巴特利特(2009 年)Neural network learning: Theoretical foundations
神经网络学习:理论基础Cambridge University Press (2009)
剑桥大学出版社(2009 年) - Bartlett et al. (1998) 巴特利特等人(1998 年)Almost linear VC-dimension bounds for piecewise polynomial networks
片断多项式网络的几乎线性 VC 维度边界Neural Computation, 10 (8) (1998), pp. 2159-2173
神经计算》,10 (8) (1998),第 2159-2173 页 - Bianchini and Scarselli (2014)
比安奇尼和斯卡塞利(2014 年)On the complexity of neural network classifiers: a comparison between shallow and deep architectures
神经网络分类器的复杂性:浅层架构与深层架构的比较IEEE Transactions on Neural Networks and Learning Systems, 25 (8) (2014), pp. 1553-1565
电气和电子工程师学会神经网络与学习系统论文集》,25 (8) (2014),第 1553-1565 页 - Cohen et al. (2015) 科恩等人(2015 年)Cohen, Nadav, Sharir, Or, & Shashua, Amnon (2015). On the Expressive Power of Deep Learning: A Tensor Analysis. ArXiv preprint ArXiv:1509.05009.
Cohen, Nadav, Sharir, Or, & Shashua, Amnon (2015)。深度学习的表现力:张量分析》。ArXiv preprint ArXiv:1509.05009. - Dawson and Nielsen (2006)
道森和尼尔森(2006 年)The Solovay-Kitaev algorithm
索洛维-基塔耶夫算法Quantum Information & Computation, 6 (1) (2006), pp. 81-95
量子信息与计算》,6 (1) (2006),第 81-95 页 - Delalleau and Bengio (2011)
Delalleau 和 Bengio(2011 年)Shallow vs. deep sum-product networks
浅层与深层和积网络Advances in Neural Information Processing Systems (2011), pp. 666-674
神经信息处理系统进展》(2011 年),第 666-674 页 - DeVore et al. (1989) DeVore 等人(1989 年)Optimal nonlinear approximation
最佳非线性近似Manuscripta Mathematica, 63 (4) (1989), pp. 469-478
Manuscripta Mathematica》,63 (4) (1989),第 469-478 页 - Goldberg and Jerrum (1995)
戈德伯格和杰鲁姆(1995 年)Bounding the Vapnik-Chervonenkis dimension of concept classes parameterized by real numbers
以实数为参数的概念类的 Vapnik-Chervonenkis 维度范围Machine Learning, 18 (2–3) (1995), pp. 131-148
机器学习》,18 (2-3) (1995),第 131-148 页 - Kainen et al. (1999) 凯宁等人(1999 年)Approximation by neural networks is not continuous
神经网络的逼近不是连续的Neurocomputing, 29 (1) (1999), pp. 47-56
神经计算》,29 (1) (1999),第 47-56 页 - Kearns and Schapire (1990)
卡恩斯和沙皮尔(1990 年)Efficient distribution-free learning of probabilistic concepts
概率概念的高效无分布学习Foundations of computer science, 1990. Proceedings., 31st annual symposium on, IEEE (1990), pp. 382-391
计算机科学基础,1990 年。第 31 届年度研讨会,IEEE (1990),第 382-391 页 - Kitaev et al. (2002) Kitaev 等人(2002 年)Classical and quantum computation
经典计算和量子计算0821832298, American Mathematical Society, Boston, MA, USA (2002)
0821832298,美国数学学会,美国马萨诸塞州波士顿(2002 年) - LeCun et al. (2015) LeCun 等人(2015 年)Deep learning 深度学习Nature, 521 (7553) (2015), pp. 436-444
自然》,521 (7553) (2015),第 436-444 页 - Liang and Srikant (2016) 梁和斯里康特(2016)Liang, Shiyu, & Srikant, R. (2016). Why Deep Neural Networks? ArXiv Preprint ArXiv:1610.04161.
Liang, Shiyu, & Srikant, R. (2016)。为什么选择深度神经网络?ArXiv Preprint ArXiv:1610.04161. - Maiorov and Meir (2000) Maiorov 和 Meir(2000 年)On the near optimality of the stochastic approximation of smooth functions by neural networks
论神经网络随机逼近平滑函数的近似最优性Advances in Computational Mathematics, 13 (1) (2000), pp. 79-103
计算数学进展》,13 (1) (2000),第 79-103 页 - Mhaskar (1996) 马斯卡(1996 年)Neural networks for optimal approximation of smooth and analytic functions
优化平滑函数和解析函数近似的神经网络Neural Computation, 8 (1) (1996), pp. 164-177
神经计算》,8 (1) (1996),第 164-177 页 - Mhaskar (1993) 马斯卡(1993 年)Approximation properties of a multilayered feedforward artificial neural network
多层前馈人工神经网络的逼近特性Advances in Computational Mathematics, 1 (1) (1993), pp. 61-80
计算数学进展》,1 (1) (1993),第 61-80 页 - Mhaskar et al. (2016) Mhaskar 等人(2016 年)Mhaskar, Hrushikesh, Liao, Qianli, & Poggio, Tomaso (2016). Learning Real and Boolean Functions: When Is Deep Better Than Shallow. ArXiv Preprint ArXiv:1603.00988.
Mhaskar, Hrushikesh, Liao, Qianli, & Poggio, Tomaso (2016).学习实函数和布尔函数:什么时候 "深 "比 "浅 "更好?ArXiv Preprint ArXiv:1603.00988. - Mhaskar and Poggio (2016)
Mhaskar 和 Poggio(2016 年)Mhaskar, Hrushikesh, & Poggio, Tomaso (2016). Deep vs. shallow networks: An approximation theory perspective. ArXiv Preprint ArXiv:1608.03287.
Mhaskar, Hrushikesh, & Poggio, Tomaso (2016).深度网络与浅层网络:近似理论视角。ArXiv Preprint ArXiv:1608.03287. - Montufar et al. (2014) 蒙图法尔等人(2014 年)On the number of linear regions of deep neural networks
关于深度神经网络的线性区域数量Advances in Neural Information Processing Systems (2014), pp. 2924-2932
神经信息处理系统进展》(2014 年),第 2924-2932 页 - Pinkus (1999) 平卡斯(1999 年)Approximation theory of the MLP model in neural networks
神经网络中 MLP 模型的近似理论Acta Numerica, 8 (1999), pp. 143-195
Acta Numerica, 8 (1999), pp. - Raghu et al. (2016) 拉古等人(2016)Raghu, Maithra, Poole, Ben, Kleinberg, Jon, Ganguli, Surya, & Sohl-Dickstein, Jascha (2016). On the expressive power of deep neural networks. ArXiv Preprint ArXiv:1606.05336.
Raghu, Maithra, Poole, Ben, Kleinberg, Jon, Ganguli, Surya, & Sohl-Dickstein, Jascha (2016)。论深度神经网络的表现力》。ArXiv Preprint ArXiv:1606.05336. - Sakurai (1999) 樱井(1999)Tight Bounds for the VC-Dimension of Piecewise Polynomial Networks
片断多项式网络 VC 维度的严格界限Advances in Neural Information Processing Systems (1999), pp. 323-329
神经信息处理系统进展》(1999 年),第 323-329 页 - Schmidhuber (2015) 施密德胡贝尔(2015)Deep learning in neural networks: an overview
神经网络中的深度学习:概述Neural Networks, 61 (2015), pp. 85-117
神经网络》,61 (2015),第 85-117 页 - Shannon (1949) 香农(1949 年)The synthesis of two-terminal switching circuits
双端开关电路的合成Bell Labs Technical Journal, 28 (1) (1949), pp. 59-98
贝尔实验室技术期刊》,28 (1) (1949),第 59-98 页 - Telgarsky (2015)Telgarsky, Matus (2015). Representation Benefits of Deep Feedforward Networks. ArXiv Preprint ArXiv:1509.08101.
Telgarsky, Matus (2015).深度前馈网络的表征优势》。ArXiv Preprint ArXiv:1509.08101. - Telgarsky (2016) 特尔加斯基(2016)Telgarsky, Matus (2016). Benefits of depth in neural networks. ArXiv Preprint ArXiv:1602.04485.
Telgarsky, Matus (2016).神经网络深度的好处。ArXiv Preprint ArXiv:1602.04485. - Vapnik and Chervonenkis (2015)
Vapnik 和 Chervonenkis(2015 年)On the uniform convergence of relative frequencies of events to their probabilities
论事件相对频率向其概率的均匀趋同Measures of complexity, Springer (2015), pp. 11-30
复杂性的度量》,施普林格出版社(2015 年),第 11-30 页 - Warren (1968) 沃伦(1968)Lower bounds for approximation by nonlinear manifolds
非线性流形近似的下限Transactions of the American Mathematical Society, 133 (1) (1968), pp. 167-178
美国数学学会论文集》,133 (1) (1968),第 167-178 页
Cited by (1209) 引用 (1209)
MAIAC AOD profiling over the Persian Gulf: A seasonal-independent machine learning approach
波斯湾上空的 MAIAC AOD 分析:与季节无关的机器学习方法2024, Atmospheric Pollution Research
2024 年,大气污染研究Solving PDEs on unknown manifolds with machine learning
用机器学习解决未知流形上的多项式方程2024, Applied and Computational Harmonic Analysis
2024,应用与计算谐波分析Short-term load forecasting based on CEEMDAN and dendritic deep learning
基于 CEEMDAN 和树枝状深度学习的短期负荷预测2024, Knowledge-Based Systems
2024 年,知识型系统Approximation of functions from Korobov spaces by shallow neural networks
浅层神经网络对科罗波夫空间函数的逼近2024, Information Sciences
2024 年,信息科学Generalization analysis of deep CNNs under maximum correntropy criterion
最大熵标准下深度 CNN 的泛化分析2024, Neural Networks 2024, 神经网络
- 1
The author thanks Matus Telgarsky for suggesting this part of the theorem.
作者感谢 Matus Telgarsky 提出这部分定理。
© 2017 爱思唯尔有限公司。保留所有权利。