Elsevier

Neural Networks 神经网络

Volume 94, October 2017, Pages 103-114
第 94 卷,2017 年 10 月,第 103-114 页
Neural Networks

Error bounds for approximations with deep ReLU networks
深度 ReLU 网络近似的误差边界

https://doi.org/10.1016/j.neunet.2017.07.002Get rights and content 获取权利和内容

Abstract 摘要

We study expressive power of shallow and deep neural networks with piece-wise linear activation functions. We establish new rigorous upper and lower bounds for the network complexity in the setting of approximations in Sobolev spaces. In particular, we prove that deep ReLU networks more efficiently approximate smooth functions than shallow networks. In the case of approximations of 1D Lipschitz functions we describe adaptive depth-6 network architectures more efficient than the standard shallow architecture.
我们研究了具有片断线性激活函数的浅层和深层神经网络的表达能力。我们在 Sobolev 空间近似的背景下,为网络复杂性建立了新的严格上下限。特别是,我们证明了深度 ReLU 网络比浅层网络更有效地逼近平滑函数。在近似一维 Lipschitz 函数的情况下,我们描述了比标准浅层架构更高效的自适应深度-6 网络架构。

Keywords 关键词

Deep ReLU networks
Approximation complexity

深度 ReLU 网络近似复杂度

1. Introduction 1.导言

Recently, multiple successful applications of deep neural networks to pattern recognition problems LeCun et al. (2015), Schmidhuber (2015) have revived active interest in theoretical properties of such networks, in particular their expressive power. It has been argued that deep networks may be more expressive than shallow ones of comparable size (see, e.g., Bianchini and Scarselli (2014), Delalleau and Bengio (2011), Montufar et al. (2014), Raghu et al. (2016), Telgarsky (2015)). In contrast to a shallow network, a deep one can be viewed as a long sequence of non-commutative transformations, which is a natural setting for high expressiveness (cf. the well-known Solovay–Kitaev theorem on fast approximation of arbitrary quantum operations by sequences of non-commutative gates, see Dawson and Nielsen (2006), Kitaev et al. (2002)).
最近,深度神经网络在模式识别问题上的多次成功应用LeCun等人(2015)、Schmidhuber(2015)重新激发了人们对此类网络理论特性的兴趣,尤其是它们的表现力。有人认为,深度网络可能比同等规模的浅层网络更具表现力(参见 Bianchini 和 Scarselli (2014)、Delalleau 和 Bengio (2011)、Montufar 等人 (2014)、Raghu 等人 (2016)、Telgarsky (2015))。与浅层网络相比,深层网络可被视为非交换变换的长序列,这是高表达能力的天然环境(参见著名的索洛维-基塔耶夫定理:通过非交换门序列快速逼近任意量子运算,见 Dawson 和 Nielsen (2006)、Kitaev 等人 (2002))。

There are various ways to characterize expressive power of networks. Delalleau and Bengio (2011) consider sum–product networks and prove for certain classes of polynomials that they are much more easily represented by deep networks than by shallow networks. Montufar et al. (2014) estimate the number of linear regions in the network’s landscape. Bianchini and Scarselli (2014) give bounds for Betti numbers characterizing topological properties of functions represented by networks. Telgarsky (2015), Telgarsky (2016) provides specific examples of classification problems where deep networks are provably more efficient than shallow ones.
表征网络表达能力的方法多种多样。Delalleau 和 Bengio(2011 年)考虑了和积网络,并证明对于某些多项式类别,深度网络比浅层网络更容易表现它们。Montufar 等人(2014 年)估算了网络景观中线性区域的数量。Bianchini 和 Scarselli(2014)给出了贝蒂数的边界,表征了网络所代表函数的拓扑特性。Telgarsky (2015)、Telgarsky (2016) 提供了分类问题的具体例子,证明深度网络比浅层网络更有效。

In the context of classification problems, a general and standard approach to characterizing expressiveness is based on the notion of the Vapnik–Chervonenkis dimension (Vapnik & Chervonenkis, 2015). There exist several bounds for VC-dimension of deep networks with piece-wise polynomial activation functions that go back to geometric techniques of Goldberg and Jerrum (1995) and earlier results of Warren (1968); see Bartlett et al. (1998), Sakurai (1999) and the book (Anthony & Bartlett, 2009). There is a related concept, the fat-shattering dimension, for real-valued approximation problems Anthony and Bartlett (2009), Kearns and Schapire (1990).
在分类问题中,表征表现力的一般标准方法是基于 Vapnik-Chervonenkis 维度的概念(Vapnik & Chervonenkis, 2015)。对于具有片断多项式激活函数的深度网络的 VC 维度,存在几种边界,这些边界可追溯到 Goldberg 和 Jerrum(1995 年)的几何技术以及 Warren(1968 年)的早期结果;参见 Bartlett 等人(1998 年)、Sakurai(1999 年)和本书(Anthony & Bartlett, 2009 年)。对于实值近似问题,Anthony 和 Bartlett (2009)、Kearns 和 Schapire (1990)有一个相关的概念,即 "碎脂维度"。

A very general approach to expressiveness in the context of approximation is the method of nonlinear widths by DeVore, Howard, and Micchelli (1989) that concerns approximation of a family of functions under assumption of a continuous dependence of the model on the approximated function.
DeVore 、Howard 和 Micchelli(1989 年)提出的非线性宽度法是近似计算中一种非常通用的表达方法,它涉及在模型对近似函数具有连续依赖性的假设条件下对函数族的近似。

In this paper we examine the problem of shallow-vs-deep expressiveness from the perspective of approximation theory and general spaces of functions having derivatives up to certain order (Sobolev-type spaces). In this framework, the problem of expressiveness is very well studied in the case of shallow networks with a single hidden layer, where it is known, in particular, that to approximate a Cn-function on a d-dimensional set with infinitesimal error ϵ one needs a network of size about ϵdn, assuming a smooth activation function (see, e.g., Mhaskar (1996), Pinkus (1999) for a number of related rigorous upper and lower bounds and further qualifications of this result). Much less seems to be known about deep networks in this setting, though Mhaskar, Liao, and Poggio (2016), Mhaskar and Poggio (2016) have recently introduced functional spaces constructed using deep dependency graphs and obtained expressiveness bounds for related deep networks.
在本文中,我们从近似理论和具有一定阶导数的函数的一般空间(Sobolev 型空间)的角度来研究浅层与深层的表现力问题。在这一框架下,人们对单隐层浅层网络的表现力问题研究得非常透彻,尤其是已知要在 d 维集合上以无穷小误差 ϵ 近似一个 Cn 函数,需要一个大小约为 ϵdn 的网络,同时假设激活函数是平滑的(参见 Mhaskar (1996)、Pinkus (1999),以了解大量相关的严格上下限和对这一结果的进一步限定)。虽然 Mhaskar、Liao 和 Poggio(2016 年)、Mhaskar 和 Poggio(2016 年)最近引入了使用深度依赖图构建的函数空间,并获得了相关深度网络的表现力边界,但人们对这种情况下的深度网络的了解似乎要少得多。

We will focus our attention on networks with the ReLU activation function σ(x)=max(0,x), which, despite its utter simplicity, seems to be the most popular choice in practical applications (LeCun et al., 2015). We will consider L-error of approximation of functions belonging to the Sobolev spaces Wn,([0,1]d) (without any assumptions of hierarchical structure). We will often consider families of approximations, as the approximated function runs over the unit ball Fd,n in Wn,([0,1]d). In such cases we will distinguish scenarios of fixed and adaptive network architectures. Our goal is to obtain lower and upper bounds on the expressiveness of deep and shallow networks in different scenarios. We measure complexity of networks in a conventional way, by counting the number of their weights and computation units (cf. Anthony and Bartlett (2009)).
我们将重点关注具有 ReLU 激活函数 σ(x)=max(0,x) 的网络,尽管它非常简单,但似乎是实际应用中最受欢迎的选择(LeCun 等人,2015 年)。我们将考虑 L -属于 Sobolev 空间 Wn,([0,1]d) 的函数的近似误差(没有任何层次结构假设)。在这种情况下,我们将区分固定网络架构和自适应网络架构。我们的目标是获得深层和浅层网络在不同场景下的表现力下限和上限。我们用传统方法测量网络的复杂性,即计算其权重和计算单元的数量(参见 Anthony 和 Bartlett (2009))。

The main body of the paper consists of Sections 2 The ReLU network model, 3 Upper bounds, 4 Lower bounds.
本文主体包括第 2 节 ReLU 网络模型、第 3 节上界、第 4 节下界。

In Section 2 we describe our ReLU network model and show that the ReLU function is replaceable by any other continuous piece-wise linear activation function, up to constant factors in complexity asymptotics (Proposition 1).
在第 2 节中,我们描述了 ReLU 网络模型,并证明 ReLU 函数可由任何其他连续的片断线性激活函数替代,在复杂性渐近论中可达到常数因子(命题 1)。

In Section 3 we establish several upper bounds on the complexity of approximating by ReLU networks, in particular showing that deep networks are quite efficient for approximating smooth functions. Specifically:
在第 3 节中,我们确定了 ReLU 网络近似复杂度的几个上限,特别是表明深度网络在近似平滑函数时相当高效。具体来说

  • In Section 3.1 we show that the function f(x)=x2 can be ϵ-approximated by a network of depth and complexity O(ln(1ϵ)) (Proposition 2). This also leads to similar upper bounds on the depth and complexity that are sufficient to implement an approximate multiplication in a ReLU network (Proposition 3).
    在第 3.1 节中,我们证明了函数 f(x)=x2 可以通过深度和复杂度均为 O(ln(1ϵ)) 的网络进行 ϵ 近似运算(命题 2)。这也引出了在 ReLU 网络中实现近似乘法所需的深度和复杂度的类似上限(命题 3)。

  • In Section 3.2 we describe a ReLU network architecture of depth O(ln(1ϵ)) and complexity O(ϵdnln(1ϵ)) that is capable of approximating with error ϵ any function from Fd,n (Theorem 1).
    在第 3.2 节中,我们描述了一个深度为 O(ln(1ϵ)) 、复杂度为 O(ϵdnln(1ϵ)) 的 ReLU 网络结构,它能够以误差 ϵ 近似 Fd,n 中的任何函数(定理 1)。

  • In Section 3.3 we show that, even with fixed-depth networks, one can further decrease the approximation complexity if the network architecture is allowed to depend on the approximated function. Specifically, we prove that one can ϵ-approximate a given Lipschitz function on the segment [0,1] by a depth-6 ReLU network with O(1ϵln(1ϵ)) connections and activation units (Theorem 2). This upper bound is of interest since it lies below the lower bound provided by the method of nonlinear widths under assumption of continuous model selection (see Section 4.1).
    在第 3.3 节中,我们将证明,即使是固定深度的网络,如果允许网络结构依赖于近似函数,也可以进一步降低近似复杂度。具体地说,我们证明了一个具有 O(1ϵln(1ϵ)) 连接和激活单元的深度为 6 的 ReLU 网络可以 ϵ 近似 [0,1] 段上的给定 Lipschitz 函数(定理 2)。这个上界低于连续模型选择假设下的非线性宽度方法所提供的下界(见第 4.1 节),因此很有意义。

In Section 4 we obtain several lower bounds on the complexity of approximation by deep and shallow ReLU networks, using different approaches and assumptions.
在第 4 节中,我们利用不同的方法和假设,得出了深层和浅层 ReLU 网络近似复杂度的几个下限。

  • In Section 4.1 we recall the general lower bound provided by the method of continuous nonlinear widths. This method assumes that parameters of the approximation continuously depend on the approximated function, but does not assume anything about how the approximation depends on its parameters. In this setup, at least ϵdn connections and weights are required for an ϵ-approximation on Fd,n (Theorem 3). As already mentioned, for d=n=1 this lower bound is above the upper bound provided by Theorem 2.
    在第 4.1 节中,我们回顾了连续非线性宽度法提供的一般下限。这种方法假定近似的参数连续地依赖于近似函数,但不假定近似如何依赖于其参数。在这种情况下,对 Fd,nϵ 近似至少需要 ϵdn 的连接和权重(定理 3)。如前所述,对 d=n=1 而言,这一下限高于定理 2 所提供的上限。

  • In Section 4.2 we consider the setup where the same network architecture is used to approximate all functions in Fd,n, but the weights are not assumed to continuously depend on the function. In this case, application of existing results on VC-dimension of deep piece-wise polynomial networks yields a ϵd(2n) lower bound in general and a ϵdnln2p1(1ϵ) lower bound if the network depth grows as O(lnp(1ϵ)) (Theorem 4).
    在第 4.2 节中,我们考虑了使用相同网络结构逼近 Fd,n 中所有函数,但权重不连续依赖于函数的情况。在这种情况下,应用现有的关于深度片断多项式网络 VC 维度的结果,一般会得到 ϵd(2n) 的下限,如果网络深度以 O(lnp(1ϵ)) 的方式增长,则会得到 ϵdnln2p1(1ϵ) 的下限(定理 4)。

  • In Section 4.3 we consider an individual approximation, without any assumptions regarding it as an element of a family as in Sections 4.1 and 4.2. We prove that for any d,n there exists a function in Wn,([0,1]d) such that its approximation complexity is not o(ϵd(9n)) as ϵ0 (Theorem 5).
    在第 4.3 节中,我们将考虑单个近似值,而不像第 4.1 节和第 4.2 节那样假设它是一个族的元素。我们将证明,对于任何 d,n 都存在一个 Wn,([0,1]d) 中的函数,使得它的近似复杂度不是 o(ϵd(9n)) 而是 ϵ0 (定理 5)。

  • In Section 4.4 we prove that ϵ-approximation of any nonlinear C2-function by a network of fixed depth L requires at least ϵ1(2(L2)) computation units (Theorem 6). By comparison with Theorem 1, this shows that for sufficiently smooth functions approximation by fixed-depth ReLU networks is less efficient than by unbounded-depth networks.
    在第 4.4 节中,我们证明用固定深度为 L 的网络对任何非线性 C2 函数进行 ϵ 近似至少需要 ϵ1(2(L2)) 计算单元(定理 6)。与定理 1 相比,这表明对于足够光滑的函数,用固定深度 ReLU 网络逼近的效率比用无界深度网络逼近的效率要低。

In Section 5 we discuss the obtained bounds and summarize their implications, in particular comparing deep vs. shallow networks and fixed vs. adaptive architectures.
在第 5 节中,我们将讨论所获得的边界并总结其影响,特别是比较深层网络与浅层网络以及固定架构与自适应架构。

The arXiv preprint of the first version of the present work appeared almost simultaneously with the work of Liang and Srikant (2016) containing results partly overlapping with our results in Sections 3.1 Fast deep approximation of squaring and multiplication, 3.2 Fast deep approximation of general smooth functions, 4.4 Slow approximation of smooth functions by shallow networks. Liang and Srikant consider networks equipped with both ReLU and threshold activation functions. They prove a logarithmic upper bound for the complexity of approximating the function f(x)=x2, which is analogous to our Proposition 2. Then, they extend this upper bound to polynomials and smooth functions. In contrast to our treatment of generic smooth functions based on standard Sobolev spaces, they impose more complex assumptions on the function (including, in particular, how many derivatives it has) that depend on the required approximation accuracy ϵ. As a consequence, they obtain strong O(lnc(1ϵ)) complexity bounds rather different from our bound in Theorem 1 (in fact, our lower bound proved in Theorem 5 rules out, in general, such strong upper bounds for functions having only finitely many derivatives). Also, Liang and Srikant prove a lower bound for the complexity of approximating convex functions by shallow networks. Our version of this result, given in Section 4.4, is different in that we assume smoothness and nonlinearity instead of global convexity.
本工作第一版的 arXiv 预印本几乎与 Liang 和 Srikant(2016 年)的工作同时出现,其中的结果与我们在第 3.1 节 "平方和乘法的快速深度逼近"、第 3.2 节 "一般平滑函数的快速深度逼近"、第 4.4 节 "浅层网络对平滑函数的慢速逼近 "中的结果部分重叠。Liang 和 Srikant 考虑了同时配备 ReLU 和阈值激活函数的网络。他们证明了函数 f(x)=x2 近似复杂度的对数上限,这与我们的命题 2 类似。然后,他们将这一上限扩展到多项式和平滑函数。与我们基于标准索波列夫空间对一般平稳函数的处理不同,他们对函数(尤其包括函数有多少导数)施加了更复杂的假设,而这些假设取决于所需的逼近精度 ϵ 。因此,他们得到的强 O(lnc(1ϵ)) 复杂性边界与我们在定理 1 中的边界截然不同(事实上,我们在定理 5 中证明的下界一般排除了对只有有限多个导数的函数的强上界)。此外,Liang 和 Srikant 还证明了用浅层网络逼近凸函数的复杂度下限。我们在第 4.4 节中给出的这一结果与之不同,我们假设的是平滑性和非线性,而不是全局凸性。

  1. Download : Download high-res image (130KB)
    下载 :下载高清图片 (130KB)
  2. Download : Download full-size image
    下载 :下载 : 下载全尺寸图片

Fig. 1. A feedforward neural network having 3 input units (diamonds), 1 output unit (square), and 7 computation units with nonlinear activation (circles). The network has 4 layers and 16+8=24 weights.
图 1.一个有 3 个输入单元(菱形)、1 个输出单元(正方形)和 7 个非线性激活计算单元(圆形)的前馈神经网络。该网络有 4 层,权重为 16+8=24

2. The ReLU network model
2.ReLU 网络模型

Throughout the paper, we consider feedforward neural networks with the ReLU (Rectified Linear Unit) activation function σ(x)=max(0,x).The network consists of several input units, one output unit, and a number of “hidden” computation units. Each hidden unit performs an operation of the form (1)y=σ(k=1Nwkxk+b)with some weights (adjustable parameters) (wk)k=1N and b depending on the unit. The output unit is also a computation unit, but without the nonlinearity, i.e., it computes y=k=1Nwkxk+b. The units are grouped in layers, and the inputs (xk)k=1N of a computation unit in a certain layer are outputs of some units belonging to any of the preceding layers (see Fig. 1). Note that we allow connections between units in non-neighboring layers. Occasionally, when this cannot cause confusion, we may denote the network and the function it implements by the same symbol.
在本文中,我们考虑的是具有 ReLU(整流线性单元)激活函数 σ(x)=max(0,x). 的前馈神经网络。该网络由多个输入单元、一个输出单元和若干 "隐藏 "计算单元组成。每个 "隐藏 "单元执行的运算形式为 (1)y=σ(k=1Nwkxk+b) ,并根据单元的不同设置一些权重(可调参数) (wk)k=1Nb 。输出单元也是一个计算单元,但没有非线性,即计算 y=k=1Nwkxk+b 。单元按层分组,某层计算单元的输入 (xk)k=1N 是属于前几层的某些单元的输出(见图 1)。请注意,我们允许非相邻层的单元之间存在连接。有时,在不会引起混淆的情况下,我们可以用相同的符号表示网络及其实现的功能。

The depth of the network, the number of units and the total number of weights are standard measures of network complexity (Anthony & Bartlett, 2009). We will use these measures throughout the paper. The number of weights is, clearly, the sum of the total number of connections and the number of computation units. We identify the depth with the number of layers (in particular, the most common type of neural networks – shallow networks having a single hidden layer – are depth-3 networks according to this convention).
网络深度、单元数和权重总数是衡量网络复杂性的标准指标(Anthony & Bartlett,2009 年)。本文将通篇使用这些指标。权重数显然是连接总数和计算单元数的总和。我们用层数来确定深度(尤其是最常见的神经网络类型--具有单个隐藏层的浅层网络--根据这一惯例属于深度-3 网络)。

We finish this subsection with a proposition showing that, given our complexity measures, using the ReLU activation function is not much different from using any other piece-wise linear activation function with finitely many breakpoints: one can replace one network by an equivalent one but having another activation function while only increasing the number of units and weights by constant factors. This justifies our restricted attention to the ReLU networks (which could otherwise have been perceived as an excessively particular example of networks).
在本小节的最后,我们将提出一个命题,说明根据我们的复杂度度量,使用 ReLU 激活函数与使用任何其他具有有限多个断点的片断线性激活函数并无太大区别:我们可以用一个等效的网络取代一个网络,但使用另一个激活函数,而只需增加单位数和权重的常数。这就证明了我们只关注 ReLU 网络是有道理的(否则它可能会被视为一个过于特殊的网络实例)。

Proposition 1 命题 1

Let ρ:RR be any continuous piece-wise linear function with M breakpoints, where 1M<
ρ:RR 是任意连续的片断线性函数,其断点为 M ,其中 1M<
.

  • (a)

    Let ξ be a network with the activation function ρ, having depth L, W weights and U computation units. Then there exists a ReLU network η that has depth L, not more than (M+1)2W weights and not more than (M+1)U units, and that computes the same function as ξ
    假设 ξ 是一个具有激活函数 ρ 的网络,其深度为 L W 权重和 U 计算单元。那么存在一个深度为 L 、权重不超过 (M+1)2W 、计算单元不超过 (M+1)U 的 ReLU 网络 η ,它计算的函数与 ξ 相同。
    .

  • (b)

    Conversely, let η be a ReLU network of depth L with W weights and U computation units. Let D be a bounded subset of Rn, where n is the input dimension of η. Then there exists a network with the activation function ρ that has depth L, 4W weights and 2U units, and that computes the same function as η on the set D
    反过来,假设 η 是一个深度为 L 的 ReLU 网络,权重为 W ,计算单元为 U 。让 DRn 的有界子集,其中 nη 的输入维度。那么就存在一个具有激活函数 ρ 的网络,它的深度为 L 4W 权重和 2U 单元,在集合 D 上计算与 η 相同的函数。
    .

Proof 证明

(a) Let a1<<aM be the breakpoints of ρ, i.e., the points where its derivative is discontinuous: ρ(ak+)ρ(ak). We can then express ρ via the ReLU function σ, as a linear combination ρ(x)=c0σ(a1x)+m=1Mcmσ(xam)+hwith appropriately chosen coefficients (cm)m=0M and h. It follows that computation performed by a single ρ-unit, x1,,xNρ(k=1Nwkxk+b),can be equivalently represented by a linear combination of a constant function and computations of M+1 σ-units, x1,,xNσ(k=1Nwkxk+bam),m=1,,M,σ(a1bk=1Nwkxk),m=0(here m is the index of a ρ-unit). We can then replace one-by-one all the ρ-units in the network ξ by σ-units, without changing the output of the network. Obviously, these replacements do not change the network depth. Since each hidden unit gets replaced by M+1 new units, the number of units in the new network is not greater than M+1 times their number in the original network. Note also that the number of connections in the network is multiplied, at most, by (M+1)2. Indeed, each unit replacement entails replacing each of the incoming and outgoing connections of this unit by M+1 new connections, and each connection is replaced twice: as an incoming and as an outgoing one. These considerations imply the claimed complexity bounds for the resulting σ-network η.
(a) 设 a1<<aMρ 的断点,即其导数不连续的点: ρ(ak+)ρ(ak) .然后,我们可以通过 ReLU 函数 σρ 表示为一个线性组合 ρ(x)=c0σ(a1x)+m=1Mcmσ(xam)+h ,并适当选择系数 (cm)m=0Mh 。 由此可见,单个 ρ 单位 x1,,xNρ(k=1Nwkxk+b), 的计算可以等价地表示为一个常数函数和 M+1 σ 单位 x1,,xNσ(k=1Nwkxk+bam),m=1,,M,σ(a1bk=1Nwkxk),m=0 的计算的线性组合(这里 mρ 单位的索引)。然后,我们可以在不改变网络输出的情况下,用 σ 单位逐一替换 ξ 网络中的所有 ρ 单位。显然,这些替换不会改变网络深度。由于每个隐藏单元都会被 M+1 个新单元替换,因此新网络中单元的数量不会超过原始网络中单元数量的 M+1 倍。还需要注意的是,网络中的连接数最多乘以 (M+1)2 。 事实上,每个单元的替换都需要用 M+1 个新连接替换该单元的每个输入和输出连接,每个连接都会被替换两次:输入连接和输出连接。这些考虑意味着所得到的 σ - 网络 η 的复杂度边界。

(b) Let a be any breakpoint of ρ, so that ρ(a+)ρ(a). Let r0 be the distance separating a from the nearest other breakpoint, so that ρ is linear on [a,a+r0] and on [ar0,a] (if ρ has only one node, any r0>0 will do). Then, for any r>0, we can express the ReLU function σ via ρ in the r-neighborhood of 0: σ(x)=ρ(a+r02rx)ρ(ar02+r02rx)ρ(a)+ρ(ar02)(ρ(a+)ρ(a))r02r,x[r,r].It follows that a computation performed by a single σ-unit, x1,,xNσ(k=1Nwkxk+b),can be equivalently represented by a linear combination of a constant function and two ρ-units, x1,,xNρ(a+r02rb+r02rk=1Nwkxk),ρ(ar02+r02rb+r02rk=1Nwkxk),provided the condition (2)k=1Nwkxk+b[r,r]holds. Since D is a bounded set, we can choose r at each unit of the initial network η sufficiently large so as to satisfy condition (2) for all network inputs from D. Then, like in (a), we replace each σ-unit with two ρ-units, which produces the desired ρ-network.  □
(b) 让 a 成为 ρ 的任意一个断点,从而 ρ(a+)ρ(a) 。让 r0a 与最近的其他断点之间的距离,这样 ρ[a,a+r0][ar0,a] 上是线性的(如果 ρ 只有一个节点,那么任何 r0>0 都可以)。那么,对于任意 r>0 ,我们可以通过 ρ 在 0 的 r 邻域表达 ReLU 函数 σσ(x)=ρ(a+r02rx)ρ(ar02+r02rx)ρ(a)+ρ(ar02)(ρ(a+)ρ(a))r02r,x[r,r]. 由此可见,只要条件 (2)k=1Nwkxk+b[r,r] 成立,由单个 σ 单位 x1,,xNσ(k=1Nwkxk+b), 执行的计算可以等价地用一个常数函数和两个 ρ 单位 x1,,xNρ(a+r02rb+r02rk=1Nwkxk),ρ(ar02+r02rb+r02rk=1Nwkxk), 的线性组合来表示。由于 D 是一个有界集,我们可以在初始网络的每个单元 η 中选择足够大的 r ,从而使来自 D 的所有网络输入都满足条件 (2)。然后,与(a)中一样,我们用两个 ρ 单位替换每个 σ 单位,这样就得到了所需的 ρ 网络。□

3. Upper bounds 3.上限

Throughout the paper, we will be interested in approximating functions f:[0,1]dR by ReLU networks. Given a function f:[0,1]dR and its approximation f˜, by the approximation error we will always mean the uniform maximum error ff˜=maxx[0,1]d|f(x)f˜(x)|.
在本文中,我们将始终关注通过 ReLU 网络逼近函数 f:[0,1]dR 的问题。给定函数 f:[0,1]dR 及其近似值 f˜ ,近似误差总是指均匀最大误差 ff˜=maxx[0,1]d|f(x)f˜(x)|.

3.1. Fast deep approximation of squaring and multiplication
3.1.平方和乘法的快速深度逼近

Our first key result shows that ReLU networks with unconstrained depth can very efficiently approximate the function f(x)=x2 (more efficiently than any fixed-depth network, as we will see in Section 4.4). Our construction uses the “sawtooth” function that has previously appeared in the paper by Telgarsky (2015).
我们的第一个关键结果表明,无深度限制的 ReLU 网络可以非常高效地逼近函数 f(x)=x2 (比任何固定深度的网络都更高效,我们将在第 4.4 节中看到这一点)。我们的构造使用了之前出现在 Telgarsky(2015)论文中的 "锯齿 "函数。

Proposition 2 命题 2

The function f(x)=x2 on the segment [0,1] can be approximated with any error ϵ>0 by a ReLU network having the depth and the number of weights and computation units O(ln(1ϵ))
可以用一个深度、权重和计算单元数为 O(ln(1ϵ)) 的 ReLU 网络来逼近线段 [0,1] 上的函数 f(x)=x2 ,误差为 ϵ>0
.

Proof 证明

Consider the “tooth” (or “mirror”) function g:[0,1][0,1], g(x)=2x,x<12,2(1x),x12,and the iterated functions gs(x)=gggs(x).Telgarsky has shown (see Lemma 2.4 in Telgarsky (2015)) that gs is a “sawtooth” function with 2s1 uniformly distributed “teeth” (each application of g doubles the number of teeth): gs(x)=2s(x2k2s),x[2k2s,2k+12s],k=0,1,,2s112s(2k2sx),x[2k12s,2k2s],k=1,2,,2s1,(see Fig. 2, Fig. 2(a)). Our key observation now is that the function f(x)=x2 can be approximated by linear combinations of the functions gs. Namely, let fm be the piece-wise linear interpolation of f with 2m+1 uniformly distributed breakpoints k2m,k=0,,2m: fm(k2m)=(k2m)2,k=0,,2m(see Fig. 2(b)). The function fm approximates f with the error ϵm=22m2. Now note that refining the interpolation from fm1 to fm amounts to adjusting it by a function proportional to a sawtooth function: fm1(x)fm(x)=gm(x)22m.Hence fm(x)=xs=1mgs(x)22s.Since g can be implemented by a finite ReLU network (as g(x)=2σ(x)4σ(x12)+2σ(x1)) and since construction of fm only involves O(m) linear operations and compositions of g, we can implement fm by a ReLU network having depth and the number of weights and computation units all being O(m) (see Fig. 2(c)). This implies the claim of the proposition.  □
考虑 "齿"(或 "镜")函数 g:[0,1][0,1], g(x)=2x,x<12,2(1x),x12, 和迭代函数 gs(x)=gggs(x). Telgarsky 已经证明(见 Telgarsky (2015)中的 Lemma 2.4), gs 是一个具有 2s1 均匀分布的 "齿 "的 "锯齿 "函数(每次应用 g 都会使齿的数量加倍):6#(见图 2,图 2(a))。我们现在要观察的关键是,函数 f(x)=x2 可以用函数 gs 的线性组合来近似。也就是说,让 fm 成为 f2m+1 均匀分布的断点 k2m,k=0,,2mfm(k2m)=(k2m)2,k=0,,2m (见图 2(b))。函数 fm 近似 f ,误差为 ϵm=22m2 。现在请注意,从 fm1fm 的细化插值相当于用一个与锯齿形函数成比例的函数进行调整: fm1(x)fm(x)=gm(x)22m. 因此 fm(x)=xs=1mgs(x)22s. 由于 g 可以用有限 ReLU 网络来实现(如 g(x)=2σ(x)4σ(x12)+2σ(x1) ),并且由于 fm 的构造只涉及 O(m) 的线性运算和 g 的合成,我们可以用一个深度、权重和计算单元数都为 O(m) 的 ReLU 网络来实现 fm (见图 2(c))。这意味着命题的主张。□

Since (3)xy=12((x+y)2x2y2),we can use Proposition 2 to efficiently implement accurate multiplication in a ReLU network. The implementation will depend on the required accuracy and the magnitude of the multiplied quantities.
由于 (3)xy=12((x+y)2x2y2), ,我们可以利用命题 2 在 ReLU 网络中有效地实现精确乘法。实现方式取决于所需的精度和被乘量的大小。

  1. Download : Download high-res image (186KB)
    下载 :下载高清图片 (186KB)
  2. Download : Download full-size image
    下载 :下载 : 下载全尺寸图片

Fig. 2(a). (a) 图 2(a).(a)

  1. Download : Download high-res image (136KB)
    下载 :下载高清图片 (136KB)
  2. Download : Download full-size image
    下载 :下载 : 下载全尺寸图片

Fig. 2(b). (b) 图 2(b).(b)

  1. Download : Download high-res image (114KB)
    下载 :下载高清图片 (114KB)
  2. Download : Download full-size image
    下载 :下载 : 下载全尺寸图片

Fig. 2(c). (c) 图 2(c).(c)

Fig. 2. Fast approximation of the function f(x)=x2 from Proposition 2 : (a) the “tooth” function g and the iterated “sawtooth” functions g2,g3; (b) the approximating functions fm; (c) the network architecture for f4.
图 2.命题 2 中函数 f(x)=x2 的快速近似:(a)"齿 "函数 g 和迭代 "锯齿 "函数 g2,g3 ;(b)近似函数 fm ;(c) f4 的网络结构。

Proposition 3 命题 3

Given M>0 and ϵ(0,1), there is a ReLU network η with two input units that implements a function ט:R2R so that
给定 M>0ϵ(0,1) ,有一个带有两个输入单元的 ReLU 网络 η ,它实现了一个函数 ט:R2R ,因此

  • (a)

    for any inputs x,y, if |x|M and |y|M, then |ט(x,y)xy|ϵ;
    对于任何输入 x,y ,如果 |x|M|y|M ,那么 |ט(x,y)xy|ϵ

  • (b)

    if x=0 or y=0, then ט(x,y)=0;
    如果 x=0y=0 ,则 ט(x,y)=0

  • (c)

    the depth and the number of weights and computation units in η are not greater than c1ln(1ϵ)+c2 with an absolute constant c1 and a constant c2=c2(M)
    0#中的深度、权重数和计算单元数不大于 c1ln(1ϵ)+c2 ,绝对常数为 c1 ,常数为 c2=c2(M)
    .

Proof 证明

Let f˜sq,δ be the approximate squaring function from Proposition 2 such that f˜sq,δ(0)=0 and |f˜sq,δ(x)x2|<δ for x[0,1]. Assume without loss of generality that M1 and set (4)ט(x,y)=M28(f˜sq,δ(|x+y|2M)f˜sq,δ(|x|2M)f˜sq,δ(|y|2M)),where δ=8ϵ3M2. Then property (b) is immediate and (a) follows easily using expansion (3). To conclude (c), observe that computation (4) consists of three instances of f˜sq,δ and finitely many linear and ReLU operations, so, using Proposition 2, we can implement ט by a ReLU network such that its depth and the number of computation units and weights are O(ln(1δ)), i.e. are O(ln(1ϵ)+lnM).  □
f˜sq,δ 是命题 2 中的近似平方函数,这样 f˜sq,δ(0)=0|f˜sq,δ(x)x2|<δ 对于 x[0,1] .在不失一般性的前提下,假设 M1 并设 (4)ט(x,y)=M28(f˜sq,δ(|x+y|2M)f˜sq,δ(|x|2M)f˜sq,δ(|y|2M)),δ=8ϵ3M2 。那么性质 (b) 是直接的,而 (a) 很容易通过展开式 (3) 得出。为了得出(c),请观察计算(4)由 f˜sq,δ 的三个实例以及有限多的线性和 ReLU 运算组成,因此,利用命题 2,我们可以用一个 ReLU 网络来实现 ט ,使得它的深度和计算单元与权重的数量都是 O(ln(1δ)) ,即都是 O(ln(1ϵ)+lnM) 。□

3.2. Fast deep approximation of general smooth functions
3.2.一般光滑函数的快速深度逼近

In order to formulate our general result, Theorem 1, we consider the Sobolev spaces Wn,([0,1]d) with n=1,2,. Recall that Wn,([0,1]d) is defined as the space of functions on [0,1]d lying in L along with their weak derivatives up to order n. The norm in Wn,([0,1]d) can be defined by fWn,([0,1]d)=maxn:|n|ness supx[0,1]d|Dnf(x)|,where n=(n1,,nd){0,1,}d, |n|=n1++nd, and Dnf is the respective weak derivative. Here and in the sequel we denote vectors by boldface characters. The space Wn,([0,1]d) can be equivalently described as consisting of the functions from Cn1([0,1]d) such that all their derivatives of order n1 are Lipschitz continuous.
为了提出我们的一般结果,即定理 1,我们考虑了 Wn,([0,1]d)n=1,2, 的索波列夫空间。回想一下, Wn,([0,1]d) 被定义为 [0,1]d 上位于 L 的函数空间及其直到 n 阶的弱导数。 Wn,([0,1]d) 中的规范可以用 fWn,([0,1]d)=maxn:|n|ness supx[0,1]d|Dnf(x)|, 来定义,其中 n=(n1,,nd){0,1,}d , |n|=n1++ndDnf 是各自的弱导数。在这里和后续章节中,我们用黑体字表示向量。空间 Wn,([0,1]d) 可以等价地描述为由来自 Cn1([0,1]d) 的函数组成,这些函数的所有阶导数 n1 都是 Lipschitz 连续的。

Throughout the paper, we denote by Fn,d the unit ball in Wn,([0,1]d): Fn,d={fWn,([0,1]d):fWn,([0,1]d)1}.
在本文中,我们用 Fn,d 表示 Wn,([0,1]d) 中的单位球 : Fn,d={fWn,([0,1]d):fWn,([0,1]d)1}.

Also, it will now be convenient to make a distinction between networks and network architectures : we define the latter as the former with unspecified weights. We say that a network architecture is capable of expressing any function from Fd,n with error ϵ meaning that this can be achieved by some weight assignment.
此外,我们现在可以方便地对网络和网络架构进行区分:我们将后者定义为前者,但不指定权重。我们说,网络结构能够表达来自 Fd,n 的任何函数,且误差为 ϵ ,这意味着可以通过某种权重分配来实现。

Theorem 1 定理 1

For any d,n and ϵ(0,1), there is a ReLU network architecture that
对于任意的 d,nϵ(0,1) ,都有一个 ReLU 网络结构,即

  • 1.

    is capable of expressing any function from Fd,n with error ϵ;
    能够表达来自 Fd,n 的任何函数,且误差为 ϵ

  • 2.

    has the depth at most c(ln(1ϵ)+1) and at most cϵdn(ln(1ϵ)+1) weights and computation units, with some constant c=c(d,n)
    深度最多为 c(ln(1ϵ)+1) ,权重和计算单元最多为 cϵdn(ln(1ϵ)+1) ,常数为 c=c(d,n)
    .

Proof 证明

The proof will consist of two steps. We start with approximating f by a sum–product combination f1 of local Taylor polynomials and one-dimensional piecewise-linear functions. After that, we will use results of the previous section to approximate f1 by a neural network.
证明分为两个步骤。首先,我们用局部泰勒多项式和一维片断线性函数的和积组合 f1 近似 f 。之后,我们将利用上一节的结果,用神经网络来逼近 f1

Let N be a positive integer. Consider a partition of unity formed by a grid of (N+1)d functions ϕm on the domain [0,1]d: mϕm(x)1,x[0,1]d.Here m=(m1,,md){0,1,,N}d, and the function ϕm is defined as the product (5)ϕm(x)=k=1dψ(3N(xkmkN)),where ψ(x)=1,|x|<1,0,2<|x|,2|x|,1|x|2(see Fig. 3). Note that (6)ψ=1 and ϕm=1mand (7)suppϕm{x:|xkmkN|<1Nk}.
N 为正整数。考虑由域 [0,1]d 上的 (N+1)d 函数 ϕm 的网格构成的 unity 的分割: mϕm(x)1,x[0,1]d. 这里是 m=(m1,,md){0,1,,N}d ,函数 ϕm 定义为 (5)ϕm(x)=k=1dψ(3N(xkmkN)), 其中 ψ(x)=1,|x|<1,0,2<|x|,2|x|,1|x|2 的乘积(见图 3)。注意 (6)ψ=1 and ϕm=1m(7)suppϕm{x:|xkmkN|<1Nk}.

For any m{0,,N}d, consider the degree-(n1) Taylor polynomial for the function f at x=mN: (8)Pm(x)=n:|n|<nDnfn!|x=mN(xmN)n,with the usual conventions n!=k=1dnk! and (xmN)n=k=1d(xkmkN)nk. Now define an approximation to f by (9)f1=m{0,,N}dϕmPm.We bound the approximation error using the Taylor expansion of f: |f(x)f1(x)|=|mϕm(x)(f(x)Pm(x))|m:|xkmkN|<1Nk|f(x)Pm(x)|2dmaxm:|xkmkN|<1Nk|f(x)Pm(x)|2ddnn!(1N)nmaxn:|n|=ness supx[0,1]d|Dnf(x)|2ddnn!(1N)n.Here in the second step we used the support property (7) and the bound (6), in the third the observation that any x[0,1]d belongs to the support of at most 2d functions ϕm, in the fourth a standard bound for the Taylor remainder, and in the fifth the property fWn,([0,1]d)1.
对于任意 m{0,,N}d , 考虑函数 fx=mN(8)Pm(x)=n:|n|<nDnfn!|x=mN(xmN)n, 与通常的约定 n!=k=1dnk!(xmN)n=k=1d(xkmkN)nk 。现在用 (9)f1=m{0,,N}dϕmPm. 来定义 f 的近似值,我们用 f 的泰勒展开来约束近似误差:在第二步中,我们使用了支持性质 (7) 和约束 (6);在第三步中,我们观察到任何 x[0,1]d 都属于最多 2d 个函数 ϕm 的支持;在第四步中,我们使用了泰勒余数的标准约束;在第五步中,我们使用了性质 fWn,([0,1]d)1.

  1. Download : Download high-res image (149KB)
    下载 :下载高清图片 (149KB)
  2. Download : Download full-size image
    下载 :下载 : 下载全尺寸图片

Fig. 3. Functions (ϕm)m=05 forming a partition of unity for d=1,N=5 in the proof of Theorem 1.
图 3.在定理 1 的证明中,函数 (ϕm)m=05 构成了 d=1,N=5 的统一分区。

It follows that if we choose (10)N=(n!2ddnϵ2)1n(where is the ceiling function), then (11)ff1ϵ2.Note that, by (8) the coefficients of the polynomials Pm are uniformly bounded for all fFd,n: (12)Pm(x)=n:|n|<nam,n(xmN)n,|am,n|1.
由此可见,如果我们选择 (10)N=(n!2ddnϵ2)1n (其中 是上限函数),那么 (11)ff1ϵ2. 注意,根据 (8),多项式 Pm 的系数对所有 fFd,n(12)Pm(x)=n:|n|<nam,n(xmN)n,|am,n|1.

We have therefore reduced our task to the following: construct a network architecture capable of approximating with uniform error ϵ2 any function of the form (9), assuming that N is given by (10) and the polynomials Pm are of the form (12).
因此,我们将任务简化为:假设 N 由 (10) 给出,多项式 Pm 的形式为 (12),构建一个能够以均匀误差 ϵ2 近似 (9) 形式的任何函数的网络架构。

Expand f1 as (13)f1(x)=m{0,,N}dn:|n|<nam,nϕm(x)(xmN)n.The expansion is a linear combination of not more than dn(N+1)d terms ϕm(x)(xmN)n. Each of these terms is a product of at most d+n1 piece-wise linear univariate factors: d functions ψ(3Nxk3mk) (see (5)) and at most n1 linear expressions xkmkN. We can implement an approximation of this product by a neural network with the help of Proposition 3. Specifically, let ט be the approximate multiplication from Proposition 3 for M=d+n and some accuracy δ to be chosen later, and consider the approximation of the product ϕm(x)(xmN)n obtained by the chained application of ט: (14)f˜m,n(x)=ט(ψ(3Nx13m1),ט(ψ(3Nx23m2),,ט(xkmkN,))). Using statement (c) of Proposition 3, we see that f˜m,n can be implemented by a ReLU network with the depth and the number of weights and computation units not larger than (d+n)c1ln(1δ), for some constant c1=c1(d,n).
f1 展开为 (13)f1(x)=m{0,,N}dn:|n|<nam,nϕm(x)(xmN)n. 展开是不超过 dn(N+1)dϕm(x)(xmN)n 的线性组合。每个项都是最多 d+n1 个片断线性单变量因子的乘积: d 函数 ψ(3Nxk3mk) (见 (5))和最多 n1 线性表达式 xkmkN 的乘积。在命题 3 的帮助下,我们可以用神经网络实现对这个乘积的逼近。具体来说,假设 ט 是命题 3 中 M=d+n 的近似乘法和稍后选择的某个精度 δ ,并考虑由 ט 的链式应用得到的乘积 ϕm(x)(xmN)n 的近似值: (14)f˜m,n(x)=ט(ψ(3Nx13m1),ט(ψ(3Nx23m2),,ט(xkmkN,))). 利用命题 3 的陈述(c),我们可以看到 f˜m,n 可以由一个 ReLU 网络来实现,其深度和权值以及计算单元的数量不大于 (d+n)c1ln(1δ) ,对于某个常数 c1=c1(d,n)

Now we estimate the error of this approximation. Note that we have |ψ(3Nxk3mk)|1 and |xkmkN|1 for all k and all x[0,1]d. By statement (a) of Proposition 3, if |a|1 and |b|M, then |ט(a,b)||b|+δ. Repeatedly applying this observation to all approximate multiplications in (14) while assuming δ<1, we see that the arguments of all these multiplications are bounded by our M (equal to d+n) and the statement (a) of Proposition 3 holds for each of them. We then have (15)|f˜m,n(x)ϕm(x)(xmN)n|=|ט(ψ(3Nx13m1),ט(ψ(3Nx23m2),ט(ψ(3Nx33m3),)))ψ(3Nx13m1)ψ(3Nx23m2)ψ(3Nx33m3)||ט(ψ(3Nx13m1),ט(ψ(3Nx23m2),ט(ψ(3Nx33m3),)))ψ(3Nx13m1)ט(ψ(3Nx23m2),ט(ψ(3Nx33m3),))|+|ψ(3Nx13m1)||ט(ψ(3Nx23m2),ט(ψ(3Nx33m3),))ψ(3Nx23m2)ט(ψ(3Nx33m3),)|+(d+n)δ.Moreover, by statement (b) of Proposition 3, (16)f˜m,n(x)=ϕm(x)(xmN)n,xsuppϕm.Now we define the full approximation by (17)f˜=m{0,,N}dn:|n|<nam,nf˜m,n.We estimate the approximation error of f˜: |f˜(x)f1(x)|=|m{0,,N}dn:|n|<nam,n(f˜m,n(x)ϕm(x)(xmN)n)|=|m:xsuppϕmn:|n|<nam,n(f˜m,n(x)ϕm(x)(xmN)n)|2dmaxm:xsuppϕmn:|n|<n|f˜m,n(x)ϕm(x)(xmN)n|2ddn(d+n)δ,where in the first step we use expansion (13), in the second the identity (16), in the third the bound |am,n|1 and the fact that xsuppϕm for at most 2d functions ϕm, and in the fourth the bound (15). It follows that if we choose (18)δ=ϵ2d+1dn(d+n),then f˜f1ϵ2 and hence, by (11), f˜ff˜f1+f1fϵ2+ϵ2ϵ.
现在我们来估算这个近似值的误差。请注意,对于所有 k 和所有 x[0,1]d ,我们都有 |ψ(3Nxk3mk)|1|xkmkN|1 。根据命题 3 的陈述 (a),如果 |a|1|b|M ,那么 |ט(a,b)||b|+δ 。在假设 δ<1 的情况下,将这一观察结果重复应用于 (14) 中的所有近似乘法,我们会发现所有这些乘法的参数都以我们的 M 为界(等于 d+n ),并且命题 3 的陈述 (a) 对它们中的每一个都成立。因此,我们有 (15)|f˜m,n(x)ϕm(x)(xmN)n|=|ט(ψ(3Nx13m1),ט(ψ(3Nx23m2),ט(ψ(3Nx33m3),)))ψ(3Nx13m1)ψ(3Nx23m2)ψ(3Nx33m3)||ט(ψ(3Nx13m1),ט(ψ(3Nx23m2),ט(ψ(3Nx33m3),)))ψ(3Nx13m1)ט(ψ(3Nx23m2),ט(ψ(3Nx33m3),))|+|ψ(3Nx13m1)||ט(ψ(3Nx23m2),ט(ψ(3Nx33m3),))ψ(3Nx23m2)ט(ψ(3Nx33m3),)|+(d+n)δ. 而且,根据命题 3 的陈述 (b), (16)f˜m,n(x)=ϕm(x)(xmN)n,xsuppϕm. 现在我们用 (17)f˜=m{0,,N}dn:|n|<nam,nf˜m,n. 来定义完全近似,我们估计 f˜ 的近似误差 : |f˜(x)f1(x)|=|m{0,,N}dn:|n|<nam,n(f˜m,n(x)ϕm(x)(xmN)n)|=|m:xsuppϕmn:|n|<nam,n(f˜m,n(x)ϕm(x)(xmN)n)|2dmaxm:xsuppϕmn:|n|<n|f˜m,n(x)ϕm(x)(xmN)n|2ddn(d+n)δ, 其中第一步我们使用扩展 (13),第二步使用特性 (16),第三步使用约束 |am,n|1xsuppϕm 对于最多 2d 的函数 ϕm 的事实,第四步使用约束 (15)。由此可见,如果我们选择 (18)δ=ϵ2d+1dn(d+n), ,那么 f˜f1ϵ2 ,因此,根据 (11), f˜ff˜f1+f1fϵ2+ϵ2ϵ.

On the other hand, note that by (17), f˜ can be implemented by a network consisting of parallel subnetworks that compute each of f˜m,n; the final output is obtained by weighting the outputs of the subnetworks with the weights am,n. The architecture of the full network does not depend on f; only the weights am,n do. As already shown, each of these subnetworks has not more than c1ln(1δ) layers, weights and computation units, with some constant c1=c1(d,n). There are not more than dn(N+1)d such subnetworks. Therefore, the full network for f˜ has not more than c1ln(1δ)+1 layers and dn(N+1)d(c1ln(1δ)+1) weights and computation units. With δ given by (18) and N given by (10), we obtain the claimed complexity bounds.  □
另一方面,请注意,根据 (17), f˜ 可以通过一个由并行子网络组成的网络来实现,子网络计算 f˜m,n 中的每一条;最终输出由子网络的输出与权重 am,n 加权得到。整个网络的结构与 f 无关,只与权重 am,n 有关。如前所述,每个子网络的层数、权重和计算单元都不超过 c1ln(1δ) ,且有一定的常数 c1=c1(d,n) 。这样的子网络不会超过 dn(N+1)d 。因此, f˜ 的完整网络的层数不超过 c1ln(1δ)+1 ,权重和计算单元不超过 dn(N+1)d(c1ln(1δ)+1) 。根据 (18) 所给出的 δ 和 (10) 所给出的 N ,我们得到了所宣称的复杂度边界。□

We remark that an analog of Theorem 1 for so-called k’th order activation functions with k2 can be found in Mhaskar (1993).
我们要指出的是,定理 1 与所谓的 k 'th 阶激活函数 k2 的类似,可以在 Mhaskar (1993) 中找到。

3.3. Faster approximations using adaptive network architectures
3.3.利用自适应网络架构实现更快近似

Theorem 1 provides an upper bound for the approximation complexity in the case when the same network architecture is used to approximate all functions in Fd,n. We can consider an alternative, “adaptive architecture” scenario where not only the weights, but also the architecture is adjusted to the approximated function. We expect, of course, that this would decrease the complexity of the resulting architectures, in general (at the price of needing to find the appropriate architecture). In this section we show that we can indeed obtain better upper bounds in this scenario.
定理 1 提供了使用相同网络架构逼近 Fd,n 中所有函数时的逼近复杂度上限。我们可以考虑另一种 "自适应架构 "方案,即根据近似函数调整权重和架构。当然,我们预计这将在总体上降低由此产生的架构的复杂性(代价是需要找到合适的架构)。在本节中,我们将展示在这种情况下我们确实可以获得更好的上限。

For simplicity, we will only consider the case d=n=1. Then, Wn,([0,1]d) is the space of Lipschitz functions on the segment [0,1]. The set F1,1 consists of functions f having both f and the Lipschitz constant bounded by 1. Theorem 1 provides an upper bound O(ln(1ϵ)ϵ) for the number of weights and computation units, but in this special case there is in fact a better bound O(1ϵ) obtained simply by piece-wise interpolation.
为简单起见,我们只考虑 d=n=1 的情况。然后, Wn,([0,1]d) 是线段 [0,1] 上的 Lipschitz 函数空间。定理 1 为权重和计算单元的数量提供了一个上限 O(ln(1ϵ)ϵ) ,但在这种特殊情况下,实际上有一个更好的上限 O(1ϵ) ,只需通过分段插值即可获得。

Namely, given fF1,1 and ϵ>0, set T=1ϵ and let f˜ be the piece-wise interpolation of f with T+1 uniformly spaced breakpoints (tT)t=0T (i.e., f˜(tT)=f(tT),t=0,,T). The function f˜ is also Lipschitz with constant 1 and hence ff˜1Tϵ (since for any x[0,1] we can find t such that |xtT|12T and then |f(x)f˜(x)||f(x)f(tT)|+|f˜(tT)f˜(x)|212T=1T). At the same time, the function f˜ can be expressed in terms of the ReLU function σ by f˜(x)=b+t=0T1wtσ(xtT)with some coefficients b and (wt)t=0T1. This expression can be viewed as a special case of the depth-3 ReLU network with O(1ϵ) weights and computation units.
即,给定 fF1,1ϵ>0 ,设 T=1ϵ ,让 f˜fT+1 均匀间隔的断点 (tT)t=0T 的片断插值(即 f˜(tT)=f(tT),t=0,,T )。函数 f˜ 也是常数为 1 的 Lipschitz 函数,因此 ff˜1Tϵ 也是 Lipschitz 函数(因为对于任意 x[0,1] ,我们都可以找到 t ,这样 |xtT|12T|f(x)f˜(x)||f(x)f(tT)|+|f˜(tT)f˜(x)|212T=1T 都是 Lipschitz 函数)。同时,函数 f˜ 可以用 ReLU 函数 σ 表示,即 f˜(x)=b+t=0T1wtσ(xtT) 加上一些系数 b(wt)t=0T1 。这个表达式可以看作是具有 O(1ϵ) 权重和计算单元的深度-3 ReLU 网络的特例。

We show now how the bound O(1ϵ) can be improved by using adaptive architectures.
现在,我们将展示如何通过使用自适应架构来改进 O(1ϵ) 约束。

Theorem 2 定理 2

For any fF1,1 and ϵ(0,12), there exists a depth- 6 ReLU network η (with architecture depending on f) that provides an ϵ-approximation of f while having not more than cϵln(1ϵ) weights, connections and computation units. Here c is an absolute constant.
对于任意的 fF1,1ϵ(0,12) ,存在一个深度为 6 的 ReLU 网络 η (其结构取决于 f ),它提供了 ϵ - f 的近似值,同时权重、连接和计算单元不超过 cϵln(1ϵ) 。这里的 c 是一个绝对常数。

Proof 证明

We first explain the idea of the proof. We start with interpolating f by a piece-wise linear function, but not on the length scale ϵ—instead, we do it on a coarser length scale mϵ, with some m=m(ϵ)>1. We then create a “cache” of auxiliary subnetworks that we use to fill in the details and go down to the scale ϵ, in each of the mϵ-subintervals. This allows us to reduce the amount of computations for small ϵ because the complexity of the cache only depends on m. The assignment of cached subnetworks to the subintervals is encoded in the network architecture and depends on the function f. We optimize m by balancing the complexity of the cache with that of the initial coarse approximation. This leads to mln(1ϵ) and hence to the reduction of the total complexity of the network by a factor ln(1ϵ) compared to the simple piece-wise linear approximation on the scale ϵ. This construction is inspired by a similar argument used to prove the O(2nn) upper bound for the complexity of Boolean circuits implementing n-ary functions (Shannon, 1949).
我们首先解释一下证明的思路。我们先用一个片断线性函数对 f 进行插值,但不是在长度尺度 ϵ 上进行,而是在更粗的长度尺度 mϵ 上进行,其中包含一些 m=m(ϵ)>1 。然后,我们会创建一个辅助子网的 "缓存",用来填补细节,并在每个 mϵ - 子区间内细化到 ϵ 的尺度。这使我们能够减少小 ϵ 的计算量,因为缓存的复杂度只取决于 m 。缓存子网络到子区间的分配在网络结构中编码,并取决于函数 f 。我们通过平衡缓存的复杂度和初始粗略近似的复杂度来优化 m 。这导致了 mln(1ϵ) ,因此,与规模为 ϵ 的简单片断线性近似相比,网络的总复杂度降低了 ln(1ϵ) 。这种构造的灵感来源于证明实现 n -ary 函数的布尔电路复杂度上限 O(2nn) 的类似论证(香农,1949 年)。

The proof becomes simpler if, in addition to the ReLU function σ, we are allowed to use the activation function (19)ρ(x)=x,x[0,1),0,x[0,1)in our neural network. Since ρ is discontinuous, we cannot just use Proposition 1 to replace ρ-units by σ-units. We will first prove the analog of the claimed result for the model including ρ-units, and then we will show how to construct a purely ReLU network.
如果除了 ReLU 函数 σ 之外,我们还可以在神经网络中使用激活函数 (19)ρ(x)=x,x[0,1),0,x[0,1) ,那么证明就变得更简单了。由于 ρ 是不连续的,我们不能直接用命题 1 将 ρ 单位替换为 σ 单位。我们将首先证明包含 ρ 单位的模型的类似结果,然后再展示如何构建一个纯 ReLU 网络。

Lemma 1 例题 1

For any fF1,1 and ϵ(0,12), there exists a depth-5 network including σ-units and ρ-units, that provides an ϵ-approximation of f while having not more than cϵln(1ϵ) weights, where c is an absolute constant.
对于任意的 fF1,1ϵ(0,12) ,存在一个深度为 5 的网络,其中包括 σ - 单元和 ρ - 单元,它提供了 ϵ - f 的近似值,同时权重不超过 cϵln(1ϵ) ,其中 c 是一个绝对常量。

Proof 证明

Given fF1,1, we will construct an approximation f˜ to f in the form f˜=f˜1+f˜2.Here, f˜1 is the piece-wise linear interpolation of f with the breakpoints {tT}t=0T, for some positive integer T to be chosen later. Since f is Lipschitz with constant 1, f˜1 is also Lipschitz with constant 1. We will denote by It the intervals between the breakpoints: It=[tT,t+1T),t=0,,T1.We will now construct f˜2 as an approximation to the difference (20)f2=ff˜1.Note that f2 vanishes at the endpoints of the intervals It: (21)f2(tT)=0,t=0,,T,and f2 is Lipschitz with constant 2: (22)|f2(x1)f2(x2)|2|x1x2|,since f and f˜1 are Lipschitz with constant 1.
给定 fF1,1 ,我们将以 f˜=f˜1+f˜2. 的形式构建 f˜f 的近似值 f˜1f 与断点 {tT}t=0T 的片断线性插值,对于某个正整数 T 待定。由于 f 是常数为 1 的 Lipschitz,所以 f˜1 也是常数为 1 的 Lipschitz。我们用 It 表示断点之间的间隔: It=[tT,t+1T),t=0,,T1. 现在我们将构建 f˜2 作为差值 (20)f2=ff˜1. 的近似值,注意 f2 在区间 It(21)f2(tT)=0,t=0,,T,f2 是常数为 2 的 Lipschitz: (22)|f2(x1)f2(x2)|2|x1x2|, 因为 ff˜1 是常数为 1 的 Lipschitz。

To define f˜2, we first construct a set Γ of cached functions. Let m be a positive integer to be chosen later. Let Γ be the set of piecewise linear functions γ:[0,1]R with the breakpoints {rm}r=0m and the properties γ(0)=γ(1)=0and γ(rm)γ(r1m){2m,0,2m},r=1,,m.Note that the size |Γ| of Γ is not larger than 3m.
要定义 f˜2 ,我们首先要构建一个缓存函数集 Γ 。让 m 为稍后选择的正整数。让 Γ 成为具有断点 {rm}r=0m 和属性 γ(0)=γ(1)=0γ(rm)γ(r1m){2m,0,2m},r=1,,m. 的片断线性函数集合 γ:[0,1]R 注意 Γ 的大小 |Γ| 不大于 3m

If g:[0,1]R is any Lipschitz function with constant 2 and g(0)=g(1)=0, then g can be approximated by some γΓ with error not larger than 2m: namely, take γ(rm)=2mg(rm)2m.
如果 g:[0,1]R 是常数为 2 的任意 Lipschitz 函数,且 g(0)=g(1)=0 ,那么 g 可以用误差不大于 2m 的某个 γΓ 近似:即取 γ(rm)=2mg(rm)2m

Moreover, if f2 is defined by (20), then, using (21), (22), on each interval It the function f2 can be approximated with error not larger than 2Tm by a properly rescaled function γΓ. Namely, for each t=0,,T1 we can define the function g by g(y)=Tf2(t+yT). Then it is Lipschitz with constant 2 and g(0)=g(1)=0, so we can find γtΓ such that supy[0,1)|Tf2(t+yT)γt(y)|2m.This can be equivalently written as supxIt|f2(x)1Tγt(Txt)|2Tm.Note that the obtained assignment tγt is not injective, in general (T will be larger than |Γ|).
此外,如果 f2 由 (20) 定义,那么,利用 (21)、(22),在每个区间 It 上,函数 f2 可以用一个适当重标的函数 γΓ 近似,误差不大于 2Tm 。也就是说,对于每个 t=0,,T1 ,我们可以用 g(y)=Tf2(t+yT) 来定义函数 g 。那么它是常数为 2 的 Lipschitz 函数和 g(0)=g(1)=0 ,所以我们可以找到 γtΓ 使得 supy[0,1)|Tf2(t+yT)γt(y)|2m. 可以等价地写成 supxIt|f2(x)1Tγt(Txt)|2Tm. 注意得到的赋值 tγt 一般来说不是注入式的( T 将大于 |Γ| )。

We can then define f˜2 on the whole [0,1) by (23)f˜2(x)=1Tγt(Txt),xIt,t=0,,T1.This f˜2 approximates f2 with error 2Tm on [0,1): (24)supx[0,1)|f2(x)f˜2(x)|2Tm,and hence, by (20), for the full approximation f˜=f˜1+f˜2 we will also have (25)supx[0,1)|f(x)f˜(x)|2Tm.Note that the approximation f˜2 has properties analogous to (21), (22) : (26)f˜2(tT)=0,t=0,,T, (27)|f˜2(x1)f˜2(x2)|2|x1x2|,in particular, f˜2 is continuous on [0,1).
我们可以通过 (23)f˜2(x)=1Tγt(Txt),xIt,t=0,,T1.[0,1) 上定义 f˜2 这个 f˜22Tm 的误差近似 f2[0,1)(24)supx[0,1)|f2(x)f˜2(x)|2Tm, 因此,根据 (20),对于完整的近似值 f˜=f˜1+f˜2 我们也将有 (25)supx[0,1)|f(x)f˜(x)|2Tm. 注意,近似值 f˜2 具有类似于 (21)、(22) 的性质 : (26)f˜2(tT)=0,t=0,,T, (27)|f˜2(x1)f˜2(x2)|2|x1x2|, 特别是, f˜2[0,1) 上是连续的。

We will now rewrite f˜2 in a different form interpretable as a computation by a neural network. Specifically, using our additional activation function ρ given by (19), we can express f˜2 as (28)f˜2(x)=1TγΓγ(t:γt=γρ(Txt)).Indeed, given x[0,1), observe that all the terms in the inner sum vanish except for the one corresponding to the t determined by the condition xIt. For this particular t we have ρ(Txt)=Txt. Since γ(0)=0, we conclude that (28) agrees with (23).
现在,我们将以另一种形式重写 f˜2 ,将其解释为神经网络的计算。具体地说,利用(19)给出的附加激活函数 ρ ,我们可以将 f˜2 表达为 (28)f˜2(x)=1TγΓγ(t:γt=γρ(Txt)). 事实上,在给出 x[0,1) 的情况下,除了与条件 xIt 确定的 t 相对应的项之外,观察到内和中的所有项都消失了。对于这个特殊的 t ,我们有 ρ(Txt)=Txt 。由于 γ(0)=0 ,我们得出结论 (28) 与 (23) 一致。

Let us also expand γΓ over the basis of shifted ReLU functions: γ(x)=r=0m1cγ,rσ(xrm),x[0,1].Substituting this expansion in (28), we finally obtain (29)f˜2(x)=1TγΓr=0m1cγ,rσ(t:γt=γρ(Txt)rm).
我们还可以在移位 ReLU 函数的基础上展开 γΓγ(x)=r=0m1cγ,rσ(xrm),x[0,1]. 将这一扩展代入 (28),我们最终得到 (29)f˜2(x)=1TγΓr=0m1cγ,rσ(t:γt=γρ(Txt)rm).

Now consider the implementation of f˜ by a neural network. The term f˜1 can clearly be implemented by a depth-3 ReLU network using O(T) connections and computation units. The term f˜2 can be implemented by a depth-5 network with ρ- and σ-units as follows (we denote a computation unit by Q with a superscript indexing the layer and a subscript indexing the unit within the layer).
现在考虑用神经网络来实现 f˜ 。项 f˜1 显然可以用深度为 3 的 ReLU 网络,使用 O(T) 连接和计算单元来实现。术语 f˜2 可以通过深度-5 网络使用 ρ - 和 σ - 单元来实现(我们用 Q 表示计算单元,上标表示层,下标表示层内单元)。

  • 1.

    The first layer contains the single input unit Q(1).
    第一层包含单输入单元 Q(1)

  • 2.

    The second layer contains T units (Qt(2))t=1T computing Qt(2)=ρ(TQ(1)t).
    第二层包含 T 单元 (Qt(2))t=1T 计算 Qt(2)=ρ(TQ(1)t).

  • 3.

    The third layer contains |Γ| units (Qγ(3))γΓ computing Qγ(3)=σ(t:γt=γQt(2)). This is equivalent to Qγ(3)=t:γt=γQt(2), because Qt(2)0.
    第三层包含 |Γ| 单元 (Qγ(3))γΓ 计算 Qγ(3)=σ(t:γt=γQt(2)) 。这相当于 Qγ(3)=t:γt=γQt(2) ,因为 Qt(2)0 .

  • 4.

    The fourth layer contains m|Γ| units (Qγ,r(4))r=0,,m1γΓ computing Qγ,r(4)=σ(Qγ(3)rm).
    第四层包含 m|Γ| 单元 (Qγ,r(4))r=0,,m1γΓ 计算 Qγ,r(4)=σ(Qγ(3)rm).

  • 5.

    The final layer consists of a single output unit Q(5)=γΓr=0m1cγ,rTQγ,r(4).
    最后一层由一个输出单元 Q(5)=γΓr=0m1cγ,rTQγ,r(4). 组成

Examining this network, we see that the total number of connections and units in it is O(T+m|Γ|) and hence is O(T+m3m). This also holds for the full network implementing f˜=f˜1+f˜2, since the term f˜1 requires even fewer layers, connections and units. The output units of the subnetworks for f˜1 and f˜2 can be merged into the output unit for f˜1+f˜2, so the depth of the full network is the maximum of the depths of the networks implementing f˜1 and f˜2, i.e., is 5 (see Fig. 4).
观察这个网络,我们会发现其中连接和单元的总数是 O(T+m|Γ|) ,因此是 O(T+m3m) 。这同样适用于实现 f˜=f˜1+f˜2 的完整网络,因为 f˜1 项所需的层数、连接数和单元数都更少。 f˜1f˜2 子网络的输出单元可以合并到 f˜1+f˜2 的输出单元中,因此整个网络的深度是实现 f˜1f˜2 的网络深度的最大值,即 5(见图 4)。

Now, given ϵ(0,12), take m=12log3(1ϵ) and T=2mϵ. Then, by (25), the approximation error maxx[0,1]|f(x)f˜(x)|2Tmϵ, while T+m3m=O(1ϵln(1ϵ)), which implies the claimed complexity bound.  □
现在,给定 ϵ(0,12) ,取 m=12log3(1ϵ)T=2mϵ 。那么,根据 (25),近似误差 maxx[0,1]|f(x)f˜(x)|2Tmϵ ,而 T+m3m=O(1ϵln(1ϵ)) ,这意味着所宣称的复杂度约束。□

  1. Download : Download high-res image (219KB)
    下载 :下载高清图片 (219KB)
  2. Download : Download full-size image
    下载 :下载 : 下载全尺寸图片

Fig. 4. Architecture of the network implementing the function f˜=f˜1+f˜2 from Lemma 1.
图 4.实现定理 1 中 f˜=f˜1+f˜2 函数的网络结构。

We show now how to modify the constructed network so as to remove ρ-units. We only need to modify the f˜2 part of the network. We will show that for any δ>0 we can replace f˜2 with a function f˜3,δ (defined below) that
现在,我们将展示如何修改所构建的网络,以删除 ρ 单位。我们只需要修改网络的 f˜2 部分。我们将证明,对于任何 δ>0 ,我们都可以用函数 f˜3,δ (定义见下文)来代替 f˜2 ,即

  • (a)

    obeys the following analog of approximation bound (24) : (30)supx[0,1]|f2(x)f˜3,δ(x)|8δT+2Tm,
    遵从以下近似约束 (24) 的类似条件: (30)supx[0,1]|f2(x)f˜3,δ(x)|8δT+2Tm,

  • (b)

    and is implementable by a depth-6 ReLU network having complexity c(T+m3m) with an absolute constant c independent of δ.
    并可由深度为 6 的 ReLU 网络实现,其复杂度为 c(T+m3m) ,绝对常数 cδ 无关。

Since δ can be taken arbitrarily small, the theorem then follows by arguing as in Lemma 1, only with f˜2 replaced by f˜3,δ.
由于 δ 可任意取小,因此,只需用 f˜3,δ 代替 f˜2 即可得出定理,其论证过程与定理 1 相同。

As a first step, we approximate ρ by a continuous piece-wise linear function ρδ, with a small δ>0: ρ(y)=y,y[0,1δ),1δδ(1y),y[1δ,1),0,y[0,1).Let f˜2,δ be defined as f˜2 in (29), but with ρ replaced by ρδ: f˜2,δ(x)=1TγΓr=0m1cγ,rσ(t:γt=γρδ(Txt)rm).Since ρδ is a continuous piece-wise linear function with three breakpoints, we can express it via the ReLU function, and hence implement f˜2,δ by a purely ReLU network, as in Proposition 1, and the complexity of the implementation does not depend on δ. However, replacing ρ with ρδ affects the function f˜2 on the intervals (tδT,tT],t=1,,T, introducing there a large error (of magnitude O(1T)). But recall that both f2 and f˜2 vanish at the points tT,t=0,,T, by (21), (26). We can then largely remove this newly introduced error by simply suppressing f˜2,δ near the points tT.
第一步,我们用一个连续的片断线性函数 ρδ 来近似 ρ ,并加上一个小的 δ>0ρ(y)=y,y[0,1δ),1δδ(1y),y[1δ,1),0,y[0,1).f˜2,δ 定义为 (29) 中的 f˜2 ,但 ρρδ 代替: f˜2,δ(x)=1TγΓr=0m1cγ,rσ(t:γt=γρδ(Txt)rm). 由于 ρδ 是一个连续的片断线性函数,有三个断点,我们可以通过 ReLU 函数来表达它,因此可以用纯 ReLU 网络来实现 f˜2,δ ,如命题 1 所示,而且实现的复杂度与 δ 无关。 不过,用 ρδ 代替 ρ 会影响区间 (tδT,tT],t=1,,T 上的函数 f˜2 ,从而引入一个很大的误差(量级为 O(1T) )。但是,根据 (21) 和 (26), f2f˜2 在点 tT,t=0,,T 上都消失了。因此,我们只需抑制 tT 点附近的 f˜2,δ ,就可以大致消除这个新引入的误差。

Precisely, consider the continuous piece-wise linear function ϕδ(y)=0,y[0,1δ),yδ,y[0,δ),1,y[δ,12δ),1δyδ,y[12δ,1δ)and the full comb-like filtering function Φδ(x)=t=0T1ϕδ(Txt).Note that Φδ is continuous piece-wise linear with 4T breakpoints, and 0Φδ(x)1. We then define our final modification of f˜2 as (31)f˜3,δ(x)=σ(f˜2,δ(x)+2Φδ(x)1)σ(2Φδ(x)1).
准确地说,考虑连续的片断线性函数 ϕδ(y)=0,y[0,1δ),yδ,y[0,δ),1,y[δ,12δ),1δyδ,y[12δ,1δ) 和全梳状滤波函数 Φδ(x)=t=0T1ϕδ(Txt). 注意, Φδ 是连续的片断线性函数,带有 4T 断点和 0Φδ(x)1 。然后,我们将 f˜2 的最终修改定义为 (31)f˜3,δ(x)=σ(f˜2,δ(x)+2Φδ(x)1)σ(2Φδ(x)1).

Lemma 2 推理 2

The function f˜3,δ obeys the bound (30)
函数 f˜3,δ 服从约束 (30)
.

Proof 证明

Given x[0,1), let t{0,,T1} and y[0,1) be determined from the representation x=t+yT (i.e., y is the relative position of x in the respective interval It). Consider several possibilities for y:
给定 x[0,1) ,让 t{0,,T1}y[0,1) 根据表示 x=t+yT 确定(即 yx 在相应区间 It 中的相对位置)。考虑 y 的几种可能性:

  • 1.

    y[1δ,1]. In this case Φδ(x)=0. Note that (32)supx[0,1]|f˜2,δ(x)|1,because, by construction, supx[0,1]|f˜2,δ(x)|supx[0,1]|f˜2(x)|, and supx[0,1]|f˜2(x)|1 by (26), (27). It follows that both terms in (31) vanish, i.e., f˜3,δ(x)=0. But, since f2 is Lipschitz with constant 2 by (22) and f2(t+1T)=0, we have |f2(x)||f2(x)f2(t+1T)|2|y1|T2δT. This implies |f2(x)f˜3,δ(x)|2δT.
    在这种情况下 Φδ(x)=0 .注意,根据构造, (32)supx[0,1]|f˜2,δ(x)|1, 是因为 supx[0,1]|f˜2,δ(x)|supx[0,1]|f˜2(x)| ,而 supx[0,1]|f˜2(x)|1 是因为 (26), (27)。由此可见,(31) 中的两项都消失了,即 f˜3,δ(x)=0 .但是,由于由 (22) 和 f2(t+1T)=0 可知 f2 是常数为 2 的 Lipschitz 值,所以有 |f2(x)||f2(x)f2(t+1T)|2|y1|T2δT .这意味着 |f2(x)f˜3,δ(x)|2δT .

  • 2.

    y[δ,12δ]. In this case Φδ(x)=1 and f˜2,δ(x)=f˜2(x). Using (32), we find that f˜3,δ(x)=f˜2,δ(x)=f˜2(x). It follows that |f2(x)f˜3,δ(x)|=|f2(x)f˜2(x)|2Tm.
    在这种情况下, Φδ(x)=1f˜2,δ(x)=f˜2(x) .利用(32),我们发现 f˜3,δ(x)=f˜2,δ(x)=f˜2(x) . 由此得出 |f2(x)f˜3,δ(x)|=|f2(x)f˜2(x)|2Tm .

  • 3.

    y[0,δ][12δ,1δ]. In this case f˜2,δ(x)=f˜2(x). Since σ is Lipschitz with constant 1, |f˜3,δ(x)||f˜2,δ(x)|=|f˜2(x)|. Both f2 and f˜2 are Lipschitz with constant 2 (by (22), (27)) and vanish at tT and t+1T (by (21), (26)). It follows that |f2(x)f˜3,δ(x)||f2(x)|+|f˜2(x)|22|xtT|,y[0,δ]2|xt+1T|,y[12δ,1δ]8δT.
    在这种情况下, f˜2,δ(x)=f˜2(x) .由于 σ 是常数为 1 的 Lipschitz,所以 |f˜3,δ(x)||f˜2,δ(x)|=|f˜2(x)| . f2f˜2 都是常数为 2 的 Lipschitz(由 (22), (27)),并且在 tTt+1T 处消失(由 (21), (26))。由此可见, |f2(x)f˜3,δ(x)||f2(x)|+|f˜2(x)|22|xtT|,y[0,δ]2|xt+1T|,y[12δ,1δ]8δT.

It remains to verify the complexity property (b) of the function f˜3,δ. As already mentioned, f˜2,δ can be implemented by a depth-5 purely ReLU network with not more than c(T+m3m) weights, connections and computation units, where c is an absolute constant independent of δ. The function Φδ can be implemented by a shallow, depth-3 network with O(T) units and connection. Then, computation of f˜3,δ can be implemented by a network including two subnetworks for computing f˜2,δ and Ψδ, and an additional layer containing two σ-units as written in (31). We thus obtain 6 layers in the resulting full network and, choosing T and m in the same way as in Lemma 1, obtain the bound cϵln(1ϵ) for the number of its connections, weights, and computation units.  □
我们还需要验证函数 f˜3,δ 的复杂性属性 (b)。如前所述, f˜2,δ 可以由一个深度为 5 的纯 ReLU 网络实现,其权重、连接和计算单元不超过 c(T+m3m) ,其中 c 是一个与 δ 无关的绝对常量。函数 Φδ 可以通过一个深度为 3 的浅层网络来实现,该网络具有 O(T) 个单元和连接。然后,对 f˜3,δ 的计算可以通过一个网络来实现,该网络包括两个用于计算 f˜2,δΨδ 的子网络,以及一个包含两个 σ 单元的附加层,如 (31) 所述。因此,我们可以在得到的完整网络中得到 6 层,并以与假设 1 相同的方式选择 Tm ,从而得到其连接、权重和计算单元数量的约束 cϵln(1ϵ) 。□

4. Lower bounds 4.下限

4.1. Continuous nonlinear widths
4.1.连续非线性宽度

The method of continuous nonlinear widths (DeVore et al., 1989) is a very general approach to the analysis of parameterized nonlinear approximations, based on the assumption of continuous selection of their parameters. We are interested in the following lower bound for the complexity of approximations in Wn,([0,1]d).
连续非线性宽度方法(DeVore 等人,1989 年)是分析参数化非线性近似的一种非常通用的方法,它基于参数连续选择的假设。我们对 Wn,([0,1]d). 中近似复杂度的下限感兴趣

Theorem 3 定理 3

DeVore et al., (1989), Theorem 4.2
DeVore 等人,(1989 年),定理 4.2

Fix d,n. Let W be a positive integer and η:RWC([0,1]d) be any mapping between the space RW and the space C([0,1]d). Suppose that there is a continuous map M:Fd,nRW such that fη(M(f))ϵ for all fFd,n. Then Wcϵdn, with some constant c depending only on n
固定 d,n 。让 W 是一个正整数, η:RWC([0,1]d) 是空间 RW 与空间 C([0,1]d) 之间的任意映射。假设有一个连续映射 M:Fd,nRW ,使得 fη(M(f))ϵ 对于所有 fFd,n 。那么 Wcϵdn ,与某个常数 c 仅取决于 n
.

We apply this theorem by taking η to be some ReLU network architecture, and RW the corresponding weight space. It follows that if a ReLU network architecture is capable of expressing any function from Fd,n with error ϵ, then, under the hypothesis of continuous weight selection, the network must have at least cϵdn weights. The number of connections is then lower bounded by c2ϵdn (since the number of weights is not larger than the sum of the number of computation units and the number of connections, and there are at least as many connections as units).
我们应用该定理时,将 η 作为某个 ReLU 网络架构,而 RW 则是相应的权重空间。由此可知,如果一个 ReLU 网络体系结构能够表达 Fd,n 的任何函数,且误差为 ϵ ,那么在连续权值选择的假设下,该网络必须至少有 cϵdn 个权值。连接数的下限为 c2ϵdn (因为权重数不会大于计算单元数与连接数之和,而且连接数至少与计算单元数相同)。

The hypothesis of continuous weight selection is crucial in Theorem 3. By examining our proof of the counterpart upper bound O(ϵdnln(1ϵ)) in Theorem 1, the weights are selected there in a continuous manner, so this upper bound asymptotically lies above cϵdn in agreement with Theorem 3. We remark, however, that the optimal choice of the network weights (minimizing the error) is known to be discontinuous in general, even for shallow networks (Kainen, Kůrková, & Vogt, 1999).
连续权重选择的假设在定理 3 中至关重要。根据我们对定理 1 中对应上限 O(ϵdnln(1ϵ)) 的证明,权重的选择是连续的,因此该上限渐近地高于 cϵdn ,与定理 3 一致。不过我们要指出的是,网络权重的最优选择(误差最小化)在一般情况下是不连续的,即使对于浅层网络也是如此(Kainen, Kůrková, & Vogt, 1999)。

We also compare the bounds of Theorem 2, Theorem 3. In the case d=n=1, Theorem 3 provides a lower bound cϵ for the number of weights and connections. On the other hand, in the adaptive architecture scenario, Theorem 2 provides the upper bound cϵln(1ϵ) for the number of weights, connections and computation units. The fact that this latter bound is asymptotically below the bound of Theorem 3 reflects the extra expressiveness associated with variable network architecture.
我们还可以比较定理 2 和定理 3 的边界。在 d=n=1 的情况下,定理 3 提供了权重和连接数的下限 cϵ 。另一方面,在自适应架构情况下,定理 2 提供了权重、连接和计算单元数量的上限 cϵln(1ϵ) 。事实上,后一个界限逐渐低于定理 3 的界限,这反映了与可变网络架构相关的额外表现力。

4.2. Bounds based on VC-dimension
4.2.基于 VC 维度的界限

In this section we consider the setup where the same network architecture is used to approximate all functions fFd,n, but the dependence of the weights on f is not assumed to be necessarily continuous. In this setup, some lower bounds on the network complexity can be obtained as a consequence of existing upper bounds on VC-dimension of networks with piece-wise polynomial activation functions and Boolean outputs (Anthony & Bartlett, 2009). In the next theorem, part (a) is a more general but weaker bound, while part (b) is a stronger bound assuming a constrained growth of the network depth.
在本节中,我们将考虑这样一种情况:使用相同的网络架构来逼近所有函数 fFd,n ,但权重对 f 的依赖性不一定是连续的。在这种情况下,网络复杂度的一些下限可以通过现有的具有片断多项式激活函数和布尔输出的网络的 VC 维度上限得到(Anthony & Bartlett,2009 年)。在下一个定理中,(a) 部分是一个更一般但较弱的约束,而 (b) 部分是一个假设网络深度增长受限的更强约束。

Theorem 4 定理 4

Fix d,n 修复 d,n .

  • (a)

    For any ϵ(0,1), a ReLU network architecture capable of approximating any function fFd,n with error ϵ must have at least cϵd(2n) weights, with some constant c=c(d,n)>0
    对于任意 ϵ(0,1) ,能够以误差 ϵ 近似任意函数 fFd,n 的 ReLU 网络结构必须至少有 cϵd(2n) 个权值,其中包含某个常数 c=c(d,n)>0
    .

  • (b)

    Let p0,c1>0 be some constants. For any ϵ(0,12), if a ReLU network architecture of depth Lc1lnp(1ϵ) is capable of approximating any function fFd,n with error ϵ, then the network must have at least c2ϵdnln2p1(1ϵ) weights, with some constant c2=c2(d,n,p,c1)>0. 1
    假设 p0,c1>0 是一些常数。对于任意 ϵ(0,12) ,如果深度为 Lc1lnp(1ϵ) 的 ReLU 网络结构能够以误差 ϵ 近似任意函数 fFd,n ,那么该网络必须至少有 c2ϵdnln2p1(1ϵ) 个权值,且有某个常数 c2=c2(d,n,p,c1)>0 1

Proof 证明

Recall that given a class H of Boolean functions on [0,1]d, the VC-dimension of H is defined as the size of the largest shattered subset S[0,1]d, i.e. the largest subset on which H can compute any dichotomy (see, e.g., (Anthony & Bartlett, 2009), Section 3.3). We are interested in the case when H is the family of functions obtained by applying thresholds 1(x>a) to a ReLU network with fixed architecture but variable weights. In this case Theorem 8.7 of Anthony and Bartlett (2009) implies that (33)VCdim(H)c3W2,and Theorem 8.8 implies that (34)VCdim(H)c3L2WlnW,where W is the number of weights, L is the network depth, and c3 is an absolute constant.
回想一下,给定 [0,1]d 上的一类布尔函数 HH 的 VC 维度被定义为最大破碎子集 S[0,1]d 的大小,即 H 可以计算任何二分法的最大子集(参见,例如,(Anthony & Bartlett, 2009),第 3.3 节)。我们感兴趣的情况是,当 H 是将阈值 1(x>a) 应用于具有固定架构但权重可变的 ReLU 网络时得到的函数族。在这种情况下,Anthony 和 Bartlett(2009 年)的定理 8.7 意味着 (33)VCdim(H)c3W2, ,定理 8.8 意味着 (34)VCdim(H)c3L2WlnW, ,其中 W 是权重数, L 是网络深度, c3 是绝对常量。

Given a positive integer N to be chosen later, choose S as a set of Nd points x1,,xNd in the cube [0,1]d such that the distance between any two of them is not less than 1N. Given any assignment of values y1,,yNdR, we can construct a smooth function f satisfying f(xm)=ym for all m by setting (35)f(x)=m=1Ndymϕ(N(xxm)),with some C function ϕ:RdR such that ϕ(0)=1 and ϕ(x)=0 if |x|>12.
给定一个稍后选择的正整数 N ,选择 S 作为立方体 [0,1]dNdx1,,xNd 的集合,使得任意两个点之间的距离不小于 1N 。给定值 y1,,yNdR 的任意赋值,我们可以通过设置 (35)f(x)=m=1Ndymϕ(N(xxm)), 与某个 C 函数 ϕ:RdR 使 ϕ(0)=1ϕ(x)=0 满足 f(xm)=ym 的所有 m 来构造一个平滑函数 f

Let us obtain a condition ensuring that such fFd,n. For any multi-index n, maxx|Dnf(x)|=N|n|maxm|ym|maxx|Dnϕ(x)|,so if (36)maxm|ym|c4Nn,with the constant c4=(maxn:|n|nmaxx|Dnϕ(x)|)1, then fFd,n.
让我们得到一个条件,确保这样的 fFd,n .对于任意多指数 n , maxx|Dnf(x)|=N|n|maxm|ym|maxx|Dnϕ(x)|, 所以如果 (36)maxm|ym|c4Nn, 加上常数 c4=(maxn:|n|nmaxx|Dnϕ(x)|)1 , 那么 fFd,n .

Now set (37)ϵ=c43Nn.Suppose that there is a ReLU network architecture η that can approximate, by adjusting its weights, any fFd,n with error less than ϵ. Denote by η(x,w) the output of the network for the input vector x and the vector of weights w.
现在设 (37)ϵ=c43Nn. 假设有一个 ReLU 网络结构 η ,它可以通过调整权重,以小于 ϵ 的误差逼近任何 fFd,n 。用 η(x,w) 表示输入向量 x 和权重向量 w 的网络输出。

Consider any assignment z of Boolean values z1,,zNd{0,1}. Set ym=zmc4Nn,m=1,,Nd,and let f be given by (35) (see Fig. 5); then (36) holds and hence fFd,n.
考虑布尔值 z1,,zNd{0,1} 的任意赋值 z 。设 ym=zmc4Nn,m=1,,Nd, 并让 f 由 (35) 给出(见图 5);则 (36) 成立,因此 fFd,n .

By assumption, there is then a vector of weights, w=wz, such that for all m we have |η(xm,wz)ym|ϵ, and in particular η(xm,wz)c4Nnϵ>c4Nn2, if zm=1,ϵ<c4Nn2, if zm=0,so the thresholded network η1=1(η>c4Nn2) has outputs η1(xm,wz)=zm,m=1,,Nd.Since the Boolean values zm were arbitrary, we conclude that the subset S is shattered and hence VCdim(η1)Nd.Expressing N through ϵ with (37), we obtain (38)VCdim(η1)(3ϵc4)dn.To establish part (a) of the Theorem, we apply bound (33) to the network η1: (39)VCdim(η1)c3W2,where W is the number of weights in η1, which is the same as in η if we do not count the threshold parameter. Combining (38) with (39), we obtain the desired lower bound Wcϵd(2n) with c=(c43)d(2n)c312.
根据假设,存在一个权重向量 w=wz ,这样对于所有 m 我们都有 |η(xm,wz)ym|ϵ ,尤其是 η(xm,wz)c4Nnϵ>c4Nn2, if zm=1,ϵ<c4Nn2, if zm=0, ,因此阈值网络 η1=1(η>c4Nn2) 有输出 η1(xm,wz)=zm,m=1,,Nd. 由于布尔值 zm 是任意的,我们得出结论,子集 S 是破碎的,因此 VCdim(η1)Nd. 用 (37) 表示 Nϵ ,我们得到 (38)VCdim(η1)(3ϵc4)dn. 为了建立定理 (a) 部分,我们对网络 η1 应用约束 (33) : (39)VCdim(η1)c3W2, 其中 Wη1 中的权重数,如果不计算阈值参数,则与 η 中的权重数相同。结合 (38) 和 (39),我们可以得到所需的下界 Wcϵd(2n)c=(c43)d(2n)c312

  1. Download : Download high-res image (134KB)
    下载 :下载高清图片 (134KB)
  2. Download : Download full-size image
    下载 :下载 : 下载全尺寸图片

Fig. 5. A function f considered in the proof of Theorem 2 (for d=2).
图 5.定理 2 证明中考虑的函数 f (对于 d=2 )。

To establish part (b) of the theorem,we use bound (34) and the hypothesis Lc1lnp(1ϵ): (40)VCdim(η1)c3c12ln2p(1ϵ)WlnW.Combining (38) with (40), we obtain (41)WlnW1c3c12(3ϵc4)dnln2p(1ϵ).Trying a W of the form Wc2=c2ϵdnln2p1(1ϵ) with a constant c2, we get Wc2lnWc2=c2ϵdnln2p1(1ϵ)×(dnln(1ϵ)+lnc2(2p+1)lnln(1ϵ))=(c2dn+o(1))ϵdnln2p(1ϵ).Comparing this with (41), we see that if we choose c2<(c43)dnn(dc3c12), then for sufficiently small ϵ we have WlnWWc2lnWc2 and hence WWc2, as claimed. We can ensure that WWc2 for all ϵ(0,12) by further decreasing c2.  □
为了建立定理的(b)部分,我们使用约束(34)和假设 Lc1lnp(1ϵ)(40)VCdim(η1)c3c12ln2p(1ϵ)WlnW. 将(38)与(40)结合,我们得到 (41)WlnW1c3c12(3ϵc4)dnln2p(1ϵ). 尝试一个形式为 Wc2=c2ϵdnln2p1(1ϵ)W 与一个常数 c2 ,我们得到 Wc2lnWc2=c2ϵdnln2p1(1ϵ)×(dnln(1ϵ)+lnc2(2p+1)lnln(1ϵ))=(c2dn+o(1))ϵdnln2p(1ϵ). 将其与(41)比较,我们发现,如果我们选择 c2<(c43)dnn(dc3c12) ,那么对于足够小的 ϵ ,我们有 WlnWWc2lnWc2 ,因此有 WWc2 ,正如我们所声称的。我们可以通过进一步减小 c2 来确保 WWc2 适用于所有 ϵ(0,12) 。□

We remark that the constrained depth hypothesis of part (b) is satisfied, with p=1, by the architecture used for the upper bound in Theorem 1. The bound stated in part (b) of Theorem 4 matches the upper bound of Theorem 1 and the lower bound of Theorem 3 up to a power of ln(1ϵ).
我们注意到,根据定理 1 中上限所使用的结构,(b) 部分的受限深度假设在 p=1 的情况下是满足的。定理 4 (b) 部分所述的界限与定理 1 的上界和定理 3 的下界相匹配,最大为 ln(1ϵ) 的幂。

4.3. Adaptive network architectures
4.3.自适应网络架构

Our goal in this section is to obtain a lower bound for the approximation complexity in the scenario where the network architecture may depend on the approximated function. This lower bound is thus a counterpart to the upper bound of Section 3.3.
本节的目标是在网络架构可能取决于近似函数的情况下,获得近似复杂度的下限。因此,这个下限与第 3.3 节中的上限相对应。

To state this result we define the complexity N(f,ϵ) of approximating the function f with error ϵ as the minimal number of hidden computation units in a ReLU network that provides such an approximation.
为了说明这一结果,我们将误差为 ϵ 的函数 f 的近似复杂度 N(f,ϵ) 定义为 ReLU 网络中提供这种近似的隐藏计算单元的最小数量。

Theorem 5 定理 5

For any d,n, there exists fWn,([0,1]d) such that N(f,ϵ) is not o(ϵd(9n)) as ϵ0
对于任何 d,n ,都存在 fWn,([0,1]d) ,使得 N(f,ϵ) 不是 o(ϵd(9n)) 而是 ϵ0
.

The proof relies on the following lemma.
证明依赖于下面的 Lemma。

Lemma 3 推理 3

Fix d,n. For any sufficiently small ϵ>0 there exists fϵFd,n such that N(fϵ,ϵ)c1ϵd(8n), with some constant c1=c1(d,n)>0
固定 d,n 。对于任何足够小的 ϵ>0 ,存在 fϵFd,n ,使得 N(fϵ,ϵ)c1ϵd(8n) ,与某个常数 c1=c1(d,n)>0
.

Proof 证明

Observe that all the networks with not more than m hidden computation units can be embedded in the single “enveloping” network that has m hidden layers, each consisting of m units, and that includes all the connections between units not in the same layer (see Fig. 6, Fig. 6(a)). The number of weights in this enveloping network is O(m4). On the other hand, Theorem 4 (a) states that at least cϵd(2n) weights are needed for an architecture capable of ϵ-approximating any function in Fd,n. It follows that there is a function fϵFd,n that cannot be ϵ-approximated by networks with fewer than c1ϵd(8n) computation units.  □
请注意,所有隐藏计算单元不超过 m 的网络都可以嵌入一个 "包络 "网络中,该网络有 m 个隐藏层,每个隐藏层由 m 个单元组成,并包括不在同一层的单元之间的所有连接(见图 6,图 6(a))。这个包络网络的权重数为 O(m4) 。另一方面,定理 4 (a) 指出,能够 ϵ 近似 Fd,n 中任何函数的架构至少需要 cϵd(2n) 个权重。 由此可见,有一个函数 fϵFd,n 是少于 c1ϵd(8n) 个计算单元的网络无法 ϵ 近似的。□

Before proceeding to the proof of Theorem 5, note that N(f,ϵ) is a monotone decreasing function of ϵ with a few obvious properties: (42)N(af,|a|ϵ)=N(f,ϵ), for any aR{0}(follows by multiplying the weights of the output unit of the approximating network by a constant); (43)N(f±g,ϵ+g)N(f,ϵ)(follows by approximating f±g by an approximation of f); (44)N(f1±f2,ϵ1+ϵ2)N(f1,ϵ1)+N(f2,ϵ2)(follows by combining approximating networks for f1 and f2 as in Fig. 6(b)).
在继续证明定理 5 之前,请注意 N(f,ϵ)ϵ 的单调递减函数,具有几个明显的性质:2#(将近似网络输出单元的权重乘以一个常数即可得出); (43)N(f±g,ϵ+g)N(f,ϵ) (用 f 的近似值近似 f±g 即可得出); (44)N(f1±f2,ϵ1+ϵ2)N(f1,ϵ1)+N(f2,ϵ2) (如图 6(b)所示,将 f1f2 的近似网络合并即可得出)。

  1. Download : Download high-res image (412KB)
    下载 :下载高清图片 (412KB)
  2. Download : Download full-size image
    下载 :下载 : 下载全尺寸图片

Fig. 6(a). (a) 图 6(a).(a)

  1. Download : Download high-res image (167KB)
    下载 :下载高清图片 (167KB)
  2. Download : Download full-size image
    下载 :下载 : 下载全尺寸图片

Fig. 6(b). (b) 图 6(b).(b)

Fig. 6. (a) Embedding a network with m=4 hidden units into an “enveloping” network (see Lemma 3). (b) Putting sub-networks in parallel to form an approximation for the sum or difference of two functions, see Eq. (44).
图 6. (a) 将具有 m=4 个隐藏单元的网络嵌入 "包络 "网络(见定理 3)。(b) 将子网络并联,形成两个函数的和或差的近似值,见公式 (44)。

Proof 证明

Proof of Theorem 5 定理 5 的证明

The claim of Theorem 5 is similar to the claim of Lemma 3, but is about a single function f satisfying a slightly weaker complexity bound at multiple values of ϵ0. We will assume that Theorem 5 is false, i.e., (45)N(f,ϵ)=o(ϵd(9n))for all fWn,([0,1]d), and we will reach contradiction by presenting f violating this assumption. Specifically, we construct this f as (46)f=k=1akfk,with some akR, fkFd,n, and we will make sure that (47)N(f,ϵk)ϵkd(9n)for a sequence of ϵk0.
定理 5 的主张与 Lemma 3 的主张类似,但涉及的是在多个 ϵ0 值下满足稍弱复杂度约束的单个函数 f 。我们将假设定理 5 是假的,即 (45)N(f,ϵ)=o(ϵd(9n)) 对于所有 fWn,([0,1]d) ,我们将通过提出违反这一假设的 f 来达到矛盾。具体地说,我们将 f 构造为 (46)f=k=1akfk, 与一些 akR , fkFd,n ,我们将确保 (47)N(f,ϵk)ϵkd(9n)ϵk0 的序列。

We determine ak,fk,ϵk sequentially. Suppose we have already found {as,fs,ϵs}s=1k1; let us describe how we define ak,fk,ϵk.
我们依次确定 ak,fk,ϵk 。假设我们已经找到了 {as,fs,ϵs}s=1k1 ;让我们描述一下如何定义 ak,fk,ϵk

First, we set (48)ak=mins=1,,k1ϵs2ks.In particular, this ensures that akϵ121k,so that the function f defined by the series (46) will be in Wn,([0,1]d), because fkWn,([0,1]d)1.
首先,我们设置 (48)ak=mins=1,,k1ϵs2ks. 特别是,这样可以确保 akϵ121k, 使得由数列 (46) 定义的函数 fWn,([0,1]d) 中,因为 fkWn,([0,1]d)1 .

Next, using Lemma 3 and Eq. (42), observe that if ϵk is sufficiently small, then we can find fkFd,n such that (49)N(akfk,3ϵk)=N(fk,3ϵkak)c1(3ϵkak)d(8n)2ϵkd(9n).In addition, by assumption (45), if ϵk is small enough then (50)N(s=1k1asfs,ϵk)ϵkd(9n).Let us choose ϵk and fk so that both (49), (50) hold. Obviously, we can also make sure that ϵk0 as k.
接下来,利用 Lemma 3 和公式 (42),观察到如果 ϵk 足够小,那么我们可以找到 fkFd,n 使 (49)N(akfk,3ϵk)=N(fk,3ϵkak)c1(3ϵkak)d(8n)2ϵkd(9n). 成立。此外,根据假设 (45),如果 ϵk 足够小,那么 (50)N(s=1k1asfs,ϵk)ϵkd(9n). 我们选择 ϵkfk 使 (49), (50) 都成立。显然,我们还可以确保 ϵk0k .

Let us check that the above choice of {ak,fk,ϵk}k=1 ensures that inequality (47) holds for all k: N(s=1asfs,ϵk)N(s=1kasfs,ϵk+s=k+1asfs)N(s=1kasfs,ϵk+s=k+1as)N(s=1kasfs,2ϵk)N(akfk,3ϵk)N(s=1k1asfs,ϵk)ϵd(9n).Here in the first step we use inequality (43), in the second the monotonicity of N(f,ϵ), in the third the monotonicity of N(f,ϵ) and the setting (48), in the fourth the inequality (44), and in the fifth the conditions (49), (50).  □
让我们检验一下,上述 {ak,fk,ϵk}k=1 的选择是否能确保不等式(47)对所有 k 都成立:这里,第一步我们使用不等式 (43),第二步使用 N(f,ϵ) 的单调性,第三步使用 N(f,ϵ) 的单调性和设置 (48),第四步使用不等式 (44),第五步使用条件 (49), (50)。□

4.4. Slow approximation of smooth functions by shallow networks
4.4.浅层网络对平滑函数的缓慢逼近

In this section we show that, in contrast to deep ReLU networks, shallow ReLU networks relatively inefficiently approximate sufficiently smooth (C2) nonlinear functions. We remark that Liang and Srikant (2016) prove a similar result assuming global convexity instead of smoothness and nonlinearity.
在本节中,我们将证明,与深度 ReLU 网络相比,浅层 ReLU 网络相对低效地逼近了足够平滑 ( C2 ) 的非线性函数。我们注意到,Liang 和 Srikant(2016)在假设全局凸性而非平滑性和非线性的情况下证明了类似的结果。

Theorem 6 定理 6

Let fC2([0,1]d) be a nonlinear function (i.e., not of the form f(x1,,xd)a0+k=1dakxk on the whole [0,1]d). Then, for any fixed L, a depth- L ReLU network approximating f with error ϵ(0,1) must have at least cϵ1(2(L2)) weights and computation units, with some constant c=c(f,L)>0.
假设 fC2([0,1]d) 是一个非线性函数(即在 [0,1]d 整体上不是 f(x1,,xd)a0+k=1dakxk 的形式)。那么,对于任意固定的 L ,以误差 ϵ(0,1) 近似 f 的深度 L ReLU 网络必须至少有 cϵ1(2(L2)) 个权值和计算单元,并有某个常数 c=c(f,L)>0.

Proof 证明

Since fC2([0,1]d and f is nonlinear, we can find x0[0,1]d and vRd such that x0+xv[0,1]d for all x[1,1] and the function f1:xf(x0+xv) is strictly convex or concave on [1,1]. Suppose without loss of generality that it is strictly convex: (51)minx[1,1]f1(x)=c1>0.Suppose that f˜ is an ϵ-approximation of function f, and let f˜ be implemented by a ReLU network η of depth L. Let f˜1:xf˜(x0+xv). Then f˜1 also approximates f1 with error not larger than ϵ. Moreover, since f˜1 is obtained from f˜ by a linear substitution x=x0+xv, f˜1 can be implemented by a ReLU network η1 of the same depth L and with the number of units and weights not larger than in η (we can obtain η1 from η by replacing the input layer in η with a single unit, accordingly modifying the input connections, and adjusting the weights associated with these connections). It is thus sufficient to establish the claimed bounds for η1.
由于 fC2([0,1]df 是非线性的,我们可以找到 x0[0,1]dvRd ,使得 x0+xv[0,1]d 对于所有 x[1,1] 和函数 f1:xf(x0+xv)[1,1] 上是严格凸的或凹的。在不失一般性的前提下,假设它是严格凸的: (51)minx[1,1]f1(x)=c1>0. 假设 f˜ 是函数 fϵ 近似值,并且让 f˜ 由深度为 L 的 ReLU 网络 η 实现。让 f˜1:xf˜(x0+xv) 。则 f˜1 也能逼近 f1 ,且误差不大于 ϵ 。此外,由于 f˜1 是通过线性替换 x=x0+xvf˜ 得到的,因此 f˜1 可以通过深度 L 相同的 ReLU 网络 η1 来实现,且单元数和权重不大于 η (我们可以通过用一个单元替换 η 中的输入层,相应地修改输入连接,并调整与这些连接相关的权重,从 η 得到 η1 )。因此,我们只需为 η1 建立所声称的边界即可。

By construction, f˜1 is a continuous piece-wise linear function of x. Denote by M the number of linear pieces in f˜1. We will use the following counting lemma.
根据构造, f˜1x 的连续片断线性函数。用 M 表示 f˜1 中的线性片段数。我们将使用下面的计数 Lemma。

Lemma 4 例题 4

M(2U)L2, where U is the number of computation units in η1
M(2U)L2 ,其中 Uη1 中的计算单元数
.

Proof 证明

This bound, up to minor details, is proved in Lemma 2.1 of Telgarsky (2015). Precisely, Telgarsky’s lemma states that if a network has a single input, connections only between neighboring layers, at most m units in a layer, and a piece-wise linear activation function with t pieces, then the number of linear pieces in the output of the network is not greater than (tm)L. By examining the proof of the lemma we see that it will remain valid for networks with connections not necessarily between neighboring layers, if we replace m by U in the expression (tm)L. Moreover, we can slightly strengthen the bound by noting that in the present paper the input and output units are counted as separate layers, only units of layers 3 to L have multiple incoming connections, and the activation function is applied only in layers 2 to L1. By following Telgarsky’s arguments, this gives the slightly more accurate bound (tU)L2 (i.e., with the power L2 instead of L). It remains to note that the ReLU activation function corresponds to t=2.  □
Telgarsky (2015)的定理 2.1 证明了这一约束,但还有一些小细节。确切地说,特尔加尔斯基的定理指出,如果一个网络只有一个输入,只在相邻层之间有连接,一层中最多有 m 个单元,以及一个片断线性激活函数有 t 个片断,那么该网络输出中的线性片断数不会大于 (tm)L 。通过研究该定理的证明,我们可以发现,如果我们在表达式 (tm)L 中用 U 代替 m ,那么该定理对于不一定在相邻层之间有连接的网络仍然有效。此外,我们还可以注意到,在本文中,输入和输出单元被视为独立的层,只有第 3 层到 L 层的单元有多个输入连接,并且激活函数只应用于第 2 层到 L1 层,从而略微加强约束。根据特尔加尔斯基的论证,这给出了稍微更精确的边界 (tU)L2 (即幂 L2 ,而不是 L )。需要注意的是,ReLU 激活函数对应于 t=2 。□

Lemma 4 implies that there is an interval [a,b][1,1] of length not less than 2(2U)(L2) on which the function f˜1 is linear. Let g=f1f˜1. Then, by the approximation accuracy assumption, supx[a,b]|g(x)|ϵ, while by (51) and by the linearity of f˜1 on [a,b], maxx[a,b]g(x)c1>0. It follows that max(g(a),g(b))g(a+b2)+c12(ba2)2 and hence ϵ12(max(g(a),g(b))g(a+b2))c14(ba2)2c14(2U)2(L2),which implies the claimed bound U12(4ϵc1)1(2(L2)). Since there are at least as many weights as computation units in a network, a similar bound holds for the number of weights.  □
定理 4 意味着存在一个长度不小于 2(2U)(L2) 的区间 [a,b][1,1] ,在这个区间上函数 f˜1 是线性的。设 g=f1f˜1 。那么,根据近似精确度假设, supx[a,b]|g(x)|ϵ ,而根据 (51) 和 f˜1[a,b] 上的线性关系, maxx[a,b]g(x)c1>0 。 由此得出 max(g(a),g(b))g(a+b2)+c12(ba2)2 ,进而得出 ϵ12(max(g(a),g(b))g(a+b2))c14(ba2)2c14(2U)2(L2), ,这意味着所声称的约束 U12(4ϵc1)1(2(L2)) 。由于网络中的权重至少与计算单元一样多,因此权重数也有类似的约束。□

5. Discussion 5.讨论

We discuss some implications of the obtained bounds.
我们讨论了所获边界的一些含义。

Deep vs. shallow ReLU approximations of smooth functions
平滑函数的深层 ReLU 近似与浅层 ReLU 近似

Our results clearly show that deep ReLU networks more efficiently express smooth functions than shallow ReLU networks. By Theorem 1, functions from the Sobolev space Wn,([0,1]d) can be ϵ-approximated by ReLU networks with depth O(ln(1ϵ)) and the number of computation units O(ϵdnln(1ϵ)). In contrast, by Theorem 6, a nonlinear function from C2([0,1]d) cannot be ϵ-approximated by a ReLU network of fixed depth L with the number of units less than cϵ1(2(L2)). In particular, it follows that in terms of the required number of computation units, unbounded-depth approximations of functions from Wn,([0,1]d) are asymptotically strictly more efficient than approximations with a fixed depth L at least when dn<12(L2)(assuming also n>2, so that Wn,([0,1]d)C2([0,1]d)). The efficiency of depth is even more pronounced for very smooth functions such as polynomials, which can be implemented by deep networks using only O(ln(1ϵ)) units (cf. Proposition 2, Proposition 3 and the proof of Theorem 1). Liang and Srikant describe in Liang and Srikant (2016) some conditions on the approximated function (resembling conditions of local analyticity) under which complexity of deep ϵ-approximation is O(lnc(1ϵ)) with a constant power c.
我们的结果清楚地表明,深度 ReLU 网络比浅层 ReLU 网络更有效地表达平滑函数。根据定理 1,来自 Sobolev 空间 Wn,([0,1]d) 的函数可以通过深度为 O(ln(1ϵ)) 且计算单元数为 O(ϵdnln(1ϵ)) 的 ReLU 网络进行 ϵ 近似。 相反,根据定理 6,来自 C2([0,1]d) 的非线性函数无法通过固定深度为 L 且单元数小于 cϵ1(2(L2)) 的 ReLU 网络进行 ϵ 近似。特别是,从所需计算单元数来看,至少当 dn<12(L2) 时,来自 Wn,([0,1]d) 的函数的无界深度近似在渐近严格意义上比具有固定深度 L 的近似更有效(假设还有 n>2 ,那么 Wn,([0,1]d)C2([0,1]d) )。对于多项式等非常平滑的函数,深度的效率更为显著,深度网络只需 O(ln(1ϵ)) 个单元就能实现(参见命题 2、命题 3 和定理 1 的证明)。Liang 和 Srikant 在 Liang and Srikant (2016) 中描述了近似函数的一些条件(类似于局部解析性条件),在这些条件下,深度 ϵ 近似的复杂度为 O(lnc(1ϵ)) ,幂为 c

Continuous model selection vs. function-dependent network architectures
连续模型选择与依赖功能的网络架构对比

When approximating a function by a neural network, one can either view the network architecture as fixed and only tune the weights, or optimize the architecture as well. Moreover, when tuning the weights, one can either require them to continuously depend on the approximated function or not. We naturally expect that more freedom in the choice of the approximation should lead to higher expressiveness.
在用神经网络逼近函数时,我们可以将网络结构视为固定的,只调整权重,也可以同时优化结构。此外,在调整权重时,我们既可以要求权重持续依赖于近似函数,也可以不要求权重持续依赖于近似函数。我们自然希望,在选择近似值时有更大的自由度,从而提高表达能力。

Our bounds confirm this expectation to a certain extent. Specifically, the complexity of ϵ-approximation of functions from the unit ball F1,1 in W1,([0,1]) is lower bounded by cϵ in the scenario with a fixed architecture and continuously selected weights (see Theorem 3). On the other hand, we show in Theorem 2 that this complexity is upper bounded by O(1ϵln(1ϵ)) if we are allowed to adjust the network architecture. This bound is achieved by finite-depth (depth-6) ReLU networks using the idea of reused subnetworks familiar from the theory of Boolean circuits (Shannon, 1949).
我们的边界在一定程度上证实了这一预期。具体来说,在架构固定、权重连续选择的情况下, W1,([0,1]) 中单位球 F1,1 函数的 ϵ 近似复杂度下限为 cϵ (见定理 3)。另一方面,我们在定理 2 中证明,如果允许我们调整网络架构,则该复杂度的上限为 O(1ϵln(1ϵ)) 。利用布尔电路理论(香农,1949 年)中熟悉的重用子网络思想,有限深度(深度-6)ReLU 网络可以实现这一约束。

In the case of fixed architecture, we have not established any evidence of complexity improvement for unconstrained weight selection compared to continuous weight selection. We remark however that, already for approximations with depth-3 networks, the optimal weights are known to discontinuously depend, in general, on the approximated function (Kainen et al., 1999). On the other hand, part b) of Theorem 4 shows that if the network depth scales as O(lnp(1ϵ)), discontinuous weight selection cannot improve the continuous-case complexity more than by a factor being some power of ln(1ϵ).
在结构固定的情况下,与连续权重选择相比,我们还没有发现无约束权重选择能提高复杂性的证据。不过我们要指出的是,对于深度为 3 的网络近似,已知最优权重一般会不连续地依赖于近似函数(Kainen 等人,1999 年)。另一方面,定理 4 的 b) 部分表明,如果网络深度按 O(lnp(1ϵ)) 的比例缩放,非连续权重选择对连续情况下复杂度的改善不会超过 ln(1ϵ) 的某个幂次。

Upper vs. lower complexity bounds
复杂度上限与下限

We indicate the gaps between respective upper and lower bounds in the three scenarios mentioned above: fixed architectures with continuous selection of weights, fixed architectures with unconstrained selection of weights, or adaptive architectures.
我们指出了上述三种情况下各自上下限之间的差距:连续选择权重的固定架构、无约束选择权重的固定架构或自适应架构。

For fixed architectures with continuous selection the lower bound cϵdn is provided by Theorem 3, and the upper bound O(ϵdnln(1ϵ)) by Theorem 1, so these bounds are tight up to a factor O(ln(1ϵ)).
对于具有连续选择功能的固定架构,定理 3 提供了下界 cϵdn ,定理 1 提供了上界 O(ϵdnln(1ϵ)) ,因此这些下界在系数 O(ln(1ϵ)) 以下都很紧凑。

In the case of fixed architecture but unconstrained selection, part b) of Theorem 4 gives a lower bound cϵdnln2p1(1ϵ) under assumption that the depth is constrained by O(lnp(1ϵ)). This is only different by a factor of O(ln2p+2(1ϵ)) from the upper bound of Theorem 1. Without this depth constraint we only have the significantly weaker bound cϵd(2n) (part a) of Theorem 4).
在结构固定但选择不受约束的情况下,假设深度受 O(lnp(1ϵ)) 约束,定理 4 的 b) 部分给出了下界 cϵdnln2p1(1ϵ) 。这与定理 1 的上限只相差 O(ln2p+2(1ϵ)) 倍。如果没有深度限制,我们只能得到明显较弱的边界 cϵd(2n) (定理 4 的 a 部分)。

In the case of adaptive architectures, there is a big gap between our upper and lower bounds. The upper bound O(1ϵln(1ϵ)) is given by Theorem 2 for d=n=1. The lower bound, proved for general d,n in Theorem 5, only states that there are fWn,([0,1]d) for which the complexity is not o(ϵd(9n)).
对于自适应架构,我们的上界和下界之间存在很大差距。定理 2 给出了 d=n=1 的上界 O(1ϵln(1ϵ)) 。下界在定理 5 中针对一般的 d,n 得到证明,它只说明有 fWn,([0,1]d) 的复杂度不是 o(ϵd(9n))

ReLU vs. smooth activation functions
ReLU 与平滑激活函数

A popular general-purpose method of approximation is shallow (depth-3) networks with smooth activation functions (e.g., logistic sigmoid). Upper and lower approximation complexity bounds for these networks  Maiorov and Meir (2000), Mhaskar (1996) show that complexity scales as ϵdn up to some ln(1ϵ) factors. Comparing this with our bounds in Theorems 1, 2, 4, it appears that deep ReLU networks are roughly (up to ln(1ϵ) factors) as expressive as shallow networks with smooth activation functions.
一种流行的通用近似方法是具有平滑激活函数(如 logistic sigmoid)的浅层(深度-3)网络。Maiorov 和 Meir(2000 年)、Mhaskar(1996 年)的研究表明,这些网络的上下限近似复杂度为 ϵdn ,最高可达 ln(1ϵ) 。将其与定理 1、2、4 中的近似值进行比较,可以发现深度 ReLU 网络的表现力与具有平滑激活函数的浅层网络大致相当(最多为 ln(1ϵ) 个因子)。

6. Conclusion 6.结论

We have established several upper and lower bounds for the expressive power of deep ReLU networks in the context of approximation in Sobolev spaces. We should note, however, that this setting may not quite reflect typical real world applications, which usually possess symmetries and hierarchical and other structural properties substantially narrowing the actually interesting classes of approximated functions (LeCun et al., 2015). Some recent publications introduce and study expressive power of deep networks in frameworks bridging this gap, in particular, graph-based hierarchical approximations are studied in Mhaskar et al. (2016), Mhaskar and Poggio (2016) and convolutional arithmetic circuits in Cohen, Sharir, and Shashua (2015). Theoretical analysis of expressiveness of deep networks taking into account such properties of real data seems to be an important and promising direction of future research.
在索波列夫空间逼近的背景下,我们为深度 ReLU 网络的表现力确定了几个上下限。然而,我们应该注意到,这种设置可能并不完全反映典型的现实世界应用,这些应用通常具有对称性、层次性和其他结构特性,大大缩小了近似函数的实际有趣类别(LeCun 等人,2015 年)。最近的一些出版物介绍并研究了深度网络在弥合这一差距的框架中的表现力,特别是 Mhaskar 等人(2016)、Mhaskar 和 Poggio(2016)研究了基于图的分层逼近,Cohen、Sharir 和 Shashua(2015)研究了卷积算术电路。考虑到真实数据的这些特性,对深度网络的表现力进行理论分析似乎是未来研究的一个重要而有前景的方向。

Acknowledgments 致谢

The author thanks Matus Telgarsky and the anonymous referees for multiple helpful comments on the preliminary versions of the paper. The research was funded by Skolkovo Institute of Science and Technology (project “Recognition of sparse geometric data”, 2016).
作者感谢马图斯-特尔加尔斯基(Matus Telgarsky)和匿名审稿人对论文初稿提出的多条有益意见。本研究得到了斯科尔科沃科学技术研究所的资助("稀疏几何数据识别 "项目,2016 年)。

References 参考资料

Cited by (1209) 引用 (1209)

  • Solving PDEs on unknown manifolds with machine learning
    用机器学习解决未知流形上的多项式方程

    2024, Applied and Computational Harmonic Analysis
    2024,应用与计算谐波分析
View all citing articles on Scopus
查看 Scopus 上的所有引用文章
1

The author thanks Matus Telgarsky for suggesting this part of the theorem.
作者感谢 Matus Telgarsky 提出这部分定理。

View Abstract 查看摘要