Elsevier

Neural Networks 神经网络

Volume 94, October 2017, Pages 103-114
第 94 卷,2017 年 10 月,第 103-114 页
Neural Networks

Error bounds for approximations with deep ReLU networks
深度 ReLU 网络近似的误差边界

https://doi.org/10.1016/j.neunet.2017.07.002Get rights and content 获取权利和内容

Abstract 摘要

We study expressive power of shallow and deep neural networks with piece-wise linear activation functions. We establish new rigorous upper and lower bounds for the network complexity in the setting of approximations in Sobolev spaces. In particular, we prove that deep ReLU networks more efficiently approximate smooth functions than shallow networks. In the case of approximations of 1D Lipschitz functions we describe adaptive depth-6 network architectures more efficient than the standard shallow architecture.
我们研究了具有片断线性激活函数的浅层和深层神经网络的表达能力。我们在 Sobolev 空间近似的背景下,为网络复杂性建立了新的严格上下限。特别是,我们证明了深度 ReLU 网络比浅层网络更有效地逼近平滑函数。在近似一维 Lipschitz 函数的情况下,我们描述了比标准浅层架构更高效的自适应深度-6 网络架构。

Keywords 关键词

Deep ReLU networks
Approximation complexity

深度 ReLU 网络近似复杂度

1. Introduction 1.导言

Recently, multiple successful applications of deep neural networks to pattern recognition problems LeCun et al. (2015), Schmidhuber (2015) have revived active interest in theoretical properties of such networks, in particular their expressive power. It has been argued that deep networks may be more expressive than shallow ones of comparable size (see, e.g., Bianchini and Scarselli (2014), Delalleau and Bengio (2011), Montufar et al. (2014), Raghu et al. (2016), Telgarsky (2015)). In contrast to a shallow network, a deep one can be viewed as a long sequence of non-commutative transformations, which is a natural setting for high expressiveness (cf. the well-known Solovay–Kitaev theorem on fast approximation of arbitrary quantum operations by sequences of non-commutative gates, see Dawson and Nielsen (2006), Kitaev et al. (2002)).
最近,深度神经网络在模式识别问题上的多次成功应用LeCun等人(2015)、Schmidhuber(2015)重新激发了人们对此类网络理论特性的兴趣,尤其是它们的表现力。有人认为,深度网络可能比同等规模的浅层网络更具表现力(参见 Bianchini 和 Scarselli (2014)、Delalleau 和 Bengio (2011)、Montufar 等人 (2014)、Raghu 等人 (2016)、Telgarsky (2015))。与浅层网络相比,深层网络可被视为非交换变换的长序列,这是高表达能力的天然环境(参见著名的索洛维-基塔耶夫定理:通过非交换门序列快速逼近任意量子运算,见 Dawson 和 Nielsen (2006)、Kitaev 等人 (2002))。

There are various ways to characterize expressive power of networks. Delalleau and Bengio (2011) consider sum–product networks and prove for certain classes of polynomials that they are much more easily represented by deep networks than by shallow networks. Montufar et al. (2014) estimate the number of linear regions in the network’s landscape. Bianchini and Scarselli (2014) give bounds for Betti numbers characterizing topological properties of functions represented by networks. Telgarsky (2015), Telgarsky (2016) provides specific examples of classification problems where deep networks are provably more efficient than shallow ones.
表征网络表达能力的方法多种多样。Delalleau 和 Bengio(2011 年)考虑了和积网络,并证明对于某些多项式类别,深度网络比浅层网络更容易表现它们。Montufar 等人(2014 年)估算了网络景观中线性区域的数量。Bianchini 和 Scarselli(2014)给出了贝蒂数的边界,表征了网络所代表函数的拓扑特性。Telgarsky (2015)、Telgarsky (2016) 提供了分类问题的具体例子,证明深度网络比浅层网络更有效。

In the context of classification problems, a general and standard approach to characterizing expressiveness is based on the notion of the Vapnik–Chervonenkis dimension (Vapnik & Chervonenkis, 2015). There exist several bounds for VC-dimension of deep networks with piece-wise polynomial activation functions that go back to geometric techniques of Goldberg and Jerrum (1995) and earlier results of Warren (1968); see Bartlett et al. (1998), Sakurai (1999) and the book (Anthony & Bartlett, 2009). There is a related concept, the fat-shattering dimension, for real-valued approximation problems Anthony and Bartlett (2009), Kearns and Schapire (1990).
在分类问题中,表征表现力的一般标准方法是基于 Vapnik-Chervonenkis 维度的概念(Vapnik & Chervonenkis, 2015)。对于具有片断多项式激活函数的深度网络的 VC 维度,存在几种边界,这些边界可追溯到 Goldberg 和 Jerrum(1995 年)的几何技术以及 Warren(1968 年)的早期结果;参见 Bartlett 等人(1998 年)、Sakurai(1999 年)和本书(Anthony & Bartlett, 2009 年)。对于实值近似问题,Anthony 和 Bartlett (2009)、Kearns 和 Schapire (1990)有一个相关的概念,即 "碎脂维度"。

A very general approach to expressiveness in the context of approximation is the method of nonlinear widths by DeVore, Howard, and Micchelli (1989) that concerns approximation of a family of functions under assumption of a continuous dependence of the model on the approximated function.
DeVore 、Howard 和 Micchelli(1989 年)提出的非线性宽度法是近似计算中一种非常通用的表达方法,它涉及在模型对近似函数具有连续依赖性的假设条件下对函数族的近似。

In this paper we examine the problem of shallow-vs-deep expressiveness from the perspective of approximation theory and general spaces of functions having derivatives up to certain order (Sobolev-type spaces). In this framework, the problem of expressiveness is very well studied in the case of shallow networks with a single hidden layer, where it is known, in particular, that to approximate a Cn-function on a d-dimensional set with infinitesimal error ϵ one needs a network of size about ϵdn, assuming a smooth activation function (see, e.g., Mhaskar (1996), Pinkus (1999) for a number of related rigorous upper and lower bounds and further qualifications of this result). Much less seems to be known about deep networks in this setting, though Mhaskar, Liao, and Poggio (2016), Mhaskar and Poggio (2016) have recently introduced functional spaces constructed using deep dependency graphs and obtained expressiveness bounds for related deep networks.
在本文中,我们从近似理论和具有一定阶导数的函数的一般空间(Sobolev 型空间)的角度来研究浅层与深层的表现力问题。在这一框架下,人们对单隐层浅层网络的表现力问题研究得非常透彻,尤其是已知要在 d 维集合上以无穷小误差 ϵ 近似一个 Cn 函数,需要一个大小约为 ϵdn 的网络,同时假设激活函数是平滑的(参见 Mhaskar (1996)、Pinkus (1999),以了解大量相关的严格上下限和对这一结果的进一步限定)。虽然 Mhaskar、Liao 和 Poggio(2016 年)、Mhaskar 和 Poggio(2016 年)最近引入了使用深度依赖图构建的函数空间,并获得了相关深度网络的表现力边界,但人们对这种情况下的深度网络的了解似乎要少得多。

We will focus our attention on networks with the ReLU activation function σ(x)=max(0,x), which, despite its utter simplicity, seems to be the most popular choice in practical applications (LeCun et al., 2015). We will consider L-error of approximation of functions belonging to the Sobolev spaces Wn,([0,1]d) (without any assumptions of hierarchical structure). We will often consider families of approximations, as the approximated function runs over the unit ball Fd,n in Wn,([0,1]d). In such cases we will distinguish scenarios of fixed and adaptive network architectures. Our goal is to obtain lower and upper bounds on the expressiveness of deep and shallow networks in different scenarios. We measure complexity of networks in a conventional way, by counting the number of their weights and computation units (cf. Anthony and Bartlett (2009)).
我们将重点关注具有 ReLU 激活函数 σ(x)=max(0,x) 的网络,尽管它非常简单,但似乎是实际应用中最受欢迎的选择(LeCun 等人,2015 年)。我们将考虑 L -属于 Sobolev 空间 Wn,([0,1]d) 的函数的近似误差(没有任何层次结构假设)。在这种情况下,我们将区分固定网络架构和自适应网络架构。我们的目标是获得深层和浅层网络在不同场景下的表现力下限和上限。我们用传统方法测量网络的复杂性,即计算其权重和计算单元的数量(参见 Anthony 和 Bartlett (2009))。

The main body of the paper consists of Sections 2 The ReLU network model, 3 Upper bounds, 4 Lower bounds.
本文主体包括第 2 节 ReLU 网络模型、第 3 节上界、第 4 节下界。

In Section 2 we describe our ReLU network model and show that the ReLU function is replaceable by any other continuous piece-wise linear activation function, up to constant factors in complexity asymptotics (Proposition 1).
在第 2 节中,我们描述了 ReLU 网络模型,并证明 ReLU 函数可由任何其他连续的片断线性激活函数替代,在复杂性渐近论中可达到常数因子(命题 1)。

In Section 3 we establish several upper bounds on the complexity of approximating by ReLU networks, in particular showing that deep networks are quite efficient for approximating smooth functions. Specifically:
在第 3 节中,我们确定了 ReLU 网络近似复杂度的几个上限,特别是表明深度网络在近似平滑函数时相当高效。具体来说

  • In Section 3.1 we show that the function f(x)=x2 can be ϵ-approximated by a network of depth and complexity O(ln(1ϵ)) (Proposition 2). This also leads to similar upper bounds on the depth and complexity that are sufficient to implement an approximate multiplication in a ReLU network (Proposition 3).
    在第 3.1 节中,我们证明了函数 f(x)=x2 可以通过深度和复杂度均为 O(ln(1ϵ)) 的网络进行 ϵ 近似运算(命题 2)。这也引出了在 ReLU 网络中实现近似乘法所需的深度和复杂度的类似上限(命题 3)。

  • In Section 3.2 we describe a ReLU network architecture of depth O(ln(1ϵ)) and complexity O(ϵdnln(1ϵ)) that is capable of approximating with error ϵ any function from Fd,n (Theorem 1).
    在第 3.2 节中,我们描述了一个深度为 O(ln(1ϵ)) 、复杂度为 O(ϵdnln(1ϵ)) 的 ReLU 网络结构,它能够以误差 ϵ 近似 Fd,n 中的任何函数(定理 1)。

  • In Section 3.3 we show that, even with fixed-depth networks, one can further decrease the approximation complexity if the network architecture is allowed to depend on the approximated function. Specifically, we prove that one can ϵ-approximate a given Lipschitz function on the segment [0,1] by a depth-6 ReLU network with O(1ϵln(1ϵ)) connections and activation units (Theorem 2). This upper bound is of interest since it lies below the lower bound provided by the method of nonlinear widths under assumption of continuous model selection (see Section 4.1).
    在第 3.3 节中,我们将证明,即使是固定深度的网络,如果允许网络结构依赖于近似函数,也可以进一步降低近似复杂度。具体地说,我们证明了一个具有 O(1ϵln(1ϵ)) 连接和激活单元的深度为 6 的 ReLU 网络可以 ϵ 近似 [0,1] 段上的给定 Lipschitz 函数(定理 2)。这个上界低于连续模型选择假设下的非线性宽度方法所提供的下界(见第 4.1 节),因此很有意义。

In Section 4 we obtain several lower bounds on the complexity of approximation by deep and shallow ReLU networks, using different approaches and assumptions.
在第 4 节中,我们利用不同的方法和假设,得出了深层和浅层 ReLU 网络近似复杂度的几个下限。

  • In Section 4.1 we recall the general lower bound provided by the method of continuous nonlinear widths. This method assumes that parameters of the approximation continuously depend on the approximated function, but does not assume anything about how the approximation depends on its parameters. In this setup, at least ϵdn connections and weights are required for an ϵ-approximation on Fd,n (Theorem 3). As already mentioned, for d=n=1 this lower bound is above the upper bound provided by Theorem 2.
    在第 4.1 节中,我们回顾了连续非线性宽度法提供的一般下限。这种方法假定近似的参数连续地依赖于近似函数,但不假定近似如何依赖于其参数。在这种情况下,对 Fd,nϵ 近似至少需要 ϵdn 的连接和权重(定理 3)。如前所述,对 d=n=1 而言,这一下限高于定理 2 所提供的上限。

  • In Section 4.2 we consider the setup where the same network architecture is used to approximate all functions in Fd,n, but the weights are not assumed to continuously depend on the function. In this case, application of existing results on VC-dimension of deep piece-wise polynomial networks yields a ϵd(2n) lower bound in general and a ϵdnln2p1(1ϵ) lower bound if the network depth grows as O(lnp(1ϵ)) (Theorem 4).
    在第 4.2 节中,我们考虑了使用相同网络结构逼近 Fd,n 中所有函数,但权重不连续依赖于函数的情况。在这种情况下,应用现有的关于深度片断多项式网络 VC 维度的结果,一般会得到 ϵd(2n) 的下限,如果网络深度以 O(lnp(1ϵ)) 的方式增长,则会得到 ϵdnln2p1(1ϵ) 的下限(定理 4)。

  • In Section 4.3 we consider an individual approximation, without any assumptions regarding it as an element of a family as in Sections 4.1 and 4.2. We prove that for any d,n there exists a function in Wn,([0,1]d) such that its approximation complexity is not o(ϵd(9n)) as ϵ0 (Theorem 5).
    在第 4.3 节中,我们将考虑单个近似值,而不像第 4.1 节和第 4.2 节那样假设它是一个族的元素。我们将证明,对于任何 d,n 都存在一个 Wn,([0,1]d) 中的函数,使得它的近似复杂度不是 o(ϵd(9n)) 而是 ϵ0 (定理 5)。

  • In Section 4.4 we prove that ϵ-approximation of any nonlinear C2-function by a network of fixed depth L requires at least ϵ1(2(L2)) computation units (Theorem 6). By comparison with Theorem 1, this shows that for sufficiently smooth functions approximation by fixed-depth ReLU networks is less efficient than by unbounded-depth networks.
    在第 4.4 节中,我们证明用固定深度为 L 的网络对任何非线性 C2 函数进行 ϵ 近似至少需要 ϵ1(2(L2)) 计算单元(定理 6)。与定理 1 相比,这表明对于足够光滑的函数,用固定深度 ReLU 网络逼近的效率比用无界深度网络逼近的效率要低。

In Section 5 we discuss the obtained bounds and summarize their implications, in particular comparing deep vs. shallow networks and fixed vs. adaptive architectures.
在第 5 节中,我们将讨论所获得的边界并总结其影响,特别是比较深层网络与浅层网络以及固定架构与自适应架构。

The arXiv preprint of the first version of the present work appeared almost simultaneously with the work of Liang and Srikant (2016) containing results partly overlapping with our results in Sections 3.1 Fast deep approximation of squaring and multiplication, 3.2 Fast deep approximation of general smooth functions, 4.4 Slow approximation of smooth functions by shallow networks. Liang and Srikant consider networks equipped with both ReLU and threshold activation functions. They prove a logarithmic upper bound for the complexity of approximating the function f(x)=x2, which is analogous to our Proposition 2. Then, they extend this upper bound to polynomials and smooth functions. In contrast to our treatment of generic smooth functions based on standard Sobolev spaces, they impose more complex assumptions on the function (including, in particular, how many derivatives it has) that depend on the required approximation accuracy ϵ. As a consequence, they obtain strong O(lnc(1ϵ)) complexity bounds rather different from our bound in Theorem 1 (in fact, our lower bound proved in Theorem 5 rules out, in general, such strong upper bounds for functions having only finitely many derivatives). Also, Liang and Srikant prove a lower bound for the complexity of approximating convex functions by shallow networks. Our version of this result, given in Section 4.4, is different in that we assume smoothness and nonlinearity instead of global convexity.
本工作第一版的 arXiv 预印本几乎与 Liang 和 Srikant(2016 年)的工作同时出现,其中的结果与我们在第 3.1 节 "平方和乘法的快速深度逼近"、第 3.2 节 "一般平滑函数的快速深度逼近"、第 4.4 节 "浅层网络对平滑函数的慢速逼近 "中的结果部分重叠。Liang 和 Srikant 考虑了同时配备 ReLU 和阈值激活函数的网络。他们证明了函数 f(x)=x2 近似复杂度的对数上限,这与我们的命题 2 类似。然后,他们将这一上限扩展到多项式和平滑函数。与我们基于标准索波列夫空间对一般平稳函数的处理不同,他们对函数(尤其包括函数有多少导数)施加了更复杂的假设,而这些假设取决于所需的逼近精度 ϵ 。因此,他们得到的强 O(lnc(1ϵ)) 复杂性边界与我们在定理 1 中的边界截然不同(事实上,我们在定理 5 中证明的下界一般排除了对只有有限多个导数的函数的强上界)。此外,Liang 和 Srikant 还证明了用浅层网络逼近凸函数的复杂度下限。我们在第 4.4 节中给出的这一结果与之不同,我们假设的是平滑性和非线性,而不是全局凸性。

  1. Download : Download high-res image (130KB)
    下载 :下载高清图片 (130KB)
  2. Download : Download full-size image
    下载 :下载 : 下载全尺寸图片

Fig. 1. A feedforward neural network having 3 input units (diamonds), 1 output unit (square), and 7 computation units with nonlinear activation (circles). The network has 4 layers and 16+8=24 weights.
图 1.一个有 3 个输入单元(菱形)、1 个输出单元(正方形)和 7 个非线性激活计算单元(圆形)的前馈神经网络。该网络有 4 层,权重为 16+8=24

2. The ReLU network model
2.ReLU 网络模型

Throughout the paper, we consider feedforward neural networks with the ReLU (Rectified Linear Unit) activation function σ(x)=max(0,x).The network consists of several input units, one output unit, and a number of “hidden” computation units. Each hidden unit performs an operation of the form (1)y=σ(k=1Nwkxk+b)with some weights (adjustable parameters) (wk)k=1N and b depending on the unit. The output unit is also a computation unit, but without the nonlinearity, i.e., it computes y=k=1Nwkxk+b. The units are grouped in layers, and the inputs (xk)k=1N of a computation unit in a certain layer are outputs of some units belonging to any of the preceding layers (see Fig. 1). Note that we allow connections between units in non-neighboring layers. Occasionally, when this cannot cause confusion, we may denote the network and the function it implements by the same symbol.
在本文中,我们考虑的是具有 ReLU(整流线性单元)激活函数 σ(x)=max(0,x). 的前馈神经网络。该网络由多个输入单元、一个输出单元和若干 "隐藏 "计算单元组成。每个 "隐藏 "单元执行的运算形式为 (1)y=σ(k=1Nwkxk+b) ,并根据单元的不同设置一些权重(可调参数) (wk)k=1Nb 。输出单元也是一个计算单元,但没有非线性,即计算 y=k=1Nwkxk+b 。单元按层分组,某层计算单元的输入 (xk)k=1N 是属于前几层的某些单元的输出(见图 1)。请注意,我们允许非相邻层的单元之间存在连接。有时,在不会引起混淆的情况下,我们可以用相同的符号表示网络及其实现的功能。

The depth of the network, the number of units and the total number of weights are standard measures of network complexity (Anthony & Bartlett, 2009). We will use these measures throughout the paper. The number of weights is, clearly, the sum of the total number of connections and the number of computation units. We identify the depth with the number of layers (in particular, the most common type of neural networks – shallow networks having a single hidden layer – are depth-3 networks according to this convention).
网络深度、单元数和权重总数是衡量网络复杂性的标准指标(Anthony & Bartlett,2009 年)。本文将通篇使用这些指标。权重数显然是连接总数和计算单元数的总和。我们用层数来确定深度(尤其是最常见的神经网络类型--具有单个隐藏层的浅层网络--根据这一惯例属于深度-3 网络)。

We finish this subsection with a proposition showing that, given our complexity measures, using the ReLU activation function is not much different from using any other piece-wise linear activation function with finitely many breakpoints: one can replace one network by an equivalent one but having another activation function while only increasing the number of units and weights by constant factors. This justifies our restricted attention to the ReLU networks (which could otherwise have been perceived as an excessively particular example of networks).
在本小节的最后,我们将提出一个命题,说明根据我们的复杂度度量,使用 ReLU 激活函数与使用任何其他具有有限多个断点的片断线性激活函数并无太大区别:我们可以用一个等效的网络取代一个网络,但使用另一个激活函数,而只需增加单位数和权重的常数。这就证明了我们只关注 ReLU 网络是有道理的(否则它可能会被视为一个过于特殊的网络实例)。

Proposition 1 命题 1

Let ρ:RR be any continuous piece-wise linear function with M breakpoints, where 1M<
ρ:RR 是任意连续的片断线性函数,其断点为 M ,其中 1M<
.

  • (a)

    Let ξ be a network with the activation function ρ, having depth L, W weights and U computation units. Then there exists a ReLU network η that has depth L, not more than (M+1)2W weights and not more than (M+1)U units, and that computes the same function as ξ
    假设 ξ 是一个具有激活函数 ρ 的网络,其深度为 L W 权重和 U 计算单元。那么存在一个深度为 L 、权重不超过 (M+1)2W 、计算单元不超过 (M+1)U 的 ReLU 网络 η ,它计算的函数与 ξ 相同。
    .

  • (b)

    Conversely, let η be a ReLU network of depth L with W weights and U computation units. Let D be a bounded subset of Rn, where n is the input dimension of η. Then there exists a network with the activation function ρ that has depth L, 4W weights and 2U units, and that computes the same function as η on the set D
    反过来,假设 η 是一个深度为 L 的 ReLU 网络,权重为 W ,计算单元为 U 。让 DRn 的有界子集,其中 nη 的输入维度。那么就存在一个具有激活函数 ρ 的网络,它的深度为 L 4W 权重和 2U 单元,在集合 D 上计算与 η 相同的函数。
    .

Proof 证明

(a) Let a1<<aM be the breakpoints of ρ, i.e., the points where its derivative is discontinuous: ρ(ak+)ρ(ak). We can then express ρ via the ReLU function σ, as a linear combination ρ(x)=c0σ(a1x)+m=1Mcmσ(xam)+hwith appropriately chosen coefficients (cm)m=0M and h. It follows that computation performed by a single ρ-unit, x1,,xNρ(k=1Nwkxk+b),can be equivalently represented by a linear combination of a constant function and computations of M+1 σ-units, x1,,xNσ(k=1Nwkxk+bam),m=1,,M,σ(a1bk=1Nwkxk),m=0(here m is the index of a ρ-unit). We can then replace one-by-one all the ρ-units in the network ξ by σ-units, without changing the output of the network. Obviously, these replacements do not change the network depth. Since each hidden unit gets replaced by M+1 new units, the number of units in the new network is not greater than M+1 times their number in the original network. Note also that the number of connections in the network is multiplied, at most, by (M+1)2. Indeed, each unit replacement entails replacing each of the incoming and outgoing connections of this unit by M+1 new connections, and each connection is replaced twice: as an incoming and as an outgoing one. These considerations imply the claimed complexity bounds for the resulting σ-network η.
(a) 设 a1<<aMρ 的断点,即其导数不连续的点: ρ(ak+)ρ(ak) .然后,我们可以通过 ReLU 函数 σρ 表示为一个线性组合 ρ(x)=c0σ(a1x)+m=1Mcmσ(xam)+h ,并适当选择系数 (cm)m=0Mh 。 由此可见,单个 ρ 单位 x1,,xNρ(k=1Nwkxk+b), 的计算可以等价地表示为一个常数函数和 M+1 σ 单位 x1,,xNσ(k=1Nwkxk+bam),m=1,,M,σ(a1bk=1Nwkxk),m=0 的计算的线性组合(这里 mρ 单位的索引)。然后,我们可以在不改变网络输出的情况下,用 σ 单位逐一替换 ξ 网络中的所有 ρ 单位。显然,这些替换不会改变网络深度。由于每个隐藏单元都会被 M+1 个新单元替换,因此新网络中单元的数量不会超过原始网络中单元数量的 M+1 倍。还需要注意的是,网络中的连接数最多乘以 (M+1)2 。 事实上,每个单元的替换都需要用 M+1 个新连接替换该单元的每个输入和输出连接,每个连接都会被替换两次:输入连接和输出连接。这些考虑意味着所得到的 σ - 网络 η 的复杂度边界。

(b) Let a be any breakpoint of ρ, so that ρ(a+)ρ(a). Let r0 be the distance separating a from the nearest other breakpoint, so that ρ is linear on [a,a+r0] and on [ar0,a] (if ρ has only one node, any r0>0 will do). Then, for any r>0, we can express the ReLU function σ via ρ in the r-neighborhood of 0: σ(x)=ρ(a+r02rx)ρ(ar02+r02rx)ρ(a)+ρ(ar02)(ρ(a+)ρ(a))r02r,x[r,r].It follows that a computation performed by a single σ-unit, x1,,xNσ(k=1Nwkxk+b),can be equivalently represented by a linear combination of a constant function and two ρ-units, x1,,xNρ(a+r02rb+r02rk=1Nwkxk),ρ(ar02+r02rb+r02rk=1Nwkxk),provided the condition (2)k=1Nwkxk+b[r,r]holds. Since D is a bounded set, we can choose r at each unit of the initial network η sufficiently large so as to satisfy condition (2) for all network inputs from D. Then, like in (a), we replace each σ-unit with two ρ-units, which produces the desired ρ-network.  □
(b) 让 a 成为 ρ 的任意一个断点,从而 ρ(a+)ρ(a) 。让 r0a 与最近的其他断点之间的距离,这样 ρ[a,a+r0][ar0,a] 上是线性的(如果 ρ 只有一个节点,那么任何 r0>0 都可以)。那么,对于任意 r>0 ,我们可以通过 ρ 在 0 的 r 邻域表达 ReLU 函数 σσ(x)=ρ(a+r02rx)ρ(ar02+r02rx)ρ(a)+ρ(ar02)(ρ(a+)ρ(a))r02r,x[r,r]. 由此可见,只要条件 (2)k=1Nwkxk+b[r,r] 成立,由单个 σ 单位 x1,,xNσ(k=1Nwkxk+b), 执行的计算可以等价地用一个常数函数和两个 ρ 单位 x1,,xNρ(a+r02rb+r02rk=1Nwkxk),ρ(ar02+r02rb+r02rk=1Nwkxk), 的线性组合来表示。由于 D 是一个有界集,我们可以在初始网络的每个单元 η 中选择足够大的 r ,从而使来自 D 的所有网络输入都满足条件 (2)。然后,与(a)中一样,我们用两个 ρ 单位替换每个 σ 单位,这样就得到了所需的 ρ 网络。□

3. Upper bounds 3.上限

Throughout the paper, we will be interested in approximating functions f:[0,1]dR by ReLU networks. Given a function f:[0,1]dR and its approximation f˜, by the approximation error we will always mean the uniform maximum error ff˜=maxx[0,1]d|f(x)f˜(x)|.
在本文中,我们将始终关注通过 ReLU 网络逼近函数 f:[0,1]dR 的问题。给定函数 f:[0,1]dR 及其近似值 f˜ ,近似误差总是指均匀最大误差 ff˜=maxx[0,1]d|f(x)f˜(x)|.

3.1. Fast deep approximation of squaring and multiplication
3.1.平方和乘法的快速深度逼近

Our first key result shows that ReLU networks with unconstrained depth can very efficiently approximate the function f(x)=x2 (more efficiently than any fixed-depth network, as we will see in Section 4.4). Our construction uses the “sawtooth” function that has previously appeared in the paper by Telgarsky (2015).
我们的第一个关键结果表明,无深度限制的 ReLU 网络可以非常高效地逼近函数 f(x)=x2 (比任何固定深度的网络都更高效,我们将在第 4.4 节中看到这一点)。我们的构造使用了之前出现在 Telgarsky(2015)论文中的 "锯齿 "函数。

Proposition 2 命题 2

The function f(x)=x2 on the segment [0,1] can be approximated with any error ϵ>0 by a ReLU network having the depth and the number of weights and computation units O(ln(1ϵ))
可以用一个深度、权重和计算单元数为 O(ln(1ϵ)) 的 ReLU 网络来逼近线段 [0,1] 上的函数 f(x)=x2 ,误差为 ϵ>0
.

Proof 证明

Consider the “tooth” (or “mirror”) function g:[0,1][0,1], g(x)=2x,x<12,2(1x),x12,and the iterated functions gs(x)=gggs(x).Telgarsky has shown (see Lemma 2.4 in Telgarsky (2015)) that gs is a “sawtooth” function with 2s1 uniformly distributed “teeth” (each application of g doubles the number of teeth): gs(x)=2s(x2k2s),x[2k2s,2k+12s],k=0,1,,2s112s(2k2sx),x[2k12s,2k2s],k=1,2,,2s1,(see Fig. 2, Fig. 2(a)). Our key observation now is that the function f(x)=x2 can be approximated by linear combinations of the functions gs. Namely, let fm be the piece-wise linear interpolation of f with 2m+1 uniformly distributed breakpoints k2m,k=0,,2m: fm(k2m)=(k2m)2,k=0,,2m(see Fig. 2(b)). The function fm approximates f with the error ϵm=22m2. Now note that refining the interpolation from fm1 to fm amounts to adjusting it by a function proportional to a sawtooth function: fm1(x)fm(x)=gm(x)22m.Hence fm(x)=xs=1mgs(x)22s.Since g can be implemented by a finite ReLU network (as g(x)=2σ(x)4σ(x12)+2σ(x1)) and since construction of fm only involves O(m) linear operations and compositions of g, we can implement fm by a ReLU network having depth and the number of weights and computation units all being O(m) (see Fig. 2(c)). This implies the claim of the proposition.  □
考虑 "齿"(或 "镜")函数 g:[0,1][0,1], g(x)=2x,x<12,2(1x),x12, 和迭代函数 gs(x)=gggs(x). Telgarsky 已经证明(见 Telgarsky (2015)中的 Lemma 2.4), gs 是一个具有 2s1 均匀分布的 "齿 "的 "锯齿 "函数(每次应用 g 都会使齿的数量加倍):6#(见图 2,图 2(a))。我们现在要观察的关键是,函数 f(x)=x2 可以用函数 gs 的线性组合来近似。也就是说,让 fm 成为 f2m+1 均匀分布的断点 k2m,k=0,,2mfm(k2m)=(k2m)2,k=0,,2m (见图 2(b))。函数 fm 近似 f ,误差为 ϵm=22m2 。现在请注意,从 fm1fm 的细化插值相当于用一个与锯齿形函数成比例的函数进行调整: fm1(x)fm(x)=gm(x)22m. 因此 fm(x)=xs=1mgs(x)22s. 由于 g 可以用有限 ReLU 网络来实现(如 g(x)=2σ(x)4σ(x12)+2σ(x1) ),并且由于 fm 的构造只涉及 O(m) 的线性运算和 g 的合成,我们可以用一个深度、权重和计算单元数都为 O(m) 的 ReLU 网络来实现 fm (见图 2(c))。这意味着命题的主张。□

Since (3)xy=12((x+y)2x2y2),we can use Proposition 2 to efficiently implement accurate multiplication in a ReLU network. The implementation will depend on the required accuracy and the magnitude of the multiplied quantities.
由于 (3)xy=12((x+y)2x2y2), ,我们可以利用命题 2 在 ReLU 网络中有效地实现精确乘法。实现方式取决于所需的精度和被乘量的大小。

  1. Download : Download high-res image (186KB)
    下载 :下载高清图片 (186KB)
  2. Download : Download full-size image
    下载 :下载 : 下载全尺寸图片

Fig. 2(a). (a) 图 2(a).(a)

  1. Download : Download high-res image (136KB)
    下载 :下载高清图片 (136KB)
  2. Download : Download full-size image
    下载 :下载 : 下载全尺寸图片

Fig. 2(b). (b) 图 2(b).(b)

  1. Download : Download high-res image (114KB)
    下载 :下载高清图片 (114KB)
  2. Download : Download full-size image
    下载 :下载 : 下载全尺寸图片

Fig. 2(c). (c) 图 2(c).(c)

Fig. 2. Fast approximation of the function f(x)=x2 from Proposition 2 : (a) the “tooth” function g and the iterated “sawtooth” functions g2,g3; (b) the approximating functions fm; (c) the network architecture for f4.
图 2.命题 2 中函数 f(x)=x2 的快速近似:(a)"齿 "函数 g 和迭代 "锯齿 "函数 g2,g3 ;(b)近似函数 fm ;(c) f4 的网络结构。

Proposition 3 命题 3

Given M>0 and ϵ(0,1), there is a ReLU network η with two input units that implements a function ט:R2R so that
给定 M>0ϵ(0,1) ,有一个带有两个输入单元的 ReLU 网络 η ,它实现了一个函数 ט:R2R ,因此

  • (a)

    for any inputs x,y, if |x|M and |y|M, then |ט(x,y)xy|ϵ;
    对于任何输入 x,y ,如果 |x|M|y|M ,那么 |ט(x,y)xy|ϵ

  • (b)

    if x=0 or y=0, then ט(x,y)=0;
    如果 x=0y=0 ,则 ט(x,y)=0

  • (c)

    the depth and the number of weights and computation units in η are not greater than c1ln(1ϵ)+c2 with an absolute constant c1 and a constant c2=c2(M)
    0#中的深度、权重数和计算单元数不大于 c1ln(1ϵ)+c2 ,绝对常数为 c1 ,常数为 c2=c2(M)
    .

Proof 证明

Let f˜sq,δ be the approximate squaring function from Proposition 2 such that f˜sq,δ(0)=0 and |f˜sq,δ(x)x2|<δ for x[0,1]. Assume without loss of generality that M1 and set (4)ט(x,y)=M28(f˜sq,δ(|x+y|2M)f˜sq,δ(|x|2M)f˜sq,δ(|y|2M)),where δ=8ϵ3M2. Then property (b) is immediate and (a) follows easily using expansion (3). To conclude (c), observe that computation (4) consists of three instances of