=Paper=
{{Paper
|id=Vol-1885/100
|storemode=property
|title=Bounds on Sparsity of
One-Hidden-Layer Perceptron Networks
|pdfUrl=https://ceur-ws.org/Vol-1885/100.pdf
|volume=Vol-1885
|authors=Věra Kůrková
|dblpUrl=https://dblp.org/rec/conf/itat/Kurkova17
}}
==Bounds on Sparsity of
One-Hidden-Layer Perceptron Networks==
J. Hlaváčová (Ed.): ITAT 2017 Proceedings, pp. 100–105 CEUR Workshop Proceedings Vol. 1885, ISSN 1613-0073, c 2017 V. Kůrková Bounds on Sparsity of One-Hidden-Layer Perceptron Networks Věra Kůrková Institute of Computer Science, Czech Academy of Sciences vera@cs.cas.cz, WWW home page: http://www.cs.cas.cz/∼vera Abstract: Limitations of one-hidden-layer (shallow) per- works which are larger than upper bounds on complexity ceptron networks to sparsely represent multivariable func- of deep ones. An important step towards this goal is ex- tions is investigated. A concrete class of functions is ploration of functions which cannot be computed or ap- described whose computation by shallow perceptron net- proximated by shallow networks satisfying various spar- works requires either large number of units or is unstable sity constraints. due to large output weights. The class is constructed us- Investigation of sparsity of artificial neural networks has ing pseudo-noise sequences which have many features of a biological motivation. A number of studies confirmed random sequences but can be generated using special poly- that only a small fraction of neurons have a high rate of nomials. Connections with the central paradox of coding firing at any time (sparse activity) and that each neuron theory are discussed. is connected to only a limited number of other neurons (sparse connectivity) [17]. The most simple measure of sparse connectivity between hidden units and network out- 1 Introduction puts is the number of non zero output weights. This num- ber is in some literature called “l0 -pseudo-norm”. How- To identify and explain efficient network designs, it is nec- ever, it is neither a norm nor a pseudo-norm. Its minimiza- essary to develop a theoretical understanding to the influ- tion is a not convex problem and its solution is NP-hard ence of a proper choice of network architecture and type of [23]. Thus instead of “l0 -pseudo-norm”, l1 and l2 -norms units on reducing network complexity. Bengio and LeCun have been used as measures of network sparsity as they [5], who recently revived the interest in deep networks, can be implemented as stabilizers in weight-decay regu- conjectured that “most functions that can be represented larization techniques (see, e.g., [8] and references therein). compactly by deep architectures cannot be represented by Also in online dictionary learning, l1 -norms were used as a compact shallow architecture”. On the other hand, a re- stabilizers [21, 7]. cent empirical study demonstrated that shallow networks can learn some functions previously learned by deep ones Bengio et al. [4] suggested that a cause of large model using the same numbers of parameters as the original deep complexities of shallow networks might be in the “amount networks [1]. of variations” of functions to be computed. In [14], we It is well-known that shallow networks with merely presented some examples showing that sparsity of shal- one hidden layer of computational units of many common low networks computing the same input-output functions types can approximate within any accuracy any reason- strongly depends on types of their units. We proposed able function on a compact domain and also can exactly to use as a measure of sparsity variational norms tai- compute any function on a finite domain [9, 22]. All these lored to dictionaries of computational units. These norms universality type results are proven assuming that numbers were used as tools in nonlinear approximation theory. of network units are potentially infinite or, in the case of We showed that variational norms can be employed to finite domains, are at least as large as sizes of the domains. obtain lower bounds on sparsity measured by l1 -norms. However, in practical applications, various constraints on For many dictionaries of computational units, we derived numbers and sizes of network parameters limit feasibility lower bounds on these norms using a probabilistic argu- of implementations. ment based on the Chernoff Bound [14, 15]. The bounds Whereas many upper bounds on numbers of units in hold for almost all functions representing binary classifiers shallow networks which are sufficient for a given accu- on sufficiently large finite domains. In [13] we comple- racy of function approximation are known (see, e.g, the mented these probabilistic results by a concrete construc- survey article [10] and references therein), fewer lower tion of binary classifiers with large variational norms with bounds are available. Some bounds hold merely for types respect to signum perceptrons. of computational units that are not commonly used (see, In this paper, we investigate sparsity of shallow net- e.g., [20, 19]). Proofs of lower bounds are generally much works computing real-valued functions on finite rectangu- more difficult than derivation of upper ones. lar domains. Such domains can be 2-dimensional (e.g., Characterization of tasks which can be computed by pixels of photographs) or high-dimensional (e.g., digitized considerably sparser deep networks than shallow ones re- high-dimensional cubes), but typically they are quite large. quires proving lower bounds on complexity of shallow net- We describe a construction of a class of functions on such Bounds on Sparsity of One-Hidden-Layer Perceptron Networks 101 domains based on matrices with orthogonal rows. Esti- signum perceptrons and those with Heaviside perceptrons mating variational norms of these functions from bellow, as we obtain lower bounds on l1 -norms of shallow networks sgn(t) = 2ϑ (t) − 1 with signum perceptrons. We show that these networks must have either large numbers of hidden units or some of and sgn(t) + 1 their output weights must be large. Both are not desirable ϑ (t) := , 2 as large output weights ma lead to non stability of compu- tation. We illustrate our general construction by a concrete where ϑ (t) = 0 for t < 0 and ϑ (t) = 1 for t ≥ 0. An advan- class of circulant matrices generated by pseudo-noise se- tage of signum perceptrons is that all units from the √ dictio- quences. We discuss the effect of pseudo-randomness on nary Pd (X) have the same size of norms equal to card X. network complexity. The paper is organized as follows. Section 2 contains 3 Measures of Sparsity basic concepts on shallow networks and dictionaries of computational units. In Section 3, sparsity is investigated The most simple measure of sparse connectivity between in terms of l1 -norm and norms tailored to computational the hidden layer and the network output is the number of units. In Section 4, a concrete construction of classes of non-zero output weights. In some literature, the number of functions with large variational norms based on orthogo- non-zero coefficients among wi ’s in an input-output func- nal matrices is described. In Section 5, the general results tion are illustrated by a concrete example of matrices obtained n from pseudo-noise sequences. Section 6 is a brief discus- f = ∑ wi gi (1) sion. i=1 from span G is called an “l0 -pseudo-norm” in quotation marks and denoted kwk0 . However, it is neither a norm 2 Preliminaries nor a pseudo-norm. The quantity kwk0 is always an integer and thus k · k0 does not satisfy the homogeneity property One-hidden-layer networks with single linear outputs of a norm (kλ xk = |λ |kxk for all λ ). Moreover, the “unit (shallow networks) compute input-output functions from ball” {w ∈ Rn | kwk0 ≤ 1} is non convex and unbounded. sets of the form Minimization of the “l0 -pseudo-norm” of the vector of ( ) output weights is a difficult non convex optimization task. n spann G := ∑ wi gi | wi ∈ R, gi ∈ G , Instead of “l0 ”, l1 -norm defined as i=1 n where G, called a dictionary, is a set of functions com- kwk1 = ∑ |wi | i=1 putable by a given type of units, the coefficients wi are called output weights, and n is the number of hidden units. and l2 -norm defined as This number is the simplest measure of model complexity. s In this paper, we focus on representations of functions n on finite domains X ⊂ Rd . We denote by kwk2 = ∑ w2i i=1 F (X) := { f | f : X → R} of output weight vectors w = (w1 , . . . , wn ) have been used the set of all real-valued functions on X. On F (X) we in weight-decay regularization [8]. These norms can be have the Euclidean inner product defined as implemented as stabilizers modifying error functionals which are minimized during learning. A network with a h f , gi := ∑ f (u)g(u) large l1 or l2 -norm of its output-weight vector must have u∈X either a large number of units or some output weights must be large. Both of these properties are not desirable as they and the Euclidean norm imply a large model complexity or non stability of compu- p tation caused by large output weights. k f k := h f , f i. Many dictionaries of computational units are over- We investigate networks with units from the dictionary complete and thus the representation (1) as a linear com- of signum perceptrons bination of units from the dictionary need not be unique. For a finite dictionary G, the minimum of the l1 -norms of Pd (X) := {sgn(v · . + b) : X → {−1, 1} | v ∈ Rd , b ∈ R} output-weight vectors of shallow networks with units from G computing f is equal to a norm tailored to the dictionary where sgn(t) := −1 for t < 0 and sign(t) := 1 for t ≥ 0. G. This norm, called G-variation, has been used as a tool Note that from the point of view of model complexity, for estimation of rates of approximation of functions by there is only a minor difference between networks with networks with increasing “l0 -pseudo-norms”. G-variation 102 V. Kůrková is defined for a bounded subset G of a normed linear space Lower bounds on variational norms can be obtained by (X , k.k) as geometrical arguments. The following theorem from [16] shows that functions which are “nearly orthogonal” to all elements of a dictionary G have large G-variations. f k f kG := inf c ∈ R+ | ∈ clX conv (G ∪ −G) , c Theorem 3. Let (X , k.kX ) be a Hilbert space and G its bounded subset. Then for every f ∈ X \ G⊥ , where − G := {− g | g ∈ G}, clX denotes the closure with respect to the topology induced by the norm k · kX , and k f k2 conv is the convex hull. Variation with respect to the dic- k f kG ≥ . supg∈G |g · f | tionary of Heaviside perceptrons (called variation with re- spect to half-spaces) was introduced by Barron [2] and we extended it to general sets in [11]. As G-variation is a norm, it can be made arbitrarily 4 Constructive Lower Bounds on large by multiplying a function by a scalar. Also in the- oretical analysis of approximation capabilities of shallow Variational Norms networks, it has to be taken into account that the approxi- In this section, we derive lower bounds on l1 -sparsity of mation error k f − spann Gk in any norm k.k can be made shallow signum perceptron networks from lower bounds arbitrarily large by multiplying f by a scalar. Indeed, for of variational norms with respect to signum perceptrons. every c > 0, Theorem 3 implies that functions which are nearly or- kc f − spann Gk = ck f − spann Gk. thogonal to all elements of a dictionary have large varia- tions. The inner product Thus, both G-variation and errors in approximation by spann G have to be studied either for sets of normalized h f , gi = ∑ f (x)g(x) functions or for sets of functions of a given fixed norm. x∈X G-variation is related to l1 -sparsity, it can be used for of any two functions f , g on a finite domain X is invariant estimating its lower bounds. The proof of the next propo- under reordering of the points in X. sition follows easily from the definition. To estimate inner products of functions with signum Proposition 1. Let G be a finite subset of (X , k.k) with perceptrons on sets of points in general positions is quite card G = k. Then, for every f ∈ X difficult, so we focus on functions on square domains ( ) k k X = {x1 , . . . , xn } × {y1 , . . . , yn } ⊂ Rd k f kG = min ∑ |wi | f = ∑ wi gi , wi ∈ R, gi ∈ G . i=1 i=1 formed by points in grid-like positions. For example, pix- els of pictures in Rd as well as digitized high-dimensional Another important property of G-variation is its role cubes can form such square domains. Functions on square in estimates of rates of approximation by networks with domains can be represented by square matrices. For a small “l0 -pseudo-norms”. This follows from the Maurey- function f on X = {x1 , . . . , xn } × {y1 , . . . , yn } we denote Jones-Barron Theorem [3]. Here we state its reformulation by M( f ) the n × n matrix defined as from [11, 16, 12] in terms of G-variation merely for finite M( f )i, j = f (xi , y j ). dimensional Hilbert space (F (X), k.k) with the Euclidean norm. By Go is denoted the set of normalized elements of On the other hand, an n × n matrix M induces a function g G, i.e., Go = { kgk | g ∈ G}. fM on X such that Theorem 2. Let X ⊂ Rd be finite, G be a finite subset of fM (xi , y j ) = Mi, j . F (X), sG = maxg∈G kgk, and f ∈ F (X). Then for every n, We prove a lower bound on variation with respect to k f kGo sG k f kG signum perceptrons for functions on square domains rep- k f − spann Gk ≤ √ ≤ √ . n n resented by matrices with orthogonal rows. To obtain these bounds from Theorem 3, we have to estimate inner Theorem 2 together with Proposition 1 imply that for any products of these functions with signum perceptrons. We function that can be l1 -sparsely represented by a shallow derive estimates of these inner products using the follow- network with units from a dictionary G and for any n, there ing two lemmas. The first one is from [13] and the second exists an input-output function fn of a network with n units one follows from the Cauchy-Schwartz Inequality. such that fn = ∑ni=1 wi gi (and so kwk0 ≤ n) such that Lemma 1. Let d = d1 + d2 , {xi | i = 1, . . . , n} ⊂ sG k f kG Rd1 , {y j | j = 1, . . . , n} ⊂ Rd2 , and X = {x1 , . . . , xn } × k f − fn k ≤ √ . {y1 , . . . , yn } ⊂ Rd . Then for every g ∈ Pd (X) there exists a n Bounds on Sparsity of One-Hidden-Layer Perceptron Networks 103 reordering of rows and columns of the n × n matrix M(g) In particular, when the domain is the 2d-dimensional such that in the reordered matrix each row and each col- Boolean cube {0, 1}2d =∈d ×{0, 1}d , then the lower umn starts with a (possibly empty) initial segment of −1’s bound is followed by a (possibly empty) segment of +1’s. 2d/2 . d Lemma 2. Let M be an n×n matrix with orthogonal rows, So the lower bounds grows with d exponentially. v1 , . . . , vn be its row vectors, a = maxi=1,...,n kvi k. Then for any subset I of the set of indices of rows and any subset J of the set of indices of columns of the matrix M, 5 Computation of Functions Induced by √ Pseudo-Noise Sequences by Shallow ∑ ∑ Mi, j ≤ a card I card J . Perceptron Networks i∈I j∈J In this section, we apply our general results to a class of circulant matrices with rows formed by shifted segments The following theorem gives a lower bound on variation of pseudo-noise sequences. These sequences are deter- with respect to signum perceptrons for functions induced ministic but exhibit some properties of random sequences. by matrices with orthogonal rows. An infinite sequence a0 , a1 , . . . , ai , . . . of elements of Theorem 4. Let M be an n × n matrix with orthogonal {0, 1} is called k-th order linear recurring sequence if for rows, v1 , . . . , vn be its row vectors, a = maxi=1,...,n kvi k, d = some h0 , . . . , hk ∈ {0, 1} d1 + d2 , {xi | i = 1, . . . , n} ⊂ Rd1 , {y j | j = 1, . . . , m} ⊂ Rd2 , k X = {xi | i = 1, . . . , n} × {y j | j = 1, . . . , n} ⊂ Rd , and ai = ∑ ai− j hk− j mod 2 fM : X → R be defined as fM (xi , y j ) = Mi, j . Then j=1 a for all i ≥ k. It is called k-th order pseudo-noise (PN) k fM kPd (X) ≥ . dlog2 ne sequence (or pseudo-random sequence) if it is k-th order linear recurring sequence with minimal period 2k − 1. PN- sequences are generated by primitive polynomials. A poly- Sketch of the proof. nomial m For each signum perceptron g ∈ Pd (X), we permute both h(x) = ∑ h j x j matrices: the one induced by g and the one with orthog- j=0 onal rows. To estimate the inner product of the permuted matrices, we apply Lemmas 1 and 2 to a partition of both is called primitive polynomial of degree m when the small- matrices into submatrices. The submatrices of permuted est integer n for which h(x) divides xn + 1 is n = 2m − 1. M(g) have all entries either equal to +1 or all entries equal PN sequences have many useful applications because to −1. The permuted matrix M has orthogonal rows and some of their properties mimic those of random sequences. thus we can estimate from above the sums of entries of its A run is a string of consecutive 1’s or a string of consec- submatrices with the same rows and columns as subma- utive 0’s. In any segment of length 2k − 1 of a k-th order trices of M(g). The partition is constructed by iterating PN-sequence, one-half of the runs have length 1, one quar- at most dlog2 ne-times the same decomposition. By Theo- ter have length 2, one-eighth have length 3, and so on. In rem 3, particular, there is one run of length k of 1’s, one run of length k − 1 of 0’s. Thus every segment of length 2k − 1 k fM k2 a contains 2k/2 ones and 2k/2 − 1 zeros [18, p.410]. k fM kPd (X) ≥ ≥ . An important property of PN-sequences is their low supg∈Pd (X) h fM , gi dlog2 ne autocorrelation. The autocorrelation of a sequence 2 a0 , a1 , . . . , ai , . . . of elements of {0, 1} with period 2k − 1 Theorem 4 shows that shallow perceptron networks is defined as computing functions generated by orthogonal n × n matri- k ces must have l1 -norms bounded from bellow by dloga ne . 1 2 −1 a j +a j+t 2 ρ(t) = k ∑ −1 2 − 1 j=0 . (2) All signum perceptrons on a domain X with card X = n × n have norms equal to n. The largest lower bound im- plied by Theorem 4 for functions induced by n × n matri- For every PN-sequence and for every t = 1, . . . , 2k − 2, ces with orthogonal rows, which have norms equal to n, is 1 achieved√ for matrices where all rows have the same norms ρ(t) = − k (3) 2 −1 equal to n. Such functions have variations with respect to signum perceptrons at least [18, p. 411]. √ Let τ : {0, 1} → {−1, 1} be defined as n . τ(x) = −1x dlog2 ne 104 V. Kůrková (i.e., τ(0) = 1 and τ(1) = −1). We say that a 2k × 2k difficult in the sense that it requires networks of large com- matrix Lk (α) is induced by a k-th order PN-sequence plexities. Such functions can be constructed using deter- α = (a0 , a1 , . . . , ai , . . . ) when for all i = 1, . . . , 2k , Li,1 = 1, ministic algorithms and have many applications. Prop- for all j = 1, . . . , 2k , L1, j = 1, and for all i = 2, . . . , 2k and erties of pseudo-noise sequences were exploited for con- j = 2, . . . , 2k structions of codes, interplanetary satellite picture trans- Lk (α)i, j = τ(Ai−1, j−1 ) mission, precision measurements, acoustics, radar camou- flage, and light diffusers. These sequences permit designs where A is the (2k − 1) × (2k − 1) circulant matrix with of surfaces that scatter incoming signals very broadly mak- rows formed by shifted segments of length 2k −1 of the se- ing reflected energy “invisible” or “inaudible”. quence α. The next proposition following from the equa- tions (2) and (3) shows that for any PN-sequence α the matrix Lk (α) has orthogonal rows. Acknowledgments. This work was partially supported by the Czech Grant Agency grant GA 15-18108S and institu- Proposition 5. Let k be a positive integer, α = tional support of the Institute of Computer Science RVO (a0 , a1 , . . . , ai , . . . ) be a k-th order PN-sequence, and 67985807. Lk (α) be the 2k × 2k matrix induced by α. Then all pairs of rows of Lk (α) are orthogonal. References Applying Theorem 4 to the 2k × 2k matrice Lk (α) in- [1] L. J. Ba and R. Caruana. Do deep networks really need to duced by a k-th order PN-sequence α we obtain a lower be deep? In Z. Ghahrani and et al., editors, Advances in k/2 bound of the form 2 k on variation with respect to signum Neural Information Processing Systems, volume 27, pages perceptrons of the function induced by the matrix Lk (α). 1–9, 2014. So in any shallow perceptron network computing this [2] A. R. Barron. Neural net approximation. In K. S. Narendra, function, the number of units or sizes of some output editor, Proc. 7th Yale Workshop on Adaptive and Learning weights depend on k exponentially. Systems, pages 69–72. Yale University Press, 1992. [3] A. R. Barron. Universal approximation bounds for superpo- sitions of a sigmoidal function. IEEE Trans. on Information 6 Discussion Theory, 39:930–945, 1993. [4] Y. Bengio, O. Delalleau, and N. Le Roux. The curse of We investigated limitations of shallow perceptron net- highly variable functions for local kernel machines. In works to sparsely represent real-valued functions. We con- Advances in Neural Information Processing Systems, vol- sidered sparsity measured by the l1 -norm which has been ume 18, pages 107–114. MIT Press, 2006. used in weight-decay regularization techniques [8] and in [5] Y. Bengio and Y. LeCun. Scaling learning algorithms to- online dictionary learning [7]. We proved lower bounds wards AI. In L. Bottou, O. Chapelle, D. DeCoste, and on l1 -norms of output weight vectors of shallow signum J. Weston, editors, Large-Scale Kernel Machines. MIT perceptron networks computing functions on square do- Press, 2007. mains induced by matrices with orthogonal rows. We il- [6] J. T. Coffey and R. M. Goodman. Any code of which we lustrated our general results by an example of a class of cannot think is good. IEEE Transactions on Information matrices constructed using pseudo-noise sequences. These Theory, 36:1453 – 1461, 1990. deterministic sequences mimic some properties of ran- [7] D. L. Donoho. For most large undetermined systems of lin- dom sequences. We showed that shallow perceptron net- ear equation the minimal l1 -norm is also the sparsest. Com- munications in Pure and Applied Mathematics, 59:797– works, which compute functions constructed using these 829, 2006. sequences, must have either large numbers of hidden units [8] T. L. Fine. Feedforward Neural Network Methodology. or some of their output weights must be large. Springer, Berlin Heidelberg, 1999. There is an interesting analogy with the central paradox [9] Y. Ito. Finite mapping by neural networks and truth func- of coding theory. This paradox is expressed in the title of tions. Mathematical Scientist, 17:69–77, 1992. the article “Any code of which we cannot think is good” [10] P. C. Kainen, V. Kůrková, and M. Sanguineti. Dependence [6]. It was proven there that any code which is truly ran- of computational models on input dimension: Tractability dom (in the sense that there is no concise way to generate of approximation and optimization tasks. IEEE Transac- the code) is good (it meets the Gilbert-Varshamov bound tions on Information Theory, 58:1203–1214, 2012. on distance versus redundancy). However despite sophis- [11] V. Kůrková. Dimension-independent rates of approxima- ticated constructions for codes derived over the years, no tion by neural networks. In K. Warwick and M. Kárný, ed- one has succeeded in finding a constructive procedure that itors, Computer-Intensive Methods in Control and Signal yields such good codes. Similarly, computation of “any Processing. The Curse of Dimensionality, pages 261–270. function of which we cannot think” (truly random) by Birkhäuser, Boston, MA, 1997. shallow perceptron networks might be untractable. Our re- [12] V. Kůrková. Complexity estimates based on integral trans- sults show that computation of functions exhibiting some forms induced by computational units. Neural Networks, randomness properties by shallow perceptron networks is 33:160–167, 2012. Bounds on Sparsity of One-Hidden-Layer Perceptron Networks 105 [13] V. Kůrková. Constructive lower bounds on model complex- ity of shallow perceptron networks. Neural Computing and Applications, DOI 10.1007/s00521-017-2965-0, 2017. [14] V. Kůrková and M. Sanguineti. Model complexities of shal- low networks representing highly varying functions. Neu- rocomputing, 171:598–604, 2016. [15] V. Kůrková and M. Sanguineti. Probabilistic lower bounds for approximation by shallow perceptron networks. Neural Networks, 91:34–41, 2017. [16] V. Kůrková, P. Savický, and K. Hlaváčková. Representa- tions and rates of approximation of real-valued Boolean functions by neural networks. Neural Networks, 11:651– 659, 1998. [17] S. B. Laughlin and T. J. Sejnowski. Communication in neu- ral networks. Science, 301:1870–1874, 2003. [18] F. MacWilliams and N. A. Sloane. The Theory of Error- Correcting Codes. North Holland Publishing Co., 1977. [19] V. E. Maiorov and R. Meir. On the near optimality of the stochastic approximation of smooth functions by neu- ral networks. Advances in Computational Mathematics, 13:79–103, 2000. [20] V. E. Maiorov and A. Pinkus. Lower bounds for approxima- tion by MLP neural networks. Neurocomputing, 25:81–91, 1999. [21] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learn- ing for matrix factorization and sparse coding. Journal of Machine Learning Research, 11:19–60, 2010. [22] A. Pinkus. Approximation theory of the MLP model in neural networks. Acta Numerica, 8:143–195, 1999. [23] A. M. Tillmann. On the computational intractability of ex- act and approximate dictionary learning. IEEE Signal Pro- cessing Letters, 22:45–49, 2015.