1 Introduction

1613-0073

Bounds on Sparsity of One-Hidden-Layer Perceptron Networks

Veˇra K˚urková

0 0 Institute of Computer Science, Czech Academy of Sciences

2017

1885 100 105

Limitations of one-hidden-layer (shallow) perceptron networks to sparsely represent multivariable functions is investigated. A concrete class of functions is described whose computation by shallow perceptron networks requires either large number of units or is unstable due to large output weights. The class is constructed using pseudo-noise sequences which have many features of random sequences but can be generated using special polynomials. Connections with the central paradox of coding theory are discussed.

1 Introduction

To identify and explain efficient network designs, it is necessary to develop a theoretical understanding to the influence of a proper choice of network architecture and type of units on reducing network complexity. Bengio and LeCun [ 5 ], who recently revived the interest in deep networks, conjectured that “most functions that can be represented compactly by deep architectures cannot be represented by a compact shallow architecture”. On the other hand, a recent empirical study demonstrated that shallow networks can learn some functions previously learned by deep ones using the same numbers of parameters as the original deep networks [ 1 ].

It is well-known that shallow networks with merely one hidden layer of computational units of many common types can approximate within any accuracy any reasonable function on a compact domain and also can exactly compute any function on a finite domain [ 9, 22 ]. All these universality type results are proven assuming that numbers of network units are potentially infinite or, in the case of finite domains, are at least as large as sizes of the domains. However, in practical applications, various constraints on numbers and sizes of network parameters limit feasibility of implementations.

Whereas many upper bounds on numbers of units in shallow networks which are sufficient for a given accuracy of function approximation are known (see, e.g, the survey article [ 10 ] and references therein), fewer lower bounds are available. Some bounds hold merely for types of computational units that are not commonly used (see, e.g., [ 20, 19 ]). Proofs of lower bounds are generally much more difficult than derivation of upper ones.

Characterization of tasks which can be computed by considerably sparser deep networks than shallow ones requires proving lower bounds on complexity of shallow networks which are larger than upper bounds on complexity of deep ones. An important step towards this goal is exploration of functions which cannot be computed or approximated by shallow networks satisfying various sparsity constraints.

Investigation of sparsity of artificial neural networks has a biological motivation. A number of studies confirmed that only a small fraction of neurons have a high rate of firing at any time (sparse activity) and that each neuron is connected to only a limited number of other neurons (sparse connectivity) [ 17 ]. The most simple measure of sparse connectivity between hidden units and network outputs is the number of non zero output weights. This number is in some literature called “ l0-pseudo-norm”. However, it is neither a norm nor a pseudo-norm. Its minimization is a not convex problem and its solution is NP-hard [ 23 ]. Thus instead of “ l0-pseudo-norm”, l1 and l2-norms have been used as measures of network sparsity as they can be implemented as stabilizers in weight-decay regularization techniques (see, e.g., [ 8 ] and references therein). Also in online dictionary learning, l1-norms were used as stabilizers [ 21, 7 ].

Bengio et al. [ 4 ] suggested that a cause of large model complexities of shallow networks might be in the “amount of variations” of functions to be computed. In [ 14 ], we presented some examples showing that sparsity of shallow networks computing the same input-output functions strongly depends on types of their units. We proposed to use as a measure of sparsity variational norms tailored to dictionaries of computational units. These norms were used as tools in nonlinear approximation theory. We showed that variational norms can be employed to obtain lower bounds on sparsity measured by l1-norms. For many dictionaries of computational units, we derived lower bounds on these norms using a probabilistic argument based on the Chernoff Bound [ 14, 15 ]. The bounds hold for almost all functions representing binary classifiers on sufficiently large finite domains. In [ 13 ] we complemented these probabilistic results by a concrete construction of binary classifiers with large variational norms with respect to signum perceptrons.

In this paper, we investigate sparsity of shallow networks computing real-valued functions on finite rectangular domains. Such domains can be 2-dimensional (e.g., pixels of photographs) or high-dimensional (e.g., digitized high-dimensional cubes), but typically they are quite large. We describe a construction of a class of functions on such domains based on matrices with orthogonal rows. Estimating variational norms of these functions from bellow, we obtain lower bounds on l1-norms of shallow networks with signum perceptrons. We show that these networks must have either large numbers of hidden units or some of their output weights must be large. Both are not desirable as large output weights ma lead to non stability of computation. We illustrate our general construction by a concrete class of circulant matrices generated by pseudo-noise sequences. We discuss the effect of pseudo-randomness on network complexity.

The paper is organized as follows. Section 2 contains basic concepts on shallow networks and dictionaries of computational units. In Section 3, sparsity is investigated in terms of l1-norm and norms tailored to computational units. In Section 4, a concrete construction of classes of functions with large variational norms based on orthogonal matrices is described. In Section 5, the general results are illustrated by a concrete example of matrices obtained from pseudo-noise sequences. Section 6 is a brief discussion. 2

Preliminaries

One-hidden-layer networks with single linear outputs (shallow networks) compute input-output functions from sets of the form spann G := ( n ) ∑ wigi | wi ∈ R, gi ∈ G , i=1 where G, called a dictionary, is a set of functions computable by a given type of units, the coefficients wi are called output weights, and n is the number of hidden units. This number is the simplest measure of model complexity.

In this paper, we focus on representations of functions on finite domainsX ⊂ Rd . We denote by

F (X ) := { f | f : X → R} the set of all real-valued functions on X . On F (X ) we have the Euclidean inner product defined as and the Euclidean norm h f , gi := ∑ f (u)g(u)

u∈X k f k := ph f , f i.

We investigate networks with units from the dictionary of signum perceptrons

Pd (X ) := {sgn(v · . + b) : X → {−1, 1} | v ∈ Rd , b ∈ R} where sgn(t) := −1 for t < 0 and sign(t) := 1 for t ≥ 0. Note that from the point of view of model complexity, there is only a minor difference between networks with signum perceptrons and those with Heaviside perceptrons as and sgn(t) = 2ϑ (t) − 1 ϑ (t) := sgn(t) + 1 2 , where ϑ (t) = 0 for t < 0 and ϑ (t) = 1 for t ≥ 0. An advantage of signum perceptrons is that all units from the dictionary Pd (X ) have the same size of norms equal to √card X . 3

Measures of Sparsity

The most simple measure of sparse connectivity between the hidden layer and the network output is the number of non-zero output weights. In some literature, the number of non-zero coefficients amongwi’s in an input-output function

n f = ∑ wigi i=1 (1) from span G is called an “l 0-pseudo-norm” in quotation marks and denoted kwk0. However, it is neither a norm nor a pseudo-norm. The quantity kwk0 is always an integer and thus k · k0 does not satisfy the homogeneity property of a norm (kλ xk = |λ |kxk for all λ ). Moreover, the “unit ball” {w ∈ Rn | kwk0 ≤ 1} is non convex and unbounded.

Minimization of the “ l0-pseudo-norm” of the vector of output weights is a difficult non convex optimization task. Instead of “ l0”, l1-norm defined as and l2-norm defined as n kwk1 = ∑ |wi|

i=1 kwk2 = s n ∑ w2

i i=1 of output weight vectors w = (w1, . . . , wn) have been used in weight-decay regularization [ 8 ]. These norms can be implemented as stabilizers modifying error functionals which are minimized during learning. A network with a large l1 or l2-norm of its output-weight vector must have either a large number of units or some output weights must be large. Both of these properties are not desirable as they imply a large model complexity or non stability of computation caused by large output weights.

Many dictionaries of computational units are overcomplete and thus the representation (1) as a linear combination of units from the dictionary need not be unique. For a finite dictionaryG, the minimum of the l1-norms of output-weight vectors of shallow networks with units from G computing f is equal to a norm tailored to the dictionary G. This norm, called G-variation, has been used as a tool for estimation of rates of approximation of functions by networks with increasing “ l0-pseudo-norms”. G-variation is defined for a bounded subsetG of a normed linear space (X , k.k) as

f k f kG := inf c ∈ R+ | c ∈ clX conv (G ∪ −G) , where − G := {− g | g ∈ G}, clX denotes the closure with respect to the topology induced by the norm k · kX , and conv is the convex hull. Variation with respect to the dictionary of Heaviside perceptrons (called variation with respect to half-spaces) was introduced by Barron [ 2 ] and we extended it to general sets in [ 11 ].

As G-variation is a norm, it can be made arbitrarily large by multiplying a function by a scalar. Also in theoretical analysis of approximation capabilities of shallow networks, it has to be taken into account that the approximation error k f − spann Gk in any norm k.k can be made arbitrarily large by multiplying f by a scalar. Indeed, for every c > 0,

kc f − spann Gk = ck f − spannGk.

Thus, both G-variation and errors in approximation by spann G have to be studied either for sets of normalized functions or for sets of functions of a given fixed norm.

G-variation is related to l1-sparsity, it can be used for estimating its lower bounds. The proof of the next proposition follows easily from the definition.

Proposition 1. Let G be a finite subset of (X , k.k) with card G = k. Then, for every f ∈ X k f kG = min ( k k ∑ |wi| f = ∑ wi gi , wi ∈ R, gi ∈ G i=1 i=1 )

Another important property of G-variation is its role in estimates of rates of approximation by networks with small “ l0-pseudo-norms”. This follows from the MaureyJones-Barron Theorem [ 3 ]. Here we state its reformulation from [ 11, 16, 12 ] in terms of G-variation merely for finite dimensional Hilbert space (F (X ), k.k) with the Euclidean norm. By Go is denoted the set of normalized elements of g G, i.e., Go = { kgk | g ∈ G}.

Theorem 2. Let X ⊂ Rd be finite, G be a finite subset of F (X ), sG = maxg∈G kgk, and f ∈ F (X ). Then for every n, k f − spann Gk ≤ k f kGo √n ≤ sG k f kG .

√n Theorem 2 together with Proposition 1 imply that for any function that can be l1-sparsely represented by a shallow network with units from a dictionary G and for any n, there exists an input-output function fn of a network with n units n such that fn = ∑i=1 wigi (and so kwk0 ≤ n) such that k f − fnk ≤ sG k f kG .

√n

Lower bounds on variational norms can be obtained by geometrical arguments. The following theorem from [ 16 ] shows that functions which are “nearly orthogonal” to all elements of a dictionary G have large G-variations. Theorem 3. Let (X , k.kX ) be a Hilbert space and G its bounded subset. Then for every f ∈ X \ G⊥, f 2 k k k f kG ≥ supg∈G |g · f | In this section, we derive lower bounds on l1-sparsity of shallow signum perceptron networks from lower bounds of variational norms with respect to signum perceptrons.

Theorem 3 implies that functions which are nearly orthogonal to all elements of a dictionary have large variations. The inner product h f , gi = ∑ f (x)g(x)

x∈X of any two functions f , g on a finite domainX is invariant under reordering of the points in X .

To estimate inner products of functions with signum perceptrons on sets of points in general positions is quite difficult, so we focus on functions on square domains

X = {x1, . . . , xn} × {y1, . . . , yn} ⊂ Rd formed by points in grid-like positions. For example, pixels of pictures in Rd as well as digitized high-dimensional cubes can form such square domains. Functions on square domains can be represented by square matrices. For a function f on X = {x1, . . . , xn} × {y1, . . . , yn} we denote by M( f ) the n × n matrix defined as

M( f )i, j = f (xi, y j).

On the other hand, an n × n matrix M induces a function fM on X such that

fM(xi, y j) = Mi, j.

We prove a lower bound on variation with respect to signum perceptrons for functions on square domains represented by matrices with orthogonal rows. To obtain these bounds from Theorem 3, we have to estimate inner products of these functions with signum perceptrons. We derive estimates of these inner products using the following two lemmas. The first one is from [ 13 ] and the second one follows from the Cauchy-Schwartz Inequality. Lemma 1. Let d = d1 + d2, {xi | i = 1, . . . , n} ⊂ Rd1 , {y j | j = 1, . . . , n} ⊂ Rd2 , and X = {x1, . . . , xn} × {y1, . . . , yn} ⊂ Rd . Then for every g ∈ Pd (X ) there exists a reordering of rows and columns of the n × n matrix M(g) such that in the reordered matrix each row and each column starts with a (possibly empty) initial segment of −1’s followed by a (possibly empty) segment of +1’s. Lemma 2. Let M be an n × n matrix with orthogonal rows, v1, . . . , vn be its row vectors, a = maxi=1,...,nkvik. Then for any subset I of the set of indices of rows and any subset J of the set of indices of columns of the matrix M, √ ∑ ∑ Mi, j ≤ a card I card J .

i∈I j∈J

The following theorem gives a lower bound on variation with respect to signum perceptrons for functions induced by matrices with orthogonal rows.

Theorem 4. Let M be an n × n matrix with orthogonal rows, v1, . . . , vn be its row vectors, a = maxi=1,...,nkvik, d = d1 + d2, {xi | i = 1, . . . , n} ⊂ Rd1 , {y j | j = 1, . . . , m} ⊂ Rd2 , X = {xi | i = 1, . . . , n} × {y j | j = 1, . . . , n} ⊂ Rd , and fM : X → R be defined as f M(xi, y j) = Mi, j. Then a k fMkPd(X) ≥ dlog2 ne .

Sketch of the proof.

For each signum perceptron g ∈ Pd (X ), we permute both matrices: the one induced by g and the one with orthogonal rows. To estimate the inner product of the permuted matrices, we apply Lemmas 1 and 2 to a partition of both matrices into submatrices. The submatrices of permuted M(g) have all entries either equal to +1 or all entries equal to −1. The permuted matrix M has orthogonal rows and thus we can estimate from above the sums of entries of its submatrices with the same rows and columns as submatrices of M(g). The partition is constructed by iterating at most dlog2 ne-times the same decomposition. By Theorem 3,

k fMk2 a k fMkPd(X) ≥ supg∈Pd(X)h fM, gi ≥ dlog2 ne .

Theorem 4 shows that shallow perceptron networks computing functions generated by orthogonal n × n matria ces must have l1-norms bounded from bellow by dlog2 ne .

All signum perceptrons on a domain X with card X = n × n have norms equal to n. The largest lower bound implied by Theorem 4 for functions induced by n × n matrices with orthogonal rows, which have norms equal to n, is achieved for matrices where all rows have the same norms equal to √n. Such functions have variations with respect to signum perceptrons at least 2 √n dlog2 ne

In particular, when the domain is the 2d-dimensional Boolean cube {0, 1}2d =∈d ×{0, 1}d , then the lower bound is 2d/2 In this section, we apply our general results to a class of circulant matrices with rows formed by shifted segments of pseudo-noise sequences. These sequences are deterministic but exhibit some properties of random sequences.

An infinite sequence a0, a1, . . . , ai, . . . of elements of {0, 1} is called k-th order linear recurring sequence if for some h0, . . . , hk ∈ {0, 1}

k ai = ∑ ai− jhk− j j=1 mod 2 for all i ≥ k. It is called k-th order pseudo-noise (PN) sequence (or pseudo-random sequence) if it is k-th order linear recurring sequence with minimal period 2k − 1. PNsequences are generated by primitive polynomials. A polynomial m h(x) = ∑ h jx j

j=0 is called primitive polynomial of degree m when the smallest integer n for which h(x) divides xn + 1 is n = 2m − 1.

PN sequences have many useful applications because some of their properties mimic those of random sequences. A run is a string of consecutive 1’s or a string of consecutive 0’s. In any segment of length 2k − 1 of a k-th order PN-sequence, one-half of the runs have length 1, one quarter have length 2, one-eighth have length 3, and so on. In particular, there is one run of length k of 1’s, one run of length k − 1 of 0’s. Thus every segment of length 2k − 1 contains 2k/2 ones and 2k/2 − 1 zeros [18, p.410].

An important property of PN-sequences is their low autocorrelation. The autocorrelation of a sequence a0, a1, . . . , ai, . . . of elements of {0, 1} with period 2k − 1 is defined as ρ(t) =

1 2∑k−1 −1a j+a j+t .

2k − 1 j=0 For every PN-sequence and for every t = 1, . . . , 2k − 2, (2) (3) [18, p. 411].

Let τ : {0, 1} → {−1, 1} be defined as 1 ρ(t) = − 2k − 1 τ(x) = −1x (i.e., τ(0) = 1 and τ(1) = −1). We say that a 2k × 2k matrix Lk(α) is induced by a k-th order PN-sequence α = (a0, a1, . . . , ai, . . . ) when for all i = 1, . . . , 2k, Li,1 = 1, for all j = 1, . . . , 2k, L1, j = 1, and for all i = 2, . . . , 2k and j = 2, . . . , 2k

Lk(α)i, j = τ(Ai−1, j−1) where A is the (2k − 1) × (2k − 1) circulant matrix with rows formed by shifted segments of length 2k − 1 of the sequence α. The next proposition following from the equations (2) and (3) shows that for any PN-sequence α the matrix Lk(α) has orthogonal rows.

Proposition 5. Let k be a positive integer, α = (a0, a1, . . . , ai, . . . ) be a k-th order PN-sequence, and Lk(α) be the 2k × 2k matrix induced by α. Then all pairs of rows of Lk(α) are orthogonal.

Applying Theorem 4 to the 2k × 2k matrice Lk(α) induced by a k-th order PN-sequence α we obtain a lower bound of the form 2kk/2 on variation with respect to signum perceptrons of the function induced by the matrix Lk(α). So in any shallow perceptron network computing this function, the number of units or sizes of some output weights depend on k exponentially. 6

Discussion

We investigated limitations of shallow perceptron networks to sparsely represent real-valued functions. We considered sparsity measured by the l1-norm which has been used in weight-decay regularization techniques [ 8 ] and in online dictionary learning [ 7 ]. We proved lower bounds on l1-norms of output weight vectors of shallow signum perceptron networks computing functions on square domains induced by matrices with orthogonal rows. We illustrated our general results by an example of a class of matrices constructed using pseudo-noise sequences. These deterministic sequences mimic some properties of random sequences. We showed that shallow perceptron networks, which compute functions constructed using these sequences, must have either large numbers of hidden units or some of their output weights must be large.

There is an interesting analogy with the central paradox of coding theory. This paradox is expressed in the title of the article “Any code of which we cannot think is good” [ 6 ]. It was proven there that any code which is truly random (in the sense that there is no concise way to generate the code) is good (it meets the Gilbert-Varshamov bound on distance versus redundancy). However despite sophisticated constructions for codes derived over the years, no one has succeeded in finding a constructive procedure that yields such good codes. Similarly, computation of “any function of which we cannot think” (truly random) by shallow perceptron networks might be untractable. Our results show that computation of functions exhibiting some randomness properties by shallow perceptron networks is difficult in the sense that it requires networks of large complexities. Such functions can be constructed using deterministic algorithms and have many applications. Properties of pseudo-noise sequences were exploited for constructions of codes, interplanetary satellite picture transmission, precision measurements, acoustics, radar camouflage, and light diffusers. These sequences permit designs of surfaces that scatter incoming signals very broadly making reflected energy “invisible” or “inaudible”.

Acknowledgments. This work was partially supported by the Czech Grant Agency grant GA 15-18108S and institutional support of the Institute of Computer Science RVO 67985807.

[1]

L. J.

Ba and

Caruana . Do deep networks really need to be deep? In Z. Ghahrani and et al., editors, Advances in Neural Information Processing Systems , volume 27 , pages 1 - 9 , 2014 .

[2]

A. R.

Barron . Neural net approximation . In K. S. Narendra, editor, Proc. 7th Yale Workshop on Adaptive and Learning Systems , pages 69 - 72 . Yale University Press, 1992 .

[3]

A. R.

Barron . Universal approximation bounds for superpositions of a sigmoidal function . IEEE Trans. on Information Theory , 39 : 930 - 945 , 1993 .

[4]

Bengio ,

Delalleau , and

N. Le

Roux . The curse of highly variable functions for local kernel machines . In Advances in Neural Information Processing Systems , volume 18 , pages 107 - 114 . MIT Press, 2006 .

[5]

Bengio and

LeCun . Scaling learning algorithms towards AI . In L. Bottou , O. Chapelle , D. DeCoste, and J. Weston, editors, Large-Scale Kernel Machines . MIT Press, 2007 .

[6]

J. T.

Coffey and

R. M.

Goodman . Any code of which we cannot think is good . IEEE Transactions on Information Theory , 36 : 1453 - 1461 , 1990 .

[7]

D. L.

Donoho . For most large undetermined systems of linear equation the minimal l1-norm is also the sparsest . Communications in Pure and Applied Mathematics , 59 : 797 - 829 , 2006 .

[8]

T. L.

Fine . Feedforward Neural Network Methodology . Springer, Berlin Heidelberg, 1999 .

[9]

Ito . Finite mapping by neural networks and truth functions . Mathematical Scientist , 17 : 69 - 77 , 1992 .

[10]

P. C.

Kainen , V. K˚urková , and

Sanguineti . Dependence of computational models on input dimension: Tractability of approximation and optimization tasks . IEEE Transactions on Information Theory , 58 : 1203 - 1214 , 2012 .

[11] V. K˚urková . Dimension-independent rates of approximation by neural networks . In K. Warwick and M. Kárný, editors, Computer-Intensive Methods in Control and Signal Processing. The Curse of Dimensionality , pages 261 - 270 . Birkhäuser, Boston, MA, 1997 .

[12] V. K˚urková . Complexity estimates based on integral transforms induced by computational units . Neural Networks , 33 : 160 - 167 , 2012 .

[13] V. K˚urková . Constructive lower bounds on model complexity of shallow perceptron networks . Neural Computing and Applications , DOI 10 .1007/s00521-017 -2965-0 , 2017 .

[14] V. K˚urková and

Sanguineti . Model complexities of shallow networks representing highly varying functions . Neurocomputing , 171 : 598 - 604 , 2016 .

[15] V. K˚urková and

Sanguineti . Probabilistic lower bounds for approximation by shallow perceptron networks . Neural Networks , 91 : 34 - 41 , 2017 .

[16] V. K˚urková , P. Savický, and K. Hlavácˇková. Representations and rates of approximation of real-valued Boolean functions by neural networks . Neural Networks , 11 : 651 - 659 , 1998 .

[17]

S. B.

Laughlin and T. J. Sejnowski. Communication in neural networks . Science , 301 : 1870 - 1874 , 2003 .

[18]

MacWilliams and

N. A.

Sloane . The Theory of ErrorCorrecting Codes . North Holland Publishing Co., 1977 .

[19]

V. E.

Maiorov and

Meir . On the near optimality of the stochastic approximation of smooth functions by neural networks . Advances in Computational Mathematics , 13 : 79 - 103 , 2000 .

[20]

V. E.

Maiorov and

Pinkus . Lower bounds for approximation by MLP neural networks . Neurocomputing , 25 : 81 - 91 , 1999 .

[21]

Mairal ,

Bach ,

Ponce , and

Sapiro . Online learning for matrix factorization and sparse coding . Journal of Machine Learning Research , 11 : 19 - 60 , 2010 .

[22]

Pinkus . Approximation theory of the MLP model in neural networks . Acta Numerica , 8 : 143 - 195 , 1999 .

[23]

A. M.

Tillmann . On the computational intractability of exact and approximate dictionary learning . IEEE Signal Processing Letters , 22 : 45 - 49 , 2015 .