=Paper= {{Paper |id=Vol-1885/100 |storemode=property |title=Bounds on Sparsity of One-Hidden-Layer Perceptron Networks |pdfUrl=https://ceur-ws.org/Vol-1885/100.pdf |volume=Vol-1885 |authors=Věra Kůrková |dblpUrl=https://dblp.org/rec/conf/itat/Kurkova17 }} ==Bounds on Sparsity of One-Hidden-Layer Perceptron Networks== https://ceur-ws.org/Vol-1885/100.pdf
J. Hlaváčová (Ed.): ITAT 2017 Proceedings, pp. 100–105
CEUR Workshop Proceedings Vol. 1885, ISSN 1613-0073, c 2017 V. Kůrková



                  Bounds on Sparsity of One-Hidden-Layer Perceptron Networks

                                                               Věra Kůrková

                                         Institute of Computer Science, Czech Academy of Sciences
                                                             vera@cs.cas.cz,
                                           WWW home page: http://www.cs.cas.cz/∼vera

      Abstract: Limitations of one-hidden-layer (shallow) per-          works which are larger than upper bounds on complexity
      ceptron networks to sparsely represent multivariable func-        of deep ones. An important step towards this goal is ex-
      tions is investigated. A concrete class of functions is           ploration of functions which cannot be computed or ap-
      described whose computation by shallow perceptron net-            proximated by shallow networks satisfying various spar-
      works requires either large number of units or is unstable        sity constraints.
      due to large output weights. The class is constructed us-            Investigation of sparsity of artificial neural networks has
      ing pseudo-noise sequences which have many features of            a biological motivation. A number of studies confirmed
      random sequences but can be generated using special poly-         that only a small fraction of neurons have a high rate of
      nomials. Connections with the central paradox of coding           firing at any time (sparse activity) and that each neuron
      theory are discussed.                                             is connected to only a limited number of other neurons
                                                                        (sparse connectivity) [17]. The most simple measure of
                                                                        sparse connectivity between hidden units and network out-
      1 Introduction
                                                                        puts is the number of non zero output weights. This num-
                                                                        ber is in some literature called “l0 -pseudo-norm”. How-
      To identify and explain efficient network designs, it is nec-
                                                                        ever, it is neither a norm nor a pseudo-norm. Its minimiza-
      essary to develop a theoretical understanding to the influ-
                                                                        tion is a not convex problem and its solution is NP-hard
      ence of a proper choice of network architecture and type of
                                                                        [23]. Thus instead of “l0 -pseudo-norm”, l1 and l2 -norms
      units on reducing network complexity. Bengio and LeCun
                                                                        have been used as measures of network sparsity as they
      [5], who recently revived the interest in deep networks,
                                                                        can be implemented as stabilizers in weight-decay regu-
      conjectured that “most functions that can be represented
                                                                        larization techniques (see, e.g., [8] and references therein).
      compactly by deep architectures cannot be represented by
                                                                        Also in online dictionary learning, l1 -norms were used as
      a compact shallow architecture”. On the other hand, a re-
                                                                        stabilizers [21, 7].
      cent empirical study demonstrated that shallow networks
      can learn some functions previously learned by deep ones             Bengio et al. [4] suggested that a cause of large model
      using the same numbers of parameters as the original deep         complexities of shallow networks might be in the “amount
      networks [1].                                                     of variations” of functions to be computed. In [14], we
         It is well-known that shallow networks with merely             presented some examples showing that sparsity of shal-
      one hidden layer of computational units of many common            low networks computing the same input-output functions
      types can approximate within any accuracy any reason-             strongly depends on types of their units. We proposed
      able function on a compact domain and also can exactly            to use as a measure of sparsity variational norms tai-
      compute any function on a finite domain [9, 22]. All these        lored to dictionaries of computational units. These norms
      universality type results are proven assuming that numbers        were used as tools in nonlinear approximation theory.
      of network units are potentially infinite or, in the case of      We showed that variational norms can be employed to
      finite domains, are at least as large as sizes of the domains.    obtain lower bounds on sparsity measured by l1 -norms.
      However, in practical applications, various constraints on        For many dictionaries of computational units, we derived
      numbers and sizes of network parameters limit feasibility         lower bounds on these norms using a probabilistic argu-
      of implementations.                                               ment based on the Chernoff Bound [14, 15]. The bounds
         Whereas many upper bounds on numbers of units in               hold for almost all functions representing binary classifiers
      shallow networks which are sufficient for a given accu-           on sufficiently large finite domains. In [13] we comple-
      racy of function approximation are known (see, e.g, the           mented these probabilistic results by a concrete construc-
      survey article [10] and references therein), fewer lower          tion of binary classifiers with large variational norms with
      bounds are available. Some bounds hold merely for types           respect to signum perceptrons.
      of computational units that are not commonly used (see,              In this paper, we investigate sparsity of shallow net-
      e.g., [20, 19]). Proofs of lower bounds are generally much        works computing real-valued functions on finite rectangu-
      more difficult than derivation of upper ones.                     lar domains. Such domains can be 2-dimensional (e.g.,
         Characterization of tasks which can be computed by             pixels of photographs) or high-dimensional (e.g., digitized
      considerably sparser deep networks than shallow ones re-          high-dimensional cubes), but typically they are quite large.
      quires proving lower bounds on complexity of shallow net-         We describe a construction of a class of functions on such
Bounds on Sparsity of One-Hidden-Layer Perceptron Networks                                                                           101

     domains based on matrices with orthogonal rows. Esti-          signum perceptrons and those with Heaviside perceptrons
     mating variational norms of these functions from bellow,       as
     we obtain lower bounds on l1 -norms of shallow networks                          sgn(t) = 2ϑ (t) − 1
     with signum perceptrons. We show that these networks
     must have either large numbers of hidden units or some of      and
                                                                                                   sgn(t) + 1
     their output weights must be large. Both are not desirable                         ϑ (t) :=              ,
                                                                                                       2
     as large output weights ma lead to non stability of compu-
     tation. We illustrate our general construction by a concrete   where ϑ (t) = 0 for t < 0 and ϑ (t) = 1 for t ≥ 0. An advan-
     class of circulant matrices generated by pseudo-noise se-      tage of signum perceptrons is that all units from the
                                                                                                                        √ dictio-
     quences. We discuss the effect of pseudo-randomness on         nary Pd (X) have the same size of norms equal to card X.
     network complexity.
        The paper is organized as follows. Section 2 contains
                                                                    3 Measures of Sparsity
     basic concepts on shallow networks and dictionaries of
     computational units. In Section 3, sparsity is investigated
                                                                    The most simple measure of sparse connectivity between
     in terms of l1 -norm and norms tailored to computational
                                                                    the hidden layer and the network output is the number of
     units. In Section 4, a concrete construction of classes of
                                                                    non-zero output weights. In some literature, the number of
     functions with large variational norms based on orthogo-
                                                                    non-zero coefficients among wi ’s in an input-output func-
     nal matrices is described. In Section 5, the general results
                                                                    tion
     are illustrated by a concrete example of matrices obtained                                     n
     from pseudo-noise sequences. Section 6 is a brief discus-                                f = ∑ wi gi                     (1)
     sion.                                                                                         i=1

                                                                    from span G is called an “l0 -pseudo-norm” in quotation
                                                                    marks and denoted kwk0 . However, it is neither a norm
     2    Preliminaries                                             nor a pseudo-norm. The quantity kwk0 is always an integer
                                                                    and thus k · k0 does not satisfy the homogeneity property
     One-hidden-layer networks with single linear outputs           of a norm (kλ xk = |λ |kxk for all λ ). Moreover, the “unit
     (shallow networks) compute input-output functions from         ball” {w ∈ Rn | kwk0 ≤ 1} is non convex and unbounded.
     sets of the form                                                  Minimization of the “l0 -pseudo-norm” of the vector of
                        (                       )                   output weights is a difficult non convex optimization task.
                               n
               spann G :=     ∑ wi gi | wi ∈ R, gi ∈ G ,            Instead of “l0 ”, l1 -norm defined as
                              i=1
                                                                                                         n
     where G, called a dictionary, is a set of functions com-                             kwk1 = ∑ |wi |
                                                                                                        i=1
     putable by a given type of units, the coefficients wi are
     called output weights, and n is the number of hidden units.    and l2 -norm defined as
     This number is the simplest measure of model complexity.                                       s
       In this paper, we focus on representations of functions                                               n
     on finite domains X ⊂ Rd . We denote by                                              kwk2 =         ∑ w2i
                                                                                                         i=1
                      F (X) := { f | f : X → R}
                                                                    of output weight vectors w = (w1 , . . . , wn ) have been used
     the set of all real-valued functions on X. On F (X) we         in weight-decay regularization [8]. These norms can be
     have the Euclidean inner product defined as                    implemented as stabilizers modifying error functionals
                                                                    which are minimized during learning. A network with a
                        h f , gi := ∑ f (u)g(u)                     large l1 or l2 -norm of its output-weight vector must have
                                    u∈X                             either a large number of units or some output weights must
                                                                    be large. Both of these properties are not desirable as they
     and the Euclidean norm                                         imply a large model complexity or non stability of compu-
                                       p                            tation caused by large output weights.
                            k f k :=       h f , f i.
                                                                       Many dictionaries of computational units are over-
        We investigate networks with units from the dictionary      complete and thus the representation (1) as a linear com-
     of signum perceptrons                                          bination of units from the dictionary need not be unique.
                                                                    For a finite dictionary G, the minimum of the l1 -norms of
       Pd (X) := {sgn(v · . + b) : X → {−1, 1} | v ∈ Rd , b ∈ R}    output-weight vectors of shallow networks with units from
                                                                    G computing f is equal to a norm tailored to the dictionary
     where sgn(t) := −1 for t < 0 and sign(t) := 1 for t ≥ 0.       G. This norm, called G-variation, has been used as a tool
     Note that from the point of view of model complexity,          for estimation of rates of approximation of functions by
     there is only a minor difference between networks with         networks with increasing “l0 -pseudo-norms”. G-variation
102                                                                                                                                  V. Kůrková

      is defined for a bounded subset G of a normed linear space         Lower bounds on variational norms can be obtained by
      (X , k.k) as                                                    geometrical arguments. The following theorem from [16]
                                                                      shows that functions which are “nearly orthogonal” to all
                                                                    elements of a dictionary G have large G-variations.
                               f
         k f kG := inf c ∈ R+ | ∈ clX conv (G ∪ −G) ,
                               c                                      Theorem 3. Let (X , k.kX ) be a Hilbert space and G its
                                                                      bounded subset. Then for every f ∈ X \ G⊥ ,
      where − G := {− g | g ∈ G}, clX denotes the closure with
      respect to the topology induced by the norm k · kX , and                                            k f k2
      conv is the convex hull. Variation with respect to the dic-                          k f kG ≥                    .
                                                                                                       supg∈G |g · f |
      tionary of Heaviside perceptrons (called variation with re-
      spect to half-spaces) was introduced by Barron [2] and we
      extended it to general sets in [11].
         As G-variation is a norm, it can be made arbitrarily         4 Constructive Lower Bounds on
      large by multiplying a function by a scalar. Also in the-
      oretical analysis of approximation capabilities of shallow
                                                                        Variational Norms
      networks, it has to be taken into account that the approxi-
                                                                      In this section, we derive lower bounds on l1 -sparsity of
      mation error k f − spann Gk in any norm k.k can be made
                                                                      shallow signum perceptron networks from lower bounds
      arbitrarily large by multiplying f by a scalar. Indeed, for
                                                                      of variational norms with respect to signum perceptrons.
      every c > 0,
                                                                         Theorem 3 implies that functions which are nearly or-
                  kc f − spann Gk = ck f − spann Gk.                  thogonal to all elements of a dictionary have large varia-
                                                                      tions. The inner product
      Thus, both G-variation and errors in approximation by
      spann G have to be studied either for sets of normalized                              h f , gi = ∑ f (x)g(x)
      functions or for sets of functions of a given fixed norm.                                         x∈X
         G-variation is related to l1 -sparsity, it can be used for   of any two functions f , g on a finite domain X is invariant
      estimating its lower bounds. The proof of the next propo-       under reordering of the points in X.
      sition follows easily from the definition.                         To estimate inner products of functions with signum
      Proposition 1. Let G be a finite subset of (X , k.k) with       perceptrons on sets of points in general positions is quite
      card G = k. Then, for every f ∈ X                               difficult, so we focus on functions on square domains
                    (                                     )
                        k               k                                         X = {x1 , . . . , xn } × {y1 , . . . , yn } ⊂ Rd
       k f kG = min    ∑ |wi | f = ∑ wi gi , wi ∈ R, gi ∈ G     .
                       i=1             i=1                            formed by points in grid-like positions. For example, pix-
                                                                      els of pictures in Rd as well as digitized high-dimensional
         Another important property of G-variation is its role        cubes can form such square domains. Functions on square
      in estimates of rates of approximation by networks with         domains can be represented by square matrices. For a
      small “l0 -pseudo-norms”. This follows from the Maurey-         function f on X = {x1 , . . . , xn } × {y1 , . . . , yn } we denote
      Jones-Barron Theorem [3]. Here we state its reformulation       by M( f ) the n × n matrix defined as
      from [11, 16, 12] in terms of G-variation merely for finite
                                                                                             M( f )i, j = f (xi , y j ).
      dimensional Hilbert space (F (X), k.k) with the Euclidean
      norm. By Go is denoted the set of normalized elements of        On the other hand, an n × n matrix M induces a function
                       g
      G, i.e., Go = { kgk | g ∈ G}.                                   fM on X such that
      Theorem 2. Let X ⊂ Rd be finite, G be a finite subset of                                 fM (xi , y j ) = Mi, j .
      F (X), sG = maxg∈G kgk, and f ∈ F (X). Then for every
      n,                                                                 We prove a lower bound on variation with respect to
                                k f kGo  sG k f kG                    signum perceptrons for functions on square domains rep-
                k f − spann Gk ≤ √ ≤ √             .
                                     n        n                       resented by matrices with orthogonal rows. To obtain
                                                                      these bounds from Theorem 3, we have to estimate inner
      Theorem 2 together with Proposition 1 imply that for any        products of these functions with signum perceptrons. We
      function that can be l1 -sparsely represented by a shallow      derive estimates of these inner products using the follow-
      network with units from a dictionary G and for any n, there     ing two lemmas. The first one is from [13] and the second
      exists an input-output function fn of a network with n units    one follows from the Cauchy-Schwartz Inequality.
      such that fn = ∑ni=1 wi gi (and so kwk0 ≤ n) such that
                                                                      Lemma 1. Let d = d1 + d2 , {xi | i = 1, . . . , n} ⊂
                                        sG k f kG                     Rd1 , {y j | j = 1, . . . , n} ⊂ Rd2 , and X = {x1 , . . . , xn } ×
                            k f − fn k ≤ √        .                   {y1 , . . . , yn } ⊂ Rd . Then for every g ∈ Pd (X) there exists a
                                             n
Bounds on Sparsity of One-Hidden-Layer Perceptron Networks                                                                                                103

     reordering of rows and columns of the n × n matrix M(g)                      In particular, when the domain is the 2d-dimensional
     such that in the reordered matrix each row and each col-                     Boolean cube {0, 1}2d =∈d ×{0, 1}d , then the lower
     umn starts with a (possibly empty) initial segment of −1’s                   bound is
     followed by a (possibly empty) segment of +1’s.                                                      2d/2
                                                                                                               .
                                                                                                            d
     Lemma 2. Let M be an n×n matrix with orthogonal rows,
                                                                                  So the lower bounds grows with d exponentially.
     v1 , . . . , vn be its row vectors, a = maxi=1,...,n kvi k. Then for
     any subset I of the set of indices of rows and any subset J
     of the set of indices of columns of the matrix M,                            5 Computation of Functions Induced by
                                          √
                                                                                    Pseudo-Noise Sequences by Shallow
                       ∑ ∑ Mi, j ≤ a card I card J .                                Perceptron Networks
                       i∈I j∈J
                                                                                  In this section, we apply our general results to a class of
                                                                                  circulant matrices with rows formed by shifted segments
       The following theorem gives a lower bound on variation
                                                                                  of pseudo-noise sequences. These sequences are deter-
     with respect to signum perceptrons for functions induced
                                                                                  ministic but exhibit some properties of random sequences.
     by matrices with orthogonal rows.
                                                                                     An infinite sequence a0 , a1 , . . . , ai , . . . of elements of
     Theorem 4. Let M be an n × n matrix with orthogonal                          {0, 1} is called k-th order linear recurring sequence if for
     rows, v1 , . . . , vn be its row vectors, a = maxi=1,...,n kvi k, d =        some h0 , . . . , hk ∈ {0, 1}
     d1 + d2 , {xi | i = 1, . . . , n} ⊂ Rd1 , {y j | j = 1, . . . , m} ⊂ Rd2 ,                             k
     X = {xi | i = 1, . . . , n} × {y j | j = 1, . . . , n} ⊂ Rd , and                               ai = ∑ ai− j hk− j      mod 2
     fM : X → R be defined as fM (xi , y j ) = Mi, j . Then                                                 j=1

                                               a                                  for all i ≥ k. It is called k-th order pseudo-noise (PN)
                           k fM kPd (X) ≥            .
                                            dlog2 ne                              sequence (or pseudo-random sequence) if it is k-th order
                                                                                  linear recurring sequence with minimal period 2k − 1. PN-
                                                                                  sequences are generated by primitive polynomials. A poly-
     Sketch of the proof.                                                         nomial
                                                                                                                     m
     For each signum perceptron g ∈ Pd (X), we permute both
                                                                                                            h(x) = ∑ h j x j
     matrices: the one induced by g and the one with orthog-                                                         j=0
     onal rows. To estimate the inner product of the permuted
     matrices, we apply Lemmas 1 and 2 to a partition of both                     is called primitive polynomial of degree m when the small-
     matrices into submatrices. The submatrices of permuted                       est integer n for which h(x) divides xn + 1 is n = 2m − 1.
     M(g) have all entries either equal to +1 or all entries equal                   PN sequences have many useful applications because
     to −1. The permuted matrix M has orthogonal rows and                         some of their properties mimic those of random sequences.
     thus we can estimate from above the sums of entries of its                   A run is a string of consecutive 1’s or a string of consec-
     submatrices with the same rows and columns as subma-                         utive 0’s. In any segment of length 2k − 1 of a k-th order
     trices of M(g). The partition is constructed by iterating                    PN-sequence, one-half of the runs have length 1, one quar-
     at most dlog2 ne-times the same decomposition. By Theo-                      ter have length 2, one-eighth have length 3, and so on. In
     rem 3,                                                                       particular, there is one run of length k of 1’s, one run of
                                                                                  length k − 1 of 0’s. Thus every segment of length 2k − 1
                                      k fM k2             a                       contains 2k/2 ones and 2k/2 − 1 zeros [18, p.410].
              k fM kPd (X) ≥                          ≥         .                    An important property of PN-sequences is their low
                                 supg∈Pd (X) h fM , gi dlog2 ne
                                                                                  autocorrelation.                The autocorrelation of a sequence
                                                                           2      a0 , a1 , . . . , ai , . . . of elements of {0, 1} with period 2k − 1
        Theorem 4 shows that shallow perceptron networks                          is defined as
     computing functions generated by orthogonal n × n matri-                                                        k
     ces must have l1 -norms bounded from bellow by dloga ne .                                                 1 2 −1 a j +a j+t
                                                          2
                                                                                                   ρ(t) =    k    ∑ −1
                                                                                                            2 − 1 j=0
                                                                                                                                 .                 (2)
        All signum perceptrons on a domain X with card X =
     n × n have norms equal to n. The largest lower bound im-
     plied by Theorem 4 for functions induced by n × n matri-                     For every PN-sequence and for every t = 1, . . . , 2k − 2,
     ces with orthogonal rows, which have norms equal to n, is                                                       1
     achieved√ for matrices where all rows have the same norms                                            ρ(t) = − k                               (3)
                                                                                                                  2 −1
     equal to n. Such functions have variations with respect
     to signum perceptrons at least                                               [18, p. 411].
                                √                                                   Let τ : {0, 1} → {−1, 1} be defined as
                                  n
                                       .                                                                        τ(x) = −1x
                              dlog2 ne
104                                                                                                                                                 V. Kůrková

      (i.e., τ(0) = 1 and τ(1) = −1). We say that a 2k × 2k                            difficult in the sense that it requires networks of large com-
      matrix Lk (α) is induced by a k-th order PN-sequence                             plexities. Such functions can be constructed using deter-
      α = (a0 , a1 , . . . , ai , . . . ) when for all i = 1, . . . , 2k , Li,1 = 1,   ministic algorithms and have many applications. Prop-
      for all j = 1, . . . , 2k , L1, j = 1, and for all i = 2, . . . , 2k and         erties of pseudo-noise sequences were exploited for con-
      j = 2, . . . , 2k                                                                structions of codes, interplanetary satellite picture trans-
                              Lk (α)i, j = τ(Ai−1, j−1 )                               mission, precision measurements, acoustics, radar camou-
                                                                                       flage, and light diffusers. These sequences permit designs
      where A is the (2k − 1) × (2k − 1) circulant matrix with                         of surfaces that scatter incoming signals very broadly mak-
      rows formed by shifted segments of length 2k −1 of the se-                       ing reflected energy “invisible” or “inaudible”.
      quence α. The next proposition following from the equa-
      tions (2) and (3) shows that for any PN-sequence α the
      matrix Lk (α) has orthogonal rows.                                               Acknowledgments. This work was partially supported by
                                                                                       the Czech Grant Agency grant GA 15-18108S and institu-
      Proposition 5. Let k be a positive integer, α =                                  tional support of the Institute of Computer Science RVO
      (a0 , a1 , . . . , ai , . . . ) be a k-th order PN-sequence, and                 67985807.
      Lk (α) be the 2k × 2k matrix induced by α. Then all pairs
      of rows of Lk (α) are orthogonal.                                                References
          Applying Theorem 4 to the 2k × 2k matrice Lk (α) in-
                                                                                        [1] L. J. Ba and R. Caruana. Do deep networks really need to
      duced by a k-th order PN-sequence α we obtain a lower                                 be deep? In Z. Ghahrani and et al., editors, Advances in
                          k/2
      bound of the form 2 k on variation with respect to signum                             Neural Information Processing Systems, volume 27, pages
      perceptrons of the function induced by the matrix Lk (α).                             1–9, 2014.
      So in any shallow perceptron network computing this                               [2] A. R. Barron. Neural net approximation. In K. S. Narendra,
      function, the number of units or sizes of some output                                 editor, Proc. 7th Yale Workshop on Adaptive and Learning
      weights depend on k exponentially.                                                    Systems, pages 69–72. Yale University Press, 1992.
                                                                                        [3] A. R. Barron. Universal approximation bounds for superpo-
                                                                                            sitions of a sigmoidal function. IEEE Trans. on Information
      6     Discussion                                                                      Theory, 39:930–945, 1993.
                                                                                        [4] Y. Bengio, O. Delalleau, and N. Le Roux. The curse of
      We investigated limitations of shallow perceptron net-                                highly variable functions for local kernel machines. In
      works to sparsely represent real-valued functions. We con-                            Advances in Neural Information Processing Systems, vol-
      sidered sparsity measured by the l1 -norm which has been                              ume 18, pages 107–114. MIT Press, 2006.
      used in weight-decay regularization techniques [8] and in                         [5] Y. Bengio and Y. LeCun. Scaling learning algorithms to-
      online dictionary learning [7]. We proved lower bounds                                wards AI. In L. Bottou, O. Chapelle, D. DeCoste, and
      on l1 -norms of output weight vectors of shallow signum                               J. Weston, editors, Large-Scale Kernel Machines. MIT
      perceptron networks computing functions on square do-                                 Press, 2007.
      mains induced by matrices with orthogonal rows. We il-                            [6] J. T. Coffey and R. M. Goodman. Any code of which we
      lustrated our general results by an example of a class of                             cannot think is good. IEEE Transactions on Information
      matrices constructed using pseudo-noise sequences. These                              Theory, 36:1453 – 1461, 1990.
      deterministic sequences mimic some properties of ran-                             [7] D. L. Donoho. For most large undetermined systems of lin-
      dom sequences. We showed that shallow perceptron net-                                 ear equation the minimal l1 -norm is also the sparsest. Com-
                                                                                            munications in Pure and Applied Mathematics, 59:797–
      works, which compute functions constructed using these
                                                                                            829, 2006.
      sequences, must have either large numbers of hidden units
                                                                                        [8] T. L. Fine. Feedforward Neural Network Methodology.
      or some of their output weights must be large.
                                                                                            Springer, Berlin Heidelberg, 1999.
         There is an interesting analogy with the central paradox
                                                                                        [9] Y. Ito. Finite mapping by neural networks and truth func-
      of coding theory. This paradox is expressed in the title of
                                                                                            tions. Mathematical Scientist, 17:69–77, 1992.
      the article “Any code of which we cannot think is good”
                                                                                       [10] P. C. Kainen, V. Kůrková, and M. Sanguineti. Dependence
      [6]. It was proven there that any code which is truly ran-
                                                                                            of computational models on input dimension: Tractability
      dom (in the sense that there is no concise way to generate
                                                                                            of approximation and optimization tasks. IEEE Transac-
      the code) is good (it meets the Gilbert-Varshamov bound                               tions on Information Theory, 58:1203–1214, 2012.
      on distance versus redundancy). However despite sophis-                          [11] V. Kůrková. Dimension-independent rates of approxima-
      ticated constructions for codes derived over the years, no                            tion by neural networks. In K. Warwick and M. Kárný, ed-
      one has succeeded in finding a constructive procedure that                            itors, Computer-Intensive Methods in Control and Signal
      yields such good codes. Similarly, computation of “any                                Processing. The Curse of Dimensionality, pages 261–270.
      function of which we cannot think” (truly random) by                                  Birkhäuser, Boston, MA, 1997.
      shallow perceptron networks might be untractable. Our re-                        [12] V. Kůrková. Complexity estimates based on integral trans-
      sults show that computation of functions exhibiting some                              forms induced by computational units. Neural Networks,
      randomness properties by shallow perceptron networks is                               33:160–167, 2012.
Bounds on Sparsity of One-Hidden-Layer Perceptron Networks             105

     [13] V. Kůrková. Constructive lower bounds on model complex-
          ity of shallow perceptron networks. Neural Computing and
          Applications, DOI 10.1007/s00521-017-2965-0, 2017.
     [14] V. Kůrková and M. Sanguineti. Model complexities of shal-
          low networks representing highly varying functions. Neu-
          rocomputing, 171:598–604, 2016.
     [15] V. Kůrková and M. Sanguineti. Probabilistic lower bounds
          for approximation by shallow perceptron networks. Neural
          Networks, 91:34–41, 2017.
     [16] V. Kůrková, P. Savický, and K. Hlaváčková. Representa-
          tions and rates of approximation of real-valued Boolean
          functions by neural networks. Neural Networks, 11:651–
          659, 1998.
     [17] S. B. Laughlin and T. J. Sejnowski. Communication in neu-
          ral networks. Science, 301:1870–1874, 2003.
     [18] F. MacWilliams and N. A. Sloane. The Theory of Error-
          Correcting Codes. North Holland Publishing Co., 1977.
     [19] V. E. Maiorov and R. Meir. On the near optimality of
          the stochastic approximation of smooth functions by neu-
          ral networks. Advances in Computational Mathematics,
          13:79–103, 2000.
     [20] V. E. Maiorov and A. Pinkus. Lower bounds for approxima-
          tion by MLP neural networks. Neurocomputing, 25:81–91,
          1999.
     [21] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learn-
          ing for matrix factorization and sparse coding. Journal of
          Machine Learning Research, 11:19–60, 2010.
     [22] A. Pinkus. Approximation theory of the MLP model in
          neural networks. Acta Numerica, 8:143–195, 1999.
     [23] A. M. Tillmann. On the computational intractability of ex-
          act and approximate dictionary learning. IEEE Signal Pro-
          cessing Letters, 22:45–49, 2015.