<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Bounds on Sparsity of One-Hidden-Layer Perceptron Networks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Veˇra K˚urková</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Computer Science, Czech Academy of Sciences</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <volume>1885</volume>
      <fpage>100</fpage>
      <lpage>105</lpage>
      <abstract>
        <p>Limitations of one-hidden-layer (shallow) perceptron networks to sparsely represent multivariable functions is investigated. A concrete class of functions is described whose computation by shallow perceptron networks requires either large number of units or is unstable due to large output weights. The class is constructed using pseudo-noise sequences which have many features of random sequences but can be generated using special polynomials. Connections with the central paradox of coding theory are discussed.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        To identify and explain efficient network designs, it is
necessary to develop a theoretical understanding to the
influence of a proper choice of network architecture and type of
units on reducing network complexity. Bengio and LeCun
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], who recently revived the interest in deep networks,
conjectured that “most functions that can be represented
compactly by deep architectures cannot be represented by
a compact shallow architecture”. On the other hand, a
recent empirical study demonstrated that shallow networks
can learn some functions previously learned by deep ones
using the same numbers of parameters as the original deep
networks [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        It is well-known that shallow networks with merely
one hidden layer of computational units of many common
types can approximate within any accuracy any
reasonable function on a compact domain and also can exactly
compute any function on a finite domain [
        <xref ref-type="bibr" rid="ref22 ref9">9, 22</xref>
        ]. All these
universality type results are proven assuming that numbers
of network units are potentially infinite or, in the case of
finite domains, are at least as large as sizes of the domains.
However, in practical applications, various constraints on
numbers and sizes of network parameters limit feasibility
of implementations.
      </p>
      <p>
        Whereas many upper bounds on numbers of units in
shallow networks which are sufficient for a given
accuracy of function approximation are known (see, e.g, the
survey article [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and references therein), fewer lower
bounds are available. Some bounds hold merely for types
of computational units that are not commonly used (see,
e.g., [
        <xref ref-type="bibr" rid="ref19 ref20">20, 19</xref>
        ]). Proofs of lower bounds are generally much
more difficult than derivation of upper ones.
      </p>
      <p>Characterization of tasks which can be computed by
considerably sparser deep networks than shallow ones
requires proving lower bounds on complexity of shallow
networks which are larger than upper bounds on complexity
of deep ones. An important step towards this goal is
exploration of functions which cannot be computed or
approximated by shallow networks satisfying various
sparsity constraints.</p>
      <p>
        Investigation of sparsity of artificial neural networks has
a biological motivation. A number of studies confirmed
that only a small fraction of neurons have a high rate of
firing at any time (sparse activity) and that each neuron
is connected to only a limited number of other neurons
(sparse connectivity) [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. The most simple measure of
sparse connectivity between hidden units and network
outputs is the number of non zero output weights. This
number is in some literature called “ l0-pseudo-norm”.
However, it is neither a norm nor a pseudo-norm. Its
minimization is a not convex problem and its solution is NP-hard
[
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. Thus instead of “ l0-pseudo-norm”, l1 and l2-norms
have been used as measures of network sparsity as they
can be implemented as stabilizers in weight-decay
regularization techniques (see, e.g., [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and references therein).
Also in online dictionary learning, l1-norms were used as
stabilizers [
        <xref ref-type="bibr" rid="ref21 ref7">21, 7</xref>
        ].
      </p>
      <p>
        Bengio et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] suggested that a cause of large model
complexities of shallow networks might be in the “amount
of variations” of functions to be computed. In [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], we
presented some examples showing that sparsity of
shallow networks computing the same input-output functions
strongly depends on types of their units. We proposed
to use as a measure of sparsity variational norms
tailored to dictionaries of computational units. These norms
were used as tools in nonlinear approximation theory.
We showed that variational norms can be employed to
obtain lower bounds on sparsity measured by l1-norms.
For many dictionaries of computational units, we derived
lower bounds on these norms using a probabilistic
argument based on the Chernoff Bound [
        <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
        ]. The bounds
hold for almost all functions representing binary classifiers
on sufficiently large finite domains. In [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] we
complemented these probabilistic results by a concrete
construction of binary classifiers with large variational norms with
respect to signum perceptrons.
      </p>
      <p>In this paper, we investigate sparsity of shallow
networks computing real-valued functions on finite
rectangular domains. Such domains can be 2-dimensional (e.g.,
pixels of photographs) or high-dimensional (e.g., digitized
high-dimensional cubes), but typically they are quite large.
We describe a construction of a class of functions on such
domains based on matrices with orthogonal rows.
Estimating variational norms of these functions from bellow,
we obtain lower bounds on l1-norms of shallow networks
with signum perceptrons. We show that these networks
must have either large numbers of hidden units or some of
their output weights must be large. Both are not desirable
as large output weights ma lead to non stability of
computation. We illustrate our general construction by a concrete
class of circulant matrices generated by pseudo-noise
sequences. We discuss the effect of pseudo-randomness on
network complexity.</p>
      <p>The paper is organized as follows. Section 2 contains
basic concepts on shallow networks and dictionaries of
computational units. In Section 3, sparsity is investigated
in terms of l1-norm and norms tailored to computational
units. In Section 4, a concrete construction of classes of
functions with large variational norms based on
orthogonal matrices is described. In Section 5, the general results
are illustrated by a concrete example of matrices obtained
from pseudo-noise sequences. Section 6 is a brief
discussion.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Preliminaries</title>
      <p>One-hidden-layer networks with single linear outputs
(shallow networks) compute input-output functions from
sets of the form
spann G :=
( n )
∑ wigi | wi ∈ R, gi ∈ G ,
i=1
where G, called a dictionary, is a set of functions
computable by a given type of units, the coefficients wi are
called output weights, and n is the number of hidden units.
This number is the simplest measure of model complexity.</p>
      <p>In this paper, we focus on representations of functions
on finite domainsX ⊂ Rd . We denote by</p>
      <p>F (X ) := { f | f : X → R}
the set of all real-valued functions on X . On F (X ) we
have the Euclidean inner product defined as
and the Euclidean norm
h f , gi := ∑ f (u)g(u)</p>
      <p>u∈X
k f k := ph f , f i.</p>
      <p>We investigate networks with units from the dictionary
of signum perceptrons</p>
      <p>Pd (X ) := {sgn(v · . + b) : X → {−1, 1} | v ∈ Rd , b ∈ R}
where sgn(t) := −1 for t &lt; 0 and sign(t) := 1 for t ≥ 0.
Note that from the point of view of model complexity,
there is only a minor difference between networks with
signum perceptrons and those with Heaviside perceptrons
as
and
sgn(t) = 2ϑ (t) − 1
ϑ (t) :=
sgn(t) + 1
2
,
where ϑ (t) = 0 for t &lt; 0 and ϑ (t) = 1 for t ≥ 0. An
advantage of signum perceptrons is that all units from the
dictionary Pd (X ) have the same size of norms equal to √card X .
3</p>
    </sec>
    <sec id="sec-3">
      <title>Measures of Sparsity</title>
      <p>The most simple measure of sparse connectivity between
the hidden layer and the network output is the number of
non-zero output weights. In some literature, the number of
non-zero coefficients amongwi’s in an input-output
function</p>
      <p>n
f = ∑ wigi
i=1
(1)
from span G is called an “l 0-pseudo-norm” in quotation
marks and denoted kwk0. However, it is neither a norm
nor a pseudo-norm. The quantity kwk0 is always an integer
and thus k · k0 does not satisfy the homogeneity property
of a norm (kλ xk = |λ |kxk for all λ ). Moreover, the “unit
ball” {w ∈ Rn | kwk0 ≤ 1} is non convex and unbounded.</p>
      <p>Minimization of the “ l0-pseudo-norm” of the vector of
output weights is a difficult non convex optimization task.
Instead of “ l0”, l1-norm defined as
and l2-norm defined as
n
kwk1 = ∑ |wi|</p>
      <p>i=1
kwk2 =
s n
∑ w2</p>
      <p>
        i
i=1
of output weight vectors w = (w1, . . . , wn) have been used
in weight-decay regularization [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. These norms can be
implemented as stabilizers modifying error functionals
which are minimized during learning. A network with a
large l1 or l2-norm of its output-weight vector must have
either a large number of units or some output weights must
be large. Both of these properties are not desirable as they
imply a large model complexity or non stability of
computation caused by large output weights.
      </p>
      <p>Many dictionaries of computational units are
overcomplete and thus the representation (1) as a linear
combination of units from the dictionary need not be unique.
For a finite dictionaryG, the minimum of the l1-norms of
output-weight vectors of shallow networks with units from
G computing f is equal to a norm tailored to the dictionary
G. This norm, called G-variation, has been used as a tool
for estimation of rates of approximation of functions by
networks with increasing “ l0-pseudo-norms”. G-variation
is defined for a bounded subsetG of a normed linear space
(X , k.k) as</p>
      <p>
        f
k f kG := inf c ∈ R+ | c ∈ clX conv (G ∪ −G) ,
where − G := {− g | g ∈ G}, clX denotes the closure with
respect to the topology induced by the norm k · kX , and
conv is the convex hull. Variation with respect to the
dictionary of Heaviside perceptrons (called variation with
respect to half-spaces) was introduced by Barron [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and we
extended it to general sets in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>As G-variation is a norm, it can be made arbitrarily
large by multiplying a function by a scalar. Also in
theoretical analysis of approximation capabilities of shallow
networks, it has to be taken into account that the
approximation error k f − spann Gk in any norm k.k can be made
arbitrarily large by multiplying f by a scalar. Indeed, for
every c &gt; 0,</p>
      <p>kc f − spann Gk = ck f − spannGk.</p>
      <p>Thus, both G-variation and errors in approximation by
spann G have to be studied either for sets of normalized
functions or for sets of functions of a given fixed norm.</p>
      <p>G-variation is related to l1-sparsity, it can be used for
estimating its lower bounds. The proof of the next
proposition follows easily from the definition.</p>
      <p>Proposition 1. Let G be a finite subset of (X , k.k) with
card G = k. Then, for every f ∈ X
k f kG = min
( k k
∑ |wi| f = ∑ wi gi , wi ∈ R, gi ∈ G
i=1 i=1
)</p>
      <p>
        Another important property of G-variation is its role
in estimates of rates of approximation by networks with
small “ l0-pseudo-norms”. This follows from the
MaureyJones-Barron Theorem [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Here we state its reformulation
from [
        <xref ref-type="bibr" rid="ref11 ref12 ref16">11, 16, 12</xref>
        ] in terms of G-variation merely for finite
dimensional Hilbert space (F (X ), k.k) with the Euclidean
norm. By Go is denoted the set of normalized elements of
g
G, i.e., Go = { kgk | g ∈ G}.
      </p>
      <p>Theorem 2. Let X ⊂ Rd be finite, G be a finite subset of
F (X ), sG = maxg∈G kgk, and f ∈ F (X ). Then for every
n,
k f − spann Gk ≤
k f kGo
√n ≤
sG k f kG .</p>
      <p>√n
Theorem 2 together with Proposition 1 imply that for any
function that can be l1-sparsely represented by a shallow
network with units from a dictionary G and for any n, there
exists an input-output function fn of a network with n units
n
such that fn = ∑i=1 wigi (and so kwk0 ≤ n) such that
k f − fnk ≤
sG k f kG .</p>
      <p>√n</p>
      <p>
        Lower bounds on variational norms can be obtained by
geometrical arguments. The following theorem from [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]
shows that functions which are “nearly orthogonal” to all
elements of a dictionary G have large G-variations.
Theorem 3. Let (X , k.kX ) be a Hilbert space and G its
bounded subset. Then for every f ∈ X \ G⊥,
f 2
k k
k f kG ≥ supg∈G |g · f |
In this section, we derive lower bounds on l1-sparsity of
shallow signum perceptron networks from lower bounds
of variational norms with respect to signum perceptrons.
      </p>
      <p>Theorem 3 implies that functions which are nearly
orthogonal to all elements of a dictionary have large
variations. The inner product
h f , gi = ∑ f (x)g(x)</p>
      <p>x∈X
of any two functions f , g on a finite domainX is invariant
under reordering of the points in X .</p>
      <p>To estimate inner products of functions with signum
perceptrons on sets of points in general positions is quite
difficult, so we focus on functions on square domains</p>
      <p>X = {x1, . . . , xn} × {y1, . . . , yn} ⊂ Rd
formed by points in grid-like positions. For example,
pixels of pictures in Rd as well as digitized high-dimensional
cubes can form such square domains. Functions on square
domains can be represented by square matrices. For a
function f on X = {x1, . . . , xn} × {y1, . . . , yn} we denote
by M( f ) the n × n matrix defined as</p>
      <p>M( f )i, j = f (xi, y j).</p>
      <p>On the other hand, an n × n matrix M induces a function
fM on X such that</p>
      <p>fM(xi, y j) = Mi, j.</p>
      <p>
        We prove a lower bound on variation with respect to
signum perceptrons for functions on square domains
represented by matrices with orthogonal rows. To obtain
these bounds from Theorem 3, we have to estimate inner
products of these functions with signum perceptrons. We
derive estimates of these inner products using the
following two lemmas. The first one is from [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and the second
one follows from the Cauchy-Schwartz Inequality.
Lemma 1. Let d = d1 + d2, {xi | i = 1, . . . , n} ⊂
Rd1 , {y j | j = 1, . . . , n} ⊂ Rd2 , and X = {x1, . . . , xn} ×
{y1, . . . , yn} ⊂ Rd . Then for every g ∈ Pd (X ) there exists a
reordering of rows and columns of the n × n matrix M(g)
such that in the reordered matrix each row and each
column starts with a (possibly empty) initial segment of −1’s
followed by a (possibly empty) segment of +1’s.
Lemma 2. Let M be an n × n matrix with orthogonal rows,
v1, . . . , vn be its row vectors, a = maxi=1,...,nkvik. Then for
any subset I of the set of indices of rows and any subset J
of the set of indices of columns of the matrix M,
√
∑ ∑ Mi, j ≤ a card I card J .
      </p>
      <p>i∈I j∈J</p>
      <p>The following theorem gives a lower bound on variation
with respect to signum perceptrons for functions induced
by matrices with orthogonal rows.</p>
      <p>Theorem 4. Let M be an n × n matrix with orthogonal
rows, v1, . . . , vn be its row vectors, a = maxi=1,...,nkvik, d =
d1 + d2, {xi | i = 1, . . . , n} ⊂ Rd1 , {y j | j = 1, . . . , m} ⊂ Rd2 ,
X = {xi | i = 1, . . . , n} × {y j | j = 1, . . . , n} ⊂ Rd , and
fM : X → R be defined as f M(xi, y j) = Mi, j. Then
a
k fMkPd(X) ≥ dlog2 ne .</p>
      <p>Sketch of the proof.</p>
      <p>For each signum perceptron g ∈ Pd (X ), we permute both
matrices: the one induced by g and the one with
orthogonal rows. To estimate the inner product of the permuted
matrices, we apply Lemmas 1 and 2 to a partition of both
matrices into submatrices. The submatrices of permuted
M(g) have all entries either equal to +1 or all entries equal
to −1. The permuted matrix M has orthogonal rows and
thus we can estimate from above the sums of entries of its
submatrices with the same rows and columns as
submatrices of M(g). The partition is constructed by iterating
at most dlog2 ne-times the same decomposition. By
Theorem 3,</p>
      <p>k fMk2 a
k fMkPd(X) ≥ supg∈Pd(X)h fM, gi ≥ dlog2 ne
.</p>
      <p>Theorem 4 shows that shallow perceptron networks
computing functions generated by orthogonal n × n
matria
ces must have l1-norms bounded from bellow by dlog2 ne .</p>
      <p>All signum perceptrons on a domain X with card X =
n × n have norms equal to n. The largest lower bound
implied by Theorem 4 for functions induced by n × n
matrices with orthogonal rows, which have norms equal to n, is
achieved for matrices where all rows have the same norms
equal to √n. Such functions have variations with respect
to signum perceptrons at least
2
√n
dlog2 ne</p>
      <p>In particular, when the domain is the 2d-dimensional
Boolean cube {0, 1}2d =∈d ×{0, 1}d , then the lower
bound is
2d/2
In this section, we apply our general results to a class of
circulant matrices with rows formed by shifted segments
of pseudo-noise sequences. These sequences are
deterministic but exhibit some properties of random sequences.</p>
      <p>An infinite sequence a0, a1, . . . , ai, . . . of elements of
{0, 1} is called k-th order linear recurring sequence if for
some h0, . . . , hk ∈ {0, 1}</p>
      <p>k
ai = ∑ ai− jhk− j
j=1
mod 2
for all i ≥ k. It is called k-th order pseudo-noise (PN)
sequence (or pseudo-random sequence) if it is k-th order
linear recurring sequence with minimal period 2k − 1.
PNsequences are generated by primitive polynomials. A
polynomial
m
h(x) = ∑ h jx j</p>
      <p>j=0
is called primitive polynomial of degree m when the
smallest integer n for which h(x) divides xn + 1 is n = 2m − 1.</p>
      <p>PN sequences have many useful applications because
some of their properties mimic those of random sequences.
A run is a string of consecutive 1’s or a string of
consecutive 0’s. In any segment of length 2k − 1 of a k-th order
PN-sequence, one-half of the runs have length 1, one
quarter have length 2, one-eighth have length 3, and so on. In
particular, there is one run of length k of 1’s, one run of
length k − 1 of 0’s. Thus every segment of length 2k − 1
contains 2k/2 ones and 2k/2 − 1 zeros [18, p.410].</p>
      <p>An important property of PN-sequences is their low
autocorrelation. The autocorrelation of a sequence
a0, a1, . . . , ai, . . . of elements of {0, 1} with period 2k − 1
is defined as
ρ(t) =</p>
      <p>1 2∑k−1 −1a j+a j+t .</p>
      <p>2k − 1 j=0
For every PN-sequence and for every t = 1, . . . , 2k − 2,
(2)
(3)
[18, p. 411].</p>
      <p>Let τ : {0, 1} → {−1, 1} be defined as
1
ρ(t) = − 2k − 1
τ(x) = −1x
(i.e., τ(0) = 1 and τ(1) = −1). We say that a 2k × 2k
matrix Lk(α) is induced by a k-th order PN-sequence
α = (a0, a1, . . . , ai, . . . ) when for all i = 1, . . . , 2k, Li,1 = 1,
for all j = 1, . . . , 2k, L1, j = 1, and for all i = 2, . . . , 2k and
j = 2, . . . , 2k</p>
      <p>Lk(α)i, j = τ(Ai−1, j−1)
where A is the (2k − 1) × (2k − 1) circulant matrix with
rows formed by shifted segments of length 2k − 1 of the
sequence α. The next proposition following from the
equations (2) and (3) shows that for any PN-sequence α the
matrix Lk(α) has orthogonal rows.</p>
      <p>Proposition 5. Let k be a positive integer, α =
(a0, a1, . . . , ai, . . . ) be a k-th order PN-sequence, and
Lk(α) be the 2k × 2k matrix induced by α. Then all pairs
of rows of Lk(α) are orthogonal.</p>
      <p>Applying Theorem 4 to the 2k × 2k matrice Lk(α)
induced by a k-th order PN-sequence α we obtain a lower
bound of the form 2kk/2 on variation with respect to signum
perceptrons of the function induced by the matrix Lk(α).
So in any shallow perceptron network computing this
function, the number of units or sizes of some output
weights depend on k exponentially.
6</p>
    </sec>
    <sec id="sec-4">
      <title>Discussion</title>
      <p>
        We investigated limitations of shallow perceptron
networks to sparsely represent real-valued functions. We
considered sparsity measured by the l1-norm which has been
used in weight-decay regularization techniques [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and in
online dictionary learning [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. We proved lower bounds
on l1-norms of output weight vectors of shallow signum
perceptron networks computing functions on square
domains induced by matrices with orthogonal rows. We
illustrated our general results by an example of a class of
matrices constructed using pseudo-noise sequences. These
deterministic sequences mimic some properties of
random sequences. We showed that shallow perceptron
networks, which compute functions constructed using these
sequences, must have either large numbers of hidden units
or some of their output weights must be large.
      </p>
      <p>
        There is an interesting analogy with the central paradox
of coding theory. This paradox is expressed in the title of
the article “Any code of which we cannot think is good”
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. It was proven there that any code which is truly
random (in the sense that there is no concise way to generate
the code) is good (it meets the Gilbert-Varshamov bound
on distance versus redundancy). However despite
sophisticated constructions for codes derived over the years, no
one has succeeded in finding a constructive procedure that
yields such good codes. Similarly, computation of “any
function of which we cannot think” (truly random) by
shallow perceptron networks might be untractable. Our
results show that computation of functions exhibiting some
randomness properties by shallow perceptron networks is
difficult in the sense that it requires networks of large
complexities. Such functions can be constructed using
deterministic algorithms and have many applications.
Properties of pseudo-noise sequences were exploited for
constructions of codes, interplanetary satellite picture
transmission, precision measurements, acoustics, radar
camouflage, and light diffusers. These sequences permit designs
of surfaces that scatter incoming signals very broadly
making reflected energy “invisible” or “inaudible”.
      </p>
      <p>Acknowledgments. This work was partially supported by
the Czech Grant Agency grant GA 15-18108S and
institutional support of the Institute of Computer Science RVO
67985807.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L. J.</given-names>
            <surname>Ba</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Caruana</surname>
          </string-name>
          .
          <article-title>Do deep networks really need to be deep? In Z. Ghahrani</article-title>
          and et al., editors,
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>27</volume>
          , pages
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Barron</surname>
          </string-name>
          .
          <article-title>Neural net approximation</article-title>
          . In K. S. Narendra, editor,
          <source>Proc. 7th Yale Workshop on Adaptive and Learning Systems</source>
          , pages
          <fpage>69</fpage>
          -
          <lpage>72</lpage>
          . Yale University Press,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Barron</surname>
          </string-name>
          .
          <article-title>Universal approximation bounds for superpositions of a sigmoidal function</article-title>
          .
          <source>IEEE Trans. on Information Theory</source>
          ,
          <volume>39</volume>
          :
          <fpage>930</fpage>
          -
          <lpage>945</lpage>
          ,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Delalleau</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N. Le</given-names>
            <surname>Roux</surname>
          </string-name>
          .
          <article-title>The curse of highly variable functions for local kernel machines</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          , volume
          <volume>18</volume>
          , pages
          <fpage>107</fpage>
          -
          <lpage>114</lpage>
          . MIT Press,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>LeCun</surname>
          </string-name>
          .
          <article-title>Scaling learning algorithms towards AI</article-title>
          . In L.
          <string-name>
            <surname>Bottou</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Chapelle</surname>
          </string-name>
          , D. DeCoste, and J. Weston, editors,
          <source>Large-Scale Kernel Machines</source>
          . MIT Press,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J. T.</given-names>
            <surname>Coffey</surname>
          </string-name>
          and
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Goodman</surname>
          </string-name>
          .
          <article-title>Any code of which we cannot think is good</article-title>
          .
          <source>IEEE Transactions on Information Theory</source>
          ,
          <volume>36</volume>
          :
          <fpage>1453</fpage>
          -
          <lpage>1461</lpage>
          ,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D. L.</given-names>
            <surname>Donoho</surname>
          </string-name>
          .
          <article-title>For most large undetermined systems of linear equation the minimal l1-norm is also the sparsest</article-title>
          .
          <source>Communications in Pure and Applied Mathematics</source>
          ,
          <volume>59</volume>
          :
          <fpage>797</fpage>
          -
          <lpage>829</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T. L.</given-names>
            <surname>Fine</surname>
          </string-name>
          .
          <source>Feedforward Neural Network Methodology</source>
          . Springer, Berlin Heidelberg,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ito</surname>
          </string-name>
          .
          <article-title>Finite mapping by neural networks and truth functions</article-title>
          .
          <source>Mathematical Scientist</source>
          ,
          <volume>17</volume>
          :
          <fpage>69</fpage>
          -
          <lpage>77</lpage>
          ,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>P. C.</given-names>
            <surname>Kainen</surname>
          </string-name>
          ,
          <string-name>
            <surname>V.</surname>
          </string-name>
          <article-title>K˚urková</article-title>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Sanguineti</surname>
          </string-name>
          .
          <article-title>Dependence of computational models on input dimension: Tractability of approximation and optimization tasks</article-title>
          .
          <source>IEEE Transactions on Information Theory</source>
          ,
          <volume>58</volume>
          :
          <fpage>1203</fpage>
          -
          <lpage>1214</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>V.</surname>
          </string-name>
          <article-title>K˚urková</article-title>
          .
          <article-title>Dimension-independent rates of approximation by neural networks</article-title>
          . In K. Warwick and M. Kárný, editors,
          <source>Computer-Intensive Methods in Control and Signal Processing. The Curse of Dimensionality</source>
          , pages
          <fpage>261</fpage>
          -
          <lpage>270</lpage>
          . Birkhäuser, Boston, MA,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>V.</surname>
          </string-name>
          <article-title>K˚urková</article-title>
          .
          <article-title>Complexity estimates based on integral transforms induced by computational units</article-title>
          .
          <source>Neural Networks</source>
          ,
          <volume>33</volume>
          :
          <fpage>160</fpage>
          -
          <lpage>167</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>V.</surname>
          </string-name>
          <article-title>K˚urková</article-title>
          .
          <article-title>Constructive lower bounds on model complexity of shallow perceptron networks</article-title>
          .
          <source>Neural Computing and Applications</source>
          , DOI
          <volume>10</volume>
          .1007/s00521-017
          <source>-2965-0</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>V.</surname>
          </string-name>
          <article-title>K˚urková</article-title>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Sanguineti</surname>
          </string-name>
          .
          <article-title>Model complexities of shallow networks representing highly varying functions</article-title>
          .
          <source>Neurocomputing</source>
          ,
          <volume>171</volume>
          :
          <fpage>598</fpage>
          -
          <lpage>604</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>V.</surname>
          </string-name>
          <article-title>K˚urková</article-title>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Sanguineti</surname>
          </string-name>
          .
          <article-title>Probabilistic lower bounds for approximation by shallow perceptron networks</article-title>
          .
          <source>Neural Networks</source>
          ,
          <volume>91</volume>
          :
          <fpage>34</fpage>
          -
          <lpage>41</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>V.</surname>
          </string-name>
          <article-title>K˚urková</article-title>
          , P. Savický,
          <article-title>and</article-title>
          K. Hlavácˇková.
          <article-title>Representations and rates of approximation of real-valued Boolean functions by neural networks</article-title>
          .
          <source>Neural Networks</source>
          ,
          <volume>11</volume>
          :
          <fpage>651</fpage>
          -
          <lpage>659</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>S. B.</given-names>
            <surname>Laughlin</surname>
          </string-name>
          and
          <string-name>
            <surname>T. J. Sejnowski.</surname>
          </string-name>
          <article-title>Communication in neural networks</article-title>
          .
          <source>Science</source>
          ,
          <volume>301</volume>
          :
          <fpage>1870</fpage>
          -
          <lpage>1874</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>F.</given-names>
            <surname>MacWilliams</surname>
          </string-name>
          and
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Sloane</surname>
          </string-name>
          .
          <source>The Theory of ErrorCorrecting Codes</source>
          . North Holland Publishing Co.,
          <year>1977</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>V. E.</given-names>
            <surname>Maiorov</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Meir</surname>
          </string-name>
          .
          <article-title>On the near optimality of the stochastic approximation of smooth functions by neural networks</article-title>
          .
          <source>Advances in Computational Mathematics</source>
          ,
          <volume>13</volume>
          :
          <fpage>79</fpage>
          -
          <lpage>103</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>V. E.</given-names>
            <surname>Maiorov</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Pinkus</surname>
          </string-name>
          .
          <article-title>Lower bounds for approximation by MLP neural networks</article-title>
          .
          <source>Neurocomputing</source>
          ,
          <volume>25</volume>
          :
          <fpage>81</fpage>
          -
          <lpage>91</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>J.</given-names>
            <surname>Mairal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ponce</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Sapiro</surname>
          </string-name>
          .
          <article-title>Online learning for matrix factorization and sparse coding</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>11</volume>
          :
          <fpage>19</fpage>
          -
          <lpage>60</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>A.</given-names>
            <surname>Pinkus</surname>
          </string-name>
          .
          <article-title>Approximation theory of the MLP model in neural networks</article-title>
          .
          <source>Acta Numerica</source>
          ,
          <volume>8</volume>
          :
          <fpage>143</fpage>
          -
          <lpage>195</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Tillmann</surname>
          </string-name>
          .
          <article-title>On the computational intractability of exact and approximate dictionary learning</article-title>
          .
          <source>IEEE Signal Processing Letters</source>
          ,
          <volume>22</volume>
          :
          <fpage>45</fpage>
          -
          <lpage>49</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>