<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Limitations of One-Hidden-Layer Perceptron Networks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Veˇra K ˚urková</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Computer Science, Czech Academy of Sciences</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>167</fpage>
      <lpage>171</lpage>
      <abstract>
        <p>Limitations of one-hidden-layer perceptron networks to represent efficiently finite mappings is investigated. It is shown that almost any uniformly randomly chosen mapping on a sufficiently large finite domain cannot be tractably represented by a one-hidden-layer perceptron network. This existential probabilistic result is complemented by a concrete example of a class of functions constructed using quasi-random sequences. Analogies with central paradox of coding theory and no free lunch theorem are discussed.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        A widely-used type of a neural-network architecture is
a network with one-hidden-layer of computational units
(such as perceptrons, radial or kernel units) and one
linear output unit. Recently, new hybrid learning algorithms
for feedforward networks with two or more hidden layers,
called deep networks [
        <xref ref-type="bibr" rid="ref3 ref9">9, 3</xref>
        ], were successfully applied to
various pattern recognition tasks. Thus a theoretical
analysis identifying tasks for which shallow networks require
considerably larger model complexities than deep ones is
needed. In [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ], Bengio et al. suggested that a cause
of large model complexities of shallow networks with one
hidden layer might be in the “amount of variations” of
functions to be computed and they illustrated their
suggestion by an example of representation of d-dimensional
parities by Gaussian SVM.
      </p>
      <p>
        In practical applications, feedforward networks
compute functions on finite domains in Rd representing, e.g.,
scattered empirical data or pixels of images. It is
wellknown that shallow networks with many types of
computational units have the “universal representation
property”, i.e., they can exactly represent any real-valued
function on a finite domain. This property holds, e.g.,
for networks with perceptrons with any sigmoidal
activation function [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and for networks with Gaussian
radial units [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. However, proofs of universal
representation capabilities assume that networks have numbers of
hidden units equal to sizes of domains of functions to be
computed. For large domains, this can be a factor
limiting practical implementations. Upper bounds on rates of
approximation of multivariable functions by shallow
networks with increasing numbers of units were studied in
terms of variational norms tailored to types of network
units (see, e.g., [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and references therein).
      </p>
      <p>In this paper, we employ these norms to derive lower
bounds on model complexities of shallow networks
representing finite mappings. Using geometrical properties
of high-dimensional spaces we show that a
representation of almost any uniformly randomly chosen function on
a “large” finite domain by a shallow perceptron networks
requires “large” number of units or “large” sizes of output
weights. We illustrate this existential probabilistic result
by a concrete construction of a class of functions based on
Hadamard and quasi-noise matrices. We discuss analogies
with central paradox of coding theory and no free lunch
theorem.</p>
      <p>The paper is organized as follows. Section 2 contains
basic concepts and notations on shallow networks and
dictionaries of computational units. Section 3 reviews
variational norms as tools for investigation of network
complexity. In Section 3, estimates of probabilistic
distributions of sizes of variational norms are proven. In section 4,
concrete examples of functions which cannot be tractably
represented by perceptron networks are constructed using
Hadamard and pseudo-noise matrices. Section 5 is a brief
discussion.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Preliminaries</title>
      <p>One-hidden-layer networks with single linear outputs
(shallow networks) compute input-output functions from
sets of the form
spann G :=
( n
∑ wigi | wi ∈ R, gi ∈ G
i=1
)
where G, called a dictionary, is a set of functions
computable by a given type of units, the coefficients wi are
called output weights, and n is the number of hidden units.
This number is sometimes used as a measure of model
complexity.</p>
      <p>In this paper, we focus on representations of functions
on finite domainsX ⊂ Rd . We denote by</p>
      <p>F (X ) := { f | f : X → R}
the set of all real-valued functions on X . On F (X ) we
have the Euclidean inner product defined as
and the Euclidean norm
h f , gi := ∑ f (u)g(u)</p>
      <p>u∈X
To distinguish the inner product h., .i on F (X ) from the
inner product on X ⊂ Rd , we denote it ·, i.e., for u, v ∈ X ,</p>
      <p>We investigate networks with units from the dictionary
of signum perceptrons</p>
      <p>Pd (X ) := {sgn(v · . + b) : X → {−1, 1} | v ∈ Rd , b ∈ R}
where sgn(t) := −1 for t &lt; 0 and sign(t) := 1 for t ≥ 0.
Note that from the point of view of model complexity,
there is only a minor difference between networks with
signum perceptrons and those with Heaviside perceptrons
as</p>
      <p>sgn(t) = 2ϑ (t) − 1
and</p>
      <p>sgn(t) + 1
ϑ (t) := ,
2
where ϑ (t) := 0 for t &lt; 0 and ϑ (t) = 1 for t ≥ 0.
3</p>
      <p>
        Model Complexity and Variational Norms
A useful tool for derivation of estimates of numbers of
units and sizes of output weights in shallow networks is
the concept of a variational norm tailored to network units
introduced in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] as an extension of a concept of
variation with respect to half-spaces from [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. For a subset G of
a normed linear space (X , k.kX ), G-variation (variation
with respect to the set G), denoted by k.kG, is defined as
k f kG := inf {c ∈ R+ | f /c ∈ clX conv (G ∪ −G)} ,
where clX denotes the closure with respect to the norm
k · kX on X , − G := {− g | g ∈ G}, and
conv G :=
( k k
∑ aigi | ai ∈ [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ], ∑ ai = 1, gi ∈ G, k ∈ N
i=1 i=1
)
is the convex hull of G. The following straightforward
consequence of the definition ofG-variation shows that in
all representations of a function with “large” G-variation
by shallow networks with units from the dictionary G, the
number of units must be “large” or absolute values of some
output weights must be “large”.
      </p>
      <p>Proposition 1. Let G be a finite subset of a normed linear
space (X , k.kX ), then for every f ∈ X ,
k f kG = min
( k k
∑ |wi| f = ∑ wi gi , wi ∈ R, gi ∈ G
i=1 i=1
)</p>
      <p>
        Note that classes of functions defined by constraints on
their variational norms represent a similar type of a
concept as classes of functions defined by constraints on both
numbers of gates and sizes of output weights studied in
theory of circuit complexity [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>
        To derive lower bounds on variational norms, we use the
following theorem from [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] showing that functions which
are “not correlated” to any element of the dictionary G
have large variations.
      </p>
      <p>Theorem 2. Let (X , k.kX ) be a Hilbert space with inner
product h., .iX and G its bounded subset. Then for every
f ∈ X − G⊥,
f 2
k k
k f kG ≥ supg∈G |h f , giX |</p>
      <p>The following theorem shows that when a
dictionary G(X ) is not “too large”, then for a “large”
domain X , almost any randomly chosen function has large
G(X )-variation. We denote by</p>
      <p>
        Sr(X ) := { f ∈ F (X ) | k f k = r}
the sphere of radius r in F (X ) and for f ∈ F (X ), f o :=
ff . The proof of the theorem is based on geometry of
k k
spheres in high-dimensional Euclidean spaces. In large
dimensions, most of areas of spheres lie very close to their
“equators” [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>Theorem 3. Let d be a positive integer, X ⊂ Rd with
card X = m, G(X ) a subset of F (X ) with card G(X ) = n
such that for all g ∈ G(X ), k f k ≤ r, μ be a uniform
probability measure on Sr(X ), and b &gt; 0. Then</p>
      <p>μ ({ f ∈ Sr(X ) | k f kG(X) ≥ b}) ≥ 1 − 2n e− 2mb2 .</p>
      <p>Proof. Denote for g ∈ Sr(X ) and ε ∈ (0, 1),</p>
      <p>C(g, ε ) := {h ∈ Srm−1 | |hho, goi| ≥ ε }.</p>
      <p>As C(g, ε ) is equivalent to a polar cap in RcardX , whose
measure is exponentially decreasing with the dimension m,
we have</p>
      <p>
        μ (C(g, ε )) ≤ e− m2ε2
(see, e.g., [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]). By Theorem 2,
{ f ∈ Sr(X ) | k f kG(X) ≥ b} = Sr(X ) −
[ C(g, 1/b).
g∈G
Hence the statement follows.
      </p>
      <p>
        Theorem 3 can be applied to dictionaries G(X ) on
domains X ⊂ Rd with card X = m, which are “relatively
small”. In particular, dictionaries of signum and
Heaviside perceptrons are relatively “small”. Estimates of their
sizes can be obtained from bounds on numbers of linearly
separable dichotomies to which finite subsets ofRd can be
partitioned. Various estimates of numbers of dichotomies
have been derived by several authors starting from results
by Schläfli [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. The next bound is obtained by combining
a theorem from [7, p.330] with an upper bound on partial
sum of binomials.
2
Theorem 4. For every d and every X ⊂ Rd such that
card X = m,
      </p>
      <p>d
card Pd (X ) ≤ 2 ∑
i=0
m − 1
i</p>
      <p>md
≤ 2 d!</p>
      <p>Combining Theorems 3 and 4, we obtain a lower bound
on measures of sets of functions having variations with
respect to signum perceptrons bounded from below by
a given bound b.</p>
      <p>Corollary 1. Let d be a positive integer, X ⊂ Rd with
card X = m, μ a uniform probability measure on S√m(X ),
and b &gt; 0. Then
μ ({ f ∈ S√m(X ) | k f kPd(X) ≥ b}) ≥ 1 − 4 d!
md
e− 2mb2 .</p>
      <p>d
For example, for the domain X = {0, 1}d and b = 2 4 , we
obtain from Corollary 1 a lower bound
1 −</p>
      <p>d
2d2+2e−(2 2 −1)
d!
on the probability that a function on {0, 1}d with the
norm 2d/2 has variation with respect to signum
percepd
trons greater or equal to 2 4 . Thus for large d almost any
uniformly randomly chosen function on the d-dimensional
Boolean cube {0, 1}d of the same norm 2d/2 as signum
perceptrons, has variation with respect to signum
perceptrons depending on d exponentially.
4</p>
      <p>Construction of Functions with Large
Variations
The results derived in the previous section are
existential. In this section, we construct a class of functions,
which cannot be represented by shallow perceptron
networks of low model complexities. We construct such
functions using Hadamard matrices. We show that the class of
Hadamard matrices contains circulant matrices with rows
being segments of pseudo-noise sequences which mimic
some properties of random sequences.</p>
      <p>Recall that a Hadamard matrix of order m is an m × m
square matrix M with entries in {−1, 1} such that any two
distinct rows (or equivalently columns) of M are
orthogonal. Note that this property is invariant under
permutating rows or columns and under sign flipping all entries in
a column or a row. Two distinct rows of a Hadamard
matrix differ in exactly m/2 positions.</p>
      <p>The next theorem gives a lower bound on variation with
respect to signum perceptrons of a {−1, 1}-valued
function constructed using a Hadamard matrix.
Theorem 5. Let M be an m × m Hadamard matrix,
{xi | i = 1, . . . , m} ⊂ Rd , {y j | j = 1, . . . , m} ⊂ Rd , X =
{xi | i = 1, . . . , m} × {y j | j = 1, . . . , m} ⊂ R2d , and fM : X →
{−1, 1} be defined as f M(xi, y j) =: Mi, j. Then</p>
      <p>Proof. By Theorem 2,
.</p>
      <p>For each g ∈ Pd (X ), let M(g) be an m × m matrix defined
as M(g)i, j = g(xi, y j). It is easy to see that
h fM, gi = ∑ Mi, jM(g)i, j.</p>
      <p>i, j</p>
      <p>Using suitable permutations, we reorder rows and
columns of both matrices M(g) and M in such a way that
each row and each column of the reordered matrix M¯(g)
starts with a (possibly empty) initial segment of −1’s
followed by a (possibly empty) segment of 1’s. Denoting M¯
the reordered matrix M we have
h fM, gi = ∑ Mi, jM(g)i, j = ∑ M¯i, jM¯(g)i, j.</p>
      <p>i, j i, j
As the property of being a Hadamard matrix is invariant
under permutations of rows and columns, we can apply
Lindsay lemma [8, p.88] to submatrices of the Hadamard
matrix M¯ on which all entries of the matrix M¯(g) are
either −1 or 1. Thus we obtain an upper bound m√m on
the differences of +1s and −1s in suitable submatrices of
M¯. Iterating the procedure at most log2 m-times, we obtain
an upper bound m√m log2 m on ∑i, j M¯i, jM¯(g)i, j = h fM, gi.
Thus
k fMkPd(X) ≥ m√m log2 m =
m2</p>
      <p>√m
log2 m
.</p>
      <p>Theorem 5 shows that functions whose representations
by shallow perceptron networks require numbers of units
√m
or sizes of output weights bounded from below by log2 m
can be constructed using Hadamard matrices. In
particular, when the domain is d-dimensional Boolean cube
{0, 1}d , where d is even, the lower bound is 2dd//24 . So the
lower bounds grows with d exponentially.</p>
      <p>Recall that if a Hadamard matrix of order m exists, then
m = 1 or m = 2 or m is divisible by 4 [14, p.44]. It is
conjectured that there exists a Hadamard matrix of every order
divisible by 4. Listings of Hadamard matrices of various
orders can be found at Neil Sloane’s library of Hadamard
matrices.</p>
      <p>We show that suitable Hadamard matrices can be
obtained from pseudo-noise sequences. An infinite sequence
2
a0, a1, . . . , ai, . . . of elements of {0, 1} is called k-th order
linear recurring sequence if for some h0, . . . , hk ∈ {0, 1}
mod 2
for all i ≥ k. It is called k-th order pseudo-noise (PN)
sequence (or pseudo-random sequence) if it is k-th order
linear recurring sequence with minimal period 2k − 1.</p>
      <p>A 2k × 2k matrix L is called pseudo-noise if for all i =
1, . . . , 2k, L1,i = 0 and Li,1 = 0 and for all i = 2, . . . , 2k and
j = 2, . . . , 2k</p>
      <p>Li, j = L¯i−1, j−1
where the (2k − 1) × (2k − 1) matrix L¯is a circulant matrix
with rows formed by shifted segments of length 2k − 1 of
a k-th order pseudo-noise sequence.</p>
      <p>PN sequences have many useful applications because
some of their properties mimic those of random sequences.
A run is a string of consecutive 1’s or a string of
consecutive 0’s. In any segment of length 2k − 1 of k-th order
PN-sequence, one-half of the runs have length 1, one
quarter have length 2, one-eighth have length 3, and so on. In
particular, there is one run of length k of 1’s, one run of
length k − 1 of 0’s. Thus every segment of length 2k − 1
contains 2k/2 ones and 2k/2 − 1 zeros [14, p.410].</p>
      <p>Let τ : {0, 1} → {−1, 1} be defined asτ (x) = −1x (i.e.,
τ (0) = 1 and τ (1) = −1). The following theorem states
that a matrix obtained by applying τ to entries of a
pseudonoise matrix is a Hadamard matrix.</p>
      <p>Theorem 6. Let L be a 2k × 2k pseudo-noise matrix and
Lτ be the 2k × 2k matrix with entries in {−1, 1}
obtained from L by applying τ to all its entries. Then Lτ
is a Hadamard matrix.</p>
      <p>Proof. We show that inner product of any two rows of Lτ
is equal to zero. The autocorrelation of a sequence
a0, a1, . . . , ai, . . . of elements of {0, 1} with period 2k − 1
is defined as
For every pseudo-noise sequence,
ρ (t) =</p>
      <p>1 2∑k−1 −1a j+a j+t .
2k − 1 j=0</p>
      <p>1
ρ (t) = − 2k − 1
for every t = 1, . . . , 2k − 2 [14, p. 411]. Thus the inner
product of every two rows of the matrix L¯τ is equal to −1.
As all elements of the first column of Lτ are equal to 1,
inner product of every pair of its rows is equal to zero. 2</p>
      <p>Theorem 5 implies that for every pseudo-noise matrix L
of order 2k and X ⊂ Rd such that card X = 2k × 2k, there
exists a function fLτ : X → {−1, 1} induced by the
matrix Lτ obtained from L by replacing 0’s with 1’s and 1’s
with −1’s such that
k fLτ kPd (X) ≥
2k/2
k
So the variation of fLτ with respect to signum
perceptrons depends on k exponentially. In particular, setting
X = {0, 1}d , where d = 2k is even, we obtain a function of
d variables with variation with respect to signum
perceptrons growing with d exponentially as
2d/4
k fLτ kPd (X) ≥ d/2 .</p>
      <p>Representation of this function by a shallow perceptron
network requires number of units or sizes of some output
weights depending on d exponentially.</p>
      <p>It is easy to show that for each even integer d, the
function induced by Sylvester-Hadamard matrix</p>
      <p>Mu,v = −1u·v,
where u, v ∈ {0, 1}d/2, can be represented by a
twohidden-layer network with d/2 units in each hidden layer.
5</p>
    </sec>
    <sec id="sec-3">
      <title>Discussion</title>
      <p>We proved that almost any uniformly randomly chosen
function on a sufficiently large finite set inRd has large
variation with respect to signum perceptrons and thus it
cannot be tractably represented by a shallow perceptron
network.</p>
      <p>
        It seems to be a paradox that although representations
of almost all functions by shallow perceptron networks
are “untractable”, it is difficult to construct such functions.
The situation can be rephrased in analogy with the title
of an article from coding theory “Any code of which we
cannot think is good” [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] as “representation of almost any
function of which we cannot think by shallow perceptron
networks is untractable”. A central paradox of coding
theory concerns the existence and construction of the best
codes. Virtually every linear code is good (in the sense
that it meets the Gilbert-Varshamov bound on distance
versus redundancy), however despite the sophisticated
constructions for codes derived over the years, no one has
succeeded in demonstrating a constructive procedure that
yields such good codes.
      </p>
      <p>The only class of functions having “large” variations
which we succeeded to construct is the class described
in section 4 based on Hadamard matrices. Among these
matrices belong quasi-noise (quasi-random) matrices with
rows obtained as shifts of segments of quasi-noise
sequences. These sequences have been used in
construction of codes, interplanetary satellite picture transmission,
precision measurements, acoustics, radar camouflage, and
light diffusers. Pseudo-noise sequences permit design of
surfaces that scatter incoming signals very broadly
making reflected energy “invisible” or “inaudible”.</p>
      <p>
        It should be emphasized that similarly as “no free lunch
theorem” [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], our results assume uniform distributions of
functions to be represented. However, probability
distributions of functions modeling some practical tasks of interest
(such as colors of pixels in a photograph) might be highly
non uniform.
Acknowledgments. This work was partially supported by
the grant COST LD13002 of the Ministry of Education of
the Czech Republic and institutional support of the
Institute of Computer Science RVO 67985807.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Ball</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>An elementary introduction to modern convex geometry</article-title>
          . In: S. Levy, (ed.),
          <source>Falvors of Geometry</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>58</lpage>
          , Cambridge University Press,
          <year>1997</year>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Barron</surname>
            ,
            <given-names>A. R.</given-names>
          </string-name>
          :
          <article-title>Neural net approximation</article-title>
          . In: K. S. Narendra, (ed.),
          <source>Proc. 7th Yale Workshop on Adaptive and Learning Systems</source>
          ,
          <volume>69</volume>
          -
          <fpage>72</fpage>
          , Yale University Press,
          <year>1992</year>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Learning deep architectures for AI. Foundations and Trends in Machine Learning 2 (</article-title>
          <year>2009</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>127</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Delalleau</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le Roux</surname>
          </string-name>
          , N.:
          <article-title>The curse of dimensionality for local kernel machines</article-title>
          .
          <source>Technical Report 1258</source>
          ,
          <string-name>
            <surname>Département d'Informatique</surname>
          </string-name>
          et Recherche Opérationnelle, Université de Montréal,
          <year>2005</year>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Delalleau</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le Roux</surname>
          </string-name>
          , N.:
          <article-title>The curse of highly variable functions for local kernel machines</article-title>
          .
          <source>In: Advances in Neural Information Processing Systems</source>
          <volume>18</volume>
          ,
          <fpage>107</fpage>
          -
          <lpage>114</lpage>
          , MIT Press,
          <year>2006</year>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Coffey</surname>
            ,
            <given-names>J. T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goodman</surname>
            ,
            <given-names>R. M.:</given-names>
          </string-name>
          <article-title>Any code of which we cannot think is good</article-title>
          .
          <source>IEEE Transactions on Information Theory</source>
          <volume>36</volume>
          (
          <year>1990</year>
          ),
          <fpage>1453</fpage>
          -
          <lpage>1461</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Cover</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Geometrical and statistical properties of systems of linear inequalities with applictions in pattern recognition</article-title>
          .
          <source>IEEE Trans. on Electronic Computers</source>
          <volume>14</volume>
          (
          <year>1965</year>
          ),
          <fpage>326</fpage>
          -
          <lpage>334</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Erdös</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Spencer</surname>
            ,
            <given-names>J. H.</given-names>
          </string-name>
          :
          <article-title>Probabilistic Methods in Combinatorics</article-title>
          . Academic Press,
          <year>1974</year>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Hinton</surname>
            ,
            <given-names>G. E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Osindero</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Teh</surname>
            ,
            <given-names>Y. W.:</given-names>
          </string-name>
          <article-title>A fast learning algorithm for deep belief nets</article-title>
          .
          <source>Neural Computation</source>
          <volume>18</volume>
          (
          <year>2006</year>
          ),
          <fpage>1527</fpage>
          -
          <lpage>1554</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Ito</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Finite mapping by neural networks and truth functions</article-title>
          .
          <source>Mathematical Scientist</source>
          <volume>17</volume>
          (
          <year>1992</year>
          ),
          <fpage>69</fpage>
          -
          <lpage>77</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Kainen</surname>
            ,
            <given-names>P. C.</given-names>
          </string-name>
          , K˚urková, V.,
          <string-name>
            <surname>Sanguineti</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Dependence of computational models on input dimension: Tractability of approximation and optimization tasks</article-title>
          .
          <source>IEEE Trans. on Information Theory</source>
          <volume>58</volume>
          (
          <year>2012</year>
          ),
          <fpage>1203</fpage>
          -
          <lpage>1214</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>K</given-names>
            <surname>˚urková</surname>
          </string-name>
          , V.:
          <article-title>Dimension-independent rates of approximation by neural networks</article-title>
          . In: K. Warwick and
          <string-name>
            <given-names>M.</given-names>
            <surname>Kárný</surname>
          </string-name>
          , (eds.),
          <source>Computer-Intensive Methods in Control and Signal Processing. The Curse of Dimensionality</source>
          ,
          <fpage>261</fpage>
          -
          <lpage>270</lpage>
          , Birkhäuser, Boston, MA, 1997
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>K</given-names>
            <surname>˚urková</surname>
          </string-name>
          , V.,
          <string-name>
            <surname>Savický</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hlavácˇková</surname>
          </string-name>
          , K.:
          <article-title>Representations and rates of approximation of real-valued Boolean functions by neural networks</article-title>
          .
          <source>Neural Networks</source>
          <volume>11</volume>
          (
          <year>1998</year>
          ),
          <fpage>651</fpage>
          -
          <lpage>659</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>MacWilliams</surname>
            ,
            <given-names>F. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sloane</surname>
            ,
            <given-names>N. J. A.</given-names>
          </string-name>
          :
          <article-title>The theory of errorcorrecting codes</article-title>
          . North Holland, New York,
          <year>1977</year>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Micchelli</surname>
            ,
            <given-names>C. A.</given-names>
          </string-name>
          :
          <article-title>Interpolation of scattered data: Distance matrices and conditionally positive definite functions</article-title>
          .
          <source>Constructive Approximation</source>
          <volume>2</volume>
          (
          <year>1986</year>
          ),
          <fpage>11</fpage>
          -
          <lpage>22</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Roychowdhury</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Siu</surname>
          </string-name>
          , K. -Y.,
          <string-name>
            <surname>Orlitsky</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Neural models and spectral methods</article-title>
          . In: V.
          <string-name>
            <surname>Roychowdhury</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Siu</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Orlitsky</surname>
          </string-name>
          , (eds.),
          <source>Theoretical Advances in Neural Computation and Learning</source>
          ,
          <fpage>3</fpage>
          -
          <lpage>36</lpage>
          , Springer, New York,
          <year>1994</year>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Schläfli</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Theorie der vielfachen Kontinuität</article-title>
          .
          <source>Zürcher &amp; Furrer</source>
          , Zürich,
          <year>1901</year>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Wolpert</surname>
            ,
            <given-names>D. H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Macready</surname>
          </string-name>
          , W. G.:
          <article-title>No free lunch theorems for optimization</article-title>
          .
          <source>IEEE Transactions on Evolutionary Computation</source>
          <volume>1</volume>
          (
          <issue>1</issue>
          ) (
          <year>1997</year>
          ),
          <fpage>67</fpage>
          -
          <lpage>82</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>