<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Series</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Multivariable Approximation by Convolutional Kernel Networks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Veˇra Ku˚rková</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Computer Science, Academy of Sciences of the Czech</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <volume>1649</volume>
      <fpage>118</fpage>
      <lpage>122</lpage>
      <abstract>
        <p>Computational units induced by convolutional kernels together with biologically inspired perceptrons belong to the most widespread types of units used in neurocomputing. Radial convolutional kernels with varying widths form RBF (radial-basis-function) networks and these kernels with fixed widths are used in the SVM (support vector machine) algorithm. We investigate suitability of various convolutional kernel units for function approximation. We show that properties of Fourier transforms of convolutional kernels determine whether sets of input-output functions of networks with kernel units are large enough to be universal approximators. We compare these properties with conditions guaranteeing positive semidefinitness of convolutional kernels.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Computational units induced by radial and convolutional
kernels together with perceptrons belong to the most
widespread types of units used in neurocomputing. In
contrast to biologically inspired perceptrons [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ],
localized radial units [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] were introduced merely due to their
good mathematical properties. Radial-basis-function units
(RBF) computing spherical waves were followed by
kernel units [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Kernel units in the most general form include
all types of computational units, which are functions of
two vector variables: an input vector and a parameter
vector. However, often the term kernel unit is reserved merely
for units computing symmetric positive semidefinite
functions of two variables. Networks with these units have
been widely used for classification with maximal margin
by the support vector machine algorithm (SVM) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] as
well as for regression [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ].
      </p>
      <p>
        Other important kernel units are units induced by
convolutional kernels in the form of translations of functions of
one vector variable. Isotropic RBF units can be viewed as
non symmetric kernel units obtained from convolutional
radial kernels by adding a width parameter. Variability
of widths is a strong property. It allows to apply
arguments based on classical results on approximation of
functions by sequences of their convolutions with scaled bump
functions to prove universal approximation capabilities of
many types of RBF networks [
        <xref ref-type="bibr" rid="ref16 ref17">16, 17</xref>
        ]. Moreover, some
estimates of rates of approximation by RBF networks
exploit variability of widths [
        <xref ref-type="bibr" rid="ref10 ref13 ref9">9, 10, 13</xref>
        ].
      </p>
      <p>
        On the other hand, symmetric positive semidefinite
kernels (which include some classes of RBFs with fixed
widths parameters) benefit from geometrical properties of
reproducing kernel Hilbert spaces (RKHS) generated by
these kernels. These properties allow an extension of
the maximal margin classification from finite dimensional
spaces also to sets of data which are not linearly separable
by embedding them into infinite dimensional spaces [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <sec id="sec-1-1">
        <title>Moreover, symmetric positive semidefinite kernels gener</title>
        <p>
          ate stabilizers in the form of norms on RKHSs suitable for
modeling generalization in terms of regularization [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] and
enable characterizations of theoretically optimal solutions
of learning tasks [
          <xref ref-type="bibr" rid="ref11 ref19 ref3">3, 19, 11</xref>
          ].
        </p>
        <p>
          Arguments proving the universal approximation
property of RBF networks using sequences of scaled kernels
might suggest that variability of widths is necessary for the
universal approximation. However, for the special case of
the Gaussian kernel, the universal approximation property
holds even when the width is fixed and merely centers are
varying [
          <xref ref-type="bibr" rid="ref12 ref14">14, 12</xref>
          ].
        </p>
        <p>On the other hand, it is easy to find some examples
of positive semidefinite kernels such that sets of
inputoutput functions of shallow networks with units generated
by these kernels are too small to be universal
approximators. For example, networks with product kernel units of
the form K(x, y) = k(x)k(y) generate as input-output
functions only scalar multiples ck(x) of the function k.</p>
        <p>In this paper, we investigate capabilities of networks
with one hidden layer of convolutional kernel units to
approximate multivariable functions. We show that a crucial
property influencing whether sets of input-output
functions of convolutional kernel networks are large enough
to be universal approximators is behavior of the Fourier
transform of the one variable function generating the
convolutional kernel. We give a necessary and sufficient
condition for universal approximation of kernel networks in
terms of the Fourier transforms of kernels. We compare
this condition with properties of kernels guaranteeing their
positive definitness. We illustrate our results by
examples of some common kernels such as Gaussian, Laplace,
parabolic, rectangle, and triangle.</p>
        <p>The paper is organized as follows. In section 2,
notations and basic concepts on one-hidden-layer networks
and kernel units are introduced. In section 3, a
necessary and sufficient condition on a convolutional kernel that
guarantees that networks with units induced by the kernel
have the universal approximation property. In section 4
this condition is compared with a condition guaranteeing
that a kernel is positive semidefinite and some examples
where
as
3
units induced by the kernel K are contained in Hilbert
spaces defined by these kernels. These spaces are called
reproducing kernel Hilbert spaces (RKHS) and denoted</p>
        <sec id="sec-1-1-1">
          <title>HK (X ). They are formed by functions from</title>
          <p>span GK (X ) = span{Kx | x ∈ X },</p>
          <p>Kx(.) = K(x, .),
together with limits of their Cauchy sequences in the norm
k.kK . The norm k.kK is induced by the inner product
h., .iK , which is defined on
of kernels satisfying both or one of these conditions are
given. Section 5 is a brief discussion.
2</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Preliminaries</title>
      <p>
        Radial-basis-function networks as well as kernel models
belong to the class of one-hidden-layer networks with one
linear output unit. Such networks compute input-output
functions from sets of the form
span G =
( n )
∑ wigi | wi ∈ R, gi ∈ G, n ∈ N+ ,
i=1
where the set G is called a dictionary [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], and R, N+
denote the sets of real numbers and positive integers, resp.
      </p>
      <sec id="sec-2-1">
        <title>Typically, dictionaries are parameterized families of functions modeling computational units, i.e., they are of the form</title>
        <p>GK (X ,Y ) = {K(., y) : X → R | y ∈ Y }
where K : X × Y → R is a function of two variables, an
input vector x ∈ X ⊆ Rd and a parameter y ∈ Y ⊆ Rs. Such
functions of two variables are called kernels. This term,
derived from the German term “kern”, has been used since
1904 in theory of integral operators [18, p.291].</p>
      </sec>
      <sec id="sec-2-2">
        <title>An important class of kernels are convolutional kernels</title>
        <p>which are obtained by translations of one-variable
functions k : Rd → Rd as</p>
        <p>K(x, y) = k(x − y).</p>
        <p>Radial convolutional kernels are convolutional kernels
obtained as translations of radial functions, i.e., functions of
the form</p>
        <p>k(x) = k1(kxk),
where k1 : R+ → R.</p>
      </sec>
      <sec id="sec-2-3">
        <title>The convolution is an operation defined as</title>
        <p>f ∗ g(x) =
[20, p.170].</p>
        <p>Recall, that a kernel K : X × X → R is called positive
semidefinite if for any positive integer m, any x1, . . . , xm ∈
X and any a1, . . . , am ∈ R,
m m
∑ ∑ aia jK(xi, x j) ≥ 0.</p>
        <p>i=1 j=1
Similarly, a function of one variable k : Rd → R is called
positive semidefinite if for any positive integer m, any
x1, . . . , xm ∈ X and any a1, . . . , am ∈ R,
m m
∑ ∑ aia jk(xi − x j) ≥ 0.</p>
        <p>i=1 j=1</p>
      </sec>
      <sec id="sec-2-4">
        <title>For symmetric positive semidefinite kernels K, the sets</title>
        <p>span GK (X ) of input-output functions of networks with
GK (X ) = {Kx | x ∈ X }
hKx, KyiK = K(x, y).</p>
        <p>So span GK (X ) ⊂ HK (X ).</p>
        <p>Universal approximation capability of
convolutional kernel networks</p>
      </sec>
      <sec id="sec-2-5">
        <title>In this section, we investigate conditions guaranteeing that</title>
        <p>sets of input-output functions of convolutional kernel
networks are large enough to be universal approximators.</p>
        <p>The universal approximation property is formally
defined as density in a normed linear space. A class of
onehidden-layer networks with units from a dictionary G is
said to have the universal approximation property in a
normed linear space (X , k.kX ) if it is dense in this space,
i.e., clX span G = X , where span G denotes the linear
span of G and clX denotes the closure with respect to the
topology induced by the norm k.kX . More precisely, for
every f ∈ X and every ε &gt; 0 there exist a positive integer
n, g1, . . . , gn ∈ G, and w1, . . . , wn ∈ R such that
n
k f − ∑ wigikX &lt; ε.</p>
        <p>i=1</p>
        <p>Function spaces where the universal approximation
property has been of interest are spaces (C(X ), k.ksup) of
continuous functions on subsets X of Rd (typically
compact) with the supremum norm
k f ksup = sup | f (x)|
x∈X
and spaces (L p(Rd ), k.kL p ) of functions on Rd with
finite RRd | f (y)|pdy and the norm
k f kL p =</p>
        <p>Z
Rd | f (y)|pdy
1/p
.</p>
        <p>Recall that the d-dimensional Fourier transform is an
isometry on L 2(Rd ) defined on L 2(Rd ) ∩ L 1(Rd ) as
fˆ(s) =
1</p>
        <p>Z
and extended to L 2(Rd ) [20, p.183].</p>
      </sec>
      <sec id="sec-2-6">
        <title>Note that the Fourier transform of an even function is</title>
        <p>real and the Fourier transform of a radial function is radial.
If k ∈ cL1(Rd ), then kˆ is uniformly continuous and with
increasing frequencies converges to zero, i.e.,
lim kˆ(s) = 0.</p>
        <p>ksk→∞</p>
        <p>The following theorem gives a necessary and sufficient
condition on a convolutional kernel that guarantees that the
class of input-output functions computable by networks
with units induced by the kernel can approximate
arbitrarily well all functions in L 2(Rd ). The condition is
formulated in terms of the size of the set of frequencies for which
the Fourier transform is equal to zero. By λ is denoted the</p>
      </sec>
      <sec id="sec-2-7">
        <title>Lebesgue measure.</title>
        <p>Theorem 1. Let d be a positive integer, k ∈ L 1(Rd ) ∩
L 2(Rd ) be even, K : Rd Rd → R be defined as K(x, y) =
k(x − y), and X ⊆ Rd ×be Lebesgue measurable. Then
span GK (X ) is dense in (L 2(X ), k.kL 2 ) if and only if
λ ({s ∈ Rd | kˆ(s) = 0}) = 0.</p>
        <sec id="sec-2-7-1">
          <title>Proof. First, we prove the necessity. To prove it by</title>
          <p>contradiction, assume that λ (S) 6= 0. Take any function
f ∈ L 2(Rd ) ∩ L 1(Rd ) with a positive Fourier transform
(for example, f can be the Gaussian). Let ε &gt; 0 be such
that
ε &lt;</p>
          <p>Z
Rd
fˆ(s)2ds.</p>
          <p>Assume that there exists n, wi ∈ R, and yi ∈ Rd such that
n
k f − ∑ wik(. − yi)kL2 &lt; ε.</p>
          <p>j=1
Then by the Plancherel Theorem [20, p.188],</p>
          <p>n n
k fˆ − ∑ wik(\.− yi)k2L2 = k fˆ − ∑ w¯ikˆk2L2 ,</p>
          <p>j=1 j=1
where w¯i = wieiyi . Hence</p>
          <p>n
k fˆ − ∑ w¯ikˆk2L2 =</p>
          <p>j=1
which is a contradiction.</p>
          <p>To prove the sufficiency, we first assume that X = Rd .</p>
        </sec>
      </sec>
      <sec id="sec-2-8">
        <title>We prove it by contradiction, so we suppose that</title>
        <p>clL 2 span GK (Rd ) = clL 2 span {K(., y) | y ∈ Rd } 6= L 2(Rd ).</p>
      </sec>
      <sec id="sec-2-9">
        <title>Then by the Hahn-Banach Theorem [20, p. 60] there ex</title>
        <p>ists a bounded linear functional l on L 2(Rd ) such that
for all f ∈ clL 2 span GK (Rd ), l( f ) = 0 and for some f0 ∈
As
As the set</p>
        <p>Z</p>
        <p>Rd
L 2(Rd ) \ clL 2 span GK (Rd ), l( f0) = 1. By the Riesz
Representation Theorem [5, p.206], l can be expressed as an
inner product with some h ∈ L 2(Rd ).</p>
        <p>As k is even, for all y ∈ Rd ,</p>
        <p>Z</p>
        <p>Rd
hh, K(., y)i =</p>
        <p>h(x)k(x − y)dx =
Z
Rd</p>
        <p>h(x)k1(y − x)dx = h ∗ k1(x) = 0.</p>
        <p>By the Young Inequality for convolutions h ∗ k ∈ L 2(Rd )
and so by the Plancherel Theorem [20, p.188],
kh [∗k1kL 2 = 0.
h [∗k1 =</p>
        <p>1
(2π)d/2</p>
        <p>hˆ kˆ
Z
Rd</p>
        <p>(hˆ(s) kˆ(s))2ds = 0.</p>
        <p>S = {s ∈ Rd | kˆ(s) = 0}
[20, p.183], we have khˆ kˆkL 2 = 0 and so
has Lebesgue measure zero we have</p>
        <p>Z</p>
        <p>Rd\S
hˆ(s)2kˆ(s)2ds =
hˆ(s)2kˆ(s)2ds = 0.</p>
        <p>As for all s ∈ Rd \ S, kˆ(s)2 &gt; 0, we have khˆk2L 2 ds = 0. So
khkL 2 = 0 and hence by the Cauchy-Schwartz Inequality
we get</p>
        <p>Z
1 = l( f0) =</p>
        <p>Rd f0(y) h(y)dy ≤ k f0kL 2 khkL 2 = 0,
which is a contradiction.</p>
        <p>Extending a function f from L 2(X ) to f¯ from L 2(Rd )
by setting its values equal to zero outside of X and
restricting approximations of f¯ by functions from span GK (Rd ) to</p>
      </sec>
      <sec id="sec-2-10">
        <title>X , we get the statement for any Lebesgue measurable sub</title>
        <p>set X of Rd .</p>
      </sec>
      <sec id="sec-2-11">
        <title>Theorem 1 shows that sets of input-output functions of</title>
        <p>convolutional kernel networks are large enough to
approximate arbitrarily well all L 2-functions if and only if the</p>
      </sec>
      <sec id="sec-2-12">
        <title>Fourier transform of the function k is almost everywhere non-zero.</title>
        <p>Theorem 1 implies that when kˆ(s) is equal to zero for all
s such that ksk ≥ r for some r &gt; 0 (the Fourier transform
is band-limited), then the set span GK (Rd ) is too small to
have the universal approximation capability. In the next
section we show, that some of such kernels are positive
semidefinite. So they can be used for classification by the</p>
      </sec>
      <sec id="sec-2-13">
        <title>SVM algorithm but they are not suitable for function approximation.</title>
        <p>✷</p>
        <p>Positive semidefinitness and universal
approximation property</p>
      </sec>
      <sec id="sec-2-14">
        <title>In this section, we compare a condition on positive semidefinitness of a convolutional kernel with the condition on the universal approximation property derived in the previous section.</title>
      </sec>
      <sec id="sec-2-15">
        <title>As the inverse Fourier transform of a convolutional kernel can be expressed as</title>
        <p>K(x, y) = k(x − y) =
1</p>
        <p>Z
it is easy to verify that when kˆ is positive or non
negative than K defined as K(x, y = k(x − y) is positive definite,
semidefinite, resp.</p>
        <p>n</p>
        <p>Indeed, to verify that ∑ j,l=1 a jal K(x j, xl ) ≥ 0 we
express K in terms of the inverse Fourier transform. Thus
we get
∑n a jal K(x j, xl ) = ∑n a jal (2π1)d/2 ZRd kˆ(s)ei(x j−xl)·sds =
j,l=1 j,l=1
1
Proposition 2. Let k ∈ L 1(Rd ) ∩ L 2(Rd ) be an even
function such that kˆ(s) ≥ 0 for all s ∈ Rd . Then K(x, y) =
k(x − y) is positive semidefinite.</p>
      </sec>
      <sec id="sec-2-16">
        <title>A complete characterization of positive semidefinite</title>
        <p>bounded continuous kernels follows from the Bochner</p>
      </sec>
      <sec id="sec-2-17">
        <title>Theorem.</title>
        <p>Theorem 3 (Bochner). A bounded continuous function
k : Rd → C is positive semidefinite iff k is the Fourier
transform of a nonnegative finite Borel measure μ, i.e.,
k(x) =
1</p>
        <p>Z</p>
      </sec>
      <sec id="sec-2-18">
        <title>The Bochner Theorem implies that when the Borel mea</title>
        <p>sure μ has a distribution function then the condition in</p>
      </sec>
      <sec id="sec-2-19">
        <title>Proposition 2 is both sufficient and necessary.</title>
        <p>Comparison of the characterization of kernels for which
by Theorem 1 one-hidden-layer kernel networks are
universal approximators with the condition on positive
semidefinitness from Proposition 2 shows that there are
positive semidefinite kernels which do not generate
networks possessing the universal approximation capability
and there also are kernels which are not positive
definite but induce networks with the universal approximation
property. The first ones are suitable for SVM but not for
regression, while the second ones can be used for
regression but are not suitable for SVM. In the sequel, we give
some examples of such kernels.</p>
      </sec>
      <sec id="sec-2-20">
        <title>A paradigmatic example of a convolutional kernel is the</title>
        <p>Gaussian kernel ga : Rd → R defined for a width a &gt; 0 as</p>
      </sec>
      <sec id="sec-2-21">
        <title>For any fixed width a and any dimension d,</title>
        <p>ga = e−a2k.k2 .
gba = (√2a)−d e−1/a2k.k2 .</p>
      </sec>
      <sec id="sec-2-22">
        <title>So the Gaussian kernel is positive definite and the class of</title>
      </sec>
      <sec id="sec-2-23">
        <title>Gaussian kernel networks have the universal approximation property.</title>
        <p>The rectangle kernel is defined as
rect(x) = 1 for x ∈ (−1/2, 1/2),
otherwise rect(x) = 0.</p>
      </sec>
      <sec id="sec-2-24">
        <title>Its Fourier transform is the sinc function</title>
        <p>sin(π s)
rdect(s) = sinc(s) = .
π s</p>
      </sec>
      <sec id="sec-2-25">
        <title>So the Fourier transform of rect is not non negative but its</title>
        <p>zeros form a discrete set of the Lebesgue measure zero.</p>
      </sec>
      <sec id="sec-2-26">
        <title>Thus the rectangle kernel is not positive semidefinite but</title>
        <p>induces class of networks with the universal
approximation property. On the other hand, the Fourier transform of
sinc is the rectangle kernel and thus it is positive
semidefinite, but does not induce networks with the universal
approximation property.</p>
        <sec id="sec-2-26-1">
          <title>The Laplace kernel is defined for any a &gt; 0 as</title>
        </sec>
      </sec>
      <sec id="sec-2-27">
        <title>Its Fourier transforms is positive as</title>
        <p>l(x) = e−a|x|.
lˆ(s) =</p>
        <p>2a
a2 + (2πs)2
.</p>
        <p>The triangle kernel is defined as
tri(x) = 2x − 1/2 for x ∈ (−1/2, 0),
tri(x) = −2(x + 1/2) for x ∈ (0, 1/2),
otherwise tri(x) = 0.</p>
      </sec>
      <sec id="sec-2-28">
        <title>Its Fourier transforms is positive as</title>
        <p>tbri(s) = sinc(s)2 =
sin(π s) 2
π s
.</p>
      </sec>
      <sec id="sec-2-29">
        <title>Thus both the Laplace and the triangle kernel are positive definite and induce networks having the universal approximation property.</title>
        <p>The parabolic (Epinechnikov) kernel is defined
epi(x) = 34 (1 − x2) for x ∈ (−1, 1),
otherwise epi(x) = 0.</p>
      </sec>
      <sec id="sec-2-30">
        <title>Its Fourier transforms is</title>
        <p>ecpi(s) = s33 (sin(s) − 21 s cos(s)) for s 6= 0,
ecpi(s) = 1 for s = 0.</p>
      </sec>
      <sec id="sec-2-31">
        <title>So the parabolic kernel is not positive semidefinite but induces networks with the universal approximation property.</title>
        <p>5</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Discussion</title>
      <p>We investigated effect of properties of the Fourier
transform of a kernel function on suitability of the
convolutional kernel for function approximation (universal
approximation property) and for maximal margin
classification algorithm (positive semidefinitness). We showed
that these properties depend on the way how the Fourier
transform converges with increasing frequencies to
infinity. For the universal approximation property, the Fourier
transform can be negative but cannot be zero on any set of
frequencies of non-zero Lebesgue measure. On the other
hand, functions with non-negative Fourier transforms are
positive semidefinite even if they are compactly supported.</p>
      <sec id="sec-3-1">
        <title>We illustrated our results by the paradigmatic example</title>
        <p>of the multivariable Gaussian kernel and by some
onedimensional examples. Multivariable Gaussian is a
product of one variable functions and thus its multivariable</p>
      </sec>
      <sec id="sec-3-2">
        <title>Fourier transform can be computed using transforms of</title>
        <p>one-variable Gaussians. Fourier transforms of other radial
multivariable kernels are more complicated, their
expressions include Bessel functions and the Hankel transform.</p>
      </sec>
      <sec id="sec-3-3">
        <title>Investigation of properties of Fourier transforms of multivariable radial convolutional kernels is subject of our future work.</title>
        <sec id="sec-3-3-1">
          <title>Acknowledgments. This work was partially supported by</title>
          <p>the grant GACˇ R 15-181085 and institutional support of the</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>Institute of Computer Science RVO 67985807.</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Broomhead</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Lowe</surname>
          </string-name>
          .
          <article-title>Error bounds for approximation with neural networks</article-title>
          .
          <source>Complex Systems</source>
          ,
          <volume>2</volume>
          :
          <fpage>321</fpage>
          -
          <lpage>355</lpage>
          ,
          <year>1988</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Cortes</surname>
          </string-name>
          and
          <string-name>
            <given-names>V. N.</given-names>
            <surname>Vapnik</surname>
          </string-name>
          .
          <article-title>Support vector networks</article-title>
          .
          <source>Machine Learning</source>
          ,
          <volume>20</volume>
          :
          <fpage>273</fpage>
          -
          <lpage>297</lpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>F.</given-names>
            <surname>Cucker</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Smale</surname>
          </string-name>
          .
          <article-title>On the mathematical foundations of learning</article-title>
          .
          <source>Bulletin of AMS</source>
          ,
          <volume>39</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>49</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>F.</given-names>
            <surname>Cucker</surname>
          </string-name>
          and
          <string-name>
            <given-names>D. X.</given-names>
            <surname>Zhou</surname>
          </string-name>
          .
          <source>Learning Theory: An Approximation Theory Viewpoint</source>
          . Cambridge University Press, Cambridge,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A. Friedman. Modern</given-names>
            <surname>Analysis. Dover</surname>
          </string-name>
          , New York,
          <year>1982</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>F.</given-names>
            <surname>Girosi</surname>
          </string-name>
          .
          <article-title>An equivalence between sparse approximation and support vector machines</article-title>
          .
          <source>Neural Computation</source>
          ,
          <volume>10</volume>
          :
          <fpage>1455</fpage>
          -
          <lpage>1480</lpage>
          <source>(AI memo 1606)</source>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>F.</given-names>
            <surname>Girosi</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Poggio</surname>
          </string-name>
          .
          <article-title>Regularization algorithms for learning that are equivalent to multilayer networks</article-title>
          .
          <source>Science</source>
          ,
          <volume>247</volume>
          (
          <issue>4945</issue>
          ):
          <fpage>978</fpage>
          -
          <lpage>982</lpage>
          ,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R.</given-names>
            <surname>Gribonval</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Vandergheynst</surname>
          </string-name>
          .
          <article-title>On the exponential convergence of matching pursuits in quasi-incoherent dictionaries</article-title>
          .
          <source>IEEE Trans. on Information Theory</source>
          ,
          <volume>52</volume>
          :
          <fpage>255</fpage>
          -
          <lpage>261</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>P. C.</given-names>
            <surname>Kainen</surname>
          </string-name>
          , V. Ku˚rková, and
          <string-name>
            <given-names>M.</given-names>
            <surname>Sanguineti</surname>
          </string-name>
          .
          <article-title>Complexity of Gaussian radial basis networks approximating smooth functions</article-title>
          .
          <source>J. of Complexity</source>
          ,
          <volume>25</volume>
          :
          <fpage>63</fpage>
          -
          <lpage>74</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>P. C.</given-names>
            <surname>Kainen</surname>
          </string-name>
          , V. Ku˚rková, and
          <string-name>
            <given-names>M.</given-names>
            <surname>Sanguineti</surname>
          </string-name>
          .
          <article-title>Dependence of computational models on input dimension: Tractability of approximation and optimization tasks</article-title>
          .
          <source>IEEE Transactions on Information Theory</source>
          ,
          <volume>58</volume>
          :
          <fpage>1203</fpage>
          -
          <lpage>1214</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>V.</given-names>
            <surname>Ku</surname>
          </string-name>
          <article-title>˚rková. Neural network learning as an inverse problem</article-title>
          .
          <source>Logic Journal of IGPL</source>
          ,
          <volume>13</volume>
          :
          <fpage>551</fpage>
          -
          <lpage>559</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>V.</given-names>
            <surname>Ku</surname>
          </string-name>
          <article-title>˚rková and</article-title>
          <string-name>
            <surname>P. C.</surname>
          </string-name>
          <article-title>Kainen</article-title>
          .
          <article-title>Comparing fixed and variable-width gaussian networks</article-title>
          .
          <source>Neural Networks</source>
          ,
          <volume>57</volume>
          :
          <fpage>23</fpage>
          -
          <lpage>28</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>V.</given-names>
            <surname>Ku</surname>
          </string-name>
          <article-title>˚rková and M. Sanguineti. Model complexities of shallow networks representing highly varying functions</article-title>
          .
          <source>Neurocomputing</source>
          ,
          <volume>171</volume>
          :
          <fpage>598</fpage>
          -
          <lpage>604</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>H. N.</given-names>
            <surname>Mhaskar</surname>
          </string-name>
          .
          <article-title>Versatile Gaussian networks</article-title>
          .
          <source>In Proceedings of IEEE Workshop of Nonlinear Image Processing</source>
          , pages
          <fpage>70</fpage>
          -
          <lpage>73</lpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.</given-names>
            <surname>Minsky</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Papert</surname>
          </string-name>
          . Perceptrons. MIT Press,
          <year>1969</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J.</given-names>
            <surname>Park</surname>
          </string-name>
          and
          <string-name>
            <surname>I. Sandberg.</surname>
          </string-name>
          <article-title>Universal approximation using radial-basis-function networks</article-title>
          .
          <source>Neural Computation</source>
          ,
          <volume>3</volume>
          :
          <fpage>246</fpage>
          -
          <lpage>257</lpage>
          ,
          <year>1991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>J.</given-names>
            <surname>Park</surname>
          </string-name>
          and
          <string-name>
            <surname>I. Sandberg.</surname>
          </string-name>
          <article-title>Approximation and radial basis function networks</article-title>
          .
          <source>Neural Computation</source>
          ,
          <volume>5</volume>
          :
          <fpage>305</fpage>
          -
          <lpage>316</lpage>
          ,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>A.</given-names>
            <surname>Pietsch</surname>
          </string-name>
          .
          <article-title>Eigenvalues and s-Numbers</article-title>
          . Cambridge University Press, Cambridge,
          <year>1987</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>T.</given-names>
            <surname>Poggio</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Smale</surname>
          </string-name>
          .
          <article-title>The mathematics of learning: dealing with data</article-title>
          .
          <source>Notices of AMS</source>
          ,
          <volume>50</volume>
          :
          <fpage>537</fpage>
          -
          <lpage>544</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>W. Rudin. Functional</given-names>
            <surname>Analysis</surname>
          </string-name>
          .
          <source>Mc Graw-Hill</source>
          ,
          <year>1991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>B.</given-names>
            <surname>Schölkopf</surname>
          </string-name>
          and
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Smola</surname>
          </string-name>
          .
          <article-title>Learning with Kernels - Support Vector Machines, Regularization, Optimization and Beyond</article-title>
          . MIT Press, Cambridge,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>