=Paper= {{Paper |id=Vol-1649/118 |storemode=property |title=Multivariable Approximation by Convolutional Kernel Networks |pdfUrl=https://ceur-ws.org/Vol-1649/118.pdf |volume=Vol-1649 |authors=Věra Kůrková |dblpUrl=https://dblp.org/rec/conf/itat/Kurkova16 }} ==Multivariable Approximation by Convolutional Kernel Networks== https://ceur-ws.org/Vol-1649/118.pdf
ITAT 2016 Proceedings, CEUR Workshop Proceedings Vol. 1649, pp. 118–122
http://ceur-ws.org/Vol-1649, Series ISSN 1613-0073, c 2016 V. Kůrková



                 Multivariable Approximation by Convolutional Kernel Networks

                                                             Věra Kůrková

                                     Institute of Computer Science, Academy of Sciences of the Czech,
                                                            vera@cs.cas.cz,
                                            WWW home page: http://www.cs.cas.cz/∼vera

      Abstract: Computational units induced by convolutional           widths parameters) benefit from geometrical properties of
      kernels together with biologically inspired perceptrons be-      reproducing kernel Hilbert spaces (RKHS) generated by
      long to the most widespread types of units used in neu-          these kernels. These properties allow an extension of
      rocomputing. Radial convolutional kernels with vary-             the maximal margin classification from finite dimensional
      ing widths form RBF (radial-basis-function) networks and         spaces also to sets of data which are not linearly separable
      these kernels with fixed widths are used in the SVM (sup-        by embedding them into infinite dimensional spaces [2].
      port vector machine) algorithm. We investigate suitabil-         Moreover, symmetric positive semidefinite kernels gener-
      ity of various convolutional kernel units for function ap-       ate stabilizers in the form of norms on RKHSs suitable for
      proximation. We show that properties of Fourier trans-           modeling generalization in terms of regularization [6] and
      forms of convolutional kernels determine whether sets of         enable characterizations of theoretically optimal solutions
      input-output functions of networks with kernel units are         of learning tasks [3, 19, 11].
      large enough to be universal approximators. We com-                 Arguments proving the universal approximation prop-
      pare these properties with conditions guaranteeing positive      erty of RBF networks using sequences of scaled kernels
      semidefinitness of convolutional kernels.                        might suggest that variability of widths is necessary for the
                                                                       universal approximation. However, for the special case of
                                                                       the Gaussian kernel, the universal approximation property
      1    Introduction                                                holds even when the width is fixed and merely centers are
                                                                       varying [14, 12].
      Computational units induced by radial and convolutional
                                                                          On the other hand, it is easy to find some examples
      kernels together with perceptrons belong to the most
                                                                       of positive semidefinite kernels such that sets of input-
      widespread types of units used in neurocomputing. In
                                                                       output functions of shallow networks with units generated
      contrast to biologically inspired perceptrons [15], local-
                                                                       by these kernels are too small to be universal approxima-
      ized radial units [1] were introduced merely due to their
                                                                       tors. For example, networks with product kernel units of
      good mathematical properties. Radial-basis-function units
                                                                       the form K(x, y) = k(x)k(y) generate as input-output func-
      (RBF) computing spherical waves were followed by ker-
                                                                       tions only scalar multiples ck(x) of the function k.
      nel units [7]. Kernel units in the most general form include
      all types of computational units, which are functions of            In this paper, we investigate capabilities of networks
      two vector variables: an input vector and a parameter vec-       with one hidden layer of convolutional kernel units to ap-
      tor. However, often the term kernel unit is reserved merely      proximate multivariable functions. We show that a crucial
      for units computing symmetric positive semidefinite func-        property influencing whether sets of input-output func-
      tions of two variables. Networks with these units have           tions of convolutional kernel networks are large enough
      been widely used for classification with maximal margin          to be universal approximators is behavior of the Fourier
      by the support vector machine algorithm (SVM) [2] as             transform of the one variable function generating the con-
      well as for regression [21].                                     volutional kernel. We give a necessary and sufficient con-
         Other important kernel units are units induced by convo-      dition for universal approximation of kernel networks in
      lutional kernels in the form of translations of functions of     terms of the Fourier transforms of kernels. We compare
      one vector variable. Isotropic RBF units can be viewed as        this condition with properties of kernels guaranteeing their
      non symmetric kernel units obtained from convolutional           positive definitness. We illustrate our results by exam-
      radial kernels by adding a width parameter. Variability          ples of some common kernels such as Gaussian, Laplace,
      of widths is a strong property. It allows to apply argu-         parabolic, rectangle, and triangle.
      ments based on classical results on approximation of func-          The paper is organized as follows. In section 2, no-
      tions by sequences of their convolutions with scaled bump        tations and basic concepts on one-hidden-layer networks
      functions to prove universal approximation capabilities of       and kernel units are introduced. In section 3, a neces-
      many types of RBF networks [16, 17]. Moreover, some              sary and sufficient condition on a convolutional kernel that
      estimates of rates of approximation by RBF networks ex-          guarantees that networks with units induced by the kernel
      ploit variability of widths [9, 10, 13].                         have the universal approximation property. In section 4
         On the other hand, symmetric positive semidefinite ker-       this condition is compared with a condition guaranteeing
      nels (which include some classes of RBFs with fixed              that a kernel is positive semidefinite and some examples
Multivariable Approximation by Convolutional Kernel Networks                                                                               119

      of kernels satisfying both or one of these conditions are            units induced by the kernel K are contained in Hilbert
      given. Section 5 is a brief discussion.                              spaces defined by these kernels. These spaces are called
                                                                           reproducing kernel Hilbert spaces (RKHS) and denoted
                                                                           HK (X). They are formed by functions from
      2    Preliminaries
                                                                                        span GK (X) = span{Kx | x ∈ X},
      Radial-basis-function networks as well as kernel models
      belong to the class of one-hidden-layer networks with one            where
      linear output unit. Such networks compute input-output                                      Kx (.) = K(x, .),
      functions from sets of the form
                                                                           together with limits of their Cauchy sequences in the norm
                      (                                )
                               n                                           k.kK . The norm k.kK is induced by the inner product
            span G =        ∑ wi gi | wi ∈ R, gi ∈ G, n ∈ N+ ,             h., .iK , which is defined on
                            i=1
                                                                                             GK (X) = {Kx | x ∈ X}
      where the set G is called a dictionary [8], and R, N+ de-
      note the sets of real numbers and positive integers, resp.           as
      Typically, dictionaries are parameterized families of func-                             hKx , Ky iK = K(x, y).
      tions modeling computational units, i.e., they are of the            So span GK (X) ⊂ HK (X).
      form
                 GK (X,Y ) = {K(., y) : X → R | y ∈ Y }
                                                                           3    Universal approximation capability of
      where K : X × Y → R is a function of two variables, an
      input vector x ∈ X ⊆ Rd and a parameter y ∈ Y ⊆ Rs . Such                 convolutional kernel networks
      functions of two variables are called kernels. This term,
      derived from the German term “kern”, has been used since             In this section, we investigate conditions guaranteeing that
      1904 in theory of integral operators [18, p.291].                    sets of input-output functions of convolutional kernel net-
         An important class of kernels are convolutional kernels           works are large enough to be universal approximators.
      which are obtained by translations of one-variable func-                The universal approximation property is formally de-
      tions k : Rd → Rd as                                                 fined as density in a normed linear space. A class of one-
                                                                           hidden-layer networks with units from a dictionary G is
                                   K(x, y) = k(x − y).                     said to have the universal approximation property in a
                                                                           normed linear space (X , k.kX ) if it is dense in this space,
      Radial convolutional kernels are convolutional kernels ob-           i.e., clX span G = X , where span G denotes the linear
      tained as translations of radial functions, i.e., functions of       span of G and clX denotes the closure with respect to the
      the form                                                             topology induced by the norm k.kX . More precisely, for
                            k(x) = k1 (kxk),                               every f ∈ X and every ε > 0 there exist a positive integer
      where k1 : R+ → R.                                                   n, g1 , . . . , gn ∈ G, and w1 , . . . , wn ∈ R such that
        The convolution is an operation defined as                                                     n
                       Z                           Z                                          k f − ∑ wi gi kX < ε.
                                                                                                      i=1
          f ∗ g(x) =         f (y − x)g(y)dy =           f (y)g(x − y)dy
                       Rd                           Rd
                                                                             Function spaces where the universal approximation
      [20, p.170].                                                         property has been of interest are spaces (C(X), k.ksup ) of
        Recall, that a kernel K : X × X → R is called positive             continuous functions on subsets X of Rd (typically com-
      semidefinite if for any positive integer m, any x1 , . . . , xm ∈    pact) with the supremum norm
      X and any a1 , . . . , am ∈ R,
                                                                                               k f ksup = sup | f (x)|
                            m       m                                                                           x∈X
                           ∑ ∑ ai a j K(xi , x j ) ≥ 0.
                           i=1 j=1                                         and Rspaces (L p (Rd ), k.kL p ) of functions on Rd with fi-
                                                                           nite Rd | f (y)| p dy and the norm
      Similarly, a function of one variable k : Rd → R is called
      positive semidefinite if for any positive integer m, any                                        Z                        1/p
      x1 , . . . , xm ∈ X and any a1 , . . . , am ∈ R,                                   k f kL p =             | f (y)| p dy          .
                                                                                                           Rd
                           m       m
                           ∑ ∑ ai a j k(xi − x j ) ≥ 0.                       Recall that the d-dimensional Fourier transform is an
                           i=1 j=1                                         isometry on L 2 (Rd ) defined on L 2 (Rd ) ∩ L 1 (Rd ) as
                                                                                                                Z
        For symmetric positive semidefinite kernels K, the sets                                      1
      span GK (X) of input-output functions of networks with                            fˆ(s) =                       e−ix·s f (x) dx
                                                                                                  (2π)d/2        Rd
120                                                                                                                                                  V. Kůrková

      and extended to L 2 (Rd ) [20, p.183].                                  L 2 (Rd ) \ clL 2 span GK (Rd ), l( f0 ) = 1. By the Riesz Rep-
         Note that the Fourier transform of an even function is               resentation Theorem [5, p.206], l can be expressed as an
      real and the Fourier transform of a radial function is radial.          inner product with some h ∈ L 2 (Rd ).
      If k ∈ cL1 (Rd ), then k̂ is uniformly continuous and with                 As k is even, for all y ∈ Rd ,
      increasing frequencies converges to zero, i.e.,                                                            Z
                                                                                             hh, K(., y)i =              h(x)k(x − y)dx =
                                    lim k̂(s) = 0.                                                                Rd
                                ksk→∞
                                                                                             Z
         The following theorem gives a necessary and sufficient                                   h(x)k1 (y − x)dx = h ∗ k1 (x) = 0.
      condition on a convolutional kernel that guarantees that the                           Rd
      class of input-output functions computable by networks                  By the Young Inequality for convolutions h ∗ k ∈ L 2 (Rd )
      with units induced by the kernel can approximate arbitrar-              and so by the Plancherel Theorem [20, p.188],
      ily well all functions in L 2 (Rd ). The condition is formu-
      lated in terms of the size of the set of frequencies for which                                         kh[
                                                                                                               ∗ k1 kL 2 = 0.
      the Fourier transform is equal to zero. By λ is denoted the
      Lebesgue measure.                                                       As
                                                                                                                        1
      Theorem 1. Let d be a positive integer, k ∈ L 1 (Rd ) ∩                                              h[
                                                                                                            ∗ k1 =           ĥ k̂
                                                                                                                     (2π)d/2
      L 2 (Rd ) be even, K : Rd × Rd → R be defined as K(x, y) =
      k(x − y), and X ⊆ Rd be Lebesgue measurable. Then                       [20, p.183], we have kĥ k̂kL 2 = 0 and so
      span GK (X) is dense in (L 2 (X), k.kL 2 ) if and only if                                        Z
      λ ({s ∈ Rd | k̂(s) = 0}) = 0.                                                                          (ĥ(s) k̂(s))2 ds = 0.
                                                                                                        Rd
         Proof. First, we prove the necessity. To prove it by
                                                                              As the set
      contradiction, assume that λ (S) 6= 0. Take any function
       f ∈ L 2 (Rd ) ∩ L 1 (Rd ) with a positive Fourier transform                                     S = {s ∈ Rd | k̂(s) = 0}
      (for example, f can be the Gaussian). Let ε > 0 be such                 has Lebesgue measure zero we have
      that                       Z                                                     Z                             Z
                            ε<       fˆ(s)2 ds.                                              ĥ(s)2 k̂(s)2 ds =              ĥ(s)2 k̂(s)2 ds = 0.
                                       Rd                                               Rd                           Rd \S
      Assume that there exists n, wi ∈ R, and yi ∈ Rd such that
                                                                              As for all s ∈ Rd \ S, k̂(s)2 > 0, we have kĥk2L 2 ds = 0. So
                                n
                                                                              khkL 2 = 0 and hence by the Cauchy-Schwartz Inequality
                          k f − ∑ wi k(. − yi )kL2 < ε.
                               j=1
                                                                              we get
                                                                                                   Z
      Then by the Plancherel Theorem [20, p.188],                                  1 = l( f0 ) =           f0 (y) h(y)dy ≤ k f0 kL 2 khkL 2 = 0,
                                                                                                    Rd
                      n                                     n
               k fˆ − ∑ wi k(.
                            \  − yi )k2L2 = k fˆ − ∑ w̄i k̂k2L2 ,             which is a contradiction.
                     j=1                                    j=1                  Extending a function f from L 2 (X) to f¯ from L 2 (Rd )
                                                                              by setting its values equal to zero outside of X and restrict-
      where w̄i = wi eiyi . Hence                                             ing approximations of f¯ by functions from span GK (Rd ) to
                                       n                                      X, we get the statement for any Lebesgue measurable sub-
                              k fˆ − ∑ w̄i k̂k2L2 =                           set X of Rd .
                                      j=1                                                                                                ✷
                                                                                 Theorem 1 shows that sets of input-output functions of
                                            !2
           Z                    n                       Z                     convolutional kernel networks are large enough to approx-
                     fˆ(s) − ∑ w̄i k̂(s)         ds +        fˆ(s)2 ds > ε,   imate arbitrarily well all L 2 -functions if and only if the
            Rd \S              j=1                      S
                                                                              Fourier transform of the function k is almost everywhere
      which is a contradiction.                                               non-zero.
        To prove the sufficiency, we first assume that X = Rd .                  Theorem 1 implies that when k̂(s) is equal to zero for all
      We prove it by contradiction, so we suppose that                        s such that ksk ≥ r for some r > 0 (the Fourier transform
                                                                              is band-limited), then the set span GK (Rd ) is too small to
                                                           6 L 2 (Rd ).
      clL 2 span GK (Rd ) = clL 2 span {K(., y) | y ∈ Rd } =                  have the universal approximation capability. In the next
                                                                              section we show, that some of such kernels are positive
      Then by the Hahn-Banach Theorem [20, p. 60] there ex-                   semidefinite. So they can be used for classification by the
      ists a bounded linear functional l on L 2 (Rd ) such that               SVM algorithm but they are not suitable for function ap-
      for all f ∈ clL 2 span GK (Rd ), l( f ) = 0 and for some f0 ∈           proximation.
Multivariable Approximation by Convolutional Kernel Networks                                                                                                        121

      4       Positive semidefinitness and universal                                                and there also are kernels which are not positive defi-
              approximation property                                                                nite but induce networks with the universal approximation
                                                                                                    property. The first ones are suitable for SVM but not for
      In this section, we compare a condition on positive                                           regression, while the second ones can be used for regres-
      semidefinitness of a convolutional kernel with the condi-                                     sion but are not suitable for SVM. In the sequel, we give
      tion on the universal approximation property derived in the                                   some examples of such kernels.
      previous section.                                                                                A paradigmatic example of a convolutional kernel is the
         As the inverse Fourier transform of a convolutional ker-                                   Gaussian kernel ga : Rd → R defined for a width a > 0 as
      nel can be expressed as                                                                                                        2     2
                                                                                                                           ga = e−a k.k .
                                                               Z                                    For any fixed width a and any dimension d,
                                                   1
              K(x, y) = k(x − y) =                                  k̂(s)ei(x−y)·s ds                                      √
                                                (2π)d/2        Rd                                                                      2   2
                                                                                                                    gba = ( 2a)−d e−1/a k.k .
      it is easy to verify that when k̂ is positive or non nega-                                    So the Gaussian kernel is positive definite and the class of
      tive than K defined as K(x, y = k(x − y) is positive definite,                                Gaussian kernel networks have the universal approxima-
      semidefinite, resp.                                                                           tion property.
         Indeed, to verify that ∑nj,l=1 a j al K(x j , xl ) ≥ 0 we ex-                                 The rectangle kernel is defined as
      press K in terms of the inverse Fourier transform. Thus
      we get                                                                                                     rect(x) = 1 for x ∈ (−1/2, 1/2),
                                                                   Z                                                  otherwise rect(x) = 0.
          n                          n
                                                     1
       ∑ a j al K(x j , xl ) = ∑ a j al                                  k̂(s)ei(x j −xl )·s ds =   Its Fourier transform is the sinc function
      j,l=1                         j,l=1         (2π)d/2           Rd

                                                                                                                                        sin(π s)
                                                 !                           !                                     rd
                                                                                                                    ect(s) = sinc(s) =           .
                    Z       n                             n                                                                                πs
             1
          (2π)d/2    Rd
                           ∑ a j ei(x j )·s               ∑ ak e−i(xl )·s k̂(s)ds =                 So the Fourier transform of rect is not non negative but its
                            j                              l                                        zeros form a discrete set of the Lebesgue measure zero.
                                                                                                    Thus the rectangle kernel is not positive semidefinite but
                                Z                              2
                     1                      n                                                       induces class of networks with the universal approxima-
                  (2π)d/2        Rd
                                         ∑ a j ei(x j )·s k̂(s)ds ≥ 0.                              tion property. On the other hand, the Fourier transform of
                                            j
                                                                                                    sinc is the rectangle kernel and thus it is positive semidef-
          The following proposition is well-known (see, e.g., [4]).                                 inite, but does not induce networks with the universal ap-
                                                                                                    proximation property.
      Proposition 2. Let k ∈ L 1 (Rd ) ∩ L 2 (Rd ) be an even                                          The Laplace kernel is defined for any a > 0 as
      function such that k̂(s) ≥ 0 for all s ∈ Rd . Then K(x, y) =
      k(x − y) is positive semidefinite.                                                                                   l(x) = e−a|x| .
        A complete characterization of positive semidefinite                                        Its Fourier transforms is positive as
      bounded continuous kernels follows from the Bochner
      Theorem.                                                                                                          ˆ =        2a
                                                                                                                        l(s)               .
                                                                                                                               a2 + (2πs)2
      Theorem 3 (Bochner). A bounded continuous function
                                                                                                      The triangle kernel is defined as
      k : Rd → C is positive semidefinite iff k is the Fourier trans-
      form of a nonnegative finite Borel measure µ, i.e.,                                                      tri(x) = 2x − 1/2 for x ∈ (−1/2, 0),
                                                  Z                                                           tri(x) = −2(x + 1/2) for x ∈ (0, 1/2),
                                       1
                        k(x) =                            e−x·s µ(ds).                                                 otherwise tri(x) = 0.
                                    (2π)d/2          Rd
                                                                                                    Its Fourier transforms is positive as
                                                                                                                                                     2
        The Bochner Theorem implies that when the Borel mea-                                                                               sin(π s)
                                                                                                                  b
                                                                                                                 tri(s) = sinc(s)2 =                       .
      sure µ has a distribution function then the condition in                                                                                πs
      Proposition 2 is both sufficient and necessary.                                               Thus both the Laplace and the triangle kernel are positive
        Comparison of the characterization of kernels for which                                     definite and induce networks having the universal approx-
      by Theorem 1 one-hidden-layer kernel networks are                                             imation property.
      universal approximators with the condition on positive                                          The parabolic (Epinechnikov) kernel is defined
      semidefinitness from Proposition 2 shows that there are
      positive semidefinite kernels which do not generate net-                                                  epi(x) = 34 (1 − x2 ) for x ∈ (−1, 1),
      works possessing the universal approximation capability                                                          otherwise epi(x) = 0.
122                                                                                                                               V. Kůrková

      Its Fourier transforms is                                         [8] R. Gribonval and P. Vandergheynst. On the exponential
                                                                            convergence of matching pursuits in quasi-incoherent dic-
               c = 33 (sin(s) − 1 s cos(s)) for s 6= 0,
               epi(s) s         2                                           tionaries. IEEE Trans. on Information Theory, 52:255–261,
                        c = 1 for s = 0.
                        epi(s)                                              2006.
                                                                        [9] P. C. Kainen, V. Kůrková, and M. Sanguineti. Complexity
      So the parabolic kernel is not positive semidefinite but in-          of Gaussian radial basis networks approximating smooth
      duces networks with the universal approximation property.             functions. J. of Complexity, 25:63–74, 2009.
                                                                       [10] P. C. Kainen, V. Kůrková, and M. Sanguineti. Dependence
      5   Discussion                                                        of computational models on input dimension: Tractability
                                                                            of approximation and optimization tasks. IEEE Transac-
      We investigated effect of properties of the Fourier trans-            tions on Information Theory, 58:1203–1214, 2012.
      form of a kernel function on suitability of the convolu-         [11] V. Kůrková. Neural network learning as an inverse prob-
      tional kernel for function approximation (universal ap-               lem. Logic Journal of IGPL, 13:551–559, 2005.
      proximation property) and for maximal margin classifi-           [12] V. Kůrková and P. C. Kainen. Comparing fixed and
      cation algorithm (positive semidefinitness). We showed                variable-width gaussian networks.       Neural Networks,
                                                                            57:23–28, 2014.
      that these properties depend on the way how the Fourier
      transform converges with increasing frequencies to infin-        [13] V. Kůrková and M. Sanguineti. Model complexities of shal-
                                                                            low networks representing highly varying functions. Neu-
      ity. For the universal approximation property, the Fourier
                                                                            rocomputing, 171:598–604, 2016.
      transform can be negative but cannot be zero on any set of
                                                                       [14] H. N. Mhaskar. Versatile Gaussian networks. In Proceed-
      frequencies of non-zero Lebesgue measure. On the other
                                                                            ings of IEEE Workshop of Nonlinear Image Processing,
      hand, functions with non-negative Fourier transforms are              pages 70–73, 1995.
      positive semidefinite even if they are compactly supported.
                                                                       [15] M. Minsky and S. Papert. Perceptrons. MIT Press, 1969.
      We illustrated our results by the paradigmatic example
                                                                       [16] J. Park and I. Sandberg. Universal approximation us-
      of the multivariable Gaussian kernel and by some one-
                                                                            ing radial–basis–function networks. Neural Computation,
      dimensional examples. Multivariable Gaussian is a prod-               3:246–257, 1991.
      uct of one variable functions and thus its multivariable
                                                                       [17] J. Park and I. Sandberg. Approximation and radial basis
      Fourier transform can be computed using transforms of                 function networks. Neural Computation, 5:305–316, 1993.
      one-variable Gaussians. Fourier transforms of other radial
                                                                       [18] A. Pietsch. Eigenvalues and s-Numbers. Cambridge Uni-
      multivariable kernels are more complicated, their expres-             versity Press, Cambridge, 1987.
      sions include Bessel functions and the Hankel transform.
                                                                       [19] T. Poggio and S. Smale. The mathematics of learning: deal-
      Investigation of properties of Fourier transforms of multi-           ing with data. Notices of AMS, 50:537–544, 2003.
      variable radial convolutional kernels is subject of our fu-
                                                                       [20] W. Rudin. Functional Analysis. Mc Graw-Hill, 1991.
      ture work.
                                                                       [21] B. Schölkopf and A. J. Smola. Learning with Kernels
                                                                            – Support Vector Machines, Regularization, Optimization
      Acknowledgments. This work was partially supported by                 and Beyond. MIT Press, Cambridge, 2002.
      the grant GAČR 15-181085 and institutional support of the
      Institute of Computer Science RVO 67985807.

      References
       [1] D. S. Broomhead and D. Lowe. Error bounds for approx-
           imation with neural networks. Complex Systems, 2:321–
           355, 1988.
       [2] C. Cortes and V. N. Vapnik. Support vector networks. Ma-
           chine Learning, 20:273–297, 1995.
       [3] F. Cucker and S. Smale. On the mathematical foundations
           of learning. Bulletin of AMS, 39:1–49, 2002.
       [4] F. Cucker and D. X. Zhou. Learning Theory: An Approx-
           imation Theory Viewpoint. Cambridge University Press,
           Cambridge, 2007.
       [5] A. Friedman. Modern Analysis. Dover, New York, 1982.
       [6] F. Girosi. An equivalence between sparse approxima-
           tion and support vector machines. Neural Computation,
           10:1455–1480 (AI memo 1606), 1998.
       [7] F. Girosi and T. Poggio. Regularization algorithms for
           learning that are equivalent to multilayer networks. Sci-
           ence, 247(4945):978–982, 1990.