1 Introduction

Series

1613-0073

Multivariable Approximation by Convolutional Kernel Networks

Veˇra Ku˚rková

0 0 Institute of Computer Science, Academy of Sciences of the Czech

2016

1649 118 122

Computational units induced by convolutional kernels together with biologically inspired perceptrons belong to the most widespread types of units used in neurocomputing. Radial convolutional kernels with varying widths form RBF (radial-basis-function) networks and these kernels with fixed widths are used in the SVM (support vector machine) algorithm. We investigate suitability of various convolutional kernel units for function approximation. We show that properties of Fourier transforms of convolutional kernels determine whether sets of input-output functions of networks with kernel units are large enough to be universal approximators. We compare these properties with conditions guaranteeing positive semidefinitness of convolutional kernels.

1 Introduction

Computational units induced by radial and convolutional kernels together with perceptrons belong to the most widespread types of units used in neurocomputing. In contrast to biologically inspired perceptrons [ 15 ], localized radial units [ 1 ] were introduced merely due to their good mathematical properties. Radial-basis-function units (RBF) computing spherical waves were followed by kernel units [ 7 ]. Kernel units in the most general form include all types of computational units, which are functions of two vector variables: an input vector and a parameter vector. However, often the term kernel unit is reserved merely for units computing symmetric positive semidefinite functions of two variables. Networks with these units have been widely used for classification with maximal margin by the support vector machine algorithm (SVM) [ 2 ] as well as for regression [ 21 ].

Other important kernel units are units induced by convolutional kernels in the form of translations of functions of one vector variable. Isotropic RBF units can be viewed as non symmetric kernel units obtained from convolutional radial kernels by adding a width parameter. Variability of widths is a strong property. It allows to apply arguments based on classical results on approximation of functions by sequences of their convolutions with scaled bump functions to prove universal approximation capabilities of many types of RBF networks [ 16, 17 ]. Moreover, some estimates of rates of approximation by RBF networks exploit variability of widths [ 9, 10, 13 ].

On the other hand, symmetric positive semidefinite kernels (which include some classes of RBFs with fixed widths parameters) benefit from geometrical properties of reproducing kernel Hilbert spaces (RKHS) generated by these kernels. These properties allow an extension of the maximal margin classification from finite dimensional spaces also to sets of data which are not linearly separable by embedding them into infinite dimensional spaces [ 2 ].

Moreover, symmetric positive semidefinite kernels gener

ate stabilizers in the form of norms on RKHSs suitable for modeling generalization in terms of regularization [ 6 ] and enable characterizations of theoretically optimal solutions of learning tasks [ 3, 19, 11 ].

Arguments proving the universal approximation property of RBF networks using sequences of scaled kernels might suggest that variability of widths is necessary for the universal approximation. However, for the special case of the Gaussian kernel, the universal approximation property holds even when the width is fixed and merely centers are varying [ 14, 12 ].

On the other hand, it is easy to find some examples of positive semidefinite kernels such that sets of inputoutput functions of shallow networks with units generated by these kernels are too small to be universal approximators. For example, networks with product kernel units of the form K(x, y) = k(x)k(y) generate as input-output functions only scalar multiples ck(x) of the function k.

In this paper, we investigate capabilities of networks with one hidden layer of convolutional kernel units to approximate multivariable functions. We show that a crucial property influencing whether sets of input-output functions of convolutional kernel networks are large enough to be universal approximators is behavior of the Fourier transform of the one variable function generating the convolutional kernel. We give a necessary and sufficient condition for universal approximation of kernel networks in terms of the Fourier transforms of kernels. We compare this condition with properties of kernels guaranteeing their positive definitness. We illustrate our results by examples of some common kernels such as Gaussian, Laplace, parabolic, rectangle, and triangle.

The paper is organized as follows. In section 2, notations and basic concepts on one-hidden-layer networks and kernel units are introduced. In section 3, a necessary and sufficient condition on a convolutional kernel that guarantees that networks with units induced by the kernel have the universal approximation property. In section 4 this condition is compared with a condition guaranteeing that a kernel is positive semidefinite and some examples where as 3 units induced by the kernel K are contained in Hilbert spaces defined by these kernels. These spaces are called reproducing kernel Hilbert spaces (RKHS) and denoted

HK (X ). They are formed by functions from

span GK (X ) = span{Kx | x ∈ X },

Kx(.) = K(x, .), together with limits of their Cauchy sequences in the norm k.kK . The norm k.kK is induced by the inner product h., .iK , which is defined on of kernels satisfying both or one of these conditions are given. Section 5 is a brief discussion. 2

Preliminaries

Radial-basis-function networks as well as kernel models belong to the class of one-hidden-layer networks with one linear output unit. Such networks compute input-output functions from sets of the form span G = ( n ) ∑ wigi | wi ∈ R, gi ∈ G, n ∈ N+ , i=1 where the set G is called a dictionary [ 8 ], and R, N+ denote the sets of real numbers and positive integers, resp.

Typically, dictionaries are parameterized families of functions modeling computational units, i.e., they are of the form

GK (X ,Y ) = {K(., y) : X → R | y ∈ Y } where K : X × Y → R is a function of two variables, an input vector x ∈ X ⊆ Rd and a parameter y ∈ Y ⊆ Rs. Such functions of two variables are called kernels. This term, derived from the German term “kern”, has been used since 1904 in theory of integral operators [18, p.291].

An important class of kernels are convolutional kernels

which are obtained by translations of one-variable functions k : Rd → Rd as

K(x, y) = k(x − y).

Radial convolutional kernels are convolutional kernels obtained as translations of radial functions, i.e., functions of the form

k(x) = k1(kxk), where k1 : R+ → R.

The convolution is an operation defined as

f ∗ g(x) = [20, p.170].

Recall, that a kernel K : X × X → R is called positive semidefinite if for any positive integer m, any x1, . . . , xm ∈ X and any a1, . . . , am ∈ R, m m ∑ ∑ aia jK(xi, x j) ≥ 0.

i=1 j=1 Similarly, a function of one variable k : Rd → R is called positive semidefinite if for any positive integer m, any x1, . . . , xm ∈ X and any a1, . . . , am ∈ R, m m ∑ ∑ aia jk(xi − x j) ≥ 0.

i=1 j=1

For symmetric positive semidefinite kernels K, the sets

span GK (X ) of input-output functions of networks with GK (X ) = {Kx | x ∈ X } hKx, KyiK = K(x, y).

So span GK (X ) ⊂ HK (X ).

Universal approximation capability of convolutional kernel networks

In this section, we investigate conditions guaranteeing that

sets of input-output functions of convolutional kernel networks are large enough to be universal approximators.

The universal approximation property is formally defined as density in a normed linear space. A class of onehidden-layer networks with units from a dictionary G is said to have the universal approximation property in a normed linear space (X , k.kX ) if it is dense in this space, i.e., clX span G = X , where span G denotes the linear span of G and clX denotes the closure with respect to the topology induced by the norm k.kX . More precisely, for every f ∈ X and every ε > 0 there exist a positive integer n, g1, . . . , gn ∈ G, and w1, . . . , wn ∈ R such that n k f − ∑ wigikX < ε.

i=1

Function spaces where the universal approximation property has been of interest are spaces (C(X ), k.ksup) of continuous functions on subsets X of Rd (typically compact) with the supremum norm k f ksup = sup | f (x)| x∈X and spaces (L p(Rd ), k.kL p ) of functions on Rd with finite RRd | f (y)|pdy and the norm k f kL p =

Z Rd | f (y)|pdy 1/p .

Recall that the d-dimensional Fourier transform is an isometry on L 2(Rd ) defined on L 2(Rd ) ∩ L 1(Rd ) as fˆ(s) = 1

Z and extended to L 2(Rd ) [20, p.183].

Note that the Fourier transform of an even function is

real and the Fourier transform of a radial function is radial. If k ∈ cL1(Rd ), then kˆ is uniformly continuous and with increasing frequencies converges to zero, i.e., lim kˆ(s) = 0.

ksk→∞

The following theorem gives a necessary and sufficient condition on a convolutional kernel that guarantees that the class of input-output functions computable by networks with units induced by the kernel can approximate arbitrarily well all functions in L 2(Rd ). The condition is formulated in terms of the size of the set of frequencies for which the Fourier transform is equal to zero. By λ is denoted the

Lebesgue measure.

Theorem 1. Let d be a positive integer, k ∈ L 1(Rd ) ∩ L 2(Rd ) be even, K : Rd Rd → R be defined as K(x, y) = k(x − y), and X ⊆ Rd ×be Lebesgue measurable. Then span GK (X ) is dense in (L 2(X ), k.kL 2 ) if and only if λ ({s ∈ Rd | kˆ(s) = 0}) = 0.

Proof. First, we prove the necessity. To prove it by

contradiction, assume that λ (S) 6= 0. Take any function f ∈ L 2(Rd ) ∩ L 1(Rd ) with a positive Fourier transform (for example, f can be the Gaussian). Let ε > 0 be such that ε <

Z Rd fˆ(s)2ds.

Assume that there exists n, wi ∈ R, and yi ∈ Rd such that n k f − ∑ wik(. − yi)kL2 < ε.

j=1 Then by the Plancherel Theorem [20, p.188],

n n k fˆ − ∑ wik(\.− yi)k2L2 = k fˆ − ∑ w¯ikˆk2L2 ,

j=1 j=1 where w¯i = wieiyi . Hence

n k fˆ − ∑ w¯ikˆk2L2 =

j=1 which is a contradiction.

To prove the sufficiency, we first assume that X = Rd .

We prove it by contradiction, so we suppose that

clL 2 span GK (Rd ) = clL 2 span {K(., y) | y ∈ Rd } 6= L 2(Rd ).

Then by the Hahn-Banach Theorem [20, p. 60] there ex

ists a bounded linear functional l on L 2(Rd ) such that for all f ∈ clL 2 span GK (Rd ), l( f ) = 0 and for some f0 ∈ As As the set

Rd L 2(Rd ) \ clL 2 span GK (Rd ), l( f0) = 1. By the Riesz Representation Theorem [5, p.206], l can be expressed as an inner product with some h ∈ L 2(Rd ).

As k is even, for all y ∈ Rd ,

Rd hh, K(., y)i =

h(x)k(x − y)dx = Z Rd

h(x)k1(y − x)dx = h ∗ k1(x) = 0.

By the Young Inequality for convolutions h ∗ k ∈ L 2(Rd ) and so by the Plancherel Theorem [20, p.188], kh [∗k1kL 2 = 0. h [∗k1 =

1 (2π)d/2

hˆ kˆ Z Rd

(hˆ(s) kˆ(s))2ds = 0.

S = {s ∈ Rd | kˆ(s) = 0} [20, p.183], we have khˆ kˆkL 2 = 0 and so has Lebesgue measure zero we have

Rd\S hˆ(s)2kˆ(s)2ds = hˆ(s)2kˆ(s)2ds = 0.

As for all s ∈ Rd \ S, kˆ(s)2 > 0, we have khˆk2L 2 ds = 0. So khkL 2 = 0 and hence by the Cauchy-Schwartz Inequality we get

Z 1 = l( f0) =

Rd f0(y) h(y)dy ≤ k f0kL 2 khkL 2 = 0, which is a contradiction.

Extending a function f from L 2(X ) to f¯ from L 2(Rd ) by setting its values equal to zero outside of X and restricting approximations of f¯ by functions from span GK (Rd ) to

X , we get the statement for any Lebesgue measurable sub

set X of Rd .

Theorem 1 shows that sets of input-output functions of

convolutional kernel networks are large enough to approximate arbitrarily well all L 2-functions if and only if the

Fourier transform of the function k is almost everywhere non-zero.

Theorem 1 implies that when kˆ(s) is equal to zero for all s such that ksk ≥ r for some r > 0 (the Fourier transform is band-limited), then the set span GK (Rd ) is too small to have the universal approximation capability. In the next section we show, that some of such kernels are positive semidefinite. So they can be used for classification by the

SVM algorithm but they are not suitable for function approximation.

✷

Positive semidefinitness and universal approximation property

In this section, we compare a condition on positive semidefinitness of a convolutional kernel with the condition on the universal approximation property derived in the previous section. As the inverse Fourier transform of a convolutional kernel can be expressed as

K(x, y) = k(x − y) = 1

Z it is easy to verify that when kˆ is positive or non negative than K defined as K(x, y = k(x − y) is positive definite, semidefinite, resp.

Indeed, to verify that ∑ j,l=1 a jal K(x j, xl ) ≥ 0 we express K in terms of the inverse Fourier transform. Thus we get ∑n a jal K(x j, xl ) = ∑n a jal (2π1)d/2 ZRd kˆ(s)ei(x j−xl)·sds = j,l=1 j,l=1 1 Proposition 2. Let k ∈ L 1(Rd ) ∩ L 2(Rd ) be an even function such that kˆ(s) ≥ 0 for all s ∈ Rd . Then K(x, y) = k(x − y) is positive semidefinite.

A complete characterization of positive semidefinite

bounded continuous kernels follows from the Bochner

Theorem.

Theorem 3 (Bochner). A bounded continuous function k : Rd → C is positive semidefinite iff k is the Fourier transform of a nonnegative finite Borel measure μ, i.e., k(x) = 1

The Bochner Theorem implies that when the Borel mea

sure μ has a distribution function then the condition in

Proposition 2 is both sufficient and necessary.

Comparison of the characterization of kernels for which by Theorem 1 one-hidden-layer kernel networks are universal approximators with the condition on positive semidefinitness from Proposition 2 shows that there are positive semidefinite kernels which do not generate networks possessing the universal approximation capability and there also are kernels which are not positive definite but induce networks with the universal approximation property. The first ones are suitable for SVM but not for regression, while the second ones can be used for regression but are not suitable for SVM. In the sequel, we give some examples of such kernels.

A paradigmatic example of a convolutional kernel is the

Gaussian kernel ga : Rd → R defined for a width a > 0 as

For any fixed width a and any dimension d,

ga = e−a2k.k2 . gba = (√2a)−d e−1/a2k.k2 .

So the Gaussian kernel is positive definite and the class of Gaussian kernel networks have the universal approximation property.

The rectangle kernel is defined as rect(x) = 1 for x ∈ (−1/2, 1/2), otherwise rect(x) = 0.

Its Fourier transform is the sinc function

sin(π s) rdect(s) = sinc(s) = . π s

So the Fourier transform of rect is not non negative but its

zeros form a discrete set of the Lebesgue measure zero.

Thus the rectangle kernel is not positive semidefinite but

induces class of networks with the universal approximation property. On the other hand, the Fourier transform of sinc is the rectangle kernel and thus it is positive semidefinite, but does not induce networks with the universal approximation property.

The Laplace kernel is defined for any a > 0 as Its Fourier transforms is positive as

l(x) = e−a|x|. lˆ(s) =

2a a2 + (2πs)2 .

The triangle kernel is defined as tri(x) = 2x − 1/2 for x ∈ (−1/2, 0), tri(x) = −2(x + 1/2) for x ∈ (0, 1/2), otherwise tri(x) = 0.

Its Fourier transforms is positive as

tbri(s) = sinc(s)2 = sin(π s) 2 π s .

Thus both the Laplace and the triangle kernel are positive definite and induce networks having the universal approximation property.

The parabolic (Epinechnikov) kernel is defined epi(x) = 34 (1 − x2) for x ∈ (−1, 1), otherwise epi(x) = 0.

Its Fourier transforms is

ecpi(s) = s33 (sin(s) − 21 s cos(s)) for s 6= 0, ecpi(s) = 1 for s = 0.

So the parabolic kernel is not positive semidefinite but induces networks with the universal approximation property.

Discussion

We investigated effect of properties of the Fourier transform of a kernel function on suitability of the convolutional kernel for function approximation (universal approximation property) and for maximal margin classification algorithm (positive semidefinitness). We showed that these properties depend on the way how the Fourier transform converges with increasing frequencies to infinity. For the universal approximation property, the Fourier transform can be negative but cannot be zero on any set of frequencies of non-zero Lebesgue measure. On the other hand, functions with non-negative Fourier transforms are positive semidefinite even if they are compactly supported.

We illustrated our results by the paradigmatic example

of the multivariable Gaussian kernel and by some onedimensional examples. Multivariable Gaussian is a product of one variable functions and thus its multivariable

Fourier transform can be computed using transforms of

one-variable Gaussians. Fourier transforms of other radial multivariable kernels are more complicated, their expressions include Bessel functions and the Hankel transform.

Investigation of properties of Fourier transforms of multivariable radial convolutional kernels is subject of our future work. Acknowledgments. This work was partially supported by

the grant GACˇ R 15-181085 and institutional support of the

Institute of Computer Science RVO 67985807.

[1]

D. S.

Broomhead and

Lowe . Error bounds for approximation with neural networks . Complex Systems , 2 : 321 - 355 , 1988 .

[2]

Cortes and

V. N.

Vapnik . Support vector networks . Machine Learning , 20 : 273 - 297 , 1995 .

[3]

Cucker and

Smale . On the mathematical foundations of learning . Bulletin of AMS , 39 : 1 - 49 , 2002 .

[4]

Cucker and

D. X.

Zhou . Learning Theory: An Approximation Theory Viewpoint . Cambridge University Press, Cambridge, 2007 .

[5]

A. Friedman. Modern

Analysis. Dover , New York, 1982 .

[6]

Girosi . An equivalence between sparse approximation and support vector machines . Neural Computation , 10 : 1455 - 1480 (AI memo 1606) , 1998 .

[7]

Girosi and

Poggio . Regularization algorithms for learning that are equivalent to multilayer networks . Science , 247 ( 4945 ): 978 - 982 , 1990 .

[8]

Gribonval and

Vandergheynst . On the exponential convergence of matching pursuits in quasi-incoherent dictionaries . IEEE Trans. on Information Theory , 52 : 255 - 261 , 2006 .

[9]

P. C.

Kainen , V. Ku˚rková, and

Sanguineti . Complexity of Gaussian radial basis networks approximating smooth functions . J. of Complexity , 25 : 63 - 74 , 2009 .

[10]

P. C.

Kainen , V. Ku˚rková, and

Sanguineti . Dependence of computational models on input dimension: Tractability of approximation and optimization tasks . IEEE Transactions on Information Theory , 58 : 1203 - 1214 , 2012 .

[11]

Ku ˚rková. Neural network learning as an inverse problem . Logic Journal of IGPL , 13 : 551 - 559 , 2005 .

[12]

˚rková and

P. C.

Kainen . Comparing fixed and variable-width gaussian networks . Neural Networks , 57 : 23 - 28 , 2014 .

[13]

Ku ˚rková and M. Sanguineti. Model complexities of shallow networks representing highly varying functions . Neurocomputing , 171 : 598 - 604 , 2016 .

[14]

H. N.

Mhaskar . Versatile Gaussian networks . In Proceedings of IEEE Workshop of Nonlinear Image Processing , pages 70 - 73 , 1995 .

[15]

Minsky and

Papert . Perceptrons. MIT Press, 1969 .

[16]

Park and I. Sandberg. Universal approximation using radial-basis-function networks . Neural Computation , 3 : 246 - 257 , 1991 .

[17]

Park and I. Sandberg. Approximation and radial basis function networks . Neural Computation , 5 : 305 - 316 , 1993 .

[18]

Pietsch . Eigenvalues and s-Numbers . Cambridge University Press, Cambridge, 1987 .

[19]

Poggio and

Smale . The mathematics of learning: dealing with data . Notices of AMS , 50 : 537 - 544 , 2003 .

[20]

W. Rudin. Functional

Analysis . Mc Graw-Hill , 1991 .

[21]

Schölkopf and

A. J.

Smola . Learning with Kernels - Support Vector Machines, Regularization, Optimization and Beyond . MIT Press, Cambridge, 2002 .