1 Introduction

1613-0073

Probabilistic Bounds on Complexity of Networks Computing Binary Classification Tasks

Veˇra K˚urková

Marcello Sanguineti

marcello.sanguineti@unige.it 0 0 DIBRIS, University of Genoa , Genoa, Italy WWW home page: 1 Institute of Computer Science, Czech Academy of Sciences , Prague , Czech Republic

2018

2203 86 91

Complexity of feedforward networks computing binary classification tasks is investigated. To deal with unmanageably large number of these tasks on domains of even moderate sizes, a probabilistic model characterizing relevance of the classification tasks is introduced. Approximate measures of sparsity of networks computing randomly chosen functions are studied in terms of variational norms tailored to dictionaries of computational units. Probabilistic lower bounds on these norms are derived using the Chernoff-Hoeffding Bound on sums of independent random variables, which need not be identically distributed. Consequences of the probabilistic results on the choice of dictionaries of computational units are discussed.

1 Introduction It has long been known that one-hidden-layer (shallow)

networks with computational units of many common types can exactly compute any function on a finite domain [ 8 ].

In particular, they can perform any binary-classification

task. Proofs of theorems on the universal approximation and representation properties of feedforward networks guarantee their power to express wide classes of functions, but do not deal with the efficiency of such representations.

Typically, such arguments assume that the number of units

is unbounded or is as large as the size of the domain of functions to be computed. For large domains, implementations of such networks might not be feasible.

A proper choice of a network architecture and a type

of its units can, in some cases, considerably reduce network complexity. For example, a classification of points in the d-dimensional Boolean cube {0, 1}d according to the parity of the numbers of 1’s cannot be computed by a Gaussian SVM network with less than 2d−1 support vectors [ 3 ] (i.e., it cannot be computed by a shallow network with less than 2d−1 Gaussian SVM units). On the other hand, it is easy to show that the parity function (as well as any generalized parity, the set of which forms the Fourier basis) can be computed by a shallow network with merely d + 1 Heaviside perceptrons [ 12 ].

The basic measure of sparsity of a network with a single linear output is the number of its nonzero output weights.

The number of nonzero entries of a vector in Rn is called “ l0-pseudonorm”. The quotation marks are used as l0 is not homogenous and its “unit ball” is unbounded and nonconvex. Thus, minimization of the number of nonzero entries of an output-weight vector is a difficult nonconvex optimization task. Minimization of “ l0-pseudonorm” has been studied in signal processing, where it was shown that in some cases it is NP-hard [ 20 ].

A good approximation of convexification of “l0

pseudonorm” is the l1-norm [ 17 ]. In neurocomputing, l1norm has been used as a stabilizer in weight-decay regularization techniques [ 7 ]. In statistical learning theory, it l1-norm plays an important role in LASSO regularization [ 19 ].

Networks with large l1-norms of output-weight vectors have either large numbers of units or some of the weights are large. Both are not desirable: implementation of networks with large numbers of units might not be feasible and large output weights might lead to instability of computation. The minimum of the l1-norms of outputweight vectors of all networks computing a given function is bounded from below by the variational norm tailored to a type of network units, which is a critical factor in estimates of upper bounds on network complexity [ 10, 11 ].

To identify and explain design of networks capable of efficient classifications, one has to focus on suitable classes of tasks. Even on a domain of a moderate size, there exists an enormous number of functions representing multi-class or binary classifications. For example, when the size of a domain is equal to 80, then the number of classifications into 10 classes is 1080 and when its size is 267, then the number of binary classification tasks is 2267.

These numbers are larger than the estimated number 1078

of atoms in the observable universe (see, e.g., [ 15 ]). Obviously, most classification tasks on such domains are not likely to be relevant for neurocomputing, as they do not model any task of practical interest.

In this paper, we investigate how to choose dictionar

ies of network units such that binary classification tasks can be efficiently solved. We assume that elements of a finite domain in Rd represent vectors of features, measurements, or observations for which some prior knowledge is available about probabilities that a presence of each of these features implies the property described by one of the classes. For example, when vectors in the domain represent ordered sets of medical symptoms, certain values of some of these symptoms might indicate a high probability of some diagnosis, while others might indicate a low probability or be irrelevant.

For sets of classification tasks endowed with product probability distributions, we explore suitability of dictionaries of computational units in terms of values of variational norms tailored to the dictionaries. We analyze consequences of the concentration of measure phenomena which imply that with increasing sizes of function domains, correlations between network units and functions tend to concentrate around their mean or median values.

We derive lower bounds on variational norms of functions

to be computed and on l1-norms of output-weight vectors of networks computing these functions. To obtain the lower bounds, we apply the Chernoff-Hoeffding Bound [5, Theorem 1.11] on sums of independent random variables not necessarily identically distributed. We show that when a priori knowledge of classification tasks is limited, then sparsity can only be achieved with large sizes of dictionaries. On the other hand, when such knowledge is biased, then there exist functions with which most functions on a large domain are highly correlated. If some of these functions is close to an element of a dictionary, then most functions can be well approximated by sparse networks with units from the dictionary.

The paper is organized as follows. In Section 2, we in

troduce basic concepts on feedforward networks, dictionaries of computational units, and approximate measures of network sparsity. In Section 3, we propose a probabilistic model of classification tasks and analyze properties of approximate measures of sparsity using the Chernoff

Hoeffding Bound. In Section 4, we derive estimates of

probability distributions of values of variational norms and analyze consequences of these estimates for choice of dictionaries suitable for tasks modeled by the given probabilities. Section 5 is a brief discussion. 2

Approximate measures of network sparsity We investigate computation of classification tasks repre

sented by binary-valued functions on finite domains X ⊂ Rd . We denote by

B(X ) := { f | f : X → {−1, 1}} the set of all functions on X with values in {−1, 1} and by

F (X ) := { f | f : X → R} the set of all real-valued functions on X .

When card X = m and X = {x1, . . . , xm} is a linear ordering of X , then the mapping ι : F (X ) → Rm defined as ι( f ) := ( f (x1), . . . , f (xm)) is an isomorphism. So, on

F (X ) we have the Euclidean inner product defined as

h f , gi := ∑ f (u)g(u)

u∈X and the Euclidean norm k f k := ph f , f i. We consider binary-valued functions with the range {−1, 1} instead o√fc{ar0d, 1X}. as all functions in B(X ) have norms equal to

A feedforward network with a single linear output can compute input-output functions from the set span G := ( n ∑ wigi wi ∈ R, gi ∈ G, n ∈ N i=1 ) where G, called a dictionary, is a parameterized family of functions. In networks with one hidden layer (called shallow networks), G is formed by functions computable by a given type of computational units, whereas in networks with several hidden layers (called deep networks), it is formed by combinations and compositions of functions representing units from lower layers (see, e.g., [ 2, 16 ]).

Formally, the number of hidden units in a shallow network or in the last hidden layer of a deep one can be described as the number of nonzero entries of the vector of output weights of the network. In applied mathematics, the number of nonzero entries of a vector w ∈ Rn, denoted kwk0, is called “l 0-pseudonorm” as it satisfies the equation n kwk0 = ∑ w0.

i i=1 The quotation marks are used because kwk0 is neither a norm nor a pseudonorm. Minimization of “ l0pseudonorm” is a difficult non convex problem as l0 lacks the homogeneity property of a norm and its “unit ball” is not convex.

Instead of the nonconvex l0-functional, its approximation by the l1-norm n kwk1 = ∑ |wi| i=1 have been used as a stabilizer in weight-decay regularization methods [ 7 ]. Some insight into efficiency of computation of a function f by networks with units from a dictionary G can be obtained from investigation of the minima of l1-norms of all vectors from the set k Wf (G) = {w = (w1, . . . , wn) | f = ∑ wigi, gi ∈ G, n ∈ N}. i=1 Minima of l1-norms of elements of Wf (G) are bounded from below by a norm of f tailored to a dictionary G called G-variation. It is defined for a bounded subset G of a normed linear space (X , k.k) as k f kG := inf nc ∈ R+ f /c ∈ clX conv (G ∪ −G)o , where − G := {− g | g ∈ G}, clX denotes the closure with respect to the topology induced by the norm k · kX , and conv denotes the convex hull. Variation with respect to

Heaviside perceptrons (called variation with respect to half-spaces) was introduced in [1] and extended to general dictionaries in [9]. It is easy to check (see [13]) that for a finite dictio

nary G and any f , such that the set Wf (G) non empty,

G-variation of f is equal to the minimum of l1-norms of

output-weight vectors of shallow networks with units from

G, which compute f , i.e.,

k f kG = min nkwk1 w ∈ Wf (G)o .

Thus lower bounds on minima of l1-norms of output

weight vectors of networks computing a function f can be obtained from lower bounds on variational norms. Such bounds can be derived using the following theorem, which is a special case of a more general result [ 10 ] proven using

Hahn-Banach theorem. By G⊥ is denoted the orthogonal

complement of G in the Hilbert space F (X ).

Theorem 1. Let X be a finite subset of Rd and G be a bounded subset of F (X ). Then for every f ∈ F (X ) \ G⊥, f 2 k k k f kG ≥ supg∈G |hg, f i| .

So functions which are nearly orthogonal to all elements

of a dictionary G have large G-variations. On the other hand, if a function is correlated with some element of G, then it is close to this element and so can be well approximated by an element of G. 3

Probabilistic bounds When we do not have any prior knowledge about a type of

classification tasks to be computed, we have to assume that a network from the class has to be capable to compute any uniformly randomly chosen function on a given domain.

Often in practical applications, most of the binary-valued

functions o a given domain are not likely to represent tasks of interest. In such cases some knowledge is available that can be expressed in terms of a discrete probability measure on the set of all functions on X .

For a finite domain X = {x1, . . . , xm} ⊂ Rd , a function f in B(X ) can be represented as a vector ( f (x1), . . . , f (xm)) ∈ {−1, 1}m ⊂ Rm. We assume that for each xi ∈ X , there exists a known probability pi ∈ [ 0, 1 ] that f (xi) = 1. For p = (p1, . . . , pm), we denote by the product probability defined for every f ∈ B(X ) as ρp : B(X ) → [ 0, 1 ]

m ρp( f ) := ∏ ρp,i( f ), i=1 where ρp,i( f ) := pi if f (xi) = 1 and ρp,i( f ) := 1 − pi if f (xi) = −1. It is easy to verify that ρp is a probability measure on B(X ).

When card X is large, the set F (X ) is isometric to a high-dimensional Euclidean space and B(X ) to a highdimensional Hamming cube. In high-dimensional spaces and cubes various concentration of measure phenomena occur [ 14 ]. We apply the Chernoff-Hoeffding Bound on sums of independent random variables, which do not need to be identically distributed [5, Theorem 1.11] to obtain estimates of distributions of inner products of any fixed function h ∈ B(X ) with functions randomly chosen from

B(X ) with probability ρp. Theorem 2 (Chernoff-Hoeffding Bound). Let m be a pos

itive integer, Y1, . . . ,Ym independent random variables with values in real intervals of lengths c1, . . . , cm, respectively, m ε > 0, and Y := ∑i=1 Yi. Then

Pr (|Y − E(Y )| ≥ ε) ≤ e− ∑im2=ε12ci2 .

For a function h ∈ B(X ) and p = (p1, . . . , pm), where pi ∈ [ 0, 1 ], we denote by

μ(h, p) := Ep hh, f i | f ∈ B(X ) the mean value of inner products of h with f randomly chosen from B(X ) with probability ρp, and by ho := khhk its normalization. The next theorem estimates the distribution of these inner products.

Theorem 3. Let X = {x1, . . . , xm} ⊂ Rd , p = (p1, . . . , pm) be such that pi ∈ [ 0, 1 ], i = 1, . . . , m, and h ∈ B(X ). Then the inner product of h with f randomly chosen from B(X ) with a probability ρp( f ) satisfies for everyλ > 0 i) Pr |h f , hi − μ(h, p)| > mλ ≤ e− mλ22 ; ii) Pr |h f ◦, h◦i − μ(h, p) m | > λ

Proof. Let Fh : B(X ) → B(X ) be an operator composed of sign-flips mapping h to the constant function equal to 1, i.e., Fh(h)(xi) = 1 for all i = 1, . . . , m and for all f ∈ F (X ) and all i = 1, . . . , m, Fh( f )(xi) = f (xi) if h(xi) = 1 and Fh( f )(xi) = − f (xi) if h(xi) = −1. Let p(h) = (p(h)1, . . . , p(h)m) be defined as p(h)i = pi if h(xi) = 1 and p(h)i = 1 − pi if h(xi) = −1. The inverse operator Fh−1 maps the random variable Fh( f ) ∈ B(X ) such that

Pr (Fh( f )(xi) = 1) = p(h)i (1) to the random variable f ∈ B(X ) such that to all elements of the dictionary. For G := {g1, . . . , gk}, we define

μG(p) := gmi,..a.,xgk |μ(gi, p)| .

The next theorem estimates probability distributions of variational norms in dependence on the size of a dictionary.

Theorem 5. Let X = {x1, . . . , xm} ⊂ Rd , G = {g1, . . . , gk} ⊂ B(X ), and p = (p1, . . . , pm) such that pi ∈ [ 0, 1 ], i = 1, . . . , m. Then for every f ∈ B(X ) randomly chosen according to ρp and every λ > 0 m Pr k f kG ≥ μG(p) + mλ > 1 − k e− mλ22 .

Proof. By Theorem 3 (i), we get

Pr |h f , hi − μ(h, p)| > mλ ∀h ∈ G

Hence,

≤ ke− mλ22 . > 1 − ke− mλ22 .

Pr ( f (xi) = 1) = pi.

Since the inner product is invariant under sign flipping, for every f ∈ B(X ) we have h f , hi = hFh( f ), (1, . . . , 1)i = m ∑i=1 Fh( f )(xi). Thus the mean value of the sum of random m variables ∑i=1 Fh( f )(xi) is μ(h, p). Applying to this sum the Chernoff-Hoeffding Bound stated in Theorem 2 with c1 = · · · = cm = 2 and ε = mλ , we get

m Pr | ∑ Fh( f )(xi) − μ(h, p)| > mλ i=1 which proves i).

ii) follows from i) as all functions in B(X ) have norms equal to √m. 2

Theorem 3 shows that when the domain X is large, most inner products of any given function with functions randomly chosen from B(X ) with a probability ρp are concentrated around their mean values. For example, setting λ = m−1/4, we get e− mλ22 = e− m−21/2 which is decreasing exponentially fast with increasing size m of the domain. 4

Dictionaries for efficient classification Theorem 3 implies that when a dictionary G contains a

function h, for which the mean value μ(h, p) is large, then most functions randomly chosen with respect to the probability distribution ρp are correlated with h. Thus most classification tasks characterized by ρp can be well approximated by a network with just one element h. A dictionary

G is also suitable for a given task when such function h can be well approximated by a small network with units from G. It is easy to calculate the mean value μ(h, p) of inner

products of a fixed function h from B(X ) with randomly chosen functions from B(X ) with respect to the probability ρp.

Proposition 4. Let h ∈ B(X ) and p = (p1 . . . , pm), where pi ∈ [ 0, 1 ] for each i = 1, . . . , m. Then for a function f randomly chosen in B(X ) according to ρp, the mean value of h f , hi satisfies μ(h, p) = ∑ (2pi − 1) + ∑ (1 − 2pi) ,

i∈Ih i∈Jh where Ih = {i ∈ {1, . . . , m} | h(xi) = 1} and Jh = {i ∈ {1, . . . , m} | h(xi) = −1}.

By Theorem 1, variation with respect to a dictionary of a function is large when the function is nearly orthogonal

Pr |h f , hi| ≤ μ(h, p) + mλ ∀h ∈ G > 1 − ke− mλ22 .

So by Theorem 1

m Pr k f kG ≥ μ(h, p) + mλ ∀h ∈ G > 1 − k e− mλ22 .

Since by the definition, for everyh ∈ G one has μG(p) ≥ μ(h, p), we obtain

m m μG(p) + mλ ≤ μ(h, p) + mλ and so

m Pr k f kG ≥ μG(p) + mλ > 1 − k e− mλ22 .

Theorem 5 shows that when for all computational units h in a dictionary G, the mean values μ(h, p) are small, then for large m almost all functions randomly chosen according to ρp are nearly orthogonal to all elements of the dictionary G. For example, setting λ = m−1/4, we get a m1/2 probability greater than 1 − ke− 2 that a randomly chosen function has G-variation greater or equal to μG(p)m+m3/4 . Thus when for large m, μG(p) is small, G-variation of most m functions is large unless the size k of a dictionary G outweighs the factor e− mλ22 .

Function with large G-variations cannot be computed by networks that have both the number of hidden units and all absolute vales of output weights small.

Corollary 1. Let X = {x1, . . . , xm} ⊂ Rd , G = {g1, . . . , gk} ⊂ B(X ), and p = (p1, . . . , pm) such that pi ∈ [ 0, 1 ], i = 1, . . . , m. Then for every f ∈ B(X ) randomly chosen according to ρp, and every λ > 0, m Pr min kwk1 | w ∈ Wf (G) ≥ μG(p) + mλ > 1 − k e− mλ22 .

Corollary 1 implies that computation of most classifica

tion tasks randomly chosen from B(X ) with the product probability ρp either requires to perform an ill-conditioned task by a moderate network or a well-conditioned task by a large network.

In particular, for the uniform distribution pi = 1/2 for all i = 1, . . . , m, for every h ∈ B(X ) the mean value μ(h, p) is zero. Thus for any dictionary G ⊂ B(X ), almost all functions uniformly randomly chosen from B(X ) are nearly orthogonal to all elements of the dictionary. So we get the following two corollaries.

Corollary 2. Let X = {x1, . . . , xm} ⊂ Rd and f ∈ B(X ) be uniformly randomly chosen. Then for every h ∈ B(X ) and every λ > 0

Pr |h f , hi| > mλ

Corollary 3. Let X = {x1, . . . , xm} ⊂ Rd and G = {g1, . . . , gk} ⊂ B(X ). Then for every f ∈ B(X ) uniformly randomly chosen and every λ > 0

1 Pr k f kG ≥ λ

≥ 1 − k e− mλ22 .

When we do not have any a priori knowledge about the task, we have to assume that the probability on B(X ) is uniform. Corollary 3 shows that unless a dictionary G is sufficiently large to outweigh the factore− mλ22 , most functions randomly chosen in B(X ) according to ρp have Gvariations greater or equal to 1/λ . So for small λ and sufficiently largem, most such functions cannot be computed by linear combinations of small numbers of elements of

G with small coefficients. Similar situation occurs when probabilities are nearly uniform. Many common dictionaries used in neurocomputing are

m1/2 relatively small with respect to the factor e− 2 . For example, the size of the dictionary of signum perceptrons Pd (X ) on a set X of m points in Rd is well-known since the work of Schläfli [ 18 ]. He estimated the number of linearly separated dichotomies of m points in Rd . His upper bound states that for every X ⊂ Rd such that card X = m, d card Pd (X ) ≤ 2 ∑ l=1 m − 1 l

md ≤ 2 d! .

(2) (see, e.g., [ 4 ]). The set Pd (X ) forms only a small fraction of the set of all functions in the set B(X ), whose cardinality is equal to 2m. Also other dictionaries of {−1, 1}valued functions generated by dichotomies of m points in Rd defined by nonlinear separating surfaces (such as hyperspheres or hypercones) are relatively small (see [4, Table I ]). 5

Discussion

As the number of binary-valued functions modeling classification tasks grows exponentially with the size of their domains, we proposed to model relevance of such tasks for a give application area by a probabilistic model. For sets of classification tasks endowed with product probability distributions, we investigated complexity of networks computing these tasks. We explored network complexity in terms of approximate measures of sparsity formalized by l1 and variational norms. For functions on large domains, we analyzed implications of the concentration of measure phenomena for correlations between network units and randomly chosen functions.

We focused on classification tasks characterized by product probabilities. To derive estimates of complexity of networks computing randomly chosen functions we used the Chernoff-Hoeffding Bound on sums of independent random variables. An extension of our analysis to tasks characterized by more general probability distributions is a subject of our future work. To obtain estimates for more general probability distributions, we plan to apply versions of the Chernoff-Hoeffding Bound stated in [ 6 ], which hold in situations when random variables are not independent .

Acknowledgments. V. K. was partially supported by

the Czech Grant Foundation grant GA18-23827S and institutional support of the Institute of Computer Science

RVO 67985807. M. S. was partially supported by a

FFABR grant of the Italian Ministry of Education, University and Research (MIUR). He is Research Associate at INM (Institute for Marine Enginering) of CNR (National Research Council of Italy) under the Project PDGP 2018/20 DIT.AD016.001 “Technologies for Smart Communities” and he is a member of GNAMPA-INdAM (Gruppo Nazionale per l’Analisi Matematica, la Probabilità e le loro Applicazioni - Instituto Nazionale di Alta

Matematica).

[1]

A. R.

Barron . Neural net approximation . In K. S. Narendra, editor, Proc. 7th Yale Workshop on Adaptive and Learning Systems , pages 69 - 72 . Yale University Press, 1992 .

[2]

Bengio and

Courville . Deep learning of representations . In Handbook of Neural Information Processing . M. Bianchini and

Maggini and

Jain , Berlin, Heideleberg, 2013 .

[3]

Bengio ,

Delalleau , and

N. Le

Roux . The curse of highly variable functions for local kernel machines . In Advances in Neural Information Processing Systems , volume 18 , pages 107 - 114 . MIT Press, 2006 .

[4]

Cover . Geometrical and statistical properties of systems of linear inequalities with applictions in pattern recognition . IEEE Trans. on Electronic Computers , 14 : 326 - 334 , 1965 .

[5]

Doerr . Analyzing randomized search heuristics: Tools from probability theory . In Theory of Randomized Seach Heuristics - Foundations and Recent Developments, chapter 1 , pages 1 - 20 . World Scientific Publishing, 2011 .

[6]

D. P.

Dubhashi and

Panconesi . Concentration of Measure for the Analysis of Randomized Algorithms . Cambridge University Press, New York, 2009 .

[7]

T. L.

Fine . Feedforward Neural Network Methodology . Springer, Berlin Heidelberg, 1999 .

[8]

Ito . Finite mapping by neural networks and truth functions . Mathematical Scientist , 17 : 69 - 77 , 1992 .

[9] V. K˚urková . Dimension-independent rates of approximation by neural networks . In K. Warwick and M. Kárný, editors, Computer-Intensive Methods in Control and Signal Processing. The Curse of Dimensionality , pages 261 - 270 . Birkhäuser, Boston, MA, 1997 .

[10] V. K˚urková . Complexity estimates based on integral transforms induced by computational units . Neural Networks , 33 : 160 - 167 , 2012 .

[11] V. K˚urková and

Sanguineti . Approximate minimization of the regularized expected error over kernel models . Mathematics of Operations Research , 33 : 747 - 756 , 2008 .

[12] V. K˚urková and

Sanguineti . Model complexities of shallow networks representing highly varying functions . Neurocomputing , 171 : 598 - 604 , 2016 .

[13] V. K˚urková , P. Savický, and K. Hlavácˇková. Representations and rates of approximation of real-valued Boolean functions by neural networks . Neural Networks , 11 : 651 - 659 , 1998 .

[14]

Ledoux . The Concentration of Measure Phenomenon . AMS, Providence, 2001 .

[15]

H.W.

Lin ,

Tegmark , and

Rolnick . Why does deep and cheap learning work so well ? J. of Statistical Physics , 168 : 1223 - 1247 , 2017 .

[16]

H. N.

Mhaskar and

Poggio . Deep vs. shallow networks: An approximation theory perspective . Analysis and Applications , 14 : 829 - 848 , 2016 .

[17]

Plan and

Vershynin . One-bit compressed sensing by linear programming . Communications in Pure and Applied Mathematics , 66 : 1275 - 1297 , 2013 .

[18]

Schläfli . Theorie der Vielfachen Kontinuität. Zürcher & Furrer, Zürich, 1901 .

[19]

Hastie ,

Tibshirani , and

Wainwright . Statistical Learning with Sparsity: The Lasso and Generalizations . Chapman & Hall/CRC, London, 2015 .

[20]

A.M.

Tillmann . On the computational intractability of exact and approximate dictionary learning . IEEE Signal Processing Letters , 22 : 45 - 49 , 2015 .