<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Probabilistic Bounds on Complexity of Networks Computing Binary Classification Tasks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Veˇra K˚urková</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marcello Sanguineti</string-name>
          <email>marcello.sanguineti@unige.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DIBRIS, University of Genoa</institution>
          ,
          <addr-line>Genoa, Italy WWW home page:</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Computer Science, Czech Academy of Sciences</institution>
          ,
          <addr-line>Prague</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <volume>2203</volume>
      <fpage>86</fpage>
      <lpage>91</lpage>
      <abstract>
        <p>Complexity of feedforward networks computing binary classification tasks is investigated. To deal with unmanageably large number of these tasks on domains of even moderate sizes, a probabilistic model characterizing relevance of the classification tasks is introduced. Approximate measures of sparsity of networks computing randomly chosen functions are studied in terms of variational norms tailored to dictionaries of computational units. Probabilistic lower bounds on these norms are derived using the Chernoff-Hoeffding Bound on sums of independent random variables, which need not be identically distributed. Consequences of the probabilistic results on the choice of dictionaries of computational units are discussed.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <sec id="sec-1-1">
        <title>It has long been known that one-hidden-layer (shallow)</title>
        <p>
          networks with computational units of many common types
can exactly compute any function on a finite domain [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>In particular, they can perform any binary-classification</title>
        <p>task. Proofs of theorems on the universal
approximation and representation properties of feedforward networks
guarantee their power to express wide classes of functions,
but do not deal with the efficiency of such representations.</p>
      </sec>
      <sec id="sec-1-3">
        <title>Typically, such arguments assume that the number of units</title>
        <p>is unbounded or is as large as the size of the domain of
functions to be computed. For large domains,
implementations of such networks might not be feasible.</p>
      </sec>
      <sec id="sec-1-4">
        <title>A proper choice of a network architecture and a type</title>
        <p>
          of its units can, in some cases, considerably reduce
network complexity. For example, a classification of points
in the d-dimensional Boolean cube {0, 1}d according to
the parity of the numbers of 1’s cannot be computed by a
Gaussian SVM network with less than 2d−1 support
vectors [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] (i.e., it cannot be computed by a shallow network
with less than 2d−1 Gaussian SVM units). On the other
hand, it is easy to show that the parity function (as well as
any generalized parity, the set of which forms the Fourier
basis) can be computed by a shallow network with merely
d + 1 Heaviside perceptrons [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
        </p>
      </sec>
      <sec id="sec-1-5">
        <title>The basic measure of sparsity of a network with a single linear output is the number of its nonzero output weights.</title>
        <p>
          The number of nonzero entries of a vector in Rn is called
“ l0-pseudonorm”. The quotation marks are used as l0 is
not homogenous and its “unit ball” is unbounded and
nonconvex. Thus, minimization of the number of nonzero
entries of an output-weight vector is a difficult nonconvex
optimization task. Minimization of “ l0-pseudonorm” has
been studied in signal processing, where it was shown that
in some cases it is NP-hard [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ].
        </p>
        <sec id="sec-1-5-1">
          <title>A good approximation of convexification of “l0</title>
          <p>
            pseudonorm” is the l1-norm [
            <xref ref-type="bibr" rid="ref17">17</xref>
            ]. In neurocomputing,
l1norm has been used as a stabilizer in weight-decay
regularization techniques [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ]. In statistical learning theory, it
l1-norm plays an important role in LASSO regularization
[
            <xref ref-type="bibr" rid="ref19">19</xref>
            ].
          </p>
          <p>
            Networks with large l1-norms of output-weight vectors
have either large numbers of units or some of the weights
are large. Both are not desirable: implementation of
networks with large numbers of units might not be
feasible and large output weights might lead to instability of
computation. The minimum of the l1-norms of
outputweight vectors of all networks computing a given function
is bounded from below by the variational norm tailored to
a type of network units, which is a critical factor in
estimates of upper bounds on network complexity [
            <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
            ].
          </p>
          <p>To identify and explain design of networks capable
of efficient classifications, one has to focus on suitable
classes of tasks. Even on a domain of a moderate size,
there exists an enormous number of functions representing
multi-class or binary classifications. For example, when
the size of a domain is equal to 80, then the number of
classifications into 10 classes is 1080 and when its size is
267, then the number of binary classification tasks is 2267.</p>
        </sec>
        <sec id="sec-1-5-2">
          <title>These numbers are larger than the estimated number 1078</title>
          <p>
            of atoms in the observable universe (see, e.g., [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ]).
Obviously, most classification tasks on such domains are not
likely to be relevant for neurocomputing, as they do not
model any task of practical interest.
          </p>
        </sec>
      </sec>
      <sec id="sec-1-6">
        <title>In this paper, we investigate how to choose dictionar</title>
        <p>ies of network units such that binary classification tasks
can be efficiently solved. We assume that elements of a
finite domain in Rd represent vectors of features,
measurements, or observations for which some prior knowledge
is available about probabilities that a presence of each of
these features implies the property described by one of the
classes. For example, when vectors in the domain
represent ordered sets of medical symptoms, certain values of
some of these symptoms might indicate a high
probability of some diagnosis, while others might indicate a low
probability or be irrelevant.</p>
        <p>For sets of classification tasks endowed with product
probability distributions, we explore suitability of
dictionaries of computational units in terms of values of
variational norms tailored to the dictionaries. We analyze
consequences of the concentration of measure
phenomena which imply that with increasing sizes of function
domains, correlations between network units and functions
tend to concentrate around their mean or median values.</p>
      </sec>
      <sec id="sec-1-7">
        <title>We derive lower bounds on variational norms of functions</title>
        <p>to be computed and on l1-norms of output-weight
vectors of networks computing these functions. To obtain the
lower bounds, we apply the Chernoff-Hoeffding Bound [5,
Theorem 1.11] on sums of independent random variables
not necessarily identically distributed. We show that when
a priori knowledge of classification tasks is limited, then
sparsity can only be achieved with large sizes of
dictionaries. On the other hand, when such knowledge is biased,
then there exist functions with which most functions on a
large domain are highly correlated. If some of these
functions is close to an element of a dictionary, then most
functions can be well approximated by sparse networks with
units from the dictionary.</p>
      </sec>
      <sec id="sec-1-8">
        <title>The paper is organized as follows. In Section 2, we in</title>
        <p>troduce basic concepts on feedforward networks,
dictionaries of computational units, and approximate measures
of network sparsity. In Section 3, we propose a
probabilistic model of classification tasks and analyze properties
of approximate measures of sparsity using the
Chernoff</p>
      </sec>
      <sec id="sec-1-9">
        <title>Hoeffding Bound. In Section 4, we derive estimates of</title>
        <p>probability distributions of values of variational norms and
analyze consequences of these estimates for choice of
dictionaries suitable for tasks modeled by the given
probabilities. Section 5 is a brief discussion.
2</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Approximate measures of network sparsity</title>
      <sec id="sec-2-1">
        <title>We investigate computation of classification tasks repre</title>
        <p>sented by binary-valued functions on finite domains X ⊂
Rd . We denote by</p>
        <p>B(X ) := { f | f : X → {−1, 1}}
the set of all functions on X with values in {−1, 1} and by</p>
        <p>F (X ) := { f | f : X → R}
the set of all real-valued functions on X .</p>
        <p>When card X = m and X = {x1, . . . , xm} is a linear
ordering of X , then the mapping ι : F (X ) → Rm defined
as ι( f ) := ( f (x1), . . . , f (xm)) is an isomorphism. So, on</p>
        <sec id="sec-2-1-1">
          <title>F (X ) we have the Euclidean inner product defined as</title>
          <p>h f , gi := ∑ f (u)g(u)</p>
          <p>u∈X
and the Euclidean norm k f k := ph f , f i. We consider
binary-valued functions with the range {−1, 1} instead
o√fc{ar0d, 1X}. as all functions in B(X ) have norms equal to</p>
          <p>
            A feedforward network with a single linear output can
compute input-output functions from the set
span G :=
( n
∑ wigi wi ∈ R, gi ∈ G, n ∈ N
i=1
)
where G, called a dictionary, is a parameterized family
of functions. In networks with one hidden layer (called
shallow networks), G is formed by functions computable
by a given type of computational units, whereas in
networks with several hidden layers (called deep networks), it
is formed by combinations and compositions of functions
representing units from lower layers (see, e.g., [
            <xref ref-type="bibr" rid="ref16 ref2">2, 16</xref>
            ]).
          </p>
          <p>Formally, the number of hidden units in a shallow
network or in the last hidden layer of a deep one can be
described as the number of nonzero entries of the vector of
output weights of the network. In applied mathematics,
the number of nonzero entries of a vector w ∈ Rn, denoted
kwk0, is called “l 0-pseudonorm” as it satisfies the
equation
n
kwk0 = ∑ w0.</p>
          <p>i
i=1
The quotation marks are used because kwk0 is
neither a norm nor a pseudonorm. Minimization of “
l0pseudonorm” is a difficult non convex problem as l0 lacks
the homogeneity property of a norm and its “unit ball” is
not convex.</p>
          <p>
            Instead of the nonconvex l0-functional, its
approximation by the l1-norm
n
kwk1 = ∑ |wi|
i=1
have been used as a stabilizer in weight-decay
regularization methods [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ]. Some insight into efficiency of
computation of a function f by networks with units from a
dictionary G can be obtained from investigation of the minima
of l1-norms of all vectors from the set
k
Wf (G) = {w = (w1, . . . , wn) | f = ∑ wigi, gi ∈ G, n ∈ N}.
i=1
Minima of l1-norms of elements of Wf (G) are bounded
from below by a norm of f tailored to a dictionary G
called G-variation. It is defined for a bounded subset G
of a normed linear space (X , k.k) as
k f kG := inf nc ∈ R+
f /c ∈ clX conv (G ∪ −G)o ,
where − G := {− g | g ∈ G}, clX denotes the closure with
respect to the topology induced by the norm k · kX , and
conv denotes the convex hull. Variation with respect to
          </p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Heaviside perceptrons (called variation with respect to half-spaces) was introduced in [1] and extended to general dictionaries in [9].</title>
      </sec>
      <sec id="sec-2-3">
        <title>It is easy to check (see [13]) that for a finite dictio</title>
        <p>nary G and any f , such that the set Wf (G) non empty,</p>
        <sec id="sec-2-3-1">
          <title>G-variation of f is equal to the minimum of l1-norms of</title>
          <p>output-weight vectors of shallow networks with units from</p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>G, which compute f , i.e.,</title>
        <p>k f kG = min nkwk1 w ∈ Wf (G)o .</p>
        <sec id="sec-2-4-1">
          <title>Thus lower bounds on minima of l1-norms of output</title>
          <p>
            weight vectors of networks computing a function f can be
obtained from lower bounds on variational norms. Such
bounds can be derived using the following theorem, which
is a special case of a more general result [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ] proven using
          </p>
        </sec>
      </sec>
      <sec id="sec-2-5">
        <title>Hahn-Banach theorem. By G⊥ is denoted the orthogonal</title>
        <p>complement of G in the Hilbert space F (X ).</p>
        <p>Theorem 1. Let X be a finite subset of Rd and G be a
bounded subset of F (X ). Then for every f ∈ F (X ) \ G⊥,
f 2
k k
k f kG ≥ supg∈G |hg, f i|
.</p>
      </sec>
      <sec id="sec-2-6">
        <title>So functions which are nearly orthogonal to all elements</title>
        <p>of a dictionary G have large G-variations. On the other
hand, if a function is correlated with some element of G,
then it is close to this element and so can be well
approximated by an element of G.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Probabilistic bounds</title>
      <sec id="sec-3-1">
        <title>When we do not have any prior knowledge about a type of</title>
        <p>classification tasks to be computed, we have to assume that
a network from the class has to be capable to compute any
uniformly randomly chosen function on a given domain.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Often in practical applications, most of the binary-valued</title>
        <p>functions o a given domain are not likely to represent tasks
of interest. In such cases some knowledge is available that
can be expressed in terms of a discrete probability measure
on the set of all functions on X .</p>
        <p>
          For a finite domain X = {x1, . . . , xm} ⊂ Rd , a
function f in B(X ) can be represented as a vector
( f (x1), . . . , f (xm)) ∈ {−1, 1}m ⊂ Rm. We assume that for
each xi ∈ X , there exists a known probability pi ∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ]
that f (xi) = 1. For p = (p1, . . . , pm), we denote by
the product probability defined for every f ∈ B(X ) as
ρp : B(X ) → [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ]
        </p>
        <p>m
ρp( f ) := ∏ ρp,i( f ),
i=1
where ρp,i( f ) := pi if f (xi) = 1 and ρp,i( f ) := 1 − pi if
f (xi) = −1. It is easy to verify that ρp is a probability
measure on B(X ).</p>
        <p>
          When card X is large, the set F (X ) is isometric to a
high-dimensional Euclidean space and B(X ) to a
highdimensional Hamming cube. In high-dimensional spaces
and cubes various concentration of measure phenomena
occur [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. We apply the Chernoff-Hoeffding Bound on
sums of independent random variables, which do not need
to be identically distributed [5, Theorem 1.11] to obtain
estimates of distributions of inner products of any fixed
function h ∈ B(X ) with functions randomly chosen from
        </p>
        <sec id="sec-3-2-1">
          <title>B(X ) with probability ρp.</title>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>Theorem 2 (Chernoff-Hoeffding Bound). Let m be a pos</title>
        <p>itive integer, Y1, . . . ,Ym independent random variables with
values in real intervals of lengths c1, . . . , cm, respectively,
m
ε &gt; 0, and Y := ∑i=1 Yi. Then</p>
        <p>Pr (|Y − E(Y )| ≥ ε) ≤ e− ∑im2=ε12ci2 .</p>
        <p>
          For a function h ∈ B(X ) and p = (p1, . . . , pm), where
pi ∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ], we denote by
        </p>
        <p>μ(h, p) := Ep hh, f i | f ∈ B(X )
the mean value of inner products of h with f randomly
chosen from B(X ) with probability ρp, and by ho := khhk
its normalization. The next theorem estimates the
distribution of these inner products.</p>
        <p>
          Theorem 3. Let X = {x1, . . . , xm} ⊂ Rd , p = (p1, . . . , pm)
be such that pi ∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ], i = 1, . . . , m, and h ∈ B(X ). Then
the inner product of h with f randomly chosen from B(X )
with a probability ρp( f ) satisfies for everyλ &gt; 0
i) Pr |h f , hi − μ(h, p)| &gt; mλ
≤ e− mλ22 ;
ii) Pr |h f ◦, h◦i −
μ(h, p)
m
| &gt; λ
        </p>
        <p>Proof. Let Fh : B(X ) → B(X ) be an operator
composed of sign-flips mapping h to the constant function
equal to 1, i.e., Fh(h)(xi) = 1 for all i = 1, . . . , m and
for all f ∈ F (X ) and all i = 1, . . . , m, Fh( f )(xi) = f (xi)
if h(xi) = 1 and Fh( f )(xi) = − f (xi) if h(xi) = −1. Let
p(h) = (p(h)1, . . . , p(h)m) be defined as p(h)i = pi if
h(xi) = 1 and p(h)i = 1 − pi if h(xi) = −1. The inverse
operator Fh−1 maps the random variable Fh( f ) ∈ B(X ) such
that</p>
        <p>Pr (Fh( f )(xi) = 1) = p(h)i
(1)
to the random variable f ∈ B(X ) such that
to all elements of the dictionary. For G := {g1, . . . , gk}, we
define</p>
        <p>μG(p) := gmi,..a.,xgk |μ(gi, p)| .</p>
      </sec>
      <sec id="sec-3-4">
        <title>The next theorem estimates probability distributions of variational norms in dependence on the size of a dictionary.</title>
        <p>
          Theorem 5. Let X = {x1, . . . , xm} ⊂ Rd , G =
{g1, . . . , gk} ⊂ B(X ), and p = (p1, . . . , pm) such that
pi ∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ], i = 1, . . . , m. Then for every f ∈ B(X )
randomly chosen according to ρp and every λ &gt; 0
m
Pr k f kG ≥ μG(p) + mλ
&gt; 1 − k e− mλ22 .
        </p>
      </sec>
      <sec id="sec-3-5">
        <title>Proof. By Theorem 3 (i), we get</title>
        <p>Pr |h f , hi − μ(h, p)| &gt; mλ ∀h ∈ G</p>
      </sec>
      <sec id="sec-3-6">
        <title>Hence,</title>
        <p>≤ ke− mλ22 .
&gt; 1 − ke− mλ22 .</p>
        <p>Pr ( f (xi) = 1) = pi.</p>
        <p>Since the inner product is invariant under sign flipping,
for every f ∈ B(X ) we have h f , hi = hFh( f ), (1, . . . , 1)i =
m
∑i=1 Fh( f )(xi). Thus the mean value of the sum of random
m
variables ∑i=1 Fh( f )(xi) is μ(h, p). Applying to this sum
the Chernoff-Hoeffding Bound stated in Theorem 2 with
c1 = · · · = cm = 2 and ε = mλ , we get</p>
        <p>m
Pr | ∑ Fh( f )(xi) − μ(h, p)| &gt; mλ
i=1
which proves i).</p>
        <p>ii) follows from i) as all functions in B(X ) have norms
equal to √m. 2</p>
        <p>Theorem 3 shows that when the domain X is large, most
inner products of any given function with functions
randomly chosen from B(X ) with a probability ρp are
concentrated around their mean values. For example, setting
λ = m−1/4, we get e− mλ22 = e− m−21/2 which is decreasing
exponentially fast with increasing size m of the domain.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Dictionaries for efficient classification</title>
      <sec id="sec-4-1">
        <title>Theorem 3 implies that when a dictionary G contains a</title>
        <p>function h, for which the mean value μ(h, p) is large, then
most functions randomly chosen with respect to the
probability distribution ρp are correlated with h. Thus most
classification tasks characterized by ρp can be well
approximated by a network with just one element h. A dictionary</p>
      </sec>
      <sec id="sec-4-2">
        <title>G is also suitable for a given task when such function h can be well approximated by a small network with units from G.</title>
      </sec>
      <sec id="sec-4-3">
        <title>It is easy to calculate the mean value μ(h, p) of inner</title>
        <p>products of a fixed function h from B(X ) with randomly
chosen functions from B(X ) with respect to the
probability ρp.</p>
        <p>
          Proposition 4. Let h ∈ B(X ) and p = (p1 . . . , pm), where
pi ∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] for each i = 1, . . . , m. Then for a function f
randomly chosen in B(X ) according to ρp, the mean value
of h f , hi satisfies
μ(h, p) = ∑ (2pi − 1) + ∑ (1 − 2pi) ,
        </p>
        <p>i∈Ih i∈Jh
where Ih = {i ∈ {1, . . . , m} | h(xi) = 1} and Jh = {i ∈
{1, . . . , m} | h(xi) = −1}.</p>
      </sec>
      <sec id="sec-4-4">
        <title>By Theorem 1, variation with respect to a dictionary of a function is large when the function is nearly orthogonal</title>
        <p>Pr |h f , hi − μ(h, p)| ≤ mλ ∀h ∈ G
As |h f , hi− μ(h, p)| ≤ mλ implies |h f , hi| ≤ μ(h, p)+mλ ,
we get</p>
        <p>Pr |h f , hi| ≤ μ(h, p) + mλ ∀h ∈ G
&gt; 1 − ke− mλ22 .</p>
      </sec>
      <sec id="sec-4-5">
        <title>So by Theorem 1</title>
        <p>m
Pr k f kG ≥ μ(h, p) + mλ ∀h ∈ G
&gt; 1 − k e− mλ22 .</p>
        <p>Since by the definition, for everyh ∈ G one has μG(p) ≥
μ(h, p), we obtain</p>
        <p>m m
μG(p) + mλ ≤ μ(h, p) + mλ
and so</p>
        <p>m
Pr k f kG ≥ μG(p) + mλ
&gt; 1 − k e− mλ22 .</p>
        <p>2</p>
        <p>Theorem 5 shows that when for all computational units
h in a dictionary G, the mean values μ(h, p) are small,
then for large m almost all functions randomly chosen
according to ρp are nearly orthogonal to all elements of the
dictionary G. For example, setting λ = m−1/4, we get a
m1/2
probability greater than 1 − ke− 2 that a randomly
chosen function has G-variation greater or equal to μG(p)m+m3/4 .
Thus when for large m, μG(p) is small, G-variation of most
m
functions is large unless the size k of a dictionary G
outweighs the factor e− mλ22 .</p>
      </sec>
      <sec id="sec-4-6">
        <title>Function with large G-variations cannot be computed by networks that have both the number of hidden units and all absolute vales of output weights small.</title>
        <p>
          Corollary 1. Let X = {x1, . . . , xm} ⊂ Rd , G =
{g1, . . . , gk} ⊂ B(X ), and p = (p1, . . . , pm) such that
pi ∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ], i = 1, . . . , m. Then for every f ∈ B(X )
randomly chosen according to ρp, and every λ &gt; 0,
m
Pr min kwk1 | w ∈ Wf (G) ≥ μG(p) + mλ
&gt;
1 − k e− mλ22 .
        </p>
      </sec>
      <sec id="sec-4-7">
        <title>Corollary 1 implies that computation of most classifica</title>
        <p>tion tasks randomly chosen from B(X ) with the product
probability ρp either requires to perform an ill-conditioned
task by a moderate network or a well-conditioned task by
a large network.</p>
        <p>In particular, for the uniform distribution pi = 1/2 for
all i = 1, . . . , m, for every h ∈ B(X ) the mean value μ(h, p)
is zero. Thus for any dictionary G ⊂ B(X ), almost
all functions uniformly randomly chosen from B(X ) are
nearly orthogonal to all elements of the dictionary. So we
get the following two corollaries.</p>
        <p>Corollary 2. Let X = {x1, . . . , xm} ⊂ Rd and f ∈ B(X )
be uniformly randomly chosen. Then for every h ∈ B(X )
and every λ &gt; 0</p>
        <p>Pr |h f , hi| &gt; mλ</p>
        <p>Corollary 3. Let X = {x1, . . . , xm} ⊂ Rd and G =
{g1, . . . , gk} ⊂ B(X ). Then for every f ∈ B(X ) uniformly
randomly chosen and every λ &gt; 0</p>
        <p>1
Pr k f kG ≥ λ</p>
        <p>≥ 1 − k e− mλ22 .</p>
        <p>When we do not have any a priori knowledge about the
task, we have to assume that the probability on B(X ) is
uniform. Corollary 3 shows that unless a dictionary G is
sufficiently large to outweigh the factore− mλ22 , most
functions randomly chosen in B(X ) according to ρp have
Gvariations greater or equal to 1/λ . So for small λ and
sufficiently largem, most such functions cannot be computed
by linear combinations of small numbers of elements of</p>
      </sec>
      <sec id="sec-4-8">
        <title>G with small coefficients. Similar situation occurs when probabilities are nearly uniform.</title>
      </sec>
      <sec id="sec-4-9">
        <title>Many common dictionaries used in neurocomputing are</title>
        <p>
          m1/2
relatively small with respect to the factor e− 2 . For
example, the size of the dictionary of signum perceptrons
Pd (X ) on a set X of m points in Rd is well-known since
the work of Schläfli [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. He estimated the number of
linearly separated dichotomies of m points in Rd . His upper
bound states that for every X ⊂ Rd such that card X = m,
d
card Pd (X ) ≤ 2 ∑
l=1
m − 1
l
        </p>
        <p>md
≤ 2 d!
.</p>
        <p>
          (2)
(see, e.g., [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]). The set Pd (X ) forms only a small fraction
of the set of all functions in the set B(X ), whose
cardinality is equal to 2m. Also other dictionaries of {−1,
1}valued functions generated by dichotomies of m points in
Rd defined by nonlinear separating surfaces (such as
hyperspheres or hypercones) are relatively small (see [4,
Table I ]).
5
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Discussion</title>
      <p>As the number of binary-valued functions modeling
classification tasks grows exponentially with the size of their
domains, we proposed to model relevance of such tasks
for a give application area by a probabilistic model. For
sets of classification tasks endowed with product
probability distributions, we investigated complexity of networks
computing these tasks. We explored network
complexity in terms of approximate measures of sparsity
formalized by l1 and variational norms. For functions on large
domains, we analyzed implications of the concentration
of measure phenomena for correlations between network
units and randomly chosen functions.</p>
      <p>
        We focused on classification tasks characterized by
product probabilities. To derive estimates of complexity of
networks computing randomly chosen functions we used
the Chernoff-Hoeffding Bound on sums of independent
random variables. An extension of our analysis to tasks
characterized by more general probability distributions is
a subject of our future work. To obtain estimates for more
general probability distributions, we plan to apply versions
of the Chernoff-Hoeffding Bound stated in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], which hold
in situations when random variables are not independent .
      </p>
      <sec id="sec-5-1">
        <title>Acknowledgments. V. K. was partially supported by</title>
        <p>the Czech Grant Foundation grant GA18-23827S and
institutional support of the Institute of Computer Science</p>
      </sec>
      <sec id="sec-5-2">
        <title>RVO 67985807. M. S. was partially supported by a</title>
        <p>FFABR grant of the Italian Ministry of Education,
University and Research (MIUR). He is Research Associate
at INM (Institute for Marine Enginering) of CNR
(National Research Council of Italy) under the Project PDGP
2018/20 DIT.AD016.001 “Technologies for Smart
Communities” and he is a member of GNAMPA-INdAM
(Gruppo Nazionale per l’Analisi Matematica, la
Probabilità e le loro Applicazioni - Instituto Nazionale di Alta</p>
      </sec>
      <sec id="sec-5-3">
        <title>Matematica).</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Barron</surname>
          </string-name>
          .
          <article-title>Neural net approximation</article-title>
          . In K. S. Narendra, editor,
          <source>Proc. 7th Yale Workshop on Adaptive and Learning Systems</source>
          , pages
          <fpage>69</fpage>
          -
          <lpage>72</lpage>
          . Yale University Press,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Courville</surname>
          </string-name>
          .
          <article-title>Deep learning of representations</article-title>
          .
          <source>In Handbook of Neural Information Processing</source>
          . M. Bianchini and
          <string-name>
            <given-names>M.</given-names>
            <surname>Maggini</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.</given-names>
            <surname>Jain</surname>
          </string-name>
          , Berlin, Heideleberg,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Delalleau</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N. Le</given-names>
            <surname>Roux</surname>
          </string-name>
          .
          <article-title>The curse of highly variable functions for local kernel machines</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          , volume
          <volume>18</volume>
          , pages
          <fpage>107</fpage>
          -
          <lpage>114</lpage>
          . MIT Press,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Cover</surname>
          </string-name>
          .
          <article-title>Geometrical and statistical properties of systems of linear inequalities with applictions in pattern recognition</article-title>
          .
          <source>IEEE Trans. on Electronic Computers</source>
          ,
          <volume>14</volume>
          :
          <fpage>326</fpage>
          -
          <lpage>334</lpage>
          ,
          <year>1965</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B.</given-names>
            <surname>Doerr</surname>
          </string-name>
          .
          <article-title>Analyzing randomized search heuristics: Tools from probability theory</article-title>
          .
          <source>In Theory of Randomized Seach Heuristics - Foundations and Recent Developments, chapter 1</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>20</lpage>
          . World Scientific Publishing,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Dubhashi</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Panconesi</surname>
          </string-name>
          .
          <article-title>Concentration of Measure for the Analysis of Randomized Algorithms</article-title>
          . Cambridge University Press, New York,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T. L.</given-names>
            <surname>Fine</surname>
          </string-name>
          .
          <source>Feedforward Neural Network Methodology</source>
          . Springer, Berlin Heidelberg,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ito</surname>
          </string-name>
          .
          <article-title>Finite mapping by neural networks and truth functions</article-title>
          .
          <source>Mathematical Scientist</source>
          ,
          <volume>17</volume>
          :
          <fpage>69</fpage>
          -
          <lpage>77</lpage>
          ,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>V.</surname>
          </string-name>
          <article-title>K˚urková</article-title>
          .
          <article-title>Dimension-independent rates of approximation by neural networks</article-title>
          . In K. Warwick and M. Kárný, editors,
          <source>Computer-Intensive Methods in Control and Signal Processing. The Curse of Dimensionality</source>
          , pages
          <fpage>261</fpage>
          -
          <lpage>270</lpage>
          . Birkhäuser, Boston, MA,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>V.</surname>
          </string-name>
          <article-title>K˚urková</article-title>
          .
          <article-title>Complexity estimates based on integral transforms induced by computational units</article-title>
          .
          <source>Neural Networks</source>
          ,
          <volume>33</volume>
          :
          <fpage>160</fpage>
          -
          <lpage>167</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>V.</surname>
          </string-name>
          <article-title>K˚urková</article-title>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Sanguineti</surname>
          </string-name>
          .
          <article-title>Approximate minimization of the regularized expected error over kernel models</article-title>
          .
          <source>Mathematics of Operations Research</source>
          ,
          <volume>33</volume>
          :
          <fpage>747</fpage>
          -
          <lpage>756</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>V.</surname>
          </string-name>
          <article-title>K˚urková</article-title>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Sanguineti</surname>
          </string-name>
          .
          <article-title>Model complexities of shallow networks representing highly varying functions</article-title>
          .
          <source>Neurocomputing</source>
          ,
          <volume>171</volume>
          :
          <fpage>598</fpage>
          -
          <lpage>604</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>V.</surname>
          </string-name>
          <article-title>K˚urková</article-title>
          , P. Savický,
          <article-title>and</article-title>
          K. Hlavácˇková.
          <article-title>Representations and rates of approximation of real-valued Boolean functions by neural networks</article-title>
          .
          <source>Neural Networks</source>
          ,
          <volume>11</volume>
          :
          <fpage>651</fpage>
          -
          <lpage>659</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ledoux</surname>
          </string-name>
          .
          <article-title>The Concentration of Measure Phenomenon</article-title>
          . AMS, Providence,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>H.W.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tegmark</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Rolnick</surname>
          </string-name>
          .
          <article-title>Why does deep and cheap learning work so well</article-title>
          ?
          <source>J. of Statistical Physics</source>
          ,
          <volume>168</volume>
          :
          <fpage>1223</fpage>
          -
          <lpage>1247</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>H. N.</given-names>
            <surname>Mhaskar</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Poggio</surname>
          </string-name>
          .
          <article-title>Deep vs. shallow networks: An approximation theory perspective</article-title>
          .
          <source>Analysis and Applications</source>
          ,
          <volume>14</volume>
          :
          <fpage>829</fpage>
          -
          <lpage>848</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Plan</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Vershynin</surname>
          </string-name>
          .
          <article-title>One-bit compressed sensing by linear programming</article-title>
          .
          <source>Communications in Pure and Applied Mathematics</source>
          ,
          <volume>66</volume>
          :
          <fpage>1275</fpage>
          -
          <lpage>1297</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>L.</given-names>
            <surname>Schläfli</surname>
          </string-name>
          . Theorie der Vielfachen Kontinuität. Zürcher &amp; Furrer, Zürich,
          <year>1901</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>T.</given-names>
            <surname>Hastie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tibshirani</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Wainwright</surname>
          </string-name>
          .
          <article-title>Statistical Learning with Sparsity: The Lasso and Generalizations</article-title>
          . Chapman &amp; Hall/CRC, London,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>A.M.</given-names>
            <surname>Tillmann</surname>
          </string-name>
          .
          <article-title>On the computational intractability of exact and approximate dictionary learning</article-title>
          .
          <source>IEEE Signal Processing Letters</source>
          ,
          <volume>22</volume>
          :
          <fpage>45</fpage>
          -
          <lpage>49</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>