Learning simplified functions to understand

                        Bruno Apolloni1 and Ernesto Damiani2
     1
         Dipartimento di Scienze dell’Informaziones Università degli Studi di Milano
                                 apolloni@di.unimi.it
                 2
                   Center on Cyber-Physical Systems, Khalifa University,
                               ernesto.damiani@ku.ac.ae


          Abstract. We propose an unprecedented approach to post-hoc inter-
          pretable machine learning. Facing a complex phenomenon, rather than
          fully capturing its mechanisms through a universal learner, albeit struc-
          tured in modular building blocks, we train a robust neural network, no
          matter its complexity, to use as an oracle. Then we approximate its
          behavior via a linear combination of simple, explicit functions of its in-
          put. Simplicity is achieved by (i) marginal functions mapping individ-
          ual inputs to the network output, (ii) the same consisting of univariate
          polynomials with a low degree,(iii) a small number of polynomials being
          involved in the linear combination, whose input is properly granulated.
          With this contrivance, we handle various real-world learning scenarios
          arising from expertise and experimental frameworks’ composition. They
          range from cooperative training instances to transfer learning. Concise
          theoretical considerations and comparative numerical experiments fur-
          ther detail and support the proposed approach .

          Keywords: Explainable AI, Post-hoc Intepretable ML,ridge polyno-
          mials, compatible explanation, transfer learning, minimum description
          length.


1        Introduction

Willing to face complex phenomena, we surrendered to the idea that someone
provides immediate answers to our questions about them. It is our typical at-
titude when we query research engines (we google something out) or any URM
(universal responder machine), in so reviving old myths such as Delphi Oracle
or Golem. The same occurs when we decide to solve either a regression or a
classification problem via cognitive algorithms, such as a neural network. The
operational pact is: once we assume that one of the above subjects is well as-
sessed, we may inquire, get a response, but we are not allowed to ask why that
answer.
    Cognitive systems provided formidable solutions to many problems, such
as speech to text or text to speech translation. The rapid evolution of deep
    Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0).
neural networks and their pervasion in many appliance interfaces, jointly with
the widespread habit to search answer from online URM, prospect disquieting
Orwellian scenarios. Whole communities of people renounce understanding why
some answers occur, as a modern implementation of the ancient statement ”con-
tra factum non valet argumentum” 4 , with answers taking the value of facts in
this scenario.
     Willing to elaborate on complex phenomena such as jam traffic in a crowded
city, structural stress of a bridge, or money lending in a FinTech environment,
we need stating relationships between the target variable y (such as traffic in-
tensity and stress value) and the broad array x of input variables characterizing
the phenomenon environment. Let us assume having an algorithm g available so
that for many x P X , gpxq » y with a satisfactorily approximation. However, g
may be not understandable, either because unknown, like an Oracle, or because
it is a function so complex and endowed with so many parameters that we cannot
perceive its real trend (or any intermediate condition). Lack of understandability
is a typical issue for Artificial Neural Networks and raised the search for tech-
niques deducing formal expressions from trained neural networks, a goal loosely
recapped in the sentence: ”passing from synapses to rules” [3] since the Eighties
of the past century. This issue currently revives, being addressed to the AI tools
in general.
    Today, possibly fostered by the general simplificative mood [4], researchers
look for understandable functions g̃ replacing g, a thread that the Explainable
AI taxonomy [1] denotes as Intepretable Machine Learning with a global-model-
agnostic interpretability goal. Simplification may happen at various levels and
with various strategies in search of surrogate models, all pivoting around a few
strongholds like:


 – Linear relations are the most understandable and identifiable as well. The
   most elementary ones are just the sum of atomic functions plus a bias.
 – To be efficient, the atomic functions must be relatively simple, albeit not
   necessarily linear in turn, hence multi-derivable and with a small number
   of arguments, possibly only one, through a projection from the entire X . In
   the last case, we expect the projections to be grosso modo orthogonal to one
   another.

Within the family of Additive Index Models (AIM), the general model can be
written as [19]:

                    g̃pxq “ ν ` γ1 h1 pβ1T xq ` . . . ` γk hk pβkT xq `      (1)

where

4
    Aristotle, L. IV, Metaph. c. III
 – βi identifies hyperplane projections rendering hi ridge functions 5 . The gen-
   eral wisdom is to work with orthogonal projections (limit case βiT x “ xi ) in
   order to exploit x information better.
 – The shape of hi s is up to us. Common solutions are:
      1. hi a polynomial, possibly linear, so that the overall expression of the
         complex function corresponds to the solution of a linear regression prob-
         lem
      2. hi a spline function, so that we can infer its shape according to entropic
         criteria [20, 14]
      3. hi a neural network, to have many parameters (though well organized
         to support interpretation), perfectly fitting g (at least theoretically) via
         g̃ [16]
 – the coefficients γi are 1 in the original AIM [14], which represent a suit-
   able set of parameters in many models. Worth mentioning that interpreting
   g̃pxq “ g̃pErx|ysq “ g̃pµq with  “ 0 leads to Generalized Additive Models
   (GAM) [9] (Generalized Linear Models (GLM) for hi linear).

Identifiability [20] and convergence [14] results are available for the options 2
and 3, where convergence holds for the mean square error function on the pair
pgpxq, g̃pxqq for usual training procedure of g̃ as a whole.
    Our approach is rather different. We maintain the above general model(1).
We train the function in an ensemble learning model with hi as in option 1,
and we put βiT x “ xi . In other words, we train the single hi to replicate the
marginal shape of g for the individual xi s. This procedure results in a one-shot
training. Then we estimate the linear combination of the hi s in (1). Moreover,
in order to make sure that the learned hi s stay simple enough to afford human
understanding of the overall expression involving them, we bargain pgpxq, g̃pxqq
Mean Squared Error (MSE) with 1) the number of components of x involved in
each hi and 2) the degree of polynomials in hi , both according to the minimum
description length criterion, and 3) the accuracy of x, i.e., the size of its vector
quantization.
    The original cognitive system’s role remains crucial in our approach, but we
move it to the background. A deeply-trained neural network acts as an oracle
that supplies all training examples needing a very accurate inference of the ap-
proximating functions. These functions supply a manageable interpretation of
what the cognitive system may have learned in a sub-symbolic way. Of course,
lay users may find the resulting interpretation hard to follow (making it a non-
immediately actionable explanation according to [1]). Still, we claim it provides
a concrete and trustworthy starting point to users with some technical education
wishing to find an explanation of the cognitive system’s behavior (when trying
to understand complex behavior, there is no ”free lunch” [18] )
5
    Informally we obtain a ridge function of a vector x computing a univariate function
    on the inner product between x and a constant vector a , representing the direction.
    As a result, the ridge function is constant on the hyperplanes that are orthogonal to
    the ridge direction.
   The paper is structured as follows. Section 2 illustrates the theoretical frame-
work of our approach. Section 3 describes our inference procedure rhat we call
Marginal Functions Ensemble (MFE). Section 4 provides numerical experiments
contrasted with literature results, while concluding remarks are given in Section
5.

2     Compatible functions

While the spring of modern science has been the willingness to develop mathe-
matical theories to describe/explain the world’s workings, we may say that the
aim of a somehow post-modern science is, more modestly, to face the complex-
ity of the phenomena with compatible tools. The classical paradigm is airplane
wing: it does the same job as birds’ wings but is more straightforward and fea-
sible. Facing complex phenomena such as those mentioned in the Introduction,
the scientific community first did rely on probability models as the last frontier
before the unexplainable (hic sunt leones). It later surrendered to cognitive al-
gorithms (with a transition through fuzzy sets frameworks). The final paradigm
is: ”It does not matter why, provided it works”[11]. This sentence summarizes
the general philosophy of cognitive algorithms whose operation inspired by living
organisms, human beings in primis, and the most common operational paradigm
is ” learning from examples.”
    Actually, in the last three decades, hard competition has been fought be-
tween two ways of implementing the paradigm. Besides the sub-symbolic one,
represented by the cognitive algorithms having neural networks as the main
computational tool, the second way, the symbolic one [11], is represented by the
computational learning vein having PAC-learning methods as the main theoret-
ical tools.

 – Neural networks are families of general-purpose functions endowed with
   many parameters, to be adaptable to reproduce any computable function
   (see [7]).
 – PAC-Learning addresses specific classes of functions, described by explicit
   symbolic expressions, to be satisfactorily approximated by items of specific
   classes of hypotheses, with the same description facilities.

Hence, in search of compatible tools, the second way would appear the most
suitable for us.
    PAC stands for ”Probably Approximately Correct”; the general scheme is
the following. Consider a class of functions C that is hard, to some extent, to
identify; for instance, circles dividing positive from negative points on a plane 6 .
Then, for a given set S of points separated by an unknown circle c, the goal is
to draw, basing on S, a circle, call it h (an hypothesis) that is very close t c. We
have no ambition of discovering precisely c; instead, we look for an h that works
6
    Do not be misled by the simplicity of the function; try to devise from scratch an
    algorithm that identifies a circle dividing any set of points which in turn are separable
    by a circle.
almost as well as c, in the sense that, questioned on a new point x in the plane
X , hpxq “ cpxq with high probability (see Figure1). In that, it meets precisely
our notion of compatibility.


Fig. 1: PAC learning basics. X Ñ the sampling space; c Ñ the boolean target function;
h Ñ the approximating hypothesis. Bullets Ñ positive points; rhombuses Ñ negative
points. Out of the intersection between c and h points will be labeled wrongly by h, i.e
cpxq “ hpxq.


    In opposition to the ”Turing machine-versus-perceptron” duel for the title of
computational paradigm champion, which was won by the former, the Compu-
tational Learning paradigm succumbed to neural networks in the learning-by-
examples practice. Suppose we know C and H (the class of the hypotheses h),
and we can compute a given complexity parameter of their symmetric difference
(trivial when of C and H are classes of circles, definitely less trivial in many
real-life conditions). In that case, we could determine lower bounds to the learn-
ing algorithms’ performance as a function of the size of S. This procedure is a
remnant of what happens, for instance, expressing a linear regression problem’s
accuracy. Things, however, are more complex. In the case of boolean functions
like labeling circles, we express accuracy in terms of probability of sampling a
set S, based on which we may compute an h such that the symmetric differ-
ence between c and h has a given probability (hence an error probability since
cpxq ‰ hpxq therein) for the x distribution law. This result appears very pow-
erful since it is distribution-independent (i.e., it holds whatever the distribution
law of x), but unfortunately, the upper-bounds constraining the size of S are
very loose, outside of any feasible sampling plan. Things work a bit better if we
know the distribution law of x, but worse if we do not know C.
    Computational Learning Theory still remains a formidable field of theoretical
elaborations but has been abandoned for practical purposes. Nevertheless, for the
reasons discussed in the Introduction, we believe that the idea of approximating
a goal function g with a class H of symbolic functions is a fundamental one to
achieve AI interpretability. In these pages, we try exploiting and adapting results
from Computational Learning Theory in this vein.
    This style of learning introduces two complexity issues: computational com-
plexity and sample complexity. The former concerns the running time of the algo-
rithm, which computes h. While it proves excessive if H is the class of Disjunctive
Normal Forms (DNF) since the related satisfiability problem is NP-complete [8],
in our case, where hypotheses consist of linear combinations of a short number
of one-dimensional polynomials of small degree, this issue is irrelevant. Sample
complexity is appreciated in terms of the upper-bound to the number of samples
to observe to have acceptable twin probabilities characterizing the learning goal.
This quantity denotes the information amount the learning algorithm needs to
identify a hypothesis within its class, thus resulting in a measure of the infor-
mation richness/difficulty of the function to be learned – a measure that may
denote its interpretability by humans.
    Let us start from the following theorem as an essential reference for our
considerations.
Theorem 1. [2] For a space X , assume we are given
 – a concept class C on X with complexity index DC,C “ µ;
 – a set Zm of labeled samples zm P X ˆ t0, 1u;
 – a fairly strongly surjective function A : Zm ÞÑ H computing consistent hy-
   potheses.
     Consider the families of random sets tc P C : zm`M “ tpxi , cpxi qq, i “
1, . . . , m ` M u when zm spans the the samples in Zm and the specifications of
their random suffixes ZM , with M Ñ `8 according to any distribution law.
     For a given pzm , hq and h “ Apzm q, denote with
 – µh the complexity index DC,H for the adopted computational approximation
 – th be the number of points misclassified by h
 – Uc˜h the random variable given by the probability measure of the simmetric
   difference c ˜ hbetween c and h.
                    !                         )
Then, for m ě max 2ε log 1δ , 5.5pµh `t
                                     ε
                                        h ´1q
                                               , A is a learning algorithm for every
zm and c P C ensuring that PpUc˜h ď εq ě 1 ´ δ.
   Let us highlight the following points:
 1. complexity index DA,B is a function of the symmetric difference between the
    two classes in argument denoting how clear is their symmetric difference. The
    most used index is Vapnik-Chervonenkis dimension [15], another suitable
    one is detail [5]. The higher the complexity index, the more is the learning
    difficulty.
 2. D increases with the decrease of the computation approximation, in turn,
    depending on the accuracy.
 3. the function A computing the hypothesis h is consistent if hpxq “ cpxq for
    each x P zm . For non-consistent A we take note of the number th of mistaken
    samples.
4. complexity and approximation of a hypothesis sum up (via mh ` th ) to
   determine the learning task’s difficulty.

    Though the theorem refers to classes of Boolean functions, we may derive
the main lesson that three strongholds characterize the complexity of the task
of learning a function:

1. the complexity of the class C to learn, paired with the class H to get its
   hypotheses,
2. the approximation of the hypothesis with which we want to explain the
   labeling of the points,
3. the accuracy with which we register the input data.

    When trying to learn real-world functions, sampling complexity plays a cru-
cial role because of the cost of achieving samples. The high and possibly unknown
complexity of C, when we abandon case studies, prevents us from hazarding an-
alytical evaluations of the learned function’s accuracy, normally left to the toss
of test sets. What remains are the directions to improve our learning that find
immediate companions in the mechanisms with which our brain may interpret
the learned function. Namely:

 – complexity of the hypothesis function maps into the number of involved vari-
   ables and parameters. This mapping represents an operational implementa-
   tion of explanation selection activity [13] that the human brain performs to
   understand the behavior of an algorithm, or a phenomenon in general, in
   terms of assignment of causal responsibilities [10]. Up to our experience, the
   number of variables and parameters should be at most 5. Otherwise, most
   humans would not be able to understand their role in the target function
   [12].
 – approximation of hypothesis maps in our brain into a trade-off between for-
   mulas that are precise but difficult to manage and approximate formulas to
   derive at least the first hints.
 – accuracy of data maps into our attitude of ”never shooting a sparrow with
   a cannon” or tracking a butterfly.


3    The MFE procedure

The core of the procedure is rather simple and quick to run. We may configure
it as a function φ of three arguments:

1. the list ls of input variables to take into consideration,
2. the degree polydeg of the polynomial through which to interpolate the marginals
   gi of the target function g, and
3. the granulation rate gr, i.e., partition size according to which vector quan-
   tizing the input variables.

    Namely,
 – Each variable x P ls is sorted and partitioned in a group of size gr. Then
   each value in the group is replaced by the group mean. The outcome is x̃.
 – A base learner hi is computed as the regression between the target y and
   the selected x̃i .
 – The approximating h is the linear regressor of y again on the weak learners’
   outputs.

   The use of h and the evaluation of its performance are carried out as usual.
Rather some attention is deserved to the experimental environment to which the
procedure is applied. The three strongholds are:

 1. Neural network like an oracle. We assume having well trained a sufficiently
    powerful neural network to approximate the target function satisfactorily.
    Hence, like with an oracle, we may obtain a reliable answer on any input.
 2. Explanation like the ones from domain experts. Since we are not interested in
    discovering the truth, rather in getting wisely understandable descriptions,
    we circumscribe the input domain to the field of interest.
 3. Surgical bombardment of the interest field. The ”bombs” are symbolic func-
    tions with which we fit a huge set of input/output pairs supplied by the
    neural network. According to their descriptional length, we drop them se-
    quentially, which means in a progression with function degree and number
    of variables. Entropic criteria will decide the suitability of each bomb to be
    maintained or removed from the final arsenal.


4     Numerical experiments

We carried out two families of experiments. A former one is on artificial data to
compare the efficiency of our method with a template of AIM style one [19]. A
second one concerns real data available on UCI repository.


4.1    Artificial data

For comparison sake, we considered the simulation study S5 in [19], concerning
the regression

                                    2px3 ` x4 ` x5 ` x6 q
              y “ x1 ´ x2 `                                     `             (2)
                              0.5 ` p1.5 ` x3 ` x5 ´ x4 ´ x6 q2

A (training-set,test-set) pair of size (10000,1000) has been generated, by starting
with a uniform random seed in r´1, 1s for each xi to which a same bias consti-
tuted by a uniform random bias in r´1, 1s was added to inject a correlation “ 0.5
between each pair of variables. A final rescaling by a factor 0.5 was applied to
gather the computed xi s in the range r´1, 1s, again. yi s are computed directly
through (2). In Figure 2 we contrast the MFE architectures (on the left) with
the one in the reference paper (on the right). The symbolic functions in the blue
boxes are replaced by neural networks in the azure boxes. Moreover, while in the
          x1                  y
                  h1(x1)
          x2                  y
               x1 h2(x2)
                              y      y
          x3
                  h3(x3)
                              y
                                     𝞢
          x4
                  h4(x4)

          xk                  y
                  hk(xk)

                            lossi


                                    LOSS


Fig. 2: The computational architectures. Left: MFE; lossi refers to the training of the
base learners hi. LOSS refers to the training of their linear combination Σ. Right:
xNN architecture from [19]. w and β corresponds to β and γ in (1) respectively. N and
Ω are normalizing nodes and weighting nodes, respectively, Loss refers to the overall
training of xNN


former we have a double training, at weak learners’ level and at their ensemble
level, in the latter we have the usual training of a neural network, albeit with a
distinguished structure of the network.
    For training and test test size S p10000, 10000q xNN claims a root mean
square error on the test set RM SE “ 1.0049 (actually a bit greater than the
optimum of its competitors “ 1.0005). Figure 3 displays our results as a function
of the three parameters of the routine φ. A first thing we note is that polydeg “ 2
surface almost coincides with the one of polydeg “ 3 7 , both denoting RMSEs
definitely lower than the xNN one. A second remark is on the apparently small
sensitivity of the performance to the granulation.
    Figure 4 left digs deeper into this aspect by showing an extensive course of
this trend (gr from 1 to 1500). We note an almost insensitive RMSE growth for a
granulation up to 200. In any case, even for a granulation 1500 MFE performance
is better than the xNN one. Finally, Figure 4 right shows the scattering of the
regressed values with respect to the original ones. The way of getting better
accuracy will be dealt with in the next subsection in regard to natural data.

4.2     Real data
Collecting real data on complex phenomena such as the one mentioned in the In-
troduction is a costly task per se. Thus to move to the public UCI repository [6]
and an application field that is the icon of simplified scenarios: Facebook trans-
actions. Namely, the benchmark Facebook Comment Volume lists the number of
comments in the first 24 hours after the publication of a post on Facebook as a
7                                                       1
    Actually, in both cases we considered also a power x {2
Fig. 3: MFE performance surfaces. z axis Ñ RMSE, x axis Ñ granulation size (ranging
from 1 to 600, y axisÑ left extreme of the indexes interval selecting the involved
variables from the set tx1 , . . . , x6 u. Surface label Ñ degree of the regressed polynomials.


                                fullvariables
            RMSE                                                                                y_reg

                                                                                                 4
                                                                     gr
                   200   400   600    800       1000   1200   1400
           0.95
                                                                          polydeg=1              2

                                                                          polydeg=2                                 y_tets
           0.90                                                                       -4   -2           2   4   6
                                                                          polydeg=3
                                                                                                -2
           0.85
                                                                                                -4

                                                                                                -6


Fig. 4: Left: course of RMSE with the granulation size; parameters as in the labels.
Right: scatter plot of the regressed versus original data of the test set, in the most favor-
able setting of the parameters (all variables used, polydeg “ 3, no vector quantization).
  function of 53 variables characterizing the post, such as ”Page popularity” and
  ”Page category.” The number of records is 40, 949, the overall variables are 54,
  of which we considered only the first 33. This decision limits the problem’s com-
  plexity (up to us in the end) at the cost of neglecting some ancillary variables.
  In greater detail, the data (extracted by a Web crawler) are:
    f1 number of likes for the source of the document.
    f2 number of individuals so far visited this page.
    f3 the daily number of activities by visitors to the page.
    f4 category of the source of the document.
f5-f29 min, max, average, median, and standard deviation of features f30 to f33 plus the
       difference between f32 and f33.
   f30 The total number of comments before a pre-established date.
   f31 Number of comments in the last 24 hours before the pre-established date.
   f32 Number of comments in the last 48 to last 24 hours before the pre-established date.
   f33 Number of comments in the first 24 hours after the publication of the post, but
       before the pre-established date.

     Finally, after deletion of less relevant variables based on p values computed
  with the usual R regression tools 8 , we come to the problem of computing f33
  from f25 to f32. We tackled this problem in three modes
   1. through a deep neural network
   2. through MFE re the original f33
   3. though MFE re the output of the deep neural network, where the latter plays
      the Oracle’s role.
  Moreover, we solved the problem at two sample scales: small (training set size
  =1000, testing set size =1000 Ñ experiment size 2000) and large (training set
  size =10000, testing set size =10000 Ñ experiment size 20000), overall working
  on the first 30.000 records of the dataset.

  Deep neural network We trained an 8 ´ 30 ´ 32 ´ 1 neural network (DNN) on
  the above dataset with standard Keras 9 options to baseline the learning difficulty
  of our task and build up the Oracle for mode 3. To this aim, we preferred limiting
  the input variables exactly to the 8 ones on which we will base the interpretation
  of the learned network. Given the asymmetry of the variable distributions, we
  escape any normalization and feature extraction (for instance, via PCA), but a
  uniform constraining of each variable’s values in the interval r´1, 1s. Figure 5
  reports the RMSE curves in recall as a function of the learning cycles and the
  original-reconstructed values scatter plots in both training and recall phases for
  the two sizings of training and test sets. Despite the regular course of RMSEs,
  with a noticeable improvement when we enlarge the training set, the first scatter
  plot, related to the smaller sample size, denotes overfitting with consequent lousy
  generalization on the test set. The second one is as expected. Hence we decide
  using the second neural network as an Oracle and register the value 0.028016 to
  be the RMSE of this Oracle.
   8
       https://cran.r-project.org
   9
       https://keras.io
                                                                                             reconstr                                                                          reconstr

   RMSE                                                                                       1.0
   0.6                                                                size2000                                                                             size20000
                                                                                                                                                                                0.5
   0.5                                                                                        0.5

   0.4                                                                                                                                          training                                                                          training
                                                            2000                                                                                                                                                       orig
   0.3                                                                                                                               orig                  -1.0     -0.5                    0.5    1.0           1.5
                                                                       -1.0      -0.5                       0.5                1.0              recall                                                                            recall
                                                            20 000
   0.2
                                                                                             -0.5                                                                              -0.5
   0.1

                                             cycle number
          50   100   150   200   250   300                                                   -1.0                                                                              -1.0


Fig. 5: Course of the loss descent and learning scatter plots of DNN. Parameters as in
the labels.


Regressing the original data Our inference device is the same as the one
used in the case study, with the sole exception of the number of input variables,
now in number of eight. The first experiment with size 2000 shows that MFE
doesn’t face overfitting. The pictures in Figure 6, first row, denote a progressive
degradation of the performance with the input granularity and the reduction
of the number of input variables, but close trends of the error in training and
recall. As for the degree of the regressing polynomials, we my see that in degraded
conditions the linear regression may prove even more efficient than non-linear
ones. We remark that the course of RMSE with the number of variables has an
appreciable slope only when their reduction is over 3, thus allowing us to run our
interpretation on 5 or less variables. The analogous of Figure 5 is in Figure 6,
second row, where, since MFE is a one shoot procedure, in place of the error
descent curves we show the RMSE surfaces as in the first row, now referred to
an experiment size 20000.


                                                                                                          size2000


                                                                              size2000                                                                            size20000
                                                                                  reconstr                                                                          reconstr

                                                                                   1.0                                                                              1.0


                                                                                   0.5                                                                              0.5
                                                                                                                                     training                                                                          training

                                                                                                                        orig         recall                                                               orig         recall
                                                               -1.0       -0.5                      0.5           1.0                           -1.0        -0.5                      0.5    1.0    1.5

                                                                                  -0.5                                                                             -0.5


               size20000                                                          -1.0                                                                             -1.0


Fig. 6: First row: MFE performance surfaces for experiment size 2000. Left picture:
same notations as in Figure 3; right picture: contrasting training and recall surfaces
for polynomial degree “ 3. Second row: MFE performance surfaces and scatter plots
for different experiment sizes. Surface colors as in the first row; scatterplot colors as in
Figure 5.
Regressing the Oracle data If we repeat the experiment in the previous
section apart for the target, now replaced by the output of the DNN trained
on 10000 examples (in the role of the Oracle), we obtain error surfaces that are
very close to those in Figure 6. Figure 7, left side, contrasts the recall RMSE
surfaces of the two experiments. To summarize the course of RMSE, in Table 1
we report these values in training and recall for DNN and for MFE runs in
similar conditions, i.e. using all eight variables and with no approximation on
their values (let us call them ideal conditions). We note that MFE outperforms
NN not only when running on the same sample size but also, as for recall, on
the smaller sample size. Inferring MFE on the NN output may suffer from a
slight overfitting that disappears when we abandon ideal conditions, as shown
in Figure 7, left side. However, this overfitting is very limited, as shown in the
right picture of the figure.


             RMSE NN10000 MFE1000 MFE10000 MFEOrac
             training    0.014272    0.0187171          0.0108914           0.0094463
              recall     0.028016    0.0274221          0.0137267           0.0211474

    Table 1: Contrasting the performances of the methods considered in the paper


                                         size20000
                                                                 reconstr


                                                                  0.5

                                                                                         training
                                                                                  orig
                                                 -1.0     -0.5              0.5          recall

                                                                 -0.5


                                                                 -1.0


Fig. 7: Left: contrasting the RMSE surfaces computed on the original test set and on
the DNN generated one; right: reconstruction scatterplots of the second target in ideal
conditions. Notations as in Figure 6.


5    Discussion and concluding remarks
Expressing a complex function by means of a combination of ridge functions
is a recurrent modeling pattern across scientific domains. We started from the
notion that its success as a modeling device may be due to the fact that it pro-
vides a dimensional decomposition of complex functions that naturally facilitates
human understanding and interpretation of the underlying phenomenon. These
functions can be seen as modular components of the expression of the overall
phenomenon; the fact that they are symbolic and can be learned independently
and at different times makes our approach a potential (interpretable) alternative
to multi-stage learning for achieving model adaptation. In a non-stationary situ-
ation, we could retrain our model piece-wise, using different retraining windows
to handle the selective obsolescence of knowledge represented by the individual
atomic function or new domain knowledge typical of transfer learning [21]. Be-
ing each atomic component symbolic, its phase-in/phase-out lifecycle would be
understandable for humans. We plan to investigate using the performance gap
between different base learners’ configurations hi s as a measure of the discrep-
ancy between the source and target domains [17].
     According to [1], ”there is no standard and generally accepted definition of
explainable AI. The XAI term tends to refer to the many initiatives and efforts
made in response to concerns regarding AI transparency and trust concerns,
more than to a formal technical concept”. These initiatives aim to white-boxing
AI, i.e., gaining a quantitative understanding by humans of AI models’ operation
10
   , where deep neural networks represent the paradigmatic target. Unfortunately,
deep learners generally prove as powerful to learn hard functions as impenetra-
ble to any form of understanding. With similar targets, we assume quantitative
understanding to be an unavoidable premise to achieving human-understandable
explanations (for instance, in simplified natural language). We argue that quan-
titative interpretations’ accuracy measures are not a guaranty but a concrete
prerequisite of a satisfactory approximation of any conceptual explanation. This
assumption contrasts the popular (but in our opinion debatable) idea that an
explanation that is not accessible to laypeople is defective by definition. As for
a climbing trip to the Alps, to reach the peak, we must be trained.
     Our experimental results support the idea that our symbolically driven ap-
proach may provide the level of accuracy suitable for practical applications as a
complementary interpretable alternative to deep learning architectures.


References

 1. Adadi, A., Berrada, M.: Peeking inside the black-box: A survey on explainable
    artificial intelligence (xai). IEEE Access 6, 52138–52160 (2018)
 2. Apolloni, B., Chiaravalli, S.: PAC learning of concept classes through the bound-
    aries of their items. Theoretical Computer Science 172, 91–120 (1997)
 3. Apolloni, B., Kurfess, F. (eds.): From Synapses to Rules – Discovering Symbolic
    Rules from Neural Processed Data. Kluwer Academic/Plenum Publishers, New
    York (2002)
 4. Apolloni, B.,Shehhi, A., Damiani, E.: The simplification conspiracy. In: Progresses
    in Artificial Intelligence and Neural Systems. Smart Innovation, Systems and Tech-
    nologies. pp. 11–23. Springer-Verlag, Lecture Notes in Artificial Intel (2019)
 5. ApolloniB., Bassis, S., Malchiodi, D., Pedrycz, W.: The Puzzle of Granular Com-
    puting, Springer, 2008, in press., vol. 138. Springer-Verlag, BERLIN – DEU (2008)
10
     https://www.iarai.ac.at/research/a-theory-of-ai/
     white-boxing-interpretability/
 6. Corke, P.I.: A robotics toolbox for matlab. IEEE Robotics Automation Magazine
    3(1), 24–32 (1996)
 7. Cybenko, G.: Approximation by superpositions of a sigmoidal function. Mathe-
    matics of Control, Signals and Systems 2(4), 303–314 (1989)
 8. Garey, M.R., Johnson, D.S.: Computer and Intractability: a Guide to the Theory
    of NP-Completeness. W. H. Freeman, San Francisco (1978)
 9. Hastie, T.J., Tibshirani, RJ: Generalized additive models. London: Chapman &
    Hall (1990)
10. Josephson, J.R., Josephson, S.G.: Abductive Inference: Computation, Philosophy,
    Technology. Cambridge University Press (1994)
11. Kelley, T.D.: Symbolic and sub-symbolic representations in computational models
    of human cognition: What can be learned from biology? Theory & Psychology
    13(6), 847–860 (2003)
12. Kohlhase, A., Kohlhase M., and Fuersich, M. :Visual structure in mathematical
    expressions. Proceedings of the International Conference on Intelligent Computer
    Mathematics. Springer, Cham, 2017.
13. Miller, T.: Explanation in artificial intelligence: Insights from the social sciences.
    ArXiv abs/1706.07269 (2019)
14. Ruan, L., Yuan, M.: Dimension reduction and parameter estimation for additive
    index models. Statistics and Its Interface 4 (01 2010)
15. Vapnik, V.: Estimating of dependencies based on empirical data. Springer, New
    York (1982)
16. Vaughan, J., Sudjianto, A., Brahimi, E., Chen, J.J., Nair, V.: Explainable neural
    networks based on additive index models. ArXiv abs/1806.01933 (2018)
17. Wang, B., Mendez, J., Cai, M., Eaton, E.: Transfer learning via minimizing the
    performance gap between domains. In: Advances in Neural Information Processing
    Systems, vol. 32, pp. 10645–10655 (2019)
18. Wolpert, D.H., Macready, W.G.: No free lunch theorems for optimization. IEEE
    Transactions on Evolutionary Computation 1(1), 67–82 (1997)
19. Yang, Z., Zhang, A., Sudjianto, A.: Https://arxiv.org/pdf/1901.03838.pdf
20. Yuan, M.: On the identifiability of additive index models. Statistica Sinica 21,
    1901–1911 (2011)
21. Zhuang, F., Qi, Z., Duan, K., Xi, D., Zhu, Y., Zhu, H., Xiong, H., He, Q.: A com-
    prehensive survey on transfer learning. Proceedings of the IEEE pp. 1–34 (2020)