<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Learning simpli ed functions to understand</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Bruno Apolloni</string-name>
          <email>apolloni@di.unimi.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ernesto Damiani</string-name>
          <email>ernesto.damiani@ku.ac.ae</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Center on Cyber-Physical Systems, Khalifa University</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Dipartimento di Scienze dell'Informaziones Universita degli Studi di Milano</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>We propose an unprecedented approach to post-hoc interpretable machine learning. Facing a complex phenomenon, rather than fully capturing its mechanisms through a universal learner, albeit structured in modular building blocks, we train a robust neural network, no matter its complexity, to use as an oracle. Then we approximate its behavior via a linear combination of simple, explicit functions of its input. Simplicity is achieved by (i) marginal functions mapping individual inputs to the network output, (ii) the same consisting of univariate polynomials with a low degree,(iii) a small number of polynomials being involved in the linear combination, whose input is properly granulated. With this contrivance, we handle various real-world learning scenarios arising from expertise and experimental frameworks' composition. They range from cooperative training instances to transfer learning. Concise theoretical considerations and comparative numerical experiments further detail and support the proposed approach .</p>
      </abstract>
      <kwd-group>
        <kwd>Explainable AI</kwd>
        <kwd>Post-hoc Intepretable ML</kwd>
        <kwd>ridge polynomials</kwd>
        <kwd>compatible explanation</kwd>
        <kwd>transfer learning</kwd>
        <kwd>minimum description length</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Willing to face complex phenomena, we surrendered to the idea that someone
provides immediate answers to our questions about them. It is our typical
attitude when we query research engines (we google something out) or any URM
(universal responder machine), in so reviving old myths such as Delphi Oracle
or Golem. The same occurs when we decide to solve either a regression or a
classi cation problem via cognitive algorithms, such as a neural network. The
operational pact is: once we assume that one of the above subjects is well
assessed, we may inquire, get a response, but we are not allowed to ask why that
answer.</p>
      <p>Cognitive systems provided formidable solutions to many problems, such
as speech to text or text to speech translation. The rapid evolution of deep
Copyright c 2020 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
neural networks and their pervasion in many appliance interfaces, jointly with
the widespread habit to search answer from online URM, prospect disquieting
Orwellian scenarios. Whole communities of people renounce understanding why
some answers occur, as a modern implementation of the ancient statement
"contra factum non valet argumentum" 4, with answers taking the value of facts in
this scenario.</p>
      <p>
        Willing to elaborate on complex phenomena such as jam tra c in a crowded
city, structural stress of a bridge, or money lending in a FinTech environment,
we need stating relationships between the target variable y (such as tra c
intensity and stress value) and the broad array x of input variables characterizing
the phenomenon environment. Let us assume having an algorithm g available so
that for many x P X , gpxq y with a satisfactorily approximation. However, g
may be not understandable, either because unknown, like an Oracle, or because
it is a function so complex and endowed with so many parameters that we cannot
perceive its real trend (or any intermediate condition). Lack of understandability
is a typical issue for Arti cial Neural Networks and raised the search for
techniques deducing formal expressions from trained neural networks, a goal loosely
recapped in the sentence: "passing from synapses to rules" [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] since the Eighties
of the past century. This issue currently revives, being addressed to the AI tools
in general.
      </p>
      <p>
        Today, possibly fostered by the general simpli cative mood [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], researchers
look for understandable functions g~ replacing g, a thread that the Explainable
AI taxonomy [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] denotes as Intepretable Machine Learning with a
global-modelagnostic interpretability goal. Simpli cation may happen at various levels and
with various strategies in search of surrogate models, all pivoting around a few
strongholds like:
{ Linear relations are the most understandable and identi able as well. The
most elementary ones are just the sum of atomic functions plus a bias.
{ To be e cient, the atomic functions must be relatively simple, albeit not
necessarily linear in turn, hence multi-derivable and with a small number
of arguments, possibly only one, through a projection from the entire X . In
the last case, we expect the projections to be grosso modo orthogonal to one
another.
      </p>
      <p>
        Within the family of Additive Index Models (AIM), the general model can be
written as [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]:
g~pxq
1h1p 1T xq
: : :
khkp kT xq
(1)
where
{ i identi es hyperplane projections rendering hi ridge functions 5. The
general wisdom is to work with orthogonal projections (limit case iT x xi) in
order to exploit x information better.
{ The shape of his is up to us. Common solutions are:
1. hi a polynomial, possibly linear, so that the overall expression of the
complex function corresponds to the solution of a linear regression
problem
2. hi a spline function, so that we can infer its shape according to entropic
criteria [
        <xref ref-type="bibr" rid="ref14 ref20">20, 14</xref>
        ]
3. hi a neural network, to have many parameters (though well organized
to support interpretation), perfectly tting g (at least theoretically) via
g~ [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]
{ the coe cients i are 1 in the original AIM [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], which represent a
suitable set of parameters in many models. Worth mentioning that interpreting
g~pxq g~pErx|ysq g~p q with 0 leads to Generalized Additive Models
(GAM) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] (Generalized Linear Models (GLM) for hi linear).
      </p>
      <p>
        Identi ability [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] and convergence [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] results are available for the options 2
and 3, where convergence holds for the mean square error function on the pair
pgpxq; g~pxqq for usual training procedure of g~ as a whole.
      </p>
      <p>Our approach is rather di erent. We maintain the above general model(1).
We train the function in an ensemble learning model with hi as in option 1,
and we put iT x xi. In other words, we train the single hi to replicate the
marginal shape of g for the individual xis. This procedure results in a one-shot
training. Then we estimate the linear combination of the his in (1). Moreover,
in order to make sure that the learned his stay simple enough to a ord human
understanding of the overall expression involving them, we bargain pgpxq; g~pxqq
Mean Squared Error (MSE) with 1) the number of components of x involved in
each hi and 2) the degree of polynomials in hi, both according to the minimum
description length criterion, and 3) the accuracy of x, i.e., the size of its vector
quantization.</p>
      <p>
        The original cognitive system's role remains crucial in our approach, but we
move it to the background. A deeply-trained neural network acts as an oracle
that supplies all training examples needing a very accurate inference of the
approximating functions. These functions supply a manageable interpretation of
what the cognitive system may have learned in a sub-symbolic way. Of course,
lay users may nd the resulting interpretation hard to follow (making it a
nonimmediately actionable explanation according to [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]). Still, we claim it provides
a concrete and trustworthy starting point to users with some technical education
wishing to nd an explanation of the cognitive system's behavior (when trying
to understand complex behavior, there is no "free lunch" [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] )
5 Informally we obtain a ridge function of a vector x computing a univariate function
on the inner product between x and a constant vector a , representing the direction.
As a result, the ridge function is constant on the hyperplanes that are orthogonal to
the ridge direction.
      </p>
      <p>The paper is structured as follows. Section 2 illustrates the theoretical
framework of our approach. Section 3 describes our inference procedure rhat we call
Marginal Functions Ensemble (MFE). Section 4 provides numerical experiments
contrasted with literature results, while concluding remarks are given in Section
5.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Compatible functions</title>
      <p>
        While the spring of modern science has been the willingness to develop
mathematical theories to describe/explain the world's workings, we may say that the
aim of a somehow post-modern science is, more modestly, to face the
complexity of the phenomena with compatible tools. The classical paradigm is airplane
wing: it does the same job as birds' wings but is more straightforward and
feasible. Facing complex phenomena such as those mentioned in the Introduction,
the scienti c community rst did rely on probability models as the last frontier
before the unexplainable (hic sunt leones ). It later surrendered to cognitive
algorithms (with a transition through fuzzy sets frameworks). The nal paradigm
is: "It does not matter why, provided it works"[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. This sentence summarizes
the general philosophy of cognitive algorithms whose operation inspired by living
organisms, human beings in primis, and the most common operational paradigm
is " learning from examples."
      </p>
      <p>
        Actually, in the last three decades, hard competition has been fought
between two ways of implementing the paradigm. Besides the sub-symbolic one,
represented by the cognitive algorithms having neural networks as the main
computational tool, the second way, the symbolic one [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], is represented by the
computational learning vein having PAC-learning methods as the main
theoretical tools.
      </p>
      <p>
        { Neural networks are families of general-purpose functions endowed with
many parameters, to be adaptable to reproduce any computable function
(see [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]).
{ PAC-Learning addresses speci c classes of functions, described by explicit
symbolic expressions, to be satisfactorily approximated by items of speci c
classes of hypotheses, with the same description facilities.
      </p>
      <p>Hence, in search of compatible tools, the second way would appear the most
suitable for us.</p>
      <p>PAC stands for "Probably Approximately Correct"; the general scheme is
the following. Consider a class of functions C that is hard, to some extent, to
identify; for instance, circles dividing positive from negative points on a plane 6.
Then, for a given set S of points separated by an unknown circle c, the goal is
to draw, basing on S, a circle, call it h (an hypothesis ) that is very close t c. We
have no ambition of discovering precisely c; instead, we look for an h that works
6 Do not be misled by the simplicity of the function; try to devise from scratch an
algorithm that identi es a circle dividing any set of points which in turn are separable
by a circle.
almost as well as c, in the sense that, questioned on a new point x in the plane
X , hpxq cpxq with high probability (see Figure1). In that, it meets precisely
our notion of compatibility.</p>
      <p>In opposition to the "Turing machine-versus-perceptron" duel for the title of
computational paradigm champion, which was won by the former, the
Computational Learning paradigm succumbed to neural networks in the
learning-byexamples practice. Suppose we know C and H (the class of the hypotheses h),
and we can compute a given complexity parameter of their symmetric di erence
(trivial when of C and H are classes of circles, de nitely less trivial in many
real-life conditions). In that case, we could determine lower bounds to the
learning algorithms' performance as a function of the size of S. This procedure is a
remnant of what happens, for instance, expressing a linear regression problem's
accuracy. Things, however, are more complex. In the case of boolean functions
like labeling circles, we express accuracy in terms of probability of sampling a
set S, based on which we may compute an h such that the symmetric di
erence between c and h has a given probability (hence an error probability since
cpxq hpxq therein) for the x distribution law. This result appears very
powerful since it is distribution-independent (i.e., it holds whatever the distribution
law of x), but unfortunately, the upper-bounds constraining the size of S are
very loose, outside of any feasible sampling plan. Things work a bit better if we
know the distribution law of x, but worse if we do not know C.</p>
      <p>Computational Learning Theory still remains a formidable eld of theoretical
elaborations but has been abandoned for practical purposes. Nevertheless, for the
reasons discussed in the Introduction, we believe that the idea of approximating
a goal function g with a class H of symbolic functions is a fundamental one to
achieve AI interpretability. In these pages, we try exploiting and adapting results
from Computational Learning Theory in this vein.</p>
      <p>
        This style of learning introduces two complexity issues: computational
complexity and sample complexity. The former concerns the running time of the
algorithm, which computes h. While it proves excessive if H is the class of Disjunctive
Normal Forms (DNF) since the related satis ability problem is NP-complete [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ],
in our case, where hypotheses consist of linear combinations of a short number
of one-dimensional polynomials of small degree, this issue is irrelevant. Sample
complexity is appreciated in terms of the upper-bound to the number of samples
to observe to have acceptable twin probabilities characterizing the learning goal.
This quantity denotes the information amount the learning algorithm needs to
identify a hypothesis within its class, thus resulting in a measure of the
information richness/di culty of the function to be learned { a measure that may
denote its interpretability by humans.
      </p>
      <p>Let us start from the following theorem as an essential reference for our
considerations.</p>
      <p>
        Theorem 1. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] For a space X , assume we are given
{ a concept class C on X with complexity index DC;C ;
{ a set Zm of labeled samples zm P X t0; 1u;
{ a fairly strongly surjective function A : Zm ÞÑ H computing consistent
hypotheses.
      </p>
      <p>Consider the families of random sets tc P C : zm M tpxi; cpxiqq; i
1; : : : ; m M u when zm spans the the samples in Zm and the speci cations of
their random su xes ZM , with M Ñ 8 according to any distribution law.</p>
      <p>For a given pzm; hq and h Apzmq, denote with
{ h the complexity index DC;H for the adopted computational approximation
{ th be the number of points misclassi ed by h
{ Uc h the random variable given by the probability measure of the simmetric
di erence c hbetween c and h.</p>
      <p>Then, for m ¥ max ! 2" log 1 ; 5:5p h" th 1q ), A is a learning algorithm for every
zm and c P C ensuring that PpUc h ¤ "q ¥ 1 .</p>
      <p>
        Let us highlight the following points:
1. complexity index DA;B is a function of the symmetric di erence between the
two classes in argument denoting how clear is their symmetric di erence. The
most used index is Vapnik-Chervonenkis dimension [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], another suitable
one is detail [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The higher the complexity index, the more is the learning
di culty.
2. D increases with the decrease of the computation approximation, in turn,
depending on the accuracy.
3. the function A computing the hypothesis h is consistent if hpxq cpxq for
each x P zm. For non-consistent A we take note of the number th of mistaken
samples.
4. complexity and approximation of a hypothesis sum up (via mh
determine the learning task's di culty.
      </p>
      <p>Though the theorem refers to classes of Boolean functions, we may derive
the main lesson that three strongholds characterize the complexity of the task
of learning a function:
1. the complexity of the class C to learn, paired with the class H to get its
hypotheses,
2. the approximation of the hypothesis with which we want to explain the
labeling of the points,
3. the accuracy with which we register the input data.</p>
      <p>
        When trying to learn real-world functions, sampling complexity plays a
crucial role because of the cost of achieving samples. The high and possibly unknown
complexity of C, when we abandon case studies, prevents us from hazarding
analytical evaluations of the learned function's accuracy, normally left to the toss
of test sets. What remains are the directions to improve our learning that nd
immediate companions in the mechanisms with which our brain may interpret
the learned function. Namely:
{ complexity of the hypothesis function maps into the number of involved
variables and parameters. This mapping represents an operational
implementation of explanation selection activity [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] that the human brain performs to
understand the behavior of an algorithm, or a phenomenon in general, in
terms of assignment of causal responsibilities [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Up to our experience, the
number of variables and parameters should be at most 5. Otherwise, most
humans would not be able to understand their role in the target function
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
{ approximation of hypothesis maps in our brain into a trade-o between
formulas that are precise but di cult to manage and approximate formulas to
derive at least the rst hints.
{ accuracy of data maps into our attitude of "never shooting a sparrow with
a cannon" or tracking a butter y.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>The MFE procedure</title>
      <p>The core of the procedure is rather simple and quick to run. We may con gure
it as a function of three arguments:
1. the list ls of input variables to take into consideration,
2. the degree polydeg of the polynomial through which to interpolate the marginals
gi of the target function g, and
3. the granulation rate gr, i.e., partition size according to which vector
quantizing the input variables.</p>
      <p>Namely,
{ Each variable x P ls is sorted and partitioned in a group of size gr. Then
each value in the group is replaced by the group mean. The outcome is x~.
{ A base learner hi is computed as the regression between the target y and
the selected x~i.
{ The approximating h is the linear regressor of y again on the weak learners'
outputs.</p>
      <p>The use of h and the evaluation of its performance are carried out as usual.
Rather some attention is deserved to the experimental environment to which the
procedure is applied. The three strongholds are:
1. Neural network like an oracle. We assume having well trained a su ciently
powerful neural network to approximate the target function satisfactorily.</p>
      <p>Hence, like with an oracle, we may obtain a reliable answer on any input.
2. Explanation like the ones from domain experts. Since we are not interested in
discovering the truth, rather in getting wisely understandable descriptions,
we circumscribe the input domain to the eld of interest.
3. Surgical bombardment of the interest eld. The "bombs" are symbolic
functions with which we t a huge set of input/output pairs supplied by the
neural network. According to their descriptional length, we drop them
sequentially, which means in a progression with function degree and number
of variables. Entropic criteria will decide the suitability of each bomb to be
maintained or removed from the nal arsenal.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Numerical experiments</title>
      <p>
        We carried out two families of experiments. A former one is on arti cial data to
compare the e ciency of our method with a template of AIM style one [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. A
second one concerns real data available on UCI repository.
4.1
      </p>
      <sec id="sec-4-1">
        <title>Arti cial data</title>
        <p>
          For comparison sake, we considered the simulation study S5 in [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ], concerning
the regression
y
x1
x2
0:5
2px3
p1:5
x4
x3
x5
x5
x6q
x4
x6q2
(2)
A (training-set,test-set) pair of size (10000,1000) has been generated, by starting
with a uniform random seed in r 1; 1s for each xi to which a same bias
constituted by a uniform random bias in r 1; 1s was added to inject a correlation 0:5
between each pair of variables. A nal rescaling by a factor 0:5 was applied to
gather the computed xis in the range r 1; 1s, again. yis are computed directly
through (2). In Figure 2 we contrast the MFE architectures (on the left) with
the one in the reference paper (on the right). The symbolic functions in the blue
boxes are replaced by neural networks in the azure boxes. Moreover, while in the
x2
x3
x4
xk
        </p>
        <p>h1(x1)
x1 h2(x2)
h3(x3)
h4(x4)
hk(xk)
y
y
y
y
y
lossi
y</p>
        <p>LOSS
former we have a double training, at weak learners' level and at their ensemble
level, in the latter we have the usual training of a neural network, albeit with a
distinguished structure of the network.</p>
        <p>For training and test test size S p10000; 10000q xNN claims a root mean
square error on the test set RM SE 1:0049 (actually a bit greater than the
optimum of its competitors 1:0005). Figure 3 displays our results as a function
of the three parameters of the routine . A rst thing we note is that polydeg 2
surface almost coincides with the one of polydeg 3 7, both denoting RMSEs
de nitely lower than the xNN one. A second remark is on the apparently small
sensitivity of the performance to the granulation.</p>
        <p>Figure 4 left digs deeper into this aspect by showing an extensive course of
this trend (gr from 1 to 1500). We note an almost insensitive RMSE growth for a
granulation up to 200. In any case, even for a granulation 1500 MFE performance
is better than the xNN one. Finally, Figure 4 right shows the scattering of the
regressed values with respect to the original ones. The way of getting better
accuracy will be dealt with in the next subsection in regard to natural data.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Real data</title>
        <p>
          Collecting real data on complex phenomena such as the one mentioned in the
Introduction is a costly task per se. Thus to move to the public UCI repository [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]
and an application eld that is the icon of simpli ed scenarios: Facebook
transactions. Namely, the benchmark Facebook Comment Volume lists the number of
comments in the rst 24 hours after the publication of a post on Facebook as a
7 Actually, in both cases we considered also a power x1{2
Fig. 3: MFE performance surfaces. z axis Ñ RMSE, x axis Ñ granulation size (ranging
from 1 to 600, y axisÑ left extreme of the indexes interval selecting the involved
variables from the set tx1; : : : ; x6u. Surface label Ñ degree of the regressed polynomials.
        </p>
        <p>RMSE</p>
        <p>ful variables
0.95 200 400 600 800 1000 1200 1400 gr
0.90
0.85
polydeg=1
polydeg=2
polydeg=3
y_reg
4
2
-2
-4
-6
-4
-2
2
4
6
y_tets
Fig. 4: Left: course of RMSE with the granulation size; parameters as in the labels.
Right: scatter plot of the regressed versus original data of the test set, in the most
favorable setting of the parameters (all variables used, polydeg 3, no vector quantization).
function of 53 variables characterizing the post, such as "Page popularity" and
"Page category." The number of records is 40; 949, the overall variables are 54,
of which we considered only the rst 33. This decision limits the problem's
complexity (up to us in the end) at the cost of neglecting some ancillary variables.
In greater detail, the data (extracted by a Web crawler) are:
f1 number of likes for the source of the document.
f2 number of individuals so far visited this page.
f3 the daily number of activities by visitors to the page.</p>
        <p>f4 category of the source of the document.
f5-f29 min, max, average, median, and standard deviation of features f30 to f33 plus the
di erence between f32 and f33.
f30 The total number of comments before a pre-established date.
f31 Number of comments in the last 24 hours before the pre-established date.
f32 Number of comments in the last 48 to last 24 hours before the pre-established date.
f33 Number of comments in the rst 24 hours after the publication of the post, but
before the pre-established date.</p>
        <p>Finally, after deletion of less relevant variables based on p values computed
with the usual R regression tools 8, we come to the problem of computing f33
from f25 to f32. We tackled this problem in three modes
1. through a deep neural network
2. through MFE re the original f33
3. though MFE re the output of the deep neural network, where the latter plays
the Oracle's role.</p>
        <p>Moreover, we solved the problem at two sample scales: small (training set size
=1000, testing set size =1000 Ñ experiment size 2000) and large (training set
size =10000, testing set size =10000 Ñ experiment size 20000), overall working
on the rst 30.000 records of the dataset.</p>
        <p>Deep neural network We trained an 8 30 32 1 neural network (DNN) on
the above dataset with standard Keras 9 options to baseline the learning di culty
of our task and build up the Oracle for mode 3. To this aim, we preferred limiting
the input variables exactly to the 8 ones on which we will base the interpretation
of the learned network. Given the asymmetry of the variable distributions, we
escape any normalization and feature extraction (for instance, via PCA), but a
uniform constraining of each variable's values in the interval r 1; 1s. Figure 5
reports the RMSE curves in recall as a function of the learning cycles and the
original-reconstructed values scatter plots in both training and recall phases for
the two sizings of training and test sets. Despite the regular course of RMSEs,
with a noticeable improvement when we enlarge the training set, the rst scatter
plot, related to the smaller sample size, denotes over tting with consequent lousy
generalization on the test set. The second one is as expected. Hence we decide
using the second neural network as an Oracle and register the value 0:028016 to
be the RMSE of this Oracle.
8 https://cran.r-project.org
9 https://keras.io</p>
        <p>Regressing the original data Our inference device is the same as the one
used in the case study, with the sole exception of the number of input variables,
now in number of eight. The rst experiment with size 2000 shows that MFE
doesn't face over tting. The pictures in Figure 6, rst row, denote a progressive
degradation of the performance with the input granularity and the reduction
of the number of input variables, but close trends of the error in training and
recall. As for the degree of the regressing polynomials, we my see that in degraded
conditions the linear regression may prove even more e cient than non-linear
ones. We remark that the course of RMSE with the number of variables has an
appreciable slope only when their reduction is over 3, thus allowing us to run our
interpretation on 5 or less variables. The analogous of Figure 5 is in Figure 6,
second row, where, since MFE is a one shoot procedure, in place of the error
descent curves we show the RMSE surfaces as in the rst row, now referred to
an experiment size 20000.</p>
        <p>size2000
size20000</p>
        <p>Regressing the Oracle data If we repeat the experiment in the previous
section apart for the target, now replaced by the output of the DNN trained
on 10000 examples (in the role of the Oracle), we obtain error surfaces that are
very close to those in Figure 6. Figure 7, left side, contrasts the recall RMSE
surfaces of the two experiments. To summarize the course of RMSE, in Table 1
we report these values in training and recall for DNN and for MFE runs in
similar conditions, i.e. using all eight variables and with no approximation on
their values (let us call them ideal conditions ). We note that MFE outperforms
NN not only when running on the same sample size but also, as for recall, on
the smaller sample size. Inferring MFE on the NN output may su er from a
slight over tting that disappears when we abandon ideal conditions, as shown
in Figure 7, left side. However, this over tting is very limited, as shown in the
right picture of the gure.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>RMSE</title>
      <p>training
recall</p>
      <p>NN10000 MFE1000 MFE10000 MFEOrac
0.014272 0.0187171 0.0108914 0.0094463
0.028016 0.0274221 0.0137267 0.0211474</p>
    </sec>
    <sec id="sec-6">
      <title>Discussion and concluding remarks</title>
      <p>
        Expressing a complex function by means of a combination of ridge functions
is a recurrent modeling pattern across scienti c domains. We started from the
notion that its success as a modeling device may be due to the fact that it
provides a dimensional decomposition of complex functions that naturally facilitates
human understanding and interpretation of the underlying phenomenon. These
functions can be seen as modular components of the expression of the overall
phenomenon; the fact that they are symbolic and can be learned independently
and at di erent times makes our approach a potential (interpretable) alternative
to multi-stage learning for achieving model adaptation. In a non-stationary
situation, we could retrain our model piece-wise, using di erent retraining windows
to handle the selective obsolescence of knowledge represented by the individual
atomic function or new domain knowledge typical of transfer learning [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ].
Being each atomic component symbolic, its phase-in/phase-out lifecycle would be
understandable for humans. We plan to investigate using the performance gap
between di erent base learners' con gurations his as a measure of the
discrepancy between the source and target domains [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
      </p>
      <p>
        According to [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], "there is no standard and generally accepted de nition of
explainable AI. The XAI term tends to refer to the many initiatives and e orts
made in response to concerns regarding AI transparency and trust concerns,
more than to a formal technical concept". These initiatives aim to white-boxing
AI, i.e., gaining a quantitative understanding by humans of AI models' operation
10, where deep neural networks represent the paradigmatic target. Unfortunately,
deep learners generally prove as powerful to learn hard functions as
impenetrable to any form of understanding. With similar targets, we assume quantitative
understanding to be an unavoidable premise to achieving human-understandable
explanations (for instance, in simpli ed natural language). We argue that
quantitative interpretations' accuracy measures are not a guaranty but a concrete
prerequisite of a satisfactory approximation of any conceptual explanation. This
assumption contrasts the popular (but in our opinion debatable) idea that an
explanation that is not accessible to laypeople is defective by de nition. As for
a climbing trip to the Alps, to reach the peak, we must be trained.
      </p>
      <p>Our experimental results support the idea that our symbolically driven
approach may provide the level of accuracy suitable for practical applications as a
complementary interpretable alternative to deep learning architectures.
10 https://www.iarai.ac.at/research/a-theory-of-ai/</p>
      <p>white-boxing-interpretability/</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Adadi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berrada</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>Peeking inside the black-box: A survey on explainable arti cial intelligence (xai)</article-title>
          .
          <source>IEEE Access 6</source>
          ,
          <issue>52138</issue>
          {
          <fpage>52160</fpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Apolloni</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chiaravalli</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>PAC learning of concept classes through the boundaries of their items</article-title>
          .
          <source>Theoretical Computer Science</source>
          <volume>172</volume>
          ,
          <issue>91</issue>
          {
          <fpage>120</fpage>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Apolloni</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kurfess</surname>
            ,
            <given-names>F</given-names>
          </string-name>
          . (eds.):
          <article-title>From Synapses to Rules { Discovering Symbolic Rules from Neural Processed Data</article-title>
          . Kluwer Academic/Plenum Publishers, New York (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Apolloni</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shehhi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Damiani</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          :
          <article-title>The simpli cation conspiracy</article-title>
          .
          <source>In: Progresses in Arti cial Intelligence and Neural Systems. Smart Innovation, Systems and Technologies</source>
          . pp.
          <volume>11</volume>
          {
          <fpage>23</fpage>
          . Springer-Verlag,
          <source>Lecture Notes in Arti cial Intel</source>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. ApolloniB.,
          <string-name>
            <surname>Bassis</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malchiodi</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pedrycz</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <source>The Puzzle of Granular Computing</source>
          , Springer,
          <year>2008</year>
          , in press., vol.
          <volume>138</volume>
          . Springer-Verlag, BERLIN { DEU (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Corke</surname>
            ,
            <given-names>P.I.:</given-names>
          </string-name>
          <article-title>A robotics toolbox for matlab</article-title>
          .
          <source>IEEE Robotics Automation Magazine</source>
          <volume>3</volume>
          (
          <issue>1</issue>
          ),
          <volume>24</volume>
          {
          <fpage>32</fpage>
          (
          <year>1996</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Cybenko</surname>
          </string-name>
          , G.:
          <article-title>Approximation by superpositions of a sigmoidal function</article-title>
          .
          <source>Mathematics of Control, Signals and Systems</source>
          <volume>2</volume>
          (
          <issue>4</issue>
          ),
          <volume>303</volume>
          {
          <fpage>314</fpage>
          (
          <year>1989</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Garey</surname>
            ,
            <given-names>M.R.</given-names>
          </string-name>
          , Johnson, D.S.:
          <article-title>Computer and Intractability: a Guide to the Theory of NP-Completeness</article-title>
          .
          <string-name>
            <given-names>W. H.</given-names>
            <surname>Freeman</surname>
          </string-name>
          , San Francisco (
          <year>1978</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Hastie</surname>
            ,
            <given-names>T.J.</given-names>
          </string-name>
          , Tibshirani,
          <string-name>
            <surname>RJ</surname>
          </string-name>
          :
          <article-title>Generalized additive models</article-title>
          . London: Chapman &amp;
          <string-name>
            <surname>Hall</surname>
          </string-name>
          (
          <year>1990</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Josephson</surname>
            ,
            <given-names>J.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Josephson</surname>
            ,
            <given-names>S.G.</given-names>
          </string-name>
          : Abductive Inference: Computation, Philosophy, Technology. Cambridge University Press (
          <year>1994</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Kelley</surname>
          </string-name>
          , T.D.:
          <article-title>Symbolic and sub-symbolic representations in computational models of human cognition: What can be learned from biology?</article-title>
          <source>Theory &amp; Psychology</source>
          <volume>13</volume>
          (
          <issue>6</issue>
          ),
          <volume>847</volume>
          {
          <fpage>860</fpage>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Kohlhase</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kohlhase</surname>
            <given-names>M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Fuersich</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Visual structure in mathematical expressions</article-title>
          .
          <source>Proceedings of the International Conference on Intelligent Computer Mathematics</source>
          . Springer, Cham,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Explanation in arti cial intelligence: Insights from the social sciences</article-title>
          .
          <source>ArXiv abs/1706</source>
          .07269 (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Ruan</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yuan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Dimension reduction and parameter estimation for additive index models</article-title>
          .
          <source>Statistics and Its Interface</source>
          <volume>4</volume>
          (
          <issue>01</issue>
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Vapnik</surname>
          </string-name>
          , V.:
          <article-title>Estimating of dependencies based on empirical data</article-title>
          . Springer, New York (
          <year>1982</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Vaughan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sudjianto</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brahimi</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>J.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nair</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Explainable neural networks based on additive index models</article-title>
          .
          <source>ArXiv abs/1806</source>
          .
          <year>01933</year>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mendez</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cai</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eaton</surname>
          </string-name>
          , E.:
          <article-title>Transfer learning via minimizing the performance gap between domains</article-title>
          .
          <source>In: Advances in Neural Information Processing Systems</source>
          , vol.
          <volume>32</volume>
          , pp.
          <volume>10645</volume>
          {
          <issue>10655</issue>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Wolpert</surname>
            ,
            <given-names>D.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Macready</surname>
          </string-name>
          , W.G.:
          <article-title>No free lunch theorems for optimization</article-title>
          .
          <source>IEEE Transactions on Evolutionary Computation</source>
          <volume>1</volume>
          (
          <issue>1</issue>
          ),
          <volume>67</volume>
          {
          <fpage>82</fpage>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sudjianto</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          : Https://arxiv.org/pdf/
          <year>1901</year>
          .03838.pdf
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Yuan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>On the identi ability of additive index models</article-title>
          .
          <source>Statistica Sinica 21</source>
          ,
          <year>1901</year>
          {
          <year>1911</year>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Zhuang</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qi</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xi</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xiong</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>He</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          :
          <article-title>A comprehensive survey on transfer learning</article-title>
          .
          <source>Proceedings of the IEEE</source>
          pp.
          <volume>1</volume>
          {
          <issue>34</issue>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>