=Paper= {{Paper |id=Vol-2253/paper71 |storemode=property |title=On the Readability of Deep Learning Models: the role of Kernel-based Deep Architectures |pdfUrl=https://ceur-ws.org/Vol-2253/paper71.pdf |volume=Vol-2253 |authors=Danilo Croce,Daniele Rossini,Roberto Basili |dblpUrl=https://dblp.org/rec/conf/clic-it/CroceR018 }} ==On the Readability of Deep Learning Models: the role of Kernel-based Deep Architectures== https://ceur-ws.org/Vol-2253/paper71.pdf

On the Readability of Deep Learning Models:
the role of Kernel-based Deep Architectures
Danilo Croce and Daniele Rossini and Roberto Basili
Department Of Enterprise Engineering
University of Roma, Tor Vergata
{croce,basili}@info.uniroma2.it

Abstract (Croce et al., 2017)). In QC the correct category of
English. Deep Neural Networks achieve a question is detected to optimize the later stages
state-of-the-art performances in several se- of a question answering system, (Li and Roth,
mantic NLP tasks but lack of explanation 2006). An epistemologically transparent learning
capabilities as for the limited interpretabil- system should trace back the causal connections
ity of the underlying acquired models. In between the proposed question category and the
other words, tracing back causal connec- linguistic properties of the input question. For
tions between the linguistic properties of example, the system could motivate the decision:
an input instance and the produced clas- ”What is the capital of Zimbabwe?” refers to a
sification is not possible. In this paper, Location, with a sentence such as: Since it is
we propose to apply Layerwise Relevance similar to ”What is the capital of California?”
Propagation over linguistically motivated which also refers to a Location. Unfortunately,
neural architectures, namely Kernel-based neural models, as for example Multilayer Percep-
Deep Architectures (KDA), to guide argu- trons (MLP), Long Short-Term Memory Networks
mentations and explanation inferences. In (LSTM), (Hochreiter and Schmidhuber, 1997), or
this way, decisions provided by a KDA even Attention-based Networks (Larochelle and
can be linked to the semantics of input ex- Hinton, 2010), correspond to parameters that have
amples, used to linguistically motivate the no clear conceptual counterpart: it is thus difficult
network output. to trace back the network components (e.g. neu-
rons or layers in the resulting topology) responsi-
Italiano. Le Deep Neural Network ble for the answer.
raggiungono oggi lo stato dell’arte in
In image classification, Layerwise Relevance
molti processi di NLP, ma la scarsa
Propagation (LRP) (Bach et al., 2015) has been
interpretabilitá dei modelli risultanti
used to decompose backward across the MLP lay-
dall’addestramento limita la compren-
ers the evidence about the contribution of indi-
sione delle loro inferenze. Non é possibile
vidual input fragments (i.e. pixels of the input
cioé determinare connessioni causali tra
images) to the final decision. Evaluation against
le proprietá linguistiche di un esempio
the MNIST and ILSVRC benchmarks suggests
e la classificazione prodotta dalla rete.
that LRP activates associations between input and
In questo lavoro, l’applicazione della
output fragments, thus tracing back meaningful
Layerwise Relevance Propagation alle
causal connections.
Kernel-based Deep Architecture(KDA)
é usata per determinare connessioni tra In this paper, we propose the use of a simi-
la semantica dell’input e la classe di lar mechanism over a linguistically motivated net-
output che corrispondono a spiegazioni work architecture, the Kernel-based Deep Archi-
linguistiche e trasparenti della decisione. tecture (KDA), (Croce et al., 2017). Tree Ker-
nels (Collins and Duffy, 2001) are here used to
integrate syntactic/semantic information within a
1 Introduction
MLP network. We will show how KDA input
Deep Neural Networks are usually criticized as nodes correspond to linguistic instances and by ap-
they are not epistemologically transparent devices, plying the LRP method we are able to trace back
i.e. their models cannot be used to provide ex- causal associations between the semantic classifi-
planations of the resulting inferences. An exam- cation and such instances. Evaluation of the LRP
ple can be neural question classification (QC) (e.g. algorithm is based on the idea that explanations
improve the user expectations about the correct- ˜ = ~cU S − 12
~x (2)
ness of an answer and shows its applicability in
human computer interfaces. where ~c is the vector whose dimensions contain
In the rest of the paper, Section 2 describes the the evaluations of the kernel function between o
KDA neural approach while section 3 illustrates and each landmark oj ∈ L. Therefore, the method
how LRP connects to KDAs. In section 4 early produces l-dimensional vectors.
results of the evaluation are reported. Given a labeled dataset, a Multi-Layer Percep-
tron (MLP) architecture can be defined, with a spe-
2 Training Neural Networks in Kernel cific Nyström layer based on the Nyström embed-
Spaces dings of Eq. 2, (Croce et al., 2017).
Such Kernel-based Deep Architecture (KDA)
Given a training set o ∈ D, a kernel K(oi , oj )
has an input layer, a Nyström layer, a possibly
is a similarity function over D2 that corresponds
empty sequence of non-linear hidden layers and a
to a dot product in the implicit kernel space,
final classification layer, which produces the out-
i.e., K(oi , oj ) = Φ(oi ) · Φ(oj ). Kernel functions
put. In particular, the input layer corresponds to
are used by learning algorithms, such as Support
the input vector ~c, i.e., the row of the C matrix
Vector Machines (Shawe-Taylor and Cristianini,
associated to an example o. It is then mapped to
2004), to efficiently operate on instances in the
the Nyström layer, through the projection in Equa-
kernel space: their advantage is that the projec-
tion 2. Notice that the embedding provides also
tion function Φ(o) = ~x ∈ Rn is never explicitly 1

computed. The Nyström method is a factorization the proper weights, defined by U S − 2 , so that the
method applied to derive a new low-dimensional mapping can be expressed through the Nyström
1
embedding x̃ in a l-dimensional space, with l n matrix HN y = U S − 2 : it corresponds to a pre-
so that G ≈ G̃ = X̃ X̃ > , where G = XX > is training stage based on the SVD. Formally, the
the Gram matrix such that Gij = Φ(oi )Φ(oj ) = low-dimensional embedding of an input example
˜ = ~c HN y = ~c U S − 12 encodes the kernel
o, ~x
K(oi , oj ). The approximation G̃ is obtained using
a subset of l columns of the matrix, i.e., a selec- space. Any neural network can then be adopted:
tion of a subset L ⊂ D of the available exam- in the rest of this paper, we assume that a tradi-
ples, called landmarks. Given l randomly sam- tional Multi-Layer Perceptron (MLP) architecture
pled columns of G, let C ∈ R|D|×l be the ma- is stacked in order to solve the targeted classifica-
trix of these sampled columns. Then, we can re- tion problems. The final layer of KDA is the clas-
arrange the columns and rows of G and define sification layer whose dimensionality depends on
X = [X1 X2 ] such that: the classification task: it computes a linear classi-
fication function with a softmax operator.
X1> X2

W C A KDA is stimulated by an input vector c which
G= =
X2> X1 X2> X2 X2> X1 corresponds to the kernel evaluations K(o, li )
between each example o and the landmarks li .
where W = X1> X1 , i.e., the subset of G that con- Linguistic kernels (such as Semantic Tree Ker-
tains only landmarks. The Nyström approxima- nels (Croce et al., 2011)) depend on the syntac-
tion can be defined as: tic/semantic similarity between the x and the sub-
set of li used for the space reconstruction. We will
G ≈ G̃ = CW † C > (1) see hereafter how tracing back through relevance
propagation into a KDA architecture corresponds
where W † denotes the Moore-Penrose inverse of
to determine which semantic landmarks contribute
W . If we apply the Singular Value Decomposition
mostly to the final output decision.
(SVD) to W , which is symmetric definite posi-
tive, we get W = U SV > = U SU > . Then it
3 Layer-wise Relevance Propagation in
is straightforward to see that W † = U S −1 U > =
1 1 Kernel-based Deep Architectures
U S − 2 S − 2 U > and that by substitution G ≈ G̃ =
1 1
(CU S − 2 )(CU S − 2 )> = X̃ X̃ > . Given an exam- Layer-wise Relevance propagation (LRP, pre-
ple o ∈ D, its new low-dimensional representation sented in (Bach et al., 2015)) is a framework which
˜ is determined by considering the corresponding
~x allows to decompose the prediction of a deep neu-
item of C as ral network computed over a sample, e.g. an im-
age, down to relevance scores for the single input 4 Explanatory Models
dimensions, such as a subset of pixels.
Formally, let f : Rd → R+ be a positive real- LRP allows the automatic compilation of justifica-
valued function taking a vector ~x ∈ Rd as input: f tions for the KDA classifications: explanations are
quantifies, for example, the probability of ~x char- possible using landmarks {`} as examples. The
acterizing a certain class. The Layer-wise Rele- {`} that the LRP method produces as the most ac-
vance Propagation assigns to each dimension, or tive elements in layer 0 are semantic analogues of
(1)
feature, xd , a relevance score Rd such that: input annotated examples. An Explanatory Model
is the function in charge of compiling the linguis-
P (1)
f (x) ≈ d Rd (3) tically fluent explanation of individual analogies
(or differences) with the input case. The mean-
(1)
Features whose score Rd > 0 (or d Rd < 0)
(1) ingfulness of such analogies makes a resulting ex-
correspond to evidence in favor (or against) the planation clear and should increase the user confi-
output classification. In other words, LRP allows dence on the system reliability. When a sentence
to identify fragments of the input playing key roles o is classified, LRP assigns activation scores r`s to
in the decision, by propagating relevance back- each individual landmark `: let L(+) (or L(−) ) de-
wards. Let us suppose to know the relevance score note the set of landmarks with positive (or nega-
(l+1) tive) activation scores.
Rj of a neuron j at network layer l + 1, then it
(l,l+1) Formally, an explanation is characterized by a
can be decomposed into messages Ri←j sent to triple e = hs, C, τ i where s is the input sentence,
neurons i in layer l: C is the predicted label and τ is the modality of the
(l+1) explanation: τ = +1 for positive (i.e. acceptance)
X (l,l+1)
Rj = Ri←j (4)
i∈(l)
statements while τ = −1 correspond to rejections
of the decision C. A landmark ` is positively acti-
Hence the relevance of a neuron i at layer l can be vated for a given sentence s if there are not more
defined as: than k − 1 other active landmarks1 `0 whose acti-
(l) (l,l+1)
X
Ri = Ri←j (5) vation value is higher than the one for `, i.e.
j∈(l+1)
|{`0 ∈ L(+) : `0 6= ` ∧ r`s0 ≥ r`s > 0}| < k
Note that 4 and 5 are such that 3 holds. In this
work, we adopted the -rule defined in (Bach et A landmark is negatively activated when: |{`0 ∈
(l,l+1) L (−) : `0 6= ` ∧ r`s0 ≤ r`s < 0}| < k. Positively
al., 2015) to compute the messages Ri←j , i.e.
(or negative) active landmarks in Lk are assigned
(l,l+1) zij (l+1) to an activation value a(`, s) = +1 (−1). For all
Ri←j = R
zj + · sign(zj ) j other not activated landmarks: a(`, s) = 0.
Given the explanation e = hs, C, τ i, a landmark
where zij = xi wij and > 0 is a numerical stabi- ` whose (known) class is C` is consistent (or in-
lizing term and must be small. Notice that weights consistent) with e according to the fact that the
wij correspond to weighted activations of input following function:
neurons. If we apply LRP to a KDA it implic-
itly traces the relevance back to the input layer, δ(C` , C) · a(`, q) · τ
i.e. to the landmarks. It thus tracks back syntac-
tic, semantic and lexical relations between a ques- is positive (or negative, respectively), where
tion and the landmark and it grants high relevance δ(C 0 , C) = 2δkron (C 0 = C) − 1 and δkron is the
to the relations the network selected as highly dis- Kronecker delta.
criminating for the class representations it learned;
The explanatory model is then a function
note that this is different from similarity in terms
M(e, Lk ) which maps an explanation e, a sub set
of kernel-function evaluation as the latter is task
Lk of the active and consistent landmarks L for e
independent whereas LRP scores are not. Notice
into a sentence in natural language. Of course sev-
also that each landmark is uniquely associated to
eral definitions for M(e, Lk ) and Lk are possible.
an entry of the input vector ~c, as shown in Sec 2,
and, as a member of the training dataset, it also 1
k is a parameter used to make explanation depending on
corresponds to a known class. not more than k landmarks, denoted by Lk .
A general explanatory model would be: of confidence the user has in accepting the state-
 ment, and its corresponding form P (o ∈ C|e),

 “ s is C since it is similar to ` ” i.e. the same quantity in the case the user is pro-
∀` ∈ L+ if τ > 0

vided by the explanation e. The core idea is that



 k
semantically coherent and exhaustive explanations


 “ s is not C since it is different


M(e, Lk ) = from ` which is C ” must indicate correct classifications whereas inco-
 ∀` ∈ L−

 if τ < 0 herent or non-existent explanations must hint to-
k



 wards wrong classifications. A quantitative mea-



 “ s is C but I don’t know why ” sure of such an increase (or decrease) in confi-
if Lk = ∅ dence is the Information Gain (IG, (Kononenko

and Bratko, 1991)) of the decision o ∈ C. Notice
−
where L+ k ,Lk ⊆ Lk are the partitions of landmarks that IG measures the increase of probability corre-
with positive (and negative) relevance scores in sponding to correct decisions, and the reduction of
Lk , respectively. Here we provide examples for the probability in case the decision is wrong. This
two explanatory models, used during the experi- amount suitably addresses the shift in uncertainty
mental evaluation. A first possible model returns −log2 (P (·)) between two (subjective) estimates,
the analogy only with the (unique) consistent land- i.e., P (o ∈ C) vs. P (o ∈ C|e).
mark with the highest positive score if τ = 1 Different explanatory models M can be also
and lowest negative when τ = −1. The ex- compared. The relative Information Gain IM
planation of a rejected decision in the Argument is measured against a collection of explanations
Classification of a Semantic Role Labeling task e ∈ TM generated by M and then normalized
(Vanzo et al., 2016), described by the triple e1 = throughout the collection’s entropy E as follows:
h’vai in camera da letto’, S OURCE B RINGING , −1i,
1 1 X
is: IM = I(e)
E | TM |
e∈TM
I think ”in camera da letto” IS NOT [S OURCE ] of
[B RINGING ] in ”Vai in camera da letto” (LU:[vai]) since where I(e) is the IG of each explanation2 .
it’s different from ”sul tavolino” which is [S OURCE ] of
[B RINGING ] in “Portami il mio catalogo sul tavolino”
5 Experimental Evaluation
(LU:[porta]) The effectiveness of the proposed approach has
The second model uses two active land- been measured against two different semantic pro-
marks: one consistent and one contradictory cessing tasks, i.e. Question Classification (QC)
with respect to the decision. For the triple over the UIUC dataset (Li and Roth, 2006) and Ar-
e1 = h’vai in camera da letto’, G OAL M OTION , 1i gument Classification in Semantic Role Labeling
the second model produces: (SRL-AC) over the HuRIC dataset (Bastianelli et
al., 2014; Vanzo et al., 2016). The adopted archi-
I think ”in camera da letto” IS [G OAL ] of [M OTION ] in
tecture consisted in a LRP-integrated KDA with 1
”Vai in camera da letto” (LU:[vai]) since it recalls ”al
hidden layers and 500 landmarks for QC, 2 hid-
telefono” which is [G OAL ] of [M OTION ] in ”Vai al telefono
den layers and 100 landmarks for SRL-AC and a
e controlla se ci sono messaggi” (LU:[vai]) and it IS NOT
stabilization-term = 10e−8 .
[S OURCE ] of [B RINGING ] since different from ”sul
tavolino” which is the [S OURCE ] of [B RINGING ] in
We defined five quality categories and asso-
”Portami il mio catalogo sul tavolino” (LU:[portami])
ciated each with a value of P (o ∈ C|e), as
shown in Table 1. Three annotators then inde-
4.1 Evaluation methodology pendently rated explanations generated from a col-
In order to evaluate the impact of the produced ex- lection composed of an equal number of correct
planations, we defined the following task: given a and wrong classifications (for a total amount of
classification decision, i.e. the input o is classified 300 and 64 explanations, respectively, for QC and
as C, to measure the impact of the explanation e SRL-AC). This perfect balancing makes the prior
on the belief that a user exhibits on the statement probability P (o ∈ C) being 0.5, i.e. maximal en-
“o ∈ C is true”. This information can be mod- tropy with a baseline IG = 0 in the [−1, 1] range.
eled through the estimates of the following prob- Notice that annotators had no information on the
abilities: P (o ∈ C) that characterizes the amount 2
More details are in (Kononenko and Bratko, 1991)
Category P (o ∈ C|e) 1−P (o ∈ C|e) Although explanation seems fairly coherent, it is
V.Good 0.95 0.05
Good 0.8 0.2 actually misleading as ENTITY is the annotated
Weak 0.5 0.5 class. This shows how the system may lack of
Bad 0.2 0.8 contextual information, as humans do, against in-
Incoher. 0.05 0.95
herently ambiguous questions.
Table 1: Posterior probab. w.r.t. quality categories
5.2 Argument Classification
Model QC SRL-AC Evaluation also targeted a second task, that is Ar-
One landmark 0.548 0.669 gument classification in Semantic Role Labeling
Two landmarks 0.580 0.784
(SRL-AC): KDA is here fed with vectors from
Table 2: Information gains for two Explanatory tree kernel evaluations as discussed in (Croce et
Models applied to the QC and SRL-AC datasets. al., 2011). The evaluation is carried out over
the HuRIC dataset (Vanzo et al., 2016), including
system classification performance, but just knowl- about 240 domotic commands in Italian, compris-
edge of the explanation dataset entropy. ing of about 450 roles. The system has an accuracy
of 91.2% on about 90 examples, while the training
5.1 Question Classification and development set have a size of, respectively,
Experimental evaluations3 showed that both the 270 and 90 examples. We considered 64 explana-
models were able to gain more than half the bit re- tions for measuring the IG of the two explanation
quired to ascertain whether the network statement models. Table 2 confirms that both explanatory
is true or not (Table 2). Consider: models performed even better than in QC. This is
I think ”What year did Oklahoma become a state ?” refers due to the narrower linguistic domain (14 frames
to a NUMBER since recalls me ”The film Jaws was made in are involved) and the clearer boundaries between
what year ?” classes: annotators seem more sensitive to the ex-
planatory information to assess the network deci-
Here the model returned a coherent supporting ev-
sion. Examples of generated sentences are:
idence, a somewhat easy case as for the available
I think ”con me” is NOT the MANNER of C OTHEME in
discriminative pair, i.e. ”What year”. The sys-
”Robot vieni con me nel soggiorno? (LU:[vieni])” since it
tem is able to capture semantic similarities even in
does NOT recall me ”lentamente” which is MANNER in
poorer conditions, e.g.:
”Per favore segui quella persona lentamente (LU:[segui])”.
I think ”Where is the Mall of the America ?” refers to a
It is rather COTHEME of C OTHEME since it recalls me
LOCATION since recalls me ”What town was the setting for
”mi” which is C OTHEME in ”Seguimi nel bagno
The Music Man ?” which refers to a LOCATION.
(LU:[segui])”.
This high quality explanation is achieved even if
with such poor lexical overlap. It seems that richer 6 Conclusion and Future Works
representations are here involved with grammati-
cal and semantic similarity used as the main in- This paper describes an LRP application to a KDA
formation involved in the decision at hand. Let us that makes use of analogies as explanations of a
consider: neural network decision. A methodology to mea-
I think ”Mexican pesos are worth what in U.S. dollars ?”
sure the explanation quality has been also pro-
refers to a DESCRIPTION since it recalls me ”What is the
posed and the experimental evidence confirms the
Bernoulli Principle ?”
effectiveness of the method in increasing the trust
of a user upon automatic classifications. Future
Here the provided explanation is incoherent, as ex- work will focus on the selection of subtrees as
pected since the classification is wrong. Now con- meaningful evidences for the explanation, or on
sider: the modeling of negative information for disam-
I think ”What is the sales tax in Minnesota ?” refers to a biguation as well as on more in depth investigation
NUMBER since it recalls me ”What is the population of of the landmark selection policies. Moreover, im-
Mozambique ?” and does not refer to a ENTITY since proved experimental scenarios involving users and
different from ”What is a fear of slime ?”. dialogues will be also designed, e.g. involving fur-
3
For details on KDA performance against the task, see ther investigation within Semantic Role Labeling,
(Croce et al., 2017) using the method proposed in (Croce et al., 2012).
References Andrea Vanzo, Danilo Croce, Roberto Basili, and
Daniele Nardi. 2016. Context-aware spoken lan-
Sebastian Bach, Alexander Binder, Gregoire Mon- guage understanding for human robot interaction. In
tavon, Frederick Klauschen, Klaus-Robert Mller, Proceedings of Third Italian Conference on Compu-
and Wojciech Samek. 2015. On pixel-wise explana- tational Linguistics (CLiC-it 2016) & Fifth Evalua-
tions for non-linear classifier decisions by layer-wise tion Campaign of Natural Language Processing and
relevance propagation. PLOS ONE, 10(7). Speech Tools for Italian. Final Workshop (EVALITA
2016), Napoli, Italy, December 5-7, 2016.
Emanuele Bastianelli, Giuseppe Castellucci, Danilo
Croce, Luca Iocchi, Roberto Basili, and Daniele
Nardi. 2014. Huric: a human robot interaction
corpus. In LREC, pages 4519–4526. European Lan-
guage Resources Association (ELRA).

Michael Collins and Nigel Duffy. 2001. New rank-
ing algorithms for parsing and tagging: Kernels over
discrete structures, and the voted perceptron. In Pro-
ceedings of the 40th Annual Meeting on Association
for Computational Linguistics (ACL ’02), July 7-12,
2002, Philadelphia, PA, USA, pages 263–270. Asso-
ciation for Computational Linguistics, Morristown,
NJ, USA.

Danilo Croce, Alessandro Moschitti, and Roberto
Basili. 2011. Structured lexical similarity via con-
volution kernels on dependency trees. In Proceed-
ings of the 2011 Conference on Empirical Methods
in Natural Language Processing, pages 1034–1046.
Association for Computational Linguistics.

Danilo Croce, Alessandro Moschitti, Roberto Basili,
and Martha Palmer. 2012. Verb classification us-
ing distributional similarity in syntactic and seman-
tic structures. In ACL (1), pages 263–272. The As-
sociation for Computer Linguistics.

Danilo Croce, Simone Filice, Giuseppe Castellucci,
and Roberto Basili. 2017. Deep learning in seman-
tic kernel spaces. In Proceedings of the 55th Annual
Meeting of the Association for Computational Lin-
guistics (Volume 1: Long Papers), pages 345–354,
Vancouver, Canada, July. Association for Computa-
tional Linguistics.

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long
short-term memory. Neural Comput., 9(8):1735–
1780, November.

Igor Kononenko and Ivan Bratko. 1991. Information-
based evaluation criterion for classifier’s perfor-
mance. Machine Learning, 6(1):67–80, Jan.

Hugo Larochelle and Geoffrey E. Hinton. 2010.
Learning to combine foveal glimpses with a third-
order boltzmann machine. In Proceedings of Neu-
ral Information Processing Systems (NIPS), pages
1243–1251.

Xin Li and Dan Roth. 2006. Learning question clas-
sifiers: the role of semantic information. Natural
Language Engineering, 12(3):229–249.

John Shawe-Taylor and Nello Cristianini. 2004. Ker-
nel Methods for Pattern Analysis. Cambridge Uni-
versity Press, Cambridge, UK.