=Paper=
{{Paper
|id=Vol-2253/paper71
|storemode=property
|title=On the Readability of Deep Learning Models: the role of Kernel-based Deep Architectures
|pdfUrl=https://ceur-ws.org/Vol-2253/paper71.pdf
|volume=Vol-2253
|authors=Danilo Croce,Daniele Rossini,Roberto Basili
|dblpUrl=https://dblp.org/rec/conf/clic-it/CroceR018
}}
==On the Readability of Deep Learning Models: the role of Kernel-based Deep Architectures==
On the Readability of Deep Learning Models: the role of Kernel-based Deep Architectures Danilo Croce and Daniele Rossini and Roberto Basili Department Of Enterprise Engineering University of Roma, Tor Vergata {croce,basili}@info.uniroma2.it Abstract (Croce et al., 2017)). In QC the correct category of English. Deep Neural Networks achieve a question is detected to optimize the later stages state-of-the-art performances in several se- of a question answering system, (Li and Roth, mantic NLP tasks but lack of explanation 2006). An epistemologically transparent learning capabilities as for the limited interpretabil- system should trace back the causal connections ity of the underlying acquired models. In between the proposed question category and the other words, tracing back causal connec- linguistic properties of the input question. For tions between the linguistic properties of example, the system could motivate the decision: an input instance and the produced clas- ”What is the capital of Zimbabwe?” refers to a sification is not possible. In this paper, Location, with a sentence such as: Since it is we propose to apply Layerwise Relevance similar to ”What is the capital of California?” Propagation over linguistically motivated which also refers to a Location. Unfortunately, neural architectures, namely Kernel-based neural models, as for example Multilayer Percep- Deep Architectures (KDA), to guide argu- trons (MLP), Long Short-Term Memory Networks mentations and explanation inferences. In (LSTM), (Hochreiter and Schmidhuber, 1997), or this way, decisions provided by a KDA even Attention-based Networks (Larochelle and can be linked to the semantics of input ex- Hinton, 2010), correspond to parameters that have amples, used to linguistically motivate the no clear conceptual counterpart: it is thus difficult network output. to trace back the network components (e.g. neu- rons or layers in the resulting topology) responsi- Italiano. Le Deep Neural Network ble for the answer. raggiungono oggi lo stato dell’arte in In image classification, Layerwise Relevance molti processi di NLP, ma la scarsa Propagation (LRP) (Bach et al., 2015) has been interpretabilitá dei modelli risultanti used to decompose backward across the MLP lay- dall’addestramento limita la compren- ers the evidence about the contribution of indi- sione delle loro inferenze. Non é possibile vidual input fragments (i.e. pixels of the input cioé determinare connessioni causali tra images) to the final decision. Evaluation against le proprietá linguistiche di un esempio the MNIST and ILSVRC benchmarks suggests e la classificazione prodotta dalla rete. that LRP activates associations between input and In questo lavoro, l’applicazione della output fragments, thus tracing back meaningful Layerwise Relevance Propagation alle causal connections. Kernel-based Deep Architecture(KDA) é usata per determinare connessioni tra In this paper, we propose the use of a simi- la semantica dell’input e la classe di lar mechanism over a linguistically motivated net- output che corrispondono a spiegazioni work architecture, the Kernel-based Deep Archi- linguistiche e trasparenti della decisione. tecture (KDA), (Croce et al., 2017). Tree Ker- nels (Collins and Duffy, 2001) are here used to integrate syntactic/semantic information within a 1 Introduction MLP network. We will show how KDA input Deep Neural Networks are usually criticized as nodes correspond to linguistic instances and by ap- they are not epistemologically transparent devices, plying the LRP method we are able to trace back i.e. their models cannot be used to provide ex- causal associations between the semantic classifi- planations of the resulting inferences. An exam- cation and such instances. Evaluation of the LRP ple can be neural question classification (QC) (e.g. algorithm is based on the idea that explanations improve the user expectations about the correct- ˜ = ~cU S − 12 ~x (2) ness of an answer and shows its applicability in human computer interfaces. where ~c is the vector whose dimensions contain In the rest of the paper, Section 2 describes the the evaluations of the kernel function between o KDA neural approach while section 3 illustrates and each landmark oj ∈ L. Therefore, the method how LRP connects to KDAs. In section 4 early produces l-dimensional vectors. results of the evaluation are reported. Given a labeled dataset, a Multi-Layer Percep- tron (MLP) architecture can be defined, with a spe- 2 Training Neural Networks in Kernel cific Nyström layer based on the Nyström embed- Spaces dings of Eq. 2, (Croce et al., 2017). Such Kernel-based Deep Architecture (KDA) Given a training set o ∈ D, a kernel K(oi , oj ) has an input layer, a Nyström layer, a possibly is a similarity function over D2 that corresponds empty sequence of non-linear hidden layers and a to a dot product in the implicit kernel space, final classification layer, which produces the out- i.e., K(oi , oj ) = Φ(oi ) · Φ(oj ). Kernel functions put. In particular, the input layer corresponds to are used by learning algorithms, such as Support the input vector ~c, i.e., the row of the C matrix Vector Machines (Shawe-Taylor and Cristianini, associated to an example o. It is then mapped to 2004), to efficiently operate on instances in the the Nyström layer, through the projection in Equa- kernel space: their advantage is that the projec- tion 2. Notice that the embedding provides also tion function Φ(o) = ~x ∈ Rn is never explicitly 1 computed. The Nyström method is a factorization the proper weights, defined by U S − 2 , so that the method applied to derive a new low-dimensional mapping can be expressed through the Nyström 1 embedding x̃ in a l-dimensional space, with l n matrix HN y = U S − 2 : it corresponds to a pre- so that G ≈ G̃ = X̃ X̃ > , where G = XX > is training stage based on the SVD. Formally, the the Gram matrix such that Gij = Φ(oi )Φ(oj ) = low-dimensional embedding of an input example ˜ = ~c HN y = ~c U S − 12 encodes the kernel o, ~x K(oi , oj ). The approximation G̃ is obtained using a subset of l columns of the matrix, i.e., a selec- space. Any neural network can then be adopted: tion of a subset L ⊂ D of the available exam- in the rest of this paper, we assume that a tradi- ples, called landmarks. Given l randomly sam- tional Multi-Layer Perceptron (MLP) architecture pled columns of G, let C ∈ R|D|×l be the ma- is stacked in order to solve the targeted classifica- trix of these sampled columns. Then, we can re- tion problems. The final layer of KDA is the clas- arrange the columns and rows of G and define sification layer whose dimensionality depends on X = [X1 X2 ] such that: the classification task: it computes a linear classi- fication function with a softmax operator. X1> X2 W C A KDA is stimulated by an input vector c which G= = X2> X1 X2> X2 X2> X1 corresponds to the kernel evaluations K(o, li ) between each example o and the landmarks li . where W = X1> X1 , i.e., the subset of G that con- Linguistic kernels (such as Semantic Tree Ker- tains only landmarks. The Nyström approxima- nels (Croce et al., 2011)) depend on the syntac- tion can be defined as: tic/semantic similarity between the x and the sub- set of li used for the space reconstruction. We will G ≈ G̃ = CW † C > (1) see hereafter how tracing back through relevance propagation into a KDA architecture corresponds where W † denotes the Moore-Penrose inverse of to determine which semantic landmarks contribute W . If we apply the Singular Value Decomposition mostly to the final output decision. (SVD) to W , which is symmetric definite posi- tive, we get W = U SV > = U SU > . Then it 3 Layer-wise Relevance Propagation in is straightforward to see that W † = U S −1 U > = 1 1 Kernel-based Deep Architectures U S − 2 S − 2 U > and that by substitution G ≈ G̃ = 1 1 (CU S − 2 )(CU S − 2 )> = X̃ X̃ > . Given an exam- Layer-wise Relevance propagation (LRP, pre- ple o ∈ D, its new low-dimensional representation sented in (Bach et al., 2015)) is a framework which ˜ is determined by considering the corresponding ~x allows to decompose the prediction of a deep neu- item of C as ral network computed over a sample, e.g. an im- age, down to relevance scores for the single input 4 Explanatory Models dimensions, such as a subset of pixels. Formally, let f : Rd → R+ be a positive real- LRP allows the automatic compilation of justifica- valued function taking a vector ~x ∈ Rd as input: f tions for the KDA classifications: explanations are quantifies, for example, the probability of ~x char- possible using landmarks {`} as examples. The acterizing a certain class. The Layer-wise Rele- {`} that the LRP method produces as the most ac- vance Propagation assigns to each dimension, or tive elements in layer 0 are semantic analogues of (1) feature, xd , a relevance score Rd such that: input annotated examples. An Explanatory Model is the function in charge of compiling the linguis- P (1) f (x) ≈ d Rd (3) tically fluent explanation of individual analogies (or differences) with the input case. The mean- (1) Features whose score Rd > 0 (or d Rd < 0) (1) ingfulness of such analogies makes a resulting ex- correspond to evidence in favor (or against) the planation clear and should increase the user confi- output classification. In other words, LRP allows dence on the system reliability. When a sentence to identify fragments of the input playing key roles o is classified, LRP assigns activation scores r`s to in the decision, by propagating relevance back- each individual landmark `: let L(+) (or L(−) ) de- wards. Let us suppose to know the relevance score note the set of landmarks with positive (or nega- (l+1) tive) activation scores. Rj of a neuron j at network layer l + 1, then it (l,l+1) Formally, an explanation is characterized by a can be decomposed into messages Ri←j sent to triple e = hs, C, τ i where s is the input sentence, neurons i in layer l: C is the predicted label and τ is the modality of the (l+1) explanation: τ = +1 for positive (i.e. acceptance) X (l,l+1) Rj = Ri←j (4) i∈(l) statements while τ = −1 correspond to rejections of the decision C. A landmark ` is positively acti- Hence the relevance of a neuron i at layer l can be vated for a given sentence s if there are not more defined as: than k − 1 other active landmarks1 `0 whose acti- (l) (l,l+1) X Ri = Ri←j (5) vation value is higher than the one for `, i.e. j∈(l+1) |{`0 ∈ L(+) : `0 6= ` ∧ r`s0 ≥ r`s > 0}| < k Note that 4 and 5 are such that 3 holds. In this work, we adopted the -rule defined in (Bach et A landmark is negatively activated when: |{`0 ∈ (l,l+1) L (−) : `0 6= ` ∧ r`s0 ≤ r`s < 0}| < k. Positively al., 2015) to compute the messages Ri←j , i.e. (or negative) active landmarks in Lk are assigned (l,l+1) zij (l+1) to an activation value a(`, s) = +1 (−1). For all Ri←j = R zj + · sign(zj ) j other not activated landmarks: a(`, s) = 0. Given the explanation e = hs, C, τ i, a landmark where zij = xi wij and > 0 is a numerical stabi- ` whose (known) class is C` is consistent (or in- lizing term and must be small. Notice that weights consistent) with e according to the fact that the wij correspond to weighted activations of input following function: neurons. If we apply LRP to a KDA it implic- itly traces the relevance back to the input layer, δ(C` , C) · a(`, q) · τ i.e. to the landmarks. It thus tracks back syntac- tic, semantic and lexical relations between a ques- is positive (or negative, respectively), where tion and the landmark and it grants high relevance δ(C 0 , C) = 2δkron (C 0 = C) − 1 and δkron is the to the relations the network selected as highly dis- Kronecker delta. criminating for the class representations it learned; The explanatory model is then a function note that this is different from similarity in terms M(e, Lk ) which maps an explanation e, a sub set of kernel-function evaluation as the latter is task Lk of the active and consistent landmarks L for e independent whereas LRP scores are not. Notice into a sentence in natural language. Of course sev- also that each landmark is uniquely associated to eral definitions for M(e, Lk ) and Lk are possible. an entry of the input vector ~c, as shown in Sec 2, and, as a member of the training dataset, it also 1 k is a parameter used to make explanation depending on corresponds to a known class. not more than k landmarks, denoted by Lk . A general explanatory model would be: of confidence the user has in accepting the state- ment, and its corresponding form P (o ∈ C|e), “ s is C since it is similar to ` ” i.e. the same quantity in the case the user is pro- ∀` ∈ L+ if τ > 0 vided by the explanation e. The core idea is that k semantically coherent and exhaustive explanations “ s is not C since it is different M(e, Lk ) = from ` which is C ” must indicate correct classifications whereas inco- ∀` ∈ L− if τ < 0 herent or non-existent explanations must hint to- k wards wrong classifications. A quantitative mea- “ s is C but I don’t know why ” sure of such an increase (or decrease) in confi- if Lk = ∅ dence is the Information Gain (IG, (Kononenko and Bratko, 1991)) of the decision o ∈ C. Notice − where L+ k ,Lk ⊆ Lk are the partitions of landmarks that IG measures the increase of probability corre- with positive (and negative) relevance scores in sponding to correct decisions, and the reduction of Lk , respectively. Here we provide examples for the probability in case the decision is wrong. This two explanatory models, used during the experi- amount suitably addresses the shift in uncertainty mental evaluation. A first possible model returns −log2 (P (·)) between two (subjective) estimates, the analogy only with the (unique) consistent land- i.e., P (o ∈ C) vs. P (o ∈ C|e). mark with the highest positive score if τ = 1 Different explanatory models M can be also and lowest negative when τ = −1. The ex- compared. The relative Information Gain IM planation of a rejected decision in the Argument is measured against a collection of explanations Classification of a Semantic Role Labeling task e ∈ TM generated by M and then normalized (Vanzo et al., 2016), described by the triple e1 = throughout the collection’s entropy E as follows: h’vai in camera da letto’, S OURCE B RINGING , −1i, 1 1 X is: IM = I(e) E | TM | e∈TM I think ”in camera da letto” IS NOT [S OURCE ] of [B RINGING ] in ”Vai in camera da letto” (LU:[vai]) since where I(e) is the IG of each explanation2 . it’s different from ”sul tavolino” which is [S OURCE ] of [B RINGING ] in “Portami il mio catalogo sul tavolino” 5 Experimental Evaluation (LU:[porta]) The effectiveness of the proposed approach has The second model uses two active land- been measured against two different semantic pro- marks: one consistent and one contradictory cessing tasks, i.e. Question Classification (QC) with respect to the decision. For the triple over the UIUC dataset (Li and Roth, 2006) and Ar- e1 = h’vai in camera da letto’, G OAL M OTION , 1i gument Classification in Semantic Role Labeling the second model produces: (SRL-AC) over the HuRIC dataset (Bastianelli et al., 2014; Vanzo et al., 2016). The adopted archi- I think ”in camera da letto” IS [G OAL ] of [M OTION ] in tecture consisted in a LRP-integrated KDA with 1 ”Vai in camera da letto” (LU:[vai]) since it recalls ”al hidden layers and 500 landmarks for QC, 2 hid- telefono” which is [G OAL ] of [M OTION ] in ”Vai al telefono den layers and 100 landmarks for SRL-AC and a e controlla se ci sono messaggi” (LU:[vai]) and it IS NOT stabilization-term = 10e−8 . [S OURCE ] of [B RINGING ] since different from ”sul tavolino” which is the [S OURCE ] of [B RINGING ] in We defined five quality categories and asso- ”Portami il mio catalogo sul tavolino” (LU:[portami]) ciated each with a value of P (o ∈ C|e), as shown in Table 1. Three annotators then inde- 4.1 Evaluation methodology pendently rated explanations generated from a col- In order to evaluate the impact of the produced ex- lection composed of an equal number of correct planations, we defined the following task: given a and wrong classifications (for a total amount of classification decision, i.e. the input o is classified 300 and 64 explanations, respectively, for QC and as C, to measure the impact of the explanation e SRL-AC). This perfect balancing makes the prior on the belief that a user exhibits on the statement probability P (o ∈ C) being 0.5, i.e. maximal en- “o ∈ C is true”. This information can be mod- tropy with a baseline IG = 0 in the [−1, 1] range. eled through the estimates of the following prob- Notice that annotators had no information on the abilities: P (o ∈ C) that characterizes the amount 2 More details are in (Kononenko and Bratko, 1991) Category P (o ∈ C|e) 1−P (o ∈ C|e) Although explanation seems fairly coherent, it is V.Good 0.95 0.05 Good 0.8 0.2 actually misleading as ENTITY is the annotated Weak 0.5 0.5 class. This shows how the system may lack of Bad 0.2 0.8 contextual information, as humans do, against in- Incoher. 0.05 0.95 herently ambiguous questions. Table 1: Posterior probab. w.r.t. quality categories 5.2 Argument Classification Model QC SRL-AC Evaluation also targeted a second task, that is Ar- One landmark 0.548 0.669 gument classification in Semantic Role Labeling Two landmarks 0.580 0.784 (SRL-AC): KDA is here fed with vectors from Table 2: Information gains for two Explanatory tree kernel evaluations as discussed in (Croce et Models applied to the QC and SRL-AC datasets. al., 2011). The evaluation is carried out over the HuRIC dataset (Vanzo et al., 2016), including system classification performance, but just knowl- about 240 domotic commands in Italian, compris- edge of the explanation dataset entropy. ing of about 450 roles. The system has an accuracy of 91.2% on about 90 examples, while the training 5.1 Question Classification and development set have a size of, respectively, Experimental evaluations3 showed that both the 270 and 90 examples. We considered 64 explana- models were able to gain more than half the bit re- tions for measuring the IG of the two explanation quired to ascertain whether the network statement models. Table 2 confirms that both explanatory is true or not (Table 2). Consider: models performed even better than in QC. This is I think ”What year did Oklahoma become a state ?” refers due to the narrower linguistic domain (14 frames to a NUMBER since recalls me ”The film Jaws was made in are involved) and the clearer boundaries between what year ?” classes: annotators seem more sensitive to the ex- planatory information to assess the network deci- Here the model returned a coherent supporting ev- sion. Examples of generated sentences are: idence, a somewhat easy case as for the available I think ”con me” is NOT the MANNER of C OTHEME in discriminative pair, i.e. ”What year”. The sys- ”Robot vieni con me nel soggiorno? (LU:[vieni])” since it tem is able to capture semantic similarities even in does NOT recall me ”lentamente” which is MANNER in poorer conditions, e.g.: ”Per favore segui quella persona lentamente (LU:[segui])”. I think ”Where is the Mall of the America ?” refers to a It is rather COTHEME of C OTHEME since it recalls me LOCATION since recalls me ”What town was the setting for ”mi” which is C OTHEME in ”Seguimi nel bagno The Music Man ?” which refers to a LOCATION. (LU:[segui])”. This high quality explanation is achieved even if with such poor lexical overlap. It seems that richer 6 Conclusion and Future Works representations are here involved with grammati- cal and semantic similarity used as the main in- This paper describes an LRP application to a KDA formation involved in the decision at hand. Let us that makes use of analogies as explanations of a consider: neural network decision. A methodology to mea- I think ”Mexican pesos are worth what in U.S. dollars ?” sure the explanation quality has been also pro- refers to a DESCRIPTION since it recalls me ”What is the posed and the experimental evidence confirms the Bernoulli Principle ?” effectiveness of the method in increasing the trust of a user upon automatic classifications. Future Here the provided explanation is incoherent, as ex- work will focus on the selection of subtrees as pected since the classification is wrong. Now con- meaningful evidences for the explanation, or on sider: the modeling of negative information for disam- I think ”What is the sales tax in Minnesota ?” refers to a biguation as well as on more in depth investigation NUMBER since it recalls me ”What is the population of of the landmark selection policies. Moreover, im- Mozambique ?” and does not refer to a ENTITY since proved experimental scenarios involving users and different from ”What is a fear of slime ?”. dialogues will be also designed, e.g. involving fur- 3 For details on KDA performance against the task, see ther investigation within Semantic Role Labeling, (Croce et al., 2017) using the method proposed in (Croce et al., 2012). References Andrea Vanzo, Danilo Croce, Roberto Basili, and Daniele Nardi. 2016. Context-aware spoken lan- Sebastian Bach, Alexander Binder, Gregoire Mon- guage understanding for human robot interaction. In tavon, Frederick Klauschen, Klaus-Robert Mller, Proceedings of Third Italian Conference on Compu- and Wojciech Samek. 2015. On pixel-wise explana- tational Linguistics (CLiC-it 2016) & Fifth Evalua- tions for non-linear classifier decisions by layer-wise tion Campaign of Natural Language Processing and relevance propagation. PLOS ONE, 10(7). Speech Tools for Italian. Final Workshop (EVALITA 2016), Napoli, Italy, December 5-7, 2016. Emanuele Bastianelli, Giuseppe Castellucci, Danilo Croce, Luca Iocchi, Roberto Basili, and Daniele Nardi. 2014. Huric: a human robot interaction corpus. In LREC, pages 4519–4526. European Lan- guage Resources Association (ELRA). Michael Collins and Nigel Duffy. 2001. New rank- ing algorithms for parsing and tagging: Kernels over discrete structures, and the voted perceptron. In Pro- ceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL ’02), July 7-12, 2002, Philadelphia, PA, USA, pages 263–270. Asso- ciation for Computational Linguistics, Morristown, NJ, USA. Danilo Croce, Alessandro Moschitti, and Roberto Basili. 2011. Structured lexical similarity via con- volution kernels on dependency trees. In Proceed- ings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 1034–1046. Association for Computational Linguistics. Danilo Croce, Alessandro Moschitti, Roberto Basili, and Martha Palmer. 2012. Verb classification us- ing distributional similarity in syntactic and seman- tic structures. In ACL (1), pages 263–272. The As- sociation for Computer Linguistics. Danilo Croce, Simone Filice, Giuseppe Castellucci, and Roberto Basili. 2017. Deep learning in seman- tic kernel spaces. In Proceedings of the 55th Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 345–354, Vancouver, Canada, July. Association for Computa- tional Linguistics. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput., 9(8):1735– 1780, November. Igor Kononenko and Ivan Bratko. 1991. Information- based evaluation criterion for classifier’s perfor- mance. Machine Learning, 6(1):67–80, Jan. Hugo Larochelle and Geoffrey E. Hinton. 2010. Learning to combine foveal glimpses with a third- order boltzmann machine. In Proceedings of Neu- ral Information Processing Systems (NIPS), pages 1243–1251. Xin Li and Dan Roth. 2006. Learning question clas- sifiers: the role of semantic information. Natural Language Engineering, 12(3):229–249. John Shawe-Taylor and Nello Cristianini. 2004. Ker- nel Methods for Pattern Analysis. Cambridge Uni- versity Press, Cambridge, UK.