=Paper=
{{Paper
|id=Vol-2600/paper20
|storemode=property
|title=Mitigating Bias in Deep Nets with Knowledge Bases: the Case of Natural Language Understanding for Robots
|pdfUrl=https://ceur-ws.org/Vol-2600/paper20.pdf
|volume=Vol-2600
|authors=Martino Mensio,Emanuele Bastianelli,Ilaria Tiddi,Giuseppe Rizzo
|dblpUrl=https://dblp.org/rec/conf/aaaiss/MensioBT020
}}
==Mitigating Bias in Deep Nets with Knowledge Bases: the Case of Natural Language Understanding for Robots==
Mitigating Bias in Deep Nets with Knowledge Bases : the Case of Natural Language Understanding for Robots Martino Mensio∗ , Emanuele Bastianelli? , Ilaria Tiddi† and Giuseppe Rizzo‡ ∗ Knowledge Media Institute, The Open University, UK, martino.mensio@open.ac.uk ? The Interaction Lab, Heriot-Watt University, UK, emanuele.bastianelli@hw.ac.uk † Department of Computer Science, Vrije Universiteit Amsterdam, NL, i.tiddi@vu.nl ‡ LINKS Foundation, Italy, giuseppe.rizzo@linksfoundation.com Abstract Events such as the Cambridge Analytica scandal and the dis- ruptions of the 2016 US elections have brought researchers In this paper, we tackle the problem of lack of understand- and practitioners to question the explainability of these sys- ability of deep learning systems by integrating heteroge- neous knowledge sources, and in the specific we present how tems, i.e. their ability to explain decisions to both experts we used FrameNet to guarantee the correct learning for an and end-users, resulting in a number of initiatives to improve LSTM-based semantic parser in the task of Spoken Language their understandability and trustworthiness (cfr. DARPA’s Understanding for robots. The problem of the explainability eXplainable AI program1 ; the “right to explanation” re- of Artificial Intelligence (AI) systems, i.e. their ability to ex- quested by the European General Data Protection Regula- plain decisions to both experts and end users, has attracted tion; and the “Ethics guidelines for Trustworthy AI” pub- growing attention in the latest years, affecting their credibil- lished by the European Union in December 20182 ). In the ity and trustworthiness. Trusting these systems is fundamen- context of robotic companions interacting in natural lan- tal in the context of AI-based robotic companions interact- guage using AI techniques, where our research is placed, ing in natural language, as the users’ acceptance of the robot trust and transparency are fundamental aspects, as the users’ also relies on the ability to explain the reasons behind its actions. Following similar approaches, we first use the val- acceptance of the robot assistants will be also based on their ues of the neural attention layers employed in the semantic ability to explain the reasons behind their actions, if and parser as a clue to analyze and interpret the model’s behavior when required. and reveal the intrinsic bias induced by the training data. We Let us take the example of a robot understanding spo- then show how the integration of knowledge from external re- ken commands given by a human, e.g. “take the book sources such as FrameNet can help minimizing, or mitigating, from the table”, and where a corresponding robot action such bias, and consequently guarantee the model to provide such as take(book, table) has to be instantiated cor- the correct interpretations. Our preliminary, but promising re- rectly. Such instantiation is generally triggered by a trained sults suggest that (i) attention layers can improve the model model, where noise, over-fitting, and mislabeling could in- understandability; (ii) the integration of different knowledge deed bring to an undesired output, e.g. the robot placing bases can help overcoming the limitations of machine learn- ing models; and (iii) an approach combining the strengths of the book on the table. In the view of symbiotic autonomous both knowledge engineering and machine learning can foster robots (Rosenthal, Biswas, and Veloso 2010) that rely on hu- the development of more transparent, understandable intelli- mans to overcome their limitations and correct their actions, gent systems. a transparent model could help identifying and explicit the reason(s) behind the wrong behavior of the robot. Our motivation is the semantic processing of robotic com- Introduction mands (also called semantic parsing) from spoken language With the dramatic success of new machine learning tech- utterances, i.e. the process of mapping natural language niques relying on deep architectures, the number of Arti- sentences to formal meaning representations. The formal ficial Intelligence (AI)-based systems has rapidly increased. meaning representation theory we rely upon is Frame Se- mantics (Fillmore 1985), describing actions and events ex- Copyright c 2020 held by the author(s). In A. Martin, K. Hinkel- pressed in language through conceptual structures called se- mann, H.-G. Fill, A. Gerber, D. Lenat, R. Stolle, F. van Harmelen (Eds.), Proceedings of the AAAI 2020 Spring Symposium on Com- mantic frames. This theory also states that a frame is evoked bining Machine Learning and Knowledge Engineering in Practice in a sentence through the occurrence of specific lexical units, (AAAI-MAKE 2020). Stanford University, Palo Alto, California, i.e. words (such as verbs and nouns) that linguistically ex- USA, March 23-25, 2020. Use permitted under Creative Commons press the underlying situation. To identify such frames, we License Attribution 4.0 International (CC BY 4.0). 1 EB and IT developed the theoretical framework and directed the https://www.darpa.mil/program/explainable-artificial- project; EB and GR designed the experiments; MM derived the intelligence 2 models, performed experiments and analysed the results; IT and https://ec.europa.eu/futurium/en/ai-alliance- EB wrote the manuscript in consultation with GR and MM. consultation/guidelines built a semantic parser based on a multi-layer Long-Short Fundamentals of Frame Semantics Term Memory (LSTM) neural network with attention (Men- Frame Semantics is a theory that formalizes how a sentence sio et al. 2018), and trained it over the Human-Robot Inter- is related to semantic frames. Each frame is a conceptual action Corpus (HuRIC) (Bastianelli et al. 2014). LSTMs as structure representing an action or, more in general, an event many similar deep nets-based models have an opaque na- or situation (e.g. the action of Taking). Frames are further ture, i.e. they do not give clear clues on the way they be- specified by a set of frame elements (e.g. the T HEME, rep- have, which may complicate the understanding of undesired resenting the object taken while performing the action Tak- behaviors as, in our case, an incorrect robot behavior. More- ing), which enhance the meaning of the frame with addi- over, understanding the inner workings of such models tend tional information. According to Frame Semantics, frames to be harder when trained on small, domain-specific datasets are evoked in sentences by specific words, called lexical (such as HuRIC), as they often lack of effective representa- units (L U). Lexical units are responsible to convey the mean- tiveness of the problem domain. The questions we wish to ing of the frames, representing hooks between the textual answer in this work are therefore: surface and the theory itself. In the example of Figure 1, the • how can we better understand our LSTM-based model? frame Taking is evoked in the sentence “take the book to the • how we can we identify undesired behaviors in the model? table” by the L U take, while the book and to the table repre- • is there a way to mitigate such undesired behaviors? sent the T HEME and the G OAL frame elements respectively: To answer the first two questions, we rely on the idea that [take]Taking linguistic theories could be used in the context of our se- mantic parser to obtain more understanding of the model, [the book]T HEME i.e. they could be exploited to provide to explain the model’s [from the table]O RIGIN behavior. Recent trends in deep learning have shown that vi- sual explanations for the models’ behavior could be obtained through the analysis of the values of the attention layers in a Figure 1: Example of semantic frame annotation for the sen- number of tasks (Machine Translation (Bahdanau, Cho, and tence “take the book from the table”. Bengio 2014), Sentiment Analysis (Lin et al. 2017), Image The process of annotating Frame Semantics over natural Captioning (Xu et al. 2015)) for their ability of correlating language involves three different tasks. First, all the frames inputs and outputs. Inspired by these works, our hypothe- evoked in a sentence are identified looking at the potential sis is that we can use attentions to achieve some degree of L Us contained it. This task is generally called Frame Pre- explainability for the LSTM-based parser, and that Frame diction or Frame Induction. Here, we refer to it as Action Semantics can be the key to drive the interpretation process. Detection (AD), as we are dealing with the action expressed We therefore use attentions to capture the interpretation of by the person uttering the command to the robot. The second spoken commands and, more specifically, use the values that task is called Argument Identification (AI, sometimes also the attention layer assign to each word of a given sentence to called Boundary Detection) and is responsible to find the detect which word is the lexical unit evoking (i.e. causing) spans of text corresponding to possible frame elements. The the identified frame. We show how this not only gives us a last task is called Argument Classification (AC) and consists hint on the model behavior, but that attentions help unveil- in assigning a label to the spans identified during the AI. ing the intrinsic bias induced by our training data. Here, we Note that the AI and AD tasks are often referred together as exploit the linguistic knowledge encoded in an external re- the process of Semantic Role Labelling. source such as FrameNet (Baker, Fillmore, and Lowe 1998) If we take the example of “take the book from the table”, in a data augmentation strategy, with the goal of mitigating the frame Taking would be predicted in the AD step by iden- the corpus bias, improve the explanations that the model pro- tifying the L U take. In the AI step, the book and from the vides and, consequently, the overall model results. table would be identified as 2 frame element spans, and re- Although preliminary, our promising results suggest that spectively classified as T HEME and O RIGIN frame elements attention layers combined with Frame Semantics do provide in the following AC step. a clue to a more explainable model, and that the integration of external knowledge bases can help overcoming the inner A multi-layer LSTM-based parser limitations of machine learning models. More importantly, In our previous work (Mensio et al. 2018), we presented a our method suggests that the combination of knowledge en- semantic parser for robotic commands, called 3LSTM-ATT, gineering and machine learning techniques can be beneficial based on a multi-layer LSTM network exploiting attention for the development of more transparent, understandable in- mechanisms. The 3LSTM-ATT topology is shown in Fig- telligent systems. ure 2. The network was adapted from (Liu and Lane 2016) so that each layer could carry one of the three semantic pars- Motivation and Background ing tasks presented above. We briefly describe the network In this section, we present the theoretical and technical back- in the following, and refer the reader to the original paper ground of our work. We first discuss Frame Semantics, for more details. which we use as linguistic theory of reference, and then de- The input to the network is a tokenized sentence, where scribe the technical details of our neural network-based se- each token is embedded using the GloVe word embed- mantic parser. dings (Pennington, Socher, and Manning 2014), pre-trained AC+AI AD broader set of values that can be more difficult to interpret AC O Theme Theme Theme Theme Theme (e.g. looking at all the values of the self-learned weights). In fact, it enables to underline a restricted subset of features, ac ac ac ac ac ac L3 because not all the inputs have the same importance. The self-attentions used in the two other layers (L2 and L3) in- stead encode the relationship among all the input objects, AI O B I I I I Bringing e.g. of much each token contributes to the representation of all the other tokens for a given task. We point the reader to L2 ai ai ai ai ai ai the original paper for more details about the self-attention layers. AD Hypotheses and Challenges legend: attention f f f f f f Taking back our research questions, at this point we ask: L1 LSTM b b b b b b • how can we better understand the LSTM-based model we dense built? embedding take the book on the table • how can we identify an undesired behavior in such model? • is there a way to mitigate any undesired behaviors? Figure 2: The neural network for the semantic parser. The Our first question can be answered looking at the attention connections in green represent highway connections be- layer values to get hints on the model’s behavior. As previ- tween the first and the third layer. ously discussed, attentions give the chance to explore the in- termediate classification steps, enabling the interpretability over the Common Crawl resource3 . The sequence is firstly of how the system processes a given input – an aspect that encoded with a bidirectional LSTM (L1). For the AD task, we can exploit for a better understanding of our process. As a a single contextual representation for the whole sequence c first attempt, this work aims at answering the previous ques- is computed through an attention layer (Bahdanau, Cho, and tions by taking into account the sole ability of the system Bengio 2014), which is in turn passed through a fully con- to detect the correct frame. For this reason, we will focus nected layer with a final softmax activation to obtain per- on the analysis of the attention values for the sole AD task. frame probabilities. The sequence out of L1 is further en- We leave the analysis of the other two tasks for forthcoming coded with a LSTM (L2) with self-attention (Cheng, Dong, work. and Lapata 2016). Single hidden representations of the to- We thus answer our second question by aligning atten- kens are classified through a dense layer with softmax into tions and the linguistic theory. On the one hand, we have the IOB labels, which denote whether a word is the Beginning, Frame Semantics theory that states that frames are evoked the Inside or it is Outside of a frame element span. The in natural language by specific words called lexical units. LSTM at L2 is modified so that, at each time step, the out- On the other, we have the attention values computed by the put of the dense layer at t − 1 is provided as additional input network to balance the input words in the final contextual to the LSTM cell at time t. The third and final encoding representation used to classify the frame. Our assumption layer (L3) takes as input the output of L2 and the output of therefore is that, by annotating data with Frame Semantics, L1 through highway connections. The same type of encoder the algorithm learning from such data should encode implic- used in L2 is applied in L3, with the difference that the dense itly the theory itself, through an attempt of learning it (or a layer outputs frame element labels instead of IOB ones. good approximation of it). If the network is learning cor- The simple attention mechanism (Bahdanau, Cho, and rectly from the data, we should therefore observe an align- Bengio 2014) used for the AD task is a layer that gives an ment between the values produced by the attention of the insight of the contribution that a certain input gives in the AD layer and what is stated by Frame Semantics, e.g. we production of a given output. The final contextual represen- should notice relevant values attributed to words that could tation of a sentence c is evaluated as the weighted sum: possibly be lexical units for the classified frame. Should the X c= ai hi , network not follow the underlying Frame Semantic theory, i this could mean not only that the model is only following where hi represents the encoding of the i-th token and the patterns statistically evident in the data (and not related to attention value (or score) ai is evaluated through a simple the theory), but also that an incorrect explanation for its be- feedforward network fatt (hi ). Roughly speaking, this atten- havior would be provided if requested. Our challenge is first tion layer evaluates a value ai for each encoded input token. to verify whether the words receiving the highest attention Since the AD classification layer operates over the contex- values are the correct lexical units of a classified frame (e.g. tual representation c, each value ai indicates how much each given the sentence “take the book from the table”, the word word in a sentence contributes to the final classification of take should be given a high attention value). a frame. For this reason, it can intrinsically provide an ex- Finally, we need a mitigation strategy to overcome the planation for the model behavior, as it summarizes a much cases where the attention turns out to be focused on the in- correct lexical element and consequently ensure that the cor- 3 http://commoncrawl.org/ rect explanation for a decision can be provided. Given that HuRIC’s annotations are based on Frame Semantics, we pro- available in HuRIC, so let ŵi be the gold L U for the i-th pose to augment the dataset using additional examples from sentence Si 5 . Let us consider the attention layer as a (sim- the FrameNet corpus (Baker, Fillmore, and Lowe 1998). Al- plified) function fatt (w) that attributes an attention value to though FrameNet cover a different domain w.r.t HuRIC, i.e. a word w (for clarity, w is a shortcut for the hidden repre- written vs. spoken language, we believe that, by using a data sentation h). The ALU (LU-alignment) measure can be then augmentation strategy, the algorithm can be driven to rely calculated as follows: on patterns consistent with the theory, and thus to achieve N better generalization. 1 X ALU = I(arg max fatt (w) = ŵi ) (1) N i=1 w∈Si Approach In this section, we show the design of the overall approach, where I(·) is the indicator function. namely (1) how we align the model to the Frame Semantics Although lexical units carry most of the meaning for a theory; (2) how we use these alignments to identify misbe- frame, there are still many ambiguous cases, where a verb havior by the model; and (3) the data augmentation strategy alone may evoke different frames. Consider for example the we use to mitigate the bias in the model. verb take, which may evoke the frame Bringing, e.g. in the sentence take the book to the table, or Taking, e.g. take the Aligning Attentions and Linguistic Theory book from the table. The meaning in this case is not carried only by the L U alone, but also by the co-occurrence with As previously explained, the attention values produced by other specific words or syntactic structures. The preposition our 3LSTM-ATT parser during the AD stage can be used to in the first example clearly introduces an argument rep- to guess which words in a sentence are more relevant to resenting the destination of a motion (i.e. the G OAL frame the classified frame. We can use these values to attempt an element), helping in choosing the frame Bringing over Tak- alignment between words and the linguistic theory, namely ing for the word take. It is thus legit to think that, in these which words are lexical units or, other relevant words such cases, part of the attention values should also focus on such as prepositions. discriminant words. The parser has been trained over the previously mentioned We thus designed a second measure, that we call AD (dis- HuRIC dataset, which contains transcriptions of user com- criminant alignment), with the aim of taking into account mands tagged with Frame Semantics. The annotated frames additional discriminant words in addition to the L U. To this generally correspond to actions like taking objects or mov- end, we annotated the discriminant words for each sentence ing to a specific position. The dataset contains 585 frame oc- in the dataset. For each sentence S = (w1 , ..., wm ), we currences over 526 sentences on 16 different frame types for created a vector of gold discriminant word indexes vg = an average of ∼36 sentences per frame. The results, obtained (gd1 , ..., gdm ) where each gdj ∈ {0, 1} is set to 1 if its over this dataset through a 5-fold cross validation stratified position corresponds to a discriminant word in S. Given on the frame types, are reported in Table 1. Compared to re- the attentions values obtained from the AD layer, we cre- sults of (Bastianelli et al. 2016) (BAS16 henceforth)4 , our ated a vector of classified discriminant word indexes vc = parser obtains better results for both the AD and AI tasks. (cd1 , ..., cdm ) where each cdj = I(fatt (wj ) ≥ 0.01)6 . Fi- nally, we calculated Precision and Recall over these vectors Table 1: Parser performances in terms of F-Measure for the the following way: AD, AI and AC, compared to BAS16. Only gold values con- Pm sidered as input of each task. j=1 I(cdj = gdj = 1) P = Pm (2) Corpus AD AI AC j=1 cdj BAS16 94.67% 90.74% 94.93% Pm 3LSTM-ATT 96.33% 94.35% 91.77% j=1 I(cdj = gdj = 1) R= Pm (3) j=1 gdj (Mis-)alignment of Attention Values through which we obtained the F-Measure. The AD was fi- Differently from BAS16, we can take advantage of the atten- nally calculated as macro-average over the F-Measure of all tion layer in the AD step to understand our system’s behavior the sentences Si in the dataset. when classifying a frame for a given sentence. As explained, The HuRIC→HuRIC row of Table 2 shows the scores our assumption is that the word receiving the highest value for ALU and AD obtained when training and testing over from the AD attention layer may be the L U for the classified HuRIC. As we can see from the 11.17% on ALU and frame. 20.53% on AD values, the model reaches good results on In order to prove such hypothesis, we need to quantita- the AD task (96.33% of F-Measure), but is quite misaligned tively measure the alignment between the attention values 5 and the “gold” L U for a given frame. Let S = (w1 , ..., wm ) Sentence splitting was applied in order to have 1 frame per be a sentence as a sequence of m words w. Gold L Us are sentence, for the rare HuRIC cases containing more than one frame per sentence. 4 6 Please note that the BAS16 makes use also of perceptual fea- This threshold was set to filter attention noises. The study how tures, while our parser relies only on linguistic inputs. to properly set this threshold is left for future work. from the linguistic theory. Indeed, the error analysis we car- Taking: ried on the attention values reported that the model is fol- Model lowing Frame goldlatent Framepatterns, pred which get arethecompletely dishes unrelated from to thedining the In the late 1870s, he defaulted on a loan from rancher room Model theory, Frame goldrather Framethan pred generalizing get LU - the linguistic the dishes - theory from - as the - ex-dining - Archibald Stewart, so [Stewart]AGENT [took]Taking [the room - HuRICàHuRIC pected. Taking In other LU Taking words, - - - - - - Las Vegas Ranch] T HEME [for his own] E XPLANATION . the model 0.001 0.908 concentrates 0.018 its attention 0.041 0.021 0.011 0 HuRICàHuRIC FNàHuRIC on Taking Taking words that0.001 recurrentEntering Taking are not 0.908 0.994 0.018 discriminative 0.002 0 0.041 with 0.021 0.004 respect 0 to0.011 0 0 Indeed, the label set of frame elements, and, in general, 0 FNàHuRIC FN+HuàHuRIC theTaking respective Taking Entering 0.994 frame; yet, Taking 0.002 0.984 it was 0 able to 0 0.001 0.004 produce 0 0.012 the correct 0.003 0 0 the 0 0 variability of the language in FrameNet is, in fact, much FN+HuàHuRIC Taking classification. Taking 0.984 0 0.001 0.012 0.003 0 higher 0 than HuRIC. On the one hand, this can negatively Model Frame gold Frame pred take the red shoes contribute to the overall performance, as the complexity of Model Frame gold Frame pred take LU the - red - shoes - the task increases. On the other, the network will access HuRICàHuRIC Taking Taking LU 0.001 - 0.627 - 0.371 - 0 more evidence in terms of theory-related patterns, e.g. HuRICàHuRIC Taking Taking 0.001 0.627 0.371 0 seeing more often the association of the frame Taking with co-occurring verbs like take, than with other unrelated Model Frame gold Frame pred inspect the red shoes words like shoes. Our aim is therefore to reach a good Model Frame gold Frame pred inspect LU the - red - shoes - trade-off between the model’s performance and its degree HuRICàHuRIC Inspecting Taking LU 0.065 - 0.214 - 0.716 - 0 of generalization that, in turn, reveals the degree of under- HuRICàHuRIC Inspecting Taking 0.065 0.214 0.716 0 standability (explainability) of its behavior. Figure 3: Attention analysis for two different input sen- tences. The attention falls mostly on words that are not L Us, Experiments and Results e.g. the, red. In order to support our hypotheses about the mitigation strat- egy, we designed two additional experimental settings with An example of such behavior is reported in Figure 3: the goal of evaluating the changing in the model behavior: while the Taking frame is indeed correctly identified, the at- tention values reveal that the model attention falls on the two • FN→HuRIC: a model is trained over the full subset of words the and red, which do not convey any frame meaning samples coming from FrameNet, and is tested on the in this context, while the correct L U take receives only 0,1% whole HuRIC dataset; of attention. As an additional proof, a similar sentence with • FN+Hu→HuRIC: the evaluation follows a 5-fold cross a different frame, e.g. inspect the red shoes, is classified with validation. At each validation turn, the training set con- the same frame Taking (instead of Inspecting), with most of sists in FrameNet + 80% of HuRIC, leaving the remaining the attention falling again on words the, red. 20% as test set. The distribution of frames is uniformly A first consideration that can arise from the above analy- stratified. sis is that linguistic phenomena are not equally represented in HuRIC (i.e. some frames happen in correspondence of Table 2 presents the results of the ALU and AD for both more frequent, but not necessarily significant, grammatical configurations. The performances of the semantic parser in patterns), and this lack of representativeness might cause in- terms of F-Measure for the AD, AI and AC tasks are re- trinsic bias. This prevents the model to learn the underlying ported as well. Please note that HuRIC→HuRIC results dif- linguistic theory, and to generalize from it. fer from the ones in Table 1, showing performances of the single tasks in isolation (i.e. each task receives gold infor- Mitigating the Data Bias mation from the previous steps). Instead, we consider here the full semantic parsing pipeline. If we hypothesize that our model does not generalize to- It appears clear how the parser performances and the wards the linguistic theory as it should due to the lack of alignment measure scores are reversed for the two differ- representativeness of the dataset, a natural solution is to try ent settings. The models trained only on FrameNet do not to increase the number of training examples to see if the achieve high performances, reaching only approx. 68% for alignment measures improve without compromising the per- the AD task. When the two datasets are combined, an in- formances. Since HuRIC is tagged with Frame Semantics crease of ∼19% points is achieved for the same task. This following the same scheme as the FrameNet corpus (Baker, is still very low when compared to the 96.33% achieved Fillmore, and Lowe 1998), the first solution at hand to at- with HuRIC only. With that said, by looking at the align- tenuate the bias with more examples consists in integrating ment measure scores, we notice that this drop of perfor- HuRIC with examples from FrameNet itself. For the purpose mance comes at the advantage of the model’s explainability. of comparison, we selected only the FrameNet examples an- When trained only on FrameNet, in fact, the ALU and AD notated with frames also contained in HuRIC. This selection scores reach 93.92% and 84.86% respectively. This confirms resulted in a subset of 6,814 frame examples, for an average that the AD attention layer is focusing on the relevant words, of ∼425 examples per frame. hence giving us a hint that the model is correctly learning the Although sharing the same background linguistic theory, linguistic theory. The introduction of HuRIC to the training however, the two datasets belong to two different domains, sample helps in raising the parsing performances to convinc- namely written text vs. spoken commands. This indeed ing levels, while not deteriorating completely the alignment. may lead to a drop in terms of performances. Let us take The ALU and AD still drop by ∼40 and ∼34 points re- the example of FrameNet annotated-sentence for the frame spectively, but considering the performances reached by the Model Frame gold Frame pred get the dishes from the dining room LU - - - - - - HuRICàHuRIC Taking Taking 0.001 0.908 0.018 0.041 0.021 0.011 0 FNàHuRIC Taking Entering 0.994 0.002 0 0.004 0 0 0 FN+HuàHuRIC Taking Taking 0.984 0 0.001 0.012 0.003 0 0 Figure 4: Result of the attention analysis over the three different training conditions for the sentence get the dishes from the Model Frame gold Frame pred take the red shoes dining room. LU - - - HuRICàHuRIC Taking Taking 0.001 0.627 0.371 0 parser, this can be considered an encouraging trade-off. Al- When using FrameNet as a training set, the system is able though the bias has been corrected to a certain extent, the to better attend on L Us (5b–5d). The distance in the applica- overall results suggest that HuRIC is still introducing some tion domain seems to still prevent the system to attend also noise, which diverts the system from Modelthe full alignment Frame with gold Frame pred on discriminant inspect the words. shoes red For the same reason, in other cases the underlying theory. Testing the use of different amount LU the attention - still- spreads- its mass over non-relevant words of examples from FrameNet and HuRIC may HuRICàHuRIC result Taking Inspecting in an (5a–5c). 0.065 0.214 This leads 0.716 to errors 0 in the frame classifications. A even better balancing of linguistic variance and the domain- more stable behavior can be observed when both FrameNet specificity. and HuRIC are used as training set. The attention values, in fact, stabilize mostly over L Us and discriminant words, al- Table 2: End-to-end performances and alignment scores of though with more dense or sparse values. This contributes to the 3LSTM-ATT parser for the three different training set- a much better frame classifications, giving us an insight of tings. F-Measure is reported for the AD, AI and AC tasks. the difference of the results in Table 2. This confirms the idea that the use of a compatible exter- HuRIC→HuRIC FN→HuRIC FN+Hu→HuRIC nal resource such as FrameNet can help in reducing the bias AD 96.33% 68.06% 87.60% of poorly represented corpora that can affect deep network AI 93.57% 77.14% 81.27% architectures. At the same time, attention values can be an- AC 87.22% 62.70% 72.44% alyzed to interpret the outcome of the model classification ALU 11.17% 93.92% 51.31% (frames/actions in our case). More importantly, this method ADS 20.53% 84.64% 50.83% promotes the idea that knowledge engineering, which helps encodes and elicit expert’s knowledge (e.g. FrameNet), and In order to better demonstrate the trade-off between parser machine learning techniques can be combined to develop performances and theory alignment, we also perform a qual- more transparent and understandable systems. itative analysis on the test examples. In Figure 4, we show the AD tagging and attention values produced by the dif- Related Work ferent training settings (i.e. HuRIC, FN, FN+Hu) for the sentence get the dishes from the dining room. When trained We divided the related work in three parts: (i) approaches to on HuRIC only, the network learns again unwanted patterns enable more explainable deep learning-based applications, and, although the frame Taking is correctly classified, the at- with a particular focus on text classification and attention tention mostly falls on the article the, also spreading with methods, (ii) approaches to mitigate bias in data and (iii) minor values on the rest of the words. By using FrameNet approaches for semantic parsing in the robotics domain. as training set, the attention falls back to the verb take that corresponds to the current L U. However, the frame classi- Explainability for Deep Learning fication fails, predicting Entering. The correct frame classi- Explainability for deep learning methods can be divided fication (Taking) with attention values matching the correct in three families. A first family, including perturbation L U and, to a minor extent, the discriminant preposition from experiments (Zeiler and Fergus 2014), saliency map- is finally obtained when using a combination of the two cor- based methods (Simonyan, Vedaldi, and Zisserman 2013), pora, as in the last row. LIME (Ribeiro, Singh, and Guestrin 2016) and influence The same behavior can be observed if we consider also functions (Koh and Liang 2017), relies on methods trying other discriminant words in the sentence. Figure 5 shows to identify the relevant features treating the model as a black again frame parsing and attentions values over different box. An approximated model is built by observing concur- sentences. Discriminative words are here reported as well rent changes between the input and the output, so that it can (D ISC). In all the four examples it appears clear that when provide simple explanations. the system is trained only over the HuRIC resource, the at- A second family of approaches focuses on inspecting the tention is unstable, i.e. either it distributes similarly among internal representations and input processing. By observing more or less relevant words (5a), or more strongly attend- the inner parameters (weights of the neural network, or other ing on non-discriminant words at all (5b). In other cases, the latent variables), these methods try to give a meaning to attention indeed does attend on discriminant words, but ei- layers and operations in a bottom-up way (Zhang and Zhu ther the final frame classification is wrong for a lack of value 2018). For this reason, their application is difficult to scale on the L U (5c), or, even if the frame is correct, we lose the for networks with lots of layers and parameters. dependence of the classification outcome on the L U (5d). A third family consists in the intrinsically explainable Model Frame gold Frame pred look for the wrench in the bathroom HuRICàHuRIC Bringing Bringing 0 0.0002 0.0019 0.7515 0.0545 0.1101 0.0192 0.0567 0.0057 LU DISC - - - - - FNàHuRIC Bringing Taking 0.9394 0 0.0606 0 0 0 0 0 0 HuRICàHuRIC Searching Perception_active 0.0032 0.0567 0.0294 0.0789 0.4687 0.3628 0.0001 FN+HuàHuRIC Bringing Bringing 0.089 0 0.0001 0.9109 0 0 0 0 0 FNàHuRIC Searching Perception_active 0.9999 0 0 0.0001 0 0 0 FN+HuàHuRIC Searching Searching 0.1034 0.8966 0 0 0 0 0 Model Frame gold Frame pred bring it to the side of the bathtub Model Frame gold Frame pred takeLU the - jar DISC to - the - table - of - the - kitchen HuRICàHuRIC Bringing Bringing 0.0003 LU - 0.003 - 0.2815DISC 0.2205 - 0.1785- 0.1267 - 0.1796 - 0.01- FNàHuRIC HuRICàHuRIC Bringing Bringing Placing Bringing 0 0 0.00010.00190.46050.7515 0.0002 00.0545 0.5394 0.1101 0 0.0192 0 0.0567 0 0.0057 FN+HuàHuRIC FNàHuRIC Bringing Bringing Bringing Taking 0.689 0.9394 0 0.01070.06060.3002 0 0 0 0 0 0 0 0 0 0 0 FN+HuàHuRIC Bringing Bringing 0.089 0 0.0001(a) 0.9109 0 0 0 0 0 Model Frame gold Frame pred robot please take the mug to the sink Model Frame gold Frame pred look for the wrench in the bathroom - - LU - - DISC - - Model Frame gold Frame pred bring it to the side of the bathtub LU DISC - - - - - HuRICàHuRIC Bringing Taking 0 0 0.0004 0.0192 0.0343 0.9104 0.0352 0 LU - DISC - - - - - HuRICàHuRIC Searching Perception_active 0.0032 0.0567 0.0294 0.0789 0.4687 0.3628 0.0001 FNàHuRIC Bringing Following 0 0.4691 0.4703 0 0.0606 0 0 0 HuRICàHuRIC Bringing Bringing 0.0003 0.003 0.2815 0.2205 0.1785 0.1267 0.1796 0.01 FNàHuRIC Searching Perception_active 0.9999 0 0 0.0001 0 0 0 FN+HuàHuRIC Bringing Bringing 0 0 0.1773 0 0 0.8227 0 0 FNàHuRIC Bringing Placing 0 0.0001 0.4605 0 0.5394 0 0 0 FN+HuàHuRIC Searching Searching 0.1034 0.8966 0 0 0 0 0 FN+HuàHuRIC Bringing Bringing 0.689 0.0107 0.3002 0 0 0 0 0 (b) Model Frame gold Frame pred take the jar to the table of the kitchen Model Frame gold robot Frame pred please take the mug to the sink LU - - DISC - - - - - Model Frame gold Frame pred look for the wrench in the bathroom - - LU - - DISC - - HuRICàHuRIC Bringing Bringing 0 0.0002 0.0019 0.7515 0.0545 0.1101 0.0192 0.0567 0.0057 LU DISC - - - - - HuRICàHuRIC Bringing Taking 0 0 0.0004 0.0192 0.0343 0.9104 0.0352 0 FNàHuRIC Bringing Taking 0.9394 0 0.0606 0 0 0 0 0 0 HuRICàHuRIC Searching Perception_active 0.0032 0.0567 0.0294 0.0789 0.4687 0.3628 0.0001 FNàHuRIC Bringing Following 0 0.4691 0.4703 0 0.0606 0 0 0 FN+HuàHuRIC Bringing Bringing 0.089 0 0.0001 0.9109 0 0 0 0 0 FNàHuRIC Searching Perception_active 0.9999 0 0 0.0001 0 0 0 FN+HuàHuRIC Bringing Bringing 0 0 0.1773 0 0 0.8227 0 0 FN+HuàHuRIC Searching Searching 0.1034 0.8966 0 0 0 0 0 Model Frame gold Frame pred bring it (c) to the side of the bathtub LU - DISC - - - - - Model Frame gold Frame pred take the jar to the table of the kitchen HuRICàHuRIC Bringing Bringing 0.0003 0.003 0.2815 0.2205 0.1785 0.1267 0.1796 0.01 LU - - DISC - - - - - FNàHuRIC Bringing Placing 0 0.0001 0.4605 0 0.5394 0 0 0 HuRICàHuRIC Bringing Bringing 0 0.0002 0.0019 0.7515 0.0545 0.1101 0.0192 0.0567 0.0057 FN+HuàHuRIC Bringing Bringing 0.689 0.0107 0.3002 0 0 0 0 0 FNàHuRIC Bringing Taking 0.9394 0 0.0606 0 0 0 0 0 0 FN+HuàHuRIC Bringing Bringing 0.089 0 0.0001 0.9109 0 0 0 0 0 Model Frame gold Frame pred robot please take the mug to the sink (d) - - LU - - DISC - - Model Frame gold Frame pred bring it to the side of the bathtub HuRICàHuRIC Bringing Taking 0 0.0004 0.0192 Figure 5: Attention analysis in relation to both the L and discriminant 0 0.0343 words (D UISC ) for0.9104 0.0352 the three training-0 settings. DISCLU - - - - - FNàHuRIC Bringing Following 0 0.4691 0.4703 0 0.0606 0 0 0 HuRICàHuRIC Bringing Bringing 0.0003 0.003 0.2815 0.2205 0.1785 0.1267 0.1796 0.01 FN+HuàHuRIC models, which are complex Bringing enoughBringing to reach good perfor- 0 0 0.1773 0 tems, (Adomavicius 0et al. 2014) 0.8227 propose 0 0 to mitigate the bi- FNàHuRIC Bringing Placing 0 0.0001 0.4605 0 0.5394 0 0 0 mances, yetFN+HuàHuRIC providing good hints Bringing for interpretation. Bringing 0.689 Atten- 0.0107 ased 0.3002 customers’ 0 ratings 0 after 0 the classification, 0 0 both with a tion layers exactly provide a relevance measure between systematic algorithm and with an interactive user-interface. the inputs and outputs, by learning a salience map between Several data augmentation methods for Generative Adver- the two other network Model layers, which Frame gold canpred Frame be further robot visual- please sarial Networks take the thatmug use image to intensity the normalization, sink ro- ized using heat-maps independently from the domain - con-- tation, LU re-scaling, - cropping, - flipping, DISC and - Gaussian - noise in- sidered. Visual attention (Mnih HuRICàHuRIC Bringing et al.Taking 2014) has0 been used 0 jection were 0.0004 0.0192presented 0.0343in the context 0.9104 of medical 0.0352 0 image anal- in the automatic Image Captioning FNàHuRIC Bringing task (Xu et0 al. 2015; Following 0.4691 ysis (Drozdzal 0.4703 0 et al. 2018; Hu0 et al. 2018; 0.0606 0 Roth 0 et al. 2015). You et al. 2016) where, given FN+HuàHuRIC Bringing an input Bringing picture0 a textual 0 Little work 0.1773 0 has been0 done on how0 to exploit 0.8227 0 alignments caption is generated. The attention values can be observed between knowledge bases for machine learning systems. to highlight the area of the picture which most contributed The Knowledge Representation community has mostly fo- to generate specific words in the caption. and which can cused on empirically analyzing the effects of data links, be visualized using heat-maps independently from the do- i.e. (Tiddi, d’Aquin, and Motta 2014) uses alignments to main considered. Self-attentions (Bahdanau, Cho, and Ben- quantify bias in datasets pairwise, without suggesting mit- gio 2014) have also been widely applied in many text pro- igation solutions; (Ding et al. 2010) discussed the confusion cessing tasks, such as Sentiment Analysis (Lin et al. 2017) of provenance and ground truth generated by owl:sameAs and Question Answering (Hermann et al. 2015). Visual ex- in the context of bioinformatics datasets; (Beek et al. 2018) planations were used in these cases to explain alignments gathers and fixed erroneous identity statements offering between the words of the input and output sentences. them in a large-scale dataset. Knowledge bases integrated with deep nets have so far Bias in Data been used to improve the embedding space at training time In their work, (Zhao et al. 2017) studies the problem of quan- or to explain the model’s outputs a posteriori (cfr. (Hitzler tifying gender bias in data and models for multi-label object et al. 2019) for a representative selection). To the best of our classification and visual semantic role labeling, developing knowledge, our work is the first using an external knowledge a calibration strategy that introduces frequency-constraints bases aligned to the training corpus to mitigate the bias in a on the training corpus. In the context of recommender sys- training dataset in the context of deep nets. Semantic Parsing for Robotic Applications References A variety of approaches have been proposed in the last two Adomavicius, G.; Bockstedt, J.; Curley, S.; and Zhang, J. decades to create semantic parsers for commands of vir- 2014. De-biasing user preference ratings in recommender tual and real autonomous agents. With the breakthrough systems. In RecSys 2014 Workshop on Interfaces and Hu- of statistical models, many machine learning techniques man Decision Making for Recommender Systems (IntRS have been applied to semantically parse robot instructions, 2014), 2–9. from sequential labelling (Kollar et al. 2010), Statistical Artzi, Y., and Zettlemoyer, L. 2013. Weakly supervised Machine Translation (Chen and Mooney 2011), learning- learning of semantic parsers for mapping instructions to ac- to-rank (Kim and Mooney 2013) and probabilistic graph- tions. Transactions of the Association for Computational ical models (Tellex et al. 2011). Statistical methods have Linguistics 1(1):49–62. also been applied to induce grammars to parse human com- mands into suitable meaning representations as well (Artzi Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural ma- and Zettlemoyer 2013; Thomason et al. 2015). These ap- chine translation by jointly learning to align and translate. proaches were implemented mostly in discretized environ- arXiv preprint arXiv:1409.0473. ments, relying on ad-hoc and formulaic representation for- Baker, C. F.; Fillmore, C. J.; and Lowe, J. B. 1998. The malisms, and often dealing with constrained vocabularies. Berkeley FrameNet project. In Proceedings of ACL and Our work, on the contrary, builds upon the idea of relying COLING, Association for Computational Linguistics, 86– on linguistically sound theories of meaning representation, 90. e.g. Frame Semantics, to bridge between linguistic knowl- Bastianelli, E.; Castellucci, G.; Croce, D.; Iocchi, L.; Basili, edge and robot internal representations. We build upon (Bas- R.; and Nardi, D. 2014. Huric: a human robot interaction tianelli et al. 2016) to design a parser to identify semantic corpus. In Proceedings of the Ninth International Confer- frames expressed in robot commands but rely on the bidi- ence on Language Resources and Evaluation (LREC’14). rectional LSTM network. Reykjavik, Iceland: European Language Resources Associ- Conclusions ation (ELRA). Bastianelli, E.; Croce, D.; Vanzo, A.; Basili, R.; and Nardi, In this paper, we have presented an approach relying on D. 2016. A discriminative approach to grounded spoken the integration of heterogeneous knowledge sources to mit- language understanding in interactive robotics. In Proceed- igate the biased results of a deep learning-based semantic ings of the 2016 International Joint Conference on Artificial parser for Spoken Language Understanding for robots, and Intelligence (IJCAI). improve the model’s understandability. We discussed how current models do not necessarily learn the underlying lin- Beek, W.; Raad, J.; Wielemaker, J.; and Van Harmelen, guistic theory, but rather focus on unwanted, unexpected pat- F. 2018. sameas. cc: The closure of 500m owl: sameas terns, because of an intrinsic bias induced by the size and statements. In European Semantic Web Conference, 65–80. domain-specificity of the training dataset. We showed how Springer. the values of the attention layers of the network can be used Chen, D. L., and Mooney, R. J. 2011. Learning to interpret as a clue to analyze and interpret the model’s behavior, as the natural language navigation instructions from observations. classification of frames in our case. Finally, we have provide In Proceedings of the 25th AAAI Conference on AI, 859– evidence that external resources such as FrameNet can help 865. to reduce the bias in the training data, also guaranteeing the Cheng, J.; Dong, L.; and Lapata, M. 2016. Long short- correct interpretations (or explanations) for the model’s be- term memory-networks for machine reading. In Proceedings havior. While being a preliminary attempt to measure a more of the 2016 Conference on Empirical Methods in Natural complex phenomenon, our work suggests that the strengths Language Processing, 551–561. Austin, Texas: Association of both knowledge engineering and machine learning can be for Computational Linguistics. combined to foster the development of more transparent, un- derstandable intelligent systems. Ding, L.; Shinavier, J.; Finin, T.; McGuinness, D. L.; et al. The future work will be focused in a first instance on de- 2010. owl: sameas and linked data: An empirical study. In signing more thorough evaluation schemes to obtain better Proceedings of the Second Web Science Conference. quantitative understandings of the model’s behavior. Sec- Drozdzal, M.; Chartrand, G.; Vorontsov, E.; Shakeri, M.; ondly, we will focus on identifying the correct balance be- Di Jorio, L.; Tang, A.; Romero, A.; Bengio, Y.; Pal, C.; and tween the domain-specific samples and the external ones, Kadoury, S. 2018. Learning normalized inputs for iterative also testing new pairs of datasets if possible. An analysis estimation in medical image segmentation. Medical image carried by gradually combining the samples and showing analysis 44:1–13. how the performances and the explainability measures be- Fillmore, C. J. 1985. Frames and the semantics of under- have across several datasets and domain is indeed crucial. standing. Quaderni di Semantica 6(2):222–254. Extending the use of more knowledge bases through their links (e.g. WordNet, ConceptNet) is another route we wish Hermann, K. M.; Kocisky, T.; Grefenstette, E.; Espeholt, L.; to follow. Finally, we will explore the idea of interactive, Kay, W.; Suleyman, M.; and Blunsom, P. 2015. Teaching symbiotic explanations, where the model can be corrected machines to read and comprehend. In Advances in Neural through spoken dialogue with the user. Information Processing Systems, 1693–1701. Hitzler, P.; Bianchi, F.; Ebrahimi, M.; and Sarker, M. K. grounding problem with probabilistic graphical models. AI 2019. Neural-symbolic integration and the semantic web. Magazine 32(4):64–76. Semantic Web (Preprint):1–9. Thomason, J.; Zhang, S.; Mooney, R.; and Stone, P. 2015. Hu, X.; Chung, A. G.; Fieguth, P.; Khalvati, F.; Haider, Learning to interpret natural language commands through M. A.; and Wong, A. 2018. Prostategan: Mitigating data human-robot dialog. In Proceedings of the 24th Inter- bias via prostate diffusion imaging synthesis with generative national Conference on Artificial Intelligence (IJCAI), IJ- adversarial networks. arXiv preprint arXiv:1811.05817. CAI’15, 1923–1929. AAAI Press. Kim, J., and Mooney, R. J. 2013. Adapting discriminative Tiddi, I.; d’Aquin, M.; and Motta, E. 2014. Quantifying the reranking to grounded language learning. In ACL (1), 218– bias in data links. In International Conference on Knowl- 227. The Association for Computer Linguistics. edge Engineering and Knowledge Management, 531–546. Koh, P. W., and Liang, P. 2017. Understanding black- Springer. box predictions via influence functions. arXiv preprint Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudi- arXiv:1703.04730. nov, R.; Zemel, R.; and Bengio, Y. 2015. Show, attend Kollar, T.; Tellex, S.; Roy, D.; and Roy, N. 2010. Toward and tell: Neural image caption generation with visual at- understanding natural language directions. In Proceedings tention. In International conference on machine learning, of the 5th ACM/IEEE, HRI ’10, 259–266. 2048–2057. You, Q.; Jin, H.; Wang, Z.; Fang, C.; and Luo, J. 2016. Im- Lin, Z.; Feng, M.; Santos, C. N. d.; Yu, M.; Xiang, B.; Zhou, age captioning with semantic attention. In Proceedings of B.; and Bengio, Y. 2017. A structured self-attentive sentence the IEEE conference on computer vision and pattern recog- embedding. arXiv preprint arXiv:1703.03130. nition, 4651–4659. Liu, B., and Lane, I. 2016. Attention-based recurrent neural Zeiler, M. D., and Fergus, R. 2014. Visualizing and under- network models for joint intent detection and slot filling. In standing convolutional networks. In European conference INTERSPEECH, 685–689. ISCA. on computer vision, 818–833. Springer. Mensio, M.; Bastianelli, E.; Tiddi, I.; and Rizzo, G. 2018. Zhang, Q.-s., and Zhu, S.-C. 2018. Visual interpretability for A multi-layer lstm-based approach for robot command in- deep learning: a survey. Frontiers of Information Technology teraction modeling. Workshop on Language and Robotics, & Electronic Engineering 19(1):27–39. IROS 2018. Zhao, J.; Wang, T.; Yatskar, M.; Ordonez, V.; and Chang, Mnih, V.; Heess, N.; Graves, A.; et al. 2014. Recurrent K.-W. 2017. Men also like shopping: Reducing gender bias models of visual attention. In Advances in neural informa- amplification using corpus-level constraints. arXiv preprint tion processing systems, 2204–2212. arXiv:1707.09457. Pennington, J.; Socher, R.; and Manning, C. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural lan- guage processing (EMNLP), 1532–1543. Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016. Why should i trust you?: Explaining the predictions of any classi- fier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 1135– 1144. ACM. Rosenthal, S.; Biswas, J.; and Veloso, M. 2010. An effec- tive personal mobile robot agent through symbiotic human- robot interaction. In Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems: volume 1-Volume 1, 915–922. International Foundation for Autonomous Agents and Multiagent Systems. Roth, H. R.; Lu, L.; Liu, J.; Yao, J.; Seff, A.; Cherry, K.; Kim, L.; and Summers, R. M. 2015. Improving computer- aided detection using convolutional neural networks and random view aggregation. IEEE transactions on medical imaging 35(5):1170–1181. Simonyan, K.; Vedaldi, A.; and Zisserman, A. 2013. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034. Tellex, S.; Kollar, T.; Dickerson, S.; Walter, M.; Banerjee, A.; Teller, S.; and Roy, N. 2011. Approaching the symbol