=Paper= {{Paper |id=Vol-2600/paper20 |storemode=property |title=Mitigating Bias in Deep Nets with Knowledge Bases: the Case of Natural Language Understanding for Robots |pdfUrl=https://ceur-ws.org/Vol-2600/paper20.pdf |volume=Vol-2600 |authors=Martino Mensio,Emanuele Bastianelli,Ilaria Tiddi,Giuseppe Rizzo |dblpUrl=https://dblp.org/rec/conf/aaaiss/MensioBT020 }} ==Mitigating Bias in Deep Nets with Knowledge Bases: the Case of Natural Language Understanding for Robots== https://ceur-ws.org/Vol-2600/paper20.pdf
       Mitigating Bias in Deep Nets with Knowledge Bases : the Case of Natural
                         Language Understanding for Robots
                 Martino Mensio∗ , Emanuele Bastianelli? , Ilaria Tiddi† and Giuseppe Rizzo‡
                    ∗
                     Knowledge Media Institute, The Open University, UK, martino.mensio@open.ac.uk
                    ?
                      The Interaction Lab, Heriot-Watt University, UK, emanuele.bastianelli@hw.ac.uk
                     †
                       Department of Computer Science, Vrije Universiteit Amsterdam, NL, i.tiddi@vu.nl
                             ‡
                               LINKS Foundation, Italy, giuseppe.rizzo@linksfoundation.com


                             Abstract                                  Events such as the Cambridge Analytica scandal and the dis-
                                                                       ruptions of the 2016 US elections have brought researchers
  In this paper, we tackle the problem of lack of understand-
                                                                       and practitioners to question the explainability of these sys-
  ability of deep learning systems by integrating heteroge-
  neous knowledge sources, and in the specific we present how          tems, i.e. their ability to explain decisions to both experts
  we used FrameNet to guarantee the correct learning for an            and end-users, resulting in a number of initiatives to improve
  LSTM-based semantic parser in the task of Spoken Language            their understandability and trustworthiness (cfr. DARPA’s
  Understanding for robots. The problem of the explainability          eXplainable AI program1 ; the “right to explanation” re-
  of Artificial Intelligence (AI) systems, i.e. their ability to ex-   quested by the European General Data Protection Regula-
  plain decisions to both experts and end users, has attracted         tion; and the “Ethics guidelines for Trustworthy AI” pub-
  growing attention in the latest years, affecting their credibil-     lished by the European Union in December 20182 ). In the
  ity and trustworthiness. Trusting these systems is fundamen-         context of robotic companions interacting in natural lan-
  tal in the context of AI-based robotic companions interact-          guage using AI techniques, where our research is placed,
  ing in natural language, as the users’ acceptance of the robot
                                                                       trust and transparency are fundamental aspects, as the users’
  also relies on the ability to explain the reasons behind its
  actions. Following similar approaches, we first use the val-         acceptance of the robot assistants will be also based on their
  ues of the neural attention layers employed in the semantic          ability to explain the reasons behind their actions, if and
  parser as a clue to analyze and interpret the model’s behavior       when required.
  and reveal the intrinsic bias induced by the training data. We          Let us take the example of a robot understanding spo-
  then show how the integration of knowledge from external re-         ken commands given by a human, e.g. “take the book
  sources such as FrameNet can help minimizing, or mitigating,         from the table”, and where a corresponding robot action
  such bias, and consequently guarantee the model to provide           such as take(book, table) has to be instantiated cor-
  the correct interpretations. Our preliminary, but promising re-      rectly. Such instantiation is generally triggered by a trained
  sults suggest that (i) attention layers can improve the model        model, where noise, over-fitting, and mislabeling could in-
  understandability; (ii) the integration of different knowledge
                                                                       deed bring to an undesired output, e.g. the robot placing
  bases can help overcoming the limitations of machine learn-
  ing models; and (iii) an approach combining the strengths of         the book on the table. In the view of symbiotic autonomous
  both knowledge engineering and machine learning can foster           robots (Rosenthal, Biswas, and Veloso 2010) that rely on hu-
  the development of more transparent, understandable intelli-         mans to overcome their limitations and correct their actions,
  gent systems.                                                        a transparent model could help identifying and explicit the
                                                                       reason(s) behind the wrong behavior of the robot.
                                                                          Our motivation is the semantic processing of robotic com-
                         Introduction                                  mands (also called semantic parsing) from spoken language
With the dramatic success of new machine learning tech-                utterances, i.e. the process of mapping natural language
niques relying on deep architectures, the number of Arti-              sentences to formal meaning representations. The formal
ficial Intelligence (AI)-based systems has rapidly increased.          meaning representation theory we rely upon is Frame Se-
                                                                       mantics (Fillmore 1985), describing actions and events ex-
Copyright c 2020 held by the author(s). In A. Martin, K. Hinkel-       pressed in language through conceptual structures called se-
mann, H.-G. Fill, A. Gerber, D. Lenat, R. Stolle, F. van Harmelen
(Eds.), Proceedings of the AAAI 2020 Spring Symposium on Com-
                                                                       mantic frames. This theory also states that a frame is evoked
bining Machine Learning and Knowledge Engineering in Practice          in a sentence through the occurrence of specific lexical units,
(AAAI-MAKE 2020). Stanford University, Palo Alto, California,          i.e. words (such as verbs and nouns) that linguistically ex-
USA, March 23-25, 2020. Use permitted under Creative Commons           press the underlying situation. To identify such frames, we
License Attribution 4.0 International (CC BY 4.0).
                                                                          1
EB and IT developed the theoretical framework and directed the               https://www.darpa.mil/program/explainable-artificial-
project; EB and GR designed the experiments; MM derived the            intelligence
                                                                           2
models, performed experiments and analysed the results; IT and               https://ec.europa.eu/futurium/en/ai-alliance-
EB wrote the manuscript in consultation with GR and MM.                consultation/guidelines
built a semantic parser based on a multi-layer Long-Short         Fundamentals of Frame Semantics
Term Memory (LSTM) neural network with attention (Men-            Frame Semantics is a theory that formalizes how a sentence
sio et al. 2018), and trained it over the Human-Robot Inter-      is related to semantic frames. Each frame is a conceptual
action Corpus (HuRIC) (Bastianelli et al. 2014). LSTMs as         structure representing an action or, more in general, an event
many similar deep nets-based models have an opaque na-            or situation (e.g. the action of Taking). Frames are further
ture, i.e. they do not give clear clues on the way they be-       specified by a set of frame elements (e.g. the T HEME, rep-
have, which may complicate the understanding of undesired         resenting the object taken while performing the action Tak-
behaviors as, in our case, an incorrect robot behavior. More-     ing), which enhance the meaning of the frame with addi-
over, understanding the inner workings of such models tend        tional information. According to Frame Semantics, frames
to be harder when trained on small, domain-specific datasets      are evoked in sentences by specific words, called lexical
(such as HuRIC), as they often lack of effective representa-      units (L U). Lexical units are responsible to convey the mean-
tiveness of the problem domain. The questions we wish to          ing of the frames, representing hooks between the textual
answer in this work are therefore:                                surface and the theory itself. In the example of Figure 1, the
• how can we better understand our LSTM-based model?              frame Taking is evoked in the sentence “take the book to the
• how we can we identify undesired behaviors in the model?        table” by the L U take, while the book and to the table repre-
• is there a way to mitigate such undesired behaviors?            sent the T HEME and the G OAL frame elements respectively:
   To answer the first two questions, we rely on the idea that
                                                                                      [take]Taking
linguistic theories could be used in the context of our se-
mantic parser to obtain more understanding of the model,                              [the book]T HEME
i.e. they could be exploited to provide to explain the model’s                        [from the table]O RIGIN
behavior. Recent trends in deep learning have shown that vi-
sual explanations for the models’ behavior could be obtained
through the analysis of the values of the attention layers in a   Figure 1: Example of semantic frame annotation for the sen-
number of tasks (Machine Translation (Bahdanau, Cho, and          tence “take the book from the table”.
Bengio 2014), Sentiment Analysis (Lin et al. 2017), Image            The process of annotating Frame Semantics over natural
Captioning (Xu et al. 2015)) for their ability of correlating     language involves three different tasks. First, all the frames
inputs and outputs. Inspired by these works, our hypothe-         evoked in a sentence are identified looking at the potential
sis is that we can use attentions to achieve some degree of       L Us contained it. This task is generally called Frame Pre-
explainability for the LSTM-based parser, and that Frame          diction or Frame Induction. Here, we refer to it as Action
Semantics can be the key to drive the interpretation process.     Detection (AD), as we are dealing with the action expressed
We therefore use attentions to capture the interpretation of      by the person uttering the command to the robot. The second
spoken commands and, more specifically, use the values that       task is called Argument Identification (AI, sometimes also
the attention layer assign to each word of a given sentence to    called Boundary Detection) and is responsible to find the
detect which word is the lexical unit evoking (i.e. causing)      spans of text corresponding to possible frame elements. The
the identified frame. We show how this not only gives us a        last task is called Argument Classification (AC) and consists
hint on the model behavior, but that attentions help unveil-      in assigning a label to the spans identified during the AI.
ing the intrinsic bias induced by our training data. Here, we     Note that the AI and AD tasks are often referred together as
exploit the linguistic knowledge encoded in an external re-       the process of Semantic Role Labelling.
source such as FrameNet (Baker, Fillmore, and Lowe 1998)             If we take the example of “take the book from the table”,
in a data augmentation strategy, with the goal of mitigating      the frame Taking would be predicted in the AD step by iden-
the corpus bias, improve the explanations that the model pro-     tifying the L U take. In the AI step, the book and from the
vides and, consequently, the overall model results.               table would be identified as 2 frame element spans, and re-
   Although preliminary, our promising results suggest that       spectively classified as T HEME and O RIGIN frame elements
attention layers combined with Frame Semantics do provide         in the following AC step.
a clue to a more explainable model, and that the integration
of external knowledge bases can help overcoming the inner         A multi-layer LSTM-based parser
limitations of machine learning models. More importantly,         In our previous work (Mensio et al. 2018), we presented a
our method suggests that the combination of knowledge en-         semantic parser for robotic commands, called 3LSTM-ATT,
gineering and machine learning techniques can be beneficial       based on a multi-layer LSTM network exploiting attention
for the development of more transparent, understandable in-       mechanisms. The 3LSTM-ATT topology is shown in Fig-
telligent systems.                                                ure 2. The network was adapted from (Liu and Lane 2016)
                                                                  so that each layer could carry one of the three semantic pars-
            Motivation and Background                             ing tasks presented above. We briefly describe the network
In this section, we present the theoretical and technical back-   in the following, and refer the reader to the original paper
ground of our work. We first discuss Frame Semantics,             for more details.
which we use as linguistic theory of reference, and then de-         The input to the network is a tokenized sentence, where
scribe the technical details of our neural network-based se-      each token is embedded using the GloVe word embed-
mantic parser.                                                    dings (Pennington, Socher, and Manning 2014), pre-trained
                                                                                           AC+AI     AD       broader set of values that can be more difficult to interpret
AC         O       Theme Theme Theme Theme Theme                                                              (e.g. looking at all the values of the self-learned weights).
                                                                                                              In fact, it enables to underline a restricted subset of features,
           ac          ac        ac        ac         ac                ac
                                                                                  L3                          because not all the inputs have the same importance. The
                                                                                                              self-attentions used in the two other layers (L2 and L3) in-
                                                                                                              stead encode the relationship among all the input objects,
AI          O           B         I          I            I              I
                                                                                                   Bringing
                                                                                                              e.g. of much each token contributes to the representation of
                                                                                                              all the other tokens for a given task. We point the reader to
                                                                                  L2
             ai         ai        ai         ai           ai             ai
                                                                                                              the original paper for more details about the self-attention
                                                                                                              layers.
AD
                                                                                                              Hypotheses and Challenges
 legend:
           attention             f           f        f                  f             f       f
                                                                                                              Taking back our research questions, at this point we ask:
                                                                                                     L1
           LSTM
                             b         b          b            b              b            b                  • how can we better understand the LSTM-based model we
           dense
                                                                                                                 built?
           embedding         take      the        book             on         the          table              • how can we identify an undesired behavior in such model?
                                                                                                              • is there a way to mitigate any undesired behaviors?
Figure 2: The neural network for the semantic parser. The                                                        Our first question can be answered looking at the attention
connections in green represent highway connections be-                                                        layer values to get hints on the model’s behavior. As previ-
tween the first and the third layer.                                                                          ously discussed, attentions give the chance to explore the in-
                                                                                                              termediate classification steps, enabling the interpretability
over the Common Crawl resource3 . The sequence is firstly                                                     of how the system processes a given input – an aspect that
encoded with a bidirectional LSTM (L1). For the AD task,                                                      we can exploit for a better understanding of our process. As a
a single contextual representation for the whole sequence c                                                   first attempt, this work aims at answering the previous ques-
is computed through an attention layer (Bahdanau, Cho, and                                                    tions by taking into account the sole ability of the system
Bengio 2014), which is in turn passed through a fully con-                                                    to detect the correct frame. For this reason, we will focus
nected layer with a final softmax activation to obtain per-                                                   on the analysis of the attention values for the sole AD task.
frame probabilities. The sequence out of L1 is further en-                                                    We leave the analysis of the other two tasks for forthcoming
coded with a LSTM (L2) with self-attention (Cheng, Dong,                                                      work.
and Lapata 2016). Single hidden representations of the to-                                                       We thus answer our second question by aligning atten-
kens are classified through a dense layer with softmax into                                                   tions and the linguistic theory. On the one hand, we have the
IOB labels, which denote whether a word is the Beginning,                                                     Frame Semantics theory that states that frames are evoked
the Inside or it is Outside of a frame element span. The                                                      in natural language by specific words called lexical units.
LSTM at L2 is modified so that, at each time step, the out-                                                   On the other, we have the attention values computed by the
put of the dense layer at t − 1 is provided as additional input                                               network to balance the input words in the final contextual
to the LSTM cell at time t. The third and final encoding                                                      representation used to classify the frame. Our assumption
layer (L3) takes as input the output of L2 and the output of                                                  therefore is that, by annotating data with Frame Semantics,
L1 through highway connections. The same type of encoder                                                      the algorithm learning from such data should encode implic-
used in L2 is applied in L3, with the difference that the dense                                               itly the theory itself, through an attempt of learning it (or a
layer outputs frame element labels instead of IOB ones.                                                       good approximation of it). If the network is learning cor-
   The simple attention mechanism (Bahdanau, Cho, and                                                         rectly from the data, we should therefore observe an align-
Bengio 2014) used for the AD task is a layer that gives an                                                    ment between the values produced by the attention of the
insight of the contribution that a certain input gives in the                                                 AD layer and what is stated by Frame Semantics, e.g. we
production of a given output. The final contextual represen-                                                  should notice relevant values attributed to words that could
tation of a sentence c is evaluated as the weighted sum:                                                      possibly be lexical units for the classified frame. Should the
                             X
                         c=       ai hi ,                                                                     network not follow the underlying Frame Semantic theory,
                                                  i                                                           this could mean not only that the model is only following
where hi represents the encoding of the i-th token and the                                                    patterns statistically evident in the data (and not related to
attention value (or score) ai is evaluated through a simple                                                   the theory), but also that an incorrect explanation for its be-
feedforward network fatt (hi ). Roughly speaking, this atten-                                                 havior would be provided if requested. Our challenge is first
tion layer evaluates a value ai for each encoded input token.                                                 to verify whether the words receiving the highest attention
Since the AD classification layer operates over the contex-                                                   values are the correct lexical units of a classified frame (e.g.
tual representation c, each value ai indicates how much each                                                  given the sentence “take the book from the table”, the word
word in a sentence contributes to the final classification of                                                 take should be given a high attention value).
a frame. For this reason, it can intrinsically provide an ex-                                                    Finally, we need a mitigation strategy to overcome the
planation for the model behavior, as it summarizes a much                                                     cases where the attention turns out to be focused on the in-
                                                                                                              correct lexical element and consequently ensure that the cor-
     3
         http://commoncrawl.org/                                                                              rect explanation for a decision can be provided. Given that
HuRIC’s annotations are based on Frame Semantics, we pro-            available in HuRIC, so let ŵi be the gold L U for the i-th
pose to augment the dataset using additional examples from           sentence Si 5 . Let us consider the attention layer as a (sim-
the FrameNet corpus (Baker, Fillmore, and Lowe 1998). Al-            plified) function fatt (w) that attributes an attention value to
though FrameNet cover a different domain w.r.t HuRIC, i.e.           a word w (for clarity, w is a shortcut for the hidden repre-
written vs. spoken language, we believe that, by using a data        sentation h). The ALU (LU-alignment) measure can be then
augmentation strategy, the algorithm can be driven to rely           calculated as follows:
on patterns consistent with the theory, and thus to achieve                                   N
better generalization.                                                                   1 X
                                                                               ALU =           I(arg max fatt (w) = ŵi )              (1)
                                                                                         N i=1     w∈Si
                         Approach
In this section, we show the design of the overall approach,            where I(·) is the indicator function.
namely (1) how we align the model to the Frame Semantics                Although lexical units carry most of the meaning for a
theory; (2) how we use these alignments to identify misbe-           frame, there are still many ambiguous cases, where a verb
havior by the model; and (3) the data augmentation strategy          alone may evoke different frames. Consider for example the
we use to mitigate the bias in the model.                            verb take, which may evoke the frame Bringing, e.g. in the
                                                                     sentence take the book to the table, or Taking, e.g. take the
Aligning Attentions and Linguistic Theory                            book from the table. The meaning in this case is not carried
                                                                     only by the L U alone, but also by the co-occurrence with
As previously explained, the attention values produced by            other specific words or syntactic structures. The preposition
our 3LSTM-ATT parser during the AD stage can be used                 to in the first example clearly introduces an argument rep-
to guess which words in a sentence are more relevant to              resenting the destination of a motion (i.e. the G OAL frame
the classified frame. We can use these values to attempt an          element), helping in choosing the frame Bringing over Tak-
alignment between words and the linguistic theory, namely            ing for the word take. It is thus legit to think that, in these
which words are lexical units or, other relevant words such          cases, part of the attention values should also focus on such
as prepositions.                                                     discriminant words.
   The parser has been trained over the previously mentioned            We thus designed a second measure, that we call AD (dis-
HuRIC dataset, which contains transcriptions of user com-            criminant alignment), with the aim of taking into account
mands tagged with Frame Semantics. The annotated frames              additional discriminant words in addition to the L U. To this
generally correspond to actions like taking objects or mov-          end, we annotated the discriminant words for each sentence
ing to a specific position. The dataset contains 585 frame oc-       in the dataset. For each sentence S = (w1 , ..., wm ), we
currences over 526 sentences on 16 different frame types for         created a vector of gold discriminant word indexes vg =
an average of ∼36 sentences per frame. The results, obtained         (gd1 , ..., gdm ) where each gdj ∈ {0, 1} is set to 1 if its
over this dataset through a 5-fold cross validation stratified       position corresponds to a discriminant word in S. Given
on the frame types, are reported in Table 1. Compared to re-         the attentions values obtained from the AD layer, we cre-
sults of (Bastianelli et al. 2016) (BAS16 henceforth)4 , our         ated a vector of classified discriminant word indexes vc =
parser obtains better results for both the AD and AI tasks.          (cd1 , ..., cdm ) where each cdj = I(fatt (wj ) ≥ 0.01)6 . Fi-
                                                                     nally, we calculated Precision and Recall over these vectors
Table 1: Parser performances in terms of F-Measure for the           the following way:
AD, AI and AC, compared to BAS16. Only gold values con-                                    Pm
sidered as input of each task.                                                                j=1 I(cdj = gdj = 1)
                                                                                       P =         Pm                            (2)
      Corpus                 AD           AI         AC                                              j=1 cdj
      BAS16              94.67%      90.74%      94.93%                                    Pm
      3LSTM-ATT          96.33%      94.35%      91.77%                                       j=1 I(cdj = gdj = 1)
                                                                                       R=          Pm                            (3)
                                                                                                     j=1 gdj
(Mis-)alignment of Attention Values                                  through which we obtained the F-Measure. The AD was fi-
Differently from BAS16, we can take advantage of the atten-          nally calculated as macro-average over the F-Measure of all
tion layer in the AD step to understand our system’s behavior        the sentences Si in the dataset.
when classifying a frame for a given sentence. As explained,            The HuRIC→HuRIC row of Table 2 shows the scores
our assumption is that the word receiving the highest value          for ALU and AD obtained when training and testing over
from the AD attention layer may be the L U for the classified        HuRIC. As we can see from the 11.17% on ALU and
frame.                                                               20.53% on AD values, the model reaches good results on
   In order to prove such hypothesis, we need to quantita-           the AD task (96.33% of F-Measure), but is quite misaligned
tively measure the alignment between the attention values
                                                                         5
and the “gold” L U for a given frame. Let S = (w1 , ..., wm )              Sentence splitting was applied in order to have 1 frame per
be a sentence as a sequence of m words w. Gold L Us are              sentence, for the rare HuRIC cases containing more than one frame
                                                                     per sentence.
    4                                                                    6
      Please note that the BAS16 makes use also of perceptual fea-         This threshold was set to filter attention noises. The study how
tures, while our parser relies only on linguistic inputs.            to properly set this threshold is left for future work.
            from the linguistic theory. Indeed, the error analysis we car-                           Taking:
            ried on the attention values reported that the model is fol-
    Model   lowing
             Frame goldlatent
                           Framepatterns,
                                  pred    which
                                          get      arethecompletely
                                                               dishes   unrelated
                                                                         from        to thedining
                                                                                    the
                                                                                                        In the late 1870s, he defaulted on a loan from rancher
                                                                                                     room
    Model   theory,
             Frame goldrather
                           Framethan
                                  pred generalizing
                                          get
                                          LU           - the linguistic
                                                      the      dishes
                                                                  -        theory
                                                                         from
                                                                           -        as
                                                                                    the
                                                                                     -   ex-dining
                                                                                               -
                                                                                                        Archibald Stewart, so [Stewart]AGENT [took]Taking [the
                                                                                                     room
                                                                                                       -

HuRICàHuRIC pected.
                Taking In other           LU
                              Taking words,            -          -        -         -         -      - Las Vegas Ranch] T HEME [for his own] E XPLANATION .
                                             the model
                                         0.001      0.908  concentrates
                                                               0.018         its attention
                                                                         0.041     0.021    0.011     0
HuRICàHuRIC
  FNàHuRIC  on Taking         Taking
                              words that0.001
                recurrentEntering
                Taking                      are not 0.908
                                         0.994                 0.018
                                                    discriminative
                                                    0.002         0      0.041
                                                                         with      0.021
                                                                         0.004 respect
                                                                                     0    to0.011
                                                                                              0       0 Indeed, the label set of frame elements, and, in general,
                                                                                                      0

  FNàHuRIC
FN+HuàHuRIC theTaking
                  respective
                Taking       Entering    0.994
                                 frame; yet,
                              Taking                0.002
                                         0.984 it was  0 able to  0
                                                               0.001     0.004
                                                                    produce          0
                                                                         0.012 the correct
                                                                                   0.003      0
                                                                                              0      the
                                                                                                      0
                                                                                                      0  variability of the language in FrameNet is, in fact, much
FN+HuàHuRIC     Taking
            classification.   Taking     0.984         0       0.001     0.012     0.003      0      higher
                                                                                                      0      than HuRIC. On the one hand, this can negatively
                   Model      Frame gold Frame pred       take      the     red       shoes
                                                                                                     contribute to the overall performance, as the complexity of
                   Model      Frame gold Frame pred       take
                                                           LU
                                                                    the
                                                                     -
                                                                            red
                                                                             -
                                                                                      shoes
                                                                                        -
                                                                                                     the task increases. On the other, the network will access
               HuRICàHuRIC      Taking        Taking
                                                          LU
                                                         0.001
                                                                     -
                                                                   0.627
                                                                             -
                                                                           0.371
                                                                                       -
                                                                                       0
                                                                                                     more evidence in terms of theory-related patterns, e.g.
               HuRICàHuRIC      Taking        Taking     0.001     0.627   0.371       0             seeing more often the association of the frame Taking with
                                                                                                     co-occurring verbs like take, than with other unrelated
                   Model      Frame gold    Frame pred   inspect    the     red       shoes
                                                                                                     words like shoes. Our aim is therefore to reach a good
                   Model      Frame gold    Frame pred   inspect
                                                           LU
                                                                    the
                                                                     -
                                                                            red
                                                                             -
                                                                                      shoes
                                                                                        -
                                                                                                     trade-off between the model’s performance and its degree
               HuRICàHuRIC     Inspecting     Taking
                                                          LU
                                                         0.065
                                                                     -
                                                                   0.214
                                                                             -
                                                                           0.716
                                                                                        -
                                                                                        0
                                                                                                     of generalization that, in turn, reveals the degree of under-
               HuRICàHuRIC     Inspecting     Taking     0.065     0.214   0.716        0            standability (explainability) of its behavior.
              Figure 3: Attention analysis for two different input sen-
              tences. The attention falls mostly on words that are not L Us,                                       Experiments and Results
              e.g. the, red.
                                                                                                     In order to support our hypotheses about the mitigation strat-
                                                                                                     egy, we designed two additional experimental settings with
                 An example of such behavior is reported in Figure 3:                                the goal of evaluating the changing in the model behavior:
              while the Taking frame is indeed correctly identified, the at-
              tention values reveal that the model attention falls on the two                        • FN→HuRIC: a model is trained over the full subset of
              words the and red, which do not convey any frame meaning                                 samples coming from FrameNet, and is tested on the
              in this context, while the correct L U take receives only 0,1%                           whole HuRIC dataset;
              of attention. As an additional proof, a similar sentence with                          • FN+Hu→HuRIC: the evaluation follows a 5-fold cross
              a different frame, e.g. inspect the red shoes, is classified with                        validation. At each validation turn, the training set con-
              the same frame Taking (instead of Inspecting), with most of                              sists in FrameNet + 80% of HuRIC, leaving the remaining
              the attention falling again on words the, red.                                           20% as test set. The distribution of frames is uniformly
                 A first consideration that can arise from the above analy-                            stratified.
              sis is that linguistic phenomena are not equally represented
              in HuRIC (i.e. some frames happen in correspondence of                                 Table 2 presents the results of the ALU and AD for both
              more frequent, but not necessarily significant, grammatical                            configurations. The performances of the semantic parser in
              patterns), and this lack of representativeness might cause in-                         terms of F-Measure for the AD, AI and AC tasks are re-
              trinsic bias. This prevents the model to learn the underlying                          ported as well. Please note that HuRIC→HuRIC results dif-
              linguistic theory, and to generalize from it.                                          fer from the ones in Table 1, showing performances of the
                                                                                                     single tasks in isolation (i.e. each task receives gold infor-
              Mitigating the Data Bias                                                               mation from the previous steps). Instead, we consider here
                                                                                                     the full semantic parsing pipeline.
              If we hypothesize that our model does not generalize to-                                  It appears clear how the parser performances and the
              wards the linguistic theory as it should due to the lack of                            alignment measure scores are reversed for the two differ-
              representativeness of the dataset, a natural solution is to try                        ent settings. The models trained only on FrameNet do not
              to increase the number of training examples to see if the                              achieve high performances, reaching only approx. 68% for
              alignment measures improve without compromising the per-                               the AD task. When the two datasets are combined, an in-
              formances. Since HuRIC is tagged with Frame Semantics                                  crease of ∼19% points is achieved for the same task. This
              following the same scheme as the FrameNet corpus (Baker,                               is still very low when compared to the 96.33% achieved
              Fillmore, and Lowe 1998), the first solution at hand to at-                            with HuRIC only. With that said, by looking at the align-
              tenuate the bias with more examples consists in integrating                            ment measure scores, we notice that this drop of perfor-
              HuRIC with examples from FrameNet itself. For the purpose                              mance comes at the advantage of the model’s explainability.
              of comparison, we selected only the FrameNet examples an-                              When trained only on FrameNet, in fact, the ALU and AD
              notated with frames also contained in HuRIC. This selection                            scores reach 93.92% and 84.86% respectively. This confirms
              resulted in a subset of 6,814 frame examples, for an average                           that the AD attention layer is focusing on the relevant words,
              of ∼425 examples per frame.                                                            hence giving us a hint that the model is correctly learning the
                 Although sharing the same background linguistic theory,                             linguistic theory. The introduction of HuRIC to the training
              however, the two datasets belong to two different domains,                             sample helps in raising the parsing performances to convinc-
              namely written text vs. spoken commands. This indeed                                   ing levels, while not deteriorating completely the alignment.
              may lead to a drop in terms of performances. Let us take                               The ALU and AD still drop by ∼40 and ∼34 points re-
              the example of FrameNet annotated-sentence for the frame                               spectively, but considering the performances reached by the
                      Model      Frame gold    Frame pred    get        the        dishes   from      the      dining   room

                                                             LU          -           -        -        -         -       -

                   HuRICàHuRIC      Taking       Taking     0.001      0.908       0.018    0.041    0.021     0.011     0

                    FNàHuRIC        Taking      Entering    0.994      0.002         0      0.004     0          0       0

                  FN+HuàHuRIC       Taking       Taking     0.984        0         0.001    0.012    0.003       0       0


Figure 4: Result of the attention analysis over the three different training conditions for the sentence get the dishes from the
                                  Model    Frame gold Frame pred take      the      red     shoes
dining room.
                                                                        LU           -         -           -

                                 HuRICàHuRIC      Taking    Taking     0.001       0.627     0.371         0
parser, this can be considered an encouraging trade-off. Al-                When using FrameNet as a training set, the system is able
though the bias has been corrected to a certain extent, the                 to better attend on L Us (5b–5d). The distance in the applica-
overall results suggest that HuRIC is still introducing some                tion domain seems to still prevent the system to attend also
noise, which diverts the system from
                                  Modelthe full alignment
                                             Frame           with
                                                    gold Frame  pred        on discriminant
                                                                       inspect     the          words. shoes
                                                                                              red       For the same reason, in other cases
the underlying theory. Testing the use of different amount               LU the attention
                                                                                    -      still- spreads- its mass over non-relevant words
of examples from FrameNet and       HuRIC may
                               HuRICàHuRIC
                                                     result Taking
                                              Inspecting
                                                            in an           (5a–5c).
                                                                        0.065     0.214
                                                                                       This leads
                                                                                             0.716
                                                                                                    to errors
                                                                                                         0
                                                                                                               in the frame classifications. A
even better balancing of linguistic variance and the domain-                more stable behavior can be observed when both FrameNet
specificity.                                                                and HuRIC are used as training set. The attention values, in
                                                                            fact, stabilize mostly over L Us and discriminant words, al-
Table 2: End-to-end performances and alignment scores of                    though with more dense or sparse values. This contributes to
the 3LSTM-ATT parser for the three different training set-                  a much better frame classifications, giving us an insight of
tings. F-Measure is reported for the AD, AI and AC tasks.                   the difference of the results in Table 2.
                                                                               This confirms the idea that the use of a compatible exter-
          HuRIC→HuRIC          FN→HuRIC         FN+Hu→HuRIC                 nal resource such as FrameNet can help in reducing the bias
 AD          96.33%              68.06%            87.60%                   of poorly represented corpora that can affect deep network
 AI          93.57%              77.14%            81.27%                   architectures. At the same time, attention values can be an-
 AC          87.22%              62.70%            72.44%                   alyzed to interpret the outcome of the model classification
 ALU         11.17%              93.92%            51.31%                   (frames/actions in our case). More importantly, this method
 ADS         20.53%              84.64%            50.83%
                                                                            promotes the idea that knowledge engineering, which helps
                                                                            encodes and elicit expert’s knowledge (e.g. FrameNet), and
   In order to better demonstrate the trade-off between parser              machine learning techniques can be combined to develop
performances and theory alignment, we also perform a qual-                  more transparent and understandable systems.
itative analysis on the test examples. In Figure 4, we show
the AD tagging and attention values produced by the dif-                                             Related Work
ferent training settings (i.e. HuRIC, FN, FN+Hu) for the
sentence get the dishes from the dining room. When trained                    We divided the related work in three parts: (i) approaches to
on HuRIC only, the network learns again unwanted patterns                     enable more explainable deep learning-based applications,
and, although the frame Taking is correctly classified, the at-               with a particular focus on text classification and attention
tention mostly falls on the article the, also spreading with                  methods, (ii) approaches to mitigate bias in data and (iii)
minor values on the rest of the words. By using FrameNet                      approaches for semantic parsing in the robotics domain.
as training set, the attention falls back to the verb take that
corresponds to the current L U. However, the frame classi-                    Explainability for Deep Learning
fication fails, predicting Entering. The correct frame classi-                Explainability for deep learning methods can be divided
fication (Taking) with attention values matching the correct                  in three families. A first family, including perturbation
L U and, to a minor extent, the discriminant preposition from                 experiments (Zeiler and Fergus 2014), saliency map-
is finally obtained when using a combination of the two cor-                  based methods (Simonyan, Vedaldi, and Zisserman 2013),
pora, as in the last row.                                                     LIME (Ribeiro, Singh, and Guestrin 2016) and influence
   The same behavior can be observed if we consider also                      functions (Koh and Liang 2017), relies on methods trying
other discriminant words in the sentence. Figure 5 shows                      to identify the relevant features treating the model as a black
again frame parsing and attentions values over different                      box. An approximated model is built by observing concur-
sentences. Discriminative words are here reported as well                     rent changes between the input and the output, so that it can
(D ISC). In all the four examples it appears clear that when                  provide simple explanations.
the system is trained only over the HuRIC resource, the at-                      A second family of approaches focuses on inspecting the
tention is unstable, i.e. either it distributes similarly among               internal representations and input processing. By observing
more or less relevant words (5a), or more strongly attend-                    the inner parameters (weights of the neural network, or other
ing on non-discriminant words at all (5b). In other cases, the                latent variables), these methods try to give a meaning to
attention indeed does attend on discriminant words, but ei-                   layers and operations in a bottom-up way (Zhang and Zhu
ther the final frame classification is wrong for a lack of value              2018). For this reason, their application is difficult to scale
on the L U (5c), or, even if the frame is correct, we lose the                for networks with lots of layers and parameters.
dependence of the classification outcome on the L U (5d).                        A third family consists in the intrinsically explainable
                        Model           Frame gold               Frame pred                 look             for           the         wrench              in              the         bathroom
            HuRICàHuRIC          Bringing        Bringing           0         0.0002         0.0019           0.7515            0.0545         0.1101           0.0192             0.0567          0.0057
                                                                                            LU              DISC            -             -                -                -            -
              FNàHuRIC    Bringing       Taking                   0.9394        0         0.0606       0                           0                 0               0                 0                0
                 HuRICàHuRIC       Searching                   Perception_active       0.0032    0.0567                   0.0294       0.0789            0.4687           0.3628        0.0001
            FN+HuàHuRIC   Bringing      Bringing                   0.089        0         0.0001                 0.9109            0                 0               0                 0                0
                  FNàHuRIC         Searching                   Perception_active       0.9999                   0           0          0.0001              0                0                0

                  FN+HuàHuRIC               Searching             Searching            0.1034              0.8966           0             0                0                0                0

                   Model          Frame gold     Frame pred          bring             it                  to              the                side              of               the             bathtub

                Model           Frame gold Frame pred              takeLU      the     -           jar    DISC     to       -    the           - table          - of               - the            -
                                                                                                                                                                                                   kitchen
              HuRICàHuRIC          Bringing         Bringing         0.0003
                                                                    LU          - 0.003             -    0.2815DISC 0.2205 -             0.1785-            0.1267
                                                                                                                                                                 -              0.1796
                                                                                                                                                                                     -            0.01-

               FNàHuRIC
            HuRICàHuRIC             Bringing
                                 Bringing            Placing
                                                 Bringing           0 0           0.00010.00190.46050.7515
                                                                              0.0002                                        00.0545 0.5394
                                                                                                                                       0.1101                   0
                                                                                                                                                                0.0192             0
                                                                                                                                                                                   0.0567          0
                                                                                                                                                                                                   0.0057
              FN+HuàHuRIC
              FNàHuRIC       Bringing
                          Bringing                  Bringing
                                                  Taking            0.689
                                                                 0.9394         0 0.01070.06060.3002 0                      0      0           0 0              0 0                0 0              0 0

           FN+HuàHuRIC           Bringing        Bringing         0.089         0            0.0001(a)           0.9109            0                 0               0                 0                0

                  Model           Frame gold Frame pred    robot        please                             take             the               mug           to         the         sink
                     Model             Frame gold       Frame pred           look                              for          the        wrench         in         the       bathroom
                                                             -             -                                 LU               -                 -          DISC          -           -
                  Model           Frame gold Frame pred    bring          it                                to              the             side         of           the        bathtub
                                                                              LU                             DISC            -            -            -          -            -
              HuRICàHuRIC     Bringing        Taking          0            0                              0.0004          0.0192            0.0343       0.9104      0.0352          0
                                                             LU           -                               DISC               -                -           -            -            -
                  HuRICàHuRIC       Searching        Perception_active     0.0032                           0.0567        0.0294       0.0789      0.4687      0.3628       0.0001
                FNàHuRIC      Bringing      Following         0        0.4691                             0.4703             0              0.0606          0            0           0
              HuRICàHuRIC    Bringing       Bringing      0.0003        0.003                            0.2815           0.2205          0.1785       0.1267       0.1796        0.01
                    FNàHuRIC        Searching        Perception_active     0.9999                               0            0         0.0001         0           0            0
              FN+HuàHuRIC     Bringing      Bringing          0            0                              0.1773             0                  0        0.8227          0           0
                FNàHuRIC     Bringing        Placing          0        0.0001                            0.4605              0            0.5394          0            0            0
                  FN+HuàHuRIC       Searching            Searching         0.1034                           0.8966           0            0           0           0            0
              FN+HuàHuRIC    Bringing       Bringing       0.689       0.0107                            0.3002              0                0           0            0            0
                                                                                                   (b)
                Model           Frame gold Frame pred              take        the                 jar             to            the               table             of                the         kitchen
                   Model          Frame gold               robot
                                                  Frame pred           please         take          the           mug            to         the        sink
                                                         LU          -           -          DISC          -           -           -             -           -
                     Model        Frame gold          Frame pred            look         for        the     wrench         in         the      bathroom
                                                             -            -            LU            -              -          DISC          -           -
            HuRICàHuRIC    Bringing      Bringing         0      0.0002       0.0019       0.7515      0.0545      0.1101      0.0192       0.0567      0.0057
                                                                             LU         DISC         -         -            -          -            -
              HuRICàHuRIC     Bringing       Taking          0            0          0.0004       0.0192         0.0343       0.9104      0.0352         0
              FNàHuRIC     Bringing       Taking       0.9394        0        0.0606          0           0           0           0            0           0
                  HuRICàHuRIC       Searching       Perception_active     0.0032       0.0567     0.0294    0.0789      0.4687      0.3628       0.0001
                FNàHuRIC      Bringing      Following        0         0.4691        0.4703          0           0.0606          0           0           0
            FN+HuàHuRIC    Bringing      Bringing       0.089        0        0.0001       0.9109         0           0           0            0           0
                   FNàHuRIC         Searching       Perception_active     0.9999          0          0      0.0001         0           0            0
              FN+HuàHuRIC     Bringing      Bringing         0            0          0.1773          0              0         0.8227         0           0
                  FN+HuàHuRIC               Searching             Searching             0.1034             0.8966           0             0                0                0                0
                   Model          Frame gold     Frame pred          bring             it
                                                                                                   (c)     to              the                side              of               the             bathtub

                                                                       LU              -                  DISC              -                  -                -                  -               -
                Model           Frame gold Frame pred              take        the                 jar             to            the               table             of                the         kitchen
              HuRICàHuRIC          Bringing         Bringing         0.0003          0.003               0.2815           0.2205         0.1785             0.1267              0.1796            0.01
                                                                    LU          -                   -             DISC             -          -                  -                   -                -
               FNàHuRIC             Bringing         Placing              0       0.0001       0.4605                       0       0.5394                      0                  0               0
            HuRICàHuRIC          Bringing        Bringing           0         0.0002     0.0019     0.7515                   0.0545    0.1101                   0.0192             0.0567          0.0057
              FN+HuàHuRIC    Bringing               Bringing         0.689          0.0107          0.3002                  0                  0                0                  0                0
              FNàHuRIC    Bringing                Taking          0.9394        0             0.0606               0               0                 0               0                 0                0

            FN+HuàHuRIC          Bringing        Bringing         0.089         0             0.0001             0.9109            0                 0               0                 0                0
                   Model          Frame gold      Frame pred         robot           please                take            the                 mug                   to            the             sink
                                                                                                   (d)
                                                                       -                -                   LU              -                    -               DISC              -                -
                 Model            Frame gold     Frame pred          bring             it                  to              the                side              of               the             bathtub
              HuRICàHuRIC           Bringing        Taking             0
                                                                  0.0004   0.0192
      Figure 5: Attention analysis in relation to both the L and discriminant           0
                                                                                      0.0343
                                                                               words (D UISC ) for0.9104  0.0352
                                                                                                   the three  training-0 settings.
                                                                  DISCLU      -        -
                                                                                       -          -         -
               FNàHuRIC            Bringing        Following           0            0.4691                0.4703             0            0.0606                 0                  0               0
              HuRICàHuRIC          Bringing        Bringing         0.0003           0.003               0.2815           0.2205         0.1785             0.1267              0.1796            0.01
            FN+HuàHuRIC
models, which   are complex  Bringing
                                 enoughBringing
                                            to reach good perfor-      0                0                 0.1773         0
                                                                                                            tems, (Adomavicius        0et al. 2014)
                                                                                                                                               0.8227 propose
                                                                                                                                                           0        0
                                                                                                                                                                to mitigate   the bi-
              FNàHuRIC       Bringing      Placing                     0            0.0001               0.4605          0       0.5394        0          0        0
mances, yetFN+HuàHuRIC
             providing good      hints
                             Bringing
                                        for   interpretation.
                                           Bringing     0.689
                                                                Atten-
                                                                 0.0107
                                                                                                            ased
                                                                                                         0.3002
                                                                                                                   customers’
                                                                                                                         0
                                                                                                                                ratings
                                                                                                                                    0
                                                                                                                                           after
                                                                                                                                               0
                                                                                                                                                   the classification,
                                                                                                                                                          0        0
                                                                                                                                                                       both   with a
tion layers exactly provide a relevance measure between                                                     systematic algorithm and with an interactive user-interface.
the inputs and outputs, by learning a salience map between                                                     Several data augmentation methods for Generative Adver-
the two other network
                Model    layers,    which
                           Frame gold        canpred
                                         Frame     be further
                                                        robot  visual-
                                                                  please                                    sarial Networks
                                                                                                           take         the     thatmug use image to intensity
                                                                                                                                                          the   normalization,
                                                                                                                                                                  sink            ro-
ized using heat-maps independently from the domain         -     con--                                      tation,
                                                                                                            LU      re-scaling,
                                                                                                                         -        cropping,
                                                                                                                                      -         flipping,
                                                                                                                                                DISC       and
                                                                                                                                                            -   Gaussian
                                                                                                                                                                    -      noise in-
sidered. Visual  attention (Mnih
            HuRICàHuRIC      Bringing et al.Taking
                                              2014) has0 been used   0                                      jection were
                                                                                                          0.0004      0.0192presented
                                                                                                                                  0.0343in the   context
                                                                                                                                               0.9104      of medical
                                                                                                                                                        0.0352      0  image anal-
in the automatic   Image Captioning
              FNàHuRIC       Bringing
                                            task (Xu et0 al. 2015;
                                          Following               0.4691
                                                                                                            ysis (Drozdzal
                                                                                                          0.4703         0
                                                                                                                              et al.   2018; Hu0 et al. 2018;
                                                                                                                                  0.0606                   0
                                                                                                                                                                 Roth
                                                                                                                                                                    0
                                                                                                                                                                       et al. 2015).
You et al. 2016)    where, given
            FN+HuàHuRIC      Bringing
                                      an input
                                           Bringing
                                                    picture0 a textual
                                                                     0
                                                                                                               Little work
                                                                                                          0.1773         0
                                                                                                                              has been0
                                                                                                                                           done    on how0 to exploit
                                                                                                                                               0.8227               0
                                                                                                                                                                        alignments
caption is generated. The attention values can be observed                                                  between knowledge bases for machine learning systems.
to highlight the area of the picture which most contributed                                                 The Knowledge Representation community has mostly fo-
to generate specific words in the caption. and which can                                                    cused on empirically analyzing the effects of data links,
be visualized using heat-maps independently from the do-                                                    i.e. (Tiddi, d’Aquin, and Motta 2014) uses alignments to
main considered. Self-attentions (Bahdanau, Cho, and Ben-                                                   quantify bias in datasets pairwise, without suggesting mit-
gio 2014) have also been widely applied in many text pro-                                                   igation solutions; (Ding et al. 2010) discussed the confusion
cessing tasks, such as Sentiment Analysis (Lin et al. 2017)                                                 of provenance and ground truth generated by owl:sameAs
and Question Answering (Hermann et al. 2015). Visual ex-                                                    in the context of bioinformatics datasets; (Beek et al. 2018)
planations were used in these cases to explain alignments                                                   gathers and fixed erroneous identity statements offering
between the words of the input and output sentences.                                                        them in a large-scale dataset.
                                                                                                               Knowledge bases integrated with deep nets have so far
Bias in Data                                                                                                been used to improve the embedding space at training time
In their work, (Zhao et al. 2017) studies the problem of quan-                                              or to explain the model’s outputs a posteriori (cfr. (Hitzler
tifying gender bias in data and models for multi-label object                                               et al. 2019) for a representative selection). To the best of our
classification and visual semantic role labeling, developing                                                knowledge, our work is the first using an external knowledge
a calibration strategy that introduces frequency-constraints                                                bases aligned to the training corpus to mitigate the bias in a
on the training corpus. In the context of recommender sys-                                                  training dataset in the context of deep nets.
Semantic Parsing for Robotic Applications                                                 References
A variety of approaches have been proposed in the last two        Adomavicius, G.; Bockstedt, J.; Curley, S.; and Zhang, J.
decades to create semantic parsers for commands of vir-           2014. De-biasing user preference ratings in recommender
tual and real autonomous agents. With the breakthrough            systems. In RecSys 2014 Workshop on Interfaces and Hu-
of statistical models, many machine learning techniques           man Decision Making for Recommender Systems (IntRS
have been applied to semantically parse robot instructions,       2014), 2–9.
from sequential labelling (Kollar et al. 2010), Statistical       Artzi, Y., and Zettlemoyer, L. 2013. Weakly supervised
Machine Translation (Chen and Mooney 2011), learning-             learning of semantic parsers for mapping instructions to ac-
to-rank (Kim and Mooney 2013) and probabilistic graph-            tions. Transactions of the Association for Computational
ical models (Tellex et al. 2011). Statistical methods have        Linguistics 1(1):49–62.
also been applied to induce grammars to parse human com-
mands into suitable meaning representations as well (Artzi        Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural ma-
and Zettlemoyer 2013; Thomason et al. 2015). These ap-            chine translation by jointly learning to align and translate.
proaches were implemented mostly in discretized environ-          arXiv preprint arXiv:1409.0473.
ments, relying on ad-hoc and formulaic representation for-        Baker, C. F.; Fillmore, C. J.; and Lowe, J. B. 1998. The
malisms, and often dealing with constrained vocabularies.         Berkeley FrameNet project. In Proceedings of ACL and
Our work, on the contrary, builds upon the idea of relying        COLING, Association for Computational Linguistics, 86–
on linguistically sound theories of meaning representation,       90.
e.g. Frame Semantics, to bridge between linguistic knowl-         Bastianelli, E.; Castellucci, G.; Croce, D.; Iocchi, L.; Basili,
edge and robot internal representations. We build upon (Bas-      R.; and Nardi, D. 2014. Huric: a human robot interaction
tianelli et al. 2016) to design a parser to identify semantic     corpus. In Proceedings of the Ninth International Confer-
frames expressed in robot commands but rely on the bidi-          ence on Language Resources and Evaluation (LREC’14).
rectional LSTM network.                                           Reykjavik, Iceland: European Language Resources Associ-
                       Conclusions                                ation (ELRA).
                                                                  Bastianelli, E.; Croce, D.; Vanzo, A.; Basili, R.; and Nardi,
In this paper, we have presented an approach relying on
                                                                  D. 2016. A discriminative approach to grounded spoken
the integration of heterogeneous knowledge sources to mit-
                                                                  language understanding in interactive robotics. In Proceed-
igate the biased results of a deep learning-based semantic
                                                                  ings of the 2016 International Joint Conference on Artificial
parser for Spoken Language Understanding for robots, and
                                                                  Intelligence (IJCAI).
improve the model’s understandability. We discussed how
current models do not necessarily learn the underlying lin-       Beek, W.; Raad, J.; Wielemaker, J.; and Van Harmelen,
guistic theory, but rather focus on unwanted, unexpected pat-     F. 2018. sameas. cc: The closure of 500m owl: sameas
terns, because of an intrinsic bias induced by the size and       statements. In European Semantic Web Conference, 65–80.
domain-specificity of the training dataset. We showed how         Springer.
the values of the attention layers of the network can be used     Chen, D. L., and Mooney, R. J. 2011. Learning to interpret
as a clue to analyze and interpret the model’s behavior, as the   natural language navigation instructions from observations.
classification of frames in our case. Finally, we have provide    In Proceedings of the 25th AAAI Conference on AI, 859–
evidence that external resources such as FrameNet can help        865.
to reduce the bias in the training data, also guaranteeing the    Cheng, J.; Dong, L.; and Lapata, M. 2016. Long short-
correct interpretations (or explanations) for the model’s be-     term memory-networks for machine reading. In Proceedings
havior. While being a preliminary attempt to measure a more       of the 2016 Conference on Empirical Methods in Natural
complex phenomenon, our work suggests that the strengths          Language Processing, 551–561. Austin, Texas: Association
of both knowledge engineering and machine learning can be         for Computational Linguistics.
combined to foster the development of more transparent, un-
derstandable intelligent systems.                                 Ding, L.; Shinavier, J.; Finin, T.; McGuinness, D. L.; et al.
   The future work will be focused in a first instance on de-     2010. owl: sameas and linked data: An empirical study. In
signing more thorough evaluation schemes to obtain better         Proceedings of the Second Web Science Conference.
quantitative understandings of the model’s behavior. Sec-         Drozdzal, M.; Chartrand, G.; Vorontsov, E.; Shakeri, M.;
ondly, we will focus on identifying the correct balance be-       Di Jorio, L.; Tang, A.; Romero, A.; Bengio, Y.; Pal, C.; and
tween the domain-specific samples and the external ones,          Kadoury, S. 2018. Learning normalized inputs for iterative
also testing new pairs of datasets if possible. An analysis       estimation in medical image segmentation. Medical image
carried by gradually combining the samples and showing            analysis 44:1–13.
how the performances and the explainability measures be-          Fillmore, C. J. 1985. Frames and the semantics of under-
have across several datasets and domain is indeed crucial.        standing. Quaderni di Semantica 6(2):222–254.
Extending the use of more knowledge bases through their
links (e.g. WordNet, ConceptNet) is another route we wish         Hermann, K. M.; Kocisky, T.; Grefenstette, E.; Espeholt, L.;
to follow. Finally, we will explore the idea of interactive,      Kay, W.; Suleyman, M.; and Blunsom, P. 2015. Teaching
symbiotic explanations, where the model can be corrected          machines to read and comprehend. In Advances in Neural
through spoken dialogue with the user.                            Information Processing Systems, 1693–1701.
Hitzler, P.; Bianchi, F.; Ebrahimi, M.; and Sarker, M. K.        grounding problem with probabilistic graphical models. AI
2019. Neural-symbolic integration and the semantic web.          Magazine 32(4):64–76.
Semantic Web (Preprint):1–9.                                     Thomason, J.; Zhang, S.; Mooney, R.; and Stone, P. 2015.
Hu, X.; Chung, A. G.; Fieguth, P.; Khalvati, F.; Haider,         Learning to interpret natural language commands through
M. A.; and Wong, A. 2018. Prostategan: Mitigating data           human-robot dialog. In Proceedings of the 24th Inter-
bias via prostate diffusion imaging synthesis with generative    national Conference on Artificial Intelligence (IJCAI), IJ-
adversarial networks. arXiv preprint arXiv:1811.05817.           CAI’15, 1923–1929. AAAI Press.
Kim, J., and Mooney, R. J. 2013. Adapting discriminative         Tiddi, I.; d’Aquin, M.; and Motta, E. 2014. Quantifying the
reranking to grounded language learning. In ACL (1), 218–        bias in data links. In International Conference on Knowl-
227. The Association for Computer Linguistics.                   edge Engineering and Knowledge Management, 531–546.
Koh, P. W., and Liang, P. 2017. Understanding black-             Springer.
box predictions via influence functions. arXiv preprint          Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudi-
arXiv:1703.04730.                                                nov, R.; Zemel, R.; and Bengio, Y. 2015. Show, attend
Kollar, T.; Tellex, S.; Roy, D.; and Roy, N. 2010. Toward        and tell: Neural image caption generation with visual at-
understanding natural language directions. In Proceedings        tention. In International conference on machine learning,
of the 5th ACM/IEEE, HRI ’10, 259–266.                           2048–2057.
                                                                 You, Q.; Jin, H.; Wang, Z.; Fang, C.; and Luo, J. 2016. Im-
Lin, Z.; Feng, M.; Santos, C. N. d.; Yu, M.; Xiang, B.; Zhou,
                                                                 age captioning with semantic attention. In Proceedings of
B.; and Bengio, Y. 2017. A structured self-attentive sentence
                                                                 the IEEE conference on computer vision and pattern recog-
embedding. arXiv preprint arXiv:1703.03130.
                                                                 nition, 4651–4659.
Liu, B., and Lane, I. 2016. Attention-based recurrent neural
                                                                 Zeiler, M. D., and Fergus, R. 2014. Visualizing and under-
network models for joint intent detection and slot filling. In
                                                                 standing convolutional networks. In European conference
INTERSPEECH, 685–689. ISCA.
                                                                 on computer vision, 818–833. Springer.
Mensio, M.; Bastianelli, E.; Tiddi, I.; and Rizzo, G. 2018.      Zhang, Q.-s., and Zhu, S.-C. 2018. Visual interpretability for
A multi-layer lstm-based approach for robot command in-          deep learning: a survey. Frontiers of Information Technology
teraction modeling. Workshop on Language and Robotics,           & Electronic Engineering 19(1):27–39.
IROS 2018.
                                                                 Zhao, J.; Wang, T.; Yatskar, M.; Ordonez, V.; and Chang,
Mnih, V.; Heess, N.; Graves, A.; et al. 2014. Recurrent          K.-W. 2017. Men also like shopping: Reducing gender bias
models of visual attention. In Advances in neural informa-       amplification using corpus-level constraints. arXiv preprint
tion processing systems, 2204–2212.                              arXiv:1707.09457.
Pennington, J.; Socher, R.; and Manning, C. 2014. Glove:
Global vectors for word representation. In Proceedings of
the 2014 conference on empirical methods in natural lan-
guage processing (EMNLP), 1532–1543.
Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016. Why
should i trust you?: Explaining the predictions of any classi-
fier. In Proceedings of the 22nd ACM SIGKDD international
conference on knowledge discovery and data mining, 1135–
1144. ACM.
Rosenthal, S.; Biswas, J.; and Veloso, M. 2010. An effec-
tive personal mobile robot agent through symbiotic human-
robot interaction. In Proceedings of the 9th International
Conference on Autonomous Agents and Multiagent Systems:
volume 1-Volume 1, 915–922. International Foundation for
Autonomous Agents and Multiagent Systems.
Roth, H. R.; Lu, L.; Liu, J.; Yao, J.; Seff, A.; Cherry, K.;
Kim, L.; and Summers, R. M. 2015. Improving computer-
aided detection using convolutional neural networks and
random view aggregation. IEEE transactions on medical
imaging 35(5):1170–1181.
Simonyan, K.; Vedaldi, A.; and Zisserman, A. 2013.
Deep inside convolutional networks: Visualising image
classification models and saliency maps. arXiv preprint
arXiv:1312.6034.
Tellex, S.; Kollar, T.; Dickerson, S.; Walter, M.; Banerjee,
A.; Teller, S.; and Roy, N. 2011. Approaching the symbol