<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>I don't understand! Evaluation Methods for Natural Language Explanations</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Miruna Clinciu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arash Eshghi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Helen Hastie</string-name>
          <email>h.hastieg@hw.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Heriot Watt University</institution>
          ,
          <addr-line>Edinburgh</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Edinburgh</institution>
          ,
          <addr-line>Edinburgh</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Explainability of intelligent systems is key for future adoption. While much work is ongoing with regards to developing methods of explaining complex opaque systems, there is little current work on evaluating how e ective these explanations are, in particular with respect to the user's understanding. Natural language (NL) explanations can be seen as an intuitive channel between humans and arti cial intelligence systems, in particular for enhancing transparency. This paper presents existing work on how evaluation methods from the eld of Natural Language Generation (NLG) can be mapped onto NL explanations. Also, we present a preliminary investigation into the relationship between linguistic features and human evaluation, using a dataset of NL explanations derived from Bayesian Networks.</p>
      </abstract>
      <kwd-group>
        <kwd>Explanations</kwd>
        <kwd>Evaluation</kwd>
        <kwd>Natural Language</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        The rapid advance of Arti cial Intelligence poses some fundamental ethical and
social concerns, where providing the right explanations for instilling transparency
in AI systems represents a main topic of discussion. An intuitive medium to
provide explanations is through natural language, and with recent regulations
comes an increasing need to provide evaluation methods for natural language
explanations that will help us to assess the quality of those explanations in
relation to the system they explain. This need for evaluating explanations has
been further validated by studies from social sciences and psychology [
        <xref ref-type="bibr" rid="ref11 ref17 ref2 ref8">2, 8, 11,
17</xref>
        ].
      </p>
      <p>Particular questions arise around the amount of information needed to
explain but not overload the user, the lexical choice for matching the user's
understanding and expertise level, as well as the linguistic style adopted. Attributes
such as informativeness, clarity, coherence, readability and e ectiveness have
3 Copyright © 2021 for this paper by its authors. Use permitted under Creative</p>
      <p>
        Commons License Attribution 4.0 International (CC BY 4.0).
been linked to human evaluation dimensions frequently used in the eld of
Natural Language Generation (NLG) [
        <xref ref-type="bibr" rid="ref17 ref2">2, 17</xref>
        ]. Considering the strong focus of the NLG
researchers on evaluating natural language, we propose that mapping existing
NLG methods onto NL explanations can provide insights into the de nition of
a good explanation [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>To get a better understanding of how we can de ne what makes an e ective
NL explanation, we designed and gathered the ExBAN Corpus (Explanations
from Bayesian Networks). This corpus provides NL explanations for a set of
Bayesian Networks, mainly motivated by the fact that Bayesian Networks are
frequently used for the detection of anomalies in data and can approximate deep
learning models. They also allow us to sense-check our explanation evaluation
techniques as they are reasonably easy to understand for the non-expert user.</p>
      <p>
        This paper presents current work into the evaluation of NL explanations ,
but also includes preliminary new linguistic analysis. The paper is thus
structured into four parts: (1) we introduce the ExBAN corpus; (2) we present how
automatic and human evaluation metrics from the eld of NLG can be mapped
onto NL explanations; (3) we present an analysis on how linguistic features
correlate with human evaluation metrics; and (4) nally, we discuss how evaluation
methods can capture the quality of NL explanations. Further details of this work
can be found in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>ExBAN Corpus</title>
      <p>
        Existing datasets of explanations have enabled signi cant progress in the way
that explanations provide transparency of machine learning algorithms.
However, less attention has been paid to methods to explain structured data, such
as Bayesian Networks. Bayesian Networks have \the ability to cover any model
with a probabilistic interpretation including supervised, unsupervised, and
reinforcement learning (including deep learning)" [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Also, their graphical
representation can be used for extracting information [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The ExBAN corpus is used
here for evaluation, but it could also be used for training models for generating
natural language explanations from graphical models such as Bayes Nets and
other structured data more broadly.
2.1
      </p>
      <p>ExBAN Corpus Desciption
De nition: ExBN: A Corpus of Natural Language Explanations for three Bayesian
Networks's graphical representations (see Figure 1).</p>
      <p>Purpose: Possible application areas for the corpus: explainable AI, general
arti cial intelligence, academic linguistic research, natural language processing.</p>
      <p>The ExBAN Corpus (Explanations for BAyesian Networks) consists of NL
Explanations collected in a two-step process:
1. NL explanations were produced by human subjects (a total number of 84
participants)
2. In a separate study, these explanations were rated on a 7-point Likert scale,
in terms of informativeness and clarity (a total number of 250 explanations,
rated by 97 participants; each explanation was rated by minimum of 3
participants)
3</p>
    </sec>
    <sec id="sec-3">
      <title>NLG Evaluation Methods</title>
      <p>Models trained iteratively with large amounts of data are particularly hard to
evaluate in a cost-e ective and timely manner. Therefore, creating automatic
methods for evaluating NLG systems that can capture the human-likeness of
the generated output is essential.</p>
      <p>
        Human Evaluation. Explanations should be clear and easily understood by
users, providing the right information in order to create better communication
[
        <xref ref-type="bibr" rid="ref6 ref9">6, 9</xref>
        ]. We focus on two dimensions: informativeness and clarity.
      </p>
      <p>
        Automatic Evaluation. Here we describe automatic metrics used in the
eld of NLG evaluation and selected for this study, speci cally: 1) word-based
(untrained) metrics such as BLEU, METEOR and ROUGE, and 2) pre-trained
metrics, such as BERTScore and BLEURT. Here, we brie y describe each in
turn:
{ BLEU [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] is a widely used metric in the eld of NLG (borrowed from
Machine Translation (MT)) that compares n-grams of a candidate text (e.g.
that generated by algorithms) with the n-grams of a reference text. The
number of matches de nes the goodness of the candidate text.
{ SacreBLEU was proposed by [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] as a new version of BLEU that calculates
scores on the detokenized text by applying its own metric-internal
preprocessing.
{ METEOR was created in order to address the weaknesses of BLEU;
METEOR evaluates generated text by computing a score based on explicit
wordto-word matches between a candidate and a reference. When using multiple
references, the candidate text is scored against each reference, and the best
score is reported.
{ ROUGE [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] evaluates n-gram overlap of the generated text (candidate)
with a reference.
{ ROUGE-L computes the longest common subsequence (LCS) between a
pair of sentences [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
{ BERTScore [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] is a token-level matching metric with pre-trained
contextual embeddings using BERT [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] that matches words in candidate and
reference sentences using cosine similarity.
{ BLEURT [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] is a text generation metric based on BERT, pre-trained on
synthetic data; it uses \random perturbations of Wikipedia sentences
augmented with a diverse set of lexical and semantic-level supervision signals".
BLEURT uses a collection of metrics and models from prior work, including
BLEU and ROUGE.
3.1
      </p>
      <p>
        Correlation of Automatic Metrics with Human Evaluation
In order to investigate the degree to which automatic metrics for NLG can
capture the quality of NL explanations [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], we ran a correlation analysis with
automatic metrics with human judgements. As shown in Figure 2, we can draw
the following conclusions:
{ Word-overlap metrics such as BLEU (n = 1,2,3,4), METEOR and ROUGE
(n = 1,2) presented low correlation with human ratings. This might be due
to certain limitations, such as the fact that they rely on word overlap and
are not invariant to paraphrases.
{ BERTScore and BLEURT outperformed other metrics and produced higher
correlation with human ratings than other metrics on all diagrams. These
metrics might capture some relevant facts of explanations, as word
representations are dynamically informed by the words around them.
According to human evaluation scores for informativeness and clarity, in Figure
3, we present examples of explanations with high scores for informativeness and
clarity (\Good" Examples of Explanation) and with low scores for
informativeness and clarity (\Bad" Examples of Explanation). As observed, all automatic
metrics are reasonably good at capturing and evaluating the \Bad" examples of
explanations. Also, we can see that only BLEURT (BRT) is more sensitive to
capturing informativeness and clarity, for both examples.
We extracted a number of linguistic features presented in Table 1 to explore
if there were any linguistic constructs that were found in our dataset and that
mapped to good or bad explanations. For example, complex syntactic
construction might lead to di culty in understanding for the user, as re ected in the
Height tree and Length tree features. Other features given in table 1 were
motivated by similar studies correlating features with user ratings for surface
realisation in NLG [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], and for social psychology [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
Total words total number of words
Sentence Length average sentence length
      </p>
      <p>number of nouns per explanation
Nr Nouns (NNNNS -- spinlugruallacrocmommmononnonuonusn,s,</p>
      <p>NNP - proper noun)
WDT number of wh-determiners which
CC number of coordinating conjunctions
Avg t df average tf-idf score of content words
Height tree depth of syntactic embedding</p>
      <p>Length tree the number of children it has</p>
      <p>We mapped the linguistic features in Table 1 to human evaluation metrics
(informativeness and clarity) to see if there was any correlation between these
features and the quality of the explanation, as rated by humans.</p>
      <p>Our preliminary analysis shows some trends, but more investigation is needed
to con rm these. We calculated the Spearman's correlation coe cient between
the linquistic features and human evaluation ratings score, for both
informativeness and clarity on a sample of 166 datapoints. With regards informativeness,
the sentence length (r = 0:29) and the number of nouns (r = 0:36) presents
weak correlatation with informativeness, as well as the number of coordinating
conjunctions (r = 0:23).</p>
      <p>Linguistic features do not seem to capture well the level of clarity of a
sentence as no correlation was found in this regard. This is perhaps because clarity
is multi-dimensional and implies more than lexical-syntactic relationships,
including other factors such as causality, common sense and general knowledge.
5</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>Finding accurate automatic measures is challenging, particularly for
explanations, as the pragmatic and cognitive processes underlying explanations, such as
reasoning, causality, and common sense, might not be captured. In our study,
the embedding-based metrics perform better than the word-overlap based ones,
but we would recommend a larger study to show this empirically. Future work
would involve examining the e ectiveness of automatic metrics across a wider
variety of explanation tasks and datasets. Finally, the next step is to use this
work to automatically generate natural language explanations from structured
data such as Bayes Nets, and this work contributes towards ensuring the quality
of such explanations.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work was supported by the EPSRC Centre for Doctoral Training in Robotics
and Autonomous Systems at Heriot-Watt University and the University of
Edinburgh. Clinciu's PhD is funded by Schlumberger Cambridge Research
Limited (EP/L016834/1, 2018-2021). This work was also supported by the EPSRC
ORCA Hub (EP/R026173/1, 2017-2021) and UKRI Trustworthy Autonomous
Systems Node on Trust (EP/V026682/1, 2020-2024).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Abelson</surname>
            ,
            <given-names>R.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leddo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gross</surname>
            ,
            <given-names>P.H.</given-names>
          </string-name>
          :
          <article-title>The Strength of Conjunctive Explanations</article-title>
          .
          <source>Personality and Social Psychology Bulletin</source>
          <volume>13</volume>
          (
          <issue>2</issue>
          ) (
          <year>1987</year>
          ). https://doi.org/10.1177/0146167287132001
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Clinciu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hastie</surname>
          </string-name>
          , H.:
          <article-title>Let's Evaluate Explanations! HRI 2020 Workshop on Test Methods and Metrics for E ective HRI in Real World Human-Robot Teams (</article-title>
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Clinciu</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eshghi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hastie</surname>
          </string-name>
          , H.:
          <article-title>A study of automatic metrics for the evaluation of natural language explanations</article-title>
          .
          <source>In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics:</source>
          Main Volume. pp.
          <volume>2376</volume>
          {
          <fpage>2387</fpage>
          . Association for Computational Linguistics,
          <source>Online (Apr</source>
          <year>2021</year>
          ), https://www.aclweb.org/anthology/2021.eacl-main.
          <fpage>202</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Dethlefs</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cuayahuitl</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hastie</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rieser</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lemon</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Clusterbased prediction of user ratings for stylistic surface realisation</article-title>
          .
          <source>In: 14th Conference of the European Chapter of the Association for Computational Linguistics</source>
          <year>2014</year>
          ,
          <string-name>
            <surname>EACL</surname>
          </string-name>
          <year>2014</year>
          . pp.
          <volume>702</volume>
          {
          <fpage>711</fpage>
          .
          <article-title>Association for Computational Linguistics (ACL) (</article-title>
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>M.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          : BERT:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          .
          <source>In: NAACL HLT</source>
          <year>2019</year>
          <article-title>- 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies -</article-title>
          <source>Proceedings of the Conference</source>
          . vol.
          <volume>1</volume>
          , pp.
          <volume>4171</volume>
          {
          <fpage>4186</fpage>
          .
          <article-title>Association for Computational Linguistics (ACL) (</article-title>
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Hastie</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          :
          <article-title>Metrics and Evaluation of Spoken Dialogue Systems</article-title>
          .
          <source>In: DataDriven Methods for Adaptive Spoken Dialogue Systems</source>
          . Springer Publishing Company, Incorporated (
          <year>2012</year>
          ). https://doi.org/10.1007/978-1-
          <fpage>4614</fpage>
          - 4803-77
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Koncel-Kedziorski</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bekal</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luan</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lapata</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hajishirzi</surname>
          </string-name>
          , H.:
          <article-title>Text generation from knowledge graphs with graph transformers</article-title>
          .
          <source>In: NAACL HLT</source>
          <year>2019</year>
          <article-title>- 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies -</article-title>
          <source>Proceedings of the Conference</source>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Leake</surname>
            ,
            <given-names>D.B.: Evaluating</given-names>
          </string-name>
          <string-name>
            <surname>Explanations</surname>
          </string-name>
          . Psychology Press (feb
          <year>2014</year>
          ). https://doi.org/10.4324/9781315807072
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Lemon</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pietquin</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Data-Driven Methods for Adaptive Spoken Dialogue Systems: Computational Learning for Conversational Interfaces</article-title>
          . Springer Publishing Company, Incorporated (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>C.Y.</given-names>
          </string-name>
          :
          <article-title>ROUGE: A Package for Automatic Evaluation of Summaries Chin-Yew</article-title>
          .
          <source>Information Sciences Institute</source>
          <volume>34</volume>
          (
          <issue>12</issue>
          ) (
          <year>1971</year>
          ). https://doi.org/10.1253/jcj.34.1213
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Mohseni</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zarei</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ragan</surname>
          </string-name>
          , E.D.
          <article-title>: A Survey of Evaluation Methods and Measures for Interpretable Machine Learning</article-title>
          .
          <source>ACM Transactions on Interactive Intelligent Systems</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Papineni</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roukos</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ward</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
          </string-name>
          , W.j.,
          <string-name>
            <surname>Heights</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <source>IBM Research Report Bleu : a Method for Automatic Evaluation of Machine Translation. Science</source>
          <volume>22176</volume>
          ,
          <issue>1</issue>
          {
          <fpage>10</fpage>
          (
          <year>2001</year>
          ). https://doi.org/10.3115/1073083.1073135, http://dl.acm.org/citation.cfm?id=
          <fpage>1073135</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Post</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>A call for clarity in reporting BLEU scores</article-title>
          .
          <source>In: Proceedings of the Third Conference on Machine Translation: Research Papers</source>
          . pp.
          <volume>186</volume>
          {
          <fpage>191</fpage>
          . Association for Computational Linguistics, Brussels, Belgium (Oct
          <year>2018</year>
          ). https://doi.org/10.18653/v1/
          <fpage>W18</fpage>
          -6319, https://www.aclweb.org/anthology/W18-6319
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Schluter</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          :
          <article-title>The limits of automatic summarisation according to ROUGE. In: 15th Conference of the European Chapter of the Association for Computational Linguistics</article-title>
          ,
          <source>EACL 2017 - Proceedings of Conference</source>
          . vol.
          <volume>2</volume>
          (
          <year>2017</year>
          ). https://doi.org/10.18653/v1/e17-
          <fpage>2007</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Sellam</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Das</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parikh</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>BLEURT: Learning robust metrics for text generation</article-title>
          .
          <source>In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</source>
          . pp.
          <volume>7881</volume>
          {
          <fpage>7892</fpage>
          . Association for Computational Linguistics,
          <source>Online (Jul</source>
          <year>2020</year>
          ). https://doi.org/10.18653/v1/
          <year>2020</year>
          .acl-main.
          <volume>704</volume>
          , https://www.aclweb.org/anthology/2020.acl-main.
          <fpage>704</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>S.C.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shafto</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Explainable Arti cial Intelligence via Bayesian Teaching</article-title>
          .
          <source>In: Neural Information Processing Systems Workshop: Teaching Machines</source>
          , Robots, and
          <string-name>
            <surname>Humans</surname>
          </string-name>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Zemla</surname>
            ,
            <given-names>J.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sloman</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bechlivanidis</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lagnado</surname>
            ,
            <given-names>D.A.</given-names>
          </string-name>
          :
          <article-title>Evaluating everyday explanations</article-title>
          .
          <source>Psychonomic Bulletin &amp; Review</source>
          <volume>24</volume>
          (
          <issue>5</issue>
          ),
          <volume>1488</volume>
          {1500 (Oct
          <year>2017</year>
          ). https://doi.org/10.3758/s13423-017-1258-z, https://doi.org/10.3758/s13423- 017-1258-z
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18] Zhang*,
          <string-name>
            <surname>T.</surname>
          </string-name>
          , Kishore*,
          <string-name>
            <given-names>V.</given-names>
            ,
            <surname>Wu*</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Weinberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.Q.</given-names>
            ,
            <surname>Artzi</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          :
          <article-title>Bertscore: Evaluating text generation with bert</article-title>
          .
          <source>In: International Conference on Learning Representations</source>
          (
          <year>2020</year>
          ), https://openreview.net/forum?id=SkeHuCVFDr
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>