<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>An approach to Evaluative AI through Large Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrea Ermellino</string-name>
          <email>andrea.ermellino@intesasanpaolo.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lorenzo Malandri</string-name>
          <email>lorenzo.malandri@unimib.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabio Mercorio</string-name>
          <email>fabio.mercorio@unimib.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Navid Nobani</string-name>
          <email>navid.nobani@unimib.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antonio Serino</string-name>
          <email>a.serino3@campus.unimib.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Large Language Models, Explainable AI, Human-Centered AI</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CRISP Research Centre, University of Milano-Bicocca</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Data Science &amp; Responsible AI, Intesa Sanpaolo</institution>
          ,
          <addr-line>Milan</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Department of Economics, Management and Statistics, University of Milano-Bicocca</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Department of Statistics and Quantitative Methods, University of Milano-Bicocca</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>eXplainable AI (XAI) has been gaining research interest across several AI applications. However, current XAI methods often fall short of involving the user in the decision-making process, as XAI explains to the user a decision already made by the algorithm, preventing the user from evaluating alternatives. In this setting, Evaluative AI encourages balanced human-AI collaboration by addressing the issues of over- and under-reliance on AI systems and involving the user in evaluating the pros/cons of each recommendation.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Many state-of-the-art (SOTA) techniques in AI aim at explaining AI decisions. Depending on the
underlying model and purpose, explanations can be provided either at a global level, giving insight
into the overall functioning of the model, or at a local one, explaining a single output of the model.
(e.g., [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5">1, 2, 3, 4, 5</xref>
        ]). These techniques aim to explain the model’s inner workings that led to the output
generation and can be divided into different categories such as counterfactual explanations, prototype
and criticism, and contrastive explanations [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Regardless of the explanation technique and its level
of presentation, the current eXplainable AI (XAI) frameworks - applied to decision support systems
provide the decision maker with a set of explanations about how an outcome recommended by the AI
model was realised and the reasons for accepting or rejecting it. However, this paradigm does not involve
users in the decision but only allows them to accept or reject the machine’s result.
      </p>
      <p>
        As argued by Miller [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], this approach limits engagement and, as a consequence, the trust of
decisionmakers or, even worse, it makes them accept the recommended decision blindly. Moreover, not knowing
the full range of possibilities limits the user’s ability to form his or her own opinions and possibly correct
the machine’s behaviour.
      </p>
      <p>CEUR</p>
      <p>ceur-ws.org</p>
      <p>
        Following this direction, Miller [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] proposes a paradigm in which AI-based Decision Aids (DAs)
do not explain a single outcome but evidence for and against possible outcomes, with the user in
charge of evaluating them and making a decision. Building on top of this paradigm, our work proposes
EADS (Evaluative AI-based Decision Support) framework. Our approach allows users to explore all
the possible outcomes - and their explanations - through a natural language dialogue system to garner
users’ trust and leverage their expertise and engagement in the decision-making process. In this sense,
our approach relies on the principles of conversational explanation framework [
        <xref ref-type="bibr" rid="ref10 ref8 ref9">8, 9, 10</xref>
        ] that exploits
natural language to provide explanations and interact with the user. Specifically, we employ an LLM to
generate conversational explanations that (i) make the dialogue more effective and (ii) better address the
context related to the user domain, resulting in better decisions.
      </p>
      <p>Motivation. Current XAI-based decision support approaches present an output to the user, who only
has the choice of accepting or rejecting it based on the explanations and context provided by the agent.
A system offering not pre-determined decisions but analysing the advantages and disadvantages of
various alternatives would enhance the comprehensiveness of the decision-making support provided.
This approach enhances explainability and allows the user to evaluate the possibilities, leading to more
informed decision-making and increased user confidence in the system;</p>
      <p>
        To further clarify the matter, let us consider the following mock conversation, which depicts an
imaginary conversation between a user (User ) and a system (Whmcs ) trained and deployed to classify millions of
online job ads over a standard occupation taxonomy [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]:
      </p>
      <p>Classic XAI Framework
sUer : Why have you classified this job as "Data engineer"?
hWmcs : I saw "Data" and "Architect" words.
sUer : What is the most important word for classifying this job ad?
hWmcs : "Architect"
sUer : Which word did you not consider for the classification?
hWmcs : "Senior"
sUer : What if the input was "Senior Python data architect"?
hWmcs : Then, the class would be "Data Scientist".
sUer : Why instead of classifying it as "Software engineer", have you chosen "Data engineer"?
hWmcs : I never saw titles containing the word "architect" being labelled as "Software engineer" in the training data.
sUer : Why did you ignore the word "senior"?
hWmcs : I consider this word as "too general".
sUer : Answer me what "too general" means by considering me as a data scientist.
hWmcs : The TF-IDF score of the word "senior" - calculated based on the training data - is distant from the score of the rest of the terms.
sUer : What if I tell you that I’m your developer?
hWmcs : I would say that I took that decision because the word "senior" is on the list of stopwords.</p>
      <p>
        This mock example illustrates a conversational explanation framework in which the user interacts with
an XAI agent- as a part of a decision support system- to understand its decision [
        <xref ref-type="bibr" rid="ref10 ref8 ref9">8, 9, 10</xref>
        ]. The user
may trust the XAI agent’s explanations, such as accepting the classification of "Data engineer" for a
specific job ad, or may not trust it.It’s important to note that the user is only partially involved in the
decision-making process through interacting with the agent to understand the system’s output. The same
example, converted in the approach we propose, can be depicted as follows:
      </p>
      <p>Evaluative AI Framework
sUer : How could this job ad be classified?
hWmcs : It could be "Data engineer", "Data Scientist" or "Software engineer".
sUer : Why could it be "Data engineer"?
hWmcs : It could be "Data engineer" because I saw the words "Data" and "Architect"
sUer : Why could it be "Data scientist"?
hWmcs : It could be "Data Scientist" because I saw the word "Data", but I didn’t see words like "Python" or "R"
sUer : Why could it be "Software engineer"?
hWmcs : It could be "Software engineer" because I saw the word "Data", but I’ve never seen titles containing the word "Architect" being labelled
as "Software engineer" in the training data.</p>
      <p>
        In this new conversation, the user interacts with an Evaluative AI framework that provides pros and
cons for each potential job classification in response to a specific job ad, subject to be classified. As can
be seen from the example, at the end of the conversation, the system does not justify (i.e. explain) a
definitive output but rather supports the user in the decision-making process by presenting the pros and
cons of each available outcome (e.g. Data engineer and Data Scientist). Contrary to the initial example,
in the Evaluative AI-based decision system, the user controls the decision-making process, and the system
serves as a helpful tool by offering evaluative information. We argue that this approach is more effective
for decision support because it aligns with the cognitive decision-making process that people use when
making judgements and decisions (e.g., [
        <xref ref-type="bibr" rid="ref11 ref12 ref13">11, 12, 13</xref>
        ]).
      </p>
      <p>Moreover, our proposed framework makes decision-making support even more robust by using not a
single XAI agent but a consensus system capable of providing a more grounded evaluation.</p>
      <sec id="sec-1-1">
        <title>Contribution. The contribution of EADS is two-fold:</title>
        <p>
          1. we propose an approach realising Evaluative AI - built on top of [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] - to support the decision-maker,
utilising the current SOTA XAI techniques;
2. we employ conversational explanations via LLM to enhance the interaction between human and agent
and to increase the informative power of explanations [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ];
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Background and Related Works</title>
      <p>
        eXplainable AI XAI is increasingly recognized as a critical component of AI-driven, human-centric
systems across diverse sectors, including healthcare [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] to robotics [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. XAI methods explain the
internal workings and decision-making processes of otherwise opaque AI models, as highlighted by
Guidotti et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Previously viewed as a luxury, the growing complexity of AI models coupled with
stringent regulatory requirements, such as the EU’s General Data Protection Regulation (GDPR) which
mandates a "right to explanations," has propelled XAI from an optional enhancement to a necessary
feature. The GDPR emphasizes the need for "meaningful information" about the logic of automated
decisions, underscoring the importance of presenting explanations in a form understandable to all users,
regardless of their technical expertise. This shift not only emphasises the significance of natural language
in making AI explanations accessible but also marks a pivotal evolution in the interaction between AI
systems and their users.
      </p>
      <p>
        Natural Language Explanation Although the Evaluative AI paradigm was recently introduced, the
use of natural language to reinforce explanations is a popular trend. According to Sokol and Flach [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ],
using natural language to provide explanations is especially effective for a non-expert audience, making
the information easily accessible. At the same time, narration in natural language increases the credibility
of explanations and improves the user’s awareness of the explanation, facilitating their approval. (see e.g.,
[
        <xref ref-type="bibr" rid="ref17 ref18">17, 18</xref>
        ]) This highlights the significance of natural language in making information more accessible and
increasing its acceptance through a clear and effective communication approach.
      </p>
      <p>
        Research on methodologies for Natural Language Generated (NLG) Explanations has advanced
from rigid early techniques to complex and sophisticated approaches(See [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]). In its early stages,
text generation was based on rule-based systems, where predefined rules set the text’s structure and
content [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. Concurrently, template-based systems were developed, utilizing preset structures filled
with specific data to combine flexibility and control in content generation [
        <xref ref-type="bibr" rid="ref20 ref21 ref22 ref5">20, 5, 21, 22</xref>
        ]. More recently,
the adoption of deep neural networks and attention mechanisms [23] has significantly enhanced the
generation of natural language explanations, surpassing the constraints of earlier methods by improving
naturalness, fluency, and adaptability in various contexts [ 24, 25, 26].
      </p>
      <p>LLMs and Prompt Engineering Recent advancements in LLMs are mitigating the constraints of
requiring extensive labelled data sets and significant computational resources for task-specific tuning.
Leveraging in-context learning [27] for instance, LLMs efficiently handle tasks with high accuracy [ 28]
through one-shot and few-shot learning methods [29, 30]. These models, informed by explainers
about feature contributions, provide clear and precise explanations easily understandable also by
nonexperts [31, 32, 33]. Despite challenges like the need for prompt engineering to ensure accuracy and
avoid errors that could impact system credibility, LLMs show promise in surpassing the limitations of
traditional rule and template-based approaches.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Formalising Evaluative AI</title>
      <p>EADS
MLModel</p>
      <p>Explainer A
lss1C [N[[CPeoruontrssa]]l]
a
2 [Pros]
lssa [Neutral]
C [Cons]
3 [Pros]
lssa [Neutral]
C [Cons]</p>
      <p>Explainer A
Explainer B
Explainer C</p>
      <p>Consensus</p>
      <p>Assessment
Consensus</p>
      <p>System
Explainer B
lss1C [N[[CPeoruontrssa]]l]
a
2 [Pros]
lssa [Neutral]
C [Cons]
3 [Pros]
lssa [Neutral]
C [Cons]</p>
      <p>Consensusassessment</p>
      <p>Interacts
toevaluate
Explainer C
lss1C [N[[CPeoruontrssa]]l]
a
2 [Pros]
lssa [Neutral]
C [Cons]
3 [Pros]
lssa [Neutral]</p>
      <p>C [Cons]
ComXpAoInent Porfosop/tCioonnss LLM</p>
      <p>Decision-Maker</p>
      <p>Decides</p>
      <p>Informed
Decision
decision-making process, the outputs from each explainer are integrated within a consensus system. This
system quantifies the degree of agreement among the explainers.</p>
      <p>LLM Component: The LLM component serves two primary functions. Firstly, it functions as a
conversational agent, enabling users to interact and gain insights from the explanations provided by the
XAI component. Secondly, it acts as an agent that enhances the context of these explanations, thereby
enriching their informational value.</p>
      <sec id="sec-3-1">
        <title>3.1. Problem Formulation</title>
        <p>
          We build and formalise our framework on top of the Evaluative AI framework introduced by Miller [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
First of all, we generalize the classification problem as described by Guidotti et al. [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]:
Definition 3.1 (Machine Learning "Black Box" Classifier) . A predictor, also named model or classifier,
is a function
        </p>
        <p>b ∶  (m) → 
which maps data instances (tuples) x from a feature space  (m) with m features to a decision y in a
target space  . We write b(x) = y to denote the decision y predicted by b, and b( ) =  as a shorthand
for {b(x) ∣ x ∈  } =  . An instance x consists of a set of m attribute-value pairs (  ,   ), where   is a
feature (or attribute) and   is a value from the domain of   . The domain of a feature can be continuous or
categorical. The target space  (with dimensionality equal to one) contains the different labels (classes
or outcomes), and also, in this case, the domain can be continuous or categorical. Note that, in the case
of ordinal classification, labels in  have an order. A predictor b can be a machine-learning model, a
domain expert rule-based system, or any combination of algorithmic and human knowledge processing.
In the following, we denote by b a black box predictor whose internals are either unknown to the observer
or are known but uninterpretable by humans.</p>
        <p>A Machine Learning "Black Box" Classifier underlies the XAI Component in implementing an
Evaluative AI framework. In the XAI component, a key feature is the presence of N explainers, collectively
forming a consensus system. Each explainer   within the XAI component takes as input all scores  
produced by the soft Machine Learning Black Box classifier for every label   ∈  , given a single data
instance  .</p>
        <p>
          In this setup, the soft Machine Learning Black Box classifier refers to the machine learning model that
outputs probabilistic scores for each class label   when presented with the input data instance  . These
probabilistic scores   represent the model’s confidence in the likelihood of each class   being the correct
prediction for  . Each explainer   takes these scores   as input and performs its analysis, contributing
to the collective interpretation of the XAI component. The aggregation of insights from all explainers
in the consensus system enhances the transparency and interpretability of the overall decision-making
process of the machine learning model, promoting a comprehensive understanding of the predictions for
the given data instance  . Burkart and Huber [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] define the explanation generation problem as follows:
Definition 3.2 (Explanation Generation). An explanator function  is defined as
        </p>
        <p>∶ ( →  ) × ( ×  ) → ℰ
which takes a supervised Machine Learning (SML) model (black box or interpretable) and a specific
data set as input and provides an explanation belonging to the set ℰ of all possible explanations as an
output. There are two possible explanation generation problems:
Global Extracting a global explanation from a model that is representative of some specific data set  ′,
i.e., (,  ′) in case of a black box model or (,  ′) for interpretable models;
Local Instance explanators extract an explanation for a single test input x and the corresponding prediction
y, i.e., (, (,  )) or (, (,  )) .</p>
        <p>Given this definition, we introduce the first type of explanation of an SML model, namely the direct
interpretation of a given black box model in a post-hoc fashion. This is achieved by utilizing global, often
model-agnostic, explanators. A well-known example is partial dependency plots [34].
(1)
(2)</p>
        <p>The XAI Component produces a collective interpretation derived from its underneath explainers,
implemented to provide explanations in the form of numerical scores within the range of [-1, +1]. Each
explainer   assesses the contribution of each feature in the feature space  () to the classification of the
instance  in the class   ∈  . This assessment classifies each feature as one of the following:
Pros: if the feature positively influenced the classification and the explainer assigns a score &gt;  ;
Cons: if the feature negatively influenced the classification and the explainer assigns a score &lt; − ;
Neutral: if the explainer assigns a score between − and + the feature had no significant impact on the
classification.</p>
        <p>Where  is an arbitrary threshold. Therefore, each explainer   of the XAI Component can determine the
contribution of individual features to the classification outcome. The output of the individual explainer  
is an object defined as Evaluation, formalised as:
Definition 3.3 (Evaluation). An evaluation   produced by the explainer   is a mapping that associates,
for each class   ∈  , a categorization label from the set ℒ = { ,  ,    } to each feature of
the feature space  () . Formally:
 ∶  → ( → ℒ )
(3)</p>
        <p>As a result, the   evaluation produced by the   explainer contains features categorised as pro, con
and neutral to the classification of the data instance  .</p>
        <p>The collective interpretation of the XAI Component is derived from its explainers   by assessing the
level of Consensus. We adopt a Consensus measure proposed in [35] to assess the level of agreement of
XAI Component individual explainers   as follows.</p>
        <p>Similarly, Le et al. [36] propose Weight of Evidence (WoE) as a probabilistic method for analysing the
variable importance. To do so, the authors propose summing the WoE of each feature   to calculate the
WoE of the hypothesis ℎ, articulated by the decision maker.</p>
        <p>Definition 3.4 (Consensus Measure). Consensus Measure is a measure of dispersion introduced as a
representation of agreement and disagreement. Building on the generally accepted Shannon entropy, this
measure uses a probability distribution and the distance between categories to produce a value spanning
the unit interval. The measure is applied to the Likert scale (or any ordinal scale) to determine degrees
of consensus. Using this measure, data on ordinal scales can be given a value of dispersion that is both
logically and theoretically sound. Since consensus is a function of shared group feelings towards an issue,
this "feeling" can be captured through a Likert scale that measures the extent to which an entity agrees or
disagrees with the question. The Likert scale adopted in our framework is shown in Tab. 3.4 where:
•   is the numeric score representing the impact of feature   on the classification of data instance 
for the individual explainer   ;
•  is an arbitrary threshold;
• Evaluation Mapping value is the result of Eq. 3.</p>
        <p>Explainer Score (  )</p>
        <p>&lt; −
− &lt;   &lt; 
  &gt;</p>
        <p>Evaluation Mapping</p>
        <p>CON
NEUTRAL</p>
        <p>PRO
1. For a given (even) number of individuals participating in a discussion on some question of interest,
if an equal number of individuals, n/2, separate themselves into two disjoint groups, each centred on
the strongly disagree and strongly agree categories, the group is considered to have no consensus;
2. if all the participants classify themselves in the same category of the Likert scale, regardless of the
category, then the consensus of the group is considered to be complete at 100%;
3. if the mix of participants is such that /2 + 1 participants assign themselves to any one category,
the degree of consensus must be greater than 0, for the balance in the group is no longer equal at
the extreme categories.</p>
        <p>Hence, a complete lack of consensus generates a value of 0, and a complete consensus of opinion yields
a value of 1. Every other combination of Likert scale categories must result in a value within the unit
interval. The consensus value of   feature for a given class   ∈  is defined as:
(
 ) = 1 + ∑   log2(1 −

=1
|  −   |


)
Where:
•   is the probability (relative frequency) of outcome   (which ranges from 1 to 3);
•   is the outcome in Likert scale value;
•   is the weighted mean of</p>
        <p>using probabilities  as weights;
•   is the dimension of Likert scale adopted.</p>
        <p>Using this measure, a collective interpretation of the XAI Component can be obtained, also quantifying
the degree of agreement of the individual explainers that make up the component.</p>
        <p>In this framework, a LLM is used to generate natural language, providing the user with textual
explanations and giving them context. Following [37], the LLM is formalised as follows:
Definition 3.5 (Large Language Model). Language models are a fundamental building block of current
SOTA natural language processing pipelines. While the objectives used to train these models vary, one
popular choice is a next-step prediction objective. This approach constructs a generative model of the
distribution
where  1,  2, ...,   is a sequence of tokens from a vocabulary  by applying the chain rule of probability
  ( 1,  2, ...,   )

=1
  ( 1,  2, ...,   ) = ∏   (  ∣  1, ...,  −1 )
(4)
(5)
(6)
SOTA LMs use neural networks to estimate this probability distribution. We let   (
 ∣  1, ...,  −1 ) denote
the likelihood of token   when evaluating the neural network f with parameters  .</p>
        <p>User can interact with the LLM by giving input prompts:
Definition 3.6 (Prompt). A prompt is a sequence  1, ...,   tokens that condition the Large Language
Model text generation process. Given an input prompt a language model can generate new text by
iteratively sampling  +̂1 ∼   ( +1 ∣  1, ...,   ) and then feeding  +̂1 back into the model to sample  +̂2 ∼
  ( +2 | 1, ...,  +̂1 ). This process is repeated until a desired stopping criterion is reached. Variations of this
text generation method include deterministically choosing the most-probable token rather than sampling
(i.e., greedy sampling) or setting all but the top-n probabilities to zero and renormalizing the probabilities
before sampling (i.e., top-n sampling [38]) [37].</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. From Explaining to Supporting Decisions</title>
        <p>
          We introduce EADS, offering a new approach for DA systems to transition from the conventional XAI
framework to the Evaluative AI framework, as suggested by Miller [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. One of the main limitations of
current XAI approaches when applied to decision support systems is that they provide the user with
explanations of a single answer (i.e. output of the black box), as seen in the first conversation example.
This approach limits the decision-maker’s ability to evaluate different choices. More precisely, the
limitation in the first conversation example lies in the "hard classification" approach of the system, which
identifies and explains a single best choice. Otherwise, we adopt a "soft classification" approach, i.e.,
a membership/classification score for each possible class. In this way, the evaluative system supports
the decision-making process by presenting a set of possible choices with pros, cons and neutral features,
enabling the decision-maker to guide the process with the system’s support, resulting in a
machine-in-theloop paradigm. Moreover, we supplement our approach with two additional components to make the
underlying model more understandable, accountable and transparent. The first is an LLM to enhance the
interaction between the AI agent and the human decision-maker. The second one gives users more robust
information thanks to a consensus system of individual explainers.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Framework Components</title>
        <p>In this section, the implementation of the framework is shown step-by-step, taking up the four fundamental
elements described in Sec. 3.1: 1. Machine Learning Black Box (soft) Classifier training, 2. XAI
Component construction and Evaluation generation, 3. Consensus assessment, and 4. Prompt engineering.
3.3.1. ML (soft) classifier training
First, a model is trained on a training dataset  
, and its performance is evaluated on a test dataset</p>
        <p>(using the hard classification approach). Therefor, given a new data instance  to be evaluated, the
trained model must no longer return the single prediction  (hard classification) but produces a score 
(see Sec. 3.1) for each target class   ∈  , thus softly approaching the classification problem as described

above (see Sec. 3.1 and 3.2).
range [-1, +1].
obtained. In Algorithm 1:
3.3.2. XAI Component construction and Eval generation
The next step is constructing the XAI Component and Evaluation generation. As described in Sec. 3.1,
the component consists of  individual explainers. Each one, given an instance  , for each class   ∈ 
evaluates the impact of each feature within the feature space  () on the classification of  in class   .
The individual explainer   expresses the impact of the feature in the form of a numerical score in the
Algorithm 1 shows how the Evaluation of an individual explainer   for a given data instance  is
• model is the trained Machine Learning soft classifier;
• explainer is an individual explainer of the XAI Component; the method explainer(model) is used as
a placeholder for the actual explainer instantiation method, the signature of which varies depending
on the library used;
step;
•  is the arbitrary threshold for the Likert Mapping step;
• pros, cons, neutral are the sets containing categorised features as a result of the evaluation mapping
• the method explain(x) is used as a placeholder for the actual explainer   explanation generation
method, the signature of which varies depending on the library used;
• explanation(  , feature) is the numerical score given by the explainer   to assess the impact of
   on the classification of data instance  in   class.</p>
        <p>XAI Component is constructed by repeating Algorithm 1 N times, one for each explainer. The result of
the N executions is N Evaluations that are used as input for the next implementation step.
3.3.3. Consensus assessment
After constructing the XAI Component and obtaining all Evaluations from the component’s explainers,
we proceed with the consensus assessment for collective interpretation of the component.</p>
        <sec id="sec-3-3-1">
          <title>Algorithm 2: LIKERT SCALE MAPPING</title>
          <p>Input: evaluation</p>
          <p>Output: likert_evaluation
1 likert_evaluation ← ∅;
2 for   in  do
3 for feature in  () do
4 if evaluation(  , feature) ∈   then
5 likert_evaluation(  , feature) ← 1;
6 else
7 if evaluation(  , feature) ∈    
8 likert_evaluation(  , feature) ← 2;
9 else
10 likert_evaluation(  , feature) ← 3;
11 return likert_evaluation</p>
          <p>then</p>
          <p>Algo 2 shows the mapping of Evaluation elements in the Likert scale described in Def. 3.4. Repeating
Algo 2  times, one for each Evaluation, results in a complete mapping of the output of the XAI
Component into a format useful for consensus assessment.</p>
          <p>After obtaining a complete mapping of the output of the XAI Component explainers into a format
useful for consensus assessment, Algo 3 shows how to obtain the consensus score for a feature given a
class   ∈  following Eq. 4. In Algo 3:</p>
          <p>• likert_evaluations contains all the evaluations resulting from  iterations of Algo 2;
Algorithm 3: CONSENSUS</p>
          <p>Input: likert_evaluations,   , feature</p>
          <p>
            Output: consensus_score
1 likert_outcome ← [
            <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
            ];
• likert_outcome is a set containing all possible outcomes of the Likert scale described in Tab. 3.4.
          </p>
          <p>Repeating Algo 3 for each class   ∈ 
consensus scores needed for the next step.</p>
          <p>Following the formalisation described in Sec. 3.1, we implemented the framework using the benchmark
Iris dataset for a preliminary study. The black box "soft" classifier trained on a portion of the dataset was
a Random Forest and the XAI Component was made by SHAP [39], LIME [40] and ELI51 explainers.
We set the threshold  = 0.10 for the Evaluation generation following Algo 1. After the results were
obtained, the Likert Mapping and Consensus Assessment steps were carried out following Algoss 2 and 3.</p>
          <p>The results of the last two steps were structured in two different JSONs and formatted in a prompt
explaining the task to be performed. Moreover, the information needed to interpret the two JSONs is
provided so that the model can understand what kind of requests the user can make and how it should
fulfil them, thus giving the LLM a frame of action. The formatted prompt constitutes the system prompt
of the model. After sending this instruction, the framework is ready to use.</p>
          <p>The following conversation is extracted from a real interaction with EADS:
and each feature in the feature space  ()
yields all the
Implemented Evaluative AI Framework
sUer : How could this instance be classified?
hWmcs : It could be "Virginica", "Versicolor" or "Setosa".
sUer : I want to know about "Setosa"
hWmcs : Certainly! Here’s a breakdown of the features and their impact on the classification of the instance under the class "Setosa":
1. sepal length (cm):
2. sepal width (cm):
• Considered as a CON by 0 explainers.
• Considered NEUTRAL by 3 explainer.
• Considered as a PRO by 0 explainers.
• Consensus: 100%
• Considered as a CON by 0 explainers.
• Considered NEUTRAL by 3 explainers.
• Considered as a PRO by 0 explainers.</p>
          <p>• Consensus: 100%
3. petal length (cm):
1https://eli5.readthedocs.io/en/latest/index.html
• Considered as a CON by 2 explainers.
• Considered NEUTRAL by 1 explainer.
• Considered as a PRO by 0 explainers.</p>
          <p>• Consensus: 76.59%
4. petal width (cm):
• Considered as a CON by 2 explainers.
• Considered NEUTRAL by 1 explainer.
• Considered as a PRO by 0 explainers.</p>
          <p>• Consensus: 76.59%
For the Setosa class, the features "sepal length (cm)" and "sepal width (cm)" are considered neutral by all explainers with a full consensus.
However, the features "petal length (cm)" and "petal width (cm)" are mostly viewed as cons, with a slightly lesser consensus of 77%.
sUer : And what about "Versicolor"?
hWmcs : Here’s the explanation for "Versicolor":
1. sepal length (cm):
• Considered as a CON by 0 explainers.
• Considered NEUTRAL by 3 explainer.
• Considered as a PRO by 0 explainers.</p>
          <p>• Consensus: 100%
. . .</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. User Evaluation</title>
        <p>For the user study we recruited 50 users using the Prolific 2 platform. The participants have been
remunerated in accordance with the prevailing average compensation rate on Prolific. To take part in
the user study, the participants had to be fluent in English and had to have a bachelor’s degree or higher.
Since EADS is meant to be used by domain experts and not only and exclusively by LLM and XAI
experts, we decided NOT to filter participants based on their technical/professional profile. They took on
average 8.12 minutes to finish the survey. A sample of the user study is provided 3.</p>
        <p>
          The respondents where provided each one with a data instance of the 20 newsgroups dataset, a
commonly used benchmark dataset in XAI (e.g. in [
          <xref ref-type="bibr" rid="ref20">40, 20</xref>
          ]). It is a collection of approximately 20,000
newsgroup documents, partitioned across 20 different topics including politics, sports, and technology. In
the user study, we considered a subset of 4 classes: atheism, christian, space, and graphics. The document
provided to the respondents belonged to the class atheism. Please notice that the classes atheism and
christian are challenging to distinguish, because the two topics are often mixed in the documents and
because they contain features that not generalise outside the validation set [40].
        </p>
        <p>In addition to the document, the participants where given 10 interactions with EADS , useful for
making an informed decision about the provided instance. After reading the document and EADS
responses to the 10 questions, the user was asked to classify the document into one of 4 topics. After
making this decision, the true label of the document was shown and then asked to answer a questionnaire
of 4 questions. The questions have been chosen according with recent literature in XAI evaluation [41] to
test the following 5 properties of the system:
Q1: Transparency to explain how the AI system works.</p>
        <p>Q2: Trust to increase users confidence in the AI system.</p>
        <p>Q3: Satisfaction to increase the ease of use or enjoyment of users when interacting with an AI system.
Q4: Effectiveness to help users make good decisions.</p>
        <p>Q5: Debugging to enable users to identify defects in the AI system.</p>
        <p>Does EADS helps the user in the decision making process? The first four properties are tested
through direct questions. The users indicate a value on a Likert scale from 1 to 4. Q1 aimed to assess the
level of Transparency of EADS so we asked to indicate how easy was to understand EADS explanations.
Value 1 stood for "very easy" and value 4 for "very difficult". For Q1, the 58% of participants indicated a
value of 1-2, 32% indicated value 3 and only 10% indicated value 4. Answers to Q1 indicate how easy it
was for most evaluators to understand the explanations provided verbatim by EADS . A good proportion
of evaluators encountered some difficulty in the enjoyment of the explanation, indicating how in certain
2https://app.prolific.com/
3https://forms.gle/HCeA4MEWEpdcR6MM8
scenarios/tasks or in the presence of particular data instances, an explanation in text format is not always
the best choice but other ways of presenting the explanation might be more suitable (e.g., images, video,
multimedia).</p>
        <p>Q2 aimed to understand how EADS helps Trust to increase confidence in the AI system. To users was
asked if having the opinion of three different explainers increase the reliability of EADS . In this case
value 1 stood for "not at all" and value 4 stood for "extremely". For Q2, the 72% of partecipants indicated
a value of 3-4 and just the 28% indicated a value of 1-2. Responses to Q2 demonstrate the potential of
the Consensus System, of its ability to donate robustness to the explanation provided by EADS . In the
critical scenarios for which EADS was designed, the Consensus System emerges as a pivotal component.
Its ability to enhance the reliability of the entire AI system and bolster the confidence of decision makers
is paramount, fostering greater support throughout the cognitive process.</p>
        <p>For Q3 instead, we asked users how satisfied they were using EADS in order to evaluate the Satistaction
to increase the ease of use or enjoyment of user interacting with an AI system. Users could indicate value
1 for "very dissatisfied" and value 4 for "very satisfied". The 66% of partecipants indicated for Q3 a value
of 3-4, demonstrating a positive overall experience with EADS .</p>
        <p>To assess EADS Effectiveness to help users make good decision, Q4 asked if they believe EADS has
the capability to assist decision-makers in making informed decision across different scenarios. Users
could express 1 for "not at all" and 4 for "extremely". For Q4, the 68% of participants indicated a value
of 3-4. Responses to Q4 support the initial thesis that having a system that gives you the opportunity
to explore pros, cons, and neutral aspects of each possibility supports the cognitive process more than
having a single opinion (decision) taken by the ML model and provided along with its explanation.
Does EADS helps the correctness of the classification? Property Q5 is measured through the
classification of the documents. Indeed, 88% of the respondents classified the instance under the Atheism
class, while 12% categorized it as Christian. None selected the Space and Graphics classes. This outcome
shows that, given sufficient supporting information, users are able to correctly classify the instance. Is
noteworthy that cases classified as Christian are not necessarily wrong, since the discussion pertains to
atheism but compared with the Christian stance rather than religious positions in general (see, e.g., the
provided survey sample). In such instances, a classifier, even with the correct prediction, would have
chosen the Atheism class, leaving the end user with no other option. Instead, EADS improves the
decisionmaking process by enabling users to consider various alternatives, and intercept misclassifications by the
ML algorithm. Moreover, it is important to note that none of the users, after interacting with EADS ,
selected the Space or Graphics classes, which would have undoubtedly been errors. A XAI approach,
on the other hand, rely on the class predicted by a classification algorithm. To test this behaviour, we
implemented a SOTA classifier for 20Newsgroup [ 42] on the 4 classes used in the user study. The total
number of instances in the four classes is 3,756, which we divided into 2,256 for training and 1,500 for
testing. In the test set we have 319 instances of the class atheism, of which 291 (91.2%) are correctly
classified, 14 (4.4%) are classified as christian and 14 (4.4%) in the other two classes. Therefore, EADS
not only involves the user in the decision making process, but it also helps in debugging misclassifications
of the classifier or of the XAI algorithm in the predicion of the correct class.</p>
        <p>Plots/graphs of the discussed results are in the additional material.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Hallucination Check</title>
        <p>Using LLMs require a particular attention to results accuracy, since a well-known problem is that
of hallucinations [43]. A possible solution to mitigate this problem is refine prompt using prompt
engineering [44]. Beyond user evaluation, we also checked hallucinations, testing fifty different input
instances with our prompt on both Open Source models 4 5 and Closed Source models67. In none of the
4https://llama.meta.com/llama3/
5https://mistral.ai/technology/models
6https://platform.openai.com/docs/models/gpt-4-turbo-and-gpt-4
7https://cohere.com/blog/command-r-plus-microsoft-azure
tests performed we find any inaccuracies or inconsistencies that would allow us to observe hallucinations.</p>
        <p>As black box AI-based tools become widespread, it is crucial to emphasize the integral involvement
of users in the decision-making process to ensure trustworthiness and transparency. In this paper, we
propose EADS a novel framework that connects Evaluative AI with conversational explanations realised
via LLMs. To realize EADS, we (i) formalize our approach and link it to the SOTA in XAI and LLM;
(ii) propose a framework that supports the user with pros and cons of different possible decisions while
preserving current XAI approaches, and (iii) integrate an LLM in the framework to transform the role of
the decision maker from passive component of the system to a part of the system that directly interacts
with it and forms her decisions based on the rich information provided by the decision system.</p>
        <p>The user study has demonstrated that the majority of users appreciate the approach employed by EADS,
which involves the user in the decision process and employes LLMs to enrich the explainer’s output,
enhances trust, satisfaction, and effectiveness in decision-making processes supported by ML algorithms
(Q2, Q3, and Q4). Furthermore, the results of question Q5 in the user study indicate that EADS, by
involving the user in the decision-making process and shifting the decision locus from the machine to the
user, enable the improvement of algorithm performance and facilitate more accurate decision-making.
Regarding the transparency of the output (Q1), ambivalent results show that, although almost none of
the users found the system very challenging to interact with, a minority of users encountered some
difficulties in comprehension. This suggests that natural language interaction may not always be the
optimal interface, prompting us to consider integrating additional tools such as graphs/plots, images, and
reports in future developments.</p>
        <p>Our research delineates the limitations into two main categories: general limitations inherited from the
underlying technologies and specific limitations of our proposed approach.</p>
        <p>General Limitations
Hallucination in Language Models: LLMs are known to produce hallucinated content, a challenge
well-documented across numerous studies. Although many methodologies have been proposed to mitigate
this issue, a definitive solution that ensures reliable detection in all cases remains elusive. Although
we have tested our framework using both open source and closed source LLMs, without encountering
hallucinations, our study acknowledges this limitation but does not address it directly, given the extensive
ongoing research in this area.</p>
        <p>Intended Audience of the System: The foundational technology of our system, Evaluative AI, is
primarily designed for decision-makers. Unlike traditional XAI approaches where explanations can be
tailored to different audiences, Evaluative AI targets authoritative figures within their domains, limiting
its broader applicability.</p>
        <p>Specific Limitations
Scalability with Increased Complexity: The output of our framework can become overwhelming as
the number of classes and features increases. This scalability issue arises from the system’s design to
provide consensus values for each feature and class, which can be complex to manage and interpret.
Data Modality Constraints: Our framework is currently implemented for text and tabular data. Its
effectiveness may not extend seamlessly to other modalities, such as images, audio, or video, thus limiting
its applicability in different domains.
[23] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin,</p>
        <p>Attention is all you need, Advances in neural information processing systems 30 (2017).
[24] U. Ehsan, P. Tambwekar, L. Chan, B. Harrison, M. O. Riedl, Automated rationale generation: a
technique for explainable ai and its effects on human perceptions, in: ACM IUI, 2019, pp. 263–274.
[25] F. Costa, S. Ouyang, P. Dolog, A. Lawlor, Automatic generation of natural language explanations,
in: Proceedings of the 23rd international conference on intelligent user interfaces companion, 2018,
pp. 1–2.
[26] E. Kokalj, B. Škrlj, N. Lavracˇ, S. Pollak, M. Robnik-Šikonja, Bert meets shapley: Extending shap
explanations to transformer-based classifiers, in: Proceedings of the EACL hackashop on news
media content analysis and automated report generation, 2021, pp. 16–21.
[27] Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun, J. Xu, Z. Sui, A survey for in-context
learning, arXiv preprint arXiv:2301.00234 (2022).
[28] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, et al.,</p>
        <p>A survey of large language models, arXiv (2023).
[29] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., Language models are
unsupervised multitask learners, OpenAI blog 1 (2019) 9.
[30] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,
G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in neural information
processing systems 33 (2020) 1877–1901.
[31] D. Slack, S. Krishna, H. Lakkaraju, S. Singh, Explaining machine learning models with interactive
natural language conversations using talktomodel, Nature Machine Intelligence 5 (2023) 873–883.
[32] J. Chun, K. Elkins, explainable ai with gpt4 for story analysis and generation: A novel framework
for diachronic sentiment analysis, International Journal of Digital Humanities 5 (2023) 507–532.
[33] T. Ali, P. Kostakos, Huntgpt: Integrating machine learning-based anomaly detection and explainable
ai with large language models (llms), arXiv preprint arXiv:2309.16021 (2023).
[34] A. Goldstein, A. Kapelner, J. Bleich, E. Pitkin, Peeking inside the black box: Visualizing statistical
learning with plots of individual conditional expectation, journal of Computational and Graphical
Statistics 24 (2015) 44–65.
[35] W. J. Tastle, M. J. Wierman, Consensus and dissention: A measure of ordinal dispersion, IJAR 45
(2007) 531–545.
[36] T. Le, T. Miller, R. Singh, L. Sonenberg, Towards the new xai: A hypothesis-driven approach to
decision support using evidence, arXiv preprint arXiv:2402.01292 (2024).
[37] N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-Voss, K. Lee, A. Roberts, T. B. Brown,
D. Song, U. Erlingsson, et al., Extracting training data from large language models., USENIX
Security Symposium 6 (2021).
[38] A. Fan, M. Lewis, Y. Dauphin, Hierarchical neural story generation, arXiv preprint
arXiv:1805.04833 (2018).
[39] S. M. Lundberg, S.-I. Lee, A unified approach to interpreting model predictions, Advances in neural
information processing systems 30 (2017).
[40] M. T. Ribeiro, S. Singh, C. Guestrin, " why should i trust you?" explaining the predictions of any
classifier, ACM SIGKDD (2016).
[41] R. Confalonieri, J. M. Alonso-Moral, An operational framework for guiding human evaluation in
explainable and trustworthy ai, IEEE Intelligent Systems (2023).
[42] Y. Lin, Y. Meng, X. Sun, Q. Han, K. Kuang, J. Li, F. Wu, Bertgcn: Transductive text classification
by combining gnn and bert, in: ACL-IJCNLP, 2021, pp. 1456–1462.
[43] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. J. Bang, A. Madotto, P. Fung, Survey of
hallucination in natural language generation, ACM Computing Surveys 55 (2023) 1–38.
[44] S. Tonmoy, S. Zaman, V. Jain, A. Rani, V. Rawte, A. Chadha, A. Das, A comprehensive survey
of hallucination mitigation techniques in large language models, arXiv preprint arXiv:2401.01313
(2024).</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Guidotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Monreale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ruggieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Turini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Giannotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pedreschi</surname>
          </string-name>
          ,
          <article-title>A survey of methods for explaining black box models, ACM computing surveys (CSUR) 51 (</article-title>
          <year>2018</year>
          )
          <fpage>1</fpage>
          -
          <lpage>42</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Gunning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stefik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Stumpf</surname>
          </string-name>
          , G.-
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          , Xaiexplainable artificial intelligence,
          <source>Science robotics 4</source>
          (
          <year>2019</year>
          )
          <article-title>eaay7120</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A. B.</given-names>
            <surname>Arrieta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Díaz-Rodríguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Del</given-names>
            <surname>Ser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bennetot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tabik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Barbado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>García</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gil-López</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Molina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Benjamins</surname>
          </string-name>
          , et al.,
          <article-title>Explainable artificial intelligence (xai): Concepts, taxonomies, opportunities and challenges toward responsible ai</article-title>
          ,
          <source>Information fusion 58</source>
          (
          <year>2020</year>
          )
          <fpage>82</fpage>
          -
          <lpage>115</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Holzinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Saranti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Molnar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Biecek</surname>
          </string-name>
          , W. Samek,
          <article-title>Explainable ai methods-a brief overview</article-title>
          , in: International workshop on extending explainable
          <source>AI beyond deep models and classifiers</source>
          , Springer,
          <year>2022</year>
          , pp.
          <fpage>13</fpage>
          -
          <lpage>38</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Malandri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Mercorio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mezzanzanica</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Seveso</surname>
          </string-name>
          ,
          <article-title>Model-contrastive explanations through symbolic reasoning, Decision Support Systems 176 (</article-title>
          <year>2024</year>
          )
          <fpage>114040</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>N.</given-names>
            <surname>Burkart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Huber</surname>
          </string-name>
          ,
          <article-title>A survey on the explainability of supervised machine learning</article-title>
          ,
          <source>Journal of Artificial Intelligence Research</source>
          <volume>70</volume>
          (
          <year>2021</year>
          )
          <fpage>245</fpage>
          -
          <lpage>317</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <article-title>Explainable ai is dead, long live explainable ai! hypothesis-driven decision support using evaluative ai</article-title>
          ,
          <source>in: Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>333</fpage>
          -
          <lpage>342</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Madumal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sonenberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Vetere</surname>
          </string-name>
          ,
          <article-title>A grounded interaction protocol for explainable artificial intelligence</article-title>
          , arXiv preprint arXiv:
          <year>1903</year>
          .
          <volume>02409</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Malandri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Mercorio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mezzanzanica</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Nobani</surname>
          </string-name>
          ,
          <article-title>Convxai: a system for multimodal interaction with any black-box explainer</article-title>
          ,
          <source>Cognitive Computation</source>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>32</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>E.</given-names>
            <surname>Cambria</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Malandri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Mercorio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mezzanzanica</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Nobani</surname>
          </string-name>
          ,
          <article-title>A survey on xai and natural language explanations</article-title>
          ,
          <source>Information Processing &amp; Management</source>
          <volume>60</volume>
          (
          <year>2023</year>
          )
          <fpage>103111</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>G.</given-names>
            <surname>Klein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. K.</given-names>
            <surname>Phillips</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. L.</given-names>
            <surname>Rall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Peluso</surname>
          </string-name>
          ,
          <article-title>A data-frame theory of sensemaking, in: Expertise out of context</article-title>
          , Psychology Press,
          <year>2007</year>
          , pp.
          <fpage>118</fpage>
          -
          <lpage>160</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>R. R.</given-names>
            <surname>Hoffman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. J.</given-names>
            <surname>Clancey</surname>
          </string-name>
          ,
          <article-title>Psychology and ai at a crossroads: How might complex systems explain themselves?</article-title>
          ,
          <source>The American Journal of Psychology</source>
          <volume>135</volume>
          (
          <year>2022</year>
          )
          <fpage>365</fpage>
          -
          <lpage>378</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Peirce</surname>
          </string-name>
          ,
          <article-title>Writings of Charles S. Peirce: a chronological edition</article-title>
          , volume
          <volume>8</volume>
          :
          <fpage>1890</fpage>
          -
          <lpage>1892</lpage>
          , volume
          <volume>8</volume>
          , Indiana University Press,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S. N.</given-names>
            <surname>Payrovnaziri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rengifo-Moreno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <article-title>Explainable artificial intelligence models using real-world electronic health record data: a systematic scoping review</article-title>
          ,
          <source>JAMIA</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>X.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Gong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Shu</surname>
          </string-name>
          , S.-C. Zhu,
          <article-title>Joint mind modeling for explanation generation in complex human-robot collaborative tasks</article-title>
          ,
          <source>IEEE RO-MAN</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>K.</given-names>
            <surname>Sokol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. A.</given-names>
            <surname>Flach</surname>
          </string-name>
          ,
          <article-title>Conversational explanations of machine learning predictions through class-contrastive counterfactual statements</article-title>
          .,
          <source>in: IJCAI</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>5785</fpage>
          -
          <lpage>5786</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Chaves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Gerosa</surname>
          </string-name>
          ,
          <article-title>How should my chatbot interact? a survey on social characteristics in human-chatbot interaction design</article-title>
          ,
          <source>International Journal of Human-Computer Interaction</source>
          <volume>37</volume>
          (
          <year>2021</year>
          )
          <fpage>729</fpage>
          -
          <lpage>758</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>M. De Gennaro</surname>
            ,
            <given-names>E. G.</given-names>
          </string-name>
          <string-name>
            <surname>Krumhuber</surname>
          </string-name>
          , G. Lucas,
          <article-title>Effectiveness of an empathic chatbot in combating adverse effects of social exclusion on mood</article-title>
          ,
          <source>Frontiers in psychology 10</source>
          (
          <year>2020</year>
          )
          <fpage>3061</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>L. A.</given-names>
            <surname>Hendricks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Darrell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Akata</surname>
          </string-name>
          ,
          <article-title>Generating counterfactual explanations with natural language</article-title>
          ,
          <source>in: ICML Workshop on Human Interpretability in Machine Learning</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>95</fpage>
          -
          <lpage>98</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>L.</given-names>
            <surname>Malandri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Mercorio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mezzanzanica</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Nobani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Seveso</surname>
          </string-name>
          , Contrxt:
          <article-title>Generating contrastive explanations from any text classifier</article-title>
          ,
          <source>Information Fusion</source>
          <volume>81</volume>
          (
          <year>2022</year>
          )
          <fpage>103</fpage>
          -
          <lpage>115</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>L.</given-names>
            <surname>Malandri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Mercorio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mezzanzanica</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Nobani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Seveso</surname>
          </string-name>
          , et al.,
          <article-title>The good, the bad, and the explainer: A tool for contrastive explanations of text classifiers</article-title>
          .,
          <source>in: IJCAI</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>5936</fpage>
          -
          <lpage>5939</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>I.</given-names>
            <surname>Donadello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dragoni</surname>
          </string-name>
          ,
          <article-title>Bridging signals to natural language explanations with explanation graphs</article-title>
          ,
          <source>Proceedings of the 2nd Italian Workshop on Explainable Artificial Intelligence</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>