<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An Empirical Investigation of Users' Assessment of XAI Explanations: Identifying the Sweet Spot of Explanation Complexity and Value</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Felix Liedeker</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christoph Düsing</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marcel Nieveler</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Philipp Cimiano</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Semantic Computing Group, CITEC, Bielefeld University</institution>
          ,
          <addr-line>Inspiration 1, 33619 Bielefeld</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>While the importance of explainable artificial intelligence in high-stakes decision-making is widely recognized in existing literature, empirical studies assessing users' perceived value of explanations are scarce. In this paper, we aim to address this shortcoming by conducting an empirical study focused on measuring the perceived value of the following types of explanations: plain explanations based on feature attribution, counterfactual explanations and complex counterfactual explanations. We measure an explanation's value using five dimensions: perceived accuracy, understandability, plausibility, suficiency of detail, and user satisfaction. Our findings indicate a sweet spot of explanation complexity, with both dimensional and structural complexity positively impacting the perceived value up to a certain threshold.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Explainable AI (XAI)</kwd>
        <kwd>Explanation Complexity</kwd>
        <kwd>Counterfactual Explanation</kwd>
        <kwd>User Perception</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In recent years, we have witnessed the prevalence of Artificial Intelligence (AI) models in various
tasks, including high-stakes decision-making in domains such as clinical decision support [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
and credit risk scoring [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Building trust in such AI systems and complying with the respective
legislation necessitates the deployment of eXplainable AI (XAI) methods to gain understanding
about the inner workings of AI systems and to ensure the correctness of their output [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>While there has been increased attention devoted to the empirical evaluation of XAI methods,
user studies investigating the perceived quality, value, understandability and informativeness of
explanations are scarce. We address this shortcoming by conducting an empirical study, asking
users to assess the perceived value of diferent types of explanations for a credit risk assessment
model. We measure the value of an explanation using five commonly applied dimensions
(perceived accuracy, understandability, plausibility, suficiency of detail, and user satisfaction).
Our subsequent analyses focus on how the complexity of the provided explanations afects
their perceived value. Accordingly, we aim to answer the following research question: How
does explanation complexity afect the perceived value of XAI explanations and is there a trade-of
between explanation complexity and value? To do so, we build on existing literature to introduce
and empirically verify two novel notions of explanation complexity: dimensional and structural
complexity of explanations, both adding to the overall explanation complexity.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>
        The opacity of black box models and the dificulties associated with verifying or understanding
their output have given rise to the subfield of XAI. Despite of the wide recognition of the
importance of XAI, the concept of explainability is still underspecified [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        Ofering users explanations for decisions made by AI systems that are easy to grasp,
trustworthy, and usable is a central challenge of XAI research. Numerous scholars have approached this
challenge from diferent perspectives: By investigating how humans tend to explain and what
general desiderata can be found for explanations [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], as well as examining various characteristics
of explanations such as the output format of an explanation [5].
      </p>
      <p>Today, counterfactual explanations (CFs) are among the most popular explanation methods,
since CFs resemble the way humans naturally explain decisions [6]. Various algorithms for
the generation of CFs have been proposed [7]. In addition, a plethora of diferent metrics and
algorithms have been introduced [8] to automatically evaluate explanations - though human
evaluation of explanations is still considered the gold standard [9].</p>
      <p>Previous studies investigated the influence of the generation process of explanations on
user perception (cf. e.g. [10, 11]). Wang and Yin [12] uncovered that users’ perception of the
helpfulness of an explanation also depends on the method used for explanation generation.</p>
      <p>Studying the efect of diferent generation processes can implicitly link to the investigations
of explanation complexity, because diferent generation algorithms may have constraints on
diferent properties of explanations such as the length or sparsity of explanations [11].</p>
      <p>Huysmans et al. [13] empirically investigated the correlation between comprehensibility and
presentation complexity of explanations. As a result, larger, i.e. more complex representations
showed a decrease in answer accuracy and confidence of the participants.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methods</title>
      <sec id="sec-3-1">
        <title>3.1. Types of Explanations</title>
        <p>Our focus is on the perception of explanations that difer in terms of their overall explanation
complexity. For this purpose, we introduce three diferent types of explanations in the following:</p>
        <p>Plain. Plain explanations - inspired by existing feature importance approaches (e.g.,
LIME [14]) - are the most simple type included in our study. They are created by taking
the  most relevant features, to construct an explanation of the form Because X, the prediction is
P, where  is the set of the  most relevant features.</p>
        <p>Counterfactual Explanations. CFs are among the most popular explanation methods for
XAI and the second type of explanation included in our study. They provide explanations of the
form If X had been diferent, the prediction would have changed from P to Q . Thus, they explain a
decision indirectly by providing a hypothetical, but similar counterexample [15].</p>
        <p>Complex Counterfactual Explanations. Inspired by the recently growing interest in
semifactual explanations (i.e., changes that do not change the output decision [16]), we define
complex CFs (CCFs) as explanations of the form Even if X would be diferent, but Y would
be diferent, the prediction is still P . By adding an Even if -clause to these explanations, we
deliberately increase their complexity, allowing us to test for the impact of complexity later on.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Explanation Complexity</title>
        <p>In the realm of task complexity, it is diferentiated between presentation complexity [17] and
domain complexity [18]. Presentation complexity is induced by diferent information presentation
formats, whereas context and environment as well as the dimensionality of data [19] contribute
to the domain complexity. Since we used a single data set in our study, the domain complexity is
ifxed. We expand the notion of presentation complexity to dimensional and structural complexity
to allow fine-grained control of explanation complexity in our study. In the following, we
motivate our choice of these notions of complexity, describe which explanation features afect
them, and how they contribute to overall explanation complexity:</p>
        <p>
          Dimensional Complexity. The presumption that shorter explanations are more
comprehensible is widely established in the field of XAI [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. The rationale here is that explanations
containing fewer features (i.e. shorter explanations) require less cognitive efort from users to
be understood easily [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. Our decision to define dimensional complexity as a contributing factor
to explanation complexity is motivated by the assumption that requiring greater cognitive efort
indicates greater explanation complexity. For the sake of this work, we measure dimensional
complexity as the number of features contained in each of the described types of explanations.
Hence, adding further features to an explanation implies an increased dimensional complexity.
        </p>
        <p>Structural Complexity. In addition to the length of explanations, its type and method of
generation are known to afect the explanation complexity, too [ 19]. In this vein, we define
structural complexity as a function of the type of explanation provided. Here, structural refers
to the structure of the explanation that is presented to the user and does not account for the
complexity of the explanation generation method itself. Previous studies found that users can
comprehend plain explanations more easily and tend to trust them more [12]. Consequently,
we assign CFs a higher structural complexity than plain explanations. With respect to CCFs, we
previously claimed that we deliberately designed them to be more complex than CFs. Thus, we
argue that their structural complexity is the highest among the three types.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. User Study</title>
      <sec id="sec-4-1">
        <title>4.1. Data and Model</title>
        <p>Our study is based on a binary classification problem using the German Credit Data [20]. The
data set contains credit and customer information as well as the credit risk. Given the complexity
of the original data set, we use a simplified version of it 1 and process it further by dropping
samples with missing values and the Checking Account feature. Duration and Credit amount are
encoded as categorical instead of numeric values. Amounts were originally stated in Deutsche
1https://www.kaggle.com/datasets/uciml/german-credit, last accessed: 10.01.2024
Mark, but were changed to United States Dollars (USD) for the ease of the English-speaking study
participants2. The processed data set contains 522 samples and is used to train a simple neural
network that achieves 67.38% accuracy. It is noteworthy that accuracy is not that important in
our setting since we are interested in the explanation quality instead.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Study Design</title>
        <p>The diferent explanation types described in Section 3 are evaluated by the participants in our
study. Explanations are generated in the following fashion: Plain: Calculate the most relevant
features with LIME [14] and pick the  most important features. CFs: Calculate CFs for all
samples with the open-source Python library Alibi Explain [21]. Filtering is then applied to
only include CFs that are feasible, e.g. sex does not change or age does not decrease in the CF
instance. Complex CFs: A combination of both previous explanations is handcrafted.</p>
        <p>We use the following data for the study: 9 diferent explanations (our three types of
explanations, each with one, two or three features included in the explanation) are generated for 12
samples from the entire data set. Each participant is randomly assigned an explanation type
and rates 8 diferent explanations with difering numbers of features of this explanation type.</p>
        <p>Participants are asked to rate each explanation on the following questions (with the respective
quality dimension in bold). Questions are answered on a 4-point Likert scale, from 1 (Definitely
YES) to 4 (Definitely NOT ) plus I don’t know.</p>
        <p>• Perceived Accuracy: Is the class predicted by the model accurate?
• Understandability: Is the provided explanation understandable?
• Plausibility: Is the provided explanation plausible?
• Suficiency of Detail : Has the provided explanation suficient detail?
• User Satisfaction: Is the provided explanation satisfying?</p>
        <p>Our online study was conducted in early 2024 using Prolific.com and was designed such that
participation should take around 8 minutes (Prolific reported a median completion time of 7 min
and 37 s). Participants received a monetary reward of £1.503. Overall, 280 participants took part
in the study. 166 (60.81%) were male, 102 (37.36%) female and 5 non-binary (1.83%). The age
of the participants ranged from 18 to 77 years (M = 34.20, SD = 12.17). 61.17% of participants
are higher education degree holders. When asked whether the task was dificult, 67.77% of
participants answered Rather NOT or Definitely NOT . Participants reported a mean experience
with ML of 2.92 (SD = 0.8) on a scale from 1 (No experience) to 4 (Extensive experience).</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Preliminary Results</title>
        <p>In the following, we will present the preliminary results from the analyses we performed
on our study results. We measure the perceived value of explanations in our study using
the five dimensions perceived accuracy, understandability, plausibility, suficiency of detail , and
satisfaction. In the following, we focus in particular on understandability as well as suficiency of
2As a matter of fact, amounts in Deutsche Mark in 1994, adjusted for inflation, are almost equal to USD today.
3To match the current German minimum wage of €12.41 per hour.</p>
        <p>
          The results in terms of understandability in Figure 1 (a) show that despite the increased
dimensional complexity, explanations containing more than one feature are perceived as more
understandable (&gt;80% answered Rather YES or Definitely YES ) than those containing a single
feature only (&lt;75% positive answers). Additionally, explanations with three features are slightly
less understandable than those with two features, indicating a sweet spot at two features. Figure
1 (b), on the other hand, shows that participants considered significantly fewer explanations
to be suficiently detailed: Only about 35% for explanation with one feature, 50% with two
features, and 60% with three. Furthermore, they favor explanations of higher dimensional
complexity w.r.t. the suficiency of detail. In contrast, explanations containing three features
are also significantly better evaluated than those with two features only. Accordingly, the
overall value and the perceived suficiency of detail in particular increases for explanations of
medium or high dimensional complexity. Explanations mentioning a single feature only have
the lowest perceived value among all dimensions. The Kruskal-Wallis test finds a significant
diference between explanations of diferent dimensional complexity for understandability
(H=11.11, p&lt;0.01) as well as the suficiency of detail (H=27.19, p&lt;0.01). While these findings
seem to contradict existing works that assumed explanations of low sparsity to be best (e.g.,
[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]), they are in line with Keane and Smyth [22], who argued that explanations of moderate
sparsity allow humans to get a better grasp on the concepts underlying the decision process.
        </p>
        <p>On Structural Complexity. Next, we investigate how structural complexity correlates
with explanation value. Figure 2 contains participants’ feedback aggregated by the type of
explanation. We sort them according to their structural complexity as explained in Section 3.2.</p>
        <p>Figure 2 (a) shows that both plain explanations and CCFs are understandable for about 70%
of the participants. For CFs, even 75% of the users agree on the fact that explanations are
detail and highlight the results for these dimensions specifically. We chose these two dimensions
as they illustrate the efects of the diferent notions of complexity on the perceived value very
well. Moreover, we observe very similar patterns for the remaining dimensions.</p>
        <p>On Dimensional Complexity. In Figure 1, we provide the user feedback regarding the
understandability and suficiency of detail for explanations. The plot shows the percentage of
participants for each of the possible answers. Here, we group explanations according to the
number of features contained in it which ultimately determines their dimensional complexity.</p>
        <sec id="sec-4-3-1">
          <title>Rather NOT</title>
        </sec>
        <sec id="sec-4-3-2">
          <title>Definitely NOT</title>
        </sec>
        <sec id="sec-4-3-3">
          <title>Rather YES</title>
        </sec>
        <sec id="sec-4-3-4">
          <title>Definitely YES I don't know</title>
          <p>CCF
yep CF
T
Plain
understandable, making explanations of medium structural complexity most understandable.
Again, the findings are statistically significant for understandability (H=44.92, p&lt;0.01) and
suficiency of detail (H=10.15, p&lt;0.01). Regarding the perceived suficiency of detail, we observe
that CCFs are considered suficiently detailed by more than half of the participants. For plain
and CFs, only about 40% of participants agree that explanations are of suficient detail.</p>
          <p>On Explanation Complexity. Finally, we study the impact of overall explanation complexity
on the perceived value of explanations by grouping the feedback for each combination of
explanation type and number of features and sort them by their explanation complexity.</p>
          <p>From Figure 3 (a) we learn that explanations of particularly low and high complexity are the
least understandable (&lt;70% agreement). It furthermore shows that explanations with low to
medium explanation complexity are most understandable (about 75% agreement). Additional
interesting findings are: (1) plain explanations become strictly more understandable with an
increasing number of features, (2) CFs receive similar evaluations, regardless of the number of
features, and (3) CCFs become less understandable when adding more features.</p>
          <p>In terms of the suficiency of detail, Figure 3 (b) confirms previous findings on explanations
with lowest and highest complexity. They are also among those considered least detailed (&lt;50%
agreement). In contrast to the findings on understandability, however, we find explanations
with medium to high complexity are considered best when it comes to the suficiency of detail.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Outlook</title>
      <p>In this paper, we investigated users’ perception of the value of diferent types of explanations
and focused in particular on the correlation with explanation complexity. For this purpose, we
conducted a user study to collect the feedback from 280 participants. Our preliminary findings
suggest that both dimensional and structural complexity correlate positively with the value of
level-4-3
explanations. In particular, users perceive explanations of either high dimensional or structural
complexity as more detailed and explanations with medium to high dimensional leovresl-t4r-u2ctural
complexity as better understandable. Finally, our findings regarding the overall complexity
identify a sweet spot of explanation complexity at medium complexity. Such explanaletivoenl-4s-1receive
the overall best evaluation by our study participants in all dimensions of an explanation’s value.
level-3-3
(CCF, 3)
)se(CCF, 2)
r
u
ta (CF, 3)
e
F
fo (CF, 2)
r
e
b(Plain, 3)
m
u
,N (CF, 1)
e
p
T(Plain, 2)
y
(
(Plain, 1)
100%75% 50% 25% 0% 25% 50% 75%100%Low</p>
      <p>Percentage
y
t
i
x
e
l
p
m
o
C M
itoan edium
n
a
l
p
x</p>
      <p>E
(a) Understandability</p>
      <p>However, we acknowledge that our current findings are limited to a single task and only three
diferent types of explanations. Therefore, in future work, we intend to extend our previous
user study to contain additional use cases, tasks, and types of explanations. This will allow us
to also take the domain complexity into account. We also aim to measure the alignment of the
perceived explanation value with metrics of explanation quality commonly used in automated
explanation evaluation. Finally, we seek to formalize our empirical findings and propose a
taxonomy of explanation complexity that contributes to efective explanation design.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work was partially funded by the Deutsche Forschungsgemeinschaft: TRR 318/1 2021 –
438445824 and the Ministry of Culture and Science of North Rhine-Westphalia: NW21-059A SAIL.
level-3-3
[5] G. Vilone, L. Longo, Classification of Explainable Artificial Intelligence Methods through</p>
      <p>Their Output Formats, Machine Learning and Knowledge Extraction 3 (2021) 615–661.
[6] S. Wachter, B. Mittelstadt, C. Russell, Counterfactual explanations without opening the
black box: Automated decisions and the gdpr, Harv. JL &amp; Tech. 31 (2017) 841.
[7] I. Stepin, J. M. Alonso, A. Catala, M. Pereira-Fariña, A Survey of Contrastive and
Counterfactual Explanation Generation Methods for Explainable Artificial Intelligence, IEEE
Access 9 (2021) 11974–12001.
[8] V. Singh, K. Cyras, R. Inam, Explainability metrics and properties for counterfactual
explanation methods, in: International Workshop on Explainable, Transparent Autonomous
Agents and Multi-Agent Systems, Springer, 2022, pp. 155–172.
[9] F. Doshi-Velez, B. Kim, Towards a rigorous science of interpretable machine learning,
arXiv preprint arXiv:1702.08608 (2017).
[10] J. Aechtner, L. Cabrera, D. Katwal, P. Onghena, D. P. Valenzuela, A. Wilbik, Comparing
User Perception of Explanations Developed with XAI Methods, in: IEEE International
Conference on Fuzzy Systems, 2022, pp. 1–7.
[11] M. Förster, P. Hühn, M. Klier, K. Kluge, User-centric explainable AI: Design and evaluation
of an approach to generate coherent counterfactual explanations for structured data,
Journal of Decision Systems 32 (2023) 700–731.
[12] X. Wang, M. Yin, Are Explanations Helpful? A Comparative Study of the Efects of
Explanations in AI-Assisted Decision-Making, in: International Conference on Intelligent
User Interfaces, 2021, pp. 318–328.
[13] J. Huysmans, K. Dejaeger, C. Mues, J. Vanthienen, B. Baesens, An empirical evaluation of
the comprehensibility of decision table, tree and rule based predictive models, Decision
Support Systems 51 (2011) 141–154.
[14] M. T. Ribeiro, S. Singh, C. Guestrin, " why should i trust you?" explaining the predictions
of any classifier, in: 22nd ACM SIGKDD international conference on knowledge discovery
and data mining, 2016, pp. 1135–1144.
[15] Y. Jia, J. McDermid, I. Habli, Enhancing the value of counterfactual explanations for deep
learning, in: International Conference on AI in Medicine, Springer, 2021, pp. 389–394.
[16] E. Kenny, W. Huang, The utility of “even if” semifactual explanation to optimise positive
outcomes, Advances in Neural Information Processing Systems 36 (2024).
[17] C. Speier, The influence of information presentation formats on complex task
decisionmaking performance, International Journal of Human-Computer Studies (2006) 1115–1131.
[18] J. Swait, W. Adamowicz, The Influence of Task Complexity on Consumer Choice: A Latent
Class Model of Decision Strategy Switching, Journal of Consumer Research 28 (2001)
135–148.
[19] L. Weber, S. Lapuschkin, A. Binder, W. Samek, Beyond explaining: Opportunities and
challenges of xai-based model improvement, Information Fusion 92 (2023) 154–176.
[20] H. Hofmann, Statlog (German Credit Data), UCI Machine Learning Repository, 1994.
[21] J. Klaise, A. V. Looveren, G. Vacanti, A. Coca, Alibi explain: Algorithms for explaining
machine learning models, Journal of Machine Learning Research 22 (2021) 1–7.
[22] M. T. Keane, B. Smyth, Good counterfactuals and where to find them: A case-based
technique for generating counterfactuals for explainable ai (xai), in: International Conferenc
of Case-Based Reasoning Research and Development, Springer, 2020, pp. 163–178.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Düsing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cimiano</surname>
          </string-name>
          ,
          <article-title>Federated learning to improve counterfactual explanations for sepsis treatment prediction</article-title>
          , in: International Conference on AI in Medicine, Springer,
          <year>2023</year>
          , pp.
          <fpage>86</fpage>
          -
          <lpage>96</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Noriega</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. A.</given-names>
            <surname>Rivera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Herrera</surname>
          </string-name>
          ,
          <article-title>Machine Learning for Credit Risk Prediction: A Systematic Literature Review, Data 8 (</article-title>
          <year>2023</year>
          )
          <article-title>169</article-title>
          . level-4-3
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T.</given-names>
            <surname>Speith</surname>
          </string-name>
          ,
          <article-title>A Review of Taxonomies of Explainable Artificial Intelligence (XAI) Methods</article-title>
          , Conference on Fairness, Accountability, and
          <string-name>
            <surname>Transparency</surname>
          </string-name>
          (
          <year>2022</year>
          )
          <fpage>2239</fpage>
          -
          <lpage>2250le</lpage>
          .
          <source>vel-4-2</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <article-title>Explanation in artificial intelligence: Insights from the social sciences</article-title>
          ,
          <source>Artificial intelligence 267</source>
          (
          <year>2019</year>
          )
          <fpage>1</fpage>
          -
          <lpage>38</lpage>
          . level-4-1
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>