<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>ACM Recommender Systems Conference, Prague, Czech Republic.
$ wardatzky@ifi.uzh.ch (K. Wardatzky); inel@ifi.uzh.ch (O. Inel); bernstein@ifi.uzh.ch (A. Bernstein)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Toward Operationalizing a Comprehensive Evaluation Framework for Recommender Systems Explanations</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kathrin Wardatzky</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oana Inel</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Abraham Bernstein</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Zurich</institution>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>The evaluation of recommender systems' explanations is an ongoing research challenge with frequent publication of new evaluation approaches and metrics. We noticed diferences in evaluation approaches in terms of the aspects of the explanations that are evaluated between research developing new explanation methods or explainable recommender systems (i.e., algorithm-focused studies), and research focusing on the impact of explanations on the users (i.e., user-centered studies). While the algorithmic side focuses on developing new metrics that allow for an ofline evaluation and comparison of explanation approaches and their direct output, the user-centered side evaluates formatted explanations that are often simulated and not generated by an explainer. It is rare that an explanation method is evaluated and compared to other approaches from generation to interaction with the user. In this position paper, we argue that a comprehensive evaluation requires a combination of both sides and, therefore, a multidisciplinary efort to develop a feasible approach that captures the strengths and weaknesses of an explanation method. We take the first steps towards moving in this direction by employing a theoretical, comprehensive evaluation framework and discussing the challenges that we came across in this process.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Explainable Recommender Systems</kwd>
        <kwd>Evaluation</kwd>
        <kwd>Human-Centered AI</kwd>
        <kwd>Evaluation Metrics</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Recommender systems (RS) are indispensable in our everyday online interactions. They influence what
news articles we read to inform ourselves [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ], the media we consume for entertainment [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ], what
items we buy [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], and what jobs we apply for [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Explanations are often seen as a means to increase
the transparency of RS, help users make better and more informed decisions, and increase their trust
and satisfaction [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]. However, there are still no established guidelines or standards for the evaluation
of the explanations [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. This makes it dificult to decide which explanation approach works best for
a given RS application. What further increases the challenge to evaluate explanations is the fact that
their efect might difer between user types or application domains [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ].
      </p>
      <p>
        Miller [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] divided the process of human-to-human explanation into three elements: (1) the cognitive
process in which the causes for an event that is to be explained are identified and selected, (2) the product
of the cognitive process which is the explanation, and (3) the social process of transferring the knowledge
to the explainee — ideally in a way that they understand the causes of the event. Donoso-Guzmán et al.
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] adapted this definition to AI explanations. The identification and selection of the causes that will be
part of the explanation happen in the generation stage. The output, or product, of an explainer is often
a structured abstraction that requires an additional step to design the explanation in the desired format.
The final social process then happens in the communication stage. Thus, multiple points throughout this
process require an evaluation of diferent properties to ensure that the final product is of high quality.
      </p>
      <p>
        In the context of reading 1,012 papers for a survey on explanations in RS [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], we notice that
many publications that propose new explainable RS or explanation methods mainly evaluate the first
two elements of the explanation process, without considering the final product and communication.
Conversely, the human-computer-interaction-centered papers focus on evaluating the final two elements,
often with simulated recommendations and explanations. We argue that a thorough evaluation requires
a holistic view integrating all four elements: generation, abstraction, formatting, and communication.
To that end, we discuss our eforts to build the infrastructure that allows the systematic evaluation and
comparison of explanation methods for RS whilst considering all four elements. We concentrate our work
on the evaluation of explanations that will ultimately be shown to an RS end-user. As a foundation, we
selected the framework for user-centered evaluation by Donoso-Guzmán et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], as it is built upon
the well-known user experience evaluation framework for RS by Knijnenburg et al. [14] and, to the
best of our knowledge, the first framework to integrate the dimension of the explanation process. Since
the framework of Donoso-Guzmán et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] is intended to evaluate AI explanations in general, in this
paper, we map the suggested explanation properties to the recommendation scenario. Our ultimate goal
is to build a tool that allows us to evaluate and compare the strengths and weaknesses of explanation
methods across multiple dimensions, from explanation generation to the impact on users.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>We now discuss prior work on the evaluation of AI explanations in RS applications and explainable
AI (XAI) evaluation in general. For brevity, we limit this section to related work with a focus on the
evaluation of explanations. We divide the section into practical and theoretical evaluation frameworks.</p>
      <sec id="sec-2-1">
        <title>2.1. Practical Evaluation Frameworks</title>
        <p>
          With the persistent challenge of evaluating AI explanations, several frameworks aiming to compare
diferent explanation methods were published. These frameworks range from general XAI applications
[
          <xref ref-type="bibr" rid="ref10">10, 15</xref>
          ] to specific explanation methods (e.g., Agarwal et al. [16] for attribute-based explanations). In
the realm of RS, Coba et al. [17] were, to the best of our knowledge, the first to implement a framework
focusing on explanations and their evaluation. What these frameworks have in common is that they all
focus the evaluation on the generation and abstraction elements of the explanations. One exception we
found is Ariza-Casabona et al. [18], who recently published their comparative evaluation approach for
text-based explanations in RS, including, thus, the explanation format assessment.
        </p>
        <p>Despite the availability of these frameworks, we still lack a structured evaluation approach that
includes all four elements of explanations and that can be applied to multiple XAI approaches in RS,
which is why we opted to operationalize a theoretical framework.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Theoretical Evaluation Frameworks</title>
        <p>The challenges of evaluating AI explanations have been pointed out in the literature for many years.
Doshi-Velez and Kim [19] proposed one of the first taxonomies of evaluation approaches and discussed
the issues connected with each approach. Their taxonomy consists of three evaluation approaches:
(1) application-grounded evaluation, where actual users solve the actual task, (2) human-grounded
metrics, where users perform a simplified task, and (3) functionally-grounded evaluation, where a
formal definition of explanation quality is used as a proxy to evaluate the explanations without users.
They point out that the evaluation types inform each other: the functional proxies reflect real-world
performance, and the simplified tasks capture the essence of the actual application setting.</p>
        <p>
          Since then, multiple surveys [20, 21, 22] summarized the evaluation approaches for explanations.
Some of these surveys resulted in theoretical frameworks and guidelines aiming for a more structured
and standardized evaluation approach. Nauta et al. [23] surveyed 361 papers and derived 12 properties
of explanation quality. Then, they categorized the quantitative evaluation methods from these papers
along the properties to provide suitable objective measures to evaluate and compare the aspects of
explanation approaches. In contrast, Donoso-Guzmán et al. [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] surveyed evaluation approaches from
a user experience perspective. They categorized the evaluation approaches of 29 papers along the
user-centered evaluation framework for RS developed by Knijnenburg et al. [14] and the elements
of explanations inspired by Miller [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. To our knowledge, such a theory-grounded framework that
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Explanation Properties and Their Measures</title>
      <p>
        In this section, we go through each explanation component of the framework of Donoso-Guzmán et al.
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], map the explanation properties that are evaluated to the RS application, and, if needed, extend the
component with properties from the explainable RS literature. We structured the section along the five
core elements of the theoretical evaluation framework: the objective system aspects describe the system’s
abilities, the explanation aspects group properties describing the quality of an explanation, the subjective
system aspects include the users’ perception of the objective system aspects, the user experience describes
what the users encounter when interacting with the system, and the interaction aspects describe factors
relating to a possible system adoption [
        <xref ref-type="bibr" rid="ref13">13, 14</xref>
        ]. Additionally, we discuss the importance of considering
situational and personal characteristics dimensions that were not analyzed in the framework due to
a lack of information on their impact on the explanation efects. The idea is to evaluate explanation
approaches on all components to determine their strengths and weaknesses, to allow AI practitioners
or researchers to select the appropriate method for their application and carefully consider possible
trade-ofs. Figure 1 provides an overview of the explanation properties and their categorization. For
brevity, we refer to Table 1 in Donoso-Guzmán et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] for their definitions. Each subsequent section
provides an overview of what is already there and then addresses remaining challenges.
      </p>
      <sec id="sec-3-1">
        <title>3.1. Objective System Aspects</title>
        <p>The evaluation of objective system aspects is centered around the explanation generation and abstraction
stages and consists of six properties: AI model performance, AI model certainty, XAI certainty, continuity,
separability, and consistency. The RS performance and certainty can impact the explanation quality and
perception of the overall system [24]. Therefore, the ranking or rating capabilities of the RS should be
reported along with the model certainty as part of a comprehensive evaluation of the explanations. There
are plenty of evaluation frameworks for RS that ofer a variety of metrics to evaluate the recommendation
capabilities.1 On the explanation method side, the certainty property can be evaluated with a checkbox
stating whether (un-)certainty values are present in the explanation.
1To start with, see the list provided by the RecSys conference: https://github.com/ACMRecSys/recsys-evaluation-frameworks
Adapting metrics from classification to local explanations in top- recommendations. A
series of metrics has been proposed to evaluate the remaining explanation properties (consistency,
continuity, and separability) that cover objective system aspects. Both the OpenXAI [16] and Quantus
[15] frameworks, for example, compiled a list of consistency and continuity metrics for classification
approaches. When applying these consistency and continuity metrics to a top- recommendation
problem with local explanations, the evaluation time, however, increases significantly. In addition, many
metrics in the classification domain are designed for post-hoc, feature-importance explanation methods.
It is dificult to implement measures that would allow to compare diferent explanation methods in these
properties. One could argue that explanations derived from intrinsically explainable models should be
inherently consistent with high degrees of continuity and separability. Having the evaluation results
for intrinsically explainable models and being able to compare both approaches could provide a better
intuition on how good is, e.g., the continuity of post-hoc explainers.</p>
        <p>
          Need for additional evaluation properties. In addition to the explanation properties mentioned
by Donoso-Guzmán et al. [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], we suggest evaluating the coverage of the explainer, as in the number of
recommendations that can(not) be explained. This property difers from the completeness property, as
the latter assumes an explanation exists. Depending on the definition of explainable that is implemented
in the model, it might not be possible to explain each recommendation of a top- list. An example
would be the Novel and Explainable Matrix Factorization model [25] which follows the intuition that
an item is explainable if  users in the neighborhood of a target user have also interacted with the
item. Otherwise, no explanation can be provided. The authors suggest evaluating this with a metric
derived from information retrieval, Explainable NDCG, which not only measures how many items in a
top- recommendation list can be explained but also their position in the ranking and the ratings of
the nearest neighbors. The metric, however, only works for collaborative filtering explanations and
requires explicit rating data. A measure that is independent of the explanation approach would be to
simply report the average ratio of items without explanation in a top- recommendation list.
        </p>
        <p>In summary, the main challenge of evaluating objective system aspects is the adaptation of existing
metrics from a classification to a recommendation problem.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Explanation Aspects</title>
        <p>The explanation aspects consist of eight properties in the literature: necessity, suficiency, correctness,
completeness, contrastivity, size, structure, and representativeness. All of them are evaluated in the
abstraction stage. Additionally, correctness ofers evaluation opportunities in the generation and format
stages, completeness in the format stage, and structure in the communication stage.</p>
      </sec>
      <sec id="sec-3-3">
        <title>Adapting metrics from classification to local explanations in top- recommendations. Similar</title>
        <p>to the challenges mentioned in Section 3.1, the majority of the metrics that are used to evaluate necessity,
suficiency, correctness, completeness, and contrastivity were developed with a classification problem
in mind. They are often specific to post-hoc explainers, and one can argue whether these properties
should indeed be evaluated for intrinsically explainable models.</p>
        <p>
          Modifying explanation abstraction. In our implementation eforts, we currently treat the structure
of an explanation at the abstraction level as a parameter that determines how the explanation will
be formatted and presented at the communication level. One could follow, for example, the data
visualization guidelines from Chatti et al. [26] to determine a suitable format. A remaining challenge is
that there are still too many options that are suitable for most explainer types, even when applying such
guidelines. More research is needed to determine the best way to present the explanations for diferent
user types or in diferent contexts (e.g., application domains, mobile vs. desktop screen) [
          <xref ref-type="bibr" rid="ref11 ref13">13, 11</xref>
          ].
Dependency of explanation properties to RS. The size of the explanation can be dependent on
the RS (e.g., an extracted path of the random walk recommender RP3 [27] will always consist of three
hops). Alternatively, a parameter can determine how many aspects or features should be extracted in
the generation process. In all other cases, the average number of elements in an explanation should be
reported along with the standard deviation to get an impression of the explanation size.
Additional evaluation properties. On the RS side, there are indications that users prefer to see
diverse explanations that mention popular features and that the explanations are based on the target
user’s recent interaction history [28]. The corresponding metrics have been mostly applied to assess
path-based explanations derived from knowledge graphs [29] or feature-based explanations [18, 30],
but could be adapted to evaluate the abstraction level of other explanation types. In these use cases the
evaluated notion of diversity is often related to the type of nodes and relations that are mentioned in a
path-based explanation, or the diversity in the mentioned item features in feature-based explanations.
Future work should investigate whether this operationalization of diversity in explanations should be
extended to other notions. For brevity, we refer to Vrijenhoek et al. [31] and Heitz et al. [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] for the
conceptualization and operationalization of diferent recommendations diversity notions.
        </p>
        <p>In summary, the evaluation of explanation aspects faces additional challenges besides the metric
adaptation (Section 3.1), such as (1) many properties lack the option to compare intrinsically explainable
methods with post-hoc approaches, and (2) there are too many options in which an abstract explanation
can be formatted. Evaluating all of them, particularly with a user study, is not feasible.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.3. Subjective System Aspects</title>
        <p>The seven properties that fall into the subjective system aspects (explanation power, form of cognitive
chunks, information expectedness, perceived model competence, cognitive load, relevance to the tasks,
and alignment with situational context) are used to evaluate a user’s perception of the explanations.
Therefore, all of them are evaluated at the communication stage with users.</p>
        <p>
          Metrics specific to explanation type. To limit the number of dimensions that need to be evaluated
with a user study, some metrics have been proposed to evaluate the explanation power, form of cognitive
chunks, information expectedness, and relevance to the tasks at the abstraction or format stage. These
metrics are either specific to an explanation type, e.g., pragmatism to evaluate the relevance to the task
property for counterfactual explanations [23], or the metrics require a ground truth to be compared
with, e.g., BLEU and METEOR scores [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], to evaluate the form of cognitive chunks.
Ground truth proxy. In the RS domain, we often see user reviews being used as a proxy for a ground
truth for text explanations (see, e.g., Ariza-Casabona et al. [18]). Aside from only being applicable to RS
leveraging user reviews in their recommendation process and explanations being presented in natural
language, it also raises other challenges related to the nature of user reviews. Reviews are often biased
towards very good or very bad opinions as users rarely leave a review if the product or experience met,
but did not exceed their expectations [32]. They additionally face the issue that certain user types are
more likely to leave a review than others [33].
        </p>
        <p>The main challenge of evaluating the subjective system aspects is a continuation of the issues pointed
out in Section 3.2. The proposed metrics to reduce the number of dimensions that would need to be
evaluated with a user study are currently not suficient.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.4. User Experience and Interaction</title>
        <p>All explanation properties in the user experience and interaction are evaluated at the communication
stage with users.</p>
      </sec>
      <sec id="sec-3-6">
        <title>Alignment of evaluation goal with measurement. While the interaction properties (reliance,</title>
        <p>performance, and eficiency) can be captured with implicit measures, such as the number of clicks,
time spent in the application, and whether the item that the user selected aligns with a prior goal (i.e.,
selecting healthier meals), the user experience properties often require the users to fill in questionnaires.
These questionnaires need to be carefully crafted and consider the notion of the properties that should
be evaluated. Taking the understanding property as an example, the results might difer if the user is
asked to rate how much a presented explanation increases their understanding of the system, compared
to asking questions that capture their understanding before and after seeing the explanations. In these
cases, leveraging insights and instruments from social science would benefit the design of the user
studies instead of having the user rate how well an explanation is perceived in each property.</p>
        <p>The main challenge with the user evaluation of the communication stage is to balance the coverage
of properties that should be evaluated and the number of dimensions in the user study.</p>
      </sec>
      <sec id="sec-3-7">
        <title>3.5. Situational and Personal Characteristics</title>
        <p>
          The situational and personal characteristics were not specifically integrated in the framework of
DonosoGuzmán et al. [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] due to a lack of information on how these characteristics impact the explanation
properties.
        </p>
        <p>
          No general knowledge about impact. The impact of individual personal and situational
characteristics on explanation properties concerning the user experience and interaction has been explored for
specific settings (e.g., [ 34, 35, 36, 37]), but it is challenging to derive general conclusions from them [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
Lack of diverse participants. An additional issue is the fact that the majority of the user evaluations
employ participants from WEIRD (western, educated, industrialized, rich, democratic) societies, while
evaluating explanations in application domains targeting a diverse range of user types [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. Prior work
has shown that users with diferent cultural backgrounds have diferent preferences when it comes to
interface design [38], which suggests that the preferences, requirements, and needs might also difer for
the format and communication elements of RS explanations.
        </p>
        <p>Collaborations with researchers who investigate the information needs and preferences across
diferent personalities and cultural backgrounds would be a great starting point to further investigate the
impact of user and situational characteristics that need to be taken into consideration when evaluating
RS explanations.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions</title>
      <p>In this paper, we argue that evaluating RS explanations requires a holistic approach that integrates
algorithmic and user-centered aspects to cover all four stages of explainability. To address this need, we
explored an extension of a theoretical framework to cover all elements. Specifically, we discussed the
challenges that we are facing while operationalizing this framework for the evaluation of AI explanations
for large-scale evaluation and comparison of RS explanation methods. The main issues that we are
facing are related to (1) adapting metrics from a classification problem, (2) finding measures that
allow a comparison of intrinsically explainable RS with post-hoc explanation methods at explanation
generation and abstraction, and (3) limiting the dimensions of user evaluation at the communication
level. Nonetheless, we see the adoption of such a framework into the evaluation practices as a promising
way to thoroughly evaluate and compare RS explanations, allowing for systematic investigation of open
research gaps, such as the impact of situational and personal characteristics on explanation properties.
Future work should look into whether the adoption of a multi-faceted evaluation framework for RS, such
as FEVR [39], which covers the evaluation objectives and design space, can help limit the evaluation
dimensions while preserving comprehensiveness. We further plan to release a first version of the
framework, which extends the Cornac framework [40, 41, 42] with explanation methods and evaluation
metrics. Ultimately, we believe that the use of a holistic view on RS explanation evaluations will lead to
a better understanding of the impact of various RS explanations, helping both practitioners to build
better systems and scientists to be more systematic when exploring novel approaches.</p>
    </sec>
    <sec id="sec-5">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used Grammarly in order to: Grammar and spelling
check. After using this tool, the author reviewed and edited the content as needed and take full
responsibility for the publication’s content.
evaluation framework for explainable ai, in: L. Longo (Ed.), Explainable Artificial Intelligence,
Springer Nature Switzerland, Cham, 2023, pp. 183–204.
[14] B. P. Knijnenburg, M. C. Willemsen, Z. Gantner, H. Soncu, C. Newell, Explaining the user experience
of recommender systems, User Modeling and User-Adapted Interaction 22 (2012) 441–504. URL:
https://doi.org/10.1007/s11257-011-9118-4. doi:10.1007/s11257-011-9118-4.
[15] A. Hedström, L. Weber, D. Krakowczyk, D. Bareeva, F. Motzkus, W. Samek, S. Lapuschkin,
M. M. M. Höhne, Quantus: An explainable ai toolkit for responsible evaluation of neural
network explanations and beyond, Journal of Machine Learning Research 24 (2023) 1–11. URL:
http://jmlr.org/papers/v24/22-0142.html.
[16] C. Agarwal, S. Krishna, E. Saxena, M. Pawelczyk, N. Johnson, I. Puri, M. Zitnik, H. Lakkaraju,
OpenXAI: Towards a transparent evaluation of model explanations, in: Thirty-sixth Conference
on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URL: https:
//openreview.net/forum?id=MU2495w47rz.
[17] L. Coba, R. Confalonieri, M. Zanker, Recoxplainer: A library for development and ofline evaluation
of explainable recommender systems, IEEE Computational Intelligence Magazine 17 (2022) 46–58.
doi:10.1109/MCI.2021.3129958.
[18] A. Ariza-Casabona, L. Boratto, M. Salamó, A comparative analysis of text-based explainable
recommender systems, in: Proceedings of the 18th ACM Conference on Recommender Systems,
RecSys ’24, Association for Computing Machinery, New York, NY, USA, 2024, p. 105–115. URL:
https://doi.org/10.1145/3640457.3688069. doi:10.1145/3640457.3688069.
[19] F. Doshi-Velez, B. Kim, Towards a rigorous science of interpretable machine learning, 2017. URL:
https://arxiv.org/abs/1702.08608. arXiv:1702.08608.
[20] D. V. Carvalho, E. M. Pereira, J. S. Cardoso, Machine learning interpretability: A survey on methods
and metrics, Electronics 8 (2019). URL: https://www.mdpi.com/2079-9292/8/8/832. doi:10.3390/
electronics8080832.
[21] G. Vilone, L. Longo, Notions of explainability and evaluation approaches for explainable artificial
intelligence, Information Fusion 76 (2021) 89–106. URL: https://www.sciencedirect.com/science/
article/pii/S1566253521001093. doi:https://doi.org/10.1016/j.inffus.2021.05.009.
[22] F. Bodria, F. Giannotti, R. Guidotti, F. Naretto, D. Pedreschi, S. Rinzivillo, Benchmarking
and survey of explanation methods for black box models, Data Mining and Knowledge
Discovery 37 (2023) 1719–1778. URL: https://doi.org/10.1007/s10618-023-00933-9. doi:10.1007/
s10618-023-00933-9.
[23] M. Nauta, J. Trienes, S. Pathak, E. Nguyen, M. Peters, Y. Schmitt, J. Schlötterer, M. van Keulen,
C. Seifert, From anecdotal evidence to quantitative evaluation methods: A systematic review on
evaluating explainable ai, ACM Comput. Surv. 55 (2023). URL: https://doi.org/10.1145/3583558.
doi:10.1145/3583558.
[24] A. R. Mohammadi, A. Peintner, M. Müller, E. Zangerle, Are we explaining the same recommenders?
incorporating recommender performance for evaluating explainers, in: Proceedings of the 18th
ACM Conference on Recommender Systems, RecSys ’24, Association for Computing Machinery,
New York, NY, USA, 2024, p. 1113–1118. URL: https://doi.org/10.1145/3640457.3691709. doi:10.
1145/3640457.3691709.
[25] L. Coba, P. Symeonidis, M. Zanker, Personalised novel and explainable matrix factorisation, Data &amp;
Knowledge Engineering 122 (2019) 142–158. URL: https://www.sciencedirect.com/science/article/
pii/S0169023X1830332X. doi:https://doi.org/10.1016/j.datak.2019.06.003.
[26] M. A. Chatti, M. Guesmi, A. Muslim, Visualization for recommendation explainability: A survey and
new perspectives, ACM Trans. Interact. Intell. Syst. 14 (2024). URL: https://doi.org/10.1145/3672276.
doi:10.1145/3672276.
[27] B. Paudel, F. Christofel, C. Newell, A. Bernstein, Updatable, accurate, diverse, and scalable
recommendations for interactive applications, ACM Trans. Interact. Intell. Syst. 7 (2016). URL:
https://doi.org/10.1145/2955101. doi:10.1145/2955101.
[28] G. Balloccu, L. Boratto, G. Fenu, M. Marras, Post processing recommender systems with
knowledge graphs for recency, popularity, and diversity of explanations, in: Proceedings of the 45th
International ACM SIGIR Conference on Research and Development in Information Retrieval,
SIGIR ’22, Association for Computing Machinery, New York, NY, USA, 2022, p. 646–656. URL:
https://doi.org/10.1145/3477495.3532041. doi:10.1145/3477495.3532041.
[29] G. Balloccu, L. Boratto, G. Fenu, M. Marras, Reinforcement recommendation reasoning through
knowledge graphs for explanation path quality, Knowledge-Based Systems 260 (2023) 110098.
URL: https://www.sciencedirect.com/science/article/pii/S0950705122011947. doi:https://doi.
org/10.1016/j.knosys.2022.110098.
[30] L. Li, Y. Zhang, L. Chen, Generate neural template explanations for recommendation, in:
Proceedings of the 29th ACM International Conference on Information &amp; Knowledge Management,
CIKM ’20, Association for Computing Machinery, New York, NY, USA, 2020, p. 755–764. URL:
https://doi.org/10.1145/3340531.3411992. doi:10.1145/3340531.3411992.
[31] S. Vrijenhoek, S. Daniil, J. Sandel, L. Hollink, Diversity of what? on the diferent conceptualizations
of diversity in recommender systems, in: Proceedings of the 2024 ACM Conference on Fairness,
Accountability, and Transparency, FAccT ’24, Association for Computing Machinery, New York,
NY, USA, 2024, p. 573–584. URL: https://doi.org/10.1145/3630106.3658926. doi:10.1145/3630106.
3658926.
[32] N. Hu, P. A. Pavlou, J. Zhang, On self-selection biases in online product reviews, MIS Quarterly 41
(2017) pp. 449–475. URL: https://www.jstor.org/stable/26629722.
[33] C. K. Manner, W. C. Lane, Who posts online customer reviews? the role of sociodemographics and
personality traits, Journal of Consumer Satisfaction, Dissatisfaction and Complaining Behavior 30
(2017) 19–42. URL: https://www.jcsdcb.com/index.php/JCSDCB/article/view/226.
[34] M. Guesmi, M. A. Chatti, L. Vorgerd, T. Ngo, S. Joarder, Q. U. Ain, A. Muslim, Explaining user
models with diferent levels of detail for transparent recommendation: A user study, in: Adjunct
Proceedings of the 30th ACM Conference on User Modeling, Adaptation and Personalization,
UMAP ’22 Adjunct, ACM, New York, NY, USA, 2022, p. 175–183. URL: https://doi.org/10.1145%
2F3511047.3537685. doi:10.1145/3511047.3537685.
[35] D. Wilkinson, Öznur Alkan, Q. V. Liao, M. Mattetti, I. Vejsbjerg, B. P. Knijnenburg, E. Daly, Why
or why not? the efect of justification styles on chatbot recommendations, ACM Transactions
on Information Systems 39 (2021) 1–21. URL: https://doi.org/10.1145%2F3441715. doi:10.1145/
3441715.
[36] D. C. Hernandez-Bocanegra, J. Ziegler, Explaining review-based recommendations: Efects of
profile transparency, presentation style and user characteristics, i-com 19 (2020) 181–200. URL:
https://doi.org/10.1515%2Ficom-2020-0021. doi:10.1515/icom-2020-0021.
[37] S. Berkovsky, R. Taib, Y. Hijikata, P. Braslavsku, B. Knijnenburg, A cross-cultural analysis of trust
in recommender systems, in: Proceedings of the 26th Conference on User Modeling, Adaptation
and Personalization, UMAP ’18, ACM, New York, NY, USA, 2018, p. 285–289. URL: https://doi.org/
10.1145%2F3209219.3209251. doi:10.1145/3209219.3209251.
[38] K. Reinecke, A. Bernstein, Improving performance, perceived usability, and aesthetics with
culturally adaptive user interfaces, ACM Transactions on Computer-Human Interaction 18 (2011)
1–29. URL: https://dl.acm.org/doi/10.1145/1970378.1970382. doi:10.1145/1970378.1970382.
[39] E. Zangerle, C. Bauer, Evaluating recommender systems: Survey and framework, ACM Comput.</p>
      <p>Surv. 55 (2022). URL: https://doi.org/10.1145/3556536. doi:10.1145/3556536.
[40] A. Salah, Q.-T. Truong, H. W. Lauw, Cornac: A comparative framework for multimodal
recommender systems, Journal of Machine Learning Research 21 (2020) 1–5.
[41] Q.-T. Truong, A. Salah, T.-B. Tran, J. Guo, H. W. Lauw, Exploring cross-modality utilization in
recommender systems, IEEE Internet Computing (2021).
[42] Q.-T. Truong, A. Salah, H. Lauw, Multi-modal recommender systems: Hands-on exploration, in:
Fifteenth ACM Conference on Recommender Systems, 2021, pp. 834–837.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Raza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <article-title>News recommender system: a review of recent progress, challenges, and opportunities</article-title>
          ,
          <source>Artificial Intelligence Review</source>
          <volume>55</volume>
          (
          <year>2022</year>
          )
          <fpage>749</fpage>
          -
          <lpage>800</lpage>
          . URL: https://link.springer.com/10. 1007/s10462-021-10043-x. doi:
          <volume>10</volume>
          .1007/s10462-021-10043-x.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Heitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Lischka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Abdullah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Laugwitz</surname>
          </string-name>
          , H. Meyer, A. Bernstein,
          <article-title>Deliberative diversity for news recommendations: Operationalization and experimental user study</article-title>
          ,
          <source>in: Proceedings of the 17th ACM Conference on Recommender Systems</source>
          , RecSys '23,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2023</year>
          , p.
          <fpage>813</fpage>
          -
          <lpage>819</lpage>
          . URL: https://doi.org/10.1145/3604915.3608834. doi:
          <volume>10</volume>
          .1145/3604915.3608834.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bhattacharya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lamkhede</surname>
          </string-name>
          ,
          <article-title>Augmenting netflix search with in-session adapted recommendations</article-title>
          ,
          <source>in: Proceedings of the 16th ACM Conference on Recommender Systems</source>
          , RecSys '22,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2022</year>
          , p.
          <fpage>542</fpage>
          -
          <lpage>545</lpage>
          . URL: https://doi.org/10.1145/3523227.3547407. doi:
          <volume>10</volume>
          .1145/3523227.3547407.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Bontempelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chapus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Rigaud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Morlon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lorant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Salha-Galvan</surname>
          </string-name>
          ,
          <article-title>Flow moods: Recommending music by moods on deezer</article-title>
          ,
          <source>in: Proceedings of the 16th ACM Conference on Recommender Systems</source>
          , RecSys '22,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2022</year>
          , p.
          <fpage>452</fpage>
          -
          <lpage>455</lpage>
          . URL: https://doi.org/10.1145/3523227.3547378. doi:
          <volume>10</volume>
          .1145/3523227.3547378.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Qiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>An industrial framework for personalized serendipitous recommendation in e-commerce</article-title>
          ,
          <source>in: Proceedings of the 17th ACM Conference on Recommender Systems</source>
          , RecSys '23,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2023</year>
          , p.
          <fpage>1015</fpage>
          -
          <lpage>1018</lpage>
          . URL: https://doi.org/10.1145/3604915.3610234. doi:
          <volume>10</volume>
          .1145/ 3604915.3610234.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Dhameliya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Desai</surname>
          </string-name>
          ,
          <article-title>Job recommender systems: A survey, in: 2019 Innovations in Power and Advanced Computing Technologies (i-PACT)</article-title>
          , volume
          <volume>1</volume>
          , IEEE, Piscataway, New Jersey,USA,
          <year>2019</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          . doi:
          <volume>10</volume>
          .1109/i-
          <fpage>PACT44901</fpage>
          .
          <year>2019</year>
          .
          <volume>8960231</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>N.</given-names>
            <surname>Tintarev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Masthof</surname>
          </string-name>
          ,
          <article-title>A Survey of Explanations in Recommender Systems</article-title>
          , in: 2007
          <source>IEEE 23rd International Conference on Data Engineering Workshop</source>
          , IEEE, Istanbul, Turkey,
          <year>2007</year>
          , pp.
          <fpage>801</fpage>
          -
          <lpage>810</lpage>
          . URL: http://ieeexplore.ieee.org/document/4401070/. doi:
          <volume>10</volume>
          .1109/ICDEW.
          <year>2007</year>
          .
          <volume>4401070</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>I.</given-names>
            <surname>Nunes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jannach</surname>
          </string-name>
          ,
          <article-title>A systematic review and taxonomy of explanations in decision support and recommender systems, User Modeling and User-Adapted Interaction 27 (</article-title>
          <year>2017</year>
          )
          <fpage>393</fpage>
          -
          <lpage>444</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          , et al.,
          <article-title>Explainable recommendation: A survey and new perspectives</article-title>
          ,
          <source>Foundations and Trends® in Information Retrieval</source>
          <volume>14</volume>
          (
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>101</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>V.</given-names>
            <surname>Arya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. K. E.</given-names>
            <surname>Bellamy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dhurandhar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hind</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. C.</given-names>
            <surname>Hofman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Houde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Liao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Luss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mojsilovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mourad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Pedemonte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Raghavendra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. T.</given-names>
            <surname>Richards</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sattigeri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Shanmugam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. R.</given-names>
            <surname>Varshney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Zhang,</surname>
          </string-name>
          <article-title>One explanation does not fit all: A toolkit and taxonomy of AI explainability techniques</article-title>
          ,
          <year>2019</year>
          . URL: http://arxiv.org/abs/
          <year>1909</year>
          .03012. arXiv:
          <year>1909</year>
          .03012.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>K.</given-names>
            <surname>Wardatzky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Inel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Rossetto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bernstein</surname>
          </string-name>
          ,
          <article-title>Whom do explanations serve? a systematic literature survey of user characteristics in explainable recommender systems evaluation</article-title>
          ,
          <source>ACM Trans. Recomm. Syst</source>
          .
          <volume>3</volume>
          (
          <year>2025</year>
          ). URL: https://doi.org/10.1145/3716394. doi:
          <volume>10</volume>
          .1145/3716394.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>T.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <article-title>Explanation in artificial intelligence: Insights from the social sciences</article-title>
          ,
          <source>Artificial Intelligence</source>
          <volume>267</volume>
          (
          <year>2019</year>
          )
          <fpage>1</fpage>
          -
          <lpage>38</lpage>
          . URL: https://www.sciencedirect.com/science/article/pii/S0004370218305988. doi:https://doi.org/10.1016/j.artint.
          <year>2018</year>
          .
          <volume>07</volume>
          .007.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>I.</given-names>
            <surname>Donoso-Guzmán</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ooge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Parra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Verbert</surname>
          </string-name>
          ,
          <article-title>Towards a comprehensive human-centred</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>