<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>From latent factors to language: a user study on LLM-generated explanations for an inherently interpretable matrix-based recommender system</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Maxime Manderlier</string-name>
          <email>maxime.manderlier@umons.ac.be</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabian Lecron</string-name>
          <email>fabian.lecron@umons.ac.be</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Olivier Vu Thanh</string-name>
          <email>olivier.vuthanh@umons.ac.be</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicolas Gillis</string-name>
          <email>nicolas.gillis@umons.ac.be</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Mathematics and Operational Research, Faculty of Engineering, University of Mons (UMONS)</institution>
          ,
          <addr-line>Mons</addr-line>
          ,
          <country country="BE">Belgium</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Technological Innovation Management, Faculty of Engineering, University of Mons (UMONS)</institution>
          ,
          <addr-line>Mons</addr-line>
          ,
          <country country="BE">Belgium</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We investigate whether large language models (LLMs) can generate efective, user-facing explanations from a mathematically interpretable recommendation model. The model is based on constrained matrix factorization, where user types are explicitly represented and predicted item scores share the same scale as observed ratings, making the model's internal representations and predicted scores directly interpretable. This structure is translated into natural language explanations using carefully designed LLM prompts. Many works in explainable AI rely on automatic evaluation metrics, which often fail to capture users' actual needs and perceptions. In contrast, we adopt a user-centered approach: we conduct a study with 326 participants who assessed the quality of the explanations across five key dimensions-transparency, efectiveness, persuasion, trust, and satisfaction-as well as the recommendations themselves. To evaluate how diferent explanation strategies are perceived, we generate multiple explanation types from the same underlying model, varying the input information provided to the LLM. Our analysis reveals that all explanation types are generally well received, with moderate statistical diferences between strategies. User comments further underscore how participants react to each type of explanation, ofering complementary insights beyond the quantitative results.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;explainable recommendations</kwd>
        <kwd>collaborative filtering</kwd>
        <kwd>matrix factorization</kwd>
        <kwd>large language models</kwd>
        <kwd>user study</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Recommender systems have become essential tools for helping users navigate large catalogs of content.
However, the algorithms powering these systems are often opaque, making it dificult for users to
understand or trust their suggestions. This has motivated a growing body of work in explainable
recommendation, which seeks to provide users with reasons for each recommendation. While many existing
approaches generate explanations post hoc—often disconnected from the underlying model—another
line of research focuses on designing inherently interpretable models.</p>
      <p>In this work, we build on a matrix factorization model that is mathematically interpretable by design:
user types are explicitly represented, and predicted scores are constrained to remain within the same
range as observed ratings. This allows internal representations to be meaningfully interpreted in terms
of user preferences. The challenge, however, lies in translating these internal representations into
user-facing explanations.</p>
      <p>Large language models (LLMs) ofer a natural interface for generating such explanations. Given
carefully designed prompts, they are capable of reasoning over structured information and expressing it
in fluent natural language. This raises an important question: can LLMs successfully act as explanation
generators for recommendation models whose internals are interpretable?</p>
      <p>This paper investigates that question through three main objectives:
1. Evaluating an interpretable recommendation model. We assess whether a mathematically
interpretable model can satisfy users in terms of the recommendations it produces, even without
additional explanation layers.
2. Generating explanations using LLMs. We examine whether LLMs can leverage the model’s
internal structure to generate efective and coherent explanations.
3. Comparing explanation strategies. We contrast explanations grounded in model internals
with alternatives based on external information (e.g., user history), to assess the trade-ofs between
transparency and other user-centered goals.</p>
      <p>These questions are addressed through a user study in which participants evaluate the
recommendations and explanations they receive across multiple dimensions. Our findings highlight the potential of
LLMs to act as a bridge between model interpretability and user-facing explanation, and ofer guidance
for designing explanation strategies that are both faithful and efective. All materials, including data
preparation, prompts, and statistical analysis code, are available in a public repository to support
transparency and reproducibility.1</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>
        Early research on explanations in recommender systems emphasized the importance of transparency and
user-centered design. Herlocker et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] conducted one of the first empirical studies on collaborative
ifltering (CF) explanation interfaces. They explored both white-box and black-box strategies and found
that simple and transparent justifications—such as rating histograms and past accuracy—were more
persuasive and trustworthy than abstract ones. Tintarev and Masthof [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] proposed a comprehensive
framework for analyzing explanations, identifying seven core aims: transparency, scrutability, trust,
efectiveness, persuasiveness, eficiency, and satisfaction. They highlighted the trade-ofs between these
objectives and the impact of explanation format and interface design.
      </p>
      <p>
        Beyond system transparency, trust has emerged as a central but complex evaluation axis. Rong et
al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] conducted a meta-analysis of 97 XAI user studies and showed that explanations generally improve
subjective understanding and, to some extent, collaboration. However, their efects on trust and usability
remain inconsistent. More targeted studies in recommender systems revealed similar complexity. For
example, Ooge et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] found that providing explanations—regardless of type—increased adolescents’
trust in educational recommendations. Kunkel et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] observed that personalized explanations boosted
trust more efectively than impersonal ones. Millecamp et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] further revealed that user traits (e.g.,
personality or domain expertise) moderated trust responses. Liao and Sundar [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] highlighted framing
efects, while Bucinca et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] questioned the validity of proxy tasks for measuring trust, arguing for
more holistic evaluations.
      </p>
      <p>
        The emergence of large language models (LLMs) has significantly transformed explanation generation
in recommender systems. Shi et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] proposed LLM-SRR, which enriches user reviews via LLMs and
integrates them into knowledge graphs to identify explanatory paths. These are then translated into
natural language, improving semantic faithfulness. Wang et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] introduced LR-Recsys, a system
using LLM-generated contrastive explanations embedded into DNN recommenders, which significantly
improved performance. Gao et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] developed Chat-REC, a conversational recommender using
structured prompts with LLMs to generate both recommendations and explanations.
      </p>
      <p>
        Some works have directly evaluated user responses to LLM-generated explanations. Feng et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ],
through a user study, found that contextualized LLM explanations—those incorporating user
history—improved users’ intent to act and better met their cognitive needs compared to generic ones. Silva
et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] compared personalized and generic explanations, showing that their efectiveness varies
depending on the familiarity of the recommended item.
      </p>
      <p>
        Several surveys now map the landscape of LLM-based explainable recommender systems. Zhao et
al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] categorized approaches by training paradigm (pre-training, fine-tuning, prompting) and noted
      </p>
      <sec id="sec-2-1">
        <title>1https://github.com/MaximeUM/interpretable-mf-llm-explanations</title>
        <p>
          the growing use of CoT prompting for nuanced reasoning. Lin et al. [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] proposed a 2D taxonomy
(“where” and “how” to adapt LLMs), while Chen [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] and Vats et al. [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] focused on explanation quality,
personalization, and fairness. Across these surveys, a common theme is the lack of user-centered
evaluation and the risk of hallucination in LLM explanations.
        </p>
        <p>
          At the model level, several frameworks have pushed the boundaries of LLM alignment. Lei et
al. [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] introduced RecExplainer, aligning LLMs with black-box recommenders through behavioral
and intention-based strategies. Ma et al. [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] presented XRec, which uses collaborative embeddings
from LightGCN and integrates them into LLMs via Mixture-of-Experts adapters. Bismay et al. [20]
propose ReasoningRec, which uses LLM-generated synthetic explanations to fine -tune a smaller LLM,
improving both recommendation accuracy and explanation quality. Luo et al. [21] developed LLMXRec,
a two-stage approach using instruction-tuned LLMs for post-hoc explanation generation. Zhao et
al. [22] introduced LANE, which uses Chain-of-Thought prompting and attention alignment to generate
logical and transparent justifications.
        </p>
        <p>Other notable contributions include consequence-based explanations [23], which highlight the
potential impact or outcomes of accepting a recommendation, and modular agent-based frameworks [24],
which structure recommendation reasoning across profile, memory, planning, and action modules.</p>
        <p>Despite these advances, most LLM-based methods focus on post-hoc explanations for black-box
models, leaving a gap between formal interpretability and natural language justification. Our work
addresses this gap by generating natural explanations from a matrix factorization model that is
interpretable by design. To our knowledge, it is one of the first to bridge mathematical transparency and
natural justification in a unified system.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <sec id="sec-3-1">
        <title>3.1. Recommender system</title>
        <p>For the recommendation algorithm, we use BSSMF (Bounded simplex-structured matrix factorization)
[25, 26], which allows us to generate recommendations that are mathematically interpretable. This
model decomposes the user-item interaction matrix  ∈ R× , where  is the number of items and 
the number of users, into two smaller matrices,  ∈ R×  and  ∈ R× , such that  ≈  .</p>
        <p>Each column of  represents a latent user type and contains the predicted ratings this latent type
would assign to each item. Unlike traditional matrix factorization approaches where latent factors are
unconstrained and often not directly interpretable, BSSMF contrains the values in  to lie within the
same rating range as  (e.g., between 1 and 5). This makes it possible to interpret the values in  as
actual scores, facilitating explanation.</p>
        <p>Each column of  indicates how much a given user aligns with each latent type. BSSMF contrains
the values in  to be non-negative and sum to one, meaning that each user is expressed as a convex
combination of the latent user types–efectively forming a soft clustering over the user base.</p>
        <p>Together, these constraints enable BSSMF to remain both expressive and inherently explainable:
 reveals the preferences of interpretable user types, and  tells us how each user combines these
types. This structure forms the basis for the explanations we generate throughout this study. Moreover,
BSSMF is much more robust to the choice of the fatorization rank, , and to overfitting than standard
unconstrained matrix factorization models [26]. The reason is that the additional bound constraints on
 and  imply that the entries of   remain in the same range as the data.
3.1.1. From ratings to recommendations
To promote catalog diversity, we avoid recommending only the top-3 highest-scoring movies. Instead,
we sample 3 recommendations from a pool of up to 20 candidate items with predicted scores ≥ 4 (always
including the actual top-3). The probability of selecting each movie is proportional to its predicted
score. This approach balances relevance with diversity.</p>
        <p>Comparison of recommended movies with and without diversity
Without diversity
With diversity</p>
        <p>Recommended movies</p>
        <p>As illustrated in Figure 1, this strategy significantly increases catalog coverage: the number of distinct
recommended movies rises from 56 (with deterministic top-3) to 95 (with sampling), out of a set of 100
movies (see subsection 4.1 for details on the selection process).</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Generating explanations</title>
        <p>As previously discussed, the BSSMF algorithm allows us to generate user types that can be interpreted.
While this interpretation can be done manually for small-scale problems, it becomes infeasible as the
data size increases—which is the case in most real-world recommender systems.</p>
        <p>We therefore propose leveraging the capabilities of Large Language Models (LLMs) to:
1. Interpret the user types. This serves two main purposes: (i) providing interpretable insights
into the latent user types, which are the building blocks used to reconstruct any individual
user—valuable information for business teams, analysts, and system designers; and (ii) reusing
these interpretations to explain the recommendations made to users.
2. Explain individual recommendations. By analyzing how a recommendation emerges from the
weighted combination of user types, we can generate a natural language explanation tailored to
each user and item.
3.2.1. Generating explanations in practice
To generate meaningful explanations using a Large Language Model (LLM), two elements are critical:
selecting a model with suficient reasoning capabilities, and designing a prompt that aligns with the
structure of the data. We evaluated several instruction-tuned models, including Llama 3 3B Instruct,
Llama 3 8B Instruct [27], Mixtral 8x7B Instruct, Mixtral 8x22B Instruct [28], Mistral Small 24B Instruct,
Llama 3 70B Instruct [27], and DeepSeek R1 Distill Llama 70B [29].</p>
        <p>For each model, we assessed two aspects: the quality of the descriptions generated for user types
(e.g., are they well-defined, coherent, and distinctive?) and the quality of the explanations provided for
individual users. Among the tested models, DeepSeek consistently delivered the best results on both
fronts. Its reasoning is more structured, and it returns its step-by-step thought process before delivering
the final answer. This feature is particularly valuable for us, as it allows us to better understand how
the model arrives at a given explanation—facilitating post hoc analysis and qualitative validation of its
outputs.</p>
        <p>Although we do not detail this model comparison process here—since comparing LLMs is not the
main goal of this paper—it required considerable efort and tuning. We emphasize that the objective
is not to benchmark models, but to verify that a suficiently capable LLM can generate high-quality,
interpretable explanations within our setup.</p>
        <p>The selected model, DeepSeek-R1-Distill-Llama-70B, is not the largest in terms of parameter count,
but it is nonetheless substantial. Running it in inference mode with float16 precision requires a setup
with eight A40 GPUs (40GB VRAM each).</p>
        <p>Explaining the user types The first step in our pipeline is to interpret the latent user types obtained
from our matrix factorization model. We prompt the LLM with the following instruction:
We have a recommendation system based on matrix factorization, capable of
generating recommendations. We aim to interpret the user types. The matrix
X (containing ratings between 1 and 5) has dimensions (movies x users). We
decompose it as follows: X = W * H, where W has dimensions (movies x latent
factors) and H has dimensions (latent factors x users). A rating above 4 is
considered very good. A rating of 3 is acceptable. A rating below 3 indicates
less interesting movies for the latent user. The values in W fall within the
same range as those in X. Each column of W represents a user type. Each column
of H sums up to 1. Your role is to interpret the user types. For each user
type, provide a description in a maximum of 100 words. Base your explanation
on all the movies associated with the user type. It is very important that
the descriptions of the user types are explicit and different. Each user
type’s description should neither be too obvious nor too generalistic, in
order to obtain distinctive and characteristic user types. We provide you
with a column of W. Do not list too many liked movies to describe the user
type, but rather focus on the characteristics of these movies... You can
mention 2–3 movies if necessary, but do not make a long list of liked movies.
Be careful not to say that a user type likes a movie if that movie has a
rating below 4. If you realize that your descriptions for two (or more) user
types are too similar, it means you are not distinguishing them enough. Focus
on what really differentiates them. Please reason step by step, and put your
final answer within \boxed{}.</p>
        <p>This prompt was carefully designed to guide the LLM toward producing compact, distinct, and
insightful descriptions of each latent user type, while enforcing consistency with the underlying data.
To build the user prompt, we extract each column of the matrix  , representing the
preferences of a user type across all movies. Since a user type is mathematically defined by its scores over the
entire item space, we include all movies, ranked by predicted rating. For each, we provide the French
title, predicted score, and translated genres—enabling the LLM to interpret the user type meaningfully,
as item IDs alone would not be understandable. This results in a purely mathematical interpretation
made accessible through natural language.</p>
        <p>Explaining the recommendations To explain the recommendations, we consider several
approaches.</p>
        <p>The first, already introduced, is to leverage the mathematical structure of the model to generate
natural language explanations. In this approach, we use the latent user types and the model internal
weighting to justify recommendations in a transparent and faithful way.</p>
        <p>Model-based explanation (user types + latent weights):</p>
        <p>We use a recommendation system based on matrix factorization to suggest
relevant movies based on user preferences.</p>
        <p>How it works:
- The matrix X (movies x users) contains ratings from 1 to 5.
- We decompose it as X = W * H, where:
- W (movies x latent factors) represents ratings from user types.
- H (latent factors x users) weights the influence of each user type for
a given user (each colum of H sum up to 1).
- A movie’s final score is a weighted average of evaluations from
multiple user types.</p>
        <p>Your task:
- Explain why this movie might appeal to the user without mentioning
matrix factorization or user type weightings.
- Highlight broad trends rather than linking the recommendation to a single
user type.</p>
        <p>Guidelines:
- Justify the recommendation in a maximum of two sentences.
- Emphasize thematic, tonal, stylistic, or emotional similarities
rather than just genre overlap.
- Adopt a natural, professional, and engaging tone.
- Frame the explanation as insightful advice, not a film summary.
- Avoid robotic or generic phrasing.
- Express measured enthusiasm to spark interest without exaggeration.
- Please reason step by step, and put your final answer within \boxed{}.
Final Goal:
- The recommendation should feel meaningful, not generic.
- The explanation should spark curiosity and interest.
- Reflect a mix of preferences, not a single user type’s
perspective.</p>
        <p>Finally, translate everything into French and use the informal "tu" form.
At the end, we only want the French final answer in the box. Do not add
information, just the final answer.</p>
        <p>We also explore an alternative strategy that relies on simplified information: instead of referring
to the model’s internal computations, the explanation is based on the user’s previously liked movies.
While this may seem intuitive, it is less transparent, as it does not reflect how the model actually works.
Our recommender system is based on BSSMF which embeds users and items in a shared latent space
and makes recommendations via dot products between these embeddings. As a result, the model does
not directly rely on previously liked items when recommending new ones. Explanations based on
viewing history, although easy to understand, do not faithfully represent the reasoning behind the
recommendations.</p>
        <p>History-based explanation (based on liked movies):</p>
        <p>We generate a personalized explanation to help the user understand why a
recommended movie might be a good match.</p>
        <p>How it works:
- The explanation is based on:
- The title of the recommended movie.
- Its genres.
- The titles and genres of movies the user has previously
watched and rated highly (at least 4 stars).
- The goal is to highlight meaningful connections between past
preferences and this recommendation, without mentioning an algorithm.
Your task:
- Justify the recommendation in a maximum of two sentences.
- Emphasize thematic, tonal, stylistic, or emotional similarities
rather than just genre overlap.
- Adopt a natural, professional, and engaging tone.
- Frame the explanation as insightful advice, not a film summary.
- Avoid robotic or generic phrasing.
- Express measured enthusiasm to spark interest without exaggeration.
- Please reason step by step, and put your final answer within \boxed{}.
Final goal:
- The explanation should feel relevant and meaningful to the user.
- It should spark curiosity and encourage them to watch the movie.
Finally, translate everything into French and use the informal "tu" form.
At the end, we only want the French final answer in the box. Do not add
information, just the final answer.</p>
        <p>Finally, we include a third type of explanation that combines both previous approaches: it references
the user types and the model’s logic, while also connecting the recommended item to the user’s past
preferences. This hybrid strategy aims to benefit from both transparency and familiarity.
Combined explanation (model reasoning + history):</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Study Design</title>
      <sec id="sec-4-1">
        <title>4.1. Selecting a representative and usable movie subset</title>
        <p>Choosing an appropriate dataset for a user study in recommender systems is not trivial. While most
benchmarks rely on well-known datasets such as MovieLens 100K or 1M [30], these collections contain
relatively old movies, which may not reflect the type of content typically consumed by modern users. To
address this limitation, we opted to extract a subset of movies from the larger MovieLens 32M dataset2.</p>
        <p>Our decision to work with a reduced subset was not due to computational constraints—BSSMF scales
to larger datasets—but rather because the number of user interactions we expect to collect is limited. A
smaller, well-curated matrix allows us to preserve the structure of a realistic recommendation scenario
while keeping the study manageable.</p>
        <p>To build a representative and diverse pool of 100 movies for the user study, we selected items based
on popularity while penalizing older movies to avoid temporal bias. French titles, synopses, and poster
images were retrieved using the TMDB API3 to ensure language consistency. We also ensured diversity
by limiting the selection to one movie per saga. The full data preparation pipeline is available in our
GitHub repository.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Selecting a consistent user base</title>
        <p>To collect user data, we relied on accessible participants to whom we could easily distribute the
questionnaire. Specifically, we collected responses from students and staf members at our university.
Since the university is French-speaking, the entire study was conducted in French to match participants’
native language. While this introduces a potential language-related bias, we consider it minimal—such
a bias would also exist if the study were conducted in English.</p>
        <p>We are aware of the sampling bias that results from recruiting participants from an academic
environment, as it restricts our sample to a relatively educated segment of the population. However, we
argue that since movies are a widely consumed cultural product, this bias is mitigated in practice and
does not prevent us from collecting coherent and meaningful data for our purposes.</p>
        <p>In total, we collected data from 326 users (see subsection 4.3 for details).</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Collecting data</title>
        <p>
          As we aim to conduct a user-centered study, it is essential to define a rigorous data collection protocol.
To guide our methodology, we follow best practices outlined in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], which provides a comprehensive
overview of experimental designs for explainability-focused user studies.
4.3.1. Collecting ratings to train the recommendation algorithm
To train the recommendation algorithm, we collected 10 ratings from each of 440 participants, using
the curated pool of 100 movies. Each participant rated a random subset of 10 movies, promoting matrix
        </p>
        <sec id="sec-4-3-1">
          <title>2https://grouplens.org/datasets/movielens/32m/</title>
          <p>
            3https://developer.themoviedb.org/
diversity. A synopsis (retrieved via the TMDB API4) was provided when needed. Users rated each movie
using a 5-point Likert scale: “I really like it”, “I like it”, “It’s okay”, “I don’t like it much”, “I really don’t
like it”. This resulted in 4400 user-item interactions.
4.3.2. Evaluating recommendations and explanations
To evaluate the usefulness and impact of explanations, we employ a between-subjects design, dividing
users into four distinct groups, each exposed to a diferent type of explanation (or none). This decision
is supported by findings in [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ], which reports that 55% of reviewed studies adopt a between-subjects
design. While within-subjects designs allow each participant to compare diferent conditions—e.g., by
ranking explanations—they introduce significant biases. In particular, asking participants to compare
explanations implicitly suggests that some explanations must be better than others, thereby confounding
the evaluation of whether explanations actually enhance understanding or trust. In contrast, the
between-subjects design allows us to evaluate the intrinsic value of each explanation type without
introducing comparative framing efects.
          </p>
          <p>We split participants into four groups depending on the explanation strategy: one group receives
only the recommendation and synopsis (no explanation), while the other three receive model-based,
history-based, or combined explanations, as detailed in subsection 3.2.</p>
          <p>The 440 participants from the initial rating phase were evenly assigned to one of the experimental
groups for the recommendation evaluation. In total, 326 participants completed this second phase, each
evaluating three personalized recommendations, for a total of 978 evaluations.</p>
          <p>For each recommended movie, participants rated the recommendation (U2) and had the option to
leave a free-text comment.</p>
          <p>Participants in Groups 1, 2, and 3 (who received explanations) also evaluated the explanation across
several key dimensions using a 5-point Likert scale: “Strongly agree”, “Somewhat agree”, “Neither agree
nor disagree”, “Somewhat disagree”, “Strongly disagree”.</p>
          <p>To avoid bias, the statements were presented in random order. The evaluated dimensions and their
associated statements are summarized in Table 1.</p>
          <p>We included two items for both transparency and efectiveness , as these dimensions are conceptually
broader. T1 and T2 capture local (why this item) and global (how the system works) understanding,
respectively. E1 assesses whether the explanation supports informed decision-making, while E2 targets
intuitive alignment with user preferences. The other dimensions—persuasion, trust, and satisfaction—are
each measured by a single, focused item.</p>
          <p>Impact of explanations on recommendation evaluation A key question in explainable
recommendation is whether explanations can enhance users’ appreciation of the recommendations themselves.
This goes beyond transparency or trust to ask whether users actually like the recommendations more
when they are explained.</p>
          <p>This hypothesis is rooted in classical psychology: the because efect [31] shows that people are more
likely to comply with a request when given a reason—even if it is trivial. However, the notion of
“acceptance” can be ambiguous: does an explanation make users more likely to engage with the content,
or does it bias their evaluation of the item?</p>
          <p>Our design allows us to investigate this question via question U2, which is asked in all groups,
including the one with no explanation. Comparing the U2 scores between Group 0 (no explanation)
and Groups 1–3 (with explanation) provides an estimate of whether receiving an explanation afects
users’ appreciation of the recommendation.</p>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Experimental setup</title>
        <p>We base our study on a BSSMF model trained on a subset of the MovieLens 32M dataset. The interaction
matrix includes  = 135,616 users, which comprises both MovieLens users and the 440 real users who
participated in our evaluation. A total of 100 items and 3,898,839 user-item interactions are retained
after filtering.</p>
        <p>To evaluate generalization and guide the choice of latent dimensionality , we split the dataset into a
training and a test set. We randomly select 5 interactions per user for the test set, restricted to users
with more than 10 ratings. The remaining interactions are used for training.</p>
        <p>To guide the choice of latent dimensionality , we trained models with  = 3, 5, and 10. As expected,
increasing  improved training accuracy (RMSE = 0.72, 0.68, and 0.61 respectively), but test RMSE did
not follow the same trend: 0.83 for  = 3, 0.85 for  = 5, and 0.89 for  = 10. We selected  = 5 as a
compromise, balancing generalization and interpretability, with a manageable number of distinct and
meaningful user types.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <sec id="sec-5-1">
        <title>5.1. Discovered user types</title>
        <p>The method described in paragraph 3.2.1 allows us to uncover the user types that characterize our
system. In our setup, we identify the following user types:</p>
        <p>User Type 1: The Epic Enthusiast
This user craves high-octane, emotionally charged experiences. They favor
intense, dramatic films like "Saving Private Ryan" and "Braveheart," which
offer epic storytelling and serious themes. While they enjoy a mix of
genres, including lighter fare like "Zootopia", their true passion lies
in dramatic, action-packed narratives that leave a lasting impact.
User Type 2: The Thought Provoker
With a penchant for dark, psychological themes, this user seeks films
that challenge their perspective. Movies like "Parasite" and "Fight Club"
reveal a love for complex, thought-provoking narratives that delve into
the human condition, offering both intellectual stimulation and emotional</p>
        <p>User Type 3: The Nostalgic Visionary
This user appreciates a blend of nostalgia and visual artistry. Films like
"Aladdin" and "Twelve Monkeys" showcase their enjoyment of both timeless
stories and visually stunning cinema. They value strong narratives and
memorable visuals, savoring a diverse range of genres but always seeking
that special cinematic magic.</p>
        <p>User Type 4: The Thrill Seeker
Encompassing a broad spectrum of genres, this user thrives on high-energy
experiences. From action-packed blockbusters like "Terminator" to
suspenseful dramas like "Gone Girl", they love stories with complex
characters and thrilling plots, always chasing the next adrenaline rush
in their cinematic journey.</p>
        <p>User Type 5: The Dark Explorer
This user is drawn to intense, unconventional narratives. Films like
"Joker" and "Pulp Fiction" highlight their taste for dark, boundary-pushing
themes. They enjoy both action and intellectual engagement, often seeking
stories that explore the deeper, darker aspects of human nature and
society.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Examples of explanations</title>
        <p>To illustrate the nature of the generated explanations, Table 2 presents three representative examples
per group, each reflecting the typical style and content characteristic of its explanation category.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. How are the explanations and recommendations perceived overall?</title>
        <p>We begin by analyzing how users globally perceive the recommendations and explanations they received,
regardless of the specific evaluation dimensions defined earlier. The goal here is to obtain an overall
picture of user responses across the diferent experimental conditions.</p>
        <p>To this end, we compute and compare the mean scores for each evaluation question, across all four
groups defined in our study. This allows us to assess how each type of explanation influences the way
users experience the system.</p>
        <p>Figure 2 shows the mean scores across all evaluation criteria, indicating consistently high satisfaction
levels in all groups. Since the median score is 4 for every group and criterion, we do not report medians
separately. In addition to mean scores, we also analyzed alternative indicators such as the proportion
of low ratings (≤ 2), percentile distributions, and raincloud plots. As these analyses yielded results
consistent with the comparison of means, we opted to present this more compact and synthetic figure.
All complementary analyses and visualizations are available in our GitHub repository.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Statistical analysis</title>
        <p>
          A common approach for comparing multiple groups is the one-way ANOVA test [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], which assumes
normality, homoscedasticity, and independent observations. While independence is guaranteed by our
experimental design (random group assignment), the other assumptions must be tested. We use the
Shapiro–Wilk test [32] for normality and Levene’s test [33] for homoscedasticity.
        </p>
        <p>As expected given the limited range of Likert responses, normality is clearly violated ( = 0.000 in
all cases). Homoscedasticity is only partially satisfied, with Levene’s -values ranging from 0.001 to</p>
        <p>Forrest Gump might interest you as it blends an epic and emotional story, touching on key
historical moments, with deep characters and surprising twists.</p>
        <p>You’ll love “Jurassic Park” for its mix of epic action and deeper reflections on human ambition,
all brought to life with stunning visuals.
“Iron Man” combines high-octane action with thoughtful storytelling, making it an exciting choice
for those who enjoy both thrilling adventures and stories that explore deep themes such as
identity and responsibility.</p>
        <p>You’re going to love “The Martian” because, like “Guardians of the Galaxy” or “Interstellar”, it
perfectly blends space adventure with a moving, hope-filled story. The film captures that balance
between suspense and emotion that resonated with fans of “Good Will Hunting” and intense
dramas.</p>
        <p>You might enjoy “Iron Man” because it blends action and intelligence, with a complex character
and deeper themes, much like “Gone Girl” or “Batman”. It’s also a visually stunning adventure
— perfect for those who liked “Thor” or “Jurassic Park”.
“Gone Girl” might appeal to you because, like “Ocean’s Eleven” and “Ex Machina”, it pulls you
into a twisted and psychological story with gripping twists. Its mix of suspense and reflection on
human relationships might spark your curiosity, just like the intense films you’ve enjoyed.</p>
        <p>You might enjoy this film as, like “Se7en” or “Memento”, it cleverly blends suspense, a complex
plot, and striking revelations with strong psychological tension.</p>
        <p>This movie might interest you because it combines a gripping story with intellectual stakes,
much like “The Bourne Identity”. The emotion and depth of the characters, similar to those in
“Forrest Gump”, will draw you into a historical and suspenseful narrative.</p>
        <p>You’ll love “Django Unchained” for its mix of intense action and deep dramatic storytelling,
much like in “Saving Private Ryan”. This film, with its complex characters and reflection on dark
themes, will also remind you of “The Silence of the Lambs” and “American History X”, with a
unique style that will captivate you.
0.286. Although ANOVA is sometimes used despite such violations, we adopt a more robust approach
and rely on the Kruskal–Wallis test [34, 35]—its non-parametric equivalent—which does not assume
normality nor homoscedasticity. Results are reported in Table 3, and full details of the tests are provided
in our GitHub repository.
5.4.1. Impact of the explanation on recommendation appreciation
As shown in Table 3, we observe no statistically significant diferences between groups for U2. This
suggests that explanations do not bias users’ stated preferences toward the recommended items,
consistent with findings from Lu et al. [ 36], who found only marginal changes in preference ratings before
and after exposure to explanations—except when explanations were provided by peers.</p>
        <p>However, as shown in Figure 2, among the participants who received an explanation (Groups 1–3),
responses to P1 (persuasion) indicate that users are generally more inclined to follow the
recommendation. This suggests that while explanations do not significantly afect the stated preference for the
movie (U2), they may increase users’ willingness to engage with the recommended content.
5.4.2. Perception of the explanations across experimental groups
To identify which specific explanation strategies difer, we perform post hoc pairwise comparisons
using Dunn’s test [37], a non-parametric method suited for rank-based data. The test is applied only to
questions where a significant group efect was detected. Results are summarized in Table 4.</p>
        <p>While Dunn’s test identifies statistically significant diferences, it does not indicate the direction or
magnitude of these diferences. We therefore compute Clif’s delta [ 38, 39] for each significant pairwise
comparison. This non-parametric efect size measures how frequently values from one group exceed
those from another, ranging from − 1 (complete dominance of Group B) to +1 (complete dominance of
Group A), with 0 indicating no diference. The results are visualized in Figure 3.</p>
        <p>Interpretation of results. Efect sizes computed via Clif’s delta (Figure 3) show that all significant
diferences between groups fall within the negligible or small range. This suggests that while certain
explanation strategies are rated diferently, these diferences remain modest in practice—reinforcing
that all approaches are generally well received.</p>
        <p>For the transparency dimension (T1 and T2), Group 2 is perceived as more transparent than Group 1.
This is somewhat counterintuitive, as Group 1 explanations are grounded in the model’s internal logic,
whereas Group 2 explanations rely solely on the user’s viewing history. However, qualitative feedback
suggests that Group 2 explanations may feel more relatable, thanks to their explicit references to familiar
movies—even if they are not faithful to the actual reasoning of the model.</p>
        <p>Regarding efectiveness, Group 3 is rated less favorably than Groups 1 and 2 for E1 (informativeness),
Effect sizes (Cliff's Delta) for significant post hoc differences
and less favorably than Group 2 for E2 (ability to judge whether one would like the movie). Although
Group 3 explanations combine internal reasoning with historical context, this hybrid strategy may
introduce verbosity or less coherent narratives. Some users expressed (see subsection 5.5) confusion
over the links drawn between movies and a desire for clearer, more focused justifications.</p>
        <p>For satisfaction (S1), Group 3 also receives slightly lower ratings compared to Groups 1 and 2.
This suggests that richer explanations do not always translate to greater satisfaction—possibly due to
information overload or a lack of narrative clarity.</p>
        <p>Overall, these findings highlight a trade-of: while combining diferent information sources can
enrich explanations, simpler and more targeted strategies may prove more efective if they better match
user expectations and maintain clarity.</p>
        <p>T1: 1 vs 2
T2: 1 vs 2
s
onE1: 1 vs 3
s
i
r
a
p
mE1: 2 vs 3
o
c
p
ouE2: 2 vs 3
r
G
S1: 1 vs 3
S1: 2 vs 3
Effect size (| | range)</p>
        <p>Negligible (&lt; 0.147)
Small (0.147 0.33)
Medium (0.33 0.474)
Large ( 0.474)</p>
      </sec>
      <sec id="sec-5-5">
        <title>5.5. Qualitative feedback</title>
        <p>We analyzed the free-text comments provided by users in each condition to better understand how
explanations were perceived. As expected, participants in Group 0, who received no explanation,
mostly left short confirmations that the recommendations were appealing, but their comments lacked
further elaboration. In Group 1 (explanations based on model internals), several users appreciated the
presence of an explanation and even stated that it made them more interested in the movie than the
synopsis itself. However, many found the wording overly generic, emotional, or vague, with comments
suggesting that the explanations could apply to any movie or any user. Some participants noted a lack
of concrete elements or specificity, and a few were bothered by stylistic choices (e.g., being addressed
informally). Group 2 (history-based explanations) generated more polarized feedback: participants
liked the references to previously seen movies, but often criticized the lack of depth in the justification,
the repetitive nature of some phrases, or the weak relevance of the mentioned titles. The frequent reuse
of the same reference movies—caused in our setup by the limited number of previously rated movies (10
per user)—was noted as a limitation. This reinforces the idea that history-based explanations may be
less suitable in cold-start scenarios, where user interaction data is sparse and diversity in justifications
is harder to achieve. In Group 3 (hybrid explanations), several users struggled to understand the
connections made between recommended movies and their viewing history, sometimes pointing out
that the movies seemed unrelated or that the links felt unjustified. Overall, while each explanation
strategy had its strengths, user comments highlighted recurrent concerns about vagueness, genericity,
and the need for more specific and grounded justifications.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Future Work</title>
      <p>This work investigates how interpretability and natural language explanations can be jointly integrated
into recommender systems in a way that is both faithful to the underlying model and understandable
to users. We propose a two-stage framework that starts from a constrained matrix factorization
algorithm—BSSMF—designed to yield human-interpretable latent factors, and then translate these
signals into textual justifications using a large language model. Unlike many existing approaches that
generate explanations post hoc for black-box models, our method grounds the generation process in a
model whose behavior is fully transparent and mathematically interpretable.</p>
      <p>Through a controlled user study, we compared four explanation strategies—including no explanation,
LLM-based explanations faithful to the model’s structure, explanations based on user history, and
a combination of both. Results show that even without explanations, users consistently rated the
recommendations highly, suggesting that interpretable models such as BSSMF are able to provide strong
relevance signals by design. This supports the idea that interpretability and performance can coexist
and that models crafted with transparency in mind can serve as solid foundations for explainable AI.</p>
      <p>However, our study also reveals important nuances. Although strategy 1 (based on the model’s
internal logic) is the most faithful to how recommendations are actually computed, it was not always
rated higher than strategy 2 (based on user history), which some users found more familiar or personally
meaningful. This points to a fundamental challenge in explainable recommendation: the explanations
that are most truthful are not always the most efective in terms of user perception. The design of
future explanations should take into account transparency, efectiveness, persuasion, trust, and overall
user satisfaction, along with other factors that may influence the user experience.</p>
      <p>Furthermore, the movie domain naturally provides a wealth of contextual signals—e.g., visual elements,
familiar titles, and synopses—that may reduce users’ reliance on explanations. The same system deployed
in domains such as job recommendation, online education, or medical decision support could yield
very diferent outcomes. Explanations in those contexts are likely to be more critical for building user
trust, ensuring fairness, and supporting informed decision-making. Generalizing our approach to such
domains will be an important direction for future research.</p>
      <p>Another insight from our qualitative analysis is that users often perceive the explanations as repetitive.
Despite the LLM’s fluency and coherence, repeated exposure to similar explanations reduces their
perceived usefulness. This highlights the importance of introducing diversity into explanation content,
to sustain user engagement over time. In parallel, we observe that users difer in their preferences for
explanation style, structure, and function. This suggests a complementary need for personalization,
where explanation systems adapt to individual user profiles. Future work should therefore explore both
axes: generating a wider variety of faithful explanations, and tailoring them to the preferences and
expectations of each user.</p>
      <p>Beyond individual personalization, future systems could also adapt at the session or task level,
selecting explanations dynamically based on context, cognitive load, or user engagement signals. We
believe that LLMs, when coupled with inherently interpretable models, ofer a unique opportunity to
reach this level of adaptivity: they can generate varied narratives grounded in the same underlying
mathematical factors, potentially addressing both the need for faithfulness and the demand for engaging
content.</p>
      <p>In sum, this work demonstrates that it is possible to build recommender systems that are interpretable
by design and capable of generating user-friendly explanations. While our findings are promising,
they also highlight the complexity of explanation design and the need for more adaptive, user-aware
approaches. Bridging the gap between model transparency and perceived clarity remains a central
challenge—one that future work must continue to address through interdisciplinary methods combining
recommender systems, cognitive psychology, and natural language generation.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>We thank all participants for their time and valuable feedback. According to the policies of our institution,
this type of study did not require formal ethics board approval. Nevertheless, all participants were
properly informed about the purpose of the study, their rights, and how their data would be used, in
accordance with ethical guidelines.</p>
      <p>The present research benefited from computational resources made available on Lucia, the Tier-1
supercomputer of the Walloon Region, infrastructure funded by the Walloon Region under the grant
agreement n°1910247.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used GPT-4o for the following activities: Grammar
and spelling check and Paraphrase and reword. After using this tool, the authors reviewed and edited
the content as needed and take full responsibility for the publication’s content. In addition, large
language models were used for experimental design purposes, specifically to generate explanations
for recommendations, as detailed in the paper. This usage is mentioned here for transparency, and is
distinct from the writing assistance described above.</p>
      <p>arXiv:2406.02377.
[20] M. Bismay, X. Dong, J. Caverlee, Reasoningrec: Bridging personalized recommendations and
human-interpretable explanations through llm reasoning, 2024. arXiv:2410.23180.
[21] Y. Luo, M. Cheng, H. Zhang, J. Lu, E. Chen, Unlocking the potential of large language models
for explainable recommendations, in: Database Systems for Advanced Applications, 2024, pp.
286–303. doi:10.1007/978-981-97-5569-1_18.
[22] H. Zhao, S. Zheng, L. Wu, B. Yu, et al., Lane: Logic alignment of non-tuning large
language models and online recommendation systems for explainable reason generation, 2024.
arXiv:2407.02833.
[23] S. Lubos, Improving recommender systems with large language models, in: Adjunct Proceedings
of the 32nd ACM Conference on User Modeling, Adaptation and Personalization, 2024, p. 40–44.
doi:10.1145/3631700.3664919.
[24] Q. Peng, H. Liu, H. Huang, Q. Yang, et al., A survey on llm-powered agents for recommender
systems, 2025. arXiv:2502.10050.
[25] O. Vu Thanh, N. Gillis, F. Lecron, Bounded simplex-structured matrix factorization, in: ICASSP
2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
2022, pp. 9062–9066. doi:10.1109/ICASSP43922.2022.9747124.
[26] O. Vu Thanh, N. Gillis, F. Lecron, Bounded simplex-structured matrix factorization: Algorithms,
identifiability and applications, IEEE Transactions on Signal Processing 71 (2023) 2434–2447.
doi:10.1109/TSP.2023.3289704.
[27] A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, et al., The llama 3 herd of models, 2024.</p>
      <p>arXiv:2407.21783.
[28] A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, et al., Mixtral of experts, 2024.</p>
      <p>arXiv:2401.04088.
[29] DeepSeek-AI, D. Guo, D. Yang, H. Zhang, et al., Deepseek-r1: Incentivizing reasoning capability in
llms via reinforcement learning, 2025. arXiv:2501.12948.
[30] F. M. Harper, J. A. Konstan, The movielens datasets: History and context, ACM Trans. Interact.</p>
      <p>Intell. Syst. 5 (2015). doi:10.1145/2827872.
[31] E. J. Langer, A. Blank, B. Chanowitz, The mindlessness of ostensibly thoughtful action: The role of"
placebic" information in interpersonal interaction., Journal of personality and social psychology
36 (1978) 635. doi:10.1037/0022-3514.36.6.635.
[32] S. Shaphiro, M. Wilk, An analysis of variance test for normality, Biometrika 52 (1965) 591–611.</p>
      <p>doi:10.1093/biomet/52.3-4.591.
[33] H. Levene, Robust tests for equality of variances, Contributions to probability and statistics (1960)
278–292.
[34] W. H. Kruskal, W. A. Wallis, Use of ranks in one-criterion variance analysis, Journal of the</p>
      <p>American statistical Association 47 (1952) 583–621. doi:10.1080/01621459.1952.10483441.
[35] P. E. McKight, J. Najab, Kruskal-wallis test, The corsini encyclopedia of psychology (2010) 1–1.</p>
      <p>doi:10.1002/9780470479216.corpsy0491.
[36] H. Lu, W. Ma, Y. Wang, M. Zhang, et al., User perception of recommendation explanation: Are
your explanations what users need?, ACM Trans. Inf. Syst. 41 (2023). doi:10.1145/3565480.
[37] O. J. Dunn, Multiple comparisons using rank sums, Technometrics 6 (1964) 241–252. doi:10.</p>
      <p>1080/00401706.1964.10490181.
[38] K. Meissel, E. S. Yao, Using clif’s delta as a non-parametric efect size measure: an accessible web
app and r tutorial, Practical Assessment, Research, and Evaluation 29 (2024). doi:10.7275/pare.
1977.
[39] R. J. Grissom, J. J. Kim, Efect sizes for research: Univariate and multivariate applications, 2012.
doi:10.4324/9780203803233.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Herlocker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Konstan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Riedl</surname>
          </string-name>
          ,
          <article-title>Explaining collaborative filtering recommendations</article-title>
          ,
          <source>in: Proceedings of the 2000 ACM Conference on Computer Supported Cooperative Work</source>
          ,
          <year>2000</year>
          , p.
          <fpage>241</fpage>
          -
          <lpage>250</lpage>
          . doi:
          <volume>10</volume>
          .1145/358916.358995.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>N.</given-names>
            <surname>Tintarev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Masthof</surname>
          </string-name>
          ,
          <article-title>A survey of explanations in recommender systems</article-title>
          ,
          <source>in: 2007 IEEE 23rd International Conference on Data Engineering Workshop</source>
          ,
          <year>2007</year>
          , pp.
          <fpage>801</fpage>
          -
          <lpage>810</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICDEW.
          <year>2007</year>
          .
          <volume>4401070</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Rong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Leemann</surname>
          </string-name>
          , T.-T. Nguyen,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fiedler</surname>
          </string-name>
          , et al.,
          <article-title>Towards human-centered explainable ai: A survey of user studies for model explanations</article-title>
          ,
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>46</volume>
          (
          <year>2024</year>
          )
          <fpage>2104</fpage>
          -
          <lpage>2122</lpage>
          . doi:
          <volume>10</volume>
          .1109/TPAMI.
          <year>2023</year>
          .
          <volume>3331846</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ooge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Verbert</surname>
          </string-name>
          ,
          <article-title>Explaining recommendations in e-learning: Efects on adolescents' trust</article-title>
          ,
          <source>in: Proceedings of the 27th International Conference on Intelligent User Interfaces</source>
          ,
          <year>2022</year>
          , p.
          <fpage>93</fpage>
          -
          <lpage>105</lpage>
          . doi:
          <volume>10</volume>
          .1145/3490099.3511140.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kunkel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Donkers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Michael</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-M. Barbu</surname>
          </string-name>
          , et al.,
          <article-title>Let me explain: Impact of personal and impersonal explanations on trust in recommender systems</article-title>
          ,
          <source>in: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems</source>
          ,
          <year>2019</year>
          , p.
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          . doi:
          <volume>10</volume>
          .1145/3290605. 3300717.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Millecamp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. N.</given-names>
            <surname>Htun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Conati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Verbert</surname>
          </string-name>
          ,
          <article-title>To explain or not to explain: the efects of personal characteristics when explaining music recommendations</article-title>
          ,
          <source>in: Proceedings of the 24th International Conference on Intelligent User Interfaces</source>
          ,
          <year>2019</year>
          , p.
          <fpage>397</fpage>
          -
          <lpage>407</lpage>
          . doi:
          <volume>10</volume>
          .1145/3301275.3302313.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Liao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Sundar</surname>
          </string-name>
          ,
          <article-title>How should ai systems talk to users when collecting their personal information? efects of role framing and self-referencing on human-ai interaction</article-title>
          ,
          <source>in: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems</source>
          ,
          <year>2021</year>
          . doi:
          <volume>10</volume>
          .1145/3411764. 3445415.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Buçinca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Z.</given-names>
            <surname>Gajos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. L.</given-names>
            <surname>Glassman</surname>
          </string-name>
          ,
          <article-title>Proxy tasks and subjective measures can be misleading in evaluating explainable ai systems</article-title>
          ,
          <source>in: Proceedings of the 25th International Conference on Intelligent User Interfaces</source>
          ,
          <year>2020</year>
          , p.
          <fpage>454</fpage>
          -
          <lpage>464</lpage>
          . doi:
          <volume>10</volume>
          .1145/3377325.3377498.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Xia</surname>
          </string-name>
          , et al.,
          <article-title>Llm-powered explanations: Unraveling recommendations through subgraph reasoning</article-title>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2406</volume>
          .
          <fpage>15859</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>The blessing of reasoning: Llm-based contrastive explanations in black-box recommender systems</article-title>
          ,
          <year>2025</year>
          . arXiv:
          <volume>2502</volume>
          .
          <fpage>16759</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Sheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xiong</surname>
          </string-name>
          , et al.,
          <article-title>Chat-rec: Towards interactive and explainable llmsaugmented recommender system</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2303</volume>
          .
          <fpage>14524</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Feuerriegel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. R.</given-names>
            <surname>Shrestha</surname>
          </string-name>
          ,
          <article-title>Contextualizing recommendation explanations with llms: A user study</article-title>
          ,
          <year>2025</year>
          . arXiv:
          <volume>2501</volume>
          .
          <fpage>12152</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>I.</given-names>
            <surname>Silva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Marinho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Said</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. C.</given-names>
            <surname>Willemsen</surname>
          </string-name>
          ,
          <article-title>Leveraging chatgpt for automated human-centered explanations in recommender systems</article-title>
          ,
          <source>in: Proceedings of the 29th International Conference on Intelligent User Interfaces</source>
          ,
          <year>2024</year>
          , p.
          <fpage>597</fpage>
          -
          <lpage>608</lpage>
          . doi:
          <volume>10</volume>
          .1145/3640543.3645171.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          , et al.,
          <article-title>Recommender systems in the era of large language models (llms)</article-title>
          ,
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          <volume>36</volume>
          (
          <year>2024</year>
          )
          <fpage>6889</fpage>
          -
          <lpage>6907</lpage>
          . doi:
          <volume>10</volume>
          .1109/ TKDE.
          <year>2024</year>
          .
          <volume>3392335</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Liu</surname>
          </string-name>
          , et al.,
          <article-title>How can recommender systems benefit from large language models: A survey</article-title>
          ,
          <source>ACM Trans. Inf. Syst</source>
          .
          <volume>43</volume>
          (
          <year>2025</year>
          ). doi:
          <volume>10</volume>
          .1145/3678004.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>A survey on large language models for personalized and explainable recommendations</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2311</volume>
          .
          <fpage>12338</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vats</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Raja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chadha</surname>
          </string-name>
          ,
          <article-title>Exploring the impact of large language models on recommender systems: An extensive review</article-title>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2402</volume>
          .
          <fpage>18590</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          , et al.,
          <article-title>Recexplainer: Aligning large language models for explaining recommendation models</article-title>
          ,
          <source>in: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining</source>
          ,
          <year>2024</year>
          , p.
          <fpage>1530</fpage>
          -
          <lpage>1541</lpage>
          . doi:
          <volume>10</volume>
          .1145/3637528.3671802.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>Xrec: Large language models for explainable recommendation</article-title>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>