<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>M. Müller);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Metrics: Revisiting the ReDial Dataset for Evaluating Conversational Recom mender Systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michael Müller</string-name>
          <email>michael.m.mueller@uibk.ac.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Amir Reza Mohammadi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andreas Peintner</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Beatriz Barroso Gstrein</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eva Zangerle</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Günther Specht</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Conversational Recommender Systems, User Centric</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Innsbruck</institution>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>Current evaluation of Conversational Recommender Systems (CRS) relies heavily on automatic metrics such as Accuracy, BLEU and ROUGE, often using datasets like ReDial. However, these metrics and benchmarks overlook essential user-centric aspects of conversational quality. In this paper, we take a user-focused perspective by applying Large Language Model (LLM) annotators and the CRS-Que framework to over 10,000 ReDial conversations. Our analysis uncovers significant limitations: ReDial exhibits strong popularity bias, conversations are brief and lack qualities such as adaptability, humanness, and rapport, and traditional metrics fail to capture these dimensions. These results highlight the need for richer, multi-dimensional evaluation protocols and improved datasets that better reflect authentic user experience. We discuss implications for future CRS research and the development of user-centric assessment frameworks.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Conversational recommender systems (CRS) are increasingly central to online platforms, enabling users
to express preferences and receive personalized suggestions through natural, multi-turn dialogue [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ].
Over the past decade, CRS have evolved from rule-based and retrieval models to sophisticated neural
architectures, culminating in the integration of LLMs that ofer unprecedented fluency and contextual
understanding [
        <xref ref-type="bibr" rid="ref3 ref4 ref5">3, 4, 5</xref>
        ]. These advances have enabled CRS to handle complex user queries, adapt to
evolving preferences, and generate more engaging conversational experiences.
      </p>
      <p>
        Despite these technical improvements, evaluating CRS remains a significant challenge. Traditional
evaluation methods—such as BLEU, ROUGE, and ofline accuracy metrics—focus on reference-based
comparisons and often overlook crucial user-centric aspects like adaptability, rapport, and conversational
quality [
        <xref ref-type="bibr" rid="ref3 ref6 ref7">6, 7, 3</xref>
        ]. While these metrics provide reproducible benchmarks, they fail to capture the nuanced
qualities that define efective human-computer interaction, such as empathy, transparency, and trust.
      </p>
      <p>
        Benchmark datasets such as ReDial [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] are widely used for CRS evaluation, but may not fully reflect
the diversity and richness of real-world conversations. Recent frameworks, including CRS-Que [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
and CAFE [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], highlight the need for multi-dimensional, user-focused assessment protocols that go
beyond surface-level metrics. However, conducting large-scale user studies is resource-intensive, and
the reliability of LLMs as scalable annotators or user simulators remains an open question [
        <xref ref-type="bibr" rid="ref10 ref7">7, 10</xref>
        ].
      </p>
      <p>In this paper, we present our first steps towards a more user-centric evaluation of CRS. We critically
examine the ReDial dataset and its evaluation methodology by leveraging LLM annotators and the
CRSQue framework to analyze conversational quality across multiple dimensions. Our exploratory analysis
reveals several limitations in both the dataset and prevailing metrics, motivating the need for richer
evaluation protocols and improved benchmarks. These initial findings aim to inform future research
Germany</p>
      <p>CEUR
Workshop</p>
      <p>ISSN1613-0073
on authentic user-centric assessment and the design of next-generation conversational recommender
systems.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        CRS have attracted growing research interest, with evaluation protocols evolving alongside advances in
system design. Early work focused on system-centric metrics, relying on datasets such as ReDial [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] for
benchmarking recommendation accuracy and conversational fluency. ReDial remains a foundational
resource, enabling large-scale evaluation and supporting both automatic and user-centric analysis.
Extensions like TG-ReDial [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and E-ReDial [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] have introduced topic-guided threads and explanation
annotations, broadening the scope for context-rich and explainable CRS evaluation.
      </p>
      <p>
        Traditional evaluation methods, including BLEU and ROUGE [
        <xref ref-type="bibr" rid="ref1 ref11 ref12 ref8">8, 1, 11, 12</xref>
        ], provide reproducible
benchmarks but often fail to capture subjective qualities critical to user experience. Recent studies
highlight that these metrics do not correlate well with user satisfaction or conversational quality [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
For example, Manzoor et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] demonstrate weak alignment between automated scores and human
ratings, underscoring the limitations of reference-based evaluation.
      </p>
      <p>
        To address these gaps, user-centric frameworks such as ResQue [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and CRS-Que [
        <xref ref-type="bibr" rid="ref14 ref6">6, 14</xref>
        ] have been
developed, introducing multi-dimensional criteria like adaptability, rapport, humanness, and response
quality. These frameworks have been validated in controlled user studies with music and mobile phone
CRS agents [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], showing that conversational dimensions significantly influence satisfaction and trust.
Large-scale studies, including those by Yun et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and Manzoor et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], further demonstrate
the importance of longitudinal and comparative evaluation, revealing that LLM-powered CRS can aid
preference clarification and outperform retrieval models on perceived response quality.
      </p>
      <p>
        Recent papers have leveraged ReDial for both objective and subjective evaluation protocols. For
instance, Zhang et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] use the ReDial dataset to assess recommendation accuracy and the emotional
quality of generated responses in empathetic CRS, employing metrics such as Recall@N and AUC,
as well as human and LLM-based ratings for user satisfaction. Similarly, Wang et al. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] utilize
ReDial as a benchmark for contextual and time-aware CRS, focusing on extracting internal knowledge
from conversation context. Their evaluation combines automatic metrics with user-centric human
assessments, where crowd-workers rate generated responses for fluency, informativeness, and coherence
on a 0–2 scale, with final scores averaged across annotators.
      </p>
      <p>
        Crowdsourced platforms such as CRS Arena [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] ofer scalable alternatives for user studies, though
with reduced experimental control. Meanwhile, the emergence of LLMs has enabled new paradigms for
CRS evaluation. LLMs can serve as annotators and user simulators, automating the assessment of
conversational quality across multiple dimensions. Frameworks like RecUserSim [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], SimpleUserSim [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ],
and CONCEPT [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] leverage agent-based simulation and LLM annotation to benchmark CRS agents at
scale, integrating criteria such as social intelligence and adaptability.
      </p>
      <p>
        Despite these advances, challenges remain. LLM-based simulation can sufer from prompt brittleness,
data leakage, and limited behavioral diversity [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. While some studies report that LLM-generated
annotations closely match human ratings [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], others find that crowdsourced judgments are more
reliable and that reference-based metrics often fail to capture human preferences [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. The reliability
and validity of LLMs as annotators and simulators require further systematic comparison with human
studies [
        <xref ref-type="bibr" rid="ref10 ref18">18, 10</xref>
        ].
      </p>
      <p>
        In summary, CRS evaluation is shifting toward richer, user-centric protocols that combine
largescale automated annotation with targeted user studies. Our work builds on these trends by critically
assessing ReDial and its evaluation methodology, leveraging LLM annotators and CRS-Que to highlight
the limitations of current benchmarks and metrics [
        <xref ref-type="bibr" rid="ref3 ref6 ref7">22, 23, 6, 7, 3</xref>
        ]. This motivates the development of
improved datasets and multi-dimensional evaluation frameworks for authentic user-centric assessment
in CRS research.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Background</title>
      <sec id="sec-3-1">
        <title>3.1. ReDial Dataset</title>
        <p>
          The ReDial dataset [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] was developed to facilitate research on conversational recommender systems at
the intersection of goal-directed and free-form dialogue. It contains 11,348 human-human conversations
about movie recommendations, collected via Amazon Mechanical Turk (AMT) between December 2017
and June 2018.
        </p>
        <p>For data collection, pairs of qualified AMT workers from English-speaking countries were matched
in real time and assigned the roles of seeker (requesting recommendations) and recommender. Workers
were required to have a high approval rate and over 1,000 approved HITs. Before each task, participants
provided informed consent via a dedicated form outlining the study’s purpose and methodology. The
custom interface enforced tagging of movie mentions with the ‘@’ symbol, ofering a searchable list
of movies sourced from DBpedia, with the option to add new titles. Instructions emphasized formal
language, a minimum of ten messages per conversation, and discussion of at least four distinct movies.
Dialogues not meeting these criteria, or containing ofensive or of-topic content, were removed.</p>
        <p>Each conversation is stored as a JSON object in jsonl format, with fields for conversation and
worker IDs, a list of messages, movie mentions, and per-movie labels indicating whether the seeker
or recommender has seen, liked, or suggested each movie. Messages include text, sender ID, and time
ofset. No additional preprocessing was performed on the released data.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. BLEU and ROUGE Scores</title>
        <p>BLEU (BiLingual Evaluation Understudy) and ROUGE (Recall-Oriented Understudy for Gisting
Evaluation) are the most widely adopted automatic metrics for evaluating natural language generation,
including CRS.</p>
        <p>BLEU was originally developed for machine translation [24]. It measures the precision of  -gram
overlaps between a candidate response and one or more reference responses, penalizing overly short
outputs via a brevity penalty. BLEU is computed as:</p>
        <p>BLEU = BP ⋅ exp(∑   log   ),

=1
where   is the precision of  -grams,   are weights (typically uniform), and BP is the brevity penalty.
BLEU can be calculated for diferent  -gram orders (e.g., BLEU-1, BLEU-4), with higher  capturing
more contextual similarity.</p>
        <p>ROUGE was introduced for automatic summarization [25]. Unlike BLEU, ROUGE is recall-oriented,
focusing on how much of the reference content is covered by the candidate. Common variants include
ROUGE-N, which measures recall of matching  -grams; ROUGE-L, which is based on the longest
common subsequence (LCS); and ROUGE-S, which measures skip-bigram overlap. For example,
ROUGEN is computed as:</p>
        <p>ROUGE-N =
∑ngram∈Ref min(Countcand(ngram), Countref(ngram))</p>
        <p>∑ngram∈Ref Countref(ngram)</p>
        <p>
          In CRS research, BLEU and ROUGE can be used to evaluate the quality of generated responses by
comparing them to human-written reference replies in datasets such as ReDial [
          <xref ref-type="bibr" rid="ref7 ref8">8, 7</xref>
          ]. These metrics are
easy to compute and allow for reproducible benchmarking across models and datasets.
        </p>
        <p>Despite their popularity, BLEU and ROUGE have notable limitations in the context of CRS. Both
metrics rely on surface-level  -gram matching, which means they often fail to account for semantic
equivalence, paraphrasing, or contextually appropriate responses. They do not measure conversational
qualities such as coherence, empathy, adaptability, or user satisfaction, which are critical in CRS.</p>
        <p>You are evaluating a conversation between a seeker and a</p>
        <p>recommender system.  
Please rate the performance of the recommender system according to</p>
        <p>the following dimensions.</p>
        <p>For each of the following dimensions, rate from 1 (poor) to 5</p>
        <p>(excellent).  
Please answer in this format:  
dimension_name: [score 1-5]
Seeker: Hi there, how are you? I'm looking for movie recommendations</p>
        <p>Recommender: I am doing okay. What kind of movies do you like?
Seeker: I like animations like The Triplets of Belleville (2003) and Waking Life (2001)</p>
        <p>Seeker: I also enjoy Mary and Max (2009). Anything Artistic
Recommender: You might like The Boss Baby (2017) that was a good movie.</p>
        <p>[...]
accuracy: How well do the recommended items match the seeker's</p>
        <p>interests?
novelty: Are the recommendations new or surprising to the seeker?
[...]
accuracy: 4  
novelty: 3  
interaction_adequacy: 5  
[...]</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Methods</title>
      <p>We began with an exploratory analysis of the ReDial dataset to characterize its conversational structure
and recommendation patterns. This included examining the distribution of mentioned movies,
identifying potential popularity bias, and analyzing message lengths to assess the depth and diversity of
interactions. These insights informed our subsequent annotation and evaluation procedures.</p>
      <p>To assess conversational quality in the ReDial dataset, we developed an automated annotation pipeline
using LLMs. We processed the 10,003 training conversations with the OpenAI GPT-4.1nano model
via the batch API. Each conversation was preprocessed by replacing movie mention tokens with their
corresponding movie titles for readability, and formatted as a transcript with each turn labeled by role
(Seeker or Recommender ).</p>
      <p>
        For each dialogue, we generated a standardized annotation prompt (see Figure 1) instructing the LLM
to rate the recommender’s performance across multiple user-centric dimensions [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. These included
both ResQue and CRS-Que criteria: accuracy, novelty, interaction adequacy, explainability, adaptability,
understanding, response quality, attentiveness, ease of use, usefulness, user control, transparency,
humanness, rapport, trust, satisfaction, and behavioral intention. The LLM assigned a score from 1
(poor) to 5 (excellent) for each dimension, following a specified response format.
      </p>
      <p>Model outputs were parsed to extract scores, and results were validated to ensure all required
dimensions were present and within the valid range. Dialogues with incomplete or malformed ratings
were excluded from analysis. The annotation process with GPT-4.1nano required approximately 18
hours and cost $0.76; using diferent model sizes would proportionally afect cost and throughput.</p>
      <p>We configured GPT-4.1nano with a temperature of 0.0 to ensure deterministic outputs and reduce
annotation variance. The entire annotation process was conducted in a single batch session to maintain
consistency. GPT-4.1nano was selected for its cost-efectiveness while maintaining suficient capabilities
for multi-dimensional assessment. The annotation prompt was embedded as a system message to
establish consistent evaluation criteria across all conversations.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <sec id="sec-5-1">
        <title>5.1. Exploratory Analysis</title>
        <p>We conducted an exploratory analysis of the ReDial dataset to better understand its conversational
structure and the diversity of recommendations it contains. Figure 2 illustrates the distribution of all
mentioned movies, revealing a pronounced long-tail pattern: a small number of movies are referenced
very frequently, while the majority appear only rarely. This concentration suggests a potential popularity
bias, where conversations are dominated by well-known or currently popular titles, such as many
Marvel movies. As a result, the dataset may be less suitable for evaluating CRS models that aim to
recommend diverse or niche items, since most dialogues revolve around a limited set of mainstream
movies.</p>
        <p>0</p>
        <p>400
200
Count</p>
        <p>This popularity bias raises questions about the representativeness of user preferences and the diversity
of recommendations in ReDial. It is possible that recommenders in the dataset tend to suggest movies
they have recently watched or that are trending at the time, rather than exploring a broader range
of options. Figure 3 further highlights this efect by showing the top 10 most frequently mentioned
movies.
e
r
ah 5%
S
2%
0%</p>
        <p>Mean = 6.8</p>
        <p>We also analyzed the length of messages exchanged in the conversations. As shown in Figure 4, the
average message contains only 6.8 words, indicating that interactions are generally brief. Such short
messages may not provide suficient context for efective preference elicitation, potentially limiting the
ability of CRS models to understand user needs and deliver personalized recommendations.</p>
        <p>
          Moreover, message length has implications for the use of automatic evaluation metrics. BLEU, being
precision-oriented, penalizes candidates that are much shorter than the reference due to the brevity
penalty, while ROUGE, which is recall-oriented, may yield lower scores for short candidates that miss
important n-grams from the reference. This relationship could help explain why recent studies, such as
[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], have found that BLEU and ROUGE scores computed on ReDial conversations do not correlate well
with user satisfaction or conversational quality.
        </p>
        <p>Therefore, our exploratory analysis highlights several limitations of the ReDial dataset for user-centric
CRS evaluation, including popularity bias, limited diversity, and the brevity of conversational turns.
These factors should be considered when interpreting results and designing future benchmarks.
0
5</p>
        <p>10 15 20
Message Length (words)
25
30</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Annotation Results</title>
        <p>Our initial findings highlight several promising directions for future research in user-centric evaluation
of CRS. A primary avenue is the systematic comparison of LLM-based annotations with human ratings
through controlled user studies. While LLMs ofer scalability, it remains unclear how well their
judgments reflect genuine user perceptions of conversational quality. Future work should design
experiments that directly compare LLM and human ratings across CRS-Que dimensions, such as
adaptability, rapport, and humanness, to establish the reliability and validity of automated annotation.
This will be essential for developing robust evaluation protocols that can be adopted by the broader
research community.</p>
        <p>Another important challenge is to move beyond traditional reference-based metrics like BLEU and
ROUGE, which often fail to capture the nuanced qualities of efective CRS interactions. One promising
approach is the creation of synthetic datasets composed of high-quality CRS responses, as identified by
LLM annotation. These datasets could serve as new benchmarks for evaluating conversational systems,
allowing researchers to investigate the correlation between automated metrics and user-centric quality.</p>
        <p>The evolution of LLM providers and models also presents an opportunity for longitudinal research.
As LLMs continue to improve, it will be important to monitor whether their assessments of
UserCentric metrics like CRS-Que dimensions converge with or diverge from human judgments over time.
Comparative studies involving multiple LLMs and versions may show the stability and generalizability
of automated evaluation, and can inform best practices for using LLMs to benchmark fast or scalable
CRS models in production environments.</p>
        <p>Preference elicitation remains a critical open problem in CRS research. Our analysis of ReDial
suggests that users are often hesitant to declare their preferences, resulting in sparse or ambiguous
conversational data. Future work should explore conversational strategies and interface designs that
encourage users to share richer and more explicit feedback. This could involve adaptive questioning,
context-aware prompts, or the integration of external knowledge to support preference clarification.
Understanding how to efectively elicit preferences will not only improve recommendation quality but
also enhance the validity of user-centric evaluation.</p>
        <p>Finally, broader methodological questions remain regarding the integration of LLM-based evaluation
into the CRS development lifecycle. Researchers should investigate how automated annotation can
be combined with traditional user studies, crowdsourcing, and simulation to create comprehensive,
multi-layered evaluation frameworks. There is also a need to address potential biases in LLM annotation,
ensure transparency in evaluation processes, and develop guidelines for interpreting automated scores
in the context of real-world user satisfaction and trust.</p>
        <p>By pursuing these directions, future research can advance the development of richer, user-focused
evaluation protocols, improved datasets, and practical guidelines for authentic user-centric assessment
in conversational recommender systems. This will ultimately support the design of CRS that better
meet user needs and expectations in diverse application domains.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>7. Conclusion</title>
      <p>This paper presents our first steps toward a more user-centric evaluation of CRS. Our exploratory analysis
of the ReDial dataset revealed a pronounced popularity bias, limited diversity in recommendations,
and generally brief conversational turns, all of which constrain the dataset’s suitability for evaluating
nuanced CRS behaviors. Through multi-dimensional annotation of ReDial conversations using LLMs
and the CRS-Que framework, we identified significant limitations in both the dataset and prevailing
evaluation metrics. Our analysis shows that human conversations in ReDial often lack key qualities such
as adaptability, humanness, and rapport, while traditional metrics like BLEU and ROUGE fail to capture
these user-centric aspects. LLM-based annotation ofers a scalable approach for multi-dimensional
assessment and could be used to test whether smaller, scalable models for production can achieve good
user-centric scores—assuming LLMs can reliably judge these qualities. However, accuracy cannot be
fully assessed at present due to the lack of ground truth. These findings underscore the need for richer
evaluation protocols and improved datasets that better reflect authentic conversational quality and user
experience. Further validation of LLM-based annotation against human ratings is required. Future
research should focus on developing comprehensive, user-focused evaluation frameworks, exploring
new benchmarks, and advancing methods for preference elicitation and conversational design. By
addressing these challenges, the CRS community can move toward systems that not only provide
accurate recommendations but also foster engaging, adaptive, and satisfying user interactions.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>The author(s) have not employed any Generative AI tools.
[22] D. Jannach, C. Bauer, Escaping the mcnamara fallacy: Towards more impactful recommender
systems research, Ai Magazine 41 (2020) 79–95. doi:10.1609/aimag.v41i4.5312.
[23] D. Jannach, Evaluating conversational recommender systems, Artificial Intelligence Review 56
(2023) 2365–2400. doi:10.1007/s10462-022-10229-x.
[24] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for automatic evaluation of machine
translation, in: Proceedings of the 40th Annual Meeting on Association for Computational
Linguistics, ACL ’02, Association for Computational Linguistics, USA, 2002, p. 311–318. URL:
https://doi.org/10.3115/1073083.1073135. doi:10.3115/1073083.1073135.
[25] C.-Y. Lin, ROUGE: A package for automatic evaluation of summaries, in: Text Summarization
Branches Out, Association for Computational Linguistics, Barcelona, Spain, 2004, pp. 74–81. URL:
https://aclanthology.org/W04-1013/.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Jannach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Manzoor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <source>A Survey on Conversational Recommender Systems, ACM Comput. Surv</source>
          .
          <volume>54</volume>
          (
          <year>2021</year>
          )
          <volume>105</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>105</lpage>
          :
          <fpage>36</fpage>
          . doi:
          <volume>10</volume>
          .1145/3453154.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Lei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. de Rijke</surname>
          </string-name>
          , T.-S. Chua,
          <article-title>Advances and challenges in conversational recommender systems: A survey</article-title>
          ,
          <source>AI</source>
          Open 2
          <article-title>(</article-title>
          <year>2021</year>
          )
          <fpage>100</fpage>
          -
          <lpage>126</lpage>
          . URL: https://www.sciencedirect.com/ science/article/pii/S2666651021000164. doi:
          <volume>10</volume>
          .1016/j.aiopen.
          <year>2021</year>
          .
          <volume>06</volume>
          .002.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lyu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. de Rijke</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Ren</surname>
          </string-name>
          ,
          <source>Towards Empathetic Conversational Recommender Systems, in: Proceedings of the 18th ACM Conference on Recommender Systems</source>
          , RecSys '24,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2024</year>
          , pp.
          <fpage>84</fpage>
          -
          <lpage>93</lpage>
          . doi:
          <volume>10</volume>
          .1145/3640457.3688133.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Joty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.-F. Wong</surname>
          </string-name>
          ,
          <article-title>Improving conversational recommender system via contextual and time-aware modeling with less domain-specific knowledge</article-title>
          ,
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          <volume>36</volume>
          (
          <year>2024</year>
          )
          <fpage>6447</fpage>
          -
          <lpage>6461</lpage>
          . doi:
          <volume>10</volume>
          .1109/TKDE.
          <year>2024</year>
          .
          <volume>3397321</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>Unleashing the retrieval potential of large language models in conversational recommender systems</article-title>
          ,
          <source>in: Proceedings of the 18th ACM Conference on Recommender Systems</source>
          , RecSys '24,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2024</year>
          , p.
          <fpage>43</fpage>
          -
          <lpage>52</lpage>
          . doi:
          <volume>10</volume>
          . 1145/3640457.3688146.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <article-title>Crs-que: A user-centric evaluation framework for conversational recommender systems</article-title>
          ,
          <source>ACM Trans. Recomm. Syst</source>
          .
          <volume>2</volume>
          (
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .1145/3631534.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Manzoor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. C.</given-names>
            <surname>Ziegler</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. M. P. Garcia</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Jannach</surname>
          </string-name>
          ,
          <article-title>Chatgpt as a conversational recommender system: A user-centric analysis</article-title>
          ,
          <source>in: Proceedings of the 32nd ACM Conference on User Modeling, Adaptation and Personalization</source>
          , UMAP '24,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2024</year>
          , p.
          <fpage>267</fpage>
          -
          <lpage>272</lpage>
          . doi:
          <volume>10</volume>
          .1145/3627043.3659574.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schulz</surname>
          </string-name>
          , V. Michalski, L. Charlin,
          <string-name>
            <given-names>C.</given-names>
            <surname>Pal</surname>
          </string-name>
          ,
          <article-title>Towards deep conversational recommendations</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems 31 (NIPS</source>
          <year>2018</year>
          ), NIPS'18, Curran Associates Inc.,
          <string-name>
            <surname>Red</surname>
            <given-names>Hook</given-names>
          </string-name>
          ,
          <string-name>
            <surname>NY</surname>
          </string-name>
          , USA,
          <year>2018</year>
          , p.
          <fpage>9748</fpage>
          -
          <lpage>9758</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>C.</given-names>
            <surname>Bauer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Fuhr</surname>
          </string-name>
          , Conversational Agents:
          <article-title>A Framework for Evaluation (CAFE) (</article-title>
          <source>Dagstuhl Perspectives Workshop</source>
          <volume>24352</volume>
          ),
          <source>Dagstuhl Reports</source>
          <volume>14</volume>
          (
          <year>2025</year>
          )
          <fpage>53</fpage>
          -
          <lpage>58</lpage>
          . doi:
          <volume>10</volume>
          .4230/DagRep. 14.8.53.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>N.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <surname>X.-M. Wu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Dong</surname>
          </string-name>
          ,
          <article-title>Large Language Models as Evaluators for Conversational Recommender Systems: Benchmarking System Performance from a User-Centric Perspective</article-title>
          ,
          <year>2025</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.2501.09493. arXiv:
          <volume>2501</volume>
          .
          <fpage>09493</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-R.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <article-title>Towards topic-guided conversational recommender system</article-title>
          ,
          <year>2020</year>
          . doi:
          <volume>10</volume>
          .48550/arxiv.
          <year>2010</year>
          .
          <volume>04125</volume>
          . arXiv:
          <year>2010</year>
          .04125.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , W. Sun,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <article-title>Towards explainable conversational recommender systems</article-title>
          ,
          <source>in: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , SIGIR '23,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2023</year>
          , p.
          <fpage>2786</fpage>
          -
          <lpage>2795</lpage>
          . doi:
          <volume>10</volume>
          .1145/3539618.3591884.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>P.</given-names>
            <surname>Pu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <article-title>A user-centric evaluation framework for recommender systems</article-title>
          ,
          <source>in: Proceedings of the fith ACM conference on Recommender systems, RecSys '11</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2011</year>
          , pp.
          <fpage>157</fpage>
          -
          <lpage>164</lpage>
          . doi:
          <volume>10</volume>
          .1145/2043932.2043962.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Pu</surname>
          </string-name>
          ,
          <article-title>Key qualities of conversational recommender systems: From users' perspective</article-title>
          , in
          <source>: Proceedings of the 9th International Conference on Human-Agent Interaction, HAI '21</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2021</year>
          , p.
          <fpage>93</fpage>
          -
          <lpage>102</lpage>
          . doi:
          <volume>10</volume>
          .1145/ 3472307.3484164.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S.</given-names>
            <surname>Yun</surname>
          </string-name>
          , Y.
          <article-title>-k. Lim, User Experience with LLM-powered Conversational</article-title>
          <source>Recommendation Systems: A Case of Music Recommendation</source>
          ,
          <year>2025</year>
          . doi:
          <volume>10</volume>
          .1145/3706598.3713347. arXiv:
          <volume>2502</volume>
          .
          <fpage>15229</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Caverlee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. H.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>Large Language Models as Data Augmenters for Cold-Start Item Recommendation</article-title>
          ,
          <source>in: Companion Proceedings of the ACM Web Conference</source>
          <year>2024</year>
          , WWW '24,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2024</year>
          , pp.
          <fpage>726</fpage>
          -
          <lpage>729</lpage>
          . doi:
          <volume>10</volume>
          .1145/3589335.3651532.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>N.</given-names>
            <surname>Bernard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Joko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hasibi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Balog</surname>
          </string-name>
          ,
          <article-title>Crs arena: Crowdsourced benchmarking of conversational recommender systems</article-title>
          ,
          <source>in: Proceedings of the Eighteenth ACM International Conference on Web Search and Data Mining</source>
          , WSDM '25,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2025</year>
          , p.
          <fpage>1028</fpage>
          -
          <lpage>1031</lpage>
          . doi:
          <volume>10</volume>
          .1145/3701551.3704120.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <article-title>Recusersim: A realistic and diverse user simulator for evaluating conversational recommender systems</article-title>
          ,
          <source>in: Companion Proceedings of the ACM Web Conference</source>
          <year>2025</year>
          , WWW '25,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2025</year>
          , p.
          <fpage>133</fpage>
          -
          <lpage>142</lpage>
          . doi:
          <volume>10</volume>
          .1145/3701716.3715258.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sang</surname>
          </string-name>
          ,
          <article-title>How reliable is your simulator? analysis on the limitations of current llm-based user simulators for conversational recommendation</article-title>
          ,
          <source>in: Companion Proceedings of the ACM Web Conference</source>
          <year>2024</year>
          , WWW '24,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2024</year>
          , p.
          <fpage>1726</fpage>
          -
          <lpage>1732</lpage>
          . doi:
          <volume>10</volume>
          .1145/3589335.3651955.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>C.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Lei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lv</surname>
          </string-name>
          , T.-S. Chua,
          <article-title>Concept - an evaluation protocol on conversational recommender systems with system-centric and user-centric factors</article-title>
          ,
          <year>2024</year>
          . doi:
          <volume>10</volume>
          .48550/arxiv.2404.03304. arXiv:
          <volume>2404</volume>
          .
          <fpage>03304</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>L.</given-names>
            <surname>Gienapp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hagen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fröbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hagen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Scells</surname>
          </string-name>
          ,
          <article-title>The viability of crowdsourcing for RAG evaluation (</article-title>
          <year>2025</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2504.15689. arXiv:
          <volume>2504</volume>
          .15689, arXiv preprint.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>