<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>VideolandGPT: A User Study on a Conversational Recom mender System</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mateo Gutierrez Granada</string-name>
          <email>Mateo.Gutierrez.Granada@rtl.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dina Zilbershtein</string-name>
          <email>zilbershtein.dina@maastrichtuniversity.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daan Odijk</string-name>
          <email>Daan.Odijk@rtl.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesco Barile</string-name>
          <email>f.barile@maastrichtuniversity.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ChatGPT</institution>
          ,
          <addr-line>Conversational Recommender Systems, Video Recommendations, Fairness</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Maastricht University</institution>
          ,
          <addr-line>Maastricht</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>RTL Nederland B.V.</institution>
          ,
          <addr-line>Hilversum</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper investigates how large language models (LLMs) can enhance recommender systems, with a specific focus on Conversational Recommender Systems that leverage user preferences and personalised candidate selections from existing ranking models. We introduce VideolandGPT, a recommender system for a Video-on-Demand (VOD) platform, Videoland, which uses ChatGPT to select from a predetermined set of contents, considering the additional context indicated by users' interactions with a chat interface. We evaluate ranking metrics, user experience, and fairness of recommendations, comparing a personalised and a non-personalised version of the system, in a between-subject user study. Our results indicate that the personalised version outperforms the non-personalised in terms of accuracy and general user satisfaction, while both versions increase the visibility of items which are not in the top of the recommendation lists. However, both versions present inconsistent behavior in terms of fairness, as the system may generate recommendations which are not available on Videoland.</p>
      </abstract>
      <kwd-group>
        <kwd>from diverse domains</kwd>
        <kwd>including Recommender Systems</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1https://www.videoland.com/</title>
      <sec id="sec-1-1">
        <title>1. Introduction</title>
        <p>Recommender systems have revolutionized various
industries such as e-commerce, media, and online
advertising by providing customized experiences based on users’
profiles and behaviors. Initially, content filtering was
used to match users based on their preferred categories
[1], but the development of collaborative filtering
techniques such as matrix factorization (MF) has enabled
more efective personalization [ 2, 3]. More recently, the
development of attention mechanisms that eficiently
connect encoder and decoder via Transformer blocks [4]
represented a significant advancement in neural
architectures, initially for natural language processing. The
emergence of Large Language Models (LLMs), such as</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>BERT [5], and subsequently GPT-3 and chatGPT [6, 7, 8],</title>
      <p>is a direct result of this breakthrough.</p>
      <p>As the Transformer architecture gained popularity in
other domains, recommender system scholars also saw
potential in the attention mechanism [9, 10], recognizing
the utility of sequential information [11, 12, 13].
Breakthroughs in NLP research continued with the addition
of new LLMs such PaLM [14] and LLaMA [15]. These
advancements have not gone unnoticed by researchers
[19], along with factually accurate but contextually
inconsistent outcomes. Updating the parametric knowledge
base and accommodating input token length are also
significant challenges. Consequently, modern research
often sees LLMs as summarization and reasoning engines
rather than knowledge-based solutions for recommender
systems, despite eforts to merge these approaches [ 20].</p>
    </sec>
    <sec id="sec-3">
      <title>This paper examines the impact of LLMs on a recom</title>
      <p>mender system that can converse and reason within the
users’ context, using their preferences and a set of
personalised candidates. The study involves users from RTL’s
Videoland1, the largest Dutch video-on-demand (VOD)
platform. The aim is to investigate through a user study
the user experience of personalised recommendations
in a conversational context, including situations where
users explicitly state their preferences using natural
language. We examine whether there is a discernible
diference in users’ personalised and non-personalised LLM
recommendations. Furthermore, we aim to determine if
users are exposed to titles beyond the top ranking.</p>
      <p>In addition to its focus on recommendation accuracy
and performance, this study evaluates the safety and
fairness of recommendations generated by our proposed
Conversational Recommender System (CRS). We analyze
if the LLM adheres to fairness definitions proposed by the
research community [21]. Adopting the principle of
fairness as ”no harm”, it becomes evident that recommending
items not accessible on the Videoland platform
undermines the platform’s interests by encouraging people
to find relevant content somewhere else. In this
context, our analysis prioritizes aligning our recommender
system with Videoland’s objectives and avoiding any
adverse impact on the platform’s operations and goals.</p>
      <sec id="sec-3-1">
        <title>2. VideolandGPT</title>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>We evaluate our approach on a prototype conversational recommender system for Videoland, that we detail in this section. We base our prototype on the Ranking Model that we presented in [13]. The architecture used to inte</title>
      <p>grate the Ranking Model with the LLM’s knowledge and
capabilities, is illustrated in Figure 1.</p>
      <p>In this architecture, the Ranking Model is considered
a critical component of the solution and also a modular
building block that can be replaced as needed. In our
case, the model is an ensemble comprising a matrix
factorization component [3] and a neural component [13],
which utilizes the attention mechanism and sequential
information in the Interaction History. Our Ranking Model
retrieves the top 300 titles for each user, reducing the
catalog by approximately 90%. We believe this number
achieves a balance between relevance and discoverability.</p>
      <p>Our Natural Language Prompt is created to give
precise instructions to the LLM Chat model to recommend
titles from Videoland’s candidates. We specify that the
model should retrieve three items and provide
explanations for each recommendation to improve explainability.
An example of the prompt is illustrated in Figure 2, which
takes a candidate list of items sorted based on particular
criteria and the user profile for which the
recommendations are intended. This approach enables flexibility in
accommodating various ranking methods.</p>
      <p>The LLM Chat is designed to suggest a list of items
that best matches the user’s query and the candidate list
of recommendations. As the conversation progresses, the
user can either accept a recommendation or give feedback
to the system to refine their discovery preferences. The
user can also request new titles, ask for explanations for a
particular recommendation, or seek further information
related to it. We are testing our prototype with
gpt-35turbo as the LLM.</p>
      <p>The post-processing step serves two critical functions.
First, it enriches the LLM Chat’s response with relevant
metadata, such as the title’s artwork and a direct link
to stream it on Videoland. This additional information
enhances the user’s experience and makes it easier to
access and enjoy recommended titles. Second, the
postprocessing step acts as a safeguard to remove any
recommended title that is not directly aligned with the
candidates for recommendation. In our experiment, we
intentionally omitted this safeguard to examine the potential
impact on platform fairness of not removing any
recommended title that is ofered by other platforms.</p>
      <sec id="sec-4-1">
        <title>3. User Study</title>
        <p>We conduct a small-scale user study to evaluate the
performance of the recommender system. We compare two
versions of the recommender system: a personalised
version, based on users’ recommendations and a
nonpersonalised one, based on the most popular titles. The
study aims to assess user satisfaction, platform fairness
aspects and to answer the main research questions: How
can LLMs enhance (our) recommender systems? Can
such a system, converse and reason within the user’s
context, using their preferences and a set of personalised
candidates? Is a personalised chat-based recommender
system perceived to be more enjoyable and more relevant
compared to its non-personalised counterpart?</p>
        <p>In a separate study, Radensky et al. [22] examined
the impact of confidence signal patterns on user trust
and reliance in a music CRS. Their research inspired our
evaluation approach, although our study covers broader
aspects beyond confidence signals.</p>
        <p>Participants The assignment of random groups was
done prior to the study. The participants comprised
employees within RTL. In total, 27 out of 42 invited
participants took part in the study, ages ranging from 26
to 48, being 35% of them women. Participation in the
survey was voluntary, and the employees had not
previously interacted with VideolandGPT. The sole
requirement for participation was that the respondents must
have watched at least one title on Videoland within the
last 6 months to have personalised recommendations.</p>
        <p>Experiment Protocol The experiment’s design is
presented in Figure 3. All study participants were explicitly
requested to engage with the system in English
throughout the study. Following this, the respondents were ran- requests. During the conversations, users had the
oppordomly divided into two groups. Participants, unaware of tunity to request the system to refine the
recommendathe version they were using, engaged with either a per- tions twice, resulting in a maximum exposure to 9 items
sonalized or non-personalized VideolandGPT, the latter per task. Each task was completed in separate instances
featuring top popular titles from Videoland’s collection, of the same version of the recommender, ensuring an
ensuring unbiased results. isolated examination.</p>
        <p>Each participant was assigned a set of five tasks with At the end of each task, respondents specified the title
the specific structures provided for each of them to ensure they considered the most relevant recommendation for
a more standardized evaluation process. Descriptions of them or stated that they did not receive a satisfactory
the tasks are provided in Table 1. However, participants recommendation. This feedback was used to understand
were informed they could use their own words during VideolandGPT’s recommendation capabilities, accuracy
interactions with the system, promoting natural conver- and fairness to the platform of the recommender.
sation. The study was conducted online over a designated Furthermore, because the participants were not
exfour-day period, ofering convenience and flexibility to posed to VideolandGPT previously and to ensure the
exparticipants. periment’s integrity, the order of the tasks was changed</p>
        <p>Assessment of the system was based on diverse forms every five collected responses. By varying the task order,
of describing users’ preferences, which included previ- we sought to avoid any systematic influence on
particously loved titles, topics, current or desired emotions, ipants’ responses, ensuring that the respondents’
reacpreferred company for movie-watching, and free-form tions to the tasks remained impartial and unafected by
the sequence in which they were presented.</p>
        <sec id="sec-4-1-1">
          <title>Task</title>
        </sec>
        <sec id="sec-4-1-2">
          <title>Title</title>
        </sec>
        <sec id="sec-4-1-3">
          <title>Topic</title>
        </sec>
        <sec id="sec-4-1-4">
          <title>Emotion</title>
        </sec>
        <sec id="sec-4-1-5">
          <title>Context</title>
        </sec>
        <sec id="sec-4-1-6">
          <title>Free</title>
        </sec>
        <sec id="sec-4-1-7">
          <title>Suggested initial prompt</title>
        </sec>
        <sec id="sec-4-1-8">
          <title>Show me the most relevant titles considering that I like &lt;TITLE&gt;.</title>
        </sec>
        <sec id="sec-4-1-9">
          <title>Show me the most relevant titles based on my passion for &lt;TOPIC&gt;.</title>
        </sec>
        <sec id="sec-4-1-10">
          <title>Show me the most relevant titles that will</title>
          <p>make me feel &lt;EMOTION/DESIRE&gt;.</p>
        </sec>
        <sec id="sec-4-1-11">
          <title>Show me the most relevant titles to watch</title>
          <p>with &lt;GF/BF/SON/FRIEND&gt; on a &lt;DAY OF</p>
        </sec>
        <sec id="sec-4-1-12">
          <title>THE WEEK and/or EVENING/AFTERNOON/</title>
        </sec>
        <sec id="sec-4-1-13">
          <title>MORNING&gt;. &lt;Ask for 3 items to be recommended in any form you would like.&gt;</title>
          <p>After completing the tasks, participants were directed
to fill out the questionnaire. The results to the Likert
questions are presented in Figure 4. Moreover, participants
were asked to rank the tasks based on their satisfaction
from the conversation with the recommender and were
encouraged to provide any additional feedback they had
regarding its use. In addition, participants were asked
about their native language, to explore any potential
correlation between the quality of recommendations and
their language background. This question was
particularly relevant, as Videoland’s collection primarily
consists of contents in Dutch (57% of the titles accounting
for 63% of the total available minutes).</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4. Evaluation</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>We evaluate this study both quantitatively and qualita</title>
      <p>tively by analyzing the data collected from the logs of
the conversation and the received questionnaire answers.</p>
      <p>It is important to note that not all responses from the evance in our research context.
conversations yielded usable data due to various reasons To assess the fairness of the recommender system to
such as incomplete or ambiguous queries. As a result, the platform and its compliance with the rules, we
meawe obtained 50 valid observations for each version of the sured the proportion of recommended and chosen
tirecommender (five per respondent, one for each task). tles that were in the candidate list and the chosen titles
that were not on the candidate list. Moreover, we
conDiference between two versions of the conversa- sider a measure of eficiency the number of unique titles
tional recommender Table 2 presents the metrics recommended per user. While the personalised
recomused to evaluate both versions. We measured accuracy mender outperformed in relevance metrics, our
examiand relevance of recommendations by allowing partici- nation revealed inconsistencies in fairness metrics. For
pants to interact with 3, 6, or 9 recommended titles (with both recommenders, over 22% of tasks had user-selected
8% of sessions interacting with other numbers &lt; 9) in our recommendations that were not available on Videoland,
experiment. We evaluated the recommendations’ perfor- suggesting that the system occasionally generated
recommance using nDCG@9 and HR@9 metrics, considering mendations beyond the platform’s content availability,
all participants regardless of the number of titles they despite our attempts to control it.
interacted with. Finally, the results presented in Figure 5 indicate how</p>
      <p>The personalised framework demonstrated a 10% rel- often users choose titles beyond the top ranking items.
ative improvement over the non-personalised version Our findings demonstrate that having a large pool of
in all tasks, highlighting the efectiveness of chat-based candidates is valuable, as users frequently select titles
recommendations in improving user satisfaction and rel- from across the entire range of recommendations.</p>
      <p>Recommended
in Candidates</p>
      <p>Chosen in
Candidates</p>
      <p>Chosen but not
in Candidates</p>
      <p>Unique Titles
per User
Personalised
Non-personalised
Personalised
Non-personalised
Personalised
Non-personalised
Personalised
Non-personalised
Personalised
Non-personalised
Personalised
Non-personalised
Overall experience of using a conversational rec- cumstances. The respondents also mentioned, that this
ommender In the second phase of our evaluation, we recommender ”could bring added value to the Videoland
analyzed the feedback received from the questionnaire. experience”. A common feedback from participants who
The metrics substantiated the results, revealing a statisti- expressed dissatisfaction with their experience was the
cally significant positive correlation (Pearson coeficient unavailability of relevant titles on the platform.
of 0.26) between quantitative metrics like nDCG@9 and
qualitative metrics like users’ task rankings. For instance,
the Title task was preferred by 30% and 60% of partic- 5. Discussion and Conclusion
ipants for personalised and non-personalised versions, Our study demonstrated that the personalised
recomrespectively, in their rankings, aligning with correspond- mender outperformed the non-personalised version by
ing nDCG scores. These findings endorse our metrics’ ef- delivering more relevant recommendations to users.
fectiveness in capturing user preferences and judgments. However, it’s important to recognise that both versions
However, it is important to note that this diference, while of the recommender still, in some cases, suggested titles
notable, is not statistically significant due to the relatively that were not available on the platform, contrary to our
small sample size of participants. Consequently, provid- initial expectations. This aspect highlights the need for
ing an explanation for why the non-personalised version further improvements and considerations in ensuring
performed better on this task is challenging and requires system consistency. Despite this drawback, the study
further investigation. shed light on the potential of personalised chat-based</p>
      <p>The Likert questions answers indicate that a compara- recommendations to improve user satisfaction and
releble proportion of respondents agreed or strongly agreed vance, ofering valuable insights for future developments
with three or more statements for both versions of the in recommender systems.
recommender system (70% for personalised and 60% for Limitations of the study include a primarily
Dutchnon-personalised). However, there is a notable diference: speaking sample (65% of all of the participants) due to
40% of respondents in the personalised version expressed the platform catering to a Dutch-speaking population,
agreement with all statements, while only 10% did so in limited sample size, and the need to consider privacy and
the non-personalised version. This suggests that the per- user preferences when implementing conversational
recsonalised version garnered a higher percentage of highly ommender systems. Furthermore, if users explicitly share
satisfied users with its recommendations. Additionally, personal details with a conversational recommender
syswe can observe that the non-personalised version elicited tem, it could impact their comfort in utilizing the system.
more neutral responses from the participants. This sug- Safeguards must be in place to ensure safety and prevent
gests a more mixed perception of the non-personalised users from exploiting the system.
version’s recommendations. In conclusion, the study emphasizes the potential of</p>
      <p>The findings from the open-ended questions shed light personalised chat-based recommendations to enhance
on user perceptions of the recommender system’s expe- user experience, but further research is required to
derience. Notably, 80% of users perceived the personalised velop a safer mechanism for LLMs usage, ensuring
adversion as enjoyable, even when their specific requests herence to rules and understanding potential unfair
scewere not entirely met. In contrast, 60% of users found the narios.
non-personalised version enjoyable despite similar
cir[14] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra,</p>
      <p>A. Roberts, P. Barham, H. W. Chung, C. Sutton,
[1] M. Naumov, D. Mudigere, H.-J. M. Shi, J. Huang, N. Sun- S. Gehrmann, et al., Palm: Scaling language modeling
daraman, J. Park, X. Wang, U. Gupta, C.-J. Wu, A. G. with pathways, arXiv preprint arXiv:2204.02311 (2022).
Azzolini, et al., Deep learning recommendation model [15] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A.
for personalization and recommendation systems, arXiv Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro,
preprint arXiv:1906.00091 (2019). F. Azhar, A. Rodriguez, A. Joulin, E. Grave, G. Lample,
[2] Y. Koren, R. Bell, C. Volinsky, Matrix factorization tech- Llama: Open and eficient foundation language models,
niques for recommender systems, Computer 42 (2009) 2023. Cite arxiv:2302.13971.</p>
      <p>30–37. [16] J. Li, W. Zhang, T. Wang, G. Xiong, A. Lu, G. G. Medioni,
[3] Y. Hu, Y. Koren, C. Volinsky, Collaborative filtering Gpt4rec: A generative framework for personalized
recfor implicit feedback datasets, in: 2008 Eighth IEEE ommendation and user interests interpretation, ArXiv
international conference on data mining, Ieee, 2008, pp. abs/2304.03879 (2023).</p>
      <p>263–272. [17] J. Liu, C. Liu, R. Lv, K. Zhou, Y. B. Zhang, Is
chat[4] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, gpt a good recommender? a preliminary study, ArXiv
A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all abs/2304.10149 (2023).
you need, Advances in neural information processing [18] Z. Cui, J. Ma, C. Zhou, J. Zhou, H. Yang, M6-rec:
Gensystems 30 (2017). erative pretrained language models are open-ended
rec[5] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre- ommender systems, ArXiv abs/2205.08084 (2022).
training of deep bidirectional transformers for language [19] P. P. Ray, Chatgpt: A comprehensive review on
backunderstanding, in: Proceedings of the 2019 Conference ground, applications, key challenges, bias, ethics,
limiof the North American Chapter of the Association for tations and future scope, Internet of Things and
CyberComputational Linguistics: Human Language Technolo- Physical Systems 3 (2023) 121–154.
gies, Volume 1 (Long and Short Papers), Association [20] S. Pan, L. Luo, Y. Wang, C. Chen, J. Wang, X. Wu,
Unifor Computational Linguistics, Minneapolis, Minnesota, fying large language models and knowledge graphs: A
2019, pp. 4171–4186. roadmap, ArXiv abs/2306.08302 (2023).
[6] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, [21] J. J. Smith, L. Beattie, H. Cramer, Scoping fairness
objecImproving language understanding by generative pre- tives and identifying fairness metrics for recommender
training, OpenAI Technical Report (2018). systems: The practitioners’ perspective, in: Proceedings
[7] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, of the ACM Web Conference 2023, WWW ’23,
AssociI. Sutskever, Language models are unsupervised multi- ation for Computing Machinery, New York, NY, USA,
task learners, OpenAI Blog 1 (2019) 9. 2023, p. 3648–3659.
[8] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Ka- [22] M. Radensky, J. A. Séguin, J. S. Lim, K. Olson, R. Geiger,
plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, “i think you might like this”: Exploring efects of
confiA. Askell, et al., Language models are few-shot learners, dence signal patterns on trust in and reliance on
converAdvances in neural information processing systems 33 sational recommender systems, in: Proceedings of the
(2020) 1877–1901. 2023 ACM Conference on Fairness, Accountability, and
[9] T. Donkers, B. Loepp, J. Ziegler, Sequential user-based Transparency, 2023, pp. 792–804.
recurrent neural network recommendations, in: Proceed- [23] D. C. Toader, G. D. Boca, R. Toader, M. Macelaru,
ings of the Eleventh ACM Conference on Recommender C. Toader, D. S. Ighian, A. T. G. Rădulescu, The efect of
Systems, 2017, pp. 152–160. social presence and chatbot errors on trust, Sustainability
[10] W.-C. Kang, J. McAuley, Self-attentive sequential recom- (2019).</p>
      <p>mendation, in: 2018 IEEE International Conference on</p>
      <p>Data Mining (ICDM), 2018, pp. 197–206.
[11] F. Sun, J. Liu, J. Wu, C. Pei, X. Lin, W. Ou, P. Jiang,</p>
      <p>Bert4rec: Sequential recommendation with bidirectional
encoder representations from transformer, in:
Proceedings of the 28th ACM international conference
on information and knowledge management, 2019, pp.</p>
      <p>1441–1450.
[12] Q. Chen, H. Zhao, W. Li, P. Huang, W. Ou, Behavior
sequence transformer for e-commerce recommendation
in alibaba, in: Proceedings of the 1st International
Workshop on Deep Learning Practice for High-Dimensional</p>
      <p>Sparse Data, 2019, pp. 1–4.
[13] M. Gutierrez Granada, D. Odijk, Recommendations at
videoland, in: Proceedings of the 15th ACM
Conference on Recommender Systems, RecSys ’21, Association
for Computing Machinery, New York, NY, USA, 2021, p.
580–582.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>