1https://www.videoland.com/

VideolandGPT: A User Study on a Conversational Recom mender System

Mateo Gutierrez Granada

Mateo.Gutierrez.Granada@rtl.nl 0 2

Dina Zilbershtein

zilbershtein.dina@maastrichtuniversity.nl 0 1 2

Daan Odijk

Daan.Odijk@rtl.nl 0 2

Francesco Barile

f.barile@maastrichtuniversity.nl 0 1 0 ChatGPT , Conversational Recommender Systems, Video Recommendations, Fairness 1 Maastricht University , Maastricht , The Netherlands 2 RTL Nederland B.V. , Hilversum , The Netherlands

This paper investigates how large language models (LLMs) can enhance recommender systems, with a specific focus on Conversational Recommender Systems that leverage user preferences and personalised candidate selections from existing ranking models. We introduce VideolandGPT, a recommender system for a Video-on-Demand (VOD) platform, Videoland, which uses ChatGPT to select from a predetermined set of contents, considering the additional context indicated by users' interactions with a chat interface. We evaluate ranking metrics, user experience, and fairness of recommendations, comparing a personalised and a non-personalised version of the system, in a between-subject user study. Our results indicate that the personalised version outperforms the non-personalised in terms of accuracy and general user satisfaction, while both versions increase the visibility of items which are not in the top of the recommendation lists. However, both versions present inconsistent behavior in terms of fairness, as the system may generate recommendations which are not available on Videoland.

from diverse domains including Recommender Systems

1https://www.videoland.com/ 1. Introduction

Recommender systems have revolutionized various industries such as e-commerce, media, and online advertising by providing customized experiences based on users’ profiles and behaviors. Initially, content filtering was used to match users based on their preferred categories [1], but the development of collaborative filtering techniques such as matrix factorization (MF) has enabled more efective personalization [ 2, 3]. More recently, the development of attention mechanisms that eficiently connect encoder and decoder via Transformer blocks [4] represented a significant advancement in neural architectures, initially for natural language processing. The emergence of Large Language Models (LLMs), such as

BERT [5], and subsequently GPT-3 and chatGPT [6, 7, 8],

is a direct result of this breakthrough.

As the Transformer architecture gained popularity in other domains, recommender system scholars also saw potential in the attention mechanism [9, 10], recognizing the utility of sequential information [11, 12, 13]. Breakthroughs in NLP research continued with the addition of new LLMs such PaLM [14] and LLaMA [15]. These advancements have not gone unnoticed by researchers [19], along with factually accurate but contextually inconsistent outcomes. Updating the parametric knowledge base and accommodating input token length are also significant challenges. Consequently, modern research often sees LLMs as summarization and reasoning engines rather than knowledge-based solutions for recommender systems, despite eforts to merge these approaches [ 20].

This paper examines the impact of LLMs on a recom

mender system that can converse and reason within the users’ context, using their preferences and a set of personalised candidates. The study involves users from RTL’s Videoland1, the largest Dutch video-on-demand (VOD) platform. The aim is to investigate through a user study the user experience of personalised recommendations in a conversational context, including situations where users explicitly state their preferences using natural language. We examine whether there is a discernible diference in users’ personalised and non-personalised LLM recommendations. Furthermore, we aim to determine if users are exposed to titles beyond the top ranking.

In addition to its focus on recommendation accuracy and performance, this study evaluates the safety and fairness of recommendations generated by our proposed Conversational Recommender System (CRS). We analyze if the LLM adheres to fairness definitions proposed by the research community [21]. Adopting the principle of fairness as ”no harm”, it becomes evident that recommending items not accessible on the Videoland platform undermines the platform’s interests by encouraging people to find relevant content somewhere else. In this context, our analysis prioritizes aligning our recommender system with Videoland’s objectives and avoiding any adverse impact on the platform’s operations and goals.

2. VideolandGPT We evaluate our approach on a prototype conversational recommender system for Videoland, that we detail in this section. We base our prototype on the Ranking Model that we presented in [13]. The architecture used to inte

grate the Ranking Model with the LLM’s knowledge and capabilities, is illustrated in Figure 1.

In this architecture, the Ranking Model is considered a critical component of the solution and also a modular building block that can be replaced as needed. In our case, the model is an ensemble comprising a matrix factorization component [3] and a neural component [13], which utilizes the attention mechanism and sequential information in the Interaction History. Our Ranking Model retrieves the top 300 titles for each user, reducing the catalog by approximately 90%. We believe this number achieves a balance between relevance and discoverability.

Our Natural Language Prompt is created to give precise instructions to the LLM Chat model to recommend titles from Videoland’s candidates. We specify that the model should retrieve three items and provide explanations for each recommendation to improve explainability. An example of the prompt is illustrated in Figure 2, which takes a candidate list of items sorted based on particular criteria and the user profile for which the recommendations are intended. This approach enables flexibility in accommodating various ranking methods.

The LLM Chat is designed to suggest a list of items that best matches the user’s query and the candidate list of recommendations. As the conversation progresses, the user can either accept a recommendation or give feedback to the system to refine their discovery preferences. The user can also request new titles, ask for explanations for a particular recommendation, or seek further information related to it. We are testing our prototype with gpt-35turbo as the LLM.

The post-processing step serves two critical functions. First, it enriches the LLM Chat’s response with relevant metadata, such as the title’s artwork and a direct link to stream it on Videoland. This additional information enhances the user’s experience and makes it easier to access and enjoy recommended titles. Second, the postprocessing step acts as a safeguard to remove any recommended title that is not directly aligned with the candidates for recommendation. In our experiment, we intentionally omitted this safeguard to examine the potential impact on platform fairness of not removing any recommended title that is ofered by other platforms.

3. User Study

We conduct a small-scale user study to evaluate the performance of the recommender system. We compare two versions of the recommender system: a personalised version, based on users’ recommendations and a nonpersonalised one, based on the most popular titles. The study aims to assess user satisfaction, platform fairness aspects and to answer the main research questions: How can LLMs enhance (our) recommender systems? Can such a system, converse and reason within the user’s context, using their preferences and a set of personalised candidates? Is a personalised chat-based recommender system perceived to be more enjoyable and more relevant compared to its non-personalised counterpart?

In a separate study, Radensky et al. [22] examined the impact of confidence signal patterns on user trust and reliance in a music CRS. Their research inspired our evaluation approach, although our study covers broader aspects beyond confidence signals.

Participants The assignment of random groups was done prior to the study. The participants comprised employees within RTL. In total, 27 out of 42 invited participants took part in the study, ages ranging from 26 to 48, being 35% of them women. Participation in the survey was voluntary, and the employees had not previously interacted with VideolandGPT. The sole requirement for participation was that the respondents must have watched at least one title on Videoland within the last 6 months to have personalised recommendations.

Experiment Protocol The experiment’s design is presented in Figure 3. All study participants were explicitly requested to engage with the system in English throughout the study. Following this, the respondents were ran- requests. During the conversations, users had the oppordomly divided into two groups. Participants, unaware of tunity to request the system to refine the recommendathe version they were using, engaged with either a per- tions twice, resulting in a maximum exposure to 9 items sonalized or non-personalized VideolandGPT, the latter per task. Each task was completed in separate instances featuring top popular titles from Videoland’s collection, of the same version of the recommender, ensuring an ensuring unbiased results. isolated examination.

Each participant was assigned a set of five tasks with At the end of each task, respondents specified the title the specific structures provided for each of them to ensure they considered the most relevant recommendation for a more standardized evaluation process. Descriptions of them or stated that they did not receive a satisfactory the tasks are provided in Table 1. However, participants recommendation. This feedback was used to understand were informed they could use their own words during VideolandGPT’s recommendation capabilities, accuracy interactions with the system, promoting natural conver- and fairness to the platform of the recommender. sation. The study was conducted online over a designated Furthermore, because the participants were not exfour-day period, ofering convenience and flexibility to posed to VideolandGPT previously and to ensure the exparticipants. periment’s integrity, the order of the tasks was changed

Assessment of the system was based on diverse forms every five collected responses. By varying the task order, of describing users’ preferences, which included previ- we sought to avoid any systematic influence on particously loved titles, topics, current or desired emotions, ipants’ responses, ensuring that the respondents’ reacpreferred company for movie-watching, and free-form tions to the tasks remained impartial and unafected by the sequence in which they were presented.

Task Title Topic Emotion Context Free Suggested initial prompt Show me the most relevant titles considering that I like <TITLE>. Show me the most relevant titles based on my passion for <TOPIC>. Show me the most relevant titles that will

make me feel <EMOTION/DESIRE>.

Show me the most relevant titles to watch

with <GF/BF/SON/FRIEND> on a <DAY OF

THE WEEK and/or EVENING/AFTERNOON/ MORNING>. <Ask for 3 items to be recommended in any form you would like.>

After completing the tasks, participants were directed to fill out the questionnaire. The results to the Likert questions are presented in Figure 4. Moreover, participants were asked to rank the tasks based on their satisfaction from the conversation with the recommender and were encouraged to provide any additional feedback they had regarding its use. In addition, participants were asked about their native language, to explore any potential correlation between the quality of recommendations and their language background. This question was particularly relevant, as Videoland’s collection primarily consists of contents in Dutch (57% of the titles accounting for 63% of the total available minutes).

4. Evaluation We evaluate this study both quantitatively and qualita

tively by analyzing the data collected from the logs of the conversation and the received questionnaire answers.

It is important to note that not all responses from the evance in our research context. conversations yielded usable data due to various reasons To assess the fairness of the recommender system to such as incomplete or ambiguous queries. As a result, the platform and its compliance with the rules, we meawe obtained 50 valid observations for each version of the sured the proportion of recommended and chosen tirecommender (five per respondent, one for each task). tles that were in the candidate list and the chosen titles that were not on the candidate list. Moreover, we conDiference between two versions of the conversa- sider a measure of eficiency the number of unique titles tional recommender Table 2 presents the metrics recommended per user. While the personalised recomused to evaluate both versions. We measured accuracy mender outperformed in relevance metrics, our examiand relevance of recommendations by allowing partici- nation revealed inconsistencies in fairness metrics. For pants to interact with 3, 6, or 9 recommended titles (with both recommenders, over 22% of tasks had user-selected 8% of sessions interacting with other numbers < 9) in our recommendations that were not available on Videoland, experiment. We evaluated the recommendations’ perfor- suggesting that the system occasionally generated recommance using nDCG@9 and HR@9 metrics, considering mendations beyond the platform’s content availability, all participants regardless of the number of titles they despite our attempts to control it. interacted with. Finally, the results presented in Figure 5 indicate how

The personalised framework demonstrated a 10% rel- often users choose titles beyond the top ranking items. ative improvement over the non-personalised version Our findings demonstrate that having a large pool of in all tasks, highlighting the efectiveness of chat-based candidates is valuable, as users frequently select titles recommendations in improving user satisfaction and rel- from across the entire range of recommendations.

Recommended in Candidates

Chosen in Candidates

Chosen but not in Candidates

Unique Titles per User Personalised Non-personalised Personalised Non-personalised Personalised Non-personalised Personalised Non-personalised Personalised Non-personalised Personalised Non-personalised Overall experience of using a conversational rec- cumstances. The respondents also mentioned, that this ommender In the second phase of our evaluation, we recommender ”could bring added value to the Videoland analyzed the feedback received from the questionnaire. experience”. A common feedback from participants who The metrics substantiated the results, revealing a statisti- expressed dissatisfaction with their experience was the cally significant positive correlation (Pearson coeficient unavailability of relevant titles on the platform. of 0.26) between quantitative metrics like nDCG@9 and qualitative metrics like users’ task rankings. For instance, the Title task was preferred by 30% and 60% of partic- 5. Discussion and Conclusion ipants for personalised and non-personalised versions, Our study demonstrated that the personalised recomrespectively, in their rankings, aligning with correspond- mender outperformed the non-personalised version by ing nDCG scores. These findings endorse our metrics’ ef- delivering more relevant recommendations to users. fectiveness in capturing user preferences and judgments. However, it’s important to recognise that both versions However, it is important to note that this diference, while of the recommender still, in some cases, suggested titles notable, is not statistically significant due to the relatively that were not available on the platform, contrary to our small sample size of participants. Consequently, provid- initial expectations. This aspect highlights the need for ing an explanation for why the non-personalised version further improvements and considerations in ensuring performed better on this task is challenging and requires system consistency. Despite this drawback, the study further investigation. shed light on the potential of personalised chat-based

The Likert questions answers indicate that a compara- recommendations to improve user satisfaction and releble proportion of respondents agreed or strongly agreed vance, ofering valuable insights for future developments with three or more statements for both versions of the in recommender systems. recommender system (70% for personalised and 60% for Limitations of the study include a primarily Dutchnon-personalised). However, there is a notable diference: speaking sample (65% of all of the participants) due to 40% of respondents in the personalised version expressed the platform catering to a Dutch-speaking population, agreement with all statements, while only 10% did so in limited sample size, and the need to consider privacy and the non-personalised version. This suggests that the per- user preferences when implementing conversational recsonalised version garnered a higher percentage of highly ommender systems. Furthermore, if users explicitly share satisfied users with its recommendations. Additionally, personal details with a conversational recommender syswe can observe that the non-personalised version elicited tem, it could impact their comfort in utilizing the system. more neutral responses from the participants. This sug- Safeguards must be in place to ensure safety and prevent gests a more mixed perception of the non-personalised users from exploiting the system. version’s recommendations. In conclusion, the study emphasizes the potential of

The findings from the open-ended questions shed light personalised chat-based recommendations to enhance on user perceptions of the recommender system’s expe- user experience, but further research is required to derience. Notably, 80% of users perceived the personalised velop a safer mechanism for LLMs usage, ensuring adversion as enjoyable, even when their specific requests herence to rules and understanding potential unfair scewere not entirely met. In contrast, 60% of users found the narios. non-personalised version enjoyable despite similar cir[14] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra,

A. Roberts, P. Barham, H. W. Chung, C. Sutton, [1] M. Naumov, D. Mudigere, H.-J. M. Shi, J. Huang, N. Sun- S. Gehrmann, et al., Palm: Scaling language modeling daraman, J. Park, X. Wang, U. Gupta, C.-J. Wu, A. G. with pathways, arXiv preprint arXiv:2204.02311 (2022). Azzolini, et al., Deep learning recommendation model [15] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. for personalization and recommendation systems, arXiv Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, preprint arXiv:1906.00091 (2019). F. Azhar, A. Rodriguez, A. Joulin, E. Grave, G. Lample, [2] Y. Koren, R. Bell, C. Volinsky, Matrix factorization tech- Llama: Open and eficient foundation language models, niques for recommender systems, Computer 42 (2009) 2023. Cite arxiv:2302.13971.

30–37. [16] J. Li, W. Zhang, T. Wang, G. Xiong, A. Lu, G. G. Medioni, [3] Y. Hu, Y. Koren, C. Volinsky, Collaborative filtering Gpt4rec: A generative framework for personalized recfor implicit feedback datasets, in: 2008 Eighth IEEE ommendation and user interests interpretation, ArXiv international conference on data mining, Ieee, 2008, pp. abs/2304.03879 (2023).

263–272. [17] J. Liu, C. Liu, R. Lv, K. Zhou, Y. B. Zhang, Is chat[4] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, gpt a good recommender? a preliminary study, ArXiv A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all abs/2304.10149 (2023). you need, Advances in neural information processing [18] Z. Cui, J. Ma, C. Zhou, J. Zhou, H. Yang, M6-rec: Gensystems 30 (2017). erative pretrained language models are open-ended rec[5] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre- ommender systems, ArXiv abs/2205.08084 (2022). training of deep bidirectional transformers for language [19] P. P. Ray, Chatgpt: A comprehensive review on backunderstanding, in: Proceedings of the 2019 Conference ground, applications, key challenges, bias, ethics, limiof the North American Chapter of the Association for tations and future scope, Internet of Things and CyberComputational Linguistics: Human Language Technolo- Physical Systems 3 (2023) 121–154. gies, Volume 1 (Long and Short Papers), Association [20] S. Pan, L. Luo, Y. Wang, C. Chen, J. Wang, X. Wu, Unifor Computational Linguistics, Minneapolis, Minnesota, fying large language models and knowledge graphs: A 2019, pp. 4171–4186. roadmap, ArXiv abs/2306.08302 (2023). [6] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, [21] J. J. Smith, L. Beattie, H. Cramer, Scoping fairness objecImproving language understanding by generative pre- tives and identifying fairness metrics for recommender training, OpenAI Technical Report (2018). systems: The practitioners’ perspective, in: Proceedings [7] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, of the ACM Web Conference 2023, WWW ’23, AssociI. Sutskever, Language models are unsupervised multi- ation for Computing Machinery, New York, NY, USA, task learners, OpenAI Blog 1 (2019) 9. 2023, p. 3648–3659. [8] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Ka- [22] M. Radensky, J. A. Séguin, J. S. Lim, K. Olson, R. Geiger, plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, “i think you might like this”: Exploring efects of confiA. Askell, et al., Language models are few-shot learners, dence signal patterns on trust in and reliance on converAdvances in neural information processing systems 33 sational recommender systems, in: Proceedings of the (2020) 1877–1901. 2023 ACM Conference on Fairness, Accountability, and [9] T. Donkers, B. Loepp, J. Ziegler, Sequential user-based Transparency, 2023, pp. 792–804. recurrent neural network recommendations, in: Proceed- [23] D. C. Toader, G. D. Boca, R. Toader, M. Macelaru, ings of the Eleventh ACM Conference on Recommender C. Toader, D. S. Ighian, A. T. G. Rădulescu, The efect of Systems, 2017, pp. 152–160. social presence and chatbot errors on trust, Sustainability [10] W.-C. Kang, J. McAuley, Self-attentive sequential recom- (2019).

mendation, in: 2018 IEEE International Conference on

Data Mining (ICDM), 2018, pp. 197–206. [11] F. Sun, J. Liu, J. Wu, C. Pei, X. Lin, W. Ou, P. Jiang,

Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer, in: Proceedings of the 28th ACM international conference on information and knowledge management, 2019, pp.

1441–1450. [12] Q. Chen, H. Zhao, W. Li, P. Huang, W. Ou, Behavior sequence transformer for e-commerce recommendation in alibaba, in: Proceedings of the 1st International Workshop on Deep Learning Practice for High-Dimensional

Sparse Data, 2019, pp. 1–4. [13] M. Gutierrez Granada, D. Odijk, Recommendations at videoland, in: Proceedings of the 15th ACM Conference on Recommender Systems, RecSys ’21, Association for Computing Machinery, New York, NY, USA, 2021, p. 580–582.