1. Introduction

J. Liu, C. Liu, R. Lv, K. Zhou, Y. Zhang, Is chatgpt crisis-the technical, legal, and ethical challenges of a good recommender? a preliminary study, arXiv research into algorithmic agents, Yale JL & Tech. preprint arXiv:

A Preliminary Study of ChatGPT on News Recommendation: Personalization, Provider Fairness, and Fake News

Xinyi Li

Yongfeng Zhang

Edward C. Malthouse

0 0 Northwestern University , Evanston, IL , USA 1 Rutgers University , New Brunswick, NJ , USA

2023

19 2017 18 22

Online news platforms commonly employ personalized news recommendation methods to assist users in discovering interesting articles, and many previous works have utilized language model techniques to capture user interests and understand news content. With the emergence of large language models such as the GPT, T5 and LLaMA series, a new recommendation paradigm has emerged, leveraging pre-trained language models for making recommendations. ChatGPT, with its user-friendly interface and growing popularity, has become a prominent choice for text-based tasks. Considering the growing reliance on ChatGPT for language tasks, the importance of news recommendation in addressing social issues, and the trend of using language models in recommendations, this study conducts an initial investigation of ChatGPT's performance in news recommendations, focusing on three perspectives: personalized news recommendation, news provider fairness, and fake news detection. Since the output of ChatGPT is sensitive to the input phrasing, we therefore aim to explore the constraints present in the generated responses of ChatGPT for each perspective. Additionally, we investigate whether specific prompt formats can alleviate these constraints or if these limitations require further attention from researchers in the future. We also surpass fixed evaluations by developing a webpage to monitor ChatGPT's performance on weekly basis on the tasks and prompts we investigated. Our aim is to contribute to and encourage more researchers to engage in the study of enhancing news recommendation performance through the utilization of large language models such as ChatGPT.

eol>News recommendations Large language models ChatGPT

1. Introduction

of these techniques [11, 12, 13].

This study aims to evaluate ChatGPT, a prominent In today’s information-overloaded society, online plat- language model developed by OpenAI, in the context of forms like Google News and Microsoft News are attract- news RS tasks. Given the success of ChatGPT in various ing users to read news online [1]. However, the daily natural language processing (NLP) tasks and the growing volume of new news articles poses a challenge for users recognition of recommendation as a language-related to find ones that align with their interests [ 2]. To address task [13], our research focuses on three key perspectives: this, news recommendation systems (RS) are crucial for personalized news recommendation, news provider fairassisting users in discovering relevant articles. News ar- ness, and fake news detection. Within each perspective, ticles contain rich textual information, making language our objective is to identify limitations in ChatGPT’s remodel techniques like Gated Recurrent Unit (GRU) [3], sponse generation and explore the potential efectiveness Long-Short Term Memory (LSTM) [4], Convolutional of specific prompt formats or requirements to address Neural Networks (CNNs) [5], and attention mechanisms these limitations. Additionally, we aim to shed light [6] popular choices for modeling users’ interests and on areas that might require further attention from fucomprehending article content [7, 8, 9]. Furthermore, ture researchers, as certain limitations may not be easily pre-trained language models and prompt learning tech- resolved through prompt design alone. We anticipate niques have demonstrated their efectiveness in various that ChatGPT will improve and address certain concerns language tasks [10], leading RS researchers to approach through user feedback. Therefore, we have developed recommendation as a language task to leverage the power a webpage1 to track its progress on the tasks we have been exploring, with updates provided on a weekly basis.

We hope our study would inspire OpenAI researchers and the wider scientific community to delve deeper into improving the performance of language models such as ChatGPT in news RS tasks. 1https://imrecommender.github.io/ChatNews/

2. Related Work

further attention.

News Recommendation. Existing news RS methods

utilize NLP techniques like denoising auto-encoders [14], 3. Evaluations of ChatGPT GRU networks and CNNs [7], and attention mechanisms [15] to understand news content and model users’ inter- This section evaluates ChatGPT’s performance in news ests based on their reading behavior [8, 9]. While content recommendations using zero-shot approaches. We specifunderstanding and personalized recommendations are es- ically focus on three key tasks: personalized recommensential, it is equally important to address social issues as- dations, fairness of news providers, and trustworthiness sociated with news RS, including filter bubbles [ 16], echo of the generated responses. Our approach involves first chambers [17], the spread of fake news [18], popularity identifying any limitations in ChatGPT’s responses using bias [19], user-side fairness [20, 21], and provider-side simple prompts. We then construct additional prompts to fairness [22, 23, 24]. In this study, we not only evalu- address these limitations or emphasize the need for furate ChatGPT’s zero-shot performance in personalized ther attention to these specific issues when utilizing lanrecommendation task but also examine whether it appro- guage models like ChatGPT for news recommendation. priately addresses provider bias and fake news concerns. To facilitate reproducibility, we have made the prompts By investigating these aspects, we aim to shed light on and codes available on a GitHub repository2. For our analthe broader societal implications of employing ChatGPT ysis, we utilize data samples from the Microsoft News for news RS. Dataset (MIND) [1].

Pre-trained Language Models and RS. Pre-trained language models like BERT [25] and GPT [26], which are 3.1. Personalized Recommendation of trained on large-scale datasets, have shown adaptability ChatGPT to various downstream tasks, and prompt learning techniques [3] have further improved their performance. This This subsection uses a random sample of 30 users from success has led to a shift in RS, treating recommendation the MIND dataset to detect limitations and gain insights tasks as language tasks [13, 27]. Researchers have pro- into ChatGPT’s performance when it generates recomposed various approaches, such as converting item-based mendations for individual users based on a set of unread recommendation to text-based tasks and utilizing textual articles. descriptions for user behavior [11], employing person- Based on our investigation of ChatGPT’s response genalized prompt learning for explainable recommendation eration using the initial prompt provided by Liu et al. [33], [28], transforming user behavior into text-based inquiries we observe that ChatGPT struggles to efectively difer[12], and adopting flexible text-to-text approaches for RS entiate between articles previously read by a user and the [13]. In this work, we investigate ChatGPT’s zero-shot candidate articles. As a result, ChatGPT may generate performance on news recommendation tasks, leveraging recommendations that include articles already read by its capabilities as a pre-trained language model. the user. Building upon this observation, we propose the

ChatGPT. ChatGPT has gained immense popularity hypothesis 1: within a short period leading to numerous studies that explore its strengths and limitations. Qin et al. [29] as- Hypothesis 1: Improving the organization of sess ChatGPT’s performance on various NLP tasks, while prompts by using the JSON format with explicit Bang et al. [30] provide a comprehensive technical evalu- keys instead of solely relying on textual descripation of its capabilities in multitasking, multimodal, and tions will better distinguish the articles read by multilingual applications. Zhou et al. [31] explore ethical a user and candidate articles. concerns associated with ChatGPT usage. Li et al. [32] study the fairness of ChatGPT in education, criminol- We evaluate the four diferent prompts (prompt 0 to 3) ogy, finance and healthcare. Liu et al. [33] construct a shown in Figure 1. We feed each prompt to the model five benchmark to evaluate ChatGPT’s performance in RS times for each user and count the number of users whose tasks like rating prediction, sequential recommendation, responses contain articles that the user has previously direct recommendation, explanation generation and re- read. We conduct an exact binomial test to further invesview summarization. While ChatGPT is known to have tigate. The results indicate that when utilizing prompt limitations, including bias and the potential for generat- 3 from Figure 1, the probability of having articles previing fake information [34], our research aims to explore ously read by the user in the response was found to be the social issues related to using ChatGPT for news rec- zero. However, we could not reach the same conclusion ommendation, particularly provider bias and fake news for the other prompts. Based on these findings, we can detection. We investigate potential prompt formats that can help mitigate these issues or highlight areas requiring infer that when dealing with lengthy texts and when 3.2. News Provider Fairness it is crucial to diferentiate specific information, utilizing a JSON format with explicit keys proves to be more Most news organizations that create content (i.e., efective than relying solely on textual descriptions. providers) depend on advertising for a substantial frac

We further assess ChatGPT’s zero-shot personalized tion of their operating revenues, supplementing other RS capability by comparing it to several baselines, includ- revenue sources such as user-subscriber fees, cable TV ing LSTUR [7], TANR [35], NRMS [36], and NAML [8] carriage fees, and donations. Digital advertising depends using metrics top- Hit Ratio (Hit@) and Normalized on attracting users to the news site, and an important Discounted Cumulative Gain (nDCG@). The results, referring source of visitors is news, social media and presented in Table 1, indicate that ChatGPT’s zero-shot search platforms, which implement RS. Reduced levels news RS performance is inferior to existing deep neural- of ad revenue have contributed to news organizations based models. However, we observe that there is a high closing, which has created vast news deserts in the US, probability (over 93.3%) that the top-5 recommended ar- where communities no longer have news coverage [37]. ticles by ChatGPT are from the same historical topics as When Facebook changed its RS in 2018, small news orthe user’s interests, whereas in the ground truth, there is ganizations had decreases in trafic and ad revenue [ 38], only a 60% chance that the clicked article belongs to the and countries such as Australia are attempting to regusame categories as the historical articles. This suggests late platforms and have them pay news organizations for that ChatGPT is capable of understanding the categories their content. Platforms that implement news RS must of historical articles that users are interested in. However, therefore balance the needs of diferent stakeholders with user interests are dynamic, and without fine-tuning or multiple objectives, and they may want to guarantee training on the news dataset, ChatGPT’s RS performance that various providers receive some “fair” proportion of is inferior compared to existing deep neural-based mod- recommendations. While provider fairness is often adels. This highlights the need for further research and dressed as a post-processing in news RS [23, 39], our potential fine-tuning approaches to enhance ChatGPT’s objective is to first identify any biases related to news recommendation performance in the domain of news. provider fairness using ChatGPT and then explore potential prompt improvement to alleviate these concerns. We divide providers into two groups, popular and unpopular, and we utilize precision@ to assess the proportion of popular providers among the top- recommendations.

The first scenario involves not providing candidate articles to ChatGPT but instead asking it for recommendations based on the articles that a user has read before.

In our preliminary experiment using initial prompt 0 from Figure 2, we observe that ChatGPT mistakenly labels some popular providers as unpopular in its responses.

This prompts us to further investigate provider fairness metrics from two perspectives: the user’s perspective where we adjust the popularity labels based on a predefined list of 100 popular providers, and ChatGPT’s perspective where we evaluate its performance using the popularity labels assigned by ChatGPT in its responses.

Additionally, in the initial experiment, we notice that ChatGPT tends to recommend articles from providers labeled as popular by ChatGPT. This finding prompt us to propose the following hypothesis:

Hypothesis 2: Explicitly specifying the num

ber of articles from both popular and unpopular providers will mitigate the issue of provider bias based on a user’s tolerance.

To evaluate hypothesis 2, six prompts (prompt 0 to prompt 5 in Figure 2) are applied. The results shown in Figure 3 support hypothesis 2: ChatGPT demonstrates eficient controllability, which is a significant advantage compared to existing models that aim to address the news provider bias issue. It indicates that ChatGPT can be guided to consider and provide equal opportunities candidate articles are provided using the initial prompt 0 to both popular and unpopular providers based on users’ in Figure 4. This bias may be influenced by the presence tolerance by explicitly stating the number of popular and of provider bias in the user’s history, where the user unpopular providers. Furthermore, the figure highlights shows a preference for articles from popular providers, that ChatGPT perceives a lower precision@ compared and we propose hypothesis 3: to the user’s perspective. This suggests that ChatGPT may believe it is addressing the provider bias based on Hypothesis 3: Explicitly indicating the priority the users’ tolerance. of less popular providers mitigates ChatGPT’s

Besides detecting provider bias when no candidate provider bias when candidate articles are proarticles are provided, we also observe this issue when vided.

Prompt 3 in Figure 4 incorporates the term ‘provider tion, but also brings an ethical concern - the generafairness’, which aligns with the definition of our study. tion of deceptive information, particularly in the form However, the results presented in Figure 5 demon- of fake news [40, 41, 42]. As the popularity of ChatGPT strate that explicitly stating the priority of less popular increases, so does the potential risk of disseminating providers can efectively mitigate the provider bias is- false or misleading information, leading to distorted persue in ChatGPT’s recommendations. This reduction in ceptions of events and fostering incorrect beliefs and bias is statistically significant ( < 0.05), as indicated by decisions among the public. To address these concerns, the precision@5 metric. The diference in precision@10, this subsection investigates the trustworthiness of Chathowever, is not statistically significant ( > 0.1). This GPT in providing news recommendations, employing the could be attributed to the composition of the provided same 30 users and conducting 5 independent trials for candidates, where a majority of them are from popular each prompt under examination. providers. In our investigation using the initial prompt 0, where

Another notable finding is the disparity between the no candidate articles are provided, and ChatGPT is asked precision of ChatGPT’s and the user’s perspectives. Com- to recommend one article based on the user’s reading inparing the disparity between prompt 2 and prompt 4, as terests (as depicted in Figure 6), we observe the existence well as prompt 3 and prompt 5 in Figure 3, it becomes of fake news generation. However, the performance of evident that reintroducing the list of popular and unpop- generating fake news (i.e., news with titles that cannot be ular providers in the prompts decreases disparity. This verifiably found on Google search) is inconsistent, fluctuifnding underscores the need for additional research on ating among approximately half and one-third of users. ChatGPT’s ability to memorize information. Building on this finding, we formulate hypothesis 4 to explore whether presenting candidate articles in designed 3.3. Trustfulness of ChatGPT prompts for ChatGPT to make recommendations could efectively reduce the issue of fake news generation.

The use of ChatGPT has opened up possibilities for human-computer interaction and information genera

Hypothesis 4: Providing candidate articles

based solely on title information would significantly decrease the likelihood of generating fake news during ChatGPT’s recommendations. ChatGPT remains a crucial area of concern.

4. Conclusion

This study evaluates ChatGPT’s performance in news rec

To test this hypothesis, we further evaluate three dif- ommendations, with a focus on personalization, provider ferent prompts (prompt 1 to 3) with provided candidates fairness, and fake news. Our findings indicate that using in diferent forms, as shown in Figure 6. Prompt 1 and the JSON format is more efective than textual represenprompt 2 represent each article using both its ID and title, tation for distinguishing diferent groups of information, while prompt 3 represents each article using only its title. particularly when dealing with lengthy prompts. We ob

Our empirical findings indicate that when utilizing serve that ChatGPT exhibits an inherent provider bias, prompt 1 and prompt 2, approximately 1 out of 10 users but it can be controlled and adjusted based on users’ tolreceive recommended responses containing fake IDs on erances by explicitly specifying the number of accepted average. The presence of fake IDs in prompt 1 and prompt popular and unpopular providers or prioritizing the un2 (as shown in Figure 6) can be attributed to ChatGPT’s popular ones. Despite providing explicit candidate artidificulty in handling numerical values and the lack of cles, the issue of generating fake news cannot be comconcrete meaningful words found in its training data pletely resolved; however, the probability of generating for the short strings in prompt 2. However, this repre- fake news during recommendations is significantly lower sents a substantial decrease in the generation of fake compared to making recommendations directly without news with statistical significance ( < 0.05) compared providing candidate options. To address the challenge to the performance observed with prompt 0. The pro- of fake news, enhancing the trustworthiness and reliavision of candidate articles for ChatGPT during news bility of language models becomes crucial in the context recommendations plays a significant role in mitigating of news domain and remains an important area for furthe generation of fake news compared to scenarios where ther research studies. Additionally, we identify the need no candidates are provided. for improving ChatGPT’s memorization capability. This

When using only the title information (prompt 3 in work aims to provide valuable insights and directions for Figure 6), there is a further reduction in the probability future studies that seek to enhance news recommendaof generating fake news, reaching 1 out of 150, which tion performance using language models like ChatGPT. confirms hypothesis 4. However, it is essential to ac- Furthermore, we have created a webpage to encourage knowledge that the issue of generating fake news is not more researchers to actively participate in this field of completely eliminated, and addressing the broader so- study. cial challenges arising from the dissemination of fake A promising and important area for future research is news articles when utilizing large language models like to investigate ethical issues around the use of ChatGPT for news recommendation. The task of recommending view learning, arXiv preprint arXiv:1907.05576 news is especially complex because the system goals ex- (2019). tend far beyond identifying articles of interest to a user [10] W. Jin, Y. Cheng, Y. Shen, W. Chen, X. Ren, A [43]. News RS should avoid creating experience cocoons, good prompt is worth millions of parameters? echo chambers and filter bubbles, where users only en- low-resource prompt-based learning for visioncounter stories that reinforce their existing beliefs, inter- language models, arXiv preprint arXiv:2110.08484 ests, and ideologies [44]. The hazards of manipulation (2021). are great, e.g., a political party attempting to manipulate [11] Y. Zhang, H. Ding, Z. Shui, Y. Ma, J. Zou, A. Deoras, the system to show stories on a certain event to inflate H. Wang, Language models as recommender systheir importance. Further research can investigate how tems: Evaluations and limitations, in: I (Still) Can’t to formulate prompts to manage exposure diversity and Believe It’s Not Better! NeurIPS 2021 Workshop, biases, and safeguard against manipulation. 2021. [12] Z. Cui, J. Ma, C. Zhou, J. Zhou, H. Yang, M6rec: Generative pretrained language models are References open-ended recommender systems, arXiv preprint arXiv:2205.08084 (2022). [1] F. Wu, Y. Qiao, J.-H. Chen, C. Wu, T. Qi, J. Lian, [13] S. Geng, S. Liu, Z. Fu, Y. Ge, Y. Zhang, RecomD. Liu, X. Xie, J. Gao, W. Wu, et al., Mind: A mendation as language processing (rlp): A unified large-scale dataset for news recommendation, in: pretrain, personalized prompt & predict paradigm Proceedings of the 58th Annual Meeting of the As- (p5), RecSys (2022). sociation for Computational Linguistics, 2020, pp. [14] S. Okura, Y. Tagami, S. Ono, A. Tajima, Embedding3597–3606. based news recommendation for millions of users, [2] J. Lian, F. Zhang, X. Xie, G. Sun, Towards better rep- in: Proceedings of the 23rd ACM SIGKDD internaresentation learning for personalized news recom- tional conference on knowledge discovery and data mendation: a multi-channel deep fusion approach., mining, 2017, pp. 1933–1942.

in: IJCAI, 2018, pp. 3805–3811. [15] C. Wu, F. Wu, M. An, J. Huang, Y. Huang, X. Xie, [3] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bah- Neural news recommendation with attentive multidanau, F. Bougares, H. Schwenk, Y. Bengio, Learn- view learning, arXiv preprint arXiv:1907.05576 ing phrase representations using rnn encoder- (2019). decoder for statistical machine translation, arXiv [16] T. T. Nguyen, P.-M. Hui, F. M. Harper, L. Terveen, preprint arXiv:1406.1078 (2014). J. A. Konstan, Exploring the filter bubble: the efect [4] R. C. Staudemeyer, E. R. Morris, Understand- of using recommender systems on content diversity, ing lstm–a tutorial into long short-term mem- in: Proceedings of the 23rd international conference ory recurrent neural networks, arXiv preprint on World wide web, 2014, pp. 677–686. arXiv:1909.09586 (2019). [17] M. Cinelli, G. De Francisci Morales, A. Galeazzi, [5] Y. Chen, Convolutional neural network for sentence W. Quattrociocchi, M. Starnini, The echo chamber classification, Master’s thesis, University of Water- efect on social media, Proceedings of the National loo, 2015. Academy of Sciences 118 (2021) e2023301118. [6] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, [18] S. Vosoughi, D. Roy, S. Aral, The spread of true and L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At- false news online, science 359 (2018) 1146–1151. tention is all you need, Advances in neural infor- [19] H. Abdollahpouri, M. Mansoury, R. Burke, mation processing systems 30 (2017). B. Mobasher, E. Malthouse, User-centered evalua[7] M. An, F. Wu, C. Wu, K. Zhang, Z. Liu, X. Xie, tion of popularity bias in recommender systems, Neural news recommendation with long-and short- in: Proceedings of the 29th ACM Conference on term user representations, in: Proceedings of the User Modeling, Adaptation and Personalization, 57th Annual Meeting of the Association for Com- 2021, pp. 119–129.

putational Linguistics, 2019, pp. 336–345. [20] Y. Li, H. Chen, S. Xu, Y. Ge, Y. Zhang, Towards [8] C. Wu, F. Wu, T. Qi, C. Li, Y. Huang, Is news rec- personalized fairness based on causal notion, in: ommendation a sequential recommendation task?, Proceedings of the 44th International ACM SIGIR in: Proceedings of the 45th International ACM SI- Conference on Research and Development in InforGIR Conference on Research and Development in mation Retrieval, 2021, pp. 1054–1063.

Information Retrieval, 2022, pp. 2382–2386. [21] C. Wu, F. Wu, X. Wang, Y. Huang, X. Xie, Fairness[9] C. Wu, F. Wu, M. An, J. Huang, Y. Huang, X. Xie, aware news recommendation with decomposed adNeural news recommendation with attentive multi- versarial learning, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 35,