<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>J. Liu, C. Liu, R. Lv, K. Zhou, Y. Zhang, Is chatgpt crisis-the technical, legal, and ethical challenges of
a good recommender? a preliminary study, arXiv research into algorithmic agents, Yale JL &amp; Tech.
preprint arXiv:</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>A Preliminary Study of ChatGPT on News Recommendation: Personalization, Provider Fairness, and Fake News</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Xinyi Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yongfeng Zhang</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Edward C. Malthouse</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Northwestern University</institution>
          ,
          <addr-line>Evanston, IL</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Rutgers University</institution>
          ,
          <addr-line>New Brunswick, NJ</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>19</volume>
      <issue>2017</issue>
      <fpage>18</fpage>
      <lpage>22</lpage>
      <abstract>
        <p>Online news platforms commonly employ personalized news recommendation methods to assist users in discovering interesting articles, and many previous works have utilized language model techniques to capture user interests and understand news content. With the emergence of large language models such as the GPT, T5 and LLaMA series, a new recommendation paradigm has emerged, leveraging pre-trained language models for making recommendations. ChatGPT, with its user-friendly interface and growing popularity, has become a prominent choice for text-based tasks. Considering the growing reliance on ChatGPT for language tasks, the importance of news recommendation in addressing social issues, and the trend of using language models in recommendations, this study conducts an initial investigation of ChatGPT's performance in news recommendations, focusing on three perspectives: personalized news recommendation, news provider fairness, and fake news detection. Since the output of ChatGPT is sensitive to the input phrasing, we therefore aim to explore the constraints present in the generated responses of ChatGPT for each perspective. Additionally, we investigate whether specific prompt formats can alleviate these constraints or if these limitations require further attention from researchers in the future. We also surpass fixed evaluations by developing a webpage to monitor ChatGPT's performance on weekly basis on the tasks and prompts we investigated. Our aim is to contribute to and encourage more researchers to engage in the study of enhancing news recommendation performance through the utilization of large language models such as ChatGPT.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;News recommendations</kwd>
        <kwd>Large language models</kwd>
        <kwd>ChatGPT</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>of these techniques [11, 12, 13].</p>
      <p>This study aims to evaluate ChatGPT, a prominent
In today’s information-overloaded society, online plat- language model developed by OpenAI, in the context of
forms like Google News and Microsoft News are attract- news RS tasks. Given the success of ChatGPT in various
ing users to read news online [1]. However, the daily natural language processing (NLP) tasks and the growing
volume of new news articles poses a challenge for users recognition of recommendation as a language-related
to find ones that align with their interests [ 2]. To address task [13], our research focuses on three key perspectives:
this, news recommendation systems (RS) are crucial for personalized news recommendation, news provider
fairassisting users in discovering relevant articles. News ar- ness, and fake news detection. Within each perspective,
ticles contain rich textual information, making language our objective is to identify limitations in ChatGPT’s
remodel techniques like Gated Recurrent Unit (GRU) [3], sponse generation and explore the potential efectiveness
Long-Short Term Memory (LSTM) [4], Convolutional of specific prompt formats or requirements to address
Neural Networks (CNNs) [5], and attention mechanisms these limitations. Additionally, we aim to shed light
[6] popular choices for modeling users’ interests and on areas that might require further attention from
fucomprehending article content [7, 8, 9]. Furthermore, ture researchers, as certain limitations may not be easily
pre-trained language models and prompt learning tech- resolved through prompt design alone. We anticipate
niques have demonstrated their efectiveness in various that ChatGPT will improve and address certain concerns
language tasks [10], leading RS researchers to approach through user feedback. Therefore, we have developed
recommendation as a language task to leverage the power a webpage1 to track its progress on the tasks we have
been exploring, with updates provided on a weekly basis.</p>
      <p>We hope our study would inspire OpenAI researchers
and the wider scientific community to delve deeper into
improving the performance of language models such as
ChatGPT in news RS tasks.
1https://imrecommender.github.io/ChatNews/</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>further attention.</p>
      <sec id="sec-2-1">
        <title>News Recommendation. Existing news RS methods</title>
        <p>utilize NLP techniques like denoising auto-encoders [14], 3. Evaluations of ChatGPT
GRU networks and CNNs [7], and attention mechanisms
[15] to understand news content and model users’ inter- This section evaluates ChatGPT’s performance in news
ests based on their reading behavior [8, 9]. While content recommendations using zero-shot approaches. We
specifunderstanding and personalized recommendations are es- ically focus on three key tasks: personalized
recommensential, it is equally important to address social issues as- dations, fairness of news providers, and trustworthiness
sociated with news RS, including filter bubbles [ 16], echo of the generated responses. Our approach involves first
chambers [17], the spread of fake news [18], popularity identifying any limitations in ChatGPT’s responses using
bias [19], user-side fairness [20, 21], and provider-side simple prompts. We then construct additional prompts to
fairness [22, 23, 24]. In this study, we not only evalu- address these limitations or emphasize the need for
furate ChatGPT’s zero-shot performance in personalized ther attention to these specific issues when utilizing
lanrecommendation task but also examine whether it appro- guage models like ChatGPT for news recommendation.
priately addresses provider bias and fake news concerns. To facilitate reproducibility, we have made the prompts
By investigating these aspects, we aim to shed light on and codes available on a GitHub repository2. For our
analthe broader societal implications of employing ChatGPT ysis, we utilize data samples from the Microsoft News
for news RS. Dataset (MIND) [1].</p>
        <p>Pre-trained Language Models and RS. Pre-trained
language models like BERT [25] and GPT [26], which are 3.1. Personalized Recommendation of
trained on large-scale datasets, have shown adaptability ChatGPT
to various downstream tasks, and prompt learning
techniques [3] have further improved their performance. This This subsection uses a random sample of 30 users from
success has led to a shift in RS, treating recommendation the MIND dataset to detect limitations and gain insights
tasks as language tasks [13, 27]. Researchers have pro- into ChatGPT’s performance when it generates
recomposed various approaches, such as converting item-based mendations for individual users based on a set of unread
recommendation to text-based tasks and utilizing textual articles.
descriptions for user behavior [11], employing person- Based on our investigation of ChatGPT’s response
genalized prompt learning for explainable recommendation eration using the initial prompt provided by Liu et al. [33],
[28], transforming user behavior into text-based inquiries we observe that ChatGPT struggles to efectively
difer[12], and adopting flexible text-to-text approaches for RS entiate between articles previously read by a user and the
[13]. In this work, we investigate ChatGPT’s zero-shot candidate articles. As a result, ChatGPT may generate
performance on news recommendation tasks, leveraging recommendations that include articles already read by
its capabilities as a pre-trained language model. the user. Building upon this observation, we propose the</p>
        <p>ChatGPT. ChatGPT has gained immense popularity hypothesis 1:
within a short period leading to numerous studies that
explore its strengths and limitations. Qin et al. [29] as- Hypothesis 1: Improving the organization of
sess ChatGPT’s performance on various NLP tasks, while prompts by using the JSON format with explicit
Bang et al. [30] provide a comprehensive technical evalu- keys instead of solely relying on textual
descripation of its capabilities in multitasking, multimodal, and tions will better distinguish the articles read by
multilingual applications. Zhou et al. [31] explore ethical a user and candidate articles.
concerns associated with ChatGPT usage. Li et al. [32]
study the fairness of ChatGPT in education, criminol- We evaluate the four diferent prompts (prompt 0 to 3)
ogy, finance and healthcare. Liu et al. [33] construct a shown in Figure 1. We feed each prompt to the model five
benchmark to evaluate ChatGPT’s performance in RS times for each user and count the number of users whose
tasks like rating prediction, sequential recommendation, responses contain articles that the user has previously
direct recommendation, explanation generation and re- read. We conduct an exact binomial test to further
invesview summarization. While ChatGPT is known to have tigate. The results indicate that when utilizing prompt
limitations, including bias and the potential for generat- 3 from Figure 1, the probability of having articles
previing fake information [34], our research aims to explore ously read by the user in the response was found to be
the social issues related to using ChatGPT for news rec- zero. However, we could not reach the same conclusion
ommendation, particularly provider bias and fake news for the other prompts. Based on these findings, we can
detection. We investigate potential prompt formats that
can help mitigate these issues or highlight areas requiring
infer that when dealing with lengthy texts and when 3.2. News Provider Fairness
it is crucial to diferentiate specific information,
utilizing a JSON format with explicit keys proves to be more Most news organizations that create content (i.e.,
efective than relying solely on textual descriptions. providers) depend on advertising for a substantial
frac</p>
        <p>We further assess ChatGPT’s zero-shot personalized tion of their operating revenues, supplementing other
RS capability by comparing it to several baselines, includ- revenue sources such as user-subscriber fees, cable TV
ing LSTUR [7], TANR [35], NRMS [36], and NAML [8] carriage fees, and donations. Digital advertising depends
using metrics top- Hit Ratio (Hit@) and Normalized on attracting users to the news site, and an important
Discounted Cumulative Gain (nDCG@). The results, referring source of visitors is news, social media and
presented in Table 1, indicate that ChatGPT’s zero-shot search platforms, which implement RS. Reduced levels
news RS performance is inferior to existing deep neural- of ad revenue have contributed to news organizations
based models. However, we observe that there is a high closing, which has created vast news deserts in the US,
probability (over 93.3%) that the top-5 recommended ar- where communities no longer have news coverage [37].
ticles by ChatGPT are from the same historical topics as When Facebook changed its RS in 2018, small news
orthe user’s interests, whereas in the ground truth, there is ganizations had decreases in trafic and ad revenue [ 38],
only a 60% chance that the clicked article belongs to the and countries such as Australia are attempting to
regusame categories as the historical articles. This suggests late platforms and have them pay news organizations for
that ChatGPT is capable of understanding the categories their content. Platforms that implement news RS must
of historical articles that users are interested in. However, therefore balance the needs of diferent stakeholders with
user interests are dynamic, and without fine-tuning or multiple objectives, and they may want to guarantee
training on the news dataset, ChatGPT’s RS performance that various providers receive some “fair” proportion of
is inferior compared to existing deep neural-based mod- recommendations. While provider fairness is often
adels. This highlights the need for further research and dressed as a post-processing in news RS [23, 39], our
potential fine-tuning approaches to enhance ChatGPT’s objective is to first identify any biases related to news
recommendation performance in the domain of news. provider fairness using ChatGPT and then explore
potential prompt improvement to alleviate these concerns. We
divide providers into two groups, popular and unpopular,
and we utilize precision@ to assess the proportion of
popular providers among the top- recommendations.</p>
        <p>The first scenario involves not providing candidate
articles to ChatGPT but instead asking it for
recommendations based on the articles that a user has read before.</p>
        <p>In our preliminary experiment using initial prompt 0
from Figure 2, we observe that ChatGPT mistakenly
labels some popular providers as unpopular in its responses.</p>
        <p>This prompts us to further investigate provider fairness
metrics from two perspectives: the user’s perspective
where we adjust the popularity labels based on a
predefined list of 100 popular providers, and ChatGPT’s
perspective where we evaluate its performance using the
popularity labels assigned by ChatGPT in its responses.</p>
        <p>Additionally, in the initial experiment, we notice that
ChatGPT tends to recommend articles from providers
labeled as popular by ChatGPT. This finding prompt us
to propose the following hypothesis:</p>
      </sec>
      <sec id="sec-2-2">
        <title>Hypothesis 2: Explicitly specifying the num</title>
        <p>ber of articles from both popular and unpopular
providers will mitigate the issue of provider bias
based on a user’s tolerance.</p>
        <p>To evaluate hypothesis 2, six prompts (prompt 0 to
prompt 5 in Figure 2) are applied. The results shown in
Figure 3 support hypothesis 2: ChatGPT demonstrates
eficient controllability, which is a significant advantage
compared to existing models that aim to address the
news provider bias issue. It indicates that ChatGPT can
be guided to consider and provide equal opportunities candidate articles are provided using the initial prompt 0
to both popular and unpopular providers based on users’ in Figure 4. This bias may be influenced by the presence
tolerance by explicitly stating the number of popular and of provider bias in the user’s history, where the user
unpopular providers. Furthermore, the figure highlights shows a preference for articles from popular providers,
that ChatGPT perceives a lower precision@ compared and we propose hypothesis 3:
to the user’s perspective. This suggests that ChatGPT
may believe it is addressing the provider bias based on Hypothesis 3: Explicitly indicating the priority
the users’ tolerance. of less popular providers mitigates ChatGPT’s</p>
        <p>Besides detecting provider bias when no candidate provider bias when candidate articles are
proarticles are provided, we also observe this issue when vided.</p>
        <p>Prompt 3 in Figure 4 incorporates the term ‘provider tion, but also brings an ethical concern - the
generafairness’, which aligns with the definition of our study. tion of deceptive information, particularly in the form
However, the results presented in Figure 5 demon- of fake news [40, 41, 42]. As the popularity of ChatGPT
strate that explicitly stating the priority of less popular increases, so does the potential risk of disseminating
providers can efectively mitigate the provider bias is- false or misleading information, leading to distorted
persue in ChatGPT’s recommendations. This reduction in ceptions of events and fostering incorrect beliefs and
bias is statistically significant (  &lt; 0.05), as indicated by decisions among the public. To address these concerns,
the precision@5 metric. The diference in precision@10, this subsection investigates the trustworthiness of
Chathowever, is not statistically significant (  &gt; 0.1). This GPT in providing news recommendations, employing the
could be attributed to the composition of the provided same 30 users and conducting 5 independent trials for
candidates, where a majority of them are from popular each prompt under examination.
providers. In our investigation using the initial prompt 0, where</p>
        <p>Another notable finding is the disparity between the no candidate articles are provided, and ChatGPT is asked
precision of ChatGPT’s and the user’s perspectives. Com- to recommend one article based on the user’s reading
inparing the disparity between prompt 2 and prompt 4, as terests (as depicted in Figure 6), we observe the existence
well as prompt 3 and prompt 5 in Figure 3, it becomes of fake news generation. However, the performance of
evident that reintroducing the list of popular and unpop- generating fake news (i.e., news with titles that cannot be
ular providers in the prompts decreases disparity. This verifiably found on Google search) is inconsistent,
fluctuifnding underscores the need for additional research on ating among approximately half and one-third of users.
ChatGPT’s ability to memorize information. Building on this finding, we formulate hypothesis 4 to
explore whether presenting candidate articles in designed
3.3. Trustfulness of ChatGPT prompts for ChatGPT to make recommendations could
efectively reduce the issue of fake news generation.</p>
        <p>The use of ChatGPT has opened up possibilities for
human-computer interaction and information
genera</p>
      </sec>
      <sec id="sec-2-3">
        <title>Hypothesis 4: Providing candidate articles</title>
        <p>based solely on title information would
significantly decrease the likelihood of generating fake
news during ChatGPT’s recommendations.
ChatGPT remains a crucial area of concern.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Conclusion</title>
      <p>This study evaluates ChatGPT’s performance in news
rec</p>
      <p>To test this hypothesis, we further evaluate three dif- ommendations, with a focus on personalization, provider
ferent prompts (prompt 1 to 3) with provided candidates fairness, and fake news. Our findings indicate that using
in diferent forms, as shown in Figure 6. Prompt 1 and the JSON format is more efective than textual
represenprompt 2 represent each article using both its ID and title, tation for distinguishing diferent groups of information,
while prompt 3 represents each article using only its title. particularly when dealing with lengthy prompts. We
ob</p>
      <p>Our empirical findings indicate that when utilizing serve that ChatGPT exhibits an inherent provider bias,
prompt 1 and prompt 2, approximately 1 out of 10 users but it can be controlled and adjusted based on users’
tolreceive recommended responses containing fake IDs on erances by explicitly specifying the number of accepted
average. The presence of fake IDs in prompt 1 and prompt popular and unpopular providers or prioritizing the
un2 (as shown in Figure 6) can be attributed to ChatGPT’s popular ones. Despite providing explicit candidate
artidificulty in handling numerical values and the lack of cles, the issue of generating fake news cannot be
comconcrete meaningful words found in its training data pletely resolved; however, the probability of generating
for the short strings in prompt 2. However, this repre- fake news during recommendations is significantly lower
sents a substantial decrease in the generation of fake compared to making recommendations directly without
news with statistical significance (  &lt; 0.05) compared providing candidate options. To address the challenge
to the performance observed with prompt 0. The pro- of fake news, enhancing the trustworthiness and
reliavision of candidate articles for ChatGPT during news bility of language models becomes crucial in the context
recommendations plays a significant role in mitigating of news domain and remains an important area for
furthe generation of fake news compared to scenarios where ther research studies. Additionally, we identify the need
no candidates are provided. for improving ChatGPT’s memorization capability. This</p>
      <p>When using only the title information (prompt 3 in work aims to provide valuable insights and directions for
Figure 6), there is a further reduction in the probability future studies that seek to enhance news
recommendaof generating fake news, reaching 1 out of 150, which tion performance using language models like ChatGPT.
confirms hypothesis 4. However, it is essential to ac- Furthermore, we have created a webpage to encourage
knowledge that the issue of generating fake news is not more researchers to actively participate in this field of
completely eliminated, and addressing the broader so- study.
cial challenges arising from the dissemination of fake A promising and important area for future research is
news articles when utilizing large language models like to investigate ethical issues around the use of ChatGPT
for news recommendation. The task of recommending view learning, arXiv preprint arXiv:1907.05576
news is especially complex because the system goals ex- (2019).
tend far beyond identifying articles of interest to a user [10] W. Jin, Y. Cheng, Y. Shen, W. Chen, X. Ren, A
[43]. News RS should avoid creating experience cocoons, good prompt is worth millions of parameters?
echo chambers and filter bubbles, where users only en- low-resource prompt-based learning for
visioncounter stories that reinforce their existing beliefs, inter- language models, arXiv preprint arXiv:2110.08484
ests, and ideologies [44]. The hazards of manipulation (2021).
are great, e.g., a political party attempting to manipulate [11] Y. Zhang, H. Ding, Z. Shui, Y. Ma, J. Zou, A. Deoras,
the system to show stories on a certain event to inflate H. Wang, Language models as recommender
systheir importance. Further research can investigate how tems: Evaluations and limitations, in: I (Still) Can’t
to formulate prompts to manage exposure diversity and Believe It’s Not Better! NeurIPS 2021 Workshop,
biases, and safeguard against manipulation. 2021.
[12] Z. Cui, J. Ma, C. Zhou, J. Zhou, H. Yang,
M6rec: Generative pretrained language models are
References open-ended recommender systems, arXiv preprint
arXiv:2205.08084 (2022).
[1] F. Wu, Y. Qiao, J.-H. Chen, C. Wu, T. Qi, J. Lian, [13] S. Geng, S. Liu, Z. Fu, Y. Ge, Y. Zhang,
RecomD. Liu, X. Xie, J. Gao, W. Wu, et al., Mind: A mendation as language processing (rlp): A unified
large-scale dataset for news recommendation, in: pretrain, personalized prompt &amp; predict paradigm
Proceedings of the 58th Annual Meeting of the As- (p5), RecSys (2022).
sociation for Computational Linguistics, 2020, pp. [14] S. Okura, Y. Tagami, S. Ono, A. Tajima,
Embedding3597–3606. based news recommendation for millions of users,
[2] J. Lian, F. Zhang, X. Xie, G. Sun, Towards better rep- in: Proceedings of the 23rd ACM SIGKDD
internaresentation learning for personalized news recom- tional conference on knowledge discovery and data
mendation: a multi-channel deep fusion approach., mining, 2017, pp. 1933–1942.</p>
      <p>in: IJCAI, 2018, pp. 3805–3811. [15] C. Wu, F. Wu, M. An, J. Huang, Y. Huang, X. Xie,
[3] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bah- Neural news recommendation with attentive
multidanau, F. Bougares, H. Schwenk, Y. Bengio, Learn- view learning, arXiv preprint arXiv:1907.05576
ing phrase representations using rnn encoder- (2019).
decoder for statistical machine translation, arXiv [16] T. T. Nguyen, P.-M. Hui, F. M. Harper, L. Terveen,
preprint arXiv:1406.1078 (2014). J. A. Konstan, Exploring the filter bubble: the efect
[4] R. C. Staudemeyer, E. R. Morris, Understand- of using recommender systems on content diversity,
ing lstm–a tutorial into long short-term mem- in: Proceedings of the 23rd international conference
ory recurrent neural networks, arXiv preprint on World wide web, 2014, pp. 677–686.
arXiv:1909.09586 (2019). [17] M. Cinelli, G. De Francisci Morales, A. Galeazzi,
[5] Y. Chen, Convolutional neural network for sentence W. Quattrociocchi, M. Starnini, The echo chamber
classification, Master’s thesis, University of Water- efect on social media, Proceedings of the National
loo, 2015. Academy of Sciences 118 (2021) e2023301118.
[6] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, [18] S. Vosoughi, D. Roy, S. Aral, The spread of true and
L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At- false news online, science 359 (2018) 1146–1151.
tention is all you need, Advances in neural infor- [19] H. Abdollahpouri, M. Mansoury, R. Burke,
mation processing systems 30 (2017). B. Mobasher, E. Malthouse, User-centered
evalua[7] M. An, F. Wu, C. Wu, K. Zhang, Z. Liu, X. Xie, tion of popularity bias in recommender systems,
Neural news recommendation with long-and short- in: Proceedings of the 29th ACM Conference on
term user representations, in: Proceedings of the User Modeling, Adaptation and Personalization,
57th Annual Meeting of the Association for Com- 2021, pp. 119–129.</p>
      <p>putational Linguistics, 2019, pp. 336–345. [20] Y. Li, H. Chen, S. Xu, Y. Ge, Y. Zhang, Towards
[8] C. Wu, F. Wu, T. Qi, C. Li, Y. Huang, Is news rec- personalized fairness based on causal notion, in:
ommendation a sequential recommendation task?, Proceedings of the 44th International ACM SIGIR
in: Proceedings of the 45th International ACM SI- Conference on Research and Development in
InforGIR Conference on Research and Development in mation Retrieval, 2021, pp. 1054–1063.</p>
      <p>Information Retrieval, 2022, pp. 2382–2386. [21] C. Wu, F. Wu, X. Wang, Y. Huang, X. Xie,
Fairness[9] C. Wu, F. Wu, M. An, J. Huang, Y. Huang, X. Xie, aware news recommendation with decomposed
adNeural news recommendation with attentive multi- versarial learning, in: Proceedings of the AAAI
Conference on Artificial Intelligence, volume 35,</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>