<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>T. E. Kolb);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>DJ - an Expert Study on News Recom mendations Beyond Accuracy</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Thomas E. Kolb</string-name>
          <email>thomas.kolb@tuwien.ac.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Irina Nalis</string-name>
          <email>irina.nalis-neuner@tuwien.ac.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Julia Neidhardt</string-name>
          <email>julia.neidhardt@tuwien.ac.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Christian Doppler Laboratory for Recommender Systems, TU Wien</institution>
          ,
          <addr-line>Vienna</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>In the past, recommender systems were primarily focused on optimizing accuracy. However, in recent years, there has been an increasing awareness that considerations beyond accuracy are necessary. The definition of what constitutes a good recommendation is a crucial issue. The most precise prediction may not always be the recommendation that satisfies the user best. This study ofers a comprehensive investigation into the present advancements within the realm of beyond-accuracy measurements, especially the metrics diversity, serendipity, and novelty. Collaborative eforts between algorithmic models and domain experts can enrich recommendation quality, particularly in labeling and categorizing content. To address this, we present a study conducted by experts in the news domain. This study provides new insights into the multifaceted nature of this challenge. Employing an interdisciplinary approach, we underscore the significance of constructing a system that revolves around the user. Recent discussions about algorithmic content filtering and its societal implications underscore the importance of maintaining human involvement in the decision-making loop.</p>
      </abstract>
      <kwd-group>
        <kwd>Accuracy</kwd>
        <kwd>recommender systems</kwd>
        <kwd>beyond-accuracy measures</kwd>
        <kwd>domain-expert study</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Recommender systems have traditionally focused on accuracy, aiming to predict how users
rate items based on their past preferences. However, in recent years, there has been a
growing recognition that assessing recommender systems based solely on accuracy is insuficient.
Beyond-accuracy measures, such as novelty, diversity, serendipity, and coverage, have emerged
as crucial dimensions for evaluating recommender systems and attract an increasing number of
studies [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. News recommender systems face several challenges and potential problems that
impact their ability to promote diversity, novelty, and serendipity. To avoid the problem of filter
bubbles [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] requires designing recommender systems that prioritize presenting a variety of
viewpoints and topics to users, rather than solely relying on personalized recommendations
based on past behavior.
(J. Neidhardt)
      </p>
      <p>
        Another research gap can be found in methodological approaches in the limited integration
of user perspectives and real-world evaluations in recommender system studies. According to
a survey of research papers on the performance of recommender systems, the vast majority
of the studies evaluated algorithmic models exclusively through ofline trials [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Few papers
combined ofline experiments with user research. This reveals a research gap in the studies of
recommender systems’ scant incorporation of user opinions and actual evaluations.
      </p>
      <p>
        Integrating user feedback, such as explicit ratings or explicit indications of interest in specific
topics or news sources, can enhance personalization while maintaining a balance with diverse
and serendipitous recommendations. Exploring how diferent recommendation strategies afect
user engagement and satisfaction is crucial. Developing algorithms that explicitly optimize for
diversity, novelty, and serendipity involves considering not only the relevance and popularity of
articles but also the diversity of perspectives, coverage of niche topics, and unexpected content
that may pique user interest. This is where our interdisciplinary research that combines insights
into user experience, and social sciences can contribute to a comprehensive understanding of the
complexities and trade-ofs involved in designing diverse and engaging news recommendations.
For instance, Choi et al. (2022) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] propose a novel news recommendation model that aims
to provide personalized recommendations by considering a user’s individual interest rather
than relying solely on popular articles. Winecof et al. (2019) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] emphasize the importance of
integrating psychological concepts into recommender systems (RS) by addressing the limitations
of commonly used similarity functions and algorithm validations. Their research highlights the
need to develop similarity functions that align with human cognition and to rigorously evaluate
their performance through methodologically sound user testing.
      </p>
      <p>
        In the design of recommender systems, it is crucial to consider users’ psychology and
incorporate domain expertise to ensure the systems meet users’ needs and preferences [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ]. To address
the limitations of traditional recommendation techniques and facilitate a more comprehensive
evaluation, we propose an approach that investigates beyond accuracy measures with domain
experts [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ]. In this workshop submission, we introduce the results of a labeling study with
experts with profound domain knowledge, as they work as editors at the news outlet that has
been researched. This approach enables a more systematic study of beyond-accuracy measures
in the specific context of news domains.
      </p>
      <p>
        Therefore, our study also answers calls to integrate the user perspective into the development
of beyond-accuracy measures. For instance, some authors [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ] underline the necessity to
closely observe the process in which the individual is confronted with the recommendation and
whether for instance one is at the beginning of a decision-making process or already decided.
Depending on the decision phase, the recommender system should consider entirely diferent
sets of items to cater to these perspectives. Moreover, specifics of the news domain and user
needs for information, entertainment [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] need to be considered.
      </p>
      <p>
        Within our study we aim to extract multi-facet feature sets that capture diferent aspects
of user preferences and behavior toward reading recommendations in the news domain in
reference to the beyond-accuracy paradigm e.g., [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Therefore, editors of a news outlet were
invited to participate in a labeling study on ideal reading recommendations from the perspective
of diversity, novelty, and serendipity. The research design for this study integrates domain
expertise to improve automated categorization, therefore it is built around an expert survey and
labeling study to evaluate reading recommendations in recommender systems. By involving
relevant stakeholders, in our case editors of the news outlet that is being investigated, this study
aims to overcome the limitations of previous approaches and contribute to the development of
recommender systems that align with users’ needs and values. Moreover, it seeks to harvest
domain expertise that expands a mere user-centric approach, which is described in
human-inthe-loop approaches, yet oftentimes comes at the cost of overlooking domain expertise [ 10].
The integration of domain expertise is crucial in building models when labeling texts with
categories requires specialized knowledge and only limited annotated data are available [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
Moreover, as demonstrated by Han et al. (2022) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] collaborative eforts with domain experts
can improve automated categorization by leveraging their understanding of categories and
their confidence in annotation. Hence, our study provides novel insights into the design and
development process.
      </p>
      <p>
        This submission particularly explores the concept of serendipity in recommender systems.
Serendipity refers to surprising or unexpected discoveries that are still relevant to users [
        <xref ref-type="bibr" rid="ref1">1, 11</xref>
        ].
Although serendipity has attracted considerable research interest, designing for serendipity
remains a challenge [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The current interpretation of serendipity as a narrow evaluation metric
for algorithmic performance limits our understanding and hinders the ability to design for
serendipity. According to Smets et al. (2022) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], serendipity ought to be viewed as a user
experience rather than just an ofline evaluation statistic like novelty or diversity. As a result,
a user-centric analysis is beneficial to any attempt to research serendipity in recommender
systems. The following topics are being addressed:
• Content analysis and model performance: Evaluation of ranking and content analysis
regarding the news article resort diversity.
• Emerging characteristics of beyond accuracy measures: Which characteristics of a reading
recommendation are ideally present?
• User behavior and domain-specific preferences: What does an ideal reading
recommendation look like?
      </p>
      <p>Our research aligns with the broader field of Digital Humanism [ 12], which emphasizes the
importance of striving for beyond-accuracy objectives and fairness in software programs and
algorithms [13]. By focusing on beyond-accuracy measures, we aim to enhance the quality
and usefulness of recommender systems while promoting fairness and user-centric design
principles.</p>
      <p>In the following sections, we will discuss the concepts of diversity, serendipity, and
novelty, highlighting their importance in evaluating recommender systems. We will also present
our experimental design for evaluating the influence of diferent features on serendipitous
encounters. This work represents a step towards a more integrated and user-centric approach
to beyond-accuracy measures in recommender systems.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Fairness and Bias</title>
        <p>
          The relevance of an item depends on various factors, including the consumer’s goals, situational
context, and the specific purpose of the recommender system from diferent stakeholders’
viewpoints. The current focus on optimizing accuracy measures using historical data may
not adequately capture the true value and relevance of recommended items. Therefore, bias
is a field that currently attracts increasing awareness as fairness towards the representation
of items but also towards user preferences and stakeholder perspectives (e.g., news outlets)
need to be better represented. However, a recent review [14] criticizes the fragmented and
inconsistent nature of existing studies on bias in recommender systems, calling for a systematic
survey and taxonomy to organize research on recommendation debiasing. Existing research
has focused on addressing this issue by increasing the coverage of long-tail items. To further
advance the development of metrics of news recommenders, Smets et al. (2022) [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] emphasize
the importance of considering multiple stakeholders in the design of news recommenders,
revealing that their development involves a negotiation process among actors beyond just users.
Their findings call for an expanded framework that accounts for preconditions, product owners,
and indirect stakeholder involvement, ofering a more comprehensive understanding of news
recommenders.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Beyond Accuracy Paradigm</title>
        <p>
          Beyond accuracy measures User behavior and decision-making are significantly influenced
by recommender systems, which highlights the necessity for ethically sound designs that uphold
democratic principles. The significance of making sure that these platforms, which enable users
to interact with the vast amount of information online, uphold the values of the cultures and
people they are utilized [15]. Although many researchers have attempted to evaluate their
methodologies in recent years using criteria other than accuracy (e.g., innovation, diversity,
serendipity, or coverage), there are still a number of significant problems that need to be resolved
[16]. Usually, in static settings, relatively basic user models are taken into account, and
beyondaccuracy solutions are not tailored to the requirements or preferences of particular users or
groups of users, nor are they adjusted to diferent domains. Therefore, a rising number of people
are calling for these technologies to be evaluated for more than just accuracy (e.g., [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]). Under
the umbrella concept of beyond accuracy, several aspects are frequently discussed in order to
prevent potentially harmful efects from personalized recommendations that have been shown
to lead to filter bubbles [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Frequently discussed are diversity, novelty, and serendipity [17].
Notwithstanding the necessary critique and call for beyond accuracy-measures, recommender
systems should continue to focus on accuracy since it helps to lower prediction mistakes. For
instance, it appears vital in many instances to prevent inaccurate recommendations because,
among other things, this could afect how well the system is viewed [ 18]. Therefore, it is crucial
to develop measurements that allow for a sensible trade-of between accuracy and other criteria
for good recommendations.
        </p>
        <p>Diversity Diversity in news recommendation is a topic of significant research interest and
has spurred research on the challenge of remaining relevant to user interests, hence avoiding
content that is too diversified or irrelevant to their preferences while yet ofering suficient
variety, e.g. introducing users to new topics and categories. In order to keep users interested,
an NRS must strike a balance between remaining relevant to their interests, hence avoiding
content that is too diversified or irrelevant to their preferences while yet ofering suficient
variety, e.g. introducing users to new topics and categories. Raza and Ding (2023) [19] present
a deep neural network, that meets the needs of users to obtain information in topics in which
they have shown interest before, but that goes beyond accuracy metrics. Lee and Lee (2022)
[20] explored the role of perceived personalization and news diversity in news recommendation
services with a focus on the eficacy of news diversity and trust in NRS. They looked at how
users’ intentions for remaining around were afected by perceived personalization and news
diversity. Through the mediation of trust, user enjoyment, and perceived utility, they discovered
in their study that perceived personalization significantly afected continuance intention.
Novelty The term novelty often refers to whether something is new. When recommendations
include products or topics the user was previously unfamiliar with, they are seen as being
more helpful [21]. Users, however, can difer in terms of their needs for novelty or variety due
to diferences in their personalities and interests. In the research on recommender systems,
novelty is often described in two conceptually distinct ways [ 17]. The first method takes into
account whether a person is familiar with a certain item. Clearly, it’s challenging to capture
this. Furthermore, novelty is oftentimes defined as the antithesis of popularity, hence it could be
applied as a measure to counterbalance popularity bias [22]. Moreover, novelty is closely related
to serendipity, together both possible beyond-accuarcy measures come with the challenge to
ifnd a good balance between relevance and positive surprise.</p>
        <p>
          Serendipity The concept of serendipity is discussed by Smets et al. (2022) [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] from the
perspective of its potential to help to mitigate popularity bias and to increase the usefulness
of recommendations by enabling better discoverability. In their research in line with the
beyond accuracy paradigm on diversity, coverage and serendipity, Smets et al. (2022) [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]
present a feature repository, highlight the importance of integrating metadata, user interface,
and information access in facilitating serendipitous experiences. In addition, evidence for the
potential of serendipity has been investigated by Niu and Al-Doulat (2021) [23] regarding the
use of surprise in improving user satisfaction and inspiring curiosity in recommender systems
in the health domain. Ziarani and Ravanmehr (2021) [24] present a systematic literature review
on serendipity in recommender systems, exploring various aspects and approaches. Chen et al.
(2021) [25] examine the values of user exploration in recommender systems, emphasizing the
importance of serendipity in addressing the cold-start and filter-bubble problems. Furthermore,
Abdollahpouri et al. (2021) [13] focus on news recommender systems and discuss the direction
toward the next generation of these systems. Another issue, particularly relevant in designing
for serendipity is to be found in the cold start problem. A problem, that could be addressed via
a model proposed by Xu et al. (2022) [26] in which a self-serendipity recommender system to
address the cold-start problem and provide diverse but relevant recommendations is presented.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Editor Study</title>
      <p>This collaborative study with Falter Verlagsgesellschaft m.b.H. 1 (FALTER) focuses on
investigating the needs for diversity, serendipity, and novelty in recommender systems. The research
aims to understand the variations of these needs within the news domain. It further explores
how bias co-evolves with various outcome measures, including beyond-accuracy objectives.
Specifically conducted within the news domain, the study examines the weekly newspaper
FALTER, which publishes a wide range of news items both in print and online formats. By
analyzing the content, meta-data, and additional news formats, the research aims to enhance
the recommendation process, aligning it more efectively with the preferences and expectations
of users.</p>
      <p>An expert‐in‐the‐loop approach was chosen to answer the raised questions. This is especially
important as recent research has shown that often multiple diferent stakeholders are involved
in such a system [27]. In addition, a survey on their personal perception of the importance of
diverse, novel, and serendipitous reading recommendations was issued. Moreover, the editors
were invited to answer an open-ended question on what an ideal reading recommendation
would look like. In total twelve editors participated in the study. The editors were chosen in a
way to represent the whole news medium by covering all important areas (feuilleton, politics,
city-life, nature, economy, etc.).
3.1. Data
The data corpus provided by FALTER contains more than 100.000 news articles. Due to FALTER
being a weekly news paper with a focus on investigative journalism these corpus consists of
high quality texts. However, it is important to note that the texts are exclusively in German.
FALTER employs a monetization model based on subscriptions, which results in the majority of
their articles being concealed behind a paywall. To enhance the reproducibility of this approach
we have provided access to the articles utilized for this study within a GitHub repository2. The
repository grants access to both content and metadata of 875 news articles. These articles were
utilized in either of two ways: as candidate items or as baseline recommendations for the scope
of this study.</p>
      <sec id="sec-3-1">
        <title>3.2. Candidate Item Selection &amp; Baseline Recommendations</title>
        <p>A representative set of candidate items was created based on the 15 most recent significant
stories of each of the participating editors. It is important to note that columns were excluded
from this selection process. This process was supported by a domain expert from FALTER.
The resulting list consists of 12 (editors) x 15 (candidate items) = 180 articles3. Pyserini [28] is
utilized to generate recommendations for each of the 168 candidate articles. A traditional lexical
model (BM-25) is employed to capture the essence of Lucene-based tools, such as Elastic Search4,
which are widely employed in the industry. The recommendations are generated based on the
corpus presented beforehand. Recommendations are considered valid if recommended news
article was published within the past 365 days, using a snapshot of the corpus dated 2022-12-12.
The initial three paragraphs of the query items serve as the search query parameter for the
recommendation process, employing the default parameters of the BM-25 algorithm within the
2https://github.com/ThomasEKolb/an-expert-study-on-news-recommendations-beyond-accuracy
3Actually, a total of 168 articles were chosen because the editors did not consistently have 15 significant articles each.
4https://www.elastic.co/elasticsearch/
Pyserini framework. A set of 15 items is recommended for each candidate article, resulting in a
total of 2520 recommended items.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.3. Survey Implementation</title>
        <p>In addition to asking the editors about their preferences towards the beyond-accuracy metrics
like diversity, novelty, and serendipity as also about their view of a good reading
recommendation, each editor was requested to rank 15 recommendations for 15 of their own articles.
Typically, assigning a ranking to a collection of 15 items proves to be challenging. Thus, we used
a ranking technique known as best-worst scaling (BWS), which increases the agreement among
annotators compared with alternative methodologies [29]. The tuples for the survey were
created by using a script5 provided by Kiritchenko &amp; Mohammad alongside their publication
[30]. The parameters were set as following: “factor: 2”, “4 items per tuple”, and a total of
“100 iterations”. Figure 1 shows one of the surveys that was generated based on the provided
tuples. At the top, the query article is displayed, featuring its headline, subtitle, and a concise
text excerpt. The headline is linked to the full version of the article. Directly beneath the
query article, a list presents four suggested articles (= one tuple created following the BWS
methodology). The editor has the option to designate one article as the “best” choice and another
as the “worst” option for a good reading recommendations in relation to the query article. As
survey tool LimeSurvey6 was used. For each editor a dedicated survey was programmatically
created. LimeSurvey ofers great flexibility in configuring an interface for the BWS approach.</p>
        <p>Once all editors completed their survey, the scores were calculated based on the collected
annotations using counts analysis. This variation of BWS computes scores for each item by
subtracting the percentage of times it was chosen as the worst from the percentage of times it
was chosen as the best, resulting in a score ranging from -1 to +1. This allows to create a more
robust and reliable article ranking for the given query articles. The outcomes of the various
stages are available in the repository linked to this work. To identify the quality of the labeled
data the split-half reliability was calculated for each editor.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.4. Analysis</title>
        <p>By utilizing ClayRs [31], a Python framework for evaluation, we compared the correlation
between BM-25 based recommendations and the outcomes of the editor study. The framework’s
evaluation module7, facilitates the use of diverse metrics to analyze the results. Furthermore, this
framework is specifically crafted to enhance the replicability of the research conducted. Besides
the correlation metrics, the comparison of the two article rankings involves an examination of
Jaccard similarity (intersection over union) among the article resort. This analysis enables the
exploration of the degree of similarity in the article resort present within the top positions of
both lists.</p>
        <p>
          We presented the expert raters with an open-ended question on their individual perception
of an ideal reading recommendation. In addition, with a description of diversity, novelty, and
serendipity, and asked to rank for themselves their individual preferences for either of these
beyond-accuracy measures for ideal reading recommendations. By applying a multi-method
approach of a labeling study in combination with a survey, we were able to mitigate several
shortcomings of research on recommender systems as described by [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>One key takeaway is the relatively high prioritizing of serendipity in comparison with the two
other beyond-accuracy measures. In the expert’s study, a component of the survey was included
to solicit opinions from editors on the best reading suggestions. Numerous editors stressed
the value of serendipity and the investigation of fresh knowledge and ideas in their written
comments. They reported a need for book suggestions that would broaden their knowledge,
pique their curiosity, and provide them with a compelling reason to keep reading. Additionally,
in order to assess beyond-accuracy metrics, the editors were required to rank the significance of
a reading recommendation’s diversity, novelty, and serendipity on a separate basis. This study
emphasizes how crucial it is to include serendipity as a fundamental component of recommender
systems in the news domain. While accuracy is unquestionably important, ofering suggestions
that go beyond mere accuracy and encourage chance meetings can increase user happiness and
engagement. The ideal reading recommendation should be like a talented DJ who creates a
mix that surprises and engages the listener, as one editor so eloquently put it. Recommender
systems can more efectively achieve the targeted results of expanding knowledge, arousing
interest, and introducing users to cutting-edge concepts and material by incorporating the
editors’ insights and giving serendipity priority. The editors’ answers emphasize the value
of serendipity in the context of news suggestions, highlighting its usefulness for fostering an
interesting and educational reading experience.
7https://swapuniba.github.io/ClayRS/evaluation/introduction/
mean
std</p>
      <p>According to the findings presented by Kiritchenko and Mohammad [ 30], the split-half
reliability of the BWS conducted with the editors was assessed. The average Spearman correlation
coeficient, computed for each query article and its corresponding recommendations, is 0.785,
with a standard deviation of 0.092. These results strongly indicate that the editors exhibited a
high level of consistency during the labeling task. Another crucial aspect is the evaluation of
the BM-25 based recommendations and the comparison of their ranking with the article ranking
created by the editors during the labeling task. Table 1 highlights that there is certain low
positive correlation between the two lists. This highlights how important such labeled datasets
are. It is not enough to just use “another” recommendation algorithm and apply it on a dataset.
The outcomes from calculating the mean of the Jaccard similarity8 and the corresponding
standard deviation across all pairs of ranked lists emphasize this observation. There is a certain
ressort wise overlap which increases if more items within the list are considered by beginning
from the top three until the full list of 15 items.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>This study goes beyond the conventional approaches in recommender systems by raising the
importance of incorporating individual-level factors and user characteristics into the
recommendation process. Unlike previous works that mainly focused on independent individuals
without considering social context or domain specific expertise, this research acknowledges
the domain knowledge of news editors in the investigation of news reading recommendations.
The findings underscore the efectiveness of integrating an expert study with BWS and the
LimeSurvey platform for eficiently acquiring domain knowledge. To overcome the limitations
of previous studies, we integrated forced choice and Best-Worst-Scaling methods in our study,
to address the limitations of traditional single-item approaches and provide a more robust and
context-aware assessment of users’ judgments of reading recommendations.
8List pairs with missing ressort data were excluded from the Jaccard similarity calculation.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Future Work</title>
      <p>Future studies could deepen these results by drawing upon psychological theories and empirical
studies, help to understand why individuals have certain preferences and how these preferences
are associated with contextual information, personality, and demographic characteristics. The
insights into the editors’ perspective on the beyond-accuracy measurements raise the importance
of further investigating serendipity. Furthermore, considering this as a multi-stakeholder issue,
it is essential to incorporate the audience’s viewpoint to perform a comprehensive comparison
of diverse opinions regarding these metrics. For instance, via A/B tests, which could be designed
as controlled field experiments. Moreover, performing topic extraction on the news articles
presents an additional opportunity for enhancing the depth of understanding within the labeled
data.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This research is supported by the Christian Doppler Research Association (CDG).
[10] N. Sambasivan, R. Veeraraghavan, The deskilling of domain expertise in ai development,
in: Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems,
2022, pp. 1–14.
[11] L. Björneborn, Three key afordances for serendipity: Toward a framework connecting
environmental and personal factors in serendipitous encounters, Journal of documentation
(2017).
[12] H. Werthner, E. Prem, E. A. Lee, C. Ghezzi, Perspectives on Digital Humanism, Springer</p>
      <p>Nature, 2022.
[13] H. Abdollahpouri, E. C. Malthouse, J. A. Konstan, B. Mobasher, J. Gilbert, Toward the
next generation of news recommender systems, in: Companion Proceedings of the Web
Conference 2021, 2021, pp. 402–406.
[14] J. Chen, H. Dong, X. Wang, F. Feng, M. Wang, X. He, Bias and debias in recommender
system: A survey and future directions, ACM Transactions on Information Systems 41
(2023) 1–39.
[15] J. Stray, A. Halevy, P. Assar, D. Hadfield-Menell, C. Boutilier, A. Ashar, L. Beattie, M.
Ekstrand, C. Leibowicz, C. M. Sehat, et al., Building human values into recommender systems:
An interdisciplinary synthesis, arXiv preprint arXiv:2207.10192 (2022).
[16] D. Jannach, P. Pu, F. Ricci, M. Zanker, Recommender systems: Trends and frontiers, AI</p>
      <p>Magazine 43 (2022) 145–150.
[17] M. Kaminskas, D. Bridge, Diversity, serendipity, novelty, and coverage: a survey and
empirical analysis of beyond-accuracy objectives in recommender systems, ACM Transactions
on Interactive Intelligent Systems (TiiS) 7 (2016) 1–42.
[18] F. Fouss, E. Fernandes, A closer-to-reality model for comparing relevant dimensions of
recommender systems, with application to novelty, Inf. 12 (2021) 500.
[19] S. Raza, C. Ding, Relevancy and diversity in news recommendations ⋆, in: CEUR
WORK</p>
      <p>SHOP PROCEEDINGS, volume 3411, 2023, pp. 6–15.
[20] S. Y. Lee, S. W. Lee, Normative or efective? the role of news diversity and trust in news
recommendation services, International Journal of Human–Computer Interaction (2022)
1–14.
[21] B. Alhijawi, A. A. Awajan, S. Fraihat, Survey on the objectives of recommender systems:
Measures, solutions, evaluation methodology, and new perspectives, ACM Computing
Surveys 55 (2022) 1 – 38.
[22] H. Abdollahpouri, M. Mansoury, R. Burke, B. Mobasher, The connection between popularity
bias, calibration, and fairness in recommendation, in: Proceedings of the 14th ACM
Conference on Recommender Systems, 2020, pp. 726–731.
[23] X. Niu, A. Al-Doulat, Luckyfind: Leveraging surprise to improve user satisfaction and
inspire curiosity in a recommender system, in: Proceedings of the 2021 Conference on
Human Information Interaction and Retrieval, 2021, pp. 163–172.
[24] R. J. Ziarani, R. Ravanmehr, Serendipity in recommender systems: a systematic literature
review, Journal of Computer Science and Technology 36 (2021) 375–396.
[25] M. Chen, Y. Wang, C. Xu, Y. Le, M. Sharma, L. Richardson, S.-L. Wu, E. Chi, Values of user
exploration in recommender systems, in: Proceedings of the 15th ACM Conference on
Recommender Systems, 2021, pp. 85–95.
[26] Y. Xu, E. Wang, Y. Yang, H. Xiong, GS2-rs: A generative approach for alleviating cold start
and filter bubbles in recommender systems, IEEE Transactions on Knowledge and Data
Engineering (2023).
[27] H. Abdollahpouri, G. Adomavicius, R. Burke, I. Guy, D. Jannach, T. Kamishima, J.
Krasnodebski, L. Pizzato, Multistakeholder recommendation: Survey and research directions, User
Modeling and User-Adapted Interaction 30 (2020) 127–158. URL: https://doi.org/10.1007/
s11257-019-09256-1. doi:10.1007/s11257- 019- 09256- 1.
[28] J. Lin, X. Ma, S.-C. Lin, J.-H. Yang, R. Pradeep, R. Nogueira, Pyserini: A Python toolkit
for reproducible information retrieval research with sparse and dense representations, in:
Proceedings of the 44th Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval (SIGIR 2021), 2021, pp. 2356–2362.
[29] S. Kiritchenko, S. Mohammad, Best-worst scaling more reliable than rating scales: A
case study on sentiment intensity annotation, in: Proceedings of the 55th Annual
Meeting of the Association for Computational Linguistics (Volume 2: Short Papers),
Association for Computational Linguistics, Vancouver, Canada, 2017, pp. 465–470. URL:
https://aclanthology.org/P17-2074. doi:10.18653/v1/P17- 2074.
[30] S. Kiritchenko, S. M. Mohammad, Capturing reliable fine-grained sentiment associations
by crowdsourcing and best–worst scaling, in: Proceedings of The 15th Annual Conference
of the North American Chapter of the Association for Computational Linguistics: Human
Language Technologies (NAACL), San Diego, California, 2016.
[31] P. Lops, C. Musto, M. Polignano, Semantics-aware content representations for reproducible
recommender systems (score), in: Proceedings of the 30th ACM Conference on User
Modeling, Adaptation and Personalization, UMAP ’22, Association for Computing
Machinery, New York, NY, USA, 2022, p. 354–356. URL: https://doi.org/10.1145/3503252.3533723.
doi:10.1145/3503252.3533723.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Smets</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Michiels</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Bogers</surname>
          </string-name>
          , L. Björneborn,
          <article-title>Serendipity in recommender systems beyond the algorithm: A feature repository and experimental design</article-title>
          ,
          <source>in: 16th ACM Conference on Recommender Systems. CEUR Workshop Proceedings</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>44</fpage>
          -
          <lpage>66</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Michiels</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leysen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Smets</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Goethals</surname>
          </string-name>
          ,
          <article-title>What are filter bubbles really? a review of the conceptual and empirical work</article-title>
          ,
          <source>in: Adjunct Proceedings of the 30th ACM Conference on User Modeling, Adaptation and Personalization</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>274</fpage>
          -
          <lpage>279</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Jannach</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Bauer, Escaping the mcnamara fallacy: towards more impactful recommender systems research</article-title>
          ,
          <source>AI</source>
          Magazine
          <volume>41</volume>
          (
          <year>2020</year>
          )
          <fpage>79</fpage>
          -
          <lpage>95</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gim</surname>
          </string-name>
          ,
          <article-title>Do not read the same news! enhancing diversity and personalization of news recommendation</article-title>
          ,
          <source>in: Companion Proceedings of the Web Conference</source>
          <year>2022</year>
          ,
          <year>2022</year>
          , pp.
          <fpage>1211</fpage>
          -
          <lpage>1215</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Winecof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Brasoveanu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Casavant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Washabaugh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Graham</surname>
          </string-name>
          ,
          <article-title>Users in the loop: a psychologically-informed approach to similar item retrieval</article-title>
          ,
          <source>in: Proceedings of the 13th ACM Conference on Recommender Systems</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>52</fpage>
          -
          <lpage>59</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>K.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rezapour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. S.</given-names>
            <surname>Nakamura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Devkota</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. C.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Diesner</surname>
          </string-name>
          ,
          <article-title>An expert‐in‐the‐loop method for domain‐specific document categorization based on small training data</article-title>
          ,
          <source>Journal of the Association for Information Science and Technology</source>
          <volume>74</volume>
          (
          <year>2022</year>
          )
          <fpage>669</fpage>
          -
          <lpage>684</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Jesse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jannach</surname>
          </string-name>
          ,
          <article-title>Digital nudging with recommender systems: Survey and future directions</article-title>
          ,
          <source>Computers in Human Behavior Reports</source>
          <volume>3</volume>
          (
          <year>2021</year>
          )
          <fpage>100052</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Jannach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Quadrana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cremonesi</surname>
          </string-name>
          ,
          <article-title>Session-based recommender systems</article-title>
          ,
          <source>in: Recommender Systems Handbook</source>
          , Springer,
          <year>2022</year>
          , pp.
          <fpage>301</fpage>
          -
          <lpage>334</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Jannach</surname>
          </string-name>
          ,
          <article-title>Evaluating conversational recommender systems: A landscape of research</article-title>
          ,
          <source>Artificial Intelligence Review</source>
          <volume>56</volume>
          (
          <year>2023</year>
          )
          <fpage>2365</fpage>
          -
          <lpage>2400</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>