RADio-: a simplified codebase for evaluating normative
                         diversity in recommender systems
                         Sanne Vrijenhoek1,2
                         1
                             AI, Media and Democracy Lab, Amsterdam, the Netherlands
                         2
                             Institute for Information Law, University of Amsterdam, the Netherlands


                                        Abstract
                                        Diversity is one of the core beyond-accuracy objectives considered in the development of news recommender
                                        systems. However, there is a clear gap between its technical conceptualization, typically as an intra-list distance,
                                        and a more normative interpretation, which touches upon the role the recommender system plays in society.
                                        Vrijenhoek et al. [1] proposed to instead use rank-aware divergence metrics to express normative diversity in
                                        news recommendations. This work describes a repository that allows for easy implementation of these metrics,
                                        by making the different diversity aspects and tactics configurable. It also contains an example implementation
                                        and analysis of the results. With its modular setup, the repository thus allows for conceptualizations of diversity
                                        that can be tailored to the news domain they need to be applied in.

                                        Keywords
                                        LaTeX class, paper template, paper formatting, CEUR-WS


                         1. Introduction
                         In its technical conceptualization, ‘diversity’ prevents a recommender system from recommending
                         the same type of content over and over again [2], and is one of the primary values considered in
                         research on news recommender systems [3]. Diversity is usually expressed as an ‘intra-list distance’,
                         measuring whether the items within the recommendation are sufficiently different from each other [4].
                         However, this definition based on intra-list distance does not fully reflect the requirements of a normative
                         interpretation of diversity, which relates to the role the recommender system plays in society [5, 6, 7].
                            Diversity has characteristics of an essentially contested concept [8], and its interpretations can vary
                         greatly among different people and implementations [9]. We may consider different aspects when
                         talking about diversity, such as political viewpoints [10], different ethnicities [11] or diversity of category
                         and topic [12]. We may also talk about diversity in different contexts: for example, the recommendation
                         should reflect society [13], it should counter existing biases [14], or expose the reader to new things [15].
                            Most, if not all, of these conceptualizations are relevant to the domain of news recommendations.
                         News recommenders may play a gatekeeping role in the type of news that is exposed to the public,
                         and thus need to be capable of incorporating editorial values [16, 17]. The different aspects could
                         logically be incorporated in the intra-list distance formalization mentioned above with a sufficiently
                         sophisticated way to calculate distance between articles. However, the different contexts cannot be
                         captured by a metric that only considers the items within a recommendation. To solve this, Vrijenhoek
                         et al. [1] proposed an alternative diversity metric, conceptualized as a rank-aware divergence metric.
                         This was called the RADio framework, where RADio is short for Rank-Aware Divergence (plus -io).
                            With these divergence metrics, the presence of a certain diversity aspect in the recommendation is
                         compared to the presence of that aspect in an external context distribution. When these distributions
                         are similar, the divergence is low; when they are very different, divergence is high. There is no clear-
                         cut ‘optimal’ divergence score. In some cases one could strive for a recommendation similar to the
                         context distribution (for example, be reflective of political voices in government), in others for a higher
                         divergence score (for example, expose a reader to new perspectives). To show how this could work in

                          INRA’24: News Recommendation and Analytics Workshop, October 18, 2024, Bari, IT
                          Envelope-Open s.vrijenhoek@uva.nl (S. Vrijenhoek)
                          Orcid 0000-0002-1031-4746 (S. Vrijenhoek)
                                        © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
practice RADio implemented the diversity metrics (DART) outlined in Vrijenhoek et al. [18]: Calibration,
Fragmentation, Activation, Representation and Alternative Voices, which are inspired by democratic
theory. The metrics were prototyped with news recommenders trained on the Microsoft News Dataset
(MIND) [19].
   In order to do justice to the normative underpinnings of the DART metrics, the RADio metrics needed
metadata that was not included in MIND. This metadata would include things like which political
viewpoints are expressed in an article, is the article written in a neutral or subjective tone, or does the
article mention people from a minority background. This type of information is notoriously hard to
extract from just a text, and often RADio needed to rely on proxies that were known not to be exactly
right, but were necessary to prototype how the framework could theoretically function.
   Despite the fact that they were simplifications, the data preprocessing and augmentation steps
to identify these proxies were already quite elaborate. For example, political opinions would be
approximated by the mention of political actors in the text. These actors would be identified by 1)
scraping article body; 2) running Named Entity Recognition on the fulltexts; 3) attempting to match
entities of type Person to their entry on Wikidata; 4) checking whether this person was a politician,
and for which party. Without a Golden Standard it was not possible to evaluate the performance of this
approach, but even just looking at the procedure makes it quite clear that there are a lot of ways in
which this process can fail. Alterations in the spelling of a name (Barack Obama vs. President Obama)
could leave a political actor unidentified, and new elections or party compositions would render past
results invalid. This approach, or even one based on regular expressions and/or string matching, would
probably work well enough in a contained experiment over a limited amount of time where the relevant
actors are already known, such as in Michiels et al. [20] and Einarsson et al. [13]. An implementation
that monitors an algorithm in real-time would probably benefit from a more sophisticated approach to
viewpoint diversity, such as in Draws et al. [21]. Lastly, RADio’s implementation of the DART metrics
also distracts from the findings of Vrijenhoek et al. [9], which claims that diversity can be conceptualized
in many ways, depending on the domain’s requirements.
   This work describes a repository that allows for easy implementation of the divergence-based
metrics, by making the different diversity aspects and tactics configurable. The code can be found
on Github.1 This paper works under the assumption that whoever implements the framework has a
data preprocessing or annotation pipeline that contains the required metadata for the metrics to work.
While it still keeps the DART metrics in the repository to give examples of metric configuration, the
framework can also accommodate domains beyond news recommendation. Keep in mind that the
repository does not provide plug-and-play metrics, and that conceptualizing diversity within a news
recommender system is still very much a matter of discussion with stakeholders from outside technical
teams [22, 23].


2. The repository
The repository consists of three primary components: a Jupyter notebook which showcases how metrics
could be configured, a class for building the rank-aware distributions, and a class for calculating the
divergence scores.

2.1. Building distributions
In this part of the framework, we aim to build the distributions for the recommendation and context
respectively. In order to do this, we pass the framework the list of relevant articles (either in the
recommendation or in the context), and tell it which feature to look for. When building the distribution,
the framework can optionally account for the rank of an article in the recommendation. It will then
count articles that appear higher up in the list more strongly than those that appear lower by weighing it


1
    https://github.com/svrijenhoek/radio-/
with the harmonic number 2 of the length of the list. Making a distribution rank-aware only makes sense
when there is some sort of meaning in the ordering of the articles; for example, in a recommendation
ranked by predicted relevance, or in a reading history when the most recently read articles are listed first.
It does not make sense in cases where such a meaning cannot be found; for example, when considering
all articles that have been published over the last few days. The framework can accommodate both
categorical and numerical data. Categorical data can have both single and multiple values per article. In
case of numerical data, the values will need to be discretized into bins. The number of bins to be used
can be set, but defaults to 10. With this approach, we lose a lot of important information. For example,
we will not know that certain bins may be closer to each other than others. Future work may look into
alternative ways of calculating divergence for numerical data.

2.2. Calculating divergence
Within RADio, diversity is conceptualized as a rank-aware divergence score between a recommendation
and a context:
                                                            𝑃 ∗ (𝑥)
                                   𝐷𝑓∗ (𝑃, 𝑄) = ∑ 𝑄 ∗ (𝑥)𝑓 ( ∗ )                                 (1)
                                                𝑥           𝑄 (𝑥)
  where 𝑥 refers to the relevant feature to consider; 𝑃 to the recommendation, and 𝑄 to the context. As
explained in the previous section, both the recommendation 𝑃 and context 𝑄 can be set up to be rank-
aware. For more details regarding the justification of setting up diversity as a rank-aware divergence
score, see Vrijenhoek et al. [1]. Within this framework, we can calculate the divergence using both
Kullback-Leibler (KL) divergence and the Jensen-Shannon (JS) divergence [24]. While KL Divergence is
commonly known, JS Divergence has the added benefit of being 1) symmetric and 2) bound between 0
and 1, and is thus the default option within the framework.

2.3. Configuring the metrics
While the repository contains instructions for configuring all the original RADio metrics, for this paper
we will discuss the configuration and output of the Calibration metric in more detail as an example.
While Calibration is from a normative perspective not the most interesting metric, it relies on data that
is supplied in MIND itself, and therefore does not rely on complicated data augmentation to show
meaningful results.
The goal of Calibration is to measure to which extent a recommendation is tailored to a user’s
preferences. Thus, we want this score to show low divergence, meaning that there is actually
a large overlap between the recommendation and what a user wants to see. In this setup, we
approximate a users’ preferences by looking at the categories of articles they have consumed in
the past: their reading history. Note that this is just an example implementation, and that there
are likely many better ways to express a users’ interests than through categories in past reading behavior.

     In summary, we configure the metric in the following way:

Table 1
Configuration of the Calibration metric
                        Metric component        Configuration
                        Feature (x)             article category
                        Context (Q)             user history
                        Feature type            categorical; here single but could be multi
                        Rank-aware              both recommendation and context (P and Q)
                        Desired value           low divergence

  We expect all recommendations to be represented in a DataFrame, with columns for 1) the impression
ID; 2) the time of the impression; 3) the ID of the user this impression was shown to; 4) the reading
2
    https://en.wikipedia.org/wiki/Harmonic_number
     history of that user; and 5) one or more generated recommendations, corresponding to different
     recommendation algorithms. We assume that an apply-method will be called to calculate the diversity
     metrics for each line, and thus for each of the different algorithms. We first configure a Metric:


1    Calibration = DiversityMetric(
2        feature_type='cat',
3        rank_aware_recommendation=True,
4        rank_aware_context=True,
5        divergence='JSD',
6        context = 'dynamic'
7        )


     Here, ‘feature_type’, ‘rank_aware_recommendation’, ‘rank_aware _context’ and ‘divergence’ corre-
     spond to the information summarized in Table 1. The 𝑐𝑜𝑛𝑡𝑒𝑥𝑡 parameter is there for efficiency. If the
     context is 𝑑𝑦𝑛𝑎𝑚𝑖𝑐, it will need to be calculated for every line. This is the case here, as we are looking at
     the users’ reading history, which is of course different for every user. The context can also be 𝑠𝑡𝑎𝑡𝑖𝑐, or
     the same for all users. This is the case when for example looking at all articles published, or when
     considering an external distribution. Next, we write a 𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒_𝑐𝑎𝑙𝑖𝑏𝑟𝑎𝑡𝑖𝑜𝑛 function to pass the right
     recommendation and context to the framework:


1    def calculate_calibration(recommendations, history):
2        scores = []
3        context_features = get_features(history, 'category')
4        for recommendation in recommendations:
5            recommendation_features = get_features(recommendation, 'category')
6            if context_features and recommendation_features:
7                calibration = Calibration.compute(context_features,
                   ↪ recommendation_features)
8                scores.append(calibration)
9            else:
10               scores.append(None)
11       return scores

        We expect ‘recommendations’ to be a list where each entry in the list corresponds to a different
     algorithm. Each entry again consists of a list of article IDs. We also expect that these are ordered
     by which article is going to be recommended first according to that algorithm. Next, we tell the
     framework to retrieve the ‘category’ feature for each article in both the recommendations and the
     reading history. The resulting lists of features are given to the framework to, under the hood, build the
     corresponding distributions and calculate the divergence. The resulting ‘scores’ is a list of scores, each
     entry corresponding to one of the recommendation algorithms.


     3. Output
     We run the configured metric on the news articles and recommendations of the ‘MINDsmall_dev’ dataset,
     which can be obtained from the Microsoft website3 . We compare the recommendations generated by
     the LSTUR [25] and NRMS [26] algorithms, trained using the code supplied by Microsoft4 , to those from
     two simple baseline algorithms: a random selection, and a selection based on article popularity. For
     3
         https://msnews.github.io/
     4
         https://github.com/recommenders-team/recommenders/tree/main
Table 2
Statistics of the Calibration scores for each recommendation algorithm
                                          lstur   nrms    pop     random
                                  mean    0.575   0.572   0.665   0.662
                                  min     0.000   0.000   0.000   0.000
                                  25%     0.461   0.458   0.581   0.558
                                  50%     0.564   0.559   0.666   0.662
                                  75%     0.681   0.677   0.752   0.768
                                  max     0.994   0.994   0.994   0.994
                                  std     0.159   0.160   0.132   0.154


the most popular baseline, the popularity of an item is derived from the clicks recorded in the dataset.
However, there are many articles with zero recorded clicks, and in case of a tie in the number of clicks
the recommender makes a random selection.
   It is quite hard to pinpoint what exactly a ‘good’ divergence score would be. However, when we
compare the algorithm we are interested in to a baseline algorithm, we can draw some conclusions on
how that algorithm impacts the behavior of the metric. In this example, the first difference in metric
outcomes can already be observed from calculating basic statistics on the outcomes, visualized in Table
2. At each point, the divergence in the neural recommenders is lower than those for the baseline
recommenders. As expected, the neural recommenders are more tailored to the users’ preferences
than the baselines. Note that this does not mean that generally neural recommenders are more diverse
than baseline ones; it just means that in this conception of diversity, and in this setting, the neural
recommenders show more of the desired behavior than the baseline does.
   Figures 1 and 2, which can also be found in the repository, provide more details into the behavior
of the metric. In Figure 1 we see that the neural recommenders show similar patterns, and that the
baseline recommenders behave similarly between them too. It also shows the effect of the time of
day; there may have been meaningful events that influence the type of articles a recommender system
can choose from, and thus make the algorithm choose articles that diverge from the users’ personal
preferences. Note that this is not necessarily bad, if the primary goal of the recommender is to inform
readers about important events happening in the world. In Figure 2, the neural recommenders have
distinctly lower divergence, which means that the recommendations they generate are closer to the
users’ reading history. Given that there are clear differences between the baseline recommenders in
this image but not in Figure 1, some meaningful consequences happen when aggregating scores per
user; the most popular recommender may generate more Calibrated recommendations for some users
than for others.


4. Discussion
Section 3 explains how to technically implement the RADio- framework to measure normative diversity
in recommendations. The example metrics are tailored towards news recommendation, but the frame-
work can be adapted to suit a wide range of applications. Yet, this does not yet answer the question
of how one should go about conceptualizing diversity for their application. This can be exceptionally
challenging for technical teams that, while they are the ones that need to implement the metric, often
do not have all the domain knowledge necessary for making such decisions. As such, it is important
that all relevant stakeholders are brought to the table. In the case of news recommendation, these would
include editorial, but also strategic and business roles [22]. Readers themselves also bring a different
perspective on what they value in their news, and why they would choose to read certain items but
others not [27, 28, 29, 30]. Lastly, one should not underestimate the effects of interface design on users’
reading behavior. Even a perfectly built and diverse algorithm may not accomplish what it is intended
to do due to position bias or simply differences between users [31].
Figure 1: Lineplot of the average Calibration scores over time


Figure 2: Distributional plot of the average Calibration score per user


   Vrijenhoek et al. [9] interviewed professionals in the media sector, and noted all the different ways
they spoke about diversity. The taxonomy that resulted from this, which is split into goals, aspects
and contexts of diversity, could serve as a starting point for other implementations; at the very least,
it should facilitate discussion and ease the identification of domain-specific needs and requirements.
Furthermore, one could take inspiration from literature beyond the technological domain. For example,
those working on news recommendation could look into how social scientists conceptualize diversity,
and draw inspiration from democratic theory and the role news plays in society [6, 18]. However, while
democratic theory is directly relevant to news, it should not be blindly applied to other domains. Rather,
we would encourage those from other domains to invest time choosing or building their own normative
framework [7].
   Without a doubt, conceptualizing and implementing diversity in any kind of recommender system is
a complicated process, and it is unlikely that a perfect (or even a good) solution will be attained in a
single iteration. One could argue that aiming for one would only prevent any progress from happening.
Rather, perhaps we should aim for imperfect solutions; ones that we fully understand, and where we
can exactly pinpoint what the metric does and does not do. As such, we would also urge readers
not to resort to opaque solutions such as off-the-shelf Large Language Models, which may be easy to
implement but are not under the control and full understanding of your organization. Solutions that
we know are simplified, perhaps even ‘stupid’, can be discussed and criticized, and thus be improved
upon. It is our hope that the RADio- codebase will make at least the technical part of the process more
straightforward.


Acknowledgments
This work builds upon the code of the original RADio framework, which was a collaboration between
the author of this work, Gabriel Bénédict and Mateo Gutierrez Granada. I would like to thank Savvina
Daniil for testing and reviewing the code, Johannes Kruse for making the repository’s very first pull
request, and Lucien Heitz and Alain Starke for proofreading.


References
 [1] S. Vrijenhoek, G. Bénédict, M. Gutierrez Granada, D. Odijk, M. De Rijke, RADio–Rank-Aware
     Divergence Metrics to Measure Normative Diversity in News Recommendations, in: Proceedings
     of the 16th ACM Conference on Recommender Systems, 2022, pp. 208–219.
 [2] M. Kunaver, T. Požrl, Diversity in recommender systems–a survey, Knowledge-based systems 123
     (2017) 154–162.
 [3] C. Bauer, C. Bagchi, O. A. Hundogan, K. van Es, Where are the values? a systematic literature
     review on news recommender systems, ACM Transactions on Recommender Systems (2024).
 [4] S. Vargas, P. Castells, Rank and relevance in novelty and diversity metrics for recommender systems,
     in: Proceedings of the fifth ACM conference on Recommender systems, 2011, pp. 109–116.
 [5] N. Helberger, K. Karppinen, L. D’acunto, Exposure diversity as a design principle for recommender
     systems, Information, communication & society 21 (2018) 191–207.
 [6] N. Helberger, On the democratic role of news recommenders, in: Algorithms, Automation, and
     News, Routledge, 2021, pp. 14–33.
 [7] S. Vrijenhoek, L. Michiels, J. Kruse, A. Starke, N. Tintarev, J. Viader Guerrero, Normalize: The first
     workshop on normative design and evaluation of recommender systems, in: Proceedings of the
     17th ACM Conference on Recommender Systems, 2023, pp. 1252–1254.
 [8] W. B. Gallie, Essentially contested concepts, Aristotelian Society, 1956.
 [9] S. Vrijenhoek, S. Daniil, J. Sandel, L. Hollink, Diversity of what? on the different conceptualizations
     of diversity in recommender systems, in: The 2024 ACM Conference on Fairness, Accountability,
     and Transparency, 2024, pp. 573–584.
[10] M. Haim, A. Graefe, H.-B. Brosius, Burst of the filter bubble? effects of personalization on the
     diversity of google news, Digital journalism 6 (2018) 330–343.
[11] M. Mitchell, D. Baker, N. Moorosi, E. Denton, B. Hutchinson, A. Hanna, T. Gebru, J. Morgenstern,
     Diversity and inclusion metrics in subset selection, in: Proceedings of the AAAI/ACM Conference
     on AI, Ethics, and Society, 2020, pp. 117–123.
[12] J. Möller, D. Trilling, N. Helberger, B. van Es, Do not blame it on the algorithm: an empirical
     assessment of multiple recommender systems and their impact on content diversity, in: Digital
     media, political polarization and challenges to democracy, Routledge, 2020, pp. 45–63.
[13] Á. M. Einarsson, R. Helles, S. Lomborg, Algorithmic agenda-setting: the subtle effects of news
     recommender systems on political agendas in the danish 2022 general election, Information,
     Communication & Society (2024) 1–21.
[14] B. Huebner, T. E. Kolb, J. Neidhardt, Evaluating group fairness in news recommendations: A
     comparative study of algorithms and metrics, in: Adjunct Proceedings of the 32nd ACM Conference
     on User Modeling, Adaptation and Personalization, UMAP Adjunct ’24, Association for Computing
     Machinery, New York, NY, USA, 2024, p. 337–346. URL: https://doi.org/10.1145/3631700.3664897.
     doi:10.1145/3631700.3664897 .
[15] L. Heitz, J. A. Lischka, R. Abdullah, L. Laugwitz, H. Meyer, A. Bernstein, Deliberative diversity
     for news recommendations: Operationalization and experimental user study, in: Proceedings
     of the 17th ACM Conference on Recommender Systems, RecSys ’23, Association for Computing
     Machinery, New York, NY, USA, 2023, p. 813–819. URL: https://doi.org/10.1145/3604915.3608834.
     doi:10.1145/3604915.3608834 .
[16] L. A. Møller, Recommended for you: how newspapers normalise algorithmic news recommendation
     to fit their gatekeeping role, Journalism Studies 23 (2022) 800–817.
[17] S. Blassnig, E. Strikovic, E. Mitova, A. Urman, A. Hannák, C. de Vreese, F. Esser, A balancing act:
     How media professionals perceive the implementation of news recommender systems, Digital
     Journalism (2023) 1–23.
[18] S. Vrijenhoek, M. Kaya, N. Metoui, J. Möller, D. Odijk, N. Helberger, Recommenders with a mission:
     assessing diversity in news recommendations, in: Proceedings of the 2021 conference on human
     information interaction and retrieval, 2021, pp. 173–183.
[19] F. Wu, Y. Qiao, J.-H. Chen, C. Wu, T. Qi, J. Lian, D. Liu, X. Xie, J. Gao, W. Wu, et al., Mind: A
     large-scale dataset for news recommendation, in: Proceedings of the 58th annual meeting of the
     association for computational linguistics, 2020, pp. 3597–3606.
[20] L. Michiels, J. Vannieuwenhuyze, J. Leysen, R. Verachtert, A. Smets, B. Goethals, How should we
     measure filter bubbles? a regression model and evidence for online news, in: Proceedings of the
     17th ACM Conference on Recommender Systems, 2023, pp. 640–651.
[21] T. Draws, N. Roy, O. Inel, A. Rieger, R. Hada, M. O. Yalcin, B. Timmermans, N. Tintarev, Viewpoint
     diversity in search results, in: European Conference on Information Retrieval, Springer, 2023, pp.
     279–297.
[22] A. Smets, J. Hendrickx, P. Ballon, We’re in this together: a multi-stakeholder approach for news
     recommenders, Digital Journalism 10 (2022) 1813–1831.
[23] N. Helberger, M. van Drunen, J. Moeller, S. Vrijenhoek, S. Eskens, Towards a normative perspective
     on journalistic ai: Embracing the messy reality of normative ideals, 2022.
[24] M. L. Menéndez, J. Pardo, L. Pardo, M. Pardo, The jensen-shannon divergence, Journal of the
     Franklin Institute 334 (1997) 307–318.
[25] M. An, F. Wu, C. Wu, K. Zhang, Z. Liu, X. Xie, Neural news recommendation with long-and
     short-term user representations, in: Proceedings of the 57th annual meeting of the association for
     computational linguistics, 2019, pp. 336–345.
[26] C. Wu, F. Wu, S. Ge, T. Qi, Y. Huang, X. Xie, Neural news recommendation with multi-head
     self-attention, in: Proceedings of the 2019 conference on empirical methods in natural language
     processing and the 9th international joint conference on natural language processing (EMNLP-
     IJCNLP), 2019, pp. 6389–6394.
[27] A. Starke, A. S. Bremnes, E. Knudsen, D. Trilling, C. Trattner, Perception versus reality: Evaluating
     user awareness of political selective exposure in news recommender systems, in: Adjunct Pro-
     ceedings of the 32nd ACM Conference on User Modeling, Adaptation and Personalization, 2024,
     pp. 286–291.
[28] J. Moeller, F. Löecherbach, J. Möller, N. Helberger, Out of control?: Using interactive testing to
     understand user agency in news recommendation systems, in: News Quality in the Digital Age,
     Routledge, 2023, pp. 117–133.
[29] L. Van den Bogaert, D. Geerts, J. Harambam, Putting a human face on the algorithm: co-designing
     recommender personae to democratize news recommender systems, Digital Journalism (2022)
     1–21.
[30] F. Loecherbach, K. Welbers, J. Moeller, D. Trilling, W. Van Atteveldt, Is this a click towards
     diversity? explaining when and why news users make diverse choices, in: Proceedings of the
     13th ACM Web Science Conference 2021, 2021, pp. 282–290.
[31] N. Mattis, T. Groot Kormelink, P. K. Masur, J. Moeller, W. van Atteveldt, Nudging news readers: A
     mixed-methods approach to understanding when and how interface nudges affect news selection,
     Digital Journalism (2024) 1–21.