Empowering Editors: How Automated Recommendations Support Editorial Article Curation Anastasiia Klimashevskaia1,∗ , Mehdi Elahi1 , Dietmar Jannach1,2 , Christoph Trattner1 and Simen Buodd3 1 MediaFutures, University of Bergen, Bergen, Norway 2 University of Klagenfurt, Klagenfurt, Austria 3 VG, Schibsted Media, Oslo, Norway Abstract The application of recommender systems in the news domain has experienced rapid growth in recent years. Vari- ous news outlets are proposing a full automation of a newspaper front page through automated recommendation. In this paper, however, we explore the synergy of editorial and algorithmic news curation by analyzing the front page of a real-world news platform, where news articles are either selected automatically by a recommendation algorithm or are selected manually by editors. An investigation of the interaction log data from an online newspaper revealed that while the editorial staff is focusing on content that is generally popular across large parts of the audience, the algorithmic curation can, in addition, provide small, yet noteworthy personalization touches for individual readers. The results of the analysis demonstrate an example of a successful coexistence of editorial and algorithmic news curation. Keywords News Recommender Systems, Editorial Mission, Diversity, Popularity 1. Introduction In the current digital age, reading news articles online has become a part of our everyday routines. Traditionally, editors make choices about which news articles are relevant or important to include on the front page or emphasize [1]. This process requires them to manually select the articles they consider to be the most interesting and important ones [2]. At the same time, overall news coverage and topic diversity should also be taken into account as part of editorial values [2, 3, 4, 5, 6]. However, with the enormous growth of news articles published every day, it is becoming a challenge for editors to go through the entire catalog of news articles and make choices on which articles to select. Thus, modern news platforms often leverage digital tools that can provide automated mechanisms, e.g., to extend the list of news articles selected by editors with additional articles to include. News Recommender Systems (NRS) can offer such services by analyzing the online click behavior of readers and offering them a personalized experience when navigating through the news. Previously, various solutions were suggested by researchers to tackle the news recommendation problem [7, 8] – utilizing both collaborative filtering [9, 10, 11] and content-based approaches [12, 13], as well as hybridizing the two methods [14, 15]. Apart from classic challenges in recommendation such as cold start [16] and various undesired effects [3], news recommendation systems also face unique difficulties due to the dynamic nature of content and user preferences. News articles have short lifespans and time-dependent relevance, which necessitates timely delivery of relevant content [7, 17]. In addition, the unstructured format of news articles and the type of media used to present the news (short- and long-form articles, quick headlines or images, video and audio formats), as well as the 12th International Workshop on News Recommendation and Analytics in Conjunction with ACM RecSys 2024, Bari, Italy ∗ Corresponding author. Envelope-Open anastasiia.klimashevskaia@uib.no (A. Klimashevskaia); mehdi.elahi@uib.no (M. Elahi); dietmar.jannach@aau.at (D. Jannach); christoph.trattner@uib.no (C. Trattner); simen.buodd@vg.no (S. Buodd) GLOBE https://www.anaklim.info/ (A. Klimashevskaia) Orcid 0000-0002-8946-667X (A. Klimashevskaia); 0000-0003-2203-9195 (M. Elahi); 0000-0002-4698-8507 (D. Jannach); 0000-0002-1193-0508 (C. Trattner); 0009-0008-5509-3693 (S. Buodd) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings typical lack of user profiles due to anonymous browsing present additional challenges [18, 7]. These restrictions possibly make news recommendation extra challenging compared to other application domains. Last but not least, the specific setting in news recommendation also imposes certain technical restrictions – minimizing response time, scaling to handle large request volumes, and adapting to mobile device constraints [18]. However, regardless of all difficulties and challenges, the use of automated recommendations can be beneficial for readers by enhancing their satisfaction and engagement [19], as well as for news platforms, by increasing key performance indicators (e.g., Click-Through-Rate (CTR)) [20]. Various works explored NRS as a means to replace the work of newspaper editors in attempts to fully automate front page curation [7, 17]. In this paper, instead, we explore how editorial efforts can be supported by an NRS, by conducting a comprehensive analysis of the interactions and the front page of a real-world commercial news platform. The news articles on the platform are delivered either based on choices made by editors or by an automated recommendation algorithm. Specifically, we compare the articles that were selected by editors and the algorithm in terms of several metrics that are commonly used to assess the popularity of the recommended content as well as its diversity, including the average recommendation popularity (ARP), Gini index, entropy, and miscalibration. We demonstrate that while the editorial staff can concentrate on more widely accepted content for wider demographics, even limited personalization through recommendation can still be beneficial and hence improve the content diversity and exposure of more niche content. 2. Methods We adopt an observational research approach in which we study the combination of editorial and algorithmic choices on a national online newspaper platform in terms of article diversity and popularity, as well as how well the curation matches user preferences in topical diversity and content popularity/- mainstreamness. In the following section, we describe the observed setting, the collected dataset, and the used metrics. 2.1. Application Setting Verdens Gang (widely known as VG)1 is a national online newspaper in Norway, serving daily readers both with breaking and essential news (also often called hard news [21]) and with articles more focusing on sport, entertainment, and lifestyle (soft news [21]). While access to hard news is free of charge, parts of the soft news articles are behind a paywall requiring a premium subscription. Most of VG’s front page is curated and manually assembled by the editorial staff, who try to balance two primary goals in their article selection. First, they aim to maintain integrity and the journalistic mission to keep the population informed and updated from diverse standpoints. Secondly, they have a vested interest in maintaining the newspaper’s revenue by placing some longer-form paywalled content on the front page, hoping to encourage readers to become subscribers. To support the editorial staff, VG employs a recommender algorithm that selects some of the paywalled soft-news content for individual users in a personalized way. For these purposes, the algorithm ranks a pool of articles based on how likely it is to entice users to subscribe. As for personalization, the algorithm additionally takes into account factors such as (i) user demographics and previously expressed preference towards specific topics; (ii) articles that a particular user has previously seen but not interacted with repeatedly – such articles would be considered as “unsuccessful” for this particular user and would not be recommended again. In the end, the newspaper’s front page contains premium articles selected both by editors and by the algorithm; see Figure 1. We emphasize that the proportion of free to paywalled content on the front page can change depending on the current news situation. We also note that the algorithm cannot select items that were already selected by an editor in order to avoid repetitions. As such, the pool of available items can be slightly smaller than the editors’ pool of choices. 1 https://www.vg.no/ Figure 1: Stylized Front Page. The mock-up demonstrates how both free and premium content can be placed on a front page at a given time, both by the editors and the algorithm. 2.2. Dataset Characteristics Our analysis is based on a dataset from VG, containing the front page user-item interaction data for a period of one month (June 2023), consisting of both impressions and clicks. The dataset also contained additional descriptive information for the news articles, such as whether the item was placed on the front page by the editors or by the algorithm, as well as the topic of the article. During preprocessing, users with less than 10 item impressions/clicks, and items with no clicks were excluded from the dataset to reduce potential noise. As a result, the preprocessed dataset consists of 272 premium articles and more than 50,000 users, spanning over 23 million rows in total. It is worth noting that our analyses focus only on the premium soft-news articles. This has been conducted for a number of reasons: first, looking only at premium content means analyzing only logged-in user profiles with more reliable interaction data. Second, it is a fairer comparison between editorial and algorithmic curation, as the algorithm can select only premium content. To investigate topical diversity aspects, we consider article categories that are available for each news item. We limit our analyses to the four main categories (i.e., Consumer, General News, Sport and Celebrity) present in the collected dataset. 2.3. Metrics To compare the characteristics of editorial and algorithmic choices, we investigated the popularity level and topic diversity of the premium articles selected with both approaches. Furthermore, we measured the extent of miscalibration in terms of these aspects [22]. Specifically, we used the following metrics: • Average Recommendation Popularity (ARP) [23]. This metric quantifies the popularity of the items that are selected for recommendation. We define item popularity as a normalized number of clicks an item has received during the time of observation. • Gini Index [24]. The Gini index is a measure to assess the inequality of a frequency distribution. In our setting, we measure how often each item is recommended on a front page. If the distri- bution is very uneven, this indicates that a recommendation might be focused on a small set of (usually popular) items. A higher Gini index value indicates a higher level of inequality – and in recommendation scenario, lower diversity. • Shannon Entropy [25]. We adopt this metric as a measure of (category) diversity in the recommendations. With the entropy measure, we can determine how well-represented each news category is on average in the recommendations. Higher entropy values mean higher diversity and better topic coverage. • Miscalibration (MC) [22, 26, 27]. The goal of calibration approaches is to ensure that the recom- mendations provided to a user match the distribution of past user preferences well. Miscalibration, in return, quantifies the discrepancy between the recommendation and user preference. To measure the extent of miscalibration we compared the probability distribution vectors describing user profiles with the distributions of the recommendations they received using Jensen-Shannon Divergence [28] as a distance metric between two distribution vectors. For item categories, we considered two separate cases with either article topics (MC-Div) or item popularities (MC-Pop), with item categories defined as highly, medium and less popular items2 . In addition, we analyze the Click-Through-Rates (CTR) in an attempt to gauge user satisfaction and experience with both curation approaches. This metric is commonly used in real-life recommendation settings to evaluate recommendation accuracy and effectiveness, and it is generally simple and quick to compute. However, we also acknowledge that CTR can have certain limitations and downsides in terms of reliability, which we discuss in further detail after presenting the results of the evaluation. 3. Results We report the summary of our results in Table 1. In terms of popularity metrics a notable difference can be observed between the algorithm-generated recommendation and manual news curation by the editors. According to the observation, the overall distribution of algorithm-generated recommendations is more spread out over the article pool compared to the editorial curation, which is demonstrated by the algorithm’s Gini index value of 0.46, nearly half of the Gini index value for the editor’s news selection with the value of 0.83. Similar trends are observed for the average recommendation popularity (ARP) metric, where the algorithm’s recommendation achieved a value of 0.22, while the editor’s news selection reached 0.37. However, the manual news selection by editors surprisingly resulted in lower popularity miscalibration (with an MC-Pop value of 0.31), more closely matching the popularity tendencies of users compared to the algorithm’s recommendation (with a value of 0.39). This suggests that the editorial approach may better align with the broad spectrum of user preferences in terms of news content popularity compared to the algorithmic approach. It also indicates a more general gravitation toward mainstream popular content, potentially being interesting for the majority of users. At the same time, the content recommended by the algorithm has an overall lower popularity, which can be interpreted as more personalized niche content for smaller demographics. However, both approaches combined on the front page have the potential to complement each other, providing enhanced utility to the whole reader base. When discussing topic diversity, the recommendation algorithm interestingly exhibited an entropy level close to that of the editor’s selection. Specifically, the recommendation algorithm achieved an entropy value of 1.09, while the editor’s selection had a value of 1.02. This is a marginal difference, though 2 The thresholds to separate the items in these popularity groups were set according to the Pareto rule and the majority previous research on popularity bias, which defines the top 20% of the most popular content as the most popular, and the lowest 20% as least popular, assigning the rest to the medium-popularity group [29]. Table 1 Results of analyzing the front page of VG where news articles are selected either manually by the editors or recommended by an automated algorithm. The arrows indicate whether lower or higher values are considered desirable. Content Popularity Content Diversity News Delivery Personalized Gini Index ↓ CTR ↑ ARP ↓ MC-Pop ↓ Entropy ↑ MC-Div ↓ Editorial Staff No 0.83 0.37 0.31 1.02 0.29 0.02 Algorithm Yes 0.46 0.22 0.39 1.09 0.28 0.03 it may still suggest that the algorithm could include slightly more diverse topics in its recommendations compared to the handpicked articles. Similarly, the miscalibration with respect to diversity (MC-Div) for the algorithm’s recommendations is not very different from the manual news selection. Specifically, the MC-Div value for algorithm-generated recommendations is 0.28, while it is 0.29 for manual news selection by editors. Despite the similar values, these results may still suggest that the algorithmic approach might slightly better align with a broader spectrum of user preferences in terms of diversity. Overall, in regards to news topic diversity, both news selection approaches appear to be similar in matching user preferences towards various news topics, which is seen from diversity miscalibration results. Last but not least, in terms of CTR results, again, both methods demonstrate fairly similar results, with the algorithm performing marginally better (0.03) than the editorial staff (0.02). This result may indicate that further personalized curation of news content has the potential to increase user engagement on the platform. However, as we provide a discussion in the following section, further investigation might be required to generalize better and make distinct conclusions on this behalf. 4. Discussion The overall results are promising and highlight the potential of an automated algorithmic approach to support editorial article curation. Several key aspects of the results may require further discussion. In this section, we explore some of these aspects. Position Bias. Aiming to better understand the difference in observed metric results we would like to discuss possible limitations and factors that can have an impact on the results apart from just performance. One of such considerations might be the influence of position bias, as generally the very top of the front page is reserved for editorial curation since it might require a more cautious human touch. This can potentially give the content curated by the editors an advantage to gain more exposure, popularity, and consequently more clicks. We recall that within the row-based structure of the front page, typically the first three rows are reserved for only “hard” free news for the most important current events. Position bias, while having an immense impact on the very first positions, generally tends to weaken rapidly further down the positions, having a lower effect [30]. Thus, while we acknowledge that position bias is present in our observed case, we believe that its influence on the results can be potentially negligible considering the positioning of the content we studied on the webpage. Article Selection Pool. Another possible factor that requires consideration could be potential variations in the article selection pool. For instance, there are cases when editors have early access to likely more popular content, thereby reducing the algorithm’s opportunity to select these items. It would require further analysis of the overall content that can be available for recommendation from either side, as due to the particularities of news recommendation, such recommendations pool is expected to be very dynamic and constantly changing. Click-Through-Rates. Measuring CTR has become one of the most common metrics in online evaluations of recommender models, it is simple, easy to calculate and can be monitored live through user activity logs. Some recommendation approaches proposed in the literature optimize the algorithms for predicted CTRs [31] as it appears to be more realistic than classic information retrieval metrics as precision and recall. However, the CTR metric can be problematic in multiple ways – firstly, it was shown that higher CTR results do not necessarily mean increased profitability of a recommender [32], which warns against employing it as a main key performance indicator (KPI). Secondly, CTR results might be non-trivial to interpret, as it is directly connected to implicit user feedback, which can sometimes be falsely attributed to user satisfaction and recommendation relevancy [33]. Last but not least, CTR values are drastically affected by position bias, when sometimes the attention an item receives is not connected to its quality or relevance, but rather to the good positioning on the front page [31]. Now, this particular drawback can be partially neglected in our particular scenario, as the contents of the front page on VG are very dynamic and change constantly. Recommendation Diversity. While some of the research on news recommendation was previously raising concerns about algorithmic curation potentially causing filter bubbles and low topic diversity [34], other works demonstrate proof that algorithms are capable of recommending a rather diverse set of topics to newspaper readers and can help avoid filter bubbles [35, 36]. Addressing this challenge is out of the scope of our analysis, but at least our results demonstrate that in terms of topic diversity, an algorithmic recommendation does not necessarily differ significantly from human-performed news curation. Indeed, our findings (e.g., in terms of entropy metrics) indicate that both editors and the algorithm appear to utilize equally a diverse pool of news topics. More metrics shall be used to further investigate the diversity of news coverage. 5. Conclusion In this paper, we conducted an observational study to explore the synergy of editorial and algorithmic news curation. We examined the front page of a real-world news platform where the news articles are delivered both by an automated recommendation algorithm or they are manually selected by editors. The results of the experiment have revealed that recommendations generated by the automated algorithm can serve as an effective assistance to the editorial staff, complementing the front page selection with more personalized niche content with similar topic diversity for smaller demographics. In future work, it can be very insightful to conduct a qualitative analysis involving interviews with news editors to explore their methods and techniques for the content selection and curation process. In addition, we are interested in learning more about their reflection on how automated algorithms can be better integrated into this process and the extent to which this can facilitate the delivery of news content to the audience in an effective and responsible way. Acknowledgments This research is funded by SFI MediaFutures partners and the Research Council of Norway (grant number 309339). References [1] M. Carlson, Automating judgment? algorithmic judgment, news knowledge, and journalistic professionalism, New Media & Society 20 (2018) 1755–1772. [2] C. Toraman, F. Can, A front-page news-selection algorithm based on topic modelling using raw text, Journal of Information Science 41 (2015) 676–685. [3] M. Elahi, D. Jannach, L. Skjærven, E. Knudsen, H. Sjøvaag, K. Tolonen, Ø. Holmstad, I. Pipkin, E. Throndsen, A. Stenbom, et al., Towards responsible media recommendation, AI and Ethics (2022) 1–12. [4] N. Helberger, On the democratic role of news recommenders, in: Algorithms, Automation, and News, Routledge, 2021, pp. 14–33. [5] F. Lu, A. Dumitrache, D. Graus, Beyond optimizing for clicks: Incorporating editorial values in news recommendation, in: Proceedings of the 28th ACM Conference on User Modeling, Adaptation and Personalization, 2020, pp. 145–153. [6] C. Trattner, D. Jannach, E. Motta, I. Costera Meijer, N. Diakopoulos, M. Elahi, A. L. Opdahl, B. Tessem, N. Borch, M. Fjeld, et al., Responsible media technology and ai: challenges and research directions, AI and Ethics 2 (2022) 585–594. [7] M. Karimi, D. Jannach, M. Jugovac, News recommender systems–survey and roads ahead, Infor- mation Processing & Management 54 (2018) 1203–1227. [8] S. Raza, C. Ding, News recommender system: a review of recent progress, challenges, and opportunities, Artificial Intelligence Review (2022) 1–52. [9] Y. Xiao, P. Ai, C.-h. Hsu, H. Wang, X. Jiao, Time-ordered collaborative filtering for news recom- mendation, China Communications 12 (2015) 53–62. [10] Z. Lu, Z. Dou, J. Lian, X. Xie, Q. Yang, Content-based collaborative filtering for news topic recommendation, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 29, 2015. [11] F. Garcin, K. Zhou, B. Faltings, V. Schickel, Personalized news recommendation based on collab- orative filtering, in: 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology, volume 1, IEEE, 2012, pp. 437–441. [12] M. Kompan, M. Bieliková, Content-based news recommendation, in: E-Commerce and Web Technologies, Springer, 2010, pp. 61–72. [13] O. Phelan, K. McCarthy, M. Bennett, B. Smyth, Terms of a feather: Content-based news recommen- dation and discovery using twitter, in: European Conference on Information Retrieval, Springer, 2011, pp. 448–459. [14] H. Wen, L. Fang, L. Guan, A hybrid approach for personalized recommendation of news on the web, Expert Systems with Applications 39 (2012) 5806–5814. [15] P. M. Gabriel De Souza, D. Jannach, A. M. Da Cunha, Contextual hybrid session-based news recommendation with recurrent neural networks, IEEE Access 7 (2019) 169185–169203. [16] B. Lika, K. Kolomvatsos, S. Hadjiefthymiades, Facing the cold start problem in recommender systems, Expert Systems with Applications 41 (2014) 2065–2073. [17] Ö. Özgöbek, J. A. Gulla, R. C. Erdur, A survey on challenges and methods in news recommen- dation, in: International Conference on Web Information Systems and Technologies, volume 2, SCITEPRESS, 2014, pp. 278–285. [18] J. Atle Gulla, K. C. Almeroth, M. Tavakolifard, F. Hopfgartner, Proceedings of the 2013 International News Recommender Systems Workshop and Challenge, ACM, 2013. [19] T.-P. Liang, H.-J. Lai, Y.-C. Ku, Personalized content recommendation and user satisfaction: Theoretical synthesis and empirical findings, Journal of Management Information Systems 23 (2006) 45–70. [20] J. Liu, P. Dolan, E. R. Pedersen, Personalized news recommendation based on click behavior, in: Proceedings of the 15th International Conference on Intelligent user interfaces, 2010, pp. 31–40. [21] C. Reinemann, J. Stanyer, S. Scherr, G. Legnante, Hard and soft news: A review of concepts, operationalizations and key findings, Journalism 13 (2012) 221–239. [22] H. Steck, Calibrated recommendations, in: Proceedings of the 12th ACM Conference on Recom- mender Systems, 2018, pp. 154–162. [23] H. Yin, B. Cui, J. Li, J. Yao, C. Chen, Challenging the long tail recommendation, Proceedings of the VLDB Endowment 5 (2012) 896–907. [24] R. Dorfman, A formula for the Gini coefficient, The Review of Economics and Statistics (1979) 146–149. [25] L. Jost, Entropy and diversity, Oikos 113 (2006) 363–375. [26] H. Abdollahpouri, M. Mansoury, R. Burke, B. Mobasher, The connection between popularity bias, calibration, and fairness in recommendation, in: Proceedings of the 14th ACM Conference on Recommender Systems, 2020, pp. 726–731. [27] A. Klimashevskaia, M. Elahi, D. Jannach, C. Trattner, L. Skjærven, Mitigating popularity bias in recommendation: potential and limits of calibration approaches, in: International Workshop on Algorithmic Bias in Search and Recommendation, Springer, 2022, pp. 82–90. [28] J. Lin, Divergence measures based on the Shannon entropy, IEEE Transactions on Information theory 37 (1991) 145–151. [29] A. Klimashevskaia, D. Jannach, M. Elahi, C. Trattner, A survey on popularity bias in recommender systems, User Modeling and User-Adapted Interaction (2024) 1–58. [30] X. Wang, N. Golbandi, M. Bendersky, D. Metzler, M. Najork, Position bias estimation for unbiased learning to rank in personal search, in: Proceedings of the 11th ACM International Conference on Web Search and Data Mining, 2018, pp. 610–618. [31] H. Guo, J. Yu, Q. Liu, R. Tang, Y. Zhang, PAL: a position-bias aware learning framework for CTR prediction in live recommender systems, in: Proceedings of the 13th ACM Conference on Recommender Systems, 2019, pp. 452–456. [32] H.-H. Chen, C.-A. Chung, H.-C. Huang, W. Tsui, Common pitfalls in training and evaluating recommender systems, ACM SIGKDD Explorations Newsletter 19 (2017) 37–45. [33] H. Zheng, D. Wang, Q. Zhang, H. Li, T. Yang, Do clicks measure recommendation relevancy? An empirical user study, in: Proceedings of the 4th ACM Conference on Recommender Systems, 2010, pp. 249–252. [34] M. Haim, A. Graefe, H.-B. Brosius, Burst of the filter bubble? Effects of personalization on the diversity of Google News, Digital journalism 6 (2018) 330–343. [35] S. Vrijenhoek, M. Kaya, N. Metoui, J. Möller, D. Odijk, N. Helberger, Recommenders with a mission: assessing diversity in news recommendations, in: Proceedings of the 2021 Conference on Human Information Interaction and Retrieval, 2021, pp. 173–183. [36] J. Möller, D. Trilling, N. Helberger, B. van Es, Do not blame it on the algorithm: an empirical assessment of multiple recommender systems and their impact on content diversity, in: Digital Media, Political Polarization and Challenges to Democracy, Routledge, 2020, pp. 45–63.