A Supervised Machine Learning Approach for Supporting
                         Editorial Article Selection
                         Bilal Mahmood1,2,∗ , Mehdi Elahi1,2 , Farhad Vadiee1,2 , Samia Touileb1,2 and Lubos Steskal3
                         1
                           MediaFutures, Lars Hilles gate 30, 5008 Bergen, Norway
                         2
                           University of Bergen, Fosswinckels gate 6 5007 Bergen, Norway
                         3
                           TV 2 Bergen Lars Hilles gate 30 5008 Bergen, Norway


                                     Abstract
                                     Editors on news platforms play a crucial role in various editorial tasks and responsibilities. One of the key tasks
                                     carried out by editors regularly is reviewing the latest news articles and manually selecting a set of related articles
                                     that could be interesting for readers to explore further. While this task is important, it can pose challenges, as
                                     it may take a substantial amount of time to search the database of published articles, check their content, and
                                     hand-select the most relevant ones.
                                         In this paper, we address this challenge by proposing an automatic approach that can support editors in this
                                     process and assist them in selecting related articles for a given news article. The approach is based on Supervised
                                     Machine Learning (SML) and leverages state-of-the-art text embedding models to create representations of news
                                     articles. A machine learning classifier is built using these embeddings and is utilized to predict scores for available
                                     articles based on their relatedness to a target article. The top articles are then recommended to the editor for
                                     consideration in the list of the most related articles.
                                         We evaluated our approach using a real-world dataset from one of Norway’s largest editor-managed commer-
                                     cial media houses, i.e., TV 2. The dataset includes editors’ feedback in the form of manually selected related news
                                     articles for each news story which has been used as ground truth to assess the effectiveness of our proposed
                                     approach. The results are promising, reflecting the effectiveness of the proposed approach in handling related
                                     article selection task in the editorial process in the news domain.

                                     Keywords
                                     Editorial Recommendations, Recommender Systems, Supervised Machine Learning


                         1. Introduction
                         One of the grand challenges in digital environments is the growing number of daily news articles
                         published online. It is estimated that hundreds of thousands of news articles are published globally each
                         day by different news publishers [1]. With such a large number of articles, it is becoming increasingly
                         difficult for online users to find relevant news articles to read.
                            Recommender Systems (RSs) are digital tools designed to help users find the most relevant news
                         articles based on their interests. These systems typically analyze data collected from users while
                         they browse news articles online, building a reading profile that represents their news preferences
                         and affinities. These profiles are then utilized to match news articles with the users’ preferences
                         and recommend them to browse further a list of the most interesting articles. This has made news
                         recommendations central for users to find and interact with news outlets [2].
                            Although this process may seem straightforward, it often fails to take into account the editorial
                         mission, which plays a significant role in the management of news publishing operations, e.g., curating,
                         fact-checking, and selecting important news for readers to consume. Such a role ensures timely and
                         accurate reporting of the latest news. Moreover, this involves different stakeholders and is a key part
                         of media organizations, e.g., newspapers and news platforms. An important aspect of this mission is
                         to provide unbiased information and include diverse perspectives [3]. This helps prevent one-sided

                          INRA 2024: 12th International Workshop on News Recommendation and Analytics in Conjunction with ACM RecSys 2024
                          Envelope-Open Bilal.Mahmood@uib.no (B. Mahmood); Mehdi.Elahi@uib.no (M. Elahi); farhad.vadiee@uib.no (F. Vadiee);
                          Samia.Touileb@uib.no (S. Touileb); Lubos.Steskal@tv2.no (L. Steskal)
                          Orcid 0009-0003-6393-752X (B. Mahmood); 0000-0003-2203-9195 (M. Elahi); 0000-0003-4584-2554 (S. Touileb);
                          0009-0001-8216-6914 (L. Steskal)
                                     © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
or unbalanced news, which can potentially damage the democratic values of modern societies or the
reputation of newspapers [4]. Such an editorial mission can ensure diversity and fairness in reporting,
which is necessary for maintaining public trust and credibility.
   In this paper, we focus on an editorial process carried out daily by editors (and journalists) and propose
an automatic approach to support them in this process. Our approach is designed as a recommendation
tool for editors, assisting them in their daily editorial tasks of selecting related articles for a given
news story. Our approach uses state-of-the-art text embeddings to build a representation of the textual
content of the news articles, followed by a machine learning classifier that learns from the choices made
by editors (and journalists) in selecting related articles (as a form of feedback) and incorporates this
information for better generating a list of related article recommendations. While this can be a notable
challenge in the news recommendation domain, it has, to our knowledge, received limited attention
from the relevant research communities [5].
   To achieve that goal, we have formulated the following research questions:

    • RQ1: Which combination of the embedding model and machine learning classifier best predicts
      the relevance between articles based on editorial feedback?
    • RQ2: What is the best candidate size for generating Top-N recommendations using the best-
      performing machine learning classifier with various embedding models?

   We obtained a comprehensive real-world dataset from TV 2, one of Norway’s leading editor-managed
commercial media houses. This dataset contains 49,757 news articles, curated with editor-selected related
articles. State-of-the-art text embedding models from OpenAI and those available in the Norwegian
language were employed to encode the textual content of the news articles and utilized as features to
represent the news articles.
   We evaluated our proposed approach by comparing a set of popular machine learning classifiers
against each other to determine the best-performing classifiers in terms of various evaluation metrics.
We considered two evaluation scenarios: (i) classification and (ii) recommendation. In the former scenario,
we evaluated the trained classifiers and compared them, while taking into account the embedding models
used, in terms of Accuracy, Precision, Recall, and Area Under the Receiver Operating Characteristic
Curve (AUC). In the latter scenario, we generated related article recommendations based on the relevance
scores predicted by the classifiers to be suggested to the news editors. We considered Precision@5,
Recall@5, and MAP@5 to measure the quality of the recommendations and considered a simpler
K-Nearest-Neighbors (KNN) baseline following the approach described in [5]. The results of both
evaluation scenarios demonstrate the effectiveness of our proposed automatic approach in identifying
related articles and generating recommendations for editors to support the editorial process.
   The rest of the paper is structured as follows: In Section 2 we review the related work and in Section
3, we describe the methodology used in this work. In Section 4, we discuss the experimental results,
and finally, in Section 5, we provide a discussion and conclusion.


2. Related Work
Over the past years, the research on News Recommender Systems (NRSs) has drawn considerable
interest from the academic community. This growing attention has explored various approaches
employed by online platforms for publishing news, including social networks, and has examined how
automated algorithms are extensively utilized alongside editorial moderation.
   According to the literature in this domain, significant attention in research has primarily focused on
the development of novel algorithms that can effectively analyze different types of user data collected
by news platforms, learn user preferences, and build models to generate recommendations tailored
to the specific preferences [6]. Naturally, the main focus of these works has been on improving the
metrics of recommendation accuracy from the end-user’s perspective. This emphasis has led to the
development of a wide range of algorithms aimed at enhancing recommendation quality, primarily
based on accuracy-oriented metrics. Most algorithms rely on popular approaches such as content-based
filtering and collaborative filtering, each of which is capable of exploiting different types of data, such
as content data (e.g., the title and description of the news articles) or user data (e.g., clicks on the news
articles) in the recommendation process. Additionally, other algorithms have focused on hybridizing
these two approaches to address their respective limitations [7].
    While employing novel algorithms is certainly crucial for generating quality recommendations, other
factors should also be considered. Such consideration may become particularly important in the news
domain, where editorial curation also plays a significant role. Reviewing the literature, a few studies
have been conducted to investigate this aspect of the news domain, e.g., by comparing the differences
between mechanisms utilized for news selection by editors in comparison to the algorithms, according
to the opinions of the audience [2]. A notable example can be the research study that conducted a
field experiment to examine the differences in performance between automated recommendations and
editor curation [8]. The findings indicated that several factors, including the editors’ experience and
the quantity of user data provided to the algorithm, can influence the performance of these approaches.
    A limited number of studies have highlighted the multi-stakeholder perspectives, hence highlighting
the role of other stakeholders in this domain [9]. News organizations are examples of such stakeholders,
which may take additional considerations in the news domain [10], such as editorial values and their
responsibility towards the public audience. Other considerations can be public service goals or regulatory
requirements [11]. Incorporating all of these considerations into the recommendation process shall
result in positive impacts, e.g., the inclusive and fair perspective of the diverse ideas within a democratic
society [12]. Indeed, editorially managed news platforms are often regarded as crucial for informing
the public about significant societal issues, perhaps a key aspect of democracy, and serving them with
responsible news recommendations [13]. The British Broadcasting Corporation (BBC), for example, has
set principles for both news publications and TV programs. These principles focus on delivering content
that reflects the different cultures of their audience and includes various viewpoints. By adhering to
these principles, more comprehensive coverage is maintained, which can resonate with the interests of a
wider range of the public [14]. Similarly, policymakers like the Council of Europe have created standards
for public broadcasters. These standards require that programs show the cultural and linguistic diversity
of their audiences [15]. Such efforts are important for establishing an inclusive media landscape that
respects and represents various audiences in modern society.
    It is worth noting that, despite the examples of research mentioned above, it is evident that the editorial
aspects of news recommendation have so far received limited attention in the research community.
We believe there is a potential need for further research in this field, and this paper aims to address
that need. Moreover, the paper differs from these prior research works in various aspects. First, while
the majority of existing research primarily focuses on recommending news articles to “users” of the
news platforms based on their preferences, we propose a recommendation approach that supports
“editors” (and journalists) by providing recommendations to assist them in the selection of related
articles. Furthermore, previous studies have largely focused on a recommendation scenario only and
evaluation based on metrics designed for that specific scenario. In contrast, this paper considers both
classification and recommendation scenarios, utilizing two distinct sets of metrics tailored to each. We
believe this dual approach is better aligned with real-world editorial practices, where editors (and
journalists) first search for related articles, narrow down a shortlist of top candidate articles and then
make their final selections from those candidate articles.


3. Methodology
3.1. Dataset
We received a real-world dataset from TV 2, one of Norway’s largest editor-managed commercial media
houses. The dataset contains 49,757 news articles published between January 1𝑠𝑡 2018 to January 3𝑟𝑑
2023, and for which editors picked at least one related article.
  After preprocessing the dataset, which involved dropping invalid articles, we were left with a total of
37,614 valid news articles. On average, each article has a median of two related articles. To prepare the
        Figure 1: The schematic view of our proposed (automatic) approach, based on supervised machine
        learning, capable of supporting editors when selecting related news articles


dataset for supervised machine learning, we computed the similarity between news articles and, for
each news article, considered the five most similar articles from the past year. This prefiltering step
was conducted to avoid the high computational expense of comparing all articles against each other
in real-world scenarios, where most articles will expectedly be dissimilar (and unrelated). In addition
to that, this step allowed us to focus on the challenging task of identifying related articles within a
prefiltered set of similar articles, where not all articles had been selected by editors as related.
   Following this step, articles that had not been selected by the editors as related were assigned a target
label of 0, while those considered related were assigned a target label of 1. It is worth noting that, our
methodology is inspired by the current workflow of the editors in their selection of related articles,
where they typically use a search tool to find a short list of candidate articles and then select them as
related (or unrelated) based on their domain knowledge and editorial principles. Figure 1 demonstrates
this process.
   To represent the textual information of the news articles, three different embedding models were
considered (and the languages they support): OpenAI’s text-embedding-3-small1 (multilingual), SBERT-
base2 (English and Norwegian), and NorBERT3-large3 (Norwegian). Each of these pre-trained models
produces different embedding vectors. For instance, OpenAI’s embeddings produce a 1,536-dimensional
representation, NorBERT3-large produces a 1,024-dimensional vector representation of the text, and NB-
SBERT-base produces 768-dimensional embeddings of the textual information in the news article. Mean
pooling of the output layer is used to produce the embeddings for NorBERT3-large and NB-SBERT-base.

3.2. Evaluation
We have considered a set of popular machine learning classifiers offered by Scikit-learn library [16],
i.e., K-Nearest-Neighbors (KNN), Random Forest, and Gradient Boosting. We used standard models,
with their default model parameters for training, validating, and testing the performance of different
classifiers. As a baseline, we used a Random classifier, that predicted whether the potential news article
1
  https://openai.com/index/new-embedding-models-and-api-updates/
2
  https://huggingface.co/NbAiLab/nb-sbert-base
3
  https://huggingface.co/ltg/norbert3-large
was related (relevant) or not with an equal probability of 50%.
   To train and evaluate the classifiers, we employed a time-based evaluation strategy [17], where all
the articles published in 2018 and 2020 were considered for training the classifiers, articles published in
2021 were used for validation purposes, and articles published in 2022 as well as articles in January
2023 were considered for testing. Table 1 presents the dataset characteristics. Since we considered three
different embedding models for representing textual information in the news articles, this resulted in
three different subsets for training and evaluating different classifiers, as can be seen in the Table.

Table 1
Characteristic of the datasets used for training, validation, and testing the classifiers
 Embedding         Training Set (2018-2020)              Validation Set (2021)                Test Set (2022)
 Models        Dimension   #Articles Related [%]   Dimension #Articles Related [%]   Dimension #Articles Related [%]
 OpenAI       220248, 3074   34446       29.4      17471, 3074   2932      22.9      1372, 3074   236        23.2
 SBERT        226225, 1538   34446       28.7      17929, 1538   2932      22.3      1412, 1538   236        22.5
 NorBERT3     227266, 2050   34446       28.5      18059, 2050   2932      22.1      1429, 2050   236        22.3


   In total, there were 34,446 news articles in the training set, 2,932 news articles in the validation
set, and 236 news articles in the test set. The prepared training dataset based on OpenAI had 220,248
news article-potential news article pairs, with a feature size of 3,074. Similarly, the training dataset for
SBERT and NorBERT3 had 226,225 and 227,266 numbers of news article-potential news article pairs
respectively. The different number of training, validation, and test sets across different embedding
models is because we used 5 most similar items for target labeling which would result in the different
number of 5 most similar items. Figure 2 presents the set of features (denoted as X) as well as the target
label (denoted as Y) that were used for training different classification models.


        Figure 2: Features used for training different machine learning classifiers


3.3. Metrics
3.3.1. Classification Scenario
To evaluate the effectiveness of our proposed approach, we employed several common evaluation
metrics: Accuracy, Precision, Recall, and the Area Under the Receiver Operating Characteristic Curve
(AUC). Although we also computed the F1 score, its results were similar to those of the other metrics
and are therefore not reported. The formulas for these metrics are presented below.
   In the context of this scenario, TP (True Positive) denotes the articles predicted as related by the
classifier and selected as related by the editor. FP (False Positive) represents the articles predicted as
related by the classifier but not selected as related by the editor. FN (False Negative) refers to the articles
predicted as not related by the classifier but selected as related based on the editor’s decision. Lastly,
TN (True Negative) indicates the articles predicted as not related by the classifier and not selected as
related by the editor.
   We have considered Accuracy, Precision, and Recall metrics for evaluating the performance of the
machine learning classifiers. Accuracy measures the overall correctness of the model, defined as the
ratio of correctly predicted instances (both true positives and true negatives) to the total number of
instances:
                                                            𝑇𝑃 + 𝑇𝑁
                                       𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
                                                       𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
 Precision measures the proportion of the true positive predictions among all positive predictions
made by the classifier, reflecting its capability to minimize false positives:
                                                       𝑇𝑃
                                           𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
                                                    𝑇𝑃 + 𝐹𝑃
  Recall, also known as Sensitivity, measures the proportion of actual positives correctly identified by
the classifier:
                                                              𝑇𝑃
                                             𝑅𝑒𝑐𝑎𝑙𝑙 =
                                                          𝑇𝑃 + 𝐹𝑁
   The Area Under the ROC Curve (AUC) is a widely adopted metric for evaluating the performance of a
classifier. It indicates the probability that the classifier ranks a randomly chosen positive instance higher
than a randomly chosen negative instance. The AUC is calculated based on the following components:
                                                             𝑇𝑁
                                          𝑆𝑝𝑒𝑐𝑖𝑓 𝑖𝑐𝑖𝑡𝑦 =
                                                           𝐹𝑃 + 𝑇𝑁
                                                          𝑇𝑃
                                          𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 =
                                                       𝑇𝑃 + 𝐹𝑁
  Here, Specificity measures the proportion of true negatives correctly identified. Sensitivity (also
known as True Positive Rate or Recall) then measures the proportion of true positives correctly identified
by the classifier. The AUC combines these two aspects to provide a single scalar value that summarizes
the overall performance of the classifier across different threshold values.

3.3.2. Recommendation Scenario
Since we aim to support editors in finding the related articles for a given news article through the rec-
ommendation, we considered metrics [18] that can specifically be used to measure the recommendation
quality, i.e., Precision@𝐾, Recall@𝐾, and MAP@𝐾, where we considered K = 5.
  Precision@𝐾 is a common metric that measures the accuracy in recommending relevant items. To
compute Precision@𝐾, the top 𝐾 items are selected for a recommendation for each news article 𝑖. Then
Precision@𝐾 (𝑃@𝐾) is calculated as follows:

                                                         |𝐿𝑖 ∩ 𝐿̂ 𝑖 |
                                              𝑃𝑖 @𝐾 =
                                                            |𝐿̂ 𝑖 |
   Here, 𝐿𝑖 denotes the set of related articles selected by the editor for a given news article 𝑖 in the test
set 𝑇, and 𝐿̂ 𝑖 represents the recommendation list containing the top 𝐾 articles in the candidate set with
the highest scores as predicted by the machine learning classifier for the news article 𝑖. The overall
Precision@𝐾 (𝑃@𝐾) is then obtained by averaging the 𝑃𝑖 @𝐾 values across all news articles in the test
set.
   Recall@𝐾 (𝑅@𝐾) is another important metric used to evaluate the effectiveness of a recommendation
system. For a given news article 𝑖, 𝑅𝑖 @𝐾 is defined as:

                                                         |𝐿𝑖 ∩ 𝐿̂ 𝑖 |
                                             𝑅𝑖 @𝐾 =
                                                            |𝐿𝑖 |
  In this formula, 𝐿𝑖 denotes the set of related articles selected by the editor for a given news article 𝑖 in
the test set 𝑇, 𝐿̂ 𝑖 represents the recommendation list containing the top 𝐾 articles in the candidate set
with the highest scores as predicted by the machine learning classifier for news article 𝑖. The overall
Recall@𝐾 (𝑅@𝐾) is then computed by averaging the 𝑅𝑖 @𝐾 values across all news articles in the test set.
  Mean Average Precision (MAP@𝐾) is a metric that assesses the quality of the ranking in recom-
mendation systems. MAP@𝐾 is computed by taking into account the arithmetic mean of the Average
Precision@K (AP@K) across all the news articles in the test set. The Average Precision for the top 𝐾
recommendations (𝐴𝑃@𝐾) is calculated as follows:
                                                          𝐾
                                             1
                                  𝐴𝑃@𝐾 =             ∑ 𝑃@𝑖 ⋅ 𝑟𝑒𝑙(𝑖)
                                         min(𝑁 , 𝐾 ) 𝑖=1
  Here, 𝑟𝑒𝑙(𝑖) is an indicator function that equals 1 if the 𝑖th recommended item is related and 0
otherwise. 𝑁 represents the total number of related articles for a given news article and 𝐾 is the size of
the recommendation list.


4. Results
In addressing our research questions, we conducted a set of experiments focused on the classification of
news articles and the prediction of related ones (Experiment A) and then used the predictions to generate
recommendations of related news articles (Experiment B). We designed Experiment A to address RQ1,
and Experiment B to address RQ2. In this section, we describe the results of these experiments.

4.1. Experiment A: Classification of News Articles Based on Editorial Feedback
We have built and evaluated several well-known machine learning classifiers, i.e., Gradient Boosting,
K-Nearest Neighbors (KNN), and Random Forest. Each classifier was trained on the training data,
described in the previous Section 3 (Methodology), that have been created based on various embedding
models, specifically OpenAI, SBERT, and NorBERT3, to encode the news articles. This approach allowed
us to assess the performance of each classifier-embedding combination in accurately predicting the
relevance of articles. Our goal was to identify the best combination of classifiers and embedding models
for predicting a set of related news articles, based on editorial feedback (i.e., our ground truth). The
results of these predictions will subsequently be used for the task of Top-N news recommendations (see
Experiment B).
   Table 2 presents the results of Experiment A. As can be seen, overall, the best-performing classifier is
Gradient Boosting, regardless of the embedding model used, with respect to most of the considered
metrics. Additionally, it is noteworthy that the classifiers almost always perform substantially higher
than the baseline (random).

Table 2
Comparison of the performances of different classifiers and embedding models on the test set

  Embedding           Machine                             Evaluation Metrics
  Models              Learning                    Accuracy Precision Recall                      AUC
                      KNN                            0.840             0.763          0.447     0.833
                      Random Forest                  0.802             0.883          0.167     0.806
  OpenAI
                      Gradient Boosting              0.861             0.836          0.497     0.887
                      Random (baseline)              0.511             0.240          0.513     0.500
                      KNN                            0.865             0.823          0.513     0.850
                      Random Forest                  0.866             0.864          0.481     0.859
  SBERT
                      Gradient Boosting              0.904             0.861          0.682     0.930
                      Random (baseline)              0.510             0.231          0.506     0.500
                      KNN                            0.880             0.873          0.541     0.851
                      Random Forest                  0.876             0.873          0.519     0.894
  NorBERT3
                      Gradient Boosting              0.910             0.886          0.686     0.936
                      Random (baseline)              0.524             0.243          0.540     0.500
   Using the OpenAI model, Gradient Boosting achieved an accuracy of 0.861, while Random Forest
and K-Nearest Neighbors (KNN) obtained accuracy values of 0.802 and 0.840, respectively. In terms of
precision, however, Random Forest performs the best with a value of 0.883. The precision value is 0.836
for Gradient Boosting and 0.763 for K-Nearest Neighbors. In terms of recall, surprisingly, the Random
classifier performs the best with the score of 0.513. Gradient Boosting performs with a score of 0.497.
While K-Nearest Neighbors achieves a comparable recall score of 0.447. Strangely, Random Forest does
not perform well with respect to this metric, obtaining a value of 0.167. Similarly, in terms of AUC,
Gradient Boosting is the best with a value of 0.887, followed by K-Nearest Neighbors with 0.833 and
Random Forest with 0.806. For the baseline (Random) classifier, expectedly, the values were lowest for
these metrics: 0.511 for accuracy, 0.240 for precision, and 0.500 for AUC.
   Considering SBERT as the embedding model, Gradient Boosting yields an accuracy of 0.904, while
this value was 0.865 for K-Nearest Neighbors and 0.866 for Random Forest. Comparing the values of
precision, Random Forest achieves better results with a score of 0.864, slightly outperforming Gradient
Boosting with a value of 0.861. K-Nearest Neighbors achieves a value of 0.823. In terms of recall, the
best performing is Gradient Boosting with a value of 0.682. For K-Nearest Neighbors and Random
Forest, the recall scores were 0.513 and 0.481, respectively. In terms of AUC, Gradient Boosting is
the best with a value of 0.930. The result for K-Nearest Neighbors is 0.850 and for Random Forest is
0.859. For the baseline (Random) classifier, the recorded metrics were as follows: an accuracy of 0.510, a
precision of 0.231, a recall of 0.506, and an AUC of 0.500.
   When the NorBERT3 embedding model was applied, the results were overall better than the other
models for nearly all classifiers. Moreover, for all metrics, Gradient Boosting outperformed the other
classifiers. For accuracy, the value for this classifier is 0.910, while for K-Nearest Neighbors it was 0.880,
and for Random Forest, it was 0.876. For precision, Gradient Boosting showed a value of 0.886, while
both K-Nearest Neighbors and Random Forest showed a value of 0.873. For recall, Gradient Boosting
again achieved the best score with 0.686, while K-Nearest Neighbors and Random Forest obtained
0.541 and 0.519, respectively. Finally, for AUC, Gradient Boosting was the best with 0.936, K-Nearest
Neighbors 0.851, and Random Forest 0.894. In the baseline (Random) classifier, again, the values of the
metrics were the lowest, with an accuracy value of 0.524, a precision of 0.243, a recall of 0.540, and an
AUC of 0.500.
   Overall, the results of experiment A present the effectiveness of our approach based on supervised
machine learning in predicting whether the potential news articles are related to a given article or not.

4.2. Experiment B: Recommendation of Related News Articles
To address RQ2, we considered a scenario where a new story is published on a news platform, and the
editor (and/or a journalist) is going to find a set of related articles for it. These related articles could
also be considered editorial suggestions from the perspective of the user who is reading that story on
the news platform. Our proposed approach supports the editors by automating this by recommending
a set of short-listed candidate articles that an editor can check and select from as related articles. This
is achieved by ranking the news articles according to the predicted relatedness scores from the top-
performing machine learning classifier (i.e., Gradient Boosting) and then selecting the Top-N articles for
recommendation. We consider 𝑁 = 5 meaning that we recommend 5 news articles to the editors. Again,
these 5 recommended articles are chosen from a larger set of candidate articles (candidate set), which are
the most similar news articles to the target article. The recommendation list is then evaluated against the
selections made by editors. We adopted various metrics to evaluate the related article recommendation,
i.e., Recall@5, Precision@5, and MAP@5 to measure the quality of the recommendations as described
before in the metrics section.
   The results of varying candidate sizes are presented in Figure 3. As can be seen, the quality of
recommendations can change significantly depending on the size of the candidate set used for generating
recommendations with all of the considered embedding models.
   When OpenAI and the Gradient Boosting classifier are used (Figure 3-top left), the Recall@5 curve
reaches the peak value at the candidate size of 20, and then it starts to decrease steadily. Precision@5
        Figure 3: Comparison of the quality of related article recommendations as the size of the candidate set
        varies for different embedding models and the gradient boosting classifier


shows similar behavior and peaks at the candidate size of 20. However, MAP@5 reflects a steady
decrease with the peak at 5. Looking at NorBERT3 and Gradient Boosting classifier (Figure 3-bottom
left) we observe that the peak for the Recall@5 and MAP@5 happens at the candidate size of 5, where
the highest value of Precision@5 is reached at the candidate size of 30. Observing the SBERT and the
Gradient Boosting classifier (Figure 3-bottom right), we observed that the best value for the Recall@5
and MAP@5 occurs at the candidate size of 5 and Precision@5 at 10.
   For the sake of comparison, we also randomly selected articles from the candidate set, which was
created using the OpenAI and Gradient Boosting classifiers. The key difference from this baseline is
that similarity among the articles was not considered when generating the recommendation list for
the editors. The results can be seen in Figure 3-top right, showing a similar trend where the metric
values continuously decrease as the candidate size increases, with the best performance observed at a
candidate size of 5.
   It is important to note that in the recommendation scenario, a crucial consideration was the mea-
surement of the proportion of related articles successfully retrieved within the Top-N recommended
articles for the editors. Therefore, we prioritized Recall@5 to choose the best candidate size. The results
are presented in Table 3. We also show the baseline results without applying the machine learning
classifiers, where the recommendations are generated by only considering the 5 most similar articles
based on Cosine similarity [19, 20].
   Overall, the results indicate that across all three embedding models, the OpenAI embeddings with a
candidate size of 20 and the Gradient Boosting classifier yielded the best results in terms of Precision@5
and Recall@5. Interestingly, the MAP@5 results for this model and classifier were comparable to those
of the similarity-based method (baseline). Surprisingly, the other embedding models did not show
significant differences from the similarity-based method (baseline) in terms of the best candidate size,
Precision@5, and Recall@5. However, we observed a considerable improvement in MAP@5.

Table 3
Overall summary of the results for the experiments focused on recommendation scenario with varying candidate
sizes.
  Embedding         Best
                                             Best Size      Recall@5        Precision@5         MAP@5
  Models            Classifier
                    Gradient Boosting            20            0.467             0.122            0.361
  OpenAI
                    Similarity-based              _            0.436             0.107            0.366
                    Gradient Boosting             5            0.311             0.073            0.270
  SBERT
                    Similarity-based              _            0.311             0.073            0.234
                    Gradient Boosting             5            0.238             0.059            0.217
  NorBERT3
                    Similarity-based              _            0.238             0.059            0.187


5. Discussion and Conclusion
One of the crucial roles of editors in news platforms is to curate news articles and ensure that related
content is appropriately linked to a target news article. This empowers the users of the platforms to
better explore a set of manually selected articles following the editorial principles, that can also be
of users’ interest to read further. While this process is important for maintaining the relevance and
coherence of news articles on the platform, it can be both expensive and time-consuming, and often
requiring significant effort.
   In this paper, we propose an automatic approach to support editors in this task by utilizing state-
of-the-art embedding models to encode the textual content of articles, thereby creating robust vector
representations of the news content. These representations are then adopted by a set of popular machine
learning classifiers to learn from the data and predict a relevance score, indicating the level of relatedness
among articles.
   We have evaluated our approach based on a real-world dataset provided by TV 2, one of Norway’s
largest editor-managed commercial media houses. We considered two evaluation scenarios: (i) a classifi-
cation scenario, where we assessed the accuracy of the classifiers’ predictions, and (ii) a recommendation
scenario, where the output of the classifiers is utilized to generate article recommendations that editors
might consider as related content.
   We employed several evaluation metrics, including Precision, Recall, and AUC, for the classification
scenario, as well as Precision@K, Recall@K, and MAP@K for the recommendation scenario, to com-
prehensively assess the performance of our approach. The results of our experiments are promising,
reflecting the effectiveness of our approach in potentially addressing the essential aspects of the editorial
process in the news domain and fulfilling them by accurately classifying articles and recommending
related ones.
   It is worth noting that, the recommendation scenario we considered in this paper is common in
real-world cases which often begins with an editor using some form of search engine to find a set of
short-listed candidate articles, among which the most related articles are selected manually. The size of
the candidate article set is an important factor in this process since a larger set is expected to increase
the chance of finding more related articles. However, this can be computationally expensive as it may
require calculating the relatedness against all the articles published in the past. Additionally, conducting
such a calculation entirely might be unnecessary since the majority of articles might not be related
to each other. Hence, finding a reasonable candidate size can indeed be very beneficial in real-world
scenarios. Thus, we report the best candidate size on the test set based on different embedding models
and the best performing classifier.
  In future work, we plan to add more features about the news articles (e.g., authorship, news categories,
entities) for training our classification model. In addition, we plan to further fine-tune the embedding
models to improve the quality of the representations generated for the news articles. This could
positively impact the accuracy of determining relatedness or similarity between articles. Additionally,
we plan to incorporate fine-tuned GPT models and potentially utilize them to rank candidate articles
when generating recommendations.


Acknowledgments
This work was in part supported by the Research Council of Norway with funding to MediaFutures:
Research Centre for Responsible Media Technology and Innovation, through the Centre for Research-
based Innovation scheme, project number 309339.


References
 [1] D. Thompson,            How many stories do newspapers publish per day?,                        The
     Atlantic       (2016).      URL:        https://www.theatlantic.com/technology/archive/2016/05/
     how-many-stories-do-newspapers-publish-per-day/483845/, accessed: 2024-08-19.
 [2] N. Thurman, J. Moeller, N. Helberger, D. Trilling, My friends, editors, algorithms, and i: Examining
     audience attitudes to news selection, Digital journalism 7 (2019) 447–469.
 [3] C. Trattner, D. Jannach, E. Motta, I. Costera Meijer, N. Diakopoulos, M. Elahi, A. L. Opdahl,
     B. Tessem, N. Borch, M. Fjeld, et al., Responsible media technology and ai: challenges and research
     directions, AI and Ethics 2 (2022) 585–594.
 [4] M. Elahi, D. Jannach, L. Skjærven, E. Knudsen, H. Sjøvaag, K. Tolonen, Ø. Holmstad, I. Pipkin,
     E. Throndsen, A. Stenbom, et al., Towards responsible media recommendation, AI and Ethics
     (2022) 1–12.
 [5] B. Mahmood, M. Elahi, S. Touileb, L. Steskal, C. Trattner, Incorporating editorial feedback in the
     evaluation of news recommender systems, in: Adjunct Proceedings of the 32nd ACM Conference
     on User Modeling, Adaptation and Personalization, 2024, pp. 148–153.
 [6] E. Mitova, S. Blassnig, E. Strikovic, A. Urman, A. Hannak, C. H. de Vreese, F. Esser, News recom-
     mender systems: A programmatic research review, Annals of the International Communication
     Association 47 (2023) 84–113.
 [7] M. Karimi, D. Jannach, M. Jugovac, News recommender systems–survey and roads ahead, Infor-
     mation Processing & Management 54 (2018) 1203–1227.
 [8] C. Peukert, A. Sen, J. Claussen, The editor and the algorithm: Recommendation technology in
     online news, Management science (2023).
 [9] H. Abdollahpouri, G. Adomavicius, R. Burke, I. Guy, D. Jannach, T. Kamishima, J. Krasnodebski,
     L. Pizzato, Multistakeholder recommendation: Survey and research directions, User Modeling
     and User-Adapted Interaction 30 (2020) 127–158.
[10] F. Lu, A. Dumitrache, D. Graus, Beyond optimizing for clicks: Incorporating editorial values in
     news recommendation, in: Proceedings of the 28th ACM conference on user modeling, adaptation
     and personalization, 2020, pp. 145–153.
[11] N. Tintarev, E. Sullivan, D. Guldin, S. Qiu, D. Odjik, Same, same, but different: algorithmic
     diversification of viewpoints in news, in: Adjunct publication of the 26th conference on user
     modeling, adaptation and personalization, 2018, pp. 7–13.
[12] N. Helberger, On the democratic role of news recommenders, in: Algorithms, automation, and
     news, Routledge, 2021, pp. 14–33.
[13] E. BROGI, D. BORGES, R. CARLINI, I. NENADIC, K. BLEYER-SIMON, J. E. KERMER, U. REVIGLIO
     DELLA VENARIA, M. TREVISAN, S. VERZA, The European Media Freedom Act: media freedom,
     freedom of expression and pluralism, Technical Report, Policy Department for Citizens’ Rights
     and Constitutional Affairs, 2023.
[14] BBC, Mission, values and public purposes - About the BBC, https://www.bbc.com/aboutthebbc/
     governance/mission, 2019.
[15] Council of Europe, Commissioner, Public service broadcasting under threat in Europe, https:
     //www.coe.int/en/web/commissioner/-/public-service-broadcasting-under-threat-in-europe, 2017.
[16] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
     R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay,
     Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12 (2011)
     2825–2830.
[17] C. Huyen, Designing Machine Learning Systems, O’Reilly Media, USA, 2022.
[18] M. Schedl, H. Zamani, C.-W. Chen, Y. Deldjoo, M. Elahi, Current challenges and visions in music
     recommender systems research, International Journal of Multimedia Information Retrieval 7 (2018)
     95–116.
[19] M. J. Pazzani, D. Billsus, Content-based recommendation systems, in: The adaptive web: methods
     and strategies of web personalization, Springer, 2007, pp. 325–341.
[20] P. Lops, M. De Gemmis, G. Semeraro, Content-based recommender systems: State of the art and
     trends, Recommender systems handbook (2011) 73–105.