Recommendation systems for news articles at the BBC
              Maria Panteli                             Alessandro Piscopo                             Adam Harland
    British Broadcasting Corporation             British Broadcasting Corporation            British Broadcasting Corporation
        London, United Kingdom                       London, United Kingdom                     Glasgow, United Kingdom
         maria.panteli@bbc.co.uk                  alessandro.piscopo@bbc.co.uk                   adam.harland@bbc.co.uk

                                  Jonathan Tutcher                          Felix Mercer Moss
                           British Broadcasting Corporation           British Broadcasting Corporation
                               Salford, United Kingdom                     Bristol, United Kingdom
                                 jon.tutcher@bbc.co.uk                   felix.mercermoss@bbc.co.uk
ABSTRACT                                                              user, the BBC combines editorial curation with personalised,
Personalised user experiences have improved engagement in             automated approaches. Data-driven recommendations are a
many industry applications. When it comes to news recom-              key part of these approaches: they are an important tool to
mendations, and especially for a public service broadcaster           enhance users’ ability to explore and discover content they
like the BBC, recommendation systems need to be in line with          would not be aware of otherwise (see e.g. [26, 31, 36, 37]) and
the editorial policy and the business values of the organisation.     have been successfully tested and deployed by several media
In this paper we describe how we develop recommendation               providers (e.g. Netflix [20]) and e-commerce companies (e.g.
systems for news articles at the BBC. We present three mod-           Amazon [41]).
els and describe how they compare with baseline approaches               According to the mission of the BBC, the organisation
such as random and popularity. We also discuss the metrics            must “act in the public interest, serving all audiences through
we use, the unique challenges we face and the considerations          the provision of impartial, high-quality and distinctive output
needed to ensure the recommendations we generate uphold               and services which inform, educate, and entertain” [5]. Fol-
the trust and quality standards of the BBC.                           lowing this mission, the BBC must be a provider of accurate
                                                                      and unbiased information and the content it produces and
CCS CONCEPTS                                                          distributes must aim to engage diverse audiences. Amongst
                                                                   the diverse types of content produced by the BBC, news is
                                  
 Information systems   Recommender systems;
                                                                      the product that likely contributes most to its reputation
Computing methodologies Machine learning approaches.
                                                                      as a trustworthy and authoritative media outlet. Besides
KEYWORDS                                                              the UK service BBC News1 , the BBC produces, broadcasts,
                                                                      and delivers online news in more than 40 languages. Hence,
recommendations, news, neural networks                                it is of utmost importance for automated recommendation
                                                                      approaches implemented on any BBC news service to be not
1    INTRODUCTION                                                     only as accurate as possible, but also to conform with the
The BBC is one of the world’s leading public service broad-           principles outlined above. This paper reports early results of
casters. Its services—television, radio, digital—reach more           the experiments we carried out to that end. In particular, it
than 80% of UK’s adult population every week [2] and 279              describes the development of recommendation systems for
million people worldwide (World Service [4]). This large au-          BBC news articles and the challenges in building data-driven
dience has access to a vast and diverse amount of content,            applications for a public service broadcaster. The case study
including video, audio and text, spanning topics such as news,        adopted in the experiment was the application of recommen-
sport, and entertainment. In order to enable its audience to          dation systems for BBC Mundo2 , a Spanish-language news
enjoy the best possible experience, it is crucial for the BBC         website and part of BBC World Service [6].
to adopt strategies to guide users to the most relevant and              The structure of this paper is as follows. Section 2 defines
engaging content. The main approach until recently has been           the problem addressed in the current work, and Section 3
to manually curate content following the guidelines formally          discusses prior related work. Section 4 describes the method-
documented in an editorial tome [3]. These have been devel-           ology including the data, models, and evaluation approaches.
oped to ensure quality across all products, uphold the BBC            Finally, results are presented and discussed in Sections 5
values, and build audience trust. Although manual curation is         and 6.
an excellent way to surface quality content, it is not tailored
to the user and is hard to scale—the more the amount of
content, the harder it is for curators to find relevant items for
each type of content. In order to deliver an experience which
is relevant, timely, and contextually useful to every single
          ©
                                                                      1
                                                                        Please note that ‘News’ capitalised refers to the UK channel, whereas
Copyright   2019 for this paper by its authors. Use permitted under   lowercase regards to the type of content.
                                                                      2
Creative Commons License Attribution 4.0 International (CC BY 4.0).     https://www.bbc.com/mundo
INRA’19September, 2019Copenhagen, Denmark               Maria Panteli, Alessandro Piscopo, Adam Harland, Jonathan Tutcher, and Felix Mercer Moss


2    PROBLEM DEFINITION                                               no standardised solutions yet. We consider evaluation met-
Our goal is to build recommendation systems for news arti-            rics that help us track the risk and bias induced by our
cles. Recommendations in the news domain have been char-              recommendation systems.
acterised distinctly in the literature [38] due to the short             The above challenges drive the decisions we make around
life-cycle of items and the vast amounts of anonymous users.          which models and evaluation strategies to implement. For
Considering the reputation of the BBC and the responsibility          example, we place significant focus upon offline evaluation
it has to deliver trustworthy and authoritative news to its           to avoid unexpected behaviour; we use a variety of met-
audience, we highlight the following challenges in achieving          rics to track the quality of recommendations; we consider
our goal.                                                             recency-based systems an essential baseline for news recom-
                                                                      mendations; and we adopt content-based approaches to tackle
   Non-signed in users. The majority of users on any BBC              the cold-start scenarios. More details about our choices and
news platform are not signed in. This means that we have              how they relate to these challenges are provided in Sections 3
limited information about the user and the items they have            and 4.
previously interacted with. We typically work with session-
based information, i.e. user-item interactions that occurred
within 30 minutes from each other. This means that our
recommendation models need to achieve high accuracy for               3     RELATED WORK
cold-start user scenarios or predict the user’s taste after as        Recommendation systems in the news domain have been
little as one item interaction.                                       investigated for more than a decade [27, 38], following var-
                                                                      ious approaches. Collaborative filtering [15] relies on past
   Many cold-start items. The publication cycle on any news           user behaviour to formulate recommendations based on com-
platform is rapid and unrelenting. BBC News is no different.          monalities across user preferences. Content-based approaches
Fresh items are regularly uploaded and any recommendation             rely on item properties (or user profiles constructed by the
system we implement should be able to serve an item within            properties of the items they consume) to recommend related
minutes of publication. Additionally, articles may become ob-         items [10, 29, 39]. Rather than considering the long user his-
solete or gain sudden relevance following an event—consider           tory, session-based approaches focus on user-item interactions
for example the case of breaking news. Recommendation                 that occur within a certain time frame or context [40, 43].
approaches must thus be able to take these characteristics            Finally, hybrid systems may put together aspects from these
into account, not being based solely on a user’s history, but         approaches and use a broader range of features, in order to
considering the content and context of the articles they read.        achieve a more nuanced representation of user activity [18, 30].
                                                                      Content-based, session-based, and hybrid approaches appear
   Architecture constraints. Because of the popularity of BBC
                                                                      to be the most suitable to address some of the problems we
news, multiple stakeholders (internal and external) rely on
                                                                      outlined earlier, namely the large number of anonymous users
and set the requirements for the news platform. Any changes
                                                                      and cold-start items (Section 2).
to the system architecture that could affect other stakehold-
                                                                         Beyond the news domain, recommendation systems have
ers need to be thoroughly investigated and justified. Our
                                                                      been investigated in a variety of industrial applications. Ap-
recommendation models often have to adapt to the exist-
                                                                      proaches vary between traditional content-based and col-
ing architecture which means that our system architecture
                                                                      laborative filtering while, more recently, the advent of deep
choices are somewhat constrained.
                                                                      neural networks has facilitated the development of hybrid
   Mistakes are not tolerated. BBC news, and the Mundo                strategies [45]. These have been applied to the problem of
platform in particular, are consumed by millions of users. For        accommodation search at Airbnb [22], product advertisement
the majority of these users, this is the only BBC platform            at Criteo [28], video recommendations at Youtube [14], and
they visit. News is also a very sensitive domain as is not just       movie recommendations at Netflix [20]. Industry approaches
entertainment but is also the way in which people inform and          using neural networks are of particular interest to us due
educate themselves. Mistakes in data-driven recommenda-               to the scalability of the systems and the domain agnostic
tions could lead to misinformation or compromise our quality          capability of neural networks.
standards, something which will largely impact our audience.             Considering the system architecture, some neural network-
The bar for the performance of the system is set very high            based approaches for recommending textual content are end-
to limit the risk of unexpected behaviour.                            to-end (for example [1]), that is, the model takes as input
                                                                      the text of items related to a user, extracts features for the
   Fairness and impartiality. The BBC has built its trust             items and the user, and ultimately outputs a recommendation.
after many years of thoughtful manual curation and expert             Other approaches rely on separate modules for extracting
editorial guidance. It commits to delivering content in a fair,       features for the content and the user and for generating rec-
impartial and honest way and data-driven recommendations              ommendations [16]. Here, we take the latter approach for a
should live up to, and advance, these standards. Algorith-            number of reasons. First, an end-to-end approach was not
mic fairness and impartiality in recommendation systems               compatible with the current architecture of the system, over
are increasingly discussed in the literature [19, 33] but with        which we have limited control (Section 2). Second, separating
Recommendation systems for news articles at the BBC                                                   INRA’19September, 2019Copenhagen, Denmark


                            Sequence length distribution                                              User visits distribution
     No. of sequences


                                                                           No. of users
                        1     2          3           4     5                              1   2   3       4        5        6    7   8   9
                                   Sequence length                                                            No. of visits


Figure 1: Sequence length distribution in our dataset.               Figure 2: User visits distribution in our dataset. The
The graph includes 99% of sequence lengths, in order                 graph includes 99% of the number of visits, in order
to leave out the long tail and improve readability.                  to leave out the long tail and improve readability.


                                                                     logs into a single or group of feature-target pairs suitable for
content representation from the generation of recommen-
                                                                     ingestion into algorithmic learning models.
dations enables further experimentation and increases the
                                                                        For the test split, our initial thought was to discard the
ability of the system to retrieve new items [16].
                                                                     temporal dimension and sample user sessions according to
                                                                     pre-determined train/test/validation fractions. While the
4 METHODOLOGY                                                        simplicity of this approach is attractive, we decided that to
4.1 Data                                                             maximise the similarity between our offline testing framework
The BBC collects detailed user interaction data for its digital      and our online production environment was more important.
services, providing information about users and the circum-          The temporal approach we implemented is displayed in Fig-
stances of their visits to BBC websites. For the purpose             ure 3 where we choose a thirteen-day period for training, the
of this analysis, we used 15-days worth of data from BBC             next day for validation and the following day for test. As we
Mundo, spanning from the 6th to 20th April 2019. We define           have the capacity to train and serve fresh consumer-facing
a sequence, or visit, as any succession of user interactions (i.e.   models every day, we aim for this offline approach to reflect
page views) within 30 minutes from each other. Page views            our production environment sufficiently for inferences in the
were aggregated into sequences according to this definition.         former to provide valuable information about the latter.
In this dataset, the average number of user interactions we             For the query split, we take a user session from a given
collected per day was in the order of millions. As shown in          period defined earlier in the current section and divide it into
Figure 1, most recorded sessions included only a single article      the maximum number of trigrams while preserving temporal
read (i.e., a sequence of length 1) which is a common observa-       order. Then, for each trigram, the first two elements (articles
tion in news delivery platforms [16]. Users often visited BBC        vectors) represent the user profile while the third and final
Mundo only once over the time-span considered (Figure 2).            element is the groundtruth item used as a target for our
   Like all statistical learning models, to robustly evaluate        models. The length of the user profile was chosen based
recommender system performance, the data is required to              upon two factors: (1) our client-side serving infrastructure
be appropriately split. In traditional machine learning prob-        is currently limited to providing the current and previous
lems where the raw data takes the form of input-output               article; and (2) exploratory analysis indicated that minimal
pairs, this split is relatively straightforward. Assuming there      gains were made from increasing the number of items that
is enough data, a common split might be 80%, 10%, 10%                make up the user profile.
into training, validation and test sets respectively. For recom-
mender systems, the temporal nature of the data makes the            4.2             System architecture and models
situation a little different. While we still need to perform a       All recommendation models we implemented were constrained
train/validation/test split, referred to from now on as the test     by the need to have compatibility with our current system
split, we also need to perform an additional split, henceforth       architecture. This consists of three main components. The
be referred to as the query split. The query split describes the     first is responsible for generating article embeddings. The
process of transforming a temporal sequence of consumption           second takes user data and article embeddings as input and
INRA’19September, 2019Copenhagen, Denmark               Maria Panteli, Alessandro Piscopo, Adam Harland, Jonathan Tutcher, and Felix Mercer Moss


                                                                      smallest distance to the user embedding. The distance is
                                                                      computed using the angular√︀ metric from the Python package
                                                                      ANNOY [7], defined as 2(1 − 𝑐𝑜𝑠(𝑎, 𝑏)) for a user embed-
                                                                      ding 𝑎 and an article embedding 𝑏.
                                                                         We evaluated three different models to derive the user
                                                                      embeddings: a) a weighted average of item embeddings (Sec-
                                                                      tion 4.2.1), b) a cosine-based collaborative filtering method
                                                                      (Section 4.2.2), and c) a rank-optimised neural network (Sec-
                                                                      tion 4.2.3). The sections below describe each approach in
                                                                      detail.


                                                                               Content data                                                User data


                                                                       Content representation module                               User representation module


                                                                            Article embeddings                                         User embeddings
                                                                                    (LDA)


Figure 3: Two splits were performed upon the raw
                                                                                                       Recommandation generation
user logs. The test split temporally divided the                                                               module
dataset into train (13 days), validation (1 day) and
                                                                                                           Nearest neighbours
test (1 day). Then for the query split, each user
log session was split into trigrams whereby the first
two items represented a user profile and model in-
put while the third represented the groundtruth and                                                        Recommendations
model output.
                                                                      Figure 4: Overview of system architecture and how
                                                                      it relates to the development of user models. A
produces a user embedding. Finally, the outputs of the first          given content representation module provides article
and the second modules are combined by the third component,           embeddings—currently LDA vectors—that are fed
which ranks the recommended articles for a user, based on a           into both a user representation module and a near-
nearest neighbour search in the latent article space (Figure 4).      est neighbour search component. The recommended
   The content representation module generates article em-            articles for a user denote the 𝐾 nearest neighbours
beddings. The article embeddings were derived using a Latent          to the user vector.
Dirichlet Allocation (LDA) model as found performant in
related research [9]. LDA is an unsupervised topic modelling
                                                                      4.2.1 Weighted average of item embeddings. The first user
approach that represents each document by the probability
                                                                      representation model we tested derived the user embeddings
of a number of topics. The number of topics is defined in
                                                                      from the weighted average of item embeddings, for all items
advance. Prior work from another BBC team found the opti-
                                                                      consumed by a given user within a session. The most recently
mal number of topics to be 75 for a related dataset of BBC
                                                                      consumed item was weighted by a factor 𝛼 while the rest of
Mundo articles [9].
                                                                      the items in the user’s session were weighted by 1 − 𝛼.
   The user representation module generates user embeddings.
The user embeddings are derived from the article embeddings           4.2.2 Cosine-based collaborative filtering. The second approach
and previous user interactions. Our experiments focused               was a combination of simple user-item collaborative filtering
primarily on developing models to derive user embeddings.             and a session-based approach. Since users do not need to
We explored neural network approaches that combine both               log in to view the articles, we had no explicit user profile
content and user data as well as models based only on user            and instead treated each session as a user. To generate the
interactions (i.e. Cosine-based collaborative filtering model,        sparse user-item matrix, we took the article IDs for all user
Section 4.2.2).                                                       sessions within a given time window. The inputs to the model
   The output of the user representation module is subse-             at prediction time were the IDs of the articles viewed in the
quently processed by the recommendation generation module.            current user session, and the output was the 𝐾 highest scored
This component takes as input a user embedding and per-               items based on these interactions. Our metric for scoring the
forms an approximate nearest neighbour search in the article          articles to recommend was the cosine distance of the current
latent space, returning as output the 𝐾 articles with the             user session and all other user sessions.
Recommendation systems for news articles at the BBC                                                                 INRA’19September, 2019Copenhagen, Denmark


   Serving                 Training                                                         articles, whereby each training user profile has one positive
   environment             environment                                                      article and five negative articles. Once this model had been
                                                                                            trained, two further models were derived from it for use in
                                                  Binary cross-
      Recommended                                  entropy loss                             the prediction environment. The first, the user model, took
         articles
                                                                                            only the user profile as input and returned the final layer of
                                                         Sigmoid
                                                                                            the connected five-layer perceptron. The second, the article
                                                    Dot product                             model, took only a single article as input and returned the
                                                                                            fifth layer of its own five-layer perceptron. The article model
           KNN
                                                                                            was then used to transform all of the raw LDA embeddings
                                                                                            into the article model embedding space before being fed into
     Prediction model          User embedding                      Article embedding        our vector-based nearest neighbour index.

                                   Multi-layer                        Multi-layer
                                   perceptron                         perceptron            4.3    Evaluation
                                                                                            The aim of our work is not only to increase user engage-
                                                                                            ment with BBC products, but also to inform, educate, and
                                                                            OR              entertain—according to the mission of our organisation. We
                                                                                            build recommendation systems taking into account these val-
                                                                                            ues and develop evaluation strategies that reflect our mission.
      Client (current +       User proﬁle (current +          Groundtruth        Negative   This section focuses on offline evaluation metrics and the
                              previous article vector)          article          sampled
      previous article
           vector)                                              vector            article   baselines we use in our experiments. Online evaluation is also
                                                                                  vector
                                                                                            a big part of our work but goes beyond the scope of this
                                                                                            paper which focuses on preliminary results.
Figure 5: Pointwise neural network architecture for
learning-to-rank problem.                                                                   4.3.1 Metrics. When developing recommendation models
                                                                                            offline, we currently monitor and optimise performance with
                                                                                            reference to a suite of six quantitative metrics. For all metrics
4.2.3 Rank-optimised neural network. Motivated by the aware-                                (with the exception of inter-list diversity) a value can be
ness that a simple linear combination of a user’s current and                               computed for each groundtruth/recommendations list pair.
previous article representations led to modest performance                                  The overall metric is computed as the mean value over all
gains over using solely the current article, we sought to                                   groundtruth/recommendations list pairs within the test pe-
explore non-linear combinations of these vectors. Artificial                                riod. For each metric, in addition to calculating the overall
neural networks are ideally suited to fitting such non-linear                               value, we also estimate the item-normalised value by first tak-
functions, while we were encouraged by the results reported                                 ing the mean metric value for every unique groundtruth item.
by others that have successfully used deep architectures to                                 This value provides an insight into the performance of an
solve information retrieval problems, e.g. [11, 14, 22, 46].                                algorithm independently of the test set bias towards popular
   The challenge we faced was to design a neural network                                    groundtruth items. All metrics were calculated upon recom-
architecture which learned a latent representation of a user                                mendation lists of length 𝐾 = 100. We use a relatively large
profile (current and previous article) to minimise the dis-                                 𝐾 motivated by the finding that deeper cut-offs in offline
tance between itself and the latent representation of the                                   experiments provide greater robustness and discriminative
most appropriately recommended article (in this case, the                                   power [42] as well as by the fact that we have to exclude a lot
subsequently consumed article). One way of reflecting this                                  of the recommended items a posteriori due to our extensive
problem is a pointwise architecture that behaves in a way                                   business rules. A brief description of each metric is provided
similar to a regression problem. The model illustrated in                                   below (for further details see [12, 21, 34]).
Figure 5 takes a user profile (two concatenated 75-length
vectors) and an article as input (a 75-length vector), passes                                  Normalised Discounted Cumulative Gain (NDCG). It mea-
each through a five-layer perceptron (with 1024, 512, 256, 128                              sures the gain of a document based on its ranked position in
and 75 hidden units, each with rectified linear activation func-                            the top 100 list, with lower ranks discounted by a logarithmic
tions). The model then minimises the binary cross-entropy                                   factor, and normalises the result by the maximum gain of an
between the target and the inner product of the final layer                                 ideal top 100 list.
of the two perceptrons. Batch normalisation placed before
the activation functions of the initial layers was found to sig-                               Hitrate. A recall-based metric whereby a recommended
nificantly boost performance while also halving convergence                                 list of items is assigned 1 if it contains the groundtruth item,
time, facilitating greater experimentation. Training runs in-                               and 0 otherwise.
cluding dropout layers produced no improvement in accuracy
so were not included in the final model. Negative articles                                    Intra-list diversity. It estimates the average distance be-
were randomly over-sampled from the population of positive                                  tween every pair of items in a recommendations list. For
INRA’19September, 2019Copenhagen, Denmark                Maria Panteli, Alessandro Piscopo, Adam Harland, Jonathan Tutcher, and Felix Mercer Moss


the experiments reported here, distance between two arti-                0.25            NDCG overall
cles is measured as the ANNOY angular distance (described                                NDCG item
formally in Section 4.2) between two article embeddings.
                                                                         0.20
   Inter-list diversity. It measures how diverse the recom-
mended items across multiple lists are. It compares two lists
of recommendations and computes the ratio of unique items                0.15
in these lists over the total number of recommended items
between these lists.
                                                                         0.10
   Popularity-based surprisal. It measures how novel or sur-
prising the items in a list are. It is formally defined as the
log of the inverse popularity of an item (i.e. the probability           0.05
of observing an item in the recommendations) [12].

   Recency. : Measures how recent the recommended items                  0.00


                                                                                Random


                                                                                                        Popularity
                                                                                             Recency


                                                                                                                     Content-based

                                                                                                                                     Weighted avg.


                                                                                                                                                                      collaborative
                                                                                                                                                     neural network

                                                                                                                                                                      Cosine-based
                                                                                                                                                     Rank-optimised
are. It calculates the time difference between the recommen-


                                                                                                                                      embeddings
                                                                                                                       similarity

                                                                                                                                       of article


                                                                                                                                                                         filtering
dation request and the age of the recommended items using
a Gaussian decay function. The mean is set to 1 and the
standard deviation is chosen such that articles of 7 days old
or more receive a score less than 0.5.
   The ideal recommendation engine would optimise all these
metrics providing recommendations that are relevant to the             Figure 6: Overall and item-normalised NDCG for
user, but that are also diverse, recent, and avoid the popular-        the four baselines described in Section 4.3.2 and the
ity bias. In practice this is usually a trade-off as an algorithm      three user models described in Section 4.2.
that provides more accurate results is, conversely, less likely
to produce diverse ones (and vice versa). In line with our
values and objectives, we sometimes choose algorithms that             items (the random model), all other baselines show clear per-
favour diverse and recent content at the cost of a certain             formance improvements for both overall and item-normalised
degree of accuracy.                                                    NDCG. The popularity and recency recommenders returned
                                                                       higher values than the content-based similarity (CS) model
4.3.2 Baselines. We compare our user models to four baseline           for NDCG overall; however, if the most popular items are
approaches and require that each new user model outperforms            factored out by looking at the item-normalised score, the
the existing ones. We consider the following recommenders              opposite is true. The recency recommender scored particu-
as baselines:                                                          larly high NDCG overall which confirms our expectation that
    ∙ Random recommender : Produces 𝐾 random recom-                    users in a news platform prefer to consume fresh content.
      mendations.                                                          Of the implemented models, the cosine-based collaborative
    ∙ Recency-based recommender : Ranks item by recency                filtering (CF) model (Section 4.2.2) outperformed all base-
      and returns the top 𝐾 most recent items.                         lines and other models by a significant margin, this being the
    ∙ Popularity-based recommender : Ranks items by popu-              case both for overall and item-normalised NDCG and hitrate.
      larity and returns the top 𝐾 most popular items.                 However, this significant advantage in accuracy comes at a
    ∙ Content similarity recommender : Finds the 𝐾 nearest             cost to inter-list diversity and surprisal, where both other
      neighbours of an item (e.g., the last item consumed by           models returned higher scores. However, this effect was not
      a user) using the ANNOY angular distance between                 observed with the intra-list diversity metric, indicating that
      item embeddings.                                                 individual CF lists contained more diverse content while the
                                                                       lists of the other models were more distinct.
   Our offline experiments report results on the four baselines            The weighted average (WA) model (described in Section 4.2.1,
defined in above and the three models defined in Section 4.2.          with 𝛼 optimised at 0.7) achieved accuracy scores surpassing
We use the NDCG metric to comment on the accuracy of the               all the baselines in item-normalised NDCG, although as ex-
systems and the remaining metrics defined in Section 4.3.1             pected, this was not the case for NDCG overall. This suggests
to comment on qualitative aspects of the recommendations.              that the model consistently projects into relevant regions of
                                                                       the embedding space, and that the nearest neighbours are not
5    RESULTS                                                           just most popular candidates. Despite returning marginally
The NDCG scores for each recommender system are shown                  higher NDCG scores, the WA results are salient mainly for
in Figure 6. The scores from all metrics are summarised in             how similar they are across the board, to the CS baseline
Table 1.                                                               that lacks information from the previous article.
   Accuracy scores recorded for the baselines models were in               The rank-optimised neural network (NN) model (Section 4.2.3)
line with expectations. Compared to a random selection of              returned accuracy scores that were a clear step up from both
Recommendation systems for news articles at the BBC                                                  INRA’19September, 2019Copenhagen, Denmark


Table 1: Benchmark results of competing models after generating 100-length lists of recommendations. For
the sake of brevity, we report here only overall metrics.

    Recommender System                                Hitrate   NDCG    Intra-list diversity   Inter-list diversity   Surprisal   Recency
    Random baseline                                   0.005     0.001         1.192                  0.995             0.430       0.010
    Recency baseline                                  0.695     0.163         1.175                  0.000             0.000       0.975
    Popularity baseline                               0.315     0.049         1.170                  0.000             0.000       0.495
    Content similarity baseline                       0.085     0.021         0.641                  0.968             0.790       0.018
    Weighted average of item embeddings               0.065     0.022          0.641                  0.968            0.790       0.018
    Cosine-based collaborative filtering              0.741     0.244          1.154                  0.584            0.480       0.512
    Rank-optimised neural network                     0.128     0.040          0.909                  0.731            0.781       0.036


other LDA-based models (CF and WA). This was the case                      network had an impact that was also weaker than expected.
for both variants of NDCG and particularly so for hitrate,                 These unintuitive results raise further questions that we plan
indicating that the NN model was optimised more for recall                 to explore in the future.
than precision and could possibly benefit from further rerank-                Fundamentally, we believe there is scope to optimise the
ing procedures. The NN model also distinguished itself from                NN approach further so that it will perform more competi-
CF and WA models in the diversity and surprisal metrics.                   tively with CF. To achieve this end we have multiple strate-
Results suggest the NN model produces more distinct lists                  gies. These fall into three categories: model architecture, data,
(indicated by higher inter-list diversity) but that those lists            and training improvements.
are more topically homogenous (indicated by lower intra-list                  We know that learning to rank in a pointwise framework
diversity and surprisal metrics).                                          is not optimal. Both pairwise and listwise approaches should,
                                                                           in theory, achieve better results (see [13, 23]). Pairwise loss
6     DISCUSSION                                                           functions together with triplet loss architectures have demon-
                                                                           strated impressive results elsewhere but our own early ex-
The first cycle of research in our journey to find the best news           periments have indicated they are difficult to train, tending
recommender for BBC Mundo is complete. In Section 2 we                     towards significant underfitting.
have outlined the characteristics of the problem we address: a                A key reason for this may be the under-representation
majority of non-signed in users; a large number of cold-start              of negative examples in our training set. Adopting a higher
items; architectural constraints; and high quality demands,                proportion of negative training examples may address this,
not only in terms of accuracy, but also in what concerns                   but also using more informed negative sampling techniques
fairness and impartiality of recommendations.                              may be required (such as weighted approximate-rank pair-
   One of the lessons we learned is that—unsurprisingly—                   wise loss [44]). Even with the current pointwise architecture
balancing the different aspects of our problem is hard. One                there is a 5% difference in train/test performance (item-
model may satisfy one of our requirements, whilst failing                  normalised NDCG) that should significantly reduce by using
to fulfil another. A pure collaborative filtering approach is              the appropriate regularisation.
currently our best option to maximise offline scoring accu-                   Changes to our training process may also lead to signifi-
racy, but that comes at the cost of reducing diversity (and                cant gains. In addition to increasing compute resources for
a degree of recency, dependent upon how regularly we re-                   the exploration of the hyperparameter space, reducing the
train). Moreover, the performance of the CF model was not                  training/testing window from the order of days/weeks to the
entirely unexpected, as it has been shown [17, 25, 32] that                order of hours may provide greater scope for experimenta-
such simple methods typically outperform the neural ap-                    tion (as has been reported elsewhere [16]). While a smaller
proaches when only logged user items are used, and instead                 training window does necessitate more regular training of
only start to perform well when the input features contain                 deployed models, it also means more manageable datasets
additional contextual meta-data. However, as with most col-                where hyperparameter optimisation is more practical.
laborative filtering approaches, this model suffers from the                  A further change that may prove fruitful is to expand
item cold-start problem and so frequent generation of the                  the richness of the input to the user profile model. This
user-item sparse matrix would be required. Therefore, we                   may include expanding the size of the user journeys in the
cannot depend upon a solution that is derived purely from                  training set beyond 3 (a constraint which, incidentally, did
user interactions. To that end, we also know from our experi-              not apply to the CF model at training), while also introducing
ments that the contribution of previous articles appears to                contextual information about the user.
have a lower impact than expected. Despite performance of                     Finally, another direction to be explored in the future re-
the WA model consistently exceeding the CS baseline model                  gards content representation. In experiments not reported in
(across validation and test), this gain was always marginal.               the current work, raw article text has been encoded through
Furthermore, our attempts at combining the current and                     an LDA model. However, our system architecture affords
previous article vectors in a non-linear fashion using a neural
INRA’19September, 2019Copenhagen, Denmark                    Maria Panteli, Alessandro Piscopo, Adam Harland, Jonathan Tutcher, and Felix Mercer Moss


enough flexibility to replace the current content model with               [10] Michel Capelle, Flavius Frasincar, Marnix Moerland, and Frederik
alternative article embeddings and test different approaches.                   Hogenboom. 2012. Semantics-based news recommendation. In
                                                                                2nd International Conference on Web Intelligence, Mining and
In particular, we are interested in taking sub-word infor-                      Semantics, WIMS ’12, Craiova, Romania, June 6-8, 2012. 27:1–
mation into consideration [8], enriching text with seman-                       27:9.
                                                                           [11] Hugo Caselles-Dupré, Florian Lesaint, and Jimena Royo-Letelier.
tics [10, 24], and augmenting text representations with mul-                    2018. Word2Vec Applied to Recommendation: Hyperparameters
timedia [35, 46].                                                               Matter. In Proceedings of the 12th ACM Conference on Rec-
   Our results demonstrate the difficulty of acquiring all                      ommender Systems (RecSys ’18). ACM, New York, NY, USA,
                                                                                352–356. https://doi.org/10.1145/3240323.3240377
the desired characteristics of an ideal news recommender.                  [12] P Castells, S Vargas, and J Wang. 2011. Novelty and diversity
Ultimately, we expect ensemble approaches may represent                         metrics for recommender systems: choice, discovery and relevance.
the best solution. Here we may take the cold-start benefits                     In International Workshop on Diversity in Document Retrieval
                                                                                (DDR 2011) at the 33rd European Conference on Information
of the content-based neural approach and combine it with                        Retrieval (ECIR 2011).
the less diverse but more accurate list of items generated by              [13] Ting Chen, Yizhou Sun, Yue Shi, and Liangjie Hong. 2017. On
                                                                                Sampling Strategies for Neural Network-based Collaborative Fil-
a collaborative filtering model.                                                tering. In Proceedings of the 23rd ACM SIGKDD International
                                                                                Conference on Knowledge Discovery and Data Mining, Halifax,
7    CONCLUSION                                                                 NS, Canada, August 13 - 17, 2017. 767–776.
                                                                           [14] Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep Neural
In this paper we evaluated three approaches to provide news                     Networks for YouTube Recommendations. In Proceedings of the
recommendations for the BBC Mundo service. The systems                          10th ACM Conference on Recommender Systems (RecSys ’16).
                                                                                ACM, 191–198. https://doi.org/10.1145/2959100.2959190
we have built are compatible with BBC serving infrastruc-                  [15] Abhinandan Das, Mayur Datar, Ashutosh Garg, and Shyamsundar
ture, a use case which includes millions of daily users and                     Rajaram. 2007. Google news personalization: scalable online
new content in the order of several thousand articles per                       collaborative filtering. In Proceedings of the 16th International
                                                                                Conference on World Wide Web, WWW 2007, Banff, Alberta,
week. In spite of our experiment being only the initial step                    Canada, May 8-12, 2007. 271–280.
of a journey that promises to be much longer, our models                   [16] Gabriel de Souza Pereira Moreira. 2018. CHAMELEON: a deep
                                                                                learning meta-architecture for news recommender systems. In
outperformed random, popularity-based, recency-based and                        Proceedings of the 12th ACM Conference on Recommender
content-similarity baselines. It is worth noticing though, that                 Systems, RecSys 2018, Vancouver, BC, Canada, October 2-7,
these results do not reflect current online performance. More                   2018. 578–583.
                                                                           [17] Gabriel de Souza Pereira Moreira, Dietmar Jannach, and Adil-
work is needed to ensure these models, when deployed, meet                      son Marques da Cunha. 2019. Contextual Hybrid Session-based
the quality and editorial standards of the BBC. Future chal-                    News Recommendation with Recurrent Neural Networks. CoRR
lenges do not concern only achieving higher accuracy, but                       abs/1904.10367 (2019).
                                                                           [18] Elena Viorica Epure, Benjamin Kille, Jon Espen Ingvaldsen,
also conforming to the principles of algorithmic fairness and                   Rébecca Deneckère, Camille Salinesi, and Sahin Albayrak. 2017.
impartiality. We encourage the community to collaborate in                      Recommending Personalized News in Short User Sessions. In Pro-
                                                                                ceedings of the Eleventh ACM Conference on Recommender
helping us create the way forward towards fair and engaging                     Systems, RecSys 2017, Como, Italy, August 27-31, 2017. 121–
recommendations and applications with responsible machine                       129.
learning.                                                                  [19] Sahin Cem Geyik, Stuart Ambler, and Krishnaram Kenthapadi.
                                                                                2019. Fairness-Aware Ranking in Search & Recommendation
                                                                                Systems with Application to LinkedIn Talent Search. CoRR
REFERENCES                                                                      abs/1905.01989 (2019).
 [1] Trapit Bansal, David Belanger, and Andrew McCallum. 2016. Ask         [20] Carlos A. Gomez-Uribe and Neil Hunt. 2016. The Netflix Rec-
     the GRU : Multi-task Learning for Deep Text Recommendations.               ommender System: Algorithms, Business Value, and Innovation.
     In Proceedings of the 10th ACM Conference on Recommender                   ACM Trans. Management Inf. Syst. 6, 4 (2016), 13:1–13:19.
     Systems, Boston, MA, USA, September 15-19, 2016. 107–114.             [21] Asela Gunawardana and Guy Shani. 2009. A Survey of Accu-
 [2] BBC. 2019. The BBC’s services in the UK - About the BBC. https:            racy Evaluation Metrics of Recommendation Tasks. Journal of
     //www.bbc.com/aboutthebbc/whatwedo/publicservices          Con-            Machine Learning Research 10 (2009), 2935–2962.
     sulted on 21 June 2019.                                               [22] Malay Haldar, Mustafa Abdool, Prashant Ramanathan, Tao
 [3] BBC. 2019. Editorial Guidelines.        https://www.bbc.co.uk/             Xu, Shulin Yang, Huizhong Duan, Qing Zhang, Nick Barrow-
     editorialguidelines Consulted on 21 June 2019.                             Williams, Bradley C. Turnbull, Brendan M. Collins, and Thomas
 [4] BBC. 2019. Global news services - About the BBC.          https:           Legrand. 2018. Applying Deep Learning To Airbnb Search. CoRR
     //www.bbc.com/aboutthebbc/whatwedo/worldservice Consulted                  abs/1810.09591 (2018).
     on 21 June 2019.                                                      [23] Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and
 [5] BBC. 2019. Mission, values and public purposes - About the                 Domonkos Tikk. 2016. Session-based Recommendations with
     BBC. https://www.bbc.com/aboutthebbc/governance/mission                    Recurrent Neural Networks. In 4th International Conference on
     Consulted on 21 June 2019.                                                 Learning Representations, ICLR 2016, San Juan, Puerto Rico,
 [6] BBC. 2019. News - Mundo.          https://www.bbc.com/mundo                May 2-4, 2016, Conference Track Proceedings.
     Consulted on 21 June 2019.                                            [24] Wouter IJntema, Frank Goossen, Flavius Frasincar, and Fred-
 [7] E Bernhardsson. 2017. ANNOY: Approximate nearest neighbors                 erik Hogenboom. 2010. Ontology-based news recommendation.
     in C++/Python optimized for memory usage and loading/saving                In EDBT/ICDT Workshops (ACM International Conference
     to disk. GitHub https://github. com/spotify/annoy (2017).                  Proceeding Series). ACM.
 [8] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas             [25] Dietmar Jannach and Malte Ludewig. 2017. When Recurrent
     Mikolov. 2016. Enriching Word Vectors with Subword Information.            Neural Networks meet the Neighborhood for Session-Based Rec-
     CoRR abs/1607.04606 (2016). arXiv:1607.04606 http://arxiv.                 ommendation. In Proceedings of the Eleventh ACM Conference
     org/abs/1607.04606                                                         on Recommender Systems, RecSys 2017, Como, Italy, August
 [9] Clara Higuera Cabañes, Michel Schammel, Shirley Ka Kei Yu, and            27-31, 2017. 306–310.
     Ben Fields. 2019. Human-centric Evaluation of Similarity Spaces       [26] Tomonari Kamba, Krishna Bharat, and Michael C. Albers. 1996.
     of News Articles. In 42nd International ACM SIGIR Confer-                  The Krakatoa Chronicle: An Interactive Personalized Newspaper
     ence on Research and Development in Information Retrieval                  on the Web. World Wide Web Journal 1, 1 (1996).
     (NewsIR’19 Third International Workshop on Recent Trends in
     News Information Retrieval). 51–56.
Recommendation systems for news articles at the BBC                                                INRA’19September, 2019Copenhagen, Denmark


[27] Mozhgan Karimi, Dietmar Jannach, and Michael Jugovac. 2018.         [37] Tien T. Nguyen, Pik-Mai Hui, F. Maxwell Harper, Loren G.
     News recommender systems - Survey and roads ahead. Inf. Pro-             Terveen, and Joseph A. Konstan. 2014. Exploring the filter bubble:
     cess. Manage. 54, 6 (2018), 1203–1227.                                   the effect of using recommender systems on content diversity. In
[28] Romain Lerallut, Diane Gasselin, and Nicolas Le Roux. 2015.              23rd International World Wide Web Conference, WWW ’14,
     Large-Scale Real-Time Product Recommendation at Criteo. In               Seoul, Republic of Korea, April 7-11, 2014. 677–686.
     Proceedings of the 9th ACM Conference on Recommender Sys-           [38] Özlem Özgöbek, Jon Atle Gulla, and Riza Cenk Erdur. 2014. A
     tems, RecSys 2015, Vienna, Austria, September 16-20, 2015.               Survey on Challenges and Methods in News Recommendation. In
     232.                                                                     WEBIST 2014 - Proceedings of the 10th International Confer-
[29] Lei Li, Dingding Wang, Tao Li, Daniel Knox, and Balaji Pad-              ence on Web Information Systems and Technologies, Volume 2,
     manabhan. 2011. SCENE: a scalable two-stage personalized news            Barcelona, Spain, 3-5 April, 2014. 278–285.
     recommendation system. In Proceeding of the 34th International      [39] Michael J. Pazzani and Daniel Billsus. 2007. Content-Based
     ACM SIGIR Conference on Research and Development in In-                  Recommendation Systems. In The Adaptive Web (Lecture Notes
     formation Retrieval, SIGIR 2011, Beijing, China, July 25-29,             in Computer Science), Vol. 4321. Springer, 325–341.
     2011. 125–134.                                                      [40] Massimo Quadrana, Paolo Cremonesi, and Dietmar Jannach. 2018.
[30] Lei Li, Li Zheng, Fan Yang, and Tao Li. 2014. Modeling and               Sequence-Aware Recommender Systems. ACM Comput. Surv.
     broadening temporal user interest in personalized news recom-            51, 4 (2018), 66:1–66:36.
     mendation. Expert Syst. Appl. 41, 7 (2014), 3168–3177.              [41] Brent Smith and Greg Linden. 2017. Two Decades of Recom-
[31] Greg Linden. 2011. Eli Pariser is wrong. http://glinden.blogspot.        mender Systems at Amazon.com. IEEE Internet Computing 21,
     com/2011/05/eli-pariser-is-wrong.html Consulted on 21 June               3 (2017), 12–18.
     2019.                                                               [42] Daniel Valcarce, Alejandro Bellogı́n, Javier Parapar, and Pablo
[32] Malte Ludewig and Dietmar Jannach. 2018. Evaluation of session-          Castells. 2018. On the robustness and discriminative power of
     based recommendation algorithms. User Model. User-Adapt.                 information retrieval metrics for top-N recommendation. In Pro-
     Interact. 28, 4-5 (2018), 331–390.                                       ceedings of the 12th ACM Conference on Recommender Systems,
[33] Rishabh Mehrotra, James McInerney, Hugues Bouchard, Mounia               RecSys 2018, Vancouver, BC, Canada, October 2-7, 2018. 260–
     Lalmas, and Fernando Diaz. 2018. Towards a Fair Marketplace:             268.
     Counterfactual Evaluation of the trade-off between Relevance,       [43] Shoujin Wang, Longbing Cao, and Yan Wang. 2019. A Survey
     Fairness & Satisfaction in Recommendation Systems. In Proceed-           on Session-based Recommender Systems. CoRR abs/1902.04864
     ings of the 27th ACM International Conference on Information             (2019).
     and Knowledge Management, CIKM 2018, Torino, Italy, Octo-           [44] Jason Weston, Hector Yee, and Ron J. Weiss. 2013. Learning to
     ber 22-26, 2018. 2243–2251.                                              rank recommendations with the k-order statistic loss. In Seventh
[34] Tomoko Murakami, Koichiro Mori, and Ryohei Orihara. 2007. Met-           ACM Conference on Recommender Systems, RecSys ’13, Hong
     rics for Evaluating the Serendipity of Recommendation Lists. In          Kong, China, October 12-16, 2013. 245–248.
     JSAI (Lecture Notes in Computer Science), Vol. 4914. Springer,      [45] Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay. 2019. Deep Learn-
     40–46.                                                                   ing Based Recommender System: A Survey and New Perspectives.
[35] Thomas Nedelec, Elena Smirnova, and Flavian Vasile. 2017. Spe-           ACM Comput. Surv. 52, 1 (2019), 5:1–5:38.
     cializing Joint Representations for the task of Product Recom-      [46] Lu Zheng, Zhao Tan, Kun Han, and Ren Mao. 2018. Collabo-
     mendation. CoRR abs/1706.07625 (2017). arXiv:1706.07625                  rative Multi-modal deep learning for the personalized product
     http://arxiv.org/abs/1706.07625                                          retrieval in Facebook Marketplace. CoRR abs/1805.12312 (2018).
[36] Nicholas Negroponte. 1996. Being Digital. Random House Inc.,             arXiv:1805.12312 http://arxiv.org/abs/1805.12312
     New York, NY, USA.