Recommendation systems for news articles at the BBC Maria Panteli Alessandro Piscopo Adam Harland British Broadcasting Corporation British Broadcasting Corporation British Broadcasting Corporation London, United Kingdom London, United Kingdom Glasgow, United Kingdom maria.panteli@bbc.co.uk alessandro.piscopo@bbc.co.uk adam.harland@bbc.co.uk Jonathan Tutcher Felix Mercer Moss British Broadcasting Corporation British Broadcasting Corporation Salford, United Kingdom Bristol, United Kingdom jon.tutcher@bbc.co.uk felix.mercermoss@bbc.co.uk ABSTRACT user, the BBC combines editorial curation with personalised, Personalised user experiences have improved engagement in automated approaches. Data-driven recommendations are a many industry applications. When it comes to news recom- key part of these approaches: they are an important tool to mendations, and especially for a public service broadcaster enhance users’ ability to explore and discover content they like the BBC, recommendation systems need to be in line with would not be aware of otherwise (see e.g. [26, 31, 36, 37]) and the editorial policy and the business values of the organisation. have been successfully tested and deployed by several media In this paper we describe how we develop recommendation providers (e.g. Netflix [20]) and e-commerce companies (e.g. systems for news articles at the BBC. We present three mod- Amazon [41]). els and describe how they compare with baseline approaches According to the mission of the BBC, the organisation such as random and popularity. We also discuss the metrics must “act in the public interest, serving all audiences through we use, the unique challenges we face and the considerations the provision of impartial, high-quality and distinctive output needed to ensure the recommendations we generate uphold and services which inform, educate, and entertain” [5]. Fol- the trust and quality standards of the BBC. lowing this mission, the BBC must be a provider of accurate and unbiased information and the content it produces and CCS CONCEPTS distributes must aim to engage diverse audiences. Amongst ˆ  ˆ the diverse types of content produced by the BBC, news is  Information systems Recommender systems; the product that likely contributes most to its reputation Computing methodologies Machine learning approaches. as a trustworthy and authoritative media outlet. Besides KEYWORDS the UK service BBC News1 , the BBC produces, broadcasts, and delivers online news in more than 40 languages. Hence, recommendations, news, neural networks it is of utmost importance for automated recommendation approaches implemented on any BBC news service to be not 1 INTRODUCTION only as accurate as possible, but also to conform with the The BBC is one of the world’s leading public service broad- principles outlined above. This paper reports early results of casters. Its services—television, radio, digital—reach more the experiments we carried out to that end. In particular, it than 80% of UK’s adult population every week [2] and 279 describes the development of recommendation systems for million people worldwide (World Service [4]). This large au- BBC news articles and the challenges in building data-driven dience has access to a vast and diverse amount of content, applications for a public service broadcaster. The case study including video, audio and text, spanning topics such as news, adopted in the experiment was the application of recommen- sport, and entertainment. In order to enable its audience to dation systems for BBC Mundo2 , a Spanish-language news enjoy the best possible experience, it is crucial for the BBC website and part of BBC World Service [6]. to adopt strategies to guide users to the most relevant and The structure of this paper is as follows. Section 2 defines engaging content. The main approach until recently has been the problem addressed in the current work, and Section 3 to manually curate content following the guidelines formally discusses prior related work. Section 4 describes the method- documented in an editorial tome [3]. These have been devel- ology including the data, models, and evaluation approaches. oped to ensure quality across all products, uphold the BBC Finally, results are presented and discussed in Sections 5 values, and build audience trust. Although manual curation is and 6. an excellent way to surface quality content, it is not tailored to the user and is hard to scale—the more the amount of content, the harder it is for curators to find relevant items for each type of content. In order to deliver an experience which is relevant, timely, and contextually useful to every single © 1 Please note that ‘News’ capitalised refers to the UK channel, whereas Copyright 2019 for this paper by its authors. Use permitted under lowercase regards to the type of content. 2 Creative Commons License Attribution 4.0 International (CC BY 4.0). https://www.bbc.com/mundo INRA’19September, 2019Copenhagen, Denmark Maria Panteli, Alessandro Piscopo, Adam Harland, Jonathan Tutcher, and Felix Mercer Moss 2 PROBLEM DEFINITION no standardised solutions yet. We consider evaluation met- Our goal is to build recommendation systems for news arti- rics that help us track the risk and bias induced by our cles. Recommendations in the news domain have been char- recommendation systems. acterised distinctly in the literature [38] due to the short The above challenges drive the decisions we make around life-cycle of items and the vast amounts of anonymous users. which models and evaluation strategies to implement. For Considering the reputation of the BBC and the responsibility example, we place significant focus upon offline evaluation it has to deliver trustworthy and authoritative news to its to avoid unexpected behaviour; we use a variety of met- audience, we highlight the following challenges in achieving rics to track the quality of recommendations; we consider our goal. recency-based systems an essential baseline for news recom- mendations; and we adopt content-based approaches to tackle Non-signed in users. The majority of users on any BBC the cold-start scenarios. More details about our choices and news platform are not signed in. This means that we have how they relate to these challenges are provided in Sections 3 limited information about the user and the items they have and 4. previously interacted with. We typically work with session- based information, i.e. user-item interactions that occurred within 30 minutes from each other. This means that our recommendation models need to achieve high accuracy for 3 RELATED WORK cold-start user scenarios or predict the user’s taste after as Recommendation systems in the news domain have been little as one item interaction. investigated for more than a decade [27, 38], following var- ious approaches. Collaborative filtering [15] relies on past Many cold-start items. The publication cycle on any news user behaviour to formulate recommendations based on com- platform is rapid and unrelenting. BBC News is no different. monalities across user preferences. Content-based approaches Fresh items are regularly uploaded and any recommendation rely on item properties (or user profiles constructed by the system we implement should be able to serve an item within properties of the items they consume) to recommend related minutes of publication. Additionally, articles may become ob- items [10, 29, 39]. Rather than considering the long user his- solete or gain sudden relevance following an event—consider tory, session-based approaches focus on user-item interactions for example the case of breaking news. Recommendation that occur within a certain time frame or context [40, 43]. approaches must thus be able to take these characteristics Finally, hybrid systems may put together aspects from these into account, not being based solely on a user’s history, but approaches and use a broader range of features, in order to considering the content and context of the articles they read. achieve a more nuanced representation of user activity [18, 30]. Content-based, session-based, and hybrid approaches appear Architecture constraints. Because of the popularity of BBC to be the most suitable to address some of the problems we news, multiple stakeholders (internal and external) rely on outlined earlier, namely the large number of anonymous users and set the requirements for the news platform. Any changes and cold-start items (Section 2). to the system architecture that could affect other stakehold- Beyond the news domain, recommendation systems have ers need to be thoroughly investigated and justified. Our been investigated in a variety of industrial applications. Ap- recommendation models often have to adapt to the exist- proaches vary between traditional content-based and col- ing architecture which means that our system architecture laborative filtering while, more recently, the advent of deep choices are somewhat constrained. neural networks has facilitated the development of hybrid Mistakes are not tolerated. BBC news, and the Mundo strategies [45]. These have been applied to the problem of platform in particular, are consumed by millions of users. For accommodation search at Airbnb [22], product advertisement the majority of these users, this is the only BBC platform at Criteo [28], video recommendations at Youtube [14], and they visit. News is also a very sensitive domain as is not just movie recommendations at Netflix [20]. Industry approaches entertainment but is also the way in which people inform and using neural networks are of particular interest to us due educate themselves. Mistakes in data-driven recommenda- to the scalability of the systems and the domain agnostic tions could lead to misinformation or compromise our quality capability of neural networks. standards, something which will largely impact our audience. Considering the system architecture, some neural network- The bar for the performance of the system is set very high based approaches for recommending textual content are end- to limit the risk of unexpected behaviour. to-end (for example [1]), that is, the model takes as input the text of items related to a user, extracts features for the Fairness and impartiality. The BBC has built its trust items and the user, and ultimately outputs a recommendation. after many years of thoughtful manual curation and expert Other approaches rely on separate modules for extracting editorial guidance. It commits to delivering content in a fair, features for the content and the user and for generating rec- impartial and honest way and data-driven recommendations ommendations [16]. Here, we take the latter approach for a should live up to, and advance, these standards. Algorith- number of reasons. First, an end-to-end approach was not mic fairness and impartiality in recommendation systems compatible with the current architecture of the system, over are increasingly discussed in the literature [19, 33] but with which we have limited control (Section 2). Second, separating Recommendation systems for news articles at the BBC INRA’19September, 2019Copenhagen, Denmark Sequence length distribution User visits distribution No. of sequences No. of users 1 2 3 4 5 1 2 3 4 5 6 7 8 9 Sequence length No. of visits Figure 1: Sequence length distribution in our dataset. Figure 2: User visits distribution in our dataset. The The graph includes 99% of sequence lengths, in order graph includes 99% of the number of visits, in order to leave out the long tail and improve readability. to leave out the long tail and improve readability. logs into a single or group of feature-target pairs suitable for content representation from the generation of recommen- ingestion into algorithmic learning models. dations enables further experimentation and increases the For the test split, our initial thought was to discard the ability of the system to retrieve new items [16]. temporal dimension and sample user sessions according to pre-determined train/test/validation fractions. While the 4 METHODOLOGY simplicity of this approach is attractive, we decided that to 4.1 Data maximise the similarity between our offline testing framework The BBC collects detailed user interaction data for its digital and our online production environment was more important. services, providing information about users and the circum- The temporal approach we implemented is displayed in Fig- stances of their visits to BBC websites. For the purpose ure 3 where we choose a thirteen-day period for training, the of this analysis, we used 15-days worth of data from BBC next day for validation and the following day for test. As we Mundo, spanning from the 6th to 20th April 2019. We define have the capacity to train and serve fresh consumer-facing a sequence, or visit, as any succession of user interactions (i.e. models every day, we aim for this offline approach to reflect page views) within 30 minutes from each other. Page views our production environment sufficiently for inferences in the were aggregated into sequences according to this definition. former to provide valuable information about the latter. In this dataset, the average number of user interactions we For the query split, we take a user session from a given collected per day was in the order of millions. As shown in period defined earlier in the current section and divide it into Figure 1, most recorded sessions included only a single article the maximum number of trigrams while preserving temporal read (i.e., a sequence of length 1) which is a common observa- order. Then, for each trigram, the first two elements (articles tion in news delivery platforms [16]. Users often visited BBC vectors) represent the user profile while the third and final Mundo only once over the time-span considered (Figure 2). element is the groundtruth item used as a target for our Like all statistical learning models, to robustly evaluate models. The length of the user profile was chosen based recommender system performance, the data is required to upon two factors: (1) our client-side serving infrastructure be appropriately split. In traditional machine learning prob- is currently limited to providing the current and previous lems where the raw data takes the form of input-output article; and (2) exploratory analysis indicated that minimal pairs, this split is relatively straightforward. Assuming there gains were made from increasing the number of items that is enough data, a common split might be 80%, 10%, 10% make up the user profile. into training, validation and test sets respectively. For recom- mender systems, the temporal nature of the data makes the 4.2 System architecture and models situation a little different. While we still need to perform a All recommendation models we implemented were constrained train/validation/test split, referred to from now on as the test by the need to have compatibility with our current system split, we also need to perform an additional split, henceforth architecture. This consists of three main components. The be referred to as the query split. The query split describes the first is responsible for generating article embeddings. The process of transforming a temporal sequence of consumption second takes user data and article embeddings as input and INRA’19September, 2019Copenhagen, Denmark Maria Panteli, Alessandro Piscopo, Adam Harland, Jonathan Tutcher, and Felix Mercer Moss smallest distance to the user embedding. The distance is computed using the angular√︀ metric from the Python package ANNOY [7], defined as 2(1 − 𝑐𝑜𝑠(𝑎, 𝑏)) for a user embed- ding 𝑎 and an article embedding 𝑏. We evaluated three different models to derive the user embeddings: a) a weighted average of item embeddings (Sec- tion 4.2.1), b) a cosine-based collaborative filtering method (Section 4.2.2), and c) a rank-optimised neural network (Sec- tion 4.2.3). The sections below describe each approach in detail. Content data User data Content representation module User representation module Article embeddings User embeddings (LDA) Figure 3: Two splits were performed upon the raw Recommandation generation user logs. The test split temporally divided the module dataset into train (13 days), validation (1 day) and Nearest neighbours test (1 day). Then for the query split, each user log session was split into trigrams whereby the first two items represented a user profile and model in- put while the third represented the groundtruth and Recommendations model output. Figure 4: Overview of system architecture and how it relates to the development of user models. A produces a user embedding. Finally, the outputs of the first given content representation module provides article and the second modules are combined by the third component, embeddings—currently LDA vectors—that are fed which ranks the recommended articles for a user, based on a into both a user representation module and a near- nearest neighbour search in the latent article space (Figure 4). est neighbour search component. The recommended The content representation module generates article em- articles for a user denote the 𝐾 nearest neighbours beddings. The article embeddings were derived using a Latent to the user vector. Dirichlet Allocation (LDA) model as found performant in related research [9]. LDA is an unsupervised topic modelling 4.2.1 Weighted average of item embeddings. The first user approach that represents each document by the probability representation model we tested derived the user embeddings of a number of topics. The number of topics is defined in from the weighted average of item embeddings, for all items advance. Prior work from another BBC team found the opti- consumed by a given user within a session. The most recently mal number of topics to be 75 for a related dataset of BBC consumed item was weighted by a factor 𝛼 while the rest of Mundo articles [9]. the items in the user’s session were weighted by 1 − 𝛼. The user representation module generates user embeddings. The user embeddings are derived from the article embeddings 4.2.2 Cosine-based collaborative filtering. The second approach and previous user interactions. Our experiments focused was a combination of simple user-item collaborative filtering primarily on developing models to derive user embeddings. and a session-based approach. Since users do not need to We explored neural network approaches that combine both log in to view the articles, we had no explicit user profile content and user data as well as models based only on user and instead treated each session as a user. To generate the interactions (i.e. Cosine-based collaborative filtering model, sparse user-item matrix, we took the article IDs for all user Section 4.2.2). sessions within a given time window. The inputs to the model The output of the user representation module is subse- at prediction time were the IDs of the articles viewed in the quently processed by the recommendation generation module. current user session, and the output was the 𝐾 highest scored This component takes as input a user embedding and per- items based on these interactions. Our metric for scoring the forms an approximate nearest neighbour search in the article articles to recommend was the cosine distance of the current latent space, returning as output the 𝐾 articles with the user session and all other user sessions. Recommendation systems for news articles at the BBC INRA’19September, 2019Copenhagen, Denmark Serving Training articles, whereby each training user profile has one positive environment environment article and five negative articles. Once this model had been trained, two further models were derived from it for use in Binary cross- Recommended entropy loss the prediction environment. The first, the user model, took articles only the user profile as input and returned the final layer of Sigmoid the connected five-layer perceptron. The second, the article Dot product model, took only a single article as input and returned the fifth layer of its own five-layer perceptron. The article model KNN was then used to transform all of the raw LDA embeddings into the article model embedding space before being fed into Prediction model User embedding Article embedding our vector-based nearest neighbour index. Multi-layer Multi-layer perceptron perceptron 4.3 Evaluation The aim of our work is not only to increase user engage- ment with BBC products, but also to inform, educate, and OR entertain—according to the mission of our organisation. We build recommendation systems taking into account these val- ues and develop evaluation strategies that reflect our mission. Client (current + User profile (current + Groundtruth Negative This section focuses on offline evaluation metrics and the previous article vector) article sampled previous article vector) vector article baselines we use in our experiments. Online evaluation is also vector a big part of our work but goes beyond the scope of this paper which focuses on preliminary results. Figure 5: Pointwise neural network architecture for learning-to-rank problem. 4.3.1 Metrics. When developing recommendation models offline, we currently monitor and optimise performance with reference to a suite of six quantitative metrics. For all metrics 4.2.3 Rank-optimised neural network. Motivated by the aware- (with the exception of inter-list diversity) a value can be ness that a simple linear combination of a user’s current and computed for each groundtruth/recommendations list pair. previous article representations led to modest performance The overall metric is computed as the mean value over all gains over using solely the current article, we sought to groundtruth/recommendations list pairs within the test pe- explore non-linear combinations of these vectors. Artificial riod. For each metric, in addition to calculating the overall neural networks are ideally suited to fitting such non-linear value, we also estimate the item-normalised value by first tak- functions, while we were encouraged by the results reported ing the mean metric value for every unique groundtruth item. by others that have successfully used deep architectures to This value provides an insight into the performance of an solve information retrieval problems, e.g. [11, 14, 22, 46]. algorithm independently of the test set bias towards popular The challenge we faced was to design a neural network groundtruth items. All metrics were calculated upon recom- architecture which learned a latent representation of a user mendation lists of length 𝐾 = 100. We use a relatively large profile (current and previous article) to minimise the dis- 𝐾 motivated by the finding that deeper cut-offs in offline tance between itself and the latent representation of the experiments provide greater robustness and discriminative most appropriately recommended article (in this case, the power [42] as well as by the fact that we have to exclude a lot subsequently consumed article). One way of reflecting this of the recommended items a posteriori due to our extensive problem is a pointwise architecture that behaves in a way business rules. A brief description of each metric is provided similar to a regression problem. The model illustrated in below (for further details see [12, 21, 34]). Figure 5 takes a user profile (two concatenated 75-length vectors) and an article as input (a 75-length vector), passes Normalised Discounted Cumulative Gain (NDCG). It mea- each through a five-layer perceptron (with 1024, 512, 256, 128 sures the gain of a document based on its ranked position in and 75 hidden units, each with rectified linear activation func- the top 100 list, with lower ranks discounted by a logarithmic tions). The model then minimises the binary cross-entropy factor, and normalises the result by the maximum gain of an between the target and the inner product of the final layer ideal top 100 list. of the two perceptrons. Batch normalisation placed before the activation functions of the initial layers was found to sig- Hitrate. A recall-based metric whereby a recommended nificantly boost performance while also halving convergence list of items is assigned 1 if it contains the groundtruth item, time, facilitating greater experimentation. Training runs in- and 0 otherwise. cluding dropout layers produced no improvement in accuracy so were not included in the final model. Negative articles Intra-list diversity. It estimates the average distance be- were randomly over-sampled from the population of positive tween every pair of items in a recommendations list. For INRA’19September, 2019Copenhagen, Denmark Maria Panteli, Alessandro Piscopo, Adam Harland, Jonathan Tutcher, and Felix Mercer Moss the experiments reported here, distance between two arti- 0.25 NDCG overall cles is measured as the ANNOY angular distance (described NDCG item formally in Section 4.2) between two article embeddings. 0.20 Inter-list diversity. It measures how diverse the recom- mended items across multiple lists are. It compares two lists of recommendations and computes the ratio of unique items 0.15 in these lists over the total number of recommended items between these lists. 0.10 Popularity-based surprisal. It measures how novel or sur- prising the items in a list are. It is formally defined as the log of the inverse popularity of an item (i.e. the probability 0.05 of observing an item in the recommendations) [12]. Recency. : Measures how recent the recommended items 0.00 Random Popularity Recency Content-based Weighted avg. collaborative neural network Cosine-based Rank-optimised are. It calculates the time difference between the recommen- embeddings similarity of article filtering dation request and the age of the recommended items using a Gaussian decay function. The mean is set to 1 and the standard deviation is chosen such that articles of 7 days old or more receive a score less than 0.5. The ideal recommendation engine would optimise all these metrics providing recommendations that are relevant to the Figure 6: Overall and item-normalised NDCG for user, but that are also diverse, recent, and avoid the popular- the four baselines described in Section 4.3.2 and the ity bias. In practice this is usually a trade-off as an algorithm three user models described in Section 4.2. that provides more accurate results is, conversely, less likely to produce diverse ones (and vice versa). In line with our values and objectives, we sometimes choose algorithms that items (the random model), all other baselines show clear per- favour diverse and recent content at the cost of a certain formance improvements for both overall and item-normalised degree of accuracy. NDCG. The popularity and recency recommenders returned higher values than the content-based similarity (CS) model 4.3.2 Baselines. We compare our user models to four baseline for NDCG overall; however, if the most popular items are approaches and require that each new user model outperforms factored out by looking at the item-normalised score, the the existing ones. We consider the following recommenders opposite is true. The recency recommender scored particu- as baselines: larly high NDCG overall which confirms our expectation that ∙ Random recommender : Produces 𝐾 random recom- users in a news platform prefer to consume fresh content. mendations. Of the implemented models, the cosine-based collaborative ∙ Recency-based recommender : Ranks item by recency filtering (CF) model (Section 4.2.2) outperformed all base- and returns the top 𝐾 most recent items. lines and other models by a significant margin, this being the ∙ Popularity-based recommender : Ranks items by popu- case both for overall and item-normalised NDCG and hitrate. larity and returns the top 𝐾 most popular items. However, this significant advantage in accuracy comes at a ∙ Content similarity recommender : Finds the 𝐾 nearest cost to inter-list diversity and surprisal, where both other neighbours of an item (e.g., the last item consumed by models returned higher scores. However, this effect was not a user) using the ANNOY angular distance between observed with the intra-list diversity metric, indicating that item embeddings. individual CF lists contained more diverse content while the lists of the other models were more distinct. Our offline experiments report results on the four baselines The weighted average (WA) model (described in Section 4.2.1, defined in above and the three models defined in Section 4.2. with 𝛼 optimised at 0.7) achieved accuracy scores surpassing We use the NDCG metric to comment on the accuracy of the all the baselines in item-normalised NDCG, although as ex- systems and the remaining metrics defined in Section 4.3.1 pected, this was not the case for NDCG overall. This suggests to comment on qualitative aspects of the recommendations. that the model consistently projects into relevant regions of the embedding space, and that the nearest neighbours are not 5 RESULTS just most popular candidates. Despite returning marginally The NDCG scores for each recommender system are shown higher NDCG scores, the WA results are salient mainly for in Figure 6. The scores from all metrics are summarised in how similar they are across the board, to the CS baseline Table 1. that lacks information from the previous article. Accuracy scores recorded for the baselines models were in The rank-optimised neural network (NN) model (Section 4.2.3) line with expectations. Compared to a random selection of returned accuracy scores that were a clear step up from both Recommendation systems for news articles at the BBC INRA’19September, 2019Copenhagen, Denmark Table 1: Benchmark results of competing models after generating 100-length lists of recommendations. For the sake of brevity, we report here only overall metrics. Recommender System Hitrate NDCG Intra-list diversity Inter-list diversity Surprisal Recency Random baseline 0.005 0.001 1.192 0.995 0.430 0.010 Recency baseline 0.695 0.163 1.175 0.000 0.000 0.975 Popularity baseline 0.315 0.049 1.170 0.000 0.000 0.495 Content similarity baseline 0.085 0.021 0.641 0.968 0.790 0.018 Weighted average of item embeddings 0.065 0.022 0.641 0.968 0.790 0.018 Cosine-based collaborative filtering 0.741 0.244 1.154 0.584 0.480 0.512 Rank-optimised neural network 0.128 0.040 0.909 0.731 0.781 0.036 other LDA-based models (CF and WA). This was the case network had an impact that was also weaker than expected. for both variants of NDCG and particularly so for hitrate, These unintuitive results raise further questions that we plan indicating that the NN model was optimised more for recall to explore in the future. than precision and could possibly benefit from further rerank- Fundamentally, we believe there is scope to optimise the ing procedures. The NN model also distinguished itself from NN approach further so that it will perform more competi- CF and WA models in the diversity and surprisal metrics. tively with CF. To achieve this end we have multiple strate- Results suggest the NN model produces more distinct lists gies. These fall into three categories: model architecture, data, (indicated by higher inter-list diversity) but that those lists and training improvements. are more topically homogenous (indicated by lower intra-list We know that learning to rank in a pointwise framework diversity and surprisal metrics). is not optimal. Both pairwise and listwise approaches should, in theory, achieve better results (see [13, 23]). Pairwise loss 6 DISCUSSION functions together with triplet loss architectures have demon- strated impressive results elsewhere but our own early ex- The first cycle of research in our journey to find the best news periments have indicated they are difficult to train, tending recommender for BBC Mundo is complete. In Section 2 we towards significant underfitting. have outlined the characteristics of the problem we address: a A key reason for this may be the under-representation majority of non-signed in users; a large number of cold-start of negative examples in our training set. Adopting a higher items; architectural constraints; and high quality demands, proportion of negative training examples may address this, not only in terms of accuracy, but also in what concerns but also using more informed negative sampling techniques fairness and impartiality of recommendations. may be required (such as weighted approximate-rank pair- One of the lessons we learned is that—unsurprisingly— wise loss [44]). Even with the current pointwise architecture balancing the different aspects of our problem is hard. One there is a 5% difference in train/test performance (item- model may satisfy one of our requirements, whilst failing normalised NDCG) that should significantly reduce by using to fulfil another. A pure collaborative filtering approach is the appropriate regularisation. currently our best option to maximise offline scoring accu- Changes to our training process may also lead to signifi- racy, but that comes at the cost of reducing diversity (and cant gains. In addition to increasing compute resources for a degree of recency, dependent upon how regularly we re- the exploration of the hyperparameter space, reducing the train). Moreover, the performance of the CF model was not training/testing window from the order of days/weeks to the entirely unexpected, as it has been shown [17, 25, 32] that order of hours may provide greater scope for experimenta- such simple methods typically outperform the neural ap- tion (as has been reported elsewhere [16]). While a smaller proaches when only logged user items are used, and instead training window does necessitate more regular training of only start to perform well when the input features contain deployed models, it also means more manageable datasets additional contextual meta-data. However, as with most col- where hyperparameter optimisation is more practical. laborative filtering approaches, this model suffers from the A further change that may prove fruitful is to expand item cold-start problem and so frequent generation of the the richness of the input to the user profile model. This user-item sparse matrix would be required. Therefore, we may include expanding the size of the user journeys in the cannot depend upon a solution that is derived purely from training set beyond 3 (a constraint which, incidentally, did user interactions. To that end, we also know from our experi- not apply to the CF model at training), while also introducing ments that the contribution of previous articles appears to contextual information about the user. have a lower impact than expected. Despite performance of Finally, another direction to be explored in the future re- the WA model consistently exceeding the CS baseline model gards content representation. In experiments not reported in (across validation and test), this gain was always marginal. the current work, raw article text has been encoded through Furthermore, our attempts at combining the current and an LDA model. However, our system architecture affords previous article vectors in a non-linear fashion using a neural INRA’19September, 2019Copenhagen, Denmark Maria Panteli, Alessandro Piscopo, Adam Harland, Jonathan Tutcher, and Felix Mercer Moss enough flexibility to replace the current content model with [10] Michel Capelle, Flavius Frasincar, Marnix Moerland, and Frederik alternative article embeddings and test different approaches. Hogenboom. 2012. Semantics-based news recommendation. In 2nd International Conference on Web Intelligence, Mining and In particular, we are interested in taking sub-word infor- Semantics, WIMS ’12, Craiova, Romania, June 6-8, 2012. 27:1– mation into consideration [8], enriching text with seman- 27:9. [11] Hugo Caselles-Dupré, Florian Lesaint, and Jimena Royo-Letelier. tics [10, 24], and augmenting text representations with mul- 2018. Word2Vec Applied to Recommendation: Hyperparameters timedia [35, 46]. Matter. In Proceedings of the 12th ACM Conference on Rec- Our results demonstrate the difficulty of acquiring all ommender Systems (RecSys ’18). ACM, New York, NY, USA, 352–356. https://doi.org/10.1145/3240323.3240377 the desired characteristics of an ideal news recommender. [12] P Castells, S Vargas, and J Wang. 2011. Novelty and diversity Ultimately, we expect ensemble approaches may represent metrics for recommender systems: choice, discovery and relevance. the best solution. Here we may take the cold-start benefits In International Workshop on Diversity in Document Retrieval (DDR 2011) at the 33rd European Conference on Information of the content-based neural approach and combine it with Retrieval (ECIR 2011). the less diverse but more accurate list of items generated by [13] Ting Chen, Yizhou Sun, Yue Shi, and Liangjie Hong. 2017. On Sampling Strategies for Neural Network-based Collaborative Fil- a collaborative filtering model. tering. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, 7 CONCLUSION NS, Canada, August 13 - 17, 2017. 767–776. [14] Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep Neural In this paper we evaluated three approaches to provide news Networks for YouTube Recommendations. In Proceedings of the recommendations for the BBC Mundo service. The systems 10th ACM Conference on Recommender Systems (RecSys ’16). ACM, 191–198. https://doi.org/10.1145/2959100.2959190 we have built are compatible with BBC serving infrastruc- [15] Abhinandan Das, Mayur Datar, Ashutosh Garg, and Shyamsundar ture, a use case which includes millions of daily users and Rajaram. 2007. Google news personalization: scalable online new content in the order of several thousand articles per collaborative filtering. In Proceedings of the 16th International Conference on World Wide Web, WWW 2007, Banff, Alberta, week. In spite of our experiment being only the initial step Canada, May 8-12, 2007. 271–280. of a journey that promises to be much longer, our models [16] Gabriel de Souza Pereira Moreira. 2018. CHAMELEON: a deep learning meta-architecture for news recommender systems. In outperformed random, popularity-based, recency-based and Proceedings of the 12th ACM Conference on Recommender content-similarity baselines. It is worth noticing though, that Systems, RecSys 2018, Vancouver, BC, Canada, October 2-7, these results do not reflect current online performance. More 2018. 578–583. [17] Gabriel de Souza Pereira Moreira, Dietmar Jannach, and Adil- work is needed to ensure these models, when deployed, meet son Marques da Cunha. 2019. Contextual Hybrid Session-based the quality and editorial standards of the BBC. Future chal- News Recommendation with Recurrent Neural Networks. CoRR lenges do not concern only achieving higher accuracy, but abs/1904.10367 (2019). [18] Elena Viorica Epure, Benjamin Kille, Jon Espen Ingvaldsen, also conforming to the principles of algorithmic fairness and Rébecca Deneckère, Camille Salinesi, and Sahin Albayrak. 2017. impartiality. We encourage the community to collaborate in Recommending Personalized News in Short User Sessions. In Pro- ceedings of the Eleventh ACM Conference on Recommender helping us create the way forward towards fair and engaging Systems, RecSys 2017, Como, Italy, August 27-31, 2017. 121– recommendations and applications with responsible machine 129. learning. [19] Sahin Cem Geyik, Stuart Ambler, and Krishnaram Kenthapadi. 2019. Fairness-Aware Ranking in Search & Recommendation Systems with Application to LinkedIn Talent Search. CoRR REFERENCES abs/1905.01989 (2019). [1] Trapit Bansal, David Belanger, and Andrew McCallum. 2016. Ask [20] Carlos A. Gomez-Uribe and Neil Hunt. 2016. The Netflix Rec- the GRU : Multi-task Learning for Deep Text Recommendations. ommender System: Algorithms, Business Value, and Innovation. In Proceedings of the 10th ACM Conference on Recommender ACM Trans. Management Inf. Syst. 6, 4 (2016), 13:1–13:19. Systems, Boston, MA, USA, September 15-19, 2016. 107–114. [21] Asela Gunawardana and Guy Shani. 2009. A Survey of Accu- [2] BBC. 2019. The BBC’s services in the UK - About the BBC. https: racy Evaluation Metrics of Recommendation Tasks. Journal of //www.bbc.com/aboutthebbc/whatwedo/publicservices Con- Machine Learning Research 10 (2009), 2935–2962. sulted on 21 June 2019. [22] Malay Haldar, Mustafa Abdool, Prashant Ramanathan, Tao [3] BBC. 2019. Editorial Guidelines. https://www.bbc.co.uk/ Xu, Shulin Yang, Huizhong Duan, Qing Zhang, Nick Barrow- editorialguidelines Consulted on 21 June 2019. Williams, Bradley C. Turnbull, Brendan M. Collins, and Thomas [4] BBC. 2019. Global news services - About the BBC. https: Legrand. 2018. Applying Deep Learning To Airbnb Search. CoRR //www.bbc.com/aboutthebbc/whatwedo/worldservice Consulted abs/1810.09591 (2018). on 21 June 2019. [23] Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and [5] BBC. 2019. Mission, values and public purposes - About the Domonkos Tikk. 2016. Session-based Recommendations with BBC. https://www.bbc.com/aboutthebbc/governance/mission Recurrent Neural Networks. In 4th International Conference on Consulted on 21 June 2019. Learning Representations, ICLR 2016, San Juan, Puerto Rico, [6] BBC. 2019. News - Mundo. https://www.bbc.com/mundo May 2-4, 2016, Conference Track Proceedings. Consulted on 21 June 2019. [24] Wouter IJntema, Frank Goossen, Flavius Frasincar, and Fred- [7] E Bernhardsson. 2017. ANNOY: Approximate nearest neighbors erik Hogenboom. 2010. Ontology-based news recommendation. in C++/Python optimized for memory usage and loading/saving In EDBT/ICDT Workshops (ACM International Conference to disk. GitHub https://github. com/spotify/annoy (2017). Proceeding Series). ACM. [8] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas [25] Dietmar Jannach and Malte Ludewig. 2017. When Recurrent Mikolov. 2016. Enriching Word Vectors with Subword Information. Neural Networks meet the Neighborhood for Session-Based Rec- CoRR abs/1607.04606 (2016). arXiv:1607.04606 http://arxiv. ommendation. In Proceedings of the Eleventh ACM Conference org/abs/1607.04606 on Recommender Systems, RecSys 2017, Como, Italy, August [9] Clara Higuera Cabañes, Michel Schammel, Shirley Ka Kei Yu, and 27-31, 2017. 306–310. Ben Fields. 2019. Human-centric Evaluation of Similarity Spaces [26] Tomonari Kamba, Krishna Bharat, and Michael C. Albers. 1996. of News Articles. In 42nd International ACM SIGIR Confer- The Krakatoa Chronicle: An Interactive Personalized Newspaper ence on Research and Development in Information Retrieval on the Web. World Wide Web Journal 1, 1 (1996). (NewsIR’19 Third International Workshop on Recent Trends in News Information Retrieval). 51–56. Recommendation systems for news articles at the BBC INRA’19September, 2019Copenhagen, Denmark [27] Mozhgan Karimi, Dietmar Jannach, and Michael Jugovac. 2018. [37] Tien T. Nguyen, Pik-Mai Hui, F. Maxwell Harper, Loren G. News recommender systems - Survey and roads ahead. Inf. Pro- Terveen, and Joseph A. Konstan. 2014. Exploring the filter bubble: cess. Manage. 54, 6 (2018), 1203–1227. the effect of using recommender systems on content diversity. In [28] Romain Lerallut, Diane Gasselin, and Nicolas Le Roux. 2015. 23rd International World Wide Web Conference, WWW ’14, Large-Scale Real-Time Product Recommendation at Criteo. In Seoul, Republic of Korea, April 7-11, 2014. 677–686. Proceedings of the 9th ACM Conference on Recommender Sys- [38] Özlem Özgöbek, Jon Atle Gulla, and Riza Cenk Erdur. 2014. A tems, RecSys 2015, Vienna, Austria, September 16-20, 2015. Survey on Challenges and Methods in News Recommendation. In 232. WEBIST 2014 - Proceedings of the 10th International Confer- [29] Lei Li, Dingding Wang, Tao Li, Daniel Knox, and Balaji Pad- ence on Web Information Systems and Technologies, Volume 2, manabhan. 2011. SCENE: a scalable two-stage personalized news Barcelona, Spain, 3-5 April, 2014. 278–285. recommendation system. In Proceeding of the 34th International [39] Michael J. Pazzani and Daniel Billsus. 2007. Content-Based ACM SIGIR Conference on Research and Development in In- Recommendation Systems. In The Adaptive Web (Lecture Notes formation Retrieval, SIGIR 2011, Beijing, China, July 25-29, in Computer Science), Vol. 4321. Springer, 325–341. 2011. 125–134. [40] Massimo Quadrana, Paolo Cremonesi, and Dietmar Jannach. 2018. [30] Lei Li, Li Zheng, Fan Yang, and Tao Li. 2014. Modeling and Sequence-Aware Recommender Systems. ACM Comput. Surv. broadening temporal user interest in personalized news recom- 51, 4 (2018), 66:1–66:36. mendation. Expert Syst. Appl. 41, 7 (2014), 3168–3177. [41] Brent Smith and Greg Linden. 2017. Two Decades of Recom- [31] Greg Linden. 2011. Eli Pariser is wrong. http://glinden.blogspot. mender Systems at Amazon.com. IEEE Internet Computing 21, com/2011/05/eli-pariser-is-wrong.html Consulted on 21 June 3 (2017), 12–18. 2019. [42] Daniel Valcarce, Alejandro Bellogı́n, Javier Parapar, and Pablo [32] Malte Ludewig and Dietmar Jannach. 2018. Evaluation of session- Castells. 2018. On the robustness and discriminative power of based recommendation algorithms. User Model. User-Adapt. information retrieval metrics for top-N recommendation. In Pro- Interact. 28, 4-5 (2018), 331–390. ceedings of the 12th ACM Conference on Recommender Systems, [33] Rishabh Mehrotra, James McInerney, Hugues Bouchard, Mounia RecSys 2018, Vancouver, BC, Canada, October 2-7, 2018. 260– Lalmas, and Fernando Diaz. 2018. Towards a Fair Marketplace: 268. Counterfactual Evaluation of the trade-off between Relevance, [43] Shoujin Wang, Longbing Cao, and Yan Wang. 2019. A Survey Fairness & Satisfaction in Recommendation Systems. In Proceed- on Session-based Recommender Systems. CoRR abs/1902.04864 ings of the 27th ACM International Conference on Information (2019). and Knowledge Management, CIKM 2018, Torino, Italy, Octo- [44] Jason Weston, Hector Yee, and Ron J. Weiss. 2013. Learning to ber 22-26, 2018. 2243–2251. rank recommendations with the k-order statistic loss. In Seventh [34] Tomoko Murakami, Koichiro Mori, and Ryohei Orihara. 2007. Met- ACM Conference on Recommender Systems, RecSys ’13, Hong rics for Evaluating the Serendipity of Recommendation Lists. In Kong, China, October 12-16, 2013. 245–248. JSAI (Lecture Notes in Computer Science), Vol. 4914. Springer, [45] Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay. 2019. Deep Learn- 40–46. ing Based Recommender System: A Survey and New Perspectives. [35] Thomas Nedelec, Elena Smirnova, and Flavian Vasile. 2017. Spe- ACM Comput. Surv. 52, 1 (2019), 5:1–5:38. cializing Joint Representations for the task of Product Recom- [46] Lu Zheng, Zhao Tan, Kun Han, and Ren Mao. 2018. Collabo- mendation. CoRR abs/1706.07625 (2017). arXiv:1706.07625 rative Multi-modal deep learning for the personalized product http://arxiv.org/abs/1706.07625 retrieval in Facebook Marketplace. CoRR abs/1805.12312 (2018). [36] Nicholas Negroponte. 1996. Being Digital. Random House Inc., arXiv:1805.12312 http://arxiv.org/abs/1805.12312 New York, NY, USA.