Deploying a Cost-Effective and Production-Ready Deep News Recommender System in the Media Crisis Context Jean-Philippe Corbeil Florent Daudens Le Devoir Le Devoir Canada Canada jpcorbeil@ledevoir.com fdaudens@ledevoir.com ABSTRACT aggregator platforms and social networks that massively feed news In the actual context of the media crisis, online media compa- content to readers in a personalized fashion. nies need cost-effective technological solutions to stay competitive In this context, the online media companies need digital tools to against huge monopolistic software companies massively feeding retain their readers on their platform and augment their conversion content to users. News recommender systems are well-suited so- goals. We can address all these issues by feeding a personalized lutions, even if current commercial solutions are well above most list of contents to the readers, making news recommender systems online media’s budget. In this paper, we present a case study of the perfect solution. Nevertheless, most of the commercial data our deployed deep news recommender system at Le Devoir, an solutions like Google Analytics 360 and Google Recommendations independent french Canadian newspaper in the province of Que- AI are far beyond what most newspapers can afford. Can we build bec. We expose the software architecture and the issues we have a production-ready and cost-effective deep news recommender met with their solutions. Furthermore, we present four qualita- system for news articles that leverage the cloud and open-source tive and quantitative analyses done with our custom monitoring technologies? At Le Devoir — an independent french Canadian dashboard: offline performances of our models, embedding space journal in the province of Quebec —, the conception and deployment analysis, fake-user testing and high-traffic simulations. For a tiny of this recommender system is a part of our digital shift plan. It is fraction of the available commercial solutions’ prices, our current also a part of our solution to reach our marketing goals by offering simple software architecture based on the Docker, the Kubernetes tailored redactional content to our readers. and open-source technologies in the cloud has demonstrated to be In this paper, our contributions are: easily maintainable, scalable, and cost-effective. It also shows ex- cellent offline performance and generates high-quality embeddings • The first case study on the deployment of a production-ready as well as relevant recommendations. cloud architecture of a recommender system with the docker technology and a continuous integration and continuous CCS CONCEPTS deployment (CI/CD) production cycle. • Information systems → Recommender systems. • The design of a cost-effective and scalable deep news recom- mender system oriented on short-term recommendations. KEYWORDS • The demonstration of a considerable offline performance Recommender System, News Recommendations, Sequential Rec- while meeting our constraints: cost, scalability, training time ommendation, Media Crisis, Dashboard, Cloud Technology and serving time. • The design of two qualitative experiments to assess the qual- Reference Format: ity of a news recommender system before going online: the Jean-Philippe Corbeil and Florent Daudens. 2020. Deploying a Cost-Effective embedding quality test and fake-user recommendation test. and Production-Ready Deep News Recommender System in the Media Crisis • The design of a monitoring dashboard for our recommender Context. In 3rd Workshop on Online Recommender Systems and User Modeling system. (ORSUM 2020), in conjunction with the 14th ACM Conference on Recommender Systems, September 25th, 2020, Virtual Event, Brazil. In the next section, we discuss the previous works related to our 1 INTRODUCTION current news recommender system. Then, we elaborate on our sys- In the last couple of decades, the newspapers have seen their world tem architecture by explaining all the aspects of our methodology: changed by the digital shift in the news market [4, 21, 23]. From data processing, model training, recommendation delivering and the newspapers to the online articles, the readers’ needs have also monitoring. Then, we discuss our model’s offline results, two quali- shifted from the static paper format to the fast, dynamic and well- tative validation methods and the traffic benchmark of our system. synthesized display of the online news on mobiles [1, 13]. Moreover, Afterwards, we talk about our system limitations and future works. we have observed the colossal impact — financially and on the We end with a conclusion summarizing our whole approach. behaviours of readers — of intermediary platforms such as news ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil 2 BACKGROUND Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Garcin et al. (2013) [5] presented the Pen recsys, a framework for a news recommender system, made with Java EE. They presented ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil Jean-Philippe Corbeil and Florent Daudens five models in an A/B testing setup: context-tree recommenda- happen when we start to provide recommendations online, we have tions, most-popular recommendations, content-based recommen- our custom dashboard containing widgets to modify the models, dations, collaborative recommendations and random recommenda- its hyperparameters and the recommendation strategies in an A/B tions. They had a web-based control panel to monitor the perfor- testing setup. Furthermore, we design our system with a CI/CD mances and to modify some parameters. They demonstrated the pipeline enabling us to make more profound modifications ready ability to deliver recommendations within 30 ms even at visit peaks. to be released within the next ten minutes without any downtime. We followed many aspects of their framework: an A/B testing setup Mohallick and Özgöbek (2017) [16] analyzed privacy in news with many models, a monitoring dashboard and a traffic bench- recommender systems. Despite many reported news recommender mark. Many aspects of their approach are yet dubious, given the systems relying on personal data, our system is in line with privacy recent state-of-the-art in software engineering. First, the current principles. We do not use more data than the required information practice in software maintainability is far beyond their Java EE for a collaborative algorithm. We only need to activate the opt-out setup. We used cloud-based and open-source technologies with option of Matomo on our website to be fully GDPR compliant. a CI/CD pipeline actionable by our GitHub repositories. Second, their web-based control panel was very minimalist and did not 3 METHODOLOGY follow any design principles. We built our user interface based on 3.1 Architecture the dashboard principles of Sarikaya et al. (2018) [20], dividing it into six tabs targeting specific monitoring and decision-making We divided the dynamic of our system into two segments the ana- goals. We included features like complete monitoring visualiza- lytics part and the recommender system part. We illustrated these tions, actionable widgets and live tests of the models. Third, the two parts in Figure 1. sequential nature of the news recommendation is known to require The analytics side contains two technical components: the ana- deep sequential architecture [24]. Our collaborative deep learning lytics platform Matomo [15] (previously Piwik) and the web site, in models are based on Pytorch [19] and take as input sequences. our case ledevoir.com. As a first step, the readers come to our web- We also prepare an A/B testing setup to test our models against site, and we track their page views anonymously. We respect our the most-popular recommendations and random ones. As future readers’ privacy by assigning a random visitor id at the first visit works, we plan the conception of a content-based approach and and by having no specific personal information1 . Only a truncated its hybridization with our collaborative approach. Finally, we dis- IP at 2 bytes — which is very general information — is recorded, cussed the financial impact of our system in terms of costs for our and no more than three months of detailed data is available locally. newspaper company. At Le Devoir, the marketing department manages any other infor- Karimi et al. (2018) [10] reviewed state of the art concerning mation linked to the user accounts on a different CRM system. As news recommender systems and made several suggestions. They a second step, we added the recommender system part in the loop noticed that the scalability of systems is often an issue despite the to propose a personalized list of five articles. maturity of storing systems. An issue reported with deep learning Overall, we designed our recommender system with five essential architecture by Zhang et al. (2019) [24] as well. Instead of following components: a serving virtual machine, a training virtual machine, their recommendation of using continuous learning, we suppressed a GCP bucket, a MongoDB database and a monitoring dashboard. this issue by designing our system on a few-days basis. With this We explained the interactions between all the components in the peculiar design choice, we were able to remove the model depen- following sections. We also address specific design considerations. dency on continual indexations of both items and users into a stable model. Second, a major issue concerns the dynamic addition of pub- 3.2 Design considerations lications throughout the day and the need to consider these articles Our recommender system’s design must respect four significant quickly. Karimi et al. mentioned the need to incorporate new arti- constraints: the cost, the scalability, the serving time and the train- cles within minutes to benefit from its momentum resulting in a ing time. Thus, we made two major design choices: split our archi- high click-rate-through. We followed their recommendation with tecture into two virtual machines and limit the number of days. many quick training sessions within one hour. They also proposed We chose to split our architecture into two virtual machines a hybrid solution mixing both efficient short-term predictions and (VM): one Serving VM and one Training VM. The first VM is the sophisticated long-term predictions — in line with the previous master VM. It is always online, and it serves recommendations works of Liu et al. [14]. We adopted this strategy in our system. In to our readers. The second VM has the only job of training the the current paper, we are focusing on short-time predictions with model. We made this peculiar design choice to reduce the cost since our news recommender system. The authors added critics about the the Graphics Processing Unit (GPU), needed to train the model in a reproducibility in the news recommendation domain with many reasonable amount of time, is the most expensive part of the system. proprietary datasets. In the context of our paper, we addressed With this peculiar design choice, we estimated based on the GCP this issue by releasing the anonymized collaborative dataset we Pricing Calculator to save near 80% of our total cost considering used in our experiment. Finally, they mentioned the lack of cor- 5.5 hours of training per day. This training time is wisely split into respondence between offline and online results demonstrated by training sessions of about 7 to 10 minutes, with, on average, 4.8 Garcin et al. (2014) [6]. In our case, we reported the results on our minutes only to train the model (taken on our dashboard’s current released dataset with further results from training the model many state and model tabs between May 29th, 2020 and May 31th 2020). times on evolving data. Despite relying only on offline results, our architecture is very flexible. Even if such correspondence does not 1 We followed Matomo’s guideline on privacy: https://matomo.org/docs/privacy/. Deploying a Cost-Effective and Production-Ready Recommender System ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil Figure 1: Diagram of our recommender system architecture. Thus, we can have a maximum of 6 training sessions by hours of indexations leading us to a scalable architecture with a simpler the day, leading to model updates of the same frequency on the sequential model that can be trained every time from scratch. We serving VM. We designed the number of training sessions to follow fixed this number at four days, knowing that our articles’ active the averaged daily traffic on our website in Figure 2 with some lifespan is at most two days following the Nyquist rate [2, 17]. On delays. We apply this strategy to follow the principle of training average, we then have around 1.9 million data points for each train- more often when more new data are available. ing session (taken on our dashboard’s model tab from May 29th, 2020 to May 31st, 2020). 3.3 Data analytics To develop a news recommender system, we needed the right an- alytics platform to have the necessary insight into our data. Two major drawbacks from the widely used Google Analytics 360 is its expensiveness and its data sampling. Thus, we solved both issues by implementing the open-source Matomo Analytics Platform [15] with a MySQL database. It is freely available, has a great community and is easily deployable on any cloud computing virtual machine. Thus, we implemented Matomo Analytics Platform on Google Cloud Figure 2: Interactive plot taken from our dashboard with Platform (GCP) for our high-traffic website. daily number of training sessions by hour of the day. 3.4 Data pipeline Second, we selected a maximum number of days to train and From Matomo’s database, we designed an intermediary MongoDB from which we recommend articles. This design choice would make database on MongoDB Atlas. This database is a crucial piece of the amount of input data steady and stabilize the system training our design because our website’s traffic heavily solicits our central time. Furthermore, we reduced the overall dependency of the sys- database. This secondary database holds around a week of data tem on keeping the previously trained model and maintaining the already pre-processed, and it is fed at every minute by cronjobs ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil Jean-Philippe Corbeil and Florent Daudens on the serving VM. Because Matomo records noisy data such as set are mutually exclusive. We optimized all four models with the 404 URLs and pages that are not articles on our website, we had to following set of hyperparameters applying a grid search approach: filter our data. We conceive it to consider the data with the right • Batch size = { 512, 1024, 2048, 4096, 8192 } URL format for articles and to validate its title before dumping the • Embedding size = { 32, 64 } results into our MongoDB. We also separated the article records • Number of negative samples = { 100, 200, 300 } (URL, title, and ID) from the visit records (ID, article ID, reader ID, and timestamp) to maintain a lower memory usage. Even with 3.6 Model dumping this architecture, the download speed of the whole data at training Once the model trained on the training virtual machine, we dump its time was slow on the training machine — around 10 minutes. This weights and configuration into our Google Cloud Bucket. Then, the downtime can be costly on a virtual machine with a GPU, such as training machine notifies by HTTPS protocol the serving machine our training virtual machine. Our solution was to pre-dump the to fetch the new model and to make it ready to serve. Finally, the data progressively into a Google Cloud Bucket and download this training machine can turn off until the serving machine calls the file instead, which resulted in a download time of less than a minute. next training session. We kept the MongoDB in this part of our design for real-time access to accelerate our cronjobs pre-processing and to maintain our data 3.7 Continuous Integration and Continuous integrity. Deployment Pipeline 3.5 Models We followed the Continuous Integration and Continuous Deployment cycle principle from the software engineering field to ease the At the core of our system, we have implemented the Spotlight maintainability of our system and to ensure zero downtime. Our python library by Kula et al. [12]. It contains many state-of-the- pipeline is illustrated in Figure 3. We hosted our recommender art deep sequential recommender system architectures in PyTorch system in a private GitHub repository. We activated a trigger to [19]. We used their sequential models, which includes four neural watch for updates on the master branch and to call our continuous models: 1D CNN [9, 18], LSTM [7], MixtureLSTM [11] and Pool integration (CI) platform CircleCI automatically on this event. The model [3]. CI runs the recommender system’s unit tests, which cover our We optimized our models with actual offline data from May code at 100% to ensure strict control of our builds. If the tests are 7th, 2020, to May 11th, 2020, containing 1,944,719 data points. We all passed, the CI build from our repository a Docker image and made an anonymous version of this dataset on our GitHub2 to registers it on Google Container Registery (GCR) afterwards. Finally, promote our results. We encourage the community to improve our the CI deploys the new image into our Kubernetes cluster3 with results with better collaborative models. The data contains records zero downtime by applying a rolling update. The serving VM and with anonymized reader identification number, anonymized article the training VM are processed similarly by two different CI/CD identification number, and timestamp. pipelines linked to their respective GitHub’s master branch. The In our experimentation, we fixed some parameter values accord- only difference is that the training image runs as a Kubernetes Job ing to both our pre-experiment and Spotlight’s documentation: on its cluster since we run it on demand. the number of epochs to 10, the learning rate to 1e-2, the random state to 42, and we used no regularization. We took the adaptive hinge loss function [22]. We did our experiments with an NVIDIA RTX2070 GPU. With Spotlight’s sequence parser, we parse all sequences of ar- ticles with a minimum of 3 articles and a maximum of 7 articles Figure 3: Diagram of the CI/CD pipeline of the recommender for each reader. Every sequence is also padded up to 7 articles. We system. chose these bounds first to ensure that the sequences contain a minimum of relevant articles and second to limit the length of the model’s input. For the lower bound of three articles, we have a significant part of our traffic that only consults one or two article(s) 3.8 Online validation strategy and does not come back. We do not aim at recommending articles to To validate the model effectiveness when we launch it online, we this type of reader. We prefer to recommend to our core readers at prepared an A/B testing setup inside the recommender system by first. For the upper bound, we did pre-experimentation, and seven monitoring the Click-Through-Rate (CTR) distributions. In this articles seemed a reasonable length. The Spotlight’s documentation A/B testing, we compare our model’s recommendations to the ac- suggested five items as an upper bound, but it is short given our tual top-5 article suggestions given to the readers in a box on the lower bound. website. Our model’s recommendations are given in a similar box Since we have a large number of sequences, we did our validation with the same disposition — see Figure 4. We decided that half the with a training/testing split of 90%/10%, which resulted in a train readers get the model’s recommendations, and the other half get set of 215,200 samples and a test set of 22,625 samples. We separated recommendations using the best articles in the last 30 minutes. By these sets according to users. Thus, readers in the train set and test applying the Student’s T-test, we can measure the relevance of one 3 Kubernetes is a system for automating deployment, scaling, and managing Docker 2 https://github.com/LeDevoir/orsum2020_collaborative_datasets containers. Deploying a Cost-Effective and Production-Ready Recommender System ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil recommendation method over the other. Moreover, we can change from our live dashboard’s performances tab. The input data given the A/B testing settings with our custom dashboard. to our 1D CNN model is live evolving data from a window of the previous four days. We see a steady MRR performance across the 3.9 System monitoring time of 0.69 ± 0.03 in line with current results when considering Many metrics and variables are recorded on the MongoDB to mon- an error based on three standard deviations. itor and control the system once online. This monitoring is done with a custom dashboard made in Dash by Plotly [8]. Our pro- 4.2 Analysis of article embedding’s quality totype has six different tabs: current state, performances, model, In Figure 7, we analyzed the embedding space of our 1D CNN with execution settings, embedding space viewer and model testing in- an embedding size of 32 dimensions using the TSNE algorithm to terface. In Figure 5, we have included an illustration of the current project them into a 3D space. In Figure 7a, we took the data of May state tab. We use it to monitor the training VM status, which is 20th, 2020, for five subjects: world, politics, culture, opinion and useful to detect errors during training. We also use it to see the economy. We did not consider the society subject, for which the current A/B testing results with two histograms with its T-test scope is vast and too similar to many other subjects. It would have values. The performances tab is designed to monitor the current made the Figure harder to visualize. We omitted the lecture and model offline performances (MRR and P@5, see section 4.1). The lifestyle subjects too because they usually contain only a couple model tab displays the training time and the number of input data of articles each. By illustrating the subjects with colours for each across time for monitoring purposes. The execution tab has many article, we notice clusters linked to the article’s subject. The exis- variables to manipulate and modify the system. We can shut down tence of these clusters indicates that the model has learned relevant the recommendations, change the A/B testing settings, change the representations for the articles. For instance, we distinguish the number of recommendations and interact with a plot to set the opinion cluster (blue) and the culture cluster (green) on the left number of training sessions per hour of the day. The embeddings and right of the Figure, respectively. We argue that this is due to tab (see Figure 7) is a live 3D plot of the embeddings space gener- their different nature in the writing style and in the subject, which ated by the trained model compressed in 3 dimensions with the attracts different readers. We also see the politic cluster (red) in TSNE algorithm. The test tab (see Figure 8) contains an interface the center near the world cluster (gray) and the economy cluster to test the recommendations of the current model. It has the list (orange). We argue that these three subjects are mostly related and of all current articles from which to make a selection. Then, this have a similar writing style, which attracts similar readers. Since selected list is sent to the server as a list of previously consulted we integrated this view on our dashboard, we can further confirm articles by a fake reader. We send back the recommendations of the the appearance of similar cluster patterns emerge mostly every day model. The last two tabs are interesting tools to evaluate the model — see Figure 7b. recommendations (see Sections 4.2 and 4.3). We know that the model learns the embeddings from our collabo- rative data. Thus, they also integrate the influence of their locations 4 RESULTS on our website partly. Thus, we plan to use this embedding viewer 4.1 Offline Model validation and its dynamics as a management tool for our website display as future work. Out of 120 possible experiments in our grid search, 12 finished with an "out of GPU memory" error leaving us with 108 results. We kept the top-5 for each model in Table 1. By fine-tuning the models’ 4.3 Analysis of fake-reader recommendations hyperparameters, we found that the 1D CNN architecture is the We further tested and analyzed our recommender system’s recom- most optimal one for our task, followed by the LSTM. Since it is mendations with the user interface in Figure 8 from our custom faster to train than the LSTM by 38 seconds and saves memory dashboard. On this interface, we have a tool to send our recom- with smaller embeddings, we selected the CNN model as our first mender system model a list of selected articles as an input and to state-of-the-art configuration. Overall, the MixtureLSTM model receive its recommendations. We design the experiment to assess takes more time to train for about slightly lower results than the the quality of our recommendations. We selected two articles from LSTM and the CNN. The Pooling model is largely under-performing each five previously selected subjects: world, politics, culture, opin- based on MRR, but competitive with the P@5 metric. We found ion and economy. Then, we submitted two articles to the system that a large negative sampling is improving the results as well as and analyzed the five recommended articles. The results are in Ta- a large embedding size in general. Moreover, smaller batch sizes ble 2. First, our results show that our model tends to recommend tend to get better results. For instance, we reported no batch size articles from the same subject about half the time in our samples of 8192 in these top-5 and only one 4096. Most of the 20 results (13 out of 25 articles). We see strong links between the most recom- presented here obtain similar P@5, which means that in a list of mended articles and submitted articles. For instance, in the world 5 articles, we found most of the time one relevant article for our sample, we see both articles are about the COVID-19 pandemic reader. The MRR score indicates that, except for the Pooling model, and that the last article is related to Trump. In the recommended most models suggest this relevant article as first or as the second articles, we received a complete list of COVID-19 and the fourth article, between 1 and 0.5 respectively. recommendation about Trump as well. Moreover, in the culture We supported the generalization of these results by providing sample, we see that the last submitted article is about a TV show the histogram of MRRs measured from 22 training sessions between named "Occupation Double" and the first recommendation is about May 26th, 2020 and May 28th, 2020, in Figure 6. We took this Figure the same TV show "OD" as well as the fourth recommendation. ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil Jean-Philippe Corbeil and Florent Daudens Figure 4: Implementation on our website of the recommendation box "VOS RECOMMANDATIONS" on the right. Table 1: Top-5 results for each model sorted by MRR. Model Embedding Batch size Negative MRR P@5 Training size samples time (s) CNN 32 2048 300 0.693 0.133 134 LSTM 64 1024 300 0.693 0.133 172 LSTM 64 2048 300 0.693 0.133 157 LSTM 64 512 300 0.691 0.133 219 CNN 64 2048 300 0.690 0.132 156 CNN 32 1024 300 0.690 0.132 154 MixtureLSTM 32 512 300 0.689 0.131 549 LSTM 64 512 200 0.687 0.131 159 LSTM 64 2048 200 0.687 0.123 108 CNN 64 2048 200 0.687 0.131 107 MixtureLSTM 64 512 200 0.687 0.122 614 CNN 64 4096 100 0.684 0.132 54 MixtureLSTM 32 1024 200 0.682 0.130 350 MixtureLSTM 32 512 200 0.680 0.128 381 MixtureLSTM 64 512 100 0.677 0.125 334 Pooling 64 512 300 0.478 0.131 211 Pooling 64 1024 300 0.478 0.133 167 Pooling 64 2048 300 0.477 0.133 155 Pooling 64 2048 200 0.476 0.132 106 Pooling 64 1024 100 0.476 0.132 65 Second, it is also interesting that not all articles are from the same 4.4 Traffic benchmark subject, which gives some serendipity to our recommendations. We developed a script that repeatedly sends to the serving VM From our observations with our website, we noted that the model real user identification numbers to benchmark the maximum traffic tends to use a good recommendation strategy by recommending a list of articles mixing top-viewed ones and subject-related ones. Deploying a Cost-Effective and Production-Ready Recommender System ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil Table 2: Five fake user tests done on May 20th, 2020. We send two submitted articles as a fake list of previously read articles to our recommender system. Then, we receive five recommended articles. The names of the articles are in french. Subject Submitted articles Recommended articles Section Result Economy 1. L’intelligence artificielle au 1. La COVID-19 fera fléchir les prix im- Economy service du transport de con- mobiliers teneurs 2. Forte baisse des ventes de 2. Situation préoccupante pour les mé- Economy maisons au Québec nages très endettés 3. Une deuxième vague «inévitable» au Society 3/5 Canada, disent les experts 4. Masque sanitaire et burqa: une insulte Opinion à l’intelligence 5. Refroidissement immobilier Economy Culture 1. Décès de la comédienne 1. Après OD, la vie Culture Michelle Rosignol 2. Occupation double: rattrapé 2. Emmener Google au théâtre, et vice Culture par la réalité des beaux senti- versa ments 3. F1: Lawrence Stroll met la barre haute World 3/5 pour Aston Martin dès 2021 4. Aimer résister à Occupation double Culture 5. Les mots de l’année (6/6): «Fake Society news», les vraies fausses nouvelles World 1. Quel est le bilan véritable de 1. Coronavirus: un «mini-Schengen» se World la pandémie de COVID-19? prépare en Europe pendant que d’autres pays se referment 2. Un président qui défie la sci- 2. Le masque non médical protège-t-il Society ence celui qui le porte? 3. Interdit ou pas avec le déconfine- Society 2/5 ment? 4. L’«incompétence» de Pékin a provo- World qué une «tuerie de masse mondiale», selon Trump 5. Les libéraux n’ont pas respecté leurs Politics promesses, accuse Blanchet Opinion 1. L’éclatant succès de Taïwan 1. Quel est le bilan véritable de la World pandémie de COVID-19? 2. Élèves abandonnés, parents 2. Le masque non médical protège-t-il Society épuisés celui qui le porte? 3. D’égal à égal, le Québec, 40 ans plus Opinion 2/5 tard? 4. Tout est affaire de décor pendant le Society confinement 5. Référendum 1980 – l’étrange cam- Opinion pagne de sécurisation Politics 1. Une nouvelle aide fédéral 1. Interdit ou pas avec le déconfine- Society pour les PME ment? 2. Le Québec déplore 51 nou- 2. Pincez-moi, Docteur Horacio, je Opinion veaux décès dus à la COVID-19 rêve. . . 3. Feu vert pour la réouverture des com- Politics 3/5 merces à Montréal 4. Trois artères de Rosemont-La Petite Politics Patrie fermées aux voitures 5. La frontière entre le Canada et les Politics États-Unis reste fermée jusqu’au 21 juin ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil Jean-Philippe Corbeil and Florent Daudens supported by our current configuration. With each received recom- mendation list, we made a rule-based decision process to simulate the click rate. If the recommendations come from the model (model), we click on any recommendation with a probability of 1/2. If we recommended using the top-viewed articles in the last 30 minutes (top), we would click on any recommendation with a probability of 1/4. We fixed the number of readers per second at 3 with a pool of 2000 readers. The fake A/B testing results are on our dashboard’s current state tab in Figure 5 — with very high significative results. The CTR distributions are very close to our rule-based decision pro- cess. We also measure the time elapsed between sending the request and receiving the response, which we displayed in the histogram of Figure 9. We see that our recommendations are made in three seconds on average, which is okay since we feed our recommen- dations box asynchronously. Knowing that our articles’ average reading time is close to one minute, we have the time to fill the box with recommendations since it appears half-way. We also looked at the correlation between the order in which we sent the requests and the time lapses, and we can report correlation of less than 0.05. Since our morning peak hour has about 2.5 readers per second, our current configuration is ready to feed our website in real-time. Compared to Garcin et al. [5], which had a response of 30 ms with Figure 5: Example of our custom dashboard’s user interface a Java EE architecture, we note that our system is slow. We argue in french. It is the current state tab ("état actuel") on our rec- that this is due to our serving VM in Python, serving with Flask ommender system. We see our six navigation tabs: current through gunicorn — known to be slower than Java. We also have state, performances, model, execution settings, embedding deep learning models served on CPU — which is a specific design space viewer and model testing interface. In the first box be- choice. Since we are meeting all constraints, we leave the service low our navigation, we have two indicators about the train- response time optimizations to future works. ing status (last training status and model reloading status). In the bottom, we have two CTR histograms that help us vi- sualize the current A/B testing result with their T-test values 5 LIMITATIONS AND FUTURE WORKS below — these are the results of our traffic benchmark (see In our case study, many aspects have limitations and need further Section 4.4). improvements. We have short-term recommendations for the limi- tations of the recommendations, small grid-search optimization, the offline performance metrics (MRR and P@5), and the cold-start is- sue. We chose to work on short-term recommendations to improve the scalability of our system. We will make further developments in the future to include long-term recommendations by adding a content-based approach. We limited our approach to a small grid search to optimize our models’ hyperparameters. We chose a small set of values for each hyperparameter to train the models in a rea- sonable amount of time based on insights from our pre-experiments. We obtained good results, and we hope that other researchers will try their approaches on our released dataset to push our state-of- the-art. While the combination of both MRR and P@5 is relevant, the first is limited to measure the appearance of the first relevant item in the list, while the second is the proportion of relevant items in the whole list. Since we chose these metrics because of our cur- rent lack of specific relevance measurements, we plan on extracting the reading time of articles from Matomo to compute the relevance of our article for a given reader as recommendations. Then, we will Figure 6: Distribution of MRR for our 1D CNN model, mea- compute the NDCG@5, which is a better metric to evaluate our sured from May 26th, 2020, to May 28th, 2020, across 22 train- offline performance. Finally, we also face the cold-start issue that ing sessions. We took this Figure from our dashboard (per- we did not address directly. Nevertheless, we design the recommen- formances tab) on May 28th, 2020. dation box to appear only in the article because of our short-term recommendations. Therefore, readers coming for the first time on our website will still get recommendations when they visit articles. Deploying a Cost-Effective and Production-Ready Recommender System ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil (a) May 20th 2020. (b) May 26th 2020. Figure 7: Embedding projection of our article embeddings taken from our custom dashboard (embeddings tab) projected in 3 dimensions with TSNE and colored for 5 main sections of our website: world, politics, culture, opinion and economy. Figure 8: Example of our fake-user test on our custom dash- Figure 9: Histogram of time lapses before receiving recom- board — done on May 20th, 2020. We use the first dropdown mendations. This traffic simulation is done with 2000 read- list to filter articles by subject, which has a pre-assigned ers and a rate of 3 readers per second. colour (e.g. red for politics). With the second dropdown, we can select the articles consulted by the fake user. The result is a list of five recommended articles displayed with subject two virtual machines (serving VM and training VM) and with a limi- colours at the bottom. tation on the number of days — to meet our cost, scalability, training time and serving time constraints. With a grid-search approach, we found that the optimal model was the 1D CNN performing with an We also have limitations with our embedding space study and MRR of 0.693 and a P@5 of 0.133 in only 134 seconds. We release an fake-user test study. Their main limitation is their generalization. anonymized version of our dataset to promote the reproducibility of However, we argue that both studies are complementary and in- our results. In our architecture, the model is trained from scratch in sightful in their results, indicating the learning of both relevant many training sessions wisely distributed according to our website embeddings and relevant recommendations. We also demonstrated traffic. We estimated this strategy to save around 80% of our total the same observations in the embedding space for both May 20th, cost for the recommender system, which is less than 4$US per day. 2020, and May 26th, 2020. Compared to commercial solutions costing thousands of dollars per month, this percentage rises close to 98 %. We also evaluated 6 CONCLUSION our systems with two more studies. Using our custom monitoring To conclude, we presented a case study about our cost-effective dashboard, we observed a high relevance of our embeddings and and production-ready deep news recommender system architecture recommendations based on two complementary qualitative studies: with open-source and cloud technologies. We designed it — with the embedding space study and the fake-user test study. Finally, we ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil Jean-Philippe Corbeil and Florent Daudens demonstrated the readiness of our system with a traffic simulation. [11] Maciej Kula. 2017. Mixture-of-tastes models for representing users with diverse We hope our affordable and robust design inspires other online me- interests. arXiv preprint arXiv:1711.08379 (2017). [12] Maciej Kula. 2017. Spotlight. https://github.com/maciejkula/spotlight. dia companies to consider developing their recommender systems [13] Azi Lev-On. 2012. Communication, community, crisis: Mapping uses and grati- to be competitive in the digital news market. fications in the contemporary media environment. New Media & Society 14, 1 (2012), 98–116. [14] Jiahui Liu, Peter Dolan, and Elin Rønby Pedersen. 2010. Personalized news REFERENCES recommendation based on click behavior. In Proceedings of the 15th international [1] Kevin G Barnhurst. 2011. The new “media affect” and the crisis of representation conference on Intelligent user interfaces. 31–40. for political communication. The International Journal of Press/Politics 16, 4 (2011), [15] Stephan A Miller. 2012. Piwik web analytics essentials. Packt Publishing Ltd. 573–593. [16] Itishree Mohallick and Özlem Özgöbek. 2017. Exploring privacy concerns in [2] Harold S Black. 1953. Modulation theory. van Nostrand. news recommender systems. In Proceedings of the International Conference on [3] Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks Web Intelligence. 1054–1061. for youtube recommendations. In Proceedings of the 10th ACM conference on [17] Harry Nyquist. 1928. Certain topics in telegraph transmission theory. Transactions recommender systems. 191–198. of the American Institute of Electrical Engineers 47, 2 (1928), 617–644. [4] Marc Edge. 2014. Newspapers’ Annual Reports Show Chains Profitable. Newspa- [18] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol per Research Journal 35, 4 (2014). Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. [5] Florent Garcin and Boi Faltings. 2013. Pen recsys: A personalized news rec- 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 ommender systems framework. In Proceedings of the 2013 International News (2016). Recommender Systems Workshop and Challenge. 3–9. [19] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, [6] Florent Garcin, Boi Faltings, Olivier Donatsch, Ayar Alazzawi, Christophe Bruttin, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. and Amr Huber. 2014. Offline and online evaluation of news recommender 2017. Automatic differentiation in pytorch. NIPS 2017 Workshop Autodiff (2017). systems at swissinfo.ch. In Proceedings of the 8th ACM Conference on Recommender [20] Alper Sarikaya, Michael Correll, Lyn Bartram, Melanie Tory, and Danyel Fisher. systems. 169–176. 2018. What do we talk about when we talk about dashboards? IEEE transactions [7] Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. on visualization and computer graphics 25, 1 (2018), 682–692. 2015. Session-based recommendations with recurrent neural networks. arXiv [21] Paul Starr. 2012. An unexpected crisis: The news media in postindustrial democ- preprint arXiv:1511.06939 (2015). racies. The International Journal of Press/Politics 17, 2 (2012), 234–242. [8] Plotly Technologies Inc. 2015. Collaborative data science. Montreal, QC. https: [22] Jason Weston, Samy Bengio, and Nicolas Usunier. 2011. Wsabie: Scaling up to //plot.ly large vocabulary image annotation. In Twenty-Second International Joint Confer- [9] Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex ence on Artificial Intelligence. Graves, and Koray Kavukcuoglu. 2016. Neural machine translation in linear time. [23] Dwayne Winseck. 2010. Financialization and the “crisis of the media”: The arXiv preprint arXiv:1610.10099 (2016). rise and fall of (some) media conglomerates in Canada. Canadian Journal of [10] Mozhgan Karimi, Dietmar Jannach, and Michael Jugovac. 2018. News recom- Communication 35, 3 (2010). mender systems–Survey and roads ahead. Information Processing & Management [24] Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay. 2019. Deep learning based rec- 54, 6 (2018), 1203–1227. ommender system: A survey and new perspectives. ACM Computing Surveys (CSUR) 52, 1 (2019), 1–38.