Deploying a Cost-Effective and Production-Ready Deep News
         Recommender System in the Media Crisis Context
                          Jean-Philippe Corbeil                                                            Florent Daudens
                                   Le Devoir                                                                   Le Devoir
                                    Canada                                                                      Canada
                            jpcorbeil@ledevoir.com                                                       fdaudens@ledevoir.com

ABSTRACT                                                                               aggregator platforms and social networks that massively feed news
In the actual context of the media crisis, online media compa-                         content to readers in a personalized fashion.
nies need cost-effective technological solutions to stay competitive                       In this context, the online media companies need digital tools to
against huge monopolistic software companies massively feeding                         retain their readers on their platform and augment their conversion
content to users. News recommender systems are well-suited so-                         goals. We can address all these issues by feeding a personalized
lutions, even if current commercial solutions are well above most                      list of contents to the readers, making news recommender systems
online media’s budget. In this paper, we present a case study of                       the perfect solution. Nevertheless, most of the commercial data
our deployed deep news recommender system at Le Devoir, an                             solutions like Google Analytics 360 and Google Recommendations
independent french Canadian newspaper in the province of Que-                          AI are far beyond what most newspapers can afford. Can we build
bec. We expose the software architecture and the issues we have                        a production-ready and cost-effective deep news recommender
met with their solutions. Furthermore, we present four qualita-                        system for news articles that leverage the cloud and open-source
tive and quantitative analyses done with our custom monitoring                         technologies? At Le Devoir — an independent french Canadian
dashboard: offline performances of our models, embedding space                         journal in the province of Quebec —, the conception and deployment
analysis, fake-user testing and high-traffic simulations. For a tiny                   of this recommender system is a part of our digital shift plan. It is
fraction of the available commercial solutions’ prices, our current                    also a part of our solution to reach our marketing goals by offering
simple software architecture based on the Docker, the Kubernetes                       tailored redactional content to our readers.
and open-source technologies in the cloud has demonstrated to be                           In this paper, our contributions are:
easily maintainable, scalable, and cost-effective. It also shows ex-
cellent offline performance and generates high-quality embeddings                          • The first case study on the deployment of a production-ready
as well as relevant recommendations.                                                         cloud architecture of a recommender system with the docker
                                                                                             technology and a continuous integration and continuous
CCS CONCEPTS                                                                                 deployment (CI/CD) production cycle.
• Information systems → Recommender systems.                                               • The design of a cost-effective and scalable deep news recom-
                                                                                             mender system oriented on short-term recommendations.
KEYWORDS                                                                                   • The demonstration of a considerable offline performance
Recommender System, News Recommendations, Sequential Rec-                                    while meeting our constraints: cost, scalability, training time
ommendation, Media Crisis, Dashboard, Cloud Technology                                       and serving time.
                                                                                           • The design of two qualitative experiments to assess the qual-
Reference Format:                                                                            ity of a news recommender system before going online: the
Jean-Philippe Corbeil and Florent Daudens. 2020. Deploying a Cost-Effective                  embedding quality test and fake-user recommendation test.
and Production-Ready Deep News Recommender System in the Media Crisis
                                                                                           • The design of a monitoring dashboard for our recommender
Context. In 3rd Workshop on Online Recommender Systems and User Modeling
                                                                                             system.
(ORSUM 2020), in conjunction with the 14th ACM Conference on Recommender
Systems, September 25th, 2020, Virtual Event, Brazil.
                                                                                          In the next section, we discuss the previous works related to our
1    INTRODUCTION                                                                      current news recommender system. Then, we elaborate on our sys-
In the last couple of decades, the newspapers have seen their world                    tem architecture by explaining all the aspects of our methodology:
changed by the digital shift in the news market [4, 21, 23]. From                      data processing, model training, recommendation delivering and
the newspapers to the online articles, the readers’ needs have also                    monitoring. Then, we discuss our model’s offline results, two quali-
shifted from the static paper format to the fast, dynamic and well-                    tative validation methods and the traffic benchmark of our system.
synthesized display of the online news on mobiles [1, 13]. Moreover,                   Afterwards, we talk about our system limitations and future works.
we have observed the colossal impact — financially and on the                          We end with a conclusion summarizing our whole approach.
behaviours of readers — of intermediary platforms such as news

ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil                     2   BACKGROUND
Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).                                     Garcin et al. (2013) [5] presented the Pen recsys, a framework for
                                                                                       a news recommender system, made with Java EE. They presented
ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil                                              Jean-Philippe Corbeil and Florent Daudens


five models in an A/B testing setup: context-tree recommenda-            happen when we start to provide recommendations online, we have
tions, most-popular recommendations, content-based recommen-             our custom dashboard containing widgets to modify the models,
dations, collaborative recommendations and random recommenda-            its hyperparameters and the recommendation strategies in an A/B
tions. They had a web-based control panel to monitor the perfor-         testing setup. Furthermore, we design our system with a CI/CD
mances and to modify some parameters. They demonstrated the              pipeline enabling us to make more profound modifications ready
ability to deliver recommendations within 30 ms even at visit peaks.     to be released within the next ten minutes without any downtime.
We followed many aspects of their framework: an A/B testing setup            Mohallick and Özgöbek (2017) [16] analyzed privacy in news
with many models, a monitoring dashboard and a traffic bench-            recommender systems. Despite many reported news recommender
mark. Many aspects of their approach are yet dubious, given the          systems relying on personal data, our system is in line with privacy
recent state-of-the-art in software engineering. First, the current      principles. We do not use more data than the required information
practice in software maintainability is far beyond their Java EE         for a collaborative algorithm. We only need to activate the opt-out
setup. We used cloud-based and open-source technologies with             option of Matomo on our website to be fully GDPR compliant.
a CI/CD pipeline actionable by our GitHub repositories. Second,
their web-based control panel was very minimalist and did not            3 METHODOLOGY
follow any design principles. We built our user interface based on
                                                                         3.1 Architecture
the dashboard principles of Sarikaya et al. (2018) [20], dividing it
into six tabs targeting specific monitoring and decision-making          We divided the dynamic of our system into two segments the ana-
goals. We included features like complete monitoring visualiza-          lytics part and the recommender system part. We illustrated these
tions, actionable widgets and live tests of the models. Third, the       two parts in Figure 1.
sequential nature of the news recommendation is known to require            The analytics side contains two technical components: the ana-
deep sequential architecture [24]. Our collaborative deep learning       lytics platform Matomo [15] (previously Piwik) and the web site, in
models are based on Pytorch [19] and take as input sequences.            our case ledevoir.com. As a first step, the readers come to our web-
We also prepare an A/B testing setup to test our models against          site, and we track their page views anonymously. We respect our
the most-popular recommendations and random ones. As future              readers’ privacy by assigning a random visitor id at the first visit
works, we plan the conception of a content-based approach and            and by having no specific personal information1 . Only a truncated
its hybridization with our collaborative approach. Finally, we dis-      IP at 2 bytes — which is very general information — is recorded,
cussed the financial impact of our system in terms of costs for our      and no more than three months of detailed data is available locally.
newspaper company.                                                       At Le Devoir, the marketing department manages any other infor-
   Karimi et al. (2018) [10] reviewed state of the art concerning        mation linked to the user accounts on a different CRM system. As
news recommender systems and made several suggestions. They              a second step, we added the recommender system part in the loop
noticed that the scalability of systems is often an issue despite the    to propose a personalized list of five articles.
maturity of storing systems. An issue reported with deep learning           Overall, we designed our recommender system with five essential
architecture by Zhang et al. (2019) [24] as well. Instead of following   components: a serving virtual machine, a training virtual machine,
their recommendation of using continuous learning, we suppressed         a GCP bucket, a MongoDB database and a monitoring dashboard.
this issue by designing our system on a few-days basis. With this        We explained the interactions between all the components in the
peculiar design choice, we were able to remove the model depen-          following sections. We also address specific design considerations.
dency on continual indexations of both items and users into a stable
model. Second, a major issue concerns the dynamic addition of pub-       3.2     Design considerations
lications throughout the day and the need to consider these articles     Our recommender system’s design must respect four significant
quickly. Karimi et al. mentioned the need to incorporate new arti-       constraints: the cost, the scalability, the serving time and the train-
cles within minutes to benefit from its momentum resulting in a          ing time. Thus, we made two major design choices: split our archi-
high click-rate-through. We followed their recommendation with           tecture into two virtual machines and limit the number of days.
many quick training sessions within one hour. They also proposed            We chose to split our architecture into two virtual machines
a hybrid solution mixing both efficient short-term predictions and       (VM): one Serving VM and one Training VM. The first VM is the
sophisticated long-term predictions — in line with the previous          master VM. It is always online, and it serves recommendations
works of Liu et al. [14]. We adopted this strategy in our system. In     to our readers. The second VM has the only job of training the
the current paper, we are focusing on short-time predictions with        model. We made this peculiar design choice to reduce the cost since
our news recommender system. The authors added critics about the         the Graphics Processing Unit (GPU), needed to train the model in a
reproducibility in the news recommendation domain with many              reasonable amount of time, is the most expensive part of the system.
proprietary datasets. In the context of our paper, we addressed          With this peculiar design choice, we estimated based on the GCP
this issue by releasing the anonymized collaborative dataset we          Pricing Calculator to save near 80% of our total cost considering
used in our experiment. Finally, they mentioned the lack of cor-         5.5 hours of training per day. This training time is wisely split into
respondence between offline and online results demonstrated by           training sessions of about 7 to 10 minutes, with, on average, 4.8
Garcin et al. (2014) [6]. In our case, we reported the results on our    minutes only to train the model (taken on our dashboard’s current
released dataset with further results from training the model many       state and model tabs between May 29th, 2020 and May 31th 2020).
times on evolving data. Despite relying only on offline results, our
architecture is very flexible. Even if such correspondence does not      1 We followed Matomo’s guideline on privacy: https://matomo.org/docs/privacy/.
Deploying a Cost-Effective and Production-Ready Recommender System               ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil


                                         Figure 1: Diagram of our recommender system architecture.


Thus, we can have a maximum of 6 training sessions by hours of          indexations leading us to a scalable architecture with a simpler
the day, leading to model updates of the same frequency on the          sequential model that can be trained every time from scratch. We
serving VM. We designed the number of training sessions to follow       fixed this number at four days, knowing that our articles’ active
the averaged daily traffic on our website in Figure 2 with some         lifespan is at most two days following the Nyquist rate [2, 17]. On
delays. We apply this strategy to follow the principle of training      average, we then have around 1.9 million data points for each train-
more often when more new data are available.                            ing session (taken on our dashboard’s model tab from May 29th,
                                                                        2020 to May 31st, 2020).

                                                                        3.3    Data analytics
                                                                        To develop a news recommender system, we needed the right an-
                                                                        alytics platform to have the necessary insight into our data. Two
                                                                        major drawbacks from the widely used Google Analytics 360 is its
                                                                        expensiveness and its data sampling. Thus, we solved both issues
                                                                        by implementing the open-source Matomo Analytics Platform [15]
                                                                        with a MySQL database. It is freely available, has a great community
                                                                        and is easily deployable on any cloud computing virtual machine.
                                                                        Thus, we implemented Matomo Analytics Platform on Google Cloud
Figure 2: Interactive plot taken from our dashboard with                Platform (GCP) for our high-traffic website.
daily number of training sessions by hour of the day.
                                                                        3.4    Data pipeline
   Second, we selected a maximum number of days to train and            From Matomo’s database, we designed an intermediary MongoDB
from which we recommend articles. This design choice would make         database on MongoDB Atlas. This database is a crucial piece of
the amount of input data steady and stabilize the system training       our design because our website’s traffic heavily solicits our central
time. Furthermore, we reduced the overall dependency of the sys-        database. This secondary database holds around a week of data
tem on keeping the previously trained model and maintaining the         already pre-processed, and it is fed at every minute by cronjobs
ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil                                               Jean-Philippe Corbeil and Florent Daudens


on the serving VM. Because Matomo records noisy data such as               set are mutually exclusive. We optimized all four models with the
404 URLs and pages that are not articles on our website, we had to         following set of hyperparameters applying a grid search approach:
filter our data. We conceive it to consider the data with the right              • Batch size = { 512, 1024, 2048, 4096, 8192 }
URL format for articles and to validate its title before dumping the             • Embedding size = { 32, 64 }
results into our MongoDB. We also separated the article records                  • Number of negative samples = { 100, 200, 300 }
(URL, title, and ID) from the visit records (ID, article ID, reader
ID, and timestamp) to maintain a lower memory usage. Even with             3.6     Model dumping
this architecture, the download speed of the whole data at training
                                                                           Once the model trained on the training virtual machine, we dump its
time was slow on the training machine — around 10 minutes. This
                                                                           weights and configuration into our Google Cloud Bucket. Then, the
downtime can be costly on a virtual machine with a GPU, such as
                                                                           training machine notifies by HTTPS protocol the serving machine
our training virtual machine. Our solution was to pre-dump the
                                                                           to fetch the new model and to make it ready to serve. Finally, the
data progressively into a Google Cloud Bucket and download this
                                                                           training machine can turn off until the serving machine calls the
file instead, which resulted in a download time of less than a minute.
                                                                           next training session.
We kept the MongoDB in this part of our design for real-time access
to accelerate our cronjobs pre-processing and to maintain our data
                                                                           3.7     Continuous Integration and Continuous
integrity.
                                                                                   Deployment Pipeline
3.5     Models                                                             We followed the Continuous Integration and Continuous Deployment
                                                                           cycle principle from the software engineering field to ease the
At the core of our system, we have implemented the Spotlight
                                                                           maintainability of our system and to ensure zero downtime. Our
python library by Kula et al. [12]. It contains many state-of-the-
                                                                           pipeline is illustrated in Figure 3. We hosted our recommender
art deep sequential recommender system architectures in PyTorch
                                                                           system in a private GitHub repository. We activated a trigger to
[19]. We used their sequential models, which includes four neural
                                                                           watch for updates on the master branch and to call our continuous
models: 1D CNN [9, 18], LSTM [7], MixtureLSTM [11] and Pool
                                                                           integration (CI) platform CircleCI automatically on this event. The
model [3].
                                                                           CI runs the recommender system’s unit tests, which cover our
   We optimized our models with actual offline data from May
                                                                           code at 100% to ensure strict control of our builds. If the tests are
7th, 2020, to May 11th, 2020, containing 1,944,719 data points. We
                                                                           all passed, the CI build from our repository a Docker image and
made an anonymous version of this dataset on our GitHub2 to
                                                                           registers it on Google Container Registery (GCR) afterwards. Finally,
promote our results. We encourage the community to improve our
                                                                           the CI deploys the new image into our Kubernetes cluster3 with
results with better collaborative models. The data contains records
                                                                           zero downtime by applying a rolling update. The serving VM and
with anonymized reader identification number, anonymized article
                                                                           the training VM are processed similarly by two different CI/CD
identification number, and timestamp.
                                                                           pipelines linked to their respective GitHub’s master branch. The
   In our experimentation, we fixed some parameter values accord-
                                                                           only difference is that the training image runs as a Kubernetes Job
ing to both our pre-experiment and Spotlight’s documentation:
                                                                           on its cluster since we run it on demand.
the number of epochs to 10, the learning rate to 1e-2, the random
state to 42, and we used no regularization. We took the adaptive
hinge loss function [22]. We did our experiments with an NVIDIA
RTX2070 GPU.
   With Spotlight’s sequence parser, we parse all sequences of ar-
ticles with a minimum of 3 articles and a maximum of 7 articles            Figure 3: Diagram of the CI/CD pipeline of the recommender
for each reader. Every sequence is also padded up to 7 articles. We        system.
chose these bounds first to ensure that the sequences contain a
minimum of relevant articles and second to limit the length of the
model’s input. For the lower bound of three articles, we have a
significant part of our traffic that only consults one or two article(s)   3.8     Online validation strategy
and does not come back. We do not aim at recommending articles to          To validate the model effectiveness when we launch it online, we
this type of reader. We prefer to recommend to our core readers at         prepared an A/B testing setup inside the recommender system by
first. For the upper bound, we did pre-experimentation, and seven          monitoring the Click-Through-Rate (CTR) distributions. In this
articles seemed a reasonable length. The Spotlight’s documentation         A/B testing, we compare our model’s recommendations to the ac-
suggested five items as an upper bound, but it is short given our          tual top-5 article suggestions given to the readers in a box on the
lower bound.                                                               website. Our model’s recommendations are given in a similar box
   Since we have a large number of sequences, we did our validation        with the same disposition — see Figure 4. We decided that half the
with a training/testing split of 90%/10%, which resulted in a train        readers get the model’s recommendations, and the other half get
set of 215,200 samples and a test set of 22,625 samples. We separated      recommendations using the best articles in the last 30 minutes. By
these sets according to users. Thus, readers in the train set and test     applying the Student’s T-test, we can measure the relevance of one
                                                                           3 Kubernetes is a system for automating deployment, scaling, and managing Docker
2 https://github.com/LeDevoir/orsum2020_collaborative_datasets             containers.
Deploying a Cost-Effective and Production-Ready Recommender System               ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil


recommendation method over the other. Moreover, we can change           from our live dashboard’s performances tab. The input data given
the A/B testing settings with our custom dashboard.                     to our 1D CNN model is live evolving data from a window of the
                                                                        previous four days. We see a steady MRR performance across the
3.9    System monitoring                                                time of 0.69 ± 0.03 in line with current results when considering
Many metrics and variables are recorded on the MongoDB to mon-          an error based on three standard deviations.
itor and control the system once online. This monitoring is done
with a custom dashboard made in Dash by Plotly [8]. Our pro-            4.2    Analysis of article embedding’s quality
totype has six different tabs: current state, performances, model,      In Figure 7, we analyzed the embedding space of our 1D CNN with
execution settings, embedding space viewer and model testing in-        an embedding size of 32 dimensions using the TSNE algorithm to
terface. In Figure 5, we have included an illustration of the current   project them into a 3D space. In Figure 7a, we took the data of May
state tab. We use it to monitor the training VM status, which is        20th, 2020, for five subjects: world, politics, culture, opinion and
useful to detect errors during training. We also use it to see the      economy. We did not consider the society subject, for which the
current A/B testing results with two histograms with its T-test         scope is vast and too similar to many other subjects. It would have
values. The performances tab is designed to monitor the current         made the Figure harder to visualize. We omitted the lecture and
model offline performances (MRR and P@5, see section 4.1). The          lifestyle subjects too because they usually contain only a couple
model tab displays the training time and the number of input data       of articles each. By illustrating the subjects with colours for each
across time for monitoring purposes. The execution tab has many         article, we notice clusters linked to the article’s subject. The exis-
variables to manipulate and modify the system. We can shut down         tence of these clusters indicates that the model has learned relevant
the recommendations, change the A/B testing settings, change the        representations for the articles. For instance, we distinguish the
number of recommendations and interact with a plot to set the           opinion cluster (blue) and the culture cluster (green) on the left
number of training sessions per hour of the day. The embeddings         and right of the Figure, respectively. We argue that this is due to
tab (see Figure 7) is a live 3D plot of the embeddings space gener-     their different nature in the writing style and in the subject, which
ated by the trained model compressed in 3 dimensions with the           attracts different readers. We also see the politic cluster (red) in
TSNE algorithm. The test tab (see Figure 8) contains an interface       the center near the world cluster (gray) and the economy cluster
to test the recommendations of the current model. It has the list       (orange). We argue that these three subjects are mostly related and
of all current articles from which to make a selection. Then, this      have a similar writing style, which attracts similar readers. Since
selected list is sent to the server as a list of previously consulted   we integrated this view on our dashboard, we can further confirm
articles by a fake reader. We send back the recommendations of the      the appearance of similar cluster patterns emerge mostly every day
model. The last two tabs are interesting tools to evaluate the model    — see Figure 7b.
recommendations (see Sections 4.2 and 4.3).                                 We know that the model learns the embeddings from our collabo-
                                                                        rative data. Thus, they also integrate the influence of their locations
4 RESULTS                                                               on our website partly. Thus, we plan to use this embedding viewer
4.1 Offline Model validation                                            and its dynamics as a management tool for our website display as
                                                                        future work.
Out of 120 possible experiments in our grid search, 12 finished with
an "out of GPU memory" error leaving us with 108 results. We kept
the top-5 for each model in Table 1. By fine-tuning the models’         4.3    Analysis of fake-reader recommendations
hyperparameters, we found that the 1D CNN architecture is the           We further tested and analyzed our recommender system’s recom-
most optimal one for our task, followed by the LSTM. Since it is        mendations with the user interface in Figure 8 from our custom
faster to train than the LSTM by 38 seconds and saves memory            dashboard. On this interface, we have a tool to send our recom-
with smaller embeddings, we selected the CNN model as our first         mender system model a list of selected articles as an input and to
state-of-the-art configuration. Overall, the MixtureLSTM model          receive its recommendations. We design the experiment to assess
takes more time to train for about slightly lower results than the      the quality of our recommendations. We selected two articles from
LSTM and the CNN. The Pooling model is largely under-performing         each five previously selected subjects: world, politics, culture, opin-
based on MRR, but competitive with the P@5 metric. We found             ion and economy. Then, we submitted two articles to the system
that a large negative sampling is improving the results as well as      and analyzed the five recommended articles. The results are in Ta-
a large embedding size in general. Moreover, smaller batch sizes        ble 2. First, our results show that our model tends to recommend
tend to get better results. For instance, we reported no batch size     articles from the same subject about half the time in our samples
of 8192 in these top-5 and only one 4096. Most of the 20 results        (13 out of 25 articles). We see strong links between the most recom-
presented here obtain similar P@5, which means that in a list of        mended articles and submitted articles. For instance, in the world
5 articles, we found most of the time one relevant article for our      sample, we see both articles are about the COVID-19 pandemic
reader. The MRR score indicates that, except for the Pooling model,     and that the last article is related to Trump. In the recommended
most models suggest this relevant article as first or as the second     articles, we received a complete list of COVID-19 and the fourth
article, between 1 and 0.5 respectively.                                recommendation about Trump as well. Moreover, in the culture
   We supported the generalization of these results by providing        sample, we see that the last submitted article is about a TV show
the histogram of MRRs measured from 22 training sessions between        named "Occupation Double" and the first recommendation is about
May 26th, 2020 and May 28th, 2020, in Figure 6. We took this Figure     the same TV show "OD" as well as the fourth recommendation.
ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil                                       Jean-Philippe Corbeil and Florent Daudens


       Figure 4: Implementation on our website of the recommendation box "VOS RECOMMANDATIONS" on the right.

                                              Table 1: Top-5 results for each model sorted by MRR.

                                Model           Embedding Batch size     Negative     MRR     P@5     Training
                                                size                     samples                      time (s)
                                CNN             32        2048           300          0.693   0.133   134
                               LSTM             64        1024           300          0.693   0.133   172
                               LSTM             64        2048           300          0.693   0.133   157
                               LSTM             64        512            300          0.691   0.133   219
                                CNN             64        2048           300          0.690   0.132   156
                                CNN             32        1024           300          0.690   0.132   154
                            MixtureLSTM         32        512            300          0.689   0.131   549
                               LSTM             64        512            200          0.687   0.131   159
                               LSTM             64        2048           200          0.687   0.123   108
                                CNN             64        2048           200          0.687   0.131   107
                            MixtureLSTM         64        512            200          0.687   0.122   614
                                CNN             64        4096           100          0.684   0.132   54
                            MixtureLSTM         32        1024           200          0.682   0.130   350
                            MixtureLSTM         32        512            200          0.680   0.128   381
                            MixtureLSTM         64        512            100          0.677   0.125   334
                              Pooling           64        512            300          0.478   0.131   211
                              Pooling           64        1024           300          0.478   0.133   167
                              Pooling           64        2048           300          0.477   0.133   155
                              Pooling           64        2048           200          0.476   0.132   106
                              Pooling           64        1024           100          0.476   0.132   65


Second, it is also interesting that not all articles are from the same    4.4   Traffic benchmark
subject, which gives some serendipity to our recommendations.             We developed a script that repeatedly sends to the serving VM
From our observations with our website, we noted that the model           real user identification numbers to benchmark the maximum traffic
tends to use a good recommendation strategy by recommending a
list of articles mixing top-viewed ones and subject-related ones.
Deploying a Cost-Effective and Production-Ready Recommender System                       ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil


Table 2: Five fake user tests done on May 20th, 2020. We send two submitted articles as a fake list of previously read articles
to our recommender system. Then, we receive five recommended articles. The names of the articles are in french.

                Subject         Submitted articles                   Recommended articles                         Section        Result
                Economy         1. L’intelligence artificielle au    1. La COVID-19 fera fléchir les prix im- Economy
                                service du transport de con-         mobiliers
                                teneurs
                                2. Forte baisse des ventes de        2. Situation préoccupante pour les mé-       Economy
                                maisons au Québec                    nages très endettés
                                                                     3. Une deuxième vague «inévitable» au        Society          3/5
                                                                     Canada, disent les experts
                                                                     4. Masque sanitaire et burqa: une insulte    Opinion
                                                                     à l’intelligence
                                                                     5. Refroidissement immobilier                Economy
                Culture         1. Décès de la comédienne 1. Après OD, la vie                                     Culture
                                Michelle Rosignol
                                2. Occupation double: rattrapé 2. Emmener Google au théâtre, et vice              Culture
                                par la réalité des beaux senti- versa
                                ments
                                                                3. F1: Lawrence Stroll met la barre haute         World            3/5
                                                                pour Aston Martin dès 2021
                                                                4. Aimer résister à Occupation double             Culture
                                                                5. Les mots de l’année (6/6): «Fake               Society
                                                                news», les vraies fausses nouvelles
                World           1. Quel est le bilan véritable de 1. Coronavirus: un «mini-Schengen» se           World
                                la pandémie de COVID-19?          prépare en Europe pendant que d’autres
                                                                  pays se referment
                                2. Un président qui défie la sci- 2. Le masque non médical protège-t-il           Society
                                ence                              celui qui le porte?
                                                                  3. Interdit ou pas avec le déconfine-           Society          2/5
                                                                  ment?
                                                                  4. L’«incompétence» de Pékin a provo-           World
                                                                  qué une «tuerie de masse mondiale»,
                                                                  selon Trump
                                                                  5. Les libéraux n’ont pas respecté leurs        Politics
                                                                  promesses, accuse Blanchet
                Opinion         1. L’éclatant succès de Taïwan       1. Quel est le bilan véritable de la         World
                                                                     pandémie de COVID-19?
                                2. Élèves abandonnés, parents        2. Le masque non médical protège-t-il        Society
                                épuisés                              celui qui le porte?
                                                                     3. D’égal à égal, le Québec, 40 ans plus     Opinion          2/5
                                                                     tard?
                                                                     4. Tout est affaire de décor pendant le      Society
                                                                     confinement
                                                                     5. Référendum 1980 – l’étrange cam-          Opinion
                                                                     pagne de sécurisation
                Politics        1. Une nouvelle aide fédéral         1. Interdit ou pas avec le déconfine-        Society
                                pour les PME                         ment?
                                2. Le Québec déplore 51 nou-         2. Pincez-moi, Docteur Horacio, je           Opinion
                                veaux décès dus à la COVID-19        rêve. . .
                                                                     3. Feu vert pour la réouverture des com-     Politics         3/5
                                                                     merces à Montréal
                                                                     4. Trois artères de Rosemont-La Petite       Politics
                                                                     Patrie fermées aux voitures
                                                                     5. La frontière entre le Canada et les       Politics
                                                                     États-Unis reste fermée jusqu’au 21 juin
ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil                                    Jean-Philippe Corbeil and Florent Daudens


                                                                     supported by our current configuration. With each received recom-
                                                                     mendation list, we made a rule-based decision process to simulate
                                                                     the click rate. If the recommendations come from the model (model),
                                                                     we click on any recommendation with a probability of 1/2. If we
                                                                     recommended using the top-viewed articles in the last 30 minutes
                                                                     (top), we would click on any recommendation with a probability of
                                                                     1/4. We fixed the number of readers per second at 3 with a pool of
                                                                     2000 readers. The fake A/B testing results are on our dashboard’s
                                                                     current state tab in Figure 5 — with very high significative results.
                                                                     The CTR distributions are very close to our rule-based decision pro-
                                                                     cess. We also measure the time elapsed between sending the request
                                                                     and receiving the response, which we displayed in the histogram
                                                                     of Figure 9. We see that our recommendations are made in three
                                                                     seconds on average, which is okay since we feed our recommen-
                                                                     dations box asynchronously. Knowing that our articles’ average
                                                                     reading time is close to one minute, we have the time to fill the box
                                                                     with recommendations since it appears half-way. We also looked
                                                                     at the correlation between the order in which we sent the requests
                                                                     and the time lapses, and we can report correlation of less than 0.05.
                                                                     Since our morning peak hour has about 2.5 readers per second,
                                                                     our current configuration is ready to feed our website in real-time.
                                                                     Compared to Garcin et al. [5], which had a response of 30 ms with
Figure 5: Example of our custom dashboard’s user interface           a Java EE architecture, we note that our system is slow. We argue
in french. It is the current state tab ("état actuel") on our rec-   that this is due to our serving VM in Python, serving with Flask
ommender system. We see our six navigation tabs: current             through gunicorn — known to be slower than Java. We also have
state, performances, model, execution settings, embedding            deep learning models served on CPU — which is a specific design
space viewer and model testing interface. In the first box be-       choice. Since we are meeting all constraints, we leave the service
low our navigation, we have two indicators about the train-          response time optimizations to future works.
ing status (last training status and model reloading status).
In the bottom, we have two CTR histograms that help us vi-
sualize the current A/B testing result with their T-test values      5   LIMITATIONS AND FUTURE WORKS
below — these are the results of our traffic benchmark (see          In our case study, many aspects have limitations and need further
Section 4.4).                                                        improvements. We have short-term recommendations for the limi-
                                                                     tations of the recommendations, small grid-search optimization, the
                                                                     offline performance metrics (MRR and P@5), and the cold-start is-
                                                                     sue. We chose to work on short-term recommendations to improve
                                                                     the scalability of our system. We will make further developments
                                                                     in the future to include long-term recommendations by adding a
                                                                     content-based approach. We limited our approach to a small grid
                                                                     search to optimize our models’ hyperparameters. We chose a small
                                                                     set of values for each hyperparameter to train the models in a rea-
                                                                     sonable amount of time based on insights from our pre-experiments.
                                                                     We obtained good results, and we hope that other researchers will
                                                                     try their approaches on our released dataset to push our state-of-
                                                                     the-art. While the combination of both MRR and P@5 is relevant,
                                                                     the first is limited to measure the appearance of the first relevant
                                                                     item in the list, while the second is the proportion of relevant items
                                                                     in the whole list. Since we chose these metrics because of our cur-
                                                                     rent lack of specific relevance measurements, we plan on extracting
                                                                     the reading time of articles from Matomo to compute the relevance
                                                                     of our article for a given reader as recommendations. Then, we will
Figure 6: Distribution of MRR for our 1D CNN model, mea-             compute the NDCG@5, which is a better metric to evaluate our
sured from May 26th, 2020, to May 28th, 2020, across 22 train-       offline performance. Finally, we also face the cold-start issue that
ing sessions. We took this Figure from our dashboard (per-           we did not address directly. Nevertheless, we design the recommen-
formances tab) on May 28th, 2020.                                    dation box to appear only in the article because of our short-term
                                                                     recommendations. Therefore, readers coming for the first time on
                                                                     our website will still get recommendations when they visit articles.
Deploying a Cost-Effective and Production-Ready Recommender System              ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil


                             (a) May 20th 2020.                                                (b) May 26th 2020.

Figure 7: Embedding projection of our article embeddings taken from our custom dashboard (embeddings tab) projected in 3
dimensions with TSNE and colored for 5 main sections of our website: world, politics, culture, opinion and economy.


Figure 8: Example of our fake-user test on our custom dash-            Figure 9: Histogram of time lapses before receiving recom-
board — done on May 20th, 2020. We use the first dropdown              mendations. This traffic simulation is done with 2000 read-
list to filter articles by subject, which has a pre-assigned           ers and a rate of 3 readers per second.
colour (e.g. red for politics). With the second dropdown, we
can select the articles consulted by the fake user. The result
is a list of five recommended articles displayed with subject
                                                                       two virtual machines (serving VM and training VM) and with a limi-
colours at the bottom.
                                                                       tation on the number of days — to meet our cost, scalability, training
                                                                       time and serving time constraints. With a grid-search approach, we
                                                                       found that the optimal model was the 1D CNN performing with an
   We also have limitations with our embedding space study and         MRR of 0.693 and a P@5 of 0.133 in only 134 seconds. We release an
fake-user test study. Their main limitation is their generalization.   anonymized version of our dataset to promote the reproducibility of
However, we argue that both studies are complementary and in-          our results. In our architecture, the model is trained from scratch in
sightful in their results, indicating the learning of both relevant    many training sessions wisely distributed according to our website
embeddings and relevant recommendations. We also demonstrated          traffic. We estimated this strategy to save around 80% of our total
the same observations in the embedding space for both May 20th,        cost for the recommender system, which is less than 4$US per day.
2020, and May 26th, 2020.                                              Compared to commercial solutions costing thousands of dollars
                                                                       per month, this percentage rises close to 98 %. We also evaluated
6    CONCLUSION                                                        our systems with two more studies. Using our custom monitoring
To conclude, we presented a case study about our cost-effective        dashboard, we observed a high relevance of our embeddings and
and production-ready deep news recommender system architecture         recommendations based on two complementary qualitative studies:
with open-source and cloud technologies. We designed it — with         the embedding space study and the fake-user test study. Finally, we
ORSUM@ACM RecSys 2020, September 25th, 2020, Virtual Event, Brazil                                                                      Jean-Philippe Corbeil and Florent Daudens


demonstrated the readiness of our system with a traffic simulation.                           [11] Maciej Kula. 2017. Mixture-of-tastes models for representing users with diverse
We hope our affordable and robust design inspires other online me-                                 interests. arXiv preprint arXiv:1711.08379 (2017).
                                                                                              [12] Maciej Kula. 2017. Spotlight. https://github.com/maciejkula/spotlight.
dia companies to consider developing their recommender systems                                [13] Azi Lev-On. 2012. Communication, community, crisis: Mapping uses and grati-
to be competitive in the digital news market.                                                      fications in the contemporary media environment. New Media & Society 14, 1
                                                                                                   (2012), 98–116.
                                                                                              [14] Jiahui Liu, Peter Dolan, and Elin Rønby Pedersen. 2010. Personalized news
REFERENCES                                                                                         recommendation based on click behavior. In Proceedings of the 15th international
 [1] Kevin G Barnhurst. 2011. The new “media affect” and the crisis of representation              conference on Intelligent user interfaces. 31–40.
     for political communication. The International Journal of Press/Politics 16, 4 (2011),   [15] Stephan A Miller. 2012. Piwik web analytics essentials. Packt Publishing Ltd.
     573–593.                                                                                 [16] Itishree Mohallick and Özlem Özgöbek. 2017. Exploring privacy concerns in
 [2] Harold S Black. 1953. Modulation theory. van Nostrand.                                        news recommender systems. In Proceedings of the International Conference on
 [3] Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks                        Web Intelligence. 1054–1061.
     for youtube recommendations. In Proceedings of the 10th ACM conference on                [17] Harry Nyquist. 1928. Certain topics in telegraph transmission theory. Transactions
     recommender systems. 191–198.                                                                 of the American Institute of Electrical Engineers 47, 2 (1928), 617–644.
 [4] Marc Edge. 2014. Newspapers’ Annual Reports Show Chains Profitable. Newspa-              [18] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol
     per Research Journal 35, 4 (2014).                                                            Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu.
 [5] Florent Garcin and Boi Faltings. 2013. Pen recsys: A personalized news rec-                   2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499
     ommender systems framework. In Proceedings of the 2013 International News                     (2016).
     Recommender Systems Workshop and Challenge. 3–9.                                         [19] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang,
 [6] Florent Garcin, Boi Faltings, Olivier Donatsch, Ayar Alazzawi, Christophe Bruttin,            Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer.
     and Amr Huber. 2014. Offline and online evaluation of news recommender                        2017. Automatic differentiation in pytorch. NIPS 2017 Workshop Autodiff (2017).
     systems at swissinfo.ch. In Proceedings of the 8th ACM Conference on Recommender         [20] Alper Sarikaya, Michael Correll, Lyn Bartram, Melanie Tory, and Danyel Fisher.
     systems. 169–176.                                                                             2018. What do we talk about when we talk about dashboards? IEEE transactions
 [7] Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk.                    on visualization and computer graphics 25, 1 (2018), 682–692.
     2015. Session-based recommendations with recurrent neural networks. arXiv                [21] Paul Starr. 2012. An unexpected crisis: The news media in postindustrial democ-
     preprint arXiv:1511.06939 (2015).                                                             racies. The International Journal of Press/Politics 17, 2 (2012), 234–242.
 [8] Plotly Technologies Inc. 2015. Collaborative data science. Montreal, QC. https:          [22] Jason Weston, Samy Bengio, and Nicolas Usunier. 2011. Wsabie: Scaling up to
     //plot.ly                                                                                     large vocabulary image annotation. In Twenty-Second International Joint Confer-
 [9] Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex                    ence on Artificial Intelligence.
     Graves, and Koray Kavukcuoglu. 2016. Neural machine translation in linear time.          [23] Dwayne Winseck. 2010. Financialization and the “crisis of the media”: The
     arXiv preprint arXiv:1610.10099 (2016).                                                       rise and fall of (some) media conglomerates in Canada. Canadian Journal of
[10] Mozhgan Karimi, Dietmar Jannach, and Michael Jugovac. 2018. News recom-                       Communication 35, 3 (2010).
     mender systems–Survey and roads ahead. Information Processing & Management               [24] Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay. 2019. Deep learning based rec-
     54, 6 (2018), 1203–1227.                                                                      ommender system: A survey and new perspectives. ACM Computing Surveys
                                                                                                   (CSUR) 52, 1 (2019), 1–38.