=Paper= {{Paper |id=Vol-2554/paper4 |storemode=property |title=Defining a Meaningful Baseline for News Recommender Systems |pdfUrl=https://ceur-ws.org/Vol-2554/paper_04.pdf |volume=Vol-2554 |authors=Benjamin Kille,Andreas Lommatzsch |dblpUrl=https://dblp.org/rec/conf/recsys/KilleL19 }} ==Defining a Meaningful Baseline for News Recommender Systems== https://ceur-ws.org/Vol-2554/paper_04.pdf
        Defining a Meaningful Baseline for News Recommender
                               Systems
                         Benjamin Kille                                              Andreas Lommatzsch
                 Institute of Technology Berlin                                   Institute of Technology Berlin
                        Berlin, Germany                                                  Berlin, Germany
                  benjamin.kille@tu-berlin.de                                   andreas.lommatzsch@dai-labor.de
ABSTRACT                                                              previous research on NRS focusing on which baselines they
Evaluation protocols for news recommender systems typically           have used. Section 3 derives requirements which baselines
involve comparing the performance of methods to a baseline.           must fulfil. Section 4 describes experiments conducted on a
The difference in performance ought to tell us what benefit           large-scale data set from three publishers. The experiments
we can expect from using a more sophisticated method. Ulti-           compare the performance of a variety of baselines. Section 5
mately, there is a trade-off between performance and effort in        summarises our findings and hints at directions for future
implementing and maintaining a system. This work explores             research.
what baselines have been used, what criteria baselines must
fulfil, and evaluates a variety of baselines in a news recom-         2    RELATED WORK
mender evaluation setting with multiple publishers. We find           A consensus concerning the evaluation protocol of NRS has
that circular buffers and trend-based predictions score highly,       yet to establish. The availability of data affects what base-
need little effort to implement, and require no additional data.      lines evaluators can use in their experiments. Frequently,
Besides, we observe variations among publishers, suggesting           researchers use recorded interactions between users and news
that not all baselines are equally competitive in different           articles. Whenever researchers have access to the NRS di-
circumstances.                                                        rectly, they can even employ counterfactual reasoning [31].
                                                                      The earliest works on automated NRS relied on itemsโ€™ pop-
CCS CONCEPTS                                                          ularity as a baseline. The rationale behind the popularity
ยˆ Information systems  Recommender systems.                          baseline suggests that items relevant to many users are suited
                                                                      candidates. Das et al. [7] and Garcin et al. [9] employ the
KEYWORDS                                                              popularity baseline. Lommatzsch [22] uses a circular buffer
                                                                      implementation as baseline. This implementation combines
news recommender systems, evaluation, baselines
                                                                      the popularity of items with a recency focus. Researchers
                                                                      interested in content-based news recommendation devise base-
1    INTRODUCTION                                                     lines using content features. Gao et al. [8] and Zheng et al.
Readers struggle to keep up with the plethora of stories              [37] use a term-frequency model as a baseline. Cantador et al.
which publishers continue to release on their digital plat-           [3] use a keyword-based baseline for their semantic news rec-
forms. News Recommender Systems (NRS) support readers                 ommendation model. Okura et al. [27] define a word-based
navigating the dynamic news landscape. They deliver a subset          baseline for their embedding experiments. Li et al. [17] use
of articles deemed interesting. Publishersโ€™ success dependsโ€”          an ๐œ–-greedy strategy as baseline in their contextual bandit
at least partiallyโ€”on how much of readersโ€™ attention they             evaluation. Li et al. [20] and Lu et al. [25] use collaborative fil-
obtain. Revenue strongly correlates with the number of ad-            tering and content-based filtering baselines. Some researchers
vertisements shown to readers in an โ€œattention economyโ€ [2].          invest considerable resources to replicate existing results by
Consequently, publishers want to know whether an NRS                  re-implementing proposed news recommendation algorithms.
recommends relevant articles to their readers. The dynamic            Li and Li [18] compare their results to [5, 7, 17, 19, 21]. Zheng
character of the news landscape impedes on the comparability          et al. [36] contrast their findings to [4, 17, 28, 32]. Wang et al.
of evaluation results. For instance, breaking news may shift          [33] consider [4, 6, 10, 13, 28, 34, 35]. Khattar et al. [14] com-
readersโ€™ attention in a way completely unrelated to the rec-          pare their approach with [11, 12, 16, 26, 29]. Studying the
ommendation algorithms. Publishers account for this effect            baselines, we notice variances across works. Some baselines
by comparing recommendation algorithms to baselines [30].             appear frequently. Some papers describe baselines tailored
    The choice of baseline introduces a variable into the evalua-     to their particular use case. There is no consensus on what
tion protocol. Ideally, evaluators would use the same baseline        baseline should be used. Moreover, it remains unclear how
to arrive at comparable results. Not every baseline applies           baselines correlate or interdepend on one another.
to each recommendation task. For instance, collaborative
filtering requires user profiles [15], whereas content-based          3    BASELINE REQUIREMENTS
filtering needs meta-data about items [24]. Section 2 reviews
                                                                      Baselines allow us to measure the relative improvement of
Copyright ยฉ 2019 for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                                                      an algorithm. Evaluators can not only report a value but
                                                                      express how much better the value is compared to a more
INRAโ€™19, September, 2019, Copenhagen, Denmark                                                                 Kille and Lommatzsch


straightforward method. Thereby, we learn whether it is           interaction, we cannot determine whether a recommendation
worth investing additional effort into developing more so-        was successful.
phisticated algorithms. Still, defining a meaningful baselines
poses a technical challenge. Defining a baseline which fails      4.1    Candidate Baselines
miserably does not reveal much about relative improvements.       In the experiment, we consider eight baseline candidates.
Defining a baseline which requires too much effort or has too
many design choices might render evaluation results hard to          Random. The random method considers all known articles
compare. Requiring additional data can make it impossible         and picks a suggestion at random. In addition, we consider a
for some baselines to be used in experiments where these          slightly advanced version which draws only from the set of
data are lacking. Based on these propositions, we introduce       items published in a certain time window. This accounts for
three requirements for news recommendation baselines:             the readersโ€™ desire to read more current news.
    (1) low implementation effort
                                                                    Popularity. The popularity method suggests exactly the
    (2) competitive performance
                                                                  article which has been read most often. Similarly to the
    (3) no additional data required
                                                                  random method, we consider a version of the popularity
   The next section introduces a variety of candidate algo-       version, which considers reading frequencies in a specific time
rithms and analyses how well they perform in a large-scale        window.
experiment.
                                                                    Recency. The recency method recommends news articles
4     EXPERIMENT                                                  most recently published. The method disregards any form of
In this experiment, we use the NewsREEL evaluation plat-          popularity or personalisation.
form described in Lommatzsch et al. [23]. NewsREEL offers
                                                                     Reading Sequences. The reading sequences method mon-
researchers the opportunity to evaluate their news recom-
                                                                  itors which article users read in sequence. It recommends
mendation algorithms on real users with several connected
                                                                  exactly the article which users read most frequently given
publishers. We use data recorded in the time 1 March 2017 to
                                                                  the readers current article.
30 June 2018 (14 months) including approximately 94 million
sessions of three publishers. The system tracks readers using        Collaborative filtering. Applying Collaborative Filtering
session cookies. The session information allows us to link        to news recommendation is particularly challenging. Not
reading events to a particular reader. Whenever more than         only keep publishers adding new items, but also do systems
1 h passes in between events, we create a new session. Note       know very little about users. We follow the suggestion of Das
that we disregard all sessions with a single reading event as     et al. [7] and implement a MinHash version of Collaborative
we cannot compare predictions to future events in these cases.    Filtering.
All three publishers operate an NRS. Empirically, the NRS
produce clicks on the order of a few per thousand recommen-          Content-based Filtering. Content-based filtering requires
dations. Hence, their effect on the collected reading events      a way to define similarity among news articles. Generally,
appears negligible.                                               we could employ a string matching approach on the title
                                                                  or text. Still, this would require considerable computational
Table 1: Statistics describing the interactions of read-          effort. Besides, the possibility of different languages adds
ers with news articles for three publishers. The sym-             another level of difficulty. We have implemented a more
bols refer to the number of sessions (S), the number              straightforward method. The Content-based filtering takes
of interactions (N), the average number of events per             the category of news articles as a proxy for similarity and
sessions (EpS), and the number (๐‘† 1 ) and proportion              suggests articles from the same category at random.
  1
(๐‘†% ) of sessions with only one event.                               Circular buffer. We use the circular buffer proposed by
                                                                  Lommatzsch [22] which has also been used as the baseline
 Statistic     Publisher A       Publisher B      Publisher C     of CLEF NewsREEL [23]. This method has a fixed size
 S                 17 019 523        22 683 047      54 272 242   list which the systems updates as interactions occur. The
 N                 36 859 823       175 930 128     105 998 109   system adds the article of each interaction. When the system
 EpS                     2.17              7.76            1.95   arrives at the end of the list, it moves the index to the first
 ๐‘†1                10 529 390           416 506      34 998 380   position and goes on. The methods select recommendations
   1
 ๐‘†%                  61.19 %             1.84 %        64.49 %    by reverse lookup in the list. Thereby, it combines popularity
                                                                  and recency. More popular articles occur more frequently
                                                                  on the list. More recently read articles occur with a higher
   Table 1 outlines the characteristics of the data set. Pub-     probability as well.
lishers A and C have to deal with expectedly short sessions.
Table 1 also lists the number and proportion of sessions with       Trending. The trending method computes the trend for
only a single interaction. This matters as we have to dis-        each article in a given time window. More specifically, the
regard those sessions in the evaluation. Without a second         method carries out a regression on the reading frequency
Defining a Meaningful Baseline for News Recommender Systems                               INRAโ€™19, September, 2019, Copenhagen, Denmark


binned on an hourly level. The method recommends those              frequencies with which readers engage with articles. The dis-
articles with the steepest trend.                                   tributions exhibit a strong popularity bias which supplies
  Implementing baseline candidates causes little effort. Content-   the same small subset of articles to the buffer. As a result,
based filtering requires knowing the category of news articles.     the recommendations would largely coincide among lists of
All remaining candidates only require interaction data with         varying lengths. Publishers A and B show improvements for
timestamps. Hence, they fulfil already two of our three re-         shorter time windows, whereas Publisher C trending baseline
quirements.                                                         performance peaks at 12 hours. Besides, we observe different
                                                                    maximum results among the three publishers. Sequences top
4.2     Evaluation Protocol                                         Publisher A at 18.5 % (๐‘…๐‘ฅ ) and 14.7 % (๐‘…๐‘  ). In contrast, the
                                                                    circular buffer with one hundred elements scores highest for
We present each event to all of the baseline candidates and
                                                                    Publisher B with 11.7 % (๐‘…๐‘ฅ ) and 3.0 % (๐‘…๐‘  ). The circular
request exactly one item as a recommendation. The evaluator
                                                                    buffer with five hundred or a thousand elements performs best
stores the recommendations with reference to the session.
                                                                    for Publisher C with 2.1 % (๐‘…๐‘ฅ ) and 1.6 % (๐‘…๐‘ฅ ). Hence, the
When the session re-appears, the evaluator checks whether
                                                                    differences with respect to ๐‘…๐‘ฅ span 16.4 %. Such substantial
this article had been recommended previously. If that is the
                                                                    differences are unlikely to emerge from the baseline. Instead,
case, the evaluator adds one to the score (๐‘ฅ) of that baseline.
                                                                    we have to assume that other aspects play a vital role, such
   Based on the recorded score, we compute two evaluation
                                                                    as the composition of the readership, interface design, and
measures on recommendation success on event and session
                                                                    content.
level:

                                      โˆ‘๏ธ
                  ๐‘…๐‘ฅ     =    |๐‘†|โˆ’1         ๐‘ฅ                 (1)   5    CONCLUSION AND FUTURE WORK
                                      ๐‘ฅโˆˆ๐‘†
                                    โˆ’1
                                       โˆ‘๏ธ                           News Recommender Systems support readers navigating the
                   ๐‘…๐‘     =    |๐‘†|           1{๐‘ฅ(๐‘ )>0}         (2)   news landscape by suggesting which article to read next.
                                      ๐‘ฅโˆˆ๐‘†                           Evaluation is necessary to optimise NRS. To estimate how
                                                                    much value a new method adds, evaluation protocols compare
  The first score, ๐‘…๐‘ฅ approximates the expected number of
                                                                    their results to baselines. A consensus on what baselines to
successful recommendations per session. The second score,
                                                                    use has yet to establish. Researchers have used a variety of
๐‘…๐‘  , estimates the chance that at least one recommendation
                                                                    baselines (cf. Section 2). We have formulated three criteria
will succeed.
                                                                    which baselines have to meet to be considered a viable option.
                                                                    Section 4.1 presents a list of candidate baselines, all of which
4.3     Evaluation Results                                          require manageable effort to implement and need not much
Table 2 shows our observations for all combinations of publish-     additional data. To check the candidate baselinesโ€™ competi-
ers and baselines. We notice that the baselinesโ€™ performances       tiveness, we have devised an experiment on three publishers
vary considerably. We have scaled the results by a factor of        with data covering 14 months. The results suggest that the
10โˆ’5 to obtain more legible figures. Thus, a score of 1000          circular buffer and the trending baseline stably provide com-
refers to one per cent. The random baseline performs poorly         petitive performance for all publishers. We have observed
among all publishers. The chance to randomly suggest some-          variations among the performance of baselines for particular
thing interesting is below 1 in 1000. Constraining the time         publishers. For instance, the sequence baseline has scored
window to shorter periods slightly improves the success rate.       exceptionally well for publisher A yet failed for publisher
The performance of popularity, recency, and sequences differs       B completely. More research is needed to explore how to
between publishers. Publisher A and C show better results for       transfer our findings to other publishers. We conjecture that
popularity and sequences. Sequences even perform best for           similar publishers will confirm the order of performance for
Publisher A overall. Publisher B, on the other hand, shows          the baselines.
the best results for recency, and, conversely, the worst perfor-       Researchers are keen to compare their methods to the most
mance for sequences. This is surprising as Publisher B sees         competitive approach. This requires considerable investment
the longest sessions on average (cf. Table 1). The presence         in re-implementing previous research. We argue that using
of very specific sequences could spoil the predictions, partic-     the circular buffer or trend-based method already represents
ularly for recently published items. Content-based filtering        a solid baseline. Both have scored higher than collabora-
performs poorly, especially for Publisher C. Publisher C fo-        tive filtering and content-based filtering in our experiments.
cusses on the narrow topic of automotive news. This could           Researchers ought to highlight the trade-off between pre-
lead to circumstances in which the NRS faces a large number         dictive accuracy and computational costs. Amatriain and
of very specific categories. Collaborative filtering performs       Basilico [1] have highlighted this critical trade-off in the case
well for publishers A and B but not for C. The circular buffer      of streaming service Netflix. Their team decided not to im-
and the trending baseline perform well for all publishers.          plement an ensemble of 107 algorithms as the engineering
The circular buffers exhibit little variation depending on the      costs surpassed the added value in prediction accuracy. The
parameter choice. This could be due to the highly skewed            accelerated dynamics of collections of news articles raise even
INRAโ€™19, September, 2019, Copenhagen, Denmark                                                                            Kille and Lommatzsch


Table 2: Evaluation results. The table shows the two performance scores for each combination of publisher
and baseline. The best results for each publishers are highlighted in bold font.

                                                   Publisher A             Publisher B                Publisher C
                 Baseline                       ๐‘…๐‘ฅ ยท 10โˆ’5  ๐‘…๐‘  ยท 10โˆ’5     ๐‘…๐‘ฅ ยท 10โˆ’5 ๐‘…๐‘  ยท 10โˆ’5       ๐‘…๐‘ฅ ยท 10โˆ’5 ๐‘…๐‘  ยท 10โˆ’5
                 random                             6.26         6.21         6.16         6.11          1.73         1.69
                 random (6h)                       87.69        85.59        64.30        62.43         13.72        13.45
                 random (12h)                      56.02        55.05        43.05        42.22          9.99         9.81
                 random (24h)                      37.75        37.29        32.69        32.21          8.06         7.88
                 random (48h)                      27.06        26.87        23.80        23.47          6.11         6.04
                 popular                          934.30       653.56      431.17         82.23      2025.96       1581.96
                 popular (6h)                    1009.02       722.28      282.33         53.83      2027.88       1582.75
                 popular (12h)                   1039.16       755.82      526.83        100.99      2031.92       1585.04
                 popular (24h)                   1089.32       774.56     1166.36        216.55      2037.30       1587.64
                 popular (48h)                   1140.12       781.17     1388.02        258.87      2043.20       1590.65
                 recency                          708.23       647.72     1017.66        200.90         38.17        34.36
                 circular buffer (100)           8069.65     6583.47    11 747.88     3049.56        2069.87      1583.18
                 circular buffer (200)           8069.65     6583.47     11 748.00     3049.56       2074.45      1585.22
                 circular buffer (500)           8069.65     6583.47     11 748.00     3049.56      2074.76      1585.35
                 circular buffer (1000)          8067.65     6583.47     11 748.00     3049.56      2074.76      1585.35
                 sequences                  18 542.49      14 650.12          0.14         0.03      1523.25       1256.07
                 content-based                    330.10       310.96      286.01         89.93          4.27          4.22
                 collaborative filtering         5864.39     4861.47      4648.87        845.90       132.68        116.79
                 trends (2h)                     8599.78     6833.47     11 237.24     2066.37       1608.16       1246.95
                 trends (6h)                     6437.37     4743.85      6398.07      1173.67       1784.49       1381.72
                 trends (12h)                    5805.29     4205.00      4481.68       823.05       1795.34       1392.33
                 trends (24h)                    4858.50     3418.07      1580.30       292.20       1723.92       1344.13


stricter constraints on recommender systems than in the case                 44โ€“78.
of movies.                                                               [4] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked,
                                                                             Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado,
   We see several directions for future research. First, one                 Wei Chai, Mustafa Ispir, et al. 2016. Wide & deep learning for
could introduce a method to estimate the implementation                      recommender systems. In Proceedings of the 1st workshop on
                                                                             deep learning for recommender systems. ACM, 7โ€“10.
costs in a more quantitative fashion. This would allow us to             [5] Wei Chu and Seung-Taek Park. 2009. Personalized recommen-
address the trade-off more rigorously. Likewise, one could                   dation on dynamic content using predictive bilinear models. In
measure the data footprint of different baselines to assess                  Proceedings of the 18th international conference on World wide
                                                                             web. ACM, 691โ€“700.
their space and time complexity. Second, one could apply                 [6] Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural
the proposed baselines and possible additions or adaptions                   networks for youtube recommendations. In Proceedings of the
on data from other publishers. This would help to quantify                   10th ACM conference on recommender systems. ACM, 191โ€“198.
                                                                         [7] Abhinandan S Das, Mayur Datar, Ashutosh Garg, and Shyam
the generality of our findings. Finally, evaluation protocols                Rajaram. 2007. Google news personalization: scalable online
including more than a single recommendation could reveal                     collaborative filtering. In Proceedings of the 16th international
                                                                             conference on World Wide Web. ACM, 271โ€“280.
how ranking metrics compare to our binary scheme. Still,                 [8] Qi Gao, Fabian Abel, Geert-Jan Houben, and Ke Tao. 2011. Inter-
ranking metrics might not display most accurately whether                    weaving trend and user modeling for personalized news recommen-
a system performs well unless it presents recommendations                    dation. In 2011 IEEE/WIC/ACM International Conferences on
                                                                             Web Intelligence and Intelligent Agent Technology, Vol. 1. IEEE,
in the form of a list.                                                       100โ€“103.
                                                                         [9] Florent Garcin, Boi Faltings, Olivier Donatsch, Ayar Alazzawi,
                                                                             Christophe Bruttin, and Amr Huber. 2014. Offline and online
REFERENCES                                                                   evaluation of news recommender systems at swissinfo. ch. In Pro-
 [1] Xavier Amatriain and Justin Basilico. 2012. Netflix recommenda-         ceedings of the 8th ACM Conference on Recommender systems.
     tions: beyond the 5 stars (part 1). Netflix Tech Blog 6 (2012).         ACM, 169โ€“176.
 [2] Erik Brynjolfsson and JooHee Oh. 2012. The Attention Economy       [10] Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xi-
     - Measuring the Value of Free Digital Services on the Internet.         uqiang He. 2017. DeepFM: A Factorization-Machine based Neural
     ICIS (2012).                                                            Network for CTR Prediction. In Proceedings of the Twenty-
 [3] Ivaฬn Cantador, Pablo Castells, and Alejandro Bellogฤฑฬn. 2011.          Sixth International Joint Conference on Artificial Intelligence,
     An enhanced semantic layer for hybrid recommender systems:              IJCAI-17. 1725โ€“1731. https://doi.org/10.24963/ijcai.2017/239
     Application to news recommendation. International Journal on
     Semantic Web and Information Systems (IJSWIS) 7, 1 (2011),
Defining a Meaningful Baseline for News Recommender Systems                                      INRAโ€™19, September, 2019, Copenhagen, Denmark


[11] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu,          [29] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars
     and Tat-Seng Chua. 2017. Neural collaborative filtering. In Pro-          Schmidt-Thieme. 2009. BPR: Bayesian personalized ranking from
     ceedings of the 26th international conference on world wide web.          implicit feedback. In Proceedings of the twenty-fifth conference
     International World Wide Web Conferences Steering Committee,              on uncertainty in artificial intelligence. AUAI Press, 452โ€“461.
     173โ€“182.                                                             [30] Guy Shani and Asela Gunawardana. 2010. Evaluating Recommen-
[12] Xiangnan He, Hanwang Zhang, Min-Yen Kan, and Tat-Seng Chua.               dation Systems. In Recommender Systems Handbook. Springer
     2016. Fast matrix factorization for online recommendation with            US, Boston, MA.
     implicit feedback. In Proceedings of the 39th International ACM      [31] A Swaminathan and T Joachims. 2015. Counterfactual Risk
     SIGIR conference on Research and Development in Information               Minimization: Learning from Logged Bandit Feedback.. In ICML.
     Retrieval. ACM, 549โ€“558.                                             [32] Huazheng Wang, Qingyun Wu, and Hongning Wang. 2016. Learn-
[13] Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero,             ing hidden features for contextual bandits. In Proceedings of the
     and Larry Heck. 2013. Learning deep structured semantic models            25th ACM International on Conference on Information and
     for web search using clickthrough data. In Proceedings of the 22nd        Knowledge Management. ACM, 1633โ€“1642.
     ACM international conference on Information & Knowledge              [33] Hongwei Wang, Fuzheng Zhang, Xing Xie, and Minyi Guo. 2018.
     Management. ACM, 2333โ€“2338.                                               DKN: Deep Knowledge-Aware Network for News Recommenda-
[14] Dhruv Khattar, Vaibhav Kumar, Vasudeva Varma, and Manish                  tion. In Proceedings of the 2018 World Wide Web Conference
     Gupta. 2018. Weave&Rec: A Word Embedding based 3-D Con-                   (WWW โ€™18). International World Wide Web Conferences Steer-
     volutional Network for News Recommendation. In Proceedings                ing Committee, Republic and Canton of Geneva, Switzerland,
     of the 27th ACM International Conference on Information and               1835โ€“1844. https://doi.org/10.1145/3178876.3186175
     Knowledge Management. ACM, 1855โ€“1858.                                [34] Jin Wang, Zhongyuan Wang, Dawei Zhang, and Jun Yan. 2017.
[15] Yehuda Koren and Robert Bell. 2015. Advances in Collaborative             Combining Knowledge with Deep Convolutional Neural Networks
     Filtering. In Recommender Systems Handbook. Springer US,                  for Short Text Classification.. In IJCAI. 2915โ€“2921.
     Boston, MA, 77โ€“118.                                                  [35] Hong-Jian Xue, Xinyu Dai, Jianbing Zhang, Shujian Huang, and
[16] Vaibhav Kumar, Dhruv Khattar, Shashank Gupta, Manish Gupta,               Jiajun Chen. 2017. Deep Matrix Factorization Models for Recom-
     and Vasudeva Varma. 2017. Deep Neural Architecture for News               mender Systems.. In IJCAI. 3203โ€“3209.
     Recommendation.. In CLEF (Working Notes).                            [36] Guanjie Zheng, Fuzheng Zhang, Zihan Zheng, Yang Xiang,
[17] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. 2010.           Nicholas Jing Yuan, Xing Xie, and Zhenhui Li. 2018. DRN:
     A contextual-bandit approach to personalized news article recom-          A Deep Reinforcement Learning Framework for News Recommen-
     mendation. In Proceedings of the 19th international conference            dation. In Proceedings of the 2018 World Wide Web Conference
     on World wide web. ACM, 661โ€“670.                                          (WWW โ€™18). International World Wide Web Conferences Steer-
[18] Lei Li and Tao Li. 2013. News Recommendation via Hypergraph               ing Committee, Republic and Canton of Geneva, Switzerland,
     Learning: Encapsulation of User Behavior and News Content. In             167โ€“176. https://doi.org/10.1145/3178876.3185994
     Proceedings of the Sixth ACM International Conference on Web         [37] Li Zheng, Lei Li, Wenxing Hong, and Tao Li. 2013. PENETRATE:
     Search and Data Mining (WSDM โ€™13). ACM, New York, NY,                     Personalized news recommendation using ensemble hierarchical
     USA, 305โ€“314. https://doi.org/10.1145/2433396.2433436                     clustering. Expert Systems with Applications 40, 6 (2013), 2127โ€“
[19] Lei Li, Dingding Wang, Tao Li, Daniel Knox, and Balaji Pad-               2136.
     manabhan. 2011. SCENE: a scalable two-stage personalized news
     recommendation system. In Proceedings of the 34th interna-
     tional ACM SIGIR conference on Research and development in
     Information Retrieval. ACM, 125โ€“134.
[20] Lei Li, Li Zheng, Fan Yang, and Tao Li. 2014. Modeling and broad-
     ening temporal user interest in personalized news recommendation.
     Expert Systems with Applications 41, 7 (2014), 3168โ€“3177.
[21] Jiahui Liu, Peter Dolan, and Elin Rรธnby Pedersen. 2010. Personal-
     ized news recommendation based on click behavior. In Proceedings
     of the 15th international conference on Intelligent user inter-
     faces. ACM, 31โ€“40.
[22] Andreas Lommatzsch. 2014. Real-time news recommendation
     using context-aware ensembles. In European Conference on In-
     formation Retrieval, Vol. LNCS 8416. Springer, Springer, 51โ€“62.
[23] Andreas Lommatzsch, Benjamin Kille, Frank Hopfgartner, Martha
     Larson, Torben Brodt, Jonas Seiler, and Oฬˆzlem Oฬˆzgoฬˆbek. 2017.
     CLEF 2017 NewsREEL Overview: A Stream-based Recommender
     Task for Evaluation and Education. In International Confer-
     ence of the Cross-Language Evaluation Forum for European
     Languages. Springer, 239โ€“254.
[24] Pasquale Lops, Marco De Gemmis, and Giovanni Semeraro. 2011.
     Content-based Recommender Systems: State of the Art and
     Trends. In Recommender Systems Handbook. Springer, Boston,
     MA, Boston, MA, 73โ€“105.
[25] Zhongqi Lu, Zhicheng Dou, Jianxun Lian, Xing Xie, and Qiang
     Yang. 2015. Content-based collaborative filtering for news topic
     recommendation. In Twenty-ninth AAAI conference on artificial
     intelligence.
[26] Cataldo Musto, Giovanni Semeraro, Marco de Gemmis, and
     Pasquale Lops. 2016. Learning word embeddings from wikipedia
     for content-based recommender systems. In European Conference
     on Information Retrieval. Springer, 729โ€“734.
[27] Shumpei Okura, Yukihiro Tagami, Shingo Ono, and Akira Tajima.
     2017. Embedding-based News Recommendation for Millions of
     Users. In Proceedings of the 23rd ACM SIGKDD International
     Conference on Knowledge Discovery and Data Mining (KDD
    โ€™17). ACM, New York, NY, USA, 1933โ€“1942. https://doi.org/10.
     1145/3097983.3098108
[28] Steffen Rendle. 2010. Factorization machines. In 2010 IEEE
     International Conference on Data Mining. IEEE, 995โ€“1000.