Defining a Meaningful Baseline for News Recommender Systems Benjamin Kille Andreas Lommatzsch Institute of Technology Berlin Institute of Technology Berlin Berlin, Germany Berlin, Germany benjamin.kille@tu-berlin.de andreas.lommatzsch@dai-labor.de ABSTRACT previous research on NRS focusing on which baselines they Evaluation protocols for news recommender systems typically have used. Section 3 derives requirements which baselines involve comparing the performance of methods to a baseline. must fulfil. Section 4 describes experiments conducted on a The difference in performance ought to tell us what benefit large-scale data set from three publishers. The experiments we can expect from using a more sophisticated method. Ulti- compare the performance of a variety of baselines. Section 5 mately, there is a trade-off between performance and effort in summarises our findings and hints at directions for future implementing and maintaining a system. This work explores research. what baselines have been used, what criteria baselines must fulfil, and evaluates a variety of baselines in a news recom- 2 RELATED WORK mender evaluation setting with multiple publishers. We find A consensus concerning the evaluation protocol of NRS has that circular buffers and trend-based predictions score highly, yet to establish. The availability of data affects what base- need little effort to implement, and require no additional data. lines evaluators can use in their experiments. Frequently, Besides, we observe variations among publishers, suggesting researchers use recorded interactions between users and news that not all baselines are equally competitive in different articles. Whenever researchers have access to the NRS di- circumstances. rectly, they can even employ counterfactual reasoning [31]. The earliest works on automated NRS relied on itemsโ€™ pop- CCS CONCEPTS ularity as a baseline. The rationale behind the popularity ยˆ Information systems  Recommender systems. baseline suggests that items relevant to many users are suited candidates. Das et al. [7] and Garcin et al. [9] employ the KEYWORDS popularity baseline. Lommatzsch [22] uses a circular buffer implementation as baseline. This implementation combines news recommender systems, evaluation, baselines the popularity of items with a recency focus. Researchers interested in content-based news recommendation devise base- 1 INTRODUCTION lines using content features. Gao et al. [8] and Zheng et al. Readers struggle to keep up with the plethora of stories [37] use a term-frequency model as a baseline. Cantador et al. which publishers continue to release on their digital plat- [3] use a keyword-based baseline for their semantic news rec- forms. News Recommender Systems (NRS) support readers ommendation model. Okura et al. [27] define a word-based navigating the dynamic news landscape. They deliver a subset baseline for their embedding experiments. Li et al. [17] use of articles deemed interesting. Publishersโ€™ success dependsโ€” an ๐œ–-greedy strategy as baseline in their contextual bandit at least partiallyโ€”on how much of readersโ€™ attention they evaluation. Li et al. [20] and Lu et al. [25] use collaborative fil- obtain. Revenue strongly correlates with the number of ad- tering and content-based filtering baselines. Some researchers vertisements shown to readers in an โ€œattention economyโ€ [2]. invest considerable resources to replicate existing results by Consequently, publishers want to know whether an NRS re-implementing proposed news recommendation algorithms. recommends relevant articles to their readers. The dynamic Li and Li [18] compare their results to [5, 7, 17, 19, 21]. Zheng character of the news landscape impedes on the comparability et al. [36] contrast their findings to [4, 17, 28, 32]. Wang et al. of evaluation results. For instance, breaking news may shift [33] consider [4, 6, 10, 13, 28, 34, 35]. Khattar et al. [14] com- readersโ€™ attention in a way completely unrelated to the rec- pare their approach with [11, 12, 16, 26, 29]. Studying the ommendation algorithms. Publishers account for this effect baselines, we notice variances across works. Some baselines by comparing recommendation algorithms to baselines [30]. appear frequently. Some papers describe baselines tailored The choice of baseline introduces a variable into the evalua- to their particular use case. There is no consensus on what tion protocol. Ideally, evaluators would use the same baseline baseline should be used. Moreover, it remains unclear how to arrive at comparable results. Not every baseline applies baselines correlate or interdepend on one another. to each recommendation task. For instance, collaborative filtering requires user profiles [15], whereas content-based 3 BASELINE REQUIREMENTS filtering needs meta-data about items [24]. Section 2 reviews Baselines allow us to measure the relative improvement of Copyright ยฉ 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). an algorithm. Evaluators can not only report a value but express how much better the value is compared to a more INRAโ€™19, September, 2019, Copenhagen, Denmark Kille and Lommatzsch straightforward method. Thereby, we learn whether it is interaction, we cannot determine whether a recommendation worth investing additional effort into developing more so- was successful. phisticated algorithms. Still, defining a meaningful baselines poses a technical challenge. Defining a baseline which fails 4.1 Candidate Baselines miserably does not reveal much about relative improvements. In the experiment, we consider eight baseline candidates. Defining a baseline which requires too much effort or has too many design choices might render evaluation results hard to Random. The random method considers all known articles compare. Requiring additional data can make it impossible and picks a suggestion at random. In addition, we consider a for some baselines to be used in experiments where these slightly advanced version which draws only from the set of data are lacking. Based on these propositions, we introduce items published in a certain time window. This accounts for three requirements for news recommendation baselines: the readersโ€™ desire to read more current news. (1) low implementation effort Popularity. The popularity method suggests exactly the (2) competitive performance article which has been read most often. Similarly to the (3) no additional data required random method, we consider a version of the popularity The next section introduces a variety of candidate algo- version, which considers reading frequencies in a specific time rithms and analyses how well they perform in a large-scale window. experiment. Recency. The recency method recommends news articles 4 EXPERIMENT most recently published. The method disregards any form of In this experiment, we use the NewsREEL evaluation plat- popularity or personalisation. form described in Lommatzsch et al. [23]. NewsREEL offers Reading Sequences. The reading sequences method mon- researchers the opportunity to evaluate their news recom- itors which article users read in sequence. It recommends mendation algorithms on real users with several connected exactly the article which users read most frequently given publishers. We use data recorded in the time 1 March 2017 to the readers current article. 30 June 2018 (14 months) including approximately 94 million sessions of three publishers. The system tracks readers using Collaborative filtering. Applying Collaborative Filtering session cookies. The session information allows us to link to news recommendation is particularly challenging. Not reading events to a particular reader. Whenever more than only keep publishers adding new items, but also do systems 1 h passes in between events, we create a new session. Note know very little about users. We follow the suggestion of Das that we disregard all sessions with a single reading event as et al. [7] and implement a MinHash version of Collaborative we cannot compare predictions to future events in these cases. Filtering. All three publishers operate an NRS. Empirically, the NRS produce clicks on the order of a few per thousand recommen- Content-based Filtering. Content-based filtering requires dations. Hence, their effect on the collected reading events a way to define similarity among news articles. Generally, appears negligible. we could employ a string matching approach on the title or text. Still, this would require considerable computational Table 1: Statistics describing the interactions of read- effort. Besides, the possibility of different languages adds ers with news articles for three publishers. The sym- another level of difficulty. We have implemented a more bols refer to the number of sessions (S), the number straightforward method. The Content-based filtering takes of interactions (N), the average number of events per the category of news articles as a proxy for similarity and sessions (EpS), and the number (๐‘† 1 ) and proportion suggests articles from the same category at random. 1 (๐‘†% ) of sessions with only one event. Circular buffer. We use the circular buffer proposed by Lommatzsch [22] which has also been used as the baseline Statistic Publisher A Publisher B Publisher C of CLEF NewsREEL [23]. This method has a fixed size S 17 019 523 22 683 047 54 272 242 list which the systems updates as interactions occur. The N 36 859 823 175 930 128 105 998 109 system adds the article of each interaction. When the system EpS 2.17 7.76 1.95 arrives at the end of the list, it moves the index to the first ๐‘†1 10 529 390 416 506 34 998 380 position and goes on. The methods select recommendations 1 ๐‘†% 61.19 % 1.84 % 64.49 % by reverse lookup in the list. Thereby, it combines popularity and recency. More popular articles occur more frequently on the list. More recently read articles occur with a higher Table 1 outlines the characteristics of the data set. Pub- probability as well. lishers A and C have to deal with expectedly short sessions. Table 1 also lists the number and proportion of sessions with Trending. The trending method computes the trend for only a single interaction. This matters as we have to dis- each article in a given time window. More specifically, the regard those sessions in the evaluation. Without a second method carries out a regression on the reading frequency Defining a Meaningful Baseline for News Recommender Systems INRAโ€™19, September, 2019, Copenhagen, Denmark binned on an hourly level. The method recommends those frequencies with which readers engage with articles. The dis- articles with the steepest trend. tributions exhibit a strong popularity bias which supplies Implementing baseline candidates causes little effort. Content- the same small subset of articles to the buffer. As a result, based filtering requires knowing the category of news articles. the recommendations would largely coincide among lists of All remaining candidates only require interaction data with varying lengths. Publishers A and B show improvements for timestamps. Hence, they fulfil already two of our three re- shorter time windows, whereas Publisher C trending baseline quirements. performance peaks at 12 hours. Besides, we observe different maximum results among the three publishers. Sequences top 4.2 Evaluation Protocol Publisher A at 18.5 % (๐‘…๐‘ฅ ) and 14.7 % (๐‘…๐‘  ). In contrast, the circular buffer with one hundred elements scores highest for We present each event to all of the baseline candidates and Publisher B with 11.7 % (๐‘…๐‘ฅ ) and 3.0 % (๐‘…๐‘  ). The circular request exactly one item as a recommendation. The evaluator buffer with five hundred or a thousand elements performs best stores the recommendations with reference to the session. for Publisher C with 2.1 % (๐‘…๐‘ฅ ) and 1.6 % (๐‘…๐‘ฅ ). Hence, the When the session re-appears, the evaluator checks whether differences with respect to ๐‘…๐‘ฅ span 16.4 %. Such substantial this article had been recommended previously. If that is the differences are unlikely to emerge from the baseline. Instead, case, the evaluator adds one to the score (๐‘ฅ) of that baseline. we have to assume that other aspects play a vital role, such Based on the recorded score, we compute two evaluation as the composition of the readership, interface design, and measures on recommendation success on event and session content. level: โˆ‘๏ธ ๐‘…๐‘ฅ = |๐‘†|โˆ’1 ๐‘ฅ (1) 5 CONCLUSION AND FUTURE WORK ๐‘ฅโˆˆ๐‘† โˆ’1 โˆ‘๏ธ News Recommender Systems support readers navigating the ๐‘…๐‘  = |๐‘†| 1{๐‘ฅ(๐‘ )>0} (2) news landscape by suggesting which article to read next. ๐‘ฅโˆˆ๐‘† Evaluation is necessary to optimise NRS. To estimate how much value a new method adds, evaluation protocols compare The first score, ๐‘…๐‘ฅ approximates the expected number of their results to baselines. A consensus on what baselines to successful recommendations per session. The second score, use has yet to establish. Researchers have used a variety of ๐‘…๐‘  , estimates the chance that at least one recommendation baselines (cf. Section 2). We have formulated three criteria will succeed. which baselines have to meet to be considered a viable option. Section 4.1 presents a list of candidate baselines, all of which 4.3 Evaluation Results require manageable effort to implement and need not much Table 2 shows our observations for all combinations of publish- additional data. To check the candidate baselinesโ€™ competi- ers and baselines. We notice that the baselinesโ€™ performances tiveness, we have devised an experiment on three publishers vary considerably. We have scaled the results by a factor of with data covering 14 months. The results suggest that the 10โˆ’5 to obtain more legible figures. Thus, a score of 1000 circular buffer and the trending baseline stably provide com- refers to one per cent. The random baseline performs poorly petitive performance for all publishers. We have observed among all publishers. The chance to randomly suggest some- variations among the performance of baselines for particular thing interesting is below 1 in 1000. Constraining the time publishers. For instance, the sequence baseline has scored window to shorter periods slightly improves the success rate. exceptionally well for publisher A yet failed for publisher The performance of popularity, recency, and sequences differs B completely. More research is needed to explore how to between publishers. Publisher A and C show better results for transfer our findings to other publishers. We conjecture that popularity and sequences. Sequences even perform best for similar publishers will confirm the order of performance for Publisher A overall. Publisher B, on the other hand, shows the baselines. the best results for recency, and, conversely, the worst perfor- Researchers are keen to compare their methods to the most mance for sequences. This is surprising as Publisher B sees competitive approach. This requires considerable investment the longest sessions on average (cf. Table 1). The presence in re-implementing previous research. We argue that using of very specific sequences could spoil the predictions, partic- the circular buffer or trend-based method already represents ularly for recently published items. Content-based filtering a solid baseline. Both have scored higher than collabora- performs poorly, especially for Publisher C. Publisher C fo- tive filtering and content-based filtering in our experiments. cusses on the narrow topic of automotive news. This could Researchers ought to highlight the trade-off between pre- lead to circumstances in which the NRS faces a large number dictive accuracy and computational costs. Amatriain and of very specific categories. Collaborative filtering performs Basilico [1] have highlighted this critical trade-off in the case well for publishers A and B but not for C. The circular buffer of streaming service Netflix. Their team decided not to im- and the trending baseline perform well for all publishers. plement an ensemble of 107 algorithms as the engineering The circular buffers exhibit little variation depending on the costs surpassed the added value in prediction accuracy. The parameter choice. This could be due to the highly skewed accelerated dynamics of collections of news articles raise even INRAโ€™19, September, 2019, Copenhagen, Denmark Kille and Lommatzsch Table 2: Evaluation results. The table shows the two performance scores for each combination of publisher and baseline. The best results for each publishers are highlighted in bold font. Publisher A Publisher B Publisher C Baseline ๐‘…๐‘ฅ ยท 10โˆ’5 ๐‘…๐‘  ยท 10โˆ’5 ๐‘…๐‘ฅ ยท 10โˆ’5 ๐‘…๐‘  ยท 10โˆ’5 ๐‘…๐‘ฅ ยท 10โˆ’5 ๐‘…๐‘  ยท 10โˆ’5 random 6.26 6.21 6.16 6.11 1.73 1.69 random (6h) 87.69 85.59 64.30 62.43 13.72 13.45 random (12h) 56.02 55.05 43.05 42.22 9.99 9.81 random (24h) 37.75 37.29 32.69 32.21 8.06 7.88 random (48h) 27.06 26.87 23.80 23.47 6.11 6.04 popular 934.30 653.56 431.17 82.23 2025.96 1581.96 popular (6h) 1009.02 722.28 282.33 53.83 2027.88 1582.75 popular (12h) 1039.16 755.82 526.83 100.99 2031.92 1585.04 popular (24h) 1089.32 774.56 1166.36 216.55 2037.30 1587.64 popular (48h) 1140.12 781.17 1388.02 258.87 2043.20 1590.65 recency 708.23 647.72 1017.66 200.90 38.17 34.36 circular buffer (100) 8069.65 6583.47 11 747.88 3049.56 2069.87 1583.18 circular buffer (200) 8069.65 6583.47 11 748.00 3049.56 2074.45 1585.22 circular buffer (500) 8069.65 6583.47 11 748.00 3049.56 2074.76 1585.35 circular buffer (1000) 8067.65 6583.47 11 748.00 3049.56 2074.76 1585.35 sequences 18 542.49 14 650.12 0.14 0.03 1523.25 1256.07 content-based 330.10 310.96 286.01 89.93 4.27 4.22 collaborative filtering 5864.39 4861.47 4648.87 845.90 132.68 116.79 trends (2h) 8599.78 6833.47 11 237.24 2066.37 1608.16 1246.95 trends (6h) 6437.37 4743.85 6398.07 1173.67 1784.49 1381.72 trends (12h) 5805.29 4205.00 4481.68 823.05 1795.34 1392.33 trends (24h) 4858.50 3418.07 1580.30 292.20 1723.92 1344.13 stricter constraints on recommender systems than in the case 44โ€“78. of movies. [4] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, We see several directions for future research. First, one Wei Chai, Mustafa Ispir, et al. 2016. Wide & deep learning for could introduce a method to estimate the implementation recommender systems. In Proceedings of the 1st workshop on deep learning for recommender systems. ACM, 7โ€“10. costs in a more quantitative fashion. This would allow us to [5] Wei Chu and Seung-Taek Park. 2009. Personalized recommen- address the trade-off more rigorously. Likewise, one could dation on dynamic content using predictive bilinear models. In measure the data footprint of different baselines to assess Proceedings of the 18th international conference on World wide web. ACM, 691โ€“700. their space and time complexity. Second, one could apply [6] Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural the proposed baselines and possible additions or adaptions networks for youtube recommendations. In Proceedings of the on data from other publishers. This would help to quantify 10th ACM conference on recommender systems. ACM, 191โ€“198. [7] Abhinandan S Das, Mayur Datar, Ashutosh Garg, and Shyam the generality of our findings. Finally, evaluation protocols Rajaram. 2007. Google news personalization: scalable online including more than a single recommendation could reveal collaborative filtering. In Proceedings of the 16th international conference on World Wide Web. ACM, 271โ€“280. how ranking metrics compare to our binary scheme. Still, [8] Qi Gao, Fabian Abel, Geert-Jan Houben, and Ke Tao. 2011. Inter- ranking metrics might not display most accurately whether weaving trend and user modeling for personalized news recommen- a system performs well unless it presents recommendations dation. In 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology, Vol. 1. IEEE, in the form of a list. 100โ€“103. [9] Florent Garcin, Boi Faltings, Olivier Donatsch, Ayar Alazzawi, Christophe Bruttin, and Amr Huber. 2014. Offline and online REFERENCES evaluation of news recommender systems at swissinfo. ch. In Pro- [1] Xavier Amatriain and Justin Basilico. 2012. Netflix recommenda- ceedings of the 8th ACM Conference on Recommender systems. tions: beyond the 5 stars (part 1). Netflix Tech Blog 6 (2012). ACM, 169โ€“176. [2] Erik Brynjolfsson and JooHee Oh. 2012. The Attention Economy [10] Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xi- - Measuring the Value of Free Digital Services on the Internet. uqiang He. 2017. DeepFM: A Factorization-Machine based Neural ICIS (2012). Network for CTR Prediction. In Proceedings of the Twenty- [3] Ivaฬn Cantador, Pablo Castells, and Alejandro Bellogฤฑฬn. 2011. Sixth International Joint Conference on Artificial Intelligence, An enhanced semantic layer for hybrid recommender systems: IJCAI-17. 1725โ€“1731. https://doi.org/10.24963/ijcai.2017/239 Application to news recommendation. International Journal on Semantic Web and Information Systems (IJSWIS) 7, 1 (2011), Defining a Meaningful Baseline for News Recommender Systems INRAโ€™19, September, 2019, Copenhagen, Denmark [11] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, [29] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars and Tat-Seng Chua. 2017. Neural collaborative filtering. In Pro- Schmidt-Thieme. 2009. BPR: Bayesian personalized ranking from ceedings of the 26th international conference on world wide web. implicit feedback. In Proceedings of the twenty-fifth conference International World Wide Web Conferences Steering Committee, on uncertainty in artificial intelligence. AUAI Press, 452โ€“461. 173โ€“182. [30] Guy Shani and Asela Gunawardana. 2010. Evaluating Recommen- [12] Xiangnan He, Hanwang Zhang, Min-Yen Kan, and Tat-Seng Chua. dation Systems. In Recommender Systems Handbook. Springer 2016. Fast matrix factorization for online recommendation with US, Boston, MA. implicit feedback. In Proceedings of the 39th International ACM [31] A Swaminathan and T Joachims. 2015. Counterfactual Risk SIGIR conference on Research and Development in Information Minimization: Learning from Logged Bandit Feedback.. In ICML. Retrieval. ACM, 549โ€“558. [32] Huazheng Wang, Qingyun Wu, and Hongning Wang. 2016. Learn- [13] Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, ing hidden features for contextual bandits. In Proceedings of the and Larry Heck. 2013. Learning deep structured semantic models 25th ACM International on Conference on Information and for web search using clickthrough data. In Proceedings of the 22nd Knowledge Management. ACM, 1633โ€“1642. ACM international conference on Information & Knowledge [33] Hongwei Wang, Fuzheng Zhang, Xing Xie, and Minyi Guo. 2018. Management. ACM, 2333โ€“2338. DKN: Deep Knowledge-Aware Network for News Recommenda- [14] Dhruv Khattar, Vaibhav Kumar, Vasudeva Varma, and Manish tion. In Proceedings of the 2018 World Wide Web Conference Gupta. 2018. Weave&Rec: A Word Embedding based 3-D Con- (WWW โ€™18). International World Wide Web Conferences Steer- volutional Network for News Recommendation. In Proceedings ing Committee, Republic and Canton of Geneva, Switzerland, of the 27th ACM International Conference on Information and 1835โ€“1844. https://doi.org/10.1145/3178876.3186175 Knowledge Management. ACM, 1855โ€“1858. [34] Jin Wang, Zhongyuan Wang, Dawei Zhang, and Jun Yan. 2017. [15] Yehuda Koren and Robert Bell. 2015. Advances in Collaborative Combining Knowledge with Deep Convolutional Neural Networks Filtering. In Recommender Systems Handbook. Springer US, for Short Text Classification.. In IJCAI. 2915โ€“2921. Boston, MA, 77โ€“118. [35] Hong-Jian Xue, Xinyu Dai, Jianbing Zhang, Shujian Huang, and [16] Vaibhav Kumar, Dhruv Khattar, Shashank Gupta, Manish Gupta, Jiajun Chen. 2017. Deep Matrix Factorization Models for Recom- and Vasudeva Varma. 2017. Deep Neural Architecture for News mender Systems.. In IJCAI. 3203โ€“3209. Recommendation.. In CLEF (Working Notes). [36] Guanjie Zheng, Fuzheng Zhang, Zihan Zheng, Yang Xiang, [17] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. 2010. Nicholas Jing Yuan, Xing Xie, and Zhenhui Li. 2018. DRN: A contextual-bandit approach to personalized news article recom- A Deep Reinforcement Learning Framework for News Recommen- mendation. In Proceedings of the 19th international conference dation. In Proceedings of the 2018 World Wide Web Conference on World wide web. ACM, 661โ€“670. (WWW โ€™18). International World Wide Web Conferences Steer- [18] Lei Li and Tao Li. 2013. News Recommendation via Hypergraph ing Committee, Republic and Canton of Geneva, Switzerland, Learning: Encapsulation of User Behavior and News Content. In 167โ€“176. https://doi.org/10.1145/3178876.3185994 Proceedings of the Sixth ACM International Conference on Web [37] Li Zheng, Lei Li, Wenxing Hong, and Tao Li. 2013. PENETRATE: Search and Data Mining (WSDM โ€™13). ACM, New York, NY, Personalized news recommendation using ensemble hierarchical USA, 305โ€“314. https://doi.org/10.1145/2433396.2433436 clustering. Expert Systems with Applications 40, 6 (2013), 2127โ€“ [19] Lei Li, Dingding Wang, Tao Li, Daniel Knox, and Balaji Pad- 2136. manabhan. 2011. SCENE: a scalable two-stage personalized news recommendation system. In Proceedings of the 34th interna- tional ACM SIGIR conference on Research and development in Information Retrieval. ACM, 125โ€“134. [20] Lei Li, Li Zheng, Fan Yang, and Tao Li. 2014. Modeling and broad- ening temporal user interest in personalized news recommendation. Expert Systems with Applications 41, 7 (2014), 3168โ€“3177. [21] Jiahui Liu, Peter Dolan, and Elin Rรธnby Pedersen. 2010. Personal- ized news recommendation based on click behavior. In Proceedings of the 15th international conference on Intelligent user inter- faces. ACM, 31โ€“40. [22] Andreas Lommatzsch. 2014. Real-time news recommendation using context-aware ensembles. In European Conference on In- formation Retrieval, Vol. LNCS 8416. Springer, Springer, 51โ€“62. [23] Andreas Lommatzsch, Benjamin Kille, Frank Hopfgartner, Martha Larson, Torben Brodt, Jonas Seiler, and Oฬˆzlem Oฬˆzgoฬˆbek. 2017. CLEF 2017 NewsREEL Overview: A Stream-based Recommender Task for Evaluation and Education. In International Confer- ence of the Cross-Language Evaluation Forum for European Languages. Springer, 239โ€“254. [24] Pasquale Lops, Marco De Gemmis, and Giovanni Semeraro. 2011. Content-based Recommender Systems: State of the Art and Trends. In Recommender Systems Handbook. Springer, Boston, MA, Boston, MA, 73โ€“105. [25] Zhongqi Lu, Zhicheng Dou, Jianxun Lian, Xing Xie, and Qiang Yang. 2015. Content-based collaborative filtering for news topic recommendation. In Twenty-ninth AAAI conference on artificial intelligence. [26] Cataldo Musto, Giovanni Semeraro, Marco de Gemmis, and Pasquale Lops. 2016. Learning word embeddings from wikipedia for content-based recommender systems. In European Conference on Information Retrieval. Springer, 729โ€“734. [27] Shumpei Okura, Yukihiro Tagami, Shingo Ono, and Akira Tajima. 2017. Embedding-based News Recommendation for Millions of Users. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD โ€™17). ACM, New York, NY, USA, 1933โ€“1942. https://doi.org/10. 1145/3097983.3098108 [28] Steffen Rendle. 2010. Factorization machines. In 2010 IEEE International Conference on Data Mining. IEEE, 995โ€“1000.