The Degree of Randomness in a Live Recommender Systems Evaluation Gebrekirstos G. Gebremeskel and Arjen P. de Vries Information Access, CWI, Amsterdam, Science Park 123, 1098 XG Amsterdam, Netherlands gebre@cwi.nl arjen@acm.org Abstract. This report describes our participation in the CLEF NEWS- REEL News Recommendation Evaluation Lab. We report and analyze experiments conducted to study two goals. One goal is to study the effect of randomness in the evaluation of algorithms. The second goal is to see whether geographic information can help in improving the performance of news recommendation. 1 Tasks and Objectives In this report we describe the results of experiments we conducted during our participation in the CLEF NEWSREEL News Recommendations Evaluation Lab, the task of Benchmark News Recommendations in a Living Lab [5]. The experiments have the following goals. The first goal is to investigate the effect of system and/or user behavior randomness on recommender systems evaluations. This randomness refers to the selection of algorithms for providing recommen- dation request by the Plista framework and the user click behavior which can be influenced by the situation of the user. The second is to study whether geo- graphic information can play a role in news recommendation systems. The literature on recommender systems shows that offline and online recom- mender system evaluations do not concur with each other [3, 1, 6]. This is to say that recommender systems behave differently in offline and online evaluations, both in terms of absolute and relative performance. This has a serious implication for recommender system research, because the whole point of offline evaluation is the assumption that at least the relative performance of recommender systems is indicative of their relative online performance and thus an important step for selecting algorithms that can be deployed in a live recommendation setting. Many reasons can be mentioned for the disparate performances of offline and online evaluations. The three most important papers [3, 1, 6] that compare online and offline evaluations use different datasets for offline evaluation and online evaluation. It is possible that the difference in performance may be attributed to this difference in dataset. The literature lists some reasons for the disparate performance of recommender systems in offline and online evaluations [6, 7]. The first reason is that offline evaluations can measure only accuracy; they do not take the user behavior into account. The second reason is that offline data-sets are incomplete and imperfect, and thus recommender systems are being evaluated based on incomplete dataset. The first reason, that is, that offline evaluations can measure only accuracy and thus can not take user behavior into account can be understood in two ways. One way is that online evaluations can be influenced by adapted imple- mentations of the algorithms that were used in offline evaluation, to take user behavior into account, in which case it becomes a different implementation. The other way to understand it is that online evaluations happen in an environment that involves user behavior and thus their performances can be influenced by the system and/or user behavior “randomness”. The first interpretation implies that offline and online algorithms are differently implemented. The second in- terpretation implies that the difference in offline and online evaluations can be due to the system and/or user behavior randomness that the online evaluations are subjected to. We attempt to explore this randomness in our participation The motivation for the second goal comes from a descriptive study we con- ducted on a Plista dataset we collected in our previous participation. In the study, there were two findings [4]. One is that there is a substantial difference in the geographical distribution of the readerships of traditional news portals, and the second is that within the same portal, the geographical distribution of the readerships of the local news category and the rest of the categories shows substantial difference. In our experiments, we tried to exploit the second finding to improve news recommendation. 2 Approach For the study of the effect of system and/or user behavior randomness on recom- mender system evaluations, we run two instances of the same news recommender algorithm with the view to measuring the effect of “randomness” involved in news recommendation clicks. For exploiting geographical information for news recom- mendation, we employ a recommendation diversification approach. Specifically, we include geographically relevant recommendation into the recommendation list, thus diversifying the recommendations in favor of geographic relevance 2.1 Algorithms We experimented with five algorithms, all of them modifications of the recency algorithm. The recency algorithm takes into account recency and popularity of an item, and it has been shown to be a strong baseline in previous online evaluations. The algorithmic variations that we experimented with are listed below. Recency: This algorithm keeps the 100 most recently viewed items for each publisher in consideration for being recommended to the user. For any recom- mendation request, the items that are , at the time of the recommendation request, the most recently read (clicked)) are the ones that are recommended. We run two instances of this algorithm to get a sense of the randomness in- volved in the selection of algorithms by the Plista framework [2] and/or clicks on recommendations by users. GeoRec: The geographical recommender takes the geographical region (states to be specific) of users and the local category of news items into account when generating recommendations. We generate two sets of recommendations, one by the recency recommender and one by a purely geographical recommender. For the purely geographical recommender, we take the 100 most recently viewed items and sort them according to their geographic conditional likelihood scores generated by Equation 1. rua ,ik = P (cik |gua ) (1) Where cik is the local category of item ik . An item is either in the local category, or not. gua is the state-level geographical information of the user ua , that is, the state the user belongs to. Then, we take top twice the number of re- quested recommendations as recommendation of the geographic recommender. For the recency, we take the requested number of recommendations from the from the most recently viewed items. Then we intersect the two sets of recom- mendations. If the number of elements in the intersection is not equal with the requested number of recommendations, we get half − 1 from purely geographic recommender and half + 1 from recency recommender. GeoRecHistory: This is a modification of the GeoRec recommender; it excludes from the recommendation list those items that the user has already visited. RecencyRandom: This recommender recommends items randomly selected from the circular-buffer. This was started a bit late, and we used it a baseline algorithm. The two instances of Recency and the GeoREc algorithms were run for a con- secutive period of 53 days, from 2015-04-12 to 2015-06-03. The RecencyRandom algorithm was started 12 days later on 2015-04-24. 3 Results and Discussions We present two types of performance scores: cumulative and daily click-through rates (CTR). The cumulative CTR is presented in Table 1. We see that the performances are all close to each other. However, if we rank them, we see that the GeoRec recommender followed by Recency followed by GeoRecHistory end up as the first three best performing algorithms. The daily performances of the algorithms are shown in Figure 1, and the cumulative CTR as a function of the number of days is shown in figure 2. From the daily (figure 1) and cumulative (figure 2) plots, we see that the performance of any of the algorithms varies greatly. In the cumulative plot, we see that Recency and Recency2 performed differently for a long time, but generally getting closer towards each other until some point after which they seem to stabilize. If one was continuously monitoring the performances of this Algorithms Requests Clicks CTR(%) Recency 56,350 478 0.85 Recency2 53,863 420 0.78 GeoRec 54,338 470 0.86 GeoRecHistory 47,001 395 0.84 RecencyRandom 39,616 283 0.71 Table 1. Live performance of different algorithms ● recency recency2 2.0 ● geoRec ● geoRecHistory● ●● ● recencyRandom ●● ● ● 1.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● CTR ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● 1.0 ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ●●● ●●●● ●●● ● ●● ● ● ● ● ● ● ●● ● ●● ●● ●●● ●● ● ●● ● ●● ● ● ● ● ●● ● ●●● ● ● ●● ● ● ●●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● 0.5 ● ● ●● ● ●● ● ●●● ●●● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● 0.0 ●● ● ● ● ● 1 5 9 13 18 23 28 33 38 43 48 53 Days Fig. 1. The daily CTR performances of the five algorithms 0.8 0.6 CTR 0.4 recency recency2 geoRec 0.2 geoRecHistory recencyRandom 1 5 9 13 18 23 28 33 38 43 48 53 Days Fig. 2. The cumulative CTR performances of the five algorithms as they progress on a daily basis two algorithms, then one would say that the Recency algorithm is better than Recency2 algorithm. This would happen, at least some times, even if one employs statistical significance tests. Let’s assume that the experimenter was peeking at the experiments everyday to make a decision on which algorithm is better. How many times would the experimenter declare statistically significant differences between the different algorithms? We examined this by using two baselines: the random recommender (RecencyRandom) and Recency2. The results, when using the RecencyRandom recommender as a baseline, are given in Table 2. Similarly, the results for the baseline of Recency2 are given in Table 3. We see that, when RecencyRandom is used a a baseline, Recency, GeoRec and GeoRecHistory achieve statistically significant performance almost more than half of the time. With Recency2 as baseline, we see that only Recency achieves a statistically significant result twice. Algorithm Days of Significance Percentage (%) recency 16 40.0 geoRec 23 57.5 geoRecHistory 24 60.0 Table 2. Statistical significance over the baseline of RecencyRandom Algorithm Days of Significance Percentage (%) recency 2 5 geoRec 0 0 geoRecHistory 1 2.5 Table 3. Statistical significance over the baseline of Recency2 What is truly interesting is that the same algorithm (Recency) can end up achieving statistically significant performance over another instance of itself (Re- cency2). The two instances of the same algorithm show so big difference in per- formance that there is a big chance of concluding one is better than itself. The implication of this raises questions on improvement of performance of algorithms that involve live users in general. Specifically, to what extent is one algorithm’s improvement over another a real improvement? It seems to us that statistical significance tests alone are not sufficiently indicative of performance improve- ments. Given the randomness involved in user’s clicks on recommendations (and the system), a certain range of performance difference is not worth taking seri- ously. Studies that involve users must account for some level of randomness, in addition to statistical significance tests. 4 Conclusions and Perspectives for Future Work We set out to study two factors in news recommendation: geographical infor- mation, and user and/or system randomness in news recommendation clicks. Although the geographical recommendation did not show any strikingly useful improvement over the recency algorithm, the user-system randomness seems to indicate that care must be taken to take into account some degree of random- ness in recommender systems evaluation that involve users in a live setting, in addition to statistical significance tests. In the future, we would like to extend this work to include a better statistical testing system that can account for this randomness. We also would like to compare the offline and online evaluations, Specifically, we would like run one algorithm on the logs of another to see to what extent they can replace each other. References 1. J. Beel, M. Genzmehr, S. Langer, A. Nürnberger, and B. Gipp. A comparative anal- ysis of offline and online evaluations and discussion of research paper recommender system evaluation. In Proceedings of the International Workshop on Reproducibility and Replication in Recommender Systems Evaluation, pages 7–14. ACM, 2013. 2. T. Brodt and F. Hopfgartner. Shedding light on a living lab: the clef newsreel open recommendation platform. In Proceedings of the 5th Information Interaction in Context Symposium, pages 223–226. ACM, 2014. 3. F. Garcin, B. Faltings, O. Donatsch, A. Alazzawi, C. Bruttin, and A. Huber. Offline and online evaluation of news recommender systems at swissinfo. ch. In Proceedings of the 8th ACM Conference on Recommender systems, pages 169–176. ACM, 2014. 4. G. G. Gebremeskel and A. P. de Vries. The role of geographic information in news consumption. In Proceedings of the 24th International Conference on World Wide Web Companion. International World Wide Web Conferences Steering Committee, 2015. 5. F. Hopfgartner, B. Kille, A. Lommatzsch, T. Plumbaum, T. Brodt, and T. Heintz. Benchmarking news recommendations in a living lab. In Information Access Eval- uation. Multilinguality, Multimodality, and Interaction, pages 250–267. Springer, 2014. 6. E. Kirshenbaum, G. Forman, and M. Dugan. A live comparison of methods for per- sonalized article recommendation at forbes. com. In Machine Learning and Knowl- edge Discovery in Databases, pages 51–66. Springer, 2012. 7. S. M. McNee, N. Kapoor, and J. A. Konstan. Don’t look stupid: avoiding pitfalls when recommending research papers. In Proceedings of the 2006 20th anniversary conference on Computer supported cooperative work, pages 171–180. ACM, 2006.