=Paper=
{{Paper
|id=Vol-1441/recsys2015_poster4
|storemode=property
|title=Comparing Offline and Online Recommender System Evaluations on Long-tail Distributions
|pdfUrl=https://ceur-ws.org/Vol-1441/recsys2015_poster4.pdf
|volume=Vol-1441
|dblpUrl=https://dblp.org/rec/conf/recsys/MoreiraSC15
}}
==Comparing Offline and Online Recommender System Evaluations on Long-tail Distributions==
Comparing offline and online recommender system evaluations on long-tail distributions Gabriel S. P. Moreira Gilmar Souza Adilson M. da Cunha CI&T, Campinas, SP, Brazil CI&T, Campinas, SP, Brazil ITA, Sao Jose dos Campos, gabrielpm@ciandt.com gilmarj@ciandt.com SP, Brazil cunha@ita.br ABSTRACT items may also bias the evaluation of recommenders accu- In this investigation, we conduct a comparison between of- racy. fline and online accuracy evaluation of different algorithms and settings in a real-world content recommender system. 1.1 Offline Evaluation By focusing on recommendations of long-tail items, which Offline evaluation is usually done by recording the items are usually more interesting for users, it was possible to re- users have interacted with, hiding some of this user-item in- duce the bias caused by extremely popular items and to teractions (test set) and training algorithms on the remain- observe a better alignment of accuracy results in offline and ing information (train set) to assess the accuracy. online evaluations. A time-based approach [3] was used to split train and test sets. User interactions occurred during the period before the split date were used as train set (20 days), and the period Categories and Subject Descriptors after composed the test set (8 days), as shown in Figure 1. H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval - information filtering. Keywords Recommender systems, offline evaluation, online evaluation, Figure 1: The Offline Evaluation Dataset Split click-through rate, accuracy metrics, long-tail. It simulates the production scenario, where the known 1. EVALUATION METHODOLOGY user preferences until that date are used to produce recom- This investigation focuses in a comparison between offline mendations for the near future. Test set comprised 342 users and online evaluation results in a recommender system im- in common with train set, with a total of 636 interactions plemented in Smart Canvas R , a platform that delivers web during test set. and mobile user experiences through curation algorithms. This investigation uses an offline evaluation methodology Smart Canvas features a mixed hybrid recommender system, named as One-Plus-Random or RelPlusN [3], in which for in which items recommended by all available algorithms are each user the recommender is requested to rank a list with aggregated and presented to users. relevant items (those that the user has interacted with in the It was conducted in one production environment, which test set) and a set of N non-relevant items (random items, consists in the website of a large shopping mall. The accu- which the user has never interacted with). racy of different recommender algorithms and variations of The final performance are averaged over Click-Through their settings were assessed in offline evaluation and further Rate (CTR), a common metric for recommender and adver- compared to online measures with real users (A/B testing). tising systems, here referred as Offline CTR. It was calcu- In this investigation, three experiments were conducted, lated as a ratio between the top recommended items, which each of them varying only one setting at a time, in both the users in fact interacted in test set, and the total number offline and online evaluations. They involve two algorithms of simulated recommendations. implemented in Smart Canvas: Content-Based Filtering (based on TF-IDF and cosine distance) and Item-Item Frequency 1.2 Online Evaluation (a model-based algorithm based on co-frequency of items For online evaluation, an engine was developed to ran- interactions in user sessions). domly split users traffic and assign to one of the experiments For all experiments, accuracy was evaluated under two of the hybrid recommender system (A/B testing), each vary- perspectives considering (1) all recommended items and (2) ing only one setting of the two component algorithms. The only long-tail items. The main reasons for this two-fold anal- online evaluation involved 402 distinct items, 45,000 users, ysis is that recommendations of non-popular items match- 5,850 recommendations, and 183 interactions. ing users interests might be more relevant to them. Popular The Click-Through Rate (CTR) metric was also used to measure online accuracy of recommendations. Online CTR was the ratio of interactions on recommended items and the Copyright is held by the author(s). ACM RecSys 2015 Poster Proceedings, September 16-20, 2015, Austria, Vienna. total of recommended items viewed by users during their . sessions. 2. RESULTS Three experiments were performed in both offline and on- line evaluations. In Experiments #1 and #2, Content-Based Filtering settings named MinSimilarity and ItemDaysAge- Limit were assessed individually with different values. In Ex- periment #3, an Item-Item Frequency setting named LastX- InteractedItems were varied. Accuracy (CTR) was evaluated under two perspectives considering: (1) all recommended items, including the very popular ones and (2) only long-tail items. The ideal scenario would be offline metrics varying in the same direction of the CTR measures. That behavior would indicate that offline evaluation could be used to cost- Figure 3: Experiment #1 (long-tail) - CTR for effectively identify the best setting values for recommender Content-Based algorithm - MinSimilarity algorithms before involving users in online evaluation. However, Online and Offline CTR behaviour did not align in perspective (1), considering all recommended items, as 3. CONCLUSION can be seen in Figure 2. In this study, Offline and Online experiments were per- formed and compared in a real production environment of a hybrid recommender system. The results did not correlate for most experiments, but when focusing on long-tail items, it was possible to observe how popular items can bias the accuracy evaluation. Two out of three experiments on long- tail items had Offline CTR very aligned to Online CTR. The evaluation of long-tail items may be a candidate for deeper investigation in future studies, aiming to increase confidence in offline evaluation results. Furthermore, focus- ing on accuracy optimization for long-tail items, algorithms may bring to the users a clear perception of the ability of the system to recommend non-trivial relevant items. This study is still ongoing to provide a better understand- ing of the relationship between offline and online evaluation Figure 2: Experiment #1 (including popular items) results. Besides accuracy, it is suggested a similar investi- - CTR for Content-Based algorithm - MinSimilarity gation of other properties like coverage and more long-term metrics, related to users engagement. This investigation went further for better understanding of the misalignment between offline and online evaluations in 4. ACKNOWLEDGEMENTS this context. It was assessed whether the very popular items Our thanks to CI&T for supporting the development of could introduce a bias in recommender accuracy analysis, Smart Canvas R recommender system evaluation framework ignoring extremely popular items and considering only long- and to the ITA for providing the research environment. tail items in perspective (2). For offline evaluation, the top 1.1% items concentrated 5. REFERENCES 22% of the interactions and were further ignored. For on- [1] J. Beel, M. Genzmehr, S. Langer, A. Nürnberger, and line experiments, it was also ignored the 1.5% most popular B. Gipp. A comparative analysis of offline and online items, responsible for 41% of the interactions in the website. evaluations and discussion of research paper Considering only the long-tail items in Experiment #1, recommender system evaluation. In Proc. Workshop on the Offline and Online CTR turned out to be nicely aligned, Reproducibility and Replication in Recommender as shown in Figure 3. The best setting value for the Min- Systems Evaluation, pages 7–14. ACM, 2013. Similarity threshold was 0.1, following the same trend for [2] F. Garcin, B. Faltings, O. Donatsch, A. Alazzawi, both CTR metrics. C. Bruttin, and A. Huber. Offline and online evaluation In Experiment #2 for long-tail items, the metric variations of news recommender systems at swissinfo. ch. In were very similar to the results considering popular items, Proceedings of the 8th ACM Conference on so there was no prediction gain by removing very popular Recommender systems, pages 169–176. ACM, 2014. items from the analysis. In Experiment #3, the CTR metrics variation were yet [3] A. Said and A. Bellogı́n. Comparative recommender more aligned by keeping only long-tail items (charts omitted system evaluation: benchmarking recommendation due to space reasons). frameworks. In Proceedings of the 8th ACM Conference In Experiments #1 and #3, considering only long-tail on Recommender systems, pages 129–136. ACM, 2014. items, offline evaluation was an adequate predictor of the online accuracy as a function of their setting thresholds. The observed bias of popular items over evaluation accu- racy metrics are aligned to recent studies like [1] and [2].