=Paper=
{{Paper
|id=Vol-1441/recsys2015_poster4
|storemode=property
|title=Comparing Offline and Online Recommender System Evaluations on Long-tail Distributions
|pdfUrl=https://ceur-ws.org/Vol-1441/recsys2015_poster4.pdf
|volume=Vol-1441
|dblpUrl=https://dblp.org/rec/conf/recsys/MoreiraSC15
}}
==Comparing Offline and Online Recommender System Evaluations on Long-tail Distributions==
<pdf width="1500px">https://ceur-ws.org/Vol-1441/recsys2015_poster4.pdf</pdf>
<pre>
         Comparing offline and online recommender system
              evaluations on long-tail distributions

              Gabriel S. P. Moreira                            Gilmar Souza                   Adilson M. da Cunha
             CI&T, Campinas, SP, Brazil                 CI&T, Campinas, SP, Brazil          ITA, Sao Jose dos Campos,
            gabrielpm@ciandt.com                          gilmarj@ciandt.com                         SP, Brazil
                                                                                                   cunha@ita.br


ABSTRACT                                                                  items may also bias the evaluation of recommenders accu-
In this investigation, we conduct a comparison between of-                racy.
fline and online accuracy evaluation of different algorithms
and settings in a real-world content recommender system.
                                                                          1.1   Offline Evaluation
By focusing on recommendations of long-tail items, which                    Offline evaluation is usually done by recording the items
are usually more interesting for users, it was possible to re-            users have interacted with, hiding some of this user-item in-
duce the bias caused by extremely popular items and to                    teractions (test set) and training algorithms on the remain-
observe a better alignment of accuracy results in offline and             ing information (train set) to assess the accuracy.
online evaluations.                                                         A time-based approach [3] was used to split train and test
                                                                          sets. User interactions occurred during the period before the
                                                                          split date were used as train set (20 days), and the period
Categories and Subject Descriptors                                        after composed the test set (8 days), as shown in Figure 1.
H.3.3 [Information Storage and Retrieval]: Information
Search and Retrieval - information filtering.

Keywords
Recommender systems, offline evaluation, online evaluation,                 Figure 1: The Offline Evaluation Dataset Split
click-through rate, accuracy metrics, long-tail.
                                                                             It simulates the production scenario, where the known
1.    EVALUATION METHODOLOGY                                              user preferences until that date are used to produce recom-
   This investigation focuses in a comparison between offline             mendations for the near future. Test set comprised 342 users
and online evaluation results in a recommender system im-                 in common with train set, with a total of 636 interactions
plemented in Smart Canvas R , a platform that delivers web                during test set.
and mobile user experiences through curation algorithms.                     This investigation uses an offline evaluation methodology
Smart Canvas features a mixed hybrid recommender system,                  named as One-Plus-Random or RelPlusN [3], in which for
in which items recommended by all available algorithms are                each user the recommender is requested to rank a list with
aggregated and presented to users.                                        relevant items (those that the user has interacted with in the
   It was conducted in one production environment, which                  test set) and a set of N non-relevant items (random items,
consists in the website of a large shopping mall. The accu-               which the user has never interacted with).
racy of different recommender algorithms and variations of                   The final performance are averaged over Click-Through
their settings were assessed in offline evaluation and further            Rate (CTR), a common metric for recommender and adver-
compared to online measures with real users (A/B testing).                tising systems, here referred as Offline CTR. It was calcu-
   In this investigation, three experiments were conducted,               lated as a ratio between the top recommended items, which
each of them varying only one setting at a time, in both                  the users in fact interacted in test set, and the total number
offline and online evaluations. They involve two algorithms               of simulated recommendations.
implemented in Smart Canvas: Content-Based Filtering (based
on TF-IDF and cosine distance) and Item-Item Frequency
                                                                          1.2   Online Evaluation
(a model-based algorithm based on co-frequency of items                      For online evaluation, an engine was developed to ran-
interactions in user sessions).                                           domly split users traffic and assign to one of the experiments
   For all experiments, accuracy was evaluated under two                  of the hybrid recommender system (A/B testing), each vary-
perspectives considering (1) all recommended items and (2)                ing only one setting of the two component algorithms. The
only long-tail items. The main reasons for this two-fold anal-            online evaluation involved 402 distinct items, 45,000 users,
ysis is that recommendations of non-popular items match-                  5,850 recommendations, and 183 interactions.
ing users interests might be more relevant to them. Popular                  The Click-Through Rate (CTR) metric was also used to
                                                                          measure online accuracy of recommendations. Online CTR
                                                                          was the ratio of interactions on recommended items and the
Copyright is held by the author(s). ACM RecSys 2015 Poster Proceedings,
September 16-20, 2015, Austria, Vienna.                                   total of recommended items viewed by users during their
.                                                                         sessions.
2.   RESULTS
   Three experiments were performed in both offline and on-
line evaluations. In Experiments #1 and #2, Content-Based
Filtering settings named MinSimilarity and ItemDaysAge-
Limit were assessed individually with different values. In Ex-
periment #3, an Item-Item Frequency setting named LastX-
InteractedItems were varied.
   Accuracy (CTR) was evaluated under two perspectives
considering: (1) all recommended items, including the very
popular ones and (2) only long-tail items.
   The ideal scenario would be offline metrics varying in
the same direction of the CTR measures. That behavior
would indicate that offline evaluation could be used to cost-
                                                                 Figure 3: Experiment #1 (long-tail) - CTR for
effectively identify the best setting values for recommender
                                                                 Content-Based algorithm - MinSimilarity
algorithms before involving users in online evaluation.
   However, Online and Offline CTR behaviour did not align
in perspective (1), considering all recommended items, as        3.   CONCLUSION
can be seen in Figure 2.
                                                                    In this study, Offline and Online experiments were per-
                                                                 formed and compared in a real production environment of a
                                                                 hybrid recommender system. The results did not correlate
                                                                 for most experiments, but when focusing on long-tail items,
                                                                 it was possible to observe how popular items can bias the
                                                                 accuracy evaluation. Two out of three experiments on long-
                                                                 tail items had Offline CTR very aligned to Online CTR.
                                                                    The evaluation of long-tail items may be a candidate for
                                                                 deeper investigation in future studies, aiming to increase
                                                                 confidence in offline evaluation results. Furthermore, focus-
                                                                 ing on accuracy optimization for long-tail items, algorithms
                                                                 may bring to the users a clear perception of the ability of
                                                                 the system to recommend non-trivial relevant items.
                                                                    This study is still ongoing to provide a better understand-
                                                                 ing of the relationship between offline and online evaluation
Figure 2: Experiment #1 (including popular items)                results. Besides accuracy, it is suggested a similar investi-
- CTR for Content-Based algorithm - MinSimilarity                gation of other properties like coverage and more long-term
                                                                 metrics, related to users engagement.
   This investigation went further for better understanding
of the misalignment between offline and online evaluations in    4.   ACKNOWLEDGEMENTS
this context. It was assessed whether the very popular items       Our thanks to CI&T for supporting the development of
could introduce a bias in recommender accuracy analysis,         Smart Canvas R recommender system evaluation framework
ignoring extremely popular items and considering only long-      and to the ITA for providing the research environment.
tail items in perspective (2).
   For offline evaluation, the top 1.1% items concentrated       5.   REFERENCES
22% of the interactions and were further ignored. For on-        [1] J. Beel, M. Genzmehr, S. Langer, A. Nürnberger, and
line experiments, it was also ignored the 1.5% most popular          B. Gipp. A comparative analysis of offline and online
items, responsible for 41% of the interactions in the website.       evaluations and discussion of research paper
   Considering only the long-tail items in Experiment #1,            recommender system evaluation. In Proc. Workshop on
the Offline and Online CTR turned out to be nicely aligned,          Reproducibility and Replication in Recommender
as shown in Figure 3. The best setting value for the Min-            Systems Evaluation, pages 7–14. ACM, 2013.
Similarity threshold was 0.1, following the same trend for
                                                                 [2] F. Garcin, B. Faltings, O. Donatsch, A. Alazzawi,
both CTR metrics.
                                                                     C. Bruttin, and A. Huber. Offline and online evaluation
   In Experiment #2 for long-tail items, the metric variations
                                                                     of news recommender systems at swissinfo. ch. In
were very similar to the results considering popular items,
                                                                     Proceedings of the 8th ACM Conference on
so there was no prediction gain by removing very popular
                                                                     Recommender systems, pages 169–176. ACM, 2014.
items from the analysis.
   In Experiment #3, the CTR metrics variation were yet          [3] A. Said and A. Bellogı́n. Comparative recommender
more aligned by keeping only long-tail items (charts omitted         system evaluation: benchmarking recommendation
due to space reasons).                                               frameworks. In Proceedings of the 8th ACM Conference
   In Experiments #1 and #3, considering only long-tail              on Recommender systems, pages 129–136. ACM, 2014.
items, offline evaluation was an adequate predictor of the
online accuracy as a function of their setting thresholds.
   The observed bias of popular items over evaluation accu-
racy metrics are aligned to recent studies like [1] and [2].

</pre>