<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Comparing offline and online recommender system evaluations on long-tail distributions</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gabriel S. P. Moreira</string-name>
          <email>gabrielpm@ciandt.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gilmar Souza</string-name>
          <email>gilmarj@ciandt.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Adilson M. da Cunha</string-name>
          <email>cunha@ita.br</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CI&amp;T</institution>
          ,
          <addr-line>Campinas, SP</addr-line>
          ,
          <country country="BR">Brazil</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>ITA</institution>
          ,
          <addr-line>Sao Jose dos Campos, SP</addr-line>
          ,
          <country country="BR">Brazil</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <abstract>
        <p>In this investigation, we conduct a comparison between ofine and online accuracy evaluation of di erent algorithms and settings in a real-world content recommender system. By focusing on recommendations of long-tail items, which are usually more interesting for users, it was possible to reduce the bias caused by extremely popular items and to observe a better alignment of accuracy results in o ine and online evaluations.</p>
      </abstract>
      <kwd-group>
        <kwd>Recommender systems</kwd>
        <kwd>o ine evaluation</kwd>
        <kwd>online evaluation</kwd>
        <kwd>click-through rate</kwd>
        <kwd>accuracy metrics</kwd>
        <kwd>long-tail</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>H.3.3 [Information Storage and Retrieval]: Information
Search and Retrieval - information ltering.</p>
    </sec>
    <sec id="sec-2">
      <title>EVALUATION METHODOLOGY</title>
      <p>This investigation focuses in a comparison between o ine
and online evaluation results in a recommender system
implemented in Smart Canvas R , a platform that delivers web
and mobile user experiences through curation algorithms.
Smart Canvas features a mixed hybrid recommender system,
in which items recommended by all available algorithms are
aggregated and presented to users.</p>
      <p>It was conducted in one production environment, which
consists in the website of a large shopping mall. The
accuracy of di erent recommender algorithms and variations of
their settings were assessed in o ine evaluation and further
compared to online measures with real users (A/B testing).</p>
      <p>In this investigation, three experiments were conducted,
each of them varying only one setting at a time, in both
o ine and online evaluations. They involve two algorithms
implemented in Smart Canvas: Content-Based Filtering (based
on TF-IDF and cosine distance) and Item-Item Frequency
(a model-based algorithm based on co-frequency of items
interactions in user sessions).</p>
      <p>
        For all experiments, accuracy was evaluated under two
perspectives considering (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) all recommended items and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
only long-tail items. The main reasons for this two-fold
analysis is that recommendations of non-popular items
matching users interests might be more relevant to them. Popular
items may also bias the evaluation of recommenders
accuracy.
1.1
      </p>
    </sec>
    <sec id="sec-3">
      <title>Offline Evaluation</title>
      <p>O ine evaluation is usually done by recording the items
users have interacted with, hiding some of this user-item
interactions (test set) and training algorithms on the
remaining information (train set) to assess the accuracy.</p>
      <p>A time-based approach [3] was used to split train and test
sets. User interactions occurred during the period before the
split date were used as train set (20 days), and the period
after composed the test set (8 days), as shown in Figure 1.
It simulates the production scenario, where the known
user preferences until that date are used to produce
recommendations for the near future. Test set comprised 342 users
in common with train set, with a total of 636 interactions
during test set.</p>
      <p>This investigation uses an o ine evaluation methodology
named as One-Plus-Random or RelPlusN [3], in which for
each user the recommender is requested to rank a list with
relevant items (those that the user has interacted with in the
test set) and a set of N non-relevant items (random items,
which the user has never interacted with).</p>
      <p>The nal performance are averaged over Click-Through
Rate (CTR), a common metric for recommender and
advertising systems, here referred as O ine CTR. It was
calculated as a ratio between the top recommended items, which
the users in fact interacted in test set, and the total number
of simulated recommendations.
1.2</p>
    </sec>
    <sec id="sec-4">
      <title>Online Evaluation</title>
      <p>For online evaluation, an engine was developed to
randomly split users tra c and assign to one of the experiments
of the hybrid recommender system (A/B testing), each
varying only one setting of the two component algorithms. The
online evaluation involved 402 distinct items, 45,000 users,
5,850 recommendations, and 183 interactions.</p>
      <p>The Click-Through Rate (CTR) metric was also used to
measure online accuracy of recommendations. Online CTR
was the ratio of interactions on recommended items and the
total of recommended items viewed by users during their
sessions.</p>
      <p>Three experiments were performed in both o ine and
online evaluations. In Experiments #1 and #2, Content-Based
Filtering settings named MinSimilarity and
ItemDaysAgeLimit were assessed individually with di erent values. In
Experiment #3, an Item-Item Frequency setting named
LastXInteractedItems were varied.</p>
      <p>
        Accuracy (CTR) was evaluated under two perspectives
considering: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) all recommended items, including the very
popular ones and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) only long-tail items.
      </p>
      <p>The ideal scenario would be o ine metrics varying in
the same direction of the CTR measures. That behavior
would indicate that o ine evaluation could be used to
coste ectively identify the best setting values for recommender
algorithms before involving users in online evaluation.</p>
      <p>
        However, Online and O ine CTR behaviour did not align
in perspective (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ), considering all recommended items, as
can be seen in Figure 2.
      </p>
      <p>
        This investigation went further for better understanding
of the misalignment between o ine and online evaluations in
this context. It was assessed whether the very popular items
could introduce a bias in recommender accuracy analysis,
ignoring extremely popular items and considering only
longtail items in perspective (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ).
      </p>
      <p>For o ine evaluation, the top 1.1% items concentrated
22% of the interactions and were further ignored. For
online experiments, it was also ignored the 1.5% most popular
items, responsible for 41% of the interactions in the website.</p>
      <p>Considering only the long-tail items in Experiment #1,
the O ine and Online CTR turned out to be nicely aligned,
as shown in Figure 3. The best setting value for the
MinSimilarity threshold was 0.1, following the same trend for
both CTR metrics.</p>
      <p>In Experiment #2 for long-tail items, the metric variations
were very similar to the results considering popular items,
so there was no prediction gain by removing very popular
items from the analysis.</p>
      <p>In Experiment #3, the CTR metrics variation were yet
more aligned by keeping only long-tail items (charts omitted
due to space reasons).</p>
      <p>In Experiments #1 and #3, considering only long-tail
items, o ine evaluation was an adequate predictor of the
online accuracy as a function of their setting thresholds.</p>
      <p>The observed bias of popular items over evaluation
accuracy metrics are aligned to recent studies like [1] and [2].</p>
    </sec>
    <sec id="sec-5">
      <title>CONCLUSION</title>
      <p>In this study, O ine and Online experiments were
performed and compared in a real production environment of a
hybrid recommender system. The results did not correlate
for most experiments, but when focusing on long-tail items,
it was possible to observe how popular items can bias the
accuracy evaluation. Two out of three experiments on
longtail items had O ine CTR very aligned to Online CTR.</p>
      <p>The evaluation of long-tail items may be a candidate for
deeper investigation in future studies, aiming to increase
con dence in o ine evaluation results. Furthermore,
focusing on accuracy optimization for long-tail items, algorithms
may bring to the users a clear perception of the ability of
the system to recommend non-trivial relevant items.</p>
      <p>This study is still ongoing to provide a better
understanding of the relationship between o ine and online evaluation
results. Besides accuracy, it is suggested a similar
investigation of other properties like coverage and more long-term
metrics, related to users engagement.</p>
    </sec>
    <sec id="sec-6">
      <title>ACKNOWLEDGEMENTS</title>
      <p>Our thanks to CI&amp;T for supporting the development of
Smart Canvas R recommender system evaluation framework
and to the ITA for providing the research environment.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Beel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Genzmehr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Langer</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <article-title>Nurnberger, and</article-title>
          <string-name>
            <given-names>B.</given-names>
            <surname>Gipp</surname>
          </string-name>
          .
          <article-title>A comparative analysis of o ine and online evaluations and discussion of research paper recommender system evaluation</article-title>
          .
          <source>In Proc. Workshop on Reproducibility and Replication in Recommender Systems Evaluation</source>
          , pages
          <volume>7</volume>
          {
          <fpage>14</fpage>
          . ACM,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>Garcin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Faltings</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Donatsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Alazzawi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bruttin</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Huber</surname>
          </string-name>
          .
          <article-title>O ine and online evaluation of news recommender systems at swissinfo. ch</article-title>
          .
          <source>In Proceedings of the 8th ACM Conference on Recommender systems</source>
          , pages
          <volume>169</volume>
          {
          <fpage>176</fpage>
          . ACM,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Said</surname>
          </string-name>
          and
          <string-name>
            <surname>A.</surname>
          </string-name>
          <article-title>Bellog n. Comparative recommender system evaluation: benchmarking recommendation frameworks</article-title>
          .
          <source>In Proceedings of the 8th ACM Conference on Recommender systems</source>
          , pages
          <volume>129</volume>
          {
          <fpage>136</fpage>
          . ACM,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>