<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Economical Evaluation of Recommender Systems: A Proposal</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kevin Roitero</string-name>
          <email>froitero.keving@spes.uniud.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giuseppe Serra</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefano Mizzaro</string-name>
          <email>mizzarog@uniud.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Background: IR Evaluation with Fewer Topics</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Dept. of Mathematics</institution>
          ,
          <addr-line>Computer Science, and Physics</addr-line>
          ,
          <institution>University of Udine</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The evaluation of information retrieval e ectiveness by using fewer topics / queries has been studied for some years now: this approach potentially allows to save resources without sacri cing evaluation reliability. We propose to apply it to the evaluation of recommender systems. We describe our proposal and detail what is needed to put it in practice. In Information Retrieval (IR), an essential task is the evaluation of the e ectiveness of Information Retrieval Systems (IRSs). To support e ectiveness evaluation, several initiatives have been established (such as TREC, NTCIR, etc.); they provide the so called test collections, that are composed by: (i) a document collection; (ii) a set of queries (called topics), which are descriptions of information needs; (iii) a set of relevance judgments made by experts for a subset of topic-document pairs, taken as ground-truth. TREC and other initiatives are often competitions: each IRS has to produce a ranked list of documents for each topic; then, the ranked list and the ground truth are compared: the more the system output is similar to the ground truth, the more the system is e ective. E ectiveness is measured computing, for each topic, standard evaluation metrics such as Average Precision (AP), Normalized Discounted Cumulative Gain (NDCG), etc. As result of the evaluation process, systems are ranked according to some measure. A common choice is to compute the average value of the metric over the topics. To reduce human e ort, and overall evaluation cost, it has been proposed to estimate IRSs e ectiveness on the basis of a limited subset of \a few good topics" [2, 4, 6, 8, 10]. We propose to apply that proposal to the domain of Recommender Systems (RecSys). This paper is structured as follows: Section 2 details the state of the art of IR evaluation using few topics; Section 3 describes our proposal; Section 4 details the needed data; and Section 5 concludes. The test collection based evaluation method can be unfeasible if too high cost and human e ort are required. Therefore, reducing the cost of test collections while preserving their reliability is a key challenge for IR. An approach to reduce the human e ort is to use fewer topics. Nowadays, it is still not clear how many topics are required. In fact, Sparck Jones and van Rijsbergen [13] conclude that</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        250 topics are usually acceptable, and 1,000 are usually needed, while Zobel [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]
concludes that 25 topics are reasonable good in predicting the e ectiveness
evaluation of systems on di erent set of 25 topics. Buckley and Voorhees [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] state
that 25 topics are good, but 50 are always better. Webber et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] conclude
that 150 topics are needed. Based on statistical analysis, Sakai [
        <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
        ] estimates
a minimum number of topics required to preserve the ability to discriminate
system e ectiveness. To reduce the number of topics, some researchers
investigated strategies to identify the best possible choice of an optimal topic subset.
A seminal work by Mizzaro and Robertson [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] proposes to exploit Kleinberg's
well known HITS algorithm on a matrix, which represents the interactions
between systems and topics. Results show that evaluation is a ected by topic ease.
Roitero et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] extend and generalize the technique and the results.
      </p>
      <p>
        Guiver et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and Berto et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] aim at nding the theoretical optimum
for the topic selection strategy: they use exhaustive and heuristic search over all
possible subsets of topics of a given cardinality, and show that the theoretical
optimum subset of \a few good topics" potentially allows a correct evaluation of
systems even for rather low cardinalities. More in detail, for each cardinality they
nd the topic subsets that provide the highest and lowest values of correlation,
considering MAP, with respect to the MAP of the full set of topics. Let us
consider an example. At cardinality k, they consider all possible 2k sets of topics
(i.e., the columns of the topic-system matrix, see Figure 1(a)); for each subset,
they compute the MAP (Mean of APs over systems) value for the matrix with
k topics. Finally, they consider the set which maximizes/minimizes the value of
correlation between that MAP value and the MAP obtained when considering
all topics (i.e., columns). Figure 1(c) shows the correlation between the MAP
computed with k topics and the MAP with all the topics using Kendall's
rank correlation. The three series in the graph represent: the best/worst topic
subset (for each cardinality, the topic subset with the highest/lowest value of
correlation), and the Average topic subset (the expected correlation that one
would get when selecting topics at random). The best series show that it is
theoretically possible to evaluate IRSs using fewer topics. This result is extended
by Robertson [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] who shows that it is not clear if it is possible to identify subsets
of good topics that are general. Roitero and Mizzaro [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] studies if clustering of
topics can be exploited to nd such subsets of a few good topics.
3
      </p>
    </sec>
    <sec id="sec-2">
      <title>Our Proposal</title>
      <p>We propose to apply the above results on nding a few good topics, obtained in
IR evaluation, to RecSys. The advantages would be threefold:
1. We can use our knowledge to evaluate recommender systems saving resources
(i.e., using fewer \topics"). Our aim is to obtain a matrix (see Figure 1(b))
having as rows the systems, and as columns the user identi ers (or the
recommended items); in the cell we have a \score" assigned by the system to
the user (e.g., the satisfaction of the user in being recommended a
particular movie, etc.). We can apply the topic reduction approach to reduce the
number of users/items required to evaluate the RecSys.</p>
      <p>t1
s1 AP (s1; t1)
.
.
.
sm AP (sm; t1)</p>
      <p>u1
s1 E(s1; u1)
.
.
.
sm E(sm; u1)
. . .
. . .</p>
      <p>tn M AP
AP (s1; tn) M AP (s1)
.
.</p>
      <p>.</p>
      <p>AP (sm; tn) M AP (sm)
(a)</p>
      <p>un AV G
E(s1; un) AV G(s1)
.
.</p>
      <p>.</p>
      <p>
        E(sm; un) AV G(sm)
(b)
(c)
2. We can use our knowledge to build recommender systems while saving
resources. Let us consider an example. When implementing a RecSys, a
common approach is to obtain a matrix having as rows the users, and as columns
the item (e.g., songs, lms, etc.). In the matrix cell we have a score assigned
by the user to the item (e.g., the times a user listened to a song, an
explicit rating, etc.). We can apply the topic reduction approach to reduce the
number of items/users required to build the RecSys, preserving its reliability.
3. We can use IR evaluation metrics for the evaluation of recommender systems,
and we can transfer the theoretical analysis on such metrics (i.e., robustness,
soundness, evaluation e ects, etc.) to the metrics currently used in the
evaluation of RecSyss. We can adapt IR metrics to RecSys purposes, for example
following the categorization of Gunawardana and Shani [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
4
      </p>
    </sec>
    <sec id="sec-3">
      <title>What We Need</title>
      <p>
        To apply the topic reduction approach to RecSys, we need to obtain a matrix
(see Figure 1(b)), which represents the evaluation results and corresponds to the
matrix used in IR evaluation (Figure 1(a)). We considered the datasets listed
by O zgobek et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]: for some datasets (The Net ix Prize, MoviePilot Dataset,
ACM RecSys Challenge 2016) the data is not public or not available; for some
other ones (Movielens Dataset, Million Song Dataset ) the dataset contains only
the relevance results (i.e., the ground truth). For the RecSyss described by Beel
and Dinesh [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], the evaluation data is not publicly available. Finally, for the
CLEF NewsReel 2017 dataset,1 the competition does not look into the
individual results of each submission, but instead it gets an aggregated list that
1 http://www.clef-newsreel.org/
shows the performance of each algorithm on each competition day. Therefore,
the datasets we explored so far seem not suitable for our purposes; furthermore,
we are not aware of any initiatives for evaluating RecSyss which could be suitable.
In absence of suitable data, we might resort to create an arti cial
recommendation systems collections, starting from \relevance les". We could, for example,
sample the relevance les with di erent relevance probabilities distributions to
obtain more/less e ective systems; the usage of di erent relevance distributions
with di erent parameters can provide a realistic population of RecSyss.
5
      </p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>Summarizing, we propose to: (i) evaluate recommender systems in a more
economic way; (ii) reduce the matrix used to build a RecSys; and (iii) use IR
evaluation metrics and transfer the metrics properties to the domain of RecSyss.
We need some data to experimentally verify the usefulness of our approach.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Beel</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Dinesh</surname>
          </string-name>
          .
          <article-title>Real-world recommender systems for academia: The pain and gain in building, operating, and researching them [long version]</article-title>
          .
          <year>2017</year>
          . arXiv:
          <volume>1704</volume>
          .
          <fpage>00156</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Berto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mizzaro</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Robertson</surname>
          </string-name>
          .
          <article-title>On using fewer topics in information retrieval evaluations</article-title>
          .
          <source>In ICTIR, ICTIR '13</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Buckley</surname>
          </string-name>
          and
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Voorhees</surname>
          </string-name>
          .
          <article-title>Evaluating evaluation measure stability</article-title>
          .
          <source>In 23rd ACM SIGIR</source>
          , pages
          <volume>33</volume>
          {
          <fpage>40</fpage>
          . ACM,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Guiver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mizzaro</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Robertson</surname>
          </string-name>
          .
          <article-title>A few good topics: Experiments in topic set reduction for retrieval evaluation</article-title>
          .
          <source>ACM TOIS</source>
          ,
          <volume>27</volume>
          (
          <issue>4</issue>
          ):
          <fpage>21</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gunawardana</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Shani</surname>
          </string-name>
          .
          <article-title>A survey of accuracy evaluation metrics of recommendation tasks</article-title>
          .
          <source>J. of Machine Learning Research</source>
          ,
          <volume>10</volume>
          (Dec):
          <volume>2935</volume>
          {
          <fpage>2962</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mizzaro</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Robertson</surname>
          </string-name>
          .
          <article-title>HITS hits TREC: exploring IR evaluation results with network analysis</article-title>
          .
          <source>In 30th ACM SIGIR</source>
          , pages
          <volume>479</volume>
          {
          <fpage>486</fpage>
          . ACM,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>O</given-names>
            <surname>. O</surname>
          </string-name>
          <article-title>zgobek, N. Shabib, and</article-title>
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Gulla</surname>
          </string-name>
          .
          <article-title>Data sets and news recommendation</article-title>
          .
          <source>In UMAP Workshops</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Robertson</surname>
          </string-name>
          .
          <article-title>On the contributions of topics to system evaluation</article-title>
          .
          <source>In ECIR</source>
          , pages
          <volume>129</volume>
          {
          <fpage>140</fpage>
          . Springer,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>K.</given-names>
            <surname>Roitero</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Mizzaro</surname>
          </string-name>
          .
          <article-title>Improving the e ciency of retrieval e ectiveness evaluation: Finding a few good topics with clustering</article-title>
          ?
          <source>In Proceedings of the 7th IIR Workshop</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>K.</given-names>
            <surname>Roitero</surname>
          </string-name>
          , E. Maddalena, and
          <string-name>
            <given-names>S.</given-names>
            <surname>Mizzaro</surname>
          </string-name>
          .
          <article-title>Do easy topics predict e ectiveness better than di cult topics?</article-title>
          <source>In ECIR</source>
          <year>2017</year>
          , Aberdeen, Scotland, To be Published.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T.</given-names>
            <surname>Sakai</surname>
          </string-name>
          .
          <article-title>Designing test collections for comparing many systems</article-title>
          .
          <source>In 23rd ACM CIKM</source>
          , pages
          <volume>61</volume>
          {
          <fpage>70</fpage>
          . ACM,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>T.</given-names>
            <surname>Sakai</surname>
          </string-name>
          .
          <article-title>Topic set size design</article-title>
          . Information Retrieval J.,
          <volume>19</volume>
          (
          <issue>3</issue>
          ):
          <volume>256</volume>
          {
          <fpage>283</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>K. Sparck</given-names>
            <surname>Jones</surname>
          </string-name>
          and
          <string-name>
            <surname>C. J. van Rijsbergen.</surname>
          </string-name>
          <article-title>Information retrieval test collections</article-title>
          .
          <source>Journal of Documentation</source>
          ,
          <volume>32</volume>
          (
          <issue>1</issue>
          ):
          <volume>59</volume>
          {
          <fpage>75</fpage>
          ,
          <year>1976</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>W.</given-names>
            <surname>Webber</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <article-title>Mo at, and</article-title>
          <string-name>
            <given-names>J.</given-names>
            <surname>Zobel</surname>
          </string-name>
          .
          <article-title>Statistical power in retrieval experimentation</article-title>
          .
          <source>In 17th ACM CIKM</source>
          , pages
          <volume>571</volume>
          {
          <fpage>580</fpage>
          . ACM,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zobel</surname>
          </string-name>
          .
          <article-title>How reliable are the results of large-scale information retrieval experiments? In 21st ACM SIGIR</article-title>
          , pages
          <volume>307</volume>
          {
          <fpage>314</fpage>
          . ACM,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>