<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Performance Measures for Multi-Graded Relevance</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Christian Scheel</string-name>
          <email>christian.scheel@dai-labor.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andreas Lommatzsch</string-name>
          <email>andreas.lommatzsch@dai-labor.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sahin Albayrak</string-name>
          <email>sahin.albayrak@dai-labor.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Technische Universitt Berlin, DAI-Labor</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <fpage>54</fpage>
      <lpage>65</lpage>
      <abstract>
        <p>We extend performance measures commonly used in semantic web applications to be capable of handling multi-graded relevance data. Most of today's recommender social web applications oer the possibility to rate objects with dierent levels of relevance. Nevertheless most performance measures in Information Retrieval and recommender systems are based on the assumption that retrieved objects (e. g. entities or documents) are either relevant or irrelevant. Hence, thresholds have to be applied to convert multi-graded relevance labels to binary relevance labels. With regard to the necessity of evaluating information retrieval strategies on multi-graded data, we propose an extended version of the performance measure average precision that pays attention to levels of relevance without applying thresholds, but keeping and respecting the detailed relevance information. Furthermore we propose an improvement to the NDCG measure avoiding problems caused by dierent scales in dierent datasets.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Semantic information retrieval systems as well as recommender systems provide
documents or entities computed to be relevant according a user prole or an
explicit user query. Potentially relevant entities (e. g. users, items, or documents)
are generally ranked by the assumed relevance, simplifying user’s navigation
through presented results. Performance measures evaluate computed rankings
based on user-given feedback and thus allow comparing dierent ltering or
recommendation strategies [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>The most frequently used performance measures in semantic web applica</title>
      <p>tions are the Precision (P = number of relevant items in the result set ) and the Mean
total number of items in the result set</p>
    </sec>
    <sec id="sec-3">
      <title>Average Precision (MAP) designed to compute the Average Precision over sorted</title>
      <p>result lists (rankings). The main advantage of these measures is that they are
simple and very commonly used. The main disadvantage of these measures is,
that they only take into account binary relevance ratings and are not able to
cope with multi-graded relevance assignments.</p>
      <p>
        One well accepted performance measure designed for handling
multigraded relevance assignments is the Normalized Discounted Cumulative Gain
(NDCG) [
        <xref ref-type="bibr" rid="ref3 ref8">3, 8</xref>
        ]. From one position in the result list to another the NDCG focuses
on the gain of information. Because the information gain of items in the result list
on the same level of relevance is constant, it is possible to swap the positions of
items belonging to the same relevance level without changing the performance
measure. The advantage of NDCG is that it applies an information-theoretic
model for considering multiple relevance levels. Unfortunately, the NDCG
measure values depend on the number of reference relevance values of the dataset.
      </p>
    </sec>
    <sec id="sec-4">
      <title>Thus, NDCG values computed for dierent datasets cannot be directly be compared with each other.</title>
    </sec>
    <sec id="sec-5">
      <title>An alternative point of view to multi-graded relevance was used in the TREC</title>
    </sec>
    <sec id="sec-6">
      <title>8 competition [2]. Instead of multiple relevance levels, probabilities for measuring</title>
      <p>the relevance of entities were used. As performance measure the Mean Scaled</p>
    </sec>
    <sec id="sec-7">
      <title>Utility (SCU) was suggested. Since the SCU is very sensitive to the applied</title>
      <p>
        scaling model and the properties of the queries, the SCU measure should not be
used for comparing dierent datasets [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        Due to the fact, that binary relevance performance measures precision and
mean average precision are commonly used, a promising approach is to extend
these binary measures to be capable of handling multi-graded relevance
assignments. Keklinen et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] discuss the possibility to evaluate retrieval strategies
on each level of relevance separately and then nd out whether one IR method
is better than another at a particular level of relevance . Additionally it is
proposed to weight dierent levels of relevance according to their gain of importance.
      </p>
    </sec>
    <sec id="sec-8">
      <title>Keklinen et al suggest a generalized precision and recall, which contributes to the level of relevance importance, but does not consider the position of an item in the retrieved result list.</title>
    </sec>
    <sec id="sec-9">
      <title>In our work we extend the measures Precision and MAP to be capable of</title>
      <p>handling multiple relevance levels. The idea of looking at the performance of
each level of relevance separately is carried on. An extension of MAP is
proposed where strategies can be evaluated with user given feedback independent
from the number of used relevance levels. We refer to this extension of MAP as</p>
    </sec>
    <sec id="sec-10">
      <title>MAP. Additionally, we introduce an adaptation of the NDCG measure taking into account the number of relevance levels present in the respective reference datasets.</title>
    </sec>
    <sec id="sec-11">
      <title>The paper is structured as follows: In the next section we explain the dataset</title>
      <p>used for benchmarking our work. We explain the performance measure Average</p>
    </sec>
    <sec id="sec-12">
      <title>Precision and show how data has to be transformed in order to compute the</title>
      <p>Average Precision. In Section 3 we propose an extension to Average Precision
allowing us to handle multi-graded relevance assignment without changing the
original ratings. After introducing our approach we evaluate the proposed
measures for several Retrieval algorithms and dierent datasets (Section 4). Finally
we discuss the advantages and disadvantages of the new measures and give an
outlook to future work.
2</p>
      <sec id="sec-12-1">
        <title>Settings and Methods</title>
      </sec>
    </sec>
    <sec id="sec-13">
      <title>For evaluating the performance of a computed item list, a reference ranking is needed (or the items must be rated allowing us to derive a reference ranking). The</title>
      <p>
        reference ranking is expert dened or provided by the user. It can be retrieved
explicitly or implicitly [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Very popular is the 5-star rating allowing the user to
rate entities or items on a scale from 0 to 4, meaning ve levels of relevance.
      </p>
      <p>For analyzing and comparing the properties the proposed evaluation
measures, we deploy an articial data set and a real-world dataset, providing three
relevance levels. We assume that the reference item ratings stand for ratings
coming from human experts and that the test rankings stand for the item list
coming from dierent prediction strategies. We discuss several dierent types of
reference ratings: In each evaluation setting the optimal item list based on the
reference ratings should achieve the performance value 1:0.
2.1</p>
    </sec>
    <sec id="sec-14">
      <title>Articial Dataset</title>
    </sec>
    <sec id="sec-15">
      <title>We create an articial dataset and analyze how changes in the dataset inuence</title>
      <p>the measured result quality. For this purpose, we compute the performance of
100 dierent test item lists for each given reference ranking considering dierent
performance measures.</p>
    </sec>
    <sec id="sec-16">
      <title>Test Ratings We create items list (test rankings) by pair-wise swapping the</title>
      <p>item of an optimal item list (reference ranking), see Fig. 1. Swapping means
that two rated items in the ranking change their positions. The best test ranking
is the one for that no items have been swapped. The performance of the obtained
item list decreases with increasing number of swapped item pairs.</p>
    </sec>
    <sec id="sec-17">
      <title>The analyzed 100 test rankings dier in the number of the swapped pairs: In</title>
      <p>the rst test ranking ( 0) we do not swap any item pair, in the last test ranking
(99) we randomly swap 99 item pairs. How much the performance decreases per
swap depends on the relevance levels’ distance of the swapped items. Hence, an
evaluation run for each number of switches includes 100 test ranking evaluations
to average the results.</p>
      <p>Uniformly Distributed Reference Ratings There are four dierent kinds of
reference rankings which dier in the number of relevance levels. Each reference
reference ranking
example test rankings
1 switch
2 switches
5 switches
ranking contains 100 elements which are uniformly distributed among 2, 10, 20,
or 50 levels of relevance (see Fig. 2).</p>
    </sec>
    <sec id="sec-18">
      <title>Non-Uniformly Distributed Reference Ratings In contrast to the refer</title>
      <p>ence rankings used in the previous paragraph, we consider reference rankings
consisting of non-uniformly rated items making use of 2, 10, 20, or 50 levels of
relevance (see Fig. 3). In other words, the probabilities (that a relevance level
is used in the reference ranking) dier randomly between the relevance levels.</p>
    </sec>
    <sec id="sec-19">
      <title>Moreover, some relevance levels may not be used. Hence, this dataset is more realistic, because users do not assign relevance scores uniformly. 2.2</title>
    </sec>
    <sec id="sec-20">
      <title>OHSUMED Dataset</title>
    </sec>
    <sec id="sec-21">
      <title>The OHSUMED dataset provided by the Hersh team at Oregon Health Sciences</title>
      <p>
        University [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] consists of medical journal articles from the period of 19871991
rated by human experts, on a scale of three levels of relevance. Our evaluation
is based on the OHSUMED dataset provided in LETOR [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The items
(documents) are rated on a scale of 0, 1, and 2, meaning not relevant, possibly relevant
and denitely relevant . As in the Non-Uniformly Distributed Reference Ratings
the given relevance scores are not uniformly distributed.
      </p>
      <p>5
6
49
50</p>
    </sec>
    <sec id="sec-22">
      <title>Test Ratings The OHSUMED dataset in LETOR provides 106 queries and 25</title>
      <p>strategies assigning relevance scores to each item in result set for a respective
query. Due to the fact that some the provided strategies show a very similar
behavior, we limit the number of evaluated strategies to eight (OHSUMED id 1,</p>
    </sec>
    <sec id="sec-23">
      <title>5, 9, 11, 15, 19, 21, 22) enabling a better visualization of the evaluation results.</title>
    </sec>
    <sec id="sec-24">
      <title>User-Given Ratings The OHSUMED dataset provides expert-labeled data</title>
      <p>based on a three level scale. Because there is no real distance between not
relevant, possibly relevant and denitely relevant , we assume 1 as distance of
successive levels of relevance as the assigned scores 0, 1, and 2 in the dataset imply.
Approximated Virtual-User Ratings The OHSUMED dataset provides
three relevance levels. Because ne-grained ratings enable a more precise
evaluation, authors believe that soon there will be datasets available with higher
number of relevance levels. Until these datasets are available a trick is applied,
replacing user’s ratings with relevance scores calculated by computer-controlled
strategies. The computer calculated relevance scores are treated as user-given
reference ratings. In our evaluation we selected the OHSUMED strategies TF
of the title (resulting in 9 dierent relevance levels) and TF-IDF of the title
(resulting in 158 dierent relevance levels) as virtual reference users. Both rating
strategies show a very strong correlation; the Pearson’s correlation coecient
of the relevance assessments is 0:96. The availability of more than three
relevance levels in the reference ranking allows us to evaluate ranking strategies with
multi-graded relevance assignments. The two strategies treated as reference
rating strategies are also considered in the evaluation. Thus, these strategies should
reach an evaluation value of 1.
2.3</p>
    </sec>
    <sec id="sec-25">
      <title>Performance Measures</title>
    </sec>
    <sec id="sec-26">
      <title>There are several performance measures commonly used in information retrieval</title>
      <p>and recommender systems, such as precision, Area Under an ROC curve or rank
of the rst relevant document (mean reciprocal rank). Additionally, the mean of
each performance measure over all queries can be computed to overcome the
unstable character of some performance measures.</p>
    </sec>
    <sec id="sec-27">
      <title>In this section we focus on the popular performance measures Average Preci</title>
      <p>
        sion (AP) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and Normalized Discounted Cumulative Gain (NDCG) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]). Unlike
      </p>
    </sec>
    <sec id="sec-28">
      <title>AP, NDCG can handle dierent numbers of relevance levels, due to the fact that</title>
    </sec>
    <sec id="sec-29">
      <title>NDCG denes the information gain based on the relevance score assigned to a document.</title>
    </sec>
    <sec id="sec-30">
      <title>Average Precision The average precision of an sorted item (document) list</title>
      <p>for a query q is dened as</p>
      <p>APq =</p>
      <p>Rq</p>
      <p>(1)
where N denotes the number of the items in the evaluated list, P @p the precision
at position p, and Rq the number of relevant items with respect to q. relq(p) is
a binary function describing if the element at position p is relevant (1) or not
(0). A higher AP value means that more relevant items are in the heading of the
result list. Given a set of queries, the mean over the AP of all queries is referred
to as MAP.</p>
    </sec>
    <sec id="sec-31">
      <title>When there are more than two relevance levels, these levels have to be as</title>
      <p>signed to either 0 or 1. A threshold must be applied, separating the relevant
items from the irrelevant items. For later use, we denote APqt as the AP with
threshold t applied. APqt is calculated by</p>
      <p>APqt =</p>
      <p>Rqt
with relqt(p) =
(1; relq(p)</p>
      <p>t
0; relq(p) &lt; t
where Rqt denes the number of results so that relqt(p) is 1.</p>
    </sec>
    <sec id="sec-32">
      <title>Normalized Discounted Cumulative Gain For a query q, the normalized</title>
      <p>discounted cumulative gain at position n is computed</p>
      <p>NDCG@n(q) = NqnDCG = Nnq Xn 2gainq(i)
log(i + 1)
where gainq(i) denotes the gain of the document at position i of the (sorted)
result list. Nn is a normalization constant, scaling the optimal DCG@n to 1. The
optimal DCG@n can be retrieved by calculating the DCG@n with the correctly
sorted item list.
3</p>
      <sec id="sec-32-1">
        <title>Extending Performance Measures</title>
      </sec>
    </sec>
    <sec id="sec-33">
      <title>The need to apply thresholds makes the measures AP and MAP not applicable</title>
      <p>for multi-graded relevance data. NDCG supports multi-graded relevance data,
but the sensitivity to the choice of relevance levels prevents the comparison
of NDCG values computed based on dierent datasets. Hence, for a detailed
evaluation based on datasets having multiple relevance levels, both MAP and</p>
    </sec>
    <sec id="sec-34">
      <title>NDCG have to be adapted. 3.1</title>
    </sec>
    <sec id="sec-35">
      <title>Extending Average Precision</title>
      <p>In the most commonly used evaluation scenarios, the relevance of items is a
binary function (returning relevant or irrelevant). If the reference dataset
provides more than two relevance levels, a threshold is applied which separates
the documents into a set of relevant items and a set of irrelevant items. The
example in Table 1 illustrates how levels of relevance aect the calculation of the
measure AP The example shows a sorted list of items (A ::: H). The relevance of
each item is denoted on a scale from 0 to 4 (5 relevance levels). For calculating
the precision, a threshold t must be applied, to separate relevant from
irrelevant items. The threshold t = 0 implies that all documents are relevant. We
refer to this threshold as the irrelevant threshold. In contrast to t = 0,
applying the threshold t = 5 leads to no relevant documents. Table 1 illustrates that
the threshold t strongly aects the computed AP. To cope with this problem,
we propose calculating the performance on each relevance level, and then
computing the mean. This ensures that higher relevance levels are considered more
frequently than lower relevance levels. The example visualized in Table 1 shows
that item H having a relevance score of 4 is considered relevant more often than
all other items.</p>
    </sec>
    <sec id="sec-36">
      <title>We refer to this approach as AP, and MAP if the mean of AP for several result lists is calculated. For handling the case that the not all relevance levels are used in every result list and that the distance between successive relevance levels is not constant, AP has to be normalized.</title>
      <p>1
APq = Pt2L dt t2L</p>
      <p>X</p>
      <p>APqt dt
(4)
where APqt denotes the average precision using the threshold t, and L a set of
all relevance levels (meaning all thresholds) used in the reference ranking. dt
denotes the distance between the relevance level ti and ti 1 if i &gt; 1 (and t if
i = 0). The following example demonstrates the approach: Given a set dataset
based on three relevance levels (0:0, 0:3, 1:0), the threshold t = 0:3 leads to the
AP t = 0:3 0:0 = 0:3. The threshold t = 1:0 leads to AP t = 1:0 0:3 = 0:7.
3.2</p>
    </sec>
    <sec id="sec-37">
      <title>The Normalized Discounted Cumulative Normalized Gain</title>
    </sec>
    <sec id="sec-38">
      <title>In contrast to MAP, NDCG is designed for handling multiple relevance levels.</title>
    </sec>
    <sec id="sec-39">
      <title>Unfortunately NDCG does not consider the scale used for the relevance scores.</title>
    </sec>
    <sec id="sec-40">
      <title>Thus, computed NDCG values highly depend on the number of relevance lev</title>
      <p>els making it impossible to compare NDCG values between dierent datasets.
Table 2 illustrates this problem.In the rst example the NDCG is calculated as
usual. In the second example, the number of relevance levels is doubled, but the
number of assigned scores as well as the number of used levels of relevance is
equal to the rst example. This doubling leads to a smaller NDCG compared to
the rst example, even though no rated element became more or less relevant
to another element. In the third example, the gain of example one is normalized
and the NDCG is calculated. It can be seen that the normalization solves the
inconsistency. A normalization of the gain overcomes the problem of
incomparable performance values for data with relevance assignments within a dierent
number of relevance levels. We dene the Normalized Discounted Cumulative</p>
    </sec>
    <sec id="sec-41">
      <title>Normalized Gain (NDCNG) at position n as follows:</title>
      <p>n
NDCNG@n(q) = Nnq X 2ngainq(i)
log(i + 1)</p>
      <p>1
i=1
; ngainq(i) =
8 gainq(i)
&gt;
&lt;</p>
      <p>mq
&gt;:0
; mq &gt; 0
; mq
0
(5)
where mq is the highest reachable gain for the query q (normalization term).</p>
      <sec id="sec-41-1">
        <title>If there is no relevant item, mq is set to 0 assuming that irrelevant items are</title>
        <p>rated with 0. All rating are 0; relevant items have relevance scores &gt; 0. If
these assumptions do not apply, the relevance scores must be shifted so that the
irrelevant level is mapped to the relevance score 0.</p>
      </sec>
      <sec id="sec-41-2">
        <title>Evaluation</title>
        <p>Evaluation based on the Articial Dataset We evaluated the proposed
performance measures on the articial dataset introduced in Section 2.1. Fig. 4
shows the mean of 100 evaluation runs with uniformly distributed relevance
scores. From left to right the number of false pair-wise item preferences increases,
and hence the measured performance decreases.</p>
        <p>1 1 1
0,9 0,9 0,9
PµAM0000,,,,7685 aenCGNDm0000,,,,7685 aenCGNNDm0000,,,,7685
0,4 0,4 0,4
0,30 10 20 30nro4f0rand50oms6w0 itch7e0s 80 90 100 0,30 10 20 30 nr o40fran5d0oms6w0itch70es 80 90 100 0,3 0 10 20 30 nro40fran5d0om s6w0itch70es 80 90 100
2 levels of relevance 10 levels of relevance 20 levels of relevance 50 levels of relevance</p>
      </sec>
    </sec>
    <sec id="sec-42">
      <title>Evaluation based on the OHSUMED Dataset The OHSUMED dataset</title>
      <p>(introduced in Sec. 2.2) uses three dierent relevance levels. Fig. 6 visualizes
the measured performance of selected retrieval strategies using MAP, the mean</p>
    </sec>
    <sec id="sec-43">
      <title>NDCG and the mean NDCN G. Since the OHSUMED dataset uses two relevance</title>
      <p>levels, the MAP is the mean of the MAP computed applying the thresholds
t = 1 and t = 2.</p>
    </sec>
    <sec id="sec-44">
      <title>A virtual user which is in fact strategy 1 ( TF of title) provides the relevance</title>
      <p>assessments for the evaluation presented in Fig. 7. Strategy 1 assigns ordinal
scores to 9 dierent levels of relevance. On this data, MAP is the mean of 8
dierent MAP values.</p>
    </sec>
    <sec id="sec-45">
      <title>The measured results show, that the measure MAP evaluates the retrieval</title>
      <p>strategy 1 (as expected) with a performance value of 1:0, so does the mean
NDCG and the mean NDCNG. All the considered evaluation measures agree that
the retrieval strategy 9 is most similar to strategy 1, which makes sense, since
strategy 9 is computed based on the TF-IDF of title and strategy 1 is computed
based on TF of title. The main dierence between both retrieval strategies is
the number of used relevance levels: Strategy 1 assigns ordinal relevance scores
(using 9 dierent relevance levels); strategy 9 assigns real values (resulting in
158 relevance levels). The distance between these relevance levels varies a lot.</p>
    </sec>
    <sec id="sec-46">
      <title>When applying strategy 9 as reference rating strategy, the need for taking</title>
      <p>into account the distance between the relevance levels (Equ. 4) can be seen.
Several very high relevance scores are used only once; lower relevance scores
are used much more frequently. Fig. 8 shows the advantages of the NDCN G
compared to standard NDCG. The comparison of the mean NDCG in Fig. 7
with the mean NDCG in Fig. 8 reveals that the NDCG is aected by the number
of relevance levels. Since the strategies 1 and 9 show a very similar performances
in both gures, the other strategies are evaluated with disproportionate lower
performance values in Fig. 8 although both reference rating strategies assign
similar relevance ratings. The MAP and the proposed mean NDCNG values
do not dier much in both evaluations due to the fact the these measures are
almost independent from the number of relevance levels.
5</p>
      <sec id="sec-46-1">
        <title>Conclusion and Discussion</title>
      </sec>
    </sec>
    <sec id="sec-47">
      <title>In this paper we introduced the performance measure AP that is capable to</title>
      <p>handle more than two levels of relevance. The main advantages of the approach
is that it extends the commonly used performance measures precision and Mean
Average Precision. AP is fully compatible with the traditional measures, since
it delivers the same performance values if only two reference relevance levels exist
in the dataset. The properties of the proposed measures have been analyzed on
dierent datasets. The experimental results show that the proposed measures
satisfy the dened requirements and enable the comparison of semantic ltering
strategies based on datasets with multi-graded relevance levels. Since AP is
based on well-accepted measures, only a minor adaptation of theses measures is
required.
1
0,9
0,8
0,7
0,6
0,5
0,4
0,3
0,2
0,1
0
1
0,9
0,8
0,7
0,6
0,5
0,4
0,3
0,2
0,1
0
1
0,9
0,8
0,7
0,6
0,5
0,4
0,3
0,2
0,1
0
1
0,9
0,8
0,7
0,6
PA0,5
M
0,4
0,3
0,2
0,1
0
1
0,9
0,8
0,7
0,6
PA0,5
M
0,4
0,3
0,2
0,1
0
1
0,9
0,8
0,7
0,6
PA0,5
M
0,4
0,3
0,2
0,1
0
t_0
t_1
t_2
t_3
µMAP
meanNDCG</p>
      <p>meanNDCNG
1 5 9 11 15 19 21 22
1 5 9 11 15 19 21 22</p>
      <p>Additionally, we showed in this paper that the NDCG measure is sensitive
to the number of relevance levels in a dataset making it impossible to compare
the performance values computed for datasets with a dierent number of
relevance levels. To overcome this problem, we suggest an additional normalization
ensuring that the number of relevance levels in the dataset does not inuence
the computed performance values. Our evaluation shows that NDCNG assigns
similar performance values to recommender strategies that are almost similar
except that dierent numbers of relevance levels are used. In the analysis, we
demonstrated that high gain values (caused by a high number of relevance levels)
lead to incommensurately low NDCG values. Since typically the number of
relevance levels diers between the data sets the NDCG values cannot be compared
among dierent data sets. Thus, the gain values per level of relevance must be
limited. An additional normalization solves this problem.</p>
    </sec>
    <sec id="sec-48">
      <title>Future Work As future work, we plan to use the measures AP and ND</title>
    </sec>
    <sec id="sec-49">
      <title>CNG for evaluating recommender algorithms on additional datasets with multi</title>
      <p>graded relevance assessments. We will focus on movie datasets such as
EACH</p>
    </sec>
    <sec id="sec-50">
      <title>MOVIE [7] (having user ratings on a discrete scale from zero to ve), and movie</title>
      <p>ratings from the Internet Movie Database (IMDB) 1 (having user ratings on a
scale from one to ten).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>W.</given-names>
            <surname>Hersh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Buckley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. J.</given-names>
            <surname>Leone</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Hickam</surname>
          </string-name>
          .
          <article-title>Ohsumed: an interactive retrieval evaluation and new large test collection for research</article-title>
          .
          <source>In SIGIR '94</source>
          , pages
          <fpage>192201</fpage>
          , New York, NY, USA,
          <year>1994</year>
          . Springer-Verlag New York, Inc.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Hull</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Robertson</surname>
          </string-name>
          .
          <article-title>The trec-8 ltering track nal report</article-title>
          .
          <source>In In The 8th Text Retrieval Conference (TREC-8)</source>
          ,
          <source>NIST SP 500-246 , page 3556</source>
          ,
          <year>2000</year>
          . http://trec.nist.gov/pubs/trec8/t8_proceedings.html.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>K.</given-names>
            <surname>Jrvelin</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Keklinen</surname>
          </string-name>
          .
          <article-title>Cumulated gain-based evaluation of ir techniques</article-title>
          .
          <source>ACM Trans. Inf</source>
          . Syst.,
          <volume>20</volume>
          (
          <issue>4</issue>
          ):
          <fpage>422446</fpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>T.</given-names>
            <surname>Joachims</surname>
          </string-name>
          and
          <string-name>
            <given-names>F.</given-names>
            <surname>Radlinski</surname>
          </string-name>
          .
          <article-title>Search engines that learn from implicit feedback</article-title>
          .
          <source>Computer</source>
          ,
          <volume>40</volume>
          (
          <issue>8</issue>
          ):
          <fpage>3440</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>J.</given-names>
            <surname>Keklinen</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Jrvelin</surname>
          </string-name>
          .
          <article-title>Using graded relevance assessments in ir evaluation</article-title>
          .
          <source>Journal of the American Society for Information Science and Technology</source>
          ,
          <volume>53</volume>
          (
          <issue>13</issue>
          ):
          <fpage>11201129</fpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>T</surname>
            .-Y. Liu,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Qin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Xiong</surname>
            , and
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>LETOR: Benchmark dataset for research on learning to rank for information retrieval</article-title>
          . In SIGIR Workshop:
          <article-title>Learning to Rank for Information Retrieval (LR4IR</article-title>
          <year>2007</year>
          ),
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>P.</given-names>
            <surname>McJones</surname>
          </string-name>
          .
          <article-title>Eachmovie collaborative ltering data set</article-title>
          . Available from http://research.compaq.com/SRC/eachmovie/,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Najork</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zaragoza</surname>
          </string-name>
          , and
          <string-name>
            <surname>M. J. Taylor.</surname>
          </string-name>
          <article-title>Hits on the web: how does it compare</article-title>
          ? In SIGIR '
          <volume>07</volume>
          , pages
          <fpage>471478</fpage>
          , New York, NY, USA,
          <year>2007</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>C. J. Van Rijsbergen. Information</given-names>
            <surname>Retrieval</surname>
          </string-name>
          ,
          <article-title>2nd edition</article-title>
          . Dept. of Computer Science, University of Glasgow,
          <year>1979</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Voorhees</surname>
          </string-name>
          and
          <string-name>
            <surname>D. K</surname>
          </string-name>
          . Harman, editors. TREC:
          <article-title>Experiment and Evaluation in Information Retrieval</article-title>
          . MIT Press,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>