<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluating Rank Accuracy based on Incomplete Pairwise Preferences</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Brian Ackerman</string-name>
          <email>backerman@asu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yi Chen</string-name>
          <email>yi@asu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Arizona State University</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <fpage>74</fpage>
      <lpage>77</lpage>
      <abstract>
        <p>Existing methods to measure the rank accuracy of a recommender system assume the ground truth is either a set of user ratings or a total ordered list of items given by the user with possible ties. However, in many applications we are only able to obtain implicit user feedback, which does not provide such comprehensive information, but only gives a set of pairwise preferences among items. Generally such pairwise preferences are not complete, and thus may not deduce a total order of items. In this paper, we propose a novel method to evaluate rank accuracy, expected discounted rank correlation, which addresses the unique challenges of handling incomplete pairwise preferences in ground truth and also puts an emphasis on properly ranking items that users most prefer.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Recommender systems</kwd>
        <kwd>rank accuracy</kwd>
        <kwd>pairwise preference</kwd>
        <kwd>evaluation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        1. INTRODUCTION
Recommender systems have proven their value in many
applications where it is advantageous to personalize a user’s
experience. Currently, recommender systems power some of
the world’s most visited websites. Users visit websites like
Amazon, Netflix, and Urbanspoon when they are looking for
suggestions on items that may be of interest. These websites
can suggest relevant items from a large collection of items
and output a ranked list ordered by the predicted relevance
to the user to aid in a user’s decision making process.
Given the prevalence of various recommender systems, a
critical question becomes their evaluation. Evaluation
measures are used to compare different techniques of
recommendation and give researchers an objective for optimization.
Existing evaluation measures may take a set of user ratings
as the ground truth [
        <xref ref-type="bibr" rid="ref2 ref4">2, 4</xref>
        ], or a total ordered list of items
given by the user [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], with an option to support ties [
        <xref ref-type="bibr" rid="ref3 ref8">3, 8</xref>
        ].
Based on the weight given to items, there are two types of
evaluation. Reference rank accuracy measures the similarity
between the ranked list output by a system and the list of
user’s preferences, such as Spearman’s ρ [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and Kendall’s
τ [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. There are also variations of area under the curve
measures, such as the one used in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], which are closely related
to a general loss function. Utility rank accuracy measures
give a stronger weight to the items that have a higher
relevance according to the user, such as normalized discounted
cumulative gain [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], which is widely adopted [
        <xref ref-type="bibr" rid="ref1 ref5 ref6">1, 5, 6</xref>
        ].
Ground truth often has to be obtained through explicit user
feedback. In some applications, we may solicit users to give
ratings on items and use such ratings as relevance scores,
such as data obtained from Netflix and MovieLens. The
rating data provides accurate user opinions on items, and
can also provide a total order of user’s preferences on items.
However, such data requires explicit feedback, and may be
considered as burdensome for some users.
      </p>
      <p>On the other hand, there is great potential for a
recommender system, and search engines, to leverage implicit user
feedback which can also be used for evaluation. After a user
issues a query, a set of items will be recommended. Among
them, a user may select some to check for more details, and
finally may choose one or more of the products to buy or
places to visit. Such user interaction provides implicit user
feedback on an item’s relevance. We can consider the set
of items purchased or visited to be preferred to the set of
selected items, which in turn is preferred to the set of items
that do not receive any user interaction. Alternatively, we
may also measure the actual time that a user spends
examining an item to infer finer grained preferences.</p>
      <p>After collecting data in the form of implicit user feedback,
we must also be able to evaluate recommendation techniques
based on the data. Using the data, we may not have the
actual rating of items, but a set of pairwise preferences among
items. Note that generally pairwise preferences may not be
complete. That is to say there may exist two items which a
user’s preference is unknown. For instance, we don’t know
a user’s preference between an item that is presented to the
user in response to a query and an item that is not shown
to the user. In this case, a total order among a set of items
cannot be deduced. In this paper, we only consider
evaluating preferences that are not contradictory. We simply allow
for the aggregation of preferences for different user sessions
assuming the preferences are consistent. It is worth noting
that considering contextual factors involved with a user
session would also be advantageous, but is outside the scope</p>
      <p>Prediction
of this paper. These contextual factors could include the
set of items the user’s were recommended, items previously
purchases or visited by the user, and how the items were
displayed to the user. It is also possible to find cyclic
preferences (e.g. A ! B, B ! C, C ! A). However, we also do
not consider these cases.</p>
      <p>To evaluate recommender systems whose user preferences
are obtained through implicit user feedback, we study
evaluation methods for a general type of recommender system
that takes a set of possibly incomplete pairwise preferences
as input for training and for ground truth, and outputs a
ranked list of items for recommendation. After examining
why existing evaluation methods may fail, we proposed a
novel rank accuracy measure, expected discounted rank
correlation (EDRC), to handle incomplete pairwise preferences.
EDRC handles two unique challenges in this setting, the lack
of total ranking order on items, and possible unknown
preferences between two items. EDRC also takes a discounted
model that gives bigger penalty for wrong prediction on the
more preferred items. To the best of our knowledge, this
is the first attempt for designing an evaluation measure for
ground truth and system prediction that is a set of
incomplete pairwise preferences.</p>
      <p>The remainder of this paper is outlined as follows: after we
introduce the problem setting in Section 2, we briefly
review two commonly used evaluation measures. Section 4
proposes a new method to measure rank accuracy for
incomplete pairwise preferences, and Section 5 concludes the
paper.
2. PROBLEM SETTING
In this work, we design an evaluation measure for a
recommender system that takes ranking data in the form of
pairwise preferences as input, trains on the ranking data to
infer relevance values for each item in a test set, and then
outputs a ranking order for the items based on their
predicted relevance values. For the test set, we consider the
ground truth to be a set of pairwise preferences collected
from implicit user feedback.</p>
      <p>We define a pairwise preference as a user’s comparison
between two items. We denote a pairwise preference as A ! B
meaning item A is preferred to item B. Clearly, from a
ranked list of items, we can derive a set of pairwise
preferences. For example, we consider three items A, B, and
C. The system may predict the order of these items to be
A, B, C. Then the derived set of pairwise preferences is:
{ A ! B, A ! C, B ! C }. On the other hand, we may
not always be able to derive a total order of items based
on a set of pairwise preferences. For example, consider the
set of pairwise preferences in the ground truth in Table 1 is
insufficient to form a total order. For instance, we do not
know the preference between A and B. We define a set of
pairwise preferences to be complete if there exists a pairwise
preference between every two items involved in the set, or a
pairwise preference can be inferred. Otherwise, it is a set of
incomplete pairwise preferences. As discussed in Section 1,
incomplete pairwise preferences are common for user
preferences that are obtained implicitly. This is true for both the
training dataset and test dataset which contains the ground
truth or user’s preferences.</p>
      <p>
        Since existing evaluation metrics are inapplicable for
incomplete pairwise preferences in the ground truth, we propose
a measure for rank accuracy where the ground truth is a
set of possibly incomplete pairwise preferences. Since we
can derive a set of pairwise preferences from the ranked list
output, the core of the evaluation is to measure the
similarity between two sets of pairwise preferences, one for system
output, and the other from the user.
3. EXISTING EVALUATION METHODS
Before we discuss our newly proposed method of evaluation,
we look at two existing methods, nDCG and AP correlation.
For both methods we show a brief example and then discuss
cases where each does not work for our problem setting.
3.1 Normalized Discounted Cumulative Gain
Cumulated gain-based evaluation was proposed by Jarvelin
and Kekalainen in 2002 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] to evaluate rank accuracy when
the ground truth is a set of user ratings. The intuition is
that it is more important to correctly predict highly relevant
items than marginally relevant ones.
nDCG is formally defined in Equation 1. We denote I as a
set of items suggested by the system, rel(i) is the relevance
of item i according to the user, index(i) is the index of
item i sorting the items based on the user’s relevance score,
and i!ndex(i) is the index of item i sorted by the system’s
prediction.
      </p>
      <p>nDCG =
1 !
Z
· "
i∈I</p>
      <p>2rel(i) − 1
log2(1 + index(i))
#
Z gives the maximum possible discounted cumulative gain
if the items were correctly sorted, and is used as a
normalization to ensure the result is between 0 and 1.</p>
      <p>Z = "
i∈I log2(1 + i!ndex(i))
2rel(i) − 1
(1)
(2)
For example, we look at the sample data in Table 2. If we
look at item A, rel(A) = 5, index(A) = 1, and i!ndex(A) =
2. Below is the full sample calculation for nDCG.
nDCG</p>
      <p>
        Z
=
=
nDCG assumes that we know an actual relevance for each
item. It does not directly support the case when an item
is preferred to another item, but the magnitude of the
preference is unknown. When there is a total order on item
preferences, we may assign rating scores with equal
magnitude and then use nDCG. However, the absence of total
order will result in such score assignments to be impractical,
thus showing inapplicability of nDCG.
3.2 AP Correlation
AP (average precision) correlation (τap) was proposed by
Yilmaz et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] as a modification to Kendall’s τ which
penalizes mistakes made for highly relevant items more than
less relevant items. AP correlation finds the precision
between two total orders at each index in the list and then
takes the average of these values, as defined in Equation 3.
τap =
      </p>
      <p>2
N − 1
!
· "
i∈I</p>
      <p>C(i)
index(i) − 1
#
N is the number of ranked items in the list and C(i) is the
number of items at an index less than index(i) that are
correctly ranked according to the ground truth. Consider
the data in Table 2. For item A, C(A) = 0 because A comes
after B in the prediction, but A is preferred to B in the
ground truth. Below is a sample calculation.</p>
      <p>τap
=</p>
      <p>2
4 − 1
·
! 0
1
+</p>
      <p>+
AP correlation is measured on scale of -1 to +1, where -1
means the lists are in reverse order and +1 means the list
are the same.</p>
      <p>AP correlation assumes that each list, the ground truth and
the system’s prediction, gives a total order of items. There is
no simple modification of AP correlation to support partial
orders. Two challenges must be addressed when thinking
about how to evaluate rank accuracy based on incomplete
pairwise preferences. How does one assign a rank to an item
when a total list cannot be constructed? In Equation 3,
index(i) is the rank index in the list, but when a total
order is unknown, such as the sample data in Table 1, a new
method is needed to give a rank index. Second, how does one
consider the cases where a user’s preference between items
is unknown? In that example, we don’t know user’s
preference between items A and B. How to evaluate a system
that makes a predication A " B?
4. EXPECTED DISCOUNTED RANK</p>
      <p>CORRELATION
We propose expected discounted rank correlation (EDRC) to
measure the similarity between two sets of pairwise
preferences. Given a set of ground truth pairwise preferences from
the user G, and a set of predicted pairwise preferences
outˆ
put by the system G, EDRC calculates the expected
correlation the two sets. Note that we may have user preferences
on a large set of items based on many different user
sessions. However, we only consider preferences relating items
currently recommended by the system. That is, the set of
items in G and Gˆ are the same.
Similar to AP correlation and nDCG, EDRC emphasizes
preserving the order of the user’s most preferred items and
enforcing a smaller penalty for less preferred items.
Different from nDCG whose ground truth is a set of relevance
scores, the ground truth supported by EDRC is a set of
pairwise preferences. Different from AP correlation which
requires complete pairwise preferences, EDRC allows
incomplete pairwise preferences.</p>
      <p>As discussed in Section 3.2, the presence of incomplete
pairwise preferences entails two challenges: the lack of total
order among items in the ground truth, and unknown
preferences between two items. Next, we look at the first problem.
Assigning Rank and Computing Weight. In the spirit
of discounted gain, we want to give different discount (weight)
for different items. If we know the rank of an item, R(v),
in the ground truth, we may set the weight linearly,
logarithmically, or exponentially. However, the problem is how
to compute the rank of an item given incomplete pairwise
preferences. From a set of pairwise preferences as ground
truth, we first derive a topological order among the items.
Consider a graph where a vertex represents an item, and
a directed edge represents a pairwise preference. The item
corresponding to the source vertex is preferred to the item
corresponding to the target vertex. In the following
discussion, we use item and vertex interchangeably.</p>
      <p>For example, the ground truth in Table 2 can be represented
by the graph in Figure 1, where the fact that item A is
preferred to item C is shown by a directed edge from vertices A
to C. In this graph, an item is preferred to any item
reachable following a directed path. For example, A is preferred
to both C and D in the ground truth of Table 1.
Now we discuss how to leverage the graph to compute item
rank. We initiate the rank of every vertex to be 1 and
traverse the graph beginning for the start nodes. Whenever
we follow an edge from a vertex u to v, we update v’s rank
to max(R(v), R(u) + 1). Note that here we consider every
user specified pairwise preference as equally important, and
is assigned a unit weight of 1. Thus we keep the maximum
score of node among the different paths that can reach this
node from a start node. The procedure is defined in the
procedure SetRanks(G) in Algorithm 1.</p>
      <p>The ranks for the items in the graph of Figure 1 are as
follows: R(A) = 1, R(B) = 1, R(C) = 2, R(D) = 3 and
R(E) = 2. As we can see, nodes A and B have the same
rank, since we don’t know user’s preferences between the
two.
1: S = all nodes in G without an incoming edge
2: for all v ∈ S do
3: R(v) ← 1; Visit(v, 1)
4: end for
Visit(v, rankin)
1: R(v) ← max(R(v), rankin + 1)
2: for all w ∈ OU T (v, G) do
3: Visit(w, R(v))
4: end for
Then we define the discount term, D(v), which can take
various forms depending on what type of discount is desired.
For example, for a simple linear discount, D(v) = R(v).
For exponential discount, D(v) = 2R(v) and for logarithmic
discount, D(v) = log2(1 + R(v)).</p>
      <p>Handling Unknown Preferences between Items.
Another problem is computing the score, C(v), of an item in
the system’s output in the presence of incomplete pairwise
preferences in the ground truth. We propose the following
formula to define the score of an item where OU T (v, G+)
is the set of outgoing edges from v in the transitive
closure, G+, of G, and W , the set of items that are more
preferred or have an unknown relationship with v where
W = V (G) \ [{v} ∪ OU T (v, G+)]. EP checks for the
expected score regarding the relationship between v and all
items in W .</p>
      <p>C(v) =
! EP (v, w)
w∈W
For a pair of items (v, w), suppose v preferred over w, that
is, v $ w in G. There are three cases. We have v $ w in
ˆ
G. In this case, EP (v, w) = 1. The second case, we have
w $ v in Gˆ. Since the system prediction contradicts to the
ground truth, we have EP (v, w) = 0. The third case, the
system cannot predict a preference between v and w. Then
we need to compute the expected score. By default we may
assume there is 50% likelihood for v $ w and 50% likelihood
for w $ v. We then let EP (v, w) = .5. Alternatively, we
may have a more accurate likelihood estimation based on
collaborative filtering. Assuming we have a set of equally
similar users, if 70% of the users have v $ w and 30% of
similar users have w $ v, then the expected score of v $ w
is 0.7. For example, below are samples for C and EP based
on Table 1.</p>
      <p>C(C) = EP (C, A) + EP (C, B) + EP (C, E) = 0 + .5 + .5 = 1
C(D) = EP (D, A) + EP (D, B) + EP (D, C) + EP (D, E) =
.5 + .5 + 1 + .5 = 2.5
C(E) = EP (E, A) + EP (E, B) + EP (E, C) + EP (E, D) =
1 + 1 + .5 + .5 = 3
Putting Things Together. Based on the discussion of
how to compute a score for an item, C(v), and the discount
(weight) of an item, D(v), we now put these together for an
evaluation measure EDRC. We denote the set of all vertices
in G without an incoming edge as S.</p>
      <p>2 "
EDRC(G, Gˆ) = ·</p>
      <p>Z</p>
      <p>!
v∈V (G)\S</p>
      <p>C(v) #
D(v)</p>
      <p>Here, Z is a normalization factor to ensure the value is
between +1 and -1.</p>
      <p>Z =</p>
      <p>!
Considering the example data in Table 1, we now show how
to put together the sample calculation for EDRC using a
linear discount method.</p>
      <p>2
EDRC(G, Gˆ) = 29 ·
6
" 1
2
+
When both the ground truth and prediction is a complete
set of pairwise preferences and the discount term is D(v) =
R(v) − 1, the values for EDRC and AP correlation will be
the same.</p>
    </sec>
    <sec id="sec-2">
      <title>5. CONCLUSION</title>
      <p>We consider a problem setting where the input from the user
and the system’s prediction are sets of possibly incomplete
pairwise preferences. Based on this setting, we discuss the
limitations of two evaluation methods, nDCG and AP
correlation. Then we propose a new rank correlation metric,
expected discounted rank correlation (EDRC), that
compares two sets of pairwise preferences, one from the ground
truth and one from the system prediction. This new measure
has wide applications for evaluating recommender systems
whose input data and ground truth are obtained from
implicit user feedback.</p>
    </sec>
    <sec id="sec-3">
      <title>Acknowledgements</title>
      <p>This work was sponsored in part by an NSF CAREER Award
IIS-0845647 and IIS-0915438. This work was also supported
in part by an IBM faculty award.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Baltrunas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Makcinskas</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Ricci</surname>
          </string-name>
          .
          <article-title>Group recommendations with rank aggregation and collaborative filtering</article-title>
          .
          <source>In Proceedings of the 4th ACM conference on Recommender systems</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>K.</given-names>
            <surname>Jarvelin</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Kekalainen</surname>
          </string-name>
          .
          <article-title>Cumulated gain-based evaluation of IR techniques</article-title>
          .
          <source>ACM Transactions on Information Systems</source>
          ,
          <volume>20</volume>
          (
          <issue>4</issue>
          ):
          <fpage>422</fpage>
          -
          <lpage>446</lpage>
          ,
          <year>October 2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Kendall</surname>
          </string-name>
          .
          <article-title>A new measure of rank correlation</article-title>
          .
          <source>Biometrika</source>
          ,
          <volume>30</volume>
          (
          <issue>1</issue>
          /2):
          <fpage>81</fpage>
          -
          <lpage>93</lpage>
          ,
          <year>1938</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Kumar</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Vassilvitskii</surname>
          </string-name>
          .
          <article-title>Generalized distances between rankings</article-title>
          .
          <source>In Proceedings of the 19th International World Wide Web Conference</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>N. N.</given-names>
            <surname>Liu</surname>
          </string-name>
          and
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yang</surname>
          </string-name>
          .
          <article-title>Eigenrank: A ranking-oriented approach to collaborative filtering</article-title>
          .
          <source>In Proceedings of the 31st Annual International ACM SIGIR Conference</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.-T.</given-names>
            <surname>Park</surname>
          </string-name>
          and
          <string-name>
            <given-names>W.</given-names>
            <surname>Chu</surname>
          </string-name>
          .
          <article-title>Pairwise preference regression for cold-start recommendation</article-title>
          .
          <source>In Proceedings of the 3rd ACM conference on Recommender systems</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Rendle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Freudenthaler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Gantner</surname>
          </string-name>
          , and S.-T. Lars. BPR:
          <article-title>Bayesian personalized ranking from implicit feedback</article-title>
          .
          <source>In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C.</given-names>
            <surname>Spearman</surname>
          </string-name>
          .
          <article-title>The proof and measurement of association between two things</article-title>
          .
          <source>The American Journal of Psychology</source>
          ,
          <volume>15</volume>
          (
          <issue>1</issue>
          ):
          <fpage>72</fpage>
          -
          <lpage>101</lpage>
          ,
          <year>1904</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>E.</given-names>
            <surname>Yilmaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Aslam</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Roberston</surname>
          </string-name>
          .
          <article-title>A new rank correlation coefficient for information retrieval</article-title>
          .
          <source>In Proceedings of the 31st Annual International ACM SIGIR Conference</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>