Evaluating Rank Accuracy based on
                          Incomplete Pairwise Preferences

                        Brian Ackerman                                                    Yi Chen
                      Arizona State University                                    Arizona State University
                     backerman@asu.edu                                                 yi@asu.edu


ABSTRACT                                                              between the ranked list output by a system and the list of
Existing methods to measure the rank accuracy of a rec-               user’s preferences, such as Spearman’s ρ [8] and Kendall’s
ommender system assume the ground truth is either a set               τ [3]. There are also variations of area under the curve mea-
of user ratings or a total ordered list of items given by the         sures, such as the one used in [7], which are closely related
user with possible ties. However, in many applications we             to a general loss function. Utility rank accuracy measures
are only able to obtain implicit user feedback, which does            give a stronger weight to the items that have a higher rele-
not provide such comprehensive information, but only gives            vance according to the user, such as normalized discounted
a set of pairwise preferences among items. Generally such             cumulative gain [2], which is widely adopted [1, 5, 6].
pairwise preferences are not complete, and thus may not de-
duce a total order of items. In this paper, we propose a novel        Ground truth often has to be obtained through explicit user
method to evaluate rank accuracy, expected discounted rank            feedback. In some applications, we may solicit users to give
correlation, which addresses the unique challenges of han-            ratings on items and use such ratings as relevance scores,
dling incomplete pairwise preferences in ground truth and             such as data obtained from Netflix and MovieLens. The
also puts an emphasis on properly ranking items that users            rating data provides accurate user opinions on items, and
most prefer.                                                          can also provide a total order of user’s preferences on items.
                                                                      However, such data requires explicit feedback, and may be
                                                                      considered as burdensome for some users.
Categories and Subject Descriptors
H.3.4 [Storage and Retrieval]: Systems and Software -
                                                                      On the other hand, there is great potential for a recom-
Performance evaluation—efficiency and effectiveness
                                                                      mender system, and search engines, to leverage implicit user
                                                                      feedback which can also be used for evaluation. After a user
General Terms                                                         issues a query, a set of items will be recommended. Among
Measurement, Performance                                              them, a user may select some to check for more details, and
                                                                      finally may choose one or more of the products to buy or
Keywords                                                              places to visit. Such user interaction provides implicit user
                                                                      feedback on an item’s relevance. We can consider the set
Recommender systems, rank accuracy, pairwise preference,
                                                                      of items purchased or visited to be preferred to the set of
evaluation
                                                                      selected items, which in turn is preferred to the set of items
                                                                      that do not receive any user interaction. Alternatively, we
1.   INTRODUCTION                                                     may also measure the actual time that a user spends exam-
Recommender systems have proven their value in many ap-               ining an item to infer finer grained preferences.
plications where it is advantageous to personalize a user’s
experience. Currently, recommender systems power some of              After collecting data in the form of implicit user feedback,
the world’s most visited websites. Users visit websites like          we must also be able to evaluate recommendation techniques
Amazon, Netflix, and Urbanspoon when they are looking for             based on the data. Using the data, we may not have the ac-
suggestions on items that may be of interest. These websites          tual rating of items, but a set of pairwise preferences among
can suggest relevant items from a large collection of items           items. Note that generally pairwise preferences may not be
and output a ranked list ordered by the predicted relevance           complete. That is to say there may exist two items which a
to the user to aid in a user’s decision making process.               user’s preference is unknown. For instance, we don’t know
                                                                      a user’s preference between an item that is presented to the
Given the prevalence of various recommender systems, a                user in response to a query and an item that is not shown
critical question becomes their evaluation. Evaluation mea-           to the user. In this case, a total order among a set of items
sures are used to compare different techniques of recommen-           cannot be deduced. In this paper, we only consider evaluat-
dation and give researchers an objective for optimization.            ing preferences that are not contradictory. We simply allow
Existing evaluation measures may take a set of user ratings           for the aggregation of preferences for different user sessions
as the ground truth [2, 4], or a total ordered list of items          assuming the preferences are consistent. It is worth noting
given by the user [9], with an option to support ties [3, 8].         that considering contextual factors involved with a user ses-
Based on the weight given to items, there are two types of            sion would also be advantageous, but is outside the scope
evaluation. Reference rank accuracy measures the similarity


                                                                 74
   Table 1: Example Pairwise Preference Data                              Table 2: Example Rating and Preference Data
                          Preferences                                                                 !
                                                                                   I rel(i) index(i) index(i)
 Ground Truth A ! C, A ! D, A ! E, C ! D, B ! D                                    A   5        1       2
                     C ! A, C ! B, C ! D                                           B   4        2       1
  Prediction
                     A ! E, B ! E, D ! E                                           C   3        3       3
                                                                                   D   2        4       4
of this paper. These contextual factors could include the
set of items the user’s were recommended, items previously             preference between every two items involved in the set, or a
purchases or visited by the user, and how the items were               pairwise preference can be inferred. Otherwise, it is a set of
displayed to the user. It is also possible to find cyclic pref-        incomplete pairwise preferences. As discussed in Section 1,
erences (e.g. A ! B, B ! C, C ! A). However, we also do                incomplete pairwise preferences are common for user prefer-
not consider these cases.                                              ences that are obtained implicitly. This is true for both the
                                                                       training dataset and test dataset which contains the ground
To evaluate recommender systems whose user preferences                 truth or user’s preferences.
are obtained through implicit user feedback, we study eval-
uation methods for a general type of recommender system                Since existing evaluation metrics are inapplicable for incom-
that takes a set of possibly incomplete pairwise preferences           plete pairwise preferences in the ground truth, we propose
as input for training and for ground truth, and outputs a              a measure for rank accuracy where the ground truth is a
ranked list of items for recommendation. After examining               set of possibly incomplete pairwise preferences. Since we
why existing evaluation methods may fail, we proposed a                can derive a set of pairwise preferences from the ranked list
novel rank accuracy measure, expected discounted rank cor-             output, the core of the evaluation is to measure the similar-
relation (EDRC), to handle incomplete pairwise preferences.            ity between two sets of pairwise preferences, one for system
EDRC handles two unique challenges in this setting, the lack           output, and the other from the user.
of total ranking order on items, and possible unknown pref-
erences between two items. EDRC also takes a discounted                3. EXISTING EVALUATION METHODS
model that gives bigger penalty for wrong prediction on the            Before we discuss our newly proposed method of evaluation,
more preferred items. To the best of our knowledge, this               we look at two existing methods, nDCG and AP correlation.
is the first attempt for designing an evaluation measure for           For both methods we show a brief example and then discuss
ground truth and system prediction that is a set of incom-             cases where each does not work for our problem setting.
plete pairwise preferences.

The remainder of this paper is outlined as follows: after we
                                                                       3.1 Normalized Discounted Cumulative Gain
introduce the problem setting in Section 2, we briefly re-             Cumulated gain-based evaluation was proposed by Jarvelin
view two commonly used evaluation measures. Section 4                  and Kekalainen in 2002 [2] to evaluate rank accuracy when
proposes a new method to measure rank accuracy for in-                 the ground truth is a set of user ratings. The intuition is
complete pairwise preferences, and Section 5 concludes the             that it is more important to correctly predict highly relevant
paper.                                                                 items than marginally relevant ones.

                                                                       nDCG is formally defined in Equation 1. We denote I as a
2.   PROBLEM SETTING                                                   set of items suggested by the system, rel(i) is the relevance
In this work, we design an evaluation measure for a rec-               of item i according to the user, index(i) is the index of
ommender system that takes ranking data in the form of                 item i sorting the items based on the user’s relevance score,
pairwise preferences as input, trains on the ranking data to                !
                                                                       and index(i)   is the index of item i sorted by the system’s
infer relevance values for each item in a test set, and then
                                                                       prediction.
outputs a ranking order for the items based on their pre-                                        !                         #
                                                                                             1     "        2rel(i) − 1
dicted relevance values. For the test set, we consider the                        nDCG =       ·                                 (1)
ground truth to be a set of pairwise preferences collected                                   Z     i∈I
                                                                                                       log2 (1 + index(i))
from implicit user feedback.                                           Z gives the maximum possible discounted cumulative gain
                                                                       if the items were correctly sorted, and is used as a normal-
We define a pairwise preference as a user’s comparison be-             ization to ensure the result is between 0 and 1.
tween two items. We denote a pairwise preference as A ! B
meaning item A is preferred to item B. Clearly, from a                                      "         2rel(i) − 1
                                                                                        Z=                                      (2)
ranked list of items, we can derive a set of pairwise pref-                                                 !
                                                                                             i∈I log2 (1 + index(i))
erences. For example, we consider three items A, B, and
C. The system may predict the order of these items to be               For example, we look at the sample data in Table 2. If we
A, B, C. Then the derived set of pairwise preferences is:                                                                  !
                                                                       look at item A, rel(A) = 5, index(A) = 1, and index(A)         =
{ A ! B, A ! C, B ! C }. On the other hand, we may                     2. Below is the full sample calculation for nDCG.
not always be able to derive a total order of items based                                 !                                           #
on a set of pairwise preferences. For example, consider the                           1      24 − 1     25 − 1     23 − 1     22 − 1
                                                                        nDCG =          ·            +          +          +
set of pairwise preferences in the ground truth in Table 1 is                         Z     log2 (2)   log2 (3)   log2 (4)   log2 (5)
insufficient to form a total order. For instance, we do not
know the preference between A and B. We define a set of                                25 − 1     24 − 1     23 − 1     22 − 1
pairwise preferences to be complete if there exists a pairwise             Z      =            +          +          +
                                                                                      log2 (2)   log2 (3)   log2 (4)   log2 (5)


                                                                  75
This gives a value of .870 for the example.

nDCG assumes that we know an actual relevance for each
item. It does not directly support the case when an item
is preferred to another item, but the magnitude of the pref-
erence is unknown. When there is a total order on item
preferences, we may assign rating scores with equal mag-
nitude and then use nDCG. However, the absence of total
order will result in such score assignments to be impractical,
thus showing inapplicability of nDCG.                                  Figure 1: Topological Order based on Ground Truth
                                                                       (left) and System Prediction (right) in Table 1

3.2 AP Correlation
AP (average precision) correlation (τap ) was proposed by              Similar to AP correlation and nDCG, EDRC emphasizes
Yilmaz et al. [9] as a modification to Kendall’s τ which pe-           preserving the order of the user’s most preferred items and
nalizes mistakes made for highly relevant items more than              enforcing a smaller penalty for less preferred items. Differ-
less relevant items. AP correlation finds the precision be-            ent from nDCG whose ground truth is a set of relevance
tween two total orders at each index in the list and then              scores, the ground truth supported by EDRC is a set of
takes the average of these values, as defined in Equation 3.           pairwise preferences. Different from AP correlation which
                          !                  #                         requires complete pairwise preferences, EDRC allows incom-
                     2      "       C(i)                               plete pairwise preferences.
           τap =        ·                      −1        (3)
                  N −1      i∈I
                                index(i) − 1
                                                                       As discussed in Section 3.2, the presence of incomplete pair-
N is the number of ranked items in the list and C(i) is the            wise preferences entails two challenges: the lack of total or-
number of items at an index less than index(i) that are                der among items in the ground truth, and unknown prefer-
correctly ranked according to the ground truth. Consider               ences between two items. Next, we look at the first problem.
the data in Table 2. For item A, C(A) = 0 because A comes
after B in the prediction, but A is preferred to B in the              Assigning Rank and Computing Weight. In the spirit
ground truth. Below is a sample calculation.                           of discounted gain, we want to give different discount (weight)
                                                                       for different items. If we know the rank of an item, R(v),
                           !         #
                     2       0  2  3            1
         τap =           ·     + +     −1 =                            in the ground truth, we may set the weight linearly, loga-
                   4−1       1  2  3            3                      rithmically, or exponentially. However, the problem is how
AP correlation is measured on scale of -1 to +1, where -1              to compute the rank of an item given incomplete pairwise
means the lists are in reverse order and +1 means the list             preferences. From a set of pairwise preferences as ground
are the same.                                                          truth, we first derive a topological order among the items.
                                                                       Consider a graph where a vertex represents an item, and
AP correlation assumes that each list, the ground truth and            a directed edge represents a pairwise preference. The item
the system’s prediction, gives a total order of items. There is        corresponding to the source vertex is preferred to the item
no simple modification of AP correlation to support partial            corresponding to the target vertex. In the following discus-
orders. Two challenges must be addressed when thinking                 sion, we use item and vertex interchangeably.
about how to evaluate rank accuracy based on incomplete
pairwise preferences. How does one assign a rank to an item            For example, the ground truth in Table 2 can be represented
when a total list cannot be constructed? In Equation 3,                by the graph in Figure 1, where the fact that item A is pre-
index(i) is the rank index in the list, but when a total or-           ferred to item C is shown by a directed edge from vertices A
der is unknown, such as the sample data in Table 1, a new              to C. In this graph, an item is preferred to any item reach-
method is needed to give a rank index. Second, how does one            able following a directed path. For example, A is preferred
consider the cases where a user’s preference between items             to both C and D in the ground truth of Table 1.
is unknown? In that example, we don’t know user’s pref-
erence between items A and B. How to evaluate a system                 Now we discuss how to leverage the graph to compute item
that makes a predication A " B?                                        rank. We initiate the rank of every vertex to be 1 and tra-
                                                                       verse the graph beginning for the start nodes. Whenever
                                                                       we follow an edge from a vertex u to v, we update v’s rank
4.   EXPECTED DISCOUNTED RANK                                          to max(R(v), R(u) + 1). Note that here we consider every
     CORRELATION                                                       user specified pairwise preference as equally important, and
We propose expected discounted rank correlation (EDRC) to              is assigned a unit weight of 1. Thus we keep the maximum
measure the similarity between two sets of pairwise prefer-            score of node among the different paths that can reach this
ences. Given a set of ground truth pairwise preferences from           node from a start node. The procedure is defined in the
the user G, and a set of predicted pairwise preferences out-           procedure SetRanks(G) in Algorithm 1.
put by the system Ĝ, EDRC calculates the expected corre-
lation the two sets. Note that we may have user preferences            The ranks for the items in the graph of Figure 1 are as
on a large set of items based on many different user ses-              follows: R(A) = 1, R(B) = 1, R(C) = 2, R(D) = 3 and
sions. However, we only consider preferences relating items            R(E) = 2. As we can see, nodes A and B have the same
currently recommended by the system. That is, the set of               rank, since we don’t know user’s preferences between the
items in G and Ĝ are the same.                                        two.


                                                                  76
Algorithm 1 Setting Rank Value                                      Here, Z is a normalization factor to ensure the value is be-
SetRanks(G)                                                         tween +1 and -1.
 1: S = all nodes in G without an incoming edge                                                !       |W |
 2: for all v ∈ S do                                                                    Z=
                                                                                                      R(v)
                                                                                                v∈V (G)\S
 3:   R(v) ← 1; Visit(v, 1)
 4: end for                                                         Considering the example data in Table 1, we now show how
Visit(v, rankin )                                                   to put together the sample calculation for EDRC using a
 1: R(v) ← max(R(v), rankin + 1)                                    linear discount method.
 2: for all w ∈ OU T (v, G) do
                                                                                              "              #
                                                                                            2   1   2.5    3         5
 3:   Visit(w, R(v))                                                       EDRC(G, Ĝ) = 29 ·     +     +      −1=
 4: end for                                                                                 6
                                                                                                2    3     2        29

                                                                    When both the ground truth and prediction is a complete
                                                                    set of pairwise preferences and the discount term is D(v) =
Then we define the discount term, D(v), which can take              R(v) − 1, the values for EDRC and AP correlation will be
various forms depending on what type of discount is desired.        the same.
For example, for a simple linear discount, D(v) = R(v).
For exponential discount, D(v) = 2R(v) and for logarithmic          5. CONCLUSION
discount, D(v) = log2 (1 + R(v)).                                   We consider a problem setting where the input from the user
                                                                    and the system’s prediction are sets of possibly incomplete
Handling Unknown Preferences between Items. An-                     pairwise preferences. Based on this setting, we discuss the
other problem is computing the score, C(v), of an item in           limitations of two evaluation methods, nDCG and AP cor-
the system’s output in the presence of incomplete pairwise          relation. Then we propose a new rank correlation metric,
preferences in the ground truth. We propose the following           expected discounted rank correlation (EDRC), that com-
formula to define the score of an item where OU T (v, G + )         pares two sets of pairwise preferences, one from the ground
is the set of outgoing edges from v in the transitive clo-          truth and one from the system prediction. This new measure
sure, G + , of G, and W , the set of items that are more            has wide applications for evaluating recommender systems
preferred or have an unknown relationship with v where              whose input data and ground truth are obtained from im-
W = V (G) \ [{v} ∪ OU T (v, G + )]. EP checks for the ex-           plicit user feedback.
pected score regarding the relationship between v and all
items in W .
                           !                                        Acknowledgements
                   C(v) =      EP (v, w)                            This work was sponsored in part by an NSF CAREER Award
                          w∈W                                       IIS-0845647 and IIS-0915438. This work was also supported
For a pair of items (v, w), suppose v preferred over w, that        in part by an IBM faculty award.
is, v $ w in G. There are three cases. We have v $ w in
Ĝ. In this case, EP (v, w) = 1. The second case, we have           6. REFERENCES
w $ v in Ĝ. Since the system prediction contradicts to the         [1] L. Baltrunas, T. Makcinskas, and F. Ricci. Group
                                                                        recommendations with rank aggregation and collaborative
ground truth, we have EP (v, w) = 0. The third case, the                filtering. In Proceedings of the 4th ACM conference on
system cannot predict a preference between v and w. Then                Recommender systems, 2010.
we need to compute the expected score. By default we may            [2] K. Jarvelin and J. Kekalainen. Cumulated gain-based
assume there is 50% likelihood for v $ w and 50% likelihood             evaluation of IR techniques. ACM Transactions on
for w $ v. We then let EP (v, w) = .5. Alternatively, we                Information Systems, 20(4):422–446, October 2002.
may have a more accurate likelihood estimation based on             [3] M. G. Kendall. A new measure of rank correlation.
collaborative filtering. Assuming we have a set of equally              Biometrika, 30(1/2):81–93, 1938.
similar users, if 70% of the users have v $ w and 30% of            [4] R. Kumar and S. Vassilvitskii. Generalized distances
                                                                        between rankings. In Proceedings of the 19th International
similar users have w $ v, then the expected score of v $ w              World Wide Web Conference, 2010.
is 0.7. For example, below are samples for C and EP based           [5] N. N. Liu and Q. Yang. Eigenrank: A ranking-oriented
on Table 1.                                                             approach to collaborative filtering. In Proceedings of the 31st
C(C) = EP (C, A) + EP (C, B) + EP (C, E) = 0 + .5 + .5 = 1              Annual International ACM SIGIR Conference, 2008.
C(D) = EP (D, A) + EP (D, B) + EP (D, C) + EP (D, E) =              [6] S.-T. Park and W. Chu. Pairwise preference regression for
.5 + .5 + 1 + .5 = 2.5                                                  cold-start recommendation. In Proceedings of the 3rd ACM
C(E) = EP (E, A) + EP (E, B) + EP (E, C) + EP (E, D) =                  conference on Recommender systems, 2009.
1 + 1 + .5 + .5 = 3                                                 [7] S. Rendle, C. Freudenthaler, Z. Gantner, and S.-T. Lars.
                                                                        BPR: Bayesian personalized ranking from implicit feedback.
                                                                        In Proceedings of the Twenty-Fifth Conference on
Putting Things Together. Based on the discussion of                     Uncertainty in Artificial Intelligence, 2009.
how to compute a score for an item, C(v), and the discount          [8] C. Spearman. The proof and measurement of association
(weight) of an item, D(v), we now put these together for an             between two things. The American Journal of Psychology,
evaluation measure EDRC. We denote the set of all vertices              15(1):72–101, 1904.
in G without an incoming edge as S.                                 [9] E. Yilmaz, J. A. Aslam, and S. Roberston. A new rank
                              "              #                          correlation coefficient for information retrieval. In
                          2      !     C(v)                             Proceedings of the 31st Annual International ACM SIGIR
         EDRC(G, Ĝ) =      ·                  −1                       Conference, 2008.
                         Z             D(v)
                               v∈V (G)\S


                                                               77