Evaluating Rank Accuracy based on Incomplete Pairwise Preferences Brian Ackerman Yi Chen Arizona State University Arizona State University backerman@asu.edu yi@asu.edu ABSTRACT between the ranked list output by a system and the list of Existing methods to measure the rank accuracy of a rec- user’s preferences, such as Spearman’s ρ [8] and Kendall’s ommender system assume the ground truth is either a set τ [3]. There are also variations of area under the curve mea- of user ratings or a total ordered list of items given by the sures, such as the one used in [7], which are closely related user with possible ties. However, in many applications we to a general loss function. Utility rank accuracy measures are only able to obtain implicit user feedback, which does give a stronger weight to the items that have a higher rele- not provide such comprehensive information, but only gives vance according to the user, such as normalized discounted a set of pairwise preferences among items. Generally such cumulative gain [2], which is widely adopted [1, 5, 6]. pairwise preferences are not complete, and thus may not de- duce a total order of items. In this paper, we propose a novel Ground truth often has to be obtained through explicit user method to evaluate rank accuracy, expected discounted rank feedback. In some applications, we may solicit users to give correlation, which addresses the unique challenges of han- ratings on items and use such ratings as relevance scores, dling incomplete pairwise preferences in ground truth and such as data obtained from Netflix and MovieLens. The also puts an emphasis on properly ranking items that users rating data provides accurate user opinions on items, and most prefer. can also provide a total order of user’s preferences on items. However, such data requires explicit feedback, and may be considered as burdensome for some users. Categories and Subject Descriptors H.3.4 [Storage and Retrieval]: Systems and Software - On the other hand, there is great potential for a recom- Performance evaluation—efficiency and effectiveness mender system, and search engines, to leverage implicit user feedback which can also be used for evaluation. After a user General Terms issues a query, a set of items will be recommended. Among Measurement, Performance them, a user may select some to check for more details, and finally may choose one or more of the products to buy or Keywords places to visit. Such user interaction provides implicit user feedback on an item’s relevance. We can consider the set Recommender systems, rank accuracy, pairwise preference, of items purchased or visited to be preferred to the set of evaluation selected items, which in turn is preferred to the set of items that do not receive any user interaction. Alternatively, we 1. INTRODUCTION may also measure the actual time that a user spends exam- Recommender systems have proven their value in many ap- ining an item to infer finer grained preferences. plications where it is advantageous to personalize a user’s experience. Currently, recommender systems power some of After collecting data in the form of implicit user feedback, the world’s most visited websites. Users visit websites like we must also be able to evaluate recommendation techniques Amazon, Netflix, and Urbanspoon when they are looking for based on the data. Using the data, we may not have the ac- suggestions on items that may be of interest. These websites tual rating of items, but a set of pairwise preferences among can suggest relevant items from a large collection of items items. Note that generally pairwise preferences may not be and output a ranked list ordered by the predicted relevance complete. That is to say there may exist two items which a to the user to aid in a user’s decision making process. user’s preference is unknown. For instance, we don’t know a user’s preference between an item that is presented to the Given the prevalence of various recommender systems, a user in response to a query and an item that is not shown critical question becomes their evaluation. Evaluation mea- to the user. In this case, a total order among a set of items sures are used to compare different techniques of recommen- cannot be deduced. In this paper, we only consider evaluat- dation and give researchers an objective for optimization. ing preferences that are not contradictory. We simply allow Existing evaluation measures may take a set of user ratings for the aggregation of preferences for different user sessions as the ground truth [2, 4], or a total ordered list of items assuming the preferences are consistent. It is worth noting given by the user [9], with an option to support ties [3, 8]. that considering contextual factors involved with a user ses- Based on the weight given to items, there are two types of sion would also be advantageous, but is outside the scope evaluation. Reference rank accuracy measures the similarity 74 Table 1: Example Pairwise Preference Data Table 2: Example Rating and Preference Data Preferences ! I rel(i) index(i) index(i) Ground Truth A ! C, A ! D, A ! E, C ! D, B ! D A 5 1 2 C ! A, C ! B, C ! D B 4 2 1 Prediction A ! E, B ! E, D ! E C 3 3 3 D 2 4 4 of this paper. These contextual factors could include the set of items the user’s were recommended, items previously preference between every two items involved in the set, or a purchases or visited by the user, and how the items were pairwise preference can be inferred. Otherwise, it is a set of displayed to the user. It is also possible to find cyclic pref- incomplete pairwise preferences. As discussed in Section 1, erences (e.g. A ! B, B ! C, C ! A). However, we also do incomplete pairwise preferences are common for user prefer- not consider these cases. ences that are obtained implicitly. This is true for both the training dataset and test dataset which contains the ground To evaluate recommender systems whose user preferences truth or user’s preferences. are obtained through implicit user feedback, we study eval- uation methods for a general type of recommender system Since existing evaluation metrics are inapplicable for incom- that takes a set of possibly incomplete pairwise preferences plete pairwise preferences in the ground truth, we propose as input for training and for ground truth, and outputs a a measure for rank accuracy where the ground truth is a ranked list of items for recommendation. After examining set of possibly incomplete pairwise preferences. Since we why existing evaluation methods may fail, we proposed a can derive a set of pairwise preferences from the ranked list novel rank accuracy measure, expected discounted rank cor- output, the core of the evaluation is to measure the similar- relation (EDRC), to handle incomplete pairwise preferences. ity between two sets of pairwise preferences, one for system EDRC handles two unique challenges in this setting, the lack output, and the other from the user. of total ranking order on items, and possible unknown pref- erences between two items. EDRC also takes a discounted 3. EXISTING EVALUATION METHODS model that gives bigger penalty for wrong prediction on the Before we discuss our newly proposed method of evaluation, more preferred items. To the best of our knowledge, this we look at two existing methods, nDCG and AP correlation. is the first attempt for designing an evaluation measure for For both methods we show a brief example and then discuss ground truth and system prediction that is a set of incom- cases where each does not work for our problem setting. plete pairwise preferences. The remainder of this paper is outlined as follows: after we 3.1 Normalized Discounted Cumulative Gain introduce the problem setting in Section 2, we briefly re- Cumulated gain-based evaluation was proposed by Jarvelin view two commonly used evaluation measures. Section 4 and Kekalainen in 2002 [2] to evaluate rank accuracy when proposes a new method to measure rank accuracy for in- the ground truth is a set of user ratings. The intuition is complete pairwise preferences, and Section 5 concludes the that it is more important to correctly predict highly relevant paper. items than marginally relevant ones. nDCG is formally defined in Equation 1. We denote I as a 2. PROBLEM SETTING set of items suggested by the system, rel(i) is the relevance In this work, we design an evaluation measure for a rec- of item i according to the user, index(i) is the index of ommender system that takes ranking data in the form of item i sorting the items based on the user’s relevance score, pairwise preferences as input, trains on the ranking data to ! and index(i) is the index of item i sorted by the system’s infer relevance values for each item in a test set, and then prediction. outputs a ranking order for the items based on their pre- ! # 1 " 2rel(i) − 1 dicted relevance values. For the test set, we consider the nDCG = · (1) ground truth to be a set of pairwise preferences collected Z i∈I log2 (1 + index(i)) from implicit user feedback. Z gives the maximum possible discounted cumulative gain if the items were correctly sorted, and is used as a normal- We define a pairwise preference as a user’s comparison be- ization to ensure the result is between 0 and 1. tween two items. We denote a pairwise preference as A ! B meaning item A is preferred to item B. Clearly, from a " 2rel(i) − 1 Z= (2) ranked list of items, we can derive a set of pairwise pref- ! i∈I log2 (1 + index(i)) erences. For example, we consider three items A, B, and C. The system may predict the order of these items to be For example, we look at the sample data in Table 2. If we A, B, C. Then the derived set of pairwise preferences is: ! look at item A, rel(A) = 5, index(A) = 1, and index(A) = { A ! B, A ! C, B ! C }. On the other hand, we may 2. Below is the full sample calculation for nDCG. not always be able to derive a total order of items based ! # on a set of pairwise preferences. For example, consider the 1 24 − 1 25 − 1 23 − 1 22 − 1 nDCG = · + + + set of pairwise preferences in the ground truth in Table 1 is Z log2 (2) log2 (3) log2 (4) log2 (5) insufficient to form a total order. For instance, we do not know the preference between A and B. We define a set of 25 − 1 24 − 1 23 − 1 22 − 1 pairwise preferences to be complete if there exists a pairwise Z = + + + log2 (2) log2 (3) log2 (4) log2 (5) 75 This gives a value of .870 for the example. nDCG assumes that we know an actual relevance for each item. It does not directly support the case when an item is preferred to another item, but the magnitude of the pref- erence is unknown. When there is a total order on item preferences, we may assign rating scores with equal mag- nitude and then use nDCG. However, the absence of total order will result in such score assignments to be impractical, thus showing inapplicability of nDCG. Figure 1: Topological Order based on Ground Truth (left) and System Prediction (right) in Table 1 3.2 AP Correlation AP (average precision) correlation (τap ) was proposed by Similar to AP correlation and nDCG, EDRC emphasizes Yilmaz et al. [9] as a modification to Kendall’s τ which pe- preserving the order of the user’s most preferred items and nalizes mistakes made for highly relevant items more than enforcing a smaller penalty for less preferred items. Differ- less relevant items. AP correlation finds the precision be- ent from nDCG whose ground truth is a set of relevance tween two total orders at each index in the list and then scores, the ground truth supported by EDRC is a set of takes the average of these values, as defined in Equation 3. pairwise preferences. Different from AP correlation which ! # requires complete pairwise preferences, EDRC allows incom- 2 " C(i) plete pairwise preferences. τap = · −1 (3) N −1 i∈I index(i) − 1 As discussed in Section 3.2, the presence of incomplete pair- N is the number of ranked items in the list and C(i) is the wise preferences entails two challenges: the lack of total or- number of items at an index less than index(i) that are der among items in the ground truth, and unknown prefer- correctly ranked according to the ground truth. Consider ences between two items. Next, we look at the first problem. the data in Table 2. For item A, C(A) = 0 because A comes after B in the prediction, but A is preferred to B in the Assigning Rank and Computing Weight. In the spirit ground truth. Below is a sample calculation. of discounted gain, we want to give different discount (weight) for different items. If we know the rank of an item, R(v), ! # 2 0 2 3 1 τap = · + + −1 = in the ground truth, we may set the weight linearly, loga- 4−1 1 2 3 3 rithmically, or exponentially. However, the problem is how AP correlation is measured on scale of -1 to +1, where -1 to compute the rank of an item given incomplete pairwise means the lists are in reverse order and +1 means the list preferences. From a set of pairwise preferences as ground are the same. truth, we first derive a topological order among the items. Consider a graph where a vertex represents an item, and AP correlation assumes that each list, the ground truth and a directed edge represents a pairwise preference. The item the system’s prediction, gives a total order of items. There is corresponding to the source vertex is preferred to the item no simple modification of AP correlation to support partial corresponding to the target vertex. In the following discus- orders. Two challenges must be addressed when thinking sion, we use item and vertex interchangeably. about how to evaluate rank accuracy based on incomplete pairwise preferences. How does one assign a rank to an item For example, the ground truth in Table 2 can be represented when a total list cannot be constructed? In Equation 3, by the graph in Figure 1, where the fact that item A is pre- index(i) is the rank index in the list, but when a total or- ferred to item C is shown by a directed edge from vertices A der is unknown, such as the sample data in Table 1, a new to C. In this graph, an item is preferred to any item reach- method is needed to give a rank index. Second, how does one able following a directed path. For example, A is preferred consider the cases where a user’s preference between items to both C and D in the ground truth of Table 1. is unknown? In that example, we don’t know user’s pref- erence between items A and B. How to evaluate a system Now we discuss how to leverage the graph to compute item that makes a predication A " B? rank. We initiate the rank of every vertex to be 1 and tra- verse the graph beginning for the start nodes. Whenever we follow an edge from a vertex u to v, we update v’s rank 4. EXPECTED DISCOUNTED RANK to max(R(v), R(u) + 1). Note that here we consider every CORRELATION user specified pairwise preference as equally important, and We propose expected discounted rank correlation (EDRC) to is assigned a unit weight of 1. Thus we keep the maximum measure the similarity between two sets of pairwise prefer- score of node among the different paths that can reach this ences. Given a set of ground truth pairwise preferences from node from a start node. The procedure is defined in the the user G, and a set of predicted pairwise preferences out- procedure SetRanks(G) in Algorithm 1. put by the system Ĝ, EDRC calculates the expected corre- lation the two sets. Note that we may have user preferences The ranks for the items in the graph of Figure 1 are as on a large set of items based on many different user ses- follows: R(A) = 1, R(B) = 1, R(C) = 2, R(D) = 3 and sions. However, we only consider preferences relating items R(E) = 2. As we can see, nodes A and B have the same currently recommended by the system. That is, the set of rank, since we don’t know user’s preferences between the items in G and Ĝ are the same. two. 76 Algorithm 1 Setting Rank Value Here, Z is a normalization factor to ensure the value is be- SetRanks(G) tween +1 and -1. 1: S = all nodes in G without an incoming edge ! |W | 2: for all v ∈ S do Z= R(v) v∈V (G)\S 3: R(v) ← 1; Visit(v, 1) 4: end for Considering the example data in Table 1, we now show how Visit(v, rankin ) to put together the sample calculation for EDRC using a 1: R(v) ← max(R(v), rankin + 1) linear discount method. 2: for all w ∈ OU T (v, G) do " # 2 1 2.5 3 5 3: Visit(w, R(v)) EDRC(G, Ĝ) = 29 · + + −1= 4: end for 6 2 3 2 29 When both the ground truth and prediction is a complete set of pairwise preferences and the discount term is D(v) = Then we define the discount term, D(v), which can take R(v) − 1, the values for EDRC and AP correlation will be various forms depending on what type of discount is desired. the same. For example, for a simple linear discount, D(v) = R(v). For exponential discount, D(v) = 2R(v) and for logarithmic 5. CONCLUSION discount, D(v) = log2 (1 + R(v)). We consider a problem setting where the input from the user and the system’s prediction are sets of possibly incomplete Handling Unknown Preferences between Items. An- pairwise preferences. Based on this setting, we discuss the other problem is computing the score, C(v), of an item in limitations of two evaluation methods, nDCG and AP cor- the system’s output in the presence of incomplete pairwise relation. Then we propose a new rank correlation metric, preferences in the ground truth. We propose the following expected discounted rank correlation (EDRC), that com- formula to define the score of an item where OU T (v, G + ) pares two sets of pairwise preferences, one from the ground is the set of outgoing edges from v in the transitive clo- truth and one from the system prediction. This new measure sure, G + , of G, and W , the set of items that are more has wide applications for evaluating recommender systems preferred or have an unknown relationship with v where whose input data and ground truth are obtained from im- W = V (G) \ [{v} ∪ OU T (v, G + )]. EP checks for the ex- plicit user feedback. pected score regarding the relationship between v and all items in W . ! Acknowledgements C(v) = EP (v, w) This work was sponsored in part by an NSF CAREER Award w∈W IIS-0845647 and IIS-0915438. This work was also supported For a pair of items (v, w), suppose v preferred over w, that in part by an IBM faculty award. is, v $ w in G. There are three cases. We have v $ w in Ĝ. In this case, EP (v, w) = 1. The second case, we have 6. REFERENCES w $ v in Ĝ. Since the system prediction contradicts to the [1] L. Baltrunas, T. Makcinskas, and F. Ricci. Group recommendations with rank aggregation and collaborative ground truth, we have EP (v, w) = 0. The third case, the filtering. In Proceedings of the 4th ACM conference on system cannot predict a preference between v and w. Then Recommender systems, 2010. we need to compute the expected score. By default we may [2] K. Jarvelin and J. Kekalainen. Cumulated gain-based assume there is 50% likelihood for v $ w and 50% likelihood evaluation of IR techniques. ACM Transactions on for w $ v. We then let EP (v, w) = .5. Alternatively, we Information Systems, 20(4):422–446, October 2002. may have a more accurate likelihood estimation based on [3] M. G. Kendall. A new measure of rank correlation. collaborative filtering. Assuming we have a set of equally Biometrika, 30(1/2):81–93, 1938. similar users, if 70% of the users have v $ w and 30% of [4] R. Kumar and S. Vassilvitskii. Generalized distances between rankings. In Proceedings of the 19th International similar users have w $ v, then the expected score of v $ w World Wide Web Conference, 2010. is 0.7. For example, below are samples for C and EP based [5] N. N. Liu and Q. Yang. Eigenrank: A ranking-oriented on Table 1. approach to collaborative filtering. In Proceedings of the 31st C(C) = EP (C, A) + EP (C, B) + EP (C, E) = 0 + .5 + .5 = 1 Annual International ACM SIGIR Conference, 2008. C(D) = EP (D, A) + EP (D, B) + EP (D, C) + EP (D, E) = [6] S.-T. Park and W. Chu. Pairwise preference regression for .5 + .5 + 1 + .5 = 2.5 cold-start recommendation. In Proceedings of the 3rd ACM C(E) = EP (E, A) + EP (E, B) + EP (E, C) + EP (E, D) = conference on Recommender systems, 2009. 1 + 1 + .5 + .5 = 3 [7] S. Rendle, C. Freudenthaler, Z. Gantner, and S.-T. Lars. BPR: Bayesian personalized ranking from implicit feedback. In Proceedings of the Twenty-Fifth Conference on Putting Things Together. Based on the discussion of Uncertainty in Artificial Intelligence, 2009. how to compute a score for an item, C(v), and the discount [8] C. Spearman. The proof and measurement of association (weight) of an item, D(v), we now put these together for an between two things. The American Journal of Psychology, evaluation measure EDRC. We denote the set of all vertices 15(1):72–101, 1904. in G without an incoming edge as S. [9] E. Yilmaz, J. A. Aslam, and S. Roberston. A new rank " # correlation coefficient for information retrieval. In 2 ! C(v) Proceedings of the 31st Annual International ACM SIGIR EDRC(G, Ĝ) = · −1 Conference, 2008. Z D(v) v∈V (G)\S 77