Modeling Difficulty in Recommender Systems Benjamin Kille, Sahin Albayrak DAI-Lab Technische Universität Berlin {kille,sahin}@dai-lab.de ABSTRACT algorithms are compared with respect to their average per- Recommender systems have frequently been evaluated with formance on the full set of users. This entails that all users respect to their average performance for all users. However, are treated equally. However, we argue that the difficulty of optimizing such recommender systems regarding those eval- recommending items to users varies. Suppose we consider a uation measures might provide worse results for a subset of recommender system with two users, Alice and Bob. Alice users. Defining a difficulty measure allows us to evaluate and has rated a large number of items. Bob has recently started optimize recommender systems in a personalized fashion. using the system and rated a few items. The system rec- We introduce an experimental setup to evaluate the eligibil- ommends a number of items to both of them. Alice and ity of such a difficulty score. We formulate the hypothesis Bob rate the recommended items. Suppose we attempt to that provided a difficulty score recommender systems can be evaluate two recommender algorithms based on those rat- optimized regarding costs and performance simultaneously. ings, denoted as R1 and R2 . Assume that R1 predicts Al- ice’s ratings with an error of 0.8 and Bob’s ratings with an error of 0.9. On the other hand, we observe R2 to devi- Categories and Subject Descriptors ate 1.0 for Alice and 0.8 for Bob, respectively. Averaging H.3.3 [Information Storage and Retrieval]: Information the errors, we obtain 0.85 for R1 and 0.9 for R2 . Still, R2 Search and Retrieval - Information filtering, Retrieval mod- predicts Bob’s ratings better even though his preferences ex- els, Search process, Selection process; H.3.4 [Information hibit higher sparsity compared to Alice’s. Besides the num- Technology and Systems Applications]: Decision sup- ber of ratings, there are further factors discriminating how port well an recommendation algorithm perform for a given user. We introduce the notion of difficulty in the context of rec- ommender systems. Each user is assigned a difficulty value General Terms reflecting her expected evaluation outcome. Users with high Algorithms, Design, Experimentation, Measurement, Hu- errors or disordered item rankings receive a high difficulty man factors value. Contrarily, we assign low difficulty values to users exhibiting low errors and well ordered item rankings. Rec- Keywords ommender systems benefit from those difficulty value two- fold. First, optimization can target difficult users who re- difficulty, recommender systems, user modeling, evaluation quire such efforts. On the other hand, the recommender system can provide users with low difficulty values with rec- 1. INTRODUCTION ommendations generated by more trivial methods. Recom- Evaluating a recommender system’s performance repre- mending most popular items represents such an approach. sents a non-trivial task. The choice of evaluation measure Second, difficulty values enable the system to estimate how and methodology depends on various factors. Modeling the likely a specific user will perceive the recommendations as recommendation task as rating prediction problem favors adequate. Thereby, the system can control interactions with measures such as root mean squared error (RMSE) and users, e.g. by asking for additional ratings to provide better mean absolute error (MAE) [7]. In contrast, mean aver- recommendations for particularly difficult users. age precision (MAP) and normalized discounted cumulative gain (NDCG) qualify as evaluation measures for an item ranking scenario [12]. In either case, two recommendation 2. RELATED WORK Adomavicius and Tuzhilin reveal possible extensions to recommender systems [1]. They mention an enhanced un- derstanding of the user as one such extension. Our work attempts to contribute to this by defining a user’s difficulty. Hereby, the system estimates a user’s likely perception of the recommendations. The methods performing best on the Copyright is held by the authors/owner(s). well-known Netflix Prize Challenge had applied ensemble Workshop on Recommendation Utitlity Evaluation: Beyond RMSE (RUE techniques to further improve their rating prediction accu- 2012), held in conjunction with ACM RecSys 2012 September 9, 2012, Dublin, Ireland. racy [3, 11]. Both do not state an explicit modeling of rec- . ommendation difficulty on the user-level. However, ensem- 30 ble techniques involve an implicit presence of such a con- 3.1 Q statistics cept. Combining the outcome of several recommendation Applied to the classification ensemble setting, the Q statis- algorithms does make sense in case a single recommenda- tics base on a confusion matrix. The confusion matrix con- tion algorithm fails for a subset of users. Such an effect is fronts two classifiers in a binary manner - correctly classified consequently compensated by including other recommenda- versus incorrectly classified. We need to adjust this set- tion algorithms’ outcomes. Bellogin presents an approach ting to fit the item ranking scenario. Table 1 illustrates the to predict a recommender systems performance [4]. How- adjusted confusion matrix. We count all correctly ranked ever, the evaluation is reported on the system level aver- items as well as all incorrectly ranked items for both recom- aging evaluation measures over the full set of users. Our mendations algorithms. Subsequently, the confusion matrix approach focuses on predicting the user-level difficulty. Ko- displays the overlap of those items sets. Equation 1 rep- ren and Sill introduced a recommendation algorithms that resents the Q statistic derived from the confusion matrix. outputs confidence values [8]. Those confidence values could Note that the Q statistic measures diversity in between two be regarded as difficulty. However, the confidence values are recommender algorithms. Hence, we need to average the Q algorithm specific. We require the difficulty score’s valid- statistic values obtained for all comparisons in between the ity to generally hold. The concept of evaluation considering available algorithms. Equation 2 computes the final diver- difficulty has been introduced in other domains. Aslam and sity measure. Pavlu investigated estimation of a query’s difficulty in the context of information retrieval [2]. Their approach bases on the diversity of retrieved lists of documents by several N 11 N 00 − N 01 N 10 Qij (u) = (1) IR systems. Strong agreement with respect to the docu- N 11 N 00 + N 10 N 01 ment ranking indicates a low level of difficulty. In contrast, highly diverse rankings suggest a high level of difficulty. He !−1 |A| X et al. describe another approach to estimate a query’s dif- QA (u) = Qij (2) ficulty [6]. Their method compares the content of retrieved 2 Ai ,Aj ∈A documents and computes a coherence score. They assume that in case highly coherent documents are retrieved the 3.2 Difficutly measure θ query is clear and thus less difficult than a query resulting in Kuncheva and Whitaker introduce the difficulty measure minor coherency. Genetic Algorithms represent another do- θ as a non-pairwise measure [9]. θ allows us to consider main where difficulty has received the research community’s several recommendation algorithms simultaneously. We it- attention. However, the difficulty is determined by the fit- erate over the set of withhold items for the given user. At ness landscape of the respective problem [5, 6]. Kuncheva each step we compute the RMSE for all recommendation and Whitaker investigated diversity in the context of a clas- algorithms at hand. Thereby, we observe the variance in sification scenario [9]. Concerning a query’s difficulty, di- between all recommender algorithms’ RMSE values. The versity has been found an appropriate measure [2]. We will average variance displays how diverse the user is perceived adapt two diversity measures applied in the classification by the set of recommendation algorithms. Equation 3 cal- scenario for determining the difficulty to recommend a user culates θ. I denotes the set of items in the target user’s test items. set. The σ(i) function computes the variances in between the recommendation algorithms’ RMSE values for the given 3. METHOD item. We strive to determine the difficulty of the recommenda- tion task for a given user. Formally, we are looking for a 1 X map δ(u) : U 7→ [0; 1] that assigns each user u ∈ U a real θ(u) = σ(i) (3) |I| i∈I number between 0 and 1 corresponding to the level of diffi- culty. For one recommendation algorithm the difficulty for a given user could be simply determined by a (normalized) 3.3 Difficulty measure δ evaluation measure, e.g. RMSE. However, such a difficulty We attempt to formulate a robust difficulty measure for would not be valid for other recommendation algorithms. recommender systems. Therefore, we propose a combination In addition, recommender systems optimized with respect of the Q statistic and θ. As a result, both RMSE and NDCG to the item ranking would likely end up with different dif- are considered. The more recommendation algorithms we ficulty values. We adapt the idea of measuring difficulty include the more robust the final difficulty value will be. in terms of diversity to overcome those issues. We assume Equation 4 displays a linear combination of both measures. the recommender systems comprises several recommenda- The parameter λ ∈ [0; 1] controls the weighting. λ = 1 tions methods, denoted A = {A1 , A2 , . . . , An }. Each such allows to focus on the ranking task. The absence of rating An generates rating predictions along with item rankings values might require such a setting. for a given user u. We choose RMSE to evaluate the rat- ing predictions and NDCG to assess the item ranking, re- δ(u|A) = λ · QA (u) + (1 − λ) · θ(u) (4) spectively. We measure a user’s difficulty by means of the diversity of those rating predictions and item rankings. For that purpose, we adjusted two diversity metrics introduced 4. EXPERIMENTAL SETTING by Kuncheva and Whitaker for classification ensembles [9]. We will evaluate the proposed difficulty measure δ and We alter the pair-wise Q statistics to fit the item ranking subsequently outline the intended experimental protocol. Sev- scenario. Regarding the rating prediction we adapt the dif- eral data sets, such as Movielens 10M, Netflix and Last.fm ficulty measure θ. provide the required data. We will implement item-based 31 Ai : rank(k) ≤ rank(l) Ai : rank(k) > rank(l) Aj : rank(k) ≤ rank(l) N 11 N 10 Aj : rank(k) > rank(l) N 01 N 00 Where rank(k) ≤ rank(l) is true. Table 1: Adjusted confusion matrix and user-based neighborhood recommenders, matrix factor- 7. REFERENCES ization recommenders along with slope-one recommenders. [1] G. Adomavicius and A. Tuzhilin. Toward the next All those recommendation methods do not require data ex- generation of recommender systems: A survey of the cept user preferences. As a first step, we split each data sets state-of-the-art and possible extensions. IEEE in three partitions. The first partition will be used for train- Transaction on Knowledge and Data Engineering, ing and the second partition to assign each user a difficulty 17(6):734–749, 2005. score according to Equation 4. Finally, the third partition [2] J. A. Aslam and V. Pavlu. Query hardness estimation will be used as evaluation data set. The subset of users using jensen-shannon divergence among multiple with comparably low difficulty score will receive recommen- scoring functions. In Proceedings of the 29th European dations based on a non-optimized recommendation method conference on IR research, ECIR’07, pages 198–209, such as the most popular recommender. The subset of users 2007. with medium difficulty score will receive recommendations [3] R. M. Bell and Y. Koren. Lessons from the netflix generated by slightly optimized recommender algorithms. prize challenge. SIGKDD Explor. Newsl., 9(2):75–79, Finally, the particularly hard users will receive recommenda- 2007. tions generated by highly optimized recommenders. We will [4] A. Bellogin. Predicting performance in recommender compare both required efforts and recommendation quality systems. In Proceedings of the fifth ACM conference against approaches dealing with all users in the same ways. on Recommender systems, pages 371–374. ACM, 2011. Such approaches will include optimizing recommenders to perform well averaged over the full set of users and recom- [5] M. Hauschild and M. Pelikan. Advanced mending all users with trivial recommenders, for instance neighborhoods and problem difficulty measures. In most popular recommendations. Our hypothesis is that the Proceedings of the 13th annual conference on Genetic difficulty score will allow us to achieve lower costs along with and evolutionary computation, GECCO ’11, pages a comparably high recommendation quality by selecting an 625–632, 2011. appropriate recommendation algorithm for a given user. In [6] J. He, M. Larson, and M. De Rijke. Using addition, we will observe what user characteristics determine coherence-based measures to predict query difficulty. her difficulty. In Proceedings of the IR research, 30th European conference on Advances in information retrieval, ECIR’08, pages 689–694, 2008. 5. CONCLUSION AND FUTURE WORK [7] J. Herlocker, J. Konstan, L. Terveen, and J. Riedl. We introduced the notion of difficulty in the context of Evaluating collaborative filtering recommender recommending items to users. Measuring that task’s dif- systems. ACM Transactions on Information Systems, ficulty on user-level allows a more personalized optimiza- 22(1):5–53, 2004. tion of recommender systems. Users with comparably low [8] Y. Koren and J. Sill. Ordrec: an ordinal model for difficulty scores receive adequate recommendations without predicting personalized item rating distributions. In much efforts. On the other hand, users with a compara- Proceedings of the fifth ACM conference on bly high difficulty score may be asked to provide additional Recommender systems, RecSys ’11, pages 117–124, data to improve the system’s individual perception. A com- New York, NY, USA, 2011. ACM. bination of the Q statistic and the difficulty measure θ - [9] L. Kuncheva and C. Whitaker. Measures of diversity both adjusted to fit the recommendation scenario - allows in classifier ensembles and their relationship with the us to measure the difficulty for a given user. The calculation ensemble accuracy. Machine Learning, 51:181–207, requires a sufficiently large data set containing user prefer- 2003. ences on items along with a set of implemented recommen- [10] A. Said, B. Jain, and S. Albayrak. Analyzing dation algorithms. We introduced an experimental setting weighting schemes in collaborative filtering: Cold that we will use to evaluate the presented methodology in start, post cold start and power users. In 27th ACM section 4. In addition, we intend to apply pattern recog- Symposium On Applied Computing (SAC ’12), New nition techniques to find correlations between the difficulty York, NY, USA, 2012. ACM. score and user characteristics. For instance, the number [11] G. Takacs, I. Pilaszy, B. Nemeth, and D. Tikk. Major of ratings is known to correlate with the difficulty for new components of the gravity recommendation system. users (”cold-start problem”) and users with a large number SIGKDD Explorations, 9(2), 2007. of items (”power-users”) [10]. [12] S. Vargas and P. Castells. Rank and relevance in novelty and diversity metrics for recommender 6. ACKNOWLEDGEMENTS systems. In Proceedings of the fifth ACM conference This work is funded by the DFG (Deutsche Forschungs- on Recommender systems, RecSys ’11, pages 109–116, gemeinschaft) in the scope of the LSR project. New York, NY, USA, 2011. ACM. 32