Implicit Feedback Recommendation via Implicit-to-Explicit Ordinal Logistic Regression Mapping Denis Parra Alexandros Karatzoglou Xavier Amatriain University of Pittsburgh Telefonica Research Telefonica Research Idil Yavuz University of Pittsburgh ABSTRACT obtained from direct usage [4]. However, it is not clear that One common dichotomy faced in recommender systems is we can trust a simple one-to-one mapping between usage that explicit user feedback -in the form of ratings, tags, or and preference [5]. On the other hand, explicit feedback is user-provided personal information- is scarce, yet the most obtained by directly querying the user, who is usually pre- popular source of information in most state-of-the-art rec- sented with an integer scale where to quantify how much ommendation algorithms, and on the other side, implicit she likes the items. In principle, explicit feedback is a more user feedback - such as numbers of clicks, playcounts, or robust way to extract preference, since the user is reporting web pages visited in a session- is more frequently available, directly on this variable, removing the need of an indirect but there are fewer methods well studied to provide recom- inference. However, it is also known that this kind of feed- mendations based on this kind of information. back is affected by user inconsistencies known as natural Given the current scenario, and under a situation where noise [6]. Besides, the fact that we are introducing a user just implicit user feedback is available, it would be more ap- overhead, makes it difficult to have a complete view on the propriate either to provide recommendations using the im- user preferences [7]. plicit data and implicit-fedback-based methods, or to map None of the two existing strategies for capturing user feed- implicit user feedback to explicit feedback and then use an back clearly outperforms the other. Ideally, we would like explicit-based algorithm? On this paper, we analyze this to use implicit feedback, minimizing the impact on the user, problem in the context of music recommendation by means but having a robust and proven way to map this informa- of a well-known implicit feedback recommendation method tion to the actual user preference. In a previous work [8], we described in Hu et al. [1] by comparing the use of raw play- tested several regression models and we were able to map im- counts with the use of explicit data - user ratings - obtained plicit user feedback to explicit ratings. Our results were sat- by mapping implicit to explicit feedback with a novel mixed- isfactory, but we did not compare to state-of-the-art meth- effects logistic regression model. ods that make use of raw implicit information to provide rec- ommendations. In this paper we propose an ordinal logistic regression model that by using a few ratings is able to infer 1. INTRODUCTION a generic parametric mapping from implicit to explicit data. Recommender Systems (RS) [2] have proved their busi- Our mapping model integrates usual implicit user feedback ness value and impact on many application scenarios that (playcounts) with contextual information (how recently the go from recommending movie rentals to new contacts on a user listened to an album). We compare our approach to a social network. One of the main features of these systems state-of-the art algorithm for implicit feedback recommen- is that they rely on understanding user preferences in or- dations and discuss possible extensions. der to estimate the utility of items and decide whether they should be recommended. These user preferences are infered by taking into account direct feedback from the user, either 2. PRELIMINARIES AND RELATED WORK in explicit or implicit form. Implicit feedback is much more readily available in prac- We obtain implicit feedback [3] by measuring the interac- tical scenarios for recommender systems. However, most of tion of the user with the different items. We can use signals the research literature focuses on the use of explicit feedback such as the number of playcounts in a song, or the clicks on input since this is considered the ground truth on the user webpages as implicit feedback. This kind of data is obtained preferences and allows to reduce the recommender problem without incurring into any overhead on the user, since it is to one of predicting ratings. In one of the few papers addressing the implicit feedback recommendation problem [1], Hu et al. deal with the implicit feedback recommendation problem by binarizing it and in- troducing the idea of confidence. In our previous work [8], however, we presented an analysis of implicit and explicit feedback that challenged most of the assumptions stated in [1]. In particular: (1) There is no negative feed- back. While it is true that you cannot interpret “no implicit CARS-2011, October 23, 2011, Chicago, Illinois, USA. feedback“ as “negative feedback“ – and this is true also for Copyright is held by the author/owner(s). explicit feedback–, implicit data can include negative feed- back. You can assume that low feedback is negative feed- post-filtering [16]. Once the implicit-to-explicit mapping is back as long as the granularity of the items is comparable, performed, we can use the inferred ratings in methods for and there is enough variability. (2) Implicit feedback is explicit or implicit data. We can then compare the perfor- noisy. Implicit feedback is noisy but, as we showed in pre- mance of these models to the one by Hu et al. in several vious work [6], so is explicit feedback. (3) Preference vs. experiments. Confidence. As we showed in our work [8], the numerical value of implicit feedback can indeed be directly mapped 3. REGRESSION MODELS to preference, given the appropriate mapping. (4) Evalu- ation of implicit feedback. On the other hand, we do 3.1 Linear Regression agree that there is no appropriate evaluation approaches for In [8] we introduce a linear regression model to predict implicit feedback and this is in fact one of the motivations explicit preference of users on music albums in the form of of our work: if we find an appropriate way to map implicit ratings based on implicit user behavior variables - (1) Im- to explicit feedback we can ensure an evaluation that is as plicit Feedback (if ): playcount for a user on a given item; good as the one we have in the explicit case. (2) Global Popularity (gp): global playcount for all users Our hypothesis that there is some observable correlation on a given item; (3) Recentness (re) : time elapsed since between implicit and explicit feedback can be tracked in the user played a given item. In that article, we compare dif- literature. Already in 1994, Morita and Shinoda [9] proved ferent linear regression models based on the aforementioned that there was a correlation between reading time on on- variables and we find that the variables implicit feedback line news and self-reported preference. Konstan et al. [10] and recentness explain the largest part the variability of the did a similar experiment with the larger user base of the ratings, while global popularity explained a very small por- Grouplens project and again found this to be true. Oard tion. This result suggested us that the two former variables and Kim [11] performed experiments using not only reading would be better predictors of the user preference, and we time but also other actions like printing an article to find a supported these assumption by performing a 10-fold cross positive correlation between implicit feedback and ratings. validation experiment using the data of our online survey on Koh et al. did a thorough study of rating behavior in two music preference as a ground truth. The RMSE values were popular websites [12]. They hypothesize that the overall consistent with the previously described regression analysis. popularity or average rating of an item will influence raters and they conclude that while there is an effect, this depends 3.1.1 Limitations and shortcomings of Linear Regres- on the cultural background of the raters. sion Lee et al. [13] implement a recommender system based Although the linear regression gives good results, there are on implicit feedback by constructing “pseudo-ratings” us- some considerations that must be observed to generalize this ing temporal information. In this work, the authors intro- model to other domains and to make it able to be compared duce the idea that recent implicit feedback should contribute with other approaches. First, depending on the application more positively towards inferring the rating. The authors we may want the predicted values to fall in the range from also use the idea of distinguishing three temporal bins: old, 1 to 5, but using linear regression we cannot ensure it. Sec- middle, and recent. ond, as in most of recommender systems research, our main Two recent works approach the issue of implicit feedback evaluation metric is RMSE. When using this metric, we are in the music domain. Jawasher et. al analyze the charac- assuming that ratings form an interval scale, i.e. the dis- teristics of user implicit and explicit feedback in the context tance between any two consecutive values in the rating scale of last.fm music service [14]. However, their results are not is the same. However, in a previous study [6], we have shown conclusive due to limitations in the dataset since they only that users have a larger probabilty to be more inconsistent used explicit feedback available in the last.fm profiles, which with some ratings numbers than with others, what give us is limitted to the love/ban binary categories. This data is the clue that users do not see the rating scale as equally very sparse and, as the authors report, almost non-existant spaced. Hence, we should consider the ratings as an ordi- for some users or artists. On the other hand, Kurdomova nal variable rather than an linear or interval one. This also et. al use a Bayesian approach to learn a classifier on mul- implies that RMSE is not a good measure alone to predict tiple implicit feedback variables [15]. Using these features, user preference, it should be combined, and in some cases the authors are able to classify liked and disliked items with replaced, with other measures coming from Information Re- an accuracy of 0.75, uncovering the potential of mapping trieval such as precision, recall, or nDCG. implicit feedback directly to preferences. Given that users present individual variability in their rat- In our previous work [8], we showed that it was possible ings, a good extension of our model should include the user to create a simple parametric model for implicit feedback as a random factor. Additionally, given that ratings are by using linear regression on some available explicit ratings. actually an ordinal variable, as explained in the previous However, as we will explain, in the context of user ratings, paragraph, and the fact that are not normally distributed, it may be more appropriate to use a mixed-effects ordinal logistic regression is a proper alternative to our linear re- logistic regression model. In this context, the main contribu- gression model. Combining both considerations, our next tion of our present work is to present an ordinal logistic re- model for implicit-to-explicit behavior mapping model will gression model that allows to map implicit data into explicit be a mixed-effects logistic regression. ratings for the task of recommendation. We make our model context-aware with respect to how recently a user listened 3.2 Mixed-effects Ordinal Logistic Regression to an album by contextual modeling, i.e., using the contex- The multinomial logistic regression is the natural model tual information directly in the modelling technique, unlike for an ordinal scale variable (rating, that ranges from 1 to data-driven approaches such as contextual pre-filtering or 5) and a mixed-effects model will help us to reduce the vari- Effect Estimate SE DF t Pr > |t| intercept 1 −1.2740 0.2808 112 −4.54 <.0001 intercept 2 0.3791 0.2784 112 1.36 0.1759 intercept 3 2.0898 0.2792 112 7.49 <.0001 intercept 4 3.7355 0.2808 112 13.30 <.0001 gp −0.01589 0.05598 10000 −0.28 0.7766 if −0.5894 0.08094 10000 −7.28 <.0001 re −0.04137 0.05395 10000 −0.77 0.4432 gp*if −0.06955 0.02956 10000 −2.35 0.0187 if*re −0.1331 0.02782 10000 −4.78 <.0001 concerts −0.1912 0.07825 10000 −2.44 0.0145 Table 1: Details of the mixed-effects multinomial regression model with 4 fixed effects ability due to differences in rating among the users. Our list of albums in the user’s playlist so that users responded multinomial logistic regression, that uses cummulative logit to a personalized survey. Details of this study, such as the as link function, can be represented as: strategy to sample the items that were rated by users and the results of user demographics and user consumption, can logit(P (rui ≤ k)) = αk + Xβ + gu (1) be found in our previous article [8]. where k = {1, 2, 3, 4}, rui is the rating that user u gives to item i, P (rui ≤ k) is the probability that the rating rui is 4.1.2 Implicit Music Consumption Feedback less or equal than k, αk is the intercept for the cumulative We call Implicit Music Consumption Feedback to our Dataset2 probability that rating is less than or equal to k, X is a vector since, unlike Dataset1 that has demographic data of each with the actual values of the fixed factors (if, re and gp), β is user, it only has information about implicit behavior of the iid the vector of coefficients of the fixed factors, gu ∼ N (0, σg2 ) users: playcount of albums per each user, how recently each is the random effect of the users, and album was listened to for the last time, and the total num- ber of listeners of each album in the whole last.fm website. p The statistics of this dataset are described in Table 2. logit(p) = log( ) (2) 1−p To obtain the predicted rating of a user u on an item i, 4.2 Regression Model Selection we calculate the expected value of the rating as To select the fixed effects that would be part of our model we conducted a forward selection on the set of all the main X 5 effects and their two-way interactions. The main effects con- E[rui ] = k · P (rui = k) (3) sidered were if , re, gp (as described in section 3.1) plus ten k=1 demographic and consumption variables: gender, age, hours where of music per week, hours of internet per week, buying phys-   P (rui ≤ k) , k = 1 ical records, buying online records, interaction style (prefer- P (rui = k) = P (rui ≤ k) − P (rui ≤ k − 1) , 1 < k < 5 ence on listening to tracks or albums), number of concerts  1 − P (rui ≤ k − 1) , k = 5 per year, interest on reading specialized music blogs or mag- (4) azines, and familiarity rating music online. We have to pick two models finally because of the nature of our two datasets. In the smallest one (dataset1) we have all the variables ob- 4. EXPERIMENTAL SETUP tained by a user study, but in the second dataset (dataset2) we just have implicit information (playcounts per user, how 4.1 Data sets recently the user listened to each album, and the total num- We use two datasets in this study. The first one was col- ber of listeners of an album in the whole dataset) that can lected by an online user study among users of the last.fm mu- be reduced to if , re and gp. sic service between September and October of 2010, contain- After conducting the process of forward selection, the model ing implicit and explicit information, and also demographic obtained for dataset1 considers four fixed effects (if, re, gp and consumption data. The second one was collected using and concerts per year) and the random effect of the user. the last.fm API during May of 2011, and contains only im- The details of the model are described in Table 1. Although plicit information. The characteristics of both datasets are the main effects of global popularity (gp) and recentness (re) described in Table 2 . are not significant, we keep them in the model because their interaction with implicit feedback (if ) is significant [17]. 4.1.1 Generating Explicit Fedback For dataset2, we consider in the model if , re, and gp as We conducted an online user study among users of the fixed effects plus the random effect of the user. For the sake last.fm music service. The goal of the study was to gather of space we do not show the details of this model, but the explicit feedback on music albums to compare to the user im- coefficient and significant values are similar to those shown plicit feedback we obtained by directly crawling the last.fm in Table 1 excepting that the factor number of concerts is page related to the user taking the survey. Explicit feed- not considered in the model. As in the previous model, we back was obtained by asking users to rate albums on a 1 keep in the model gp and re although they are not significant to 5 star scale. The items to rate were obtained from the due to their interaction with if . Under this model, is also Dataset1 (Implicit Explicit) Dataset2 (Implicit) users 114 2549 albums 6037 6037 entries 10122 111815 density 1.47% 0.73% avg albums/user 88.79 43.87 avg user/album 1.71 18.52 Table 2: Details of the datasets MAP (D1) nDCG(D1) MAP(D2) nDCG(D2) HK 0.02315 0.14831 0.1014 0.2718 HKlog 0.02742 0.15447 0.1234 0.2954 logit3 0.02636 0.15319 0.1223 0.2944 logit4 0.02601 0.15268 N/A N/A popularity 0.48331 0.54378 0.0178 0.1367 Table 3: Results of MAP and nDCG after 5-fold Cross validation on dataset 1 (D1) and dataset 2 (D2) not significant the intercept for rating equal to 2, which tell where the Frobenius norm of the factor matrices is used us that this intercept is not significantly different than 0, for regularization. This minimization problem is then solved and we may dismiss it from the model. in linear time using Alternate Least Squares and utilizing a trick to avoid direct optimization over the 0 entries of the 4.3 Comparing the different approaches matrix. After we have done the implicit-to-explicit mapping, we are in condition to compare the use of impplicit data with 4.3.1 Error Measures inferred explicit data. In this article, we compare four ap- proaches using dataset 1 and three aproaches using dataset RMSE [18] is probably the most common measure to eval- 2. The methods we compare, as identified in the first column uate the performance of recommender systems and we used of Table 3, are: it to evaluate and compare our linear regression approaches • HK : the implicit feedback method introduced in Hu et al. in [8]. However, when there are no ratings to assess the [1] which uses raw playcounts, performance of the algorithms we can not use metrics like • HKlog: a variation of the HK method, also introduced in RMSE or MAE. Hence, we opt for using Mean Average Pre- [1], that makes a log-transformation of the playcounts, cision (MAP) [19] and normalized Discounted Cummulative • logit3 : the HK method, where the input values are the Gain (nDCG) [20]. The former gives us an overall sense of ratings inferred by logistic regression using 3 fixed factors how well we identify relevant items to recommend from a (if, gp, and re) set of retrieved recommendations, and the latter how well • logit4 : similar to logit3 but adding the factor number of we rank them in a list. concerts in the logistic regression model to infer the ratings. We have this information available just for dataset1. 5. RESULTS Description of the HK method. For the implicit In order to evaluate and compare the methods, we split feedback modeling we use the Matrix Factorization method each dataset into 5 groups in order to perform a 5-fold cross developed in [1]. In this Matrix Factorization method a validation. The result of each run is a list of recommended weighted least squares error loss function is minimized. To items (albums) for each user in the test set, sorted by the this end user-item interactions pij are signaled with a 1 and preference that the user would have for that item. We cal- missing interactions are marked with a 0. The counts of culate MAP and nDCG for each list recommended to a user-item interactions (e.g. playcounts Yij ) are translated user, judging an item as relevant whether it was consumed into a confidence measure wij , which in the case of the HK (played) at least once by the user. Results can be seen in method correspond to pij + αYij , and in the case of the Table 3. HKlog method a simple log transform is used where: In the case of dataset 1, the best results of MAP and  nDCG are obtained by recommending the most popular αlog(1 + Yijk ) Yijk > 0 items. This result is somewhat expected due to the spar- wij = (5) sity of the dataset that affects the methods based on matrix 1 Yijk = 0 factorization. As shown in Table 2, each album was rated in This ”confidence” is then used as a weight in the loss func- average by just 1.71 users. This situation is not repeated in tion and the objective function then becomes dataset 2, where the average number of users per album is 18.52, and then the popularity method performs the worst. X n X m We highlight two results on these initial experiments. The min [wij (pij − hUi∗ Mj∗ i)2 (6) first one is that the log transformation of raw playcounts U,M,C i j makes HKlog improve clearly over HK on both MAP and λ λ nDCG measures. The second result we higlight is that logit3 + ||Ui∗ ||2 + ||Mj∗ ||2 ] n m and logit4 perform better than HK and there is not a big difference in performance with HKlog, leading us investigate Conference on User Modeling, Adaptation, and further to confirm this difference. Personalization, 2009. [7] G. Jawaheer, M. Szomszor, and P. Kostkova. 6. CONCLUSIONS AND FUTURE WORK Characterisation of explicit feedback in an online In this paper, we continue the work that we started in [8] music recommendation service. In Proceedings of the to create a model that allows us to map implicit to explicit fourth ACM conference on Recommender systems, user behavior. Using MAP and nDCG metrics, we show RecSys ’10, pages 317–320, 2010. that our method is comparable to state of the art methods [8] D. Parra and X. Amatriain. Walk the talk: Analyzing that provides recommendations making use of implicit user the relation between implicit and explicit feedback for feedback. preference elicitation. In Proc. of the 2011 Conference The results that we have obtained, part of which we show on User Modeling, Adaptation, and Personalization, on this paper, give us some insights but they mainly open 2011. research questions that we need to analyze further. We have [9] M. Morita and Y. Shinoda. Information filtering based confirmed in our dataset the benefits of applying a log trans- on user behavior analysis and best match text formation to the raw user feedback in the Hu et al. model, retrieval. In SIGIR ’94: Proceedings of the 17th annual showing consistently better results than the unmodified ver- international ACM SIGIR conference, pages 272–281, sion. New York, NY, USA, 1994. Springer-Verlag New In terms of the questions we need to further analyze, up York, Inc. to this point, we have considered the factors implicit feed- [10] Joseph A. Konstan, Bradley N. Miller, David Maltz, back and global popularity in our logistic regression models Jonathan L. Herlocker, Lee R. Gordon, and John as ordinal variables. We coded these variables on this way Riedl. Grouplens: applying collaborative filtering to to make sure that we were doing an appropiate diverse sam- usenet news. Commun. ACM, 40(3):77–87, 1997. pling when creating the user survey described in [8]. How- [11] D. Oard and J. Kim. Modeling information content ever, there is no constraint to rather use the raw playcounts using observable behavior. In Proc. of the ASIST for both factors aforementioned, and we think that this mod- Annual Meeting, pages p481–88, 2001. ification can benefit the results of our implicit-to-explicit lo- [12] N.S. Koh, N. Hu, and E. K. Clemons. Do online gistic regression model. reviews reflect a product’s true perceived quality? - an On the experiments run on this study, since we are not investigation of online movie reviews across cultures. predicting user ratings but rather user preference, metrics Electronic Commerce Research and Applications, 2010. such as RMSE or MAE can not be used to compare the [13] T. Lee, Y. Park, and Y. Park. A time-based approach methods so we opt for IR metrics such as MAP and nDCG, to effective recommender systems using implicit which rely on how we define relevancy. We wonder if our def- feedback. Expert Syst. Appl., 34(4):3055–3062, 2008. inition of relevance might bias our results and conclusions. [14] Gawesh Jawaheer, Martin Szomszor, and Patty As we have stated it before, we think that low feedback Kostkova. Comparison of implicit and explicit might be, in fact, negative feedback. For this reason, we are feedback from an online music recommendation currently testing different user activity (implicit feedback) service. In Proceedings of the 1st International thresholds to define relevancy in order to analyze how that Workshop on Information Heterogeneity and Fusion in influences the evaluation of the different recommendation Recommender Systems, 2010. approaches. [15] S. Kordumova, I. Kostadinovska, M. Barbieri, V. Pronk, and J. Korst. Personalized implicit learning 7. REFERENCES in a music recommender system. In UMAP 2010, 2010. [1] Y. Hu, Y. Koren, and C. Volinsky. Collaborative [16] Gediminas Adomavicius and Alexander Tuzhilin. filtering for implicit feedback datasets. In Proceedings Context-aware recommender systems. In Francesco of ICDM 2008, 2008. Ricci, Lior Rokach, Bracha Shapira, and Paul B. [2] F. Ricci, L. Rokach, B. Shapira, and P. B. Kantor, Kantor, editors, Recommender Systems Handbook, editors. Recommender Systems Handbook. Springer, pages 217–253. Springer US, 2011. 2011. [17] J. Neter, M. H. Kutner, C. J. Nachtsheim, and [3] Douglas Oard and Jinmook Kim. Implicit feedback for W. Wasserman. Applied Linear Statistical Models. recommender systems. In in Proceedings of the AAAI Irwin, Chicago, 1996. Workshop on Recommender Systems, pages 81–83, [18] Jonathan L. Herlocker, Joseph A. Konstan, Loren G. 1998. Terveen, and John T. Riedl. Evaluating collaborative [4] G. Potter. Putting the collaborator back into filtering recommender systems. ACM Trans. Inf. Syst., collaborative filtering. In 2nd KDD Workshop on 22(1):5–53, 2004. Large-Scale Recommender Systems and the Netflix [19] Christopher D. Manning, Prabhakar Raghavan, and Prize Competition, 2008. Hinrich Schtze. Introduction to Information Retrieval. [5] D. M. Nichols. Implicit rating and filtering. In In Cambridge University Press, New York, NY, USA, Proceedings of the Fifth DELOS Workshop on 2008. Filtering and Collaborative Filtering, pages 31–36, [20] Kalervo Järvelin and Jaana Kekäläinen. Cumulated 1997. gain-based evaluation of ir techniques. ACM Trans. [6] X. Amatriain, J.M. Pujol, and N. Oliver. I like it... i Inf. Syst., 20:422–446, October 2002. like it not: Evaluating user ratings noise in recommender systems. In Proc. of the 2009