1. INTRODUCTION

October

Implicit Feedback Recommendation via Implicit-to-Explicit Ordinal Logistic Regression Mapping

Alexandros Karatzoglou Telefonica Research

0 1

Xavier Amatriain Telefonica Research

0 1 0 Denis Parra University of Pittsburgh 1 Idil Yavuz University of Pittsburgh

2011

23 2011

One common dichotomy faced in recommender systems is that explicit user feedback -in the form of ratings, tags, or user-provided personal information- is scarce, yet the most popular source of information in most state-of-the-art recommendation algorithms, and on the other side, implicit user feedback - such as numbers of clicks, playcounts, or web pages visited in a session- is more frequently available, but there are fewer methods well studied to provide recommendations based on this kind of information. Given the current scenario, and under a situation where just implicit user feedback is available, it would be more appropriate either to provide recommendations using the implicit data and implicit-fedback-based methods, or to map implicit user feedback to explicit feedback and then use an explicit-based algorithm? On this paper, we analyze this problem in the context of music recommendation by means of a well-known implicit feedback recommendation method described in Hu et al. [1] by comparing the use of raw playcounts with the use of explicit data - user ratings - obtained by mapping implicit to explicit feedback with a novel mixede ects logistic regression model.

1. INTRODUCTION

Recommender Systems (RS) [ 2 ] have proved their business value and impact on many application scenarios that go from recommending movie rentals to new contacts on a social network. One of the main features of these systems is that they rely on understanding user preferences in order to estimate the utility of items and decide whether they should be recommended. These user preferences are infered by taking into account direct feedback from the user, either in explicit or implicit form.

We obtain implicit feedback [ 3 ] by measuring the interaction of the user with the di erent items. We can use signals such as the number of playcounts in a song, or the clicks on webpages as implicit feedback. This kind of data is obtained without incurring into any overhead on the user, since it is obtained from direct usage [ 4 ]. However, it is not clear that we can trust a simple one-to-one mapping between usage and preference [ 5 ]. On the other hand, explicit feedback is obtained by directly querying the user, who is usually presented with an integer scale where to quantify how much she likes the items. In principle, explicit feedback is a more robust way to extract preference, since the user is reporting directly on this variable, removing the need of an indirect inference. However, it is also known that this kind of feedback is a ected by user inconsistencies known as natural noise [ 6 ]. Besides, the fact that we are introducing a user overhead, makes it di cult to have a complete view on the user preferences [ 7 ].

None of the two existing strategies for capturing user feedback clearly outperforms the other. Ideally, we would like to use implicit feedback, minimizing the impact on the user, but having a robust and proven way to map this information to the actual user preference. In a previous work [ 8 ], we tested several regression models and we were able to map implicit user feedback to explicit ratings. Our results were satisfactory, but we did not compare to state-of-the-art methods that make use of raw implicit information to provide recommendations. In this paper we propose an ordinal logistic regression model that by using a few ratings is able to infer a generic parametric mapping from implicit to explicit data. Our mapping model integrates usual implicit user feedback (playcounts) with contextual information (how recently the user listened to an album). We compare our approach to a state-of-the art algorithm for implicit feedback recommendations and discuss possible extensions. 2.

PRELIMINARIES AND RELATED WORK

Implicit feedback is much more readily available in practical scenarios for recommender systems. However, most of the research literature focuses on the use of explicit feedback input since this is considered the ground truth on the user preferences and allows to reduce the recommender problem to one of predicting ratings.

In one of the few papers addressing the implicit feedback recommendation problem [ 1 ], Hu et al. deal with the implicit feedback recommendation problem by binarizing it and introducing the idea of con dence. In our previous work [ 8 ], however, we presented an analysis of implicit and explicit feedback that challenged most of the assumptions stated in [ 1 ]. In particular: (1) There is no negative feedback. While it is true that you cannot interpret \no implicit feedback\ as \negative feedback\ { and this is true also for explicit feedback{, implicit data can include negative feedback. You can assume that low feedback is negative feedback as long as the granularity of the items is comparable, and there is enough variability. (2) Implicit feedback is noisy. Implicit feedback is noisy but, as we showed in previous work [ 6 ], so is explicit feedback. (3) Preference vs. Con dence. As we showed in our work [ 8 ], the numerical value of implicit feedback can indeed be directly mapped to preference, given the appropriate mapping. (4) Evaluation of implicit feedback. On the other hand, we do agree that there is no appropriate evaluation approaches for implicit feedback and this is in fact one of the motivations of our work: if we nd an appropriate way to map implicit to explicit feedback we can ensure an evaluation that is as good as the one we have in the explicit case.

Our hypothesis that there is some observable correlation between implicit and explicit feedback can be tracked in the literature. Already in 1994, Morita and Shinoda [ 9 ] proved that there was a correlation between reading time on online news and self-reported preference. Konstan et al. [ 10 ] did a similar experiment with the larger user base of the Grouplens project and again found this to be true. Oard and Kim [ 11 ] performed experiments using not only reading time but also other actions like printing an article to nd a positive correlation between implicit feedback and ratings. Koh et al. did a thorough study of rating behavior in two popular websites [ 12 ]. They hypothesize that the overall popularity or average rating of an item will in uence raters and they conclude that while there is an e ect, this depends on the cultural background of the raters.

Lee et al. [ 13 ] implement a recommender system based on implicit feedback by constructing \pseudo-ratings" using temporal information. In this work, the authors introduce the idea that recent implicit feedback should contribute more positively towards inferring the rating. The authors also use the idea of distinguishing three temporal bins: old, middle, and recent.

Two recent works approach the issue of implicit feedback in the music domain. Jawasher et. al analyze the characteristics of user implicit and explicit feedback in the context of last.fm music service [ 14 ]. However, their results are not conclusive due to limitations in the dataset since they only used explicit feedback available in the last.fm pro les, which is limitted to the love/ban binary categories. This data is very sparse and, as the authors report, almost non-existant for some users or artists. On the other hand, Kurdomova et. al use a Bayesian approach to learn a classi er on multiple implicit feedback variables [ 15 ]. Using these features, the authors are able to classify liked and disliked items with an accuracy of 0.75, uncovering the potential of mapping implicit feedback directly to preferences.

In our previous work [ 8 ], we showed that it was possible to create a simple parametric model for implicit feedback by using linear regression on some available explicit ratings. However, as we will explain, in the context of user ratings, it may be more appropriate to use a mixed-e ects ordinal logistic regression model. In this context, the main contribution of our present work is to present an ordinal logistic regression model that allows to map implicit data into explicit ratings for the task of recommendation. We make our model context-aware with respect to how recently a user listened to an album by contextual modeling, i.e., using the contextual information directly in the modelling technique, unlike data-driven approaches such as contextual pre- ltering or post- ltering [ 16 ]. Once the implicit-to-explicit mapping is performed, we can use the inferred ratings in methods for explicit or implicit data. We can then compare the performance of these models to the one by Hu et al. in several experiments. 3. 3.1

REGRESSION MODELS Linear Regression

In [ 8 ] we introduce a linear regression model to predict explicit preference of users on music albums in the form of ratings based on implicit user behavior variables - (1) Implicit Feedback (if ): playcount for a user on a given item; (2) Global Popularity (gp): global playcount for all users on a given item; (3) Recentness (re) : time elapsed since user played a given item. In that article, we compare different linear regression models based on the aforementioned variables and we nd that the variables implicit feedback and recentness explain the largest part the variability of the ratings, while global popularity explained a very small portion. This result suggested us that the two former variables would be better predictors of the user preference, and we supported these assumption by performing a 10-fold cross validation experiment using the data of our online survey on music preference as a ground truth. The RMSE values were consistent with the previously described regression analysis. 3.1.1

Limitations and shortcomings of Linear Regression

Although the linear regression gives good results, there are some considerations that must be observed to generalize this model to other domains and to make it able to be compared with other approaches. First, depending on the application we may want the predicted values to fall in the range from 1 to 5, but using linear regression we cannot ensure it. Second, as in most of recommender systems research, our main evaluation metric is RMSE. When using this metric, we are assuming that ratings form an interval scale, i.e. the distance between any two consecutive values in the rating scale is the same. However, in a previous study [ 6 ], we have shown that users have a larger probabilty to be more inconsistent with some ratings numbers than with others, what give us the clue that users do not see the rating scale as equally spaced. Hence, we should consider the ratings as an ordinal variable rather than an linear or interval one. This also implies that RMSE is not a good measure alone to predict user preference, it should be combined, and in some cases replaced, with other measures coming from Information Retrieval such as precision, recall, or nDCG.

Given that users present individual variability in their ratings, a good extension of our model should include the user as a random factor. Additionally, given that ratings are actually an ordinal variable, as explained in the previous paragraph, and the fact that are not normally distributed, logistic regression is a proper alternative to our linear regression model. Combining both considerations, our next model for implicit-to-explicit behavior mapping model will be a mixed-e ects logistic regression. 3.2

Mixed-effects Ordinal Logistic Regression

The multinomial logistic regression is the natural model for an ordinal scale variable (rating, that ranges from 1 to 5) and a mixed-e ects model will help us to reduce the variability due to di erences in rating among the users. Our multinomial logistic regression, that uses cummulative logit as link function, can be represented as: logit(P (rui k)) = k + X + gu (1) where k = f1; 2; 3; 4g, rui is the rating that user u gives to item i, P (rui k) is the probability that the rating rui is less or equal than k, k is the intercept for the cumulative probability that rating is less than or equal to k, X is a vector with the actual values of the xed factors (if, re and gp), is the vector of coe cients of the xed factors, gu iid N (0; g2) is the random e ect of the users, and logit(p) = log( 1 p p

)

To obtain the predicted rating of a user u on an item i, we calculate the expected value of the rating as k=1 k) 1

P (rui P (rui k P (rui k k) , k = 1 1) , 1 < k < 5 1) , k = 5 where

E ect intercept 1 intercept 2 intercept 3 intercept 4 gp if re gp*if if*re concerts

EXPERIMENTAL SETUP Data sets

We use two datasets in this study. The rst one was collected by an online user study among users of the last.fm music service between September and October of 2010, containing implicit and explicit information, and also demographic and consumption data. The second one was collected using the last.fm API during May of 2011, and contains only implicit information. The characteristics of both datasets are described in Table 2 . 4.1.1

Generating Explicit Fedback

We conducted an online user study among users of the last.fm music service. The goal of the study was to gather explicit feedback on music albums to compare to the user implicit feedback we obtained by directly crawling the last.fm page related to the user taking the survey. Explicit feedback was obtained by asking users to rate albums on a 1 to 5 star scale. The items to rate were obtained from the (2) (3) (4) list of albums in the user's playlist so that users responded to a personalized survey. Details of this study, such as the strategy to sample the items that were rated by users and the results of user demographics and user consumption, can be found in our previous article [ 8 ]. 4.1.2 Implicit Music Consumption Feedback

We call Implicit Music Consumption Feedback to our Dataset2 since, unlike Dataset1 that has demographic data of each user, it only has information about implicit behavior of the users: playcount of albums per each user, how recently each album was listened to for the last time, and the total number of listeners of each album in the whole last.fm website. The statistics of this dataset are described in Table 2. 4.2

Regression Model Selection

To select the xed e ects that would be part of our model we conducted a forward selection on the set of all the main e ects and their two-way interactions. The main e ects considered were if , re, gp (as described in section 3.1) plus ten demographic and consumption variables: gender, age, hours of music per week, hours of internet per week, buying physical records, buying online records, interaction style (preference on listening to tracks or albums), number of concerts per year, interest on reading specialized music blogs or magazines, and familiarity rating music online. We have to pick two models nally because of the nature of our two datasets. In the smallest one (dataset1) we have all the variables obtained by a user study, but in the second dataset (dataset2) we just have implicit information (playcounts per user, how recently the user listened to each album, and the total number of listeners of an album in the whole dataset) that can be reduced to if , re and gp.

After conducting the process of forward selection, the model obtained for dataset1 considers four xed e ects (if, re, gp and concerts per year) and the random e ect of the user. The details of the model are described in Table 1. Although the main e ects of global popularity (gp) and recentness (re) are not signi cant, we keep them in the model because their interaction with implicit feedback (if ) is signi cant [ 17 ].

For dataset2, we consider in the model if , re, and gp as xed e ects plus the random e ect of the user. For the sake of space we do not show the details of this model, but the coe cient and signi cant values are similar to those shown in Table 1 excepting that the factor number of concerts is not considered in the model. As in the previous model, we keep in the model gp and re although they are not signi cant due to their interaction with if . Under this model, is also users albums entries density avg albums/user avg user/album not signi cant the intercept for rating equal to 2, which tell us that this intercept is not signi cantly di erent than 0, and we may dismiss it from the model.

4.3 Comparing the different approaches

After we have done the implicit-to-explicit mapping, we are in condition to compare the use of impplicit data with inferred explicit data. In this article, we compare four approaches using dataset 1 and three aproaches using dataset 2. The methods we compare, as identi ed in the rst column of Table 3, are:

HK : the implicit feedback method introduced in Hu et al. [ 1 ] which uses raw playcounts, HKlog: a variation of the HK method, also introduced in [ 1 ], that makes a log-transformation of the playcounts, logit3 : the HK method, where the input values are the ratings inferred by logistic regression using 3 xed factors (if, gp, and re) logit4 : similar to logit3 but adding the factor number of concerts in the logistic regression model to infer the ratings.

We have this information available just for dataset1.

Description of the HK method. For the implicit feedback modeling we use the Matrix Factorization method developed in [ 1 ]. In this Matrix Factorization method a weighted least squares error loss function is minimized. To this end user-item interactions pij are signaled with a 1 and missing interactions are marked with a 0. The counts of user-item interactions (e.g. playcounts Yij ) are translated into a con dence measure wij, which in the case of the HK method correspond to pij + Yij, and in the case of the HKlog method a simple log transform is used where: wij = 1 log(1 + Yijk) Yijk > 0

Yijk = 0

This "con dence" is then used as a weight in the loss function and the objective function then becomes min U;M;C i j n m X X[wij (pij

hUi Mj i)2 + n jjUi jj2 + m jjMj jj2] (5) (6)

where the Frobenius norm of the factor matrices is used for regularization. This minimization problem is then solved in linear time using Alternate Least Squares and utilizing a trick to avoid direct optimization over the 0 entries of the matrix. 4.3.1

Error Measures

RMSE [ 18 ] is probably the most common measure to evaluate the performance of recommender systems and we used it to evaluate and compare our linear regression approaches in [ 8 ]. However, when there are no ratings to assess the performance of the algorithms we can not use metrics like RMSE or MAE. Hence, we opt for using Mean Average Precision (MAP) [ 19 ] and normalized Discounted Cummulative Gain (nDCG) [ 20 ]. The former gives us an overall sense of how well we identify relevant items to recommend from a set of retrieved recommendations, and the latter how well we rank them in a list. 5.

RESULTS

In order to evaluate and compare the methods, we split each dataset into 5 groups in order to perform a 5-fold cross validation. The result of each run is a list of recommended items (albums) for each user in the test set, sorted by the preference that the user would have for that item. We calculate MAP and nDCG for each list recommended to a user, judging an item as relevant whether it was consumed (played) at least once by the user. Results can be seen in Table 3.

In the case of dataset 1, the best results of MAP and nDCG are obtained by recommending the most popular items. This result is somewhat expected due to the sparsity of the dataset that a ects the methods based on matrix factorization. As shown in Table 2, each album was rated in average by just 1:71 users. This situation is not repeated in dataset 2, where the average number of users per album is 18:52, and then the popularity method performs the worst.

We highlight two results on these initial experiments. The rst one is that the log transformation of raw playcounts makes HKlog improve clearly over HK on both MAP and nDCG measures. The second result we higlight is that logit3 and logit4 perform better than HK and there is not a big di erence in performance with HKlog, leading us investigate further to con rm this di erence.

CONCLUSIONS AND FUTURE WORK

In this paper, we continue the work that we started in [ 8 ] to create a model that allows us to map implicit to explicit user behavior. Using MAP and nDCG metrics, we show that our method is comparable to state of the art methods that provides recommendations making use of implicit user feedback.

The results that we have obtained, part of which we show on this paper, give us some insights but they mainly open research questions that we need to analyze further. We have con rmed in our dataset the bene ts of applying a log transformation to the raw user feedback in the Hu et al. model, showing consistently better results than the unmodi ed version.

In terms of the questions we need to further analyze, up to this point, we have considered the factors implicit feedback and global popularity in our logistic regression models as ordinal variables. We coded these variables on this way to make sure that we were doing an appropiate diverse sampling when creating the user survey described in [ 8 ]. However, there is no constraint to rather use the raw playcounts for both factors aforementioned, and we think that this modi cation can bene t the results of our implicit-to-explicit logistic regression model.

On the experiments run on this study, since we are not predicting user ratings but rather user preference, metrics such as RMSE or MAE can not be used to compare the methods so we opt for IR metrics such as MAP and nDCG, which rely on how we de ne relevancy. We wonder if our definition of relevance might bias our results and conclusions. As we have stated it before, we think that low feedback might be, in fact, negative feedback. For this reason, we are currently testing di erent user activity (implicit feedback) thresholds to de ne relevancy in order to analyze how that in uences the evaluation of the di erent recommendation approaches.

[1]

Hu ,

Koren , and

Volinsky . Collaborative ltering for implicit feedback datasets . In Proceedings of ICDM 2008 , 2008 .

[2]

Ricci ,

Rokach ,

Shapira , and P. B. Kantor, editors. Recommender Systems Handbook . Springer, 2011 .

[3]

Douglas

Oard and

Jinmook

Kim . Implicit feedback for recommender systems . In in Proceedings of the AAAI Workshop on Recommender Systems , pages 81 { 83 , 1998 .

[4]

Potter. Putting the collaborator back into collaborative ltering . In 2nd KDD Workshop on Large-Scale Recommender Systems and the Net ix Prize Competition , 2008 .

[5]

D. M.

Nichols . Implicit rating and ltering . In In Proceedings of the Fifth DELOS Workshop on Filtering and Collaborative Filtering , pages 31 { 36 , 1997 .

[6]

Amatriain ,

J.M.

Pujol , and

Oliver . I like it ... i like it not: Evaluating user ratings noise in recommender systems . In Proc. of the 2009 Conference on User Modeling , Adaptation, and Personalization , 2009 .

[7]

Jawaheer ,

Szomszor , and

Kostkova . Characterisation of explicit feedback in an online music recommendation service . In Proceedings of the fourth ACM conference on Recommender systems, RecSys '10 , pages 317 { 320 , 2010 .

[8]

Parra and

Amatriain . Walk the talk: Analyzing the relation between implicit and explicit feedback for preference elicitation . In Proc. of the 2011 Conference on User Modeling , Adaptation, and Personalization , 2011 .

[9]

Morita and

Shinoda . Information ltering based on user behavior analysis and best match text retrieval . In SIGIR '94: Proceedings of the 17th annual international ACM SIGIR conference , pages 272 { 281 , New York, NY, USA, 1994 . Springer-Verlag New York, Inc.

[10] Joseph

Konstan , Bradley N.

Miller , David Maltz, Jonathan L. Herlocker , Lee R. Gordon , and John Riedl . Grouplens: applying collaborative ltering to usenet news . Commun. ACM , 40 ( 3 ): 77 { 87 , 1997 .

[11]

Oard and

Kim . Modeling information content using observable behavior . In Proc. of the ASIST Annual Meeting , pages p481 { 88 , 2001 .

[12]

N.S.

Koh ,

Hu , and

E. K.

Clemons . Do online reviews re ect a product's true perceived quality? - an investigation of online movie reviews across cultures . Electronic Commerce Research and Applications , 2010 .

[13]

Lee ,

Park , and

Park . A time-based approach to e ective recommender systems using implicit feedback . Expert Syst. Appl. , 34 ( 4 ): 3055 { 3062 , 2008 .

[14] Gawesh

Jawaheer

Martin

Szomszor , and

Patty

Kostkova . Comparison of implicit and explicit feedback from an online music recommendation service . In Proceedings of the 1st International Workshop on Information Heterogeneity and Fusion in Recommender Systems , 2010 .

[15]

Kordumova , I. Kostadinovska,

Barbieri ,

Pronk , and

Korst . Personalized implicit learning in a music recommender system . In UMAP 2010 , 2010 .

[16]

Gediminas

Adomavicius and

Alexander

Tuzhilin . Context-aware recommender systems . In Francesco Ricci, Lior Rokach, Bracha Shapira, and Paul B. Kantor, editors, Recommender Systems Handbook , pages 217 { 253 . Springer

, 2011 .

[17]

Neter ,

M. H.

Kutner ,

C. J.

Nachtsheim , and

Wasserman . Applied Linear Statistical Models. Irwin, Chicago, 1996 .

[18] Jonathan

Herlocker , Joseph A. Konstan , Loren G. Terveen, and John T. Riedl. Evaluating collaborative ltering recommender systems . ACM Trans. Inf . Syst., 22 ( 1 ):5{ 53 , 2004 .

[19] Christopher

Manning , Prabhakar Raghavan, and Hinrich

Schtze . Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008 .

[20] Kalervo

rvelin and Jaana Kekalainen. Cumulated gain-based evaluation of ir techniques . ACM Trans. Inf . Syst., 20 : 422 { 446 , October 2002 .