Introduction

Factor Models for Tag Recommendation in BibSonomy

Ste en Rendle

Lars Schmidt-Thieme

schmidt-thiemeg@ismll.uni-hildesheim.de 0 0 Machine Learning Lab, University of Hildesheim , Germany

This paper describes our approach to the ECML/PKDD Discovery Challenge 2009. Our approach is a pure statistical model taking no content information into account. It tries to nd latent interactions between users, items and tags by factorizing the observed tagging data. The factorization model is learned by the Bayesian Personal Ranking method (BPR) which is inspired by a Bayesian analysis of personalized ranking with missing data. To prevent over tting, we ensemble the models over several iterations and hyperparameters. Finally, we enhance the top-n lists by estimating how many tags to recommend.

Introduction

In this paper, we describe our approach to task 2 of the ECML/PKDD Discovery Challenge 2009. The setting of the challenge is personalized tag recommendation [ 1 ]. An example is a social bookmark site where a user wants to tag one of his bookmark and the tag recommender suggest the user a personalized list of tags he might want to use for this item.

Our approach to this problem is a pure statistical model using no content information. It relies on a factor model related to [ 2 ] where the model parameters are optimized for the maximum likelihood estimator for personalized pairwise ranking [ 3 ]. Furthermore, we use a smoothing method for reducing the variance in the factor models. Finally, we provide a method for estimating how many tags should be recommended for a given post. This method is model independent and can be applied to any tag recommender.

Terminology and Formalization

We follow the terminology of [ 2 ]: U is the set of all users, I the set of all items/ resources and T the set of all tags. The tagging information of the past is represented as the ternary relation S U I T . A tagging triple (u; i; t) 2 S means that user u has tagged an item i with the tag t. The posts PS denotes the set of all distinct user/ item combinations in S: Where U^ , I^, T^U and T^I are feature matrices capturing the latent interactions. They have the following types: y^u;i;t =

X u^u;f t^tU;f + X ^ii;f tt;f

^I f f ^

U 2 RjUj k; T^U 2 RjT j k; ^

I 2 RjIj k; T^I 2 RjT j k Our models calculate an estimator Y^ for S. Given such a predictor Y^ the list Top of the N highest scoring items for a given user u and an item i can be calculated by:

Top(u; i; N ) := argNmax y^u;i;t

t2T r^u;i;t := jft0 : y^u;i;t0 > y^u;i;tgj where the superscript N denotes the number of tags to return. Besides y^u;i;t we also use the notation of a rank r^u;i;t which is the position of t in a post (u; i) after sorting all tags by y^u;i;t: 3

Factor Model

Our factorization model (FM) captures the interactions between users and tags as well as between items and tags. The model equation is given by: (1) (2) Note that this model di ers from the factorization model in [ 2 ] where the model equation is the Tucker Decomposition. 3.1

Optimization Criterion

Our optimization criterion is an adaption of the BPR criterion (Bayesian Personalized Ranking) [ 3 ]. The criterion presented in [ 3 ] is derived for the task of item recommendation. Adapted to tag recommendation, the optimization function for our factor model is:

BPR-Opt :=

X (jjU^ jj2 + jjI^jj2 + jjT^U jj2 + jjT^I jj2) (3) That means BPR-Opt tries to optimize the pairwise classi cation accuracy within observed posts. Note that it di ers from [ 2 ] by optimizing for pairwise classi cation (log-sigmoid) instead of AUC (sigmoid). 3.2

Learning

The model is learned by the LearnBPR algorithm [ 3 ] which is a stochastic gradient descent algorithm where cases are sampled by bootstrapping. In the following, we show how we apply this generic algorithm to the task of optimzing our model paramaters for the task of tag recommendation. The gradients of our model equation (2) with respect to the model parameters = fU^ ; I^; T^U ; T^I g are: / = X

2 e (y^u;i;t+ y^u;i;t ) y^u;i;t ) That means, we only have to compute the derivations of our model equation y^u;i;t with respect to each model parameter from = fU^ ; I^; T^U ; T^I g:

y^u;i;t = ^ii;f These derivations are used in the stochastic gradient descent algorithm shown in gure 1.

The method presented so far has the following hyperparameters: { 2 R+ learning rate { 2 R0+ regularization parameter { 2 R mean value for initialization of model parameters { 2 2 R0+ standard deviation for initialization of model parameters { k 2 N+ feature dimensionality of factorization Reasonable values for all parameters can be searched on a holdout set. The learning rate and the initialization parameters are only important for the learning algorithm but are not part of the optimization criterion or model equation. Usually, the values found for ; ; 2 on the holdout generalize well.

In contrast to this, the regularization and dimensionality are more important for the prediction quality. In general, when the regularization is chosen properly, the higher the dimensionality the better. In our submitted result, we use an ensemble of models with di erent regularization and dimensionality. 1: procedure LearnBPR(PS; U^ ; I^; T^U ; T^I ) 2: draw U^ ; I^; T^U ; T^I from N ( ; 2) 3: repeat 4: draw (u; i; t+; t ) uniformly from PS 5: d y^u;i;t+ y^u;i;t 6: for f 2 1; : : : ; k do 7: u^u;f u^u;f + 8: 9: 10: 11: 12: t^tI ;f 13: end for 14: until convergence 15: return U^ ; I^; T^U ; T^I 16: end procedure ^ii;f t^tU+;f t^tU ;f t^tI+;f ^ii;f + t^tU+;f + t^tU ;f + t^tI+;f + t^tI ;f + e d 1+e d (t^tU+;f e d 1+e d (t^tI+;f e d 1+e d u^u;f + e d 1+e d e d 1+e d ^ii;f + e d 1+e d ^ii;f + u^u;f + t^tU+;f

t^tU ;f t^tI+;f

t^tI ;f Ensembling factor models with di erent regularization and dimensionality is supposed to remove variance from the ranking estimates. There are basically two simple approaches of ensembling predictions y^ul;i;t of l models: 1. Ensemble of the value estimates y^ul;i;t: 2. Ensemble of the rank estimates r^ul;i;t: y^uev;i;t :=

X wl y^u;i;t

l l y^uer;i;t :=

X wl (jT j l l r^u;i;t) (4) (5)

That means tags with a high rank (low r^) will get a high score y^. Where wl is the weighting parameter for each model.

Whereas ensembling value estimates is e ective for models with predictions on the same scale, rank estimates are favorable in cases where the y^ values of the di erent models have no direct relationship.

Ensembling Di erent Factor Models For our factor models the scales of y^ depend both on the dimensionality and the regularization parameter. Thus we use the rank estimates for ensembling factor models with di erent dimensionality and regularization. In our approach we use a dimensionality of k 2 f64; 128; 256g and regularization of 2 f10 4; 5 10 5g. As the prediction quality of all of our factor models are comparable, we have chosen identical weights wl = 1. Ensembling Iterations Within each factor model we use a second ensembling strategy to remove variance. Besides the hyperparameters, another problem is the stopping criterion of the learning algorithm (see gure 1). We stop after a prede ned number of iterations (2000) { we have chosen an iteration size of 10 jSj single draws. In our experiments the models usually converged already after about 500 iterations but in the following iterations the ranking alternates still a little bit. To remove the variance, we create many value estimates from di erent iterations and ensemble them. I.e. after the rst 500 iterations we create each 50 iterations a value estimate for each tag in all test posts and ensemble these estimates with (4). Again there is no reason to favor an iteration over another, so we use identical weights wl = 1. This gives the nal estimates for each model. The models with di erent dimensionality and regularization are ensembled as described above. 4

Baseline Models

Besides our factorization model we also consider several baseline models and ensembles of these models. The models we pick as baselines are most-popular by item (mpi), most-popular by user (mpu), item-based knn (knni) and user-based knn (knnu).

The most-popular models are de ned as follows: The k-nearest-neighbour models (knn) are de ned as follows: y^um;pi;it = jfu0 2 U : (u0; i; t)gj y^um;pi;ut = jfi0 2 I : (u; i0; t)gj

y^ukn;in;ti = y^ukn;in;tu =

X (u;i0;t)2S

X simi;i0 simu;u0 (u0;i;t)2S fiI;t = jfu : (u; i; t) 2 Sgj fuU;t = jfi : (u; i; t) 2 Sgj To measure simi;i0 and simu;u0 respectively, we rst fold/ project the observed data tensor in a two dimensional matrix F U and F I : After the folding we apply cosine similarity to compare two tag vectors:

We tried di erent weighted ensembles of the baseline models using the value estimate ensembling method. Even though these ensembles produce quite good results, in our experiments they did not outperform the factor models and furthermore adding baselines to the factor models did not result in a signi cant improvement of the factor models. Thus our nal submission only consists of the factor models. 5

Adaptive List Length

In contrast to the usual evaluation scheme of tag recommendation, in this challenge the recommender was free to choose the length of the list of the recommendations in a range from a length of 0 to 5. The evaluation functions are:

Prec(Stest) := Recall(Stest) :=

F1(Stest) :=

avg (u;i)2PStest

avg (u;i)2PStest 2 Prec(Stest) Recall(Stest) Prec(Stest) + Recall(Stest) j Top(u; i; min(5; #u;i)) \ ftj(u; i; t) 2 Stestgj

min(5; #u;i) j Top(u; i; min(5; #u;i)) \ ftj(u; i; t) 2 Stestgj jftj(u; i; t) 2 Stestgj Where #u;i is the number of tags the recommender estimates for a post.

There are three simple ways to estimate #u;i: { Global estimate: { User estimate: { Item estimate: #uG;i := jSj

jPS j #uU;i := jf(i0; t) : (u; i0; t) 2 Sgj

jfi0 : (u; i0; t) 2 Sgj #Iu;i := jf(u0; t) : (u0; i; t) 2 Sgj

jfu0 : (u0; i; t) 2 Sgj Based on these simple estimators, a combined post size can be produced by a linear combination: #uE;i := 0 +

G#uG;i +

U #uU;i + I #Iu;i In our approach we use #uE;i and optimize on the holdout set for maximal F1. We found that choosing an adaptive length of the recommender list signi cantly improved the results over a xed number. 6 6.1

Experimental Results Sampling of Holdout Set

As the test of the challenge was released two days before the submission deadline, we tried to generate representative holdout-sets. We created two test sets, one following the leave-one-post-per-user-out protocol [ 1 ] and a second one by uniformly sampling posts with the constraint that the dataset should remain a 2-core after moving a post into the test set. These two sets were used as holdout sets for algorithm evaluation and hyperparameter selection. In the following, we report results for the second holdout set, because its characteristics (in terms of number of users, items and posts) are closer to the real test set. 6.2

Results

The results of the method presented so far can be found in table 2 and 3. As you can see, the single baseline models result in low quality but ensembles can achieve a good quality. In contrast to this, our proposed factor models generate better recommendations. The best possible ensemble (optimized on test!) of the baselines achieves a score of 0.330 on the challenge set whereas our factor ensemble (not optimized on test) results in 0.345.

mpu mpi mp-ens knni knnu knn-ens knn+mp-ens holdout 0.249 0.351 -/0.423 0.401 0.371 -/0.445 -/0.473 challenge 0.098 0.288 0.290/0.317 0.209 0.295 0.293/0.320 0.299/0.330

An interesting nding is that the results on the challenge test set largely differs from both of our holdout sets. But as all methods su er, we assume that the tagging behavior in the challenge test set is indeed di erent from the one in the training set. Especially, the baseline most-popular-by-user dropped largely from 24.9% to 9.8% { this might indicate that personalization is di cult to achieve on single FM

FM-ens FM-ens adaptive list length holdout challenge 0:495 the challenge test set using the provided training set. Non-personalized methods or content-based methods could bene t from the di erence in both sets. Also methods that can handle temporal changes in the tagging behaviour might improve the scores. 7

Conclusion

In this paper, we have presented a factor model for the task of tag recommendation. The model tries to describe the individual tagging behavior by four low-dimensional matrices. The model parameters are optimized for the personalized ranking criterion BPR-Opt [ 3 ]. The length of the recommended lists is adapted both to the user and item. Our evaluation indicates that our approach outperforms ensembles of baseline models which are known to give high quality recommendations [ 1 ]. 8

Acknowledgements

The authors gratefully acknowledge the partial co-funding of their work through the European Commission FP7 project MyMedia (www.mymediaproject.org) under the grant agreement no. 215006. For your inquiries please contact info@mymediaproject.org.

1. Jaeschke , R. , Marinho , L. , Hotho , A. , Schmidt-Thieme , L. , Stumme , G.: Tag recommendations in social bookmarking systems . AICOM ( 2008 )

2. Rendle , S. , Marinho , L.B. , Nanopoulos , A. , Thieme , L.S. : Learning optimal ranking with tensor factorization for tag recommendation . In: KDD '09: Proceeding of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining , New York, NY, USA, ACM ( 2009 )

3. Rendle , S. , Freudenthaler , C. , Gantner , Z. , Thieme , L.S. : Bpr: Bayesian personalized ranking from implicit feedback . In: Proceedings of the 25th Conference on Uncertainty in Arti cial Intelligence (UAI 2009 ). ( 2009 )