Comparing Prediction Models for Active
          Learning in Recommender Systems

     Rasoul Karimi, Christoph Freudenthaler, Alexandros Nanopoulos, Lars
                               Schmidt-Thieme

           Information Systems and Machine Learning Lab (ISMLL)
    Samelsonplatz 1, University of Hildesheim, D-31141 Hildesheim, Germany
{karimi,freudenthaler,nanopoulos,schmidt-thieme}@ismll.uni-hildesheim.de


        Abstract. Recommender systems help web users to address informa-
        tion overload. Their performance, however, depends on the amount of
        information that users provide about their preferences. Users are not
        willing to provide information for a large amount of items, thus the
        quality of recommendations is affected. Active learning for recommender
        systems has been proposed in the past, to acquire preference informa-
        tion from users. Early active learning methods for recommender systems
        used as underlying model either memory-based approaches or the aspect
        model. However, matrix factorization has been recently demonstrated
        (especially after the Netflix challenge) as being superior to memory-based
        approaches or the aspect model. Therefore, it is promising to develop ac-
        tive learning methods based on this prediction model. In this paper, we
        thoroughly compare matrix factorization with the aspect model to find
        out which one is more suitable for applying active learning in recom-
        mender systems. The results show that beside improving the accuracy
        of recommendations, the matrix factorization approach also results in
        drastically reduced user waiting times, i.e., the time that the users wait
        before being asked a new query. Therefore, it is an ideal choice for using
        active learning in real-world applications of recommender systems.


1     Introduction

Recommender systems guide users in a personalized way to interesting or useful
objects in a large space of possible options. There are several techniques for
recommendation and collaborative filtering is one them [1, 2]. Given a domain
of items, users give ratings to these items. The recommender system can then
compare the user’s ratings to those of other users, find the most similar users
based on some criterion of similarity, and recommend items that similar users
have already liked.

    Copyright c 2015 by the papers authors. Copying permitted only for private and
    academic purposes. In: R. Bergmann, S. Görg, G. Müller (Eds.): Proceedings of
    the LWA 2015 Workshops: KDML, FGWM, IR, and FGDB. Trier, Germany, 7.-9.
    October 2015, published at http://ceur-ws.org


                                        171
    Evidently, the performance of recommender systems depends on the number
of ratings that the users provide. This problem is amplified even more in the
case where we lack ratings due to a new user (cold-start problem). There are
different solutions to deal with this problem. The first solution is to use the
meta data of the new user. However, even a few ratings are more valuable than
the meta data [21]. Therefore, the new user is requested to provide ratings to
some items. But a well identified problem is that users are not willing to provide
ratings for a large amount of items [4, 5]. Therefore, the queries presented to the
new user have to be selected carefully. To address this situation active learning
methods have been proposed to acquire those ratings from the new user that
will help most in determining his/her interests [5, 4]. Another approach for the
new user problem is to use implicit feedback. It means the recommender system
uses implicit information from the user (browsing, viewing events) that can be
used to quickly adjust his/her user model to his/her real taste, while interacting
with the system [22]. In this paper, we focus on the active learning approach
and do not deal with the other solutions.
    Early active learning methods for recommender systems were developed based
on Aspect Model (AM) [4, 5]. However, Matrix Factorization (MF) has been
demonstrated (especially after the Netflix challenge) as being superior to other
techniques. Therefore, it is promising to develop active learning methods based
on this prediction model. In this paper we examine AM and MF for the new
user problem in recommender systems. For this problem, in addition to the ac-
curacy, training time of the prediction model is also important. It is because the
preference elicitation of the new user is an interactive scenario and long time
interruptions cause the new users to leave the conversation.
    This paper is organized as follows: in section 2, related work is reviewed. In
section 3, MF and AM are explained. In section 4, the training algorithms of MF
and AM are compared. The experimental result is given in section 5. Finally the
conclusion is stated in section 6.


2   Related Work

Active learning, in the context of the new-user problem, was introduced by Kohrs
and Merialdo [9]. This work suggested a method based on nearest-neighbor col-
laborative filtering, which uses entropy and variance as the loss function to iden-
tify the queried items. Al Mamunur et al. [6] expanded this work, by considering
the popularity of items and also personalizing the item selection for each indi-
vidual user. Boutilier et al. [10] applied the metric of expected value of utility
to find the most informative item to query, which is to find the item that leads
to the most significant change in the highest expected ratings.
    Jin and Si [4] developed a new active learning algorithm based on AM which
is similar to applying active learning for parameter estimation in Bayesian net-
works [11]. This method uses the entropy of the model as the loss function.
However, this work does not directly minimize the entropy loss function, be-
cause the current model may be far from the true model and relying only on


                                     172
the current model can become misleading. To overcome this problem, this work
proposes to use a Bayesian network to take into account the reliability of the
current model. This Bayesian approach is, however, complex and intractable for
real applications (demands excessive execution time). Harpale and Yang [5] ex-
tended [4] by relaxing the assumption that a new user can provide a rating for
any queried item. This approach personalizes active learning to the preferences
of each new user as it queries only those items for which users are expected to
provide a rating for. Karimi et. al [12] applied the most popular item selection to
AM. The results show that it competes in accuracy with the Bayesian approach
while its execution time is in the order of magnitude faster than the Bayesian
method.
    Karimi et. al [13] developed a non-myopic active learning which capitalizes ex-
plicitly on the update procedure of the MF model. Initially, this method queries
items that if the new user’s features are updated with the provided rating, it will
change the features as much as possible. Its goal is to explore the latent space
to get closer to the optimal features. Then, it exploits the learned features and
slightly adjusts them. Karimi et. al. [14] by being inspired from existing opti-
mal active learning for the regression task, exploits the characteristics of matrix
factorization and develops a method which approximates the optimal solution
for recommender systems. Karimi et. al. [15] improved the most popular item
selection according to the characteristics of MF. It finds similar users to the new
user in the latent space and then selects the item which is most popular among
the similar users.
    The idea of using decision trees for cold-start recommendation was proposed
by Al Mamunur et. al [8]. Golbandi et. al [7] improved [8] by advocating a
specialized version of decision trees to adapt the preference elicitation process
to the new user’s responses. Zhou et. al [20] modified [7] by associating matrix
factorization to decision trees. Karimi et. al [16] proposed another approach to
introduce matrix factorization in decision trees which is more scalable compare
to [20]. Karimi et. al [17] improved the decision trees by splitting the nodes of the
trees in a finer-grained fashion. Specifically, the nodes are split in a 6-way manner
instead of 3-way split. Karimi et. al [18] proposed an innovative approach for
active learning in recommender systems. The main idea is to consider existing
users as (hypothetical) new users and solve an active learning problem for each
of them. In the end, we aggregate all solved problems in order to learn how to
solve the active learning problem for a real new user.


3     Background

In this section, a short introduction to AM and MF is provided.


3.1   Aspect Model

The Aspect Model is a probabilistic latent space model, which models user
interests as a mixture of preference factors [24, 25]. The latent class variables


                                      173
f ∈ F := {f1 , f2 , ..., fk } are associated with each user u and each item i. Users
and items are independent from each other given the latent class variable f . The
probability for each observation tuple (u, i, r) is calculated as follows:
                                           X
                             p(r|i, u) =          p(r|f, i)p(f |u)                 (1)
                                           f ∈F

where p(f |u) is a multinomial distribution and stands for the likelihood for user
u to be in the latent class f . p(r|f, i) is the likelihood of assigning item i with
rating r for class f . In order to achieve better performance, the training ratings
of each user are normalized with zero mean and variance 1 [25]. The parameter
p(r|f, i) is a Gaussian distribution N (µi,f , σi,f ) with latent class mean µi,f and
standard deviation σi,f .


3.2   Matrix Factorization

Matrix Factorization is the task of approximating the true, unobserved ratings-
matrix R. The rows of R correspond to the users U and the columns to the
items I. Thus the matrix has dimension |U | × |I|. The predicted ratings R̂ are
the product of two feature matrices W : |U | × k and H : |I| × k , where the u-th
row wu of W contains the k features that describe the u-th user and the i-th row
hi of H contains k corresponding features for the i-th item. The elements of hi
indicate the importance of factors in rating item i by users. Some factors might
have higher effect and vice versa. For a given user the element of wu measure the
influence of the factors on user preferences. Different applications of MF differ in
the constraints that are sometimes imposed on the factorization. The common
form of MF is finding a low-norm approximation (regularized factorization) to
a fully observed data matrix minimizing the sum-squared difference to it.
    The predicted rating R̂ of user u to item i is the inner product of the user
u features and item i features hTi wu . However, the full rating value is not just
explained by this interaction and the user and item bias should also be taken
into account. It is because part of the rating values is due to effects associated
with either users or items,i.e biases, independent of any interactions.
    By considering the user and item bias, the predicted rating is computed as
follows [3]:

                               r̂ui = µ + bi + bu + hTi wu                         (2)
where µ is the global average, bi and bu are item and user bias respectively.
The major challenge is computing the mapping of each item and user to factor
vectors hi , wu ∈ Rk . The mapping is done by minimizing the following squared
error:

                    X
 Opt(S, W, H) =             (rui −µ−bu −bi −hTi wu )2 +λ(khi k2 +kwu k2 +b2u +b2i ) (3)
                  (u,i)∈S


                                           174
where λ is the regularization factor, and S is the set of the (u, i) pairs for
which rui is known, i.e the training set. The details of MF learning algorithm is
described in [3].
    When MF is applied to a specific data set, the predicted ratings should be in
the range of the minimum rating and maximum rating of the dataset. However,
sometimes this does not happen and we have to explicitly clip them. To solve this
problem we use the sigmoidal function to automatically truncate the predicted
rating to the range of minimum and maximum ratings. Therefore, the predicted
ratings are computed as follows:

                                     (M axRating − M inRating)
               r̂ui = M inRating +                        T                   (4)
                                        1 + e−(µ+bi +bu +hi wu )

3.3   Retraining Policy
When a new user enters the recommender system, the prediction model (AM or
MF) should be updated to learn the new user latent features. As there are already
a lot of users in the recommender system, training the model from scratch needs
a lot of time. Therefore, we switch to online updating which means after a first
training, further retraining is only done for new users.
    For online updating, we use the method introduced in [23]. In this method
after getting a new rating for the new user, the user’s latent features are ini-
tialized to a random setting and then learned using all ratings of the new user.
The complexity of retraining is the same as the training but the size of training
data, S, is only the number of ratings used for online updating which is just the
ratings provided by the new user.
    When the online updating technique is applied in MF, the learning step
should be reduced. This is because the number of training data (ratings provided
by the new user) is small and updating the new user’s latent features should be
done more precisely and carefully. In our experiments the learning step in the
training phase is 0.01 and is reduced to 0.001 when online updating is performed.


4     Comparing AM and MF
The training algorithm for MF has the time complexity of [23] :

                                O(L × |S| × k)                                (5)
where L is the maximum number of iterations. The learning algorithm stops if
the RMSE on the training data is smaller than .
    The training algorithm for AM is shown in Algorithm 1. In this algorithm, the
convergence criterion is the same as the convergence criterion in MF. According
to this algorithm the time complexity of AM is O(L × |S| × k) which is equal
to Equation 5. Therefore MF and AM have the same time complexity. However,
AM needs more computations because there are two essential differences between
AM and MF.


                                     175
Algorithm 1 Aspect Model Training Algorithm According to [25]
    loop {repeat until convergence}
      for rui in S do
        for f ← 1, ..., k do
           compute E-Step for each f
        end for
        for f ← 1, ..., k do
           update p(f |u), µi,f , and σi,f
        end for
      end for
      for f ← 1, ..., k do
        normalize p(f |u)
      end for
      check the convergence
    end loop


    First, the learning algorithm of MF uses the gradient descent but AM is
based on expectation maximization. While in the gradient descent the gradient
is computed just by one training sample, in the expectation maximization the
amount of change should be computed using all training data. This step is called
E-step [24]. The time complexity of the E-step is O(L × |S| × k). The second
difference is that as AM is a probabilistic approach, the user features must be
normalized so the summation of probabilities becomes 1. But MF is an algebraic
approach, so it is not necessary to normalize the user features. The time com-
plexity of the normalization is O(L × k). Finally though the maximum number
of iterations L is the same for AM and MF (100 in our experiments), but the
effective L in MF is lower than the effective L in AM, because MF converges
faster than AM which consequently cuts down the training time.


5     Experimental Results

As the main challenge in applying active learning for recommender systems is
that users are not willing to answer many queries in order to rate the queried
items, we evaluate AM and MF with respect to their accuracy on the new users
in terms of prediction error versus the number of queried items (simply denoted
as number of queries). The mean absolute error (MAE) is used to evaluate the
performance of each test user u :

                                          1 X
                             M AEu =          |rui − r̂ui |                   (6)
                                        |Mu |
                                               i∈Mu


where Mu is the set of test items of user u, rui is the true rating of user u for
item i, and r̂ui is the predicted rating. Since the test dataset includes multiple
users, the reported MAE is the average over individual MAE for each test user.


                                         176
5.1    Data Set

We use the MovieLens(100K)1 dataset in our experiments. MovieLens contains
943 users and 1682 items. The dataset was randomly split into training and test
sets. The training dataset consists of 343 users (the same number used in [5])
and the rest of the users are in the test dataset. Each test user is considered as
a new user. The latent features of the new user are initially trained with three
random ratings. 20 rated items of each test user are separated to compute the
error. The test items are not new and already appeared in the training data. The
remaining items are in the pool dataset, i.e the dataset that is used to select a
query. For simplicity, we assume that the new user will always be able to rate
the queried item. Of course, this is not a realistic assumption because there are
items that the new user has not seen before, so it is not possible for him/her to
provide the rating. As the focus of this paper is on the suitable prediction model
for active learning in recommender system, we will leave this issue for future
work. In our experiment, 10 queries are asked from each new user. Therefore,
the pool dataset should contain at east 10 items which exist in the training data.
Considering 10 queries and 20 test items, each test user has given ratings to at
least 30 items. The number of latent dimensions k is 10 according to [5].


5.2    Results

In this section, we compare the accuracy of the active learning algorithm based
on MF with the active learning algorithm based on AM. The objective is to
show that MF is a better prediction model to be used for developing the active
learning algorithm. For this reason, in order to have a fair comparison we focus
only on the prediction model and simply apply random selection of the queried
items for both MF and AM.
    Learning the new user’s features usually starts with 3 initial ratings [5, 4].
This can be done in two different ways. The first option is to add the ratings
to the training user dataset and train AM or MF with all users together. The
further retraining of the new user is done using the online updating technique.
The second way is to train the prediction model (AM or MF) only with training
users, and then train the new user with three initial ratings using the online
updating technique.
    For AM, both ways provide the same initial error, i.e before asking any query.
But for MF, the error is lower when online updating is used from the beginning
(i.e the second way). This evidence shows a new solution to improve the accuracy
of MF. MF can not make accurate predictions for users with few ratings [26].
Therefore, after training all users and items, the latent features of such users
can be retrained using the online updating technique. This is an open door for
further research.
    Now we move on to compare MF and AM for 10 queries. Fig. 1 depicts the
resulting MAE as a function of the number of queried items. MF outperforms
1
    www.grouplens.org/system/files/ml-data0.zip


                                      177
               Fig. 1. Active Learning trends for 10 active-iterations


AM, indicating its superiority as the prediction model. In addition to the accu-
racy, the time required to retrain the new users latent features is also important.
It is because the preference elicitation of the new user is an interactive scenario
and long time interruptions make the new users leave the conversation. Table 1
compares the retraining time of new users latent features in AM and MF. Al-
though both of them have the same complexity, but due to the reasons that have
already been mentioned, MF is faster than AM.


       Table 1. Retraining time of new users latent features in AM and MF

                              Aspect Model Matrix Factorization
                  MovieLens      44.5s             3.9s


6   Conclusion

In this paper, we proposed to develop active learning methods based on matrix
factorization. We compared the training algorithm of matrix factorization with
the aspect model and showed that matrix factorization is faster and its accuracy
is also better.


                                      178
   As the future work, we plan to conduct online survey to validate the signifi-
cance of our offline evaluation. To this end, it is crucial to design a software with a
user-friendly user interface to encourage users to cooperate with the system [19].

References
1. G. Adomavicius and A. Tuzhilin, “Toward the next generation of recommender sys-
   tems: A survey of the state-of-the-art and possible extensions,” IEEE Transactions
   on Knowledge and Data Engineering, vol. 17, no. 6, pp. 734–749, 2005.
2. J. A. Konstan, B. N. Miller, D. Maltz, J. L. Herlocker, L. R. Gordon, and J. Riedl,
   “GroupLens: Applying collaborative filtering to usenet news,” Communications of
   the ACM, vol. 40, no. 3, pp. 77–87, 1997.
3. Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization techniques for recom-
   mender systems,” Computer, vol. 42, pp. 30–37, 2009.
4. R. Jin and L. Si, “A bayesian approach toward active learning for collaborative filter-
   ing,” in Proceedings of the 20th conference on Uncertainty in artificial intelligence,
   2004.
5. A. S. Harpale and Y. Yang, “Personalized active learning for collaborative filtering,”
   in Proceedings of the 31st annual international ACM SIGIR conference on Research
   and development in information retrieval. ACM, 2008, pp. 91–98.
6. A. M. Rashid, I. Albert, D. Cosley, S. K. Lam, S. M. McNee, J. A. Konstan, and
   J. Riedl, “Getting to know you: Learning new user preferences in recommender
   systems,” in International Conference on Intelligent User Interfaces (IUI). ACM
   Press, 2002, pp. 127–134.
7. N. Golbandi, Y. Koren, and R. Lempel, “Adaptive bootstrapping of recommender
   systems using decision trees.” in WSDM. ACM, 2011, pp. 595–604.
8. A. M. Rashid, G. Karypis, and J. Riedl, “Learning preferences of new users in rec-
   ommender systems: an information theoretic approach,” SIGKDD Explor. Newsl.,
   vol. 10, no. 2, pp. 90–100, Dec. 2008.
9. A. Kohrs and B. Merialdo, “Improving collaborative filtering for new users by smart
   object selection,” in International Conference on Media Features (ICMF), 2001.
10. C. Boutilier, R. S. Zemel, and B. Marlin, “Active collaborative filtering,” in Con-
   ference on Uncertainty in Artificial Intelligence(UAI), 2003.
11. S. Tong and D. Koller, “Active learning for parameter estimation in bayesian net-
   works,” in Advances in Neural Information Processing Systems( NIPS), 2000.
12. R. Karimi, C. Freudenthaler, A. Nanopoulos, and L. Schmidt-Thieme, “Active
   learning for aspect model in recommender systems,” in IEEE Symposium on Com-
   putational Intelligence and Data Mining (CIDM). IEEE, 2011.
13. R. Karimi, C. Freudenthaler, A. Nanopoulos, and L. Schmidt-Thiemee, “Non-
   myopic active learning for recommender systems based on matrix factorization,”
   in IEEE Information Reuse and Integration (IRI). IEEE, 2011.
14. R. Karimi, C. Freudenthaler, A. Nanopoulos, and L. Schmidt-Thieme, “Towards
   optimal active learning for matrix factorization in recommender systems,” in 23th
   IEEE International Conference on Tools With Artificial Intelligence (ICTAI), 2011.
15. R. Karimi, C. Freudenthaler, A. Nanopoulos, and L. Schmidt-Thiemee, “Exploit-
   ing the characteristics of matrix factorization for active learning in recommender
   systems,” in RecSys, 2012, pp. 317–320.
16. R. Karimi, M. Wistuba, A. Nanopoulos, and L. Schmidt-Thieme. Factorized deci-
   sion trees for active learning in recommender systems. In 25th IEEE International
   Conference on Tools With Artificial Intelligence (ICTAI), 2013.


                                        179
17. R. Karimi, A. Nanopoulos, and L. Schmidt-Thieme. Improved Questionnaire Trees
   for Active Learning in Recommender Systems. Proceedings of the 16th LWA Work-
   shops: KDML, IR and FGWM, Aachen, Germany, September 8-10, 2014.
18. R. Karimi, A. Nanopoulos, and L. Schmidt-Thieme. A supervised active learning
   framework for recommender systems based on decision trees. User Model. User-
   Adapt. Interact. 25(1): 39-64 (2015).
19. R. Karimi, Fuzzy Model View Controller Pattern in International Conference on
   Advances in Intelligent Systems, Theory and Applications in cooperation with IEEE
   Computer Society, 2004.
20. K. Zhou, S.-H. Yang, and H. Zha, “Functional matrix factorizations for cold-start
   recommendation,” in Proceedings of the 34th international ACM SIGIR conference
   on Research and development in Information Retrieval, ser. SIGIR ’11. ACM, 2011,
   pp. 315–324.
21. Pilászy I, Tikk D (2009) Recommending new movies: even a few ratings are more
   valuable than metadata. In: RecSys, pp 93–100
22. Zhang L, Meng XW, Chen JL, Xiong SC, Duan K (2009) Alleviating cold-start
   problem by using implicit feedback. In: Proceedings of the 5th International Con-
   ference on Advanced Data Mining and Applications, Springer-Verlag, Berlin, Hei-
   delberg, ADMA ’09, pp 763–771
23. Rendle S, Schmidt-Thieme L (2008) Online-updating regularized kernel matrix
   factorization models for large-scale recommender systems. In: ACM Conference on
   Recommender Systems (RecSys), ACM, pp 251–258
24. Hofmann T, Puzicha J (1999) Latent class models for collaborative filtering. In:
   International Joint Conference on Artificial Intelligence, Morgan Kaufmann Pub-
   lishers Inc., pp 688–693
25. Hofmann T (2003) Collaborative filtering via gaussian probabilistic latent semantic
   analysis. In: Proceedings of the 26th annual international ACM SIGIR conference
   on Research and development in informaion retrievall, ACM, pp 259–266
26. Salakhutdinov R, Mnih A (2008) Probabilistic matrix factorization. In: Advances
   in Neural Information Processing Systems (NIPS 2007), pp 134–141


                                       180