Model Adaptation with Bayesian Hierarchical Modeling for Context-Aware Recommendation Hideki Asoh Yoichi Motomura Chihiro Ono AIST AIST KDDI R & D Laboratories Inc. AIST Tsukuba Central 2 AIST Tokyo Waterfront 2-1-15 Ohara, Fujimino 1-1-1 Umezono, Tsukuba 2-3-26 Aomi, Koutouku Saitama 365-8502 Japan Ibaraki 305-8568 Japan Tokyo 135-0064 Japan ono@kddilabs.jp h.asoh@aist.go.jp y.motomura@aist.go.jp ABSTRACT in the specic contexts, is often conducted. Although there Model adaptation is a process of modifying a model trained may be dierences between the preferences in the real con- with a large amount of training data from the source do- texts and the supposed contexts, the dierences are not main to adapt a specic similar target domain by using a taken seriously. small amount of adaptation data regarding the target do- In our previous works, we collected users' preferences of main. Bayesian hierarchical modeling is well known as a various dishes in both real and supposed contexts and showed general tool for model adaptation and multi-task learning, that the dierence is statistically signicant and not negli- and widely used in various areas such as marketing, ecol- gible [17]. We also analyzed the statistical nature of the dif- ogy, medicine, education, and so on in order to model the ferences and demonstrated that the structure of preferences heterogeneity in the phenomena. In this work, we propose in supposed contexts is simpler than that of the preferences to apply the Bayesian hierarchical modeling to the problem in real contexts [3]. These studies suggested that it is dan- of preference modeling, where a model trained with a large gerous to construct preference models using data collected amount of supposed context data is adapted to the real con- only in supposed contexts. text by using additional small amount of real context data. In this work, we pursue the possibility to construct bet- The eectiveness of the proposed method is evaluated by ter preference models by combining data in the supposed experiments using context-aware food preference data. contexts and the real contexts. Although there are dier- ences between the preferences in the real contexts and in the supposed contexts, they are similar in some extent, and the Categories and Subject Descriptors cost to collect data in the supposed contexts is much cheaper H.4 [Information Systems Applications]: Miscellaneous than in the real contexts. Hence, if we can modify a model constructed by a large amount of supposed context data to General Terms adapt to the real contexts by using small amount of real context data, it helps much to realize better context-aware Experimentation, Human factors, Measurement recommender systems with smaller cost. This kind problems are known as "model adaptation", Keywords "learning to learn", "transfer learning", or "multi-task learn- Model Adaptation, Preference Modeling, Context Aware- ing" in the area of the statistical machine learning, and stud- ness ied actively in recent years [6, 13, 15, 20]. In the area, the methods to have good learning results (statistical models of data) by combining data in dierent but similar domains. 1. INTRODUCTION Typical examples are acoustic model adaptation and lan- Modeling users' preferences is an important element of guage model adaptation in speech recognition systems [11, recommender systems. We have constructed several context- 9, 19]. The collaborative ltering can also be considered as aware attribute-based recommender systems. The systems a case of multi-task learning [24]. use Bayesian networks for modeling users' preferences [2, There has been proposed several methods for model adap- 16]. In the course of the construction, collecting large amount tation. In this work, we will exploit the methods using of data about users' preference through inquiries is neces- Bayesian hierarchical modeling [7, 8] because the simple and sary. In particular, to make the model context-aware, users' natural nature of the method. We will construct a hierar- preference data should be collected under various contexts. chical model for preference model adaptation by combining However, putting subjects of inquiries into various contexts real and supposed context data, and evaluate the model us- and collecting answers from them is often dicult and costs ing the food preference data. much. Hence, collecting answers in supposed contexts, i.e. The rest of the paper is organized as follows. Section contexts where the subjects pretend or image that they are 2 briey introduces the Bayesian hierarchical modeling and formulates our model for model adaptation in context-aware preference modeling. Section 3 describes experiments using food preference data, and Section 4 is for conclusion and CARS-2011, October 23, 2011, Chicago, Illinois, USA. future work. Copyright is held by the author/owner(s). 2. BAYESIAN HIERARCHICAL MODELING 3. EXPERIMENTS Bayesian hierarchical modeling is an eective method for We applied the proposed model to our context-aware food simultaneous estimation of several parameters over similar preference data and evaluated the accuracy of predicted rat- domains, and is used to capture heterogeneity of subjects in ings in the real contexts for unknown cases. areas such as marketing and ecology [5, 12, 18]. We have already proposed to apply the following simple 3.1 Data acquisition and preparation linear Gaussian hierarchical model to the problem of con- In our previous work [17], we designed an internet ques- structing context-aware preference model which can model tionnaire survey in order to collect corresponding data, that and predict ratings rucs by users u for items c in contexts s is, we asked subjects the same question about food prefer- [4]. ence both in real and supposed contexts and collect pairs of rucs ∼ normal(µucs , 1/τ ), answers. The target contents were typical dishes served in µucs = µ0 + au + bc + cs , food courts. The survey was composed of two questionnaire surveys. τ ∼ gamma(ν, θ), The rst questionnaire survey was conducted from 16th to µ0 ∼ normal(µ, σ 2 ), 17th in December 2008. The number of subjects was 746, au ∼ normal(0, 1/τa ), each subject evaluated 5 kinds of a la carte dishes randomly selected from 20 kinds of dishes such as "chicken steak", bc ∼ normal(0, 1/τb ), "beef steak", "beef curry", "pasta with cod roe", "Japanese cs ∼ normal(0, 1/τc ), noodle", etc. using 5-grade rating scale from "I do not want τa ∼ gamma(ν, θ), to order the dish at all" to "I want to order the dish very τb ∼ gamma(ν, θ), much". At the same time the subjects answered the current degree of hunger in 3 levels (hungry, normal, full). τc ∼ gamma(ν, θ). After that, the subjects are asked to imagine that they are in the dierent degree of hunger from the current, and answered the preference for the same 5 dishes. In total, Here, normal(µ, 1/τ ) means Gaussian distribution with mean preferences for 5 dishes in three dierent contexts (degree of µ and variance 1/τ , and gamma means Gamma distribution. hunger) are collected. Among the three contexts, one is real In this paper, we will extend the above model for model and two are supposed. adaptation by combining real and supposed context data as The second survey was conducted in other days from 22nd follows: to 24th in December 2008. The all subjects who answered in (r) rucs ∼ normal(µ(r) ucs , 1/τ ), the rst survey were imposed the same questions as the rst (s) rucs ∼ normal(µ(s) survey and we extracted subjects who answered dierent ucs , 1/τ ), (r) degree of hunger from the rst survey. After ltering out µ(r) ucs = µ0 + au(r) + b(r) (r) c + cs , unreliable subjects, the number of extracted subjects was µ(s) (s) = ν0 + a(s) (s) (s) 212. ucs u + b c + cs , By combining the result of two surveys, we got corre- τ ∼ gamma(ν, θ), sponding preference for 5 dishes in 2 dierent degree of (r) µ0 ∼ normal(µ, σ 2 ), hunger per a subject. Hence the number of total ratings was 2,120. Figure 1 shows the whole data set. Figure 2 ∼ normal(µ, σ 2 ), (s) µ0 shows examples of answerers in two surveys, and examples a(r) u ∼ normal(0, 1/τa ), of combined corresponding data. b(r) ∼ normal(0, 1/τb ), We divided the dataset into training data and test data. c First, we randomly left one real context rating out of the 10 cs(r) ∼ normal(0, 1/τc ), ratings of each subject for evaluation. The rest of the 9 rat- a(s) u ∼ normal(0, 1/τa ), b(s) c ∼ normal(0, 1/τb ), Answer in Real Contexts Answers in Supposed Contexts c(s) s ∼ normal(0, 1/τc ), Independent Supposed Context Data τa ∼ gamma(ν, θ), No Data Available (S=si, U=ui, C=ci, V=vSi) τb ∼ gamma(ν, θ). 2,120 records τc ∼ gamma(ν, θ), Corresponding Corresponding Real Context Data Supposed Context Data (S=sk, U=uk, C=ck, V=vRk) (S=sk, U=uk, C=ck, V=v Sk) Here, rucs (r) denotes a rating in a real context, and rucs (s) de- 2,120 records 2,120 records notes corresponding rating in a supposed context. This model is composed of two hierarchical context-aware prefer- ence models, the generative model of the real context data Combined Corresponding Real and Supposed Context Data and the supposed context data. They are connected through (S=sk, U=uk, C=ck, V=vRk, V=vSk) common hyper-hyper parameters τ, τa , τb , τc . Through the 2,120 records common hyper-hyper parameters, information in the ratings in supposed contexts can aect to the posterior probability distribution of predicted ratings in the real context model. Figure 1: Structure of the whole data set [17] 1st Survey Table 1: Average of mean squared eerror for 10 experiments Subject Food Real Supposed Preference 1 Noodle Full × 3 1 Noodle Full Normal 2 L Supposed Context Data + Only Real Context Data 1 Noodle Full Hungry 2 Real Context Data 0 1.95 (0.10)  2nd Survey 1 1.66 (0.17) 1.71 (0.20) Subject Food Real Supposed Preference 2 1.50 (0.14) 1.49 (0.14) 1 Noodle Hungry × 1 3 1.51 (0.11) 1.51 (0.12) 1 Noodle Hungry Normal 1 4 1.48 (0.12) 1.48 (0.12) 1 Noodle Hungry Full 2 5 1.43 (0.13) 1.43 (0.12) 6 1.43 (0.13) 1.42 (0.11) Subj. Food Menu Context Real Supposed 7 1.42 (0.11) 1.42 (0.11) Preference Preference 8 1.40 (0.12) 1.39 (0.13) 1 Noodle Full 3 2 9 1.40 (0.12) 1.39 (0.13) 1 Noodle Hungry 1 2 3 Steak Full 2 3 3 Steak Normal 2 2 2 1 Fried Rice Hungry 1 2 1.9 Supposed Context Data + 1 Fried Rice Full 2 3 Real Context Data 1.8 Real Context Data Only Figure 2: Examples of ratings in two surveys and combined 1.7 corresponding data [3] MSE 1.6 1.5 ings per a subject in real contexts are used as training data. 1.4 In order to evaluate the eect of the number of real context 1.3 training data for constructing preference model, we change the number of real context ratings per a subject which are 1.2 used for model construction from 0 (supposed only) to 9. 0 1 2 3 4 5 6 7 8 9 L For the supposed context data, all 10 ratings per a subject are used for model construction. We repeated the experiment 10 times with dierent divi- Figure 3: Eect of Model Adaptation sion of the real context data and evaluated the accuracy of the predicted ratings for the left out test data in real con- texts. We evaluated average and standard deviation of the mean squared error (MSE) of the predictions. We also eval- • Constructing preference model with only supposed con- uate the prediction accuracy of the model constructed with text data is dangerous, only real context data. Experiments were conducted with the open source statis- • Very small amount of real context data can improve tical computing software R and software for Bayesian Monte the model. Carlo simulation WinBUGS [14, 22]. For connecting R to WinBUGS, we used R package R2WinBUGS. We set µ = However, the results showed that the models constructed 2.0, σ = 10, ν = 2.0, θ = 1.0, however, the results are robust with only small number of real context data perform rather with respect to the values of these parameters. well also. Even using only 2 real context data per a subject, the performance of the model is almost equal to the per- 3.2 Result and discussion formance of the model constructed by combining supposed and real context data. This is because Bayesian hierarchi- Table 1 shows the average and standard deviation of MSE cal models is able to make robust prediction even though for various values of the number of real context data L = the number of training data is very small. This means that 0, ..., 9. Standard deviations are depicted in brackets. We using the supposed context data is eective only for L = 1 also visualize the average of the MSE values in Figure 2. (cold start) case. This results demonstrate that as the number of real con- text data L increases, the MSE of the predicted rating in real context decreases monotonically. Hence, model adapta- 4. CONCLUSION AND FUTURE WORK tion by combining a small amount of real context data with In this paper, we propose to apply Bayesian hierarchi- a large amount of supposed context data is veried to be cal modeling to preference model adaptation by combining eective. real and supposed context data. The results of the experi- In particular, the performance for L = 1, that is, con- ments with food preference data demonstrate that the model structing model with 10 supposed context data + 1 real con- adaptation is eective in particular for the cases where very text data, is much better than the performance for L = 0, small amount of real context data is available. This means that is, constructing model only with supposed context data. that the model adaptation provides a solution for the cold This demonstrates the facts that start problem in context-aware recommender systems. Note that Umyarov and Tuzhillin observed a very similar phe- Language Technologies: The 2009 Annual Conference nomena in dierent context. They showed a small amount of the North American Chapter of the ACL, of aggregated external rating data can signicantly improve pp.602610, 2009. the performance of a Bayesian hierarchical preference model [9] D. Gildes and T. Hofmann. Topic-based language [21]. models using EM. In Proceedings of 6th European There are several future works. The rst one is more in- Conference on Speech Communication and Technology tensive evaluation. In this paper we evaluated the method (Eurospeech 99), pp.21672170, 1999. with small scale dataset. As the number of users, items and [10] A. Karatzoglou, X. Amatriain, N. Oliver, and L. contexts increases, the more training data is necessary for Baltrunas. Multiverse recommendation: constructing good preference models. Hence the importance N-dimensional tensor factorization for context-aware of the model adaptation is expected to increase. Evaluat- collaborative ltering. In Proceedings of ACM ing with data in domains other than food preference is also Recommender Systems 2010, 2010. important. [11] C. Leggetter, P. Woodland. Maximum likelihood The second one is to apply the method to dierent base linear regression for speaker adaptation of continuous models. The model adaptation technique with Bayesian density hidden Markov models. Computer, Speech and hierarchical modeling is independent from the generative Langage, vol. 9, no. 2, pp.171185, 1995. model of ratings. In this work, we used the simple linear [12] M. A. McCarthy. Bayesian Methods for Ecology. Gaussian model of rating generation. More elaborated gen- Cambridge University Press, New York, 2007. erative models of ratings such as probabilistic tensor factor- [13] NIPS2005 Workshop Inductive Transfer: 10 Years ization model [10, 23] can be used instead. Using generative Later, http://iitrl.acadiau.ca/itws05/ models for ordered ratings may be eective also. The third one is to investigate other model adaptation [14] I. Ntzoufras. Bayesian Modeling Using WinBUGS. techniques. The proposed model adaptation technique is Wiley, 2009. bi-directional. This means that the combined model is sym- [15] S. J. Pan and Q. Yang. A survey of transfer learning. metric for source and target domains. Investigating more IEEE Transactions on Knowledge and Data directional model adaptation techniques is an interesting fu- Engineering, vol. 22, no. 10, pp. 13451359, 2010. ture work. [16] C. Ono, M. Kurokawa, Y. Motomura and H. Asoh. A context-aware movie preference model using a Acknowledgments. Bayesian network for recommendation and promotion. We thank Dr. Yasuyuki Nakajima, President and CEO of In 11th International Conference, UM 2007, Corfu, KDDI R&D Laboratories Inc., for his continuous support Greece, July, 2007, Proceedings, LNCS vol. 4511, pp. of this study. This work was supported in part by JSPS 247257, Springer-Verlag, 2007. KAKANHI 20650030. [17] C. Ono, Y. Takishima, Y. Motomura and H. Asoh. Context-aware prefence model based on a study of dierence between real and supposed context data. In 5. REFERENCES User Modeling, Adaptation, and Personalization, 17th [1] A. Ansari, S. Essegaier, and R. Kohli. Internet International Conference, UMAP2009, Proceedings, recommendation systems. Journal of Marketing LNCS vol. 5535, pp. 102113, 2009. Research, vol. 37, no. 3, 2000. [18] P. E. Rossi, G. M. Allenby and R. McCulloch. [2] H. Asoh, C. Ono and Y. Motomura. A movie Bayesian Statistics and Marketing. Wiley, 2005. recommendation method considering both users' [19] T. Takiguchi. Statistical Acoustic Model Adaptation personality and situation. In Proceedings of the for Robust Speech Recognition in Noisy Reverberant ECAI2006 Workshop on Recommender Systems, pp. Environments. Doctral Thesis, Nara Institute of 4548, 2006. Science and Technology, 1999. [3] H. Asoh, C. Ono and Y. Motomura. An analysis of [20] S. Thrun and L. Pratt (eds.). Learning to Learn, dierences between preferences in real and supposed Kluwer Academic Publishers, 1998. contexts. In Proceedings of 2nd Workshop on [21] A. Umyarov and A. Tuzhilin. Using External Context-Aware Recommender Systems (CARS-2010), Aggregate Ratings for Improving Individual 2010. Recommendations, ACM Transactions on the Web, [4] H. Asoh, C. Ono, and Y. Motomura. A Bayesian vol. 5, no. 1, article 3, 2011. hierarchical preference model for context-aware [22] http://www.mrc-bsu.cam.ac.uk/bugs/winbugs/ recommendations, Adjunct Proceeding of UMAP contents.shtml. 2010, 2010. [23] L. Xiong, X. Chen, T.-K. Huang, J. Schneider, and J. [5] P. Congdon. Bayesian Statistical Modelling, Second Carbonell. Temporal collaborative ltering with Edition. Wiley, 2006. Bayesian probabilistic tensor factorization. In [6] H. Daume III, D. Marcu. Domain adaptation for Proceedings of SIAM Data Mining 2010 (SDM 10), sitasitical classiers. Journal of Articial Intelligence 2010. Research, vol. 26, pp. 101126, 2006. [24] K. Yu and V. Tresp. Learning to learn and [7] H. Daume III. Bayesian multitask learning with latent collaborative ltering. In NIPS 2005 Workshop hierarchy. In Proceedings of the 25th Conference on "Indctive Transfer: 10 Years Later", 2005. Uncertainty in Articial Intelligence (UAI), 2009. [8] J. R. Finkel, C. D. Manning. Hierarchical Bayesian domain adaptation. In Proceedings of Human