Model Adaptation with Bayesian Hierarchical Modeling
             for Context-Aware Recommendation

                     Hideki Asoh                         Yoichi Motomura                      Chihiro Ono
                         AIST                                   AIST                  KDDI R & D Laboratories Inc.
               AIST Tsukuba Central 2                   AIST Tokyo Waterfront           2-1-15 Ohara, Fujimino
              1-1-1 Umezono, Tsukuba                    2-3-26 Aomi, Koutouku          Saitama 365-8502 Japan
               Ibaraki 305-8568 Japan                   Tokyo 135-0064 Japan               ono@kddilabs.jp
                h.asoh@aist.go.jp                      y.motomura@aist.go.jp

ABSTRACT                                                            in the specic contexts, is often conducted. Although there
Model adaptation is a process of modifying a model trained          may be dierences between the preferences in the real con-
with a large amount of training data from the source do-            texts and the supposed contexts, the dierences are not
main to adapt a specic similar target domain by using a            taken seriously.
small amount of adaptation data regarding the target do-               In our previous works, we collected users' preferences of
main. Bayesian hierarchical modeling is well known as a             various dishes in both real and supposed contexts and showed
general tool for model adaptation and multi-task learning,          that the dierence is statistically signicant and not negli-
and widely used in various areas such as marketing, ecol-           gible [17]. We also analyzed the statistical nature of the dif-
ogy, medicine, education, and so on in order to model the           ferences and demonstrated that the structure of preferences
heterogeneity in the phenomena. In this work, we propose            in supposed contexts is simpler than that of the preferences
to apply the Bayesian hierarchical modeling to the problem          in real contexts [3]. These studies suggested that it is dan-
of preference modeling, where a model trained with a large          gerous to construct preference models using data collected
amount of supposed context data is adapted to the real con-         only in supposed contexts.
text by using additional small amount of real context data.            In this work, we pursue the possibility to construct bet-
The eectiveness of the proposed method is evaluated by             ter preference models by combining data in the supposed
experiments using context-aware food preference data.               contexts and the real contexts. Although there are dier-
                                                                    ences between the preferences in the real contexts and in the
                                                                    supposed contexts, they are similar in some extent, and the
Categories and Subject Descriptors                                  cost to collect data in the supposed contexts is much cheaper
H.4 [Information Systems Applications]: Miscellaneous               than in the real contexts. Hence, if we can modify a model
                                                                    constructed by a large amount of supposed context data to
General Terms                                                       adapt to the real contexts by using small amount of real
                                                                    context data, it helps much to realize better context-aware
Experimentation, Human factors, Measurement                         recommender systems with smaller cost.
                                                                       This kind problems are known as "model adaptation",
Keywords                                                            "learning to learn", "transfer learning", or "multi-task learn-
Model Adaptation, Preference Modeling, Context Aware-               ing" in the area of the statistical machine learning, and stud-
ness                                                                ied actively in recent years [6, 13, 15, 20]. In the area, the
                                                                    methods to have good learning results (statistical models of
                                                                    data) by combining data in dierent but similar domains.
1. INTRODUCTION                                                     Typical examples are acoustic model adaptation and lan-
  Modeling users' preferences is an important element of            guage model adaptation in speech recognition systems [11,
recommender systems. We have constructed several context-           9, 19]. The collaborative ltering can also be considered as
aware attribute-based recommender systems. The systems              a case of multi-task learning [24].
use Bayesian networks for modeling users' preferences [2,              There has been proposed several methods for model adap-
16]. In the course of the construction, collecting large amount     tation. In this work, we will exploit the methods using
of data about users' preference through inquiries is neces-         Bayesian hierarchical modeling [7, 8] because the simple and
sary. In particular, to make the model context-aware, users'        natural nature of the method. We will construct a hierar-
preference data should be collected under various contexts.         chical model for preference model adaptation by combining
However, putting subjects of inquiries into various contexts        real and supposed context data, and evaluate the model us-
and collecting answers from them is often dicult and costs         ing the food preference data.
much. Hence, collecting answers in supposed contexts, i.e.             The rest of the paper is organized as follows. Section
contexts where the subjects pretend or image that they are          2 briey introduces the Bayesian hierarchical modeling and
                                                                    formulates our model for model adaptation in context-aware
                                                                    preference modeling. Section 3 describes experiments using
                                                                    food preference data, and Section 4 is for conclusion and
CARS-2011, October 23, 2011, Chicago, Illinois, USA.                future work.
Copyright is held by the author/owner(s).
2. BAYESIAN HIERARCHICAL MODELING                               3.    EXPERIMENTS
   Bayesian hierarchical modeling is an eective method for       We applied the proposed model to our context-aware food
simultaneous estimation of several parameters over similar      preference data and evaluated the accuracy of predicted rat-
domains, and is used to capture heterogeneity of subjects in    ings in the real contexts for unknown cases.
areas such as marketing and ecology [5, 12, 18].
   We have already proposed to apply the following simple       3.1   Data acquisition and preparation
linear Gaussian hierarchical model to the problem of con-          In our previous work [17], we designed an internet ques-
structing context-aware preference model which can model        tionnaire survey in order to collect corresponding data, that
and predict ratings rucs by users u for items c in contexts s   is, we asked subjects the same question about food prefer-
[4].                                                            ence both in real and supposed contexts and collect pairs of
                rucs      ∼ normal(µucs , 1/τ ),                answers. The target contents were typical dishes served in
               µucs       = µ0 + au + bc + cs ,                 food courts.
                                                                   The survey was composed of two questionnaire surveys.
                      τ   ∼ gamma(ν, θ),                        The rst questionnaire survey was conducted from 16th to
                     µ0   ∼ normal(µ, σ 2 ),                    17th in December 2008. The number of subjects was 746,
                     au   ∼ normal(0, 1/τa ),                   each subject evaluated 5 kinds of a la carte dishes randomly
                                                                selected from 20 kinds of dishes such as "chicken steak",
                     bc   ∼ normal(0, 1/τb ),
                                                                "beef steak", "beef curry", "pasta with cod roe", "Japanese
                     cs   ∼ normal(0, 1/τc ),                   noodle", etc. using 5-grade rating scale from "I do not want
                     τa   ∼ gamma(ν, θ),                        to order the dish at all" to "I want to order the dish very
                     τb   ∼ gamma(ν, θ),                        much". At the same time the subjects answered the current
                                                                degree of hunger in 3 levels (hungry, normal, full).
                     τc   ∼ gamma(ν, θ).                           After that, the subjects are asked to imagine that they
                                                                are in the dierent degree of hunger from the current, and
                                                                answered the preference for the same 5 dishes. In total,
Here, normal(µ, 1/τ ) means Gaussian distribution with mean
                                                                preferences for 5 dishes in three dierent contexts (degree of
µ and variance 1/τ , and gamma means Gamma distribution.
                                                                hunger) are collected. Among the three contexts, one is real
   In this paper, we will extend the above model for model
                                                                and two are supposed.
adaptation by combining real and supposed context data as
                                                                   The second survey was conducted in other days from 22nd
follows:
                                                                to 24th in December 2008. The all subjects who answered in
              (r)
             rucs     ∼ normal(µ(r)
                                ucs , 1/τ ),                    the rst survey were imposed the same questions as the rst
              (s)
             rucs     ∼ normal(µ(s)                             survey and we extracted subjects who answered dierent
                                ucs , 1/τ ),
                            (r)
                                                                degree of hunger from the rst survey. After ltering out
            µ(r)
             ucs      = µ0 + au(r) + b(r)  (r)
                                      c + cs ,                  unreliable subjects, the number of extracted subjects was
            µ(s)
                            (s)
                      = ν0 + a(s)   (s)  (s)                    212.
             ucs              u + b c + cs ,
                                                                   By combining the result of two surveys, we got corre-
                τ     ∼ gamma(ν, θ),                            sponding preference for 5 dishes in 2 dierent degree of
              (r)
             µ0       ∼ normal(µ, σ 2 ),                        hunger per a subject. Hence the number of total ratings
                                                                was 2,120. Figure 1 shows the whole data set. Figure 2
                      ∼ normal(µ, σ 2 ),
              (s)
             µ0
                                                                shows examples of answerers in two surveys, and examples
             a(r)
              u       ∼ normal(0, 1/τa ),                       of combined corresponding data.
             b(r)     ∼ normal(0, 1/τb ),                          We divided the dataset into training data and test data.
              c
                                                                First, we randomly left one real context rating out of the 10
             cs(r)    ∼ normal(0, 1/τc ),                       ratings of each subject for evaluation. The rest of the 9 rat-
             a(s)
              u       ∼ normal(0, 1/τa ),
             b(s)
              c       ∼ normal(0, 1/τb ),                                 Answer in Real Contexts       Answers in Supposed Contexts

             c(s)
              s       ∼ normal(0, 1/τc ),                                                                     Independent
                                                                                                          Supposed Context Data
               τa     ∼ gamma(ν, θ),                                        No Data Available            (S=si, U=ui, C=ci, V=vSi)
               τb     ∼ gamma(ν, θ).                                                                           2,120 records

               τc     ∼ gamma(ν, θ),                                        Corresponding                    Corresponding
                                                                           Real Context Data              Supposed Context Data
                                                                         (S=sk, U=uk, C=ck, V=vRk)       (S=sk, U=uk, C=ck, V=v Sk)
Here, rucs
        (r)
            denotes a rating in a real context, and rucs
                                                      (s)
                                                          de-                 2,120 records                     2,120 records
notes corresponding rating in a supposed context. This
model is composed of two hierarchical context-aware prefer-
ence models, the generative model of the real context data                             Combined Corresponding
                                                                                    Real and Supposed Context Data
and the supposed context data. They are connected through                           (S=sk, U=uk, C=ck, V=vRk, V=vSk)
common hyper-hyper parameters τ, τa , τb , τc . Through the                                     2,120 records
common hyper-hyper parameters, information in the ratings
in supposed contexts can aect to the posterior probability
distribution of predicted ratings in the real context model.           Figure 1: Structure of the whole data set [17]
      1st Survey
                                                                 Table 1: Average of mean squared eerror for 10 experiments
      Subject   Food     Real      Supposed      Preference
      1         Noodle   Full      ×             3
      1         Noodle   Full      Normal        2                L      Supposed Context Data +                   Only Real Context Data
      1         Noodle   Full      Hungry        2
                                                                            Real Context Data
                                                                  0             1.95 (0.10)                                     
      2nd Survey                                                  1             1.66 (0.17)                                 1.71 (0.20)
      Subject   Food     Real      Supposed      Preference       2             1.50 (0.14)                                 1.49 (0.14)
      1         Noodle   Hungry    ×             1                3             1.51 (0.11)                                 1.51 (0.12)
      1         Noodle   Hungry    Normal        1                4             1.48 (0.12)                                 1.48 (0.12)
      1         Noodle   Hungry    Full          2                5             1.43 (0.13)                                 1.43 (0.12)
                                                                  6             1.43 (0.13)                                 1.42 (0.11)
     Subj. Food Menu     Context    Real         Supposed         7             1.42 (0.11)                                 1.42 (0.11)
                                    Preference   Preference
                                                                  8             1.40 (0.12)                                 1.39 (0.13)
     1      Noodle       Full       3            2
                                                                  9             1.40 (0.12)                                 1.39 (0.13)
     1      Noodle       Hungry     1            2
     3      Steak        Full       2            3
     3      Steak        Normal     2            2
                                                                               2
     1      Fried Rice   Hungry     1            2
                                                                              1.9
                                                                                                     Supposed Context Data +
     1      Fried Rice   Full       2            3
                                                                                                     Real Context Data
                                                                              1.8
                                                                                                     Real Context Data Only
Figure 2: Examples of ratings in two surveys and combined                     1.7
corresponding data [3]


                                                                        MSE
                                                                              1.6

                                                                              1.5
ings per a subject in real contexts are used as training data.                1.4
In order to evaluate the eect of the number of real context
                                                                              1.3
training data for constructing preference model, we change
the number of real context ratings per a subject which are                    1.2
used for model construction from 0 (supposed only) to 9.                             0   1   2   3     4       5    6   7     8   9
                                                                                                           L
For the supposed context data, all 10 ratings per a subject
are used for model construction.
   We repeated the experiment 10 times with dierent divi-                          Figure 3: Eect of Model Adaptation
sion of the real context data and evaluated the accuracy of
the predicted ratings for the left out test data in real con-
texts. We evaluated average and standard deviation of the
mean squared error (MSE) of the predictions. We also eval-            • Constructing preference model with only supposed con-
uate the prediction accuracy of the model constructed with              text data is dangerous,
only real context data.
   Experiments were conducted with the open source statis-            • Very small amount of real context data can improve
tical computing software R and software for Bayesian Monte              the model.
Carlo simulation WinBUGS [14, 22]. For connecting R to
WinBUGS, we used R package R2WinBUGS. We set µ =                   However, the results showed that the models constructed
2.0, σ = 10, ν = 2.0, θ = 1.0, however, the results are robust   with only small number of real context data perform rather
with respect to the values of these parameters.                  well also. Even using only 2 real context data per a subject,
                                                                 the performance of the model is almost equal to the per-
3.2 Result and discussion                                        formance of the model constructed by combining supposed
                                                                 and real context data. This is because Bayesian hierarchi-
   Table 1 shows the average and standard deviation of MSE
                                                                 cal models is able to make robust prediction even though
for various values of the number of real context data L =
                                                                 the number of training data is very small. This means that
0, ..., 9. Standard deviations are depicted in brackets. We
                                                                 using the supposed context data is eective only for L = 1
also visualize the average of the MSE values in Figure 2.
                                                                 (cold start) case.
   This results demonstrate that as the number of real con-
text data L increases, the MSE of the predicted rating in
real context decreases monotonically. Hence, model adapta-       4.    CONCLUSION AND FUTURE WORK
tion by combining a small amount of real context data with         In this paper, we propose to apply Bayesian hierarchi-
a large amount of supposed context data is veried to be         cal modeling to preference model adaptation by combining
eective.                                                        real and supposed context data. The results of the experi-
   In particular, the performance for L = 1, that is, con-       ments with food preference data demonstrate that the model
structing model with 10 supposed context data + 1 real con-      adaptation is eective in particular for the cases where very
text data, is much better than the performance for L = 0,        small amount of real context data is available. This means
that is, constructing model only with supposed context data.     that the model adaptation provides a solution for the cold
This demonstrates the facts that                                 start problem in context-aware recommender systems. Note
that Umyarov and Tuzhillin observed a very similar phe-               Language Technologies: The 2009 Annual Conference
nomena in dierent context. They showed a small amount                of the North American Chapter of the ACL,
of aggregated external rating data can signicantly improve           pp.602610, 2009.
the performance of a Bayesian hierarchical preference model       [9] D. Gildes and T. Hofmann. Topic-based language
[21].                                                                 models using EM. In Proceedings of 6th European
   There are several future works. The rst one is more in-           Conference on Speech Communication and Technology
tensive evaluation. In this paper we evaluated the method             (Eurospeech 99), pp.21672170, 1999.
with small scale dataset. As the number of users, items and      [10] A. Karatzoglou, X. Amatriain, N. Oliver, and L.
contexts increases, the more training data is necessary for           Baltrunas. Multiverse recommendation:
constructing good preference models. Hence the importance             N-dimensional tensor factorization for context-aware
of the model adaptation is expected to increase. Evaluat-             collaborative ltering. In Proceedings of ACM
ing with data in domains other than food preference is also           Recommender Systems 2010, 2010.
important.                                                       [11] C. Leggetter, P. Woodland. Maximum likelihood
   The second one is to apply the method to dierent base             linear regression for speaker adaptation of continuous
models. The model adaptation technique with Bayesian                  density hidden Markov models. Computer, Speech and
hierarchical modeling is independent from the generative              Langage, vol. 9, no. 2, pp.171185, 1995.
model of ratings. In this work, we used the simple linear        [12] M. A. McCarthy. Bayesian Methods for Ecology.
Gaussian model of rating generation. More elaborated gen-             Cambridge University Press, New York, 2007.
erative models of ratings such as probabilistic tensor factor-
                                                                 [13] NIPS2005 Workshop Inductive Transfer: 10 Years
ization model [10, 23] can be used instead. Using generative
                                                                      Later, http://iitrl.acadiau.ca/itws05/
models for ordered ratings may be eective also.
   The third one is to investigate other model adaptation        [14] I. Ntzoufras. Bayesian Modeling Using WinBUGS.
techniques. The proposed model adaptation technique is                Wiley, 2009.
bi-directional. This means that the combined model is sym-       [15] S. J. Pan and Q. Yang. A survey of transfer learning.
metric for source and target domains. Investigating more              IEEE Transactions on Knowledge and Data
directional model adaptation techniques is an interesting fu-         Engineering, vol. 22, no. 10, pp. 13451359, 2010.
ture work.                                                       [16] C. Ono, M. Kurokawa, Y. Motomura and H. Asoh. A
                                                                      context-aware movie preference model using a
Acknowledgments.                                                      Bayesian network for recommendation and promotion.
We thank Dr. Yasuyuki Nakajima, President and CEO of                  In 11th International Conference, UM 2007, Corfu,
KDDI R&D Laboratories Inc., for his continuous support                Greece, July, 2007, Proceedings, LNCS vol. 4511, pp.
of this study. This work was supported in part by JSPS                247257, Springer-Verlag, 2007.
KAKANHI 20650030.                                                [17] C. Ono, Y. Takishima, Y. Motomura and H. Asoh.
                                                                      Context-aware prefence model based on a study of
                                                                      dierence between real and supposed context data. In
5. REFERENCES                                                         User Modeling, Adaptation, and Personalization, 17th
 [1] A. Ansari, S. Essegaier, and R. Kohli. Internet                  International Conference, UMAP2009, Proceedings,
     recommendation systems. Journal of Marketing                     LNCS vol. 5535, pp. 102113, 2009.
     Research, vol. 37, no. 3, 2000.                             [18] P. E. Rossi, G. M. Allenby and R. McCulloch.
 [2] H. Asoh, C. Ono and Y. Motomura. A movie                         Bayesian Statistics and Marketing. Wiley, 2005.
     recommendation method considering both users'               [19] T. Takiguchi. Statistical Acoustic Model Adaptation
     personality and situation. In Proceedings of the                 for Robust Speech Recognition in Noisy Reverberant
     ECAI2006 Workshop on Recommender Systems, pp.                    Environments. Doctral Thesis, Nara Institute of
     4548, 2006.                                                     Science and Technology, 1999.
 [3] H. Asoh, C. Ono and Y. Motomura. An analysis of             [20] S. Thrun and L. Pratt (eds.). Learning to Learn,
     dierences between preferences in real and supposed              Kluwer Academic Publishers, 1998.
     contexts. In Proceedings of 2nd Workshop on                 [21] A. Umyarov and A. Tuzhilin. Using External
     Context-Aware Recommender Systems (CARS-2010),                   Aggregate Ratings for Improving Individual
     2010.                                                            Recommendations, ACM Transactions on the Web,
 [4] H. Asoh, C. Ono, and Y. Motomura. A Bayesian                     vol. 5, no. 1, article 3, 2011.
     hierarchical preference model for context-aware             [22] http://www.mrc-bsu.cam.ac.uk/bugs/winbugs/
     recommendations, Adjunct Proceeding of UMAP                      contents.shtml.
     2010, 2010.                                                 [23] L. Xiong, X. Chen, T.-K. Huang, J. Schneider, and J.
 [5] P. Congdon. Bayesian Statistical Modelling, Second               Carbonell. Temporal collaborative ltering with
     Edition. Wiley, 2006.                                            Bayesian probabilistic tensor factorization. In
 [6] H. Daume III, D. Marcu. Domain adaptation for                    Proceedings of SIAM Data Mining 2010 (SDM 10),
     sitasitical classiers. Journal of Articial Intelligence        2010.
     Research, vol. 26, pp. 101126, 2006.                       [24] K. Yu and V. Tresp. Learning to learn and
 [7] H. Daume III. Bayesian multitask learning with latent            collaborative ltering. In NIPS 2005 Workshop
     hierarchy. In Proceedings of the 25th Conference on              "Indctive Transfer: 10 Years Later", 2005.
     Uncertainty in Articial Intelligence (UAI), 2009.
 [8] J. R. Finkel, C. D. Manning. Hierarchical Bayesian
     domain adaptation. In Proceedings of Human