An Explanatory Matrix Factorization with User Comments Data*
                              Donghyun Kim                                                       Hayong Shin
        Department of Industrial and Systems Engineering                     Department of Industrial and Systems Engineering
       Korea Advanced Institute of Science and Technology                   Korea Advanced Institute of Science and Technology
                         South Korea                                                           South Korea
                      dhk618@kaist.ac.kr                                                    hyshin@kaist.ac.kr


ABSTRACT                                                                differ as using user comment data about Webtoons which was not
                                                                        used previously.
Matrix factorization is one of the crucial algorithms of the
Recommendation system. It implies that the relationship between
                                                                        2   EXPLANTORY MATRIX FACTORIZATION
user and contents can be explained by hidden latent variables.
However, it is not intuitive to understand the meaning of these         In this study, we introduce a method to utilize domain knowledge
hidden latent variables. Therefore, this study suggests a way to        by combining LDA and MF. The LDA has explanatory power on
learn the meaning from supplementary data such as comments and          topics, and the MF is explaining the relationship between the user
use in matrix factorization. The data used in this study is user        and the contents with hidden latent variable. However, the meaning
comment data from Naver which is the largest web platform and           of each latent variable is difficult to grasp. Therefore, we first
also the largest Webtoons (Web comics) platform in South Korea.         derive explanatory power from the supplementary data , =
We show that the suggest method which uses the supervised latent        1 … , such as user comments or the report using LDA. The LDA
variable also fits well with users with the distinct tendency compare   assumes that several topics are mixed in each document, and the
to conventional matrix factorization.                                   analysis results can analyze the themes of documents. There are
                                                                        many LDA algorithms to infer topic [5], so we do not describe
KEYWORDS                                                                conventional LDA algorithms in detail. Let , = 1, … , , be a
Matrix Factorization, Explanatory Analysis, Latent Dirichlet            topic from LDA, then each user can be represented as vector =
Allocation                                                                   ,…,       , where      =          . This value indicates how
                                                                        much a specific user talks about a particular topic, which is an
                                                                        indirect indicator that shows what the user likes. Therefore, we can
1    INTRODUCTION                                                       make a user-topic relationship matrix          =     , ,…,       for
Recommendation system (RS) is referred to collecting information        whole user and use it directly in MF. The core of MF is to divide
to analyze user’s taste. Numerous methods have been proposed for        an user-contents rating matrix into two low-rank matrices         =
the RS, and one of overwhelming method is matrix factorization                which needs to be estimated. However, if can be obtained
(MF) which is predicting a missing value of a score matrix              sufficiently from supplementary data, MF turned into a simple
composed of evaluation for contents given by the user [3]. MF           matrix inverse problem. Thus, can be obtained simply through
improved the quality of RS significantly, but there are some issues     the Moore-Penrose pseudoinverse with =                     .
such as a cold-start problem, insufficient explanatory power, etc.
MF decomposes into low rank matrices with latent features and           3   EXPERIMENT
make the original score matrix treatable, but it was difficult to
analyze the meaning of each latent features.                            3.1 Data Collection and Refinement
  There are a lot of works that uses user reviews to assist RS. Also,   We use user comment data from Naver, which is Korea's largest
it is shown that the appropriate latent factor model using topic        web and Webtoon platform. However, since Naver does not
selection with LDA is better than the existing model [2, 4].            provide any formalized data, the data was collected and refined
However, previous studies assumes that there exist score matrix         through web crawling by Python. The collected data contains 4
and use reviews to make better while not only this study does not       features (Title-Episodes-userID-Comment). The raw data has over
have score matrix but also this focus on the explanatory power of       100K users, 151 Webtoons, 21927 episodes and over 110 million
MF not the RMSE itself. The methodology itself is not new as part       comments. Since this raw data needs more than 10TB of capacity,
of research using user reviews to make better RS, but the two           due to hardware limitations, we limited to small size data. Also,
popular methodologies, MF and Latent Dirichlet Allocation (LDA),        unlike commonly used reviews, there are a lot of useless data
have been mixed appropriately and give exploratory power. It also       because comments can be written without any restrictions such as


*
 Copyright is held by the author(s).
RecSys 2017 Poster Proceedings, August 27-31, 2017, Como, Italy
 An Explanatory Matrix Factorization with User Comments Data                                                                         D. Kim and H. Shin

multiple comments is allowed in same item. Therefore, in this study,        implied that it has a similar RMSE even though it is obtained by
we chose 3,000 users who kept the grammar as much as possible               simple matrix inversion using a relatively interpretable latent
and wrote over a reasonable length (more than 70 characters in              variable rather than the existing method. The reason why RMSE is
average) for certain period consistently (at least 12 weeks).               lower than other studies is because it is not to predict the score, but
Compared to all data, 3000 users are quite small numbers, but since         to determine whether user sees a specific Webtoon, so we calculate
the data used in this paper is very different from the user review          RMSE with a rounded value which is 0 or 1.
usually used in other papers, it was important to refine useful data
before analysis. This data contains 149 Webtoons, 1.1 million
comments.


                                                                                  RMSE
3.2 Experiment Settings and Result
The topic is modeled through the LDA with selected 1.1 million
comments. In this experiment, the number of topic was set to 10,
20, and 30, and the hidden latent variables of MF were also set to
be the same in each case. Topic selection is very important task, but
                                                                            Figure 1: RMSE plot with the full user data
it is too vague to use the whole as it is. Therefore, some topics are
collected through each Webtoon, and the some topics are obtained
by whole data. Some of the noticeable topics are listed in Table 1.
Topic 3 is mainly composed of words about stories of comic, and
Topic 7 is made of the drawing style of comic. Even though not all
topics can be identified as the above topics, but there are more
topics that can be interpreted, such as the attitude of the artiest, etc.


   Table 1: Noticeable topics from comment data (             =     )
                                                                            Figure 2: RMSE plot with the distinct tendency user data
        Topic 3                Topic 7             Topic 21
         Sick of               Beautiful                                   4    CONCLUSIONS
         Crazy                  Sick of              Funny                  In this paper, the two popular methodologies, MF and LDA, have
      Main character           Drawing           Best comments              been mixed appropriately and shows some extra synergy. We
          Story                  Color                Clear                 conducted experiments with comment data of Webtoons and shows
                                                                            the suggest method works quite well as much as conventional MF,
  Each user vectors are constructed by the topics we obtained.              and some cases it works better. This study did not fully use the
We use cosine similarity which is most commonly used. Using this            comment data that is currently available. Webtoon is a content that
similarity, we construct a matrix to be used in MF and simply               is published one episode a week, so we think it will be very
obtain other matrix . We compare RMSE with original MF. In this             influential in time. Also, this study is domain specific research and
paper we use most basic MF algorithm [6] as conventional MF                 the proposed algorithm is used only in this domain, so the extensive
algorithm. The score matrix M used in this experiment is composed           research with certified data set is needed to generalize the algorithm.
of 0 and 1 which indicates whether the user sees a certain comic,
not the score rating. We assume that the user only sees the comics
                                                                            ACKNOWLEDGMENTS
they commented on. In order to measure the RMSE, about 15% of
                                                                            This research was supported by Basic Science Research Program
each user data was randomly deleted. Therefore, we learned with
                                                                            through the National Research Foundation of Korea funded by the
85% of the data and observe the difference between the erased 15%
                                                                            Ministry of Science, ICT & Future Planning (2017R1A2B4006290).
actual data and the predicted data. As can be seen from the results
Figure 1, it cannot be concluded that the overall data performance
is better than conventional MF. However, when compared only for             REFERENCES
                                                                             [1] Bobadilla, Jesús, et al. 2013. Recommender systems survey. Knowledge-
those with distinct tendencies, whose             −             >                based systems 46 (pp.109-132).
(in this paper α = 0.4 is used) which means user has at least one            [2] Seroussi, Yanir, Fabian Bohnert, and Ingrid Zukerman. 2011. Personalised
                                                                                 rating prediction for new users using latent factor models. Proceedings of the
noticeable topic that can be categorized more clearly than other                 22nd ACM conference on Hypertext and hypermedia (pp.47-56).
users, it can be seen that the suggested method using the LDA is             [3] Bobadilla, J., Ortega, F., Hernando, A., & Gutiérrez, A. 2013. Recommender
slightly better than the conventional MF method. In other words,                 systems survey. Knowledge-based systems, 46, 109-132.
                                                                             [4] Chen, Li, Guanliang Chen, and Feng Wang. 2015. Recommender systems
we can see that the unsupervised latent variable which is                        based on user reviews: the state of the art. User Modeling and User-Adapted
conventional MF fits better with users who judge the contents with               Interaction 25(2) (pp. 99-154).
                                                                             [5] Alghamdi, Rubayyi, and Khalid Alfalqi. 2015. A survey of topic modeling in
a complex view, and the suggested EMF which uses the supervised                  text mining. I. J. ACSA 6.1 (pp. 147-153).
latent variable fits well with users with a simple view. This result         [6] Koren, Yehuda, Robert Bell, and Chris Volinsky. 2009. Matrix factorization
cannot be regarded as meaningful for RMSE itself, but it can be                  techniques for recommender systems. Computer 42(8).