An Explanatory Matrix Factorization with User Comments Data* Donghyun Kim Hayong Shin Department of Industrial and Systems Engineering Department of Industrial and Systems Engineering Korea Advanced Institute of Science and Technology Korea Advanced Institute of Science and Technology South Korea South Korea dhk618@kaist.ac.kr hyshin@kaist.ac.kr ABSTRACT differ as using user comment data about Webtoons which was not used previously. Matrix factorization is one of the crucial algorithms of the Recommendation system. It implies that the relationship between 2 EXPLANTORY MATRIX FACTORIZATION user and contents can be explained by hidden latent variables. However, it is not intuitive to understand the meaning of these In this study, we introduce a method to utilize domain knowledge hidden latent variables. Therefore, this study suggests a way to by combining LDA and MF. The LDA has explanatory power on learn the meaning from supplementary data such as comments and topics, and the MF is explaining the relationship between the user use in matrix factorization. The data used in this study is user and the contents with hidden latent variable. However, the meaning comment data from Naver which is the largest web platform and of each latent variable is difficult to grasp. Therefore, we first also the largest Webtoons (Web comics) platform in South Korea. derive explanatory power from the supplementary data , = We show that the suggest method which uses the supervised latent 1 … , such as user comments or the report using LDA. The LDA variable also fits well with users with the distinct tendency compare assumes that several topics are mixed in each document, and the to conventional matrix factorization. analysis results can analyze the themes of documents. There are many LDA algorithms to infer topic [5], so we do not describe KEYWORDS conventional LDA algorithms in detail. Let , = 1, … , , be a Matrix Factorization, Explanatory Analysis, Latent Dirichlet topic from LDA, then each user can be represented as vector = Allocation ,…, , where = . This value indicates how much a specific user talks about a particular topic, which is an indirect indicator that shows what the user likes. Therefore, we can 1 INTRODUCTION make a user-topic relationship matrix = , ,…, for Recommendation system (RS) is referred to collecting information whole user and use it directly in MF. The core of MF is to divide to analyze user’s taste. Numerous methods have been proposed for an user-contents rating matrix into two low-rank matrices = the RS, and one of overwhelming method is matrix factorization which needs to be estimated. However, if can be obtained (MF) which is predicting a missing value of a score matrix sufficiently from supplementary data, MF turned into a simple composed of evaluation for contents given by the user [3]. MF matrix inverse problem. Thus, can be obtained simply through improved the quality of RS significantly, but there are some issues the Moore-Penrose pseudoinverse with = . such as a cold-start problem, insufficient explanatory power, etc. MF decomposes into low rank matrices with latent features and 3 EXPERIMENT make the original score matrix treatable, but it was difficult to analyze the meaning of each latent features. 3.1 Data Collection and Refinement There are a lot of works that uses user reviews to assist RS. Also, We use user comment data from Naver, which is Korea's largest it is shown that the appropriate latent factor model using topic web and Webtoon platform. However, since Naver does not selection with LDA is better than the existing model [2, 4]. provide any formalized data, the data was collected and refined However, previous studies assumes that there exist score matrix through web crawling by Python. The collected data contains 4 and use reviews to make better while not only this study does not features (Title-Episodes-userID-Comment). The raw data has over have score matrix but also this focus on the explanatory power of 100K users, 151 Webtoons, 21927 episodes and over 110 million MF not the RMSE itself. The methodology itself is not new as part comments. Since this raw data needs more than 10TB of capacity, of research using user reviews to make better RS, but the two due to hardware limitations, we limited to small size data. Also, popular methodologies, MF and Latent Dirichlet Allocation (LDA), unlike commonly used reviews, there are a lot of useless data have been mixed appropriately and give exploratory power. It also because comments can be written without any restrictions such as * Copyright is held by the author(s). RecSys 2017 Poster Proceedings, August 27-31, 2017, Como, Italy An Explanatory Matrix Factorization with User Comments Data D. Kim and H. Shin multiple comments is allowed in same item. Therefore, in this study, implied that it has a similar RMSE even though it is obtained by we chose 3,000 users who kept the grammar as much as possible simple matrix inversion using a relatively interpretable latent and wrote over a reasonable length (more than 70 characters in variable rather than the existing method. The reason why RMSE is average) for certain period consistently (at least 12 weeks). lower than other studies is because it is not to predict the score, but Compared to all data, 3000 users are quite small numbers, but since to determine whether user sees a specific Webtoon, so we calculate the data used in this paper is very different from the user review RMSE with a rounded value which is 0 or 1. usually used in other papers, it was important to refine useful data before analysis. This data contains 149 Webtoons, 1.1 million comments. RMSE 3.2 Experiment Settings and Result The topic is modeled through the LDA with selected 1.1 million comments. In this experiment, the number of topic was set to 10, 20, and 30, and the hidden latent variables of MF were also set to be the same in each case. Topic selection is very important task, but Figure 1: RMSE plot with the full user data it is too vague to use the whole as it is. Therefore, some topics are collected through each Webtoon, and the some topics are obtained by whole data. Some of the noticeable topics are listed in Table 1. Topic 3 is mainly composed of words about stories of comic, and Topic 7 is made of the drawing style of comic. Even though not all topics can be identified as the above topics, but there are more topics that can be interpreted, such as the attitude of the artiest, etc. Table 1: Noticeable topics from comment data ( = ) Figure 2: RMSE plot with the distinct tendency user data Topic 3 Topic 7 Topic 21 Sick of Beautiful  4 CONCLUSIONS Crazy Sick of Funny In this paper, the two popular methodologies, MF and LDA, have Main character Drawing Best comments been mixed appropriately and shows some extra synergy. We Story Color Clear conducted experiments with comment data of Webtoons and shows the suggest method works quite well as much as conventional MF, Each user vectors are constructed by the topics we obtained. and some cases it works better. This study did not fully use the We use cosine similarity which is most commonly used. Using this comment data that is currently available. Webtoon is a content that similarity, we construct a matrix to be used in MF and simply is published one episode a week, so we think it will be very obtain other matrix . We compare RMSE with original MF. In this influential in time. Also, this study is domain specific research and paper we use most basic MF algorithm [6] as conventional MF the proposed algorithm is used only in this domain, so the extensive algorithm. The score matrix M used in this experiment is composed research with certified data set is needed to generalize the algorithm. of 0 and 1 which indicates whether the user sees a certain comic, not the score rating. We assume that the user only sees the comics ACKNOWLEDGMENTS they commented on. In order to measure the RMSE, about 15% of This research was supported by Basic Science Research Program each user data was randomly deleted. Therefore, we learned with through the National Research Foundation of Korea funded by the 85% of the data and observe the difference between the erased 15% Ministry of Science, ICT & Future Planning (2017R1A2B4006290). actual data and the predicted data. As can be seen from the results Figure 1, it cannot be concluded that the overall data performance is better than conventional MF. However, when compared only for REFERENCES [1] Bobadilla, Jesús, et al. 2013. Recommender systems survey. Knowledge- those with distinct tendencies, whose − > based systems 46 (pp.109-132). (in this paper α = 0.4 is used) which means user has at least one [2] Seroussi, Yanir, Fabian Bohnert, and Ingrid Zukerman. 2011. Personalised rating prediction for new users using latent factor models. Proceedings of the noticeable topic that can be categorized more clearly than other 22nd ACM conference on Hypertext and hypermedia (pp.47-56). users, it can be seen that the suggested method using the LDA is [3] Bobadilla, J., Ortega, F., Hernando, A., & Gutiérrez, A. 2013. Recommender slightly better than the conventional MF method. In other words, systems survey. Knowledge-based systems, 46, 109-132. [4] Chen, Li, Guanliang Chen, and Feng Wang. 2015. Recommender systems we can see that the unsupervised latent variable which is based on user reviews: the state of the art. User Modeling and User-Adapted conventional MF fits better with users who judge the contents with Interaction 25(2) (pp. 99-154). [5] Alghamdi, Rubayyi, and Khalid Alfalqi. 2015. A survey of topic modeling in a complex view, and the suggested EMF which uses the supervised text mining. I. J. ACSA 6.1 (pp. 147-153). latent variable fits well with users with a simple view. This result [6] Koren, Yehuda, Robert Bell, and Chris Volinsky. 2009. Matrix factorization cannot be regarded as meaningful for RMSE itself, but it can be techniques for recommender systems. Computer 42(8).