=Paper=
{{Paper
|id=Vol-2283/MediaEval_18_paper_53
|storemode=property
|title=Movie Rating Prediction Using Multimedia Content and Modeling as a Classification Problem
|pdfUrl=https://ceur-ws.org/Vol-2283/MediaEval_18_paper_53.pdf
|volume=Vol-2283
|authors=Fatemeh Nazary,Yashar Deldjoo
|dblpUrl=https://dblp.org/rec/conf/mediaeval/NazaryD18
}}
==Movie Rating Prediction Using Multimedia Content and Modeling as a Classification Problem==
Movie Rating Prediction using Multimedia Content and Modeling as a Classification Problem Fatemeh Nazary1 , Yashar Deldjoo2 1 University of Pavia, Italy, 2 University of Milano-Bicocca, Italy fatemeh.nazary01@universitadipavia.it,deldjooy@acm.org ABSTRACT scenario known as the cold-start problem. The goal of the current This paper presents the method proposed for the recommender MediaEval task [3] is to bridge the gap between advances and per- system task in Mediaeval 2018 on predicting user global ratings spective in multimedia and recommender systems communities [7]. given to movies and their standard deviation through the audio- In particular, participants are required to use the audiovisual con- visual content and the associated metadata. In the proposed work, tent and metadata in order predict global ratings of users provided we model the rating prediction problem as a classification problem to movies (representing their appreciation/dis-appreciation) and and employ different classifiers for the prediction task. Furthermore, the corresponding standard deviation (characterizing users agree- in order to obtain a video-level representation of features from ment and disagreement). This task is novel in two regards. First, the clip-level features, we employ statistical summarization functions. provided dataset uses movie clips instead of trailers [4–6], thereby Results are promising and show the potential of leveraging the providing a wider variety of the movie’s aspects by showing dif- audiovisual content for improving the quality of existing movie ferent kinds of scenes. Second, including information about the recommendation systems in service. ratings’ variance makes it possible to assess users’ agreement and to uncover polarizing movies [3]. 1 INTRODUCTION AND CONTEXT 2 PROPOSED APPROACH Video recordings are complex audiovisual signals. When we watch The proposed framework can be divided into three phases: a movie, a large amount of information is communicated to us through different multimedia channels, in particular, the audio and (1) Multimodal feature fusion: This step is carried out in the visual channel. As a result, the video content can be described the multimodal phase for hybridization of the features. It in different manners since its consumption is not limited to one aims to fuse two descriptors of different nature (e.g., audio type of perception. These multiple facets can be manifested by de- and visual) into a fixed-length descriptor. In this work, we scriptors of visual and audio content, but also in terms of metadata, chose concatenation of features as a simple early fusion including information about the movie’s genre, actors, or plot of approach toward multimodal fusion. a movie. The goal of movie recommendation systems (MRS) is to (2) Video-level representation building: The novelty of provide personalized suggestions about movies that users would this task is that it uses movie clips instead of movie trail- likely find interesting. Collaborative filtering (CF) models lie at the ers [3] in which each movie has several associated clips. core of most MRS in service today and generate recommendation This step aims to aggregate clip-level representation of fea- by exploiting the items favored by other like-minded users [2, 9, 10]. tures in order to build a video-level representation so it can Content-based filtering (CBF) methods on the other hand, base their be used in the classification stage. In this work, we adopted recommendations on the similarities between the target user’s pre- aggregation methods based on statistical summarization ferred or consumed items and other items in the catalog, where including mean(), min() and max() to obtain video-level this similarity is defined by computing a content-centric similarity representations of the features. using descriptors (features) inferred or extracted from the item (3) Classification: The provided scores (global ratings and content, typically by leveraging textual metadata, either editorial, their stds) are continuous values. Our approach to the e.g., genre, cast, director, or user generated, e.g., tags, reviews [1, 8]. prediction problem consisted of treating it as a classifica- For instance, the authors in [12] developed a heterogeneous social- tion problem. This means prior to classification, the tar- aware MRS that uses movie-poster images and textual description, get scores are quantized to predefined values. We chose as well as user ratings and social relationships in order to generate 2-level uniform quantization for global rating meaning recommendations. Another example is [11], in which a hybrid MRS the ratings were mapped to one of the values in the set using tags and ratings is proposed, where user profiles are formed {0.5, 1, 1.5, ..., 4.5, 5}. As for std, we chose 10-level plus based on users’ interaction in a social movie network. 16-level uniform quantization where in the latter case, the Regardless of the approach, metadata are prone to errors and higher number of levels were chosen to provide a larger res- expensive to collect. Moreover, user feedback and user-generated olution to the narrow distribution of std scores (std values metadata are rare or absent for new movies, making it difficult are quite compact around [0.5-1.5] whereas global ratings or even impossible to provide good quality recommendations a are spread in the range [0-5]). Finally for classification, we investigated three classification approaches: logistic Copyright held by the owner/author(s). regression (LR), k-nearest neighbor (KNN) and random MediaEval’18, 29-31 October 2018, Sophia Antipolis, France forest (RF) classifier. MediaEval’18, 29-31 October 2018, Sophia Antipolis, France F. Nazary et al. Table 1: Results of Classification in terms of RMSE, SoA: state of the art. The results are calculated on the development set. The 4 submitted runs are highlighted in bold selected from best unimodal and hybrid model. movie clips Avg Std feature modality LR KNN RF LR KNN RF i-vector audio (SoA) 0.56 0.68 0.57 0.15 0.17 0.15 BLF audio (traditional) 0.55 0.65 0.58 0.15 0.16 0.14 Deep visual (SoA) 0.57 0.63 0.56 0.14 0.15 0.14 AVF visual (traditional) 0.58 0.65 0.57 0.14 0.16 0.14 Tag metadata (user generated) 0.48 0.55 0.39 0.14 0.15 0.14 Genre metadata (editorial) 0.52 0.60 0.53 0.14 0.23 0.18 i-vector + BLF audio+audio 0.55 0.65 0.55 0.15 0.19 0.15 i-vector + Deep audio+visual 0.57 0.63 0.57 0.14 0.16 0.14 i-vector + AVF audio+visual 0.58 0.65 0.54 0.14 0.19 0.15 i-vector + Tag audio+metadata 0.48 0.53 0.39 0.15 0.19 0.15 i-vector + Genre audio+metadata 0.52 0.58 0.56 0.14 0.13 0.13 BLF + Deep audio+visual 0.55 0.64 0.54 0.15 0.19 0.15 BLF + AVF audio+visual 0.55 0.64 0.57 0.14 0.19 0.14 BLF + Tag audio+metadata 0.55 0.54 0.49 0.16 0.18 0.14 BLF + Genre audio+metadata 0.55 0.65 0.56 0.15 0.19 0.15 Deep + AVF visual+visual 0.59 0.69 0.57 0.14 0.18 0.14 Deep + Tag visual +metadata 0.40 0.64 0.38 0.14 0.25 0.15 Deep + Genre visual+metadata 0.57 0.63 0.58 0.14 0.16 0.14 AVF + Tag visual+metadata 0.44 0.78 0.38 0.15 0.42 0.15 AVF + Genre visual+metadata 0.58 0.67 0.56 0.14 0.20 0.12 Tag + Genre metadata+metadata 0.36 0.54 0.46 0.15 0.18 0.14 3 RESULTS AND ANALYSIS compare AVF+Tag: 0.44 v.s. Tag: 0.48 for LR and 0.38 v.s. 0.39 for The results of classification using the proposed approach are pre- RF), hinting that they have a complementary nature which can be sented in Table 1. Regarding the comparison of classifiers, we can better leveraged if right a fusion strategy is adopted. note that RF is the best classifier among others usually generating the best performance for each feature or feature combination while Predicting standard deviation of ratings: As for predicting KNN is the worst (note that KNN classifier is a lazy classifier). Thus, standard deviation of ratings, for unimodal case, it can be seen in reporting the results, we mostly base our judgment on results that except genre feature with the worst performance, the rest of obtained from RF and in some cases on LR. The final submitted audiovisual features and tag metadata have very similar results. runs are selected based on the ones performing the best on the This indicates that genre is the weakest descriptor and compared to development set, which are highlighted in bold in Table 1. others less capable of distinguishing difference in users’ opinions. Predicting average ratings: From the result obtained it can be Note that under LR, genre descriptor performs similar to other seen that the performance of all audio and visual features, regard- audiovisual features. For multimodal case, the results for majority less of their type i.e., traditional or state of the art, are closely similar of combinations are pretty similar regardless of the classifier type. to each other. These results with a close margin look similar to the The best performing combinations are AVF + Genre and i-vector + performance of the genre descriptor. In fact the difference between Genre with the RMSE equal to 0.12 and 0.13. the best audio or visual feature and genre is 6-7% while this differ- ence with tag can reach up to 45%. These results are interesting and confirm that user-generated tags assigned to movies contain 4 CONCLUSION semantics that are well correlated with ratings given to movies by This paper reports the description of the method for the "Recom- user, even though the users of tags and ratings are not necessary the mending movie Using Content: Which content is key" MediaEval same. For multimodal case, one can note that simple concatenation 2018 task [3]. The proposed approach consists of three main steps: of the features can not improve the final performance substantially (i) multimiodal fusion, (ii) video-level video representation building compared with unimodal audiovisual features. The best results are and (iii) classification. Results of experiments using three classifica- obtained in cross modal fusion for i-vector + AVF (compare 0.54 v.s tion approaches are promising and show the efficacy of audiovisual Genre: 0.53) and BLF + Deep (0.54). However for metadata-based content in predicting user global ratings and to a lesser extent for multimodal fusion, the general observation is that audiovisual fea- predicting rating variance. tures can slightly improve the performance of genre and tag (e.g., Recommending Movies Using Content: Which content is key? MediaEval’18, 29-31 October 2018, Sophia Antipolis, France REFERENCES [1] Charu C Aggarwal. 2016. Content-based recommender systems. In Recommender systems. Springer, 139–166. [2] Charu C Aggarwal. 2016. Neighborhood-based collaborative filtering. In Recommender Systems. Springer, 29–70. [3] Yashar Deldjoo, Mihai Gabriel Constantin, Thanasis Dritsas, Markus Schedl, and Bogdan Ionescu. 2018. The MediaEval 2018 Movie Recom- mendation Task: Recommending Movies Using Content. In MediaEval 2018 Workshop. [4] Yashar Deldjoo, Mihai Gabriel Constantin, Hamid Eghbal-Zadeh, Markus Schedl, Bogdan Ionescu, and Paolo Cremonesi. 2018. Audio- Visual Encoding of Multimedia Content to Enhance Movie Recommen- dations. In Proceedings of the Twelfth ACM Conference on Recommender Systems. ACM. https://doi.org/10.1145/3240323.3240407 [5] Yashar Deldjoo, Mihai Gabriel Constantin, Bogdan Ionescu, Markus Schedl, and Paolo Cremonesi. 2018. MMTF-14K: A Multifaceted Movie Trailer Dataset for Recommendation and Retrieval. In Proceedings of the 9th ACM Multimedia Systems Conference (MMSys 2018). Amsterdam, the Netherlands. [6] Yashar Deldjoo, Mehdi Elahi, Massimo Quadrana, and Paolo Cre- monesi. 2018. Using Visual Features based on MPEG-7 and Deep Learning for Movie Recommendation. International Journal of Multi- media Information Retrieval (2018), 1–13. [7] Yashar Deldjoo, Markus Schedl, Paolo Cremonesi, and Gabriella Pasi. 2018. Content-Based Multimedia Recommendation Systems: Def- inition and Application Domains. In Proceedings of the 9th Italian Information Retrieval Workshop (IIR 2018). Rome, Italy. [8] Pasquale Lops, Marco De Gemmis, and Giovanni Semeraro. 2011. Content-based recommender systems: State of the art and trends. In Recommender systems handbook. Springer, 73–105. [9] Francesco Ricci, Lior Rokach, and Bracha Shapira. 2015. Recommender systems: introduction and challenges. In Recommender systems hand- book. Springer, 1–34. [10] Yue Shi, Martha Larson, and Alan Hanjalic. 2014. Collaborative filter- ing beyond the user-item matrix: A survey of the state of the art and future challenges. ACM Computing Surveys (CSUR) 47, 1 (2014), 3. [11] Shouxian Wei, Xiaolin Zheng, Deren Chen, and Chaochao Chen. 2016. A hybrid approach for movie recommendation via tags and ratings. Electronic Commerce Research and Applications 18 (2016), 83–94. [12] Zhou Zhao, Qifan Yang, Hanqing Lu, Tim Weninger, Deng Cai, Xiaofei He, and Yueting Zhuang. 2017. Social-Aware Movie Recommendation via Multimodal Network Learning. IEEE Transactions on Multimedia (2017).