The MediaEval 2018 Movie Recommendation Task: Recommending Movies Using Content Yashar Deldjoo1 , Mihai Gabriel Constantin2 , Athanasios Dritsas3 , Bogdan Ionescu2 , Markus Schedl4 1 Politecnico di Milano, Italy, 2 University Politehnica of Bucharest, Romania, 3 Delft University of Technology, Netherlands, 4 Johannes Kepler University Linz, Austria deldjooy@acm.org,mgconstantin@imag.pub.ro,a.dritsas@student.tudelft.nl bionescu@imag.pub.ro,markus.schedl@jku.at ABSTRACT predicting the variance of the ratings whose correct predictions In this paper we introduce the MediaEval 2018 task Recommend- imply the ability of the prediction system to differentiate between ing Movies Using Content. It focuses on predicting overall scores the preferences of different users or groups of users which can be that users give to movies, i.e., average rating (representing overall exploited by current CBF movie recommender systems. In contrast appreciation of the movies by the viewers) and the rating vari- to the de facto CF approach widely adopted by the community of ance/standard deviation (representing agreement/disagreement be- RS, the CBF approach can handle the cold-start problem for items tween users) using audio, visual and textual features derived from where the newly added items lack enough interactions (impeding selected movie scenes. We release a dataset of movie clips consist- the usability of CF approach) and can also help systems respect ing of 7K clips for 800 unique movies. In the paper, we present the user privacy [3, 4]. This paper presents an overview of the task, the challenge, the dataset and ground truth creation, the evaluation features provided by the organizers, a description of the ground protocol and the requested runs. truth and evaluation methods as well as the required runs. KEYWORDS 2 TASK DESCRIPTION movie rating prediction, movie recommender systems, multimedia Task participants must create an automatic system that can predict features, audio, visual, textual descriptors, clips, trailers the average ratings that users will assign to movies (representing the overall appreciation of the movie by the audience) and also the rating variance (representing the agreement/disagreements be- 1 INTRODUCTION tween user ratings)1 . The input to the system is a set of audio, visual, A dramatic rise in the generation of video content has been wit- and text features derived from selected movie scenes (movie clips). nessed in recent years. Video recommender systems (RS), play an The novelty of this task is that it uses movie clips instead of important role in helping users of online streaming services to cope movie trailers as chosen by most of previous works both in the with the information overload. Video recommendation systems are multimedia and recommendation fields [4, 6, 11]. Movie trailers traditionally powered by either collaborative filtering (CF) models for the most part are free samples of a film that are packaged to which leverage the correlations between users’ consumption pat- communicate a feeling of the movie’s story. Their main goal is to terns or content-based filtering (CBF) approaches typically based convince the audience to come back for more when the film opens on textual metadata, either editorial, e.g., genre, cast, director, or in theaters. For this reason, the trailers are usually made with lots of user generated e.g., tags, reviews [1, 15]. thrills and chills. Movie clips, however, focus on a particular scene The goal of the MediaEval Movie Recommendation Task is to use and display the scene at the natural pace of the movie . The two content-based audio, visual and metadata features and their multi- media types communicate different information to their viewers modal combinations to predict how a movie will be received by its and can evoke different emotions [14] which in turn strongly effect viewers by predicting global ratings of users and the standard devi- the users’ perception and appreciation, i.e. ratings, of the movie. To ation of ratings [7]. The task uses as input movie clips instead of the give an example, compare from the movie "Beautiful Girls" (1996) full-length movies, which makes it more versatile and effective as the official trailer,2 a movie clip (A girl named Marty),3 and another clips are more easily available than the full movies. There are two movie clip (Ice skating with Marty)4 all taken from the same movie. main useful outcomes of this task: firstly, by predicting the average ratings that users give to movies, such techniques can be exploited 3 DATA by producers and investors to decide whether or not to adopt the production of similar movies; secondly and more importantly Participants are supplied with audio and visual features extracted the task is laying the groundwork for CBF movie recommendation from movie clips as well as associated metadata (genre and tag where recommendations are tailored to match the individual pref- labels). These content features resemble the content features of erences of users on the audio-visual content and the descriptive 1 Note that in fact it is required to predict standard deviation of ratings, cf. Section 5 metadata. As for the latter, the current MediaEval task looks into but due to intelligibility we use the term “variance” instead of standard deviation. Copyright held by the owner/author(s). 2 https://www.youtube.com/watch?v=yfQ5ONwWxI8 3 https://www.youtube.com/watch?v=4K8M2EVnoKc MediaEval’18, 29-31 October 2018, Sophia Antipolis, France 4 https://www.youtube.com/watch?v=M-h1ERyxbQ0 MediaEval’18, 29-31 October 2018, Sophia Antipolis, France Y. Deldjoo et al. our recently released movie trailer dataset MMTF-14K [4, 5]. How- median and median and median absolute deviation ("MedMad"). ever, unlike in MMTF-14K, in the movie clips dataset used in the Each of the four aggregation sub-folders of the Aesthetic visual MediaEval task at hand, each movie can be associated with several features folder contains CSV files for three types of fusion meth- clips. ods: early fusion of all the components (denoted All), early fusion The complete development set (devset) provides features com- of components according to their type (color based components puted from 5562 clips corresponding to 632 unique movies while denoted Type3Color, object based components - Type3Object and the testset provides features for 1315 clips corresponding to 159 texture - Type3Texture) and finally each of the 26 individual com- unique movies from the well-known MovieLens 20M dataset (ml- ponents with no early fusion scheme (example: the colorfulness 20m) [10]. The task makes use of the user ratings from the ml-20m component denoted Feat26Colorfulness), therefore resulting in a dataset in order to calculate the grountruth, namely the per-movie total of 30 files in each sub-folder. Regarding the AlexNet features, global average rating and rating variance. The YouTube IDs of the in our context, we use the output values extracted from the fc7 clips are also available in the movie names of the clips. For example, layer. For this reason, no supplementary early fusion scheme is 000000094_2V am2a4r 9vo represents a clip in the dataset with the required or possible, and only one CSV file is present inside each ml-ID 94 and the YouTube ID 2V am2a4r 9vo 5 . Each movie has on av- of the four aggregation folders. erage about 8.5 associated clips where this value is calculated over both the devset and testset. The content descriptors are organized 4 RUN DESCRIPTION in three categories described next. Every team can submit up to 4 runs, 2 runs for prediction score for rating average and 2 runs for rating std. For each score type, the 3.1 Metadata first run is expected to contain the prediction score for the best uni- The metadata descriptors (found in the folder named Metadata) are modal approach (using visual information, audio or metadata) and provided as two CSV files containing genre and user-generated tag the second run, hybrid approach that consider all modalities. Note features associated with each movie. The metadata features come that in all these runs, the teams should think how to temporally in pre-computed numerical format instead of the original textual aggregate clip-level information into movie-level information (each format for ease of use. The metadata descriptors are exactly the movie on average is assigned 8 clips). This task is novel in two same as with our MMTF-14K trailer dataset [4, 5]. regards. First, the dataset includes movie clips instead of trailers, thereby providing a wider variety of the movie’s aspects by showing 3.2 Audio features different kinds of scenes. Second, including information about the The Audio descriptors (found in the folder named Audio) are con- ratings’ variance allows to assess users’ agreement and to uncover tained in two sub-folders: block level features (BLF) [17] and i- polarizing movies. vector features [8, 16, 17]. The BLF data includes the raw features of the 6 sub-components (sub-features) that describe various audio 5 GROUND TRUTH AND EVALUATION aspects: spectral aspects (spectral pattern, delta spectral pattern, The evaluation of participants’ runs is realized by predicting users’ variance delta spectral pattern), harmonic aspects (correlation pat- overall ratings for which we use the standard error metric root- tern), rhythmic aspects (logarithmic fluctuation pattern), and tonal mean-square-error (RMSE) between the predicted scores and the aspects (spectral contrast pattern). The i-vector features, describing actual scores according to the ground timbre, include different parameters for Gaussian mixture models q Í truth (as given in the Movie- Lens 20M dataset), RMSE = N1 i=1 N (s − ŝ )2 where N is the i i (GMM) equal to (16, 32, 64, 256, 512), the total variability dimension (tvDim) equal to (10, 20, 40, 200, 400). The Block level features folder number of scores in the test set on which the system is validated, has two subfolders: "All" and "Component6"; the former contains si is the actual score of users given to item i and ŝi is the predicted the super-vector created by concatenating all 6 sub-components, score. Two types of scores are considered for evaluation the latter contains the raw feature vectors of the sub-components (1) average ratings in separate CSV files. The i-vector features folder contains indi- (2) standard deviation of ratings vidual CSV files for each of the possible combinations of the two The standard deviation of ratings is chosen to measure the agree- parameters GMM, and tvDim. ment/disagreements between user ratings thereby building the groundwork for personalized recommendation. It should be re- 3.3 Visual features minded that during test data release, participants are provided only The Visual descriptors (found it the folder named Visual) are con- with the IDs of test movie clips where they are expected to predict tained in two sub-folders: Aesthetic visual features [9, 13] and both of the above scores. Deep AlexNet Fc7 features [2, 12], each of them including differ- ent aggregation and fusion schemes for the two types of visual 6 CONCLUSIONS features. These two features are aggregated by using four basic The 2018 Movie Recommendation Task provides an unified frame- statistical methods, each corresponding to a different sub-folder, work for evaluating participants’ approaches to the prediction of that compute a video-level feature vector from frame-level vectors movie ratings through the usage of movie clips and audio, visual and by using: average value across all frames (denoted "Avg"), average metadata features and their hybrid combinations. Details regarding value and variance ("AvgVar"), median values ("Med") and finally the methods and results of each individual run can be found in the 5 https://www.youtube.com/watch?v=2Vam2a4r9vo working note papers of the MediaEval 2018 workshop proceedings. Recommending Movies Using Content: Which content is key? MediaEval’18, 29-31 October 2018, Sophia Antipolis, France REFERENCES [1] Charu C Aggarwal. 2016. Content-based recommender systems. In Recommender systems. Springer, 139–166. [2] Mihai Gabriel Constantin and Bogdan Ionescu. 2017. Content de- scription for Predicting image Interestingness. In Signals, Circuits and Systems (ISSCS), 2017 International Symposium on. IEEE, 1–4. [3] Yashar Deldjoo. 2018. Video recommendation by exploiting the multi- media content. Ph.D. Dissertation. Italy. [4] Yashar Deldjoo, Mihai Gabriel Constantin, Hamid Eghbal-Zadeh, Markus Schedl, Bogdan Ionescu, and Paolo Cremonesi. 2018. Audio- Visual Encoding of Multimedia Content to Enhance Movie Recommen- dations. In Proceedings of the Twelfth ACM Conference on Recommender Systems. ACM. https://doi.org/10.1145/3240323.3240407 [5] Yashar Deldjoo, Mihai Gabriel Constantin, Bogdan Ionescu, Markus Schedl, and Paolo Cremonesi. 2018. MMTF-14K: A Multifaceted Movie Trailer Dataset for Recommendation and Retrieval. In Proceedings of the 9th ACM Multimedia Systems Conference (MMSys 2018). Amsterdam, the Netherlands. [6] Yashar Deldjoo, Mehdi Elahi, Massimo Quadrana, and Paolo Cre- monesi. 2018. Using Visual Features based on MPEG-7 and Deep Learning for Movie Recommendation. International Journal of Multi- media Information Retrieval (2018), 1–13. [7] Yashar Deldjoo, Markus Schedl, Paolo Cremonesi, and Gabriella Pasi. 2018. Content-Based Multimedia Recommendation Systems: Def- inition and Application Domains. In Proceedings of the 9th Italian Information Retrieval Workshop (IIR 2018). Rome, Italy. [8] Hamid Eghbal-Zadeh, Bernhard Lehner, Markus Schedl, and Gerhard Widmer. 2015. I-Vectors for Timbre-Based Music Similarity and Music Artist Classification.. In ISMIR. 554–560. [9] Andreas F Haas, Marine Guibert, Anja Foerschner, Sandi Calhoun, Emma George, Mark Hatay, Elizabeth Dinsdale, Stuart A Sandin, Jen- nifer E Smith, Mark JA Vermeij, and others. 2015. Can we measure beauty? Computational evaluation of coral reef aesthetics. PeerJ 3 (2015), e1390. [10] F Maxwell Harper and Joseph A Konstan. 2016. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis) 5, 4 (2016), 19. [11] Yimin Hou, Ting Xiao, Shu Zhang, Xi Jiang, Xiang Li, Xintao Hu, Junwei Han, Lei Guo, L Stephen Miller, Richard Neupert, and others. 2016. Predicting movie trailer viewer’s "like/dislike" via learned shot editing patterns. IEEE Transactions on Affective Computing 7, 1 (2016), 29–44. [12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Im- agenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097–1105. [13] Congcong Li and Tsuhan Chen. 2009. Aesthetic visual quality assess- ment of paintings. IEEE Journal of selected topics in Signal Processing 3, 2 (2009), 236–252. [14] Robert Marich. 2013. Marketing to moviegoers: a handbook of strategies and tactics. SIU Press. [15] Francesco Ricci, Lior Rokach, and Bracha Shapira. 2015. Recommender systems: introduction and challenges. In Recommender systems hand- book. Springer, 1–34. [16] Markus Schedl, Hamed Zamani, Ching-Wei Chen, Yashar Deldjoo, and Mehdi Elahi. 2018. Current challenges and visions in music recommender systems research. IJMIR 7, 2 (2018), 95–116. https: //doi.org/10.1007/s13735-018-0154-2 [17] Klaus Seyerlehner, Markus Schedl, Peter Knees, and Reinhard Sonnleit- ner. 2011. A Refined Block-level Feature Set for Classification, Simi- larity and Tag Prediction. In 7th Annual Music Information Retrieval Evaluation eXchange (MIREX 2011). Miami, FL, USA.