Context of Experience – MediaEval submission of ITEC / AAU Polyxeni Sgouroglou, Tarek Markus Abdel Aziz, Mathias Lux Klagenfurt University Universitätsstrasse 65-67 Klagenfurt, Austria {psgourog,tabdelaz}@edu.aau.at, mlux@itec.aau.at ABSTRACT suitability of films, while being out of the specific context. People want to be entertained. However, context influences For a human who remains out of the airplane context it is what people actually find entertaining. The MultimediaEval rather difficult to visualize the exact emotional impact of a 2016 Context of Experience Task [2] focuses on automated film while watched during a flight. Mood, different tastes, methods to find the right content for a specific viewing situ- stress levels during flight, anxiety problems are only some ation, or more specifically, which movies are good to watch factors that could influence a passenger’s decision. Hence, in an airplane. In this paper we present our approach to lists proposed by websites about what should or should not automatically suggesting movie from a list of possible ones be watched on an airplane remain controversial. However, by means of visual data as well as meta data. the majority of passengers share the same goal, which is to pleasantly pass time. Therefore, there are some character- istics based on common sense that intuitively make a film 1. INTRODUCTION unsuitable for watching on an airplane for the majority of Movies for entertainment are big business. Worldwide TV the passengers including films about airplane crashes, very and video revenue in 2015 is estimated with 286.17 billion violent ones, those with a high level of nudity, etc. USD1 . With new shows, series and movies every year, there For the MultimediaEval 2016 Context of Experience Task is a huge library of content to choose from. Especially in the we submitted four runs, whereas the first two are visual confined space of an airplane seat and for the duration of a only runs and the latter two investigate text features. For long distance flight, video entertainment is well received by the text part of our experiments we choose to work with passengers. So many companies offer on-board entertain- meta data and more specifically with plots, synopses and ment systems, where passengers can choose from multiple plot keywords of the films, since we consider them as the videos to entertain themselves without disturbing other pas- more descriptive and concise types of meta data for deciding sengers. While there is of course no one-fits-all solution, the the suitability in our watching situation. The intuition we general hypothesis of the task is, that some videos are better have is that other types of meta data such as genre, language suited for watching on an airplane than others. Of course etc. are not sufficient to characterize a film as unsuitable for there are many different factors that can influence such a watching on an airplane. decision, ie. if a movie is still watchable on a small, low contrast screen, or if there are scenes which are potentially offending to neighboring passengers. 2.1 Classification based on visual features We used two different visual components, which are posters and trailers, for the classification of a movie. The first run 2. OUR APPROACH – Run 1 – uses only posters, and the second run – Run 2 – While distinguishing positive from negative reviews for uses only trailers. For both runs we used the same machine films is a quite easy process for humans, it appears that learning techniques. determining the suitability of a film for watching on an air- Most people determine the genre or other special features plane, is even for humans a non trivial task. It becomes of a movie by looking at a poster. For instance, red fonts even harder when humans are asked to decide about the are sometimes a common hint that the movie includes horri- 1 fying or bloody scenes. These derived features play also an https://www.statista.com/statistics/259985/ global-filmed-entertainment-revenue/, last visited 2016-09- important role in the movie selection in air planes, because 27 to every movie title there is also presented the belonging poster. The best way to develop an algorithm that deter- mine on a poster, whether it is good or not, is to use machine learning techniques. We decided to use deep learning. It is usually employed for large datasets, but our development set was too small. Therefore, we extracted only the vectors of the last hidden layer from a pre-trained deep neural network model, which can then be used as input to a separate machine learning Copyright is held by the author/owner(s). system. [4] MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Netherlands We used the deep learning framework Caffe [1] devel- oped by the Berkeley Vision Learning Center. It can be plots, while the rest 3 028 are features extracted from our used to load one of the pretrained models, which are pro- downloaded, preprocessed and tokenized web results. For vided by community members and researchers. We used creating features the CountVectorizer – scikit-learn’s bag of the BLVC GoogLeNet model [3]. It has 22-layers and is words tool – is used. As analyzer parameter we use un- trained on the dataset ImageNet-2014 to detect 1 000 differ- igrams. As classification method we use Naive Bayes for ent classes of images. For each poster from the development multinomial models. In the final prediction vector if a film set we extracted a vector with 1024 dimensions from the that is already classified as suitable contains at least two of layer pool5/7x7_s1. It is the last hidden layer, and contains the list’s terms, we change the classification to unsuitable all high-level processing information which were created by for watching on an airplane. the network. Our second text based run – Run 4 – is based on the 1972 As classifier we build a Support Vector Machine (SVM) baseline text features extracted, while the rest 3028 features with scikit-learn 2 , a free software machine learning library. are obtained from plot keywords and synopses of Internet The advantages of SVMs are that they are effective in high Movie Database (IMDb)5 . All the other characteristics re- dimensional spaces and in cases where the number of input- main as in Run 1. vectors is smaller than the number of dimensions, like in our situation. We used a linear kernel function, and the soft margin parameter C was set to 10. The result is Run 1. 3. RESULTS & DISCUSSION We performed something similar with the trailers. From In the ground truth 137 out of 225 movies were classified every trailer we extracted 200 frames with ffmpeg 3 , a soft- as suitable for watching on an airplane (positive examples). ware for handling multimedia data and streams. We also The remaining 88 were negative examples. Table 1 shows extracted with the BLVC GoogLeNet model the vectors from the results of our runs. Classifying all as positive would the last hidden layer and we concatenated all 200 vectors of theoretically lead to a precision of 0.6089 and a recall of a trailer. The result was a vector with 204,800 dimensions. 1.0. In that sense our approach focusing on minimizing the We trained with these vectors from the development set the number of false positives was not well chosen in terms of SVM afterwards we classified the vectors from the test set. evaluation numbers. However, the meta data based runs (3 The result is Run 2. and 4) as well as the poster based run (1) are better than the naive classifier. Surprisingly enough the visual only runs 2.2 Text based classification perform comparably well in relation to the meta data based We first obtain the plots of the baseline text data given by ones. The poster itself – with the given method – seems to the organizers by the XML files for the whole dataset of 318 carry enough information for classification. films. We parse the XML files and access the title and the plot of each film. Then we perform text processing on the plots by casefolding, tokenization, non-alphanumeric char- Table 1: Results of the submitted runs giving true acters removal, stopping based on Google’s stop words list to and false positives and negatives, precision (P), re- reduce computational and space complexity and stemming call (R) and F1. with the PorterStemmer of the Natural Language Toolkit Run TP FP TN FN P R F1 (nltk)4 . Finally, we store the XML plot tokens for further 1 87 52 35 49 0.6259 0.6397 0.6327 use. After preprocessing and feature vectorization the train- 2 92 60 27 44 0.6053 0.6765 0.6389 ing corpus contains 1 972 distinct terms. 3 92 55 32 44 0.6259 0.6765 0.6502 In both runs we employ a two-step classification. In the 4 88 54 33 48 0.6197 0.6471 0.6331 first step the text features are used to determine with a Naive Bayes classifier if movies are good to watch on an air- plane or not. In case of a positive match, ie. the film is classified as being good to watch on an airplane, we com- pare the terms of the text features to a list of – in our opin- 4. CONCLUSIONS ion intuitive – terms, that make movies unsuitable for being In this work we have applied convolutional neural net- watched on an airplane: {airplane crash, airplane attack, hi- works as well as meta data based methods for classification. jack, hijacking, air force one, bomb, terrorist, kidnap, abuse, Although the ground truth leans towards positive examples fascism}. If the text features of a positive example match and our approach focuses on minimizing false negatives we terms of this list twice, the classification result is changed to think the chosen methods provide interesting results, es- the negative example class. This basically allows us to focus pecially when considering, that the applied methods are more on precision than on recall and to reduce the number well known approaches, ie. Naive Bayes Classifiers, SVMs, of false positive hits. Experiments based on the develop- CNNs, TF*IDF, etc., not tuned to the use case besides the ment data set have shown that this significantly improves list of inappropriate concepts and can surely be extended to precision, while reducing recall only marginally. better fit the use case. Our first text based run – Run 3 – contains predictions based on the baseline text features, expanded by a set of text 5. REFERENCES features we obtained from relevant web pages. The number of features used is 5 000 ordered by term frequency. The first [1] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, 1 972 are features extracted by the baseline tokenized XML R. Girshick, S. Guadarrama, and T. Darrell. Caffe: 2 Convolutional architecture for fast feature embedding. http://scikit-learn.org/stable/, last visited 2016-10-06 arXiv preprint arXiv:1408.5093, 2014. 3 https://ffmpeg.org/, last visited 2016-10-06 4 5 http://www.nltk.org/, last visited 2016-10-06 http://www.imdb.com/, last visited 2016-09-29 [2] M. Riegler, , C. Spampinato, M. Larson, P. Halvorsen, and C. Griwodz. The MediaEval 2016 context of experience task: Recommending videos suiting a watching situation. In Proceedings of the MediaEval 2016 Workshop, 2016. [3] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014. [4] J. Wan, D. Wang, S. C. H. Hoi, P. Wu, J. Zhu, Y. Zhang, and J. Li. Deep learning for content-based image retrieval: A comprehensive study. In Proceedings of the 22nd ACM International Conference on Multimedia, MM ’14, pages 157–166, New York, NY, USA, 2014. ACM.