Simula @ MediaEval 2016 Context of Experience Task Konstantin Pogorelov, Michael Riegler, Pål Halvorsen, Carsten Griwodz Simula Research Laboratory and University of Oslo konstantin, michael, paalh, griff@simula.no ABSTRACT algorithm to get the final class. This paper presents our approach for the Context of Multi- The classification algorithm that we used for all three media Experience Task of the MediaEval 2016 Benchmark. runs is the PART algorithm [2], which is based on PAR- We present different analyses of the given data using dif- Tial decision Trees. PART relies on decision lists and uses ferent subsets of data sources and combinations of it. Our a separate-and-conquer approach to create them. In each approach gives a baseline evaluation indicating that meta- iteration PART creates a partial decision tree. For each it- data approaches work well but that also visual features can eration the algorithm finds the best leaf in the tree and uses provide useful information for the given problem to solve. it as a rule. This is repeated until a best set of rules is found for the given data. The advantage of PART is that it is very simple. The simplicity is achieved by using rule 1. INTRODUCTION based learning and decision finding that does not require In this paper we present our solutions for the Context of global optimization. A possible disadvantage of the algo- Experience Task: recommending videos suiting a watching rithm is that the rule sets are rather big compared to other situation [10], which is part of the MediaEval 2016 Bench- decision based algorithms such as C4.5 [7] or RIPPER [1]. mark. The Context of Experience task’s main purpose is to Nevertheless, for our use case this is not important be- explore multimedia content that is watched under a certain cause the dataset is rather small [11]. For all our runs we situation. This situation can be seen as the context under use the WEKA machine learning library implementation of that the multimedia content is consumed. The use case for PART with the provided (optimal) standard settings [3]. the task is watching movies during a flight. The hypothesis is that watching movies during a specific 2.1 Metadata context situation will change the preferences of the view- For the metadata only approach we used only metadata ers. This is related to similar hypotheses in the field of provided be the task dataset. We limited the metadata to recommender systems as presented in for example [12, 13] the following attributes: rating, country, language, year, where context is also an important influencing factor. Nev- runtime, Rotten Tomatoes score, IMDB score, Metacritic ertheless, it is also closely related to the field of quality of score, and genre. We pre-processed and transformed rating, experience [9, 8, 4] because the context during a flight, such language, countries and genre into numeric values for the as loud noises and other distractions, can play an important classification. The different scores for the different movie role for which movies viewers chose to watch. scoring pages were normalized to a scale from 1.0 to 10. If Participants of the context of experience task are asked a value was missing in the dataset we manually searched for to classify a list of movies into the two classes, namely, the information in the Internet and replaced it with what we +goodonairplane or -goodonairplane. To tackle this prob- found. If we could not find ratings for all scoring services lem we propose three different approaches. All three meth- we used 5.0 (average score) as value. ods use information extracted directly from the movies or the metadata containing information about the movies in combination with a machine-learning-based classifier. The 2.2 Visual Information remainder of the paper is organized as following. At first, we For the visual data we downloaded the trailers from the will give a detailed explanation of our three approaches and provided links and extracted all frames. From each frame we the classification algorithm that we used. This is followed extracted different visual features and combined them into by a description of the experimental setup and the results. one feature vector for the classification (with a dimension of Finally, we draw a conclusion. 3, 866 values). For the visual features, we decided to use several different global features. The features that we used for this work are: 2. APPROACHES joint histogram, JPEG coefficient histogram, Tamura, fuzzy In this section we will describe our three proposed runs opponent histogram, simple color histogram, fuzzy color his- in more detail. For all runs we use the same classification togram, rotation invariant local binary pattern, fuzzy color and texture histogram, local binary patterns and opponent histogram, PHOG, rank and opponent histogram, color lay- Copyright is held by the author/owner(s). MediaEval 2016 Workshop, October 20-21, 2016, Hilversum, out, CEDD, Gabor, opponent histogram, edge histogram, Netherlands scalable color and JCD. All the features have been extracted Table 1: The configuration of our three submitted Table 3: MediaEval 2016 Context of Experience Task runs for the task. R1 combines visual and metadata, official results. R2 uses only the metadata and R3 uses only the visual data for classification. The last row shows Run F1-score Precision Recall the baseline provided by the organizers. R1 0.6688 0.6084 0.7426 R2 0.7341 0.6047 0.9338 Run Description R3 0.7687 0.6333 0.9779 R1 Metadata and visual data combined Baseline 0.6 0.629 0.5735 R2 Meta data only R3 Visual data only Baseline All available metadata 4. RESULTS Table 3 gives a detailed overview of the results in terms of true positives, false positives, true negatives and false nega- Table 2: Detailed results for each run and the base- tives achieved by our runs and the baseline. Table 3 depicts line regarding true positives (TP), false positives the official results of the task metrics for our runs and the (FP), true negatives (TN) and false negatives (FN). baseline. All three runs outperformed the baseline signifi- cantly. R1 which used metadata and visual information at the same time had the lowest performance. This was surpris- Run TP FP TN FN ing for us since we were thinking that this approach would R1 101 65 22 35 perform best. A reason for the weak performance could be R2 127 83 4 9 the way of how we combine the different features. The sec- R3 133 77 10 3 ond best of our runs is R2 that uses metadata only. This Baseline 78 46 41 58 is not surprising since metadata is well known for perform- ing well and in general better than content based classifica- tion. R3 was the best performing approach and even out- using the LIRE open source library [5]. A detailed descrip- performed the metadata approach which was not expected. tion of all features can be found in [6] It seems that for the use case of watching movies on a flight the visual features of the movie play an important role. The 2.3 Metadata and reason therefore could be that movies with brighter colors Visual Information Combined are preferred. Nevertheless, we have to investigate his in For the final run we combined the metadata with the vi- more detail to give a final conclusion. sual feature information. To combine the visual information with the metadata we first run the classifier on the visual 5. CONCLUSION information with a modification so that the output was not This paper presented three approaches for the context of binary but a probability for each class. This probability experience task, which were able to classify movies into two then is added to the metadata as two additional features subsets for being suitable or not to be watched on an air- (probability to be negative or positive). The extended fea- plane. The results and insights gained by evaluating our ture vector then is used for finding the final class. This can different methods indicate that there is a difference between be seen as a kind of late fusion approach which is in general what people would like to watch during a flight and that seen as better performing than early fusion in literature [14]. this difference is detectable to a certain extend by automatic analysis of metadata and content based information. Nevertheless, we would clearly see the need for extending 3. EXPERIMENTAL SETUP the work by using multiple and larger datasets. Addition- The by the task provided dataset contains all in all 318 ally, it might be important to collect user opinions not by movies split into training and testset. For each run we cal- crowdsourcing but by actually travelling people. culated the F1-score, precision and recall. The testset con- tains 223 movies. For the trailers, only links were provided, and we had to download them. Furthermore, the posters 6. ACKNOWLEDGMENT of the movies were also provided but we did not use them This work has been funded by the NFR-funded FRINATEK in our approaches. Apart form the movies we did also use project ”Efficient Execution of Large Workloads on Elastic the provided metadata. We did not collect any additional Heterogeneous Resources” (EONS) (project number 231687) data such as full length movies, etc and we did not use the funded by the Norwegian Research Council. pre-extracted visual, text and audio features. The goal of the of the task was, as mentioned before, to automatically identify if a movie is suitable to be watched during a flight or not. We assessed three different methods executed in three runs. An overview of the conducted runs can be found in table 3 where we provide a summarized overview and short descriptions of each method. The organizers also provided a baseline for comparison based on a simple random tree algorithm (last row in the tables). 7. REFERENCES [1] W. W. Cohen. Fast effective rule induction. In Proceedings of the twelfth international conference on machine learning, pages 115–123, 1995. [2] E. Frank and I. H. Witten. Generating accurate rule sets without global optimization. In J. Shavlik, editor, Fifteenth International Conference on Machine Learning, pages 144–151. Morgan Kaufmann, 1998. [3] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The weka data mining software: an update. ACM SIGKDD explorations newsletter, 11(1):10–18, 2009. [4] P. Lebreton, A. Raake, M. Barkowsky, and P. Le Callet. Evaluating complex scales through subjective ranking. In Proc. of QoMEX). IEEE, 2014. [5] M. Lux. Lire: Open source image retrieval in java. In Proceedings of the 21st ACM international conference on Multimedia, pages 843–846. ACM, 2013. [6] M. Lux and O. Marques. Visual information retrieval using java and lire. Synthesis Lectures on Information Concepts, Retrieval, and Services, 5(1):1–112, 2013. [7] J. R. Quinlan. C4. 5: programs for machine learning. Elsevier, 2014. [8] J. A. Redi, Y. Zhu, H. de Ridder, and I. Heynderickx. How passive image viewers became active multimedia users. In Visual Signal Quality Assessment. Springer, 2015. [9] U. Reiter, K. Brunnström, K. De Moor, M.-C. Larabi, M. Pereira, A. Pinheiro, J. You, and A. Zgank. Factors influencing quality of experience. In Quality of Experience. Springer, 2014. [10] M. Riegler, , C. Spampinato, M. Larson, P. Halvorsen, and C. Griwodz. The mediaeval 2016 context of experience task: Recommending videos suiting a watching situation. In Proceedings of the MediaEval 2016 Workshop, 2016. [11] M. Riegler, M. Larson, C. Spampinato, P. Halvorsen, M. Lux, J. Markussen, K. Pogorelov, C. Griwodz, and H. Stensland. Right inflight?: A dataset for exploring the automatic prediction of movies suitable for a watching situation. In Proc. of MMSys. ACM, 2016. [12] A. Said, S. Berkovsky, and E. W. De Luca. Putting things in context: Challenge on context-aware movie recommendation. In Proc. of CAMRa. ACM, 2010. [13] A. Said, S. Berkovsky, and E. W. De Luca. Group recommendation in context. In Proc. of CAMRa. ACM, 2011. [14] C. G. Snoek, M. Worring, and A. W. Smeulders. Early versus late fusion in semantic video analysis. In Proceedings of the 13th annual ACM international conference on Multimedia, pages 399–402. ACM, 2005.