=Paper=
{{Paper
|id=Vol-1436/Paper32
|storemode=property
|title=RECOD at MediaEval 2015: Affective Impact of Movies Task
|pdfUrl=https://ceur-ws.org/Vol-1436/Paper32.pdf
|volume=Vol-1436
|dblpUrl=https://dblp.org/rec/conf/mediaeval/MoreiraAPMTVGR15
}}
==RECOD at MediaEval 2015: Affective Impact of Movies Task==
RECOD at MediaEval 2015: Affective Impact of Movies Task Daniel Moreira1 , Sandra Avila2 , Mauricio Perez1 , Daniel Moraes1 , Vanessa Testoni3 , ∗ Eduardo Valle2 , Siome Goldenstein1 , Anderson Rocha1 1 Institute of Computing, University of Campinas, SP, Brazil 2 School of Electrical and Computing Engineering, University of Campinas, SP, Brazil 3 Samsung Research Institute Brazil, SP, Brazil ABSTRACT 2.1 Bags of Visual Features This paper presents the approach used by the RECOD team First of all, similarly to Akata et al. [1], as a preprocessing to address the challenges provided in the MediaEval 2015 Af- step, and for the sake of saving low-level description time, fective Impact of Movies Task. We designed various video we reduce the resolution of all videos, keeping the original classifiers, which relied on bags of visual features, and on aspect ratio. bags of auditory features. We combined these classifiers us- We developed two classifiers based on bags of visual fea- ing different approaches, ranging from majority voting to tures. These classifiers differ from each other mainly with machine-learned techniques on the training dataset. We respect to the employed low-level local video descriptors. We only participated in the Violence Detection subtask. have a solution based on a static frame descriptor (Speeded- UP Robust Features, SURF [2]), and another solution based on a space-temporal video descriptor. 1. INTRODUCTION In the particular case of the SURF-based classifier, SURF The MediaEval 2015 Affective Impact of Movies Task chal- descriptions are extracted on a dense spatial grid, at multiple lenged its participants to automatically classify video con- scales. In the case of the space-temporal-based one, we apply tent, regarding three high-level concepts: valence, arousal a sparse description of the video space-time (i.e., we describe and violence [5]. only the detected space-temporal interest points). The activities of classifying video valence and of classifying Prior to the mid-level feature extraction, for the sake of video arousal were grouped under the same subtask: the saving extraction time, we also reduce the dimensionality of Induced Affect Detection. The classification of violence, in the low-level descriptions. turn, was related to the Violence Detection subtask, where In the mid-level feature extraction, for each descriptor participants were supposed to label a video as violent or not. type, we use a bag-of-visual-words-based representation [4]. For both subtasks, the same video dataset was annotated In the high-level video classification, we employ a linear and provided. It consisted of short clips, extracted from Support Vector Machine (SVM) to label the mid-level fea- 199 Creative Commons-licensed movies of various genres. tures, as suggested in [4]. A detailed overview of the two subtasks, metrics, dataset content, license, and annotation process can be found in [5]. 2.2 Bags of Auditory Features In the following sections, we detail the classifiers we de- signed to solve the task. Thereafter, we explain the setup of We developed three classifiers based on bags of auditory the submitted runs, and report the results, with the proper features. Analogously to the visual ones, these classifiers discussion. differ from each other with respect to the employed low-level audio descriptors. We thus use the OpenSmile library [3] to extract audio features. 2. SYSTEM DESCRIPTION Prior to the mid-level feature extraction, for the sake of We designed video classifiers based on bags of visual fea- saving extraction time, we also reduce the dimensionality of tures, and on bags of auditory features. Following the typi- the low-level descriptions. cal bags-of-features-based approach, these classifiers imple- To deal with the semantic gap between the low-level au- ment a pipeline that is composed by three stages: (i) low- dio descriptions, and the high-level concept of violence, we level video/audio description, (ii) mid-level feature extrac- adapt a bag-of-features-based representation [4] to quantize tion, and (iii) supervised classification. These classifiers are the auditory features. then combined either in a majority-voting fashion, or in a Finally, concerning the high-level video classification, we machine-learned scheme.1 employ a linear SVM. ∗Corresp. author: A. Rocha, anderson.rocha@ic.unicamp.br 2.3 Combination Schemes 1 To combine various classifiers, we adopt two late fusion As we are patenting the developed approach, a few techni- cal aspects are not reported on this manuscript. schemes. In the first one, we combine the scores returned by the various classifiers in a voting fashion. After counting the Copyright is held by the author/owner(s). votes, we designate the video class as being equal to the most MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany voted one. To attribute a final score, we pick the score of the classifier that presents the strongest certainty regarding video sources, given that — in this year — Hollywood-like the video class. movie segments were provided [5], contrasting to the pre- In the second combination scheme, we concatenate the dominantly amateur web videos of last year [6]. positive scores of the classifiers in a predefined order, and Notwithstanding, the majority-voting late combination of feed them to an additional classifier. visual and auditory features indeed improved the classifica- tion performance. Although trained with the same videos 2.4 External Data and Data Augmentation (with external data), runs 4 (auditory only, with M AP = In the dataset of this year, 6, 144 short video clips were 0.0924) and 5 (visual only, with M AP = 0.096) achieved provided in the development (i.e., training) group [5]. From results that were below the combined solution (related to this total, only 272 video clips were from the positive class, a run 3, with M AP = 0.1126). small number for an effective train of our techniques. There- Regarding our results, in general terms, we did not have fore, in order to augment such content and obtain a more enough positive samples to learn a better classifier, a manda- balanced training set, we incorporated the 86 YouTube web tory requirement of the machine learning techniques that we videos that were provided in the competition of last year [6], employed. as an external data source. Given that these web videos were, in average, longer than 5. CONCLUSIONS the videos of this year, we decided to segment the positive This paper presented the video classifiers used by the annotated chunks in parts of 10 − 12 seconds. That leaded RECOD team to participate in the violence detection sub- to a total of 252 additional positive segments to augment task of the MediaEval 2015 Affective Impact of Movies Task. our positive training dataset. The reported results show that a late combination of visual- and auditory-feature-based classifiers lead to a better final 3. SUBMITTED RUNS classification system, in the case of violence detection. Fi- This year, participants were allowed to submit up to five nally, given the machine leaning nature of our solutions, the runs for the violence detection subtask, with at least one challenging dataset of this year did not contain enough pos- requiring the use of no external training data [5]. The official itive video samples to learn a better classifier, what strongly evaluation metric is mean average precision (MAP), which impacted on our results. is calculated using the NIST trec eval 2 tool. Table 1 summarizes the runs that were submitted this Acknowledgments year to the competition. In total, we generated five differ- Part of the results presented in this paper were obtained ent runs. In two, we did not use external data, while on through the project “Sensitive Media Analysis”, sponsored the remaining other three, we employed external data, as by Samsung Eletrônica da Amazônia Ltda., in the frame- explained in Section 2.4. work of law No. 8,248/91. We also thank the financial support from CNPq, FAPESP and CAPES. External Visual Auditory Run Combined MAP Data Features Features 6. REFERENCES Majority [1] Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid. 1 No All All 0.1143 Good practice in large-scale learning for image Voting classification. IEEE Transactions on Pattern Analysis 2 No All All Classifier 0.0690 and Machine Intelligence (TPAMI), 36(3):507–520, 2014. Majority 3 Yes All All Voting 0.1126 [2] H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool. SURF: Speeded up robust features. Computer Vision and Majority Image Understanding (CVIU), 110(3):346–359, 2008. 4 Yes No Tone 0.0924 Voting [3] F. Eyben, M. Wöllmer, and B. Schuller. Opensmile: Space- Majority the munich versatile and fast open-source audio feature 5 Yes No 0.0960 temporal Voting extractor. In ACM Multimedia, pages 1459–1462. ACM, Table 1: Official results obtained for the Violence 2010. Detection subtask. [4] F. Perronnin, J. Sánchez, and T. Mensink. Improving the fisher kernel for large-scale image classification. In European Conference on Computer Vision (ECCV), pages 143–156, 2010. 4. RESULTS AND DISCUSSION [5] M. Sjöberg, Y. Baveye, H. Wang, V. L. Quang, The best result (related to run 1 ) was achieved by the B. Ionescu, E. Dellandréa, M. Schedl, C.-H. Demarty, classifier that used a majority-voting late combination of and L. Chen. The mediaeval 2015 affective impact of visual and auditory features, trained with no external data movies task. In Working Notes Proceedings of the (i.e., M AP = 0.1143). It performed better than the exact MediaEval 2015 Workshop, Wurzen, Germany, same solution (at run 3, with M AP = 0.1126), whose only September 14-15, 2015., 2015. difference was the use of external data in the training phase [6] M. Sjöberg, B. Ionescu, Y. Jiang, V. L. Quang, (as explained in Section 2.4). M. Schedl, and C.-H. Demarty. The mediaeval 2014 Therefore, we failed to augment the training data. Rea- affect task: Violent scenes detection. In Working Notes son for that may be related to the use of different types of Proceedings of the MediaEval 2014 Workshop, 2 Barcelona, Spain, October 16-17, 2014., 2014. http://trec.nist.gov/trec_eval/