RECOD at MediaEval 2014: Violent Scenes Detection Task Sandra Avila‡ , Daniel Moreira† , Mauricio Perez† , Daniel Moraes§ , Isabela Cota† , Vanessa Testoni§ , Eduardo Valle‡ , Siome Goldenstein† , Anderson Rocha† † Institute of Computing, University of Campinas (Unicamp), SP, Brazil ‡ School of Electrical and Computing Engineering, University of Campinas (Unicamp), SP, Brazil § Samsung Research Institute Brazil, SP, Brazil sandra@dca.fee.unicamp.br, daniel.moreira@ic.unicamp.br, {mauricio.perez daniel.moraes, isabela.cota}@students.ic.unicamp.br, vanessa.t@samsung.com dovalle@dca.fee.unicamp.br, {siome, anderson.rocha}@ic.unicamp.br ABSTRACT ing time, we decide to resize the video. Also, we reduce the This paper presents the RECOD approaches used in the dimensionality of the video descriptors. MediaEval 2014 Violent Scenes Detection task. Our system In mid-level feature extraction, for each descriptor type, is based on the combination of visual, audio, and text fea- we use a bag of visual words-based representation. tures. We also evaluate the performance of a convolutional Furthermore, we use a visual feature extractor based on network as a feature extractor. We combined those features Convolutional Networks, which were trained on the Ima- using a fusion scheme. We participated in the main and the geNet 2012 training set [5]. It has been chosen due to generalization tasks. its very competitive results on detection and classifications tasks. Additionally, as far as we know, deep learning meth- ods have not yet been employed in the MediaEval Violent 1. INTRODUCTION Scenes Detection task. The objective of the MediaEval 2014 Violent Scenes De- tection task is to automatically detect violent scenes in movies 2.2 Audio Features and web videos. The targeted violent scenes are those “one Using the OpenSmile library [3], we extract several types would not let an 8 years old child see in a movie because they of audio features. A bag of visual words-based representa- contain physical violence”. tion is employed to quantize the audio features and a PCA In this year, two different datasets were proposed: (i) a algorithm is also used to reduce the dimensionality of the set of 31 Hollywood movies, for the main task, and (ii) a features. set of 86 short YouTube web videos, for the generalization task. The training data is the same for both subtasks. A 2.3 Text Features detailed overview of the datasets and the subtasks can be To represent the movie subtitles, we apply the bag of found in [6]. words approach: the most common, simple and successful In the following, we briefly introduce our system and dis- document representation used so far. The bag of words vec- cuss our results1 . tor is normalized using a term’s document frequency. Also, before creating the bag of words representation, we remove the stop words and we apply a stemming algorithm 2. SYSTEM DESCRIPTION to reduce a word to its stem. 2.1 Visual Features 2.4 Classification In low-level visual feature extraction, we extract SURF Classification is performed with Support Vector Machines descriptors [4]. For that, we first apply the FFmpeg soft- (SVM) classifiers, using the LIBSVM library [2]. Moreover, ware [1] to extract and resize the video frames. Low-level classification is done separately for each descriptor. The visual descriptors are extracted on a dense spatial grid at outputs of those individual classifiers are then combined at multiple scales. Next, they are reduced using a PCA algo- the level of normalized scores. Our fusion strategy is done rithm. by the combination of classification outcomes optimized on Besides that, in order to incorporate temporal informa- the training set. tion, we compute dense trajectories and motion boundary descriptors, according to [7]. Again, for the sake of process- 3. RUNS SUBMITTED 1 In total, we generated 10 different runs: 5 runs for each There are some technical aspects which we cannot put di- subtask. For main task (m), we have: rectly in the manuscript given we are patenting the deve- loped approach. • m1: 3 types of audio features + 3 types of visual fea- tures (including a visual feature extractor based on Convolutional Networks) + text features; Copyright is held by the author/owner(s). • m2: 1 type of audio features + 3 types of visual fea- MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain tures (including a visual feature extractor based on Convolutional Networks) + text features; groundtruth. Despite both containers store the video stream • m3: 1 type of audio features + 3 types of visual fea- in H.264 format, we did not notice that the M4V conversion tures (including a visual feature extractor based on resulted in a different video aspect ratio (718×432 pixels). Convolutional Networks); Similarly, the audio encoding was also divergent: MP3 au- • m4: 1 type of audio features + 2 types of visual fea- dio for MP4, while AAC audio for M4V. Next, due to frame tures + text features; synchronization issue, we kept the test data in its original • m5: 1 type of audio features. format (MPEG-2, 720×576 pixels, with AC3 audio). There- fore, we faced the problem of dealing with different aspect For generalization task (g), we have: ratios in training and testing data, as well as distinct audio formats. • g1: 3 types of audio features + 3 types of visual fea- For the generalization task, the problem is alleviated be- tures (including a visual feature extractor based on cause the test data is provided in MP4. Convolutional Networks); Tables 3 reports the unofficial (u) results for main task • g2: 1 type of audio features + 3 types of visual fea- that we evaluated ourselves. Here, the results are obtained tures (including a visual feature extractor based on by using the data (training and test sets) in MPEG format. Convolutional Networks); The first column indicates which input features were used: • g3: 1 type of audio features + 2 types of visual fea- u1 for 1 type of audio features and u2 for text features. tures; Unfortunately, due to time constraints, we were not able to • g4: 1 type of audio features; prepare more runs. • g5: 1 type of visual features. It should be mentioned first that, the results for run u2, are independent of the video format, since we directly ex- 4. RESULTS AND DISCUSSION tracted the movie subtitles from DVD. For run u1, we can Tables 1 and 2 show the performance of our system for notice a considerable improvement of classification perfor- main and generalization task, respectively. We can notice mance, from 0.315 (run m5) to 0.493 (run u1), confirming that, despite the diversity of fusion strategies, the differences the negative impact of using distinct audio formats. We are among most runs (m1, m2, m3 and g1, g2, g3) are quite currently investigating the impact on visual features. small. We are currently investigating such results. Also, we observe that, for run m4, we selected a wrong threshold2 by 8mil Brav Desp Ghos Juma Term Vven MAP mistake. u1 0.351 0.601 0.636 0.530 0.521 0.352 0.463 0.493 u2 0.402 0.237 0.407 0.345 0.232 0.277 0.188 0.298 8mil Brav Desp Ghos Juma Term Vven MAP m1 0.204 0.477 0.337 0.567 0.188 0.479 0.378 0.376 Table 3: Unofficial results obtained for the main task in terms of MAP2014. m2 0.239 0.459 0.308 0.348 0.362 0.465 0.431 0.373 m3 0.189 0.545 0.277 0.465 0.212 0.418 0.489 0.371 m4 0.115 0.319 0.209 0.270 0.159 0.502 0.167 0.249 5. ACKNOWLEDGMENTS This research was partially supported by FAPESP, CAPES, m5 0.373 0.301 0.307 0.423 0.175 0.308 0.317 0.315 CNPq and Project “Capacitação em Tecnologia de Informa- Table 1: Official results obtained for the main task in terms ção” financed by Samsung Eletrônica da Amazônia Ltda., us- of MAP2014. ing resources provided by the Informatics Law no. 8.248/91. 6. REFERENCES g1 g2 g3 g4 g5 [1] FFmpeg. http://www.ffmpeg.org/. MAP 0.618 0.615 0.604 0.545 0.515 [2] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM TIST, 2(3):1–27, 2011. Table 2: Official results obtained for the generalization task [3] F. Eyben, M. Wöllmer, and B. Schuller. OpenSmile: in terms of MAP2014. the munich versatile and fast open-source audio feature extractor. In ACM Multimedia, pages 1459–1462, 2010. For the main task, our results are considerably below our [4] D. Lowe. Distinctive image features from scale-invariant expectations (based on our training results). By analyz- keypoints. International Journal of Computer Vision ing the results, we pointed out a crucial difference between (IJCV), 60:91–110, 2004. training and test videos. In the Violent Scenes Detection [5] O. Russakovsky, J. Deng, H. Su, J. Krause, task, the participants are instructed in how to extract the S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, DVD data and convert it to MPEG format. For the sake M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet of saving disk space, we opted to convert the MPEG video Large Scale Visual Recognition Challenge. files to MP4 or to M4V. However, that choice introduced a arXiv:1409.0575, 2014. set of problems. [6] M. Sjöberg, B. Ionescu, Y. Jiang, V. Quang, M. Schedl, First, with respect to the training data, we converted the and C. Demarty. The MediaEval 2014 Affect Task: MPEG video files to MP4 or to M4V, depending on which Violent Scenes Detection. In MediaEval 2014 video container we were able to successfully synchronize Workshop, Barcelona, Spain, October 16–17 2014. the extracted frames, regarding the numbers given by the [7] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Dense 2 trajectories and motion boundary descriptors for action Scenes are classified as violent or non-violent based on a certain threshold. recognition. Interna, 103:60–79, 2013.