The MediaEval 2015 Affective Impact of Movies Task Mats Sjöberg1, Yoann Baveye2, Hanli Wang3, Vu Lam Quang4, Bogdan Ionescu5, Emmanuel Dellandréa6, Markus Schedl7, Claire-Hélène Demarty2, and Liming Chen6 1 Helsinki Institute for Information Technology HIIT, University of Helsinki, Finland, mats.sjoberg@helsinki.fi 2 Technicolor, France, [yoann.baveye,claire-helene.demarty]@technicolor.com 3 Tongji University, China, hanliwang@tongji.edu.cn 4 University of Science, VNU-HCMC, Vietnam, lamquangvu@gmail.com 5 University Politehnica of Bucharest, Romania, bionescu@imag.pub.ro 6 Ecole Centrale de Lyon, France, emmanuel.dellandrea@ec-lyon.fr, liming.chen@liris.cnrs.fr 7 Johannes Kepler University, Linz, Austria, markus.schedl@jku.at ABSTRACT of a constrained and closed data set [4, 7, 10]. Hence, the This paper provides a description of the MediaEval 2015 task’s main objective is to propose a public common eval- “Affective Impact of Movies Task”, which is running for the uation framework for the research in these closely-related fifth year, previously under the name “Violent Scenes Detec- areas. tion”. In this year’s task, participants are expected to create systems that automatically detect video content that depicts 2. TASK DESCRIPTION violence, or predict the affective impact that video content The task requires participants to deploy multimedia fea- will have on viewers. Here we provide insights on the use tures to automatically detect violent content and emotional case, task challenges, data set and ground truth, task run impact of short movie clips. In contrast to previous years, requirements and evaluation metrics. the task no longer considers arbitrary starting and ending points of detected segments, but instead the short video clips 1. INTRODUCTION are considered as single units for detection purposes with a single judgement per clip. This year, there are two subtasks: The Affective Impact of Movies Task is part of the Media- (i) induced affect detection, and (ii) violence detection. Both Eval 2015 Benchmarking Initiative. The overall use case tasks use the same videos for training and testing. scenario of the task is to design a video search system that For the induced affect detection task, participants are ex- uses automatic tools to help users find videos that fit their pected to predict, for each video, its valence class (i.e., into particular mood, age or preferences. To address this, we one of negative, neutral or positive) and arousal class (i.e., present two subtasks: into one of calm, neutral or active). In this task, we are fo- • Induced affect detection: the emotional impact of a cusing on felt emotion, i.e., the actual emotion of the viewer video or movie can be a strong indicator for search or when watching the video clip, rather than for example what recommendation; the viewer believes that he or she is expected to feel. Valence • Violence detection: detecting violent content is an im- is defined as a continuous scale from most negative to most portant aspect of filtering video content based on age. positive emotion, while arousal is defined continuously from most calm to most active emotion. However, to keep the two This task builds on the experiences from previous years’ subtasks compatible and enable participants to use similar editions of the Affect in Multimedia Task: Violent Scenes systems for both tasks, we have here opted to discretise the Detection. However, this year, we introduce a completely two scales into three classes as follows: new subtask for detecting the emotional impact of movies. • valence: negative, neutral, and positive, In addition, we are introducing to MediaEval a newly ex- • arousal: calm, neutral, and active. tended data set consisting of 10,900 short video clips ex- tracted from 199 Creative Commons-licensed movies. For the violence detection task, participants are expected In the literature, detection of violence in movies has been to classify each video as violent or non-violent. Violence is marginally addressed until recently [8, 6, 1]. Similarly, in defined as content that “one would not let an 8 years old affective video content analysis it has been repeatedly claimed child see in a movie because it contains physical violence”. that the field would highly benefit from a standardised eval- To solve the task, participants are only allowed to use uation data set [5, 9]. Most of the previously proposed meth- features extracted from the original video files, or metadata ods for affective impact or violence detection suffer from a provided by the organisers. In addition, there is a possibility lack of a consistent evaluation, which usually requires the use to use external data for runs which are specifically marked, however, at least one run for each subtask must be without any external data. Copyright is held by the author/owner(s). MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany 3. DATA DESCRIPTION different countries, and three judgements were collected for This year a single data set is proposed: 10,900 short video each pivot/affect dimension pair. Out of these three judge- clips extracted from 199 Creative Commons-licensed movies ments the majority vote was selected. of various genres. The movies are split into a development For the violence detection the annotation process was sim- set – intended for training and validation – and a test set as ilar to previous years’ protocol. Firstly, all the videos were 100, respectively 99 movies, resulting in 6,144 respectively annotated separately by two groups of annotators from two 4,756 extracted short video clips. different countries. For each group, regular annotators la- The proposed data set is actually an extension of the belled all the videos which were then reviewed by master an- LIRIS-ACCEDE data set originally composed of 9,800 ex- notators. Regular annotators were graduate students (typ- cerpts extracted from 160 movies [3]. For this task, 1,100 ad- ically single with no children) and master annotators were ditional video clips have been extracted from 39 new movies senior researchers (typically married with children). No dis- and included in the test set. The selected feature films and cussions were held between annotators during the annota- short films can be considered professionally made or ama- tion process. Group 1 used 12 regular and 2 master annota- teur movies but almost all are indexed on video platforms tors, while Group 2 used 5 regular and 2 master annotators. referencing best free-to-share movies or have been screened Within each group, each video received 2 different annota- during film festivals. Since these movies are shared under tions which were then merged by the master annotators into Creative Commons licenses, the excerpts can also be shared the final annotation for the group. Finally, the achieved an- and downloaded along with the annotations without infring- notations from the two groups were merged and reviewed ing copyright. The excerpts have been extracted from the once more by the task organisers. movies so that they last between 8 and 12 seconds and start and end with a cut or a fade. 5. RUN DESCRIPTION Along with the video material and the annotations, fea- Participants can submit up to 5 runs for each subtask: in- tures extracted from each video clip are also provided by duced affect detection and violence detection. Each subtask the organisers. They correspond to the audiovisual features has a required run which uses no external training data, only described in [3]. the provided development data is allowed. Also any features that can be automatically extracted from the video are al- 4. GROUND TRUTH lowed. Both tasks also have the possibility for optional runs For each of the 10,900 video clips, the ground truth con- in which any external data can be used, such as Internet sists of: a binary value to indicate the presence of violence, sources, as long as they are marked as ”external data” runs. the class of the excerpt for felt arousal (calm-neutral-active), and the class for felt valence (negative-neutral-positive). Be- 6. EVALUATION CRITERIA fore the evaluation, participants are provided only with the For the induced affect detection subtask the official eval- annotations for the development set, while those for the test uation measure is global accuracy, calculated separately for set are held back to be used for benchmarking the submitted valence and arousal dimensions. Global accuracy is the pro- results. portion of the returned video clips that have been assigned The original video clips included in the LIRIS-ACCEDE to the correct class (out of the three classes). data set were all already ranked along the felt valence and The official evaluation metric for the violence detection arousal axes by using a crowdsourcing protocol [3]. Pairwise subtask is average precision, which is calculated using the comparisons were generated using the quicksort algorithm trec_eval tool provided by NIST1 . This tool also produces and presented to crowdworkers who had to select the video a set of commonly used metrics such as precision and recall, inducing the calmer emotion or the more positive emotion. which may be used for comparison purposes. In [2] the crowdsourced ranks were converted into absolute affective scores ranging from -1 to 1, which have been used to 7. CONCLUSIONS define the three classes for each affective axis for the Media- The Affective Impact of Movies Task provides participants Eval task. The negative and calm classes correspond re- with a comparative and collaborative evaluation framework spectively to the video clips with a valence or arousal score for violence and emotion detection in movies. The introduc- smaller than -0.15, the neutral class for both axes is assigned tion of the induced affect detection subtask is a new effort for to the videos with an affective score between -0.15 and 0.15, this year. In addition, we have started fresh with a data set and the positive and active classes are assigned to the videos not used in MediaEval before, which consists of short Cre- with an affective score higher than 0.15. These limits have ative Commons-licensed video clips, which enables legally been defined empirically taking into account the distribution sharing the data directly with participants. Details on the of the data set in the valence-arousal space. methods and results of each individual team can be found in For the 2015 MediaEval evaluation the test set was ex- the papers of the participating teams in these proceedings. tended with an additional 1,100 video clips. Due to time and resource constraints, these were annotated using a sim- Acknowledgments plified scheme which takes advantage of the fact that we do This task is supported by the following projects: ERA- not need a full ranking of the new video clips, but only to NET CHIST-ERA grant ANR-12-CHRI-0002-04, UEFIS- separate them into three classes for each affect axis. Two CDI SCOUTER grant 28DPST/30-08-2013, Vietnam Na- pivot videos were selected for each axis, which had absolute tional University Ho Chi Minh City grant B2013-26-01, Aus- scores very close to the -0.15 and 0.15 class boundaries. The trian Science Fund P25655, and EU FP7-ICT-2011-9 project annotation task could then be formulated as comparing each 601166. video clip to these pivot videos, and thus place them in their 1 correct class. In total 17 annotators were involved from five http://trec.nist.gov/trec_eval/ 8. REFERENCES [1] E. Acar, F. Hopfgartner, and S. Albayrak. Violence detection in hollywood movies by the fusion of visual and mid-level audio cues. In Proceedings of the 21st ACM international conference on Multimedia, pages 717–720. ACM, 2013. [2] Y. Baveye, E. Dellandrea, C. Chamaret, and L. Chen. From crowdsourced rankings to affective ratings. In 2014 IEEE International Conference on Multimedia and Expo Workshops (ICMEW), pages 1–6, July 2014. [3] Y. Baveye, E. Dellandrea, C. Chamaret, and L. Chen. LIRIS-ACCEDE: A video database for affective content analysis. IEEE Transactions on Affective Computing, 6(1):43–55, Jan 2015. [4] A. Hanjalic and L.-Q. Xu. Affective video content representation and modeling. IEEE Transactions on Multimedia, 7(1):143–154, Feb. 2005. [5] M. Horvat, S. Popovic, and K. Cosic. Multimedia stimuli databases usage patterns: a survey report. In Proceedings of the 36nd International ICT Convention MIPRO, pages 993–997, 2013. [6] B. Ionescu, J. Schlüter, I. Mironica, and M. Schedl. A naive mid-level concept-based fusion approach to violence detection in hollywood movies. In ICMR, pages 215–222, 2013. [7] G. Irie, T. Satou, A. Kojima, T. Yamasaki, and K. Aizawa. Affective audio-visual words and latent topic driving model for realizing movie affective scene classification. IEEE Transactions on Multimedia, 12(6):523–535, Oct. 2010. [8] C. Penet, C.-H. Demarty, G. Gravier, and P. Gros. Multimodal Information Fusion and Temporal Integration for Violence Detection in Movies. In ICASSP, Kyoto, Japon, 2012. [9] M. Soleymani, M. Larson, T. Pun, and A. Hanjalic. Corpus development for affective video indexing. IEEE Transactions on Multimedia, 16(4):1075–1089, June 2014. [10] S. Zhang, Q. Huang, S. Jiang, W. Gao, and Q. Tian. Affective visualization and retrieval for music video. IEEE Transactions on Multimedia, 12(6):510–522, Oct. 2010.