=Paper=
{{Paper
|id=Vol-1436/Paper32
|storemode=property
|title=RECOD at MediaEval 2015: Affective Impact of Movies Task
|pdfUrl=https://ceur-ws.org/Vol-1436/Paper32.pdf
|volume=Vol-1436
|dblpUrl=https://dblp.org/rec/conf/mediaeval/MoreiraAPMTVGR15
}}
==RECOD at MediaEval 2015: Affective Impact of Movies Task==
<pdf width="1500px">https://ceur-ws.org/Vol-1436/Paper32.pdf</pdf>
<pre>
RECOD at MediaEval 2015: Affective Impact of Movies Task

        Daniel Moreira1 , Sandra Avila2 , Mauricio Perez1 , Daniel Moraes1 , Vanessa Testoni3 ,
                                                                               ∗
                     Eduardo Valle2 , Siome Goldenstein1 , Anderson Rocha1
                                 1
                                   Institute of Computing, University of Campinas, SP, Brazil
                 2
                     School of Electrical and Computing Engineering, University of Campinas, SP, Brazil
                                         3
                                           Samsung Research Institute Brazil, SP, Brazil


ABSTRACT                                                            2.1    Bags of Visual Features
This paper presents the approach used by the RECOD team                First of all, similarly to Akata et al. [1], as a preprocessing
to address the challenges provided in the MediaEval 2015 Af-        step, and for the sake of saving low-level description time,
fective Impact of Movies Task. We designed various video            we reduce the resolution of all videos, keeping the original
classifiers, which relied on bags of visual features, and on        aspect ratio.
bags of auditory features. We combined these classifiers us-           We developed two classifiers based on bags of visual fea-
ing different approaches, ranging from majority voting to           tures. These classifiers differ from each other mainly with
machine-learned techniques on the training dataset. We              respect to the employed low-level local video descriptors. We
only participated in the Violence Detection subtask.                have a solution based on a static frame descriptor (Speeded-
                                                                    UP Robust Features, SURF [2]), and another solution based
                                                                    on a space-temporal video descriptor.
1.   INTRODUCTION                                                      In the particular case of the SURF-based classifier, SURF
   The MediaEval 2015 Affective Impact of Movies Task chal-         descriptions are extracted on a dense spatial grid, at multiple
lenged its participants to automatically classify video con-        scales. In the case of the space-temporal-based one, we apply
tent, regarding three high-level concepts: valence, arousal         a sparse description of the video space-time (i.e., we describe
and violence [5].                                                   only the detected space-temporal interest points).
   The activities of classifying video valence and of classifying      Prior to the mid-level feature extraction, for the sake of
video arousal were grouped under the same subtask: the              saving extraction time, we also reduce the dimensionality of
Induced Affect Detection. The classification of violence, in        the low-level descriptions.
turn, was related to the Violence Detection subtask, where             In the mid-level feature extraction, for each descriptor
participants were supposed to label a video as violent or not.      type, we use a bag-of-visual-words-based representation [4].
For both subtasks, the same video dataset was annotated                In the high-level video classification, we employ a linear
and provided. It consisted of short clips, extracted from           Support Vector Machine (SVM) to label the mid-level fea-
199 Creative Commons-licensed movies of various genres.             tures, as suggested in [4].
A detailed overview of the two subtasks, metrics, dataset
content, license, and annotation process can be found in [5].       2.2    Bags of Auditory Features
   In the following sections, we detail the classifiers we de-
signed to solve the task. Thereafter, we explain the setup of         We developed three classifiers based on bags of auditory
the submitted runs, and report the results, with the proper         features. Analogously to the visual ones, these classifiers
discussion.                                                         differ from each other with respect to the employed low-level
                                                                    audio descriptors. We thus use the OpenSmile library [3] to
                                                                    extract audio features.
2.   SYSTEM DESCRIPTION                                               Prior to the mid-level feature extraction, for the sake of
   We designed video classifiers based on bags of visual fea-       saving extraction time, we also reduce the dimensionality of
tures, and on bags of auditory features. Following the typi-        the low-level descriptions.
cal bags-of-features-based approach, these classifiers imple-         To deal with the semantic gap between the low-level au-
ment a pipeline that is composed by three stages: (i) low-          dio descriptions, and the high-level concept of violence, we
level video/audio description, (ii) mid-level feature extrac-       adapt a bag-of-features-based representation [4] to quantize
tion, and (iii) supervised classification. These classifiers are    the auditory features.
then combined either in a majority-voting fashion, or in a            Finally, concerning the high-level video classification, we
machine-learned scheme.1                                            employ a linear SVM.
∗Corresp. author: A. Rocha, anderson.rocha@ic.unicamp.br
                                                                    2.3    Combination Schemes
1                                                                     To combine various classifiers, we adopt two late fusion
 As we are patenting the developed approach, a few techni-
cal aspects are not reported on this manuscript.                    schemes.
                                                                      In the first one, we combine the scores returned by the
                                                                    various classifiers in a voting fashion. After counting the
Copyright is held by the author/owner(s).                           votes, we designate the video class as being equal to the most
MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany         voted one. To attribute a final score, we pick the score of
the classifier that presents the strongest certainty regarding     video sources, given that — in this year — Hollywood-like
the video class.                                                   movie segments were provided [5], contrasting to the pre-
   In the second combination scheme, we concatenate the            dominantly amateur web videos of last year [6].
positive scores of the classifiers in a predefined order, and         Notwithstanding, the majority-voting late combination of
feed them to an additional classifier.                             visual and auditory features indeed improved the classifica-
                                                                   tion performance. Although trained with the same videos
2.4       External Data and Data Augmentation                      (with external data), runs 4 (auditory only, with M AP =
  In the dataset of this year, 6, 144 short video clips were       0.0924) and 5 (visual only, with M AP = 0.096) achieved
provided in the development (i.e., training) group [5]. From       results that were below the combined solution (related to
this total, only 272 video clips were from the positive class, a   run 3, with M AP = 0.1126).
small number for an effective train of our techniques. There-         Regarding our results, in general terms, we did not have
fore, in order to augment such content and obtain a more           enough positive samples to learn a better classifier, a manda-
balanced training set, we incorporated the 86 YouTube web          tory requirement of the machine learning techniques that we
videos that were provided in the competition of last year [6],     employed.
as an external data source.
  Given that these web videos were, in average, longer than        5.   CONCLUSIONS
the videos of this year, we decided to segment the positive           This paper presented the video classifiers used by the
annotated chunks in parts of 10 − 12 seconds. That leaded          RECOD team to participate in the violence detection sub-
to a total of 252 additional positive segments to augment          task of the MediaEval 2015 Affective Impact of Movies Task.
our positive training dataset.                                     The reported results show that a late combination of visual-
                                                                   and auditory-feature-based classifiers lead to a better final
3.       SUBMITTED RUNS                                            classification system, in the case of violence detection. Fi-
   This year, participants were allowed to submit up to five       nally, given the machine leaning nature of our solutions, the
runs for the violence detection subtask, with at least one         challenging dataset of this year did not contain enough pos-
requiring the use of no external training data [5]. The official   itive video samples to learn a better classifier, what strongly
evaluation metric is mean average precision (MAP), which           impacted on our results.
is calculated using the NIST trec eval 2 tool.
   Table 1 summarizes the runs that were submitted this            Acknowledgments
year to the competition. In total, we generated five differ-       Part of the results presented in this paper were obtained
ent runs. In two, we did not use external data, while on           through the project “Sensitive Media Analysis”, sponsored
the remaining other three, we employed external data, as           by Samsung Eletrônica da Amazônia Ltda., in the frame-
explained in Section 2.4.                                          work of law No. 8,248/91. We also thank the financial
                                                                   support from CNPq, FAPESP and CAPES.
          External   Visual     Auditory
    Run                                     Combined     MAP
          Data       Features   Features                           6.   REFERENCES
                                            Majority               [1] Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid.
     1      No         All         All                   0.1143        Good practice in large-scale learning for image
                                            Voting
                                                                       classification. IEEE Transactions on Pattern Analysis
     2      No         All         All      Classifier   0.0690        and Machine Intelligence (TPAMI), 36(3):507–520,
                                                                       2014.
                                            Majority
     3      Yes        All         All
                                            Voting
                                                         0.1126    [2] H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool. SURF:
                                                                       Speeded up robust features. Computer Vision and
                                            Majority                   Image Understanding (CVIU), 110(3):346–359, 2008.
     4      Yes         No        Tone                   0.0924
                                            Voting
                                                                   [3] F. Eyben, M. Wöllmer, and B. Schuller. Opensmile:
                     Space-                 Majority                   the munich versatile and fast open-source audio feature
     5      Yes                    No                    0.0960
                     temporal               Voting
                                                                       extractor. In ACM Multimedia, pages 1459–1462. ACM,
Table 1: Official results obtained for the Violence                    2010.
Detection subtask.                                                 [4] F. Perronnin, J. Sánchez, and T. Mensink. Improving
                                                                       the fisher kernel for large-scale image classification. In
                                                                       European Conference on Computer Vision (ECCV),
                                                                       pages 143–156, 2010.
4.       RESULTS AND DISCUSSION
                                                                   [5] M. Sjöberg, Y. Baveye, H. Wang, V. L. Quang,
   The best result (related to run 1 ) was achieved by the             B. Ionescu, E. Dellandréa, M. Schedl, C.-H. Demarty,
classifier that used a majority-voting late combination of             and L. Chen. The mediaeval 2015 affective impact of
visual and auditory features, trained with no external data            movies task. In Working Notes Proceedings of the
(i.e., M AP = 0.1143). It performed better than the exact              MediaEval 2015 Workshop, Wurzen, Germany,
same solution (at run 3, with M AP = 0.1126), whose only               September 14-15, 2015., 2015.
difference was the use of external data in the training phase
                                                                   [6] M. Sjöberg, B. Ionescu, Y. Jiang, V. L. Quang,
(as explained in Section 2.4).
                                                                       M. Schedl, and C.-H. Demarty. The mediaeval 2014
   Therefore, we failed to augment the training data. Rea-
                                                                       affect task: Violent scenes detection. In Working Notes
son for that may be related to the use of different types of
                                                                       Proceedings of the MediaEval 2014 Workshop,
2                                                                      Barcelona, Spain, October 16-17, 2014., 2014.
    http://trec.nist.gov/trec_eval/

</pre>