The ICL-TUM-PASSAU Approach for the MediaEval 2015
               “Affective Impact of Movies” Task

                                       George Trigeorgis1 , Eduardo Coutinho1 ,
                                          Fabien Ringeval2,3 , Erik Marchi2 ,
                                        Stefanos Zafeiriou1 , Björn Schuller1,3
                                 1
                                   Department of Computing, Imperial College London, UK
      2
          Machine Intelligence & Signal Processing Group, Technische Universität München, Munich, Germany
                        3
                          Chair of Complex & Intelligent Systems, University of Passau, Germany
                                               g.trigeorgis@imperial.ac.uk

ABSTRACT                                                              The emotional impact of videos can be heavily influenced
In this paper we describe the Imperial College London,             by the kind of objects present in a given scene [11, 12,
Technische Universität München and University of Passau          15]. We thus computed a probability of 1000 different ob-
(ICL+TUM+PASSAU) team approach to the MediaEval’s                  jects to be present in a frame using a pretrained 16-layer
“Affective Impact of Movies” challenge, which consists in          convolutional neural network (CNN) on the ILSVRC2013
the automatic detection of affective (arousal and valence)         dataset [21, 4]. Let x ∈ RN ×p represents a video of the
and violent content in movie excerpts. In addition to the          database with N frames and p pixels per frame, and f (·)
baseline features, we computed spectral and energy related         the trained convolutional neural net with softmax activa-
acoustic features, and the probability of various objects be-      tion functions in the output layer. The probability P r(y =
ing present in the video. Random Forests, AdaBoost and             c|xi ; θ) for each of the 1000 classes being present inside the
Support Vector Machines were used as classification meth-          i-th frame of a video xi is obtained by forwarding the p
ods. Best results show that the dataset is highly challenging      pixels value through the network. By averaging the activa-
for both affect and violence detection tasks, mainly because       tions over all the N frames of a video sequence we obtained
of issues in inter-rater agreement and data scarcity.              the probability distribution of the 1000 ILSVRC2013 classes
                                                                   that might be present in the video.
1.   INTRODUCTION                                                  Classifiers.
   The MediaEval 2015 Challenge “Affective Impact of                 For modelling the data we concentrated on two out-of-the-
Movies” comprises two subtasks using the LIRIS-ACCEDE              box ensemble techniques: Random Forests and AdaBoost.
database [2]. Subtask 1 targets the automatic categorisa-          We used these two techniques as they are less susceptible
tion of videos in terms of their affective impact. The goal        to the overfitting problem than other learning algorithms
is to identify the arousal (calm-neutral-excited) and valence      due to the combination of weak learners, they are trivial to
(negative-neutral-positive) levels of each video. The goal         optimise as they have only one hyper-parameter, and they
of Subtask 2 is to identify those videos that contain violent      usually provide close or on par results with the state-of-
scenes. The full description of the tasks can be found in [22].    the-art for a multitude of tasks [9, 10, 23, 14]. The hyper-
                                                                   parameters for each classifier were determined using a 5-fold
2. METHODOLOGY                                                     cross-validation scheme on the development set. During de-
2.1 Subtask 1: affect classification                               velopment the best performance was achieved with 10 trees
                                                                   with Random Forests and 20 trees with AdaBoost.
Feature sets.
   In our work we have used both the baseline features pro-        Runs.
vided by the organisers [2], as well as our own sets of audio-        We submitted a total of five runs. Run 1 consisted of pre-
visual features as described below.                                dictions using the baseline features and the AdaBoost model.
   The extended Geneva Minimalistic Acoustic Parameter             The predictions in runs 2 and 5 were obtained using the
Set (eGeMAPS) was used to extract acoustic features with           baseline plus our audio-visual feature sets and the Random
the openSMILE toolkit [6]; this feature set was designed as a      Forest and AdaBoost classifiers, respectively. By looking
standard acoustic parameter set for automatic speech emo-          at the distribution of labels in the development set, we ob-
tion recognition [5, 18, 16] and has also been successfully        served that the most common combinations of labels are: 1)
used for other paralinguistic tasks [17]. The eGeMAPS              neutral valence (V n ) and negative arousal (A− ) (24%), and
comprises a total of 18 Low-Level Descriptors (LLDs), in-          2) positive valence (V + ) and negative arousal (A− ) (20%).
cluding frequency, energy/amplitude, and spectral related          Runs 3 and 4 are thus based on the hypothesis that the label
features. Various functionals were then applied to the LLDs        distribution of the test set will be similarly unbalanced. In
over the whole instance, giving raise to a total of 88 features.   run 3 every clip was predicted to be V n , A+ and in Run
                                                                   4 every one was V + , A− . These submissions act as a san-
                                                                   ity check of our own models, but also other competitors’
                                                                   submissions for this competition.
Copyright is held by the author/owner(s).
MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany
2.2     Subtask 2: violence detection                              Subtask 1. Our results for the affective task indicate that
                                                                we did not do much better than was expected by chance for
Feature sets.                                                   arousal classification, and did slightly better than chance
   According to previous work [7, 13], we only considered
                                                                for valence in run 5; we thus refrain from further interpreta-
spectral and energy based features as acoustic descriptors.
                                                                tion of results. This can be explained by the low quality of
Indeed, violent segments do not necessarily contain speech;
                                                                the provided annotations for the dataset. The initial anno-
voice specific features, such as voice quality and pitch re-
                                                                tations had a low inter-rater agreement [2], and there were
lated descriptors, might thus not be a reliable source of in-
                                                                multiple processing stages afterwards [1, 22] with high levels
formation for violence. We extracted 22 acoustic low-level
                                                                of uncertainty and unclear validity.
descriptors (LLDs): loudness, alpha ratio, Hammarberg’s
                                                                   Subtask 2. Results show that there is an important over-
index, energy slope and proportion in the bands [0 − 500] Hz
                                                                fitting in our models as the performance is divided by a fac-
and [500 − 1500] Hz, and 14 MFCCs, using the openSMILE
                                                                tor of 2 between development and test partitions. This is,
toolkit [6]. All LLDs, with the exception of loudness and
                                                                however, not really surprising since only 272 instances la-
the measures of energy proportion, were computed sepa-
                                                                belled as violent were available as training data. Moreover,
rately for voiced and unvoiced segments. As the frames
                                                                the labelling task being performed not at the frame level
of the movie that contain violent scenes are unknown, we
                                                                but rather at the excerpt level does not allow to model pre-
computed 5 functionals (max, min, range, arithmetic mean
                                                                cisely the information that is judged as violent, making the
and standard-deviation) to summarise the LLDs over the
                                                                task highly challenging. We can nevertheless observe that
movie excerpt, which provided a total of 300 features. For
                                                                the proposed audio-visual feature set brings a large improve-
the video modality, we used the same additional features de-
                                                                ment over the baseline feature set – the MAP is improved
fined in Subtask 1. We also used the metadata information
                                                                by a factor superior to 2, and that the inclusion of the movie
of the video genre as an additional feature, due to depen-
                                                                genre as additional feature also allows a small improvement
dencies between movie genre and violent content.
                                                                in the performance.
Classifier.
   Since the dataset is strongly imbalanced – only 272 ex-                           Subtask 1                Subtask 2
cerpts out of 6,144 are labelled as violent – we up-sampled      Run     Arousal (AC)     Valence (AC)    Violence (MAP)
the violent instances to achieve a balanced distribution. All
features were furthermore standardised with a z-score. As            1       55.72               39.99           4.9
classifier, we used the libsvm implementation of Support             2       54.71               41.00           13.3
Vector Machines (SVMs) [3] and optimised the complexity              3       55.55               37.87           13.5
parameter, and the γ coefficient of the radial basis kernel          4       55.55               29.02           14.9
in a 5-folds cross-validation framework on the development           5       54.46               41.48           13.9
set. Because the official scoring script requires the com-
putation of a posteriori probabilities, which is more time      Table 1: The submission results for the arousal, va-
consuming than the straightforward classification task, we      lence, and violence classification tasks on the test
optimised the Unweighted Average Recall (UAR) to find           partition. AC stands for accuracy and MAP for the
the best hyper-parameters [19, 20], and then re-trained the     mean average precision.
SVMs with the probability estimates.
Runs.                                                           4.   CONCLUSIONS
   We first performed experiments with the full baseline fea-      We have presented our approach to the MediaEval’s “Af-
ture set and found that the addition of the movie genre         fective Impact of Movies” challenge, which consists in the
as feature improved the Mean Average Precision (MAP)            automatic detection of affective and violent content in movie
from 19.5 to 20.3, despite degrading the UAR from 72.3 to       excerpts. Our results for the affective task have shown that
72.0. Adding our own audio-visual features provided a jump      we did not do much better than a classifier that is based
in the performance with the MAP reaching 33.6 and UAR           on chance, although we use features and classifiers that are
77.6. Because some movie excerpts contain partly relevant       known to work well in the literature for arousal and valence
acoustic information, we empirically defined a threshold on     prediction [2, 8]. We consider that this might be owed to a
loudness based on the histogram, to exclude frames before       potentially noisiness of the annotations provided. As for the
computing the functionals. This procedure has improved          violence prediction subtask, the results show that we over-
the MAP to 35.9 but downgraded the UAR to 76.9. A fine          fit a lot on the development set, which is not very striking
tuning of the complexity parameter and γ coefficient yielded    given the small amount of instances of the minority class.
the best performance in terms of UAR with a value of 78.0,      The analysis of violent content at the excerpt level is also
but slightly deteriorated the MAP to 35.7.                      highly challenging, because only few frames might contain
   We submitted a total of five runs. Run 1 – baseline fea-     violence, and such brief information is almost totally lost in
tures; Run 2 – all features mentioned above (except movie       the computation of functionals at the full excerpt level.
genre) with loudness threshold (0.038); Run 3 – same as Run
2 plus the inclusion of movie genre; Run 4 – as Run 3 but       5.   ACKNOWLEDGEMENTS
with a fine tuning of the hyper-parameters; Run 5 – similar        The research leading to these results has received funding
to Run 3 but with a higher threshold for loudness (0.078).      from the EC’s Seventh Framework Programme through the
                                                                ERC Starting Grant No. 338164 (iHEARu), and the EU’s
3.    RESULTS                                                   Horizon 2020 Programme through the Innovative Action
  Our official results on the test set for both subtasks are    No. 644632 (MixedEmotions), No. 645094 (SEWA) and the
shown in Table 1.                                               Research Innovative Action No. 645378 (ARIA-VALUSPA).
6.   REFERENCES                                                     Multimedia Retrieval (ICMR 2013), pages 215–222,
 [1] Y. Baveye, E. Dellandrea, C. Chamaret, and L. Chen.            Dallas (TX), USA, 2013.
     From crowdsourced rankings to affective ratings. In       [14] J.-J. Lee, P.-H. Lee, S.-W. Lee, A. Yuille, and
     Proceedings of the IEEE International Conference on            C. Koch. Adaboost for text detection in natural scene.
     Multimedia and Expo Workshops (ICMEW 2014),                    In Proceedings of the IEEE 12th International
     pages 1–6, 2014.                                               Conference on Document Analysis and Recognition
 [2] Y. Baveye, E. Dellandrea, C. Chamaret, and L. Chen.            (ICDAR 2013), pages 429–434, Beijing, China, 2013.
     LIRIS-ACCEDE: A Video Database for Affective              [15] I. Lopatovskaa and I. Arapakis. Theories, methods
     Content Analysis. IEEE Transactions on Affective               and current research on emotions in library and
     Computing, 6(1):43–55, January-March 2015.                     information science, information retrieval and
 [3] C.-C. Chang and C.-J. Lin. LIBSVM: A library for               human–computer interaction. Information Processing
     support vector machines. ACM Transactions on                   & Management, 47(4):575–592, July 2011.
     Intelligent Systems and Technology, 2(3), April 2011.     [16] F. Ringeval, S. Amiriparian, F. Eyben, K. Scherer,
 [4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and              and B. Schuller. Emotion recognition in the wild:
     L. Fei-Fei. Imagenet: A large-scale hierarchical image         Incorporating voice and lip activity in multimodal
     database. In Proceedings of the IEEE Conference on             decision-level fusion. In Proceedings of the 2nd
     Computer Vision and Pattern Recognition (CVPR                  Emotion Recognition In The Wild Challenge and
     2009), pages 248–255, 2009.                                    Workshop (EmotiW 2014), pages 473–480, Istanbul,
 [5] F. Eyben, K. Scherer, B. Schuller, J. Sundberg,                Turkey, September 2014.
     E. André, C. Busso, L. Devillers, J. Epps, P. Laukka,    [17] F. Ringeval, E. Marchi, M. Mehu, K. Scherer, and
     S. Narayanan, and K. Truong. The Geneva                        B. Schuller. Face reading from speech – predicting
     minimalistic acoustic parameter set (GeMAPS) for               facial action units from audio cues. In Proceedings of
     voice research and affective computing. IEEE                   INTERSPEECH 2015, 16th Annual Conference of the
     Transactions on Affective Computing, In press, 2015.           International Speech Communication Association
 [6] F. Eyben, F. Weninger, F. Groß, and B. Schuller.               (ISCA), to appear, Dresden, Germany, September
     Recent Developments in openSMILE, the Munich                   2015.
     Open-Source Multimedia Feature Extractor. In              [18] F. Ringeval, B. Schuller, M. Valstar, S. Jaiswal,
     Proceedings of the 21st ACM International Conference           E. Marchi, D. Lalanne, R. Cowie, and M. Pantic.
     on Multimedia (MM 2013), pages 835–838, Barcelona,             AV+EC 2015 – The First Affect Recognition
     Spain, October 2013.                                           Challenge Bridging Across Audio, Video, and
 [7] F. Eyben, F. Weninger, N. Lehment, and B. Schuller.            Physiological Data. In Proceedings of the 5th
     Affective Video Retrieval: Violence Detection in               International Workshop on Audio/Visual Emotion
     Hollywood Movies by Large-Scale Segmental Feature              Challenge (AVEC), ACM MM, Brisbane, Australia,
     Extraction. PLOS one, 8(12):1–12, December 2013.               October 2015.
 [8] K. Forbes-Riley and D. J. Litman. Predicting emotion      [19] B. Schuller, S. Steidl, A. Batliner, J. Epps, F. Eyben,
     in spoken dialogue from multiple knowledge sources.            F. Ringeval, E. Marchi, and Y. Zhang. The
     In Proceedings of the Conference of the North                  INTERSPEECH 2014 Computational Paralinguistics
     American Chapter of the Association for                        Challenge: Cognitive & Physical Load. In Proceedings
     Computational Linguistics - Human Language                     of INTERSPEECH 2014, 15th Annual Conference of
     Technologies (HLT-NAACL), pages 201–208, 2004.                 the International Speech Communication Association
                                                                    (ISCA), pages 427–431, Singapore, Republic of
 [9] F. G., J. Gall, and V. G. L. Real time head pose
                                                                    Singapore, September 2014.
     estimation with random regression forests. In
     Proceedings of the IEEE International Conference on       [20] B. Schuller, S. Steidl, A. Batliner, S. Hantke,
     Computer Vision and Pattern Recognition (CVPR),                F. Hönig, J. R. Orozco-Arroyave, E. Nöth, Y. Zhang,
     pages 617–624, Providence (RI), USA, June 2011.                and F. Weninger. The INTERSPEECH 2015
                                                                    Computational Paralinguistics Challenge: Nativeness,
[10] L. Guo, N. Chehata, C. Mallet, and S. Boukir.
                                                                    Parkinson’s & Eating Condition. In Proceedings of
     Relevance of airborne lidar and multispectral image
                                                                    INTERSPEECH 2015, 16th Annual Conference of the
     data for urban scene classification using Random
                                                                    International Speech Communication Association
     Forests. ISPRS Journal of Photogrammetry and
                                                                    (ISCA), Dresden, Germany, September 2015.
     Remote Sensing, 66(1):56–66, January 2011.
                                                               [21] K. Simonyan and A. Zisserman. Very deep
[11] A. Hanjalic and L.-Q. Xu. Affective video content
                                                                    convolutional networks for large-scale image
     representation and modeling. IEEE Transactions on
                                                                    recognition. arXiv preprint arXiv:1409.1556, 2014.
     Multimedia, 7(1):143–154, 2005.
                                                               [22] M. Sjöberg, Y. Baveye, H. Wang, V. Quand,
[12] W. Hu, N. Xie, L. Li, X. Zeng, and M. S. A survey on
                                                                    B. Ionescu, E. Dellandréa, M. Schedl, C.-H. Demarty,
     visual content-based video indexing and retrieval.
                                                                    and L. Chen. The MediaEval 2015 Affective Impact of
     IEEE Transactions on Systems, Man, and
                                                                    Movies Task. In Proceedings of the MediaEval 2015
     Cybernetics, Part C: Applications and Reviews,
                                                                    Workshop, Wurzen, Germany, 2015.
     41(6):797–819, October 2011.
                                                               [23] A. Stumpf and N. Kerle. Object-oriented mapping of
[13] B. Ionescu, J. Schlüter, I. Mironica, and M. Schedl. A
                                                                    landslides using random forests. Remote Sensing of
     naive mid-level concept-based fusion approach to
                                                                    Environment, 115(10):2564–2577, October 2011.
     violence detection in hollywood movies. In Proceedings
     of the 3rd ACM International Conference on