=Paper= {{Paper |id=Vol-2283/MediaEval_18_paper_21 |storemode=property |title=Visual and Audio Analysis of Movies Video for Emotion Detection @ Emotional Impact of Movies Task MediaEval 2018 |pdfUrl=https://ceur-ws.org/Vol-2283/MediaEval_18_paper_21.pdf |volume=Vol-2283 |authors=Elissavet Batziou,Emmanouil Michail,Konstantinos Avgerinakis,Stefanos Vrochidis,Ioannis Patras,Ioannis Kompatsiaris |dblpUrl=https://dblp.org/rec/conf/mediaeval/BatziouMAVPK18 }} ==Visual and Audio Analysis of Movies Video for Emotion Detection @ Emotional Impact of Movies Task MediaEval 2018== https://ceur-ws.org/Vol-2283/MediaEval_18_paper_21.pdf
Visual and audio analysis of movies video for emotion detection
      @ Emotional Impact of Movies task MediaEval 2018
                               Elissavet Batziou1 , Emmanouil Michail1 , Konstantinos Avgerinakis1 ,
                                    Stefanos Vrochidis1 , Ioannis Patras2 , Ioannis Kompatsiaris1
                               1 Information Technologies Institute, Centre for Research and Technology Hellas
                                                             2 Queen Mary University of London

                                                       batziou.el@iti.gr,michem@iti.gr,koafgeri@iti.gr
                                                      stefanos@iti.gr,i.patras@qmul.ac.uk,ikom@iti.gr

ABSTRACT
This work reports the methodology that CERTH-ITI team devel-
oped so as to recognize the emotional impact that movies have to
its viewers in terms of valence/arousal and fear. More Specifically,
deep convolutional neural newtworks and several machine learn-
ing techniques are utilized to extract visual features and classify
them based on the predicted model, while audio features are also
taken into account in the fear scenario, leading to highly accurate
recognition rates.

                                                                                Figure 1: Block diagram of our approach for fear recognition

1    INTRODUCTION
Emotion based content have a large number of applications, includ-
ing emotion-based personalized content delivery[2], video indexing[7],          2 APPROACH
summarization[5] and protection of children from potentially harm-
ful video content. Another intriguing trend that appears to get a               2.1 Valence-Arousal Subtask
lot of attention lately is style transferring and more specifically rec-        In the valence-arousal recognition subtask, keyframe extraction
ognizing the emotion from some painting or some specific section                is initially applied so as to extract one video frame per second
from a movie and transferring its affect to the viewer as a style to a          and correlate them with the annotations that were provided from
novel creation.                                                                 MediaEval emotion organizers, who has also used the same time
   Emotional Impact of Movies Task is a challenge of MediaEval                  interval to record human extracted groundtruth data. The provided
2018 that comprises of two subtasks: (a) Valence/Arousal prediction             visual features are then concatenated into one vector representation
and (b) Fear prediction from movies. The Task provides a great                  so as to have a common and fixed representation scheme throughout
amount of movies video, their visual and audio features and also                different video samples.
their annotations[1].Both subtasks ask from the participants to                    The first recognition approach that was deployed concerns the
leverage any available technology, in order to determine when and               valence/arousal estimation by adopting a linear regression model.
whether fear scenes occur and to estimate a valence-arousal score               Linear regression try to minimize the residual sum of squares be-
for each video frame in the provided test data [3].                             tween the groundtruth and predicted responses by using linear
   In this work, CERTH-ITI introduces its algorithms for valence/arousal        approximation (Run 3). PCA is also deployed on our final visual
and fear recognition subtasks, which include the deployment of                  features vectors so as to reduce their dimensionality and keep only
deep learning and other classification schemes to recognize the de-             the most discriminant principal components (in our case the first
sired outcome. More specifically, a 3-layer neural network(NN) and              2000) to represent all features (Run 4).
a simple linear regression model are deployed, with and without                    A Neural Network (NN) framework has also been deployed so as
PCA, so as to predict the correct emotion in the valence-arousal                to fulfil the valence/arousal recognition substask. For that purposes,
subtask, while a pre-trained V GG16 model [6] is combined with a K              a 3-hidden layer NN with ReLU activation function and Adam opti-
Nearest Neighbors (KNN)- classification scheme, so as to leverage               mizer with learning rate = 0.001 was deployed. The size of each
the visual and audio attributes respectively and identify the correct           hidden layer is 64, 32 and 32 respectively. We use batch size equal
boundary video frames in the fear subtask.                                      to 10 and 10 epochs. The size of the training set is 2/3 of the de-
                                                                                velopment set and the remaining 1/3 for validation set. The input
                                                                                of the NN is the set of vectors of concatenated visual features(Run
                                                                                3). PCA has also been used in order to downsample the concate-
Copyright held by the owner/author(s).
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France                      nated highly dimensional size (5367) in the golden section of 2000
                                                                                principal components(Run 4).
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France                                                                       E. Batziou et al.


2.2     Fear Subtask                                                                        Table 1: CERTH-ITI Predictions
For the fear recognition subtask, we initially keyframe extraction
every one second, as we perform in valance subtask. The frames                              Valence                    Arousal            Fear
annotated as "fear" were significantly less than the "no-fear" class         Run         MSE             r          MSE             r     IoU
and, therefore, in order to balance our dataset we used data augmen-
tation techniques. Firstly, we downloaded from Flickr about 10, 000             1    396901706.564    0.079    1678218552.19     0.054    0.075
images with tag "fear" and we also download emotion images 1 and                2        0.139        0.010        0.181         -0.022   0.065
kept those which are annotated as "fear". In order to further increase          3        0.117        0.098        0.138          nan     0.053
the number of fear frames, we additionally use data augmentation                4        0.142        0.067        0.187         -0.029   0.063
techniques on the provided annotated frames. We randomly rotate
and translate pictures vertically or horizontally and we randomly
apply shearing transformations, randomly zooming inside pictures,           redundant noise and keep the most important features. Moreover,
flipping half of the images horizontally and filling in newly created       there is a "NaN" score for the Pearson measure in the arousal predic-
pixels which can appear after a rotation or a width/height shift.           tion scores, because we accidentally set the training value stable and
Finally, we reduce the set of no-fear frames. After these, we had           so our model predicts the same score for all frames, but this score
about 23,000 "fear" and 30,000 tagged as "no fear" images to train          does not characterize our model, since it does not appear in any
our model. We used transfer learning to gain information from a             other prediction within the valence/arousal prediction sub-task.
large scale dataset and also trained our model in a very realistic             We have also submitted 4 runs for fear prediction subtask and
and efficient time. The architecture that we chose to represent our         their results are also presented in Table 1 and are evaluated in terms
features is the V GG16 pre-trained on Places2 dataset [8] because           of Intersection over Union (IoU). From Table 1 we see that, the best
the majority of the movies have places as background and so we              performance for the fear recognition subtask is Run 1, using all
assume that it would be helpful. We use Nadam optimizer with                predicted scores of the pre-trained VGG16 model. In addition, our
learning rate 0.0001. The batch size is 32 and the number of epochs         intuition to remove isolated predicted frames (Run 2), as they are
50. Finally, we set a threshold of 0.4 on their probability (Run 1). In a   not associated with any duration, did not perform better than Run
different approach, we used the same architecture without isolated          1, hence we miss significant information (video frames that invoke
predicted frames (Run2).                                                    fear).
    Additionally, in order to exploit auditory information, we devel-
oped a classification method applied on audio features already ex-          4       DISCUSSION AND OUTLOOK
tracted from the challenge committee using openSmile toolbox [4].           In this paper we report the CERTH-ITI team approach to the Media-
Audio feature vectors, consisting of 1582 features, extracted from          Eval 2018 Challenge "Emotional Impact of Movies" task. The results
videos every second, were separated into training (80%) and vali-           in valence/arousal prediction subtask shows that according to MSE,
dation set (20%). In order to equalize the size of the two classes in       the best result obtained in Run 3 for both valence and arousal, while
the training set we randomly removed "no-fear" samples. We apply            regarding to Pearson Correlation Coefficient, Run 1 has the best
KNN classification method with N=3 on the test set, results were            performance for arousal and the second best performance for va-
further processed, in order to remove erroneous false negatives             lence. The Pearson correlation is able to measure linear correlations
(single "no-fear" samples around "fear" areas) and false positives          between two or more variables. However, the MSE is obtained by
(isolated small "fear" areas consisting of one or two "fear" samples).      a sum of squared deviations between predicted and ground-truth
    Results from visual and audio analysis were submitted both              values, no matter if they are linearly correlated or not.
separately, as different runs, and in combination by taking the post           The results of the fear prediction subtask shows that the inclusion
probabilities of visual and auditory classifications and setting a          of audio features failed to enhance the classification performance,
threshold of 0.7 on their average probability. The overall block            as expected. This could be due to several reasons, with the promi-
diagram of this approach is depicted in Figure 1.                           nent one to be the incapability of performing data augmentation
                                                                            on audio features such as in the case of visual analysis. Both the
3     RESULTS AND ANALYSIS                                                  aforementioned reason and the large inequality between the two
We have submitted 4 runs for valence/arousal prediction and their           classes, which led us discard many "no-fear" annotations, in or-
results are introduced in Table 1. In the experiments two evalua-           der to balance the training set, resulted on a very limited training
tion measures are used: (a) Mean Square Error (MSE) and (b) Pear-           set. These drawbacks could be overcome by using classification
son Correlation Coefficient (r). We observe that the N N approach           methods able to handle unbalanced training sets, such as penalized
that we describe in the previous section has the best performance           models, or by enriching the training set with external annotated
amongst all the others. Furthermore, it is worth mentioning that            datasets and by exploring more efficient fusion methods, such as
the linear regression model produces some extremely high scores,            performing classification on fused audiovisual features, instead of
probably because the original feature vectors weren’t neither dis-          a posterior combining separate classification results.
criminative nor adequate enough to create the regression model.
However, PCA projection to lower dimensional space, with higher             ACKNOWLEDGMENTS
discriminative power show to solve this problem as it reduces the
                                                                            This work was funded by the EC-funded project V4Design under
1 http://www.imageemotion.org/                                              the contract number H2020-779962
Emotional Impact of Movies Task                                                MediaEval’18, 29-31 October 2018, Sophia Antipolis, France


REFERENCES
[1] Yoann Baveye, Emmanuel Dellandrea, Christel Chamaret, and Lim-
    ing Chen. 2015. Liris-accede: A video database for affective content
    analysis. IEEE Transactions on Affective Computing 6, 1 (2015), 43–55.
[2] Luca Canini, Sergio Benini, and Riccardo Leonardi. 2013. Affective
    recommendation of movies based on selected connotative features.
    IEEE Transactions on Circuits and Systems for Video Technology 23, 4
    (2013), 636–647.
[3] Emmanuel Dellandréa, Martijn Huigsloot, Liming Chen, Yoann Baveye,
    Zhongzhe Xiao, and Mats Sjöberg. 2018. The mediaeval 2018 emo-
    tional impact of movies task. In MediaEval 2018 Multimedia Benchmark
    Workshop Working Notes Proceedings of the MediaEval 2018 Workshop.
[4] Florian Eyben, Felix Weninger, Florian Gross, and BjÎŞÂűrn Schuller.
    2013. Recent developments in openSMILE, the munich open-source
    multimedia feature extractor. In MM 2013 - Proceedings of the 2013
    ACM Multimedia Conference. 835–838.
[5] Harish Katti, Karthik Yadati, Mohan Kankanhalli, and Chua Tat-Seng.
    2011. Affective video summarization and story board generation
    using pupillary dilation and eye gaze. In Multimedia (ISM), 2011 IEEE
    International Symposium on. IEEE, 319–326.
[6] Karen Simonyan and Andrew Zisserman. 2014. Very deep convo-
    lutional networks for large-scale image recognition. arXiv preprint
    arXiv:1409.1556 (2014).
[7] Shiliang Zhang, Qingming Huang, Shuqiang Jiang, Wen Gao, and Qi
    Tian. 2010. Affective visualization and retrieval for music video. IEEE
    Transactions on Multimedia 12, 6 (2010), 510–522.
[8] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio
    Torralba. 2018. Places: A 10 million image database for scene recogni-
    tion. IEEE transactions on pattern analysis and machine intelligence 40,
    6 (2018), 1452–1464.