=Paper=
{{Paper
|id=Vol-1984/Mediaeval_2017_paper_18
|storemode=property
|title=TCNJ-CS@MediaEval 2017 Emotional Impact of Movie Task
|pdfUrl=https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_18.pdf
|volume=Vol-1984
|authors=Sejong Yoon
|dblpUrl=https://dblp.org/rec/conf/mediaeval/Yoon17a
}}
==TCNJ-CS@MediaEval 2017 Emotional Impact of Movie Task==
<pdf width="1500px">https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_18.pdf</pdf>
<pre>
      TCNJ-CS @ MediaEval 2017 Emotional Impact of Movie Task
                                                                Sejong Yoon
                                                      The College of New Jersey, U.S.A.
                                                              yoons@tcnj.edu

ABSTRACT                                                                 and their 4 differential coefficient contours (19 × (4 + 4)). Last
This paper presents our approaches for the MediaEval Emotional           two features are additional statistical features, the number of pitch
Impact of Movies Task. We employed features from image frames            onsets and the total duration of the input. These features were
and audio signal. We use support vector regression for the learn-        computed every ten seconds segments sliding over the whole movie
ing and prediction. In addition, we introduce a new feature using        with a shift of 5 seconds. All these features were computed by
exponential decay of the initially predicted emotion labels. The         openSmile toolbox [3].
motivation behind this is to computationally model lingering effect.
Experimental results and future direction are also discussed.            2.2    Lingering Feature
                                                                         In addition to the provided features, we introduced additional fea-
1     INTRODUCTION                                                       ture, using the ground truth labels of emotional levels. The moti-
                                                                         vation behind this new feature, is to computationally model the
MediaEval 2017 Emotional Impact of Movie Task [2] consist of two
                                                                         gradually amplifying or decaying emotional flow, what is typically
subtasks. One of them is valence/arousal prediction that predicts a
                                                                         referred as lingering emotion. Traditional and even state-of-the-art
score of expected level of two emotional state, valence and arousal
                                                                         affect prediction systems focus on predicting induced emotions as
for each consecutive ten seconds segments. Both valence (most
                                                                         a spike noise detection model, regardless of whether they modeled
negative to most positive) and arousal (least active to most active)
                                                                         the temporal aspect of affect or not. On the other hand, lingering
are defined as a continuous scale within range of [−1, 1]. The other
                                                                         emotions do not directly induced by the stimuli, rather, they are gen-
is the the fear prediction that makes binary prediction for each
                                                                         erated from the emotional change, i.e. response, already existing a
of the ten second-segments, whether they are likely to induce fear
                                                                         priori. In short, we argue that what is called climax of a movie is not
or not. Fear is defined as binary integer [0, 1] where 1 indicates
                                                                         only a consequence of short segment stimuli, but also amplified (or
that the segment will induce fear. In the following, we describe the
                                                                         degraded) by the emotional state change itself across the segments.
method we used in our prediction system.
                                                                         A similar idea was utilized to predict media interestingness [6], but
                                                                         there was no explicit consideration of amplifying / decaying effect.
2     APPROACH
                                                                         Here, we consider the change of emotion directly.
First, we describe multimodal features we employed. Next, we intro-         To model this lingering emotion, we use emotion level label
duce our new feature based on prediction label to model lingering        values. For each segment t = 1..T where T denotes the total number
effect of induced emotions. Lastly, we describe our hierarchical         of segments, assume that we are given emotion level label yt . So,
regression framework for emotion prediction.                             at each time segment t, we have
2.1     Visual and Audio Features                                                                      x1, x2, · · · , xt ,                 (1)
We employed all standard set of visual and audio features provided                                     y 1 , y 2 , · · · , yt ,             (2)
by the MediaEval task organizers. For image frame-based features,
we used Auto Color Correlogram, Color and Edge Directivity De-           where x t denotes the vectorized visual / audio features and yt
scriptor, Color Layout, Edge Histogram, Fuzzy Color and Texture          denotes either the ground truth or predicted emotion level (it could
Histogram, Gabor, Joint descriptor joining CEDD and FCTH in one          be valence, arousal, or fear label). Then, we can define the lingering
histogram, Scalable Color, Tamura, Local Binary Patterns, fc6 layer      feature lt as an exponential decay function of labels as
of VGG16 network [5]. All features were extracted frame-by-frame,
where one frame was extracted per second. The features, except                           l (t −w ) = y(t −w )                               (3)
VGG16, were computed using LIRE library. VGG16 features were                          l (t −w +1) = (1 − α) · l (t −w ) + α · y(t −w +1)    (4)
extracted using the MATLAB Neural Network toolbox.                                              ..
   For auditory features, we employed the audio features provided.                               .
In the provided description, there should be 1,582 features which                             ls = (1 − α) · ls−1 + α · ys                  (5)
result from a base of 34 low-level descriptors, with 34 corresponding
differential coefficients, and 21 functionals applied to each of these   where s = (t − w), ..., t, and w denotes the lingering window size.
68 contours, thus 1,428 features (21 × (34 + 34)) in total. Out of       Parameter α is the decay factor. Intuitively, we take weighted accu-
remaining 154 features, 152 features were computed by applying           mulated emotions over time, and consider it to model the lingering
19 additional functionals to the 4 pitch-based low-level descriptors     effect. In training phase, we can utilize the ground truth emotion.
                                                                         In testing phase, we can devise a two-step, hierarchical regression
Copyright held by the owner/author(s).
MediaEval’17, 13-15 September 2017, Dublin, Ireland                      model to obtain the emotion level feature values. We will describe
                                                                         this model in the next section.
MediaEval’17, 13-15 September 2017, Dublin, Ireland                                                                                      S. Yoon


   There are considerations why we think this can be a reasonable                     Table 1: Result of All Subtasks in Devset
model for the lingering effect. First, with the exponential decay
function, we can consider both smoothness and also decaying of                       Subtask     Measure     w/o Linger     w/ Linger
emotional change over time. Second, one can view this as simplified
                                                                                     Valence      MSE        0.13106        0.09893
version of traditional temporal models, e.g., Hidden Markov Models
                                                                                                   ρ         0.12826        0.20826
(HMM), where we fix the transition probability. If we can obtain
                                                                                     Arousal      MSE        0.08817        0.08415
large number of emotion labels, one may try to learn HMM-based
                                                                                                   ρ         0.08390        0.09001
features instead, as in [6].
                                                                                       Fear     Accuracy     0.96581        0.96543
2.3    Hierarchical Regression Framework
To combine features, we utilized standard multiple kernel learning                    Table 2: Result of All Subtasks in Testset
approach [1, 4]. We first compute kernels of each feature, and build
combined kernel using either addition or multiplications. We use                     Subtask     Measure     w/o Linger     w/ Linger
multiplication within same modality, e.g., combining Color Correlo-                                          (run 1)        (run 2)
gram kernel and Edge Histogram kernel, and use addition between
different modalities, i.e., combining combined visual kernel and au-                 Valence      MSE        0.20276        0.04640
dio kernel. The lingering feature is considered as another modality                                 ρ        0.19748        0.00583
than visual and auditory. In summary, our combined kernel was                        Arousal      MSE        0.12304        0.11335
computed as                                                                                         ρ        0.11340        0.21485
                                                                                       Fear     Accuracy     0.77862        0.72956
                    Kvis = Kacc · Kcedd · · · K f c6                 (6)                        Precision    0.22474        0.25530
                                                                                                 Recall      0.09922        0.19224
                    Kall = Kvis + Kaud + Kl in                       (7)
                                                                                                   F1        0.10113        0.17399
where each K · denotes the kernel computed using the features. We
used Radial Basis Function (RBF) kernel with median of training
data as the hyperparameter.                                                make strict data split based on the information which frame belongs
   Once the combined kernel is computed, we can use it as feature          to which video, to obtain this result.
vectors. For the regression model, we used linear Support Vector              In the Testset, shown in Table 2, the official results are more
Regression (SVR). We used MATLAB’s fitrsvm function for this.              interesting. It is obvious that the lingering feature does not help
One important aspect of our approach is that we use emotion pre-           (actually hinders) the valence prediction. On the other hand, for
diction labels to compute the lingering features. Since we do not          the arousal and fear, lingering feature seems to make positive con-
have ground truth labels for testset, we design a two-step, hierarchi-     tribution to the prediction although the overall MSE and accuracy
cal regression framework. In this framework, we need to train two          sacrificed a little. It is also notable that the similar tendency could
SVR models in the training phase. One model (Model A) is trained           be observed in the Devset in Table 1. One intuitive explain here
with the kernel computed using training data, but the kernel is only       would be following: what we are modeling with lingering feature,
combines visual and auditory features. The other model (Model              is how the prior, recent emotional change might affect or induce
B) is trained with the kernel computed using all modalities. In the        the new emotion. In case of arousal (either active or passive to the
testing phase, we first perform an initial emotion prediction on the       stimuli) or fear (feeling horror, anxiety or not), this happens often.
test data using Model A. Then, we compute the lingering feature            On the other hand, valence (positive or negative) is rather difficult
using the predicted affect labels. Note that this is computationally       to capture with a fixed window size of the linger feature. Moreover,
not expensive since the lingering feature itself is easy to compute        changing from positive to negative emotional state, or vice versa,
and the labels of all training data is only 1 dimensional vector. Fi-      requires more contextual (or semantic) information of stimuli to
nally, we perform final emotion prediction on the test data using          understand why that change has happened.
Model B. We applied regression framework for all subtasks. For
the fear subtask, we first rescaled the output into [0, 1] range and       4   DISCUSSION AND OUTLOOK
thresholded at 0.75.                                                       In this paper, we introduced a new feature modeling lingering ef-
                                                                           fect and presented a hierarchical regression framework to predict
3     RESULTS AND ANALYSIS                                                 emotions. We found promising applications of the new features in
For the measure, we used Mean Squared Error (MSE) and Person’s             arousal and fear prediction, with limitations in valence prediction.
correlation coefficient (ρ) for the valence and arousal subtasks, and      In the future, it would be interesting to investigate how one can
accuracy, precision, recall, and F 1 score for the fear subtask. We        more robustly capture this lingering effect with in-depth under-
used α = 0.5 for all experiments.                                          standing of the feature’s impact on the valence prediction.
   In the Devset, shown in Table 1, one can see that there is no
significant benefit in using lingering feature in this case. We used       ACKNOWLEDGMENTS
50-50 split to obtain the result but the readers should take this result   This work was supported in part by The College of New Jersey
with a grain of salt (particularly, accuracy of fear) since we did not     under Support Of Scholarly Activity (SOSA) 2017-2019 grant.
Emotional Impact of Movies Task                                               MediaEval’17, 13-15 September 2017, Dublin, Ireland


REFERENCES
[1] Francis R. Bach, Gert R. G. Lanckriet, and Michael I. Jordan. 2004.
    Multiple Kernel Learning, Conic Duality, and the SMO Algorithm. In
    Proceedings of the Twenty-first International Conference on Machine
    Learning (ICML) (ICML ’04). ACM, New York, NY, USA, 6–. https:
    //doi.org/10.1145/1015330.1015424
[2] Emmanuel Dellandréa, Martijn Huigsloot, Liming Chen, Yoann Baveye,
    and Mats Sjöberg. 2017. The MediaEval 2017 Emotional Impact of
    Movies Task. In MediaEval 2017 Workshop.
[3] Florian Eyben, Felix Weninger, Florian Gross, and Björn Schuller. 2013.
    Recent Developments in openSMILE, the Munich Open-source Multi-
    media Feature Extractor. In Proceedings of the 21st ACM International
    Conference on Multimedia (MM ’13). ACM, New York, NY, USA, 835–
    838. https://doi.org/10.1145/2502081.2502224
[4] Mehmet Gönen and Ethem Alpaydin. 2011. Multiple Kernel Learning
    Algorithms. Journal of Machine Learning Research (JMLR) 12 (July
    2011), 2211–2268.
[5] K. Simonyan and A. Zisserman. 2014. Very Deep Convolutional Net-
    works for Large-Scale Image Recognition. CoRR abs/1409.1556 (2014).
[6] Sejong Yoon and Vladimir Pavlovic. 2014. Sentiment Flow for Video
    Interestingness Prediction. In Proceedings of the 1st ACM International
    Workshop on Human Centered Event Understanding from Multimedia
    (HuEvent ’14). ACM, New York, NY, USA, 29–34. https://doi.org/10.
    1145/2660505.2660513

</pre>