<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>THUHCSI in MediaEval 2017 Emotional Impact of Movies Task</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Zitong Jin, Yuqi Yao, Ye Ma, Mingxing Xu Key Laboratory of Pervasive Computing, Ministry of Education Tsinghua National Laboratory for Information Science and Technology (TNList) Department of Computer Science and Technology, Tsinghua University</institution>
          ,
          <addr-line>Beijing</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <fpage>13</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>In this paper we describe our team's approach to MediaEval 2017 Challenge Emotional Impact of Movies. Except for the baseline features, we use OpenSMILE toolbox to extract audio features eGeMAPS from video clips. We also aim at the continuous flow of emotion, where using time-sequential models such as LSTM will be useful and efective. Fusion methods are also considered and discussed in this paper. The evaluation results of our experiments show that our features and models are competitive in both valence / arousal and fear prediction, indicating our approaches' efectiveness.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        The MediaEval 2017 Challenge Emotional Impact of Movies
consists of two subtasks. Subtask 1 aims at Valence/Arousal
prediction while subtask 2 aims at Fear prediction. Long movies are
considered for both cases and prediction needs to be given every 5
seconds for the consecutive ten seconds’ segment. LIRIS-ACCEDE
[
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ] dataset is used for training and testing, including both
discrete and continuous sections of data. For more details, please refer
to [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        Video afective analysis and prediction is an important and
challenging issue, which has drawn the attention of many researchers
recently. The Emotional Impact of Movies task has been held for
three years, so there are many participants who took part in the
challenge in 2015 and 2016 [
        <xref ref-type="bibr" rid="ref4 ref9">4, 9</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>APPROACH</title>
      <p>In this section, we will describe the main approaches for the
subtasks, including feature extraction, pre-processing, prediction
models, fusion and post-processing methods.</p>
      <p>
        Feature extraction. Except for the baseline features provided
by the organizers, the extended Geneva Minimalistic Acoustic
Parameter Set (eGeMAPS) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] is extracted from audio channel, which
contains 88 features and has been proved efective in the same
task of last year [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. In our experiments, we extract them with the
OpenSMILE toolkit [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] from 5-second-long segments which are cut
from original videos in advance.
      </p>
      <p>As for the visual features, the general purpose visual features
provided by the organizers (except CNN features) are merged into
one large feature. This is mainly on account of the fact that these
features are short and complementary, and that combining them
can greatly reduce the training workload to try every one of them.
2.2</p>
    </sec>
    <sec id="sec-3">
      <title>Subtask 2: Fear prediction</title>
      <p>
        Feature extraction. We use the same feature sets as Subtask 1.
However, the main problem and the biggest challenge in Subtask 2
is that the samples are so unbalanced that simply predicting “zero”
obtains the accuracy score of 84.34% in the test set (see Run 4).
Therefore, to solve the unbalanced problem, SMOTE (Synthetic
Minority Over-sampling TEchnique [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]) method is adopted after
feature extraction to re-sample. The main idea of SMOTE algorithm is
to generate new samples for minorities using interpolation, which
will make it more balanced.
      </p>
      <p>Prediction models. Random Forest model is adopted in fear
prediction, which may behave better than Support Vector Machine
(SVM) in unbalanced problem. We first use Random Forest model
to obtain the probability of predicting fear (“one”) for each video
clip. Then we set up the decision threshold p, and predict fear when
the probability is larger than p. The value of p are adjusted
according to the validation set’s results. Due to the time constaints, we
didn’t try the LSTM model for Subtask 2.</p>
      <p>Fusion methods. Similar to Subtask 1, both early and late
fusion are used. In late fusion, the probability of diferent models are
averaged to get one probability.</p>
      <sec id="sec-3-1">
        <title>Zitong Jin, Yuqi Yao, Ye Ma, Mingxing Xu</title>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>EXPERIMENTS AND RESULTS</title>
      <p>In this section, we will describe our specific runs in more detail and
show the results. Note that all the hyper-parameters are selected
due to the results of validation set, and the ratio of training data
and validation data is 4:1.
3.1</p>
    </sec>
    <sec id="sec-5">
      <title>Subtask 1: valence / arousal prediction</title>
      <p>We’ve submitted 5 runs for valence / arousal prediction, where the
ifrst two use LSTM and the other ones use SVR and AdaBoost, all
listed below:</p>
      <p>Run 1: For valence, 2-layer LSTM model of hidden size 500
taking eGeMAPS as input; For arousal, 3-layer LSTM model of hidden
size 500 taking VGG as input.</p>
      <p>Run 2: For valence, late fusion of three 2-layer LSTM models of
hidden size 1000 taking eGeMAPS, VGG and other visual features
as input respectively; For arousal, the input features are Emobase,
eGeMAPS and CEDD respectively.</p>
      <p>Run 3: For both valence and arousal, SVR model taking VGG as
input.</p>
      <p>Run 4: For valence, AdaBoost model taking eGeMAPS as input;
For arousal, AdaBoost model taking other visual features as input.</p>
      <p>Run 5: For both valence and arousal, late fusion of Run 3 and
Run 4.</p>
      <p>In detail, the “other visual features” in Run 2 and 4 means the
concatenation of all the visual features except the CNN feature.
CEDD means Color and Edge Directivity Descriptor, which is one
of the visual feature provided. VGG means CNN features extracted
using VGG16 fc6 layer.</p>
      <p>From Table 1 we can see that, the best run of valence MSE is
Run 2, using late fusion of LSTM models. Run 3 achieves the best
results on other metrics, using SVR model and VGG feature.
Notice that Run 2, the LSTM late fusion method, is better at MSE
than Run 1, the single LSTM model, which means late fusion of
three models utilizes diferent information in diferent features and
enhances the performance to some extent. However, LSTM
models perform worse in Pearson’s r, compared to traditional machine
learning models. This could be because LSTM models tend to
predict similar values of all time, and thus obtain lower MSE and lower
Pearson’s r.</p>
      <p>Taken together, Run 3 using SVR and VGG achieves best results,
which means CNN features may contain useful information for
emotion analysis, and traditional model could behave well when
trained properly.</p>
      <sec id="sec-5-1">
        <title>Runs</title>
        <p>Run 1
Run 2
Run 3
Run 4
Run 5</p>
      </sec>
      <sec id="sec-5-2">
        <title>Accuracy Precision 0.7352 0.8153</title>
        <p>We’ve submitted 5 runs for fear prediction, all using Random Forest
model, listed below:</p>
        <p>Run 1: Random Forest + other visual features.</p>
        <p>Run 2: Random Forest + VGG.</p>
        <p>Run 3: Random Forest + all visual features.</p>
        <p>Run 4: All predicting “zero” (just for test)
Run 5: Late fusion of Run 1 and Run 2.</p>
        <p>From Table 2 we can see that, Run 2 using VGG features achieve
best results on recall and f1, while Run 5 using late fusion achieve
best results on accuracy and precision. As mentioned before, the
problem of subtask 2 is very unbalanced, and the fear samples are
much fewer. Therefore, there is no surprise that accuracy and
precision are one pair while recall and f1 are the other pair. Predicting
more “zeros” will lead to higher accuracy while lower recall, and
vice versa.</p>
        <p>When considering f1 score, which is the harmonic mean of both
precision and recall, Run 2 using VGG feature performs best, which
confronts with the result of subtask 1 that CNN features contain
useful information for emotion analysis.
4</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>CONCLUSION AND DISCUSSION</title>
      <p>In this paper, we illustrate our approach to the MediaEval 2017
Challenge “Emotional Impact of Movies” task. In valence / arousal
prediction subtask, both LSTM and SVR models are trained and
compared. In fear prediction subtask, Random Forest model using
diferent features are compared. Besides, early fusion and late
fusion are adopted in experiments, which shows promising results
in some aspects.</p>
      <p>However, some problems have not been solved yet. For instance,
some of the LSTM models tend to predict similar values of all time,
leading to a very low Pearson’s r, which may be caused by
inappropriate experiment configuration. Unbalanced problem in subtask 2
still exists even using SMOTE algorithm, which means changing
models or features could make no big diference, and all predicting
“zero” can still obtain a very high accuracy. These problems remain
to be solved in the future.</p>
    </sec>
    <sec id="sec-7">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work was partially supported by the National High
Technology Research and Development Program of China (863 program)
(2015AA016305) and the National Natural Science Foundation of
China (61433018, 61171116).</p>
      <sec id="sec-7-1">
        <title>Emotional Impact of Movies Task</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Yoann</given-names>
            <surname>Baveye</surname>
          </string-name>
          , Emmanuel Dellandréa, Christel Chamaret, and
          <string-name>
            <given-names>Liming</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Deep learning vs. kernel methods: Performance for emotion prediction in videos</article-title>
          .
          <source>In Afective Computing and Intelligent Interaction (ACII)</source>
          ,
          <source>2015 International Conference on. IEEE</source>
          ,
          <fpage>77</fpage>
          -
          <lpage>83</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Yoann</given-names>
            <surname>Baveye</surname>
          </string-name>
          , Emmanuel Dellandrea, Christel Chamaret, and
          <string-name>
            <given-names>Liming</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>LIRIS-ACCEDE: A video database for afective content analysis</article-title>
          .
          <source>IEEE Transactions on Afective Computing</source>
          <volume>6</volume>
          ,
          <issue>1</issue>
          (
          <year>2015</year>
          ),
          <fpage>43</fpage>
          -
          <lpage>55</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Nitesh</surname>
            <given-names>V Chawla</given-names>
          </string-name>
          , Kevin W Bowyer, Lawrence O Hall, and
          <string-name>
            <given-names>W Philip</given-names>
            <surname>Kegelmeyer</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>SMOTE: synthetic minority over-sampling technique</article-title>
          .
          <source>Journal of artificial intelligence research 16</source>
          (
          <year>2002</year>
          ),
          <fpage>321</fpage>
          -
          <lpage>357</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Emmanuel</given-names>
            <surname>Dellandréa</surname>
          </string-name>
          , Liming Chen, Yoann Baveye, Mats Sjöberg, and
          <string-name>
            <given-names>Christel</given-names>
            <surname>Chamaret</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>The MediaEval 2016 Emotional Impact of Movies Task</article-title>
          .
          <source>In Proceedings of MediaEval 2016 Workshop</source>
          . Hilversum, Netherlands.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Emmanuel</given-names>
            <surname>Dellandréa</surname>
          </string-name>
          , Martijn Huigsloot, Liming Chen, Yoann Baveye, and
          <string-name>
            <given-names>Mats</given-names>
            <surname>Sjöberg</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>The MediaEval 2017 Emotional Impact of Movies Task</article-title>
          .
          <source>In Proceedings of MediaEval 2017 Workshop</source>
          . Dublin, Ireland.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Florian</given-names>
            <surname>Eyben</surname>
          </string-name>
          , Klaus Scherer, Khiet Truong, Bjorn Schuller, Johan Sundberg, Elisabeth Andre, Carlos Busso, Laurence Devillers, Julien Epps, and
          <string-name>
            <given-names>Petri</given-names>
            <surname>Laukka</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Afective Computing</article-title>
          .
          <source>IEEE Transactions on Afective Computing</source>
          <volume>12</volume>
          ,
          <issue>2</issue>
          (
          <year>2016</year>
          ),
          <fpage>190</fpage>
          -
          <lpage>202</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Florian</given-names>
            <surname>Eyben</surname>
          </string-name>
          , Felix Weninger, Florian Gross, and
          <string-name>
            <given-names>Björn</given-names>
            <surname>Schuller</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Recent developments in openSMILE, the munich open-source multimedia feature extractor</article-title>
          .
          <source>In Proceedings of the 21st ACM international conference on Multimedia. ACM</source>
          ,
          <volume>835</volume>
          -
          <fpage>838</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Ye</given-names>
            <surname>Ma</surname>
          </string-name>
          , Zipeng Ye, and
          <string-name>
            <given-names>Mingxing</given-names>
            <surname>Xu</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>THU-HCSI at MediaEval 2016: Emotional Impact of Movies Task</article-title>
          .
          <source>In Proceedings of MediaEval 2016 Workshop</source>
          . Hilversum, Netherlands.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Mats</given-names>
            <surname>Sjöberg</surname>
          </string-name>
          , Yoann Baveye, Hanli Wang, Vu Lam Quang, Bogdan Ionescu, Emmanuel Dellandréa, Markus Schedl,
          <string-name>
            <surname>Claire-Hélène Demarty</surname>
            , and
            <given-names>Liming</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>The MediaEval 2015 Afective Impact of Movies Task.</article-title>
          .
          <source>In Proceedings of MediaEval 2015 Workshop</source>
          . Wurzen, Germany.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>