<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>RUC at MediaEval 2016 Emotional Impact of Movies Task: Fusion of Multimodal Features</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shizhe Chen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Qin Jin</string-name>
          <email>qjin@ruc.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Information, Renmin University of China</institution>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <fpage>20</fpage>
      <lpage>21</lpage>
      <abstract>
        <p>In this paper, we present our approaches for the Mediaeval Emotional Impact of Movies Task. We extract features from multiple modalities including audio, image and motion modalities. SVR and Random Forest are used as our regression models and late fusion is applied to fuse di erent modalities. Experimental results show that the multimodal late fusion is bene cial to predict global a ects and continuous arousal and using CNN features can further boost the performance. But for continuous valence prediction the acoustic features are superior to other features.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The 2016 Emotion Impact of Movies Task [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] involves
two subtasks: global and continuous a ects prediction. The
global subtask requires participants to predict the induced
valance and arousal values for the short video clips, while the
a ects values should be continuously predicted every second
for long movies in the continuous subtask. In the following
sections, we describe the multimodal features, models and
experiments in details.
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>FEATURE EXTRACTION</title>
    </sec>
    <sec id="sec-3">
      <title>Audio Modality</title>
      <p>
        Statistical Acoustic Features: Statistical acoustic
features are proved to be very e ective in speech emotion
recognition. We use the open-source toolkit OpenSMILE [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] to
extract three kinds of features IS09, IS10 and IS13, which
uses the con guration in INTERSPEECH 2009 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], 2010 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
and 2013 [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] Paralinguistic challenge respectively. The
difference between these features is that features in the later
years cover more low-level features and statistical functions.
      </p>
      <p>
        MFCC-based Features: The Mel-Frequency Cepstral
Coe cients (MFCCs) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] are the most widely used low-level
features. Therefore, we use MFCCs as our frame-level
feature and apply two encoding strategies, Bag-of-Audio-Words
(BoW) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and Fisher Vector Encoding (FV) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], to
transform the set of MFCCs to the sentence-level features. For
mfccBoW features, the acoustic codebook is trained by
Kmeans with 1024 clusters. For mfccFV features, we use the
GMM to train the codebook with 8 mixtures.
      </p>
      <p>In the continuous subtask, the audio features are extracted
with the window of 10s and shift of 1s to cover more context.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Image Modality</title>
      <p>Hand-crafted Visual Features: We extract the
HueSaturation Histogram (hsh) to describe the color
information and the Dense SIFT (DSIFT) features to represent the
visual appearance information. For hsh features, we
quantize the hue to 30 levels and the saturation to 32 levels. For
DSIFT features, we use Fisher Vector encoding to
construct the video-level features. Then kernel PCA is utilized to
reduce the dimensionality into 4096.</p>
      <p>
        DCNN Features: To explore the performance of
different pre-trained CNN models, we extract multiple layers
from di erent CNN models including inception-v3 [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ],
VGG16 and VGG-19 [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. All the CNN features are applied with
mean pooling to generate video-level representations.
2.3
      </p>
    </sec>
    <sec id="sec-5">
      <title>Motion Modality</title>
      <p>
        To exploit the temporal information in the video, we
extract the improved Dense Trajectory (iDT) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and the C3D
features [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].For iDT features, HOG, HOF and MBH
features are densely extracted from the video and encoded with
Fisher Vector. Then kernel PCA is used to reduce
dimensionality into 4096. For C3D features, we extract activations
from the penultimate layer for every non-overlap 16 frames
and use mean pooling to aggregate them into one vector.
      </p>
      <p>
        The challenge also provides baseline features [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] for global
subtask, which consists of acoustic and visual features.
3.
3.1
      </p>
    </sec>
    <sec id="sec-6">
      <title>EXPERIMENTS</title>
    </sec>
    <sec id="sec-7">
      <title>Experimental Setting</title>
      <p>In the global subtask, there are 9,800 video clips from 160
movies in the development set. We randomly select 6093,
1761 and 1946 videos as our local training, validation and
testing sets respectively. Video clips from the same movies
are kept in the same set. In the continuous subtask, the 30
movies in the development set are also divided into 3 parts
with 24 for training, 3 for validation and 3 for testing.</p>
      <p>We train SVR and Random Forest for each kind of
features and use grid search to select the best hyper-parameters.
For SVR, we explore linear and RBF kernels and tune the
cost from 2 5 to 212 and the epsilon-tube from 0.1 to 0.4. For
Random Forest, the number of trees and the depth of trees
are tuned from 100 to 1000 and from 3 to 20 respectively.
We apply late fusion to fuse di erent features by training a
second-layer model (linear SVR) with input of the best
predictions for each kind of features using the local validation
set. We use Sequential Backward Selection algorithm to nd
the best subset of feature types for late fusion.</p>
    </sec>
    <sec id="sec-8">
      <title>Global Affects Prediction</title>
      <p>In the global subtask, we use the mean standard error
(MSE) as evaluation metric. Figure 1 presents MSE of the
di erent features for arousal prediction. The audio modality
performs the best. Since the baseline feature contains
multimodal cues, it achieves the second best performance
following our mfccBoW feature. The run1 is the late fusion of all the
audio features, baseline and iDT features. In the run2
system, besides the features used in run1, c3d_fc6, vgg16_fc7
and vgg19_fc6 features are also used in late fusion. The
arousal prediction performance is signi cantly improved by
the multimodal late fusion.</p>
      <p>The MSE of di erent features for global valence prediction
is shown in Figure 2. The image modality features
especially the CNN features are better than other modalities for
valence prediction. The run1 system consists of baseline,
IS10, mfccBoW, mfccFv and hsh. The run2 system also
uses c3d_fc6, c3d_prob, vgg16_fc6, vgg16_fc7 and the
features in run1. Although the late fusion performance does
not outperform the unimodal performance with CNN vgg16
features on our local testing set, it might be more robust
than using one single feature.
3.3</p>
    </sec>
    <sec id="sec-9">
      <title>Continuous Affects Prediction</title>
      <p>In the continuous subtask, we use the Pearson
Correlation Coe cient (PCC) for performance evaluation instead
of MSE. Because the labels in the continuous subtask have
closer temporal connections than those in the global subtask
and thus the shape of the prediction curve is more
important. Since the testing set is relative small and the performance
is quite unstable in the evaluation, we average the
performance of the validation and testing set. Figure 3 shows the
PCC results of di erent features. The mfccFV feature
performs the best in both arousal and valence prediction. The
settings for the submitted three runs are as follows. In run1,
we apply late fusion over mfccFV, IS09 and IS10 for arousal
and use the mfccFV SVR for valence. In run2, mfccFV, IS09,
IS10 and inc_fc are late fused for arousal and mfccFV and
IS09 are late fused for valence. The run3 late fuses mfccFV,
IS09 and inc_fc for arousal and use mfccFV Random Forest
for valence. In our experiment, late fusion is bene cial for
the arousal prediction but not for valence prediction.
3.4</p>
    </sec>
    <sec id="sec-10">
      <title>Submitted Runs</title>
      <p>In Table 1, we list our results on the challenge testing set.
For the global subtask, comparing between run1 and run2,
fusing CNN features can greatly improve the arousal and
valence prediction performance. For the continuous subtask,
the fusion of image and audio cues improves the arousal
prediction performance. But for valence prediction, the mfccFV
feature alone achieves the best results.
4.</p>
    </sec>
    <sec id="sec-11">
      <title>CONCLUSIONS</title>
      <p>In this paper, we present the multimodal approach to
predict global and continuous a ects. The best result on the
global subtask is achieved by the late fusion of audio,
image and motion modalities. However, for the continuous
subtask, the mfccFV feature signi cantly outperforms other
features and bene ts little from late fusion on valence
prediction. In the future work, we will explore more features
for continuous a ects prediction and use LSTMs to model
the temporal structure of the videos.</p>
    </sec>
    <sec id="sec-12">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work is supported by National Key Research and
Development Plan under Grant No. 2016YFB1001202.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Emmanuel</given-names>
            <surname>Dellandrea</surname>
          </string-name>
          , Liming Chen, Yoann Baveye, Mats Sjoberg, and Christel Chamaret.
          <article-title>The mediaeval 2016 emotional impact of movies task</article-title>
          .
          <source>In MediaEval 2016 Workshop</source>
          , Hilversum, Netherlands, Oct.
          <volume>20</volume>
          -
          <fpage>21</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Florian</given-names>
            <surname>Eyben</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Martin</given-names>
            <surname>Llmer</surname>
          </string-name>
          , and Bjorn Schuller.
          <article-title>Opensmile: the munich versatile and fast open-source audio feature extractor</article-title>
          .
          <source>In ACM International Conference on Multimedia, Mm</source>
          , pages
          <volume>1459</volume>
          {
          <fpage>1462</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Bjo</surname>
            <given-names>rn W.</given-names>
          </string-name>
          <string-name>
            <surname>Schuller</surname>
            , Stefan Steidl, and
            <given-names>Anton</given-names>
          </string-name>
          <string-name>
            <surname>Batliner</surname>
          </string-name>
          .
          <article-title>The INTERSPEECH 2009 emotion challenge</article-title>
          .
          <source>In INTERSPEECH</source>
          <year>2009</year>
          ,
          <article-title>10th Annual Conference of the International Speech Communication Association</article-title>
          , Brighton,
          <source>United Kingdom, September</source>
          <volume>6</volume>
          -
          <issue>10</issue>
          ,
          <year>2009</year>
          , pages
          <fpage>312</fpage>
          {
          <fpage>315</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4] Bjorn Schuller, Anton Batliner, Stefan Steidl, and
          <string-name>
            <given-names>Dino</given-names>
            <surname>Seppi</surname>
          </string-name>
          .
          <article-title>Recognising realistic emotions and a ect in speech: State of the art and lessons learnt from the rst challenge</article-title>
          .
          <source>Speech Communication</source>
          ,
          <volume>53</volume>
          (
          <fpage>9</fpage>
          -10):
          <volume>1062</volume>
          {
          <fpage>1087</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5] Bjorn Schuller, Stefan Steidl, Anton Batliner, Alessandro Vinciarelli, Klaus Scherer, Fabien Ringeval, Mohamed Chetouani, Felix Weninger, Florian Eyben, and
          <string-name>
            <given-names>Erik</given-names>
            <surname>Marchi</surname>
          </string-name>
          .
          <article-title>The interspeech 2013 computational paralinguistics challenge: Social signals, con ict, emotion, autism</article-title>
          .
          <source>Proceedings of Interspeech</source>
          , pages
          <volume>148</volume>
          {
          <fpage>152</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Steven</surname>
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Davis</surname>
          </string-name>
          .
          <article-title>Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences</article-title>
          .
          <source>Readings in Speech Recognition</source>
          ,
          <volume>28</volume>
          (
          <issue>4</issue>
          ):
          <volume>65</volume>
          {
          <fpage>74</fpage>
          ,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Stephanie</given-names>
            <surname>Pancoast</surname>
          </string-name>
          and
          <string-name>
            <given-names>Murat</given-names>
            <surname>Akbacak</surname>
          </string-name>
          .
          <article-title>Softening quantization in bag-of-audio-words</article-title>
          .
          <source>In ICASSP 2014 - 2014 IEEE International Conference on Acoustics, Speech and Signal Processing</source>
          , pages
          <volume>1370</volume>
          {
          <fpage>1374</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Jorge</given-names>
            <surname>Sanchez</surname>
          </string-name>
          , Florent Perronnin, Thomas Mensink, and
          <string-name>
            <given-names>Jakob</given-names>
            <surname>Verbeek</surname>
          </string-name>
          .
          <article-title>Image classi cation with the sher vector: Theory and practice</article-title>
          .
          <source>International Journal of Computer Vision</source>
          ,
          <volume>105</volume>
          (
          <issue>3</issue>
          ):
          <volume>222</volume>
          {
          <fpage>245</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Christian</given-names>
            <surname>Szegedy</surname>
          </string-name>
          , Vincent Vanhoucke, Sergey Io e, Jonathon Shlens, and
          <string-name>
            <given-names>Zbigniew</given-names>
            <surname>Wojna</surname>
          </string-name>
          .
          <article-title>Rethinking the inception architecture for computer vision</article-title>
          .
          <source>arXiv preprint arXiv:1512.00567</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Karen</given-names>
            <surname>Simonyan</surname>
          </string-name>
          and
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <article-title>Very deep convolutional networks for large-scale image recognition</article-title>
          .
          <source>Computer Science</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Heng</given-names>
            <surname>Wang</surname>
          </string-name>
          and
          <string-name>
            <given-names>Cordelia</given-names>
            <surname>Schmid</surname>
          </string-name>
          .
          <article-title>Action recognition with improved trajectories</article-title>
          .
          <source>In Proceedings of the IEEE International Conference on Computer Vision</source>
          , pages
          <volume>3551</volume>
          {
          <fpage>3558</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Du</surname>
            <given-names>Tran</given-names>
          </string-name>
          , Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and
          <string-name>
            <given-names>Manohar</given-names>
            <surname>Paluri</surname>
          </string-name>
          .
          <article-title>Learning spatiotemporal features with 3d convolutional networks</article-title>
          .
          <source>In 2015 IEEE International Conference on Computer Vision</source>
          (ICCV), pages
          <fpage>4489</fpage>
          {
          <fpage>4497</fpage>
          . IEEE,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Yoann</surname>
            <given-names>Baveye</given-names>
          </string-name>
          , Emmanuel Dellandrea, Christel Chamaret, and
          <string-name>
            <given-names>Liming</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>Deep learning vs. kernel methods: Performance for emotion prediction in videos</article-title>
          .
          <source>In ACII</source>
          , pages
          <volume>77</volume>
          {
          <fpage>83</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>