<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MIC-TJU in MediaEval 2017 Emotional Impact of Movies Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yun Yi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hanli Wang</string-name>
          <email>hanliwang@tongji.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jiangchuan Wei</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Technology, Tongji University</institution>
          ,
          <addr-line>Shanghai 201804</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Mathematics and Computer Science, Gannan Normal University</institution>
          ,
          <addr-line>Ganzhou 341000</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Hanli Wang is the</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <fpage>13</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>To predict the emotional impact and fear of movies, we propose a framework which employs four audio-visual features. In particular, we utilize the features extracted by the methods of motion keypoint trajectory and convolutional neural networks to depict the visual information, and extract a global and a local audio features to describe the audio cues. The early fusion strategy is employed to combine the vectors of these features. Then, the linear support vector regression and support vector machine are used to learn the a ective models. The experimental results show that the combination of these features obtains promising performances.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        The 2017 emotional impact of movies task is a challenging
task, which contains two subtasks (i.e., valence-arousal
prediction and fear prediction). A brief introduction about
this challenge has been given in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In this paper, we mainly
introduce the system architecture and algorithms used in our
framework, and discuss the evaluation results.
      </p>
    </sec>
    <sec id="sec-2">
      <title>FRAMEWORK</title>
      <p>The key components of the proposed framework is shown in
Fig. 1, and the highlights of our framework are introduced
below.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>Feature Extraction</title>
      <p>
        In this framework, we evaluate four features, including
EmoBase10 feature [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], Mel-Frequency Cepstral Coe cients
(MFCC) feature [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], Motion Keypoint Trajectory (MKT)
feature [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], and Convolutional Networks (ConvNets)
feature [
        <xref ref-type="bibr" rid="ref12 ref14">12, 14</xref>
        ].
      </p>
      <p>
        2.1.1 MFCC Feature. In a ective content analysis, audio
modality is essential. MFCC is a famous local audio feature.
The time window of MFCC is set to 32 ms, and set
50% overlap between two adjacent windows. In order to
promote the performance, we append delta and double-delta
of 20-dimensional vectors into the original MFCC vector.
Therefore, a 60-dimensional MFCC vector is generated. We
apply Principal Component Analysis (PCA) to reduce the
dimension of the local feature, and use the Fisher Vector
(FV) model [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] to represent a whole audio le via a
signature vector. The cluster number of Gaussian Mixture
Model (GMM) is set to 512, and the signed square root
and L2 norm are utilized to normalize the vectors. In our
experiments, we use the toolbox provided by [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] to calculate
the vectors of MFCC.
      </p>
      <p>
        2.1.2 EmoBase10 Feature. To depict audio information,
we extract the EmoBase10 feature [
        <xref ref-type="bibr" rid="ref11 ref5">5, 11</xref>
        ], which is a
global and high-level audio feature. As suggested by [
        <xref ref-type="bibr" rid="ref11 ref5">5, 11</xref>
        ],
the default parameters are utilized to extract the
1,582dimensional vector of EmoBase10. The 1,582-dimensional
vector results from: (1) 21 functionals applied to 34
LowLevel Descriptors (LLD) and 34 corresponding delta coe
cients, (2) 19 functionals applied to the 4 pitch-based LLD
and their 4 delta coe cient contours, (3) the number of pitch
onsets and the total duration of the input [
        <xref ref-type="bibr" rid="ref11 ref5">5, 11</xref>
        ]. Then, the
signed square root and L2 norm are utilized to normalize the
vectors. We calculate the EmoBase10 feature by using the
openSMILE1 toolkit.
      </p>
      <p>
        2.1.3 MKT Feature. We utilize the MKT [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] Feature to
depict the motion information. Motion keypoints are tracked
by the approach of MKT at multiple spatial scales, and an
optical ow recti cation algorithm that is based on vector
eld consensus [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] is designed to reduce the in uence of
camera motions. To depict trajectories in a video, we
calculate four local descriptors along trajectories, including
Histogram of Oriented Gradient (HOG) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], Motion Boundary
Histogram (MBH) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], Histogram of Optical Flow (HOF) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
and Trajectory-Based Covariance (TBC) [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. In general,
MBH and HOF represent the local motion information, HOG
describes the local appearance, and TBC depicts the
relationships between di erent motion variables. After
calculating these local vectors, we individually apply the RootSIFT
normalization (i.e., square root on each dimension after L1
normalization) to normalize these vectors.
      </p>
      <p>
        In order to reduce the dimension of descriptors, we apply
PCA to the four descriptors individually. Then, the FV
model [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] is used to encode these local vectors. In particular,
we apply GMM to construct a codebook of each descriptor,
and set the number of GMM to 128. Finally, the signed
square root and L2 normalization are applied to these
vectors. To combine the trajectory-based descriptors, we
concatenate the vectors of these four descriptors into a single
one.
      </p>
      <p>
        2.1.4 ConvNets Feature. Convolutional Neural
Networks (CNNs) have been successfully applied in many areas. The
two-stream Convolutional Networks (ConvNets) feature
include two streams [
        <xref ref-type="bibr" rid="ref12 ref14">12, 14</xref>
        ], i.e., the spatial stream ConvNet
and temporal stream ConvNet. The spatial ConvNet
operating on video frames indicates the information about scenes
and objects. Meanwhile, the temporal ConvNet stacking
optical ow elds conveys the motion information of videos.
The two-stream ConvNets feature is calculated according to
the processes in [
        <xref ref-type="bibr" rid="ref12 ref14">12, 14</xref>
        ] based on the network architecture
of BN-Inception [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        In our experiments, the Ca e toolbox is used to calculate
the ConvNets feature. We utilize the models pretrained on
the UCF101 dataset [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], and calculate the feature vectors
from the `global pool' layer. Let the sets of vectors extracted
from spatial and temporal nets be individually denoted as
S = fS1; ; Si; ; SN g and T = fT1; ; Ti; ; TN g,
where N is the number of frames, and Si and Ti are
1,024-dimensional vectors. To depict a video via one vector,
we utilize two strategies, including Fisher Vector (FV)
and Mean Standard Deviation (MSD). The feature vectors
calculated by the two strategies are denoted as
ConvNetsFV and ConvNets-MSD separately. For the extraction of
ConvNets-FV, we follow the processes as suggested in [
        <xref ref-type="bibr" rid="ref10 ref15 ref16">10,
15, 16</xref>
        ], and set the cluster number of GMM to 64. For
the feature calculation of ConvNets-MSD, we calculate the
mean of the two sets respectively, which are denoted as (S)
and (T), and calculate their standard deviations denoted
as (S) and (T). Then, the four vectors (i.e., (S), (T),
(S), and (T)) are concatenated to produce a (1; 024
4)dimensional vector.
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Regression and Classi cation</title>
      <p>
        In the two subtasks, we employ linear Support Vector
Regression (SVR) and Support Vector Machine (SVM) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
to learn the emotional models separately. For the fear
subtask, the number of positive samples is less than that
of the negative samples. To solve this problem, we weight
positive and negative samples in an inverse manner. The
regularization parameter C is set by cross-validation on
the training set. The LIBLINEAR toolbox2 is utilized to
implement the L2-regularized L2-loss SVM and SVR.
3
      </p>
    </sec>
    <sec id="sec-5">
      <title>RESULTS AND DISCUSSIONS</title>
      <p>
        In this task, we submit 5 runs, and the results are given in
Table 1 and Table 2. The main di erence of these 5 runs is
the selection of features. We select MFCC, ConvNets-MSD
and EmoBase10 in Run 1, MFCC and ConvNets-MSD in
Run 2, MFCC, ConvNets-FV and EmoBase10 in Run 3,
MFCC, ConvNets-MSD, EmoBase10 and MKT in Run 4,
and MFCC, ConvNets-FV, EmoBase10 and MKT in Run
5. For the valence-arousal subtask, we report Mean Square
Error (MSE) and Pearson Correlation Coe cient (PCC) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
For the fear subtask, the performances of accuracy, precision,
recall and F1-score are considered as suggested in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
Regarding the learning processes of all runs, we utilize SVR
in the valence-arousal subtask, and use SVM in the fear
subtask.
2https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/multicore-liblinear
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Navneet</given-names>
            <surname>Dalal</surname>
          </string-name>
          and
          <string-name>
            <given-names>Bill</given-names>
            <surname>Triggs</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Histograms of oriented gradients for human detection</article-title>
          .
          <source>In CVPR'05</source>
          . 886{
          <fpage>893</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Navneet</given-names>
            <surname>Dalal</surname>
          </string-name>
          , Bill Triggs, and
          <string-name>
            <given-names>Cordelia</given-names>
            <surname>Schmid</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Human detection using oriented histograms of ow and appearance</article-title>
          .
          <source>In ECCV'06</source>
          . 428{
          <fpage>441</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Emmanuel</given-names>
            <surname>Dellandrea</surname>
          </string-name>
          , Martijn Huigsloot, Liming Chen, Yoann Baveye, and Mats Sjoberg.
          <year>2017</year>
          .
          <article-title>The MediaEval 2017 Emotional Impact of Movies Task</article-title>
          . In MediaEval 2017 Workshop.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Daniel</surname>
            <given-names>P. W.</given-names>
          </string-name>
          <string-name>
            <surname>Ellis</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>PLP and RASTA (and MFCC, and inversion</article-title>
          ) in Matlab. http://www.ee.columbia.edu/ dpwe/ resources/matlab/rastamat/. (
          <year>2005</year>
          ).
          <article-title>online web resource</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Florian</given-names>
            <surname>Eyben</surname>
          </string-name>
          , Felix Weninger, Florian Gross, and Bjorn Schuller.
          <year>2013</year>
          .
          <article-title>Recent developments in opensmile, the munich open-source multimedia feature extractor</article-title>
          .
          <source>In ACM MM'13</source>
          . 835{
          <fpage>838</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Rong-En</surname>
            <given-names>Fan</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kai-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cho-Jui</surname>
            <given-names>Hsieh</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xiang-Rui</surname>
            <given-names>Wang</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Chih-Jen Lin</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>LIBLINEAR: A Library for Large Linear Classi cation</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>9</volume>
          (
          <year>2008</year>
          ),
          <year>1871</year>
          {
          <year>1874</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Sergey</given-names>
            <surname>Io</surname>
          </string-name>
          e
          <string-name>
            <given-names>and Christian</given-names>
            <surname>Szegedy</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Batch normalization: Accelerating deep network training by reducing internal covariate shift</article-title>
          .
          <source>In ICML'15</source>
          . 448{
          <fpage>456</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Ivan</given-names>
            <surname>Laptev</surname>
          </string-name>
          , Marcin Marszalek, Cordelia Schmid, and
          <string-name>
            <given-names>Benjamin</given-names>
            <surname>Rozenfeld</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Learning realistic human actions from movies</article-title>
          .
          <source>In CVPR'08. 1{8.</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Jiayi</given-names>
            <surname>Ma</surname>
          </string-name>
          , Ji Zhao,
          <string-name>
            <given-names>Jinwen</given-names>
            <surname>Tian</surname>
          </string-name>
          , Alan L Yuille, and
          <string-name>
            <given-names>Zhuowen</given-names>
            <surname>Tu</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Robust point matching via vector eld consensus</article-title>
          .
          <source>IEEE Trans. Image Processing 23</source>
          ,
          <issue>4</issue>
          (
          <year>2014</year>
          ),
          <volume>1706</volume>
          {
          <fpage>1721</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Florent</given-names>
            <surname>Perronnin</surname>
          </string-name>
          and
          <string-name>
            <given-names>Christopher</given-names>
            <surname>Dance</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Fisher kernels on visual vocabularies for image categorization</article-title>
          .
          <source>In CVPR'07.</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11] Bjorn Schuller, Stefan Steidl, Anton Batliner, Felix Burkhardt, Laurence Devillers,
          <article-title>Christian A Muller, and Shrikanth S Narayanan</article-title>
          .
          <year>2010</year>
          .
          <article-title>The INTERSPEECH 2010 paralinguistic challenge</article-title>
          .
          <source>In Interspeech'10.</source>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Karen</given-names>
            <surname>Simonyan</surname>
          </string-name>
          and
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Two-stream convolutional networks for action recognition in videos</article-title>
          .
          <source>In NIPS'14</source>
          . 568{
          <fpage>576</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Khurram</surname>
            <given-names>Soomro</given-names>
          </string-name>
          , Amir Roshan Zamir, and
          <string-name>
            <given-names>Mubarak</given-names>
            <surname>Shah</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>UCF101: A dataset of 101 human actions classes from videos in the wild</article-title>
          .
          <source>CRCV-TR-12-01</source>
          (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Limin</surname>
            <given-names>Wang</given-names>
          </string-name>
          , Yuanjun Xiong,
          <string-name>
            <surname>Zhe</surname>
            <given-names>Wang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            <given-names>Qiao</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Dahua</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Xiaoou</given-names>
            <surname>Tang</surname>
          </string-name>
          , and Luc Van Gool.
          <year>2016</year>
          .
          <article-title>Temporal segment networks: towards good practices for deep action recognition</article-title>
          .
          <source>In ECCV'16</source>
          . 20{
          <fpage>36</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Yun</given-names>
            <surname>Yi</surname>
          </string-name>
          and
          <string-name>
            <given-names>Hanli</given-names>
            <surname>Wang</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Motion keypoint trajectory and covariance descriptor for human action recognition</article-title>
          .
          <source>The Visual Computer</source>
          (
          <year>2017</year>
          ),
          <volume>1</volume>
          {
          <fpage>13</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Yun</surname>
            <given-names>Yi</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Hanli</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Bowen</given-names>
            <surname>Zhang</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Learning correlations for human action recognition in videos</article-title>
          .
          <source>Multimedia Tools and Applications</source>
          <volume>76</volume>
          ,
          <issue>18</issue>
          (
          <year>2017</year>
          ),
          <volume>18891</volume>
          {
          <fpage>18913</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>