<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CNN Features for Emotional Impact of Movies Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yun Yi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hanli Wang</string-name>
          <email>wang@tongji.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Qinyu Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emulation Techniques</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gannan Normal</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Technology, Tongji University</institution>
          ,
          <addr-line>Shanghai 201804</addr-line>
          ,
          <country country="CN">P. R. China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science, Lanzhou City University</institution>
          ,
          <addr-line>Lanzhou 730070</addr-line>
          ,
          <country country="CN">P. R. China</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Hanli Wang is the</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University</institution>
          ,
          <addr-line>Ganzhou 341000</addr-line>
          ,
          <country country="CN">P. R. China</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>ence Foundation of China under Grants 61622115 and Grant 61472281, Shanghai Engineering Research Center of Industrial Vision Perception &amp; Intelligent Computing (17DZ2251600), and IBM Shared University Research Awards Program</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <fpage>29</fpage>
      <lpage>31</lpage>
      <abstract>
        <p>A framework is proposed to predict the emotional impact of movies by using the audio, action, object and scene features. First, four state-of-the-art features are extracted from four pre-trained convolutional neural networks to depict video contents, and an early fusion strategy is used to combine vectors of these features. Then, the linear support vector regression or linear support vector machine is employed to separately learn afective models or fear models, and the strategy of cross-validation is utilized to select training parameters. Finally, the Gaussian blur function is used to smooth scores of video segments. The experiments show that the combination of these features obtains promising results.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        The 2018 emotional impact of movies task consists of two
subtasks, including the valence-arousal prediction and the
fear prediction. A brief introduction about this challenge has
been given in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This paper mainly introduces the proposed
framework and discusses the experimental results.
      </p>
      <p>
        The selection of features is crucial to emotional analysis.
Intuitively, the audio, action, object and scene features can
influence emotions. Therefore, vectors of four
state-of-theart features are calculated in this framework. Then, the
afective models or fear models are learned by using linear
support vector regression (SVR) or linear support vector
machine (SVM) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Finally, the function of Gaussian blur is
utilized to smooth scores of temporal segments.
      </p>
    </sec>
    <sec id="sec-2">
      <title>FRAMEWORK</title>
      <p>Video</p>
      <p>Audio
Action
Object
Scene
2.1
To depict a video, four features are separately extracted from
four pre-trained Convolutional Neural Networks (CNNs),
including audio, action, object and scene features.</p>
      <p>
        2.1.1 Audio Feature. The audio signals are important
information that describes emotions. VGGish [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] is a famous
audio feature extractor, so it is used to calculate the vectors
of audio feature. First, the audio files are extracted from
videos. Then, the pre-trained model1 provided by [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] is
utilized to calculate the feature vectors of audio files. Therefore,
the audio signals are converted into semantically meaningful
high-level 128-dimensional feature vectors by VGGish. In
conclusion, for the audio feature, a video is described as a
sequence of 128-dimensional vectors.
      </p>
      <p>
        2.1.2 Action Feature. The actions in the video can
inlfuence viewer’s emotions. The two-stream Convolutional
Networks (ConvNet) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] is a well-known framework for
videobased action recognition, and includes the spatial ConvNet
and the temporal ConvNet. The temporal segment network [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
builds the model of long-range temporal structure to improve
this framework, and Inception-v3 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] is the basic network
architecture of the two ConvNets. The pre-trained models
provided by [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] are utilized to calculate the vectors from
the ‘top cls global pool’ layer. As a result, a frame is
described by two 1024-dimensional vectors. By connecting the
two vectors of a frame, a video is depicted as a sequence of
2048-dimensional vectors.
      </p>
      <p>
        2.1.3 Object Feature. The objects in the video may afect
emotions of the viewer. The Squeeze-and-Excitation
Network (SENet) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] is the state-of-the-art model for object
1https://github.com/tensorflow/models/tree/master/research/
audioset
classification. We utilize the pre-trained SENet model 2 to
calculate the vectors from the ‘pool5/7 × 7 s1’ layer. Therefore,
the dimension of object features is 2048.
      </p>
      <p>
        2.1.4 Scene Feature. The scenes of the video afect the
emotions of the audience. The Places365 dataset is a large
dataset for scene classification [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. We utilize the pre-trained
ResNet-50 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] model3 to calculate the vectors from the
‘avgpool’ layer. So a frame is depicted by a 2048-dimensional
vector.
2.2
      </p>
    </sec>
    <sec id="sec-3">
      <title>Emotional Prediction</title>
      <p>To combine vectors of these features, we utilize the early
fusion strategy because of its simplicity and eficiency. As
shown in Fig. 1, we directly connect vectors of these features
for each sample.</p>
      <p>For diferent subtasks, the linear SVR and the linear SVM
are used to learn the emotional models, separately. The
number of positive samples is less than that of the negative
samples in the fear subtask. To solve this problem, we weight
positive and negative samples in an inverse manner. The
regularization parameter  is set by the strategy of
crossvalidation. The LIBLINEAR toolbox4 is used to implement
the L2-regularized L2-loss SVM and SVR.</p>
      <p>After obtaining the scores of video segments, we use the
function of Gaussian blur to smooth these scores. Let the
score vector of a video be  . Then, the Gaussian blur function
is defined as</p>
      <p>Gaussianblur( ) =  ⊗ ,
where ⊗ is the convolution operator,  is the specified
Gaussian kernel. In experiments, we set the size of Gaussian kernel
to 11 for the valence-arousal subtask and 5 for the fear
subtask.
3</p>
    </sec>
    <sec id="sec-4">
      <title>RESULT AND DISCUSSION</title>
      <p>
        In order to evaluate the aforementioned features described in
Section 2.1, the features provided by the task organizers are
selected as the baseline features. As required in the task, we
submit five runs for each of the two subtasks. Table 1 shows
the features used in these runs.
2https://github.com/hujie-frank/SENet
3https://github.com/CSAILVision/places365
4https://www.csie.ntu.edu.tw/∼ cjlin/libsvmtools/multicore-liblinear
learning algorithm, SVR is employed in the valence-arousal
subtask, and SVM is used in the fear subtask. The Mean
Square Error (MSE) and Pearson Correlation
Coeficient (PCC) are reported for the valence-arousal subtask, and
the Intersection over Union (IoU) of time intervals is
considered as the evaluation metric for the fear subtask [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The
results are given in Table 2 and Table 3
      </p>
      <p>As shown in Table 2, Run 2 obtains the best result in the
valence-arousal subtask. This suggests that the combination
of audio feature and scene feature is suficient to predict
valence-arousal values. In the fear subtask, Run 4 achieves
the top performance as shown in Table 3. This demonstrates
that the combination of audio, scene and action features is
enough to describe fear, and that the method using more
features does not necessarily lead to better experimental
results. By comparing the results of Run 2 and Run 3 in
Table 2 and Table 3, the usage of the object feature improves
the performance in the fear subtask, but it decreases the
performance in the valence-arousal subtasks. This may be
due to the reason that some objects can cause people’s fears,
such as blood, guns, etc. In Table 3, Run 4 obtains better
performances than Run 3. This partly demonstrates that
actions are more likely to cause fear than objects.
4</p>
    </sec>
    <sec id="sec-5">
      <title>CONCLUSION</title>
      <p>In this work, we propose a framework to predict the emotional
impact of movies. Vectors of four features are calculated by
using four pre-trained convolutional neural networks. The
afective models or fear models are separately learned by
using SVR or SVM, and the function of Gaussian blur is
utilized to smooth the temporal scores. Experimental results
show that the combination of audio feature and scene feature
is enough in the valence-arousal subtask, and that additional
action feature improve the performance in the fear subtask.
The 2018 Emotional Impact of Movies Task</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Emmanuel</given-names>
            <surname>Dellandr</surname>
          </string-name>
          ´ea, Martijn Huigsloot, Liming Chen, Yoann Baveye, Zhongzhe Xiao, and Mats Sjo¨berg.
          <year>2018</year>
          .
          <article-title>The MediaEval 2018 emotional impact of movies task</article-title>
          .
          <source>In MediaEval 2018 Workshop.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Rong-En</surname>
            <given-names>Fan</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kai-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cho-Jui</surname>
            <given-names>Hsieh</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xiang-Rui</surname>
            <given-names>Wang</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Chih-Jen Lin</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>LIBLINEAR: A library for large linear classification</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>9</volume>
          (
          <year>2008</year>
          ),
          <fpage>1871</fpage>
          -
          <lpage>1874</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Kaiming</given-names>
            <surname>He</surname>
          </string-name>
          , Xiangyu Zhang, Shaoqing Ren, and
          <string-name>
            <given-names>Jian</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Deep residual learning for image recognition</article-title>
          .
          <source>In CVPR</source>
          .
          <volume>770</volume>
          -
          <fpage>778</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Shawn</given-names>
            <surname>Hershey</surname>
          </string-name>
          , Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke,
          <string-name>
            <surname>Aren</surname>
            <given-names>Jansen</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>R Channing</given-names>
            <surname>Moore</surname>
          </string-name>
          , Manoj Plakal, Devin Platt,
          <article-title>Rif A Saurous, Bryan Seybold, and</article-title>
          <string-name>
            <surname>others.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>CNN architectures for large-scale audio classification</article-title>
          .
          <source>InICASSP</source>
          .
          <volume>131</volume>
          -
          <fpage>135</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Jie</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Li</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Gang</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Squeeze-and-excitation networks</article-title>
          .
          <source>In CVPR</source>
          .
          <volume>7132</volume>
          -
          <fpage>7141</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Karen</given-names>
            <surname>Simonyan</surname>
          </string-name>
          and
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Two-stream convolutional networks for action recognition in videos</article-title>
          .
          <source>In NIPS</source>
          .
          <volume>568</volume>
          -
          <fpage>576</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Christian</given-names>
            <surname>Szegedy</surname>
          </string-name>
          , Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and
          <string-name>
            <given-names>Zbigniew</given-names>
            <surname>Wojna</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Rethinking the inception architecture for computer vision</article-title>
          . In CVPR.
          <volume>2818</volume>
          -
          <fpage>2826</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Limin</given-names>
            <surname>Wang</surname>
          </string-name>
          , Yuanjun Xiong,
          <string-name>
            <surname>Zhe</surname>
            <given-names>Wang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            <given-names>Qiao</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Dahua</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Xiaoou</given-names>
            <surname>Tang</surname>
          </string-name>
          , and Luc Van Gool.
          <year>2016</year>
          .
          <article-title>Temporal segment networks: Towards good practices for deep action recognition</article-title>
          .
          <source>In ECCV</source>
          .
          <volume>20</volume>
          -
          <fpage>36</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Bolei</given-names>
            <surname>Zhou</surname>
          </string-name>
          , Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba.
          <year>2018</year>
          .
          <article-title>Places: A 10 million image database for scene recognition</article-title>
          .
          <source>IEEE Transactions Pattern Analysis and Machine Intelligence</source>
          <volume>40</volume>
          ,
          <issue>6</issue>
          (
          <year>2018</year>
          ),
          <fpage>1452</fpage>
          -
          <lpage>1464</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>