=Paper= {{Paper |id=Vol-2283/MediaEval_18_paper_13 |storemode=property |title=CNN Features for Emotional Impact of Movies Task |pdfUrl=https://ceur-ws.org/Vol-2283/MediaEval_18_paper_13.pdf |volume=Vol-2283 |authors=Yun Yi,Hanli Wang,Qinyu Li |dblpUrl=https://dblp.org/rec/conf/mediaeval/YiWL18 }} ==CNN Features for Emotional Impact of Movies Task== https://ceur-ws.org/Vol-2283/MediaEval_18_paper_13.pdf
             CNN Features for Emotional Impact of Movies Task

                                            Yun Yi1,2 , Hanli Wang2,* , Qinyu Li2,3
    1
        Key Laboratory of Jiangxi Province for Numerical Simulation and Emulation Techniques, Gannan Normal
                                      University, Ganzhou 341000, P. R. China
          2
            Department of Computer Science and Technology, Tongji University, Shanghai 201804, P. R. China
               3
                 Department of Computer Science, Lanzhou City University, Lanzhou 730070, P. R. China
ABSTRACT                                                                   Video
A framework is proposed to predict the emotional impact of
                                                                                         Audio




                                                                                                                             Temporal smooth
movies by using the audio, action, object and scene features.
First, four state-of-the-art features are extracted from four                            Action                       SVM




                                                                                                                                               Results
pre-trained convolutional neural networks to depict video con-                                                         or
tents, and an early fusion strategy is used to combine vectors                           Object                       SVR
of these features. Then, the linear support vector regression                            Scene
or linear support vector machine is employed to separately
learn affective models or fear models, and the strategy of
cross-validation is utilized to select training parameters. Fi-        Figure 1: An overview of the proposed framework.
nally, the Gaussian blur function is used to smooth scores of
video segments. The experiments show that the combination
of these features obtains promising results.                           2.1     Features
                                                                       To depict a video, four features are separately extracted from
                                                                       four pre-trained Convolutional Neural Networks (CNNs),
                                                                       including audio, action, object and scene features.
1    INTRODUCTION
The 2018 emotional impact of movies task consists of two                  2.1.1 Audio Feature. The audio signals are important in-
subtasks, including the valence-arousal prediction and the             formation that describes emotions. VGGish [4] is a famous
fear prediction. A brief introduction about this challenge has         audio feature extractor, so it is used to calculate the vectors
been given in [1]. This paper mainly introduces the proposed           of audio feature. First, the audio files are extracted from
framework and discusses the experimental results.                      videos. Then, the pre-trained model1 provided by [4] is uti-
   The selection of features is crucial to emotional analysis.         lized to calculate the feature vectors of audio files. Therefore,
Intuitively, the audio, action, object and scene features can          the audio signals are converted into semantically meaningful
influence emotions. Therefore, vectors of four state-of-the-           high-level 128-dimensional feature vectors by VGGish. In
art features are calculated in this framework. Then, the               conclusion, for the audio feature, a video is described as a
affective models or fear models are learned by using linear            sequence of 128-dimensional vectors.
support vector regression (SVR) or linear support vector                  2.1.2 Action Feature. The actions in the video can in-
machine (SVM) [2]. Finally, the function of Gaussian blur is           fluence viewer’s emotions. The two-stream Convolutional
utilized to smooth scores of temporal segments.                        Networks (ConvNet) [6] is a well-known framework for video-
                                                                       based action recognition, and includes the spatial ConvNet
2    FRAMEWORK                                                         and the temporal ConvNet. The temporal segment network [8]
Figure 1 shows the key components of the proposed frame-               builds the model of long-range temporal structure to improve
work, and the highlights of our framework are introduced               this framework, and Inception-v3 [7] is the basic network
below.                                                                 architecture of the two ConvNets. The pre-trained models
                                                                       provided by [8] are utilized to calculate the vectors from
                                                                       the ‘top cls global pool’ layer. As a result, a frame is de-
*Hanli Wang is the        corresponding   author,   E-mail:   hanli-   scribed by two 1024-dimensional vectors. By connecting the
wang@tongji.edu.cn.                                                    two vectors of a frame, a video is depicted as a sequence of
This work was supported in part by the National Natural Sci-
                                                                       2048-dimensional vectors.
ence Foundation of China under Grants 61622115 and Grant 61472281,
Shanghai Engineering Research Center of Industrial Vision Perception     2.1.3 Object Feature. The objects in the video may affect
& Intelligent Computing (17DZ2251600), and IBM Shared University       emotions of the viewer. The Squeeze-and-Excitation Net-
Research Awards Program.
                                                                       work (SENet) [5] is the state-of-the-art model for object
Copyright held by the owner/author(s).
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France             1
                                                                        https://github.com/tensorflow/models/tree/master/research/
                                                                       audioset
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France                                                    Y. Yi, H. Wang, Q. Li


classification. We utilize the pre-trained SENet model2 to cal-        learning algorithm, SVR is employed in the valence-arousal
culate the vectors from the ‘pool5/7 × 7 s1’ layer. Therefore,         subtask, and SVM is used in the fear subtask. The Mean
the dimension of object features is 2048.                              Square Error (MSE) and Pearson Correlation Coefficien-
                                                                       t (PCC) are reported for the valence-arousal subtask, and
  2.1.4 Scene Feature. The scenes of the video affect the
                                                                       the Intersection over Union (IoU) of time intervals is consid-
emotions of the audience. The Places365 dataset is a large
                                                                       ered as the evaluation metric for the fear subtask [1]. The
dataset for scene classification [9]. We utilize the pre-trained
                                                                       results are given in Table 2 and Table 3
ResNet-50 [3] model3 to calculate the vectors from the ‘avg-
pool’ layer. So a frame is depicted by a 2048-dimensional                  Table 2: Results of the valence-arousal subtask.
vector.

2.2     Emotional Prediction                                                             Valence                Arousal
                                                                            Runs
                                                                                      MSE      PCC           MSE      PCC
To combine vectors of these features, we utilize the early
fusion strategy because of its simplicity and efficiency. As                Run 1    0.09142    0.27518    0.14634     0.11571
shown in Fig. 1, we directly connect vectors of these features              Run 2    0.09038    0.30084    0.13598     0.15546
for each sample.                                                            Run 3    0.09163    0.26326    0.14056     0.14310
   For different subtasks, the linear SVR and the linear SVM                Run 4    0.09105    0.25668    0.13624     0.17486
are used to learn the emotional models, separately. The                     Run 5    0.09243    0.24679    0.13950     0.15226
number of positive samples is less than that of the negative
samples in the fear subtask. To solve this problem, we weight
positive and negative samples in an inverse manner. The                         Table 3: Results of the fear subtask.
regularization parameter 𝐶 is set by the strategy of cross-
validation. The LIBLINEAR toolbox4 is used to implement                               Runs     IoU of time intervals
the L2-regularized L2-loss SVM and SVR.
   After obtaining the scores of video segments, we use the                           Run 1          0.14360
function of Gaussian blur to smooth these scores. Let the                             Run 2          0.12900
score vector of a video be 𝑉 . Then, the Gaussian blur function                       Run 3          0.13067
is defined as                                                                         Run 4          0.15750
                                                                                      Run 5          0.14969
                  Gaussianblur(𝑉 ) = 𝑉 ⊗ 𝐾,
where ⊗ is the convolution operator, 𝐾 is the specified Gauss-            As shown in Table 2, Run 2 obtains the best result in the
ian kernel. In experiments, we set the size of Gaussian kernel         valence-arousal subtask. This suggests that the combination
to 11 for the valence-arousal subtask and 5 for the fear sub-          of audio feature and scene feature is sufficient to predict
task.                                                                  valence-arousal values. In the fear subtask, Run 4 achieves
                                                                       the top performance as shown in Table 3. This demonstrates
3     RESULT AND DISCUSSION                                            that the combination of audio, scene and action features is
In order to evaluate the aforementioned features described in          enough to describe fear, and that the method using more
Section 2.1, the features provided by the task organizers are          features does not necessarily lead to better experimental
selected as the baseline features. As required in the task, we         results. By comparing the results of Run 2 and Run 3 in
submit five runs for each of the two subtasks. Table 1 shows           Table 2 and Table 3, the usage of the object feature improves
the features used in these runs.                                       the performance in the fear subtask, but it decreases the
                                                                       performance in the valence-arousal subtasks. This may be
          Table 1: Features used in five runs.                         due to the reason that some objects can cause people’s fears,
                                                                       such as blood, guns, etc. In Table 3, Run 4 obtains better
      Runs                        Features                             performances than Run 3. This partly demonstrates that
                                                                       actions are more likely to cause fear than objects.
      Run 1     features provided by the task organizers
      Run 2             audio and scene features
                                                                       4    CONCLUSION
      Run 3         audio, scene and object features
      Run 4         audio, scene and action features                   In this work, we propose a framework to predict the emotional
      Run 5     audio, scene, action and object features               impact of movies. Vectors of four features are calculated by
                                                                       using four pre-trained convolutional neural networks. The
                                                                       affective models or fear models are separately learned by
  For the sake of fair comparison, the five runs utilize the           using SVR or SVM, and the function of Gaussian blur is
same framework except the features used. Regarding the                 utilized to smooth the temporal scores. Experimental results
2                                                                      show that the combination of audio feature and scene feature
  https://github.com/hujie-frank/SENet
3
  https://github.com/CSAILVision/places365                             is enough in the valence-arousal subtask, and that additional
4
  https://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/multicore-liblinear   action feature improve the performance in the fear subtask.
The 2018 Emotional Impact of Movies Task                             MediaEval’18, 29-31 October 2018, Sophia Antipolis, France


REFERENCES
 [1] Emmanuel Dellandréa, Martijn Huigsloot, Liming Chen,
     Yoann Baveye, Zhongzhe Xiao, and Mats Sjöberg. 2018. The
     MediaEval 2018 emotional impact of movies task. In Media-
     Eval 2018 Workshop.
 [2] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui
     Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A library
     for large linear classification. Journal of Machine Learning
     Research 9 (2008), 1871–1874.
 [3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
     2016. Deep residual learning for image recognition. In CVPR.
     770–778.
 [4] Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F
     Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal,
     Devin Platt, Rif A Saurous, Bryan Seybold, and others. 2017.
     CNN architectures for large-scale audio classification. In I-
     CASSP. 131–135.
 [5] Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation
     networks. In CVPR. 7132–7141.
 [6] Karen Simonyan and Andrew Zisserman. 2014. Two-stream
     convolutional networks for action recognition in videos. In
     NIPS. 568–576.
 [7] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon
     Shlens, and Zbigniew Wojna. 2016. Rethinking the inception
     architecture for computer vision. In CVPR. 2818–2826.
 [8] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin,
     Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment
     networks: Towards good practices for deep action recognition.
     In ECCV. 20–36.
 [9] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and
     Antonio Torralba. 2018. Places: A 10 million image database
     for scene recognition. IEEE Transactions Pattern Analysis
     and Machine Intelligence 40, 6 (2018), 1452 – 1464.