=Paper=
{{Paper
|id=Vol-2283/MediaEval_18_paper_13
|storemode=property
|title=CNN Features for Emotional Impact of Movies Task
|pdfUrl=https://ceur-ws.org/Vol-2283/MediaEval_18_paper_13.pdf
|volume=Vol-2283
|authors=Yun Yi,Hanli Wang,Qinyu Li
|dblpUrl=https://dblp.org/rec/conf/mediaeval/YiWL18
}}
==CNN Features for Emotional Impact of Movies Task==
CNN Features for Emotional Impact of Movies Task Yun Yi1,2 , Hanli Wang2,* , Qinyu Li2,3 1 Key Laboratory of Jiangxi Province for Numerical Simulation and Emulation Techniques, Gannan Normal University, Ganzhou 341000, P. R. China 2 Department of Computer Science and Technology, Tongji University, Shanghai 201804, P. R. China 3 Department of Computer Science, Lanzhou City University, Lanzhou 730070, P. R. China ABSTRACT Video A framework is proposed to predict the emotional impact of Audio Temporal smooth movies by using the audio, action, object and scene features. First, four state-of-the-art features are extracted from four Action SVM Results pre-trained convolutional neural networks to depict video con- or tents, and an early fusion strategy is used to combine vectors Object SVR of these features. Then, the linear support vector regression Scene or linear support vector machine is employed to separately learn affective models or fear models, and the strategy of cross-validation is utilized to select training parameters. Fi- Figure 1: An overview of the proposed framework. nally, the Gaussian blur function is used to smooth scores of video segments. The experiments show that the combination of these features obtains promising results. 2.1 Features To depict a video, four features are separately extracted from four pre-trained Convolutional Neural Networks (CNNs), including audio, action, object and scene features. 1 INTRODUCTION The 2018 emotional impact of movies task consists of two 2.1.1 Audio Feature. The audio signals are important in- subtasks, including the valence-arousal prediction and the formation that describes emotions. VGGish [4] is a famous fear prediction. A brief introduction about this challenge has audio feature extractor, so it is used to calculate the vectors been given in [1]. This paper mainly introduces the proposed of audio feature. First, the audio files are extracted from framework and discusses the experimental results. videos. Then, the pre-trained model1 provided by [4] is uti- The selection of features is crucial to emotional analysis. lized to calculate the feature vectors of audio files. Therefore, Intuitively, the audio, action, object and scene features can the audio signals are converted into semantically meaningful influence emotions. Therefore, vectors of four state-of-the- high-level 128-dimensional feature vectors by VGGish. In art features are calculated in this framework. Then, the conclusion, for the audio feature, a video is described as a affective models or fear models are learned by using linear sequence of 128-dimensional vectors. support vector regression (SVR) or linear support vector 2.1.2 Action Feature. The actions in the video can in- machine (SVM) [2]. Finally, the function of Gaussian blur is fluence viewer’s emotions. The two-stream Convolutional utilized to smooth scores of temporal segments. Networks (ConvNet) [6] is a well-known framework for video- based action recognition, and includes the spatial ConvNet 2 FRAMEWORK and the temporal ConvNet. The temporal segment network [8] Figure 1 shows the key components of the proposed frame- builds the model of long-range temporal structure to improve work, and the highlights of our framework are introduced this framework, and Inception-v3 [7] is the basic network below. architecture of the two ConvNets. The pre-trained models provided by [8] are utilized to calculate the vectors from the ‘top cls global pool’ layer. As a result, a frame is de- *Hanli Wang is the corresponding author, E-mail: hanli- scribed by two 1024-dimensional vectors. By connecting the wang@tongji.edu.cn. two vectors of a frame, a video is depicted as a sequence of This work was supported in part by the National Natural Sci- 2048-dimensional vectors. ence Foundation of China under Grants 61622115 and Grant 61472281, Shanghai Engineering Research Center of Industrial Vision Perception 2.1.3 Object Feature. The objects in the video may affect & Intelligent Computing (17DZ2251600), and IBM Shared University emotions of the viewer. The Squeeze-and-Excitation Net- Research Awards Program. work (SENet) [5] is the state-of-the-art model for object Copyright held by the owner/author(s). MediaEval’18, 29-31 October 2018, Sophia Antipolis, France 1 https://github.com/tensorflow/models/tree/master/research/ audioset MediaEval’18, 29-31 October 2018, Sophia Antipolis, France Y. Yi, H. Wang, Q. Li classification. We utilize the pre-trained SENet model2 to cal- learning algorithm, SVR is employed in the valence-arousal culate the vectors from the ‘pool5/7 × 7 s1’ layer. Therefore, subtask, and SVM is used in the fear subtask. The Mean the dimension of object features is 2048. Square Error (MSE) and Pearson Correlation Coefficien- t (PCC) are reported for the valence-arousal subtask, and 2.1.4 Scene Feature. The scenes of the video affect the the Intersection over Union (IoU) of time intervals is consid- emotions of the audience. The Places365 dataset is a large ered as the evaluation metric for the fear subtask [1]. The dataset for scene classification [9]. We utilize the pre-trained results are given in Table 2 and Table 3 ResNet-50 [3] model3 to calculate the vectors from the ‘avg- pool’ layer. So a frame is depicted by a 2048-dimensional Table 2: Results of the valence-arousal subtask. vector. 2.2 Emotional Prediction Valence Arousal Runs MSE PCC MSE PCC To combine vectors of these features, we utilize the early fusion strategy because of its simplicity and efficiency. As Run 1 0.09142 0.27518 0.14634 0.11571 shown in Fig. 1, we directly connect vectors of these features Run 2 0.09038 0.30084 0.13598 0.15546 for each sample. Run 3 0.09163 0.26326 0.14056 0.14310 For different subtasks, the linear SVR and the linear SVM Run 4 0.09105 0.25668 0.13624 0.17486 are used to learn the emotional models, separately. The Run 5 0.09243 0.24679 0.13950 0.15226 number of positive samples is less than that of the negative samples in the fear subtask. To solve this problem, we weight positive and negative samples in an inverse manner. The Table 3: Results of the fear subtask. regularization parameter 𝐶 is set by the strategy of cross- validation. The LIBLINEAR toolbox4 is used to implement Runs IoU of time intervals the L2-regularized L2-loss SVM and SVR. After obtaining the scores of video segments, we use the Run 1 0.14360 function of Gaussian blur to smooth these scores. Let the Run 2 0.12900 score vector of a video be 𝑉 . Then, the Gaussian blur function Run 3 0.13067 is defined as Run 4 0.15750 Run 5 0.14969 Gaussianblur(𝑉 ) = 𝑉 ⊗ 𝐾, where ⊗ is the convolution operator, 𝐾 is the specified Gauss- As shown in Table 2, Run 2 obtains the best result in the ian kernel. In experiments, we set the size of Gaussian kernel valence-arousal subtask. This suggests that the combination to 11 for the valence-arousal subtask and 5 for the fear sub- of audio feature and scene feature is sufficient to predict task. valence-arousal values. In the fear subtask, Run 4 achieves the top performance as shown in Table 3. This demonstrates 3 RESULT AND DISCUSSION that the combination of audio, scene and action features is In order to evaluate the aforementioned features described in enough to describe fear, and that the method using more Section 2.1, the features provided by the task organizers are features does not necessarily lead to better experimental selected as the baseline features. As required in the task, we results. By comparing the results of Run 2 and Run 3 in submit five runs for each of the two subtasks. Table 1 shows Table 2 and Table 3, the usage of the object feature improves the features used in these runs. the performance in the fear subtask, but it decreases the performance in the valence-arousal subtasks. This may be Table 1: Features used in five runs. due to the reason that some objects can cause people’s fears, such as blood, guns, etc. In Table 3, Run 4 obtains better Runs Features performances than Run 3. This partly demonstrates that actions are more likely to cause fear than objects. Run 1 features provided by the task organizers Run 2 audio and scene features 4 CONCLUSION Run 3 audio, scene and object features Run 4 audio, scene and action features In this work, we propose a framework to predict the emotional Run 5 audio, scene, action and object features impact of movies. Vectors of four features are calculated by using four pre-trained convolutional neural networks. The affective models or fear models are separately learned by For the sake of fair comparison, the five runs utilize the using SVR or SVM, and the function of Gaussian blur is same framework except the features used. Regarding the utilized to smooth the temporal scores. Experimental results 2 show that the combination of audio feature and scene feature https://github.com/hujie-frank/SENet 3 https://github.com/CSAILVision/places365 is enough in the valence-arousal subtask, and that additional 4 https://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/multicore-liblinear action feature improve the performance in the fear subtask. The 2018 Emotional Impact of Movies Task MediaEval’18, 29-31 October 2018, Sophia Antipolis, France REFERENCES [1] Emmanuel Dellandréa, Martijn Huigsloot, Liming Chen, Yoann Baveye, Zhongzhe Xiao, and Mats Sjöberg. 2018. The MediaEval 2018 emotional impact of movies task. In Media- Eval 2018 Workshop. [2] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research 9 (2008), 1871–1874. [3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770–778. [4] Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, and others. 2017. CNN architectures for large-scale audio classification. In I- CASSP. 131–135. [5] Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In CVPR. 7132–7141. [6] Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In NIPS. 568–576. [7] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In CVPR. 2818–2826. [8] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In ECCV. 20–36. [9] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2018. Places: A 10 million image database for scene recognition. IEEE Transactions Pattern Analysis and Machine Intelligence 40, 6 (2018), 1452 – 1464.