=Paper=
{{Paper
|id=Vol-2283/MediaEval_18_paper_25
|storemode=property
|title=Frame-based Evaluation with Deep Features to Predict Emotional Impact of Movies
|pdfUrl=https://ceur-ws.org/Vol-2283/MediaEval_18_paper_25.pdf
|volume=Vol-2283
|authors=Khanh-An C. Quan,Vinh-Tiep Nguyen,Minh-Triet Tran
|dblpUrl=https://dblp.org/rec/conf/mediaeval/QuanNT18
}}
==Frame-based Evaluation with Deep Features to Predict Emotional Impact of Movies==
Frame-based Evaluation with Deep Features to Predict Emotional Impact of Movies Khanh-An C.Quan1 , Vinh-Tiep Nguyen1 , Minh-Triet Tran2 1 University of Information Technology, Vietnam National University-Ho Chi Minh city 2 University of Science, Vietnam National University-Ho Chi Minh city 15520006@gm.uit.edu.vn,tiepnv@uit.edu.vn,tmtriet@fit.hcmus.edu.vn ABSTRACT In this paper, we describe our approach for the Emotional Impact of Movies Task at the MediaEval 2018 Challenge. Specifically, we em- ploy features extracted from ResNet-50 from image frames. Then, a fully connected neural network is used for learning the predic- tion models. Later, we applied the Window Sliding Technique for post-processing the results. The experimental results show the effectiveness of our approach. Figure 1: Overview of Frame-based Prediction Models 1 INTRODUCTION Analysing the emotional impact of a video clip to viewers can be utilized to enhance or control psychological effects of media to people [2, 7], to boost user engagement to media content [6], or to generate personalized media content [5]. The MediaEval 2018 Emotional Impact of Movies Task consists of two subtasks. The first subtask is to predict the score of induced valence and induced arousal every second along movies. The other is fear prediction, but we have not worked on it. Both subtasks are evaluated by Mean Squared Error and Pearson’s Correlation Coefficient. The dataset used for both is the LIRIS-ACCEDE [1] dataset. Full details of the challenge tasks and database can be found in [3]. There are various sources of information that can be exploited to predict the emotional impact of a movie clip. Although visual Figure 2: Valence/Arousal prediction models content is an essential source to infer viewers’ emotion, audio and text are also potential components for this task. Frame-based and sequence-based approaches can be applied to analyse video frames 2.1 Frame extraction to evaluate emotional impact. Firstly, we extracted one frame per second of all movies on the train- In our method, we follow the frame-based approach to predict ing and test set. For frame extraction, we use ffmpeg the framework video emotional impact. From the training dataset, we extract deep and the extract command provided by the organizers to extract features of each frame and train two models to predict valence frames. and arousal properties of a video frame. Then we apply the two trained models to evaluate each frame in the test set independently. 2.2 Features extraction Finally, we employ the sliding window technique to smooth the For the frame extraction, we use pre-trained 50-layer Residual Net- final results. work (ResNet-50) [4] for ImageNet. The ResNet-50 used as a feature extractor and 2048-dim features vector are extracted from each 2 APPROACH frame of the movies. In our experiments, we used the Keras ResNet- In this section, we will describe how we approach the valence- 50 pre-trained model on ImageNet dataset and calculate the features arousal prediction subtasks. The proposed method contains four vector from the global average pooling that applied to the output stages: frame extraction, features extraction, prediction models and of the last convolutional layer. post-processing methods. Our system pipeline is shown in Figure 1 below. 2.3 Prediction models Copyright held by the owner/author(s). We apportion the training set provided by the organizer into train- MediaEval’18, 29-31 October 2018, Sophia Antipolis, France ing and validation sets with a ratio of 80:20. An overview of the prediction models is shown in Figure 2. MediaEval’18, 29-31 October 2018, Sophia Antipolis, France Khanh-An C.Quan et al. We employ 2-layer fully connected neural network to learn the 3.3 Results and Analysis emotional models. The models take 2048-dim features vectors ex- tracted from ResNet-50 as input. We experimented with varying Table 1: Results of the valence-arousal subtask. the number of the nodes for the first and the second layer with 128, 256, 512 and training epochs as 10, 15, 20 . We use Root Mean Square Propagation (RMSProp) with the learning rate 10−4 . Valence Arousal Runs All prediction models are trained separately for valence and arousal. MSE r MSE r Run 1 0.11936 0.10665 0.17448 0.05282 Run 2 0.11504 0.14565 0.17055 0.07525 2.4 Post-processing Run 3 0.11943 0.14513 0.17443 0.06978 After get the valence/arousal results, we applied the Average Win- Run 4 0.11731 0.14097 0.17901 0.01877 dow Sliding Technique to smooth out the random noise. We tested Run 5 0.11526 0.14306 0.17282 0.09123 the window size of the algorithm with 3, 5, 7. 3 RESULTS AND ANALYSIS In this section, we will describe in detail the experimental specifi- cation, five runs that we have summitted for the valence-arousal subtask and the result. 3.1 Experimental specification The experiments are processed on Google Compute Engine with 2 vCPU, 7.5 GB RAM and Nvidia Tesla K80 GPU. The average times for extracting 93406 frames on the training set about 1 hour, 40 minutes for extracting features by ResNet and 3 minutes for training each models. Figure 3: Examples of the frame similarity between the tran- ning set and test set on Valence 3.2 Submitted runs We tested all trained models with the validation set. After we got As shown in the Table 1, Run 2 obtains the best result for the the results of all models on the validation set, we sorted descending valence-arousal subtask. But in general, there is a slight difference by the mean square error and selected from Top-1 to Top-4 model in the results. Comparing Run 1 with Run 2, applying Window to submit. The details of each run are listed below. Sliding Technique provides better results. As shown in the Figure All runs take ResNet-50 features as input. From Run 2 to Run 5, 3, there is the similarity in frames between the training set and test the results take the Window Sliding Technique with the window set on valence. size = 7. • Run 1: For both valence and arousal, 2-layer fully con- 4 CONCLUSION nected neural network with 128 nodes on the first layer, We propose a simple method to evaluate the emotional impact, i.e. 512 nodes on the second layer trained on 20 epochs. valence and arousal properties, of a video frame. We study several • Run 2: The same models with Run 1 but we also take the settings of classification modules with 1 to 2 fully connected layers Window Sliding Technique with the window size = 7 to and different numbers of nodes in each layer to select an appropriate smooth out the random noise. model for each property. Experimental results demonstrate that • Run 3: For valence, 2-layer fully connected neural network although our method is simple, it achieves promising results for with 256 nodes on the first layer, 512 nodes on the second this task. This is the initial step to develop better method to utilize layer trained on 10 epochs. For arousal, 2-layer fully con- temporal information of frame sequences, and other media types, nected neural network with 512 nodes on the first layer, such as audio and text components. 512 nodes on the second layer trained on 15 epochs. • Run 4: For valence, 2-layer fully connected neural network ACKNOWLEDGMENTS with 256 nodes on the first layer, 512 nodes on the second We would like to express our appreciation to Multimedia Com- layer trained on 15 epochs. For arousal, 2-layer fully con- munications Laboratory, University of Information Technology, nected neural network with 128 nodes on the first layer, VNU-HCM, Vietnam, and Software Engineering Laboratory, Uni- 512 nodes on the second layer trained on 10 epochs. versity of Science, VNU-HCM, Vietnam. • Run 5: For valence, 2-layer fully connected neural network with 512 nodes on the first layer, 512 nodes on the second REFERENCES layer trained on 10 epochs. For arousal, 2-layer fully con- [1] Y. Baveye, E. Dellandrea, C. Chamaret, and Liming Chen. 2015. LIRIS- nected neural network with 512 nodes on the first layer, ACCEDE: A Video Database for Affective Content Analysis. IEEE 512 nodes on the second layer trained on 10 epochs. Transactions on Affective Computing 6, 1 (Jan.-March 2015), 43–55. Task name as it appears on http://multimediaeval.org/mediaeval2018 MediaEval’18, 29-31 October 2018, Sophia Antipolis, France [2] L. Canini, S. Benini, and R. Leonardi. 2013. Affective Recommendation of Movies Based on Selected Connotative Features. IEEE Transactions on Circuits and Systems for Video Technology 23, 4 (April 2013), 636–647. https://doi.org/10.1109/TCSVT.2012.2211935 [3] Emmanuel Dellandréa, Huigsloot Martijn, Liming Chen, Yoann Bav- eye, Zhongzhe Xiao, and Mats Sjöberg. 2018. The MediaEval 2018 Emotional Impact of Movies Task. In MediaEval 2018 Workshop. [4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR. IEEE Computer Society, 770–778. [5] Rajiv Ratn Shah, Yi Yu, and Roger Zimmermann. 2014. ADVISOR: Personalized Video Soundtrack Recommendation by Late Fusion with Heuristic Rankings. In Proceedings of the 22Nd ACM International Conference on Multimedia (MM ’14). ACM, New York, NY, USA, 607– 616. https://doi.org/10.1145/2647868.2654919 [6] K. Yadati, H. Katti, and M. Kankanhalli. 2014. CAVVA: Computational Affective Video-in-Video Advertising. IEEE Transactions on Multimedia 16, 1 (Jan 2014), 15–23. https://doi.org/10.1109/TMM.2013.2282128 [7] Sicheng Zhao, Hongxun Yao, Xiaoshuai Sun, Xiaolei Jiang, and Pengfei Xu. 2013. Flexible Presentation of Videos Based on Affective Content Analysis. In Advances in Multimedia Modeling, Shipeng Li, Abdul- motaleb El Saddik, Meng Wang, Tao Mei, Nicu Sebe, Shuicheng Yan, Richang Hong, and Cathal Gurrin (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 368–379.