=Paper=
{{Paper
|id=Vol-1984/Mediaeval_2017_paper_18
|storemode=property
|title=TCNJ-CS@MediaEval 2017 Emotional Impact of Movie Task
|pdfUrl=https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_18.pdf
|volume=Vol-1984
|authors=Sejong Yoon
|dblpUrl=https://dblp.org/rec/conf/mediaeval/Yoon17a
}}
==TCNJ-CS@MediaEval 2017 Emotional Impact of Movie Task==
TCNJ-CS @ MediaEval 2017 Emotional Impact of Movie Task Sejong Yoon The College of New Jersey, U.S.A. yoons@tcnj.edu ABSTRACT and their 4 differential coefficient contours (19 × (4 + 4)). Last This paper presents our approaches for the MediaEval Emotional two features are additional statistical features, the number of pitch Impact of Movies Task. We employed features from image frames onsets and the total duration of the input. These features were and audio signal. We use support vector regression for the learn- computed every ten seconds segments sliding over the whole movie ing and prediction. In addition, we introduce a new feature using with a shift of 5 seconds. All these features were computed by exponential decay of the initially predicted emotion labels. The openSmile toolbox [3]. motivation behind this is to computationally model lingering effect. Experimental results and future direction are also discussed. 2.2 Lingering Feature In addition to the provided features, we introduced additional fea- 1 INTRODUCTION ture, using the ground truth labels of emotional levels. The moti- vation behind this new feature, is to computationally model the MediaEval 2017 Emotional Impact of Movie Task [2] consist of two gradually amplifying or decaying emotional flow, what is typically subtasks. One of them is valence/arousal prediction that predicts a referred as lingering emotion. Traditional and even state-of-the-art score of expected level of two emotional state, valence and arousal affect prediction systems focus on predicting induced emotions as for each consecutive ten seconds segments. Both valence (most a spike noise detection model, regardless of whether they modeled negative to most positive) and arousal (least active to most active) the temporal aspect of affect or not. On the other hand, lingering are defined as a continuous scale within range of [−1, 1]. The other emotions do not directly induced by the stimuli, rather, they are gen- is the the fear prediction that makes binary prediction for each erated from the emotional change, i.e. response, already existing a of the ten second-segments, whether they are likely to induce fear priori. In short, we argue that what is called climax of a movie is not or not. Fear is defined as binary integer [0, 1] where 1 indicates only a consequence of short segment stimuli, but also amplified (or that the segment will induce fear. In the following, we describe the degraded) by the emotional state change itself across the segments. method we used in our prediction system. A similar idea was utilized to predict media interestingness [6], but there was no explicit consideration of amplifying / decaying effect. 2 APPROACH Here, we consider the change of emotion directly. First, we describe multimodal features we employed. Next, we intro- To model this lingering emotion, we use emotion level label duce our new feature based on prediction label to model lingering values. For each segment t = 1..T where T denotes the total number effect of induced emotions. Lastly, we describe our hierarchical of segments, assume that we are given emotion level label yt . So, regression framework for emotion prediction. at each time segment t, we have 2.1 Visual and Audio Features x1, x2, · · · , xt , (1) We employed all standard set of visual and audio features provided y 1 , y 2 , · · · , yt , (2) by the MediaEval task organizers. For image frame-based features, we used Auto Color Correlogram, Color and Edge Directivity De- where x t denotes the vectorized visual / audio features and yt scriptor, Color Layout, Edge Histogram, Fuzzy Color and Texture denotes either the ground truth or predicted emotion level (it could Histogram, Gabor, Joint descriptor joining CEDD and FCTH in one be valence, arousal, or fear label). Then, we can define the lingering histogram, Scalable Color, Tamura, Local Binary Patterns, fc6 layer feature lt as an exponential decay function of labels as of VGG16 network [5]. All features were extracted frame-by-frame, where one frame was extracted per second. The features, except l (t −w ) = y(t −w ) (3) VGG16, were computed using LIRE library. VGG16 features were l (t −w +1) = (1 − α) · l (t −w ) + α · y(t −w +1) (4) extracted using the MATLAB Neural Network toolbox. .. For auditory features, we employed the audio features provided. . In the provided description, there should be 1,582 features which ls = (1 − α) · ls−1 + α · ys (5) result from a base of 34 low-level descriptors, with 34 corresponding differential coefficients, and 21 functionals applied to each of these where s = (t − w), ..., t, and w denotes the lingering window size. 68 contours, thus 1,428 features (21 × (34 + 34)) in total. Out of Parameter α is the decay factor. Intuitively, we take weighted accu- remaining 154 features, 152 features were computed by applying mulated emotions over time, and consider it to model the lingering 19 additional functionals to the 4 pitch-based low-level descriptors effect. In training phase, we can utilize the ground truth emotion. In testing phase, we can devise a two-step, hierarchical regression Copyright held by the owner/author(s). MediaEval’17, 13-15 September 2017, Dublin, Ireland model to obtain the emotion level feature values. We will describe this model in the next section. MediaEval’17, 13-15 September 2017, Dublin, Ireland S. Yoon There are considerations why we think this can be a reasonable Table 1: Result of All Subtasks in Devset model for the lingering effect. First, with the exponential decay function, we can consider both smoothness and also decaying of Subtask Measure w/o Linger w/ Linger emotional change over time. Second, one can view this as simplified Valence MSE 0.13106 0.09893 version of traditional temporal models, e.g., Hidden Markov Models ρ 0.12826 0.20826 (HMM), where we fix the transition probability. If we can obtain Arousal MSE 0.08817 0.08415 large number of emotion labels, one may try to learn HMM-based ρ 0.08390 0.09001 features instead, as in [6]. Fear Accuracy 0.96581 0.96543 2.3 Hierarchical Regression Framework To combine features, we utilized standard multiple kernel learning Table 2: Result of All Subtasks in Testset approach [1, 4]. We first compute kernels of each feature, and build combined kernel using either addition or multiplications. We use Subtask Measure w/o Linger w/ Linger multiplication within same modality, e.g., combining Color Correlo- (run 1) (run 2) gram kernel and Edge Histogram kernel, and use addition between different modalities, i.e., combining combined visual kernel and au- Valence MSE 0.20276 0.04640 dio kernel. The lingering feature is considered as another modality ρ 0.19748 0.00583 than visual and auditory. In summary, our combined kernel was Arousal MSE 0.12304 0.11335 computed as ρ 0.11340 0.21485 Fear Accuracy 0.77862 0.72956 Kvis = Kacc · Kcedd · · · K f c6 (6) Precision 0.22474 0.25530 Recall 0.09922 0.19224 Kall = Kvis + Kaud + Kl in (7) F1 0.10113 0.17399 where each K · denotes the kernel computed using the features. We used Radial Basis Function (RBF) kernel with median of training data as the hyperparameter. make strict data split based on the information which frame belongs Once the combined kernel is computed, we can use it as feature to which video, to obtain this result. vectors. For the regression model, we used linear Support Vector In the Testset, shown in Table 2, the official results are more Regression (SVR). We used MATLAB’s fitrsvm function for this. interesting. It is obvious that the lingering feature does not help One important aspect of our approach is that we use emotion pre- (actually hinders) the valence prediction. On the other hand, for diction labels to compute the lingering features. Since we do not the arousal and fear, lingering feature seems to make positive con- have ground truth labels for testset, we design a two-step, hierarchi- tribution to the prediction although the overall MSE and accuracy cal regression framework. In this framework, we need to train two sacrificed a little. It is also notable that the similar tendency could SVR models in the training phase. One model (Model A) is trained be observed in the Devset in Table 1. One intuitive explain here with the kernel computed using training data, but the kernel is only would be following: what we are modeling with lingering feature, combines visual and auditory features. The other model (Model is how the prior, recent emotional change might affect or induce B) is trained with the kernel computed using all modalities. In the the new emotion. In case of arousal (either active or passive to the testing phase, we first perform an initial emotion prediction on the stimuli) or fear (feeling horror, anxiety or not), this happens often. test data using Model A. Then, we compute the lingering feature On the other hand, valence (positive or negative) is rather difficult using the predicted affect labels. Note that this is computationally to capture with a fixed window size of the linger feature. Moreover, not expensive since the lingering feature itself is easy to compute changing from positive to negative emotional state, or vice versa, and the labels of all training data is only 1 dimensional vector. Fi- requires more contextual (or semantic) information of stimuli to nally, we perform final emotion prediction on the test data using understand why that change has happened. Model B. We applied regression framework for all subtasks. For the fear subtask, we first rescaled the output into [0, 1] range and 4 DISCUSSION AND OUTLOOK thresholded at 0.75. In this paper, we introduced a new feature modeling lingering ef- fect and presented a hierarchical regression framework to predict 3 RESULTS AND ANALYSIS emotions. We found promising applications of the new features in For the measure, we used Mean Squared Error (MSE) and Person’s arousal and fear prediction, with limitations in valence prediction. correlation coefficient (ρ) for the valence and arousal subtasks, and In the future, it would be interesting to investigate how one can accuracy, precision, recall, and F 1 score for the fear subtask. We more robustly capture this lingering effect with in-depth under- used α = 0.5 for all experiments. standing of the feature’s impact on the valence prediction. In the Devset, shown in Table 1, one can see that there is no significant benefit in using lingering feature in this case. We used ACKNOWLEDGMENTS 50-50 split to obtain the result but the readers should take this result This work was supported in part by The College of New Jersey with a grain of salt (particularly, accuracy of fear) since we did not under Support Of Scholarly Activity (SOSA) 2017-2019 grant. Emotional Impact of Movies Task MediaEval’17, 13-15 September 2017, Dublin, Ireland REFERENCES [1] Francis R. Bach, Gert R. G. Lanckriet, and Michael I. Jordan. 2004. Multiple Kernel Learning, Conic Duality, and the SMO Algorithm. In Proceedings of the Twenty-first International Conference on Machine Learning (ICML) (ICML ’04). ACM, New York, NY, USA, 6–. https: //doi.org/10.1145/1015330.1015424 [2] Emmanuel Dellandréa, Martijn Huigsloot, Liming Chen, Yoann Baveye, and Mats Sjöberg. 2017. The MediaEval 2017 Emotional Impact of Movies Task. In MediaEval 2017 Workshop. [3] Florian Eyben, Felix Weninger, Florian Gross, and Björn Schuller. 2013. Recent Developments in openSMILE, the Munich Open-source Multi- media Feature Extractor. In Proceedings of the 21st ACM International Conference on Multimedia (MM ’13). ACM, New York, NY, USA, 835– 838. https://doi.org/10.1145/2502081.2502224 [4] Mehmet Gönen and Ethem Alpaydin. 2011. Multiple Kernel Learning Algorithms. Journal of Machine Learning Research (JMLR) 12 (July 2011), 2211–2268. [5] K. Simonyan and A. Zisserman. 2014. Very Deep Convolutional Net- works for Large-Scale Image Recognition. CoRR abs/1409.1556 (2014). [6] Sejong Yoon and Vladimir Pavlovic. 2014. Sentiment Flow for Video Interestingness Prediction. In Proceedings of the 1st ACM International Workshop on Human Centered Event Understanding from Multimedia (HuEvent ’14). ACM, New York, NY, USA, 29–34. https://doi.org/10. 1145/2660505.2660513