=Paper=
{{Paper
|id=Vol-1739/MediaEval_2016_paper_35
|storemode=property
|title=AUTH-SGP in MediaEval 2016 Emotional Impact of Movies Task
|pdfUrl=https://ceur-ws.org/Vol-1739/MediaEval_2016_paper_35.pdf
|volume=Vol-1739
|dblpUrl=https://dblp.org/rec/conf/mediaeval/TimoleonH16
}}
==AUTH-SGP in MediaEval 2016 Emotional Impact of Movies Task==
AUTH-SGP in MediaEval 2016 Emotional Impact of Movies Task Timoleon Anastasia Hadjileontiadis Leontios School of Electrical Computer & Engineering, Aristotle University of Thessaloniki, Greece {timoanas,leontios}@auth.gr ABSTRACT tures from the audio signals of the movies, features regard- This paper presents all the aspects expected for the Media- ing the scene cuts and much more. These features were Eval Workshop. The tested and adopted solutions are well used almost directly, the only preprocessing step included described and the interest of using a set of features versus the normalization of them, by subtracting the mean value another one is discussed. The conclusion follows state-of- and dividing with the standard deviation of each column. the-art findings and allows bringing new inputs in the un- 2.1.2 Improved Dense Trajectories (IDT) derstanding of emotion prediction. This kind of features provide information about the move- ment of the videos and are calculated in different spatial and 1. INTRODUCTION temporal scales[11]. They are extensively used to classify Recent years videos have been the main medium for many human actions. We resized the original videos to 320x240. people to interact with each other and share information. So, Then, several descriptors were calculated for each trajectory there is a further need to evaluate the quality of this inter- (length of 15 frames), including Histogram of Oriented Gra- action in terms of emotions, not only to analyze the video- dients (HOG), Histogram of Optical Flow (HOF) and Mo- content. To serve this purpose, video affective content anal- tional Boundary Histogram along x and y axes (MBHx and ysis has gained interest among researchers[12]. Many audio- MBHy). The total number of descriptors for each trajectory visual video features can be found useful to depict emotion. is 426 (30+96+108+96+96)[1]. For example, imagine a film where the background is full For the conversion of the local features into global, the of warm colors. This can induce the viewers to have pos- Fisher Vector approach was used. A Gaussian Mixture Model itive emotions, namely emotions with high valence values. (GMM) was employed to construct a codebook with k words Motion is another important film element that can control for each descriptor (k = 64). A total of 2500000 points were a video's emotion. Films with large motion intensity can sampled from the descriptors of the development-train set cause stronger emotions, where the arousal score is higher. to train the GMM. The features of each descriptor are then This task aims exactly at predicting the emotional feedback individually projected via PCA to the half of their dimen- of the users while watching different genres of films[6]. sions, resulting in 213 dimensions for each trajectory, and encoded using the Fisher Kernel method. The power and L2-normalization schemes were applied to each descriptor 2. SYSTEM DESCRIPTION and to the resulting vectors, which hopefully can improve the performance of the system. Finally, an entire video 2.1 Feature Extraction can be described by a vector of 27264 features (=2 [mean The key points of our system can be summarized to the value, standard deviation due to the Gaussian model]*213 followings: first we extract multi-modal features that can [features]*64[codebook size]). successfully represent emotion. These can be either local features, from specific patches of the video frames or from 2.1.3 Deep Learning Feature overlapping time windows of sound signals, or directly global Deep learning is a modern sub-field of computer vision features from the entire image[9]. In the first case, a fea- and machine learning, which uses artificial neural networks ture encoding technique must be applied, in order to con- combined with the principles of convolution in images, to vert these local features to global. We examined the Bag- describe pictures using more abstract and high-level fea- Of-Words and Fisher Vector approaches[9]. Finally, the ex- tures. We used the famous BVLC Caffe deep learning frame- tracted features are regressed and/or combined in order to work, and treated the BVLC Reference CaffeNet pre-trained predict the emotion scores. model as a feature extractor[8]. In particular, this network contains 5 convolutional layers, 2 fully connected layers and 2.1.1 Development-data feature a soft-max classifier. We extracted features from the last These features were provided by the organizers of the task. fully-connected layer which outputs 4096 neurons. A great variety of features were given, including diverse fea- The input frames were the keyframes from the 10-seconds videos, size 256x256[5]. Instead of averaging the results over the 10 random-crops that the network produces for each im- MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Netherlands. age, the 4096 output activations of each one of the 10 crops © 2016 Copyright held by the owner/author(s). were kept, resulting in 10x4096 feature representations for each video. Then, the classic Bag-of-Works concept was second run was the combination of the above features with used to encode these features. The size of the codebook the deep-learning ones. The features were concatenated hor- was 8, and the BOWKMeansTrainer class from OpenCV[7] izontally and then regressed. The third run includes only the was used to find the clusters. Each video was finally rep- features from the improved dense trajectories. The fourth resented by a 8-bin normalized histogram of the frequency run contains only the HSHs, MFCCs, DSIFT as well as IDT of appearance of each codeword. These features were added features. The fifth run mixes the features from the two pre- to the development-data features to explore, whether the vious runs. Due to the large size of the feature-space, for the performance is actually improved with their presentation. last run, a linear late fusion strategy was implemented and the scores of the two regressors were combined linearly[2]. 2.1.4 Dense SIFT Feature Table 1 displays the name of each run, whether it was The SIFT descriptor was used on the re-scaled videos. an external or a required run, and the Pearson-Correlation One common approach when we are dealing with videos is, coefficient for valence and arousal models separately, both to densely compute SIFT features along neighbors of pixels in development- and release test-set. Some cells of the ma- in images, with specific stride step (counted in pixels) and trix do not provide scores for the release test-set, because specific frame step. In our approach, the neighborhood size these runs were executed after the corresponding deadline. is 10x10, and a new SIFT descriptor is calculated every 5 It should be pointed out, that some videos had too little pixels and every 5 frames[10]. After the extraction of the movement and no IDT features could be extracted. So, mod- dense SIFT feature, PCA is applied to reduce the dimension els of Runs 3,4 and 5 were trained, validated and evaluated of the descriptor from 128 to 64. Finally, the fisher vector is in a slightly smaller set of videos (9786 instead of the total applied, in a similar manner to the IDT approach. 9800 movie-segments). 2.1.5 Hue Saturation Histogram (HSH) Table 1: AUTH-SGP Results, Pearson-Correlation Co- As mentioned above, different colors can depict different efficient on Development Test-set (Dev-Test) and Release genres of emotions. We converted the frames from RGB Test-set (Rel-Test) to Hue Saturation Value (HSV) space and then computed Run Arousal Valence a two-dimensional histogram keeping only the hue and sat- Dev-Test Rel-Test Dev-Test Rel-Test uration channels. The number of hue bins were 15, while Run1 0.308 0.247 0.264 0.076 the number of saturation bins were 16. A HSH was calcu- Run2 ext 0.303 0.265 0.290 0.11 lated every 5 frames, exactly like the Dense SIFT descriptor. Run3 0.264 - 0.192 - Finally, PCA and fisher vector approaches were applied. Run4 0.244 - 0.209 - Run5 0.307 - 0.247 - 2.1.6 Audio Feature We used the Mel Frequency Cepstral Coefficients (MFCC) 2nd sub-task. It is worth mentioning also, that an as representative audio feature [3]. Each video can be de- attempt was made for the second sub-task. A deep-learning scribed by three different types of MFCCs. The first type model was trained from scratch for the two variables is the short-term descriptor, where the input audio signal (valence, arousal) separately. Because there were difficulties is divided into overlapping windows of size 32ms (and over- with the converge of these models and the results were not lap 50%) and then a cepstral representation is computed for encouraging, we decided not to publish them. each one of them. The other two types of descriptors are the mean and standard deviation of the above-mentioned 4. CONCLUSIONS features, resulting in a 39-dimensional (3*13) vector. Fi- Comparing Run1 and Run2, we can conclude that, deep nally, PCA dimension reduction and encoding with fisher learning features do actually improve the performance of vector were employed. the system. From Run3 and Run4 we can notice, that IDT features (Run3), which represent motion, are more impor- 2.2 Regression tant for the arousal prediction (emotion intensity), while As far as regression is concerned, the Support Vector Re- HSH features in Run4, which symbolize color, better af- gression (SVR)[4] is employed in this project. For each task, fect the performance of the valence model (positive-negative a grid search cross-validation scheme was used, in order to emotions). These conclusions are confirmed also from our determine the best hyper-parameters C, γ and the type of findings in bibliography[12]. Finally, combining the features kernel for each model. We investigated radial basis function from Run3 and Run4 leads to a satisfying improvement of and linear kernels, while C and γ were in the range [0.01,10] both models. and [0.001,1] respectively. The objective function to be max- imized was the Pearson-Correlation Coefficient between pre- 5. REFERENCES dicted and real output values. The cross-validation scheme [1] Activity Recognition in Videos using UCF101 dataset. we followed was simple k-fold validation with k=5. The dis- https://github.com/anenbergb/CS221 Project. tribution of different types of movie genres in each set (train [2] Finding optimized weights when combining classifiers. and validation) was not taken into account, although it is a https://www.kaggle.com/c/ good alternative future direction. otto-group-product-classification-challenge/forums/t/ 13868/ensamble-weights/75870#post75870. 3. RESULTS AND DISCUSSION [3] pyAudioAnalysis: A Python library for audio feature 1st sub-task. We submitted a total of 5 runs for the first extraction, classification, segmentation and sub-task only. The first run was only with the presence of the applications. already-extracted features from the development-data. The https://github.com/tyiannak/pyAudioAnalysis. [4] Scikit-learn: Machine learning in Python. http://scikit-learn.org/stable/. [5] Y. Baveye, E. Dellandréa, C. Chamaret, and L. Chen. Deep Learning vs. Kernel Methods: Performance for Emotion Prediction in Videos. In 2015 Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII), 2015. [6] Emmanuel Dellandréa, Liming Chen, Yoann Baveye, Mats Sjöberg and Christel Chamaret. The MediaEval 2016 Emotional Impact of Movies Task. In Proc. of the MediaEval 2016 Workshop, Hilversum, Netherlands, Oct. 20-21 2016. [7] Itseez. Open source computer vision library. https://github.com/itseez/opencv, 2015. [8] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’14, pages 1725–1732, Washington, DC, USA, 2014. IEEE Computer Society. [9] D. Paschalidou and A. Delopoulos. Event detection on video data with topic modeling algorithms. Master’s thesis, Department of Electrical and Computer Engineering, Aristotle University of Thessaloniki, Nov. 2015. [10] A. Vedaldi and B. Fulkerson. Vlfeat: An open and portable library of computer vision algorithms. In Proceedings of the 18th ACM International Conference on Multimedia, MM ’10, pages 1469–1472, New York, NY, USA, 2010. ACM. [11] H. Wang and C. Schmid. Action recognition with improved trajectories. In IEEE International Conference on Computer Vision, Sydney, Australia, 2013. [12] S. Wang and Q. Ji. Video affective content analysis: A survey of state-of-the-art methods. IEEE Transactions on Affective Computing, 6(4):410–430, Oct. 2015.