=Paper= {{Paper |id=Vol-1739/MediaEval_2016_paper_35 |storemode=property |title=AUTH-SGP in MediaEval 2016 Emotional Impact of Movies Task |pdfUrl=https://ceur-ws.org/Vol-1739/MediaEval_2016_paper_35.pdf |volume=Vol-1739 |dblpUrl=https://dblp.org/rec/conf/mediaeval/TimoleonH16 }} ==AUTH-SGP in MediaEval 2016 Emotional Impact of Movies Task== https://ceur-ws.org/Vol-1739/MediaEval_2016_paper_35.pdf
 AUTH-SGP in MediaEval 2016 Emotional Impact of Movies
                         Task

                             Timoleon Anastasia                               Hadjileontiadis Leontios
          School of Electrical Computer & Engineering, Aristotle University of Thessaloniki, Greece
                                {timoanas,leontios}@auth.gr


ABSTRACT                                                             tures from the audio signals of the movies, features regard-
This paper presents all the aspects expected for the Media-          ing the scene cuts and much more. These features were
Eval Workshop. The tested and adopted solutions are well             used almost directly, the only preprocessing step included
described and the interest of using a set of features versus         the normalization of them, by subtracting the mean value
another one is discussed. The conclusion follows state-of-           and dividing with the standard deviation of each column.
the-art findings and allows bringing new inputs in the un-           2.1.2    Improved Dense Trajectories (IDT)
derstanding of emotion prediction.
                                                                        This kind of features provide information about the move-
                                                                     ment of the videos and are calculated in different spatial and
1.     INTRODUCTION                                                  temporal scales[11]. They are extensively used to classify
   Recent years videos have been the main medium for many            human actions. We resized the original videos to 320x240.
people to interact with each other and share information. So,        Then, several descriptors were calculated for each trajectory
there is a further need to evaluate the quality of this inter-       (length of 15 frames), including Histogram of Oriented Gra-
action in terms of emotions, not only to analyze the video-          dients (HOG), Histogram of Optical Flow (HOF) and Mo-
content. To serve this purpose, video affective content anal-        tional Boundary Histogram along x and y axes (MBHx and
ysis has gained interest among researchers[12]. Many audio-          MBHy). The total number of descriptors for each trajectory
visual video features can be found useful to depict emotion.         is 426 (30+96+108+96+96)[1].
For example, imagine a film where the background is full                For the conversion of the local features into global, the
of warm colors. This can induce the viewers to have pos-             Fisher Vector approach was used. A Gaussian Mixture Model
itive emotions, namely emotions with high valence values.            (GMM) was employed to construct a codebook with k words
Motion is another important film element that can control            for each descriptor (k = 64). A total of 2500000 points were
a video's emotion. Films with large motion intensity can             sampled from the descriptors of the development-train set
cause stronger emotions, where the arousal score is higher.          to train the GMM. The features of each descriptor are then
This task aims exactly at predicting the emotional feedback          individually projected via PCA to the half of their dimen-
of the users while watching different genres of films[6].            sions, resulting in 213 dimensions for each trajectory, and
                                                                     encoded using the Fisher Kernel method. The power and
                                                                     L2-normalization schemes were applied to each descriptor
2.     SYSTEM DESCRIPTION                                            and to the resulting vectors, which hopefully can improve
                                                                     the performance of the system. Finally, an entire video
2.1      Feature Extraction                                          can be described by a vector of 27264 features (=2 [mean
   The key points of our system can be summarized to the             value, standard deviation due to the Gaussian model]*213
followings: first we extract multi-modal features that can           [features]*64[codebook size]).
successfully represent emotion. These can be either local
features, from specific patches of the video frames or from          2.1.3    Deep Learning Feature
overlapping time windows of sound signals, or directly global          Deep learning is a modern sub-field of computer vision
features from the entire image[9]. In the first case, a fea-         and machine learning, which uses artificial neural networks
ture encoding technique must be applied, in order to con-            combined with the principles of convolution in images, to
vert these local features to global. We examined the Bag-            describe pictures using more abstract and high-level fea-
Of-Words and Fisher Vector approaches[9]. Finally, the ex-           tures. We used the famous BVLC Caffe deep learning frame-
tracted features are regressed and/or combined in order to           work, and treated the BVLC Reference CaffeNet pre-trained
predict the emotion scores.                                          model as a feature extractor[8]. In particular, this network
                                                                     contains 5 convolutional layers, 2 fully connected layers and
2.1.1       Development-data feature                                 a soft-max classifier. We extracted features from the last
  These features were provided by the organizers of the task.        fully-connected layer which outputs 4096 neurons.
A great variety of features were given, including diverse fea-         The input frames were the keyframes from the 10-seconds
                                                                     videos, size 256x256[5]. Instead of averaging the results over
                                                                     the 10 random-crops that the network produces for each im-
MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Netherlands.   age, the 4096 output activations of each one of the 10 crops
© 2016 Copyright held by the owner/author(s).                        were kept, resulting in 10x4096 feature representations for
each video. Then, the classic Bag-of-Works concept was            second run was the combination of the above features with
used to encode these features. The size of the codebook           the deep-learning ones. The features were concatenated hor-
was 8, and the BOWKMeansTrainer class from OpenCV[7]              izontally and then regressed. The third run includes only the
was used to find the clusters. Each video was finally rep-        features from the improved dense trajectories. The fourth
resented by a 8-bin normalized histogram of the frequency         run contains only the HSHs, MFCCs, DSIFT as well as IDT
of appearance of each codeword. These features were added         features. The fifth run mixes the features from the two pre-
to the development-data features to explore, whether the          vious runs. Due to the large size of the feature-space, for the
performance is actually improved with their presentation.         last run, a linear late fusion strategy was implemented and
                                                                  the scores of the two regressors were combined linearly[2].
2.1.4    Dense SIFT Feature                                          Table 1 displays the name of each run, whether it was
   The SIFT descriptor was used on the re-scaled videos.          an external or a required run, and the Pearson-Correlation
One common approach when we are dealing with videos is,           coefficient for valence and arousal models separately, both
to densely compute SIFT features along neighbors of pixels        in development- and release test-set. Some cells of the ma-
in images, with specific stride step (counted in pixels) and      trix do not provide scores for the release test-set, because
specific frame step. In our approach, the neighborhood size       these runs were executed after the corresponding deadline.
is 10x10, and a new SIFT descriptor is calculated every 5         It should be pointed out, that some videos had too little
pixels and every 5 frames[10]. After the extraction of the        movement and no IDT features could be extracted. So, mod-
dense SIFT feature, PCA is applied to reduce the dimension        els of Runs 3,4 and 5 were trained, validated and evaluated
of the descriptor from 128 to 64. Finally, the fisher vector is   in a slightly smaller set of videos (9786 instead of the total
applied, in a similar manner to the IDT approach.                 9800 movie-segments).

2.1.5    Hue Saturation Histogram (HSH)                           Table 1: AUTH-SGP Results, Pearson-Correlation Co-
   As mentioned above, different colors can depict different      efficient on Development Test-set (Dev-Test) and Release
genres of emotions. We converted the frames from RGB              Test-set (Rel-Test)
to Hue Saturation Value (HSV) space and then computed                  Run            Arousal              Valence
a two-dimensional histogram keeping only the hue and sat-                       Dev-Test Rel-Test     Dev-Test Rel-Test
uration channels. The number of hue bins were 15, while                Run1      0.308      0.247      0.264      0.076
the number of saturation bins were 16. A HSH was calcu-             Run2 ext     0.303      0.265      0.290       0.11
lated every 5 frames, exactly like the Dense SIFT descriptor.          Run3      0.264        -        0.192         -
Finally, PCA and fisher vector approaches were applied.                Run4      0.244        -        0.209         -
                                                                       Run5      0.307        -        0.247         -
2.1.6    Audio Feature
   We used the Mel Frequency Cepstral Coefficients (MFCC)         2nd sub-task. It is worth mentioning also, that an
as representative audio feature [3]. Each video can be de-        attempt was made for the second sub-task. A deep-learning
scribed by three different types of MFCCs. The first type         model was trained from scratch for the two variables
is the short-term descriptor, where the input audio signal        (valence, arousal) separately. Because there were difficulties
is divided into overlapping windows of size 32ms (and over-       with the converge of these models and the results were not
lap 50%) and then a cepstral representation is computed for       encouraging, we decided not to publish them.
each one of them. The other two types of descriptors are
the mean and standard deviation of the above-mentioned            4.   CONCLUSIONS
features, resulting in a 39-dimensional (3*13) vector. Fi-        Comparing Run1 and Run2, we can conclude that, deep
nally, PCA dimension reduction and encoding with fisher           learning features do actually improve the performance of
vector were employed.                                             the system. From Run3 and Run4 we can notice, that IDT
                                                                  features (Run3), which represent motion, are more impor-
2.2     Regression                                                tant for the arousal prediction (emotion intensity), while
   As far as regression is concerned, the Support Vector Re-      HSH features in Run4, which symbolize color, better af-
gression (SVR)[4] is employed in this project. For each task,     fect the performance of the valence model (positive-negative
a grid search cross-validation scheme was used, in order to       emotions). These conclusions are confirmed also from our
determine the best hyper-parameters C, γ and the type of          findings in bibliography[12]. Finally, combining the features
kernel for each model. We investigated radial basis function      from Run3 and Run4 leads to a satisfying improvement of
and linear kernels, while C and γ were in the range [0.01,10]     both models.
and [0.001,1] respectively. The objective function to be max-
imized was the Pearson-Correlation Coefficient between pre-       5.   REFERENCES
dicted and real output values. The cross-validation scheme         [1] Activity Recognition in Videos using UCF101 dataset.
we followed was simple k-fold validation with k=5. The dis-            https://github.com/anenbergb/CS221 Project.
tribution of different types of movie genres in each set (train    [2] Finding optimized weights when combining classifiers.
and validation) was not taken into account, although it is a           https://www.kaggle.com/c/
good alternative future direction.                                     otto-group-product-classification-challenge/forums/t/
                                                                       13868/ensamble-weights/75870#post75870.
3.    RESULTS AND DISCUSSION                                       [3] pyAudioAnalysis: A Python library for audio feature
   1st sub-task. We submitted a total of 5 runs for the first          extraction, classification, segmentation and
sub-task only. The first run was only with the presence of the         applications.
already-extracted features from the development-data. The              https://github.com/tyiannak/pyAudioAnalysis.
 [4] Scikit-learn: Machine learning in Python.
     http://scikit-learn.org/stable/.
 [5] Y. Baveye, E. Dellandréa, C. Chamaret, and L. Chen.
     Deep Learning vs. Kernel Methods: Performance for
     Emotion Prediction in Videos. In 2015 Humaine
     Association Conference on Affective Computing and
     Intelligent Interaction (ACII), 2015.
 [6] Emmanuel Dellandréa, Liming Chen, Yoann Baveye,
     Mats Sjöberg and Christel Chamaret. The MediaEval
     2016 Emotional Impact of Movies Task. In Proc. of
     the MediaEval 2016 Workshop, Hilversum,
     Netherlands, Oct. 20-21 2016.
 [7] Itseez. Open source computer vision library.
     https://github.com/itseez/opencv, 2015.
 [8] A. Karpathy, G. Toderici, S. Shetty, T. Leung,
     R. Sukthankar, and L. Fei-Fei. Large-scale video
     classification with convolutional neural networks. In
     Proceedings of the 2014 IEEE Conference on
     Computer Vision and Pattern Recognition, CVPR ’14,
     pages 1725–1732, Washington, DC, USA, 2014. IEEE
     Computer Society.
 [9] D. Paschalidou and A. Delopoulos. Event detection on
     video data with topic modeling algorithms. Master’s
     thesis, Department of Electrical and Computer
     Engineering, Aristotle University of Thessaloniki, Nov.
     2015.
[10] A. Vedaldi and B. Fulkerson. Vlfeat: An open and
     portable library of computer vision algorithms. In
     Proceedings of the 18th ACM International Conference
     on Multimedia, MM ’10, pages 1469–1472, New York,
     NY, USA, 2010. ACM.
[11] H. Wang and C. Schmid. Action recognition with
     improved trajectories. In IEEE International
     Conference on Computer Vision, Sydney, Australia,
     2013.
[12] S. Wang and Q. Ji. Video affective content analysis: A
     survey of state-of-the-art methods. IEEE Transactions
     on Affective Computing, 6(4):410–430, Oct. 2015.