=Paper= {{Paper |id=Vol-1436/Paper20 |storemode=property |title=NII-UIT at MediaEval 2015 Affective Impact of Movies Task |pdfUrl=https://ceur-ws.org/Vol-1436/Paper20.pdf |volume=Vol-1436 |dblpUrl=https://dblp.org/rec/conf/mediaeval/LamLLSD15 }} ==NII-UIT at MediaEval 2015 Affective Impact of Movies Task== https://ceur-ws.org/Vol-1436/Paper20.pdf
                                 NII-UIT at MediaEval 2015
                              Affective Impact of Movies Task

                                     Vu Lam                                      Sang Phan
                              University of Science,                          National Institute of
                                   VNU-HCM                                    Informatics, Japan
                          lqvu@fit.hcmus.edu.vn                       plsang@nii.ac.jp
                  Duy-Dinh Le                         Shin’ichi Satoh            Duc Anh Duong
                National Institute of                 National Institute of              University of Information
                Informatics, Japan                    Informatics, Japan                 Technology, VNU-HCM
               ledduy@nii.ac.jp                       satoh@nii.ac.jp                      ducda@uit.edu.vn

ABSTRACT
Affective Impact of Movies task aims to detect violent videos
and affective impact on viewers of that videos [9]. This is a
challenging task not only because of the diversity of video
content but also due to the subjectiveness of human emo-
tion. In this paper, we present a unified framework that
can be applied to both subtasks: (i) induce affect detection,
and (ii) violence detection. This framework is based on our
previous year’s Violent Scene Detection (VSD) framework.
We extended it to support affect detection by training differ-
ent valence/arousal classes independently and combine them
to make the final decision. Besides using internal features
from three different modalities: audio, image, and motion,
in this year, we also incorporate deep learning features into
our framework. Experimental results show that our unified          Figure 1: Our framework for extracting and encoding local
framework can detect violent videos and its affective impact       features.
with a reasonable accuracy. Moreover, using deep features
can significantly improve the detection performance of both
subtasks.                                                          use the standard SIFT feature with Hessian Laplace interest
                                                                   point detector to extract features from each frame [6]. Each
1.    INTRODUCTION                                                 frame is represented using the Fisher Vector encoding [7].
   Detecting affective impact of movies requires combining         We use the average pooling strategy to aggregate frame-
multimedia features. For example, a violent video of car-          based feature into the final video representation, which has
chase can be detected by searching for evidences such as           40,960 dimensions.
fast moving of cars or possibly the sound of gun shooting.
To this end, we have developed a framework that supports           2.2    Motion Feature
combining features from multiple modalities for violent scene         We use the Improved Trajectories [10] to extract dense
detection. We consider the induced affect detection as a           trajectories. A combination of Histogram of Oriented Gra-
multi-class classification task. Therefore, our framework can      dients (HOG), Histogram of Optical Flow (HOF) and Mo-
be applied to predict the valence and arousal class of a video     tion Boundary Histogram (MBH) is used to describe each
as well. In general, our framework consists of three main          trajectory. We encode HOGHOF and MBH features sepa-
components: feature extraction, feature encoding, feature          rately using the Fisher Vector encoding. The codebook size
classification. An overview of our framework is shown in           is 256, trained using a Gaussian Mixture Model (GMM).
Fig 1.                                                             The feature representation of each descriptor after applying
                                                                   PCA has 65,536 dimensions.
2.    FEATURE EXTRACTION
                                                                   2.3    Audio Feature
2.1   Image Features                                                  We use the popular Mel-frequency Cepstral Coefficients
  At first, we scale the original video into 320x240 pixels        (MFCC) for extracting audio features. We choose a length
and then sample frames from video at every second. We              of 25ms for audio segments and a step size of 10ms. The 13-
                                                                   dimensional MFCC vectors along with each first and second
                                                                   derivatives are used for representing each audio segment.
Copyright is held by the author/owner(s).                          Raw MFCC features are also encoded using Fisher vector
MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany        encoding. We use a GMM to train the codebook with 256
                                Table 1: Submitted violence detection runs and official results.
             Run                    Features                        Validation Results (mAP)       Official Results (mAP)
              1      HOGHOF+MBH+MFCC                                           0.2200                        0.2039
              2      HOGHOF+MBH+SIFT+MFCC                                      0.2094                        0.2087
             3 ext   HOGHOF+MBH+MFCC+VDFULL                                    0.2457                        0.2380
             4 ext   HOGHOF+MBH+MFCC+VDFULL+HBM                                0.2499                        0.2196
                     HOGHOF+MBH+MFCC+VDFULL+VDFC6
             5 ext                                                             0.1930                     0.2684
                     +VDFC7+FOHGOH+HBM+TFIS+CCFM

                             Table 2: Submitted induced affect detection runs and official results.
                                                                     Validation Results (mAP)       Official Results (Accuracy)
     Run                 Features               Decision Strategy
                                                                     Valence        Arousal         Valence           Arousal
      1        HOGHOF+MBH+SIFT+MFCC                 MAXREL           0.4148             0.3998      39.823            35.723
      2        HOGHOF+MBH+SIFT+MFCC                    MAX           0.4148             0.3998      41.653            55.908
               HOGHOF+MBH+SIFT+MFCC
     3 ext                                          MAXREL           0.4376             0.3958      42.956            55.677
               +VDFULL+VDFC6+VDFC7
               HOGHOF+MBH+SIFT+MFCC
     4 ext                                             MAX           0.4376             0.3958      42.914            55.656
               +VDFULL+VDFC6+VDFC7


clusters. For audio features, we do not use PCA. The final          detection, we need to make the decision from the predictions
feature descriptor has 19,968 dimensions.                           of all valence or arousal classes. To this end, we propose
                                                                    using two strategies: (1) MAX: select the class that has the
2.4       Deep Learning Feature                                     highest prediction; (2) MAXREL: select the class that has
   We use the popular DeepCaffe [3] framework to extract            the highest relative improvement from the learned threshold.
image features. We used the pre-trained deep model pro-
vided by Simonyan and Zisserman [8]. This model was
trained on ImageNet 1,000 concepts [2]. As suggested in             4.   SUBMITTED RUNS
[4], we selected the neuron activations from the last three           At first, we use the late fusion with average weighting
layers for the feature representation. The third and second-        scheme to combine features from different modalities. After
to-last layer has 4,096 dimensions, while the last layer has        that we select the runs that have the top performance on
1,000 dimensions corresponding to the 1,000 concept cate-           the validation set to submit. The list of submitted runs for
gories in the ImageNet dataset. We denote these features as         each subtask and its validation results can be seen on Table
VDFC6, VDFC7, and VDFULL in our experiments.                        1 and Table 2.

2.5       Features from Past VSD Tasks
  For the violent detection task, we also consider using fea-
                                                                    5.   RESULTS AND DISCUSSIONS
turs from past VSD tasks as external features. In partic-             The official results for each subtask are shown on the last
ular, we use the features that were extracted in the VSD            column of Table 1 and Table 2. For the violence detection
2014 task for training the violent detector. These features         task, we observe that the results of combining multiple fea-
include SIFT, Dense Trajectories (HOGHOF and MBH de-                tures are more stable. For example, on the validation set,
scriptors) and Audio MFCC which achieved the runner-up              the run that combines all available features has the lowest
performance in VSD 2014 [5]. We denote these features as            performance. However, on the test set, this run achieves the
FOHGHOF, HBM, TFIS and CCFM in our experiments.                     best performance. This can be due to the fact that we only
                                                                    select one split for validation. For both subtasks, combining
                                                                    with deep learning features can significantly improve the de-
3.     CLASSIFICATION                                               tection performance. For the induced affect detection task,
   LibSVM [1] is used for training and testing our affective        we found that the strategy using the max detection score
impact detectors. For features that are encoded using the           tends to have more stable performance. The best valence
Fisher vector, we use linear kernel for training and testing.       detection performance is obtained by combining all internal
For deep learning feature, χ2 kernel is used.                       features with all deep learning feature using the max relative
   We divide the training videos into two subset. The first         improvement strategy.
3,072 videos are used for training the model, while the re-
maining 3,072 videos are used for validation. To learn the
decision threshold of each detector, we sample this threshold       6.   ACKNOWLEDGEMENTS
in the range from 0 to 1 with the step size of 0.01, and select       This research is partially funded by Vietnam National
the value that maximizes the F1 score.                              University Ho Chi Minh City (VNU-HCM) under grant num-
   In order to generate the decision for valence or arousal         ber B2013-26-01.
7.   REFERENCES
 [1] C.-C. Chang and C.-J. Lin. LIBSVM: A library for
     support vector machines. ACM Transactions on
     Intelligent Systems and Technology, 2:27:1–27:27,
     2011.
 [2] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and
     L. Fei-Fei. Imagenet: A large-scale hierarchical image
     database. In Computer Vision and Pattern
     Recognition (CVPR), pages 248–255. IEEE, 2009.
 [3] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev,
     J. Long, R. Girshick, S. Guadarrama, and T. Darrell.
     Caffe: Convolutional architecture for fast feature
     embedding. In Proceedings of the ACM International
     Conference on Multimedia, pages 675–678. ACM,
     2014.
 [4] A. Krizhevsky, I. Sutskever, and G. E. Hinton.
     Imagenet classification with deep convolutional neural
     networks. In Advances in neural information
     processing systems, pages 1097–1105, 2012.
 [5] V. Lam, D. Le, S. Phan, S. Satoh, and D. A. Duong.
     NII-UIT at mediaeval 2014 violent scenes detection
     affect task. In Working Notes Proceedings of the
     MediaEval 2014 Workshop, Barcelona, Catalunya,
     Spain, October 16-17, 2014., 2014.
 [6] D. G. Lowe. Distinctive image features from
     scale-invariant keypoints. International journal of
     computer vision, 60(2):91–110, 2004.
 [7] J. Sánchez, F. Perronnin, T. Mensink, and J. Verbeek.
     Image classification with the fisher vector: Theory and
     practice. International journal of computer vision,
     105(3):222–245, 2013.
 [8] K. Simonyan and A. Zisserman. Very deep
     convolutional networks for large-scale image
     recognition. arXiv preprint arXiv:1409.1556, 2014.
 [9] M. Sjöberg, Y. Baveye, H. Wang, V. L. Quang,
     B. Ionescu, E. Dellandrea, M. Schedl, C.-H. Demarty,
     and L. Chen. The mediaeval 2015 affective impact of
     movies task. In MediaEval 2015 Workshop, Wurzen,
     Germany, Septemper 14-15 2015.
[10] H. Wang and C. Schmid. Action recognition with
     improved trajectories. In International Conference on
     Computer Vision (ICCV), pages 3551–3558. IEEE,
     2013.