NII-UIT at MediaEval 2016
                       Predicting Media Interestingness Task

                   Vu Lam                                  Tien Do                                Sang Phan
            University of Science,                 University of Information                  National Institute of
                 VNU-HCM                           Technology, VNU-HCM                        Informatics, Japan
         lqvu@fit.hcmus.edu.vn                      tiendv@uit.edu.vn                         plsang@nii.ac.jp
                   Duy-Dinh Le                        Shin’ichi Satoh                     Duc Anh Duong
                 National Institute of                National Institute of            University of Information
                 Informatics, Japan                   Informatics, Japan               Technology, VNU-HCM
                 ledduy@nii.ac.jp                     satoh@nii.ac.jp                    ducda@uit.edu.vn

ABSTRACT                                                                 codebook of 300 code words is used in the quantization
The MediaEval 2016 Predicting Media Interestingness (PMI)                process with a spatial pyramid of three layers [8];
Task requires participants to retrieve images and video seg-          • HOG descriptors [2] are computed over densely sam-
ments that are considered to be the most interesting for a              pled patches. Following [12], HOG descriptors in a 2x2
common viewer. This is a challenging problem not only be-               neighborhood are concatenated to form a descriptor of
cause the large complexity of the data but also due to the              higher dimension;
semantic meaning of interestingness. This paper provides
an overview of our framework used in MediaEval 2016 for               • GIST is computed based on the output energy of sev-
the PMI task and discusses the performance results for both             eral Gabor-like filters (8 orientations and 4 scales) over
subtasks of predicting image and video interestingness. Ex-             a dense frame grid like in [10].
perimental results show that, our framework give a reason-         2.2    Audio Features
able accuracy just by simply using low-level features: GIST,
HoG, Dense SIFT, and incorporating deep features from pre-            In predicting video interestingness task, we use the popu-
trained deep learning models.                                      lar Mel-frequency Cepstral Coefficients (MFCC) for extract-
                                                                   ing audio features. We choose a length of 25ms for audio seg-
                                                                   ments and a step size of 10ms. The 13-dimensional MFCC
1.    INTRODUCTION                                                 vectors along with each first and second derivatives are used
   Following the setting of this task [3], we design a frame-      for representing each audio segment. Raw MFCC features
work that consists of three main components: feature extrac-       are also encoded using Fisher vector encoding. We use a
tion and encoding, feature classification, and feature fusion.     GMM to train the codebook with 256 clusters. For audio
An overview of our framework is shown in Fig 1. For the            features, we do not use PCA. The final feature descriptor
features extracted from video frames, we use the max pool-         has 19,968 dimensions.
ing strategy to aggregate all frame features of a same shot to
form the shot representation. In the training step, we train       2.3    Deep Features
a classifier for each type of features using the Support Vec-         We used the popular Caffe framework [5] to extract deep
tor Machine [1]. Then we use these classifiers to predict the      features from two pre-trained model Alexnet [7] and VGG [11].
scores for each shot. Finally, we adopt the late fusion with       These models were trained on ImageNet 1,000 concepts [4].
average weighting scheme to combine the prediction scores             AlexNet is the first work that popularized Convolutional
of various features.                                               Networks in Computer Vision, developed by Alex Krizhevsky,
                                                                   Ilya Sutskever and Geoffrey Hinton. It is the winning system
                                                                   of ILSVRC2012 classification task [4] and it outperformed
2.    FEATURE EXTRACTION                                           other methods by a large margin in terms of accuracy. This
                                                                   very first visual deep learning network only contains 5 con-
2.1     Low-level Features                                         volutional layers and 3 fully-connected layers.
  We use features that are provided by the organizers [6].            VGGNet refers to a deep convolutional network for ob-
More specifically, following features are exploited for the        ject recognition developed and trained by Oxford’s renowned
task.                                                              Visual Geometry Group [11]. They provided two deep net-
                                                                   works that consist of 16 and 19 layers respectively. In our
     • Dense SIFT are computed following the original work         experiments, we use the VGGNet with 16 layers for feature
       in [9], except that the local frame patches are densely     extraction.
       sampled instead of using interest point detectors. A           We selected the neuron activations from the last three
                                                                   layers for the feature representation. The third and second-
                                                                   to-last layer has 4,096 dimensions, while the last layer has
Copyright is held by the author/owner(s).
MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Nether-      1,000 dimensions corresponding to the 1,000 concept cate-
lands                                                              gories in the ImageNet dataset. We denote these features
                                                                 Table 2: Results of predicting interestingness from video
                                                                      Run          Features           Results (MAP)
                                                                      FA      AlexNetFC8+MFCC              16.9
                                                                      F1      VGGFC7 + GIST               16.41


                                                                ness from video can be improved if motion features are ex-
                                                                ploited, which have not been incorporated to our system for
                                                                the time being.
                                                                  Examples of top interesting images that are detected by
                                                                our system are illustrated on Fig. 2. Interestingly, our sys-
Figure 1: Our framework for extracting and encoding local       tem tends to output a higher rank on images of beauti-
features.                                                       ful women. Furthermore, we found that images from dark
                                                                scenes are often considered more interesting, probably be-
 Table 1: Results of predicting interestingness from image      cause these scenes often draw more attention from the au-
                                                                diences.
     Run          Features              Results (MAP)
     FA    VGGFC8+AlexNetFC8                21.15
                                                                6.   ACKNOWLEDGEMENTS
           VGGFC7+GIST+HOG+
     V1                                      17.73                This research is partially funded by Vietnam National
           DenseSIFT
                                                                University Ho Chi Minh City (VNU-HCM) under grant num-
                                                                ber B2013-26-01.
as AlexNetFC6, AlexNetFC7, AlexNetF8, VGGFC6, VG-
GFC7, and VGGFC8 in our experiments.                            7.   REFERENCES
                                                                 [1] C.-C. Chang and C.-J. Lin. LIBSVM: A library for
                                                                     support vector machines. ACM Transactions on
3.   CLASSIFICATION                                                  Intelligent Systems and Technology, 2:27:1–27:27,
   LibSVM [1] is used for training and testing our interest-         2011.
ingness classifiers. For features that are encoded using the     [2] N. Dalal and B. Triggs. Histograms of oriented
Fisher vector, we use linear kernel for training and testing.        gradients for human detection. In 2005 IEEE
For deep learning feature, χ2 kernel is used. The optimal            Computer Society Conference on Computer Vision
gamma and cost parameters for learning SVM classifiers are           and Pattern Recognition (CVPR’05), volume 1, pages
found by conducting a grid search with 5-fold cross valida-          886–893. IEEE, 2005.
tion on the training dataset.
                                                                 [3] C.-H. Demarty, M. Sjöberg, B. Ionescu, T.-T. Do,
                                                                     H. Wang, N. Q. Duong, and F. Lefebvre. Mediaeval
4.   SUBMITTED RUNS                                                  2016 predicting media interestingnesstask. Proc. of the
  At first, we use the late fusion with average weighting            MediaEval 2016 Workshop, Hilversum, Netherlands,
scheme to combine features from different modalities. After          Oct. 20-21, 2016.
that we select the runs that have the top performance on the     [4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and
validation set to submit. The list of submitted runs for each        L. Fei-Fei. ImageNet: A Large-Scale Hierarchical
subtask and its results can be seen on Table 1 and Table 2.          Image Database. In CVPR09, 2009.
                                                                 [5] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev,
5.   RESULTS AND DISCUSSIONS                                         J. Long, R. Girshick, S. Guadarrama, and T. Darrell.
                                                                     Caffe: Convolutional architecture for fast feature
  The official results for each subtask are shown on the
                                                                     embedding. In Proceedings of the ACM International
last column of Table 1 and Table 2, which are correspond-
                                                                     Conference on Multimedia, pages 675–678. ACM,
ing to the results of predicting interestingness from image
                                                                     2014.
and video respectively. These results show that predict-
ing interestingness from image is more accurate than from        [6] Y.-G. Jiang, Q. Dai, T. Mei, Y. Rui, and S.-F. Chang.
video. This can be due to the highly dynamic of video con-           Super fast event recognition in internet videos. IEEE
tent. Moreover, the performance of predicting interesting-           Transactions on Multimedia, 17(8):1174–1186, 2015.
                                                                 [7] A. Krizhevsky, I. Sutskever, and G. E. Hinton.
                                                                     Imagenet classification with deep convolutional neural
                                                                     networks. In Advances in neural information
                                                                     processing systems, pages 1097–1105, 2012.
                                                                 [8] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of
                                                                     features: Spatial pyramid matching for recognizing
                                                                     natural scene categories. In 2006 IEEE Computer
                                                                     Society Conference on Computer Vision and Pattern
                                                                     Recognition (CVPR’06), volume 2, pages 2169–2178.
                                                                     IEEE, 2006.
                                                                 [9] D. G. Lowe. Distinctive image features from
                                                                     scale-invariant keypoints. International journal of
Figure 2: Top interesting images of detected by our system.          computer vision, 60(2):91–110, 2004.
[10] A. Oliva and A. Torralba. Modeling the shape of the
     scene: A holistic representation of the spatial
     envelope. International journal of computer vision,
     42(3):145–175, 2001.
[11] K. Simonyan and A. Zisserman. Very deep
     convolutional networks for large-scale image
     recognition. arXiv preprint arXiv:1409.1556, 2014.
[12] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and
     A. Torralba. Sun database: Large-scale scene
     recognition from abbey to zoo. In Computer vision
     and pattern recognition (CVPR), 2010 IEEE
     conference on, pages 3485–3492. IEEE, 2010.