NII-UIT at MediaEval 2016 Predicting Media Interestingness Task Vu Lam Tien Do Sang Phan University of Science, University of Information National Institute of VNU-HCM Technology, VNU-HCM Informatics, Japan lqvu@fit.hcmus.edu.vn tiendv@uit.edu.vn plsang@nii.ac.jp Duy-Dinh Le Shin’ichi Satoh Duc Anh Duong National Institute of National Institute of University of Information Informatics, Japan Informatics, Japan Technology, VNU-HCM ledduy@nii.ac.jp satoh@nii.ac.jp ducda@uit.edu.vn ABSTRACT codebook of 300 code words is used in the quantization The MediaEval 2016 Predicting Media Interestingness (PMI) process with a spatial pyramid of three layers [8]; Task requires participants to retrieve images and video seg- • HOG descriptors [2] are computed over densely sam- ments that are considered to be the most interesting for a pled patches. Following [12], HOG descriptors in a 2x2 common viewer. This is a challenging problem not only be- neighborhood are concatenated to form a descriptor of cause the large complexity of the data but also due to the higher dimension; semantic meaning of interestingness. This paper provides an overview of our framework used in MediaEval 2016 for • GIST is computed based on the output energy of sev- the PMI task and discusses the performance results for both eral Gabor-like filters (8 orientations and 4 scales) over subtasks of predicting image and video interestingness. Ex- a dense frame grid like in [10]. perimental results show that, our framework give a reason- 2.2 Audio Features able accuracy just by simply using low-level features: GIST, HoG, Dense SIFT, and incorporating deep features from pre- In predicting video interestingness task, we use the popu- trained deep learning models. lar Mel-frequency Cepstral Coefficients (MFCC) for extract- ing audio features. We choose a length of 25ms for audio seg- ments and a step size of 10ms. The 13-dimensional MFCC 1. INTRODUCTION vectors along with each first and second derivatives are used Following the setting of this task [3], we design a frame- for representing each audio segment. Raw MFCC features work that consists of three main components: feature extrac- are also encoded using Fisher vector encoding. We use a tion and encoding, feature classification, and feature fusion. GMM to train the codebook with 256 clusters. For audio An overview of our framework is shown in Fig 1. For the features, we do not use PCA. The final feature descriptor features extracted from video frames, we use the max pool- has 19,968 dimensions. ing strategy to aggregate all frame features of a same shot to form the shot representation. In the training step, we train 2.3 Deep Features a classifier for each type of features using the Support Vec- We used the popular Caffe framework [5] to extract deep tor Machine [1]. Then we use these classifiers to predict the features from two pre-trained model Alexnet [7] and VGG [11]. scores for each shot. Finally, we adopt the late fusion with These models were trained on ImageNet 1,000 concepts [4]. average weighting scheme to combine the prediction scores AlexNet is the first work that popularized Convolutional of various features. Networks in Computer Vision, developed by Alex Krizhevsky, Ilya Sutskever and Geoffrey Hinton. It is the winning system of ILSVRC2012 classification task [4] and it outperformed 2. FEATURE EXTRACTION other methods by a large margin in terms of accuracy. This very first visual deep learning network only contains 5 con- 2.1 Low-level Features volutional layers and 3 fully-connected layers. We use features that are provided by the organizers [6]. VGGNet refers to a deep convolutional network for ob- More specifically, following features are exploited for the ject recognition developed and trained by Oxford’s renowned task. Visual Geometry Group [11]. They provided two deep net- works that consist of 16 and 19 layers respectively. In our • Dense SIFT are computed following the original work experiments, we use the VGGNet with 16 layers for feature in [9], except that the local frame patches are densely extraction. sampled instead of using interest point detectors. A We selected the neuron activations from the last three layers for the feature representation. The third and second- to-last layer has 4,096 dimensions, while the last layer has Copyright is held by the author/owner(s). MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Nether- 1,000 dimensions corresponding to the 1,000 concept cate- lands gories in the ImageNet dataset. We denote these features Table 2: Results of predicting interestingness from video Run Features Results (MAP) FA AlexNetFC8+MFCC 16.9 F1 VGGFC7 + GIST 16.41 ness from video can be improved if motion features are ex- ploited, which have not been incorporated to our system for the time being. Examples of top interesting images that are detected by our system are illustrated on Fig. 2. Interestingly, our sys- Figure 1: Our framework for extracting and encoding local tem tends to output a higher rank on images of beauti- features. ful women. Furthermore, we found that images from dark scenes are often considered more interesting, probably be- Table 1: Results of predicting interestingness from image cause these scenes often draw more attention from the au- diences. Run Features Results (MAP) FA VGGFC8+AlexNetFC8 21.15 6. ACKNOWLEDGEMENTS VGGFC7+GIST+HOG+ V1 17.73 This research is partially funded by Vietnam National DenseSIFT University Ho Chi Minh City (VNU-HCM) under grant num- ber B2013-26-01. as AlexNetFC6, AlexNetFC7, AlexNetF8, VGGFC6, VG- GFC7, and VGGFC8 in our experiments. 7. REFERENCES [1] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on 3. CLASSIFICATION Intelligent Systems and Technology, 2:27:1–27:27, LibSVM [1] is used for training and testing our interest- 2011. ingness classifiers. For features that are encoded using the [2] N. Dalal and B. Triggs. Histograms of oriented Fisher vector, we use linear kernel for training and testing. gradients for human detection. In 2005 IEEE For deep learning feature, χ2 kernel is used. The optimal Computer Society Conference on Computer Vision gamma and cost parameters for learning SVM classifiers are and Pattern Recognition (CVPR’05), volume 1, pages found by conducting a grid search with 5-fold cross valida- 886–893. IEEE, 2005. tion on the training dataset. [3] C.-H. Demarty, M. Sjöberg, B. Ionescu, T.-T. Do, H. Wang, N. Q. Duong, and F. Lefebvre. Mediaeval 4. SUBMITTED RUNS 2016 predicting media interestingnesstask. Proc. of the At first, we use the late fusion with average weighting MediaEval 2016 Workshop, Hilversum, Netherlands, scheme to combine features from different modalities. After Oct. 20-21, 2016. that we select the runs that have the top performance on the [4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and validation set to submit. The list of submitted runs for each L. Fei-Fei. ImageNet: A Large-Scale Hierarchical subtask and its results can be seen on Table 1 and Table 2. Image Database. In CVPR09, 2009. [5] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, 5. RESULTS AND DISCUSSIONS J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature The official results for each subtask are shown on the embedding. In Proceedings of the ACM International last column of Table 1 and Table 2, which are correspond- Conference on Multimedia, pages 675–678. ACM, ing to the results of predicting interestingness from image 2014. and video respectively. These results show that predict- ing interestingness from image is more accurate than from [6] Y.-G. Jiang, Q. Dai, T. Mei, Y. Rui, and S.-F. Chang. video. This can be due to the highly dynamic of video con- Super fast event recognition in internet videos. IEEE tent. Moreover, the performance of predicting interesting- Transactions on Multimedia, 17(8):1174–1186, 2015. [7] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012. [8] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pages 2169–2178. IEEE, 2006. [9] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International journal of Figure 2: Top interesting images of detected by our system. computer vision, 60(2):91–110, 2004. [10] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. International journal of computer vision, 42(3):145–175, 2001. [11] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [12] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In Computer vision and pattern recognition (CVPR), 2010 IEEE conference on, pages 3485–3492. IEEE, 2010.