RUC at MediaEval 2016: Predicting Media Interestingness Task Shizhe Chen, Yujie Dian, Qin Jin School of Information, Renmin University of China, China {cszhe1, dianyujie-blair, qjin}@ruc.edu.cn ABSTRACT Figure 1: An Overview of the System Framework Measuring media interestingness has a wide range of appli- cations such as video recommendation. This paper presents Image Interestingness Video Interestingness our approach in the MediaEval 2016 Predicting Media In- Classification Classification terestingness Task. There are two subtasks: image inter- estingness prediction and video interestingness prediction. For both subtasks, we utilize hand-crafted features and C- Visual Features Audio Features NN features as our visual features. For the video subtask, CNN Features Handcrafted Features we also extract acoustic features including MFCC Fisher Alexnet fc7 ColorHistogram Acoustic Vector and statistical acoustic features. We train SVM and GIST Statistics Alexnet prob Random Forest as classifiers and early fusion is applied to LBP combine different features. Experimental results show that Inception MFCC FV DenseSIFT combining semantic-level and low-level visual features are Inception prob HOG beneficial for image interestingness prediction. When pre- dicting video interestingness, the audio modality has supe- Early Fusion rior performance and the early fusion of visual and audio modalities can further boost the performance. Classification SVM Random Forest 1. SYSTEM DESCRIPTION An overview of our framework in the MediaEval 2016 Pre- Interestingness Probability dicting Media Interestingness Task [1] is shown in Figure 1. For image interestingness prediction, we use hand-crafted visual features and CNN features. For the video subtask, we utilize both visual and audio cues in the video to pre- dict the interestingness. Early fusion is applied to combine provided in [3] to cover different aspects of the images. For different features. In the following subsections, we describe the video subtask, mean pooling is applied over all the image the feature representation and prediction model in details. features of the video clip to generate video-level features. 1.1 Feature Extraction 1.1.2 Acoustic Feature Statistical Acoustic Features: Statistical acoustic fea- 1.1.1 Visual Features tures are proved to be effective in speech emotion recogni- DCNN is the state-of-the-art model in many visual tasks tion. We use the open-source toolkit OpenSMILE [4] to ex- such as object detection, scene recognition etc. In this task, tract the statistical acoustic features, which use the configu- we extract activations from the penultimate and the last ration in INTERSPEECH 2009 [5] Paralinguistic challenge. softmax layers from the AlexNet and Inception-v3 [2] pre- Low-level acoustic features such as energy, pitch, jitter and trained on ImageNet as our image-level CNN features, name- shimmer are first extracted over a short-time window. And ly alex_fc7, alex_prob, inc_fc, inc_prob respectively. The then statistical functions like mean, max are applied over the features extracted from the last layers are the probability set of low-level features to generate sentence-level features. distribution on 1000 different objects, which describe the MFCC based Features: The Mel-Frequency Cepstral semantic level of concepts people might show interest in. Coefficients (MFCCs) [6] are the most widely used low-level The penultimate layer features are the abstraction of the features which have been successfully applied in many speech image content and have shown great generalization ability tasks. Therefore, we use MFCCs as our frame-level feature in different tasks. We also use hand-crafted visual features with window of 25ms and shift of 10ms. The Fisher Vector including Color Histogram, GIST, LBP, HOG, Dense SIFT Encoding (FV) [7] is applied to transform the variant length of MFCCs to the sentence-level features. We train a Gaus- sian Mixture Models (GMMs) with 8 mixtures as our audio Copyright is held by the author/owner(s). MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Nether- word dictionary. Then we compute the gradient of the log lands. likelihood with respect to the parameters of the GMMs for Figure 2: MAP of Single Feature for Image Subtask Figure 4: MAP of Single Feature for Video Subtask on Local Testing Set on Local Testing Set Figure 3: MAP of Early Fusion for Image Subtask Table 1: MAP of Early Fusion for Image and Video on Local Testing Set Subtask on the Real Testing Set (the Official Eval- uation Metric) features model real tst GIST-LBP-alex prob RF 0.199 image Color-GIST-alex prob RF 0.204 subtask Color-GIST-LBP-alex prob SVM 0.199 video AcouStats-GIST SVM 0.165 subtask mfccFV-GIST SVM 0.170 alex_fc7 achieve the top performance among all the visual features. However, the probability features extracted from CNN do not perform well alone. We then use early fusion to concatenate different visual features. Figure 3 shows some of the fusion results. We can see that combining the alex_prob with other visual ap- pearance features can significantly improve the classification performance, which shows that the semantic-level features each audio to maximize the probability that the model can and low-level appearance features are complementary. How- fit the data. L2-norm is applied for the mfccFV features. ever, concatenating alex_fc7 with hand-crafted features do not bring any improvement. 1.2 Classification Model For video interestingness prediction, Figure 4 presents the For both the image and video systems, we train binary performance of each single feature. The audio modality out- SVM and Random Forest as our interestingness classifica- performs the visual modality and mfccFV achieves the best tion models. Hyper parameters of the models are selected performance. Fusing acoustic features with the best visu- according to the mean average precision (MAP) on our lo- al feature GIST are beneficial, for example, AcouStats-GIST cal validation set using grid search. For SVM, RBF kernel achieves MAP of 20.80%, which is 19% relative gain com- is applied and the cost is searched from 2−2 to 210 . And for pared with the MAP of single feature GIST. Random Forest, the number of trees is set to be 100 and the The total five runs we submitted are listed in Table 1. depth of the tree is searched from 2 to 16. 3. CONCLUSIONS 2. EXPERIMENTS Our results show that image interestingness prediction can benefit from combining semantic-level objects probabilities 2.1 Experimental Setting distribution features and low-level visual appearance fea- tures. For predicting video interestingness, audio modality There are 5054 images or videos in total for development shows superior performance than visual modality and the in each subtask. We use video with id from 0 to 40 (4014 early fusion of two modalities can further boost the perfor- samples) as the local training set, 41 to 45 (468 samples) as mance. In the future work, we will explore ranking models local validation set and the remained videos (572 samples) for the interestingness prediction task and extract more dis- as the local testing set. We use the whole development set criminative features such as video motion features. to train the final submitted systems. 2.2 Experimental Results 4. ACKNOWLEDGMENTS Figure 2 shows the best MAP performance of SVM and This research was supported by the Research Funds of Random Forest classifiers for each kind of features in the Renmin University of China (No. 14XNLQ01) and the Bei- image subtask. The penultimate CNN features inc_fc and jing Natural Science Foundation (No. 4142029). 5. REFERENCES audio feature extractor. In ACM International [1] C.-H. Demarty, M. Sjöberg, B. Ionescu, T.-T. Do, Conference on Multimedia, Mm, pages 1459–1462, 2010. H. Wang, N. Q. K. Duong, and F. Lefebvre. Mediaeval [5] Björn W. Schuller, Stefan Steidl, and Anton Batliner. 2016 predicting media interestingness task. In Proc. of The INTERSPEECH 2009 emotion challenge. In the MediaEval 2016 Workshop, Hilversum, Netherlands, INTERSPEECH 2009, 10th Annual Conference of the Oct. 20-21, 2016. International Speech Communication Association, [2] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Brighton, United Kingdom, September 6-10, 2009, Jonathon Shlens, and Zbigniew Wojna. Rethinking the pages 312–315, 2009. inception architecture for computer vision. arXiv [6] Steven B. Davis. Comparison of parametric preprint arXiv:1512.00567, 2015. representations for monosyllabic word recognition in [3] Yu-Gang Jiang, Qi Dai, Tao Mei, Yong Rui, and continuously spoken sentences. Readings in Speech Shih-Fu Chang. Super fast event recognition in internet Recognition, 28(4):65–74, 1990. videos. IEEE Transactions on Multimedia, [7] Jorge Sánchez, Florent Perronnin, Thomas Mensink, 17(8):1174–1186, 2015. and Jakob J. Verbeek. Image classification with the [4] Florian Eyben, Martin Llmer, and Björn Schuller. fisher vector: Theory and practice. International Opensmile: the munich versatile and fast open-source Journal of Computer Vision, 105(3):222–245, 2013.