RUC at MediaEval 2017: Predicting Media Interestingness Task Shuai Wang, Shizhe Chen, Jinming Zhao, Wenxuan Wang, Qin Jin Renmin University of China, China {shuaiwang,cszhe1,qjin}@ruc.edu.cn zhaojinming@bjfu.edu.cn,wangwenxuan@hust.edu.cn ABSTRACT Predicting the interestingness of images or videos can greatly improve people’s satisfaction in many applications, such as video retrieval and recommendations. In this paper, we present our methods in the 2017 Predicting Media Interest- ingness Task. We propose deep ranking model based on aural and visual modalities which simulates the human annotation procedures for more reliable interestingness prediction. 1 INTRODUCTION The interestingness prediction task [1] aims to predict people’s general preferences for images and videos, which has a wide range of applications such as video recommendation. We propose an interestingness prediction model based on aural and visual modalities and deep ranking model to Figure 1: The Network Structure of Deep Ranking calculate interestingness score by the given images or video Model clips. 2 APPROACH can update the weights of the previous network and make it closer to the target of Equation 2. This kind of training 2.1 Aural-Visual Features will make the network more and more effective in recognition Aural Features We extract 39-dim Mel-Frequency Cepstral of video interestingness. Eventually, the network will give Coefficients (MFCCs) features from each video segment and higher scores to attractive videos, and boring video will get create their bag of words features with 128 codewords de- a lower score. noting the responding segment. L1-norm is used to get the n probability distributions on the codebook for each video. X Visual Features We utilize officially provided features in- minimization : max(1 − f (pi ) + f (ni ), 0) (2) i=1 cluding Alex fc7, Alex prob, ColorHist, DenseSIFT, GIST, HOG and LBP. Additionally, we consider 2048-dim frame- 2.2.2 Pairwise Generation. Input data is an essential factor level features from the penultimate layer of InceptionV3 of the training stage. We try to use different strategies to network, which is trained on 1.2 million images of ImageNet sample data pairs as inputs. Different principles impact the challenge dataset[3]. training process and result in big difference. Let x and y denote float numbers in the range of 0 to 1. Our four strategies 2.2 Deep Ranking Model can be presented as follows: 2.2.1 Ranking Loss. Suppose we have a set of video seg- ments pairs P sampled from the original video segment pool. f (pi ) − f (ni ) > x (3) In P , each pair contains a segment pi with higher interesting- f (pi ) − f (ni ) < y (4) ness and a segment ni with lower score. If function f denotes the output of branches, we can get a score pair as follows: x < f (pi ) − f (ni ) < y (5) f (pi ) − f (ni ) < x or f (pi ) − f (ni ) > y (6) (f (pi ), f (ni )), ∀(pi , ni ) ∈ P (1) Our basic empirical parameter of sampling is to set the We set the margin in the loss as 1 by default according distance of ground truth interestingness labels of the two to [2]. By using this deviation namely the loss value, we videos in the same pair as 0.55. At first, big distances and Copyright held by the owner/author(s). small distances are both taken into account but does not MediaEval’17, 13-15 September 2017, Dublin, Ireland show significant performance. We suppose that the network cannot learn much from two pretty similar videos and huge MediaEval’17, 13-15 September 2017, Dublin, Ireland Shuai Wang et al. gap between the two videos, which are the reasons that the network result in worse results. 3 RESULTS AND ANALYSIS 3.1 Experimental Setting There are 7396 images or video clips in each subtask. We use video with id from 0 to 61 as the local training set, 62 to 69 as local validation set and 70 to 77 as local testing set. We utilize the Support Vector Regression (SVR) and Ran- dom Forest Regression (RF) as our baseline models as the comparison with the deep ranking models. For SVM, RBF Figure 2: MAP of Single Feature for Image Subtask kernel is applied and the cost is searched from 22 to 210 . And on Local Testing Set for Random Forest, the number of trees is searched from 100 to 1000 with step 100 and the depth of the tree is searched from 2 to 16. 3.2 Results and Discussion In both subtasks, we consider different prediction models and features. The results are shown in Figure 2 and Figure 3 respectively. In image subtask, we can find the pairwise ranking model shows greater performance generally and the deep neural network features are distinctive for interestingness prediction. It is not surprising that deep neural network displays its state-of-art capability. Figure 3: MAP of Single Feature for Video Subtask In video subtask, we test the MFCC BoAW feature on on Local Testing Set local testing set and get a MAP of 0.151. Early fusion is applied over various visual features and MFCC BoAW. The results are generally consistent with the conclusions of image Table 1: Results of the official submitted runs subtask. In our ranking model, the greatest MAP on local testing set is 0.210, which overpasses the other results. While Runs Subtask Input MAP MAP@10 the fusion boosts the performance not very much for each visual feature. We suppose it is due to its low dimensionality. 1 Image img norm 0.2655 0.0940 Given the experiments results on local testing set, we 2 Video img norm 0.1830 0.0589 pick the winners of various models and features, namely 3 Video img origin 0.1897 0.0637 pairwise ranking model and InceptionV3 feature, as our final choice for submissions. The official results for both subtasks are shown in Table 1. We utilize two types of input in the experiments of InceptionV3 feature, which are original images with dark spectacles gains a low MAP. For video subtask, and normalized images. Normalized images are scaled into videos with changeless audio content obtain relative low MAP. 0 to 1 for each pixel. As the results shown, the InceptionV3 feature from original image performs a little better than the 4 CONCLUSIONS normalized one on official testing set. We develop an interestingness prediction system based on As the results show, image interestingness prediction is pairwise ranking. Comparing with basic regression models, we generally accurate than video subtask. We think that it is notice the effectiveness of ranking model and the InceptionV3 easier to fetch distinctive features from static images than feature is distinctive for interestingness prediction task. In from videos. Firstly, audio displays completely different cues training process, optimizing the data pair sampling strategy with images and the fusion of the two modalities may present is always considered a fundamental and essential point. In brand new interestingness. Secondly, dynamic properties like the future, we will also use more temporal cues to guarantee changes of scene make videos more informative, so that we that the information within the internal frames of the same cannot capture the interestingness precisely only by the static video is not wasted. images inside a video. After investigating the testing set, we find out some in- ACKNOWLEDGMENTS teresting phenomena. For image subtask, images containing This work was supported by National Key Research and varied scenes can be ranked precisely, but a series of images Development Plan under Grant No.2016YFB1001202. The 2017 Predicting Media Interestingness Task MediaEval’17, 13-15 September 2017, Dublin, Ireland REFERENCES [1] Claire Hélène Demarty, Mats Sjöberg, Bogdan Ionescu, Thanh Toan Do, Michael Gygli, and Ngoc Q K Duong. Sept. 13-15, 2017.. MediaEval 2017 Predicting Media Interesting- ness Task. In Proc. of the MediaEval 2017 Workshop, Dublin, Ireland. [2] Ting Yao, Tao Mei, and Yong Rui. 2016. Highlight detection with pairwise deep ranking for first-person video summariza- tion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 982–990. [3] Hangjun Ye and Guangyou Xu. 2003. Hierarchical indexing scheme for fast search in a large-scale image database. 5286, 3-4 (2003), 974–979.