RUC at MediaEval 2017:
                             Predicting Media Interestingness Task
                      Shuai Wang, Shizhe Chen, Jinming Zhao, Wenxuan Wang, Qin Jin
                                                Renmin University of China, China
                                                {shuaiwang,cszhe1,qjin}@ruc.edu.cn
                                        zhaojinming@bjfu.edu.cn,wangwenxuan@hust.edu.cn
ABSTRACT
Predicting the interestingness of images or videos can greatly
improve people’s satisfaction in many applications, such
as video retrieval and recommendations. In this paper, we
present our methods in the 2017 Predicting Media Interest-
ingness Task. We propose deep ranking model based on aural
and visual modalities which simulates the human annotation
procedures for more reliable interestingness prediction.


1     INTRODUCTION
The interestingness prediction task [1] aims to predict people’s
general preferences for images and videos, which has a wide
range of applications such as video recommendation.
   We propose an interestingness prediction model based
on aural and visual modalities and deep ranking model to           Figure 1: The Network Structure of Deep Ranking
calculate interestingness score by the given images or video       Model
clips.

2 APPROACH                                                         can update the weights of the previous network and make
                                                                   it closer to the target of Equation 2. This kind of training
2.1 Aural-Visual Features                                          will make the network more and more effective in recognition
Aural Features We extract 39-dim Mel-Frequency Cepstral            of video interestingness. Eventually, the network will give
Coefficients (MFCCs) features from each video segment and          higher scores to attractive videos, and boring video will get
create their bag of words features with 128 codewords de-          a lower score.
noting the responding segment. L1-norm is used to get the
                                                                                             n
probability distributions on the codebook for each video.                                    X
Visual Features We utilize officially provided features in-               minimization :            max(1 − f (pi ) + f (ni ), 0)   (2)
                                                                                             i=1
cluding Alex fc7, Alex prob, ColorHist, DenseSIFT, GIST,
HOG and LBP. Additionally, we consider 2048-dim frame-                2.2.2 Pairwise Generation. Input data is an essential factor
level features from the penultimate layer of InceptionV3           of the training stage. We try to use different strategies to
network, which is trained on 1.2 million images of ImageNet        sample data pairs as inputs. Different principles impact the
challenge dataset[3].                                              training process and result in big difference. Let x and y
                                                                   denote float numbers in the range of 0 to 1. Our four strategies
2.2    Deep Ranking Model                                          can be presented as follows:
   2.2.1 Ranking Loss. Suppose we have a set of video seg-
ments pairs P sampled from the original video segment pool.                               f (pi ) − f (ni ) > x                     (3)
In P , each pair contains a segment pi with higher interesting-
                                                                                          f (pi ) − f (ni ) < y                     (4)
ness and a segment ni with lower score. If function f denotes
the output of branches, we can get a score pair as follows:                            x < f (pi ) − f (ni ) < y                    (5)
                                                                            f (pi ) − f (ni ) < x     or   f (pi ) − f (ni ) > y    (6)
                (f (pi ), f (ni )),   ∀(pi , ni ) ∈ P       (1)
                                                                      Our basic empirical parameter of sampling is to set the
  We set the margin in the loss as 1 by default according          distance of ground truth interestingness labels of the two
to [2]. By using this deviation namely the loss value, we          videos in the same pair as 0.55. At first, big distances and
Copyright held by the owner/author(s).
                                                                   small distances are both taken into account but does not
MediaEval’17, 13-15 September 2017, Dublin, Ireland                show significant performance. We suppose that the network
                                                                   cannot learn much from two pretty similar videos and huge
MediaEval’17, 13-15 September 2017, Dublin, Ireland                                                              Shuai Wang et al.


gap between the two videos, which are the reasons that the
network result in worse results.

3 RESULTS AND ANALYSIS
3.1 Experimental Setting
There are 7396 images or video clips in each subtask. We use
video with id from 0 to 61 as the local training set, 62 to 69
as local validation set and 70 to 77 as local testing set.
   We utilize the Support Vector Regression (SVR) and Ran-
dom Forest Regression (RF) as our baseline models as the
comparison with the deep ranking models. For SVM, RBF              Figure 2: MAP of Single Feature for Image Subtask
kernel is applied and the cost is searched from 22 to 210 . And    on Local Testing Set
for Random Forest, the number of trees is searched from 100
to 1000 with step 100 and the depth of the tree is searched
from 2 to 16.

3.2    Results and Discussion
In both subtasks, we consider different prediction models
and features. The results are shown in Figure 2 and Figure 3
respectively.
   In image subtask, we can find the pairwise ranking model
shows greater performance generally and the deep neural
network features are distinctive for interestingness prediction.
It is not surprising that deep neural network displays its
state-of-art capability.
                                                                   Figure 3: MAP of Single Feature for Video Subtask
   In video subtask, we test the MFCC BoAW feature on
                                                                   on Local Testing Set
local testing set and get a MAP of 0.151. Early fusion is
applied over various visual features and MFCC BoAW. The
results are generally consistent with the conclusions of image         Table 1: Results of the official submitted runs
subtask. In our ranking model, the greatest MAP on local
testing set is 0.210, which overpasses the other results. While         Runs    Subtask      Input      MAP        MAP@10
the fusion boosts the performance not very much for each
visual feature. We suppose it is due to its low dimensionality.           1      Image     img norm     0.2655      0.0940
   Given the experiments results on local testing set, we                 2      Video     img norm     0.1830      0.0589
pick the winners of various models and features, namely
                                                                          3      Video     img origin   0.1897      0.0637
pairwise ranking model and InceptionV3 feature, as our final
choice for submissions. The official results for both subtasks
are shown in Table 1. We utilize two types of input in the
experiments of InceptionV3 feature, which are original images      with dark spectacles gains a low MAP. For video subtask,
and normalized images. Normalized images are scaled into           videos with changeless audio content obtain relative low MAP.
0 to 1 for each pixel. As the results shown, the InceptionV3
feature from original image performs a little better than the      4    CONCLUSIONS
normalized one on official testing set.                            We develop an interestingness prediction system based on
   As the results show, image interestingness prediction is        pairwise ranking. Comparing with basic regression models, we
generally accurate than video subtask. We think that it is         notice the effectiveness of ranking model and the InceptionV3
easier to fetch distinctive features from static images than       feature is distinctive for interestingness prediction task. In
from videos. Firstly, audio displays completely different cues     training process, optimizing the data pair sampling strategy
with images and the fusion of the two modalities may present       is always considered a fundamental and essential point. In
brand new interestingness. Secondly, dynamic properties like       the future, we will also use more temporal cues to guarantee
changes of scene make videos more informative, so that we          that the information within the internal frames of the same
cannot capture the interestingness precisely only by the static    video is not wasted.
images inside a video.
   After investigating the testing set, we find out some in-       ACKNOWLEDGMENTS
teresting phenomena. For image subtask, images containing          This work was supported by National Key Research and
varied scenes can be ranked precisely, but a series of images      Development Plan under Grant No.2016YFB1001202.
The 2017 Predicting Media Interestingness Task                       MediaEval’17, 13-15 September 2017, Dublin, Ireland


REFERENCES
 [1] Claire Hélène Demarty, Mats Sjöberg, Bogdan Ionescu,
     Thanh Toan Do, Michael Gygli, and Ngoc Q K Duong. Sept.
     13-15, 2017.. MediaEval 2017 Predicting Media Interesting-
     ness Task. In Proc. of the MediaEval 2017 Workshop, Dublin,
     Ireland.
 [2] Ting Yao, Tao Mei, and Yong Rui. 2016. Highlight detection
     with pairwise deep ranking for first-person video summariza-
     tion. In Proceedings of the IEEE Conference on Computer
     Vision and Pattern Recognition. 982–990.
 [3] Hangjun Ye and Guangyou Xu. 2003. Hierarchical indexing
     scheme for fast search in a large-scale image database. 5286,
     3-4 (2003), 974–979.