RUC at MediaEval 2016:
                      Predicting Media Interestingness Task

                                           Shizhe Chen, Yujie Dian, Qin Jin
                                School of Information, Renmin University of China, China
                                    {cszhe1, dianyujie-blair, qjin}@ruc.edu.cn


ABSTRACT
                                                                  Figure 1: An Overview of the System Framework
Measuring media interestingness has a wide range of appli-
cations such as video recommendation. This paper presents                Image Interestingness             Video Interestingness
our approach in the MediaEval 2016 Predicting Media In-                      Classification                    Classification
terestingness Task. There are two subtasks: image inter-
estingness prediction and video interestingness prediction.
For both subtasks, we utilize hand-crafted features and C-                       Visual Features                       Audio Features
NN features as our visual features. For the video subtask,               CNN Features       Handcrafted Features
we also extract acoustic features including MFCC Fisher                    Alexnet fc7         ColorHistogram             Acoustic
Vector and statistical acoustic features. We train SVM and                                          GIST
                                                                                                                          Statistics
                                                                          Alexnet prob
Random Forest as classifiers and early fusion is applied to                                          LBP
combine different features. Experimental results show that                  Inception
                                                                                                                         MFCC FV
                                                                                                 DenseSIFT
combining semantic-level and low-level visual features are                  Inception
                                                                              prob                  HOG
beneficial for image interestingness prediction. When pre-
dicting video interestingness, the audio modality has supe-                                    Early Fusion
rior performance and the early fusion of visual and audio
modalities can further boost the performance.
                                                                                               Classification
                                                                                         SVM          Random Forest
1.    SYSTEM DESCRIPTION
  An overview of our framework in the MediaEval 2016 Pre-
                                                                                         Interestingness Probability
dicting Media Interestingness Task [1] is shown in Figure 1.
For image interestingness prediction, we use hand-crafted
visual features and CNN features. For the video subtask,
we utilize both visual and audio cues in the video to pre-
dict the interestingness. Early fusion is applied to combine     provided in [3] to cover different aspects of the images. For
different features. In the following subsections, we describe    the video subtask, mean pooling is applied over all the image
the feature representation and prediction model in details.      features of the video clip to generate video-level features.

1.1     Feature Extraction                                       1.1.2    Acoustic Feature
                                                                    Statistical Acoustic Features: Statistical acoustic fea-
1.1.1    Visual Features                                         tures are proved to be effective in speech emotion recogni-
   DCNN is the state-of-the-art model in many visual tasks       tion. We use the open-source toolkit OpenSMILE [4] to ex-
such as object detection, scene recognition etc. In this task,   tract the statistical acoustic features, which use the configu-
we extract activations from the penultimate and the last         ration in INTERSPEECH 2009 [5] Paralinguistic challenge.
softmax layers from the AlexNet and Inception-v3 [2] pre-        Low-level acoustic features such as energy, pitch, jitter and
trained on ImageNet as our image-level CNN features, name-       shimmer are first extracted over a short-time window. And
ly alex_fc7, alex_prob, inc_fc, inc_prob respectively. The       then statistical functions like mean, max are applied over the
features extracted from the last layers are the probability      set of low-level features to generate sentence-level features.
distribution on 1000 different objects, which describe the          MFCC based Features: The Mel-Frequency Cepstral
semantic level of concepts people might show interest in.        Coefficients (MFCCs) [6] are the most widely used low-level
The penultimate layer features are the abstraction of the        features which have been successfully applied in many speech
image content and have shown great generalization ability        tasks. Therefore, we use MFCCs as our frame-level feature
in different tasks. We also use hand-crafted visual features     with window of 25ms and shift of 10ms. The Fisher Vector
including Color Histogram, GIST, LBP, HOG, Dense SIFT            Encoding (FV) [7] is applied to transform the variant length
                                                                 of MFCCs to the sentence-level features. We train a Gaus-
                                                                 sian Mixture Models (GMMs) with 8 mixtures as our audio
Copyright is held by the author/owner(s).
MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Nether-    word dictionary. Then we compute the gradient of the log
lands.                                                           likelihood with respect to the parameters of the GMMs for
Figure 2: MAP of Single Feature for Image Subtask               Figure 4: MAP of Single Feature for Video Subtask
on Local Testing Set                                            on Local Testing Set


Figure 3: MAP of Early Fusion for Image Subtask                 Table 1: MAP of Early Fusion for Image and Video
on Local Testing Set                                            Subtask on the Real Testing Set (the Official Eval-
                                                                uation Metric)
                                                                                   features        model real tst
                                                                             GIST-LBP-alex prob     RF      0.199
                                                                  image
                                                                            Color-GIST-alex prob    RF      0.204
                                                                 subtask
                                                                          Color-GIST-LBP-alex prob SVM      0.199
                                                                  video        AcouStats-GIST      SVM      0.165
                                                                 subtask        mfccFV-GIST        SVM      0.170


                                                                alex_fc7 achieve the top performance among all the visual
                                                                features. However, the probability features extracted from
                                                                CNN do not perform well alone.
                                                                   We then use early fusion to concatenate different visual
                                                                features. Figure 3 shows some of the fusion results. We
                                                                can see that combining the alex_prob with other visual ap-
                                                                pearance features can significantly improve the classification
                                                                performance, which shows that the semantic-level features
each audio to maximize the probability that the model can       and low-level appearance features are complementary. How-
fit the data. L2-norm is applied for the mfccFV features.       ever, concatenating alex_fc7 with hand-crafted features do
                                                                not bring any improvement.
1.2   Classification Model                                         For video interestingness prediction, Figure 4 presents the
   For both the image and video systems, we train binary        performance of each single feature. The audio modality out-
SVM and Random Forest as our interestingness classifica-        performs the visual modality and mfccFV achieves the best
tion models. Hyper parameters of the models are selected        performance. Fusing acoustic features with the best visu-
according to the mean average precision (MAP) on our lo-        al feature GIST are beneficial, for example, AcouStats-GIST
cal validation set using grid search. For SVM, RBF kernel       achieves MAP of 20.80%, which is 19% relative gain com-
is applied and the cost is searched from 2−2 to 210 . And for   pared with the MAP of single feature GIST.
Random Forest, the number of trees is set to be 100 and the        The total five runs we submitted are listed in Table 1.
depth of the tree is searched from 2 to 16.
                                                                3.   CONCLUSIONS
2.    EXPERIMENTS                                                 Our results show that image interestingness prediction can
                                                                benefit from combining semantic-level objects probabilities
2.1   Experimental Setting                                      distribution features and low-level visual appearance fea-
                                                                tures. For predicting video interestingness, audio modality
  There are 5054 images or videos in total for development
                                                                shows superior performance than visual modality and the
in each subtask. We use video with id from 0 to 40 (4014
                                                                early fusion of two modalities can further boost the perfor-
samples) as the local training set, 41 to 45 (468 samples) as
                                                                mance. In the future work, we will explore ranking models
local validation set and the remained videos (572 samples)
                                                                for the interestingness prediction task and extract more dis-
as the local testing set. We use the whole development set
                                                                criminative features such as video motion features.
to train the final submitted systems.

2.2   Experimental Results                                      4.   ACKNOWLEDGMENTS
  Figure 2 shows the best MAP performance of SVM and               This research was supported by the Research Funds of
Random Forest classifiers for each kind of features in the      Renmin University of China (No. 14XNLQ01) and the Bei-
image subtask. The penultimate CNN features inc_fc and          jing Natural Science Foundation (No. 4142029).
5.   REFERENCES                                                   audio feature extractor. In ACM International
[1] C.-H. Demarty, M. Sjöberg, B. Ionescu, T.-T. Do,             Conference on Multimedia, Mm, pages 1459–1462, 2010.
    H. Wang, N. Q. K. Duong, and F. Lefebvre. Mediaeval       [5] Björn W. Schuller, Stefan Steidl, and Anton Batliner.
    2016 predicting media interestingness task. In Proc. of       The INTERSPEECH 2009 emotion challenge. In
    the MediaEval 2016 Workshop, Hilversum, Netherlands,          INTERSPEECH 2009, 10th Annual Conference of the
    Oct. 20-21, 2016.                                             International Speech Communication Association,
[2] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe,           Brighton, United Kingdom, September 6-10, 2009,
    Jonathon Shlens, and Zbigniew Wojna. Rethinking the           pages 312–315, 2009.
    inception architecture for computer vision. arXiv         [6] Steven B. Davis. Comparison of parametric
    preprint arXiv:1512.00567, 2015.                              representations for monosyllabic word recognition in
[3] Yu-Gang Jiang, Qi Dai, Tao Mei, Yong Rui, and                 continuously spoken sentences. Readings in Speech
    Shih-Fu Chang. Super fast event recognition in internet       Recognition, 28(4):65–74, 1990.
    videos. IEEE Transactions on Multimedia,                  [7] Jorge Sánchez, Florent Perronnin, Thomas Mensink,
    17(8):1174–1186, 2015.                                        and Jakob J. Verbeek. Image classification with the
[4] Florian Eyben, Martin Llmer, and Björn Schuller.             fisher vector: Theory and practice. International
    Opensmile: the munich versatile and fast open-source          Journal of Computer Vision, 105(3):222–245, 2013.