DUT-MMSR at MediaEval 2017: Predicting Media
                                Interestingness Task
              Reza Aditya Permadi, Septian Gilang Permana Putra, Helmiriawan, Cynthia C. S. Liem
                                             Delft University of Technology, The Netherlands
                     {r.a.permadi,septiangilangpermanaputra,helmiriawan}@student.tudelft.nl,c.c.s.liem@tudelft.nl

ABSTRACT                                                                   therefore consists of a fairly straightforward machine learning
This paper describes our approach for the submission to the Media-         pipeline, evaluating performance of individual features first, and
eval 2017 Predicting Media Interestingness Task, which was partic-         subsequently applying classifier stacking to find the best combi-
ularly developed for the Image subtask. An approach using a late           nations of the best-performing classifiers on the best-performing
fusion strategy is employed, combining classifiers from different          features. For our implementation, we use sklearn [9]. As base classi-
features by stacking them using logistic regression (LR). As the task      fier, we use logistic regression and aim to find optimal parameters.
ground truth was based on pairwise evaluation of shots or keyframe         Further details of our approach are described below.
images within the same movie, next to using precomputed features
as-is, we also include a more contextual feature, considering aver-        2.1    Features of interest
aged feature values over each movie. Furthermore, we also consider
                                                                           For the Image subtask, we initially consider all the pre-computed
evaluation outcomes for the heuristic algorithm that yielded the
                                                                           visual features from [5]: Color Histogram (HSV), LBP, HOG, GIST,
highest MAP score on the 2016 Image subtask. Considering results
                                                                           denseSIFT, and the Alexnet-based features (fully connected (fc7)
obtained for the development and test sets, our late fusion method
                                                                           layer and probability output). In all cases, we consider pre-computed
shows consistent performance on the Image subtask, but not on
                                                                           features and their dimensionalities as-is.
the Video subtask. Furthermore, clear differences can be observed
                                                                               Considering the way in which ground truth for this task was
between MAP@10 and MAP scores.
                                                                           established, human raters were asked to perform pairwise annota-
                                                                           tions on shots or keyframe images from the same movie. Hence,
1    INTRODUCTION
                                                                           it likely that overall properties of the movie may have affected
The main challenge of the Media Interestingness task is to rank sets       the ratings: for example, if a movie consistently is shot in a dark
of images and video shots from a movie, based on their interesting-        background, a dark shot may not stand out as much as in another
ness level. The evaluation metric of interest for this task is the Mean    movie. In other words, we assume that the same feature vector may
Average Precision considering the first 10 documents (MAP@10).             be associated to different interestingness levels, depending on the
A complete overview of the task, along with the description of the         context of the movie it occurs in. Therefore, apart from the pre-
dataset, is given in [4].                                                  computed features by the organizers, we also consider a contextual
   Due to the similarity of this year’s task to the 2016 Predicting        feature, based on the average image feature values per movie.
Media Interestingness task, we considered the strategies used in               Let X i be an m × n feature matrix for a movie, where m is the
submissions to last year’s task to inform the strategy of our submis-      number of images offered for the movie, and n the length of the
sion to this year’s task. [3] and [2] both use an early fusion strategy    feature vector describing each image. For our contextual feature,
by combining features that perform relatively well individually. A         we then take the average value of X i across its columns, yielding
late fusion strategy with average weighting is used in [6], combin-        a new vector µ i of size 1 × n. In our subsequent discussion, we
ing classifiers from different modalities. A Support Vector Machine        will denote the contextual feature belonging to a feature type F by
(SVM) is used as the final combining classifier. [8] finds that logistic   ‘meanF ’ (e.g. HSV → meanHSV). This feature is then concatenated
regression gives good results, using CNN features which have been          to the original feature vector.
transformed by PCA.                                                            For the Video subtask, in comparison to the Image subtask, we
   [7] proposed a heuristic approach, based on observing clear             now also have information from the audio modality in the form
presence of people in images. This approach performed surprisingly         of Mel-Frequency Cepstral Coefficients (MFCC). We further use
well, even yielding the highest MAP score in the 2016 Image subtask.       the pre-computed fully-connected layer (fc6) of a C3D deep neural
While we will mainly focus on a fusion-based approach this year,           network [10]. As pre-computed features are given at the keyframe
we will also include results of the best-performing configuration          resolution, we simply average over all the keyframes to obtain the
from [7] in unaltered form, so a reference point to state-of-the-art       values for a particular feature representation. Again, we consider
of last year is retained.                                                  pre-computed features and their dimensionalities as-is.

2    APPROACH
                                                                           2.2    Individual feature evaluation
We would like to devise a strategy which is computationally efficient
and yet gives a good interpretability of the results. Our approach         For each feature type, we would like to individually find the best-
                                                                           performing classifier that will optimize the MAP@10 value. Before
Copyright held by the owner/author(s).
                                                                           feeding the feature vector into the classifier, the values are scaled to
MediaEval’17, 13-15 September 2017, Dublin, Ireland
                                                                           have zero mean and unit variance, considering the overall statistics
MediaEval’17, 13-15 September 2017, Dublin, Ireland                                      R.A. Permadi, S.G.P. Putra, Helmiriawan, C.C.S Liem

                                            Table 1: Evaluation on the test and development set

                                                                                                  Development Set                Test Set
    SubTask     Run         Features                   Late Fusion Classifier                    MAP@10 MAP                MAP@10 MAP
                 1                        Logistic Regression, C=1                               0.1387    0.3021          0.1310       0.3002
                 2      [ (lbp, meanlbp), Logistic Regression, C=100                             0.1390    0.3050          0.1385       0.3075
      Image      3      (hog), (fc7) ]    SVM polynomial kernel, degree=2, gamma=0.01            0.1373    0.3028          0.1349       0.3052
                 4                        SVM RBF kernel, gamma=0.01, C=0.01                     0.1322    0.2972          0.1213       0.2887
                 5      2016 unmodified heuristic feature [7]                                    0.0508    0.1864          0.0649       0.2105
                 1                        Logistic Regression, C=1000                            0.1040    0.2295          0.0443       0.1734
                        [ (gist), (c3d) ]
                 2                        Logistic Regression, C=10                              0.1045    0.2299          0.0465       0.1748
      Video      3                        Logistic Regression, C=1                               0.1025    0.2249          0.0478       0.1770
                        [ (hog), (c3d) ]
                 4                        Logistic Regression, C=10                              0.1025    0.2253          0.0482       0.1783
                 5      2016 unmodified heuristic feature [7]                                    0.0530    0.1494          0.0516       0.1791

      Table 2: Best individual feature classifier using LR                 2), rather than the feature values themselves. Our best-performing
                                                                           result on the Image development set is a MAP@10 value of 0.139
        Task      Feature          Size    C       MAP@10                  (Logistic Regression, C = 100). This is an improvement over the
                  lbp, meanlbp     118     10−3    0.1145                  performance of the best individual classifier in Table 2. On the test
        Image     hog              300     10−8    0.1092                  set, our best result on the Image subtask is a MAP@10 value of
                  fc7              4096    10−5    0.1081                  0.1385 for the same classifier configuration.
                  gist             512     10−4    0.0767                      For the Video subtask, evaluation results on the test set show
        Video     c3d              4096    10−5    0.1017                  considerable differences in comparison to the development set.
                  hog              300     10−5    0.0744                  While somewhat surprising, we did notice considerable variation
                                                                           in results during cross-validation, and our reported development
                                                                           set results are an average of several cross-validation runs. As one
of the training set. For logistic regression, the optimal penalty          particularly bad cross-validation result, using late fusion of GIST
parameter C is searched on a logaritmic scale from 10−9 until 100 .        and c3d features with logistic regression (C = 10), with videos 0, 3,
   To evaluate our model, 5-fold cross validation is used. Each fold       6, 9, 12, 22, 23, 27, 28, 30, 46, 50, 60, 69, 71 as the evaluation fold, we
is created based on the number of movies in the dataset, rather            only obtained a MAP@10 value of 0.0496.
than on the number of individual instances of images or videos                 Considering the results for the best-performing configuration
within a movie. This way, we make sure that training or prediction         (histface) of the heuristic approach from [7], we notice clear differ-
always considers all instances offered for a particular movie. For         ences between MAP@10 and MAP as a metric. Generally spoken,
each evaluation, the cross validation procedure is run 10 times to         the heuristic approach is especially outperformed on MAP@10,
avoid dependence on specific fold compositions, and the average            implying that clear presence of people is not the only criterion for
MAP@10 value is considered.                                                the top-ranked items. Comparing results for this approach to our
                                                                           proposed late fusion approach, the late fusion approach consistently
2.3    Classifier stacking                                                 outperforms a heuristic approach on the Image subtask, but in the
After identifying the best classifier configuration per feature type,      Video subtask, the heuristic approach still has reasonable scores,
we stack the output of those classifiers and try different combina-        and outperforms the late fusion approach on the test set.
tions of them, which are then trained again with several classifiers           In conclusion, we employed the offered pre-computed features
(logistic regression, SVM, AdaBoost, Linear Discriminant Analy-            and included a contextual averaged feature, and then proposed a
sis, and Random Forest). Finding that logistic regression and SVM          late fusion strategy based on the best-performing classifier settings
perform quite well, we apply a more intensive grid search on these         for the best-performing features. Using fusion shows improvements
classifier types to optimize parameters.                                   over results obtained on individual features. In future work, alter-
                                                                           native late fusion strategies as explained in [1] may be investigated.
3     RESULTS AND CONCLUSIONS                                                  For the Image subtask, we notice consistent results between the
Table 2 illustrates the top performance on the development set for         development and test set. However, on the Video subtask, we notice
individual classifiers, which will then also be considered in the test     inconsistent results on the development set in comparison to the
set. In several cases, addition of our contextual averaged feature         test set. Predicting interestingness in video likely needs a more
yields slight improvements over the individual feature alone. This         elaborated approach that we have not yet covered thoroughly in
improvement is the biggest for LBP where there is an increase of           our method. It also might be the case that the feature distribution
0.68 compared to the original features.                                    of the test set turned out different from that of the training set, or
   The final evaluation results for 5 different runs per subtask are       that generally, the distribution of features across a video should
shown in Table 1. As before, late fusion makes use of the output           be taken into account in more sophisticated ways, for example by
probability of the best classifiers trained on each feature (as in Table   taking into account temporal development aspects.
Predicting Media Interestingness Task                                           MediaEval’17, 13-15 September 2017, Dublin, Ireland


REFERENCES
 [1] Pradeep K. Atrey, M. Anwar Hossain, Abdulmotaleb El Saddik, and
     Mohan S. Kankanhalli. 2010. Multimodal fusion for multimedia anal-
     ysis: a survey. Multimedia Systems 16, 6 (01 Nov 2010), 345–379.
     https://doi.org/10.1007/s00530-010-0182-0
 [2] Shizhe Chen, Yujie Dian, and Qin Jin. 2016. RUC at MediaEval 2016:
     Predicting Media Interestingness Task. In MediaEval 2016 Working
     Notes Proceedings.
 [3] Mihai Gabriel Constantin, Bogdan Boteanu, and Bogdan Ionescu. 2016.
     LAPI at MediaEval 2016 Predicting Media Interestingness Task. In
     MediaEval 2016 Working Notes Proceedings.
 [4] Claire-Hélène Demarty, Mats Sjöberg, Bogdan Ionescu, Thanh-Toan
     Do, Michael Gygli, and Ngoc QK Duong. Mediaeval 2017 Predict-
     ing Media Interestingness Task. In MediaEval 2017 Working Notes
     Proceedings.
 [5] Yu-Gang Jiang, Qi Dai, Tao Mei, Yong Rui, and Shih-Fu Chang. 2015.
     Super Fast Event Recognition in Internet Videos. IEEE Transactions
     on Multimedia 17, 8 (Aug 2015), 1174–1186.
 [6] Vu Lam, Tien Do, Sang Phan, Duy-Dinh Le, and Duc Anh Duong. 2016.
     NII-UIT at MediaEval 2016 Predicting Media Interestingness Task.. In
     MediaEval 2016 Working Notes Proceedings.
 [7] Cynthia C. S. Liem. 2016. TUD-MMC at MediaEval 2016: Predict-
     ing Media Interestingness Task.. In MediaEval 2016 Working Notes
     Proceedings.
 [8] Jayneel Parekh and Sanjeel Parekh. 2016. The MLPBOON Predicting
     Media Interestingness System for MediaEval 2016.. In MediaEval 2016
     Working Notes Proceedings.
 [9] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent
     Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Pret-
     tenhofer, Ron Weiss, Vincent Dubourg, and others. 2011. Scikit-learn:
     Machine learning in Python. Journal of Machine Learning Research 12,
     Oct (2011), 2825–2830.
[10] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and
     Manohar Paluri. 2015. Learning spatiotemporal features with 3d con-
     volutional networks. In Proceedings of the IEEE international conference
     on computer vision. 4489–4497.