The MLPBOON Predicting Media Interestingness System
                    for MediaEval 2016

                           Jayneel Parekh                                          Sanjeel Parekh
           Indian Institute of Technology, Bombay, India                Technicolor, Cesson Sévigné, France
                   jayneelparekh@gmail.com                                 sanjeelparekh@gmail.com


ABSTRACT                                                          very promising [6].
This paper describes the system developed by team MLP-               Our approach was inspired by the following line of thought:
BOON for MediaEval 2016 Predicting Media Interestingness          if the right set of features are identified then any simple clas-
Image Subtask. After experimenting with various features          sifier should produce good results. Thus, we decided upon
and classifiers on the development dataset, our final system      the proposed system after experimenting with different fea-
involves use of CNN features (fc7 layer of AlexNet) for the       ture sets.
input representation and logistic regression as the classifier.
For the proposed method, the MAP for the best run reaches         2.   SYSTEM DESCRIPTION
a value of 0.229.                                                    We have opted for a more traditional machine learning
                                                                  pipeline involving - feature selection & preprocessing, train-
1.    INTRODUCTION                                                ing of classification model and then the predictions.
                                                                     Given the training data feature matrix X ∈ RN ×F con-
   The MediaEval 2016 Predicting Media Interestingness Task
                                                                  sisting of N examples, each described by a F -dimensional
[1] requires to automatically select images and/or video seg-
                                                                  vector, we first standardize it and apply principal compo-
ments which are considered to be the most interesting for a
                                                                  nent analysis (PCA) to reduce its dimensionality. The trans-
common viewer. We will be focusing on solving the image
                                                                  formed feature matrix Z = (zi )i ∈ RN ×M is used to ex-
interestingness subtask which involves automatically identi-
                                                                  periment with various classifiers. Here M depends on the
fying images from a given set of key-frames extracted from a
                                                                  number of top eigenvalues we wish to consider.
certain movie that the viewers report to be interesting. We
                                                                     After preliminary testing (discussed in section 3), we de-
will only use the visual content and no additional metadata.
                                                                  cided to move ahead with logistic regression as our classifier.
   The solution should essentially involve encoding into fea-
                                                                  Logistic regression minimizes the following cost function [9].
tures many generic factors that are taken into account by
humans while judging interestingness of an image [3]. How-
                                                                                         N
ever, there is an intrinsic difficulty this task presents which                          X                   T         1 T
makes it extremely challenging to have reliable datasets and             Cost (w) = C          log(1 + e−yi w zi ) +     w w,   (1)
                                                                                         i=1
                                                                                                                       2
features - subjectivity [2]. One can observe the high level of
subjectivity by realizing that a given image could be labeled       where w denotes the weight vector, C > 0 denotes penalty
as highly interesting or non-interesting depending upon the       parameter, zi denotes feature vector for the ith instance of
parts of the world in which it is surveyed. Even though cur-      training data, while yi denotes its label (0 if non-interesting
rent methods of annotating datasets tend to reduce [2] this       and 1 if interesting). Note that a column of ones is appended
factor but none can eliminate it.                                 to Z to include the hyperplane intercept as a coefficient of
   Therefore, while taking into account subjectivity, we wish     w. Now given a test data instance t, its label y is assigned
to determine features good for satisfactorily solving the task.   according to equation (2).
In this context, several efforts have been made to understand
factors that affect, or cues that contribute to interestingness                           (
of an image, even at an individual level. Katti et al. [5]                                 1,      if wT t ≥ 0
                                                                                       y=                                       (2)
attempt to understand the effect of human cognition and                                    0,      otherwise
perception in interestingness. Work by Gygli et al. [3] shows
                                                                  After experimenting with various descriptors (as discussed
how interestingness is related to features capturing unusual-
                                                                  later in Section 3.1), we use CNN features extracted from
ness, aesthetics and general preferences such as GIST, SIFT,
                                                                  fc7 layer of the AlexNet as our input feature representation
Color Histograms etc. Further, [8] tries to learn attributes
                                                                  for building X. In the following section we discuss our ex-
that can be used to predict interestingness at an individual
                                                                  perimental results obtained by varying different parameters
level. Moreover, recent advances in application of neural
                                                                  of the above stated system.
networks to tasks in image processing and computer vision
makes use of convolutional neural network [4] based features
                                                                  3.   EXPERIMENTAL VALIDATION
                                                                     The training dataset consised of 5054 images extracted
Copyright is held by the author/owner(s).
MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Nether-     from 52 movie trailers, while the test data consisted of 2342
lands.                                                            images extracted from 26 movie trailers. [1] gives complete
                                                                 Run    No. of features      C      MAP      Precision    Recall
 Figure 1: Block diagram for the proposed system
                                                                  1           780          0.001    0.2205    0.140       0.581
                                                                  2           700          0.008    0.2023    0.128       0.381
                                                                  3           700           0.05    0.1941    0.131       0.348
                     Input training data, X
                                                                  4           400           0.1     0.2170    0.137       0.427
                                Standardize                       5          2016         0.0001    0.2296    0.141       0.726

                             PCA                                Table 1: Run Submission Results: MAP was the
                                                                official metric

                     Transformed data, Z
                                                                (4096-dimensional fc7 & 1000-dimensional prob layers) did
                                                                show significant improvements with fc7 features in particular
                                                                performing better over the prob features. We also observed
                          Logistic
                                                                that combination of CNN features with GIST and ColorHis-
                         Regression
                                                                togram features gave similar performance to the case when
                                Using the learnt                we use just CNN features. Hence we went forward with
                                weight vector, w                using just CNN features, in particular from fc7 layer.
                                                                   Classifier: After selecting CNN features we experimented
                         Classify test
                                                                with various classifiers with different parameters. Specif-
                           samples
                                                                ically, we tried (1) SVM with linear, polynomial and rbf
                                                                kernels (2) ridge regression classifier (3) stochastic gradi-
                                                                ent descent classifier with hinge, log, modified-huber and
                                                                squared-hinge loss functions (4) logistic regression [7] and
information about the preparation of the dataset. WEKA          (5) random trees (WEKA). In general, it was found that lo-
and scikit-learn [7] were used to implement and test various    gistic regression performed better than the other classifiers
configurations.                                                 with its MAP being greater than 0.2 on training data. The
                                                                performance of SVM was reasonable with the prob features
3.1   Results and Discussion                                    but it did not show any significant improvements with the
   The run submission results are given in Table 1. The table   fc7 features. It particularly did not perform well with the
gives the mean average precision (MAP) - the official metric,   rbf kernel. Hence we went ahead with logistic regression.
precision and recall on the interesting images of different
runs corresponding to the respective penalty parameter and      4.   CONCLUSIONS
number of transformed features retained after PCA. The             In summary, we have presented a system for interesting-
general strategy for the run submissions was to first decide    ness prediction in images. Despite its simplicity, we obtain
and fix the number of PCA features and subsequently tune        reasonable mean average precision values with the maximum
C for best MAP on development data.                             being 0.229. From an analysis of the system’s development
   As observed, C decreases with increasing PCA features.       history we think that selection of features was more impor-
This trend can be possibly explained as a way to avoid over-    tant than the selection of the classifier. We believe it would
fitting. The 5th run gives the best MAP, however, the MAP       be useful to identify and incorporate high level features de-
for all the runs seems comparable. This points towards the      scribing image composition and object expressivity such as
utility of dimensionality reduction which significantly re-     facial expressions. Moreover, to analyze the issue of sub-
duces the redundancy without affecting the results much.        jectivity, it would be interesting to check inter-annotator
It was observed that 400 and 780 transformed features cap-      agreement over test images.
ture about 95% and 98% variance of the data, respectively.
The difference between MAP on development and test data
for all the runs was very small and lied between 0.01-0.03.     5.   REFERENCES
The maximum MAP on development data was 0.24 with 1st           [1] C.-H. Demarty, M. Sjöberg, B. Ionescu, T.-T. Do,
run’s system configuration.                                         H. Wang, N. Q. Duong, and F. Lefebvre. Mediaeval
                                                                    2016 predicting media interestingness task. In Proc. of
System Design Decisions                                             the MediaEval 2016 Workshop, Hilversum, Netherlands,
We experimented with the following features provided by the         Oct. 20-21, 2016.
task [1]: CNN (fc7 and prob layers of AlexNet), GIST and        [2] Y. Fu, T. M. Hospedales, T. Xiang, S. Gong, and
Color Histogram (HSV space) features [4], and trained their         Y. Yao. Interestingness prediction by robust learning to
different combinations on various machine learning classi-          rank. In European Conference on Computer Vision,
fiers like SVM, Decision Trees, Logistic Regression with 4-         pages 488–503. Springer, 2014.
fold or 5-fold cross-validation when experimenting on the       [3] M. Gygli, H. Grabner, H. Riemenschneider, F. Nater,
development data. In this section we give a rationale for           and L. Van Gool. The interestingness of images. In
selected features and classifier in the proposed system.            Proceedings of the IEEE International Conference on
   Features: The results on the development data using              Computer Vision, pages 1633–1640, 2013.
the GIST (512 dimensional feature vector) and ColorHis-         [4] Y.-G. Jiang, Q. Dai, T. Mei, Y. Rui, and S.-F. Chang.
togram (128 dimensional feature vector) features were not           Super fast event recognition in internet videos. IEEE
very positive over any classifier. The use of CNN features          Transactions on Multimedia, vol. 177(8):1–13, 2015.
[5] H. Katti, K. Y. Bin, C. T. Seng, and M. Kankanhalli.
    Interestingness discrimination in images.
[6] A. Khosla, A. Das Sarma, and R. Hamid. What makes
    an image popular? In Proceedings of the 23rd
    international conference on World wide web, pages
    867–876. ACM, 2014.
[7] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
    B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
    R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
    D. Cournapeau, M. Brucher, M. Perrot, and
    E. Duchesnay. Scikit-learn: Machine learning in
    Python. Journal of Machine Learning Research,
    12:2825–2830, 2011.
[8] M. Soleymani. The quest for visual interest. In
    Proceedings of the 23rd ACM international conference
    on Multimedia, pages 919–922. ACM, 2015.
[9] H.-F. Yu, F.-L. Huang, and C.-J. Lin. Dual coordinate
    descent methods for logistic regression and maximum
    entropy models. Machine Learning, 85(1-2):41–75, 2011.