The IITB Predicting Media Interestingness System for
                         MediaEval 2017

                   Jayneel Parekh                      Harshvardhan Tibrewal                      Sanjeel Parekh
            Indian Institute of Technology,            Indian Institute of Technology,     Technicolor, Cesson Sévigné,
                    Bombay, India                              Bombay, India                          France
           jayneelparekh@gmail.com                      hrtibrewal@gmail.com              sanjeelparekh@gmail.com


ABSTRACT
This paper describes the system developed by team IITB
for MediaEval 2017 Predicting Media Interestingness Task.                                                 fc7


We propose a new method of training based on pairwise


                                                                                                                  Classifier
comparisons between frames of a trailer. The algorithm gave
very promising results on the development set but did not
impress on test set. Our highest achieved MAP@10 on test                                                  fc7

set is 0.0911 (Image subtask) and 0.0525 (Video subtask),
based on a systems submitted last year ([4, 6]).


1.    INTRODUCTION                                                      Figure 1:    Pairwise comparison based Training: Concate-
   The MediaEval 2017 Predicting Media Interestingness Task             nated images features (fc7) from same trailer are fed to the
[2] deals with automatic selection of images and/or video               classifier and it learns to predict the more interesting image
segments according to their interestingness to a common
viewer. We only use the visual content and no additional
metadata.                                                               2.2    Training
   Previous systems on this task discuss in detail several rel-           We adopted the following two methods for training:
evant inherent problems. Further, they also point towards
the usefulness of CNN features: in particular, they report                1. Feed every frame/video’s feature vector to the classifier
features from AlexNet’s fc7 layer performing reasonably well                 where it learns to predict the interestingness label of
with simple classifiers [4, 6]. We believe a key shortcoming                 the frame as in [4]
of the previous approaches is that they attempt to tag im-                2. For each trailer we consider all possible pairs of its
ages interesting/non-interesting in a global context whereas                 frames/videos and feed the corresponding concatenated
the task inherently expects to classify images in a local con-               feature vectors to the classifier. The classifier learns to
text (trailer-wise). Our system tries to take this aspect into               predict which one of the two frames/videos is more in-
account by training a classifier on pairwise comparisons of                  teresting.
frames from same trailer.
                                                                           For the second training method, pairwise comparisons are
                                                                        made . First, from each trailer, we generate all possible pairs
2.    SYSTEM DESCRIPTION                                                of frames. This ensures that only frames/videos of the same
                                                                        trailer are being compared. Considering T trailersP  having ni
                                                                        number of frames/videos in them, we get N1 = i=T            ni
2.1    Pre-processing                                                                                                         i=1
                                                                        pairs. Representation of each pair is done by concatenating
                                                                                                                                     2

   Given the training data feature matrix X ∈ RN ×F con-                the feature vectors of each frame/video. The feature vector
sisting of N examples, each described by a F -dimensional               of each being of size M , after concatenating we get final
vector, we first standardize it and apply principal compo-              feature vector of size 2M . This procedure yields a feature
nent analysis (PCA) to reduce its dimensionality. The trans-            matrix Znew ∈ RN1 ×2M . Output labels for an ordered pair
formed feature matrix Z = (zi )i ∈ RN ×M is used to ex-                 of frames/videos (I1 , I2 ) is assigned as follows:
periment with various classifiers. Here M depends on the                               (
number of top eigenvalues we wish to consider.                                           1, I1 is more interesting than I2
   For our system we use AlexNets’s fc7 [3] features provided                      y=                                               (1)
                                                                                         0, I2 is more interesting than I1
for image subtask and C3D [8] features provided for video
subtask. Each feature vector has a dimension of 4096. After
performing PCA we reduce the dimension to 200. Thus Z                   2.3    Prediction
is a RN ×200 matrix in our system.                                        For the first two runs which are based on [4], [6], we have
                                                                        used different classifiers. Support vector machines (SVM)
                                                                        with rbf kernel (run1) and logistic regression with `1 penalty
Copyright is held by the author/owner(s).                               (run2). We now describe the prediction algorithm for our
MediaEval’17, 13-15 September 2017, Dublin, Ireland.                    new approach.
   Ranking of the frames/videos according to their inter-                      Run     Classifier     Subtask    MAP@10
estingness in a particular trailer is determined from the pre-                  1    NP + SVM-rbf      Image      0.094
dicted results of all the pairwise comparisons by generating                    2     NP + LR-l1       Image      0.144
penalty scores si for each of them and ordering them from                       3     P1 + LR-l1       Image      0.179
lowest to highest with lowest corresponding to most inter-                      4     P2 + LR-l1       Image      0.178
esting frame/video. The scores are determined using the                         5    NP + SVM-rbf      Video      0.088
following algorithm (referred as P1):                                           6     NP + LR-l1       Video      0.092
                                                                                7     P1 + LR-l1       Video      0.109
     1. Initialize the penalty scores si = 0 for each i                         8     P2 + LR-l1       Video      0.108
     2. Iterate over results of all pairwise comparisons: for
        each pair indexed by {k, l}, let r(k, l) denote the pre-                Table 1: Results on development set
        diction of classifier. The following update is performed:
            su = su + | Pr{r(k, l) = 1} − Pr{r(k, l) = 0}|               Run      Classifier    Subtask      MAP      MAP@10
                                                                          1     NP + SVM-rbf     Image      0.1886     0.0500
        where u denotes the index of less interesting frame/video         2      NP + LR-l1      Image      0.2570     0.0911
        predicted, Pr{.} denotes the probability and |.| the ab-
                                                                          3      P1 + LR-l1      Image      0.2038     0.0494
        solute value
                                                                          4      P2 + LR-l1      Image      0.2054     0.0521
   This essentially increases the penalty score for the less              5     NP + SVM-rbf     Video      0.1795     0.0525
interesting according to the confidence the classifier has in its         6      NP + LR-l1      Video      0.1675     0.0445
prediction. The confidence value of the classifier for a given            7      P1 + LR-l1      Video      0.1700     0.0474
pair is treated as absolute difference between Pr{r(k, l) = 1}            8      P2 + LR-l1      Video      0.1678     0.0445
and Pr{r(k, l) = 0}. We also try a variant of the above
algorithm in one of our runs wherein the update equation            Table 2: Run Submissions: MAP@10 (official metric)
is: su = su + 1 (referred as P2)
   Interestingness classification: We opt for a simple
method for binary classification of each image as interesting       was aligned with our expectation. Logistic regression was
or not: We classify the top 12% ranked images as interest-          giving better results as compared to SVM.
ing. We chose top 12% images as it’s slightly higher than           Due to large number of pairs involved in training, we could
the average number of interesting images, which is about            not experiment with classifiers such as SVM in our presented
9%. It’s important to note that since we generate ranking           approach (P1, P2) because of large training time. We exper-
of frames, choosing only top 12% images has no particular           imented with the following classifiers. (1) Logistic regression
significance as the official metric remains unaffected by it.       with l2 penalty, (2) Random Forest, (3) Logistic regression
                                                                    with l1 penalty. (3) gave slightly better results than the
                                                                    other two and was the fastest in training, hence we went
3.     EXPERIMENTAL VALIDATION                                      with logistic regression-l1 penalty as our classifier.
   The training dataset consisted of 7396 frames extracted
from 78 movie trailers with about 392,000 pairs of frames,          Test Set
while the test data consisted of 2435 frames extracted from
                                                                    However the results on the test set, were unexpected. Lo-
30 movie trailers. [2] gives complete information about the
                                                                    gistic regression using pairwise comparisons gave the best
preparation of the dataset. Scikit-learn [5] was used to im-
                                                                    results on the development set for both the tasks. On the
plement and test various configurations.
                                                                    test set it isn’t impressive where the best result is for non-
3.1      Results and Discussion                                     pairwise logistic regression (Image subtask) and non-pairwise
                                                                    SVM-rbf kernel (Video subtask).
   Our results on the development set for various approaches
                                                                       There could be various possible reasons for the discrep-
are given in Table 1. The run submission results are given in
                                                                    ancy in the results on development and test set. (i) Viewing
Table 2. The tables gives the mean average precision (MAP)
                                                                    the classifier as a neural network, it may require more fine
- the official metric MAP@10, of different runs corresponding
                                                                    tuning of the weights of fc7 layer of AlexNet, or a more
to the method of training and the classifier used.
                                                                    complex network instead of a single neuron so that it gener-
Development Set                                                     alizes better. (ii) Though improbable, it’s possible there are
                                                                    some discrepancies in the sources of development and test
We experimented with the CNN features provided and used             set which result in poor generalization.
PCA to bring down the number of dimensions to 200. Addi-
tionally, we used a non pairwise (NP) and a pairwise strategy
(P1, P2) for training and prediction as described in previous       4.     CONCLUSIONS
section. These methods were used to train SVM (rbf kernel)            In summary, we proposed a new system for interestingness
[7], logistic regression with l1-penalty (LR-l1) [9]. These de-     prediction in images and videos. It essentially differs in the
cisions were taken following inferences of previous results [4,     method of training based on pairwise comparisons of images.
6]. We split the development set into the training set (62          This helps in capturing interestingness of an image in a lo-
videos) and cross validation set (16 videos). We calculated         cal context. Although our system gave impressive results on
MAP@10 on the validation set. Accordingly we tested the             development set, it failed to perform well on the test set.
model with several parameters and chose the model param-            Some improvements on current system can be improving its
eters giving best MAP@10 results. We found that the pair-           complexity or fine tuning the last layer of AlexNet for better
wise comparisons strategy was working better compared to            input representation. The efficiency of the training can also
non pairwise strategy. They gave a better MAP@10, which             be improved by selecting pairs more intelligently ([1]).
5.   REFERENCES
[1] R. A. Bradley and M. E. Terry. Rank analysis of
    incomplete block designs: I. the method of paired
    comparisons. Biometrika, 39(3/4):324–345, 1952.
[2] C.-H. Demarty, M. Sjöberg, B. Ionescu, T.-T. Do,
    M. Gygli, and N. Q. Duong. MediaEval 2017 Predicting
    Media Interestingness Task. In Proc. of the MediaEval
    2017 Workshop, Dublin, Ireland, Sept. 13-15, 2017,
    2017.
[3] Y.-G. Jiang, Q. Dai, T. Mei, Y. Rui, and S.-F. Chang.
    Super fast event recognition in internet videos. IEEE
    Transactions on Multimedia, 17(8):1174–1186, 2015.
[4] J. Parekh and S. Parekh. The MLPBOON Predicting
    Media Interestingness System for MediaEval 2016. In
    MediaEval, 2016.
[5] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
    B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
    R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
    D. Cournapeau, M. Brucher, M. Perrot, and
    E. Duchesnay. Scikit-learn: Machine learning in
    Python. Journal of Machine Learning Research,
    12:2825–2830, 2011.
[6] Y. Shen, C.-H. Demarty, and N. Q. Duong.
    Technicolor@ mediaeval 2016 predicting media
    interestingness task. In MediaEval, 2016.
[7] J. A. Suykens and J. Vandewalle. Least squares support
    vector machine classifiers. Neural processing letters,
    9(3):293–300, 1999.
[8] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and
    M. Paluri. Learning spatiotemporal features with 3d
    convolutional networks. In Proceedings of the IEEE
    international conference on computer vision, pages
    4489–4497, 2015.
[9] H.-F. Yu, F.-L. Huang, and C.-J. Lin. Dual coordinate
    descent methods for logistic regression and maximum
    entropy models. Machine Learning, 85(1-2):41–75, 2011.