1. INTRODUCTION

The IITB Predicting Media Interestingness System for MediaEval 2017

Jayneel Parekh

jayneelparekh@gmail.com 0

Harshvardhan Tibrewal

hrtibrewal@gmail.com 0

Sanjeel Parekh

sanjeelparekh@gmail.com 1 0 Indian Institute of Technology , Bombay , India 1 Technicolor , Cesson Sévigné , France

2017

13 15

This paper describes the system developed by team IITB for MediaEval 2017 Predicting Media Interestingness Task. We propose a new method of training based on pairwise comparisons between frames of a trailer. The algorithm gave very promising results on the development set but did not impress on test set. Our highest achieved MAP@10 on test set is 0.0911 (Image subtask) and 0.0525 (Video subtask), based on a systems submitted last year ([4, 6]).

1. INTRODUCTION

The MediaEval 2017 Predicting Media Interestingness Task [ 2 ] deals with automatic selection of images and/or video segments according to their interestingness to a common viewer. We only use the visual content and no additional metadata.

Previous systems on this task discuss in detail several relevant inherent problems. Further, they also point towards the usefulness of CNN features: in particular, they report features from AlexNet's fc7 layer performing reasonably well with simple classi ers [ 4, 6 ]. We believe a key shortcoming of the previous approaches is that they attempt to tag images interesting/non-interesting in a global context whereas the task inherently expects to classify images in a local context (trailer-wise). Our system tries to take this aspect into account by training a classi er on pairwise comparisons of frames from same trailer. 2.1

SYSTEM DESCRIPTION Pre-processing

Given the training data feature matrix X 2 RN F consisting of N examples, each described by a F -dimensional vector, we rst standardize it and apply principal component analysis (PCA) to reduce its dimensionality. The transformed feature matrix Z = (zi)i 2 RN M is used to experiment with various classi ers. Here M depends on the number of top eigenvalues we wish to consider.

For our system we use AlexNets's fc7 [ 3 ] features provided for image subtask and C3D [ 8 ] features provided for video subtask. Each feature vector has a dimension of 4096. After performing PCA we reduce the dimension to 200. Thus Z is a RN 200 matrix in our system. 2.2

Training

We adopted the following two methods for training: 1. Feed every frame/video's feature vector to the classi er where it learns to predict the interestingness label of the frame as in [ 4 ] 2. For each trailer we consider all possible pairs of its frames/videos and feed the corresponding concatenated feature vectors to the classi er. The classi er learns to predict which one of the two frames/videos is more interesting.

For the second training method, pairwise comparisons are made . First, from each trailer, we generate all possible pairs of frames. This ensures that only frames/videos of the same trailer are being compared. Considering T trailers having ni number of frames/videos in them, we get N1 = Pii==1T n2i pairs. Representation of each pair is done by concatenating the feature vectors of each frame/video. The feature vector of each being of size M , after concatenating we get nal feature vector of size 2M . This procedure yields a feature matrix Znew 2 RN1 2M . Output labels for an ordered pair of frames/videos (I1; I2) is assigned as follows: y = (1; 0;

I1 is more interesting than I2 I2 is more interesting than I1

(1) 2.3

Prediction

For the rst two runs which are based on [ 4 ], [ 6 ], we have used di erent classi ers. Support vector machines (SVM) with rbf kernel (run1) and logistic regression with `1 penalty (run2). We now describe the prediction algorithm for our new approach.

Ranking of the frames/videos according to their interestingness in a particular trailer is determined from the predicted results of all the pairwise comparisons by generating penalty scores si for each of them and ordering them from lowest to highest with lowest corresponding to most interesting frame/video. The scores are determined using the following algorithm (referred as P1): 1. Initialize the penalty scores si = 0 for each i 2. Iterate over results of all pairwise comparisons: for each pair indexed by fk; lg, let r(k; l) denote the prediction of classi er. The following update is performed: su = su + j Prfr(k; l) = 1g

Prfr(k; l) = 0gj where u denotes the index of less interesting frame/video predicted, Prf:g denotes the probability and j:j the absolute value

This essentially increases the penalty score for the less interesting according to the con dence the classi er has in its prediction. The con dence value of the classi er for a given pair is treated as absolute di erence between Prfr(k; l) = 1g and Prfr(k; l) = 0g. We also try a variant of the above algorithm in one of our runs wherein the update equation is: su = su + 1 (referred as P2)

Interestingness classi cation: We opt for a simple method for binary classi cation of each image as interesting or not: We classify the top 12% ranked images as interesting. We chose top 12% images as it's slightly higher than the average number of interesting images, which is about 9%. It's important to note that since we generate ranking of frames, choosing only top 12% images has no particular signi cance as the o cial metric remains una ected by it. 3.

EXPERIMENTAL VALIDATION

The training dataset consisted of 7396 frames extracted from 78 movie trailers with about 392,000 pairs of frames, while the test data consisted of 2435 frames extracted from 30 movie trailers. [ 2 ] gives complete information about the preparation of the dataset. Scikit-learn [ 5 ] was used to implement and test various con gurations. 3.1

Results and Discussion

Our results on the development set for various approaches are given in Table 1. The run submission results are given in Table 2. The tables gives the mean average precision (MAP) - the o cial metric MAP@10, of di erent runs corresponding to the method of training and the classi er used.

Development Set

We experimented with the CNN features provided and used PCA to bring down the number of dimensions to 200. Additionally, we used a non pairwise (NP) and a pairwise strategy (P1, P2) for training and prediction as described in previous section. These methods were used to train SVM (rbf kernel) [ 7 ], logistic regression with l1-penalty (LR-l1) [ 9 ]. These decisions were taken following inferences of previous results [ 4, 6 ]. We split the development set into the training set (62 videos) and cross validation set (16 videos). We calculated MAP@10 on the validation set. Accordingly we tested the model with several parameters and chose the model parameters giving best MAP@10 results. We found that the pairwise comparisons strategy was working better compared to non pairwise strategy. They gave a better MAP@10, which was aligned with our expectation. Logistic regression was giving better results as compared to SVM.

Due to large number of pairs involved in training, we could not experiment with classi ers such as SVM in our presented approach (P1, P2) because of large training time. We experimented with the following classi ers. (1) Logistic regression with l2 penalty, (2) Random Forest, (3) Logistic regression with l1 penalty. (3) gave slightly better results than the other two and was the fastest in training, hence we went with logistic regression-l1 penalty as our classi er.

Test Set

However the results on the test set, were unexpected. Logistic regression using pairwise comparisons gave the best results on the development set for both the tasks. On the test set it isn't impressive where the best result is for nonpairwise logistic regression (Image subtask) and non-pairwise SVM-rbf kernel (Video subtask).

There could be various possible reasons for the discrepancy in the results on development and test set. (i) Viewing the classi er as a neural network, it may require more ne tuning of the weights of fc7 layer of AlexNet, or a more complex network instead of a single neuron so that it generalizes better. (ii) Though improbable, it's possible there are some discrepancies in the sources of development and test set which result in poor generalization. 4.

CONCLUSIONS

In summary, we proposed a new system for interestingness prediction in images and videos. It essentially di ers in the method of training based on pairwise comparisons of images. This helps in capturing interestingness of an image in a local context. Although our system gave impressive results on development set, it failed to perform well on the test set. Some improvements on current system can be improving its complexity or ne tuning the last layer of AlexNet for better input representation. The e ciency of the training can also be improved by selecting pairs more intelligently ([ 1 ]).

[1]

R. A.

Bradley and

M. E.

Terry . Rank analysis of incomplete block designs: I. the method of paired comparisons . Biometrika , 39 ( 3 /4): 324 { 345 , 1952 .

[2] C.-H. Demarty , M. Sjoberg, B.

Ionescu , T.-T. Do, M.

Gygli , and N. Q.

Duong. MediaEval 2017 Predicting Media

Interestingness Task . In Proc. of the MediaEval 2017 Workshop , Dublin, Ireland, Sept. 13 - 15 , 2017 , 2017 .

[3]

Y.-G.

Jiang ,

Dai ,

Mei ,

Rui , and

S.-F.

Chang . Super fast event recognition in internet videos . IEEE Transactions on Multimedia , 17 ( 8 ): 1174 { 1186 , 2015 .

[4]

Parekh and

Parekh . The MLPBOON Predicting Media Interestingness System for MediaEval 2016 . In MediaEval, 2016 .

[5]

Pedregosa ,

Varoquaux ,

Gramfort ,

Michel ,

Thirion ,

Grisel ,

Blondel ,

Prettenhofer ,

Weiss ,

Dubourg ,

Vanderplas ,

Passos ,

Cournapeau ,

Brucher ,

Perrot , and

Duchesnay . Scikit-learn: Machine learning in Python . Journal of Machine Learning Research , 12 : 2825 { 2830 , 2011 .

[6]

Shen ,

C.-H.

Demarty , and

N. Q.

Duong . Technicolor@ mediaeval 2016 predicting media interestingness task . In MediaEval , 2016 .

[7]

J. A.

Suykens and

Vandewalle . Least squares support vector machine classi ers . Neural processing letters , 9 ( 3 ): 293 { 300 , 1999 .

[8]

Tran ,

Bourdev ,

Fergus ,

Torresani , and

Paluri . Learning spatiotemporal features with 3d convolutional networks . In Proceedings of the IEEE international conference on computer vision , pages 4489 { 4497 , 2015 .

[9]

H.-F.

Yu ,

F.-L.

Huang , and

C.-J.

Lin . Dual coordinate descent methods for logistic regression and maximum entropy models . Machine Learning , 85 ( 1-2 ): 41 { 75 , 2011 .