The IITB Predicting Media Interestingness System for MediaEval 2017 Jayneel Parekh Harshvardhan Tibrewal Sanjeel Parekh Indian Institute of Technology, Indian Institute of Technology, Technicolor, Cesson Sévigné, Bombay, India Bombay, India France jayneelparekh@gmail.com hrtibrewal@gmail.com sanjeelparekh@gmail.com ABSTRACT This paper describes the system developed by team IITB for MediaEval 2017 Predicting Media Interestingness Task. fc7 We propose a new method of training based on pairwise Classifier comparisons between frames of a trailer. The algorithm gave very promising results on the development set but did not impress on test set. Our highest achieved MAP@10 on test fc7 set is 0.0911 (Image subtask) and 0.0525 (Video subtask), based on a systems submitted last year ([4, 6]). 1. INTRODUCTION Figure 1: Pairwise comparison based Training: Concate- The MediaEval 2017 Predicting Media Interestingness Task nated images features (fc7) from same trailer are fed to the [2] deals with automatic selection of images and/or video classifier and it learns to predict the more interesting image segments according to their interestingness to a common viewer. We only use the visual content and no additional metadata. 2.2 Training Previous systems on this task discuss in detail several rel- We adopted the following two methods for training: evant inherent problems. Further, they also point towards the usefulness of CNN features: in particular, they report 1. Feed every frame/video’s feature vector to the classifier features from AlexNet’s fc7 layer performing reasonably well where it learns to predict the interestingness label of with simple classifiers [4, 6]. We believe a key shortcoming the frame as in [4] of the previous approaches is that they attempt to tag im- 2. For each trailer we consider all possible pairs of its ages interesting/non-interesting in a global context whereas frames/videos and feed the corresponding concatenated the task inherently expects to classify images in a local con- feature vectors to the classifier. The classifier learns to text (trailer-wise). Our system tries to take this aspect into predict which one of the two frames/videos is more in- account by training a classifier on pairwise comparisons of teresting. frames from same trailer. For the second training method, pairwise comparisons are made . First, from each trailer, we generate all possible pairs 2. SYSTEM DESCRIPTION of frames. This ensures that only frames/videos of the same trailer are being compared. Considering T trailersP having ni number of frames/videos in them, we get N1 = i=T ni 2.1 Pre-processing i=1 pairs. Representation of each pair is done by concatenating 2 Given the training data feature matrix X ∈ RN ×F con- the feature vectors of each frame/video. The feature vector sisting of N examples, each described by a F -dimensional of each being of size M , after concatenating we get final vector, we first standardize it and apply principal compo- feature vector of size 2M . This procedure yields a feature nent analysis (PCA) to reduce its dimensionality. The trans- matrix Znew ∈ RN1 ×2M . Output labels for an ordered pair formed feature matrix Z = (zi )i ∈ RN ×M is used to ex- of frames/videos (I1 , I2 ) is assigned as follows: periment with various classifiers. Here M depends on the ( number of top eigenvalues we wish to consider. 1, I1 is more interesting than I2 For our system we use AlexNets’s fc7 [3] features provided y= (1) 0, I2 is more interesting than I1 for image subtask and C3D [8] features provided for video subtask. Each feature vector has a dimension of 4096. After performing PCA we reduce the dimension to 200. Thus Z 2.3 Prediction is a RN ×200 matrix in our system. For the first two runs which are based on [4], [6], we have used different classifiers. Support vector machines (SVM) with rbf kernel (run1) and logistic regression with `1 penalty Copyright is held by the author/owner(s). (run2). We now describe the prediction algorithm for our MediaEval’17, 13-15 September 2017, Dublin, Ireland. new approach. Ranking of the frames/videos according to their inter- Run Classifier Subtask MAP@10 estingness in a particular trailer is determined from the pre- 1 NP + SVM-rbf Image 0.094 dicted results of all the pairwise comparisons by generating 2 NP + LR-l1 Image 0.144 penalty scores si for each of them and ordering them from 3 P1 + LR-l1 Image 0.179 lowest to highest with lowest corresponding to most inter- 4 P2 + LR-l1 Image 0.178 esting frame/video. The scores are determined using the 5 NP + SVM-rbf Video 0.088 following algorithm (referred as P1): 6 NP + LR-l1 Video 0.092 7 P1 + LR-l1 Video 0.109 1. Initialize the penalty scores si = 0 for each i 8 P2 + LR-l1 Video 0.108 2. Iterate over results of all pairwise comparisons: for each pair indexed by {k, l}, let r(k, l) denote the pre- Table 1: Results on development set diction of classifier. The following update is performed: su = su + | Pr{r(k, l) = 1} − Pr{r(k, l) = 0}| Run Classifier Subtask MAP MAP@10 1 NP + SVM-rbf Image 0.1886 0.0500 where u denotes the index of less interesting frame/video 2 NP + LR-l1 Image 0.2570 0.0911 predicted, Pr{.} denotes the probability and |.| the ab- 3 P1 + LR-l1 Image 0.2038 0.0494 solute value 4 P2 + LR-l1 Image 0.2054 0.0521 This essentially increases the penalty score for the less 5 NP + SVM-rbf Video 0.1795 0.0525 interesting according to the confidence the classifier has in its 6 NP + LR-l1 Video 0.1675 0.0445 prediction. The confidence value of the classifier for a given 7 P1 + LR-l1 Video 0.1700 0.0474 pair is treated as absolute difference between Pr{r(k, l) = 1} 8 P2 + LR-l1 Video 0.1678 0.0445 and Pr{r(k, l) = 0}. We also try a variant of the above algorithm in one of our runs wherein the update equation Table 2: Run Submissions: MAP@10 (official metric) is: su = su + 1 (referred as P2) Interestingness classification: We opt for a simple method for binary classification of each image as interesting was aligned with our expectation. Logistic regression was or not: We classify the top 12% ranked images as interest- giving better results as compared to SVM. ing. We chose top 12% images as it’s slightly higher than Due to large number of pairs involved in training, we could the average number of interesting images, which is about not experiment with classifiers such as SVM in our presented 9%. It’s important to note that since we generate ranking approach (P1, P2) because of large training time. We exper- of frames, choosing only top 12% images has no particular imented with the following classifiers. (1) Logistic regression significance as the official metric remains unaffected by it. with l2 penalty, (2) Random Forest, (3) Logistic regression with l1 penalty. (3) gave slightly better results than the other two and was the fastest in training, hence we went 3. EXPERIMENTAL VALIDATION with logistic regression-l1 penalty as our classifier. The training dataset consisted of 7396 frames extracted from 78 movie trailers with about 392,000 pairs of frames, Test Set while the test data consisted of 2435 frames extracted from However the results on the test set, were unexpected. Lo- 30 movie trailers. [2] gives complete information about the gistic regression using pairwise comparisons gave the best preparation of the dataset. Scikit-learn [5] was used to im- results on the development set for both the tasks. On the plement and test various configurations. test set it isn’t impressive where the best result is for non- 3.1 Results and Discussion pairwise logistic regression (Image subtask) and non-pairwise SVM-rbf kernel (Video subtask). Our results on the development set for various approaches There could be various possible reasons for the discrep- are given in Table 1. The run submission results are given in ancy in the results on development and test set. (i) Viewing Table 2. The tables gives the mean average precision (MAP) the classifier as a neural network, it may require more fine - the official metric MAP@10, of different runs corresponding tuning of the weights of fc7 layer of AlexNet, or a more to the method of training and the classifier used. complex network instead of a single neuron so that it gener- Development Set alizes better. (ii) Though improbable, it’s possible there are some discrepancies in the sources of development and test We experimented with the CNN features provided and used set which result in poor generalization. PCA to bring down the number of dimensions to 200. Addi- tionally, we used a non pairwise (NP) and a pairwise strategy (P1, P2) for training and prediction as described in previous 4. CONCLUSIONS section. These methods were used to train SVM (rbf kernel) In summary, we proposed a new system for interestingness [7], logistic regression with l1-penalty (LR-l1) [9]. These de- prediction in images and videos. It essentially differs in the cisions were taken following inferences of previous results [4, method of training based on pairwise comparisons of images. 6]. We split the development set into the training set (62 This helps in capturing interestingness of an image in a lo- videos) and cross validation set (16 videos). We calculated cal context. Although our system gave impressive results on MAP@10 on the validation set. Accordingly we tested the development set, it failed to perform well on the test set. model with several parameters and chose the model param- Some improvements on current system can be improving its eters giving best MAP@10 results. We found that the pair- complexity or fine tuning the last layer of AlexNet for better wise comparisons strategy was working better compared to input representation. The efficiency of the training can also non pairwise strategy. They gave a better MAP@10, which be improved by selecting pairs more intelligently ([1]). 5. REFERENCES [1] R. A. Bradley and M. E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952. [2] C.-H. Demarty, M. Sjöberg, B. Ionescu, T.-T. Do, M. Gygli, and N. Q. Duong. MediaEval 2017 Predicting Media Interestingness Task. In Proc. of the MediaEval 2017 Workshop, Dublin, Ireland, Sept. 13-15, 2017, 2017. [3] Y.-G. Jiang, Q. Dai, T. Mei, Y. Rui, and S.-F. Chang. Super fast event recognition in internet videos. IEEE Transactions on Multimedia, 17(8):1174–1186, 2015. [4] J. Parekh and S. Parekh. The MLPBOON Predicting Media Interestingness System for MediaEval 2016. In MediaEval, 2016. [5] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. [6] Y. Shen, C.-H. Demarty, and N. Q. Duong. Technicolor@ mediaeval 2016 predicting media interestingness task. In MediaEval, 2016. [7] J. A. Suykens and J. Vandewalle. Least squares support vector machine classifiers. Neural processing letters, 9(3):293–300, 1999. [8] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015. [9] H.-F. Yu, F.-L. Huang, and C.-J. Lin. Dual coordinate descent methods for logistic regression and maximum entropy models. Machine Learning, 85(1-2):41–75, 2011.