The MLPBOON Predicting Media Interestingness System for MediaEval 2016 Jayneel Parekh Sanjeel Parekh Indian Institute of Technology, Bombay, India Technicolor, Cesson Sévigné, France jayneelparekh@gmail.com sanjeelparekh@gmail.com ABSTRACT very promising [6]. This paper describes the system developed by team MLP- Our approach was inspired by the following line of thought: BOON for MediaEval 2016 Predicting Media Interestingness if the right set of features are identified then any simple clas- Image Subtask. After experimenting with various features sifier should produce good results. Thus, we decided upon and classifiers on the development dataset, our final system the proposed system after experimenting with different fea- involves use of CNN features (fc7 layer of AlexNet) for the ture sets. input representation and logistic regression as the classifier. For the proposed method, the MAP for the best run reaches 2. SYSTEM DESCRIPTION a value of 0.229. We have opted for a more traditional machine learning pipeline involving - feature selection & preprocessing, train- 1. INTRODUCTION ing of classification model and then the predictions. Given the training data feature matrix X ∈ RN ×F con- The MediaEval 2016 Predicting Media Interestingness Task sisting of N examples, each described by a F -dimensional [1] requires to automatically select images and/or video seg- vector, we first standardize it and apply principal compo- ments which are considered to be the most interesting for a nent analysis (PCA) to reduce its dimensionality. The trans- common viewer. We will be focusing on solving the image formed feature matrix Z = (zi )i ∈ RN ×M is used to ex- interestingness subtask which involves automatically identi- periment with various classifiers. Here M depends on the fying images from a given set of key-frames extracted from a number of top eigenvalues we wish to consider. certain movie that the viewers report to be interesting. We After preliminary testing (discussed in section 3), we de- will only use the visual content and no additional metadata. cided to move ahead with logistic regression as our classifier. The solution should essentially involve encoding into fea- Logistic regression minimizes the following cost function [9]. tures many generic factors that are taken into account by humans while judging interestingness of an image [3]. How- N ever, there is an intrinsic difficulty this task presents which X T 1 T makes it extremely challenging to have reliable datasets and Cost (w) = C log(1 + e−yi w zi ) + w w, (1) i=1 2 features - subjectivity [2]. One can observe the high level of subjectivity by realizing that a given image could be labeled where w denotes the weight vector, C > 0 denotes penalty as highly interesting or non-interesting depending upon the parameter, zi denotes feature vector for the ith instance of parts of the world in which it is surveyed. Even though cur- training data, while yi denotes its label (0 if non-interesting rent methods of annotating datasets tend to reduce [2] this and 1 if interesting). Note that a column of ones is appended factor but none can eliminate it. to Z to include the hyperplane intercept as a coefficient of Therefore, while taking into account subjectivity, we wish w. Now given a test data instance t, its label y is assigned to determine features good for satisfactorily solving the task. according to equation (2). In this context, several efforts have been made to understand factors that affect, or cues that contribute to interestingness ( of an image, even at an individual level. Katti et al. [5] 1, if wT t ≥ 0 y= (2) attempt to understand the effect of human cognition and 0, otherwise perception in interestingness. Work by Gygli et al. [3] shows After experimenting with various descriptors (as discussed how interestingness is related to features capturing unusual- later in Section 3.1), we use CNN features extracted from ness, aesthetics and general preferences such as GIST, SIFT, fc7 layer of the AlexNet as our input feature representation Color Histograms etc. Further, [8] tries to learn attributes for building X. In the following section we discuss our ex- that can be used to predict interestingness at an individual perimental results obtained by varying different parameters level. Moreover, recent advances in application of neural of the above stated system. networks to tasks in image processing and computer vision makes use of convolutional neural network [4] based features 3. EXPERIMENTAL VALIDATION The training dataset consised of 5054 images extracted Copyright is held by the author/owner(s). MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Nether- from 52 movie trailers, while the test data consisted of 2342 lands. images extracted from 26 movie trailers. [1] gives complete Run No. of features C MAP Precision Recall Figure 1: Block diagram for the proposed system 1 780 0.001 0.2205 0.140 0.581 2 700 0.008 0.2023 0.128 0.381 3 700 0.05 0.1941 0.131 0.348 Input training data, X 4 400 0.1 0.2170 0.137 0.427 Standardize 5 2016 0.0001 0.2296 0.141 0.726 PCA Table 1: Run Submission Results: MAP was the official metric Transformed data, Z (4096-dimensional fc7 & 1000-dimensional prob layers) did show significant improvements with fc7 features in particular performing better over the prob features. We also observed Logistic that combination of CNN features with GIST and ColorHis- Regression togram features gave similar performance to the case when Using the learnt we use just CNN features. Hence we went forward with weight vector, w using just CNN features, in particular from fc7 layer. Classifier: After selecting CNN features we experimented Classify test with various classifiers with different parameters. Specif- samples ically, we tried (1) SVM with linear, polynomial and rbf kernels (2) ridge regression classifier (3) stochastic gradi- ent descent classifier with hinge, log, modified-huber and squared-hinge loss functions (4) logistic regression [7] and information about the preparation of the dataset. WEKA (5) random trees (WEKA). In general, it was found that lo- and scikit-learn [7] were used to implement and test various gistic regression performed better than the other classifiers configurations. with its MAP being greater than 0.2 on training data. The performance of SVM was reasonable with the prob features 3.1 Results and Discussion but it did not show any significant improvements with the The run submission results are given in Table 1. The table fc7 features. It particularly did not perform well with the gives the mean average precision (MAP) - the official metric, rbf kernel. Hence we went ahead with logistic regression. precision and recall on the interesting images of different runs corresponding to the respective penalty parameter and 4. CONCLUSIONS number of transformed features retained after PCA. The In summary, we have presented a system for interesting- general strategy for the run submissions was to first decide ness prediction in images. Despite its simplicity, we obtain and fix the number of PCA features and subsequently tune reasonable mean average precision values with the maximum C for best MAP on development data. being 0.229. From an analysis of the system’s development As observed, C decreases with increasing PCA features. history we think that selection of features was more impor- This trend can be possibly explained as a way to avoid over- tant than the selection of the classifier. We believe it would fitting. The 5th run gives the best MAP, however, the MAP be useful to identify and incorporate high level features de- for all the runs seems comparable. This points towards the scribing image composition and object expressivity such as utility of dimensionality reduction which significantly re- facial expressions. Moreover, to analyze the issue of sub- duces the redundancy without affecting the results much. jectivity, it would be interesting to check inter-annotator It was observed that 400 and 780 transformed features cap- agreement over test images. ture about 95% and 98% variance of the data, respectively. The difference between MAP on development and test data for all the runs was very small and lied between 0.01-0.03. 5. REFERENCES The maximum MAP on development data was 0.24 with 1st [1] C.-H. Demarty, M. Sjöberg, B. Ionescu, T.-T. Do, run’s system configuration. H. Wang, N. Q. Duong, and F. Lefebvre. Mediaeval 2016 predicting media interestingness task. In Proc. of System Design Decisions the MediaEval 2016 Workshop, Hilversum, Netherlands, We experimented with the following features provided by the Oct. 20-21, 2016. task [1]: CNN (fc7 and prob layers of AlexNet), GIST and [2] Y. Fu, T. M. Hospedales, T. Xiang, S. Gong, and Color Histogram (HSV space) features [4], and trained their Y. Yao. Interestingness prediction by robust learning to different combinations on various machine learning classi- rank. In European Conference on Computer Vision, fiers like SVM, Decision Trees, Logistic Regression with 4- pages 488–503. Springer, 2014. fold or 5-fold cross-validation when experimenting on the [3] M. Gygli, H. Grabner, H. Riemenschneider, F. Nater, development data. In this section we give a rationale for and L. Van Gool. The interestingness of images. In selected features and classifier in the proposed system. Proceedings of the IEEE International Conference on Features: The results on the development data using Computer Vision, pages 1633–1640, 2013. the GIST (512 dimensional feature vector) and ColorHis- [4] Y.-G. Jiang, Q. Dai, T. Mei, Y. Rui, and S.-F. Chang. togram (128 dimensional feature vector) features were not Super fast event recognition in internet videos. IEEE very positive over any classifier. The use of CNN features Transactions on Multimedia, vol. 177(8):1–13, 2015. [5] H. Katti, K. Y. Bin, C. T. Seng, and M. Kankanhalli. Interestingness discrimination in images. [6] A. Khosla, A. Das Sarma, and R. Hamid. What makes an image popular? In Proceedings of the 23rd international conference on World wide web, pages 867–876. ACM, 2014. [7] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. [8] M. Soleymani. The quest for visual interest. In Proceedings of the 23rd ACM international conference on Multimedia, pages 919–922. ACM, 2015. [9] H.-F. Yu, F.-L. Huang, and C.-J. Lin. Dual coordinate descent methods for logistic regression and maximum entropy models. Machine Learning, 85(1-2):41–75, 2011.