DUT-MMSR at MediaEval 2017: Predicting Media Interestingness Task Reza Aditya Permadi, Septian Gilang Permana Putra, Helmiriawan, Cynthia C. S. Liem Delft University of Technology, The Netherlands {r.a.permadi,septiangilangpermanaputra,helmiriawan}@student.tudelft.nl,c.c.s.liem@tudelft.nl ABSTRACT therefore consists of a fairly straightforward machine learning This paper describes our approach for the submission to the Media- pipeline, evaluating performance of individual features first, and eval 2017 Predicting Media Interestingness Task, which was partic- subsequently applying classifier stacking to find the best combi- ularly developed for the Image subtask. An approach using a late nations of the best-performing classifiers on the best-performing fusion strategy is employed, combining classifiers from different features. For our implementation, we use sklearn [9]. As base classi- features by stacking them using logistic regression (LR). As the task fier, we use logistic regression and aim to find optimal parameters. ground truth was based on pairwise evaluation of shots or keyframe Further details of our approach are described below. images within the same movie, next to using precomputed features as-is, we also include a more contextual feature, considering aver- 2.1 Features of interest aged feature values over each movie. Furthermore, we also consider For the Image subtask, we initially consider all the pre-computed evaluation outcomes for the heuristic algorithm that yielded the visual features from [5]: Color Histogram (HSV), LBP, HOG, GIST, highest MAP score on the 2016 Image subtask. Considering results denseSIFT, and the Alexnet-based features (fully connected (fc7) obtained for the development and test sets, our late fusion method layer and probability output). In all cases, we consider pre-computed shows consistent performance on the Image subtask, but not on features and their dimensionalities as-is. the Video subtask. Furthermore, clear differences can be observed Considering the way in which ground truth for this task was between MAP@10 and MAP scores. established, human raters were asked to perform pairwise annota- tions on shots or keyframe images from the same movie. Hence, 1 INTRODUCTION it likely that overall properties of the movie may have affected The main challenge of the Media Interestingness task is to rank sets the ratings: for example, if a movie consistently is shot in a dark of images and video shots from a movie, based on their interesting- background, a dark shot may not stand out as much as in another ness level. The evaluation metric of interest for this task is the Mean movie. In other words, we assume that the same feature vector may Average Precision considering the first 10 documents (MAP@10). be associated to different interestingness levels, depending on the A complete overview of the task, along with the description of the context of the movie it occurs in. Therefore, apart from the pre- dataset, is given in [4]. computed features by the organizers, we also consider a contextual Due to the similarity of this year’s task to the 2016 Predicting feature, based on the average image feature values per movie. Media Interestingness task, we considered the strategies used in Let X i be an m × n feature matrix for a movie, where m is the submissions to last year’s task to inform the strategy of our submis- number of images offered for the movie, and n the length of the sion to this year’s task. [3] and [2] both use an early fusion strategy feature vector describing each image. For our contextual feature, by combining features that perform relatively well individually. A we then take the average value of X i across its columns, yielding late fusion strategy with average weighting is used in [6], combin- a new vector µ i of size 1 × n. In our subsequent discussion, we ing classifiers from different modalities. A Support Vector Machine will denote the contextual feature belonging to a feature type F by (SVM) is used as the final combining classifier. [8] finds that logistic ‘meanF ’ (e.g. HSV → meanHSV). This feature is then concatenated regression gives good results, using CNN features which have been to the original feature vector. transformed by PCA. For the Video subtask, in comparison to the Image subtask, we [7] proposed a heuristic approach, based on observing clear now also have information from the audio modality in the form presence of people in images. This approach performed surprisingly of Mel-Frequency Cepstral Coefficients (MFCC). We further use well, even yielding the highest MAP score in the 2016 Image subtask. the pre-computed fully-connected layer (fc6) of a C3D deep neural While we will mainly focus on a fusion-based approach this year, network [10]. As pre-computed features are given at the keyframe we will also include results of the best-performing configuration resolution, we simply average over all the keyframes to obtain the from [7] in unaltered form, so a reference point to state-of-the-art values for a particular feature representation. Again, we consider of last year is retained. pre-computed features and their dimensionalities as-is. 2 APPROACH 2.2 Individual feature evaluation We would like to devise a strategy which is computationally efficient and yet gives a good interpretability of the results. Our approach For each feature type, we would like to individually find the best- performing classifier that will optimize the MAP@10 value. Before Copyright held by the owner/author(s). feeding the feature vector into the classifier, the values are scaled to MediaEval’17, 13-15 September 2017, Dublin, Ireland have zero mean and unit variance, considering the overall statistics MediaEval’17, 13-15 September 2017, Dublin, Ireland R.A. Permadi, S.G.P. Putra, Helmiriawan, C.C.S Liem Table 1: Evaluation on the test and development set Development Set Test Set SubTask Run Features Late Fusion Classifier MAP@10 MAP MAP@10 MAP 1 Logistic Regression, C=1 0.1387 0.3021 0.1310 0.3002 2 [ (lbp, meanlbp), Logistic Regression, C=100 0.1390 0.3050 0.1385 0.3075 Image 3 (hog), (fc7) ] SVM polynomial kernel, degree=2, gamma=0.01 0.1373 0.3028 0.1349 0.3052 4 SVM RBF kernel, gamma=0.01, C=0.01 0.1322 0.2972 0.1213 0.2887 5 2016 unmodified heuristic feature [7] 0.0508 0.1864 0.0649 0.2105 1 Logistic Regression, C=1000 0.1040 0.2295 0.0443 0.1734 [ (gist), (c3d) ] 2 Logistic Regression, C=10 0.1045 0.2299 0.0465 0.1748 Video 3 Logistic Regression, C=1 0.1025 0.2249 0.0478 0.1770 [ (hog), (c3d) ] 4 Logistic Regression, C=10 0.1025 0.2253 0.0482 0.1783 5 2016 unmodified heuristic feature [7] 0.0530 0.1494 0.0516 0.1791 Table 2: Best individual feature classifier using LR 2), rather than the feature values themselves. Our best-performing result on the Image development set is a MAP@10 value of 0.139 Task Feature Size C MAP@10 (Logistic Regression, C = 100). This is an improvement over the lbp, meanlbp 118 10−3 0.1145 performance of the best individual classifier in Table 2. On the test Image hog 300 10−8 0.1092 set, our best result on the Image subtask is a MAP@10 value of fc7 4096 10−5 0.1081 0.1385 for the same classifier configuration. gist 512 10−4 0.0767 For the Video subtask, evaluation results on the test set show Video c3d 4096 10−5 0.1017 considerable differences in comparison to the development set. hog 300 10−5 0.0744 While somewhat surprising, we did notice considerable variation in results during cross-validation, and our reported development set results are an average of several cross-validation runs. As one of the training set. For logistic regression, the optimal penalty particularly bad cross-validation result, using late fusion of GIST parameter C is searched on a logaritmic scale from 10−9 until 100 . and c3d features with logistic regression (C = 10), with videos 0, 3, To evaluate our model, 5-fold cross validation is used. Each fold 6, 9, 12, 22, 23, 27, 28, 30, 46, 50, 60, 69, 71 as the evaluation fold, we is created based on the number of movies in the dataset, rather only obtained a MAP@10 value of 0.0496. than on the number of individual instances of images or videos Considering the results for the best-performing configuration within a movie. This way, we make sure that training or prediction (histface) of the heuristic approach from [7], we notice clear differ- always considers all instances offered for a particular movie. For ences between MAP@10 and MAP as a metric. Generally spoken, each evaluation, the cross validation procedure is run 10 times to the heuristic approach is especially outperformed on MAP@10, avoid dependence on specific fold compositions, and the average implying that clear presence of people is not the only criterion for MAP@10 value is considered. the top-ranked items. Comparing results for this approach to our proposed late fusion approach, the late fusion approach consistently 2.3 Classifier stacking outperforms a heuristic approach on the Image subtask, but in the After identifying the best classifier configuration per feature type, Video subtask, the heuristic approach still has reasonable scores, we stack the output of those classifiers and try different combina- and outperforms the late fusion approach on the test set. tions of them, which are then trained again with several classifiers In conclusion, we employed the offered pre-computed features (logistic regression, SVM, AdaBoost, Linear Discriminant Analy- and included a contextual averaged feature, and then proposed a sis, and Random Forest). Finding that logistic regression and SVM late fusion strategy based on the best-performing classifier settings perform quite well, we apply a more intensive grid search on these for the best-performing features. Using fusion shows improvements classifier types to optimize parameters. over results obtained on individual features. In future work, alter- native late fusion strategies as explained in [1] may be investigated. 3 RESULTS AND CONCLUSIONS For the Image subtask, we notice consistent results between the Table 2 illustrates the top performance on the development set for development and test set. However, on the Video subtask, we notice individual classifiers, which will then also be considered in the test inconsistent results on the development set in comparison to the set. In several cases, addition of our contextual averaged feature test set. Predicting interestingness in video likely needs a more yields slight improvements over the individual feature alone. This elaborated approach that we have not yet covered thoroughly in improvement is the biggest for LBP where there is an increase of our method. It also might be the case that the feature distribution 0.68 compared to the original features. of the test set turned out different from that of the training set, or The final evaluation results for 5 different runs per subtask are that generally, the distribution of features across a video should shown in Table 1. As before, late fusion makes use of the output be taken into account in more sophisticated ways, for example by probability of the best classifiers trained on each feature (as in Table taking into account temporal development aspects. Predicting Media Interestingness Task MediaEval’17, 13-15 September 2017, Dublin, Ireland REFERENCES [1] Pradeep K. Atrey, M. Anwar Hossain, Abdulmotaleb El Saddik, and Mohan S. Kankanhalli. 2010. Multimodal fusion for multimedia anal- ysis: a survey. Multimedia Systems 16, 6 (01 Nov 2010), 345–379. https://doi.org/10.1007/s00530-010-0182-0 [2] Shizhe Chen, Yujie Dian, and Qin Jin. 2016. RUC at MediaEval 2016: Predicting Media Interestingness Task. In MediaEval 2016 Working Notes Proceedings. [3] Mihai Gabriel Constantin, Bogdan Boteanu, and Bogdan Ionescu. 2016. LAPI at MediaEval 2016 Predicting Media Interestingness Task. In MediaEval 2016 Working Notes Proceedings. [4] Claire-Hélène Demarty, Mats Sjöberg, Bogdan Ionescu, Thanh-Toan Do, Michael Gygli, and Ngoc QK Duong. Mediaeval 2017 Predict- ing Media Interestingness Task. In MediaEval 2017 Working Notes Proceedings. [5] Yu-Gang Jiang, Qi Dai, Tao Mei, Yong Rui, and Shih-Fu Chang. 2015. Super Fast Event Recognition in Internet Videos. IEEE Transactions on Multimedia 17, 8 (Aug 2015), 1174–1186. [6] Vu Lam, Tien Do, Sang Phan, Duy-Dinh Le, and Duc Anh Duong. 2016. NII-UIT at MediaEval 2016 Predicting Media Interestingness Task.. In MediaEval 2016 Working Notes Proceedings. [7] Cynthia C. S. Liem. 2016. TUD-MMC at MediaEval 2016: Predict- ing Media Interestingness Task.. In MediaEval 2016 Working Notes Proceedings. [8] Jayneel Parekh and Sanjeel Parekh. 2016. The MLPBOON Predicting Media Interestingness System for MediaEval 2016.. In MediaEval 2016 Working Notes Proceedings. [9] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Pret- tenhofer, Ron Weiss, Vincent Dubourg, and others. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, Oct (2011), 2825–2830. [10] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d con- volutional networks. In Proceedings of the IEEE international conference on computer vision. 4489–4497.