LAPI at MediaEval 2017 - Predicting Media Interestingness Mihai Gabriel Constantin1 , Bogdan Boteanu1 , Bogdan Ionescu1 1 LAPI, University "Politehnica" Bucharest, Romania {mgconstantin, bboteanu, bionescu}@imag.pub.ro ABSTRACT Lightness extracted from HSL space (HSL), Colorfulness [5, 9], In the following paper we will present our contribution, approach Hue descriptors (HueDesc) [9, 11], Hue models (HueModel) [11], and results for the MediaEval 2017 Predicting Media Interesting- Brightness [10, 11], Edge [9–11], Texture [9], RGB entropy (RG- ness task. We studied several visual descriptors and created several BEntropy) [9], HSV wavelet (HSVwavelet) and average value for early and late fusion approaches in our machine learning system, the HSV wavelet (aHSVwavelet) [5], average HSV values based optimized for best results for this benchmarking competition. on the Rule of Thirds (aHSVRot) [5], average HSL values for the focus region (aHSLFocus) [11], size analysis for the largest five segments (LargSegm) [5], centroid placement (Centroids) [5], Hue, 1 INTRODUCTION Saturation, Value and Brightness for the largest segments (Hue- Multimedia interestingness has been studied more and more exten- Segm, SatSegm, ValSegm, BrightSegm) [5, 11], color model for the sively in recent years, from several perspectives including psychol- largest segments (ColorSegm) [5], coordinates of the segments (Co- ogy and computer vision. From a psychological perspective user ordSegm) [11], mass variance, skewness and contrast between the studies described a correlation between human interest and several segments (MassVarSegm, SkewSegm, ContrastSegm) [11] and finally other concepts including, but not limited to aesthetics, enjoyment, a depth of field indicator (DoF) calculated according to the method complexity, novelty [1, 8], while computer vision approaches stud- presented in [5]. ied various sets of features and machine learning techniques that While for the image subtask each image generated a set of the are able to predict the interestingness of multimedia shots, based presented descriptors, for the video subtask we generated two sets on low-level attributes such as color histograms, SIFT, edge dis- of descriptors for each of the individual segments. These two sets tributions [8] or high-level attribute like composition rules or the of descriptors were generated by extracting the feature set for each presence of certain objects [7]. frame and then calculating the average value and median value The MediaEval 2017 Predicting Media Interestingness task [6] over all the frames in a video segment. creates a benchmarking competition where participants are tasked with the creation of a system that can predict the interestingness 2.2 Data fusion of images and video segments annotated by a team of viewers, In both subtasks we used early and late fusion techniques for max- according to a Video on Demand scenario, where a set of most imizing out final results. Early fusion combinations consisted of interesting frames or video shots has to be presented to a certain concatenating several features and using the newly created fea- user. This paper will thus describe our approach for this task. ture as an input for a new training algorithm, while for the late fusion approach we used the confidence output values of several 2 APPROACH runs and combined them in several strategies, thus generating new The approach presented in this paper is a continuation of our work confidence outputs. described in [3], with the addition of a video interestingness pre- For the late fusion trials we used 4 strategies: CombMax and diction system. The first step in our machine learning system is CombMin, where we took the maximum and minimum confidence the extraction of the content descriptors, followed by the learning value for each media sample and used them as new outputs, Comb- stage for these content descriptors and their early and late fusion Sum, where we added up the individual confidence values of the combinations executed on the annotated development dataset. In runs and CombMean where the added confidence values were also the final stage we evaluate the best performing combinations on multiplied with weights distributed according to the rank of the the unlabeled testing dataset. The features used here are presented, initial system. This weight was calculated as w = 1/(2r ), where the along with a detailed description in [3] and are based on the works rank r had the value 0 for the best component output classifier, 1 of [5, 9–11]. These features have been used in several domains for the second and so on. connected with interestingness such as aesthetics, photographic compositional rules, color theory etc. For the machine learning al- 2.3 Learning system gorithm we used Support Vector Machine (SVM) [4] with different parameters and kernels. The learning system we used was SVM, implemented with the LibSVM library [2], with linear, polynomial and RBF kernels. For 2.1 Features the degree, gamma and cost coefficients we used the combinations The features used in this system are as follows: Hue, Saturation, of values 2k , where k ∈ [−6, ..., 6]. Value computed from HSV space (denoted HSV ), Hue, Saturation, 3 EXPERIMENTAL RESULTS Copyright held by the owner/author(s). MediaEval’17, 13-15 September 2017, Dublin, Ireland As presented in the task overview paper [6], the development dataset consisted of 7396 frames for the image subtask and 7396 MediaEval’17, 13-15 September 2017, Dublin, Ireland MG Constantin et al. Table 1: Best results on devset for the image and video subtasks and their final result on testset (best testset results are marked in bold) MAP@10 MAP MAP@10 Subtask Run Approach devset testset testset CombMax (HSV + HSL + aHSLFocus and aHSVRot + aHSLFocus image run1 0.0821 0.1791 0.0463 and HSV + MassVarSegm + LargSegm) image run2 CombMax (HSV + HSL + aHSLFocus and aHSVRot + aHSLFocus) 0.0803 0.1789 0.0442 image run3 CombMean (aHSVRot + aHSLFocus and HSV + MassVarSegm + LargSegm) 0.0793 0.1873 0.0555 CombMean (HSVWavelet + aHSVWavelet + aHSLFocus and image run4 0.0793 0.1851 0.0529 HSV + HSL + aHSLFocus and HSV + MassVarSegm) video run1 CombMax (LargSegmMED + ValSegmMED and TextureMED + MassVarSegmMED) 0.0753 0.1937 0.0619 CombMax (LargSegmMED + ValSegmMED and TextureMED + MassVarSegmMED video run2 and EdgeAVG + TextureAVG) 0.0737 0.1819 0.0564 video run3 CombMax (EdgeAVG + TextureAVG and HSVAVG + MassVarSegmAVG) 0.0732 0.1937 0.0619 CombMean(LargSegmMED + ValSegmMED and TextureMED + MassVarSegmMED video run4 and EdgeAVG + TextureAVG) 0.0725 0.2028 0.0732 CombMax (EdgeAVG + TextureAVG and HSVAVG + MassVarSegmAVG video run5 and HSLAVG + ColorfulnessAVG) 0.0723 0.1843 0.0571 video segments for the video subtask, while the test dataset had MassVarSegm + LargSegm, with a MAP@10 score on the devset of 2435 frames for the image subtask and 2435 video segments for the 0.0821. For the video subtask the best result was a CombMax strat- video subtask. The official metric was mean average precision at egy containing LargSegmMED + ValSegmMED and TextureMED + 10 (MAP@10), and the organisers also calculated the mean aver- MassVarSegmMED early fusion outputs, with a MAP@10 score of age precision (MAP) for each submitted run. A large number of 0.0753. experiments with different early and late fusion strategies and with different SVM systems were carried out and the best performing combinations were in the last phase run on the testset. 3.2 Official results on testset For the final submission we trained the systems on the entire de- vset, using the optimal parameters that we found in the previous 3.1 Experiments on the devset experiments and tested the resulting systems on the testset. Our SVM training system used a 10-fold cross-validation approach Table 1 also presents the official results on the testset runs for for choosing the best SVM-feature set combination. Generally, tak- the combinations we submitted, as returned by the task organisers, ing into account the MAP@10 metric, the best performing SVM with the MAP and MAP@10 scores for each of the runs. For the kernel was the RBF kernel. Also another general observation is that image subtask we have a best MAP@10 score of 0.0555, obtained the late fusion approaches, especially CombMax and CombMean, by using a CombMean strategy with the outputs of aHSVRot + aH- outperformed the early fusion combination, while early fusion out- SLFocus and HSV + MassVarSegm + LargSegm. The same system performed learning systems with single descriptors. On the other also had the best MAP score - 0.1873. For the video subtask again it hand, CombMin and CombSum strategies performed worse than was a single system that got both the best MAP@10 and the best their components with many combinations. Regarding the two de- MAP score - a CombMean strategy usign the early fusion outputs scriptor sets for the video subtask (average and median), the results of LargSegmMED + ValSegmMED and TextureMED + MassVarSeg- were mixed, some early fusion or single descriptors performing mMED and EdgeAVG + TextureAVG, with a MAP@10 value of better with the median approach while others performed better 0.0732 and a MAP value of 0.2028. when we used the average calculation. The interestingness confidence score for each shot used for the MAP@10 calculation were extracted as the margin to the decision 4 CONCLUSIONS hyperplane. In this paper we presented several systems that predict media in- Table 1 shows the best results registered on both the image and terestingness using content descriptors and early and late fusion the video subtasks, and as mentioned earlier the best results were approaches. We tested these systems on the MediaEval 2017 Predict- achieved for the late fusion approaches. For the video subtask we ing Media Interestingness task and our best results were MAP@10 used the notation AVG for features that were obtained using average 0.5555 for the image subtask and 0.0732 for the video subtask. and MED for features that were obtained using median. All the components in Table 1 were trained using the best performing SVM RBF kernel. ACKNOWLEDGMENTS For the image subtask the best result on the devset was obtained Part of this work was funded by UEFISCDI under research grant with a CombMax strategy combining the early fusion outputs of PNIII-P2-2.1-PED-2016-1065, agreement 30PED/2017, project SPOT- HSV + HSL + aHSLFocus and aHSVRot + aHSLFocus and HSV + TER Predicting Media Interestingness Task MediaEval’17, 13-15 September 2017, Dublin, Ireland REFERENCES [1] Daniel E Berlyne. 1960. Conflict, arousal, and curiosity. (1960). [2] Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST) 2, 3 (2011), 27. [3] Mihai Gabriel Constantin and Bogdan Ionescu. 2017. Content Descrip- tion for Predicting Image Interestingness. In International Symposium on Signals, Circuits and Systems - ISSCS 2017. [4] Corinna Cortes and Vladimir Vapnik. 1995. Support vector machine. Machine learning 20, 3 (1995), 273–297. [5] Ritendra Datta, Dhiraj Joshi, Jia Li, and James Z Wang. 2006. Studying aesthetics in photographic images using a computational approach. In European Conference on Computer Vision. Springer, 288–301. [6] Claire-Hélène Demarty, Mats Sjöberg, Bogdan Ionescu, Thanh-Toan Do, Michael Gygli, and Ngoc QK Duong. 2017. Mediaeval 2017 pre- dicting media interestingness task. In MediaEval 2017 Multimedia Benchmark Workshop Working Notes Proceedings of the MediaEval 2017 Workshop. [7] Sagnik Dhar, Vicente Ordonez, and Tamara L Berg. 2011. High level describable attributes for predicting aesthetics and interestingness. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 1657–1664. [8] Michael Gygli, Helmut Grabner, Hayko Riemenschneider, Fabian Nater, and Luc Van Gool. 2013. The interestingness of images. In Proceedings of the IEEE International Conference on Computer Vision. 1633–1640. [9] Andreas F Haas, Marine Guibert, Anja Foerschner, Sandi Calhoun, Emma George, Mark Hatay, Elizabeth Dinsdale, Stuart A Sandin, Jen- nifer E Smith, Mark JA Vermeij, and others. 2015. Can we measure beauty? Computational evaluation of coral reef aesthetics. PeerJ 3 (2015), e1390. [10] Yan Ke, Xiaoou Tang, and Feng Jing. 2006. The design of high-level features for photo quality assessment. In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, Vol. 1. IEEE, 419–426. [11] Congcong Li and Tsuhan Chen. 2009. Aesthetic visual quality assess- ment of paintings. IEEE Journal of selected topics in Signal Processing 3, 2 (2009), 236–252.