LAPI at MediaEval 2016 Predicting Media Interestingness Task Mihai Gabriel Constantin, Bogdan Boteanu, Bogdan Ionescu LAPI, University "Politehnica" of Bucharest, Romania {mgconstantin, bboteanu, bionescu}@alpha.imag.pub.ro ABSTRACT a supervised classifier is learned on these features using the This paper will present our results for the MediaEval 2016 labeled examples. Finally, the actual evaluation is carried Predicting Media Interestingness task. We proposed an ap- out by feeding the classifier the unlabeled data. Regarding proach based on video descriptors and studied several ma- the content descriptors, we used the ones provided by the chine learning models, in order to detect the optimal config- task organizers [4] with some additions. They were used uration and combination for the descriptors and algorithms as descriptors for a learning system based on SVM, where that compose our system. we tested different combinations of SVM kernel types and coefficients by using the LibSVM library [2]. 1. INTRODUCTION Interestingness is the ability to attract and hold human 2.1 Used features attention, this concept is gaining importance in the field Several visual features were used as descriptors, many of of computer vision, especially since the growing importance them being used in the literature for some computer vision and market value of social media and advertising. Even tasks. The provided computed features were: color his- though the concept of interest might seem the result of a togram of the Hue-Saturation-Value (denoted histo), His- subjective viewer judgment, important progress has been togram of Oriented Gradients (HoG) descriptors computed made towards both an objective and context-based model for over densely sampled patches, dense Scale Invariant Feature interest. Generally, in the field of computer vision two direc- Transform (SIFT) with a codebook of 300 codewords and a tions arose regarding this topic: pure visual interestingness three layered spatial pyramid (denoted dsift), Local Binary (based on multimedia features and ideas [5, 6, 7]) and social Patterns (LBP), GIST computed with the output of Gabor- interestingness (based on the degree of social media interest like features (denoted gist) and the fc7 and prob layers of shown for certain visual data [5, 8]). Some researchers [8] AlexNet (denoted cnnfc7 and cnnprob). All these features focused on the similarities and differences between these two are presented and detailed in [4] and [9]. We also extracted directions. Studies have been made regarding the psycholog- and used the color naming histogram (denoted colornames) ical and physiological connections with novelty, enjoyment, feature based on the work [12], as we wanted to obtain a challenge [1, 3], appraisal structures [10, 11] and computer color descriptor with fewer dimensions for our learning al- vision concepts [5, 7, 6]. gorithms, that could better represent a human-centered un- In this context, the MediaEval 2016 Predicting Media In- derstanding of the colors in each image or video. terestingness Task [4] challenges the participants to auto- For the image subtask, each image is represented with a matically select images and/or video segments which are content descriptor. For the video subtask, each video con- considered to be the most interesting for a common viewer. tains a certain number of images. To determine the final The concept of interestingness is defined in a particular use descriptor we use the simple averaging of the frames descrip- case scenario, i.e., helping professionals to illustrate a Video tors, leading in the end to a global descriptor per video. on Demand (VOD) web site by selecting some interesting frames and/or video excerpts for the movies. In this working note paper, we present our machine learning based approach 2.2 Learning system to the task. The learning is achieved using a Support Vector Machine (SVM) binary classifier. For all trained SVM models we used polynomial, RBF and linear kernels. For the polyno- 2. PROPOSED APPROACH mial kernels we used all the combinations of the following As previously stated, to determine the interestingness of degrees : 1, 2, 3*k where k ∈ [1, ..., 10] and the gamma co- images and video, we have experimented with a classic ma- efficients were set as 2k where k ∈ [0, ..., 6]. For the RBF chine learning approach. First, the raw data is converted kernel combinations we had values for the cost parameter of to content descriptors which should capture as best as pos- 2k where k ∈ [−4, ..., 8] and gamma coefficients with values sible the visual interestingness features of the data. Then, in 2k where k ∈ [−4, ..., 8]. We also tried different weights, considering the fact that the devset data, both for images and for videos, was unbalanced, the ratio of uninteresting to Copyright is held by the author/owner(s). MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Nether- interesting samples being almost 10 to 1. lands. Table 1: Best results on devset for the image and video subtasks (best results are marked in bold) Subtask Feature SVM type Degree Gamma TP FP Precision Recall MAP image histo+gist poly 18 2 22 76 0.224 0.05 0.214 image dsift+gist poly 3 32 63 330 0.16 0.144 0.211 image histo+dsift+gist poly 9 2 15 35 0.3 0.034 0.197 image colornames+any poly 3 2 56 334 0.143 0.128 0.195 image colornames poly 2 8 226 1892 0.107 0.517 0.195 video gist+cnnprob poly 9 4 35 305 0.103 0.083 0.179 video cnnfc7+any poly 3 4 40 364 0.099 0.095 0.172 video dsift+cnnprob poly 24 64 81 846 0.087 0.192 0.159 video gist poly 6 8 49 359 0.121 0.116 0.148 video dsift poly 3 64 25 204 0.109 0.059 0.147 Table 2: Final results on testset (best results are marked in bold) Run Subtask Feature SVM Type Degree Gamma MAP P@5 P@10 P@20 P@100 run1 image histo+gist poly 18 2 0.1714 0.1077 0.1346 0.1423 0.0869 run2 image dsift+gist poly 3 32 0.1398 0.0462 0.0808 0.1000 0.0862 run3 video gist+cnnprob poly 9 4 0.1574 0.0923 0.1269 0.1212 0.0812 run4 video cnnfc7+histo poly 3 4 0.1572 0.1231 0.1000 0.1077 0.0815 run5 video dsift+cnnprob poly 24 64 0.1629 0.1154 0.1500 0.1192 0.0819 3. EXPERIMENTAL RESULTS composed of GIST and CNNProb layer, with a polynomial The task data consists of a development data intended to SVM with 9 degree and 4 gamma for the video subtask. train the approaches and a test data for the actual bench- marking. The devset was extracted from 52 trailers, manu- 3.2 Official results on testset ally segmented, thus obtaining 5054 segments. For the image The teams were allowed to submit 5 runs, so we chose the subtask one key-frame was used from each segment, while best 2 descriptor-classifier combinations for the image sub- for the video subtask the whole segment was used. By an- task and the best 3 combinations for the video subtask. This notating all the data a total of 473 interesting images and time the training of the SVM learning systems was done 420 interesting videos were obtained, with a provided inter- on the entire devset, using the optimal degree and gamma estingness score for calculating the mean average precision. parameters obtained in our previous experiments. The sub- The testset consisted of 26 trailers divided into 2342 seg- mitted runs were the following : run1 - image subtask with ments. We performed a number of experiments on devset HSV Histogram + GIST, SVM with degree = 18 and gamma and selected the best combinations to be run on testset. = 2, run2 - image subtask with DSIFT + GIST, SVM with degree = 3 and gamma = 32, run3 - video subtask with GIST + CNNProb, SVM with degree = 9 and gamma = 4, 3.1 Experiments on devset run4 - video with CNNFc7 + HSV Histogram, SVM with Using a 10-fold cross-validation, we chose the best results degree = 3 and gamma = 4 and run5 - video with DSIFT for the descriptor-classifier combinations based on precision, + CNNProb, SVM with degree = 24 and gamma = 64. with a recall better than 0.03. For those best combinations The final results, as returned by the task organizers are we calculated the mean average precision. We have experi- presented in Table 2. The best results were a 0.1714 MAP mented with many different combinations of descriptors and on run1 for the image subtask and a 0.1629 MAP on run5 SVM kernels. The best performing combination was gener- for the video subtask. With the single exception being run5, ally the polynomial SVM. A high number of training runs, the MAP results on testset were below the estimated MAP especially with the RBF or linear kernels, tended to classify on devset. all or almost all (low recall) the samples as non-interesting. In the case of weight-based training for the RBF kernel the recall tended to grow, but the precision was below that of 4. CONCLUSIONS the polynomial SVMs. In this paper we presented several models for predicting Table 1 lists the best five results for each of the two sub- and scoring multimedia interestingness. Our best MAP re- tasks, giving details regarding the best coefficient combi- sults on the testset were 0.1714 for the image subtask and nation used. As shown, the estimated MAP on the devset 0.1629 for the video subtask. These results seem to indicate was better for the image subtask than for the video subtask. that the task in very challenging, one possible reason for this The MAP scores were calculated by using LibSVM’s deci- being the subjective nature of this field of study. sion values/prob estimates output result for indicating the interestingness score of each sample [2]. The values for true 5. REFERENCES positives, false positives, precision and recall are also listed. The best results were achieved with a descriptor composed of [1] D. E. Berlyne. Conflict, arousal, and curiosity. 1960. HSV Histogram and GIST, with a polynomial SVM with 18 [2] C.-C. Chang and C.-J. Lin. Libsvm: a library for degree and 2 gamma for the image subtask, and a descriptor support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27, 2011. [3] A. Chen, P. W. Darst, and R. P. Pangrazi. An examination of situational interest and its sources. British Journal of Educational Psychology, 71(3):383–400, 2001. [4] C.-H. Demarty, M. Sjöberg, B. Ionescu, T.-T. Do, H. Wang, N. Q. K. Duong, and F. Lefèbvre. Mediaeval 2016 predicting media interestingness task. In Proc. of the MediaEval 2016 Workshop, Hilversum, Netherlands, Oct. 20-21, 2016. [5] S. Dhar, V. Ordonez, and T. L. Berg. High level describable attributes for predicting aesthetics and interestingness. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1657–1664. IEEE, 2011. [6] H. Grabner, F. Nater, M. Druey, and L. V. Gool. Visual interestingness in image sequences. In Proceedings of the 21st ACM international conference on Multimedia, pages 1017–1026. ACM, 2013. [7] M. Gygli, H. Grabner, H. Riemenschneider, F. Nater, and L. Gool. The interestingness of images. In Proceedings of the IEEE International Conference on Computer Vision, pages 1633–1640, 2013. [8] L.-C. Hsieh, W. H. Hsu, and H.-C. Wang. Investigating and predicting social and visual image interestingness on social media by crowdsourcing. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 4309–4313. IEEE, 2014. [9] Y.-G. Jiang, Q. Dai, T. Mei, Y. Rui, and S.-F. Chang. Super fast event recognition in internet videos. IEEE Transactions on Multimedia, 17(8):1174–1186, 2015. [10] P. J. Silvia. What is interesting? exploring the appraisal structure of interest. Emotion, 5(1):89, 2005. [11] S. A. Turner and P. J. Silvia. Must interesting things be pleasant? a test of competing appraisal structures. Emotion, 6(4):670, 2006. [12] J. V. D. Weijer, C. Schmid, J. Verbeek, and D. Larlus. Learning color names for real-world applications. IEEE Transactions on Image Processing, 18(7):1512–1523, 2009.