Ranking Images and Videos on Visual Interestingness by Visual Sentiment Features Soheil Rayatdoost Mohammad Soleymani Swiss Center for Affective Sciences Swiss Center for Affective Sciences University of Geneva University of Geneva Switzerland Switzerland soheil.rayatdoost@unige.ch mohammad.soleymani@unige.ch ABSTRACT Today, users generate and consume millions of videos online. Automatic identification of the most interesting moments of these videos have many applications such as video retrieval. Although most interesting excerpts are person-dependent, existing work demonstrate that there are some common fea- tures among these segments. The media interestingness task at MediaEval 2016 focuses on ranking the shots and key- frames in a movie trailer based on their interestingness. The dataset consists of a set of commercial movie trailers from Figure 1: Examples of hit (top row) and miss (bottom row) which the participants are required to automatically iden- top-ranking key-frames. tify the most interesting shots and frames. We approach The “media interestingness task” is organized at Medi- the problem as a regression task and test several algorithms. aEval 2016. In this task, a development and evaluation-set We particularly use mid-level semantic visual sentiment fea- consisting of Creative Commons licensed trailers of commer- tures. These features are related to the emotional content cial movies with their interestingness labels are provided. of images and are shown to be effective in recognizing inter- For the details of the task description, dataset development estingness in GIFs. We found that our suggested features and evaluation, we refer the reader to the task overview pa- outperform the baseline for the task at hand. per [3]. There are two subtasks for this challenge, the first one involves automatic prediction of interestingness rank- 1. INTRODUCTION ing for different shots in a trailer. The second task involves Interestingness is the capability of catching and holding predicting the ranking for the most interesting key frames. human attention [1]. Research in psychology suggests that Visual and audio (only for shots) modalities are available interest is related to novelty, uncertainty, conflict and com- for the interestingness prediction methods [3]. The designed plexity [2, 14]. These attributes determine whether a person algorithms are evaluated over evaluation data which include finds an item interesting. The attributes contribute to inter- 2342 shots from 26 trailers. Examples of top-ranking key- estingness differently for different people, for example, one frames are shown in Figure 1. might find more complex stimulus more interesting than the The organizers provided a set of baseline visual and audio other. Developing a computational model which automati- features. For the visual modality, we additionally extracted cally perform such a task is useful for different applications mid-level semantic visual descriptors [11] and deep learning such as video retrieval, recommendation and summarization features. Sentiment related features are effective in captur- [1, 15]. ing emotional content of images and are shown to be useful There are a number of work that address the problem in recognizing interestingness in GIFs [8]. For the audio of visual interestingness prediction from the content. Gygli modality, we extracted the extended Geneva Minimalistic et al. and Grabner et al. [7, 6] used visual content fea- Acoustic Parameter Set (eGeMAPS) [4]. We tested multiple tures related to unusualness, aesthetics and general pref- regression models for interestingness ranking. We compare erence for predicting visual interestingness. Soleymani [15] our results with the ones from the baseline features based built a model for personalized interest prediction for images. on mean average precision (MAP) over top N best ranked He found that affective content, quality, coping potential images or shots. According to our results on the evaluation- and complexity have a significant effect on visual interest set, our feature-set outperform the baseline features for pre- in images. In a more recent work, Gygli and Soleymani [8] dicting interestingness. In the next section, we present our attempted predicting GIF interestingness from the content. features and describe our methodology in detail. They found that visual sentiment descriptors [11] to be more effective for predicting GIF interestingness compared to the features that capture temporal information and motion. 2. METHOD 2.1 Features Copyright is held by the author/owner(s). MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Nether- We opt for using a set of hand-crafted features and trans- lands. fer learning in addition to regression models with the goal of interestingness ranking. The task organizers provided a set Table 1: Evaluation results on interestingness ranking. of baseline low-level features. These features include a num- Task Method Features MAP ↑ ber of low-level audiovisual features that are typically used for computer vision and speech analysis, including dense Image LR MVSO+fc7 0.1710 Dev. Set SIFT, Histogram of Gradients (HoG), Local Binary Patterns (LBP), GIST, Color Histogram, deep learning features for Video SPARROW MVSO+fc7 0.2617 the visual modality [10], and Mel-Frequency Cepstral Coef- Video SPARROW Baseline 0.2414 ficients (MFCC) and the cepstral vectors for audio. Interestingness is highly correlated with image emotional Video SVR eGeMAPS 0.1987 content [15]. Therefore, we opted for extracting the eGeMAPS Image LR MVSO+fc7 0.1704 Eval. Set from audio [4]. eGeMAPS features are acoustic features hand-picked by experts for the goal of speech and music Video SPARROW MVSO+fc7 0.1710 emotion recognition. 88 eGeMAPS features were extracted Video SPARROW Baseline 0.1497 by openSMILE [5]. For video sub-challenge, we extracted all the key-frames from each shot. We then applied the vi- Video SVR eGeMAPS 0.1367 sual sentiment adjective-noun-pair (ANP) detectors [11] on each key-frame. The weights from the fully connected layer 7 (fc7) and the output from the final layer was extracted on on all the available data in the development-set. The results each frame. We then pooled the resulting values by mean for interestingness prediction with the best pair of regres- and variance to form one feature vector for each shot. sion methods and feature-sets are summarized in Table 1. The best MAP on the development-set which is achieved by 2.2 Regression models combining multilingual visual sentiment ontology (MVSO) We used three different regression models to predict the descriptors and deep learning features in combination with interestingness level (linear regression (LR), support vector SPARROW regression is 0.262. We used Baseline video fea- regression (SVR) with linear kernel and sparse approxima- tures and SPARROW regression as our baseline. To check tion weighted regression (SPARROW) [13]. the performance of audio features we ranked the video with We used LIBLINEAR Library [9, 12] implementation of respect to SVR output which was trained on audio features SVR with L2-regularized logistic regression option to pre- only. The best results for image sub-task is achieved by sen- dict the interesingness score. We also used a regression with timent descriptors and deep learning features in combination sparse approximation. Regression with sparse approxima- with linear regression. tion is a regression model for approximation of the predic- Overall, the evaluation-set results demonstrate that mid- tion based on local information. It is similar to a k-nearest level semantic visual descriptors are more effective in pre- neighbors regression (k-NNR) whose weights are calculated dicting interestingness compared to the baselines low-level based on sparse approximation [13]. Linear regression with features. The results from a set of relatively simple audio minimum least-squares optimization is utilized as a baseline features show the significance of audio modality for such method. a task. In Image sub-task, the evaluation-set results are In all cases, except eGeMAPS audio features, we used very similar to video sub-task, since sentiment features lack principal component analysis (PCA) to reduce the dimen- temporal information. The drop in the performance on sionality of features. For SVR and SPARROW, we kept the the evaluation-set demonstrates that our models were over- principal components containing 99% of variance. In case fitting to the development-set and it is likely that an ensem- of linear regression, we only kept the principal components ble learning regression would have performed better. that added up to 50% of the total variance. 3. EXPERIMENTS 5. CONCLUSION After extracting all the feature-sets, we evaluated the per- In this work, we explored different strategies for predict- formance of different combinations of the feature-sets and re- ing visual interestingness in videos. We found the mid-level gression models. We evaluated different approaches using a visual descriptors which are related to sentiment to be more five-folding cross-validation on the development-set. In each effective for such a task compared to the low-level visual fea- iteration, one-fifth of the development-set was held out and tures. This is due to the affective nature of interestingness, the rest was used to train the regression model. When train- i.e., interest is an emotion by some account. Our features are ing the SVR, we optimized the hyper-parameter C using a all static and frame-based; we did not try extracting features grid-search on the training-set. related to movement that can capture temporal information The best performing approaches based on their perfor- due to the small size of the dataset. Hence, the frame-based mance measured by MAP on the ranked results were selected results are not any different to the shot-based ones. Essen- for submitted runs (See Table 1). tially they do very similar tasks. The observed performance of the proposed method is rather low. However given the sample size and the dimensionality of the descriptors, they 4. RESULTS AND DISCUSSION still show promising potential. In the future, ideally larger Following the task evaluation procedure, we report MAP scale datasets shall be developed and annotated to enable on N best ranked images or shots. We report the results on using more sophisticated methods such as transfer learning the cross-validation on the development-set and on our four using deep neural networks. Even though the audio features submitted runs on the evaluation-set. For our submitted are not as effective, they showed significant performance de- runs, we trained selected features and regression methods serving more in-depth analysis in the future. 6. REFERENCES multilevel mixture models to explore individual [1] X. Amengual, A. Bosch, and J. L. de la Rosa. Review differences in appraisal structures. Cognition and of methods to predict social image interestingness and Emotion, 23(7):1389–1406, 2009. memorability. In G. Azzopardi and N. Petkov, editors, [15] M. Soleymani. The quest for visual interest. In Computer Analysis of Images and Patterns: 16th Proceedings of the 23rd ACM International Conference International Conference, CAIP 2015, Valletta, Malta, on Multimedia, MM ’15, pages 919–922, New York, September 2-4, 2015 Proceedings, Part I, pages 64–76. NY, USA, 2015. ACM. Springer International Publishing, Cham, 2015. [2] D. Berlyne. Conflict, arousal, and curiosity. McGraw-Hill, 1960. [3] C. Demarty, M. Sjöberg, B. Ionescu, T. Do, H. Wang, N. Duong, and F. Lefebvre. Mediaeval 2016 predicting media interestingness task. In MediaEval 2016 workshop, Amsterdam, Netherland, 2016. [4] F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. André, C. Busso, L. Y. Devillers, J. Epps, P. Laukka, S. S. Narayanan, and K. P. Truong. The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing. IEEE Transactions on Affective Computing, 7(2):190–202, April 2016. [5] F. Eyben, F. Weninger, F. Gross, and B. Schuller. Recent developments in opensmile, the munich open-source multimedia feature extractor. In Proceedings of the 21st ACM International Conference on Multimedia, MM ’13, pages 835–838, New York, NY, USA, 2013. ACM. [6] H. Grabner, F. Nater, M. Druey, and L. Van Gool. Visual interestingness in image sequences. In Proceedings of the 21st Annual ACM Conference on Multimedia, 2013. [7] M. Gygli, H. Grabner, H. Riemenschneider, F. Nater, and L. Van Gool. The Interestingness of Images. In The IEEE International Conference on Computer Vision (ICCV), 2013. [8] M. Gygli and M. Soleymani. Analyzing and predicting GIF interestingness. In ACM Multimedia, 2016. [9] C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan. A dual coordinate descent method for large-scale linear SVM. In Proceedings of the Twenty Fifth International Conference on Machine Learning (ICML), 2008. [10] Y. G. Jiang, Q. Dai, T. Mei, Y. Rui, and S. F. Chang. Super fast event recognition in internet videos. IEEE Transactions on Multimedia, 17(8):1174–1186, Aug 2015. [11] B. Jou, T. Chen, N. Pappas, M. Redi, M. Topkara, and S.-F. Chang. Visual affect around the world: A large-scale multilingual visual sentiment ontology. In Proceedings of the 23rd ACM International Conference on Multimedia, MM ’15, pages 159–168, New York, NY, USA, 2015. ACM. [12] C.-J. Lin, R. C. Weng, and S. S. Keerthi. Trust region Newton method for large-scale logistic regression. In Proceedings of the 24th International Conference on Machine Learning (ICML), 2007. [13] P. Noorzad and B. L. Sturm. Regression with sparse approximations of data. In Signal Processing Conference (EUSIPCO), 2012 Proceedings of the 20th European, pages 674–678, Aug 2012. [14] P. J. Silvia, R. A. Henson, and J. L. Templin. Are the sources of interest the same for everyone? using