=Paper=
{{Paper
|id=Vol-1739/MediaEval_2016_paper_1
|storemode=property
|title=MediaEval 2016 Predicting Media Interestingness Task
|pdfUrl=https://ceur-ws.org/Vol-1739/MediaEval_2016_paper_1.pdf
|volume=Vol-1739
|dblpUrl=https://dblp.org/rec/conf/mediaeval/DemartySIDWDL16
}}
==MediaEval 2016 Predicting Media Interestingness Task==
MediaEval 2016 Predicting Media Interestingness Task Claire-Hélène Demarty1 , Mats Sjöberg2 , Bogdan Ionescu3 , Thanh-Toan Do4 , Hanli Wang5 , Ngoc Q. K. Duong1 , Frédéric Lefebvre1 1 Technicolor, Rennes, France 2 HIIT, University of Helsinki, Finland 3 LAPI, University Politehnica of Bucharest, Romania 4 Singapore University of Technology and Design, Singapore & University of Science, Vietnam 5 Tongji University, China {claire-helene.demarty, quang-khanh-ngoc.duong, frederic.lefebvre}@technicolor.com mats.sjoberg@helsinki.fi, bionescu@imag.pub.ro, thanhtoan do@sutd.edu.sg, hanliwang@tongji.edu.cn ABSTRACT 2. TASK DESCRIPTION This paper provides an overview of the Predicting Media The task requires participants to deploy algorithms that Interestingness task that is organized as part of the Media- automatically select images and video segments of Hollywood- Eval 2016 Benchmarking Initiative for Multimedia Evalua- like movies which are considered to be the most interesting tion. The task, which is running for the first year, expects for a common viewer. Interestingness of the media is judged participants to create systems that automatically select im- based on the visual appearance, audio information and text ages and video segments that are considered to be the most accompanying the data. Therefore, the multimodal facet of interesting for a common viewer. In this paper, we present interestingness prediction can be investigated. the task use case and challenges, the proposed data set and Two different subtasks are provided, which correspond to ground truth, the required participant runs and the evalua- the two types of available media content, namely: tion metrics. • predicting image interestingness — given a set of key- frames extracted from a movie, the task requires to 1. INTRODUCTION automatically identify those images for the given movie The ability of multimedia data to attract and keep peo- that viewers report to be the most interesting. To solve ple’s interest for long periods of time is gaining more and the task, participants can make use of visual content more importance in the field of multimedia, where concepts as well as external metadata, e.g., Internet data about such as memorability [4], aesthetics [9], interestingness [13, the movie, social media information, etc; 11], attractiveness [14], affective value [25], are intensely • predicting video interestingness — given the video shots studied, especially in the context of the ever growing mar- of a movie, the task requires to automatically identify ket value of social media and advertising. In particular, those shots that viewers report to be the most interest- although interestingness has been studied for a long time in ing in the given movie. To solve the task, participants the psychology community [21, 2, 22] and more recently but can make use of visual and audio data as well as ex- actively in the image processing community [23, 5, 1, 6, 10, ternal data, e.g., subtitles, Internet data, etc. 18], no common definition exists in the literature. Moreover datasets publicly available are only a few and no benchmark In both cases, the task is a binary classification task, thus exists for the evaluation of what makes a media interesting. participants are expected to label the provided data as being In this paper we introduce the 2016 MediaEval1 Predict- interesting or not (Note that prediction will be carried out on ing Media Interestingness Task, which is a pioneer bench- a per movie basis). However, a confidence value is required marking initiative for automatic prediction of image and for the provided prediction. video interestingness. The task which is in its first year derives from a practical use case at Technicolor2 . It in- 3. DATA DESCRIPTION volves helping professionals to illustrate a Video on Demand The 2016 data is extracted from Creative Commons li- (VOD) web site by selecting some interesting frames and/or censed trailers of Hollywood-like movies. It consists of a video excerpts for the posted movies. The frames and ex- development data intended for designing and training the cerpts should be suitable in terms of helping a user to make methods (information is extracted from 52 trailers) and a his/her decision about whether he/she is interested in watch- testing data which is used for the final benchmarking (with ing the underlying movie. The data in this task is therefore information from 26 trailers). The choice of using trailers adapted to this particular context which provides a more instead of full movies is driven by the need to find data, focused definition for interestingness. both freely distributable and still representative in content and quality of Hollywood movies. Trailers are the results of 1 http://multimediaeval.org/ some manual filtering of movies to keep interesting scenes, 2 http://www.technicolor.com/ but also less attractive shots to balance their content. We therefore believe they are still representative for the task. For the predicting video interestingness subtask, the data Copyright is held by the author/owner(s). MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Nether- consists of the video shots obtained after the manual seg- lands. mentation of the videos (video shots are the continuous frame sequences recorded between a camera turn on and threshold. This position corresponds to the limit between off), i.e., 5,054 shots for the development data, and 2,342 non interesting and interesting shots/images. The underly- shots for the test data. ing motivation for this empirical rule is the following: the For the predicting image interestingness subtask, the data non interesting population has rather similar interestingness consists of collections of key-frames extracted from the video values which increase slowly, while a gap happens when one shots used for the video subtask. One single key-frame is switches from this non interesting population to the popu- extracted per shot, therefore leading to 5,054 key-frames for lation of more interesting samples. The second derivative the development set and 2,342 for the test set. This single was chosen preferably to the first derivative, as it allowed to key-frame is chosen as the middle frame, as it is highly likely select those gaps more precisely. to capture the most important information of the shot. Ground truth is provided in binary format, i.e., 1 for in- To facilitate participation from various communities, we teresting and 0 for non interesting, for each image and video also provide some pre-computed content descriptors, namely: in the two subtasks. low level features — dense SIFT (Scale Invariant Feature 5. RUN DESCRIPTION Transform) which are computed following the original work in [17], except that the local frame patches are densely sam- Each participating team is expected to submit up to 5 pled instead of using interest point detectors. A codebook runs for both subtasks altogether. Among these 5 runs, two of 300 codewords is used in the quantization process with a runs are required, one per subtask: for the predicting image spatial pyramid of three layers [15]; HoG descriptors (His- interestingness subtask, the required run is built on visual tograms of Oriented Gradients) [7] are computed over densely information only and no external data is allowed; for the pre- sampled patches. Following [24], HoG descriptors in a 2 × dicting video interestingness subtask only audio and visual 2 neighborhood are concatenated to form a descriptor of information is allowed (no external data) for the required higher dimension; LBP (Local Binary Patterns) [19]; GIST run. are computed based on the output energy of several Gabor- Note that in this context, external data can be understood like filters (8 orientations and 4 scales) over a dense frame as: (i) additional datasets and annotations dedicated to in- grid like in [20]; color histogram computed in the HSV space terestingness classification; (ii) pre-trained models, features, (Hue-Saturation-Value); MFCC (Mel-Frequency Cepstral Co- detectors obtained from such dedicated additional datasets; efficients) computed over 32ms time-windows with 50% over- and (iii) additional metadata that could be found on the lap. The cepstral vectors are concatenated with their first Internet on the provided content (e.g., from IMDB4 ). and second derivatives; fc7 layer (4,096 dimensions) and On the contrary, CNN features trained on generic datasets prob layer (1,000 dimensions) of AlexNet [12]; mid level face such as ImageNet (typically the provided CNN features) are detection and tracking related features 3 — obtained by face allowed for use in the required runs. By generic datasets, we tracking-by-detection in each video shot with a HoG detec- mean datasets that were not designed to support research in tor [7] and the correlation tracker proposed in [8]. the task area, i.e., for the classification/study of image and video interestingness. 4. GROUND TRUTH 6. EVALUATION All data was manually annotated in terms of interesting- The official evaluation metric is the mean average preci- ness by human assessors. A dedicated web-based tool was sion (MAP) over the interesting class, i.e., the mean over developed to assist the annotation process. Overall, more the average precision scores computed for each trailer. This than 312 annotators participated to the annotation for the metric, adapted to retrieval tasks, fits perfectly the chosen video data and 100 for the images. The cultural distribution use case in which we want to help a user choose between dif- is over 29 different countries in the world. ferent samples by providing him a list of suggestions, ranked We use a pair-wise comparison protocol [3] where anno- according to interestingness. For assessing the performance, tators are provided with a pair of images/shots at a time we use the trec_eval tool provided by NIST5 . In addition and asked to tag which of the content is more interesting to MAP, other commonly used metrics such as precision and for them. The process is repeated by scanning the whole recall will be provided to participants. dataset. As an exhaustive comparison of all possible pairs is basically impossible due to the required human resources, a 7. CONCLUSIONS boosting selection was used instead, i.e., a modified version The 2016 Predicting Media Interestingness task provides of the adaptive square design method [16], for which several participants with a comparative and collaborative evalua- annotators participate to each iteration. tion framework for predicting content interestingness with To achieve the final ground truth, pair-based annotations explicit focus on multimedia approaches. Details on the are aggregated with the Bradley-Terry-Luce (BTL) model methods and results of each individual participant team can computation [3] resulting in an interestingness degree for be found in the working note papers of the MediaEval 2016 each image/shot. The final binary decisions are obtained workshop proceedings. after the following processing steps: (i) the interestingness values are ranked in increasing order and normalized be- 8. ACKNOWLEDGMENTS tween 0 and 1; (ii) the resulting curve is smoothed with a We would like to thank Yu-Gang Jiang and Baohan Xu short averaging window, and the second derivative is com- from the Fudan University, China, and Hervé Bredin, from puted; (iii) for both shots and images, and for all videos, LIMSI, France for providing the features that accompany the a threshold empirically set to 0.01 is applied on the second released data, and Alexey Ozerov and Vincent Demoulin for derivative to find the first point whose value is above the their valuable inputs to the task definition. 3 4 http://multimediaeval.org/mediaeval2016/ http://www.imdb.com/ 5 persondiscovery/ http://trec.nist.gov/trec\_eval/ 9. REFERENCES Computer Vision, (60):91–110, 2004. [1] X. Amengual, A. Bosch, and J. L. de la Rosa. Review [18] J. Machajdik and A. Hanbury. Affective image of Methods to Predict Social Image Interestingness and classification using features inspired by psychology Memorability, pages 64–76. Springer, 2015. and art theory. In Proceedings of the 18th ACM [2] D. E. Berlyne. Conflict, arousal and curiosity. International Conference on Multimedia, pages 83–92, Mc-Graw-Hill, 1960. New York, NY, USA, 2010. [3] R. A. Bradley and M. E. Terry. Rank analysis of [19] T. Ojala, M. Pietikainen, and T. Maenpaa. incomplete block designs: the method of paired Multiresolution gray-scale and rotation invariant comparisons. Biometrika, (39 (3-4)):324–345, 1952. texture classification with local binary patterns. IEEE [4] Z. Bylinskii, P. Isola, C. Bainbridge, A. Torralba, and Transactions on Pattern Analysis and Machine A. Oliva. Intrinsic and extrinsic effects on image Intelligence, (24(7)):971–987, 2002. memorability. Vision research, (116):165–178, 2015. [20] A. Oliva and A. Torralba. Modeling the shape of the [5] C. Chamaret, C.-H. Demarty, V. Demoulin, and scene: a holistic representation of the spatial envelope. G. Marquant. Experiencing the interestingness International Journal of Computer Vision, concept within and between pictures. In Proceeding of (42):145–175, 2001. SPIE, Human Vision and Electronic Imaging, 2016. [21] P. J. Silvia. Exploring the psychology of interest. [6] S. L. Chu, E. Fedorovskaya, F. Quek, and J. Snyder. Oxford University Press, 2006. The effect of familiarity on perceived interestingness of [22] C. Smith and P. Ellsworth. Patterns of cognitive images. volume 8651, pages 86511C–86511C–12, 2013. appraisal in emotion. Journal of Personality and [7] N. Dalal and B. Triggs. Histograms of oriented Social Psychology, 48(4):813–838, 1985. gradients for human detection. In IEEE CVPR [23] M. Soleymani. The quest for visual interest. In Conference on Computer Vision and Pattern Proceedings of the 23rd ACM International Conference Recognition, 2005. on Multimedia, pages 919–922, New York, NY, USA, [8] M. Danelljan, G. Hager, F. S. Khan, and M. Felsberg. 2015. Accurate scale estimation for robust visual tracking. [24] J. Xiao, J. Hays, K. Ehinger, A. Oliva, and In British Machine Vision Conference, 2014. A. Torralba. Sun database: Large-scale scene [9] S. Dhar, V. Ordonez, and T. L. Berg. High level recognition from abbey to zoo. In IEEE CVPR describable attributes for predicting aesthetics and Conference on Computer Vision and Pattern interestingness. In IEEE International Conference on Recognition, pages 3485–3492, 2010. Computer Vision and Pattern Recognition, 2011. [25] A. Yazdani, E. Skodras, N. Fakotakis, and [10] H. Grabner, F. Nater, M. Druey, and L. Van Gool. T. Ebrahimi. Multimedia content analysis for Visual interestingness in image sequences. In emotional characterization of music video clips. Proceedings of the 21st ACM International Conference EURASIP Journal on Image and Video Processing, on Multimedia, pages 1017–1026, New York, NY, (26), 2013. USA, 2013. [11] M. Gygli, H. Grabner, H. Riemenschneider, F. Nater, and L. van Gool. The interestingness of images. In ICCV International Conference on Computer Vision, 2013. [12] Y.-G. Jiang, Q. Dai, T. Mei, Y. Rui, and S.-F. Chang. Super fast event recognition in internet videos. IEEE Transactions on Multimedia, 177(8):1–13, 2015. [13] Y.-G. Jiang, Y. Wang, R. Feng, X. Xue, Y. Zheng, and H. Yan. Understanding and predicting interestingness of videos. In AAAI Conference on Artificial Intelligence, 2013. [14] S. Kalayci, H. K. Ekenel, and H. Gunes. Automatic analysis of facial attractiveness from video. In IEEE ICIP International Conference on Image Processing, pages 4191–4195, 2014. [15] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In IEEE CVPR Conference on Computer Vision and Pattern Recognition, pages 2169–2178, 2006. [16] J. Li, M. Barkowsky, and P. L. Callet. Boosting paired comparison methodology in measuring visual discomfort of 3dtv: performances of three different designs. In SPIE Electronic Imaging, Stereoscopic Displays and Applications, volume 8648, 2013. [17] D. Lowe. Distinctive image features from scale-invariant keypoints. International Journal on