=Paper=
{{Paper
|id=Vol-1984/Mediaeval_2017_paper_4
|storemode=property
|title=MediaEval 2017 Predicting Media Interestingness Task
|pdfUrl=https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_4.pdf
|volume=Vol-1984
|authors=Claire-Hélène Demarty,Mats Sjöberg,Bogdan Ionescu,Thanh-Toan Do,Michael Gygli,Ngoc Q.K. Duong
|dblpUrl=https://dblp.org/rec/conf/mediaeval/DemartySIDGD17
}}
==MediaEval 2017 Predicting Media Interestingness Task==
MediaEval 2017 Predicting Media Interestingness Task Claire-Hélène Demarty1 , Mats Sjöberg2 , Bogdan Ionescu3 , Thanh-Toan Do4 , Michael Gygli5 , Ngoc Q. K. Duong1 1 Technicolor, Rennes, France 2 Dept. of Computer Science and Helsinki Institute for Information Technology HIIT, University of Helsinki, Finland 3 LAPI, University Politehnica of Bucharest, Romania 4 University of Adelaide, Australia 5 ETH Zurich, Switzerland & Gifs.com, US ABSTRACT As in 2016, interestingness should be assessed according to a In this paper, the Predicting Media Interestingness task which is practical use case at Technicolor, which involves helping profes- running for the second year as part of the MediaEval 2017 Bench- sionals to illustrate a Video on Demand (VOD) web site by selecting marking Initiative for Multimedia Evaluation, is presented. For the some interesting frames and/or video excerpts for the movies. The task, participants are expected to create systems that automatically frames and excerpts should be suitable in terms of helping a user select images and video segments that are considered to be the to make his/her decision about whether he/she is interested in most interesting for a common viewer. All task characteristics are watching the whole movie. Once again, two subtasks are be offered described, namely the task use case and challenges, the released to participants, which correspond to two types of available media data set and ground truth, the required participant runs and the content, namely images and videos. Participants are encouraged to evaluation metrics. submit to both subtasks. In both cases, the task will be considered as a binary classification and a ranking task. Prediction will be carried out on a per movie basis. The two taskes are: 1 INTRODUCTION Predicting Image Interestingness Given a set of key-frames Predicting the interestingness of media content has been an ac- extracted from a certain movie, the task involves automatically (1) tive area of research in the computer vision community for several identifying those images that viewers report to be interesting and years now [1, 7, 8, 10] and it has even been studied earlier in the (2) ranking all images according to their level of interestingness. psychological community [2, 16, 17]. However, there were multiple To solve the task, participants can make use of visual content as competing definitions of interestingness, only a few publicly avail- well as accompanying metadata, e.g., Internet data about the movie, able datasets, and until last year, no public benchmark existed to social media information, etc. assess the interestingness of content. In 2016, a task for the Predic- Predicting Video Interestingness Given a set of video seg- tion of Media Interestingness was proposed in the MediaEval 2016 ments extracted from a certain movie, the task involves automat- Benchmarking Initiative for Multimedia Evaluation. This task was ically (1) identifying the segments that viewers report to be in- also an opportunity to propose a clear definition of interestingness, teresting and (2) ranking all segments according to their level of compatible with a real-world industry use case at Technicolor1 . interestingness. To solve the task, participants can make use of The 2017 edition of the MediaEval benchmark includes a follow-up visual and audio data as well as accompanying metadata, e.g., sub- of the Predicting Media Interestingness Task. This paper gives an titles, Internet data about the movie, etc. overview of the task description in its second year, together with a description of the data and ground truth. The required runs and 3 DATA DESCRIPTION chosen evaluation metrics are also detailed. In all cases, changes in The data is extracted from Creative Commons licensed Hollywood- this year’s edition are highlighted compared to last year’s edition. like videos: 103 movie trailers and 4 continuous extracts of ca. 15min 2 TASK DESCRIPTION from full-length movies. For the video interestingness subtask, the data consists of video segments obtained after a manual segmen- The Predicting Media Interestingness Task was proposed for the tation. These segments correspond to shots (video shots are the first time last year. This year’s edition is a follow-up which builds continuous frame sequences recorded between the camera being incrementally upon the previous experience. The task requires turned on and being turned off) for all videos but four. Their average participants to automatically select images and/or video segments duration is of one second. The four last videos, which correspond that are considered to be the most interesting for a common viewer. to the full-length movie extracts cited above, were manually seg- Interestingness of media is to be judged based on visual appearance, mented into longer segments (243) with an average duration of audio information and text accompanying the data, including movie 11.4s, to better take into account a certain unity of meaning and metadata. To solve the task, participants are strongly encouraged the audio information of the resulting segments. For the image to deploy multimodal approaches. subtask, the data consists of collections of key-frames extracted 1 http://www.technicolor.com from the video segments used for the video subtask (one key-frame per segment). This will allow the comparison of results from both subtasks. The extracted key-frame corresponds to the frame in the Copyright held by the owner/author(s). MediaEval’17, 13-15 September 2017, Dublin, Ireland middle of each video segment. In total, 7,396 video segments and 7,396 key-frames are released in the development set, whereas the MediaEval’17, 13-15 September 2017, Dublin, Ireland C.H. Demarty et al. test set consists of 2435 video segments and the same number of for each image/shot. Previously the same procedure was also used key-frames. to get the final interestingness values. This year we used an alter- To facilitate participation from various communities, we also native method, which took into account all pair comparisons from provide some pre-computed content descriptors, namely: low level all rounds done this year into a single large BTL calculation. This features — dense SIFT (Scale Invariant Feature Transform) which are was done mainly because we discovered afterwards that some an- computed following the original work in [13], except that the local notations from earlier rounds had to be discarded, because of some frame patches are densely sampled instead of using interest point unserious annotators. These annotators occasionally switched to detectors. A codebook of 300 codewords is used in the quantization cheating, where they simply always selected the first, or the sec- process with a spatial pyramid of three layers [11]; HoG descriptors ond item as the most interesting one without actually assessing (Histograms of Oriented Gradients) [4] are computed over densely the media contents. In the development set as many as 10% of the sampled patches. Following [19], HoG descriptors in a 2 × 2 neigh- annotations were marked as invalid and not included in the final borhood are concatenated to form a descriptor of higher dimension; BTL calculation. We added some heuristic anti-cheating measures LBP (Local Binary Patterns) [14]; GIST are computed based on the to the system, although it is not possible to perfectly detect all output energy of several Gabor-like filters (8 orientations and 4 cheating. Unfortunately, in the iterative approach, we could only scales) over a dense frame grid like in [15]; color histogram computed discard annotations from the most recent round, as it would be in the HSV space (Hue-Saturation-Value); MFCC (Mel-Frequency based on the previous round’s BTL output, which is why we devel- Cepstral Coefficients) computed over 32ms time-windows with oped another solution to compute the final BTL ranking. The final 50% overlap. The cepstral vectors are concatenated with their first binary decisions are obtained using a thresholding scheme that and second derivatives; fc7 layer (4,096 dimensions) and prob layer tries to detect the boundary where interestingness values make the (1,000 dimensions) of AlexNet [9]; mid level face detection and track- “jump” between the underlying distributions of the non interesting ing related features2 — obtained by face tracking-by-detection in and interesting populations. See last year’s overview paper for a each video shot with a HoG detector [4] and the correlation tracker more detailed description [6]. proposed in [5]. In addition to these frame-based features, we pro- 5 RUN DESCRIPTION vide C3D [18] features, which were extracted from fc6 layer (4,096 dimensions) and averaged on a segment level. Every team can submit up to 10 runs, 5 per subtask. For each subtask, a required run is defined: Image subtask - required run: classification 4 GROUND TRUTH is to be carried out with the use of the visual information. External Both video and image data was manually and independently anno- data is allowed. Video subtask - required run: classification is to be tated in terms of interestingness by human assessors, to make it achieved with the use of both audio and visual information. External possible to study the correlation between the two subtasks. A dedi- data is allowed. Apart from these required runs, any additional run cated web-based annotation tool was developed by the organising for each subtask will be considered as a general run, i.e., anything team for the previous edition of the task [6]. This year some incre- is allowed, both from the method point of view and the information mental improvements were added, and the tool was released as free sources. and open source software3 . Overall, more than 252 annotators par- 6 EVALUATION ticipated in the annotation for the video data and 189 for the images. The cultural distribution is over 22 different countries in the world. For both subtasks, the official evaluation metric will be the mean As in last year’s setup we use a pair-wise comparison protocol [3] average precision at 10 (MAP@10) computed over all videos, and where annotators are provided with a pair of images/shots at a time over the top 10 best ranked images/video shots. MAP@10 is selected and asked to tag which one in the pair is the more interesting for because it reflects the VOD use case, where the goal is to select a them. As a change from last year, we now ask the question in a way small set of the most interesting images or video segments for each more directly connected to the commercial application: “Which movie. To provide a broad overview of the systems’ performances, image/video makes you more interested in watching the whole other common metrics will also be provided. All metrics will be movie?”, with the intent to make the decision criteria clearer to computed by using the trec_eval tool from NIST4 . the annotators. As an exhaustive annotation of all possible pairs is 7 CONCLUSIONS practically impossible due to the required human resources, a boost- In the 2017 Predicting Media Interestingness task a complete and ing selection was used instead. In particular, we used a modified comparative framework for the evaluation of content interesting- version of the adaptive square design method [12], in which several ness is proposed. Details on the methods and results of each indi- annotators participated in each iteration. In this method the number vidual participant team can be found in the working note papers of of comparisons for each iteration is reduced from all possible pairs the MediaEval 2017 workshop proceedings. √ 3 n(n − 1)/2 ∼ O(n 2 ) to a subset of pairs n( n − 1) ∼ O(n 2 ), where n is the number of segments or images. For the development set, we ACKNOWLEDGMENTS started from iteration 6, as we could reuse the annotations done last We would like to thank Yu-Gang Jiang and Baohan Xu from the Fudan year. To achieve the ranking used as the basis for the next round, the University, China, Hervé Bredin, from LIMSI, France, and Michael Gygli for pair-based annotations are aggregated with the Bradley-Terry-Luce providing the features that accompany the released data. Part of the task (BTL) model computation [3] resulting in an interestingness degree was funded under research grant PN-III-P2-2.1-PED-2016-1065, agreement 2 http://multimediaeval.org/mediaeval2016/persondiscovery/ 30PED/2017, project SPOTTER. 3 https://github.com/mvsjober/pair-annotate 4 http://trec.nist.gov/trec_eval/ MediaEval 2017 Predicting Media Interestingness Task MediaEval’17, 13-15 September 2017, Dublin, Ireland REFERENCES [1] Xesca Amengual, Anna Bosch, and Josep Lluís de la Rosa. 2015. Review of Methods to Predict Social Image Interestingness and Memorability. Springer, 64–76. https://doi.org/10.1007/978-3-319-23192-1_6 [2] Daniel E. Berlyne. 1960. Conflict, arousal and curiosity. Mc-Graw-Hill. [3] R. A. Bradley and M. E. Terry. 1952. Rank Analysis of Incomplete Block Designs: the method of paired comparisons. Biometrika 39 (3-4) (1952), 324–345. [4] N. Dalal and B. Triggs. 2005. Histograms of oriented gradients for human detection. In IEEE CVPR Conference on Computer Vision and Pattern Recognition. [5] Martin Danelljan, Gustav Hager, Fahad Shahbaz Khan, and Michael Felsberg. 2014. Accurate scale estimation for robust visual tracking. In British Machine Vision Conference. [6] Claire-Hélène Demarty, Mats Sjöberg, Bogdan Ionescu, Thanh-Toan Do, Hanli Wang, Ngoc Q.K. Duong, and Frédéric Lefebvre. 2016. Media- Eval 2016 Predicting Media Interestingness Task. In Proceedings of the MediaEval 2016 Workshop. Hilversum, Netherlands. [7] Sagnik Dhar, Vicente Ordonez, and Tamara L Berg. 2011. High level describable attributes for predicting aesthetics and interestingness. In IEEE International Conference on Computer Vision and Pattern Recogni- tion. [8] M. Gygli, H. Grabner, H. Riemenschneider, F. Nater, and L. van Gool. 2013. The Interestingness of Images. In ICCV International Conference on Computer Vision. [9] Yu-Gang Jiang, Qi Dai, Tao Mei, Yong Rui, and Shih-Fu Chang. 2015. Super Fast Event Recognition in Internet Videos. IEEE Transactions on Multimedia 177, 8 (2015), 1–13. [10] Y-G. Jiang, Y. Wang, R. Feng, X. Xue, Y. Zheng, and H. Yan. 2013. Understanding and Predicting Interestingness of Videos. In AAAI Conference on Artificial Intelligence. [11] S. Lazebnik, C. Schmid, and J. Ponce. 2006. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In IEEE CVPR Conference on Computer Vision and Pattern Recognition. 2169–2178. [12] Jing Li, Marcus Barkowsky, and Patrick Le Callet. 2013. Boosting paired comparison methodology in measuring visual discomfort of 3DTV: performances of three different designs. In SPIE Electronic Imaging, Stereoscopic Displays and Applications, Vol. 8648. [13] D. Lowe. 2004. Distinctive image features from scale-invariant key- points. International Journal on Computer Vision 60 (2004), 91–110. [14] T. Ojala, M. Pietikainen, and T. Maenpaa. 2002. Multiresolution gray- scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelli- gence 24(7) (2002), 971–987. [15] A. Oliva and A. Torralba. 2001. Modeling the shape of the scene: a holistic representation of the spatial envelope. International Journal of Computer Vision 42 (2001), 145–175. [16] Paul J. Silvia. 2006. Exploring the psychology of interest. Oxford Uni- versity Press. [17] Craig Smith and Phoebe Ellsworth. 1985. Patterns of cognitive ap- praisal in emotion. Journal of Personality and Social Psychology 48, 4 (1985), 813–838. [18] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d con- volutional networks. In Proceedings of the IEEE International Conference on Computer Vision. [19] J. Xiao, J. Hays, K. Ehinger, A. Oliva, and A. Torralba. 2010. SUN database: Large-scale scene recognition from abbey to zoo. In IEEE CVPR Conference on Computer Vision and Pattern Recognition. 3485– 3492.