=Paper=
{{Paper
|id=Vol-1984/Mediaeval_2017_paper_4
|storemode=property
|title=MediaEval 2017 Predicting Media Interestingness Task
|pdfUrl=https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_4.pdf
|volume=Vol-1984
|authors=Claire-Hélène Demarty,Mats Sjöberg,Bogdan Ionescu,Thanh-Toan Do,Michael Gygli,Ngoc Q.K. Duong
|dblpUrl=https://dblp.org/rec/conf/mediaeval/DemartySIDGD17
}}
==MediaEval 2017 Predicting Media Interestingness Task==
<pdf width="1500px">https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_4.pdf</pdf>
<pre>
              MediaEval 2017 Predicting Media Interestingness Task
                        Claire-Hélène Demarty1 , Mats Sjöberg2 , Bogdan Ionescu3 , Thanh-Toan Do4 ,
                                          Michael Gygli5 , Ngoc Q. K. Duong1
                                                                 1 Technicolor, Rennes, France
      2 Dept. of Computer Science and Helsinki Institute for Information Technology HIIT, University of Helsinki, Finland
                                                      3 LAPI, University Politehnica of Bucharest, Romania
                                                               4 University of Adelaide, Australia
                                                           5 ETH Zurich, Switzerland & Gifs.com, US


ABSTRACT                                                                               As in 2016, interestingness should be assessed according to a
In this paper, the Predicting Media Interestingness task which is                  practical use case at Technicolor, which involves helping profes-
running for the second year as part of the MediaEval 2017 Bench-                   sionals to illustrate a Video on Demand (VOD) web site by selecting
marking Initiative for Multimedia Evaluation, is presented. For the                some interesting frames and/or video excerpts for the movies. The
task, participants are expected to create systems that automatically               frames and excerpts should be suitable in terms of helping a user
select images and video segments that are considered to be the                     to make his/her decision about whether he/she is interested in
most interesting for a common viewer. All task characteristics are                 watching the whole movie. Once again, two subtasks are be offered
described, namely the task use case and challenges, the released                   to participants, which correspond to two types of available media
data set and ground truth, the required participant runs and the                   content, namely images and videos. Participants are encouraged to
evaluation metrics.                                                                submit to both subtasks. In both cases, the task will be considered
                                                                                   as a binary classification and a ranking task. Prediction will be
                                                                                   carried out on a per movie basis. The two taskes are:
1    INTRODUCTION
                                                                                       Predicting Image Interestingness Given a set of key-frames
Predicting the interestingness of media content has been an ac-                    extracted from a certain movie, the task involves automatically (1)
tive area of research in the computer vision community for several                 identifying those images that viewers report to be interesting and
years now [1, 7, 8, 10] and it has even been studied earlier in the                (2) ranking all images according to their level of interestingness.
psychological community [2, 16, 17]. However, there were multiple                  To solve the task, participants can make use of visual content as
competing definitions of interestingness, only a few publicly avail-               well as accompanying metadata, e.g., Internet data about the movie,
able datasets, and until last year, no public benchmark existed to                 social media information, etc.
assess the interestingness of content. In 2016, a task for the Predic-                 Predicting Video Interestingness Given a set of video seg-
tion of Media Interestingness was proposed in the MediaEval 2016                   ments extracted from a certain movie, the task involves automat-
Benchmarking Initiative for Multimedia Evaluation. This task was                   ically (1) identifying the segments that viewers report to be in-
also an opportunity to propose a clear definition of interestingness,              teresting and (2) ranking all segments according to their level of
compatible with a real-world industry use case at Technicolor1 .                   interestingness. To solve the task, participants can make use of
The 2017 edition of the MediaEval benchmark includes a follow-up                   visual and audio data as well as accompanying metadata, e.g., sub-
of the Predicting Media Interestingness Task. This paper gives an                  titles, Internet data about the movie, etc.
overview of the task description in its second year, together with
a description of the data and ground truth. The required runs and                  3   DATA DESCRIPTION
chosen evaluation metrics are also detailed. In all cases, changes in              The data is extracted from Creative Commons licensed Hollywood-
this year’s edition are highlighted compared to last year’s edition.               like videos: 103 movie trailers and 4 continuous extracts of ca. 15min
2    TASK DESCRIPTION                                                              from full-length movies. For the video interestingness subtask, the
                                                                                   data consists of video segments obtained after a manual segmen-
The Predicting Media Interestingness Task was proposed for the                     tation. These segments correspond to shots (video shots are the
first time last year. This year’s edition is a follow-up which builds              continuous frame sequences recorded between the camera being
incrementally upon the previous experience. The task requires                      turned on and being turned off) for all videos but four. Their average
participants to automatically select images and/or video segments                  duration is of one second. The four last videos, which correspond
that are considered to be the most interesting for a common viewer.                to the full-length movie extracts cited above, were manually seg-
Interestingness of media is to be judged based on visual appearance,               mented into longer segments (243) with an average duration of
audio information and text accompanying the data, including movie                  11.4s, to better take into account a certain unity of meaning and
metadata. To solve the task, participants are strongly encouraged                  the audio information of the resulting segments. For the image
to deploy multimodal approaches.                                                   subtask, the data consists of collections of key-frames extracted
1 http://www.technicolor.com
                                                                                   from the video segments used for the video subtask (one key-frame
                                                                                   per segment). This will allow the comparison of results from both
                                                                                   subtasks. The extracted key-frame corresponds to the frame in the
Copyright held by the owner/author(s).
MediaEval’17, 13-15 September 2017, Dublin, Ireland                                middle of each video segment. In total, 7,396 video segments and
                                                                                   7,396 key-frames are released in the development set, whereas the
MediaEval’17, 13-15 September 2017, Dublin, Ireland                                                                         C.H. Demarty et al.


test set consists of 2435 video segments and the same number of          for each image/shot. Previously the same procedure was also used
key-frames.                                                              to get the final interestingness values. This year we used an alter-
   To facilitate participation from various communities, we also         native method, which took into account all pair comparisons from
provide some pre-computed content descriptors, namely: low level         all rounds done this year into a single large BTL calculation. This
features — dense SIFT (Scale Invariant Feature Transform) which are      was done mainly because we discovered afterwards that some an-
computed following the original work in [13], except that the local      notations from earlier rounds had to be discarded, because of some
frame patches are densely sampled instead of using interest point        unserious annotators. These annotators occasionally switched to
detectors. A codebook of 300 codewords is used in the quantization       cheating, where they simply always selected the first, or the sec-
process with a spatial pyramid of three layers [11]; HoG descriptors     ond item as the most interesting one without actually assessing
(Histograms of Oriented Gradients) [4] are computed over densely         the media contents. In the development set as many as 10% of the
sampled patches. Following [19], HoG descriptors in a 2 × 2 neigh-       annotations were marked as invalid and not included in the final
borhood are concatenated to form a descriptor of higher dimension;       BTL calculation. We added some heuristic anti-cheating measures
LBP (Local Binary Patterns) [14]; GIST are computed based on the         to the system, although it is not possible to perfectly detect all
output energy of several Gabor-like filters (8 orientations and 4        cheating. Unfortunately, in the iterative approach, we could only
scales) over a dense frame grid like in [15]; color histogram computed   discard annotations from the most recent round, as it would be
in the HSV space (Hue-Saturation-Value); MFCC (Mel-Frequency             based on the previous round’s BTL output, which is why we devel-
Cepstral Coefficients) computed over 32ms time-windows with              oped another solution to compute the final BTL ranking. The final
50% overlap. The cepstral vectors are concatenated with their first      binary decisions are obtained using a thresholding scheme that
and second derivatives; fc7 layer (4,096 dimensions) and prob layer      tries to detect the boundary where interestingness values make the
(1,000 dimensions) of AlexNet [9]; mid level face detection and track-   “jump” between the underlying distributions of the non interesting
ing related features2 — obtained by face tracking-by-detection in        and interesting populations. See last year’s overview paper for a
each video shot with a HoG detector [4] and the correlation tracker      more detailed description [6].
proposed in [5]. In addition to these frame-based features, we pro-
                                                                         5     RUN DESCRIPTION
vide C3D [18] features, which were extracted from fc6 layer (4,096
dimensions) and averaged on a segment level.                             Every team can submit up to 10 runs, 5 per subtask. For each subtask,
                                                                         a required run is defined: Image subtask - required run: classification
4    GROUND TRUTH                                                        is to be carried out with the use of the visual information. External
Both video and image data was manually and independently anno-           data is allowed. Video subtask - required run: classification is to be
tated in terms of interestingness by human assessors, to make it         achieved with the use of both audio and visual information. External
possible to study the correlation between the two subtasks. A dedi-      data is allowed. Apart from these required runs, any additional run
cated web-based annotation tool was developed by the organising          for each subtask will be considered as a general run, i.e., anything
team for the previous edition of the task [6]. This year some incre-     is allowed, both from the method point of view and the information
mental improvements were added, and the tool was released as free        sources.
and open source software3 . Overall, more than 252 annotators par-       6     EVALUATION
ticipated in the annotation for the video data and 189 for the images.
The cultural distribution is over 22 different countries in the world.   For both subtasks, the official evaluation metric will be the mean
As in last year’s setup we use a pair-wise comparison protocol [3]       average precision at 10 (MAP@10) computed over all videos, and
where annotators are provided with a pair of images/shots at a time      over the top 10 best ranked images/video shots. MAP@10 is selected
and asked to tag which one in the pair is the more interesting for       because it reflects the VOD use case, where the goal is to select a
them. As a change from last year, we now ask the question in a way       small set of the most interesting images or video segments for each
more directly connected to the commercial application: “Which            movie. To provide a broad overview of the systems’ performances,
image/video makes you more interested in watching the whole              other common metrics will also be provided. All metrics will be
movie?”, with the intent to make the decision criteria clearer to        computed by using the trec_eval tool from NIST4 .
the annotators. As an exhaustive annotation of all possible pairs is     7     CONCLUSIONS
practically impossible due to the required human resources, a boost-     In the 2017 Predicting Media Interestingness task a complete and
ing selection was used instead. In particular, we used a modified        comparative framework for the evaluation of content interesting-
version of the adaptive square design method [12], in which several      ness is proposed. Details on the methods and results of each indi-
annotators participated in each iteration. In this method the number     vidual participant team can be found in the working note papers of
of comparisons for each iteration is reduced from all possible pairs     the MediaEval 2017 workshop proceedings.
                                            √             3
n(n − 1)/2 ∼ O(n 2 ) to a subset of pairs n( n − 1) ∼ O(n 2 ), where n
is the number of segments or images. For the development set, we
                                                                         ACKNOWLEDGMENTS
started from iteration 6, as we could reuse the annotations done last    We would like to thank Yu-Gang Jiang and Baohan Xu from the Fudan
year. To achieve the ranking used as the basis for the next round, the   University, China, Hervé Bredin, from LIMSI, France, and Michael Gygli for
pair-based annotations are aggregated with the Bradley-Terry-Luce        providing the features that accompany the released data. Part of the task
(BTL) model computation [3] resulting in an interestingness degree       was funded under research grant PN-III-P2-2.1-PED-2016-1065, agreement
2 http://multimediaeval.org/mediaeval2016/persondiscovery/
                                                                         30PED/2017, project SPOTTER.
3 https://github.com/mvsjober/pair-annotate                              4 http://trec.nist.gov/trec_eval/
MediaEval 2017 Predicting Media Interestingness Task                            MediaEval’17, 13-15 September 2017, Dublin, Ireland


REFERENCES
 [1] Xesca Amengual, Anna Bosch, and Josep Lluís de la Rosa. 2015. Review
     of Methods to Predict Social Image Interestingness and Memorability.
     Springer, 64–76. https://doi.org/10.1007/978-3-319-23192-1_6
 [2] Daniel E. Berlyne. 1960. Conflict, arousal and curiosity. Mc-Graw-Hill.
 [3] R. A. Bradley and M. E. Terry. 1952. Rank Analysis of Incomplete
     Block Designs: the method of paired comparisons. Biometrika 39 (3-4)
     (1952), 324–345.
 [4] N. Dalal and B. Triggs. 2005. Histograms of oriented gradients for
     human detection. In IEEE CVPR Conference on Computer Vision and
     Pattern Recognition.
 [5] Martin Danelljan, Gustav Hager, Fahad Shahbaz Khan, and Michael
     Felsberg. 2014. Accurate scale estimation for robust visual tracking.
     In British Machine Vision Conference.
 [6] Claire-Hélène Demarty, Mats Sjöberg, Bogdan Ionescu, Thanh-Toan
     Do, Hanli Wang, Ngoc Q.K. Duong, and Frédéric Lefebvre. 2016. Media-
     Eval 2016 Predicting Media Interestingness Task. In Proceedings of the
     MediaEval 2016 Workshop. Hilversum, Netherlands.
 [7] Sagnik Dhar, Vicente Ordonez, and Tamara L Berg. 2011. High level
     describable attributes for predicting aesthetics and interestingness. In
     IEEE International Conference on Computer Vision and Pattern Recogni-
     tion.
 [8] M. Gygli, H. Grabner, H. Riemenschneider, F. Nater, and L. van Gool.
     2013. The Interestingness of Images. In ICCV International Conference
     on Computer Vision.
 [9] Yu-Gang Jiang, Qi Dai, Tao Mei, Yong Rui, and Shih-Fu Chang. 2015.
     Super Fast Event Recognition in Internet Videos. IEEE Transactions
     on Multimedia 177, 8 (2015), 1–13.
[10] Y-G. Jiang, Y. Wang, R. Feng, X. Xue, Y. Zheng, and H. Yan. 2013.
     Understanding and Predicting Interestingness of Videos. In AAAI
     Conference on Artificial Intelligence.
[11] S. Lazebnik, C. Schmid, and J. Ponce. 2006. Beyond bags of features:
     Spatial pyramid matching for recognizing natural scene categories.
     In IEEE CVPR Conference on Computer Vision and Pattern Recognition.
     2169–2178.
[12] Jing Li, Marcus Barkowsky, and Patrick Le Callet. 2013. Boosting paired
     comparison methodology in measuring visual discomfort of 3DTV:
     performances of three different designs. In SPIE Electronic Imaging,
     Stereoscopic Displays and Applications, Vol. 8648.
[13] D. Lowe. 2004. Distinctive image features from scale-invariant key-
     points. International Journal on Computer Vision 60 (2004), 91–110.
[14] T. Ojala, M. Pietikainen, and T. Maenpaa. 2002. Multiresolution gray-
     scale and rotation invariant texture classification with local binary
     patterns. IEEE Transactions on Pattern Analysis and Machine Intelli-
     gence 24(7) (2002), 971–987.
[15] A. Oliva and A. Torralba. 2001. Modeling the shape of the scene: a
     holistic representation of the spatial envelope. International Journal
     of Computer Vision 42 (2001), 145–175.
[16] Paul J. Silvia. 2006. Exploring the psychology of interest. Oxford Uni-
     versity Press.
[17] Craig Smith and Phoebe Ellsworth. 1985. Patterns of cognitive ap-
     praisal in emotion. Journal of Personality and Social Psychology 48, 4
     (1985), 813–838.
[18] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and
     Manohar Paluri. 2015. Learning spatiotemporal features with 3d con-
     volutional networks. In Proceedings of the IEEE International Conference
     on Computer Vision.
[19] J. Xiao, J. Hays, K. Ehinger, A. Oliva, and A. Torralba. 2010. SUN
     database: Large-scale scene recognition from abbey to zoo. In IEEE
     CVPR Conference on Computer Vision and Pattern Recognition. 3485–
     3492.

</pre>