=Paper= {{Paper |id=Vol-1739/MediaEval_2016_paper_1 |storemode=property |title=MediaEval 2016 Predicting Media Interestingness Task |pdfUrl=https://ceur-ws.org/Vol-1739/MediaEval_2016_paper_1.pdf |volume=Vol-1739 |dblpUrl=https://dblp.org/rec/conf/mediaeval/DemartySIDWDL16 }} ==MediaEval 2016 Predicting Media Interestingness Task== https://ceur-ws.org/Vol-1739/MediaEval_2016_paper_1.pdf
       MediaEval 2016 Predicting Media Interestingness Task

             Claire-Hélène Demarty1 , Mats Sjöberg2 , Bogdan Ionescu3 , Thanh-Toan Do4 ,
                        Hanli Wang5 , Ngoc Q. K. Duong1 , Frédéric Lefebvre1
                                             1
                                               Technicolor, Rennes, France
                                            2
                                          HIIT, University of Helsinki, Finland
                                 3
                                   LAPI, University Politehnica of Bucharest, Romania
            4
              Singapore University of Technology and Design, Singapore & University of Science, Vietnam
                                                5
                                                  Tongji University, China
                        {claire-helene.demarty, quang-khanh-ngoc.duong, frederic.lefebvre}@technicolor.com
                mats.sjoberg@helsinki.fi, bionescu@imag.pub.ro, thanhtoan do@sutd.edu.sg, hanliwang@tongji.edu.cn


ABSTRACT                                                           2.    TASK DESCRIPTION
This paper provides an overview of the Predicting Media               The task requires participants to deploy algorithms that
Interestingness task that is organized as part of the Media-       automatically select images and video segments of Hollywood-
Eval 2016 Benchmarking Initiative for Multimedia Evalua-           like movies which are considered to be the most interesting
tion. The task, which is running for the first year, expects       for a common viewer. Interestingness of the media is judged
participants to create systems that automatically select im-       based on the visual appearance, audio information and text
ages and video segments that are considered to be the most         accompanying the data. Therefore, the multimodal facet of
interesting for a common viewer. In this paper, we present         interestingness prediction can be investigated.
the task use case and challenges, the proposed data set and           Two different subtasks are provided, which correspond to
ground truth, the required participant runs and the evalua-        the two types of available media content, namely:
tion metrics.
                                                                        • predicting image interestingness — given a set of key-
                                                                          frames extracted from a movie, the task requires to
1.     INTRODUCTION                                                       automatically identify those images for the given movie
   The ability of multimedia data to attract and keep peo-                that viewers report to be the most interesting. To solve
ple’s interest for long periods of time is gaining more and               the task, participants can make use of visual content
more importance in the field of multimedia, where concepts                as well as external metadata, e.g., Internet data about
such as memorability [4], aesthetics [9], interestingness [13,            the movie, social media information, etc;
11], attractiveness [14], affective value [25], are intensely
                                                                        • predicting video interestingness — given the video shots
studied, especially in the context of the ever growing mar-
                                                                          of a movie, the task requires to automatically identify
ket value of social media and advertising. In particular,
                                                                          those shots that viewers report to be the most interest-
although interestingness has been studied for a long time in
                                                                          ing in the given movie. To solve the task, participants
the psychology community [21, 2, 22] and more recently but
                                                                          can make use of visual and audio data as well as ex-
actively in the image processing community [23, 5, 1, 6, 10,
                                                                          ternal data, e.g., subtitles, Internet data, etc.
18], no common definition exists in the literature. Moreover
datasets publicly available are only a few and no benchmark          In both cases, the task is a binary classification task, thus
exists for the evaluation of what makes a media interesting.       participants are expected to label the provided data as being
   In this paper we introduce the 2016 MediaEval1 Predict-         interesting or not (Note that prediction will be carried out on
ing Media Interestingness Task, which is a pioneer bench-          a per movie basis). However, a confidence value is required
marking initiative for automatic prediction of image and           for the provided prediction.
video interestingness. The task which is in its first year
derives from a practical use case at Technicolor2 . It in-
                                                                   3.    DATA DESCRIPTION
volves helping professionals to illustrate a Video on Demand         The 2016 data is extracted from Creative Commons li-
(VOD) web site by selecting some interesting frames and/or         censed trailers of Hollywood-like movies. It consists of a
video excerpts for the posted movies. The frames and ex-           development data intended for designing and training the
cerpts should be suitable in terms of helping a user to make       methods (information is extracted from 52 trailers) and a
his/her decision about whether he/she is interested in watch-      testing data which is used for the final benchmarking (with
ing the underlying movie. The data in this task is therefore       information from 26 trailers). The choice of using trailers
adapted to this particular context which provides a more           instead of full movies is driven by the need to find data,
focused definition for interestingness.                            both freely distributable and still representative in content
                                                                   and quality of Hollywood movies. Trailers are the results of
1
    http://multimediaeval.org/                                     some manual filtering of movies to keep interesting scenes,
2
    http://www.technicolor.com/                                    but also less attractive shots to balance their content. We
                                                                   therefore believe they are still representative for the task.
                                                                     For the predicting video interestingness subtask, the data
Copyright is held by the author/owner(s).
MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Nether-      consists of the video shots obtained after the manual seg-
lands.                                                             mentation of the videos (video shots are the continuous
frame sequences recorded between a camera turn on and             threshold. This position corresponds to the limit between
off), i.e., 5,054 shots for the development data, and 2,342       non interesting and interesting shots/images. The underly-
shots for the test data.                                          ing motivation for this empirical rule is the following: the
   For the predicting image interestingness subtask, the data     non interesting population has rather similar interestingness
consists of collections of key-frames extracted from the video    values which increase slowly, while a gap happens when one
shots used for the video subtask. One single key-frame is         switches from this non interesting population to the popu-
extracted per shot, therefore leading to 5,054 key-frames for     lation of more interesting samples. The second derivative
the development set and 2,342 for the test set. This single       was chosen preferably to the first derivative, as it allowed to
key-frame is chosen as the middle frame, as it is highly likely   select those gaps more precisely.
to capture the most important information of the shot.               Ground truth is provided in binary format, i.e., 1 for in-
   To facilitate participation from various communities, we       teresting and 0 for non interesting, for each image and video
also provide some pre-computed content descriptors, namely:       in the two subtasks.
low level features — dense SIFT (Scale Invariant Feature          5.     RUN DESCRIPTION
Transform) which are computed following the original work
in [17], except that the local frame patches are densely sam-        Each participating team is expected to submit up to 5
pled instead of using interest point detectors. A codebook        runs for both subtasks altogether. Among these 5 runs, two
of 300 codewords is used in the quantization process with a       runs are required, one per subtask: for the predicting image
spatial pyramid of three layers [15]; HoG descriptors (His-       interestingness subtask, the required run is built on visual
tograms of Oriented Gradients) [7] are computed over densely      information only and no external data is allowed; for the pre-
sampled patches. Following [24], HoG descriptors in a 2 ×         dicting video interestingness subtask only audio and visual
2 neighborhood are concatenated to form a descriptor of           information is allowed (no external data) for the required
higher dimension; LBP (Local Binary Patterns) [19]; GIST          run.
are computed based on the output energy of several Gabor-            Note that in this context, external data can be understood
like filters (8 orientations and 4 scales) over a dense frame     as: (i) additional datasets and annotations dedicated to in-
grid like in [20]; color histogram computed in the HSV space      terestingness classification; (ii) pre-trained models, features,
(Hue-Saturation-Value); MFCC (Mel-Frequency Cepstral Co-          detectors obtained from such dedicated additional datasets;
efficients) computed over 32ms time-windows with 50% over-        and (iii) additional metadata that could be found on the
lap. The cepstral vectors are concatenated with their first       Internet on the provided content (e.g., from IMDB4 ).
and second derivatives; fc7 layer (4,096 dimensions) and             On the contrary, CNN features trained on generic datasets
prob layer (1,000 dimensions) of AlexNet [12]; mid level face     such as ImageNet (typically the provided CNN features) are
detection and tracking related features 3 — obtained by face      allowed for use in the required runs. By generic datasets, we
tracking-by-detection in each video shot with a HoG detec-        mean datasets that were not designed to support research in
tor [7] and the correlation tracker proposed in [8].              the task area, i.e., for the classification/study of image and
                                                                  video interestingness.
4.   GROUND TRUTH                                                 6.     EVALUATION
   All data was manually annotated in terms of interesting-          The official evaluation metric is the mean average preci-
ness by human assessors. A dedicated web-based tool was           sion (MAP) over the interesting class, i.e., the mean over
developed to assist the annotation process. Overall, more         the average precision scores computed for each trailer. This
than 312 annotators participated to the annotation for the        metric, adapted to retrieval tasks, fits perfectly the chosen
video data and 100 for the images. The cultural distribution      use case in which we want to help a user choose between dif-
is over 29 different countries in the world.                      ferent samples by providing him a list of suggestions, ranked
   We use a pair-wise comparison protocol [3] where anno-         according to interestingness. For assessing the performance,
tators are provided with a pair of images/shots at a time         we use the trec_eval tool provided by NIST5 . In addition
and asked to tag which of the content is more interesting         to MAP, other commonly used metrics such as precision and
for them. The process is repeated by scanning the whole           recall will be provided to participants.
dataset. As an exhaustive comparison of all possible pairs is
basically impossible due to the required human resources, a       7.     CONCLUSIONS
boosting selection was used instead, i.e., a modified version        The 2016 Predicting Media Interestingness task provides
of the adaptive square design method [16], for which several      participants with a comparative and collaborative evalua-
annotators participate to each iteration.                         tion framework for predicting content interestingness with
   To achieve the final ground truth, pair-based annotations      explicit focus on multimedia approaches. Details on the
are aggregated with the Bradley-Terry-Luce (BTL) model            methods and results of each individual participant team can
computation [3] resulting in an interestingness degree for        be found in the working note papers of the MediaEval 2016
each image/shot. The final binary decisions are obtained          workshop proceedings.
after the following processing steps: (i) the interestingness
values are ranked in increasing order and normalized be-          8.     ACKNOWLEDGMENTS
tween 0 and 1; (ii) the resulting curve is smoothed with a           We would like to thank Yu-Gang Jiang and Baohan Xu
short averaging window, and the second derivative is com-         from the Fudan University, China, and Hervé Bredin, from
puted; (iii) for both shots and images, and for all videos,       LIMSI, France for providing the features that accompany the
a threshold empirically set to 0.01 is applied on the second      released data, and Alexey Ozerov and Vincent Demoulin for
derivative to find the first point whose value is above the       their valuable inputs to the task definition.
3                                                                 4
 http://multimediaeval.org/mediaeval2016/                             http://www.imdb.com/
                                                                  5
persondiscovery/                                                      http://trec.nist.gov/trec\_eval/
9.   REFERENCES                                                       Computer Vision, (60):91–110, 2004.
 [1] X. Amengual, A. Bosch, and J. L. de la Rosa. Review         [18] J. Machajdik and A. Hanbury. Affective image
     of Methods to Predict Social Image Interestingness and           classification using features inspired by psychology
     Memorability, pages 64–76. Springer, 2015.                       and art theory. In Proceedings of the 18th ACM
 [2] D. E. Berlyne. Conflict, arousal and curiosity.                  International Conference on Multimedia, pages 83–92,
     Mc-Graw-Hill, 1960.                                              New York, NY, USA, 2010.
 [3] R. A. Bradley and M. E. Terry. Rank analysis of             [19] T. Ojala, M. Pietikainen, and T. Maenpaa.
     incomplete block designs: the method of paired                   Multiresolution gray-scale and rotation invariant
     comparisons. Biometrika, (39 (3-4)):324–345, 1952.               texture classification with local binary patterns. IEEE
 [4] Z. Bylinskii, P. Isola, C. Bainbridge, A. Torralba, and          Transactions on Pattern Analysis and Machine
     A. Oliva. Intrinsic and extrinsic effects on image               Intelligence, (24(7)):971–987, 2002.
     memorability. Vision research, (116):165–178, 2015.         [20] A. Oliva and A. Torralba. Modeling the shape of the
 [5] C. Chamaret, C.-H. Demarty, V. Demoulin, and                     scene: a holistic representation of the spatial envelope.
     G. Marquant. Experiencing the interestingness                    International Journal of Computer Vision,
     concept within and between pictures. In Proceeding of            (42):145–175, 2001.
     SPIE, Human Vision and Electronic Imaging, 2016.            [21] P. J. Silvia. Exploring the psychology of interest.
 [6] S. L. Chu, E. Fedorovskaya, F. Quek, and J. Snyder.              Oxford University Press, 2006.
     The effect of familiarity on perceived interestingness of   [22] C. Smith and P. Ellsworth. Patterns of cognitive
     images. volume 8651, pages 86511C–86511C–12, 2013.               appraisal in emotion. Journal of Personality and
 [7] N. Dalal and B. Triggs. Histograms of oriented                   Social Psychology, 48(4):813–838, 1985.
     gradients for human detection. In IEEE CVPR                 [23] M. Soleymani. The quest for visual interest. In
     Conference on Computer Vision and Pattern                        Proceedings of the 23rd ACM International Conference
     Recognition, 2005.                                               on Multimedia, pages 919–922, New York, NY, USA,
 [8] M. Danelljan, G. Hager, F. S. Khan, and M. Felsberg.             2015.
     Accurate scale estimation for robust visual tracking.       [24] J. Xiao, J. Hays, K. Ehinger, A. Oliva, and
     In British Machine Vision Conference, 2014.                      A. Torralba. Sun database: Large-scale scene
 [9] S. Dhar, V. Ordonez, and T. L. Berg. High level                  recognition from abbey to zoo. In IEEE CVPR
     describable attributes for predicting aesthetics and             Conference on Computer Vision and Pattern
     interestingness. In IEEE International Conference on             Recognition, pages 3485–3492, 2010.
     Computer Vision and Pattern Recognition, 2011.              [25] A. Yazdani, E. Skodras, N. Fakotakis, and
[10] H. Grabner, F. Nater, M. Druey, and L. Van Gool.                 T. Ebrahimi. Multimedia content analysis for
     Visual interestingness in image sequences. In                    emotional characterization of music video clips.
     Proceedings of the 21st ACM International Conference             EURASIP Journal on Image and Video Processing,
     on Multimedia, pages 1017–1026, New York, NY,                    (26), 2013.
     USA, 2013.
[11] M. Gygli, H. Grabner, H. Riemenschneider, F. Nater,
     and L. van Gool. The interestingness of images. In
     ICCV International Conference on Computer Vision,
     2013.
[12] Y.-G. Jiang, Q. Dai, T. Mei, Y. Rui, and S.-F. Chang.
     Super fast event recognition in internet videos. IEEE
     Transactions on Multimedia, 177(8):1–13, 2015.
[13] Y.-G. Jiang, Y. Wang, R. Feng, X. Xue, Y. Zheng,
     and H. Yan. Understanding and predicting
     interestingness of videos. In AAAI Conference on
     Artificial Intelligence, 2013.
[14] S. Kalayci, H. K. Ekenel, and H. Gunes. Automatic
     analysis of facial attractiveness from video. In IEEE
     ICIP International Conference on Image Processing,
     pages 4191–4195, 2014.
[15] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of
     features: Spatial pyramid matching for recognizing
     natural scene categories. In IEEE CVPR Conference
     on Computer Vision and Pattern Recognition, pages
     2169–2178, 2006.
[16] J. Li, M. Barkowsky, and P. L. Callet. Boosting paired
     comparison methodology in measuring visual
     discomfort of 3dtv: performances of three different
     designs. In SPIE Electronic Imaging, Stereoscopic
     Displays and Applications, volume 8648, 2013.
[17] D. Lowe. Distinctive image features from
     scale-invariant keypoints. International Journal on