=Paper= {{Paper |id=Vol-1739/MediaEval_2016_paper_30 |storemode=property |title=TUD-MMC at MediaEval 2016: Predicting Media Interestingness Task |pdfUrl=https://ceur-ws.org/Vol-1739/MediaEval_2016_paper_30.pdf |volume=Vol-1739 |dblpUrl=https://dblp.org/rec/conf/mediaeval/Liem16 }} ==TUD-MMC at MediaEval 2016: Predicting Media Interestingness Task== https://ceur-ws.org/Vol-1739/MediaEval_2016_paper_30.pdf
               TUD-MMC at MediaEval 2016: Predicting Media
                          Interestingness Task

                                                           Cynthia C. S. Liem
                                   Multimedia Computing Group, Delft University of Technology
                                                   Delft, The Netherlands
                                                          c.c.s.liem@tudelft.nl


ABSTRACT                                                                                           Data                     MAP
                                                                                       video ground truth on image set      0.1747
This working notes paper describes the TUD-MMC entry to the
                                                                                       image ground truth on video set      0.1457
MediaEval 2016 Predicting Media Interestingness Task. Noting
that the nature of movie trailer shots is different from that of pre-
                                                                          Table 1: MAP values obtained on development set by swapping
ceding tasks on image and video interestingness, we propose two
                                                                          ground truth annotations of image and video.
baseline heuristic approaches based on the clear occurrence of peo-
ple. MAP scores obtained on the development set and test set sug-
gest that our approaches cover a limited but non-marginal subset of
the interestingness spectrum. Most strikingly, our obtained scores        2.    CONSIDERATIONS
on the Image and Video Subtasks are comparable or better than
                                                                             In designing our current method, several considerations coming
those obtained when evaluating the ground truth annotations of the
                                                                          forth from the task setup and provided data were taken into account.
Image Subtask against the Video Subtask and vice versa.
                                                                             First of all, interestingness assessments only considered pairs of
                                                                          items originating from the same trailer. Therefore, given our cur-
1.    INTRODUCTION                                                        rent data, scored preference between items can only meaningfully
   The MediaEval 2016 Predicting Media Interestingness Task [3]           be assessed within the context of a certain trailer. As a conse-
considers interestingness of shots and frames in Hollywood-like           quence, we choose to only focus on ranking mechanisms restricted
trailer videos. The intended use case for this task would be to auto-     to a given input trailer, rather than ranking mechanisms that mean-
matically select interesting frames and/or video excerpts for movie       ingfully can rank input from multiple trailers.
previewing on Video on Demand web sites.                                     Secondly, the use case behind the currently offered task consid-
   Movie trailers are intended to raise a viewer’s interest in a movie.   ered helping professionals to illustrate a Video on Demand (VOD)
As a consequence, they will not be a topical summary of the video,        web site by selecting interesting frames and/or video excerpts of
and they are likely to be constituted by ‘teaser material’ that should    movies. The frames and excerpts should be suitable in terms of
make a viewer curious to watch more.                                      helping a user to make a decision on whether to watch a movie or
   In our approach to this problem, we originally were interested in      not. As a consequence, we assume that selected frames or excerpts
assessing whether ‘interestingness’ could relate to salient narrative     should not only be interesting, but also representative with respect
elements in a trailer. In particular, we wondered whether criteria        to the movie’s content.
for connecting production music fragments to storylines [5] would            Thirdly, the trailer is expected to contain groups of shots (which
also be relevant factors in rater assessment of interestingness.          may or may not be sequentially presented) originating from the
   However, the rating acquisition procedure for the task did not         same scenes.
involve full trailer watching by the raters, but rather the rating of        Finally, binary relevance labels were no integral part of the rating
isolated pairs of clips or frames. As such, while ideas in [5] largely    procedure, but added afterwards. As a consequence, finding an
considered the dynamic unfolding of a story, a sense of overall sto-      appropriate ranking order will be more important in relation to the
ryline and longer temporal dynamics could not be assumed in the           input data than providing a correct binary relevance prediction.
current task.                                                                When manually inspecting the ground truth annotations, we
   We ultimately decided on pursuing a simpler strategy: the cur-         were struck by the inconsistency between ground truth rankings
rently presented approaches investigate to what extent the clear          on the Image Subtask vs. that obtained for the Video Subtask. To
presence of people, as approximated by automated face detection           quantify this inconsistency, given that annotations were always pro-
results, indicate visual environments which are more interesting to       vided considering video shots as individual units (so there were as
a human rater. The underlying assumption is that close-ups should         many items considered per trailer in the Image Subtask as in the
attract a viewer’s attention, and as such may cause larger empathy        Video Subtask), we mimicked the evaluation procedure for the case
with the shown subject or its environment. It will be interesting         ground truth would be swapped. In other words, we computed the
to consider to what extent this currently proposed heuristic method       MAP value for the Image Subtask in case the ground truth of the
will compare against more agnostic direct machine learning tech-          Video Subtask (including confidence values and binary relevance
niques on the provided labels.                                            indications) would have been a system outcome, and vice versa.
                                                                          Results are shown in Table 1: it can be noted the MAP values are
                                                                          indeed not high. As we will discuss at the end of the paper, this
Copyright is held by the author/owner(s).                                 phenomenon will be interesting to investigate further in future con-
MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Netherlands.        tinuations of the task.
3.     METHOD                                                                                  Run name              MAP
   As mentioned, we assess interestingness on the basis of (clearly)                          image_hist             0.1867
visible people. We do this for both Subtasks, and simplify the no-                          image_histface           0.1831
tion of ‘visible people’ by employing face detection techniques.                              video_hist             0.1370
While these techniques are not perfect (and false negatives, or                             video_histface           0.1332
missed faces, are prevalent), it can safely be assumed that when
a face is detected, the face will be clearly recognizable to a human            Table 2: MAP values obtained on development set.
rater.
   Both for the Image and Video Subtask, we follow a similar strat-                            Run name              MAP
egy, which can be described as follows:                                                       image_hist             0.2202
                                                                                            image_histface           0.2336
     1. Employ face detectors to identify those image frames that                             video_hist             0.1557
        feature people. For each of these, store bounding boxes for                         video_histface           0.1558
        all positive face detections.
                                                                          Table 3: Official task evaluation results: MAP values obtained
     2. In practice, the amount of frames with detected faces is rel-     on test set.
        atively low. Assuming that frames in which detected faces
        occur are part of scene(s) in the trailer which are important
        (and therefore may contain representative content of inter-          We sort the obtained confidence values, and apply an (empirical)
        est), we consider the set of all frames with detected faces,      threshold to set binary relevance. For the hist run, all items with
        and calculate the mean HSV histogram Hf over it.                  a confidence value higher than 0.75 are deemed interesting; for the
                                                                          histface run, the threshold is set at 0.6.
     3. For each shot s in the trailer, we consider its HSV histogram
        Hs and calculate the histogram intersection between Hs and        3.2    Video Subtask
        Hf as similarity value:
                                                                             For the Video Subtask, in parallel to our approach for the Image
                                  |Hf |−1                                 Subtask, we consider HSV color histograms and face detections.
                                   X
                sim(Hs , Hf ) =             min(Hs (i), Hf (i)).          For this, we can make use of released precomputed features. How-
                                    i=0                                   ever, in contrast to the Image Subtask, these features now are based
                                                                          on multiple frames per shot.
     4. Normalize the similarity scoring range to the [0, 1] interval        In case of the HSV color histograms [4], we take the average
        to obtain confidence scores. The ranking of shots according       histogram per shot as representation. For face detection, we use the
        to these scores will be denoted as hist.                          face tracking results based on [1] and [2], and consider the sum of
                                                                          all detected face bounding box areas per shot.
     5. Next to considering histogram intersection scores, for each          The binary relevance threshold is set at 0.75 for the hist run,
        shot, we consider the bounding box area of detected faces.        and at 0.55 for the histface run.
        If multiple faces are detected within a shot, we simply sum
        areas.
                                                                          4.    RESULTS AND DISCUSSION
     6. The range of calculated face areas also is scaled to the [0, 1]      Results of our runs as obtained on the development and test set
        interval.                                                         are presented in Tables 2 and 3, respectively. The results on the test
     7. For each shot, we take the average of the normalized              set constitute the offical evaluation results of the task.
        histogram-based confidence score and the normalized face             Generally, it can be noted that MAP scores are considerably
        area score. These averages are again scaled to the [0, 1] in-     lower for the Video Subtask than for the Image Subtask. Also look-
        terval, establishing an alternative confidence score which is     ing back to the results in Table 1, it may be hypothesized that the
        boosted by larger detected face areas. The ranking of shots       Video Subtask generally is more difficult than the Image Subtask.
        according to these scores will be denoted as histface.            We would expect for temporal dynamics and non-visual modali-
                                                                          ties to play a larger role in the Video Subtask; aspects we are not
  Both for the Image and Video Subtask, we submitted a hist               considering yet in our current approach.
and histface run. Below, we give further details on what feature             When comparing the obtained MAP against the scores seen in
detectors and implementation details were used per subtask.               Table 1, we notice that our scores are comparable, or even better.
                                                                          Furthermore, comparing results for the test set vs. the development
3.1      Image Subtask                                                    set, we see that scores slightly improve for the test set, suggest-
   For the Image Subtask, each shot is represented by a single            ing that our modeling criteria were indeed of certain relevance to
frame. The HSV color histograms for each frame are taken out              ratings in the test set.
of the precomputed features for the image dataset [4].                       For future work, it will be worthwhile to further investigate how
   No face detector data was available as part of the provided            universal the concept of ‘interestingness’ is, both across trailers,
dataset. Therefore, we computed detector outcomes ourselves, us-          and when comparing the Image Subtask to the Video Subtask. The
ing the head detector as proposed by [7], and employing a detection       surprisingly low MAP scores when exchanging ground truth be-
model as refined in [6]. The features were computed employing the         tween Subtasks may indicate that human rater stability is not opti-
code released by the authors1 . This head detector does not require       mal, and/or that the two Subtasks are fundamentally different from
frontal faces, but also is designed to detect profile faces and the       one another. Furthermore, as part of the quest for a more specific
back of heads, making it both flexible and robust.                        definition of ‘interestingness’, a continued discussion on how inter-
1                                                                         estingness can be leveraged for a previewing-oriented use case will
  http://www.robots.ox.ac.uk/~vgg/software/
headmview/                                                                also be useful.
5.   REFERENCES
[1] N. Dalal and B. Triggs. Histograms of oriented gradients for
    human detection. In Proc. of IEEE Conference on Computer
    Vision and Pattern Recognition, 2005.
[2] M. Danelljan, G. Häger, F. Shahbaz Khan, and M. Felsberg.
    Accurate scale estimation for robust visual tracking. In
    Proceedings of the British Machine Vision Conference.
    BMVA Press, 2014.
[3] C.-H. Demarty, M. Sjöberg, B. Ionescu, T.-T. Do, H. Wang,
    N. Q. Duong, and F. Lefebvre. MediaEval 2016 Predicting
    Media Interestingness Task. In Proc. of the MediaEval 2016
    Workshop, Hilversum, The Netherlands, October 2016.
[4] Y.-G. Jiang, Q. Dai, T. Mei, Y. Rui, and S.-F. Chang. Super
    Fast Event Recognition in Internet Videos. IEEE Transactions
    on Multimedia, 177:1–13, 2015.
[5] C. C. S. Liem, M. A. Larson, and A. Hanjalic. When Music
    Makes a Scene — Characterizing Music in Multimedia
    Contexts via User Scene Descriptions. International Journal
    of Multimedia Information Retrieval, 2:15–30, 2013.
[6] M. Marin-Jimenez, A. Zisserman, M. Eichner, and V. Ferrari.
    Detecting People Looking at Each Other in Videos.
    International Journal of Computer Vision, 106(3):282–296,
    February 2014.
[7] M. Marin-Jimenez, A. Zisserman, and V. Ferrari. Here’s
    looking at you, kid." Detecting people looking at each other in
    videos. In British Machine Vision Conference, 2011.