<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>TUD-MMC at MediaEval 2016: Predicting Media Interestingness Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Cynthia C. S. Liem</string-name>
          <email>c.c.s.liem@tudelft.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Multimedia Computing Group, Delft University of Technology Delft</institution>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <fpage>20</fpage>
      <lpage>21</lpage>
      <abstract>
        <p>This working notes paper describes the TUD-MMC entry to the MediaEval 2016 Predicting Media Interestingness Task. Noting that the nature of movie trailer shots is different from that of preceding tasks on image and video interestingness, we propose two baseline heuristic approaches based on the clear occurrence of people. MAP scores obtained on the development set and test set suggest that our approaches cover a limited but non-marginal subset of the interestingness spectrum. Most strikingly, our obtained scores on the Image and Video Subtasks are comparable or better than those obtained when evaluating the ground truth annotations of the Image Subtask against the Video Subtask and vice versa.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        The MediaEval 2016 Predicting Media Interestingness Task [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
considers interestingness of shots and frames in Hollywood-like
trailer videos. The intended use case for this task would be to
automatically select interesting frames and/or video excerpts for movie
previewing on Video on Demand web sites.
      </p>
      <p>Movie trailers are intended to raise a viewer’s interest in a movie.
As a consequence, they will not be a topical summary of the video,
and they are likely to be constituted by ‘teaser material’ that should
make a viewer curious to watch more.</p>
      <p>
        In our approach to this problem, we originally were interested in
assessing whether ‘interestingness’ could relate to salient narrative
elements in a trailer. In particular, we wondered whether criteria
for connecting production music fragments to storylines [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] would
also be relevant factors in rater assessment of interestingness.
      </p>
      <p>
        However, the rating acquisition procedure for the task did not
involve full trailer watching by the raters, but rather the rating of
isolated pairs of clips or frames. As such, while ideas in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] largely
considered the dynamic unfolding of a story, a sense of overall
storyline and longer temporal dynamics could not be assumed in the
current task.
      </p>
      <p>We ultimately decided on pursuing a simpler strategy: the
currently presented approaches investigate to what extent the clear
presence of people, as approximated by automated face detection
results, indicate visual environments which are more interesting to
a human rater. The underlying assumption is that close-ups should
attract a viewer’s attention, and as such may cause larger empathy
with the shown subject or its environment. It will be interesting
to consider to what extent this currently proposed heuristic method
will compare against more agnostic direct machine learning
techniques on the provided labels.
MAP
0.1747
0.1457
2.</p>
    </sec>
    <sec id="sec-2">
      <title>CONSIDERATIONS</title>
      <p>In designing our current method, several considerations coming
forth from the task setup and provided data were taken into account.</p>
      <p>First of all, interestingness assessments only considered pairs of
items originating from the same trailer. Therefore, given our
current data, scored preference between items can only meaningfully
be assessed within the context of a certain trailer. As a
consequence, we choose to only focus on ranking mechanisms restricted
to a given input trailer, rather than ranking mechanisms that
meaningfully can rank input from multiple trailers.</p>
      <p>Secondly, the use case behind the currently offered task
considered helping professionals to illustrate a Video on Demand (VOD)
web site by selecting interesting frames and/or video excerpts of
movies. The frames and excerpts should be suitable in terms of
helping a user to make a decision on whether to watch a movie or
not. As a consequence, we assume that selected frames or excerpts
should not only be interesting, but also representative with respect
to the movie’s content.</p>
      <p>Thirdly, the trailer is expected to contain groups of shots (which
may or may not be sequentially presented) originating from the
same scenes.</p>
      <p>Finally, binary relevance labels were no integral part of the rating
procedure, but added afterwards. As a consequence, finding an
appropriate ranking order will be more important in relation to the
input data than providing a correct binary relevance prediction.</p>
      <p>When manually inspecting the ground truth annotations, we
were struck by the inconsistency between ground truth rankings
on the Image Subtask vs. that obtained for the Video Subtask. To
quantify this inconsistency, given that annotations were always
provided considering video shots as individual units (so there were as
many items considered per trailer in the Image Subtask as in the
Video Subtask), we mimicked the evaluation procedure for the case
ground truth would be swapped. In other words, we computed the
MAP value for the Image Subtask in case the ground truth of the
Video Subtask (including confidence values and binary relevance
indications) would have been a system outcome, and vice versa.
Results are shown in Table 1: it can be noted the MAP values are
indeed not high. As we will discuss at the end of the paper, this
phenomenon will be interesting to investigate further in future
continuations of the task.</p>
    </sec>
    <sec id="sec-3">
      <title>METHOD</title>
      <p>As mentioned, we assess interestingness on the basis of (clearly)
visible people. We do this for both Subtasks, and simplify the
notion of ‘visible people’ by employing face detection techniques.
While these techniques are not perfect (and false negatives, or
missed faces, are prevalent), it can safely be assumed that when
a face is detected, the face will be clearly recognizable to a human
rater.</p>
      <p>Both for the Image and Video Subtask, we follow a similar
strategy, which can be described as follows:
1. Employ face detectors to identify those image frames that
feature people. For each of these, store bounding boxes for
all positive face detections.
2. In practice, the amount of frames with detected faces is
relatively low. Assuming that frames in which detected faces
occur are part of scene(s) in the trailer which are important
(and therefore may contain representative content of
interest), we consider the set of all frames with detected faces,
and calculate the mean HSV histogram Hf over it.
3. For each shot s in the trailer, we consider its HSV histogram
Hs and calculate the histogram intersection between Hs and
Hf as similarity value:
jHf j 1</p>
      <p>X
i=0
sim(Hs; Hf ) =
min(Hs(i); Hf (i)):
4. Normalize the similarity scoring range to the [0; 1] interval
to obtain confidence scores. The ranking of shots according
to these scores will be denoted as hist.
5. Next to considering histogram intersection scores, for each
shot, we consider the bounding box area of detected faces.
If multiple faces are detected within a shot, we simply sum
areas.
6. The range of calculated face areas also is scaled to the [0; 1]
interval.
7. For each shot, we take the average of the normalized
histogram-based confidence score and the normalized face
area score. These averages are again scaled to the [0; 1]
interval, establishing an alternative confidence score which is
boosted by larger detected face areas. The ranking of shots
according to these scores will be denoted as histface.</p>
      <p>Both for the Image and Video Subtask, we submitted a hist
and histface run. Below, we give further details on what feature
detectors and implementation details were used per subtask.
3.1</p>
    </sec>
    <sec id="sec-4">
      <title>Image Subtask</title>
      <p>
        For the Image Subtask, each shot is represented by a single
frame. The HSV color histograms for each frame are taken out
of the precomputed features for the image dataset [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        No face detector data was available as part of the provided
dataset. Therefore, we computed detector outcomes ourselves,
using the head detector as proposed by [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], and employing a detection
model as refined in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The features were computed employing the
code released by the authors1. This head detector does not require
frontal faces, but also is designed to detect profile faces and the
back of heads, making it both flexible and robust.
1http://www.robots.ox.ac.uk/~vgg/software/
headmview/
      </p>
      <sec id="sec-4-1">
        <title>Run name</title>
        <p>image_hist
image_histface
video_hist
video_histface</p>
      </sec>
      <sec id="sec-4-2">
        <title>Run name</title>
        <p>image_hist
image_histface</p>
        <p>video_hist
video_histface
MAP
0.2202
0.2336
0.1557
0.1558</p>
        <p>We sort the obtained confidence values, and apply an (empirical)
threshold to set binary relevance. For the hist run, all items with
a confidence value higher than 0.75 are deemed interesting; for the
histface run, the threshold is set at 0.6.
3.2</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Video Subtask</title>
      <p>For the Video Subtask, in parallel to our approach for the Image
Subtask, we consider HSV color histograms and face detections.
For this, we can make use of released precomputed features.
However, in contrast to the Image Subtask, these features now are based
on multiple frames per shot.</p>
      <p>
        In case of the HSV color histograms [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], we take the average
histogram per shot as representation. For face detection, we use the
face tracking results based on [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and consider the sum of
all detected face bounding box areas per shot.
      </p>
      <p>The binary relevance threshold is set at 0.75 for the hist run,
and at 0.55 for the histface run.
4.</p>
    </sec>
    <sec id="sec-6">
      <title>RESULTS AND DISCUSSION</title>
      <p>Results of our runs as obtained on the development and test set
are presented in Tables 2 and 3, respectively. The results on the test
set constitute the offical evaluation results of the task.</p>
      <p>Generally, it can be noted that MAP scores are considerably
lower for the Video Subtask than for the Image Subtask. Also
looking back to the results in Table 1, it may be hypothesized that the
Video Subtask generally is more difficult than the Image Subtask.
We would expect for temporal dynamics and non-visual
modalities to play a larger role in the Video Subtask; aspects we are not
considering yet in our current approach.</p>
      <p>When comparing the obtained MAP against the scores seen in
Table 1, we notice that our scores are comparable, or even better.
Furthermore, comparing results for the test set vs. the development
set, we see that scores slightly improve for the test set,
suggesting that our modeling criteria were indeed of certain relevance to
ratings in the test set.</p>
      <p>For future work, it will be worthwhile to further investigate how
universal the concept of ‘interestingness’ is, both across trailers,
and when comparing the Image Subtask to the Video Subtask. The
surprisingly low MAP scores when exchanging ground truth
between Subtasks may indicate that human rater stability is not
optimal, and/or that the two Subtasks are fundamentally different from
one another. Furthermore, as part of the quest for a more specific
definition of ‘interestingness’, a continued discussion on how
interestingness can be leveraged for a previewing-oriented use case will
also be useful.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>N.</given-names>
            <surname>Dalal</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Triggs</surname>
          </string-name>
          .
          <article-title>Histograms of oriented gradients for human detection</article-title>
          .
          <source>In Proc. of IEEE Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Danelljan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Häger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Shahbaz Khan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Felsberg</surname>
          </string-name>
          .
          <article-title>Accurate scale estimation for robust visual tracking</article-title>
          .
          <source>In Proceedings of the British Machine Vision Conference</source>
          . BMVA Press,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>C.-H. Demarty</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Sjöberg</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Ionescu</surname>
            , T.-T. Do,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>N. Q.</given-names>
          </string-name>
          <string-name>
            <surname>Duong</surname>
            , and
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Lefebvre. MediaEval 2016 Predicting Media</surname>
          </string-name>
          <article-title>Interestingness Task</article-title>
          .
          <source>In Proc. of the MediaEval 2016 Workshop</source>
          , Hilversum, The Netherlands,
          <year>October 2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.-G.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Rui</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.-F.</given-names>
            <surname>Chang</surname>
          </string-name>
          .
          <article-title>Super Fast Event Recognition in Internet Videos</article-title>
          .
          <source>IEEE Transactions on Multimedia</source>
          ,
          <volume>177</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C. C. S.</given-names>
            <surname>Liem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Larson</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Hanjalic</surname>
          </string-name>
          .
          <article-title>When Music Makes a Scene - Characterizing Music in Multimedia Contexts via User Scene Descriptions</article-title>
          .
          <source>International Journal of Multimedia Information Retrieval</source>
          ,
          <volume>2</volume>
          :
          <fpage>15</fpage>
          -
          <lpage>30</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Marin-Jimenez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Eichner</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Ferrari</surname>
          </string-name>
          . Detecting People Looking at Each Other in Videos.
          <source>International Journal of Computer Vision</source>
          ,
          <volume>106</volume>
          (
          <issue>3</issue>
          ):
          <fpage>282</fpage>
          -
          <lpage>296</lpage>
          ,
          <year>February 2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Marin-Jimenez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Ferrari</surname>
          </string-name>
          .
          <article-title>Here's looking at you, kid." Detecting people looking at each other in videos</article-title>
          .
          <source>In British Machine Vision Conference</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>