<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The MediaEval 2015 Affective Impact of Movies Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mats Sjöberg</string-name>
          <email>mats.sjoberg@helsinki</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yoann Baveye</string-name>
          <email>yoann.baveye@technicolor.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hanli Wang</string-name>
          <email>hanliwang@tongji.edu.cn</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vu Lam Quang</string-name>
          <email>lamquangvu@gmail.com</email>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bogdan Ionescu</string-name>
          <email>bionescu@imag.pub.ro</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emmanuel Dellandréa</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Markus Schedl</string-name>
          <email>markus.schedl@jku.at</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Claire-Hélène Demarty</string-name>
          <email>claire-helene.demarty@technicolor.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Liming Chen</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ecole Centrale de Lyon</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>France</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>emmanuel.dellandrea@ec-lyon.fr</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>liming.chen@liris.cnrs.fr</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Helsinki Institute for Information Technology HIIT, University of Helsinki</institution>
          ,
          <country country="FI">Finland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Johannes Kepler University</institution>
          ,
          <addr-line>Linz</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Tongji University</institution>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University Politehnica of Bucharest</institution>
          ,
          <country country="RO">Romania</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>University of Science</institution>
          ,
          <addr-line>VNU-HCMC</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>This paper provides a description of the MediaEval 2015 \A ective Impact of Movies Task", which is running for the fth year, previously under the name \Violent Scenes Detection". In this year's task, participants are expected to create systems that automatically detect video content that depicts violence, or predict the a ective impact that video content will have on viewers. Here we provide insights on the use case, task challenges, data set and ground truth, task run requirements and evaluation metrics.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>The A ective Impact of Movies Task is part of the
MediaEval 2015 Benchmarking Initiative. The overall use case
scenario of the task is to design a video search system that
uses automatic tools to help users nd videos that t their
particular mood, age or preferences. To address this, we
present two subtasks:</p>
      <p>Induced a ect detection: the emotional impact of a
video or movie can be a strong indicator for search or
recommendation;
Violence detection: detecting violent content is an
important aspect of ltering video content based on age.</p>
      <p>This task builds on the experiences from previous years'
editions of the A ect in Multimedia Task: Violent Scenes
Detection. However, this year, we introduce a completely
new subtask for detecting the emotional impact of movies.
In addition, we are introducing to MediaEval a newly
extended data set consisting of 10,900 short video clips
extracted from 199 Creative Commons-licensed movies.</p>
      <p>
        In the literature, detection of violence in movies has been
marginally addressed until recently [
        <xref ref-type="bibr" rid="ref1 ref6 ref8">8, 6, 1</xref>
        ]. Similarly, in
a ective video content analysis it has been repeatedly claimed
that the eld would highly bene t from a standardised
evaluation data set [
        <xref ref-type="bibr" rid="ref5 ref9">5, 9</xref>
        ]. Most of the previously proposed
methods for a ective impact or violence detection su er from a
lack of a consistent evaluation, which usually requires the use
of a constrained and closed data set [
        <xref ref-type="bibr" rid="ref10 ref4 ref7">4, 7, 10</xref>
        ]. Hence, the
task's main objective is to propose a public common
evaluation framework for the research in these closely-related
areas.
2.
      </p>
    </sec>
    <sec id="sec-2">
      <title>TASK DESCRIPTION</title>
      <p>The task requires participants to deploy multimedia
features to automatically detect violent content and emotional
impact of short movie clips. In contrast to previous years,
the task no longer considers arbitrary starting and ending
points of detected segments, but instead the short video clips
are considered as single units for detection purposes with a
single judgement per clip. This year, there are two subtasks:
(i) induced a ect detection, and (ii) violence detection. Both
tasks use the same videos for training and testing.</p>
      <p>For the induced a ect detection task, participants are
expected to predict, for each video, its valence class (i.e., into
one of negative, neutral or positive) and arousal class (i.e.,
into one of calm, neutral or active). In this task, we are
focusing on felt emotion, i.e., the actual emotion of the viewer
when watching the video clip, rather than for example what
the viewer believes that he or she is expected to feel. Valence
is de ned as a continuous scale from most negative to most
positive emotion, while arousal is de ned continuously from
most calm to most active emotion. However, to keep the two
subtasks compatible and enable participants to use similar
systems for both tasks, we have here opted to discretise the
two scales into three classes as follows:
valence: negative, neutral, and positive,
arousal: calm, neutral, and active.</p>
      <p>For the violence detection task, participants are expected
to classify each video as violent or non-violent. Violence is
de ned as content that \one would not let an 8 years old
child see in a movie because it contains physical violence ".</p>
      <p>To solve the task, participants are only allowed to use
features extracted from the original video les, or metadata
provided by the organisers. In addition, there is a possibility
to use external data for runs which are speci cally marked,
however, at least one run for each subtask must be without
any external data.</p>
    </sec>
    <sec id="sec-3">
      <title>DATA DESCRIPTION</title>
      <p>This year a single data set is proposed: 10,900 short video
clips extracted from 199 Creative Commons-licensed movies
of various genres. The movies are split into a development
set { intended for training and validation { and a test set as
100, respectively 99 movies, resulting in 6,144 respectively
4,756 extracted short video clips.</p>
      <p>
        The proposed data set is actually an extension of the
LIRIS-ACCEDE data set originally composed of 9,800
excerpts extracted from 160 movies [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. For this task, 1,100
additional video clips have been extracted from 39 new movies
and included in the test set. The selected feature lms and
short lms can be considered professionally made or
amateur movies but almost all are indexed on video platforms
referencing best free-to-share movies or have been screened
during lm festivals. Since these movies are shared under
Creative Commons licenses, the excerpts can also be shared
and downloaded along with the annotations without
infringing copyright. The excerpts have been extracted from the
movies so that they last between 8 and 12 seconds and start
and end with a cut or a fade.
      </p>
      <p>
        Along with the video material and the annotations,
features extracted from each video clip are also provided by
the organisers. They correspond to the audiovisual features
described in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
4.
      </p>
    </sec>
    <sec id="sec-4">
      <title>GROUND TRUTH</title>
      <p>For each of the 10,900 video clips, the ground truth
consists of: a binary value to indicate the presence of violence,
the class of the excerpt for felt arousal (calm-neutral-active),
and the class for felt valence (negative-neutral-positive).
Before the evaluation, participants are provided only with the
annotations for the development set, while those for the test
set are held back to be used for benchmarking the submitted
results.</p>
      <p>
        The original video clips included in the LIRIS-ACCEDE
data set were all already ranked along the felt valence and
arousal axes by using a crowdsourcing protocol [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Pairwise
comparisons were generated using the quicksort algorithm
and presented to crowdworkers who had to select the video
inducing the calmer emotion or the more positive emotion.
In [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] the crowdsourced ranks were converted into absolute
a ective scores ranging from -1 to 1, which have been used to
de ne the three classes for each a ective axis for the
MediaEval task. The negative and calm classes correspond
respectively to the video clips with a valence or arousal score
smaller than -0.15, the neutral class for both axes is assigned
to the videos with an a ective score between -0.15 and 0.15,
and the positive and active classes are assigned to the videos
with an a ective score higher than 0.15. These limits have
been de ned empirically taking into account the distribution
of the data set in the valence-arousal space.
      </p>
      <p>For the 2015 MediaEval evaluation the test set was
extended with an additional 1,100 video clips. Due to time
and resource constraints, these were annotated using a
simpli ed scheme which takes advantage of the fact that we do
not need a full ranking of the new video clips, but only to
separate them into three classes for each a ect axis. Two
pivot videos were selected for each axis, which had absolute
scores very close to the -0.15 and 0.15 class boundaries. The
annotation task could then be formulated as comparing each
video clip to these pivot videos, and thus place them in their
correct class. In total 17 annotators were involved from ve
di erent countries, and three judgements were collected for
each pivot/a ect dimension pair. Out of these three
judgements the majority vote was selected.</p>
      <p>For the violence detection the annotation process was
similar to previous years' protocol. Firstly, all the videos were
annotated separately by two groups of annotators from two
di erent countries. For each group, regular annotators
labelled all the videos which were then reviewed by master
annotators. Regular annotators were graduate students
(typically single with no children) and master annotators were
senior researchers (typically married with children). No
discussions were held between annotators during the
annotation process. Group 1 used 12 regular and 2 master
annotators, while Group 2 used 5 regular and 2 master annotators.
Within each group, each video received 2 di erent
annotations which were then merged by the master annotators into
the nal annotation for the group. Finally, the achieved
annotations from the two groups were merged and reviewed
once more by the task organisers.
5.</p>
    </sec>
    <sec id="sec-5">
      <title>RUN DESCRIPTION</title>
      <p>Participants can submit up to 5 runs for each subtask:
induced a ect detection and violence detection. Each subtask
has a required run which uses no external training data, only
the provided development data is allowed. Also any features
that can be automatically extracted from the video are
allowed. Both tasks also have the possibility for optional runs
in which any external data can be used, such as Internet
sources, as long as they are marked as "external data" runs.
6.</p>
    </sec>
    <sec id="sec-6">
      <title>EVALUATION CRITERIA</title>
      <p>For the induced a ect detection subtask the o cial
evaluation measure is global accuracy, calculated separately for
valence and arousal dimensions. Global accuracy is the
proportion of the returned video clips that have been assigned
to the correct class (out of the three classes).</p>
      <p>The o cial evaluation metric for the violence detection
subtask is average precision, which is calculated using the
trec_eval tool provided by NIST1. This tool also produces
a set of commonly used metrics such as precision and recall,
which may be used for comparison purposes.
7.</p>
    </sec>
    <sec id="sec-7">
      <title>CONCLUSIONS</title>
      <p>The A ective Impact of Movies Task provides participants
with a comparative and collaborative evaluation framework
for violence and emotion detection in movies. The
introduction of the induced a ect detection subtask is a new e ort for
this year. In addition, we have started fresh with a data set
not used in MediaEval before, which consists of short
Creative Commons-licensed video clips, which enables legally
sharing the data directly with participants. Details on the
methods and results of each individual team can be found in
the papers of the participating teams in these proceedings.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This task is supported by the following projects:
ERANET CHIST-ERA grant ANR-12-CHRI-0002-04,
UEFISCDI SCOUTER grant 28DPST/30-08-2013, Vietnam
National University Ho Chi Minh City grant B2013-26-01,
Austrian Science Fund P25655, and EU FP7-ICT-2011-9 project
601166.
1http://trec.nist.gov/trec_eval/</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.</given-names>
            <surname>Acar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hopfgartner</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Albayrak</surname>
          </string-name>
          .
          <article-title>Violence detection in hollywood movies by the fusion of visual and mid-level audio cues</article-title>
          .
          <source>In Proceedings of the 21st ACM international conference on Multimedia</source>
          , pages
          <volume>717</volume>
          {
          <fpage>720</fpage>
          . ACM,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Baveye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Dellandrea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chamaret</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>From crowdsourced rankings to a ective ratings</article-title>
          .
          <source>In 2014 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)</source>
          , pages
          <fpage>1</fpage>
          <lpage>{</lpage>
          6,
          <string-name>
            <surname>July</surname>
          </string-name>
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Baveye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Dellandrea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chamaret</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen.</surname>
          </string-name>
          LIRIS-ACCEDE:
          <article-title>A video database for a ective content analysis</article-title>
          .
          <source>IEEE Transactions on A ective Computing</source>
          ,
          <volume>6</volume>
          (
          <issue>1</issue>
          ):
          <volume>43</volume>
          {
          <fpage>55</fpage>
          ,
          <string-name>
            <surname>Jan</surname>
          </string-name>
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hanjalic</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.-Q.</given-names>
            <surname>Xu</surname>
          </string-name>
          .
          <article-title>A ective video content representation and modeling</article-title>
          .
          <source>IEEE Transactions on Multimedia</source>
          ,
          <volume>7</volume>
          (
          <issue>1</issue>
          ):
          <volume>143</volume>
          {
          <fpage>154</fpage>
          ,
          <string-name>
            <surname>Feb</surname>
          </string-name>
          .
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Horvat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Popovic</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Cosic</surname>
          </string-name>
          .
          <article-title>Multimedia stimuli databases usage patterns: a survey report</article-title>
          .
          <source>In Proceedings of the 36nd International ICT Convention MIPRO</source>
          , pages
          <volume>993</volume>
          {
          <fpage>997</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          , J. Schluter, I. Mironica, and
          <string-name>
            <given-names>M.</given-names>
            <surname>Schedl</surname>
          </string-name>
          .
          <article-title>A naive mid-level concept-based fusion approach to violence detection in hollywood movies</article-title>
          .
          <source>In ICMR</source>
          , pages
          <volume>215</volume>
          {
          <fpage>222</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>G.</given-names>
            <surname>Irie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Satou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kojima</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yamasaki</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Aizawa</surname>
          </string-name>
          .
          <article-title>A ective audio-visual words and latent topic driving model for realizing movie a ective scene classi cation</article-title>
          .
          <source>IEEE Transactions on Multimedia</source>
          ,
          <volume>12</volume>
          (
          <issue>6</issue>
          ):
          <volume>523</volume>
          {
          <fpage>535</fpage>
          ,
          <string-name>
            <surname>Oct</surname>
          </string-name>
          .
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C.</given-names>
            <surname>Penet</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-H. Demarty</surname>
            , G. Gravier, and
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Gros</surname>
          </string-name>
          .
          <article-title>Multimodal Information Fusion and Temporal Integration for Violence Detection in Movies</article-title>
          . In ICASSP, Kyoto, Japon,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Soleymani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Larson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Pun</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Hanjalic</surname>
          </string-name>
          .
          <article-title>Corpus development for a ective video indexing</article-title>
          .
          <source>IEEE Transactions on Multimedia</source>
          ,
          <volume>16</volume>
          (
          <issue>4</issue>
          ):
          <volume>1075</volume>
          {
          <fpage>1089</fpage>
          ,
          <year>June 2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Gao</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Q.</given-names>
            <surname>Tian</surname>
          </string-name>
          .
          <article-title>A ective visualization and retrieval for music video</article-title>
          .
          <source>IEEE Transactions on Multimedia</source>
          ,
          <volume>12</volume>
          (
          <issue>6</issue>
          ):
          <volume>510</volume>
          {
          <fpage>522</fpage>
          ,
          <string-name>
            <surname>Oct</surname>
          </string-name>
          .
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>