<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The MediaEval 2016 Emotional Impact of Movies Task</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Ecole Centrale de Lyon</institution>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Emmanuel Dellandréa</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>HIIT, University of Helsinki</institution>
          ,
          <country country="FI">Finland</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Technicolor</institution>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Université de Nantes</institution>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <fpage>20</fpage>
      <lpage>21</lpage>
      <abstract>
        <p>This paper provides a description of the MediaEval 2016 "Emotional Impact of Movies" task. It continues builds on previous years' editions of the A ect in Multimedia Task: Violent Scenes Detection. However, in this year's task, participants are expected to create systems that automatically predict the emotional impact that video content will have on viewers, in terms of valence and arousal scores. Here we provide insights on the use case, task challenges, dataset and ground truth, task run requirements and evaluation metrics.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        A ective video content analysis aims at the automatic
recognition of emotions elicited by videos. It has a large
number of applications, including mood based personalized
content recommendation [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] or video indexing [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], and
efcient movie visualization and browsing [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Beyond the
analysis of existing video material, a ective computing
techniques can also be used to generate new content, e.g., movie
summarization [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], or personalized soundtrack
recommendation to make user-generated videos more attractive [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
A ective techniques can also be used to enhance the user
engagement with advertising content by optimizing the way
ads are inserted inside videos [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>
        While major progress has been achieved in computer
vision for visual object detection, scene understanding and
high-level concept recognition, a natural further step is the
modeling and recognition of a ective concepts. This has
recently received increasing interest from research
communities, e.g., computer vision, machine learning, with an overall
goal of endowing computers with human-like perception
capabilities. Thus, this task is proposed to o er researchers a
place to compare their approaches for the prediction of the
emotional impact of movies. It continues builds on previous
years' editions of the A ect in Multimedia Task: Violent
Scenes Detection [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>TASK DESCRIPTION</title>
      <p>
        The task requires participants to deploy multimedia
features to automatically predict the emotional impact of movies.
We are focusing on felt emotion, i.e., the actual emotion of
the viewer when watching the video, rather than for
example what the viewer believes that he or she is expected
to feel. The emotion is considered in terms of valence and
arousal [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Valence is de ned as a continuous scale from
most negative to most positive emotions, while arousal is
de ned continuously from calmest to most active emotions.
Two subtasks are considered:
1. Global emotion prediction: given a short video clip
(around 10 seconds), participants' systems are expected
to predict a score of induced valence (negative-positive)
and induced arousal (calm-excited) for the whole clip;
2. Continuous emotion prediction: as an emotion felt
during a scene may be in uenced by the emotions felt
during the previous ones, the purpose here is to consider
longer videos, and to predict the valence and arousal
continuously along the video. Thus, a score of induced
valence and arousal should be provided for each
1ssegment of the video.
3.
      </p>
    </sec>
    <sec id="sec-3">
      <title>DATA DESCRIPTION</title>
      <p>
        The development dataset used in this task is the
LIRISACCEDE dataset (liris-accede.ec-lyon.fr) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. It is composed
of two subsets. The rst one, used for the rst subtask
(global emotion prediction), contains 9,800 video clips
extracted from 160 professionally made and amateur movies,
with di erent genres, and shared under Creative Commons
licenses that allows to freely use and distribute videos
without copyright issues as long as the original creator is
credited. The segmented video clips last between 8 and 12
seconds and are representative enough to conduct experiments.
Indeed, the length of extracted segments is large enough to
get consistent excerpts allowing the viewer to feel emotions
and is also small enough to make the viewer feel only one
emotion per excerpt. A robust shot and fade in/out
detection has been implemented using to make sure that each
extracted video clip start and end with a shot or a fade.
Several movie genres are represented in this collection of
movies such as horror, comedy, drama, action and so on.
Languages are mainly English with a small set of Italian,
Spanish, French and others subtitled in English.
      </p>
      <p>The second part of LIRIS-ACCEDE dataset is used for the
second subtask (continuous emotion prediction). It consists
in a selection of movies among the 160 ones used to extract
the 9,800 video clips mentioned previously. The total length
of the selected movies was the only constraint. It had to be
smaller than eight hours to create an experiment of
acceptable duration. The selection process ended with the choice
of 30 movies so that their genre, content, language and
duration are diverse enough to be representative of the original
LIRIS-ACCEDE dataset. The selected videos are between
117 and 4,566 seconds long (mean = 884.2sec 766.7sec
SD). The total length of the 30 selected movies is 7 hours,
22 minutes and 5 seconds.</p>
      <p>In addition to the development set, a test set is also
provided to assess participants' methods performance. 49 new
movies under Creative Commons licenses have been
considered. With the same protocol as the one used for the
development set, 1,200 additional short video clips have been
extracted for the rst subtask (between 8 and 12 seconds),
and 10 long movies (from 25 minutes to 1 hour and 35
minutes) have been selected for the second subtask (for a total
duration of 11.48 hours).</p>
      <p>In solving the task, participants are expected to exploit
the provided resources. Use of external resources (e.g.,
Internet data) will be however allowed as speci c runs.</p>
      <p>
        Along with the video material and the annotations,
features extracted from each video clip are also provided by
the organizers for the rst subtask. They correspond to the
audiovisual features described in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
    </sec>
    <sec id="sec-4">
      <title>GROUND TRUTH 4. 4.1</title>
    </sec>
    <sec id="sec-5">
      <title>Ground Truth for the first subtask</title>
      <p>
        The 9,800 video clips included in the rst part of the
LIRIS-ACCEDE dataset are ranked along the felt valence
and arousal axes by using a crowdsourcing protocol [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. To
make reliable annotations as simple as possible, pairwise
comparisons were generated using the quicksort algorithm
and presented to crowdworkers who had to select the video
inducing the calmest emotion or the most positive emotion.
      </p>
      <p>
        To cross-validate the annotations gathered from various
uncontrolled environments using crowdsourcing, another
experiment has been created to collect ratings for a subset of
the database in a controlled environment. In this controlled
experiment, 28 volunteers were asked to rate a subset of the
database carefully selected using the 5-point discrete
SelfAssessment-Manikin scales for valence and arousal [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. 20
excerpts per axis that are regularly distributed have been
selected in order to get enough excerpts to represent the
whole database while being relatively few to create an
experiment of acceptable duration.
      </p>
      <p>
        From the original ranks and these ratings, absolute a
ective scores for valence and arousal have been estimated for
each of the 9,800 video clips using Gaussian process
regression models as described in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>To obtain ground truth for the test subset, each of the
1,200 additional video clips has rst been ranked according
to the 9,800 video clips from the original dataset. Then,
its valence and arousal ranks have been converted into a
valence and arousal score using the regression models
mentioned previously.
4.2</p>
    </sec>
    <sec id="sec-6">
      <title>Ground Truth for the second subtask</title>
      <p>
        In order to collect continuous valence and arousal
annotations, 16 French participants had to continuously indicate
their level of arousal while watching the movies using a
modi ed version of the GTrace annotation tool [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and a joystick
(10 participants for the development set and 6 for the test
set). Movies have been divided into two subsets. Each
annotator continuously annotated one subset along the induced
valence and the other into the induced arousal. Thus, each
movie has been continuously annotated by ve annotators
for the development set, and three for the test set.
      </p>
      <p>
        Then, the continuous valence and arousal annotations from
the participants have been down-sampled by averaging the
annotations over windows of 10 seconds with 1 second
overlap (i.e., 1 value per second) in order to remove the noise
due to unintended moves of the joystick. Finally, these
postprocessed continuous annotations have been averaged in
order to create a continuous mean signal of the valence and
arousal self-assessments. The details of this processing are
given in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
5.
      </p>
    </sec>
    <sec id="sec-7">
      <title>RUN DESCRIPTION</title>
      <p>Participants can submit up to 5 runs for the rst subtask
(global emotion prediction). For the second subtask
(continuous emotion prediction), there can be 2 types of run
submissions: full runs that concerns the whole test set (the
10 movies, total duration: 11.48 hours) and light runs that
concern a subset of the test set (5 movies, total duration:
4.82 hours). In each case (light and full), up to 5 runs can
be submitted. Moreover, each subtask has a required run
which uses no externale training data, only the provided
development data is allowed. Also any features that can be
automatically extracted from the video are allowed. Both
tasks also have the possibility for optional runs in which
any external data can be used, such as Internet sources, as
long as they are marked as "external data" runs.
6.</p>
    </sec>
    <sec id="sec-8">
      <title>EVALUATION CRITERIA</title>
      <p>Standard evaluation metrics (Mean Square Error and
Pearson's Correlation Coe cient) are used to assess systems
performance. Indeed, the common measure generally used to
evaluate regression models is the Mean Square Error (MSE).
However, this measure is not always su cient to analyze
models e ciency and the correlation may be required to
obtain a deeper performance analysis. As an example, if a
large portion of the data is neutral (i.e., its valence score is
close to 0.5) or is distributed around the neutral score, a
uniform model that always outputs 0.5 will result in good MSE
performance (low MSE). In this case, the lack of accuracy
of the model will be brought to the fore by the correlation
between the predicted values and the ground truth that will
be also very low.
7.</p>
    </sec>
    <sec id="sec-9">
      <title>CONCLUSIONS</title>
      <p>The Emotional Impact of Movies Task provides
participants with a comparative and collaborative evaluation
framework for emotional detection in movies, in terms of valence
and arousal scores. The LIRIS-ACCEDE dataset 1 has been
used as development set, and additional movies under
Creative Commons licenses and ground truth annotations have
been provided as test set. Details on the methods and
results of each individual team can be found in the papers
of the participating teams in the MediaEval 2016 workshop
proceedings.</p>
    </sec>
    <sec id="sec-10">
      <title>ACKNOWLEDGMENTS</title>
      <p>This task is supported by the CHIST-ERA Visen project
ANR-12-CHRI-0002-04.
1http://liris-accede.ec-lyon.fr</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Baveye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Dellandrea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chamaret</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>From crowdsourced rankings to a ective ratings</article-title>
          .
          <source>In IEEE International Conference on Multimedia and Expo Workshops (ICMEW)</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Baveye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Dellandrea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chamaret</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>Deep learning vs. kernel methods: Performance for emotion prediction in videos</article-title>
          .
          <source>In Humaine Association Conference on A ective Computing and Intelligent Interaction (ACII)</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Baveye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Dellandrea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chamaret</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>Liris-accede: A video database for a ective content analysis</article-title>
          .
          <source>IEEE Transactions on A ective Computing</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M. M.</given-names>
            <surname>Bradley</surname>
          </string-name>
          and
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Lang</surname>
          </string-name>
          .
          <article-title>Measuring emotion: the self-assessment manikin and the semantic di erential</article-title>
          .
          <source>Journal of behavior therapy and experimental psychiatry</source>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Canini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Benini</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Leonardi</surname>
          </string-name>
          .
          <article-title>A ective recommendation of movies based on selected connotative features</article-title>
          .
          <source>IEEE Transactions on Circuits and Systems for Video Technology</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R.</given-names>
            <surname>Cowie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sawey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Doherty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jaimovich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Fyans</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Stapleton</surname>
          </string-name>
          . Gtrace:
          <article-title>General trace program compatible with emotionml</article-title>
          .
          <source>In Humaine Association Conference on A ective Computing and Intelligent Interaction (ACII)</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>Katti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Yadati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kankanhalli</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>TatSeng</surname>
          </string-name>
          .
          <article-title>A ective video summarization and story board generation using pupillary dilation and eye gaze</article-title>
          .
          <source>In IEEE International Symposium on Multimedia (ISM)</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Russell</surname>
          </string-name>
          .
          <article-title>Core a ect and the psychological construction of emotion</article-title>
          .
          <source>Psychological Review</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R. R.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Zimmermann</surname>
          </string-name>
          . Advisor:
          <article-title>Personalized video soundtrack recommendation by late fusion with heuristic rankings</article-title>
          .
          <source>In ACM International Conference on Multimedia</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sjo</surname>
          </string-name>
          berg, Y. Baveye,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Quang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          , E. Dellandrea,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schedl</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-H. Demarty</surname>
            , and
            <given-names>L. Chen.</given-names>
          </string-name>
          <article-title>The mediaeval 2015 a ective impact of movies task</article-title>
          .
          <source>In MediaEval 2015 Workshop</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>K.</given-names>
            <surname>Yadati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Katti</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Kankanhalli</surname>
          </string-name>
          . Cavva:
          <article-title>Computational a ective video-in-video advertising</article-title>
          .
          <source>IEEE Transactions on Multimedia</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Gao</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Q.</given-names>
            <surname>Tian</surname>
          </string-name>
          .
          <article-title>A ective visualization and retrieval for music video</article-title>
          .
          <source>IEEE Transactions on Multimedia</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Xu</surname>
          </string-name>
          .
          <article-title>Flexible presentation of videos based on a ective content analysis</article-title>
          .
          <source>Advances in Multimedia Modeling</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>