<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The MediaEval 2017 Emotional Impact of Movies Task</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Ecole Centrale de Lyon</institution>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Emmanuel Dellandréa</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>HIIT, University of Helsinki</institution>
          ,
          <country country="FI">Finland</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>NICAM</institution>
          ,
          <country country="NL">Netherlands</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Université de Nantes</institution>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <fpage>13</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>This paper provides a description of the MediaEval 2017 “Emotional Impact of Movies task". It continues to build on previous years' editions. In this year's task, participants are expected to create systems that automatically predict the emotional impact that video content will have on viewers, in terms of valence, arousal and fear. Here we provide a description of the use case, task challenges, dataset and ground truth, task run requirements and evaluation metrics.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Afective video content analysis aims at the automatic recognition
of emotions elicited by videos. It has a large number of applications,
including mood based personalized content recommendation [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] or
video indexing [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], and eficient movie visualization and browsing
[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Beyond the analysis of existing video material, afective
computing techniques can also be used to generate new content, e.g.,
movie summarization [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], or personalized soundtrack
recommendation to make user-generated videos more attractive [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Afective
techniques can also be used to enhance the user engagement with
advertising content by optimizing the way ads are inserted inside
videos [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>
        While major progress has been achieved in computer vision for
visual object detection, scene understanding and high-level concept
recognition, a natural further step is the modeling and
recognition of afective concepts. This has recently received increasing
interest from research communities, e.g., computer vision, machine
learning, with an overall goal of endowing computers with
humanlike perception capabilities. Thus, this task is proposed to ofer
researchers a place to compare their approaches for the prediction
of the emotional impact of movies. It continues to build on previous
years’ editions [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] with a first subtask, which is a mix of last year’s
tasks related to valence and arousal prediction, and a new subtask
dedicated to fear prediction.
      </p>
    </sec>
    <sec id="sec-2">
      <title>TASK DESCRIPTION</title>
      <p>
        The task requires participants to deploy multimedia features and
models to automatically predict the emotional impact of movies.
This emotional impact is considered here to be the prediction of
the expected emotion. The expected emotion is the emotion that
the majority of the audience feels in response to the same
content. In other words, the expected emotion is the expected value
of experienced (i.e. induced) emotion in a population. While the
induced emotion is subjective and context dependent, the expected
emotion can be considered objective, as it reflects the more-or-less
unanimous response of a general audience to a given stimulus [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        This year, two new scenarios are proposed as subtasks. In both
cases, long movies are considered and the emotional impact has to
be predicted for consecutive 10-second segments sliding over the
whole movie with a shift of 5 seconds:
(1) Valence/Arousal prediction: participants’ systems are
supposed to predict a score of expected valence and arousal
for each consecutive 10-second segments. Valence is
deifned as a continuous scale from most negative to most
positive emotions, while arousal is defined continuously
from calmest to most active emotions [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ];
(2) Fear prediction: the purpose here is to predict for each
consecutive 10-second segments whether they are likely to
induce fear or not. The targeted use case is the prediction
of frightening scenes to help systems protecting children
from potentially harmful video content. This subtask is
complementary to the valence/arousal prediction task in
the sense that the mapping of discrete emotions into the
2D valence/arousal space is often overlapped (for instance,
fear, disgust and anger are overlapped since they are
characterized with very negative valence and high arousal)
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>DATA DESCRIPTION</title>
      <p>
        The dataset used in this task is the LIRIS-ACCEDE
dataset1. It contains videos from a set of 160 professionally made
and amateur movies, shared under Creative Commons licenses that
allow redistribution [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Several movie genres are represented in
this collection of movies such as horror, comedy, drama, action and
so on. Languages are mainly English with a small set of Italian,
Spanish, French and others subtitled in English.
      </p>
      <p>
        The continuous part of LIRIS-ACCEDE [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is used as the
development test for both subtasks. It consists of a selection of 30 movies.
The selected videos are between 117 and 4,566 seconds long (mean
= 884.2sec ± 766.7sec SD). The total length of the 30 selected movies
is 7 hours, 22 minutes and 5 seconds.
      </p>
      <p>The test set consists of a selection of 14 movies other than the
selection of the 160 original movies. They are between 210 and 6,260
seconds long (mean = 2045.2sec ± 2450.1sec SD). The total length
of the 14 selected movies is 7 hours, 57 minutes and 13 seconds.</p>
      <p>
        In addition to the video data, participants are also provided with
general purpose audio and visual content features. To compute
audio features, movies have first been processed to extract
consecutive 10-second segments sliding over the whole movie with a shift
of 5 seconds. Then, audio features have been extracted from these
segments using openSmile toolbox2 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The default configuration
named “emobase2010.conf" was used. It allows the computation of
1,582 features, which result from a base of 34 low-level descriptors
(LLD) with 34 corresponding delta coeficients appended, and 21
functionals applied to each of these 68 LLD contours (1 428
features). In addition, 19 functionals are applied to the 4 pitch-based
LLD and their four delta coeficient contours (152 features). Finally
the number of pitch onsets (pseudo syllables) and the total duration
of the input are appended (2 features).
      </p>
      <p>Beyond audio features, for each movie, image frames were
extracted every one second. For each of these images, several general
purpose visual features have been provided. They have been
computed using LIRE library3, except CNN features (VGG16 fc6 layer)
that have been extracted using Matlab Neural Networks toolbox4.
The visual features are the following: Auto Color Correlogram,
Color and Edge Directivity Descriptor, Color Layout, Edge
Histogram, Fuzzy Color and Texture Histogram, Gabor, Joint descriptor
joining CEDD and FCTH in one histogram, Scalable Color, Tamura,
Local Binary Patterns, VGG16 fc6 layer.
4</p>
    </sec>
    <sec id="sec-4">
      <title>GROUND TRUTH</title>
      <p>Annotations are provided to participants for the 30 movies from the
development set. Thus, for each movie, a first file contains valence
and arousal values for consecutive 10-second segments sliding over
the whole movie with a shift of 5 seconds, and a second file contains
the indication whether these segments are supposed to induce fear
(value 1) or not (value 0).
4.1</p>
    </sec>
    <sec id="sec-5">
      <title>Ground Truth for the first subtask</title>
      <p>
        In order to collect continuous valence and arousal annotations,
16 French participants had to continuously indicate their level of
arousal while watching the movies using a modified version of the
GTrace annotation tool [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and a joystick (10 participants for the
development set and 6 for the test set). Movies have been divided into
two subsets. Each annotator continuously annotated one subset
considering the induced valence and the other subset considering
the induced arousal. Thus, each movie has been continuously
annotated by five annotators for the development set, and three for
the test set.
      </p>
      <p>
        Then, the continuous valence and arousal annotations from the
participants have been down-sampled by averaging the annotations
over windows of 10 seconds with a shift of 1 second overlap (i.e., 1
value per second) in order to remove the noise due to unintended
movements of the joystick. Finally, these post-processed continuous
annotations have been averaged in order to create a continuous
mean signal of the valence and arousal self-assessments. The details
of this processing are given in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. For the purpose of the first
subtask, these values have been averaged to obtain a single value
of valence and a single value of arousal for every consecutive
10second segments sliding over the whole movie with a shift of 5
seconds.
2http://audeering.com/technology/opensmile/
3http://www.lire-project.net/
4https://www.mathworks.com/products/neural-network.html
4.2
      </p>
    </sec>
    <sec id="sec-6">
      <title>Ground Truth for the second subtask</title>
      <p>Fear annotations for the second subtask were generated using a
tool specifically designed for the classification of audio-visual
media allowing to perform annotation while watching the movie (at
the same time). The annotations have been realized by two well
experienced team members of NICAM5 both of them trained in
classification of media. Each movie has been annotated by 1
annotator reporting the start and stop times of each sequence in the movie
exptected to induce fear. From this information, the 10-second
segments sliding over the whole movie with a shift of 5 seconds
have been labeled as fear (value 1) if they intersect one of the fear
sequences and as not fear (value 0) otherwise.
5
6</p>
    </sec>
    <sec id="sec-7">
      <title>RUN DESCRIPTION</title>
      <p>Participants can submit up to 5 runs for each of the two subtasks,
so 10 runs in total. Models can rely on the features provided by the
organizers or any other external data.</p>
    </sec>
    <sec id="sec-8">
      <title>EVALUATION CRITERIA</title>
      <p>Standard evaluation metrics are used to assess systems performance.
The first subtask can be considered as a regression problem
(estimation of expected valence and arousal scores) while the second
subtask can be seen as a binary classification problem (the video
segment is supposed to induce/not induce fear).</p>
      <p>For the first subtask, the oficial metric is the Mean Square Error
(MSE), which is the common measure generally used to evaluate
regression models. However, to allow a deeper understanding of
systems’ performance, we also consider Pearson’s Correlation
Coeficient. Indeed, MSE is not always suficient to analyze models
eficiency and the correlation may be required to obtain a deeper
performance analysis. As an example, if a large portion of the data
is neutral (i.e., its valence score is close to 0.5) or is distributed
around the neutral score, a uniform model that always outputs 0.5
will result in good MSE performance (low MSE). In this case, the
lack of accuracy of the model will be brought to the fore by the
correlation between the predicted values and the ground truth that
will be also very low.</p>
      <p>For the second subtask, the oficial metric is the Mean
Average Precision (MAP). Moreover, Accuracy, Precision, Recall and
F1-score are also considered to provide insights into systems
behaviours.
7</p>
    </sec>
    <sec id="sec-9">
      <title>CONCLUSIONS</title>
      <p>The Emotional Impact of Movies Task provides participants with a
comparative and collaborative evaluation framework for emotional
detection in movies, in terms of valence, arousal and fear. The
LIRIS-ACCEDE dataset has been used as development and test sets.
Details on the methods and results of each individual team can be
found in the papers of the participating teams in the MediaEval
2017 workshop proceedings.</p>
    </sec>
    <sec id="sec-10">
      <title>ACKNOWLEDGMENTS</title>
      <p>This task is supported by the CHIST-ERA Visen project
ANR-12CHRI-0002-04.
5http://www.kijkwijzer.nl/nicam
Emotional Impact of Movies Task</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Baveye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Dellandréa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chamaret</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Deep Learning vs</article-title>
          . Kernel Methods:
          <article-title>Performance for Emotion Prediction in Videos</article-title>
          .
          <source>In Humaine Association Conference on Afective Computing and Intelligent Interaction (ACII).</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Baveye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Dellandréa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chamaret</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>LIRISACCEDE: A Video Database for Afective Content Analysis</article-title>
          .
          <source>IEEE Transactions on Afective Computing</source>
          <volume>6</volume>
          ,
          <issue>1</issue>
          (
          <year>2015</year>
          ),
          <fpage>43</fpage>
          -
          <lpage>55</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Canini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Benini</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Leonardi</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Afective recommendation of movies based on selected connotative features</article-title>
          .
          <source>IEEE Transactions on Circuits and Systems for Video Technology</source>
          <volume>23</volume>
          ,
          <issue>4</issue>
          (
          <year>2013</year>
          ),
          <fpage>636</fpage>
          -
          <lpage>647</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Cowie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sawey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Doherty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jaimovich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Fyans</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Stapleton</surname>
          </string-name>
          .
          <year>2013</year>
          . Gtrace:
          <article-title>General trace program compatible with emotionml.</article-title>
          .
          <source>In Humaine Association Conference on Afective Computing and Intelligent Interaction (ACII).</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E.</given-names>
            <surname>Dellandréa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Baveye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sjöberg</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Chamaret</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>The MediaEval 2016 Emotional Impact of Movies Task</article-title>
          . In MediaEval 2016 Workshop.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>F.</given-names>
            <surname>Eyben</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Weninger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Gross</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Schuller</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Recent Developments in openSMILE, the Munich Open-Source Multimedia Feature Extractor</article-title>
          .
          <source>In ACM Multimedia (MM)</source>
          , Barcelona, Spain.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>L.-A.</given-names>
            <surname>Feldman</surname>
          </string-name>
          .
          <year>1995</year>
          .
          <article-title>Valence focus and arousal focus: Individual diferences in the structure of afective experience</article-title>
          .
          <volume>69</volume>
          (
          <year>1995</year>
          ),
          <fpage>153</fpage>
          -
          <lpage>166</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hanjalic</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Extracting moods from pictures and sounds: Towards truly personalized TV</article-title>
          .
          <source>IEEE Signal Processing Magazine</source>
          (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H.</given-names>
            <surname>Katti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Yadati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kankanhalli</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>TatSeng</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Afective video summarization and story board generation using pupillary dilation and eye gaze</article-title>
          .
          <source>In IEEE International Symposium on Multimedia (ISM).</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Russell</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Core afect and the psychological construction of emotion</article-title>
          .
          <source>Psychological Review</source>
          (
          <year>2003</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>R. R.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Zimmermann</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Advisor: Personalized video soundtrack recommendation by late fusion with heuristic rankings</article-title>
          .
          <source>In ACM International Conference on Multimedia.</source>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>K.</given-names>
            <surname>Yadati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Katti</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Kankanhalli</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Cavva: Computational afective video-in-video advertising</article-title>
          .
          <source>IEEE Transactions on Multimedia 16</source>
          ,
          <issue>1</issue>
          (
          <year>2014</year>
          ),
          <fpage>15</fpage>
          -
          <lpage>23</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Gao</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Q.</given-names>
            <surname>Tian</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Afective visualization and retrieval for music video</article-title>
          .
          <source>IEEE Transactions on Multimedia 12</source>
          ,
          <issue>6</issue>
          (
          <year>2010</year>
          ),
          <fpage>510</fpage>
          -
          <lpage>522</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Xu</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Flexible presentation of videos based on afective content analysis</article-title>
          .
          <source>Advances in Multimedia Modeling</source>
          <volume>7732</volume>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>