<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The ICL-TUM-PASSAU Approach for the MediaEval 2015 “Affective Impact of Movies” Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>George Trigeorgis</string-name>
          <email>g.trigeorgis@imperial.ac.uk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eduardo Coutinho</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabien Ringeval</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Erik Marchi</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefanos Zafeiriou</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Björn Schuller</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Chair of Complex &amp; Intelligent Systems, University of Passau</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computing, Imperial College London</institution>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Machine Intelligence &amp; Signal Processing Group, Technische Universität München</institution>
          ,
          <addr-line>Munich</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>In this paper we describe the Imperial College London, Technische Universitat Munchen and University of Passau (ICL+TUM+PASSAU) team approach to the MediaEval's \A ective Impact of Movies" challenge, which consists in the automatic detection of a ective (arousal and valence) and violent content in movie excerpts. In addition to the baseline features, we computed spectral and energy related acoustic features, and the probability of various objects being present in the video. Random Forests, AdaBoost and Support Vector Machines were used as classi cation methods. Best results show that the dataset is highly challenging for both a ect and violence detection tasks, mainly because of issues in inter-rater agreement and data scarcity.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The MediaEval 2015 Challenge \A ective Impact of
Movies" comprises two subtasks using the LIRIS-ACCEDE
database [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Subtask 1 targets the automatic
categorisation of videos in terms of their a ective impact. The goal
is to identify the arousal (calm-neutral-excited) and valence
(negative-neutral-positive) levels of each video. The goal
of Subtask 2 is to identify those videos that contain violent
scenes. The full description of the tasks can be found in [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ].
2.
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>METHODOLOGY</title>
    </sec>
    <sec id="sec-3">
      <title>Subtask 1: affect classification</title>
      <sec id="sec-3-1">
        <title>Feature sets.</title>
        <p>
          In our work we have used both the baseline features
provided by the organisers [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], as well as our own sets of
audiovisual features as described below.
        </p>
        <p>
          The extended Geneva Minimalistic Acoustic Parameter
Set (eGeMAPS) was used to extract acoustic features with
the openSMILE toolkit [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]; this feature set was designed as a
standard acoustic parameter set for automatic speech
emotion recognition [
          <xref ref-type="bibr" rid="ref16 ref18 ref5">5, 18, 16</xref>
          ] and has also been successfully
used for other paralinguistic tasks [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. The eGeMAPS
comprises a total of 18 Low-Level Descriptors (LLDs),
including frequency, energy/amplitude, and spectral related
features. Various functionals were then applied to the LLDs
over the whole instance, giving raise to a total of 88 features.
The emotional impact of videos can be heavily in uenced
by the kind of objects present in a given scene [
          <xref ref-type="bibr" rid="ref11 ref12 ref15">11, 12,
15</xref>
          ]. We thus computed a probability of 1000 di erent
objects to be present in a frame using a pretrained 16-layer
convolutional neural network (CNN) on the ILSVRC2013
dataset [
          <xref ref-type="bibr" rid="ref21 ref4">21, 4</xref>
          ]. Let x 2 RN p represents a video of the
database with N frames and p pixels per frame, and f ( )
the trained convolutional neural net with softmax
activation functions in the output layer. The probability P r(y =
cjxi; ) for each of the 1000 classes being present inside the
i-th frame of a video xi is obtained by forwarding the p
pixels value through the network. By averaging the
activations over all the N frames of a video sequence we obtained
the probability distribution of the 1000 ILSVRC2013 classes
that might be present in the video.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Classifiers.</title>
        <p>
          For modelling the data we concentrated on two
out-of-thebox ensemble techniques: Random Forests and AdaBoost.
We used these two techniques as they are less susceptible
to the over tting problem than other learning algorithms
due to the combination of weak learners, they are trivial to
optimise as they have only one hyper-parameter, and they
usually provide close or on par results with the
state-ofthe-art for a multitude of tasks [
          <xref ref-type="bibr" rid="ref10 ref14 ref23 ref9">9, 10, 23, 14</xref>
          ]. The
hyperparameters for each classi er were determined using a 5-fold
cross-validation scheme on the development set. During
development the best performance was achieved with 10 trees
with Random Forests and 20 trees with AdaBoost.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Runs.</title>
        <p>We submitted a total of ve runs. Run 1 consisted of
predictions using the baseline features and the AdaBoost model.
The predictions in runs 2 and 5 were obtained using the
baseline plus our audio-visual feature sets and the Random
Forest and AdaBoost classi ers, respectively. By looking
at the distribution of labels in the development set, we
observed that the most common combinations of labels are: 1)
neutral valence (V n) and negative arousal (A ) (24%), and
2) positive valence (V +) and negative arousal (A ) (20%).
Runs 3 and 4 are thus based on the hypothesis that the label
distribution of the test set will be similarly unbalanced. In
run 3 every clip was predicted to be V n, A+ and in Run
4 every one was V +, A . These submissions act as a
sanity check of our own models, but also other competitors'
submissions for this competition.
2.2</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Subtask 2: violence detection</title>
      <sec id="sec-4-1">
        <title>Feature sets.</title>
        <p>
          According to previous work [
          <xref ref-type="bibr" rid="ref13 ref7">7, 13</xref>
          ], we only considered
spectral and energy based features as acoustic descriptors.
Indeed, violent segments do not necessarily contain speech;
voice speci c features, such as voice quality and pitch
related descriptors, might thus not be a reliable source of
information for violence. We extracted 22 acoustic low-level
descriptors (LLDs): loudness, alpha ratio, Hammarberg's
index, energy slope and proportion in the bands [0 500] Hz
and [500 1500] Hz, and 14 MFCCs, using the openSMILE
toolkit [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. All LLDs, with the exception of loudness and
the measures of energy proportion, were computed
separately for voiced and unvoiced segments. As the frames
of the movie that contain violent scenes are unknown, we
computed 5 functionals (max, min, range, arithmetic mean
and standard-deviation) to summarise the LLDs over the
movie excerpt, which provided a total of 300 features. For
the video modality, we used the same additional features
dened in Subtask 1. We also used the metadata information
of the video genre as an additional feature, due to
dependencies between movie genre and violent content.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Classifier.</title>
        <p>
          Since the dataset is strongly imbalanced { only 272
excerpts out of 6,144 are labelled as violent { we up-sampled
the violent instances to achieve a balanced distribution. All
features were furthermore standardised with a z-score. As
classi er, we used the libsvm implementation of Support
Vector Machines (SVMs) [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] and optimised the complexity
parameter, and the coe cient of the radial basis kernel
in a 5-folds cross-validation framework on the development
set. Because the o cial scoring script requires the
computation of a posteriori probabilities, which is more time
consuming than the straightforward classi cation task, we
optimised the Unweighted Average Recall (UAR) to nd
the best hyper-parameters [
          <xref ref-type="bibr" rid="ref19 ref20">19, 20</xref>
          ], and then re-trained the
SVMs with the probability estimates.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>Runs.</title>
        <p>We rst performed experiments with the full baseline
feature set and found that the addition of the movie genre
as feature improved the Mean Average Precision (MAP)
from 19.5 to 20.3, despite degrading the UAR from 72.3 to
72.0. Adding our own audio-visual features provided a jump
in the performance with the MAP reaching 33.6 and UAR
77.6. Because some movie excerpts contain partly relevant
acoustic information, we empirically de ned a threshold on
loudness based on the histogram, to exclude frames before
computing the functionals. This procedure has improved
the MAP to 35.9 but downgraded the UAR to 76.9. A ne
tuning of the complexity parameter and coe cient yielded
the best performance in terms of UAR with a value of 78.0,
but slightly deteriorated the MAP to 35.7.</p>
        <p>We submitted a total of ve runs. Run 1 { baseline
features; Run 2 { all features mentioned above (except movie
genre) with loudness threshold (0.038); Run 3 { same as Run
2 plus the inclusion of movie genre; Run 4 { as Run 3 but
with a ne tuning of the hyper-parameters; Run 5 { similar
to Run 3 but with a higher threshold for loudness (0.078).</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>RESULTS</title>
      <p>Our o cial results on the test set for both subtasks are
shown in Table 1.</p>
      <p>
        Subtask 1. Our results for the a ective task indicate that
we did not do much better than was expected by chance for
arousal classi cation, and did slightly better than chance
for valence in run 5; we thus refrain from further
interpretation of results. This can be explained by the low quality of
the provided annotations for the dataset. The initial
annotations had a low inter-rater agreement [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and there were
multiple processing stages afterwards [
        <xref ref-type="bibr" rid="ref1 ref22">1, 22</xref>
        ] with high levels
of uncertainty and unclear validity.
      </p>
      <p>Subtask 2. Results show that there is an important
overtting in our models as the performance is divided by a
factor of 2 between development and test partitions. This is,
however, not really surprising since only 272 instances
labelled as violent were available as training data. Moreover,
the labelling task being performed not at the frame level
but rather at the excerpt level does not allow to model
precisely the information that is judged as violent, making the
task highly challenging. We can nevertheless observe that
the proposed audio-visual feature set brings a large
improvement over the baseline feature set { the MAP is improved
by a factor superior to 2, and that the inclusion of the movie
genre as additional feature also allows a small improvement
in the performance.</p>
      <sec id="sec-5-1">
        <title>Subtask 1</title>
      </sec>
      <sec id="sec-5-2">
        <title>Subtask 2 Run</title>
      </sec>
      <sec id="sec-5-3">
        <title>Arousal (AC)</title>
      </sec>
      <sec id="sec-5-4">
        <title>Valence (AC) Violence (MAP) 1 2</title>
        <p>
          We have presented our approach to the MediaEval's
\Affective Impact of Movies" challenge, which consists in the
automatic detection of a ective and violent content in movie
excerpts. Our results for the a ective task have shown that
we did not do much better than a classi er that is based
on chance, although we use features and classi ers that are
known to work well in the literature for arousal and valence
prediction [
          <xref ref-type="bibr" rid="ref2 ref8">2, 8</xref>
          ]. We consider that this might be owed to a
potentially noisiness of the annotations provided. As for the
violence prediction subtask, the results show that we
overt a lot on the development set, which is not very striking
given the small amount of instances of the minority class.
The analysis of violent content at the excerpt level is also
highly challenging, because only few frames might contain
violence, and such brief information is almost totally lost in
the computation of functionals at the full excerpt level.
5.
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>ACKNOWLEDGEMENTS</title>
      <p>The research leading to these results has received funding
from the EC's Seventh Framework Programme through the
ERC Starting Grant No. 338164 (iHEARu), and the EU's
Horizon 2020 Programme through the Innovative Action
No. 644632 (MixedEmotions), No. 645094 (SEWA) and the
Research Innovative Action No. 645378 (ARIA-VALUSPA).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Baveye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Dellandrea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chamaret</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>From crowdsourced rankings to a ective ratings</article-title>
          .
          <source>In Proceedings of the IEEE International Conference on Multimedia and Expo Workshops (ICMEW</source>
          <year>2014</year>
          ), pages
          <fpage>1</fpage>
          <issue>{6</issue>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Baveye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Dellandrea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chamaret</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>LIRIS-ACCEDE: A Video Database for A ective Content Analysis</article-title>
          .
          <source>IEEE Transactions on A ective Computing</source>
          ,
          <volume>6</volume>
          (
          <issue>1</issue>
          ):
          <volume>43</volume>
          {
          <fpage>55</fpage>
          ,
          <string-name>
            <surname>January</surname>
            <given-names>-March</given-names>
          </string-name>
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.-C.</given-names>
            <surname>Chang</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.-J.</given-names>
            <surname>Lin</surname>
          </string-name>
          .
          <article-title>LIBSVM: A library for support vector machines</article-title>
          .
          <source>ACM Transactions on Intelligent Systems and Technology</source>
          ,
          <volume>2</volume>
          (
          <issue>3</issue>
          ),
          <year>April 2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.-J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Fei-Fei</surname>
          </string-name>
          .
          <article-title>Imagenet: A large-scale hierarchical image database</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR</source>
          <year>2009</year>
          ), pages
          <fpage>248</fpage>
          {
          <fpage>255</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F.</given-names>
            <surname>Eyben</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Scherer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Schuller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sundberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Andre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Busso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Devillers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Epps</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Laukka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narayanan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Truong</surname>
          </string-name>
          .
          <article-title>The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and a ective computing</article-title>
          .
          <source>IEEE Transactions on A ective Computing</source>
          , In press,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>F.</given-names>
            <surname>Eyben</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Weninger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Gro</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Schuller</surname>
          </string-name>
          .
          <article-title>Recent Developments in openSMILE, the Munich Open-Source Multimedia Feature Extractor</article-title>
          .
          <source>In Proceedings of the 21st ACM International Conference on Multimedia (MM</source>
          <year>2013</year>
          ), pages
          <fpage>835</fpage>
          {
          <fpage>838</fpage>
          ,
          <string-name>
            <surname>Barcelona</surname>
          </string-name>
          , Spain,
          <year>October 2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>F.</given-names>
            <surname>Eyben</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Weninger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lehment</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Schuller</surname>
          </string-name>
          .
          <article-title>A ective Video Retrieval: Violence Detection in Hollywood Movies by Large-Scale Segmental Feature Extraction</article-title>
          .
          <source>PLOS one</source>
          ,
          <volume>8</volume>
          (
          <issue>12</issue>
          ):
          <volume>1</volume>
          {
          <fpage>12</fpage>
          ,
          <year>December 2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>K.</given-names>
            <surname>Forbes-Riley</surname>
          </string-name>
          and
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Litman</surname>
          </string-name>
          .
          <article-title>Predicting emotion in spoken dialogue from multiple knowledge sources</article-title>
          .
          <source>In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies (HLT-NAACL)</source>
          , pages
          <fpage>201</fpage>
          {
          <fpage>208</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>F. G.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gall</surname>
          </string-name>
          , and
          <string-name>
            <surname>V. G. L.</surname>
          </string-name>
          <article-title>Real time head pose estimation with random regression forests</article-title>
          .
          <source>In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          , pages
          <fpage>617</fpage>
          {
          <fpage>624</fpage>
          ,
          <string-name>
            <surname>Providence</surname>
          </string-name>
          (RI), USA,
          <year>June 2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>L.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Chehata</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Mallet</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Boukir</surname>
          </string-name>
          .
          <article-title>Relevance of airborne lidar and multispectral image data for urban scene classi cation using Random Forests</article-title>
          .
          <source>ISPRS Journal of Photogrammetry and Remote Sensing</source>
          ,
          <volume>66</volume>
          (
          <issue>1</issue>
          ):
          <volume>56</volume>
          {
          <fpage>66</fpage>
          ,
          <year>January 2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hanjalic</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.-Q.</given-names>
            <surname>Xu</surname>
          </string-name>
          .
          <article-title>A ective video content representation and modeling</article-title>
          .
          <source>IEEE Transactions on Multimedia</source>
          ,
          <volume>7</volume>
          (
          <issue>1</issue>
          ):
          <volume>143</volume>
          {
          <fpage>154</fpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>W.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zeng</surname>
          </string-name>
          , and
          <string-name>
            <surname>M. S.</surname>
          </string-name>
          <article-title>A survey on visual content-based video indexing and retrieval</article-title>
          .
          <source>IEEE Transactions on Systems, Man, and Cybernetics</source>
          , Part C:
          <article-title>Applications</article-title>
          and Reviews,
          <volume>41</volume>
          (
          <issue>6</issue>
          ):
          <volume>797</volume>
          {
          <fpage>819</fpage>
          ,
          <year>October 2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          , J. Schluter, I. Mironica, and
          <string-name>
            <given-names>M.</given-names>
            <surname>Schedl</surname>
          </string-name>
          .
          <article-title>A naive mid-level concept-based fusion approach to violence detection in hollywood movies</article-title>
          .
          <source>In Proceedings of the 3rd ACM International Conference on Multimedia Retrieval (ICMR</source>
          <year>2013</year>
          ), pages
          <fpage>215</fpage>
          {
          <fpage>222</fpage>
          ,
          <string-name>
            <surname>Dallas</surname>
          </string-name>
          (TX), USA,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>J.-J. Lee</surname>
            ,
            <given-names>P.-H.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>S.-W.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Yuille</surname>
            , and
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Koch</surname>
          </string-name>
          .
          <article-title>Adaboost for text detection in natural scene</article-title>
          .
          <source>In Proceedings of the IEEE 12th International Conference on Document Analysis and Recognition (ICDAR</source>
          <year>2013</year>
          ), pages
          <fpage>429</fpage>
          {
          <fpage>434</fpage>
          , Beijing, China,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>I.</given-names>
            <surname>Lopatovskaa</surname>
          </string-name>
          and
          <string-name>
            <given-names>I.</given-names>
            <surname>Arapakis</surname>
          </string-name>
          .
          <article-title>Theories, methods and current research on emotions in library and information science, information retrieval and human{computer interaction</article-title>
          .
          <source>Information Processing &amp; Management</source>
          ,
          <volume>47</volume>
          (
          <issue>4</issue>
          ):
          <volume>575</volume>
          {
          <fpage>592</fpage>
          ,
          <year>July 2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>F.</given-names>
            <surname>Ringeval</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Amiriparian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Eyben</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Scherer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Schuller</surname>
          </string-name>
          .
          <article-title>Emotion recognition in the wild: Incorporating voice and lip activity in multimodal decision-level fusion</article-title>
          .
          <source>In Proceedings of the 2nd Emotion Recognition In The Wild Challenge and Workshop (EmotiW</source>
          <year>2014</year>
          ), pages
          <fpage>473</fpage>
          {
          <fpage>480</fpage>
          , Istanbul, Turkey,
          <year>September 2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>F.</given-names>
            <surname>Ringeval</surname>
          </string-name>
          , E. Marchi,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mehu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Scherer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Schuller</surname>
          </string-name>
          .
          <article-title>Face reading from speech { predicting facial action units from audio cues</article-title>
          .
          <source>In Proceedings of INTERSPEECH</source>
          <year>2015</year>
          ,
          <article-title>16th Annual Conference of the International Speech Communication Association (ISCA)</article-title>
          , to appear, Dresden, Germany,
          <year>September 2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>F.</given-names>
            <surname>Ringeval</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Schuller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Valstar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jaiswal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Marchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lalanne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cowie</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Pantic</surname>
          </string-name>
          .
          <source>AV+EC</source>
          <year>2015</year>
          {
          <article-title>The First A ect Recognition Challenge Bridging Across Audio, Video, and Physiological Data</article-title>
          .
          <source>In Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge (AVEC)</source>
          ,
          <string-name>
            <surname>ACM</surname>
            <given-names>MM</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brisbane</surname>
          </string-name>
          , Australia,
          <year>October 2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>B.</given-names>
            <surname>Schuller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Steidl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Batliner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Epps</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Eyben</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ringeval</surname>
          </string-name>
          , E. Marchi, and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang. The INTERSPEECH 2014 Computational Paralinguistics</surname>
          </string-name>
          <article-title>Challenge: Cognitive &amp; Physical Load</article-title>
          .
          <source>In Proceedings of INTERSPEECH</source>
          <year>2014</year>
          ,
          <article-title>15th Annual Conference of the International Speech Communication Association (ISCA)</article-title>
          , pages
          <fpage>427</fpage>
          {
          <fpage>431</fpage>
          ,
          <string-name>
            <surname>Singapore</surname>
          </string-name>
          , Republic of Singapore,
          <year>September 2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>B.</given-names>
            <surname>Schuller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Steidl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Batliner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hantke</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.</surname>
          </string-name>
          <article-title>Honig</article-title>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Orozco-Arroyave</surname>
          </string-name>
          , E. Noth,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Weninger. The INTERSPEECH 2015 Computational Paralinguistics</surname>
          </string-name>
          <article-title>Challenge: Nativeness, Parkinson's &amp; Eating Condition</article-title>
          .
          <source>In Proceedings of INTERSPEECH</source>
          <year>2015</year>
          ,
          <article-title>16th Annual Conference of the International Speech Communication Association (ISCA), Dresden</article-title>
          , Germany,
          <year>September 2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <article-title>Very deep convolutional networks for large-scale image recognition</article-title>
          .
          <source>arXiv preprint arXiv:1409.1556</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sjo</surname>
          </string-name>
          berg, Y. Baveye,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Quand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          , E. Dellandrea,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schedl</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-H. Demarty</surname>
            , and
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>The MediaEval 2015 A ective Impact of Movies Task</article-title>
          .
          <source>In Proceedings of the MediaEval 2015 Workshop</source>
          , Wurzen, Germany,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>A.</given-names>
            <surname>Stumpf</surname>
          </string-name>
          and
          <string-name>
            <given-names>N.</given-names>
            <surname>Kerle</surname>
          </string-name>
          .
          <article-title>Object-oriented mapping of landslides using random forests</article-title>
          .
          <source>Remote Sensing of Environment</source>
          ,
          <volume>115</volume>
          (
          <issue>10</issue>
          ):
          <volume>2564</volume>
          {
          <fpage>2577</fpage>
          ,
          <year>October 2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>