<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>RUCMM at MediaEval 2015 Affective Impact of Movies Task: Fusion of Audio and Visual Cues</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Qin Jin</string-name>
          <email>qjin@ruc.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xirong Li</string-name>
          <email>xirong@ruc.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Haibing Cao</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yujia Huo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shuai Liao</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gang Yang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jieping Xu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Multimedia Computing Lab, School of Information, Renmin University of China Key Lab of Data Engineering and Knowledge Engineering, Renmin University of</institution>
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>This paper summarizes our e orts for the rst time participation in the Violent Scene Detection subtask of the MediaEval 2015 A ective Impact of Movies Task. We build violent scene detectors using both audio and visual cues. In particular, the audio cue is represented by bag-of-audio-words with sher vector encoding. The visual cue is exploited by extracting CNN features from video frames. The detectors are implemented using two-class linear SVM classi ers. Evaluation shows that the audio detectors and the visual detectors are comparable and complementary to each other. Among our submissions, multi-modal late fusion leads to the best performance.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The 2015 A ective Impact of Movies Task consists of two
subtasks: Induced A ect Detection and Violence Detection
which we participated in for the rst time. Violent scene
detection (VSD) which automatically detect violent scenes
in videos is a challenging task due to its large variations in
video quality, content, and broad semantic meaning.
Violence is de ned as \ violent videos are those one would not let
an 8 years old child see because of their physical violence ".
MediaEval provides a common corpus and evaluation
platform that encourages and enables competition and
comparison among research teams. In this paper, we describe our
VSD system for our rst time participation in MediaEval
2015 [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. We focus on utilizing both audio and visual cues
in the video for violent scene detection. Our audio-based
system uses bag-of-audio-words with sher vector encoding,
while our visual-based system uses deep features extracted
by pretrained Convolutional Neural Networks (CNN)
models. We combine both modalities via late fusion, and
investigate two weighting strategies. One is equal weights, and
the other is non-equal weights learned on a held-out subset
of the development dataset.
      </p>
    </sec>
    <sec id="sec-2">
      <title>SYSTEM DESCRIPTION</title>
      <p>In this task, we build audio-only subsystems and
visualonly subsystems. We also fuse the two modality subsystems
via late fusion. The detailed description of feature
representation and prediction model of each subsystem is presented
in following subsections.</p>
      <sec id="sec-2-1">
        <title>Equal contribution and corresponding authors.</title>
        <p>2.1</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Audio Feature Representation</title>
      <p>We chunk the audio stream into small segments with some
overlap (such as a 3-sec segment and 1-sec shift leading to
2sec of overlap between adjacent segments), and empirically
nd that 2s segment length with 1s shift achieves the best
detection accuracy. We therefore use this setup.</p>
      <p>
        We use the Mel-frequency Cepstral Coe cients (MFCCs)
as our fundamental frame-level feature. The MFCCs are
computed over a sliding short-time window of 25ms with
a 10ms shift [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Each 25ms frame of an audio segment
is then represented as a 39-dimensional MFCC feature
vector (13-dimensional MFCC + delta + delta delta). An
audio segment is then represented by a set of MFCC feature
vectors. Finally, we use two encoding strategies to
transform this set of MFCC frames into a single xed-dimension
segment-level feature vector: Bag-of-Audio-Words (BoAW)
and Fisher Vector (FV) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        Bag-of-Audio-Words: We rst use an acoustic
codebook to generate the segment-level feature vector. The
codebook model is a common technique used in the document
classi cation (bag-of-words) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and the image classi
cation (bag-of-visual-words) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] elds. We use the
bag-ofaudio-words model to represent each audio segment by
assigning its low-level acoustic features (MFCCs) to a discrete
set of codewords in the vocabulary (codebook), thus
providing a histogram of codeword counts. The vocabulary of
BoAW is learned by applying Kmeans clustering algorithm
with K=4096 on the whole training dataset.
      </p>
      <p>
        Fisher Vector: The Fisher Vector (FV) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
representation can be seen as an extension of bag-of-words
representation. Both the FV and BoAW are based on an intermediate
representation, the audio vocabulary built in the low level
feature space. The Fisher encoding uses Gaussian Mixture
Models (GMM) to construct an audio word dictionary. We
compute the gradient of the log likelihood with respect to the
parameters of the model to represent an audio segment. The
Fisher Vector is the concatenation of these partial
derivatives and describes in which direction the parameters of the
model should be modi ed to best t the data. A GMM
with 256 mixtures is used in our experiments to generate
FV representation.
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Visual Feature Representation</title>
      <p>
        We consider both frame-level and video-level
representations. Given a video, we uniformly extract its frames with
an interval of 0.5 seconds. Subsequently, we extract CNN
features from these frames. In particular, we employ two
existing CNN models, i.e., the 16-layer VGGNet [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and
GoogLeNet [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The feature vectors are the last fully
connected layer of VGGNet, and the pool5 layer of GoogLeNet,
respectively.
      </p>
      <p>A video's feature vector is obtained by mean pooling the
feature vectors of its frames.
2.3</p>
    </sec>
    <sec id="sec-5">
      <title>Classification Model</title>
      <p>
        For both the audio and visual systems, we train two-class
linear SVM classi ers as violent scene detectors. A frame is
considered as a positive training example if its video is
labelled as positive with respect to the violent class. To learn
from many training examples, we employ the Negative
Bootstrap algorithm [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The algorithm takes a xed number N
of positive examples and iteratively selects those negative
examples, which are misclassi ed the most by the current
classi ers. The algorithm randomly samples 10 N number
of negative examples from the remaining negative examples
as candidates at each iteration. An ensemble of classi ers
trained in the previous iterations is used to classify each of
the negative candidate examples. The top N most
misclassi ed candidates are selected and used together with the N
positive examples to train a new classi er. The algorithm
takes several bags of positive examples and performs the
training independently on each of the positive bags,
resulting in multiple ensembles. They are compressed into a single
vector [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], making the prediction very fast.
2.4
      </p>
    </sec>
    <sec id="sec-6">
      <title>Prediction at Video Level</title>
      <p>For detectors trained using the frame-level
representations, they make prediction also at frame-level. In order to
aggregate the frame-level scores to the video-level, we rst
apply temporal smoothing to re ne scores per frame. For
the visual-based system, we take the maximum response of
the frames as their video score, while for the audio-based
system, the video score is obtained by averaging over its
frames.</p>
      <p>
        We fuse the two modalities of audio and visual via simple
linear fusion at the decision score level. We experiment two
fusion strategies: 1) simply assigning equal fusion weights
to each modality and 2) learning the optimal fusion weights
via coordinate ascent [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
    </sec>
    <sec id="sec-7">
      <title>EXPERIMENTS</title>
    </sec>
    <sec id="sec-8">
      <title>Dataset</title>
      <p>There are in total 6,144 labelled videos for development in
this year's task. We split the development set randomly into
two partitions, namely 1) dev-train consisting of 4,300 videos
among which 190 videos are labelled as violent videos, and 2)
dev-val of 1844 videos among which 82 videos are labelled
as violent videos. The detectors are trained on dev-train,
with hyper parameters tuned on dev-val.
3.2</p>
    </sec>
    <sec id="sec-9">
      <title>Submitted Runs</title>
      <p>All the runs use the previous described subsystems or
fused system. We use feature name to indicate a speci c
system. For instance, BoAW refers to the system using
the BoAW feature. Frame-level VGGNet-CNN means the
system is learned from frames which are represented by
VGGNet-CNN, while Video-level VGGNet-CNN means
learning directly from video vectors. We submitted 5 runs:</p>
      <sec id="sec-9-1">
        <title>Run1: Learned fusion of BoAW and FV.</title>
        <p>The performance of our VSD system with varied settings
is summarized in Table 1. We observe that fusion is always
helpful. For the audio-only runs, fusion of BoAW and FV
brings additional gain. Fusion of the audio and visual runs
results in the best performance. Probably due to the
divergence between the dev-val set and the test set, while Run2
(Frame-level VGGNet-CNN) outperforms Run3 (Video-level
VGGNet-CNN) on dev-val, the latter is better on the test
set. Consequently, fusion with learned weights does not yield
improvement.
4.</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>CONCLUSIONS</title>
      <p>Our results show that both audio and visual modalities
can perform violence detection well and the two
modalities are complementary to each other and simple late
fusion of two modalities leads to performance enhancement.
The CNN features, although without domain-speci c
information engineered, can generalize well for the VSD task. In
the future work, we will explore more e ective fusion
strategy for improving detection performance.</p>
    </sec>
    <sec id="sec-11">
      <title>Acknowledgements</title>
      <p>This research was supported by the Fundamental Research
Funds for the Central Universities and the Research Funds of
Renmin University of China (No. 14XNLQ01), the National
Science Foundation of China (No. 61303184), the Beijing
Natural Science Foundation (No. 4142029), the Specialized
Research Fund for the Doctoral Program of Higher
Education (No. 20130004120006), and the Scienti c Research
Foundation for the Returned Overseas Chinese Scholars,
State Education Ministry.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>Semantic concept annotation for user generated videos using soundtracks</article-title>
          .
          <source>In ICMR</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Snoek</surname>
          </string-name>
          .
          <article-title>Classifying tag relevance with relevant positive and negative examples</article-title>
          .
          <source>In ACM MM</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Snoek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Worring</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Koelma</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Smeulders</surname>
          </string-name>
          .
          <article-title>Bootstrapping visual categorization with relevant negatives</article-title>
          .
          <source>TMM</source>
          ,
          <volume>15</volume>
          (
          <issue>4</issue>
          ),
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Snoek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Worring</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Smeulders</surname>
          </string-name>
          .
          <article-title>Fusing concept detection and geo context for visual search</article-title>
          .
          <source>In ICMR</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Philbin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Chum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Isard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sivic</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <article-title>Object retrieval with large vocabularies and fast spatial matching</article-title>
          .
          <source>In CVPR</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Sanchez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Perronnin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mensink</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Verbeek</surname>
          </string-name>
          .
          <article-title>Image classi cation with the sher vector: Theory and practice</article-title>
          .
          <source>IJCV</source>
          ,
          <volume>105</volume>
          (
          <issue>3</issue>
          ),
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <article-title>Very deep convolutional networks for large-scale image recognition</article-title>
          .
          <source>CoRR, abs/1409.1556</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sjo</surname>
          </string-name>
          berg, Y. Baveye,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. L.</given-names>
            <surname>Quang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          , E. Dellandrea,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schedl</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-H. Demarty</surname>
            , and
            <given-names>L. Chen.</given-names>
          </string-name>
          <article-title>The mediaeval 2015 a ective impact of movies task</article-title>
          .
          <source>In MediaEval 2015 Workshop</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>C.</given-names>
            <surname>Szegedy</surname>
          </string-name>
          , W. Liu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sermanet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Reed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Anguelov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Erhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vanhoucke</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Rabinovich</surname>
          </string-name>
          .
          <article-title>Going deeper with convolutions</article-title>
          .
          <source>CoRR, abs/1409.4842</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>X.</given-names>
            <surname>Xue</surname>
          </string-name>
          and
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhou</surname>
          </string-name>
          .
          <article-title>Distributional features for text categorization</article-title>
          .
          <source>TKDE</source>
          ,
          <volume>21</volume>
          (
          <issue>3</issue>
          ),
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>