<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Fudan-NJUST at MediaEval 2014: Violent Scenes Detection Using Deep Neural Networks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Video Clips</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <fpage>16</fpage>
      <lpage>17</lpage>
      <abstract>
        <p>The Violent Scenes Detection task aims at evaluating algorithms that automatically localize violent segments in both Hollywood movies and short web videos. The de nition of violence is subjective: \the segments that one would not let an 8 years old child see in a movie because they contain physical violence". This is a highly challenging problem because of the strong content variations among the positive instances. In this year's evaluation, we adopted our recently proposed classi cation method to fuse multiple features using Deep Neural Networks (DNN). The method was named regularized DNN. We extracted a set of visual and audio features, which have been observed useful. We then applied the regularized DNN for feature fusion and classi cation. Results indicate that using multiple features is still very helpful, and more importantly, our proposed regularized DNN offers signi cantly better results than the popular SVM. We achieved a mean average precision of 0.63 for the main task and 0.60 for the generalization task.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>SYSTEM DESCRIPTION</title>
    </sec>
    <sec id="sec-2">
      <title>Features</title>
      <p>Three kinds of audio-visual features were extracted, which
have been observed useful in 2013.</p>
      <p>
        We extracted trajectory-based motion features according
to our previous work [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. A main difference is that the
new improved dense trajectories (IDT) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] were used as the
basis to replace the original dense trajectories. Four
baseline features, histograms of oriented gradients (HOG),
histograms of optical ow (HOF), motion boundary histograms
(MBH) and trajectory shape (TrajShape) descriptors were
computed. These features were encoded using the Fisher
vectors (FV) with a codebook of 256 codewords. We further
computed our proposed TrajMF [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] based on the HOG, HOF
and MBH, by considering the motion relationships of the
trajectories. As the dimension of the original TrajMF is very
high, we employed the expectation-maximization principal
component analysis (EM-PCA) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] for dimension reduction,
generating a 1500-dimensional representation for each
fea1
SVM
      </p>
      <sec id="sec-2-1">
        <title>Merging</title>
        <p>F
O
H
V
F
H
B
M
V
F</p>
        <sec id="sec-2-1-1">
          <title>Feature Extraction</title>
          <p>e
p
a
h
S
j
a
r
T
V
F</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>Fusion</title>
          <p>G
O
H
F
M
j
a
r
T
F
O
H
F
M
j
a
r
T
H
B
M
F
M
j
a
r
T
5</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Merging</title>
        <p>3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Smoothing, &amp;Merging</title>
        <p>4</p>
      </sec>
      <sec id="sec-2-4">
        <title>Merging</title>
        <p>
          2
ture. In total, there are seven trajectory-based features,
including four baseline FV and three dimension-reduced
TrajMF features. See [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] for more details.
        </p>
        <p>
          The other two kinds of features include Space-Time
Interest Points (STIP) [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] and Mel-Frequency Cepstral
Coefficients (MFCC). The STIP describes the texture and motion
features around local interest points, which were encoded
using the bag-of-words framework with 4000 codewords. Here
we randomly sampled 300k features and used k-means to
generate the codebook. The MFCC is a very popular
audio feature. It was extracted from every 32ms time-window
with 50% overlap. The bag-of-words was also adopted to
quantize the MFCC descriptors, using 4000 codewords.
1.2
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Classifiers</title>
      <p>We adopted both SVM and deep neural networks (DNN)
for classi cation.</p>
      <p>SVM: 2 kernel was adopted for the bag-of-words
features (STIP and MFCC), and linear kernel was used for the
others. For feature fusion, kernel-level average fusion was
used for the trajectory-based features, while score-level
average late fusion was adopted to combine trajectory features
with STIP and MFCC.</p>
      <p>WL−1
n,3
l = L
l = F
l = E
......</p>
      <p>......</p>
      <p>......
x
n,1
x
n,2</p>
      <p>
        DNN: We also adopted a new DNN-based classi er
proposed in our recent work [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ]. The aforementioned fusion
methods used for the SVM classi ers neglect the hidden
patterns shared among the different features. To capture the
relationships of distinct features, we constructed a regularized
DNN for video classi cation. Speci cally, as shown in
Figure 2, in the regularized DNN, a layer of neurons were rst
used to perform feature abstraction separately for each input
feature. After that, another layer was used for feature
fusion with carefully designed structural-norm regularization
on network weights, which can identify feature relationships.
Finally, the fused representation was used to build a classi
cation model in the last layer. With this special network, we
are able to fuse features by considering both feature
correlation and feature diversity, as well as perform classi cation
simultaneously. See [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ] for more details.
1.3
      </p>
    </sec>
    <sec id="sec-4">
      <title>Score Smoothing and Clip Merging</title>
      <p>Temporal score smoothing has been proved to be effective
as incorrect predictions on a short clip may be eliminated
by considering predictions on nearby clips. All the videos
were rst partitioned uniformly into 3-second long clips. A
smoothed prediction score of a clip is simply the average
value of the scores in a three-clip window.</p>
      <p>As we need to output segment level predictions (not on the
xed-length clip-level), we need to merge continuous clips if
they are all determined to contain violence or no violence.
This was done if their violence scores were all above or below
a threshold, and the new score of the merged segment was
set to be the average value of clips.</p>
    </sec>
    <sec id="sec-5">
      <title>RESULTS AND DISCUSSIONS</title>
      <p>We submitted 5 runs for official evaluation. As shown in
Figure 1, Run 1 and Run 2 used SVM and DNN respectively.
Run 2 did not use FV encoding of the HOG, HOF and MBH
features, as the dimensionality of these three features are too
high, which would jeopardize the performance of DNN when
there is insufficient training data. Run 3 is the score fusion
of Run 1 and Run 2. Run 4 is the score-smoothed version
Feature Fusion</p>
      <p>WE</p>
      <p>1
Feature
Extraction
0%
0.5%
of Run 3 (smoothing was performed before merging), while
Run 5 is the direct fusion of SVM and DNN without using
any smoothing and merging functions.</p>
      <p>The official results are summarized in Figure 3. We see
that, although some features were not used in DNN, the
performance of DNN (Run 2) is still signi cantly better than
SVM. This clearly con rms the effectiveness of deep
networks. Directly fusing DNN and SVM incurs a small
performance drop (Run 3). This may be due to the sub-optimal
parameters used in the fusion process. Another fusion
setting (Run 5) without using score merging improves the main
task performance but still hurts the result of the
generalization task, showing that DNN has better generalization
capability than the SVM, and thus fusing SVM with DNN will
always degrade the performance of the generalization task.
Finally, the results of Run 4 indicate that both smoothing
and merging are useful for the main task. It is not
surprising that smoothing does not work for the generalization task,
because, compared with the long movies used in the main
task, the test clips are short and are relatively temporally
more consistent.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sjo</surname>
          </string-name>
          berg,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-G.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. L.</given-names>
            <surname>Quang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schedl</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.-H.</given-names>
            <surname>Demarty. The MediaEval 2014 Affect Task</surname>
          </string-name>
          <article-title>: Violent Scenes Detection</article-title>
          . In MediaEval 2014 Workshop, Barcelona, Spain, Oct 16-
          <issue>17</issue>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Y.-G.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xue</surname>
          </string-name>
          , W. Liu, and
          <string-name>
            <given-names>C.-W.</given-names>
            <surname>Ngo</surname>
          </string-name>
          .
          <article-title>Trajectory-based modeling of human actions with motion reference points</article-title>
          .
          <source>In ECCV</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Roweis</surname>
          </string-name>
          .
          <article-title>EM Algorithms for PCA and SPCA</article-title>
          . NIPS,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Schmid</surname>
          </string-name>
          .
          <article-title>Action Recognition With Improved Trajectories</article-title>
          . In ICCV,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>I.</given-names>
            <surname>Laptev</surname>
          </string-name>
          .
          <article-title>On space-time interest points</article-title>
          .
          <source>IJCV</source>
          ,
          <volume>64</volume>
          :
          <fpage>107</fpage>
          {
          <fpage>123</fpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-G.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xue</surname>
          </string-name>
          .
          <article-title>Exploring Inter-feature and Inter-class Relationships with Deep Neural Networks for Video Classi cation</article-title>
          .
          <source>In ACM MM</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Tu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-G.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xue</surname>
          </string-name>
          . Challenge Huawei Challenge:
          <article-title>Fusing Multimodal Features with Deep Neural Networks for Mobile Video Annotation</article-title>
          .
          <source>In ICME</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>