<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>TUDCL at MediaEval 2013 Violent Scenes Detection: Training with Multi-modal Features by MKL</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shinichi Goto</string-name>
          <email>s-goto@riec.tohoku.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Terumasa Aoki</string-name>
          <email>aoki@riec.tohoku.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Graduate School of Information</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>New Industry Creation Hachery Center Tohoku University</institution>
          ,
          <addr-line>Miyagi</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <fpage>18</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>The purpose of this paper is to describe the work carried out for the Violent Scenes Detection task at MediaEval 2013 by team TUDCL. Our work is based on the combination of visual, temporal and audio features with machine learning at segment-level. Block-saliency-map based dense trajectory is proposed for visual and temporal features, and MFCC and delta-MFCC is used for audio features. For the classi cation, Multiple Kernel Learning is applied, which is effective if multi-modal features exist.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The MediaEval 2013 Affect Task [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is intended to detect
violence scenes in movies. Although two different de nitions
of violent events are provided this year, our algorithm is
developed only to solve the task for the objective de nition,
which is "physical violence or accident resulting in human
injury or pain."
      </p>
    </sec>
    <sec id="sec-2">
      <title>APPROACH</title>
      <p>Rather than focusing on video shots from the beginning,
our approach rst handles xed-length segments, each of
which has 20 frames (0.8 seconds if FPS is 25). After
segmentbased scores are calculated from extracted feature vectors by
machine learning, shot-based scores are generated.</p>
      <p>For our runs only violent and non-violent ground truth
are used, and neither a high-level concept nor external data
is used.</p>
    </sec>
    <sec id="sec-3">
      <title>Visual and Temporal Features</title>
      <p>
        Both visual and temporal features based on dense
trajectory [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] are calculated at every frame. Although the original
dense trajectory algorithm is carried out by sampling frames
densely except for homogeneous image areas, we
additionally apply saliency maps proposed by Itti [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to increase the
precision, supposing that events concerned with violence are
located in the areas people tend to pay attention to.
      </p>
      <p>In our algorithm, rst a normal saliency map is
generated, and then it is transformed to a block-based map by
taking the average of salient values in a xed block area so
that dense sampling can be applied, changing its sampling
step size and maximum spatial scale level according to the
salient level. For instance, the most salient area in a
image is densely sampled with the smallest step size, which
guarantees the more salient a block is, the more points are
obtained there. Figure 1 shows one example of our dense
sampling and normal dense sampling. You notice that our
algorithm is sampling more points in salient regions and less
points in non-salient regions, but normal dense sampling, on
the other hand, is taking points more uniformly on a whole
frame. Note points in the homogeneous areas have already
been deleted.</p>
      <p>
        Trajectories, MBH, and additionally RGB histogram
around trajectories are extracted for visual and temporal
information, though in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] HOG and HOF are also proposed. This
is due to the fact that those features have poor contribution
on our test runs.
      </p>
      <p>All features are converted to Bag-of-Words form in each
segment to get 200-d trajectory, 200-d MBH-x, 200-d
MBHy, and 400-d RGB histogram. In total, 1000-d feature vector
is used as the visual and temporal feature for classi cation.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Audio Features</title>
      <p>Major MFCC, delta-MFCC and audio energy is calculated
every 20ms with 10ms overlap to create 200-d
Bag-of-AudioWords in each segment, which has 0.8 seconds.
2.3</p>
    </sec>
    <sec id="sec-5">
      <title>Classifier Learning</title>
      <p>
        Although a conventional way of tackling this classifying
problem is to use Support Vector Machine (SVM), we
apply Multiple Kernel Learning (MKL), which aims at nding
optimized weights when multiple SVM kernels are applied
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. This suits well our case since multiple feature spaces
exist. The whole kernel is composed of multiple kernels, and
is computed according to the following equation:
K(xi; xj) =
∑ dkKk(xi; xj)
k
(1)
where Kk are base kernels, and dk is a weight for each
kernel. In our case, kernels for trajectory, x-direction MBH,
ydirection MBH, RGB-histogram and audio features are
prepared. For a kernel function, Histogram Intersection Kernel
(HIK) is used since all of our features are histogram-based.
      </p>
      <p>Although MKL can nd optimal weights, we found these
values are different depending on movies. Table 1 shows
the difference between weights learned from three different
movies. Therefore rst classi ers for training movies are
learned separately to give binary classi cation for each
segment, and nally they are integrated in the following way.
2.4</p>
    </sec>
    <sec id="sec-6">
      <title>Integration</title>
      <p>The rst step here is to calculate a pre- nal violence score
for each segment. To do so, for each segment in test movies,
we simply calculate the number of classi ers which classify
that segment as violent. Therefore for each test movie, a
score si for the ith segment is:
si =</p>
      <p>M 1
∑ ci(m); ci(n) = f0; 1g (n = 0; 1; : : : ; M
m=0
1) (2)
where ci(n) is a result of binary classi cation by the nth
classi er with 0 for non-violence, 1 for violence. Note M is
the total number of classi ers, which is equal to the number
of training movies.</p>
      <p>Finally a moving average is calculated as smoothing method
for each test movie in order to decide nal scores s′t for all
segments following:
s′i =
si + ∑N
n=1</p>
      <p>n (si n + si+n)
2N + 1
(0 &lt;
&lt; 1)
(3)
where is a smoothing coefficient, N is a neighbor range
around a segment. We used 0.5 for and 2 for N .</p>
      <p>The reason why this integration process is needed is to
take the continuity of segments into account. Besides, since
our classi er is learning each training movie separately, the
violence concepts which a training movie does not have can
be easily missed. Scores for shots are calculated by
converting segment-based scores after calculating score per frame.
If this score is higher than a threshold, that segment or shot
is classi ed as violent. We choose 0.1 for a segment
threshold, and 0.03 and 0.06 for shot thresholds.</p>
    </sec>
    <sec id="sec-7">
      <title>RESULTS AND DISCUSSION</title>
      <p>0.470
0.470
0.343
0.214
0.0473
of the scoring threshold (0.03 for the former, 0.06 for the
latter), and therefore it doesn't affect MAP@100. In
addition to our main runs, results by normal SVM with RBF
kernel are displayed for comparison, although there is no
MAP@100 score since only binary classi cation results are
decided and no score is calculated for SVM.</p>
      <p>Our results show the approach of Multiple Kernel
Learning with HIK kernel is effective for violent scenes detection,
though its F-score is still not high enough. We investigated
this and came to the presumption that segments which have
frequent camera motions, multiple people and loud sound
tend to be mis-classi ed as violent.</p>
      <p>On the other hand, common missed violent segments are
violent scenes without sound, such as a scene in which a man
is wringing on an another man's neck. It is reasonable to
suppose that segments in which multi-modality cannot be
exploited are likely to get missed.</p>
      <p>Although MBH, which is proposed as robust to camera
motions, is extracted, trajectories themselves easily get
affected by camera motions, making them unreliable.
Therefore some action against this problem is imperative.</p>
      <p>It also should be added as classi ers have learned each
training movie separately, feature vectors might not be enough
compared to the case in which classi ers learn all movies
simultaneously. Since not enough comparison with other
methods have been done, we will continue our investigation.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Demarty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Penet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schedl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.L.</given-names>
            <surname>Quang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jiang. The MediaEval 2013 Affect Task</surname>
          </string-name>
          <article-title>: Violent Scenes Detection</article-title>
          . In MediaEval 2013 Workshop, Barcelona, Spain, October
          <volume>18</volume>
          -19
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Heng</given-names>
            <surname>Wang</surname>
          </string-name>
          , Alexander Klaser, Cordelia Schmid, and Cheng-Lin Liu.
          <article-title>Action Recognition by Dense Trajectories</article-title>
          .
          <source>In IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pages
          <volume>3169</volume>
          {
          <fpage>3176</fpage>
          ,
          <string-name>
            <surname>Colorado</surname>
            <given-names>Springs</given-names>
          </string-name>
          , United States,
          <year>June 2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Laurent</given-names>
            <surname>Itti</surname>
          </string-name>
          , Christof Koch, and
          <string-name>
            <given-names>Ernst</given-names>
            <surname>Niebur</surname>
          </string-name>
          .
          <article-title>A Model of Saliency-based Visual Attention for Rapid Scene Analysis</article-title>
          .
          <source>In IEEE Transactions on Pattern Analysis and Machine Intelligence archive Volume 20 Issue 11</source>
          , pages
          <fpage>1254</fpage>
          {
          <fpage>1259</fpage>
          , IEEE Computer Society Washington, DC, USA,
          <year>November 1988</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G.R.G.</given-names>
            <surname>Lanckriet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Cristianini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bartlett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.E.</given-names>
            <surname>Ghaoui</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.I.</given-names>
            <surname>Jordan</surname>
          </string-name>
          .
          <article-title>Learning the Kernel Matrix with Semide nite Programming</article-title>
          .
          <source>In Journal of Machine Learning Research 5</source>
          , pages
          <fpage>27</fpage>
          {
          <fpage>72</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>