<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>NII-UIT at MediaEval 2015 Affective Impact of Movies Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Vu Lam</string-name>
          <email>lqvu@fit.hcmus.edu.vn</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sang Phan</string-name>
          <email>plsang@nii.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Duy-Dinh Le</string-name>
          <email>ledduy@nii.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shin'ichi Satoh</string-name>
          <email>satoh@nii.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Duc Anh Duong</string-name>
          <email>ducda@uit.edu.vn</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Institute of</institution>
          ,
          <addr-line>Informatics</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Information, Technology</institution>
          ,
          <addr-line>VNU-HCM</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Science</institution>
          ,
          <addr-line>VNU-HCM</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>A ective Impact of Movies task aims to detect violent videos and a ective impact on viewers of that videos [9]. This is a challenging task not only because of the diversity of video content but also due to the subjectiveness of human emotion. In this paper, we present a uni ed framework that can be applied to both subtasks: (i) induce a ect detection, and (ii) violence detection. This framework is based on our previous year's Violent Scene Detection (VSD) framework. We extended it to support a ect detection by training di erent valence/arousal classes independently and combine them to make the nal decision. Besides using internal features from three di erent modalities: audio, image, and motion, in this year, we also incorporate deep learning features into our framework. Experimental results show that our uni ed framework can detect violent videos and its a ective impact with a reasonable accuracy. Moreover, using deep features can signi cantly improve the detection performance of both subtasks.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>Detecting a ective impact of movies requires combining
multimedia features. For example, a violent video of
carchase can be detected by searching for evidences such as
fast moving of cars or possibly the sound of gun shooting.
To this end, we have developed a framework that supports
combining features from multiple modalities for violent scene
detection. We consider the induced a ect detection as a
multi-class classi cation task. Therefore, our framework can
be applied to predict the valence and arousal class of a video
as well. In general, our framework consists of three main
components: feature extraction, feature encoding, feature
classi cation. An overview of our framework is shown in
Fig 1.
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>FEATURE EXTRACTION</title>
    </sec>
    <sec id="sec-3">
      <title>Image Features</title>
      <p>
        At rst, we scale the original video into 320x240 pixels
and then sample frames from video at every second. We
use the standard SIFT feature with Hessian Laplace interest
point detector to extract features from each frame [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Each
frame is represented using the Fisher Vector encoding [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
We use the average pooling strategy to aggregate
framebased feature into the nal video representation, which has
40,960 dimensions.
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Motion Feature</title>
      <p>
        We use the Improved Trajectories [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] to extract dense
trajectories. A combination of Histogram of Oriented
Gradients (HOG), Histogram of Optical Flow (HOF) and
Motion Boundary Histogram (MBH) is used to describe each
trajectory. We encode HOGHOF and MBH features
separately using the Fisher Vector encoding. The codebook size
is 256, trained using a Gaussian Mixture Model (GMM).
The feature representation of each descriptor after applying
PCA has 65,536 dimensions.
2.3
      </p>
    </sec>
    <sec id="sec-5">
      <title>Audio Feature</title>
      <p>We use the popular Mel-frequency Cepstral Coe cients
(MFCC) for extracting audio features. We choose a length
of 25ms for audio segments and a step size of 10ms. The
13dimensional MFCC vectors along with each rst and second
derivatives are used for representing each audio segment.
Raw MFCC features are also encoded using Fisher vector
encoding. We use a GMM to train the codebook with 256
Run
1
2
3 ext
4 ext</p>
    </sec>
    <sec id="sec-6">
      <title>Deep Learning Feature</title>
      <p>
        We use the popular DeepCa e [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] framework to extract
image features. We used the pre-trained deep model
provided by Simonyan and Zisserman [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. This model was
trained on ImageNet 1,000 concepts [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. As suggested in
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], we selected the neuron activations from the last three
layers for the feature representation. The third and
secondto-last layer has 4,096 dimensions, while the last layer has
1,000 dimensions corresponding to the 1,000 concept
categories in the ImageNet dataset. We denote these features as
VDFC6, VDFC7, and VDFULL in our experiments.
2.5
      </p>
    </sec>
    <sec id="sec-7">
      <title>Features from Past VSD Tasks</title>
      <p>
        For the violent detection task, we also consider using
featurs from past VSD tasks as external features. In
particular, we use the features that were extracted in the VSD
2014 task for training the violent detector. These features
include SIFT, Dense Trajectories (HOGHOF and MBH
descriptors) and Audio MFCC which achieved the runner-up
performance in VSD 2014 [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. We denote these features as
FOHGHOF, HBM, TFIS and CCFM in our experiments.
      </p>
    </sec>
    <sec id="sec-8">
      <title>CLASSIFICATION</title>
      <p>
        LibSVM [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is used for training and testing our a ective
impact detectors. For features that are encoded using the
Fisher vector, we use linear kernel for training and testing.
For deep learning feature, 2 kernel is used.
      </p>
      <p>We divide the training videos into two subset. The rst
3,072 videos are used for training the model, while the
remaining 3,072 videos are used for validation. To learn the
decision threshold of each detector, we sample this threshold
in the range from 0 to 1 with the step size of 0.01, and select
the value that maximizes the F1 score.</p>
      <p>In order to generate the decision for valence or arousal</p>
      <sec id="sec-8-1">
        <title>Validation Results (mAP)</title>
      </sec>
      <sec id="sec-8-2">
        <title>O cial Results (Accuracy)</title>
      </sec>
      <sec id="sec-8-3">
        <title>Valence</title>
        <p>detection, we need to make the decision from the predictions
of all valence or arousal classes. To this end, we propose
using two strategies: (1) MAX: select the class that has the
highest prediction; (2) MAXREL: select the class that has
the highest relative improvement from the learned threshold.</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>4. SUBMITTED RUNS</title>
      <p>At rst, we use the late fusion with average weighting
scheme to combine features from di erent modalities. After
that we select the runs that have the top performance on
the validation set to submit. The list of submitted runs for
each subtask and its validation results can be seen on Table
1 and Table 2.
5.</p>
    </sec>
    <sec id="sec-10">
      <title>RESULTS AND DISCUSSIONS</title>
      <p>The o cial results for each subtask are shown on the last
column of Table 1 and Table 2. For the violence detection
task, we observe that the results of combining multiple
features are more stable. For example, on the validation set,
the run that combines all available features has the lowest
performance. However, on the test set, this run achieves the
best performance. This can be due to the fact that we only
select one split for validation. For both subtasks, combining
with deep learning features can signi cantly improve the
detection performance. For the induced a ect detection task,
we found that the strategy using the max detection score
tends to have more stable performance. The best valence
detection performance is obtained by combining all internal
features with all deep learning feature using the max relative
improvement strategy.</p>
    </sec>
    <sec id="sec-11">
      <title>ACKNOWLEDGEMENTS</title>
      <p>This research is partially funded by Vietnam National
University Ho Chi Minh City (VNU-HCM) under grant
number B2013-26-01.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.-C.</given-names>
            <surname>Chang</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.-J.</given-names>
            <surname>Lin</surname>
          </string-name>
          .
          <article-title>LIBSVM: A library for support vector machines</article-title>
          .
          <source>ACM Transactions on Intelligent Systems and Technology</source>
          ,
          <volume>2</volume>
          :
          <issue>27</issue>
          :1{
          <fpage>27</fpage>
          :
          <fpage>27</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.-J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Fei-Fei</surname>
          </string-name>
          .
          <article-title>Imagenet: A large-scale hierarchical image database</article-title>
          .
          <source>In Computer Vision and Pattern Recognition (CVPR)</source>
          , pages
          <fpage>248</fpage>
          {
          <fpage>255</fpage>
          . IEEE,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Shelhamer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Donahue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Karayev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Long</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Guadarrama</surname>
          </string-name>
          , and T. Darrell. Ca e:
          <article-title>Convolutional architecture for fast feature embedding</article-title>
          .
          <source>In Proceedings of the ACM International Conference on Multimedia</source>
          , pages
          <volume>675</volume>
          {
          <fpage>678</fpage>
          . ACM,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          .
          <article-title>Imagenet classi cation with deep convolutional neural networks</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <volume>1097</volume>
          {
          <fpage>1105</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>V.</given-names>
            <surname>Lam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Phan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Satoh</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Duong</surname>
          </string-name>
          .
          <article-title>NII-UIT at mediaeval 2014 violent scenes detection a ect task</article-title>
          .
          <source>In Working Notes Proceedings of the MediaEval 2014 Workshop</source>
          , Barcelona, Catalunya, Spain,
          <source>October 16-17</source>
          ,
          <year>2014</year>
          .,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D. G.</given-names>
            <surname>Lowe</surname>
          </string-name>
          .
          <article-title>Distinctive image features from scale-invariant keypoints</article-title>
          .
          <source>International journal of computer vision</source>
          ,
          <volume>60</volume>
          (
          <issue>2</issue>
          ):
          <volume>91</volume>
          {
          <fpage>110</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Sanchez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Perronnin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mensink</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Verbeek</surname>
          </string-name>
          .
          <article-title>Image classi cation with the sher vector: Theory and practice</article-title>
          .
          <source>International journal of computer vision</source>
          ,
          <volume>105</volume>
          (
          <issue>3</issue>
          ):
          <volume>222</volume>
          {
          <fpage>245</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <article-title>Very deep convolutional networks for large-scale image recognition</article-title>
          .
          <source>arXiv preprint arXiv:1409.1556</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sjo</surname>
          </string-name>
          berg, Y. Baveye,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. L.</given-names>
            <surname>Quang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          , E. Dellandrea,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schedl</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-H. Demarty</surname>
            , and
            <given-names>L. Chen.</given-names>
          </string-name>
          <article-title>The mediaeval 2015 a ective impact of movies task</article-title>
          .
          <source>In MediaEval 2015 Workshop</source>
          , Wurzen, Germany, Septemper 14-15
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Schmid</surname>
          </string-name>
          .
          <article-title>Action recognition with improved trajectories</article-title>
          .
          <source>In International Conference on Computer Vision</source>
          (ICCV), pages
          <fpage>3551</fpage>
          {
          <fpage>3558</fpage>
          . IEEE,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>