<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MIC-TJU in MediaEval 2015 Affective Impact of Movies Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yun Yi</string-name>
          <email>13yiyun@tongji.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hanli Wang</string-name>
          <email>hanliwang@tongji.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bowen Zhang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jian Yu</string-name>
          <email>yujian@tongji.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Technology, Tongji University</institution>
          ,
          <addr-line>Shanghai 201804</addr-line>
          ,
          <country country="CN">P. R. China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>feature, Improved Dense Trajectory (IDT) based feature, Dense Scale Invariant Feature Transform (Dense SIFT) feature, Hue-Saturation Histogram (HSH) feature and Convolutional Neural Network (CNN) based feature</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>The MediaEval 2015 Affective Impact of Movies task challenged participants to automatically detect video content that depicts violence, or predict the affective impact that video content will have on viewers. In this paper, we describe our system and discuss the performance results obtained in this task. We adopt our recently proposed Trajectory Based Covariance (TBC) descriptor to depict the motion information. Besides that, other features including audio, scene, color and appearance are also utilized in our system. To combine these features, a late fusion strategy is employed. Our results show that the trajectory based motion feature can achieve very competitive performances, furthermore the combination with audio, scene, color and appearance features can improve the overall performance. This work was supported in part by the “Shu Guang” project of Shanghai Municipal Education Commission and Shanghai Education Development Foundation under Grant 12SG23 and the Program for Professor of Special Appointment (Eastern Scholar) at Shanghai Institutions of Higher Learning (No. GZ2015005).</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The Affective Impact of Movies task is a challenging
task which requires to build a high performance system to
automatically detect video content that depicts violence, or
predict the affective impact that video content will have on
viewers. This task contains two subtasks: Induced Affect
Detection and Violence Detection. A brief introduction to
the dataset for training and testing as well as evaluation
metrics of these two subtasks has been given in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In
this paper, we mainly discuss the techniques and algorithms
employed by our system, as well as the related system
architecture and evaluation results.
      </p>
    </sec>
    <sec id="sec-2">
      <title>SYSTEM DESCRIPTION</title>
      <p>The key components of the proposed system is shown in
Fig. 1. The highlights of our system are introduced below.</p>
    </sec>
    <sec id="sec-3">
      <title>Feature Extraction</title>
      <p>In the feature extraction part, five kinds of features are
used including Mel-Frequency Cepstral Coefficients (MFCC)
H. Wang is the corresponding author.
2.1.1</p>
      <sec id="sec-3-1">
        <title>MFCC Feature</title>
        <p>
          We adopt the famous MFCC algorithm in the audio
section [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. The time window for MFCC is 32 ms and there is
50% overlap between two adjacent windows. To fully utilize
the discrimination ability of MFCC, we append delta and
double-delta of 20-dimension MFCC vectors into the original
MFCC vector to generate a 60-dimension MFCC vector. To
represent a whole audio file as a single vector, we adopt
the classic Bag-of-Words (BoW) framework, where Fisher
Vector and Gaussian Mixture Model (GMM) are used [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
The cluster number of GMM is set to 512 in our system.
2.1.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>IDT Based Feature</title>
        <p>
          The Improved Dense Trajectory (IDT) approach [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] is an
efficient method to track human actions. The trajectory
based descriptors, including the Histogram of Oriented
Gradient (HOG), Histogram of Optical Flow (HOF) and
Motion Boundary Histogram (MBH), are employed in our
system to depict the motion information of video content.
In addition, our recently proposed TBC [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] descriptor is also
utilized.
        </p>
        <p>
          After the extraction of descriptors, these feature vectors
are normalized with the L1 and signed square root
normalization. To reduce the dimension of descriptors, the
Principal Component Analysis (PCA) is individually applied to
the three descriptors (i.e., HOG, HOF, MBH) as in [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], and
the Logarithm Principal Component Analysis (LogPCA) is
applied to the TBC descriptor.
        </p>
        <p>
          To encode feature vectors, the Fisher Vector model [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] is
utilized. Specifically, GMM is applied to construct a
codebook for each descriptor and we compute one Fisher Vector
over an entire video followed by applying the signed square
root and L2 normalization, which is able to significantly
improve the performance in combination with linear SVM. To
combine the IDT based descriptors, early fusion is performed
to generate the final feature vector by concatenating the
aforementioned four feature vectors into a single one.
2.1.3
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Dense SIFT Feature</title>
        <p>Scene information is an important cue for video content
analysis. The Dense SIFT approach is utilized to depict
scene information of video clips. We densely compute SIFT
descriptors at multiple scales on a dense grid every 30
frames. After the SIFT descriptors are extracted, PCA is
utilized to reduce the dimension of SIFT descriptors, and
GMM is applied to construct a codebook. Unlike IDT based
descriptors, we compute one Fisher Vector over one frame,
and then average each Fisher Vector at the current temporal
scale. In our system, we set the cluster number of GMM to
512 and the temporal scale number to 2.
2.1.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>Hue-Saturation Histogram Feature</title>
        <p>The color of videos can affect the viewer’s psychology, so
the Hue-Saturation Histogram (HSH) is used to describe
the color information of each frame. We quantize the hue
to 30 levels and the saturation to 32 levels, therefore the
dimension of HSH is 960. Similar to the IDT based feature,
PCA is utilized to reduce the dimension of HSH feature. To
encode the color information of a video, GMM is applied
to construct a codebook. We compute one Fisher Vector
over the current temporal scale, and set the cluster number
of GMM and the temporal scale number to 512 and 2,
respectively.
2.1.5</p>
      </sec>
      <sec id="sec-3-5">
        <title>CNN Based Feature</title>
        <p>
          In the MediaEval Violence Detection subtask, we also
train a Convolutional Neural Network (CNN) to extract
appearance features. CNN includes five convolution and
pooling layers to extract appearance features and three layers
of full connections for classification. CNN is well known for
its powerful ability in feature extraction. However, CNN’s
generalization ability will be limited if there is not enough
samples for training. Therefore, the images from the
ImageNet dataset are used to pre-train the CNN. Frames from
the Violence Detection subtask are used to fine-tune the first
five layers and retrain the last three full connection layers.
The architecture of CNN is the same as the CNN M 2048
model [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
2.2
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Classification</title>
      <p>As far as classification is concerned, the linear SVM
is employed in this work. In addition, the
One-AgainstRest approach is used for multi-label classification. In
order to achieve the balance of training samples for each
of the multiple classes, we weight positive and negative
samples in an inverse manner. In our system, the standard
linear LIBSVM is used with the penalty parameter C equal
to 100. To combine different types of features, the late
fusion strategy is utilized to linearly combine classifier scores
computed for each feature.
3.</p>
    </sec>
    <sec id="sec-5">
      <title>RESULTS AND DISCUSSIONS</title>
      <p>We submitted 5 runs with the results given in Table 1.
The Violence Detection subtask of Run 1 used the IDT based
feature, Dense SIFT feature, MFCC feature, HSH feature
and CNN based feature. Run 2 and the Induced Affect
Detection subtask of Run 1 used the IDT based feature,
Dense SIFT feature, MFCC feature and HSH feature. Run 3
used the IDT based feature, Dense SIFT feature and MFCC
feature. The Violence Detection subtask of Run 4 just
used the CNN based feature. The Induced Affect Detection
subtask of Run4 used the IDT based feature and Dense SIFT
feature. The Violence Detection subtask of Run 5 used the
IDT based feature, Dense SIFT feature, HSH feature and
CNN based feature. The Induced Affect Detection subtask
of Run 5 used the IDT based feature, Dense SIFT feature
and HSH feature.</p>
      <p>
        From the comparison, we can see that the motion cue is
important for both subtasks. For the Violence Detection
subtask, the comparison of Run 1 and Run 2 shows that the
CNN based feature can help to improve the performance.
For the Induced Affect Detection subtask, the comparison
of Run 4 and Run 5 shows that the color information has a
significant impact.
55.93
55.93
55.61
53.70
55.32
41.95
41.95
40.81
40.90
40.92
We report average precision for Violence, global accuracy
for Arousal and Valence [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>K.</given-names>
            <surname>Chatfield</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vedaldi</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <article-title>Return of the devil in the details: Delving deep into convolutional nets</article-title>
          .
          <source>In BMVC'14</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D. P. W.</given-names>
            <surname>Ellis</surname>
          </string-name>
          .
          <article-title>PLP and RASTA (and MFCC, and inversion</article-title>
          ) in Matlab. http://www.ee.columbia.edu/ ~dpwe/resources/matlab/rastamat/.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>F.</given-names>
            <surname>Perronnin</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Dance</surname>
          </string-name>
          .
          <article-title>Fisher kernels on visual vocabularies for image categorization</article-title>
          .
          <source>In CVPR'07</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sjo</surname>
          </string-name>
          ¨berg, Y. Baveye,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. L.</given-names>
            <surname>Quang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          , E. Dellandr´ea, M. Schedl,
          <string-name>
            <surname>C.-H. Demarty</surname>
            , and
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>The MediaEval 2015 Affective Impact of Movies Task</article-title>
          .
          <source>In MediaEval 2015 Workshop</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Schmid</surname>
          </string-name>
          .
          <article-title>Action recognition with improved trajectories</article-title>
          .
          <source>In ICCV'13</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yi</surname>
          </string-name>
          , and
          <string-name>
            <surname>J. Wu.</surname>
          </string-name>
          <article-title>Human action recognition with trajectory based covariance descriptor in unconstrained videos</article-title>
          .
          <source>In ACM MM'15</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>