<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>RUC at MediaEval 2016: Predicting Media Interestingness Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shizhe Chen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yujie Dian</string-name>
          <email>dianyujie-blair@ruc.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Qin Jin</string-name>
          <email>qjin@ruc.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Information, Renmin University of China</institution>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Visual Features CNN Features Handcrafted Features</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <fpage>20</fpage>
      <lpage>21</lpage>
      <abstract>
        <p>Measuring media interestingness has a wide range of applications such as video recommendation. This paper presents our approach in the MediaEval 2016 Predicting Media Interestingness Task. There are two subtasks: image interestingness prediction and video interestingness prediction. For both subtasks, we utilize hand-crafted features and CNN features as our visual features. For the video subtask, we also extract acoustic features including MFCC Fisher Vector and statistical acoustic features. We train SVM and Random Forest as classi ers and early fusion is applied to combine di erent features. Experimental results show that combining semantic-level and low-level visual features are bene cial for image interestingness prediction. When predicting video interestingness, the audio modality has superior performance and the early fusion of visual and audio modalities can further boost the performance.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>SYSTEM DESCRIPTION</title>
      <p>
        An overview of our framework in the MediaEval 2016
Predicting Media Interestingness Task [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is shown in Figure 1.
For image interestingness prediction, we use hand-crafted
visual features and CNN features. For the video subtask,
we utilize both visual and audio cues in the video to
predict the interestingness. Early fusion is applied to combine
di erent features. In the following subsections, we describe
the feature representation and prediction model in details.
      </p>
    </sec>
    <sec id="sec-2">
      <title>Feature Extraction</title>
      <sec id="sec-2-1">
        <title>Visual Features</title>
        <p>
          DCNN is the state-of-the-art model in many visual tasks
such as object detection, scene recognition etc. In this task,
we extract activations from the penultimate and the last
softmax layers from the AlexNet and Inception-v3 [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]
pretrained on ImageNet as our image-level CNN features,
namely alex_fc7, alex_prob, inc_fc, inc_prob respectively. The
features extracted from the last layers are the probability
distribution on 1000 di erent objects, which describe the
semantic level of concepts people might show interest in.
The penultimate layer features are the abstraction of the
image content and have shown great generalization ability
in di erent tasks. We also use hand-crafted visual features
including Color Histogram, GIST, LBP, HOG, Dense SIFT
        </p>
        <p>Inception
Inception
prob</p>
        <p>ColorHistogram</p>
        <p>GIST</p>
        <p>LBP
DenseSIFT</p>
        <p>HOG
Early Fusion</p>
        <p>Classification
SVM</p>
        <p>Random Forest
Interestingness Probability</p>
        <p>Audio Features</p>
        <p>
          Acoustic
Statistics
provided in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] to cover di erent aspects of the images. For
the video subtask, mean pooling is applied over all the image
features of the video clip to generate video-level features.
1.1.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Acoustic Feature</title>
        <p>
          Statistical Acoustic Features: Statistical acoustic
features are proved to be e ective in speech emotion
recognition. We use the open-source toolkit OpenSMILE [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] to
extract the statistical acoustic features, which use the con
guration in INTERSPEECH 2009 [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] Paralinguistic challenge.
Low-level acoustic features such as energy, pitch, jitter and
shimmer are rst extracted over a short-time window. And
then statistical functions like mean, max are applied over the
set of low-level features to generate sentence-level features.
        </p>
        <p>
          MFCC based Features: The Mel-Frequency Cepstral
Coe cients (MFCCs) [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] are the most widely used low-level
features which have been successfully applied in many speech
tasks. Therefore, we use MFCCs as our frame-level feature
with window of 25ms and shift of 10ms. The Fisher Vector
Encoding (FV) [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] is applied to transform the variant length
of MFCCs to the sentence-level features. We train a
Gaussian Mixture Models (GMMs) with 8 mixtures as our audio
word dictionary. Then we compute the gradient of the log
likelihood with respect to the parameters of the GMMs for
each audio to maximize the probability that the model can
t the data. L2-norm is applied for the mfccFV features.
1.2
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Classification Model</title>
      <p>For both the image and video systems, we train binary
SVM and Random Forest as our interestingness classi
cation models. Hyper parameters of the models are selected
according to the mean average precision (MAP) on our
local validation set using grid search. For SVM, RBF kernel
is applied and the cost is searched from 2 2 to 210. And for
Random Forest, the number of trees is set to be 100 and the
depth of the tree is searched from 2 to 16.</p>
    </sec>
    <sec id="sec-4">
      <title>EXPERIMENTS</title>
    </sec>
    <sec id="sec-5">
      <title>Experimental Setting</title>
      <p>There are 5054 images or videos in total for development
in each subtask. We use video with id from 0 to 40 (4014
samples) as the local training set, 41 to 45 (468 samples) as
local validation set and the remained videos (572 samples)
as the local testing set. We use the whole development set
to train the nal submitted systems.
2.2</p>
    </sec>
    <sec id="sec-6">
      <title>Experimental Results</title>
      <p>Figure 2 shows the best MAP performance of SVM and
Random Forest classi ers for each kind of features in the
image subtask. The penultimate CNN features inc_fc and
alex_fc7 achieve the top performance among all the visual
features. However, the probability features extracted from
CNN do not perform well alone.</p>
      <p>We then use early fusion to concatenate di erent visual
features. Figure 3 shows some of the fusion results. We
can see that combining the alex_prob with other visual
appearance features can signi cantly improve the classi cation
performance, which shows that the semantic-level features
and low-level appearance features are complementary.
However, concatenating alex_fc7 with hand-crafted features do
not bring any improvement.</p>
      <p>For video interestingness prediction, Figure 4 presents the
performance of each single feature. The audio modality
outperforms the visual modality and mfccFV achieves the best
performance. Fusing acoustic features with the best
visual feature GIST are bene cial, for example, AcouStats-GIST
achieves MAP of 20.80%, which is 19% relative gain
compared with the MAP of single feature GIST.</p>
      <p>The total ve runs we submitted are listed in Table 1.
3.</p>
    </sec>
    <sec id="sec-7">
      <title>CONCLUSIONS</title>
      <p>Our results show that image interestingness prediction can
bene t from combining semantic-level objects probabilities
distribution features and low-level visual appearance
features. For predicting video interestingness, audio modality
shows superior performance than visual modality and the
early fusion of two modalities can further boost the
performance. In the future work, we will explore ranking models
for the interestingness prediction task and extract more
discriminative features such as video motion features.
4.</p>
    </sec>
    <sec id="sec-8">
      <title>ACKNOWLEDGMENTS</title>
      <p>This research was supported by the Research Funds of
Renmin University of China (No. 14XNLQ01) and the
Beijing Natural Science Foundation (No. 4142029).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>C.-H. Demarty</surname>
            , M. Sjoberg,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Ionescu</surname>
            , T.-T. Do,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>N. Q. K.</given-names>
          </string-name>
          <string-name>
            <surname>Duong</surname>
            , and
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Lefebvre</surname>
          </string-name>
          .
          <article-title>Mediaeval 2016 predicting media interestingness task</article-title>
          .
          <source>In Proc. of the MediaEval 2016 Workshop</source>
          , Hilversum, Netherlands, Oct.
          <volume>20</volume>
          -
          <fpage>21</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Christian</given-names>
            <surname>Szegedy</surname>
          </string-name>
          , Vincent Vanhoucke, Sergey Io e, Jonathon Shlens, and
          <string-name>
            <given-names>Zbigniew</given-names>
            <surname>Wojna</surname>
          </string-name>
          .
          <article-title>Rethinking the inception architecture for computer vision</article-title>
          .
          <source>arXiv preprint arXiv:1512.00567</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Yu-Gang</surname>
            <given-names>Jiang</given-names>
          </string-name>
          , Qi Dai, Tao Mei, Yong Rui, and
          <string-name>
            <surname>Shih-Fu Chang</surname>
          </string-name>
          .
          <article-title>Super fast event recognition in internet videos</article-title>
          .
          <source>IEEE Transactions on Multimedia</source>
          ,
          <volume>17</volume>
          (
          <issue>8</issue>
          ):
          <volume>1174</volume>
          {
          <fpage>1186</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Florian</given-names>
            <surname>Eyben</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Martin</given-names>
            <surname>Llmer</surname>
          </string-name>
          , and Bjorn Schuller.
          <article-title>Opensmile: the munich versatile and fast open-source audio feature extractor</article-title>
          .
          <source>In ACM International Conference on Multimedia, Mm</source>
          , pages
          <volume>1459</volume>
          {
          <fpage>1462</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Bjo</surname>
            <given-names>rn W.</given-names>
          </string-name>
          <string-name>
            <surname>Schuller</surname>
            , Stefan Steidl, and
            <given-names>Anton</given-names>
          </string-name>
          <string-name>
            <surname>Batliner</surname>
          </string-name>
          .
          <article-title>The INTERSPEECH 2009 emotion challenge</article-title>
          .
          <source>In INTERSPEECH</source>
          <year>2009</year>
          ,
          <article-title>10th Annual Conference of the International Speech Communication Association</article-title>
          , Brighton,
          <source>United Kingdom, September</source>
          <volume>6</volume>
          -
          <issue>10</issue>
          ,
          <year>2009</year>
          , pages
          <fpage>312</fpage>
          {
          <fpage>315</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Steven</surname>
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Davis</surname>
          </string-name>
          .
          <article-title>Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences</article-title>
          .
          <source>Readings in Speech Recognition</source>
          ,
          <volume>28</volume>
          (
          <issue>4</issue>
          ):
          <volume>65</volume>
          {
          <fpage>74</fpage>
          ,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Jorge</given-names>
            <surname>Sanchez</surname>
          </string-name>
          , Florent Perronnin, Thomas Mensink, and
          <string-name>
            <given-names>Jakob J.</given-names>
            <surname>Verbeek</surname>
          </string-name>
          .
          <article-title>Image classi cation with the sher vector: Theory and practice</article-title>
          .
          <source>International Journal of Computer Vision</source>
          ,
          <volume>105</volume>
          (
          <issue>3</issue>
          ):
          <volume>222</volume>
          {
          <fpage>245</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>