<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>DUT-MMSR at MediaEval 2017: Predicting Media Interestingness Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Reza Aditya Permadi</string-name>
          <email>r.a.permadi@student.tudelft.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Septian Gilang Permana Putra</string-name>
          <email>septiangilangpermanaputra@student.tudelft.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Helmiriawan</string-name>
          <email>helmiriawan@student.tudelft.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cynthia C. S. Liem</string-name>
          <email>c.c.s.liem@tudelft.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Delft University of Technology</institution>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <fpage>13</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>This paper describes our approach for the submission to the Mediaeval 2017 Predicting Media Interestingness Task, which was particularly developed for the Image subtask. An approach using a late fusion strategy is employed, combining classifiers from diferent features by stacking them using logistic regression (LR). As the task ground truth was based on pairwise evaluation of shots or keyframe images within the same movie, next to using precomputed features as-is, we also include a more contextual feature, considering averaged feature values over each movie. Furthermore, we also consider evaluation outcomes for the heuristic algorithm that yielded the highest MAP score on the 2016 Image subtask. Considering results obtained for the development and test sets, our late fusion method shows consistent performance on the Image subtask, but not on the Video subtask. Furthermore, clear diferences can be observed between MAP@10 and MAP scores.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        The main challenge of the Media Interestingness task is to rank sets
of images and video shots from a movie, based on their
interestingness level. The evaluation metric of interest for this task is the Mean
Average Precision considering the first 10 documents (MAP@10).
A complete overview of the task, along with the description of the
dataset, is given in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Due to the similarity of this year’s task to the 2016 Predicting
Media Interestingness task, we considered the strategies used in
submissions to last year’s task to inform the strategy of our
submission to this year’s task. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] both use an early fusion strategy
by combining features that perform relatively well individually. A
late fusion strategy with average weighting is used in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
combining classifiers from diferent modalities. A Support Vector Machine
(SVM) is used as the final combining classifier. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] finds that logistic
regression gives good results, using CNN features which have been
transformed by PCA.
      </p>
      <p>
        [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] proposed a heuristic approach, based on observing clear
presence of people in images. This approach performed surprisingly
well, even yielding the highest MAP score in the 2016 Image subtask.
While we will mainly focus on a fusion-based approach this year,
we will also include results of the best-performing configuration
from [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] in unaltered form, so a reference point to state-of-the-art
of last year is retained.
      </p>
    </sec>
    <sec id="sec-2">
      <title>APPROACH</title>
      <p>
        We would like to devise a strategy which is computationally eficient
and yet gives a good interpretability of the results. Our approach
therefore consists of a fairly straightforward machine learning
pipeline, evaluating performance of individual features first, and
subsequently applying classifier stacking to find the best
combinations of the best-performing classifiers on the best-performing
features. For our implementation, we use sklearn [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. As base
classiifer, we use logistic regression and aim to find optimal parameters.
Further details of our approach are described below.
2.1
      </p>
    </sec>
    <sec id="sec-3">
      <title>Features of interest</title>
      <p>
        For the Image subtask, we initially consider all the pre-computed
visual features from [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]: Color Histogram (HSV), LBP, HOG, GIST,
denseSIFT, and the Alexnet-based features (fully connected (fc7)
layer and probability output). In all cases, we consider pre-computed
features and their dimensionalities as-is.
      </p>
      <p>Considering the way in which ground truth for this task was
established, human raters were asked to perform pairwise
annotations on shots or keyframe images from the same movie. Hence,
it likely that overall properties of the movie may have afected
the ratings: for example, if a movie consistently is shot in a dark
background, a dark shot may not stand out as much as in another
movie. In other words, we assume that the same feature vector may
be associated to diferent interestingness levels, depending on the
context of the movie it occurs in. Therefore, apart from the
precomputed features by the organizers, we also consider a contextual
feature, based on the average image feature values per movie.</p>
      <p>Let Xi be an m × n feature matrix for a movie, where m is the
number of images ofered for the movie, and n the length of the
feature vector describing each image. For our contextual feature,
we then take the average value of Xi across its columns, yielding
a new vector µ i of size 1 × n. In our subsequent discussion, we
will denote the contextual feature belonging to a feature type F by
‘meanF ’ (e.g. HSV → meanHSV). This feature is then concatenated
to the original feature vector.</p>
      <p>
        For the Video subtask, in comparison to the Image subtask, we
now also have information from the audio modality in the form
of Mel-Frequency Cepstral Coeficients (MFCC). We further use
the pre-computed fully-connected layer (fc6) of a C3D deep neural
network [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. As pre-computed features are given at the keyframe
resolution, we simply average over all the keyframes to obtain the
values for a particular feature representation. Again, we consider
pre-computed features and their dimensionalities as-is.
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Individual feature evaluation</title>
      <p>For each feature type, we would like to individually find the
bestperforming classifier that will optimize the MAP@10 value. Before
feeding the feature vector into the classifier, the values are scaled to
have zero mean and unit variance, considering the overall statistics
R.A. Permadi, S.G.P. Putra, Helmiriawan, C.C.S Liem
of the training set. For logistic regression, the optimal penalty
parameter C is searched on a logaritmic scale from 10−9 until 100.</p>
      <p>To evaluate our model, 5-fold cross validation is used. Each fold
is created based on the number of movies in the dataset, rather
than on the number of individual instances of images or videos
within a movie. This way, we make sure that training or prediction
always considers all instances ofered for a particular movie. For
each evaluation, the cross validation procedure is run 10 times to
avoid dependence on specific fold compositions, and the average
MAP@10 value is considered.
2.3</p>
    </sec>
    <sec id="sec-5">
      <title>Classifier stacking</title>
      <p>After identifying the best classifier configuration per feature type,
we stack the output of those classifiers and try diferent
combinations of them, which are then trained again with several classifiers
(logistic regression, SVM, AdaBoost, Linear Discriminant
Analysis, and Random Forest). Finding that logistic regression and SVM
perform quite well, we apply a more intensive grid search on these
classifier types to optimize parameters.
3</p>
    </sec>
    <sec id="sec-6">
      <title>RESULTS AND CONCLUSIONS</title>
      <p>Table 2 illustrates the top performance on the development set for
individual classifiers, which will then also be considered in the test
set. In several cases, addition of our contextual averaged feature
yields slight improvements over the individual feature alone. This
improvement is the biggest for LBP where there is an increase of
0.68 compared to the original features.</p>
      <p>The final evaluation results for 5 diferent runs per subtask are
shown in Table 1. As before, late fusion makes use of the output
probability of the best classifiers trained on each feature (as in Table
2), rather than the feature values themselves. Our best-performing
result on the Image development set is a MAP@10 value of 0.139
(Logistic Regression, C = 100). This is an improvement over the
performance of the best individual classifier in Table 2. On the test
set, our best result on the Image subtask is a MAP@10 value of
0.1385 for the same classifier configuration.</p>
      <p>For the Video subtask, evaluation results on the test set show
considerable diferences in comparison to the development set.
While somewhat surprising, we did notice considerable variation
in results during cross-validation, and our reported development
set results are an average of several cross-validation runs. As one
particularly bad cross-validation result, using late fusion of GIST
and c3d features with logistic regression (C = 10), with videos 0, 3,
6, 9, 12, 22, 23, 27, 28, 30, 46, 50, 60, 69, 71 as the evaluation fold, we
only obtained a MAP@10 value of 0.0496.</p>
      <p>
        Considering the results for the best-performing configuration
(histface) of the heuristic approach from [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], we notice clear
diferences between MAP@10 and MAP as a metric. Generally spoken,
the heuristic approach is especially outperformed on MAP@10,
implying that clear presence of people is not the only criterion for
the top-ranked items. Comparing results for this approach to our
proposed late fusion approach, the late fusion approach consistently
outperforms a heuristic approach on the Image subtask, but in the
Video subtask, the heuristic approach still has reasonable scores,
and outperforms the late fusion approach on the test set.
      </p>
      <p>
        In conclusion, we employed the ofered pre-computed features
and included a contextual averaged feature, and then proposed a
late fusion strategy based on the best-performing classifier settings
for the best-performing features. Using fusion shows improvements
over results obtained on individual features. In future work,
alternative late fusion strategies as explained in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] may be investigated.
      </p>
      <p>For the Image subtask, we notice consistent results between the
development and test set. However, on the Video subtask, we notice
inconsistent results on the development set in comparison to the
test set. Predicting interestingness in video likely needs a more
elaborated approach that we have not yet covered thoroughly in
our method. It also might be the case that the feature distribution
of the test set turned out diferent from that of the training set, or
that generally, the distribution of features across a video should
be taken into account in more sophisticated ways, for example by
taking into account temporal development aspects.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Pradeep</surname>
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Atrey</surname>
            ,
            <given-names>M. Anwar</given-names>
          </string-name>
          <string-name>
            <surname>Hossain</surname>
          </string-name>
          , Abdulmotaleb El Saddik, and
          <string-name>
            <surname>Mohan</surname>
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Kankanhalli</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Multimodal fusion for multimedia analysis: a survey</article-title>
          .
          <source>Multimedia Systems</source>
          <volume>16</volume>
          ,
          <issue>6</issue>
          (
          <issue>01</issue>
          <year>Nov 2010</year>
          ),
          <fpage>345</fpage>
          -
          <lpage>379</lpage>
          . https://doi.org/10.1007/s00530-010-0182-0
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Shizhe</given-names>
            <surname>Chen</surname>
          </string-name>
          , Yujie Dian, and
          <string-name>
            <given-names>Qin</given-names>
            <surname>Jin</surname>
          </string-name>
          .
          <year>2016</year>
          . RUC at MediaEval 2016:
          <article-title>Predicting Media Interestingness Task</article-title>
          .
          <source>In MediaEval 2016 Working Notes Proceedings.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Mihai</given-names>
            <surname>Gabriel</surname>
          </string-name>
          <string-name>
            <surname>Constantin</surname>
          </string-name>
          , Bogdan Boteanu, and
          <string-name>
            <given-names>Bogdan</given-names>
            <surname>Ionescu</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>LAPI at MediaEval 2016 Predicting Media Interestingness Task</article-title>
          .
          <source>In MediaEval 2016 Working Notes Proceedings.</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Claire-Hélène</surname>
            <given-names>Demarty</given-names>
          </string-name>
          , Mats Sjöberg, Bogdan Ionescu,
          <string-name>
            <surname>Thanh-Toan</surname>
            <given-names>Do</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Gygli</surname>
          </string-name>
          , and Ngoc QK Duong.
          <article-title>Mediaeval 2017 Predicting Media Interestingness Task</article-title>
          .
          <source>In MediaEval 2017 Working Notes Proceedings.</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Yu-Gang</surname>
            <given-names>Jiang</given-names>
          </string-name>
          , Qi Dai, Tao Mei, Yong Rui, and
          <string-name>
            <surname>Shih-Fu Chang</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Super Fast Event Recognition in Internet Videos</article-title>
          .
          <source>IEEE Transactions on Multimedia 17, 8 (Aug</source>
          <year>2015</year>
          ),
          <fpage>1174</fpage>
          -
          <lpage>1186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Vu</given-names>
            <surname>Lam</surname>
          </string-name>
          , Tien Do, Sang Phan,
          <string-name>
            <surname>Duy-Dinh Le</surname>
          </string-name>
          , and Duc Anh Duong.
          <year>2016</year>
          .
          <article-title>NII-UIT at MediaEval 2016 Predicting Media Interestingness Task.</article-title>
          .
          <source>In MediaEval 2016 Working Notes Proceedings.</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Cynthia</surname>
            <given-names>C. S.</given-names>
          </string-name>
          <string-name>
            <surname>Liem</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>TUD-MMC at MediaEval 2016: Predicting Media Interestingness Task.</article-title>
          .
          <source>In MediaEval 2016 Working Notes Proceedings.</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Jayneel</given-names>
            <surname>Parekh</surname>
          </string-name>
          and
          <string-name>
            <given-names>Sanjeel</given-names>
            <surname>Parekh</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>The MLPBOON Predicting Media Interestingness System for MediaEval 2016.</article-title>
          . In MediaEval 2016 Working Notes Proceedings.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Fabian</given-names>
            <surname>Pedregosa</surname>
          </string-name>
          , Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Prettenhofer</surname>
          </string-name>
          , Ron Weiss, Vincent Dubourg, and others.
          <source>2011</source>
          .
          <article-title>Scikit-learn: Machine learning in Python</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>12</volume>
          ,
          <string-name>
            <surname>Oct</surname>
          </string-name>
          (
          <year>2011</year>
          ),
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Du</surname>
            <given-names>Tran</given-names>
          </string-name>
          , Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and
          <string-name>
            <given-names>Manohar</given-names>
            <surname>Paluri</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Learning spatiotemporal features with 3d convolutional networks</article-title>
          .
          <source>In Proceedings of the IEEE international conference on computer vision</source>
          . 4489-
          <fpage>4497</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>