<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The IITB Predicting Media Interestingness System for MediaEval 2017</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jayneel Parekh</string-name>
          <email>jayneelparekh@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Harshvardhan Tibrewal</string-name>
          <email>hrtibrewal@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sanjeel Parekh</string-name>
          <email>sanjeelparekh@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Indian Institute of Technology</institution>
          ,
          <addr-line>Bombay</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Technicolor</institution>
          ,
          <addr-line>Cesson Sévigné</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <fpage>13</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>This paper describes the system developed by team IITB for MediaEval 2017 Predicting Media Interestingness Task. We propose a new method of training based on pairwise comparisons between frames of a trailer. The algorithm gave very promising results on the development set but did not impress on test set. Our highest achieved MAP@10 on test set is 0.0911 (Image subtask) and 0.0525 (Video subtask), based on a systems submitted last year ([4, 6]).</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The MediaEval 2017 Predicting Media Interestingness Task
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] deals with automatic selection of images and/or video
segments according to their interestingness to a common
viewer. We only use the visual content and no additional
metadata.
      </p>
      <p>
        Previous systems on this task discuss in detail several
relevant inherent problems. Further, they also point towards
the usefulness of CNN features: in particular, they report
features from AlexNet's fc7 layer performing reasonably well
with simple classi ers [
        <xref ref-type="bibr" rid="ref4 ref6">4, 6</xref>
        ]. We believe a key shortcoming
of the previous approaches is that they attempt to tag
images interesting/non-interesting in a global context whereas
the task inherently expects to classify images in a local
context (trailer-wise). Our system tries to take this aspect into
account by training a classi er on pairwise comparisons of
frames from same trailer.
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>SYSTEM DESCRIPTION</title>
    </sec>
    <sec id="sec-3">
      <title>Pre-processing</title>
      <p>Given the training data feature matrix X 2 RN F
consisting of N examples, each described by a F -dimensional
vector, we rst standardize it and apply principal
component analysis (PCA) to reduce its dimensionality. The
transformed feature matrix Z = (zi)i 2 RN M is used to
experiment with various classi ers. Here M depends on the
number of top eigenvalues we wish to consider.</p>
      <p>
        For our system we use AlexNets's fc7 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] features provided
for image subtask and C3D [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] features provided for video
subtask. Each feature vector has a dimension of 4096. After
performing PCA we reduce the dimension to 200. Thus Z
is a RN 200 matrix in our system.
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Training</title>
      <p>
        We adopted the following two methods for training:
1. Feed every frame/video's feature vector to the classi er
where it learns to predict the interestingness label of
the frame as in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
2. For each trailer we consider all possible pairs of its
frames/videos and feed the corresponding concatenated
feature vectors to the classi er. The classi er learns to
predict which one of the two frames/videos is more
interesting.
      </p>
      <p>For the second training method, pairwise comparisons are
made . First, from each trailer, we generate all possible pairs
of frames. This ensures that only frames/videos of the same
trailer are being compared. Considering T trailers having ni
number of frames/videos in them, we get N1 = Pii==1T n2i
pairs. Representation of each pair is done by concatenating
the feature vectors of each frame/video. The feature vector
of each being of size M , after concatenating we get nal
feature vector of size 2M . This procedure yields a feature
matrix Znew 2 RN1 2M . Output labels for an ordered pair
of frames/videos (I1; I2) is assigned as follows:
y =
(1;
0;</p>
      <sec id="sec-4-1">
        <title>I1 is more interesting than I2</title>
      </sec>
      <sec id="sec-4-2">
        <title>I2 is more interesting than I1</title>
        <p>(1)
2.3</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Prediction</title>
      <p>
        For the rst two runs which are based on [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], we have
used di erent classi ers. Support vector machines (SVM)
with rbf kernel (run1) and logistic regression with `1 penalty
(run2). We now describe the prediction algorithm for our
new approach.
      </p>
      <p>Ranking of the frames/videos according to their
interestingness in a particular trailer is determined from the
predicted results of all the pairwise comparisons by generating
penalty scores si for each of them and ordering them from
lowest to highest with lowest corresponding to most
interesting frame/video. The scores are determined using the
following algorithm (referred as P1):
1. Initialize the penalty scores si = 0 for each i
2. Iterate over results of all pairwise comparisons: for
each pair indexed by fk; lg, let r(k; l) denote the
prediction of classi er. The following update is performed:
su = su + j Prfr(k; l) = 1g</p>
      <p>Prfr(k; l) = 0gj
where u denotes the index of less interesting frame/video
predicted, Prf:g denotes the probability and j:j the
absolute value</p>
      <p>This essentially increases the penalty score for the less
interesting according to the con dence the classi er has in its
prediction. The con dence value of the classi er for a given
pair is treated as absolute di erence between Prfr(k; l) = 1g
and Prfr(k; l) = 0g. We also try a variant of the above
algorithm in one of our runs wherein the update equation
is: su = su + 1 (referred as P2)</p>
      <p>Interestingness classi cation: We opt for a simple
method for binary classi cation of each image as interesting
or not: We classify the top 12% ranked images as
interesting. We chose top 12% images as it's slightly higher than
the average number of interesting images, which is about
9%. It's important to note that since we generate ranking
of frames, choosing only top 12% images has no particular
signi cance as the o cial metric remains una ected by it.
3.</p>
    </sec>
    <sec id="sec-6">
      <title>EXPERIMENTAL VALIDATION</title>
      <p>
        The training dataset consisted of 7396 frames extracted
from 78 movie trailers with about 392,000 pairs of frames,
while the test data consisted of 2435 frames extracted from
30 movie trailers. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] gives complete information about the
preparation of the dataset. Scikit-learn [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] was used to
implement and test various con gurations.
3.1
      </p>
    </sec>
    <sec id="sec-7">
      <title>Results and Discussion</title>
      <p>Our results on the development set for various approaches
are given in Table 1. The run submission results are given in
Table 2. The tables gives the mean average precision (MAP)
- the o cial metric MAP@10, of di erent runs corresponding
to the method of training and the classi er used.</p>
      <sec id="sec-7-1">
        <title>Development Set</title>
        <p>
          We experimented with the CNN features provided and used
PCA to bring down the number of dimensions to 200.
Additionally, we used a non pairwise (NP) and a pairwise strategy
(P1, P2) for training and prediction as described in previous
section. These methods were used to train SVM (rbf kernel)
[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], logistic regression with l1-penalty (LR-l1) [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. These
decisions were taken following inferences of previous results [
          <xref ref-type="bibr" rid="ref4 ref6">4,
6</xref>
          ]. We split the development set into the training set (62
videos) and cross validation set (16 videos). We calculated
MAP@10 on the validation set. Accordingly we tested the
model with several parameters and chose the model
parameters giving best MAP@10 results. We found that the
pairwise comparisons strategy was working better compared to
non pairwise strategy. They gave a better MAP@10, which
was aligned with our expectation. Logistic regression was
giving better results as compared to SVM.
        </p>
        <p>Due to large number of pairs involved in training, we could
not experiment with classi ers such as SVM in our presented
approach (P1, P2) because of large training time. We
experimented with the following classi ers. (1) Logistic regression
with l2 penalty, (2) Random Forest, (3) Logistic regression
with l1 penalty. (3) gave slightly better results than the
other two and was the fastest in training, hence we went
with logistic regression-l1 penalty as our classi er.</p>
      </sec>
      <sec id="sec-7-2">
        <title>Test Set</title>
        <p>However the results on the test set, were unexpected.
Logistic regression using pairwise comparisons gave the best
results on the development set for both the tasks. On the
test set it isn't impressive where the best result is for
nonpairwise logistic regression (Image subtask) and non-pairwise
SVM-rbf kernel (Video subtask).</p>
        <p>There could be various possible reasons for the
discrepancy in the results on development and test set. (i) Viewing
the classi er as a neural network, it may require more ne
tuning of the weights of fc7 layer of AlexNet, or a more
complex network instead of a single neuron so that it
generalizes better. (ii) Though improbable, it's possible there are
some discrepancies in the sources of development and test
set which result in poor generalization.
4.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>CONCLUSIONS</title>
      <p>
        In summary, we proposed a new system for interestingness
prediction in images and videos. It essentially di ers in the
method of training based on pairwise comparisons of images.
This helps in capturing interestingness of an image in a
local context. Although our system gave impressive results on
development set, it failed to perform well on the test set.
Some improvements on current system can be improving its
complexity or ne tuning the last layer of AlexNet for better
input representation. The e ciency of the training can also
be improved by selecting pairs more intelligently ([
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]).
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Bradley</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Terry</surname>
          </string-name>
          .
          <article-title>Rank analysis of incomplete block designs: I. the method of paired comparisons</article-title>
          .
          <source>Biometrika</source>
          ,
          <volume>39</volume>
          (
          <issue>3</issue>
          /4):
          <volume>324</volume>
          {
          <fpage>345</fpage>
          ,
          <year>1952</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>C.-H. Demarty</surname>
            , M. Sjoberg,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Ionescu</surname>
            , T.-T. Do,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Gygli</surname>
            , and
            <given-names>N. Q.</given-names>
          </string-name>
          <string-name>
            <surname>Duong. MediaEval 2017 Predicting Media</surname>
          </string-name>
          <article-title>Interestingness Task</article-title>
          .
          <source>In Proc. of the MediaEval 2017 Workshop</source>
          , Dublin, Ireland, Sept.
          <fpage>13</fpage>
          -
          <lpage>15</lpage>
          ,
          <year>2017</year>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.-G.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Rui</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.-F.</given-names>
            <surname>Chang</surname>
          </string-name>
          .
          <article-title>Super fast event recognition in internet videos</article-title>
          .
          <source>IEEE Transactions on Multimedia</source>
          ,
          <volume>17</volume>
          (
          <issue>8</issue>
          ):
          <volume>1174</volume>
          {
          <fpage>1186</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Parekh</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Parekh</surname>
          </string-name>
          .
          <article-title>The MLPBOON Predicting Media Interestingness System for MediaEval 2016</article-title>
          . In MediaEval,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F.</given-names>
            <surname>Pedregosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Varoquaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gramfort</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Michel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Thirion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Grisel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Blondel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Prettenhofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Dubourg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Vanderplas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Passos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cournapeau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Brucher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Perrot</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Duchesnay</surname>
          </string-name>
          .
          <article-title>Scikit-learn: Machine learning in Python</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>12</volume>
          :
          <fpage>2825</fpage>
          {
          <fpage>2830</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.-H.</given-names>
            <surname>Demarty</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N. Q.</given-names>
            <surname>Duong</surname>
          </string-name>
          . Technicolor@
          <article-title>mediaeval 2016 predicting media interestingness task</article-title>
          .
          <source>In MediaEval</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Suykens</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Vandewalle</surname>
          </string-name>
          .
          <article-title>Least squares support vector machine classi ers</article-title>
          .
          <source>Neural processing letters</source>
          ,
          <volume>9</volume>
          (
          <issue>3</issue>
          ):
          <volume>293</volume>
          {
          <fpage>300</fpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bourdev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fergus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Torresani</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Paluri</surname>
          </string-name>
          .
          <article-title>Learning spatiotemporal features with 3d convolutional networks</article-title>
          .
          <source>In Proceedings of the IEEE international conference on computer vision</source>
          , pages
          <volume>4489</volume>
          {
          <fpage>4497</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H.-F.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.-L.</given-names>
            <surname>Huang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.-J.</given-names>
            <surname>Lin</surname>
          </string-name>
          .
          <article-title>Dual coordinate descent methods for logistic regression and maximum entropy models</article-title>
          .
          <source>Machine Learning</source>
          ,
          <volume>85</volume>
          (
          <issue>1-2</issue>
          ):
          <volume>41</volume>
          {
          <fpage>75</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>