<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Technicolor@MediaEval 2016 Predicting Media Interestingness Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yuesong Shen</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Claire-Hélène Demarty</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ngoc Q. K. Duong</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>École polytechnique</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>France</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Technicolor</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <fpage>20</fpage>
      <lpage>21</lpage>
      <abstract>
        <p>This paper presents the work done at Technicolor regarding the MediaEval 2016 Predicting Media Interestingness Task, which aims at predicting the interestingness of individual images and video segments extracted from Hollywood movies. We participated in both the image and video subtasks.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The MediaEval 2016 Predicting Media Interestingness Task
aims at predicting the level of interestingness of multimedia
content, i.e., frames and/or video excerpts from
Hollywoodtype movies. The task is divided in two subtasks depending
on the type of content, i.e., images or video segments. A
complete description of the task can be found in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>For the image subtask, Technicolor's contribution is
double: a Support Vector Machine (SVM)-based system is
compared with several deep neural network (DNN) structures:
Multi-layer perceptrons (MLP), Residual networks (ResNet),
Highway networks. For the video subtask, we compare
different systems built on DNN including both the existing
Long Short Term Memory (LSTM) with ResNet blocks, and
the proposed architecture named Circular State-Passing
Recurrent Neural Network (CSP-RNN).</p>
      <p>The paper is divided in two main parts corresponding to
the two subtasks. Before this, section 2 gives insight on
the features used. In each subtask's section, we present the
systems built, then give some details on the derived runs.
The results for the two subtasks are discussed in Section 5.</p>
    </sec>
    <sec id="sec-2">
      <title>FEATURES</title>
      <p>
        For both subtasks, input features for the visual modality
are the CNN features extracted from the fc7 layer of the
pre-trained Ca eNet model [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. They were computed for
each image of the dataset, after being resized to 227 for
its smaller dimension and center-cropped so that the aspect
ratio is preserved. The mean image that comes with the
Ca e model was also subtracted for normalization purpose.
The nal feature dimension is 4096 per image, or per video
frame for the video subtask.
      </p>
      <p>For the audio modality, 60 Mel Frequency Cepstral
Coefcients (MFCCs) concatenated with their rst and second
derivatives were extracted from windows of size 80 ms
centered around each frame, resulting in an audio feature vector
of size 180 per video frame. The mean value of the feature
vector over the whole dataset is then subtracted for the
normalization purpose.</p>
    </sec>
    <sec id="sec-3">
      <title>3. IMAGE SUBTASK</title>
      <p>For the image subtask, the philosophy was to experiment
using several DNN structures and to compare with a
SVMbased baseline. For all system types and both subtasks
(image and video), the best parameter con gurations were
chosen, by splitting the development set (either MediaEval data
or external data) into some training (80%) and validation
sets (20%). As the MediaEval development dataset is not
very large, we proceeded to a nal training on the whole
development set, when building the nal models, except for the
CSP-based runs, for which this nal training was omitted.</p>
      <p>SVM-based system We tested SVM with di erent
kernels: linear, RBF and polynomial and with di erent
parameter settings on the development dataset. We observed
that the validation accuracy varies from one run to another
(even for the same parameter setting) as the training
samples change due to the random partition into the training
and validation sets. This may suggest that the dataset is
not large enough. Also, because of the class imbalance, the
validation accuracy used for training makes it di cult to
choose the best SVM con guration during grid search, that
targets the optimization of the o cial MAP metric.</p>
      <p>
        MLP-based system Several variations of network
structures have been tested, with di erent number of layers, layer
sizes, activation functions and topologies among which
simple MLP, residual [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and highway [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] networks. These
structures were rst trained on a balanced dataset of 200,000
images, extracted using the Flickr interestingness API [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
As this API uses some social metadata associated to
content, it may lead to a social de nition of interestingness,
instead of a more content-driven interestingness, which may
bias the system performance on the o cial dataset. The
best performance in terms of accuracy for the Flickr dataset
was obtained by a simple structure: a rst dense layer of
size 1000 and recti ed linear unit (ReLU) activation, with
a dropout of 0.5, followed by a nal softmax layer. This
structure was then re-trained on the MediaEval image
development dataset, but with the addition of some resampling
or upsampling steps of the training data, to cope with the
imbalance of the two classes in the o cial dataset.
During resampling, a training sample is selected randomly from
one class or another depending on a probability xed
beforehand. Upsampling consists of putting multiple occurrences
of each interesting sample into the list of training data,
resulting in potentially interesting samples being used multiple
times during training. In both cases, di erent probabilities
(0:3 to 0:6 for the interesting class) or upsampling
proportions (5 to 13 times more interesting samples) were tested.
      </p>
      <p>Run #1: SVM-based SVM in Python Scikit-learn
package1 with RBF kernel, gamma = 1=n f eatures, c = 0:1 and
default parameter settings elsewhere, is used. Upsampling
strategy to enlarge interesting samples by factor of 11 is
used.</p>
      <p>Run #2: MLP-based A simple structure with 2 layers
of sizes (1000, 2) (cf. section 3) was selected for its
performance on the Flickr dataset. Best performances with this
structure were obtained with a learning rate of 0.1, decay
rate of 0.1, ReLu activation function, Adadelta optimizing
method and a maximum of 10 epochs. Resampling with
probability of 0.6 for the interesting class gives the best MAP
value on the MediaEval development set.
4.</p>
    </sec>
    <sec id="sec-4">
      <title>VIDEO SUBTASK</title>
      <p>Di erent DNN structures capable of handling the
temporal evolution of the data were tested with variation of size
and depth. We also investigated the performances of di
erent modalities separately vs. in a multimodal approach.</p>
      <sec id="sec-4-1">
        <title>Systems based on existing structures</title>
        <p>
          Di erent simple RNN and LSTM [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] structures were tested
by varying their number and size of layers, as they are
wellknown to be able to handle the temporal aspect of the data.
We also experimented the idea of ResNet (recently proposed
for CNN) in our implementation with RNN and LSTM.
Monomodal systems (audio-only, visual-only) were also
compared to multimodal (audio+visual modality) ones. For the
latter, a mid-level multimodal fusion was implemented, i.e.,
each modality was rst processed independently through one
or more layers before merging and processing by some
additional layers, as illustrated in gure 1. The best structures
and set of parameters were chosen while training on the
Flickr part of the Jiang dataset [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Contrary to the
MediaEval dataset, this dataset contains 1200 longer videos, equally
balanced between the two classes. Once the structures and
parameters were chosen, some upsampling/resampling
process was applied while re-training on the MediaEval dataset.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Run #3: LSTM-ResNet-based The best structure</title>
        <p>obtained while validating on the Jiang dataset corresponds
to gure 1, with a multimodal processing part composed of
a residual block built upon 2 LSTM layers of size 436.
After re-training on the MediaEval dataset, upsampling with
a factor of 9 was applied to the input samples.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Systems based on the proposed CSP</title>
        <p>Figure 2 illustrates the philosophy of this new structure,
which can be seen as a generalization of the traditional RNN,
in which at a time instance t, N samples of the input
sequence go through N recurrent nodes arranged in a circular
structure (allowing to take into account both the past and
the future over a temporal window of size N ) to produce N
internal outputs. These outputs are then combined to form a
nal output at t. This architecture targets a better modeling
1http://scikit-learn.org/stable/modules/generated/
sklearn.svm.SVC</p>
      </sec>
      <sec id="sec-4-4">
        <title>Runs - image Run #1 Run #2 MAP</title>
      </sec>
      <sec id="sec-4-5">
        <title>Random baseline</title>
        <p>0.1656</p>
      </sec>
      <sec id="sec-4-6">
        <title>Runs - video Run #3 Run #4 Run #5</title>
      </sec>
      <sec id="sec-4-7">
        <title>Random baseline MAP</title>
        <p>The obtained results are reported in terms of MAP in
Table 1, with some baseline values computed from a random
ranking of the test data. At least on the image subtask our
systems perform signi cantly higher than the baseline. For
the video subtask, MAP values are lower and we may wonder
whether these performances come in part from the di culty
of the task itself or the dataset which contains signi cant
number of very short shots that were certainly di cult to
annotate.</p>
        <p>For the image subtask, we observed that simple SVM
systems perform similarly (development set) and even slightly
better (test set) than more sophisticated DNNs, leading to
the conclusion that the size of the dataset was probably not
large enough for DNN training. This is also supported by
our test on the external Flickr dataset containing 200,000
images for which DNN reached more than 80% accuracy.</p>
        <p>For the video subtask, several conclusions may be drawn.
First, multimodality seems to bring bene t to the task (this
was con rmed by some additional but not submitted runs).
Second, the new CSP structure seems be able to capture the
temporal evolution of the videos better than classic RNN
and more sophisticated LSTM-ResNet structures, and this
independently of the monomodal branches which were the
same in both cases. This very rst results support the need
for further testing of this new structure in the future.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>[1] Flickr interestingness api</article-title>
          . https://www.flickr.com/services/api/flickr. interestingness.getList.html.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>C.-H. Demarty</surname>
            , M. Sjoberg,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Ionescu</surname>
            , T.-T. Do,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>N. Q.</given-names>
          </string-name>
          <string-name>
            <surname>Duong</surname>
            , and
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Lefebvre</surname>
          </string-name>
          .
          <article-title>Mediaeval 2016 predicting media interestingness task</article-title>
          .
          <source>In Proc. of the MediaEval 2016 Workshop</source>
          , Hilversum, Netherlands, Oct.
          <volume>20</volume>
          -
          <fpage>21</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren, and
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <article-title>Deep residual learning for image recognition</article-title>
          .
          <source>In arXiv prepring arXiv:1506.01497</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          .
          <article-title>Long short-term memory</article-title>
          .
          <source>Neural Comput.</source>
          ,
          <volume>9</volume>
          (
          <issue>8</issue>
          ):
          <volume>1735</volume>
          {
          <fpage>1780</fpage>
          ,
          <string-name>
            <surname>Nov</surname>
          </string-name>
          .
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Shelhamer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Donahue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Karayev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Long</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Guadarrama</surname>
          </string-name>
          , and T. Darrell. Ca e:
          <article-title>Convolutional architecture for fast feature embedding</article-title>
          .
          <source>arXiv preprint arXiv:1408.5093</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.-G.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zheng</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Yang</surname>
          </string-name>
          .
          <article-title>Understanding and predicting interestingness of videos</article-title>
          .
          <source>In Proceedings of the Twenty-Seventh AAAI Conference on Arti cial Intelligence</source>
          ,
          <source>AAAI'13</source>
          , pages
          <fpage>1113</fpage>
          {
          <fpage>1119</fpage>
          . AAAI Press,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R. K.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Gre</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          .
          <article-title>Highway networks</article-title>
          .
          <source>CoRR, abs/1505.00387</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>