<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multimodality and Deep Learning when predicting Media Interestingness</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Eloïse Berson</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Claire-Hélène Demarty</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ngoc Q. K. Duong Technicolor</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>France eloise.berson@gmail.com</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>claire-helene.demarty</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>quang-khanh-ngoc.duong}@technicolor.com</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <fpage>13</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>This paper summarizes the computational models that Technicolor proposes to predict interestingness of images and videos within the MediaEval 2017 Predicting Media Interestingness Task. Our systems are based on deep learning architectures and exploit the use of both semantic and multimodal features. Based on the obtained results, we discuss our findings and obtain some scientific perspectives for the task.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Understanding interestingness of media content such as images
and videos, has gained a significant attention from the research
community recently as it ofers numerous practical applications in
e.g., content selection or recommendation [
        <xref ref-type="bibr" rid="ref1 ref2 ref5">1, 2, 5</xref>
        ]. Following the
success of the 2016 edition [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], the MediaEval 2017 Predicting Media
Interestingness Task extends the benchmark to larger datasets, also
annotated with a greater human annotation efort. A complete
description of the task can be found in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        For both subtasks, Technicolor’s motivation was to build
incrementally from last year’s systems [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], i.e., re-use similar features
and DNN architectures, while adding some contextual information
to the content. To this end, two new features were added, so as to
capture additional semantic information related to the content,
following a similar idea as in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. These new features (section 2), were
expected to bring contextual information related to the content. In
a second step (section 3), and for the video subtask only, several
embeddings of this semantic information at diferent network
levels were experimented. The aim was to investigate how this was
influencing the temporal modeling of this new information.
      </p>
    </sec>
    <sec id="sec-2">
      <title>MULTIMODALITY AND CONTEXTUAL</title>
    </sec>
    <sec id="sec-3">
      <title>FEATURES</title>
      <p>
        As in 2016, CNN coming from the fc7 layer of the pre-trained
caffeNet model (image modality, both subtasks) and MFCCs
concatenated with their first and second derivatives (audio modality, video
subtask) were extracted following the protocol described in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
Dimensions for these features were 4096 and 180, respectively.
      </p>
      <p>
        To capture some additional semantic information, Image-Captioning
Based (ICB) features [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] were computed for each image or frame,
depending on the subtask. These features correspond to the projection
of an image in a visual-semantic embedding space [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], obtained
from a jointly-trained model for images and captions dedicated
to automatic captioning. In this embedded space, where
semantic distances between projected image and captioning features are
minimized, the resulting representation features are more likely to
contain semantic information than the CNN features alone.
Dimension of the ICB feature is 1024.
      </p>
      <p>
        To go further in this vein of adding semantic and contextual
information, textual metadata was directly extracted from ImDB 1,
exploiting the fact that the MediaEval 2017 dataset was built from
Hollywood-like movie extracts. Except for 3 movies (for 2 of them,
a short summary was built from descriptions found on the internet;
for the last one, description was left empty), ImDB information was
available: each movie description and/or storyline was proposed
at the input of the RAKE algorithm [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], for keyword extraction.
Thus, several keywords were extracted per movie, from which we
derived a textual feature of dimension 300, classically using the
Word2Vec [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] representations (pretrained on GoogleNews dataset)
of this batch of keywords and averaging them.
3
      </p>
    </sec>
    <sec id="sec-4">
      <title>DNN ARCHITECTURES</title>
      <p>
        Global workflows for all submitted runs and for both subtasks are
shown in Figure 1. As stated in the introduction, most components
used to build the systems’ architectures for both subtasks were the
same as in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Thus, to cope with the unbalance of the dataset,
some resampling of the data was applied during training. Several
parameter configurations were investigated by splitting the dataset
in 80% for training and 20% for validation. A final retraining of the
best model was then applied on the complete development set.
      </p>
      <p>For the image subtask, diferent concatenations of the features
were investigated to understand the contribution of each modality
and to conclude on the input of contextual information to the
task. Thus each submitted run difers from the others by the input
features, and the adaptation of the layer sizes, while the DNN
architecture remains the same: a single MLP layer, with rectified
linear unit (ReLU) activation and a dropout of 0.5. All submitted
runs are summarized in Figure 1a, with diferent colors depending
on the feature concatenation; Run#1, corresponding to 2016 best
system, will serve as a baseline.</p>
      <p>
        For the video subtask, three levels of embedding for the W2V
features were investigated (see Figure 1b), except for Run#1 which
re-uses one of last year’s systems (Run#3 in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], see Figure 1 for the
used layers, each of them with ReLU activation function, followed
by a dropout equal to 0.5). Run#1’s architecture is kept for the other
runs, with some adaptation of the multimodal block depending on
the input feature sizes (one or two LSTM layers, with a residual
block). In Run#1, our baseline, audio and video modalities only are
used. For the image channel, a first modal-specific learning step was
implemented with a MLP layer followed by a LSTM layer. For the
audio, a single LSTM layer is used. After merging, both channels
serve as input to two LSTM layers, with a residual part (ResNet [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]).
1see http://www.imdb.com
      </p>
      <p>In Run#2, W2V features were simply merged to the result of the
temporal modeling for the other modalities, whereas for Run#3
and Run#4, they were duplicated for each frame and merged into
the workflow either in parallel to the audio and video channels
(Run#4) or after a first merging of these two modalities (Run#3)
(See Figure 1). For each run, some potential processing steps were
added to realize the merge with the other modalities thanks to
either additional LSTM-ResNet layers when temporal modeling was
possible (Runs#3 and 4), or simple MLP layers otherwise (Run#2).
These steps were followed by a simple concatenation of the features
from the diferent modalities. Run#5 is similar to Run#4 except that
the Time Domain Average and Softmax steps were swapped. The
motivation for this last run was to test whether the location of the
decision step (softmax) had an influence on the performance.
4</p>
    </sec>
    <sec id="sec-5">
      <title>RESULTS AND DISCUSSION</title>
      <p>Results are summarized in Table 1. First runs for both subtasks
show slightly improved MAP values compared to last year’s results.
As the systems remain the same for these runs over the two years,
it tends to show that the dataset size increase and/or the refinement
of the annotations had an efect on the modeling performance.
Unexpectedly, the MAP@10 values are very low (lower than during
the validation process, when MAP@10 values were of the same
range or slightly lower than MAP for both tasks). Another
unexpected result is that contextual features, either ICB or W2V features,
did not bring any improvement to the image subtask, although we
had an opposite conclusion during validation on the development
set with MAP values of resp. 0.36 and 0.38 for those features (for
comparison, we obtained 0.31 with CNN features). This suggests
that using more features might have led to over-fitting, probably
because of the small size of the dataset during training. This
overiftting might have also been reinforced as, because of a lack of
computation resources, cross-validation was done with one fold
only. In the future, some cross-validation process with more folds
might lead to a better system. However, once the test set is released,
further analysis of the diferences between the development and
test sets should be done to better understand this observation.</p>
      <p>For the video subtask, as expected, W2V features slightly
improved MAP and MAP@10 when considered as a frame-based
feature. Although they are simply repeated for each frame, i.e., each
frame of a given video shares the same textual feature, the
concatenation of this new information did bring some useful information
for the video subtask. This diference between the two subtasks
reinforces the diference between image and video interestingnesses
which was already stated last year. Run#5 of the video subtask
suggests that keeping the classification for the final step of the
system is maximizing the performance, which is understandable
as it allows to keep continuous values as much as possible before
switching to a binary classification. It also corresponds better to the
annotation protocol where the annotation is done for each video
segment as a whole; thus the softmax prediction should also be
done for the whole segment and not for every single frame.</p>
      <p>As a conclusion, a lot of our findings on the evaluation step
were diferent from those of the test set. We definitely need to
understand what difers from these two sets that is responsible for
the diferences in performance. E.g., some new, significantly longer
and thus more meaningful segments (243 out of 2435) were added
to the test set only, representing a duration of 46min over a total
duration of 87min, i.e., more than half of the test set.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Sharon</given-names>
            <surname>Lynn</surname>
          </string-name>
          <string-name>
            <surname>Chu</surname>
          </string-name>
          ,
          <article-title>Elena A Fedorovskaya, Francis KH Quek, and</article-title>
          <string-name>
            <given-names>Jefrey</given-names>
            <surname>Snyder</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>The efect of familiarity on perceived interestingness of images.</article-title>
          .
          <source>In Human Vision</source>
          and Electronic Imaging.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Claire-Helène</surname>
            <given-names>Demarty</given-names>
          </string-name>
          , Mats Sjöberg, Gabriel Constantin,
          <string-name>
            <surname>Ngoc Q. K. Duong</surname>
          </string-name>
          , Bogdan Ionescu,
          <string-name>
            <surname>Thanh-Toan Do</surname>
            , and
            <given-names>Hanli</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Predicting Interestingness of Visual Content</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Claire-Helène</surname>
            <given-names>Demarty</given-names>
          </string-name>
          , Mats Sjöberg, Bogdan Ionescu,
          <string-name>
            <surname>Thanh-Toan</surname>
            <given-names>Do</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Gygli</surname>
          </string-name>
          , and
          <string-name>
            <surname>Ngoc</surname>
            <given-names>Q. K.</given-names>
          </string-name>
          <string-name>
            <surname>Duong</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>MediaEval 2017 Predicting Media Interestingness Task</article-title>
          .
          <source>MediaEval 2017</source>
          Workshop (
          <year>September 2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Claire-Helène</surname>
            <given-names>Demarty</given-names>
          </string-name>
          , Mats Sjöberg, Bogdan Ionescu,
          <string-name>
            <surname>Thanh-Toan</surname>
            <given-names>Do</given-names>
          </string-name>
          , Hanli Wang,
          <string-name>
            <surname>Ngoc Q. K. Duong</surname>
            , and
            <given-names>Frederic</given-names>
          </string-name>
          <string-name>
            <surname>Lefebvre</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>MediaEval 2016 Predicting Media Interestingness Task</article-title>
          .
          <source>MediaEval 2016 Workshop (October</source>
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Helmut</given-names>
            <surname>Grabner</surname>
          </string-name>
          , Fabian Nater, Michel Druey, and Luc Van Gool.
          <year>2013</year>
          .
          <article-title>Visual Interestingness in Image Sequences</article-title>
          .
          <source>In Proceedings of the 21st ACM International Conference on Multimedia (MM '13)</source>
          . ACM, New York, NY, USA,
          <fpage>1017</fpage>
          -
          <lpage>1026</lpage>
          . https://doi.org/10.1145/2502081.2502109
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Kaiming</given-names>
            <surname>He</surname>
          </string-name>
          , Xiangyu Zhang, Shaoqing Ren, and
          <string-name>
            <given-names>Jian</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Deep Residual Learning for Image Recognition</article-title>
          .
          <source>In arXiv prepring arXiv:1506</source>
          .
          <fpage>01497</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Ryan</given-names>
            <surname>Kiros</surname>
          </string-name>
          , Ruslan Salakhutdinov, and Richard S Zemel.
          <year>2014</year>
          .
          <article-title>Unifying visual-semantic embeddings with multimodal neural language models</article-title>
          .
          <source>arXiv preprint arXiv:1411.2539</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Brian</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Plummer</surname>
            , Matthew Brown, and
            <given-names>Svetlana</given-names>
          </string-name>
          <string-name>
            <surname>Lazebnik</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Enhancing Video Summarization via Vision-Language Embedding</article-title>
          . In
          <source>Computer Vision and Pattern Recognition</source>
          ,
          <year>2017</year>
          .
          <article-title>CVPR 2017</article-title>
          .
          <article-title>Proceedings of the International Conference on</article-title>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Xin</given-names>
            <surname>Rong</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>word2vec parameter learning explained</article-title>
          .
          <source>arXiv preprint arXiv:1411.2738</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Stuart</surname>
            <given-names>Rose</given-names>
          </string-name>
          , Dave Engel, Nick Cramer, and
          <string-name>
            <given-names>Wendy</given-names>
            <surname>Cowley</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Automatic keyword extraction from individual documents</article-title>
          .
          <source>Text Mining: Applications and Theory</source>
          (
          <year>2010</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>20</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Yuesong</surname>
            <given-names>Shen</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Claire-Hélène Demarty</surname>
          </string-name>
          , and
          <string-name>
            <surname>Ngoc</surname>
            <given-names>Q. K.</given-names>
          </string-name>
          <string-name>
            <surname>Duong</surname>
          </string-name>
          .
          <year>2016</year>
          . Technicolor@
          <article-title>MediaEval 2016 Predicting Media Interestingness Task.</article-title>
          .
          <source>In MediaEval.</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>