<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>GIBIS at MediaEval 2019: Predicting Media Memorability Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Samuel Felipe dos Santos</string-name>
          <email>felipe.samuel@unifesp.br</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jurandy Almeida</string-name>
          <email>jurandy.almeida@unifesp.br</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>GIBIS Lab, Instituto de Ciência e Tecnologia, Universidade Federal de São Paulo - UNIFESP 12247-014, São José dos Campos</institution>
          ,
          <addr-line>SP -</addr-line>
          <country country="BR">Brazil</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <fpage>27</fpage>
      <lpage>29</lpage>
      <abstract>
        <p>This paper presents the GIBIS team experience in the Predicting Media Memorability Task at MediaEval 2019. In this task, the teams were requested to develop an approach to predict a score reflecting whether videos are memorable or not, considering short-term memorability and long-term memorability. Our proposal relies on late fusion of multiple regression models learned with both hand-crafted and data-driven features and by diferent regression algorithms.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        People’s experience in watching a video is essential to making it
remembered or forgotten after a while. Due to this subjectiveness,
the challenging task of automatically predicting whether a video
is memorable or not has attracted a lot of attention. Since 2018,
the Predicting Media Memorability Task [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] at MediaEval has been
challenging participants to assign a memorability score for a video
reflecting its probability to be remembered. For this, it is provided
a dataset composed of 10,000 short, soundless videos, which are
splitted into 8,000 videos for the development set and 2,000 videos
for the test set. For more details about this task, please, refer to [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        In this paper, we describe the work developed by the GIBIS team
in the context of the MediaEval 2019 Predicting Media
Memorability Task. Our starting point was the approach we proposed last
year [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Roughly speaking, it relies on regression models learned
with hand-crafted and data-driven features and by diferent
regression algorithms. This year we focused on improving our previous
approach by exploiting new features, regressors, and late fusion.
      </p>
    </sec>
    <sec id="sec-2">
      <title>APPROACH</title>
      <p>
        Both short-term and long-term memorability subtasks were
approached with the same strategies. The starting point for our
proposal is the work of Savii et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], where visual features were
extracted from videos and then used to train regression models.
      </p>
      <p>
        Diferent visual features were evaluated by our approach: (1)
hand-crafted motion features extracted with HMP1 (Histogram of
Motion Patterns) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and (2) data-driven features learned with I3D2
(Inflated 3D ConvNet ) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. One limitation of I3D is its capacity to
capture subtle but long-term motion dynamics, as it requires to
break a video into small clips. Unlike I3D, HMP captures motion
dynamics of a video as a whole, and not just parts.
      </p>
      <p>
        HMP [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] considers the video movement by the transitions
between frames. For each frame, motion features are extracted from
      </p>
      <sec id="sec-2-1">
        <title>1https://github.com/jurandy-almeida/hmp (As of September, 2019) 2https://github.com/deepmind/kinetics-i3d (As of September, 2019)</title>
        <p>the video stream. After that, each feature is encoded as a unique
pattern, representing its spatio-temporal configuration. Finally, those
patterns are accumulated to form a normalized histogram.</p>
        <p>
          I3D [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] generalizes a 2D ConvNet into a 3D ConvNet. For that, 2D
convolutional filters of the Inception-V1 [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] architecture are inflated
into 3D convolutions, thus adding a temporal dimension. The I3D
model was first initialized by repeating and rescaling the weights of
the Inception-V1 model pre-trained on ImageNet and then trained
on the Kinetics Human Action Video Dataset3 [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. To extract the
I3D features, the classification layers of this pre-trained model were
replaced by a global average pooling layer. Next, each video was
resized to 256×256 resolution and then splitted into 64-frame clips
with an overlap of 32 frames between two consecutive clips. After
that, a single center crop with size 224×224 was extracted from each
of those clips and passed through the network, producing multiple
I3D features for each video. Finally, diferent strategies were used
to combine clip-based features into a single video representation:
(1) average, where the multiple I3D features are averaged; and (2)
concatenation, where they are concatenated together.
        </p>
        <p>
          Each of the above features was used as input to train diferent
regression algorithms: (1) KNR (k-Nearest Neighbor Regressor) and
(2) SVR (Support Vector Regression) [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. The KNR and SVR
implementations from the scikit-learn python package4 [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] were used
for easy reproducibility. For training such regressors, we first
divided the development set into training and validation sets, with
an 80%-20% split. Then, we randomly splitted the training set into
n equal-size subsets and trained one regression model for each
subset, thus obtaining n diferent regression models. Next, they were
combined as an ensemble model to predict memorability scores
for the videos in both validation and test sets. For that, the final
score was computed by averaging their individual scores and we
used the 95% confidence interval as the output confidence. In our
experiments, the values tested for n were 1, 5, and 10. For KNR, the
values tested for the parameter k were 1, 3, and 5. For SVR, we used
RBF kernel with the parameter ϵ set to 0.1 and values ranging from
0.5 to 16 with step of 0.5 were tested for the C parameter.
        </p>
        <p>
          Besides individual predictions provided by diferent
combinations of features and regressors, we also explored late fusion for
combining the top performing regression models learned with
different features, by diferent regression algorithms, and using
different hyperparameter settings. For that, we adopted the strategy
proposed by Almeida et al. [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. First, individual regression models
obtained by all the diferent configurations (i.e., combination of
features, regressors, and hyperparameter settings) were sorted in an
decreasing order of their performance on the validation set
according to the oficial metric for the task. Then, each of those individual
3In this work, we used the I3D model pre-trained on Kinetics with RGB data only.
4https://scikit-learn.org/ (As of September, 2019)
regression models was selected according to its rank, i.e., the best
was the first, the second best was the second, and so on. At each
step, the next model was combined with all the previous ones by
averaging their individual scores. This process was repeated until
the performance degrades. At the end, the best set of regression
models for the validation set was selected by this procedure and
then used to predict memorability scores for videos in the test set.
        </p>
        <p>
          Finally, we evaluted the use of the I3D model as a quantile
regressor instead of a feature extractor. For that, we changed its output
layer to have only 3 neurons representing the quantiles τ of 0.1, 0.5
and 0.9. The 0.5 quantile corresponds to the median and was taken
as the memorability score whereas the other two were used to
calculate the output confidence. The resulting model was initilialized
with weights pre-trained on the Kinetics dataset and fine-tuned on
the training set for 10 epochs with stochastic gradient descent using
learning rate of 0.1, batch size of 20, and quantile loss function [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>RESULTS AND ANALYSIS</title>
      <p>Five diferent runs were submitted for each subtask. They were
configured as shown in Table 1. The first three runs refer to the best
parameter setting for each combination of feature &amp; regressor in
isolation, the fourth run refers to late fusion of the top performing
feature &amp; regressor combinations, and the last run refers to the
deep quantile regression with the I3D model. All the evaluated
approaches were calibrated on the development set using a holdout
method (80% train/20% test). The evaluation metrics are: Spearman’s
rank correlation, Pearson correlation coeficient, and MSE (Mean
Squared Error). The former is the oficial metric for the task.</p>
      <sec id="sec-3-1">
        <title>5 The run 4 from the long-term memorability subtask was not submitted, since</title>
        <p>no performance gain was obtained on combining the best model in isolation
with the other ones, being therefore identical to the run 2.
I3Dfceoantcuarteenation with an ensemble of n = 10 KNR(k = 5) achieved
the best result on the test set, yielding a Spearman value of 0.199.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>ACKNOWLEDGMENTS</title>
      <p>This research was supported by the São Paulo Research
Foundation - FAPESP (grant #2018/21837-0), the FAPESP-Microsoft
Research Virtual Institute (grant #2017/25908-6), and the Brazilian
National Council for Scientific and Technological Development
CNPq (grants #423228/2016-1 and #313122/2017-2).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. J.</given-names>
            <surname>Leite</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Torres</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Comparison of Video Sequences with Histograms of Motion Patterns</article-title>
          .
          <source>In IEEE International Conference on Image Processing (ICIP'11)</source>
          . Brussels, Belgium,
          <fpage>3673</fpage>
          -
          <lpage>3676</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. C. G.</given-names>
            <surname>Pedronette</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. C.</given-names>
            <surname>Alberton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. P. C.</given-names>
            <surname>Morellato</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Torres</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Unsupervised Distance Learning for Plant Species Identification</article-title>
          .
          <source>IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing</source>
          <volume>9</volume>
          ,
          <issue>12</issue>
          (
          <year>2016</year>
          ),
          <fpage>5325</fpage>
          -
          <lpage>5338</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Carreira</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset</article-title>
          .
          <source>In IEEE Conference on Computer Vision and Pattern Recognition (CVPR'17)</source>
          . Honolulu,
          <string-name>
            <surname>HI</surname>
          </string-name>
          , USA,
          <fpage>4724</fpage>
          -
          <lpage>4733</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Constantin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <surname>C-H. Demarty</surname>
            ,
            <given-names>N. Q. K.</given-names>
          </string-name>
          <string-name>
            <surname>Duong</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Alameda-Pineda</surname>
            , and
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Sjöberg</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>The Predicting Media Memorability Task at MediaEval 2019</article-title>
          .
          <source>In Proc. of the MediaEval 2019 Workshop</source>
          . Sophia Antipolis, France.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Iofe</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Szegedy</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift</article-title>
          .
          <source>In International Conference on Machine Learning (ICML'15)</source>
          . Lille, France,
          <fpage>448</fpage>
          -
          <lpage>456</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R.</given-names>
            <surname>Koenker</surname>
          </string-name>
          .
          <year>2005</year>
          . Quantile Regression. Cambridge University Press.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>F.</given-names>
            <surname>Pedregosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Varoquaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gramfort</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Michel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Thirion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Grisel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Blondel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Prettenhofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Dubourg</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. VanderPlas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Passos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cournapeau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Brucher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Perrot</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Duchesnay</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Scikit-learn: Machine Learning in Python</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>12</volume>
          (
          <year>2011</year>
          ),
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Savii</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. F.</surname>
          </string-name>
          <article-title>dos Santos, and</article-title>
          <string-name>
            <given-names>J.</given-names>
            <surname>Almeida</surname>
          </string-name>
          .
          <year>2018</year>
          . GIBIS at MediaEval 2018:
          <article-title>Predicting Media Memorability Task</article-title>
          .
          <source>In Proc. of the MediaEval 2018 Workshop</source>
          . Sophia Antipolis, France.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>