=Paper= {{Paper |id=Vol-2670/MediaEval_19_paper_27 |storemode=property |title=GIBIS at MediaEval 2019: Predicting Media Memorability Task |pdfUrl=https://ceur-ws.org/Vol-2670/MediaEval_19_paper_27.pdf |volume=Vol-2670 |authors=Samuel Felipe dos Santos,Jurandy Almeida |dblpUrl=https://dblp.org/rec/conf/mediaeval/SantosA19 }} ==GIBIS at MediaEval 2019: Predicting Media Memorability Task== https://ceur-ws.org/Vol-2670/MediaEval_19_paper_27.pdf
    GIBIS at MediaEval 2019: Predicting Media Memorability Task
                                                 Samuel Felipe dos Santos and Jurandy Almeida
                       GIBIS Lab, Instituto de Ciência e Tecnologia, Universidade Federal de São Paulo – UNIFESP
                                               12247-014, São José dos Campos, SP – Brazil
                                               {felipe.samuel,jurandy.almeida}@unifesp.br

ABSTRACT                                                                    the video stream. After that, each feature is encoded as a unique pat-
This paper presents the GIBIS team experience in the Predicting             tern, representing its spatio-temporal configuration. Finally, those
Media Memorability Task at MediaEval 2019. In this task, the teams          patterns are accumulated to form a normalized histogram.
were requested to develop an approach to predict a score reflecting            I3D [3] generalizes a 2D ConvNet into a 3D ConvNet. For that, 2D
whether videos are memorable or not, considering short-term mem-            convolutional filters of the Inception-V1 [5] architecture are inflated
orability and long-term memorability. Our proposal relies on late           into 3D convolutions, thus adding a temporal dimension. The I3D
fusion of multiple regression models learned with both hand-crafted         model was first initialized by repeating and rescaling the weights of
and data-driven features and by different regression algorithms.            the Inception-V1 model pre-trained on ImageNet and then trained
                                                                            on the Kinetics Human Action Video Dataset3 [3]. To extract the
                                                                            I3D features, the classification layers of this pre-trained model were
1    INTRODUCTION                                                           replaced by a global average pooling layer. Next, each video was
People’s experience in watching a video is essential to making it           resized to 256×256 resolution and then splitted into 64-frame clips
remembered or forgotten after a while. Due to this subjectiveness,          with an overlap of 32 frames between two consecutive clips. After
the challenging task of automatically predicting whether a video            that, a single center crop with size 224×224 was extracted from each
is memorable or not has attracted a lot of attention. Since 2018,           of those clips and passed through the network, producing multiple
the Predicting Media Memorability Task [4] at MediaEval has been            I3D features for each video. Finally, different strategies were used
challenging participants to assign a memorability score for a video         to combine clip-based features into a single video representation:
reflecting its probability to be remembered. For this, it is provided       (1) average, where the multiple I3D features are averaged; and (2)
a dataset composed of 10,000 short, soundless videos, which are             concatenation, where they are concatenated together.
splitted into 8,000 videos for the development set and 2,000 videos            Each of the above features was used as input to train different
for the test set. For more details about this task, please, refer to [4].   regression algorithms: (1) KNR (k-Nearest Neighbor Regressor) and
   In this paper, we describe the work developed by the GIBIS team          (2) SVR (Support Vector Regression) [7]. The KNR and SVR imple-
in the context of the MediaEval 2019 Predicting Media Memora-               mentations from the scikit-learn python package4 [7] were used
bility Task. Our starting point was the approach we proposed last           for easy reproducibility. For training such regressors, we first di-
year [8]. Roughly speaking, it relies on regression models learned          vided the development set into training and validation sets, with
with hand-crafted and data-driven features and by different regres-         an 80%-20% split. Then, we randomly splitted the training set into
sion algorithms. This year we focused on improving our previous             n equal-size subsets and trained one regression model for each sub-
approach by exploiting new features, regressors, and late fusion.           set, thus obtaining n different regression models. Next, they were
                                                                            combined as an ensemble model to predict memorability scores
2    APPROACH                                                               for the videos in both validation and test sets. For that, the final
Both short-term and long-term memorability subtasks were ap-                score was computed by averaging their individual scores and we
proached with the same strategies. The starting point for our pro-          used the 95% confidence interval as the output confidence. In our
posal is the work of Savii et al. [8], where visual features were           experiments, the values tested for n were 1, 5, and 10. For KNR, the
extracted from videos and then used to train regression models.             values tested for the parameter k were 1, 3, and 5. For SVR, we used
   Different visual features were evaluated by our approach: (1)            RBF kernel with the parameter ϵ set to 0.1 and values ranging from
hand-crafted motion features extracted with HMP1 (Histogram of              0.5 to 16 with step of 0.5 were tested for the C parameter.
Motion Patterns) [1] and (2) data-driven features learned with I3D2            Besides individual predictions provided by different combina-
(Inflated 3D ConvNet) [3]. One limitation of I3D is its capacity to         tions of features and regressors, we also explored late fusion for
capture subtle but long-term motion dynamics, as it requires to             combining the top performing regression models learned with dif-
break a video into small clips. Unlike I3D, HMP captures motion             ferent features, by different regression algorithms, and using dif-
dynamics of a video as a whole, and not just parts.                         ferent hyperparameter settings. For that, we adopted the strategy
   HMP [1] considers the video movement by the transitions be-              proposed by Almeida et al. [2]. First, individual regression models
tween frames. For each frame, motion features are extracted from            obtained by all the different configurations (i.e., combination of
                                                                            features, regressors, and hyperparameter settings) were sorted in an
1 https://github.com/jurandy-almeida/hmp (As of September, 2019)
2 https://github.com/deepmind/kinetics-i3d (As of September, 2019)
                                                                            decreasing order of their performance on the validation set accord-
                                                                            ing to the official metric for the task. Then, each of those individual
Copyright 2019 for this paper by its authors. Use
permitted under Creative Commons License Attribution                        3 In this work, we used the I3D model pre-trained on Kinetics with RGB data only.
4.0 International (CC BY 4.0).                                              4 https://scikit-learn.org/ (As of September, 2019)

MediaEval’19, 27-29 October 2019, Sophia Antipolis, France
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France                                                              S. F. dos Santos and J. Almeida


regression models was selected according to its rank, i.e., the best               I3Dconcatenation
                                                                                       feature
                                                                                                    with an ensemble of n = 10 KNR(k = 5) achieved
was the first, the second best was the second, and so on. At each                  the best result on the test set, yielding a Spearman value of 0.199.
step, the next model was combined with all the previous ones by
                                                                                             Table 2: Long-term memorability results.
averaging their individual scores. This process was repeated until
the performance degrades. At the end, the best set of regression                                Set      Run    Spearman      Pearson    MSE
models for the validation set was selected by this procedure and                                          1       0.091        0.101     0.04
                                                                                                          2       0.213        0.219     0.02
then used to predict memorability scores for videos in the test set.
                                                                                             Dev. Set     3       0.189        0.203     0.02
   Finally, we evaluted the use of the I3D model as a quantile regres-
                                                                                                          4       0.213        0.219     0.02
sor instead of a feature extractor. For that, we changed its output                                       5       0.071        0.077     0.02
layer to have only 3 neurons representing the quantiles τ of 0.1, 0.5                                     1       0.015        0.019     0.04
and 0.9. The 0.5 quantile corresponds to the median and was taken                                         2       0.197        0.214     0.02
as the memorability score whereas the other two were used to cal-                             Test Set    3       0.199        0.214     0.02
culate the output confidence. The resulting model was initilialized                                       4       0.197        0.214     0.02
with weights pre-trained on the Kinetics dataset and fine-tuned on                                        5       0.111        0.137     0.02
the training set for 10 epochs with stochastic gradient descent using
learning rate of 0.1, batch size of 20, and quantile loss function [6].               Table 3 presents the results for the development and test sets in
                                                                                   the short-term memorability subtask. Our best result on both sets
3     RESULTS AND ANALYSIS                                                         was obtained by the late fusion of the six best models among all
Five different runs were submitted for each subtask. They were                     the combinations of features & regressors, achieving a Spearman
configured as shown in Table 1. The first three runs refer to the best             value of 0.453 for the development set and 0.438 for the test set.
parameter setting for each combination of feature & regressor in                             Table 3: Short-term memorability results.
isolation, the fourth run refers to late fusion of the top performing                           Set      Run    Spearman      Pearson    MSE
feature & regressor combinations, and the last run refers to the                                          1       0.215        0.256     0.01
deep quantile regression with the I3D model. All the evaluated                                            2       0.434        0.474     0.01
approaches were calibrated on the development set using a holdout                            Dev. Set     3       0.416        0.454     0.01
method (80% train/20% test). The evaluation metrics are: Spearman’s                                       4       0.453        0.491     0.01
rank correlation, Pearson correlation coefficient, and MSE (Mean                                          5       0.262        0.281     0.01
Squared Error). The former is the official metric for the task.                                           1       0.249        0.259     0.01
                                                                                                          2       0.417         0.46     0.01
          Table 1: Configuration of the submitted runs.                                       Test Set    3       0.398        0.443     0.01
       Subtask         Run                       Configuration                                            4       0.438        0.477     0.01
                                                                                                          5       0.247         0.25     0.01
                         1           HMP & KNR(k = 1) with n = 1
                                      average
                         2        I3Dfeature & KNR(k = 5) with n = 10
     Long-term
                         3     I3Dconcatenation & KNR(k = 5) with n = 10
                                                                                   4   DISCUSSION AND OUTLOOK
    memorability                  feature
                         4            Late Fusion5 (same as run 2)                 In general, I3D performed better than HMP as feature extractor. The
                                                                                                   average
                         5                        I3Dregressor
                                                                                   results for I3Dfeature and I3Dconcatenation
                                                                                                                  feature
                                                                                                                                were similar with a small
                                                                                   advantage to the first. An intent for future work is to analyze the use
                         1           HMP & KNR(k = 3) with n = 10
                                       average
                                                                                   of smarter strategies for combining the clip-based features extracted
                         2         I3Dfeature & KNR(k = 5) with n = 5              with the I3D model, like RNNs (Recurrent Neural Networks).
                         3     I3Dconcatenation
                                  feature
                                                & SVR(C = 16) with n = 1              Late fusion of the top performing models achieved our best
                               Late Fusion (of the six best combinations):         results in the short-term subtask, but for the long-therm subtask it
                                       average
                                   I3Dfeature & KNR(k = 5) with n = 5              did not lead to performance gain. As future work, we also plan to
     Short-term                        average                                     evaluate different fusion strategies, for instance, the use of SVM to
    memorability                   I3Dfeature & KNR(k = 3) with n = 5
                         4             average
                                   I3Dfeature & KNR(k = 3) with n = 10
                                                                                   learn how to combine features and regressors effectively.
                                        average                                       The performance of using the I3D model as a regressor was lower
                                    I3Dfeature & SVR(C = 1) with n = 1
                                      average                                      than expected. One of the reasons might be the small volume of
                                  I3Dfeature & SVR(C = 0.5) with n = 1             data available for training and/or our choices for hyperparameters,
                                      average
                                  I3Dfeature & SVR(C = 10) with n = 10             since they were chosen arbitrarily. We want to conduct a deeper
                         5                        I3Dregressor                     investigation of strategies to overcome those issues in future.
    5 The run 4 from the long-term memorability subtask was not submitted, since
    no performance gain was obtained on combining the best model in isolation      ACKNOWLEDGMENTS
    with the other ones, being therefore identical to the run 2.
                                                                                   This research was supported by the São Paulo Research Founda-
   Table 2 presents the results for the development and test sets in               tion - FAPESP (grant #2018/21837-0), the FAPESP-Microsoft Re-
the long-term memorability subtask. Our best result on the devel-                  search Virtual Institute (grant #2017/25908-6), and the Brazilian
                                 average
opment set was obtained by I3Dfeature using an ensemble of n = 10                  National Council for Scientific and Technological Development -
KNR(k = 5), achieving a Spearman value of 0.213. In contrast,                      CNPq (grants #423228/2016-1 and #313122/2017-2).
Predicting Media Memorability Task                                              MediaEval’19, 27-29 October 2019, Sophia Antipolis, France


REFERENCES
[1] J. Almeida, N. J. Leite, and R. S. Torres. 2011. Comparison of Video
    Sequences with Histograms of Motion Patterns. In IEEE International
    Conference on Image Processing (ICIP’11). Brussels, Belgium, 3673–
    3676.
[2] J. Almeida, D. C. G. Pedronette, B. C. Alberton, L. P. C. Morellato,
    and R. S. Torres. 2016. Unsupervised Distance Learning for Plant
    Species Identification. IEEE Journal of Selected Topics in Applied Earth
    Observations and Remote Sensing 9, 12 (2016), 5325–5338.
[3] J. Carreira and A. Zisserman. 2017. Quo Vadis, Action Recognition? A
    New Model and the Kinetics Dataset. In IEEE Conference on Computer
    Vision and Pattern Recognition (CVPR’17). Honolulu, HI, USA, 4724–
    4733.
[4] M. G. Constantin, B. Ionescu, C-H. Demarty, N. Q. K. Duong, X.
    Alameda-Pineda, and M. Sjöberg. 2019. The Predicting Media Memora-
    bility Task at MediaEval 2019. In Proc. of the MediaEval 2019 Workshop.
    Sophia Antipolis, France.
[5] S. Ioffe and C. Szegedy. 2015. Batch Normalization: Accelerating
    Deep Network Training by Reducing Internal Covariate Shift. In In-
    ternational Conference on Machine Learning (ICML’15). Lille, France,
    448–456.
[6] R. Koenker. 2005. Quantile Regression. Cambridge University Press.
[7] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O.
    Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. VanderPlas,
    A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay.
    2011. Scikit-learn: Machine Learning in Python. Journal of Machine
    Learning Research 12 (2011), 2825–2830.
[8] R. M. Savii, S. F. dos Santos, and J. Almeida. 2018. GIBIS at MediaEval
    2018: Predicting Media Memorability Task. In Proc. of the MediaEval
    2018 Workshop. Sophia Antipolis, France.