=Paper=
{{Paper
|id=Vol-2283/MediaEval_18_paper_35
|storemode=property
|title=Transfer Learning for Video Memorability Prediction
|pdfUrl=https://ceur-ws.org/Vol-2283/MediaEval_18_paper_35.pdf
|volume=Vol-2283
|authors=Romain Cohendet,Claire-Hélène Demarty,Ngoc Q. K. Duong
|dblpUrl=https://dblp.org/rec/conf/mediaeval/CohendetDD18
}}
==Transfer Learning for Video Memorability Prediction==
<pdf width="1500px">https://ceur-ws.org/Vol-2283/MediaEval_18_paper_35.pdf</pdf>
<pre>
                 Transfer learning for video memorability prediction
                                 Romain Cohendet, Claire-Hélène Demarty and Ngoc Q. K. Duong
                                                       Technicolor
      romain.cohendet@laposte.net,claire-helene.demarty@technicolor.com,quang-khanh-ngoc.duong@technicolor.com

ABSTRACT                                                                   MemNet-based system. The first network for large-scale IM pre-
This paper summarizes Technicolor’s computational models to pre-        diction was presented in [10]. Based on the assumption that memo-
dict memorability of videos within the MediaEval 2018 Predicting        rability depends on both scenes and objects, authors fine-tuned the
Media Memorability Task. Our systems are based on deep learning         training using a convolutional neural network (CNN) pre-trained
features and architectures, and exploit the use of both semantic and    both on the ImageNet and Places databases. They showed that fine-
multimodal features. Based on the obtained results, we discuss our      tuned deep features outperform other features by a large margin.
findings and some scientific perspectives for the task.                 We used this model as is to generate memorability scores for our
                                                                        video frames. We further averaged them to obtain the VM scores
                                                                        proposed in Run#1, identical for the two subtasks.
1    INTRODUCTION                                                          CNN and Image captioning based system. A more recent model
Understanding and predicting memorability of media such as im-          that, to our knowledge, obtained the best performance up-to-now
ages and videos has recently gained a significant attention from the    for IM prediction, was presented in [11]. It exploits both CNN-
research community. To facilitate the expansion of this research        based and semantic Image captioning (IC)-based image features.
field, the Predicting Media Memorability Task is proposed at Media-     The authors used the pre-trained VGG16 network for their CNN
Eval 2018, which releases a large dataset of 10,000 videos, manually    feature (extracted from the last layer), and a pre-trained IC model as
annotated with scores of memorability. A complete description of        an extractor for a more semantic image feature. The IC model builds
the task can be found in [4].                                           an encoder consisting of a CNN and a long short-term memory
   In order to automatically predict "short-term" and "long-term"       recurrent network (LSTM) that enables to learn a joint image-text
memorability (as referred in the two proposed subtasks), we in-         embedding by projecting the CNN image feature and the word2vec
vestigated different approaches, summarized in figure 1. Our first      representation of the image caption on a 2D embedding space.
two approaches were intended to serve as a baseline for systems         Finally, the authors merged the two features using a Multilayer
of video memorability prediction. We therefore re-used available        Perceptron (MLP). We also re-used this model as is as a second
high performance models for image memorability (IM) prediction          baseline which produces scores at frame level. Again, Run#2 is set
and applied them directly to video memorability (VM) prediction         to be identical for both subtasks.
(Section 2). Our second set of approaches (Section 3) investigated                                                                    Run #1
                                                                                                                MemNet model [10]
different features, including multi-modal ones. In a last approach
(Section 4), instead of using an existing model as fixed feature ex-
                                                                                                                                      Run #2
tractor, we fine-tuned an entire state-of-the-art ResNet model to                                               Houssaini’s model
                                                                                                                      [11]
adapt it to the task of memorability prediction.                                       Frame
   All the above models are frame-based. As input, we extracted                       sampling
                                                                              Input              7 images
                                                                              video              per video         Deep visual        Run #3
seven frames (one per second) from each video, each frame being                                                  embedding based
assigned the ground-truth score of its corresponding video. We                                                      model [6]

then assess the VM score of a given video by simply averaging                                                    Deep visual + text   Run #4
the seven predicted frame-based scores. When possible, we trained                                                embedding based
                                                                                         Video tags                 model [6]
the models on short-term or long-term memorability ground-truth
scores to build specific runs for the two subtasks. We also split the                                           Fine-tuned ResNet
                                                                                                                                      Run #5
development set into 80% for training and 20% for validation. This                                                    model

random split was done at the video level, to enforce that frames
from a single video were kept together in one part.                      Figure 1: Summary of our approaches for VM prediction.
2    PRE-TRAINED IMAGE MEMORABILITY
     BASED APPROACHES                                                   3   DEEP SEMANTICS EMBEDDING-BASED
To construct a performance baseline of VM prediction, we tested             MULTIMODAL APPROACHES
two high-performance models available in the literature for IM          We tested different features for VM prediction, including video-
prediction. Both were trained on the LaMem dataset [10], the largest    dedicated and frame-based features. Video-dedicated features in-
dataset for IM to date (ca. 60,000 images from diverse sources).        cluded: C3D [13], HMP [1]. Frame-based features were extracted
                                                                        on three key-frames for each video and included: Color histograms,
Copyright held by the owner/author(s).                                  InceptionV3 features [12], LBP [8] and a set of Aesthetic visual fea-
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France              tures [7]. Please refer to [4] for more details on these features, as
                                                                        provided by the task’s organizers.
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France                                                                     R. Cohendet et al.


    Motivated by the finding that IC features perform well on both         least we could have improved the performance by using the long-
IM [11] and VM [5] prediction, we used the model proposed in               term memorability scores of our dataset. So, Run#5 is identical for
[6] to extract some additional IC features from the frames. We             both subtasks.
also took advantage of this model to extract an additional text
embedding feature from the titles provided with each video. As             5   RESULTS AND DISCUSSION
such a feature corresponds to a mapping of natural language words,         Results are summarized in Table 2. The results of the first two
i.e., a video description in our case, we expected an improvement          runs for the short-term subtask show that it is possible to achieve
of our system’s capacity to capture semantics. We generated a new          quite good results in VM prediction using models designed for IM
multimodal feature (image-text) by simply concatenating the two            prediction This means that the memorability of a video is correlated
previous IC-based image and text features.                                 to some extent with the memorability of its constituent frames. We
    We then trained simple MLP (with one hidden layer of 100 neu-          may also note the poor performance of all models for the long-
rons) on top of each single feature, and a concatenation of the 3 best     term subtask, compared to the short-term subtask. For runs #1, #2
non IC-based features. Again, these are frame-based models. Each           and #5, this may be explained by the fact that the training was
time, two versions of the networks were trained on the short-term          done with the use of LaMem for which only short-term scores are
and long-term scores respectively. Table 1 shows the performance           available. This may also come from the significantly lower number
of each individual system on the validation data. From these results       of annotations for the long-term scores in the task’s dataset [4]. It
we decided to keep only the system with IC-image based features as         may also highlight that there is a significant difference between
input for Run#3 and the multimodal IC-(image+text) based features          short-term and long-term memorability and that it might be more
as input for Run#4, as the best performing features.                       difficult to predict the latter. However, these results also prove that
                  Features            short-term long-term                 long-term memorability is correlated – though not perfectly – with
                    C3D                   .28         .126                 short-term memorability. In accordance with the literature, the
                   HMP                   .275         .114                 model of [11] performed a little better than the model of [10] for
                 ColorHist               .134          .05                 memorability prediction.
                InceptionV3               .16         .058
                    LBP                  .267         .128                                     short-term mem.                long-term mem.
                 Aesthetics              .283         .127                                 Spearman         Pearson       Spearman        Pearson
                                                                               Runs
           C3D+LBP+Aesthetics            .347         .128                                val.    test   val.     test   val.    test   val.    test
             IC-image (Run#3)            .492          .22                  1-MemNet      .397   .385    .414     .406   .195   .168   .188     .184
          IC-(image+text) (Run#4)        .436         .222                  2-CNN&IC      .401   .398    .405     .402   .201   .182   .199     .191
                                                                                3-IC      .492   .442    .501    .493     .22   .201   .233     .216
Table 1: Results in terms of Spearman’s correlation obtained
                                                                              4-Multi     .452   .418     .48     .451   .212   .208    .23    .228
by a simple MLP for different video-dedicated and frame-
                                                                             5-ResNet     .498    .46    .512     .491   .198   .219 .217       .217
based features, on the validation dataset.
                                                                           Table 2: Official results on the test set, and results on the
                                                                           validation set. (Official metric: Spearman’s corr.)
4    FINE-TUNED RESNET101
As in [3, 10], where fine-tuned DNN outperformed classical ap-
proaches, we tried a transfer learning approach by fine-tuning a              Runs #3 and #4 perform better than runs #1 and #2. As in [11]
state-of-the-art ResNet model to the problem of IM prediction.             and [5], IC features performed well for memorability prediction
   For this, we classically replaced the last fully connected layer of     tasks, especially when fine-tuned on the new dataset (Run#3 can be
ResNet to a new one dedicated to our regression task of memorabil-         seen as a fine-tuned version of Run#2). Indeed, IC features convey
ity prediction. This last layer was first trained alone for a few epochs   high semantics: high-level visual attributes and scene semantics
(5), before re-training the complete network for more epochs. The          (actions, movements, appearance of objects, emotions, etc.) have
following parameters were used: optimizer, Adam; batch size, 32.           been founded to be linked to memorability [9, 10]. It also shows
We used the Mean Square Error as loss function to stick to our             that training of long-term scores helps improving the performance
regression task. Some data augmentation was conducted: random              for long-term memorability. The multimodal approach gave slightly
center cropping of 224x224 after resizing of the original images           worse results than IC features alone. However, due to time con-
and horizontal flip, followed by a mean normalization computed on          straints, we did not proceed to any optimizing of the set of parame-
ImageNet. We trained on an augmented dataset composed of the               ters, to deal with the possible redundancy between IC image and
80% of the development set and LaMem (because of the latter, we            text embedding features.
processed to a normalization of the scores from the two datasets).            The most accurate memorability prediction were obtained by the
   We fine-tuned two variants of ResNet: ResNet18 and ResNet101.           fine-tuned ResNet101, which confirms that transfer learning from
We kept ResNet101 to generate scores for Run#5, as it gave the             an image classification problem to yet another task such as memo-
best performance on the validation set. We did not trained separate        rability prediction works well. This validates also the quality of the
models for the short-term and long-term subtasks, due to time              dataset at least for the short-term annotations. As perspectives, it
constraints. Note that, as LaMem images are provided with short-           will be interesting to test systems incorporating temporal evolution
term memorability scores only, we would still have biased the              of the videos such as motion information or latest architectures
network for long-term memorability prediction in doing so, but at          such as TCN [2] to see how it improves the performances.
Predicting Media Memorability                                                      MediaEval’18, 29-31 October 2018, Sophia Antipolis, France


REFERENCES
 [1] Jurandy Almeida, Neucimar J Leite, and Ricardo da S Torres. 2011.
     Comparison of video sequences with histograms of motion patterns.
     In Proc. of the IEEE International Conference on Image Processing (ICIP).
     3673–3676.
 [2] S. Bai, J.Z. Kolter, and Koltun V. 2018. An empirical evaluation of generic
     convolutional and recurrent networks for sequence modeling. Technical
     Report. arXiv preprint arXiv:1803.01271.
 [3] Yoann Baveye, Romain Cohendet, Matthieu Perreira Da Silva, and
     Patrick Le Callet. 2016. Deep Learning for Image Memorability Pre-
     diction: the Emotional Bias. In Proc. ACM International Conference on
     Multimedia (ACMM). 491–495.
 [4] Romain Cohendet, Claire-Hélène Demarty, Ngoc Q. K. Duong, Mats
     Sjöberg, Bogdan Ionescu, and Thanh-Toan Do. 2018. MediaEval 2018:
     Predicting Media Memorability Task. In Proc. of the MediaEval Work-
     shop.
 [5] Romain Cohendet, Karthik Yadati, Ngoc Q. K. Duong, and Claire-
     Hélène Demarty. 2018. Annotating, understanding, and predicting
     long-term video memorability. In Proc. of the ICMR 2018 Workshop,
     Yokohama, Japan, June 11-14.
 [6] Martin Engilberge, Louis Chevallier, Patrick Pérez, and Matthieu Cord.
     2018. Finding beans in burgers: Deep semantic-visual embedding with
     localization. In Proc. IEEE International Conference on Computer Vision
     and Pattern Recognition (CVPR). 3984–3993.
 [7] Andreas F Haas, Marine Guibert, Anja Foerschner, Sandi Calhoun,
     Emma George, Mark Hatay, Elizabeth Dinsdale, Stuart A Sandin, Jen-
     nifer E Smith, Mark JA Vermeij, and others. 2015. Can we measure
     beauty? Computational evaluation of coral reef aesthetics. PeerJ 3
     (2015), e1390.
 [8] Dong-Chen He and Li Wang. 1990. Texture unit, texture spectrum, and
     texture analysis. IEEE Transactions on Geoscience and Remote Sensing
     28, 4 (1990), 509–512.
 [9] Phillip Isola, Jianxiong Xiao, Devi Parikh, Antonio Torralba, and Aude
     Oliva. 2014. What makes a photograph memorable? IEEE Transactions
     on Pattern Analysis and Machine Intelligence 36, 7 (2014), 1469–1482.
[10] Aditya Khosla, Akhil S Raju, Antonio Torralba, and Aude Oliva. 2015.
     Understanding and predicting image memorability at a large scale. In
     Proc. IEEE International Conference on Computer Vision (ICCV). 2390–
     2398.
[11] Hammad Squalli-Houssaini, Ngoc Q. K. Duong, Marquant Gwenaëlle,
     and Claire-Hélène Demarty. 2018. Deep learning for predicting image
     memorability. In Proc. IEEE International Conference on Audio, Speech
     and Language Processing (ICASSP).
[12] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and
     Zbigniew Wojna. 2016. Rethinking the inception architecture for
     computer vision. In Proc. of the IEEE Conference on Computer Vision
     and Pattern Recognition. 2818–2826.
[13] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and
     Manohar Paluri. 2015. Learning Spatiotemporal Features with 3D
     Convolutional Networks. In Proc. IEEE International Conference on
     Computer Vision (ICCV). 4489–4497.

</pre>