Investigating Memorability of Dynamic Media
                 Phuc H. Le-Khac, Ayush K. Rai, Graham Healy, Alan F. Smeaton, Noel E. O’Connor
                                                             Dublin City University, Ireland
                                                           {khac.le2,ayush.rai3}@mail.dcu.ie
                                                    {graham.healy,alan.smeaton,noel.oconnor}@dcu.ie

ABSTRACT                                                                      3 APPROACH
The Predicting Media Memorability task in MediaEval’20 has some               3.1 Motivation
challenging aspects compared to previous years. In this paper we
                                                                                 High dynamic video. The short videos used in previous years’
identify the high-dynamic content in videos and dataset of limited
                                                                              memorability challenges were extracted from raw professional
size as the core challenges for the task, we propose directions to
                                                                              footage that is used for example in creating high-quality film and
overcome some of these challenges and we present our initial result
                                                                              commercials [3]. Consequently, the majority of videos therefore
in these directions.
                                                                              contain only one scene and are mostly static. Conversely, the data
                                                                              for this year’s 2020 challenge is extracted from the TRECVid 2019
                                                                              [1] Video-to-Text dataset. These videos were collected from social
1    INTRODUCTION                                                             media posts and TV shows. They contain multiple scene changes
Since the seminal paper on image memorability by Isola et. al. [10],          in one video, and thus are more dynamic and with more complex
there has been growing interest in building computational models              movement and activity levels, as illustrated in Fig.1.
to understand and predict the intrinsic memorability of media, as
well as other subjective perceptions of media [7]. The Predicting
Media Memorability task at MediaEval is designed to facilitate and
promote research in this area by requiring task participants to de-
velop computational approaches to generate measures of media
memorability, which in turn also helps to better understand the
subjective memorability of human cognition. Having a prediction
model of a media’s memorability also allows for interesting applica-
tions to be developed, such as techniques to enhance memorability
of images through the use of neural style transfer [14].
   Continuing from the success of the previous two years [4, 5],
the MediaEval’20 Predicting Media Memorability challenge [8]
remains the same, where teams need to build prediction models
for the memorability scores from a dataset of videos with captions.
There are two types of memorability scores: short-term scores for
videos that are shown for a second time within a short timespan of
the initial viewing, and long-term scores for videos that are shown
again two to three days later. The videos making up this year’s
training dataset contain 590 videos with sound from the TRECVid
                                                                              Figure 1: Histogram showing distributional differences in
2019 Video-to-text dataset [1], compared to 8,000 soundless videos
                                                                              Mean Magnitude of Optical Flow (amount of motion) be-
from VideoMem [3] dataset.
                                                                              tween the VideoMem dataset used in previous years and
                                                                              the MediaEval2020 dataset. The Mean Magnitude of Optical
2    RELATED WORK                                                             Flow of a video was obtained by averaging the optical flow
In previous challenges, a variety of methods that utilised differ-            features over all pixels in a frame and across all frames.
ent data modalities were explored [2, 11, 17]. As previous works
have shown [6, 13], utilising high-level semantic features, either
                                                                                 While previous approaches to memorability computation use
extracted using deep networks or provided by human annotators
                                                                              higher-semantic features, such as those extracted from deep neural
via text captions, are among the most effective methods to predict
                                                                              networks, text captions provided by human-annotators and human-
memorability. However, the modalities and features that are most
                                                                              centric features such as emotions or aesthetics, most of these fea-
predictive of memorability remain unclear i.e. the best approach
                                                                              tures were extracted using frame-based models. In contrast, this
on last year’s challenge [2] used an ensemble of models trained on
                                                                              year’s dataset provided an unexplored challenge for methods that
a variety of modalities.
                                                                              can capture semantic features from highly dynamic videos.

Copyright 2020 for this paper by its authors. Use permitted under Creative       Small size dataset. Another challenge for this year’s data set is
Commons License Attribution 4.0 International (CC BY 4.0).                    the limited number of annotated videos with memorability scores.
MediaEval’20, 14-15 December 2020, Online                                     Compared to previous years, this is an order of magnitude smaller
MediaEval’20, 14-15 December 2020, Online                                                                                       P.H. Le-Khac et al.


in size, with 590 videos in the training set and 500 videos in the test   before fine tuning on the MediaEval challenge’s dataset. Contain-
set. A development set with an additional 500 videos was released         ing 10,000 videos, the Memento10K dataset is roughly the same
in the later stage of the benchmark.                                      size as the VideoMem [3] dataset. Moreover, videos in the Me-
                                                                          mento10K contains more action and are more similar to the new
   Challenging benchmark. The changes to this year’s dataset made
                                                                          data of this year’s challenge. Therefore we decided to focus our
the task considerably more challenging, as the videos used were
                                                                          large-scale pre-training approach on the Memento10K dataset and
more dynamic thus making the extraction of high-level semantic
                                                                          replicate the accompanying SemNet model [12] instead of pre-
features more difficult, and further compounded by the limited size
                                                                          training on VideoMem [3] from previous years. The SemNet model
of the dataset.
                                                                          contains three separate sub-networks to process three different
   Motivated by these insights, we focus our work on using video-
                                                                          input streams: image, optical flow and video stream.
level features to capture the dynamic in videos, and attempt to
                                                                             After pre-training SemNet on Memento10k data, we only retain
pre-train a memorability model on a larger dataset before fine-
                                                                          the video stream sub-network to fine tune on the MediaEval dataset
tuning it on this year MediaEval’s data.
                                                                          and discarded all other components.
3.2    Spatiotemporal baseline                                            4    RESULTS AND DISCUSSION
While the provided captions for each video can be used as high-level
                                                                          The results of our runs together with this year’s mean and variance
semantic-rich features for predicting media memorability. However,
                                                                          are reported in Table 1. Due to technical issues and time constraints,
in our approach we focus on a method to extract high-level features
                                                                          we only submitted the baseline regression model based on C3D
from raw video input, without any additional annotation.
                                                                          features and did not have the result for fine tuning SemNet on the
   Motivated from the fact that this year’s video is highly dynamic
                                                                          Memento10K [12] dataset for this year’s challenge.
as discussed above, we chose to use features extracted from C3D
[16], the only video-based method instead of frame-based model. We        Table 1: Results of our approach compared with the chal-
used the C3D pre-extracted features provided by the task organisers       lenge’s statistics
as the input for our memorability regression model.
   C3D model learned spatio-temporal representation. C3D [17], short                            Short-term                      Long-term
for 3D Convolution, is among the earliest approaches to learning              Run    Spearman       Pearson   MSE     Spearman      Pearson   MSE
generic representation from videos. Extending the 2D Convolution           Mean         0.058        0.066    0.013     0.036        0.043    0.051
operations common in most image-processing deep learning mod-             Variance      0.002        0.002    0.000     0.002        0.001    0.000
els, C3D’s homogeneous architecture composed of small 3 × 3 × 3
                                                                              C3D       0.034        0.078     0.1      -0.01        0.022    0.09
convolution kernels, expanding in all height, width and time dimen-
sion of videos. Trained on a generic action recognition dataset, each
                                                                          Even with the simplest method possible, the result from our regres-
video passed through C3D returns a 4096-dimension feature vector
                                                                          sion baseline with pre-extracted C3D features is not very far from
that extracts high-level semantic features from a video segment.
                                                                          the mean. This support our hypothesis that spatio-temporal repre-
   Memorability Regression Model. With the extracted features via         sentation is important for this year’s dataset and the performance
C3D, we used a multi-layer perceptron (MLP) with two hidden               can be increased with other complementary high-level features.
layers of 512 units followed by a Rectified Linear Unit (ReLU) non-       Future work can explore this hypothesis further by fine tuning
linearity layer. To mitigate overfitting, we used Dropout [15] with       the entire C3D model instead of just using the pre-extracted fea-
a probability of 0.1. Batch Normalisation [9] was used to normalise       tures, or by using different models that also capture spatio-temporal
the features in each batch after each layer to help optimisation. The     features.
final model is a sequential application 3 building blocks composed           On the other hand, compared to the state-of-the-art result of
from BatchNorm, Dropout, fully-connected and ReLU activation              0.528 [2] in last year’s challenge, the mean of the Spearman rank
layers. For the final output layer, the ReLU activation is replaced       correlation for all runs this year (0.058) is an order of magnitude
with a Sigmoid to obtain the final memorability score in the range        lower. This highlights the importance of large scale pre-training
from 0 to 1.                                                              and transfer learning techniques, and methods that can extract
   There was a total of more than 2 million parameters in the model,      high-level features from dynamic videos effectively.
most belonging to the first fully-connected layer due to the large           We believe the direction of our solution, given the challenges
C3D’s feature size of 4096. To minimise the effect of outliers in         in this year’s challenge, not only shows promise but also indicates
ranking score, we use L1 loss instead of the standard L2 Mean-            the importance of spatio-temporal models to capture high-level
Squared Error (MSE) and saw a slight improvement in convergence           semantics of videos for memorability prediction.
speed. The model is trained for 100 epochs on the training set
and the final model was picked based on the performance on the            ACKNOWLEDGMENTS
validation set.                                                           This work was co-funded by Science Foundation Ireland through the
                                                                          SFI Centre for Research Training in Machine Learning (18/CRT/6183)
3.3    Large scale Memorability Pre-training                              and the Insight Centre for Data Analytics (SFI/12/RC/2289_P2), co-
To overcome the limited size of this year’s dataset, we used the          funded by the European Regional Development Fund. A.K. Rai also
recently released Memento10K [12] for pre-training our model              acknowledges support from FotoNation Ltd.
Predicting Media Memorability                                                                          MediaEval’20, 14-15 December 2020, Online


REFERENCES                                                                          223–240.
 [1] George Awad, Asad A. Butt, Keith Curtis, Yooyoung Lee, Jonathan           [13] Alison Reboud, Ismail Harrando, Jorma T. Laaksonen, D. Francis,
     Fiscus, Afzal Godil, Andrew Delgado, Jesse Zhang, Eliot Godard, Lukas          Raphaël Troncy, and H. Mantecón. Combining Textual and Visual Mod-
     Diduch, Alan F. Smeaton, Yvette Graham, Wessel Kraaij, and Georges             eling for Predicting Media Memorability. In Proceedings of MediaEval
     Quenot. 2020. TRECVID 2019: An Evaluation Campaign to Benchmark                2019, Sophia Antipolis, France (2019). CEUR Workshop Proceedings.
     Video Activity Detection, Video Captioning and Matching, and Video             http://ceur-ws.org/Vol-2670/
     Search & Retrieval. In TREC Video Retrieval Evaluation Notebook Papers    [14] Aliaksandr Siarohin, Gloria Zen, Cveta Majtanovic, Xavier Alameda-
     and Slides. https://www-nlpir.nist.gov/projects/tvpubs/tv.pubs.19.org.         Pineda, Elisa Ricci, and Nicu Sebe. 2019-06-14. Increasing Image Mem-
     html                                                                           orability with Neural Style Transfer. ACM Transactions on Multimedia
 [2] David Azcona, Enric Moreu, Feiyan Hu, Tomás E Ward, and Alan F                 Computing, Communications, and Applications 15, 2 (2019-06-14), 1–22.
     Smeaton. 2019. Predicting Media Memorability Using Ensemble Mod-               https://doi.org/10.1145/3311781
     els. In Proceedings of MediaEval 2019, Sophia Antipolis, France. CEUR     [15] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever,
     Workshop Proceedings, 3. http://ceur-ws.org/Vol-2670/                          and Ruslan Salakhutdinov. 2014. Dropout: A Simple Way to Pre-
 [3] Romain Cohendet, Claire-Helene Demarty, Ngoc Duong, and Martin                 vent Neural Networks from Overfitting. Journal of Machine Learn-
     Engilberge. VideoMem: Constructing, Analyzing, Predicting Short-               ing Research 15, 56 (2014), 1929–1958. http://jmlr.org/papers/v15/
     Term and Long-Term Video Memorability. In 2019 IEEE/CVF Interna-               srivastava14a.html
     tional Conference on Computer Vision (ICCV) (2019). IEEE, 2531–2540.      [16] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning
     https://doi.org/10.1109/ICCV.2019.00262                                        Spatiotemporal Features with 3D Convolutional Networks. In 2015
 [4] Romain Cohendet, Claire-Hélène Demarty, Ngoc Q K Duong, Mats                   IEEE International Conference on Computer Vision (ICCV) (2015-12).
     Sjöberg, Bogdan Ionescu, and Thanh-Toan Do. MediaEval 2018: Pre-               4489–4497. https://doi.org/10.1109/ICCV.2015.510
     dicting Media Memorability. In Proceedings of MediaEval 2018, CEUR        [17] Le-Vu Tran, Vinh-Loc Huynh, and Minh-Triet Tran. Predicting Media
     Workshop Proceedings (2018). 3. http://ceur-ws.org/Vol-2283/                   Memorability Using Deep Features with Attention and Recurrent
 [5] Mihai Gabriel Constantin, Bogdan Ionescu, Claire-Hélène Demarty,               Network. In Proceedings of MediaEval 2019, Sophia Antipolis, France
     Ngoc Q K Duong, Xavier Alameda-Pineda, and Mats Sjöberg. The                   (2019). CEUR Workshop Proceedings, 3. http://ceur-ws.org/Vol-2670/
     Predicting Media Memorability Task at MediaEval 2019. In Proceedings
     of MediaEval 2019, Sophia Antipolis, France (2019). CEUR Workshop
     Proceedings, 3. http://ceur-ws.org/Vol-2670/
 [6] Mihai Gabriel Constantin, Chen Kang, G. Dinu, Frédéric Dufaux,
     Giuseppe Valenzise, and B. Ionescu. Using Aesthetics and Action
     Recognition-Based Networks for the Prediction of Media Memorabil-
     ity. In Proceedings of MediaEval 2019, Sophia Antipolis, France (2019).
     CEUR Workshop Proceedings. http://ceur-ws.org/Vol-2670/
 [7] Mihai Gabriel Constantin, Miriam Redi, Gloria Zen, and Bogdan
     Ionescu. 2019. Computational Understanding of Visual Interest-
     ingness Beyond Semantics: Literature Survey and Analysis of Co-
     variates. ACM Comput. Surv. 52, 2, Article 25 (2019), 37 pages.
     https://doi.org/10.1145/3301299
 [8] Alba García Seco de Herrera, Rukiye Savran Kiziltepe, Jon Chamber-
     lain, Mihai Gabriel Constantin, Claire-Hélène Demarty, Faiyaz Doctor,
     Bogdan Ionescu, and Alan F. Smeaton. 2020. Overview of MediaEval
     2020 Predicting Media Memorability task: What Makes a Video Memo-
     rable?. In Working Notes Proceedings of the MediaEval 2020 Workshop.
 [9] Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accel-
     erating Deep Network Training by Reducing Internal Covariate Shift.
     In Proceedings of the 32nd International Conference on International
     Conference on Machine Learning - Volume 37 (ICML’15). JMLR.org,
     448–456.
[10] Phillip Isola, Devi Parikh, Antonio Torralba, and Aude Oliva. 2011.
     Understanding the Intrinsic Memorability of Images. In Advances
     in Neural Information Processing Systems, J. Shawe-Taylor, R. Zemel,
     P. Bartlett, F. Pereira, and K. Q. Weinberger (Eds.), Vol. 24. Curran
     Associates, Inc., 2429–2437. https://proceedings.neurips.cc/paper/
     2011/file/286674e3082feb7e5afb92777e48821f-Paper.pdf
[11] Roberto Leyva and Faiyaz Doctor. Multimodal Deep Features Fusion
     For Video Memorability Prediction. In Proceedings of MediaEval 2019,
     Sophia Antipolis, France (2019). CEUR Workshop Proceedings, 3. http:
     //ceur-ws.org/Vol-2670/
[12] Anelise Newman, Camilo Fosco, Vincent Casser, Allen Lee, Barry Mc-
     Namara, and Aude Oliva. 2020. Multimodal Memorability: Modeling
     Effects of Semantics and Decay on Video Memorability. In Computer
     Vision – ECCV 2020, Andrea Vedaldi, Horst Bischof, Thomas Brox, and
     Jan-Michael Frahm (Eds.). Springer International Publishing, Cham,