=Paper= {{Paper |id=Vol-2882/MediaEval_20_paper_57 |storemode=property |title=Predicting Media Memorability with Audio, Video, and Text representations |pdfUrl=https://ceur-ws.org/Vol-2882/paper57.pdf |volume=Vol-2882 |authors=Alison Reboud,Ismail Harrando,Jorma Laaksonen,Raphaël Troncy |dblpUrl=https://dblp.org/rec/conf/mediaeval/ReboudHLT20 }} ==Predicting Media Memorability with Audio, Video, and Text representations== https://ceur-ws.org/Vol-2882/paper57.pdf
                                  Predicting Media Memorability
                            with Audio, Video, and Text representations

                        Alison Reboud* , Ismail Harrando* , Jorma Laaksonen+ and Raphaël Troncy*
                                                              * EURECOM, Sophia Antipolis, France
                                                                 + Aalto University, Espoo, Finland

                                              {alison.reboud,ismail.harrando,raphael.troncy}@eurecom.fr
                                                               jorma.laaksonen@aalto.fi
ABSTRACT                                                                            2.1     Audio-Visual Approach
This paper describes a multimodal approach proposed by the MeMAD                    Our audio-visual memorability prediction scores are based on us-
team for the MediaEval 2020 “Predicting Media Memorability” task.                   ing a feed-forward neural network with a concatenation of video
Our best approach is a weighted average method combining predic-                    and audio features in the input, one hidden layer of units and
tions made separately from visual, audio, textual and visiolinguistic               one unit in the output layer. The best performance was obtained
representations of videos. Our best model achieves Spearman scores                  with 2575-dimensional features consisting of the concatenation of
of 0.101 and 0.078, respectively, for the short and long term predic-               2048-dimensional I3D [3] video features and 527-dimensional audio
tions tasks.                                                                        features. Our audio features encode the occurrence probabilities
                                                                                    of the 527 classes of the Google AudioSet Ontology [6] in each
                                                                                    video clip. The hidden layer uses ReLU activations and dropout
                                                                                    during the training phase, while the output unit is sigmoidal. The
1    INTRODUCTION                                                                   training of the network used the Adam optimizer. The features, the
Considering video memorability as a useful tool for digital content                 number of training epochs and the number of units in the hidden
retrieval as well as for sorting and recommending an ever growing                   layer were selected with the 6-fold cross-validation. For short term
number of videos, the Predicting Media Memorability task aims                       memorability prediction, the optimal number of epochs was 750
at fostering the research in the field by asking its participants to                and the optimal hidden layer size 80 units, whereas for the long
automatically predict both a short and a long term memorability                     term prediction these figures were 260 and 160, respectively.
score for a given set of annotated videos. The full description for                    We also experimented with other types of features and their
this task is provided in [5]. Last year’s best approaches for both                  combinations. These include the ResNet [7] features extracted just
the long term [10] and short term tasks [2] rely on multimodal                      from the middle frames of the clips as this approach worked very
features. Our method is inspired from last year’s best approaches                   well last year. The contents of this year’s videos are, however, such
but also acknowledges the specifics of the 2020’s edition dataset.                  that genuine video features I3D and C3D [13] work better than still
More specifically, because in comparison to last year’s set of videos,              image features. When I3D and AudioSet features are used, C3D
the TRECVid videos contain more actions, our model uses video fea-                  features do not bring any additional advantage.
tures and image features for multiple frames. In addition, because
this year sound was included in the videos, our model includes au-                  2.2     Textual Approach
dio features. Finally, a key contribution of our approach is to test the
                                                                                    Our textual approach leverages the video descriptions provided by
relevance of visiolinguistic representation for the Media Memora-
                                                                                    the organizers. First, all the provided descriptions are concatenated
bility task. Our final model1 is a multimodal weighted average with
                                                                                    by video identifier to get one string per video. To generate the
visual and audio deep features extracted from the videos, textual
                                                                                    textual representation of the video content, we used the following
features from the provided captions and visiolinguistic features.
                                                                                    methods:
                                                                                          • Computing TF-IDF, removing rare (less than 4 occurrences)
2    APPROACH                                                                                and stopwords and accounting for frequent 2-grams.
We trained separate models for the short and long term predictions                        • Averaging GloVe embeddings for all non-stopwords words
using originally a 6-fold cross-validation of the training set, which                        using the pre-trained 300d version [9].
means that we typically had 492 samples for training and 98 samples                       • Averaging BERT [4] token representations (keeping all the
for testing each model.                                                                      words in the descriptions up to 250 words per sentence).
                                                                                          • Using Sentence-BERT [11] sentence representations. We
1 https://github.com/MeMAD-project/media-memorability                                        use the distilled version that is fine-tuned for the STS Tex-
                                                                                             tual Similarity Benchmark2 .
                                                                                       For each representation, we experimented with multiple regres-
Copyright 2020 for this paper by its authors. Use permitted under Creative          sion models and finetuned the hyper-parameters for each model
Commons License Attribution 4.0 International (CC BY 4.0).
MediaEval’20, 14-15 December 2020, Online                                           2 https://huggingface.co/sentence-transformers/distilbert-base-nli-stsb-mean-tokens
MediaEval’20, December 14-15 2020, Online                                                                                    A. Reboud et al.


using the 6-fold cross-validation on the training set. For our sub-     Table 1: Average Spearman score obtained on a 6-folds cross
mission, we used the Averaging GloVe embeddings with a Support          validation of the Training set
Machine Regressor with an RBF kernel and a regulation parameter
𝐶 = 1𝑒 − 5.                                                                             Method     Short Term     Long Term
   We also attempted enhancing the provided descriptions with ad-
                                                                                         run1        0.2899          0.179
ditional captions automatically generated using the DeepCaption3
                                                                                         run2         0.214         0.1309
software. We did not see an improvement in the results, which
                                                                                         run3        0.2506         0.1372
is probably due to the nature of the clips provided for this year’s
                                                                                         run4        0.3104         0.2038
edition (as DeepCaption is trained on static stock images from MS
                                                                                         run5         0.067         0.1700
COCO and TGIF datasets).

2.3     Visiolinguistic Approach                                        Table 2: Results on the Test set for Short Term (ST) and Long
                                                                        Term (LT) memorability
ViLBERT [8] is a task-agnostic extension of BERT that aims to learn
the associations and links between visual and linguistic properties
of a concept. It has a two-stream architecture, first modelling each        Method     SpearmanST      PearsonST     SpearmanLT       PearsonLT
modality (i.e. visual and textual) separately, and then fusing them        run1            0.099           0.09           0.077       0.0855
through a set of attention-based interactions (co-attention). ViL-         run2            0.098          0.085          -0.017       0.011
BERT is pre-trained using the Conceptual Captions data set (3.3M           run3            0.073          0.091           0.019       0.049
image-caption pairs) [12] on masked multi modal learning and               run4            0.101          0.09           0.078        0.085
multi-modal alignment prediction. We used a frozen pre-trained             run5            0.101           0.09           0.067       0.066
model which was fine-tuned twice, first on the task of Video-            AvgTeams          0.058          0.066           0.036       0.043
Question Answering (VQA) [1] and then on the 2019 MediaEval
Memorability task and dataset.
   The 1024-dimensional features extracted for the two modalities          We present in Table 2 the final results obtained on the test set
can be combined in different ways.In our experiment, multiplying        using models trained on the full training set composed of 590 videos.
textual and visual feature vectors performed the best for short term    We observe that the weighted average method which uses short
memorability prediction but using the sole visual feature vectors       term scores works the best for both short and long term prediction,
worked better for long term memorability prediction. Averaging          obtaining results which are approximately double the mean Spear-
the features extracted from 6 frames performed better than only         man score obtained across the teams. Our best results (Spearman
using only the middle frame. We experimented with the same set          scores) on the test set are however significantly worse than the
of regression models as for the textual approach. In our submission,    ones we obtained on average over the 6-folds of the training set
we used a Support Machine Regressor with a regulation parameter         suggesting that the test set is quite different from the training set.
𝐶 = 1𝑒 − 5 and an RBF or Poly kernel respectively for short and         The results for Long Term prediction are always worse than the
long term scores prediction.                                            ones for Short Term prediction. Finally, both our scores and the
                                                                        mean score across team are below the ones obtained for the 2018
3     RESULTS AND ANALYSIS                                              and 2019 videos.
We have prepared 5 different runs following the task description
defined as follows:                                                     4    DISCUSSION AND OUTLOOK
       • run1 = Audio-Visual Score                                      This paper describes a multimodal weighted average method pro-
       • run2 = Visiolinguistic Score                                   posed for the 2020 Predicting Media Memorability task of Media-
       • run3 = Textual Score                                           Eval. One of the key contribution of this paper is to have shown that
       • run4 = 0.5 * run1 + 0.2 * run2 + 0.3 * run3                    based on our experiments during the model construction or testing
       • run5 = run4 with LT scores for LT task                         phase, in comparison to image, audio and text, video features per-
                                                                        formed the best. Similarly to last year, short term scores predictions
For the Long Term task, all models except run5 use exclusively short-   correlated better with long term scores than the predictions made
term scores. For runs 4 and 5, we normalise the scores obtained         when training directly on long term scores. Finally considering the
from runs 1, 2 and 3 before combining them.                             difference of results obtained between the training and test set, it
   Table 1 provides the Spearman score obtained for each run when       would be interesting to investigate further the differences between
performing a 6-folds cross-validation on the training set. We ob-       these datasets in terms of content (video, audio and text) and anno-
serve that our models use only the training set, as the annotations     tation. We conclude that generalizing this type of task to different
on the later-provided development set did not yield better results.     video genres and characteristics remain a scientific challenge.
We hypothesize that this is due to the fewer number of annotations
per video available as many videos had a score for 1, for instance,     Acknowledgements
which we do not observe on the training set.
                                                                        This work has been partially supported by the European Union’s
                                                                        Horizon 2020 research and innovation programme via the project
3 https://github.com/aalto-cbir/DeepCaption
                                                                        MeMAD (GA 780069).
Predicting Media Memorability                                                   MediaEval’20, December 14-15 2020, Online


REFERENCES
 [1] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell,
     Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual
     Question Answering. In IEEE International Conference on Computer
     Vision (ICCV). IEEE, Santiago, Chile.
 [2] David Azcona, Enric Moreu, Feiyan Hu, Tomás E Ward, and Alan F
     Smeaton. 2019. Predicting media memorability using ensemble models.
     In MediaEval 2019: Multimedia Benchmark Workshop. Sophia Antipolis,
     France.
 [3] João Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recog-
     nition? A New Model and the Kinetics Dataset. In IEEE Conference on
     Computer Vision and Pattern Recognition (CVPR). IEEE, 4724–4733.
 [4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
     2019. BERT: Pre-training of Deep Bidirectional Transformers for
     Language Understanding. In Conference of the North American Chap-
     ter of the Association for Computational Linguistics (NAACL). ACL,
     Minneapolis, Minnesota, USA, 4171—-4186.
 [5] Alba García Seco de Herrera, Rukiye Savran Kiziltepe, Jon Chamber-
     lain, Mihai Gabriel Constantin, Claire-Hélène Demarty, Faiyaz Doctor,
     Bogdan Ionescu, and Alan F. Smeaton. 2020. Overview of MediaEval
     2020 Predicting Media Memorability task: What Makes a Video Memo-
     rable?. In Working Notes Proceedings of the MediaEval 2020 Workshop.
 [6] Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade
     Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. 2017.
     Audio set: An ontology and human-labeled dataset for audio events. In
     IEEE International Conference on Acoustics, Speech and Signal Processing
     (ICASSP). New Orleans, Louisiana, USA, 776–780.
 [7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep
     residual learning for image recognition. In IEEE Conference on Com-
     puter Vision and Pattern Recognition (CVPR). IEEE, Las Vegas, Nevada,
     USA, 770–778.
 [8] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT:
     Pretraining Task-Agnostic Visiolinguistic Representations for Vision-
     and-Language Tasks. In 33𝑟𝑑 Conference on Neural Information Pro-
     cessing Systems (NeurIPS). Vancouver, Canada.
 [9] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014.
     Glove: Global vectors for word representation. In International Con-
     ference on Empirical Methods in Natural Language Processing (EMNLP).
     ACL, Melbourne, Australia, 1532—-1543.
[10] Alison Reboud, Ismail Harrando, Jorma Laaksonen, Danny Francis,
     Raphaël Troncy, and Héctor Laria Mantecón. 2019. Combining Textual
     and Visual Modeling for Predicting Media Memorability. In MediaEval
     2019: Multimedia Benchmark Workshop. Sophia Antipolis, France.
[11] Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence
     Embeddings using Siamese BERT-Networks. In International Confer-
     ence on Empirical Methods in Natural Language Processing (EMNLP).
     ACL, Hong Kong, China, 3982—-3992.
[12] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut.
     2018. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text
     Dataset For Automatic Image Captioning. In 56th Annual Meeting of
     the Association for Computational Linguistics (Volume 1: Long Papers).
     ACL, Melbourne, Australia, 2556–2565.
[13] Du Tran, Lubomir D. Bourdev, Rob Fergus, Lorenzo Torresani, and
     Manohar Paluri. 2015. Learning Spatiotemporal Features with 3D
     Convolutional Networks. In International Conference on Computer
     Vision (ICCV). IEEE, Santiago, Chile, 4489–4497.