Using Aesthetics and Action Recognition-based Networks for
               the Prediction of Media Memorability
      Mihai Gabriel Constantin1 , Chen Kang2 , Gabriela Dinu1 , Frédéric Dufaux2 , Giuseppe Valenzise2 ,
                                            Bogdan Ionescu1
                                                1 CAMPUS, University Politehnica of Bucharest, Romania
      2 Laboratoire des Signaux et Systèmes, Université Paris-Sud-CNRS-CentraleSupélec, Université Paris-Saclay, France

                                                             mgconstantin@imag.pub.ro

ABSTRACT                                                                      Aesthetics        Fine-Tuning   Fine-Tuned               Run 1     Run 5
                                                                              ResNet-101                      ResNet-101
In this working note paper we present the contribution and results
of the participation of the UPB-L2S team to the MediaEval 2019                                                                                    LF
                                                                                                                  I3D
Predicting Media Memorability Task. The task requires participants               I3D
                                                                                                  Feature
                                                                                                                features
                                                                                                                                         Run 2
                                                                                                 Extraction
to develop machine learning systems able to predict automatically                                                                                Run 4
                                                                                                                  TSN      PCA   SVR
whether a video will be memorable for the viewer, and for how                   TSN                             features

                                                                                                                                         Run 3
long (e.g., hours, or days). To solve the task, we investigated sev-
                                                                                  C3D
eral aesthetics and action recognition-based deep neural networks,              features


either by fine-tuning models or by using them as pre-trained fea-
ture extractors. Results from different systems were aggregated in                         Figure 1: The diagram of the proposed solution.
various fusion schemes. Experimental results are positive showing
the potential of transfer learning for this tasks.
                                                                             aesthetic value, following the approach described in [9]. This ap-
1     INTRODUCTION                                                           proach generates a deep neural model that can process single image
Media Memorability was studied extensively in recent years, play-            aesthetics and must be fine-tuned to process the short and long
ing an important role in the analysis of human perception and                term memorability of videos. To generate a training dataset that
understanding of media content. This domain was approached by                will support the fine-tuning process, we extracted key-frames in
numerous scientists from different perspectives and fields of study,         two ways: (i) key frames from the 4th, 5th, and 6th second of each
including psychology [1, 13] and computer vision [3, 12], while              sample; (ii) one key frame every two seconds to test multi-frame
several works analyzed the correlation between memorability and              training. In the retraining stage of the network for the memorability
other visual perception concepts like interestingness and aesthet-           task, the provided devset is randomly split into three parts, with
ics [6, 8]. In this context, the MediaEval 2019 Predicting Media             65% of the samples representing the training set, 25% the test set
Memorability task requires participants to create systems that can           and 10% the validation set. We adapted the last layer for this task
predict the short-term and long-term memorability of a set of sound-         by creating a fully connected layer with 2,048 inputs and 1 output.
less videos. The dataset, annotation protocol, precomputed features,         During the fine-tuning process, we applied mean square error as
and ground truth data are described in the task overview paper [5].          loss function, using an initial learning rate of 0.0001. We ran the
                                                                             training process for 15 epochs, with a batch size of 32.
2     APPROACH
For our approach, we used several deep neural network models                 2.2           Action recognition networks
based on image aesthetics and action recognition. For the first cate-        Apart from the precomputed C3D features, we extracted the "Mixed_5"
gory, we fine-tuned the aesthetic deep model presented in [9]. It            layer from the I3D network [2], trained on the Kinetics dataset [10]
is based on the ResNet-101 architecture [7]. For the action recog-           and the "Inception_5" layer of the TSN network [15], trained on
nition networks, we used features extracted from the I3D [2] and             the UCF101 dataset [14]. These features were used as inputs for a
TSN [15] networks and attempted to augment these features with               Support Vector Regression algorithm that generates the final mem-
the C3D features provided by the task organizers. Finally, we per-           orability scores. We conducted preliminary early fusion tests with
formed some late fusion experiments to further improve the results           combinations of these features in order to select the best possible
of these individual runs. Figure 1 summarizes and presents these             combinations, testing both each feature vector individually and all
approaches. The approaches are detailed in the following.                    possible combinations of two feature vectors. We also employed a
                                                                             PCA dimensionality reduction, reducing the size of each vector to
2.1     Aesthetics networks                                                  128 elements. Finally, to train the SVR system, we used a random
The aesthetic-based approach modifies the ResNet-101 architec-               4-fold approach, with 75% of the data representing the training set
ture [7], trained on the AVA dataset [11] for the prediction of image        and 25% representing the validation set. We used parameter tuning
                                                                             for the SVR model, via a RBF kernel and performing a grid search
Copyright 2019 for this paper by its authors. Use
permitted under Creative Commons License Attribution                         with two parameters: the C parameter and the gamma parameter
4.0 International (CC BY 4.0).                                               (taking values 10k , where k ∈ [−4, ..., 4]).
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France                                                                  M.G. Constantin et al.

            Table 1: Results of the proposed runs (preliminary experiments on devset, and official results on testset).

                                                                              Devset - Spearman’s ρ    Testset - Spearman’s ρ
                      Run    System description                               Short-term Long-term     Short-term Long-term
                      run1   Aesthetic-based                                    0.448        0.230       0.401         0.203
                      run2   Action-based (TSN+I3D)                             0.473        0.259        0.45         0.228
                      run3   Action-based (C3D+I3D)                             0.433        0.204       0.386         0.184
                      run4   Late Fusion Action-based (run2 + run3)             0.466        0.200       0.439         0.218
                      run5   Late Fusion Aesthetic and Action (run1 + run2)     0.494       0.265        0.477        0.232


2.3    Late fusion                                                            has on the final results. Therefore we decided to apply an early
We employed several late fusion schemes on the best performing                fusion scheme, where we tested all the possible combinations of
systems, trying to benefit from their combined strengths. We used             the feature vectors, by concatenating them. The best performing
three different strategies for combining these scores, namely: (i)            combinations were TSN + I3D and C3D + I3D.
LFMax, where we took the maximum score for each media sample;                    Finally, in the late fusion part of the experiment, we generally de-
(ii) LFMin, where we took the minimum score; (iii) LFWeight, where            cided to test late fusion schemes between the two action-recognition
each score from different samples was multiplied with a weight w.             based systems and between the best performing action-recognition
We assigned each weight varying values according to the formula               system (TSN + I3D) and the aesthetic-based system. In general,
w = 1−r /c, where the rank r had the value 0 for the best performing          results for the LFMin systems were underperforming, while the
system, 1 for the second best and so on, and c represents a coefficient       LFMax systems were better than their components, but without
that dictates rank influence on the weights.                                  bringing a significant increase in results. The best performing late
                                                                              fusion schemes proved to be based on LFWeight, more precisely
3     EXPERIMENTAL RESULTS                                                    using a c value of 5. This was an expected result, as it confirms
                                                                              some of our previous work in other MediaEval tasks [4].
The development dataset consists of 8,000 videos, annotated with
short and long term memory scores, while the test dataset consists
                                                                              3.2    Results on the testset
of 2,000 videos. The official metric used in the task is Spearman’s
rank correlation (ρ). The best performing systems in the develop-             For the final phase, we retrained all the systems on the entire set of
ment phase are selected, retrained on the whole devset by using               videos from devset, using the parameters computed in the previous
the optimal parameters and lastly run on the testset data.                    phases and tested them on the videos from the testset. Table 1
                                                                              presents also the results for this phase.
3.1    Results on the devset                                                     As expected, the best performance comes from a late fusion
                                                                              system using both aesthetic and action-based components (short-
During the tests performed on the devset, several systems and                 term ρ = 0.477 and long-term ρ = 0.232). Generally, we observe
combinations of parameters stood out as best performers. Table 1              that the system ranking for the submitted systems is consistent with
shows the performances recorded by the best performing aesthetic,             the one we observed during the development phase, however, the
action-based, and late fusion systems.                                        results are lower than those predicted then, with significant drops
   We used several dataset variations in retraining the aesthetic-            in performance for the aesthetic-based system and the action-based
based deep network. More precisely, we found that, for the short-             (C3D + I3D) approaches. In terms of single-system performance,
term memorability, the best performing systems were the ones                  the action-based TSN + I3D system performs best, followed by the
trained with keyframes extracted from the 5th second and the ones             aesthetic-based system.
extracted from the multi-frame approach. The results were both
similar with a Spearman’s ρ of 0.45. On the other hand, in the
                                                                              4     CONCLUSIONS
long-term memorability subtask we found that the best perform-
ing systems were the ones trained with keyframes from the 5th                 In this paper we presented the UPB-L2S approach for predicting
frame. Although this may seem somewhat surprising, giving that                media memorability at MediaEval. We created a framework that
bigger data sets usually account for better results, we believe that          uses aesthetic and action recognition based systems and some late
the reason behind this is that each video contains only one scene.            fusion combinations of these systems, that predict short-term and
Therefore not much additional information is given to the system              long-term memorability scores for soundless video samples. The
when more frames are extracted because the frames are very similar.           results show that these systems are able to individually predict these
However, we would also like to point out that the results for the             scores, while the best results are achieved via late fusion weighted
other frame extraction schemes were not much lower than these.                schemes. This enforces the idea of better exploiting transfer learning
   Regarding the 3D action-recognition based systems, we noticed              to tasks where labeled data are in particular hard to obtain.
that individual systems, based on only one feature vector (TSN,
I3D or C3D) had a low performance, with a Spearman’s ρ score                  ACKNOWLEDGMENTS
of under 0.42. This performance further dropped when we used                  This work was partially supported by the Romanian Ministry of
the original vectors, without applying PCA reduction, therefore               Innovation and Research (UEFISCDI, project SPIA-VA, agreement
demonstrating the positive influence that dimensionality reduction            2SOL/2017, grant PN-III-P2-2.1-SOL-2016-02-0002).
Predicting Media Memorability                                                    MediaEval’19, 27-29 October 2019, Sophia Antipolis, France


REFERENCES
 [1] Timothy F Brady, Talia Konkle, George A Alvarez, and Aude Oliva.
     2008. Visual long-term memory has a massive storage capacity for
     object details. Proceedings of the National Academy of Sciences 105, 38
     (2008), 14325–14329.
 [2] Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recogni-
     tion? a new model and the kinetics dataset. In proceedings of the IEEE
     Conference on Computer Vision and Pattern Recognition. 6299–6308.
 [3] Romain Cohendet, Claire-Hélène Demarty, Ngoc Q. K. Duong, and
     Martin Engilberge. 2019. VideoMem: Constructing, Analyzing, Predict-
     ing Short-term and Long-term Video Memorability. In International
     Conference on Computer Vision (ICCV).
 [4] Mihai Gabriel Constantin, Bogdan Andrei Boteanu, and Bogdan
     Ionescu. 2017. LAPI at MediaEval 2017-Predicting Media Interest-
     ingness.. In MediaEval.
 [5] Mihai Gabriel Constantin, Bogdan Ionescu, Claire-Hélène Demarty,
     Ngoc Q. K. Duong, Xavier Alameda-Pineda, and Mats Sjöberg. 2019.
     Predicting Media Memorability Task at MediaEval 2019. In Proc. of
     MediaEval 2019 Workshop, Sophia Antipolis, France, Oct. 27-29, 2019
     (2019).
 [6] Mihai Gabriel Constantin, Miriam Redi, Gloria Zen, and Bogdan
     Ionescu. 2019. Computational understanding of visual interestingness
     beyond semantics: literature survey and analysis of covariates. ACM
     Computing Surveys (CSUR) 52, 2 (2019), 25.
 [7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep
     residual learning for image recognition. In Proceedings of the IEEE
     conference on computer vision and pattern recognition. 770–778.
 [8] Phillip Isola, Jianxiong Xiao, Devi Parikh, Antonio Torralba, and Aude
     Oliva. 2013. What makes a photograph memorable? IEEE transactions
     on pattern analysis and machine intelligence 36, 7 (2013), 1469–1482.
 [9] Chen Kang, Giuseppe Valenzise, and Frédéric Dufaux. 2019. Predicting
     Subjectivity in Image Aesthetics Assessment. In IEEE 21st International
     Workshop on Multimedia Signal Processing, 27-29 Sept 2019, Kuala
     Lumpur, Malaysia.
[10] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier,
     Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back,
     Paul Natsev, and others. 2017. The kinetics human action video dataset.
     arXiv preprint arXiv:1705.06950 (2017).
[11] Naila Murray, Luca Marchesotti, and Florent Perronnin. 2012. AVA: A
     large-scale database for aesthetic visual analysis. In 2012 IEEE Confer-
     ence on Computer Vision and Pattern Recognition. IEEE, 2408–2415.
[12] Sumit Shekhar, Dhruv Singal, Harvineet Singh, Manav Kedia, and
     Akhil Shetty. 2017. Show and recall: Learning what makes videos
     memorable. In Proceedings of the IEEE International Conference on
     Computer Vision. 2730–2739.
[13] Roger N Shepard. 1967. Recognition memory for words, sentences,
     and pictures. Journal of verbal Learning and verbal Behavior 6, 1 (1967),
     156–163.
[14] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012.
     UCF101: A dataset of 101 human actions classes from videos in the
     wild. arXiv preprint arXiv:1212.0402 (2012).
[15] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou
     Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards
     good practices for deep action recognition. In European conference on
     computer vision. Springer, 20–36.