Using Aesthetics and Action Recognition-based Networks for the Prediction of Media Memorability Mihai Gabriel Constantin1 , Chen Kang2 , Gabriela Dinu1 , Frédéric Dufaux2 , Giuseppe Valenzise2 , Bogdan Ionescu1 1 CAMPUS, University Politehnica of Bucharest, Romania 2 Laboratoire des Signaux et Systèmes, Université Paris-Sud-CNRS-CentraleSupélec, Université Paris-Saclay, France mgconstantin@imag.pub.ro ABSTRACT Aesthetics Fine-Tuning Fine-Tuned Run 1 Run 5 ResNet-101 ResNet-101 In this working note paper we present the contribution and results of the participation of the UPB-L2S team to the MediaEval 2019 LF I3D Predicting Media Memorability Task. The task requires participants I3D Feature features Run 2 Extraction to develop machine learning systems able to predict automatically Run 4 TSN PCA SVR whether a video will be memorable for the viewer, and for how TSN features Run 3 long (e.g., hours, or days). To solve the task, we investigated sev- C3D eral aesthetics and action recognition-based deep neural networks, features either by fine-tuning models or by using them as pre-trained fea- ture extractors. Results from different systems were aggregated in Figure 1: The diagram of the proposed solution. various fusion schemes. Experimental results are positive showing the potential of transfer learning for this tasks. aesthetic value, following the approach described in [9]. This ap- 1 INTRODUCTION proach generates a deep neural model that can process single image Media Memorability was studied extensively in recent years, play- aesthetics and must be fine-tuned to process the short and long ing an important role in the analysis of human perception and term memorability of videos. To generate a training dataset that understanding of media content. This domain was approached by will support the fine-tuning process, we extracted key-frames in numerous scientists from different perspectives and fields of study, two ways: (i) key frames from the 4th, 5th, and 6th second of each including psychology [1, 13] and computer vision [3, 12], while sample; (ii) one key frame every two seconds to test multi-frame several works analyzed the correlation between memorability and training. In the retraining stage of the network for the memorability other visual perception concepts like interestingness and aesthet- task, the provided devset is randomly split into three parts, with ics [6, 8]. In this context, the MediaEval 2019 Predicting Media 65% of the samples representing the training set, 25% the test set Memorability task requires participants to create systems that can and 10% the validation set. We adapted the last layer for this task predict the short-term and long-term memorability of a set of sound- by creating a fully connected layer with 2,048 inputs and 1 output. less videos. The dataset, annotation protocol, precomputed features, During the fine-tuning process, we applied mean square error as and ground truth data are described in the task overview paper [5]. loss function, using an initial learning rate of 0.0001. We ran the training process for 15 epochs, with a batch size of 32. 2 APPROACH For our approach, we used several deep neural network models 2.2 Action recognition networks based on image aesthetics and action recognition. For the first cate- Apart from the precomputed C3D features, we extracted the "Mixed_5" gory, we fine-tuned the aesthetic deep model presented in [9]. It layer from the I3D network [2], trained on the Kinetics dataset [10] is based on the ResNet-101 architecture [7]. For the action recog- and the "Inception_5" layer of the TSN network [15], trained on nition networks, we used features extracted from the I3D [2] and the UCF101 dataset [14]. These features were used as inputs for a TSN [15] networks and attempted to augment these features with Support Vector Regression algorithm that generates the final mem- the C3D features provided by the task organizers. Finally, we per- orability scores. We conducted preliminary early fusion tests with formed some late fusion experiments to further improve the results combinations of these features in order to select the best possible of these individual runs. Figure 1 summarizes and presents these combinations, testing both each feature vector individually and all approaches. The approaches are detailed in the following. possible combinations of two feature vectors. We also employed a PCA dimensionality reduction, reducing the size of each vector to 2.1 Aesthetics networks 128 elements. Finally, to train the SVR system, we used a random The aesthetic-based approach modifies the ResNet-101 architec- 4-fold approach, with 75% of the data representing the training set ture [7], trained on the AVA dataset [11] for the prediction of image and 25% representing the validation set. We used parameter tuning for the SVR model, via a RBF kernel and performing a grid search Copyright 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution with two parameters: the C parameter and the gamma parameter 4.0 International (CC BY 4.0). (taking values 10k , where k ∈ [−4, ..., 4]). MediaEval’19, 27-29 October 2019, Sophia Antipolis, France MediaEval’19, 27-29 October 2019, Sophia Antipolis, France M.G. Constantin et al. Table 1: Results of the proposed runs (preliminary experiments on devset, and official results on testset). Devset - Spearman’s ρ Testset - Spearman’s ρ Run System description Short-term Long-term Short-term Long-term run1 Aesthetic-based 0.448 0.230 0.401 0.203 run2 Action-based (TSN+I3D) 0.473 0.259 0.45 0.228 run3 Action-based (C3D+I3D) 0.433 0.204 0.386 0.184 run4 Late Fusion Action-based (run2 + run3) 0.466 0.200 0.439 0.218 run5 Late Fusion Aesthetic and Action (run1 + run2) 0.494 0.265 0.477 0.232 2.3 Late fusion has on the final results. Therefore we decided to apply an early We employed several late fusion schemes on the best performing fusion scheme, where we tested all the possible combinations of systems, trying to benefit from their combined strengths. We used the feature vectors, by concatenating them. The best performing three different strategies for combining these scores, namely: (i) combinations were TSN + I3D and C3D + I3D. LFMax, where we took the maximum score for each media sample; Finally, in the late fusion part of the experiment, we generally de- (ii) LFMin, where we took the minimum score; (iii) LFWeight, where cided to test late fusion schemes between the two action-recognition each score from different samples was multiplied with a weight w. based systems and between the best performing action-recognition We assigned each weight varying values according to the formula system (TSN + I3D) and the aesthetic-based system. In general, w = 1−r /c, where the rank r had the value 0 for the best performing results for the LFMin systems were underperforming, while the system, 1 for the second best and so on, and c represents a coefficient LFMax systems were better than their components, but without that dictates rank influence on the weights. bringing a significant increase in results. The best performing late fusion schemes proved to be based on LFWeight, more precisely 3 EXPERIMENTAL RESULTS using a c value of 5. This was an expected result, as it confirms some of our previous work in other MediaEval tasks [4]. The development dataset consists of 8,000 videos, annotated with short and long term memory scores, while the test dataset consists 3.2 Results on the testset of 2,000 videos. The official metric used in the task is Spearman’s rank correlation (ρ). The best performing systems in the develop- For the final phase, we retrained all the systems on the entire set of ment phase are selected, retrained on the whole devset by using videos from devset, using the parameters computed in the previous the optimal parameters and lastly run on the testset data. phases and tested them on the videos from the testset. Table 1 presents also the results for this phase. 3.1 Results on the devset As expected, the best performance comes from a late fusion system using both aesthetic and action-based components (short- During the tests performed on the devset, several systems and term ρ = 0.477 and long-term ρ = 0.232). Generally, we observe combinations of parameters stood out as best performers. Table 1 that the system ranking for the submitted systems is consistent with shows the performances recorded by the best performing aesthetic, the one we observed during the development phase, however, the action-based, and late fusion systems. results are lower than those predicted then, with significant drops We used several dataset variations in retraining the aesthetic- in performance for the aesthetic-based system and the action-based based deep network. More precisely, we found that, for the short- (C3D + I3D) approaches. In terms of single-system performance, term memorability, the best performing systems were the ones the action-based TSN + I3D system performs best, followed by the trained with keyframes extracted from the 5th second and the ones aesthetic-based system. extracted from the multi-frame approach. The results were both similar with a Spearman’s ρ of 0.45. On the other hand, in the 4 CONCLUSIONS long-term memorability subtask we found that the best perform- ing systems were the ones trained with keyframes from the 5th In this paper we presented the UPB-L2S approach for predicting frame. Although this may seem somewhat surprising, giving that media memorability at MediaEval. We created a framework that bigger data sets usually account for better results, we believe that uses aesthetic and action recognition based systems and some late the reason behind this is that each video contains only one scene. fusion combinations of these systems, that predict short-term and Therefore not much additional information is given to the system long-term memorability scores for soundless video samples. The when more frames are extracted because the frames are very similar. results show that these systems are able to individually predict these However, we would also like to point out that the results for the scores, while the best results are achieved via late fusion weighted other frame extraction schemes were not much lower than these. schemes. This enforces the idea of better exploiting transfer learning Regarding the 3D action-recognition based systems, we noticed to tasks where labeled data are in particular hard to obtain. that individual systems, based on only one feature vector (TSN, I3D or C3D) had a low performance, with a Spearman’s ρ score ACKNOWLEDGMENTS of under 0.42. This performance further dropped when we used This work was partially supported by the Romanian Ministry of the original vectors, without applying PCA reduction, therefore Innovation and Research (UEFISCDI, project SPIA-VA, agreement demonstrating the positive influence that dimensionality reduction 2SOL/2017, grant PN-III-P2-2.1-SOL-2016-02-0002). Predicting Media Memorability MediaEval’19, 27-29 October 2019, Sophia Antipolis, France REFERENCES [1] Timothy F Brady, Talia Konkle, George A Alvarez, and Aude Oliva. 2008. Visual long-term memory has a massive storage capacity for object details. Proceedings of the National Academy of Sciences 105, 38 (2008), 14325–14329. [2] Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recogni- tion? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308. [3] Romain Cohendet, Claire-Hélène Demarty, Ngoc Q. K. Duong, and Martin Engilberge. 2019. VideoMem: Constructing, Analyzing, Predict- ing Short-term and Long-term Video Memorability. In International Conference on Computer Vision (ICCV). [4] Mihai Gabriel Constantin, Bogdan Andrei Boteanu, and Bogdan Ionescu. 2017. LAPI at MediaEval 2017-Predicting Media Interest- ingness.. In MediaEval. [5] Mihai Gabriel Constantin, Bogdan Ionescu, Claire-Hélène Demarty, Ngoc Q. K. Duong, Xavier Alameda-Pineda, and Mats Sjöberg. 2019. Predicting Media Memorability Task at MediaEval 2019. In Proc. of MediaEval 2019 Workshop, Sophia Antipolis, France, Oct. 27-29, 2019 (2019). [6] Mihai Gabriel Constantin, Miriam Redi, Gloria Zen, and Bogdan Ionescu. 2019. Computational understanding of visual interestingness beyond semantics: literature survey and analysis of covariates. ACM Computing Surveys (CSUR) 52, 2 (2019), 25. [7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778. [8] Phillip Isola, Jianxiong Xiao, Devi Parikh, Antonio Torralba, and Aude Oliva. 2013. What makes a photograph memorable? IEEE transactions on pattern analysis and machine intelligence 36, 7 (2013), 1469–1482. [9] Chen Kang, Giuseppe Valenzise, and Frédéric Dufaux. 2019. Predicting Subjectivity in Image Aesthetics Assessment. In IEEE 21st International Workshop on Multimedia Signal Processing, 27-29 Sept 2019, Kuala Lumpur, Malaysia. [10] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, and others. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017). [11] Naila Murray, Luca Marchesotti, and Florent Perronnin. 2012. AVA: A large-scale database for aesthetic visual analysis. In 2012 IEEE Confer- ence on Computer Vision and Pattern Recognition. IEEE, 2408–2415. [12] Sumit Shekhar, Dhruv Singal, Harvineet Singh, Manav Kedia, and Akhil Shetty. 2017. Show and recall: Learning what makes videos memorable. In Proceedings of the IEEE International Conference on Computer Vision. 2730–2739. [13] Roger N Shepard. 1967. Recognition memory for words, sentences, and pictures. Journal of verbal Learning and verbal Behavior 6, 1 (1967), 156–163. [14] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012). [15] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision. Springer, 20–36.