Dublin’s Participation in the Predicting Media Memorability Task at MediaEval 2018 Alan F. Smeaton 1 , Owen Corrigan 1 , Paul Dockree 2 , Cathal Gurrin 1 , Graham Healy 1 , Feiyan Hu 1 , Kevin McGuinness 1 , Eva Mohedano 1 , Tomás Ward 1 1 Insight Centre for Data Analytics, Dublin City University, 2 School of Psychology and Trinity College Institute of Neuroscience, Trinity College Dublin, Ireland alan.smeaton@dcu.ie ABSTRACT these extremes – i.e. those that are neither particularly globally This paper outlines 6 approaches taken to computing video mem- coherent nor locally distinctive. orability, for the MediaEval Predicting Media Memorability Task. This work in this paper was carried out in the context of the The approaches are based on video features, an end-to-end ap- 2018 MediaEval Predicting Media Memorability task and we refer proach, saliency, aesthetics, neural feedback, and an ensemble of the reader to the task description for prior art [1]. all approaches. 2 RUNS SUBMITTED 1 INTRODUCTION 2.1 Machine Learning with Pre Computed In our work we seek to explore theories from psychology and neu- Features roaesthetics, which may guide predictors for memorability of visual In this run, we evaluated the performance of a neural network to media. Two caveats are that most of the ideas from neuroaesthetics run on the precomputed features provided by the task organisers. come from perception of visual art or artificial experimental stimuli, These features include C3D features, HMP, HOG Descriptors and rather than real life scenes so these ideas might not translate. The more. The complete list can be found in [1]. To merge these different second caveat is that over and above the aesthetics of the video features, we simply flattened them into one long vector. Using this or keyframes, we cannot control for the semantic content or the as an input, we trained a Multi Layer Perceptron which would emotional salience of the imagery for the viewer just as we cannot output a probability. We tested a number of architectures and found control for the viewer’s attention or concentration while initially in testing that using 3 layers was optimal. viewing or subsequently trying to remember the video. Our first principle is the idea that aesthetically pleasing features 2.2 An End-to-end System are driven by Gestalt principles [10] including grouping, symmetry For our end-to-end system we used 3 keyframe images from the raw and lines of good continuation. In each case, items in a scene are videos as inputs. At each epoch, we selected one frame randomly bound together into coherent groups or continuous unbroken forms from the video as a form of data augmentation. For the architecture, by our visual system. According to Ramachandran [7], these Gestalt we tried two standard models: VGG16 [9] and Resnet18 [2]. We principles are driven by neural mechanisms in our perceptual sys- modified these networks by changing the output to target a single tem that trigger the brain’s reward system so that our attention is variable, memorability, instead of matrix of class probabilities. We reflexively drawn to these features. There is also some evidence also investigated using different numbers of dense layers after the that grouping of visual features not only increases attention but convolutional layers. Surprisingly, we found that using a single also benefits visual working memory [6]. layer with VGG16 gave the best results. Our loss function was Our second principle, and in opposition to processing a coherent mean squared error, and we used a gradient descent optimizer. whole, is that images that show distinctive figure/ground arrange- ments may also capture attention thus promoting memorability. So, another of Ramachandran’s laws of neuroaesthetics is “isolation” in 2.3 Using Video and Image Saliency which a key visual feature has exaggerated importance and stands Visual saliency models generate a probability map highlighting out from the surrounding information [8]. image regions that most attract human attention. Here, this infor- Although these aesthetic features are intrinsic qualities in images mation is explored for the task of predicting media memorability. that capture attention, it is less clear how they affect memorabil- More precisely, a saliency map for each frame of video is computed ity. However superior attention based on these qualities should with the SalGAN model [5]. increase encoding of the videos and hence improve memorability. The maps are used to spatially weight the activations of the last Thus a key prediction based on these principles is that a U-shaped convolutional layer of Inception-v3 pre-trained on Imagenet. For relationship should emerge in which the most globally coherent that, video frames are resized to 300×300 resolution, and forwarded video images and the most locally distinctive images should both be to Inception-v3 to generate convolutional volumes of 7 × 7 × 2048 more memorable compared to the video frames that fall in-between (the first two dimensions correspond to the spatial resolution, and the last one the number of channels or depth of the layer). Copyright held by the owner/author(s). Saliency maps are downsized to 7 × 7, normalised to contain MediaEval’18, 29-31 October 2018, Sophia Antipolis, France values between 0-1, and element-wise multiplied to the convolu- tional activations. Global average pooling is applied on the channel MediaEval’18, 29-31 October 2018, Sophia Antipolis, France A.F. Smeaton et al. dimension to obtain a final representation of 2048 dimensions. The Run type Ensemble Features End-to-end Saliency Neural hypothesis here is that the denser the saliency map the more human Short Term Memorability attention the images draw, and consequently the more memorable Spearman -0.018 0.051 0.055 -0.015 -0.027 they may be. Pearson -0.019 0.026 0.085 -0.015 -0.031 This 2048 long vector was then fed into a neural network, similar MSE 0.0089 0.0069 0.0069 0.0073 0.0089 to how precomputed features were used in Section 2.1. Long Term Memorability Spearman 0.039 0.037 0.017 0.007 -0.024 2.4 Using Neural Approach Pearson 0.021 0.016 0.032 0.006 -0.024 In this approach we used human reaction to a second viewing of a MSE 0.0207 0.0205 0.0207 0.0208 0.0207 video keyframe, to train a classifier for memorability, a true human- Table 1: Results in-the-loop experiment. The middle frame was extracted for each video clip in the test set and a participant was shown these images at high speed (4 Hz) on a computer screen while simultaneously recording their EEG (Electroencephalography) signals. Each of the 2000 test set extracted images were presented twice. Following completion of the first viewing, EEG signals were band- passed between 0.5 Hz and 10 Hz, re-referenced to a common av- erage reference and the mean voltage between 300ms and 600ms following each image presentation calculated for the Pz channel (baselined to -250ms to 0 ms prior to image presentation). The par- ticipant then viewed the images a second time with similar EEG data recording and processing and the values averaged for the two presentations of each image, which formed the submission scores. These parameters were selected as they are known to correspond both to a time region and electrode location in which a P300 event- related potential in this type of task is typically observed where attention is elicited [3]. The rationale is that high amplitude P300 responses correspond to imagery which is visually attentive and thus potentially more memorable which should also stimulate visual working memory [6]. We then computed the pearson correlation between the P300 signals and the memorability scores to evalute the performance of this feature. 2.5 Computing Visual Aesthetics A final technique we incorporated was to use our own version of an image aesthetics classifier as described in [4], instead of the values provided by the task organisers. This maps back to our guiding principles driven by neuroaesthetics, described earlier. Figure 1: Performance for memorability classification 2.6 An Ensemble of All Techniques In each of the approaches above we made predictions for the entire part of the reason might be because training was done with on training set, as well as the entire testing set after training had only 2,000 images, with only one participant. It is definitely worth completed. One limitation to note is that due to the time consuming scaling up this approach to see performance with more data. nature of EEG labelling in Section 2.4, only a subset of the training The run based on our saliency was a bit better than the neural dataset (2,000 videos) was used in this ensemble run. We used run, especially for long-term memorability. The ordering of runs predictions from each of the above approaches, and trained a linear by performance among the provided features, ensemble and end- model on this subset of the training data to identify which were the to-end submissions has contradictions across runs, across long most important predictors. We then used these weights to combine vs. short term memorability, and across the metric used but the the values on the test set, which generated this run. end-to-end seems to have performed best, which is surprising. Overall, our results seem poor for the above reason or because 3 RESULTS, CONCLUSIONS AND FUTURE of insufficient tuning of parameter settings in our experiments. PLANS The performance results of our submissions are shown in Table 1 ACKNOWLEDGMENTS and illustrated in Figure 1. This work was partially supported by Science Foundation Ireland The results show that the run based on direct neural/EEG feed- under the SFI Research Centres Programme grant number SFI/12/RC/2289. back from the human participant was the worst, as expected, and Dublin’s Predicting Media Memorability Task MediaEval’18, 29-31 October 2018, Sophia Antipolis, France REFERENCES [1] R. Cohendet, C.-H. Demarty, N.Q. Duong, M. Sjöberg, B. Ionescu, and T.-T. Do. 2018. MediaEval 2018: Predicting Media Memorability. In Proc. of the MediaEval 2018 Workshop, Sophia-Antipolis, France. CEUR-WS, Sophia-Antipolis, France, 29–31. [2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, Las Vegas, United States, 770–778. [3] Graham Healy, Tomas Ward, Cathal Gurrin, and Alan F. Smeaton. 2017. Overview of NTCIR-13 NAILS Task. In Proceedings of the NTCIR- 13 NAILS (Neurally Augmented Image Labelling Strategies). National Institute of Informatics, Japan, Tokyo, Japan, 380–383. [4] Feiyan Hu and Alan F. Smeaton. 2018. Image Aesthetics and Content in Selecting Memorable Keyframes from Lifelogs. In MultiMedia Modeling - 24th International Conference, MMM, Bangkok, Thailand, February 5-7, 2018, Proceedings, Part I. Springer, Bangkok, Thailand, 608–619. [5] Junting Pan, Cristian Canton-Ferrer, Kevin McGuinness, Noel E. O’Connor, Jordi Torres, Elisa Sayrol, and Xavier Giró-i-Nieto. 2017. SalGAN: Visual Saliency Prediction with Generative Adversarial Net- works. CoRR abs/1701.01081 (2017), 1–9. arXiv:1701.01081 http: //arxiv.org/abs/1701.01081 [6] Dwight J. Peterson and Marian E. Berryhill. 2013. The Gestalt principle of similarity benefits visual working memory. Psychonomic Bulletin & Review 20, 6 (Dec 2013), 1282–1289. [7] Vilayanur S Ramachandran. 2012. The tell-tale brain: A neuroscientist’s quest for what makes us human. WW Norton & Company, 500 Fifth Avenue, New York, New York. [8] Vilayanur S Ramachandran and Diane Rogers-Ramachandran. 2010. Reading between the Lines. Scientific American Mind 21, 4 (2010), 18–20. [9] Karen Simonyan and Andrew Zisserman. 2014. Very Deep Con- volutional Networks for Large-Scale Image Recognition. CoRR abs/1409.1556 (2014), 1–14. arXiv:1409.1556 http://arxiv.org/abs/1409. 1556 [10] D. Todorovic. 2008. Gestalt principles. Scholarpedia 3, 12 (2008), 5345. revision #91314.