1. Introduction

Understanding Media Memorability From Event-Related Potential Records And Visual Semantics

Ricardo Kleinlein

Enrique R. Sebastián

Fernando Fernández-Martínez

0 0 Grupo de Tecnología del Habla y Aprendizaje Automático, Information Processing and Telecommunications Center, E.T.S.I. de Telecomunicación, Universidad Politécnica de Madrid 28040 Madrid , Spain 1 Instituto Cajal, CSIC , Madrid , Spain

2010

The memorability of a video has been de ned in the literature as an intrinsic property of its visual features, expressed as the proportion of an audience that successfully remembers having watched that video on a subsequent viewing. Hence our brains must cope not only with information about pixel statistics and scene semantics, but also to encode whether it is worth keeping information about them in memory for future retrieval. These are the hypothesis behind the 5th edition of the Predicting Media Memorability challenge, which we tackle from a two-fold perspective: rst we pursue a semantics-based approach, using both pre-trained and ne-tuned visual and textual Transformers; on the other hand, we process Event-Related Potential (ERP) data according to two feature extraction methods to obtain a representation compatible with cross-subject predictive models of media memorability, namely: (1) to extract sample-level functionals and feed them as input features to a random forest classi er, and (2) to compute coherence maps between sensor recordings at four frequency bands, training a shallow neural classi er from them. Ultimately, we seek to further comprehend why, whereas some of our visual models display performances that rival that of the current state-of-the-art predictive systems, ERP-based approaches pose a far more complex challenge.

1. Introduction

A detailed scienti c modelling of the factors by which some visual memories remain attached to us for a long time while others fade shortly after has eluded a mathematical formulation for decades. Recent studies point to the possibility that all the visual information that reaches our eyes carry along a measure that would account for its likelihood to be remembered in subsequent viewings, i.e., its intrinsic memorability [ 1, 2, 3 ]. With the rise of social media, an automatic system able to classify a video on these terms is of the utmost interest, both from a commercial and a scienti c perspective. In this paper, we report on our experience during the5th edition of the Predicting Media Memorability Challenge [ 4 ]. The availability of Electroencephalography (EEG) data enables us not only to study the link between visual features and memorability but also to explore possible mechanisms by which human brain stores that information, building predictive models of media memorability accordingly.

2. Related Work

Although studies on the issue date back to R.N. Shepard (1967) and Standing (1973) [ 5, 6 ], it has not been until the work of Isola et al.[ 3 ] that researchers began to think of media memorability as a deterministic function of fundamental visual properties (such as image colour or its brightness) and/or the high-level semantic features of a multimedia clip [ 7, 8, 9 ]. We use Transfomers, highly successful in an array of di erent tasks [ 10, 11 ], either as visual and textual feature extractors or ne-tuning them as predictive models of media memorability (Section3.1).

EEG data open the path for further understanding of the mechanisms underpinning the encoding of media memorability by the human brain. Much of the di culty lies in the entanglement between di erent brain regions operating simultaneously along the process [ 12, 13, 14 ]. However, coherence between di erent brain areas (a measure of the strength of the coupling between the signal recorded by two sensors at speci c frequency bands) has been found to relate to memory impairment in Alzheimer’s disease [ 15, 16 ] and other dementia-related health disorders [ 17 ]. Furthermore, techniques based on similar functional connectivity between EEG channels has been demonstrated to correlate with long-term semantic memory [ 18 ]. Therefore in Section 3.2 we propose two alternative preprocessing methods for ERP data, both outlined in Figure 1.

3. Experimental setup and results

A detailed description of both the requirements and the data resources available for each subtask can be consulted at [ 4 ]. During the experimental phase we placed a special emphasis not only on accurately predicting memorability but also on explaining the decisions made by our models.

System description Statistical Functionals 0.529 Delta channel only 0.490 Beta channel only 0.514 Late-fusion of all channels (Median) 0.534 Late-fusion of all channels (Max.) 0.529

Val.*

3.1. Subtask 1: Predicting memorability rates from visual features

Our fundamental hypothesis, supported by previous experiences [ 9, 19 ], is that video-level semantic features are robust indicators of video memorability, given the strong correlation found between certain topics and the average memorability rates of videos depicting them. Here we elaborate on this idea: either keeping a frame-wise (extracted at 1FPS) pre-trained CLIP Visual Transformer (ViT) as a feature extractor upon which a linear regressor is trained on the task of media memorability (run #4), or ne-tuning a ViT and its textual counterpart on Memento10K data [ 8 ] (run #1, run #2). We also investigate the degree to which both modalities can help each other in making a prediction, and hence the output of the run #3 is the average between the prediction made by run #1 and run #2, while run #5 is the analogous for run #2 and run #4. In all cases, ne-tuning is performed optimizing the mean square loss between predicted labels and the ground-truth memorability scores for 10 epochs. Prediction rates at both validation and testing are shown in Table 1.

3.2. Subtask 3: Memorability classification from ERP

We propose two di erent processing pipelines, illustrated in Figure1, aimed both at computing useful numerical representations for the nal task of predicting whether a video will be remembered, irrespective of the subject data comes from. This is an inherently complex scenario, since two subjects can respond very di erently to the same video. Validation and testing classication Area Under the Curve (AUC) rates are shown in Table2. Our rst approach consists on concatenating statistical functionals - mean value, standard deviation, median, maximum and minimum values, kurtosis index and the rst three quartiles of a sample - to describe each trial (subject-video pair). As predictive algorithm, we train a random forest model. For our second approach, for each subject and video we compute the coherency between each ERP channel pairwise. We used the function “coherencyc” from Matlab’s® third party toolbox Chronux1 to compute the mean coherency value for di erent power bands: delta (0.5-4Hz), theta (4-8Hz), alpha (8-14Hz) and beta (14-30Hz). This yields a 28x28x4 matrix of coherencies between channels in speci c spectral bands. These values, once arranged as a single vector embedding, conform to the input features for a shallow neural network whose hidden layer has 256 neurons, with a ReLU activation function [ 20 ] and Adam optimizer [ 21 ] and adaptive learning rate.

4. Discussion and outlook

Interestingly enough, a ne-tuned ViT performs worse than a simpler linear regressor trained from the features obtained by a pretrained version of the full model, even though the same does not seem to happen in the case of text. Computing explanations using a custom version of LIME [ 22 ], a popular post-hoc local surrogate method [ 23 ], we notice that while the text-based model bases its predictions on concepts that we know correlate well with memorability [ 9 ], our ne-tuned ViT (run #1) might be generalising worse due to over tting (Fig2.). As illustrated in Figure 3, it is hard to notice a clear pattern of neural activity amidst the subjects when using ERP data to predict memorability. Di erent people show high memorability rates (subjects 4 and 9), yet the rest fail about 80% of the time, hence leaving an extremely unbalanced dataset that adds up to the overall complexity of the task. As a future line of research, we believe it would be particularly interesting to explore multimodal EEG-visual-textual models, in order to further develop scienti c knowledge on what information from a video clip is actually leaving a lasting footprint on our brains.

Acknowledgements

Our work has been supported by the Spanish Ministry of Science and Innovation: projects GOMINOLA (PID2020-118112RB-C21, PID2020-118112RB-C22, funded by MCIN/AEI/10.13039/501100011033), AMIC-PoC (PDC2021-120846-C42, funded by MCIN/AEI/10.13039/501100011033 and by the European Union “NextGenerationEU/PRTR”), and the Spanish Ministry of Education (FPI grant PRE2018-083225).

[1]

Isola ,

Parikh ,

Torralba ,

Oliva , Understanding the intrinsic memorability of images , in: Proceedings of the 24th International Conference on Neural Information Processing Systems (NIPS'11) , Curran Associates Inc., Red

Hook

, NY , USA, 2011 , p. 2429 - 2437 .

[2]

Isola ,

Xiao ,

Torralba ,

Oliva , What makes an image memorable? , in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2011 , pp. 145 - 152 .

[3]

Isola ,

Xiao ,

Parikh ,

Torralba ,

Oliva , What makes a photograph memorable? , IEEE Transactions on Pattern Analysis and Machine Intelligence 36 ( 2014 ) 1469 - 1482 .

[4]

Sweeney ,

M. G.

Constantin , C.-H. Demarty , C.

Fosco , A.

García Seco de Herrera , S. Halder, G.

Healy , B.

Ionescu , A.

Matran-Fernandez , A. F.

Smeaton , M.

Sultana, Overview of the MediaEval 2022 predicting video memorability task , in: MediaEval Multimedia Benchmark Workshop Working Notes, 2023 .

[5]

R. N.

Shepard , Recognition memory for words, sentences, and pictures , Journal of Verbal Learning and Verbal Behavior 6 ( 1967 ) 156 - 163 .

[6]

Standing , Learning 10000 pictures, Quarterly Journal of Experimental Psychology 25 ( 1973 ) 207 - 222 .

[7]

Bylinskii ,

Goetschalckx ,

Newman ,

Oliva , Memorability: An image-computable measure of information utility , 2021 . arXiv: 2104 . 00805 .

[8]

Newman ,

Fosco ,

Casser ,

Lee ,

McNamara ,

Oliva , Multimodal memorability: Modeling e ects of semantics and decay on video memorability , 2020 .arXiv: 2009 .02568.

[9]

Kleinlein ,

Luna-Jiménez ,

Arias-Cuadrado ,

Ferreiros ,

Fernández-Martínez , Topicoriented text features can match visual deep models of video memorability , Applied Sciences 11 ( 2021 ).

[10]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez , L. u. Kaiser, I. Polosukhin , Attention is all you need , in: I. Guyon,

U. V.

Luxburg ,

Bengio ,

Wallach ,

Fergus ,

Vishwanathan , R. Garnett (Eds.), Advances in Neural Information Processing Systems , volume 30 , Curran

Associates

, Inc., 2017 .

[11]

Radford ,

J. W.

Kim ,

Hallacy ,

Ramesh , G. Goh,

Agarwal ,

Sastry ,

Askell ,

Mishkin ,

Clark ,

Krueger , I. Sutskever , Learning transferable visual models from natural language supervision , in: International Conference on Machine Learning (ICML) , 2021 .

[12]

Han ,

Chen ,

Shao ,

Hu , J. Han, T . Liu, Learning computational models of video memorability from fmri brain imaging , IEEE Transactions on Cybernetics 45 ( 2015 ) 1692 - 1703 .

[13]

R. F.

Thompson ,

J. J.

Kim , Memory systems in the brain and localization of a memory , Proceedings of the National Academy of Sciences 93 ( 1996 ) 13438 - 13444 .

[14]

Jaegle ,

Mehrpour ,

Mohsenzadeh , T. Meyer, A. Oliva,

Rust , Population response magnitude variation in inferotemporal cortex predicts image memorability , eLife 8 ( 2019 ).

[15]

Adler ,

Brassen ,

Jajcevic , Eeg coherence in alzheimer's dementia , Journal of Neural Transmission 110 ( 2003 ) 1051 - 1058 .

[16] M. J. Hogan , G. Swanwick, J.

Kaiser , M.

Rowan , B.

Lawlor , Memory-related eeg power and coherence reductions in mild alzheimer's disease , International Journal of Psychophysiology 49 ( 2003 ) 147 - 163 .

[17]

Laptinskaya ,

Fissler ,

O. C.

Küster ,

Wischniowski ,

Thurm ,

Elbert , C. A. F. von Arnim , I.-T. Kolassa, Global eeg coherence as a marker for cognition in older adults at risk for dementia , Psychophysiology 57 ( 2020 ).

[18]

Hanouneh ,

H. U.

Amin ,

N. M.

Saad ,

A. S.

Malik , Eeg power and functional connectivity correlates with semantic long-term memory retrieval , IEEE Access 6 ( 2018 ) 8695 - 8703 .

[19]

Kleinlein ,

Luna-Jiménez ,

Fernández-Martínez , Thau-upm at mediaeval 2021: From video semantics to memorability using pretrained transformers , in: MediaEval'21 Online , 2021 .

[20]

A. F.

Agarap , Deep learning using recti ed linear units (relu ), ArXiv abs/ 1803 .08375 ( 2018 ).

[21]

D. P.

Kingma ,

Ba , Adam: A method for stochastic optimization , CoRR abs/1412 .6980 ( 2014 ).

[22]

Kleinlein ,

Hepburn ,

Santos-Rodríguez ,

Fernández-Martínez , Sampling based on natural image statistics improve local surrogate explainers , in: The 33rd British Machine Vision Conference , 2022 .

[23]

M. T.

Ribeiro ,

Singh ,

Guestrin , "why should i trust you?": Explaining the predictions of any classi er , in: Proceedings of the 22nd ACM International Conference on Knowledge Discovery and