Video Memorability Prediction with Recurrent Neural Networks and Video Titles at the 2018 MediaEval Predicting Media Memorability Task Wensheng Sun Xu Zhang Michigan Technological University, Houghton, USA Saginaw Valley State University, Saginaw, USA wsun3@mtu.edu xzhang21@svsu.edu ABSTRACT 2 APPROACH This paper describes the approach developed to predict the short- 2.1 Semantic-based Models term and long-term video memorability at the 2018 MediaEval Predicting Media Memorability Task [1]. This approach utilizes the Table 1: official test results: Spearman’s rank correlation scene semantics derived from the titles of the videos using natu- ral language processing (NLP) techniques and a recurrent neural Run Method Short-term Long-term network (RNN). Compared to using video-based features, this ap- proach has a low computational cost for feature extraction. The 1 SVR+AF(Median) 0.315299 0.083562 performance of the semantic-based methods are compared with 2 SVR+AF(Mean) 0.347227 0.091239 those of the aesthetic feature-based methods using support vector 3 ANN+AF(Mean) 0.121194 0.057660 regression (ϵ-SVR) and artificial neural network (ANN) models, and 4 RNN+Captions 0.356349 0.213220 the possibility of predicting the highly subjective media memora- 5 SVR+Captions 0.230784 0.111450 bility with simple features is explored. The main model corresponding to run 4 is a three-layer neu- 1 INTRODUCTION ral network with a recurrent layer; the structure of the model is Knowledge of the memorability of a video has potential in advertise- depicted in Fig. 1. After importing the titles, punctuation and white- ment and content recommendation applications. Although highly space is removed. The texts are then tokenized to integer sequences subjective, it has been shown that media memorability is measur- with length equal to 20. Longer titles are truncated and short titles able and predictable. As with most other machine learning problems, are padded with zeros. After the preprocessing, 80% of the training finding the most relevant features and the right model is the key dataset is randomly chosen to train the model, and the remaining to the successful prediction of the media memorability. In [2], the 20% is used for model evaluation. authors investigate possible features that are correlated with im- The tokenized titles are fed to an embedding layer with the out- age memorability. It is shown that simple image features such as put dimension equal to 15. The embedding matrix is initialized color and number of objects show negligible correlation with image following uniform distribution. No embedding regularizer is used. memorability, whereas semantics are significantly correlated with The semantics are extracted by adding a fully connected recurrent the memorability. layer with 10 units after the embedding layer. The activation func- Even though images are reportedly different from videos in many tion for the recurrent layer is hyperbolic tangent. The layer uses aspects [3], the similarity and connection between images and a bias vector, which is initialized as zeros. Initializer for the ker- videos motivate this work to explore the possible connection be- nel weight matrix used for the linear transformation of the inputs tween semantics of a video and its memorability at the 2018 Media- is chosen as “glorot uniform”. Initializer for the recurrent kernel Eval Predicting Media Memorability Task [1]. This hypothesis is weight matrix used for the linear transformation of the recurrent confirmed in [4], where the authors show that visual semantic fea- state is set as “orthogonal”. A 10-node fully connected dense layer tures provide best prediction among other audio and visual features. follows using rectangular linear unit (ReLU) activation function. Different from [4], an RNN is used to extract the semantics from The kernel regularization function used is l 1 −l 2 regularization with the video titles and to predict the video memorabilities in this work. λ 1 = 0.001 and λ 2 = 0.004. The initialization scheme is the same Compared to video-based features, the extraction of video se- as that of the RNN layer. The last layer is a 2-node dense layer pre- mantics from its title requires relatively low feature extraction cost. dicting the short-term and long-term memorability simultaneously, Moreover, the authors in [5] demonstrate a strong connection be- where a linear activation function is used. This model is trained tween aesthetic features and image interestingness. Thus in this using RMSprop optimizer against the mean absolute error (MAE). work, models to predict video memorability using precomputed The model is trained 10 epochs with batch size equal to 20. aesthetic features [6] provided by the organizer are also developed Similar to the model in [4], the semantics are combined with a and compared with the semantic-based models in performance. support vector regression (ϵ-SVR) model to generate run 5, whose structure is also shown in Fig. 1. After the preprocessing stage, the Copyright held by the owner/author(s). dimensionality of the tokenized titles is reduced to explain 90% MediaEval’18, 29-31 October 2018, Sophia Antipolis, France of the variance through principle component analysis (PCA). The output is then fed into an ϵ-SVR model. The penalty parameter C of MediaEval’18, 29-31 October 2018, Sophia Antipolis, France W. Sun et al. Preprocessing RNN RNN Dense Dense Punctuation Embedding Titles Vectorization Layer Layer Layer Run 4 removal Layer (H.T.,10) (ReLU,10) (Linear,2) PCA ߳-SVR Run 5 90% Figure 1: Semantic-based models: the recurrent neural network model and ϵ-SVR model correspond to run 4 and 5, respectively. Run1 Run 1 AF PCA STD. ߳-SVR / Median 95% Run 2 Run2 Dense Dense Dense AF DR=0.1 DR=0.5 Layer Layer Layer Run 3 Mean (ReLU,50) (ReLU,50) (Linear,2) ANN Figure 2: Aesthetic feature-based models: ϵ-SVR models with median and mean aesthetic features correspond to run 1 and 2, respectively; ANN with mean aesthetic features gen- Figure 3: Correlation between two types of memorabilities erates run 3. 3 RESULTS AND ANALYSIS the error is set to be 0.1. The ϵ, which defines a tube within which no From the returned evaluation results in Table. 1, the following penalty is associated, is equal to 0.01. Radial basis functions are used conclusions can be observed: 1) The model using RNN and se- as the kernel function. The above hyper parameters are obtained mantics is the best among all the five models. It confirms that the through a grid search cross-validation using the Spearman’s rank semantics of the videos are more relevant to both short and long- correlation as the scoring matrix. term memorability than aesthetic features. Especially for long-term memorability, the semantic based models outperform the aesthetic 2.2 Aesthetic Feature-based Models feature-based models unanimously. 2) Without the recurrent layer, Details of the models using precomputed aesthetic features [6] are the performance decreases. Thus it can be inferred that interac- described in this section. As shown in Fig. 2, run 1 and run 2 are tion between objects in a video has more impact on the video’s generated by ϵ-SVR models using aesthetic visual features aggre- long-term and short-term memorability than knowing only the ob- gated at video level by median and mean methods, respectively. jects. 3) Even though there is certain correlation between short and In both runs, the input features are standardized first, and a PCA long-term memorability as depicted in Fig. 3, results have shown module is applied to reduce the dimensionality of the data to count that short-term memorability is more predictable than long-term 95% of the data variance. Radial basis function is chosen in both ones since all models score higher in short-term than long-term runs. The grid search cross-validated best parameters for the eSVR memorability. As illustrated in Fig. 3, long-term scores range from model are C = 0.01 and ϵ = 0.1. 0.2 to 1 and exhibit higher variance than the short-term scores, The evaluation results show that the mean aesthetic features are which distribute from 0.4 to 1. Thus, one possible reason is that the more relevant to the video memorability. Thus run 3 is generated long-term memorability is more subjective and depends more on using ANN and mean aesthetic features as illustrated in Fig. 2. The individual’s memory. ANN model consists of three dense layers, the first two layers are It is observed that the SVR models using median and mean aes- fully connected dense layers with 50 nodes, where ReLU activation thetic features have close performance as run 4 in terms of short- function is used, and l 2 regularization is applied. The regularization term memorability prediction. However, the long-term performance penalty constant is set to 0.001. Dropout rates for the first two is far worse than run 4. Further investigations are needed to clar- layers are equal to 0.1 and 0.5, respectively. The output layer has ify this. Performance of run 3 is worse than that of run 2, even two nodes and uses linear activation functions. Mean square error though both of them use mean aesthetic features. Possible reasons (MSE) is used as the loss function during the training process, where are over-fitting and the missing standardization procedure in run 4. the validation data is randomly chosen from the training data within In the future, ensemble methods are expected to further enhance each epoch. 20 epochs are trained in total with the batch size equal the prediction accuracy. to 32. Predicting media memorability task MediaEval’18, 29-31 October 2018, Sophia Antipolis, France REFERENCES [1] Romain Cohendet, Claire-Hélène Demarty, Ngoc Q. K. Duong, Mats Sjöberg, Bogdan Ionescu, and Thanh-Toan Do. MediaEval 2018: Pre- dicting Media Memorability Task. In Proc. of the MediaEval 2018 Workshop, Vol. abs/1807.01052. 29-31 October, 2018, Sophia Antipolis, France, 2018. [2] Phillip Isola, Jianxiong Xiao, Devi Parikh, Antonio Torralba, and Aude Oliva. 2014. What makes a photograph memorable? IEEE transactions on pattern analysis and machine intelligence 36, 7 (2014), 1469–1482. [3] S. Shekhar, D. Singal, H. Singh, M. Kedia, and A. Shetty. 2017. Show and Recall: Learning What Makes Videos Memorable. In 2017 IEEE International Conference on Computer Vision Workshops (ICCVW). 2730– 2739. https://doi.org/10.1109/ICCVW.2017.321 [4] Romain Cohendet, Karthik Yadati, Ngoc Q.K. Duong, and Claire- Hélène Demarty. 2018. Annotating, understanding, and predicting long-term video memorability. In Proc. of the ICMR 2018 Workshop, Yokohama, Japan, June 11-14. [5] Michael Gygli, Helmut Grabner, Hayko Riemenschneider, Fabian Nater, and Luc Van Gool. 2013. The interestingness of images. In Proceedings of the IEEE International Conference on Computer Vision. 1633–1640. [6] Andreas F Haas, Marine Guibert, Anja Foerschner, Sandi Calhoun, Emma George, Mark Hatay, Elizabeth Dinsdale, Stuart A Sandin, Jen- nifer E Smith, Mark JA Vermeij, and others. 2015. Can we measure beauty? Computational evaluation of coral reef aesthetics. PeerJ 3 (2015), pp.1390.