Video Memorability Prediction with Recurrent Neural Networks
   and Video Titles at the 2018 MediaEval Predicting Media
                      Memorability Task
                                 Wensheng Sun                                                      Xu Zhang
        Michigan Technological University, Houghton, USA                      Saginaw Valley State University, Saginaw, USA
                       wsun3@mtu.edu                                                      xzhang21@svsu.edu

ABSTRACT                                                                  2 APPROACH
This paper describes the approach developed to predict the short-         2.1 Semantic-based Models
term and long-term video memorability at the 2018 MediaEval
Predicting Media Memorability Task [1]. This approach utilizes the         Table 1: official test results: Spearman’s rank correlation
scene semantics derived from the titles of the videos using natu-
ral language processing (NLP) techniques and a recurrent neural
                                                                                 Run          Method          Short-term     Long-term
network (RNN). Compared to using video-based features, this ap-
proach has a low computational cost for feature extraction. The                    1     SVR+AF(Median)        0.315299       0.083562
performance of the semantic-based methods are compared with                        2      SVR+AF(Mean)         0.347227       0.091239
those of the aesthetic feature-based methods using support vector                  3     ANN+AF(Mean)          0.121194       0.057660
regression (ϵ-SVR) and artificial neural network (ANN) models, and                 4     RNN+Captions          0.356349       0.213220
the possibility of predicting the highly subjective media memora-                  5      SVR+Captions         0.230784       0.111450
bility with simple features is explored.
                                                                              The main model corresponding to run 4 is a three-layer neu-
1    INTRODUCTION                                                         ral network with a recurrent layer; the structure of the model is
Knowledge of the memorability of a video has potential in advertise-      depicted in Fig. 1. After importing the titles, punctuation and white-
ment and content recommendation applications. Although highly             space is removed. The texts are then tokenized to integer sequences
subjective, it has been shown that media memorability is measur-          with length equal to 20. Longer titles are truncated and short titles
able and predictable. As with most other machine learning problems,       are padded with zeros. After the preprocessing, 80% of the training
finding the most relevant features and the right model is the key         dataset is randomly chosen to train the model, and the remaining
to the successful prediction of the media memorability. In [2], the       20% is used for model evaluation.
authors investigate possible features that are correlated with im-            The tokenized titles are fed to an embedding layer with the out-
age memorability. It is shown that simple image features such as          put dimension equal to 15. The embedding matrix is initialized
color and number of objects show negligible correlation with image        following uniform distribution. No embedding regularizer is used.
memorability, whereas semantics are significantly correlated with         The semantics are extracted by adding a fully connected recurrent
the memorability.                                                         layer with 10 units after the embedding layer. The activation func-
   Even though images are reportedly different from videos in many        tion for the recurrent layer is hyperbolic tangent. The layer uses
aspects [3], the similarity and connection between images and             a bias vector, which is initialized as zeros. Initializer for the ker-
videos motivate this work to explore the possible connection be-          nel weight matrix used for the linear transformation of the inputs
tween semantics of a video and its memorability at the 2018 Media-        is chosen as “glorot uniform”. Initializer for the recurrent kernel
Eval Predicting Media Memorability Task [1]. This hypothesis is           weight matrix used for the linear transformation of the recurrent
confirmed in [4], where the authors show that visual semantic fea-        state is set as “orthogonal”. A 10-node fully connected dense layer
tures provide best prediction among other audio and visual features.      follows using rectangular linear unit (ReLU) activation function.
Different from [4], an RNN is used to extract the semantics from          The kernel regularization function used is l 1 −l 2 regularization with
the video titles and to predict the video memorabilities in this work.    λ 1 = 0.001 and λ 2 = 0.004. The initialization scheme is the same
   Compared to video-based features, the extraction of video se-          as that of the RNN layer. The last layer is a 2-node dense layer pre-
mantics from its title requires relatively low feature extraction cost.   dicting the short-term and long-term memorability simultaneously,
Moreover, the authors in [5] demonstrate a strong connection be-          where a linear activation function is used. This model is trained
tween aesthetic features and image interestingness. Thus in this          using RMSprop optimizer against the mean absolute error (MAE).
work, models to predict video memorability using precomputed              The model is trained 10 epochs with batch size equal to 20.
aesthetic features [6] provided by the organizer are also developed           Similar to the model in [4], the semantics are combined with a
and compared with the semantic-based models in performance.               support vector regression (ϵ-SVR) model to generate run 5, whose
                                                                          structure is also shown in Fig. 1. After the preprocessing stage, the
Copyright held by the owner/author(s).
                                                                          dimensionality of the tokenized titles is reduced to explain 90%
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France                of the variance through principle component analysis (PCA). The
                                                                          output is then fed into an ϵ-SVR model. The penalty parameter C of
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France                                                                                                      W. Sun et al.


                                                       Preprocessing                                                    RNN
                                                                                                              RNN               Dense        Dense
                                           Punctuation                                    Embedding
                         Titles                                   Vectorization                              Layer              Layer        Layer      Run 4
                                             removal                                        Layer
                                                                                                            (H.T.,10)         (ReLU,10)    (Linear,2)


                                                                                                      PCA
                                                                                                                         ߳-SVR            Run 5
                                                                                                      90%


Figure 1: Semantic-based models: the recurrent neural network model and ϵ-SVR model correspond to run 4 and 5, respectively.


               Run1                                                               Run 1
       AF                                     PCA
                           STD.                                   ߳-SVR             /
      Median                                  95%
                                                                                  Run 2
               Run2

                        Dense                Dense                Dense
       AF                         DR=0.1               DR=0.5
                        Layer                Layer                Layer           Run 3
      Mean
                      (ReLU,50)            (ReLU,50)            (Linear,2)
                                             ANN


Figure 2: Aesthetic feature-based models: ϵ-SVR models
with median and mean aesthetic features correspond to run
1 and 2, respectively; ANN with mean aesthetic features gen-                                          Figure 3: Correlation between two types of memorabilities
erates run 3.
                                                                                                  3         RESULTS AND ANALYSIS
the error is set to be 0.1. The ϵ, which defines a tube within which no                           From the returned evaluation results in Table. 1, the following
penalty is associated, is equal to 0.01. Radial basis functions are used                          conclusions can be observed: 1) The model using RNN and se-
as the kernel function. The above hyper parameters are obtained                                   mantics is the best among all the five models. It confirms that the
through a grid search cross-validation using the Spearman’s rank                                  semantics of the videos are more relevant to both short and long-
correlation as the scoring matrix.                                                                term memorability than aesthetic features. Especially for long-term
                                                                                                  memorability, the semantic based models outperform the aesthetic
2.2     Aesthetic Feature-based Models                                                            feature-based models unanimously. 2) Without the recurrent layer,
Details of the models using precomputed aesthetic features [6] are                                the performance decreases. Thus it can be inferred that interac-
described in this section. As shown in Fig. 2, run 1 and run 2 are                                tion between objects in a video has more impact on the video’s
generated by ϵ-SVR models using aesthetic visual features aggre-                                  long-term and short-term memorability than knowing only the ob-
gated at video level by median and mean methods, respectively.                                    jects. 3) Even though there is certain correlation between short and
In both runs, the input features are standardized first, and a PCA                                long-term memorability as depicted in Fig. 3, results have shown
module is applied to reduce the dimensionality of the data to count                               that short-term memorability is more predictable than long-term
95% of the data variance. Radial basis function is chosen in both                                 ones since all models score higher in short-term than long-term
runs. The grid search cross-validated best parameters for the eSVR                                memorability. As illustrated in Fig. 3, long-term scores range from
model are C = 0.01 and ϵ = 0.1.                                                                   0.2 to 1 and exhibit higher variance than the short-term scores,
   The evaluation results show that the mean aesthetic features are                               which distribute from 0.4 to 1. Thus, one possible reason is that the
more relevant to the video memorability. Thus run 3 is generated                                  long-term memorability is more subjective and depends more on
using ANN and mean aesthetic features as illustrated in Fig. 2. The                               individual’s memory.
ANN model consists of three dense layers, the first two layers are                                    It is observed that the SVR models using median and mean aes-
fully connected dense layers with 50 nodes, where ReLU activation                                 thetic features have close performance as run 4 in terms of short-
function is used, and l 2 regularization is applied. The regularization                           term memorability prediction. However, the long-term performance
penalty constant is set to 0.001. Dropout rates for the first two                                 is far worse than run 4. Further investigations are needed to clar-
layers are equal to 0.1 and 0.5, respectively. The output layer has                               ify this. Performance of run 3 is worse than that of run 2, even
two nodes and uses linear activation functions. Mean square error                                 though both of them use mean aesthetic features. Possible reasons
(MSE) is used as the loss function during the training process, where                             are over-fitting and the missing standardization procedure in run 4.
the validation data is randomly chosen from the training data within                              In the future, ensemble methods are expected to further enhance
each epoch. 20 epochs are trained in total with the batch size equal                              the prediction accuracy.
to 32.
Predicting media memorability task                                           MediaEval’18, 29-31 October 2018, Sophia Antipolis, France


REFERENCES
[1] Romain Cohendet, Claire-Hélène Demarty, Ngoc Q. K. Duong, Mats
    Sjöberg, Bogdan Ionescu, and Thanh-Toan Do. MediaEval 2018: Pre-
    dicting Media Memorability Task. In Proc. of the MediaEval 2018
    Workshop, Vol. abs/1807.01052. 29-31 October, 2018, Sophia Antipolis,
    France, 2018.
[2] Phillip Isola, Jianxiong Xiao, Devi Parikh, Antonio Torralba, and Aude
    Oliva. 2014. What makes a photograph memorable? IEEE transactions
    on pattern analysis and machine intelligence 36, 7 (2014), 1469–1482.
[3] S. Shekhar, D. Singal, H. Singh, M. Kedia, and A. Shetty. 2017. Show
    and Recall: Learning What Makes Videos Memorable. In 2017 IEEE
    International Conference on Computer Vision Workshops (ICCVW). 2730–
    2739. https://doi.org/10.1109/ICCVW.2017.321
[4] Romain Cohendet, Karthik Yadati, Ngoc Q.K. Duong, and Claire-
    Hélène Demarty. 2018. Annotating, understanding, and predicting
    long-term video memorability. In Proc. of the ICMR 2018 Workshop,
    Yokohama, Japan, June 11-14.
[5] Michael Gygli, Helmut Grabner, Hayko Riemenschneider, Fabian
    Nater, and Luc Van Gool. 2013. The interestingness of images. In
    Proceedings of the IEEE International Conference on Computer Vision.
    1633–1640.
[6] Andreas F Haas, Marine Guibert, Anja Foerschner, Sandi Calhoun,
    Emma George, Mark Hatay, Elizabeth Dinsdale, Stuart A Sandin, Jen-
    nifer E Smith, Mark JA Vermeij, and others. 2015. Can we measure
    beauty? Computational evaluation of coral reef aesthetics. PeerJ 3
    (2015), pp.1390.