Learning Memorability Preserving Subspace for Predicting
                       Media Memorability
                                       Yang Liu1,2 , Zhonglei Gu1 , Tobey H. Ko3
           1
            Department of Computer Science, Hong Kong Baptist University, Hong Kong SAR, P.R. China
                   2
                     HKBU Institute of Research and Continuing Education, Shenzhen, P.R. China
    3
      Department of Industrial and Manufacturing Systems Engineering, The University of Hong Kong, Hong Kong
                                                 SAR, P.R. China
                       csygliu@comp.hkbu.edu.hk,cszlgu@comp.hkbu.edu.hk,tobeyko@hku.hk

ABSTRACT                                                         to a low-dimensional subspace, in which the memorability
This paper describes our approach designed for the MediaEval     information and manifold structure of the dataset are well
2018 Predicting Media Memorability Task. First, a subspace       preserved. In the test stage, we use the learned transforma-
learning method called Memorability Preserving Embedding         tion matrix to map the test data to the subspace, and apply
(MPE) is proposed to learn discriminative subspace from the      a Support Vector Regressor (SVR) [13] to the subspace for
original feature space according to the memorability scores.     final memorability prediction.
Then the Support Vector Regressor (SVR) is applied to the
learned subspace for memorability prediction. The predic-        2    MEMORABILITY PRESERVING
tion performance demonstrates that SVR can achieve good               EMBEDDING
performance even in a very low-dimensional subspace, which       Given the training set 𝒳 = {(x1 , 𝑙1 ), (x2 , 𝑙2 ), ..., (x𝑛 , 𝑙𝑛 )},
implies that the subspace learned by the MPE is capable of
                                                                 with x𝑖 ∈ R𝐷 (𝑖 = 1, · · · , 𝑛) being the visual feature vector
preserving important memorability information. Moreover,
                                                                 of the 𝑖-th video and 𝑙𝑖 ∈ [0, 1] being the corresponding
the results indicate that the short-term memorability is more
                                                                 memorability score, MPE aims to learn a 𝐷×𝑑 transformation
predictable than the long-term memorability.
                                                                 matrix W to map x𝑖 (𝑖 = 1, · · · , 𝑛) to a low-dimensional
                                                                 subspace, where the memorability information and manifold
1    INTRODUCTION                                                structure of the dataset can be well preserved. To achieve
                                                                 this goal, MPE optimizes the following objective function:
Predicting media memorability plays a key role in many real-
world applications such as media retrieval and recommenda-                         𝑛
                                                                                  ∑︁
                                                                                       ‖W𝑇(x𝑖 −x𝑗 )‖2 · 𝛼𝑆𝑖𝑗 + (1−𝛼)𝑁𝑖𝑗 , (1)
                                                                                                       (︀              )︀
tion, and has attracted much attention recently [1, 4, 6, 9–      W = arg min
12, 14]. The MediaEval 2018 Predicting Media Memorability                   W    𝑖,𝑗=1
Task aims to seek solutions to the problem of predicting
how memorable a video will be [3]. Specifically, given a set     where 𝑆𝑖𝑗 = 𝑒𝑥𝑝(−(𝑙𝑖 − 𝑙𝑗 )2 /2𝜎 2 ) measures the similarity
of training video data (each data sample is associated with      between the memorability score of x𝑖 and that of x𝑗 , 𝑁𝑖𝑗 =
its visual features and the corresponding memorability s-        𝑒𝑥𝑝(−||x𝑖 −x𝑗 ||2 /2𝜎 2 ) measures the closeness between x𝑖 and
core), the participants are asked to build a model using the     x𝑗 , and 𝛼 ∈ [0, 1] is the parameter balancing the memorability
training data and utilize the trained model to predict the       information and the manifold structure.
memorability score of test data.                                    Eq. (1) could be equivalently rewritten as follows:
   Images and videos often have very high dimensionality,
which brings computational challenges to the analysis tasks.                     W = arg min 𝑡𝑟(W𝑇 XLX𝑇 W),                       (2)
                                                                                          W
To solve the memorability prediction task in an efficient way,
in this paper, we propose a supervised subspace learning         where X = [x1 , x2 , ..., x𝑛 ] ∈ R𝐷×𝑛 is the data matrix, L =
method called Memorability Preserving Embedding (MPE).           D − A is the 𝑛 × 𝑛 Laplacian
The motivation of designing such a subspace learning method                                 ∑︀ matrix [7], and D is a diagonal
                                                                 matrix defined as 𝐷𝑖𝑖 = 𝑛      𝑗=1 𝐴𝑖𝑗 (𝑖 = 1, ..., 𝑛), where 𝐴𝑖𝑗 =
for the task rather than directly performing the prediction      𝛼𝑆𝑖𝑗 + (1 − 𝛼)𝑁𝑖𝑗 . Then the optimal W can be obtained
is that we believe most of the discriminative information of     by finding the eigenvectors corresponding to the smallest
the high-dimensional media data is actually embedded in          eigenvalues of the following eigen-decomposition problem:
a relatively low-dimensional subspace and discovering such
a subspace could enhance the performance of prediction.                                   XLX𝑇 w = 𝜆w.                            (3)
Therefore, the proposed MPE aims to learn a transforma-
tion matrix to project the high-dimensional training data           After obtaining W, for each high-dimensional data sample
                                                                 x𝑖 in the development and test sets, we can obtain its low-
Copyright held by the owner/author(s).
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France       dimensional representation by y𝑖 = W𝑇 x𝑖 . Then we apply
                                                                 SVR to y𝑖 for memorability prediction.
MediaEval’18, 29-31 October 2018, Sophia Antipolis, France                                                       Yang Liu et al.


Table 1: The performance (in terms of Spearman                     Table 2: The performance (in terms of Spearman
Correlation and MSE) of our approach on the test                   Correlation and MSE) of our approach on the de-
set of MediaEval 2018 Predicting Media Memorabil-                  velopment set of MediaEval 2018 Predicting Media
ity Task.                                                          Memorability Task.

                      Run1       Run2      Run3       Run4                               𝑑=4      𝑑=5      𝑑=9      𝑑 = 10   𝐷
                      (𝑑 = 4)    (𝑑 = 5)   (𝑑 = 9)    (𝑑 = 10)
                                                                                 Long    0.1422   0.1514   0.1654   0.1675   0.1414
                                                                    Spearman
              Long     0.0774    0.0962     0.0647    0.0634                     Short   0.3047   0.3059   0.3065   0.3070   0.2946
 Spearman
              Short    0.1332    0.1268     0.0656    0.0717
                                                                                 Long    0.0212   0.0212   0.0211   0.0210   0.0211
                                                                    MSE
              Long     0.0214    0.0214     0.0213    0.0213                     Short   0.0061   0.0061   0.0061   0.0061   0.0062
 MSE
              Short    0.0082    0.0080     0.0078    0.0079


3    RESULTS AND ANALYSIS                                          we notice that runs 1 and 2 are better than runs 3 and
                                                                   4 in terms of Spearman, and are comparable in terms of
In this section, we report our experimental results on the
                                                                   MSE. This fact may imply that most of the discriminative
MediaEval 2018 Predicting Media Memorability Task [3].
                                                                   information is embedded in a very low-dimensional subspace
Specifically, we participate in two subtasks: 1) short-term
                                                                   and increasing more dimensions may not necessarily improve
memorability subtask and 2) long-term memorability subtask.
                                                                   the performance.
   We use both video specialized features and image features,
                                                                      To further validate the effectiveness of subspace learning,
which are provided by the task, to construct the original
                                                                   we compare the performance of SVR on the learned subspace
feature space. For the video features, we use the 101-D C3D
                                                                   and that on the original 2771-D space using the development
feature vector. For the image features, we use the 122-D
                                                                   set. We use 5-fold cross validation and average the results.
local binary pattern (LBP) feature vector and the 768-D
                                                                   The Spearman coefficient and MSE in Table 2 show that the
color histogram feature vector. We select these features as
                                                                   performance on the original space is slightly worse than that
they have demonstrated good performance in visual analysis
                                                                   on learned subspaces, supporting our assumption that the
tasks [5, 8, 15]. For each video, the first, the median, and
                                                                   original high-dimensional space may contain redundant or
the last frames are selected as the representatives of the
                                                                   even noisy information, and reducing the dimensionality with
video, so the total dimension of the original feature space is
                                                                   supervised information could improve the subsequent learning
𝐷 = 101 + 3 × (122 + 768) = 2771.
                                                                   performance. However, the results in terms of Spearman
   We use all 8000 video data samples in the development
                                                                   coefficient is far from satisfactory. The reason might be that
set for training. Before subspace learning, we normalize the
                                                                   MPE is a linear mapping method, which is not sufficient
values of different features to [0, 1]. For the MPE method, we
                                                                   to capture the complex discriminant information embedded
set 𝛼 = 0.5 and 𝜎 = 1.
                                                                   in the high-dimensional feature space. This motivates us
       ∙ For Run 1, we set the reduced dimension 𝑑 = 4. Then       to consider extending our method to the nonlinear case to
         we learn the 𝐷×𝑑 (i.e., 2771×4 in this case) transfor-    improve the performance.
         mation matrix W via MPE using the development
         set, and utilize W to map both development and
                                                                   4   CONCLUSION
         test data onto the 4-D subspace. Finally, we train
         the 𝜈-SVR [13] using the development set in the 4-D       This paper describes our approach designed for memorability
         subspace and employ the trained 𝜈-SVR model to            prediction. A subspace learning method, MPE, is proposed to
         predict the memorability score of the test data in        learn the subspace that preserves the memorability informa-
         the same subspace. We use the RBF kernel and set          tion. After that, SVR is utilized for memorability prediction
         𝜈 = 0.5 and 𝛾 = 1/𝐷 [2].                                  in the learned subspace. The results on the MediaEval 2018
       ∙ For Run 2, we set the reduced dimension 𝑑 = 5.            Predicting Media Memorability Task validate the effective-
       ∙ For Run 3, we set the reduced dimension 𝑑 = 9.            ness of our approach. Our future work will focus on exploring
       ∙ For Run 4, we set the reduced dimension 𝑑 = 10.           the physical meaning of the learned subspace, as this could
         The remaining procedure and the parameter setting         improve the interpretability of our approach. Moreover, we
         in Runs 2, 3, and 4 are the same as those in Run 1.       plan to generalize our method to nonlinear scenario to en-
                                                                   hance its data representation ability.
   Table 1 shows the performance (in terms of Spearman
Correlation and MSE) of our approach. From the results, we
have several observations. First, we observe that the results      ACKNOWLEDGMENTS
(both Spearman and MSE) on the short-term subtask are              This work was supported in part by the National Natural
better than those on the long-term subtask, which indicates        Science Foundation of China (NSFC) under Grant 61503317
that the short-term memorability is more predictable than          and in part by the General Research Fund (GRF) from the
the long-term memorability. Besides, by comparing the MSE          Research Grant Council (RGC) of Hong Kong SAR under
of runs 1 and 2 (𝑑 = 4, 5) and that of runs 3 and 4 (𝑑 = 9, 10),   Project HKBU12202417.
Learning Memorability Preserving Subspace                              MediaEval’18, 29-31 October 2018, Sophia Antipolis, France


REFERENCES                                                               Conference on Computer Vision (ICCV). 4489–4497.
 [1] Y. Baveye, R. Cohendet, M. Perreira Da Silva, and P. Le Cal-
      let. 2016. Deep Learning for Image Memorability Prediction:
      The Emotional Bias. In Proceedings of the 24th ACM Inter-
      national Conference on Multimedia (MM ’16). ACM, New
      York, NY, USA, 491–495.
 [2] C.-C. Chang and C.-J. Lin. 2011. LIBSVM: A library for
      support vector machines. ACM Transactions on Intelligent
      Systems and Technology 2 (2011), 27:1–27:27. Issue 3.
 [3] R. Cohendet, C.-H. Demarty, N. Q. K. Duong, M. Sjoberg,
      B. Ionescu, and T.-T. Do. MediaEval 2018: Predicting Media
      Memorability. In Proceedings of the MediaEval 2018 Work-
      shop. CEUR-WS, Sophia Antipolis, France, 29–31 October,
      2018.
 [4] R. Cohendet, K. Yadati, N. Q. K. Duong, and C.-H. Demarty.
      2018. Annotating, Understanding, and Predicting Long-term
      Video Memorability. In Proceedings of the 2018 ACM on
      International Conference on Multimedia Retrieval (ICMR
     ’18). ACM, New York, NY, USA, 178–186.
 [5] A. M. Ferman, A. M. Tekalp, and R. Mehrotra. 2002. Robust
      color histogram descriptors for video segment retrieval and
      identification. IEEE Transactions on Image Processing 11, 5
     (2002), 497–508.
 [6] J. Han, C. Chen, L. Shao, X. Hu, J. Han, and T. Liu. 2015.
      Learning Computational Models of Video Memorability from
      fMRI Brain Imaging. IEEE Transactions on Cybernetics 45,
      8 (Aug 2015), 1692–1703.
 [7] X. He and P. Niyogi. 2003. Locality Preserving Projections.
      In Advances in Neural Information Processing Systems 16
      (NIPS). 153–160.
 [8] D. Huang, C. Shan, M. Ardabilian, Y. Wang, and L. Chen.
      2011. Local Binary Patterns and Its Application to Facial
      Image Analysis: A Survey. IEEE Transactions on Systems,
      Man, and Cybernetics, Part C (Applications and Reviews)
      41, 6 (2011), 765–781.
 [9] P. Isola, D. Parikh, A. Torralba, and A. Oliva. 2011. Under-
      standing the Intrinsic Memorability of Images. In Advances in
      Neural Information Processing Systems 24, J. Shawe-Taylor,
      R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger
     (Eds.). Curran Associates, Inc., 2429–2437.
[10] P. Isola, J. Xiao, D. Parikh, A. Torralba, and A. Oliva. 2014.
      What Makes a Photograph Memorable? IEEE Trans. Pattern
      Anal. Mach. Intell. 36, 7 (July 2014), 1469–1482.
[11] A. Khosla, A. S. Raju, A. Torralba, and A. Oliva. 2015.
      Understanding and Predicting Image Memorability at a Large
      Scale. In 2015 IEEE International Conference on Computer
      Vision (ICCV). 2390–2398.
[12] H. Peng, K. Li, B. Li, H. Ling, W. Xiong, and W. Hu. 2015.
      Predicting Image Memorability by Multi-view Adaptive Re-
      gression. In Proceedings of the 23rd ACM International Con-
      ference on Multimedia (MM ’15). ACM, New York, NY, USA,
      1147–1150.
[13] B. Scholkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlet-
      t. 2000. New Support Vector Algorithms. Neural Comput.
      12, 5 (2000), 1207–1245.
[14] S. Shekhar, D. Singal, H. Singh, M. Kedia, and A. Shetty. 2017.
      Show and Recall: Learning What Makes Videos Memorable.
      In 2017 IEEE International Conference on Computer Vision
     Workshops (ICCVW). 2730–2739.
[15] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri.
      2015. Learning Spatiotemporal Features with 3D Convolution-
      al Networks. In Proceedings of the 2015 IEEE International