INTRODUCTION

Learning Memorability Preserving Subspace for Predicting Media Memorability

Yang Liu

csygliu@comp.hkbu.edu.hk 0 2

Zhonglei Gu

Tobey H. Ko

tobeyko@hku.hk 1 0 Department of Computer Science, Hong Kong Baptist University , Hong Kong SAR , P.R. China 1 Department of Industrial and Manufacturing Systems Engineering, The University of Hong Kong , Hong Kong SAR , P.R. China 2 HKBU Institute of Research and Continuing Education , Shenzhen , P.R. China

2018

29 31

This paper describes our approach designed for the MediaEval 2018 Predicting Media Memorability Task. First, a subspace learning method called Memorability Preserving Embedding (MPE) is proposed to learn discriminative subspace from the original feature space according to the memorability scores. Then the Support Vector Regressor (SVR) is applied to the learned subspace for memorability prediction. The prediction performance demonstrates that SVR can achieve good performance even in a very low-dimensional subspace, which implies that the subspace learned by the MPE is capable of preserving important memorability information. Moreover, the results indicate that the short-term memorability is more predictable than the long-term memorability.

INTRODUCTION

Predicting media memorability plays a key role in many realworld applications such as media retrieval and recommendation, and has attracted much attention recently [ 1, 4, 6, 9– 12, 14 ]. The MediaEval 2018 Predicting Media Memorability Task aims to seek solutions to the problem of predicting how memorable a video will be [ 3 ]. Specifically, given a set of training video data (each data sample is associated with its visual features and the corresponding memorability score), the participants are asked to build a model using the training data and utilize the trained model to predict the memorability score of test data.

Images and videos often have very high dimensionality, which brings computational challenges to the analysis tasks. To solve the memorability prediction task in an eficient way, in this paper, we propose a supervised subspace learning method called Memorability Preserving Embedding (MPE). The motivation of designing such a subspace learning method for the task rather than directly performing the prediction is that we believe most of the discriminative information of the high-dimensional media data is actually embedded in a relatively low-dimensional subspace and discovering such a subspace could enhance the performance of prediction. Therefore, the proposed MPE aims to learn a transformation matrix to project the high-dimensional training data to a low-dimensional subspace, in which the memorability information and manifold structure of the dataset are well preserved. In the test stage, we use the learned transformation matrix to map the test data to the subspace, and apply a Support Vector Regressor (SVR) [ 13 ] to the subspace for ifnal memorability prediction. 2

MEMORABILITY PRESERVING EMBEDDING

Given the training set = {(x1, 1), (x2, 2), ..., (x, )}, with x ∈ R ( = 1, · · · , ) being the visual feature vector of the -th video and ∈ [ 0, 1 ] being the corresponding memorability score, MPE aims to learn a × transformation matrix W to map x ( = 1, · · · , ) to a low-dimensional subspace, where the memorability information and manifold structure of the dataset can be well preserved. To achieve this goal, MPE optimizes the following objective function: W = arg min ∑︁ ‖W(x − x)‖2 · (︀ + (1− ) ︀) , (1)

W ,=1 where = (− ( − )2/2 2) measures the similarity between the memorability score of x and that of x, = (−|| x − x||2/2 2) measures the closeness between x and x, and ∈ [ 0, 1 ] is the parameter balancing the memorability information and the manifold structure.

Eq. (1) could be equivalently rewritten as follows: W = arg min (W XLX W),

W where X = [x1, x2, ..., x] ∈ R× is the data matrix, L = D − A is the × Laplacian matrix [ 7 ], and D is a diagonal matrix defined as = ∑︀=1 ( = 1, ..., ), where = + (1 − ) . Then the optimal W can be obtained by finding the eigenvectors corresponding to the smallest eigenvalues of the following eigen-decomposition problem:

XLX w = w.

After obtaining W, for each high-dimensional data sample x in the development and test sets, we can obtain its lowdimensional representation by y = W x. Then we apply SVR to y for memorability prediction. (2) (3)

RESULTS AND ANALYSIS

In this section, we report our experimental results on the MediaEval 2018 Predicting Media Memorability Task [ 3 ]. Specifically, we participate in two subtasks: 1) short-term memorability subtask and 2) long-term memorability subtask.

We use both video specialized features and image features, which are provided by the task, to construct the original feature space. For the video features, we use the 101-D C3D feature vector. For the image features, we use the 122-D local binary pattern (LBP) feature vector and the 768-D color histogram feature vector. We select these features as they have demonstrated good performance in visual analysis tasks [ 5, 8, 15 ]. For each video, the first, the median, and the last frames are selected as the representatives of the video, so the total dimension of the original feature space is = 101 + 3 × (122 + 768) = 2771.

We use all 8000 video data samples in the development set for training. Before subspace learning, we normalize the values of diferent features to [ 0 , 1 ]. For the MPE method, we set = 0.5 and = 1.

∙ For Run 1, we set the reduced dimension = 4. Then we learn the × (i.e., 2771× 4 in this case) transformation matrix W via MPE using the development set, and utilize W to map both development and test data onto the 4-D subspace. Finally, we train the -SVR [ 13 ] using the development set in the 4-D subspace and employ the trained -SVR model to predict the memorability score of the test data in the same subspace. We use the RBF kernel and set = 0.5 and = 1/ [ 2 ]. ∙ For Run 2, we set the reduced dimension = 5. ∙ For Run 3, we set the reduced dimension = 9. ∙ For Run 4, we set the reduced dimension = 10.

The remaining procedure and the parameter setting in Runs 2, 3, and 4 are the same as those in Run 1.

Table 1 shows the performance (in terms of Spearman Correlation and MSE) of our approach. From the results, we have several observations. First, we observe that the results (both Spearman and MSE) on the short-term subtask are better than those on the long-term subtask, which indicates that the short-term memorability is more predictable than the long-term memorability. Besides, by comparing the MSE of runs 1 and 2 ( = 4, 5) and that of runs 3 and 4 ( = 9, 10),

Spearman MSE Long

Short

Long

Short = 4 we notice that runs 1 and 2 are better than runs 3 and 4 in terms of Spearman, and are comparable in terms of MSE. This fact may imply that most of the discriminative information is embedded in a very low-dimensional subspace and increasing more dimensions may not necessarily improve the performance.

To further validate the efectiveness of subspace learning, we compare the performance of SVR on the learned subspace and that on the original 2771-D space using the development set. We use 5-fold cross validation and average the results. The Spearman coeficient and MSE in Table 2 show that the performance on the original space is slightly worse than that on learned subspaces, supporting our assumption that the original high-dimensional space may contain redundant or even noisy information, and reducing the dimensionality with supervised information could improve the subsequent learning performance. However, the results in terms of Spearman coeficient is far from satisfactory. The reason might be that MPE is a linear mapping method, which is not suficient to capture the complex discriminant information embedded in the high-dimensional feature space. This motivates us to consider extending our method to the nonlinear case to improve the performance. 4

CONCLUSION

This paper describes our approach designed for memorability prediction. A subspace learning method, MPE, is proposed to learn the subspace that preserves the memorability information. After that, SVR is utilized for memorability prediction in the learned subspace. The results on the MediaEval 2018 Predicting Media Memorability Task validate the efectiveness of our approach. Our future work will focus on exploring the physical meaning of the learned subspace, as this could improve the interpretability of our approach. Moreover, we plan to generalize our method to nonlinear scenario to enhance its data representation ability.

ACKNOWLEDGMENTS

This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grant 61503317 and in part by the General Research Fund (GRF) from the Research Grant Council (RGC) of Hong Kong SAR under Project HKBU12202417.

[1]

Baveye ,

Cohendet , M.

Perreira Da Silva, and

P. Le

Callet . 2016 . Deep Learning for Image Memorability Prediction: The Emotional Bias . In Proceedings of the 24th ACM International Conference on Multimedia (MM '16) . ACM, New York, NY, USA, 491 - 495 .

[2]

C.-C.

Chang and

C.-J.

Lin . 2011 . LIBSVM: A library for support vector machines . ACM Transactions on Intelligent Systems and Technology 2 ( 2011 ), 27 : 1 - 27 : 27 . Issue 3.

[3]

Cohendet , C.-H. Demarty , N. Q. K.

Duong , M.

Sjoberg , B.

Ionescu , and T.-T. Do. MediaEval 2018 : Predicting Media Memorability . In Proceedings of the MediaEval 2018 Workshop . CEUR-WS, Sophia Antipolis, France, 29 - 31 October, 2018 .

[4]

Cohendet ,

Yadati ,

N. Q. K.

Duong , and

C.-H.

Demarty . 2018 . Annotating, Understanding, and Predicting Long-term Video Memorability . In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval (ICMR '18) . ACM, New York, NY, USA, 178 - 186 .

[5]

A. M.

Ferman ,

A. M.

Tekalp , and

Mehrotra . 2002 . Robust color histogram descriptors for video segment retrieval and identification . IEEE Transactions on Image Processing 11 , 5 ( 2002 ), 497 - 508 .

[6]

Han ,

Chen ,

Shao ,

Hu , J. Han, and

Liu . 2015 . Learning Computational Models of Video Memorability from fMRI Brain Imaging . IEEE Transactions on Cybernetics 45 , 8 (Aug 2015 ), 1692 - 1703 .

[7]

He and

Niyogi . 2003 . Locality Preserving Projections . In Advances in Neural Information Processing Systems 16 (NIPS) . 153 - 160 .

[8]

Huang ,

Shan ,

Ardabilian ,

Wang , and

Chen . 2011 . Local Binary Patterns and Its Application to Facial Image Analysis: A Survey . IEEE Transactions on Systems, Man, and Cybernetics , Part C ( Applications and Reviews) 41 , 6 ( 2011 ), 765 - 781 .

[9]

Isola ,

Parikh ,

Torralba , and

Oliva . 2011 . Understanding the Intrinsic Memorability of Images . In Advances in Neural Information Processing Systems 24, J. Shawe-Taylor, R. S. Zemel, P. L.

Bartlett , F.

Pereira , and K. Q.

Weinberger (Eds.). Curran Associates, Inc., 2429 - 2437 .

[10]

Isola ,

Xiao ,

Parikh ,

Torralba , and

Oliva . 2014 . What Makes a Photograph Memorable? IEEE Trans . Pattern Anal. Mach. Intell . 36 , 7 ( July 2014 ), 1469 - 1482 .

[11]

Khosla ,

A. S.

Raju ,

Torralba , and

Oliva . 2015 . Understanding and Predicting Image Memorability at a Large Scale . In 2015 IEEE International Conference on Computer Vision (ICCV). 2390 - 2398 .

[12]

Peng ,

Li ,

Ling ,

Xiong , and

Hu . 2015 . Predicting Image Memorability by Multi-view Adaptive Regression . In Proceedings of the 23rd ACM International Conference on Multimedia (MM '15) . ACM, New York, NY, USA, 1147 - 1150 .

[13]

Scholkopf ,

A. J.

Smola ,

R. C.

Williamson , and

P. L.

Bartlett . 2000 . New Support Vector Algorithms . Neural Comput . 12 , 5 ( 2000 ), 1207 - 1245 .

[14]

Shekhar ,

Singal ,

Singh ,

Kedia , and

Shetty . 2017 . Show and Recall: Learning What Makes Videos Memorable . In 2017 IEEE International Conference on Computer Vision Workshops (ICCVW) . 2730 - 2739 .

[15]

Tran ,

Bourdev ,

Fergus ,

Torresani , and

Paluri . 2015 . Learning Spatiotemporal Features with 3D Convolutional Networks . In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV). 4489 - 4497 .