Learning Memorability Preserving Subspace for Predicting Media Memorability Yang Liu1,2 , Zhonglei Gu1 , Tobey H. Ko3 1 Department of Computer Science, Hong Kong Baptist University, Hong Kong SAR, P.R. China 2 HKBU Institute of Research and Continuing Education, Shenzhen, P.R. China 3 Department of Industrial and Manufacturing Systems Engineering, The University of Hong Kong, Hong Kong SAR, P.R. China csygliu@comp.hkbu.edu.hk,cszlgu@comp.hkbu.edu.hk,tobeyko@hku.hk ABSTRACT to a low-dimensional subspace, in which the memorability This paper describes our approach designed for the MediaEval information and manifold structure of the dataset are well 2018 Predicting Media Memorability Task. First, a subspace preserved. In the test stage, we use the learned transforma- learning method called Memorability Preserving Embedding tion matrix to map the test data to the subspace, and apply (MPE) is proposed to learn discriminative subspace from the a Support Vector Regressor (SVR) [13] to the subspace for original feature space according to the memorability scores. final memorability prediction. Then the Support Vector Regressor (SVR) is applied to the learned subspace for memorability prediction. The predic- 2 MEMORABILITY PRESERVING tion performance demonstrates that SVR can achieve good EMBEDDING performance even in a very low-dimensional subspace, which Given the training set 𝒳 = {(x1 , 𝑙1 ), (x2 , 𝑙2 ), ..., (x𝑛 , 𝑙𝑛 )}, implies that the subspace learned by the MPE is capable of with x𝑖 ∈ R𝐷 (𝑖 = 1, Β· Β· Β· , 𝑛) being the visual feature vector preserving important memorability information. Moreover, of the 𝑖-th video and 𝑙𝑖 ∈ [0, 1] being the corresponding the results indicate that the short-term memorability is more memorability score, MPE aims to learn a 𝐷×𝑑 transformation predictable than the long-term memorability. matrix W to map x𝑖 (𝑖 = 1, Β· Β· Β· , 𝑛) to a low-dimensional subspace, where the memorability information and manifold 1 INTRODUCTION structure of the dataset can be well preserved. To achieve this goal, MPE optimizes the following objective function: Predicting media memorability plays a key role in many real- world applications such as media retrieval and recommenda- 𝑛 βˆ‘οΈ β€–W𝑇(x𝑖 βˆ’x𝑗 )β€–2 Β· 𝛼𝑆𝑖𝑗 + (1βˆ’π›Ό)𝑁𝑖𝑗 , (1) (οΈ€ )οΈ€ tion, and has attracted much attention recently [1, 4, 6, 9– W = arg min 12, 14]. The MediaEval 2018 Predicting Media Memorability W 𝑖,𝑗=1 Task aims to seek solutions to the problem of predicting how memorable a video will be [3]. Specifically, given a set where 𝑆𝑖𝑗 = 𝑒π‘₯𝑝(βˆ’(𝑙𝑖 βˆ’ 𝑙𝑗 )2 /2𝜎 2 ) measures the similarity of training video data (each data sample is associated with between the memorability score of x𝑖 and that of x𝑗 , 𝑁𝑖𝑗 = its visual features and the corresponding memorability s- 𝑒π‘₯𝑝(βˆ’||x𝑖 βˆ’x𝑗 ||2 /2𝜎 2 ) measures the closeness between x𝑖 and core), the participants are asked to build a model using the x𝑗 , and 𝛼 ∈ [0, 1] is the parameter balancing the memorability training data and utilize the trained model to predict the information and the manifold structure. memorability score of test data. Eq. (1) could be equivalently rewritten as follows: Images and videos often have very high dimensionality, which brings computational challenges to the analysis tasks. W = arg min π‘‘π‘Ÿ(W𝑇 XLX𝑇 W), (2) W To solve the memorability prediction task in an efficient way, in this paper, we propose a supervised subspace learning where X = [x1 , x2 , ..., x𝑛 ] ∈ R𝐷×𝑛 is the data matrix, L = method called Memorability Preserving Embedding (MPE). D βˆ’ A is the 𝑛 Γ— 𝑛 Laplacian The motivation of designing such a subspace learning method βˆ‘οΈ€ matrix [7], and D is a diagonal matrix defined as 𝐷𝑖𝑖 = 𝑛 𝑗=1 𝐴𝑖𝑗 (𝑖 = 1, ..., 𝑛), where 𝐴𝑖𝑗 = for the task rather than directly performing the prediction 𝛼𝑆𝑖𝑗 + (1 βˆ’ 𝛼)𝑁𝑖𝑗 . Then the optimal W can be obtained is that we believe most of the discriminative information of by finding the eigenvectors corresponding to the smallest the high-dimensional media data is actually embedded in eigenvalues of the following eigen-decomposition problem: a relatively low-dimensional subspace and discovering such a subspace could enhance the performance of prediction. XLX𝑇 w = πœ†w. (3) Therefore, the proposed MPE aims to learn a transforma- tion matrix to project the high-dimensional training data After obtaining W, for each high-dimensional data sample x𝑖 in the development and test sets, we can obtain its low- Copyright held by the owner/author(s). MediaEval’18, 29-31 October 2018, Sophia Antipolis, France dimensional representation by y𝑖 = W𝑇 x𝑖 . Then we apply SVR to y𝑖 for memorability prediction. MediaEval’18, 29-31 October 2018, Sophia Antipolis, France Yang Liu et al. Table 1: The performance (in terms of Spearman Table 2: The performance (in terms of Spearman Correlation and MSE) of our approach on the test Correlation and MSE) of our approach on the de- set of MediaEval 2018 Predicting Media Memorabil- velopment set of MediaEval 2018 Predicting Media ity Task. Memorability Task. Run1 Run2 Run3 Run4 𝑑=4 𝑑=5 𝑑=9 𝑑 = 10 𝐷 (𝑑 = 4) (𝑑 = 5) (𝑑 = 9) (𝑑 = 10) Long 0.1422 0.1514 0.1654 0.1675 0.1414 Spearman Long 0.0774 0.0962 0.0647 0.0634 Short 0.3047 0.3059 0.3065 0.3070 0.2946 Spearman Short 0.1332 0.1268 0.0656 0.0717 Long 0.0212 0.0212 0.0211 0.0210 0.0211 MSE Long 0.0214 0.0214 0.0213 0.0213 Short 0.0061 0.0061 0.0061 0.0061 0.0062 MSE Short 0.0082 0.0080 0.0078 0.0079 3 RESULTS AND ANALYSIS we notice that runs 1 and 2 are better than runs 3 and 4 in terms of Spearman, and are comparable in terms of In this section, we report our experimental results on the MSE. This fact may imply that most of the discriminative MediaEval 2018 Predicting Media Memorability Task [3]. information is embedded in a very low-dimensional subspace Specifically, we participate in two subtasks: 1) short-term and increasing more dimensions may not necessarily improve memorability subtask and 2) long-term memorability subtask. the performance. We use both video specialized features and image features, To further validate the effectiveness of subspace learning, which are provided by the task, to construct the original we compare the performance of SVR on the learned subspace feature space. For the video features, we use the 101-D C3D and that on the original 2771-D space using the development feature vector. For the image features, we use the 122-D set. We use 5-fold cross validation and average the results. local binary pattern (LBP) feature vector and the 768-D The Spearman coefficient and MSE in Table 2 show that the color histogram feature vector. We select these features as performance on the original space is slightly worse than that they have demonstrated good performance in visual analysis on learned subspaces, supporting our assumption that the tasks [5, 8, 15]. For each video, the first, the median, and original high-dimensional space may contain redundant or the last frames are selected as the representatives of the even noisy information, and reducing the dimensionality with video, so the total dimension of the original feature space is supervised information could improve the subsequent learning 𝐷 = 101 + 3 Γ— (122 + 768) = 2771. performance. However, the results in terms of Spearman We use all 8000 video data samples in the development coefficient is far from satisfactory. The reason might be that set for training. Before subspace learning, we normalize the MPE is a linear mapping method, which is not sufficient values of different features to [0, 1]. For the MPE method, we to capture the complex discriminant information embedded set 𝛼 = 0.5 and 𝜎 = 1. in the high-dimensional feature space. This motivates us βˆ™ For Run 1, we set the reduced dimension 𝑑 = 4. Then to consider extending our method to the nonlinear case to we learn the 𝐷×𝑑 (i.e., 2771Γ—4 in this case) transfor- improve the performance. mation matrix W via MPE using the development set, and utilize W to map both development and 4 CONCLUSION test data onto the 4-D subspace. Finally, we train the 𝜈-SVR [13] using the development set in the 4-D This paper describes our approach designed for memorability subspace and employ the trained 𝜈-SVR model to prediction. A subspace learning method, MPE, is proposed to predict the memorability score of the test data in learn the subspace that preserves the memorability informa- the same subspace. We use the RBF kernel and set tion. After that, SVR is utilized for memorability prediction 𝜈 = 0.5 and 𝛾 = 1/𝐷 [2]. in the learned subspace. The results on the MediaEval 2018 βˆ™ For Run 2, we set the reduced dimension 𝑑 = 5. Predicting Media Memorability Task validate the effective- βˆ™ For Run 3, we set the reduced dimension 𝑑 = 9. ness of our approach. Our future work will focus on exploring βˆ™ For Run 4, we set the reduced dimension 𝑑 = 10. the physical meaning of the learned subspace, as this could The remaining procedure and the parameter setting improve the interpretability of our approach. Moreover, we in Runs 2, 3, and 4 are the same as those in Run 1. plan to generalize our method to nonlinear scenario to en- hance its data representation ability. Table 1 shows the performance (in terms of Spearman Correlation and MSE) of our approach. From the results, we have several observations. First, we observe that the results ACKNOWLEDGMENTS (both Spearman and MSE) on the short-term subtask are This work was supported in part by the National Natural better than those on the long-term subtask, which indicates Science Foundation of China (NSFC) under Grant 61503317 that the short-term memorability is more predictable than and in part by the General Research Fund (GRF) from the the long-term memorability. Besides, by comparing the MSE Research Grant Council (RGC) of Hong Kong SAR under of runs 1 and 2 (𝑑 = 4, 5) and that of runs 3 and 4 (𝑑 = 9, 10), Project HKBU12202417. Learning Memorability Preserving Subspace MediaEval’18, 29-31 October 2018, Sophia Antipolis, France REFERENCES Conference on Computer Vision (ICCV). 4489–4497. [1] Y. Baveye, R. Cohendet, M. Perreira Da Silva, and P. Le Cal- let. 2016. Deep Learning for Image Memorability Prediction: The Emotional Bias. In Proceedings of the 24th ACM Inter- national Conference on Multimedia (MM ’16). ACM, New York, NY, USA, 491–495. [2] C.-C. Chang and C.-J. Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2 (2011), 27:1–27:27. Issue 3. [3] R. Cohendet, C.-H. Demarty, N. Q. K. Duong, M. Sjoberg, B. Ionescu, and T.-T. Do. MediaEval 2018: Predicting Media Memorability. In Proceedings of the MediaEval 2018 Work- shop. CEUR-WS, Sophia Antipolis, France, 29–31 October, 2018. [4] R. Cohendet, K. Yadati, N. Q. K. Duong, and C.-H. Demarty. 2018. Annotating, Understanding, and Predicting Long-term Video Memorability. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval (ICMR ’18). ACM, New York, NY, USA, 178–186. [5] A. M. Ferman, A. M. Tekalp, and R. Mehrotra. 2002. Robust color histogram descriptors for video segment retrieval and identification. IEEE Transactions on Image Processing 11, 5 (2002), 497–508. [6] J. Han, C. Chen, L. Shao, X. Hu, J. Han, and T. Liu. 2015. Learning Computational Models of Video Memorability from fMRI Brain Imaging. IEEE Transactions on Cybernetics 45, 8 (Aug 2015), 1692–1703. [7] X. He and P. Niyogi. 2003. Locality Preserving Projections. In Advances in Neural Information Processing Systems 16 (NIPS). 153–160. [8] D. Huang, C. Shan, M. Ardabilian, Y. Wang, and L. Chen. 2011. Local Binary Patterns and Its Application to Facial Image Analysis: A Survey. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 41, 6 (2011), 765–781. [9] P. Isola, D. Parikh, A. Torralba, and A. Oliva. 2011. Under- standing the Intrinsic Memorability of Images. In Advances in Neural Information Processing Systems 24, J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 2429–2437. [10] P. Isola, J. Xiao, D. Parikh, A. Torralba, and A. Oliva. 2014. What Makes a Photograph Memorable? IEEE Trans. Pattern Anal. Mach. Intell. 36, 7 (July 2014), 1469–1482. [11] A. Khosla, A. S. Raju, A. Torralba, and A. Oliva. 2015. Understanding and Predicting Image Memorability at a Large Scale. In 2015 IEEE International Conference on Computer Vision (ICCV). 2390–2398. [12] H. Peng, K. Li, B. Li, H. Ling, W. Xiong, and W. Hu. 2015. Predicting Image Memorability by Multi-view Adaptive Re- gression. In Proceedings of the 23rd ACM International Con- ference on Multimedia (MM ’15). ACM, New York, NY, USA, 1147–1150. [13] B. Scholkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlet- t. 2000. New Support Vector Algorithms. Neural Comput. 12, 5 (2000), 1207–1245. [14] S. Shekhar, D. Singal, H. Singh, M. Kedia, and A. Shetty. 2017. Show and Recall: Learning What Makes Videos Memorable. In 2017 IEEE International Conference on Computer Vision Workshops (ICCVW). 2730–2739. [15] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. 2015. Learning Spatiotemporal Features with 3D Convolution- al Networks. In Proceedings of the 2015 IEEE International