Predicting Media Interestingness via Biased Discriminant Embedding and Supervised Manifold Regression Yang Liu1,2 , Zhonglei Gu1 , Tobey H. Ko3 1 Department of Computer Science, Hong Kong Baptist University, HKSAR, China 2 Institute of Research and Continuing Education, Hong Kong Baptist University, Shenzhen, China 3 Department of Industrial and Manufacturing Systems Engineering, University of Hong Kong, HKSAR, China csygliu@comp.hkbu.edu.hk,cszlgu@comp.hkbu.edu.hk,tobeyko@hku.hk ABSTRACT one. The objective function of BDE is given as follows: (οΈƒ )οΈƒ In this paper, we describe our model designed for automatic W𝑇 S𝑏 W prediction of media interestingness. Specifically, a two-stage W = arg max π‘‘π‘Ÿ , (1) W W𝑇 S𝑀 W learning framework is proposed. In the first stage, supervised βˆ‘οΈ€π‘› dimensionality reduction is employed to discover the key where S𝑀 = 𝑖,𝑗=1 (𝑁𝑖𝑗 Γ— 𝑙𝑖 Γ— 𝑙𝑗 )(x𝑖 βˆ’ x𝑗 )(x𝑖 βˆ’ x𝑗 ) 𝑇 de- βˆ‘οΈ€π‘› discriminant information embedded in the original feature 𝑏 notes the biased within-class scatter, S = 𝑖,𝑗=1 (𝑁𝑖𝑗 Γ— space. We present a new algorithm dubbed biased discrimi- |𝑙𝑖 βˆ’ 𝑙𝑗 |)(x𝑖 βˆ’ x𝑗 )(x𝑖 βˆ’ x𝑗 )𝑇 denotes the biased between-class nant embedding (BDE) to extract discriminant features with scatter, and 𝑁𝑖𝑗 = 𝑒π‘₯𝑝(βˆ’||x𝑖 βˆ’ x𝑗 ||2 /2𝜎) measures the close- discrete labels and use supervised manifold regression (SMR) ness between two data samples x𝑖 and x𝑗 . The optimization to extract discriminant features with continuous labels. In the problem could be solved by generalized eigen-decomposition. second stage, SVM is utilized for prediction. Experimental results validate the effectiveness of our approaches. 2.2 Supervised Manifold Regression Supervised manifold regression (SMR) [4] aims to find the 1 INTRODUCTION latent subspace, where two data points should be close to each other if they possess similar interestingness levels. The Predicting the interestingness of multimedia content has long objective function of SMR is given as follows: been studied in the psychology community [1, 6, 7]. More 𝑛 recently, we witness an explosion of multimedia content due βˆ‘οΈ β€–W𝑇 (x𝑖 βˆ’ x𝑗 )β€–2 Β· 𝛼𝑆𝑖𝑗 (οΈ€ 𝑙 )οΈ€ W = arg min + (1 βˆ’ 𝛼)𝑁𝑖𝑗 , to the accessibility of low cost multimedia creation tools, the W 𝑖,𝑗=1 automatic prediction of media interestingness thus started (2) to attract attention in the computer science community be- 𝑙 cause of its many useful applications to content providers, where 𝑆𝑖𝑗 = |𝑙𝑖 βˆ’ 𝑙𝑗 | measures the similarity between the marketing, and managerial decision-makers. interestingness level of x𝑖 and that of x𝑗 . In this paper, we propose to use dimensionality reduction For each high-dimensional data point x𝑖 , we can obtain to extract low-dimensional features for MediaEval 2017 Pre- its low-dimensional representation by y𝑖 = W𝑇 x𝑖 . Then we dicting Media Interestingness Task. Specifically, we propose a apply SVM to y𝑖 for interestingness prediction. new algorithm called biased discriminant embedding (BDE) for discrete labels and utilize supervised manifold regression 3 EXPERIMENTS (SMR) [4] for continuous labels. For each image data sample, we construct a 1299-D feature vector by selecting features from the feature set provided 2 DIMENSIONALITY REDUCTION by the task organizers, including 128-D color histogram fea- 2.1 Biased Discriminant Embedding tures, 300-D denseSIFT features, 512-D gist features, 300-D hog2Γ—2, and 59-D LBP features. For the video data, we treat Given the data matrix X = [x1 , x2 , ..., x𝑛 ], where x𝑖 ∈ R𝐷 each frame as a separate image, and calculate the average denotes the feature vector of the 𝑖-th image or video, and and standard deviation over all frames in this shot, and thus label vector l = [𝑙1 , 𝑙2 , ..., 𝑙𝑛 ], where 𝑙𝑖 ∈ {0, 1} denotes the we have a 2598-D feature set for each video. We normal- corresponding label of x𝑖 , with 1 for interesting and 0 for ize each dimension of the training data to the range [0, 1] non-interesting, biased discriminant embedding (BDE) aims π‘₯βˆ’π‘₯π‘šπ‘–π‘› by π‘₯Λ† = π‘₯π‘šπ‘Žπ‘₯ βˆ’π‘₯π‘šπ‘–π‘› before dimensionality reduction, where to learn a 𝐷 Γ— 𝑑 transformation matrix W, which maximizes π‘₯π‘šπ‘–π‘› and π‘₯π‘šπ‘Žπ‘₯ denote the minimum and maximum values the biased discriminant information in the reduced subspace. in the corresponding dimension, respectively. Details about The motivation for proposing the biased discrimination is that the dataset description can be found in [3]. in media interestingness prediction, we are probably more For Run 1 of image data, we use the normalized 1299-D interested in the interesting class than the non-interesting feature vector as the input of SVM. For Runs 2-5 of image Copyright held by the owner/author(s). data, we reduce the original data to the 23-D, 25-D, 26-D, MediaEval’17, 13-15 September 2017, Dublin, Ireland 27-D subspaces via BDE (for discrete labels) and SMR (for continuous labels), respectively. For Run 1 of video data, we MediaEval’17, 13-15 September 2017, Dublin, Ireland Y. Liu, Z. Gu, T. H. Ko (a) BDE on image data (b) SMR on image data (c) BDE on video data (d) SMR on video data Figure 1: Contribution of each individual feature in image/video discrete/continuous prediction tasks. Table 1: MAP@10 and MAP of the proposed model. We further analyze the contribution of each dimension Images Videos in the original feature space. The contribution βˆ‘οΈ€ of the 𝑖-th MAP@10 MAP MAP@10 MAP dimension is defined as πΆπ‘œπ‘›π‘‘π‘Ÿπ‘–π‘π‘’π‘‘π‘–π‘œπ‘›π‘– = 𝑗 πœ†π‘— |𝑀𝑖𝑗 |, where Run 1 0.1184 0.2812 0.0556 0.1813 πœ†π‘— denotes the 𝑗-th eigenvalue, 𝑀𝑖𝑗 denotes the (𝑖, 𝑗)-th el- Run 2 0.132 0.2916 0.0468 0.1761 ement of W, and | Β· | denotes the absolute value operator. Run 3 0.1332 0.2898 0.0468 0.1761 From Figures 1(a) and 1(c), we can observe that color his- Run 4 0.1315 0.2884 0.0463 0.1742 togram and LBP features contribute more than the others Run 5 0.1369 0.291 0.0445 0.1746 while the GIST features contribute the least in the discrete prediction task. In continuous prediction (Figures 1(b) and 1(d)), the color histogram and GIST features contribute the most among the five feature sets. use the normalized 2598-D feature vector as the input of SVM. For Runs 2-5 of video data, we reduce the original data to the 23-D, 25-D, 26-D, 27-D subspaces via BDE (for 4 DISCUSSION AND OUTLOOK discrete labels) and SMR (for continuous labels), respectively. This paper introduces our model designed for media interest- To predict the binary interestingness labels, we use 𝜈-SVC [5] ingness prediction. For the future work, we aim to improve with an RBF kernel. We set 𝜈 = 0.1 and π‘”π‘Žπ‘šπ‘šπ‘Ž = 100 (for the performance of video interestingness prediction by in- image data)/64 (for video data). To predict the continuous corporating the video temporal information. Moreover, as interestingness level, we use πœ–-SVR [2] with an RBF kernel. the ground truth (labels) of interestingness are provided by We set π‘π‘œπ‘ π‘‘ = 1, πœ– = 0.01, and 𝛾 = 1/𝐷. Table 1 reports human beings, they generally vary with each individual and the evaluation results of the proposed model provided by are somewhat subjective. We are therefore particularly inter- the task organizers. For image data, the reduced features ested in refining the human labeled ground truth (especially perform better than the original ones, which indicates that for continuous case) via machine learning technologies. the subspaces learned by BDE and SMR capture important information in terms of media interestingness. For video data, the performance of reduced features is slightly worse ACKNOWLEDGMENTS than that of the original ones. The reason might be that This work was supported in part by the National Natural video data are more complex than image data so that such a Science Foundation of China under Grant 61503317, and in low-dimensional representation cannot fully capture the key part by the Faculty Research Grant of Hong Kong Baptist discriminant information embedded in the original space. University (HKBU) under Project FRG2/16-17/032. Predicting Media Interestingness Task MediaEval’17, 13-15 September 2017, Dublin, Ireland REFERENCES [1] Daniel E. Berlyne. 1960. Conflict, arousal and curiosity. McGraw-Hill. [2] Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2 (2011), 27:1–27:27. Issue 3. [3] C.-H. Demarty, M. Sjoberg, B. Ionescu, T.-T. Do, M. Gygli, and N. Q. K Duong. MediaEval 2017 Predicting Media Inter- estingness Task. In Proc. of the MediaEval 2017 Workshop. Dublin, Ireland, Sept. 13–15, 2017. [4] Y. Liu, Z. Gu, and Y.-M. Cheung. Supervised Manifold Learning for Media Interestingness Prediction. In Proc. of the MediaEval 2016 Workshop. Hilversum, Netherlands, Oct. 20–21, 2016. [5] Bernhard Schölkopf, Alex J. Smola, Robert C. Williamson, and Peter L. Bartlett. 2000. New Support Vector Algorithms. Neural Comput. 12, 5 (2000), 1207–1245. [6] Paul J. Silvia. 2006. Exploring the psychology of interest. Oxford University Press. [7] Craig Smith and Phoebe Ellsworth. 1985. Patterns of cogni- tive appraisal in emotion. Journal of Personality and Social Psychology 48, 4 (1985), 813–838.