Supervised Manifold Learning for Media Interestingness Prediction Yang Liu1,2 , Zhonglei Gu3 , Yiu-ming Cheung1,2,4 1 Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, Hong Kong SAR, China 2 Institute of Research and Continuing Education, Hong Kong Baptist University, Shenzhen, China 3 AAOO Tech Limited, Shatin, Hong Kong SAR, China 4 United International College, Beijing Normal University-Hong Kong Baptist University, Zhuhai, China {csygliu,ymc}@comp.hkbu.edu.hk, allen.koo@aaoo-tech.com ABSTRACT of the original data. Finally, we use nearest neighbor classi- In this paper, we describe the models designed for automat- fier and support vector regressor to predict the discrete and ically selecting multimedia data, e.g., image and video seg- continuous labels of the given images/videos, respectively. ments, which are considered to be interesting for a common viewer. Specifically, we utilize an existing dimensionality re- 2. METHOD duction method called Neighborhood MinMax Projections (NMMP) to extract the low-dimensional features for pre- 2.1 Feature Extraction via NMMP and SMR dicting the discrete interestingness labels. Meanwhile, we introduce a new dimensionality reduction method dubbed 2.1.1 Neighborhood MinMax Projections Supervised Manifold Regression (SMR) to learn the compact Given the data matrix X = [x1 , x2 , ..., xn ], where xi ∈ RD representations for predicting the continuous interestingness denotes the feature vector of the i-th image or video, and levels. Finally, we use the nearest neighbor classifier and the label vector l = [l1 , l2 , ..., ln ], where li ∈ {0, 1} denotes support vector regressor for classification and regression, re- the corresponding label of xi , 1 for interesting and 0 for spectively. Experimental results demonstrate the effective- non-interesting, Neighborhood MinMax Projections (NMM- ness of the low-dimensional features learned by NMMP and P) aims to find a linear transformation, after which the near- SMR. by points within the same class are as close as possible, while those between different classes are as far as possible [11]. 1. INTRODUCTION The objective function of NMMP is given as follows: Effective prediction of media interestingness plays an im- tr(WT S̃b W) portant role in many real-world applications such as im- W = arg max , (1) T age/video search, retrieval, and recommendation [5–9, 12]. WT W=I tr(W S̃w W) The MediaEval 2016 Predicting Media Interestingness Task where tr(·) denotes the matrix trace operator, W denotes requires participants to automatically select images and/or the transformation matrix to be learned, S̃b denotes the video segments which are considered to be the most inter- between-class scatter matrix defined on nearby data points, esting for a common viewer. The data used in this task and S̃w denotes the within-class scatter matrix defined on n- are extracted from ca 75 movie trailers of Hollywood-like earby data points. The optimization problem in Eq. (1) can movies. More details about the task requirements as well as be effectively solved by eigen-decomposition. More details the dataset description can be found in [3]. of NMMP can be found in [11]. Supervised manifold learning, which aims to discover the data-label mapping relation while capturing the manifold 2.1.2 Supervised Manifold Regression structure of the dataset, plays an important role in many Different from the binary form in discrete case, the con- multimedia content analysis tasks such as face recognition [4] tinuous interestingness label is a real number, i.e., li ∈ [0, 1]. and video classification [10]. In this paper, we aim to solve The idea behind Supervised Manifold Regression (SMR) is both image and video interestingness prediction via super- simple: the more similar the interestingness levels of two me- vised manifold learning. There are two kinds of interest- dia data, the closer the two feature vectors should be in the ingness labels in the given task, i.e., discrete and continu- learned subspace. Meanwhile, we aim to preserve the man- ous. For the case of discrete labels, we utilize an existing ifold structure of the dataset in the original feature space. competitive dimensionality reduction method called Neigh- The objective function of SMR is formulated as follows: borhood MinMax Projections (NMMP) to extract the low- n dimensional features from the original high-dimensional s- X pace. For the case of continuous labels, we propose a new W = arg min kWT xi − WT xj k2 · αSij l m + (1 − α)Sij , W i,j=1 dimensionality reduction method dubbed Supervised Mani- fold Regression (SMR) to learn the compact representations (2) l where Sij = |li − lj | measures the similarity between the interestingness level of xi and that of xj (i, j = 1, ..., n), Copyright is held by the author/owner(s). m ||xi −xj ||2 2 MediaEval 2016 Workshop, October 20-21, 2016, Hilversum, Netherlands Sij = exp(− 2σ ) denotes the similarity between xi Table 1: Performance of proposed system (provided by the organizers) MAP Accuracy Precision Recall F-score Run 1: Original image features (D = 1299) 0.1835 0.838 [0.900, 0.139] [0.922, 0.110] [0.911, 0.123] Run 2: Reduced image features (d = 100) 0.1806 0.802 [0.902, 0.134] [0.874, 0.169] [0.888, 0.149] Run 3: Original video features (D = 2598) 0.1552 0.828 [0.901, 0.084] [0.910, 0.076] [0.905, 0.080] Run 4: Reduced video features (d = 100) 0.1733 0.834 [0.902, 0.098] [0.916, 0.084] [0.909, 0.091] and xj in the original space, and α ∈ [0, 1] denotes the • For Run 2, we first learn the 100-D subspaces of the balancing parameter, which is empirically set to be 0.5 in our original feature vector via NMMP (for discrete labels) experiments. Following some standard operations in linear and SMR (for continuous labels), respectively. After algebra, the above optimization problem could be reduced we obtain the transformation matrix W ∈ R1299×100 , to the following one: we define the contribution of the i-th dimension (i = 1, ..., 1299) of the original feature vector: W = arg min tr(WT XLXT W), (3) X W Contributioni = |wij |, (7) where X = [x1 , x2 , ..., xn ] ∈ RD×n is the data matrix, L = j D − S is the n × n Laplacian P matrix [1], and D is a diagonal where wij is the element in row i and column j of W, matrix defined as Dii = n j=1 Sij (i = 1, ..., n), where Sij = and | · | denotes the absolute value operator. Then we l m αSij + (1 − α)Sij . By transforming (2) to (3), the optimal select the features with Contributioni ≥ 4 to form the W can be easily obtained by employing the standard eigen- reduced feature space, the dimension of which is 117. decomposition. We use this 117-D feature vector as the input of each data sample. 2.2 Prediction via NN and SVR • For Run 3, we use the 2598-D video feature vector as 2.2.1 Nearest Neighbor Classifier the input of each data sample. Given the feature matrix X = [x1 , x2 , ..., xn ] and the label • For Run 4, we apply the same way used in Run 2 to vector l = [l1 , l2 , ..., ln ], for a new test data sample x, its label select the most contributing features, the dimension of l is decided by l = li∗ , where which is 140. We use this 140-D feature vector as the i∗ = arg min ||x − xi ||2 (4) input of each data sample. i For each run, the NN classifier and -SVR are used to 2.2.2 Support Vector Regressor predict the discrete and continuous labels, respectively. For To predict the continuous interestingness level, we use the -SVR, we use RBF kernel with the default parameter set- -SVR [2]. The final optimization problem, i.e., the dual tings from LIBSVM: cost = 1,  = 0.1, and γ = 1/D. problem that -SVR aims to solve is: Table 1 reports the performance of the proposed system, which is provided by the organizers, on several standard e- 1 min (α − α∗ )T K(α − α∗ ) + eT (α + α∗ ) + l(α − α∗ ) valuation criteria. For Precision, Recall, and F-score, the α,α∗ 2 results follow the label order [non-interesting, interesting]. s.t. eT (α − α∗ ) = 0, 0 ≤ αi , αi∗ ≤ C, i = 1, ..., n, After dimensionality reduction, the performance of the re- (5) duced features is comparable to that of original features, which indicates that the reduced features capture most of where αi , αi∗ are the Lagrange multipliers, K is a positive the discriminant information of the dataset. Furthermore, semidefinite matrix, in which Kij = K(xi , xj ) = φ(xi )T φ(xj ) we can observe that the performance on interesting data is is the kernel function, e = [1, ..., 1]T is the n-dimensional not as good as that on non-interesting data. This might be vector of all ones, and C > 0 is the regularization parame- caused by the imbalance between non-interesting (majority) ter. The level of a new sample x is predicted by: and interesting (minority) data. Sampling techniques and n X cost-sensitive measures could therefore be utilized to further l= (αi∗ − αi )K(xi , x) + b. (6) improve the performance. i=1 4. CONCLUSIONS 3. EVALUATION RESULTS In this paper, we have introduced our system for media In this section, we report the experimental settings and interestingness prediction. The results shown that the fea- the evaluation results. For the image data, we construct a tures extracted by NMMP and SMR are informative. Our 1299-D feature set, including 128-D color hist features, 300- future work will focus on improving the system by consider- D denseSIFT features, 512-D gist features, 300-D hog2×2, ing the dynamic nature of the video data as well as exploring and 59-D LBP features. For the video data, we treat each the technologies for learning imbalanced data. frame as a separate image, and calculate the average and standard deviation over all frames in this shot, and thus we Acknowledgments have a 2598-D feature set for each video. The authors would like to thank the reviewer for the helpful • For Run 1, we use the 1299-D image feature vector as comments. This work was supported in part by the National the input of each data sample. Natural Science Foundation of China under Grant 61503317. 5. REFERENCES [1] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput., 15(6):1373–1396, 2003. [2] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. [3] C.-H. Demarty, M. Sjoberg, B. Ionescu, T.-T. Do, H. Wang, N. Q. K. Duong, and F. Lefebvre. Mediaeval 2016 predicting media interestingness task. In Working Notes Proceedings of the MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Netherlands. [4] B. Ge, Y. Shao, and Y. Shu. Uncorrelated discriminant isometric projection for face recognition. In Information Computing and Applications, volume 307, pages 138–145. 2012. [5] L. Geng and H. J. Hamilton. Interestingness measures for data mining: A survey. ACM Comput. Surv., 38(3), 2006. [6] H. Grabner, F. Nater, M. Druey, and L. Van Gool. Visual interestingness in image sequences. In Proceedings of the 21st ACM International Conference on Multimedia, pages 1017–1026, 2013. [7] M. Gygli, H. Grabner, H. Riemenschneider, F. Nater, and L. V. Gool. The interestingness of images. In Proceedings of IEEE International Conference on Computer Vision, pages 1633–1640, 2013. [8] P. Isola, J. Xiao, D. Parikh, A. Torralba, and A. Oliva. What makes a photograph memorable? IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7):1469–1482, 2014. [9] Y.-G. Jiang, Y. Wang, R. Feng, X. Xue, Y. Zheng, and H. Yang. Understanding and predicting interestingness of videos. In Proceedings of The 27th AAAI Conference on Artificial Intelligence (AAAI), 2013. [10] Y. Liu, Y. Liu, and K. C. Chan. Supervised manifold learning for image and video classification. In Proceedings of the 18th ACM International Conference on Multimedia, pages 859–862, 2010. [11] F. Nie, S. Xiang, and C. Zhang. Neighborhood minmax projections. In Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI), pages 993–998, 2007. [12] M. Soleymani. The quest for visual interest. In Proceedings of the 23rd ACM International Conference on Multimedia, pages 919–922, 2015.