1. INTRODUCTION

Supervised Manifold Learning for Media Interestingness Prediction

Yang Liu

csygliu@comp.hkbu.edu.hk 1 2

Zhonglei Gu

Yiu-ming Cheung

1 2 3 0 AAOO Tech Limited , Shatin, Hong Kong SAR , China 1 Department of Computer Science, Hong Kong Baptist University , Kowloon Tong, Hong Kong SAR , China 2 Institute of Research and Continuing Education, Hong Kong Baptist University , Shenzhen , China 3 United International College, Beijing Normal University-Hong Kong Baptist University , Zhuhai , China

2016

20 21

In this paper, we describe the models designed for automatically selecting multimedia data, e.g., image and video segments, which are considered to be interesting for a common viewer. Speci cally, we utilize an existing dimensionality reduction method called Neighborhood MinMax Projections (NMMP) to extract the low-dimensional features for predicting the discrete interestingness labels. Meanwhile, we introduce a new dimensionality reduction method dubbed Supervised Manifold Regression (SMR) to learn the compact representations for predicting the continuous interestingness levels. Finally, we use the nearest neighbor classi er and support vector regressor for classi cation and regression, respectively. Experimental results demonstrate the e ectiveness of the low-dimensional features learned by NMMP and SMR.

1. INTRODUCTION

E ective prediction of media interestingness plays an important role in many real-world applications such as image/video search, retrieval, and recommendation [5{9, 12]. The MediaEval 2016 Predicting Media Interestingness Task requires participants to automatically select images and/or video segments which are considered to be the most interesting for a common viewer. The data used in this task are extracted from ca 75 movie trailers of Hollywood-like movies. More details about the task requirements as well as the dataset description can be found in [3].

Supervised manifold learning, which aims to discover the data-label mapping relation while capturing the manifold structure of the dataset, plays an important role in many multimedia content analysis tasks such as face recognition [4] and video classi cation [10]. In this paper, we aim to solve both image and video interestingness prediction via supervised manifold learning. There are two kinds of interestingness labels in the given task, i.e., discrete and continuous. For the case of discrete labels, we utilize an existing competitive dimensionality reduction method called Neighborhood MinMax Projections (NMMP) to extract the lowdimensional features from the original high-dimensional space. For the case of continuous labels, we propose a new dimensionality reduction method dubbed Supervised Manifold Regression (SMR) to learn the compact representations of the original data. Finally, we use nearest neighbor classier and support vector regressor to predict the discrete and continuous labels of the given images/videos, respectively. 2. 2.1

METHOD

Feature Extraction via NMMP and SMR 2.1.1

Neighborhood MinMax Projections

Given the data matrix X = [x1; x2; :::; xn], where xi 2 RD denotes the feature vector of the i-th image or video, and the label vector l = [l1; l2; :::; ln], where li 2 f0; 1g denotes the corresponding label of xi, 1 for interesting and 0 for non-interesting, Neighborhood MinMax Projections (NMMP) aims to nd a linear transformation, after which the nearby points within the same class are as close as possible, while those between di erent classes are as far as possible [11]. The objective function of NMMP is given as follows: W = arg max

WT W=I tr(WT S~wW)

; tr(WT S~bW) ( 1 ) where tr( ) denotes the matrix trace operator, W denotes the transformation matrix to be learned, S~b denotes the between-class scatter matrix de ned on nearby data points, and S~w denotes the within-class scatter matrix de ned on nearby data points. The optimization problem in Eq. ( 1 ) can be e ectively solved by eigen-decomposition. More details of NMMP can be found in [11]. 2.1.2

Supervised Manifold Regression

Di erent from the binary form in discrete case, the continuous interestingness label is a real number, i.e., li 2 [0; 1]. The idea behind Supervised Manifold Regression (SMR) is simple: the more similar the interestingness levels of two media data, the closer the two feature vectors should be in the learned subspace. Meanwhile, we aim to preserve the manifold structure of the dataset in the original feature space. The objective function of SMR is formulated as follows: n W = arg min X kWT xi

W i;j=1

WT xj k2

Silj + ( 1 )Simj ; ( 2 ) where Silj = jli lj j measures the similarity between the interestingness level of xi and that of xj (i; j = 1; :::; n), Simj = exp( jjxi 2 xjjj22 ) denotes the similarity between xi and xj in the original space, and 2 [0; 1] denotes the balancing parameter, which is empirically set to be 0.5 in our experiments. Following some standard operations in linear algebra, the above optimization problem could be reduced to the following one:

W = arg min tr(WT XLXT W);

W where X = [x1; x2; :::; xn] 2 RD n is the data matrix, L = D S is the n n Laplacian matrix [1], and D is a diagonal matrix de ned as Dii = Pn

j=1 Sij (i = 1; :::; n), where Sij =

Silj + ( 1 )Simj . By transforming ( 2 ) to ( 3 ), the optimal W can be easily obtained by employing the standard eigendecomposition. 2.2

Prediction via NN and SVR

2.2.1

Nearest Neighbor Classifier

Given the feature matrix X = [x1; x2; :::; xn] and the label vector l = [l1; l2; :::; ln], for a new test data sample x, its label l is decided by l = li , where i = arg miin jjx xijj2 2.2.2

Support Vector Regressor

To predict the continuous interestingness level, we use the -SVR [2]. The nal optimization problem, i.e., the dual problem that -SVR aims to solve is: min ; 1

( 2 s:t: eT ( )T K( ) + eT ( +

) + l( ) = 0; 0 i; i

C; i = 1; :::; n; ( 3 ) ( 4 ) ) ( 5 ) where i; i are the Lagrange multipliers, K is a positive semide nite matrix, in which Kij = K(xi; xj) = (xi)T (xj) is the kernel function, e = [1; :::; 1]T is the n-dimensional vector of all ones, and C > 0 is the regularization parameter. The level of a new sample x is predicted by: n l = X( i i=1 i)K(xi; x) + b: ( 6 )

EVALUATION RESULTS

In this section, we report the experimental settings and the evaluation results. For the image data, we construct a 1299-D feature set, including 128-D color hist features, 300D denseSIFT features, 512-D gist features, 300-D hog2 2, and 59-D LBP features. For the video data, we treat each frame as a separate image, and calculate the average and standard deviation over all frames in this shot, and thus we have a 2598-D feature set for each video.

For Run 1, we use the 1299-D image feature vector as the input of each data sample.

For Run 2, we rst learn the 100-D subspaces of the original feature vector via NMMP (for discrete labels) and SMR (for continuous labels), respectively. After we obtain the transformation matrix W 2 R1299 100, we de ne the contribution of the i-th dimension (i = 1; :::; 1299) of the original feature vector:

Contributioni = X jwij j; j ( 7 ) where wij is the element in row i and column j of W, and j j denotes the absolute value operator. Then we select the features with Contributioni 4 to form the reduced feature space, the dimension of which is 117. We use this 117-D feature vector as the input of each data sample.

For Run 3, we use the 2598-D video feature vector as the input of each data sample.

For Run 4, we apply the same way used in Run 2 to select the most contributing features, the dimension of which is 140. We use this 140-D feature vector as the input of each data sample.

For each run, the NN classi er and -SVR are used to predict the discrete and continuous labels, respectively. For -SVR, we use RBF kernel with the default parameter settings from LIBSVM: cost = 1, = 0:1, and = 1=D.

Table 1 reports the performance of the proposed system, which is provided by the organizers, on several standard evaluation criteria. For Precision, Recall, and F-score, the results follow the label order [non-interesting, interesting]. After dimensionality reduction, the performance of the reduced features is comparable to that of original features, which indicates that the reduced features capture most of the discriminant information of the dataset. Furthermore, we can observe that the performance on interesting data is not as good as that on non-interesting data. This might be caused by the imbalance between non-interesting (majority) and interesting (minority) data. Sampling techniques and cost-sensitive measures could therefore be utilized to further improve the performance. 4.

CONCLUSIONS

In this paper, we have introduced our system for media interestingness prediction. The results shown that the features extracted by NMMP and SMR are informative. Our future work will focus on improving the system by considering the dynamic nature of the video data as well as exploring the technologies for learning imbalanced data.

Acknowledgments

The authors would like to thank the reviewer for the helpful comments. This work was supported in part by the National Natural Science Foundation of China under Grant 61503317.

[1]

Belkin and

Niyogi . Laplacian eigenmaps for dimensionality reduction and data representation . Neural Comput. , 15 ( 6 ): 1373 { 1396 , 2003 .

[2]

C.-C.

Chang and

C.-J.

Lin . LIBSVM: A library for support vector machines . ACM Transactions on Intelligent Systems and Technology , 2 : 27 :1{ 27 : 27 , 2011 .

[3] C.-H. Demarty , M.

Sjoberg , B.

Ionescu , T.-T. Do, H.

Wang , N. Q. K.

Duong , and F.

Lefebvre . Mediaeval 2016 predicting media interestingness task . In Working Notes Proceedings of the MediaEval 2016 Workshop , Oct. 20 - 21 , 2016 , Hilversum, Netherlands.

[4]

Ge ,

Shao , and

Shu . Uncorrelated discriminant isometric projection for face recognition . In Information Computing and Applications , volume 307 , pages 138 { 145 . 2012 .

[5]

Geng and

H. J.

Hamilton . Interestingness measures for data mining: A survey . ACM Comput. Surv. , 38 ( 3 ), 2006 .

[6]

Grabner ,

Nater ,

Druey , and

L. Van

Gool . Visual interestingness in image sequences . In Proceedings of the 21st ACM International Conference on Multimedia , pages 1017 { 1026 , 2013 .

[7]

Gygli ,

Grabner ,

Riemenschneider ,

Nater , and

L. V.

Gool . The interestingness of images . In Proceedings of IEEE International Conference on Computer Vision , pages 1633 { 1640 , 2013 .

[8]

Isola ,

Xiao ,

Parikh ,

Torralba , and

Oliva . What makes a photograph memorable? IEEE Transactions on Pattern Analysis and Machine Intelligence , 36 ( 7 ): 1469 { 1482 , 2014 .

[9]

Y.-G.

Jiang ,

Wang ,

Feng ,

Xue ,

Zheng , and

Yang . Understanding and predicting interestingness of videos . In Proceedings of The 27th AAAI Conference on Arti cial Intelligence (AAAI) , 2013 .

[10]

Liu ,

Liu , and

K. C.

Chan . Supervised manifold learning for image and video classi cation . In Proceedings of the 18th ACM International Conference on Multimedia , pages 859 { 862 , 2010 .

[11]

Nie ,

Xiang , and C. Zhang. Neighborhood minmax projections . In Proceedings of the 20th International Joint Conference on Arti cial Intelligence (IJCAI) , pages 993 { 998 , 2007 .

[12]

Soleymani . The quest for visual interest . In Proceedings of the 23rd ACM International Conference on Multimedia , pages 919 { 922 , 2015 .