Supervised Manifold Learning for Media Interestingness
                          Prediction

                                    Yang Liu1,2 , Zhonglei Gu3 , Yiu-ming Cheung1,2,4
     1
         Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, Hong Kong SAR, China
            2
              Institute of Research and Continuing Education, Hong Kong Baptist University, Shenzhen, China
                                   3
                                     AAOO Tech Limited, Shatin, Hong Kong SAR, China
          4
            United International College, Beijing Normal University-Hong Kong Baptist University, Zhuhai, China
                         {csygliu,ymc}@comp.hkbu.edu.hk, allen.koo@aaoo-tech.com

ABSTRACT                                                               of the original data. Finally, we use nearest neighbor classi-
In this paper, we describe the models designed for automat-            fier and support vector regressor to predict the discrete and
ically selecting multimedia data, e.g., image and video seg-           continuous labels of the given images/videos, respectively.
ments, which are considered to be interesting for a common
viewer. Specifically, we utilize an existing dimensionality re-        2.    METHOD
duction method called Neighborhood MinMax Projections
(NMMP) to extract the low-dimensional features for pre-                2.1     Feature Extraction via NMMP and SMR
dicting the discrete interestingness labels. Meanwhile, we
introduce a new dimensionality reduction method dubbed                 2.1.1     Neighborhood MinMax Projections
Supervised Manifold Regression (SMR) to learn the compact                Given the data matrix X = [x1 , x2 , ..., xn ], where xi ∈ RD
representations for predicting the continuous interestingness          denotes the feature vector of the i-th image or video, and
levels. Finally, we use the nearest neighbor classifier and            the label vector l = [l1 , l2 , ..., ln ], where li ∈ {0, 1} denotes
support vector regressor for classification and regression, re-        the corresponding label of xi , 1 for interesting and 0 for
spectively. Experimental results demonstrate the effective-            non-interesting, Neighborhood MinMax Projections (NMM-
ness of the low-dimensional features learned by NMMP and               P) aims to find a linear transformation, after which the near-
SMR.                                                                   by points within the same class are as close as possible, while
                                                                       those between different classes are as far as possible [11].
1.   INTRODUCTION                                                      The objective function of NMMP is given as follows:
   Effective prediction of media interestingness plays an im-
                                                                                                       tr(WT S̃b W)
portant role in many real-world applications such as im-                                W = arg max                 ,                  (1)
                                                                                                           T
age/video search, retrieval, and recommendation [5–9, 12].                                      WT W=I tr(W S̃w W)
The MediaEval 2016 Predicting Media Interestingness Task               where tr(·) denotes the matrix trace operator, W denotes
requires participants to automatically select images and/or
                                                                       the transformation matrix to be learned, S̃b denotes the
video segments which are considered to be the most inter-
                                                                       between-class scatter matrix defined on nearby data points,
esting for a common viewer. The data used in this task
                                                                       and S̃w denotes the within-class scatter matrix defined on n-
are extracted from ca 75 movie trailers of Hollywood-like
                                                                       earby data points. The optimization problem in Eq. (1) can
movies. More details about the task requirements as well as
                                                                       be effectively solved by eigen-decomposition. More details
the dataset description can be found in [3].
                                                                       of NMMP can be found in [11].
   Supervised manifold learning, which aims to discover the
data-label mapping relation while capturing the manifold               2.1.2     Supervised Manifold Regression
structure of the dataset, plays an important role in many
                                                                          Different from the binary form in discrete case, the con-
multimedia content analysis tasks such as face recognition [4]
                                                                       tinuous interestingness label is a real number, i.e., li ∈ [0, 1].
and video classification [10]. In this paper, we aim to solve
                                                                       The idea behind Supervised Manifold Regression (SMR) is
both image and video interestingness prediction via super-
                                                                       simple: the more similar the interestingness levels of two me-
vised manifold learning. There are two kinds of interest-
                                                                       dia data, the closer the two feature vectors should be in the
ingness labels in the given task, i.e., discrete and continu-
                                                                       learned subspace. Meanwhile, we aim to preserve the man-
ous. For the case of discrete labels, we utilize an existing
                                                                       ifold structure of the dataset in the original feature space.
competitive dimensionality reduction method called Neigh-
                                                                       The objective function of SMR is formulated as follows:
borhood MinMax Projections (NMMP) to extract the low-
                                                                                      n
dimensional features from the original high-dimensional s-                            X
pace. For the case of continuous labels, we propose a new              W = arg min            kWT xi − WT xj k2 · αSij
                                                                                                                    l            m
                                                                                                                       + (1 − α)Sij ,
                                                                                W     i,j=1
dimensionality reduction method dubbed Supervised Mani-
fold Regression (SMR) to learn the compact representations                                                                             (2)
                                                                                l
                                                                       where Sij  = |li − lj | measures the similarity between the
                                                                       interestingness level of xi and that of xj (i, j = 1, ..., n),
Copyright is held by the author/owner(s).                               m             ||xi −xj ||2
                                                                                                 2
MediaEval 2016 Workshop, October 20-21, 2016, Hilversum, Netherlands   Sij = exp(−         2σ
                                                                                                   )   denotes the similarity between xi
                     Table 1: Performance of proposed system (provided by the organizers)
                                                  MAP Accuracy    Precision        Recall                                 F-score
        Run 1: Original image features (D = 1299) 0.1835 0.838  [0.900, 0.139] [0.922, 0.110]                          [0.911, 0.123]
        Run 2: Reduced image features (d = 100)   0.1806 0.802  [0.902, 0.134] [0.874, 0.169]                          [0.888, 0.149]
        Run 3: Original video features (D = 2598) 0.1552 0.828  [0.901, 0.084] [0.910, 0.076]                          [0.905, 0.080]
         Run 4: Reduced video features (d = 100)  0.1733 0.834  [0.902, 0.098] [0.916, 0.084]                          [0.909, 0.091]


and xj in the original space, and α ∈ [0, 1] denotes the                        • For Run 2, we first learn the 100-D subspaces of the
balancing parameter, which is empirically set to be 0.5 in our                    original feature vector via NMMP (for discrete labels)
experiments. Following some standard operations in linear                         and SMR (for continuous labels), respectively. After
algebra, the above optimization problem could be reduced                          we obtain the transformation matrix W ∈ R1299×100 ,
to the following one:                                                             we define the contribution of the i-th dimension (i =
                                                                                  1, ..., 1299) of the original feature vector:
                 W = arg min tr(WT XLXT W),                         (3)                                              X
                             W
                                                                                                   Contributioni =       |wij |,     (7)
where X = [x1 , x2 , ..., xn ] ∈ RD×n is the data matrix, L =                                                      j
D − S is the n × n Laplacian
                           P matrix [1], and D is a diagonal                      where wij is the element in row i and column j of W,
matrix defined as Dii = n      j=1 Sij (i = 1, ..., n), where Sij =               and | · | denotes the absolute value operator. Then we
   l            m
αSij + (1 − α)Sij . By transforming (2) to (3), the optimal                       select the features with Contributioni ≥ 4 to form the
W can be easily obtained by employing the standard eigen-                         reduced feature space, the dimension of which is 117.
decomposition.                                                                    We use this 117-D feature vector as the input of each
                                                                                  data sample.
2.2     Prediction via NN and SVR
                                                                                • For Run 3, we use the 2598-D video feature vector as
2.2.1     Nearest Neighbor Classifier                                             the input of each data sample.
   Given the feature matrix X = [x1 , x2 , ..., xn ] and the label              • For Run 4, we apply the same way used in Run 2 to
vector l = [l1 , l2 , ..., ln ], for a new test data sample x, its label          select the most contributing features, the dimension of
l is decided by l = li∗ , where                                                   which is 140. We use this 140-D feature vector as the
                       i∗ = arg min ||x − xi ||2                    (4)           input of each data sample.
                                  i
                                                                              For each run, the NN classifier and -SVR are used to
2.2.2     Support Vector Regressor                                         predict the discrete and continuous labels, respectively. For
  To predict the continuous interestingness level, we use the              -SVR, we use RBF kernel with the default parameter set-
-SVR [2]. The final optimization problem, i.e., the dual                  tings from LIBSVM: cost = 1,  = 0.1, and γ = 1/D.
problem that -SVR aims to solve is:                                          Table 1 reports the performance of the proposed system,
                                                                           which is provided by the organizers, on several standard e-
         1
 min       (α − α∗ )T K(α − α∗ ) + eT (α + α∗ ) + l(α − α∗ )              valuation criteria. For Precision, Recall, and F-score, the
 α,α∗    2                                                                 results follow the label order [non-interesting, interesting].
  s.t. eT (α − α∗ ) = 0, 0 ≤ αi , αi∗ ≤ C, i = 1, ..., n,                  After dimensionality reduction, the performance of the re-
                                                                    (5)    duced features is comparable to that of original features,
                                                                           which indicates that the reduced features capture most of
where αi , αi∗ are the Lagrange multipliers, K is a positive               the discriminant information of the dataset. Furthermore,
semidefinite matrix, in which Kij = K(xi , xj ) = φ(xi )T φ(xj )           we can observe that the performance on interesting data is
is the kernel function, e = [1, ..., 1]T is the n-dimensional              not as good as that on non-interesting data. This might be
vector of all ones, and C > 0 is the regularization parame-                caused by the imbalance between non-interesting (majority)
ter. The level of a new sample x is predicted by:                          and interesting (minority) data. Sampling techniques and
                       n
                       X                                                   cost-sensitive measures could therefore be utilized to further
                  l=     (αi∗ − αi )K(xi , x) + b.                  (6)    improve the performance.
                       i=1
                                                                           4.    CONCLUSIONS
3.      EVALUATION RESULTS                                                   In this paper, we have introduced our system for media
   In this section, we report the experimental settings and                interestingness prediction. The results shown that the fea-
the evaluation results. For the image data, we construct a                 tures extracted by NMMP and SMR are informative. Our
1299-D feature set, including 128-D color hist features, 300-              future work will focus on improving the system by consider-
D denseSIFT features, 512-D gist features, 300-D hog2×2,                   ing the dynamic nature of the video data as well as exploring
and 59-D LBP features. For the video data, we treat each                   the technologies for learning imbalanced data.
frame as a separate image, and calculate the average and
standard deviation over all frames in this shot, and thus we               Acknowledgments
have a 2598-D feature set for each video.
                                                                           The authors would like to thank the reviewer for the helpful
     • For Run 1, we use the 1299-D image feature vector as                comments. This work was supported in part by the National
       the input of each data sample.                                      Natural Science Foundation of China under Grant 61503317.
5.   REFERENCES
 [1] M. Belkin and P. Niyogi. Laplacian eigenmaps for
     dimensionality reduction and data representation.
     Neural Comput., 15(6):1373–1396, 2003.
 [2] C.-C. Chang and C.-J. Lin. LIBSVM: A library for
     support vector machines. ACM Transactions on
     Intelligent Systems and Technology, 2:27:1–27:27,
     2011.
 [3] C.-H. Demarty, M. Sjoberg, B. Ionescu, T.-T. Do,
     H. Wang, N. Q. K. Duong, and F. Lefebvre. Mediaeval
     2016 predicting media interestingness task. In
     Working Notes Proceedings of the MediaEval 2016
     Workshop, Oct. 20-21, 2016, Hilversum, Netherlands.
 [4] B. Ge, Y. Shao, and Y. Shu. Uncorrelated
     discriminant isometric projection for face recognition.
     In Information Computing and Applications, volume
     307, pages 138–145. 2012.
 [5] L. Geng and H. J. Hamilton. Interestingness measures
     for data mining: A survey. ACM Comput. Surv.,
     38(3), 2006.
 [6] H. Grabner, F. Nater, M. Druey, and L. Van Gool.
     Visual interestingness in image sequences. In
     Proceedings of the 21st ACM International Conference
     on Multimedia, pages 1017–1026, 2013.
 [7] M. Gygli, H. Grabner, H. Riemenschneider, F. Nater,
     and L. V. Gool. The interestingness of images. In
     Proceedings of IEEE International Conference on
     Computer Vision, pages 1633–1640, 2013.
 [8] P. Isola, J. Xiao, D. Parikh, A. Torralba, and
     A. Oliva. What makes a photograph memorable?
     IEEE Transactions on Pattern Analysis and Machine
     Intelligence, 36(7):1469–1482, 2014.
 [9] Y.-G. Jiang, Y. Wang, R. Feng, X. Xue, Y. Zheng, and
     H. Yang. Understanding and predicting interestingness
     of videos. In Proceedings of The 27th AAAI
     Conference on Artificial Intelligence (AAAI), 2013.
[10] Y. Liu, Y. Liu, and K. C. Chan. Supervised manifold
     learning for image and video classification. In
     Proceedings of the 18th ACM International Conference
     on Multimedia, pages 859–862, 2010.
[11] F. Nie, S. Xiang, and C. Zhang. Neighborhood
     minmax projections. In Proceedings of the 20th
     International Joint Conference on Artificial
     Intelligence (IJCAI), pages 993–998, 2007.
[12] M. Soleymani. The quest for visual interest. In
     Proceedings of the 23rd ACM International Conference
     on Multimedia, pages 919–922, 2015.