Predicting Media Interestingness via Biased Discriminant
         Embedding and Supervised Manifold Regression
                                         Yang Liu1,2 , Zhonglei Gu1 , Tobey H. Ko3
                    1
                   Department of Computer Science, Hong Kong Baptist University, HKSAR, China
          2
           Institute of Research and Continuing Education, Hong Kong Baptist University, Shenzhen, China
    3
      Department of Industrial and Manufacturing Systems Engineering, University of Hong Kong, HKSAR, China
                         csygliu@comp.hkbu.edu.hk,cszlgu@comp.hkbu.edu.hk,tobeyko@hku.hk
ABSTRACT                                                               one. The objective function of BDE is given as follows:
                                                                                                           (︃           )︃
In this paper, we describe our model designed for automatic                                                   W𝑇 S𝑏 W
prediction of media interestingness. Specifically, a two-stage                           W = arg max 𝑡𝑟                    ,             (1)
                                                                                                   W          W𝑇 S𝑤 W
learning framework is proposed. In the first stage, supervised                          ∑︀𝑛
dimensionality reduction is employed to discover the key               where S𝑤 =          𝑖,𝑗=1 (𝑁𝑖𝑗 × 𝑙𝑖 × 𝑙𝑗 )(x𝑖 − x𝑗 )(x𝑖 − x𝑗 )
                                                                                                                                      𝑇
                                                                                                                                         de-
                                                                                                                             ∑︀𝑛
discriminant information embedded in the original feature                                                             𝑏
                                                                       notes the biased within-class scatter, S =              𝑖,𝑗=1 (𝑁𝑖𝑗 ×
space. We present a new algorithm dubbed biased discrimi-              |𝑙𝑖 − 𝑙𝑗 |)(x𝑖 − x𝑗 )(x𝑖 − x𝑗 )𝑇 denotes the biased between-class
nant embedding (BDE) to extract discriminant features with             scatter, and 𝑁𝑖𝑗 = 𝑒𝑥𝑝(−||x𝑖 − x𝑗 ||2 /2𝜎) measures the close-
discrete labels and use supervised manifold regression (SMR)           ness between two data samples x𝑖 and x𝑗 . The optimization
to extract discriminant features with continuous labels. In the        problem could be solved by generalized eigen-decomposition.
second stage, SVM is utilized for prediction. Experimental
results validate the effectiveness of our approaches.                  2.2     Supervised Manifold Regression
                                                                       Supervised manifold regression (SMR) [4] aims to find the
1    INTRODUCTION                                                      latent subspace, where two data points should be close to
                                                                       each other if they possess similar interestingness levels. The
Predicting the interestingness of multimedia content has long
                                                                       objective function of SMR is given as follows:
been studied in the psychology community [1, 6, 7]. More
                                                                                        𝑛
recently, we witness an explosion of multimedia content due                            ∑︁
                                                                                           ‖W𝑇 (x𝑖 − x𝑗 )‖2 · 𝛼𝑆𝑖𝑗
                                                                                                              (︀ 𝑙                 )︀
                                                                         W = arg min                                 + (1 − 𝛼)𝑁𝑖𝑗 ,
to the accessibility of low cost multimedia creation tools, the                  W     𝑖,𝑗=1
automatic prediction of media interestingness thus started                                                                              (2)
to attract attention in the computer science community be-
                                                                              𝑙
cause of its many useful applications to content providers,            where 𝑆𝑖𝑗  = |𝑙𝑖 − 𝑙𝑗 | measures the similarity between the
marketing, and managerial decision-makers.                             interestingness level of x𝑖 and that of x𝑗 .
   In this paper, we propose to use dimensionality reduction              For each high-dimensional data point x𝑖 , we can obtain
to extract low-dimensional features for MediaEval 2017 Pre-            its low-dimensional representation by y𝑖 = W𝑇 x𝑖 . Then we
dicting Media Interestingness Task. Specifically, we propose a         apply SVM to y𝑖 for interestingness prediction.
new algorithm called biased discriminant embedding (BDE)
for discrete labels and utilize supervised manifold regression         3     EXPERIMENTS
(SMR) [4] for continuous labels.                                       For each image data sample, we construct a 1299-D feature
                                                                       vector by selecting features from the feature set provided
2 DIMENSIONALITY REDUCTION                                             by the task organizers, including 128-D color histogram fea-
2.1 Biased Discriminant Embedding                                      tures, 300-D denseSIFT features, 512-D gist features, 300-D
                                                                       hog2×2, and 59-D LBP features. For the video data, we treat
Given the data matrix X = [x1 , x2 , ..., x𝑛 ], where x𝑖 ∈ R𝐷          each frame as a separate image, and calculate the average
denotes the feature vector of the 𝑖-th image or video, and             and standard deviation over all frames in this shot, and thus
label vector l = [𝑙1 , 𝑙2 , ..., 𝑙𝑛 ], where 𝑙𝑖 ∈ {0, 1} denotes the   we have a 2598-D feature set for each video. We normal-
corresponding label of x𝑖 , with 1 for interesting and 0 for           ize each dimension of the training data to the range [0, 1]
non-interesting, biased discriminant embedding (BDE) aims                        𝑥−𝑥𝑚𝑖𝑛
                                                                       by 𝑥ˆ = 𝑥𝑚𝑎𝑥 −𝑥𝑚𝑖𝑛
                                                                                            before dimensionality reduction, where
to learn a 𝐷 × 𝑑 transformation matrix W, which maximizes
                                                                       𝑥𝑚𝑖𝑛 and 𝑥𝑚𝑎𝑥 denote the minimum and maximum values
the biased discriminant information in the reduced subspace.
                                                                       in the corresponding dimension, respectively. Details about
The motivation for proposing the biased discrimination is that
                                                                       the dataset description can be found in [3].
in media interestingness prediction, we are probably more
                                                                          For Run 1 of image data, we use the normalized 1299-D
interested in the interesting class than the non-interesting
                                                                       feature vector as the input of SVM. For Runs 2-5 of image
Copyright held by the owner/author(s).
                                                                       data, we reduce the original data to the 23-D, 25-D, 26-D,
MediaEval’17, 13-15 September 2017, Dublin, Ireland                    27-D subspaces via BDE (for discrete labels) and SMR (for
                                                                       continuous labels), respectively. For Run 1 of video data, we
MediaEval’17, 13-15 September 2017, Dublin, Ireland                                                      Y. Liu, Z. Gu, T. H. Ko


                      (a) BDE on image data                                      (b) SMR on image data


                      (c) BDE on video data                                      (d) SMR on video data
   Figure 1: Contribution of each individual feature in image/video discrete/continuous prediction tasks.

Table 1: MAP@10 and MAP of the proposed model.                       We further analyze the contribution of each dimension
                     Images               Videos                  in the original feature space. The contribution
                                                                                                            ∑︀     of the 𝑖-th
               MAP@10 MAP            MAP@10 MAP                   dimension is defined as 𝐶𝑜𝑛𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛𝑖 = 𝑗 𝜆𝑗 |𝑤𝑖𝑗 |, where
      Run 1     0.1184    0.2812      0.0556   0.1813             𝜆𝑗 denotes the 𝑗-th eigenvalue, 𝑤𝑖𝑗 denotes the (𝑖, 𝑗)-th el-
      Run 2      0.132    0.2916      0.0468   0.1761             ement of W, and | · | denotes the absolute value operator.
      Run 3     0.1332    0.2898      0.0468   0.1761             From Figures 1(a) and 1(c), we can observe that color his-
      Run 4     0.1315    0.2884      0.0463   0.1742             togram and LBP features contribute more than the others
      Run 5     0.1369     0.291      0.0445   0.1746             while the GIST features contribute the least in the discrete
                                                                  prediction task. In continuous prediction (Figures 1(b) and
                                                                  1(d)), the color histogram and GIST features contribute the
                                                                  most among the five feature sets.
use the normalized 2598-D feature vector as the input of
SVM. For Runs 2-5 of video data, we reduce the original
data to the 23-D, 25-D, 26-D, 27-D subspaces via BDE (for         4   DISCUSSION AND OUTLOOK
discrete labels) and SMR (for continuous labels), respectively.   This paper introduces our model designed for media interest-
To predict the binary interestingness labels, we use 𝜈-SVC [5]    ingness prediction. For the future work, we aim to improve
with an RBF kernel. We set 𝜈 = 0.1 and 𝑔𝑎𝑚𝑚𝑎 = 100 (for           the performance of video interestingness prediction by in-
image data)/64 (for video data). To predict the continuous        corporating the video temporal information. Moreover, as
interestingness level, we use 𝜖-SVR [2] with an RBF kernel.       the ground truth (labels) of interestingness are provided by
We set 𝑐𝑜𝑠𝑡 = 1, 𝜖 = 0.01, and 𝛾 = 1/𝐷. Table 1 reports           human beings, they generally vary with each individual and
the evaluation results of the proposed model provided by          are somewhat subjective. We are therefore particularly inter-
the task organizers. For image data, the reduced features         ested in refining the human labeled ground truth (especially
perform better than the original ones, which indicates that       for continuous case) via machine learning technologies.
the subspaces learned by BDE and SMR capture important
information in terms of media interestingness. For video
data, the performance of reduced features is slightly worse
                                                                  ACKNOWLEDGMENTS
than that of the original ones. The reason might be that          This work was supported in part by the National Natural
video data are more complex than image data so that such a        Science Foundation of China under Grant 61503317, and in
low-dimensional representation cannot fully capture the key       part by the Faculty Research Grant of Hong Kong Baptist
discriminant information embedded in the original space.          University (HKBU) under Project FRG2/16-17/032.
Predicting Media Interestingness Task                               MediaEval’17, 13-15 September 2017, Dublin, Ireland


REFERENCES
 [1] Daniel E. Berlyne. 1960. Conflict, arousal and curiosity.
     McGraw-Hill.
 [2] Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A
     library for support vector machines. ACM Transactions
     on Intelligent Systems and Technology 2 (2011), 27:1–27:27.
     Issue 3.
 [3] C.-H. Demarty, M. Sjoberg, B. Ionescu, T.-T. Do, M. Gygli,
     and N. Q. K Duong. MediaEval 2017 Predicting Media Inter-
     estingness Task. In Proc. of the MediaEval 2017 Workshop.
     Dublin, Ireland, Sept. 13–15, 2017.
 [4] Y. Liu, Z. Gu, and Y.-M. Cheung. Supervised Manifold
     Learning for Media Interestingness Prediction. In Proc. of
     the MediaEval 2016 Workshop. Hilversum, Netherlands, Oct.
     20–21, 2016.
 [5] Bernhard Schölkopf, Alex J. Smola, Robert C. Williamson,
     and Peter L. Bartlett. 2000. New Support Vector Algorithms.
     Neural Comput. 12, 5 (2000), 1207–1245.
 [6] Paul J. Silvia. 2006. Exploring the psychology of interest.
     Oxford University Press.
 [7] Craig Smith and Phoebe Ellsworth. 1985. Patterns of cogni-
     tive appraisal in emotion. Journal of Personality and Social
     Psychology 48, 4 (1985), 813–838.