<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Supervised Manifold Learning for Media Interestingness Prediction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yang Liu</string-name>
          <email>csygliu@comp.hkbu.edu.hk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhonglei Gu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yiu-ming Cheung</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>AAOO Tech Limited</institution>
          ,
          <addr-line>Shatin, Hong Kong SAR</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science, Hong Kong Baptist University</institution>
          ,
          <addr-line>Kowloon Tong, Hong Kong SAR</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Institute of Research and Continuing Education, Hong Kong Baptist University</institution>
          ,
          <addr-line>Shenzhen</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>United International College, Beijing Normal University-Hong Kong Baptist University</institution>
          ,
          <addr-line>Zhuhai</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <fpage>20</fpage>
      <lpage>21</lpage>
      <abstract>
        <p>In this paper, we describe the models designed for automatically selecting multimedia data, e.g., image and video segments, which are considered to be interesting for a common viewer. Speci cally, we utilize an existing dimensionality reduction method called Neighborhood MinMax Projections (NMMP) to extract the low-dimensional features for predicting the discrete interestingness labels. Meanwhile, we introduce a new dimensionality reduction method dubbed Supervised Manifold Regression (SMR) to learn the compact representations for predicting the continuous interestingness levels. Finally, we use the nearest neighbor classi er and support vector regressor for classi cation and regression, respectively. Experimental results demonstrate the e ectiveness of the low-dimensional features learned by NMMP and SMR.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>E ective prediction of media interestingness plays an
important role in many real-world applications such as
image/video search, retrieval, and recommendation [5{9, 12].
The MediaEval 2016 Predicting Media Interestingness Task
requires participants to automatically select images and/or
video segments which are considered to be the most
interesting for a common viewer. The data used in this task
are extracted from ca 75 movie trailers of Hollywood-like
movies. More details about the task requirements as well as
the dataset description can be found in [3].</p>
      <p>Supervised manifold learning, which aims to discover the
data-label mapping relation while capturing the manifold
structure of the dataset, plays an important role in many
multimedia content analysis tasks such as face recognition [4]
and video classi cation [10]. In this paper, we aim to solve
both image and video interestingness prediction via
supervised manifold learning. There are two kinds of
interestingness labels in the given task, i.e., discrete and
continuous. For the case of discrete labels, we utilize an existing
competitive dimensionality reduction method called
Neighborhood MinMax Projections (NMMP) to extract the
lowdimensional features from the original high-dimensional
space. For the case of continuous labels, we propose a new
dimensionality reduction method dubbed Supervised
Manifold Regression (SMR) to learn the compact representations
of the original data. Finally, we use nearest neighbor
classier and support vector regressor to predict the discrete and
continuous labels of the given images/videos, respectively.
2.
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>METHOD</title>
      <p>Feature Extraction via NMMP and SMR
2.1.1</p>
      <sec id="sec-2-1">
        <title>Neighborhood MinMax Projections</title>
        <p>Given the data matrix X = [x1; x2; :::; xn], where xi 2 RD
denotes the feature vector of the i-th image or video, and
the label vector l = [l1; l2; :::; ln], where li 2 f0; 1g denotes
the corresponding label of xi, 1 for interesting and 0 for
non-interesting, Neighborhood MinMax Projections
(NMMP) aims to nd a linear transformation, after which the
nearby points within the same class are as close as possible, while
those between di erent classes are as far as possible [11].
The objective function of NMMP is given as follows:
W = arg max</p>
        <p>WT W=I tr(WT S~wW)</p>
        <p>
          ;
tr(WT S~bW)
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
where tr( ) denotes the matrix trace operator, W denotes
the transformation matrix to be learned, S~b denotes the
between-class scatter matrix de ned on nearby data points,
and S~w denotes the within-class scatter matrix de ned on
nearby data points. The optimization problem in Eq. (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) can
be e ectively solved by eigen-decomposition. More details
of NMMP can be found in [11].
2.1.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Supervised Manifold Regression</title>
        <p>Di erent from the binary form in discrete case, the
continuous interestingness label is a real number, i.e., li 2 [0; 1].
The idea behind Supervised Manifold Regression (SMR) is
simple: the more similar the interestingness levels of two
media data, the closer the two feature vectors should be in the
learned subspace. Meanwhile, we aim to preserve the
manifold structure of the dataset in the original feature space.
The objective function of SMR is formulated as follows:
n
W = arg min X kWT xi</p>
        <p>W
i;j=1</p>
        <p>WT xj k2</p>
        <p>
          Silj + (
          <xref ref-type="bibr" rid="ref1">1</xref>
          )Simj ;
(
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
where Silj = jli lj j measures the similarity between the
interestingness level of xi and that of xj (i; j = 1; :::; n),
Simj = exp( jjxi 2 xjjj22 ) denotes the similarity between xi
and xj in the original space, and 2 [0; 1] denotes the
balancing parameter, which is empirically set to be 0.5 in our
experiments. Following some standard operations in linear
algebra, the above optimization problem could be reduced
to the following one:
        </p>
        <p>W = arg min tr(WT XLXT W);</p>
        <p>W
where X = [x1; x2; :::; xn] 2 RD n is the data matrix, L =
D S is the n n Laplacian matrix [1], and D is a diagonal
matrix de ned as Dii = Pn</p>
        <p>j=1 Sij (i = 1; :::; n), where Sij =</p>
        <p>
          Silj + (
          <xref ref-type="bibr" rid="ref1">1</xref>
          )Simj . By transforming (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) to (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ), the optimal
W can be easily obtained by employing the standard
eigendecomposition.
2.2
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Prediction via NN and SVR</title>
      <p>2.2.1</p>
      <sec id="sec-3-1">
        <title>Nearest Neighbor Classifier</title>
        <p>Given the feature matrix X = [x1; x2; :::; xn] and the label
vector l = [l1; l2; :::; ln], for a new test data sample x, its label
l is decided by l = li , where
i = arg miin jjx
xijj2
2.2.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Support Vector Regressor</title>
        <p>To predict the continuous interestingness level, we use the
-SVR [2]. The nal optimization problem, i.e., the dual
problem that -SVR aims to solve is:
min
;
1</p>
        <p>(
2
s:t: eT (
)T K(
) + eT ( +</p>
        <p>) + l(
) = 0; 0
i; i</p>
        <p>
          C; i = 1; :::; n;
(
          <xref ref-type="bibr" rid="ref3">3</xref>
          )
(
          <xref ref-type="bibr" rid="ref4">4</xref>
          )
)
(
          <xref ref-type="bibr" rid="ref5">5</xref>
          )
where i; i are the Lagrange multipliers, K is a positive
semide nite matrix, in which Kij = K(xi; xj) = (xi)T (xj)
is the kernel function, e = [1; :::; 1]T is the n-dimensional
vector of all ones, and C &gt; 0 is the regularization
parameter. The level of a new sample x is predicted by:
n
l = X( i
i=1
i)K(xi; x) + b:
(
          <xref ref-type="bibr" rid="ref6">6</xref>
          )
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>EVALUATION RESULTS</title>
      <p>In this section, we report the experimental settings and
the evaluation results. For the image data, we construct a
1299-D feature set, including 128-D color hist features,
300D denseSIFT features, 512-D gist features, 300-D hog2 2,
and 59-D LBP features. For the video data, we treat each
frame as a separate image, and calculate the average and
standard deviation over all frames in this shot, and thus we
have a 2598-D feature set for each video.</p>
      <p>For Run 1, we use the 1299-D image feature vector as
the input of each data sample.</p>
      <p>For Run 2, we rst learn the 100-D subspaces of the
original feature vector via NMMP (for discrete labels)
and SMR (for continuous labels), respectively. After
we obtain the transformation matrix W 2 R1299 100,
we de ne the contribution of the i-th dimension (i =
1; :::; 1299) of the original feature vector:</p>
      <p>
        Contributioni = X jwij j;
j
(
        <xref ref-type="bibr" rid="ref7">7</xref>
        )
where wij is the element in row i and column j of W,
and j j denotes the absolute value operator. Then we
select the features with Contributioni 4 to form the
reduced feature space, the dimension of which is 117.
We use this 117-D feature vector as the input of each
data sample.
      </p>
      <p>For Run 3, we use the 2598-D video feature vector as
the input of each data sample.</p>
      <p>For Run 4, we apply the same way used in Run 2 to
select the most contributing features, the dimension of
which is 140. We use this 140-D feature vector as the
input of each data sample.</p>
      <p>For each run, the NN classi er and -SVR are used to
predict the discrete and continuous labels, respectively. For
-SVR, we use RBF kernel with the default parameter
settings from LIBSVM: cost = 1, = 0:1, and = 1=D.</p>
      <p>Table 1 reports the performance of the proposed system,
which is provided by the organizers, on several standard
evaluation criteria. For Precision, Recall, and F-score, the
results follow the label order [non-interesting, interesting].
After dimensionality reduction, the performance of the
reduced features is comparable to that of original features,
which indicates that the reduced features capture most of
the discriminant information of the dataset. Furthermore,
we can observe that the performance on interesting data is
not as good as that on non-interesting data. This might be
caused by the imbalance between non-interesting (majority)
and interesting (minority) data. Sampling techniques and
cost-sensitive measures could therefore be utilized to further
improve the performance.
4.</p>
    </sec>
    <sec id="sec-5">
      <title>CONCLUSIONS</title>
      <p>In this paper, we have introduced our system for media
interestingness prediction. The results shown that the
features extracted by NMMP and SMR are informative. Our
future work will focus on improving the system by
considering the dynamic nature of the video data as well as exploring
the technologies for learning imbalanced data.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>The authors would like to thank the reviewer for the helpful
comments. This work was supported in part by the National
Natural Science Foundation of China under Grant 61503317.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Belkin</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Niyogi</surname>
          </string-name>
          .
          <article-title>Laplacian eigenmaps for dimensionality reduction and data representation</article-title>
          .
          <source>Neural Comput.</source>
          ,
          <volume>15</volume>
          (
          <issue>6</issue>
          ):
          <volume>1373</volume>
          {
          <fpage>1396</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.-C.</given-names>
            <surname>Chang</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.-J.</given-names>
            <surname>Lin</surname>
          </string-name>
          .
          <article-title>LIBSVM: A library for support vector machines</article-title>
          .
          <source>ACM Transactions on Intelligent Systems and Technology</source>
          ,
          <volume>2</volume>
          :
          <issue>27</issue>
          :1{
          <fpage>27</fpage>
          :
          <fpage>27</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>C.-H. Demarty</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Sjoberg</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Ionescu</surname>
            , T.-T. Do,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>N. Q. K.</given-names>
          </string-name>
          <string-name>
            <surname>Duong</surname>
            , and
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Lefebvre</surname>
          </string-name>
          .
          <article-title>Mediaeval 2016 predicting media interestingness task</article-title>
          .
          <source>In Working Notes Proceedings of the MediaEval 2016 Workshop</source>
          , Oct.
          <volume>20</volume>
          -
          <fpage>21</fpage>
          ,
          <year>2016</year>
          , Hilversum, Netherlands.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shao</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shu</surname>
          </string-name>
          .
          <article-title>Uncorrelated discriminant isometric projection for face recognition</article-title>
          .
          <source>In Information Computing and Applications</source>
          , volume
          <volume>307</volume>
          , pages
          <fpage>138</fpage>
          {
          <fpage>145</fpage>
          .
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Geng</surname>
          </string-name>
          and
          <string-name>
            <given-names>H. J.</given-names>
            <surname>Hamilton</surname>
          </string-name>
          .
          <article-title>Interestingness measures for data mining: A survey</article-title>
          .
          <source>ACM Comput. Surv.</source>
          ,
          <volume>38</volume>
          (
          <issue>3</issue>
          ),
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Grabner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Nater</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Druey</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L. Van</given-names>
            <surname>Gool</surname>
          </string-name>
          .
          <article-title>Visual interestingness in image sequences</article-title>
          .
          <source>In Proceedings of the 21st ACM International Conference on Multimedia</source>
          , pages
          <volume>1017</volume>
          {
          <fpage>1026</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Gygli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Grabner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Riemenschneider</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Nater</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L. V.</given-names>
            <surname>Gool</surname>
          </string-name>
          .
          <article-title>The interestingness of images</article-title>
          .
          <source>In Proceedings of IEEE International Conference on Computer Vision</source>
          , pages
          <volume>1633</volume>
          {
          <fpage>1640</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Isola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Parikh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Torralba</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Oliva</surname>
          </string-name>
          .
          <article-title>What makes a photograph memorable?</article-title>
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          ,
          <volume>36</volume>
          (
          <issue>7</issue>
          ):
          <volume>1469</volume>
          {
          <fpage>1482</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.-G.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zheng</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Yang</surname>
          </string-name>
          .
          <article-title>Understanding and predicting interestingness of videos</article-title>
          .
          <source>In Proceedings of The 27th AAAI Conference on Arti cial Intelligence (AAAI)</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K. C.</given-names>
            <surname>Chan</surname>
          </string-name>
          .
          <article-title>Supervised manifold learning for image and video classi cation</article-title>
          .
          <source>In Proceedings of the 18th ACM International Conference on Multimedia</source>
          , pages
          <volume>859</volume>
          {
          <fpage>862</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>F.</given-names>
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Xiang</surname>
          </string-name>
          , and
          <string-name>
            <surname>C. Zhang.</surname>
          </string-name>
          <article-title>Neighborhood minmax projections</article-title>
          .
          <source>In Proceedings of the 20th International Joint Conference on Arti cial Intelligence (IJCAI)</source>
          , pages
          <fpage>993</fpage>
          {
          <fpage>998</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Soleymani</surname>
          </string-name>
          .
          <article-title>The quest for visual interest</article-title>
          .
          <source>In Proceedings of the 23rd ACM International Conference on Multimedia</source>
          , pages
          <volume>919</volume>
          {
          <fpage>922</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>