<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Ranking Images and Videos on Visual Interestingness by Visual Sentiment Features</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Soheil Rayatdoost</string-name>
          <email>soheil.rayatdoost@unige.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mohammad Soleymani</string-name>
          <email>mohammad.soleymani@unige.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Swiss Center for Affective Sciences, University of Geneva</institution>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <fpage>20</fpage>
      <lpage>21</lpage>
      <abstract>
        <p>Today, users generate and consume millions of videos online. Automatic identi cation of the most interesting moments of these videos have many applications such as video retrieval. Although most interesting excerpts are person-dependent, existing work demonstrate that there are some common features among these segments. The media interestingness task at MediaEval 2016 focuses on ranking the shots and keyframes in a movie trailer based on their interestingness. The dataset consists of a set of commercial movie trailers from which the participants are required to automatically identify the most interesting shots and frames. We approach the problem as a regression task and test several algorithms. We particularly use mid-level semantic visual sentiment features. These features are related to the emotional content of images and are shown to be e ective in recognizing interestingness in GIFs. We found that our suggested features outperform the baseline for the task at hand.</p>
      </abstract>
      <kwd-group>
        <kwd>Features</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        Interestingness is the capability of catching and holding
human attention [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Research in psychology suggests that
interest is related to novelty, uncertainty, con ict and
complexity [
        <xref ref-type="bibr" rid="ref14 ref2">2, 14</xref>
        ]. These attributes determine whether a person
nds an item interesting. The attributes contribute to
interestingness di erently for di erent people, for example, one
might nd more complex stimulus more interesting than the
other. Developing a computational model which
automatically perform such a task is useful for di erent applications
such as video retrieval, recommendation and summarization
[
        <xref ref-type="bibr" rid="ref1 ref15">1, 15</xref>
        ].
      </p>
      <p>
        There are a number of work that address the problem
of visual interestingness prediction from the content. Gygli
et al. and Grabner et al. [
        <xref ref-type="bibr" rid="ref6 ref7">7, 6</xref>
        ] used visual content
features related to unusualness, aesthetics and general
preference for predicting visual interestingness. Soleymani [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]
built a model for personalized interest prediction for images.
He found that a ective content, quality, coping potential
and complexity have a signi cant e ect on visual interest
in images. In a more recent work, Gygli and Soleymani [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
attempted predicting GIF interestingness from the content.
They found that visual sentiment descriptors [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] to be more
e ective for predicting GIF interestingness compared to the
features that capture temporal information and motion.
      </p>
      <p>
        The \media interestingness task" is organized at
MediaEval 2016. In this task, a development and evaluation-set
consisting of Creative Commons licensed trailers of
commercial movies with their interestingness labels are provided.
For the details of the task description, dataset development
and evaluation, we refer the reader to the task overview
paper [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. There are two subtasks for this challenge, the rst
one involves automatic prediction of interestingness
ranking for di erent shots in a trailer. The second task involves
predicting the ranking for the most interesting key frames.
Visual and audio (only for shots) modalities are available
for the interestingness prediction methods [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The designed
algorithms are evaluated over evaluation data which include
2342 shots from 26 trailers. Examples of top-ranking
keyframes are shown in Figure 1.
      </p>
      <p>
        The organizers provided a set of baseline visual and audio
features. For the visual modality, we additionally extracted
mid-level semantic visual descriptors [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and deep learning
features. Sentiment related features are e ective in
capturing emotional content of images and are shown to be useful
in recognizing interestingness in GIFs [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. For the audio
modality, we extracted the extended Geneva Minimalistic
Acoustic Parameter Set (eGeMAPS) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. We tested multiple
regression models for interestingness ranking. We compare
our results with the ones from the baseline features based
on mean average precision (MAP) over top N best ranked
images or shots. According to our results on the
evaluationset, our feature-set outperform the baseline features for
predicting interestingness. In the next section, we present our
features and describe our methodology in detail.
2.
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>METHOD Features</title>
      <p>
        We opt for using a set of hand-crafted features and
transfer learning in addition to regression models with the goal of
interestingness ranking. The task organizers provided a set
of baseline low-level features. These features include a
number of low-level audiovisual features that are typically used
for computer vision and speech analysis, including dense
SIFT, Histogram of Gradients (HoG), Local Binary Patterns
(LBP), GIST, Color Histogram, deep learning features for
the visual modality [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], and Mel-Frequency Cepstral
Coefcients (MFCC) and the cepstral vectors for audio.
      </p>
      <p>
        Interestingness is highly correlated with image emotional
content [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Therefore, we opted for extracting the eGeMAPS
from audio [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. eGeMAPS features are acoustic features
hand-picked by experts for the goal of speech and music
emotion recognition. 88 eGeMAPS features were extracted
by openSMILE [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. For video sub-challenge, we extracted
all the key-frames from each shot. We then applied the
visual sentiment adjective-noun-pair (ANP) detectors [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] on
each key-frame. The weights from the fully connected layer
7 (fc7) and the output from the nal layer was extracted on
each frame. We then pooled the resulting values by mean
and variance to form one feature vector for each shot.
2.2
      </p>
    </sec>
    <sec id="sec-3">
      <title>Regression models</title>
      <p>
        We used three di erent regression models to predict the
interestingness level (linear regression (LR), support vector
regression (SVR) with linear kernel and sparse
approximation weighted regression (SPARROW) [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>
        We used LIBLINEAR Library [
        <xref ref-type="bibr" rid="ref12 ref9">9, 12</xref>
        ] implementation of
SVR with L2-regularized logistic regression option to
predict the interesingness score. We also used a regression with
sparse approximation. Regression with sparse
approximation is a regression model for approximation of the
prediction based on local information. It is similar to a k-nearest
neighbors regression (k-NNR) whose weights are calculated
based on sparse approximation [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Linear regression with
minimum least-squares optimization is utilized as a baseline
method.
      </p>
      <p>In all cases, except eGeMAPS audio features, we used
principal component analysis (PCA) to reduce the
dimensionality of features. For SVR and SPARROW, we kept the
principal components containing 99% of variance. In case
of linear regression, we only kept the principal components
that added up to 50% of the total variance.</p>
    </sec>
    <sec id="sec-4">
      <title>EXPERIMENTS</title>
      <p>After extracting all the feature-sets, we evaluated the
performance of di erent combinations of the feature-sets and
regression models. We evaluated di erent approaches using a
ve-folding cross-validation on the development-set. In each
iteration, one- fth of the development-set was held out and
the rest was used to train the regression model. When
training the SVR, we optimized the hyper-parameter C using a
grid-search on the training-set.</p>
      <p>The best performing approaches based on their
performance measured by MAP on the ranked results were selected
for submitted runs (See Table 1).</p>
    </sec>
    <sec id="sec-5">
      <title>RESULTS AND DISCUSSION</title>
      <p>Following the task evaluation procedure, we report MAP
on N best ranked images or shots. We report the results on
the cross-validation on the development-set and on our four
submitted runs on the evaluation-set. For our submitted
runs, we trained selected features and regression methods
on all the available data in the development-set. The results
for interestingness prediction with the best pair of
regression methods and feature-sets are summarized in Table 1.
The best MAP on the development-set which is achieved by
combining multilingual visual sentiment ontology (MVSO)
descriptors and deep learning features in combination with
SPARROW regression is 0.262. We used Baseline video
features and SPARROW regression as our baseline. To check
the performance of audio features we ranked the video with
respect to SVR output which was trained on audio features
only. The best results for image sub-task is achieved by
sentiment descriptors and deep learning features in combination
with linear regression.</p>
      <p>Overall, the evaluation-set results demonstrate that
midlevel semantic visual descriptors are more e ective in
predicting interestingness compared to the baselines low-level
features. The results from a set of relatively simple audio
features show the signi cance of audio modality for such
a task. In Image sub-task, the evaluation-set results are
very similar to video sub-task, since sentiment features lack
temporal information. The drop in the performance on
the evaluation-set demonstrates that our models were
overtting to the development-set and it is likely that an
ensemble learning regression would have performed better.
5.</p>
    </sec>
    <sec id="sec-6">
      <title>CONCLUSION</title>
      <p>In this work, we explored di erent strategies for
predicting visual interestingness in videos. We found the mid-level
visual descriptors which are related to sentiment to be more
e ective for such a task compared to the low-level visual
features. This is due to the a ective nature of interestingness,
i.e., interest is an emotion by some account. Our features are
all static and frame-based; we did not try extracting features
related to movement that can capture temporal information
due to the small size of the dataset. Hence, the frame-based
results are not any di erent to the shot-based ones.
Essentially they do very similar tasks. The observed performance
of the proposed method is rather low. However given the
sample size and the dimensionality of the descriptors, they
still show promising potential. In the future, ideally larger
scale datasets shall be developed and annotated to enable
using more sophisticated methods such as transfer learning
using deep neural networks. Even though the audio features
are not as e ective, they showed signi cant performance
deserving more in-depth analysis in the future.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>X.</given-names>
            <surname>Amengual</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bosch</surname>
          </string-name>
          , and
          <string-name>
            <surname>J. L. de la Rosa</surname>
          </string-name>
          .
          <article-title>Review of methods to predict social image interestingness and memorability</article-title>
          . In G. Azzopardi and N. Petkov, editors,
          <source>Computer Analysis of Images and Patterns: 16th International Conference, CAIP</source>
          <year>2015</year>
          , Valletta, Malta, September 2-
          <issue>4</issue>
          ,
          <fpage>2015</fpage>
          <string-name>
            <surname>Proceedings</surname>
          </string-name>
          ,
          <string-name>
            <surname>Part</surname>
            <given-names>I</given-names>
          </string-name>
          , pages
          <volume>64</volume>
          {
          <fpage>76</fpage>
          . Springer International Publishing, Cham,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Berlyne</surname>
          </string-name>
          .
          <article-title>Con ict, arousal, and curiosity</article-title>
          .
          <source>McGraw-Hill</source>
          ,
          <year>1960</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Demarty</surname>
          </string-name>
          , M. Sjoberg,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Do</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Duong</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Lefebvre</surname>
          </string-name>
          .
          <article-title>Mediaeval 2016 predicting media interestingness task</article-title>
          .
          <source>In MediaEval 2016 workshop</source>
          , Amsterdam, Netherland,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>F.</given-names>
            <surname>Eyben</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. R.</given-names>
            <surname>Scherer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. W.</given-names>
            <surname>Schuller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sundberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Andre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Busso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. Y.</given-names>
            <surname>Devillers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Epps</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Laukka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Narayanan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K. P.</given-names>
            <surname>Truong</surname>
          </string-name>
          .
          <article-title>The geneva minimalistic acoustic parameter set (gemaps) for voice research and a ective computing</article-title>
          .
          <source>IEEE Transactions on A ective Computing</source>
          ,
          <volume>7</volume>
          (
          <issue>2</issue>
          ):
          <volume>190</volume>
          {
          <fpage>202</fpage>
          ,
          <string-name>
            <surname>April</surname>
          </string-name>
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F.</given-names>
            <surname>Eyben</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Weninger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Gross</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Schuller</surname>
          </string-name>
          .
          <article-title>Recent developments in opensmile, the munich open-source multimedia feature extractor</article-title>
          .
          <source>In Proceedings of the 21st ACM International Conference on Multimedia, MM '13</source>
          , pages
          <fpage>835</fpage>
          {
          <fpage>838</fpage>
          , New York, NY, USA,
          <year>2013</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Grabner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Nater</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Druey</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L. Van</given-names>
            <surname>Gool</surname>
          </string-name>
          .
          <article-title>Visual interestingness in image sequences</article-title>
          .
          <source>In Proceedings of the 21st Annual ACM Conference on Multimedia</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Gygli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Grabner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Riemenschneider</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Nater</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L. Van</given-names>
            <surname>Gool</surname>
          </string-name>
          .
          <source>The Interestingness of Images. In The IEEE International Conference on Computer Vision</source>
          (ICCV),
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Gygli</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Soleymani</surname>
          </string-name>
          .
          <article-title>Analyzing and predicting GIF interestingness</article-title>
          .
          <source>In ACM Multimedia</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>C.-J. Hsieh</surname>
            ,
            <given-names>K.-W.</given-names>
          </string-name>
          <string-name>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-J. Lin</surname>
            ,
            <given-names>S. S.</given-names>
          </string-name>
          <string-name>
            <surname>Keerthi</surname>
            , and
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Sundararajan</surname>
          </string-name>
          .
          <article-title>A dual coordinate descent method for large-scale linear SVM</article-title>
          .
          <source>In Proceedings of the Twenty Fifth International Conference on Machine Learning (ICML)</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Y. G.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Rui</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S. F.</given-names>
            <surname>Chang</surname>
          </string-name>
          .
          <article-title>Super fast event recognition in internet videos</article-title>
          .
          <source>IEEE Transactions on Multimedia</source>
          ,
          <volume>17</volume>
          (
          <issue>8</issue>
          ):
          <volume>1174</volume>
          {
          <fpage>1186</fpage>
          ,
          <string-name>
            <surname>Aug</surname>
          </string-name>
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>B.</given-names>
            <surname>Jou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Pappas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Redi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Topkara</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.-F.</given-names>
            <surname>Chang</surname>
          </string-name>
          .
          <article-title>Visual a ect around the world: A large-scale multilingual visual sentiment ontology</article-title>
          .
          <source>In Proceedings of the 23rd ACM International Conference on Multimedia, MM '15</source>
          , pages
          <fpage>159</fpage>
          {
          <fpage>168</fpage>
          , New York, NY, USA,
          <year>2015</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>C.-J. Lin</surname>
            ,
            <given-names>R. C.</given-names>
          </string-name>
          <string-name>
            <surname>Weng</surname>
            , and
            <given-names>S. S.</given-names>
          </string-name>
          <string-name>
            <surname>Keerthi</surname>
          </string-name>
          .
          <article-title>Trust region Newton method for large-scale logistic regression</article-title>
          .
          <source>In Proceedings of the 24th International Conference on Machine Learning (ICML)</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>P.</given-names>
            <surname>Noorzad</surname>
          </string-name>
          and
          <string-name>
            <given-names>B. L.</given-names>
            <surname>Sturm</surname>
          </string-name>
          .
          <article-title>Regression with sparse approximations of data</article-title>
          .
          <source>In Signal Processing Conference (EUSIPCO)</source>
          ,
          <source>2012 Proceedings of the 20th European</source>
          , pages
          <volume>674</volume>
          {
          <fpage>678</fpage>
          ,
          <string-name>
            <surname>Aug</surname>
          </string-name>
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Silvia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Henson</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Templin</surname>
          </string-name>
          .
          <article-title>Are the sources of interest the same for everyone? using multilevel mixture models to explore individual di erences in appraisal structures</article-title>
          .
          <source>Cognition and Emotion</source>
          ,
          <volume>23</volume>
          (
          <issue>7</issue>
          ):
          <volume>1389</volume>
          {
          <fpage>1406</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.</given-names>
            <surname>Soleymani</surname>
          </string-name>
          .
          <article-title>The quest for visual interest</article-title>
          .
          <source>In Proceedings of the 23rd ACM International Conference on Multimedia, MM '15</source>
          , pages
          <fpage>919</fpage>
          {
          <fpage>922</fpage>
          , New York, NY, USA,
          <year>2015</year>
          . ACM.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>