<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>EURECOM @MediaEval 2017: Media Genre Inference for Predicting Media Interestingness</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Olfa Ben-Ahmed</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jonas Wacker</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Gaballo</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Benoit Huet EURECOM</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sophia Antipolis</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>France olfa.ben-ahmed@eurecom.fr</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>jonas.wacker@eurecom.fr</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>alessandro.gaballo@eurecom.fr</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>benoit.huet@eurecom.fr</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <fpage>13</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>In this paper, we present EURECOM's approach to address the MediaEval 2017 Predicting Media Interestingness Task. We developed models for both the image and video subtasks. In particular, we investigate the usage of media genre information (i.e., drama, horror, etc.) to predict interestingness. Our approach is related to the afective impact of media content and is shown to be efective in predicting interestingness for both video shots and key-frames.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Multimedia interestingness prediction aims to automatically
analyze media data and identify the most attractive content.
Previous works have been focused on predicting media interestingness
directly from the multimedia content [
        <xref ref-type="bibr" rid="ref3 ref6 ref7 ref8">3, 6–8</xref>
        ]. However, media
interestingness prediction is still an open challenge in the
computer vision community [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ] due to the gap between low-level
perceptual features and high-level human perception of the data.
      </p>
      <p>
        Recent research proved that perceived interestingness is highly
correlated with data emotional content [
        <xref ref-type="bibr" rid="ref14 ref9">9, 14</xref>
        ]. Indeed, humans
may prefer "afective decisions" to find interesting content because
emotional factors directly reflect the viewer’s attention. Hence, an
afective representation of video content will be useful for
identifying the most important parts in a movie. In this work, we
hypothesize that the emotional impact of the movie genre can be
a factor for the perceived interestingness of a video for a given
viewer. Therefore, we adopt a mid-level representation based on
video genre recognition. We propose to represent each sample as
a distribution over genres (action, drama, horror, romance, sci-fi).
For instance, a high confidence for the horror label inside the shot
genre distribution could be perceived as more emotional (scary in
this case). Therefore, this shot might be more characteristic and
therefore more interesting than a neutral genre that could appear
in any shot.
      </p>
      <p>
        The media interestingness challenge is organized at MediaEval
2017. The task consists of two subtasks for the prediction of image
and video interestingness respectively. The first one involves
predicting the most interesting key frames. The second one involves
the automatic prediction of interestingness for diferent shots in a
trailer. For more details about the task description, related dataset
and experimental setting, we refer the reader to the task overview
paper [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The rest of the paper is organized as follows: Section
2 describes our proposed method, Section 3 presents experiments
and results and finally Section 4 concludes the work and gives some
perspectives.
      </p>
    </sec>
    <sec id="sec-2">
      <title>METHOD</title>
      <p>Extracting genre information from movie scenes results in an
intermediate representation that may be quite useful for further
classification tasks. In this section, we briefly present our method for
media interestingness prediction. Figure 1 gives a brief overview
over the entire framework. At first, we extract deep visual and
acoustic features for each shot. We then obtain a genre prediction
for each modality to finally use this prediction for the training of
an interestingness classifier.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>Media Genre representation</title>
      <p>The genre prediction model is based on audio-visual deep features.
Using these features, we trained two genre classifiers, a Deep
Neural Network (DNN) on deep visual features and an SVM on deep
acoustic features.</p>
      <p>
        The dataset [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] used to train our genre model contains originally
4 diferent movie genres: action, drama, horror and romance. We
extended the dataset with an additional genre to obtain a more
sophisticated genre representation for each movie trailer shot. Our
ifnal dataset comprises 415 movie trailers of 5 genres (69 trailers
for action, 95 for drama, 99 for horror, 80 for romance and 72 for
sci-fi ). Each movie trailer is segmented into visual shots using the
PySceneDetect tool1. The visual shots are automatically obtained
by comparing HSV histograms of consecutive video frame (a high
histogram distance results in a shot boundary). We also segment
each video into audio shots using the OpenSmile Voice Activity
Detection tool2. The tool automatically determines speaker cues in
the audio stream which we use as acoustic shot boundaries. In total,
we trained our two genre predictor models on 29151 visual and on
26144 audio shots. The visual shots are represented by key-frames.
We select the middle frame in a shot as a key-frame. Visual features
are extracted from these key-frames using a pretrained VGG-16
network [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. By removing the last 2 layers, the output results in
a 4096-dimensional feature vector for each keyframe. This single
feature vector represents the visual information that we obtained
for each shot/key-frame.
1http://pyscenedetect.readthedocs.io/en/latest/
2https://github.com/naxingyu/opensmile/tree/master/scripts/vad
      </p>
      <p>
        2.1.1 Visual feature learning. We use the DNN architecture
proposed by [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] to make genre predictions on visual features. The
architecture is shown in Figure 2. The Dropout regularization is
used to avoid overfitting and to optimize training performance. The
output is squashed into a probability vector over the 5 genres using
Softmax. We use mini-batch stochastic gradient descent on a batch
size of 32 to train the network. Categorical cross entropy is used as
a loss function and we train the network over 50 epochs.
      </p>
      <p>
        2.1.2 Acoustic feature learning. Adding the audio information
surely plays an important role for content analysis in videos. Most
of the approaches in related work only focus on hand-crafted audio
features such as the Mel Frequency Cepstrum Coeficients (MFCC)
or spectrograms, with either traditional or deep classifiers.
However, those audio features are rather low-level representations and
are not designed for semantic video analysis. Instead of using such
classical audio features, we extract deep audio features from a
pretrained model called Soundnet [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The latter has been learned by
transferring knowledge from vision to sound to ultimately
recognize objects and scenes in sound data. According to the work of
Ayter et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], an audio feature representation using Soundnet
reaches state-of-the-art accuracy on three standard acoustic scene
classification datasets. In our work, features are extracted from the
iffth convolutional layer of the 8-layers version of the Soundnet
model. For the training on audio features, we used a probabilistic
SVM with a linear kernel and a regularization value of C = 1.0.
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Interestingness classification</title>
      <p>Our genre model can be used for both the image and video subtasks.
Indeed, we train two separate genre classifiers (i.e., one based on
audio and one based on visual features). Therefore, we end up with
two probability vector outputs for respectively the visual and audio
inputs. In order to obtain the final genre distribution for the video
shots, we simply take the mean of both probability vectors. This
probabilistic genre distribution is our mid-level representation and
thus serves as the input for the actual interestingness classifier. A
Support Vectors Machine (SVM) binary classifier is then trained
on these features to predict with a confidence score whether a
shot/image is considered interesting or not. For the video subtask,
we also performed experiments using only the visual information
of the video shots. For this we used the genre prediction model
based on the extracted VGG features from the video key-frames. To
evaluate the performance of our interestingness model, we tested
several SVM kernels (linear, RBF and sigmoid) with diferent
parameters on the development dataset. A high number of experiments
with a grid search in order to optimize kernel parameters tended to
classify almost all the samples as non interesting. This may be due
to the imbalanced labels of the training data. Hence, we opted for a
weighted version of SVM classification where the minority class
receives a higher misclassification penalty. We also take into
account the confidence scores of the development set samples during
training by giving a larger penalty to samples with high confidence
scores, and a small penalty to samples with low confidence scores.
3</p>
    </sec>
    <sec id="sec-5">
      <title>EXPERIMENTS AND RESULTS</title>
      <p>The evaluation results of our models on the test data provided by
the organizers are shown below. We submitted two runs for the
image classification and five for the video classification task. Table 1
reports the MAP and the MAP@10 scores for our various model
configurations returned by the task organizers.</p>
      <sec id="sec-5-1">
        <title>Task</title>
        <p>Image
Video
Run
1
2</p>
      </sec>
      <sec id="sec-5-2">
        <title>Classifier</title>
        <p>SVM - Sigmoid kernel
SVM - Linear kernel</p>
        <p>MAP
0.2029
0.2016
1 Sigmoid kernel: 0.2034</p>
        <p>gamma=0.5, C=100
2 Polyn. kernel: degree=3 0.1960
3 Polyn. kernel: degree=2 0.1964
4 Sigmoid kernel: 0.2094</p>
        <p>gamma=0.2, C=100
5 Sigmoid kernel: 0.2002</p>
        <p>gamma=0.3 , C=100
Table 1: Oficial evaluation results on test data
MAP@10
0.0587
0.0579</p>
        <p>
          For the image subtask, the MAP values are quite similar for both
linear and sigmoid SVM kernels. For the video subtask, decent
results in MAP values are already achieved with visual key-frame
classification (run 2 and 3). When using both modalities (run 1, 4
and 5), averaging audio and video genre predictions, results show a
slight performance gain. However, we obtain a larger improvement
when looking at the MAP@10 scores. Here, employing both
modalities outperforms the pure key-frame classification. Overall, an SVM
with a sigmoid kernel seems more efective for the audio-visual
submission than using a linear or polynomial kernel. Yet, we have only
looked at SVM models in our experiments. Further improvements
could be done by trying out diferent models as it has been done
in related work [
          <xref ref-type="bibr" rid="ref10 ref13 ref15">10, 13, 15</xref>
          ]. Also, it would be interesting to apply
genre prediction on all/multiple shot frames instead of employing
a single key-frame. In general, we have shown that our approach
is capable of making useful scene suggestions even if we do not
consider it ready for commercial use yet.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>CONCLUSION</title>
      <p>In this paper, we presented a framework for predicting image and
video interestingness that includes a genre recognition system as a
mid-level representation for the data. Our best results on the testset
were 20.29 and 20.94 of MAP for respectively the image and video
subtasks. Obtained results are promising especially for the video
subtask. Future works include the joint learning of audio-visual
features and the integration of temporal information to describe
the evolution of audio-visual features over video frames.</p>
    </sec>
    <sec id="sec-7">
      <title>ACKNOWLEDGMENTS</title>
      <p>The research leading to this paper was partially supported by
Bpifrance within the NexGenTV Project (F1504054U). The Titan
Xp used for this research was donated by the NVIDIA Corporation.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Yusuf</given-names>
            <surname>Aytar</surname>
          </string-name>
          , Carl Vondrick, and Antonio Torralba.
          <year>2016</year>
          .
          <article-title>Soundnet: Learning sound representations from unlabeled video</article-title>
          .
          <source>In Proceedings of Advances in Neural Information Processing Systems</source>
          .
          <volume>892</volume>
          -
          <fpage>900</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Claire-Hélène</surname>
            <given-names>Demarty</given-names>
          </string-name>
          , Mats Viktor Sjöberg, Bogdan Ionescu, ThanhToan Do, Hanli Wang,
          <source>Ngoc QK Duong</source>
          ,
          <article-title>Frédéric Lefebvre, and others</article-title>
          .
          <source>Media interestingness at Mediaeval</source>
          <year>2017</year>
          .
          <source>In Proceedings of MediaEval 2017 Workshop</source>
          , Dublin, Ireland,
          <source>September 13-15</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Yanwei</given-names>
            <surname>Fu</surname>
          </string-name>
          , Timothy M Hospedales,
          <string-name>
            <given-names>Tao</given-names>
            <surname>Xiang</surname>
          </string-name>
          , Shaogang Gong, and
          <string-name>
            <given-names>Yuan</given-names>
            <surname>Yao</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Interestingness prediction by robust learning to rank</article-title>
          .
          <source>In Proceedings of the European Conference on Computer Vision</source>
          . Zurich, Switzerland, September 6-
          <issue>12</issue>
          ,
          <fpage>488</fpage>
          -
          <lpage>503</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Michael</given-names>
            <surname>Gygli</surname>
          </string-name>
          , Helmut Grabner, Hayko Riemenschneider, Fabian Nater, and Luc Van Gool.
          <year>2013</year>
          .
          <article-title>The interestingness of images</article-title>
          .
          <source>In Proceedings of the IEEE International Conference on Computer Vision</source>
          , Sydney, Australia, December 1-
          <issue>8</issue>
          ,
          <year>2013</year>
          .
          <fpage>1633</fpage>
          -
          <lpage>1640</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Michael</given-names>
            <surname>Gygli</surname>
          </string-name>
          , Helmut Grabner, and Luc Van Gool.
          <year>2015</year>
          .
          <article-title>Video summarization by learning submodular mixtures of objectives</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision</source>
          and Pattern Recognition, Santiago, Chile,
          <source>December 11-18</source>
          ,
          <year>2015</year>
          .
          <fpage>3090</fpage>
          -
          <lpage>3098</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Michael</given-names>
            <surname>Gygli</surname>
          </string-name>
          and
          <string-name>
            <given-names>Mohammad</given-names>
            <surname>Soleymani</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Analyzing and Predicting GIF Interestingness</article-title>
          .
          <source>In Proceedings of ACM Multimedia</source>
          , Amsterdam, The Netherlands,
          <source>October 15-19</source>
          ,
          <year>2016</year>
          . New York, NY, USA,
          <fpage>122</fpage>
          -
          <lpage>126</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Yu-Gang</surname>
            <given-names>Jiang</given-names>
          </string-name>
          , Yanran Wang, Rui Feng, Xiangyang Xue, Yingbin Zheng, and
          <string-name>
            <given-names>Hanfang</given-names>
            <surname>Yang</surname>
          </string-name>
          .
          <article-title>Understanding and Predicting Interestingness of Videos</article-title>
          .
          <source>In Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence</source>
          , Bellevue, Washington, July
          <volume>14</volume>
          -
          <issue>18</issue>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Yang</surname>
            <given-names>Liu</given-names>
          </string-name>
          , Zhonglei Gu, Yiu-ming
          <string-name>
            <surname>Cheung</surname>
          </string-name>
          , and
          <string-name>
            <surname>Kien</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Hua</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Multi-view Manifold Learning for Media Interestingness Prediction</article-title>
          .
          <source>In Proceedings of ACM on International Conference on Multimedia Retrieval</source>
          , Bucharest, Romania, June 6-9,
          <year>2017</year>
          . New York, NY, USA,
          <fpage>308</fpage>
          -
          <lpage>314</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Soheil</given-names>
            <surname>Rayatdoost</surname>
          </string-name>
          and
          <string-name>
            <given-names>Mohammad</given-names>
            <surname>Soleymani</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Ranking Images and Videos on Visual Interestingness by Visual Sentiment Features</article-title>
          .
          <source>In Proceedings of the MediaEval 2016 Workshop</source>
          , Hilversum, Netherlands,
          <source>October 20-21</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>G. S.</given-names>
            <surname>Simoes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wehrmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. C.</given-names>
            <surname>Barros</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D. D.</given-names>
            <surname>Ruiz</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Movie genre classification with Convolutional Neural Networks</article-title>
          .
          <source>In 2016 International Joint Conference on Neural Networks (IJCNN)</source>
          .
          <volume>259</volume>
          -
          <fpage>266</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Very Deep Convolutional Networks for Large-Scale Image Recognition</article-title>
          ,
          <source>Technical report. CoRR abs/1409</source>
          .1556 (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>K S</given-names>
            <surname>Sivaraman and Gautam Somappa</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>MovieScope: Movie trailer classification using Deep Neural Networks</article-title>
          . University of Virginia (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>John</surname>
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>Dhiraj</given-names>
          </string-name>
          <string-name>
            <surname>Joshi</surname>
            , Benoit Huet, Hsu Winston, and
            <given-names>Jozef</given-names>
          </string-name>
          <string-name>
            <surname>Cota</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Harnessing A.I. for Augmenting Creativity: Application to Movie Trailer Creation</article-title>
          .
          <source>In Proceedings of ACM Multimedia. October 23-27</source>
          ,
          <year>2017</year>
          , Mountain View, CA, USA.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Mohammad</given-names>
            <surname>Soleymani</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>The Quest for Visual Interest</article-title>
          .
          <source>In Proceedings of the 23rd ACM International Conference on Multimedia, Brisbane, Australia, October 26-30</source>
          ,
          <year>2015</year>
          . New York, NY, USA,
          <fpage>919</fpage>
          -
          <lpage>922</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Sejong</given-names>
            <surname>Yoon</surname>
          </string-name>
          and
          <string-name>
            <given-names>Vladimir</given-names>
            <surname>Pavlovic</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Sentiment Flow for Video Interestingness Prediction</article-title>
          .
          <source>In Proceedings of the 1st ACM International Workshop on Human Centered Event Understanding from Multimedia (HuEvent '14)</source>
          . New York, NY, USA,
          <fpage>29</fpage>
          -
          <lpage>34</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>