<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>LAPI at MediaEval 2017 - Predicting Media Interestingness</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mihai Gabriel Constantin</string-name>
          <email>mgconstantin@imag.pub.ro</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bogdan Boteanu</string-name>
          <email>bboteanu@imag.pub.ro</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bogdan Ionescu</string-name>
          <email>bionescu@imag.pub.ro</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>LAPI, University "Politehnica" Bucharest</institution>
          ,
          <country country="RO">Romania</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <fpage>13</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>In the following paper we will present our contribution, approach and results for the MediaEval 2017 Predicting Media Interestingness task. We studied several visual descriptors and created several early and late fusion approaches in our machine learning system, optimized for best results for this benchmarking competition.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Multimedia interestingness has been studied more and more
extensively in recent years, from several perspectives including
psychology and computer vision. From a psychological perspective user
studies described a correlation between human interest and several
other concepts including, but not limited to aesthetics, enjoyment,
complexity, novelty [
        <xref ref-type="bibr" rid="ref1 ref8">1, 8</xref>
        ], while computer vision approaches
studied various sets of features and machine learning techniques that
are able to predict the interestingness of multimedia shots, based
on low-level attributes such as color histograms, SIFT, edge
distributions [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] or high-level attribute like composition rules or the
presence of certain objects [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        The MediaEval 2017 Predicting Media Interestingness task [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
creates a benchmarking competition where participants are tasked
with the creation of a system that can predict the interestingness
of images and video segments annotated by a team of viewers,
according to a Video on Demand scenario, where a set of most
interesting frames or video shots has to be presented to a certain
user. This paper will thus describe our approach for this task.
      </p>
    </sec>
    <sec id="sec-2">
      <title>APPROACH</title>
      <p>
        The approach presented in this paper is a continuation of our work
described in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], with the addition of a video interestingness
prediction system. The first step in our machine learning system is
the extraction of the content descriptors, followed by the learning
stage for these content descriptors and their early and late fusion
combinations executed on the annotated development dataset. In
the final stage we evaluate the best performing combinations on
the unlabeled testing dataset. The features used here are presented,
along with a detailed description in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and are based on the works
of [
        <xref ref-type="bibr" rid="ref10 ref11 ref5 ref9">5, 9–11</xref>
        ]. These features have been used in several domains
connected with interestingness such as aesthetics, photographic
compositional rules, color theory etc. For the machine learning
algorithm we used Support Vector Machine (SVM) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] with diferent
parameters and kernels.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Features</title>
      <p>
        The features used in this system are as follows: Hue, Saturation,
Value computed from HSV space (denoted HSV ), Hue, Saturation,
Lightness extracted from HSL space (HSL), Colorfulness [
        <xref ref-type="bibr" rid="ref5 ref9">5, 9</xref>
        ],
Hue descriptors (HueDesc) [
        <xref ref-type="bibr" rid="ref11 ref9">9, 11</xref>
        ], Hue models (HueModel) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ],
Brightness [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ], Edge [
        <xref ref-type="bibr" rid="ref10 ref11 ref9">9–11</xref>
        ], Texture [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], RGB entropy
(RGBEntropy) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], HSV wavelet (HSVwavelet) and average value for
the HSV wavelet (aHSVwavelet) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], average HSV values based
on the Rule of Thirds (aHSVRot) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], average HSL values for the
focus region (aHSLFocus) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], size analysis for the largest five
segments (LargSegm) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], centroid placement (Centroids) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], Hue,
Saturation, Value and Brightness for the largest segments
(HueSegm, SatSegm, ValSegm, BrightSegm) [
        <xref ref-type="bibr" rid="ref11 ref5">5, 11</xref>
        ], color model for the
largest segments (ColorSegm) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], coordinates of the segments
(CoordSegm) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], mass variance, skewness and contrast between the
segments (MassVarSegm, SkewSegm, ContrastSegm) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and finally
a depth of field indicator ( DoF ) calculated according to the method
presented in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>While for the image subtask each image generated a set of the
presented descriptors, for the video subtask we generated two sets
of descriptors for each of the individual segments. These two sets
of descriptors were generated by extracting the feature set for each
frame and then calculating the average value and median value
over all the frames in a video segment.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Data fusion</title>
      <p>In both subtasks we used early and late fusion techniques for
maximizing out final results. Early fusion combinations consisted of
concatenating several features and using the newly created
feature as an input for a new training algorithm, while for the late
fusion approach we used the confidence output values of several
runs and combined them in several strategies, thus generating new
confidence outputs.</p>
      <p>For the late fusion trials we used 4 strategies: CombMax and
CombMin, where we took the maximum and minimum confidence
value for each media sample and used them as new outputs,
CombSum, where we added up the individual confidence values of the
runs and CombMean where the added confidence values were also
multiplied with weights distributed according to the rank of the
initial system. This weight was calculated as w = 1/(2r ), where the
rank r had the value 0 for the best component output classifier, 1
for the second and so on.
2.3</p>
    </sec>
    <sec id="sec-5">
      <title>Learning system</title>
      <p>
        The learning system we used was SVM, implemented with the
LibSVM library [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], with linear, polynomial and RBF kernels. For
the degree, gamma and cost coeficients we used the combinations
of values 2k , where k ∈ [−6, ..., 6].
3
      </p>
    </sec>
    <sec id="sec-6">
      <title>EXPERIMENTAL RESULTS</title>
      <p>
        As presented in the task overview paper [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], the development
dataset consisted of 7396 frames for the image subtask and 7396
video segments for the video subtask, while the test dataset had
2435 frames for the image subtask and 2435 video segments for the
video subtask. The oficial metric was mean average precision at
10 (MAP@10), and the organisers also calculated the mean
average precision (MAP) for each submitted run. A large number of
experiments with diferent early and late fusion strategies and with
diferent SVM systems were carried out and the best performing
combinations were in the last phase run on the testset.
3.1
      </p>
    </sec>
    <sec id="sec-7">
      <title>Experiments on the devset</title>
      <p>Our SVM training system used a 10-fold cross-validation approach
for choosing the best SVM-feature set combination. Generally,
taking into account the MAP@10 metric, the best performing SVM
kernel was the RBF kernel. Also another general observation is that
the late fusion approaches, especially CombMax and CombMean,
outperformed the early fusion combination, while early fusion
outperformed learning systems with single descriptors. On the other
hand, CombMin and CombSum strategies performed worse than
their components with many combinations. Regarding the two
descriptor sets for the video subtask (average and median), the results
were mixed, some early fusion or single descriptors performing
better with the median approach while others performed better
when we used the average calculation.</p>
      <p>The interestingness confidence score for each shot used for the
MAP@10 calculation were extracted as the margin to the decision
hyperplane.</p>
      <p>Table 1 shows the best results registered on both the image and
the video subtasks, and as mentioned earlier the best results were
achieved for the late fusion approaches. For the video subtask we
used the notation AVG for features that were obtained using average
and MED for features that were obtained using median. All the
components in Table 1 were trained using the best performing SVM
RBF kernel.</p>
      <p>For the image subtask the best result on the devset was obtained
with a CombMax strategy combining the early fusion outputs of
HSV + HSL + aHSLFocus and aHSVRot + aHSLFocus and HSV +
MassVarSegm + LargSegm, with a MAP@10 score on the devset of
0.0821. For the video subtask the best result was a CombMax
strategy containing LargSegmMED + ValSegmMED and TextureMED +
MassVarSegmMED early fusion outputs, with a MAP@10 score of
0.0753.
3.2</p>
    </sec>
    <sec id="sec-8">
      <title>Oficial results on testset</title>
      <p>For the final submission we trained the systems on the entire
devset, using the optimal parameters that we found in the previous
experiments and tested the resulting systems on the testset.</p>
      <p>Table 1 also presents the oficial results on the testset runs for
the combinations we submitted, as returned by the task organisers,
with the MAP and MAP@10 scores for each of the runs. For the
image subtask we have a best MAP@10 score of 0.0555, obtained
by using a CombMean strategy with the outputs of aHSVRot +
aHSLFocus and HSV + MassVarSegm + LargSegm. The same system
also had the best MAP score - 0.1873. For the video subtask again it
was a single system that got both the best MAP@10 and the best
MAP score - a CombMean strategy usign the early fusion outputs
of LargSegmMED + ValSegmMED and TextureMED +
MassVarSegmMED and EdgeAVG + TextureAVG, with a MAP@10 value of
0.0732 and a MAP value of 0.2028.
4</p>
    </sec>
    <sec id="sec-9">
      <title>CONCLUSIONS</title>
      <p>In this paper we presented several systems that predict media
interestingness using content descriptors and early and late fusion
approaches. We tested these systems on the MediaEval 2017
Predicting Media Interestingness task and our best results were MAP@10
0.5555 for the image subtask and 0.0732 for the video subtask.</p>
    </sec>
    <sec id="sec-10">
      <title>ACKNOWLEDGMENTS</title>
      <p>Part of this work was funded by UEFISCDI under research grant
PNIII-P2-2.1-PED-2016-1065, agreement 30PED/2017, project
SPOTTER</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Daniel</surname>
            <given-names>E</given-names>
          </string-name>
          <string-name>
            <surname>Berlyne</surname>
          </string-name>
          .
          <year>1960</year>
          .
          <article-title>Conflict, arousal, and curiosity</article-title>
          . (
          <year>1960</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Chih-Chung Chang</surname>
          </string-name>
          and
          <string-name>
            <surname>Chih-Jen Lin</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>LIBSVM: a library for support vector machines</article-title>
          .
          <source>ACM transactions on intelligent systems and technology (TIST) 2</source>
          ,
          <issue>3</issue>
          (
          <year>2011</year>
          ),
          <fpage>27</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Mihai</given-names>
            <surname>Gabriel</surname>
          </string-name>
          Constantin and
          <string-name>
            <given-names>Bogdan</given-names>
            <surname>Ionescu</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Content Description for Predicting Image Interestingness</article-title>
          . In International Symposium on Signals, Circuits and
          <string-name>
            <surname>Systems - ISSCS</surname>
          </string-name>
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Corinna</given-names>
            <surname>Cortes</surname>
          </string-name>
          and
          <string-name>
            <given-names>Vladimir</given-names>
            <surname>Vapnik</surname>
          </string-name>
          .
          <year>1995</year>
          .
          <article-title>Support vector machine</article-title>
          .
          <source>Machine learning 20, 3</source>
          (
          <year>1995</year>
          ),
          <fpage>273</fpage>
          -
          <lpage>297</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Ritendra</given-names>
            <surname>Datta</surname>
          </string-name>
          , Dhiraj Joshi,
          <string-name>
            <given-names>Jia</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and James Z</given-names>
            <surname>Wang</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Studying aesthetics in photographic images using a computational approach</article-title>
          .
          <source>In European Conference on Computer Vision</source>
          . Springer,
          <fpage>288</fpage>
          -
          <lpage>301</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Claire-Hélène</surname>
            <given-names>Demarty</given-names>
          </string-name>
          , Mats Sjöberg, Bogdan Ionescu,
          <string-name>
            <surname>Thanh-Toan</surname>
            <given-names>Do</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Gygli</surname>
          </string-name>
          , and Ngoc QK Duong.
          <year>2017</year>
          .
          <article-title>Mediaeval 2017 predicting media interestingness task</article-title>
          .
          <source>In MediaEval 2017 Multimedia Benchmark Workshop Working Notes Proceedings of the MediaEval 2017 Workshop.</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Sagnik</given-names>
            <surname>Dhar</surname>
          </string-name>
          , Vicente Ordonez, and Tamara L Berg.
          <year>2011</year>
          .
          <article-title>High level describable attributes for predicting aesthetics and interestingness</article-title>
          .
          <source>In Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <source>2011 IEEE Conference on. IEEE</source>
          ,
          <fpage>1657</fpage>
          -
          <lpage>1664</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Michael</given-names>
            <surname>Gygli</surname>
          </string-name>
          , Helmut Grabner, Hayko Riemenschneider, Fabian Nater, and Luc Van Gool.
          <year>2013</year>
          .
          <article-title>The interestingness of images</article-title>
          .
          <source>In Proceedings of the IEEE International Conference on Computer Vision</source>
          . 1633-
          <fpage>1640</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Andreas</surname>
            <given-names>F Haas</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marine Guibert</surname>
            , Anja Foerschner, Sandi Calhoun, Emma George, Mark Hatay, Elizabeth Dinsdale, Stuart A Sandin, Jennifer E Smith,
            <given-names>Mark JA</given-names>
          </string-name>
          <article-title>Vermeij, and</article-title>
          <string-name>
            <surname>others.</surname>
          </string-name>
          <year>2015</year>
          .
          <article-title>Can we measure beauty? Computational evaluation of coral reef aesthetics</article-title>
          .
          <source>PeerJ</source>
          <volume>3</volume>
          (
          <year>2015</year>
          ),
          <year>e1390</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Yan</surname>
            <given-names>Ke</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Xiaoou</given-names>
            <surname>Tang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Feng</given-names>
            <surname>Jing</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>The design of high-level features for photo quality assessment</article-title>
          .
          <source>In Computer Vision and Pattern Recognition</source>
          ,
          <year>2006</year>
          IEEE Computer Society Conference on, Vol.
          <volume>1</volume>
          . IEEE,
          <fpage>419</fpage>
          -
          <lpage>426</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Congcong</given-names>
            <surname>Li</surname>
          </string-name>
          and
          <string-name>
            <given-names>Tsuhan</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Aesthetic visual quality assessment of paintings</article-title>
          .
          <source>IEEE Journal of selected topics in Signal Processing 3</source>
          ,
          <issue>2</issue>
          (
          <year>2009</year>
          ),
          <fpage>236</fpage>
          -
          <lpage>252</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>