<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>DA-IICT at MediaEval 2017: Objective prediction of media interestingness</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rashi Gupta</string-name>
          <email>rashi.8496@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manish Narwaria</string-name>
          <email>manish_narwaria@daiict.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dhirubhai Ambani Institute of Information and Communication Technology</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <fpage>13</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>Interestingness is defined as the power of engaging and holding the curiosity. While humans can almost efortlessly rank and judge interestingness of a scene, automated prediction of interestingness for an arbitrary scene is a challenging problem. In this work, we attempt to develop a computational model for the said problem. Our approach is based on identifying and extracting context-specific features from video clips. These features are subsequently utilized in a predictor model to provide continuous scores that can be related to the interestingness of the scene in question. Such computational models can be useful in a automated analysis of videos (eg. movie, a CCTV footage or a clip from an advertisement).</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        The aim of the task is to select content (image and video clips) which
are considered to be the most interesting for a common viewer. This
is a challenging task because interestingness of media is highly
subjective, and can depend on multiple aspects including personal
preferences, emotional state and the content itself. Therefore, as a
ifrst step, our goal in this task is to understand and extract signal
related features which may, for instance, quantify visual appearance
and audio information. Such features can then be mapped into an
interestingness score via machine learning. Further details about
the task and dataset can be retrieved from [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>The said task falls under the broad areas of multimedia signal
processing (image, video and audio) and machine learning. The
former focuses on analysis and extraction of context-specific features
from the signal. These may include color, contrast, complexity,
audio characteristics etc. The primary goal of feature extraction is to
obtain a more meaningful signal representation from the view point
of capturing useful information pertaining to media interestingness.
In the task, these features will be subsequently used as input to
a regressor (eg. linear regression and multilayer perceptron). As
the target value of such regression problem is known (equal to
interestingness score given by a panel of human subjects), this is a
supervised learning problem.</p>
      <p>
        We note that a similar approach has been used in previous works
such as [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. However, the key diference lies in the features
used, and this is one of the contributions from the task. Also, the
results shed light on some interesting aspects of interestingness that
may not be fully captured by the current set of features. This can
obviously be used to improve feature extraction, and in the process
predict objective media interestingness scores that are closer to
human judgments.
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>APPROACH</title>
    </sec>
    <sec id="sec-3">
      <title>Interestingness Features Computation</title>
      <p>Following are the features extracted from the image for image
subtask: colorfulness, contrast, complexity and visual attention.
For video subtask, along with these features, audio feature
Melfrequency cepstral coeficients (MFCCs) is also computed to take
audio of the clip in the account.</p>
      <p>
        Colorfulness: We measure colorfulness as proposed by [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
Redgreen and yellow-blue spaces are used where α = R − G and
β = 0.5(R + G) − B where σ 2 , σβ2 , µ α , µ β represent the variance
α
and mean values along these two opponent color axes defined as:
µ α = N1 ΣpN=1αP and σα2 = N1 ΣpN=1(αP − µ α2 )
2
The equation formulates the ratio of the variance to the average
chrominance in each of the opponent component:
color f ulness = 0.02 × loд( |µ σαα2|0.2 ) × loд( |µ σβ β2|0.2 )
Contrast: We measure contrast as proposed by [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The main idea
is to compute local contrast factors at various resolutions, and then
to build a weighted average in order to get the global contrast factor.
Let us denote the original pixel value with k, k = 0, 1,.. 254, 255.
The first step is to apply gamma correction with γ = 2.2 , and
scale the input values to the [
        <xref ref-type="bibr" rid="ref1">0,1</xref>
        ] range. The corrected values linear
k
luminance is l = ( 255 )γ . The perceptual luminance L can be
approx√
imated with the square root of the linear luminance: L = 100 × l .
Once the perceptual luminances are computed we have to compute
local contrast. For each pixel we compute the average diference of
L between the pixel and four neighboring pixels.
lci = |Li −Li−1 |+|Li −Li+1 |+|Li −Li−w |+|Li −Li+w |
      </p>
      <p>4
The average local contrast for current resolution Ci is computed
as the average local contrast lci over the whole image, where the
image is w pixels wide and h pixels high.</p>
      <p>1
Ci = w ×h × Σwi=×1hlci
We have to compute the Ci for various resolutions. Once the Ci for
original image is computed, we build a smaller resolution image, so
that we combine 4 original pixels into one super pixel. The image
width is half the original width and the image height is half the
original height now. The Ci for various resolution can easily be
computed and the process continues until we have only few huge
superpixels in the image. Now that we have computed average local
contrasts Ci , we can compute the global contrast factor.
3.2</p>
    </sec>
    <sec id="sec-4">
      <title>Analysis</title>
      <p>
        GCF = ΣiN=1wi × Ci
Complexity: We measure contrast by calculating Spatial
Information as proposed by [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Let sh and sv denote gray-scale
images filtered with horizontal and vertical Sobel kernels,
respectively. SIr = qsh2 + sv2 represents the magnitude of spatial
information at every pixel. Mean and standard deviation of SIr is used
to calculate the complexity of an image. SImean = P1 ΣSIr and
SIstdev = q P1 Σ(SIr2 − SI m2ean )
where P is the number of pixels in the image.
      </p>
      <p>
        Visual attention: We propose a method to calculate attention
of an image by computing saliency maps for the corresponding
image. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] implementation is used for saliency map computation.
The mean of this saliency map at every pixel is the attention value.
Audio Extraction: For audio features, Mel-frequency cepstral
coeficients (MFCCs) are computed and mean and standard deviation
of a time frame is calculated. This is the feature vector for audio
extraction.
      </p>
      <p>Novelty: We propose a method to calculate novelty by firstly
calculating saliency maps for the images. 8 X 8 average filter is convoluted
and the mean is calculated for the consecutive saliency map images.
If for both of the blocks, the average is less than the threshold (0.1),
the block is avoided. Otherwise, for the two consecutive saliency
map images mean squared error (MSE) is calculated. If MSE is less
than the threshold, this block is also ignored, else it is considered.
Hence, the mean of the MSE is calculated. Higher the value, more
action has happened in the two consecutive frames.
2.2</p>
    </sec>
    <sec id="sec-5">
      <title>Interestingness Prediction</title>
      <p>For image subtask, we have used five features namely, colorfulness,
contrast, complexity (mean and standard deviation) and attention.
With the help of these features, we would learn our model for
interestingness using linear regression.</p>
      <p>For video subtask, along with these five features, we also added
the feature vector for the audio. As there are more features in this
case, we used multilayer perceptron. We used the mean image
provided in the image subtask for computation of the feature vector
of the video subtask as computation for all frames of the video was
not feasible, given the time constraints.
3
3.1</p>
    </sec>
    <sec id="sec-6">
      <title>EXPERIMENTAL RESULTS</title>
    </sec>
    <sec id="sec-7">
      <title>Evaluation</title>
      <p>For image subtask, using the five features with the linear regression
as the learning algorithm the MAP@10 is coming up to be 0.0406
for the test set. For video subtask, along with these features and
the audio feature vector, MAP@10 is 0.0636 for the test set where
multilayer perceptron is the learning algorithm.</p>
      <p>For image subtask, the maximum value of average precision@10 is
for those videos in which the top 10 images which are interesting in
common view are more colorful, have high variation and contrast.
It also includes the images which are eye catchy because of the
visual attention feature. Example being the type of images that have
many people gathering like in a rebellion or maybe in a meeting,
or wearing clothes with vibrant setting say in a party.</p>
      <p>Similarly, for the video subtask, the maximum value of average
precision@10 is for those videos in which the top 10 clips which are
interesting in common view are more colorful, have high complexity
and traps attention. A clip which has high audio seems to attract
more viewer. Example being a clip which shows a blast or people
screaming are of more significance than a silent clip.</p>
      <p>On the contrary, the lowest average precision@10 is for those
kinds of videos in which the most interesting image are the ones
which are less colorful and has very fewer variations. Example of
such scenes includes the one in which say a dark wall with some
strange symbols is painted. This may be because these symbols
have some back story in the movie and hence are interesting in
common view. Other being the scene where some explicit content
is shown, it is usually shown in dark with very less variation and
is arousing for humans. In such cases, the model tends to predict
following kinds of images more interesting: a crowded place which
may have no greater significance or complex buildings and road of
no greater importance or just a lighted empty room.</p>
      <p>For video subtask, the lowest average precision@10 is because, in
these scenes, the audio is also negligible be it a moment of suspicion
or any of the examples mentioned for the low value in image
subtask and would instead predict those clips to be interesting which
have higher audio along with the other features as in image subtask.
4</p>
    </sec>
    <sec id="sec-8">
      <title>CONCLUSION</title>
      <p>Interestingness of a scene is a subjective aspect and one that
involves complex cognitive processes. However, certain features such
as contrast, colorfulness and novelty of the scene can be assumed
to play a part in the way humans quantify interestingness,
irrespective of the type of scene. Therefore, in this work, we extracted
and used such audio and visual features to develop a model for
predicting interestingness. Such approach is, of course, an initial
step towards building a more comprehensive model. The novelty
feature proposed in this paper is not used for the current task due
to time constraint and can be exploited in the future work.</p>
    </sec>
    <sec id="sec-9">
      <title>ACKNOWLEDGMENTS</title>
      <p>We thank Karan Thakkar for his fruitful help.</p>
      <p>DA-IICT at MediaEval 2017: Objective prediction of media interestingness</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Demarty</surname>
          </string-name>
          et al.
          <year>2017</year>
          .
          <article-title>Mediaeval 2017 predicting media interestingness task</article-title>
          . (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Helmut</given-names>
            <surname>Grabner</surname>
          </string-name>
          , Fabian Nater, Michel Druey, and Luc Van Gool.
          <year>2013</year>
          .
          <article-title>Visual interestingness in image sequences</article-title>
          .
          <source>In Proceedings of the 21st ACM international conference on Multimedia. ACM</source>
          ,
          <volume>1017</volume>
          -
          <fpage>1026</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Michael</given-names>
            <surname>Gygli</surname>
          </string-name>
          , Helmut Grabner, Hayko Riemenschneider, Fabian Nater, and Luc Van Gool.
          <year>2013</year>
          .
          <article-title>The interestingness of images</article-title>
          .
          <source>In Proceedings of the IEEE International Conference on Computer Vision</source>
          . 1633-
          <fpage>1640</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Jonathan</given-names>
            <surname>Harel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C</given-names>
            <surname>Koch</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P</given-names>
            <surname>Perona</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>A saliency implementation in matlab</article-title>
          . URL: http://www. klab. caltech. edu/˜ harel/share/gbvs. php (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Kresimir</given-names>
            <surname>Matkovic</surname>
          </string-name>
          , László Neumann, Attila Neumann, Thomas Psik, and
          <string-name>
            <given-names>Werner</given-names>
            <surname>Purgathofer</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Global Contrast Factor-a New Approach to Image Contrast</article-title>
          .
          <source>Computational Aesthetics</source>
          <year>2005</year>
          (
          <year>2005</year>
          ),
          <fpage>159</fpage>
          -
          <lpage>168</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Karen</given-names>
            <surname>Panetta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Chen</given-names>
            <surname>Gao</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Sos</given-names>
            <surname>Agaian</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>No reference color image contrast and quality measures</article-title>
          .
          <source>IEEE transactions on Consumer Electronics</source>
          <volume>59</volume>
          ,
          <issue>3</issue>
          (
          <year>2013</year>
          ),
          <fpage>643</fpage>
          -
          <lpage>651</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Honghai</given-names>
            <surname>Yu</surname>
          </string-name>
          and
          <string-name>
            <given-names>Stefan</given-names>
            <surname>Winkler</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Image complexity and spatial information</article-title>
          .
          <source>In Quality of Multimedia Experience (QoMEX)</source>
          ,
          <source>2013 Fifth International Workshop on. IEEE</source>
          ,
          <fpage>12</fpage>
          -
          <lpage>17</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>