=Paper= {{Paper |id=Vol-1984/Mediaeval_2017_paper_17 |storemode=property |title=DA-IICT at MediaEval 2017: Objective Prediction of Media Interestingness |pdfUrl=https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_17.pdf |volume=Vol-1984 |authors=Rashi Gupta,Manish Narwaria |dblpUrl=https://dblp.org/rec/conf/mediaeval/GuptaN17 }} ==DA-IICT at MediaEval 2017: Objective Prediction of Media Interestingness== https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_17.pdf
        DA-IICT at MediaEval 2017: Objective prediction of media
                           interestingness
                                                      Rashi Gupta, Manish Narwaria
                                  Dhirubhai Ambani Institute of Information and Communication Technology
                                            rashi.8496@gmail.com,manish_narwaria@daiict.ac.in

ABSTRACT                                                                 2 APPROACH
Interestingness is defined as the power of engaging and holding          2.1 Interestingness Features Computation
the curiosity. While humans can almost effortlessly rank and judge
                                                                         Following are the features extracted from the image for image
interestingness of a scene, automated prediction of interestingness
                                                                         subtask: colorfulness, contrast, complexity and visual attention.
for an arbitrary scene is a challenging problem. In this work, we
                                                                         For video subtask, along with these features, audio feature Mel-
attempt to develop a computational model for the said problem. Our
                                                                         frequency cepstral coefficients (MFCCs) is also computed to take
approach is based on identifying and extracting context-specific
                                                                         audio of the clip in the account.
features from video clips. These features are subsequently utilized
in a predictor model to provide continuous scores that can be related
                                                                         Colorfulness: We measure colorfulness as proposed by [6], Red-
to the interestingness of the scene in question. Such computational
                                                                         green and yellow-blue spaces are used where α = R − G and
models can be useful in a automated analysis of videos (eg. movie,
                                                                         β = 0.5(R + G) − B where σα2 , σ β2 , µ α , µ β represent the variance
a CCTV footage or a clip from an advertisement).
                                                                         and mean values along these two opponent color axes defined as:

1    INTRODUCTION                                                        µ α = N1 Σp=1
                                                                                   N α and σ 2 = 1 Σ N (α 2 − µ 2 )
                                                                                       P    α    N p=1 P        α
The aim of the task is to select content (image and video clips) which
are considered to be the most interesting for a common viewer. This      The equation formulates the ratio of the variance to the average
is a challenging task because interestingness of media is highly         chrominance in each of the opponent component:
                                                                                                                    σα2                σ2
subjective, and can depend on multiple aspects including personal        color f ulness = 0.02 × loд(                        ) × loд( β0.2 )
                                                                                                                  |µ α | 0.2         |µβ |
preferences, emotional state and the content itself. Therefore, as a
first step, our goal in this task is to understand and extract signal    Contrast: We measure contrast as proposed by [5]. The main idea
related features which may, for instance, quantify visual appearance     is to compute local contrast factors at various resolutions, and then
and audio information. Such features can then be mapped into an          to build a weighted average in order to get the global contrast factor.
interestingness score via machine learning. Further details about        Let us denote the original pixel value with k, k = 0, 1,.. 254, 255.
the task and dataset can be retrieved from [1].                          The first step is to apply gamma correction with γ = 2.2 , and
   The said task falls under the broad areas of multimedia signal        scale the input values to the [0,1] range. The corrected values linear
processing (image, video and audio) and machine learning. The for-                           k )γ . The perceptual luminance L can be approx-
                                                                         luminance is l = ( 255
mer focuses on analysis and extraction of context-specific features                                                                         √
                                                                         imated with the square root of the linear luminance: L = 100 × l.
from the signal. These may include color, contrast, complexity, au-
                                                                         Once the perceptual luminances are computed we have to compute
dio characteristics etc. The primary goal of feature extraction is to
                                                                         local contrast. For each pixel we compute the average difference of
obtain a more meaningful signal representation from the view point
                                                                         L between the pixel and four neighboring pixels.
of capturing useful information pertaining to media interestingness.
In the task, these features will be subsequently used as input to                 |L i −L i −1 |+|L i −L i +1 |+|L i −L i −w |+ |L i −L i +w |
                                                                         lc i =                                 4
a regressor (eg. linear regression and multilayer perceptron). As
the target value of such regression problem is known (equal to
                                                                         The average local contrast for current resolution Ci is computed
interestingness score given by a panel of human subjects), this is a
                                                                         as the average local contrast lc i over the whole image, where the
supervised learning problem.
                                                                         image is w pixels wide and h pixels high.
   We note that a similar approach has been used in previous works
such as [2], [3]. However, the key difference lies in the features
                                                                         Ci = w 1×h × Σw ×h
                                                                                       i=1 lc i
used, and this is one of the contributions from the task. Also, the
results shed light on some interesting aspects of interestingness that
                                                                         We have to compute the Ci for various resolutions. Once the Ci for
may not be fully captured by the current set of features. This can
                                                                         original image is computed, we build a smaller resolution image, so
obviously be used to improve feature extraction, and in the process
                                                                         that we combine 4 original pixels into one super pixel. The image
predict objective media interestingness scores that are closer to
                                                                         width is half the original width and the image height is half the
human judgments.
                                                                         original height now. The Ci for various resolution can easily be
Copyright held by the owner/author(s).
                                                                         computed and the process continues until we have only few huge
MediaEval’17, 13-15 September 2017, Dublin, Ireland                      superpixels in the image. Now that we have computed average local
MediaEval’17, 13-15 September 2017, Dublin, Ireland                                                            Rashi Gupta, Manish Narwaria


contrasts Ci , we can compute the global contrast factor.                  3.2    Analysis
                                                                           For image subtask, the maximum value of average precision@10 is
GCF = Σi=1
       N w ×C
           i  i                                                            for those videos in which the top 10 images which are interesting in
                                                                           common view are more colorful, have high variation and contrast.
Complexity: We measure contrast by calculating Spatial Infor-              It also includes the images which are eye catchy because of the
mation as proposed by [7]. Let sh and sv denote gray-scale im-             visual attention feature. Example being the type of images that have
ages filtered q
              with horizontal and vertical Sobel kernels, respec-          many people gathering like in a rebellion or maybe in a meeting,
tively. SIr = sh2 + sv2 represents the magnitude of spatial infor-         or wearing clothes with vibrant setting say in a party.
mation at every pixel. Mean and standard deviation of SIr is used             Similarly, for the video subtask, the maximum value of average
to calculate the complexity of an image. SImean = P1 ΣSIr and              precision@10 is for those videos in which the top 10 clips which are
            q                                                              interesting in common view are more colorful, have high complexity
SIstdev = P1 Σ(SIr2 − SImean
                          2   )                                            and traps attention. A clip which has high audio seems to attract
where P is the number of pixels in the image.                              more viewer. Example being a clip which shows a blast or people
                                                                           screaming are of more significance than a silent clip.
Visual attention: We propose a method to calculate attention                  On the contrary, the lowest average precision@10 is for those
of an image by computing saliency maps for the corresponding               kinds of videos in which the most interesting image are the ones
image. [4] implementation is used for saliency map computation.            which are less colorful and has very fewer variations. Example of
The mean of this saliency map at every pixel is the attention value.       such scenes includes the one in which say a dark wall with some
                                                                           strange symbols is painted. This may be because these symbols
Audio Extraction: For audio features, Mel-frequency cepstral co-           have some back story in the movie and hence are interesting in
efficients (MFCCs) are computed and mean and standard deviation            common view. Other being the scene where some explicit content
of a time frame is calculated. This is the feature vector for audio        is shown, it is usually shown in dark with very less variation and
extraction.                                                                is arousing for humans. In such cases, the model tends to predict
                                                                           following kinds of images more interesting: a crowded place which
Novelty: We propose a method to calculate novelty by firstly calcu-        may have no greater significance or complex buildings and road of
lating saliency maps for the images. 8 X 8 average filter is convoluted    no greater importance or just a lighted empty room.
and the mean is calculated for the consecutive saliency map images.           For video subtask, the lowest average precision@10 is because, in
If for both of the blocks, the average is less than the threshold (0.1),   these scenes, the audio is also negligible be it a moment of suspicion
the block is avoided. Otherwise, for the two consecutive saliency          or any of the examples mentioned for the low value in image sub-
map images mean squared error (MSE) is calculated. If MSE is less          task and would instead predict those clips to be interesting which
than the threshold, this block is also ignored, else it is considered.     have higher audio along with the other features as in image subtask.
Hence, the mean of the MSE is calculated. Higher the value, more
action has happened in the two consecutive frames.
                                                                           4     CONCLUSION
                                                                           Interestingness of a scene is a subjective aspect and one that in-
2.2    Interestingness Prediction                                          volves complex cognitive processes. However, certain features such
For image subtask, we have used five features namely, colorfulness,        as contrast, colorfulness and novelty of the scene can be assumed
contrast, complexity (mean and standard deviation) and attention.          to play a part in the way humans quantify interestingness, irre-
With the help of these features, we would learn our model for              spective of the type of scene. Therefore, in this work, we extracted
interestingness using linear regression.                                   and used such audio and visual features to develop a model for
   For video subtask, along with these five features, we also added        predicting interestingness. Such approach is, of course, an initial
the feature vector for the audio. As there are more features in this       step towards building a more comprehensive model. The novelty
case, we used multilayer perceptron. We used the mean image                feature proposed in this paper is not used for the current task due
provided in the image subtask for computation of the feature vector        to time constraint and can be exploited in the future work.
of the video subtask as computation for all frames of the video was
not feasible, given the time constraints.
                                                                           ACKNOWLEDGMENTS
                                                                           We thank Karan Thakkar for his fruitful help.
3 EXPERIMENTAL RESULTS
3.1 Evaluation
For image subtask, using the five features with the linear regression
as the learning algorithm the MAP@10 is coming up to be 0.0406
for the test set. For video subtask, along with these features and
the audio feature vector, MAP@10 is 0.0636 for the test set where
multilayer perceptron is the learning algorithm.
DA-IICT at MediaEval 2017: Objective prediction of media interestingness      MediaEval’17, 13-15 September 2017, Dublin, Ireland


REFERENCES
[1] Demarty et al. 2017. Mediaeval 2017 predicting media interestingness
    task. (2017).
[2] Helmut Grabner, Fabian Nater, Michel Druey, and Luc Van Gool. 2013.
    Visual interestingness in image sequences. In Proceedings of the 21st
    ACM international conference on Multimedia. ACM, 1017–1026.
[3] Michael Gygli, Helmut Grabner, Hayko Riemenschneider, Fabian
    Nater, and Luc Van Gool. 2013. The interestingness of images. In
    Proceedings of the IEEE International Conference on Computer Vision.
    1633–1640.
[4] Jonathan Harel, C Koch, and P Perona. 2006. A saliency implementa-
    tion in matlab. URL: http://www. klab. caltech. edu/˜ harel/share/gbvs.
    php (2006).
[5] Kresimir Matkovic, László Neumann, Attila Neumann, Thomas Psik,
    and Werner Purgathofer. 2005. Global Contrast Factor-a New Ap-
    proach to Image Contrast. Computational Aesthetics 2005 (2005), 159–
    168.
[6] Karen Panetta, Chen Gao, and Sos Agaian. 2013. No reference color
    image contrast and quality measures. IEEE transactions on Consumer
    Electronics 59, 3 (2013), 643–651.
[7] Honghai Yu and Stefan Winkler. 2013. Image complexity and spatial
    information. In Quality of Multimedia Experience (QoMEX), 2013 Fifth
    International Workshop on. IEEE, 12–17.