LAPI at MediaEval 2017 - Predicting Media Interestingness
                                  Mihai Gabriel Constantin1 , Bogdan Boteanu1 , Bogdan Ionescu1
                                                      1 LAPI, University "Politehnica" Bucharest, Romania

                                                      {mgconstantin, bboteanu, bionescu}@imag.pub.ro

ABSTRACT                                                                          Lightness extracted from HSL space (HSL), Colorfulness [5, 9],
In the following paper we will present our contribution, approach                 Hue descriptors (HueDesc) [9, 11], Hue models (HueModel) [11],
and results for the MediaEval 2017 Predicting Media Interesting-                  Brightness [10, 11], Edge [9–11], Texture [9], RGB entropy (RG-
ness task. We studied several visual descriptors and created several              BEntropy) [9], HSV wavelet (HSVwavelet) and average value for
early and late fusion approaches in our machine learning system,                  the HSV wavelet (aHSVwavelet) [5], average HSV values based
optimized for best results for this benchmarking competition.                     on the Rule of Thirds (aHSVRot) [5], average HSL values for the
                                                                                  focus region (aHSLFocus) [11], size analysis for the largest five
                                                                                  segments (LargSegm) [5], centroid placement (Centroids) [5], Hue,
1     INTRODUCTION                                                                Saturation, Value and Brightness for the largest segments (Hue-
Multimedia interestingness has been studied more and more exten-                  Segm, SatSegm, ValSegm, BrightSegm) [5, 11], color model for the
sively in recent years, from several perspectives including psychol-              largest segments (ColorSegm) [5], coordinates of the segments (Co-
ogy and computer vision. From a psychological perspective user                    ordSegm) [11], mass variance, skewness and contrast between the
studies described a correlation between human interest and several                segments (MassVarSegm, SkewSegm, ContrastSegm) [11] and finally
other concepts including, but not limited to aesthetics, enjoyment,               a depth of field indicator (DoF) calculated according to the method
complexity, novelty [1, 8], while computer vision approaches stud-                presented in [5].
ied various sets of features and machine learning techniques that                    While for the image subtask each image generated a set of the
are able to predict the interestingness of multimedia shots, based                presented descriptors, for the video subtask we generated two sets
on low-level attributes such as color histograms, SIFT, edge dis-                 of descriptors for each of the individual segments. These two sets
tributions [8] or high-level attribute like composition rules or the              of descriptors were generated by extracting the feature set for each
presence of certain objects [7].                                                  frame and then calculating the average value and median value
   The MediaEval 2017 Predicting Media Interestingness task [6]                   over all the frames in a video segment.
creates a benchmarking competition where participants are tasked
with the creation of a system that can predict the interestingness                2.2    Data fusion
of images and video segments annotated by a team of viewers,
                                                                                  In both subtasks we used early and late fusion techniques for max-
according to a Video on Demand scenario, where a set of most
                                                                                  imizing out final results. Early fusion combinations consisted of
interesting frames or video shots has to be presented to a certain
                                                                                  concatenating several features and using the newly created fea-
user. This paper will thus describe our approach for this task.
                                                                                  ture as an input for a new training algorithm, while for the late
                                                                                  fusion approach we used the confidence output values of several
2     APPROACH
                                                                                  runs and combined them in several strategies, thus generating new
The approach presented in this paper is a continuation of our work                confidence outputs.
described in [3], with the addition of a video interestingness pre-                  For the late fusion trials we used 4 strategies: CombMax and
diction system. The first step in our machine learning system is                  CombMin, where we took the maximum and minimum confidence
the extraction of the content descriptors, followed by the learning               value for each media sample and used them as new outputs, Comb-
stage for these content descriptors and their early and late fusion               Sum, where we added up the individual confidence values of the
combinations executed on the annotated development dataset. In                    runs and CombMean where the added confidence values were also
the final stage we evaluate the best performing combinations on                   multiplied with weights distributed according to the rank of the
the unlabeled testing dataset. The features used here are presented,              initial system. This weight was calculated as w = 1/(2r ), where the
along with a detailed description in [3] and are based on the works               rank r had the value 0 for the best component output classifier, 1
of [5, 9–11]. These features have been used in several domains                    for the second and so on.
connected with interestingness such as aesthetics, photographic
compositional rules, color theory etc. For the machine learning al-
                                                                                  2.3    Learning system
gorithm we used Support Vector Machine (SVM) [4] with different
parameters and kernels.                                                           The learning system we used was SVM, implemented with the
                                                                                  LibSVM library [2], with linear, polynomial and RBF kernels. For
2.1     Features                                                                  the degree, gamma and cost coefficients we used the combinations
The features used in this system are as follows: Hue, Saturation,                 of values 2k , where k ∈ [−6, ..., 6].
Value computed from HSV space (denoted HSV ), Hue, Saturation,
                                                                                  3     EXPERIMENTAL RESULTS
Copyright held by the owner/author(s).
MediaEval’17, 13-15 September 2017, Dublin, Ireland                               As presented in the task overview paper [6], the development
                                                                                  dataset consisted of 7396 frames for the image subtask and 7396
MediaEval’17, 13-15 September 2017, Dublin, Ireland                                                                   MG Constantin et al.

                  Table 1: Best results on devset for the image and video subtasks and their final result on testset
                                               (best testset results are marked in bold)

                                                                                                           MAP@10        MAP      MAP@10
 Subtask    Run                                            Approach
                                                                                                            devset      testset    testset
                           CombMax (HSV + HSL + aHSLFocus and aHSVRot + aHSLFocus
  image     run1                                                                                            0.0821      0.1791     0.0463
                                       and HSV + MassVarSegm + LargSegm)
  image     run2           CombMax (HSV + HSL + aHSLFocus and aHSVRot + aHSLFocus)                          0.0803      0.1789     0.0442
  image     run3        CombMean (aHSVRot + aHSLFocus and HSV + MassVarSegm + LargSegm)                     0.0793      0.1873     0.0555
                              CombMean (HSVWavelet + aHSVWavelet + aHSLFocus and
  image     run4                                                                                            0.0793      0.1851     0.0529
                                  HSV + HSL + aHSLFocus and HSV + MassVarSegm)
  video     run1     CombMax (LargSegmMED + ValSegmMED and TextureMED + MassVarSegmMED)                     0.0753      0.1937     0.0619
                     CombMax (LargSegmMED + ValSegmMED and TextureMED + MassVarSegmMED
  video     run2                           and EdgeAVG + TextureAVG)                                        0.0737      0.1819     0.0564
  video     run3         CombMax (EdgeAVG + TextureAVG and HSVAVG + MassVarSegmAVG)                         0.0732      0.1937     0.0619
                     CombMean(LargSegmMED + ValSegmMED and TextureMED + MassVarSegmMED
  video     run4                           and EdgeAVG + TextureAVG)                                        0.0725      0.2028     0.0732
                         CombMax (EdgeAVG + TextureAVG and HSVAVG + MassVarSegmAVG
  video     run5                         and HSLAVG + ColorfulnessAVG)                                      0.0723      0.1843     0.0571


video segments for the video subtask, while the test dataset had        MassVarSegm + LargSegm, with a MAP@10 score on the devset of
2435 frames for the image subtask and 2435 video segments for the       0.0821. For the video subtask the best result was a CombMax strat-
video subtask. The official metric was mean average precision at        egy containing LargSegmMED + ValSegmMED and TextureMED +
10 (MAP@10), and the organisers also calculated the mean aver-          MassVarSegmMED early fusion outputs, with a MAP@10 score of
age precision (MAP) for each submitted run. A large number of           0.0753.
experiments with different early and late fusion strategies and with
different SVM systems were carried out and the best performing
combinations were in the last phase run on the testset.
                                                                        3.2    Official results on testset
                                                                        For the final submission we trained the systems on the entire de-
                                                                        vset, using the optimal parameters that we found in the previous
3.1    Experiments on the devset                                        experiments and tested the resulting systems on the testset.
Our SVM training system used a 10-fold cross-validation approach           Table 1 also presents the official results on the testset runs for
for choosing the best SVM-feature set combination. Generally, tak-      the combinations we submitted, as returned by the task organisers,
ing into account the MAP@10 metric, the best performing SVM             with the MAP and MAP@10 scores for each of the runs. For the
kernel was the RBF kernel. Also another general observation is that     image subtask we have a best MAP@10 score of 0.0555, obtained
the late fusion approaches, especially CombMax and CombMean,            by using a CombMean strategy with the outputs of aHSVRot + aH-
outperformed the early fusion combination, while early fusion out-      SLFocus and HSV + MassVarSegm + LargSegm. The same system
performed learning systems with single descriptors. On the other        also had the best MAP score - 0.1873. For the video subtask again it
hand, CombMin and CombSum strategies performed worse than               was a single system that got both the best MAP@10 and the best
their components with many combinations. Regarding the two de-          MAP score - a CombMean strategy usign the early fusion outputs
scriptor sets for the video subtask (average and median), the results   of LargSegmMED + ValSegmMED and TextureMED + MassVarSeg-
were mixed, some early fusion or single descriptors performing          mMED and EdgeAVG + TextureAVG, with a MAP@10 value of
better with the median approach while others performed better           0.0732 and a MAP value of 0.2028.
when we used the average calculation.
   The interestingness confidence score for each shot used for the
MAP@10 calculation were extracted as the margin to the decision         4     CONCLUSIONS
hyperplane.                                                             In this paper we presented several systems that predict media in-
   Table 1 shows the best results registered on both the image and      terestingness using content descriptors and early and late fusion
the video subtasks, and as mentioned earlier the best results were      approaches. We tested these systems on the MediaEval 2017 Predict-
achieved for the late fusion approaches. For the video subtask we       ing Media Interestingness task and our best results were MAP@10
used the notation AVG for features that were obtained using average     0.5555 for the image subtask and 0.0732 for the video subtask.
and MED for features that were obtained using median. All the
components in Table 1 were trained using the best performing SVM
RBF kernel.                                                             ACKNOWLEDGMENTS
   For the image subtask the best result on the devset was obtained     Part of this work was funded by UEFISCDI under research grant
with a CombMax strategy combining the early fusion outputs of           PNIII-P2-2.1-PED-2016-1065, agreement 30PED/2017, project SPOT-
HSV + HSL + aHSLFocus and aHSVRot + aHSLFocus and HSV +                 TER
Predicting Media Interestingness Task                                           MediaEval’17, 13-15 September 2017, Dublin, Ireland


REFERENCES
 [1] Daniel E Berlyne. 1960. Conflict, arousal, and curiosity. (1960).
 [2] Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: a library for
     support vector machines. ACM transactions on intelligent systems and
     technology (TIST) 2, 3 (2011), 27.
 [3] Mihai Gabriel Constantin and Bogdan Ionescu. 2017. Content Descrip-
     tion for Predicting Image Interestingness. In International Symposium
     on Signals, Circuits and Systems - ISSCS 2017.
 [4] Corinna Cortes and Vladimir Vapnik. 1995. Support vector machine.
     Machine learning 20, 3 (1995), 273–297.
 [5] Ritendra Datta, Dhiraj Joshi, Jia Li, and James Z Wang. 2006. Studying
     aesthetics in photographic images using a computational approach.
     In European Conference on Computer Vision. Springer, 288–301.
 [6] Claire-Hélène Demarty, Mats Sjöberg, Bogdan Ionescu, Thanh-Toan
     Do, Michael Gygli, and Ngoc QK Duong. 2017. Mediaeval 2017 pre-
     dicting media interestingness task. In MediaEval 2017 Multimedia
     Benchmark Workshop Working Notes Proceedings of the MediaEval 2017
     Workshop.
 [7] Sagnik Dhar, Vicente Ordonez, and Tamara L Berg. 2011. High level
     describable attributes for predicting aesthetics and interestingness. In
     Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference
     on. IEEE, 1657–1664.
 [8] Michael Gygli, Helmut Grabner, Hayko Riemenschneider, Fabian
     Nater, and Luc Van Gool. 2013. The interestingness of images. In
     Proceedings of the IEEE International Conference on Computer Vision.
     1633–1640.
 [9] Andreas F Haas, Marine Guibert, Anja Foerschner, Sandi Calhoun,
     Emma George, Mark Hatay, Elizabeth Dinsdale, Stuart A Sandin, Jen-
     nifer E Smith, Mark JA Vermeij, and others. 2015. Can we measure
     beauty? Computational evaluation of coral reef aesthetics. PeerJ 3
     (2015), e1390.
[10] Yan Ke, Xiaoou Tang, and Feng Jing. 2006. The design of high-level
     features for photo quality assessment. In Computer Vision and Pattern
     Recognition, 2006 IEEE Computer Society Conference on, Vol. 1. IEEE,
     419–426.
[11] Congcong Li and Tsuhan Chen. 2009. Aesthetic visual quality assess-
     ment of paintings. IEEE Journal of selected topics in Signal Processing
     3, 2 (2009), 236–252.