LAPI at MediaEval 2016
                       Predicting Media Interestingness Task

                        Mihai Gabriel Constantin, Bogdan Boteanu, Bogdan Ionescu
                                   LAPI, University "Politehnica" of Bucharest, Romania
                                  {mgconstantin, bboteanu, bionescu}@alpha.imag.pub.ro


ABSTRACT                                                           a supervised classifier is learned on these features using the
This paper will present our results for the MediaEval 2016         labeled examples. Finally, the actual evaluation is carried
Predicting Media Interestingness task. We proposed an ap-          out by feeding the classifier the unlabeled data. Regarding
proach based on video descriptors and studied several ma-          the content descriptors, we used the ones provided by the
chine learning models, in order to detect the optimal config-      task organizers [4] with some additions. They were used
uration and combination for the descriptors and algorithms         as descriptors for a learning system based on SVM, where
that compose our system.                                           we tested different combinations of SVM kernel types and
                                                                   coefficients by using the LibSVM library [2].

1.   INTRODUCTION
   Interestingness is the ability to attract and hold human        2.1   Used features
attention, this concept is gaining importance in the field            Several visual features were used as descriptors, many of
of computer vision, especially since the growing importance        them being used in the literature for some computer vision
and market value of social media and advertising. Even             tasks. The provided computed features were: color his-
though the concept of interest might seem the result of a          togram of the Hue-Saturation-Value (denoted histo), His-
subjective viewer judgment, important progress has been            togram of Oriented Gradients (HoG) descriptors computed
made towards both an objective and context-based model for         over densely sampled patches, dense Scale Invariant Feature
interest. Generally, in the field of computer vision two direc-    Transform (SIFT) with a codebook of 300 codewords and a
tions arose regarding this topic: pure visual interestingness      three layered spatial pyramid (denoted dsift), Local Binary
(based on multimedia features and ideas [5, 6, 7]) and social      Patterns (LBP), GIST computed with the output of Gabor-
interestingness (based on the degree of social media interest      like features (denoted gist) and the fc7 and prob layers of
shown for certain visual data [5, 8]). Some researchers [8]        AlexNet (denoted cnnfc7 and cnnprob). All these features
focused on the similarities and differences between these two      are presented and detailed in [4] and [9]. We also extracted
directions. Studies have been made regarding the psycholog-        and used the color naming histogram (denoted colornames)
ical and physiological connections with novelty, enjoyment,        feature based on the work [12], as we wanted to obtain a
challenge [1, 3], appraisal structures [10, 11] and computer       color descriptor with fewer dimensions for our learning al-
vision concepts [5, 7, 6].                                         gorithms, that could better represent a human-centered un-
   In this context, the MediaEval 2016 Predicting Media In-        derstanding of the colors in each image or video.
terestingness Task [4] challenges the participants to auto-           For the image subtask, each image is represented with a
matically select images and/or video segments which are            content descriptor. For the video subtask, each video con-
considered to be the most interesting for a common viewer.         tains a certain number of images. To determine the final
The concept of interestingness is defined in a particular use      descriptor we use the simple averaging of the frames descrip-
case scenario, i.e., helping professionals to illustrate a Video   tors, leading in the end to a global descriptor per video.
on Demand (VOD) web site by selecting some interesting
frames and/or video excerpts for the movies. In this working
note paper, we present our machine learning based approach
                                                                   2.2   Learning system
to the task.                                                          The learning is achieved using a Support Vector Machine
                                                                   (SVM) binary classifier. For all trained SVM models we
                                                                   used polynomial, RBF and linear kernels. For the polyno-
2.   PROPOSED APPROACH                                             mial kernels we used all the combinations of the following
   As previously stated, to determine the interestingness of       degrees : 1, 2, 3*k where k ∈ [1, ..., 10] and the gamma co-
images and video, we have experimented with a classic ma-          efficients were set as 2k where k ∈ [0, ..., 6]. For the RBF
chine learning approach. First, the raw data is converted          kernel combinations we had values for the cost parameter of
to content descriptors which should capture as best as pos-        2k where k ∈ [−4, ..., 8] and gamma coefficients with values
sible the visual interestingness features of the data. Then,       in 2k where k ∈ [−4, ..., 8]. We also tried different weights,
                                                                   considering the fact that the devset data, both for images
                                                                   and for videos, was unbalanced, the ratio of uninteresting to
Copyright is held by the author/owner(s).
MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Nether-      interesting samples being almost 10 to 1.
lands.
       Table 1: Best results on devset for the image and video subtasks (best results are marked in bold)
         Subtask         Feature     SVM type Degree Gamma         TP    FP    Precision Recall MAP
          image        histo+gist       poly       18       2       22    76    0.224      0.05 0.214
          image        dsift+gist       poly        3      32       63   330     0.16     0.144  0.211
          image    histo+dsift+gist     poly        9       2       15    35      0.3     0.034  0.197
          image    colornames+any       poly        3       2       56   334     0.143    0.128  0.195
          image       colornames        poly        2       8      226 1892      0.107    0.517 0.195
           video     gist+cnnprob       poly        9       4       35   305     0.103    0.083 0.179
           video      cnnfc7+any        poly        3       4       40   364     0.099    0.095  0.172
           video     dsift+cnnprob      poly       24      64       81   846     0.087    0.192 0.159
           video           gist         poly        6       8       49   359    0.121     0.116  0.148
           video          dsift         poly        3      64       25   204     0.109    0.059  0.147


                           Table 2: Final results on testset (best results are marked in bold)
      Run    Subtask         Feature     SVM Type Degree Gamma            MAP     P@5     P@10               P@20     P@100
      run1    image         histo+gist      poly        18        2      0.1714 0.1077 0.1346                0.1423   0.0869
      run2    image         dsift+gist      poly         3       32       0.1398 0.0462 0.0808               0.1000   0.0862
      run3    video      gist+cnnprob       poly         9        4       0.1574 0.0923 0.1269               0.1212   0.0812
      run4    video       cnnfc7+histo      poly         3        4       0.1572 0.1231 0.1000               0.1077   0.0815
      run5    video      dsift+cnnprob      poly        24       64      0.1629 0.1154 0.1500                0.1192   0.0819


3.    EXPERIMENTAL RESULTS                                          composed of GIST and CNNProb layer, with a polynomial
   The task data consists of a development data intended to         SVM with 9 degree and 4 gamma for the video subtask.
train the approaches and a test data for the actual bench-
marking. The devset was extracted from 52 trailers, manu-           3.2    Official results on testset
ally segmented, thus obtaining 5054 segments. For the image           The teams were allowed to submit 5 runs, so we chose the
subtask one key-frame was used from each segment, while             best 2 descriptor-classifier combinations for the image sub-
for the video subtask the whole segment was used. By an-            task and the best 3 combinations for the video subtask. This
notating all the data a total of 473 interesting images and         time the training of the SVM learning systems was done
420 interesting videos were obtained, with a provided inter-        on the entire devset, using the optimal degree and gamma
estingness score for calculating the mean average precision.        parameters obtained in our previous experiments. The sub-
The testset consisted of 26 trailers divided into 2342 seg-         mitted runs were the following : run1 - image subtask with
ments. We performed a number of experiments on devset               HSV Histogram + GIST, SVM with degree = 18 and gamma
and selected the best combinations to be run on testset.            = 2, run2 - image subtask with DSIFT + GIST, SVM with
                                                                    degree = 3 and gamma = 32, run3 - video subtask with
                                                                    GIST + CNNProb, SVM with degree = 9 and gamma = 4,
3.1    Experiments on devset                                        run4 - video with CNNFc7 + HSV Histogram, SVM with
   Using a 10-fold cross-validation, we chose the best results      degree = 3 and gamma = 4 and run5 - video with DSIFT
for the descriptor-classifier combinations based on precision,      + CNNProb, SVM with degree = 24 and gamma = 64.
with a recall better than 0.03. For those best combinations           The final results, as returned by the task organizers are
we calculated the mean average precision. We have experi-           presented in Table 2. The best results were a 0.1714 MAP
mented with many different combinations of descriptors and          on run1 for the image subtask and a 0.1629 MAP on run5
SVM kernels. The best performing combination was gener-             for the video subtask. With the single exception being run5,
ally the polynomial SVM. A high number of training runs,            the MAP results on testset were below the estimated MAP
especially with the RBF or linear kernels, tended to classify       on devset.
all or almost all (low recall) the samples as non-interesting.
In the case of weight-based training for the RBF kernel the
recall tended to grow, but the precision was below that of          4.    CONCLUSIONS
the polynomial SVMs.                                                  In this paper we presented several models for predicting
   Table 1 lists the best five results for each of the two sub-     and scoring multimedia interestingness. Our best MAP re-
tasks, giving details regarding the best coefficient combi-         sults on the testset were 0.1714 for the image subtask and
nation used. As shown, the estimated MAP on the devset              0.1629 for the video subtask. These results seem to indicate
was better for the image subtask than for the video subtask.        that the task in very challenging, one possible reason for this
The MAP scores were calculated by using LibSVM’s deci-              being the subjective nature of this field of study.
sion values/prob estimates output result for indicating the
interestingness score of each sample [2]. The values for true       5.    REFERENCES
positives, false positives, precision and recall are also listed.
The best results were achieved with a descriptor composed of         [1] D. E. Berlyne. Conflict, arousal, and curiosity. 1960.
HSV Histogram and GIST, with a polynomial SVM with 18                [2] C.-C. Chang and C.-J. Lin. Libsvm: a library for
degree and 2 gamma for the image subtask, and a descriptor               support vector machines. ACM Transactions on
     Intelligent Systems and Technology (TIST), 2(3):27,
     2011.
 [3] A. Chen, P. W. Darst, and R. P. Pangrazi. An
     examination of situational interest and its sources.
     British Journal of Educational Psychology,
     71(3):383–400, 2001.
 [4] C.-H. Demarty, M. Sjöberg, B. Ionescu, T.-T. Do,
     H. Wang, N. Q. K. Duong, and F. Lefèbvre. Mediaeval
     2016 predicting media interestingness task. In Proc. of
     the MediaEval 2016 Workshop, Hilversum,
     Netherlands, Oct. 20-21, 2016.
 [5] S. Dhar, V. Ordonez, and T. L. Berg. High level
     describable attributes for predicting aesthetics and
     interestingness. In Computer Vision and Pattern
     Recognition (CVPR), 2011 IEEE Conference on,
     pages 1657–1664. IEEE, 2011.
 [6] H. Grabner, F. Nater, M. Druey, and L. V. Gool.
     Visual interestingness in image sequences. In
     Proceedings of the 21st ACM international conference
     on Multimedia, pages 1017–1026. ACM, 2013.
 [7] M. Gygli, H. Grabner, H. Riemenschneider, F. Nater,
     and L. Gool. The interestingness of images. In
     Proceedings of the IEEE International Conference on
     Computer Vision, pages 1633–1640, 2013.
 [8] L.-C. Hsieh, W. H. Hsu, and H.-C. Wang.
     Investigating and predicting social and visual image
     interestingness on social media by crowdsourcing. In
     Acoustics, Speech and Signal Processing (ICASSP),
     2014 IEEE International Conference on, pages
     4309–4313. IEEE, 2014.
 [9] Y.-G. Jiang, Q. Dai, T. Mei, Y. Rui, and S.-F. Chang.
     Super fast event recognition in internet videos. IEEE
     Transactions on Multimedia, 17(8):1174–1186, 2015.
[10] P. J. Silvia. What is interesting? exploring the
     appraisal structure of interest. Emotion, 5(1):89, 2005.
[11] S. A. Turner and P. J. Silvia. Must interesting things
     be pleasant? a test of competing appraisal structures.
     Emotion, 6(4):670, 2006.
[12] J. V. D. Weijer, C. Schmid, J. Verbeek, and D. Larlus.
     Learning color names for real-world applications.
     IEEE Transactions on Image Processing,
     18(7):1512–1523, 2009.