=Paper=
{{Paper
|id=Vol-1739/MediaEval_2016_paper_27
|storemode=property
|title=Ranking Images and Videos on Visual Interestingness by Visual Sentiment Features
|pdfUrl=https://ceur-ws.org/Vol-1739/MediaEval_2016_paper_27.pdf
|volume=Vol-1739
|dblpUrl=https://dblp.org/rec/conf/mediaeval/RayatdoostS16
}}
==Ranking Images and Videos on Visual Interestingness by Visual Sentiment Features==
<pdf width="1500px">https://ceur-ws.org/Vol-1739/MediaEval_2016_paper_27.pdf</pdf>
<pre>
  Ranking Images and Videos on Visual Interestingness by
               Visual Sentiment Features

                       Soheil Rayatdoost                                   Mohammad Soleymani
               Swiss Center for Affective Sciences                     Swiss Center for Affective Sciences
                     University of Geneva                                    University of Geneva
                           Switzerland                                           Switzerland
                  soheil.rayatdoost@unige.ch                            mohammad.soleymani@unige.ch

ABSTRACT
Today, users generate and consume millions of videos online.
Automatic identification of the most interesting moments of
these videos have many applications such as video retrieval.
Although most interesting excerpts are person-dependent,
existing work demonstrate that there are some common fea-
tures among these segments. The media interestingness task
at MediaEval 2016 focuses on ranking the shots and key-
frames in a movie trailer based on their interestingness. The
dataset consists of a set of commercial movie trailers from      Figure 1: Examples of hit (top row) and miss (bottom row)
which the participants are required to automatically iden-       top-ranking key-frames.
tify the most interesting shots and frames. We approach
                                                                    The “media interestingness task” is organized at Medi-
the problem as a regression task and test several algorithms.
                                                                 aEval 2016. In this task, a development and evaluation-set
We particularly use mid-level semantic visual sentiment fea-
                                                                 consisting of Creative Commons licensed trailers of commer-
tures. These features are related to the emotional content
                                                                 cial movies with their interestingness labels are provided.
of images and are shown to be effective in recognizing inter-
                                                                 For the details of the task description, dataset development
estingness in GIFs. We found that our suggested features
                                                                 and evaluation, we refer the reader to the task overview pa-
outperform the baseline for the task at hand.
                                                                 per [3]. There are two subtasks for this challenge, the first
                                                                 one involves automatic prediction of interestingness rank-
1.   INTRODUCTION                                                ing for different shots in a trailer. The second task involves
   Interestingness is the capability of catching and holding     predicting the ranking for the most interesting key frames.
human attention [1]. Research in psychology suggests that        Visual and audio (only for shots) modalities are available
interest is related to novelty, uncertainty, conflict and com-   for the interestingness prediction methods [3]. The designed
plexity [2, 14]. These attributes determine whether a person     algorithms are evaluated over evaluation data which include
finds an item interesting. The attributes contribute to inter-   2342 shots from 26 trailers. Examples of top-ranking key-
estingness differently for different people, for example, one    frames are shown in Figure 1.
might find more complex stimulus more interesting than the          The organizers provided a set of baseline visual and audio
other. Developing a computational model which automati-          features. For the visual modality, we additionally extracted
cally perform such a task is useful for different applications   mid-level semantic visual descriptors [11] and deep learning
such as video retrieval, recommendation and summarization        features. Sentiment related features are effective in captur-
[1, 15].                                                         ing emotional content of images and are shown to be useful
   There are a number of work that address the problem           in recognizing interestingness in GIFs [8]. For the audio
of visual interestingness prediction from the content. Gygli     modality, we extracted the extended Geneva Minimalistic
et al. and Grabner et al. [7, 6] used visual content fea-        Acoustic Parameter Set (eGeMAPS) [4]. We tested multiple
tures related to unusualness, aesthetics and general pref-       regression models for interestingness ranking. We compare
erence for predicting visual interestingness. Soleymani [15]     our results with the ones from the baseline features based
built a model for personalized interest prediction for images.   on mean average precision (MAP) over top N best ranked
He found that affective content, quality, coping potential       images or shots. According to our results on the evaluation-
and complexity have a significant effect on visual interest      set, our feature-set outperform the baseline features for pre-
in images. In a more recent work, Gygli and Soleymani [8]        dicting interestingness. In the next section, we present our
attempted predicting GIF interestingness from the content.       features and describe our methodology in detail.
They found that visual sentiment descriptors [11] to be more
effective for predicting GIF interestingness compared to the
features that capture temporal information and motion.
                                                                 2.    METHOD
                                                                 2.1   Features
Copyright is held by the author/owner(s).
MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Nether-       We opt for using a set of hand-crafted features and trans-
lands.                                                           fer learning in addition to regression models with the goal of
interestingness ranking. The task organizers provided a set            Table 1: Evaluation results on interestingness ranking.
of baseline low-level features. These features include a num-                        Task    Method    Features    MAP ↑
ber of low-level audiovisual features that are typically used
for computer vision and speech analysis, including dense                             Image     LR      MVSO+fc7    0.1710


                                                                         Dev. Set
SIFT, Histogram of Gradients (HoG), Local Binary Patterns
(LBP), GIST, Color Histogram, deep learning features for                             Video   SPARROW   MVSO+fc7    0.2617
the visual modality [10], and Mel-Frequency Cepstral Coef-                           Video   SPARROW    Baseline   0.2414
ficients (MFCC) and the cepstral vectors for audio.
   Interestingness is highly correlated with image emotional                         Video     SVR     eGeMAPS     0.1987
content [15]. Therefore, we opted for extracting the eGeMAPS                         Image     LR      MVSO+fc7    0.1704


                                                                         Eval. Set
from audio [4]. eGeMAPS features are acoustic features
hand-picked by experts for the goal of speech and music                              Video   SPARROW   MVSO+fc7    0.1710
emotion recognition. 88 eGeMAPS features were extracted                              Video   SPARROW    Baseline   0.1497
by openSMILE [5]. For video sub-challenge, we extracted
all the key-frames from each shot. We then applied the vi-                           Video     SVR     eGeMAPS     0.1367
sual sentiment adjective-noun-pair (ANP) detectors [11] on
each key-frame. The weights from the fully connected layer
7 (fc7) and the output from the final layer was extracted on      on all the available data in the development-set. The results
each frame. We then pooled the resulting values by mean           for interestingness prediction with the best pair of regres-
and variance to form one feature vector for each shot.            sion methods and feature-sets are summarized in Table 1.
                                                                  The best MAP on the development-set which is achieved by
2.2    Regression models                                          combining multilingual visual sentiment ontology (MVSO)
   We used three different regression models to predict the       descriptors and deep learning features in combination with
interestingness level (linear regression (LR), support vector     SPARROW regression is 0.262. We used Baseline video fea-
regression (SVR) with linear kernel and sparse approxima-         tures and SPARROW regression as our baseline. To check
tion weighted regression (SPARROW) [13].                          the performance of audio features we ranked the video with
   We used LIBLINEAR Library [9, 12] implementation of            respect to SVR output which was trained on audio features
SVR with L2-regularized logistic regression option to pre-        only. The best results for image sub-task is achieved by sen-
dict the interesingness score. We also used a regression with     timent descriptors and deep learning features in combination
sparse approximation. Regression with sparse approxima-           with linear regression.
tion is a regression model for approximation of the predic-          Overall, the evaluation-set results demonstrate that mid-
tion based on local information. It is similar to a k-nearest     level semantic visual descriptors are more effective in pre-
neighbors regression (k-NNR) whose weights are calculated         dicting interestingness compared to the baselines low-level
based on sparse approximation [13]. Linear regression with        features. The results from a set of relatively simple audio
minimum least-squares optimization is utilized as a baseline      features show the significance of audio modality for such
method.                                                           a task. In Image sub-task, the evaluation-set results are
   In all cases, except eGeMAPS audio features, we used           very similar to video sub-task, since sentiment features lack
principal component analysis (PCA) to reduce the dimen-           temporal information. The drop in the performance on
sionality of features. For SVR and SPARROW, we kept the           the evaluation-set demonstrates that our models were over-
principal components containing 99% of variance. In case          fitting to the development-set and it is likely that an ensem-
of linear regression, we only kept the principal components       ble learning regression would have performed better.
that added up to 50% of the total variance.

3.    EXPERIMENTS                                                 5.       CONCLUSION
   After extracting all the feature-sets, we evaluated the per-      In this work, we explored different strategies for predict-
formance of different combinations of the feature-sets and re-    ing visual interestingness in videos. We found the mid-level
gression models. We evaluated different approaches using a        visual descriptors which are related to sentiment to be more
five-folding cross-validation on the development-set. In each     effective for such a task compared to the low-level visual fea-
iteration, one-fifth of the development-set was held out and      tures. This is due to the affective nature of interestingness,
the rest was used to train the regression model. When train-      i.e., interest is an emotion by some account. Our features are
ing the SVR, we optimized the hyper-parameter C using a           all static and frame-based; we did not try extracting features
grid-search on the training-set.                                  related to movement that can capture temporal information
   The best performing approaches based on their perfor-          due to the small size of the dataset. Hence, the frame-based
mance measured by MAP on the ranked results were selected         results are not any different to the shot-based ones. Essen-
for submitted runs (See Table 1).                                 tially they do very similar tasks. The observed performance
                                                                  of the proposed method is rather low. However given the
                                                                  sample size and the dimensionality of the descriptors, they
4.    RESULTS AND DISCUSSION                                      still show promising potential. In the future, ideally larger
  Following the task evaluation procedure, we report MAP          scale datasets shall be developed and annotated to enable
on N best ranked images or shots. We report the results on        using more sophisticated methods such as transfer learning
the cross-validation on the development-set and on our four       using deep neural networks. Even though the audio features
submitted runs on the evaluation-set. For our submitted           are not as effective, they showed significant performance de-
runs, we trained selected features and regression methods         serving more in-depth analysis in the future.
6.   REFERENCES                                                    multilevel mixture models to explore individual
 [1] X. Amengual, A. Bosch, and J. L. de la Rosa. Review           differences in appraisal structures. Cognition and
     of methods to predict social image interestingness and        Emotion, 23(7):1389–1406, 2009.
     memorability. In G. Azzopardi and N. Petkov, editors,    [15] M. Soleymani. The quest for visual interest. In
     Computer Analysis of Images and Patterns: 16th                Proceedings of the 23rd ACM International Conference
     International Conference, CAIP 2015, Valletta, Malta,         on Multimedia, MM ’15, pages 919–922, New York,
     September 2-4, 2015 Proceedings, Part I, pages 64–76.         NY, USA, 2015. ACM.
     Springer International Publishing, Cham, 2015.
 [2] D. Berlyne. Conflict, arousal, and curiosity.
     McGraw-Hill, 1960.
 [3] C. Demarty, M. Sjöberg, B. Ionescu, T. Do, H. Wang,
     N. Duong, and F. Lefebvre. Mediaeval 2016 predicting
     media interestingness task. In MediaEval 2016
     workshop, Amsterdam, Netherland, 2016.
 [4] F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg,
     E. André, C. Busso, L. Y. Devillers, J. Epps,
     P. Laukka, S. S. Narayanan, and K. P. Truong. The
     geneva minimalistic acoustic parameter set (gemaps)
     for voice research and affective computing. IEEE
     Transactions on Affective Computing, 7(2):190–202,
     April 2016.
 [5] F. Eyben, F. Weninger, F. Gross, and B. Schuller.
     Recent developments in opensmile, the munich
     open-source multimedia feature extractor. In
     Proceedings of the 21st ACM International Conference
     on Multimedia, MM ’13, pages 835–838, New York,
     NY, USA, 2013. ACM.
 [6] H. Grabner, F. Nater, M. Druey, and L. Van Gool.
     Visual interestingness in image sequences. In
     Proceedings of the 21st Annual ACM Conference on
     Multimedia, 2013.
 [7] M. Gygli, H. Grabner, H. Riemenschneider, F. Nater,
     and L. Van Gool. The Interestingness of Images. In
     The IEEE International Conference on Computer
     Vision (ICCV), 2013.
 [8] M. Gygli and M. Soleymani. Analyzing and predicting
     GIF interestingness. In ACM Multimedia, 2016.
 [9] C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi,
     and S. Sundararajan. A dual coordinate descent
     method for large-scale linear SVM. In Proceedings of
     the Twenty Fifth International Conference on
     Machine Learning (ICML), 2008.
[10] Y. G. Jiang, Q. Dai, T. Mei, Y. Rui, and S. F. Chang.
     Super fast event recognition in internet videos. IEEE
     Transactions on Multimedia, 17(8):1174–1186, Aug
     2015.
[11] B. Jou, T. Chen, N. Pappas, M. Redi, M. Topkara,
     and S.-F. Chang. Visual affect around the world: A
     large-scale multilingual visual sentiment ontology. In
     Proceedings of the 23rd ACM International Conference
     on Multimedia, MM ’15, pages 159–168, New York,
     NY, USA, 2015. ACM.
[12] C.-J. Lin, R. C. Weng, and S. S. Keerthi. Trust region
     Newton method for large-scale logistic regression. In
     Proceedings of the 24th International Conference on
     Machine Learning (ICML), 2007.
[13] P. Noorzad and B. L. Sturm. Regression with sparse
     approximations of data. In Signal Processing
     Conference (EUSIPCO), 2012 Proceedings of the 20th
     European, pages 674–678, Aug 2012.
[14] P. J. Silvia, R. A. Henson, and J. L. Templin. Are the
     sources of interest the same for everyone? using

</pre>