=Paper= {{Paper |id=Vol-1739/MediaEval_2016_paper_19 |storemode=property |title=BigVid at MediaEval 2016: Predicting Interestingness in Images and Videos |pdfUrl=https://ceur-ws.org/Vol-1739/MediaEval_2016_paper_19.pdf |volume=Vol-1739 |dblpUrl=https://dblp.org/rec/conf/mediaeval/XuFJ16 }} ==BigVid at MediaEval 2016: Predicting Interestingness in Images and Videos== https://ceur-ws.org/Vol-1739/MediaEval_2016_paper_19.pdf
     BigVid at MediaEval 2016: Predicting Interestingness in
                      Images and Videos

                                         Baohan Xu13 , Yanwei Fu23 , Yu-Gang Jiang13
                                     1
                                  School of Computer Science, Fudan University, China
                                   2
                                     School of Data Science, Fudan University, China
                 3
                   Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, China
                                            {bhxu14, yanweifu, ygj}@fudan.edu.cn


ABSTRACT                                                              Both visual features and high-level attributes were explored
Despite growing research interest, the tasks of predicting the        in our framework. We also compared SVM with deep neu-
interestingness of images and videos remain as an open chal-          ral networks to further study the relations between different
lenge. The main obstacles come from both the diversity and            features.
complexity of video content and highly subjective and vary-
ing judgements of interestingness of different persons. In the        2.    SYSTEM DESCRIPTION
MediaEval 2016 Predicting Media Interestingness Task, our               Figure 1 gives an overview of our system. The whole sys-
team of BigVid@Fudan had submitted five runs exploring                tem is composed of two key components: feature extraction
various methods of extraction, and modeling the low-level             and classifiers.
features (from visual and audio modalities) and hundreds
of high-level semantic attributes; and fusing these features          2.1    Feature Extraction
for classification. We not only investigated the use of the              There are several pre-computed features provided by the
SVM (Support Vector Machine) model; but the recent deep               organizers, such as denseSIFT [3], pre-trained CNN fc7 layer
learning methods were explored as well. We had submitted              features using ImageNet model and face features. To en-
5 runs using SVM/Ranking-SVM (Run1, Run3 and Run4)                    large the useful information in data, we also consider two
and Deep Neural Networks (Run2 and Run5) respectively.                other types of high-level features. These features have been
We achieved a mean average precision of 0.23 for the image            shown very useful in the tasks of aesthetics and interesting-
subtask and 0.15 for the video subtask. Furthermore, our              ness prediction in [5] and [1]. The average pooling of all
experiments revealed some insights of this task which are             the descriptors from all sampled frames is used to form the
interesting and potential useful. For example, our results            video-level representation for each feature modality.
show that the visual features and high-level attributes are
complementary to each other.                                          Style Attributes: We have considered the photographic
                                                                      style attributes [5] as high-level descriptors. These attributes
                                                                      have been shown highly related to aesthetics and interest-
1.   INTRODUCTION                                                     ingness in [5]. To compute these high-level features, the
   The problem of automatically predicting the interesting-           descriptor is formed by concatenating the classification out-
ness of images and videos has started to receive increasing           puts of 14 photographic styles (e.g., Complementary Colors,
attention. Interestingness prediction has a number of real-           Duotones, Rule of Thirds, Vanishing Point, etc).
world applications, such as interestingness-based video rec-
                                                                      SentiBank: There are 1,200 concepts in SentiBank, and
ommendation system for social media platform.
                                                                      each is defined as an adjective-noun pair, e.g., ”crazy cat”
   MediaEval introduced the “2016 Predicting Media Inter-
                                                                      and ”lovely girl ”, where the adjective is strongly related to
estingness Task”. This task requires participants to auto-
                                                                      emotions and the noun corresponds to objects and scenes
matically select images and/or video segments which are
                                                                      that are expected to be automatically detectable. Models
considered to be the most interesting for a common viewer.
                                                                      for detecting the concepts were trained on Flickr images [1].
Interestingness of the media is to be judged based on visual
                                                                      This set of attributes is intuitively effective on the emotion-
appearance, audio information and text accompanying the
                                                                      related objects and scenes. Since interesting images/videos
data. To solve the task, participants are strongly encour-
                                                                      often related with strong emotions, the attribute is expected
aged to deploy multimodal approaches. For the definitions,
                                                                      to be a very helpful clue for predicting interestingness.
dataset and evaluation of the task, please refer to the official
document [2].                                                         2.2    Classifiers
   This paper describes the first participation of MediaE-
                                                                         Several classifiers are investigated here in order to be ro-
val 2016 from the team of BigVid@Fudan. For this task
                                                                      bustness to the diversity and complexity of similar visual
we developed an approach to investigate how features and
                                                                      content. Particularly, we discussed the SVM, Ranking-SVM
classifiers affect the interestingness in images and videos.
                                                                      and Deep Neural Networks (DNN) for feature fusion and
                                                                      classification. We explain them as follows,
Copyright is held by the author/owner(s).
MediaEval 2016 Workshop, October.           20-21, 2016, Hilversum,   SVM: χ2 kernel was adopted for the bag-of-words features
Netherlands.                                                          (denseSIFT), and Gaussian RBF kernel was used for the oth-
                                                                           0.25
       Images                                   Video Frames                                                                                      0.229


                                                                            0.2
                                                                                                                                   0.179

                                                                                      0.148          0.151          0.154
                                                                           0.15




                                                                     MAP
                                                                            0.1
                  Feature Extraction
                                                                           0.05




                                    SentiBank
                           Style




                                                    Face
           SIFT


                    CNN

                                                                             0
                                                                                  VideoTask-Run1 VideoTask-Run2 VideoTask-Run3 ImageTask-Run4 ImageTask-Run5



                                                                   Figure 2: Performance of our 5 submitted runs on
                                                                   both video and image subtasks. AP is computed
                                                                   on a per trailer basis over the top N best ranked
                                   Fusion          Ranking-        images/video shots. And MAP averaged over all
         DNN              SVM                        SVM
                                                                   trailers.

Figure 1: An overview of the key components in our
                                                                   subtasks respectively, Run 2 and Run 5 used DNN for video
proposed methods. We use DNN and SVM for both
                                                                   and image subtasks respectively. Run 3 used SVM fusion
subtask, while Ranking-SVM is only used in video
                                                                   with Ranking-SVM for video subtask.
subtask. The face feature is computed according to
                                                                      Figure 2 summarized the results of all the submissions.
the movement of face in the video shot, which is also
                                                                   The official performance measure is MAP for both video and
only used in video subtask.
                                                                   image subtasks. For image subtask, the DNN (Run5) signif-
                                                                   icantly outperforms the SVM classifier (Run4) since feature
                                                                   correlation plays an important role in feature fusion for the
ers. For feature fusion, kernel-level average fusion was used
                                                                   interestingness task. This also clearly confirms the effective-
for the features, which linearly combines kernels computed
                                                                   ness of our proposed deep networks. Our experiments also
on different features.
                                                                   verify that the high-level attributes are complementary to
Ranking-SVM: As the interestingness level also affects the         visual features and CNN features.
classification result, we consider training a model to compare        For the video subtask, besides the visual and high-level
the interestingness of different images/videos. We therefore       features, we combined the face features. The experiments
adopt Joaquims’ Ranking SVM [4] to enhance the final re-           show these features are complementary with each other,
sults. To fully use the training data, we organized them           which means visual and high-level attribute both make con-
in form of pairs, with ground-truth labels indicating which        tribution to determine whether a video clip is interesting or
one is more interesting for each pair. Score-level average         not. We found that adding the audio feature such as MFCC
late fusion was adopted to combine the results of SVM and          (Mel-Frequency Cepstrum Coefficient) may cause worse re-
Ranking-SVM.                                                       sults. This is possibly due to the fact that the video shots are
DNN: We also adopted a DNN-based classifier proposed in            very short and cannot provide continuous and useful audio
our recent work [6]. The fusion methods for the SVM classi-        information. We also considered adding ranking information
fiers may take advantage of different features; however, they      for video tasks (Run3); it shows slightly improvement over
often neglect the hidden relations shared among features.          SVM (Run1) and DNN (Run2). The result also indicates
We proposed a regularized DNN to explore the relationship          that interestingness level may further improve the result.
of distinct features, which is found useful for image/video           It’s also worth mentioning that the results of the image
classification. Specifically, for each input feature, a layer of   subtask are better than for the video subtask. It may be
neurons was first used to perform feature abstraction. Then,       caused by the fact that the average of frame features weaken
feature fusion is performed by another layer with carefully        the weights of interesting information. How to fully use the
designed structural-norm regularization on network weights.        effective information in video clips is a future direction.
The feature relationships is also considered in the regular-
ized DNN. And the fused representation was finally used to         4.       CONCLUSIONS
construct a classification model in the last layer. With this
special network, we are able to fuse features by consider-           We have explored both SVM model and DNN to achieve
ing both feature correlation and feature diversity, as well as     better classification on image and video interestingness. Our
perform classification simultaneously. Please see [6] for more     experiments have shown that DNN-based method outper-
details.                                                           forms the SVM model by considering feature correlation.
                                                                   Additionally, the high-level attributes are complementary to
                                                                   visual and CNN features on predicting interestingness. Nev-
3.   SUBMITTED RUNS AND RESULTS                                    ertheless, our experimental results indicate that the visual
  There are two subtasks in this year’s evaluation, namely         and audio features may lack of discrimination about inter-
predicting video interestingness and predicting image inter-       estingness. Thus, as the future work of predicting the inter-
estingness. We submitted 5 runs for official evaluation, among     estingness, we will consider extracting from the image and
which 2 runs for the image subtask and 3 runs for the video        videos the text information which may contain the textual
subtask. Run1 and Run4 used SVM for video and image                descriptions of interestingness (from linguistic perspective).
5.   REFERENCES
[1] D. Borth, T. Chen, R. Ji, and S. F. Chang. Sentibank:
    large-scale ontology and classifiers for detecting
    sentiment and emotions in visual content. In ACM
    MM, 2013.
[2] C.-H. Demarty, M. Sjöberg, B. Ionescu, T.-T. Do,
    H. Wang, N. Q. Duong, and F. Lefebvre. Mediaeval
    2016 predicting media interestingness task. In Proc. of
    the MediaEval 2016 Workshop, Hilversum, Netherlands,
    Oct. 20-21, 2016.
[3] Y.-G. Jiang, Q. Dai, T. Mei, Y. Rui, and S.-F. Chang.
    Super fast event recognition in internet videos. IEEE
    TMM, 17(8):1–13, 2015.
[4] T. Joachims. Optimizing search engines using
    clickthrough data. In ACM SIGKDD, 2002.
[5] N. Murray, L. Marchesotti, and F. Perronnin. Ava: A
    large-scale database for aesthetic visual analysis. In
    CVPR, 2012.
[6] Z. Wu, Y.-G. Jiang, J. Wang, J. Pu, and X. Xue.
    Exploring inter-feature and inter-class relationships
    with deep neural networks for video classification. In
    ACM MM, 2014.