=Paper=
{{Paper
|id=Vol-1984/Mediaeval_2017_paper_22
|storemode=property
|title=TCNJ-CS@MediaEval 2017 Predicting Media Interestingness Task
|pdfUrl=https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_22.pdf
|volume=Vol-1984
|authors=Sejong Yoon
|dblpUrl=https://dblp.org/rec/conf/mediaeval/Yoon17
}}
==TCNJ-CS@MediaEval 2017 Predicting Media Interestingness Task==
<pdf width="1500px">https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_22.pdf</pdf>
<pre>
                                         TCNJ-CS @ MediaEval 2017
                                    Predicting Media Interestingness Task
                                                                Sejong Yoon
                                                      The College of New Jersey, USA
                                                             yoons@tcnj.edu

ABSTRACT                                                                    For the image prediction task, we vectorized each feature per
In this paper, we present our approach and investigation on the          frame. For the video prediction task, we took the mean of raw fea-
MedialEval 2017 Predicting Media Interestingness Task. We used           ture values of all frames in the segment. Given the original feature
most of the visual and audiotory features provided. The standard         ft,n for n-th frame in t-th segment, we compute the summarized
kernel fusion technique was applied to combine features and we           feature for the segment t as
used the ranking support vector machine to learn the classification                                         N
                                                                                                         1 Õ
model. No extra data was introduced to train the model. Official                                  xt =         ft,n                      (1)
                                                                                                         N n=1
results, as well as our investigation on the task data is provided at
the end.                                                                 where N denotes the total number of frames in the segment.
                                                                            We used prob (fc8) layer to incorporate semantic information
                                                                         of the training data that can be extracted from the deep neural
1     INTRODUCTION                                                       network.
MediaEval 2017 Predicting Media Interestingness [2] consists of          2.2    Classification
two subtasks. In the first task, the system should predict whether
the viewer will consider a given image to be interesting or not to       We applied the standard kernel fusion approach: we compute a
the common viewers. In the second task, a similar task should be         kernel for each type of features, and combine the kernels either
performed given a video segment. In both tasks, the system should        by additions or multiplications. We used the multiplication within
predict both the binary decision whether the media is interesting        the same modality and we used the addition across the different
or not, and the ranking of the image frame/video segment among           modalities. For the image prediction subtask, we used the following
all image frame/video segments within the same movie. The data           combination of kernels:
consists of 108 video clips. In total 7,396 key-frames and the same                           K 1 = Kchist · Kдist ,                     (2)
number of video segments are provided in the development set,                                 K 2 = Kdhist · Khoд · Klbp ,               (3)
and 2,436 key-frames and the same number of video segments
                                                                                            Kall = K 1 + K 2 + Kpr ob .                  (4)
are reserved for the test set. In this work, we used most of the
features provided by the task organizers and we did not introduce        The rational behind this choice was to consider features with global
any external data, e.g., meta-data, rating, reviews of the movies.       histograms and features using the spatial pyramids [6] as different
                                                                         modalities. We present the results on different kernel combinations
2     APPROACH                                                           for development set in the following section. The CNN probability
                                                                         layer, Kpr ob , is also considered as another modality since it con-
In this section, we first describe the features we employed and then
                                                                         veys semantic information (objects in the images). For the video
present our classification method.
                                                                         prediction subtask, we used the following combination of kernels:
2.1     Features                                                                     Kall = K 1 + K 2 + Kpr ob + Kc3d + Kmf cc .         (5)
We used features from different modalities. All features were pro-       Since C3D and MFCC features model temporal aspect of input, we
vided by the task organizers.                                            consider them as different modalities from the visual features. For
   Visual Features We used nearly all features provided, including       the kernel choice, we used RBF kernel with the median of training
Color histogram in HSV space, GIST [9], Dense SIFT [7], HOG              data for the hyper-parameter choice.
2x2 [1], Linear Binary Pattern (LBP) [8], prob (fc8, probabilities of       For the classification model, we used the ranking support vector
predicted labels of 1,000 objects) layer of AlexNet [5], and C3D [10].   machine. We used SV M r ank [4] to learn pair-wise ranking patterns
   Audio Features We used the provided Mel-frequency Cepstral            from the development set data, following a prior work [3].
Coefficients (MFCC) features. An MFCC descriptor (60 dimensions)
is computed over every 32ms temporal window with 16ms shift.             3     RESULTS AND ANALYSIS
The first and second derivatives of the cepstral vectors are also        The official metric for evaluation is the mean average precision at
included in the MFCC descriptors.                                        10 (MAP@10) computed over all videos, and over the top 10 best
                                                                         ranked images/video segments. First, we present different kernel
Copyright held by the owner/author(s).                                   combinations we tested on the development set. Table 1 describes
MediaEval’17, 13-15 September 2017, Dublin, Ireland                      the different kernel fusion formula we used in the experiments. We
                                                                         report both MAP and MAP@10 results in Table 2. As one can see,
MediaEval’17, 13-15 September 2017, Dublin, Ireland                                                                                    S. Yoon

       Table 1: Different visual feature combinations                                Table 3: Result of all subtasks in testset

                 Combined kernel     Fusion formula                              Subtask     Measure     Result   Kernel
                       V1            K 1 · K 2 · Kpr ob                           Image      MAP         0.1331   V3
                       V2            K 1 · K 2 + Kpr ob                                     MAP@10       0.0126   V3
                       V3            K 1 + K 2 + Kpr ob                           Video      MAP         0.1774   V3 + Kc3d + Kmf cc
                                                                                            MAP@10       0.0524   V3 + Kc3d + Kmf cc

     Table 2: Result of all subtasks in development set                   Table 4: Key-frames of most interesting segments in some
                                                                          development set movies categorized into types of interest
       Subtask      Measure     Result    Kernel                          stimuli

        Image       MAP         0.3065    V1
                                                                                  Subtask          Key-frames
                   MAP@10       0.0123    V1
        Image       MAP         0.3013    V2
                   MAP@10       0.0094    V2
        Image       MAP         0.3003    V3                                     Violence
                   MAP@10       0.0074    V3
         Video      MAP         0.3052    V1 + Kc3d + Kmf cc                      Nudity
                   MAP@10       0.0084    V1 + Kc3d + Kmf cc
         Video      MAP         0.3055    V2 + Kc3d + Kmf cc
                   MAP@10       0.0082    V2 + Kc3d + Kmf cc                 Horror / Surprise
         Video      MAP         0.3038    V3 + Kc3d + Kmf cc
                   MAP@10       0.0082    V3 + Kc3d + Kmf cc
                                                                              Romantic mood


there is no significant differences among kernel fusion choices. We          Facial expression
used 50-50 split, i.e. 39 movies each for train and test splits of the
development set.
                                                                            Joyful, Fun, Humor
   We also report both MAP and MAP@10 results on the testset in
Table 3 provided by the task organizers. As described in the previous
section, we used the visual feature combination, Eq. 4 for the image       Open view / scenery
prediction task, and we used the multi-modal combination, Eq. 5 for
the video prediction task. SV M r ank takes the ranking information
as the label of input data and generates pairwise constraints. All            Others (context)
provided ranking information in the development set was used
for training the SV M r ank model, with image snapshots and video
segments in each movie grouped together.                                  of the most interesting segments in each movie clip we gathered
   As it can be seen, in both image and video subtasks, the system        during the progress. As it can be seen, many of the categories are
shows low performance. This is not surprising given the very sim-         closely related to key emotional states that modern and existing
ple nature of the approach we applied to the task. What was not           affect prediction methods can predict. This is particularly true for
expected is that the video prediction result is much better (although     violence, horror, and joy which consist in large proportion of the
still not reaching the level of good performance) than the image          most interesting video segments. On the other hand, there are many
prediction result, which was not observable in the development            other video segments for which one cannot readily identify the
set. This is interesting because we used the same set of features for     root of interest stimuli. These typically require a higher level of
image and video prediction subtasks, and the only differences are         understanding of the context. The best example is the third movie
the two additional features modeling the temporal aspect of data          in the Others category which requires fusion of all modalities plus
(C3D, MFCC). We believe this reiterates a known understanding on          reading of a sentence shown on the image frame.
the task: we must somehow incorporate temporal information to                In the future, we hope to challenge the media interestingness
improve video interestingness prediction.                                 prediction problem in this direction. Maybe the most promising
                                                                          approach at this point is to understand human activities and link
4   DISCUSSION AND OUTLOOK                                                them to emotions and the interestingness.
One of the major challenges in video interestingness prediction
is to fill the semantic gap. Initially, we intended to fill this gap by   ACKNOWLEDGMENTS
capturing expected emotional status of viewers and connect it to          This work was supported in part by The College of New Jersey
the notion of interestingness. Table 4 shows our categorization           under Support Of Scholarly Activity (SOSA) 2017-2019 grant.
Predicting Media Interestingness Task                                          MediaEval’17, 13-15 September 2017, Dublin, Ireland


REFERENCES
 [1] N. Dalal and B. Triggs. 2005. Histograms of oriented gradients for
     human detection. In 2005 IEEE Computer Society Conference on Com-
     puter Vision and Pattern Recognition (CVPR’05), Vol. 1. 886–893 vol. 1.
     https://doi.org/10.1109/CVPR.2005.177
 [2] Claire-Hélène Demarty, Mats Sjöberg, Bogdan Ionescu, Thanh-Toan
     Do, Michael Gygli, and Ngoc Q. K. Duong. Predicting Media Interest-
     ingness Task at MediaEval 2017. In Proc. of MediaEval 2017 Workshop,
     Dublin, Ireland, Sept. 13-15, 2017.
 [3] Yu-Gang Jiang, Yanran Wang, Rui Feng, Xiangyang Xue, Yingbin
     Zheng, and Hanfang Yang. 2013. Understanding and Predicting Inter-
     estingness of Videos. In AAAI.
 [4] Thorsten Joachims. 2006. Training Linear SVMs in Linear Time. In
     Proceedings of the ACM Conference on Knowledge Discovery and Data
     Mining (KDD).
 [5] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Ima-
     geNet Classification with Deep Convolutional Neural Networks. In
     Advances in Neural Information Processing Systems 25 (NIPS).
 [6] S. Lazebnik, C. Schmid, and J. Ponce. 2006. Beyond Bags of Features:
     Spatial Pyramid Matching for Recognizing Natural Scene Categories.
     In 2006 IEEE Computer Society Conference on Computer Vision and
     Pattern Recognition (CVPR’06), Vol. 2. 2169–2178. https://doi.org/10.
     1109/CVPR.2006.68
 [7] David Lowe. 2004. Distinctive image features from scale-invariant
     keypoints. International Journal on Computer Vision (2004).
 [8] Timo Ojala, Matti Pietikäinen, and Topi Mäenpää. 2002. Multireso-
     lution Gray-Scale and Rotation Invariant Texture Classification with
     Local Binary Patterns. IEEE Trans. Pattern Anal. Mach. Intell. 24, 7
     (July 2002), 971–987. https://doi.org/10.1109/TPAMI.2002.1017623
 [9] Aude Oliva and Antonio Torralba. 2001. Modeling the Shape of
     the Scene: A Holistic Representation of the Spatial Envelope. Int.
     J. Comput. Vision 42, 3 (May 2001), 145–175. https://doi.org/10.1023/A:
     1011139631724
[10] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and
     Manohar Paluri. 2015. Learning Spatiotemporal Features with 3D
     Convolutional Networks. In ICCV.

</pre>