=Paper= {{Paper |id=Vol-1984/Mediaeval_2017_paper_25 |storemode=property |title=GIBIS at MediaEval 2017: Predicting Media Interestingness Task |pdfUrl=https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_25.pdf |volume=Vol-1984 |authors=Jurandy Almeida,Ricardo Savii |dblpUrl=https://dblp.org/rec/conf/mediaeval/AlmeidaS17 }} ==GIBIS at MediaEval 2017: Predicting Media Interestingness Task== https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_25.pdf
 GIBIS at MediaEval 2017: Predicting Media Interestingness Task
                                                      Jurandy Almeida and Ricardo M. Savii
                       GIBIS Lab, Institute of Science and Technology, Federal University of São Paulo – UNIFESP
                                                12247-014, São José dos Campos, SP – Brazil
                                              {jurandy.almeida,ricardo.manhaes}@unifesp.br

ABSTRACT                                                                  information: Mel-Frequency Cepstral Coefficients (MFCC). Seven
This paper describes the GIBIS team experience in the Predicting          features are the same used for images and encode visual content:
Media Interestingness Task at MediaEval 2017. In this task, the teams     five low-level features (Dense SIFT, HoG, LBP, GIST, and Color His-
were required to develop an approach to predict whether images or         togram) and two deep learning features (CNN-fc7 and CNN-prob).
videos are interesting or not. Our proposal relies on late fusion with    These eight features are frame-based representations [11]. To obtain
rank aggregation methods for combining ranking models learned             a single video representation, we built a Bag-of-Features (BoF) [4]
with different features and by different learning-to-rank algorithms.     model for each feature. In the BoF framework, visual words [15] are
                                                                          obtained by quantizing a feature space according to a pre-learned
                                                                          dictionary. Thus, a video is represented as a normalized frequency
1     INTRODUCTION                                                        histogram of visual words associated with each feature. In this work,
In this paper, we explore the use of rank aggregation methods             we construct a codebook of 4000 visual words using a random selec-
for predicting the interestingness of images and videos. For that,        tion. In addition, we considered three video-based representations.
content-based representations for images and videos are obtained          One of them is also a pre-computed feature provided by task or-
by different features, which are used to train different learning-        ganizers, denoted C3D [16]. The two others refer to additional
to-rank algorithms, creating rankers capable of predicting the in-        visual features we extracted from videos: Histogram of Motion
terestingness degree of images and videos. Then, the information          Patterns (HMP) [2] and Bag-of-Attributes (BoA) [8].
provided by different pairs of feature-ranker are combined by rank
aggregation methods, yielding more effective predictions [3].             2.2     Learning-to-Rank Algorithms
   This work is developed in the context of the MediaEval 2017 Pre-
                                                                          Each of the above features was used as input to train four different
dicting Media Interestingness Task, whose goal is to automatically
                                                                          learning-to-rank algorithms, which are the same used in [1]. The
select the most interesting frames or portions of videos according
                                                                          first three are based on pairwise comparisons: Ranking SVM [12],
to a common viewer by using features derived from audio-visual
                                                                          RankNet [5], and RankBoost [10]. The latter approach considers
content or associated textual information. Details about data, task,
                                                                          lists of objects by using ListNet [6].
and evaluation are described in [7].
                                                                              The SVMr ank package1 [12] was used for running Ranking SVM.
                                                                          The RankLib package2 was used for running RankNet, RankBoost,
2     PROPOSED APPROACH
                                                                          and ListNet. Ranking SVM was configured with a linear kernel. The
The start point for our proposal is the work of Almeida [1], where        others were configured with their default parameter settings.
motion features were extracted from videos and then used to train
four different ranking models, which were combined with a major-          2.3     Rank Aggregation Models
ity voting strategy [13]. The key idea exploited in the Almeida’s
                                                                          Let C = {o 1 , o 2 , . . . , on } be a collection of n objects (i.e., images
work was the use of multiple learning-to-rank algorithms, and their
                                                                          or videos). Let R = {r 1 , r 2 , . . . , rm } be a set of m feature-ranker
combination was pointed out as promising.
                                                                          pairs. Let ρ j (i) be the interestingness degree assigned by the feature-
   Here, we extend the work of Almeida [1] by exploring rank
                                                                          ranker pair r j ∈ R to the object oi ∈ C. Based on the score ρ j , a
aggregation methods for combining ranking models learned with
                                                                          ranked list τ j can be computed. The ranked list τ j can be defined as a
different features and by different learning-to-rank algorithms.
                                                                          permutation of the collection C, which contains the most interesting
                                                                          objects according to the feature-ranker pair r j . A permutation τ j
2.1     Features
                                                                          is a bijection from the set C onto the set [n] = {1, 2, . . . , n}. For a
Images. For the image subtask, we used only the pre-computed              permutation τ j , we interpret τ j (i) as the position (or rank) of the
features provided by the task organizers [7]. Five low-level features     object oi in the ranked list τ j . We can say that, if oi is ranked before
were considered: Dense SIFT, Histogram of Gradients (HoG), Local          ok in the ranked list τ j , that is, τ j (i) < τ j (k), then ρ j (i) ≤ ρ j (k) [3].
Binary Patterns (LBP), GIST, and Color Histogram. Also, two deep             Given the different scores ρ j and their respective ranked lists τ j
learning features were used and they refer to Convolutional Neural        computed by distinct pairs r j ∈ R, a rank aggregation method aims
Network (CNN) features extracted from the last layers (i.e., fc7 and      to compute a fused score F (i) to each object oi [3]. In this work, we
prob) of the pre-trained AlexNet model [11].                              used three different methods based on score and rank information:
Videos. For the video subtask, we used nine pre-computed features                                                     m
                                                                               (1) Borda Method [17]: F (i) =            τ j (i),
                                                                                                                      Í
provided by the task organizers [7]. One of them represents audio
                                                                                                                      j=0
Copyright held by the owner/author(s).
MediaEval’17, 13-15 September 2017, Dublin, Ireland                       1 https://www.cs.cornell.edu/people/tj/svm_light/svm_rank.html (As of August 2017)
                                                                          2 https://sourceforge.net/p/lemur/wiki/RankLib/ (As of August 2017)
MediaEval’17, 13-15 September 2017, Dublin, Ireland                                                                                       J. Almeida and R. Savii

                                                           m
     (2) Multiplicative Approach [14]: F (i) =                   (1 + ρ j (i)),         the machine-learned rankers using MFCC achieved poor results.
                                                           Î
                                                           j=1                          By analyzing the confidence intervals, it can be noticed that the
                                                     m
     (3) Weighted Sum Model [9]: F (i) =
                                                     Í
                                                         (τ j (i) × ρ j (i)).           results achieved by the rank aggregation methods seem promising.
                                                     j=1
                                                                                            Table 2: MAP results obtained on the development data.
3    EXPERIMENTS & RESULTS                                                                    Subtask    Run          Avg. MAP
                                                                                                                                   Confidence Interval (95%)
                                                                                                                                      min.          max.
Five different runs were submitted for each subtask configured
                                                                                                          1             27.78        22.78          32.77
as shown in Table 13 . For both subtasks, the first run is the best
                                                                                                          2             28.95        23.17          34.72




                                                                                                Image
feature-ranker pair in isolation and the others refer to the fusion                                       3             29.36        25.18          33.53
of the top performing feature-ranker pairs with rank aggregation                                          4             28.98        24.53          33.43
methods. All the evaluated approaches were calibrated through a                                           5             29.74        25.28          34.19
3-fold cross validation on the development data.                                                          1             22.41        21.48          23.34
                                                                                                          2             21.85        20.65          23.05




                                                                                                 Video
           Table 1: Configuration of the submitted runs.                                                  3             23.43        22.77          24.09
     Subtask       Run          Fusion          Feature-Ranker Pairs                                      4             23.19        21.68          24.70
                    1              -            CNN-fc7 & RankBoost                                       5             23.07        21.88          24.27
                                                CNN-fc7 & RankBoost,
                     2      Weighted Sum
                                                CNN-fc7 & RankNet                           Table 3 presents the official results reported for 2,435 videos
         Image




                     3      Multiplicative      CNN-fc7 & RankBoost,                    and images from 30 movie trailers of the test data. MAP is a good
                                                CNN-fc7 & RankNet,                      indication of the effectiveness considering all the results (i.e., images
                     4           Borda
                                                CNN-fc7 & RankSVM,
                                                                                        or videos) of the same movie trailer. MAP@10, in turn, focuses on
                     5      Weighted Sum        CNN-prob & RankSVM
                                                                                        the effectiveness considering only the 10 results classified as the
                     1            -             HMP & RankSVM
                                                HMP & RankSVM,                          most interesting ones. On one hand, for the image subtask, the best
                     2      Multiplicative                                              results were achieved by a feature-ranker pair in isolation (run 1).
                                                MFCC & RankSVM
                                                                                        On the other hand, for the video subtask, the use of rank aggregation
         Video




                     3      Multiplicative      HMP & rankSVM,
                                                HMP & RankBoost,                        methods (runs 2 to 5) improved the overall performance. One of the
                     4           Borda
                                                C3D & RankNet,                          reasons is the strategy used for selecting the feature-ranker pairs to
                                                HoG & RankSVM,                          be combined by the rank aggregation methods. For that, we sorted
                     5      Weighted Sum
                                                Dense SIFT & RankBoost                  all the pairs in an increasing order of MAP. We believe the ordering
                                                                                        obtained on the development and test data may not be consistent.
   The development data is composed of 7,396 videos from 78 movie
trailers. For the image subtask, the middle keyframe of each video                             Table 3: Official results reported for the test data.
was extracted, forming a dataset with 7,396 images. Each of the                                          Subtask         Run     MAP     MAP@10
features (Section 2.1) was used as input to train each of the learning-                                                   1      27.10    11.29
to-rank algorithms (Section 2.2). In this way, we obtained 28 feature-                                                    2      26.45    10.29
                                                                                                              Image




ranker pairs (i.e., 7 features × 4 rankers) for the image subtask                                                         3      25.02    09.24
                                                                                                                          4      25.25    09.16
and 44 feature-ranker pairs (i.e., 11 features × 4 rankers) for the
                                                                                                                          5      25.31    09.39
video subtask. Next, each of the feature-ranker pairs was used
                                                                                                                          1      16.67    03.96
to predict the interestingness degree of test images and videos.                                                          2      18.07    05.30
                                                                                                              Video




Finally, the prediction scores of the top performing feature-ranker                                                       3      18.77    06.14
pairs in isolation were combined using rank aggregation methods                                                           4      18.36    06.24
(Section 2.3), producing fused prediction scores.                                                                         5      18.30    06.28
   To assess the effectiveness of each approach, we computed the
Mean Average Precision (MAP). For that, we transformed prediction
scores into binary decisions using the strategy proposed in [1].                        4     CONCLUSIONS
First, the prediction scores associated with images and videos of a                     Our approach has explored rank aggregation methods for com-
same movie trailer were normalized using a z-score normalization.                       bining feature-ranker pairs. Obtained results demonstrate that the
Then, an empirical threshold of 0.7 was applied to the normalized                       proposed approach is promising. Future work includes the investi-
prediction scores, producing binary decisions.                                          gation of a smarter strategy for selecting the pairs to be combined.
   Table 2 presents MAP scores obtained for each run on the devel-
opment data. For both subtasks, the fusion of the top performing                        ACKNOWLEDGMENTS
feature-ranker pairs (runs 2 to 5) performed better than the best
                                                                                        We thank the São Paulo Research Foundation - FAPESP (grant 2016/
feature-ranker pair in isolation (run 1). The only exception was the
                                                                                        06441-7) and the Brazilian National Council for Scientific and Tech-
run 2 of the video subtask, which was a required run for the task
                                                                                        nological Development - CNPq (grant 423228/2016-1) for funding.
where the use of audio features (i.e., MFCC) was mandatory. All
                                                                                        This work has also benefited from the support of the Association
3 The run 1 of the image subtask and the run 2 of the video subtask were the required   for the Advancement of Affective Computing (AAAC) and the ACM
runs for the task, while the other runs were optional.                                  Special Interest Group on Information Retrieval (SIGIR).
Predicting Media Interestingness Task                                            MediaEval’17, 13-15 September 2017, Dublin, Ireland


REFERENCES
 [1] J. Almeida. 2016. UNIFESP at MediaEval 2016: Predicting Media In-
     terestingness Task. In Proc. of the MediaEval 2016 Workshop. http:
     //ceur-ws.org/Vol-1739/MediaEval_2016_paper_28.pdf
 [2] J. Almeida, N. J. Leite, and R. S. Torres. 2011. Comparison of Video
     Sequences with Histograms of Motion Patterns. In IEEE Intl. Conf.
     Image Processing (ICIP’11). 3673–3676.
 [3] J. Almeida, L. P. Valem, and D. C. G. Pedronette. 2017. A Rank Aggre-
     gation Framework for Video Interestingness Prediction. In Intl. Conf.
     Image Analysis and Processing (ICIAP’17). 1–11.
 [4] Y.-L. Boureau, F. Bach, Y. LeCun, and J. Ponce. 2010. Learning Mid-
     Level Features for Recognition. In IEEE Intl. Conf. Computer Vision and
     Pattern Recognition (CVPR’10). 2559–2566.
 [5] C. J. C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton,
     and G. N. Hullender. 2005. Learning to rank using gradient descent.
     In Intl. Conf. Machine Learning (ICML’05). 89–96.
 [6] Z. Cao, T. Qin, T-Y. Liu, M-F. Tsai, and H. Li. 2007. Learning to rank:
     from pairwise approach to listwise approach. In Intl. Conf. Machine
     Learning (ICML’07). 129–136.
 [7] C-H. Demarty, M. Sjöberg, B. Ionescu, T-T. Do, M. Gygli, and N. Q. K.
     Duong. 2017. MediaEval 2017 Predicting Media Interestingness Task.
     In Proc. of the MediaEval 2017 Workshop. Dublin, Ireland.
 [8] L. A. Duarte, O. A. B. Penatti, and J. Almeida. 2016. Bag of Attributes
     for Video Event Retrieval. CoRR abs/1607.05208 (2016). http://arxiv.
     org/abs/1607.05208
 [9] P. C. Fishburn. 1967. Additive Utilities with Incomplete Product Set:
     Applications to Priorities and Assignments. Operations Research Society
     of America (ORSA).
[10] Y. Freund, R. D. Iyer, R. E. Schapire, and Y Singer. 2003. An Efficient
     Boosting Algorithm for Combining Preferences. Journal of Machine
     Learning Research 4 (2003), 933–969.
[11] Y-G. Jiang, Q. Dai, T. Mei, Y. Rui, and S-F. Chang. 2015. Super Fast
     Event Recognition in Internet Videos. IEEE Transactions on Multimedia
     17, 8 (2015), 1174–1186.
[12] T. Joachims. 2006. Training linear SVMs in linear time. In ACM SIGKDD
     Intl. Conf. Knowledge Discovery and Data Mining (ACM SIGKDD’06).
     217–226.
[13] L. Lam and C. Y. Suen. 1997. Application of majority voting to pattern
     recognition: an analysis of its behavior and performance. IEEE Trans.
     Systems, Man, and Cybernetics, Part A 27, 5 (1997), 553–568.
[14] D. C. G. Pedronette and R. S. Torres. 2013. Image Re-Ranking and Rank
     Aggregation based on Similarity of Ranked Lists. Pattern Recognition
     46, 8 (2013), 2350–2360.
[15] J. Sivic and A. Zisserman. 2003. Video Google: A Text Retrieval Ap-
     proach to Object Matching in Videos. In IEEE Intl. Conf. Computer
     Vision (ICCV’03). 1470–1477.
[16] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. 2015. Learn-
     ing Spatiotemporal Features with 3D Convolutional Networks. In IEEE
     Intl. Conf. Computer Vision (ICCV’15). 4489–4497.
[17] H. P. Young. 1974. An axiomatization of Borda’s rule. Journal of
     Economic Theory 9, 1 (1974), 43–52.