=Paper=
{{Paper
|id=Vol-1984/Mediaeval_2017_paper_25
|storemode=property
|title=GIBIS at MediaEval 2017: Predicting Media Interestingness Task
|pdfUrl=https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_25.pdf
|volume=Vol-1984
|authors=Jurandy Almeida,Ricardo Savii
|dblpUrl=https://dblp.org/rec/conf/mediaeval/AlmeidaS17
}}
==GIBIS at MediaEval 2017: Predicting Media Interestingness Task==
GIBIS at MediaEval 2017: Predicting Media Interestingness Task Jurandy Almeida and Ricardo M. Savii GIBIS Lab, Institute of Science and Technology, Federal University of São Paulo – UNIFESP 12247-014, São José dos Campos, SP – Brazil {jurandy.almeida,ricardo.manhaes}@unifesp.br ABSTRACT information: Mel-Frequency Cepstral Coefficients (MFCC). Seven This paper describes the GIBIS team experience in the Predicting features are the same used for images and encode visual content: Media Interestingness Task at MediaEval 2017. In this task, the teams five low-level features (Dense SIFT, HoG, LBP, GIST, and Color His- were required to develop an approach to predict whether images or togram) and two deep learning features (CNN-fc7 and CNN-prob). videos are interesting or not. Our proposal relies on late fusion with These eight features are frame-based representations [11]. To obtain rank aggregation methods for combining ranking models learned a single video representation, we built a Bag-of-Features (BoF) [4] with different features and by different learning-to-rank algorithms. model for each feature. In the BoF framework, visual words [15] are obtained by quantizing a feature space according to a pre-learned dictionary. Thus, a video is represented as a normalized frequency 1 INTRODUCTION histogram of visual words associated with each feature. In this work, In this paper, we explore the use of rank aggregation methods we construct a codebook of 4000 visual words using a random selec- for predicting the interestingness of images and videos. For that, tion. In addition, we considered three video-based representations. content-based representations for images and videos are obtained One of them is also a pre-computed feature provided by task or- by different features, which are used to train different learning- ganizers, denoted C3D [16]. The two others refer to additional to-rank algorithms, creating rankers capable of predicting the in- visual features we extracted from videos: Histogram of Motion terestingness degree of images and videos. Then, the information Patterns (HMP) [2] and Bag-of-Attributes (BoA) [8]. provided by different pairs of feature-ranker are combined by rank aggregation methods, yielding more effective predictions [3]. 2.2 Learning-to-Rank Algorithms This work is developed in the context of the MediaEval 2017 Pre- Each of the above features was used as input to train four different dicting Media Interestingness Task, whose goal is to automatically learning-to-rank algorithms, which are the same used in [1]. The select the most interesting frames or portions of videos according first three are based on pairwise comparisons: Ranking SVM [12], to a common viewer by using features derived from audio-visual RankNet [5], and RankBoost [10]. The latter approach considers content or associated textual information. Details about data, task, lists of objects by using ListNet [6]. and evaluation are described in [7]. The SVMr ank package1 [12] was used for running Ranking SVM. The RankLib package2 was used for running RankNet, RankBoost, 2 PROPOSED APPROACH and ListNet. Ranking SVM was configured with a linear kernel. The The start point for our proposal is the work of Almeida [1], where others were configured with their default parameter settings. motion features were extracted from videos and then used to train four different ranking models, which were combined with a major- 2.3 Rank Aggregation Models ity voting strategy [13]. The key idea exploited in the Almeida’s Let C = {o 1 , o 2 , . . . , on } be a collection of n objects (i.e., images work was the use of multiple learning-to-rank algorithms, and their or videos). Let R = {r 1 , r 2 , . . . , rm } be a set of m feature-ranker combination was pointed out as promising. pairs. Let ρ j (i) be the interestingness degree assigned by the feature- Here, we extend the work of Almeida [1] by exploring rank ranker pair r j ∈ R to the object oi ∈ C. Based on the score ρ j , a aggregation methods for combining ranking models learned with ranked list τ j can be computed. The ranked list τ j can be defined as a different features and by different learning-to-rank algorithms. permutation of the collection C, which contains the most interesting objects according to the feature-ranker pair r j . A permutation τ j 2.1 Features is a bijection from the set C onto the set [n] = {1, 2, . . . , n}. For a Images. For the image subtask, we used only the pre-computed permutation τ j , we interpret τ j (i) as the position (or rank) of the features provided by the task organizers [7]. Five low-level features object oi in the ranked list τ j . We can say that, if oi is ranked before were considered: Dense SIFT, Histogram of Gradients (HoG), Local ok in the ranked list τ j , that is, τ j (i) < τ j (k), then ρ j (i) ≤ ρ j (k) [3]. Binary Patterns (LBP), GIST, and Color Histogram. Also, two deep Given the different scores ρ j and their respective ranked lists τ j learning features were used and they refer to Convolutional Neural computed by distinct pairs r j ∈ R, a rank aggregation method aims Network (CNN) features extracted from the last layers (i.e., fc7 and to compute a fused score F (i) to each object oi [3]. In this work, we prob) of the pre-trained AlexNet model [11]. used three different methods based on score and rank information: Videos. For the video subtask, we used nine pre-computed features m (1) Borda Method [17]: F (i) = τ j (i), Í provided by the task organizers [7]. One of them represents audio j=0 Copyright held by the owner/author(s). MediaEval’17, 13-15 September 2017, Dublin, Ireland 1 https://www.cs.cornell.edu/people/tj/svm_light/svm_rank.html (As of August 2017) 2 https://sourceforge.net/p/lemur/wiki/RankLib/ (As of August 2017) MediaEval’17, 13-15 September 2017, Dublin, Ireland J. Almeida and R. Savii m (2) Multiplicative Approach [14]: F (i) = (1 + ρ j (i)), the machine-learned rankers using MFCC achieved poor results. Î j=1 By analyzing the confidence intervals, it can be noticed that the m (3) Weighted Sum Model [9]: F (i) = Í (τ j (i) × ρ j (i)). results achieved by the rank aggregation methods seem promising. j=1 Table 2: MAP results obtained on the development data. 3 EXPERIMENTS & RESULTS Subtask Run Avg. MAP Confidence Interval (95%) min. max. Five different runs were submitted for each subtask configured 1 27.78 22.78 32.77 as shown in Table 13 . For both subtasks, the first run is the best 2 28.95 23.17 34.72 Image feature-ranker pair in isolation and the others refer to the fusion 3 29.36 25.18 33.53 of the top performing feature-ranker pairs with rank aggregation 4 28.98 24.53 33.43 methods. All the evaluated approaches were calibrated through a 5 29.74 25.28 34.19 3-fold cross validation on the development data. 1 22.41 21.48 23.34 2 21.85 20.65 23.05 Video Table 1: Configuration of the submitted runs. 3 23.43 22.77 24.09 Subtask Run Fusion Feature-Ranker Pairs 4 23.19 21.68 24.70 1 - CNN-fc7 & RankBoost 5 23.07 21.88 24.27 CNN-fc7 & RankBoost, 2 Weighted Sum CNN-fc7 & RankNet Table 3 presents the official results reported for 2,435 videos Image 3 Multiplicative CNN-fc7 & RankBoost, and images from 30 movie trailers of the test data. MAP is a good CNN-fc7 & RankNet, indication of the effectiveness considering all the results (i.e., images 4 Borda CNN-fc7 & RankSVM, or videos) of the same movie trailer. MAP@10, in turn, focuses on 5 Weighted Sum CNN-prob & RankSVM the effectiveness considering only the 10 results classified as the 1 - HMP & RankSVM HMP & RankSVM, most interesting ones. On one hand, for the image subtask, the best 2 Multiplicative results were achieved by a feature-ranker pair in isolation (run 1). MFCC & RankSVM On the other hand, for the video subtask, the use of rank aggregation Video 3 Multiplicative HMP & rankSVM, HMP & RankBoost, methods (runs 2 to 5) improved the overall performance. One of the 4 Borda C3D & RankNet, reasons is the strategy used for selecting the feature-ranker pairs to HoG & RankSVM, be combined by the rank aggregation methods. For that, we sorted 5 Weighted Sum Dense SIFT & RankBoost all the pairs in an increasing order of MAP. We believe the ordering obtained on the development and test data may not be consistent. The development data is composed of 7,396 videos from 78 movie trailers. For the image subtask, the middle keyframe of each video Table 3: Official results reported for the test data. was extracted, forming a dataset with 7,396 images. Each of the Subtask Run MAP MAP@10 features (Section 2.1) was used as input to train each of the learning- 1 27.10 11.29 to-rank algorithms (Section 2.2). In this way, we obtained 28 feature- 2 26.45 10.29 Image ranker pairs (i.e., 7 features × 4 rankers) for the image subtask 3 25.02 09.24 4 25.25 09.16 and 44 feature-ranker pairs (i.e., 11 features × 4 rankers) for the 5 25.31 09.39 video subtask. Next, each of the feature-ranker pairs was used 1 16.67 03.96 to predict the interestingness degree of test images and videos. 2 18.07 05.30 Video Finally, the prediction scores of the top performing feature-ranker 3 18.77 06.14 pairs in isolation were combined using rank aggregation methods 4 18.36 06.24 (Section 2.3), producing fused prediction scores. 5 18.30 06.28 To assess the effectiveness of each approach, we computed the Mean Average Precision (MAP). For that, we transformed prediction scores into binary decisions using the strategy proposed in [1]. 4 CONCLUSIONS First, the prediction scores associated with images and videos of a Our approach has explored rank aggregation methods for com- same movie trailer were normalized using a z-score normalization. bining feature-ranker pairs. Obtained results demonstrate that the Then, an empirical threshold of 0.7 was applied to the normalized proposed approach is promising. Future work includes the investi- prediction scores, producing binary decisions. gation of a smarter strategy for selecting the pairs to be combined. Table 2 presents MAP scores obtained for each run on the devel- opment data. For both subtasks, the fusion of the top performing ACKNOWLEDGMENTS feature-ranker pairs (runs 2 to 5) performed better than the best We thank the São Paulo Research Foundation - FAPESP (grant 2016/ feature-ranker pair in isolation (run 1). The only exception was the 06441-7) and the Brazilian National Council for Scientific and Tech- run 2 of the video subtask, which was a required run for the task nological Development - CNPq (grant 423228/2016-1) for funding. where the use of audio features (i.e., MFCC) was mandatory. All This work has also benefited from the support of the Association 3 The run 1 of the image subtask and the run 2 of the video subtask were the required for the Advancement of Affective Computing (AAAC) and the ACM runs for the task, while the other runs were optional. Special Interest Group on Information Retrieval (SIGIR). Predicting Media Interestingness Task MediaEval’17, 13-15 September 2017, Dublin, Ireland REFERENCES [1] J. Almeida. 2016. UNIFESP at MediaEval 2016: Predicting Media In- terestingness Task. In Proc. of the MediaEval 2016 Workshop. http: //ceur-ws.org/Vol-1739/MediaEval_2016_paper_28.pdf [2] J. Almeida, N. J. Leite, and R. S. Torres. 2011. Comparison of Video Sequences with Histograms of Motion Patterns. In IEEE Intl. Conf. Image Processing (ICIP’11). 3673–3676. [3] J. Almeida, L. P. Valem, and D. C. G. Pedronette. 2017. A Rank Aggre- gation Framework for Video Interestingness Prediction. In Intl. Conf. Image Analysis and Processing (ICIAP’17). 1–11. [4] Y.-L. Boureau, F. Bach, Y. LeCun, and J. Ponce. 2010. Learning Mid- Level Features for Recognition. In IEEE Intl. Conf. Computer Vision and Pattern Recognition (CVPR’10). 2559–2566. [5] C. J. C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. N. Hullender. 2005. Learning to rank using gradient descent. In Intl. Conf. Machine Learning (ICML’05). 89–96. [6] Z. Cao, T. Qin, T-Y. Liu, M-F. Tsai, and H. Li. 2007. Learning to rank: from pairwise approach to listwise approach. In Intl. Conf. Machine Learning (ICML’07). 129–136. [7] C-H. Demarty, M. Sjöberg, B. Ionescu, T-T. Do, M. Gygli, and N. Q. K. Duong. 2017. MediaEval 2017 Predicting Media Interestingness Task. In Proc. of the MediaEval 2017 Workshop. Dublin, Ireland. [8] L. A. Duarte, O. A. B. Penatti, and J. Almeida. 2016. Bag of Attributes for Video Event Retrieval. CoRR abs/1607.05208 (2016). http://arxiv. org/abs/1607.05208 [9] P. C. Fishburn. 1967. Additive Utilities with Incomplete Product Set: Applications to Priorities and Assignments. Operations Research Society of America (ORSA). [10] Y. Freund, R. D. Iyer, R. E. Schapire, and Y Singer. 2003. An Efficient Boosting Algorithm for Combining Preferences. Journal of Machine Learning Research 4 (2003), 933–969. [11] Y-G. Jiang, Q. Dai, T. Mei, Y. Rui, and S-F. Chang. 2015. Super Fast Event Recognition in Internet Videos. IEEE Transactions on Multimedia 17, 8 (2015), 1174–1186. [12] T. Joachims. 2006. Training linear SVMs in linear time. In ACM SIGKDD Intl. Conf. Knowledge Discovery and Data Mining (ACM SIGKDD’06). 217–226. [13] L. Lam and C. Y. Suen. 1997. Application of majority voting to pattern recognition: an analysis of its behavior and performance. IEEE Trans. Systems, Man, and Cybernetics, Part A 27, 5 (1997), 553–568. [14] D. C. G. Pedronette and R. S. Torres. 2013. Image Re-Ranking and Rank Aggregation based on Similarity of Ranked Lists. Pattern Recognition 46, 8 (2013), 2350–2360. [15] J. Sivic and A. Zisserman. 2003. Video Google: A Text Retrieval Ap- proach to Object Matching in Videos. In IEEE Intl. Conf. Computer Vision (ICCV’03). 1470–1477. [16] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. 2015. Learn- ing Spatiotemporal Features with 3D Convolutional Networks. In IEEE Intl. Conf. Computer Vision (ICCV’15). 4489–4497. [17] H. P. Young. 1974. An axiomatization of Borda’s rule. Journal of Economic Theory 9, 1 (1974), 43–52.