=Paper= {{Paper |id=Vol-1984/Mediaeval_2017_paper_25 |storemode=property |title=GIBIS at MediaEval 2017: Predicting Media Interestingness Task |pdfUrl=https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_25.pdf |volume=Vol-1984 |authors=Jurandy Almeida,Ricardo Savii |dblpUrl=https://dblp.org/rec/conf/mediaeval/AlmeidaS17 }} ==GIBIS at MediaEval 2017: Predicting Media Interestingness Task== https://ceur-ws.org/Vol-1984/Mediaeval_2017_paper_25.pdf

GIBIS at MediaEval 2017: Predicting Media Interestingness Task
Jurandy Almeida and Ricardo M. Savii
GIBIS Lab, Institute of Science and Technology, Federal University of São Paulo – UNIFESP
12247-014, São José dos Campos, SP – Brazil
{jurandy.almeida,ricardo.manhaes}@unifesp.br

ABSTRACT information: Mel-Frequency Cepstral Coefficients (MFCC). Seven
This paper describes the GIBIS team experience in the Predicting features are the same used for images and encode visual content:
Media Interestingness Task at MediaEval 2017. In this task, the teams five low-level features (Dense SIFT, HoG, LBP, GIST, and Color His-
were required to develop an approach to predict whether images or togram) and two deep learning features (CNN-fc7 and CNN-prob).
videos are interesting or not. Our proposal relies on late fusion with These eight features are frame-based representations [11]. To obtain
rank aggregation methods for combining ranking models learned a single video representation, we built a Bag-of-Features (BoF) [4]
with different features and by different learning-to-rank algorithms. model for each feature. In the BoF framework, visual words [15] are
obtained by quantizing a feature space according to a pre-learned
dictionary. Thus, a video is represented as a normalized frequency
1 INTRODUCTION histogram of visual words associated with each feature. In this work,
In this paper, we explore the use of rank aggregation methods we construct a codebook of 4000 visual words using a random selec-
for predicting the interestingness of images and videos. For that, tion. In addition, we considered three video-based representations.
content-based representations for images and videos are obtained One of them is also a pre-computed feature provided by task or-
by different features, which are used to train different learning- ganizers, denoted C3D [16]. The two others refer to additional
to-rank algorithms, creating rankers capable of predicting the in- visual features we extracted from videos: Histogram of Motion
terestingness degree of images and videos. Then, the information Patterns (HMP) [2] and Bag-of-Attributes (BoA) [8].
provided by different pairs of feature-ranker are combined by rank
aggregation methods, yielding more effective predictions [3]. 2.2 Learning-to-Rank Algorithms
This work is developed in the context of the MediaEval 2017 Pre-
Each of the above features was used as input to train four different
dicting Media Interestingness Task, whose goal is to automatically
learning-to-rank algorithms, which are the same used in [1]. The
select the most interesting frames or portions of videos according
first three are based on pairwise comparisons: Ranking SVM [12],
to a common viewer by using features derived from audio-visual
RankNet [5], and RankBoost [10]. The latter approach considers
content or associated textual information. Details about data, task,
lists of objects by using ListNet [6].
and evaluation are described in [7].
The SVMr ank package1 [12] was used for running Ranking SVM.
The RankLib package2 was used for running RankNet, RankBoost,
2 PROPOSED APPROACH
and ListNet. Ranking SVM was configured with a linear kernel. The
The start point for our proposal is the work of Almeida [1], where others were configured with their default parameter settings.
motion features were extracted from videos and then used to train
four different ranking models, which were combined with a major- 2.3 Rank Aggregation Models
ity voting strategy [13]. The key idea exploited in the Almeida’s
Let C = {o 1 , o 2 , . . . , on } be a collection of n objects (i.e., images
work was the use of multiple learning-to-rank algorithms, and their
or videos). Let R = {r 1 , r 2 , . . . , rm } be a set of m feature-ranker
combination was pointed out as promising.
pairs. Let ρ j (i) be the interestingness degree assigned by the feature-
Here, we extend the work of Almeida [1] by exploring rank
ranker pair r j ∈ R to the object oi ∈ C. Based on the score ρ j , a
aggregation methods for combining ranking models learned with
ranked list τ j can be computed. The ranked list τ j can be defined as a
different features and by different learning-to-rank algorithms.
permutation of the collection C, which contains the most interesting
objects according to the feature-ranker pair r j . A permutation τ j
2.1 Features
is a bijection from the set C onto the set [n] = {1, 2, . . . , n}. For a
Images. For the image subtask, we used only the pre-computed permutation τ j , we interpret τ j (i) as the position (or rank) of the
features provided by the task organizers [7]. Five low-level features object oi in the ranked list τ j . We can say that, if oi is ranked before
were considered: Dense SIFT, Histogram of Gradients (HoG), Local ok in the ranked list τ j , that is, τ j (i) < τ j (k), then ρ j (i) ≤ ρ j (k) [3].
Binary Patterns (LBP), GIST, and Color Histogram. Also, two deep Given the different scores ρ j and their respective ranked lists τ j
learning features were used and they refer to Convolutional Neural computed by distinct pairs r j ∈ R, a rank aggregation method aims
Network (CNN) features extracted from the last layers (i.e., fc7 and to compute a fused score F (i) to each object oi [3]. In this work, we
prob) of the pre-trained AlexNet model [11]. used three different methods based on score and rank information:
Videos. For the video subtask, we used nine pre-computed features m
(1) Borda Method [17]: F (i) = τ j (i),
Í
provided by the task organizers [7]. One of them represents audio
j=0
Copyright held by the owner/author(s).
MediaEval’17, 13-15 September 2017, Dublin, Ireland 1 https://www.cs.cornell.edu/people/tj/svm_light/svm_rank.html (As of August 2017)
2 https://sourceforge.net/p/lemur/wiki/RankLib/ (As of August 2017)
MediaEval’17, 13-15 September 2017, Dublin, Ireland J. Almeida and R. Savii

m
(2) Multiplicative Approach [14]: F (i) = (1 + ρ j (i)), the machine-learned rankers using MFCC achieved poor results.
Î
j=1 By analyzing the confidence intervals, it can be noticed that the
m
(3) Weighted Sum Model [9]: F (i) =
Í
(τ j (i) × ρ j (i)). results achieved by the rank aggregation methods seem promising.
j=1
Table 2: MAP results obtained on the development data.
3 EXPERIMENTS & RESULTS Subtask Run Avg. MAP
Confidence Interval (95%)
min. max.
Five different runs were submitted for each subtask configured
1 27.78 22.78 32.77
as shown in Table 13 . For both subtasks, the first run is the best
2 28.95 23.17 34.72

Image
feature-ranker pair in isolation and the others refer to the fusion 3 29.36 25.18 33.53
of the top performing feature-ranker pairs with rank aggregation 4 28.98 24.53 33.43
methods. All the evaluated approaches were calibrated through a 5 29.74 25.28 34.19
3-fold cross validation on the development data. 1 22.41 21.48 23.34
2 21.85 20.65 23.05

Video
Table 1: Configuration of the submitted runs. 3 23.43 22.77 24.09
Subtask Run Fusion Feature-Ranker Pairs 4 23.19 21.68 24.70
1 - CNN-fc7 & RankBoost 5 23.07 21.88 24.27
CNN-fc7 & RankBoost,
2 Weighted Sum
CNN-fc7 & RankNet Table 3 presents the official results reported for 2,435 videos
Image

3 Multiplicative CNN-fc7 & RankBoost, and images from 30 movie trailers of the test data. MAP is a good
CNN-fc7 & RankNet, indication of the effectiveness considering all the results (i.e., images
4 Borda
CNN-fc7 & RankSVM,
or videos) of the same movie trailer. MAP@10, in turn, focuses on
5 Weighted Sum CNN-prob & RankSVM
the effectiveness considering only the 10 results classified as the
1 - HMP & RankSVM
HMP & RankSVM, most interesting ones. On one hand, for the image subtask, the best
2 Multiplicative results were achieved by a feature-ranker pair in isolation (run 1).
MFCC & RankSVM
On the other hand, for the video subtask, the use of rank aggregation
Video

3 Multiplicative HMP & rankSVM,
HMP & RankBoost, methods (runs 2 to 5) improved the overall performance. One of the
4 Borda
C3D & RankNet, reasons is the strategy used for selecting the feature-ranker pairs to
HoG & RankSVM, be combined by the rank aggregation methods. For that, we sorted
5 Weighted Sum
Dense SIFT & RankBoost all the pairs in an increasing order of MAP. We believe the ordering
obtained on the development and test data may not be consistent.
The development data is composed of 7,396 videos from 78 movie
trailers. For the image subtask, the middle keyframe of each video Table 3: Official results reported for the test data.
was extracted, forming a dataset with 7,396 images. Each of the Subtask Run MAP MAP@10
features (Section 2.1) was used as input to train each of the learning- 1 27.10 11.29
to-rank algorithms (Section 2.2). In this way, we obtained 28 feature- 2 26.45 10.29
Image

ranker pairs (i.e., 7 features × 4 rankers) for the image subtask 3 25.02 09.24
4 25.25 09.16
and 44 feature-ranker pairs (i.e., 11 features × 4 rankers) for the
5 25.31 09.39
video subtask. Next, each of the feature-ranker pairs was used
1 16.67 03.96
to predict the interestingness degree of test images and videos. 2 18.07 05.30
Video

Finally, the prediction scores of the top performing feature-ranker 3 18.77 06.14
pairs in isolation were combined using rank aggregation methods 4 18.36 06.24
(Section 2.3), producing fused prediction scores. 5 18.30 06.28
To assess the effectiveness of each approach, we computed the
Mean Average Precision (MAP). For that, we transformed prediction
scores into binary decisions using the strategy proposed in [1]. 4 CONCLUSIONS
First, the prediction scores associated with images and videos of a Our approach has explored rank aggregation methods for com-
same movie trailer were normalized using a z-score normalization. bining feature-ranker pairs. Obtained results demonstrate that the
Then, an empirical threshold of 0.7 was applied to the normalized proposed approach is promising. Future work includes the investi-
prediction scores, producing binary decisions. gation of a smarter strategy for selecting the pairs to be combined.
Table 2 presents MAP scores obtained for each run on the devel-
opment data. For both subtasks, the fusion of the top performing ACKNOWLEDGMENTS
feature-ranker pairs (runs 2 to 5) performed better than the best
We thank the São Paulo Research Foundation - FAPESP (grant 2016/
feature-ranker pair in isolation (run 1). The only exception was the
06441-7) and the Brazilian National Council for Scientific and Tech-
run 2 of the video subtask, which was a required run for the task
nological Development - CNPq (grant 423228/2016-1) for funding.
where the use of audio features (i.e., MFCC) was mandatory. All
This work has also benefited from the support of the Association
3 The run 1 of the image subtask and the run 2 of the video subtask were the required for the Advancement of Affective Computing (AAAC) and the ACM
runs for the task, while the other runs were optional. Special Interest Group on Information Retrieval (SIGIR).
Predicting Media Interestingness Task MediaEval’17, 13-15 September 2017, Dublin, Ireland

REFERENCES
[1] J. Almeida. 2016. UNIFESP at MediaEval 2016: Predicting Media In-
terestingness Task. In Proc. of the MediaEval 2016 Workshop. http:
//ceur-ws.org/Vol-1739/MediaEval_2016_paper_28.pdf
[2] J. Almeida, N. J. Leite, and R. S. Torres. 2011. Comparison of Video
Sequences with Histograms of Motion Patterns. In IEEE Intl. Conf.
Image Processing (ICIP’11). 3673–3676.
[3] J. Almeida, L. P. Valem, and D. C. G. Pedronette. 2017. A Rank Aggre-
gation Framework for Video Interestingness Prediction. In Intl. Conf.
Image Analysis and Processing (ICIAP’17). 1–11.
[4] Y.-L. Boureau, F. Bach, Y. LeCun, and J. Ponce. 2010. Learning Mid-
Level Features for Recognition. In IEEE Intl. Conf. Computer Vision and
Pattern Recognition (CVPR’10). 2559–2566.
[5] C. J. C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton,
and G. N. Hullender. 2005. Learning to rank using gradient descent.
In Intl. Conf. Machine Learning (ICML’05). 89–96.
[6] Z. Cao, T. Qin, T-Y. Liu, M-F. Tsai, and H. Li. 2007. Learning to rank:
from pairwise approach to listwise approach. In Intl. Conf. Machine
Learning (ICML’07). 129–136.
[7] C-H. Demarty, M. Sjöberg, B. Ionescu, T-T. Do, M. Gygli, and N. Q. K.
Duong. 2017. MediaEval 2017 Predicting Media Interestingness Task.
In Proc. of the MediaEval 2017 Workshop. Dublin, Ireland.
[8] L. A. Duarte, O. A. B. Penatti, and J. Almeida. 2016. Bag of Attributes
for Video Event Retrieval. CoRR abs/1607.05208 (2016). http://arxiv.
org/abs/1607.05208
[9] P. C. Fishburn. 1967. Additive Utilities with Incomplete Product Set:
Applications to Priorities and Assignments. Operations Research Society
of America (ORSA).
[10] Y. Freund, R. D. Iyer, R. E. Schapire, and Y Singer. 2003. An Efficient
Boosting Algorithm for Combining Preferences. Journal of Machine
Learning Research 4 (2003), 933–969.
[11] Y-G. Jiang, Q. Dai, T. Mei, Y. Rui, and S-F. Chang. 2015. Super Fast
Event Recognition in Internet Videos. IEEE Transactions on Multimedia
17, 8 (2015), 1174–1186.
[12] T. Joachims. 2006. Training linear SVMs in linear time. In ACM SIGKDD
Intl. Conf. Knowledge Discovery and Data Mining (ACM SIGKDD’06).
217–226.
[13] L. Lam and C. Y. Suen. 1997. Application of majority voting to pattern
recognition: an analysis of its behavior and performance. IEEE Trans.
Systems, Man, and Cybernetics, Part A 27, 5 (1997), 553–568.
[14] D. C. G. Pedronette and R. S. Torres. 2013. Image Re-Ranking and Rank
Aggregation based on Similarity of Ranked Lists. Pattern Recognition
46, 8 (2013), 2350–2360.
[15] J. Sivic and A. Zisserman. 2003. Video Google: A Text Retrieval Ap-
proach to Object Matching in Videos. In IEEE Intl. Conf. Computer
Vision (ICCV’03). 1470–1477.
[16] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. 2015. Learn-
ing Spatiotemporal Features with 3D Convolutional Networks. In IEEE
Intl. Conf. Computer Vision (ICCV’15). 4489–4497.
[17] H. P. Young. 1974. An axiomatization of Borda’s rule. Journal of
Economic Theory 9, 1 (1974), 43–52.