UNIFESP at MediaEval 2016: Predicting Media Interestingness Task Jurandy Almeida GIBIS Lab, Institute of Science and Technology, Federal University of São Paulo – UNIFESP 12247-014, São José dos Campos, SP – Brazil jurandy.almeida@unifesp.br ABSTRACT 2.1 Visual Features This paper describes the approach proposed by UNIFESP Instead of using any keyframe visual features provided for the MediaEval 2016 Predicting Media Interestingness by the organizers, a simple and fast algorithm was used to Task and for its video subtask only. The proposed approach encode visual properties, known as histogram of motion pat- is based on combining learning-to-rank algorithms for pre- terns (HMP) [1]. It considers the video movement by the dicting the interestingness of videos by their visual content. transitions between frames. For each frame of an input se- quence, motion features are extracted from the video stream. 1. INTRODUCTION After that, each feature is encoded as a unique pattern, rep- resenting its spatio-temporal configuration. Finally, those Current solutions to predict the interestingness of video pattens are accumulated to form a normalized histogram. data are usually based on learning-to-rank strategies. Most of those research works have focused on using a single machine- 2.2 Learning to Rank Strategies learned ranker. Recently, combining individual predictions In this work, the extracted features were classified with from a set of machine-learned rankers has been established the following methods: as an effective way to improve classification performance [9]. Ranking SVM [7]. It is a pairwise ranking method This paper presents an approach for predicting the inter- that uses the traditional SVM classifier to learn a ranking estingness of videos that relies on different learning-to-rank function. For that, each query and its possible results are strategies for processing visual contents. For that, a sim- mapped to a feature space. Next, a given rank is associ- ple, yet effective, histogram of motion patterns (HMP) [1] ated to each point in this space. Finally, a SVM classifier is used for processing visual information. Then, a simple is used to find an optimal separating hyperplane between majority voting scheme [8] is used for combining machine- those points based on their ranks. learned rankers and predicting the interestingness of videos. RankNet [2]. It is a pairwise ranking method that relies This work is developed in the MediaEval 2016 Predicting on a probabilistic model. For that, pairwise rankings are Media Interestingness Task and for its video subtask only, transformed into probability distributions, enabling the use whose goal is to automatically select the most interesting of probability distribution metrics as cost functions. Thus, video segments according to a common viewer by using fea- optimization algorithms can be used to minimize a cost func- tures derived from audio-visual content or associated textual tion to perform pairwise rankings. The authors formulate information. Details about data, task, and evaluation are this cost function using a neural network in which the learn- described in [4]. ing rate is controlled with gradient descent steps. RankBoost [5]. It is a pairwise ranking method that 2. PROPOSED APPROACH relies on boosting algorithms. Initially, each possible result for a given query is mapped to a feature space, in which Measuring the degree of interestingness of a video is a each dimension indicates the relative ranking of individual challenging task. For that, the strategy proposed by Jiang et pairs of results, i.e., whether one result is ranked below or al. [6] was adopted. It relies on training a model to compare above the other. Thus, the ranking problem is formulated as the interestingness of video pairs. Thus, given two videos to a binary classification problem. Next, a set of weak rankers the system, it indicates the more interesting one. are trained iteratively. At each iteration, the resulting pairs Roughly speaking, the basic idea is to use machine learn- are re-weighted so that the weight of pairs ranked wrongly ing algorithms to learn a ranking function based on features is increased whereas the weight of pairs ranked correctly is extracted from training data, and then apply it to features decreased. Finally, all the weak rankers are combined as a extracted from testing data. final ranking function. The proposed approach predicts the interestingness of videos ListNet [3]. It is an extension of RankNet that, instead based on combining learning-to-rank algorithms and exploit- of using pairwise rankings, considers all possible results for ing only visual information. a given query as a single instance, enabling to capture and exploit the instrinsic structure of the data. Majority Voting [8]. It is the simplest method for com- bining the output of a set of classifiers. It relies on assigning Copyright is held by the author/owner(s). MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Nether- the class of a given result by the most common class assigned lands. by all the classifiers. 3. EXPERIMENTS & RESULTS Table 3 presents the results obtained on the official sub- Five different runs were submitted for the video sub-task. mission runs for 2, 342 video segments from 26 movie trailers These runs were configured as shown in Table 1. As the of the test data. Observe that the best results were achieved proposed approach relies on combining different learning- by a learning-to-rank algorithm in isolation, more specifi- to-rank algorithms, one of the runs considers a fusion of cally, Ranking SVM. It can be noticed that the fusion of machine-learned rankers. For comparison purposes, the use all the machine-learned rankers did not improved the over- of each machine-learned ranker in isolation was evaluated in all performance. One of the reasons for those results is the the other runs. All those approaches were calibrated through strategy adopted for combining learning-to-rank algorithms, a 4-fold cross validation on the development data. in which all of them are treated equally. Table 3: Results of the official submitted runs. Table 1: Configurations of Runs Run Learning-to-Rank Strategy MAP (%) Run Learning-to-Rank Strategy 1 Ranking SVM 18.15 1 Ranking SVM 2 RankNet 16.17 2 RankNet 3 RankBoost 16.17 3 RankBoost 4 ListNet 16.56 4 ListNet 5 Majority Voting 16.53 5 Majority Voting Figure 1 presents the Average Precision (AP) per movie The development data was used for training and is com- trailer achieved in each of the submitted runs. Although posed by 5, 054 video segments from 52 movie trailers. Each the MAP obtained for the fusion of all the machine-learned video segment was represented by a HMP. Notice that only rankers is not superior to each of them in isolation, the ob- the visual content was considered, ignoring audio informa- tained results show the potential of the idea. Notice that tion and textual metadata. Then, the extracted features Ranking SVM provides the best results for 8 movie trailers, were used as input to train the aforementioned machine- RankNet was the best for 8 movie trailers, and RankBoost learned rankers. The SVMrank package1 [7] was used for performs better than both of them in 7 movie trailers. This running Ranking SVM. The RankLib package2 was used clearly indicates that those learning-to-rank algorithms pro- for running RankNet, RankBoost, and ListNet. Ranking vide complementary information that can be combined by SVM was configured with a linear kernel. RankNet, Rank- fusion techniques aiming at producing better results. Boost, and ListNet were configured with their default pa- rameter settings. Next, the trained rankers were used to 70 Ranking SVM predict the rankings of test video segments. The rankings RankNet associated with the video segments of a same movie trailer 60 RankBoost ListNet were normalized using a z-score normalization. After that, a Average Precision (%) Majority Voting 50 thresholding method was applied to transform the normal- ized rankings into binary decisions. It was found empirically 40 that better results were obtained when a video segment is classified as interesting if its normalized rank is greater than 30 0.7; otherwise, it is classified as non interesting. Finally, 20 the binary decisions of all the rankers are combined using a majority voting scheme, producing the final classification. 10 The effectiveness of each strategy was assessed using Mean Average Precision (MAP). 0 vi vi o− vi o− vi o− vi o− vi o− vi o− vi o− vi o− vi o− vi o− vi o− vi o− vi o− vi o− vi o− vi o− vi o− vi o− vi o− vi o− vi o− vi o− vi o− vi o− vi o− Table 2 presents the results obtained on the development de de 52 de 53 de 54 de 55 de 56 de 57 de 58 de 59 de 60 de 61 de 62 de 63 de 64 de 65 de 66 de 67 de 68 de 69 de 70 de 71 de 72 de 73 de 74 de 75 de 76 o− 77 data. Observe that the performance of the different learning- to-rank algorithms in isolation is similar, with a small advan- tage to Ranking SVM. By analyzing the confidence intervals, Figure 1: AP per movie trailer achieved in each run. it can be noticed that the results achieved by the fusion of all the machine-learned rankers seem promising. 4. CONCLUSIONS Table 2: Results obtained on the development data. The proposed approach has explored only visual proper- Conf. Interval (95%) ties. Different learning-to-rank strategies were considered, Run Avg. min. max. including a fusion of all of them. Obtained results demon- 1 15.19 13.99 16.38 strate that the proposed approach is promising. Future 2 13.82 12.09 15.55 works include the investigation of a smarter strategy for 3 14.67 12.93 16.42 combining learning-to-rank algorithms and considering other 4 13.32 12.06 14.57 5 14.71 12.69 16.73 information sources to include more features semantically related to visual content. 1 https://www.cs.cornell.edu/people/tj/svm_light/ svm_rank.html (As of September 2016) 5. ACKNOWLEDGMENTS 2 https://sourceforge.net/p/lemur/wiki/RankLib/ (As The author would like to thank CAPES, CNPq, and FAPESP of September 2016) (grant #2016/06441-7) for funding. 6. REFERENCES [1] J. Almeida, N. J. Leite, and R. S. Torres. Comparison of video sequences with histograms of motion patterns. In ICIP, pages 3673–3676, 2011. [2] C. J. C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. N. Hullender. Learning to rank using gradient descent. In ICML, pages 89–96, 2005. [3] Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, and H. Li. Learning to rank: from pairwise approach to listwise approach. In ICML, pages 129–136, 2007. [4] C.-H. Demarty, M. Sjöberg, B. Ionescu, T.-T. Do, H. Wang, N. Q. K. Duong, and F. Lefebvre. Mediaeval 2016 predicting media interestingness task. In Proc. of the MediaEval 2016 Workshop, Hilversum, Netherlands, Oct. 20–21 2016. [5] Y. Freund, R. D. Iyer, R. E. Schapire, and Y. Singer. An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4:933–969, 2003. [6] Y.-G. Jiang, Y. Wang, R. Feng, X. Xue, Y. Zheng, and H. Yang. Understanding and predicting interestingness of videos. In AAAI, pages 1113–1119, 2013. [7] T. Joachims. Training linear svms in linear time. In ACM SIGKDD, pages 217–226, 2006. [8] L. Lam and C. Y. Suen. Application of majority voting to pattern recognition: an analysis of its behavior and performance. IEEE Trans. Systems, Man, and Cybernetics, Part A, 27(5):553–568, 1997. [9] L. T. Li, D. C. G. Pedronette, J. Almeida, O. A. B. Penatti, R. T. Calumby, and R. S. Torres. A rank aggregation framework for video multimodal geocoding. Multimedia Tools and Applications, 73(3):1323–1359, 2014.