RECOD @ Placing Task of MediaEval 2016: A Ranking Fusion Approach for Geographic-Location Prediction of Multimedia Objects Javier A. V. Muñoz1 , Lin Tzy Li1 , Ícaro C. Dourado1 , Keiller Nogueira5 , Samuel G. Fadel1 , Otávio A. B. Penatti1,4 , Jurandy Almeida1,2 , Luís A. M. Pereira1 , Rodrigo T. Calumby1,3 , Jefersson A. dos Santos5 , Ricardo da S. Torres1 1 RECOD Lab, Institute of Computing, University of Campinas (UNICAMP), 2 GIBIS Lab, Institute of Science and Technology, Federal University of São Paulo (UNIFESP), 3 University of Feira de Santana, 4 SAMSUNG Research Institute Brazil, 5 Department of Computer Science, Universidade Federal de Minas Gerais (UFMG) – Brazil jalvarm.acm@gmail.com, {lintzyli, samuel.fadel, luis.pereira, rtorres}@ic.unicamp.br, icaro.dourado@students.ic.unicamp.br, o.penatti@samsung.com, jurandy.almeida@unifesp.br, rtcalumby@ecomp.uefs.br, {keiller.nogueira, jefersson}@dcc.ufmg.br ABSTRACT For evaluation purposes in the training phase (as in We describe the approach proposed by the RECOD team for 2015 [6]) we split the whole training set into two parts: (i) the estimation-based sub-task of Placing Task at MediaE- a validation set; and (ii) a sub-training set. The validation val 2016. Our approach uses genetic programming (GP) to set has 4,674 images and 903 videos, while the sub-training combine ranked lists defined in terms of textual and visual set has 12,935 videos and 4,188,484 images. descriptors to automatically assign geographic locations to images and videos. 2.1 Features Textual . The title, description, and tags of photos/videos were concatenated as a single field. The text was stemmed 1. INTRODUCTION and stopwords were removed. We used BM25, TF-IDF By having multimedia content annotated with geographic (cosine), information-based similarity (IBSimilarity - IBS) information, we can provide richer services for users such and language modelling similarity (LMDirichletSimilarity - as placing information on maps and providing geographic LMD), which are similarity measures implemented in the searches. Since 2011, the Placing Task [3] at MediaEval Lucene package [9]. has been challenging participants to assign the geographical Audio/Visual . For visual place recognition of images, we locations to images and videos automatically. used the provided features: edgehistogram (EHD), scalable- Here we present our approach for the estimation-based color (SCD), GIST (static feature), cedd, col, jhist, and subtask of the Placing Task 2016. It combines textual, tamura. We also extracted BIC [12] and deep-learning based audio, and/or visual descriptors by applying rank aggre- features (GoogleNet) [13]. For video data, due to time and gation and ranked list density analysis to combine multi- infrastructure constraints for extracting features for new modal information encoded in ranked lists. We evaluated videos in test set, we were only able to use features of his- new features and a genetic programming (GP) [5] approach tograms of motion patterns (HMP) [1]. for multimodal geocoding. GP provides a good framework for modeling optimization problems even when the variables 2.2 GP-based Rank aggregation & Geocoding are functions. We applied combinations of rank aggrega- We used the full training set as geo-profiles and each test tion methods defined by a GP framework. The idea is to item was compared to the whole training set for each feature automatically select a set of suitable features and rank ag- independently. For a given test item, a ranked list for each gregation functions that yield the best result according to feature was generated. Then, these ranked lists were aggre- a given fitness function. Previous works [8, 16] have shown gated through the GP-Agg framework [15]. Given the im- that combining rank aggregated lists and rank aggregation provements obtained in the last year by applying the ranked functions [15] yields very effective results. list density analysis (RLDA) over the final combined ranked list [6], we explored the idea of including this RLDA func- 2. PROPOSED APPROACH tion into the GP-Agg framework: both in the fitness function Our approach estimates location based on rank aggrega- evaluation and in the tree structure of GP’s individuals (as tion of a multitude of ranked lists and their top-K density an unary and binary operator). In this way, the GP-Agg analysis [8]. We extracted a large set of features from the framework was able to apply the RLDA density function in data, derived their ranked lists, and combined them using previous steps of the combination, which improved the re- rank aggregation methods which in turn are selected and sults. Including the RLDA density function in the set of fused by the GP-based framework proposed in [15] (GP- rank aggregation functions turns it in the unique function Agg). that uses geo-localization in the combination, whereas the other classic approaches only use similarity or rank position. The GP-Agg method uses genetic programming to com- Copyright is held by the author/owner(s). MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Nether- bine a set of methods for rank aggregation in an agglom- lands. erative way, in order to improve the results of the isolated methods [15]. We used this method to combine the tex- tual and visual ranked lists generated for various descriptors. Table 2: Configurations of Runs Run Combination function This method was chosen because in [15] the authors showed Photo: RLDA( RLDA( CombSUM( CombMED( BM25, IBS), RRF( IBS, BM25)), RLDA( CombSUM( BM25, BM25), RLDA(LMD))), CombANZ( that GP-Agg produced better or equal results than the best 1 IBS, BordaCount( CombMNZ( LMD, BM25), CombMNZ( LMD, BM25)))) supervised technique in a wide range of rank aggregation Video: CombSUM( BordaCount( RLDA( RLDA( BM25, TF-IDF)), Borda- Count( CombMAX( BM25, BM25), CombSUM( BM25, BM25))), CombMNZ( techniques (supervised and unsupervised). Moreover, it re- CombMAX( RLDA( BM25, TF-IDF), CombMIN( LMD, IBS)), CombMED( quired a reasonable time for training (a couple of hours), and CombMIN( BM25, TF-IDF), TF-IDF))) Photo: RRF(RRF(CombMNZ(CombANZ(GoogleNet, jhist), RRF(tamura, it was relatively fast to apply the best individual (discovered cedd)), CombMAX(GoogleNet, tamura)), CombMED( MRA(col, GIST), 2 function) on the test set. RLDA( BIC, RLDA( EHD, GoogleNet)))) Video: RLDA(HMP) The GP-Agg method was trained using 400 queries from Photo: RLDA(CombMNZ(CombMED(BM25, CombMNZ(GoogleNet, IBS)), the validation set (randomly chosen) and their ranked lists. 3 BM25), MRA(CombMNZ(LMD, LMD), CombMNZ(CombSUM(TF-IDF, BM25), TF-IDF))) We stopped the evolution process at the 20th generation. Video: BordaCount(CombANZ(CombMAX(BM25, RLSim(TF-IDF, LMD)), We used the fitness function, genetic operators, and rank BordaCount(CombANZ(HMP, TF-IDF), CombMED(LMD, BM25))), Comb- SUM(RLDA(BM25, RLDA(BM25, LMD)), CombMAX(RRF(IBS, TDIDF), aggregation techniques that yielded the best results in [15]. RLSim(LMD, LMD)))) The GP-Agg parameters are shown in Table 1. 4 Photo: RLDA(IBS, BM25, LMD, TF-IDF) Video: LMD For the training phase of GP-Agg, an element of a ranked list was considered relevant if it is located no farther than 1 km from the ground truth location of the query element. Table 3: % of photos predicted correctly in test set. The best individuals discovered in the training phase were Precision Run1 Run2 Run3 Run4 applied to combine the ranked lists of test set. The pre- 10m 0.59 0.09 0.56 0.56 dicted lat/long for an test-set element is obtained by picking 100m 6.07 0.87 5.97 5.94 the lat/long of the first element of its respective combined 1km 21.06 2.36 20.83 20.73 ranked list (which could be the single result of RLDA). 10km 38.00 4.47 37.72 37.47 100km 46.23 5.88 46.04 45.71 Table 1: GP-Agg parameters [15]. 1000km 59.69 21.46 59.89 59.28 Parameter Value Avg (km) 2872.49 5664.14 2797.58 2919.3 Number of generations 20 Median (km) 254.67 5821.07 261.66 279.37 Genetics operators Reproduction, Mutation, Crossover Fitness functions FFP1, WAS*, MAP, NDCG Rank Agg. methods CombMAX, CombMIN, CombSUM, CombMED, CombANZ, CombMNZ, RLSim, BordaCount, RRF, 2 (visual) because we had only the HMP descriptor, thus MRA, RLDA * WAS (Weighted Average Score) as defined in [7]. we applied RLDA over it. In Run 4 we used only the best textual descriptor, since the best configuration of past year Among the different fitness functions tested, the best re- decreased the precision of video results. We can observe sults (more precise) were achieved with the WAS [7] and in Table 4 significant improvements in the combination of FFP1 [4]. textual ranked lists through GP-Agg framework over the best textual descriptor (Run 1 vs. Run 4). 3. OUR SUBMISSIONS & RESULTS Based on parameters of our best results in the evaluation Table 4: % of videos predicted correctly in test set. Precision Run1 Run2 Run3 Run4 phase, our submissions were configured as shown in Table 2. 10m 0.45 0.00 0.51 0.37 For each Run, it shows the combination function applied on 100m 5.74 0.03 5.82 4.03 the test set, some of them discovered by the GP-Agg frame- 1km 18.69 0.15 18.46 13.51 work and others we choose based on experimental results, as 10km 33.57 1.15 33.38 25.76 100km 41.56 2.46 41.20 33.02 it will be explained in next paragraphs. Runs 1 and 4 were 1000km 54.51 13.54 54.77 47.67 based on textual-only descriptors, Run 2 was visual-only, Avg (km) 3204.8 6085.23 3123.38 3739.79 and Run 3 was our multi-modal submission. For textual Median (km) 566.96 6085.63 571.24 1236.51 and multimodal runs, we set the K-top parameter of RLDA at 5, and for the visual ones at 100. No extra crawled ma- In both cases, for photos and videos, results obtained show terial or gazetteers were used in our submissions. no gain in the combination of textual and visual information In the case of photos, for Runs 1-3, we used the GP-Agg (Run 3) through GP-Agg. It is explained due to the fact that framework to discover a semi-optimal combination of rank the visual ranked list has significantly lower precision than aggregation functions and ranked lists. For the Run 4, we textual ranked lists, and it is hard to find complementary used the configuration with which we got the best results between these types of lists by just applying classical rank in the past year. Results in Table 3 show slight improve- aggregation methods. ments at including RLDA in GP-Agg framework (Run 1 vs. Run 4). As shown in Table 3, most of our best results were from 4. FUTURE WORK Run 1, where GP-Agg applied rank aggregation for textual We plan to evaluate more textual and visual descriptors descriptors. For visual run (Run 2), combining rank ag- and give them as input to GP-Agg to select descriptors gregation functions and different visual features, including and rank aggregation methods. For example: (a) a tex- GoogleNet, improved our results over last year’s. tual descriptor that combines graph representation [10] with The results for videos are presented in Table 4. As in a framework for graph-to-vector synthesis [11]; (b) apply- the case of images, the best video results were obtained by ing results from works that tackle the problem of visual applying GP-Agg over textual ranked lists. For Run 1 and place recognition [14] and of geolocation with Convolutional Run 3, we combined the ranked lists using individual found Neural Networks [2, 17]; (c) extracting visual features using by GP-Agg. We were unable to use the GP-Agg for Run GoogleNet and BIC for video frames. Acknowledgments [15] J. A. Vargas Muñoz, R. da Silva Torres, and M. A. We thank FAPESP, CNPq, CAPES, and Samsung. Gonçalves. A soft computing approach for learning to aggregate rankings. In Proceedings of the 24th ACM International on Conference on Information and 5. REFERENCES Knowledge Management, CIKM ’15, pages 83–92, New [1] J. Almeida, N. J. Leite, and R. da Silva Torres. York, NY, USA, 2015. ACM. Comparison of video sequences with histograms of [16] M. N. Volkovs and R. S. Zemel. CRF framework for motion patterns. In ICIP, pages 3673–3676, 2011. supervised preference aggregation. In Proceedings of [2] R. Arandjelović, P. Gronat, A. Torii, T. Pajdla, and the 22Nd ACM International Conference on J. Sivic. NetVLAD: CNN architecture for weakly Conference on Information; Knowledge Management, supervised place recognition. In Computer Vision and CIKM ’13, pages 89–98, New York, NY, USA, 2013. Pattern Recognition (CVPR). [17] T. Weyand, I. Kostrikov, and J. Philbin. Planet - [3] J. Choi, C. Hauff, O. V. Laere, and B. Thomee. The photo geolocation with convolutional neural networks. Placing Task at MediaEval 2016. In Working Notes In European Conference on Computer Vision (ECCV), Proc. MediaEval Workshop, Hilversum, Netherlands, 2016. Oct. 2016. [4] W. Fan, E. A. Fox, P. Pathak, and H. Wu. The effects of fitness functions on genetic programming-based ranking discovery for web search. Journal of the American Society for Information Science and Technology, 55(7):628–636, 2004. [5] J. R. Koza. Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge, MA, USA, 1992. [6] L. T. Li, J. A. Muñoz, J. Almeida, R. T. Calumby, O. A. Penatti, Í. C. Dourado, K. Nogueira, P. R. M. Júnior, L. A. Pereira, D. C. Pedronette, et al. Recod@ placing task of mediaeval 2015. In Working Notes Proc. MediaEval Workshop, volume 15, page 2. [7] L. T. Li, D. C. G. Pedronette, J. Almeida, O. A. B. Penatti, R. T. Calumby, and R. da Silva Torres. A rank aggregation framework for video multimodal geocoding. Mult. Tools and App., pages 1–37, 2013. http://dx.doi.org/10.1007/s11042-013-1588-4. [8] L. T. Li, O. A. B. Penatti, J. Almeida, G. Chiachia, R. T. Calumby, P. R. M. Júnio, D. C. G. Pedronette, and R. da S. Torres. Multimedia geocoding: The RECOD 2014 approach. In Working Notes Proc. MediaEval Workshop, volume 1263, page 2, 2014. [9] A. Lucene. Apache Lucene Core. Web Site. http://lucene.apache.org/core/. As of Sept. 2015. [10] A. Schenker, H. Bunke, M. Last, and A. Kandel. Graph-Theoretic Techniques for Web Content Mining. World Scientific Publishing Co., Inc., NJ, USA, 2005. [11] F. B. Silva, S. Tabbone, and R. d. S. Torres. BoG: A New Approach for Graph Matching. In ICPR, pages 82–87. IEEE, Aug. 2014. [12] R. d. O. Stehling, M. Nascimento, and A. Falcão. A compact and efficient image retrieval approach based on border/interior pixel classification. In Proceedings of the 11th International Conference on Information and Knowledge Management, CIKM ’02, pages 102–109, 2002. [13] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Computer Vision and Pattern Recognition (CVPR), 2015. [14] A. Torii, R. Arandjelovic, J. Sivic, M. Okutomi, and T. Pajdla. 24/7 place recognition by view synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’15, pages 1808–1817, 2015.