Rank Fusion and Multimodal Per-topic Adaptiveness for Diverse Image Retrieval Rodrigo T. Calumby1 , Iago B. A. C. Araujo1 , Felipe S. Cordeiro1 , Fabiana C. Bertoni1 , Sérgio Canuto2 , Fabiano Belém2 , Marcos A. Gonçalves2 , Icaro Dourado3 , Javier A. V. Munoz3 , Lin Tzy Li3 , Ricardo da S. Torres3 1 University of Feira de Santana, Brazil, [rtcalumby,ibacaraujo,fscordeiro,fabianabertoni]@ecomp.uefs.br 2 Federal University of Minas Gerais, Brazil, [sergiocomputacao,famube,magoncalv]@gmail.com 3 University of Campinas, Brazil, [icaro.dourado,javier.munoz,lintzyli,rtorres]@ic.unicamp.br ABSTRACT 2.1 Relevance Enhancement and Filtering This paper presents the MultiBrasil team experience in the Re- For improving the original ranking of the images from each topic, trieving Diverse Social Images Task at MediaEval 2017. The teams we explored textual and credibility-based ranking. For text ranking, were required to develop a diversification approach for social image we used the original topic terms as query and for each flickr image retrieval, enhanced with visual summarization. Our proposal for rel- obtained for that topic, the title, description, and tag data were evance improvement relies on text and credibility-based reranking concatenated before preprocessing (see Section 3.2). For credibility- and rank aggregation. For diversification, we use diversity-oriented based ranking, the user credibility scores were used as relevance of reranking and also propose a clustering-based query-adaptive di- her uploaded images (see Section 3.3). versity promotion approach. We applied Genetic Programming and Additionally, for the aggregation of multiple rankings, we applied Genetic Algorithm-based approaches for combining textual, visual, the Genetic Programming approach from GPAgg [13], which com- and user credibility information. bines several well-known rank aggregation methods. This method was trained using the development data and integrated order-based (MRA [8], RRF [7], and BordaCount [14]) and score-based (Comb- MIN, CombMAX, CombSUM, ComMED, CombANZ [12], and RL- 1 INTRODUCTION Sim [9]) rank fusion methods. Beyond relevance, for many search tasks the coverage of different As a relevance-based filtering, from the final aggregated list, the query aspects/intents in the retrieved set has great impact on ful- top-ranked images were selected as input for the diversification step. filling user needs [5, 11]. Promoting diversity has been shown to We evaluated multiple cutoff points with best results achieved by positively impact the user search experience specially for ambigu- using only the 200-top images (run 1) and the 50-top images (runs 2 ous, underspecified, and visual summarization queries [1–4]. to 5). Considering the diversification approaches, keeping more than However, tackling the balance between relevance and diversity 50 images degraded the final ranking by pushing more non-relevant is still a great challenge. The Retrieving Diverse Social Images Task images to be reranked and consequently also degrading diversity. 2017 [15] models it into a general ad-hoc image retrieval challenge Since we did not visually reranked the images for run 1, a deeper in which systems are supposed to handle complex and general- reranking had to be considered to allow better diversification, which purpose multi-concept queries. This paper describes the MultiBrasil in turn may negatively impact final relevance. team proposals based on reranking and query-adaptive diversity promotion boosted by multimodal rank fusion. 2.2 Diversification For diversification, we tested two methods. First, a reranking method 2 PROPOSED APPROACH following the traditional Maximal Marginal Relevance [6] approach The proposed approach consists in improving the original Flickr considering multiple features. For this method, the feature combi- ranking for relevance-based filtering followed by a diversification nation is performed by averaging the individual similarity scores. step with diversity-oriented reranking or query-adaptive clustering- The relevance-diversity trade-off adjustment was selected based on based summarization. By improving the original ranking and keep- the best results on the development set. ing only the top-ranked items, we intend to construct a more rele- Alternatively, our query-adaptive clustering method seeks to vant subset, which may reduce data noise for the subsequent diver- construct a more suitable clustering structure based on an evo- sity promotion step. In turn, for diversification, we have evaluated lutionary weight adjustment metric learning method guided by two approaches: (i) relevance-diversity balancing via reranking; and intrinsic clustering fitness evaluation. Here, we used a Genetic (ii) representative image selection via query-adaptive clustering Algorithm (GA) for per-query feature weight learning. For combi- metric learning. nation compatibility, min-max normalization was applied for all distance matrices. As an unsupervised optimization criteria, we evaluated the clus- Copyright held by the owner/author(s). tering quality of the discovered functions by clustering the 50- MediaEval’17, 13-15 September 2017, Dublin, Ireland top images using agglomerative hierarchical clustering (average- linkage). For metric learning and final diversification, we used 25 MediaEval’17, 13-15 September 2017, Dublin, Ireland R. T. Calumby et al. Table 1: Runs Configurations. Table 2: Development Set Results. Run Reranking Cutoff Diversification Diversification Development Set Function Method Features Run P@20 CR@20 F1@20 ERR-IA@20 α -NDCG@20 1 - 200 MMR ACC 1 0.5832 0.4057 0.4595 0.5452 0.5020 2 BM25 50 MMR BM25+Jaccard 2 0.6977 0.4896 0.5503 0.6344 0.5922 3 BM25 50 MMR BM25+Jaccard+Phog 3 0.7073 0.4968 0.5587 0.6413 0.6008 4 GPAgg 50 MMR BM25+Jaccard+Phog 4 0.6914 0.4957 0.5527 0.6308 0.5930 5 BM25 50 GA BM25+Jaccard+Phog 5 0.7009 0.4825 0.5447 0.6269 0.5868 clusters. The clusters were ranked according to their sizes in de- Table 3: Test Set Results. scending order and intra-cluster sorting was applied using the im- ages original ranking positions. The final ranking was constructed Test Set in a round-robin fashion from the final clusters. Run P@20 CR@20 F1@20 ERR-IA@20 α -NDCG@20 1 0.5976 0.5758 0.5657 0.5908 0.5618 2 0.7083 0.6524 0.6559 0.6692 0.6391 3 FEATURES 3 0.7208 0.6482 0.6634 0.6778 0.6461 3.1 Visual features 4 0.7202 0.6498 0.6614 0.6806 0.6479 We evaluated only the provided visual descriptors for run 1. Addi- 5 0.7173 0.6363 0.6512 0.6355 0.6202 tionally, the combination of each visual feature with the best text similarity measures was evaluated for the remaining runs. For the Regarding the test set, the text-only run achieved superior effec- diversification step, the features were combined by averaging the tiveness than the visual-only run, which is a direct consequence of respective similarity scores. the textual reranking and filtering of the input list. The visual-only run handled more non-relevant images, which impacted the final 3.2 Text similarity relevance and diversity. For text-only and multimodal reranking (runs 2 to 5), the text-based In general, all multimodal runs (3, 4, and 5) achieved similar scores were computed as the similarity between the text vectors effectiveness, with run 3 being slightly superior. Nevertheless, con- associated with the images and the original query terms. As text sidering F1@20 for the test set, although run 3 outperformed run 5, preprocessing, we applied stopwords removal1 and stemming [10]. in a per-query analysis, we noticed that run 5 outperforms for ∼40% We evaluated several similarity scores: BM25, Cosine, Dice, Jaccard, of the topics. Moreover, the absolute difference between runs 3 and 5 and TF-IDF. These scores were also evaluated for the diversification was 0.0714 in terms of F1@20. Furthermore, even though runs 3 procedure. and 4 rely on the same diversification method, the average F1@20 difference was 0.0571, with run 4 outperforming for roughly 43% of 3.3 Credibility the topics. Hence, we highlight the opportunity for further improve- The user credibility scores were individually used for ranking. In ment with per-query adaptiveness, for instance, by selecting the this step, we first evaluated the ranking quality of each score in- most suitable diversification model or even dynamically combining dividually and finally aggregated the ranking for: bulkProportion, them. meanImageTagClarity, meanTagRank, meanTagsPerPhoto, mean- TitleWordCounts, photoCount, uniqueTags and uploadFrequency. 6 CONCLUSIONS Additionally, we have also created a ranking considering a linear In our experiments, we have combined traditional reranking and combination of such scores with the weights empirically adjusted, clustering-based diversification methods along with ranking fusion which here we name linearCred. and per-query adaptive feature fusion for clustering. Even though traditional methods slightly outperformed our more complex pro- 4 RUN CONFIGURATIONS posals, we found the results to be satisfactory. In this case, the small We submitted 5 runs. In Table 1, GPAgg (in run 4) is the GP-based training corpus is considered a challenging factor for the learning rank aggregation of BM25, Jaccard, DICE, and linearCred rankings. strategies. Moreover, our results have shown the importance of For all runs, we have used only the data provided by the task, and improving the original ranking for allowing better results, both in the parameters and features used were chosen according to the best terms of relevance and diversity. We have also shown that properly results yielded from development set. selecting the most suitable diversification approach or integrating alternative methods may lead to further improvements. 5 RESULTS AND DISCUSSION Tables 2 and 3 present the effectiveness results for the five runs, ACKNOWLEDGMENTS respectively, for development set and test set, considering all official The authors thank FAPESP, FAPEMIG, CAPES, CNPq, and FAPESB measures. (grant #4098/2016) for partially funding this work. 1 http://code.google.com/p/stop-words/ (version 2011.11.21 – As of Aug. 2017). Retrieving Diverse Social Images Task MediaEval’17, 13-15 September 2017, Dublin, Ireland REFERENCES [1] Fabiano M. Belém, Carolina S. Batista, Rodrygo L. T. Santos, Jussara M. Almeida, and Marcos A. Gonçalves. 2016. Beyond Relevance: Explicitly Promoting Novelty and Diversity in Tag Recommendation. ACM Transactions on Intelligent Systems and Technology 7, 3 (2016), 26:1– 26:34. [2] Jingwen Bian, Yang Yang, Hanwang Zhang, and Tat-Seng Chua. 2015. Multimedia Summarization for Social Events in Microblog Stream. IEEE Transactions on Multimedia 17, 2 (2015), 216–228. [3] R. T. Calumby, R. da S. Torres, and M. A. Gonçalves. 2014. Diversity- driven Learning for Multimodal Image Retrieval with Relevance Feed- back. In Proceedings of the 21st IEEE International Conference on Image Processing. 2197–2201. [4] R. T. Calumby, M. A. Gonçalves, and R. da S. Torres. 2016. On Interac- tive Learning-to-rank for IR: Overview, recent advances, challenges, and directions. Neurocomputing 208 (2016), 3–24. [5] R. T. Calumby, M. A. Gonçalves, and R. da S. Torres. 2017. Diversity- based interactive learning meets multimodality. Neurocomputing 259 (2017), 159–175. [6] J. Carbonell and J. Goldstein. 1998. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st Anual International ACM SIGIR Conference on Research and Development in Information Retrieval. 335–336. [7] Gordon V. Cormack, Charles L A Clarke, and Stefan Buettcher. 2009. Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 758– 759. [8] Ronald Fagin, Ravi Kumar, and D. Sivakumar. 2003. Efficient Similarity Search and Classification via Rank Aggregation. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data. 301–312. [9] Daniel Carlos Guimarães Pedronette and R. da S. Torres. 2011. Image Re-ranking and Rank Aggregation Based on Similarity of Ranked Lists. In Proceedings of the 14th International Conference on Computer Analysis of Images and Patterns - Volume Part I. 369–376. [10] M. F. Porter. 1980. An algorithm for suffix stripping. Program 14, 3 (1980), 130–137. [11] Rodrygo L. T. Santos, Craig Macdonald, and Iadh Ounis. 2015. Search Result Diversification. Foundations and Trends in Information Retrieval 9, 1 (2015), 1–90. [12] Joseph A. Shaw, Edward A. Fox, Joseph A. Shaw, and Edward A. Fox. 1994. Combination of Multiple Searches. In The Second Text REtrieval Conference (TREC-2). 243–252. [13] Javier A. Vargas, R. da S. Torres, and Marcos A. Gonçalves. 2015. A Soft Computing Approach for Learning to Aggregate Rankings. In Proceedings of the 24th ACM International Conference on Conference on Information and Knowledge Management. [14] H. P. Young. 1974. An axiomatization of Borda’s rule. Journal of Economic Theory 9, 1 (1974), 43–52. [15] Maia Zaharieva, Bogdan Ionescu, Alexandru Lucian Gînscă, Rodrygo L.T. Santos, and Henning Müller. 2017. Retrieving Diverse Social Images at MediaEval 2017: Challenges, Dataset and Evaluation. In Proceedings of the MediaEval 2017 Workshop. Dublin, Ireland, Sept. 13-15.