UPC-UB-STP @ MediaEval 2015 Diversity Task: Iterative Reranking of Relevant Images Aniol Lidon Marc Bolaños Markus Seidl Xavier Giró-i-Nieto Petia Radeva Matthias Zeppelzauer Universitat Politècnica de Universitat de Barcelona St. Pölten University of Catalunya Barcelona, Catalonia/Spain Applied Sciences Barcelona, Catalonia/Spain marc.bolanos@ub.edu St. Pölten, Austria xavier.giro@upc.edu m.zeppelzauer@fhstp.ac.at ABSTRACT 2) Filtering of irrelevant images: Only a percentage This paper presents the results of the UPC-UB-STP team in of the top ranked images by relevance are considered in later the 2015 MediaEval Retrieving Diverse Images Task. The steps. In the multimodal runs, the relevance scores for the goal of the challenge is to provide a ranked list of Flickr visual and textual modalities are linearly normalized and photos for a predefined set of queries. Our approach firstly fused by averaging. generates a ranking of images based on a query-independent 3) Feature and distance computation: Visual and/or estimation of its relevance. Only top results are kept and textual features are extracted for each image, and the simi- iteratively re-ranked based on their intra-similarity to intro- larity between each pair computed. duce diversity. 4) Reranking by diversity: An iterative algorithm se- lects the most different image with respect to all previously selected ones. The similarity is always assessed by averaging 1. INTRODUCTION the considered visual and textual features. Iterations start The diversification of search results is an important factor by adding the most relevant image as the first element of to improve the usability of visual retrieval engines. This the reranked list. motivates the 2015 MediaEval Retrieving Diverse Images Task [8], which defines the scientific benchmark targeted in 2.1 Visual data this paper. The proposed methodology solves the trade-off The visual information was analyzed with Convolutional between relevance and diversity by firstly filtering results Neural Networks (CNN) [13, 12] with two different approaches: based on a learned relevance classifier, and secondly building 1) Ranking by relevance: A Relevance CNN was cre- a diverse reranked list following an iterative scheme. ated based on HybridNet [22], a CNN trained with objects The first challenge in our system is filtering irrelevant im- from the ImageNet dataset [3] and locations from the Places ages, as suggested in [2]. Relevance is a very abstract con- dataset [22]. HybridNet was fine-tuned in two classes: rele- cept with a high subjectivity involved. Similar problems vant and irrelevant, as labeled by human annotators. have been addressed in the visual domain, as for memorabil- 3) Feature and distance computation: The fully con- ity [10] or interestingness [16]. In both cases, a crowdsourced nected layers fc7 from a CNN trained on ImageNet [11], and task was organised to collect a large amount of human an- the fully connected layer fc8 from HybridNet [22] were used notations used to train a classifier based on visual features. as feature vectors [14]. The second challenge to address is the diversity in the ranked list. A seminar work from 1998 [1] introduced diver- sity in addition to relevance for text retrieval, a concept that 2.2 Textual data was later ported to image [17, 4, 19] and video retrieval [7, 1) Ranking by relevance: For each query, we generate 6]. Different features have been used for this purpose, both a textual term model in an unsupervised manner from all textual (e.g. tags [20]), visual (e.g. convolutional neural images returned for this query. We first remove stopwords, networks [18]), or multimodal fusion [5]. words with numeric and special characters and words of length ≤ 4. Next, we select the most representative terms by retaining only those terms where the term frequency (T Fq ) 2. METHODOLOGY is higher than the document frequency (DFq ) for the query A generic and easily extensible methodology of four steps q. For each term in the model we store the T Fq as a weight. has been applied in all our submitted runs. While steps 2 Once this model has been established, we map the textual and 4 apply to all runs, steps 1 and 3 contain particularities descriptions of the images to the model of the query. For for visual and textual processing. each image only terms that appear also in the query model 1) Ranking by relevance: A relevance score for each are retained. For each remaining term we retrieve the T Fi image is estimated by either using visual or textual informa- for the corresponding ith image and build a feature vector. tion (see details in Section 2.1 and 2.2 respectively). To compute a relevance score si for an image, we compute the cosine similarity simi between the query model and a given image feature vector. Additionally, we add the inverse Copyright is held by the author/owner(s). original Flickr rank ri of the image to the score, yielding a MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany final textual relevance score of si = simi + (1/ri ) for im- age i. This computation is inspired by that of [21] with the Modality Visual Text Multi Multi difference that we use TF instead of TFIDF in the scoring devset Run 1 Run 2 Run 3 Run 5 function which showed to be more expressive in our experi- P@20 0.756 0.802 0.836 0.847 ments. CR@20 0.416 0.419 0.452 0.447 3) Feature and distance computation: Diversity re- F1@20 0.530 0.543 0.578 0.577 ranking requires the similarity comparison of all relevant testset (single) Run 1 Run 2 Run 3 Run 5 images for a query. For a given image, we first align its terms to the query model. Next, we compute their TFIDF P@20 0.705 0.6819 0.749 0.733 weights (T Fi /DFi ) [15, 23]. Terms from the query model CR@20 0.423 0.383 0.431 0.412 that do not occur in the image’s descriptions get a weight of F1@20 0.519 0.478 0.533 0.513 zero. The resulting feature vectors are compared with the testset (multi) Run 1 Run 2 Run 3 Run 5 cosine metric in diversity re-ranking. P@20 0.593 0.724 0.627 0.621 CR@20 0.403 0.372 0.414 0.397 3. EXPERIMENTAL SETUP F1@20 0.463 0.47 0.482 0.464 The experimental setup is mostly defined by the 2015 Me- testset (overall) Run 1 Run 2 Run 3 Run 5 diaEval Retrieving Diverse Images Task, which provides a P@20 0.649 0.703 0.688 0.677 dataset partitioned into development (devset) and test (test- CR@20 0.413 0.378 0.422 0.405 set), two types of queries (single and multi-topic), and stan- F1@20 0.491 0.474 0.508 0.489 dardized and complementary evaluation metrics: Precision at 20 (P@20 ), Cluster Recall at 20 (CR@20 ) and F1-score Table 1: Precision, Recall and F1-Scores obtained at 20 (F1@20 ). The reader is referred to the task overview on each run with N = 20 on the devset, and the paper [8] to learn the details of the problem. testset (single-topic, multi-topic and overall). The Relevance CNN described in Section 2.1 was trained with a 2-fold cross validation, each split containing one half of the devset queries. For both splits we stopped after 2,000 iterations, when the validation accuracy was the highest one (76% and 75% respectively). When applying the best meth- ods’ parameters on the testset, we used all the dev data and fine-tuned the network stopping after 4,500 iterations, when the training loss was minimum. The portion of images to be filtered in Step 2 was learned by measuring the evolution of the final F1-score for differ- ent percentages. From Runs 1 to 3 the best results where obtained by keeping the top 20% of images, while for Run 5 the best value was 15%. 4. RESULTS Table 1 presents the results obtained in four different con- figurations: using visual information only (Run 1 ), using textual data only (Run 2 ), and using the best combina- tion of textual and visual data (Run 3 ). An additional Run 5 considers multimodal information only for relevance fil- tering (Step 2) and purely visual information for diversity Figure 1: Overall Precision, Recall and F1-score reranking (Step 4). Rows 2 to 5 presents results on the de- curves for different cutoffs N of top ranked images vset for single-topic queries, while rows 6 to row 13 include on all testset queries. the results on the testset for the single-topic and multi-topic queries. The overall results can be found in Rows 14 to 17. Figure 1 plots the Precision, Cluster Recall and F1-Score content. Considering the fact that our method was trained curves depending on the amount of N top ranked images on single-topic queries only, the results for the multi-topic considered in the evaluation, averaged over all queries on queries are, however, still promising. our best run (Run 3). It is remarkable that increasing the number of N of re- trieved images increases both, recall and precision (and not 5. CONCLUSIONS only recall as one would expect in a typical retrieval sce- The trade-off between relevance and diversity has been nario), as shown in Figure 1. This indicates that the rele- targeted in this work with relevance-based filtering and a vance ranking obtained by our method is accurate (at least posterior iterative process to introduce diversity. The final for N ≤ 50). results, presented in Table 1, are comparable to the state of There is no clear winner between textual and visual in- the art on the devset [9], and achieve up to a F1@20 of 0.508 formation (Runs 1 and 2 ). The multimodal combination, on the testset. however, clearly improves performance (Runs 3 and 5 ). Ad- Multi-topic queries seem to be more difficult to diversify ditionally, results indicate that using multimodal processing than single-topic queries. A reason may be that multi-topic at all stages (Run 3 ) is better than using multimodal pro- queries are more general and contain more heterogeneous cessing only during the relevance ranking (Run 5 ). 6. REFERENCES recognition. Proceedings of the IEEE, [1] J. Carbonell and J. Goldstein. The use of mmr, 86(11):2278–2324, 1998. diversity-based reranking for reordering documents [14] A. S. Razavian, H. Azizpour, J. Sullivan, and and producing summaries. In Proceedings of the 21st S. Carlsson. Cnn features off-the-shelf: an astounding annual international ACM SIGIR conference on baseline for recognition. In Computer Vision and Research and development in information retrieval, Pattern Recognition Workshops (CVPRW), 2014 pages 335–336. ACM, 1998. IEEE Conference on, pages 512–519. IEEE, 2014. [2] D.-T. Dang-Nguyen, L. Piras, G. Giacinto, G. Boato, [15] G. Salton and C. Buckley. Term-weighting approaches and F. G. De Natale. A hybrid approach for retrieving in automatic text retrieval. Information processing & diverse social images of landmarks. In Multimedia and management, 24(5):513–523, 1988. Expo (ICME), 2015 IEEE International Conference [16] M. Soleymani. The quest for visual interest. In on, pages 1–6. IEEE, 2015. Proceedings of the ACM International Conference on [3] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and Multimedia. ACM, 2015. L. Fei-Fei. Imagenet: A large-scale hierarchical image [17] K. Song, Y. Tian, W. Gao, and T. Huang. database. In Computer Vision and Pattern Diversifying the image retrieval results. In Proceedings Recognition, 2009. CVPR 2009. IEEE Conference on, of the 14th annual ACM international conference on pages 248–255. IEEE, 2009. Multimedia, pages 707–710. ACM, 2006. [4] T. Deselaers, T. Gass, P. Dreuw, and H. Ney. Jointly [18] E. Spyromitros-Xioufis, S. Papadopoulos, A. L. optimising relevance and diversity in image retrieval. Ginsca, A. Popescu, Y. Kompatsiaris, and I. Vlahavas. In Proceedings of the ACM international conference on Improving diversity in image search via supervised image and video retrieval, page 39. ACM, 2009. relevance scoring. In Proceedings of the 5th ACM on [5] Y. Gao, M. Wang, Z.-J. Zha, J. Shen, X. Li, and International Conference on Multimedia Retrieval, X. Wu. Visual-textual joint relevance learning for pages 323–330. ACM, 2015. tag-based social image search. Image Processing, [19] R. H. van Leuken, L. Garcia, X. Olivares, and R. van IEEE Transactions on, 22(1):363–376, 2013. Zwol. Visual diversification of image search results. In [6] X. Giro-i Nieto, M. Alfaro, and F. Marques. Diversity Proceedings of the 18th international conference on ranking for video retrieval from a broadcaster archive. World wide web, pages 341–350. ACM, 2009. In Proceedings of the 1st ACM International [20] R. Van Zwol, V. Murdock, L. Garcia Pueyo, and Conference on Multimedia Retrieval, page 56. ACM, G. Ramirez. Diversifying image search with user 2011. generated content. In Proceedings of the 1st ACM [7] M. Halvey, P. Punitha, D. Hannah, R. Villa, international conference on Multimedia information F. Hopfgartner, A. Goyal, and J. M. Jose. Diversity, retrieval, pages 67–74. ACM, 2008. assortment, dissimilarity, variety: A study of diversity [21] B. Vandersmissen, A. Tomar, F. Godin, W. De Neve, measures using low level features for video retrieval. In and R. Van de Walle. Ghent university-iminds at Advances in Information Retrieval, pages 126–137. mediaeval 2014 diverse images: Adaptive clustering Springer, 2009. with deep features. In MediaEval 2014, Workshop, [8] B. Ionescu, A. L. Gınsca, B. Boteanu, A. Popescu, 2014. M. Lupu, and H. Müller. Retrieving diverse social [22] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and images at mediaeval 2015: Challenge, dataset and A. Oliva. Learning deep features for scene recognition evaluation. In MediaEval 2015 Workshop, Wurzen, using places database. In Advances in Neural Germany, 2015. Information Processing Systems, pages 487–495, 2014. [9] B. Ionescu, A. Popescu, M. Lupu, A. L. Gı̂nsca, [23] J. Zobel and A. Moffat. Exploring the Similarity B. Boteanu, and H. Müller. Div150cred: A social Space. ACM SIGIR Forum, 32(1):18–34, 1998. image retrieval result diversification with user tagging credibility dataset. ACM Multimedia Systems-MMSys, Portland, Oregon, USA, 2015. [10] P. Isola, J. Xiao, D. Parikh, A. Torralba, and A. Oliva. What makes a photograph memorable? Pattern Analysis and Machine Intelligence, IEEE Transactions on, 36(7):1469–1482, 2014. [11] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia, pages 675–678. ACM, 2014. [12] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012. [13] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document