1. INTRODUCTION

UPC-UB-STP @ MediaEval 2015 Diversity Task: Iterative Reranking of Relevant Images

Aniol Lidon

xavier.giro@upc.edu 2

Marc Bolaños

marc.bolanos@ub.edu 1

Markus Seidl

m.zeppelzauer@fhstp.ac.at 0 0 Matthias Zeppelzauer, St. Pölten University of, Applied Sciences , St. Pölten , Austria 1 Petia Radeva, Universitat de Barcelona , Barcelona, Catalonia/ Spain 2 Xavier Giró-i-Nieto, Universitat Politècnica de , Catalunya, Barcelona, Catalonia/ Spain

2015

14 15

This paper presents the results of the UPC-UB-STP team in the 2015 MediaEval Retrieving Diverse Images Task. The goal of the challenge is to provide a ranked list of Flickr photos for a prede ned set of queries. Our approach rstly generates a ranking of images based on a query-independent estimation of its relevance. Only top results are kept and iteratively re-ranked based on their intra-similarity to introduce diversity.

1. INTRODUCTION

The diversi cation of search results is an important factor to improve the usability of visual retrieval engines. This motivates the 2015 MediaEval Retrieving Diverse Images Task [ 8 ], which de nes the scienti c benchmark targeted in this paper. The proposed methodology solves the trade-o between relevance and diversity by rstly ltering results based on a learned relevance classi er, and secondly building a diverse reranked list following an iterative scheme.

The rst challenge in our system is ltering irrelevant images, as suggested in [ 2 ]. Relevance is a very abstract concept with a high subjectivity involved. Similar problems have been addressed in the visual domain, as for memorability [ 10 ] or interestingness [ 16 ]. In both cases, a crowdsourced task was organised to collect a large amount of human annotations used to train a classi er based on visual features.

The second challenge to address is the diversity in the ranked list. A seminar work from 1998 [ 1 ] introduced diversity in addition to relevance for text retrieval, a concept that was later ported to image [ 17, 4, 19 ] and video retrieval [ 7, 6 ]. Di erent features have been used for this purpose, both textual (e.g. tags [ 20 ]), visual (e.g. convolutional neural networks [ 18 ]), or multimodal fusion [ 5 ].

METHODOLOGY

A generic and easily extensible methodology of four steps has been applied in all our submitted runs. While steps 2 and 4 apply to all runs, steps 1 and 3 contain particularities for visual and textual processing.

1) Ranking by relevance: A relevance score for each image is estimated by either using visual or textual information (see details in Section 2.1 and 2.2 respectively). 2) Filtering of irrelevant images: Only a percentage of the top ranked images by relevance are considered in later steps. In the multimodal runs, the relevance scores for the visual and textual modalities are linearly normalized and fused by averaging.

3) Feature and distance computation: Visual and/or textual features are extracted for each image, and the similarity between each pair computed.

4) Reranking by diversity: An iterative algorithm selects the most di erent image with respect to all previously selected ones. The similarity is always assessed by averaging the considered visual and textual features. Iterations start by adding the most relevant image as the rst element of the reranked list. 2.1

Visual data

The visual information was analyzed with Convolutional Neural Networks (CNN) [ 13, 12 ] with two di erent approaches: 1) Ranking by relevance: A Relevance CNN was created based on HybridNet [ 22 ], a CNN trained with objects from the ImageNet dataset [ 3 ] and locations from the Places dataset [ 22 ]. HybridNet was ne-tuned in two classes: relevant and irrelevant, as labeled by human annotators.

3) Feature and distance computation: The fully connected layers fc7 from a CNN trained on ImageNet [ 11 ], and the fully connected layer fc8 from HybridNet [ 22 ] were used as feature vectors [ 14 ]. 2.2

Textual data 1) Ranking by relevance: For each query, we generate a textual term model in an unsupervised manner from all images returned for this query. We rst remove stopwords, words with numeric and special characters and words of length 4. Next, we select the most representative terms by retaining only those terms where the term frequency (T Fq) is higher than the document frequency (DFq) for the query q. For each term in the model we store the T Fq as a weight. Once this model has been established, we map the textual descriptions of the images to the model of the query. For each image only terms that appear also in the query model are retained. For each remaining term we retrieve the T Fi for the corresponding ith image and build a feature vector. To compute a relevance score si for an image, we compute the cosine similarity simi between the query model and a given image feature vector. Additionally, we add the inverse original Flickr rank ri of the image to the score, yielding a nal textual relevance score of si = simi + (1=ri) for image i. This computation is inspired by that of [ 21 ] with the di erence that we use TF instead of TFIDF in the scoring function which showed to be more expressive in our experiments.

3) Feature and distance computation: Diversity reranking requires the similarity comparison of all relevant images for a query. For a given image, we rst align its terms to the query model. Next, we compute their TFIDF weights (T Fi=DFi) [ 15, 23 ]. Terms from the query model that do not occur in the image's descriptions get a weight of zero. The resulting feature vectors are compared with the cosine metric in diversity re-ranking.

EXPERIMENTAL SETUP

The experimental setup is mostly de ned by the 2015 MediaEval Retrieving Diverse Images Task, which provides a dataset partitioned into development (devset) and test (testset), two types of queries (single and multi-topic), and standardized and complementary evaluation metrics: Precision at 20 (P@20 ), Cluster Recall at 20 (CR@20 ) and F1-score at 20 (F1@20 ). The reader is referred to the task overview paper [ 8 ] to learn the details of the problem.

The Relevance CNN described in Section 2.1 was trained with a 2-fold cross validation, each split containing one half of the devset queries. For both splits we stopped after 2,000 iterations, when the validation accuracy was the highest one (76% and 75% respectively). When applying the best methods' parameters on the testset, we used all the dev data and ne-tuned the network stopping after 4,500 iterations, when the training loss was minimum.

The portion of images to be ltered in Step 2 was learned by measuring the evolution of the nal F1-score for di erent percentages. From Runs 1 to 3 the best results where obtained by keeping the top 20% of images, while for Run 5 the best value was 15%.

RESULTS CONCLUSIONS

The trade-o between relevance and diversity has been targeted in this work with relevance-based ltering and a posterior iterative process to introduce diversity. The nal results, presented in Table 1, are comparable to the state of the art on the devset [ 9 ], and achieve up to a F1@20 of 0.508 on the testset.

Multi-topic queries seem to be more di cult to diversify than single-topic queries. A reason may be that multi-topic queries are more general and contain more heterogeneous content. Considering the fact that our method was trained on single-topic queries only, the results for the multi-topic queries are, however, still promising.

It is remarkable that increasing the number of N of retrieved images increases both, recall and precision (and not only recall as one would expect in a typical retrieval scenario), as shown in Figure 1. This indicates that the relevance ranking obtained by our method is accurate (at least for N 50).

There is no clear winner between textual and visual information (Runs 1 and 2 ). The multimodal combination, however, clearly improves performance (Runs 3 and 5 ). Additionally, results indicate that using multimodal processing at all stages (Run 3 ) is better than using multimodal processing only during the relevance ranking (Run 5 ).

[1]

Carbonell and J. Goldstein . The use of mmr, diversity-based reranking for reordering documents and producing summaries . In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval , pages 335 { 336 . ACM, 1998 .

[2]

.-T. Dang-Nguyen,

Piras , G. Giacinto, G. Boato, and

F. G.

De Natale . A hybrid approach for retrieving diverse social images of landmarks . In Multimedia and Expo (ICME) , 2015 IEEE International Conference on, pages 1{6 . IEEE, 2015 .

[3]

Deng ,

Dong ,

Socher ,

L.-J.

Li ,

Li , and

Fei-Fei . Imagenet: A large-scale hierarchical image database . In Computer Vision and Pattern Recognition , 2009 . CVPR 2009 . IEEE Conference on, pages 248 { 255 . IEEE, 2009 .

[4]

Deselaers ,

Gass ,

Dreuw , and

Ney . Jointly optimising relevance and diversity in image retrieval . In Proceedings of the ACM international conference on image and video retrieval, page 39. ACM , 2009 .

[5]

Gao ,

Wang ,

Z.-J.

Zha ,

Shen ,

Li ,

and X.

Wu . Visual-textual joint relevance learning for tag-based social image search . Image Processing , IEEE Transactions on, 22 ( 1 ): 363 { 376 , 2013 .

[6]

Giro-i Nieto ,

Alfaro , and

Marques . Diversity ranking for video retrieval from a broadcaster archive . In Proceedings of the 1st ACM International Conference on Multimedia Retrieval, page 56. ACM , 2011 .

[7]

Halvey ,

Punitha ,

Hannah ,

Villa ,

Hopfgartner ,

Goyal , and

J. M.

Jose . Diversity, assortment, dissimilarity, variety : A study of diversity measures using low level features for video retrieval . In Advances in Information Retrieval , pages 126 { 137 . Springer, 2009 .

[8]

Ionescu , A. L. G nsca , B. Boteanu ,

Popescu ,

Lupu , and

Mu ller. Retrieving diverse social images at mediaeval 2015: Challenge, dataset and evaluation . In MediaEval 2015 Workshop, Wurzen, Germany, 2015 .

[9]

Ionescu ,

Popescu ,

Lupu , A. L. G ^nsca,

Boteanu , and

Mu ller. Div150cred: A social image retrieval result diversi cation with user tagging credibility dataset . ACM Multimedia Systems-MMSys , Portland, Oregon, USA, 2015 .

[10]

Isola ,

Xiao ,

Parikh ,

Torralba , and

Oliva . What makes a photograph memorable? Pattern Analysis and Machine Intelligence , IEEE Transactions on, 36 ( 7 ): 1469 { 1482 , 2014 .

[11]

Jia ,

Shelhamer ,

Donahue ,

Karayev ,

Long ,

Girshick ,

Guadarrama , and T. Darrell. Ca e: Convolutional architecture for fast feature embedding . In Proceedings of the ACM International Conference on Multimedia , pages 675 { 678 . ACM, 2014 .

[12]

Krizhevsky , I. Sutskever , and

G. E.

Hinton . Imagenet classi cation with deep convolutional neural networks . In Advances in neural information processing systems , pages 1097 { 1105 , 2012 .

[13]

LeCun , L. Bottou,

Bengio , and P. Ha ner. Gradient-based learning applied to document recognition . Proceedings of the IEEE , 86 ( 11 ): 2278 { 2324 , 1998 .

[14]

A. S.

Razavian ,

Azizpour ,

Sullivan , and S. Carlsson. Cnn features o -the-shelf: an astounding baseline for recognition . In Computer Vision and Pattern Recognition Workshops (CVPRW) , 2014 IEEE Conference on , pages 512 { 519 . IEEE, 2014 .

[15]

Salton and

Buckley . Term-weighting approaches in automatic text retrieval . Information processing & management , 24 ( 5 ): 513 { 523 , 1988 .

[16]

Soleymani . The quest for visual interest . In Proceedings of the ACM International Conference on Multimedia. ACM , 2015 .

[17]

Song ,

Tian ,

Gao , and

Huang . Diversifying the image retrieval results . In Proceedings of the 14th annual ACM international conference on Multimedia , pages 707 { 710 . ACM, 2006 .

[18]

Spyromitros-Xiou s , S. Papadopoulos,

A. L.

Ginsca ,

Popescu ,

Kompatsiaris ,

and I.

Vlahavas . Improving diversity in image search via supervised relevance scoring . In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval , pages 323 { 330 . ACM, 2015 .

[19] R. H. van Leuken , L.

Garcia , X.

Olivares , and R. van Zwol. Visual diversi cation of image search results . In Proceedings of the 18th international conference on World wide web , pages 341 { 350 . ACM, 2009 .

[20]

Van Zwol ,

Murdock ,

L. Garcia

Pueyo , and

Ramirez . Diversifying image search with user generated content . In Proceedings of the 1st ACM international conference on Multimedia information retrieval , pages 67 { 74 . ACM, 2008 .

[21]

Vandersmissen ,

Tomar ,

Godin , W. De Neve, and R. Van de Walle. Ghent university-iminds at mediaeval 2014 diverse images: Adaptive clustering with deep features . In MediaEval 2014 , Workshop, 2014 .

[22]

Zhou ,

Lapedriza ,

Xiao ,

Torralba , and

Oliva . Learning deep features for scene recognition using places database . In Advances in Neural Information Processing Systems , pages 487 { 495 , 2014 .

[23]

Zobel and A. Mo at. Exploring the Similarity Space . ACM SIGIR Forum , 32 ( 1 ): 18 { 34 , 1998 .