USEMP: Finding Diverse Images at MediaEval 2015 E. Spyromitros-Xioufis1 , A. Popescu2 , S. Papadopoulos1 , I. Kompatsiaris1 1 CERTH-ITI, Thermi-Thessaloniki, Greece, {espyromi,papadop,ikom}@iti.gr 2 CEA, LIST, 91190 Gif-sur-Yvette, France, adrian.popescu@cea.fr ABSTRACT evance annotations (for the images of the development set) We describe the participation of the USEMP team in the to train a classifier which was then used to assign relevance Retrieving Diverse Social Images Task of MediaEval 2015. scores to images of unseen queries. Although that classifier Our runs are produced based on a supervised diversification was not adapted to any particular query, it could accurately method that jointly optimizes relevance and diversity. All estimate relevance by capturing its task-specific notion. runs are automated and use only resources given by the task In 2014, the SocialSensor team refined their approach [12], organizers. Our best results in terms of the official ranking topping the scoreboard in several categories: best visual- metric on the one-topic part of the test set came by the only run, third textual run and best visual+textual run that runs that combine visual and textual information while the was also ranked second overall, slightly surpassed by a run textual-only run performed better on the multi-topic part. that used specialized filters (face/blur detectors) and user credibility information [4]. The main addition compared to [3] was that a different relevance classifier was trained for 1. INTRODUCTION each query, using the query’s Wikipedia images as additional The Retrieving Diverse Social Images task of MediaEval positive examples. These examples were assigned a larger 2015 [7] deals with the problem of result diversification in weight to have increased influence on the learned model. social image retrieval. This year there are two notable differ- Thus the query-specific notion of relevance was captured in ences with previous editions: (1) a larger development set is addition to the task-specific notion captured in [3]. available and (2) “multi-concept” queries which are mainly The approach was evaluated in [10] in more detail iden- related to events rather than specific places or landmarks tifying a link between relevance detection accuracy and di- were introduced. versification performance. Furthermore, a multimodal en- We deal with the task using supervised Maximal Marginal semble classifier, called Multi-Modal Stacking (MMS), was Relevance (sMMR) [10], a refined version of the supervised proposed for combining different types of features for rele- diversification method that we developed in [12]. sMMR vance detection in a principled manner. Due to the addition and earlier versions of the method are discussed in Section 2. of this multimodal scheme and to the use of state-of-the-art Section 3 gives further details about our methodology and convolutional neural network features for relevance detec- describes modifications compared to [10]. Section 4 provides tion, [10] managed to achieve a 5.7% relative increase over descriptions of the employed features and Section 5 describes the best result obtained in the 2014 [4] edition of the task. the submitted runs. Finally, Section 6 presents and discusses the experimental results. 3. METHOD Given the effectiveness of sMMR in previous editions of 2. PREVIOUS WORK the task, we opted for applying it in this year’s task as In the 2013 edition of the task, the SocialSensor team de- well. In particular, we applied the sMMRaq variant: the veloped a supervised diversification method [3] and applied relevance detection model for each query was trained using it for producing the visual-only run that achieved the best relevant and irrelevant examples from other queries, com- performance among runs of this type. Similarly to previous bined with representative examples of the query itself (in diversification methods [2, 5], that method greedily opti- the form of either the corresponding Wikipedia images or mized a utility function that jointly accounts for relevance the Wikipedia page). For multi-topic queries, which were and diversity. The main difference compared to earlier ap- not accompanied by representative Wikipedia images, the proaches was the replacement of the unsupervised definition visual relevance models were trained using only examples of relevance with a task-specific definition that is learned from other queries (sMMRa variant). directly from the ground truth. More specifically, instead To further improve the relevance detection models com- of computing an image’s relevance score by measuring its pared to [10], we performed careful tuning of two param- similarity to a reference image (e.g., the Wikipedia image eters: a) the number no of examples from other queries ne of a query topic), the approach exploited the available rel- employed by each model and b) the ratio r = n o defined 1 as the number of examples of this query divided by the 1 ne Copyright is held by the author/owner(s). ne is modified by repeating each representative example n∗ e MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany times, where n∗e is the actual number of representative examples. Table 1: Estimated (one-topic/multi-topic) and final performance of the submitted runs. Development Set Test Set (One-Topic) Test Set (Multi-Topic) Test Set (overall) Run auc p@20 cr@20 f1@20 p@20 cr@20 f1@20 p@20 cr@20 f1@20 p@20 cr@20 f1@20 1 0.821/0.773 0.860/0.763 0.489/0.468 0.616/0.573 0.805 0.478 0.587 0.598 0.453 0.499 0.701 0.465 0. 542 2 0.688 0.836 0.459 0.586 0.824 0.455 0.569 0.734 0.442 0.530 0.779 0.448 0.549 3 0.857/0.816 0.893/0.840 0.515/0.489 0.646/0.609 0.833 0.504 0.618 0.617 0.408 0.471 0.724 0.456 0.544 5 0.857/0.816 0.877/0.823 0.526/0.499 0.650/0.613 0.802 0.509 0.611 0.608 0.417 0.474 0.704 0.462 0.542 number of examples from other queries, by performing a the one-topic and the multi-topic instantiation respectively. grid search over the values no = {1K, 5K, 10K, 20K} and Run 2: BOW features were used for both relevance and r = {0.0, 0.1, . . . , 1.0}. Model selection was based on area diversity. The following parameters were used: no = 20K, under ROC (AUC), computed using a modified leave-one(- r = 0.4, N = 80 and w = 0.8. query)-out cross-validation procedure. For each query of the Run 3: A different instantiation was used for each part development set, no examples were randomly selected from of the collection. MMS was used to combine the outputs other queries and combined (if possible) with ne represen- of relevance detection models built using CNN and BOW tative (Wikipedia) examples of that query to build a model features with one-dimensional META features, and VLAD that was evaluated on the remaining (Flickr) examples of features were used for diversity in both instantiations. {w = that query. The per-query AUC scores were then averaged 0.50, N = 220} and {w = 0.55, N = 170} were used for the to obtain a single estimate. As in [10], an L2-regularized one-topic and the multi-topic instantiation respectively. Logistic Regression classifier was used [6] with appropriate Run 5: This is a variation of run 3 where we use the same tuning of the c parameter. Besides the parameters of the relevance detection models but also tune the M parameter relevance model, we also tuned the w and N parameters of the approach in addition to N and w, as done in [12]. of the sMMR approach, so as to maximize F1@20 on the This resulted into setting {w = 0.4, N = 210, M = 2} for development set, as done in [12]. the one-topic instantiation and {w = 0.4, N = 300, M = 5} for the mutli-topic instantiation. 4. FEATURES VLAD: VLAD+CSURF [11] vectors were computed from 6. RESULTS AND DISCUSSION a 128-dimensional visual vocabulary and projected to 128 Table 1 shows the performance of the submitted runs on dimensions with PCA and whitening. Both the visual vo- each part of the test collection and estimates of their perfor- cabulary and the PCA projection matrix are learned using mance obtained from the development set. The best overall the images of the development set. performance on the test set is obtained with run 2. We ob- CNN : Convolutional neural network features were adapted serve that the performance is much better on the one-topic for the tourism use case using ≈1,000 Points Of Interest part. This was expected given the fact that model and pa- (POIs) instead of ImageNet classes. These features were rameter tuning was performed on the development set which computed by fine-tuning the VGG model proposed by [9]. did not contain examples of multi-topic queries. The best Approximately 1,200 images were collected for each POI and performance in terms of F1@20 on the one-topic part is ob- fed directly to Caffe [8] for training. This change of training tained by runs 3 and 5, followed by run 1 and then run 2. classes was inspired by recent domain adaptation work pre- We see that although being slightly over-optimistic (≈5% sented in [1] which shows that the feature transfer is more on average) our F1@20 estimates for the one-topic part are efficient when the training classes are conceptually close to strongly correlated with the final results and are indicative the target dataset. The features are constituted by the out- of the relative run strength. puts of the fc7 layer and include 4, 096 dimensions. On the multi-topic part, the best performance is obtained BOW : To generate textual features, we transformed each by run 2, followed by run 1 and then runs 3 and 5. The query and each Flickr image into a text document. For superiority of run 2 over 1 on this part of the collection is queries, we used a parsed version of the corresponding Wiki- attributed to the fact that representative examples of multi- pedia page and for Flickr images we used a concatenation of topic queries were available only in textual form. Compar- the words in their titles, descriptions and tags. Bag-of-words ing the final results on this part with our estimates we see features (BOW) were then computed for each document us- very poor correlation2 . This suggests that this part of the ing all terms that appear at least twice in the collection to collection has significantly different characteristics from the form the dictionary, and word frequencies as term weights. development set and that performing model selection and This led to an 80K-dimensional representation. parameter tuning on the development set was not helpful. META: The following one-dimensional features were also We expect that better results could have been achieved on computed from textual metadata and used as additional fea- the multi-topic part, provided that the development set con- tures in the meta input space of the MMS algorithm: dis- tained queries of this type. tance from POI (only for one-topic queries) and Flickr rank. 7. ACKNOWLEDGEMENTS 5. RUNS This work is supported by the USEMP FP7 project, par- Run 1: CNN features were used for relevance and VLAD tially funded by the EC under contract number 611596. features for diversity. no was set to 20K in both instanti- 2 Nevertheless, these estimates could serve as an indication of the ations and r was set to 0.5 for the one-topic instantiation. performance of a system that has no access to the Wikipedia {w = 0.55, N = 170} and {w = 0.00, N = 120} were used for images of the one-topic queries. 8. REFERENCES M. Lupu, and H. Müller. Retrieving diverse social images at MediaEval 2015: Challenge, dataset and [1] A. Babenko, A. Slesarev, A. Chigorin, and evaluation. In MediaEval, 2015. V. Lempitsky. Neural codes for image retrieval. In [8] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, ECCV, 2014. J. Long, R. Girshick, S. Guadarrama, and T. Darrell. [2] J. Carbonell and J. Goldstein. The use of mmr, Caffe: Convolutional architecture for fast feature diversity-based reranking for reordering documents embedding. arXiv preprint arXiv:1408.5093, 2014. and producing summaries. In ACM SIGIR, 1998. [9] K. Simonyan and A. Zisserman. Very deep [3] D. Corney, C. Martin, A. Göker, convolutional networks for large-scale image E. Spyromitros-Xioufis, S. Papadopoulos, recognition. In International Conference on Learning Y. Kompatsiaris, L. Aiello, and B. Thomee. Representations, 2015. Socialsensor: Finding diverse images at mediaeval [10] E. Spyromitros-Xioufis, A. Ginsca, A. Popescu, 2013. In MediaEval, 2013. S. Papadopoulos, Y. Kompatsiaris, and I. Vlahavas. [4] D.-T. Dang-Nguyen, L. Piras, G. Giacinto, G. Boato, Improving diversity in image search via supervised and F. De Natale. Retrieval of diverse images by relevance scoring. In International Conference on pre-filtering and hierarchical clustering. In MediaEval, Multimedia Retrieval (ICMR), 2015. 2014. [11] E. Spyromitros-Xioufis, S. Papadopoulos, [5] T. Deselaers, T. Gass, P. Dreuw, and H. Ney. Jointly I. Kompatsiaris, G. Tsoumakas, and I. Vlahavas. A optimising relevance and diversity in image retrieval. comprehensive study over vlad and product In ACM CIVR, 2009. quantization in large-scale image retrieval. IEEE [6] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, Transactions on Multimedia, 2014. and C.-J. Lin. LIBLINEAR: A library for large linear [12] E. Spyromitros-Xioufis, S. Papadopoulos, classification. Journal of Machine Learning Research, Y. Kompatsiaris, and I. Vlahavas. Socialsensor: 9:1871–1874, 2008. Finding diverse images at mediaeval 2014. In [7] B. Ionescu, A. Gı̂nsca, B. Boteanu, A. Popescu, MediaEval, 2014.