LAPI @ 2015 Retrieving Diverse Social Images Task: A Pseudo-Relevance Feedback Diversification Perspective Bogdan Boteanu1∗, Ionuţ Mironică1†, Bogdan Ionescu1‡ 1 LAPI, University “Politehnica” of Bucharest, Romania {bboteanu,imironica,bionescu}@alpha.imag.pub.ro ABSTRACT In this paper we present the results achieved during the 2015 Media- Eval Retrieving Diverse Social Images Task, using an approach based on pseudo-relevance feedback, in which human feedback is replaced by an automatic selection of images. The proposed ap- proach is designed to have in priority the diversification of the re- sults, in contrast to most of the existing techniques that address only the relevance. Diversification is achieved by exploiting a hi- erarchical clustering scheme followed by a diversification strategy. Methods are tested on the benchmarking data and results are ana- lyzed. Insights for future work conclude the paper. Figure 1: General scheme of the proposed approach 1. INTRODUCTION An efficient information retrieval system should be able to pro- Relevance feedback proved efficient in improving the precision of vide search results which are in the same time relevant for the query the results [3], but its potential was not fully exploited to diver- and cover different aspects of it, i.e., diverse. The 2015 Retrieving sification. The main contribution of our approach is in proposing Diverse Social Images Task [1] addresses this issue in the context of a pseudo-relevance feedback technique which substitutes the user a tourism real-world usage scenario. Given a ranked list of location needed in traditional RF and in proposing several diversity-adapted photos retrieved from Flickr1 , participating systems are expected to relevance feedback schemes. refine the results by providing up to 50 images that are in the same time relevant and provide a diversified summary of the location. These results will help potential tourists in selecting their visiting 2. PROPOSED APPROACH locations. The refinement and diversification process is based on In traditional RF Techniques, recording actual user feedback is the social metadata associated with the images and/or on the visual inefficient in terms of time and human resources. The proposed ap- characteristics. A complete overview of the task is presented in [1]. proach, denoted in the following HC-RF, attempts to replace user Despite the current advances of machine intelligence techniques input with machine generated ground truth. It exploits the concept used in the area of information retrieval and multimedia, in search of pseudo-relevance feedback. The concept is based on the assump- for achieving high performance and adapting to user needs, more tion that top k ranked documents are relevant and the feedback is and more research is turning now towards the concept of “human in learned as in traditional RF under this assumption [6]. A general the loop” [2]. The idea is to bring the human expertise in the pro- diagram of the approach is depicted in Figure 1. cessing chain, thus combining the accuracy of human judgements The algorithm is as follows. Firstly, we remove non-relevant im- with the computational power of machines. ages using three filters. The first one is the Viola-Jones [4] face In this work we propose a novel perspective that exploits the con- detector, which filters out images with persons as the main subject. cept of pseudo-relevance feedback (RF). RF techniques attempt to Second one is an image blur detector based on the aggregation of introduce the user in the loop by harvesting feedback about the rel- 10 state-of-the-art blur indicators as implemented by Said Pertuz2 . evance of the search results. This information is used as ground The last one is a GPS distance-based filter, which rejects the im- truth for re-computing a better representation of the data needed. ages that are positioned too far away from the query location, and ∗ therefore which cannot be relevant shots for that location. This work has been funded by the Ministry of European Funds In the next step we propose a pseudo-relevance feedback scheme through the Financial Agreement POSDRU 187/1.5/S/155420. based on the selection of the images assessed in an automated man- † The work was funded by the ESF POSDRU/159/1.5/S/132395 In- ner. We consider that most of the first returned results are relevant noRESEARCH programme. ‡ (i.e., positive examples). For instance, on devset [1], in average, This work is supported by the European Science Foundation, ac- tivity on “Evaluating Information Access Systems". 40 out of 50 returned images are relevant which support our as- 1 http://flickr.com/. sumption. In contrast, the very last of the results are more likely non-relevant and considered accordingly (i.e., negative examples). 2 http://www.mathworks.com/matlabcentral/ Copyright is held by the author/owner(s). fileexchange/27314-focus-measure/content/ MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany fmeasure/fmeasure.m Table 1: Best pseudo-relevance feedback results for each modality Table 2: Results for the official runs on testset (best results are or combination of modalities on devset (best results are depicted in depicted in bold). bold). metric/ HC-RF HC-RF HC-RF HC-RF HC-RF Flickr set metric Run1 Run2 Run3 Run4 Run5 method visual text vis-text cred. CNN init. res. P @20 0.7241 0.709 0.7306 0.7126 0.7227 P @20 0.8199 0.8346 0.8281 0.7281 0.7546 0.8118 Overall CR@20 0.4156 0.4306 0.4062 0.449 0.3999 CR@20 0.4423 0.4588 0.4484 0.4415 0.4234 0.3432 F 1@20 0.5164 0.5231 0.5056 0.5336 0.4994 F 1@20 0.5655 0.5839 0.5735 0.5426 0.5356 0.4713 P @20 0.7319 0.7391 0.7341 0.7442 0.7123 One- CR@20 0.4153 0.4392 0.4211 0.4294 0.3934 topic F 1@20 0.5222 0.5402 0.5219 0.5308 0.4958 P @20 0.7164 0.6793 0.7271 0.6814 0.7329 The positive and negative examples are feed to an Hierarchical Multi- CR@20 0.416 0.4222 0.3915 0.4684 0.4063 Clustering3 scheme which yields a dendrogram of classes. For a topics F 1@20 0.5108 0.5063 0.4895 0.5364 0.503 certain cutting point (i.e., number of classes), a class is declared non-relevant if contains only negative examples or the number of negative examples is higher than the positive ones. The final step is the actual diversification scheme. We select from each of the rel- ble 1. To serve as baseline for the evaluation, we present also the evant classes one image which has the highest rank according to Flickr initial retrieval results. From the modality point of view, text the initial ranking of the system. Then we proceed by selecting the descriptor (TF) lead to the highest results (F 1@20=0.5839) fol- second image in the same manner and the process is repeated until lowed closely by the combination of all visual and all text descrip- a maximum number of images is reached. The resulting images tors (F 1@20=0.5735) and then visual (LBP) (F 1@20=0.5655), represent the output of the proposed system. all credibility information (F 1@20=0.5426) and all convolutional neural network (CNN) based descriptors (F 1@20=0.5356). 3. EXPERIMENTAL RESULTS 3.2 Official results on testset This section presents the experimental results achieved on de- Following the previous experiments, the final runs were deter- vset which consists of 153 queries and 45,375 images and test- mined for best modality/parameter combinations obtained on de- set, respectively, which consists in 139 queries (69 one-concept - vset (see Table 1). We submitted five runs, computed as follow- 70 multi-concept) and 41,394 images. For devset, we first opti- ing: Run1 - automated using visual information only: HC-RF vi- mized the parameters of the filters in order to obtain best precision. sual LBP, Run2 - automated using text information only: HC-RF Based on this configuration we then applied the proposed approach. text TF, Run3 - automated using visual-text information: HC-RF Ground truth was also provided with the data for this set for pre- all visual-all text, Run4 - automated using credibility information liminary validation of the approaches. The final benchmarking is only: HC-RF all cred., and Run5 - everything allowed: HC-RF all conducted however on testset. CNN. Results are presented in Table 2. In our approaches, images are represented with the content de- What is interesting to observe is the fact that the highest pre- scriptors that were provided with the task data, i.e., visual (e.g., cision is achieved on one-topic set, using credibility information, color, feature descriptors), text (e.g., term frequency - inverse doc- (Run4 - P @20 = 0.7442), whereas maximum diversification is ument frequency representations of metadata) and user annotation achieved on multi-topics set, using the same type of information credibility (e.g., face proportions, upload frequency) information. (Run4 - CR@20 = 0.4684). Another interesting observation is Detailed information about provided content descriptors is avail- that credibility information was useful in the context of overall di- able in [1]. Performance is assessed with Precision at X images versification. Credibility information gives an automatic estima- (P@X), Cluster Recall at X (CR@X) and F1-measure at X (F1@X). tion of the quality of tag-image content relationships, telling which users are most likely to share relevant images in Flickr. Best diver- 3.1 Results on devset sification is achieved, CR@20 = 0.4684, due to the high proba- Several tests were performed with different descriptor combi- bility that different relevant images belong to different users with nations and various cutoff points. Descriptors are combined with a good credibility score. In terms of F 1 metric score, the use of an early fusion approach. We varied the number of initial images credibility information, Run4 - F 1@20 = 0.5336, allows for better considered as positive examples, from 80 to 160 with a step of 10 performance over text descriptor (TF) by almost 1% and by 1.7% images, the number of last images considered as negative exam- over visual descriptor (LBP). ples, from 0 to 21 with a step of 3, and the inconsistency coeffi- cient threshold for which HC naturally divides the data into well- 4. CONCLUSIONS separated clusters, from 0.1 to 0.95 with a step of 0.05. We select We approached the image search result diversification issue from the combinations yielding the highest F 1@20, which is the official the perspective of relevance feedback techniques, when user feed- metric. back is substituted with an automatic pseudo-feedback approach. While experimenting, we observed that, by increasing the num- Results show that in general, the automatic techniques improve the ber of analyzed images, precision tends to slightly decrease as the precision and diversification, which proves the real potential of rel- probability of obtaining un-relevant images increases; in the same evance feedback to the diversification. Future developments will time, diversity increases as having more images is more likely to mainly address a more efficient exploitation of different modali- get more diverse representations. For brevity reasons, in the follow- ties (visual-text-credibility), e.g., via late fusion techniques, as well ing we focus on presenting only the results at a cutoff of 20 images as exploitation of adaptive face-detectors that are able to filter out which is the official cutoff point. These results are presented in Ta- only a certain category of images, e.g., with people in focus, and 3 pass other categories of images, e.g., with crowds that are naturally http://www.mathworks.com/help/stats/ hierarchical-clustering.html present at a target location. 5. REFERENCES [1] B. Ionescu, A.L. Gînscă, B. Boteanu, A. Popescu, M. Lupu, H. Müller, “Retrieving Diverse Social Images at MediaEval 2015: Challenge, Dataset and Evaluation”, MediaEval 2015 Workshop, September 14-15, Wurzen, Germany, 2015. [2] B. Emond, “Multimedia and Human-in-the-loop: Interaction as Content Enrichment”, ACM Int. Workshop on Human-Centered Multimedia, pp. 77-84, 2007. [3] J. Li, N.M. Allinson, “Relevance Feedback in Content-Based Image Retrieval: A Survey”, Handbook on Neural Information Processing, 49, pp. 433-469, Springer 2013. [4] P. Viola, M. J. Jones, “Robust Real-Time Face Detection," in International Journal of Computer Vision, 57(2), pp. 137–154, 2004. [5] B. Boteanu, I. Mironică, B. Ionescu, “A Relevance Feedback Perspective to Image Search Result Diversification”, IEEE ICCP, September 4-6, Cluj-Napoca, Romania, 2014. [6] B. Boteanu, I. Mironică, B. Ionescu, “Hierarchical Clustering Pseudo-Relevance Feedback for Social Image Search Result Diversification”, IEEE CBMI, June 10-12, Prague, Czech Republic, 2015.