LAPI @ 2015 Retrieving Diverse Social Images Task:
A Pseudo-Relevance Feedback Diversification Perspective

                                 Bogdan Boteanu1∗, Ionuţ Mironică1†, Bogdan Ionescu1‡
                                      1
                                          LAPI, University “Politehnica” of Bucharest, Romania
                                  {bboteanu,imironica,bionescu}@alpha.imag.pub.ro


ABSTRACT
In this paper we present the results achieved during the 2015 Media-
Eval Retrieving Diverse Social Images Task, using an approach
based on pseudo-relevance feedback, in which human feedback is
replaced by an automatic selection of images. The proposed ap-
proach is designed to have in priority the diversification of the re-
sults, in contrast to most of the existing techniques that address
only the relevance. Diversification is achieved by exploiting a hi-
erarchical clustering scheme followed by a diversification strategy.
Methods are tested on the benchmarking data and results are ana-
lyzed. Insights for future work conclude the paper.
                                                                                Figure 1: General scheme of the proposed approach
1.    INTRODUCTION
   An efficient information retrieval system should be able to pro-      Relevance feedback proved efficient in improving the precision of
vide search results which are in the same time relevant for the query    the results [3], but its potential was not fully exploited to diver-
and cover different aspects of it, i.e., diverse. The 2015 Retrieving    sification. The main contribution of our approach is in proposing
Diverse Social Images Task [1] addresses this issue in the context of    a pseudo-relevance feedback technique which substitutes the user
a tourism real-world usage scenario. Given a ranked list of location     needed in traditional RF and in proposing several diversity-adapted
photos retrieved from Flickr1 , participating systems are expected to    relevance feedback schemes.
refine the results by providing up to 50 images that are in the same
time relevant and provide a diversified summary of the location.
These results will help potential tourists in selecting their visiting
                                                                         2.    PROPOSED APPROACH
locations. The refinement and diversification process is based on           In traditional RF Techniques, recording actual user feedback is
the social metadata associated with the images and/or on the visual      inefficient in terms of time and human resources. The proposed ap-
characteristics. A complete overview of the task is presented in [1].    proach, denoted in the following HC-RF, attempts to replace user
   Despite the current advances of machine intelligence techniques       input with machine generated ground truth. It exploits the concept
used in the area of information retrieval and multimedia, in search      of pseudo-relevance feedback. The concept is based on the assump-
for achieving high performance and adapting to user needs, more          tion that top k ranked documents are relevant and the feedback is
and more research is turning now towards the concept of “human in        learned as in traditional RF under this assumption [6]. A general
the loop” [2]. The idea is to bring the human expertise in the pro-      diagram of the approach is depicted in Figure 1.
cessing chain, thus combining the accuracy of human judgements              The algorithm is as follows. Firstly, we remove non-relevant im-
with the computational power of machines.                                ages using three filters. The first one is the Viola-Jones [4] face
   In this work we propose a novel perspective that exploits the con-    detector, which filters out images with persons as the main subject.
cept of pseudo-relevance feedback (RF). RF techniques attempt to         Second one is an image blur detector based on the aggregation of
introduce the user in the loop by harvesting feedback about the rel-     10 state-of-the-art blur indicators as implemented by Said Pertuz2 .
evance of the search results. This information is used as ground         The last one is a GPS distance-based filter, which rejects the im-
truth for re-computing a better representation of the data needed.       ages that are positioned too far away from the query location, and
∗
                                                                         therefore which cannot be relevant shots for that location.
  This work has been funded by the Ministry of European Funds               In the next step we propose a pseudo-relevance feedback scheme
through the Financial Agreement POSDRU 187/1.5/S/155420.                 based on the selection of the images assessed in an automated man-
†
  The work was funded by the ESF POSDRU/159/1.5/S/132395 In-             ner. We consider that most of the first returned results are relevant
noRESEARCH programme.
‡                                                                        (i.e., positive examples). For instance, on devset [1], in average,
  This work is supported by the European Science Foundation, ac-
tivity on “Evaluating Information Access Systems".                       40 out of 50 returned images are relevant which support our as-
1
  http://flickr.com/.                                                    sumption. In contrast, the very last of the results are more likely
                                                                         non-relevant and considered accordingly (i.e., negative examples).
                                                                         2
                                                                          http://www.mathworks.com/matlabcentral/
Copyright is held by the author/owner(s).                                fileexchange/27314-focus-measure/content/
MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany              fmeasure/fmeasure.m
Table 1: Best pseudo-relevance feedback results for each modality         Table 2: Results for the official runs on testset (best results are
or combination of modalities on devset (best results are depicted in      depicted in bold).
bold).

 metric/     HC-RF     HC-RF    HC-RF      HC-RF    HC-RF    Flickr        set         metric      Run1      Run2      Run3      Run4       Run5
 method      visual    text     vis-text   cred.    CNN      init. res.               P @20       0.7241     0.709    0.7306     0.7126    0.7227
 P @20       0.8199    0.8346   0.8281     0.7281   0.7546   0.8118        Overall    CR@20       0.4156    0.4306    0.4062      0.449    0.3999
 CR@20       0.4423    0.4588   0.4484     0.4415   0.4234   0.3432                   F 1@20      0.5164    0.5231    0.5056     0.5336    0.4994
 F 1@20      0.5655    0.5839   0.5735     0.5426   0.5356   0.4713
                                                                                      P @20       0.7319    0.7391    0.7341     0.7442    0.7123
                                                                           One-
                                                                                      CR@20       0.4153    0.4392    0.4211     0.4294    0.3934
                                                                           topic
                                                                                      F 1@20      0.5222    0.5402    0.5219     0.5308    0.4958
                                                                                      P @20       0.7164    0.6793    0.7271     0.6814    0.7329
The positive and negative examples are feed to an Hierarchical             Multi-
                                                                                      CR@20       0.416     0.4222    0.3915     0.4684    0.4063
Clustering3 scheme which yields a dendrogram of classes. For a             topics
                                                                                      F 1@20      0.5108    0.5063    0.4895     0.5364     0.503
certain cutting point (i.e., number of classes), a class is declared
non-relevant if contains only negative examples or the number of
negative examples is higher than the positive ones. The final step
is the actual diversification scheme. We select from each of the rel-     ble 1. To serve as baseline for the evaluation, we present also the
evant classes one image which has the highest rank according to           Flickr initial retrieval results. From the modality point of view, text
the initial ranking of the system. Then we proceed by selecting the       descriptor (TF) lead to the highest results (F 1@20=0.5839) fol-
second image in the same manner and the process is repeated until         lowed closely by the combination of all visual and all text descrip-
a maximum number of images is reached. The resulting images               tors (F 1@20=0.5735) and then visual (LBP) (F 1@20=0.5655),
represent the output of the proposed system.                              all credibility information (F 1@20=0.5426) and all convolutional
                                                                          neural network (CNN) based descriptors (F 1@20=0.5356).
3.    EXPERIMENTAL RESULTS                                                3.2       Official results on testset
   This section presents the experimental results achieved on de-            Following the previous experiments, the final runs were deter-
vset which consists of 153 queries and 45,375 images and test-            mined for best modality/parameter combinations obtained on de-
set, respectively, which consists in 139 queries (69 one-concept -        vset (see Table 1). We submitted five runs, computed as follow-
70 multi-concept) and 41,394 images. For devset, we first opti-           ing: Run1 - automated using visual information only: HC-RF vi-
mized the parameters of the filters in order to obtain best precision.    sual LBP, Run2 - automated using text information only: HC-RF
Based on this configuration we then applied the proposed approach.        text TF, Run3 - automated using visual-text information: HC-RF
Ground truth was also provided with the data for this set for pre-        all visual-all text, Run4 - automated using credibility information
liminary validation of the approaches. The final benchmarking is          only: HC-RF all cred., and Run5 - everything allowed: HC-RF all
conducted however on testset.                                             CNN. Results are presented in Table 2.
   In our approaches, images are represented with the content de-            What is interesting to observe is the fact that the highest pre-
scriptors that were provided with the task data, i.e., visual (e.g.,      cision is achieved on one-topic set, using credibility information,
color, feature descriptors), text (e.g., term frequency - inverse doc-    (Run4 - P @20 = 0.7442), whereas maximum diversification is
ument frequency representations of metadata) and user annotation          achieved on multi-topics set, using the same type of information
credibility (e.g., face proportions, upload frequency) information.       (Run4 - CR@20 = 0.4684). Another interesting observation is
Detailed information about provided content descriptors is avail-         that credibility information was useful in the context of overall di-
able in [1]. Performance is assessed with Precision at X images           versification. Credibility information gives an automatic estima-
(P@X), Cluster Recall at X (CR@X) and F1-measure at X (F1@X).             tion of the quality of tag-image content relationships, telling which
                                                                          users are most likely to share relevant images in Flickr. Best diver-
3.1    Results on devset                                                  sification is achieved, CR@20 = 0.4684, due to the high proba-
   Several tests were performed with different descriptor combi-          bility that different relevant images belong to different users with
nations and various cutoff points. Descriptors are combined with          a good credibility score. In terms of F 1 metric score, the use of
an early fusion approach. We varied the number of initial images          credibility information, Run4 - F 1@20 = 0.5336, allows for better
considered as positive examples, from 80 to 160 with a step of 10         performance over text descriptor (TF) by almost 1% and by 1.7%
images, the number of last images considered as negative exam-            over visual descriptor (LBP).
ples, from 0 to 21 with a step of 3, and the inconsistency coeffi-
cient threshold for which HC naturally divides the data into well-        4.     CONCLUSIONS
separated clusters, from 0.1 to 0.95 with a step of 0.05. We select
                                                                             We approached the image search result diversification issue from
the combinations yielding the highest F 1@20, which is the official
                                                                          the perspective of relevance feedback techniques, when user feed-
metric.
                                                                          back is substituted with an automatic pseudo-feedback approach.
   While experimenting, we observed that, by increasing the num-
                                                                          Results show that in general, the automatic techniques improve the
ber of analyzed images, precision tends to slightly decrease as the
                                                                          precision and diversification, which proves the real potential of rel-
probability of obtaining un-relevant images increases; in the same
                                                                          evance feedback to the diversification. Future developments will
time, diversity increases as having more images is more likely to
                                                                          mainly address a more efficient exploitation of different modali-
get more diverse representations. For brevity reasons, in the follow-
                                                                          ties (visual-text-credibility), e.g., via late fusion techniques, as well
ing we focus on presenting only the results at a cutoff of 20 images
                                                                          as exploitation of adaptive face-detectors that are able to filter out
which is the official cutoff point. These results are presented in Ta-
                                                                          only a certain category of images, e.g., with people in focus, and
3                                                                         pass other categories of images, e.g., with crowds that are naturally
  http://www.mathworks.com/help/stats/
hierarchical-clustering.html                                              present at a target location.
5.   REFERENCES
[1] B. Ionescu, A.L. Gînscă, B. Boteanu, A. Popescu, M. Lupu,
    H. Müller, “Retrieving Diverse Social Images at MediaEval
    2015: Challenge, Dataset and Evaluation”, MediaEval 2015
    Workshop, September 14-15, Wurzen, Germany, 2015.
[2] B. Emond, “Multimedia and Human-in-the-loop: Interaction
    as Content Enrichment”, ACM Int. Workshop on
    Human-Centered Multimedia, pp. 77-84, 2007.
[3] J. Li, N.M. Allinson, “Relevance Feedback in Content-Based
    Image Retrieval: A Survey”, Handbook on Neural
    Information Processing, 49, pp. 433-469, Springer 2013.
[4] P. Viola, M. J. Jones, “Robust Real-Time Face Detection," in
    International Journal of Computer Vision, 57(2), pp.
    137–154, 2004.
[5] B. Boteanu, I. Mironică, B. Ionescu, “A Relevance Feedback
    Perspective to Image Search Result Diversification”, IEEE
    ICCP, September 4-6, Cluj-Napoca, Romania, 2014.
[6] B. Boteanu, I. Mironică, B. Ionescu, “Hierarchical
    Clustering Pseudo-Relevance Feedback for Social Image
    Search Result Diversification”, IEEE CBMI, June 10-12,
    Prague, Czech Republic, 2015.