LAPI @ 2017 Retrieving Diverse Social Images Task:
          A Pseudo-Relevance Feedback Diversification Perspective
                                     Bogdan Boteanu, Mihai Gabriel Constantin, Bogdan Ionescu
                                                  LAPI, University “Politehnica” of Bucharest, Romania
                                                 {bboteanu,mgconstantin,bionescu}@alpha.imag.pub.ro

ABSTRACT
In this paper we present the results achieved during the 2017 Media-
Eval Retrieving Diverse Social Images Task, using an approach
based on pseudo-relevance feedback (RF), in which human feed-
back is replaced by an automatic selection of images. The proposed
approach is designed to have in priority the diversification of the
results, in contrast to most of the existing techniques that address
only the relevance. Diversification is achieved by exploiting a hi-
erarchical clustering (HC) scheme followed by a diversification
strategy. Methods are tested on the benchmarking data and results
are analyzed. Insights for future work conclude the paper.                         Figure 1: General scheme of the proposed approach

1     INTRODUCTION                                                            needed in traditional RF and in proposing several diversity-adapted
An efficient information retrieval system should be able to provide           relevance feedback schemes.
search results which are in the same time relevant for the query
and cover different aspects of it, i.e., diverse. The 2017 Retrieving         2    APPROACH
Diverse Social Images Task [7] addresses this issue in the context            In traditional RF techniques, recording actual user feedback is ineffi-
of a general ad-hoc image retrieval system, which provides the user           cient in terms of time and human resources. The proposed approach,
with diverse representations of the queries. The system should be             denoted in the following HC-RF, attempts to replace user input with
able to tackle complex and general-purpose multi-concept queries.             machine generated ground truth. It exploits the concept of pseudo-
Given a ranked list of photos retrieved from Flickr1 , participating          relevance feedback. The concept is based on the assumption that
systems are expected to refine the results by providing up to 50              top k ranked documents are relevant and the feedback is learned
images that are in the same time relevant and provide a diversified           as in traditional RF under this assumption [1]. A general diagram
summary of the query. The process is based on the social metadata             of the approach is depicted in Figure 1.
associated with the images and/or on the visual characteristics. A                Similarly to [3] we didn’t opt for the use of the pre-processing
complete overview of the task is presented in [7].                            step, i.e., the use of filters for the non-relevant images. The moti-
   Despite the current advances of machine intelligence techniques            vation is based on the specificity of the dataset proposed for this
used in the area of information retrieval and multimedia, in search           year [7], i.e. the use of multi-topic queries in the development and
for achieving high performance and adapting to user needs, more               evaluation sets. An image containing people or depicting a loca-
and more research is turning now towards the concept of “human                tion or a place which is geographically far away from the query,
in the loop” [5]. The idea is to bring the human expertise in the pro-        can be considered relevant as long as it is a common photo rep-
cessing chain, thus combining the accuracy of human judgements                resentation of the query topics (all at once). Also, we noticed in
with the computational power of machines.                                     an extensive study [4] that the blur filter does not improve signifi-
   Due to good performance achieved in [3], this year we decided              cantly the overall performance, thus we decided to removed it to
to follow the same work, which is an adapted version of the work              reduce complexity. The algorithm is as follows.
in [2] that exploits the concept of RF. RF techniques attempt to                  First, we employ a pseudo-relevance feedback scheme based on
introduce the user in the loop by harvesting feedback about the               an automatic selection of the images. We consider that the first
relevance of the search results. This information is used as ground           returned results are relevant (i.e., positive examples). For instance,
truth for re-computing a better representation of the data needed.            on devset [7], in average, 26 out of 50 returned images are relevant
Relevance feedback proved efficient in improving the precision of             which supports our assumption. In contrast, the very last of the
the results [6], but its potential was not fully exploited to diversi-        results are more likely non-relevant and considered accordingly
fication. The main contribution of our approach is in proposing a             (i.e., negative examples). The positive and negative examples are
pseudo-relevance feedback technique which substitutes the user                fed to an HC2 scheme which yields a dendrogram of classes. For
1 http://flickr.com/.                                                         a certain cutting point (i.e., number of classes), a class is declared
                                                                              non-relevant if contains only negative examples or the number
Copyright held by the owner/author(s).                                        of negative examples is higher than the positive ones. The final
MediaEval’17, 13-15 September 2017, Dublin, Ireland
                                                                              2 http://www.mathworks.com/help/stats/hierarchical-clustering.html
MediaEval’17, 13-15 September 2017, Dublin, Ireland                                                     B. Boteanu, M.G. Constantin, B. Ionescu

Table 1: Best RF results for each modality or combination of               Table 2: Results for the official runs on testset (best results
modalities on devset (best results are depicted in bold).                  are depicted in bold).

    metric/   HC-RF    HC-RF    HC-RF      HC-RF    HC-RF     Flickr              metric/run     Run1      Run2     Run3     Run4     Run5
    run       visual   text     vis-text   CNN      cred.     init. res.          P @20         0.6333     0.6214   0.6196   0.5845   0.6018
     P @20     0.575    0.575   0.6136      0.575    0.575     0.5864             C R@20        0.5791     0.5794   0.5729   0.5216   0.6045
 C R@20       0.3969   0.3969   0.4234     0.3969   0.3969     0.3646             F 1@20        0.5753     0.5733   0.5741   0.5253   0.5777
    F 1@20    0.4473   0.4473   0.4773     0.4473   0.4473     0.4277

                                                                           performance was followed by visual (visual-all without CNN), tex-
step is the actual diversification scheme, which is a round robin          tual (textual-all), CNN and credibility descriptors (F 1@20=0.4473)
approach. We select from each of the relevant classes one image            all with (180-20-1.1) parameter setup.
which has the highest rank according to the initial ranking of the
system. Then, we remove the selected images from the clusters and          3.2    Official results on testset
proceed by selecting the remaining ones in the same manner. The            Following the previous experiments, the final runs were determined
process is repeated until a maximum number of images is reached.           for the best modality/parameter combinations obtained on devset
The resulting images represent the output of the proposed system.          (see Table 1). We submitted five runs, computed as following: Run1
                                                                           - automated using visual information only: HC-RF all visual; Run2
3     EXPERIMENTAL RESULTS                                                 - automated using text information only: HC-RF all text; Run3 -
This section presents the experimental results achieved on devset          automated using visual-text information: HC-RF all visual-all text;
which consists of 110 multi-topic queries and 32,487 images and            Run4 - everything allowed: HC-RF CNN.; and Run5 - everything
testset, respectively, which consists of 84 multi-topic queries and        allowed: HC-RF cred. Results are presented in Table 2.
24,986 images. We optimized the parameters of the proposed ap-                 What is interesting to observe is the fact that the highest preci-
proach on devset to obtain best precision and diversity. The final         sion is achieved using visual information, (Run1 - P@20 = 0.6333),
benchmarking is conducted however on testset.                              whereas maximum diversification is achieved using credibility in-
   In our approaches, images are represented with the content              formation (Run5 - CR@20 = 0.6045), with more than 2% over other
descriptors that were provided with the task data, i.e., visual (e.g.,     types of descriptors. Relevance was also preserved, which leads to
convolutional neural network based descriptors), text (e.g., term fre-     the conclusion that credibility information was useful in the con-
quency - inverse document frequency representations of metadata)           text of overall diversification. Credibility information estimates the
and user annotation credibility (e.g. upload frequency) informa-           quality of tag-image content relationships, telling which users are
tion. Detailed information about provided content descriptors is           most likely to share relevant images in Flickr. Best diversification is
available in [7]. Performance is assessed with Precision at X images       achieved in this case due to a high probability that different and rel-
(P@X), Cluster Recall at X (CR@X) and F1-measure at X (F1@X).              evant images belong to different users with a good credibility score.
                                                                           In terms of F 1 metric score, the use of credibility information, Run5
3.1      Results on devset                                                 - F 1@20 = 0.5777, allows for best performance, followed closely
Several tests were performed with different descriptor combinations        by visual descriptors, Run1 - F 1@20 = 0.5753. Visual-textual infor-
and various cutoff points. Descriptors are combined with an early          mation achieved also good performance, Run3 - F 1@20 = 0.5741,
fusion approach (normalization and concatenation). We varied the           followed by textual information, Run2 - F 1@20 = 0.5733. The CNN
number of initial images considered as positive examples (Np) from         descriptors had the lowest performance, by more than 5% under
100 to 280 with a step of 20 images, the number of last images             the credibility information, Run4 - F 1@20 = 0.5253.
considered as negative examples (Nn) from 0 to 20 with a step of 10,
and the inconsistency coefficient threshold for which HC divides           4     CONCLUSIONS
the data into well-separated clusters (Nc) from 0.5 to 1.3 with a          We approached the image search result diversification issue from
step of 0.2. We select the combinations yielding the highest F 1@20,       the perspective of relevance feedback techniques, when user feed-
which is the official metric.                                              back is substituted with an automatic pseudo-relevance feedback
    While experimenting, we observed that, by increasing the num-          approach. Results show that in general, the automatic techniques
ber of analyzed images, precision tends to decrease as the probabil-       improve the precision and diversification, which proves the real
ity of obtaining non-relevant images increases; in the same time,          potential of relevance feedback to the diversification. Future de-
diversity increases as having more images is more likely to get more       velopments will mainly address different efficient exploitations of
diverse representations. For brevity reasons, in the following we          re-ranking approaches, e.g., relevance-score estimation techniques,
focus on presenting only the results at a cutoff of 20 images which        to improve the relevance and consequently the overall diversifi-
is the official cutoff point. These results are presented in Table 1. We   cation. Another perspective is to also exploit the advantages of
present also the Flickr initial retrieval results to serve as baseline     deep neural networks and use them in the context of automatic
for the evaluation. From the modality point of view, visual-text           relevance-feedback-based diversification scenarios, by classifying
information (visual-all textual-all) with the parameter setup (Np-         the selected positive and negative examples using unsupervised
Nn-Nc)=(180-0-1.1) lead to the highest results (F 1@20=0.4773). This       deep-learning-based classifiers.
Retrieving Diverse Social Images Task                                      MediaEval’17, 13-15 September 2017, Dublin, Ireland


REFERENCES
[1] Bogdan Boteanu, Ionuţ Mironică, and Bogdan Ionescu. 2015. Hierar-
    chical Clustering Pseudo-Relevance Feedback for Social Image Search
    Result Diversification. Content-Based Multimedia Indexing (CBMI),
    2015 13th International Workshop on (September 2015), 1–6.
[2] Bogdan Boteanu, Ionuţ Mironică, and Bogdan Ionescu. 2015. LAPI
    @ 2015 Retrieving Diverse Social Images Task: A Pseudo-Relevance
    Feedback Diversification Perspective. MediaEval 2015 Workshop (Sep-
    tember 2015).
[3] Bogdan Boteanu, Ionuţ Mironică, and Bogdan Ionescu. 2016. LAPI @
    2016 Retrieving Diverse Social Images Task: A Pseudo-Relevance Feed-
    back Diversification Perspective. MediaEval 2016 Workshop (October
    2016).
[4] Bogdan Boteanu, Ionuţ Mironică, and Bogdan Ionescu. 2016. Pseudo-
    Relevance Feedback Diversication of Social Image Retrieval Results.
    Multimedia Tools and Applications 76, 9 (2016), 11889–11916.
[5] Bruno Emond. 2007. Multimedia and Human-in-the-loop: Interaction
    as Content Enrichment. ACM International Workshop on Human-
    Centered Multimedia (2007), 77–84.
[6] Jing Li and Nigel M Allinson. 2013. Relevance Feedback in Content-
    Based Image Retrieval: A Survey. Handbook on Neural Information
    Processing 49 (2013), 433–469.
[7] Maia Zaharieva, Bogdan Ionescu, Alexandru Lucian Gînscă,
    Rodrygo L.T. Santos, and Henning H. Müller. 2017. Retrieving Diverse
    Social Images at MediaEval 2017: Challenges, Dataset and Evaluation.
    MediaEval 2017 Workshop (September 2017).