<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>LAPI @ 2015 Retrieving Diverse Social Images Task: A Pseudo-Relevance Feedback Diversification Perspective</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Bogdan Boteanu</string-name>
          <email>bboteanu@alpha.imag.pub.ro</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ionu¸t Mironica˘</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>y Bogdan Ionescu</string-name>
          <email>bionescu@alpha.imag.pub.ro</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>LAPI, University “Politehnica” of Bucharest</institution>
          ,
          <country country="RO">Romania</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>In this paper we present the results achieved during the 2015 MediaEval Retrieving Diverse Social Images Task, using an approach based on pseudo-relevance feedback, in which human feedback is replaced by an automatic selection of images. The proposed approach is designed to have in priority the diversification of the results, in contrast to most of the existing techniques that address only the relevance. Diversification is achieved by exploiting a hierarchical clustering scheme followed by a diversification strategy. Methods are tested on the benchmarking data and results are analyzed. Insights for future work conclude the paper.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        An efficient information retrieval system should be able to
provide search results which are in the same time relevant for the query
and cover different aspects of it, i.e., diverse. The 2015 Retrieving
Diverse Social Images Task [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] addresses this issue in the context of
a tourism real-world usage scenario. Given a ranked list of location
photos retrieved from Flickr1, participating systems are expected to
refine the results by providing up to 50 images that are in the same
time relevant and provide a diversified summary of the location.
These results will help potential tourists in selecting their visiting
locations. The refinement and diversification process is based on
the social metadata associated with the images and/or on the visual
characteristics. A complete overview of the task is presented in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        Despite the current advances of machine intelligence techniques
used in the area of information retrieval and multimedia, in search
for achieving high performance and adapting to user needs, more
and more research is turning now towards the concept of “human in
the loop” [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The idea is to bring the human expertise in the
processing chain, thus combining the accuracy of human judgements
with the computational power of machines.
      </p>
      <p>
        In this work we propose a novel perspective that exploits the
concept of pseudo-relevance feedback (RF). RF techniques attempt to
introduce the user in the loop by harvesting feedback about the
relevance of the search results. This information is used as ground
truth for re-computing a better representation of the data needed.
This work has been funded by the Ministry of European Funds
through the Financial Agreement POSDRU 187/1.5/S/155420.
yThe work was funded by the ESF POSDRU/159/1.5/S/132395
InnoRESEARCH programme.
zThis work is supported by the European Science Foundation,
activity on “Evaluating Information Access Systems".
1http://flickr.com/.
Relevance feedback proved efficient in improving the precision of
the results [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], but its potential was not fully exploited to
diversification. The main contribution of our approach is in proposing
a pseudo-relevance feedback technique which substitutes the user
needed in traditional RF and in proposing several diversity-adapted
relevance feedback schemes.
2.
      </p>
    </sec>
    <sec id="sec-2">
      <title>PROPOSED APPROACH</title>
      <p>
        In traditional RF Techniques, recording actual user feedback is
inefficient in terms of time and human resources. The proposed
approach, denoted in the following HC-RF, attempts to replace user
input with machine generated ground truth. It exploits the concept
of pseudo-relevance feedback. The concept is based on the
assumption that top k ranked documents are relevant and the feedback is
learned as in traditional RF under this assumption [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. A general
diagram of the approach is depicted in Figure 1.
      </p>
      <p>
        The algorithm is as follows. Firstly, we remove non-relevant
images using three filters. The first one is the Viola-Jones [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] face
detector, which filters out images with persons as the main subject.
Second one is an image blur detector based on the aggregation of
10 state-of-the-art blur indicators as implemented by Said Pertuz2.
The last one is a GPS distance-based filter, which rejects the
images that are positioned too far away from the query location, and
therefore which cannot be relevant shots for that location.
      </p>
      <p>
        In the next step we propose a pseudo-relevance feedback scheme
based on the selection of the images assessed in an automated
manner. We consider that most of the first returned results are relevant
(i.e., positive examples). For instance, on devset [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], in average,
40 out of 50 returned images are relevant which support our
assumption. In contrast, the very last of the results are more likely
non-relevant and considered accordingly (i.e., negative examples).
2http://www.mathworks.com/matlabcentral/
fileexchange/27314-focus-measure/content/
fmeasure/fmeasure.m
The positive and negative examples are feed to an Hierarchical
Clustering3 scheme which yields a dendrogram of classes. For a
certain cutting point (i.e., number of classes), a class is declared
non-relevant if contains only negative examples or the number of
negative examples is higher than the positive ones. The final step
is the actual diversification scheme. We select from each of the
relevant classes one image which has the highest rank according to
the initial ranking of the system. Then we proceed by selecting the
second image in the same manner and the process is repeated until
a maximum number of images is reached. The resulting images
represent the output of the proposed system.
3.
      </p>
    </sec>
    <sec id="sec-3">
      <title>EXPERIMENTAL RESULTS</title>
      <p>This section presents the experimental results achieved on
devset which consists of 153 queries and 45,375 images and
testset, respectively, which consists in 139 queries (69 one-concept
70 multi-concept) and 41,394 images. For devset, we first
optimized the parameters of the filters in order to obtain best precision.
Based on this configuration we then applied the proposed approach.
Ground truth was also provided with the data for this set for
preliminary validation of the approaches. The final benchmarking is
conducted however on testset.</p>
      <p>
        In our approaches, images are represented with the content
descriptors that were provided with the task data, i.e., visual (e.g.,
color, feature descriptors), text (e.g., term frequency - inverse
document frequency representations of metadata) and user annotation
credibility (e.g., face proportions, upload frequency) information.
Detailed information about provided content descriptors is
available in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Performance is assessed with Precision at X images
(P@X), Cluster Recall at X (CR@X) and F1-measure at X (F1@X).
3.1
      </p>
    </sec>
    <sec id="sec-4">
      <title>Results on devset</title>
      <p>Several tests were performed with different descriptor
combinations and various cutoff points. Descriptors are combined with
an early fusion approach. We varied the number of initial images
considered as positive examples, from 80 to 160 with a step of 10
images, the number of last images considered as negative
examples, from 0 to 21 with a step of 3, and the inconsistency
coefficient threshold for which HC naturally divides the data into
wellseparated clusters, from 0.1 to 0.95 with a step of 0.05. We select
the combinations yielding the highest F 1@20, which is the official
metric.</p>
      <p>While experimenting, we observed that, by increasing the
number of analyzed images, precision tends to slightly decrease as the
probability of obtaining un-relevant images increases; in the same
time, diversity increases as having more images is more likely to
get more diverse representations. For brevity reasons, in the
following we focus on presenting only the results at a cutoff of 20 images
which is the official cutoff point. These results are presented in
Ta3http://www.mathworks.com/help/stats/
hierarchical-clustering.html</p>
      <p>Run1
ble 1. To serve as baseline for the evaluation, we present also the
Flickr initial retrieval results. From the modality point of view, text
descriptor (TF) lead to the highest results (F 1@20=0.5839)
followed closely by the combination of all visual and all text
descriptors (F 1@20=0.5735) and then visual (LBP) (F 1@20=0.5655),
all credibility information (F 1@20=0.5426) and all convolutional
neural network (CNN) based descriptors (F 1@20=0.5356).
3.2</p>
    </sec>
    <sec id="sec-5">
      <title>Official results on testset</title>
      <p>Following the previous experiments, the final runs were
determined for best modality/parameter combinations obtained on
devset (see Table 1). We submitted five runs, computed as
following: Run1 - automated using visual information only: HC-RF
visual LBP, Run2 - automated using text information only: HC-RF
text TF, Run3 - automated using visual-text information: HC-RF
all visual-all text, Run4 - automated using credibility information
only: HC-RF all cred., and Run5 - everything allowed: HC-RF all
CNN. Results are presented in Table 2.</p>
      <p>What is interesting to observe is the fact that the highest
precision is achieved on one-topic set, using credibility information,
(Run4 - P @20 = 0:7442), whereas maximum diversification is
achieved on multi-topics set, using the same type of information
(Run4 - CR@20 = 0:4684). Another interesting observation is
that credibility information was useful in the context of overall
diversification. Credibility information gives an automatic
estimation of the quality of tag-image content relationships, telling which
users are most likely to share relevant images in Flickr. Best
diversification is achieved, CR@20 = 0:4684, due to the high
probability that different relevant images belong to different users with
a good credibility score. In terms of F 1 metric score, the use of
credibility information, Run4 - F 1@20 = 0:5336, allows for better
performance over text descriptor (TF) by almost 1% and by 1:7%
over visual descriptor (LBP).
4.</p>
    </sec>
    <sec id="sec-6">
      <title>CONCLUSIONS</title>
      <p>We approached the image search result diversification issue from
the perspective of relevance feedback techniques, when user
feedback is substituted with an automatic pseudo-feedback approach.
Results show that in general, the automatic techniques improve the
precision and diversification, which proves the real potential of
relevance feedback to the diversification. Future developments will
mainly address a more efficient exploitation of different
modalities (visual-text-credibility), e.g., via late fusion techniques, as well
as exploitation of adaptive face-detectors that are able to filter out
only a certain category of images, e.g., with people in focus, and
pass other categories of images, e.g., with crowds that are naturally
present at a target location.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.L.</given-names>
            <surname>Gînsca˘</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Boteanu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Popescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lupu</surname>
          </string-name>
          , H. Müller, “Retrieving Diverse Social Images at MediaEval 2015:
          <article-title>Challenge, Dataset and Evaluation”</article-title>
          ,
          <source>MediaEval 2015 Workshop, September</source>
          <volume>14</volume>
          -15, Wurzen, Germany,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>Emond</surname>
          </string-name>
          , “
          <article-title>Multimedia and Human-in-the-loop: Interaction as Content Enrichment”</article-title>
          ,
          <source>ACM Int. Workshop on Human-Centered Multimedia</source>
          , pp.
          <fpage>77</fpage>
          -
          <lpage>84</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.M.</given-names>
            <surname>Allinson</surname>
          </string-name>
          , “
          <article-title>Relevance Feedback in Content-Based Image Retrieval: A Survey”</article-title>
          ,
          <source>Handbook on Neural Information Processing</source>
          ,
          <volume>49</volume>
          , pp.
          <fpage>433</fpage>
          -
          <lpage>469</lpage>
          ,
          <year>Springer 2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Viola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Jones</surname>
          </string-name>
          , “Robust
          <string-name>
            <surname>Real-Time Face</surname>
            <given-names>Detection</given-names>
          </string-name>
          ," in
          <source>International Journal of Computer Vision</source>
          ,
          <volume>57</volume>
          (
          <issue>2</issue>
          ), pp.
          <fpage>137</fpage>
          -
          <lpage>154</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B.</given-names>
            <surname>Boteanu</surname>
          </string-name>
          , I. Mironica˘,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          , “
          <string-name>
            <given-names>A Relevance</given-names>
            <surname>Feedback</surname>
          </string-name>
          <article-title>Perspective to Image Search Result Diversification”</article-title>
          ,
          <source>IEEE ICCP, September 4-6</source>
          , Cluj-Napoca, Romania,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B.</given-names>
            <surname>Boteanu</surname>
          </string-name>
          , I. Mironica˘,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          , “
          <article-title>Hierarchical Clustering Pseudo-Relevance Feedback for Social Image Search Result Diversification”</article-title>
          ,
          <source>IEEE CBMI, June</source>
          <volume>10</volume>
          -12, Prague, Czech Republic,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>