<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multimodal-based Diversified Summarization in Social Image Retrieval</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Duc-Tien Dang-Nguyen</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giulia Boato</string-name>
          <email>boato@disi.unitn.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesco G. B. De Natale</string-name>
          <email>denatale@ing.unitn.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca Piras</string-name>
          <email>luca.piras@diee.unica.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giorgio Giacinco</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Franco Tuveri</string-name>
          <email>tuveri@crs4.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manuela Angioni</string-name>
          <email>angioni@crs4.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Center for Advanced Studies, Research and Development in Sardina</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>DIEE - University of Cagliari</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>DISI - University of Trento</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>In this paper, we describe our approach and its results for the MediaEval 2015 Retrieving Diverse Social Images task. The main strength of the proposed approach is its exibility that permits to lter out irrelevant images, and to obtain a reliable set of diverse and relevant images. This is done by rst clustering similar images according to their textual descriptions and their visual content, and then extracting images from di erent clusters according to a measure of user's credibility. Experimental results shown that it is stable and has little uctuation in both single-concept and multi-concept queries.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        In the MediaEval 2015 Retrieving Diverse Social Images
task [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], participants are provided with sets of images
retrieved from Flickr, where each set is related to a location.
However, these sets are normally noisy and redundant, thus,
the goal of this task is to re ne the initial results by choosing
a subset of images that are relevant to the queried location
in di erent views, times, and other conditions.
      </p>
      <p>
        We propose here an improved method based on our
previous approaches in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The basic idea is to lter
out the non-relevant images at the beginning of the process
according to the rules of the task. Then, exploit textual and
visual features, as well as the user credibility information
by a multi-modal retrieval framework to have a diversi ed
summarization of the queried images.
      </p>
    </sec>
    <sec id="sec-2">
      <title>METHODOLOGY</title>
      <p>
        The proposed method comprises 3 steps (see Fig. 1):
Filtering: The goal of this step is to lter out outliers by
removing images that are considered as non-relevant. We
consider an image as non-relevant by de ning the following
rules: (i) it contains people as the main subject; (ii) it was
shot far away from the queried location; (iii) it received very
few number of views on Flickr; and (iv) it is out-of-focus or
blurred. Condition (i) can be detected by the proportion of
the human face size with respect to the size of the image. In
our method, the Luxand FaceSDK (luxand.com) is used as
a face detector. Conditions (ii) and (iii) can be computed
exploiting the provided user credibility information. In order
to detect blurred images (rule iv), we estimate the focus by
computing the sum of wavelet coe cients and decide if it is
out-of-focus following the method in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. After this step, all
the images left are considered as relevant and are passed to
the next step.
      </p>
      <p>
        Clustering: we propose to cluster similar images by
constructing a particular clustering feature tree (CF tree) which
is built based on the combination of textual and visual
information. To this end, we exploit the characteristic of
the BIRCH algorithm [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] to perform clustering in two main
phases, namely the Global Clustering phase, and the Re
ning phase. While these two phases are intended to produce
a high quality clustering results by using the same set of
features, we used textual features to perform the rst phase and
we re ned the clusters by using visual features instead. We
computed a di erent set of textual features by performing
the analysis of the provided textual data in order to reduce
the noise of not relevant words. After this step, all images
that are visually similar and have the same context (i.e., the
textual information) are grouped into the same branch of
the tree.
      </p>
      <p>Summarization: Starting from the CF tree, the
clusters can be obtained by applying the agglomerative
hierarchical clustering algorithm on CF leaves to form the set of
clusters. To choose the best images for summarizing the
landmark, rst the clusters are sorted based on the number
of images, i.e., clusters containing more images are ranked
higher. Then, we extract images from each cluster till the
maximum number of required images is reached (e.g., 20
images). In each cluster, the image uploaded by the user with
highest visual score is selected as the rst image. If there
is more than one image from that user, the image closest
to the centroid is selected. If more than one image have to
be extracted from a cluster to reach the exact number of
images required to build the visual summary, we select the
second image as the one which has the largest distance from
the rst image, the third image as the one with the largest
distance to both the rst two images, and so on.
3.</p>
    </sec>
    <sec id="sec-3">
      <title>RUN DESCRIPTION</title>
      <p>We ran our model on the development set (devset,
containing 153 location queries from 45.375 Flickr photos).
According to the results, we choose the best features and the
tuned parameters for each run and applied to the test set
(containing 69 single-concept queries and 70 multi-concept
queries from 41.394 Flickr images) as follows:</p>
      <p>Run 1: Color naming (CNM), color descriptor (GCD),
histogram of oriented gradients (HOG) and local binary
pattern (GLBP) are used. In the Summarization step, since we
do not have the user credibility information in this run, the
centroid of each cluster is selected as the rst image.</p>
      <p>Run 2: In this run, we re ned text features by
normalizing the text terms and removing stop-words, html tags and
special characters from the given TF-IDF. Cosine similarity
was used as the distance metric. The parameters are chosen
similar to Run 1.</p>
      <p>Run 3: The proposed method is applied on the combined
features from run 1 and run 2 where TF-IDF is used rst,
then the visual features with Euclidean distance are applied
after.</p>
      <p>Run 4: In this run, we clustered the images by user. The
order of the clusters is ranked based on the visual score (i.e.,
the cluster belong to the user with highest visual score will
be selected rst), then by face proportion, and so on with
all the user credibility information. For each cluster, images
are selected based on the number of views, i.e., the image
with highest number of views is selected as the rst image.</p>
      <p>
        Run 5: In the rst four runs, we applied the same method
on both single-concept and multi-concept queries. However,
in this run, we used two di erent methods for these two
different cases. In the ltering step for single-concept queries,
outliers are detected as follows: rule (i): the face size is
bigger than 10% with respect to the size of the image, (ii)
images that were shot farther than 15kms, (iii) images that
have less than 25 views, and (iv) images that have f-score
(focus measure) smaller than 20. For the multi-concept queries,
only rule (iii) and (iv) were applied since there are many
queries require images belong to multiple locations. We also
removed images whose title and descriptors do not contain
any word from the query. In the Clustering step, a
similar clustering as Run 3 is applied for both types of query
with the extra visual features: Dense SIFT and HOG2x2,
extracted as the study in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Text features were re ned as
described in Run 2. Finally, in the Summarization step, the
same method as described in Section 2 were applied.
4.
      </p>
    </sec>
    <sec id="sec-4">
      <title>RESULTS AND CONCLUSION</title>
      <p>With the mentioned selected features and parameters, we
obtained the highest F 1@20, the o cial metric of the task,
at Run 5 on both development and test sets with the values
of 0:61 and 0:55, respectively. These results con rmed that
removing outliers and combining textual, visual and user
credibility information as run 5 signi cantly improved the
performance with respect to the other runs (see in Table 1
and Table 2 the results on the test set and development set,
respectively).</p>
      <p>According to the results on the test set, we can state
that the performances is stable and has little uctuation
in both single-concept (F 1@20 = 0:529) and multi-concept
(F 1@20 = 0:567) queries.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D</given-names>
            <surname>.-T.</surname>
          </string-name>
          Dang-Nguyen,
          <string-name>
            <given-names>L.</given-names>
            <surname>Piras</surname>
          </string-name>
          , G. Giacinto, G. Boato, and
          <string-name>
            <given-names>F.</given-names>
            <surname>G. B. De Natale</surname>
          </string-name>
          .
          <article-title>Retrieval of diverse images by pre- ltering and hierarchical clustering</article-title>
          .
          <source>In MediaEval</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D</given-names>
            <surname>.-T.</surname>
          </string-name>
          Dang-Nguyen,
          <string-name>
            <given-names>L.</given-names>
            <surname>Piras</surname>
          </string-name>
          , G. Giacinto, G. Boato, and
          <string-name>
            <given-names>F. G. B. D.</given-names>
            <surname>Natale</surname>
          </string-name>
          .
          <article-title>A Hybrid Approach for Retrieving Diverse Social Images of Landmarks</article-title>
          .
          <source>In IEEE International Conference on Multimedia and Expo</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.-T.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.-H.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-M.</given-names>
            <surname>Phoong</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>Robust measure of image focus in the wavelet domain</article-title>
          .
          <source>In Intelligent Signal Processing and Communication Systems</source>
          , pages
          <fpage>157</fpage>
          {
          <fpage>160</fpage>
          ,
          <string-name>
            <surname>Dec</surname>
          </string-name>
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. L. G</surname>
          </string-name>
          ^nsca,
          <string-name>
            <given-names>B.</given-names>
            <surname>Boteanu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Popescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lupu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Muller</surname>
          </string-name>
          . Retrieving Diverse Social Images at MediaEval 2015:
          <article-title>Challenge, Dataset and Evaluation</article-title>
          .
          <source>In Working Notes Proceedings of the MediaEval 2015 Workshop</source>
          ,
          <year>September 2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hays</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. A.</given-names>
            <surname>Ehinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Oliva</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Torralba</surname>
          </string-name>
          .
          <article-title>Sun database: Large-scale scene recognition from abbey to zoo</article-title>
          .
          <source>In IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pages
          <volume>3485</volume>
          {
          <fpage>3492</fpage>
          . IEEE,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ramakrishnan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Livny</surname>
          </string-name>
          . BIRCH:
          <article-title>An E cient Data Clustering Method for Very Large Databases</article-title>
          .
          <source>In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data</source>
          , pages
          <volume>103</volume>
          {
          <fpage>114</fpage>
          ,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>