<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Unified Framework for Retrieving Diverse Social Images</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Maia Zaharieva</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Patrick Schwab</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Interactive Media Systems, Institute of Software Technology and Interactive Systems, Vienna University of Technology</institution>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Multimedia Information Systems, Faculty of Computer Science, University of Vienna</institution>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <fpage>16</fpage>
      <lpage>17</lpage>
      <abstract>
        <p>In this paper we explore the performance of a generic, unied framework for the retrieval of relevant and diverse images from social photo collections. The approach allows for the easy evaluation of di erent visual and textual image descriptions, clustering algorithms, and similarity metrics. Preliminary results show strong dependance between the choice of underlying technology and similarity metric, and the achieved performance.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The immense daily growth of publicly shared media bears
both potentials and challenges for automated media analysis
and understanding. Currently, image access and retrieval is
usually based on user-provided and camera-generated
metadata. Although, more and more available, such metadata
often su ers limitations such as imprecise capture time and
GPS information or misleading and ambiguous textual
descriptions. Additionally, the large number of shared items
often results in a high-number of visually highly similar data.
This challenge is the focus of the MediaEval 2014
Retrieving Diverse Social Images Task [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The aim of the task is
the re nement of location images retrieved from Flickr while
taking into consideration both their relevance and diversity.
      </p>
      <p>
        Previous work in the context of this task shows a broad
eld of possible approaches ranging from re-ranking and
clustering to greedy optimization and graph representations
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Several authors propose di erent systems for di erent
feature types (e.g., [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ][
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]) that impedes the reasoning about
the selection of an approach or particular features.
Furthermore, some methods build upon assumptions that hold
true in a limited setting only (e.g., relevance of an image is
related to the number of views or the length of the
descriptions [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]). While most of the presented approaches employ a
combination of a re-ranking (for relevance improvement) and
a clustering (for ensuring diversi cation) method, we build
a uni ed framework that allows for a thorough evaluation
of various textual and visual features, clustering algorithms,
and similarity metrics.
      </p>
    </sec>
    <sec id="sec-2">
      <title>APPROACH</title>
      <p>We employ a multi-stage approach for the retrieval of
diverse social images. The work ow passes three main stages:
1) relevance ranking of input images, 2) image clustering for
diversi cation, and 3) nal image selection. The initial set
of input images may be optionally pre-processed in order to
lter potentially irrelevant images, such as images with a
human as main subject.</p>
      <p>
        In the rst stage, relevance ranking, each image of the
input set is rst represented by a feature vector ~v, where ~v
is a concatenation of the standardized z-scores of the feature
descriptors d1; :::; dn:
(1)
Since provided Wikipedia photos are per de nition
representative [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], we additionally compute a representative feature
vector ~vr for each referenced Wikipedia image. Following,
the relevance score, s, of an image is de ned as the smallest
distance between its feature vector ~v and all ~vr from the set
of representative feature vectors W :
s = min distance(~v; ~vr)
~vr2W
(2)
      </p>
      <p>The aim of the second stage, image clustering, is to nd
groups of similar images that can be used to diversify the
nal retrieval results. Note that, distance measures and
image features at this step are not necessarily the same ones
employed for relevance ranking.</p>
      <p>The third and last stage of the approach, nal image
selection, combines the results of the previous steps to retrieve
images that are both relevant and diverse according to the
initial image set. For this stage we use a Round-Robin
algorithm. We start by selecting the image with the best
relevance score from each cluster. These images, sorted in
ascending order, constitute the m highest ranked results,
where m is the number of detected clusters. The selected
images are removed from their corresponding clusters and
the selection process is repeated until the required number
of retrieved results is achieved.</p>
      <p>
        In general, the clustering algorithm, the metric used to
compare the feature vectors, and the underlying image
features (for both image ranking and image clustering) are
up to choice. In our experiments we tested di erent
clustering algorithms: k-means, Adaptive Hierarchical
Clustering (AHC), MeanShift, and Lingo, several comparison
metrics: Euclidean, city-block, 2, cosine, correlation,
Mahalanobis, Spearman, Hamming, and Jaccard, and all visual
and textual features provided by the organizers [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]: term
frequency - inverse document frequency (TF-IDF), Color
Naming Histogram (CN), Histogram of Oriented Gradients
(HOG), Color Moments (CM), Locally Binary Pattern (LBP),
Statistics of Gray Level Run Length Matrix (GLRLM), and
the corresponding spatial pyramid representations (3x3) in
addition to Bag-of-Visual Words (BoVW) of dense SIFT
descriptors.
      </p>
    </sec>
    <sec id="sec-3">
      <title>EXPERIMENTS AND RESULTS</title>
      <p>In our rst experiments we compared the performances of
the di erent clustering algorithms. Results on the
development data set showed that AHC signi cantly outperforms
k-means, MeanShift and Lingo for all explored features
(signi cance t-test, p &lt; 0:001). Thus, we employed AHC in all
follow up experiments.</p>
      <p>We conducted a thorough evaluation of the performance
of the employed features at the two main stages of our
approach: relevance ranking and image clustering. Table 1
summarizes the results by means of ranked feature lists. The
reported feature rankings and the selection of
corresponding best performing distance measures are the product of
signi cance t-tests with overall p &lt; 0:003. While the Color
Naming Histograms (CN and CN3x3) are usually
outperformed by any other feature, the Color Moments (CM and
CM3x3) show robust performance in both the ranking and
the clustering tasks. In contrast to the ranking, which is
clearly dominated by the performance of CM3x3, TF-IDF
and SIFT, image clustering using AHC is more robust and
the di erence in the performance of global and local features
decreases to a large extent.</p>
      <p>Eventually, we submitted four runs for the nal
evaluation (see Table 2 for the con gurations). Table 3 shows the
results for the submitted runs for both development and test
datasets. Best performances are achieved by the
combination of textual and visual information (run3 ). However, in
the context of the test dataset, the di erences between the
performances of the di erent runs vanish. Overall, clustering
recall (CR) remains relatively low due to the large number
of irrelevant images building noisy clusters. In general, the
achieved results outline the limitations of the available
textual (and visual) information in assessing image relevance.
This is mainly due to the fact, that user-provided textual
descriptions on social media sites often contain ambiguous or
irrelevant information. A possible approach to improve the
results may consider occasionally available GPS data and
employ external resources as additional source for
information.</p>
      <p>Development dataset
CR@20 P@20 F1@20
0.4426 0.7600 0.5552
0.4132 0.7250 0.5188
0.4484 0.7567 0.5559
0.4369 0.7617 0.5499</p>
      <p>CR@20
0.3901
0.3909
0.3982
0.3915</p>
      <p>Test dataset</p>
      <p>P@20
0.6646
0.6809
0.6732
0.6752
4.</p>
    </sec>
    <sec id="sec-4">
      <title>CONCLUSION</title>
      <p>In this paper we presented a generic, unsupervised
framework for the evaluation of various visual and textual
features, similarity metrics, and clustering approaches for the
retrieval of diverse social images. Performed experiments
aim at the evaluation of the potentials and limitations of
the provided visual and textual descriptions and, thus, we
refrain from employing any assumptions or external sources
of information. Although, there are signi cant di erences in
the performances of single features, the top performing
features prove to be highly interchangeable. Achieved results
indicate that - for the given datasets - the crucial part of
the process is not so much the diversi cation but more the
assessment of image relevance.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgment</title>
      <p>This work has been partly funded by the Vienna Science and
Technology Fund (WWTF) through project ICT12-010.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Corney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Goker, E. S. Xiou s, S. Papadopoulos,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kompatsiaris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. M.</given-names>
            <surname>Aiello</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Thomee</surname>
          </string-name>
          . Socialsensor:
          <article-title>Finding diverse images at mediaeval 2013</article-title>
          . In MediaEval 2013 Workshop,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Popescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lupu</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. L. G^</surname>
          </string-name>
          <article-title>nsca^, and</article-title>
          <string-name>
            <given-names>H.</given-names>
            <surname>Mu</surname>
          </string-name>
          <article-title>ller. Retrieving diverse social images at mediaeval 2014: Challenge, dataset, and evaluation</article-title>
          .
          <source>In MediaEval 2014 Workshop</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Popescu</surname>
          </string-name>
          ,
          <string-name>
            <surname>H.</surname>
          </string-name>
          <article-title>Muller, M. Menendez, and</article-title>
          <string-name>
            <given-names>A.-L.</given-names>
            <surname>Radu</surname>
          </string-name>
          .
          <article-title>Benchmarking Result Diversi cation in Social Image Retrieval</article-title>
          .
          <source>IEEE International Conference on Image Processing</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hare</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Samangooei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Preston</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Davies</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dupplaw</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P. H.</given-names>
            <surname>Lewis</surname>
          </string-name>
          .
          <article-title>Experiments in diversifying ickr result sets</article-title>
          .
          <source>In MediaEval 2013 Multimedia Benchmark Workshop</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B.</given-names>
            <surname>Vandersmissen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tomar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Godin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. D.</given-names>
            <surname>Neve</surname>
          </string-name>
          , and R. V. de Walle. Ghent University-iMinds at
          <article-title>MediaEval 2013 Diverse Images: Relevance-Based Hierarchical Clustering</article-title>
          .
          <source>In MediaEval 2013 Workshop</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>