<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Maia Zaharieva</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lukas Diem</string-name>
          <email>l.diem@univie.ac.at</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Interactive Media Systems Group, Vienna University of Technology</institution>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Multimedia Information Systems Group, University of Vienna</institution>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>In this paper, we describe our approach for the MediaEval 2015 Retrieving Diverse Social Images Task. The proposed approach exploits available user-generated textual descriptions and the visual content of the images in a combination with common, unsupervised clustering techniques in order to increase the diversi cation of retrieval results. Preliminary experiments indicate that the approach generalizes well for di erent datasets and achieves comparable results for singleand multi-topic queries.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        Manual assessment of the relevance of publicly available
images to a particular query is not feasible due to the
immense amount of data captured and shared daily on social
media platforms. As a result, the automated optimization
of image retrieval results gains constantly in importance.
Next to relevance, the aspect of diversi cation of retrieval
results plays a crucial role in order to reduce the redundancy
in the retrieved images and, thus, to increase the e ciency
in overviewing the underlying data. The MediaEval 2015
Retrieving Diverse Social Images Task [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] addresses these
challenges in form of a tourist-oriented retrieval task, where
the topics of interest represent sightseeing spots around the
world. The aim of the task is to re ne the set of images
retrieved from Flickr while taking into account both the
relevance and the diversity of the selected images.
      </p>
      <p>
        Previous work in this context shows a broad range of
possible approaches. The original Flickr ranking is commonly
improved by a direct comparison with the corresponding
Wikipedia images [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ][
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Other methods employ training by
support vector machines (SVM) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] or regression models [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
The diversi cation of retrieval results is usually approached
by means of conventional clustering algorithms, such as
kmeans [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ][
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], hierarchical clustering [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ][
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and random
forest [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] or by an ensemble of clustering approaches [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>In this paper, we address relevance re-ranking by means
of a similarity score to a reference set of images. This
reference set is given by Wikipedia images (if available) or by
the top ranked images provided by Flickr. To increase
diversi cation, we employ a hierarchical clustering algorithm
and compare the performance of recently-introduced
powerful visual features with text-based approaches, which are
well-established in the context of web mining and retrieval.
2.</p>
    </sec>
    <sec id="sec-2">
      <title>APPROACH</title>
      <p>We employ a multi-stage work ow for the retrieval of
diverse social images, which passes the following steps: 1) data
preprocessing, 2) relevance reranking, and 3) image
clustering and nal image selection.</p>
      <p>
        In the rst step, data preprocessing, we lter potentially
irrelevant images, i.e., images with humans as the main
subjects and images that are captured far away from the topic of
interest. We employ the OpenCV1 face detector and remove
images with faces of area exceeding 5% of the total image
area. Additionally, if GPS data is available, we measure the
distance between the topic of interest and the
corresponding images and remove those with a Harvesine distance [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
greater than 100km. The reason for this strict threshold is
the underlying tourist application scenario where the
precision of location's speci cation ranges strongly from a
particular spot (e.g., the Tower Bridge in London) to large-scale
locations such as national parks or entire cities.
      </p>
      <p>
        The aim of the second stage, relevance reranking, is to
improve the original Flickr rating. Since the provided
Wikipedia images are per de nition representative [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], we
measure the visual similarity between the images of a set and
the associated Wikipedia images by means of the Euclidean
distance between the corresponding adapted convolutional
neural network (CNN) based descriptors. In case that there
are no Wikipedia images provided for a given query, we
consider the top 10 images from the original Flickr ranking as
reference images. Following, all images are reranked
according to the achieved similarity score.
      </p>
      <p>
        In the third step, image clustering, we aim at nding
groups of similar images which can be used to diversify the
nal image results. For the visual-based runs, preliminary
experiments with the provided visual descriptors [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and
different clustering algorithms (k-means, k-medoids, XMeans,
and agglomerative hierarchival clusteirng (AHC)) showed
that the best performing method for the development data
considers CNN as a visual feature and the AHC
clustering method. The nal selection of images from the clusters
follows a Round-Robin approach. We start by selecting the
image with the best relevance score from each cluster. These
images, sorted in ascending order, constitute the m highest
ranked results, where m is the number of detected clusters.
The selected images are removed from their corresponding
clusters and the selection process is repeated until the
required number of retrieved results is achieved. We employ
the Ward's aggregation method and limit the number of nal
clusters to 50 based on preliminary experiments.
Data
preprocessing
{ Flickr baseline
T GPS lter
V Face lter
V,T Face+GPS lter
T {
T GPS lter
T {
T GPS lter
V {
V {
V {
V Face lter
V,T Face+GPS lter
V,T Face+GPS lter
V,T Face+GPS lter
V,T Face+GPS lter
V,T Face+GPS lter
{
{
{
{
{
{
{
CNN
      </p>
      <p>{
CNN
CNN</p>
      <p>{
CNN</p>
      <p>{
CNN
CNN
{
{
{
TF-IDF
TF-IDF
LDA
LDA</p>
      <p>{
CNN
CNN
CNN
TF-IDF
TF-IDF
LDA
LDA
CNN</p>
      <p>For the text-based runs we consider two approaches. First,
we perform topic modeling on the textual descriptions of
each image (title and tags) using Latent Dirichlet Allocation
(LDA) and the MALLET Toolbox2 and extract T topics
for the employed dataset. For each image, we estimate the
likelihoods l1 and l2 of the rst- and second-best matching
topics. If the di erence of the likelihoods is larger than a
threshold (l2=l1 &lt; ) the most likely topic (l1) is assigned
to the photo otherwise no topic is assigned. We set T = 50
and = 0:8 for all experiments.</p>
      <p>The second text-based approach considers the
well-established term frequency-inverse document frequency (TF-IDF).
We compute the TF-IDF vector for each image using the
complete textual description (title, tags, and descriptions).
The textual descriptions are rst preprocessed to increase
their expressiveness, i.e., we remove potential occurrences
of the corresponding user name, web links, and stopwords
and we additionally stem all remaining terms. Furthermore,
we account for images with missing textual descriptions. In
such a case, we search for timely closest image with a
description which is either captured within a prede ned radius
(10 meter in our experiments) or by the same user within a
prede ned short time span (e.g., 5 minutes). In the
following, we cluster the resulting TF-IDF vectors using again the
AHC method, whereas the similarity between the TF-IDF
vectors is measured using the cosine similarity. The selection
of the nal image set follows the Round-Robin approach as
described for the visual-based approach.</p>
    </sec>
    <sec id="sec-3">
      <title>3. EXPERIMENTAL RESULTS</title>
      <p>Table 1 presents a selection of our preliminary
experiments on the development dataset. The results show that
the prerpocessing step (face and GPS lter) only marginally
improves the performance for the top 20 retrieved images
in comparison to the Flickr baseline results. Nevertheless,
95% of the rejected images are irrelevant with respect to the
underlying search query. Experiments with the text-based
runs show only minor di erences in the performance of the
TF-IDF and the LDA-based methods. While the achieved
precision (P@20) is comparable to those of the Flickr
baseline, the cluster recall (CR@20) improves notably, e.g. from
0:34 to 0:46 using the TF-IDF approach. For the
visual2http://mallet.cs.umass.edu.</p>
      <p>Run
1 (V)
2 (T)
3 (V,T)
5 (V,T)
based runs, the consideration of the relevance reranking step
using the CNN features demonstrates a signi cant increase
in the relevance (P@20-score of 0:94). However, the drop
in the clustering recall indicates an increase of redundancy
in the retrieved images as a side-e ect. Overall, the
bestperforming text-based and visual-based runs are comparable
in terms of F1@20 with the computational costs for the
textbased runs being signi cantly lower. The multimodal runs
additionally slightly improve both the clustering recall and
the F1-scores by approximately 1%. Surprisingly, the
consideration of the reranking step in a combination with the
text-based image clustering and selection cannot
compensate for the drop in the clustering recall.</p>
      <p>Following our preliminary experiments we submitted four
runs corresponding to the best con guration for the
respective modality (see Table 2). Table 3 summarizes the
results of the o cial runs on the test dataset. In opposite to
the development data, which contains the retrieval results
of single-topic queries only, the test data di erentiates
between single- (e.g., Niagara Falls) and multi-topics queries
(e.g., Academy awards in Hollywood ). Overall, there is no
signi cant di erence in the performance for the two subsets.
While the (predominantly) visual-driven runs (runs 1 and 3)
show a slight decrease in the clustering recall for the
multitopic queries, the text-driven runs (runs 2 and 4) indicate
the opposing trend. Furthermore, in contrast to the results
on the development data, the test runs show notable di
erence between the performance of the text- and the
visualbased runs. This reveals the better generalization ability of
the visual-based runs to di erent datasets. Overall, the best
performance in terms of F1-score of 0:55 is achieved by the
visual-based run which additionally considers the face and
GPS lters to reject irrelevant images (run 3).
4.</p>
    </sec>
    <sec id="sec-4">
      <title>CONCLUSION</title>
      <p>In this paper we investigated both text- and visual-driven
approaches for the diversi cation of Flickr image retrieval
results. The achieved performances indicate that the
visualbased approach copes well with di erent data and varying
query types. Overall, the relevance ranking shows promising
results in terms of precision. However, the diversi cation
increases only slowly by means of clustering recall. Our future
work will exploit the potential of combining features of
different modalities in the clustering process, e.g. by means of
a late fusion approach.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgment</title>
      <p>This work has been partly funded by the Vienna Science and
Technology Fund (WWTF) through project ICT12-010.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Castellanos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Garcia-Serrano</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <surname>J. Cigarran.</surname>
          </string-name>
          <article-title>UNED @ retrieving diverse social images task</article-title>
          .
          <source>In MediaEval Benchmark Workshop</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D</given-names>
            <surname>.-T.</surname>
          </string-name>
          Dang-Nguyen,
          <string-name>
            <given-names>L.</given-names>
            <surname>Piras</surname>
          </string-name>
          , G. Giacinto, G. Boato, and
          <string-name>
            <given-names>F. D.</given-names>
            <surname>Natale</surname>
          </string-name>
          .
          <article-title>Retrieval of diverse images by pre- ltering and hierarchical clustering</article-title>
          .
          <source>In MediaEval Benchmark Workshop</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A. L.</given-names>
            <surname>Ginsca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Popescu</surname>
          </string-name>
          , and
          <string-name>
            <surname>N. Rekabsaz. CEA</surname>
          </string-name>
          <article-title>LIST's participation at the MediaEval 2014 retrieving diverse social images task</article-title>
          .
          <source>In MediaEval Benchmark Workshop</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. L. G</surname>
          </string-name>
          ^nsca^,
          <string-name>
            <given-names>B.</given-names>
            <surname>Boteanu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Popescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lupu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Mu</surname>
          </string-name>
          <article-title>ller. Retrieving diverse social images at MediaEval 2015: Challenge, dataset and evaluation</article-title>
          .
          <source>In MediaEval Benchmark Workshop</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J. R. M.</given-names>
            <surname>Palotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Rekabsaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lupu</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Hanbury</surname>
          </string-name>
          .
          <article-title>TUW @ retrieving diverse social images task 2014</article-title>
          . In MediaEval Benchmark Workshop,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M. I.</given-names>
            <surname>Sarac</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Duygulu</surname>
          </string-name>
          .
          <article-title>Bilkent-RETINA at retrieving diverse social images task of MediaEval 2014</article-title>
          . In MediaEval Benchmark Workshop,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R. W.</given-names>
            <surname>Sinnott</surname>
          </string-name>
          .
          <article-title>Virtues of the haversine</article-title>
          .
          <source>Sky and Telescope</source>
          ,
          <volume>68</volume>
          (
          <issue>2</issue>
          ):
          <fpage>159</fpage>
          ,
          <year>1984</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C.</given-names>
            <surname>Spampinato</surname>
          </string-name>
          and
          <string-name>
            <surname>S. Palazzo.</surname>
          </string-name>
          <article-title>PeRCeiVe@UNICT at MediaEval 2014 diverse images: Random forests for diversity-based clustering</article-title>
          .
          <source>In MediaEval Benchmark Workshop</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>