<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Imcube @ MediaEval 2015 Retrieving Diverse Social Images Task: Multimodal Filtering and Re-ranking</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sebastian Schmiedeke</string-name>
          <email>schmiedeke@imcube.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pascal Kelm</string-name>
          <email>kelm@imcube.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lutz Goldmann</string-name>
          <email>goldmann@imcube.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>imcube labs GmbH Berlin</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>This paper summarizes the participation of Imcube at the Retrieving Diverse Social Images Task of MediaEval 2015. This task addresses the problem of result diversification in the context of social photo retrieval where the results of a query should contain relevant but diverse items. Therefore, we propose a multi-modal approach for filtering and re-ranking in order to improve the relevancy and diversity of the returned list of ranked images.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        The Retrieving Diverse Social Images Task of MediaEval 2015 [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
requires participants to develop a system that automatically refines
a list of images returned by a Flickr query in such a way that the
most relevant and diverse images are returned in a ranked list of up
to 50 images.
      </p>
      <p>A photo is considered relevant if it is a common representation
of the overall query concept in good visual quality (sharpness,
contrast, colours) and without people as main subjects except for
queries dealing with people as part of the topic. The results are
considered diverse if they depict different visual aspects (time, location,
view, style, etc) of the target concept with a certain degree of
complementarity.</p>
      <p>The refinement and diversification process can be based on the
social metadata associated with the collected photos in the data set
and/ or on the visual characteristics of the images. Furthermore,
the task provides information about user annotation credibility as
an automation estimation of the quality of a particular user’s tag.</p>
    </sec>
    <sec id="sec-2">
      <title>SYSTEM DESCRIPTION</title>
      <p>In this section, we present our approach that combines textual,
visual and credibility information to filter and re-rank the initial
results. Our approach consists of two steps – relevancy improvement
and diversification – as depicted in Figure 1.</p>
      <p>The goal of the first step is to improve the relevancy of the ranked
image list by re-ranking the images based on more reliable textual
and visual criteria and filtering them in order to remove images
which are irrelevant for the given application scenario. The goal of
the second step is to improve the diversity of the ranked image list
through textual filtering and visual clustering and re-ranking. The
individual modules will be described in the following sections.</p>
    </sec>
    <sec id="sec-3">
      <title>Textual relevancy improvement</title>
      <p>This step exploits additional information extracted from the
corresponding Wikipedia article that is provided together with the query.
Diversification</p>
      <p>Text</p>
      <p>Visual
Fusion
Fusion</p>
      <p>Credibility</p>
      <p>Visual</p>
      <p>To improve the ranking of the images the query is expanded with
the most frequent words from the Wikipedia article and the images
are re-ranked using a bag-of-words representation. The relevancy is
further improved by removing images that do not match the original
query and the location information from the Wikipedia article. The
location information is extracted by analysing the original query or
Wikipedia title (e.g. , “Great Sphinx of Giza”) taking into account
typical prepositions for locations (e.g. “in”, “at”, “on”, “of”, “de”).
In the case that no location information can be extracted (e.g.
“Niagara Falls”) toponyms are not considered for relevancy filtering.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Visual relevancy improvement</title>
      <p>Visual information is also used to improve the relevancy by
reranking them according to different criteria. For each visual feature
a ranked image list is derived based on the computed relevancy
scores.</p>
      <p>
        Since images with persons as main subjects are considered
irrelevant, we employ a face detector [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] trained for frontal and profile
faces to determine the size of facial regions. The inverse relative
size of the detected faces determines the relevancy. Hence, the
smaller the area covered by faces the more relevant is the image.
      </p>
      <p>
        Additionally, photos taken from the target location but not
displaying it are considered irrelevant. We model that relevancy by
computing the visual similarity between the retrieved images and
the images which are available from the associated Wikipedia
article. We use histogram of oriented gradients (HOG) features and a
clusterless BoW approach [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] based on speeded up robust features
run1
run2
run3
run5
(SURF) features to generate histograms for each of the images. The
similarity between the retrieved images and the wikipedia images
is computed through histogram intersection. The retrieved images
are re-ranked by considering the maximum score across the set of
wikipedia images.
      </p>
      <p>
        We further incorporate aesthetic aspects to emphasize more
visually appealing images, since less blurry and salient images are
usually considered more relevant. The sharpness is calculated as
the ratio of magnitude of image gradients between different blurred
versions of the original image. The larger that ratio, the more
relevant the image. Saliency is measured using a spectral residual
approach [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Considering the different criteria described above we
obtain 5 ranked image lists (Face, HOG, BoW, Sharpness, Saliency)
which are fused using weighted rank fusion.
2.3
      </p>
    </sec>
    <sec id="sec-5">
      <title>Credibility based relevance improvement</title>
      <p>
        This step is intended to be the baseline approach for improving
the relevance. It re-ranks the image list according to the
credibility of the owner of an image. The re-ranking is based on 3 scores
which describe the user credibility [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]: the use of correct tags
(visualScore), specific tags (tagSpecificity) and their preference for
photographing faces (faceProportion). Following the application
scenario the combined credibility score for an image is high if the user
has a high visualScore, a high tagSpecificity and a low
faceProbability.
2.4
      </p>
    </sec>
    <sec id="sec-6">
      <title>Textual diversification</title>
      <p>The final image list should not only contain relevant images but
also diverse ones, i.e. , depicting different aspects of the topic. With
the assumption that images which have an identical textual
description often depict very similar content, the images are clustered
based on their textual similarity. The ranked image list is then
obtained by ranking the clusters in descending order according to their
relevancy and iteratively selecting the most relevant image from
each cluster.
2.5</p>
    </sec>
    <sec id="sec-7">
      <title>Visual diversification</title>
      <p>
        The visual diversification considers multiple visual
characteristics including colour (ColorMoment), structure (HOG, clusterless
BoW approach [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]) and texture (local binary patterns (LBP)). For
each feature the normalized distances between the retrieved images
are combined using weighted summation and then projected in a
lower dimensional space by applying the FastMap [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] algorithm.
On the resulting 5-dimensional feature space, kMeans++
clustering [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is applied.
      </p>
      <p>
        The number of clusters is estimated by Hartigan’s Leader
clustering algorithm [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], but the number is restricted to be between 5
and 21. Clusters with a low mean relevancy or clusters containing
only a few images are discarded, Since these small clusters are very
likely to contain outliers. The remaining clusters are ordered in
descending order according to their maximum relevancy and ranked
image list is obtained by iteratively selecting the best image from
them.
3.
      </p>
    </sec>
    <sec id="sec-8">
      <title>EXPERIMENTS &amp; RESULTS</title>
      <p>The following experiments were performed based on the system
and the individual modules described above following the
guidelines of the task.</p>
      <p>Run1 is an approach using visual information only (as described
in Sec. 2.2 and 2.5). Run2 is an approach based on purely textual
information (as described in Sec. 2.1 and 2.4). Run3 is an approach
based on textual and visual information (as described in Sec. 2.1,
2.2 and Sec. 2.5). Run5 is an approach using credibility based
relevancy and visual diversity (as described in Sec. 2.3 and 2.5).</p>
      <p>These experiments are performed on the test set provided which
contains 69 one-concept location queries and 70 multi-concept
queries related to events.</p>
      <p>Table 1 shows the results on the test set for all the runs defined
above. Since we want to evaluate our filters for different conditions,
scores for the one-concept and multi-concept queries are also
provided.</p>
      <p>In general, the textual run (run2) achieves the best results. It
achieves the higher precision (P @20 = 0:748) and also a slightly
better recall (CR@20 = 0:418) compared to the visual run (run1).
The advantage is more significant for one-concept queries than for
multi-concept queries. The textual run fails for queries which main
topic is not correlated to a location (e.g. “chinese new year in
Beijing” (its main topic is firework), “paragliding in the mountains”
or “tropical rain”). For these cases, the visual run reaches
considerably higher F1 scores. Generally, the purely visual run achieves
a better recall (CR@20 = 0:4209) and thus a slightly better F1
metric (F 1@20 = 0:5027) for multi-concept queries.</p>
      <p>Since, the combination of visual and textual features (run3)
constantly achieves lower scores than the individual modalities, we
analyse the cases where improvements were made. For example
the previously mentioned query (“chinese new year in Beijing”)
benefits from visual information with a considerable increase of
the F1 measure ( F 1@20 = 0:18). In comparison to run2, run3
achieves a lower precision ( P @20 = 0:015) and a similar
recall ( CR@20 = 0:001) leading to slightly lower F1 score
( F 1@20 = 0:006). However, it is interesting to note that the
results differ for one-concept and multi-concept queries. The recall
of run3 is actually higher than that of run2 for multi-concept
queries ( CR@20 = 0:008) while it is lower for one-concept queries
( CR@20 = 0:009).
4.</p>
    </sec>
    <sec id="sec-9">
      <title>CONCLUSION</title>
      <p>The results of the different runs show that overall the best results
can be achieved with textual information only and that the fusion
of visual and textual information leads to slightly worse results.
Analysing the results in more detail shows that visual information
provides better results for multi-concept queries and queries where
the main topic is not correlated to a location while textual
information achieves better performance for one-concept queries. This
shows that a more advanced fusion approach for combining textual
and visual information may improve the results further.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Arthur</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Vassilvitskii.</surname>
          </string-name>
          K-means+
          <article-title>+: The advantages of careful seeding</article-title>
          .
          <source>In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms</source>
          , SODA '
          <volume>07</volume>
          , pages
          <fpage>1027</fpage>
          -
          <lpage>1035</lpage>
          , Philadelphia, PA, USA,
          <year>2007</year>
          . Society for Industrial and Applied Mathematics.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Faloutsos</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Lin. FastMap</surname>
          </string-name>
          :
          <article-title>A fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets</article-title>
          .
          <source>In Proceedings of the 1995 ACM SIGMOD international conference on Management of data</source>
          , pages
          <fpage>163</fpage>
          -
          <lpage>174</lpage>
          . ACM New York, NY, USA,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J. A. Hartigan. Clustering</given-names>
            <surname>Algorithms</surname>
          </string-name>
          . John Wiley &amp; Sons, Inc., New York, NY, USA, 99th edition,
          <year>1975</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>X.</given-names>
            <surname>Hou</surname>
          </string-name>
          and
          <string-name>
            <surname>L. Zhang.</surname>
          </string-name>
          <article-title>Saliency detection: A spectral residual approach</article-title>
          .
          <source>In Computer Vision and Pattern Recognition</source>
          ,
          <year>2007</year>
          . CVPR '07. IEEE Conference on, pages
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          ,
          <year>June 2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. L.</given-names>
            <surname>Gînsca˘</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Boteanu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Popescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lupu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          . Retrieving Diverse Social Images at MediaEval 2015:
          <article-title>Challenge, Dataset and Evaluation</article-title>
          . MediaEval 2015 Workshop, Wurzen, Germany,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Schmiedeke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kelm</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Sikora</surname>
          </string-name>
          .
          <article-title>DCT-based features for categorisation of social media in compressed domain</article-title>
          .
          <source>In Multimedia Signal Processing (MMSP)</source>
          ,
          <source>2013 IEEE 15th International Workshop on</source>
          , pages
          <fpage>295</fpage>
          -
          <lpage>300</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P.</given-names>
            <surname>Viola</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Jones</surname>
          </string-name>
          .
          <article-title>Rapid object detection using a boosted cascade of simple features</article-title>
          .
          <source>In Computer Vision and Pattern Recognition</source>
          ,
          <year>2001</year>
          .
          <article-title>CVPR 2001</article-title>
          .
          <article-title>Proceedings of the 2001</article-title>
          IEEE Computer Society Conference on, volume
          <volume>1</volume>
          ,
          <string-name>
            <surname>pages</surname>
            <given-names>I</given-names>
          </string-name>
          <source>-511-I-518 vol.1</source>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>