<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>DWS at MediaEval 2014 Retrieving Diverse Social Images: Approaches for Visual and Textual Runs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lydia Weiland</string-name>
          <email>lydia@informatik.uni-mannheim.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simone Paolo Ponzetto</string-name>
          <email>simone@informatik.uni-mannheim.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Data and Web Science Research Group, University of Mannheim</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <fpage>16</fpage>
      <lpage>17</lpage>
      <abstract>
        <p>We present an overview of our two runs submitted to the MediaEval 2014 Retrieving Diverse Social Images task, each relying only on visual and textual features. Whilst the approach for textual features is based on a standard tf-idf bag-of-words approach, we focused for visual features on a more complex contribution for the task which consists of clustering the images and diversifying the result list based on visual features. At its heart, our method relies on using images collected for each location from Wikipedia. These images are then used as centroids of clusters, where the images collected from Flickr based on their similarity to the Wikipedia images are later grouped in. Both runs, namely using either visual or textual information only, achieve precision-oriented (i.e., more than twice higher precision than recall) results.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        This year’s dataset for the MediaEval 2014 Retrieving Diverse
Social Images task contains images and their textual descriptions
of 30 locations for the development dataset and of 123 locations
for the test dataset [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. There are several visual and textual features
for each image and their corresponding textual description. These
features can be used in order to create a ranked result list of images.
The characteristics of the images in the result list is defined by two
requirements: The images have to be relevant with respect to the
query and from these groups of relevant images, intragroup-wise
the most diverse ones [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>Filtering out images can be grouped into two main sub-tasks.
The first task is to identify images which are indeed wrong,
because the location is not shown. These images are hard to find with
visual features only, because they have often similar visual
characteristics, e.g., for the location Obelisco an image of the Berlin
Victory Column would bring noise into the data (cf. Figure 1).
The second task focuses instead on identifying images where the
location is not the central aspect of the image (e.g., photographs
showing a person in focus and just a tiny piece of the location in
the background). These images are instead hard to find with textual
features only, because they use the name of the location within at
least one of the textual descriptors (title, description, or tag) and fit
to the textual query for collecting the initial data (e.g., an image for
location Leaning Tower showing a woman next to a green lawn
taking a photograph with the caption "Mary taking a picture up the
tower of Dave taking a picture down from the tower", cf. Figure 1).
As a result of this, textual and visual features can be expected to be
both beneficial to estimate the relevance of image results.</p>
      <p>
        Diversification is defined for this task in terms of different
visual compositions, e.g., images showing the location in daylight
or by night, or from near or a bird’s perspective. Even if the key
goal is similar to last year’s task [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], the number of images within
the devset and testset changed and new features were added, e.g.,
user annotation credibility [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The official ranking metrics of this
year reflect the two requirements for relevance and diversification,
namely a balanced F-measure computed over the first 20 images.
2.
      </p>
    </sec>
    <sec id="sec-2">
      <title>METHODOLOGY</title>
      <p>
        The methods we developed are based on the provided features
and data. Besides the features of the images, additional external
information for each location was provided. As an exception, the
participants were allowed to use these additional information for the
first runs, e.g., images from Wikipedia with their affiliated image
features (Color Names, Histogram of Oriented Gradients (HOG),
etc.). No other external information than the provided ones was
allowed in the final runs. Wikipedia images were given without
textual descriptions: accordingly we developed two different
process chains for the two runs.
Run 1: Visual Information Only. We start with the assumption
that images from Wikipedia are showing the location from different
perspectives, i.e., they provide good examples of diverse images.
We then developed the following pipeline to filter, cluster, and
diversify the images. First, each image from Wikipedia is taken as
centroid of a cluster. Images crawled from Flickr are then analyzed
for their distance to one of the clusters. The candidate image is
then grouped into one of the clusters with the lowest distance (we
do not use soft clusters to avoid duplicates in the ranked result list).
Distances are calculated using the Euclidean distance of the HOG
values for each patch (an image has 9 patches). The method uses
HOG features as they have shown to outperform other features [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
We then use inverse distances as measures for the relevance
ranking. Sorting the images in descending order of their best similarity
value results in a list of ranked images (we store the top 50 as final
output). Filtering images before clustering using face detection
algorithms could potentially lead to slightly better results: however,
it has also shown a negative impact with respect to the CR values
of some of the locations of the devset. Thus, we decided not to use
a filtering method for our final run.
      </p>
      <p>Run 2: Textual Information Only. For each image the tf-idf
weights are given with respect to different reference data (image,
location, and user). We decided in our run to use only the
locationrelated tf-idf weights. Storing these values in a vector for each
image allows for calculating the cosine similarity between two
images. For each location the cosine similarities are calculated for
each pair of images. We remove pairs with maximum similarity,
i.e., 1, based on two assumptions: i) either we have the very same
image, or ii) two images have exactly the same textual description
– both cases, in fact, cannot positively impact result diversification.
We finally return the top 50 images with highest cosine similarity
as output ranked result list.</p>
    </sec>
    <sec id="sec-3">
      <title>RESULTS</title>
    </sec>
    <sec id="sec-4">
      <title>CONCLUSIONS</title>
      <p>
        Both our runs show weaknesses on the diversification side (CR@20),
whilst the precision (P@20) have shown to be more than twice
better. The results indicate that the methodologies are basically able to
spot relevant data in a coarse way, whereas fine-grained
diversification and ranking still need to be improved. In the future, we plan to
collect Wikipedia articles for each location in order to build a topic
model, as opposed to merely using tf-idf weights of the Flickr data
only. From the visual processing perspective, we plan to improve
on result diversification by either using other clustering approaches
(e.g., allowing Flickr images to build an own cluster) or by
crawling more than five Wikipedia images representing the location and
at the same time being diverse to the Wikipedia images, which are
already collected. Related work in the field of ranking and
retrieving multimedia data have shown that the combination of visual and
textual features in a multimodal model outperforms single
modality models. Thus, we also plan to conduct experiments where both
types of features are used jointly within one single model [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
5.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Deselaers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gass</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dreuw</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Ney</surname>
          </string-name>
          .
          <article-title>Jointly optimising relevance and diversity in image retrieval</article-title>
          .
          <source>In Proceedings of the ACM International Conference on Image and Video Retrieval</source>
          ,
          <source>CIVR '09</source>
          , pages
          <fpage>39</fpage>
          :
          <fpage>1</fpage>
          -
          <lpage>39</lpage>
          :
          <fpage>8</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Menéndez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Popescu</surname>
          </string-name>
          .
          <article-title>Retrieving diverse social images at MediaEval 2013: Objectives, dataset and evaluation</article-title>
          .
          <source>In Proceedings of the MediaEval 2013 Multimedia Benchmark Workshop</source>
          , Barcelona, Spain,
          <source>October 18-19</source>
          ,
          <year>2013</year>
          .,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Popescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lupu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. L.</given-names>
            <surname>Ginsca</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          .
          <article-title>Retrieving diverse social images at MediaEval 2014: Challenge, dataset and evaluation</article-title>
          . In MediaEval 2014 Workshop, October 16-17, Barcelona, Spain,
          <year>2014</year>
          .,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Popescu. CEA LIST</surname>
          </string-name>
          <article-title>'s participation at MediaEval 2013 retrieving diverse social images task</article-title>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>N.</given-names>
            <surname>Rasiwasia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Costa</given-names>
            <surname>Pereira</surname>
          </string-name>
          , E. Coviello, G. Doyle,
          <string-name>
            <given-names>G. R.</given-names>
            <surname>Lanckriet</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. Levy</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Vasconcelos</surname>
          </string-name>
          .
          <article-title>A new approach to cross-modal multimedia retrieval</article-title>
          .
          <source>In Proc. of MM '10</source>
          , pages
          <fpage>251</fpage>
          -
          <lpage>260</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>