<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Overview of the wikipediaMM task at ImageCLEF 2009</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Theodora Tsikrika</string-name>
          <email>Theodora.Tsikrika@cwi.nl</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jana Kludas</string-name>
          <email>jana.kludas@cui.unige.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CUI, University of Geneva</institution>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>CWI</institution>
          ,
          <addr-line>Amsterdam</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>ImageCLEF's wikipediaMM task provides a testbed for the system-oriented evaluation of multimedia information retrieval from a collection of Wikipedia images. The aim is to investigate retrieval approaches in the context of a large and heterogeneous collection of images (similar to those encountered on the Web) that are searched for by users with diverse information needs. This paper presents an overview of the resources, topics, and assessments of the wikipediaMM task at ImageCLEF 2009, summarises the retrieval approaches employed by the participating groups, and provides a first analysis of the main evaluation results.</p>
      </abstract>
      <kwd-group>
        <kwd>ImageCLEF</kwd>
        <kwd>Wikipedia image collection</kwd>
        <kwd>image retrieval</kwd>
        <kwd>evaluation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>The wikipediaMM task is an ad-hoc image retrieval task. The evaluation scenario is thereby similar
to the classic TREC ad-hoc retrieval task and the ImageCLEF photo retrieval task: simulation of
the situation in which a system knows the set of documents to be searched, but cannot anticipate
the particular topic that will be investigated (i.e. topics are not known to the system in advance).
Given a multimedia query that consists of a title and one or more sample images describing a
user’s multimedia information need, the aim is to find as many relevant images as possible from
the (INEX MM) wikipedia image collection. A multi-modal retrieval approach in that case should
be able to combine the relevance of different media types into a single ranking that is presented
to the user.</p>
      <p>The wikipediaMM task differs from other benchmarks in multimedia information retrieval, like
TRECVID, in the sense that the textual modality in the wikipedia image collection contains less
noise than the speech transcripts in TRECVID. Maybe that is one of the reasons why, both in
last year’s task and in INEX Multimedia 2006-2007 (where this image collection was also used), it
has proven challenging to outperform the text-only approaches. This year, the aim is to promote
the investigation of multi-modal approaches to the forefront of this task by providing a number of
resources to support the participants towards this research direction.</p>
      <p>The paper is organised as follows. First, we introduce the task’s resources: the wikipedia image
collection and additional resources, the topics, and the assessments (Sections 2–4). Section 5
presents the approaches employed by the participating groups and Section 6 summarises their
main results. Section 7 concludes the paper.</p>
    </sec>
    <sec id="sec-2">
      <title>Task resources</title>
      <p>
        The resources used for the wikipediaMM task are based on Wikipedia data. The collection is
the (INEX MM) wikipedia image collection, which consists of approximately 150,000 JPEG
and PNG Wikipedia images provided by Wikipedia users. Each image is associated with
usergenerated alphanumeric, unstructured metadata in English. These metadata usually contain a
brief caption or description of the image, the Wikipedia user who uploaded the image, and the
copyright information. These descriptions are highly heterogeneous and of varying length. Further
information about the image collection can be found in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Additional resources were also provided to support the participants in their investigations of
multi-modal approaches. These resources are:
Image similarity matrix: The similarity matrix for the images in the collection has been
constructed by the IMEDIA group at INRIA. For each image in the collection, this matrix
contains the list of the top K = 1000 most similar images in the collection together with
their similarity scores. The same is given for each image in the topics. The similarity scores
are based on the distance between images; therefore, the lower the score, the more similar
the images. Further details on the features and distance metric used can be found in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
Image classification scores: For each image, the classification scores for the 101 MediaMill
concepts have been provided by UvA [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The UvA classifier is trained on manually annotated
TRECVID video data for concepts selected for the broadcast news domain.
      </p>
      <p>
        Image features: For each image, the set of the 120D feature vectors that has been used to
derive the above image classification scores [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] has also been made available. Participants
can use these feature vectors to custom-build a content-based image retrieval (CBIR) system,
without having to pre-process the image collection.
      </p>
      <p>The additional resources are beneficial to researchers who wish to exploit visual evidence without
performing image analysis. Of course, participants could also extract their own image features.</p>
    </sec>
    <sec id="sec-3">
      <title>Topics</title>
      <p>3.1</p>
      <sec id="sec-3-1">
        <title>Topic Format</title>
        <p>The topics are descriptions of multimedia information needs that contain textual and visual hints.
These multimedia queries consist of a textual part, the query title, and a visual part, one or several
example images.
&lt;title&gt; query by keywords
&lt;image&gt; query by image content (one or several)
&lt;narrative&gt; description of query in which the definitive definition of relevance and irrelevance
are given
3.1.1</p>
        <p>&lt;title&gt;
The topic &lt;title&gt; simulates a user who does not have (or want to use) example images or other
visual constraints. The query expressed in the topic &lt;title&gt; is therefore a text-only query. This
profile is likely to fit most users searching digital libraries.</p>
        <p>Upon discovering that a text-only query does not produce many relevant hits, a user might
decide to add visual hints and formulate a multimedia query.
3.1.2</p>
        <p>&lt;image&gt;
3.1.3</p>
        <p>&lt;narrative&gt;
The visual hints are example images, which can be taken from outside or inside the wikipedia
image collection and can be of any common format. Each topic has at least one example image,
but it can have several, e.g., to describe the visual diversity of the topic.</p>
        <p>A clear and precise description of the information need is required in order to unambiguously
determine whether or not a given document fulfils the given information need. In a test collection
this description is known as the narrative. It is the only true and accurate interpretation of a user’s
needs. Precise recording of the narrative is important for scientific repeatability - there must exist,
somewhere, a definitive description of what is and is not relevant to the user. To aid this, the
&lt;narrative&gt; should explain not only what information is being sought, but also the context and
motivation of the information need, i.e., why the information is being sought and what work-task
it might help to solve.</p>
        <p>These different types of information sources (textual terms and visual examples) can be used
in any combination. It is up to the systems how to use, combine or ignore this information; the
relevance of a result does not directly depend on these constraints, but it is decided by manual
assessments based on the &lt;narrative&gt;.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Topic Development</title>
        <p>The topics in the ImageCLEF 2009 wikipediaMM task have been partly developed by the
participants and partly by the organisers. This year the participation in the topic development process
was not obligatory, so only 2 of the participating groups submitted a total of 11 candidate topics.
The rest of the candidate topics were created by the organisers with the help of the log of an image
search engine. After a selection process performed by the organisers, a final list of 45 topics was
created.</p>
        <p>These final topics range from simple, and thus relatively easy (e.g., “bikes”), to semantic,
and hence highly difficult (e.g., “aerial photos of non-artificial landscapes”), with the latter
forming the bulk of the topics. Semantic topics typically have a complex set of constraints, need
world knowledge, and/or contain ambiguous terms, so they are expected to be challenging for
current state-of-the-art retrieval algorithms. We encouraged the participants to use multi-modal
approaches since they are more appropriate for dealing with semantic information needs. On
average, the 45 topics contain 1.7 images and 2.7 words.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Assessments</title>
      <p>The wikipediaMM task is an image retrieval task, where an image with its metadata is either
relevant or not (binary relevance). We adopted TREC-style pooling of the retrieved images with
a pool depth of 50, resulting in pools of between 299 and 802 images with a mean and median
both around 545. The evaluation was performed by the participants of the task within a period
of 4 weeks after the submission of runs. The 7 groups that participated in the evaluation process
used the web-based interface that was used last year and which has also been previously employed
in the INEX Multimedia and TREC Enterprise tracks.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Participants</title>
      <p>A total of 8 groups submitted 57 runs: CEA (LIC2M-CEA, Centre CEA de Saclay, France),
DCU (Dublin City University, School of Computing, Ireland), DEU (Dokuz Eylul University,
Department of Computer Engineering, Turkey), IIIT-Hyderabad (Search and Info Extraction Lab,
India), LaHC (Laboratoire Hubert Curien, UMR CNRS, France), SZTAKI (Hungarian Academy
of Science, Hungary), SINAI (Intelligent Systems, University of Jaen, Spain) and UALICANTE
(Software and Computer Systems, University of Alicante, Spain).</p>
      <p>DEU (6 runs) Their research interests focussed on 1) the expansion of native documents and
queries, term phrase selection based on WordNet, WSD and WordNet similarity functions
and 2) a new reranking approach with Boolean retrieval and C3M based clustering.
IIT-H (1 run) Their system automatically ranks the most similar images to a given textual
query using a combination of the Vector Space Model and the Boolean model. The system
preprocesses the data set in order to remove the non-informative terms.</p>
      <p>LaHC (13 runs) In this second participation, they extended their approach (a multimedia
document model defined as a vector of textual and visual terms weighted using tf.idf) by using
1) additional information for the textual part (legend and image bounding text extracted
from the original documents), 2) different image detectors and descriptors, and 3) a new
text/image combination approach.</p>
      <p>SINAI (4 runs) Their approach focussed on query and document expansion techniques based
on WordNet. They used the LEMUR toolkit as their retrieval system.</p>
      <p>SZTAKI (7 runs) They used both textual and visual features and employed image
segmentation, SIFT keypoints, Okapi BM25 based text retrieval, and query expansion by an online
thesaurus. They preprocessed the annotation text to remove author and copyright
information and biased retrieval towards images with filenames containing relevant terms.
UALICANTE (9 runs) They used IR-n, a retrieval system based on passages and applied two
different term selection strategies for query expansion: Probabilistic Relevance Feedback and
Local Context Analysis, and their multi-modal versions. They also used the same technique
for Camel Case decompounding of image filenames that they used in last year’s participation.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Results</title>
      <p>Next, we analyse the evaluation results. In our analysis, we use only the top 90% of the runs
to exclude noisy and buggy results. Furthermore, we excluded 3 runs that we considered to be
redundant, i.e., they were produced by the same group and achieved the exact same result, so as
to reduce the bias of the analysis.
6.1</p>
      <sec id="sec-6-1">
        <title>Performance per modality for all topics</title>
        <p>examined evaluation metrics (MAP, Precison at 20, and precision after R (= number of relevant)
documents are retrieved).</p>
        <p>Modality
All top 90% runs (46 runs)
TXT in top 90% runs (23 runs)
TXTIMG in top 90% runs (23 runs)</p>
      </sec>
      <sec id="sec-6-2">
        <title>Performance per topic and per modality</title>
        <p>To analyse the average difficulty of the topics, we classify the topics based on the average MAP
values per topic as follows:
easy: aM AP &gt; 0.3
medium: 0.2 &lt; aM AP &lt;= 0.3
hard: 0.1 &lt; aM AP &lt;= 0.2
very hard: aM AP &lt; 0.1.</p>
        <p>We also analysed the performance of runs that use only text (TXT) versus text and visual
resources (TXTIMG). Figure 2 shows the average performance on each topic for all runs, the
text-only and text-visual based ones. The text-based runs outperform the text-visual ones in 22
out of the 45. So, slightly more than half of the topics profit from a multi-modal approach.
6.3</p>
      </sec>
      <sec id="sec-6-3">
        <title>Visuality of topics</title>
        <p>The “visuality” of topics can be deduced from the performance of text-only and text-visual
approaches that we presented in the last section. We consider that if, for a topic, the text-visual
approaches improve significantly the MAP over all runs (e.g., by dif f (M AP ) &gt;= 0.01), then we
could consider that to be a visual topic. In the same way, we can define topics as textual, if the
text-only approaches improve significantly the MAP over all runs of a topic. Based on this, 15 of
the topics can be characterised as textual and 14 as visual. The remaining 16 topics, where no
clear improvements are observed, are considered to be neutral.
Finally, we analyse the effect of the application of query expansion (QE) and relevance feedback
(FB) techniques. Similarly to the analysis in the previous section, we consider the techniques to
be useful for a topic, if they improved significantly the MAP over all runs. Table 6 presents the
top 10 best performing topics for these techniques and some statistics. Query expansion is useful
for 17 topics and relevance feedback for 11. The statistics show that these techniques can help
improve the retrieval results for topics defined without too much detail, e.g., topics having a short
title (#words/topic) and/or a small number of example images (#images/topic).
7</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Conclusions</title>
      <p>This year (similarly to 2008), a text-based approach performed best in the wikipediaMM task, even
though highly semantic multimedia topics were developed with the aim to encourage and show the
potential of multi-modal approaches. It is worth noting though that all of the participants that
submitted both mono-media and multi-modal runs achieved their best results with their
multimodal runs. Additionally, we as organisers are really glad to see more than half of the submitted
runs being multi-modal.
8</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgements</title>
      <p>Theodora Tsikrika was supported by the European Union via the European Commission project
VITALAS (contract no. 045389).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Marin</given-names>
            <surname>Ferecatu</surname>
          </string-name>
          .
          <article-title>Image retrieval with active relevance feedback using both visual and keywordbased descriptors</article-title>
          .
          <source>In Ph.D. Thesis</source>
          , Universit de Versailles, France,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Jan</surname>
            <given-names>C. van Gemert</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jan-Mark</surname>
            <given-names>Geusebroek</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Cor J.</given-names>
            <surname>Veenman</surname>
          </string-name>
          ,
          <string-name>
            <surname>Cees G. M. Snoek</surname>
          </string-name>
          , and
          <string-name>
            <surname>Arnold</surname>
            <given-names>W. M.</given-names>
          </string-name>
          <string-name>
            <surname>Smeulders</surname>
          </string-name>
          .
          <article-title>Robust scene categorization by learning image statistics in context</article-title>
          .
          <source>In Proceedings of the 2006 Conference on Computer Vision</source>
          and Pattern Recognition Workshop, page 105, Washington, DC, USA,
          <year>2006</year>
          . IEEE Computer Society.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Cees</surname>
            <given-names>G. M.</given-names>
          </string-name>
          <string-name>
            <surname>Snoek</surname>
          </string-name>
          , Marcel Worring, Jan C. van Gemert,
          <string-name>
            <surname>Jan-Mark Geusebroek</surname>
          </string-name>
          , and
          <string-name>
            <surname>Arnold</surname>
            <given-names>W. M.</given-names>
          </string-name>
          <string-name>
            <surname>Smeulders</surname>
          </string-name>
          .
          <article-title>The challenge problem for automated detection of 101 semantic concepts in multimedia</article-title>
          .
          <source>In Proceedings of the 14th annual ACM international conference on Multimedia</source>
          , pages
          <fpage>421</fpage>
          -
          <lpage>430</lpage>
          , New York, NY, USA,
          <year>2006</year>
          . ACM Press.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Thijs</given-names>
            <surname>Westerveld</surname>
          </string-name>
          and Roelof van Zwol.
          <article-title>The INEX 2006 multimedia track</article-title>
          . In Norbert Fuhr, Mounia Lalmas, and Andrew Trotman, editors,
          <source>Advances in XML Information Retrieval: 5th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2006, Revised Selected Papers</source>
          , volume
          <volume>4518</volume>
          , pages
          <fpage>331</fpage>
          -
          <lpage>344</lpage>
          . Springer,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>