<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Overview of the ImageCLEF 2006 photographic retrieval and ob ject annotation tasks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Paul Clough</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael Grubinger</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thomas Deselaers</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Allan Hanbury</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Henning Mu¨ller</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Department, RWTH Aachen University</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institution University of California, Berkeley, USA Fontenay aux Roses Cedex, France CELI srl, Torino, Italy Concordia University</institution>
          ,
          <addr-line>Montreal</addr-line>
          ,
          <country>Canada Dublin</country>
          <institution>City University</institution>
          ,
          <addr-line>Dublin, Ireland IPAL</addr-line>
          ,
          <institution>Singapore National Institute of Informatics, Tokyo, Japan Daedalus University</institution>
          ,
          <addr-line>Madrid</addr-line>
          ,
          <country>Spain National</country>
          <institution>Taiwan University</institution>
          ,
          <addr-line>Taipei</addr-line>
          ,
          <institution>Taiwan RWTH Aachen University</institution>
          ,
          <addr-line>Aachen</addr-line>
          ,
          <institution>Germany University of Ja ́en</institution>
          ,
          <addr-line>Ja ́en, Spain Technische Universit ̈at Chemnitz</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Sheffield University</institution>
          ,
          <addr-line>Sheffield</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University and Hospitals of Geneva</institution>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Victoria University</institution>
          ,
          <addr-line>Melbourne</addr-line>
          ,
          <country country="AU">Australia</country>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>Vienna University</institution>
          ,
          <addr-line>Vienna</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes the general photographic retrieval and object annotation tasks of the ImageCLEF 2006 evaluation campaign. These tasks provide both the resources and the framework necessary to perform comparative laboratory-style evaluation of visual information systems for image retrieval and automatic image annotation. Both tasks offer something new for 2006 and attracted a large number of submissions: 12 groups participating in ImageCLEFphoto and 3 in the automatic annotation task. This paper summarises components used in the benchmark, including the collections, the search and annotation tasks, the submissions from participating groups, and results. The general photographic retrieval task, ImageCLEFphoto, used a new collection - the IAPR-TC12 Benchmark - of 20,000 colour photographs with semi-structured captions in English and German. This new collection replaces the St Andrews collection of historic photographs used for the previous three years. For ImageCLEFphoto groups submitted mainly text-only runs. However, 31% of runs involved some kind of visual retrieval technique, typically combined with text through the merging of image and text retrieval results. Bilingual text retrieval was performed using two target languages: English and German, with 59% of runs bilingual. Highest monolingual of English was shown to be 74% for Portuguese-English and 39% of German for English-German. Combined text and retrieval approaches were seen to give, on average, higher retrieval results (+54%) than using text (or image) retrieval alone. Similar to previous years, the use of relevance feedback (most commonly in the form of pseudo relevance feedback) to enable query expansion was seen to improve the text-based submissions by an average of 39%. Topics have been categorised and analysed with respect to various attributes including an estimation of their “visualness” and linguistic complexity. The general automatic object annotation task used a hand collected dataset of 81,211 images from 268 classes provided by LTUtech. Given training data, participants were required to classify previously unseen images. The error rate of submissions for this task was high (ranging from 77.3% to 93.2%) resulting in a large proportion of test images being misclassified by any of the proposed classification methods. The task can therefore be said to have been very challenging for participants.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Management]: Languages—Query Languages</p>
    </sec>
    <sec id="sec-2">
      <title>General Terms</title>
      <p>Measurement, Performance, Experimentation
Image retrieval, image classification, performance evaluation
1</p>
      <sec id="sec-2-1">
        <title>Introduction</title>
        <p>The evaluation of text information retrieval has benefited from the use of standardised benchmarks
and evaluation events, performed since the 1960s [2]. With TREC1 (Text REtrieval Conference
[11]) a standard was set that has been used as the model for evaluation events in related fields. One
such event is CLEF2 (Cross Language Evaluation Forum) and within CLEF, the retrieval of images
from multilingual collections: ImageCLEF. Over the past 2-3 years, ImageCLEF has expanded to
deal with multiple domains (most noticeably the retrieval of medical images) and aspects of
retrieval such as the automatic annotation of images with text descriptors. In this paper, we describe
three tasks at ImageCLEF 2006: the general photographic retrieval task (ImageCLEFphoto), a
general visual retrieval task, and a general image annotation (or classification) task. Section 2
describes the first general retrieval task, section 3 the visual retrieval task aimed more specifically
at evaluating purely visual retrieval systems, and section 4 describes the automatic annotation
task.
2
2.1</p>
      </sec>
      <sec id="sec-2-2">
        <title>The ImageCLEFphoto photographic retrieval task</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>General Overview</title>
      <p>
        This task is similar to the classic TREC ad-hoc retrieval task: simulation of the situation in
which a system knows the set of documents to be searched, but cannot anticipate the particular
topic that will be investigated (i.e. topics are not known to the system in advance). The goal
of ImageCLEFphoto 2006 is: given a multilingual statement describing a user information need,
find as many relevant images as possible from the given document collection. After three years
of image retrieval evaluation using the St. Andrews database [3], a new database was used in
this year’s task: the IAPR TC-12 Benchmark [5], created under Technical Committee 12
(TC12) of the International Association of Pattern Recognition (IAPR3). This collection differs from
the St Andrews collection used in previous campaigns in two major ways: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) it contains mainly
colour photographs (the St Andrews collection was primarily black and white) and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) it contains
semi-structured captions in English and German (the St Andrews collection used only English).
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Document Collection</title>
      <p>The IAPR TC-12 Benchmark contains 20,000 photos taken from locations around the world and
comprises a varying cross-section of still natural images. Figure 1 illustrates a number of sample
images from a selection of categories. The majority of images have been provided by viventura4,
an independent travel company that organises adventure and language trips to South-America.
Travel guides accompany the tourists and maintain a daily online diary including photographs of
1http://trec.nist.gov/
2http://www-clef-campaign.org/
3http://www.iapr.org/
4http://www.viventura.de
trips made and general pictures of each location including accommodation, facilities and
ongoing social projects. The collection contains many different images of similar visual content, but
varying illumination, viewing angle and background. This makes it a challenge for the successful
application of visual analysis techniques.</p>
      <p>
        Each image in the collection has a corresponding semi-structured caption consisting of the
following seven fields: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) a unique identifier, (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) a title, (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) a free-text description of the semantic
and visual contents of the image, (
        <xref ref-type="bibr" rid="ref4">4</xref>
        ) notes for additional information, (
        <xref ref-type="bibr" rid="ref5">5</xref>
        ) the provider of the photo
and fields describing (
        <xref ref-type="bibr" rid="ref6">6</xref>
        ) where and (
        <xref ref-type="bibr" rid="ref7">7</xref>
        ) when the photo was taken. These fields exist in English
and German, with a Spanish version currently being verified. Figure 2 shows a sample image with
its corresponding English annotation.
      </p>
      <p>These annotations are stored in a database allowing subsets of the collection to be created
for benchmarking based on specifying particular parameters (e.g. which caption fields to use).
One of these parameters is annotation quality: in order to provide a more realistic scenario, the
annotation files have been generated with a varying degree of annotation “completeness”:
• 70% of the annotations contain title, description, notes, location and date.
• 10% of the annotations contain title, location and date.
• 10% of the annotations contain location and date.</p>
      <p>• 10% of the images are not annotated (or have empty tags respectively).
2.3</p>
    </sec>
    <sec id="sec-5">
      <title>Query Topics</title>
      <p>Participants were given 60 topics, created using a custom-built topic creation and administration
system to “achieve a natural, balanced topic set accurately reflecting real word user statements of
information needs” [9] (pp.1069). The following information was considered in the topic creation
process:</p>
      <p>Number of topics. In order to increase the reliability of results, a total of 60 topics was
provided to participants.</p>
      <p>Log file Analysis. To make the task realistic, topics were derived from analysing a log
file5 from a web-based interface to the IAPR TC-12 collection which is used by employees and
customers of viventura. A total of 40 topics were taken directly from the log file (semantically
equivalent but perhaps with slight syntactic modification, e.g. “lighthouse sea” to “lighthouses
at the sea”) and 10 topics derived from entries in the log file (e.g. “straight roads in Argentina”
changed to “straight roads in the USA”). The remaining 10 topics were not taken directly from
the log file but created to test various aspects of text and image retrieval (e.g. “black and white
photos of Russia”).</p>
      <p>Geographic Constraints. Corresponding to the findings from previous log file analyses (see,
e.g. [12]), many search requests exhibit geographic constraints and this was found to be similar
with the IAPR TC-12 collection. Thus, 24 of the topics were created with a geographic constraint
(e.g. “tourist accommodation near Lake Titicaca” specifies a location and spatial operator near);
20 of the topics specifying a geographic feature or a permanent man-made object (e.g. “group
standing in salt pan”) and the remaining topics having no geography (e.g. “photos of female
guides”).</p>
      <p>Visual Features. All topics were classified according to how “visual” they were considered
to be. An average rating between 1-56 was obtained for each topic from three experts in the field
of image analysis, and the retrieval score from a baseline content-based image retrieval (CBIR)
system7. A total of 30 topics are classed as “semantic” (levels 1 and 2) for which visual approaches
are highly unlikely to improve results; 20 topics are “neutral” (level 3) for which visual approaches
may or may not improve results and 10 are “visual” topics for which content-based approaches
are most likely to improve retrieval results.</p>
      <p>Topic Difficulty. A topic complexity measure was used to categorise topics according to their
linguistic complexity [6]. A total of 31 topics were chosen to be rather easy topics (levels 1 and
2), 25 topics were medium–hard topics (level 3), and 4 topics were difficult (levels 4 and 5).</p>
      <p>Size of Target Set. Topic creators aimed for a target set size between 20 and 100 relevant
images and thus had to further modify some of the topics (broadening or narrowing the concepts).
The minimum was chosen in order to be able to use P(20) as a performance measure, whereas the
upper limit of relevant images should limit the retrieval of relevant images by chance and to keep
the relevance judgment pools to a manageable size.</p>
      <p>Annotation Quality. Another dimension considered was the distribution of the topics in
regards to the level of annotation quality of relevant images for the particular queries. In other
5Log file taken between 1st February and 15th April 2006 containing 980 unique queries.</p>
      <p>
        6We asked experts in the field to rate these topics according to the following scheme: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) CBIR will produce
very bad or random results, (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) bad results, (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) average results, (
        <xref ref-type="bibr" rid="ref4">4</xref>
        ) good results and (
        <xref ref-type="bibr" rid="ref5">5</xref>
        ) very good results.
      </p>
      <p>7The FIRE system was used based on using all query images.
words, 18 topics were provided in which all relevant images have complete annotations, 10 topics
with 80% - 100% of the relevant images having complete annotations, further 19 topics with 60%
- 80% of the relevant images with complete annotations, and 13 topics with less than 60% of the
relevant images with complete annotations.</p>
      <p>Attributes of Text Retrieval. Various aspects of text retrieval on a more semantic level
were considered too, concentrating on vocabulary mismatches, general versus specific concepts,
word disambiguation and abbreviations.</p>
      <p>Participant Feedback. In last year’s break-out session, participants suggested we provide
groups of similar topics in order to facilitate the analysis of weak performing queries. This year
saw groups of up to five similar topics (e.g. “tourist groups / destinations / Machu Picchu in bad
weather”).</p>
      <p>Each original topic comprised a title (a short sentence or phrase describing the search request
in a few words), and a narrative (a description of what constitutes a relevant or non-relevant image
for each request). In addition, three image examples were provided with each topic in order to test
relevance feedback (both manual and automatic) and query-by-example searches. The topic titles
were then translated into 15 languages including German, French, Spanish, Italian, Portuguese,
Dutch, Russian, Japanese, and Simplified and Traditional Chinese. All translations were provided
by at least one native speaker and verified by at least another native speaker. Unlike in past
campaigns, however, the topic narratives were neither translated nor evaluated this year. A list
of all topics can be found in Table 5.</p>
      <p>In addition, 30 purely visual topics were provided in a visual subtask to attract more visual
groups. These visual topics are, in fact, a modified subset of the 60 original topics in which
nonvisual features like geographic constraints or proper names were removed. Only three example
images and no textual information like topic titles or narrative descriptions were provided. Section
3 provides more details about this task.
2.4</p>
    </sec>
    <sec id="sec-6">
      <title>Relevance Assessments</title>
      <p>
        Relevance assessments were carried out by the two topic creators8 using a custom-built online
tool. The top 40 results from all submitted runs were used to create image pools giving an average
of 1,045 images (max: 1468; min: 575) to judge per topic. The topic creators judged all images
in the topic pools and also used interactive search and judge (ISJ) to supplement the pools with
further relevant images. The ISJ was based on purely text searches. The assessments were based
on a ternary classification scheme: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) relevant, (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) partially relevant, and (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) not relevant. Based
on these judgments, only those images judged relevant by both assessors were considered for the
set of relevant images (qrels).
2.5
      </p>
    </sec>
    <sec id="sec-7">
      <title>Participating Groups and Methods</title>
      <p>A record number of 36 groups registered for ImageCLEFphoto this year, with exactly one third
of them submitting a total of 157 runs (all of which were evaluated). This is similar to last
year (11 groups in 2005), although fewer runs (349 in 2005). Table 2 shows an overview of these
participating groups and the number of runs submitted. New groups submitting in 2006 include
Berkeley, RWTH, CINDI, TUC and CELI. All groups (with the exception of RWTH) submitted
a monolingual English run with the most popular languages appearing as Italian, Japanese and
Simplified Chinese.</p>
      <p>Group ID
Berkeley
CEA-LIC2M
CELI
CINDI
DCU
IPAL
NII
Miracle
NTU
RWTH
SINAI
TUC
Runs
7
5
9
3
40
9(+4)
6
30
30
2(+2)
12
4</p>
      <p>A brief description of the methods of the submitted runs is provided for each group (listed
alphabetically by their group ID). Participants were also asked to categorise their submissions
according to the following: query language, annotation language (English or German), type
(automatic or manual), use of feedback or automatic query expansion, and modality (text only, image
only or combined). Table 4 shows the overall results according to runs categorised by these
dimensions. Most submissions made use of the image metadata, with 8 groups submitting bilingual runs
and 11 groups monolingual runs. For many participants, the main focus of their submission was
combining visual and text features (11 groups text-only and 7 groups combined text and image)
and/or using some kind of relevance feedback to provide query expansion (8 groups using some
kind of feedback).</p>
      <p>Berkeley. The School of Information Management and Systems of the University of California
in Berkeley, USA, submitted seven runs. All runs were text only: 4 monolingual English, 2
monolingual German, one bilingual English-German. Berkeley submitted 3 runs using feedback
and 3 runs using title + narrative. The retrieval algorithm used was a form of logistic regression
8One of the topic generators is part of the viventura travel company.
as used in TREC2 with blind relevance feedback method (10 highest weighting terms from top
10 documents). Translation was using Babelfish and expanding queries using the metadata of
relevant images was found to work well. An interesting result was that using query expansion
without any translation of terms worked surprisingly well for the bilingual run.
CEA-LIC2M. The CEA-LIC2M group from Fontenay aux Roses Cedex in France submitted
five runs without using feedback or query expansion techniques. The group submitted 2 visual, 2
text, 1 mixed, 2 monolingual English and 1 bilingual French-English run. Separate initial queries
were performed using the text and visual components of the topics, and then merged a-posteriori.
Documents and queries are processed using a linguistic analyser to extract “concepts”. Performing
visual retrieval on each query image and merging results appeared to provide better results than
visual retrieval with all three example images simultaneously.</p>
      <p>CELI. The participants from CELI srl of Torino, Italy, submitted 9 text-only, automatic runs
without feedback: 1 monolingual English, 8 bilingual, Italian-English and 6 runs with different
query expansion techniques. Translation is achieved using bilingual dictionaries and a
disambiguation approach based on Latent Semantic Analysis was implemented. Using a Boolean AND
operator of the translations was found to provide higher results than using an OR operator.
Results for P10 and P20 were shown to give similar results across runs compared to a more variable
MAP result. The use of query expansion was shown to increase retrieval effectiveness to bridge the
gap between the uncontrolled language of the query and the controlled language of the metadata.
CINDI. The CINDI group from Concordia University in Montreal, Canada, submitted 3
monolingual English runs, 2 text only, 1 mixed, 2 automatic, 1 manual, 2 with feedback (manual), 1
without feedback, 2 with query expansion and 1 without query expansion. The use of manual
relevance feedback and the integration of text and image achieved the best performance for this
group.</p>
      <p>DCU. Dublin City University in Dublin, Ireland, submitted 40 automatic runs, 14 mixed, 26
text-only, 27 with feedback and 13 without feedback. DCU submitted 6 monolingual and 34
bilingual runs exploring 10 different query languages and both annotation languages. Text retrieval
is performed using the BM25 weighting operator, and visual features matched using the Jeffrey
Divergence function. Image retrieval on individual images was performed and merged using the
CombMAX operator. Text and visual runs were fused using the weighted CombSUM operator.
The results showed that fused text and image retrieval consistently outperformed text-only
methods. The use of pseudo relevance feedback was also shown to improve the effectiveness of the text
retrieval model.</p>
      <p>IPAL. IPAL Singapore submitted 13 automatic runs (monolingual only): 6 visual, 4 mixed and
3 text only. Various indexing methods were tested and the XIOTA system used for text retrieval.
The group used pseudo relevance feedback and an interesting feature of this was using feedback
from one modality to influence the other (e.g. the result of image ranking used to drive query
expansion through documents). Results indicate that the combination of text and image retrieval
leads to better performance. They submitted a further 4 runs to the visual-only subtask.
NII. The National Institute of Informatics from Tokyo, Japan, submitted 6 text-only, automatic
runs without feedback or query expansion, concentrating on all possibilities of three languages:
English, German and Japanese: 1 monolingual English, 1 monolingual German and four bilingual
runs. NII used the Lemur toolkit for text retrieval (unigram language modelling algorithm),
Babelfish for translation, and a visual feature-based micro-clustering algorithm was trialled for
the linking of near identical images annotated in different languages. This clustering approach did
not improve retrieval effectiveness.</p>
      <p>Miracle. The Miracle group of the Daedalus University in Madrid, Spain, submitted 30
automatic runs: 28 text only, 2 mixed and 10 runs involving query expansion based on Wordnet.
The group used only the English annotations and generated 18 monolingual English runs and 12
bilingual runs (Russian, Polish, Japanese and simplified Chinese). A total of 8 runs used narrative
descriptions only, 9 runs used both title and narratives and the remaining used the titles only.
The most effective approach was shown to be the indexing of nouns from the image captions with
no other processing.</p>
      <p>NTU. The National Taiwan University from Taipei, Taiwan, submitted 30 automatic runs: 10
text only, 20 mixed, 12 with feedback and 18 without feedback. A total of 2 monolingual English, 2
monolingual German, 1 visual run and 25 bilingual runs (using English annotations only) exploring
10 different languages were submitted. NTU showed that the use of visual features could improve
text-only retrieval based on the image annotations. A novel word-image ontology approach did
not perform as well as retrieval with the image captions. Systran was used to provide translation
and the initial query images were found to improve ad-hoc retrieval.</p>
      <p>RWTH. The Human Language Technology and Pattern Recognition Group from the RWTH
Aachen University in Aachen, Germany, submitted a total number of 4 entirely visual runs: 2
for the standard ad-hoc task, and 2 to the visual retrieval sub-task. Visual-only retrieval did not
perform well in either task.</p>
      <p>SINAI. The University of Ja´en, Spain, submitted 12 automatic text-only runs, 8 runs with
query expansion, using English annotations only. The group submitted 4 monolingual runs and 8
bilingual runs (Dutch, French, German, Italian, Portuguese and Spanish). A number of different
MT systems were used for translation and the Lemur toolkit implementation of Okapi used as the
retrieval model.</p>
      <p>TUC. Technische Universit¨at Chemnitz from Germany submitted four automatic monolingual
English runs: 3 text only and 1 mixed; 3 with feedback (and query expansion) and 1 without.
Combining/merging independent visual and text runs appear to give highest retrieval effectiveness,
together with the use of text-based query expansion.</p>
    </sec>
    <sec id="sec-8">
      <title>Results and Discussion</title>
      <p>Analysis of System Runs
Results for submitted runs were computed using the latest version of TREC EVAL9. Submissions
were evaluated using uninterpolated (arithmetic) Mean Average Precisions MAP and Precision
at rank 20 (P20) because most online image retrieval engines like Google, Yahoo and Altavista
display 20 images by default. Further measures considered include Geometric Mean Average
Precision (GMAP) to test robustness [10], and the Binary Preference (bpref) measure which is a
good indicator for the completeness of relevance judgments [1]. Using Kendall’s Tau to compare
system ranking between measures, we have found significant correlations at the 0.001 level between
all measures above 0.74. This requires further investigation, but it would appear that the measure
used to rank systems does affect the system ranking.</p>
      <p>Table 3 shows the runs which achieved the highest MAP for each language pair. Of these runs,
83% use feedback of some kind (typically pseudo relevance feedback) and a similar proportion use
both visual and textual features for retrieval. It is noticeable that submissions from NTU and
DCU dominate the results (see participant’s workshop papers for further information about their
runs). It is interesting to note that English monolingual outperforms the German monolingual
(19% lower) and the highest bilingual to English run was Portuguese-English which performed 74%
of monolingual , but the highest bilingual to German run was English to German which performed
only at only 39% of monolingual. Also, unlike previous years, the top-performing bilingual runs
have involved Portuguese, traditional Chinese and Russian as the source language showing an
improvement of the retrieval methods using these languages.</p>
      <p>9http://trec.nist.gov/trec eval/trec eval.7.3.tar.gz
• Log file Analysis. For topics taken from the log file MAP=0.1296 (0.0928); topics derived
from the log file MAP=0.1155 (0.0625) and topics not taken from the log file MAP=0.2191
(0.1604). It is likely that most topics not derived from the log file are more “visual” and
perhaps therefore simpler to execute.
• Geographic Constraints. Topics specifying specific locations and spatial operators
MAP=0.1146 (0.0872); topics specifying general locations or man-made objects MAP=0.1785
(0.1111) and topics with no geography MAP=0.1313 (0.1219). Most groups did not use
geographic retrieval methods.
• Visual Features. For topics where it is estimated visual techniques will not improve results
(levels 1 and 2) MAP=0.1179 (0.1041); for topics where visual retrieval could improve results
(level 3) MAP=0.1318 (0.0940) and topics where visual techniques are expected to improve
results (levels 4 and 5) MAP=0.2250 (0.1094). More visual topics are likely to perform better
given many participants made use of combined visual and textual approaches.
• Topic Difficulty. Topics rated as linguistically easy (complexity levels 1 and 2) MAP=0.1794
(0.1191); topics rated as challenging MAP=0.1107 (0.0728) and topics rated as difficult
MAP=0.0234 (0.0240).
• Annotation Quality. Topics with all relevant images having annotations MAP=0.1668
(0.1356); topics with 80-99% of relevant images having annotations MAP=0.1290 (0.0653);
topics with 60-79% of relevant images having annotations MAP=0.1353 (0.1002) and topics
with 0-59% of relevant images having complete annotations MAP=0.1198 (0.1027). The
use of non-text approaches is the likely cause of successful retrieval for topics with relevant
images containing incomplete annotations.</p>
      <p>We are currently investigating the effects of various retrieval strategies (e.g. use of visual and
textual features) on results for different topics which will be reported in further work. We expect
that the use of visual techniques will improve topics which can be considered “more visual” (e.g.
“sunset over water” is more visual than “pictures of female guides” which one could consider more
semantic) and that topics which are considered “more difficult” linguistically (e.g. “bird flying”
is linguistically simpler than “pictures taken on Ayers Rock”) will require more complex language
processing techniques.
3</p>
      <sec id="sec-8-1">
        <title>The ImageCLEFphoto visual retrieval sub-task</title>
        <p>The ImageCLEFphoto visual retrieval sub-task offers a challenge that is similar to the general
ImageCLEFphoto task: given a user information need described by three sample images, find as
many relevant images as possible from a given document collection using content-based image
retrieval only.</p>
        <p>The main goal of this task is to investigate the current status quo of CBIR as regards general
photographic collections, or in other words, how well CBIR techniques can, at this stage of research,
handle realistic user queries on general still-natural images (in contrast to very specific tasks); it
was created to further attract more visually orientated groups to ImageCLEFphoto, which was
predominated by participating groups using text-orientated approaches in previous years.
The same document collection was used as with the ImageCLEFphoto task, namely the 20,000
colour photos of the IAPR TC-12 collection, without the corresponding image captions.</p>
        <p>The topic creators selected 30 topics (also from the ImageCLEFphoto task) that were as
collection- independent as possible, removing geographic constraints (e.g. “black and white
photos” instead of “black and white photos from Russia”) and other, non-visual constraints (e.g.
“child wearing baseball cap” instead of “godson wearing baseball cap”) in order to make them
more visual (narrative descriptions for the relevance assessments was adjusted accordingly). Yet,
the participants were only allowed to use three images representative for the textual description
of each topic10. These 30 topics were further classified into three evenly sized groups according to
how visual they were estimated to be (the same approach as described in the Visual Features
paragraph of section 2.3).</p>
        <p>Based on these findings, the topics were categorized into 10 easy topics that should do well
with CBIR techniques (level &gt; 3), 10 hard topics that will be quite difficult for CBIR (level ≤ 2),
and 10 medium topics that should lie in between these two categories (2 &lt; level ≤ 3). Table 6
displays the title of the visual queries together with the average value of the individual expert
judgments and the aforementioned categorisation.
3.3</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>Participating Groups and Methods</title>
      <p>Two out of 12 groups that participated at the general ImageCLEFphoto task also submitted a
total of six runs for the visual subtask.</p>
      <p>IPAL. The IPAL group from Singapore submitted four slightly different runs in which only
visual similarities are used: the query images and all the images of the collection were indexed
with feature reduction using Latent Semantic Indexing, and the images were then ranked according
to their distances to the query images.</p>
      <p>10The same three sample images as in the ImageCLEFphoto retrieval task were used.</p>
      <p>RWTHi6. The RWTHi6 group from the RWTH University Aachen, Germany, submitted two
runs to the visual sub-task: one using invariant and tamura texture feature histograms which are
compared using JSD, weighing IFH twice as strong as texture features based on the assumption
that colour information is more important than texture information for databases of general
photographs; the other one using 2048 bin histograms of image patches in colour which are compared
according to their colour and texture using JSD.
The relevance judgments were performed as described in Section 2.4: the top 40 results from
the six submitted runs were used to create image pools giving an average of 171 images (max:
190; min: 83) to judge per topic. The topic creators judged all images in the topic pools and
also heavily used interactive search and judge (ISJ) to supplement the pools with further relevant
images.</p>
      <p>Most runs had quite promising results for precision values at a low cut-off (P20 = 0.285 for
the best run, compare the results shown in Table 7). However, it is felt that this is due to the fact
that some relevant images in the database are visually very similar to the query images, rather
than algorithms really understanding what one is searching for. The retrieved images at higher
ranks seemed to be quite random and further relevant images were only found by chance, which
is also reflected by the quite low MAP scores (0.101 for the best run) and further backs up the
aforementioned assumption.
Many image retrieval systems have recently achieved decent results in retrieval tasks of quite
specific domains or in tasks which are purely tailored to the current level of CBIR. The low results
of the visual sub-task, however, show that content-based image retrieval is a far cry from actually
bridging the semantic gap for visual information retrieval from databases of general, real-life
photographs.</p>
      <p>It has to be further investigated with the participants why only two (out of 36 registered) groups
actually submitted their results. On the one hand, some groups mentioned in their feedback that
they couldn’t submit due to lack of time; the generally low results for this task might have also
discouraged several groups from submitting their results. On the other hand, there were twice as
many groups that submitted purely content-based runs to the main ImageCLEFphoto task; the
question might arise whether this visual task has been promoted sufficiently enough and it should
further be discussed with participants.
4</p>
      <sec id="sec-9-1">
        <title>The Object Annotation Task</title>
        <p>After the big success of the automatic medical annotation task from last year [7], which clearly
showed the need for evaluation challenges in computer vision, and several demands for a similar
task in a less specific domain by participants, a plan for a non-medical automatic image
classification or annotation task was created. In contrast to the medical task, images to be labeled are
of everyday objects and hence do not require The aim of this newly created image annotation
task is to identify objects shown in images and label the image accordingly. In contrast to the
PASCAL visual object classes challenge11 [4] where several two-class experiments are performed,
i.e. independent prediction of presence or absence of various object classes, here several object
classes are tackled jointly.</p>
        <p>11http://www.pascal-network.org/challenges/VOC/
4.1</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>Database &amp; Task Description</title>
      <p>LTUtech12 kindly provided their hand collected dataset of images from 268 classes. Each image
of this dataset shows one object in a rather clean environment, i.e. the images show the object
and some mostly homogeneous background.</p>
      <p>To facilitate participation in the first year, the number of classes taken into account is
considerably lowered to 21 classes. The classes 1) “ashtrays”, 2) “backpacks”, 3) “balls”, 4) “banknotes”,
5) “benches”, 6) “books”, 7) “bottles”, 8) “cans”, 9) “calculators”, 10) “chairs”, 11) “clocks”,
12) “coins”, 13) “computer equipment ”, 14) “cups and mugs”, 15) “hifi equipment ”, 16)
“cutlery(knives, forks and spoons)”, 17) “plates”, 18) “sofas”, 19) “tables”, 20) “mobile phones”, and
21) “wallets” are used. Removing all images that do not belong to one of these classes leads to
a database of 81211 images. To create a new set of test data, 1100 new images of objects from
these classes were taken. In these images, the objects are in a more “natural setting”, i.e. there
is more background clutter than in the training images. To simplify the classification task, it is
specified in advance that each test image belongs to only one of the 21 classes. Multiple objects
of the same class may appear in an image. Objects not belonging to any of the 21 classes may
appear as background clutter.</p>
      <p>The training data was released together with 100 randomly sampled test images with known
classification to allow for tuning of the systems. At a later date, the remaining 1000 test images
were published without their classification as test data.</p>
      <p>The distribution of the classes is not uniform in either of these datasets. An overview of the
distribution of the classes is given in Table 8 and Figure 4 gives an example from the training data
and from the test data for each of the classes. From these images it can be seen that the task is
hard, as the test data contains far more clutter than the training data.
4.2</p>
    </sec>
    <sec id="sec-11">
      <title>Participating Groups &amp;</title>
    </sec>
    <sec id="sec-12">
      <title>Methods</title>
      <p>In total, 20 groups registered and 3 of these submitted a total of 8 runs. Here for each group a
very short description of the methods of the submitted runs is provided. The groups are listed
alphabetically by their group id, which is later used in the results section to refer to the groups.
CINDI. The CINDI group from Concordia University in Montreal, Canada submitted 4 runs.
For their experiments they use MPEG7 edge direction histograms and MPEG7 color layout
descriptors which are classified by a nearest neighbor classifier and by different combinations of
support vector machines. They expect their run SVM-Product to be their best submission.
DEU. This group from the Department of Computer Engineering of the Dokuz Eylul University
in Tinaztepe, Turkey submitted 2 runs. For their experiments they use MPEG7 edge direction
histograms and MPEG7 color layout descriptors respectively. For classification, a nearest prototype
approach is taken.</p>
      <p>RWTHi6. The Human Language Technology and Pattern Recognition Group from the RWTH
Aachen University in Aachen, Germany submitted 2 runs. For image representation they use a
bag-of-features approach and for classification a discriminatively trained maximum entropy
(loglinear) model is used. The runs differ with respect to the histogram bins and vector quantization
methods chosen.</p>
      <p>MedGIFT. The medGIFT group of the University and Hospitals of Geneva submitted three
runs to the medical automatic annotation task. One was entirely based on tf/idf weighting of the
GNU Image Finding Tool (GIFT) and thus acted as a baseline using only collection frequencies
of features with no learning on the training data supplied. The other submission is a combination
of several separate runs by voting. The single results were quite different, so the combination-run
1 – Ashtrays</p>
      <p>2 – Backpacks
4 – Banknotes</p>
      <p>5 – Bench
7 – Bottles
10 – Chairs
13 – Computer
16 – Knives
6 – Bookshelves</p>
      <p>9 – Cans
8 – Calculators
11 – Clocks</p>
      <p>12 – Coins
14 – Cups</p>
      <p>15 – HiFi
17 – MobilePhones
18 – Plates
19 – Sofas
20 – Tables
21 – Wallets
is expected to be the best submission. The runs were submitted after the evaluation ended and
are thus not ranked.
The results of the evaluation are given in Table 9: the runs are sorted by the error rate. Overall,
the error rates are very high due to the very hard task: they range from 77.3% to 93.2%, i.e.
a large part of the test images could not be classified correctly by any of the methods. Table 9
gives details how many images could be classified correctly by how many classifiers. There is no
test image that was classified correctly by all classifiers, but 411 images were misclassified by all
submitted runs and 301 images could be classified correctly by only one classifier.</p>
      <p>Here too, a combination of classifiers can improve the results: Combining the first two methods
by summing up normalized confidences leads to an error rate of 76.7%. Combining the three best
submissions leads to an error rate of 75.8%. Adding further submissions could not improve the
performance further, and combining all submissions leads to an error rate of 78.8%.
Considering that the error rates of the submitted runs are high and that nearly half of these images
could not be classified correctly by any of the submitted methods, it can be said that the the task
was very challenging. One aspect that contributes to this outcome is certainly that the training
images mainly contain very little clutter and that the test images are images of the objects in
their “natural” environment. None of the groups specially addressed this issue although it would
be exepcted to lead to improvements. Furthermore, the results show that discriminatively trained
methods outperform other methods as in the medical automatic annotation task (although only
a small improvement is seen and is probably not statistically significant).</p>
      <p>The object annotation task and the medical automatic annotation task of ImageCLEF 06 [8]
are very similar, but differ in some critical aspects:
• Both tasks provide a relatively large training set and a disjunct test set. Thus, in both
cases it is possible to learn a relatively reliable model for the training data (this is somewhat
proven for the medical annotation task)
• Both tasks are multi-class/one object per image classification tasks. Here they differ from
the PASCAL visual classes challenge which addresses a set of object vs. non object tasks
where several objects (of equal or unequal type) may be contained in an image.
• The medical annotation task has only gray scale images, whereas the object task has mainly
color images. This is probably most relevant for the selection of descriptors.
• The images from the test and the training set are from the same distribution for the medical
task, whereas for the object task, the training images are rather clutter-free and the test
images contain a significant amount of clutter. This is probably relevant and should be
addressed when developing methods for the non-medical task. Unfortunately, the participating
methods did not address this issue which probably has a significant impact on the results.
5</p>
      <sec id="sec-12-1">
        <title>Conclusions</title>
        <p>ImageCLEF continues to provide resources to the retrieval and computational vision communities
to facilitate standardised laboratory-style testing of (predominately text-based) image retrieval
systems. The main division of effort thus far in ImageCLEF has been between medical and
nonmedical information systems. These fields have helped to attract different groups to ImageCLEF
(and CLEF) over the past 2-3 years and thereby broaden the audience of this evaluation campaign.
For the retrieval task, the first 2 evaluation events were based on cross-language retrieval from
a cultural heritage collection: the St Andrews historic collection of photographic images. This
provided certain challenges for both the text and visual retrieval communities, most noticeably the
style of language used in the captions and the types of pictures in the collection: mainly
black-andwhite of varying levels of quality and visual degradation. For the automatic annotation/object
classification task the addition of the LTU dataset has provided a more general challenge to
researchers than medical images.</p>
        <p>For 2006, the retrieval task moved to a new collection based on feedback from ImageCLEF
participants in 2005-2006 and the availability of the IAPR-TC12 Benchmark13. Designed
specifically as a benchmark collection, it is well-suited for use in ImageCLEF with captions in multiple
languages and high-quality colour photographs covering a range of topics. This type of collection
- personal photographs - is likely to become of increasing interest to researchers with the growth
of the desktop search market and popularity of tools such as FlickR14.</p>
        <p>Like in previous years, the ImageCLEFphoto task has shown the usefulness of combining
visual and textual features derived from the images themselves and associated image captions.
It is noticeable that, although some topics are more “visual” than others and likely to benefit
more from visual techniques, the majority of topics seem to benefit from a combination of text
and visual approaches and participants continue to deal with issues involved in combining this
evidence. In addition, the use of relevance feedback to facilitate, for example, query expansion
in text retrieval continues to improve the results of many topics in collections used so far, likely
due to the nature of the text associated with images: typically a controlled vocabulary that lends
itself to blind relevance feedback.</p>
        <p>The object annotation task has shown that current approaches to image classification and/or
annotation have problems with test data that is not from the same distribution as the provided
training data. Given the current high interest in object recognition and annotation in the computer
vision community it is to be expected that big improvements are achievable in the area of automatic
image annotation in the near future. It is planned to use image annotation techniques as a
preprocessing step for a multi-modal information retrieval system: given an image, create an
annotation and use the image and the generated annotation to query a multi-modal information
retrieval system, which is likely to improve the results given the much better performance of
combined runs in the photographic retrieval task.</p>
      </sec>
      <sec id="sec-12-2">
        <title>Acknowledgements</title>
        <p>We would like to thank the CLEF campaign for supporting the ImageCLEF initiative.
Furthermore, special thanks go to viventura, the IAPR and LTUtech for providing their image databases
for this years’ tasks, and to Tobias Weyand for creating the web interface for submissions.</p>
        <p>This work was partially funded by the DFG (Deutsche Forschungsgemeinschaft) under
contracts NE-572/6 and Le-1108/4, the Swiss National Science Foundation (FNS) under contract
205321-109304/1, the American National Science Foundation (NSF) with grant ITR–0325160, an
International Postgraduate Research Scholarship (IPRS) by Victoria University, and the EU Sixth
Framework Program with the SemanticMining project (IST NoE 507505) and the MUSCLE NoE.
13One of the biggest factors influencing what collections are used and provided by ImageCLEF is copyright.
14http://www.flickr.com</p>
        <p>RUN ID
RWTHi6-IFHTAM
RWTHi6-PatchHisto
IPAL-LSA3-VisualTopics
IPAL-LSA2-VisualTopics
IPAL-LSA1-VisualTopics
IPAL-MF-VisualTopics</p>
        <p>class
Ashtrays
Backpacks
Balls
Banknotes
Bench
Books
Bottles
Calculators
Cans
Chairs
Clocks
Coins
Computing equipment
Cups
HiFi
Cutlery
Mobile Phones
Plates
Sofas
Tables
Wallets
sum</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Chris</given-names>
            <surname>Buckley and Ellen M. Voorhees</surname>
          </string-name>
          .
          <article-title>Retrieval evaluation with incomplete information</article-title>
          .
          <source>In SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          , pages
          <fpage>25</fpage>
          -
          <lpage>32</lpage>
          , New York, NY, USA,
          <year>2004</year>
          . ACM Press.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Cyril</surname>
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Cleverdon</surname>
          </string-name>
          .
          <article-title>Report on the testing and analysis of an investigation into the comparative efficiency of indexing systems</article-title>
          .
          <source>Technical report</source>
          , Aslib Cranfield Research Project, Cranfield, USA,
          <year>September 1962</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Paul</given-names>
            <surname>Clough</surname>
          </string-name>
          , Henning Mu¨ller, and
          <string-name>
            <given-names>Mark</given-names>
            <surname>Sanderson</surname>
          </string-name>
          .
          <article-title>Overview of the CLEF cross-language image retrieval track (ImageCLEF) 2004</article-title>
          . In Carol Peters,
          <string-name>
            <given-names>Paul D.</given-names>
            <surname>Clough</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Gareth J. F.</given-names>
            <surname>Jones</surname>
          </string-name>
          , Julio Gonzalo,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kluck</surname>
          </string-name>
          , and B. Magnini, editors,
          <source>Multilingual Information Access for Text</source>
          ,
          <article-title>Speech and Images: Result of the fifth CLEF evaluation campaign</article-title>
          , Lecture Notes in Computer Science, Bath, England,
          <year>2005</year>
          . Springer-Verlag.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Mark</given-names>
            <surname>Everingham</surname>
          </string-name>
          , Andrew Zisserman,
          <string-name>
            <surname>Christopher K. I. Williams</surname>
          </string-name>
          , Luc van Gool,
          <string-name>
            <surname>Moray Allan</surname>
          </string-name>
          ,
          <string-name>
            <surname>Christopher M. Bishop</surname>
            , Olivier Chapelle, Navneet Dalal, Thomas Deselaers, Gyuri Dorko, Stefan Duffner, Jan Eichhorn,
            <given-names>Jason D. R.</given-names>
          </string-name>
          <string-name>
            <surname>Farquhar</surname>
            , Mario Fritz, Christophe Garcia, Tom Griffiths, Frederic Jurie, Daniel Keysers, Markus Koskela, Jorma Laaksonen, Diane Larlus, Bastian Leibe, Hongying Meng, Hermann Ney, Bernt Schiele, Cordelia Schmid, Edgar Seemann, John Shawe-Taylor, Amos Storkey, Sandor Szedmak, Bill Triggs, Ilkay Ulusoy, Ville Viitaniemi,
            <given-names>and Jianguo Zhang.</given-names>
          </string-name>
          <article-title>The 2005 pascal visual object classes challenge</article-title>
          .
          <source>In Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment (PASCAL Workshop 05), number 3944 in Lecture Notes in Artificial Intelligence</source>
          , pages
          <fpage>117</fpage>
          -
          <lpage>176</lpage>
          , Southampton, UK,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Michael</given-names>
            <surname>Grubinger</surname>
          </string-name>
          , Paul Clough, Henning Mu¨ller, and Thomas Deselears.
          <article-title>The IAPR-TC12 benchmark: A new evaluation resource for visual information systems</article-title>
          . In International Workshop OntoImage'
          <year>2006</year>
          <article-title>Language Resources for Content-Based Image Retrieval, held in conjunction with LREC'06</article-title>
          , pages
          <fpage>13</fpage>
          -
          <lpage>23</lpage>
          , Genoa, Italy, May
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Michael</given-names>
            <surname>Grubinger</surname>
          </string-name>
          , Clement Leung, and
          <string-name>
            <given-names>Paul</given-names>
            <surname>Clough</surname>
          </string-name>
          .
          <article-title>Linguistic estimation of topic difficulty in cross-language image retrieval</article-title>
          .
          <source>In CLEF 2005: Overview of the Cross Language Evaluation Forum</source>
          <year>2005</year>
          , page to appear,
          <year>September 2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Henning</given-names>
            <surname>Mu</surname>
          </string-name>
          ¨ller, Antoine Geissbuhler, Johan Marty,
          <string-name>
            <given-names>Christian</given-names>
            <surname>Lovis</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Patrick</given-names>
            <surname>Ruch</surname>
          </string-name>
          .
          <article-title>The Use of medGIFT and easyIR for ImageCLEF 2005</article-title>
          .
          <source>In Proceedings of the Cross Language Evaluation Forum</source>
          <year>2005</year>
          , LNCS, page in press, Vienna, Austria,
          <year>September 2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Henning</given-names>
            <surname>Mu</surname>
          </string-name>
          ¨ller, Thomas Deselaers, Thomas Lehmann, Paul Clough, and
          <string-name>
            <given-names>William</given-names>
            <surname>Hersh</surname>
          </string-name>
          .
          <article-title>Overview of the imageclefmed 2006 medical retrieval and annotation tasks</article-title>
          .
          <source>In CLEF working notes</source>
          , Alicante, Spain,
          <year>September 2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>C.</given-names>
            <surname>Peters</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Braschler</surname>
          </string-name>
          .
          <article-title>Cross language system evaluation: The clef campaigns</article-title>
          .
          <source>Journal of the American Society for Information Science and Technology</source>
          ,
          <volume>22</volume>
          (
          <issue>12</issue>
          ):
          <fpage>1067</fpage>
          -
          <lpage>1072</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Ellen</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Voorhees</surname>
          </string-name>
          .
          <article-title>The trec robust retrieval track</article-title>
          .
          <source>SIGIR Forum</source>
          ,
          <volume>39</volume>
          (
          <issue>1</issue>
          ):
          <fpage>11</fpage>
          -
          <lpage>20</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Ellen</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Voorhees</surname>
            and
            <given-names>Donna</given-names>
          </string-name>
          <string-name>
            <surname>Harmann</surname>
          </string-name>
          .
          <article-title>Overview of the seventh Text REtrieval Conference (TREC-7)</article-title>
          .
          <source>In The Seventh Text Retrieval Conference</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>23</lpage>
          , Gaithersburg,
          <string-name>
            <surname>MD</surname>
          </string-name>
          , USA,
          <year>November 1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Vivian</surname>
            <given-names>Zhang</given-names>
          </string-name>
          , Benjamin Rey, Eugene Stipp, and
          <string-name>
            <given-names>Rosie</given-names>
            <surname>Jones</surname>
          </string-name>
          .
          <article-title>Geomodification in query rewriting</article-title>
          .
          <source>In GIR '06: Proceedings of the Workshop on Geographic Information Retrieval</source>
          ,
          <string-name>
            <surname>SIGIR</surname>
          </string-name>
          <year>2006</year>
          , page to appear
          <source>, August</source>
          <volume>10</volume>
          2006.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>