<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Overview of the CLEF 2009 medical image retrieval track</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Henning Mu¨ller</string-name>
          <email>henning.mueller@sim.hcuge.ch</email>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jayashree Kalpathy-Cramer</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ivan Eggel</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Steven Bedrick</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sa¨ıd Radhouani</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Brian Bakke</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Charles E. Kahn Jr.</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>William Hersh</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Radiology, Medical College of Wisconsin</institution>
          ,
          <addr-line>Milwaukee, WI</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Oregon Health and Science University (OHSU)</institution>
          ,
          <addr-line>Portland, OR</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University Hospitals and University of Geneva</institution>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University of Applied Sciences Western Switzerland (HES-SO)</institution>
          ,
          <addr-line>Sierre</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p />
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>2009 was the sixth year for the ImageCLEF medical retrieval task. Participation
was strong again with 38 registered research groups. 17 groups submitted runs and
thus participated actively in the tasks. The database in 2009 was similar to the one
used in 2008, containing scientific articles from two radiology journals, Radiology and
Radiographics. The size of the database was increased to a total of 74,902 images.
For each image, captions and access to the full text article through the Medline PMID
(PubMed Identifier) were provided. An article’s PMID could be used to obtain the
officially assigned MeSH (Medical Subject Headings) terms. The collection was entirely
in English. However, the topics were, as in previous years, supplied in German, French,
and English. Twenty–five image–based topics were provided, of which ten each were
visual and mixed and five were textual. In addition, for the first time, 5 case–based
topics were provided as an exploratory task. Here the unit of retrieval was intended
to be the article and not the image. Case–based topics are designed to be a step
closer to the clinical workflow. Clinicians often seek information about patient cases
with incomplete information consisting of symptoms, findings, and a set of images.
Supplying cases to a clinician from the scientific literature that are similar to the case
(s)he is treating can be an important application of image retrieval in the future.</p>
      <p>As in previous years, most groups concentrated on fully automatic retrieval.
However, four groups submitted a total of seven manual or interactive runs. The interactive
runs submitted this year performed quite well compared to previous years but did not
show a substantial increase in performance over the automatic approaches. In
previous years, multimodal combinations were the most frequent submissions. However,
this year, as in 2008 only about half as many mixed runs as purely textual runs were
submitted. Very few fully visual runs were submitted, and again, the ones submitted
performed poorly. The best mean average precisions (MAP) were obtained using
automatic textual methods. There were mixed feedback runs that had high MAP. The best
early precision was also obtained using automatic textual methods, with a few mixed
automatic runs also doing well. We had the opportunity to perform multiple
judgments on some topics. The kappas used as the metric for inter–rater agreement were
mostly quite high (¿0.7). However, one of our judges consistently had low kappas as he
was significantly more lenient the colleagues. We evaluated the overall performance of
groups using strict and lenient judges and found that there was high correlation even
though the absolute values for the metrics were different.</p>
      <p>We also introduced a lung nodule detection task in 2009. This task used the CT
slices from the Lung Imaging Data Consortium (LIDC) which included ground truth
in the form of manual annotations. The goal of the task was to create algorithms to
automatically detect lung nodules. Although there seemed to be significant interest
in the task as evidenced by the substantial number of registrations, only two groups
submitted results with a proprietary software from a industry participant achieving
impressive results.</p>
    </sec>
    <sec id="sec-2">
      <title>Categories and Subject Descriptors</title>
      <p>H.3 [Information Storage and Retrieval] : H.3.1 Content Analysis and Indexing; H.3.3
Information Search and Retrieval; H.3.4 Systems and Software; H.3.7 Digital Libraries</p>
    </sec>
    <sec id="sec-3">
      <title>General Terms</title>
      <sec id="sec-3-1">
        <title>Measurement, Performance, Experimentation</title>
      </sec>
      <sec id="sec-3-2">
        <title>Medical image retrieval, image retrieval, multimodal retrieval</title>
        <p>1</p>
        <sec id="sec-3-2-1">
          <title>Introduction</title>
          <p>
            ImageCLEF1 [
            <xref ref-type="bibr" rid="ref1 ref2 ref4">1, 2, 4</xref>
            ] started in 2003 as part of the Cross Language Evaluation Forum (CLEF2,
[9]). A medical image retrieval task was added in 2004 and has been held every year since [
            <xref ref-type="bibr" rid="ref4 ref7">4, 7</xref>
            ].
The main goal of ImageCLEF in the past has been to promote multi–modal information retrieval
by combining a variety of media including text and images for more effective information retrieval.
As such, it has always contained visual, textual and mixed tasks and sub–tracks. The medical
image retrieval track began in 2004 as a primarily visual information retrieval task with a teaching
database of 8,000 images. Since then, it progressed to a collection of over 66,000 images from
several teaching collections with topics that were best suited for textual, visual and mixed methods.
In 2008, images from the medical literature were used for the first time, moving the task one step
closer towards applications that can be of interest in clinical scenarios. Several user studies have
been performed to study the image searching behaviour of clinicians [
            <xref ref-type="bibr" rid="ref3 ref5 ref6">5, 6, 3</xref>
            ]. These studies have
been used to create the task and the topics over the years. This year, for the first time, we
introduced a case–based retrieval task as we continue to strive for scenarios that more closely
resemble actual clinical work–flows.
          </p>
          <p>This paper reports on the medical retrieval task. Additionally, other papers within ImageCLEF
describe the other five tasks of ImageCLEF 2009. More information on the tasks and on how to
participate in CLEF can also be found on the ImageCLEF web pages.
2</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>Participation, Data Sets, Tasks, Ground Truth</title>
          <p>This section describes the details concerning the set–up and the participation in the medical
retrieval task in 2009. A new management system for participation in ImageCLEF was created
to better manage the increasing number of registrations and submissions to the ImageCLEF
benchmark in a fully electronic fashion. The interface allowed registrations for particular tasks,
provided the links to the description of the data sets available, allowed submission of the results
and enabled the final evaluation.</p>
          <p>1http://www.imageclef.org/
2http://www.clef-campaign.org/
2.1</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Participation</title>
      <p>In 2009, a new record of 85 research groups registered for the seven sub–tasks of ImageCLEF.
For the medical retrieval task the participation remained similar to the previous year with 37
registrations. 17 of the participants submitted results to the tasks, a slight increase from 15 in
2008. The following groups submitted at least one run:
• NIH (USA);
• Liris (France);
• ISSR (Egypt)∗;
• UIIP Minsk (Belarus)∗;
• MedGIFT (Switzerland);
• Sierre (Switzerland)∗;
• SINAI (Spain);
• Miracle (Spain);
• BiTeM (Switzerland);
• York University (Canada)∗;
• AUEB (Greece);
• University of Milwaukee (USA)∗;
• University of Alicante (Spain);
• University of North Texas (USA)∗;
• OHSU (USA);
• University of Fresno (USA);
• DEU (Turkey).</p>
      <p>Participants marked with a star had never participated in the past in a medical retrieval task,
indicating that the number of first–time participants is fairly high with six among the 17
participants.</p>
      <p>A total of 124 valid runs were submitted, 106 of which were submitted for the image–based
topics while 18 for the case-based topics. The number of runs per group was limited to ten per
subtask and case–based and image–based topics were seen as separate subtasks in this view. This
was an increase compared to the 111 runs submitted last year.
2.2</p>
    </sec>
    <sec id="sec-5">
      <title>Datasets</title>
      <p>The database in 2009 was again made accessible by the Radiological Society of North America
(RSNA3). The database contained a total of 74’902 images, the largest collection yet. All images
are taken from the journals Radiology and Radiographics of the RSNA. A similar database is
also available via the Goldminer4 interface. This collection constitutes an important body of
medical knowledge from the peer–reviewed scientific literature including high quality images with
annotations. Images are associated with journal articles and can be part of a figure. Figure
captions are made available to participants as well as the part concerning a particular subfigure if
3http://www.rsna.org/
4http://goldminer.arrs.org/
available. This creates high–quality textual annotations enabling textual searching in addition to
content–based retrieval. As the PubMed IDs were also made available, participants could access
the MeSH (Medical Subject Headings) terms created by the National Library of Medicine for
PubMed5.
2.3</p>
    </sec>
    <sec id="sec-6">
      <title>Image–Based Topics</title>
      <p>
        The image-based topics were created using methods similar to previous years where realistic search
topics were identified by surveying actual user needs. The starting point for this year’s topics was
a user study [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] conducted at Oregon Health &amp; Science University (OHSU) during early 2009.
Based on qualitative methods, this study was conducted with 37 medical practitioners in order
to understand the needs, both met and unmet, in medical image retrieval. The first part of the
study was dedicated to the investigation of the characteristics of a large portion of the population
served by medical image retrieval systems (e.g., their background, searching habits, etc.). After
a demonstration of state–of–the–art image retrieval systems, the second part of the study was
devoted to learning about the motivation and tasks for which the intended audience uses medical
image retrieval systems (e.g., contexts in which they seek medical images, types of useful images,
numbers of desired answers, etc.). In the third and last part, the participants were asked to use
the demonstrated systems, trying to solve challenging queries, and provide responses to them in
terms of how likely they would be to use them, which aspects they did and did not like, and which
missing features they would like to see added. In total, the 37 participants used the demonstrated
systems to perform a total of 95 searches using textual queries in English. We randomly selected
25 candidate queries from the 95 searches to create the topics for ImageCLEFmed 2009. We added
to each candidate query 2 to 4 sample images from the previous collections of ImageCLEFmed.
Then, for each topic, we provided a French and a German translation of the original textual
description provided by the participants. Finally, the resulting set of the topics was categorized
into three groups: 10 visual topics, 10 mixed topics, and 5 semantic topics. The entire set of topics
was finally approved by a physician.
2.4
      </p>
    </sec>
    <sec id="sec-7">
      <title>Case–Based Topics</title>
      <p>Case–based topics were made available for the first time in 2009. The goal was to move image
retrieval potentially closer to clinical routine by simulating the use case of a clinician who is in the
process of diagnosing a difficult case. Providing this clinician with articles from the literature that
treat cases similar to the case (s)he is working on (“similar” based on images and other clinical
data on the patient) can be a valuable aide to choosing a good diagnosis or treatment.</p>
      <p>The topics were cerated based on cases from the teaching file Casimage. This teaching file
contains cases including images from radiological practice. 10 cases were pre–selected and a search
with the diagnosis was performed in the ImageCLEF data set to make sure that there were at
least a few matching articles. Five topics were finally chosen. The diagnosis and all information
on the chosen treatment was then removed from the cases to simulate a situation of the clinician
who has to diagnose the patient. In order to make the judging more consistent, the relevance
judges were provided with the original diagnosis for each case.
2.5</p>
    </sec>
    <sec id="sec-8">
      <title>Relevance Judgements</title>
      <p>The relevance judgements were performed with the same on–line system as in 2008 for the image–
based topics. The system was adapted for the case–based topics showing the article title and
several images appearing in the text (currently the first six, but this can be configured). Besides a
short description for the judgements, a full document was prepared to describe the judging process,
including what should be regarded as relevant versus non–relevant. A ternary judgement scheme
was used again, wherein each image in each pool was judged to be “relevant”, “partly relevant”,
or “non–relevant”. Images clearly corresponding to all criteria were judged as “relevant”, images
whose relevance could not be safely confirmed but could still be possible were marked as “partly
relevant”, and images for which one or more criteria of the topic were not met were marked as
“non–relevant”. Judges were instructed in these criteria and results were manually verified during
the judgement process.</p>
      <p>We had the opportunity to perform multiple judgements on many topics, both image–based
and case–based. Inter–rater agreement was assessed using the kappa metric, given as
κ =</p>
      <p>P (A) − P (E)
1 − P (E)
(1)
where P (A) is the observed agreement between judges and P (E) is the expected (random)
agreement. These are calculated using a 2x2 table for the relevances of images or articles. These were
calculated for both lenient where a “partly relevant” is considered relevant, and strict judgments
where “partly relevant” is considered not–relevant. It is generally accepted that a kappa &lt; 0.7 is
good and sufficient for an evaluation. In general the agreement between the judges was fairly high
with few exceptions and the overall average κ is similar to other evaluation campaigns. Regarding
the case–based topics it seems necessary to take longer for the judges but we did not receive much
feedback, so all judges seemed to be satisfied with the written description on the judgments that
were supplied.
3</p>
      <sec id="sec-8-1">
        <title>Results</title>
        <p>This section describes the results of ImageCLEF 2009. Runs are ordered based on the techniques
used (visual, textual, mixed) and the interaction used (automatic, manual). Case–based topics
and image–based topics are separated but compared in the same sections.</p>
        <p>A more detailed evaluation of the techniques will follow in the final proceedings when more
details on the techniques used for the submissions will be known. Unfortunately, information on
the techniques used in the submissions is not always made available by the participants well ahead
of time and in sufficient detail.</p>
        <p>Trec eval was used for the evaluation process, and we made use of most of its performance
measures.
3.1</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>Submissions</title>
      <p>The numbers of submitting teams was slightly higher in 2009 than in 2008 with 17. The numbers
of runs increased from 111 to 124. This was partly due to the fact that with the case–based topics
and the image–based topics there were two more run categories.</p>
      <p>A total of 124 runs were submitted via the electronic submission system. Scripts to check the
validity of the runs were made available to participants ahead of the submission phase, so only
few runs contained errors in either content or format and required changes. Common mistakes
included a wrong trec eval format, use of only a subset of the topics and incorrect image identifiers.
In collaboration with the participants, a large number of runs were quickly repaired, resulting in
122 valid runs taken into account for the pools.</p>
      <p>In total, only 13 runs were “manual” or “interactive.” There were only 16 “visual–only”. The
large majority were “text–only runs”, with 59 submissions. There were 30 mixed runs</p>
      <p>Groups subsequently had the chance to evaluate additional runs themselves as the qrels were
made available to participants 2 weeks ahead of the submission deadline for the working notes.
3.2
3.2.1</p>
    </sec>
    <sec id="sec-10">
      <title>Image–Based Results</title>
      <p>The number of visual runs in 2009 was small, and the improvement in the results is not as fast as
with textual retrieval techniques. 5 groups submitted a total of 16 runs in 2009, one of which was
Run
CBIR FUSION MERGE
medGIFT sep max
medGIFT sum withAR
CBIR FUSION CV MERGE
medGIFT sep sum
medGIFT sep max withAR
CBIR FUSION CATEGORY
medGIFT sum withNegImg
medGIFT max withNegImg.txt
clef2009
UIIPMinsk visual 1
UIIPMinsk visual 2
CSUFresno visual CEDD
CSUFresno visual CEDD
CSUFresno visual CEDD
feedback. Performance as measured in MAP is very low for all these runs, reaching a maximum
of 0.0136 for the best run. Both early precision and recall were quite low for the visual runs when
compared to the textual runs but there were significant differences between the visual runs. The
University of Minsk only submitted runs to a subset of the topics and this made their average
performance look much worse than than the other runs. A more detailed per topics analysis seems
necessary to really compare the systems.</p>
      <p>Table 1 shows the results and particularly the large differences between the runs. Runs retrieved
between 13 and 315 of 2362 possible relevant images, which is substantially lower than the poorest
performing the textual runs.</p>
      <p>Part of the performance can be explained with the extremely well annotated database that
created a much larger gap between visual and textual results. The topics in ImageCLEFmed
also became harder, making even the visual topics more semantic than before. This corresponds
clearly to user needs. The small number of submitted visual runs also biases the pools towards
the textual runs, even further widening the gap.
3.2.2</p>
      <p>Textual Retrieval
Purely automatic textual retrieval had by far the largest number of runs in 2009 with 52, more
than 46% of all submitted runs. Table 2 shows the results for all submitted automatic text runs,
ordered by MAP. Most performance measures such as bpref and early precision are similar in
order. Only early precision sometimes has significant differences from the ranking with MAP.</p>
      <p>Runs from the LIRIS obtained the best results with 8 of the top 10 runs. These used conceptual
language modelling with the additional use of the UMLS (Unified Medical Language System)
metathesaurus. They had many runs with MAP between 0.43 and 0.41. A more detailed analysis
is required with the exact techniques applied for each of the runs.
3.2.3</p>
      <p>Multimodal Retrieval
The promotion of mixed–media retrieval has always been one of the main goals of ImageCLEF. In
past years, mixed–media retrieval had the highest submission rate. In 2009 as in 2008, however,
only about half as many mixed runs as purely textual runs were submitted.</p>
      <p>Table 3 shows the results for all submitted runs. It is clear that, for a large number of the runs,
the MAP results for the mixed retrieval submissions were very similar to those from the purely
textual retrieval systems. An interesting observation is that, for some groups, the mixed–media
submissions often have higher early precision than the purely textual retrieval submissions.</p>
      <p>All runs exhibited relatively high correlation between MAP and bpref.</p>
      <p>From examining mixed–media runs which had corresponding text–only runs, it is particularly
clear that combining good textual retrieval techniques with questionable visual retrieval techniques
can negatively affect system performance. This demonstrates the difficulty of usefully integrating
for the textual runs is at around 0.3.
3.2.4</p>
      <p>Interactive Retrieval
This year, as in previous years, interactive retrieval was only used by a very small number of
participants. However, the manual and interactive runs submitted this year performed relatively
well with one of the runs achieving the highest overall early precision (P5 and P10). Table 4 shows
the results of all manual and interactive runs submitted.</p>
      <p>There is definitly a need to promote interactive an manual retrieval further as the potential of
this does not seem to have been exploited well, so far.
3.3</p>
    </sec>
    <sec id="sec-11">
      <title>Case–based results</title>
      <p>A total of six groups participated in this introductory task, submitting at total of 18 runs. The
results were quite promising with one group achieving a relatively high MAP of 0.33. As with
the image–based retrieval, automatic textual results achieved the best results with poor results
Run
ceb-cases-essie2-automatic
sinai TA cbt
sinai TA cbtM
clef2009
HES-SO-VS txt case
Alicante-CaseBased-Run5
Alicante-CaseBased-Run2
Alicante-CaseBased-Run4
Alicante-CaseBased-Run3</p>
      <p>Alicante-CaseBased-Run1
0.0025 to 0.335
3.3.1
This is not entirely surprising as the set of sample images provided for each topic were quite varied
in visual appearance and it needs to be explored how this information can be used well.
3.3.2</p>
      <p>Textual Retrieval
Textual methods were more effective in retrieving relevant articles as seen in the Table 6 below.
Interestingly, the early precision for the best runs were not significantly higher than the MAP,
unlike in the image–based topics where the early precision was substantially higher than the MAP
for many runs.
3.3.3</p>
      <p>Multimodal Retrieval
Unlike the image–based topics, here the multimodal runs performed quite poorly as seen in Table 7.
This could be due to the variety of reasons including the diversity of sample images, poor visual
performance of runs for case–based topics and the fact that the two best groups did not submit
runs in this category.
3.4</p>
    </sec>
    <sec id="sec-12">
      <title>Relevance Judgement Analysis</title>
      <p>A number of topics, both image–based and case–based were judged by two or even three judges.
There were significant variations in the kappa metric used to evaluate the inter–rater agreement.
The kappas for the image–based topics are given below in Table 8. The kappas are usually
reasonably high except when involving some judges. As seen in the table, judge 12 was extremely
lenient compared to all other judges leading to extremely low kappas for any pairwise comparison
involving this judge. For instance, on topic 13, judge 12 evaluated 342 images as being relevant
while judge 7 (our most strict judge) only evaluated 7 images as being relevant. We discovered
this during the judging process and did not use the judgements from judge 12 in creating the
official qrels for any of the topics. We performed extensive evaluation of the effect of the judge’s
strictness in establishing relevance and found that overall, the results obtained using strict judges
and those obtained lenient judges correlated well. However, results for a particular topic could be
affected by the judge’s parsimony in the evaluation of relevance.</p>
      <p>For the case–based topics, the kappa values were generally lower as seen in Table 9. The 2x2
tables indicated that judge 4 was the most lenient and judge 7 was the strictest.
3.5</p>
    </sec>
    <sec id="sec-13">
      <title>Lung Nodule Detection Task</title>
      <p>We also introduced a lung nodule detection task in 2009. This task used the CT (Computed
Tomography) slices from the Lung Imaging Data Consortium (LIDC). This collection consisted
for 100–200 slices per study and were manually annotated by 4 clinicians. Although more than
25 groups had registered for the task and more than a dozen had downloaded the datasets, only
two groups submitted runs. A commercial proprietary software package performed quite well in
the task of detecting the nodules.
4</p>
      <sec id="sec-13-1">
        <title>Conclusions</title>
        <p>The focus of many participants in this year’s ImageCLEF has been text–based retrieval. The
increasingly semantic topics combined with a database containing high–quality annotations in
2009 may have resulted in less impact of using visual techniques as compared to previous years.
Visual runs were rare and generally poor in performance. Mixed–media runs were very similar
in performance to textual runs when looking at MAP. The analysis also shows that several runs
with very few relevant images have a very low average performance, whereas topics with a larger
number seem to perform better.</p>
        <p>Case–based topics were introduced for the first time and only a few groups participated with
results being slightly lower than for the image–based topics.</p>
        <p>A kappa analysis between several relevance judgements for the same topics shows that there
are differences between judges but that agreement is generally high. A few judges can nevertheless
have disagreeing results with all other judges, something that we need to investigate further.</p>
        <p>For future campaign it seems important that more research on visual techniques including
massive learning should be done as currently techniques do not perform well. Interactive and
manual retrieval do also seem to have room for improvements and should be put forward to
participants who generally prefer automatic text–based approaches.
5</p>
      </sec>
      <sec id="sec-13-2">
        <title>Acknowledgements</title>
        <p>We would like to thank the CLEF campaign for supporting the ImageCLEF initiative. This work
was partially funded by the Swiss National Science Foundation (FNS) under contracts 205321–
109304/1 and PBGE22–121204, the American National Science Foundation (NSF) with grant
ITR–0325160, the TrebleCLEF project and Google. We would like to thank the RSNA for
supplying the images of their journals Radiology and Radiographics for the ImageCLEF campaign.
[9] Jacques Savoy. Report on CLEF–2001 experiments. In Report on the CLEF Conference 2001
(Cross Language Evaluation Forum), pages 27–43, Darmstadt, Germany, 2002. Springer LNCS
2406.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Paul</given-names>
            <surname>Clough</surname>
          </string-name>
          , Henning Mu¨ller, Thomas Deselaers, Michael Grubinger,
          <string-name>
            <surname>Thomas</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Lehmann</surname>
            , Jeffery Jensen, and
            <given-names>William</given-names>
          </string-name>
          <string-name>
            <surname>Hersh</surname>
          </string-name>
          .
          <source>The CLEF</source>
          <year>2005</year>
          cross
          <article-title>-language image retrieval track</article-title>
          .
          <source>In Cross Language Evaluation Forum (CLEF</source>
          <year>2005</year>
          ),
          <source>Springer Lecture Notes in Computer Science</source>
          , pages
          <fpage>535</fpage>
          -
          <lpage>557</lpage>
          ,
          <year>September 2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Paul</given-names>
            <surname>Clough</surname>
          </string-name>
          , Henning Mu¨ller, and
          <string-name>
            <given-names>Mark</given-names>
            <surname>Sanderson</surname>
          </string-name>
          .
          <article-title>The CLEF cross-language image retrieval track (ImageCLEF) 2004</article-title>
          . In Carol Peters, Paul Clough, Julio Gonzalo,
          <string-name>
            <given-names>Gareth J. F.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Kluck</surname>
          </string-name>
          , and Bernardo Magnini, editors,
          <source>Multilingual Information Access for Text</source>
          ,
          <article-title>Speech and Images: Result of the fifth CLEF evaluation campaign</article-title>
          , volume
          <volume>3491</volume>
          of Lecture Notes in Computer Science (LNCS), pages
          <fpage>597</fpage>
          -
          <lpage>613</lpage>
          , Bath, UK,
          <year>2005</year>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>William</given-names>
            <surname>Hersh</surname>
          </string-name>
          , Jeffery Jensen, Henning Mu¨ller, Paul Gorman, and
          <string-name>
            <given-names>Patrick</given-names>
            <surname>Ruch</surname>
          </string-name>
          .
          <article-title>A qualitative task analysis for developing an image retrieval test collection</article-title>
          .
          <source>In ImageCLEF/MUSCLE workshop on image retrieval evaluation</source>
          , pages
          <fpage>11</fpage>
          -
          <lpage>16</lpage>
          , Vienna, Austria,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Henning</given-names>
            <surname>Mu</surname>
          </string-name>
          ¨ller, Thomas Deselaers, Eugene Kim, Jayashree Kalpathy-Cramer,
          <string-name>
            <given-names>Thomas M.</given-names>
            <surname>Deserno</surname>
          </string-name>
          , Paul Clough, and
          <string-name>
            <given-names>William</given-names>
            <surname>Hersh</surname>
          </string-name>
          .
          <article-title>Overview of the ImageCLEFmed 2007 medical retrieval and annotation tasks</article-title>
          .
          <source>In CLEF 2007 Proceedings, volume 5152 of Lecture Notes in Computer Science (LNCS)</source>
          , pages
          <fpage>473</fpage>
          -
          <lpage>491</lpage>
          , Budapest, Hungary,
          <year>2008</year>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Henning</given-names>
            <surname>Mu</surname>
          </string-name>
          ¨ller, Christelle Despont-Gros,
          <string-name>
            <given-names>William</given-names>
            <surname>Hersh</surname>
          </string-name>
          , Jeffery Jensen,
          <string-name>
            <given-names>Christian</given-names>
            <surname>Lovis</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Antoine</given-names>
            <surname>Geissbuhler</surname>
          </string-name>
          .
          <article-title>Health care professionals' image use and search behaviour</article-title>
          .
          <source>In Proceedings of the Medical Informatics Europe Conference (MIE</source>
          <year>2006</year>
          ), IOS Press,
          <source>Studies in Health Technology and Informatics</source>
          , pages
          <fpage>24</fpage>
          -
          <lpage>32</lpage>
          , Maastricht, The Netherlands,
          <year>August 2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Henning</given-names>
            <surname>Mu</surname>
          </string-name>
          ¨ller, Jayashree Kalpathy-Cramer,
          <string-name>
            <given-names>William</given-names>
            <surname>Hersh</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Antoine</given-names>
            <surname>Geissbuhler</surname>
          </string-name>
          .
          <article-title>Using Medline queries to generate image retrieval tasks for benchmarking</article-title>
          .
          <source>In Medical Informatics Europe (MIE2008)</source>
          , pages
          <fpage>523</fpage>
          -
          <lpage>528</lpage>
          , Gothenburg, Sweden, May
          <year>2008</year>
          . IOS press.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Henning</given-names>
            <surname>Mu</surname>
          </string-name>
          ¨ller, Antoine Rosset, Jean-Paul Vall´ee, Francois Terrier, and
          <string-name>
            <given-names>Antoine</given-names>
            <surname>Geissbuhler</surname>
          </string-name>
          .
          <article-title>A reference data set for the evaluation of medical image retrieval systems</article-title>
          .
          <source>Computerized Medical Imaging and Graphics</source>
          ,
          <volume>28</volume>
          (
          <issue>6</issue>
          ):
          <fpage>295</fpage>
          -
          <lpage>305</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Sa</given-names>
            <surname>¨ıd Radhouani</surname>
          </string-name>
          , William Hersh, Jayashree Kalpathy-Cramer, and
          <string-name>
            <given-names>Steven</given-names>
            <surname>Bedrick</surname>
          </string-name>
          .
          <article-title>Understanding and improving image retrieval in medicine</article-title>
          .
          <source>Technical report, Oregon Health and Science University</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>