<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Medical Case-based Retrieval by using a language model: MIRACL at ImageCLEF 2012</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jihen Majdoubi</string-name>
          <email>Jihen.Majdoubi@isims.rnu.tn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hatem Loukil</string-name>
          <email>Hatem.Loukil@isims.rnu.tn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mohamed Tmar</string-name>
          <email>Mohamed.Tmar@isims.rnu.tn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Faiez Gargouri</string-name>
          <email>Faiez.Gargouri@isims.rnu.tn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Multimedia InfoRmation system and Advanced Computing Laboratory, Higher Institute of Information Technologie and Multimedia, University of sfax</institution>
          ,
          <country country="TN">Tunisia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper reports the experiment results of the MIRACL team in participating in the medical case retrieval task of ImageCLEF 2012. In this paper, we propose our contribution for conceptual indexing of medical articles which uses a language model for selecting the best representative descriptors for each article.</p>
      </abstract>
      <kwd-group>
        <kwd>conceptual indexing</kwd>
        <kwd>medical article</kwd>
        <kwd>language model</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Started from 2004, the ImageCLEFmed (medical retrieval task) aims at
evaluating the performance of medical information systems, which retrieve medical
information from a mono or multilingual image collection. The medical retrieval
task of ImageCLEF 2012 uses a subset of PubMed Central containing 305; 000
images. This task consists of three subtasks: modality classi cation, ad-hoc
retrieval and case-based retrieval. In our work, we are particularly interested in
the case-based retrieval task, which was rstly introduced in 2009. This is a
more complex task, but one that is closer to the clinical work ow. In this task,
a case description, with patient demographics, limited symptoms and test
results including imaging studies, is provided (but not the nal diagnosis). The
goal is to retrieve cases including images that might best suit the provided case
description. Unlike the ad-hoc task, the unit of retrieval here is a case, not an
image. For the purposes of this task, a "case" is a PubMed ID corresponding to
the journal article [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>This paper describes the contribution of the MIRACL1 team (Multimedia
InfoRmation systems and Advanced Computing Laboratory) in its participation
at the medical retrieval track.</p>
      <p>Our proposed conceptual indexing approach consists of three main steps. At the
rst step (Term extraction), being given an article, Medical Subject Headings</p>
      <sec id="sec-1-1">
        <title>1 http://www.miracl.rnu.tn/</title>
        <p>(MeSH2) thesaurus and the NLP tools, our indexing system extracts two sets:
the rst is the article's lemma, and the second is the list of lemma existing in
the MeSH thesaurus. After that, these sets are used in order to extract the Mesh
terms existing in the document. At step 2, these extracted terms are weighed by
using the measures CSW and SW that intuitively interprets MeSH conceptual
information to calculate the term importance. The step 3 aims to recognize the
MeSH descriptors that represent the document by using the language model.
The rest of this paper is organized as follows: Section 2 describes our conceptual
indexing approach. Submitted results will be presented and discussed in section
3. We conclude the paper in section 4 by outlining some perspectives for future
work.
2</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Our conceptual indexing approach</title>
      <p>Our indexing methodology as schematized in Figure 1, consists of four main
steps: (a) Pretreatment (b) term extraction (c) term weighing and (d) selection
of descriptors. In the following, we describe the structure of MeSH vocabulary
and then we detail the steps of our indexing method.
2.1</p>
      <sec id="sec-2-1">
        <title>MeSH thesaurus</title>
        <p>The structure of MeSH is centred on descriptors, concepts, and terms.
{ Each term can be either a simple or a composed term.
{ A concept is viewed as a class of synonyms terms. The preferred term gives
its name to the concept.
{ A descriptor class consists of one or more concepts where each one is closely
related to each other in meaning. Each descriptor has a preferred concept.
The descriptors name is the name of the preferred concept. Each of the
subordinate concepts is related to the preferred concept by a relationship
(broader, narrower).
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Pretreatment</title>
        <p>
          The rst step is to split text into a set of sentences. We use the Tokeniser
module of GATE [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] in order to split the document into tokens, such as numbers,
punctuation, character and words. Then, the TreeTagger [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] stems these tokens
to assign a grammatical category (noun, verb,...) and lemma to each token.
Finally, our system prunes the stop words for each medical article of the corpus.
This process is also carried out on the MeSH thesaurus. Thus, the output of this
stage consists of two sets. The rst set is the articles lemma, and the second one
is the list of lemma existing in the MeSH thesaurus.
        </p>
        <p>The gure 2 outlines the basic steps of the pre-treatment phase.</p>
        <sec id="sec-2-2-1">
          <title>2 http://www.nlm.nih.gov/mesh</title>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>Term extraction</title>
        <p>
          This step consists of nding the di erent Mesh terms existing in the set of terms
generated by the pretreatment step. As mentioned above, a term MeSH can be
either simple or composed. To extract the simple term, we project the Mesh
thesaurus on the document by applying a simple matching. More precisely, each
lemmatized term in the document is matched with the canonical form or lemma
of MeSH terms. To recognize the composed terms, we have chosen to use YateA
[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. YateA (Yet Another Term ExtrAtor) is an hybrid term extractor developed
in the project ALVIS. After text processing, YateA generates a le composed of
two columns: the in ected form of the term and its frequency. For instance, as
shown in gure 3 which describes the result of the term extraction process by
using YateA, the term exercice physique occurs 6 times.
Given a set of extracted terms issued from the step of Term extraction, we
calculate the terms weight by using two measures: the Content Structure Weight
(CSW) and the Semantic Weight (SW) [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
        <p>
          Content Structure Weight We can notice that the frequency is not a main
criterion to calculate the CSW of the term. Indeed, the CSW takes into account
the term frequency in each part of the document rather than the whole document.
For example, a term of the Title receives a higher importance ( 10) than to a
term that appears in the Paragraphs ( 2). Table 2 shows the various coe cients
used to weight the term locations. These coe cients were determined in an
experimental way in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
(1)
(2)
The CSW of the term ti in a document d is given as follows:
        </p>
        <p>P</p>
        <p>WA</p>
        <p>{ WA is the weight of the location A (see Table 2),
{ f (ti; d; A) is the occurrence frequency of the term ti in the document d at
location A.</p>
        <p>For example, the term tumeur exists in the document d1683: 1 time in the title,
2 times in the abstract and 9 times in the Paragraphs,</p>
        <p>CSW (tumeur; d1683) =
1 10 + 2 8 + 9 2
1 + 2 + 9
Semantic Weight (SW) The Semantic Weight of term ti in the document
d depends on its synonyms existing in the set of Candidate Terms (CT (d))
generated by the term extraction step. To do so, we use the Synof function that
associates for a given term ti, its synonyms among the CT(d).</p>
        <p>Formally the measure SW is de ned as follows:</p>
        <p>SW (ti; d) =</p>
        <p>P
g2Synof(ti;CT (d))
jSynof (ti; CT (d))j
f (g; d)
For a given term ti, we have on the one hand its Content Structure Weight
(CSW (ti; d)) and on the other its Semantic Weight (SW (ti; d)), its Local Weight
where:</p>
        <p>X
st2subterms(t)
length(st)
length(t)
cf (t; d) = f (t; d) +
:f (st; d)
(4)
{ f (t; d): the occurrences number of t in the document d.
{ Length(t) represents the number of words in the term t.
{ subterms(t) is the set of all possible terms MeSH which can be derived from
t.</p>
        <p>For example, if we consider a term "cancer of blood", knowing that "cancer" is
itself also a MeSH term, its frequency is computed as:
cf (cancer of blood) = f (cancer of blood) +
:f (cancer)
Consequently, in an attempt to take into account the case of composed terms,
we calculate the csw measure as follows:
((LW (ti; d))) is determined as follows:</p>
        <p>LW (ti; d) =</p>
        <p>
          CSW (ti; d) + SW (ti; d)
2
By examining the equation 3, we can notice that the terms (simple or
composed) are weighted by the same way. Despite the several works dealing with
the weighing of composed terms, there is so far no weighing technique shared by
the community [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. In our approach, we applied the weighing method proposed
by [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. According to [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], for a term t composed of n words, its frequency in a
document depends on the frequency of the term itself, and the frequency of each
sub-term. For this purpose, it proposes the measure cf is de ned as follows:
P
        </p>
        <p>WA
where: f(st,d) is the occurrences number of st in the document d.
It's important to note that in the case of simple terme, subterms(ti) = ;.
Consequently the formulas presentd by equations 5 and 1 are equivalent.</p>
        <p>Finally, the weight of a term ti in a document dj (W eight(ti; dj )) is calculated
as follows:</p>
        <p>W eight(ti; dj ) = LW (ti; dj ):ln(N=df )
where:
N : the total number of documents,
df (document frequency): the number of documents which term ti occurs in.
(3)
(5)
(6)</p>
      </sec>
      <sec id="sec-2-4">
        <title>Selection of descriptors</title>
        <p>
          A term MeSH may be located in di erent hierarchies at various levels of
specicity, which re ects its ambiguity. In the last years, due to the amount of
ambiguous terms and their various senses used in biomedical texts, term ambiguity
resolution becomes a challenge for several researchers [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ][
          <xref ref-type="bibr" rid="ref10">10</xref>
          ][
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. Di erently from
the proposed works in the literature, our method assign the appropriate
descriptor related to a given term by using the language model approach.
In our approach, to determine for an ambiguous term, its best descriptor, we
have adapted the language model of [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] by substituting the query by the Mesh
descriptor. Thus, we infer a language model for each document and rank Mesh
descriptors according to their probability of producing each one given this model.
We would like to estimate P (desjd), the probability of generation a Mesh
descriptor des given the language model of document d. For a collection D, document
d and MeSH descriptor (des) composed of n concepts, the probability P (desjd)
is done by :
        </p>
        <p>P (desjd) = P (d):</p>
        <p>Y
cj2relatedtoDes(des;d)
(1
) :P (cj jd) + :P (cj jD)
(7)
Where:
RelatedtoDes (respectively RelatedtoCon) is the function that associates for a
given descriptor des (respectively concept con) and a document d, the concepts
(respectively terms) MeSH which are related to des (respectively con) in d. In
the equation 7, we need to estimate two probabilities:
1. P (cjD): the probability of observing the concept c in the collection D:
P (cjD) =</p>
        <p>f (c; D)
P f (c0; D)
c02D
where f (c; D) is the frequency of concept c in the collection D.
2. P (cjd): the probability of observing a concept c in a document d:
Where</p>
        <p>P (cjd) =</p>
        <p>f (c; d)
jconcepts(d)j
f (c; d) =</p>
        <p>Y
tj2relatedtoCon(c;d)</p>
        <p>LW (tj ; d)
(8)
(9)
Finally, to assign the appropriate sense (Best Descriptor (BD)) related to an
ambiguous term (ti) in the context of document (dj ), we retain the descriptor
which maximizes P (desjdj).</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results and discussion</title>
      <p>The goal of our experiments is to answer the following question: Can our
conceptual indexing approach improve the information retrieval process. these
experiments are performed on the Case-based 2012 collection. This collection is
based on a dataset containing the over 300; 000 images of 75000 articles of the
biomedical open access literature. 26 case-based topics are also provided where
the retrieval unit is a case, not an image.</p>
      <p>In order to make clear these experiments, we rst present the experimental
process and the techniques used for validation. Finally, we discuss the obtained
results.
3.1</p>
      <sec id="sec-3-1">
        <title>Experimental process</title>
        <p>
          Our experimental process is undertaken as follows:
{ Our process starts by dividing each article into a set of sentences. After
tokenisation, lemmatisation of the corpus and the Mesh terms is ensured by
TreeTagger[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Finally, a ltering step is performed to remove the stop-words.
{ For each document dj , of a test corpus, we determine the set of Candidate
Terms(CT(dj ). After that, each term of this set will be weighed to determine
its imprtance in dj .
{ For each document dj , we select the set of Best Descriptor BD(dj ).
!
Thus, each document d is presented as follows: d = (d1; d2 : : : dn)
where di is the probability of descriptor i in the document (see equation 7).
We can note that this indexing process is also performed on queries: after
extracting the pertinent descriptors, the querie is presented as follows: !q= (q1; q2 : : : qn)
where qi is the weight (0 or 1 depending on whether the descriptor belongs or
not to the query) of descriptor i in the query.
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Experimental results</title>
        <p>To determine the relevance of a document dj to a query q: we apply 6 RSV
(Retrieval Status Value) measures:
1. Okapi BM25:</p>
        <p>Where:
{ N : total number of documents in the collection.
{ n(qj ): number of documents containing the descriptor j.
{ f (qj; d): frequency of descriptor j in document d.
{ k1 et b: experimental parameters3.</p>
        <p>{ avgdl: average length of documents.
2. Cosine measure:
rsv(!d ; !q) = cos(d; q) =</p>
        <p>d:q
jdj:jqj
=
n
P dk qk
k=1
s n n</p>
        <p>P d2: P q2</p>
        <p>k k
k=1 k=1
where:
{ Des is the set of ds MeSH descriptors,
{ wij is the weight of the descriptor desi in the document dj,
{ f i is the frequency of the descriptor desi in the querie q.
5. Jaccard measure:
3. Dice coe cient:
4. Jelinek measure:
6. Overlap measure:
3 In this experiment b=0; 75 and k1 was xed at 1; 6
As shown in table 2, the results generated by the runs "R3 MIRACL" and
"R6 MIRACL" are very similar.</p>
        <p>R5 MIRACL perform worse than R4 MIRACL in all metrics. For example,
the value of MAP generated by R4 MIRACL is equal to 0; 0196. Concerning
R5 MIRACL, it generates 0; 0024 as value of MAP.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>This article describes the conceptual retrieval approach of the MIRACL team for
the ImageCLEF 2012 medical retrieval track, especially the case-based retrieval
task. The results obtained by our submitted runs prove that our indexing method
is useful to enhance the semantics of the document, which could be an interesting
evidence to improve the retrieval e ectiveness of medical retrieval systems. Our
future work aims at incorporating a kind of semantic smoothing into the
langage modeling approach. We also plan to use several semantic resources in the
indexing process. We believe that multi-terminology based indexing approach
can enhance the IR performance.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. Muller, H.,
          <string-name>
            <surname>de Herrera</surname>
            ,
            <given-names>A.G.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kalpathy-Cramer</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fushman</surname>
            ,
            <given-names>D.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Antani</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eggel</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Overview of the imageclef 2012 medical image retrieval and classi cation tasks</article-title>
          .
          <source>In: CLEF</source>
          . (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Cunningham</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maynard</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bontcheva</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>V.</surname>
          </string-name>
          <article-title>Tablan: Gate: A framework and graphical development environment for robust nlp tools and applications</article-title>
          .
          <source>ACL</source>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Schmid</surname>
          </string-name>
          , H.:
          <article-title>Probabilistic part-of-speech tagging using decision trees</article-title>
          .
          <source>International Conference on New Methods in Language Processing</source>
          . Manchester (
          <year>1994</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Aubin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hamon</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Improving term extraction with terminological resources</article-title>
          .
          <source>In: Advances in Natural Language Processing. Volume 4139 of Lecture Notes in Computer Science</source>
          . Springer Berlin / Heidelberg (
          <year>2006</year>
          )
          <volume>380</volume>
          {
          <fpage>387</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Majdoubi</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tmar</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gargouri</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Using the mesh thesaurus to index a medical article: Combination of content, structure and semantics</article-title>
          .
          <source>In: KES (1)</source>
          . (
          <year>2009</year>
          )
          <volume>277</volume>
          {
          <fpage>284</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Gamet</surname>
          </string-name>
          , J.: Indexation de pages
          <fpage>web</fpage>
          .
          <source>Report of dea</source>
          , universit de Nantes (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Baziz</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boughanem</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aussenac-Gilles</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chrisment</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Semantic cores for representing documents in ir</article-title>
          .
          <source>In: Proceedings of the 2005 ACM symposium on Applied computing. SAC '05</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2005</year>
          )
          <volume>1011</volume>
          {
          <fpage>1017</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Baziz</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Indexation conceptuelle guide par ontologie pour la recherche d'information</article-title>
          .
          <source>PhD thesis</source>
          , Univ. of Paul sabatier (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Andreopoulos</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alexopoulou</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schroeder</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Word sense disambiguation in biomedical ontologies with term co-occurrence analysis and document clustering</article-title>
          .
          <source>IJDMB</source>
          <volume>2</volume>
          (
          <issue>3</issue>
          ) (
          <year>2008</year>
          )
          <volume>193</volume>
          {
          <fpage>215</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Stevenson</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gaizauskas</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martinez</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Knowledge sources for word sense disambiguation of biomedical text</article-title>
          .
          <source>In: BioNLP '08: Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing</source>
          , Association for Computational Linguistics (
          <year>2008</year>
          )
          <volume>80</volume>
          {
          <fpage>87</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Duy</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lynda</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Sense-based biomedical indexing and retrieval</article-title>
          . In: NLDB. (
          <year>2010</year>
          )
          <volume>24</volume>
          {
          <fpage>35</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Hiemstra</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Using Language Models for Information Retrieval</article-title>
          .
          <source>PhD thesis</source>
          , University of Twente (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>