=Paper=
{{Paper
|id=Vol-1172/CLEF2006wn-ImageCLEF-DaumkeEt2006
|storemode=property
|title=Morphosaurus in ImageCLEF 2006: The Effect of Subwords on Biomedical IR
|pdfUrl=https://ceur-ws.org/Vol-1172/CLEF2006wn-ImageCLEF-DaumkeEt2006.pdf
|volume=Vol-1172
|dblpUrl=https://dblp.org/rec/conf/clef/DaumkePM06a
}}
==Morphosaurus in ImageCLEF 2006: The Effect of Subwords on Biomedical IR==
<pdf width="1500px">https://ceur-ws.org/Vol-1172/CLEF2006wn-ImageCLEF-DaumkeEt2006.pdf</pdf>
<pre>
          Morphosaurus in ImageCLEF 2006:
         The effect of subwords on biomedical IR
                         Philipp Daumke, Jan Paetzold, Kornél Markó
                                 University Hospital Freiburg
                         Philipp.Daumke@klinikum.uni-freiburg.de


                                            Abstract
     We here describe the subword approach we used in the 2006 ImageCLEF Medical
     Image Retrieval task. It is based on the assupmtion that neither fully inflected nor
     automatically stemmed words constitute the appropriate granularity for lexicalized
     content description. We therefore introduce subwords as morphologically meaningful
     word units. Subwords are organized in language specific lexica that were partly man-
     ually and partly automatically generated and currently cover six European languages.
     They are linked together via a multilingual thesaurus. The use of subwords instead
     of full words significantly reduces the number of lexical entries that are needed to
     sufficiently cover a specific language and domain. A further benefit of the approach
     is its independence from the underlying retrieval system, thus making it usable by
     any search engine. In this year’s test runs we combined MorphoSaurus with the
     open-source search engine Lucene and achieved precision gains of up to 25% over the
     baseline for a monolingual setting and promising results in a multilingual scenario.

Categories and Subject Descriptors
H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing[Dictionaries,
Thesauruses]; H.3.3 Information Search and Retrieval[Query formulation, Retrieval models, Search
process]; H.3.4 Systems and Software; H.2.3 [Database Management]: Languages—Query Lan-
guages

General Terms
Algorithms, Experimentation, Languages

Keywords
Biomedical Information Retrieval, Cross Language Information Retrieval, Stemming, Subwords,
MorphoSaurus


1    Introduction
The conventional view on human language is word-centered, at least for written language where
words are clearly delimited by spaces. It builds on the hypothesis that words are the basic build-
ing blocks of phrases and sentences. In syntactic theories, words constitute the terminal symbols.
Therefore, it appears straightforward to break down natural langugae to the word level. However,
looking at the sense of natural language expressions, evidence can be found that semantic atom-
icity frequently does not coincide with the word level, which bears methodical challenges even
for pretended ‘simple’ tasks such as tokenization of natural language input [3]. As an example,
considering the English noun phrase “high blood pressure”, the word limits reflect quite well the
semantic composition, whereas this is not the case in its literal translations “verhoogde bloed-
druk” (Dutch), “högt blodtryck” (Swedish) or “Bluthochdruck” (German). Especially in technical
sublanguages such as the medical one, atomic senses are encountered at different levels of frag-
mentation or granularity. An atomic sense may correspond to word stems (e.g., “hepat” referring
to “liver”), prefixes (e.g., “anti-”, “hyper-”), suffixes (e.g., “-logy”, “-itis”), larger word fragments
(“hypophys”), words (“spleen”, “liver”) or even multi-word terms (“yellow fever”). The possible
combinations of these word-forming elements are immense and ad-hoc term formation is common.
    Extracting atomic sense units from texts as a basis for the semantic interpretation of natural
language is therefore an important goal in various areas of research dealing with natural language
processing, such as information retrieval, text mining or speech recognition. Especially in mor-
phologically rich languages (German, Finnish, Swedish, etc.), where classical rule-based stemmers
such as the Porter Stemmer[9] are not particular effective, a deeper morphological analysis is
widely acknowledged to improve the retrieval performance in IR systems [4], [8], [1]. However,
the lack of accuracy of current unsupervised approaches for morphological analysis hampers the
automatic or semi-automatic creation of an underlying knowledge base.
    We developed such lexical resource and introduced the notion of subwords [5], i.e., self-
contained, semantically minimal units. Language-specific subwords are linked by intralingual as
well as interlingual synonymy and grouped together in terms of concept-like language independent
equivalence classes.
    For ImageCLEFmed 2006, we prove the positive effect of using subwords on the monolingual
and multilingual biomedical text retrieval tasks.


2     Morpho-Semantic Indexing
Morpho-semantic indexing translates source documents and queries into an interlingual represen-
tation in which their content is represented by language-independent semantic descriptors. This
procedure is based upon the MorphoSaurus document pre-processing engine, which consists of
orthographic normalization rules, a morphological component for word segmentation, language-
specific subword lexicons for each of the natural languages to be analyzed, as well as a language-
independent thesaurus.

2.1    Subwords as Document Description Units
In many NLP applications evidence can be found that neither fully inflected nor automatically
stemmed words - such as common in many text retrieval systems - constitute the appropriate gran-
ularity level for lexicalized content description. Especially in scientific and technical sublanguages,
semantically atomic, i.e., non-decomposable entities are chained in complex word forms such as in
’pseudo⊕hypo⊕para⊕thyroid⊕ism’, ’gluco⊕corticoid⊕s’ or ’pancreat⊕itis’. Domain-specific suf-
fixes (e.g., ’-itis’ ), and single-word compounding are even more accentuated in morphologically
richer languages than English, such as German, Finnish, Dutch, Hungarian or Swedish. We refer
to these self-contained, semantically minimal units as subwords and motivate their existence by
their usefulness for document retrieval rather than by linguistic arguments.
    The minimality criterion is difficult to define in a general way, and so far is based more on
empirical examinations rather than on formal criteria. Considering, e.g., the text token ’diaph-
ysis’, a linguistically plausible morpheme-style segmentation would possibly lead to ’dia⊕phys⊕is’.
From a medical perspective, a segmentation into ’diaphys⊕is’ seems much more reasonable, be-
cause the canonical linguistic decomposition is far too fine-grained and likely to create many sub-
word ambiguities. Comparable ’low-level’ segmentations of semantically unrelated tokens such as
’dia⊕lyt⊕ic’ and ’phys⊕io⊕logy’ also lead to morpheme-style subwords ’dia’ and ’phys’, and, thus,
unwarrantedly match segmentations such as ’dia⊕phys⊕is’, too. The (semantic) self-containedness
of the chosen subword is often supported by the existence of a synonym, e.g., ’shaft’ for ’diaphys’.
   Subwords are assembled in language specific lexicons and a multilingual thesaurus, which
together contain subword entries, special subword attributes and semantic relations between sub-
words. The lexicons and the thesaurus are both constructed manually, with the following consid-
erations in mind:

   • Subwords are collected, together with their attributes such as language (English, German),
     subword type (stem, prefix, suffix, invariant), etc. Each lexicon entry is then assigned to a
     unique identifier representing one synonymy class, the MorphoSaurus identifier (MID).
   • Intralingual synonyms and interlingual translations of subwords are grouped together by the
     same equivalence class.
   • Semantic links between synonymy classes are added. We subscribe to a shallow approach in
     which semantic relations are restricted to:
        1. a paradigmatic relation has-sense, which relates one ambiguous class to its specific
           readings, e.g.: {head}→({kopf,zephal,cephal}OR{leader, boss}).
        2. a syntagmatic relation expands-to, which consists of predefined segmentations in case
           of utterly short subwords such as {myalg}→{muskel,muscle}⊕{schmerz,pain}.

    We refrain from introducing hierarchical relations between MIDs, because such links can be
acquired from domain-specific vocabularies, e.g., the Medical Subject Headings (MeSH [8], cf. also
[9] for the mapping of MIDs to appropriate MeSH terms).

2.2    Morpho-Semantic Normalization
Figure 1 depicts how source documents (top left) are converted into an interlingual representation
by a three-step procedure. The first step deals with orthographic normalization (cf. Figure 1, top
right). A preprocessor reduces all capitalized characters from the input documents to lower-case
characters and, additionally, performs language-specific character substitutions in order to ease
the matching of (parts of) text tokens and entries in the lexicons.
    The next step in the pipeline is concerned with morphological segmentation. The system
decomposes the orthographically normalized input stream into a sequence of sublexical items, the
content-bearing ones correspond to subwords as listed in the lexicon (cf. Figure 1, bottom right).
The segmentation proceeds as follows: Each document token t of length n defined as a sequence
of characters c1 , c2 , ..., cn is processed, in parallel, by a forward and backward matching process.
The forward matching process starts at the positions 1 and k=n and decrements k iteratively by
one unless the sequence c1 , c2 , ..., ck is found in the subword lexicon. Alternatively, the backward
matching process starts at the positions k=1 and n and increments k iteratively by one unless the
sequence ck , ck+1 , ..., cn is found in the lexicon. Substrings recognized this way are entered into a
chart. Unless the remaining sequences are empty, ck+1 , ck+2 , ..., cn , as well as c1 , c2 , ..., ck−1 are
tested recursively in the same manner, by forward and backward matching, respectively.
    The segmentation results that are stored in the chart are checked for morphological plausibility
using a finite-state automaton in order to reject invalid segmentations (e.g., those without stems
or beginning with a suffix). If there are ambiguous valid readings or incomplete segmentations
(due to missing entries in the lexicon), a series of heuristic rules are applied, which prefer those
segmentations with the longest match from the left, the lowest number of unspecified segments,
etc. Whenever the segmentation algorithm fails to detect a valid reading, all extracted stems of
four characters or longer - if available - are preserved and the remaining fragments are discarded.
Otherwise, if no stem longer than four characters can be determined during the segmentation,
the original word is restituted. This method proved useful for the preservation of proper names,
although a dedicated name recognizer is still a desideratum for our system.
    In the final step, semantic normalization, each subword recognized is then substituted by its
corresponding MID. After that step, all synonyms within a language and all translations of seman-
tically equivalent subwords from different languages are represented by the same descriptor in that
          High TSH values suggest the               high tsh values suggest the
                                      Orthographic
          diagnosis of primary hypo-                diagnosis of primary hypo-
                                      Normalization
          thyroidism ...                            thyroidism ...
          Erhöhte TSH-Werte erlauben die   Orthographic    erhoehte tsh-werte erlauben die
          Diagnose einer primären Hypo-        Rules       diagnose einer primaeren hypo-
          thyreose ...                                     thyreose ...
                 Original                                          Morphosyntactic Parser
             MID Representation                                      Subword Lexicon
         #up    tsh    #value   #suggest                high tsh value s suggest the
         #diagnost #primar #small #thyre    Semantic
                                                        diagnos is of primar y hypo
                                          Normalization
                                                        thyroid ism
         #up tsh #value #permit #diagnost               er hoeh te tsh wert e erlaub en die
         #primar #small #thyre              Subword     diagnos e einer primaer en hypo
                                           Thesaurus    thyre ose

                           Figure 1: Morpho-Semantic Indexing Pipeline

target representation. Composed terms (such as ’myalg⊕y’ ), which are linked to their components
by the expands-to relation, are substituted by the MIDs of their components. Ambiguous classes,
i.e., those related by a has-sense link to two or more classes, are disambiguated by an in-house
developed disambiguation tool [6]. The final result is a morpho-semantically normalized document
in a language-independent, interlingual representation (cf. Figure 1, bottom left).

2.3    Subword Lexicons and Subword Thesaurus
The process of lexicon construction is a challenging task which requires in-depth knowledge of
biomedical terminology. During the development workflow, the effect of lexical changes are im-
mediately fed back to the developers using word lists to test both the segmentation and the
assignment of MIDs. Furthermore, a collection of parallel texts (e.g., abstracts of medical pub-
lications in English and German) are used to detect errors in the assignment of MIDs. In order
to impose common policies on the lexicon builders, we developed a maintenance manual which
contains over 30 rules for tasks such as:
   • The proper delimitation of subwords (e.g., ’compat⊕ibility’ vs. ’compatib⊕ility’ )
   • The decision whether an affix introduces a new meaning which would justify a new entry
     (e.g., ’neuros⊕is’ instead of ’neur⊕osis’ )
   • Data-driven decisions, such as to add ’-otomy’ as a synonym of ’-tomy’ in order to block
     erroneous segmentations (e.g., ’nephrotomy’ into ’nephr⊕oto⊕my’ )
   • The decision to exclude short stems from segmentation (such as ’ov’ in ’ovum’ ) in order
     to block false segmentations; short stems are then treated as invariants so that they cannot
     be subject to a segmentation - therefore, possible combinations have to be considered as
     separate lexical units
   • The decision to locate the appropriate level of semantic abstraction when defining equivalence
     classes, e.g., by grouping hyper-, high, elevated into the same equivalence class
   • The decision which function words and affixes are excluded from indexing, such as ’and’,
     ’-ation’, ’-able’, and those which are not, e.g., ’dys-’, ’anti-’, ’-itis’.


2.4    Lexicon Statistics
The manual construction of our lexical resources was done over the last five years with a changing
amount of man power. The English, German and Portuguese subword lexicons were created fully
                         Table 1: Total number of subwords in different languages
                 Lang      Subwordsall       Subwordsauto       EqClasses         eT       hS
                 EN            22,706                   -          16,668        261       438
                 GE            24,178                   -          16,713        306       448
                 PT            14,997                   -          10,523        286       343
                 SP            13,060               7,795           9,096        215       254
                 FR             8,751               3,735           6,006        122       250
                 SE            13,557               5,946           7,908        182       478
                 All           97,249              17,476          21,679        497     1,369

Table 2: (Subwordsall - number of all subwords, Subwordsauto - number of automatically acquired
subwords, EqClasses - number of equivalence classes, eT - number of expands − T o-relations, hS -
number of has − Sense-relations)


manually, while for French, Spanish and Swedish, additional machine learning techniques were
applied in order to bootstrap the lexical resources for these languages[7].
    Overall, the lexicons contain 97,249 entries1 (see Table 2), corresponding to 21,679 equivalence
classes. In terms of relations between equivalence classes, there are currently 497 distinct expands-
To and 1,369 distinct has-Sense relations defined in the thesaurus. When medical corpora are
indexed by the morpho-semantic indexing routine, an average number of 1.62 MIDs per word is
obtained, considering all languages.


3      Experiments
3.1     Lucene Search Engine
In this year’s ImageCLEFmed we combined the MorphoSaurus-System with the open-source
search engine Lucene,2 a high performance, scalable, cross-platform retrieval system. It offers a
high degree of flexibility by implementing a well-scaling indexing approach and has desirable I/O
characteristics for both merging and searching. Lucene supports batch indexing and incremental
indexing and a wide range of query features, including full Boolean queries. Its rich query language
includes more than ten different query operators and allows multi-field search. Lucene handles
adjacency queries and searches multiple indexes at once merging the results to give a meaningful
relevance score based on TF-IDF [10]. Its ranking model achieves results that can even outperform
advanced vector retrieval systems [11].

3.2     Indexing and Query Preparation Process
In the preparation phase all annotations of the whole dataset containing textual information from
Casimage, MIR, PEIR, and PathoPIC datasets were extracted. For all image annotations, we
created four Lucene fields which can be queried separately. The first field contained headlines,
keywords and additional concise XML tags in the original representation (original t). The second
field contained all other free text information that was extracted from the dataset (original d).
The third and fourth field contained the Morphosaurus representation of the corresponding first
and second field (mid t and mid d). In order to compare the subword approach with traditional
automated stemming routines we added two further fields (stem t and stem d) containing the
annotations processed by the Porter Stemmer3 . In addition to these six query fields, a language
flag was added for each image annotation (language).
   1 Just for comparison, the size of WordNet [2] assembling the lexemes of general English in the 2.0 version is

on the order of 152,000 entries (http:// www.cogsci.princeton.edu/∼wn/). Linguistically speaking, the entries
are basic forms of verbs, nouns, adjectives and adverbs.
    2 http://jakarta.apache.org/lucene/docs/index.html
    3 We used the stemmer available on http://www.snowball.tartarus.org (last visited on July 2006).
    In the topics collection, ”Show me”, ”Zeige mir” and ”Montre-moi des” and language specific
stopwords4 were removed. The queries were subsequently transformed into the Morphosaurus
interlingua resulting in 30 “original/mid” query pairs, each represented in all three languages.

3.3    Monolingual Scenario
One of our goals was to show the effectiveness of the subword approach for monolingual biomedical
text retrieval. We therefore decided to create a scenario which makes use of the english subset
only. Four different runs were prepared:
  1. Orig-En-En: As a baseline we took the original english query and searched in the corre-
     sponding two Lucene fields which contained the original English annotations (original t,
     original d, language : en).
  2. Stem-En-En: The Orig-En-En queries were processed by the Porter Stemmer and sent to
     the corresponding (stem t, stem d)-fields of Lucene.
  3. Mids-En-En: In this run we took the morpho-semantic normalized form of each English
     query and searched in the two fields that contained the MID representation of the document
     annotations (mid t, mid d, language : en).
  4. Both-En-En: Here, we combined both the original query and the morpho-semantic normal-
     ized query in a simple disjunct fashion. The original words were queried in the original fields
     (original t and original d, language : en) and the normalized queries in the MID fields
     (mid t, mid d, language : en).
   Example Box 1 lists the query syntax of query one in all four test scenarios.


      Query 1: “Show me images of the oral cavity including teeth and gum tissue”

  Example 1 – Run: Orig-En-En (Baseline), Query: 1
  (+language:en +original t:images) (+language:en +original t:oral) (+language:en +orig-
  inal t:cavity) (+language:en +original t:including) (+language:en +original t:teeth)
  (+language:en +original t:gum) (+language:en +original t:tissue) (+language:en +orig-
  inal d:images) (+language:en +original d:oral) (+language:en +original d:cavity) (+lan-
  guage:en +original d:including) (+language:en +original d:teeth) (+language:en +origi-
  nal d:gum) (+language:en +original d:tissue)

  Example 2 – Run: Stem-En-En, Query: 1
  (+language:en +stem t:imag) (+language:en +stem t:oral) (+language:en +stem t:caviti)
  (+language:en +stem t:includ) (+language:en +stem t:teeth) (+ language:en +stem t:gum)
  (+language:en +stem t:tissu) (+language:en +stem d:imag) (+language:en +stem d:oral)
  (+language:en +stem d:caviti) (+language:en +stem d:includ) (+language:en +stem d:teeth)
  (+language:en +stem d:gum ) (+language:en +stem d:tissu)
  Example 3 – Run: Mids-En-En, Query: 1
  (+language:en     +mid t:#imag)      (+language:en     +mid t:#stom)      (+language:en
  +mid t:#excav) (+language:en +mid t:#enfold) (+language:en +mid t:#tusk) (+lan-
  guage:en +mid t:#gum) (+language:en mid t:#histio) (+language:en +mid d:#imag) (+lan-
  guage:en +mid d:#stom) (+language:en +mid d:#excav) (+language:en +mid d:#enfold)
  (+language:en +mid d:#tusk) (+language:en +mid d:#gum)(+language:en +mid d:#histio)


  4 The Snowball Stemmer incorporates stop word lists containing 172 English, 232 German and 155 French entries.
    Example 4 – Run: OrigMids-En-En, Query: 1
    (+language:en +original t:images) (+language:en +original t:oral) (+language:en +orig-
    inal t:cavity) (+language:en +original t:including) (+language:en +original t:teeth)
    (+language:en +original t:gum) (+language:en +original t:tissue) (+language:en +orig-
    inal d:images) (+language:en +original d:oral) (+language:en +original d:cavity) (+lan-
    guage:en +original d:including) (+language:en +original d:teeth) (+language:en +origi-
    nal d:gum) (+language:en +original d:tissue) (+language:en +mid t:#imag) (+language:en
    +mid t:#stom) (+language:en +mid t:#excav) (+language:en +mid t:#enfold) (+lan-
    guage:en +mid t:#tusk) (+language:en +mid t:#gum) (+language:en mid t:#histio) (+lan-
    guage:en +mid d:#imag) (+language:en +mid d:#stom) (+language:en +mid d:#excav)
    (+language:en     +mid d:#enfold)     (+language:en     +mid d:#tusk)     (+language:en
    +mid d:#gum)(+language:en +mid d:#histio)


Example Box 1: Query syntax of the first query for Orig-En-En, Stem-En-En, Mids-En-En
     and Both-En-En. Instead of using nested query terms we prefer the more flexible but longer
disjunctive normal form which consists of disjuncts, each of which is a conjunction of one or more query
 terms. Disjuncts (OR) do not have to be marked separately, conjunctions (AND) can be expressed by
                                             the + symbol.


3.4     Multilingual Scenario / Textual
In the multilingual scenario, only the morpho-semantic representation of the topics in English,
German and French were used to search in both MID fields (mid t and mid d). No restrictions
to the language flag of the document collections were made. The resulting runs were named
Mids-En-All, Mids-De-All and Mids-Fr-All, respectively. As a baseline, the original queries in all
languages were used to separately search in the language corresponding original document fields
(i.e., English queries in English documents, German queries in German documents and French
queries in French documents). The baseline is refered to as Orig-All-All. As a second baseline, we
also consider the monolingual baseline Orig-En-En as a solid reference value, taking into account
that the English subset represents a comprehensive part of the annotations (almost 93% of all
annotations are available in English).

3.5     Multilingual Scenario / Mixed
In the mixed multilingual scenario we combined the retrieval results of the multilingual textual
scenario with the GIFT visual retrieval results. In addition, a test run with the best textual run
Both-En-En in combination with the GIFT results was carried out, as we expected this run to
be the best of all runs. As the textual retrieval results were expected to perform better than
the visual retrieval results, we ranked those documents at top which were found in both retrieval
systems under the first 50 hits, followed by only textual retrieval results. Hits only found in the
visual retrieval system were discarded. The runs are named Orig-All-All-Comb, Both-En-En-Comb
Mids-En-All-Comb, Mids-De-All-Comb and Mids-Fr-All-Comb.


4      Results
The average precision values at all eleven standard recall points (0.0, 0.1, 0.2, ..., 1.0) are depicted
in Figure 3 for the monolingual and in Figure 5 for the multilingual scenario.5 We also depict
the precision values between P5 and P100 in Figure 2 (monolingual) and Figure 4 (multilingual
scenario). Interesting from a realistic retrieval perspective, at least to our view, is the average
   5 We here present results from a detailed re-evaluation after the results were officially published. They slightly

diverge from the original results.
                        60
                                                           Mids + Orig                               80                                        Mids + Orig
                                                                 Mids                                                                                Mids
                                                                  Orig                                                                                Orig
                                                                 Stem                                                                                Stem
                                                                                                     60


                                                                                  Precision [in %]
     precision [in %]


                        40
                                                                                                     40


                                                                                                     20


                        20
                              5 10 15 20   30                            100                                 .1   .2   .3   .4     .5     .6     .7    .8    .9   1
                                                Cut-Off                                                                          Recall

                               Fig. 2: Precision Monolingual                                                  Fig. 3: IRCL Monolingual
                                                              English                                80                                           English
                                                             German                                                                               German
                                                              French                                                                               French
                                                             Baseline                                                                                 Orig
                        40                                                                           60


                                                                                  Precision [in %]
     precision [in %]


                                                                                                     40


                                                                                                     20
                        20


                              5 10 15 20   30                            100                                 .1   .2   .3   .4    .5    .6       .7    .8    .9   1
                                                Cut-Off                                                                          Recall

                         Fig. 4: Prec Multilingual / Textual                                          Fig. 5: IRCL Multilingual / Textual
                                                              English                                80                                           English
                                                             German                                                                               German
                                                              French                                                                               French
                                                             Baseline                                                                                 Orig
                        40                                                                           60
                                                                                  Precision [in %]
     precision [in %]


                                                                                                     40


                                                                                                     20
                        20


                              5 10 15 20   30                            100                                 .1   .2   .3   .4     .5     .6     .7    .8    .9   1
                                                Cut-Off                                                                          Recall

                             Fig. 6: Prec Multilingual / Mixed                                            Fig. 7: IRCL Multilingual / Mixed

Fig. 8: Precision/Recall Graphs for the monolingual (first), multilingual/textual (second) and multilinu-
gal/mixed scenarios
                                                   MonoText
             Scenario    Orig-En-En     Stem-En-En         Mids-En-En         Both-En-En
               map         0.1625      0.1482   91%       0.1297     80%     0.1792 110%
             top2avg       0.3778      0.3669   97%       0.3175     84%     0.4525 120%
                P5         0.4867      0.4667   96%       0.4867 100%        0.5933 122%
                P20        0.3600      0.4000 111%        0.3550     99%     0.4500 125%

                                     Table 3: Standard Precision/Recall Table for the Monolingual Scenario

                                                                          Multilingual / Textual
    Scenario                          Orig-All-All          Orig-En-En       Mids-En-All        Mids-De-All                                      Mids-Fr-All
      map                               0.1068            0.1625 152%       0.1366 128%       0.1439 135%                                       0.0734 69%
    top2avg                             0.2881            0.3778 131%       0.3970 138%       0.3751 130%                                       0.2028 70%
       P5                               0.3200            0.4867 152%       0.4200 131%       0.4000 125%                                       0.3103 97%
      P20                               0.2600            0.3600 138%       0.3467 133%       0.3383 130%                                       0.2293 88%

                              Table 4: Standard Precision/Recall Table for the Multilingual Scenario (Textual)
                                             Multilingual / Mixed
    Scenario   Orig-All-All    Both-En-En      Mids-En-All        Mids-De-All      Mids-Fr-All
      map        0.1079       0.1791 166%     0.1383 128%        0.1441 134%      0.0746 69%
    top2avg      0.2942       0.4516 154%     0.4015 136%        0.3726 127%      0.2020 69%
       P5        0.3333       0.6000 180%     0.4400 132%        0.4000 120%      0.3241 97%
      P20        0.2633       0.4500 171%     0.3517 134%        0.3350 127%      0.2259 86%

Table 5: Standard Precision/Recall Table for the Multilingual Scenario (Mixed). Comb-suffixes of the
scenario identifiers were stripped for lack of space.


gain on the top two recall points as well as the P5 and P20 values. Together with the overall map
value, these figures can be found in Table 3 for the monolingual and in Table 4 for the multilingual
scenario.
    The first observation in the monolingual scenario was that the simple stemming approach
(Stem-En-En) does not perform better than the baseline Orig-En-En, with values between 91%
(map) and 111% (P20 ) of the baseline. Also, Mids-En-En performs not as good as the baseline
Orig-En-En regarding map (0.13 vs.016) and top2avg (0.32 vs. 0.38) and achieves same figures as
Orig-En-En for the P5 (0.49) and P20 (0.36) precision values.
    The Both-En-En run outperforms all other runs in all relevant values. Regarding the P5 and
P20 values of both Both-En-En and the baseline run Orig-En-En (P5: 0.60 vs. 0.48, P20: 0.45
vs. 0.36), Both-En-En exceeds the baseline by up to 25%. Top2avg (0.45 vs. 0.38) is 20% higher
and the map-value (0.18 vs. 0.16) is still 10% increased. The P5 value of Both-En-En is second
best of all (automatic/textual) runs at ImageCLEFmed 2006.
    Regarding the multilingual runs, the first observation was that the multilingual baseline
Orig-All-All was notedly lower than the monolingual baseline (the monolingual results of Orig-
En-En are added in Table 4 for comparison). Obviously, due to the comprehensive part of English
annotations (93% of the annotations are available in English) querying the English subset achieves
better results than querying the German and French subset. Regarding the multilingual runs, we
observe similar results for the German and the English test scenarios with P5 values between
0.40 (Mids-De-All ) and 0.42 (Mids-En-All ) and top2avg values between 0.38 (Mids-De-All ) and
0.40 (Mids-En-All ). Both runs exceeded the multilingual baseline regarding all relevant values by
about 25% to 38%.
    Considering the low number of German and French annotations compared to English, the
Mids-De-All run can nearly be considered a fully translingual run (German queries on English
documents). It is particular promising that this run performed as good as the English run (Mids-
En-All ) and the monolingual baseline (Orig-En-En) and clearly better than the multilingual base-
line (Orig-All-All ).
    The French run Mids-Fr-All performed relatively poor compared to the others. This reflects
the fact that we only just started to build the French Subword lexicon.
    The mixed multilingual runs (Table 5) performed in average not better than the multilingual
textual runs. Obviously, better merging algorithms to better exploit the synergies between the
textual and the visual runs are needed and are due to future work.
    While the P5, P20 and top2avg values of our best runs are quite high compared with other
groups, the overall map value is mean. This is due to an overbalance of certain unspecific MIDs
in our subword approach that cause a decrease of precision values beginning roughly at the cut-off
point 100. However, as the P100-P1000 values are of only limited value for a user in real life
scenario, these findings are not too worrying. Still the increase of the overall map value is an
important goal of our future work.


5    Conclusion and Future Work
We introduced a subword-based approach for biomedical text retrieval that addresses many of
the general and domain specific challenges in the current CLIR research. In our monolingual
test runs, we showed that a combination of original and morpho-semantic normalized queries
remarkably boosts precision up to 25% (P5, P20), compared to the baseline. The best of our runs
was second best of all (automatic/textual) test runs regarding the P5 value. In the multilingual
runs we achieve similar results for the English and German test runs. In all scenarios, the P5, P20
and top2avg values are distinctly higher than the multilingual baseline.
    A detailed analysis of the test runs is now due in order to determine which part of our system
contributed to which extend to the precision gain. Also, an increase of the overall map-value
is aimed at, even though we consider P5 and P20 as the most important values in terms of
user-friendliness. A medium-term goal is to show the usefulness of subwords in other (technical)
domains.


References
 [1] Martin Braschler and Bärbel Ripplinger. How effective is stemming and decompounding for
     german text retrieval? Information Retrieval, 7(3-4):291–316, 2004.
 [2] Christiane Fellbaum, editor. WordNet: An Electronic Lexical Database. Cambridge, MA:
     MIT Press, 1998.
 [3] Gregory Grefenstette and Pasi Tapanainen. What is a word, what is a sentence? prob-
     lems of tokenization. In Proceedings of the 3rd International Conference on Computational
     Lexicography, pages 79–87, 1994.
 [4] Gina-Anne Levow, Douglas W. Oard, and Philip Resnik. Dictionary-based techniques for
     cross-language information retrieval. Information Processing and Management: an Interna-
     tional Journal, 41(3):523–547, May 2005.
 [5] Kornél Markó, Stefan Schulz, and Udo Hahn. Morphosaurus - design and evaluation of an
     interlingua-based, cross-language document retrieval engine for the medical domain. Methods
     of Information in Medicine, 44(4):537–545, 2005.
 [6] Kornél Markó, Stefan Schulz, and Udo Hahn. Unsupervised multilingual word sense disam-
     biguation via an interlingua. In AAAI 2005 – Proceedings of the 20th National Conference
     on Artificial Intelligence & IAAI’05 – Proceedings of the 17th Innovative Applications of Ar-
     tificial Intelligence Conference, pages 1075–1080. Pittsburgh, Pennsylvania, USA, July 9-13,
     2004. Menlo Park, CA; Cambridge, MA: AAAI Press & MIT Press, 2005.
 [7] Kornél Markó, Stefan Schulz, Alyona Medelyan, and Udo Hahn. Bootstrapping dictionaries
     for cross-language information retrieval. In SIGIR 2005 – Proceedings of the 28th Annual In-
     ternational ACM SIGIR Conference on Research and Development in Information Retrieval,
     pages 528–535. Salvador, Brazil, August 15-19, 2005. New York, NY: ACM, 2005.
 [8] Ari Pirkola. Morphological typology of languages for IR.         Journal of Documentation,
     57(3):330–348, 2001.
 [9] M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980.
[10] Gerald Salton and Chris Buckley. Term-weighting approaches in automatic text retrieval.
     Information Processing & Management, 24(5):513–523, 1988.
[11] Stefanie Tellex, Boris Katz, Jimmy J. Lin, Aaron Fernandes, and Gregory Marton. Quan-
     titative evaluation of passage retrieval algorithms for question answering. In SIGIR 2003 –
     Proceedings of the 26th Annual International ACM SIGIR Conference on Research and De-
     velopment in Information Retrieval, pages 41–47. Toronto, Canada, July 28 - August 1, 2003.
     ACM, 2003.

</pre>