<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <pub-date>
        <year>1864</year>
      </pub-date>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>ally, an additional syntactic and semantic disambiguation evaluating mainly context information.</p>
      <p>For languages with a rich declensional morphology such as French or German the results of such
ern European languages). However, the same level of functionality as the German module is not
(gone) to gang results in a wrong form (the correct one is gehen/to go). German verbs as well as
French verbs such as aller (to go) or recevoir (to get) have numerous forms which makes it almost
lemmatisation, a part-of-speech tagging, and for German, a compound analysis as well as
optionis not enough (cf. [6], [12]). For instance, the stemming of the German past participle gegangen
morpheme lexicon. This morphological dictionary contains allomorphs but also some irregular
each language.
often to failures because of the underlying highly productive morphological process (cf. [3]).
the identied stem. This approac h produces far better results, it avoids error as shown above but
impossible to stem them by using sux algorithms. F or German, the compound formation leads
cannot be found in the dictionary. Also irregular plural (media/medium) or declination forms
where cat is the category. Nouns, verbs, adjectives, and derived adverbs are looked up in a
a stemming are rather unsatisfying because considering only inection (or ev en sux reduction)
Stemming is the nlp technique which is frequently used and successfully applied in ir systems.</p>
      <p>For the reduction of syntactic ambiguities there is also a shallow parsing component available for
available for all language modules. Mpro performs a morpho-syntactic analysis consisting of a
o suxes. T o overcome some of the serious deciencies of suc h stemmers, for instance general
lexicon.</p>
      <p>
        The Linguistic Processing
follows:
A standard tool is the Porter stemmer [
        <xref ref-type="bibr" rid="ref2">7</xref>
        ] which achieves a normalisation by simply chopping
The morpho-syntactic analysis is combined with a look-up in a word-form dictionary. In a rst
{string=Word-form,c=w,sc=CAT,lu=Citation-form,...}
(went/go) cause errors. The main drawback of this approach lies thus in the coverage of the
names. Each entry shows how the associated stems behave morphologically, as shown in the
word-forms which cannot be identied in another w ay as well as variety of toponyms and other
is mapped to gener, and distribute to distribut, both no lexical base forms, and thus lead to
others such as the mapping of distributed to distribut still occur. In this case, the word distributed
improper conations, adv anced stemmers are developed and combined with a lexicon [4] to verify
In Mpro-IR, the Mpro programme package [5] developed at the IAI is used for the linguistic
developed to process German language but is now available for dieren t languages (including
Eastprocessing, and its major features will be described in the following. Mpro has been primarily
step, the word-forms are looked up in a special tagging dictionary, for which an entry looks as
Due to a special treatment some defective noun constructions in German - such as these
(cf. Example below), and the result is given in the feature ts and its normalised in feature form2
munications services) - are recognised. Mpro assigns the missing head information by using a
occurring in coordinations like Informations- und Kommunikationsdienst (Information and
Comfeatures s and ss (for compounds) contain semantic information. In the example above, all three
lookahead algorithm:
words have the same derivation. For German words, a compound analysis is performed additionally
t. These features are also assigned for English analyses but correspond always to the lu feature.
      </p>
      <p>The feature ds contains the morphological derivation, and ls the respective normalised form. The
lingual search afterwards. This approach seems more appropriate because legal information is
are two entries in the English-German dictionary for human dignity, Menschenwurde and Wurde
compound is much faster then that for a phrase.
single words, abbreviations, compound terms but also xed phrases. F or multiword units, the
For the cross-language retrieval, we decided to translated the queries, and to carry out a
monoquery. Mpro-IR uses a shallow translation tool which performs a lexical transfer based on huge
mt-component rst looks up whether the dictionary con tains a translation for the whole phrase.
translation whereas the syntactic representation of the source is taken into account. For German
as target language, the syntactic variants of a term are additionally sorted out. For example, there
highly related to the original wording, and machine translation systems provides only a poor
qualby the part-of-speech, i.e. for verbs only the translations for verbs are assigned. The translation
occurrences of the syntactic variant Wurde des Menschen are equally found but the search for a
ity [2]. The input to the translation component is the complete morphological analysis of the
des Menschen. In these cases, the compound is preferred, because due to the query expansion all
transfer lexicons (coverage of the English-German lexicon is about 400.000 entries) comprising
output is undergone by a shallow parsing based on a phrase grammar to get only one possible
If no translation exists, the phrase is translated compositionally whereas the translation is guided
1. Looking up the index built over the lexical base forms (lu-index) with the value of the
4
number (wnr), as well as the word-form (the form of the word as occuring in the text) are stored.
and French nouns have a t-feature we have not exploited this kind of information because this
for German a third index is constructed with the decomposition information. Though English
Based on the analyses of the documents, several indices are built up: One using the
informaFunction words (entries with c=w) are discarded from the indexing. This process is done within
information is subject of an ongoing revision of the English and French morpheme lexicon (see
above). With each key the document identication n umber, the sentence number (snr), the word
tion about the lexical unit (i.e. the normalised form), one using the derivational information, and
a preparation phase.
mation provided by the features lu, ls as well as t (currently for German only) are exploited.</p>
      <p>The Retrieval
For all three, indexing, query expansion, the search together with a document ranking the
inforIn the reminder of the section, it is described how these results of the morpho-syntactic analysis
is applied for various stages of the ir process.
syntactical ambiguities such as verb/noun readings. This parsing process can also be performed
After this analysis, for German the output can be further disambiguated by evaluating context
i.e. as well as proper names such as Bill, Berlin.
on English and French output of the morphological analysis to get an almost unambiguous
representation. Mpro does not reduce ambiguity where the correctness of the decision is doubtful.
information, i.e. if the rst letter of w ord-form is capitalised, and the word is not the rst in a
prex v erbs mitteilen, xed expressions suc h as in Bezug auf, de facto, abbreviations like etc.,
sentence, it must be a noun. In a nal step, a shallo w parsing can be applied to reduce other
The search itself consists of several look-ups in the dieren t indices, for each content bearing
term the following look-ups are done:
ments. For the monolingual search, the function words are removed from the analysis output, and
extracted to construct a set of search patterns. For the input query Competitiveness of European
industry the set of search terms consists of competitiveness, compete, european, europe, industry.
for the meaning bearing words the values of the lu-, ls- and, for German queries, the t-feature are
At search time, the queries are processed by the same morpho-syntactic analysis as the
docuFor phrases, the topmost result list consists of documents which contain the elements of the
phrase exactly (excluding function words). The next list contains documents in which at least
calculated.
one phrase element occurs only as part of a compound. All further results lists are analogously
2. At least for one element only the derivation occurs within this distance.
lu-feature
We apply this distance measure also to German to nd syn tactic variants of compound terms:
1. The lu-values looked up in the lu-index of each element occur within the determined distance.
3. All other occurrences.
5
3. Looking up the index built over the derivations (ls-index) with the value of the ls-feature
dieren t search strategy: Having in mind that open compound terms in English and French has
French compounds, the occurrences of each word within a phrase is evaluated against this distance
factor using the word number provided by the index, and sorted into the following three lists:
between each meaning bearing word of a phrase is xed to 3. This allows to classify occurrences
of advertising in UK’s television as exact hit of television advertising. For English as well as for
or more words represent an open compound or not. Based on statistical data the longest distance
almost a xed w ord order, we dened a distance factor to decide whether the occurrence of the two
For compounds, the dieren t formation in English and French compared to German leads to a
formation. It expresses at the time the degree of precision of the retrieval. The results of the rst
list have a higher precision than those of the lower lists because the probability that mismatched
to the reliability of the linguistic information used to retrieve a document: a document retrieved
frequency seems not to be adequate in this environment of a legal domain in which some terms
occurs only once in a document which is much more relevant than a document in which the term
by stem information is more relevant to the query then a document retrieved by derivational
inoccurs several times. Thus, in Mpro-IR, the documents are ranked by the information used to
Usually the rank of a retrieved document is computed by the tf*idf. Using a weight based on
retrieve them, in the order of the lists described above. This ranking mirrors the relevance related
documents are retrieved increases.
3 Mpro-IR in CLEF
We participated the rst time in a clef/trec evaluation to investigate how Mpro-IR developed
for a special domain fares with unrestricted documents related to recall and precision.
form the clir task which additionally comprises the search in Italian documents, we integrated a
has now 27.800 entries compared to the English morpheme lexicon with about 48.300 entries. We
tions for the words occuring in the title sections of the topics. Thus the Italian morpheme lexicon
ysed the complete Italian topics (titles, description, and narratives), and added unknown words
Currently the Mpro-IR system covers only the languages German, English, and French. To
persmall Italian component into Mpro-IR. To provide a sucien t coverage for this module, we
analwe added missing translations for the terms of the topic titles to the respective transfer dictionaries.</p>
      <p>Setting up the Experiment
(morphemes) to our monolingual lexicon. For the translation component we added only
translaused English topics and retrieved documents in English, French, German, and Italian, therefore
all meaning bearing words have to occur in the same sentence, and
Fig.1 clef Results
6
only one translation is used
are more or less incomplete sentences such as French conscientous objector, supermarket ceiling
the outcome is not to bad. The results show more or less what we expected: For topics which
in Nice collapses, etc. we got none or only a few results (cf. Figure below).
sections) which lead in some cases to a lower performance. This process was mainly done due to
the type of queries was not always adequate for this kind of search. To build up the indices, texts
space limitations, the Mpro tool is able to indentify sgml tag but the analyses are unnecessarily
Retrieval Performance
blown up.
to perform a phrase search only over the titles sections of the topics, although we noticed that
were undergone a normalisation, i.e. we discarded all formating information (including the title
Due to time and space restriction we could perform and submit only one run. Therefore we decided
together with the semantic information already provided by the morpho-syntactic analyzer [9].
semantically similar terms is very poor. Because this approach is also very time consuming, we will
precise lexcial units, and derivational information. Compositional information was also valuable
much better recall. With a Boolean search we could certainly get a better insight in the
usefulConclusion
For the query expansion on the monolingual side, we currently experiment with a method to add
retrieval algorithm within the emis system [10]. Also here most hits could be retrieved by using
language. Whilst the search itself could be improved by taking advantage of the part-of-speech
The results of the clef evaluation are coincident with those we got from the evaluation of the
As the results here show the phrase search as implemented in Mpro-IR is useful in retrieval
systhe legal domain. In retrieval systems dealing with unrestricted texts, a Boolean search achieves
for a better indexing by using a term recognition component, and a better translation component.
synonyms which will be automatically computed by translating the translations back to the source
ness of derivational and compositional information in the retrieval process due to the higher recall.
to detect syntactic variants of German compounds. The improvement of the recall by so-called
concede this in favour of a better morpho-syntactic analysis. This will then provide the grounds
tems developed for a special type of domain where the search of complex phrases is necessary as in
Another reason is that only one translation is used (ex: Methane deposit is translated into German
improve the recall. Thus, we could conclude that most of the documents are retrieved by
ussearch space, furthermore the German compounds occurring in the queries (such as
Kriegsdienstverweigerer, Krebsgenetik Golfskriegssyndrom, Nobelpreis, Alkoholkonsum,. . . ) consist of words
though not satisfying.
the run submitted to clef. We got the same result for the query European Economic Area (T21)
French conscientious objectors (T6), Methane deposits (T9), Tourism in the US (T14) a v e times
as Methanlagerstatte where in the documents often the synonym Methanlager is used).
because here we got no results in the monolingual retrieval.
performed a Boolean search for some sample topics (T6, T9, T10, T14, T21). For queries such as
Our main objective was to evaluate the use of derivational and decompositional information to
derivational information. Decomposition information which is only used for retrieving German
For topics such as European Economic Area, World Trade Organisation etc. the results are better
To get an impression to which degree the restriction to a sentence as search space is to strong, we
We got also only a few results by on the basis of the productive use of decomposition information,
which are not frequently used in compound formation within the context of the respective query.
i.e. documents containing semantically similar terms. The main reason is certainly the restricted
better recall is achieved, and a 30% improvement for the query War and radio (T10) compared to
documents depends on the type of compounds, and in a few cases also on the type of the single
words forming a compound. No relevant occurrences of syntactic variants are found in the corpus.
ing the information of the lexical base form. Only a few others are retrieved on the basis of
7
References
Natural Language Processing, Trento, Italy, 1992.
[1] Brill, E. A simple rule-based part-of-speech tagger. In Proceedings of the Third Conference on Applied
experiment have no signicance so far. However, part-of-speech, currently exploited only for
The approach we pursue in Mpro-IR using a sophisticated morpho-syntactic analysis has shown
almost unambiguous representation of the documents and the queries. The possible impact of
translation purpose together with semantic information can be expected to contribute to a better
retrieval performance which still has to be proven.
derivational and decompositional information has to be further evaluated. Results from the clef
that the recall can be improved by more precise identication of the lexical base units and the</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>8</mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Porter</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <article-title>An algorithm for sux stripping</article-title>
          .
          <source>In Programm</source>
          ,
          <volume>14</volume>
          ,
          <year>1980</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>