-

1864

ally, an additional syntactic and semantic disambiguation evaluating mainly context information.

For languages with a rich declensional morphology such as French or German the results of such ern European languages). However, the same level of functionality as the German module is not (gone) to gang results in a wrong form (the correct one is gehen/to go). German verbs as well as French verbs such as aller (to go) or recevoir (to get) have numerous forms which makes it almost lemmatisation, a part-of-speech tagging, and for German, a compound analysis as well as optionis not enough (cf. [6], [12]). For instance, the stemming of the German past participle gegangen morpheme lexicon. This morphological dictionary contains allomorphs but also some irregular each language. often to failures because of the underlying highly productive morphological process (cf. [3]). the identied stem. This approac h produces far better results, it avoids error as shown above but impossible to stem them by using sux algorithms. F or German, the compound formation leads cannot be found in the dictionary. Also irregular plural (media/medium) or declination forms where cat is the category. Nouns, verbs, adjectives, and derived adverbs are looked up in a a stemming are rather unsatisfying because considering only inection (or ev en sux reduction) Stemming is the nlp technique which is frequently used and successfully applied in ir systems.

For the reduction of syntactic ambiguities there is also a shallow parsing component available for available for all language modules. Mpro performs a morpho-syntactic analysis consisting of a o suxes. T o overcome some of the serious deciencies of suc h stemmers, for instance general lexicon.

The Linguistic Processing follows: A standard tool is the Porter stemmer [ 7 ] which achieves a normalisation by simply chopping The morpho-syntactic analysis is combined with a look-up in a word-form dictionary. In a rst {string=Word-form,c=w,sc=CAT,lu=Citation-form,...} (went/go) cause errors. The main drawback of this approach lies thus in the coverage of the names. Each entry shows how the associated stems behave morphologically, as shown in the word-forms which cannot be identied in another w ay as well as variety of toponyms and other is mapped to gener, and distribute to distribut, both no lexical base forms, and thus lead to others such as the mapping of distributed to distribut still occur. In this case, the word distributed improper conations, adv anced stemmers are developed and combined with a lexicon [4] to verify In Mpro-IR, the Mpro programme package [5] developed at the IAI is used for the linguistic developed to process German language but is now available for dieren t languages (including Eastprocessing, and its major features will be described in the following. Mpro has been primarily step, the word-forms are looked up in a special tagging dictionary, for which an entry looks as Due to a special treatment some defective noun constructions in German - such as these (cf. Example below), and the result is given in the feature ts and its normalised in feature form2 munications services) - are recognised. Mpro assigns the missing head information by using a occurring in coordinations like Informations- und Kommunikationsdienst (Information and Comfeatures s and ss (for compounds) contain semantic information. In the example above, all three lookahead algorithm: words have the same derivation. For German words, a compound analysis is performed additionally t. These features are also assigned for English analyses but correspond always to the lu feature.

The feature ds contains the morphological derivation, and ls the respective normalised form. The lingual search afterwards. This approach seems more appropriate because legal information is are two entries in the English-German dictionary for human dignity, Menschenwurde and Wurde compound is much faster then that for a phrase. single words, abbreviations, compound terms but also xed phrases. F or multiword units, the For the cross-language retrieval, we decided to translated the queries, and to carry out a monoquery. Mpro-IR uses a shallow translation tool which performs a lexical transfer based on huge mt-component rst looks up whether the dictionary con tains a translation for the whole phrase. translation whereas the syntactic representation of the source is taken into account. For German as target language, the syntactic variants of a term are additionally sorted out. For example, there highly related to the original wording, and machine translation systems provides only a poor qualby the part-of-speech, i.e. for verbs only the translations for verbs are assigned. The translation occurrences of the syntactic variant Wurde des Menschen are equally found but the search for a ity [2]. The input to the translation component is the complete morphological analysis of the des Menschen. In these cases, the compound is preferred, because due to the query expansion all transfer lexicons (coverage of the English-German lexicon is about 400.000 entries) comprising output is undergone by a shallow parsing based on a phrase grammar to get only one possible If no translation exists, the phrase is translated compositionally whereas the translation is guided 1. Looking up the index built over the lexical base forms (lu-index) with the value of the 4 number (wnr), as well as the word-form (the form of the word as occuring in the text) are stored. and French nouns have a t-feature we have not exploited this kind of information because this for German a third index is constructed with the decomposition information. Though English Based on the analyses of the documents, several indices are built up: One using the informaFunction words (entries with c=w) are discarded from the indexing. This process is done within information is subject of an ongoing revision of the English and French morpheme lexicon (see above). With each key the document identication n umber, the sentence number (snr), the word tion about the lexical unit (i.e. the normalised form), one using the derivational information, and a preparation phase. mation provided by the features lu, ls as well as t (currently for German only) are exploited.

The Retrieval For all three, indexing, query expansion, the search together with a document ranking the inforIn the reminder of the section, it is described how these results of the morpho-syntactic analysis is applied for various stages of the ir process. syntactical ambiguities such as verb/noun readings. This parsing process can also be performed After this analysis, for German the output can be further disambiguated by evaluating context i.e. as well as proper names such as Bill, Berlin. on English and French output of the morphological analysis to get an almost unambiguous representation. Mpro does not reduce ambiguity where the correctness of the decision is doubtful. information, i.e. if the rst letter of w ord-form is capitalised, and the word is not the rst in a prex v erbs mitteilen, xed expressions suc h as in Bezug auf, de facto, abbreviations like etc., sentence, it must be a noun. In a nal step, a shallo w parsing can be applied to reduce other The search itself consists of several look-ups in the dieren t indices, for each content bearing term the following look-ups are done: ments. For the monolingual search, the function words are removed from the analysis output, and extracted to construct a set of search patterns. For the input query Competitiveness of European industry the set of search terms consists of competitiveness, compete, european, europe, industry. for the meaning bearing words the values of the lu-, ls- and, for German queries, the t-feature are At search time, the queries are processed by the same morpho-syntactic analysis as the docuFor phrases, the topmost result list consists of documents which contain the elements of the phrase exactly (excluding function words). The next list contains documents in which at least calculated. one phrase element occurs only as part of a compound. All further results lists are analogously 2. At least for one element only the derivation occurs within this distance. lu-feature We apply this distance measure also to German to nd syn tactic variants of compound terms: 1. The lu-values looked up in the lu-index of each element occur within the determined distance. 3. All other occurrences. 5 3. Looking up the index built over the derivations (ls-index) with the value of the ls-feature dieren t search strategy: Having in mind that open compound terms in English and French has French compounds, the occurrences of each word within a phrase is evaluated against this distance factor using the word number provided by the index, and sorted into the following three lists: between each meaning bearing word of a phrase is xed to 3. This allows to classify occurrences of advertising in UK’s television as exact hit of television advertising. For English as well as for or more words represent an open compound or not. Based on statistical data the longest distance almost a xed w ord order, we dened a distance factor to decide whether the occurrence of the two For compounds, the dieren t formation in English and French compared to German leads to a formation. It expresses at the time the degree of precision of the retrieval. The results of the rst list have a higher precision than those of the lower lists because the probability that mismatched to the reliability of the linguistic information used to retrieve a document: a document retrieved frequency seems not to be adequate in this environment of a legal domain in which some terms occurs only once in a document which is much more relevant than a document in which the term by stem information is more relevant to the query then a document retrieved by derivational inoccurs several times. Thus, in Mpro-IR, the documents are ranked by the information used to Usually the rank of a retrieved document is computed by the tf*idf. Using a weight based on retrieve them, in the order of the lists described above. This ranking mirrors the relevance related documents are retrieved increases. 3 Mpro-IR in CLEF We participated the rst time in a clef/trec evaluation to investigate how Mpro-IR developed for a special domain fares with unrestricted documents related to recall and precision. form the clir task which additionally comprises the search in Italian documents, we integrated a has now 27.800 entries compared to the English morpheme lexicon with about 48.300 entries. We tions for the words occuring in the title sections of the topics. Thus the Italian morpheme lexicon ysed the complete Italian topics (titles, description, and narratives), and added unknown words Currently the Mpro-IR system covers only the languages German, English, and French. To persmall Italian component into Mpro-IR. To provide a sucien t coverage for this module, we analwe added missing translations for the terms of the topic titles to the respective transfer dictionaries.

Setting up the Experiment (morphemes) to our monolingual lexicon. For the translation component we added only translaused English topics and retrieved documents in English, French, German, and Italian, therefore all meaning bearing words have to occur in the same sentence, and Fig.1 clef Results 6 only one translation is used are more or less incomplete sentences such as French conscientous objector, supermarket ceiling the outcome is not to bad. The results show more or less what we expected: For topics which in Nice collapses, etc. we got none or only a few results (cf. Figure below). sections) which lead in some cases to a lower performance. This process was mainly done due to the type of queries was not always adequate for this kind of search. To build up the indices, texts space limitations, the Mpro tool is able to indentify sgml tag but the analyses are unnecessarily Retrieval Performance blown up. to perform a phrase search only over the titles sections of the topics, although we noticed that were undergone a normalisation, i.e. we discarded all formating information (including the title Due to time and space restriction we could perform and submit only one run. Therefore we decided together with the semantic information already provided by the morpho-syntactic analyzer [9]. semantically similar terms is very poor. Because this approach is also very time consuming, we will precise lexcial units, and derivational information. Compositional information was also valuable much better recall. With a Boolean search we could certainly get a better insight in the usefulConclusion For the query expansion on the monolingual side, we currently experiment with a method to add retrieval algorithm within the emis system [10]. Also here most hits could be retrieved by using language. Whilst the search itself could be improved by taking advantage of the part-of-speech The results of the clef evaluation are coincident with those we got from the evaluation of the As the results here show the phrase search as implemented in Mpro-IR is useful in retrieval systhe legal domain. In retrieval systems dealing with unrestricted texts, a Boolean search achieves for a better indexing by using a term recognition component, and a better translation component. synonyms which will be automatically computed by translating the translations back to the source ness of derivational and compositional information in the retrieval process due to the higher recall. to detect syntactic variants of German compounds. The improvement of the recall by so-called concede this in favour of a better morpho-syntactic analysis. This will then provide the grounds tems developed for a special type of domain where the search of complex phrases is necessary as in Another reason is that only one translation is used (ex: Methane deposit is translated into German improve the recall. Thus, we could conclude that most of the documents are retrieved by ussearch space, furthermore the German compounds occurring in the queries (such as Kriegsdienstverweigerer, Krebsgenetik Golfskriegssyndrom, Nobelpreis, Alkoholkonsum,. . . ) consist of words though not satisfying. the run submitted to clef. We got the same result for the query European Economic Area (T21) French conscientious objectors (T6), Methane deposits (T9), Tourism in the US (T14) a v e times as Methanlagerstatte where in the documents often the synonym Methanlager is used). because here we got no results in the monolingual retrieval. performed a Boolean search for some sample topics (T6, T9, T10, T14, T21). For queries such as Our main objective was to evaluate the use of derivational and decompositional information to derivational information. Decomposition information which is only used for retrieving German For topics such as European Economic Area, World Trade Organisation etc. the results are better To get an impression to which degree the restriction to a sentence as search space is to strong, we We got also only a few results by on the basis of the productive use of decomposition information, which are not frequently used in compound formation within the context of the respective query. i.e. documents containing semantically similar terms. The main reason is certainly the restricted better recall is achieved, and a 30% improvement for the query War and radio (T10) compared to documents depends on the type of compounds, and in a few cases also on the type of the single words forming a compound. No relevant occurrences of syntactic variants are found in the corpus. ing the information of the lexical base form. Only a few others are retrieved on the basis of 7 References Natural Language Processing, Trento, Italy, 1992. [1] Brill, E. A simple rule-based part-of-speech tagger. In Proceedings of the Third Conference on Applied experiment have no signicance so far. However, part-of-speech, currently exploited only for The approach we pursue in Mpro-IR using a sophisticated morpho-syntactic analysis has shown almost unambiguous representation of the documents and the queries. The possible impact of translation purpose together with semantic information can be expected to contribute to a better retrieval performance which still has to be proven. derivational and decompositional information has to be further evaluated. Results from the clef that the recall can be improved by more precise identication of the lexical base units and the

[7] Porter , E. An algorithm for sux stripping . In Programm , 14 , 1980 .