-

LCN

Retrieval Experiments at Morpho Challenge 2008

0 Paul McNamee JHU Human Language Technology Center of Excellence

2008

5 0

Morpho Challenge 2008 hosted an extrinsic evaluation of morphological analysis that explored whether unsupervised morphology induction could bene t information retrieval. This paper presents results in alternative methods for word normalization using test sets from the Cross-Language Evaluation Forum (CLEF) ad-hoc collections. Preliminary results for the Morpho Challenge 2008 evaluation are consistent with these data. We found that: (1) rule-based stemming is e ective in less morphologically complicated languages; (2) alternative methods for stemming such as unsupervised learning of morphemes and least common n-gram stemming are helpful; and, (3) full character n-gram indexing is the most e ective form of tokenization in more morphologically complex languages.

Multilingual text retrieval Character n-grams Morphological normalization

Experimentation

Introduction

Over the past year we conducted experiments using alternative methods of lexical normalization. Using test sets in 13 languages used in Cross-Language Evaluation Forum (CLEF) evaluations between 2002 and 2007, we compared the use of a rule-based stemmer (Snowball ), an unsupervised morphological segmenter (Morfessor ), a synthetic form of stemming based on selecting a single character n-gram from each word, and the use of all n-grams for a given word. Character n-grams of length n = 5 were the most e ective technique, performing 18% better than unnormalized words, averaged across the set of languages.

N-grams possess many advantages. They work in every language, require no training, and are more e ective than plain words. Their main shortcoming is increased disk space requirements and slower query response times. To address the performance penalty we introduced a method of n-gram stemming that selects for each word its least common n-gram. By selecting only one n-gram per word, the inverted le becomes no larger than when words or stemmed words are used. And query responses times also improve since the number of query terms becomes equal to the number of words, not a function of the number of letters in the query string.

While our n-gram stems perform better than unlemmatized words, it was clear from the Morpho Challenge 2007 evaluation [ 2 ] that other techniques were even more e ective. This year we wanted to compare full (i.e., traditional) n-gram indexing against the more linguistically sophisticated approaches of other participants. For the Information Retrieval subtask at Morpho Challenge 2008 we provided o cial submissions using three techniques in each of English, Finnish, and German. The methods were:

Character n-grams with length n = 5 Character n-grams with length n = 4

The rarest 5-gram from each word (measured using document frequency statistics) In Section 2 we describe our tokenization experiments on the CLEF data sets. In Section 3 we analyze our Morpho Challenge 2008 results. 2

Analysis on Past CLEF Collections

We compared plain words, stems, induced morphemes, n-gram stems, and character n-grams using test sets from the CLEF ad hoc tasks between 2002 and 2007. In each of the 13 languages we used two years worth of data except for Czech where only one year was available. The number of test queries per language varied from 50 (Czech) to 107 (Spanish). The document collections for each language were indexed using the various methods of tokenization. Common to each method was conversion to lower case letters, removal of punctuation, and truncation of long numbers to 6 digits. The HAIRCUT retrieval engine was used with a statistical language model retrieval metric. The similarity calculation combines document term frequencies and corpus frequencies (for smoothing) using linear interpolation with a smoothing constant of 0.5 [ 6 ]. The methods of tokenization examined were: words: space-delimited tokens. snow: output of the Snowball stemmer1, if available in the given language. morf : the set of morphemes produced by Morfessor 2 for each input word. A model was trained using the document collection's lexicon with digit-containing tokens omitted. The default parameters for the Morfessor algorithm were used [ 1 ]. lcn4: the least common word-internal character 4-gram for each input word [ 3 ]. lcn5: the least common word-internal character 5-gram for each input word. 4-gram: overlapping, word-spanning character 4-grams produced from the stream of words encountered in the document or query.

5-gram: length n = 5 n-grams created in the same fashion as the character 4-grams. In Table 1 results are presented using mean average precision to compare performance. The score for the highest performing technique in each language is emboldened. 2.1

Unnormalized words

Not attempting to control for morphological processes can have harmful e ects. In Bulgarian, Czech, Finnish, and Hungarian, more than a 30% loss is observed compared to the use of 4-grams as indexing terms.

1Available from http://snowball.tartarus.org/ 2Available from http://www.cis.hut. /projects/morpho/ 5-gram 0.2916 0.3223 0.4443 0.4612 0.4960 0.4399 0.4321 0.3438 0.4220 0.3515 0.3330 0.4376 0.4271 0.3979 0.3956

Word authored authorized authorship

afoot footballs footloose footprint feet juggle juggled jugglers

Snowball author author authorship afoot footbal footloos footprint feet juggl juggl juggler As it may be di cult to nd a rule-based stemmer for every language, a language-independent approach can be quite attractive. The Morfessor algorithm only requires a lexicon (i.e., wordlist) for a language to learn to identify morpheme boundaries, even for previously unseen words. Such automatically detected segments can be an e ective form of tokenization [ 5 ]. Examples of the algorithm's output are presented in Table 2, along with results for Snowball and character 5grams.

Compared to plain words the induced morphemes produced by Morfessor led to gains in 9 of 13 languages; 8 of these were signi cant improvements with p < 0.05 (Wilcoxon test). The languages where words outperformed segments were English (dramatically), French, Italian, and Spanish { each is relatively low in morphological complexity. The di erences in French and Spanish were less than 0.004 in absolute terms. Segments achieved more than a 20% relative improvement in Bulgarian, Finnish, and Russian, and over 40% in Czech and Hungarian. 2.4

Least Common N-gram Stems

Another language-neutral approach to stemming is to select for each word, its least common ngram. This requires advance knowledge of n-gram frequencies, but this is easily obtainable by constructing a regular n-gram index, or even by scanning a corpus and counting. Lengths of n = 4 and n = 5 appear about equally e ective with a slight advantage for lcn4, but this is in uenced primarily by the languages with greater morphological complexity, which see larger changes. An 8% relative improvement in mean average precision over words is obtained. As can be seen from Table 1, in languages where rule-based stemming is available its use is preferable. N-gram stemming achieves comparable performance with Morfessor segments.. 2.5

Overlapping Character N-grams

N-grams achieve morphological regularization indirectly due to the fact that subsequences that touch on word roots will match. For example, \juggling" and \juggler" will share the 5-grams jugg and juggl. While n-gram's redundancy enables useful matches, other matches are less valuable, for example, every word ending in `tion' will share 5-gram tion with all of the others. In practice these morphological false alarms are almost completely discounted because term weighting deemphasizes them. In fact, such a xes can be so common, that ignoring them entirely by treating them as \stop n-grams" is a reasonable thing to do.

Character n-grams are the most e ective technique studied here, giving a relative improvement of 18%. Consistent with earlier work [ 4 ] lengths of n = 4 and n = 5 are equally e ective averaged across the 13 languages; however there are some noticeable di erences in particular languages. The data is suggestive of a trend that the most morphologically variable languages (i.e., Bulgarian, Czech, Hungarian, and Russian) gain more from 4-grams than 5-grams, while 5-grams have a slight advantage in medium complexity languages.

Snowball stems are roughly as e ective as n-grams, on average, but only available in certain languages (i.e., 8 of 13 in this study). The other \alternative" stemming approaches, segments and least common n-grams, appear to gain about half of the bene t that full n-gram indexing sees compared to unnormalized word forms. 3

Morpho Challenge 2008 Submissions

This year's evaluation was conducted in English, Finnish, and German, as was the case in Morpho Challenge 2007. The contest was conducted by having participants submit an analysis for each word form, which would be used to create replacement indexing terms in an IR system. According to the contest website the LEMUR IR engine was used with Okapi BM 25 term weighting. Preliminary results reported by the organizers are presented in Table 3 using mean average precision.

An examination of the preliminary results shows the same basic trend identi ed in Section 2. Rule-based stemming is most e ective in English, and in a morphologically rich language such as Finnish, n-grams have a strong advantage. 4

Conclusions

We examined a variety of methods for lexical normalization, nding that the most e ective technique was character n-gram indexing which achieved a relative gain of 18% in mean average precision over unlemmatized words. In Czech, Bulgarian, Finnish, and Hungarian gains of over 40% were observed. While rule-based stemming can be quite e ective, such tools are not available in every language and even when present, require additional work to integrate with an IR system. When language-neutral methods are able to achieve the same, or better performance, their use should be seriously considered.

[1]

Mathias

Creutz and

Krista

Lagus . Unsupervised morpheme segmentation and morphology induction from text corpora using Morfessor 1.0 . Technical

report

, Helsinki University of Technology, 2005 .

[2]

Mikko

Kurimo , Mathias Creutz, and

Ville

Turunen . Overview of Morpho Challenge in CLEF 2007 . In Alessandro Nardi and Carol Peters, editors, Working Notes of the CLEF 2007 Workshop , 2007 .

[3] James May eld and Paul McNamee . Single n-gram stemming . In SIGIR , pages 415 { 416 . ACM, 2003 .

[4]

Paul

McNamee and James May eld. Character N-gram tokenization for european language text retrieval . Information Retrieval , 7 ( 1 -2): 73 { 97 , 2004 .

[5]

Paul

McNamee ,

Charles

Nicholas , and James May eld . Don' t have a stemmer?: un+concern+ed . In SIGIR '08 , pages 813 { 814 , New York, NY, USA, 2008 . ACM.

[6] Jay

Ponte and W. Bruce

Croft . A language modeling approach to information retrieval . In SIGIR , pages 275 { 281 . ACM, 1998 .