CLEF Experiments at the University of Maryland: Statistical stemming and backo translation strategies

CLEF Experiments at the University of Maryland: Statistical stemming and backo translation strategies DouglasWOard oard@glue.umd.edu Gina-AnneLevow College of Information Studies Institute for Advanced Computer Studies Institute for Advanced Computer Studies Clara I. Cabezas Institute for Advanced Computer Studies University of Maryland

20742 College Park MD

CLEF Experiments at the University of Maryland: Statistical stemming and backo translation strategies 667307B22C3C61A66CDA23AAB40A5091 GROBID - A machine learning software for extracting information from scholarly documents

The University of Maryland participated in the CLEF 2000 multilingual task, submitting three o cial runs that explored the impact of applying language-independent stemming techniques to dictionary-based cross-language information retrieval. The paper begins by describing a cross-language information retrieval architecture based on balanced document translation. A four-stage backo strategy for improving the coverage of dictionary-based translation techniques is then introduced, and an implementation based on automatically trained statistical stemming is presented. Results indicate that competitive performance can be achieved using these techniques in conjunction with freely available bilingual dictionaries.

Introduction

One important goal of our research i s t o d e v elop cross-language information retrieval (CLIR) techniques that can be applied to new language pairs with minimal language-speci c tuning. So-called \dictionarybased" techniques o er promise in this regard because bilingual dictionaries have p r o ven to be a useful basis for CLIR 6] and because simple bilingual dictionaries are becoming widely available on the Internet. Although bilingual dictionaries sometimes include useful information such as part-of-speech, morphology and translation preference, it is far more common to nd a simple list of translation equivalent term pairs|what we refer to as a \bilingual term list." The objective of our participation in the Cross-Language Evaluation Forum (CLEF) was to explore techniques for dictionary-based CLIR using bilingual term lists between English and other European languages. We applied techniques that we h a ve used before (balanced document translation), and chose to focus our contrastive r u n s o n i m p r o ving translation coverage using unsupervised morphological analysis, an approach that we refer to as \statistical stemming." In the next section we describe our balanced document translation architecture and then explain how statistical stemming can be used to improve translation coverage without additional language-speci c resources. The following section presents our CLEF results, which demonstrate that the additional coverage achieved by statistical stemming has a substantial bene cial e ect on retrieval e ectiveness as measured by m e a n a verage precision. In the nal section we draw some conclusions regarding the broader utility of our techniques and suggest some additional research directions.

Experiment Design

We c hose to participate in the multilingual task of CLEF 2000 because the structure of the task (English queries, documents in other languages) was well matched to a CLIR architecture based on document translation that we h a ve been developing. Document translation is an attractive approach i n i n teractive applications if all queries are in a single language because the pre-translated documents that are retrieved can immediately be examined by the user. Although storage overhead is doubled (if the documents are also stored in their original language), that may be of little consequence in an era of rapidly falling disk prices. The principal challenge in a document translation architecture is to balance the speed and accuracy of the translation. In our initial experiments with document translation, we found that a commercial machine translation system required about 10 machine-months to translate approximately 250,000 documents|a clearly impractical approach 5 ]. With simpler techniques, such a s l o o k i n g u p e a c h w ord in a bilingual term list, we can translate a similar number of documents in only three machine-hours|a period of time comparable to that required to build an inverted index. In our CLEF experiments we h a ve t h us chosen to focus on improving the retrieval e ectiveness of dictionary-based CLIR without introducing a signi cant adverse e ect on translation e ciency.

Figure 1 illustrates our overall CLIR system design. Each non-English collection was processed separately using the appropriate bilingual term list. We grouped the articles from Der Spiegel and Frankfurter Rundschau into a single German collection and formed a French collection from the Le Monde articles and an Italian collection from the La Stampa articles. The documents were normalized by mapped all characters to lower case 7-bit ASCII through removal of accents. Term-by-term translation was then performed, optionally applying a four-stage backo statistical stemming approach to enhance translation coverage. For translation, we t o k enized source-language terms at white space or terminal punctuation (which had the e ect of ignoring all source-language multiword expressions in our bilingual term lists). When no translation was known for a clitic contraction, automatic expansion was performed (e.g. l'heur ! le heur and the resulting words were translated separately. 1 Other wo r d s w i t h n o k n o wn translation were retained unchanged, which is often appropriate for proper names. We produced exactly two English terms for each source-language term. For terms with no known translation, the untranslated term was generated twice. For terms with one known translation, that translation was generated twice. Terms with two or more known translations resulted in generation of each of the \best" two translations once. In prior experiments we h a ve found that this strategy, known as \balanced translation," signi cantly outperforms the usual (unbalanced) technique of including all known translations because it avoids overweighting terms that have m a n y translations (which are often quite common, and hence less useful as search terms) 4].

Each of the four resulting English collections (the fourth consisting of Los Angeles Times articles, which did not require translation) was then indexed using Inquery (version 3.1p1), with Inquery's kstem stemmer and default English stopword list selected. Queries were produced by enclosing each w ord in the title, description, and narrative elds (except for stop-structure) in Inquery's #sum operator. In our o cial runs, two t ypes of stop-structure were removed by hand: \ nd documents" was removed at the beginning of any description eld in which it appeared, and \relevant d o c u m e n ts report" was removed at the beginning of any narrative e l d i n w h i c h it appeared. Because this stop structure was removed manually after examining the queries, our runs should o cially be classi ed as being in the \manual" category. 2 We generated separate ranked lists for each collection and then used the weighted round-robin merging technique that we h a d developed for the TREC CLIR track to construct a single ranked list of the top 1000 documents retrieved for each query 7]. We expected our (monolingual) English system to outperform our French and German systems, and we expected our Italian system to be adversely a ected by the small size of the bilingual term list for that language pair. We t h us chose a 10:5:5:3 ratio as the relative w eights for each language.

We used the same bilingual term lists for CLEF 2000 that we had employed in the TREC-8 CLIR track 7 ]. Table 1 shows the source and summary statistics for each dictionary. Source language terms in the bilingual term lists were normalized in a manner similar to that used for the documents, although clitic contractions were not split because they were not common in the bilingual term lists. Balanced document translation becomes unwieldy beyond two translations, so the number of translations for any t e r m w as limited to the two Table 1: Sources and summary statistics for bilingual dictionaries.

that most commonly occurred in written English. All single word translations were ordered by decreasing unigram frequency in the Brown corpus (which c o n tains many genres of written English), followed by a l l multi-word translations (in no particular order), and nally by a n y single word entries that did not appear at all in the Brown corpus. Translations beyond the second for any English term were then deleted this had the e ect of minimizing the e ect of infrequent w ords in non-standard usages or misspellings that might appear in the bilingual term list.

Four-Stage Backo

The coverage problem in CLIR arises when the object being translated (in this case, a document), contains a term that is not known to the translation resource (in this case, the bilingual term list). Bilingual term lists found on the web often contain an eclectic mix of root forms and their morphological variants, and our experience with the TREC-8 CLIR track suggested that morphological analysis of terms contained in documents and bilingual term lists could discover plausible translations when no exact match is found. We thus developed a four-stage backo strategy that was designed to maximize coverage while limiting the introduction of spurious translations:

1. Match the surface form of a document t e r m t o surface forms of source language terms in the bilingual term list.

2. Match t h e morphological root of a document term to surface forms of source language terms in the bilingual term list.

3. Match t h e surface form o f a d o c u m e n t term to morphological roots of source language terms in the bilingual term list.

4. Match the morphological root of a document term to morphological roots of source language terms in the bilingual term list. The process terminates as soon as a match is found at any stage, and the known translations for that match are generated. Although this process may result in generation of an inappropriate morphological variant for a correct English translation, the use of English stemming in Inquery should minimize the e ect of that factor on retrieval e ectiveness.

Statistical Stemming

The four-stage backo strategy described above poses two k ey challenges. First, it would require that an e cient morphological analysis system be available for every document language that must be processed. And second, the morphological analysis systems would need to produce accurate results on words presented out of context, as they are in the bilingual term list. This is a tall order, so we elected to explore a simpli cation of this idea in which morphological analysis was replaced by stemming. Stemmers are freely available for French and German, 3 and stemming has proven to be about as e ective as more sophisticated morphology in information retrieval applications where (as is the case in our application) matching is the principal objective 3 ]. This represents only a partial solution, however, since we a r e n o t a ware of a freely available stemmer for Italian. In TREC-4, Buckley, et al. demonstrated that a simple stemmer could be easily constructed for Spanish without knowledge of the language by examining lexicographically similar words to discover common su xes 1]. We decided to try to push that idea further, automating the process so that it could be applied to new languages without additional e ort. We call this approach \statistical stemming," since the stemmer is learned from the statistics of a text collection, in our case the collection that was ultimately to be searched.

Statistical stemming is a special case of unsupervised acquisition of morphology, a specialized topic in computational linguistics. Of this work, the closest in spirit to our objectives that we k n o w of is a program by Goldsmith known as Linguistica 2]. Linguistica examines each token in a collection, observing the frequency of stems and su xes that would result from every possible breakpoint. An optimal breakpoint for each t o k en is then selected by applying as a constraint t h a t e v ery instance of a token must have the same breakpoint and then choosing breakpoints for each unique token that minimize the number of bits needed to encode the collection. This \minimum description length" criterion captures the intuition that breakpoints should be chosen in such a w ay that each t o k en is partitioned into a relatively common stem and a relatively common su x. Linguistica is freely available, 4 but the present implementation can process only about 200,000 words on a 128 MB Windows NT machine. This is certainly large enough to ensure that breakpoints will be discovered for most common words, but breakpoints might not be discovered for less common terms|quite possibly the terms that would prove most useful in a search. We therefore augmented Linguistica with a simple rule induction technique to handle words that were outside Linguistica's training set.

We implemented rule induction as follows. We rst counted the frequency of every one, two, three and four-character su x that would result in a stem of three or more characters for the rst 500,000 words of the collection. Each instance of every word was used to compute the su x frequencies. These statistics alone would overstate the frequency of partial su xes|for example, \-ng" is a common ending in English, but in almost every case it is part of \-ing". We t h us subtracted the frequency of the most common subsuming su x of the next longer length from each su x. 5 The adjusted frequencies we r e t h e n u s e d t o s o r t a l l t wo, three and four-character su xes in decreasing order of frequency. W e observed that the count vs. rank plot for an English training case was convex, so we selected the rank at which the second derivative of the count vs. rank plot was maximized as the limit for how m a n y su xes to generate for each length. In tuning experiments with English, this approach did not work well for single-character su xes because the distribution of character frequency (regardless of location) is highly skewed. We t h us sorted single characters by the ratio between their word-nal likelihood and their unconditioned likelihood, and again used the maximum of the second derivative as a stopping point. 6 For each w ord, the rst matching su x (if any, from the top of the list) was then removed to produce the stemmed form. The heuristics we c hose were motivated by our intuition of what constituted a likely su x, but the details were settled only after a good deal of tweaking with a training collection. Of note, the training collection contained only English documents and the tweaking was done by the rst author, who has no useful knowledge of French, German or Italian. Table 2.2 shows the su x removal rules for those languages that were automaticallyproduced with no further tuning. Many of the postulated su xes in that table accord well with our intuition, as in the case the French adverbial su x ment or third-person plural in ectional su x ent. H o wever, some others suggest insu cient generalization. Consider the suggested German su xes: ngen,nden,sen,nen,gen,den, a n d ten. The more appropriate su x would be en h o wever, the preference for longer subsuming strings selects the less general su xes. A large number of single character su xes are suggested for Italian, including letters such a s k and w which d o n o t t ypically appear in word-nal position in this language. This somewhat counterintuitive set suggests that further optimization of threshold setting is necessary.

Three o cial runs were submitted. In our baseline run (\unstemmed"), we used no pre-translation stemming (i.e., step one alone). In our Linguistica run (\backo 4Ling"), we implemented the complete four-stage backo strategy using Linguistica for terms with known breakpoints, and added a fth stage that replicated stage four using the rule induction stemmer in place of Linguistica that would be invoked if none of the rst four stages found a translation. The rule induction process is considerably faster than Linguistica (less than 5 minutes, compared with 30-40 minutes for Linguistica) so we also submitted a third run in which which w e implemented four-stage backo with rule induction alone. Table 2.2 summarizes these conditions.

Results

Our backo 4 run was judged, and all three runs were scored o cially. T able 3 summarizes the results. Overall, a four-stage backo document translation strategy using statistical stemming achieved a dramatic improvement in retrieval e ectiveness over the unstemmed approach that was found to be statistically signi cant b y a paired two-tailed t-test (p < 0:002 in both cases) (Figure 2). Surprisingly, o u r ad hoc rule induction technique produced results that were statistically indistinguishable from those obtained using the more sophisticated Linguistica software (p 0:38).(Figure 3) The backo 4Ling run achieved at-or-abovemedian average precision on 24 of 40 queries, and the backo 4 run achieved at-or-above-median average precision on 27 or 40 queries, although in both cases the median was computed for automatic queries (Figure 4). Since the e ect of our limited manual stop-structure removal was likely quite small, we i n terpret these results as indicating that we h a ve a c hieved a credible degree of retrieval e ectiveness using only freely available linguistic resources.

Although we can conclude that four-stage backo resulted in improved retrieval e ectiveness and that statistical stemming appears to be a viable substitute for more sophisticated morphological analysis in this application, further analysis is needed if we are to optimize the design of our techniques. The multilingual task design can easily mask single-language e ects, so we plan to perform uno cial monolingual runs using the same language pairs. We d o n o t y et know which stages in our four-stage backo strategy produce the greatest bene cial e ects, or whether reversing the second and third stages might i m p r o ve r e t r i e v al e ectiveness. We plan to explore those questions using uno cial contrastive runs. Finally, w e plan to explore the di erences between the Linguistica and rule induction results on a query-by-query basis as we seek to understand whether some other way o f c o m bining the two might result in improved retrieval e ectiveness.

Conclusion

We h a ve i n troduced two new techniques, four-stage backo and statistical stemming, and shown how they can be used together to improve retrieval e ectiveness in a document translation architecture. When coupled with other language-independent techniques such as blind relevance feedback for query expansion and for , developers now h a ve a robust toolkit with which to design e ective dictionary-based CLIR systems using only a bilingual term list and some modest query-language resources (speci cally, a comparable collection from which to obtain term statistics). The CLEF evaluation has proven to be a suitable venue for exploring these questions, and we look forward to continued participation in future years.

Figure 1 :1Figure 1: Information Retrieval Process.

Figure 2 :Figure 3 :23Figure 2: Improvement of 4-stage statistical stemming backo over unstemmed translation: Bars above x-axis indicate improvement, below indicate decrease

Figure 4 :4Figure 4: Comparison of 4-stage statistical stemming backo to median: Bars above the x-axis indicate statistical stemming (\Backo 4") outperforms median

Table 2 :2Candidate stems, in order of removal.French German Italian ment chen ione tion ngen ente ique nden ioni ions sche ento ent rung enti res lich ato tes sten are es ten to re ung ta x den re s gen ti nen no ter la sen y en o er e te a y k t i x w

Table 4 :4Multilingual evaluation results, uninterpolated mean average precision over 40 topics.unstemmed Stage Document Term List Document backo 4Ling Term ListDocumentbacko 4 Term List1 None 2 3 4 5NoneNone Linguistica None Linguistica Rule Induction Rule Induction None None Linguistica LinguisticaNone Rule Induction None None None Rule Induction Rule Induction Rule InductionTable 3: Summary of o cial runsRunAverage Precisionunstemmed backo 4Ling backo 40.1012 0.1938 0.1952

Clitic contractions are not common in German, so we did not run the splitting process in that case. Our o cial runs were originally inadvertently submitted in the automatic category, b u t h a ve since been reclassi ed as manual. French and German stemmers are available as part of the PRISE information retrieval system, which is freely available from the U.S. National Institute of Standards and Technology. Linguistica is available at http://humanities.uchicago.edu/faculty/goldsmith/index.html We did not adjust the frequencies of four-character su xes since we did not count t h e v e-character su xes. If a more precise speci cation of the process is desired, the source code for the rule induction software is available at http://www.glue.umd.edu/ oard/research.html

Acknowledgments

The authors wish to thank Patrick S c hone, Philip Resnik and David Yarowsky for helpful discussions of the unsupervised morphology acquisition and Jianqiang Wang for his help with Inquery and the bilingual term lists. This work was work was supported in part by D ARPA c o n tract N6600197C8540 and DARPA cooperative agreement N660010028910.

Automatic query expansion using SMART: TREC 3 ChrisBuckley GerardSalton JamesAllan AmitSinghal Overview of the Third T ext REtrieval Conference (TREC-3), p a g e s 69{80 DKHarman NIST November 1994 Unsupervised learning of the morphology of a natural language JohnGoldsmith 2000 Stemming algorithms -A case study for detailed evaluation DavidAHull Journal of the American Society for Information Science 47 1 1996 Translingual topic tracking with PRISE Gina-AnneLevow DouglasWOard Working Notes of the Third T opic Detection and Tracking Workshop

ebruary 2000 A comparative study of query and document translation for cross-lan guage information retrieval DouglasWOard Proceedings of the Third Conference o f t h e A s s o ciation for Machine Translation in the Americas the Third Conference o f t h e A s s o ciation for Machine Translation in the Americas October 1998 Cross-language information retrieval DouglasWOard AnneRDiekema Annual Review of Information Science and Technology American Society for Information Science 1998 33 TREC-8 experiments at Maryland: CLIR, QA, and routing DouglasWOard JianqiangWang DekangLin IanSoboro The Eighth Text Retrieval Conference (TREC-8) vember 1999