Report on CLEF-2002 Experiments: Combining Multiple Sources of Evidence Jacques Savoy Institut interfacultaire d'informatique, Université de Neuchâtel, Switzerland Jacques.Savoy@unine.ch Web site: www.unine.ch/info/clef/ Abstract. For our second participation in the CLEF retrieval tasks, our first objective was to propose better and more general stopword lists for various European languages (namely, French, Italian, German, Spanish and Finnish) along with improved, simpler and efficient stemming procedures. Our second goal was to propose a combined query-translation approach that could cross language barriers and also an effective merging strategy based on logistic regression for accessing the multilingual collection. Finally, within the Amaryllis experiment, we wanted to analyze how a specialized thesaurus might improve retrieval effectiveness. Introduction Based on our experiments of last year [Savoy 2002b], we participate in French, Italian, Spanish, German, Dutch and Finnish monolingual tasks in which our information retrieval approaches could work without having to rely on a dictionary. In Section 1, we improve our stopword lists and simple stemmers for the French, Italian, Spanish and German languages. For German, we also propose a new decompounding algorithm. For Dutch, we use the available stoplist and stemmer, and for the Finnish language we design a new stemmer and stopword list. In order to obtain a better overview, we evaluate our propositions using ten different retrieval schemes. In Section 2, for the various bilingual tracks we choose to express the submitted requests in the English language, which are in turn automatically translated using five different machine translation (MT) systems and one bilingual dictionary. We study these various translations, and based on the relative merit of each translation device we investigate various combinations of them. In Section 3, we carry out a multilingual information retrieval, investigating various merging strategies based on the results obtained during our bilingual tasks. Finally, in the last section, we present various experiments done using the Amaryllis corpus, within which a specialized thesaurus is made available in order to improve the retrieval effectiveness of the information retrieval system. 1. Monolingual indexing and search Most European languages included in the Indo-European language family (including French, Italian, Spanish, German and Dutch) can be viewed as flectionnal languages within which polymorphs suffixes are added at the end of a flexed root. On the other hand, the Finnish language, member of the Uralic language family (together with the Turkish language), is based on a concatenative morphology in which suffixes, more or less invariable, are added to roots that are generally invariable. Any adaptation of those indexing or search strategies available for the English language requires that general stopword lists and fast stemming procedures be developed for the other target languages. Stopword lists contain non-significant words that are removed from a document or a request before the indexing process is begun. Stemming procedures try to remove inflectional and derivational suffixes in order to conflate word variants into the same stem or root. This first section will deal with these issues and is organized as follows: Section 1.1 contains an overview of our eight test-collections while Section 1.2 describes our general approach to building stopword lists and stemmers for use with languages other than English. In order to decompound German words, we try a simple decompounding algorithm as described in Section 1.3. Section 1.4 depicts the Okapi probabilistic model together with various vector-space models and we evaluate them using eight test-collections written in seven different languages (monolingual track). 1.1. Overview of the test-collections The corpora used in our experiments included newspapers such as the Los Angeles Times (1994, English) Le Monde (1994, French), La Stampa (1994, Italian), Der Spiegel (1994/95, German) and Frankfurter Rundschau (1994, German), NRC Handelsbald (1994/95, Dutch), Algemeen Dagblad (1995/95, Dutch) and Tidningarnas Telegrambyrå (1994/95, Finnish). As a second source of information, we also used various articles edited by news agencies such as EFE (1994, Spanish), and the Swiss news agency (1994, available in French, German and Italian but without parallel translation). As shown in Table 1a and 1b, these corpora are of various sizes, with the English, German, Spanish and Dutch collections being twice the volume of the French, Italian and Finnish sources. On the other hand, the mean number of distinct indexing terms per document is relatively similar across the corpora (around 120), and this number is a little bit higher for the English collection (167.33). The Amaryllis collection contains abstracts of scientific papers written mainly in French and this corpus contains fewer distinct indexing terms per article (70.418). English French Italian German Spanish Size (in MB) 425 MB 243 MB 278 MB 527 MB 509 MB # of documents 113,005 87,191 108,578 225,371 215,738 # of distinct terms 330,753 320,526 503,550 1,507,806 528,382 Number of distinct indexing terms / document Mean 167.33 130.213 129.908 119.072 111.803 Standard deviation 126.315 109.151 97.602 109.727 55.397 Median 138 95 92 89 99 Maximum 1,812 1,622 1,394 2,420 642 Minimum 2 3 1 1 5 Max df 69,082 42,983 48,805 82,909 215,151 Number of indexing terms / document Mean 273.846 181.559 165.238 152.004 156.931 Standard deviation 246.878 164.347 130.728 155.336 82.133 Median 212 129 115 111 137 Maximum 6,087 3,923 3,763 6,407 1,003 Minimum 2 3 2 1 5 Number of queries 42 50 49 50 50 Number rel. items 821 1,383 1,072 1,938 2,854 Mean rel./request 19.548 27.66 21.878 38.76 57.08 Standard deviation 20.832 34.293 19.897 31.744 67.066 Median 11.5 13.5 16 28 27 Maximum 96 (#q:95) 177 (#q:95) 86 (#q:103) 119 (#q:103) 321 (#q:95) Minimum 1 (#q:97,98,136) 1 (#q:121) 3 (#q:121, 132) 1 (#q:137) 3 (#q:111) Table 1a: Test-collection statistics When examining the number of relevant documents per request, Tables 1a and 1b show that the mean number is always greater than the median (e.g., for the English collection, there is an average of 19.548 relevant documents per query and the corresponding median is 11.5). These findings indicate that each collection contains numerous queries with a rather small number of relevant items. For each collection, we encounter 50 queries except for the Italian corpus (for which Query #120 does not have any relevant items) and the English collection (for which Query #93, #96, #101, #110, #117, #118, #127 and #132 do not have any relevant items). The Finnish corpus contains only 30 available requests while only 25 queries are included in the Amaryllis collection. From the original documents and during the indexing process, we retained only the following logical sections in our automatic runs: , < HEADLINE>, < TEXT>, < LEAD>, < LEAD1>, < TX>, < LD>, < TI> and < ST>. On the other hand, we did conduct two experiments (indicated as manual runs), one with the French collection and one with the German corpus, within which we retained the following tags: for the French collection: <DE>, < KW>, < TB>, < CHA1>, < SUBJECTS>, < NAMES>, < NOM1>, < NOTE>, < GENRE>, < PEOPLE>, < SU11>, < SU21>, <GO11>, <GO12>, <GO13>, <GO14>, <GO24>, <TI01>, <TI02>, <TI03>, <TI04>, <TI05>, < TI06>, < TI07>, < TI08>, < TI09>, < ORT1>, < SOT1>, < SYE1> and <SYF1>; while for the German corpus and for one experiment, we used also the following tags: <KW> and <TB>. From the topic descriptions we automatically removed certain phrases such as "Relevant document report …", "Find documents …", "Trouver des documents qui parlent …", "Sono valide le discussioni e le decisioni …", "Relevante Dokumente berichten …" or "Los documentos relevantes proporcionan información …". To evaluate our approaches, we used the SMART system as a test bed for implementing the Okapi probabilistic model [Robertson 2000] as well as other vector-space models. This year our experiments were conducted on an Intel Pentium III/600 (memory: 1 GB, swap: 2 GB, disk: 6 x 35 GB). Dutch Finnish Amaryllis Size (in MB) 540 MB 137 MB 195 MB # of documents 190,604 55,344 148,688 # of distinct terms 883,953 1,483,354 413,262 Number of distinct indexing terms / document Mean 110.013 114.01 70.418 Standard deviation 107.037 91.349 31.9 Median 77 87 64 Maximum 2297 1,946 263 Minimum 1 1 5 Max df 325,188 20,803 61,544 Number of indexing terms / document Mean 151.22 153.73 104.617 Standard deviation 162.027 128.783 54.089 Median 101 123 91 Maximum 4510 6,117 496 Minimum 1 1 6 Number of queries 50 30 25 Number rel. items 1,862 502 2,018 Mean rel./request 37.24 16.733 80.72 Standard deviation 49.873 14.92 46.0675 Median 21 8.5 67 Maximum 301 (#q:95) 62 (#q:124) 180 (#q:25) Minimum 4 (#q:110) 1 (#q:114) 18 (#q:23) Table 1b: Test-collection statistics 1.2. Stopword lists and stemming procedures In order to define general stopword lists, we used those lists already available for the English and French languages [Fox 1990], [Savoy 1999], while for the other languages we established a general stopword list by following the guidelines described in [Fox 1990]. These lists mainly contain the top 200 most frequent words included in the various collections together with articles, pronouns, prepositions, conjunctions or very frequently occurring verb forms (e.g., to be, is, has, etc.). Stopword lists used during our previous participation [Savoy 2002b] were often extended. For example for the English we used that provided by the SMART system (571 words), 431 Italian words (no change from last year), 462 French words (previously 217), 603 German words (previously 294), 351 Spanish terms (previously 272), 1,315 Dutch terms (available at CLEF Web site) and 1,134 Finnish words (these stopword lists are available at www.unine.ch/info/clef/). After removing high frequency words, an indexing procedure uses a stemming algorithm that attempts to conflate word variants into the same stem or root. In developing this procedure for the French, Italian, German and Spanish languages, it is important to remember that these languages have more complex morphologies than does the English language [Sproat 1992]. As a first approach, our intention was to remove only inflectional suffixes such that singular and plural word forms or feminine and masculine forms conflate to the same root. More sophisticated schemes have already been proposed for the removal of derivational suffixes (e.g., "-ize", "-ably", "- ship" in the English language), such as the stemmer developed by Lovins [1968] is based on a list of over 260 suffixes, while that of Porter [1980] looks for about 60 suffixes. Figuerola [2002] for example described two different stemmers for the Spanish language, and the results show that removing only inflectional suffixes (88 different inflectional suffixes were defined) seemed to provide better retrieval levels than did removing both inflectional and derivational suffixes (this extended stemmer included 230 suffixes). Our various stemming procedures can be found at www.unine.ch/info/clef/. This year we improved our stemming algorithms for French, within which some derivational suffixes were also removed. For the Dutch language, we use the Kraaij & Pohlmann's stemmer (ruulst.let.ruu.nl:2000/uplift/ulift.html) [Kraaij 1996]. For the Finnish language, our stemmer tries to conflate various word declinations into the same stem. Also, the Finnish language makes a distinction between partial object and whole object (e.g., "syön leilää" or "I'm eating bread" and "syön leivan" for "I'm eating the whole bread"). This aspect is not actually taken into consideration. Finally, diacritic characters are usually not present in English collections (with some exceptions, such as "à la carte" or "résumé"); and such characters are replaced by their corresponding non-accentuated letter in the Italian, Dutch, Finnish, German and Spanish language. 1.3. Decompounding German words Most European languages manifests other morphological characteristics that we have been considered by our approach, with compound word constructions being just one example (e.g., handgun, worldwide). In German compound words are widely used and this causes more difficulties than does English. For example, a life insurance company employee would be "Lebensversicherungsgesellschaftsangestellter" (Leben + S + versicherung + S + gesellschaft +S + angestellter for life + insurance + company + employee). Also the morphological marker ("S") is not always present (e.g., "Bankangestelltenlohn" built as Bank + angestellter + lohn (salary)). In Finnish, we also encounter similar constructions as such as "rakkauskirje" (rakkaus + kirje for love + letter) or "työviikko" (työ + viikko for work + week). String sequence End of previous word Beginning of next word . . . schaften schaft tion tion ern er schg sch g . . . weisen weise ling ling tät tät schl sch l . . . lischen lisch igkeit igkeit net net schh sch h . lichkeit lichkeit . . lingen ling ens en scht sch t . . . igkeiten igkeit keit keit ers er dtt dt t . . . lichkeit lichkeit erheit erheit ems em dtp dt p . . . keiten keit enheit enheit ts t dtm dt m . . . erheiten erheit heit heit ions ion dtb dt b . . . enheiten enheit lein lein isch isch dtw dt w . . . heiten heit chen chen rm rm ldan ld an . . . haften haft haft haft rw rw ldg ld g . . halben halb halb halb nbr n br ldm ld m . . langen lang lang lang nb n b ldq ld q . . erlichen erlich erlich erlich nfl n f l ldp ld p . . enlichen enlich enlich enlich nfr n f r ldv ld v . . lichen lich lich lich nf n f ldw ld w . . baren bar bar bar nh n h tst t t . . igenden igend igend igend nk n k rg r g . . igungen igung igung igung ntr n tr rk r k . . igen ig ig ig fff ff f rm r m . . enden end end end ffs ff rr r r . . isten ist ist ist fk f k rs r s . . anten ant ant ant fm f m rt r t . . ungen ung tum tum fp f p rw r w . . schaft schaft age age fv f v rz r z . . weise weise ung ung fw f w fp f p . . lisch lisch enden end schb sch b fsf f f . . ismus ismus eren er schf sch f gss g s Table 2: Decompounding patterns for German According to Monz & de Rijke [2002] or [Chen 2002], including both compounds and their composite parts (only noun-noun decompositions in [Monz 2002]) in queries and documents can result in better performance while according to Molina-Salgado [2002], the decomposition of German words seems to reduce average precision. Our approach seeks to break up those words having an initial length greater than or equal to eight characters. Moreover, decomposition cannot take place before an initial sequence [V]C, meaning that a word might begin with a series of vowels that must be followed by at least one consonant. The algorithm then seeks the occurrence of one of the models described in Table 2. For example, the last model "gss g s" indicates that when we encounter the character string "gss" the computer is allowed to cut the compound term, ending the first word with "g" and beginning the second with "s". All the models depicted in Table 2 often include letters sequences impossible to find in a simple German word such as "dtt," "fff," or "ldm". Once it has detected this pattern, the computer makes sure that the right part consists of at least four characters, potentially beginning with a series of vowels (criterion noted as [V]), followed by a CV sequence. If decomposition proves to be possible, the algorithm begins working on the right part of the decomposed word. As an example, take the compound word "Betreuungsstelle" (meaning "care center" and made up "Betreuung" (care) and "Stelle" (center, place)). This word is definitely more than seven characters long. Once this has been verified, the computer begins searching for substitution models for the third character. The computer will find a match with the last model described in Table 2, and form the words "Betreuung" and "Stelle." This break is validated because the second word has a length greater than four characters. This term also meets criterion [V]CV and finally, given that the term "Stelle" has less than eight letters, the computer will not attempt to continue decomposing this term. 1.4. Indexing and searching strategy In order to obtain a broader view of the relative merit of various retrieval models, we first adopted a binary indexing scheme within which each document (or request) is represented by a set of keywords, without any weight. To measure the similarity between documents and requests, we count the number of common terms, computed according to the inner product (retrieval model denoted "doc=bnn, query=bnn" or "bnn-bnn"). For document and query indexing however binary logical restrictions however are often too limiting. In order to weight the presence of each indexing term in a document surrogate (or in a query), we may take account of the term occurrence frequency which allows for better term distinction and increases indexing flexibility (retrieval model notation: "doc=nnn, query=nnn" or "nnn-nnn"). bnn wij = 1 nnn wij = tfij ltn wij = (ln(tf ij ) + 1) . idfj atn wij = idfj . [0.5+ 0.5 . tfij / max tf i. ]   wij = tf ij ⋅ ln (  n − df j  ) nfn wij = ln  n  npn df j   df j    1+ ln(tfij )  ((k 1 + 1) ⋅ tf ij )   1+pivot   ( K + tf ij ) Okapi wij = Lnu wij = (1− slope) ⋅ pivot + slope ⋅ nti ln(tf ij ) + 1 tfij ⋅ idf j lnc wij = ntc wij = t ∑ (ln(tf ik ) +1) t ∑ ( tfik ⋅ idf k ) 2 2 k =1 k =1 dtc wij = (ln(ln(tf ij) + 1) + 1)⋅ idf j ( ∑ ( lnln(tf ( ik ) +1) + 1) ⋅ idf k ) t 2 k =1 ltc wij = ( ln(tfij ) + 1) ⋅ idf j t ∑ (( ln(tfik ) + 1) ⋅ idf k ) 2 k=1  ( (  1 + l n 1 + ln(tf ) ⋅idf ij j ))    1+pivot    dtu wij = (1− slope) ⋅ pivot + slope ⋅ nti Table 3: Weighting schemes Those terms however that do occur very frequently in the collection are not considered very helpful in distinguishing between relevant and non-relevant items. Thus we might count their frequency in the collection, or more precisely the inverse document frequency (denoted by idf), resulting in more weight for sparse words and less weight for more frequent ones. Moreover, a cosine normalization could prove beneficial and each indexing weight could vary within the range of 0 to 1 (retrieval model notation: "ntc-ntc", Table 3 depicts the exact weighting formulation). Other variants may also be created, especially if we consider the occurrence of a given term in a document is a rare event. Thus, it may be a good practice to give more importance to the first occurrence of this word as compared to any successive or repeating occurrences. Therefore, the tf component may be computed as 0.5 + 0.5 · [tf / max tf in a document] (retrieval model denoted "doc=atn"). Finally, we should consider that a term's presence in a shorter document provides stronger evidence than it does in a longer document. To account for this, we integrate document length within the weighting formula, leading to more complex IR models; for example, the IR model denoted by "doc=Lnu" [Buckley 1996], "doc=dtu" [Singhal 1999]. Finally for CLEF-2002, we also conducted various experiments using the Okapi probabilistic model [Robertson 2000] within with K = k1 · [(1 - b) + b · (l i / avdl)], representing the ratio between the length of D i measured by li (sum of tfij ) and the collection mean noted by advl. In our experiments, the constants b, k1 , advl, pivot and slope are fixed according to values listed in Table 4. To evaluate the retrieval performance of these various IR models, we adopted the non-interpolated average precision (computed on the basis of 1,000 retrieved items per request by the TREC -EVAL program), allowing for both precision and recall using a single number. Language b k1 advl pivot slope English 0.8 2 900 100 0.1 French 0.7 2 750 100 0.1 Italian 0.6 1.5 800 100 0.1 Spanish 0.5 1.2 300 100 0.1 German 0.55 1.5 600 125 0.1 Dutch 0.9 3.0 600 125 0.1 Finnish 0.75 1.2 900 125 0.1 Amaryllis 0.7 2 160 30 0.2 Table 4: Parameter setting for the various test-collections Given that French, Italian and Spanish morphology is comparable to that of English, we decided to index French, Italian and Spanish documents based on word stems. For the German, Dutch and Finnish languages and their more complex compounding morphology, we decided to use a 5-gram approach [McNamee 2002]. However, contrary to [McNamee 2002], our generation of 5-gram indexing terms does not span word boundaries. This value of 5 was chosen because it performed better with the CLEF-2000 corpora [Savoy 2001a]. Using this indexing scheme, the compound «das Hausdach» (the roof of the house) will generate the following indexing terms: «das», «hausd», «ausda», «usdac» and «sdach». Our evaluation results as reported in Tables 5 show that the Okapi probabilistic model performs best with the use of five different languages. In the second position, we usually find the vector-space model "doc=Lnu, query=ltc" and in the third "doc=dtu, query=dtc". Finally, the traditional tf-idf weighting scheme ("doc=ntc, query=ntc") does not exhibit very satisfactory results, and the simple term-frequency weighting scheme ("doc=nnn, query=nnn") or the simple coordinate match ("doc=bnn, query=bnn") results in poor retrieval performance. Average precision Query T-D English French Italian Spanish Model 42 queries 50 queries 49 queries 50 queries doc=Okapi, query=npn 50.08 48.41 41.05 51.71 doc=Lnu, query=ltc 48.91 46.97 39.93 49.27 doc=dtu, query=dtc 43.03 45.38 39.53 47.29 doc=atn, query=ntc 42.50 42.42 39.08 46.01 doc=ltn, query=ntc 39.69 44.19 37.03 46.90 doc=ntc, query=ntc 27.47 31.41 29.32 33.05 doc=ltc, query=ltc 28.43 32.94 31.78 36.61 doc=lnc, query=ltc 29.89 33.49 32.79 38.78 doc=bnn, query=bnn 19.61 18.59 18.53 25.12 doc=nnn, query=nnn 9.59 14.97 15.63 22.22 Table 5a: Average precision of various indexing and searching strategies (monolingual) For the German language, we determined that 5-gram indexing, decompounded indexing and word-based document representation methods to be distinct and independent sources of evidence for German language document content. We therefore decided to combine these three indexing schemes and to do so we normalized similarity values obtained by each document extracted from these three separate retrieval models, according to Equation 1 (see Section 3). The resulting average precision for these four approaches is shown in Table 5b, thus demonstrating how the combined model usually results in better retrieval performance. Average precision Query T-D German German German German words decompounded 5-gram combined (Eq. 1) Model 50 queries 50 queries 50 queries 50 queries doc=Okapi, query=npn 37.39 37.75 39.83 41.25 doc=Lnu, query=ltc 36.41 36.77 36.91 39.79 doc=dtu, query=dtc 35.55 35.08 36.03 38.21 doc=atn, query=ntc 34.48 33.46 37.90 37.93 doc=ltn, query=ntc 34.68 33.67 34.79 36.37 doc=ntc, query=ntc 29.57 31.16 32.52 32.88 doc=ltc, query=ltc 28.69 29.26 30.05 31.08 doc=lnc, query=ltc 29.33 29.14 29.95 31.24 doc=bnn, query=bnn 17.65 16.88 16.91 21.30 doc=nnn, query=nnn 14.87 12.52 8.94 13.49 Table 5b: Average precision of various indexing and searching strategies (German collection) It was observed that pseudo-relevance feedback (blind-query expansion) seems to be a useful technique for enhancing retrieval effectiveness. In this study, we adopted Rocchio's approach [Buckley 1996] with α = 0.75, β = 0.75 whereby the system was allowed to add m terms extracted from the n best ranked documents from the original query. To evaluate this proposition, we used the Okapi probabilistic model and we enlarged the query by 10 to 20 terms provided by the 5 or 10 best-retrieved articles. The results depicted in Table 6a and 6b indicate that the optimal parameter setting seems to be collection-dependant. Moreover, performance improvement seems also to be collection dependant (or language dependant) with no improvement for the English corpus yet an increase of 8.55% for the Spanish corpus (from an average precision of 51.71 to 56.13), 9.85% for the French corpus (from 48.41 to 53.18), 12.91% for the Italian language (41.05 to 46.35) and 13.26% for the German collection (from 41.25 to 46.72, combined model, Table 6b). Average precision Query T-D English French Italian Spanish Model 42 queries 50 queries 49 queries 50 queries doc=Okapi, query=npn 50.08 48.41 41.05 51.71 5 docs / 10 best terms 49.54 53.10 45.14 55.16 5 docs / 15 best terms 48.68 53.18 46.07 54.95 5 docs / 20 best terms 48.62 53.13 46.35 54.41 10 docs / 10 best terms 47.77 52.03 45.37 55.94 10 docs / 15 best terms 46.92 52.75 46.18 56.00 10 docs / 20 best terms 47.42 52.78 45.87 56.13 Table 6a: Average precision using blind-query expansion Average precision Query T-D German German German German words decompounded 5-gram combined (Eq. 1) Model 50 queries 50 queries 50 queries 50 queries doc=Okapi, query=npn 37.39 37.75 39.83 41.25 # docs / # terms 5 / 40 42.90 5 / 40 42.19 10 / 200 45.45 46.72 # docs / # terms 5 / 40 4 2 . 9 0 5 / 40 4 2 . 1 9 5 / 300 4 5 . 8 2 46.27 Table 6b: Average precision using blind-query expansion This year, we also participated in the Dutch and Finnish monolingual tasks, the results of which are depicted in Table 7, and the average precision of the Okapi model using blind-query expansion is given in Table 8. For these two languages, we also applied or combined an indexing model based on 5-gram indexing and word-based document representations. While for the Dutch language, our combined model seems to enhance the retrieval effectiveness, for the Finnish language it does not. This however was a first trial for our proposed stemmer and it seemed to improve the average precision over a baseline trial without stemming procedure (Okapi model, unstemmed 23.04, with stemming 30.45, an improvement of +32.16%). Average precision Query T-D Dutch Dutch Dutch Finnish Finnish Finnish word 5-gram combined word 5-gram combined Model 50 queries 50 queries 50 queries 30 queries 30 queries 30 queries doc=Okapi, query=npn 42.37 41.75 44.56 30.45 38.25 37.51 doc=Lnu, query=ltc 42.57 40.73 44.50 27.58 36.07 36.83 doc=dtu, query=dtc 41.26 40.59 43.00 30.70 36.79 36.47 doc=atn, query=ntc 40.29 40.34 41.89 29.22 37.26 36.51 doc=ltn, query=ntc 38.33 38.72 40.24 29.14 35.28 35.31 doc=ntc, query=ntc 33.35 34.94 36.41 25.21 30.68 31.93 doc=ltc, query=ltc 32.81 31.24 34.46 26.53 30.85 33.47 doc=lnc, query=ltc 31.91 29.67 34.18 24.86 30.43 31.39 doc=bnn, query=bnn 18.91 20.87 23.52 12.46 14.55 18.64 doc=nnn, query=nnn 13.75 10.48 12.86 11.43 14.69 15.56 Table 7: Average precision of various indexing and searching strategies (Dutch and Finnish corpora) Average precision Query T-D Dutch Dutch Dutch Finnish Finnish Finnish word 5-gram combined word 5-gram combined Model 50 queries 50 queries 50 queries 30 queries 30 queries 30 queries doc=Okapi, query=npn 42.37 41.75 44.56 30.45 38.25 37.51 # docs / # terms 5/60 47.86 5/75 45.09 48.78 5/60 31.89 5/75 40.90 39.33 # docs / # terms 5/100 4 8 . 8 4 10/150 4 6 . 2 9 49.28 5/15 3 2 . 3 6 5/175 4 1 . 6 7 40.11 Table 8: Average precision using blind-query expansion In the monolingual track, we submitted six runs along with their corresponding descriptions, as listed in Table 9. Four of them were fully automatic using the request's Title and Descriptive logical sections, while the last three used more other document sections, based on the request's Title, Descriptive and Narrative sections. In these last three runs, two were labeled "manual" because we used logical sections containing manually assigned index terms. For all other runs however we did not use any manual intervention during the indexing and retrieval procedures. Run name Language Query Form Model Query expansion average UniNEfr French T-D automatic Okapi no expansion 48.41 UniNEit Italian T-D automatic Okapi 10 best docs / 15 terms 46.18 UniNEes Spanish T-D automatic Okapi 5 best docs / 20 terms 54.41 UniNEde German T-D automatic combined 5/40 word, 10/200 5-gra. 46.72 UniNEnl Dutch T-D automatic combined 5/60 word, 5/75 5-gram 48.78 UniNEfi1 Finnish T-D automatic Okapi 5 best docs / 75 terms 40.90 UniNEfi2 Finnish T-D automatic combined 5/60 word, 5/75 5-gram 39.33 UniNEfrtdn French T-D-N manual Okapi 5 best docs / 10 terms 59.19 UniNEestdn Spanish T-D-N automatic Okapi 5 best docs / 40 terms 60.51 UniNEdetdn German T-D-N manual combined 5/50 word, 10/300 5-gram 49.11 Table 9: Official monolingual run descriptions 2. Bilingual information retrieval In order to overcome language barriers, we based our approach on free and readily available translation resources that automatically translate queries into the desired target language. More precisely, the original queries were written in English and we used no parallel or aligned corpora to derive statistically or semantically related words in the target language. Section 2.1 describes our combined strategy for cross-lingual retrieval while Section 2.2 provides some examples of translation errors. This year, we used five machine translation systems, namely SYSTRAN™ (babel.altavista.com/translate.dyn), GOOGLE. COM (www.google.com/language_tools), F REETRANSLATION. COM (www.freetranslation.com), INTERTRAN (www.tranexp.com:2000/InterTran) and R EVERSO ONLINE (translation2.paralink.com). As bilingual dictionary we used the BABYLON system (www.babylon.com). 2.1. Query automatic translation In order to develop a fully automatically approach, we chose to translate the requests using five different machine translation (MT) systems. We also translated query terms word-by-word using the BABYLON bilingual dictionary, provides not only one but several terms for the translation for each word submitted. In our experiments, we decided to pick the first translation available (labeled "baby1"), the first two terms (labeled "baby2") or the first three available translations (labeled "baby3"). Average precision Query T-D \ Language French Italian Spanish German German German Translation tools word decomp. 5-gram Original queries 48.41 41.05 51.71 37.39 37.75 39.83 Systran 42.70 32.30 38.49 28.75 28.66 27.74 Google 42.70 32.30 38.35 28.07 26.05 27.19 FreeTranslation 40.58 32.71 40.55 28.85 31.42 27.47 InterTran 33.89 30.28 37.36 21.32 21.61 19.21 Reverso 39.02 N/A 43.28 30.71 30.33 28.71 Babylon 1 43.24 27.65 39.62 26.17 27.66 28.10 Babylon 2 37.58 23.92 34.82 26.78 27.74 25.41 Babylon 3 35.69 21.65 32.89 25.34 26.03 23.66 Comb 1 46.77 33.31 44.57 34.32 34.66 32.75 Comb 2 48.02 34.70 45.63 35.26 34.92 32.95 Comb 2b 48.02 45.53 35.09 34.51 32.76 Comb 3 48.56 34.98 45.34 34.43 34.37 33.34 Comb 3b 48.49 35.02 45.34 34.58 34.43 32.76 Comb 3b2 35.41 35.13 33.25 MT 2 35.82 MT 3 44.54 35.57 44.32 33.53 33.05 31.96 All 47.94 35.29 44.25 34.52 34.31 32.79 MT all 46.83 35.68 44.25 33.80 33.51 31.66 Comb 1 Rever-baby1 Free-baby1 Rever-baby1 Reverso-baby1 Comb 2 Reverso Free-google Rever-systran Reverso-systran-baby1 systran-baby1 baby1 baby1 Comb 2b Reverso Rever-google Reverso-google-baby1 google-baby1 baby1 Comb 3 Reverso-free Free-google Free-google Reverso-systran-inter-baby1 google-baby1 inter-baby1 rever-baby1 Comb 3b Reverso-inter Free-google Free-google Reverso-google-inter-baby1 google-baby1systran-baby1 rever-baby2 Comb 3b2 Reverso-google-inter-baby2 MT 2 Free-google MT 3 Reverso Free-google Free-google Reverso-inter-systran systran-google inter reverso Table 10: Average precision of various query translation strategies (Okapi model) The first part of Table 10 lists the average precision for each translation devices used along the performance achieved by manually translated requests. For German, we also reported the retrieval effectiveness achieved by the three difference approach, namely using words as indexing terms, decompounding the German words according to our approach and the 5-grams model. While the REVERSO system seems to be the better choice for German and Spanish, FREETRANSLATION is the best choice for Italian and BABYLON 1 the best for French. In order to improve search performance, we tried combining different machine translation systems with the bilingual dictionary approach. In this case, we formed the translated query by concatenating the different translations provided by the various approaches. Thus the column header "Comb 1", we combined one machine translation system with the bilingual dictionary ("baby1"). Similarly, under columns "Comb 2" or "Comb 2b," we listed the results of two machine translation approaches and three machine translation systems under column headings "Comb 3", "Comb 3b" or "Comb 3b2". With the exception of the performance under "Comb 3b2," we also included terms provided by the "baby1" dictionary look-up in the translated requests. In columns "MT 2" and "MT 3," we evaluated the combination of two and three machine translation systems respectively. Finally, we could also combine all translation sources (under heading "All") or all machine translation approaches under the heading "MT all." Since the performance of each translation device depends on the target language, in the lower part of Table 10 we included the exact specification for each of the combined runs. For the German language, for each of the three indexing models, we used the same combination of translation resources. From an examination of the retrieval effectiveness of our various combined approaches listed in the middle part of Table 10, a clear recommendation cannot be made. Overall, it seems better to combine two or three machine translation systems with the bilingual dictionary approach ("baby1"). However, combining the five machine translation systems (heading "MT all") or all translation tools (heading "All") does not result in a very effective performance. Average precision Query T-D French French French Italian Italian UniNEfrBi UniNEfrBi2 UniNEfrBi3 UniNEitBi UniNEitBi2 Combined Comb 3b MTall+baby2 MT all Comb 2 Comb 3 Expand # docs / # terms 5 / 20 5 / 40 10 / 15 10 / 60 10 / 100 Corrected 51.64 50.79 48.49 38.50 38.62 Official 49.35 48.47 46.20 37.36 37.56 Query T-D Spanish Spanish Spanish German German UniNEesBi UniNEesBi2 UniNEesBi3 UniNEdeBi UniNEdeBi2 Combined MT 3 Comb 3b Comb 2 Comb 3b2 & Comb 3 Expand # docs / # terms 10 / 75 10 / 100 10 / 75 5 / 100 & 5 / 300 Corrected 50.67 50.95 50.93 42.89 42.11 Official 47.63 47.86 47.84 41.29 40.42 Table 11: Average precision and description of our official bilingual runs (Okapi model) Table 11 lists the exact specifications of our various bilingual runs. However, when submitting our official results, we used the wrong numbers for Query # 130 and # 131 (we switched these two query numbers). Thus, both requests have an average precision 0.00 in our official results and we reported the corrected performance in Tables 11 and 13 (multilingual runs). 2.2. Examples of failures In order to obtain a preliminary picture of the automatic translation approach's underlying difficulties, we analyzed some queries through comparing translations produced by our six machine-based tools with the request formulation written by a human being (examples are given in Table 12). As a first example, the title of Query #113 is "European Cup". In this case, the term "cup" was analyzed as a teacup by all automatic translation tools, resulting in the French translations "tasse" or "verre" (or "tazza" in Italian, "Schale" in German ("Pokal" can be viewed as a correct translation alternative) and "taza" or "Jícara" (small teacup) in Spanish). In Query #118 ("Finland's first EU Commissioner"), the machine translation systems failed to give the appropriate Spanish term "comisario" for "Commissioner" but returned "comisión" (commission) or "Comisionado" (adjective relative to commission). For this same request number, the manually translated query seemed to contain a spelling error in Italian ("commis ario" instead of "commis s ario"). For the same request, the translation given in German "Beauftragter" (delegate) does not correspond to the appropriate term "Kommissar" (more the missing "-" in the translation "EUBEAUFTRAGTER"). Other examples: for Query #94 ("Return of Solzhenitsyn") which is translated manually in German ("Rückkehr Solschenizyns"), our automatic translation systems fail to translate the proper noun (returning "Solzhenitsyn" instead of "Solschenizyns"). Query #109 ("Computer Security") is translated manually Spanish as "Seguridad Informática" and our various translations devices return different terms for "Computer" (e.g., "Computadora", "Computador", or "ordenador") but not the word "Informática". <num> C113 (query translations failed in French, Italian, German and Spanish) <EN-title> European Cup <FR-title manually translated> Coupe d'Europe de football <FR-title FREETRANSLATION> Tasse européenne <FR-title BYBYLON 1> Européen verre <FR-title BYBYLON 2> Européen résident de verre tasse <FR-title BYBYLON 3> Européen résident de l'Europe verre tasse coupe <IT-title manually translated> Campionati europei <IT-title SYSTRAN> Tazza Europea <IT-title G OOGLE> Tazza Europea <GE-title manually translated> Fussballeuropameisterschaft <GE-title SYSTRAN> Europäische Schale <GE-title R EVERSO> Europäischer Pokal <ES-title manually translated> Eurocopa <ES-title I NTERTRAN> Europea Jícara <ES-title REVERSO> Taza europea <num> C118 (query translations failed in Italian, German and Spanish) <EN-title> Finland's first EU Commissioner. <IT-title manually translated> Primo commisario europeo per la Finlandia <IT-title G OOGLE> Primo commissario dell'Eu della Finlandia. <IT-title F REETRANSLATION> Finlandia primo Commissario di EU. <GE-title manually translated> Erster EU-Kommissar aus Finnland <GE-title G OOGLE> Finnlands erster EUBEAUFTRAGTER. <GE-title R EVERSO> Finlands erster EG-Beauftragter <ES-title manually translated> Primer comisario finlandés de la UE <ES-title GOOGLE> Primera comisión del EU de Finlandia. <ES-title REVERSO> El primer Comisionado de Unión Europea de Finlandia. Table 12: Examples of unsuccessful query translations 3. Multilingual information retrieval Using our combined approach to automatically translate a query, we were able to search a document collection for a request written in English. This stage however represents only the first step in a proposal for multi-language information retrieval systems. We also need to investigate situations where users write a request in English in order to retrieve pertinent documents in English, French, Italian, German and Spanish. To deal with this multi- language barrier, we divided our document sources according to language and thus formed five different collections. After searching in these corpora and obtaining five results lists, we needed to merge them in order to provide users with a single list of retrieved articles. Recent works have suggested various solutions to merging the separate result list obtained from different collections or distributed information services. As a first approach, we will assume that each collection contains approximately the same number of pertinent items and that the distribution of the relevant documents is similar across the result lists. Based solely on the rank of the retrieved records, we can interleave the results in a round- robin fashion. According to previous studies [Voorhees 1995], the retrieval effectiveness of such an interleaving scheme is around 40% below that achieved from a single retrieval scheme working with a single huge collection, representing the entire set of documents. To take account of the document score computed for each retrieved item (or the similarity value between the retrieved record and the request, denoted score rsvj ), we might formulate the hypothesis that each collection is searched by the same or a very similar search engine and that the similarity values are therefore directly comparable [Kwok 1995]. Such a strategy, called raw-score merging, produces a final list sorted by the document score computed by each collection. However, collection-dependent statistics in document or query weights may vary widely among collections, and therefore this phenomenon may invalidate the raw-score merging hypothesis. To account for this fact, we might normalize the document scores within each collection by dividing them by the maximum score (i.e. the document score of the retrieved record in the first position). As a variant of this normalized score merging scheme, Powell et al. [2000] suggest normalizing the document score rsvj according to the following formula: ( rsv′ j = rsv j − rsv min ) ( rsv max − rsv min ) (1) in which rsv j is the original retrieval status value (or document score), and rsvmax and rsvmin are the maximum and minimum document score values that a collection could achieve for the current request. In this study, the rsvmax is given by the document score achieved by the first retrieved item and the retrieval status value obtained by the 1000th retrieved record gives the value of rsvmin . As a fourth strategy, we might use the logistic regression [Flury 1997, Chapter 7] to predict the probability of a binary outcome variable, according to a set of explanatory variables. Based on this statistical approach, Le Calvé and Savoy [2000] and Savoy [2002a] described how to predict the probability of relevance of those documents retrieved by different retrieval schemes or collections. The resulting estimated probabilities would be predicted according to both the original document score rsvi and the logarithm of the ranki attributed to the corresponding document Di . Based on these estimated relevance probabilities, we sorted the records retrieved from separate collections in order to obtain a single ranked list. However, in order to estimate the underlying parameters, this approach requires a training set, in this case the CLEF-2001 topics and their relevance assessments. e α+β 1⋅ln(ranki ) +β2 ⋅rsv i [ Prob Di is rel | rank i , rsv i ] = (2) 1 + e α+β 1⋅ln(ranki )+β 2⋅rsv i within which ranki denotes the rank of the retrieved document Di, ln() is the natural logarithm, and rsvi is the retrieval status value (or document score) of the document Di. In this equation, the coefficients α, β1 and β2 are unknown parameters that are estimated according the method of the maximum likelihood (the required computations have been done with the S language). Average precision Query T-D English French Italian Spanish German 42 queries 50 queries 49 queries 50 queries 50 queries UniNEfrBi UniNEitBi UniNEesBi UniNEdeBi 50.08 51.64 38.50 50.67 42.89 Multilingual Round-robin Raw-score Eq. 1 Log ln(ranki ) Log reg Eq.2 50 queries 34.27 33.83 36.62 36.10 39.49 English French Italian Spanish German 42 queries 50 queries 49 queries 50 queries 50 queries UniNEfrBi2 UniNEitBi2 UniNEesBi2 UniNEdeBi2 50.08 50.79 38.62 50.95 42.11 Multilingual Round-robin Raw-score Eq. 1 Log ln(ranki ) Log reg Eq.2 50 queries 33.97 33.99 36.90 35.59 39.25 Table 13: Average precision using various merging strategies based on automatically translated queries When searching in multi-lingual corpora using Okapi, the round-robin scheme or the raw-score merging strategy provide very similar retrieval performances (see Table 13). The normalized score merging based on Equation 1 shows an enhancement over the round-robin approach (36.62 vs. 34.27, an improvement of +6.86% in our first experiment, and 36.90 vs. 33.97, +8.63% in our second run). Using our logistic model with only the rank as explanatory variable (or more precisely the ln(ranki ), performance depicted under the label "Log ln(ranki )"), the resulting average precision is lower than the normalized score merging. When merging the result lists based on the logistic regression approach (using both the rank and the document score as explanatory variables) presents the best average precision. Query T-D UniNEm1 UniNEm2 UniNEm3 UniNEm4 UniNEm5 Equation 1 Log reg Eq.2 Equation 1 Log reg Eq.2 Equation 1 Corrected 36.62 39.49 36.90 39.25 35.97 Official 34.88 37.83 35.12 37.56 35.52 Table 14: Average precision obtained with our official multilingual runs Our official and corrected results are shown in Table 14 while some statistics about the number of documents provided by each collection are given in Table 15. From this data, we can see that the normalized score merging (UniNEm1) extracts more documents for the English corpus (in mean 24.94 items) than the logistic regression model (UniNEm2 where in mean 11.44 documents are coming from the English collection). Moreover, the logistic regression scheme takes more documents from the Spanish and German collections Finally, we can see that the percentage of relevant items is relatively similar when comparing CLEF01 and CLEF02 test-collections. Statistics \ Language English French Italian Spanish German UniNEm1, based on the top 100 retrieved documents for each query Mean 24.94 16.68 19.12 23.8 15.46 Median 23.5 15 18 22 15 Maximum 60 (q#:101) 54 (q#:110) 45 (q#:136) 70 (q#:121) 54 (q#:116) Minimum 4 (q#:108) 5 (q#:97,123) 5 (q#:93,114)6 (q#:98,110) 2 (q#:139) Standard deviation 13.14 9.26 9.17 14.15 9.79 UniNEm2, based on the top 100 retrieved documents for each query Mean 11.44 15.58 16.18 34.3 22.5 Median 9 14 16 34.5 19 Maximum 33 (q#:92) 38 (q#:110) 28 (q#:108) 62 (q#:91) 59 (q#:116) Minimum 1 (q#:135) 6 (q:102,123) 8 (q#:114) 10 (q#:116) 4 (q#:91) Standard deviation 6.71 7.49 5.18 10.90 11.90 % relevant items CLEF02 10.18% 17.14% 13.29% 35.37% 24.02% % relevant items CLEF01 10.52% 14.89% 15.31% 33.10% 26.17% Table 15: Statistics about the merging schemes based on the top 100 retrieved documents for each query 4. Amaryllis experiments For the Amaryllis experiments, we wanted to determine whether a specialized thesaurus might improve the retrieval effectiveness over a baseline, ignoring term relationships. From the original documents and during the indexing process, we retained only the following logical sections in our runs: <text>, <ti>, <ab>, <mc>, <kw>. < RECORD> < RECORD> < TERMFR> Analyse de poste < TERMFR> La Poste < TRADENG> Station Analysis < TRADENG> Postal services … … < RECORD> < RECORD> < TERMFR> Bureau poste < TERMFR> Poste conduite < TRADENG> Post offices < TRADENG> Operation platform < RECORD> < SYNOFRE1> Cabine conduite < TERMFR> Bureau poste … < TRADENG> Post office < RECORD> … < TERMFR> POSTE DE TRAVAIL < RECORD> < TRADENG> WORK STATION < TERMFR> Isolation poste électrique < RECORD> < TRADENG> Substation insulation < TERMFR> Poste de travail … < TRADENG> Work Station < RECORD> < RECORD> < TERMFR> Caserne pompier < TERMFR> Poste de travail < TRADENG> Fire houses < TRADENG> Work station < SYNOFRE1> Poste incendie < RECORD> … < TERMFR> Poste de travail < RECORD> < TRADENG> workstations < TERMFR> Habitacle aéronef < SYNOFRE1> Poste travail < TRADENG> Cockpits (aircraft) … < SYNOFRE1> Poste pilotage … Table 16: Sample of various entries under the word "poste" in the Amaryllis thesaurus From the given thesaurus, we have extracted 126,902 terms having a relationship with one or more terms (the thesaurus owns 173,946 entries delimited by the tags <RECORD> … </ RECORD>, however only 149,207 entries have at least one relationship with another term. From these 149,207 entries, we found 22,305 multiple entries (that are removed, as for example, the term "Poste de travail" or "Bureau poste" in Table 16). In building our thesaurus, we removed the accents, wrote all terms in lowercase, and ignored numbers and terms given between parenthesis. For example, the word "poste" appears in 49 records (usually as part of a compound entry in the < TERMFR> field). From our 126,902 entries, we counted 107,038 TRADEENG relationships, 14,590 SYNOFRE1, 26,772 AUTOP1 relationships and 1,071 VAUSSI1 relationships (see examples given in Table 16). In a first set of experiments, we did not use this thesaurus and we used the Title and Descriptive logical sections of the requests (second column of Table 17a) or the Title, Descriptive and Narrative parts of the queries (last column of Table 17a). In a second set of experiments, we included all related words that could be found in the thesaurus using only the search keywords (average precision depicted under the label "Qthes"). In a third experiment, we enlarged only document representatives using our thesaurus (performance shown under column heading "Dthes"). In a last experiment, we take account for related words found in the thesaurus only for document surrogates and under the additional condition that such relationship can be found with at least three terms (e.g. "moteur à combustion" is a valid candidate but not single term like "moteur"). On the other hand, we also included in the query all relationships that can be found using the search keywords (performance shown under the column heading "Dthes3Qthes"). Average precision Amaryllis Amaryllis Amaryllis Amaryllis Amaryllis Query T-D T-D T-D T-D T-D-N Qthes Dthes Dthes3QThes Model 25 queries 25 queries 25 queries 25 queries 25 queries doc=Okapi, query=npn 45.75 45.45 44.28 44.85 53.65 doc=Lnu, query=ltc 43.07 44.28 41.75 43.45 49.87 doc=dtu, query=dtc 39.09 41.12 40.25 42.81 47.97 doc=atn, query=ntc 42.19 43.83 40.78 43.46 51.44 doc=ltn, query=ntc 39.60 41.14 39.01 40.13 47.50 doc=ntc, query=ntc 28.62 26.87 25.57 26.26 33.89 doc=ltc, query=ltc 33.59 34.09 33.42 33.78 42.47 doc=lnc, query=ltc 37.30 36.77 35.82 36.10 46.09 doc=bnn, query=bnn 20.17 23.97 19.78 23.51 24.72 doc=nnn, query=nnn 13.59 13.05 10.18 12.07 15.94 Table 17a: Average precision of various indexing and searching strategies (Amaryllis) Average precision Amaryllis Amaryllis Amaryllis Amaryllis Amaryllis Query T-D T-D T-D T-D T-D-N Qthes Dthes Dthes3Qthes Model 25 queries 25 queries 25 queries 25 queries 25 queries doc=Okapi, query=npn 45.75 45.45 44.28 44.85 53.65 5 docs / 10 terms 47.75 47.29 46.41 46.73 55.80 5 docs / 50 terms 49.33 48.27 47.84 47.61 56.72 5 docs / 100 terms 49.28 48.53 47.78 47.83 56.71 10 docs / 10 terms 47.71 47.43 46.28 47.21 55.58 10 docs / 50 terms 49.04 48.46 48.49 48.12 56.34 10 docs / 100 terms 48.96 48.60 48.56 48.29 56.34 25 docs / 10 terms 47.07 46.63 45.79 46.77 55.31 25 docs / 50 terms 48.02 47.64 47.23 47.85 55.82 25 docs / 100 terms 48.03 47.78 47.38 47.83 55.80 Table 17b: Average precision using blind-query expansion (Amaryllis) From the achieved average precision depicted in Tables 17a and 17b, we cannot infer that the available thesaurus is really helpful in improving retrieval effectiveness, at least as implemented in this study. Run name Query Form Model Thesaurus Query expansion Av. precision UniNEama1 T-D automatic Okapi no 25 docs / 50 terms 48.02 UniNEama2 T-D automatic Okapi with query terms 25 docs / 25 terms 47.34 UniNEama3 T-D automatic Okapi with documents 25 docs / 50 terms 47.23 UniNEama4 T-D automatic Okapi both query & doc 10 docs / 15 terms 47.78 UniNEamaN1 T-D-N automatic Okapi no 25 docs / 50 terms 55.82 Table 18: Official Amaryllis run descriptions Conclusion For our second participation in CLEF retrieval tasks, we suggested a general stopword list and stemming procedure for the French, Italian, German, Spanish and Finnish languages. We also suggested a simple decompounding approach for the German language. For the Dutch, Finnish and German languages we were to consider 5-gram indexing and word-based (and decompounding-based) document representations to be distinct and independent sources of evidence on document content, and it would be a good practice to combine these two (or three) indexing schemes. To improve bilingual information retrieval, we suggest using not only one but two or three different translation sources to translate the query into the target languages. Such a combination seems to improve the retrieval effectiveness. In the multilingual environment, we demonstrated that a learning scheme such as logistic regression could perform effectively. As a second best solution, we suggested using a simple normalization procedure based on the document score. Finally, in the Amaryllis experiments, we studied various possible ways we could use a specialized thesaurus to improve average precision. However, the various strategies used in this paper do not demonstrate clear enhancement over a baseline that ignores the term relationships stored in the thesaurus. Acknowledgments The author would like to thank C. Buckley from SabIR for giving us the opportunity to use the SMART system, without which this study could not have been conducted. This research was supported in part by the SNSF (Swiss National Science Foundation) under grants 21-58 813.99 and 21-66 742.01. References [Buckley 1996] Buckley, C., Singhal, A., Mitra, M. & Salton, G. (1996). New retrieval approaches using SMART. In Proceedings of TREC'4, (pp. 25-48). Gaithersburg: NIST Publication #500- 236. [Chen 2002] Chen, A. (2002). Multilingual information retrieval using English and Chinese queries. In C. Peters, M. Braschler, J. Gonzalo & M. Kluck (Eds.), Evaluation of cross-language information retrieval systems. Lecture Notes in Computer Science #2409. Berlin: Springer-Verlag. [Figuerola 2002] Figuerola, C.G., Gómez, R. & Zazo Rodríguez, A.F. (2002). Stemming in Spanish: A first approach to its impact on information retrieval. In C. Peters, M. Braschler, J. Gonzalo & M. Kluck (Eds.), Evaluation of cross-language information retrieval systems. Lecture Notes in Computer Science #2409. Berlin: Springer-Verlag. [Flury 1997] Flury, B. (1997). A first course in multivariate statistics. New York: Springer. [Fox 1990] Fox, C. (1990). A stop list for general text. ACM-SIGIR Forum, 24, 19-35. [Kraaij 1996] Kraaij, W. & Pohlmann, R. (1996). Viewing stemming as recall enhancement. In Proceedings of the 19th International Conference of the ACM-SIGIR'96, (pp. 40-48). New York: The ACM Press. [Kwok 1995] Kwok, K.L., Grunfeld, L. & Lewis, D.D. (1995). TREC-3 ad-hoc, routing retrieval and thresholding experiments using PIRCS. In Proceedings of TREC'3, (pp. 247-255). Gaithersburg: NIST Publication #500-225. [Le Calvé 2000] Le Calvé, A., Savoy, J. (2000). Database merging strategy based on logistic regression. Information Processing & Management, 36(3), 341-359. [Lovins 1968] Lovins, J. B. (1968). Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11(1), 22-31. [McNamee 2002] McNamee, P. & Mayfield, J. (2002). JHU/APL Experiments at CLEF: Translation Resources and Score Normalization. In C. Peters, M. Braschler, J. Gonzalo & M. Kluck (Eds.), Evaluation of Cross-Language Information Retrieval Systems. Lecture Notes in Computer Science #2409. Berlin: Springer-Verlag. [Molina-Salgado 2002] Molina-Salgado, H., Moulinier, I., Knutson, M., Lund, E. & Sekhon, K. (2002). Thomson legal and regulatory at CLEF 2001: Monolingual and bilingual experiments. In C. Peters, M. Braschler, J. Gonzalo & M. Kluck (Eds.), Evaluation of cross-language information retrieval systems. Lecture Notes in Computer Science #2409. Berlin: Springer-Verlag. [Monz 2002] Monz, C. & de Rijke, M. (2002). The University of Amsterdam at CLEF 2001. In C. Peters, M. Braschler, J. Gonzalo & M. Kluck (Eds.), Evaluation of cross-language information retrieval systems. Lecture Notes in Computer Science #2409. Berlin: Springer-Verlag. [Porter 1980] Porter, M.F. (1980). An algorithm for suffix stripping. Program, 14, 130-137. [Powell 2000] Powell, A.L., French, J. C., Callan, J., Connell, M. & Viles, C.L. (2000). The impact of database selection on distributed searching. In Proceedings of the 23rd International Conference of the ACM-SIGIR'2000, (pp. 232-239). New York: The ACM Press. [Robertson 2000] Robertson, S.E., Walker, S. & Beaulieu, M. (2000). Experimentation as a way of life: Okapi at TREC. Information Processing & Management, 36(1), 95-108. [Savoy 1999] Savoy, J. (1999). A stemming procedure and stopword list for general French corpora. Journal of the American Society for Information Science, 50(10), 944-952. [Savoy 2002a] Savoy, J. (2002). Cross-language information retrieval: Experiments based on CLEF-2000 corpora. Information Processing & Management, to appear. [Savoy 2002b] Savoy, J. (2002). Report on CLEF-2001 Experiments: Effective Combined Query- Translation Approach. In C. Peters, M. Braschler, J. Gonzalo & M. Kluck (Eds.), Evaluation of cross-language information retrieval systems. Lecture Notes in Computer Science #2409. Berlin: Springer-Verlag. [Savoy 2002c] Savoy, J. (2002). Recherche d'informations dans des corpus en langue française : Utilisation du référentiel Amaryllis. TSI, Technique et Science Informatiques, 21(3), 345- 373. [Singhal 1999] Singhal, A., Choi, J., Hindle, D., Lewis, D.D. & Pereira, F. (1999). AT&T at TREC-7. In Proceedings TREC-7, (pp. 239-251). Gaithersburg: NIST Publication #500-242. [Sproat 1992] Sproat, R. (1992). Morphology and computation. Cambridge: The MIT Press. [Voorhees 1995] Voorhees, E.M., Gupta, N.K. & Johnson-Laird, B. (1995). The collection fusion problem. In Proceedings of TREC'3, (pp. 95-104). Gaithersburg: NIST Publication #500-225.