Report on CLEF-2001 Experiments
Jacques Savoy
Institut interfacultaire d'informatique, Université de Neuchâtel, Switzerland
Jacques.Savoy@unine.ch
Web site: www.unine.ch/info/
Abstract. For our first participation in CLEF retrieval tasks, our first objective was to define a
general stopword list for various European languages (namely, French, Italian, German and
Spanish) and also to suggest simple and efficient stemming procedures for them. Our second aim
was to suggest a combined approach that might be implemented in order to facilitate effective
access to multilingual collections.
1. Monolingual indexing and search
Most European languages (including French, Italian, Spanish, German) share many of the same characteristics as
does the language of Shakespeare (e.g., word boundaries marked in a conventional manner, variant word forms
generated by adding suffixes to the end of a root, etc.). Any adaptation of indexing or search strategies thus
means the elaboration of general stopword lists and fast stemming procedures. Stopword lists contain non-
significant words that are removed from a document or a request before the indexing process is begun. Stemming
procedures try to remove inflectional and derivational suffixes in order to conflate word variants into the same
stem or root.
This first chapter will deal with these issues and is organized as follows: Section 1.1 contains an overview of
our five test collections while Section 1.2 describes our general approach to building stopword lists and
stemmers for use with languages other than English. Section 1.3 depicts the Okapi probabilistic model together
with the description of the runs submitted by us in the monolingual track.
1.1. Overview of the test-collections
The corpora used in our experiments included newspapers such as the Los Angeles Times, Le Monde (French),
La Stampa (Italian), Der Spiegel and Frankfurter Rundschau (German) and EFE (Spanish) and various news items
edited by the Swiss news agency (available in French, German and Italian but without parallel translation). As
shown in Table 1, these corpora are of various sizes, with the English, German and Spanish collections being
twice the volume of the French and Italian sources. On the other hand, the mean number of distinct indexing
terms per document is relatively similar across the corpora (around 130), and this number is little bit higher for
the English collection (167.33) and clearly higher for the German corpora (509.131).
From the original documents and during the indexing process, we retained only the following logical sections in
our automatic runs:
, < HEADLINE>, < TEXT>, < LEAD>, < LEAD1>, < TX>, < LD>, < TI> and . On
the other hand, we conducted two experiments (indicated as manual runs), one with the French collections and
one with the Italian corpora within which we retained the following tags; for the French collections: ,
< KW>, < TB>, < CHA1>, < SUBJECTS>, < NAMES>, < NOM1>, < NOTE>, < GENRE>, < PEOPLE>, < SU11>,
< SU21>, , , , , , , , , , , ,
< TI07>, , , , < SOT1>, < SYE1> and . In the Italian corpora, and for one
experiment, we used the following tags: < DE>, < KW>, < TB>, < ARGUMENTS>, < NAMES>, < LOCATIONS>,
< TABLE>, < PEOPLE>, < ORGANISATIONS> and .
From topic descriptions, we automatically removed certain phrases such as "Relevant document report …", "Find
documents that give …", "Trouver des documents qui parlent …", "Sono valide le discussioni e le decisioni …",
"Relevante Dokumente berichten …" or "Los documentos relevantes proporcionan información …".
To evaluate our approaches, we used the SMART system as a test bed for implementing the OKAPI probabilistic
model [Robertson 2000]. This year our experiments were conducted on an Intel Pentium III/600 (memory:
1 GB, swap: 2 GB, disk: 6 x 35 GB).
English French Italian German Spanish
Size (in MB) 425 MB 243 MB 278 MB 527 MB 509 MB
# of documents 113,005 87,191 108,578 225,371 215,738
number of distinct indexing terms / document
mean 167.33 140.476 129.908 509.131 120.245
standard error 126.315 118.605 97.602 431.527 60.148
median 138 102 92 396 107
maximum 1,812 1,723 1,394 8,136 682
minimum 2 3 1 1 5
max df 69,082 42,983 48,805 129,562 215,151
number of indexing terms / document
mean 273.846 208.709 173.477 703.068 183.658
standard error 246.878 178.907 130.746 712.416 87.873
median 212 152 125 516 163
maximum 6,087 3,946 3,775 17,213 1,073
minimum 2 8 2 1 13
number of queries 47 48 47 49 49
no rel. for queries #q:54 #q:57 #q:60 #q:64, #q:87 #q:43 #q:52 #q:64 #q:44 #q:61
number rel. items 856 1,193 1,246 2,238 2,694
mean rel. / request 18.21 24.85 26.51 42.04 54.97
standard error 22.56 24.57 24.37 47.77 63.68
median 10 17 18 27 26
maximum 107 (#q:50) 90 (#q:60) 95 (#q:50) 212 (#q:42) 261 (#q:42)
minimum 1 (#q:59) 1 (#q:43) 2 (#q:44) 1 (#q:64) 1 (#q:64)
Table 1: Test collection statistics
1.2. Stopword lists and stemming procedures
In order to define general stopword lists, we knew that such lists were already available for the English and
French languages [Fox 1990], [Savoy 1999]. For the three other languages, we established a general stopword
list by following the guidelines described in [Fox 1990]. Firstly, we sorted all word forms appearing in our
corpora according to their frequency of occurrence and we extracted the 200 most frequently occurring words.
Secondly, we inspected this list to remove all numbers (e.g., "1994", "1"), plus all nouns and adjectives more or
less directly related to the main subjects of the underlying collections. For example, the German word "Prozent"
(ranking 69), the Italian noun "Italia" (ranking 87) or the term "política" (ranking 131) from the Spanish corpora
were removed from the final list. From our point of view, such words can be useful as indexing terms in other
circumstances. Thirdly, we included some non-information-bearing words, even if they did not appear in the first
200 most frequent words. For example, we added various personal or possessive pronouns (such as "meine",
"my" in German), prepositions ("nello", "in the" in Italian), conjunctions ("où", "where" in French) or verbs
("estar", "to be" in Spanish). The presence of homographs represents another debatable issue, and to some
extent, we had to make arbitrary decisions concerning their inclusion in stopword lists. For example, the French
word "son" can be translated as "sound" or "his".
The resulting stopword lists thus contained a large number of pronouns, articles, prepositions and conjunctions.
As in various English stopword lists, there were also some verbal forms ("sein", "to be" in German; "essere", "to
be" in Italian; "sono", "I am" in Italian). In our experiments we used the stoplist provided by the SMART system
(571 English words), and our 217 French words, 431 Italian words, 294 German words and 272 Spanish terms
(these stopword lists are available at http://www.unine.ch/info/clef/).
After removing high frequency words, an indexing procedure tries to conflate word variants into the same stem or
root using a stemming algorithm. In developing this procedure for the French, Italian, German and Spanish
languages, it is important to remember that these languages have more complex morphologies than does the
English language [Sproat 1992]. As a first approach, we intended to remove only inflectional suffixes such that
singular and plural word forms or feminine and masculine forms conflate to the same root. More sophisticated
schemes have already been proposed for the removal of derivational suffixes (e.g., «-ize», «-ably», «-ship» in the
English language), such as the stemmer developed by Lovins [1968], which is based on a list of over 260
suffixes, while that of Porter [1980] looks for about 60 suffixes.
A "quick and dirty" stemming procedure has already been developed for the French language [Savoy 1999]. Based
on the same concept, we have implemented a stemming algorithm for the Italian, Spanish and German languages
(the C code for these stemmers can be found at http://www.unine.ch/info/clef/). In Italian, the main inflectional
-2-
rule is to modify the final character (e.g., «-o», «-a» or «-e») into another (e.g., «-i», «-e»). As a second rule,
Italian morphology may also alter the final two letters (e.g., «-io» in «-o», «-co» in «-chi», «-ga» in «-ghe»). In
Spanish, the main inflectional rule is to add one or two characters to denote the plural form of nouns or
adjectives (e.g., «-s», «-es» like in "amigo" and "amigos" (friend) or "rey" and "reyes" (king)) or to modify the
final character (e.g., «-z» in «-ces» in "voz" and "voces" (voice)). In German, a few rules may be applied to
obtain the plural form of words (e.g., "Sängerin" into "Sängerinnen" (singer), "Boot" into "Boote" (boat), "Gott"
into "Götter" (god)). However, the suggested algorithms do not account for person and tense variations used by
verbs or other derivational constructions.
Finally, the morphology of most European languages manifests other aspects that are not taken into account by
our approach, with compound word constructions being just one example (e.g., handgun, worldwide). In German
compound words are widely used and this causes more difficulties than does English. For example, a life
insurance company employee would be "Lebensversicherungsgesellschaftsangeteller" (Leben + S + versicherung
+ S + gesellschaft +S + angeteller for life + insurance + company + employee). Also the morphological marker
(«S») is not always present (e.g., "Bankangetellenlohn" built as Bank + angetellen + lohn (salary)). Finally,
diacritic characters are usually not present in an English collection (with some exceptions, such as "à la carte" or
"résumé"); such characters are replaced by their corresponding non-accentuate letter.
Given that French, Italian and Spanish morphology is comparable to that of English, we decided to index French,
Italian and Spanish documents based on word stems. For the German language and its more complex
compounding morphology, we decided to use a 5-gram approach [McNamee 2000], [Mayfield 2001]. This value
of 5 was chosen for two reasons; it returns a better performance on CLEF-2000 corpora [Savoy 2001a], and, on
the other hand, it is closed to the mean word length of our German corpora (mean word length: 5.87; standard
error: 3.7).
1.3. Indexing and searching strategy
For the CLEF-2001 experiments, we conducted different experiments using the OKAPI probabilistic model
[Robertson 2000] in which the weight wij assigned to a given term t j in a document Di was computed according
to the following formula:
(k1 + 1 ) . tf ij
with K = k1 . (1 - b) + b .
li
wij =
K + tf ij avdl
where tfij indicates the within-document term frequency, and b, k1 are constants (fixed at b = 0.75 and
k1 = 1.2). K represents the ratio between the length of Di measured by li (sum of tfij ) and the collection mean
denoted by advl (fixed at 900).
To index a keyword contained in a request Q, the following formula was used:
wqj = tfqj . ln[(n - dfj ) / dfj]
where tfqj indicates the search term frequency, dfj the collection-wide term frequency, n the number of documents
in the collection.
It has been observed that pseudo-relevance feedback (blind expansion) seems to be a useful technique for
enhancing retrieval effectiveness. In this study, we adopted Rocchio's approach [Buckley 1996] with α = 0.75,
β = 0.75 where the system was allowed to add to the original query generally 10 search keywords, extracted from
the 5-best ranked documents.
In the monolingual track, we submitted six runs along with their corresponding descriptions as listed in Table 2.
Four of them were fully automatic using the request's Title and Descriptive logical sections while the last two
used more logical sections from the documents and were based on the request's Title, Descriptive and Narrative
sections. These last two runs were labeled "manual" because we used logical sections containing manually
assigned index terms. For all runs, we did not use any manual interventions during the indexing and retrieval
procedures.
As a retrieval effectiveness indicator, we adopted the non-interpolated average precision (computed on the basis of
1,000 retrieved items per request by the TREC-EVAL program) allow for both precision and recall using a single
number. These values (unofficial) are depicted in the last column of Table 2
-3-
Run name Language Query Form Query expansion average precision
UniNEmofr French T-D automatic 10 terms from 5 best docs ( 50.00 )
UniNEmoit Italian T-D automatic 10 terms from 5 best docs ( 48.65 )
UniNEmoge German T-D automatic 30 terms from 5 best docs ( 42.32 )
UniNEmoes Spanish T-D automatic 10 terms from 5 best docs ( 58.00 )
UniNEmofrM French T-D-N manual no expansion ( 51.84 )
UniNEmoitM Italian T-D-N manual 10 terms from 5 best docs ( 54.18 )
Table 2: Monolingual run descriptions
2. Multilingual information retrieval
In order to overcome language barriers [Oard 1996], [Grefenstette 1998], we based our approach on free and
readily available translation resources that automatically provide translations to queries submitted in the desired
target language. More precisely, the original queries were written in English and we did not use any parallel or
aligned corpora to derive statistically or semantically related words in the target language. The first section of
this chapter describes our combined strategy for cross-lingual retrieval while Section 2.2 provides some examples
of translation errors. Finally, Section 2.3 presents our merging strategy and a description of our runs submitted
in the multilingual track.
2.1. Query automatic translation
In order to develop a fully automatically approach, we chose to translate the requests using the SYSTRAN®
system [Gachot 1998] (available for free at http://www.systran.com) and to translate query terms word-by-word
using the BABYLON bilingual dictionary (available at http://www.babylon.com) [Hull 1996]. In the latter case,
the bilingual dictionary may suggest not only one, but several terms for the translation of each word. In our
experiments, we decide to pick the first translation available (under the heading "babylon1") or the first two terms
(indicated under the label "babylon2").
Figure 1: Distribution of the number of translation alternatives
40
35
30 German
Spanish
25
Italian
20
French
15
10
5
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 More
-4-
In order to obtain a quantitative picture of a term's ambiguity, we analyze the number of translation alternatives
generated by BABYLON's bilingual dictionaries. For this study, we do not take into account for determinants
(e.g., "the"), conjunctions and prepositions (e.g., "and", "in", "of") or words appearing in our English stopword
list (e.g., "new", "use"), terms generally having a larger number of translations. Based on the Title section of
the English requests, we found 137 search keywords to be translated.
From the data depicted in Table 3, we can see that the mean number of translations provided by BABYLON
dictionaries varies according to language, from 2.94 for German to 5.64 for Spanish. We found the maximum
number of translation alternatives for the word "fall" in French and German (the word "fall" can be viewed as a
noun or a verb), for the term "court" in Italian and for the word "attacks" in Spanish. The median values of these
distributions is rather small, varying from 2 for German to 4 for Spanish. Thus when considering the first two
translation alternatives, we covered around 54% of the keywords to be translated in German, 40.9% in French,
42.3% in Italian and 36.5% for the Spanish language. Figure 1 shows more clearly how the number of
translation alternatives is relatively concentrated around one.
In order to improve search performance, we tried combining the machine translation given by the SYSTRAN
system with the bilingual dictionary approaches. In this case for the translated query using the SYSTRAN system
and for each English search term, we would add the first or the first two translated words obtained from a
bilingual dictionary look-up.
Number of translation alternatives
Query (Title only) French Italian German Spanish
mean number of translations 3.63 5.48 2.94 5.64
standard deviation 3.15 5.48 2.41 5.69
median 3 3 2 4
maximum 17 19 12 24
with word "fall" "court" "fall" "attacks"
no translation 8 9 9 8
only one alternative 27 36 40 28
two alternatives 21 13 25 14
three alternatives 31 15 21 15
Table 3: Number of translations given by the Babylon system for the English keywords
appearing in the Title section of our queries
2.2. Examples of failures
Thus, in order to obtain a preliminary picture of the relative merit of each query translation-based strategy, we
analyzed some queries by comparing the translations produced by our two machine-based tools with the request
formulation written by an human being (examples are given in Table 4). As a first example, the title of query
#70 is "Death of Kim Il Sung" (in which the number "II" is written as the letter "i" followed by the letter "l").
This couple of letters "IL" is analyzed as the chemical symbol of illinium (chemical element #61 "found" by two
at the University of Illinois in 1926; however this discovery was not confirmed and the chemical element #61
was finally found in 1947 and was named promethium). Moreover, the proper name "Sung" was analyzed as the
past participle of the verb "to sing".
As another example, we analyzed query #54 "Final four results" translated as "demi-finales" in French or
"Halbfinale" in German. This request resulted in the incorrect identification of a multi-word concept (namely
"final four") both by our two automatic translation tools and by the manual translation given in Italian and
Spanish (where a more appropriate translation might be "mezzi finali" in Italian or "semifinales" in Spanish).
In query #48 "Peace-keeping forces in Bosnia" or in the request #57 "Tainted-blood trial", our automatic system
was unable to decipher compound word constructions using the "-" symbol and failed to translate the term "peace-
keeping" or "tainted-blood".
In query #74 "Inauguration of Channel Tunnel", the term "Channel Tunnel" was translated into French as
"Eurotunnel". In the Spanish news test there were various translations for this proper name, including
"Eurotúnel" (which appears in the manually translated request), as well as the term "Eurotunel" or "Eurotunnel".
2.3. Merging strategies
Using our combined approach to automatically translate a query, we were able to search a document collection for
a request written in English. However, this stage represents only the first step in proposing cross-language
information retrieval systems. We also need to investigate situations where users write a request in English in
-5-
order to retrieve pertinent documents in English, French, Italian, German and Spanish. To deal with this multi-
language barrier, we divided our document sources according to language and thus formed five different
collections. After searching in these corpora and obtaining five results lists, we needed to merge them in order to
provide users with a single list of retrieved articles.
Recent works have suggested various solutions to merge separate results list obtained from separate collections
or distributed information services. As a first approach, we will assume that each collection contains
approximately the same number of pertinent items and that the distribution of the relevant documents is similar
across the result lists. Based solely on the rank of the retrieved records, we can interleave the results in a round-
robin fashion. According to previous studies [Voorhees 1995], [Callan 1995], the retrieval effectiveness of such
interleaving scheme is around 40% below that achieved from a single retrieval scheme working with a single
huge collection that represents the entire set of documents. However, this decrease may diminish (around -20%)
when using other collections [Savoy 2001b].
C070 (both query translations failed in French, Italian, German and Spanish)
Death of Kim Il Sung
Mort de Kim Il Sung
La mort de Kim Il chantée
mort de Kim Il chanter
Morte di Kim Il Sung
Morte di Kim Il cantata
morte di Kim ilinio cantare
Tod von Kim Il Sung
Tod von Kim Il gesungen
Tod von Kim Ilinium singen
Muerte de Kim Il Sung
Muerte de Kim Il cantada
muerte de Kim ilinio cantar
C047 (both query translations failed in French)
Russian Intervention in Chechnya
L'intervention russe en Tchéchénie
Interposition russe dans Chechnya
Russe intervention dans Chechnya
C054 (both query translations failed in French, Italian, German and Spanish)
Final Four Results
Résultats des demi-finales
Résultats De la Finale Quatre
final quatre résultat
Risultati della "Final Four"
Risultati Di Finale Quattro
ultimo quattro risultato
Ergebnisse im Halbfinale
Resultate Der Endrunde Vier
abschliessend Vier Ergebnis
Resultados de la Final Four
Resultados Del Final Cuatro
final cuatro resultado
Table 4: Examples of unsucessful query translations
To take account of the document score computed for each retrieved item (or the similarity value between the
retrieved record and the request denoted score rsvj ), we might formulate the hypothesis that each collection is
searched by the same or a very similar search engine and that the similarity values are therefore directly
comparable [Kwok 1995], [Moffat 1995]. Such a strategy, called raw-score merging, produces a final list sorted
-6-
by the document score computed by each collection. However, as demonstrated by Dumais [1994], collection-
dependent statistics in document or query weights may vary widely among collections, and therefore this
phenomenon may invalidate the raw-score merging hypothesis.
To account for this fact, we might normalize the document score within each collection by dividing them by the
maximum score (e.i. the document score of the retrieved record in the first position). As a variant of this
normalized score merging scheme, Powell et al. [2000] suggest normalizing the document score rsvj according to
the following formula:
(
rsv′ j = rsv j − rsv min ) ( rsv max − rsv min )
in which rsv j is the original retrieval status value (or document score), and rsvmax and rsvmin are the maximum and
minimum document score values that a collection could achieve for the current request. In this study, the rsvmax
is given by the document score achieved by the first retrieved item and the retrieval status value obtained by the
1000th retrieved record gives the value of rsvmin .
This merging strategy was used for our four runs that formed a part of the multilingual track. As a baseline for
comparison, we used the manually translated requests in the "UniNEmum" and "UniNEmuLm" runs. In order to
retrieve more relevant items from the various corpora, the "UniNEmuL" and "UniNEmuLm" runs were based on
long request (using the Title, Descriptive and Narrative sections) while the "UniNEmu" and "UniNEmum" runs
were based on queries built with the Title and Descriptive logical sections.
Run name English French Italian German Spanish
UniNEmum original original original original original
expand 5 docs | 10 terms 5 docs | 10 terms 5 docs | 10 terms 5 docs | 30 terms 5 docs | 10 terms
UniNEmu original systran+bybylon1 systran+babylon2 systran+babylon2 systran+babylon2
expand 5 docs | 10 terms 10 docs | 15 terms 5 docs | 50 terms 10 docs | 40 terms 10 docs | 15 terms
UniNEmuLm original original original original original
expand 5 docs | 10 terms no 10 docs | 15 terms 10 docs | 100 terms 5 docs | 10 terms
UniNEmuL original systran+bybylon1 systran+babylon2 systran+bybylon1 systran+bybylon1
expand 5 docs | 10 terms 10 docs | 10 terms 5 docs | 50 terms 10 docs | 30 terms 10 docs | 15 terms
Table 5: Descriptions of our multilingual runs
As indicated in Table 5, our automatic "UniNEmu" and "UniNEmuL" runs used both the query translation
furnished by the SYSTRAN system and one or two translation alternatives given by the BABYLON bilingual
dictionary. The average precision (unofficial) achieved by these runs are depicted in Table 6.
Run name average precision % change Prec@5 Prec@10 Prec@20
UniNEmum 40.21 - 65.60 61.20 59.30
UniNEmu 33.28 -17.23% 60.40 59.80 55.10
UniNEmuLm 41.77 - 70.80 66.60 60.10
UniNEmuL 36.85 -11.78% 69.20 63.00 58.60
Table 6: Average precision (unofficial) of our multilingual runs
Conclusion
In this our first participation in CLEF retrieval tasks, we are suggesting a general stopword list for the Italian,
German and Spanish languages. Based on our experiments with the French language [Savoy 1999], we would
suggest simple and efficient stemming procedures for these three languages. Although we are convinced that
these stopword lists and stemming procedures are not perfect, based on the relevance assessments of the CLEF-
2001 corpora we should be able to improve upon these two retrieval tools.
For the German language and its high frequency of compound word constructions, it could still be worthwhile to
find out whether n-gram indexing approaches might produce higher levels of retrieval performance relative to an
enhanced word segmentation heuristic, without requiring a German dictionary.
Moreover, we could consider additional sources of evidence when translating a request (e.g., based on the
EuroWordNet [Vossen 1998]) or logical approaches that would appropriately weight translation alternatives.
Finally, when searching in multiple collections containing documents written in various languages, it might be
-7-
worthwhile to look into better results merging strategies or include intelligent selection procedures in order to
avoid searching in a collection or in a language that does not contain any relevant documents.
Acknowledgments
The author would like to thank C. Buckley from SabIR for giving us the opportunity to use the SMART
system, without which this study could not have been conducted. This research was supported by the SNSF
(Swiss National Science Foundation) under grant 21-58 813.99.
Appendix 1. Queries
C041 Pesticides in Baby Food C042 U.N./US Invasion of Haiti
C043 El Niño and the Weather C044 Indurain Wins Tour
C045 Israel/Jordan Peace Treaty C046 Embargo on Iraq
C047 Russian Intervention in Chechnya C048 Peace-Keeping Forces in Bosnia
C049 Fall in Japanese Car Exports C050 Revolt in Chiapas
C051 World Soccer Championship C052 Chinese Currency Devaluation
C053 Genes and Diseases C054 Final Four Results
C055 Swiss Initiative for the Alps C056 European Campaigns against Racism
C057 Tainted-Blood Trial C058 Euthanasia
C059 Computer Viruses C060 Corruption in French Politics
C061 Siberian Oil Catastrophe C062 Northern Japan Earthquake
C063 Whale Reserve C064 Computer Mouse RSI
C065 Treasure Hunting C066 Russian Withdrawal from Latvia
C067 Ship Collisions C068 Attacks on European Synagogues
C069 Cloning and Ethics C070 Death of Kim Il Sung
C071 Vegetables, Fruit and Cancer C072 G7 Summit in Naples
C073 Norwegian Referendum on EU C074 Inauguration of Channel Tunnel
C075 Euskirchen Court Massacre C076 Solar Energy
C077 Teenage Suicides C078 Venice Film Festival
C079 Ulysses Space Probe C080 Hunger Strikes
C081 French Airbus Hijacking C082 IRA Attacks in Airports
C083 Auction of Lennon Memorabilia C084 Shark Attacks
C085 Turquoise Program in Rwanda C086 Renewable Power
C087 Inflation and Brazilian Elections C088 Mad Cow in Europe
C089 Schneider Bankruptcy C090 Vegetable Exporters
References
[Buckley 1996] Buckley, C., Singhal, A., Mitra, M. & Salton, G. (1996). New retrieval approaches using
SMART. In Proceedings of TREC'4, (pp. 25-48). Gaithersburg: NIST Publication #500-
236.
[Callan 1995] Callan, J. P., Lu, Z. & Croft, W. B. (1995). Searching distributed collections with
inference networks. In Proceedings of the 18th International Conference of the ACM-
SIGIR'95 (pp. 21-28). New York: The ACM Press.
[Dumais 1994] Dumais, S. T. (1994). Latent semantic indexing (LSI) and TREC-2. In Proceedings of
TREC'2, (pp. 105-115). Gaithersburg: NIST Publication #500-215.
[Fox 1990] Fox C. (1990). A stop list for general text. ACM-SIGIR Forum, 24, 19-35.
[Gachot 1998] Gachot, D. A., Lange, E. & Yang, J. (1998). The SYSTRAN NLP browser: An
application of machine translation technology. In Grefenstette G. (Ed.), Cross-language
information retrieval, (pp. 105-118). Boston: Kluwer.
[Grefenstette 1998] Grefenstette, G. (Ed.) (1998). Cross-language information retrieval. Amsterdam: Kluwer.
[Hull 1996] Hull, D. & Grefenstette, G. (1996). Querying across languages: A dictionary-based
approach to multilingual information retrieval. In Proceedings of the 19th International
Conference of the ACM-SIGIR'96, (pp. 49-57). New York: The ACM Press.
-8-
[Kwok 1995] Kwok, K. L., Grunfeld L. & Lewis, D. D. (1995). TREC-3 ad-hoc, routing retrieval and
thresholding experiments using PIRCS. In Proceedings of TREC'3, (pp. 247-255).
Gaithersburg: NIST Publication #500-225.
[Lovins 1968] Lovins, J. B. (1968). Development of a stemming algorithm. Mechanical Translation and
Computational Linguistics, 11(1), 22-31.
[Mayfield 2001] Mayfield, J., McNamee, P. & Piatko, J. (2001). The JHU/APL HAIRCUT system at
Trec-8. In Proceedings TREC-8, (pp. 445-452). Gaithersburg: NIST Publication #500-
246.
[McNamee 2000] McNamee, P. & Mayfield, J. (2000). A language-independent approach to European text
retrieval. In Proceedings CLEF-2000, http://www.iei.pi.cnr.it/DELOS/CLEF/apl.doc.
[Moffat 1995] Moffat, A. & Zobel, J. (1995). Information retrieval systems for large document
collections. In Proceedings of TREC'3, (pp. 85-93). Gaithersburg,: NIST Publication
#500-225.
[Oard 1996] Oard, D. & Dorr, B. J. (1996). A survey of multilingual text retrieval. Institute for
advanced computer studies and computer science department, University of Maryland,
http://www.clis.umd.edu/dlrg/filter/papers/mlir.ps.
[Porter 1980] Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14, 130-137.
[Powell 2000] Powell, A. L., French, J. C., Callan, J., Connell, M. & Viles, C. L. (2000). The impact
of database selection on distributed searching. In Proceedings of the 23rd International
Conference of the ACM-SIGIR'2000, (pp. 232-239). New York: The ACM Press.
[Robertson 2000] Robertson, S. E., Walker, S. & Beaulieu, M. (2000). Experimentation as a way of life:
Okapi at TREC. Information Processing & Management, 36(1), 95-108.
[Savoy 1999] Savoy, J. (1999). A stemming procedure and stopword list for general French corpora.
Journal of the American Society for Information Science, 50(10), 944-952.
[Savoy 2001a] Savoy, J. (2001). Bilingual information retrieval: CLEF-2000 experiments. In
Proceedings ECSQARU-2001 Workshop. Toulouse, France: to appear.
[Savoy 2001b] Savoy, J. & Rasolofo, Y. (2001). Report on the TREC-9 experiment: Link-based retrieval
and distributed collections. In Proceedings TREC-9. Gaithersburg, MD: to appear.
[Sproat 1992] Sproat, R. (1992). Morphology and computation. Cambridge: The MIT Press.
[Voorhees 1995] Voorhees, E. M., Gupta, N. K. & Johnson-Laird, B. (1995). The collection fusion
problem. In Proceedings of TREC'3, (pp. 95-104). Gaithersburg: NIST Publication #500-
225.
[Vossen 1998] Vossen, P. (1998). EuroWordNet: A multilingual database with lexical semantic networks.
Dordrecht: Kluwer.
-9-