Report on CLEF-2002 Experiments:
Combining Multiple Sources of Evidence
Jacques Savoy
Institut interfacultaire d'informatique, Université de Neuchâtel, Switzerland
Jacques.Savoy@unine.ch Web site: www.unine.ch/info/clef/
Abstract. For our second participation in the CLEF retrieval tasks, our first objective was to
propose better and more general stopword lists for various European languages (namely, French,
Italian, German, Spanish and Finnish) along with improved, simpler and efficient stemming
procedures. Our second goal was to propose a combined query-translation approach that could
cross language barriers and also an effective merging strategy based on logistic regression for
accessing the multilingual collection. Finally, within the Amaryllis experiment, we wanted to
analyze how a specialized thesaurus might improve retrieval effectiveness.
Introduction
Based on our experiments of last year [Savoy 2002b], we participate in French, Italian, Spanish, German, Dutch
and Finnish monolingual tasks in which our information retrieval approaches could work without having to rely
on a dictionary. In Section 1, we improve our stopword lists and simple stemmers for the French, Italian,
Spanish and German languages. For German, we also propose a new decompounding algorithm. For Dutch, we
use the available stoplist and stemmer, and for the Finnish language we design a new stemmer and stopword list.
In order to obtain a better overview, we evaluate our propositions using ten different retrieval schemes.
In Section 2, for the various bilingual tracks we choose to express the submitted requests in the English
language, which are in turn automatically translated using five different machine translation (MT) systems and
one bilingual dictionary. We study these various translations, and based on the relative merit of each translation
device we investigate various combinations of them.
In Section 3, we carry out a multilingual information retrieval, investigating various merging strategies based
on the results obtained during our bilingual tasks. Finally, in the last section, we present various experiments
done using the Amaryllis corpus, within which a specialized thesaurus is made available in order to improve the
retrieval effectiveness of the information retrieval system.
1. Monolingual indexing and search
Most European languages included in the Indo-European language family (including French, Italian, Spanish,
German and Dutch) can be viewed as flectionnal languages within which polymorphs suffixes are added at the end
of a flexed root. On the other hand, the Finnish language, member of the Uralic language family (together with
the Turkish language), is based on a concatenative morphology in which suffixes, more or less invariable, are
added to roots that are generally invariable.
Any adaptation of those indexing or search strategies available for the English language requires that general
stopword lists and fast stemming procedures be developed for the other target languages. Stopword lists contain
non-significant words that are removed from a document or a request before the indexing process is begun.
Stemming procedures try to remove inflectional and derivational suffixes in order to conflate word variants into
the same stem or root.
This first section will deal with these issues and is organized as follows: Section 1.1 contains an overview of
our eight test-collections while Section 1.2 describes our general approach to building stopword lists and
stemmers for use with languages other than English. In order to decompound German words, we try a simple
decompounding algorithm as described in Section 1.3. Section 1.4 depicts the Okapi probabilistic model
together with various vector-space models and we evaluate them using eight test-collections written in seven
different languages (monolingual track).
1.1. Overview of the test-collections
The corpora used in our experiments included newspapers such as the Los Angeles Times (1994, English) Le
Monde (1994, French), La Stampa (1994, Italian), Der Spiegel (1994/95, German) and Frankfurter Rundschau
(1994, German), NRC Handelsbald (1994/95, Dutch), Algemeen Dagblad (1995/95, Dutch) and Tidningarnas
Telegrambyrå (1994/95, Finnish). As a second source of information, we also used various articles edited by
news agencies such as EFE (1994, Spanish), and the Swiss news agency (1994, available in French, German and
Italian but without parallel translation). As shown in Table 1a and 1b, these corpora are of various sizes, with
the English, German, Spanish and Dutch collections being twice the volume of the French, Italian and Finnish
sources. On the other hand, the mean number of distinct indexing terms per document is relatively similar across
the corpora (around 120), and this number is a little bit higher for the English collection (167.33). The
Amaryllis collection contains abstracts of scientific papers written mainly in French and this corpus contains
fewer distinct indexing terms per article (70.418).
English French Italian German Spanish
Size (in MB) 425 MB 243 MB 278 MB 527 MB 509 MB
# of documents 113,005 87,191 108,578 225,371 215,738
# of distinct terms 330,753 320,526 503,550 1,507,806 528,382
Number of distinct indexing terms / document
Mean 167.33 130.213 129.908 119.072 111.803
Standard deviation 126.315 109.151 97.602 109.727 55.397
Median 138 95 92 89 99
Maximum 1,812 1,622 1,394 2,420 642
Minimum 2 3 1 1 5
Max df 69,082 42,983 48,805 82,909 215,151
Number of indexing terms / document
Mean 273.846 181.559 165.238 152.004 156.931
Standard deviation 246.878 164.347 130.728 155.336 82.133
Median 212 129 115 111 137
Maximum 6,087 3,923 3,763 6,407 1,003
Minimum 2 3 2 1 5
Number of queries 42 50 49 50 50
Number rel. items 821 1,383 1,072 1,938 2,854
Mean rel./request 19.548 27.66 21.878 38.76 57.08
Standard deviation 20.832 34.293 19.897 31.744 67.066
Median 11.5 13.5 16 28 27
Maximum 96 (#q:95) 177 (#q:95) 86 (#q:103) 119 (#q:103) 321 (#q:95)
Minimum 1 (#q:97,98,136) 1 (#q:121) 3 (#q:121, 132) 1 (#q:137) 3 (#q:111)
Table 1a: Test-collection statistics
When examining the number of relevant documents per request, Tables 1a and 1b show that the mean number is
always greater than the median (e.g., for the English collection, there is an average of 19.548 relevant documents
per query and the corresponding median is 11.5). These findings indicate that each collection contains numerous
queries with a rather small number of relevant items. For each collection, we encounter 50 queries except for the
Italian corpus (for which Query #120 does not have any relevant items) and the English collection (for which
Query #93, #96, #101, #110, #117, #118, #127 and #132 do not have any relevant items). The Finnish corpus
contains only 30 available requests while only 25 queries are included in the Amaryllis collection.
From the original documents and during the indexing process, we retained only the following logical sections in
our automatic runs:
, < HEADLINE>, < TEXT>, < LEAD>, < LEAD1>, < TX>, < LD>, < TI> and < ST>. On
the other hand, we did conduct two experiments (indicated as manual runs), one with the French collection and
one with the German corpus, within which we retained the following tags: for the French collection: ,
< KW>, < TB>, < CHA1>, < SUBJECTS>, < NAMES>, < NOM1>, < NOTE>, < GENRE>, < PEOPLE>, < SU11>,
< SU21>, , , , , , , , , , , < TI06>,
< TI07>, < TI08>, < TI09>, < ORT1>, < SOT1>, < SYE1> and ; while for the German corpus and for one
experiment, we used also the following tags: and .
From the topic descriptions we automatically removed certain phrases such as "Relevant document report …",
"Find documents …", "Trouver des documents qui parlent …", "Sono valide le discussioni e le decisioni …",
"Relevante Dokumente berichten …" or "Los documentos relevantes proporcionan información …".
To evaluate our approaches, we used the SMART system as a test bed for implementing the Okapi probabilistic
model [Robertson 2000] as well as other vector-space models. This year our experiments were conducted on an
Intel Pentium III/600 (memory: 1 GB, swap: 2 GB, disk: 6 x 35 GB).
Dutch Finnish Amaryllis
Size (in MB) 540 MB 137 MB 195 MB
# of documents 190,604 55,344 148,688
# of distinct terms 883,953 1,483,354 413,262
Number of distinct indexing terms / document
Mean 110.013 114.01 70.418
Standard deviation 107.037 91.349 31.9
Median 77 87 64
Maximum 2297 1,946 263
Minimum 1 1 5
Max df 325,188 20,803 61,544
Number of indexing terms / document
Mean 151.22 153.73 104.617
Standard deviation 162.027 128.783 54.089
Median 101 123 91
Maximum 4510 6,117 496
Minimum 1 1 6
Number of queries 50 30 25
Number rel. items 1,862 502 2,018
Mean rel./request 37.24 16.733 80.72
Standard deviation 49.873 14.92 46.0675
Median 21 8.5 67
Maximum 301 (#q:95) 62 (#q:124) 180 (#q:25)
Minimum 4 (#q:110) 1 (#q:114) 18 (#q:23)
Table 1b: Test-collection statistics
1.2. Stopword lists and stemming procedures
In order to define general stopword lists, we used those lists already available for the English and French
languages [Fox 1990], [Savoy 1999], while for the other languages we established a general stopword list by
following the guidelines described in [Fox 1990]. These lists mainly contain the top 200 most frequent words
included in the various collections together with articles, pronouns, prepositions, conjunctions or very frequently
occurring verb forms (e.g., to be, is, has, etc.). Stopword lists used during our previous participation [Savoy
2002b] were often extended. For example for the English we used that provided by the SMART system (571
words), 431 Italian words (no change from last year), 462 French words (previously 217), 603 German words
(previously 294), 351 Spanish terms (previously 272), 1,315 Dutch terms (available at CLEF Web site) and
1,134 Finnish words (these stopword lists are available at www.unine.ch/info/clef/).
After removing high frequency words, an indexing procedure uses a stemming algorithm that attempts to conflate
word variants into the same stem or root. In developing this procedure for the French, Italian, German and
Spanish languages, it is important to remember that these languages have more complex morphologies than does
the English language [Sproat 1992]. As a first approach, our intention was to remove only inflectional suffixes
such that singular and plural word forms or feminine and masculine forms conflate to the same root. More
sophisticated schemes have already been proposed for the removal of derivational suffixes (e.g., "-ize", "-ably", "-
ship" in the English language), such as the stemmer developed by Lovins [1968] is based on a list of over 260
suffixes, while that of Porter [1980] looks for about 60 suffixes. Figuerola [2002] for example described two
different stemmers for the Spanish language, and the results show that removing only inflectional suffixes (88
different inflectional suffixes were defined) seemed to provide better retrieval levels than did removing both
inflectional and derivational suffixes (this extended stemmer included 230 suffixes).
Our various stemming procedures can be found at www.unine.ch/info/clef/. This year we improved our
stemming algorithms for French, within which some derivational suffixes were also removed. For the Dutch
language, we use the Kraaij & Pohlmann's stemmer (ruulst.let.ruu.nl:2000/uplift/ulift.html) [Kraaij 1996]. For
the Finnish language, our stemmer tries to conflate various word declinations into the same stem. Also, the
Finnish language makes a distinction between partial object and whole object (e.g., "syön leilää" or "I'm eating
bread" and "syön leivan" for "I'm eating the whole bread"). This aspect is not actually taken into consideration.
Finally, diacritic characters are usually not present in English collections (with some exceptions, such as "à la
carte" or "résumé"); and such characters are replaced by their corresponding non-accentuated letter in the Italian,
Dutch, Finnish, German and Spanish language.
1.3. Decompounding German words
Most European languages manifests other morphological characteristics that we have been considered by our
approach, with compound word constructions being just one example (e.g., handgun, worldwide). In German
compound words are widely used and this causes more difficulties than does English. For example, a life
insurance company employee would be "Lebensversicherungsgesellschaftsangestellter" (Leben + S + versicherung
+ S + gesellschaft +S + angestellter for life + insurance + company + employee). Also the morphological
marker ("S") is not always present (e.g., "Bankangestelltenlohn" built as Bank + angestellter + lohn (salary)). In
Finnish, we also encounter similar constructions as such as "rakkauskirje" (rakkaus + kirje for love + letter) or
"työviikko" (työ + viikko for work + week).
String sequence End of previous word Beginning of next word
. . .
schaften schaft tion tion ern er schg sch g
. . .
weisen weise ling ling tät tät schl sch l
. . .
lischen lisch igkeit igkeit net net schh sch h
.
lichkeit lichkeit . .
lingen ling ens en scht sch t
. . .
igkeiten igkeit keit keit ers er dtt dt t
. . .
lichkeit lichkeit erheit erheit ems em dtp dt p
. . .
keiten keit enheit enheit ts t dtm dt m
. . .
erheiten erheit heit heit ions ion dtb dt b
. . .
enheiten enheit lein lein isch isch dtw dt w
. . .
heiten heit chen chen rm rm ldan ld an
. . .
haften haft haft haft rw rw ldg ld g
. .
halben halb halb halb nbr n br ldm ld m
. .
langen lang lang lang nb n b ldq ld q
. .
erlichen erlich erlich erlich nfl n f l ldp ld p
. .
enlichen enlich enlich enlich nfr n f r ldv ld v
. .
lichen lich lich lich nf n f ldw ld w
. .
baren bar bar bar nh n h tst t t
. .
igenden igend igend igend nk n k rg r g
. .
igungen igung igung igung ntr n tr rk r k
. .
igen ig ig ig fff ff f rm r m
. .
enden end end end ffs ff rr r r
. .
isten ist ist ist fk f k rs r s
. .
anten ant ant ant fm f m rt r t
. .
ungen ung tum tum fp f p rw r w
. .
schaft schaft age age fv f v rz r z
. .
weise weise ung ung fw f w fp f p
. .
lisch lisch enden end schb sch b fsf f f
. .
ismus ismus eren er schf sch f gss g s
Table 2: Decompounding patterns for German
According to Monz & de Rijke [2002] or [Chen 2002], including both compounds and their composite parts
(only noun-noun decompositions in [Monz 2002]) in queries and documents can result in better performance
while according to Molina-Salgado [2002], the decomposition of German words seems to reduce average
precision.
Our approach seeks to break up those words having an initial length greater than or equal to eight characters.
Moreover, decomposition cannot take place before an initial sequence [V]C, meaning that a word might begin
with a series of vowels that must be followed by at least one consonant. The algorithm then seeks the
occurrence of one of the models described in Table 2. For example, the last model "gss g s" indicates that when
we encounter the character string "gss" the computer is allowed to cut the compound term, ending the first word
with "g" and beginning the second with "s". All the models depicted in Table 2 often include letters sequences
impossible to find in a simple German word such as "dtt," "fff," or "ldm". Once it has detected this pattern, the
computer makes sure that the right part consists of at least four characters, potentially beginning with a series of
vowels (criterion noted as [V]), followed by a CV sequence. If decomposition proves to be possible, the
algorithm begins working on the right part of the decomposed word.
As an example, take the compound word "Betreuungsstelle" (meaning "care center" and made up "Betreuung"
(care) and "Stelle" (center, place)). This word is definitely more than seven characters long. Once this has been
verified, the computer begins searching for substitution models for the third character. The computer will find a
match with the last model described in Table 2, and form the words "Betreuung" and "Stelle." This break is
validated because the second word has a length greater than four characters. This term also meets criterion [V]CV
and finally, given that the term "Stelle" has less than eight letters, the computer will not attempt to continue
decomposing this term.
1.4. Indexing and searching strategy
In order to obtain a broader view of the relative merit of various retrieval models, we first adopted a binary
indexing scheme within which each document (or request) is represented by a set of keywords, without any
weight. To measure the similarity between documents and requests, we count the number of common terms,
computed according to the inner product (retrieval model denoted "doc=bnn, query=bnn" or "bnn-bnn"). For
document and query indexing however binary logical restrictions however are often too limiting. In order to
weight the presence of each indexing term in a document surrogate (or in a query), we may take account of the
term occurrence frequency which allows for better term distinction and increases indexing flexibility (retrieval
model notation: "doc=nnn, query=nnn" or "nnn-nnn").
bnn wij = 1 nnn wij = tfij
ltn wij = (ln(tf ij ) + 1) . idfj atn wij = idfj . [0.5+ 0.5 . tfij / max tf i. ]
wij = tf ij ⋅ ln
(
n − df j )
nfn wij = ln n npn df j
df j
1+ ln(tfij )
((k 1 + 1) ⋅ tf ij )
1+pivot
( K + tf ij )
Okapi wij = Lnu wij =
(1− slope) ⋅ pivot + slope ⋅ nti
ln(tf ij ) + 1 tfij ⋅ idf j
lnc wij = ntc wij =
t
∑ (ln(tf ik ) +1)
t
∑ ( tfik ⋅ idf k )
2 2
k =1 k =1
dtc wij = (ln(ln(tf ij) + 1) + 1)⋅ idf j
(
∑ ( lnln(tf
( ik ) +1) + 1) ⋅ idf k )
t 2
k =1
ltc wij = ( ln(tfij ) + 1) ⋅ idf j
t
∑ (( ln(tfik ) + 1) ⋅ idf k )
2
k=1
( (
1 + l n 1 + ln(tf ) ⋅idf
ij j ))
1+pivot
dtu wij =
(1− slope) ⋅ pivot + slope ⋅ nti
Table 3: Weighting schemes
Those terms however that do occur very frequently in the collection are not considered very helpful in
distinguishing between relevant and non-relevant items. Thus we might count their frequency in the collection,
or more precisely the inverse document frequency (denoted by idf), resulting in more weight for sparse words and
less weight for more frequent ones. Moreover, a cosine normalization could prove beneficial and each indexing
weight could vary within the range of 0 to 1 (retrieval model notation: "ntc-ntc", Table 3 depicts the exact
weighting formulation).
Other variants may also be created, especially if we consider the occurrence of a given term in a document is a
rare event. Thus, it may be a good practice to give more importance to the first occurrence of this word as
compared to any successive or repeating occurrences. Therefore, the tf component may be computed as 0.5 + 0.5
· [tf / max tf in a document] (retrieval model denoted "doc=atn").
Finally, we should consider that a term's presence in a shorter document provides stronger evidence than it does
in a longer document. To account for this, we integrate document length within the weighting formula, leading
to more complex IR models; for example, the IR model denoted by "doc=Lnu" [Buckley 1996], "doc=dtu"
[Singhal 1999]. Finally for CLEF-2002, we also conducted various experiments using the Okapi probabilistic
model [Robertson 2000] within with K = k1 · [(1 - b) + b · (l i / avdl)], representing the ratio between the
length of D i measured by li (sum of tfij ) and the collection mean noted by advl.
In our experiments, the constants b, k1 , advl, pivot and slope are fixed according to values listed in Table 4. To
evaluate the retrieval performance of these various IR models, we adopted the non-interpolated average precision
(computed on the basis of 1,000 retrieved items per request by the TREC -EVAL program), allowing for both
precision and recall using a single number.
Language b k1 advl pivot slope
English 0.8 2 900 100 0.1
French 0.7 2 750 100 0.1
Italian 0.6 1.5 800 100 0.1
Spanish 0.5 1.2 300 100 0.1
German 0.55 1.5 600 125 0.1
Dutch 0.9 3.0 600 125 0.1
Finnish 0.75 1.2 900 125 0.1
Amaryllis 0.7 2 160 30 0.2
Table 4: Parameter setting for the various test-collections
Given that French, Italian and Spanish morphology is comparable to that of English, we decided to index French,
Italian and Spanish documents based on word stems. For the German, Dutch and Finnish languages and their
more complex compounding morphology, we decided to use a 5-gram approach [McNamee 2002]. However,
contrary to [McNamee 2002], our generation of 5-gram indexing terms does not span word boundaries. This
value of 5 was chosen because it performed better with the CLEF-2000 corpora [Savoy 2001a]. Using this
indexing scheme, the compound «das Hausdach» (the roof of the house) will generate the following indexing
terms: «das», «hausd», «ausda», «usdac» and «sdach».
Our evaluation results as reported in Tables 5 show that the Okapi probabilistic model performs best with the
use of five different languages. In the second position, we usually find the vector-space model "doc=Lnu,
query=ltc" and in the third "doc=dtu, query=dtc". Finally, the traditional tf-idf weighting scheme ("doc=ntc,
query=ntc") does not exhibit very satisfactory results, and the simple term-frequency weighting scheme
("doc=nnn, query=nnn") or the simple coordinate match ("doc=bnn, query=bnn") results in poor retrieval
performance.
Average precision
Query T-D English French Italian Spanish
Model 42 queries 50 queries 49 queries 50 queries
doc=Okapi, query=npn 50.08 48.41 41.05 51.71
doc=Lnu, query=ltc 48.91 46.97 39.93 49.27
doc=dtu, query=dtc 43.03 45.38 39.53 47.29
doc=atn, query=ntc 42.50 42.42 39.08 46.01
doc=ltn, query=ntc 39.69 44.19 37.03 46.90
doc=ntc, query=ntc 27.47 31.41 29.32 33.05
doc=ltc, query=ltc 28.43 32.94 31.78 36.61
doc=lnc, query=ltc 29.89 33.49 32.79 38.78
doc=bnn, query=bnn 19.61 18.59 18.53 25.12
doc=nnn, query=nnn 9.59 14.97 15.63 22.22
Table 5a: Average precision of various indexing and searching strategies (monolingual)
For the German language, we determined that 5-gram indexing, decompounded indexing and word-based document
representation methods to be distinct and independent sources of evidence for German language document content.
We therefore decided to combine these three indexing schemes and to do so we normalized similarity values
obtained by each document extracted from these three separate retrieval models, according to Equation 1 (see
Section 3). The resulting average precision for these four approaches is shown in Table 5b, thus demonstrating
how the combined model usually results in better retrieval performance.
Average precision
Query T-D German German German German
words decompounded 5-gram combined (Eq. 1)
Model 50 queries 50 queries 50 queries 50 queries
doc=Okapi, query=npn 37.39 37.75 39.83 41.25
doc=Lnu, query=ltc 36.41 36.77 36.91 39.79
doc=dtu, query=dtc 35.55 35.08 36.03 38.21
doc=atn, query=ntc 34.48 33.46 37.90 37.93
doc=ltn, query=ntc 34.68 33.67 34.79 36.37
doc=ntc, query=ntc 29.57 31.16 32.52 32.88
doc=ltc, query=ltc 28.69 29.26 30.05 31.08
doc=lnc, query=ltc 29.33 29.14 29.95 31.24
doc=bnn, query=bnn 17.65 16.88 16.91 21.30
doc=nnn, query=nnn 14.87 12.52 8.94 13.49
Table 5b: Average precision of various indexing and searching strategies (German collection)
It was observed that pseudo-relevance feedback (blind-query expansion) seems to be a useful technique for
enhancing retrieval effectiveness. In this study, we adopted Rocchio's approach [Buckley 1996] with α = 0.75,
β = 0.75 whereby the system was allowed to add m terms extracted from the n best ranked documents from the
original query. To evaluate this proposition, we used the Okapi probabilistic model and we enlarged the query by
10 to 20 terms provided by the 5 or 10 best-retrieved articles. The results depicted in Table 6a and 6b indicate
that the optimal parameter setting seems to be collection-dependant. Moreover, performance improvement seems
also to be collection dependant (or language dependant) with no improvement for the English corpus yet an
increase of 8.55% for the Spanish corpus (from an average precision of 51.71 to 56.13), 9.85% for the French
corpus (from 48.41 to 53.18), 12.91% for the Italian language (41.05 to 46.35) and 13.26% for the German
collection (from 41.25 to 46.72, combined model, Table 6b).
Average precision
Query T-D English French Italian Spanish
Model 42 queries 50 queries 49 queries 50 queries
doc=Okapi, query=npn 50.08 48.41 41.05 51.71
5 docs / 10 best terms 49.54 53.10 45.14 55.16
5 docs / 15 best terms 48.68 53.18 46.07 54.95
5 docs / 20 best terms 48.62 53.13 46.35 54.41
10 docs / 10 best terms 47.77 52.03 45.37 55.94
10 docs / 15 best terms 46.92 52.75 46.18 56.00
10 docs / 20 best terms 47.42 52.78 45.87 56.13
Table 6a: Average precision using blind-query expansion
Average precision
Query T-D German German German German
words decompounded 5-gram combined (Eq. 1)
Model 50 queries 50 queries 50 queries 50 queries
doc=Okapi, query=npn 37.39 37.75 39.83 41.25
# docs / # terms 5 / 40 42.90 5 / 40 42.19 10 / 200 45.45 46.72
# docs / # terms 5 / 40 4 2 . 9 0 5 / 40 4 2 . 1 9 5 / 300 4 5 . 8 2 46.27
Table 6b: Average precision using blind-query expansion
This year, we also participated in the Dutch and Finnish monolingual tasks, the results of which are depicted in
Table 7, and the average precision of the Okapi model using blind-query expansion is given in Table 8. For
these two languages, we also applied or combined an indexing model based on 5-gram indexing and word-based
document representations. While for the Dutch language, our combined model seems to enhance the retrieval
effectiveness, for the Finnish language it does not. This however was a first trial for our proposed stemmer and
it seemed to improve the average precision over a baseline trial without stemming procedure (Okapi model,
unstemmed 23.04, with stemming 30.45, an improvement of +32.16%).
Average precision
Query T-D Dutch Dutch Dutch Finnish Finnish Finnish
word 5-gram combined word 5-gram combined
Model 50 queries 50 queries 50 queries 30 queries 30 queries 30 queries
doc=Okapi, query=npn 42.37 41.75 44.56 30.45 38.25 37.51
doc=Lnu, query=ltc 42.57 40.73 44.50 27.58 36.07 36.83
doc=dtu, query=dtc 41.26 40.59 43.00 30.70 36.79 36.47
doc=atn, query=ntc 40.29 40.34 41.89 29.22 37.26 36.51
doc=ltn, query=ntc 38.33 38.72 40.24 29.14 35.28 35.31
doc=ntc, query=ntc 33.35 34.94 36.41 25.21 30.68 31.93
doc=ltc, query=ltc 32.81 31.24 34.46 26.53 30.85 33.47
doc=lnc, query=ltc 31.91 29.67 34.18 24.86 30.43 31.39
doc=bnn, query=bnn 18.91 20.87 23.52 12.46 14.55 18.64
doc=nnn, query=nnn 13.75 10.48 12.86 11.43 14.69 15.56
Table 7: Average precision of various indexing and searching strategies (Dutch and Finnish corpora)
Average precision
Query T-D Dutch Dutch Dutch Finnish Finnish Finnish
word 5-gram combined word 5-gram combined
Model 50 queries 50 queries 50 queries 30 queries 30 queries 30 queries
doc=Okapi, query=npn 42.37 41.75 44.56 30.45 38.25 37.51
# docs / # terms 5/60 47.86 5/75 45.09 48.78 5/60 31.89 5/75 40.90 39.33
# docs / # terms 5/100 4 8 . 8 4 10/150 4 6 . 2 9 49.28 5/15 3 2 . 3 6 5/175 4 1 . 6 7 40.11
Table 8: Average precision using blind-query expansion
In the monolingual track, we submitted six runs along with their corresponding descriptions, as listed in
Table 9. Four of them were fully automatic using the request's Title and Descriptive logical sections, while the
last three used more other document sections, based on the request's Title, Descriptive and Narrative sections. In
these last three runs, two were labeled "manual" because we used logical sections containing manually assigned
index terms. For all other runs however we did not use any manual intervention during the indexing and retrieval
procedures.
Run name Language Query Form Model Query expansion average
UniNEfr French T-D automatic Okapi no expansion 48.41
UniNEit Italian T-D automatic Okapi 10 best docs / 15 terms 46.18
UniNEes Spanish T-D automatic Okapi 5 best docs / 20 terms 54.41
UniNEde German T-D automatic combined 5/40 word, 10/200 5-gra. 46.72
UniNEnl Dutch T-D automatic combined 5/60 word, 5/75 5-gram 48.78
UniNEfi1 Finnish T-D automatic Okapi 5 best docs / 75 terms 40.90
UniNEfi2 Finnish T-D automatic combined 5/60 word, 5/75 5-gram 39.33
UniNEfrtdn French T-D-N manual Okapi 5 best docs / 10 terms 59.19
UniNEestdn Spanish T-D-N automatic Okapi 5 best docs / 40 terms 60.51
UniNEdetdn German T-D-N manual combined 5/50 word, 10/300 5-gram 49.11
Table 9: Official monolingual run descriptions
2. Bilingual information retrieval
In order to overcome language barriers, we based our approach on free and readily available translation resources
that automatically translate queries into the desired target language. More precisely, the original queries were
written in English and we used no parallel or aligned corpora to derive statistically or semantically related words
in the target language. Section 2.1 describes our combined strategy for cross-lingual retrieval while Section 2.2
provides some examples of translation errors.
This year, we used five machine translation systems, namely SYSTRAN™ (babel.altavista.com/translate.dyn),
GOOGLE. COM (www.google.com/language_tools), F REETRANSLATION. COM (www.freetranslation.com),
INTERTRAN (www.tranexp.com:2000/InterTran) and R EVERSO ONLINE (translation2.paralink.com). As
bilingual dictionary we used the BABYLON system (www.babylon.com).
2.1. Query automatic translation
In order to develop a fully automatically approach, we chose to translate the requests using five different machine
translation (MT) systems. We also translated query terms word-by-word using the BABYLON bilingual
dictionary, provides not only one but several terms for the translation for each word submitted. In our
experiments, we decided to pick the first translation available (labeled "baby1"), the first two terms (labeled
"baby2") or the first three available translations (labeled "baby3").
Average precision
Query T-D \ Language French Italian Spanish German German German
Translation tools word decomp. 5-gram
Original queries 48.41 41.05 51.71 37.39 37.75 39.83
Systran 42.70 32.30 38.49 28.75 28.66 27.74
Google 42.70 32.30 38.35 28.07 26.05 27.19
FreeTranslation 40.58 32.71 40.55 28.85 31.42 27.47
InterTran 33.89 30.28 37.36 21.32 21.61 19.21
Reverso 39.02 N/A 43.28 30.71 30.33 28.71
Babylon 1 43.24 27.65 39.62 26.17 27.66 28.10
Babylon 2 37.58 23.92 34.82 26.78 27.74 25.41
Babylon 3 35.69 21.65 32.89 25.34 26.03 23.66
Comb 1 46.77 33.31 44.57 34.32 34.66 32.75
Comb 2 48.02 34.70 45.63 35.26 34.92 32.95
Comb 2b 48.02 45.53 35.09 34.51 32.76
Comb 3 48.56 34.98 45.34 34.43 34.37 33.34
Comb 3b 48.49 35.02 45.34 34.58 34.43 32.76
Comb 3b2 35.41 35.13 33.25
MT 2 35.82
MT 3 44.54 35.57 44.32 33.53 33.05 31.96
All 47.94 35.29 44.25 34.52 34.31 32.79
MT all 46.83 35.68 44.25 33.80 33.51 31.66
Comb 1 Rever-baby1 Free-baby1 Rever-baby1 Reverso-baby1
Comb 2 Reverso Free-google Rever-systran Reverso-systran-baby1
systran-baby1 baby1 baby1
Comb 2b Reverso Rever-google Reverso-google-baby1
google-baby1 baby1
Comb 3 Reverso-free Free-google Free-google Reverso-systran-inter-baby1
google-baby1 inter-baby1 rever-baby1
Comb 3b Reverso-inter Free-google Free-google Reverso-google-inter-baby1
google-baby1systran-baby1 rever-baby2
Comb 3b2 Reverso-google-inter-baby2
MT 2 Free-google
MT 3 Reverso Free-google Free-google Reverso-inter-systran
systran-google inter reverso
Table 10: Average precision of various query translation strategies (Okapi model)
The first part of Table 10 lists the average precision for each translation devices used along the performance
achieved by manually translated requests. For German, we also reported the retrieval effectiveness achieved by
the three difference approach, namely using words as indexing terms, decompounding the German words
according to our approach and the 5-grams model. While the REVERSO system seems to be the better choice for
German and Spanish, FREETRANSLATION is the best choice for Italian and BABYLON 1 the best for French.
In order to improve search performance, we tried combining different machine translation systems with the
bilingual dictionary approach. In this case, we formed the translated query by concatenating the different
translations provided by the various approaches. Thus the column header "Comb 1", we combined one machine
translation system with the bilingual dictionary ("baby1"). Similarly, under columns "Comb 2" or "Comb 2b,"
we listed the results of two machine translation approaches and three machine translation systems under column
headings "Comb 3", "Comb 3b" or "Comb 3b2". With the exception of the performance under "Comb 3b2,"
we also included terms provided by the "baby1" dictionary look-up in the translated requests. In columns
"MT 2" and "MT 3," we evaluated the combination of two and three machine translation systems respectively.
Finally, we could also combine all translation sources (under heading "All") or all machine translation
approaches under the heading "MT all."
Since the performance of each translation device depends on the target language, in the lower part of Table 10 we
included the exact specification for each of the combined runs. For the German language, for each of the three
indexing models, we used the same combination of translation resources. From an examination of the retrieval
effectiveness of our various combined approaches listed in the middle part of Table 10, a clear recommendation
cannot be made. Overall, it seems better to combine two or three machine translation systems with the bilingual
dictionary approach ("baby1"). However, combining the five machine translation systems (heading "MT all") or
all translation tools (heading "All") does not result in a very effective performance.
Average precision
Query T-D French French French Italian Italian
UniNEfrBi UniNEfrBi2 UniNEfrBi3 UniNEitBi UniNEitBi2
Combined Comb 3b MTall+baby2 MT all Comb 2 Comb 3
Expand # docs / # terms 5 / 20 5 / 40 10 / 15 10 / 60 10 / 100
Corrected 51.64 50.79 48.49 38.50 38.62
Official 49.35 48.47 46.20 37.36 37.56
Query T-D Spanish Spanish Spanish German German
UniNEesBi UniNEesBi2 UniNEesBi3 UniNEdeBi UniNEdeBi2
Combined MT 3 Comb 3b Comb 2 Comb 3b2 & Comb 3
Expand # docs / # terms 10 / 75 10 / 100 10 / 75 5 / 100 & 5 / 300
Corrected 50.67 50.95 50.93 42.89 42.11
Official 47.63 47.86 47.84 41.29 40.42
Table 11: Average precision and description of our official bilingual runs (Okapi model)
Table 11 lists the exact specifications of our various bilingual runs. However, when submitting our official
results, we used the wrong numbers for Query # 130 and # 131 (we switched these two query numbers). Thus,
both requests have an average precision 0.00 in our official results and we reported the corrected performance in
Tables 11 and 13 (multilingual runs).
2.2. Examples of failures
In order to obtain a preliminary picture of the automatic translation approach's underlying difficulties, we
analyzed some queries through comparing translations produced by our six machine-based tools with the request
formulation written by a human being (examples are given in Table 12). As a first example, the title of Query
#113 is "European Cup". In this case, the term "cup" was analyzed as a teacup by all automatic translation
tools, resulting in the French translations "tasse" or "verre" (or "tazza" in Italian, "Schale" in German ("Pokal"
can be viewed as a correct translation alternative) and "taza" or "Jícara" (small teacup) in Spanish).
In Query #118 ("Finland's first EU Commissioner"), the machine translation systems failed to give the
appropriate Spanish term "comisario" for "Commissioner" but returned "comisión" (commission) or
"Comisionado" (adjective relative to commission). For this same request number, the manually translated query
seemed to contain a spelling error in Italian ("commis ario" instead of "commis s ario"). For the same request, the
translation given in German "Beauftragter" (delegate) does not correspond to the appropriate term "Kommissar"
(more the missing "-" in the translation "EUBEAUFTRAGTER").
Other examples: for Query #94 ("Return of Solzhenitsyn") which is translated manually in German ("Rückkehr
Solschenizyns"), our automatic translation systems fail to translate the proper noun (returning "Solzhenitsyn"
instead of "Solschenizyns"). Query #109 ("Computer Security") is translated manually Spanish as "Seguridad
Informática" and our various translations devices return different terms for "Computer" (e.g., "Computadora",
"Computador", or "ordenador") but not the word "Informática".
C113 (query translations failed in French, Italian, German and Spanish)
European Cup
Coupe d'Europe de football
Tasse européenne
Européen verre
Européen résident de verre tasse
Européen résident de l'Europe verre tasse coupe
Campionati europei
Tazza Europea
Tazza Europea
Fussballeuropameisterschaft
Europäische Schale
Europäischer Pokal
Eurocopa
Europea Jícara
Taza europea
C118 (query translations failed in Italian, German and Spanish)
Finland's first EU Commissioner.
Primo commisario europeo per la Finlandia
Primo commissario dell'Eu della Finlandia.
Finlandia primo Commissario di EU.
Erster EU-Kommissar aus Finnland
Finnlands erster EUBEAUFTRAGTER.
Finlands erster EG-Beauftragter
Primer comisario finlandés de la UE
Primera comisión del EU de Finlandia.
El primer Comisionado de Unión Europea de Finlandia.
Table 12: Examples of unsuccessful query translations
3. Multilingual information retrieval
Using our combined approach to automatically translate a query, we were able to search a document collection for
a request written in English. This stage however represents only the first step in a proposal for multi-language
information retrieval systems. We also need to investigate situations where users write a request in English in
order to retrieve pertinent documents in English, French, Italian, German and Spanish. To deal with this multi-
language barrier, we divided our document sources according to language and thus formed five different
collections. After searching in these corpora and obtaining five results lists, we needed to merge them in order to
provide users with a single list of retrieved articles.
Recent works have suggested various solutions to merging the separate result list obtained from different
collections or distributed information services. As a first approach, we will assume that each collection contains
approximately the same number of pertinent items and that the distribution of the relevant documents is similar
across the result lists. Based solely on the rank of the retrieved records, we can interleave the results in a round-
robin fashion. According to previous studies [Voorhees 1995], the retrieval effectiveness of such an interleaving
scheme is around 40% below that achieved from a single retrieval scheme working with a single huge collection,
representing the entire set of documents.
To take account of the document score computed for each retrieved item (or the similarity value between the
retrieved record and the request, denoted score rsvj ), we might formulate the hypothesis that each collection is
searched by the same or a very similar search engine and that the similarity values are therefore directly
comparable [Kwok 1995]. Such a strategy, called raw-score merging, produces a final list sorted by the
document score computed by each collection. However, collection-dependent statistics in document or query
weights may vary widely among collections, and therefore this phenomenon may invalidate the raw-score
merging hypothesis.
To account for this fact, we might normalize the document scores within each collection by dividing them by the
maximum score (i.e. the document score of the retrieved record in the first position). As a variant of this
normalized score merging scheme, Powell et al. [2000] suggest normalizing the document score rsvj according to
the following formula:
(
rsv′ j = rsv j − rsv min ) ( rsv max − rsv min ) (1)
in which rsv j is the original retrieval status value (or document score), and rsvmax and rsvmin are the maximum and
minimum document score values that a collection could achieve for the current request. In this study, the rsvmax
is given by the document score achieved by the first retrieved item and the retrieval status value obtained by the
1000th retrieved record gives the value of rsvmin .
As a fourth strategy, we might use the logistic regression [Flury 1997, Chapter 7] to predict the probability of a
binary outcome variable, according to a set of explanatory variables. Based on this statistical approach, Le Calvé
and Savoy [2000] and Savoy [2002a] described how to predict the probability of relevance of those documents
retrieved by different retrieval schemes or collections. The resulting estimated probabilities would be predicted
according to both the original document score rsvi and the logarithm of the ranki attributed to the corresponding
document Di . Based on these estimated relevance probabilities, we sorted the records retrieved from separate
collections in order to obtain a single ranked list. However, in order to estimate the underlying parameters, this
approach requires a training set, in this case the CLEF-2001 topics and their relevance assessments.
e α+β 1⋅ln(ranki ) +β2 ⋅rsv i
[
Prob Di is rel | rank i , rsv i ] = (2)
1 + e α+β 1⋅ln(ranki )+β 2⋅rsv i
within which ranki denotes the rank of the retrieved document Di, ln() is the natural logarithm, and rsvi is the
retrieval status value (or document score) of the document Di. In this equation, the coefficients α, β1 and β2 are
unknown parameters that are estimated according the method of the maximum likelihood (the required
computations have been done with the S language).
Average precision
Query T-D English French Italian Spanish German
42 queries 50 queries 49 queries 50 queries 50 queries
UniNEfrBi UniNEitBi UniNEesBi UniNEdeBi
50.08 51.64 38.50 50.67 42.89
Multilingual Round-robin Raw-score Eq. 1 Log ln(ranki ) Log reg Eq.2
50 queries 34.27 33.83 36.62 36.10 39.49
English French Italian Spanish German
42 queries 50 queries 49 queries 50 queries 50 queries
UniNEfrBi2 UniNEitBi2 UniNEesBi2 UniNEdeBi2
50.08 50.79 38.62 50.95 42.11
Multilingual Round-robin Raw-score Eq. 1 Log ln(ranki ) Log reg Eq.2
50 queries 33.97 33.99 36.90 35.59 39.25
Table 13: Average precision using various merging strategies based on automatically translated queries
When searching in multi-lingual corpora using Okapi, the round-robin scheme or the raw-score merging strategy
provide very similar retrieval performances (see Table 13). The normalized score merging based on Equation 1
shows an enhancement over the round-robin approach (36.62 vs. 34.27, an improvement of +6.86% in our first
experiment, and 36.90 vs. 33.97, +8.63% in our second run). Using our logistic model with only the rank as
explanatory variable (or more precisely the ln(ranki ), performance depicted under the label "Log ln(ranki )"), the
resulting average precision is lower than the normalized score merging. When merging the result lists based on
the logistic regression approach (using both the rank and the document score as explanatory variables) presents
the best average precision.
Query T-D UniNEm1 UniNEm2 UniNEm3 UniNEm4 UniNEm5
Equation 1 Log reg Eq.2 Equation 1 Log reg Eq.2 Equation 1
Corrected 36.62 39.49 36.90 39.25 35.97
Official 34.88 37.83 35.12 37.56 35.52
Table 14: Average precision obtained with our official multilingual runs
Our official and corrected results are shown in Table 14 while some statistics about the number of documents
provided by each collection are given in Table 15. From this data, we can see that the normalized score merging
(UniNEm1) extracts more documents for the English corpus (in mean 24.94 items) than the logistic regression
model (UniNEm2 where in mean 11.44 documents are coming from the English collection). Moreover, the
logistic regression scheme takes more documents from the Spanish and German collections Finally, we can see
that the percentage of relevant items is relatively similar when comparing CLEF01 and CLEF02 test-collections.
Statistics \ Language English French Italian Spanish German
UniNEm1, based on the top 100 retrieved documents for each query
Mean 24.94 16.68 19.12 23.8 15.46
Median 23.5 15 18 22 15
Maximum 60 (q#:101) 54 (q#:110) 45 (q#:136) 70 (q#:121) 54 (q#:116)
Minimum 4 (q#:108) 5 (q#:97,123) 5 (q#:93,114)6 (q#:98,110) 2 (q#:139)
Standard deviation 13.14 9.26 9.17 14.15 9.79
UniNEm2, based on the top 100 retrieved documents for each query
Mean 11.44 15.58 16.18 34.3 22.5
Median 9 14 16 34.5 19
Maximum 33 (q#:92) 38 (q#:110) 28 (q#:108) 62 (q#:91) 59 (q#:116)
Minimum 1 (q#:135) 6 (q:102,123) 8 (q#:114) 10 (q#:116) 4 (q#:91)
Standard deviation 6.71 7.49 5.18 10.90 11.90
% relevant items CLEF02 10.18% 17.14% 13.29% 35.37% 24.02%
% relevant items CLEF01 10.52% 14.89% 15.31% 33.10% 26.17%
Table 15: Statistics about the merging schemes based on the top 100 retrieved documents for each query
4. Amaryllis experiments
For the Amaryllis experiments, we wanted to determine whether a specialized thesaurus might improve the
retrieval effectiveness over a baseline, ignoring term relationships. From the original documents and during the
indexing process, we retained only the following logical sections in our runs: , , , , .
< RECORD> < RECORD>
< TERMFR> Analyse de poste < TERMFR> La Poste
< TRADENG> Station Analysis < TRADENG> Postal services
… …
< RECORD> < RECORD>
< TERMFR> Bureau poste < TERMFR> Poste conduite
< TRADENG> Post offices < TRADENG> Operation platform
< RECORD> < SYNOFRE1> Cabine conduite
< TERMFR> Bureau poste …
< TRADENG> Post office < RECORD>
… < TERMFR> POSTE DE TRAVAIL
< RECORD> < TRADENG> WORK STATION
< TERMFR> Isolation poste électrique < RECORD>
< TRADENG> Substation insulation < TERMFR> Poste de travail
… < TRADENG> Work Station
< RECORD> < RECORD>
< TERMFR> Caserne pompier < TERMFR> Poste de travail
< TRADENG> Fire houses < TRADENG> Work station
< SYNOFRE1> Poste incendie < RECORD>
… < TERMFR> Poste de travail
< RECORD> < TRADENG> workstations
< TERMFR> Habitacle aéronef < SYNOFRE1> Poste travail
< TRADENG> Cockpits (aircraft) …
< SYNOFRE1> Poste pilotage
…
Table 16: Sample of various entries under the word "poste" in the Amaryllis thesaurus
From the given thesaurus, we have extracted 126,902 terms having a relationship with one or more terms (the
thesaurus owns 173,946 entries delimited by the tags … RECORD>, however only 149,207 entries
have at least one relationship with another term. From these 149,207 entries, we found 22,305 multiple entries
(that are removed, as for example, the term "Poste de travail" or "Bureau poste" in Table 16). In building our
thesaurus, we removed the accents, wrote all terms in lowercase, and ignored numbers and terms given between
parenthesis. For example, the word "poste" appears in 49 records (usually as part of a compound entry in the
< TERMFR> field).
From our 126,902 entries, we counted 107,038 TRADEENG relationships, 14,590 SYNOFRE1, 26,772 AUTOP1
relationships and 1,071 VAUSSI1 relationships (see examples given in Table 16). In a first set of experiments,
we did not use this thesaurus and we used the Title and Descriptive logical sections of the requests (second
column of Table 17a) or the Title, Descriptive and Narrative parts of the queries (last column of Table 17a). In
a second set of experiments, we included all related words that could be found in the thesaurus using only the
search keywords (average precision depicted under the label "Qthes"). In a third experiment, we enlarged only
document representatives using our thesaurus (performance shown under column heading "Dthes"). In a last
experiment, we take account for related words found in the thesaurus only for document surrogates and under the
additional condition that such relationship can be found with at least three terms (e.g. "moteur à combustion" is a
valid candidate but not single term like "moteur"). On the other hand, we also included in the query all
relationships that can be found using the search keywords (performance shown under the column heading
"Dthes3Qthes").
Average precision
Amaryllis Amaryllis Amaryllis Amaryllis Amaryllis
Query T-D T-D T-D T-D T-D-N
Qthes Dthes Dthes3QThes
Model 25 queries 25 queries 25 queries 25 queries 25 queries
doc=Okapi, query=npn 45.75 45.45 44.28 44.85 53.65
doc=Lnu, query=ltc 43.07 44.28 41.75 43.45 49.87
doc=dtu, query=dtc 39.09 41.12 40.25 42.81 47.97
doc=atn, query=ntc 42.19 43.83 40.78 43.46 51.44
doc=ltn, query=ntc 39.60 41.14 39.01 40.13 47.50
doc=ntc, query=ntc 28.62 26.87 25.57 26.26 33.89
doc=ltc, query=ltc 33.59 34.09 33.42 33.78 42.47
doc=lnc, query=ltc 37.30 36.77 35.82 36.10 46.09
doc=bnn, query=bnn 20.17 23.97 19.78 23.51 24.72
doc=nnn, query=nnn 13.59 13.05 10.18 12.07 15.94
Table 17a: Average precision of various indexing and searching strategies (Amaryllis)
Average precision
Amaryllis Amaryllis Amaryllis Amaryllis Amaryllis
Query T-D T-D T-D T-D T-D-N
Qthes Dthes Dthes3Qthes
Model 25 queries 25 queries 25 queries 25 queries 25 queries
doc=Okapi, query=npn 45.75 45.45 44.28 44.85 53.65
5 docs / 10 terms 47.75 47.29 46.41 46.73 55.80
5 docs / 50 terms 49.33 48.27 47.84 47.61 56.72
5 docs / 100 terms 49.28 48.53 47.78 47.83 56.71
10 docs / 10 terms 47.71 47.43 46.28 47.21 55.58
10 docs / 50 terms 49.04 48.46 48.49 48.12 56.34
10 docs / 100 terms 48.96 48.60 48.56 48.29 56.34
25 docs / 10 terms 47.07 46.63 45.79 46.77 55.31
25 docs / 50 terms 48.02 47.64 47.23 47.85 55.82
25 docs / 100 terms 48.03 47.78 47.38 47.83 55.80
Table 17b: Average precision using blind-query expansion (Amaryllis)
From the achieved average precision depicted in Tables 17a and 17b, we cannot infer that the available thesaurus
is really helpful in improving retrieval effectiveness, at least as implemented in this study.
Run name Query Form Model Thesaurus Query expansion Av. precision
UniNEama1 T-D automatic Okapi no 25 docs / 50 terms 48.02
UniNEama2 T-D automatic Okapi with query terms 25 docs / 25 terms 47.34
UniNEama3 T-D automatic Okapi with documents 25 docs / 50 terms 47.23
UniNEama4 T-D automatic Okapi both query & doc 10 docs / 15 terms 47.78
UniNEamaN1 T-D-N automatic Okapi no 25 docs / 50 terms 55.82
Table 18: Official Amaryllis run descriptions
Conclusion
For our second participation in CLEF retrieval tasks, we suggested a general stopword list and stemming
procedure for the French, Italian, German, Spanish and Finnish languages. We also suggested a simple
decompounding approach for the German language. For the Dutch, Finnish and German languages we were to
consider 5-gram indexing and word-based (and decompounding-based) document representations to be distinct and
independent sources of evidence on document content, and it would be a good practice to combine these two (or
three) indexing schemes.
To improve bilingual information retrieval, we suggest using not only one but two or three different translation
sources to translate the query into the target languages. Such a combination seems to improve the retrieval
effectiveness. In the multilingual environment, we demonstrated that a learning scheme such as logistic
regression could perform effectively. As a second best solution, we suggested using a simple normalization
procedure based on the document score.
Finally, in the Amaryllis experiments, we studied various possible ways we could use a specialized thesaurus to
improve average precision. However, the various strategies used in this paper do not demonstrate clear
enhancement over a baseline that ignores the term relationships stored in the thesaurus.
Acknowledgments
The author would like to thank C. Buckley from SabIR for giving us the opportunity to use the SMART
system, without which this study could not have been conducted. This research was supported in part by the
SNSF (Swiss National Science Foundation) under grants 21-58 813.99 and 21-66 742.01.
References
[Buckley 1996] Buckley, C., Singhal, A., Mitra, M. & Salton, G. (1996). New retrieval approaches using
SMART. In Proceedings of TREC'4, (pp. 25-48). Gaithersburg: NIST Publication #500-
236.
[Chen 2002] Chen, A. (2002). Multilingual information retrieval using English and Chinese queries. In
C. Peters, M. Braschler, J. Gonzalo & M. Kluck (Eds.), Evaluation of cross-language
information retrieval systems. Lecture Notes in Computer Science #2409. Berlin:
Springer-Verlag.
[Figuerola 2002] Figuerola, C.G., Gómez, R. & Zazo Rodríguez, A.F. (2002). Stemming in Spanish: A
first approach to its impact on information retrieval. In C. Peters, M. Braschler, J.
Gonzalo & M. Kluck (Eds.), Evaluation of cross-language information retrieval systems.
Lecture Notes in Computer Science #2409. Berlin: Springer-Verlag.
[Flury 1997] Flury, B. (1997). A first course in multivariate statistics. New York: Springer.
[Fox 1990] Fox, C. (1990). A stop list for general text. ACM-SIGIR Forum, 24, 19-35.
[Kraaij 1996] Kraaij, W. & Pohlmann, R. (1996). Viewing stemming as recall enhancement. In
Proceedings of the 19th International Conference of the ACM-SIGIR'96, (pp. 40-48). New
York: The ACM Press.
[Kwok 1995] Kwok, K.L., Grunfeld, L. & Lewis, D.D. (1995). TREC-3 ad-hoc, routing retrieval and
thresholding experiments using PIRCS. In Proceedings of TREC'3, (pp. 247-255).
Gaithersburg: NIST Publication #500-225.
[Le Calvé 2000] Le Calvé, A., Savoy, J. (2000). Database merging strategy based on logistic regression.
Information Processing & Management, 36(3), 341-359.
[Lovins 1968] Lovins, J. B. (1968). Development of a stemming algorithm. Mechanical Translation and
Computational Linguistics, 11(1), 22-31.
[McNamee 2002] McNamee, P. & Mayfield, J. (2002). JHU/APL Experiments at CLEF: Translation
Resources and Score Normalization. In C. Peters, M. Braschler, J. Gonzalo & M. Kluck
(Eds.), Evaluation of Cross-Language Information Retrieval Systems. Lecture Notes in
Computer Science #2409. Berlin: Springer-Verlag.
[Molina-Salgado 2002] Molina-Salgado, H., Moulinier, I., Knutson, M., Lund, E. & Sekhon, K. (2002).
Thomson legal and regulatory at CLEF 2001: Monolingual and bilingual experiments. In
C. Peters, M. Braschler, J. Gonzalo & M. Kluck (Eds.), Evaluation of cross-language
information retrieval systems. Lecture Notes in Computer Science #2409. Berlin:
Springer-Verlag.
[Monz 2002] Monz, C. & de Rijke, M. (2002). The University of Amsterdam at CLEF 2001. In C.
Peters, M. Braschler, J. Gonzalo & M. Kluck (Eds.), Evaluation of cross-language
information retrieval systems. Lecture Notes in Computer Science #2409. Berlin:
Springer-Verlag.
[Porter 1980] Porter, M.F. (1980). An algorithm for suffix stripping. Program, 14, 130-137.
[Powell 2000] Powell, A.L., French, J. C., Callan, J., Connell, M. & Viles, C.L. (2000). The impact of
database selection on distributed searching. In Proceedings of the 23rd International
Conference of the ACM-SIGIR'2000, (pp. 232-239). New York: The ACM Press.
[Robertson 2000] Robertson, S.E., Walker, S. & Beaulieu, M. (2000). Experimentation as a way of life:
Okapi at TREC. Information Processing & Management, 36(1), 95-108.
[Savoy 1999] Savoy, J. (1999). A stemming procedure and stopword list for general French corpora.
Journal of the American Society for Information Science, 50(10), 944-952.
[Savoy 2002a] Savoy, J. (2002). Cross-language information retrieval: Experiments based on CLEF-2000
corpora. Information Processing & Management, to appear.
[Savoy 2002b] Savoy, J. (2002). Report on CLEF-2001 Experiments: Effective Combined Query-
Translation Approach. In C. Peters, M. Braschler, J. Gonzalo & M. Kluck (Eds.),
Evaluation of cross-language information retrieval systems. Lecture Notes in Computer
Science #2409. Berlin: Springer-Verlag.
[Savoy 2002c] Savoy, J. (2002). Recherche d'informations dans des corpus en langue française :
Utilisation du référentiel Amaryllis. TSI, Technique et Science Informatiques, 21(3), 345-
373.
[Singhal 1999] Singhal, A., Choi, J., Hindle, D., Lewis, D.D. & Pereira, F. (1999). AT&T at TREC-7. In
Proceedings TREC-7, (pp. 239-251). Gaithersburg: NIST Publication #500-242.
[Sproat 1992] Sproat, R. (1992). Morphology and computation. Cambridge: The MIT Press.
[Voorhees 1995] Voorhees, E.M., Gupta, N.K. & Johnson-Laird, B. (1995). The collection fusion problem.
In Proceedings of TREC'3, (pp. 95-104). Gaithersburg: NIST Publication #500-225.