=Paper= {{Paper |id=Vol-1168/CLEF2002wn-adhoc-Chen2002 |storemode=property |title=Cross-language Retrieval Experiments at CLEF 2002 |pdfUrl=https://ceur-ws.org/Vol-1168/CLEF2002wn-adhoc-Chen2002.pdf |volume=Vol-1168 |dblpUrl=https://dblp.org/rec/conf/clef/Chen02a }} ==Cross-language Retrieval Experiments at CLEF 2002== https://ceur-ws.org/Vol-1168/CLEF2002wn-adhoc-Chen2002.pdf
           Cross-language Retrieval Experiments at CLEF-2002
                                               Aitao Chen
                           School of Information Management and Systems
                          University of California at Berkeley, CA 94720, USA
                                        aitao@sims.berkeley.edu


                                                          Abstract
           This paper describes monolingual, cross-language, and multilingual retrieval experiments using CLEF-2002
      test collection. The paper presents a technique for incorporating blind relevance feedback into a document ranking
      formula based on logistic regression analysis, and a procedure for decomposing German or Dutch compounds into
      their component words.


1    Introduction
Multilingual text retrieval is the task of searching for relevant documents in a collection of documents in more
than one language in response to a query, and presenting a unified ranked list of documents regardless of language.
Multilingual retrieval is an extension of bilingual retrieval where the collection consists of documents in a single
language that is different from the query language. Recent developments on multilingual retrieval were reported
in CLEF-2000 [12], and CLEF-2001 [13]. Most of the multilingual retrieval methods fall into one of three groups.
The first approach translates the source topics separately into all the document languages in the document collec-
tion. Then monolingual retrieval is carried out separately for each document language, resulting in one ranked
list of documents for each document language. Finally the intermediate ranked lists of retrieved documents, one
for each language, are merged to yield a combined ranked list of documents regardless of language. The second
approach translates a multilingual document collection into the topic language. Then the topics are used to search
against the translated document collection. The third one also translates topics to all document languages as in the
first approach. The source topics and the translated topics are concatenated to form a set of multilingual topics.
The multilingual topics are then searched directly against the multilingual document collection, which directly
produces a ranked list of documents in all languages. The latter two approaches do not involve merging two or
more ranked lists of documents, one for each document language, to form a combined ranked list of documents
in all document languages. The merging task is hard and challenging. To the best of our knowledge, no effective
technique has been developed yet. It appears most participating groups of the multilingual retrieval tasks in the
TREC or CLEF evaluation conferences applied the first approach. Translating large collections of documents in
multiple languages into topic languages requires the availability of machine translation systems that support the
necessary language pairs, which is sometime problematic. For example, if the document collection consists of
documents in English, French, German, Italian, and Spanish, and the topics are in English. To perform the mul-
tilingual retrieval task using English topics, one would have to translate the French, German, Italian, and Spanish
documents into English. In this case, there exist translators, such as Babelfish, that can do the job. However, if
the topics are in Chinese or Japanese, it may be more difficult or even not possible to find the translators to do the
work. The availability of the translation resources and the need for extensive computation are factors that limit the
applicability of the second approach. The third approach is appealing in that it does not require to translate the
documents, and circumvents the difficult merging problem. However, there is some empirical evidence showing
that the third approach is less effective than the first one [3].
    We believe that three of the core components of the first approach are monolingual retrieval, topic translation,
and merging. Performing multilingual retrieval requires many language resources such as stopwords, stemmers,
bilingual dictionaries, machine translation systems, parallel or comparable corpora. At the same time, we see more
and better language resources publicly available on the Internet. The end performance of multilingual retrieval can
be affected by many factors such as monolingual retrieval performance of the document ranking algorithm, the
quality and coverage of the translation resources, the availability of language-dependent stemmers and stopwords,
and the effectiveness of merging algorithm. Since merging of ranked lists of documents is a challenging task, we
seek to improve multilingual retrieval performance by improving monolingual retrieval performance and exploiting
translation resources publicly available on the Internet.
    At CLEF 2002, we participated in the monolingual, crosss-language, and multilingual retrieval tasks. For mono-
lingual task, we submitted retrieval runs for Dutch, French, German, Italian, and Spanish. For cross-language task,
we submitted cross-language retrieval runs from English topics to document languages Dutch, French, German,
Italian, and Spanish, one French-to-German run, and one German-to-French run. And for multilingual task, we
submitted two runs using English topics. All of our runs used only the title and desc fields in the topics. The
document collection for multilingual task consists of documents in English, French, German, Italian and Spanish.
More details on document collections are presented below in section 5. Realizing the difficulty of merging mul-
tiple disjoint ranked lists of retrieved documents in multilingual retrieval, we have put little effort on the merging
problem. We mainly worked on improving the performances of monolingual retrieval and cross-language retrieval
since we believe improved performances in monolingual and cross-language retrieval should ultimately lead to
better performance in multilingual retrieval. For all of our runs in cross-language and multilingual tasks, the topics
was translated into document languages. The main translation resources we used are the SYSTRAN-based online
machine translation system Babelfish translation and L&H Power Translator Pro Version 7.0. We also used paral-
lel English/French texts in one of the English-to-French retrieval runs. The Babylon English-Dutch dictionary was
used in cross-language retrieval from English to Dutch.
    The same document ranking formula developed at Berkeley [4] back in 1993 was used for all retrieval runs
reported in this paper. It was also used in our participation in the previous CLEF workshops. It has been shown
that query expansion via blind relevance feedback can be effective in monolingual and cross-language retrieval.
The Berkeley formula based on logistic regression has been used for years without blind relevance feedback. We
developed a blind relevance feedback procedure for the Berkeley document ranking formula. All of our official
runs were produced with blind relevance feedback. We will present a brief overview of the Berkeley document
ranking formula in section 2. We will describe the blind relevance feedback procedure in section 3.
    At CLEF 2001, we presented a German decompounding procedure that was hastily developed. The decom-
pounding procedure uses a German base dictionary consisting of words that should not be further decomposed
into smaller components. When a compound can be split into component words found in the base dictionary in
more than one way, we choose to split up the compound so that the number of component words is the smallest.
However if there two or more decompositions with the smallest number of component words, we choose the de-
composition that is most likely. The probability for a decomposition of a compound is computed based on the
relative frequencies of the component words in a German collection. We reported a slight decrease in German
monolingual performance with German decompounding [3] at CLEF 2001. The slight decline in performance
may be attributed to the fact that we kept both the original compounds and the component words resulted from
decompounding in topic index. When we re-ran the same German monolingual retrieval with only the component
words of compounds in the topics were retained, the average precision was improved by 8.88% with decompound-
ing over without it [3]. Further improvements in performance brought by German decompounding were reported
in [3] when a different method was used to compute the relative frequencies of component words.
    At CLEF 2002, we used the improved version of the German decompounding procedure first described in [3]. A
slightly different presentation of the same decompounding procedure is given in section 4. Two small changes were
made in performing German retrieval with decompounding. Firstly, in both topic and document indexes, only the
component words resulted from decompounding were kept. When a compound was split into component words,
the compound itself was not indexed. Secondly, additional non-German words in the German base dictionary were
removed. Our current base dictionary still has 762,342 words, some being non-German words and some being
German compounds that should be excluded. It would take a major effort to clean up the base dictionary so that it
contains only the German words that should not be further decomposed. The decompounding procedure initially
developed for splitting up German compounds was also used to decompose Dutch compounds with a Dutch base
dictionary.
    For the submitted two official multilingual runs, one used unnormalized raw score to re-rank the documents
from intermediate runs to produce the unified ranked list of documents. The other run used normalized score
in the same way to produce the final list. To measure the effectiveness of different mergers, we developed an
algorithm for computing the best performance that could possibly be achieved by merging multiple ranked lists of
documents under the conditions that the relevances of the documents are known, and that the relative ranking of
the documents in individual ranked lists is preserved in the unified ranked list. That is, if document is ranked
higher than document  in some ranked list, then in the unified ranked list, document should also be ranked
higher that document  . The simple mergers based on unnormalized raw score, normalized raw score, or rank all
preserve the relative ranking order. This procedure cannot be used to predict merging, however it should be useful
for measuring the performance of merging algorithms. The procedure for producing optimal performance given
document relevances is presented in section 6.3.


2       Document Ranking
All of our retrieval runs used the same document ranking formula developed at Berkeley [4] to rank documents in
response to a query. The log odds of relevance of document  with respect to query  , denoted by    ,

is given by

                                        !"
                              $#&%('*)(+-,.%0/1' 24365879,;:<'=%%:>365@?A#B:<'+DC%0/&365FE-,G:<'=:CHC>3I5KJ
                                         !"


where L !" is the probability that document  is relevant to query  ,                                   !" the probability that

document  is irrelevant to query  , which
                                          PSUT        is 1.0 - L !" PS`.T The four composite variables            PS`T 587DM5@?NM5FE , and 5KJ
                                   7                                     7                                         7
are defined as follows: 587  O P0Q 7R            Q                 O P0Q                   Qdc]e , 5FE_     O P0Q
                                               7WVMXY[EZ ^ , 5@?_       7R   7 
                                                                                    
                                                                                      b
                                                                                          a  
                                                                                              X [
                                                                                                Y Z         S        7IR        7 gf XY[Z , 5KJ>ih ,
                                                                                         a \                                               f \
where h is the number of matchingS       termsV]\ between a document and a query,                     jlk]m   is the within-query
                                                                                                                              S                frequency
of the n th matching term, ok]m is the within-document frequency of the n th matching term, pk]m is the occurrence
frequency in a collection of the n th matching term, j is query length (i.e., number of terms in a query), o( is
document length (i.e., number of terms in a document), and pq is collection length (i.e., number of terms in a test
collection). If stopwords are removed from indexing, then j , o1 , and pr are the query length, document                            S length, and

collection length, respectively, after removing stopwords. If the query terms are re-weighted, then jlk]m is no longer
the original term frequency, but S the new weight, and j is the sum of the new weight values for the query terms. In
the original training matrix, jlk]m is the within-query term frequency, and j is the query length. Note that, unlike
5@? and 5FE , the variable 587 sums the “optimized” relative frequency without first taking the log over the matching

terms. The relevance probability of document  with respect to query  can be written as follows, given the log
odds of relevance.
                                                                                    +
                                                      s
                                                                       +-,;tu
                                                                                 \wvyxDz{|9} ~A €‚

The documents are ranked in decreasing order by their relevance probability  !" with respect to a query.
The coefficients were determined by fitting the logistic regression model specified in -ƒ s to training
data using a statistical software package. We refer readers to reference [4] for more details.


3       Relevance Feedback
It is well known that blind (also called pseudo) relevance feedback can substantially improve retrieval effectiveness.
It is commonly implemented in research text retrieval systems. For example, see the papers of the groups who
participated in the Ad Hoc tasks in TREC-7 [15] and TREC-8 [16]. Blind relevance feedback is typically performed
in two stages. First, an initial search using the original queries is performed, after which a number of terms are
selected from the „ top-ranked documents that are presumed relevant. The selected terms are merged with the
original query to formulate a new query. Finally the new query is searched against the document collection to
produce a final ranked list of documents. The techniques for deciding the number of terms to be selected, the
number of top-ranked documents from which to extract terms, and ranking the terms varies.
    The Berkeley document ranking formula has been in use for many years without blind relevance feedback. In
this paper we present a technique for incorporating blind relevance feedback into the logistic regression-based
document ranking framework. Some of the issues involved in implementing blind relevance feedback include
determining the number of top ranked documents that will be presumed relevant and from which new terms will be
extracted, ranking the selected terms and determining the number of terms that should be selected, and assigning
weight to the selected terms. We refer readers to [9] for a survey of relevance feedback techniques.
    Two factors are import in relevance feedback. The first one is how to select the terms from top-ranked documents
after the initial search, the second is how to assign weight to the selected terms with respect to the terms in the
initial query. For term selection, we assume the „ top-ranked documents in the initial search are relevant, and the
rest of the documents in the collection are irrelevant. For the terms in the documents that are presumed relevant,
we compute the odds ratio of seeing a term in the set of relevant documents and in the set of irrelevant documents.
This is the term relevance weighting formula proposed by Robertson and Sparck Jones in [14]. Table 1 presents
a word contingency table, where h is the number of documents in the collection, „ the number of top-ranked
                                                     relevant                irrelevant
                                 indexed             „                       h     #B„                                          h

                                 non-indexed         mX- „                   n X - h - Xm + „                                   nX- h
                                                                       X
                                                     m                       n - mX                             X
                                                                                                                                n X

                                       Table 1: A contingency table for a word.


                                 Initial Query           Selected Terms                           Expanded Query
                                 k!7 (1.0)                                                        k!7 (1.0)

                                  kM? (2.0)              kM? (2*0.5)                               kM? (3.0)

                                   k[E (1.0)             k[E (1*0.5)                                k[E (1.5)

                                                         kyJ (0.5)                                   kyJ (0.5)


                                                 Table 2: Query expansion.



documents after the initial search that are presumed relevant, „      the number of documents among the „ top-
                                                                    X
ranked documents that contain the term k , and h the number of documents        in the collection that contain the term
                                                  X
k . Then we see from the above contingency table that the probability of finding the term k in a relevant document

is &† , because „ documents out of the „ relevant documents contain the term k . Likewise, the probability of not
finding the term k X in a relevant document is u &† . The odds of findingP a term k in a relevant document is &† .
                                                                                                                  u ‡†
Likewise, the odds of finding a term k in an irrelevant document is P P † u &†Q      . The terms extracted from the „
                                                                       u   u
top-ranked documents are ranked by their odds ratio which is given by †          &†



                                         ˆ                         „      h‰#Bh             #Š„b,G„                        
                                                          U        X              X                                X               (1)
                                             X                         „‹#Š„              !h       #Š„           
                                                                                   X              X         X
For every term k , except for stopwords, found in the „ top-ranked documents, we compute its weight ˆ according
                                                                                                               X
to the above formula. Then all the terms are ranked in decreasing order by their weight ˆ . The top-ranked              Œ
                                                                                                      X
terms, including the ones that are in the initial query, are added to the initial query to create a new query. For the
selected top-ranked terms that are not in the initialS   query,S the weight is set to 0.5. For those top-ranked terms that
are in the initial query, the weight is set to 0.5*k , where k is the occurrence frequency of term k in the initial query.
The weights are unchanged for the initial query terms that are not in the set of selected terms. The selected terms
are merged with the initial query to formulate an expanded query. When a selected term is one of the query terms
in the initial query, its weight in the expanded query is the sum of its weight in the initial query and its weight
assigned in the term selection process. For a selected term that is not in the initial query, its weight in the final
query is the same as the weight assigned in the term selection process, which is 0.5. The weights for the initial
query terms that are not in the list of selected terms remain unchanged. Table 2 presents an example to illustrate
how the expanded query is created from the initial query and the selected terms. The numbers in parentheses are
term weights. For example, the weight for term k[E in the expanded query is 3.0, since it is in the initial query with
a weight value of 2.0 and it is one of the selected terms assigned the weight of 2*0.5.
    Three minor changes are made to the blind relevance feedback procedure described above. First, a constant of
0.5 was added to every item in the formula used to compute the weight. Second, the selected terms must occur
in at least 3 of the top-ranked „ documents that are presumed relevant. Third, the top-ranked two documents in
the initial search remained as the top-ranked two documents in the final search. That is, the final search does not
affect the ranking of the first two documents in the initial search. The rationale for not changing the top-ranked
two documents is that when a query has only one or two relevant documents in the entire collection and if they are
not ranked in the top in the initial search, it is unlikely these few relevant documents would be risen to the top in
the second search. On the other hand, if these few relevant documents are ranked in the top in the initial search,
after expansion, they are likely to be ranked lower in the final search. We believe a good strategy is to not change
the ranking of the top two documents.
    Note that in computing the relevance probability of a document with respect to a query in the initial search, the
j is the number of terms in the initial query, and jlk]m is the number of times that term k occurs in the initial query.
After query expansion, jlk]m is no longer the raw termX frequency in the initial query, instead it is now the weight of
                              X
term k in the expanded query,     and j is the sum of the weight values of all the terms in the expanded query. For
the example presented in table 2, jlk]m is 1.5, and j is 6.0 (i.e., 1.0 + 3.0 + 1.5 + 0.5). The relevance clues related
                                         XU
to documents and the collection are the       same in computing relevance probability using the expanded query as in
computing relevance probability using the initial query. For all the experiments reported below, we selected the
top 10 terms ranked by ˆ from 10 top-ranked documents in the initial search.
                         X



4     Decompounding
It appears most German compounds are formed by directly joining two or more words. Such examples are Comput-
erviren (computer viruses), which is the concatenation of Computer and Viren, and Sonnenenergie (solar energy),
which is formed by joining sonnen and Energie together. Sometimes a linking element such as s or e is inserted
between two words. For example, the compound Schönheitskönigin (beauty queen) is derived from Schönheit and
königin with s inserted between them. There are also cases where compounds are formed with the final letter e of
the first word elided. For example, the compound Erdbeben (earthquake) is derived from Erde (earth) and Beben
(trembling). When the word Erde is combined with the word Atmoshpäre to create a compound, the compound is
not Erdeatmoshpäre, but Erdatmoshpäre. The final letter e of the word Erde is elided from the compound. We re-
fer readers to, for example, [6] for discussions of German compounds formations. The example earthquake shows
compounds are also used in English, just not nearly as commonly used as in German.
    We present a German decompounding procedure in this section which will only address the cases where the
compounds are directly formed by joining words and the cases where the linking element s is inserted. The
procedure is described as follows:
    1. Create a German base dictionary consisting of German words in various forms, but not compounds.
    2. Decompose a German compound with respect to the base dictionary. That is, find all possible ways to break
       up a compound with respect to the base dictionary.
    3. Choose the decomposition of the minimum number of component words.
    4. If there are more than one decompositions that have the smallest number of component words, choose the
       one with the highest probability of decomposition. The probability of a decomposition is estimated by
       product of the relative frequency of the component words. More details are presented below.
For example, when the German base dictionary contains ball, europa, fuss, fussball, meisterschaft and others,
the German compound fussballeuropameisterschaft can be decomposed into component words with respect to the
base dictionary in two different ways as shown in Table 3. The last decomposition has the smallest number of

                                       Decompositions
                             1       fuss     ball        europa          meisterschaft
                             2       fussball europa      meisterschaft

                       Table 3: Decompositions of compound fussballeuropameisterschaft.


component words, so the German compound fussballeuropameisterschaft is split into fussball, europa and meis-
terschaft. Table 4 presents another example which shows the decompositions of German compound wintersports
with respect to a base dictionary containing port, ports, s, sport, sports, winter, winters and other words. The

                                         Decompositions                   log p(D)
                                 1      winter  s           ports         -43.7002
                                 2      winter  sports                    -20.0786
                                 3      winters ports                     -28.3584

                                 Table 4: Decompositions of compound wintersports.


compound wintersports has three decompositions with respect to the base dictionary. Because two decompositions
have the smallest number of component words, the rule of selecting the decomposition with the smallest number
of component words cannot be applied here. We have to compute the probability of the decomposition for the
decompositions with the smallest number of component words. The last column in Table 4 shows the log of the
decomposition probability for all three decompositions that were computed using relative frequencies of the com-
ponents words in the German test collection. According to the rule of selecting the decomposition of the highest
probability, the second decomposition should be chosen as the decomposition of the compound wintersports. That
is, the compound wintersports should be split into winter and sports. Consider the decomposition of compound p
into h component words, p& ˆ 7 ˆ ?9'q'q' ˆ P . The probability of a decomposition is computed as follows:
                                                                                                                                                P
                                                                                                                                                               S
                                  Ž            Ž       ˆ                 Ž         ˆ               Ž      ˆ       P                       S T
                                                                                                                                           `            Ž   ˆ
                                      p9                   7!                     ?DF'q'q'                          I                                       
                                                                                                                                                    7


where the probability of component word ˆ is computed as follows:
                                                                                                                      S
                                                                     S                                        ˆ
                                                   Ž       ˆ                                     k]mFpN                   
                                                                                               T
                                                                                           “                                  ˆ   “
                                                                                        R ’  ‘         7 k]mKp                       
             S                                                                               S

where k]mKpN ˆ  is the number of occurrences of word ˆ in a collection, ” is the number of unique words, including
compounds, in the collection. The occurrence frequency of a word is the number of times the word occurs alone in
the collection. The frequency count of a word does not include the cases where the word is a component word in a
larger compound. Also, the base dictionary does not contain any words that are three-letter long or shorter except
for the letter s. We created a German base dictionary of about 762,000 words by combining a lexicon extracted
from Morphy, a German morphological analyzer [10], German wordlists found on the Internet, and German words
in the CLEF-2001 German collection. In our implementation, we considered only the case where a compound is
the concatenation of component words, and the case where the linking element s is present. Note that the number
of possible decompositions of a compound is determined by what is in the base dictionary. For example, when the
word mittagessen (lunch) is not in the base dictionary, the compound mittagessenzeit (lunch time) would be split
into three component words mittag (noon), essen (meal), and zeit (time).
   It is not always desirable to split up German compounds into their component words. Consider again the
compound Erdbeben. In this case, it is probably better not to split up the compound. But in other cases like
Gemüseexporteure (vegetable exporters), Fußballweltmeisterschaft (World Soccer Championship), splitting up
the compounds probably is desirable since the use of the component words might retrieval additional relevant
documents which are otherwise likely to be missed if only the compounds are used. In fact, we noticed that the
compound Gemüseexporteure does not occur in the CLEF-2001 German document collection.
   In general, it is conceivable that breaking up compounds is helpful. The same phrase may be spelled out in
words sometimes, but as one compound other times. When a user formulate a German query, the user may not
know if a phrase should appear as multi-word phrase or as one compound. An example is the German equiva-
lent of the English phrase “European Football Cup”, in the title of topic 113, the German equivalent is spelled as
one compound Fussballeuropameisterschaft, but in the description field, it is Europameisterschaft im Fußball, yet
in the narrative field, it is Fußballeuropameisterschaft. This example brings out two points in indexing German
texts. First, it should be helpful to split compounds into component words. Second, normalizing the spelling of
ss and ß should be helpful. Two more such examples are Scheidungsstatistiken and Präsidentschaftskandidaten.
The German equivalent of “divorce statistics” is Scheidungsstatistiken in the title field of topic 115, but Statis-
tiken über die Scheidungsraten in the description field. The German equivalent of “presidency candidates” is
Präsidentschaftskandidaten in title field of topic 135, but Kandidat für das Präsidentenamt in the description field
of the same topic. The German equivalent for “Nobel price winner for literature” is Literaturnobelpreisträger, in
the “Der Spiegel” German collection, we find variants of Literatur-Nobelpreisträger, Literaturnobelpreis-Trgerin.
Literaturnobelpreis sometimes appears as “Nobelpreis für Literatur”.


5    Test Collection
The document collection for the multilingual IR task consists of documents in five languages: English, French,
German, Italian, and Spanish. The collection has about 750,000 documents which are newspaper articles published
in 1994 except that part of the Der Spiegel was published in 1995. The distribution of documents among the five
document languages is presented in Table 5. A set of 50 topics was developed and released in more than 10 lan-
guages, including Dutch, English, French, German, Italian, and Spanish. A topic has three parts: 1) title, a short
description of information need; 2) description, a sentence-long description of information need; and 3) narrative,
specifying document relevance criteria. More details about the test collection are presented in [13]. The multilin-
gual IR task at CLEF 2002 was concerned with searching the collection consisting of English, French, German,
Italian, and Spanish documents for relevant documents, and returning a combined, ranked list of documents in any
document language in response to a query.
                            Language      Name                       No. of         Size
                                                                     documents      (MB)
                            English       Los Angeles Times          113,005        425
                            French        Le Monde                   44,013         157
                                          SDA French                 43,178         86
                            German        Frankfurter Rundschau      139,715        320
                                          Der Spiegel                13,979         63
                                          SDA German                 71,677         144
                            Italian       La Stampa                  58,051         193
                                          SDA Italian                50,527         85
                            Spanish       EFE                        215,738        509
                            Dutch         RC Handelsblad             84,121         299
                                          Algemeen Dagblad           106,483        241

                                  Table 5: Part of the CLEF 2002 document sets.



6    Experimental Results
All retrieval runs reported in this paper used only the title and description fields in the topics. The ids and average
precision values of the official runs are presented in bold face, other runs are unofficial ones.

6.1 Monolingual retrieval experiments
In this section we present the results of monolingual retrieval. We created a stopwords list for each document
language. In indexing, the stopwords were removed from both documents and topics. Additional words such as
relevant and document were removed from topics. The words in all six languages were stemmed using Muscat
stemmers downloaded from http://open.muscat.com. For automatic query expansion, the top-ranked 10 terms from
the top-ranked 10 documents after the initial search were combined with the original query to create the expanded
query. For Dutch and German monolingual runs, the compounds were split into their component words, and only
their component words were retained in document and topic indexing. All the monolingual runs included automatic
query expansion via the relevance feedback procedure described in section 3. Table 6 presents the monolingual
retrieval results for six document languages. The last column labeled change shows the improvement of average
precision with blind relevance feedback over without it. As table 6 shows, query expansion increased the average
precision of the monolingual runs for all six languages, the improvement ranging from 6.42% for Spanish to
19.42% for French. There are no relevant Italian documents for topic 120, and no relevant English documents for

                                            without expansion           with expansion
              run id         language       recall     precision       recall    precision       change
              bky2moen       English       765/821      0.5084        793/821     0.5602        10.19%
              bky2monl       Dutch        1633/1862     0.4446       1734/1862    0.4847          9.02%
              bky2mofr       French       1277/1383     0.4347       1354/1383    0.5191        19.42%
              bky2mode       German       1696/1938     0.4393       1807/1938    0.5234        19.14%
              bky2moit       Italian      994/1072      0.4169       1024/1072    0.4750        13.94%
              bky2moes       Spanish      2531/2854     0.5016       2673/2854    0.5338          6.42%

                                        Table 6: Monolingual IR performance.


topics 93, 96, 101, 110, 117, 118, 127 and 132.
   For the German monolingual runs, compounds were decomposed into their component words by applying the
decompounding procedure described above. Only component words of the decomposed compounds were kept
in document and topic indexing. Table 7 presents the performance of German monolingual retrieval with three
different features which are decompounding, stemming, and query expansion. The features are implemented in
the order of decompounding, stemming, and query expansion. For example, when decompounding and stemming
are present, the compounds are split into component words first, then the components are stemmed. The table
shows when any one of the three features is present, the average precision improves from 4.94% to 19.73% over
                1          2         3          4            5               6                 7             8
  features      none       decomp    stem       expan        decomp+stem     decomp+expan      stem+expan    decomp+stem+expan
  avg prec      0.3462     0.3859    0.3633     0.4145       0.4393          0.4517            0.4393        0.5234
  recall        1359       1577      1500       1575         1696            1752              1702          1807
  change        baseline   +11.47%   +4.94%     +19.73%      +26.89%         +30.47%           +26.89%       +51.18%

Table 7: German monolingual retrieval performance. The total number of German relevant documents for 50
topics is 1938.



the baseline performance when none of the features is present. When two of the three features are included in
retrieval, the improvement in precision ranges from 26.89% to 30.47%. And when all three features are present,
the average precision is 51.18% better than the baseline performance. It is interesting to see the three features are
complementary. That is, the improvement brought by each individual feature is not diminished by the presence
of the other two features. Without decompounding, stemming alone improved the average precision by 4.94%.
However with decompounding, stemming improved the average precision from 0.3859 to 0.4393, an increase
of 13.84%. Stemming became more effective because of decompounding. Decompounding alone improved the
average precision by 11.47% for German monolingual retrieval.

       compounds                     component            words                   compounds            component     words
  1    absatzkrise                   absatz               krise            2      atemwege             atem          wege
  3    autoindustrie                 auto                 industrie        4      automobilindustrie   automobil     industrie
  5    bandleaders                   band                 leaders          6      bronchialasthma      bronchial     asthma
  7    bürgerkrieg                  bürger              krieg            8      computeranimation    computer      animation
  9    computeranimationen           computer             animationen      10 computersicherheit       computer      sicherheit
  11   durchbrüche                  durch                brüche          12 eigentumsrechte          eigentums     rechte
  13   eurofighter                   euro                 fighter          14 europameisterschaft      europa        meisterschaft
  15   filmfestspielen               film                 festspielen      16 filmindustrie            film          industrie
  17   fischereiquoten               fischerei            quoten           18 fremdsprachigen          fremd         sprachigen
  19   fremdwörter                  fremd                wörter          20 goldmedaille             gold          medaille
  21   handynutzung                  handy                nutzung          22 interessenkonflikt       interessen    konflikt
  23   interessenkonflikts           interessen           konflikts        24 menschenrechte           menschen      rechte
  25   mobiltelefone                 mobil                telefone         26 nahrungskette            nahrungs      kette
  27   netzwerken                    netz                 werken           28 nordamerika              nord          amerika
  29   nordamerikanische             nord                 amerikanische    30 pelzindustrie            pelz          industrie
  31   präsidentschaftskandidaten   präsidentschafts    kandidaten       32 premierministers         premier       ministers
  33   scheidungsraten               scheidungs           raten            34 scheidungsstatistiken    scheidungs    statistiken
  35   sicherheitspolitik            sicherheits          politik          36 spionagefall             spionage      fall
  37   spionagefalles                spionage             falles           38 sternensystemen          sternen       systemen
  39   verkaufszahlen                verkaufs             zahlen           40 volksbefragung           volks         befragung
  41   winterspielen                 winter               spielen          42 wintersports             winter        sports
  43   wirtschaftsembargos           wirtschafts          embargos         44 wirtschaftspolitik       wirtschafts   politik
  45   zeitplan                      zeit                 plan             46 zurücktreten            zurück       treten
  47   einheitswährung              einheit              s                währung
  48   fischfangquoten               fisch                fang             quoten
  49   fussballeuropameisterschaft   fussball             europa           meisterschaft
  50   geographos                    geog                 rapho            s
  51   literaturnobelpreisträgers   literatur            nobel            preisträgers
  52   schönheitswettbewerbe        schönheit           s                wettbewerbe
  53   schönheitswettbewerben       schönheit           s                wettbewerben

             Table 8: German words in title or desc fields of the topics that are split into component words.


   Table 8 presents the German words in the title or desc fields of the topics that were split into component words
using the decompounding procedure described in section 4. The column labeled component words shows the
component words of the decomposed compounds. The German word eurofighter was split into euro and fighter
since both component words are in the base dictionary, and the word eurofighter is not. Including the word
eurofigher in the base dictionary will prevent it from being split into component words. The word geographos was
decomposed into geog, rapho, and s for the same reason that the component words are in the base dictionary. Two
topic words, lateinamerika and zivilbevölkerung, were not split into component words because both are present in
our base dictionary which is far from being perfect. For the same reason, the preisträgers was not decomposed into
preis and trägers. An ideal base dictionary should contain all and only the words that should not be further split
into smaller component words. Our current decompounding procedure does not split words in the base dictionary
into smaller component words. When the two compounds, lateinamerika and zivilbevölkerung, are removed from
the base dictionary, lateinamerika is split into latein and amerika, and zivilbevölkerung into zivil and bevölkerung.
The topic word südjemen was not split into süd and jemen because our base dictionary does not contain words that
are three-letter long or shorter. The majority of the errors in decompounding are caused by the incompleteness of
the base dictionary or the presence of compound words in the base dictionary.
   We used a Dutch stopword list of 1326 words downloaded from http://clef.iei.pi.cnr.it:2002/ for Dutch mono-
lingual retrieval. After removing stopwords, the Dutch words were stemmed using the muscat Dutch stemmer.
For Dutch decompounding, we used a Dutch wordlist of 223,557 words 1 . From this wordlist we created a Dutch
base dictionary of 210,639 by manually breaking up the long words that appear to be compounds. It appears that
many Dutch compound words remain in the base dictionary. Like the German base dictionary, an ideal Dutch base
dictionary should include all and only the words that should not be further decomposed into smaller component
words. The Dutch words in the topics or desc fields of the topics were split into component words using the same
procedure as for German decompounding. Like German decompounding, the words in the Dutch base dictionary
are not decomposed. The source wordlist files contain a list of country names, which should have beed added to the
                 compounds           component     words            compounds                component        words
         1       rijkspolitie        rijks         politie     2    belangenverstrengeling   belangen         verstrengeling
         3       sterrenstelsels     sterren       stelsels    4    bontsector               bont             sector
         5       nobelprijs          nobel         prijs       6    verkoopaantallen         verkoop          aantallen
         7       grungerock          grunge        rock        8    spionagezaak             spionage         zaak
         9       frankrijk           frank         rijk        10   echtscheidingscijfers    echtscheidings   cijfers
         11      oproepkaart         oproep        kaart       12   autofabrikanten          auto             fabrikanten
         13      handelsembargo      handels       embargo     14   internationale           inter            nationale
         15      duitsland           duit          s land      16   computerbeveiliging      computer         beveiliging
         17      filmindustrie       film          industrie   18   veiligheidsbeleid        veiligheids      beleid
         19      netwerktoegang      veiligheids   beleid      20   filmfestival             film             festival
         21      omzetcrisis         omzet         crisis      22   computeranimatie         computer         animatie
         23      tijdschema          tijd          schema

              Table 9: Dutch words in title or desc fields of the topics that are split into component words.


Dutch base dictionary. The Dutch words frankrijk and duitsland were split into component words because they are
not in the base dictionary. For the same reason, the word internationale was decomposed. It appears compound
words in Dutch are not as common as in German. Like in German indexing, when a compound was split into
component words, only the component words were retained in the index. Table 10 presents the performance of
                1          2           3           4           5              6                7              8
   features     none       decomp      stem        expan       decomp+stem    decomp+expan     stem+expan     decomp+stem+expan
   avg prec     0.4021     0.4186      0.4281      0.4669      0.4446         0.4721           0.4770         0.4847
   recall       1562       1623        1584        1614        1633           1727             1702           1734
   change       baseline   +4.10%      +6.47%      +16.12%     +10.57%        +17.41%          +18.63         +20.54%

Table 10: Dutch monolingual retrieval performance on CLEF-2002 test set. The total number of Dutch relevant
documents for the 50 topics of CLEF 2002 is 1862.


Dutch monolingual retrieval under various conditions. With no stemming and expansion, Dutch decompounding
improved the average precision by 4.10%. Together the three features improved the average precision by 20.54%
over the base performance when none of the features is implemented.
                1          2           3            4          5              6                 7             8
  features      none       decomp      stem         expan      decomp+stem    decomp+expan      stem+expan    decomp+stem+expan
  avg prec      0.3239     0.3676      0.3587       0.3471     0.4165         0.3822            0.3887        0.4372
  change        baseline   +13.49%     +10.74%      +7.16%     +28.59%        +18.00%           +20.01%       +34.98%

Table 11: Dutch monolingual retrieval performance on CLEF-2001 test set. The total number of Dutch relevant
documents for the 50 topics of CLEF 2001 is 1224.


  For comparison, table 11 presents the Dutch monolingual performance on the CLEF 2001 test set. Decom-
pounding alone improved the average precision by 13.49%. Topic 88 of CELF 2001 is about mad cow diseases in
  1 downloaded from ftp://archive.cs.ruu.nl/pub/UNIX/ispell/words.dutch.gz
Europe. The Dutch equivalent of mad cow diseases is gekkekoeienziekte in the topic, but never occurs in the Dutch
collection. Without decompounding, the precision for this topic is 0.1625, and with decompounding, the precision
increased to 0.3216. The precision for topic 90 which vegetable exporters is 0.0128 without decompounding. This
topic contains two compound words, Groentenexporteurs and diepvriesgroenten. The former one which is perhaps
the most important term for this topic never occurs in the Dutch document collection. After decompounding, the
precision for this topic increased to 0.3443. Topic 55 contains two important compound words, Alpenverkeersplan
and Alpeninitiatief. Both never occur in the Dutch document collection. The precision for this topic is 0.0746
without decompounding, and increased to 0.2137 after decompounding.

6.2 Cross-language Retrieval Experiments
A major factor affecting the end performance of cross-language retrieval and multilingual retrieval is the quality
of translation resources. In this section, we evaluate the effectiveness of three different translation resources:
automatic machine translation systems, parallel corpora, and bilingual dictionaries. Two of the issues in translating
topics are 1) determining the number of translations to retain when multiple candidate translations are available;
and 2) assigning weights to the selected translations [8]. When machine translation systems are used to translate
topics, these two issues are resolved automatically by the machine translation systems, since they provides only
one translation for each word. However, when bilingual dictionaries or parallel corpora are used to translate topics,
often for a source word, there may be several alternative translations.

6.2.1 CLIR Using MT
In this section, we evaluate two machine translation systems, online Babelfish translation 2 and L&H Power Trans-
lator Pro, version 7.0, for translating topics in CLIR. We used both translators to translate the 50 English topics
into French, Italian, German, and Spanish. For each language, both sets of translations were preprocessed in the
same way. Table 12 presents the CLIR retrieval performances for all the official runs and additional runs. The

                                                                              without     with
                                                                              expansion   expansion
          run id               topic          document      resources         precision   precision    change
          bky2bienfr           English        French        Babelfish + L&H   0.4118      0.4773      +15.91%
          bky2bienfr2          English        French        Systran + L&H +   0.4223      0.4744      +12.34%
                                                            Parallel Texts
          bky2bienfr3          English        French        Babelfish         0.3731      0.4583      +22.84%
          bky2bienfr4          English        French        L&H               0.3951      0.4652      +17.74%
          bky2bienfr5          English        French        Parallel texts    0.3835      0.4529      +18.10%
          bky2bidefr           German         French        Babelfish         0.3437      0.4124      +19.99%
          bky2biende           English        German        Babelfish + L&H   0.3561      0.4479      +25.78%
          bky2biende1          English        German        Babelfish         0.3229      0.4091      +26.70%
          bky2biende2          English        German        L&H               0.3555      0.4449      +25.15%
          bky2bifrde           French         German        Babelfish         0.3679      0.4759      +29.36%
          bky2bienit           English        Italian       Babelfish + L&H   0.3608      0.4090      +13.36%
          bky2bienit1          English        Italian       Babelfish         0.3239      0.3634      +12.20%
          bky2bienit2          English        Italian       L&H               0.3412      0.3974      +16.47%
          bky2bienes           English        Spanish       Babelfish + L&H   0.4090      0.4567      +11.66%
          bky2bienes1          English        Spanish       Babelfish         0.3649      0.4108      +12.58%
          bky2bienes2          English        Spanish       L&H               0.4111      0.4557      +10.85%
          bky2biennl           English        Dutch         Babylon           0.2564      0.3199      +24.77%

Table 12: Performance of cross-language retrieval runs. The ids and average precision values for the official runs
are in bold face.


ids and average precision values for the official runs are in bold face. Last column in table 12 shows the im-
provement of average precision with query expansion over without it. When both L&H Translator and Babelfish
  2 publicly available at http://babelfish.altavista.com/
were used in cross-language retrieval from English to French, German, Italian and Spanish, the translation from
L&H Translator and the translation from Babelfish were combined by topic. The term frequencies in the com-
bined topics were reduced by half so that the combined topics were comparable in length to the source English
topics. Then the combined translations were used to search the document collection for relevant documents as in
monolingual retrieval. For example, for the English-to-Italian run bky2bienit, we first translated the source English
topics into Italian using L&H Translator and Babelfish. The Italian translations produced by L&H Translator and
the Italian translations produced by Babelfish were combined by topic. Then the combined, translated Italian top-
ics with term frequencies reduced by half were used to search the Italian document collections. The bky2bienfr,
bky2biende, bky2bienes CLIR runs from English were all produced in the same way as the bky2bienit run. For En-
glish or French to German cross-language retrieval runs, the words in title or desc fields of the translated German
topics were decompounded. For all cross-language runs, words were stemmed after removing stopwords like in
monolingual retrieval. The English-to-French run bky2bienfr2 was produced by merging the bky2bienfr run and
the bky2bienfr5 run which used parallel corpora as the sole translation resource. More discussion about the use of
parallel corpora will be presented below.
   All the cross-language runs applied blind relevance feedback. The top-ranked 10 terms from the top-ranked
10 documents after the initial search were combined with the initial query to formulate an expanded query. The
results presented in table 12 show that query expansion improved the average precision for the official runs from
10.85% to 29.36%. The L&H Translator performed better than Babelfish for cross-language retrieval from English
to French, German, Italian and Spanish. Combining the translations from L&H Translator and Babalfish performed
slightly better than using only the translations from L&H translator.
   We notices a number of error in translating English to Italian using Babelfish. For example, the English text
Super G which was translated into Superg, U.N. and U.S.-Russian were not translated. While the phrase Southern
Yemen in the desc field was correctly translated into Südyemen, the same phrase in the title field became SüdcYemen.
Decompounding is helpful in monolingual retrieval, it is also helpful in cross-language retrieval to German from

                                                      no decompounding       decompounding
           source     target      translator          average precision      average precision    change
           English    German      L&H Translator      0.2776                 0.3009               8.4%
           English    German      Babelfish           0.2554                 0.2906               13.78%
           French     German      Babelfish           0.2774                 0.3092               11.46%

Table 13: Effectiveness of decompounding in cross-language retrieval to German. All runs were performed without
stemming and query expansion.


other languages such as English. An English phrase of two words may be translated into a German phrase of
two words, or into a compound. For examples, in topic 111, the English phrase computer animation in title
became ComputerAnimation, and Computer Animation in desc. In topic 109, the English phrase Computer Security
became Computer-Sicherheit in the title, but the same phrase in lower case in desc became Computersicherheit.
Table 13 shows the performances of three cross-language retrieval to German with and without decompounding.
The improvement in average precision ranges from 8.4% to 13.78%.

6.2.2 English-French CLIR Using Parallel Corpora
We created a French-English bilingual lexicon from the Canadian Hansard (the recordings of the debates of the
House for the period of 1994 to 2001). The texts are in English and French. We first aligned the Hansard corpus at
the sentence level using the length-based algorithm proposed by Gale and Church [7], resulting in about two million
aligned French/English sentence pairs. To speed up the training (i.e, estimating word translation probabilities), we
extracted and used only the sentence pairs that contain at least one English topic word in CLEF-2001 topics. A
number of preprocessing steps were carried out prior to the training. First, we removed the English stopwords
from the English sentences, and French stopwords from the French sentences. Secondly, we changed the variants
of a word into its base form. For English, we used a morphological analyzer described in [5]. For French, we used
a French morphological analyzer named DICO. Each of the packages contains a list of words together with their
morphological analyses. Thirdly, we discarded the sentence pairs in which one of the sentence has 40 or more
words after removing stopwords, and the sentence pairs in which the length ratio of the English sentence over the
French sentence is below .7 or above 1.5. The average length ratio of English text over French text is approximately
1.0. Since sentence alignment is not perfect, some mis-alignments are unavoidable. Hence there may be sentence
pairs in which the length ratios that deviate far from the average length ratio. After the preprocessing, only 706,210
pairs of aligned sentences remained. The remaining aligned sentence pairs were fed to GIZA++ for estimating
English-to-French word translation probabilities. GIZA++ toolkit is an extension to the EGYPT toolkit [1] which
was based on the statistical machine translation models described in [2]. Readers are referred to [11] for more
details on GIZA++. The whole training phase took about 24 hours on a Sun Microsystem Sparc server machine.
Table 14 shows the first three French translations produced by GIZA++ for some of the words in the English topics.

 English     French translations   Translation probability   English        French translation   Translation probability
 amnesty     amnistier             0.960881                  independence   indpendance          0.762385
             amnistie              0.032554                                 autonomie            0.142249
             amnesty               0.006565                                 indpendant           0.032079
 asthma      asthme                0.902453                  lead           mener                0.128457
             asthma                0.053307                                 conduire             0.076652
             atteindre             0.044240                                 amener               0.066278
 car         voiturer              0.251941                  phone          téléphoner         0.419111
             automobile            0.214175                                 téléphonique       0.194628
             voiture               0.160644                                 appeler              0.123656
 computer    informatique          0.438414                  prime          ministre             0.898780
             ordinateur            0.434168                                 chrtien              0.049399
             informatiser          0.021902                                 principal            0.003372
 conflict    conflit               0.873595                  race           race                 0.598102
             guerre                0.016048                                 courser              0.182982
             contradictoire        0.012773                                 racial               0.053363
 currency    monnayer              0.455730                  rock           rock                 0.889616
             deviser               0.108036                                 rocher               0.015060
             devise                0.106799                                 pierre               0.010083
 fall        automne               0.323739                  right          droit                0.973834
             tomber                0.081521                                 rights               0.002897
             relever               0.064848                                 charte               0.001411
 film        film                  0.722327                  sanction       sanction             0.600641
             cinmatographique      0.105770                                 sanctionner          0.147880
             cinma                 0.058341                                 approuver            0.076667
 economic    économique           0.830063                  star           star                 0.624963
             économie             0.104932                                 toile                0.130342
             financier             0.010520                                 toiler               0.077801


Table 14: English to French word translation probabilities estimated from parallel corpora using a statistical ma-
chine translation toolkit.

The French translations are ranked in descending order by the probability of translating from an English word into
French words. In translating an English word into French, we selected only one French word, the one of the highest
translation probability, as the translation. The English topics were translated into French word-by-word, then the
translated French topics were used in producing the English-to-French run labeld bky2bienfr5 in table 12. Without
query expansion, the parallel corpus-based English-French CLIR performance was slightly better than that of using
Babelfish, but slightly lower than that of using L&H translator.
   The CLEF 2002 English topics contain a number of polysemous words such as cup, fall, interest, lead, race,
right, rock, star, and the like. The word fall in the context of fall in sale of cars in topic 106 has the meaning
of declining. However, the most likely French translation for fall as table 14 shows is automne, meaning autumn
in English. The word race in ski race in topic 102 or in race car in topic 121 has the meaning of contest or
competition in speed. Again the French word of the highest translation probability is race, meaning human race in
English. The corrent French translation for the sense of race in ski race or car race should be course. The word
star in topic 129 means a plant or celestial body, while in topic 123 in the context of pop star, it means a famous
performer. The correct translation for star in topic 129 should be étoile, instead of the most likely translation star,
which is the correct French word for the sense of star in pop star. The word rock in topic 130 has the same sense
as rock in rock music, not the sense of stone. The correct translation for rock in topic 130 should be rocke. In the
same topic, the word lead in lead singer means someone in the leading role, not the metal. These examples show
that taking the French word of the highest translation probability as the translation for an English word is overly
simplified. Choosing the right French translations would require word sense disambiguation.
6.2.3 CLIR using bilingual dictionary
For the only English-to-Dutch run bky2biennl, the English topics were translated into Dutch by looking up each
English topic word, excluding stopwords, in the online English-Dutch dictionary Babylon 3 . All the Dutch words
in the dictionary lookup results were retained except for Dutch stopwords. The Dutch compound words were
split into component words. If translating an English topic word resulted in „ Dutch words, then all translated
                                                              7
Dutch words of the English word received the same weight , i.e., the translated Dutch words were weighted
uniformly. The average precision of the English-to-Dutch run is 0.3199, which is much lower than 0.4847 for
Dutch monolingual retrieval.

6.3 Multilingual Retrieval Experiments
In this section, we describe our multilingual retrieval experiments using the English topics (only title and descrip-
tion fields were indexed). As mentioned in the cross-language experiments section above, we translated the English
topics into the other four document languages which are French, German, Italian, and Spanish using Babelfish and
L&H Translator. A separate index was created for each of the five document languages. For the multilingual
retrieval runs, we merged five ranked lists of documents, one resulted from English monolingual retrieval and four
resulted from cross-language retrieval from English to the other four document languages, to produce a unified
ranked list of documents regardless of language.
   A fundamental difference between merging in monolingual retrieval or cross-language retrieval and merging
in multilingual retrieval is that in monolingual or cross-language retrieval, documents for individual ranked lists
are from the same collection, while in multilingual retrieval, the documents for individual ranked lists are from
different collections. For monolingual or cross-language retrieval, if we assume that documents appearing on more
than one ranked list are more likely to be relevant than the ones appearing on a single ranked list, then we should
rank the documents appearing on multiple ranked lists in higher position in the merged ranked list of documents.
A simple way to accomplish this is to sum the probability of relevance for the documents appearing on multiple
ranked lists while the probabilities of relevance for the documents appearing on a single list remain the same.
After summing up the probabilities, the documents are re-ranked in descending order by combined probability of
relevance. In multilingual retrieval merging, since the documents on the individual ranked lists are all different,
we cannot use multiple appearances of a document in the ranked lists as evidence to promote its rank in the final
ranked list. The problem of merging multiple ranked lists of documents in multilingual retrieval is closely linked
to estimating probability of relevance. If the estimates of probability of relevance are accurate and well calibrated,
then one can simply combine the individual ranked lists and then re-rank the combined list by the raw probability
of relevance. In practice, estimating relevance probabilities is a hard problem.
   We looked at the estimated probabilities of relevance produced using the ranking formula described in section 2
for the CLEF 2001 topics to see if there is a linear relationship between the number of relevant documents and the
number of documents whose estimated probabilities of relevance are above some threshold. Figure 1 shows the
scatter plot of the number of retrieved documents whose estimated relevance probabilities are above 0.37 versus the
number relevant documents for the same topic. Each dot in the figure represents one French topic. The ranked list
of documents was produced using the 50 French topics of CLEF 2001 to search against the French collection with
query expansion. The top-ranked 10 terms from top-ranked 10 documents in the initial search were merged with
initial query to create the expanded query. The threshold of 0.37 was chosen so that the total number of documents
for all 50 topics whose estimated relevance probabilities are above the threshold is close to the total number of
relevant documents for the same set of topics. If the estimated probabilities are good, the dots in the figure would
appear along the diagonal line. The figure shows there is no linear relationship between the number of retrieved
documents whose relevance probabilities are above the threshold and the number of relevant documents for the
same topic. This implies one cannot use the raw relevance probabilities to directly estimate the number of relevant
documents for a topic in a test document collection.
   There are a few simple ways to merge ranked lists of documents from different collections. Here we will
evaluate two of them. The first method is to combine all ranked lists, sort the combined list by the raw relevance
score, then take the top 1000 documents per topic. The second method is to normalize the relevance score for
each topic, dividing all relevance scores by the relevance score of the top most ranked document for the same
topic. Table 15 presents the multilingual retrieval performance with different merging strategies. The multilingual
runs were produced by merging from five runs: bky2moen (English-English, 0.5602), bky2bienfr (English-French,
0.4773), bky2biende (English-German, 0.4479), bky2bienit (English-Italian, 0.4090), and bky2bienes (English-
Spanish, 0.4567). The run bky2muen1 was produced by ranking the documents by the unnormalized relevance
  3 available at http://www.babylon.com
Figure 1: Number of retrieved documents with relevance probability over .37 versus the number of relevant docu-
ments for the same topic.

           run id           topic language     topic fields    merging strategy          recall         precision
           bky2muen1        English            title,desc      Direct merging            5880/8068      0.3762
           bky2muen2        English            title,desc      Normalized merging        5765/8068      0.3570

Table 15: Multilingual retrieval performance for different merging strategies. The five runs from which the the
multilingual runs were produced are bky2moen, bky2bienfr, bky2biende,bky2bienit,bky2bienes.



probabilities after combining the individual runs. And the run bky2muen2 was produced in the same way except
that the relevance probabilities were normalized before merging. For each topic, the relevance probabilities of
the documents was divided by the relevance probability of the highest-ranked document for the same topic. The
simplest direct merging outperformed the score normalizing merging strategy. We did two things to make the
relevance probabilities of documents from different language collections comparable to each other. Firstly, as
mentioned in section 6.2.3, after concatenating the topic translations from two translators, we reduced the term
frequencies by half so that the translated topics are close to the source English topics in length. Secondly, in query
expansion, we took the same number of terms (i.e, 10) from the same number of top-ranked documents (i.e, 10)
after the initial search for all five individual runs that were used to produce the multilingual runs.
   In the remainder of this section, we present a procedure for computing the optimal performance that could
possibly be achieved under the constraint that the relative ranking of the documents in the individual ranked lists
is preserved. This procedure assumes that the relevances of documents are known, thus it is not useful to to
predict ranks of documents in the final ranked list for multilingual retrieval. However, knowing the upper-bound
performance for a set of ranked lists of documents and the related document relevances is useful in measuring the
performance of different merging strategies. We will use an example to explain the procedure. Let us assume we
are going to merge three runs labeled ,  and • , as shown in table 16. The relevant documents are marked with
an ‘*’. We want to find a combined ranked list such that the average precision is maximized without changing
the relative rank order of the documents on the same ranked list. First we transform the individual runs shown
in table 16 into the form shown in table 17     S  by
                                                    S   grouping the consecutive S     irrelevant and relevant documents. Each
entry in table 17 has the form „ŠS Mh‚]S –o ]o Q 7Dq'q'D'qMo “N— , where o is the id of the document ranked in the n th
position in the original ranking. –o Mo Q 7Dq'D'q'qMo “N— denotes a set of consecutive irrelevant and relevant documents
                                                                       Run A                  Run B         Run C
                                                             1          7 *                   g7           •_7

                                                             2                  ?             ˜?           •A? *
                                                             3              E       *         ™E   *       •   *
                                                                                                                E

                                                             4          J                     4J           •-J *


                        Table 16: Individual ranks. Relevant documents are marked with *


ranked in position from n to š , inclusive. „ is the number of irrelevant documents in the set, and h is the number
of relevant documents in the set. For example, the entry (2,1) –g7 ,˜? ,4E — means the set –g7 ,˜? ,™E — has two
irrelevant documents, 7 and ™? , and one relevant document, ™E . After the transformation, the procedure can be

                         set   Run A                                   Run B                                        Run C
                                                     —
                         1     (0,1) –           7                     (2,1) –7 ,˜? ,4E —                      (1,3) –N•_7 , •A? , •   E   , •-J —
                                                             —
                         2     (1,1) –       ?       ,   E             (1,0) –>J —
                                                     —
                         3     (1,0) –   E


                                    Table 17: Ranked lists after transformation.


implemented in four steps.
    Step 1: Let the active set consist of the first set in the individual lists that contains at least one relevant doc-
ument. For the example presented in table 17, the initial active set is – (0,1) – 7 — , (2,1) –7 ,˜? ,4E — , (1,3)
                        ——
–l•>7 , •‡? , •AE , • J

    Step 2: Choose the element in the active set with the smallest number of irrelevant documents. If there are two
or more elements with the smallest number of irrelevant documents, then choose the element that also contains
the largest number of relevant documents. If there are two or more elements with the same smallest number of
irrelevant documents and the same largest number of relevant documents in the current active set, randomly choose
one of them. Append the selected element to the final ranked list. If the next set appearing immediately after the
selected element contains at least one relevant document, then add the next set to the current active set. That is,
sort the active set by „ as the major order in increasing order, and by h as the minor order in decreasing order,
then take out the first element and put it in the final ranked list.
    Step 3: Repeat step 2 until the current active set is empty.
    Step 4: If the final ranked list has less than 1000 documents, append more irrelevant documents drew from any
individual list to the final ranked list.
    The optimal ranking after reordering the sets is presented in table 18

                                                                 set                Optimal ranking
                                                                 1                  (0,1) – 7 —
                                                                 2                  (1,3) –l•>7 , •‡? , •AE , •-J —
                                                                 3                  (1,1) – ? , E —
                                                                 4                  (2,1) –Dg7 ,™? ,4E —
                                                                 5                  (1,0) – J —
                                                                 6                  (1,0) –D4J —

                                                                 Table 18: Optimal ranking.


   The upper-bound average precision for the set of runs used for producing our official multilingual runs is 0.5177
with overall recall of 6392/8068. The performances of the direct merging and score-normalizing merging are far
below the upper-bound performance that could possibly be achieved.


7    Conclusions
We have presented a technique for incorporating blind relevance feedback into a document ranking formula based
on logistic regression analysis. The improvement in average precision brought by query expansion via blind rel-
evance feedback ranges from 6.42% to 19.42% for monolingual retrieval runs, and from 10.85% to 29.36% for
cross-language retrieval runs. We have also presented a procedure to decompose German compounds and Dutch
compounds. German decompounding improved the the average precision of German monolingual retrieval by
11.47%. Decompounding increased the average precision for cross-language retrieval to German from English or
French. The increase ranges from 8.4% to 11.46%. For Dutch monolingual retrieval, decompounding increased the
average precision by 4.10%, which is much lower than the improvement of 13.49% on CLEF 2001 test set. In sum-
mary, both blind relevance feedback and decompounding in German or Dutch have been shown to be effective in
monolingual and cross-language retrieval. The amount of improvement of performance by decompounding varies
from one set of topics to another. Three different translation resources, machine translators, parallel corpora, and
bilingual dictionaries, were evaluated on cross-language retrieval. We found that the English-French CLIR perfor-
mance of using parallel corpora was competitive with that of using commercial machine translators. Two different
merging strategies in multilingual retrieval were evaluated. The simplest strategy of merging individual ranked
lists of documents by unnormalized relevance score worked better than the one first normalizing the relevance
score. To make the relevance scores of the documents from different collections as closely comparable as possible,
we selected the same number of terms from the same number of top-ranked documents after the initial search for
query expansion in all the runs that were combined to produce the unified ranked lists of documents in multiple
languages. We used two machine translators to translate English topics to French, German, Italian and Spanish,
and combined by topic the translations from the two translators. We reduced the term frequencies in the combined
translated topics by half so that the combined translated topics are close in length to the source English topics.
We presented an algorithm for generating the optimal ranked list of documents when the document relevances are
known. The optimal performance can then be used to measure the performances of different merging strategies.


8    Acknowledgments
We would like to thank Vivien Petras for improving the German base dictionary. This research was supported
by DARPA under research grant N66001-00-1-8911 (PI: Michael Buckland) as part of the DARPA Translingual
Information Detection, Extraction, and Summarization Program (TIDES).


References
 [1] Y. Al-Onaizan et al. Statistical machine translation, final report, JHU workshop, 1999.
 [2] P. F. Brown, S. A. D. Pietra, V. J. D. Pietra, and R. L. Mercer. The mathematics of statistical machine translation:
     Parameter estimation. Computational linguistics, 19:263–312, June 1993.
 [3] A. Chen. Multilingual information retrieval using english and chinese queries. In C. Peters, M. Braschler, J. Gonzalo,
     and M. Kluck, editors, Evaluation of Cross-Language Information Retrieval Systems: Second Workshop of the Cross-
     Language Evaluation Forum, CLEF-2001, Darmstadt, Germany, September 2001, pages 44–58. Springer Computer
     Scinece Series LNCS 2406, 2002.
 [4] W. S. Cooper, A. Chen, and F. C. Gey. Full text retrieval based on probabilistic equations with coefficients fitted by
     logistic regression. In D. K. Harman, editor, The Second Text REtrieval Conference (TREC-2), pages 57–66, March 1994.
 [5] M. Z. Daniel Karp, Yves Schabes and D. Egedi. A freely available wide coverage morphological analyzer for english. In
     Proceedings of COLING, 1992.
 [6] A. Fox. The Structure of German. Clarendon Press, Oxford, 1990.
 [7] W. A. Gale and K. W. Church. A program for aligning sentences in bilingual corpora. Computational linguistics, 19:75–
     102, March 1993.
 [8] G. Grefenstette, editor. Cross-language information retrieval. Kluwer Academic Publishers, Boston, MA, 1998.
 [9] D. Harman. Relevance feedback and other query modification techniques. In W. Frakes and R. Baeza-Yates, editors,
     Information Retrieval: Data Structures & Algorithms, pages 241–263. Prentice Hall, 1992.
[10] W. Lezius, R. Rapp, and M. Wettler. A freely available morphological analyzer, disambiguator and context sensitive
     lemmatizer for german. In COLING-ACL’98, pages 743–748, 1998.
[11] F. J. Och and H. Ney. A comparison of alignment models for statistical machine transl ation. In COLING00, pages
     1086–1090, Saarbrücken, Germany, August 2000.
[12] C. Peters, editor. Cross Language Information Retrieval and Evaluation: Proceedings of the CLEF 2000 Workshop.
     Springer Computer Scinece Series LNCS 2069, 2001.
[13] C. Peters, editor. Working Notes of the CLEF 2001 Workshop 3 September, Darmstadt, Germany. September 2001.
[14] S. E. Robertson and K. S. Jones. Relevance weighting of search terms. Journal of the American Society for Information
     Science, pages 129–146, May–June 1976.
[15] E. Voorhees and D. Harman, editors. The Seventh Text Retrieval Conference (TREC-7). NIST, 1998.
[16] E. Voorhees and D. Harman, editors. The Eighth Text Retrieval Conference (TREC-8). NIST, 1999.