Resolving Translation Ambiguity using Monolingual Corpora
                  A Report on Clairvoyance CLEF-2002 Experiments

                                  Yan Qu, Greg Grefenstette, David A. Evans
                                            Clairvoyance Corporation
                                        5001 Baum Boulevard, Suite 700
                                           Pittsburgh, PA 15213-1854
                               {y.qu, g.grefenstette, dae}@clairvoyancecorp.com


Abstract
Choosing the correct target words is a difficult problem for machine translation. In cross-language information
retrieval, this problem of choice is mitigated since more than one translation alternative can be retained in the
target query. Between choosing just one word as a translation and keeping all the possible translations for each
source word, one can apply a range of filtering techniques for eliminating some words and keeping others. In the
bilingual track of CLEF 2002, focusing on word translation ambiguity, we experimented with several techniques
for choosing the best target translation for each source query word by using co-occurrence statistics in a
reference corpus consisting of documents in the target language. One of two distinct corpora was used, the
target-language test corpus or the World Wide Web. Our techniques give one best translation per source query
word. We also experimented with combining these word choice results (providing up to three translations for
each word) in the final translated query. The source query languages were Spanish and Chinese; the target
language documents were in English. We submitted four automatic runs for each language pair. When the
methods were combined, mixing results obtained with different reference corpora, the recall and average
precision of Spanish-to-English retrieval reached 95% and 97%, respectively, of the recall and average precision
of an English monolingual retrieval run. For Chinese-to-English text retrieval, the recall and average precision
reached 89% and 60%, respectively, of the English run.

1. Introduction
Choosing among lexical or translation variants to find correct target words is a difficult problem for machine
translation. In cross-language information retrieval (CLIR), the problem of choosing a single best translation is
less critical in theory since more than one translation alternative can frequently be retained in the target query
with only minimal harm to retrieval performance. However, there is still a range of possible performance
tradeoffs between choosing just one word as a translation and keeping all the possible translations for each
source word (Grefenstette 1998). To exploit the possible advantages of limiting the alternative translations to “a
few (or one) best”, one can apply a range of filtering techniques for eliminating some words and keeping others.
In the bilingual track of CLEF 2002, we focused specifically on this problem of word translation ambiguity and
experimented with several techniques for choosing the best target translation for each source query word, using
co-occurrence statistics over target-language reference corpora. In particular, we derived statistics from two
different corpora, the target-language (English) test corpus and the World Wide Web. Our techniques give one
“best” translation per source query word per reference corpus. We also experimented with combining these word
choice results (providing up to three translations for each word) in the final translated query. The source query
languages were Spanish and Chinese; the target language documents were in English. We submitted four
automatic runs for each language pair. In this report, we describe our translation disambiguation methods and
present their performance results.

2. CLARIT Cross-Language Information Retrieval
For cross-language information retrieval, both the documents and the queries need to be represented in the same
language at some point in the process (Oard & Dorr 1996). In our experiments, we adopted the query translation
approach. First, the query terms in the source languages were translated into all possible terms in the target
language using translation lexicons. Some of these translations were retained according to the methods described
below. The retained terms in the target language (English, here) were used for retrieving documents from the
target collection. For all document processing—including query and document indexing and retrieval—we used
the Clarit system (Evans & Lefferts 1995), in particular, the functions for NLP (morphological analysis and
phrase recognition), IR (term weighting and phrasal decomposition), and “thesaurus extraction” (for effecting
pseudo-relevance feedback).
2.1 Spanish Topic Processing and Translation
Spanish queries were processed as follows. The text of the Spanish query was tokenized and morphologically
analyzed using a Spanish version of Clarit. Only nouns, verbs, adjectives, and adverbs were retained for further
treatment. Some additional words were removed via a stop list containing a total of 400 words. This list
includes all prepositions, pronouns, and articles (which were already removed using the morphological analyzer);
common stop words such as “es”, “cada”; and query meta-language from previous CLEF queries such as
“describir” and “discutir”.

As an example of our processing consider the Spanish query on the Leaning Tower of Pisa (Topic 136): Torre
inclinada de Pisa. ¿En qué estado se encuentra la torre inclinada de Pisa? After morphological analysis and
stop word removal, this query becomes torre, inclinar, pisa, estado, torre, inclinar, pisa. Each of these words is
then looked up in a Spanish–English word-to-word dictionary, which contains the following translations:

               estado              inclinar              pisa                  torre
               state               apt                   pisa                  high
               states              bow                                         tower
               statis              drooping                                    towers
                                   incline
                                   inclined
                                   inclining
                                   sloping
                                   stooping
                                   titling
                                   verging

We created our Spanish-English gloss lexicon by combining various lexicons available on the Web. The final
collation was not manually edited; stop words were automatically removed from the English translations (unless
the only translation was a stop word); and the Spanish side of the dictionary was lemmatized (e.g., an original
gloss such as “inclinado—apt to” was reduced to “inclinar—apt” in our experimental version). If a source word
was not found in the dictionary then the original source word was retained in lieu of a translation. The resulting
translations formed the basis of the English queries that were generated by the methods described below.

We can note here two limitations of our technique: (1) The dictionaries are neither clean nor complete. Notice
that “leaning” is missing from the above translations of “inclinar”. (2) We have restricted ourselves to word-to-
word translations for engineering reasons, even though we know that phrasal translation results in superior
performance in CLIR (Hull & Grefenstette 1996). If we were to go beyond a research version of our system,
investments in reducing these limitations would need to be made.

2.2 Chinese Topic Processing and Translation
Since spaces are not used in Chinese text for word segmentation, we first broke Chinese text into individual
words. We used the longest-match method, which greedily recognizes an initial string of characters as a word if
the string matches a word in the segmentation dictionary. We obtained a Chinese-to-English wordlist from the
Linguistics Data Consortium1 (LDC). This bilingual wordlist contains a list of Chinese words together with their
possible English translations—a total of 188,474 entries. We did not edit or further “clean” the lexicon. Our
run-time segmentation dictionary consisted of the Chinese words from this wordlist augmented with all possible
single Chinese characters and symbols. By using the words from the bilingual wordlist, we ensured that the
words identified during segmentation would have glosses during translation.

Once we obtained the segmented Chinese words, we first removed stop words automatically via a stop word list.
The list contains a total of 3,894 entries, including closed-class words (e.g., symbols, prepositions, pronouns,
particles), numerals, and query specific terms such as       and      , collected from CLEF 2001 topics. Then
we translated the remaining words into English using the LDC bilingual wordlist. Lastly, we used the translation
disambiguation methods (described below) to select the best translation for a query word.


1
    http://www.ldc.upenn.edu/Projects/Chinese/LDC_ch.htm#e2cdict
As an example of our processing, consider the Chinese version of the Leaning Tower of Pisa query (Topic 136):
         ;                              First the query is segmented as:

            ;     ;         ;           ;     ;   ;   ;   ;    ;   ;   ;

After the stop-word removal, the query contains unique words:        ;    ;    ;      ;     . Each of these words
is then looked up in the Chinese–English bilingual wordlist, which yields the following translations:


    pisa                              askant              ter              hygeia             circumstance
                                      slanting            pagoda           exuberance         circumstantiality
                                                          tower            health             situation
                                                                           healthiness        state
                                                                                              affairs

We should note that there are several problems with the longest-match method that can cause segmentation
errors. First, the greedy algorithm may break words in the wrong places when word boundaries are ambiguous.
Generally, a wrong segmentation will result in more subsequent segmentation errors. Second, the coverage of
the dictionary will affect segmentation quality. Missing dictionary words will result in single characters being
generated during segmentation. For bilingual retrieval, this not only reduces term quality, but also increases
ambiguity, since single characters are generally more ambiguous than multiple character words. This is
especially a problem with proper names, where the meanings of the single characters in the names bear no
relation to the meaning of the name. In the pre-processing of topics, we eliminated any sequence of more than
three consecutive single characters. This helped reduce translation noise, but without proper handling of the
names, we still lost the specific information associated with deleted characters, which tended to render the topics
too general.

3   Translation Disambiguation Methods
Given that a source language term can give more than one possible target translation, we want to find the best or
the best few translations for each non-stop word in the source topic. Our methods for disambiguating alternative
possible translations are based on two observations:

    •      The correct translation for the query term, given its potential translations in a target language, is
           generally not ambiguous when context (i.e., other terms in the query) is considered.

    •      The Web and reference corpora can be used as practical resources for estimating the coherence of the
           translated terms. Each provides a language model of how words co-occur. In particular, we expect that
           words that are found to co-occur are lexically-semantically cohesive.

For example, suppose the query terms s1,…,s5 in the source language have the translations in the target language
as follows:

     s1     s2        s3        s4      s5
     t11    t21       t31       t41     t51
     t12    t22       t32               t52
     t13              t33
                      t34

Term s1 has three possible translations: t11, t12, and t13. A context for t11 can be constructed as one of the possible
sequences including the other translations in the target language, such as,

    <t11, t21, t31, t41, t51>
    <t11, t21, t32, t41, t51>
    <t11, t21, t33, t41, t51>
    <t11, t21, t34, t41, t51>
    …
    <t13, t22, t34, t41, t52>
In this example, there are a total of 3 × 2 × 4 × 1 × 2 (i.e., 48) possible sequences or paths through the translation
space. Each path establishes a context for the translated terms with respect to their neighbors. We assume that
the best path of all combinations will demonstrate the best coherence among the translated terms.

We have developed several practical methods to measure the quality (or coherence) of translation paths based on
evidence of actual word co-occurrence in reference corpora. One of these methods takes advantage of the World
Wide Web (WWW) and two use the actual target test collection, as described in the following sections.

3.1 Web Method
The Web method is an elaboration of the ideas explored by Grefenstette (1999), namely, using the WWW as the
language model for choosing translations. The idea of using the Web to acquire general language models is
becoming more popular. (See also Zhu & Rosenfeld 2001 for an application for speech.) To exploit the Web, we
first create sequences of possible translations. Each sequence is sent to a popular Web portal (here, AltaVista) to
discover how often the combination of translations appears. The number of occurrences of a translation is used as
the score for the sequence. The complete algorithm is as follows:

    1 Get translations for each term in the source language query.
    2 Construct a hypothesis space of translated sequences (overlapping n-grams, n=3 in our experiments)
      by obtaining all possible combinations of the translations in a source sequence of n query words.
    3 For each translated sequence in the hypothesis space,
        3.1 Send it to a Web portal (e.g., AltaVista);
        3.2 Call the number of pages on which that translated sequence occurs the coherence score
             for the query;
        3.3 Select the translation for each source word that has the best coherence score.
    4 Collate the selected translations into a new target language query.

Since word order is not preserved from one language to another, the query sent to the Web uses an operator that
enforces presence of the translated sequence but not sequence order. AltaVista’s advanced search supports the
operator NEAR, which ensures that the words so linked appear within ten words of each other, in any order. We
use this operator to calculate the score of the sequence.

Since the Web searches do not stem search terms, we expanded each translation by linking all surface forms of
the search term by OR. For example, if the translated sequence being scored was “big black dogs” then the
following, advanced AltaVista query was generated:

    (big OR bigger OR biggest) NEAR (black OR blacks OR blacked OR blacking OR blackest OR blacking)
    NEAR (dog OR dogs)

The Web method has the advantage of a massive reference corpus: most candidate translations (paths) will result
in some “hits” and the number of hits for meaningful combinations of words will typically be much greater than
for meaningless ones. It has the potential disadvantage that the texts on the Web may bear little or no direct
relation to the texts (or domain) of the target search. In theory, a narrowly focused target search (e.g., in a
technical domain reflected by many documents concentrated in the database to be searched) might be under-
represented in the Web corpus compared to alternative, more common documents. To mitigate this possibility,
we also explored two methods that use only the target texts for co-occurrence evidence.

3.2 Target Corpus Methods
An alternative reference corpus for language modeling, particularly modeling the coherence of combinations of
translation alternatives, is the target corpus itself. We use the target corpus as the basis for choosing a “best”
translation of a query, exploiting approaches developed by Evans (2000; 2001). We implemented two target-
corpus methods (“Corpus1” and “Corpus2”).

3.2.1 Corpus1

The Corpus1 method has the following steps:

    1 Get translations for each term in the query.
    2 Construct a hypothesis space of translated queries by obtaining all possible combinations
      of the translations.
    3 For each translated sequence in the hypothesis space,
          3.1 Send it to the target database;
          3.2 Compute the sum of the similarities scores of the top N retrieval documents as the coherence
              score of the sequence.
    4 Select the sequence (or sequences) with the best coherence score.

The Corpus1 method computes the coherence score for every path in the hypothesis space. This can be
computationally expensive when the query terms have many possible translations. In our experiments, we
reduced the hypothesis space by using a maximum of three translations for each query term. In cases where
there were more than three alternative translations, we chose the three terms with the smallest distribution scores
in the target corpus. For summing similarity scores, we set N to 100.

3.2.2 Corpus2

The Corpus2 method makes use of the mutual information of two terms based on corpus statistics. The method
works as follows:

    1 Get translations for each term in the source language query.
    2 Construct a hypothesis space of translated sequences (overlapping n-grams, n=3 in our experiments)
      by obtaining all possible combinations of the translations in a source sequence of n query words.
    3 For each translated sequence in the hypothesis space,
        3.1 Compute mutual information (MI) scores for all term pairs in the sequence;
        3.2 Sum up the scores from step 3.1 to give a coherence score for the sequence;
        3.3 Select as a translation of the first source word in the sequence the alternative that gives the best
             coherence score.
    4 Collate the selected translations into a new target language query.

The mutual information between two term t1 and t2 is defined as:

                                 p (t1 , t 2 )
         MI (t1 , t 2 ) = log
                                p (t1 ) p(t 2 )

3.3 An Illustrative Example
For Topic 136 (as presented in Section 2), we see that there are 3 × 10 × 1 × 3 or 90 possible ways the Spanish-to-
English translations can be combined. For Chinese, the number is 1 × 2 × 3 × 4 × 5 or 120. Here, we give the
target translations for the source query terms as determined by the above three translation disambiguation
methods. (This reflects a “combined” method in which all the “best” terms of each method are retained for the
final translated query.) In cases where the methods produced different “best” translations, we separate each
candidate translation by a comma (“,”).

 Topic 136

 English: Leaning Tower of Pisa. What is the state of health of the Leaning Tower of Pisa?
 English terms: lean tower; pisa; lean; tower; health; state.

 Spanish: Torre inclinada de Pisa . ¿En qué estado se encuentra la torre inclinada de Pisa?
 English translations of the Spanish query terms: pisa; high, tower; state; droop, tilt, bow.

 Chinese:           ;
 English translations of the Chinese query terms: pisa; ter, tower; slant; health; affair, situation; health
 situation

For our actual submissions reflecting a particular method (e.g., Web), we naturally used only those “best” terms
that the method itself nominated. In addition to our official runs, our experiments included a combined method
that uses the full set of terms (as given above) for the final target query. We describe the performance results of
the combined method along with our official and baseline results in the next section.
4   Experiments
Our CLIR experiment labels have the following convention: Cl<source-language>2<target-language><method>.
Thus, “Cles2enw” denotes our Spanish-to-English Web-method run. The actual experiments involved separate
runs for each of the three translation disambiguation methods described in Section 3 (the “w”, “t1”, and “t2”
runs), and also runs with combined methods (“c1”, based on a combination of the Web and Corpus1 methods,
and “c2”, based on all three methods). In all cases, combinations involved a simple concatenation of the selected
translations nominated by each participating method. To establish a baseline for evaluating the quality of
translation disambiguation, we used all possible translations in a default run (“all”). We ran English monolingual
experiments to obtain the baseline with ideal translations.

All the experiments were run with post-translation pseudo-relevance feedback, as we have observed that post-
translation pseudo-relevance feedback produces the best overall performance boost (Qu et al. 2000). The
feedback-related parameters were based on calibration runs using CLEF 2001 topics. The settings for Spanish-
to-English retrieval were: extracting T=50 terms from the top N=25 retrieved documents, with an additional term
cutoff percentage set to P=0.1. For Chinese-to-English retrieval: T=50, N=25, P=0.01. For English monolingual
retrieval: T=75, N=50, P=0.8. We used a variation of Rocchio weighting to identify terms for selection.

All runs were automatic. All the queries used the title and description fields (Ttitle+Description) of the topics
provided by CLEF 2002. The results presented below are based on relevance judgments of 42 topics. The topics
not evaluated include 93, 96, 101, 110, 117, 118, 127, and 132, as these were not listed among the official results
(presumably because they have no relevant documents in the target corpus).

Table 1 and Table 2 give the results for our submitted runs, together with our other experimental runs for
comparative analysis. For Spanish-to-English cross-language retrieval, compared with the baseline of keeping
all possible translations, both the Web-based and corpus-based methods improve the average precision by 2.5%
to 14.4%. The Web method and the Corpus2 method improve the exact precision by 17.2% and 5%,
respectively. Overall recall decreases for the corpus-based methods, while it improves a little (0.6%) for the
Web run. By combining the methods, the overall recall, average precision, and exaction precision all improve
over the baseline. The combination of all three disambiguation methods produces the best average precision and
exact precision, achieving 97.3% and 97.9% of the average precision and exact precision of the English
monolingual retrieval results. The best recall is achieved by combining the Web method and the Corpus1
method, reaching 95.7% of that of English monolingual retrieval.


 Run ID        Method                      Recall (over baseline)   AP (over baseline)   EP (over baseline)
 Cles2enw      Web                         720/821 (+0.6%)          0.3502 (+14.4%)      0.3399 (+17.2%)
 Cles2ent1     Corpus1                     664/821 (-7.3%)          0.3137 (+2.5%)       0.2871 (-1.0%)
 Cles2ent2     Corpus2                     706/821 (-1.4%)          0.3310 (+8.2%)       0.3046 (+5.0%)
 Cles2enc1     Web, Corpus1                750/821 (+4.8%)          0.3478 (+13.7%)      0.3276 (+12.9%)
 Cles2enc2     Web, Corpus1, Corpus2       744/821 (+3.9%)          0.3583 (+17.1%)      0.3441 (+18.6%)
 Cles2enall    All possible translations   716/821                  0.3060               0.2901
 (baseline)
 English       original English topics     784/821                  0.3682               0.3514
Table 1: Spanish-to-English retrieval performance with post-translation pseudo-relevance feedback. The runs in
boldface are our submitted runs.

For Chinese-to-English cross-language retrieval, compared with the bilingual baseline of keeping all possible
translations, both the Web-based method and the Corpus2 method improve average precision and exact
precision, while only the Corpus2 method improves recall. The Corpus1 method did not perform well compared
with the baseline. Again, when the methods are combined, we observe improvements over the overall recall,
average precision, and exact precision. The best run, with all three translation disambiguation methods
combined, reached 89.0%, 59.7%, and 62.9% of the recall, average precision, and exact precision, respectively,
of the English monolingual run.
    Run ID          Method                      Recall (over baseline)   AP (over baseline)   EP (over baseline)
    Clch2enw        Web2                        591/821 (-9.2%)          0.1795 (+2.8%)       0.1752 (+3.2%)
    Clch2ent1       Corpus1                     558/821 (-14.3%)         0.1322 (-24.3%)      0.1262 (-25.6%)
    Clch2ent2       Corpus2                     655/821 (+0.6%)          0.1936 (+10.9%)      0.1853 (+9.2%)
    Clch2enc1       Web, Corpus13               653/821 (+0.3%)          0.1858 (+6.4%)       0.1841 (+8.5%)
    Clch2enc2       Web, Corpus1, Corpus2       698/821 (+7.2%)          0.2199 (+25.9%)      0.2209 (+30.2%)
    Clch2enall      All possible translations   651/821                  0.1746               0.1697
    (baseline)
    English         original English topics     784/821                  0.3682               0.3514
Table 2: Chinese-to-English retrieval performance with post-translation pseudo-relevance feedback. The runs in
boldface are our submitted runs.

Since pseudo-relevance feedback can affect performance differently depending on the original query terms, in
our follow-up experiments, we re-computed the results without using pseudo-relevance feedback to best estimate
the quality of the selected translations against the baseline. Table 3 and Table 4 give the results without feedback
for both language pairs.

    Run ID              Method                      Recall                AP                  EP
                                                    (over baseline)       (over baseline)     (over baseline)
    Cles2enw-nf         Web                         684/821 (+1.3%)       0.2940 (+27.5%)     0.2776 (+19.2%)
    Cles2ent1-nf        Corpus1                     620/821 (-8.2%)       0.2608 (+13.1%)     0.2417 (+3.8%)
    Cles2ent2-nf        Corpus2                     679/821 (+0.6%)       0.3035 (+31.6%)     0.3087 (+32.6%)
    Cles2enc1-nf        Web, Corpus1                702/821 (+4.0%)       0.3110 (+34.9%)     0.2948 (+26.6%)
    Cles2enc2-nf        Web, Corpus1, Corpus2       695/821 (+3.0%)       0.3079 (+33.5%)     0.2955 (+26.9%)
    Cles2enall-nf       All possible translations   675/821               0.2306              0.2328
    (baseline)
    English-nf          original English topics     770/821               0.3331              0.3156
Table 3: Spanish-to-English retrieval performance without pseudo-relevance feedback.


    Run ID              Method                      Recall                AP                  EP
                                                    (over baseline)       (over baseline)     (over baseline)
    Clch2enw-nf         Web                         521/821 (-11.5%)      0.1547 (+20.8%)     0.1630 (+27.5%)
    Clch2ent1-nf        Corpus1                     534/821 (-9.3%)       0.1094 (-14.6%)     0.1026 (-19.7%)
    Clch2ent2-nf        Corpus2                     575/821 (-2.4%)       0.1510 (+17.9%)     0.1460 (+14.2%)
    Clch2enc1-nf        Web, Corpus1                588/821 (-0.2%)       0.1556 (+21.5%)     0.1608 (+25.8%)
    Clch2enc2-nf        Web, Corpus1, Corpus2       613/821 (+4.1%)       0.1761 (+37.5%)     0.1858 (+45.4%)
    Clch2enall-nf       All possible translations   589/821               0.1281              0.1278
    (baseline)
    English-nf          original English topics     770/821               0.3331              0.3156
Table 4: Chinese-to-English retrieval performance without pseudo-relevance feedback.

Regardless of whether pseudo-relevance feedback is used or not, the Chinese–English retrieval is overall poor.
Beside translation ambiguity, the bilingual translation lexicon is very noisy, with many wrong word choices,
occasional misspellings, and interfering descriptive text. In addition, wrong word segmentation resulted from
incomplete segmentation dictionary coverage and the greedy longest match segmentation algorithm.

For Spanish–English cross-language retrieval, all three translation disambiguation methods outperform the
baseline in terms of recall, average precision, and exact precision (except recall with the Corpus1 method). The


2
  The statistics reported here are higher than the official evaluation statistics. In the official submission, we did
not filter out the consecutive single Chinese characters (>3) for this run.
3
  The statistics reported here are higher than the official evaluation statistics since we fixed a formatting bug in
the official submission.
combinations of the methods outperform any individual method participating in the combinations. For Chinese–
English cross-language retrieval, both the Web method and the Corpus2 method improve average precision and
exact precision, while the Corpus1 method performs less well with these measures compared with the baseline.
All three methods resulted in a decrease in recall performance. The combinations generally outperform any
individual method participating in the combinations, except for recall of the Clch2enc1-nf run.


Figure 1: Comparative analysis of average precision, with the English run as the monolingual retrieval baseline
and the Cl*2enall runs as the cross-language retrieval baseline. The best results are achieved by the combination
of all three translation disambiguation methods, Cl*2enc2.

A summary of our results focusing on average precision is given schematically in Figure 1. In general, our best
results for Spanish-to-English CLIR are virtually indistinguishable from English monolingual performance. Our
best results for Chinese-to-English CLIR, suffering from the effects of poor resources, are at about 60% of the
monolingual baseline.

5   Error Analysis
Comparing the English baseline run to the Spanish-to-English Web run (Cles2enw), we find that 16 of the
translated Spanish-to-English queries actually give better results in terms of average precision than the English
queries, 26 are worse (of which 12 are much worse, with less than half the average precision of the English
queries). (See Figure 2.)

Some of the reasons for improved results are:

    •   Different word choice, e.g.:

             o   Population in the English version of Topic 95 (Conflict in Palestine) becomes town in the
                 Spanish-to-English translation of poblacion.
             o   Ski races in the English version of Topic 102 becomes ski competition in the translations.
             o   The English version of Topic 106 (European car industry) contains countermeasures, whereas
                 the Spanish contains medidas de recuperacion, which is translated by the methods described
                 above as recovery measures, which are more common words than countermeasures, which is
                 usually found in political rather than business contexts in the CLEF documents.
             o   In Topic 140 (Mobile phones), the English version contains perspectives and the Spanish
                 version contains pesrpectiva, which can be translated as outlook, perspective, prospect, and
                 vista via our dictionary, and our method chooses outlook, a more common word, which might
                 well account for why this topic scores better after translation.
          •                            Different word ordering: Clarit recognizes noun phrases in the English text; in the Spanish-to-English
                                       text, a query is repeated backwards and forwards so that different phrases may be recognized in the
                                       reconstituted topic, e.g.:

                                            o   Weapon destruction appears in the translation of Topic 119, whereas weapon and destruction
                                                are not found in the same simple noun phrase in the English topic.
                                            o   Grunge rock appears in the translation of Topic 130 while the English contains grunge group.

          •                            Different formulations of the same topic in English and Spanish, e.g.:

                                            o   Topic 121 (Successes of Ayrton Senna) contains success and sporting achievements while the
                                                Spanish version contains just the word palmares, which is not in the Spanish-English
                                                dictionary and not translated in the resulting English version of the Spanish topic. It could be
                                                that the English words successes and achievements are only distractors from the relevant
                                                documents for Ayrton Senna.
                                            o   Topic 138 (Foreign words in French) contains lengua, which translates to language not present
                                                in the English version.


                                                  S panish-English Retrieval vs. Englis h Monolingual -- Average Precision

                                      1.2
   Difference in Average Precision


                                      1.0
                                      0.8
                                      0.6
                                      0.4
                                      0.2
                                      0.0
                                     -0.2
                                     -0.4
                                     -0.6
                                     -0.8
                                     -1.0
                                             98
                                            121
                                             91
                                            130
                                            138
                                             95
                                            100
                                            114
                                            140
                                            120
                                            102
                                            115
                                            131
                                            119
                                            106
                                            125
                                            134
                                            133
                                            112
                                             99
                                            136
                                            103
                                            111
                                            108
                                            113
                                            105
                                            116
                                            128
                                            137
                                             97
                                            126
                                             92
                                            122
                                            104
                                            123
                                            135
                                            124
                                             94
                                            107
                                            139
                                            109
                                            129
Figure 2: Query-by-query comparison of precision between the best Spanish–English retrieval run
(Cles2enc2) and English monolingual retrieval.

Some of the reasons for worse results are:

          •                            Proper names written differently and not in the dictionary, e.g.:

                                            o   Solzhenitsyn (Topic 94) is written Solzhenitsin in the Spanish topic and is retained with the
                                                same spelling in the translated queries since it is not found in the dictionary.
                                            o   European Cup (Topic 113) is written Eurocopa in Spanish and is not in the dictionary, so it
                                                passes through as-is, but is not found as a string in the English documents.

          •                            Dictionary divergences, e.g.:

                                            o   In Topic 107 (Genetic Engineering), food chain appears as cadena alimentaria in Spanish, but
                                                alimentaria does not have food among its translations.

Comparing the English baseline run to the Chinese-to-English runs, we observe that most of the translated
queries are worse in average precision compared with that of the English run. As in Spanish-to-English retrieval,
the combination of the three methods (Clch2enc2) produces the best overall performance: 12 of the translated
queries give better average precision than the English queries and 30 are worse (Figure 3).
The better performance is due to:

          •                            Reduced ambiguity in translation, e.g.:

                                            o       In Topic 113 (European Cup), the English version uses the word football, while the target
                                                    translations include the word soccer. Since football can mean soccer or American football, the
                                                    translation makes the query more relevant to the topic.

          •                            Difference in word choice, e.g.:

                                            o       In Topic 133 (German Armed Forces Out-of-Area), the English version uses the word area,
                                                    while the Chinese version uses border. As our system throws out out and of as stop words, the
                                                    word area becomes too general, while border implies country boundaries.


                                                             Chinese-Englis h Retrieval vs . Englis h Monolingual -- Average Precis ion

                                     0.4
   Difference in Average Precision


                                     0.2
                                     0.0
                                     -0.2
                                     -0.4
                                     -0.6
                                     -0.8
                                     -1.0
                                            138
                                            113
                                            133
                                             95
                                            115
                                            114
                                            135
                                            119
                                            131
                                            108
                                            122
                                             92
                                            130
                                            140
                                            128
                                             98
                                             99
                                            126
                                            137
                                             97
                                            136
                                             91
                                            106
                                            124
                                            111
                                            109
                                            100
                                             94
                                            107
                                            134
                                            121
                                            112
                                            105
                                            120
                                            116
                                            139
                                            103
                                            104
                                            123
                                            129
                                            102
                                            125
Figure 3: Query-by-query comparison of precision between the best Chinese–English retrieval run
(Clch2enc2) and English monolingual retrieval.

Some of the reasons for the poor performance are:

          •                            Improper segmentation of words, e.g.:

                                       o    In Topic 111 (           ), the word       was segmented as two characters          (with translations of
                                            act, arouse, get moving, move, stir, act, change, use, touch, etc.) and     (with translations of draw,
                                            painting, picture). As a result, our system produced translations such as use picture instead of the
                                            correct translation animation.
                                       o    Similar segmentation mistakes include:          (galaxy) in Topic 129 as        (silver) and      (river);
                                                  (beauty contest) in Topic 137 as      (to choose, to elect, to pick, to select) and      (America,
                                            beautiful, pretty).
                                       o               (EU fishing) in Topic 139 was rendered as four single characters ,           ,    ,   , which
                                            were consequently filtered out by the system.

          •                            Improper segmentation of transliterated names, e.g.:

                                       o    Topic 102 (                 ·                ) is segmented as:

                                                ;        ;    ;    ;·; ; ; ; ;                      ;
                                                     ;        ;    ; ; ; ; ; ;                  ;       ;      ;         ;      ;   ;

                                            The name           ·        has been incorrectly segmented into seven single characters that have
                                            no direct semantic relation to the transliterated name. During pre-processing of the topics, we
             filtered out the consecutive single characters. As a result, the query contained only general terms
             such as        (victory),     (ski), and         (competition). The best precision was achieved by
             the Web run at 0.1078. In contrast, in the English query (Victories of Alberto Tomba), the name,
             “Alberto Tomba”, as a term makes the query very specific, resulting in high precision (0.7491).
         o   Other topics that suffered similarly include: Topic 94 (Return of Solzhenitsy), 98 (Films by the
             Kaurismäkis), 103 (Conflict of interests in Italy), 104 (Super G Gold medal), 120 (Edouard
             Balladur), 121 (Successes of Ayrton Senna), and 123 (Marriage Jackson-Presley).

    •    Different word choice, e.g.:

         o   The translations of the word             in Topic 107 (         ) are transmit, transmittal, and
             hereditary, instead of the desired translation genetic.
         o   In Topic 99, Holocaust is represented in Chinese as     (butchery, massacre).

6   Conclusions
We have explored three methods for selecting “best” target translations that take advantage of co-occurrence
statistics among alternative translations of the source query words. We have demonstrated that, given a wide
variety of possible translations that might be generated from a bilingual dictionary, the use of the Web or a local
large corpus as a language model can provide a good basis for lexical choice, provided the gloss dictionary
covers the source vocabulary. Combining the target words obtained from the different translation disambiguation
methods can produce better cross-language retrieval performance, as compared to keeping all possible
translations. For Spanish-to-English retrieval, our combined method achieved 95% and 97%, respectively, of the
recall and average precision of our English monolingual run. For Chinese-to-English text retrieval, the recall and
average precision reached 89% and 60%, respectively, of the English monolingual results.

Our experiments have shown that the quality of the translation resources has a significant impact on the
performance of cross-language retrieval. The poor translation quality and poor coverage of the Chinese-to-
English bilingual lexicon we used resulted in relatively poor performance of the Chinese-to-English retrieval.
Names (transliterated names in CLEF topics, in particular) that are not covered in a translation lexicon need to be
recognized and translated correctly for better retrieval performance.

With respect to the translation disambiguation methods, we believe our methods can be further improved by
addressing the following issues:

•   Identifying the optimal context span for the Web method and the Corpus1 method. Currently, we use a 3-
    word context, which may limit the access of contextual information.

•   Exploring ways to prune paths in the translation hypothesis space when the space is large. This includes the
    problem of how to rank translations of a query term when many translations are possible (currently, for the
    Corpus1 method, we use the idf scores of the translations from the target corpus) and how to rank the
    combinations of the translations of a sequence of source terms.

•   Identifying the optimal retrieval cutoff point for the Corpus1 method, instead of the arbitrary number of
    responses (i.e., 100) that we used in the experiments.

•   Incorporating phrasal translations. Phrasal translations can give important performance gain in cross-
    language retrieval. We have begun to explore techniques for automatically translating phrasal terms.

In general, we feel our results demonstrate that it is possible to achieve remarkably high CLIR performance by
exploiting relatively simple and available resources. Such approaches hold promise for cross-language retrieval
in cases where machine translation, parallel corpora, or other knowledge resources may be difficult or impossible
to obtain. We believe that incremental straightforward refinements of our approach will give better and more
consistent results. We plan on testing this hypothesis in future work on other language pairs.
References
[Evans 2000] Evans, D.A. Method and Apparatus for Cross Linguistic Document Retrieval. U.S. Patent #
6,055,528, April 25, 2000.

[Evans 2001] Evans, D.A. Method and Apparatus for Cross Linguistic Database Retrieval. U.S. Patent #
6,263,329 (a division of U.S. Patent # 6,055,528), July 17, 2001.

[Evans & Lefferts 1995] Evans, D.A., and R.G. Lefferts. CLARIT–TREC Experiments. Information Processing
and Management, Vol.31, No.3, 1995, 385–395.

[Grefenstette 1998]. Grefenstette, G. The Problem of Cross-Language Information Retrieval. In G. Grefenstette,
editor, Cross-Language Information Retrieval, chapter 1. Kluwer Academic Publishers, Boston, pp.1-9, 1998.

[Grefenstette 1999] Grefenstette, G. The WWW as a resource for example-based MT tasks. In Proc., ASLIB
Translating and the Computer 21 Conference, London, 1999.

[Hull & Grefenstette 1996] Hull, D.A. and G. Grefenstette. Experiments in Multilingual Information Retrieval.
In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in
Information Retrieval, Zurich, Switzerland, 1996.

[Oard & Dorr 1996] Oard, D.W. and B.J. Dorr. A survey of multilingual text retrieval. Technical Report
UMIACS-TR-96-19, University of Maryland, Institute for Advanced Computer Studies, 1996.

[Qu et al. 2000] Qu, Y., A.N. Eilerman, H. Jin, and D.A. Evans. The Effect of Pseudo-relevance Feedback on
MT-Based CLIR. In Proceedings of the Recherche d’Informations Assistée par Ordinateur (RIAO 2000), 2000.

[Zhu & Rosenfeld 2001] Zhu, X. and R. Rosenfeld. Improving Trigram Language Modeling with the World
Wide Web. In Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP),
2001.