=Paper= {{Paper |id=Vol-1166/CLEF2000wn-adhoc-NieEt2000 |storemode=property |title=Using Parallel Web Pages for Multi-lingual IR |pdfUrl=https://ceur-ws.org/Vol-1166/CLEF2000wn-adhoc-NieEt2000.pdf |volume=Vol-1166 |dblpUrl=https://dblp.org/rec/conf/clef/NieS00 }} ==Using Parallel Web Pages for Multi-lingual IR== https://ceur-ws.org/Vol-1166/CLEF2000wn-adhoc-NieEt2000.pdf
                 Using Parallel Web Pages for Multi-lingual IR

                         Jian-Yun Nie, Michel Simard, Goerge Foster
                                        Laboratoire RALI,
                      Département d'Informatique et Recherche opérationnelle,
                                      Université de Montréal
                                C.P. 6128, succursale Centre-ville
                               Montréal, Québec, H3C 3J7 Canada
                             {nie, simardm, foster}@iro.umontreal.ca

In this report, we describe the approach we used in CLEF Cross-Language IR (CLIR) tasks.
In our experiments, we used statistical models estimated from parallel texts
automatically mined from the Web. In our previous experiments, we tested CLIR for
English-French and English-Chinese. Our goal of this series of experiments is to see if
the approach may be extended to multi-lingual IR (with other languages). In particular,
we compare models trained from the Web documents with models that also combine
other resources such as dictionaries.

1. Introduction
    In the past two years, we have tested the possibility of using parallel texts mined from the Web for
CLIR [Nie99]. We were able to build models for French-English and Chinese-English translations. In
TREC8, the results we obtained were difficult to compare with other systems as the bilingual French-
English runs are unofficial, and there is no report on similar experiments in TREC8 runs. This year,
we successfully found several sets of parallel Web pages for the following language pairs: English-
Italian, English-German, in addition to the English-French corpus we found previously. So our goal
in this year's CLEF is to see if the parallel Web documents can also apply to multi-lingual IR.
    In our previous experiments, we observed that a certain combination of the translation models
with a dictionary could improve IR effectiveness. However, the combination remained empirical: a
dictionary translation is attributed a certain "default probability" and it is added to that from a
translation model. This year, we combined several resources together in the model-training step: the
documents from the Web, the Hansard parallel corpus, and two bilingual dictionaries. The
combination is set in such a way that maximizes the translation probability of set of texts (held-out).
    In this report, we will first describe the mining system we used to gather parallel texts from the
Web. Then a brief description of the training process will be described. Finally, the CLIR
experimental results will be summarized.

2. Mining parallel texts from the Web
   Statistical models have often been used in computational linguistics for building MT systems or
constructing translation assistance tools. The problem we often have is the unavailability of parallel
texts for many language pairs. The Hansard corpus is one of the few existing corpora for English and
French. For other languages, such a corpus is less (or not at all) available. In order to solve this
problem, we conducted a text-mining project on the Web in order to find parallel texts automatically.
The first experiments with the mined documents have been described in [Nie99]. The experiments
were done with a subset (5000) of the mined documents. However, they showed that the approach is
feasible. Later on, we trained another translation model with all the Web documents found, and the
CLIR effectiveness obtained with it is close to that with a good MT system (Systran).
   The mining process is devised into the following three steps:




                                                       1
         - selection of candidate Web sites
         - finding all the documents from the candidate sites
         - paring the texts using simple heuristic criteria
    The first step aims to determine the possible web sites where there may be parallel texts for the
given language pair. The way we did this is to send requests to some search engines, asking for
French documents containing an anchor text such as "English version", "english", and so on; and
similarly for English documents. The idea is, if a French document contains such an anchor text, the
link to which the anchor is associated usually points to the parallel text in English.
    From the set of documents returned by the search engines, we extract the addresses of web sites,
which are considered as candidate sites.
    The second step also uses the search engines. In this step, a series of requests are sent to the search
engines to obtain the URLs of all the documents in each site.
    The last step consists of paring up the URLs. We used some heuristic rules to determine quickly is
an URL may be parallel to another:
         - First, parallel texts usually have similar URLs. The only difference between them is often
             a segment denoting the language of the document. For example, "-en", "-e", and so on for
             English documents. Their corresponding segments for French are "-fr", "-f", and so on.
             Therefore, by examining the URLs of the documents, we can quickly determine which
             files may be a pair.
         - We then use other criteria such as the length of the file to further confirm or reject a pair.
         - The above criteria do not require to downloading the files actually. Once a set of possible
             pairs is determined, the paired files are downloaded. Then we can perform some checking
             of the document contents. For example, are their HTML structures similar? Do they
             contain enough text? Can we align them into parallel sentences?
    The French-English parallel corpus has been constructed last year at RALI laboratoiry. This year,
we cooperated with Twenty-One (W. Kraiij) to construct English-Italian and English-German parallel
corpora, using the same mining system - PTMiner. The following table shows the number of text
pairs as well as volume of the corpora for different language pairs.

                                          E-F          E-G      E-I
                         Pairs          18 807       10 200    8 504
                         Volume (Mb) 174 198 77 100 50            68
                                     Table 1. Training corpora

  The corpora found from the Web will be called WAC corpora (Web Aligned Corpora). The
models trained with these corpora will be called the WAC models.

3. Principle of building a probabilistic translation model
   Given a set of parallel texts in two languages, it is first aligned into parallel sentences. The criteria
used in sentence alignment are the position of the sentence in the text (parallel sentences have similar
positions in two parallel texts), the length of the sentence (they are also similar in length), and so on
[Gale93]. In [Simard92], it is proposed that cognates may be used as an additional criterion. Cognates
refers to the words (e.g. proper names) or symbols (e.g. numbers) that are identical (or very similar in
form) in two languages. If two sentences contain such cognates, it provides additional evidence that
they are parallel. It has been shown that the approach using cognates performs better than the one
without cognates. Before the training of models, each corpus is aligned into parallel sentences using
cognate-based alignment algorithm.




                                                         2
    Once a set of parallel sentences is obtained, word translation relations are estimated. First, it is
assumed that every word in a sentence may be the translation of every word in its parallel sentence.
Therefore, the more two words appear often in parallel sentences, the more they are thought of to be
translation of one another. In this way, we obtain the initial probabilities of word translation.
    At the second step, the probabilities are submitted to a process of Expectation Maximization (EM)
in order to maximize the probabilities with respect to the given parallel sentences. The algorithm of
EM is described in [Brown93]. The final result is a probability function P(f|e) which gives the
probability that f is the translation of e. Using this function, we can determine a set of probable word
translations in the target language for each source word, or for a complete query in the source
language.

4. The training of multiple models and their combination
    For English and French, we also have other resources: the Hansard corpus (a set of parallel French
and English texts from the Canadian parliament debates), a big terminology database (Termuim) and
a small bilingual dictionary (Ergane). A translation model is trained from the Hansard data, in the
same way as for the Web document (WAC).
    In both the terminology database and the bilingual dictionary, we have English words/terms, and
their French translations (words/terms). In some way, we can also think of these two resources as two
sets of special parallel "sentences". Therefore, the translation probability between words can also be
estimated with the same statistical training process. Therefore, two additional translation models are
estimated from them. In total, we obtain 4 different translation models between English and French
from four different resources (in each direction). The question now is how we can combine them in a
reasonable way.
    We choose a linear combination of the models. Each model is assigned a coefficient denoting our
confidence on it. The coefficient is tuned according to a set of "held-out" data - a set of parallel
sentences (about 100K words). This set is selected from different resources (however different from
the resources we are using for model training) so that it gives a good balance of different kinds of
texts. Finally, the following coefficients are assigned to each model:

        Ergane          0.0413916
        Hansard         0.300517
        Termium         0.413493
        Wac             0.244598

   As we can see, the combination seems to favor the model which contains more vocabulary.
Termium is attributed the highest coefficient because it contains about 1 million words/terms in each
language. The Hansard corpus and the Wac corpus contain about the same volume of texts. So their
coefficients are comparable. The Ergane dictionary is a small dictionary which only contains 9000
words in each language, its coefficient is very low. Although these coefficients are the best for the
held-out data, they may not be suitable to our data in CLEF.

5. Experiments
   We used a modified version of SMART system [Buckley85] for monolingual document indexing
and retrieval. The ltn weighting scheme is used for documents. For queries, we used the probabilities
provided by the probabilistic model, multiplied by the idf factor. From the translation words obtained,
we retained the top 50 words for each query. The alue of 50 seemed to be a reasonable number on
TREC6 and TREC7 data.




                                                       3
5.1. Monolingual IR
   Monolingual IR results have been submitted for the following languages: French, Italian and
German. This series of experiments uses the SMART ltn weighting scheme for queries as well. In
addition, a pseudo-relevance feedback is applied, which uses the 100 most important terms among the
top 30 documents retrieved to revise the original queries. The parameters used for this process is: α =
0.75, and β = 0.25. The results obtained are shown below:

                                             French          Italian     German
                     ≥ medium              18           18             12
                     < medium             16          16               25
                     Average precision    0.4026      0.4334           0.2301
                                         Table 2. Monolingual IR

    As we can see, a great difference can be observed in effectiveness in the above runs. Several
factors have contributed to this.
1. The use of stoplist
    In the case of French, a set of stopwords is set up carefully by French speaking people. In the case
of Italian and German, we used two stoplists found from the Web [Stoplists]. In addition, a small set
of additional stopwords was added for Italian.

2. The use of a lemmatizer or a stemmer
    For French, we used a lemmatizer developed in the RALI laboratory that first uses a statistical
tagger, then transforms a word to its citation form according to its part-of-speech category. For Italian
and German, two simple stemmers obtained from the Web [Stemmers] are used. There is no
particular processing for compound words in German. This may be an important factor that affected
the effectiveness of German IR.

    Overall, the French and Italian monolingual runs seem to be comparable to the medium
performance of the participants; but the German run is well below the medium performance. We
think the main reason is due to the lack of special processing on German (e.g. compound words).

5.2. Tests on bilingual IR
    The bilingual task consists of finding documents in a language different from that of the queries.
We tested the following bilingual IR: E-F (i.e. English queries for French documents), E-I and E-G.
For this series of test, we first used the translation models to obtain a set of 50 weighted translation
words for each query. Unknown words are not translated, and they are added into the translation
words with the default probability of 0.05. The same pseudo-relevance feedback process is used.
    Between English and Italian, English and German, we only have the Web parallel documents to
train our translation models. For French and English, we have multiple translation resources: the Web
documents, the Hansard corpus, and two bilingual dictionaries. So we also compare the model with
only the Web documents (the WAC model) and the model with all the resources combined (the
Mixed model). The following table summarizes the results we obtained for the official submissions in
bilingual IR.




                                                       4
                                                  F-E          I-E (WAC) G-E (WAC)
                                           WAC       Mixed
                     ≥ medium             20        16        21          13
                     < medium            13        17         13          21
                     Average precision 0.2197 0.1722 0.2032               0.1437
                              Table 3. Bilingual IR with different models

   For F-E and I-E cases, the WAC models lead to an effectiveness that is better than the medium.
The Mixed model of F-E gives a medium performance. The comparison between the two translation
models for French to English is particularly interesting. We expected that the Mixed model could
perform better because it is trained with more data from difference sources. Surprisingly, its
effectiveness is worse than the Wac model. There may be several reasons to this:
    - The combination of different resources is tailored for a set of held-out data that does not come
        from the CLEF document set. So there may be a bias in the combination.
    - During the combination, we observed that the combination results tend to favor dictionary
        translations. A high priority is attributed to dictionary translations. This may also be
        attributed to the biased tuning of combination.

    The unreasonable combination may be further illustrated by the following examination of the
bilingual IR effectiveness with each individual translation model. The following table shows the
effectiveness over all the 40 queries for bilingual IR between French and English:

                     Model           Wac       Hansard Termium Mixed
                     Avg. precision 0.1989 0.2367          0.1800       0.1426
                 Table 4. Comparison of bilingual IR with different individual models.

   The 0.1426 case over 40 queries corresponds to 0.1722 over 33 queries in the previous table (for
the official evaluations). We can see that the mixed model performed worse than any of the individual
models. This indicates clearly that the combination of the models is not suitable for the CLEF data.

5.3. Multilingual runs
   In our case, the multilingual runs are only possible from English to all the languages (English,
French, Italian and German). In these experiments, the following three steps have been done:
        1. Translate English queries to French, Italian and German, respectively;
        2. Retrieve document from different document sets;
        3. Merge the retrieval results.

    The translation of English queries to German and Italian was done by the WAC translation model
(trained from the Web documents). For English to French, we also have the alternative of using the
Mixed model. The translation words are submitted to the mtc transformation of SMART. This scheme
is chosen because it leads to comparable similarities between different data sets, therefore, makes the
result merging easier. The merging is done according to the similarity scores. The top 1000 retrieved
are selected as the final results and submitted for evaluation.
    The following table describes the results of different runs. In the Wac column, all the models used
to translate English queries are WAC models. In the Mixed case, only the English to French
translation uses the Mixed model, whereas the other translations still use the Wac models.



                                                        5
                                                      Wac      Mixed
                                ≥ medium              14       12
                                < medium           26         28
                                Average precision 0.1531 0.1293
                                       Table 5. Multilingual IR

   As we can see, these performances are all below the medium performance. One of the main
reasons may be that the German monolingual retrieval does not use any linguistic preprocessing, and
has a very poor effectiveness. This may greatly affected the multilingual runs. Another possible
reason may be the over-simplified merging method we used. In fact, in order to render the English
monolingual runs compatible (in terms of similarity values) with other bilingual runs, we had to
choose the mtc weighting scheme as for the other cases. In our tests, we observe that this weighting
scheme is not as good as ltc. Therefore, the ease of result merge has been obtained to the detriment of
English effectiveness.
   We observe again the negative impact of the Mixed model in this task. When the Wac model for
English-French is replaced by the Mixed model, the effectiveness decreases. This shows once again
that the coefficients we set for different models are not suitable for the CLEF data.

6. Final remarks
    In this CLEF, we successfully used parallel Web pages to train translation models for language
pairs other than English and French.
    For monolingual IR, the effectiveness for French and Italian are similar to the medium
performance. The German monolingual run is well below the medium. We think the main reason is
that we did not carry out any particular processing on German morphology, which is an important
problem for German IR.
    For bilingual IR between English and French, and between English and Italian, the effectiveness
seems to be reasonable. It is better than the medium effectiveness. Between English and German,
however, the effectiveness is well below the medium effectiveness. The reason may be the same as
for the German monolingual run.
    For multilingual runs, the performance is below the medium. We believe the reason is once again
the low effectiveness for German. In addition, result merging may also have affected the global
effectiveness.
    Between English and French, we also tried to combine different resources in our translation
models. We used a linear combination of the models trained with different data, and the coefficients
are determined by using a small set of held-out data. However, to our surprise, the mixed model
performed worse than the model trained with the Web documents only. This clearly indicates that the
combination is not reasonable to the CLEF data.
    In our future work, we will try to determine a better way to combine different translation models
between English and French. For German, we will use more linguistic processing, in particular, a
more sophisticated stemmer.
    Overall, we are still encouraged by this CLEF because we showed for the fist time that the Web
parallel documents could be used for multilingual IR.

References
[Brown93] P. F. Brown, S. A. D. Pietra, V. D. J. Pietra, and R. L. Mercer, The mathematics of
    machine translation: Parameter estimation. Computational Linguistics, vol. 19, pp. 263-312
    (1993).



                                                      6
[Gale93] W. A. Gale, K.W. Church, A program for aligning sentences in bilingual corpora,
    Computational Linguistics, 19 :1, 75-102 (1993).
[Franz98] M. Franz, J.S. McCarley, S. Roukos, Ad hoc and multilingual information retrieval at IBM,
    The Seventh Text Retrieval Conference (TREC-7), NIST SP 500-242, pp. 157-168 (1998)
[Nie98] J.Y. Nie, TREC-7 CLIR using a probabilistic translation model, The Seventh Text Retrieval
    Conference (TREC-7), NIST SP 500-242, pp. 547-553 (1998).
[Nie99] J.Y. Nie, P. Isabelle, M. Simard, R. Durand, Cross-language information retrieval based on
    parallel texts and automatic mining of parallel texts from the Web, ACM-SIGIR conference,
    Berkeley, CA, pp. 74-81(1999).
[Simard92] M. Simard, G. Foster, P. Isabelle, Using Cognates to Align Sentences in Parallel Corpora,
    Proceedings of the 4th International Conference on Theoretical and Methodological Issues in
    Machine Translation, Montreal (1992).
[Stoplists] http://www.cs.ualberta.ca/~oracle8/oradoc/DOC/cartridg.804/a58165/appa.htm
[Stemmers] http://www.muscat.com




                                                    7