=Paper=
{{Paper
|id=Vol-1169/CLEF2003wn-adhoc-RogatiEt2003
|storemode=property
|title=CONTROL: CLEF-2003 with Open, Transparent Resources Off-Line
|pdfUrl=https://ceur-ws.org/Vol-1169/CLEF2003wn-adhoc-RogatiEt2003.pdf
|volume=Vol-1169
|dblpUrl=https://dblp.org/rec/conf/clef/RogatiY03a
}}
==CONTROL: CLEF-2003 with Open, Transparent Resources Off-Line==
<pdf width="1500px">https://ceur-ws.org/Vol-1169/CLEF2003wn-adhoc-RogatiEt2003.pdf</pdf>
<pre>
      CONTROL: CLEF-2003 with Open, Transparent Resources Off-Line
                                       Monica Rogati and Yiming Yang
                            Computer Science Department, Carnegie Mellon University
                                           Pittsburgh, Pennsylvania
                                        {mrogati, yiming}@cs.cmu.edu


Abstract: Corpus-based approaches to CLIR have been studied for many years. However, using commercial MT
systems for CLEF has been considered easier and better performing. Our goal is to be one of the CLEF participants
who show that the hypothetical performance drop is not large enough to justify the loss of control and transparency,
especially for research systems. We participated in two bilingual runs and the small multilingual run using software
and data that are free to obtain, transparent and modifiable.

1    Introduction
Over the past years, a necessary condition for a good cross- or multi-lingual performance in CLEF appeared to be the
use of commercial MT systems, be it purchased or freely available online (Systran etc.)[1,3,11]. While using black
boxes to cross the language barrier allowed researchers to concentrate on important issues such as stemming, query
pre- and post-processing, combining black boxes outputs, and multilingual merging, [1,3,11] we believe that query
translation does play an essential role in CLIR, and that understanding, control and transparency are crucial in a
research system. Online MT systems can be upgraded, lose their free status, or change parameters at will, making
past experiments irreproducible. If such a dependence is permitted, research in IR in general can be similarly reduced
to pre- and post- processing of Google I/O. Our goal is to attempt to move away from basing the core of our CLIR
research system on a module that cannot be fully understood and modified, to which future access might not be
guaranteed, and in which external changes are allowed and sometimes not even detected. The main challenge,
however, is to do so while sacrificing as little performance as possible.

Our initial attempt to reach this goal (CLEF 2001) was disappointing in this respect, mainly because we disallowed
using translation resources entirely and relied on the temporal correspondence between CLEF documents to produce
a “parallel” corpus. In CLEF 2003 we relaxed the independence requirement to using transparent data and code,
freely available or available for a modest one-time fee, which we can store locally, easily modify, recompile and
process, and which cannot change in uncontrollable or undetectable ways. We participated in two bilingual tasks
(DE->IT, IT->ES), and the small multilingual task, which involved four languages.

Our general approach was to rely on parallel corpora and GIZA++ [8] for query translation, and on Lemur [9] for
retrieval. All these resources (as well as the stemmers we used where applicable) fulfill the criteria outlined above.
Moreover, with the exception of LDC data, which we did not use in the official runs but did use in preliminary
experiments, all these resources are free of charge and publicly available.

In section 2 we discuss the parallel data and preprocessing (stemming, stopping etc.). In section 3 we discuss our
approach to bilingual retrieval in general as well as approaches for situations where a parallel corpus between the
two languages does not exist.

2    Data Description and Preprocessing
We have used the European Parliament proceedings 1996-2001 [6]. It includes versions in 11 European languages:
Romance (French, Italian, Spanish, Portuguese), Germanic (English, Dutch, German, Danish, Swedish), Greek and
Finnish. Sentence aligned parallel corpora (English-X) have been prepared by the author of [6]. We have also
prepared German-Italian and Italian-Spanish versions for the two bilingual CLEF tasks we participated in, by
detecting almost identical English sentences and aligning the corresponding non-English sentences. Table 1 shows
the size of the relevant parallel corpora post-processing, after stopping and stemming. Some sentence pairs have been
eliminated after becoming empty post-processing. Note that our quick intersection of X-EN to Y-EN parallel corpora
by taking only sentences where the English versions were close resulted in losing about ¾ of the corpus. A much
better approach would have been to follow [6]’s procedure, most likely resulting in a corpus of comparable size with
the English versions.


                     Table 1: Size of the European Parliament parallel corpora (in sentences)

DE-IT               IT-ES                DE-EN                FR-EN                IT-EN                ES-EN

128505              150910               659773               674770               687890               738772


We have also experimented with several other corpora, including Hansard set A for French (available from the
Linguistic Data Consortium). Although the sentence-aligned version was much larger (2.7M sentences), preliminary
experiments on CLEF ‘01 and ’02 datasets showed a consistent performance drop (usually around 10%). As a result,
Hansard has not been used for CLEF 2003.

 We preprocessed the parallel corpora and CLEF documents by eliminating punctuation, stopwords, and document
sections disallowed in the task description. We have used the Porter stemmer for English and the rule-based
stemmers and stopword lists kindly provided by J. Savoy [10]. After stemming, we have used 5-grams as a substitute
for German word decompounding.

3    Bilingual Retrieval

Our main focus in bilingual retrieval has been query translation without the use of commercial MT systems,
including Systran. In this section we will discuss our bilingual retrieval system using a parallel corpus, as well as the
challenge of handling language pairs for which parallel corpora do not exist.
Conceptually, our approach consists of several steps:

    1.   Parallel corpora and test documents preprocessing
    2.   Dictionary generation from parallel corpora
    3.   Pseudo-Relevance Feedback in the source language
    4.   Query translation
    5.   Pseudo-Relevance Feedback in the target language
    6.   Retrieval

3.1     Dictionary Generation and Query Translation
We have used GIZA++ [8] as an implementation of IBM Model 1 [2]. GIZA++ takes a parallel corpus and generates
a translation probability matrix. The number of training iterations was 10. Although GIZA++ implements the more
sophisticated translation models discussed in [2], we have not used them for efficiency reasons, and because word
order is not a factor during retrieval.
Query translation was done on a word-by-word basis. A significant difference from MT or online dictionary based
approaches is that instead of using a rank-based cutoff (i.e. the first or first two variants for each word) we are using
all translations weighted by their translation probability:

                                                        qt = qs•Mst

where qt is the query in the target language, qs is the query in the source language, and M st is the translation matrix.
M was pruned to 50 translations per word for efficiency reasons.

This is similar to IBM and BBN CLIR approaches [4,5] except the translation is not integrated in the retrieval model;
only the query is translated. This approach has the welcome side effect of a very focused query expansion.
3.2    Pseudo-Relevance Feedback and Retrieval

We have used the Lemur toolkit [9] to implement weighted query expansion, and we modified the retrieval interface
to accept weighted queries as input. After query expansion is done in Lemur , the resulting query vector (qs , words +
weights) is extracted for future translation. After translation, qt is loaded into Lemur for a new round of query
expansion in the target language, followed by retrieval.

PRF and retrieval parameters we tuned include the number of documents to be considered relevant, the number of
new query words added, the relative weight of added queries (usually 0.5) and term weighting method. There is one
such parameter set for each pre- and post- translation query expansion, and for each language pair. However,
experiments on CLEF 2001 and 2002 indicated that post-translation query expansion hurts performance by diluting
the query in some languages, so the second set of parameters were set to 0 for the bilingual runs.

3.3    Handling language pairs with no available parallel corpora

The bilingual task this year was more challenging, in that we were aware of no Italian-Spanish or German-Italian
parallel corpora. However, since most parallel corpora have English as one of the languages we had the option of
using English as a pivot language in two ways:
     1. to create a new parallel corpus if there is significant overlap (as described in Section 2). This is the least
          likely situation, but it does happen in the case where there is an underlying text translated in multiple
          languages, as it happened with the European Parliament corpus.
     2. to translate first to English, then from English. This is where keeping and using translation probabilities is
          very useful. In traditional MT approaches, where the query is translated as a sentence twice, the (binary)
          mistakes accumulate, making the original meaning difficult to preserve. We believe the original meaning is
          easier to preserve when the entire query vector is translated, taking into account the translation
          probabilities:
                                                   qt = qs•Ms2EN • MEN2t

          where qt is the query in the target language, qs is the query in the source language, and M X2Y is the
translation matrix for language X to language Y.

3.4    Official Runs (German-Italian and Italian-Spanish)

All our official runs use the Title and Description fields. Relevant parameters are pre-translation feedback
documents/terms, whether a new parallel corpus was created or if English was used as a pivot language during
translation.

                                         Table 2 : Official Bilingual Runs
Run Name                Task    Feedback docs/terms                     Parallel/Pivot          Avg. Precision
cmuG2Icombfb            G2I     10/150                                  Pivot                   0.3439
cmuG2Icomb              G2I     0/0                                     Pivot                   0.3124
cmuG2Iparafb            G2I     10/150                                  Parallel                0.4117
cmuG2Ipara              G2I     0/0                                     Parallel                0.3669
cmuI2Scombfb            I2S     15/80                                   Pivot                   0.4269
cmuI2Scomb              I2S     0/0                                     Pivot                   0.4114
cmuI2Sparafb            I2S     15/80                                   Parallel                0.2921
cmuI2Spara              I2S     0/0                                     Parallel                0.4154

It is hard to draw conclusions from the official runs without more extensive experimentation on CLEF 2003 data (to
be completed in the final version of the working notes). In particular, we are seeking an explanation for the
extremely low relative performance of cmuI2Sparafb. If the run is not buggy, feedback performance is very unstable
from one translation method to the other, and from language to language. This would not be a complete surprise,
since feedback performance varied dramatically for French and Spanish on our system for CLEF 2001 and 2002 ,
and is the main reason why our runs are duplicated with their “low feedback” alternative in both bilingual and
multilingual tasks.

4     Multilingual Retrieval

By using English as the query language we have leveraged the parallel corpora that had English as one of the
languages. We have experimented with several parallel corpora, but chose the European Parliament proceedings as
the corpus for our CLEF submission. We performed bilingual retrieval as described in Section 3, and we used Lemur
for English monolingual retrieval. We then merged the results using the two methods described in Section 4.1. The
number of feedback documents and words were tuned for each language.

4.1    Merging strategies

We examined two simple merging strategies: normalizing the individual scores and two step RSV [7].
The first strategy consists of normalizing the first N document scores to fit in the [0,1] interval, then using the
normalized scores to produce the final ranked document list. This strategy is easy, requires no training but it has been
proved inferior to regression-based models or two-step RSV.

Two-step RSV is a reindexing-based method: top ranked documents from each collection are translated to the topic
language, then reindexed. Note that this is fundamentally different from translating the test collection, which we
would like to avoid. Only top documents are translated, instead of a large test collection. However, the disadvantage
of this method is that translation and reindexing need to be done online. Document caching can somewhat alleviate
this problem when there are many queries.

Translation is done on a word-by-word basis, using the translation matrix built from the parallel corpus. We use only
the first two translations for efficiency; however, we allocate S slots to each untranslated word and distribute the
translated words proportionally to their normalized translation probabilities. Due to lack of running time, official
runs had S=3.

4.2    Official Runs (Small Multilingual)

All our official runs use the Title and Description fields.

                                         Table 3 : Official Multilingual Runs
Run Name           Feedback docs/terms pre- and post translation             Norm/2step merging         Avg. Pr.
cmuM4fb            EN: 5/30, FR:10/20-5/20, ES:5/20-10/20, DE:15/20-         Norm                       0.2921
                   10/30
cmuM4fbre          EN: 5/30, FR:10/20-5/20, ES:5/20-10/20, DE:15/20-         2step                      0.3710
                   10/30
cmuM4lowfb         EN: 5/30, FR:0/0-5/20, ES:0/0-10/20, DE:5/20-10/30        Norm                       0.3398
cmuM4lowfbre       EN: 5/30, FR:0/0-5/20, ES:0/0-10/20, DE:5/20-10/30        2step                      0.3773

Note that in this case the feedback made little difference among the best runs. The merging strategy had a significant
impact, with the two-step RSV being better as expected.

5     Conclusion and Future Work

Our main goal in participating in this year’s CLEF was to prove that freedom from opaque, uncontrollable
commercial systems does not have to mean poor CLIR performance for European languages. Many conceptual or
implementation-related improvements can be made. They include better solutions for using a pivot language,
especially when the domains do not match; better morphological processing, pseudo-relevant regression for merging
etc.

6    References

[1] Braschler, M., Gohring, A. and Shauble, P. Eurospider at CLEF 2002. 2002. In C. Peters(Ed.), Results of the
CLEF2002 cross-language evaluation forum, (to appear).
[2] Brown, P.F, Pietra, D., Pietra, D, Mercer, R.L. 1993. 2002.The Mathematics of Statistical Machine Translation:
Parameter Estimation. Computational Linguistics, 19: 263-312
[3] Chen, A. Cross-language Retrieval Experiments at CLEF-2002. 2002. In C. Peters(Ed.), Results of the CLEF2002
cross-language evaluation forum, (to appear).
[4] Franz, M. and McCarley, J.S. 2002. Arabic Information Retrieval at IBM.TREC 2002 proceedings

[5] Fraser, A., Xu, J., Weischedel, R. 2002. TREC 2002 Cross-lingual Retrieval at BBN. TREC 2002 proceedings

[6] Koehn, P. Europarl: A Multilingual Corpus for Evaluation of Machine Translation. Draft, Unpublished.
[7] Martinez-Santiago, Martin M. and Urena, A. 2002. SINAI on CLEF 2002: Experiments with merging strategies.
In C. Peters(Ed.), Results of the CLEF2002 cross-language evaluation forum, (to appear).
[8] Och, F. J. and Hermann N. 2000. Improved Statistical Alignment Models. In Proc. of the 38th Annual Meeting of
the Association for Computational Linguistics, pp. 440-447, Hongkong, China
[9] Ogilvie, P and Callan, J. 2001. Experiments using the Lemur toolkit. In Proceedings of the Tenth Text Retrieval
Conference (TREC-10).
[10] Savoy, J. 1999. A stemming procedure and stopword list for general French corpora. Journal of the American
Society for Information Science, 50(10), 944-952.
[11] Savoy, J. 2002. Report on CLEF-2002 Experiements: Combining multiple sources of evidence. In C.
Peters(Ed.), Results of the CLEF2002 cross-language evaluation forum, (to appear).

</pre>