Introduction

Miracle's 2005 Approach to Cross-lingual Information Retrieval

José C. González

José Miguel Goñi-Menoyo

josemiguel.goni@upm.es 0

Julio Villena-Román

julio.villena@uc3m.es 0

Universidad Politécnica de Madrid

Universidad Carlos III de Madrid

DAEDALUS - Data

Decisions

Language

0 0 Experiments , Ad-Hoc Automatic Translation

2004

3237 210 219

This paper presents the 2005 Miracle's team approach to Bilingual and Multilingual Information Retrieval. In the multilingual track, we have concentrated our work on the merging process of the results of monolingual runs to get the multilingual overall result, relying on available translations. In the bilingual and multilingual tracks, we have used available translation resources, and in some cases we have using a combining approach. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.2 Information Storage; H.3.3 Information Search and Retrieval; H.3.4 Systems and Software. E.1 [Data Structures]; E.2 [Data Storage Representations]. H.2 [Database Management] The MIRACLE team is made up of three university research groups located in Madrid (UPM, UC3M and UAM) along with DAEDALUS, a company founded in 1998 as a spin-off of two of these groups. DAEDALUS is a leading company in linguistic technologies in Spain and is the coordinator of the MIRACLE team. This is the third participation in CLEF, after years 2003 and 2004 [7], [8], [10], [11], [12], [14], [15], [25], [26]. As well as bilingual, monolingual and cross lingual tasks, the team has participated in the ImageCLEF, Q&A, WebCLEF and GeoCLEF tracks. The starting point was the same set of basic components that we are using for the monolingual track: stemming, transformation (transliteration, elimination of diacritics and conversion to lowercase), filtering (elimination of stop and frequent words), proper nouns extracting, paragraphs extracting, and pseudo-relevance feedback. Some of these basic components are used in different combinations and order of application for document indexing and for query processing. Second order combinations were also tested, mainly by averaging or by selective combination of the documents retrieved by different approaches for a particular query. When evidence is found of better precision of one system at one extreme of the recall level (i.e. 1), complemented by the better precision of another system at the other recall end (i.e. 0), then both are combined to benefit from their complementary results. The indexing and retrieval engine used in this track is also our own-developed trie-based one. For the 2005 bilingual track, runs were submitted for the following language pairs: English to Bulgarian, French, Hungarian and Portuguese; and Spanish to French and Portuguese. For the multilingual track, runs were submitted using as source language English, French, and Spanish.

Introduction

Description of the MIRACLE Toolbox A more in-depth description of the MIRACLE toolbox used for pre-processing and indexing the document collections required for each one of the tracks can be found in the paper “Miracle’s 2005 Approach to Monolingual Information Retrieval” that can be found in this on-line documentation. In this section we will only describe the multilingual aspects of these tracks, since indexing and retrieval processes are exactly the same as in the Monolingual track. This means that our approach to cross-lingual information retrieval relies on topic query translations to target languages. 3

Experiments submitted to Ad Hoc Bilingual track For ease of reference we describe the letters that denote the run identifier we used for the different experiments: The first characters of the run identifier denote the source language (the target language is not encoded): EN: for English.

ES: for Spanish.

The third character in bilingual runs encodes the engine used for topic queries translations: A: ATRANS [1] was used for the pairs: EsFr and EsPt.

B: Bultra [16] for EnBg.

M: MoBiCAT [13] for EnHu.

S: SYSTRAN [19] was used for the following language pairs: EnFr, EsFr, and EnPt.

X: Used for English to Bulgarian, see below on combining experiments.

W: Webtrance [27]for EnBg.

The rest of the characters encode the basic processes, or the variations of the extraction process for topic queries: S: Standard or baseline treatment: tokenization, filtering, stemming, and transformation1.

N: Non-stemming treatment: tokenization, filtering, and transformation.

R: Use the narrative field in the topics.

T: Do not use narrative field.

So, the basic experiments are denoted SR, ST, NR, or NT. We designed additional experiments:

P: Treatment that extracts proper nouns from topic queries, even using the narrative field. Since no stemming is made on the proper nouns detected, a non-stemmed index is needed. Thus, the only possible experiment is denoted NP.

H: The standard treatment (S) is made on the index built from the paragraphs of the documents from the collection. The combination of the paragraphs retrieved is done as stated in the previous section.

Depending on the fields selected in the topic queries, we can denote these experiments as HR or HT. Finally, we describe the combining experiments we used: x<run1>WD<run2>X: This denotes an asymmetric DWX combination from experiments <run1> and <run2>. For example, we run the experiment xNP01HR1 that gets the first document retrieved from the NP run, with its original relevance value and all the documents retrieved in the experiment HR also with their original relevance values, then re-sorting all these relevance figures.

Average: is used to average the results from other experiments. This is done by adding the relevance measure values obtained for each document in the combined experiments. In the bilingual runs, we have used this to combine the results from he translations with the Webtrance and Bultra systems to Bulgarian language, but we have encoded this using X in the third character of the bilingual run identifier. 1 As it was commented before, in some cases these two steps are made in reverse order.

Experiments submitted to Multi-8 Two Years On track For multilingual runs, we have avoided working on the translation problem, and we have used mainly the topic queries translations provided [4], although we have made some experiments using our own obtained translations (for Spanish and French as source language).

We tested Savoy [19] approach to translations concatenations, in the case of being English the source language. Two cases were considered: all available translations are concatenated, and selected translations are concatenated. The following table shows which were the translations used:

Translation

ALT BA1 BA2 BA3 FRE GOO INT LIN REV SYS DE AH A A AH AH A AH AH EN ES AH A A AH AH A AH A

Topic language

FI FR

A AH

A AH A A AH

AH AH A A AH A

IT AH A A AH AH A A

NL AH A A AH A

SV AH A A AH For translation systems the following codes are used: ALT for Babelfish Altavista [2], BA1, BA2, and BA32 for Babylon [3], FRE for FreeTranslation [6], GOO for Google Language Tools [9], INT for InterTrans [22], LIN for WordLingo [28], REV for Reverso [18], and SYS for Systran [21]. The entries in the table contain A (for ALL) if a translation is available for English to the topic language shown in the heading row of a column, and this used for the concatenations of all available translations; and H if a translation is selected for concatenation in the second case described above. These same letters are used near the end of the run identifiers, just after the letter p, that is used to denote test runs (runs that use topic queries numbered from 161 to 200). In the case of Spanish and French as topic languages, the following translators were used: ATRANS [1] was used for the pairs: EsFr and EsPt, WorldLingo [28] was used for EsDe, EsIt, and EsNl, InterTrans [22] for EsFi, EsSv, FrFi, and FrSv, and SYSTRAN [19] was used for all the other language pairs. Only one translator was used for each pair. The approach consisting in using the direct machine translation of topics in Spanish and French are denoted “nte” and “ntf” respectively.

Another approach was used in the case of Spanish as topic language. The approach consists in pre-processing the topics with a high quality morphologic analysis tool. This tool is STILUS3. STILUS recognizes not only closed words, but expressions (prepositional, adverbial, etc.). In this case, STILUS is simply used to discard closed words and expressions from the topics and to obtain the main form of their component words (in most cases, singular masculine or feminine for nouns and adjectives and infinitive for verbs). The queries are so transformed to a simple list of words that are passed to the automatic translators (one word per line). The runs using this approach are denoted “sti”.

The first four characters in the run identifier denote source language:4 enml: English. esml: Spanish.

frml: French. 2 The digit after BA shows how many words are used from the translation of a word, provided that it returns more than one. 3 STILUS® is a trademark of DAEDALUS-Data, Decisions and Language, S.A. It is the core of the Spanish-processing tools of the company, that include spell, grammar and style checkers, fuzzy search engines, semantic processing, etc. 4 The last two characters are ml for “multilingual”. We have not used consistent criteria for bilingual and multilingual runs. The used approach to multilingual information retrieval, that translates each topic query to each of the documents collections, is very sensitive to the merging approach for the relevance measures. The probabilistic BM25 [19] approach used for monolingual retrieval gives relevance measures that depends heavily on parameters that are too dependent on the monolingual collection, so it is not very good for this type of multilingual merging, since relevance measures are not comparable among collections. In spite of this, we made merging experiments using the relevance figures obtained from each monolingual retrieval process, considering three cases: 5

Using original relevance measures for each document as obtained from the monolingual retrieval process. The results are composed of the documents with greater relevance measures.

Normalizing relevance measures with respect to the maximum relevance measure obtained for each topic query i (normal normalization): The results are composed of the documents with greater normalized relevance measures.

Normalizing relevance measures with respect to the maximum and minimum relevance measure obtained for each topic query i (alternate normalization): reli norm =

reli reli max

. reli alt =

reli − reli min reli max − reli min

The results are composed of the documents with greater alternate normalized relevance measures. We denote if normalization is done in the run identifier using the last character: N means normal normalization whereas L denotes alternate normalization. When neither L nor N is present, no normalization has been made for that run.

In addition to all this, we tried a different approach to merging: Considering that the more relevant documents for each of the topics are usually the first ones in the results list, we will select from each monolingual results file a variable number of documents, proportional of the average relevance number of the first N documents. Thus, if we need 1,000 documents for a given topic query, we will get more documents from languages where the average relevance of the first N relevant documents is greater. We did all this both from runs not normalized, and normalizing after the merging process is done (with normal and alternate normalization); and from runs normalized with alternate normalization. We tested several cases that are encoded in the run identifier, when appropriate, from the fifth character. This is done for SR or ST runs, and the following table shows the parameters used: 5 Round-robin merging for results of each monolingual collection has not been used. example, the average combining approach allows us to have better results when combining the results from translations for Bulgarian than the Bultra or Webtrance systems alone. In multilingual experiments, combining (concatenating) translations permits getting better results, as it was reported previously [20], when good translations are available. This seems to explain that H concatenations are better than A ones. Regarding to the merging aspects, our approach has obtained better results than standard merging, whether normalized or not. Alternate normalizations seem to behave better than the standard normalization, whereas the latter behaves better than no normalization. This occurs too when normalization is used in our own approach to merging. Regarding the approach consisting in preprocessing queries in the source topic language with high quality tools for extracting content words before translation, the results have been good when used in the case of Spanish (with our tool STILUS). This approach has got the best precission figures at 0 and at 1 recall extremes, although worse average precision than other runs. 6

Future work

The future work for the MIRACLE team, regarding cross-lingual tasks will be centered on the merging aspects of the monolingual results. The translation aspects of this process are not of interest for us, since our research interests depart from all this: we will only use translation resources available, and we will try to combine them to get better results.

On the other hand, the process of merging the monolingual results is very sensitive on the way it is done; there are some techniques to explore. In addition to that, perhaps a different way to measure relevance is needed for monolingual retrieval when multilingual merging has to be done. Such a measure should be independent of the collection, so monolingual relevance measures would be comparable.

Acknowledgements

This work has been partially supported by the Spanish R+D National Plan, by means of the project RIMMEL (Multilingual and Multimedia Information Retrieval, and its Evaluation), TIN2004-07588-C03-01. Special mention to our colleagues of the MIRACLE team should be done (in alphabetical order): Ana María García-Serrano, Ana González-Ledesma, José Mª Guirao-Miras, Sara Lana-Serrano, José Luis MartínezFernández, Paloma Martínez-Fernández, Ángel Martínez-González, Antonio Moreno-Sandoval and César de Pablo-Sánchez. [1] Automatic Trans SL, Spain. Automatic translation server. On line http://www.automatictrans.es [Visited 28/07/2005]. [2] BabelFish translation resources. On line http://babelfish.altavista.com [Visited 20/07/2005]. [3] Babylon.com, Ltd, Israel. On line http://www.babylon.com [Visited 11/08/2005]. [4] CLEF 2005 Multilingual Information Retrieval resources page. On line http://www.computing.dcu.ie/ ~gjones/CLEF2005/Multi-8/ [Visited 11/08/2005]. [5] Ergane multilingual translation dictionary. On line http://download.travlang.com [Visited 20/07/2005]. [6] Free2Translation. Free text translator. On line http://www.freetranslation.com [Visited 20/07/2005]. [7] Goñi-Menoyo, José M; González, José C.; Martínez-Fernández, José L.; and Villena, J. Miracle’s Hybrid Approach to Bilingual and Monolingual Information Retrieval. CLEF 2004 proceedings (Peters, C. et al., Eds.). Lecture Notes in Computer Science, vol. 3491, pp. 188-199. Springer, 2005 (to appear). [8] Goñi-Menoyo, José M.; González, José C.; Martínez-Fernández, José L.; Villena-Román, Julio; GarcíaSerrano, Ana; Martínez-Fernández, Paloma; de Pablo-Sánchez, César; and Alonso-Sánchez, Javier. Miracle’s hybrid approach to bilingual and monolingual Information Retrieval. Working Notes for the CLEF 2004 Workshop (Carol Peters and Francesca Borri, Eds.), pp. 141-150. Bath, United Kingdom, 2004.

[9] Google language tools. On line http://www.google.com/language_tools [Visited 20/07/2005]. [10] [11] [12] [13]

Martínez, J.L.; Villena-Román, J.; Fombella, J.; García-Serrano, A.; Ruiz, A.; Martínez, P.; Goñi, J.M.; and González, J.C. (Carol Peters, Ed.): Evaluation of MIRACLE approach results for CLEF 2003. Working Notes for the CLEF 2003 Workshop, 21-22 August, Trondheim, Norway.

Morphological, Hungary. MoBiCAT translation resources. On line http://www.morphologic.hu [Visited 28/07/2005]. [14] de Pablo, C.; Martínez-Fernández, J. L.; Martínez, P.; Villena, J.; García-Serrano, A. M.; Goñi, J. M.; and González, J. C. miraQA: Initial experiments in Question Answering. Working Notes for the CLEF 2004 Workshop, pp. 405-411 (Carol Peters and Francesca Borri, Eds.), pgs. 371-376. Bath, United Kingdom, 2004. [15] de Pablo, C.; Martínez-Fernández, J. L.; Martínez, P.; Villena, J.; García-Serrano, A. M.; Goñi, J. M.; and González, J. C. miraQA: Initial experiments in Question Answering. CLEF 2004 proceedings (Peters, C. et al., Eds.). Lecture Notes in Computer Science, vol. 3491. Springer, 2005 (to appear). [16] Pro Langs Ltd., Bulgary. BULTRA translation resources. On line http://www.bultra.com [Visited 28/07/2005]. [17] Prompt-Online free automatic translation service. On line http://translation2.paralink.com [Visited 20/07/2005]. [18] Reverso translation resources. On line http://www.reverso.net/text_translation.asp [Visited 20/07/2005]. [19] Robertson, S.E. et al. Okapi at TREC-3. In Overview of the Third Text REtrieval Conference (TREC-3).

D.K. Harman (Ed.). Gaithersburg, MD: NIST, April 1995. [20] Savoy, Jacques. Report on CLEF-2003 Multilingual Tracks. Comparative Evaluation of Multilingual Information Access Systems (Peters, C; Gonzalo, J.; Brascher, M.; and Kluck, M., Eds.). Lecture Notes in Computer Science, vol. 3237, pp. 64-73. Springer, 2004. [21] SYSTRAN Software Inc., USA. SYSTRAN 5.0 translation resources. On line http://www.systransoft.com [Visited 13/07/2005]. [22] Translation Experts Ltd. InterTrans translation resources. On line http://www.tranexp.com [Visited 28/07/2005]. [23] Travlang translating dictionaries. On line http://diction.travlang.com/dictionaries/diction.php [Visited 20/07/2005]. [24] University of Neuchatel. Page of resources for CLEF (Stopwords, transliteration, stemmers …). On line http://www.unine.ch/info/clef [Visited 13/07/2005] . [25] Villena, Julio; Martínez, José L.; Fombella, Jorge; G. Serrano, Ana; Ruiz, Alberto; Martínez, Paloma; Goñi, José M.; and González, José C. Image Retrieval: The MIRACLE Approach. Comparative Evaluation of Multilingual Information Access Systems (Peters, C; Gonzalo, J.; Brascher, M.; and Kluck, M., Eds.). Lecture Notes in Computer Science, vol. 3237, pp. 621-630. Springer, 2004. [26] Villena-Román, J.; Martínez, J.L.; Fombella, J.; García-Serrano, A.; Ruiz, A.; Martínez, P.; Goñi, J.M.; and González, J.C. (Carol Peters, Ed.); MIRACLE results for ImageCLEF 2003. Working Notes for the CLEF 2003 Workshop, 21-22 August, Trondheim, Norway.

Appendix

Skycode Ltd., Bulgaria. Webtrance translation program. On line ?current=&lang=en [Visited 09/08/2005]. http://webtrance.skycode.com/ WorldLingo Translations LLC, USA. WorldLingo free online translator. On line http://www.worldlingo.com/en/products_services/worldlingo_translator.html [Visited 28/07/2005].

For each of the cross-lingual tracks, we show a table with the precision at 0 and 1 points of recall, the average precision, the percentage deviation (in average precision) from best one obtained, and the run identifier. The results are sorted in average precision ascending order, but an asterisk marks all the best precision values for each column. The submitted runs to CLEF 2005 are shown in boldface, and the figures show the precision-recall graphs for the submitted runs as well as our best6 runs, provided that these were not submitted. 6 Be the best run in average precision, or in precision at 0 or 1 points of recall.

Results for bilingual English to Bulgarian

1 0.8 0.6 0.4

At 0 At 1 0.3737 0.0191 0.3794 0.0192 0.4292 0.0208 0.4617 0.0470 0.4985 0.0420 0.4997 0.0382 0.5321 0.0389 0.4924 0.0501* 0.5801* 0.0432

Avgp % 0.1405 -40.34% 0.1489 -36.77% 0.1635 -30.57% 0.1926 -18.22% 0.2014 -14.48% 0.2112 -10.32% 0.2132 -9.47% 0.2194 -6.84% 0.2355* -0.00%

Run ENBHT ENWHT ENXHT ENBSR ENBST ENWSR ENWST ENXSR

ENXST Bilingual runs: English to Bulgarian

Results for bilingual X to French

0 1

0 0 0.2 0.4 0.6 0.8 Bilingual runs: Spanish to French ESSSR

ESASR ESSxNP01SR1

ESSST 0.2 0.4 0.6 0.8 1 Results for bilingual English to Hungarian 1 0.8 0.6 0.4 0.2

Results for bilingual X to Portuguese

Bilingual runs: English to Portuguese ENSSR ENSxNP01SR1

ENSST ENSNP 0.2 0.4 0.6 0.8 Bilingual runs: Spanish to Portuguese ESASR ESAxNP01SR1

ESAST 0.2 0.4 0.6 0.8 1 Multilingual runs from English enml0XSRpHL enmlSTpHL enmlXSRpA enmlSTpH

At 0 0.6860 0.6860 0.6860 0.6860 0.6860 0.6860 0.6901 0.6897 0.6895 0.6901 0.6896 0.6899 0.7828* 0.7828* 0.7828* 0.7828* 0.7828* 0.7828*