=Paper=
{{Paper
|id=Vol-1169/CLEF2003wn-adhoc-BertoldiEt2003
|storemode=property
|title=ITC-irst at CLEF 2003: Monolingual, Bilingual and Multilingual Information Retrieval
|pdfUrl=https://ceur-ws.org/Vol-1169/CLEF2003wn-adhoc-BertoldiEt2003.pdf
|volume=Vol-1169
|dblpUrl=https://dblp.org/rec/conf/clef/BertoldiF03b
}}
==ITC-irst at CLEF 2003: Monolingual, Bilingual and Multilingual Information Retrieval==
ITC-irst at CLEF 2003: Monolingual, Bilingual, and Multilingual Information Retrieval Nicola Bertoldi and Marcello Federico ITC-irst - Centro per la Ricerca Scientifica e Tecnologica I-38050 Povo, Trento, Italy. Abstract This paper reports on the participation of ITC-irst in the Cross Language Evaluation Forum 2003; in particular, in the monolingual, bilingual, small multilingual, and spoken document retrieval tracks. Con- sidered languages were English, French, German, Italian, and Spanish. With respect to our CLEF 2002 system, the statistical models for bilingual document retrieval have been improved, more languages have been considered, and a novel multilingual information retrieval system has been developed, which combines several bilingual retrieval models into a statistical framework. As in the last CLEF, bilingual models integrate retrieval and translation scores over the set of N-best translations of the source query. 1. Introduction 2. Multilingual Information Retrieval: Statistical approach This paper reports on the participation of ITC-irst in Multilingual information retrieval can be defined as the Cross Language Evaluation Forum (CLEF) 2003. the task of finding and ranking documents, which are Several tracks were faced: monolingual document re- relevant for a given topic, within a collection of texts trieval in Italian, French, German and Spanish; bilin- in several language. As we know the language of gual document retrieval from German to Italian and each document, we may view the multilingual target from Italian to Spanish; small multilingual document collection as the union of distinct monolingual col- retrieval from English to English, German, French, lections. and Spanish; and, finally, cross-language spoken doc- ument retrieval from French, German, Italian, Span- 2.1. Multilingual retrieval model ish to English. Let a multilingual collection D contain documents The statistical cross-language information retrieval in L different languages, where D results from the (CLIR) model presented in the 2002 CLEF evalu- union of L monolingual sub-collections D1 , . . . , DL . ation (Federico and Bertoldi, 2002) was extended Let f be a query in a given source language, even- in order to cope with a multilingual target collec- tually different from any of the L languages. One tion. Moreover, better query-translation probabili- would like to rank documents d within the multilin- ties were obtained by exploiting bilingual dictionar- gual collection D, according to the posterior proba- ies and statistics from monolingual corpora. Ba- bility: sically, the ITC-irst system presented in the 2002 Pr(d | f ) ∝ Pr(f , d) (1) CLEF evaluation was expanded with a module for where the right term of formula (1) follows from the merging document rankings of different document constancy of Pr(f ), with respect to the ranking of collections generated by different bilingual systems. documents. Each bilingual system features a statistical model, A hidden variable l is introduced, which represents which generates a list of the N-best query transla- the language of either a sub-collection or a document. tions, and a basic IR engine, which integrates scores, computed by a standard Okapi model and a statistical X language model, over multiple translations. Remark- Pr(f , d) = = Pr(l, f , d) ably, training of the system’s parameters just requires l X a bilingual dictionary, the target document collection, = Pr(l) Pr(f , d | l) (2) and a document collection in the source language. l This paper is organized as follows. Section 2 intro- where Pr(l) is an a-priori distribution over languages, duces the statistical approach to multilingual IR. Sec- which can be estimated from the multilingual collec- tions 3 briefly summarizes main features of our sys- tion or taken uniform. Formula (2) shows a weighted tem, and describes the retrieval procedure. Section 4 mixture of bilingual IR models depending on the sub- and 5 present experimental results for each tracks we collection. However, given that we know the lan- participated in. Section 6 closes the paper. guage each document is written in, we can assume that the probability Pr(f , d | l) is larger than zero 2.2. Basic Query-Document Model only if d belongs to the sub-collection Dl . The query-document model computes the joint prob- Next, a hidden variable e is introduced, which repre- ability of a query e and a document d, written in the sents a (term-by-term) translation of f into one of the same language. The query-document model consid- L languages. Hence, we derive the following decom- ered in the experiments results from the combination position: of two different models: a language model and an Okapi based scoring function. X Pr(f , d | l) = Pr(f , e, d | l) Language Model The joint probability can be fac- e tored out as follows: X ≈ Pr(f , e | l) Pr(d | e, l) (3) Pr(e, d) = Pr(e | d) Pr(d) (4) e where the a-priori probability of d, Pr(d), is assumed In deriving formula (3), we make the assump- to be uniform, and the probability of e given d to be tion (or approximation) that the probability of doc- an order-free multinomial (bag-of-word) model: ument d given query f , translation e and lan- guage l, does not depend on f . Formula (3) puts Y n in evidence a language-dependent query-translation Pr(e = e1 , . . . , en | d) = p(ek | d) (5) model, Pr(f , e | l), and a collection-dependent k=1 query-document model, Pr(d | e, l). Okapi The joint probability can be obtained The language-dependent query-translation model is through the normalization over queries and docu- defined as follows: ments of a generic scoring function s(e, d): Pr(f , e | l) = Pr(f | l)Prl (e | f ) s(e, d) Pr(e, d) = P (6) Prl (f , e) 0 0 X if e ∈ Tl (f ) e0 ,d0 s(e , d ) Prl (f , e0 ) ∝ e0 ∈Tl (f ) The denominator is considered only for the sake of normalization, but can be disregarded in the compu- 0 otherwise tation of equation (3). A scoring function derived from the standard Okapi where Tl (f ) is the set of all translations of f into lan- formula, is used guage l. For practical reasons, this set is approxi- n Y mated with the set of the N most probable transla- s(e = e1 , . . . , en , d) = idf (ek )Wd (ek ) (7) tions computed by the basic query-translation model k=1 Prl (f , e). The term Pr(f | l) can be considered inde- pendent from l and hence be discarded. The normal- Combination Previous work (Bertoldi and Fed- ization introduced in formula (4) is needed in order to erico, 2001) showed that the two models rank doc- obtain ranking scores, which are comparable among uments almost independently. Hence, information different languages. about the relevant documents can be gained by in- The collection-dependent query-document model is tegrating the scores of both methods. Combination derived from a basic query-document model Prl (d | of the two models is implemented by just taking the e) as follows: sum of scores, after a suitable normalization. Prl (d, e) 2.3. Basic Query-Translation Model X if d ∈ I(e, l) Prl (d0 , e) The query-translation model computes the probabil- Pr(d | e, l) = d0 ∈I(e,l) ity of any query-translation pair. This probability is modeled by an HMM (Rabiner, 1990) in which the 0 otherwise observable variable is the query f in the source lan- guage, and the hidden variable is its translation e in where I(e, l) is the set of documents in Dl containing the target language. According to the HMM, the joint at least a word of e. probability of a pair (f , e) is decomposed as follows: The basic query document and query translation models are now briefly described; more details can P r(f = f1 , . . . , fn , e = e1 , . . . , en ) be found in (Bertoldi and Federico, 2002). The sub- Y n Y n script l, which refers to the specific language or col- = p(e1 ) p(ek | ek−1 ) p(fk | ek ) lection the models are estimated on, will be omitted k=2 k=1 without loss of generality. (8) The term translation probabilities p(f | e) are esti- mated from a bilingual dictionary as follows: Query (source) δ(f, e) Pr(f | e) = P 0 (9) f 0 δ(f , e) Preprocessing where δ(f, e) = 1 if the term e is one of the transla- tions of term f and δ(f, e) = 0 otherwise. This flat distribution can be refined through the EM algorithm (Dempster et al., 1977) by exploiting a large corpus Documents Bilingual IR Documents Bilingual IR in the source language. Bilingual Dictionary Bilingual Dictionary The target LM probabilities p(e | e0 ), are estimated on the target document collection, through an order- free bigram LM, which tries to compensate for differ- ent word positions induced by the source and target Merging languages. Let p(e, e0 ) p(e | e0 ) = P 00 0 (10) e00 p(e , e ) Ranked Documents where p(e, e0 ) is the probability of e co-occurring with e0 , regardless of the order, within a text win- Figure 1: Architecture of the multilingual IR system. dow of fixed size. Smoothing of this probability is performed through absolute discounting and interpo- lation. 3. System architecture • Tokenization was performed to separate words from punctuation marks, to recognize abbrevia- As shown in Section 2, the ITC-irst multilingual tions and acronyms, correct possible word splits IR system features several independent bilingual re- across lines, and discriminate between accents trieval systems, which return collection-dependent and quotation marks. rankings, and a module for merging these results into a global ranking with respect to the whole mul- • Stemming was performed by using a language- tilingual collection. Moreover, language-dependent dependent Porter-like algorithm (Frakes and text preprocessing modules have been implemented Baeza-Yates, 1992), freely available at snow- to process documents and queries. Figure 3. shows ball.tartarus.org. the architecture of the system. Two merging criteria were developed. The first, we • Stop-terms removal was applied on the call stat method, implements the statistical model documents by removing terms included introduced in Section 2: for each language, language- in a language-dependent public list dependent relevance scores of documents, computed (www.unine.ch/info/clef). by the bilingual IR systems are normalized in order • Proper names and numbers in queries were rec- to have language independent scores, and, hence, a ognized in order to improve coverage of the dic- global ranking is created. tionary. The second criterion, we call rank method, exploits the document rank positions only, i.e. all the collec- • Out-of-dictionary terms which have not been tion dependent rank lists are joined and documents recognized as proper names or numbers were re- are globally sorted according to the inverse of their moved. original rank position. Monolingual and bilingual versions of the system 3.2. Blind Relevance Feedback trivially follows by omitting the query-translation After document ranking, the following Blind Rele- model and by limiting the collection to one language, vance Feedback (BRF) technique was applied. First, respectively. the documents matching the source query e are ranked, then the B best ranked documents are taken 3.1. Preprocessing and the R most relevant terms in them are added to In order to homogenize the preparation of data, and, the query, and the retrieval phase is repeated. In the hence, to reduce workload, a standard procedure was CLIR framework, R terms are added to each single defined. More specifically, the following preprocess- translation of the N -best list and the retrieval algo- ing steps were applied both to documents and queries rithms is repeated once again. In this work, 15 new in every language: search terms are selected from the top 5 documents according to the Offer Weight proposed in (Johnson Dictionary #entries avg. # translations et al., 1999). English-French 44728 1.97 English-German 131429 1.88 4. Experimental Evaluation English-Italian 44195 1.95 English-Spanish 47305 1.83 ITC-irst submitted 4 monolingual runs in French, Italian-Spanish 66059 3.94 German, Italian, and Spanish, 4 Italian-Spanish bilin- German-Italian 103618 3.91 gual runs, 2 German-Italian bilingual runs, and 4 small multilingual runs using queries in English to Table 3: Statistics about dictionaries. search documents in English, French, German, and Spanish. Moreover, some unofficial experiments were performed for the sake of comparison. would suggest that they contain two wrong transla- tions per entry, on the average. 4.1. Data Moreover, all term translation probabilities, but the In Table 1, statistics about the target collections for German-Italian ones, were estimated through the EM the five considered languages are reported. algorithm by using the corresponding document col- lections. Language #docs #words English 166,754 100,971,969 4.2. Results French 129,809 52,275,689 Table 4 reports main settings and official mAvPr German 294,809 99,461,570 scores for each run. In particular, the number of N - Italian 153,208 54,434,345 best translations (1 vs. 10), the type of bilingual dic- Spanish 454,045 171,971,487 tionary (flat vs. estimated through EM algorithm), Multi-4 1,045,417 424,680,715 and the merging policy (looking at the rank vs. the stat) are indicated. Source and target languages are Table 1: Statistics about target collections. indicated in the run name. Monolingual results As shown in Table 4, our Table 2 reports statistics about the topics and corre- monolingual retrieval system achieves good results sponding relevant documents in each collection (top- for all languages. More than 70% of queries have ics with no relevant document are not considered). mAvPr greater than or equal to the median values. It is worth noticing that mAvPrs are pretty the same for Language #queries #rel.docs all languages. English 54 1006 Bilingual results Italian-Spanish results show that French 52 946 the estimation of translation probabilities through the German 56 1825 EM algorithm is quite effective, especially in combi- Italian 51 809 nation with the 10-best translations. Spanish 57 2368 Multi-4 60 6145 Language monolingual bilingual from English French .5339 .4297 Table 2: Statistics about queries. German .5173 .4378 Italian .5397 .4184 Spanish .5375 .4298 Bilingual dictionaries from English to the other languages were gathered from public available re- Table 5: Comparison of monolingual and bilingual sources. Unfortunately, German-Italian and Italian- performance. Spanish dictionaries were not available. Hence, the missing dictionaries were built from other avail- able dictionaries using English as a pivot language. Table 5 reports mAvPr for monolingual and bilingual For example, an Italian-Spanish dictionary was de- runs for every language; the 10-best translations were rived by exploiting the Spanish-English and Italian- obtained with EM estimated translation probabilities. English dictionaries as follows: the translation alter- A relative degradation between 15% and 22% is al- natives of an Italian term are all Spanish translations ways observed. This means that the translation pro- of all English translations of that term. Table 2 re- cess causes almost equal losses in performance for ports some statistics of the bilingual dictionaries. It is each language pair. worth noticing that for the generated dictionaries the average number of translation alternatives is about Multilingual results As shown in Table 4, about twice larger than that of original dictionaries. This 60% of the queries have mAvPr greater than or equal Official Run Setting mAvPrmdn bst IRSTfr 1 .5339 15 10 27 11 IRSTde 1 .5173 16 5 35 6 IRSTit 1 .5397 11 8 32 10 IRSTes 1 .5375 17 3 37 5 IRSTit2es 1 10-best, EM .4262 31 1 25 2 IRSTit2es 2 10-best, flat .4006 36 1 20 2 IRSTit2es 3 1-best, EM .4053 33 1 23 2 IRSTit2es 4 1-best, flat .4009 35 1 21 2 IRSTde2it 1 10-best, flat .2291 38 0 18 0 IRSTde2it 2 1-best, flat .2437 36 0 20 0 IRSTen2xx 1 10-best, EM, rank .3147 23 1 36 0 IRSTen2xx 2 10-best, EM, stat .3089 22 2 36 1 IRSTen2xx 3 10-best, flat, rank .3084 25 2 33 0 IRSTen2xx 4 10-best, flat, stat .3036 25 1 34 1 Table 4: Main settings and results of the official runs. Comparison against the median and best values. to the median values. The merging method based on were used (www.nist.gov/speech/tests/sdr). In partic- the rank is a little more effective, but differences are ular, 313K documents are extracted from Los Angeles very low. Again, the EM estimation of term proba- Times, Washington Post, New York Times, and Asso- bilities slightly improves performance. ciated Press Worldstream, issued between Septem- The merging criteria were also applied to the mono- ber 1997 and April 1998. Unfortunately, the avail- lingual runs, in order to obtain an upper bound for our able texts do not entirely cover the test period. The multilingual retrieval system. The achieved mAvPrs following strategy was chosen: first query expansion for this virtual experiment were .3754 and .3667 for was performed on parallel texts, and then on target the “rank” and “stat” criteria, respectively. The rel- collection. ative degradation is very similar to that observer for bilingual experiments. 5.2. Results Table 6 reports the official submitted runs, and some unofficial runs (in italics), used for comparison. 5. Cross-Language Spoken Document Official run Query mAvPr Retrieval mono-brf EN .3944 ITC-irst participated also in the Cross-Language Spo- mono-brf-brf EN .4244 ken Document Retrieval (CLSDR) track, which con- fr-en-1bst-brf-bfr FR .2281 sists in searching for relevant stories within a collec- fr-en-sys-brf-bfr FR .3064 tion of automatically transcribed English broadcast de-en-dec-1bst-brf-bfr DE .2676 news. Topics correspond in 50 short queries man- de-en-1bst-brf-bfr DE .2523 ually translated from English into French, German, de-en-sys-brf-bfr DE .2880 Italian, and Spanish. For the CLSDR track, the bilin- it-en-1bst-brf-bfr IT .2347 gual version of the ITC-irst IR system was applied, it-en-sys-brf-bfr IT .3218 with little changes in the BRF expansion of queries. es-en-1bst-brf-bfr ES .2746 Moreover, German text were also processed for split- es-en-sys-brf-bfr ES .3555 ting compound words, by using a DP based algo- rithm. Table 6: mAvPr results of CLSDR track at CLEF 2003 5.1. Query expansion on parallel corpora As the number of stories in the SDR target collection The official English monolingual run was performed was quite small, a double query expansion policy was in order to evaluate the quality of the retrieval sys- chosen. New terms are added which are extracted not tem. ITC-irst performance is about 10% above the only from the target collection, but also from a large other participants. For this experiment the query ex- corpus of written texts, consisting of newspapers and pansion on the parallel corpus was not applied. If not news wires. so, a relative improvement of 7% is observed. As the As a parallel corpus for query expansion, newspa- double query expansion policy is quite effective, was per articles of the North American News Text corpus applied in all the other experiments. In the bilingual experiments, query were trans- of the 8th Text REtrieval Conference. Gaithersburg, lated either through our 1-best translation approach MD. or by the Babelfish translation service, powered Rabiner, Lawrence R., 1990. A tutorial on hid- by Systran, which is available on the Internet den Markov models and selected applications in (world.altavista.com). Run names are indicating with speech recognition. In Alex Weibel and Kay-Fu 1bst and sys, respectively. Commercial transla- Lee (eds.), Readings in Speech Recognition. Los tions outperforms our approach. Altos, CA: Morgan Kaufmann, pages 267–296. German word decompounding seems to be slightly effective, as shown by comparing the run without decompounding ( de-en-1bst-brf-bfr) and the with (de-en-dec-1bst-brf-bfr). 6. Conclusion This paper presented a multilingual IR system devel- oped at ITC-irst. A complete statistical model was defined which combines several bilingual retrieval model. The system was evaluated in the CLEF2003 campaign in the monolingual, bilingual, and multilin- gual tracks. The basic monolingual IR model resulted very competitive on every languages. The multilin- gual IR systems also achieves higher performance than the median. Experiments in the Cross-Language Spoken Document Retrieval task, which uses very short queries, showed that significantly better results are still achieved by using translations produced by a commercial system. 7. References Bertoldi, N. and M. Federico, 2001. ITC-irst at CLEF 2000: Italian monolingual track. In Carol Peters (ed.), Cross-Language Information Re- trieval and Evaluation, volume 2069 of Lecture Notes in Computer Science. Heidelberg, Germany: Springer Verlag. Bertoldi, N. and M. Federico, 2002. ITC-irst at CLEF 2001: Monolingual and bilingual tracks. In Carol Peters, Martin Braschler, Julio Gonzales, and Michael Kluck (eds.), Cross-Language Infor- mation Retrieval and Evaluation, volume 2406 of Lecture Notes in Computer Science. Heidelberg, Germany: Springer Verlag. Dempster, A. P., N. M. Laird, and D. B. Rubin, 1977. Maximum-likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical So- ciety, B, 39:1–38. Federico, Marcello and Nicola Bertoldi, 2002. Sta- tistical cross-language information retrieval using n-best query translations. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Re- trieval. Tampere, Finland. Frakes, William B. and Ricardo Baeza-Yates (eds.), 1992. Information Retrieval: Data Structures and Algorithms. Englewood Cliffs, NJ: Prentice Hall. Johnson, S.E., P. Jourlin, K. Spark Jones, and P.C. Woodland, 1999. Spoken document retrieval for TREC-8 at Cambridge University. In Proceedings