Introduction

ITC-irst at CLEF 2001: Monolingual and Bilingual Tracks

Nicola Bertoldi

Marcello Federico

0 0 ITC-irst - Centro per la Ricerca Scientifica e Tecnologica I-38050 Povo , Trento , Italy

1990

2 5

This paper reports on the participation of ITC-irst in the Cross Language Evaluation Forum (CLEF) of 2001. ITC-irst has taken part to two tracks: the monolingual retrieval task, and the bilingual retrieval task. In both cases, Italian was chosen as the query language, while English was chosen as the document language of the bilingual task. The employed retrieval engine combines scores computed by an Okapi model and a statistical language model. The cross language system employes a statistical query translation model, which is estimated on the target document collection and on a translation dictionary.

Introduction

This paper reports on the participation of ITCirst in two Information Retrieval (IR) tracks of the Cross Language Evaluation Forum (CLEF) 2001: the monolingual retrieval task, and the bilingual retrieval task. The language for the queries was always Italian, and English documents were searched for in the bilingual task. With respect to the 2000 CLEF evaluation (Bertoldi and Federico, 2000) , the monolingual IR system was just slightly refined, while most of the effort was dedicated to develop an original crosslanguage IR system.

The basic IR engine, used for both evaluations, combines scores of a standard Okapi model and of a statistical language model. For cross-language IR, a light-weight statistical model for translating queries was developed, which does not need any parallel or comparable corpora to be trained, but just the target document collection and a bilingual dictionary. This paper is organized as follows. In Section 2, the employed text pre-processing modules are presented. Section 3 describes the employed IR models, Section 4 introduces the cross-language specific models, namely the query translation model and the retrieval model. Section 5 presents the official evaluation results. Finally, Section 6 gives some conclusions. Text pre-processing is performed in several stages, which may differ according to the task and language. In the following a list of modules used to pre-process documents and queries is given, by also specifying to which languages they apply.

2.1. Tokenization - IT+EN

Text tokenization is performed in order to isolate words from punctuation marks, recognize abbreviations and acronyms, correct possible word splits across lines, and discriminate between accents and quotation marks. 2.2.

Morphological analysis - IT

A morphological analyzer decomposes each Italian inflected word into its morphemes, and suggests all possible POSs and base forms of each valid decomposition. By base forms we mean the usual not inflected entries of a dictionary.

2.3. POS tagging - IT

Words in a text are tagged with parts-of-speech (POS) by computing the best text-POS alignment through a statistical model. The employed tagger works with 57 tag classes and has an accuracy around 96%.

2.4. Base form extraction - IT

Once the POS and the morphological analysis of each word in the text is computed, a base form can be assigned to each word.

2.5. Stemming - EN

Word stemming is just performed on English words by using the Porter’s algorithm.

2.6. Stop-terms removal - IT+EN

Words that are not considered relevant for IR are discarded in order to save index space. Words are filtered out on the basis either of their POS (if available) or their inverted document frequency. 2.7.

Multi-word recognition - EN

Multi-words are just used for the sake of the query translation. Hence, the statistics used by the translation models do contain multi-words. After translation, multi-words are split into single words. , random variables of query, translation, and document instances of query, query translation, and document generic term, Italian term, English term collection of documents set of terms occurring in , and in document number of term occurrences in , and in a document frequency of term in , in document , and in query 3.1.

Okapi Model

where:

Information Retrieval Models , the following Okapi weighting function is applied: scores the relevance of

in , and the inverted document frequency: evaluates the relevance of term inside the collecbe empirically estimated over a development sample. referred in it.

3.2. Language Model

were used. An explanation of the involved terms can be found in (Robertson et al., 1994) and other papers

3.3. Combined model

According to this model, the match between a query probability distribution: random variable

and a document random variable is expressed through the following conditional

To score the relevance of a document versus a query number of documents in which contain term size of a set (1) (2) (3)

, the word probability over the collecIn this work we use an interpolation formula which applies the smoothing method proposed by (Witten and Bell, 1991) . This method linearly smoothes word The probability that a term is generated by can be estimated by a statistical language model (LM). Previous work on statistical information retrieval (Miller et al., 1998; Ng, 1999) proposed to interpolate relative frequencies of each document with those of the whole collection, with interpolation weights empirically estimated from the data. the document, i.e.: frequencies of a document, and the amount of probability assigned to never observed terms is proportional to the number of different words contained in tion, is estimated by interpolating the smoothed relative frequency with the uniform distribution over the vocabulary : (6) (7) Previous work (Bertoldi and Federico, 2000) showed that Okapi and the statistical model rank documents almost independently. Hence, information about the relevant documents can be gained by integrating the scores of both methods.

Combination of the two models is implemented by just taking the sum of given a uniform a-priori probability distribution about the documents, and disregarding the normalization factor, documents can be ranked, with respect to , just order-free multinomial model, the likelihood is: added to the query. Hence, the retrieval phase is repeated with the augmented query. In this work, new best ranked documents most relevant terms in them are (4) scores. Actually, in order to adjust scale differences, scores of each model are normalized in the range represents the likelihood of ,

3.4. Blind Relevance Feedback

Blind relevance feedback (BRF) is a well known technique that allows to improve retrieval performance. The basic idea is to perform retrieval in two steps. query

First, the documents matching the original

Cross-language IR Model ƒ (7), and the absolute discounting term is equal to 4 the estimate proposed in (Ney et al., 1994) :

4.2. Cross-Language IR Model

As a first method to perform cross-language retrieval, a simple plug-in method was devised, which decouples the translation and retrieval phases. Hence, given a query in the source language, the Viterbi decoding algorithm is applied to compute the most probable translation in the target language, according to the statistical query translation model explained above. Then, the document collection is searched by applying a conventional monolingual IR method.

4.1. Query Translation Model

Query translation is based on a hidden Markov model (HMM) (Rabiner, 1990) , in which the observable part is the query in the source language (Italian), and the hidden part is the corresponding query in the target language (English). Hence, the joint probability of a pair can be decomposed as follows: (13) } the top documents according to (Johnson et al., search terms are extracted by sorting all the terms of 1999) : Si where is the number of documents, among the top } O ~ 6iO formed experiments the values and } documents, which contain word . In all the perwere used. (8) RTS* where is the probability of co-occurring with , regardless of the order, within a text window of fixed size. Smoothing of the probability is performed through absolute discounting and interpolation as follows:

Order documents by using the translation

5.1. Monolingual Track

Two monolingual runs were submitted to the Italian monolingual track. The first run used all the information available for the topics, while the second one just the title and description parts. The track consisted of 47 topics, for a total of 1,246 documents to be retrieved inside a collection of 108,578 documents. A detailed description of the used system follows now:

Document/query pre-processing: tokenization, POS tagging, base form extraction, stop-term removal.

Retrieval step 1: separate Okapi and LM runs.

BRF: performed on each model output. 5.2. Bilingual IR Evaluation

Two runs were submitted to the Italian-to-English bilingual track, with the same modalities of the monolingual track. The bilingual track consisted of 47 topics, for a total of 856 documents to be retrieved inside a collection of 110,282 documents. A detailed description of the used system follows now:

Document pre-processing: tokenization, stemming, stop-term removal.

Query pre-processing: tokenization, POS tagging, base form extraction, stop term removal, translation, multi-words split, stemming.

Retrieval step 1: separate Okapi and LM runs.

BRF: performed on each model output.

Retrieval step 2: same as step 1 with the expanded query.

Final rank: sum of Okapi and LM normalized scores.

An important issue concerns with the use of multiwords. Multi-words were only used for the target language, i.e. English, and just for the translation process. After translation, multi-words in the query are split again into single words.

As a term of comparison, our statistical query translation model was replaced with the Babelfish text translation service powered by Systran and available on the Internet1. Cross-language retrieval performance was measured by keeping all the other components of the system fixed. Results obtained by the submitted runs and by the Babelfish translator are shown in Table 4. The mean average precision achieved with the commercial translation system shows to be about 5%-10% better, depending to the retrieval mode. Detailed results of the experiments are shown in Table 4.

Conclusion

In this work we presented the monolingual and crosslanguage information retrieval systems developed at ITC-irst and evaluated at the CLEF 2001. In particular, the cross-language system uses a statistical query translation algorithm that requires minimal language resources: a bilingual dictionary and the target document collection. Results on the CLEF 2001 evaluation data show that satisfactory performance can be achieved with this simple translation model. However, experience gained from the many performed experiments suggest that a fair comparison between different systems would require a much larger amount of queries. The retrieval performance shows in fact to be very sensitive to the translation step.

Current work is in the direction of further developing the here proposed statistical model for crosslanguage IR. In particular, significant improvements have been achieved by closely integrating the translation and retrieval models.

Acknowledgements

The authors would like to thank their colleagues at ITC-irst Bernardo Magnini and Emanuele Pianta for putting at disposal an electronic Italian-English dictionary.

Bertoldi , N. and

Federico , 2000 . Italian text retrieval for CLEF 2000 at ITC-irst . In Working notes of CLEF 2000 . Lisbon, Portugal.

Johnson , S.E. ,

Jourlin ,

K. Spark

Jones , and

P.C.

Woodland , 1999 . Spoken document retrieval for TREC- 8 at Cambridge University. In Proc. of 8th TREC. Gaithersburg , MD.

Miller , David R. H. , Tim

Leek

, and Richard

Schwartz , 1998 . BBN at TREC-7: Using hidden Markov models for information retrieval . In Proc. of 7th TREC. Gaithersburg , MD.

Ney , Herman, Ute

Essen , and Reinhard Kneser, 1994 . On structuring probabilistic dependences in stochastic language modelling . Computer Speech and Language , 8 : 1 - 38 .

Ng , Kenney, 1999 . A maximum likelihood ratio information retrieval model . In Proc. of 8th TREC. Gaithersburg , MD.

Rabiner , Lawrence R., 1990 . A tutorial on hidden Markov models and selected applications in speech recognition . In Alex Weibel and Kay-Fu Lee (eds.) , Readings in Speech Recognition . Los Altos, CA: Morgan Kaufmann, pages 267 - 296 .

Robertson , S. E., S.

Walker , S.

Jones , M. M.

Hancock-Beaulieu , and M.

Gatford , 1994 . Okapi at TREC-3 . In Proc. of 3rd TREC. Gaithersburg , MD.

Witten , Ian H . and Timothy

Bell , 1991 . The zerofrequency problem: Estimating the probabilities of novel events in adaptive text compression . IEEE Trans. Inform. Theory , IT- 37 ( 4 ): 1085 - 1094 .