1 Introduction

Twenty-One at CLEF-2000: Translation resources, merging strategies and relevance feedback

This paper describes the official runs of the Twenty-One group for CLEF-2000. The Twenty-One group participated in the monolingual, bilingual and multilingual tasks. The following new techniques are introduced in this paper. In the bilingual task we experimented with different methods to estimate translation probabilities. In the multilingual task we experimented with refinements on raw-score merging techniques and with a new relevance feedback algorithm that re-estimates both the model's translation probabilities and the relevance weights. Finally, we performed preliminary experiments to exploit the web to generate translation probabilities and bilingual dictionaries, notably for English-Italian and English-Dutch.

1 Introduction

(1) *TNO-TPD P.O. Box 155, 2600 AD Delft The Netherlands 2.2

Translation in practice

i defines the probability of drawing a term at random from the document; and defines the importance of P P (T ability measure (Ti) defines the probability of drawing a term at random from the collection, ijDk) i i n formula, Ti is a random variable for the query term on position in the query (1 n, where is the P (D each query term. The marginal probability of relevance k) might be assumed uniformly distributed ; query length), which sample space is the set ft(0); t(1); t(m)g of all terms in the collection. The probFormula 1 shows the basic idea of this approach to information retrieval, where the document-based language model is interpolated with a background language model to compensate for sparseness. In the over the documents in which case it may be ignored in the above formula.

S i for cross-language information retrieval [ 1, 4 ]. Let be a random variable for the source language query Information retrieval models and statistical translation models can be integrated into one unifying model term on position i. Each document gets a score defined by the following formula. (2) 2.1

A model of cross-language information retrieval 2.3 Probability estimation

normalisation” runs reported in section 5.

2.4 Implementation

(4) (5) (3) (6) P df (ti) i = 1 determines the importance of the source language query term. If then the system will assign zero P The translation probabilities (SijTi) and the value of i, however, are unknown. The collection that is i = 0 i uments. If then the possible translations of the original query term on position will not affect the i ities should therefore be estimated from other data, for instance from a parallel corpus. The value of i i = constant value for each is taken. The system’s default value is 0:3.

searched was not translated, or if it was translated, the translations are not available. Translation probabilprobability to documents that do not contain any of the possible translations of the original query term on position i. In this case, a possible translation of the source language term is mandatory in the retrieved docfinal ranking. In this case, the source language query term is treated as if it were a stop word. For ad-hoc queries, it is not known which of the original query terms are important and which are not important and a P (S = equation 2 results in the following formula. The probability measure ijTi t(j)) will be replaced by Equation 2 is not implemented as is, but instead it is rewritten into a weighting algorithm that assigns zero weight to terms that do not occur in the document. Filling in the definitions of equation 3, 4 and 5 in the translation probability estimates i(j). i multiplied by the translation probabilities. Only in the big sum is constant for every addition and can The translation probabilities can be moved into the inner sum. As summing is associative and commutative, it is not necessary to calculate each probability separately before adding them. Instead, respectively the document frequencies and the term frequencies of the disjuncts can be added beforehand, properly therefore be moved outside the sum, resulting in:

j=1Pi(j)df (t(j)) Using simple calculus (see e.g. [ 3 ]), the probability measure can now be rewritten into a term weighting algorithm that assigns zero weight to non-matching terms, resulting in equation 6. The formula ranks documents in exactly the same order as equation 2.

Equation 6 is the algorithm implemented in the TNO retrieval engine. It contains a weighted sum of respectively the term frequencies and the document frequencies where the weights are determined by the translation probabilities i(j). Unweighted summing of frequencies was used before for on-line stemming in [ 5 ] in a vector space model retrieval system. 2.5

The model does not require the translation probabilities i(j) to sum up to one for each i, since they are conditioned on the target language query term and not on the source language query term. Interestingly, for the final ranking it does not matter what the actual sum of the translation probabilities is. Only the relative proportions of the translations define the final ranking of documents. This can be seen by i(j) which occurs in the numerator and in the denominator of the big fraction in equation 6. i The actual re-estimation of i(j) and was done by iteratively applying the EM-algorithm defined by i were some known relevant documents, then the values of i(j) and could be re-estimated from that = values are initialised with the translation probabilities from the dictionary and with i(0) 0:3. The re(p) the formulas in equation 7. In the algorithm, i(j)(p) and denote the values on the pth iteration. The i i documents) than the original query term “dangereux” and the value of should be higher for “de´chets” p estimation formulas should be used simultaneously for each until the values do not change significantly This paper introduces a new relevance feedback method for cross-language information retrieval. If there data. The idea is the following. Suppose there are three known relevant English documents to the French query “de´chets dangereux”. If two out of three documents contain the term “waste” and none contain the terms “litter” and “garbage” then this is an indication that “waste” is the correct translation and should be assigned a higher translation probability than ‘litter” and “garbage”. If only one of the three known relevant document contains one or more possible translations of “dangereux” then this is an indication that the original query term “de´chets” is more important (possible translations occur in more relevant than for “dangereux”.

anymore.

Translation resources

As in previous years we applied a dictionary based query translation approach. The translations were based on the VLIS lexical database of Van Dale publishers [ 2 ]. Because VLIS currently lacks translations into Italian, we used two other resources: i) the Systran web based MT engine ii) a probabilistic lexicon based a parallel web corpus. The next section will describe the construction of this new resource in more detail. 3.1

Parallel web corpora

We developed three parallel corpora based on web pages in close cooperation with RALI, Universite´ de Montre´al. RALI already had developed an English-French parallel corpus of web pages, so it seemed interesting to investigate the feasibility of a full multilingual system based on web derived lexical resources only. We used the PTMiner tool [ 7 ] to find web pages which have a high probability to be translations of each other. The mining process consists of the following steps: 1. Query a web search engine for web pages with a hyperlink anchor text “English version” and respective variants. 2. (For each web site) Query a web search engine for all web pages on a particular site. 3. (For each web site) Try to find pairs of path names that match certain patterns, e.g.: /department/tt/english/home.html and /department/tt/italian.html. 4. (For each pair) download web pages, perform a language check using a probabilistic language classifier, remove pages which are not positively identified as being written in a particular language.

The mining process was run for three language pairs and resulted in three modest size parallel corpora. Table 1 lists sizes of the corpus during intermediate steps. Due to the dynamic nature of the web, a lot of pages that have been indexed, do not exist anymore. Sometimes a site is down for maintenance. Finally a lot of pages are simply place holders for images and are discarded by the language identification step. language

EN-IT EN-DE EN-NL

These parallel corpora have been used in different ways: i) to refine the estimates of translation probabilities of a dictionary based translation system (corpus based probability estimation) ii) to construct simple statistical translation models (IBM model 1) [ 7 ] . The former application will be described in more detail in Section 5.2 the latter in Section 5.3. The translation models for English-Italian and English-German, complemented with an already existing model for English-French formed also the basis for a full corpus based translation multilingual run which is described elsewhere in this volume [ 6 ]. 4

Merging intermediate runs

Our strategy to multilingual retrieval is to translate the query into the document languages, perform separate language specific runs and merge the results into a single result file. In previous CLIR evaluations, we compared different merging strategies: round robin Here the idea is that document scores are not comparable across collections, because we are basically ignorant about the distribution of relevant documents in the retrieved lists, round robin assumes that these distributions are similar across languages. raw score This type of merging assumes that document scores are comparable across collections. rank based It has been observed that the relationship between probability of relevance and the log of the rank of a document can be approximated by a linear function, at least for a certain class of IR systems. If a training collection is available, one can estimate the parameters of this relationship by applying regression. Merging can subsequently be based on the estimated probability of relevance.

Note that the actual score of a document is only used to rank documents, but that merging is based on the rank, not on the score.

The new CLEF multilingual task is based on a new document collection which makes it hard to compute reliable estimates for the linear parameters; a training set is not available. A second disadvantage of the rank based merging strategy is that the linear function generalises across topics. Unfortunately in the multilingual task, the distribution of relevant documents over the subcollections is quite skewed. All collections have several (differing) topics without relevant documents, so applying a rank based merging strategy would hurt the performance for these topics, because the proportion of retrieved documents in every collection is the same for every topic.

The raw score merging strategy (which proved succesful last year) does not need training data and also does not suffer from the equal proportions strategy. Unfortunately, usually scores are not totally compatible across collections. We have tried to identify factors which cause these differences. We have applied two normalization techniques. First of all we treat term translations as a weighted concept vector 0.1 P (T we have observed that collection size has a large influence on the occurence probability estimates ijC) (cf. section 2). That means that we can normalise scores across topics by dividing the score by the query length. This amounts to computing the geometric avarage of probabilities per query concept. Secondly, because the probability of rare terms is inversely proportional to the collection size. << language): C1,C2 with vocabulary size V1,V2 and number of tokens T1, T2 respectively, with T1 T2. C 1 Now we could try to either extrapolate the term occurrence probability estimates on collection to a

Figure 4 shows the probability estimates of a sample of words of 1 document when we add more documents to the collection. The occurrence probability of common words stabilises fast when the collection size increases. The more rare a word is however, the higher is the degree of overestimation of its occurrence probability. This effect is a consequence of the sparse data problem. In fact, a small collection will never yield correct term occurrence probability estimates.

The collection-size dependency of collection-frequency (or global term frequency) estimates has a direct influence on the distribution of document scores for a particular query. When the collection is small, the scores will be lower than the scores on a large collection. This is due to the fact that the score we study is based on the maximum likelihood ratio. So the median of the distribution of document scores for a particular topic (set) is inversely related with the collection size. Thus when we use the raw scores of different subcollections as a basis for merging, large collections will be favoured.

We hypothesised that we could improve the merging process, if we could correct the estimates for their dependence on the collection size. Suppose we have just two collections with a different size (and different hypothetical collection with T2 tokens or try to ‘downscale’ the term occurrence probability estimates of a term from C2 to vocabulary size V1.

The first option seems cumbersome, because we have hardly information to guide the extrapolation process. The second option, trying to adapt the estimates of the large collection to the small collection, seems more viable. The idea is to adapt the probability estimates of rare terms in such a way, that they T p 1=T2 has to be mapped to 1=T1. So the probability is multiplied by the factor 2=T1 and probabilities p0(t = probabilities have to be re-normalised ( i) p(ti)= PV2 p(ti) ). However, this has the result that all x = log(p) a = and T2T 2T1 ). Because we have re-estimated the probabilities, one would expect that the p > 10 f (x) = x ax 2 changes for 3. A function which meets these properties is the polynomial (where 2 will become ‘compatible’ with the estimates on the small collection. As shown in figure 4 the estimates of frequent terms stabilise soon. Our idea is to construct a mapping function which maps the probability estimates to the small collection domain. The mapping function has the following requirements: a probability larger than 1=T2 will be multiplied by a factor which decreases for larger p. In fact we only want very small global probabilities (also those of relatively frequent words) are increased, which will increase the score of all documents, i.e. will have the opposite effect of what we want. So we decide not to re-normalise, because a smaller corpus would also have a smaller vocabulary, which would compensate for the increase in probability mass which is a result of the transformation. We indexed the collections in the 4 languages separately. All documents were lemmatised using the Xelda morphological toolkit from Xerox XRCE and stopped with language specific stoplists. For German, we splitted compounds and added both the full compound and its parts to the index. This strategy is motivated by our experience with a Dutch corpus (Dutch is also a compounding language) [ 8 ] and tests on the TREC CLIR test collection. Table 2 shows the results of the monolingual runs, runs in bold are judged runs, runs in italic font are unofficial runs (mostly post-hoc). The table also lists the proportion of documents which has been judged. The standard runs include fuzzy lookup of unknown words. The expand option adds close orthographical variants for every query term. The official runs were done without document length normalisation defined by equation 3.

run name tnoutdd1 tnoutdd2 tnoutdd2l tnoutff1 tnoutff2 tnoutff2l tnoutii1 tnoutii2 tnoutii2l tnoutee01i tnoutee01 tnoutee01l

The first thing that strikes us, is that the pool depth is 50, contrary to what has been practice in TREC in which the top 100 documents are judged for relevance. Section 5.4 analyses the CLEF collection further. Length normalisation usually gives a modest improvement in average precision. The ‘expand’ option was especially effective for German. The reason is probably that compound parts are not always properly lemmatised by the German morphology. Especially the German run performs well with 28 out of 37 topics above average. This relatively good performance is probably due to the morphology, which includes compound splitting.

The pseudo relevance feedback runs were done with the experimental language models retrieval engine at the University of Twente, using an index based on the Porter stemming algorithm. The run tagged with tnoutne3-stem is the baseline run for this system. The official pseudo relevance feedback run used the top 10 documents retrieved to re-estimate relevance weights and translation probabilities, but turned out to contain a bug. The unofficial fixed run tnoutne4-fix performs a little bit worse than the baseline. The run tnoutne4-retro uses the relevant documents to re-estimate the probabilities retrospectively (see e.g. [ 9 ]). This run reaches an impressive performance of 0.4695 average precision, much higher even than the best monolingual English run. This indicates that the algorithm might be helpful in an interactive setting where the user’s feedback is used to retrieve a new, improved, set of documents. Apparently, the top 10 retrieved contains too much noise to be useful for the re-estimation of the model’s parameters. 5.3

Multilingual runs

This section reports on some of the statistics of the CLEF collection and compares it to the TREC crosslanguage collection. Table 5 lists the size, number of judged documents, number of relevant documents and the judged fraction, which is the part of the collection that is judged per topic.

collection english french german italian total This year’s evaluation has confirmed that cross-language retrieval based on structured queries, no matter what the translation resources are, is a powerful technique. Re-estimating model parameters based on pseudo relevant documents does not result in improvement of retrieval performance. However, the relevance weighting algorithm shows an impressive performance gain if the relevant documents are used retrospectively. This indicates that the algorithm might in fact be a valuable tool for processing user feedback in an inter-active setting. Finally, merging based on the collection size re-estimation technique proved not successful. Further analysis is needed why the technique did not work on this collection, as it was quite successful on the TREC-8 collection.

Acknowledgements

We would like to thank the DRUID project for sponsoring the translation of the topic set into Dutch. We thank Xerox XRCE for making the Xelda morphological toolkit available to us. Furthermore we would like to thank Jiang Chen (RALI, Universite´ de Montre´al), Jian-Yun Nie (RALI) for help with the PTMiner web mining tools, and Michel Simard (RALI) for helping with the construction of aligned corpora and building translation models.

[1]

Hiemstra and F.M.G. de Jong . Disambiguation strategies for cross-language information retrieval . In Proceedings of the third European Conference on Research and Advanced Technology for Digital Libraries , pages 274 - 293 , 1999 .

[2]

Hiemstra and

Kraaij . Twenty-One at TREC-7: Ad-hoc and cross-language track . In Proceedings of the seventh Text Retrieval Conference TREC-7, NIST Special Publication 500-242 , pages 227 - 238 , 1999 .

[3]

Hiemstra . A probabilistic justification for using tf.idf term weighting in information retrieval . International Journal on Digital Libraries , 3 ( 2 ): 131 - 139 , 2000 .

[4]

Kraaij ,

Pohlmann , and

Hiemstra . Twenty-one at TREC-8: using language technology for information retrieval . In Proceedings of the eighth Text Retrieval Conference TREC-8 , NIST Special Publications, 2000 .

[5]

Kraaij and

Pohlmann . Viewing stemming as recall enhancement . In H.P. Frei,

Harman , P. Scha¨uble, and R. Wilkinson, editors, Proceedings of the 19th ACM-SIGIR Conference on Research and Development in Information Retrieval (SIGIR96) , pages 40 - 48 , 1996 .

[6]

J.Y.

Nie . Parallel web corpora for CLIR? In Proceedings of CLEF 2000 , 2000 .

[7]

J.Y.

Nie ,

Simard ,

Isabelle , and

Durand . Cross-language information retrieval based on parallel texts and automatic mining of parallel texts in the web . In ACM-SIGIR'99 , pages 74 - 81 , 1999 .

[8]

Pohlmann and

Kraaij . The effect of syntactic phrase indexing on retrieval performance for Dutch texts . In L. Devroye and C. Chrisment, editors, Proceedings of RIAO'97 , pages 176 - 187 , 1997 .

[9]

S.E.

Robertson and

K. Sparck

Jones . Relevance weighting of search terms . Journal of the American Society for Information Science , 27 : 129 - 146 , 1976 .