Twenty-One at CLEF-2000: Translation resources, merging strategies and relevance feedback

Twenty-One at CLEF-2000: Translation resources, merging strategies and relevance feedback DjoerdHiemstra hiemstra@cs.utwente.nl TNO-TPD

P.O. Box 155 2600 AD Delft The Netherlands

WesselKraaij kraaij@tpd.tno.nl TNO-TPD

P.O. Box 155 2600 AD Delft The Netherlands

RenéePohlmann pohlmann@tpd.tno.nl ThijsWesterveld Ý University of Twente

CTIT, P.O. Box 217 7500 AE Enschede The Netherlands

Twenty-One at CLEF-2000: Translation resources, merging strategies and relevance feedback D184EF799AE976D24A6D88DC11CCA6C4 GROBID - A machine learning software for extracting information from scholarly documents

This paper describes the official runs of the Twenty-One group for CLEF-2000. The Twenty-One group participated in the monolingual, bilingual and multilingual tasks. The following new techniques are introduced in this paper. In the bilingual task we experimented with different methods to estimate translation probabilities. In the multilingual task we experimented with refinements on raw-score merging techniques and with a new relevance feedback algorithm that re-estimates both the model's translation probabilities and the relevance weights. Finally, we performed preliminary experiments to exploit the web to generate translation probabilities and bilingual dictionaries, notably for English-Italian and English-Dutch.

Introduction

Twenty-One is a project funded by the EU Telematics Applications programme, sector Information Engineering. The project subtitle is "Development of a Multimedia Information Transaction and Dissemination Tool". Twenty-One started early 1996 and was completed in June 1999. Because the TREC ad-hoc and cross-language information retrieval (CLIR) tasks fitted our needs to evaluate the system on the aspects of monolingual and cross-language retrieval performance, TNO-TPD and University of Twente participated under the flag of "Twenty-One" in TREC-6 / 7 / 8. Since the cooperation is continued in other projects: Olive and Druid, we have decided to continue our participation in CLEF as "Twenty-One". 1 For all tasks, we used the TNO vector retrieval engine. The engine supports several term weighting schemes. The principal term weighting scheme we used is the "linguistically motivated probabilistic model of information retrieval" [2,4] explained below.

The approach

All runs were carried out with an information retrieval system based on a simple unigram language model. The basic idea is that documents can be represented by simple statistical language models. Now, if a query is more probable given a language model based on document ½ , than given e.g. a language model based on document ¾ , then we hypothesise that the document ½ is more relevant to the query than document ¾ . Thus the probability of generating a certain query given a document-based language model can serve as a score to rank documents with respect to relevance. 1 Information about Twenty-one, Olive and Druid is available at http://dis.tpd.tno.nl/ Formula 1 shows the basic idea of this approach to information retrieval, where the document-based language model is interpolated with a background language model to compensate for sparseness. In the formula, Ì is a random variable for the query term on position in the query (½ Ò, where Ò is the query length), which sample space is the set Ø ´¼µ Ø ´½µ ¡ ¡ ¡ Ø ´Ñµ of all terms in the collection. The probability measure È ´Ì µ defines the probability of drawing a term at random from the collection, È ´Ì µ defines the probability of drawing a term at random from the document; and defines the importance of each query term. The marginal probability of relevance È ´ µ might be assumed uniformly distributed over the documents in which case it may be ignored in the above formula.

È ´Ì½ Ì ¾ ¡ ¡ ¡ Ì Ò µÈ ´ µ È ´ µ Ò ½ ´½ µÈ ´Ì µ • È ´Ì µ (1)

A model of cross-language information retrieval

Information retrieval models and statistical translation models can be integrated into one unifying model for cross-language information retrieval [1,4]. Let Ë be a random variable for the source language query term on position . Each document gets a score defined by the following formula.

È ´Ë½ Ë ¾ ¡ ¡ ¡ Ë Ò µ È ´ µ È ´ µ Ò ½ Ñ ½ È ´Ë Ì Ø ´ µ µ´´½ µÈ ´Ì Ø ´ µ µ • È ´Ì Ø ´ µ µµ(2)

In the formula, the probability measure È ´Ë Ì Ø ´ µ µ defines the translation probabilities.

Translation in practice

In practice, the statistical translation model will be used as follows. The automatic query formulation process will translate the query Ë ½ Ë ¾ ¡ ¡ ¡ Ë Ò using a probabilistic dictionary. The probabilistic dictionary is a dictionary that list pairs ´× Øµ together with their probability of occurrence, where × is from the sample space of Ë and Ø is from the sample space of Ì . For each Ë there will be one or more realisations Ø of Ì for which È ´Ë Ì Ø µ ¼, which will be called the possible translations of Ë . The possible translations should be grouped for each to search the document collection, resulting in a structured query. For instance, suppose the original French query on an English collection is "déchets dangereux", then possible translations of "déchets" might be "waste", "litter" or "garbage", possible translations of "dangereux" might be "dangerous" or "hazardous" and the structured query can be presented as follows.

´´waste litter garbageµ ´dangerous hazardousµµ

The product from ½ to Ò (in this case Ò ¾ ) of equation 2 is represented above by using the comma as is done in the representation of a query of length 2 as Ì ½ Ì ¾ . The sum from ½ to Ñ of equation 2 is represented by displaying only the realisations of Ì for which È ´Ë Ì µ ¼ and by separating those by ' '. So, in practice, translation takes place during automatic query formulation (query translation), resulting in a structured query like the one displayed above that is matched against each document in the collection. Unless stated otherwise, whenever this paper mentions 'query terms', it will denote the target language query terms: realisations of Ì . Realisations of Ë , the source language query terms, will usually left implicit. The combination of the structured query representation and the translation probabilities will implicitly define the sequence of the source language query terms Ë ½ Ë ¾ ¡ ¡ ¡ Ë Ò , but the actual realisation of the sequence is not important to the system.

Probability estimation

The prior probability of relevance È ´ µ, the probability of term occurrence in the collection È ´Ì µ and the probability of term occurrence in the relevant document È ´Ì µ are defined by the collection that is searched. For the evaluations reported in this paper, the following definitions were used, where Ø ´Ø µ denotes the number of occurrences of the term Ø in the document , and ´Øµ denotes the number of documents in which the term Ø occurs. Equation 3 is the definition used for the unofficial "document length normalisation" runs reported in section 5.

È ´ µ È Ø Ø ´Ø µ È Ø Ø ´Ø µ (3) È ´Ì Ø µ Ø ´Ø µ È Ø Ø ´Ø µ (4) È ´Ì Ø µ ´Ø µ È Ø ´Øµ(5)

The translation probabilities È ´Ë Ì µ and the value of , however, are unknown. The collection that is searched was not translated, or if it was translated, the translations are not available. Translation probabilities should therefore be estimated from other data, for instance from a parallel corpus. The value of determines the importance of the source language query term. If ½ then the system will assign zero probability to documents that do not contain any of the possible translations of the original query term on position . In this case, a possible translation of the source language term is mandatory in the retrieved documents. If

¼ then the possible translations of the original query term on position will not affect the final ranking. In this case, the source language query term is treated as if it were a stop word. For ad-hoc queries, it is not known which of the original query terms are important and which are not important and a constant value for each is taken. The system's default value is ¼ ¿.

Implementation

Equation 2 is not implemented as is, but instead it is rewritten into a weighting algorithm that assigns zero weight to terms that do not occur in the document. Filling in the definitions of equation 3, 4 and 5 in equation 2 results in the following formula. The probability measure È ´Ë Ì Ø ´ µ µ will be replaced by the translation probability estimates ´ µ.

È ´ Ë ½ Ë ¾ ¡ ¡ ¡ Ë Ò µ È Ø Ø ´Ø µ È Ø Ø ´Ø µ Ò ½ Ñ ½ ´ µ´´½ µ ´Ø´ µ µ È Ø ´Øµ • Ø ´Ø´ µ µ È Ø Ø ´Ø µ µ

The translation probabilities can be moved into the inner sum. As summing is associative and commutative, it is not necessary to calculate each probability separately before adding them. Instead, respectively the document frequencies and the term frequencies of the disjuncts can be added beforehand, properly multiplied by the translation probabilities. Only in the big sum is constant for every addition and can therefore be moved outside the sum, resulting in:

È ´ Ë ½ Ë ¾ ¡ ¡ ¡ Ë Ò µ È Ø Ø ´Ø µ È Ø Ø ´Ø µ Ò ½ ´´½ µ È Ñ ½ ´ µ ´Ø´ µ µ È Ø ´Øµ • È Ñ ½ ´ µØ ´Ø´ µ µ È Ø Ø ´Ø µ µ

Using simple calculus (see e.g. [3]), the probability measure can now be rewritten into a term weighting algorithm that assigns zero weight to non-matching terms, resulting in equation 6. The formula ranks documents in exactly the same order as equation 2.

È ´ Ë ½ Ë ¾ ¡ ¡ ¡ Ë Ò µ » ÐÓ ´ÈØ Ø ´Ø µµ • Ò ½ ÐÓ ´½• ´ÈÑ ½ ´ µØ ´Ø´ µ µµ È Ø ´Øµ ´½ µ´È Ñ ½ ´ µ ´Ø´ µ µµ È Ø Ø ´Ø µ µ (6)

Equation 6 is the algorithm implemented in the TNO retrieval engine. It contains a weighted sum of respectively the term frequencies and the document frequencies where the weights are determined by the translation probabilities ´ µ. Unweighted summing of frequencies was used before for on-line stemming in [5] in a vector space model retrieval system.

The model does not require the translation probabilities ´ µ to sum up to one for each , since they are conditioned on the target language query term and not on the source language query term. Interestingly, for the final ranking it does not matter what the actual sum of the translation probabilities is. Only the relative proportions of the translations define the final ranking of documents. This can be seen by ´ µ which occurs in the numerator and in the denominator of the big fraction in equation 6.

A Relevance feedback method for cross-language retrieval

This paper introduces a new relevance feedback method for cross-language information retrieval. If there were some known relevant documents, then the values of ´ µ and could be re-estimated from that data. The idea is the following. Suppose there are three known relevant English documents to the French query "déchets dangereux". If two out of three documents contain the term "waste" and none contain the terms "litter" and "garbage" then this is an indication that "waste" is the correct translation and should be assigned a higher translation probability than 'litter" and "garbage". If only one of the three known relevant document contains one or more possible translations of "dangereux" then this is an indication that the original query term "déchets" is more important (possible translations occur in more relevant documents) than the original query term "dangereux" and the value of should be higher for "déchets" than for "dangereux".

The actual re-estimation of ´ µ and was done by iteratively applying the EM-algorithm defined by the formulas in equation 7. In the algorithm, ´ µ ´Ôµ and ´Ôµ denote the values on the Ôth iteration. The values are initialised with the translation probabilities from the dictionary and with ´

µ ´Ô•½µ ½ Ö Ö ½ ´ µ ´Ôµ ´´½ ´Ôµ µÈ ´Ì Ø ´ µ µ • ´Ôµ È ´Ì Ø ´ µ µµ È Ñ Ð ½ ´Ðµ ´Ôµ ´´½ ´Ôµ µÈ ´Ì Ø ´Ðµ µ • ´Ôµ È ´Ì Ø ´Ðµ µµ ´Ô•½µ ½ Ö Ö ½ ´Ôµ ´ÈÑ Ð ½ ´Ðµ ´Ôµ È ´Ì Ø ´Ðµ µµ È Ñ Ð ½ ´Ðµ ´Ôµ ´´½ ´Ôµ µÈ ´Ì Ø ´Ðµ µ • ´Ôµ È ´Ì Ø ´Ðµ µµ(7)

The re-estimation of ´ µ and was done from 'pseudo-relevant' documents. First the top 10 documents were retrieved using the default values of ´ µ and and then the feedback algorithm was used on these documents to find the new values. The actual algorithm implemented was a variation of equation 7 of the form: ´½ ´Ö • ½ µ µ ¡ ´default value • È Ö ½ µ to avoid that e.g.

½ after re-estimation.

Translation resources

As in previous years we applied a dictionary based query translation approach. The translations were based on the VLIS lexical database of Van Dale publishers [2]. Because VLIS currently lacks translations into Italian, we used two other resources: i) the Systran web based MT engine ii) a probabilistic lexicon based a parallel web corpus. The next section will describe the construction of this new resource in more detail.

Parallel web corpora

We developed three parallel corpora based on web pages in close cooperation with RALI, Université de Montréal. RALI already had developed an English-French parallel corpus of web pages, so it seemed interesting to investigate the feasibility of a full multilingual system based on web derived lexical resources only. We used the PTMiner tool [7] to find web pages which have a high probability to be translations of each other. The mining process consists of the following steps:

1. Query a web search engine for web pages with a hyperlink anchor text "English version" and respective variants.

2. (For each web site) Query a web search engine for all web pages on a particular site.

3. (For each web site) Try to find pairs of path names that match certain patterns, e.g.: /department/tt/english/home.html and /department/tt/italian.html.

4. (For each pair) download web pages, perform a language check using a probabilistic language classifier, remove pages which are not positively identified as being written in a particular language.

The mining process was run for three language pairs and resulted in three modest size parallel corpora. Table 1 lists sizes of the corpus during intermediate steps. Due to the dynamic nature of the web, a lot of pages that have been indexed, do not exist anymore. Sometimes a site is down for maintenance. Finally a lot of pages are simply place holders for images and are discarded by the language identification step. These parallel corpora have been used in different ways: i) to refine the estimates of translation probabilities of a dictionary based translation system (corpus based probability estimation) ii) to construct simple statistical translation models (IBM model 1) [7] . The former application will be described in more detail in Section 5.2 the latter in Section 5.3. The translation models for English-Italian and English-German, complemented with an already existing model for English-French formed also the basis for a full corpus based translation multilingual run which is described elsewhere in this volume [6].

Merging intermediate runs

Our strategy to multilingual retrieval is to translate the query into the document languages, perform separate language specific runs and merge the results into a single result file. In previous CLIR evaluations, we compared different merging strategies: round robin Here the idea is that document scores are not comparable across collections, because we are basically ignorant about the distribution of relevant documents in the retrieved lists, round robin assumes that these distributions are similar across languages.

raw score This type of merging assumes that document scores are comparable across collections.

rank based It has been observed that the relationship between probability of relevance and the log of the rank of a document can be approximated by a linear function, at least for a certain class of IR systems. If a training collection is available, one can estimate the parameters of this relationship by applying regression. Merging can subsequently be based on the estimated probability of relevance.

Note that the actual score of a document is only used to rank documents, but that merging is based on the rank, not on the score.

The new CLEF multilingual task is based on a new document collection which makes it hard to compute reliable estimates for the linear parameters; a training set is not available. A second disadvantage of the rank based merging strategy is that the linear function generalises across topics. Unfortunately in the multilingual task, the distribution of relevant documents over the subcollections is quite skewed. All collections have several (differing) topics without relevant documents, so applying a rank based merging strategy would hurt the performance for these topics, because the proportion of retrieved documents in every collection is the same for every topic.

The raw score merging strategy (which proved succesful last year) does not need training data and also does not suffer from the equal proportions strategy. Unfortunately, usually scores are not totally compatible across collections. We have tried to identify factors which cause these differences. We have applied two normalization techniques. First of all we treat term translations as a weighted concept vector (cf. section 2). That means that we can normalise scores across topics by dividing the score by the query length. This amounts to computing the geometric avarage of probabilities per query concept. Secondly, we have observed that collection size has a large influence on the occurence probability estimates È ´Ì µ because the probability of rare terms is inversely proportional to the collection size. Figure 4 shows the probability estimates of a sample of words of 1 document when we add more documents to the collection. The occurrence probability of common words stabilises fast when the collection size increases. The more rare a word is however, the higher is the degree of overestimation of its occurrence probability. This effect is a consequence of the sparse data problem. In fact, a small collection will never yield correct term occurrence probability estimates.

The collection-size dependency of collection-frequency (or global term frequency) estimates has a direct influence on the distribution of document scores for a particular query. When the collection is small, the scores will be lower than the scores on a large collection. This is due to the fact that the score we study is based on the maximum likelihood ratio. So the median of the distribution of document scores for a particular topic (set) is inversely related with the collection size. Thus when we use the raw scores of different subcollections as a basis for merging, large collections will be favoured.

We hypothesised that we could improve the merging process, if we could correct the estimates for their dependence on the collection size. Suppose we have just two collections with a different size (and different language): ½ , ¾ with vocabulary size Î ½ ,Î ¾ and number of tokens Ì ½ , Ì ¾ respectively, with Ì ½ Ì ¾ . Now we could try to either extrapolate the term occurrence probability estimates on collection ½ to a hypothetical collection with Ì ¾ tokens or try to 'downscale' the term occurrence probability estimates of a term from ¾ to vocabulary size Î ½ .

The first option seems cumbersome, because we have hardly information to guide the extrapolation process. The second option, trying to adapt the estimates of the large collection to the small collection, seems more viable. The idea is to adapt the probability estimates of rare terms in such a way, that they will become 'compatible' with the estimates on the small collection. As shown in figure 4 the estimates of frequent terms stabilise soon. Our idea is to construct a mapping function which maps the probability estimates to the small collection domain. The mapping function has the following requirements: a probability ½ Ì ¾ has to be mapped to ½ Ì ½ . So the probability is multiplied by the factor Ì ¾ Ì ½ and probabilities Ô larger than ½ Ì ¾ will be multiplied by a factor which decreases for larger Ô. In fact we only want very small changes for Ô ½¼ ¿ . A function which meets these properties is the polynomial ´Üµ Ü Ü ¾ (where Ü ÐÓ ´Ôµ and

Ì¾ Ì½ Ì ¾ ¾ µ.

Because we have re-estimated the probabilities, one would expect that the probabilities have to be re-normalised ( Ô¼´Ø µ Ô´Ø µ È Î¾ Ô´Ø µ ). However, this has the result that all global probabilities (also those of relatively frequent words) are increased, which will increase the score of all documents, i.e. will have the opposite effect of what we want. So we decide not to re-normalise, because a smaller corpus would also have a smaller vocabulary, which would compensate for the increase in probability mass which is a result of the transformation.

Results

Monolingual runs

We indexed the collections in the 4 languages separately. All documents were lemmatised using the Xelda morphological toolkit from Xerox XRCE and stopped with language specific stoplists. For German, we splitted compounds and added both the full compound and its parts to the index. This strategy is motivated by our experience with a Dutch corpus (Dutch is also a compounding language) [8] and tests on the TREC CLIR test collection. Table 2 shows the results of the monolingual runs, runs in bold are judged runs, runs in italic font are unofficial runs (mostly post-hoc). The table also lists the proportion of documents which has been judged. The standard runs include fuzzy lookup of unknown words. The expand option adds close orthographical variants for every query term. The official runs were done without document length normalisation defined by equation 3.

run name avp above median description % j@1000 %j@100 %j@10 tnoutdd1 0. The first thing that strikes us, is that the pool depth is 50, contrary to what has been practice in TREC in which the top 100 documents are judged for relevance. Section 5.4 analyses the CLEF collection further. Length normalisation usually gives a modest improvement in average precision. The 'expand' option was especially effective for German. The reason is probably that compound parts are not always properly lemmatised by the German morphology. Especially the German run performs well with 28 out of 37 topics above average. This relatively good performance is probably due to the morphology, which includes compound splitting.

Bilingual runs

Table 3 lists the results of the bilingual runs. All runs use Dutch as a query language. The base run of 0.3069 can be improved by several techniques: a higher lambda, document length normalisation or Porter stemming instead of dictionary based stemming. The latter can be explained by the fact that Porter's algorithm is an aggressive stemmer that also removes most of the derivational affixes. This is usually beneficial to retrieval performance. The experiment with corpus based frequencies yielded disappointing results. We first generated topic translations in a standard fashion based on VLIS. Subsequently we replaced the translation probabilities È ´ÛAE Ä Û AE µ by rough corpus based estimates. We simply looked up all English sentences which contained the translation and determined the proportion of the corresponding (aligned) Dutch sentences that contained the original Dutch query word. If the pair was not found, the original probability was left unchanged. Unfortunately a lot of the query terms and translations were not found in the aligned corpus, because they were lemmatised whereas the corpus was not lemmatised. At least this mismatch did hurt the estimates. The procedure resulted in high translation probabilities for words that did not occur in the corpus and low probabilities for words that did occur. The pseudo relevance feedback runs were done with the experimental language models retrieval engine at the University of Twente, using an index based on the Porter stemming algorithm. The run tagged with tnoutne3-stem is the baseline run for this system. The official pseudo relevance feedback run used the top 10 documents retrieved to re-estimate relevance weights and translation probabilities, but turned out to contain a bug. The unofficial fixed run tnoutne4-fix performs a little bit worse than the baseline. The run tnoutne4-retro uses the relevant documents to re-estimate the probabilities retrospectively (see e.g. [9]). This run reaches an impressive performance of 0.4695 average precision, much higher even than the best monolingual English run. This indicates that the algorithm might be helpful in an interactive setting where the user's feedback is used to retrieve a new, improved, set of documents. Apparently, the top 10 retrieved contains too much noise to be useful for the re-estimation of the model's parameters.

Multilingual runs

Table 4 shows that our best multilingual run was a run with Dutch as a query language. This is on one hand surprising (because this run is composed of 4 bilingual runs instead of 3 for the EN X run. But the translation is based on the VLIS lexical database which is built on lexical relations with Dutch as a source language. Thus the translations in the NL X case are much cleaner than the EN X case. In the latter case, Dutch serves as a pivot language. On the other hand, the NL IT translation is quite cumbersome. We first used Xelda to translate the Dutch queries to English stopped and lemmatised files. These files were subsequently translated by Systran.

Another interesting point is that the intermediate bilingual run based on the parallel web corpus performed quite well, with an average precision of 0.2750 versus 0.3203 of Systran. The translation of this run is based on a translation model trained on the parallel web corpus. The English topics were simply stopped and translated by the translation model. We took the most probable translation and used that as Italian query. We plan to experiment with a more refined approach where we import the translation probabilities into structured queries.

The CLEF collection

This section reports on some of the statistics of the CLEF collection and compares it to the TREC crosslanguage collection. Table 6 lists the same information for the TREC collection. The collections are actually quite different. First of all, the CLEF collection is almost half the size of the TREC collection and heavily biased towards German and English documents. Although the CLEF organisation decided to judge only the top 50 of documents retrieved and not the top 100 documents retrieved as in TREC, the number of documents judged per topic is only a little lower for the CLEF collection: about 814 documents per topic vs. 834 for TREC. Given the fact that the 56 TREC topics were developed over a period of two years and the CLEF collection has 40 topics already, the organisation actually did more work this year compared to pervious years. Another striking difference is the number of relevant documents per topic, only 57 for CLEF and 116 for TREC. This might actually make the decision to only judge the top 50 of runs not that harmful for the usefulness of the CLEF evaluation results.

Conclusions

This year's evaluation has confirmed that cross-language retrieval based on structured queries, no matter what the translation resources are, is a powerful technique. Re-estimating model parameters based on pseudo relevant documents does not result in improvement of retrieval performance. However, the relevance weighting algorithm shows an impressive performance gain if the relevant documents are used retrospectively. This indicates that the algorithm might in fact be a valuable tool for processing user feedback in an inter-active setting. Finally, merging based on the collection size re-estimation technique proved not successful. Further analysis is needed why the technique did not work on this collection, as it was quite successful on the TREC-8 collection.

´¼µ ¼ ¿. The reestimation formulas should be used simultaneously for each Ô until the values do not change significantly anymore.

Figure 1 :1Figure 1: Probability estimates vs collection size

Table 1 :1Intermediate sizes during corpus constructionlanguage nr of web sites nr of candidate pages nr of candidate pairs retrieved + cleaned pairsEN-IT36511053649234474768EN-DE38171828906335775743EN-NL30041170082247382907

Table 2 :2Results of the monolingual runs3760standard18.6479.05100tnoutdd20.396128/37+expand18.7281.22100tnoutdd2l0.3968+length normalisation18.5878.2297.50tnoutff10.4551standard16.1379.42100tnoutff20.447118/34+expand16.2180.88100tnoutff2l0.4529+length normalisation16.0077.8897.50tnoutii10.4677standard16.5978.92100tnoutii20.470918/34+expand16.6780.33100tnoutii2l0.4808+length normalisation16.6677.2598tnoutee01i 0.4200standard17.8171.10100tnoutee01 0.4169+expand17.8470.7599.75tnoutee01l 0.4273+length normalisation17.8269.3098.00

Table 3 :3Results of the bilingual runsrun nameavpabove mediandescriptiontnoutne10.306927/33standardtnoutne1l0.3278-+ doclen normtnoutne1p0.3442-+¼tnoutne20.276225/33corpus frequenciestnoutne3-stem 0.3366-Porter stemmer +doclen normtnoutne40.294620/33pseudo relevance feedback (PRF)tnoutne4-fix0.3266-PRF bugfix +doclen norm, Portertnoutne4-retro 0.4695-retrospective relevance feedback

Table 4 :4Results of therun nameavpabove mediandescriptiontnoutex10.221425/40baseline runtnoutex2 0.216526/40mergedtnoutex2f 0.2219fixedtnoutex30.196025/40Web based EN-IT lexicontnoutnx1 0.225623/40query language is DutchAE ÊÁ Ì runs

Table 5 :5Table 5 lists the size, number of judged documents, number of relevant documents and the judged fraction, which is the part of the collection that is judged per topic. CLEF collection statistics, 40 topics collectiontotal judged relevant no hitsjudgeddocs.docs.docs. in topicfractionenglish110,250 14,737579 2, 6, 8, 23, 25, 27, 35 0.0033french44,0138,434528 2, 4, 14, 27, 28, 360.0048german153,694 12,283821 2, 28, 360.0020italian58,0518,112338 3, 6, 14, 27, 28, 400.0035total366,008 43,5662,2660.0022collectiontotal judged relevant no hitsjudgeddocs.docs.docs. in topicfractionenglish242,866 18,7832,645 26, 46, 59, 63, 66, 750.0014french141,637 11,8811,569 760.0015german185,0998,6561,634 26, 60 ,75, 760.0008italian62,3597,396671 26, 44, 51, 60, 63, 75, 80 0.0021total631,961 46,7166,5190.0013

Table 6 :6TREC collection statistics, 56 topics (26-81)

Acknowledgements

We would like to thank the DRUID project for sponsoring the translation of the topic set into Dutch. We thank Xerox XRCE for making the Xelda morphological toolkit available to us. Furthermore we would like to thank Jiang Chen (RALI, Université de Montréal), Jian-Yun Nie (RALI) for help with the PTMiner web mining tools, and Michel Simard (RALI) for helping with the construction of aligned corpora and building translation models.

Disambiguation strategies for cross-language information retrieval DHiemstra FM GDe Jong Proceedings of the third European Conference on Research and Advanced Technology for Digital Libraries the third European Conference on Research and Advanced Technology for Digital Libraries 1999 Twenty-One at TREC-7: Ad-hoc and cross-language track DHiemstra WKraaij Proceedings of the seventh Text Retrieval Conference TREC-7, NIST Special Publication the seventh Text Retrieval Conference TREC-7, NIST Special Publication 1999 A probabilistic justification for using tf.idf term weighting in information retrieval DHiemstra ternational Journal on Digital Libraries 3 2 2000 Twenty-one at TREC-8: using language technology for information retrieval WKraaij RPohlmann DHiemstra Proceedings of the eighth Text Retrieval Conference TREC-8 the eighth Text Retrieval Conference TREC-8 NIST Special Publications 2000 Viewing stemming as recall enhancement WKraaij RPohlmann Proceedings of the 19th ACM-SIGIR Conference on Research and Development in Information Retrieval (SIGIR96) HPFrei DHarman PSchäuble RWilkinson the 19th ACM-SIGIR Conference on Research and Development in Information Retrieval (SIGIR96) 1996 Parallel web corpora for CLIR? JYNie Proceedings of CLEF 2000 CLEF 2000 2000 Cross-language information retrieval based on parallel texts and automatic mining of parallel texts in the web JYNie MSimard PIsabelle RDurand ACM-SIGIR'99 1999 The effect of syntactic phrase indexing on retrieval performance for Dutch texts RPohlmann WKraaij Proceedings of RIAO'97 LDevroye CChrisment RIAO'97 1997 Relevance weighting of search terms SERobertson KSparckJones Journal of the American Society for Information Science 27 1976