<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Twenty-One at CLEF-2000: Translation resources, merging strategies and relevance feedback</article-title>
      </title-group>
      <abstract>
        <p>This paper describes the official runs of the Twenty-One group for CLEF-2000. The Twenty-One group participated in the monolingual, bilingual and multilingual tasks. The following new techniques are introduced in this paper. In the bilingual task we experimented with different methods to estimate translation probabilities. In the multilingual task we experimented with refinements on raw-score merging techniques and with a new relevance feedback algorithm that re-estimates both the model's translation probabilities and the relevance weights. Finally, we performed preliminary experiments to exploit the web to generate translation probabilities and bilingual dictionaries, notably for English-Italian and English-Dutch.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>(1)
*TNO-TPD
P.O. Box 155, 2600 AD Delft
The Netherlands
2.2</p>
      <sec id="sec-1-1">
        <title>Translation in practice</title>
        <p>i defines the probability of drawing a term at random from the document; and defines the importance of
P P (T ability measure (Ti) defines the probability of drawing a term at random from the collection, ijDk)
i i n formula, Ti is a random variable for the query term on position in the query (1 n, where is the
P (D each query term. The marginal probability of relevance k) might be assumed uniformly distributed
; query length), which sample space is the set ft(0); t(1); t(m)g of all terms in the collection. The
probFormula 1 shows the basic idea of this approach to information retrieval, where the document-based
language model is interpolated with a background language model to compensate for sparseness. In the
over the documents in which case it may be ignored in the above formula.</p>
        <p>
          S i for cross-language information retrieval [
          <xref ref-type="bibr" rid="ref1 ref4">1, 4</xref>
          ]. Let be a random variable for the source language query
Information retrieval models and statistical translation models can be integrated into one unifying model
term on position i. Each document gets a score defined by the following formula.
(2)
2.1
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>A model of cross-language information retrieval 2.3</title>
      </sec>
      <sec id="sec-1-3">
        <title>Probability estimation</title>
        <p>normalisation” runs reported in section 5.</p>
      </sec>
      <sec id="sec-1-4">
        <title>2.4 Implementation</title>
        <p>(4)
(5)
(3)
(6)
P
df (ti)
i = 1 determines the importance of the source language query term. If then the system will assign zero
P The translation probabilities (SijTi) and the value of i, however, are unknown. The collection that is
i = 0 i uments. If then the possible translations of the original query term on position will not affect the
i ities should therefore be estimated from other data, for instance from a parallel corpus. The value of
i i = constant value for each is taken. The system’s default value is 0:3.</p>
        <p>searched was not translated, or if it was translated, the translations are not available. Translation
probabilprobability to documents that do not contain any of the possible translations of the original query term on
position i. In this case, a possible translation of the source language term is mandatory in the retrieved
docfinal ranking. In this case, the source language query term is treated as if it were a stop word. For ad-hoc
queries, it is not known which of the original query terms are important and which are not important and a
P (S = equation 2 results in the following formula. The probability measure ijTi t(j)) will be replaced by
Equation 2 is not implemented as is, but instead it is rewritten into a weighting algorithm that assigns zero
weight to terms that do not occur in the document. Filling in the definitions of equation 3, 4 and 5 in
the translation probability estimates i(j).
i multiplied by the translation probabilities. Only in the big sum is constant for every addition and can
The translation probabilities can be moved into the inner sum. As summing is associative and
commutative, it is not necessary to calculate each probability separately before adding them. Instead, respectively
the document frequencies and the term frequencies of the disjuncts can be added beforehand, properly
therefore be moved outside the sum, resulting in:</p>
        <p>Pm</p>
        <p>
          j=1Pi(j)df (t(j))
Using simple calculus (see e.g. [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]), the probability measure can now be rewritten into a term weighting
algorithm that assigns zero weight to non-matching terms, resulting in equation 6. The formula ranks
documents in exactly the same order as equation 2.
        </p>
        <p>
          Equation 6 is the algorithm implemented in the TNO retrieval engine. It contains a weighted sum of
respectively the term frequencies and the document frequencies where the weights are determined by the
translation probabilities i(j). Unweighted summing of frequencies was used before for on-line stemming
in [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] in a vector space model retrieval system.
2.5
        </p>
        <p>The model does not require the translation probabilities i(j) to sum up to one for each i, since they are
conditioned on the target language query term and not on the source language query term. Interestingly,
for the final ranking it does not matter what the actual sum of the translation probabilities is. Only the
relative proportions of the translations define the final ranking of documents. This can be seen by i(j)
which occurs in the numerator and in the denominator of the big fraction in equation 6.
i The actual re-estimation of i(j) and was done by iteratively applying the EM-algorithm defined by
i were some known relevant documents, then the values of i(j) and could be re-estimated from that
= values are initialised with the translation probabilities from the dictionary and with i(0) 0:3. The
re(p) the formulas in equation 7. In the algorithm, i(j)(p) and denote the values on the pth iteration. The
i
i documents) than the original query term “dangereux” and the value of should be higher for “de´chets”
p estimation formulas should be used simultaneously for each until the values do not change significantly
This paper introduces a new relevance feedback method for cross-language information retrieval. If there
data. The idea is the following. Suppose there are three known relevant English documents to the French
query “de´chets dangereux”. If two out of three documents contain the term “waste” and none contain the
terms “litter” and “garbage” then this is an indication that “waste” is the correct translation and should
be assigned a higher translation probability than ‘litter” and “garbage”. If only one of the three known
relevant document contains one or more possible translations of “dangereux” then this is an indication
that the original query term “de´chets” is more important (possible translations occur in more relevant
than for “dangereux”.</p>
        <p>anymore.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Translation resources</title>
      <p>
        As in previous years we applied a dictionary based query translation approach. The translations were based
on the VLIS lexical database of Van Dale publishers [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Because VLIS currently lacks translations into
Italian, we used two other resources: i) the Systran web based MT engine ii) a probabilistic lexicon based
a parallel web corpus. The next section will describe the construction of this new resource in more detail.
3.1
      </p>
      <sec id="sec-2-1">
        <title>Parallel web corpora</title>
        <p>
          We developed three parallel corpora based on web pages in close cooperation with RALI, Universite´ de
Montre´al. RALI already had developed an English-French parallel corpus of web pages, so it seemed
interesting to investigate the feasibility of a full multilingual system based on web derived lexical resources
only. We used the PTMiner tool [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] to find web pages which have a high probability to be translations of
each other. The mining process consists of the following steps:
1. Query a web search engine for web pages with a hyperlink anchor text “English version” and
respective variants.
2. (For each web site) Query a web search engine for all web pages on a particular site.
3. (For each web site) Try to find pairs of path names that match certain patterns, e.g.:
/department/tt/english/home.html and /department/tt/italian.html.
4. (For each pair) download web pages, perform a language check using a probabilistic language
classifier, remove pages which are not positively identified as being written in a particular language.
        </p>
        <p>The mining process was run for three language pairs and resulted in three modest size parallel corpora.
Table 1 lists sizes of the corpus during intermediate steps. Due to the dynamic nature of the web, a lot of
pages that have been indexed, do not exist anymore. Sometimes a site is down for maintenance. Finally a
lot of pages are simply place holders for images and are discarded by the language identification step.
language</p>
        <p>EN-IT
EN-DE
EN-NL</p>
        <p>
          These parallel corpora have been used in different ways: i) to refine the estimates of translation
probabilities of a dictionary based translation system (corpus based probability estimation) ii) to construct simple
statistical translation models (IBM model 1) [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] . The former application will be described in more detail
in Section 5.2 the latter in Section 5.3. The translation models for English-Italian and English-German,
complemented with an already existing model for English-French formed also the basis for a full corpus
based translation multilingual run which is described elsewhere in this volume [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
4
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Merging intermediate runs</title>
      <p>Our strategy to multilingual retrieval is to translate the query into the document languages, perform separate
language specific runs and merge the results into a single result file. In previous CLIR evaluations, we
compared different merging strategies:
round robin Here the idea is that document scores are not comparable across collections, because we
are basically ignorant about the distribution of relevant documents in the retrieved lists, round robin
assumes that these distributions are similar across languages.
raw score This type of merging assumes that document scores are comparable across collections.
rank based It has been observed that the relationship between probability of relevance and the log of
the rank of a document can be approximated by a linear function, at least for a certain class of IR
systems. If a training collection is available, one can estimate the parameters of this relationship by
applying regression. Merging can subsequently be based on the estimated probability of relevance.</p>
      <p>Note that the actual score of a document is only used to rank documents, but that merging is based
on the rank, not on the score.</p>
      <p>The new CLEF multilingual task is based on a new document collection which makes it hard to
compute reliable estimates for the linear parameters; a training set is not available. A second disadvantage
of the rank based merging strategy is that the linear function generalises across topics. Unfortunately in
the multilingual task, the distribution of relevant documents over the subcollections is quite skewed. All
collections have several (differing) topics without relevant documents, so applying a rank based merging
strategy would hurt the performance for these topics, because the proportion of retrieved documents in
every collection is the same for every topic.</p>
      <p>The raw score merging strategy (which proved succesful last year) does not need training data and
also does not suffer from the equal proportions strategy. Unfortunately, usually scores are not totally
compatible across collections. We have tried to identify factors which cause these differences. We have
applied two normalization techniques. First of all we treat term translations as a weighted concept vector
0.1
P (T we have observed that collection size has a large influence on the occurence probability estimates ijC)
(cf. section 2). That means that we can normalise scores across topics by dividing the score by the query
length. This amounts to computing the geometric avarage of probabilities per query concept. Secondly,
because the probability of rare terms is inversely proportional to the collection size.
&lt;&lt; language): C1,C2 with vocabulary size V1,V2 and number of tokens T1, T2 respectively, with T1 T2.
C 1 Now we could try to either extrapolate the term occurrence probability estimates on collection to a</p>
      <p>Figure 4 shows the probability estimates of a sample of words of 1 document when we add more
documents to the collection. The occurrence probability of common words stabilises fast when the collection
size increases. The more rare a word is however, the higher is the degree of overestimation of its occurrence
probability. This effect is a consequence of the sparse data problem. In fact, a small collection will never
yield correct term occurrence probability estimates.</p>
      <p>The collection-size dependency of collection-frequency (or global term frequency) estimates has a
direct influence on the distribution of document scores for a particular query. When the collection is
small, the scores will be lower than the scores on a large collection. This is due to the fact that the score
we study is based on the maximum likelihood ratio. So the median of the distribution of document scores
for a particular topic (set) is inversely related with the collection size. Thus when we use the raw scores of
different subcollections as a basis for merging, large collections will be favoured.</p>
      <p>We hypothesised that we could improve the merging process, if we could correct the estimates for their
dependence on the collection size. Suppose we have just two collections with a different size (and different
hypothetical collection with T2 tokens or try to ‘downscale’ the term occurrence probability estimates of a
term from C2 to vocabulary size V1.</p>
      <p>
        The first option seems cumbersome, because we have hardly information to guide the extrapolation
process. The second option, trying to adapt the estimates of the large collection to the small collection,
seems more viable. The idea is to adapt the probability estimates of rare terms in such a way, that they
T p 1=T2 has to be mapped to 1=T1. So the probability is multiplied by the factor 2=T1 and probabilities
p0(t = probabilities have to be re-normalised ( i) p(ti)= PV2 p(ti) ). However, this has the result that all
x = log(p) a = and T2T 2T1 ). Because we have re-estimated the probabilities, one would expect that the
p &gt; 10 f (x) = x ax 2 changes for 3. A function which meets these properties is the polynomial (where
2
will become ‘compatible’ with the estimates on the small collection. As shown in figure 4 the estimates of
frequent terms stabilise soon. Our idea is to construct a mapping function which maps the probability
estimates to the small collection domain. The mapping function has the following requirements: a probability
larger than 1=T2 will be multiplied by a factor which decreases for larger p. In fact we only want very small
global probabilities (also those of relatively frequent words) are increased, which will increase the score
of all documents, i.e. will have the opposite effect of what we want. So we decide not to re-normalise,
because a smaller corpus would also have a smaller vocabulary, which would compensate for the increase
in probability mass which is a result of the transformation.
We indexed the collections in the 4 languages separately. All documents were lemmatised using the Xelda
morphological toolkit from Xerox XRCE and stopped with language specific stoplists. For German, we
splitted compounds and added both the full compound and its parts to the index. This strategy is motivated
by our experience with a Dutch corpus (Dutch is also a compounding language) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and tests on the TREC
CLIR test collection. Table 2 shows the results of the monolingual runs, runs in bold are judged runs, runs
in italic font are unofficial runs (mostly post-hoc). The table also lists the proportion of documents which
has been judged. The standard runs include fuzzy lookup of unknown words. The expand option adds
close orthographical variants for every query term. The official runs were done without document length
normalisation defined by equation 3.
      </p>
      <p>run name
tnoutdd1
tnoutdd2
tnoutdd2l
tnoutff1
tnoutff2
tnoutff2l
tnoutii1
tnoutii2
tnoutii2l
tnoutee01i
tnoutee01
tnoutee01l</p>
      <p>The first thing that strikes us, is that the pool depth is 50, contrary to what has been practice in TREC in
which the top 100 documents are judged for relevance. Section 5.4 analyses the CLEF collection further.
Length normalisation usually gives a modest improvement in average precision. The ‘expand’ option was
especially effective for German. The reason is probably that compound parts are not always properly
lemmatised by the German morphology. Especially the German run performs well with 28 out of 37
topics above average. This relatively good performance is probably due to the morphology, which includes
compound splitting.</p>
      <p>
        The pseudo relevance feedback runs were done with the experimental language models retrieval engine
at the University of Twente, using an index based on the Porter stemming algorithm. The run tagged with
tnoutne3-stem is the baseline run for this system. The official pseudo relevance feedback run used the
top 10 documents retrieved to re-estimate relevance weights and translation probabilities, but turned out to
contain a bug. The unofficial fixed run tnoutne4-fix performs a little bit worse than the baseline. The run
tnoutne4-retro uses the relevant documents to re-estimate the probabilities retrospectively (see e.g. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]).
This run reaches an impressive performance of 0.4695 average precision, much higher even than the best
monolingual English run. This indicates that the algorithm might be helpful in an interactive setting where
the user’s feedback is used to retrieve a new, improved, set of documents. Apparently, the top 10 retrieved
contains too much noise to be useful for the re-estimation of the model’s parameters.
5.3
      </p>
      <sec id="sec-3-1">
        <title>Multilingual runs</title>
        <p>This section reports on some of the statistics of the CLEF collection and compares it to the TREC
crosslanguage collection. Table 5 lists the size, number of judged documents, number of relevant documents
and the judged fraction, which is the part of the collection that is judged per topic.</p>
        <p>collection
english
french
german
italian
total
This year’s evaluation has confirmed that cross-language retrieval based on structured queries, no
matter what the translation resources are, is a powerful technique. Re-estimating model parameters based
on pseudo relevant documents does not result in improvement of retrieval performance. However, the
relevance weighting algorithm shows an impressive performance gain if the relevant documents are used
retrospectively. This indicates that the algorithm might in fact be a valuable tool for processing user
feedback in an inter-active setting. Finally, merging based on the collection size re-estimation technique proved
not successful. Further analysis is needed why the technique did not work on this collection, as it was quite
successful on the TREC-8 collection.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgements</title>
      <p>We would like to thank the DRUID project for sponsoring the translation of the topic set into Dutch. We
thank Xerox XRCE for making the Xelda morphological toolkit available to us. Furthermore we would
like to thank Jiang Chen (RALI, Universite´ de Montre´al), Jian-Yun Nie (RALI) for help with the PTMiner
web mining tools, and Michel Simard (RALI) for helping with the construction of aligned corpora and
building translation models.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Hiemstra and F.M.G. de Jong</surname>
          </string-name>
          .
          <article-title>Disambiguation strategies for cross-language information retrieval</article-title>
          .
          <source>In Proceedings of the third European Conference on Research and Advanced Technology for Digital Libraries</source>
          , pages
          <fpage>274</fpage>
          -
          <lpage>293</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Hiemstra</surname>
          </string-name>
          and
          <string-name>
            <given-names>W.</given-names>
            <surname>Kraaij</surname>
          </string-name>
          .
          <article-title>Twenty-One at TREC-7: Ad-hoc and cross-language track</article-title>
          .
          <source>In Proceedings of the seventh Text Retrieval Conference TREC-7, NIST Special Publication 500-242</source>
          , pages
          <fpage>227</fpage>
          -
          <lpage>238</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Hiemstra</surname>
          </string-name>
          .
          <article-title>A probabilistic justification for using tf.idf term weighting in information retrieval</article-title>
          .
          <source>International Journal on Digital Libraries</source>
          ,
          <volume>3</volume>
          (
          <issue>2</issue>
          ):
          <fpage>131</fpage>
          -
          <lpage>139</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>W.</given-names>
            <surname>Kraaij</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pohlmann</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Hiemstra</surname>
          </string-name>
          .
          <article-title>Twenty-one at TREC-8: using language technology for information retrieval</article-title>
          .
          <source>In Proceedings of the eighth Text Retrieval Conference TREC-8</source>
          , NIST Special Publications,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>W.</given-names>
            <surname>Kraaij</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Pohlmann</surname>
          </string-name>
          .
          <article-title>Viewing stemming as recall enhancement</article-title>
          . In H.P. Frei,
          <string-name>
            <given-names>D.</given-names>
            <surname>Harman</surname>
          </string-name>
          , P. Scha¨uble, and R. Wilkinson, editors,
          <source>Proceedings of the 19th ACM-SIGIR Conference on Research and Development in Information Retrieval (SIGIR96)</source>
          , pages
          <fpage>40</fpage>
          -
          <lpage>48</lpage>
          ,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.Y.</given-names>
            <surname>Nie</surname>
          </string-name>
          .
          <article-title>Parallel web corpora for CLIR?</article-title>
          <source>In Proceedings of CLEF</source>
          <year>2000</year>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.Y.</given-names>
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Simard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Isabelle</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Durand</surname>
          </string-name>
          .
          <article-title>Cross-language information retrieval based on parallel texts and automatic mining of parallel texts in the web</article-title>
          .
          <source>In ACM-SIGIR'99</source>
          , pages
          <fpage>74</fpage>
          -
          <lpage>81</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R.</given-names>
            <surname>Pohlmann</surname>
          </string-name>
          and
          <string-name>
            <given-names>W.</given-names>
            <surname>Kraaij</surname>
          </string-name>
          .
          <article-title>The effect of syntactic phrase indexing on retrieval performance for Dutch texts</article-title>
          . In L. Devroye and C. Chrisment, editors,
          <source>Proceedings of RIAO'97</source>
          , pages
          <fpage>176</fpage>
          -
          <lpage>187</lpage>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.E.</given-names>
            <surname>Robertson</surname>
          </string-name>
          and
          <string-name>
            <given-names>K. Sparck</given-names>
            <surname>Jones</surname>
          </string-name>
          .
          <article-title>Relevance weighting of search terms</article-title>
          .
          <source>Journal of the American Society for Information Science</source>
          ,
          <volume>27</volume>
          :
          <fpage>129</fpage>
          -
          <lpage>146</lpage>
          ,
          <year>1976</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>