JHU/APL Experiments at CLEF: Translation Resources and Score
Normalization
Paul McNamee and James Mayfield
Johns Hopkins University Applied Physics Lab
11100 Johns Hopkins Road
Laurel, MD 20723-6099 USA
{mcnamee, mayfield}@jhuapl.edu
The Johns Hopkins University Applied Physics Laboratory participated in three of the five
tasks of the CLEF-2001 evaluation, monolingual retrieval, bilingual retrieval, and multilingual
retrieval. In this paper we describe the fundamental methods we used and we present initial
results from three experiments. The first investigation examines whether residual inverse
document frequency can improve the term weighting methods used with a linguistically-
motivated probabilistic model. The second experiment attempts to assess the benefit of various
translation resources for cross-language retrieval. Our last effort is to improve cross-collection
score normalization, a task essential for the multilingual problem.
Introduction
The Hopkins Automated Information Retriever for Combing Unstructured Text (HAIRCUT) is a research
retrieval system developed at the Johns Hopkins University Applied Physics Laboratory (APL). The design
of HAIRCUT was influenced by a desire to compare various methods for lexical analysis and tokenization;
thus the system has no commitment to any particular method. With western European languages we typically
use both unstemmed words and overlapping character n-grams as indexing terms, and previous
experimentation has led us to believe that a combination of both approaches enhances performance [7].
We participated in three tasks at this year’s workshop, monolingual, cross-language, and multilingual
retrieval. All of our official submissions were automated runs and our official cross-language runs relied on
query translation using one of two machine translation systems. In the sections that follow, we first describe
our standard methodology and we then present initial results from three experiments. The first investigation
examines whether residual inverse document frequency can improve the term weighting methods used with a
linguistically-motivated probabilistic model. The second experiment attempts to assess the benefit of various
translation resources for cross-language retrieval. Our last effort is to improve cross-collection score
normalization, a task essential for the multilingual problem.
Methodology
For the monolingual tasks we used twelve indices, a word and an n-gram (n=6) index for each of the six
languages. For the bilingual and multilingual tasks we used the same indices with translated topic statements.
Information about each index is provided in Table 1.
# docs collection size name # terms index size
(MB gzipped) (MB)
Dutch 190,604 203 words 692,745 162
6-grams 4,154,405 1144
English 110,282 163 words 235,710 99
6-grams 3,118,973 901
French 87,191 93 words 479,682 84
6-grams 2,966,390 554
German 225,371 207 words 1,670,316 254
6-grams 5,028,002 1387
Italian 108,578 108 words 1,323,283 146
6-grams 3,333,537 694
Spanish 215,737 185 words 382,664 150
6-grams 3,339,343 1101
Table 1. Index statistics for the CLEF-2001 test collection
Index Construction
Documents were processed using only the permitted tags specified in the workshop guidelines. First SGML
macros were expanded to their appropriate Unicode character. Then punctuation was eliminated, letters were
downcased, and only the first four of a sequence of digits were preserved (e.g., 010394 became 0103##).
Diacritical marks were preserved. The result is a stream of words separated by spaces. Exceedingly long
words were truncated; the limit was 35 characters in the Dutch and German languages and 20 otherwise.
When using n-grams we extract indexing terms from the same stream of words; thus, the n-grams may span
word boundaries, but sentence boundaries are noted so that n-grams spanning sentence boundaries are not
recorded. N-grams with leading, central, or trailing spaces are formed at word boundaries. For example,
given the phrase, “the prime minister,” the following 6-grams are produced.
Term Document Collection IDF RIDF
Frequency Frequency
-the-p 72,489 241,648 0.605 0.434
the-pr 41,729 86,923 1.402 0.527
he-pri 8,701 11,812 3.663 0.364
e-prim 2,827 3,441 5.286 0.261
-prime 3,685 5,635 4.903 0.576
prime- 3,515 5,452 4.971 0.597
rime-m 1,835 2,992 5.910 0.689
ime-mi 1,731 2,871 5.993 0.711
me-min 1,764 2,919 5.966 0.707
e-mini 3,797 5,975 4.860 0.615
-minis 4,243 8,863 4.699 1.005
minist 15,428 33,731 2.838 0.914
iniste 4,525 8,299 4.607 0.821
nister 4,686 8,577 4,557 0.816
ister- 7,727 12,860 3.835 0.651
Table 2. Example 6-grams produced for the input “the prime minister.” Term statistics are based on the LA
Times subset of the CLEF-2001 collection. Dashes indicate whitespace characters.
The use of overlapping character n-grams provides a surrogate form of morphological normalization. For
example, in Table 2 above, the n-gram “minist” could have been generated from several different forms like
administer, administrative, minister, ministers, ministerial, or ministry. It could also come from an unrelated
word like feminist. Another advantage of n-gram indexing comes from the fact that n-grams containing
spaces can convey phrasal information. In the table above, 6-grams such as “rime-m”, “ime-mi”, and “me-
min” may act much like the phrase “prime minister” in a word-based index using multiple word phrases.
At last year’s workshop we explored language-neutral retrieval and avoided the use of stopword lists,
lexicons, decompounders, stemmers, lists of phrases, or manually-built thesauri [6]. Such resources are
seldom in a standard format, may be of varying quality, and worst of all, necessitate additional software
development to utilize. Although we are open to the possibility that such linguistic resources may improve
retrieval performance, we are interested in how far we can push performance without them. We followed the
same approach this year.
We conducted our work on four Sun Microsystems workstations that are shared with about 30 other
researchers. Each machine has at least 1GB of physical memory and we have access to dedicated disk space
of about 200GB. The use of character n-grams increases the size of both dictionaries and inverted files,
typically by a factor of five or six, over those of comparable word-based indices. Furthermore, when we use
pseudo-relevance feedback we use a large number of expansion n-grams. As a consequence, runtime
performance became an issue that we needed to address. Over the last year we made a number of
improvements to HAIRCUT to reduce the impact of large data structures, and to allow the system to run in
less memory-rich environments.
To minimize the memory consumption needed for a dictionary in a large term-space, we developed a multi-
tiered cache backed by a B-tree. If sufficient memory is available, term/term-id pairs are stored in a hash
table; if the hash table grows too large, entries are removed from the table, but still stored in memory as
compressed B-tree nodes; if the system then runs out of memory data are written to disk.
To reduce the size of our inverted files we applied gamma compression [9] and saw our disk usage shrink to
about 1/3 of its former size. HAIRCUT also generates dual files, an analogous structure to inverted files that
are document-referenced vectors of terms; the dual files also compressed rather nicely.
Query Processing
HAIRCUT performs rudimentary preprocessing on topic statements to remove stop structure, e.g., phrases
such as “… would be relevant” or “relevant documents should….” . We have constructed a list of about 1000
such English phrases from previous topic sets (mainly TREC topics) and these have been translated into
other languages using commercial machine translation. Other than this preprocessing, queries are parsed in
the same fashion as documents in the collection.
In all of our experiments we used a linguistically motivated probabilistic model for retrieval. Our official
runs all used blind relevance feedback, though it did not improve retrieval performance in every instance. To
perform relevance feedback we first retrieved the top 1000 documents. We then used the top 20 documents
for positive feedback and the bottom 75 documents for negative feedback; however, we removed any
duplicate or near duplicate documents from these sets. We then select terms for the expanded query based on
three factors, a term’s initial query term frequency (if any); the cube root of the (α=3, β=2, γ=2) Rocchio
score; and a term similarity metric that incorporates IDF weighting. The 60 top ranked terms are then used
as the revised query with words as indexing terms; 400 terms are used with 6-grams. In previous work we
penalized documents containing only a fraction of the query terms; we are no longer convinced that this
technique adds much benefit and have discontinued its use. As a general trend we observe a decrease in
precision at very low recall levels when blind relevance feedback is used, but both overall recall and mean
average precision are improved.
Monolingual Experiments
Once again our approach to monolingual retrieval focused on language-independent methods. We submitted
two official runs for each target language, one using the mandated
and fields (TD runs) and
one that added the field as well (TDN runs), for a total of 10 submissions. These official runs were
automated runs formed by combining results from two base runs, one using words and one using n-grams.
In all our experiments we used a linguistically motivated probabilistic model. This model has been described
in a report by Hiemstra and de Vries [5], which compares the method to traditional models. This is
essentially the same approach that was used by BBN in TREC-7 [8] which was billed as a Hidden Markov
Model. The similarity calculation that is performed is:
Sim (q, d ) = ∏ (α ⋅ f (t, d ) + (1 − α ) ⋅ mrdf (t ))
f (t , q )
t = terms
Equation 1. Similarity calculation.
where f(t,d) is the relative frequency of term t in document d (or query q) and mrdf(t) denotes the mean
relative document frequency of t. The parameter α is a tunable parameter that can be used to ascribe a degree
of importance to a term. For our baseline system we simply fix the value of α at 0.3 when words are used as
indexing terms. Since individual n-grams tend to have a lower semantic value than words a lower α is
indicated; we use a value of 0.15 for 6-grams. In training experiments using the TREC-8 test collection we
found performance remained acceptable across a wide range of values. When blind relevance feedback is
applied we do not adjust this importance value, and instead just expand the initial query.
topic fields average recall # topics # ≥ median # ≥ best # = worst
precision
aplmodea TDN 0.4596 2086 / 2130 49 36 11 0
aplmodeb TD 0.4116 2060 / 2130 49 40 6 0
aplmoena TDN 0.4896 838 / 856 47 unofficial English run
aplmoenb TD 0.4471 840 / 856 47 unofficial English run
aplmoesa TDN 0.5518 2618 / 2694 49 36 16 1
aplmoesb TD 0.5176 2597 / 2694 49 31 6 0
aplmofra TDN 0.4210 1202 / 1212 49 24 11 0
aplmofrb TD 0.3919 1195 / 1212 49 19 4 2
aplmoita TDN 0.4346 1213 / 1246 47 32 8 1
aplmoitb TD 0.4049 1210 / 1246 47 26 6 1
aplmonla TDN 0.4002 1167 / 1224 50 40 12 0
aplmonlb TD 0.3497 1149 / 1224 50 37 3 0
Table 2. Official results for monolingual task. The shaded rows contain results for comparable, unofficial
English runs.
Monolingual Performance
0.60
0.50
Mean Average Precision
0.40 TD Cmb
TD Six
TD Words
0.30
TDN Cmb
TDN Six
0.20 TDN Words
0.10
0.00
DE EN ES FR IT NL
Language
Figure 1. Comparison of retrieval performance across target languages. For each language results using both
TD and TDN queries are shown when words, 6-grams, or a combination of the two is used. Unsurprisingly
longer queries were more effective. 6-gram runs most often had better performance than words, but this was
not the case in French or Italian. Combination of the two methods yielded a slight improvement.
We were interested in performing an experiment to see if baseline performance could be improved by
adjusting the importance parameter α for each query term, Residual inverse document frequency (RIDF) [2]
is a statistic that represents the burstiness of a term in the documents in which it occurs (see Equation 2
below). Terms with high RIDF tend to be distinctive, so when they are present, they occur more frequently
within a document than might otherwise be expected; terms with low RIDF tend to occur indiscriminately.
Numerals and adverbs, and to some extent adjectives all tend to have low RIDF. For example, the English
words briefly and computer both occur in just over 5000 LA Times articles, yet computer appears 2.18 times
per occurrence, on average, while briefly almost always appears just once (1.01 times on average). By taking
this into account, we hope to minimize the influence that a word like briefly has on document scores (aside:
Yamamoto and Church have recently published an efficient method for computing RIDF for all substrings in
a collection [10]).
1
RIDF (t ) = IDF (t ) − log −cf ( t )
1 − e
Equation 2. Computing residual inverse document frequency for a term. The log term in the
equation represents the expected IDF if the term had a Poisson distribution.
Our approach was as follows. For each query term, we adjust the importance value, α , for each term
depending on RIDF. We linearly interpolate the RIDF value based on the minimum and maximum values in
the collection and multiply by a constant k to determine the adjusted α . For these initial experiments we only
considered k=0.2.
RIDF (t ) − RIDFmin
α (t ) = α baseline + k ⋅
RIDFmax − RIDFmin
Equation 3. Computing a term-specific value for α .
We are still analyzing these results, however the preliminary indications are promising. Figure 2 shows the
change in average precision when applying this rudimentary method.
Effect of Term-Specific Adjustment Using RIDF
0.04
Absolute Difference in Mean
0.03
0.02
Average Precision
T Six
0.01 TD Six
TDN Six
0.00
T RF Six
-0.01 TD RF Six
TDN RF Six
-0.02
-0.03
-0.04
DE EN ES FR IT NL
Language
Figure 2. Impact on mean average precision when term-specific adjustments are made. 6-gram indexing
shown for six query types (different topic fields and use of pseudo-relevance feedback ) in each language.
We observe a small positive effect, particularly with intermediate-length queries. One possible explanation
for why the improvement does not occur with very short queries (e.g., title-only) is because these queries are
unlikely to contain low-RIDF terms (being short and to the point), and the adjustment in importance value is
unwarranted. As yet, we have no explanation for why long queries (TDN or those with expanded queries) do
not seem to gain much with this method. As time permits an analysis of individual topics may reveal what is
happening.
Bilingual Experiments
Our goal for the bilingual task was to assess retrieval performance when four approaches to query translation
are used, commercial machine translation software; publicly available bilingual wordlists; parallel corpora
mined from the Web; and untranslated queries. The last is only likely to succeed when languages share word
roots. We wanted to attempt as many of the topic languages as possible, and managed to use all but Thai.
In the past we observed good performance when commercial machine translation is used, and so all of our
official runs used MT. Since only four official runs were permitted, we had a hard time choosing which
topic languages to use. We attempted the Dutch bilingual task as well as the English task and ended up
submitting runs using French, German, and Japanese topics against English documents, and using English
topics for the Dutch documents.
topic fields average % mono recall # topics # ≥ median # ≥ best #= worst
precision
aplbifren TD 0.3519 78.7% 778 / 856 47 36 6 0
aplbideen TD 0.4195 93.8% 835 / 856 47 31 4 2
aplbijpen TD 0.3285 73.5% 782 / 856 47 30 3 1
aplmoenb TD 0.4471 -- 840 / 856 47 monolingual baseline
aplbiennl TD 0.2707 77.4% 963 / 1224 50 38 14 13
aplmonlb TD 0.3497 -- 1149 / 1224 50 monolingual baseline
Table 3. Official results for the bilingual task
At the time of this writing we are still working on our dictionary and corpus-based methods, and will present
results from these experiments in a revised version of this manuscript. We now discuss some experiments on
the English bilingual collection using MT-translated and untranslated queries. Systran supports translation
from Chinese, French, German, Italian, Japanese, Russian, and Spanish to (American) English; to translate
Dutch, Finnish, and Swedish topics we used the on-line translator at http://www.tranexp.com/. High quality
machine translation can result in excellent cross-language retrieval; our official bilingual runs achieve 81%
of the performance (on average) of a comparable monolingual baseline.
Although we generally use relevance feedback and are accustomed to seeing a roughly 25% boost in
performance from its use, we observed that it was not always beneficial. This was especially the case with
longer queries (TDN vs. Title-only) and when the translation quality was very high for the language pair in
question. In Figure 3 (below), we compare retrieval performance using words as indexing terms when
relevance feedback is applied. When 6-grams were used the results were similar.
Relevance Feedback Not Always Helpful
0.50
T
0.45
T+RF
0.40 TD
Mean Average Precision
TD+RF
0.35
TDN
0.30
TDN+RF
0.25
0.20
0.15
0.10
0.05
0.00
EN* DE ES FI FR IT JP NL RU SV ZH
Source Language
Figure 3. Bilingual performance using words as indexing terms, examining the effect of relevance feedback.
Untranslated English topics are shown at the left.
Translations of Topic 41 into English
German pestizide in baby food
reports on pestizide in baby food are looked for.
English Pesticides in Baby Food
Find reports on pesticides in baby food.
Spanish Pesticidas in foods for you drink
Encontrar the news on pesticidas in foods stops you drink.
Finnish Suppression-compositions lasten valmisruuassa
Etsi raportteja suppression-aineista lasten valmisruuassa.
French Of the pesticides in food for babies
To seek documents on the pesticides in food for babies.
Italian Pesticidi in the alimony for children
Trova documents that they speak about the pesticidi in the alimony for children.
Japanese Damage by disease and pest pest control medicine in baby hood
The article regarding the damage by disease and pest pest control medicine in
the baby hood was searched to be.
Dutch Pesticide within babyvoeding
Missing unpleasant documents via pesticide within babyvoeding.
Russian pesticides in the children's nourishment of
to find articles about the pesticides in the children's nourishment of
Swedish Bekdmpningsmedel a baby
Svk report a bekdmpningsmedel a baby.
Chinese In baby food includes report which in pesticide
inquiry concerned baby food includes pesticide.
The Finnish translations are poor in quality, which explains the rather low relative performance when those
topics were used. However, looking over the translated topics we observe that many untranslated terms are
near cognates to the proper English word. For example, pestizide (German), pesticidas (Spanish), and
pesticidi (Italian) are easily recognizable. Similarly, ‘baby hood’ is phonetically similar to ‘baby food’, an
easy to understand mistake when Japanese phonetic characters are used to transliterate a term.
In TREC-6, Buckley et al. explored cross-language English to French retrieval using cognate matches [1].
They took an ‘English is misspelled French’ approach and attempted to ‘correct’ English terms into their
proper French equivalents, projecting that 30% or so of non stopwords could be transformed automatically.
Their results were unpredictably good, and they reported bilingual performance of 60% of their monolingual
baseline. Although this approach is non-intuitive, it can be used as a worst-case approach when few or no
translation resources are available, so long as the source and target languages are compatible. Furthermore, it
can certainly be used as a lower bound on CLIR performance that can serve as a minimal standard by which
to assess the added benefit of additional translation resources.
While Buckley et al. manually developed rules to spell-correct English into French, this work may be
entirely unnecessary when n-gram indexing is used, since n-grams provide a form of morphological
normalization. Thus we consider a more radical hypothesis than ‘English is misspelled French’, namely,
‘other languages are English.’ We now examine more closely the relative performance observed when words
and 6-grams are used without spelling correction.
Figure 4 is a plot that compares the efficacy of machine-translated queries to untranslated queries for the
English bilingual task. Since we have argued that relevance feedback does not have a large effect, we will
only compare runs that do not use it. The data in the leftmost column is a monolingual English baseline, the
unstarred columns in the central region are runs using machine translation for various source languages, and
the rightmost area contains runs that used untranslated source language queries against the English
collection. For each combination of translation method and source language six runs are shown using title-
only, TD, or TDN topic statements and either words or 6-grams.
Several observations can be made from this plot. First, we observe that longer topic statements tend to do
better than shorter ones; roughly speaking, TDN runs are about 0.05 higher than corresponding TD runs, and
TD runs are about the same amount better than title-only runs. Secondly we note that 6-grams tend to
outperform words; the mean relative difference among comparable MT runs is 5.95%. Looking at the various
source languages we note that as a group, the Systran translated runs (DE, ES, FR, IT, JP, RU, and ZH)
outperform the InterTran translated queries (FI, NL, and SV); this may reveal an underlying difference in
product quality, however a better comparison would be to use languages they translate in common.
Translation quality is rather poor for the Finnish and Swedish topics (InterTran) and also with the Chinese
topics (Systran). Averaging across all source languages, the translated runs have performance between 41-
63% of the top monolingual English run when words are used, and 41-70% when 6-grams are used.
The untranslated queries plotted on the right clearly do worse than their translated equivalents. Averaging
across the seven languages encoded in ISO-8859-1, word runs achieve performance between 9-15% of the
top monolingual English run, but 6-gram runs do much better and get performance between 22-34%
depending on the topic fields used. The mean relative advantage when n-grams are used on these topics is
183%, almost a doubling in efficacy over words. The 6-grams achieve 54% of the performance of the
machine-translated runs. Though not shown in the plot, relevance feedback actually does enhance these
untranslated 6-gram runs even though we have shown that relevance feedback did not significantly affect
translated topics. One final observation is that shorter queries are actually better when words are used; we
suspect that this is because longer topics may contain more matching words, but not necessarily the key
words for the topic.
One concern we have with this analysis is that we are comparing an aggregate measure, mean average
precision. For untranslated topics, we imagine that the variance in performance is greater over many topics
since some topics will have almost no cognate matches. We hope to examine individual topic behavior in the
future.
We looked for this effect in other measures besides average precision. Recall at 1000 documents was
effectively doubled when 6-grams were used instead of words; roughly 70% of the monolingual recall was
observed. Averaged across language, Precision at 5 documents was 0.1921 when 6-grams were used with
TDN topics with blind relevance feedback. Thus even this rudimentary approach can expected to find one
relevant document on average in the top five documents.
Mean Average Precision by Language, Tokenization, and Query Type
0.50
Mean Average Precision
0.40 Words T
Words TD
Words TDN
0.30
Six T
Six TD
0.20 Six TDN
0.10
0.00
EN* DE ES FI FR IT JP NL RU SV ZH DE* ES* FI* FR* IT* NL* SV*
Topic Language
Figure 4. Comparing word and n-gram indexing on machine-translated, and untranslated topics. Untranslated
topics are indicated with a star.
Multilingual Experiments
When combining several runs, one must either use document rank as a measure of the importance of a
document, or try to make some sense out of system-generated scores. Using rank is problematic when the
two runs cover different documents. For example, if a different index is built for each language in a
multilingual retrieval task, there is no way to distinguish a language that has many relevant documents from
one that has few or no relevant documents using rank alone. On the other hand, raw scores are not typically
comparable. For example, the scores produced by our statistical language model are products, with one factor
per query term. Even if the individual factors were somehow comparable (which they are not), there is no
guarantee that a query will have the same number of terms when translated into two or more different
languages. Other similarity metrics suffer from similar difficulties. Thus, score normalization is crucial if
scores are to be used for run combination.
We tried a new score normalization technique this year. We viewed scores as masses, and normalized by
dividing each individual score by the sum of the masses of the top 1000 documents. (Because our
probabilistic calculations are typically performed in log space, and scores are therefore negative, we achieved
the desired effect by using the reciprocal of a document's score as its mass.) Our previous method of score
normalization was to interpolate scores for a topic within a run onto [0,1]. We were concerned that this
would cause documents in languages with few or no relevant documents for a topic to appear comparable to
top-ranked documents in a language with many relevant documents. While there was no appreciable
difference between the two methods in this year's multilingual task (at least in average precision) we did see
an eight percent improvement in precision at five documents using the new normalization (compare
aplmuena with aplmuend).
We are still investigating rank-based combination as well, though we submitted no official runs using this
technique. Our preliminary findings show little difference compared to score-based combination.
We were intrigued by a method that the U.C. Berkeley team used for multilingual merging in TREC-7 [4]
and in last year’s CLEF workshop [3], where documents from all languages were indexed as a common
collection. Queries were translated into all target languages and the resulting collective query was run against
the collection. Berkeley’s results using this approach in last year’s multilingual task (run BKMUEAA1)
were comparable to runs that used a merging strategy. We were inspired to try this method ourselves and
built two unified indices, one using words and one using 5-grams. Using unstemmed words as indexing
terms, our performance with this method was poor (run aplmuenc); however, we did see a significant
improvement using 5-grams instead (see Table 4). Still, our attempts using a unified term space have not
resulted in better scores than approaches combining separate retrievals in each target language. We will
continue to examine this method because of its desirable property of not requiring cross-collection score
normalization.
topic index normalization average recall Prec. @ 5 #≥ #≥ #=
fields type(s) method precision (8138) median best worst
aplmuena TD words + mass 0.2979 5739 0.5600 25 2 0
6-grams contribution
aplmuenb TDN words mass 0.3033 5707 0.5800 31 3 0
contribution
aplmuenc TD unified NA 0.1688 2395 0.5600 9 1 9
words
aplmuend TD words + linear 0.3025 5897 0.5240 32 1 0
6-grams interpolation
aplmuene TD unified NA 0.2593 4079 0.5960 unofficial run
5-grams
Table 4. Multilingual results
Conclusions
The second Cross-Language Evaluation Forum workshop has offered a unique opportunity to investigate
multilingual retrieval issues for European languages. We participated in three of the five tasks and were able
to conduct several interesting experiments. Our first investigation into the use of term-specific adjustments
using a statistical language model showed that a small improvement can be obtained when residual inverse
document frequency is utilized. However, this conclusion is preliminary and we do not feel that we
completely understand the mechanism involved.
Our second experiment is only partially completed; we compared bilingual retrieval performance when two
query translation methods are used. The first method using extant commercial machine translation gives very
good results that approach a monolingual baseline. We also showed that reasonable performance can be
obtained when no attempt whatsoever is made at query translation, and we have demonstrated that
overlapping character n-grams have a strong advantage over word-based retrieval in this scenario. The
method is of course only practicable when related languages are involved. We think this result is significant
for several reasons. First, it quantifies a lower bound for bilingual performance that other approaches may be
measured against. Secondly, it implies that translation to a related language, when translation to the target
language of interest is infeasible, may form the basis of a rudimentary retrieval system. We hope to augment
this work by also comparing the use of parallel corpora and publicly available bilingual dictionaries in the
near future.
Multilingual retrieval, where a single source language query is used to search for documents in multiple
target languages, remains a critical challenge. Our attempt to improve cross-collection score normalization
was not successful. We will continue to investigate this problem, which will only grow more difficult as a
greater number of target languages is considered.
References
[1] C. Buckley, M. Mitra, J. Walz, and C. Cardie, ‘Using Clustering and Super Concepts within SMART: TREC-6’. In
E. Voorhees and D. Harman (eds.), Proceedings of the Sixth Text REtrieval Conference (TREC-6), NIST Special
Publication 500-240, 1998.
[2] K. W. Church, ‘One Term or Two?’, In the Proceedings of the 18 th International Conference on Research and
Development in Information Retrieval (SIGIR-95), pp. 310-318, 1995.
[3] F. Gey, H. Jiang, V. Petras, and A. Chen, ‘Cross-Language Retrieval for the CLEF Collections – Comparing
Multiple Methods of Retrieval. In Working Notes of the CLEF-2000 Workshop, pp. 29-38, 2000.
[4] F. Gey, H. Jiang, A. Chen, and R. Larson, ‘Manual Queries and Machine Translation in Cross-language Retrieval
and Interactive Retrieval with Cheshire II at TREC-7’. In E. M. Voorhees and D. K. Harman, eds., Proceedings of the
Seventh Text REtrieval Conference (TREC-7), pp. 527-540, 1999.
[5] D. Hiemstra and A. de Vries, ‘Relating the new language models of information retrieval to the traditional retrieval
models.’ CTIT Technical Report TR-CTIT-00-09, May 2000.
[6] P. McNamee, J. Mayfield, and C. Piatko, ‘A Language-Independent Approach to European Text Retrieval. In Carol
Peters (ed.), Cross-Language Information Retrieval and Evaluation: Proceedings of the CLEF 2000 Workshop, Lecture
Notes in Computer Science 2069, Springer, 2001, pp 129-139, forthcoming.
[7] J. Mayfield, P. McNamee, and C. Piatko, ‘The JHU/APL HAIRCUT System at TREC-8.’ In E. M. Voorhees and D.
K. Harman, eds., Proceedings of the Eighth Text REtrieval Conference (TREC-8), pp. 445-451, 2000.
[8] D. R. H. Miller, T. Leek, and R. M. Schwartz, ‘A Hidden Markov Model Information Retrieval System.’ In the
Proceedings of the 22nd International Conference on Research and Development in Information Retrieval (SIGIR-99),
pp. 214-221, August 1999.
[9] I. Witten, A. Moffat, and T. Bell, ‘Managing Gigabytes’, Chapter 3, Morgan Kaufmann, 1999.
[10] M. Yamamoto and K. Church, ‘Using Suffix Arrays to Compute Term Frequency and Document Frequency for all
Substrings in a Corpus’. In Computational Linguistics, vol 27(1), pp. 1-30, 2001.