<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <pub-date>
        <year>2001</year>
      </pub-date>
      <fpage>2</fpage>
      <lpage>10</lpage>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>tions. Participation in the NTCIR Workshops in Tokoyo
for predicting probability of relevance based upon statistical clues contained
Successful cross-language information retrieval (CLIR) combines linguistic
techJapanese and Japanese-English cross-language information retrieval as well as
uation[10]. During the past two years, the formula has proven well-suited for
within documents, queries and collections as a whole. This formula was used
niques (phrase discovery, machine translation, bilingual dictionary lookup) with
(http://research.nii.ac.jp/n~tcadm/workshop/work-en.html)
English-Chinese CLIR[6], even when only trained on English document
collecWe utilized the identical formula for English and German queries against the
technical nature of the NTCIR collections of scientic and engineering articles.
the TREC series of conferences. In TREC-2 [1] we derived a statistical formula
English/French/German/Italian document collections in the CLEF 2000
evalJapanese and between English and Chinese. Such lexicons were well-suited to the
robust monolingual information retrieval. For monolingual retrieval the
Berkeallel texts were used to create large-scale bilingual lexicons between English and
for document retrieval in Chinese[3] and Spanish in TREC-4 through TREC-6.
ley group has used the technique of logistic regression from the beginning of
led to dieren t techniques for cross-language retrieval, ones which utilised the
power of human indexing of documents to improve retrieval. Alignments of
parcommercial system, but our association dictionary obtained the words ’hepatic
some fuzzy gauge of whether our performance is better, poorer, or about the
the English collection of CLEF. We chose to focus on Russian but to do German
using the PROMPT web-based translator (http://www.translate.ru/). As with
deal with this was twofold { rst we transliterated the Russian queries to their
Using this measure we can nd that all our bilingual runs performed
signiabout our methodology can be found in [8]. The run BKBIREA1 was obtained by
romanized alphabetic equivalent, and then we added untranslated terms to the
same as the median performance.
however, be transliterated as ’chiapas’.
the English collection had relevant documents. While an average of medians
cantly better than the median for CLEF bilingual runs.
adjusting the resulting queries by searching for the untranslated terms in our
the German translation, certain words were not translated. Our methodology to
labeled "CLEF Prec" is computed as an average of each CLEF median precision
3.2 Bilingual Performance
own special association dictionary created from a library catalog. An example of
Bilingual retrieval is performed by running queries in another language against
of Indians in Chiapas’, the Russian word was not translated, It can, qiapas
encephalopathy’ associated with the inquery Enzephalopathie. Further details
we have found it useful to average the medians of all queries as sent by CLEF
for a baseline comparison. The run BKBIGEM1 was obtained by translating the
organizers. Comparing our overall precision to this average of medians yields
German queries to English using the L&amp;H Power Translator and then manually
cannot be considered a statistic from which rigorous inference can be made,
Our bilingual performance can be found in Table 2. The nal line of the table,
words not found comes from Topic 88 about ’mad cow disease’. In the German
version, the words Spongiformer and Enzephalopathie were not translated by the
3.1 Bilingual Retrieval of the CLEF collections
among all submitted runs. The average is performed over the 47 queries for which
English query in their transliterated form. For example Topic 50 on ’uprising
at 0.20 0.6912 0.5877 0.5691
at 0.50 0.5529 0.4397 0.4167
at 0.80 0.3494 0.2881 0.2747
at 0.70 0.3952 0.3331 0.3218
at 0.90 0.2869 0.2398 0.2339
at 0.30 0.6306 0.5354 0.5187
at 0.10 0.7390 0.6451 0.6303
at 0.00 0.7797 0.6545 0.6420
Relevant 856 856 856
CLEF Prec. 0.2423 0.2423 0.2423
Table 2. Results of three oÆcial Berkeley CLEF bilingual runs.
at 0.60 0.4693 0.3695 0.3605
Brk. Prec. 0.5088 0.4204 0.4077
Rel. Ret 812 737 733
at 0.40 0.5944 0.4987 0.4806
at 1.00 0.2375 0.1762 0.1743
Precision
Run ID BKBIGEM1 BKBIREM1 BKBIREA1
Retrieved 47000 47000 47000
0.2749 is about 50 percent better than the median multilingual performance of
rently studying why this is so. In addition our overall performance seems only
precisons for last year (Row 2000 Prec. at the bottom of Table 3). As can
compared to our excellent bilingual performance and our above average
perforin average precision than with either English or German queries. We are
curThe results show that with Russian queries we are about one third lower
mance at CLEF-2000. For comparison we also inserted the average of median
be seen, the median performance in terms of query precision for CLEF-2001 of
0.1843 of CLEF-2000. This argues that signican t progress has been made by the
CLEF community in terms of European cross-language retrieval performance.
slightly above the CLEF-2001 median performance. This seemed puzzling when
guage and (again) the L&amp;H Power Tranlator as the MT system to tranlate
only.
between these two runs is that BKMUEAA1 used Title and Description elds
Our non-English multilingual retrieval runs were based upon our bilingual
exFrench/German/Italian/Spanish. Run BKMUGAM1 takes the German queries
queries from one language to another. Run BKMURAA1 takes the English
of bilingual run BKBIGEM1 as well as the translation of their English
equivfrom the English queries in runs BKMUEAA1 and BKMUEAA2. The dierence
translated queries of the bilingual run BKBIREA1 and again translates them to
alents into French/Italian/Spanish. For comparison we did direct translation
periments, extended to French/Italian/Spanish using English as a pivot
lancases such thesauri are developed with identiers translated (or provided) in
identiers by human indexers. These classication identiers can come from
English), which is based on the Thesaurus for the Social Sciences [2], provides the
have transliterated all Russian thesaurus entries.
thesauri. Since many millions of dollars are expended on developing such
classiorganization (http://www.social-science-gesis.de). GIRT is an excellent example
cialized domain documents which have been assigned individual classication
assigned to each document. Figure 1 is an example of a thesaurus entry. Since
The special emphasis of our current funding has focussed upon retrieval of
spevocabulary source for the indexing terms within the GIRT collection of CLEF.
cation schemes and using them to index documents, it is natural to attempt to
of a collection indexed by a multilingual thesaurus, originally German-English,
The GIRT collection consists of reports and papers (grey literature) in the
multiple languages, and can thus be used to transfer words across the language
recently translated into Russian. The GIRT multilingual thesaurus
(GermanFurther information about GIRT can be found in [7] There are 76,128 German
documents in GIRT subtask collection. Almost all the documents contain
manbarrier.
ually assigned thesaurus terms. On average, there are about 10 thesaurus terms
exploit the resources to the fullest extent possible to improve retrieval. In some
transliteration of the Cyrillic alphabet is a key part of our retrieval strategy, we
social science domain. The collection is managed and indexed by the GESIS
4 GIRT retrieval
at 0.50 0.2950 0.2476 0.2697 0.1443
at 0.00 0.8890 0.8198 0.8522 0.7698
at 0.90 0.0457 0.0440 0.0315 0.0058
at 0.60 0.2244 0.1736 0.2033 0.0933
at 0.20 0.5141 0.4703 0.5143 0.3381
Relevant 8138 8138 8138 8138
at 0.40 0.3653 0.3061 0.3324 0.1796
Retrieved 50000 50000 50000 50000
at 0.70 0.1502 0.1110 0.1281 0.0556
Run ID BKMUEAA2 BKMUEAA1 BKMUGAM1 BKMURAA1
at 0.10 0.6315 0.5708 0.6058 0.4525
Rel. Ret 5520 5190 5223 4202
CLEF Prec. 0.2749 0.2749 0.2749 0.2749
at 1.00 0.0022 0.0029 0.0026 0.0005
Table 3. Results of four oÆcial CLEF-2001 multilingual runs.</p>
      <p>Brk. Prec. 0.3101 0.2674 0.2902 0.1838
Precision
2000 Prec. 0.1843 0.1843 0.1843 0.1843
at 0.80 0.0894 0.0620 0.0806 0.0319
at 0.30 0.4441 0.3892 0.4137 0.2643
but for the German language.
ument (not the E-TITLE or E-TEXT). The CLEF rules specied that indexing
any other eld would need to be declared a manual run. For our CLEF runs this
In our experiments, we indexed the TITLE and TEXT sections in each
docyear we again used the Muscat stemmer, which is similar to the Porter stemmer
German query being created. The resulting German query was then run against
and bigrams from the Russian topic elds. Since we do not have a Russian
In order to prepare for query translation, we rst extracted all the single words
bigram was found in the thesaurus, its German translation was added to the new
query were then compared against the Russian-German thesaurus. If a word or
considered a potential phrase. The single words and bigrams in each Russian
POS tagger, we took any two adjacent words (overlapping word bigram) to be
the German collection to retrieve relevant documents.</p>
      <p>For comparison, we also used an online MT system
(Promt-Reverso: http://translation2.paralink.com/)
to translate the Russian queries to German.
translation was exact matching for thesaurus lookup. From the 25 GIRT topics
the two strings.
them were directly found in the thesaurus, and these were all single words. Two
Fuzzy Matching for the Thesaurus The rst approach to thesaurus-based
this case, a Russian morphological analyzer would be helpful. Since we do not
(Europe in English) is in the thesaurus, but "evrope" and "evropu" are not. In
for matching dieren t word forms. The two strings are divided into their
conFirst, a Russian word may have several forms or variations. Usually only
the base form or general form appears in the thesaurus. For example, "evropa"
problems contribute to the low matching rate:
enshtein distance, common n-grams, longest common subsequence, etc [9]. We
have a Russian morphological analyzer, we used fuzzy matching to address this
There are dieren t kinds of algorithms for fuzzy matching, such as
Levstituent bigrams and Dice’s coeÆcient is used to compute the similarity between
problem.
found that the simple common bigram algorithm to be very eÆcient and eectiv e
we obtain about 1300 Russian query terms (words and bigrams). Only 50 of
Fig. 1. GIRT German-Russian Thesaurus Entry with Transliteration
4.1 Query translation from Russian to German
migratsiiu migratsiia wanderung
Original Russian word Russian word in the thesaurus German translation
migratsii migratsiia wanderung
kul’turoi kul’tura kultur
kul’turu kul’tura kultur
tekhnologii tekhnologiia technologie
tekhnologiei tekhnologiia technologie
bezrabotitsei bezrabotitsa arbeitslosigkeit
televideniia televidenie fernsehen
The second problem lies in nding query bigrams which do not match exactly
of bigrams, even in cases where word order is changed, examples are:
Russian characters in the examples are transliterated for easy reading).
but whose dieren t forms were found in the thesaurus by fuzzy matching (the
Above are some examples of Russian words that do not occur in the thesaurus
to thesaurus entries. Fuzzy matching was also useful for nding dieren t forms
The way bigram ’phrases’ were created had two problems: rst, many of the
bigrams were simply not meaningful; second, even though most genuine phrases
thesaurus contain 3 or more words. A Russian POS tagger would be very helpful
contain two words (bigrams), approximately 25 percent of Russian terms in the
for nding meaningful or long phrases.
rukovodiashchikh rabotnikov rukovodiashchie rabotniki fuhrungskraft
razvitie i organizatsiia organizatsionnoe razvitie organisationsentwicklung
upravlenie organizatsiia organizatsiia upravleniia verwaltungsorganisation
Original Russian bigram Bigram found in the thesaurus German translation
rabochem meste rabochee mesto arbeitsplatz
tekhnologicheskogo razvitiia tekhnologicheskoe razvitie technologische entwicklung
ogy to BKGRRGA2 except that only title and description sections of the topic
a Russian-German bilingual run using thesaurus lookup and fuzzy matching.
of the precision of the German-German monolingual run, with a signican t edge
BKGRRGA3 is a Russian-German bilingual run which is identical in
methodolGerman bilingual run using MT system for query translation. BKGRRGA2 is
to the machine translation version. While the full narrative run BKGRRGA3
Our GIRT results are summarized in Table 4. The runs can be described as
follows: BKGRGGA is a monolingual run using the German version of the topics
were used for matching. As we can see, our Russian runs achieve only about 1/3
run against the German GIRT document collection. BKGRRGA1 is a
RussianResearch Projects Agency) under contract N66001-97-8541; AO# F477: Search
query language or English as a query language for multilingual retrieval. For
have determined that there is signican t work to be done before cross-language
Support for Unfamiliar Metadata Vocabularies within the DARPA Information
Roman alphabet language, Russian, for which limited resources are available.</p>
      <p>The participation of Berkeley’s Group One in CLEF-2001 has enabled us to
Specically we have explored a comparison of bilingual and multilingual retrieval
European languages.
where original queries were in Russian when compared against German as a
Technology OÆce. We thank Aitao Chen for indexing the main CLEF collections.
explore the diÆculties in extending cross-language information retrieval to a
noninformation retrieval from the Russian language will become competitive to other
This research was supported by DARPA (Department of Defense Advanced
the GIRT task we compared various forms of Russian !German retrieval. We
at 0.30 0.6282 0.2671 0.2257 0.1984
Brk. Prec. 0.5002 0.1845 0.1448 0.1461
at 0.60 0.4604 0.1400 0.0993 0.1323
Relevant 1111 1111 1111 1111
at 0.50 0.5166 0.1999 0.1341 0.1609
Precision
at 0.70 0.4002 0.1045 0.0712 0.1125
at 0.10 0.8225 0.3461 0.3018 0.2640
at 0.80 0.3038 0.0693 0.0502 0.0765
at 0.00 0.9390 0.4408 0.4032 0.3793
at 1.00 0.0620 0.0051 0.0013 0.0000
at 0.90 0.2097 0.0454 0.0232 0.0273
Rel. Ret 1054 774 781 813
at 0.40 0.5676 0.2381 0.1790 0.1782
Table 4. Results of oÆcial GIRT Russian-German runs.
at 0.20 0.7501 0.2993 0.2441 0.2381
Run ID BKGRGGA BKGRRGA1 BKGRRGA2 BKGRRGA3
Retrieved 25000 25000 25000 25000
5 Summary and Acknowlegments
References
with coeÆcients tted by logistic regression. In D. K. Harman, editor, The Second
1. W Cooper A Chen and F Gey. Full text retrieval based on probabilistic equations
Text REtrieval Conference (TREC-2), pages 57{66, March 1994.
retrieved more relevant documents, this did not translate into higher overall
precision.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>