-

2001

2 10

tions. Participation in the NTCIR Workshops in Tokoyo for predicting probability of relevance based upon statistical clues contained Successful cross-language information retrieval (CLIR) combines linguistic techJapanese and Japanese-English cross-language information retrieval as well as uation[10]. During the past two years, the formula has proven well-suited for within documents, queries and collections as a whole. This formula was used niques (phrase discovery, machine translation, bilingual dictionary lookup) with (http://research.nii.ac.jp/n~tcadm/workshop/work-en.html) English-Chinese CLIR[6], even when only trained on English document collecWe utilized the identical formula for English and German queries against the technical nature of the NTCIR collections of scientic and engineering articles. the TREC series of conferences. In TREC-2 [1] we derived a statistical formula English/French/German/Italian document collections in the CLEF 2000 evalJapanese and between English and Chinese. Such lexicons were well-suited to the robust monolingual information retrieval. For monolingual retrieval the Berkeallel texts were used to create large-scale bilingual lexicons between English and for document retrieval in Chinese[3] and Spanish in TREC-4 through TREC-6. ley group has used the technique of logistic regression from the beginning of led to dieren t techniques for cross-language retrieval, ones which utilised the power of human indexing of documents to improve retrieval. Alignments of parcommercial system, but our association dictionary obtained the words ’hepatic some fuzzy gauge of whether our performance is better, poorer, or about the the English collection of CLEF. We chose to focus on Russian but to do German using the PROMPT web-based translator (http://www.translate.ru/). As with deal with this was twofold { rst we transliterated the Russian queries to their Using this measure we can nd that all our bilingual runs performed signiabout our methodology can be found in [8]. The run BKBIREA1 was obtained by romanized alphabetic equivalent, and then we added untranslated terms to the same as the median performance. however, be transliterated as ’chiapas’. the English collection had relevant documents. While an average of medians cantly better than the median for CLEF bilingual runs. adjusting the resulting queries by searching for the untranslated terms in our the German translation, certain words were not translated. Our methodology to labeled "CLEF Prec" is computed as an average of each CLEF median precision 3.2 Bilingual Performance own special association dictionary created from a library catalog. An example of Bilingual retrieval is performed by running queries in another language against of Indians in Chiapas’, the Russian word was not translated, It can, qiapas encephalopathy’ associated with the inquery Enzephalopathie. Further details we have found it useful to average the medians of all queries as sent by CLEF for a baseline comparison. The run BKBIGEM1 was obtained by translating the organizers. Comparing our overall precision to this average of medians yields German queries to English using the L&H Power Translator and then manually cannot be considered a statistic from which rigorous inference can be made, Our bilingual performance can be found in Table 2. The nal line of the table, words not found comes from Topic 88 about ’mad cow disease’. In the German version, the words Spongiformer and Enzephalopathie were not translated by the 3.1 Bilingual Retrieval of the CLEF collections among all submitted runs. The average is performed over the 47 queries for which English query in their transliterated form. For example Topic 50 on ’uprising at 0.20 0.6912 0.5877 0.5691 at 0.50 0.5529 0.4397 0.4167 at 0.80 0.3494 0.2881 0.2747 at 0.70 0.3952 0.3331 0.3218 at 0.90 0.2869 0.2398 0.2339 at 0.30 0.6306 0.5354 0.5187 at 0.10 0.7390 0.6451 0.6303 at 0.00 0.7797 0.6545 0.6420 Relevant 856 856 856 CLEF Prec. 0.2423 0.2423 0.2423 Table 2. Results of three oÆcial Berkeley CLEF bilingual runs. at 0.60 0.4693 0.3695 0.3605 Brk. Prec. 0.5088 0.4204 0.4077 Rel. Ret 812 737 733 at 0.40 0.5944 0.4987 0.4806 at 1.00 0.2375 0.1762 0.1743 Precision Run ID BKBIGEM1 BKBIREM1 BKBIREA1 Retrieved 47000 47000 47000 0.2749 is about 50 percent better than the median multilingual performance of rently studying why this is so. In addition our overall performance seems only precisons for last year (Row 2000 Prec. at the bottom of Table 3). As can compared to our excellent bilingual performance and our above average perforin average precision than with either English or German queries. We are curThe results show that with Russian queries we are about one third lower mance at CLEF-2000. For comparison we also inserted the average of median be seen, the median performance in terms of query precision for CLEF-2001 of 0.1843 of CLEF-2000. This argues that signican t progress has been made by the CLEF community in terms of European cross-language retrieval performance. slightly above the CLEF-2001 median performance. This seemed puzzling when guage and (again) the L&H Power Tranlator as the MT system to tranlate only. between these two runs is that BKMUEAA1 used Title and Description elds Our non-English multilingual retrieval runs were based upon our bilingual exFrench/German/Italian/Spanish. Run BKMUGAM1 takes the German queries queries from one language to another. Run BKMURAA1 takes the English of bilingual run BKBIGEM1 as well as the translation of their English equivfrom the English queries in runs BKMUEAA1 and BKMUEAA2. The dierence translated queries of the bilingual run BKBIREA1 and again translates them to alents into French/Italian/Spanish. For comparison we did direct translation periments, extended to French/Italian/Spanish using English as a pivot lancases such thesauri are developed with identiers translated (or provided) in identiers by human indexers. These classication identiers can come from English), which is based on the Thesaurus for the Social Sciences [2], provides the have transliterated all Russian thesaurus entries. thesauri. Since many millions of dollars are expended on developing such classiorganization (http://www.social-science-gesis.de). GIRT is an excellent example cialized domain documents which have been assigned individual classication assigned to each document. Figure 1 is an example of a thesaurus entry. Since The special emphasis of our current funding has focussed upon retrieval of spevocabulary source for the indexing terms within the GIRT collection of CLEF. cation schemes and using them to index documents, it is natural to attempt to of a collection indexed by a multilingual thesaurus, originally German-English, The GIRT collection consists of reports and papers (grey literature) in the multiple languages, and can thus be used to transfer words across the language recently translated into Russian. The GIRT multilingual thesaurus (GermanFurther information about GIRT can be found in [7] There are 76,128 German documents in GIRT subtask collection. Almost all the documents contain manbarrier. ually assigned thesaurus terms. On average, there are about 10 thesaurus terms exploit the resources to the fullest extent possible to improve retrieval. In some transliteration of the Cyrillic alphabet is a key part of our retrieval strategy, we social science domain. The collection is managed and indexed by the GESIS 4 GIRT retrieval at 0.50 0.2950 0.2476 0.2697 0.1443 at 0.00 0.8890 0.8198 0.8522 0.7698 at 0.90 0.0457 0.0440 0.0315 0.0058 at 0.60 0.2244 0.1736 0.2033 0.0933 at 0.20 0.5141 0.4703 0.5143 0.3381 Relevant 8138 8138 8138 8138 at 0.40 0.3653 0.3061 0.3324 0.1796 Retrieved 50000 50000 50000 50000 at 0.70 0.1502 0.1110 0.1281 0.0556 Run ID BKMUEAA2 BKMUEAA1 BKMUGAM1 BKMURAA1 at 0.10 0.6315 0.5708 0.6058 0.4525 Rel. Ret 5520 5190 5223 4202 CLEF Prec. 0.2749 0.2749 0.2749 0.2749 at 1.00 0.0022 0.0029 0.0026 0.0005 Table 3. Results of four oÆcial CLEF-2001 multilingual runs.

Brk. Prec. 0.3101 0.2674 0.2902 0.1838 Precision 2000 Prec. 0.1843 0.1843 0.1843 0.1843 at 0.80 0.0894 0.0620 0.0806 0.0319 at 0.30 0.4441 0.3892 0.4137 0.2643 but for the German language. ument (not the E-TITLE or E-TEXT). The CLEF rules specied that indexing any other eld would need to be declared a manual run. For our CLEF runs this In our experiments, we indexed the TITLE and TEXT sections in each docyear we again used the Muscat stemmer, which is similar to the Porter stemmer German query being created. The resulting German query was then run against and bigrams from the Russian topic elds. Since we do not have a Russian In order to prepare for query translation, we rst extracted all the single words bigram was found in the thesaurus, its German translation was added to the new query were then compared against the Russian-German thesaurus. If a word or considered a potential phrase. The single words and bigrams in each Russian POS tagger, we took any two adjacent words (overlapping word bigram) to be the German collection to retrieve relevant documents.

For comparison, we also used an online MT system (Promt-Reverso: http://translation2.paralink.com/) to translate the Russian queries to German. translation was exact matching for thesaurus lookup. From the 25 GIRT topics the two strings. them were directly found in the thesaurus, and these were all single words. Two Fuzzy Matching for the Thesaurus The rst approach to thesaurus-based this case, a Russian morphological analyzer would be helpful. Since we do not (Europe in English) is in the thesaurus, but "evrope" and "evropu" are not. In for matching dieren t word forms. The two strings are divided into their conFirst, a Russian word may have several forms or variations. Usually only the base form or general form appears in the thesaurus. For example, "evropa" problems contribute to the low matching rate: enshtein distance, common n-grams, longest common subsequence, etc [9]. We have a Russian morphological analyzer, we used fuzzy matching to address this There are dieren t kinds of algorithms for fuzzy matching, such as Levstituent bigrams and Dice’s coeÆcient is used to compute the similarity between problem. found that the simple common bigram algorithm to be very eÆcient and eectiv e we obtain about 1300 Russian query terms (words and bigrams). Only 50 of Fig. 1. GIRT German-Russian Thesaurus Entry with Transliteration 4.1 Query translation from Russian to German migratsiiu migratsiia wanderung Original Russian word Russian word in the thesaurus German translation migratsii migratsiia wanderung kul’turoi kul’tura kultur kul’turu kul’tura kultur tekhnologii tekhnologiia technologie tekhnologiei tekhnologiia technologie bezrabotitsei bezrabotitsa arbeitslosigkeit televideniia televidenie fernsehen The second problem lies in nding query bigrams which do not match exactly of bigrams, even in cases where word order is changed, examples are: Russian characters in the examples are transliterated for easy reading). but whose dieren t forms were found in the thesaurus by fuzzy matching (the Above are some examples of Russian words that do not occur in the thesaurus to thesaurus entries. Fuzzy matching was also useful for nding dieren t forms The way bigram ’phrases’ were created had two problems: rst, many of the bigrams were simply not meaningful; second, even though most genuine phrases thesaurus contain 3 or more words. A Russian POS tagger would be very helpful contain two words (bigrams), approximately 25 percent of Russian terms in the for nding meaningful or long phrases. rukovodiashchikh rabotnikov rukovodiashchie rabotniki fuhrungskraft razvitie i organizatsiia organizatsionnoe razvitie organisationsentwicklung upravlenie organizatsiia organizatsiia upravleniia verwaltungsorganisation Original Russian bigram Bigram found in the thesaurus German translation rabochem meste rabochee mesto arbeitsplatz tekhnologicheskogo razvitiia tekhnologicheskoe razvitie technologische entwicklung ogy to BKGRRGA2 except that only title and description sections of the topic a Russian-German bilingual run using thesaurus lookup and fuzzy matching. of the precision of the German-German monolingual run, with a signican t edge BKGRRGA3 is a Russian-German bilingual run which is identical in methodolGerman bilingual run using MT system for query translation. BKGRRGA2 is to the machine translation version. While the full narrative run BKGRRGA3 Our GIRT results are summarized in Table 4. The runs can be described as follows: BKGRGGA is a monolingual run using the German version of the topics were used for matching. As we can see, our Russian runs achieve only about 1/3 run against the German GIRT document collection. BKGRRGA1 is a RussianResearch Projects Agency) under contract N66001-97-8541; AO# F477: Search query language or English as a query language for multilingual retrieval. For have determined that there is signican t work to be done before cross-language Support for Unfamiliar Metadata Vocabularies within the DARPA Information Roman alphabet language, Russian, for which limited resources are available.

The participation of Berkeley’s Group One in CLEF-2001 has enabled us to Specically we have explored a comparison of bilingual and multilingual retrieval European languages. where original queries were in Russian when compared against German as a Technology OÆce. We thank Aitao Chen for indexing the main CLEF collections. explore the diÆculties in extending cross-language information retrieval to a noninformation retrieval from the Russian language will become competitive to other This research was supported by DARPA (Department of Defense Advanced the GIRT task we compared various forms of Russian !German retrieval. We at 0.30 0.6282 0.2671 0.2257 0.1984 Brk. Prec. 0.5002 0.1845 0.1448 0.1461 at 0.60 0.4604 0.1400 0.0993 0.1323 Relevant 1111 1111 1111 1111 at 0.50 0.5166 0.1999 0.1341 0.1609 Precision at 0.70 0.4002 0.1045 0.0712 0.1125 at 0.10 0.8225 0.3461 0.3018 0.2640 at 0.80 0.3038 0.0693 0.0502 0.0765 at 0.00 0.9390 0.4408 0.4032 0.3793 at 1.00 0.0620 0.0051 0.0013 0.0000 at 0.90 0.2097 0.0454 0.0232 0.0273 Rel. Ret 1054 774 781 813 at 0.40 0.5676 0.2381 0.1790 0.1782 Table 4. Results of oÆcial GIRT Russian-German runs. at 0.20 0.7501 0.2993 0.2441 0.2381 Run ID BKGRGGA BKGRRGA1 BKGRRGA2 BKGRRGA3 Retrieved 25000 25000 25000 25000 5 Summary and Acknowlegments References with coeÆcients tted by logistic regression. In D. K. Harman, editor, The Second 1. W Cooper A Chen and F Gey. Full text retrieval based on probabilistic equations Text REtrieval Conference (TREC-2), pages 57{66, March 1994. retrieved more relevant documents, this did not translate into higher overall precision.