Two Stages Refinement of Query Translation for Pivot Language Approach to Cross Lingual Information Retrieval: A Trial at CLEF 2003 Kazuaki KISHIDA Noriko KANDO Surugadai University / National Institute of National Institute of Informatics, Informatics, JAPAN JAPAN kishida@surugadai.ac.jp kando@nii.ac.jp Abstract This paper reports experimental results of cross-lingual information retrieval from German to Italian. The authors are concerned with CLIR in the case that available language resources are very limited. Thus transitive translation of queries using English as a pivot language was used to search Italian document collections for German queries without any direct bilingual dictionary or MT system of these two languages. In order to remove irrelevant translations produced by the transitive translation, we propose a disambiguation technique, in which two stages of refinement of query translation are executed. Basically, this refinement is based on the idea of pseudo rele- vance feedback. In the first stage, for each source query term, we select a translation candidate that appears most frequently in the set of top-ranked documents searched for a set of terms pro- vided via transitive translation of the source query. Next, in the second stage, a standard query expansion based on pseudo relevance feedback is conducted. Our experiment result showed that the two stages refinement method is able to improve significantly search performance of bilingual IR using a pivot language. However, it also turned out that performance of this method is inferior to that of machine translation method. - no parallel corpus, 1 Introduction between German and Italian directly, and 1.1 Purpose - no corpus written in the language of query (i.e., This paper aims at reporting our experiment of no German corpus). cross language IR (CLIR) from German to Italian at We decided to employ only two relatively small dic- CLEF 2003. Our fundamental interest is CLIR be- tionaries of German to English (G to E) and English tween languages with very limited translation re- to Italian (E to I), which are easy to be available source, and then we attempt to explore a new ap- through the Internet. As mentioned above, our re- proach for transitive query translation using English search purpose is fundamentally to develop an effec- as a pivot. It is because translation resource between tive method for enhancing performance of CLIR in English and each language is often easily obtained, the situation that language resource is very poor. and it is true not only in European environment but 1.2 Basic idea also in East Asian environment. Although East Asian According to this presupposition, our method for languages are completely different from English, CLIR is to be characterized as pivot language approach using English is very im- - dictionary-based approach (query translation), portant because of limited availability of resources - pivot language approach (English is a pivot). for direct translation among them. As well-known, it is possible that some extraneous or Thus the basic premise we made for the experi- irrelevant translations are unjustly produced by the ment in CLEF 2003 is that only very limited language dictionary-based approach [1]. Particularly, in the resources are available for executing CLIR runs. For case of pivot language approach, consecutive two example, it was supposed that there is steps of translation (e.g., German to English and Eng- - no bilingual dictionary, lish to Italian) yield often much more extraneous - no machine translation (MT) system, and translation candidates because of double replace- ments of each word. Therefore, a term disambigua- translation candidates according to an assumption that tion technique is dispensable. “the correct translations of query terms should In the presupposition of our study, the resource to co-occur in target language documents and incorrect be used for translation disambiguation is only the translation should tend not to co-occur” (Ballestellos target document collection (i.e., Italian document sets and Croft [2]). Many works have been attempted ba- included in the CLEF test collection). We will pro- sically in line with the idea [3-9]. pose a disambiguation technique in which pseudo The fundamental procedure is as follows: relevance feedback is repeated for refining query - Computing similarity degrees for all pairs of translations. The basic procedure is as follows: translation candidates based on co-occurrence - Initial search: the target document collection is frequencies in the target document collection, searched for all translation candidates produced - Selecting ‘correct’ pairs of translations accord- by dictionary-based replacements via pivot ing to the similarity degrees. language. One of the difficulties for implementing the proce- - First feedback (disambiguation stage): the set of dure is that computational complexity in selecting translation candidates are reduced by using term correct translations is increasing as the number of occurrence statistics within a set of some translations becomes large. For alleviating the prob- top-ranked documents obtained by the initial lem, Gao, et al.[4] have proposed an approximate search. algorithm for choosing optimal translations. - Second search: the target document collection is 2.2 Pivot language approach searched for the reduced set of translations. So many languages are spoken in the world, while - Second feedback (query expansion stage): the the bilingual resources are limited. There is no guar- reduced set of translations is expanded by using antee that useful resources are always available for a standard pseudo relevance feedback tech- the combination of two languages that we need in the nique. real situation. For example, it may be difficult to find - Final search: the target document collection is bilingual resources in machine-readable form be- searched for the extended set of search terms. tween Dutch and Japanese. One of the solutions is to The two stages of refinement of translation candi- employ English as an intermediate (pivot) language, dates would enable us to obtain better performance of since English is an international language and it is CLIR in the situation that available language resource reasonably expected that bilingual dictionaries or MT is poor. The purpose of this paper is to verify experi- systems with English are prepared for many lan- mentally effectiveness of the two stages refinement guages. technique using the test collection of CLEF 2003. The basic approach is transitive translation of This paper is organized as follows. In the section query by using two bilingual resources (see Balles- 2, we will review some previous works on translation teros [10]). If two MT systems or two bilingual dic- disambiguation techniques and pivot language ap- tionaries of Dutch to English and English to Japanese proach. In the section 3, the technique of two stages are available, we can translate Dutch query into refinement of query translation will be introduced. Japanese without any direct Dutch-Japanese diction- The section 4 will describe our system used in the ary. This approach has already been attempted by experiment of CLEF 2003. In the section 5, the result some researchers [11-15]. will be reported. In the case of using successively two bilingual dictionaries for query translation, it is crucial to solve translation ambiguity because possibly so many ex- 2 Previous works traneous or irrelevant search terms are generated by 2.1 Translation disambiguation techniques the two steps of translation. Suppose that an English In the CLIR field, various ideas or techniques for term obtained from a bilingual dictionary of from the translation disambiguation have been proposed. source language to English was irrelevant translation. Among them, some researchers have explored meth- Inevitably, all terms listed under the English term in ods of employing the target document collection for the bilingual dictionary from English to the target identifying extraneous or irrelevant translations. The language would be also irrelevant. Therefore, much typical approach is to use co-occurrence statistics of more extraneous translations are to be generated in pivot language approach than in standard single-step sumption is that ‘correct’ combination of each trans- translation process. lation from distinct original search terms tends to occur together in a single document in the target col- To the disambiguation, Ballesteros [10] has at- lection. If so, such documents are expected to be tempted to apply co-occurrence frequency-based ranked higher in the result of search for the set T . method, query expansion and so on. Meanwhile, Gol- lins and Sanderson [16] proposed a technique of Source terms “lexical triangulation”, in which two pivot languages are used independently and removal of error transla- Translation into pivot language tion is tried by taking only translations in common from two ways of transitive translation using two pivot languages. Translation into target language 3 Two Stages Refinement of Translations 3.1 Translation disambiguation stage Selected terms Translation disambiguation technique based on term co-occurrence statistics may be useful in the Figure 1 Outline of translation disambiguation situation that our study is presupposing, since the technique makes use of only the target document col- Suppose that we have three sets of translations in lection as source of disambiguation. However, as al- the target language as follows: ready mentioned, the computational complex is fairly T1′ : term A, term B, term C, high. Also, it should be noted that term co-occurrence T2′ : term D, term E, frequencies can be considered as macro-level statis- T3′ : term F, term G. tics on the entire document collection. This means Also, it is assumed that a combination of term A, D that the disambiguation based on the statistics may lead to false combination of translation candidates and F is correct and the other terms are irrelevant. In (see Yamabana et al. [3]). Even if two terms A and B such situation, we can expect reasonably that the ir- relevant terms do not appear together in each docu- are statistically associated in general (i.e., in the en- tire collection), the association is not always valid in ment because the probability that such irrelevant a given query. terms have relations each other is low. Meanwhile, the ‘correct’ combination of term A, D and F would Therefore, in the study, the authors decided to use an alternative disambiguation technique, which is not tend to appear in documents more than any combina- based on term co-occurrence statistics. First, we de- tions including irrelevant translation. Therefore, the documents containing the ‘correct’ combination pos- fine some mathematical notations such that sibly have higher score for ranking. s j : terms in the source query ( j = 1,2,..., m ), For detecting such combination from the result of T j : a set of translations in the pivot language for the initial search for the set T , it would be enough the j-th term s j , that we use document frequency of each translation in the set of top-ranked documents. That is, we can T j′ : a set of translations in the target language for choose a term ~ t j for each T j′ ( j = 1,2,..., m ) such all terms included in the set T j . that ~ By transitive translation process using two bilin- t j = arg max rt , t ∈ T j′ (2) gual dictionaries, it is easy to obtain a set of trans- where rt is the number of top-ranked documents lated query terms in the target language with no dis- including the term t . Finally, we obtain that a set of ambiguation, m translations through the disambiguation process T = T1′ ∪ T2′ ∪ ... ∪ Tm′ . (1) ~ ~~ ~ T = {t1 , t2 ,..., tm } . (3) The procedure of disambiguation we propose is to search the target document collection for the set of Ideally, we should make use of co-occurrence terms T , and then to select the most frequently ap- frequencies of all combinations of translation can- pearing term in the top-ranked documents, from each didates in the set of top-ranked documents. However, set of T j′ respectively (see Figure 1). The basic as- the computational cost is expected to be fairly high (a) translation disambiguation and (b) post-translation since we need to compile the statistics dynamically query expansion. The detailed procedure is as follows for each search run. A solution for avoiding the (see also Figure 2): complexity is to count only simple frequencies in- - Obtaining a set of translations T (see Equa- stead of co-occurrence. That is, if the ‘correct’ com- tion (1)) by transitive translation, bination of translations often appears, naturally the - Searching the target document collection for the simple frequency of each translation would also set T (i.e., initial search), become high. Equation (2) is based on this hypothe- - Selecting a single translation from each T j′ sis. respectively, according to the document fre- 3.2 Query expansion stage quency in the top-ranked documents by the ini- ~ In the previous stage, translation ambiguity was tial search, and obtaining a new set T (see resolved, and final m search terms in the target lan- Equation (3)) (i.e., disambiguation), guage remain. We can consider the stage as a process - Searching the target document collection for the ~ for improving precision of search. In next stage, en- set T (i.e., second search), hancement of recall should be attempted since some - Adding terms according to the weight shown as synonyms or related terms would have been removed Equation (4) (i.e., query expansion), - Searching finally the target document collection in the previous stage. ~ According to Ballestellos and Croft[1,2], we exe- for the expanded set of terms T ′ (i.e., third cute a standard post-translation query expansion us- search). ing a pseudo relevance feedback (PRF) technique, in which new terms to be added to the query is selected 4. System Description based on its weight wt , calculated by a formula of standard probabilistic model, 4.1 Purpose of the system (rt + 0.5)( N − R − nt + rt + 0.5) The system enables us to search an Italian docu- wt = rt × log , (4) ment collection for a German query automatically, i.e., ( N − nt + 0.5)( R − rt + 0.5) it is an automatic CLIR system from German to Ital- where N is the total number of documents, R is ian. the number of relevant documents, and nt is the 4.2 Text processing number of documents including term t . It should be Both of German and Italian texts (in documents noted that, in PRF, the set of relevant documents is assumed to be the set of some top-ranked documents and queries) were basically processed by the follow- by the initial search. Therefore, rt is defined as the ing steps: (1) identifying tokens, (2) removing stop- same before (see Equation (2)). We denote the ex- words, (3) lemmatization, (4) stemming. In addition, ~ for German text, decomposition of compound words panded term set by the method as T ′ . was attempted based on an algorithm of longest Original query matching with headwords included in the German to English dictionary in machine readable form. Translation dictionaries 4.3 Language resources Translation candidates We downloaded free dictionaries (German to English and English to Italian) from the Internet1. Disambiguation Also, stemmers and stopword lists for German and Italian were available through the Snowball project2. Selected terms Stemming for English was conducted by the original Expansion Target document Porter’s algorithm [17]. collection Furthermore, in order to evaluate performance of Final set of search terms our two stages refinement method comparatively, we decided to use commercial MT software produced by a Japanese company. Figure 2 Two stages refinement of translation 4.4 Transitive translation procedure To sum up, the method for refining the result of 1 query translation we propose consists of two stages: http://www.freelang.net/ 2 http://snowball.tartarus.org/ Before executing transitive translation by two is the frequency of occurrence of term t in the bilingual dictionaries, all terms included in the bilin- document, l is the document length, l is an aver- gual dictionaries were normalized through stemming age of the document length over the entire document and lemmatization steps with the same procedure collection, and yt is the frequency of occurrence of applied to texts of documents and queries. Actual term t in the query. It should be noted that the value translation process is a simple replacement, i.e., each of each yt is always fixed at the frequency of the normalized German term in a query was replaced corresponding original search terms in the source with a set of corresponding normalized English words, query. Also, Ω is a set of query term, i.e., in the ~ and similarly, each English word was replaced with first search Ω = T , in the second search Ω = T , ~ the corresponding Italian words. As a result, for each and in the third search Ω = T ′ . Finally, the docu- query, a set of normalized Italian words, i.e., T in ments were ranked in the decreasing order of the val- ues of z . Equation (1), was obtained. If no corresponding 4.6 Type of runs executed headword was included in the dictionaries (Ger- In CLEF 2003, we executed three runs (see Table 1), man-English or English-Italian), the unknown word in which only filed in each query was straightforwardly sent to the next step without was used. any change. Next, refinement of the set T through two Table 1 Runs submitted in CLEF 2003 stages described in the previous section was executed. ID Translation Refinement The number of top-ranked documents was set to 100 method Disambigua- Expansion in both stages, and in the query expansion stage, tion top-ranked 30 terms in the decreasing order of term NiiMt01 MT - done weights (Equation (4)) were added. If the top-ranked term is already included in the NiiDic01 Dictionary done done ~ NiiDic02 Dictionary - done set of search terms, T , term frequency in the query is changed into 1.5 × y t . If not, the term frequency is set to 0.5 (i.e., y t = 0.5 ). 5. Results On the other hand, in the case of using MT soft- 5.1 Basic statistics ware, first of all, the original German query was input The Italian collections include 157,558 docu- to the software. The software we used is automati- ments in total. The average document length is cally executing German to English translation and 181.86. then English to Italian translation (i.e., a kind of tran- 5.2 System error sitive translation). The resulting Italian text from the Unfortunately, a non-trivial system error was de- MT system was processed according to the procedure tected after submission of results, i.e., by a bug in our described in the section 4.2, and finally, a set of nor- source code, only a last term within the set of search malized Italian words was obtained for each query. In terms has contributed to the calculation of document the case of MT translation, only post-translation scores. Inevitably, search performance of all runs query expansion was executed with the same proce- shown in Table 1 was very low. dure and parameters in the case of dictionary-based 5.3 Results of runs conducted after submission translation. Therefore, the authors have corrected the source 4.5 Search algorithm code and attempted to perform again some search The well-known Okapi formula [18] was used for runs after submission of results to the organizers of computing each document score in all searches of this CLEF. Six types of run were conducted as shown in study, i.e., Table 2, which also indicates each value of mean av-  3.0 xt z = ∑  × erage precision calculated by using the relevance t∈Ω  ( 0 .5 + 1.5l / l ) + xt judgment file. Furthermore, recall-precision curves of N − nt + 0.5  , the six runs are presented as Figure 3. It should be yt × log  nt + 0.5  noted that each value in represented in Table 2 and Figure 3 was calculated for 51 topics to which one or where z is the score of a particular document, xt more relevant documents are included in the Italian collections. sion of search with no disambiguation and no expan- sion by PRF is only .143, which is significantly lower Table 2 Mean average precision of runs executed than .207 in the case of searches through the two after submission (51 topics) stages refinement. Translation method Expansion by PRF However, we can also point out that there is a done none large difference of performance between MT and the MT .301 .281 two stage refinement. The reason may be attributed to Diction- With disam- .207 .181 difference of quality and coverage between the com- ary biguation mercial MT software and free dictionaries Without dis- .190 .143 downloaded from the Internet. Even if it is true, we ambiguation need to modify the two stages refinement method so that its performance level is approaching to that of As shown in Table 2, MT outperforms diction- MT system. ary-based translation significantly. Also, it turns out For example, in Figure 3, at the levels of recall that the disambiguation technique based on term fre- over 0.7, searches with no disambiguation is re- quency moderately improves effectiveness of dic- versely superior to those with disambiguation. This tionary-based translation method, i.e., the mean av- may be due to that our disambiguation method selects erage precision with disambiguation is .207 in com- only one translation and consequently may remove parison with .190 in the case of no disambiguation. some useful synonyms or related terms. A simple Especially, Table 2 indicates that our technique of two solution is possibly to choose two or more transla- stages refinement has a large effect on enhancement tions instead of using directly Equation (2). Although of search performance since the mean average preci- it is difficult to determine the optimal number of translations to be selected, multiple translations for 0.6 MT with Expansion MT with no Expansion 0.5 Dic. with Disambiguation and Expansion Dic. with Disambiguation and no Expansion Dic. with Expansion and no Disambiguation 0.4 Dic. with no Expansion and no Disambiguation Precision 0.3 0.2 0.1 0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Recall Figure 3. Recall-Precision curves each source term may improve recall of searches. Retrieval and Term Recognition. Tokyo: National Institute of Informatics. http://research.nii.ac.jp/ntcir/workshop/ 6. Concluding Remarks [6] Sadat, F., Maeda, A. Yoshikawa, M. & Uemura, S. This paper reported results of our experiment on (2002). Query expansion techniques for the CLEF CLIR from German to Italian, in which English was Bilingual track. In C. Peters et al. (Eds.) Evaluation used as a pivot language. In particular, two stages of Cross-Language Information Retrieval Systems: refinement of query translation was employed for LNCS 2406 (pp.177-184) Berlin: Springer. removing irrelevant terms in the target language pro- [7] Adriani, M. (2002). English-Dutch CLIR using duced by transitive translation using successively two query translation techniques. In C. Peters et al. bilingual dictionaries. (Eds.) Evaluation of Cross-Language Information As a result, it turned out that Retrieval Systems: LNCS 2406 (pp.219-225) Ber- - our two stages refinement method improves sig- lin: Springer. nificantly retrieval performance of bilingual IR [8] Qu, Y., Grefenstette, G. & Evans, D. A. (2002). using a pivot language, and Resolving translation ambiguity using monolingual - the performance is inferior to that by MT-base corpora: a report on Clairvoyance CLEF-2002 ex- searches. periments. In Working Notes for the CLEF-2002 By choosing two or more search terms in the disam- Workshop (pp.115-126). biguation stage, it is possible that our method be- [9] Seo, H. C., Kim, S. B., Kim, B. I., Rim, H. C. & comes more effective. Lee, S. Z. (2003). KUNLP system for NTCIR-3 English-Korean cross-language information re- trieval. In Proceedings of the Third NTCIR Work- Reference shop on research in information Retrieval, Auto- [1] Ballesteros, L.A. & Croft, W. B. (1997). Phrasal matic Text Summarization and Question Answer- translation and query expansion techniques for ing. Tokyo, National Institute of Informatics. cross-language information retrieval. In Proceed- ttp://research.nii.ac.jp/ntcir/workshop/ ings of the 20st ACM SIGIR conference on Re- [10] Ballesteros, L. A. (2000). Cross-language re- search and Development in Information Retrieval. trieval via transitive translation. In W.B.Croft (Ed.) (pp.84-91). Advances in Information Retrieval: Recent Re- [2] Ballesteros, L. & Croft, W.B. (1998). Resolving search from the Center for Intelligent Information ambiguity for cross-language retrieval. In Pro- Retrieval (pp.203-234). Boston, MA: Kluwer. ceedings of the 21st ACM SIGIR conference on [11] Frantz, M., McCarley, J. S., & Roukos, S. (1999). Research and Development in Information Re- Ad hoc and multilingual information retrieval at trieval (pp.64-71). IBM. In Proceedings of the TREC-7, Gaithersburg, [3] Yamabana, K., Muraki, K., Doi, S. & Kamei, S. MD: National Institute of Standards and Technol- (1998). A language conversion front-end for ogy. http://trec.nist.gov/pubs/ cross-language information retrieval. In G. Grefen- [12] Gey, F. C., Jiang, H. Chen, A. & Larson, R. R. stette (ed.) Cross-language Information retrieval (1999). Manual queries and machine translation in (pp.93-104). Boston, MA: Kluwer. cross-language retrieval at TREC-7. In Proceedings [4] Gao, J., Nie, J. Y., Xun, E X., Zhang, J. Zhou, M. of the TREC-7, Gaithersburg: MD, National Insti- & Huang, C. (2001b). Improving query translation tute of Standards and Technology. for cross-language information retrieval using sta- http://trec.nist.gov/pubs/ tistical models. In Proceedings of 24th ACM [13] Hiemstra, D. & Kraaij, W. (1999). Twenty-one at SIGIR conference on Research and Development TREC-7: ad-hoc and cross-language track. In Pro- in Information Retrieval (pp.96-104). ceedings of the TREC-7, Gaithersburg, MD: Na- [5] Lin, C. J., Lin, W. C., Bian, G. W. & Chen, H. H. tional Institute of Standards and Technology. (1999). Description of the NTU Japanese-English http://trec.nist.gov/pubs/ cross-lingual information retrieval system used for [14] Chen, A. & Gey, F. C. (2003). Experiments on NTCIR workshop. In Proceedings of the First cross-language and patent retrieval at NTCIR-3 NTCIR Workshop on Research in Japanese Text workshop. In Proceedings of the Third NTCIR Workshop on research in information Retrieval, cross language information retrieval with triangu- Automatic Text Summarization and Question An- lated translation. In Proceedings of the 24th ACM swering. Tokyo, National Institute of Informatics. SIGIR conference on Research and Development http://research.nii.ac.jp/ntcir/workshop/ in Information Retrieval (pp.90-95). [15] Lin, W. C. & Chen, H.H. (2003). Description of [17] Porter, M.F.(1980). An algorithm for suffix strip- NTU approach to Multilingual Information re- ping. Program, 14(3), pp.130-137. trieval. In Proceedings of the Third NTCIR Work- [18] Roberson, S. E., Walker, S., Jones, S., Han- shop on research in information Retrieval, Auto- cock-Beaulieu, M. M. & Gatford, M. (1995). matic Text Summarization and Question Answer- Okapi at TREC-3. In Proceedings of TREC-3, ing. Tokyo, National Institute of Informatics Gaithersburg: MD, National Institute of Standards http://research.nii.ac.jp/ntcir/workshop/ and Technology. http://trec.nist.gov/pubs/ [16] Gollins, T. & Sanderson, M. (2001). Improving