Merging Mechanisms in Multilingual Information Retrieval Wen-Cheng Lin and Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University Taipei, TAIWAN, R.O.C. E-mail: denislin@nlg.csie.ntu.edu.tw; hh_chen@csie.ntu.edu.tw Abstract National Taiwan University (NTU) Natural Language Processing Laboratory (NLPL) participated in MLIR task in CLEF 2002. We submitted five official multilingual runs. In this paper, we try to resolve the collection fusion problem. We experimented with several merging strategies that merge the results of several intermediate runs. 1. Introduction Multilingual Information Retrieval [4] uses a query in one language to retrieve documents in different languages. A multilingual data collection is a set of documents that are written in different languages. There are two types of multilingual data collection. The first one contains several monolingual document collections. The second one consists of multilingual-documents. A multilingual-document is written in more than two languages. Some multilingual-documents have a major language, i.e. most part of the document is written in the same language. For example, a document is written in Chinese, but the abstract is in English. Therefore this document is a multilingual-document and Chinese is its major language. The significances of different languages in a multilingual-document may be different. For example, the English translation of a Chinese proper noun is a useful clue when using English queries to retrieve Chinese documents. Therefore, the English translation should have higher weight. Figure 1 shows these two types of multilingual data collections. In Multilingual Information Retrieval, queries and documents are in different languages. We can translate queries, or translate documents, or translate both queries and documents into an intermediate language to unify the languages of queries and documents. Figure 2 shows some MLIR architectures when query translation is adopted. The front-end controller processes queries, translates queries, submits translated queries to the monolingual IR systems, collects the relevant document lists reported by IR systems and merges them. Figure 3 shows document translation architectures. … … … … L1-1 L1-m L2-1 L2-n Lt-1 Lt-k (a) A set of monolingual document collections L1 L1 L1 L1 L1 L2 L2 L2 L3 L2 L1 L3 L4 ... ... L2 L2 L4 L5 L2 … L5 L1 L3 (b) Some types of multilingual-documents Fig 1. Multilingual Data Collections Mono D_L1 IR Query MLIR front-end Mono D_L2 controller IR Result Mono D_L3 IR Mono D_L1 IR 1 Query MLIR front-end Mono D_L2 controller IR 2 Result Mono D_L3 IR 3 D_L1 Query MLIR front-end D_L2 controller Result D_L3 Fig 2. MLIR Architectures when Query Translation is Adopted Mono D L1 IR Query MLIR front-end Mono D L2’ DT D L2 controller IR L21 Result Mono D L3’ DT D L3 IR L31 D L1 Query Mono D L2’ DT D L2 IR L21 Result D L3’ DT D L3 L31 Fig 3. Document Translation Architectures In addition to language barrier issue, how to conduct a ranked list that contains documents in different languages from several text collections is also critical. There are two architectures in MLIR: centralized and distributed. The first two architectures in Figure 2 and first architecture in Figure 3 are distributed architectures; the remaining architectures in Figures 2 and 3 are centralized architectures. In a centralized architecture, a huge collection that contains documents in different languages is used. In a distributed architecture, documents in different languages are indexed and retrieved separately. The results of each run are merged into a multilingual ranked list. Several merging strategies have been proposed. Raw-score merging selects documents based on their original similarity scores. Normalized-score merging normalizes the similarity score of each document and sorts all documents by the normalized score. For each topic, the similarity score of each document is divided by the maximum score in this topic. Round-robin merging interleaves the results in the intermediate runs. In this experiment, we adopted distributed architecture and proposed a merging strategy to merge the result lists. The rest of this paper is organized as follows. Section 2 describes the indexing method. Section 3 shows the query translation process. Section 4 describes our merging strategies. Section 5 shows the experiment results. Section 6 concludes the remarks. 2. Indexing The document set used in CLEF2002 MLIR task consists of English, French, German, Spanish and Italian. The numbers of documents in English, French, German, Spanish and Italian document sets are 113,005, 87,191, 225,371, 215,738 and 108,578 respectively. The IR model we used is the basic vector space model. Documents and queries are represented as term vectors, and cosine vector similarity formula is used to measure the similarity of a query and a document. The term weighting function is tf*idf. Appropriate terms are extracted from each document in indexing stage. In the experiment, the and sections in English documents were used for indexing. For Spanish documents, the and <TEXT> sections were used. When indexing French, German and Italian documents, the <TITLE>, <TEXT>, <TI>, <LD> and <TX> sections were used. The words in these sections were stemmed, and stopwords were removed. The stopword lists and stemmers were developed by University of Neuchatel (The stopword lists and stemmers are available at http://www.unine.ch/info/clef/) [5] 3. Query translation In the experiment, the English queries were used as source queries and translated into target languages, i.e. French, German, Spanish and Italian. In our previous experiments [1, 3], we used CO model to translate queries. CO model uses word co-occurrence information trained from a target language text collection to disambiguate the translations of query terms. In this experiment, we didn’t have enough time to train the word co-occurrence information for the languages used in CLEF 2002 MLIR task. Thus, we used a simple method to translate the queries. We adopted a dictionary-based approach to translate the queries. For each English query term, we found its translation equivalents by looking up a dictionary and selected the first two translation equivalents to be the target language query terms. The dictionaries we used are Ergane English-French, English-German, English-Spanish and English-Italian dictionaries (The Ergane dictionaries are available at http://www.travlang.com/Ergane). There are 8,839, 9,046, 16,936 and 4,595 terms in Ergane English-French, English-German, English-Spanish and English-Italian dictionaries, respectively. 4. Merging strategies There are two architectures in MLIR, i.e., centralized and distributed. In a centralized architecture, document collections in different languages are viewed as a single document collection and are indexed in one huge index file. The advantage of a centralized architecture is that it avoids the merging problem. It needs only one retrieving phase to produce a result list that contains documents in different languages. One of problems of centralized architecture is that the weights of index terms are over weighting. The total number of documents increases but the number of occurrences of a term does not. Thus, the idf of a term is increased and the weight is over-weighting. This phenomenon is acuter in small text collection. For example, the N in idf formula is 87,191 when French document is used. However, this value is increased to 749,883, i.e. about 8.6 times larger, if the three document collections are merged together. Comparatively, the weights of German index terms are increased 3.33 times due to the size of N. The increments of weights are unbalance for document collections in different size. Thus, it makes retrieval result preferring documents in small document collection. The second architecture is a distributed MLIR. Documents in different languages are indexed and retrieved separately. The ranked lists of all monolingual and cross-lingual runs are merged into one multilingual ranked list. How to merge result lists is a problem. Recent works have proposed various approaches to deal with merging problem. A simple merging method is raw-score merging that sorts all results by their original similarity scores and then selects the top ranked documents. Raw-score merging is based on the assumption that the similarity scores across collections are comparable. However, the collection-dependent statistics in document or query weights invalidates this assumption [2, 6]. Another approach, round-robin merging, interleaves the results based on the rank. This approach assumes that each collection has approximately the same number of relevant documents and the distribution of relevant documents is similar across the result lists. Actually, different collections do not contain equal numbers of relevant documents. Thus the performance of round-robin merging may be poor. The third approach is normalized-score merging. For each topic, the similarity score of each document is divided by the maximum score in this topic. After adjusting scores, all results are put into a pool and sorted by the normalized score. This approach maps the similarity scores of different result lists into the same range, from 0 to 1, and makes the scores more comparable. But it has a problem. If the maximum score is much higher than the second one in a result list, the normalized-score of document at rank 2 would be low even if its original score is high. Thus, the final rank of this document would be lower than that of the top ranked documents with very low but similar original scores in another result list. Similarity score reflects the degree of similarity between a document and a query. A document with higher similarity score seems more relevant to the desired query. But, if the query is not formulated well, e.g., un- appropriate translation of a query, a document with high score still does not meet the user’s information need. When merging results, such documents that have high scores should not be included in the final result list. Thus, we have to consider the effectiveness of each individual run in the merging stage. The basic idea of our merging strategy is that adjusting the similarity scores of documents in each result list to make them more comparable and to reflect their confidence. The similarity scores are adjusted by the following formula. 1 Sˆij = S ij × × Wi (1) Sk where Sij is the original similarity score of the document at rank j in the ranked list of topic i, Ŝij is the adjusted similarity score of the document at rank j in the ranked list of topic i, S k is the average similarity score of top k documents, and Wi is the weight of query i in a cross lingual run. We divide the weight adjusting process into two steps. First, we use a modified score normalization method to normalize the similarity scores. The original score of each document is divided by the average score of top k documents instead of the maximum score. We call this normalized-by-top-k. Second, the normalized score multiplies a weight that reflects the retrieval effectiveness of the desired topic in each text collection. However, due to not knowing the retrieval performance in advance, we have to guess the performance of each run. For each language pair, the queries are translated into target language and then the target language documents are retrieved. A good translation should have better performance. We can predict the retrieval performance based on the translation performance. There are two factors affecting the translation performance, i.e., the degree of translation ambiguity and the number of unknown words. For each query, we compute the average number of translation equivalents of query terms and the number of unknown words in each language pair, and use them to compute the weights of each cross lingual run. The weight can be determined by the following formula:  1    U  Wi = c1 + c 2 ×    + c3 × 1 − i  (2)   Ti     ni   where Wi is the weight of query i in a cross lingual run, Ti is the average number of translation equivalents of query terms in query i, Ui is the number of unknown words in query i, ni is the number of query terms in query i, and c1, c2 and c3 are tunable parameters, and c1+c2+c3=1. 5. Results We submitted five multilingual runs. All runs use title and description fields. The five multilingual runs use English topics as source queries. The English topics were translated into French, German, Spanish and Italian. The source English topics and translated French, German, Spanish and Italian topics were used to retrieve the corresponding document collections. Then, we merged these five result lists. We used different merging strategies for the five multilingual runs: 1. NTUmulti01 The result lists were merged by normalized-score merging strategy. The maximum similarity score was used for normalization. After normalization, all results were put in a pool and were sorted by the adjusted score. The top1000 documents were selected as the final results. 2. NTUmulti02 In this run, we used the modified normalized-score merging method. The average similarity score of top 100 documents were used for normalization. We did not consider the performance drop down caused by query translation. That is, the weight Wi in formula (1) was 1 for every sub run. 3. NTUmulti03 First, the similarity scores of each document were normalized. The maximum similarity score was used for normalization. Then we assigned a weight Wi to each intermediate run. The weight was determined by formula (2). The values of c1, c2 and c3 were 0, 0.4 and 0.6, respectively. 4. NTUmulti04 We used formula (1) to adjust the similarity score of each document. We used the average similarity score of top 100 documents for normalization. The weight Wi was determined by formula (2). The values of c1, c2 and c3 were 0, 0.4 and 0.6, respectively. 5. NTUmulti05 In this run, the merging strategy is similar to run NTUmulti04. The difference was that each intermediate run was assigned a constant weight. The weights assigned to English-English, English-French, English- German, English-Italian and English-Spanish intermediate runs were 1, 0.7, 0.4, 0.6 and 0.6 respectively. The results of our official runs are shown in Table 1. The performance of normalized-score merging is bad. The average precision of run NTUmulti01 is 0.0173. When using our modified normalized-score merging strategy, the performance is better. The average precision is increased to 0.0266. Run NTUmulti03 and NTUmulti04 have considered the performance drop down caused by query translation. Table 2 shows the unofficial evaluation of intermediate monolingual and cross-lingual runs. The performance of English monolingual run is much better than cross-lingual runs. Therefore, the cross-lingual runs should have lower weights when merging results. The results show that the performances are improved by decreasing the importance of un-effective cross-lingual runs. The average precisions of runs NTUmulti03 and NTUmulti04 are 0.0336 and 0.0373, which are better than run NTUmulti01 and NTUmulti02. Run NTUmulti05 assigned constant weights to each intermediate runs. Its performance is slightly worse than run NTUmulti04. All our official runs didn’t perform well. This is because that the performances of cross-lingual runs are very bad. In the experiments, we did not disambiguate the senses of query terms when translating queries and the numbers of words contained in the bilingual dictionaries we used are too few, the queries were not translated well, thus the performances of cross-lingual runs are very poor. If we use larger dictionaries and disambiguate the word senses, the performance should be better. In order to compare the effectiveness of different merging strategies, we also conducted several unofficial runs: 1. ntu-multi-raw-score We used raw-score merging to merge result lists. 2. ntu-multi-round-robin We used round-robin merging to merge result lists. 3. ntu-multi-centralized This run adopted centralized architecture. All document collections were indexed in one index file. The topics contained source English query terms, and other translated query terms. Table 1. The Results of Official Runs Run Average Precision Recall NTUmulti01 0.0173 1083 / 8068 NTUmulti02 0.0266 1135 / 8068 NTUmulti03 0.0336 1145 / 8068 NTUmulti04 0.0373 1195 / 8068 NTUmulti05 0.0361 1209 / 8068 Table 2. The Results of Intermediate Runs Run # Topic Average Precision Recall English-English 42 0.2722 741 / 821 English-French 50 0.0497 490 / 1383 English-German 50 0.0066 201 / 1938 English-Italian 49 0.054 426 / 1072 English-Spanish 50 0.0073 223 / 2854 Table 3. The Results of Unofficial Runs Run Average Precision Recall ntu-multi-raw-score 0.0381 1180 / 8068 ntu-multi-round-robin 0.0224 1165 / 8068 ntu-multi-centralized 0.0398 1413 / 8068 The results of unofficial runs are shown in Table 3. The performance of raw-score merging is good. This is probably because we use the same IR model and term weighting scheme for all text collections. When using round-robin merging strategy, the performance is bad. The best run is ntu-multi-centralized which indexes all documents in different languages together. In this run, most top ranked documents are in English, French and Italian in most topics. The performance of English monolingual retrieval is much better than other runs. The average precisions of English-German and English-Spanish cross-lingual runs are quite low. Therefore, the result list should not contain too many German and Spanish documents. The over-weighting phenomenon in centralized architecture makes the scores of French, Italian and English documents increase. Thus, more French, Italian and English documents are included in the result list of run ntu-multi-centralized. This makes the performance better. If we use the German or Spanish queries as source queries, the performance of centralized architecture may be not so good. 6. Concluding Remarks In the experiment, we proposed some merging strategies to integrate the result lists of collections in different languages. The results showed that the performance of our merging strategies was similar to that of raw-score merging and was better than normalized-score and round-robin merging. The performance of run NTUmulti05 was good. The weights of English-German and English-Spanish runs were not low enough. If we decrease the weights of these two runs, the performance should be better. The results showed that we could gain better performance by adjusting the similarity score appropriately. The similar results also appeared in NTCIR multilingual IR task. How to determine appropriate weights is an important issue. Considering the degree of ambiguity, i.e. lowering the weights of more ambiguous query terms, improves some performance. The centralized approach performed well. We will do more experiments to find out if the centralized approach still works when using German or Spanish queries as source queries. References [1] Chen, H.H., Bian, G.W., and Lin, W.C., 1999. Resolving translation ambiguity and target polysemy in cross-language information retrieval. In Proceedings of 37th Annual Meeting of the Association for Computational Linguistics, Maryland, June 1999. Association for Computational Linguistics, 215-222. [2] Dumais, S.T., 1992. LSI meets TREC: A Status Report. In Proceedings of the First Text REtrieval Conference (TREC- 1), Gaithersburg, Maryland, November, 1992. NIST Publication, 137-152. [3] Lin, W.C. and Chen, H.H., 2002. NTU at NTCIR3 MLIR Task. In Working Notes for NTCIR3 workshop, 2000. [4] Oard, D.W. and Dorr, B.J., 1996. A Survey of Multilingual Text Retrieval. Technical Report UMIACS-TR-96-19, University of Maryland, Institute for Advanced Computer Studies. [5] Savoy, J., 2001. Report on CLEF-2001 Experiments: Effective Combined Query-Translation Approach. In Evaluation of Cross-Language Information Retrieval Systems, Lecture Notes in Computer Science, Vol. 2406, Darmstadt, Germany, September, 2001. Springer, 27-43. [6] Voorhees, E.M., Gupta, N.K., and Johnson-Laird, B., 1995. The Collection Fusion Problem. In proceedings of the Third Text REtrieval Conference (TREC-3), Gaithersburg, Maryland, November, 1994. NIST Publication, 95-104.