Merging Different Languages in a Single Document Collection Jian-Yun Nie, Fuman Jin Laboratoire RALI Département d’Informatique et Recherche opérationnelle, Université de Montréal C.P. 6128, succursale Centre-ville Montréal, Québec, H3C 3J7 Canada {nie, jinf}@iro.umontreal.ca Abstract. Multilingual IR is usually carried out with separate collections, each for a language. Once a set of answers have been found in each language, all the sets have to be merged to produce a unique answer list. In our experiments of CLEF2002, we try to implement a different approach, in which the documents in different languages are mixed in the same collection. Indexes are associated with a language tag so as to distinguish homographs in different languages. The indexing and retrieval processes can then be done once for all the documents. No result merging is required. This report describes our first tests in CLEF2002. 1. Introduction Most current approaches to CLIR make a clear separation between different languages. For example, the following schema has been used in most of the previous studies: 1. Query translation: Translate a query from one language into several target languages; 2. Document retrieval: Using a translated query to retrieve documents in the same language as the translation; 3. Merging of the results: The results produced in different languages are merged to produce a unique result list. We can notice two following fact in this approach: Different languages are processed separately. It is assumed that the documents in each language form a separate document collection. This makes it necessary to carry out a merging step. The clear separation between different languages makes it difficult to compare the results in different languages. The previous studies [Rasolofo et al. 2001] on result merging clearly showed that it is difficult to arrive at the same level of effectiveness in a unique collection with a retrieval-and-merging approach. So a better approach is to deal with all the documents as forming a unique document collection, whatever the language is. In so doing, we can avoid the result-merging step, which seems to generate additional problems. 2. Our approach As the retrieval on a unique document collection usually performs better than an approach with separate retrieval then merging, we can consider putting all the documents in a unique collection (if the whole volume can be treated by a centralized IR system). The difference between different languages can be marked with a language tagger associated with each index term. For example, we can identify the French index “chaise” as “chaise_f”. When a query is translated into different languages, then a large query “translation” is created that contains all the index terms in different languages. For example, we may have “chaise_f”, “chaire_f”, “chair_e”, … in a single query. Then the CLIR problem is no longer different from a monolingual IR problem. One advantage of this approach is that the weights of index terms in different languages may be more comparable, because hey are determined in the same way (although the weights may still be unbalanced because of the unbalanced occurrences of index term I the document collection). Another advantage is due to the removal of the problematic merging process. The retrieval result naturally contains answers in different languages. One may expect a higher effectiveness (if we compare with the previous experiments on result merging). Finally, we believe that this approach contributes in lower the barrier between different languages. In fact, documents in different languages often co-exist in the same collection. By separating them, we artificially enhance the difference between them. In fact, the difference between languages is not more (likely less) than the difference between areas. In monolingual IR, documents in different areas are often grouped into the same collection. Then why not group documents in different languages in the same collection? Especially, in some of the cases (e.g. on the Web), they appear naturally together. The approach we implemented is follows the following steps: - Language identification: - In CLEF experiments, as the language of each sub-collection is clearly identified, we do not need to use an automatic language identifier. The language is indicated manually in our experiments. - Language dependent preprocessing: - Stop words in each language are removed separately; - Each word is stemmed using the appropriate stemmer of the language, - Stems are associated with the language tags, _f, _e, _i, _g, and _s, respectively. - Indexing of the mixed document collection: - All the documents are then indexed using the SMART system. - The indexes are weighted according to the usual tf*idf schema: tf(t, d) = log(freq(t, d)+1); idf(t) = log(N/n(t)) , where N is the total number of documents in the mixed document collection. - Query translation: - The original query (in English) is translated separately into French, Italian, German and Spanish. The translation words are stemmed and associated with the appropriate language tag as for document indexes. All the translation words are then put together to form a unique multilingual translation, together with the original query. - Each translation word is associated with its translation probability, which is then considered as the weight of that word. The problem we encountered is the weighting of the original query words with respect to the translation words. Several alternatives are tried: the weight of the original words is 1, 1/n (where n is the number of words in the query). We also tried the following solution: attribute the weight 1 to the original query words, while normalize the weights of translation words so that the maximal translation probability for each language is 1. - Retrieval - The retrieval is performed exactly in the same way as in monolingual retrieval. - The output is a list of documents in different languages. Query translation is performed by statistical translation models trained on parallel Web pages mined with PTMiner [Chen and Nie 2000]. For en-fr, en-it, en-ge, we use the same models as last year [Nie and Simard 2002]. We add en-sp model this year. 3. Results of our preliminary experiments The weighting methods we tried do not seem to work well. In fact, with all the three solutions, we observed that very often, one language dominates in the mixed result list: the first 100 documents retrieved are almost only in that language. This shows that we did not reach at a reasonable balance between languages as we expected. Our intention to mix the documents in a unique collection is to create a better balance between languages, and this may be achieved with the use of the same weighting scheme. However, the result is disappointing. Several reasons may be possible: - The translation models are trained with different parallel corpora. The size and coverage of the corpora are different. This may result in translation of different quality in different languages. For example, the translation of an English word may be quite concentrated on a few words in one language, but not in another language. This makes the translation probabilities often incomparable. - The weights we attributed to the original query words are not reasonable. In fact, in our experiments, we only tested a few simple solutions. They may not be the most appropriate ones. - Finally, in our current approach, query translation is still made independently from the retrieval process. We believe that translation and retrieval should be considered together, as we suggested in [Nie 2002]: Both the translation and the retrieval steps are uncertain. When the two steps are separate, their uncertainties are to be integrated in a principled way. However, we only used the translation probabilities as the initial weights of the query words, which are then combined with the idf factor to arrive at the final weights. This simplistic method may be greatly improved in order to arrive at a better integration of both such as in [Gao et al. 2001] and [Xu et al. 2001]. Among the three weighting schemes we tested, the one with 1/n for the original query words seems to produce the best results. However, these results are still far below the average performance of the CLEF participants, as we can see in the following figure. Fig. 1. The best performance we obtained in CLEF 2002. 4. Remarks Despite the disappointing results of our tests, we still believe that our basic idea of mixing documents in the same collection is reasonable. Our current implementation is too simple to show the true potential of this method. In our future experiments, we will try to implement the idea more carefully. The translation probabilities will also be integrated more tightly with the retrieval process. References [Chen and Nie 2000] J. Chen, J.Y. Nie. Automatic construction of parallel English-Chinese corpus for cross-language information retrieval. Proc. ANLP, pp. 21-28, Seattle (2000). [Gao et al. 2001] Gao, J. Nie, J.-Y., Xun, E. Zhang, J., Zhou, M., Huang, C., Improving query translation for cross-language information retrieval using statistical models, SIGIR 2001, pp. 96 – 104. [Nie and Simard 2002] J-Y. Nie, M. Simard, Using statistical translation models for bilingual IR, Evaluation of Cross- Language Information Retrieval Systems, CLEF 2001, LNCS 2406, ed. C. Peters, M. Braschler, J. Gonzalo and M. Kluck, Springer, 2002, pp. 137-150. [Nie 2002] J.-Y. Nie, Towards a unified approach to CLIR and multilingual IR, Workshop on Cross-language information retrieval: A research roadmap, 25th ACM-SIGIR, Tampere, Finland, August 2002, pp. 7-14. [Rasolofo et al. 2001] Rasolofo, Y., Abbaci, F., Savoy, J., Approaches to Collection Selection and Results Merging for Distributed Information Retrieval. CIKM'2001, Atlanta, November 2001, 191-198 [Xu et al. 2001] Xu, J., Weischedel, R., Nguyen, C., Evaluating a probabilistic model for cross-lingual information retrieval, SIGIR 2001, pp.105 – 110.