Introduction and outline

Chemnitz at CLEF 2009 Ad-Hoc TEL Task: Combining Di erent Retrieval Models and Addressing the Multilinguality

Jens Kursten

jens.kuersten@cs.tu-chemnitz.de 0 1 2 0 09107 Chemnitz , Germany 1 Chemnitz University of Technology 2 Faculty of Computer Science, Dept. Computer Science and Media

In this paper we report our e orts for the participation in the CLEF 2009 Ad-Hoc TEL task. In our second participation we were able to test and evaluate a new feature of the Xtrieval framework, which was the accessibility of the three core retrieval engines Lucene, Lemur and Terrier. This year we submitted 24 experiments in total, 12 each for the monolingual and bilingual subtasks. We compared our baseline experiments to combined runs, where we used two di erent retrieval models, namely the vector space model (VSM) used in Lucene and the Bose-Einstein model for randomness (BB2) available in the Terrier framework. We found that an almost constant improvement in terms of mean average precision over all provided collections is achievable. Furthermore we tried to bene t from the multilingual contents of the collections by running combined multilingual experiments for both subtasks. The evaluation showed that the used approach achieves small improvements in the monolingual setting of the task. Unfortunately, we were not able to con rm this nding in the bilingual setting, where the multilingual experiments were outperformed by the standard bilingual runs, especially on the English target collection.

Evaluation Experimentation Cross-Language Information Retrieval

Introduction and outline

The Xtrieval framework [ 3 ],[ 2 ] was used to prepare and run this year's retrieval experiments in the Ad-Hoc track TEL setting. The core retrieval functionality was provided by Lucene1 and the Terrier framework [ 4 ]. For the TEL task three di erent multilingual corpora with content mainly in German, English and French were provided by The European Library. Each collection consists of approximately one million library records. These library records only contain sparse information and have descriptions in multiple languages. We conducted monolingual experiments on each of the collections and also submitted experiments for the bilingual task. For the translation of the topics the Google AJAX language API2 was accessed through a JSON3 programming interface.

The remainder of the paper is organized as follows. Section 2 describes the general setup of our system. The individual con gurations and the results of our submitted experiments are presented in section 3. In sections 4 and 5 we summarize the results and conclude our observations. 2

Experimental setup

This year we were able to choose from various retrieval models and combine the results in the retrieval stage by applying our implementation of the Z-Score operator [ 5 ]. We also used a standard top-k pseudo-relevance feedback algorithm in the retrieval stage, where the values for the top most frequent terms that were obtained from the top documents di ered according to the language and used retrieval model. We used the vector space model (VSM) shipped with Lucene and the Bose-Einstein model for randomness (BB2) available in the Terrier framework. We submitted two monolingual baseline runs for all provided collections. Additionally we submitted one monolingual merged experiment and another one in which we tried to bene t from the multilingual character of the collections. The merged monolingual experiments for each collection formed the baseline for two bilingual experiments, where the topics were translated from two di erent source languages to the corresponding target collection. For two additional bilingual experiments on each target collection we also tried to access the multilingual content of the collections.

We submitted 9 experiments in which we tried to bene t from the multilingual character of the collections. Therefore we created multiple indexes for each target collection using appropriate stemming and stopword removal for the four most frequent languages. During the retrieval we queried these four indexes and combined the results into one nal result list. We needed to translate the topics for all those experiments to the according language of the index, which makes those experiments somewhat multilingual. In table 1 we denote the experiments that had multilingual character and present the boost values for the combination in the multilingual result set for each of the experiments in column 'IDs'. These values were chosen according to the occurrence frequency of the language in the corresponding target collection. All runs in the column 'IDs' correspond to an experiment in column 'refer ID' and are directly comparable to this experiment, because we used identical system con gurations except for the translation component and the multilingual indexes. 2http://code.google.com/apis/ajaxlanguage/documentation 3http://json.org

Con gurations and Results

The detailed setup of our experiments and their evaluation results are presented in the following subsections. 3.1

Monolingual Experiments

We submitted 12 monolingual experiments in total, whereof 4 were submitted for each target collection in German, English and French. For all experiments a language-speci c stopword list was applied4. We used the stemmers from Snowball5 for English and French and applied a special n-gram stemmer6 for German.

In table 2 the retrieval performance of our experiments is reported in terms of mean average precision (MAP) and the absolute rank of the experiment in the evaluation. We compare the two baseline runs to one combined experiment per target collection. Furthermore we compare the performance of the rst baseline run per collection (cut1, cut9, cut17) to the corresponding multilingual experiment (cut4++, cut12++, cut20++). The evaluation of our experiments allows to draw some interesting conclusions. First the overall performance in terms of MAP on the German and French collection were quite similar, while the experiments on the English collection achieved much better results. Interestingly this seemed not to be a aw in our con guration since we achieved identical position in the ranking over all submitted experiments. Another important observation was that our combined experiments (where di erent retrieval models were used) always performed better than the baseline run on each of the target collections. However the overall gain was not very large. Furthermore one can conclude that our multilingual approach also worked consistently well by slightly improving MAP (compare cut1 to cut4++, cut9 to cut12++ and cut17 to cut20++). 3.2

Cross-lingual Experiments

We submitted 12 experiments for the bilingual subtask, whereof 4 were submitted for each target collection. Two experiments per target collection correspond to the combined monolingual run on that collection. Though two di erent source topic languages were translated in those experiments. The remaining two runs per target collection had again multilingual character. We translated the topic from the source language to the four most common languages in the target collections, queried the four indexes and combined the results in a multilingual result set. Again the general 4http://members.unine.ch/jacques.savoy/clef/index.html 5http://snowball.tartarus.org 6http://www-user.tu-chemnitz.de/w~ags/cv/clr.pdf con guration was equal to the corresponding monolingual reference run for comparability. In table 3 we report the evaluation results for each of the bilingual experiments in terms of MAP and the rank over all submitted experiments. Additionally we report our best monolingual experiment for each target collection as baseline for comparison. The evaluation results of our bilingual experiments were very strong. The retrieval performance of our best monolingual runs compared to our best bilingual runs decreased only about 0.6% on the English collection, about 1% on the French collection and about 7,5% on the German collection. We still contribute those results to the quality of the Google translation service. Another nding was that the experiments in which we tried to bene t from the multilinguality of the collections also performed quite good in the bilingual setting. In fact one of those experiments performed best on the French collection and on the German collection it performed almost as good as the best experiment. Only on the English collection we could not bene t from the multilinguality, where those two experiments were clearly outperformed by the standard bilingual runs. 4

Result Analysis - Summary

The following list provides a summary of the analysis of our retrieval experiments for the Ad-Hoc TEL task at CLEF 2009:

Combining retrieval models: Our experiments showed that combining di erent retrieval models results in a small but consistent gain in terms of MAP over all target collections. Monolingual task: The submitted monolingual experiments achieved strong performance on all target collections. Interestingly the MAP on the French and German collection is almost the same, while the performance is much better on the English collection Bilingual task: Probably due to the used translation service our bilingual experiments performed very good and achieved top results on each target collection. The gap to our best corresponding monolingual runs ranged between 0.6% and 7.5%.

Addressing the multilinguality of the collections: We experimented with multilingual congurations and compared them to a baseline experiment. We found that our approach to combine multiple indexed collections works quite good except for the bilingual con gurations on the English target collection. In our second participation in the CLEF Ad-Hoc TEL task we were able to chose from a wide selection of retrieval models. The Xtrieval framework supports three di erent retrieval cores now, namely Lucene, Lemur and Terrier. By combining results from Lucene and Terrier we achieved constant gains in terms of mean average precision on all collections over our baseline runs. Again we found that the translation service provided by Google seems to be extremely superior to any other approach or system. We used this service for translating our bilingual and multilingual experiments and got very strong retrieval performance for all those runs. In the future we will further investigate the numerous retrieval models and try to help to develop an open-source retrieval framework for information retrieval evaluation as it was proposed by Ferro and Harman [ 1 ].

Acknowledgments

We would like to thank Jaques Savoy and his co-workers for providing numerous resources for language processing. Also, we would like to thank Giorgio M. di Nunzio and Nicola Ferro for developing and operating the DIRECT system7.

This work was partially accomplished in conjunction with the project sachsMedia, which is funded by the Entrepreneurial Regions 8 program of the German Federal Ministry of Education and Research.

[1]

Nicola

Ferro and

Donna

Harman . Dealing with multilingual information access: Grid experiments at trebleclef . Post-proceedings of the Fourth Italian Research Conference on Digital Library Systems (IRCDL 2008 ), pages 29 { 32 , 2008 .

[2]

Jens

Ku rsten, Thomas Wilhelm, and

Maximilian

Eibl . Extensible retrieval and evaluation framework: Xtrieval . LWA 2008: Lernen - Wissen - Adaption, Wurzburg, October 2008 , Workshop Proceedings, October 2008 .

[3]

Jens

Ku rsten, Thomas Wilhelm, and

Maximilian

Eibl . The xtrieval framework at clef 2007: Domain-speci c track . In C. Peters,

Jijkoun , Th. Mandl, H. Muller,

D.W.

Oard , A . Pen~as,

Petras , and D. Santos, editors, LNCS - Advances in Multilingual and Multimodal Information Retrieval , volume 5152 , pages 174 { 181 , Berlin, 2008 . Springer Verlag.

[4]

Iadh

Ounis , Christina Lioma, Craig Macdonald, and Vassilis Plachouras. Research directions in terrier: a search engine for advanced retrieval on the web . Novatica/UPGRADE Special Issue on Next Generation Web Search , pages 49 { 56 , 2007 .

[5]

Jaques

Savoy . Data fusion for e ective european monolingual information retrieval . Working Notes for the CLEF 2004 Workshop , Bath, UK, September 2004 .