REINA at CLEF 2007 Robust Task Angel F. Zazo, Carlos G. Figuerola, and José L. Alonso Berrocal REINA Research Group - University of Salamanca C/ Francisco Vitoria 6-16, 37008 Salamanca, Spain http://reina.usal.es Abstract This paper describes our work at CLEF 2007 Robust Task. We have participated in the monolingual (English, French and Portuguese) and the bilingual (English to French) subtask. At CLEF 2006 our research group obtained very good results applying local query expansion using windows of terms in the robust task. This year we have used the same expansion technique, but taking into account some criteria of robustness: MAP, GMAP, MMR, GS@10, P@10, number of failed topics, number of topics bellow 0.1 MAP, and number of topics with P@10=0. In bilingual retrieval experiments three machine translation programs were used to translate topics. For the target language, translations were merged before performing a monolingual retrieval. We also applied the same local expansion technique. This year the results were disappointing. We think out that the reason is the difficulty to select the best measurement for robustness. Perhaps the problem is that all measurements are average results over all topics, but the hard topics are inherently hard and must be analyze separately. This year all our runs also ends up in good ranking, both base runs and expanded ones. We think that the reason is that we used a good information retrieval system, and the expansion technique is robust because it does not deteriorate significantly the retrieval performance. Categories and Subject Descriptors H.3.1 [Content Analysis and Indexing]: Indexing methods, Thesauruses; H.3.3 [Information Search and Retrieval]: Query formulation, Relevance feedback ; H.3.4 [Systems and Soft- ware]: Performance evaluation; I.2.7 [Natural Language Processing]: Machine Translation General Terms Measurement, Performance, Experimentation Keywords Robust Retrieval, Query Expansion, Term Windows, Association Thesauri, CLIR, Machine Trans- lation 1 Introduction Robust retrieval tries to obtain stable performance over all topics by focusing on poorly performing topics. Robust tracks were carried out at TREC 2003, 2004 and 2005 for monolingual retrieval [3, 4, 5], and at CLEF 2006, including monolingual, bilingual and multilingual retrieval [1]. This year only monolingual (English, French and Portuguese) and bilingual (English to French) subtask were carried out. Our research group has participated in all the subtasks. For a complete description of this task, please, see the CLEF 2007 Ad-hoc Track Overview, also published in this volume. The system’s robustness ensures that all topics obtain minimum effectiveness levels. In in- formation retrieval the mean of the average precision (MAP) is used to measure systems’ perfor- mance. But, poorly performing topics have little influence on MAP. At TREC, geometric average (GMAP), rather than MAP, turned out to be the most stable evaluation method for robustness [4]. The GMAP has the desired effect of emphasizing scores close to 0.0 (the poor performers) while minimizing differences between higher scores. Nevertheless, at the CLEF 2006 Workshop the submitted runs showed high correlations between MAP and GMAP, so at CLEF 2007 other criteria of robustness have been suggested: MAP, GMAP, P@10, number of failed topics, num- ber of topics bellow 0.1 MAP, and number of topics with P@10=0. In our experiments we have also considered other two user-related measurements: the Generalized Success@10 (GS@10) [2], and the mean reciprocal rank (MRR). Both ones indicate the rank of the top retrieved relevant document. Our main focus was monolingual retrieval. The steps followed are explained below. For bilin- gual retrieval experiments we used machine translation (MT) programs to translate topics into document language, and then we performed a monolingual retrieval. 2 Experiments For the monolingual experiments we used the well-known vector space model, using the dnu-ntc term weighting scheme. For documents, letter u stands for the pivoted document normaliza- tion: we adjusted pivot to the average document length and slope set to 0.1 for all the collec- tions. We decided to remove the terms present in more than 25 percent of documents. For the English and French languages we verified that stemming improve retrieval. Last year we saw that stemming does not deteriorate the retrieval performance of hard topics, so we also de- cided to apply stemming for the Portuguese language. For English we used the Porter stemmer, and for French and Portuguese the stemmers from the University of Neuchatel in the web page http://www.unine.ch/info/clef/. From the descriptions and narratives of the topics we auto- matically removed certain phrases such as “Find documents that . . . ”, “Les documents pertinents relatent . . . ” or “Encontrar documentos sobre . . . ”. At CLEF 2006 Robust Task our research group obtained very good results applying local query expansion using windows of terms [6]. This year we have used the same expansion technique, but taking into account the new criteria. This technique uses co-occurrence relations in windows of terms from the first retrieved documents to build a thesaurus to expand the original query. Our interest was to use sort and long queries in our experiments, i.e., use the title field of the topics for sort queries, and title and description fields for long ones. A lot of tests were carried out to obtain the best performance using the training collections, but we found no settings that improve retrieval for all measurements. Then we decided to select the settings that improve the greatest number of measurements for both sort and long queries. For English the highest improvement achieved with this expansion technique was by using a distance value of 1, taking the first 15 retrieved documents to build the thesauri, and adding about 10 terms to the original query. For French, the highest improvement achieved was by using a distance value of 1, taking the first 20 retrieved documents, and adding 40 terms to the original query. For Portuguese we decided to use the best combination obtained last year for the Spanish experiments, due two reasons. First, the Portuguese language is more similar to Spanish than English or French are. Second, the average number of terms per sentence in the Portuguese collection is very similar to the Spanish one. We use a distance value of 2, taking the first 10 documents, and adding 30 terms to the original query. For the bilingual experiments the CLIR system was the same as that used in monolingual retrieval. A previous step was carried out before searching, to translate English topics into French. We used three MT programs: L&H Power Translator Pro 7.0, Systran1 and Reverso2 . For each topic we combined the terms of the translations in a single topic: this is another expansion process, 1 http://www.systransoft.com 2 http://www.reverso.net Table 1: Results of the runs submitted at CLEF 2007 Robust Task. Basis Expansion* Basis Expansion* Basis t t td td tdn English MAP 0.3226 0.3205 0.3897 0.3855 0.3897 GMAP 0.1190 0.1045 0.1850 0.1762 0.1850 (*)Settings MRR 0.5602 0.5379 0.6922 0.6792 0.6922 for expansion: GS@10 0.7613 0.7219 0.8506 0.8422 0.8506 distance=1 P@10 0.3200 0.3240 0.3620 0.3640 0.3620 docs=15 # failed 5 5 5 5 5 terms=10 # <0.1 MAP 16 20 7 8 7 # P@10=0 16 23 10 11 10 French MAP 0.3382 0.3481 0.3773 0.3804 0.3773 GMAP 0.0940 0.0947 0.1289 0.1218 0.1289 (*)Settings MRR 0.5749 0.5972 0.6564 0.6564 0.6564 for expansion: GS@10 0.7555 0.7445 0.7940 0.7959 0.7940 distance=1 P@10 0.3710 0.3740 0.4140 0.4280 0.4140 docs=20 # failed 9 9 8 9 8 terms=40 # <0.1 MAP 18 19 12 12 12 # P@10=0 23 24 19 18 19 Portuguese MAP 0.3387 0.3533 0.4083 0.4121 0.4140 GMAP 0.0825 0.0911 0.1369 0.1301 0.1287 (*)Settings MRR 0.5711 0.5950 0.6286 0.6273 0.6419 for expansion: GS@10 0.7307 0.7277 0.7855 0.7718 0.7787 distance=2 P@10 0.3013 0.3027 0.3320 0.3347 0.3360 docs=10 # failed 15 12 10 10 11 terms=30 # <0.1 MAP 28 29 22 26 23 # P@10=0 36 39 29 30 30 EN → FR MAP 0.3035 0.3278 0.3385 0.3455 0.3583 GMAP 0.0821 0.0872 0.1005 0.0997 0.1228 (*)Settings MRR 0.5819 0.6084 0.6219 0.6164 0.6794 for expansion: GS@10 0.7555 0.7580 0.7833 0.7769 0.8096 distance=1 P@10 0.3242 0.3535 0.3770 0.3870 0.3830 docs=20 # failed 9 9 9 9 8 terms=40 # <0.1 MAP 16 16 15 14 11 # P@10=0 22 20 19 18 16 although in most cases the three translations were identical. Finally, a monolingual retrieval was performed. The local query expansion using co-occurrence based thesauri built with terms windows was also applied. For each subtask and topic language five runs were submitted for the test and training topics. The name of the run begins with “reina”, follows the abbreviation of the language (EN, FR or PT for the monolingual runs, and E2F to indicate the English to French bilingual runs), follows the fields of the topics used in the run (t: title, td: title and description, tdn: title, description and narrative), follows with the letter “e” to indicate if expansion of terms was used and/or the letter “T” to indicate if the run is a test run. For example, the run “reinaENtdeT” stands for the test run submitted for the English collection using the title and descriptions fields of the topics, and applying term expansion. We send the “tdn” runs only for internal testing purposes. 3 Results We only analyze results of our test runs, i.e., for the test topics of the robust task. Table 1 shows the results of the runs. We can see that term expansion no improves performance for all measurements. 4 Conclusions At CLEF 2006 Robust Task our research group obtained very good results applying local query expansion using windows of terms in the robust task. This year at CLEF 2007 the results were disappointing. We think out that the reason is the difficulty to select the best measurement for robustness. Perhaps the problem is that all measurements are average results over all topics, but the hard topics are inherently hard and must be analyse separately. When a topic becomes hard depends on the document collection, the topic collection, the information retrieval system and the topic itself. Therefore general directives to improve performance of hard topics are difficult to suggest. This year all our runs also ends up in good ranking, both base runs and expanded ones. We think that the reason is that we used a good information retrieval system, and the expansion technique is robust because it does not deteriorate significantly the retrieval performance. References [1] G. M. Di Nunzio, N. Ferro, T. Mandl, and C. Peters. CLEF 2006: Ad hoc track overview. CLEF 2006, LNCS, 4730, 2007. [2] S. Tomlinson. Comparing the robustness of expansion techniques and retrieval measures. In A. Nardi, C. Peters, and J. Vicedo, editors, ABSTRACTS CLEF 2006 Workshop, 20- 22 September, Alicante, Spain. Results of the CLEF 2006 Cross-Language System Evaluation Campaign, 2006. [3] E. M. Voorhees. Overview of the TREC 2003 robust retrieval track. In The Twelfth Text REtrieval Conference (TREC 2003), pages 69–77. NIST Special Publication 500-255, 2003. [4] E. M. Voorhees. Overview of the TREC 2004 robust retrieval track. In The Thirteen Text REtrieval Conference (TREC 2004), Gaithersburg, Maryland, November 16-19. NIST Special Publication 500-261, 2004. [5] E. M. Voorhees. Overview of the TREC 2005 robust retrieval track. In The Fourteenth Text REtrieval Conference (TREC 2005), Gaithersburg, Maryland, November 15-18. NIST, 2005. [6] A. F. Zazo, J. L. Alonso Berrocal, and C. G. Figuerola. Local query expansion using terms windows for robust retrieval. CLEF 2006, LNCS, 4730:145–152, 2007.