Comparing weighting models for monolingual information retrieval Gianni Amati1, Claudio Carpineto1, and Giovanni Romano1 Fondazione Ugo Bordoni, via B. Castiglione 59, 00142 Rome, Italy fgba,carpinet,romanog@fub.it 1 Introduction Although the choice of the weighting model may crucially a ect the performance of any information retrieval system, there has been little work on evaluating the relative merits and drawbacks of di erent weighting models in the CLEF environment. The main goal of our participation in CLEF03 was to help ll this gap. We consider three weighting models with a di erent theoretical background that have proved their e ectiveness on a number of tasks and collections. The three models are Okapi [8], statistical language modeling [10], and deviation from randomness [3]. We study the retrieval performance of the rankings produced by each weight- ing model, without and with retrieval feedback, on three monolingual test collec- tions, i.e., French, Italian and Spanish. The collections are indexed with standard techniques and the retrieval feedback stage is performed using the method de- scribed in [4]. In the following we rst describe the three weighting models, the method used for retrieval feedback, and the experimental setting. Then we report the results and the main conclusions of our experiments. 2 The three weighting models For the ease of clarity and comparison, the document ranking produced by each weighting model is represented using the same general expression, namely as the product of a document-based term weight by a query-based term weight: P sim(q; d) = w t;d wt;q 2q^d t This formalism allows also a uniform application of the subsequent retrieval feedback stage to the rst-pass ranking produced by each weighting model, as we will see in the next section. Before giving the expressions for wt;d and wt;q for each weighting model, we report the complete list of variables that will be used: ft the number of occurrences of term t in the collection ft;d the number of occurrences of term t in document d ft;q the number of occurrences of term t in query q nt the number of documents in which term t occurs D the number of documents in the collection T the number of terms in the collection t the ratio between f and T t W dthe length of document d W qthe length of query q avr W the average length of documents in the collection d 2.1 Okapi To describe Okapi, we use the expression given in [8]. This formula has been used by most participants in TREC and CLEF over the last years. w =  (k1 + 1)  ft;d  t;d Wd k1  (1 b) + b + ft;d avg W d w 1)  ft;q  log D = (k3k + + nt + 0:5 t;q f 3 2 t;qn + 0:5 t 2.2 Statistical language modeling (SLM) The statistical language modeling approach has been presented in several papers (e.g., [5], [6]). Here we use the expression given in [10], with Dirichlet smoothing. w = P f +  t (log2 t;d log2 W +  log2  ) + W  log2 W +  t;d Wd +  d t q d 2q^d t w t;q = ft;q . 2.3 Deviation from randomness (DFR) Deviation from randomness has been successfully used at the Web Track of TREC10 [1] and in CLEF 2002, for the Italian monolingual task [2]. It is best described in [3] w   log 1 + t )  = ( log2 (1 + t) + ft;d f +1  (f  + 1) t t;d 2  n t t t;d w t;q = ft;q , with f  = f  log2 (1 + c  avrd W ) t;d t W d 3 Retrieval feedback As retrieval feedback has been incorporated in most recent systems participat- ing in CLEF, it is interesting to evaluate also the performance of the di erent weighting models when they are enriched with retrieval feedback. To perform the experiments, we used a technique called information-theoretic query expansion [4]. At the end of the rst-pass ranking, each term in the top retrieved documents was assigned a score using the Kullback-Leibler distance between the distribution of the term in such documents and the distribution of the same term in the whole collection: KLD f = ft;d  log2 ft;d t;d t and the terms with the highest scores were selected for expansion. At this point, the KLD scores were used also to reweight the terms in the expanded query. As the weights for the unexpanded query (i.e., SLM, Okapi, and DFR) and the KLD scores had di erent scales, we normalized both the weights of the original query and the scores of the expansion terms by the maximum corresponding value; then the normalized values were linearly combined. The new expression for computing the similarity between an expanded query qexp and a document d becomes: sim(q ; d) = P w w  ( Max t;q + KLD t;d ) exp t;d qw t;q Max KLD d t;d t2qexp ^d 4 Experimental setting 4.1 Test collections The experiments were performed using three CLEF 2003 monolingual test col- lections, namely the French, Spanish, and Italian collections. For all collections, the title+description topic statement was considered. 4.2 Document and query indexing We identi ed the individual words occurring in the documents, considering only the admissible sections and ignoring punctuation and case. The system then performed word stemming and word stopping. For word stemming, we used the French, Italian, and Spanish versions of Porter stemming algorithm [7], which have been made available on the Snowball web site (http://snowball.tartarus.org) To remove common words, we used the stop lists provided by Savoy [9]. Thus, we performed a strict single-word indexing; furthermore, we did not use any ad hoc linguistic manipulation such as expanding or removing certain words from the query text or using lists of proper nouns. 4.3 Choice of experimental parameters The nal document ranking is a ected by a number of parameters. To perform the experiments, we set the parameters using values that have been reported in the literature. Here is the complete list of parameter values: Okapi k1 = 1.2, k3 = 1000, b = 0.75 SLM  = 1000 DF R c=2 Retrievalfeedback 10 docs, 40 terms, = 1, = 0.5 5 Results For each collection and for each query, we computed six runs: two runs for each of the three weighting modesl, one without and one with retrieval feedback (RF). Table 1, Table 2, and Table 3 show the retrieval performance of each method on the French, Italian, and Spanish collection, respectively. Performance was measured using average precision (AV-PREC), precision at 5 retrieved docu- ments (PREC-AT-5), and precision at 10 retrieved documents (PREC-AT-10). For each collection we show in bold the best result without retrieval feedback and the best result with retrieval feedback. Note that for the French and Italian collections the average precision was greater than the early precisions; this is due to the fact that for these collections the mean number of relevant documents per query is, on average, small, and that there are many queries with very few relevant documents. The rst main nding of our experiments is that the best absolute result for each collection and for each evaluation measure was always obtained by DFR with retrieval feedback, with notable improvements on several data points. The excellent performance of the DFR model is con rmed also when comparing the weighting models without query expansion, although in the latter case DFR did not always achieve the best results (i.e., for PREC-AT-5 and PREC-AT-10 on Italian, and for PREC-AT-5 on Spanish). AV-PREC PREC-AT-5 PREC-AT-10 Okapi 0.5030 0.4385 0.3654 Okapi + RF 0.5054 0.4769 0.3942 SLM 0.4753 0.4538 0.3635 SLM + RF 0.4372 0.4192 0.3462 DFR 0.5116 0.4577 0.3654 DFR + RF 0.5238 0.4885 0.3981 Table 1. Retrieval performance on the French collection AV-PREC PREC-AT-5 PREC-AT-10 Okapi 0.4762 0.4588 0.3510 Okapi + RF 0.5238 0.4824 0.3902 SLM 0.5027 0.4941 0.3824 SLM + RF 0.5095 0.4824 0.3863 DFR 0.5046 0.4824 0.3725 DFR + RF 0.5364 0.5255 0.4137 Table 2. Retrieval performance on the Italian collection AV-PREC PREC-AT-5 PREC-AT-10 Okapi 0.4606 0.5684 0.5175 Okapi + RF 0.5093 0.6105 0.5491 SLM 0.4720 0.6140 0.5157 SLM + RF 0.5112 0.5825 0.5316 DFR 0.4907 0.6035 0.5386 DFR + RF 0.5510 0.6140 0.5825 Table 3. Retrieval performance on the Spanish collection Of the other two models (i.e., Okapi and SLM), none was clearly superior to the other. They achieved comparable results on Spanish, while Okapi was slightly better than DFR on French and slightly worse on Italian. However, when considering the rst retrieved documents, the performance of SLM was usually very good and sometimes even better than DFR. The results in Table 1, Table 2, and Table 3 show also that retrieval feedback improved Okapi and DFR runs and mostly hurt SLM runs. In particular, the use of retrieval feedback improved the retrieval performance of Okapi and DFR for all evaluation measures and across all collections, whereas it usually decreased the early precision of SLM and on one occasion (i.e., on French) it hurt even the average precision of SLM. The unsatisfying performance of SLM + RF may be explained by considering that the experiments were performed using long queries. We also would like to emphasize that the DFR runs shown here correspond to actually submitted runs, although they were not the best runs. In fact, our best submitted runs had language-speci c optimal parameters; then we submitted for each language a run with the same experimental parameters, obtained by averaging the best parameters. We also performed a query-by-query analysis. For each query, we computed the di erence between the best and the worst retrieval result, considering average precision as the performance measure. Figure 1, Figure 2, and Figure 3 show the results for French, italian, and Spanish, respectively. Fig. 1. Performance variation on individual queries for French Thus, the length of each bar depicts the range of performance variations attainable by the three methods (with retrieval feedback) for each query. The Fig. 2. Performance variation on individual queries for Italian Fig. 3. Performance variation on individual queries for Spanish results show that the intermethod variations on sigle queries was ample, but does not tell us which method performed best. To get a more complete picture, we counted, for each collection, the number of queries for which each method achieved the best, median, or worst performance. The results, shown in Table 4, con rm the better retrieval e ectiveness of DFR over the other two models. The superiority of DFR over Okapi and SLM was clear for Spanish, while DFR and Okapi obtained more comparable results on the other two test collections. For French and Italian, the number of best results obtained by DFR and Okapi was similar, but, on the whole, DFR was ranked ahead of Okapi for a much larger number of queries. French Italian Spanish SLM Okapi DFR SLM Okapi DFR SLM Okapi DFR 1st 11 20 21 10 21 20 16 16 25 2nd 11 17 24 9 16 26 10 22 25 3rd 30 15 7 32 14 5 31 19 7 Table 4. Ranked performance 6 Conclusions The main conclusion of our experiments is that the DFR model was more ef- fective than both Okapi and SLM, which achieved comparable retrieval per- formance. In particular, DFR with query expansion obtained the best average absolute results for any evaluation measure and across all test collections. The second conclusion is that retrieval feedback always improved the per- formance of Okapi and DFR, whereas it was often detrimental to the retrieval e ectiveness of SLM, although the latter nding may have been in uenced by the length of the queries used in the experiments. These results seem to suggest that the retrieval performance of a weighting model is only moderately a ected by the choice of the language, but this hy- pothesis should be taken with caution, because our results were obtained under speci c experimental conditions. Although there are reasons to believe that similar results might hold also across di erent experimental situations, in that we chose simple and untuned parameter values and made typical indexing assumptions, the issue needs more investigation. The next step of this research is to experiment with a wider range of factors, such as the length of queries, the values of each weighting model's parameters, and the combination of parameter values for retrieval feedback. It would also be useful to experiment with other languages, to see if the hypoth- esis that the retrieval performance of a weighting model is independent of the language receives further support. References 1. G. Amati, C. Carpineto, G. Romano. FUB at TREC 10 web track: a probabilistic framework for topic relevance term weighting. Proceedings of TREC-7, 182-191, 2001. 2. G. Amati, C. Carpineto, G. Romano. Italian monolingual information retrieval with PROSIT. Working notes of CLEF 2002, 145-152, 2001. 3. Gianni Amati and C.J. van Rijsbergen. Probabilistic models of information re- trieval based on measuring divergence from randomness. ACM Transactions on Information Systems, 20(4):357-389, 2002. 4. C. Carpineto, R. De Mori, G. Romano, and B. Bigi. An information theoretic ap- proach to automatic query expansion. ACM Transactions on Information Systems, 19(1):1-27, 2001. 5. D. Hiemstra, W. Kraaij. Twenty-one at TREC-7: Ad-hoc and cross-language track. Proceedings of TREC-7, 227-238, 1998. 6. J. Ponte, W.B. Croft. A language modeling approach to information retrieval. Proceedings of SIGIR-98, 275-281, 1998. 7. M.F. Porter. Implementing a probabilistic information retrieval system. Inf. Tech. Res. Dev., 1(2):131-156, 1982. 8. S.E. Robertson, S. Walker, M. Beaulieu. Okapi at TREC-7: Automatic ad hoc, ltering, VLC, and interactive track. Proceedings of TREC-7, 253-264, 1998. 9. J Savoy. Reports on CLEF-2001 experiments. In Working notes of CLEF 2001, Darmstadt, 2001. 10. C. Zhai, J. La erty. A study of smoothing methods for language models applied to ad hoc information retrieval. Proceedings of SIGIR-01, 334-342, 2001.