Comparing weighting models for monolingual
              information retrieval
         Gianni Amati1, Claudio Carpineto1, and Giovanni Romano1
                      Fondazione Ugo Bordoni, via B. Castiglione 59,
                                   00142 Rome, Italy
                                fgba,carpinet,romanog@fub.it


1    Introduction

Although the choice of the weighting model may crucially a ect the performance
of any information retrieval system, there has been little work on evaluating
the relative merits and drawbacks of di erent weighting models in the CLEF
environment. The main goal of our participation in CLEF03 was to help ll this
gap.
    We consider three weighting models with a di erent theoretical background
that have proved their e ectiveness on a number of tasks and collections. The
three models are Okapi [8], statistical language modeling [10], and deviation
from randomness [3].
    We study the retrieval performance of the rankings produced by each weight-
ing model, without and with retrieval feedback, on three monolingual test collec-
tions, i.e., French, Italian and Spanish. The collections are indexed with standard
techniques and the retrieval feedback stage is performed using the method de-
scribed in [4].
    In the following we rst describe the three weighting models, the method
used for retrieval feedback, and the experimental setting. Then we report the
results and the main conclusions of our experiments.


2    The three weighting models

For the ease of clarity and comparison, the document ranking produced by each
weighting model is represented using the same general expression, namely as the
product of a document-based term weight by a query-based term weight:

                      P
    sim(q; d) =           w   t;d   wt;q

                  2q^d
                  t


    This formalism allows also a uniform application of the subsequent retrieval
feedback stage to the rst-pass ranking produced by each weighting model, as
we will see in the next section.
   Before giving the expressions for wt;d and wt;q for each weighting model, we
report the complete list of variables that will be used:
          ft    the number of occurrences of term t in the collection
          ft;d  the number of occurrences of term t in document d
          ft;q  the number of occurrences of term t in query q
          nt    the number of documents in which term t occurs
          D     the number of documents in the collection
          T     the number of terms in the collection
          t    the ratio between f and T       t

          W    dthe length of document d
          W    qthe length of query q
          avr W the average length of documents in the collection
                       d


2.1            Okapi

To describe Okapi, we use the expression given in [8]. This formula has been
used by most participants in TREC and CLEF over the last years.

w          =                     (k1 + 1)  ft;d 
    t;d
                                             Wd
                   k1          (1 b) + b           + ft;d
                                           avg W    d


w                   1)  ft;q  log D
           = (k3k + +                   nt + 0:5
 t;q
                       f   3
                                   2
                                   t;qn + 0:5           t


2.2            Statistical language modeling (SLM)

The statistical language modeling approach has been presented in several papers
(e.g., [5], [6]). Here we use the expression given in [10], with Dirichlet smoothing.

w          =
                       P            f +  t
                               (log2 t;d                log2 W +    log2  ) + W  log2 W + 
    t;d
                                      Wd +                  d
                                                                               t         q
                                                                                             d
                   2q^d
                   t


w   t;q    = ft;q .

2.3            Deviation from randomness (DFR)

Deviation from randomness has been successfully used at the Web Track of
TREC10 [1] and in CLEF 2002, for the Italian monolingual task [2]. It is best
described in [3]

w                                 log 1 + t ) 
           = ( log2 (1 + t) + ft;d                                       f +1
                                                                           (f  + 1)
                                                                           t
    t;d                                2          n          t       t            t;d


w   t;q    = ft;q ,
with

f  = f  log2 (1 + c  avrd W )
 t;d       t                          W
                                                    d


3       Retrieval feedback
As retrieval feedback has been incorporated in most recent systems participat-
ing in CLEF, it is interesting to evaluate also the performance of the di erent
weighting models when they are enriched with retrieval feedback.
    To perform the experiments, we used a technique called information-theoretic
query expansion [4]. At the end of the rst-pass ranking, each term in the top
retrieved documents was assigned a score using the Kullback-Leibler distance
between the distribution of the term in such documents and the distribution of
the same term in the whole collection:

       KLD                            f
                     = ft;d  log2 ft;d
           t;d
                                          t


    and the terms with the highest scores were selected for expansion.
    At this point, the KLD scores were used also to reweight the terms in the
expanded query. As the weights for the unexpanded query (i.e., SLM, Okapi, and
DFR) and the KLD scores had di erent scales, we normalized both the weights
of the original query and the scores of the expansion terms by the maximum
corresponding value; then the normalized values were linearly combined.
    The new expression for computing the similarity between an expanded query
qexp and a document d becomes:

       sim(q         ; d) =
                                  P
                                          w              w
                                                     ( Max
                                                          t;q
                                                                      +
                                                                           KLD   t;d
                                                                                             )
               exp                            t;d
                                                           qw   t;q       Max KLD
                                                                             d         t;d
                              t2qexp ^d


4       Experimental setting
4.1      Test collections

The experiments were performed using three CLEF 2003 monolingual test col-
lections, namely the French, Spanish, and Italian collections. For all collections,
the title+description topic statement was considered.

4.2      Document and query indexing

We identi ed the individual words occurring in the documents, considering only
the admissible sections and ignoring punctuation and case. The system then
performed word stemming and word stopping.
   For word stemming, we used the French, Italian, and Spanish versions of
Porter stemming algorithm [7], which have been made available on the Snowball
web site (http://snowball.tartarus.org) To remove common words, we used the
stop lists provided by Savoy [9].
   Thus, we performed a strict single-word indexing; furthermore, we did not
use any ad hoc linguistic manipulation such as expanding or removing certain
words from the query text or using lists of proper nouns.

4.3    Choice of experimental parameters

The nal document ranking is a ected by a number of parameters. To perform
the experiments, we set the parameters using values that have been reported in
the literature. Here is the complete list of parameter values:
      Okapi             k1 = 1.2, k3 = 1000, b = 0.75
      SLM                = 1000
      DF R              c=2
      Retrievalfeedback 10 docs, 40 terms, = 1, = 0.5

5     Results
For each collection and for each query, we computed six runs: two runs for each
of the three weighting modesl, one without and one with retrieval feedback (RF).
Table 1, Table 2, and Table 3 show the retrieval performance of each method
on the French, Italian, and Spanish collection, respectively. Performance was
measured using average precision (AV-PREC), precision at 5 retrieved docu-
ments (PREC-AT-5), and precision at 10 retrieved documents (PREC-AT-10).
For each collection we show in bold the best result without retrieval feedback
and the best result with retrieval feedback.
    Note that for the French and Italian collections the average precision was
greater than the early precisions; this is due to the fact that for these collections
the mean number of relevant documents per query is, on average, small, and
that there are many queries with very few relevant documents.
    The rst main nding of our experiments is that the best absolute result for
each collection and for each evaluation measure was always obtained by DFR
with retrieval feedback, with notable improvements on several data points. The
excellent performance of the DFR model is con rmed also when comparing the
weighting models without query expansion, although in the latter case DFR did
not always achieve the best results (i.e., for PREC-AT-5 and PREC-AT-10 on
Italian, and for PREC-AT-5 on Spanish).
             AV-PREC PREC-AT-5 PREC-AT-10
Okapi         0.5030   0.4385    0.3654
Okapi + RF    0.5054   0.4769    0.3942
SLM           0.4753   0.4538    0.3635
SLM + RF      0.4372   0.4192    0.3462
DFR            0.5116        0.4577         0.3654
DFR + RF       0.5238        0.4885         0.3981
             Table 1.   Retrieval performance on the French collection


             AV-PREC PREC-AT-5 PREC-AT-10
Okapi         0.4762   0.4588    0.3510
Okapi + RF    0.5238   0.4824    0.3902
SLM           0.5027   0.4941    0.3824
SLM + RF      0.5095   0.4824    0.3863
DFR           0.5046   0.4824    0.3725
DFR + RF       0.5364        0.5255         0.4137
             Table 2.   Retrieval performance on the Italian collection


            AV-PREC PREC-AT-5 PREC-AT-10
Okapi        0.4606        0.5684        0.5175
Okapi + RF   0.5093        0.6105        0.5491
SLM          0.4720       0.6140         0.5157
SLM + RF     0.5112        0.5825        0.5316
DFR          0.4907        0.6035       0.5386
DFR + RF     0.5510       0.6140        0.5825
           Table 3. Retrieval performance on the Spanish collection
    Of the other two models (i.e., Okapi and SLM), none was clearly superior
to the other. They achieved comparable results on Spanish, while Okapi was
slightly better than DFR on French and slightly worse on Italian. However,
when considering the rst retrieved documents, the performance of SLM was
usually very good and sometimes even better than DFR.
    The results in Table 1, Table 2, and Table 3 show also that retrieval feedback
improved Okapi and DFR runs and mostly hurt SLM runs. In particular, the use
of retrieval feedback improved the retrieval performance of Okapi and DFR for
all evaluation measures and across all collections, whereas it usually decreased
the early precision of SLM and on one occasion (i.e., on French) it hurt even
the average precision of SLM. The unsatisfying performance of SLM + RF may
be explained by considering that the experiments were performed using long
queries.
    We also would like to emphasize that the DFR runs shown here correspond to
actually submitted runs, although they were not the best runs. In fact, our best
submitted runs had language-speci c optimal parameters; then we submitted
for each language a run with the same experimental parameters, obtained by
averaging the best parameters.
    We also performed a query-by-query analysis. For each query, we computed
the di erence between the best and the worst retrieval result, considering average
precision as the performance measure. Figure 1, Figure 2, and Figure 3 show the
results for French, italian, and Spanish, respectively.


           Fig. 1.   Performance variation on individual queries for French


   Thus, the length of each bar depicts the range of performance variations
attainable by the three methods (with retrieval feedback) for each query. The
Fig. 2.   Performance variation on individual queries for Italian


Fig. 3.   Performance variation on individual queries for Spanish
results show that the intermethod variations on sigle queries was ample, but
does not tell us which method performed best.
    To get a more complete picture, we counted, for each collection, the number of
queries for which each method achieved the best, median, or worst performance.
The results, shown in Table 4, con rm the better retrieval e ectiveness of DFR
over the other two models. The superiority of DFR over Okapi and SLM was
clear for Spanish, while DFR and Okapi obtained more comparable results on
the other two test collections. For French and Italian, the number of best results
obtained by DFR and Okapi was similar, but, on the whole, DFR was ranked
ahead of Okapi for a much larger number of queries.

                      French           Italian          Spanish
                SLM Okapi DFR SLM Okapi DFR SLM Okapi DFR
             1st 11 20 21 10 21 20 16 16 25
             2nd 11 17 24 9         16 26 10 22 25
             3rd 30 15     7 32 14         5 31 19     7
                      Table 4. Ranked performance


6   Conclusions

The main conclusion of our experiments is that the DFR model was more ef-
fective than both Okapi and SLM, which achieved comparable retrieval per-
formance. In particular, DFR with query expansion obtained the best average
absolute results for any evaluation measure and across all test collections.
    The second conclusion is that retrieval feedback always improved the per-
formance of Okapi and DFR, whereas it was often detrimental to the retrieval
e ectiveness of SLM, although the latter nding may have been in uenced by
the length of the queries used in the experiments.
    These results seem to suggest that the retrieval performance of a weighting
model is only moderately a ected by the choice of the language, but this hy-
pothesis should be taken with caution, because our results were obtained under
speci c experimental conditions.
    Although there are reasons to believe that similar results might hold also
across di erent experimental situations, in that we chose simple and untuned
parameter values and made typical indexing assumptions, the issue needs more
investigation. The next step of this research is to experiment with a wider range
of factors, such as the length of queries, the values of each weighting model's
parameters, and the combination of parameter values for retrieval feedback. It
would also be useful to experiment with other languages, to see if the hypoth-
esis that the retrieval performance of a weighting model is independent of the
language receives further support.
References
1. G. Amati, C. Carpineto, G. Romano. FUB at TREC 10 web track: a probabilistic
   framework for topic relevance term weighting. Proceedings of TREC-7, 182-191,
   2001.
2. G. Amati, C. Carpineto, G. Romano. Italian monolingual information retrieval with
   PROSIT. Working notes of CLEF 2002, 145-152, 2001.
3. Gianni Amati and C.J. van Rijsbergen. Probabilistic models of information re-
   trieval based on measuring divergence from randomness. ACM Transactions on
   Information Systems, 20(4):357-389, 2002.
4. C. Carpineto, R. De Mori, G. Romano, and B. Bigi. An information theoretic ap-
   proach to automatic query expansion. ACM Transactions on Information Systems,
   19(1):1-27, 2001.
5. D. Hiemstra, W. Kraaij. Twenty-one at TREC-7: Ad-hoc and cross-language track.
   Proceedings of TREC-7, 227-238, 1998.
6. J. Ponte, W.B. Croft. A language modeling approach to information retrieval.
   Proceedings of SIGIR-98, 275-281, 1998.
7. M.F. Porter. Implementing a probabilistic information retrieval system. Inf. Tech.
   Res. Dev., 1(2):131-156, 1982.
8. S.E. Robertson, S. Walker, M. Beaulieu. Okapi at TREC-7: Automatic ad hoc,
     ltering, VLC, and interactive track. Proceedings of TREC-7, 253-264, 1998.
9. J Savoy. Reports on CLEF-2001 experiments. In Working notes of CLEF 2001,
   Darmstadt, 2001.
10. C. Zhai, J. La erty. A study of smoothing methods for language models applied
   to ad hoc information retrieval. Proceedings of SIGIR-01, 334-342, 2001.