Introduction

Are There New BM25 \Expectations"?

Emanuele Di Buccio

dibuccio@dei.unipd.it 0

Giorgio Maria Di Nunzio

dinunzio@dei.unipd.it 0 0 Dept. of Information Engineering 1 University of Padua

In this paper, we present some ideas about possible directions of a new interpretation of the Okapi BM25 ranking formula. In particular, we have focused on a full bayesian approach for deriving a smoothed formula that takes into account a-priori knowledge on the probability of terms. In fact, most of the e orts in improving the BM25 were done in capturing the language model (frequencies, length, etc.) but missed the fact that the constant equal to 0.5 used as a correction factor can be one of the parameters that can be modelled in a better way. This approach has been tested on a visual data mining tool and the initial results are encouraging.

Introduction

The relevance weighting model, also known as RSJ by the name of its creators (Roberston and Sparck-Jones), has been one of the most in uential model in the history of Information Retrieval [ 1 ]. It is a probabilistic model of retrieval that tries to answer the following question:

What is the probability that this document is relevant to this query? `Query' is a particular instance of an information need, and `document' a particular content description. The purpose of this question is to rank the documents in order of their probability of relevance according the Probability Ranking Principle [ 2 ]:

If retrieved documents are ordered by decreasing probability of relevance on the data available, then the system's e ectiveness is the best to be gotten for the data.

The probability of relevance is achieved by assigning weights to terms, the RSJ weight hereafter named as wi, according to the following formula: (1 p1) (1 qi q ) i ; (1) where pi is the probability that the document contains the term ti given that the document is relevant, and qi is the probability that the document contains the term ti given that the document is not relevant. If the estimates of these probabilities are computed by means of a maximum likelihood estimation, we obtain the following results: qi = pi = ni N ri R ri R where ri is the number of relevant documents that contain term ti, ni the number of documents that contain term ti, R and N the number of relevant documents and the total number of documents, respectively. However, this estimation leads to arithmetical anomalies; for example, if a term is not present in the set of relevant documents, its probability pi is equal to zero and the logarithm of zero will return a minus in nity. In order to avoid this situation, a kind of smoothing is applied to the probabilities. By substituting Equation 2 and 3 in Equation 1 and adding a constant to smooth probabilities, we obtain: BM25 estimates the full eliteness weight for a term from the RSJ score, then approximates the term frequency behaviour with a single global parameter controlling the rate of approach. Finally, it makes a correction for document length. For a full explanation of how to interpret eliteness and integrate it into the BM25 formula read [6{9]. The resulting formula is summarised in the following way: (2) (3) (4) (5) which is the actual RSJ score for a term. The choice of the constant 0.5 may resemble some Bayesian justi cation related to the binary independence model.1 This idea is wrong, as Robertson and Sparck Jones explained in [ 3 ], and the real justi cation can be traced back to the work of Cox [ 4 ].

The Okapi BM25 weighting schema takes a step further and introduces the property of eliteness [ 5 ]:

Assume that each term represent a concept, and that a given document is about that concept or not. A term is `elite' in the document or not.

wi0 = f (tfi) wi where wi is the RSJ weight, and f (tfi) is a function of the frequency of the term ti parametrized by global parameters.

In this paper, we concentrate on the RSJ weight and in particular to a full Bayesian approach for smoothing the probabilities and on a visual data analysis to assess the e ectiveness of these new smoothed probabilities. In Section 2, we present the Bayesian framework, then in Section 3 we describe the visualisation approach; in Section 4, we describe the initial experiments on this approach. Some nal remarks are given in Section 5. 1 In this model; documents are represented as binary vectors: a term may be either present or not in a document and have a `natural' a priori probability of 0.5.

Bayesian Framework

In Bayesian inference, a problem is described by a mathematical model M with parameters and, when we have observed some data D, we use Bayes' rule to determine our beliefs across di erent parameter values [ 10 ]:

P ( jD; M ) = P (Dj ; M )P ( jM ) ; (6)

P (DjM ) the posterior distribution of our belief on is equal to a likelihood function P (Dj ; M ), the mathematical model of our problem, multiplied by a prior distribution P ( jM ), our belief in the values of the parameters of the model, and normalized by the probability of the data P (DjM ). We control the prior by choosing its distributional form along with its parameters, usually called hyperparameters. Since the product between P (Dj ; M ) and P ( jM ) can be hard to calculate, one solution is to nd a \conjugate" prior of the likelihood function [ 10 ].

In the case of a likelihood function which belongs to the exponential family, there always exists a conjugate prior. Nave Bayes (NB) models have a likelihood of this type and, since the RSJ weight is related to the Binary Independence Model which is a multi-variate Bernoulli NB model, we can easily derive a formula to estimate the parameter . The multi-variate Bernoulli NB model represents a document d as a vector of V (number of words in the vocabulary) Bernoulli random variables d = (t1; :::; ti; :::; tV ) such that: We can write the probability of a document by using the NB assumption as: (7) (8) (9) (10) ti

Bern( ti ) : P (dj ) =

V V Y ti = Y ixk (1 k=1 k=1 i)1 xk ; where p^i is the new estimate of the probability pi. Accordingly, the probability of a term in the non-relevant documents is:

^tijrel = N ni R r+i + + = q^i : (11) where xi is a binary value that is equal either to 1 when the term ti is present in the document or to 0 otherwise. With a Maximum Likelihood estimation, we would end up with the result shown in Equation 2 and 3; instead, we want to integrate the conjugate prior which in this case of a Bernoulli random variable is the beta function: betai = i 1(1 i) 1 ; where i refers to the ith random variable ti. Therefore, the new estimate of the probability of a term ti that takes into account the prior knowledge is given by the posterior mean of Eq. 6 (see [ 10 ] for the details of this result). For the relevant documents we obtain: ^tijrel =

ri + R + + = p^i ; With this formula, we can recall di erent smoothing approaches; for example, with = 0 and = 0 we obtain the Maximum Likelihood Estimation, with = 1, = 1 the Laplace smoothing. We can even recall the RSJ score by assigning = 0:5 and = 0:5. 3

Probabilistic Visual Data Mining Now that we have new estimates for the probabilities pi and qi, we need a way to assess how the parameters and in uence the e ectiveness of the retrieval system. In [11, 12], we presented a visual data mining tool for analyzing the behavior of various smoothing methods, to suggest possible directions for nding the most suitable smoothing parameters and to shed the light into new methods of automatic hyper-parameters estimation. Here, we use the same approach for analyzing a simpli ed version of the BM25 (that is Equation 5 ignoring the term frequency function).

In order to explain the visual approach, we present the problem of retrieval in terms of a classi cation problem: classify the documents as relevant or non relevant. Given a document d and a query q, we consider d relevant if: P (reljd; q) > P (reljd; q) ; (12) that is when the probability of being relevant is higher compared to the probability of not being relevant. By using Bayes rule, we can invert the problem and decide that d is relevant when:

P (djrel; q)P (reljq) > P (djrel; q)P (reljq) : Note that we are exactly in the same situation of Equation (2.2) of [ 9 ] where: (13) (14) (15) In fact, if we divide both members of Equation 13 by P (djrel; q)P (reljq) (we assume that this quantity is strictly greater than zero), we obtain:

P (djrel; q)P (reljq)

P (reljd; q) / P (djrel; q)P (reljq) :

P (djrel; q)P (reljq)

P (djrel; q)P (reljq) > 1 ; where the ranking of the documents is given by the value of the ratio on the left (as in the BM25); moreover, we can classify a document as `relevant' if this ratio is greater than one.

The main idea of the two-dimensional visualization of probabilistic model is to maintain the two probabilities separated and use the two numbers as two coordinates, X and Y, on the cartesian plane:

P (djrel; q)P (reljq) > P (djrel; q)P (reljq) : | {Xz } | {Yz } (16) If we take the logs, a monotonic transformation that maintains the order, and if we model the document as a multivariate binomial (as in the Binary Independence Model [ 1 ]), we obtain for the coordinate X:

X xi log i2V |

p^i 1

p^i P (d{jrzel;q) + X log(1 i2V

} p^i) + log(P (reljq)) : | P (r{ezljq) } Since we are using the Bayesian estimate p^i, we can modulate it by adjusting the hyper parameters and of Equation 10. If we want to consider the terms that appear in the query, the rst sum is computed over the terms i 2 q, which corresponds to Equation (2.6) of [ 9 ].

We intentionally maintained explicit the two addends that are independent of the document, respectively Pi2V log(1 p^i) and log(P (reljq)). These two addends do not in uence the ordering among documents (it is a constant factor independent of the document) but they can (and they actually do) a ect the classi cation performance. If we rewrite the complete inequality and substitute these addends with constants we obtain: 2

X xi log X xi log i2q i2q 1 1 p^i p^i p^i p^i + c1 > X xi log

i2q X xi log i2q 1

q^i X xi log i2q | 1 p^i 1 p^i q^i {z RSJ

q^i 1

q^i q^i q^i } > c2 > c2 + c2 c1 c1 (17) (18) (19) (20) that is exactly the same formulation of the RSJ weight with new estimates for pi and qi, plus some indication about whether we classify a document as relevant or not. 3.1

A simple example

Let us consider a collection of 1,000 documents, suppose that we have a query with two terms, q = ft1; t2g, and the following estimates: 2 Note that we need to investigate how this reformulation is related to Cooper's linked dependence assumption [13]. { 10 relevant document (R = 10) for this query; { 20 documents that contain term t1 (n1 = 20) and three of them are known to be relevant (r1 = 3); { 17 documents that contain term t2 (n2 = 17) and two of them are known to be relevant (r2 = 2).

For the log odds, we have: 1 = log 2 = log 1 1 p^1 p^2 p^1 p^2 = log = log 3 + 7 + 2 + 8 + ; ; 1 = log 2 = log 1 1 q^1 q^2 q^1 q^2 = log = log Suppose that we want to rank two document d1 and d2, where d1 contains both terms t1 and t2, while d2 contains only term t1. Let us draw the points in the two-dimensional space, we assume the two constants c1 and c2 equal to zero: Xd1 = x1;d1 Yd1 = x1;d1 Xd2 = x1;d2 Yd2 = x1;d2 1 + x2;d1 1 + x2;d1 1 + x2;d2 1 + x2;d2 2 = 1 2 = 1 2 = 1 2 = 1 1 + 1 1 + 1 1 + 0 1 + 0 2 ' 2 ' 2 ' 2 ' 2:86; 11:77; 1:10; 5:80 where xi;dj = 1 if term ti occurs in document dj , xi;dj = 0 otherwise.

In Figure 1, the two points (Xd1 ; Yd1 ) and (Xd2 ; Yd2 ) are shown. The line is a graphical help to indicate which point is ranked rst: the closer the point, the higher the document in the rank. The justi cation of this statement is not presented in this paper for space reasons, refer to [14] for further details. What is important here is the possibility to assess the in uence of the parameter and on the RSJ score. The objective is to study whether these two parameters can drastically change the ranking of the documents or not. In graphical terms, if we can \rotate" the points such that the closest to the line becomes the furthest.

Moreover, there are some considerations we want to address: { when the number of terms in the query is small, it is very di cult to note any change in the ranking list. Remember that with `n' query terms, we can only have 2n points (or RSJ scores). In the event of a query constituted of a single term, all the documents that contain that query term collapse in one point. { the Okapi BM25 weight `scatters' the documents that are collapsed in one point in the space by multiplying the RSJ score with a scaling factor f (tfi) proportional to the frequency of the term in the document. Therefore, we expect this Bayesian approach to be more e ective on the BM25 rather than on the simple RSJ score. 3.2

Visualization Tool

The visualisation tool was designed and developed in R [15]. It consists of three panels:

{ View Panel : this displays the two-dimensional plot of the dataset according to the choices of the user. { Interaction Panel : this allows for the interaction between the user and the parameters of the probabilistic models. { Performance Panel : this displays the performance measures of the model.

Figure 2 shows the main window with the three panels. In the centre-right, there is the main view panel, the actual two-dimensional view of the documents as points, blue and red for relevant and non-relevant, respectively. The green line represents the ranking line, the closer the point the higher the rank in the retrieval list. At the top and on the left, there is the interaction panel where the user can choose di erent options: the type of the model (Bernoulli in our case), the type of smoothing (conjugate prior), the value of the parameters and . The bottom of the window is dedicated to the performance in terms of classi cation (not used in this experiment). Preliminary experiments were carried out on some topics of the TREC2001 Adhoc Web Track test collection.3 The content of each document was processed during indexing except for the text contained inside the <script></script> and the <style></style> tags. When parsing, the title of the document was extracted and considered as the beginning of the document content. Stop words were removed during indexing.4 For each topic we considered the set of documents in the pool, therefore those for which explicit assessment are available.

We considered two di erent experimental settings: (i) query-term based representation and (ii) collection vocabulary-based representation of the documents. In the former case, each document was represented by means of the descriptor extracted from the title of the TREC topics, used as queries: therefore V consisted of query terms; in the latter case V consisted of the entire collection vocabulary | both settings did not consider stopwords as part of V . 3 http://trec.nist.gov/data/t10.web.html 4 The stop words list is that available http://ir.dcs.gla.ac.uk/resources/linguistic utils/stop words at the url

In this paper, we report the experiments on topic 528. We selected this query because it contains ve terms and it is easier to show the e ect of the hyperparameters. In Figure 2, the cloud of points generated by the two-dimensional approach is shown. Parameters and are set to the standard RSJ score constant 0.5. The line corresponds to the decision line of a classi er, and it also correspond to the `ranking' line: imagine this line spanning the plane from right to left, each time the line touches a document, the document is added to the list of retrieved documents.

In Figure 3, the hyper-parameter was increased and was left equal to 0.5. When we increase , the probability p^i tends to one, and the e ect, in terms of the two dimensional plot, is that points rotate anti-clockwise. In Figure 4, the opposite e ect is obtained by increasing and leaving equal to 0.5. In both situations, the list of ranked documents was signi cantly di erent from the original list produced by using the classical RSJ score. This paper presents a new direction for the study of the Okapi BM25 model. In particular, we have focused on a full Bayesian approach for deriving a smoothed formula that takes into account our a-priori knowledge on the probability of terms. In fact, we think that many of the e orts in improving the BM25 were done mostly in capturing the language model (frequencies, length, etc.) but missed the fact that the 0.5 correction factor could be one of the parameters that can be modelled in a better way.

By starting from a slightly di erent approach, the classi cation of documents into relevant and non relevant classes, we derived the exact same formula of the RSJ weight but with more degrees of interaction. The two-dimensional visualization approach helped in understanding why some of the constants factors can be taken into account for the case of the classi cation and, more important, how the hyper-parameters can be tuned to obtain a better ranking.

After this preliminary experiment, we can draw some considerations: for the rst time, it was possible to visualize the cluster of points that are generated by the RSJ scores; it was clear that very short queries tend to create a very small number of points making it hard to perform a good retrieval; hyper-parameters do make a di erence in both classi cation and retrieval.

There are still many open research questions we want to investigate in the future: { so far, we have assumed that all the beta priors associated to each term use exactly the same values for hyper-parameters and . A more selective approach may be more e ective; { the coordinate of the points in the two-dimensional plot take into account the two constants of Equation 17. In particular, the addend Pi2V log(1 p^i) may be the cause of the `rotation' of the points, hence the radical change of the ranking list; { The current approach assumes that the value of R and ri are known for each term in the query: indeed these values are adopted to estimate the coordinates of each document. A further research question is the e ect of estimation based on feedback data on the capability of the probabilistic visual data mining approach adopted in this paper.

Acknowledgments. This work has been partially supported by the QONTEXT project under grant agreement N. 247590 (FP7/2007-2013). 11. Di Nunzio, G., Sordoni, A.: How well do we know bernoulli? In: IIR. Volume 835 of CEUR Workshop Proceedings., CEUR-WS.org (2012) 38{44 12. Di Nunzio, G., Sordoni, A.: A visual tool for bayesian data analyisis: The impact of smoothing on nave bayes text classi ers. In: Proceeding of the 35th International ACM SIGIR 2012. Volume 1002., Portland, Oregon, USA (2012) 13. Cooper, W.S.: Some inconsistencies and misnomers in probabilistic information retrieval. In: Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval. SIGIR '91, New York, NY, USA, ACM (1991) 57{61 14. Di Nunzio, G.: Using scatterplots to understand and improve probabilistic models for text categorization and retrieval. Int. J. Approx. Reasoning 50 (2009) 945{956 15. Di Nunzio, G., Sordoni, A.: A Visual Data Mining Approach to Parameters Optimization. In Zhao, Y., Cen, Y., eds.: Data Mining Applications in R. Elsevier (2013, In Press)

1. Robertson , S.E. , Sparck Jones , K. : Relevance weighting of search terms . In Willett, P., ed.: Document retrieval systems . Taylor Graham Publishing, London, UK, UK ( 1988 ) 143 { 160

2. Robertson , S.E. : The Probability Ranking Principle in IR . Journal of Documentation 33 ( 1977 ) 294 { 304

3. Jones , K.S. , Walker , S. , Robertson , S.E. : A probabilistic model of information retrieval: development and comparative experiments . Inf. Process. Manage . 36 ( 2000 ) 779 { 808

4. Cox , D. , Snell , D. : The Analysis of Binary Data . Monographs on Statistics and Applied Probability Series. Chapman & Hall ( 1989 )

5. Robertson , S. : Understanding inverse document frequency: On theoretical arguments for idf . In: Journal of Documentation . Volume 60 . ( 2004 )

6. Robertson , S.E. , Walker , S.: Some simple e ective approximations to the 2-poisson model for probabilistic weighted retrieval . In Croft, W.B., van Rijsbergen , C.J., eds.: SIGIR, ACM/Springer ( 1994 ) 232 { 241

7. Robertson , S.E. , Walker , S. , Jones , S. , Hancock-Beaulieu , M. , Gatford , M. : Okapi at TREC-3 . In: Proceedings of the Third Text REtrieval Conference (TREC), Gaithesburg , USA ( 1994 )

8. Robertson , S.E. , Walker , S. : On relevance weights with little relevance information . SIGIR Forum 31 ( 1997 ) 16 { 24

9. Robertson , S.E. , Zaragoza , H.: The probabilistic relevance framework: Bm25 and beyond . Foundations and Trends in Information Retrieval 3 ( 2009 ) 333 { 389

10. Kruschke , J.K. : Doing Bayesian Data Analysis: A Tutorial with R and BUGS. 1 edn . Academic Press/Elsevier ( 2011 )