Introduction

Geometric Perspectives of the BM25

Giorgio Maria Di Nunzio

dinunzio@dei.unipd.it 0 0 Department of Information Engineering 1 University of Padua

In this paper, we present the initial ndings about a possible geometric interpretation of the BM25 model and a comparison of the BM25 with the Binary Independence Model (BIM) on a two-dimensional space. A Web application was developed in R to show an example of this geometric view on a standard TREC collection. The application is accessible at the following link: http://gmdn.shinyapps.io/shinyRF04 wi0 =

Introduction

The Binary Independence Model (BIM) [ 4 ] is a probabilistic retrieval model that considers documents as binary vectors and ranks the documents according to their probability of relevance given a query. The BIM assigns a weight wi to each term ti that appears in both the query and the document: wi = log pi (1 qi (1 pi =

ri + R + + ; qi =

N qi) pi) ni ; ri + R + + where pi (or qi) is the probability that a relevant (or non-relevant) document contains the term ti. The estimates of these probabilities are: where ri is the number of relevant documents that contain ti, ni the number of documents that contain term ti, R and N the number of relevant documents and the total number of documents, respectively. The parameters and are used to smooth pi and qi in order to avoid arithmetical anomalies (in [ 4 ], = = 0:5).

The BM25 model goes one step further by introducing the frequency of the term and the length of the document in the weight of the term ti [ 5 ]:

tfi tfi + K where tfi the frequency of the term ti in the document, and K is a function of some parameters about the global statistics of the collection of documents: K = k1 ((1 b) + b dl= ) ? This work is an extended abstract of [ 3 ] (1) (2) (3) (4) where k1 and b are two parameters (usually set to 1.2 and 0.75, respectively), dl is the length of document d, and is the average document length.

In this paper, we want to study the problem of the optimisation of the parameters of two models, the BIM and the BM25, and to show a direct comparison of the two models by means of a visual interpretation of probabilities based on the idea of Likelihood Spaces [ 6, 1 ]. For this purpose, we have developed a Web application which allows users to be directly involved in the optimisation of the retrieval function and to study the e ect of the variation of the parameters by means of visual inspection. 1 As a showcase, the test collection used for this application is based on the TREC2004 Robust collection. 2 2

Mathematical Background

The BIM ranks documents according to the probability of relevance (R = 1) given a document d and a query q, P (R = 1jd; q). This probability can be approximated by the sum of the weights wi de ned in Eq. 1 (see [ 5 ]): X log ti |

1 {z x pi pi

X log ti } |

qi 1 qi {z y } > 0 (5) (6) (7) which can be interpreted as a ranking line (or a decision line) y < M x + Q. This formulation allows us to study the problem on a two-dimensional space where documents are represented by two coordinates shown and ranking can be optimised according to the parameters of the decision line. 1 http://gmdn.shinyapps.io/shinyRF04 2 http://trec.nist.gov/data/t13_robust.html

P (R = 1jd; q)

X ti2d\q wi The BM25 ranking formula can be expressed with the same sum over the terms that appear in the document and the query, replacing wi with wi0 of Eq. 3.

In the two-dimensional representation of probabilities, we keep P (R = 1jd; q) distinct from the probability of a document being not relevant P (R = 0jd; q). With some algebraic manipulation (see [ 2 ]), we obtain the following decision (or ranking) function: which is an alternative interpretation of the relevance weight of a document of the original work [ 4 ]. The two sums, x and y, can be interpreted as two coordinates of a two-dimensional space, and documents are ranked according to the value of the di erence of the two sums. With a more general approach (described in [ 2 ]) which involves Bayesian Decision Theory, we can add two more parameters M and Q:

M X ti |

pi 1 {z x pi } +Q

X ti |

qi 1 qi {z y } > 0

Description of the Interface The Web application that we developed takes into account many important parameters which characterise the two retrieval models: { we can change the smoothing parameters and and see how the probabilities pi and qi change (for both BIM and BM25); { we can decide whether the ranking is computed over the terms of the document or the terms of the query, and study the di erences when pseudorelevance feedback is available (both for BIM and BM25). { we can change the BM25 parameters k1 and b; { we can change the proportion of training documents to estimate pi and qi and change the number of terms of the vocabulary (both BIM and BM25); { we can adjust the decision line by changing the angular coe cient M (and the intercept Q which does not a ect ranking).

The main window is split into two parts: the sidebar with the interaction widgets on the left and the main panel with the output on the right (in Fig. 1 we show only half of the interface for space limits). 3 Interaction The user can interact with the following widgets 1. Select the topic of interest from the drop-down menu. 2. Select the retrieval model (if BM25 is not selected, the BIM is on). 3. Change the value of the parameters and . 4. Choose the number of folds that we need to compute the probabilities of the terms of relevant and non relevant documents. 5. Change the parameters M and Q of the ranking line.

Visualization The main panel is divided into two columns: the rst column shows the results on the validation set, the second column (not shown in the gure) the results on the test set. Both columns contain the following pieces of information: the text box shows the total number of objects used for validation and the number of positive examples (red points, the pseudo-relevant documents of the chosen topic). The table shows performance measures in terms of precisionat-j (j = 5, 10, 20, 100, 500, 1000). The two-dimensional plot shows in red the relevant documents of the chosen topic (pseudo-relevant for validation, true relevant for test) and in black all the other documents of the collection. 4

Description of the Interface In this paper, we have presented a Web application developed in R which allows users to interact with two retrieval models, the BIM and the BM25 models, on a standard TREC collection. The two models are projected on a two-dimensional space based on the idea of Likelihood spaces. The interactive application shows, in a real machine learning setting, how the human pattern recognition capabilities can immediately detect whether the model is close to the optimal solution or not. We believe that this interactive approach may be a crucial step in setting the initial parameters of an automatic procedure that optimises these parameters. 3 http://gmdn.shinyapps.io/shinyRF04

Giorgio

Maria Di Nunzio . Using scatterplots to understand and improve probabilistic models for text categorization and retrieval . Int. J. Approx. Reasoning , 50 ( 7 ): 945 { 956 , 2009 .

Giorgio

Maria Di Nunzio . A new decision to take for cost-sensitive nave bayes classi ers . Information Processing & Management , 50 ( 5 ): 653 { 674 , 2014 .

Giorgio

Maria Di Nunzio . Shiny on your crazy diagonal . In Proceedings of the SIGIR 2015 , pages in press, http://dx.doi.org/10.1145/2766462.2767867, 2015 .

4. Stephen

Robertson and Karen Sparck Jones. Relevance weighting of search terms . In Peter Willett, editor, Document retrieval systems, chapter Relevance weighting of search terms , pages 143 { 160 . Taylor Graham Publishing, London, UK, UK, 1988 .

5. Stephen

Robertson and Hugo

Zaragoza . The probabilistic relevance framework: BM25 and beyond . Foundations and Trends in Information Retrieval , 3 ( 4 ): 333 { 389 , 2009 .

Rita

Singh and

Bhiksha

Raj . Classi cation in likelihood spaces . Technometrics , 46 ( 3 ): 318 { 329 , 2004 .