<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Geometric Perspectives of the BM25</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Giorgio Maria Di Nunzio</string-name>
          <email>dinunzio@dei.unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Information Engineering</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Padua</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we present the initial ndings about a possible geometric interpretation of the BM25 model and a comparison of the BM25 with the Binary Independence Model (BIM) on a two-dimensional space. A Web application was developed in R to show an example of this geometric view on a standard TREC collection. The application is accessible at the following link: http://gmdn.shinyapps.io/shinyRF04 wi0 =</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        The Binary Independence Model (BIM) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] is a probabilistic retrieval model
that considers documents as binary vectors and ranks the documents according
to their probability of relevance given a query. The BIM assigns a weight wi to
each term ti that appears in both the query and the document:
wi = log
pi (1
qi (1
pi =
      </p>
      <p>ri +
R +
+
; qi =</p>
      <p>
        N
qi)
pi)
ni
;
ri +
R +
+
where pi (or qi) is the probability that a relevant (or non-relevant) document
contains the term ti. The estimates of these probabilities are:
where ri is the number of relevant documents that contain ti, ni the number of
documents that contain term ti, R and N the number of relevant documents and
the total number of documents, respectively. The parameters and are used to
smooth pi and qi in order to avoid arithmetical anomalies (in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], = = 0:5).
      </p>
      <p>
        The BM25 model goes one step further by introducing the frequency of the
term and the length of the document in the weight of the term ti [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]:
      </p>
      <p>
        tfi
tfi + K
where tfi the frequency of the term ti in the document, and K is a function of
some parameters about the global statistics of the collection of documents:
K = k1 ((1
b) + b dl= )
? This work is an extended abstract of [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
(1)
(2)
(3)
(4)
where k1 and b are two parameters (usually set to 1.2 and 0.75, respectively), dl
is the length of document d, and is the average document length.
      </p>
      <p>
        In this paper, we want to study the problem of the optimisation of the
parameters of two models, the BIM and the BM25, and to show a direct comparison
of the two models by means of a visual interpretation of probabilities based on
the idea of Likelihood Spaces [
        <xref ref-type="bibr" rid="ref1 ref6">6, 1</xref>
        ]. For this purpose, we have developed a Web
application which allows users to be directly involved in the optimisation of the
retrieval function and to study the e ect of the variation of the parameters by
means of visual inspection. 1 As a showcase, the test collection used for this
application is based on the TREC2004 Robust collection. 2
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Mathematical Background</title>
      <p>
        The BIM ranks documents according to the probability of relevance (R = 1)
given a document d and a query q, P (R = 1jd; q). This probability can be
approximated by the sum of the weights wi de ned in Eq. 1 (see [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]):
X log
ti
|
      </p>
      <p>1
{z
x
pi
pi</p>
      <p>X log
ti
} |</p>
      <p>qi
1 qi
{z
y
}
&gt; 0
(5)
(6)
(7)
which can be interpreted as a ranking line (or a decision line) y &lt; M x + Q.
This formulation allows us to study the problem on a two-dimensional space
where documents are represented by two coordinates shown and ranking can be
optimised according to the parameters of the decision line.
1 http://gmdn.shinyapps.io/shinyRF04
2 http://trec.nist.gov/data/t13_robust.html</p>
      <p>P (R = 1jd; q)</p>
      <p>X
ti2d\q
wi
The BM25 ranking formula can be expressed with the same sum over the terms
that appear in the document and the query, replacing wi with wi0 of Eq. 3.</p>
      <p>
        In the two-dimensional representation of probabilities, we keep P (R = 1jd; q)
distinct from the probability of a document being not relevant P (R = 0jd; q).
With some algebraic manipulation (see [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]), we obtain the following decision (or
ranking) function:
which is an alternative interpretation of the relevance weight of a document of the
original work [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The two sums, x and y, can be interpreted as two coordinates
of a two-dimensional space, and documents are ranked according to the value of
the di erence of the two sums. With a more general approach (described in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ])
which involves Bayesian Decision Theory, we can add two more parameters M
and Q:
      </p>
      <p>M X
ti
|</p>
      <p>pi
1
{z
x
pi
}
+Q</p>
      <p>X
ti
|</p>
      <p>qi
1 qi
{z
y
}
&gt; 0</p>
      <p>Description of the Interface
The Web application that we developed takes into account many important
parameters which characterise the two retrieval models:
{ we can change the smoothing parameters and and see how the
probabilities pi and qi change (for both BIM and BM25);
{ we can decide whether the ranking is computed over the terms of the
document or the terms of the query, and study the di erences when
pseudorelevance feedback is available (both for BIM and BM25).
{ we can change the BM25 parameters k1 and b;
{ we can change the proportion of training documents to estimate pi and qi
and change the number of terms of the vocabulary (both BIM and BM25);
{ we can adjust the decision line by changing the angular coe cient M (and
the intercept Q which does not a ect ranking).</p>
      <p>The main window is split into two parts: the sidebar with the interaction widgets
on the left and the main panel with the output on the right (in Fig. 1 we show
only half of the interface for space limits). 3
Interaction The user can interact with the following widgets
1. Select the topic of interest from the drop-down menu.
2. Select the retrieval model (if BM25 is not selected, the BIM is on).
3. Change the value of the parameters and .
4. Choose the number of folds that we need to compute the probabilities of the
terms of relevant and non relevant documents.
5. Change the parameters M and Q of the ranking line.</p>
      <p>Visualization The main panel is divided into two columns: the rst column
shows the results on the validation set, the second column (not shown in the
gure) the results on the test set. Both columns contain the following pieces of
information: the text box shows the total number of objects used for validation
and the number of positive examples (red points, the pseudo-relevant documents
of the chosen topic). The table shows performance measures in terms of
precisionat-j (j = 5, 10, 20, 100, 500, 1000). The two-dimensional plot shows in red
the relevant documents of the chosen topic (pseudo-relevant for validation, true
relevant for test) and in black all the other documents of the collection.
4</p>
      <p>Description of the Interface
In this paper, we have presented a Web application developed in R which allows
users to interact with two retrieval models, the BIM and the BM25 models, on a
standard TREC collection. The two models are projected on a two-dimensional
space based on the idea of Likelihood spaces. The interactive application shows,
in a real machine learning setting, how the human pattern recognition
capabilities can immediately detect whether the model is close to the optimal solution or
not. We believe that this interactive approach may be a crucial step in setting the
initial parameters of an automatic procedure that optimises these parameters.
3 http://gmdn.shinyapps.io/shinyRF04</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Giorgio</given-names>
            <surname>Maria Di Nunzio</surname>
          </string-name>
          .
          <article-title>Using scatterplots to understand and improve probabilistic models for text categorization and retrieval</article-title>
          .
          <source>Int. J. Approx. Reasoning</source>
          ,
          <volume>50</volume>
          (
          <issue>7</issue>
          ):
          <volume>945</volume>
          {
          <fpage>956</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Giorgio</given-names>
            <surname>Maria Di Nunzio</surname>
          </string-name>
          .
          <article-title>A new decision to take for cost-sensitive nave bayes classi ers</article-title>
          .
          <source>Information Processing &amp; Management</source>
          ,
          <volume>50</volume>
          (
          <issue>5</issue>
          ):
          <volume>653</volume>
          {
          <fpage>674</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Giorgio</given-names>
            <surname>Maria Di Nunzio</surname>
          </string-name>
          .
          <article-title>Shiny on your crazy diagonal</article-title>
          .
          <source>In Proceedings of the SIGIR</source>
          <year>2015</year>
          , pages in press, http://dx.doi.org/10.1145/2766462.2767867,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Stephen</surname>
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Robertson</surname>
          </string-name>
          and Karen Sparck Jones.
          <article-title>Relevance weighting of search terms</article-title>
          . In Peter Willett, editor,
          <source>Document retrieval systems, chapter Relevance weighting of search terms</source>
          , pages
          <volume>143</volume>
          {
          <fpage>160</fpage>
          . Taylor Graham Publishing, London, UK, UK,
          <year>1988</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Stephen</surname>
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Robertson</surname>
            and
            <given-names>Hugo</given-names>
          </string-name>
          <string-name>
            <surname>Zaragoza</surname>
          </string-name>
          .
          <article-title>The probabilistic relevance framework: BM25 and beyond</article-title>
          .
          <source>Foundations and Trends in Information Retrieval</source>
          ,
          <volume>3</volume>
          (
          <issue>4</issue>
          ):
          <volume>333</volume>
          {
          <fpage>389</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Rita</given-names>
            <surname>Singh</surname>
          </string-name>
          and
          <string-name>
            <given-names>Bhiksha</given-names>
            <surname>Raj</surname>
          </string-name>
          .
          <article-title>Classi cation in likelihood spaces</article-title>
          .
          <source>Technometrics</source>
          ,
          <volume>46</volume>
          (
          <issue>3</issue>
          ):
          <volume>318</volume>
          {
          <fpage>329</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>