<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>BM25-FIC: Information Content-based Field Weighting for BM25F</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tuomas Ketola</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thomas Roelleke</string-name>
          <email>t.roellekeg@qmul.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Queen Mary, University of London</institution>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <fpage>79</fpage>
      <lpage>85</lpage>
      <abstract>
        <p>BM25F has been shown to perform well on many multi- eld and multi-modal retrieval tasks. However, one of its key challenges is nding appropriate eld weights. This paper tackles the challenge by introducing a new analytical method for the automatic estimation of these weights. The method | denoted BM25-FIC | is based on eld information content (FIC), calculated from term, collection and eld statistics. The eld weights are applied to each document separately rather than to the entire eld, as normally done by BM25F where the eld weights are constant across documents. The BM25-FIC outperforms the BM25F in terms of P@10, MAP and NDCG on a small test collection. Then the paper introduces an interactive information discovery model based on the eld weights. The weights are used to compute a similarity score between a seed document and the retrieved documents. Overall, the BM25-FIC approach is an enhanced BM25F method that combines information-oriented search and parameter estimation.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Formal retrieval models for multi-modal and heterogeneous data are
becoming more necessary, as the complexity of data-collections and information needs
grow. Formality is required to keep the models interpretable, a quality often
expected in elds such as law and medicine. Most of the data-collections searched
these days | whether it is websites, product catalogues or multi-media data |
consist of objects with more than one feature and more than one feature type.</p>
      <p>
        BM25F has been shown to be e ective for multi-modal and multi- eld
retrieval [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. One of its main challenges is the choice of eld weights. The main
contribution of this paper is to introduce a new method for automatically
determining these weights; the BM25-FIC (BM25 Field Information Content).
      </p>
      <p>The proposed method calculates the eld weights based on eld information
content, estimated from term, collection and eld statistics. As the weights are
calculated directly, no learning or heuristics are needed to determine appropriate
eld weights, as is the case with BM25F. This makes BM25-FIC much easier to
implement. Furthermore, the eld weights are determined for each document
eld separately, rather than for the entire eld of the collection, as is done by
the normal BM25F. This means that the BM25-FIC is able to capture more
complicated relationships between the query and the di erent elds.</p>
      <p>Our experiments con rm that BM25-FIC outperforms BM25F. However, it
needs to be noted that this result is obtained on a small test-collection and
without training for the BM25F weights. Training was not performed on the
benchmarks as BM25-FIC itself requires no training.</p>
      <p>The second contribution of the paper is to introduce an interactive
information discovery model that bene ts from the obtained eld weights. It uses a seed
document as a reference point to help the user better de ne their query intent.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Multi-modal retrieval has received much attention in the IR community.
Multimodal approaches closely relate to multi- eld and multi-model approaches. Here
the terms multi-modal and multi- eld are used interchangeably as the elds are
assumed to represent di erent data types. However, our approach is not a
multimodel one, as the BM25 is used for all elds. Multi-modal data can be fully text
based, rather than audio and text for example, as di erent feature types can be
represented in text, e.g. author lists, abstracts or geographic information [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        It has been shown that the BM25F generalizes well to multi-modal data [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
As with normal textual data, this involves setting the eld weights. He and Ounis
have examined the setting of the eld weights and other eld level
hyperparameters extensively [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Outside the BM25F, various probabilistic and learning based
models have been considered for multi- eld ad-hoc-retrieval. These models will
not be explained in detail here as the focus is on the BM25F. Instead, the reader
is advised to see [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] for a summary of the di erent approaches.
      </p>
      <p>
        The principle of polyrepresentation relates to the concept of relevance
dimensionality discussed in this paper. According to the principle, relevance consists
of multiple overlapping cognitive representations of documents. The most
relevant documents are most likely found where these representations overlap [
        <xref ref-type="bibr" rid="ref10 ref3">3,10</xref>
        ].
The di erent dimensions of relevance, represented by the di erent documents
elds, can be seen as forms of cognitive representations when they communicate
di erent types of information.
3
      </p>
      <p>BM25-FIC - Information Content based BM25F
There are two common ways in which multiple elds are considered in the BM25
context. The rst option is to get the BM25 scores for each eld and calculate a
weighted sum over them. In this paper this approach is denoted BM25F-macro:
RSVBM25F-macro;b;k1 (q; d; c) := X wf RSVBM25;b;k1 (q; fd; c)
(1)
where q is a query, d a document, c a collection, f a document eld, and wf is
the eld weight. Fc is the set of elds. Note that f is the type (e.g. title, abstract,
f2Fc
body) whereas fd denotes an instance (e.g. the title of document d). b and k1
are BM25 hyper-parameters.</p>
      <p>RSVBM25;b;k1 (q; fd; c) := X TFBM25;b;k1 (t; fd; c) wRSJ(t; c)</p>
      <p>t2fd
where the TF component is the BM25 term frequency quanti cation:
(2)
(3)
TFBM25;b;k1 (t; fd; c) :=</p>
      <p>
        (k1 + 1) n(t; fd)
n(t; fd) + k1 b alvegn(f(c)) + (1
b)
where n(t; f ) is the raw term frequency, and avg is determined globally for the
collection. wRSJ(t; c) can be de ned based on documents, or can consider
eldbased frequencies. For example, wRSJ(t; c) := dNf(Ft;(Fcc))++00:5:5 is a eld-based rather
than document-based weight, as described by Robertson et. al [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        The second option for multi- eld BM25 retrieval was introduced by
Robertson et. al as they noted that approaches which use weighted sums of eld based
retrieval scores give too much weight to some query terms [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Their approach is
di erent from BM25F-macro in that the constant eld weights are applied to the
raw term frequencies n(t; f ) and the BM25 score is calculated over the summed
terms frequencies from all the elds. This model is commonly known as BM25F,
here it is denoted BM25F-micro for clarity.
      </p>
      <p>In both BM25F-macro and BM25F-micro the eld weights are set as constant
and are applied in the same manner to each document in the corpus. Therefore,
the BM25F-macro and BM25F-micro models assume that a given eld always
a ects relevance in the same manner. Furthermore, eld weights are de ned
through learning, or heuristics | both costly tasks.</p>
      <p>To counteract these two issues, we propose the BM25-FIC which does not
assume the eld weights to be constant. The ranking score between a query and
a document is de ned as the weighted sum of the document eld BM25 scores,
where the weight is calculated from the information content of a document eld.</p>
      <sec id="sec-2-1">
        <title>De nition 1 (BM25-FIC Ranking Score).</title>
        <p>RSVBM25-FIC;Inf;b;k1 (q; d; c) := X wf (q; c; Inf) RSVBM25;b;k1 (q; fd; c)
where Inf is the chosen information content model.</p>
        <p>Comparing De nition 1 to BM25F-macro (or micro), it is clear that they are
closely related. The di erence is that instead of having constant eld weights,
in BM25-FIC the weight is dependent on the query q, the document eld f , the
collection eld F , the collection c and the information content model Inf.
3.1</p>
      </sec>
      <sec id="sec-2-2">
        <title>Rationale for the BM25-FIC Score and the Field Weights</title>
        <p>The main research question in this paper is the rationale and estimation of the
eld weights wf (q; c; Inf). Before we propose the estimates, we consider the wider
picture of probability and information theory for creating the rationale for the
estimation of wf .</p>
        <p>An aggregation of values (scores) as for BM25F, is inherently related to the
1st moment (expected value):</p>
        <sec id="sec-2-2-1">
          <title>Mean:</title>
          <p>E[X] = X x P (x)
x2X
Regarding BM25F, x is a score for a eld, and P (x) is a probability associated
with the eld.</p>
          <p>Regarding an information-theoretic approach, the entropy is the expected
value (EV) of the negated logarithm of the probability:</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>Entropy: E[ log(P (X))] =</title>
          <p>X P (x) log(P (x))
x2X
Entropy or related concepts such as log-likelihood are commonly used for
justifying estimates. These probabilistic and information-theoretic rationales justify
the eld weights.
3.2</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>Estimates for Field Weights wf (Inf )</title>
        <p>Following the framework of probabilistic and information-theoretic expectation
values, the candidates for wf are derived from the concept of information content:
wf (q; c; Inf) = Inf(q; f; c) :=
(4)
X log(P (tjf; c))
t2q\f</p>
        <p>P (tjf; c) is de ned via the max-likelihood method as the number of
document elds where term t occurs (df(t; f; c)), divided by the number of potential
document elds where t could appear: P (tjf; c) := df(t;f;c) .</p>
        <p>NP</p>
        <p>Three di erent de nitions of the number of potential documents (NP (c)) are
used to create the three candidate models for information content Inf (Inf 1, Inf2
and Inf3).</p>
        <p>Estimate P1 The rst model de nes NP as the total number of documents in
the collection: NP 1(c) := ND(c).</p>
        <p>Estimate P2 The second model de nes NP as the number of documents in the
collection for which the eld in question is not empty, that is contains at least
one term. NP 2(c) := jfdjf 2 Fc ^ 9t; fd : n(t; fd) &gt; 0gj, where fd is the instance
of a eld in document d.</p>
        <p>P2 ensures that elds which are empty for many documents are given less
weight than they would otherwise. This makes sense as often elds are empty
for reasons, such as data redundancies.
Estimate P3 The third model normalizes N for each eld according to their
average eld lengths (avg ): NP 3(c) := NP 2(c) aavvgg ((fc)) . where avg (c) is the
avg eld length over all elds, and avg (f ) is the average for a speci c eld
(e.g. title).</p>
        <p>
          P3 ensures that short elds get more weight. Adding weight to shorter elds
has been shown to be bene cial in previous research [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
3.3
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>Evaluation Data and Results</title>
        <p>
          For evaluation we use the Kaggle Home Depot product catalogue data set1,
also used by [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Their results on the data were similar to ones obtained on
more established test-collections such as TREC-GOV2. The data set contains
55k products with name, description and attribute elds. The attribute eld
contains additional information, such as notes and can also be empty.
        </p>
        <p>We considered 1000 queries with the most relevance judgements available.
The documents were judged by humans on a scale between 1 and 3. We de ned
3 as relevant and anything under as non-relevant. All together there are 12093
judgements, 10260 relevant and 1833 non-relevant.</p>
        <p>The table below shows the results of the experimentation, with a clear
indication that the BM25-FIC is outperforming BM25F. As mentioned earlier, the
eld weights for the benchmarks are uniform, as no learning or heuristics are
used on the BM25-FIC either. The relative improvements of in the table are
calculated based on the better performing baseline; BM25F-micro.</p>
        <p>Baseline BM25Fmacro 0.218
Baseline BM25Fmicro 0.232
BM25-FICP 1
BM25-FICP 2
BM25-FICP 3</p>
        <p>MAP
0.288
0.290
0.300</p>
        <p>NDCG</p>
        <p>Out of the three candidate models BM25-FICP 3 is most accurate, consistently
outperforming BM25F-micro and BM25F-macro for all metrics.
4</p>
        <p>
          Field Weights for De ning User Query Intent
To better re ect the query intent we are enhancing the BM25-FIC model by using a
seed document as reference point. The user picks a document from the initial results
that corresponds to a type of query intent. This is a simpli cation of usual approaches
to relevance feedback [
          <xref ref-type="bibr" rid="ref2 ref6 ref8 ref9">6,9,8,2</xref>
          ]. Speci c to our model is the usage of the eld weights.
        </p>
        <p>The eld weights re ect query intent, as each eld contributes to relevance in a
di erent way. By analysing the similarity between the seed document eld weights
and those of other documents, the model allows the user to prioritize or de-prioritize
documents with similar query intent to that of the seed document.
1 https://www.kaggle.com/c/home-depot-product-search-relevance/data
De nition 2 (Interactive Model).</p>
        <p>RSVBM25-FIC;Inf;inter;a(q; d; c; dsd) := RSVBM25-FIC(q; d; c)+a S(q; d; c; dsd; Inf)
where S is a similarity measure. This could be a retrieval model or the Euclidean
distance: S := 1 jjwd(q; c; Inf) wdsd (q; c; Inf)jj2 and the normalized eld weight
vector for document d is de ned as wd(q; c; Inf) := ( Pwffw1(fq(;qc;;cI;nIfn)f) ::: Pwffwnf(q(;qc;;cI;nIfn)f) ), where
fi 2 Fc</p>
        <p>The parameter a can be adjusted by the user. Positive values result in higher scores
for documents that are relevant in a similar way to dsd. Negative values do the opposite.</p>
        <p>Using the seed document as a reference point and by adjusting a, the user can
de ne their query intent in an intuitive manner. The idea is to make use of the fact
that when presented with the initial search results, a user can often understand why
some documents are high in the ranking, even though they might not correspond to
their query intent. By using these documents as reference points, the user can easily
re ne and ne-tune their query intent.</p>
        <p>As a more concrete example, consider the query \Wizards and magic J.K. Rowling
1995" used to search a book catalogue with the document elds plot, author and
publication year. The fact that the rst Harry Potter book was only published in 1997
poses questions about the users query intent. Are they looking for Harry Potter books,
but got the year wrong? Or books about wizards from 1995 similar to J.K. Rowlings
work? There are many possible ways | such as the two above | in which documents
can be relevant. What the model from De nition 2 does is to help navigate these
di erent dimensions of relevance: Say the user is not looking for Harry Potter books,
but some of them come up in the search results. They can choose one of them as the
seed document and by making a negative, other Harry Potter books would disappear
from the top results. This is because the remaining results would be those with higher
weights for publication year and lower ones for author. With a high positive a, they
would see all Harry Potter books in the top results.
5</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Conclusion</title>
      <p>The main contribution of this paper is to introduce a new method for automatic eld
weighting in the BM25F retrieval model. We denote the model BM25-FIC, as it uses
eld information content (FIC) to calculate the document level eld weight.
Compared to basic BM25F (macro or micro), there is a relative improvement between 10%
(NDCG) and 30% (MAP and P@10) for the Kaggle Home Depot data set.</p>
      <p>Moreover, the paper introduces a new re-ranking retrieval model based on the
BM25-FIC weights, which is a good candidate for interactive retrieval. The aim of
the model is to give a user better tools for de ning their query intent. This is done
using a seed document as a positive or negative reference point for nding the desired
dimensionality of relevance.</p>
      <p>Overall, the research con rms that there is unexpected potential in the re nement
of BM25F eld weights, and that the \weighted sum" is also useful for other
multidimensional measures than just document elds.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Balaneshinkordan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kotov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nikolaev</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Attentive Neural Architecture for Ad-hoc Structured Document Retrieval</article-title>
          .
          <source>In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management</source>
          . pp.
          <volume>1173</volume>
          {
          <fpage>1182</fpage>
          . CIKM '
          <volume>18</volume>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, Torino, Italy (Oct
          <year>2018</year>
          ). https://doi.org/10.1145/3269206.3271801, https://doi.org/10. 1145/3269206.3271801
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Buckley</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Automatic Query Expansion Using SMART : TREC 3</article-title>
          . In: In
          <source>Proceedings of The third Text REtrieval Conference (TREC-3</source>
          . pp.
          <volume>69</volume>
          {
          <issue>80</issue>
          (
          <year>1993</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Frommholz</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Larsen</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piwowarski</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lalmas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ingwersen</surname>
            , P., van Rijsbergen,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Supporting polyrepresentation in a quantum-inspired geometrical retrieval framework</article-title>
          .
          <source>In: Proceedings of the third symposium on Information interaction in context</source>
          . pp.
          <volume>115</volume>
          {
          <fpage>124</fpage>
          . IIiX '10,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New Brunswick, New Jersey, USA (Aug
          <year>2010</year>
          ). https://doi.org/10.1145/1840784.1840802, https://doi.org/10.1145/1840784. 1840802
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>He</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ounis</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Setting Per- eld Normalisation Hyper-parameters for the NamedPage Finding Search Task</article-title>
          . In: Amati,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Carpineto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Romano</surname>
          </string-name>
          ,
          <string-name>
            <surname>G</surname>
          </string-name>
          . (eds.) Advances in Information Retrieval. pp.
          <volume>468</volume>
          {
          <fpage>480</fpage>
          . Lecture Notes in Computer Science, Springer, Berlin, Heidelberg (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Imhof</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Braschler</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>A study of untrained models for multimodal information retrieval</article-title>
          .
          <source>Information Retrieval Journal</source>
          <volume>21</volume>
          (
          <issue>1</issue>
          ),
          <volume>81</volume>
          {106 (Feb
          <year>2018</year>
          ). https://doi.org/10.1007/s10791-017-9322-x, https://doi.org/10.1007/ s10791-017-9322-x
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Robertson</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaragoza</surname>
          </string-name>
          , H.:
          <article-title>The Probabilistic Relevance Framework: BM25 and Beyond</article-title>
          .
          <source>Foundations and Trends in Information Retrieval</source>
          <volume>3</volume>
          ,
          <issue>333</issue>
          {389 (Jan
          <year>2009</year>
          ). https://doi.org/10.1561/1500000019
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Robertson</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaragoza</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taylor</surname>
          </string-name>
          , M.:
          <article-title>Simple BM25 extension to multiple weighted elds</article-title>
          .
          <source>In: Proceedings of the thirteenth ACM international conference on Information and knowledge management</source>
          . pp.
          <volume>42</volume>
          {
          <fpage>49</fpage>
          . CIKM '
          <volume>04</volume>
          ,
          <string-name>
            <surname>Association</surname>
            for Computing Machinery, Washington,
            <given-names>D.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>USA</surname>
          </string-name>
          (Nov
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. Rocchio:
          <article-title>Relevance feedback in information retrieval</article-title>
          .
          <source>The SMART retrieval System Expedriments in Automatic Document Processing</source>
          (
          <year>1971</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Roelleke</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Information Retrieval Models: Foundations and Relationships</article-title>
          . Morgan &amp; Claypool Publishers (
          <year>2013</year>
          ),
          <article-title>google-Books-ID: SFavNAEACAAJ</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. Zellhofer,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Schmitt</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          :
          <article-title>A user interaction model based on the principle of polyrepresentation</article-title>
          .
          <source>In: Proceedings of the 4th workshop on Workshop for Ph.D. students in information &amp; knowledge management</source>
          . pp.
          <volume>3</volume>
          {
          <fpage>10</fpage>
          . PIKM '
          <volume>11</volume>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, Glasgow, Scotland, UK (Oct
          <year>2011</year>
          ). https://doi.org/10.1145/2065003.2065007, https://doi.org/10. 1145/2065003.2065007
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>