Introduction

BM25-FIC: Information Content-based Field Weighting for BM25F

Tuomas Ketola

Thomas Roelleke

t.roellekeg@qmul.ac.uk 0 0 Queen Mary, University of London , UK

79 85

BM25F has been shown to perform well on many multi- eld and multi-modal retrieval tasks. However, one of its key challenges is nding appropriate eld weights. This paper tackles the challenge by introducing a new analytical method for the automatic estimation of these weights. The method | denoted BM25-FIC | is based on eld information content (FIC), calculated from term, collection and eld statistics. The eld weights are applied to each document separately rather than to the entire eld, as normally done by BM25F where the eld weights are constant across documents. The BM25-FIC outperforms the BM25F in terms of P@10, MAP and NDCG on a small test collection. Then the paper introduces an interactive information discovery model based on the eld weights. The weights are used to compute a similarity score between a seed document and the retrieved documents. Overall, the BM25-FIC approach is an enhanced BM25F method that combines information-oriented search and parameter estimation.

Introduction

Formal retrieval models for multi-modal and heterogeneous data are becoming more necessary, as the complexity of data-collections and information needs grow. Formality is required to keep the models interpretable, a quality often expected in elds such as law and medicine. Most of the data-collections searched these days | whether it is websites, product catalogues or multi-media data | consist of objects with more than one feature and more than one feature type.

BM25F has been shown to be e ective for multi-modal and multi- eld retrieval [ 5 ]. One of its main challenges is the choice of eld weights. The main contribution of this paper is to introduce a new method for automatically determining these weights; the BM25-FIC (BM25 Field Information Content).

The proposed method calculates the eld weights based on eld information content, estimated from term, collection and eld statistics. As the weights are calculated directly, no learning or heuristics are needed to determine appropriate eld weights, as is the case with BM25F. This makes BM25-FIC much easier to implement. Furthermore, the eld weights are determined for each document eld separately, rather than for the entire eld of the collection, as is done by the normal BM25F. This means that the BM25-FIC is able to capture more complicated relationships between the query and the di erent elds.

Our experiments con rm that BM25-FIC outperforms BM25F. However, it needs to be noted that this result is obtained on a small test-collection and without training for the BM25F weights. Training was not performed on the benchmarks as BM25-FIC itself requires no training.

The second contribution of the paper is to introduce an interactive information discovery model that bene ts from the obtained eld weights. It uses a seed document as a reference point to help the user better de ne their query intent. 2

Related Work

Multi-modal retrieval has received much attention in the IR community. Multimodal approaches closely relate to multi- eld and multi-model approaches. Here the terms multi-modal and multi- eld are used interchangeably as the elds are assumed to represent di erent data types. However, our approach is not a multimodel one, as the BM25 is used for all elds. Multi-modal data can be fully text based, rather than audio and text for example, as di erent feature types can be represented in text, e.g. author lists, abstracts or geographic information [ 5 ].

It has been shown that the BM25F generalizes well to multi-modal data [ 5 ]. As with normal textual data, this involves setting the eld weights. He and Ounis have examined the setting of the eld weights and other eld level hyperparameters extensively [ 4 ]. Outside the BM25F, various probabilistic and learning based models have been considered for multi- eld ad-hoc-retrieval. These models will not be explained in detail here as the focus is on the BM25F. Instead, the reader is advised to see [ 1 ] for a summary of the di erent approaches.

The principle of polyrepresentation relates to the concept of relevance dimensionality discussed in this paper. According to the principle, relevance consists of multiple overlapping cognitive representations of documents. The most relevant documents are most likely found where these representations overlap [ 3,10 ]. The di erent dimensions of relevance, represented by the di erent documents elds, can be seen as forms of cognitive representations when they communicate di erent types of information. 3

BM25-FIC - Information Content based BM25F There are two common ways in which multiple elds are considered in the BM25 context. The rst option is to get the BM25 scores for each eld and calculate a weighted sum over them. In this paper this approach is denoted BM25F-macro: RSVBM25F-macro;b;k1 (q; d; c) := X wf RSVBM25;b;k1 (q; fd; c) (1) where q is a query, d a document, c a collection, f a document eld, and wf is the eld weight. Fc is the set of elds. Note that f is the type (e.g. title, abstract, f2Fc body) whereas fd denotes an instance (e.g. the title of document d). b and k1 are BM25 hyper-parameters.

RSVBM25;b;k1 (q; fd; c) := X TFBM25;b;k1 (t; fd; c) wRSJ(t; c)

t2fd where the TF component is the BM25 term frequency quanti cation: (2) (3) TFBM25;b;k1 (t; fd; c) :=

(k1 + 1) n(t; fd) n(t; fd) + k1 b alvegn(f(c)) + (1 b) where n(t; f ) is the raw term frequency, and avg is determined globally for the collection. wRSJ(t; c) can be de ned based on documents, or can consider eldbased frequencies. For example, wRSJ(t; c) := dNf(Ft;(Fcc))++00:5:5 is a eld-based rather than document-based weight, as described by Robertson et. al [ 6 ].

The second option for multi- eld BM25 retrieval was introduced by Robertson et. al as they noted that approaches which use weighted sums of eld based retrieval scores give too much weight to some query terms [ 7 ]. Their approach is di erent from BM25F-macro in that the constant eld weights are applied to the raw term frequencies n(t; f ) and the BM25 score is calculated over the summed terms frequencies from all the elds. This model is commonly known as BM25F, here it is denoted BM25F-micro for clarity.

In both BM25F-macro and BM25F-micro the eld weights are set as constant and are applied in the same manner to each document in the corpus. Therefore, the BM25F-macro and BM25F-micro models assume that a given eld always a ects relevance in the same manner. Furthermore, eld weights are de ned through learning, or heuristics | both costly tasks.

To counteract these two issues, we propose the BM25-FIC which does not assume the eld weights to be constant. The ranking score between a query and a document is de ned as the weighted sum of the document eld BM25 scores, where the weight is calculated from the information content of a document eld.

De nition 1 (BM25-FIC Ranking Score).

RSVBM25-FIC;Inf;b;k1 (q; d; c) := X wf (q; c; Inf) RSVBM25;b;k1 (q; fd; c) where Inf is the chosen information content model.

Comparing De nition 1 to BM25F-macro (or micro), it is clear that they are closely related. The di erence is that instead of having constant eld weights, in BM25-FIC the weight is dependent on the query q, the document eld f , the collection eld F , the collection c and the information content model Inf. 3.1

Rationale for the BM25-FIC Score and the Field Weights

The main research question in this paper is the rationale and estimation of the eld weights wf (q; c; Inf). Before we propose the estimates, we consider the wider picture of probability and information theory for creating the rationale for the estimation of wf .

An aggregation of values (scores) as for BM25F, is inherently related to the 1st moment (expected value):

Mean:

E[X] = X x P (x) x2X Regarding BM25F, x is a score for a eld, and P (x) is a probability associated with the eld.

Regarding an information-theoretic approach, the entropy is the expected value (EV) of the negated logarithm of the probability:

Entropy: E[ log(P (X))] =

X P (x) log(P (x)) x2X Entropy or related concepts such as log-likelihood are commonly used for justifying estimates. These probabilistic and information-theoretic rationales justify the eld weights. 3.2

Estimates for Field Weights wf (Inf )

Following the framework of probabilistic and information-theoretic expectation values, the candidates for wf are derived from the concept of information content: wf (q; c; Inf) = Inf(q; f; c) := (4) X log(P (tjf; c)) t2q\f

P (tjf; c) is de ned via the max-likelihood method as the number of document elds where term t occurs (df(t; f; c)), divided by the number of potential document elds where t could appear: P (tjf; c) := df(t;f;c) .

Three di erent de nitions of the number of potential documents (NP (c)) are used to create the three candidate models for information content Inf (Inf 1, Inf2 and Inf3).

Estimate P1 The rst model de nes NP as the total number of documents in the collection: NP 1(c) := ND(c).

Estimate P2 The second model de nes NP as the number of documents in the collection for which the eld in question is not empty, that is contains at least one term. NP 2(c) := jfdjf 2 Fc ^ 9t; fd : n(t; fd) > 0gj, where fd is the instance of a eld in document d.

P2 ensures that elds which are empty for many documents are given less weight than they would otherwise. This makes sense as often elds are empty for reasons, such as data redundancies. Estimate P3 The third model normalizes N for each eld according to their average eld lengths (avg ): NP 3(c) := NP 2(c) aavvgg ((fc)) . where avg (c) is the avg eld length over all elds, and avg (f ) is the average for a speci c eld (e.g. title).

P3 ensures that short elds get more weight. Adding weight to shorter elds has been shown to be bene cial in previous research [ 7 ]. 3.3

Evaluation Data and Results

For evaluation we use the Kaggle Home Depot product catalogue data set1, also used by [ 1 ]. Their results on the data were similar to ones obtained on more established test-collections such as TREC-GOV2. The data set contains 55k products with name, description and attribute elds. The attribute eld contains additional information, such as notes and can also be empty.

We considered 1000 queries with the most relevance judgements available. The documents were judged by humans on a scale between 1 and 3. We de ned 3 as relevant and anything under as non-relevant. All together there are 12093 judgements, 10260 relevant and 1833 non-relevant.

The table below shows the results of the experimentation, with a clear indication that the BM25-FIC is outperforming BM25F. As mentioned earlier, the eld weights for the benchmarks are uniform, as no learning or heuristics are used on the BM25-FIC either. The relative improvements of in the table are calculated based on the better performing baseline; BM25F-micro.

Baseline BM25Fmacro 0.218 Baseline BM25Fmicro 0.232 BM25-FICP 1 BM25-FICP 2 BM25-FICP 3

MAP 0.288 0.290 0.300

NDCG

Out of the three candidate models BM25-FICP 3 is most accurate, consistently outperforming BM25F-micro and BM25F-macro for all metrics. 4

Field Weights for De ning User Query Intent To better re ect the query intent we are enhancing the BM25-FIC model by using a seed document as reference point. The user picks a document from the initial results that corresponds to a type of query intent. This is a simpli cation of usual approaches to relevance feedback [ 6,9,8,2 ]. Speci c to our model is the usage of the eld weights.

The eld weights re ect query intent, as each eld contributes to relevance in a di erent way. By analysing the similarity between the seed document eld weights and those of other documents, the model allows the user to prioritize or de-prioritize documents with similar query intent to that of the seed document. 1 https://www.kaggle.com/c/home-depot-product-search-relevance/data De nition 2 (Interactive Model).

RSVBM25-FIC;Inf;inter;a(q; d; c; dsd) := RSVBM25-FIC(q; d; c)+a S(q; d; c; dsd; Inf) where S is a similarity measure. This could be a retrieval model or the Euclidean distance: S := 1 jjwd(q; c; Inf) wdsd (q; c; Inf)jj2 and the normalized eld weight vector for document d is de ned as wd(q; c; Inf) := ( Pwffw1(fq(;qc;;cI;nIfn)f) ::: Pwffwnf(q(;qc;;cI;nIfn)f) ), where fi 2 Fc

The parameter a can be adjusted by the user. Positive values result in higher scores for documents that are relevant in a similar way to dsd. Negative values do the opposite.

Using the seed document as a reference point and by adjusting a, the user can de ne their query intent in an intuitive manner. The idea is to make use of the fact that when presented with the initial search results, a user can often understand why some documents are high in the ranking, even though they might not correspond to their query intent. By using these documents as reference points, the user can easily re ne and ne-tune their query intent.

As a more concrete example, consider the query \Wizards and magic J.K. Rowling 1995" used to search a book catalogue with the document elds plot, author and publication year. The fact that the rst Harry Potter book was only published in 1997 poses questions about the users query intent. Are they looking for Harry Potter books, but got the year wrong? Or books about wizards from 1995 similar to J.K. Rowlings work? There are many possible ways | such as the two above | in which documents can be relevant. What the model from De nition 2 does is to help navigate these di erent dimensions of relevance: Say the user is not looking for Harry Potter books, but some of them come up in the search results. They can choose one of them as the seed document and by making a negative, other Harry Potter books would disappear from the top results. This is because the remaining results would be those with higher weights for publication year and lower ones for author. With a high positive a, they would see all Harry Potter books in the top results. 5

Conclusion

The main contribution of this paper is to introduce a new method for automatic eld weighting in the BM25F retrieval model. We denote the model BM25-FIC, as it uses eld information content (FIC) to calculate the document level eld weight. Compared to basic BM25F (macro or micro), there is a relative improvement between 10% (NDCG) and 30% (MAP and P@10) for the Kaggle Home Depot data set.

Moreover, the paper introduces a new re-ranking retrieval model based on the BM25-FIC weights, which is a good candidate for interactive retrieval. The aim of the model is to give a user better tools for de ning their query intent. This is done using a seed document as a positive or negative reference point for nding the desired dimensionality of relevance.

Overall, the research con rms that there is unexpected potential in the re nement of BM25F eld weights, and that the \weighted sum" is also useful for other multidimensional measures than just document elds.

1. Balaneshinkordan , S. , Kotov , A. , Nikolaev , F. : Attentive Neural Architecture for Ad-hoc Structured Document Retrieval . In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management . pp. 1173 { 1182 . CIKM ' 18 , Association for Computing Machinery, Torino, Italy (Oct 2018 ). https://doi.org/10.1145/3269206.3271801, https://doi.org/10. 1145/3269206.3271801

2. Buckley , C. : Automatic Query Expansion Using SMART : TREC 3 . In: In Proceedings of The third Text REtrieval Conference (TREC-3 . pp. 69 { 80 ( 1993 )

3. Frommholz , I. , Larsen , B. , Piwowarski , B. , Lalmas , M. , Ingwersen , P., van Rijsbergen, K. : Supporting polyrepresentation in a quantum-inspired geometrical retrieval framework . In: Proceedings of the third symposium on Information interaction in context . pp. 115 { 124 . IIiX '10, Association for Computing Machinery, New Brunswick, New Jersey, USA (Aug 2010 ). https://doi.org/10.1145/1840784.1840802, https://doi.org/10.1145/1840784. 1840802

4. He , B. , Ounis , I. : Setting Per- eld Normalisation Hyper-parameters for the NamedPage Finding Search Task . In: Amati, G. , Carpineto , C. , Romano , G . (eds.) Advances in Information Retrieval. pp. 468 { 480 . Lecture Notes in Computer Science, Springer, Berlin, Heidelberg ( 2007 )

5. Imhof , M. , Braschler , M.: A study of untrained models for multimodal information retrieval . Information Retrieval Journal 21 ( 1 ), 81 {106 (Feb 2018 ). https://doi.org/10.1007/s10791-017-9322-x, https://doi.org/10.1007/ s10791-017-9322-x

6. Robertson , S. , Zaragoza , H.: The Probabilistic Relevance Framework: BM25 and Beyond . Foundations and Trends in Information Retrieval 3 , 333 {389 (Jan 2009 ). https://doi.org/10.1561/1500000019

7. Robertson , S. , Zaragoza , H. , Taylor , M.: Simple BM25 extension to multiple weighted elds . In: Proceedings of the thirteenth ACM international conference on Information and knowledge management . pp. 42 { 49 . CIKM ' 04 , Association for Computing Machinery, Washington, D.C. , USA (Nov 2004 )

8. Rocchio: Relevance feedback in information retrieval . The SMART retrieval System Expedriments in Automatic Document Processing ( 1971 )

9. Roelleke , T. : Information Retrieval Models: Foundations and Relationships . Morgan & Claypool Publishers ( 2013 ), google-Books-ID: SFavNAEACAAJ

10. Zellhofer, D. , Schmitt , I. : A user interaction model based on the principle of polyrepresentation . In: Proceedings of the 4th workshop on Workshop for Ph.D. students in information & knowledge management . pp. 3 { 10 . PIKM ' 11 , Association for Computing Machinery, Glasgow, Scotland, UK (Oct 2011 ). https://doi.org/10.1145/2065003.2065007, https://doi.org/10. 1145/2065003.2065007