=Paper=
{{Paper
|id=Vol-2741/paper-11
|storemode=property
|title=BM25-FIC: Information Content-based Field Weighting for
BM25F
|pdfUrl=https://ceur-ws.org/Vol-2741/paper-11.pdf
|volume=Vol-2741
|authors=Tuomas Ketola,Thomas Roelleke
|dblpUrl=https://dblp.org/rec/conf/sigir/KetolaR20
}}
==BM25-FIC: Information Content-based Field Weighting for
BM25F==
BM25-FIC: Information Content-based Field Weighting for BM25F Tuomas Ketola and Thomas Roelleke Queen Mary, University of London, UK {t.j.h.ketola,t.roelleke}@qmul.ac.uk Abstract. BM25F has been shown to perform well on many multi-field and multi-modal retrieval tasks. However, one of its key challenges is finding appropriate field weights. This paper tackles the challenge by introducing a new analytical method for the automatic estimation of these weights. The method — denoted BM25-FIC — is based on field information content (FIC), calculated from term, collection and field statistics. The field weights are applied to each document separately rather than to the entire field, as normally done by BM25F where the field weights are constant across documents. The BM25-FIC outperforms the BM25F in terms of P@10, MAP and NDCG on a small test collection. Then the paper introduces an interactive information discovery model based on the field weights. The weights are used to compute a similarity score between a seed document and the retrieved documents. Overall, the BM25-FIC approach is an enhanced BM25F method that combines information-oriented search and parameter estimation. 1 Introduction Formal retrieval models for multi-modal and heterogeneous data are becom- ing more necessary, as the complexity of data-collections and information needs grow. Formality is required to keep the models interpretable, a quality often ex- pected in fields such as law and medicine. Most of the data-collections searched these days — whether it is websites, product catalogues or multi-media data — consist of objects with more than one feature and more than one feature type. BM25F has been shown to be effective for multi-modal and multi-field re- trieval [5]. One of its main challenges is the choice of field weights. The main contribution of this paper is to introduce a new method for automatically deter- mining these weights; the BM25-FIC (BM25 Field Information Content). The proposed method calculates the field weights based on field information content, estimated from term, collection and field statistics. As the weights are calculated directly, no learning or heuristics are needed to determine appropriate field weights, as is the case with BM25F. This makes BM25-FIC much easier to implement. Furthermore, the field weights are determined for each document Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). BIRDS 2020, 30 July 2020, Xi’an, China (online). 79 field separately, rather than for the entire field of the collection, as is done by the normal BM25F. This means that the BM25-FIC is able to capture more complicated relationships between the query and the different fields. Our experiments confirm that BM25-FIC outperforms BM25F. However, it needs to be noted that this result is obtained on a small test-collection and without training for the BM25F weights. Training was not performed on the benchmarks as BM25-FIC itself requires no training. The second contribution of the paper is to introduce an interactive informa- tion discovery model that benefits from the obtained field weights. It uses a seed document as a reference point to help the user better define their query intent. 2 Related Work Multi-modal retrieval has received much attention in the IR community. Multi- modal approaches closely relate to multi-field and multi-model approaches. Here the terms multi-modal and multi-field are used interchangeably as the fields are assumed to represent different data types. However, our approach is not a multi- model one, as the BM25 is used for all fields. Multi-modal data can be fully text based, rather than audio and text for example, as different feature types can be represented in text, e.g. author lists, abstracts or geographic information [5]. It has been shown that the BM25F generalizes well to multi-modal data [5]. As with normal textual data, this involves setting the field weights. He and Ounis have examined the setting of the field weights and other field level hyperparame- ters extensively [4]. Outside the BM25F, various probabilistic and learning based models have been considered for multi-field ad-hoc-retrieval. These models will not be explained in detail here as the focus is on the BM25F. Instead, the reader is advised to see [1] for a summary of the different approaches. The principle of polyrepresentation relates to the concept of relevance dimen- sionality discussed in this paper. According to the principle, relevance consists of multiple overlapping cognitive representations of documents. The most rele- vant documents are most likely found where these representations overlap [3,10]. The different dimensions of relevance, represented by the different documents fields, can be seen as forms of cognitive representations when they communicate different types of information. 3 BM25-FIC - Information Content based BM25F There are two common ways in which multiple fields are considered in the BM25 context. The first option is to get the BM25 scores for each field and calculate a weighted sum over them. In this paper this approach is denoted BM25F-macro: X RSVBM25F-macro,b,k1 (q, d, c) := wf RSVBM25,b,k1 (q, fd , c) (1) f ∈Fc where q is a query, d a document, c a collection, f a document field, and wf is the field weight. Fc is the set of fields. Note that f is the type (e.g. title, abstract, 80 body) whereas fd denotes an instance (e.g. the title of document d). b and k1 are BM25 hyper-parameters. X RSVBM25,b,k1 (q, fd , c) := TFBM25,b,k1 (t, fd , c) · wRSJ (t, c) (2) t∈fd where the TF component is the BM25 term frequency quantification: (k1 + 1) n(t, fd ) TFBM25,b,k1 (t, fd , c) := (3) len(f ) n(t, fd ) + k1 b avgfl(c) + (1 − b) where n(t, f ) is the raw term frequency, and avgfl is determined globally for the collection. wRSJ (t, c) can be defined based on documents, or can consider field- NF (c)+0.5 based frequencies. For example, wRSJ (t, c) := df(t,F c )+0.5 is a field-based rather than document-based weight, as described by Robertson et. al [6]. The second option for multi-field BM25 retrieval was introduced by Robert- son et. al as they noted that approaches which use weighted sums of field based retrieval scores give too much weight to some query terms [7]. Their approach is different from BM25F-macro in that the constant field weights are applied to the raw term frequencies n(t, f ) and the BM25 score is calculated over the summed terms frequencies from all the fields. This model is commonly known as BM25F, here it is denoted BM25F-micro for clarity. In both BM25F-macro and BM25F-micro the field weights are set as constant and are applied in the same manner to each document in the corpus. Therefore, the BM25F-macro and BM25F-micro models assume that a given field always affects relevance in the same manner. Furthermore, field weights are defined through learning, or heuristics — both costly tasks. To counteract these two issues, we propose the BM25-FIC which does not assume the field weights to be constant. The ranking score between a query and a document is defined as the weighted sum of the document field BM25 scores, where the weight is calculated from the information content of a document field. Definition 1 (BM25-FIC Ranking Score). X RSVBM25-FIC,Inf,b,k1 (q, d, c) := wf (q, c, Inf) RSVBM25,b,k1 (q, fd , c) f ∈Fc where Inf is the chosen information content model. Comparing Definition 1 to BM25F-macro (or micro), it is clear that they are closely related. The difference is that instead of having constant field weights, in BM25-FIC the weight is dependent on the query q, the document field f , the collection field F , the collection c and the information content model Inf. 3.1 Rationale for the BM25-FIC Score and the Field Weights The main research question in this paper is the rationale and estimation of the field weights wf (q, c, Inf). Before we propose the estimates, we consider the wider 81 picture of probability and information theory for creating the rationale for the estimation of wf . An aggregation of values (scores) as for BM25F, is inherently related to the 1st moment (expected value): X Mean: E[X] = x P (x) x∈X Regarding BM25F, x is a score for a field, and P (x) is a probability associated with the field. Regarding an information-theoretic approach, the entropy is the expected value (EV) of the negated logarithm of the probability: X Entropy: E[− log(P (X))] = − P (x) log(P (x)) x∈X Entropy or related concepts such as log-likelihood are commonly used for justi- fying estimates. These probabilistic and information-theoretic rationales justify the field weights. 3.2 Estimates for Field Weights wf (Inf ) Following the framework of probabilistic and information-theoretic expectation values, the candidates for wf are derived from the concept of information content: X wf (q, c, Inf) = Inf(q, f, c) := − log(P (t|f, c)) (4) t∈q∩f P (t|f, c) is defined via the max-likelihood method as the number of docu- ment fields where term t occurs (df(t, f, c)), divided by the number of potential document fields where t could appear: P (t|f, c) := df(t,f,c) NP . Three different definitions of the number of potential documents (NP (c)) are used to create the three candidate models for information content Inf (Inf 1 , Inf 2 and Inf 3 ). Estimate P1 The first model defines NP as the total number of documents in the collection: NP 1 (c) := ND (c). Estimate P2 The second model defines NP as the number of documents in the collection for which the field in question is not empty, that is contains at least one term. NP 2 (c) := |{d|f ∈ Fc ∧ ∃t, fd : n(t, fd ) > 0}|, where fd is the instance of a field in document d. P2 ensures that fields which are empty for many documents are given less weight than they would otherwise. This makes sense as often fields are empty for reasons, such as data redundancies. 82 Estimate P3 The third model normalizes N for each field according to their avgfl(c) average field lengths (avgfl): NP 3 (c) := NP 2 (c) avgfl(f ) . where avgfl(c) is the avg field length over all fields, and avgfl(f ) is the average for a specific field (e.g. title). P3 ensures that short fields get more weight. Adding weight to shorter fields has been shown to be beneficial in previous research [7]. 3.3 Evaluation Data and Results For evaluation we use the Kaggle Home Depot product catalogue data set1 , also used by [1]. Their results on the data were similar to ones obtained on more established test-collections such as TREC-GOV2. The data set contains 55k products with name, description and attribute fields. The attribute field contains additional information, such as notes and can also be empty. We considered 1000 queries with the most relevance judgements available. The documents were judged by humans on a scale between 1 and 3. We defined 3 as relevant and anything under as non-relevant. All together there are 12093 judgements, 10260 relevant and 1833 non-relevant. The table below shows the results of the experimentation, with a clear indi- cation that the BM25-FIC is outperforming BM25F. As mentioned earlier, the field weights for the benchmarks are uniform, as no learning or heuristics are used on the BM25-FIC either. The relative improvements of in the table are calculated based on the better performing baseline; BM25F-micro. MAP ∆ MAP P@10 ∆ P@10 NDCG ∆ NDCG Baseline BM25Fmacro 0.218 - 0.219 - 0.496 - Baseline BM25Fmicro 0.232 - 0.220 - 0.499 - BM25-FICP 1 0.288 +24% 0.278 +27% 0.546 +10% BM25-FICP 2 0.290 +25% 0.281 +28% 0.547 +10% BM25-FICP 3 0.300 +29% 0.291 +33% 0.554 +11% Out of the three candidate models BM25-FICP 3 is most accurate, consistently outperforming BM25F-micro and BM25F-macro for all metrics. 4 Field Weights for Defining User Query Intent To better reflect the query intent we are enhancing the BM25-FIC model by using a seed document as reference point. The user picks a document from the initial results that corresponds to a type of query intent. This is a simplification of usual approaches to relevance feedback [6,9,8,2]. Specific to our model is the usage of the field weights. The field weights reflect query intent, as each field contributes to relevance in a different way. By analysing the similarity between the seed document field weights and those of other documents, the model allows the user to prioritize or de-prioritize documents with similar query intent to that of the seed document. 1 https://www.kaggle.com/c/home-depot-product-search-relevance/data 83 Definition 2 (Interactive Model). RSVBM25-FIC,Inf,inter,a (q, d, c, dsd ) := RSVBM25-FIC (q, d, c)+a · S(q, d, c, dsd , Inf) where S is a similarity measure. This could be a retrieval model or the Euclidean distance: S := 1 − ||w̄d (q, c, Inf) − w̄dsd (q, c, Inf)||2 and the normalized field weight w (q,c,Inf) w n (q,c,Inf) vector for document d is defined as w̄d (q, c, Inf) := ( P fw1 f (q,c,Inf) ... P fw f (q,c,Inf) ), where f f fi ∈ Fc The parameter a can be adjusted by the user. Positive values result in higher scores for documents that are relevant in a similar way to dsd . Negative values do the opposite. Using the seed document as a reference point and by adjusting a, the user can define their query intent in an intuitive manner. The idea is to make use of the fact that when presented with the initial search results, a user can often understand why some documents are high in the ranking, even though they might not correspond to their query intent. By using these documents as reference points, the user can easily refine and fine-tune their query intent. As a more concrete example, consider the query “Wizards and magic J.K. Rowling 1995” used to search a book catalogue with the document fields plot, author and publication year. The fact that the first Harry Potter book was only published in 1997 poses questions about the users query intent. Are they looking for Harry Potter books, but got the year wrong? Or books about wizards from 1995 similar to J.K. Rowlings work? There are many possible ways — such as the two above — in which documents can be relevant. What the model from Definition 2 does is to help navigate these different dimensions of relevance: Say the user is not looking for Harry Potter books, but some of them come up in the search results. They can choose one of them as the seed document and by making a negative, other Harry Potter books would disappear from the top results. This is because the remaining results would be those with higher weights for publication year and lower ones for author. With a high positive a, they would see all Harry Potter books in the top results. 5 Conclusion The main contribution of this paper is to introduce a new method for automatic field weighting in the BM25F retrieval model. We denote the model BM25-FIC, as it uses field information content (FIC) to calculate the document level field weight. Com- pared to basic BM25F (macro or micro), there is a relative improvement between 10% (NDCG) and 30% (MAP and P@10) for the Kaggle Home Depot data set. Moreover, the paper introduces a new re-ranking retrieval model based on the BM25-FIC weights, which is a good candidate for interactive retrieval. The aim of the model is to give a user better tools for defining their query intent. This is done using a seed document as a positive or negative reference point for finding the desired dimensionality of relevance. Overall, the research confirms that there is unexpected potential in the refinement of BM25F field weights, and that the “weighted sum” is also useful for other multi- dimensional measures than just document fields. 84 References 1. Balaneshinkordan, S., Kotov, A., Nikolaev, F.: Attentive Neural Architec- ture for Ad-hoc Structured Document Retrieval. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management. pp. 1173–1182. CIKM ’18, Association for Computing Machinery, Torino, Italy (Oct 2018). https://doi.org/10.1145/3269206.3271801, https://doi.org/10. 1145/3269206.3271801 2. Buckley, C.: Automatic Query Expansion Using SMART : TREC 3. In: In Pro- ceedings of The third Text REtrieval Conference (TREC-3. pp. 69–80 (1993) 3. Frommholz, I., Larsen, B., Piwowarski, B., Lalmas, M., Ingwersen, P., van Rijsbergen, K.: Supporting polyrepresentation in a quantum-inspired geometrical retrieval framework. In: Proceedings of the third symposium on Information interaction in context. pp. 115–124. IIiX ’10, Association for Computing Machinery, New Brunswick, New Jersey, USA (Aug 2010). https://doi.org/10.1145/1840784.1840802, https://doi.org/10.1145/1840784. 1840802 4. He, B., Ounis, I.: Setting Per-field Normalisation Hyper-parameters for the Named- Page Finding Search Task. In: Amati, G., Carpineto, C., Romano, G. (eds.) Ad- vances in Information Retrieval. pp. 468–480. Lecture Notes in Computer Science, Springer, Berlin, Heidelberg (2007) 5. Imhof, M., Braschler, M.: A study of untrained models for multimodal information retrieval. Information Retrieval Journal 21(1), 81–106 (Feb 2018). https://doi.org/10.1007/s10791-017-9322-x, https://doi.org/10.1007/ s10791-017-9322-x 6. Robertson, S., Zaragoza, H.: The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval 3, 333–389 (Jan 2009). https://doi.org/10.1561/1500000019 7. Robertson, S., Zaragoza, H., Taylor, M.: Simple BM25 extension to multiple weighted fields. In: Proceedings of the thirteenth ACM international conference on Information and knowledge management. pp. 42–49. CIKM ’04, Association for Computing Machinery, Washington, D.C., USA (Nov 2004) 8. Rocchio: Relevance feedback in information retrieval. The SMART retrieval System Expedriments in Automatic Document Processing (1971) 9. Roelleke, T.: Information Retrieval Models: Foundations and Relationships. Mor- gan & Claypool Publishers (2013), google-Books-ID: SFavNAEACAAJ 10. Zellhöfer, D., Schmitt, I.: A user interaction model based on the princi- ple of polyrepresentation. In: Proceedings of the 4th workshop on Work- shop for Ph.D. students in information & knowledge management. pp. 3– 10. PIKM ’11, Association for Computing Machinery, Glasgow, Scotland, UK (Oct 2011). https://doi.org/10.1145/2065003.2065007, https://doi.org/10. 1145/2065003.2065007 85