1. Introduction

A Bayesian Neural Model for Documents' Relevance Estimation

Alberto Purpura

Gian Antonio Susto

0 0 Department of Information Engineering, University of Padova , Padova , Italy

We propose QLFusion, an approach based on Quantification Learning (QL) to improve rank fusion performance in Information Retrieval. We first introduce a QL model based on a Bayesian Neural Network to estimate the proportion of relevant documents in a ranked list. The proposed model is trained using a probabilistic loss function formulated specifically for this QL task. Next, we describe a rank fusion algorithm which leverages on this information to merge multiple ranked lists. We compare our approach to various popular rank fusion baselines on multiple collections, showing how the proposed approach outperforms the baselines in several evaluation measures.

eol>bayesian neural models quantification learning information retrieval

1. Introduction

to be learnt and the need for large training datasets compared to an equivalent traditional neural model. BNMs Quantification Learning (QL) is a Machine Learning task are similar to traditional neural networks where the model that can be defined as follows. Given a labeled training weights are probability distributions instead of scalar valset, induce a quantifier that takes an unlabelled test set ues [7]. For this reason, a BNM can be compared to as input and returns its best estimate of the class distribu- an ensemble of conventional neural networks in which tion [1]. QL has been studied for prevalence estimation weights are sampled from the weights distributions that in the medical domain (i.e., to estimate the number of our model learns [8]. We interpret the output of QLFucases of a condition at a particular point in time) in [2], sion as a probability value indicating the likelihood of in-text classification in [ 1], and for sentiment analysis sampling a relevant document from a ranking list after in [3] (more in Section 2). attempts. We then identify a Binomial distribution with

However, the application of QL to traditional Informa- parameters and , and train a QL model to minimize tion Retrieval (IR) tasks is still very limited [4]. This is the Kullback–Leibler (KL)-divergence between the distrilikely due to the formulation of the ranking problem in bution identified by the QLFusion model output and the IR, and to the lack of large training annotated datasets – correct probability value computed using the relevance which are starting to become available only recently with judgments – i.e., the ground truth. To minimize the need the increasing interest towards Neural IR. In this paper, for training data, our model employs a Bayesian Neural we propose a QL Bayesian Neural Model (BNM) called Network (BNN) only as its output layer, i.e., a neural netQLFusion for the rank fusion task in IR. This task con- work with a prior distribution on its weights [7]. The use sists in combining items from multiple ranked lists into of these models in real-world scenarios has been recently one, maximizing the relevance of the documents at the made possible thanks to the theoretical advancements top. Over the years, other rank fusion strategies explored in the approximation of the sampling process from the different probabilistic solutions to the problem; however, learned weights distributions, which is necessary to train these solutions generally employ limited modeling strate- them [9, 10]. We then describe (Section 3) a rank fusion gies [5] or methods not flexible enough to adapt to the algorithm that leverages the information provided by the large variance introduced by the different query topics [6]. proposed model. In our evaluation (Section 4), we show To overcome these limitations we propose to employ a how a BNM is superior to a deterministic one and to other shallow BNM trained with a probabilistic loss function more straightforward linear and non-linear machine learnwhich maximize the representation and generalization ing models for the QL task at hand. Finally, we evaluate power of the model, limiting the number of parameters the quality of the proposed QL approach, training objective, and rank fusion algorithm in the rank fusion task, comparing it to other popular rank fusion algorithms such as CombSum, CombMNZ [11], META [5] and variants of our model on three different datasets – TREC-3 [12] and TREC-5 [13] – widely-used for testing rank-fusion methods, and CLEF-2018 [14], a recent dataset whose goal is DESIRES 2021 – 2nd International Conference on Design of Experimental Search Information REtrieval Systems, September 15–18, 2021, Padua, Italy " purpuraa@dei.unipd.it (A. Purpura); sustogia@dei.unipd.it (G. A. Susto)

© 2021 Copyright for this paper by its authors. Use permitted under Creative CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmmUonRs LWicenoserkAtstrhiboutpionP4.r0oIncteerneadtioinnalg(sCC(BCYE4.U0).R-WS.org) to maximize recall in the context of medical reviews.

The paper is organized as follows: Section 2 presents related works, Section 3 describes the proposed QL model In this section, we describe the BNM, the training objecand our rank fusion algorithm, Section 4 reports the ex- tive and rank fusion algorithm that we propose. Afterperimental results and Section 5 concludes the paper. wards, we will show how the proposed QL model can be employed to perform rank fusion and to this end we will present our QLFusion algorithm.

3. Proposed Approach 2. Related Work

Quantification Learning Model. Given a user QL addresses the problem of estimating the number of pos- query and a document ranking returned by a retrieval itives in a set of data points, given a training set to learn to model, the proposed QLFusion algorithm predicts the distinguish positives from negatives [4, 1]. However, this proportion of relevant documents in it considering their estimation is not computed by a classic classification-and- relevance scores. Indeed, the sequence of scores assigned count approach, but relies on a model trained to predict to different documents by a retrieval model can be a proxy the count skipping the classification step. This solution is for the number/proportion of relevant documents retrieved. particularly useful when: (i) the distribution of positives in In Figure 1, we report the sequence of normalized relethe training set is different from the one in the test set, (ii) vance scores associated to the top 25 documents in three when acquiring training data is expensive, or (iii) when random topics by one of the runs submitted to TREC-3 there are too few positive examples to learn an accurate achieving the best MAP. As observed by [6] and [5], the classification model [ 1]. This is often the case in ad-hoc distribution of relevant documents in a ranked list can be retrieval, where the problem of sparsity in the training estimated from the entire set of relevance scores. Howdata often limits the performance of machine learning ap- ever, the number of retrieved relevant documents varies proaches [15, 16]. For this reason, we investigate on how widely from topic to topic. For this reason, to estimate we can benefit from learning a model able to only quantify the proportion of relevant documents in different ranked the number of relevant documents to a particular query in lists we need a model able to adapt to each retrieval model a collection without learning an accurate relevance model. that computed them and to the different characteristics of The IR task that we choose is rank fusion, where the goal each topic. Following these observations, we developed is to merge multiple ranked lists into one, maximizing the a QL model to estimate the quality of a ranked list in relevance of the documents at the top. We compare our terms of the proportion of relevant documents contained approach to two classic and still competitive baselines: in its top- positions. Our QL approach consists of a CombSum and CombMNZ [11]. Under CombSum, a shallow neural model with one feed-forward hidden layer document’s score is calculated by adding the normalized and a Bayesian neural output layer. Before feeding the scores returned by the individual input models. Regarding data into the hidden layer of our model, we first rescale CombMNZ, a document’s score is found by multiplying them within each batch to normalize the feature values its CombSum score by the number of non-zero relevance independently [18]. Following [18], the transformation scores that it received in the ranked lists to fuse. Our QLFusion approach is inspired from previous works in wthhaetrwe ea paprelythtoe feeaacthurfeesaotufrtehe iisnpˆuts=toth(eno− rm ali)za+tion, the rank fusion domain such as the linear combination layer for each item in the input batch, , , and model proposed by [17], where document scores are re- are respectively the gain term, standard deviation, mean scaled before being combined using different coefficients and bias of the layer, out of which and are paramefor each query. Our probabilistic approach to this prob- ters learned by the model. After feeding the inputs to the lem is also shared by previous works in the rank fusion hidden layer, we apply a sigmoid activation function to domain. For instance, [5] shows how documents’ rele- the layer outputs. The output layer that we employ is a vance scores associated to a query can be modeled with Bayesian Neural Layer (BNL) [19]. A BNL is similar to known distributions. Their main contribution is to show a traditional neural network, where the weights of each how Gaussian or Exponential distributions can fit the dis- layer are probability distributions instead of scalar values. tribution of relevance scores of relevant and non-relevant We realized this layer employing the flipout technique, documents, respectively. They then use this information an efficient method for decorrelating the gradients within to compute the relevance probability of each item in a a mini-batch by implicitly sampling pseudo-independent ranked list and use it to combine the documents through weight perturbations for each example [20]. The ema rank fusion algorithm (META). With a similar goal in ployed weights priors are Gaussian. During inference mind but employing the quantification learning paradigm, and training, we sample a value for each of the network the proposed QLFusion model directly computes these weights and we use it to compute the final output of the coefficients and then uses them to fuse multiple ranking network. More precisely, if we represent the last layer of lists into one. our model as the function (, ) – which receives an input and is parametrized by the weights in – our model samples the values of from a distribution parametrized by . The network is trained to minimize the expected loss E(,)∼ , ∼ [ℒ( (, ), ], where

indicates the data distribution and ℒ our loss function.

The distribution of can be represented in terms of per

turbations: = + ∆ , where indicates the mean of the layer weights and ∆ is a stochastic perturbation. In our case, the stochastic perturbation follows a Gaussian distribution ∼ 2 (, , , ) [20]. Thanks to these unique characteristics, a BNM can be considered equivalent to an ensemble of neural models where the weight of each model are sampled from our learned distributions.

Hence, we leverage on the generalization power of ensem

ble approaches while at the same time improving training time and efcfiiency. We also apply a sigmoid activation function to the outputs of this layer to keep them in the [0-1] range. We train the proposed model in a supervised fashion by minimizing the KL-divergence between the Binomial distribution with probability equal to the model output ˆ and the “true” proportion of relevant documents – interpreted as a Binomial distribution as well – in the input ranked list. The loss function formulation originates from the idea to interpret the output of a QL model as the proportion of successful attempts at sampling a relevant document from an input ranking list of length . Hence, we train the model to minimize the KL-divergence between the Binomial distributions outputted by the model and the correct probability value. Formally, if we consider the true probability to sample a relevant document from the input ranking list of length and our model prediction ˆ, we can identify two Binomial distributions: ∼

(, ) and ∼ lation of the loss function follows from the definition of KL-divergence between two discrete Binomial probability (ˆ, ), respectively. The formudistributions: ℒ = ( ||) = ∑︁ () log = ∑︁ =0 (︃ )︃

∈ (1 − )− log

4. Experimental Results We conduct our experiments on two widely-used ad-hoc

TREC collections in the rank fusion domain TREC-3 [12], TREC-5 [13] and on the CLEF-2018 Technologically Assisted Reviews in Empirical Medicine (TAR) [14]. In particular, the CLEF-2018 TAR task focused on maximizing recall. We consider the 6 runs with the highest MAP submitted to each track and all of their topics. The evaluation measures that we report in our experiments are: P@{5, 10, 20}, Recall@{5, 10, 20} and nDCG@{5, 10, 20}. Also, since most of the relevant documents in a ranked list will likely be at the top of it – as we can also observe from the charts in Figure 1 – we only consider the top 100 documents in each ranked list. We employ 10-Fold cross validation to train and evaluate our model, reporting the averaged results over the 10 folds , i.e. we divide the set of topics into 10 groups, train our model on 8 of them leaving one set for validation, and evaluate our approach on the remaining one, we repeat this process 10 times. All the measures we report are averaged over the 10 folds after training a different model for each collection and retrieval model. The learning rate and hidden layer size used to train our model were 0.0005 and 32, for all collections. The batch size was optimized automatically according to the average performance of the model on the validation sets of each collection considering values in the range: [1, 2, 4, 8, 16]. We train our model for a maximum of 500 epochs with early stopping with patience 20, and evaluate it after each epoch on the validation set. For each fold, we keep the model with the minimum mean absolute error obtained on the validation set. Our implementation of the model is publicly available online. 1 Rank Fusion. For this task, we merge the documents from different ranked lists employing QLFusion, which rescales the relevance scores of documents according to the proportion of relevant documents predicted for each ranking. The baselines that we consider are CombSum and CombMNZ – two popular and still highly competitive rank fusion approaches – META – a similar probabilistic approach for rank fusion that we reimplemented – and three variants of our QLFusion approach relying on different regression models such as a Linear Regressor (LR), a Random Forest (RF) a standard neural model with the same architecture and feature normalization strategy of the proposed Bayesian neural one but using only deterministic neural layers and the mean squared error as loss function (det.). From the results in Table 1, we observe that the proposed QLFusion algorithm combined with the probabilistic QL model outperforms all the other baselines in the majority of the evaluation measures on the TREC experimental collections, while the deterministic variant of our QLFusion approach is the best one overall on the CLEF-2018 dataset. Interestingly, the performance of all QLFusion variants on the CLEF-2018 collection is much higher than on TREC ones – especially considering the measures focusing on the top 5 and 10 items. We also observe that the META algorithm is more competitive on CLEF-2018 than on the other collections. This is likely due to the different characteristics of the runs to merge, which goal was to maximize recall instead of other ranking metrics and therefore makes them more similar to our

1Link to the online repository: https://github.com/

albpurpura/QLFusion. Model QLFusion prob.

QLFusion det.

QLFusion prob.

QLFusion det.

QLFusion prob.

QLFusion det.

QLFusion prob.

QLFusion det.

QLFusion prob.

QLFusion det.

QLFusion prob.

QLFusion det.

Loss proposed proposed MSE MSE proposed proposed MSE MSE proposed proposed MSE MSE

Acknowledgments Work supported by the ExaMode project, as part of the European Union Horizon 2020 program under Grant Agreement no. 825292. QLFusion model.

CLEF-2018 [1] G. Forman, Counting positives accurately despite inaccurate classification, in: Proc. of ECIR 2005, Springer, 2005, pp. 564–575.

Table 2 [2] A. Moreo, F. Sebastiani, Tutorial: Supervised LearntEicvavlaurai atinotns ooff QthLeFmusoiodnelwainthd plorossbafubnilicsttiiocna. nd determinis- ing for Prevalence Estimation, in: Proc. of the 13th International Conference on Flexible Query AnswerModel Optimization. From the results reported in Ta- ing Systems FQAS 2019, volume 11529 of Lecture ble 2, we observe how the probabilistic neural model Notes in Computer Science, Springer, 2019, pp. 13– (QLFusion prob.) outperforms, in most cases, its deter- 17. ministic variant (QLFusion det.) regardless of the chosen [3] W. Gao, F. Sebastiani, Tweet Sentiment: From Clasloss function. However, the proposed probabilistic loss sification to Quantification, in: Proc. of the 2015 gives the probabilistic model a sizeable advantage over its IEEE/ACM International Conference on Advances variant trained with the MSE loss, especially on the TREC in Social Networks Analysis and Mining, ASONAM collections. Also, the effect of the proposed loss function ’15, ACM, New York, NY, USA, 2015, pp. 97–104. on the deterministic QLFusion model is not as large as [4] P. González, A. Castaño, N. Chawla, J. Coz, A on its probabilistic variant. Differently from the other review on quantification learning, ACM Computing experiments, on the CLEF-2018 dataset we observe that Surveys (CSUR) 50 (2017) 74. the effect of the different loss function used is negligible [5] R. Manmatha, T. Rath, F. Feng, Modeling score across the model variants. If we consider the variance of distributions for combining the outputs of search the models predictions – an indicator of how much our engines, in: In Proc. of the 24th annual international models overfit or underfit the data – of the probabilistic ACM SIGIR conference on Research and developand deterministic QLFusion models with the two previous ment in information retrieval, 2001, pp. 267–275. loss functions variants, we notice that the probabilistic [6] D. Lillis, F. Toolan, R. Collier, J. Dunnion, Probfuse: loss favors a larger variance in the model’s predictions – a probabilistic approach to data fusion, in: Proc. of closer to the variance of the true proportions of relevant SIGIR 2006, 2006, pp. 139–146. documents in different queries – decreasing the tendency [7] R. Neal, Bayesian learning for neural networks, volto underfit the training data that we observed with the ume 118, Springer Science & Business Media, 2012. deterministic model and/or MSE loss functions. [8] C. Blundell, J. Cornebise, K. Kavukcuoglu, D. Wierstra, Weight uncertainty in neural networks, arXiv 5. Conclusions preprint arXiv:1505.05424 (2015). [9] A. Balan, V. Rathod, K. Murphy, M. Welling, In this work, we proposed a Quantification Learning- Bayesian dark knowledge, in: Advances in Neural based approach for rank fusion (QLFusion). The model Information Processing Systems, 2015, pp. 3438– relies on a Bayesian Neural Network (BNN) to predict 3446. the number or relevant documents in a ranking list, an in- [10] A. Graves, Practical variational inference for neuformation that is then used to rescale the relevance scores ral networks, in: Advances in neural information of documents in different ranked lists and fuse them. We processing systems, 2011, pp. 2348–2356. trained the proposed BNN using a custom probabilistic [11] J. Shaw, E. Fox, Combination of multiple searches, loss function which allowed us to improve the general- in: Proc. TREC 1994, 1994, pp. 105–108. ization power of the model. The proposed approach out- [12] D. Harman, Overview of the third text retrieval conperformed the considered baselines on three experimental ference (TREC-3), 500, DIANE Publishing, 1995. collections. In the future, we plan to investigate on new [13] D. Harman, Overview of the fifth text retrieval confeatures to add to our QLFusion approach to further im- ference (TREC-5), 1997. prove its performance. [14] K. Evangelos, A. L. Li, D., R. Spijker, CLEF 2018 technologically assisted reviews in empirical

ings , volume 2125 , 2018 . [15]

Guo ,

Fan ,

Pang ,

Yang ,

Ai , H. Zamani,

tion Processing & Management ( 2019 ) 102067 . [16]

Marchesin ,

Purpura , G. Silvello, Focal el-

Processing & Management ( 2019 ) 102109 . [17]

Si ,

Callan , Using sampled data and regression

2002 , 2002 , pp. 19 - 26 . [18]

J. Lei

Ba ,

Kiros , G. Hinton, Layer normalization,

arXiv preprint arXiv:1607.06450 ( 2016 ). [19]

Gal ,

Ghahramani , Bayesian convolutional neu-

inference , arXiv preprint arXiv:1506.02158 ( 2015 ). [20]

Wen ,

Vicol ,

Ba ,

Tran ,

Grosse , Flipout:

on mini-batches, arXiv preprint arXiv: 1803 .04386