<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Bayesian Neural Model for Documents' Relevance Estimation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alberto Purpura</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gian Antonio Susto</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Information Engineering, University of Padova</institution>
          ,
          <addr-line>Padova</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We propose QLFusion, an approach based on Quantification Learning (QL) to improve rank fusion performance in Information Retrieval. We first introduce a QL model based on a Bayesian Neural Network to estimate the proportion of relevant documents in a ranked list. The proposed model is trained using a probabilistic loss function formulated specifically for this QL task. Next, we describe a rank fusion algorithm which leverages on this information to merge multiple ranked lists. We compare our approach to various popular rank fusion baselines on multiple collections, showing how the proposed approach outperforms the baselines in several evaluation measures.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;bayesian neural models</kwd>
        <kwd>quantification learning</kwd>
        <kwd>information retrieval</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>to be learnt and the need for large training datasets
compared to an equivalent traditional neural model. BNMs
Quantification Learning (QL) is a Machine Learning task are similar to traditional neural networks where the model
that can be defined as follows. Given a labeled training weights are probability distributions instead of scalar
valset, induce a quantifier that takes an unlabelled test set ues [7]. For this reason, a BNM can be compared to
as input and returns its best estimate of the class distribu- an ensemble of conventional neural networks in which
tion [1]. QL has been studied for prevalence estimation weights are sampled from the weights distributions that
in the medical domain (i.e., to estimate the number of our model learns [8]. We interpret the output of
QLFucases of a condition at a particular point in time) in [2], sion as a probability value  indicating the likelihood of
in-text classification in [ 1], and for sentiment analysis sampling a relevant document from a ranking list after 
in [3] (more in Section 2). attempts. We then identify a Binomial distribution with</p>
      <p>However, the application of QL to traditional Informa- parameters  and , and train a QL model to minimize
tion Retrieval (IR) tasks is still very limited [4]. This is the Kullback–Leibler (KL)-divergence between the
distrilikely due to the formulation of the ranking problem in bution identified by the QLFusion model output and the
IR, and to the lack of large training annotated datasets – correct probability value computed using the relevance
which are starting to become available only recently with judgments – i.e., the ground truth. To minimize the need
the increasing interest towards Neural IR. In this paper, for training data, our model employs a Bayesian Neural
we propose a QL Bayesian Neural Model (BNM) called Network (BNN) only as its output layer, i.e., a neural
netQLFusion for the rank fusion task in IR. This task con- work with a prior distribution on its weights [7]. The use
sists in combining items from multiple ranked lists into of these models in real-world scenarios has been recently
one, maximizing the relevance of the documents at the made possible thanks to the theoretical advancements
top. Over the years, other rank fusion strategies explored in the approximation of the sampling process from the
different probabilistic solutions to the problem; however, learned weights distributions, which is necessary to train
these solutions generally employ limited modeling strate- them [9, 10]. We then describe (Section 3) a rank fusion
gies [5] or methods not flexible enough to adapt to the algorithm that leverages the information provided by the
large variance introduced by the different query topics [6]. proposed model. In our evaluation (Section 4), we show
To overcome these limitations we propose to employ a how a BNM is superior to a deterministic one and to other
shallow BNM trained with a probabilistic loss function more straightforward linear and non-linear machine
learnwhich maximize the representation and generalization ing models for the QL task at hand. Finally, we evaluate
power of the model, limiting the number of parameters the quality of the proposed QL approach, training
objective, and rank fusion algorithm in the rank fusion task,
comparing it to other popular rank fusion algorithms such
as CombSum, CombMNZ [11], META [5] and variants of
our model on three different datasets – TREC-3 [12] and
TREC-5 [13] – widely-used for testing rank-fusion
methods, and CLEF-2018 [14], a recent dataset whose goal is
DESIRES 2021 – 2nd International Conference on Design of
Experimental Search Information REtrieval Systems, September
15–18, 2021, Padua, Italy
" purpuraa@dei.unipd.it (A. Purpura); sustogia@dei.unipd.it
(G. A. Susto)</p>
      <p>© 2021 Copyright for this paper by its authors. Use permitted under Creative
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmmUonRs LWicenoserkAtstrhiboutpionP4.r0oIncteerneadtioinnalg(sCC(BCYE4.U0).R-WS.org)
to maximize recall in the context of medical reviews.</p>
      <p>The paper is organized as follows: Section 2 presents
related works, Section 3 describes the proposed QL model In this section, we describe the BNM, the training
objecand our rank fusion algorithm, Section 4 reports the ex- tive and rank fusion algorithm that we propose.
Afterperimental results and Section 5 concludes the paper. wards, we will show how the proposed QL model can be
employed to perform rank fusion and to this end we will
present our QLFusion algorithm.</p>
    </sec>
    <sec id="sec-2">
      <title>3. Proposed Approach</title>
    </sec>
    <sec id="sec-3">
      <title>2. Related Work</title>
      <p>Quantification Learning Model. Given a user
QL addresses the problem of estimating the number of pos- query and a document ranking returned by a retrieval
itives in a set of data points, given a training set to learn to model, the proposed QLFusion algorithm predicts the
distinguish positives from negatives [4, 1]. However, this proportion of relevant documents in it considering their
estimation is not computed by a classic classification-and- relevance scores. Indeed, the sequence of scores assigned
count approach, but relies on a model trained to predict to different documents by a retrieval model can be a proxy
the count skipping the classification step. This solution is for the number/proportion of relevant documents retrieved.
particularly useful when: (i) the distribution of positives in In Figure 1, we report the sequence of normalized
relethe training set is different from the one in the test set, (ii) vance scores associated to the top 25 documents in three
when acquiring training data is expensive, or (iii) when random topics by one of the runs submitted to TREC-3
there are too few positive examples to learn an accurate achieving the best MAP. As observed by [6] and [5], the
classification model [ 1]. This is often the case in ad-hoc distribution of relevant documents in a ranked list can be
retrieval, where the problem of sparsity in the training estimated from the entire set of relevance scores.
Howdata often limits the performance of machine learning ap- ever, the number of retrieved relevant documents varies
proaches [15, 16]. For this reason, we investigate on how widely from topic to topic. For this reason, to estimate
we can benefit from learning a model able to only quantify the proportion of relevant documents in different ranked
the number of relevant documents to a particular query in lists we need a model able to adapt to each retrieval model
a collection without learning an accurate relevance model. that computed them and to the different characteristics of
The IR task that we choose is rank fusion, where the goal each topic. Following these observations, we developed
is to merge multiple ranked lists into one, maximizing the a QL model to estimate the quality of a ranked list in
relevance of the documents at the top. We compare our terms of the proportion of relevant documents contained
approach to two classic and still competitive baselines: in its top- positions. Our QL approach consists of a
CombSum and CombMNZ [11]. Under CombSum, a shallow neural model with one feed-forward hidden layer
document’s score is calculated by adding the normalized and a Bayesian neural output layer. Before feeding the
scores returned by the individual input models. Regarding data into the hidden layer of our model, we first rescale
CombMNZ, a document’s score is found by multiplying them within each batch to normalize the feature values
its CombSum score by the number of non-zero relevance independently [18]. Following [18], the transformation
scores that it received in the ranked lists to fuse. Our
QLFusion approach is inspired from previous works in wthhaetrwe ea paprelythtoe feeaacthurfeesaotufrtehe iisnpˆuts=toth(eno− rm ali)za+tion,
the rank fusion domain such as the linear combination layer for each item in the input batch, ,  ,   and 
model proposed by [17], where document scores are re- are respectively the gain term, standard deviation, mean
scaled before being combined using different coefficients and bias of the layer, out of which  and  are
paramefor each query. Our probabilistic approach to this prob- ters learned by the model. After feeding the inputs to the
lem is also shared by previous works in the rank fusion hidden layer, we apply a sigmoid activation function to
domain. For instance, [5] shows how documents’ rele- the layer outputs. The output layer that we employ is a
vance scores associated to a query can be modeled with Bayesian Neural Layer (BNL) [19]. A BNL is similar to
known distributions. Their main contribution is to show a traditional neural network, where the weights of each
how Gaussian or Exponential distributions can fit the dis- layer are probability distributions instead of scalar values.
tribution of relevance scores of relevant and non-relevant We realized this layer employing the flipout technique,
documents, respectively. They then use this information an efficient method for decorrelating the gradients within
to compute the relevance probability of each item in a a mini-batch by implicitly sampling pseudo-independent
ranked list and use it to combine the documents through weight perturbations for each example [20]. The
ema rank fusion algorithm (META). With a similar goal in ployed weights priors are Gaussian. During inference
mind but employing the quantification learning paradigm, and training, we sample a value for each of the network
the proposed QLFusion model directly computes these weights and we use it to compute the final output of the
coefficients and then uses them to fuse multiple ranking network. More precisely, if we represent the last layer of
lists into one. our model as the function  (,  ) – which receives an
input  and is parametrized by the weights in  – our
model samples the values of  from a distribution 
parametrized by  . The network is trained to minimize the
expected loss E(,)∼ ,  ∼  [ℒ( (,  ), ], where</p>
      <p>indicates the data distribution and ℒ our loss function.</p>
      <sec id="sec-3-1">
        <title>The distribution of  can be represented in terms of per</title>
        <p>turbations:  =  + ∆  , where  indicates the mean
of the layer weights and ∆  is a stochastic perturbation.
In our case, the stochastic perturbation follows a Gaussian
distribution  ∼ 
2
(, ,  , ) [20]. Thanks to these
unique characteristics, a BNM can be considered
equivalent to an ensemble of neural models where the weight
of each model are sampled from our learned distributions.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Hence, we leverage on the generalization power of ensem</title>
        <p>ble approaches while at the same time improving training
time and efcfiiency. We also apply a sigmoid activation
function to the outputs of this layer to keep them in the
[0-1] range. We train the proposed model in a supervised
fashion by minimizing the KL-divergence between the
Binomial distribution with probability equal to the model
output ˆ and the “true” proportion of relevant documents
– interpreted as a Binomial distribution as well – in the
input ranked list. The loss function formulation originates
from the idea to interpret the output of a QL model as the
proportion of successful attempts at sampling a relevant
document from an input ranking list of length . Hence,
we train the model to minimize the KL-divergence
between the Binomial distributions outputted by the model
and the correct probability value. Formally, if we
consider the true probability  to sample a relevant document
from the input ranking list of length  and our model
prediction ˆ, we can identify two Binomial distributions:
 ∼</p>
        <p>(, ) and  ∼
lation of the loss function follows from the definition of
KL-divergence between two discrete Binomial probability
(ˆ, ), respectively. The
formudistributions:
ℒ = ( ||) = ∑︁  () log
= ∑︁
=0
 (︃ )︃</p>
        <p>∈
(1 − )−  log</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Results</title>
      <sec id="sec-4-1">
        <title>We conduct our experiments on two widely-used ad-hoc</title>
        <p>TREC collections in the rank fusion domain TREC-3 [12],
TREC-5 [13] and on the CLEF-2018 Technologically
Assisted Reviews in Empirical Medicine (TAR) [14]. In
particular, the CLEF-2018 TAR task focused on
maximizing recall. We consider the 6 runs with the highest
MAP submitted to each track and all of their topics. The
evaluation measures that we report in our experiments
are: P@{5, 10, 20}, Recall@{5, 10, 20} and nDCG@{5,
10, 20}. Also, since most of the relevant documents in a
ranked list will likely be at the top of it – as we can also
observe from the charts in Figure 1 – we only consider
the top 100 documents in each ranked list. We employ
10-Fold cross validation to train and evaluate our model,
reporting the averaged results over the 10 folds , i.e. we
divide the set of topics into 10 groups, train our model on
8 of them leaving one set for validation, and evaluate our
approach on the remaining one, we repeat this process 10
times. All the measures we report are averaged over the
10 folds after training a different model for each collection
and retrieval model. The learning rate and hidden layer
size used to train our model were 0.0005 and 32, for all
collections. The batch size was optimized automatically
according to the average performance of the model on the
validation sets of each collection considering values in the
range: [1, 2, 4, 8, 16]. We train our model for a maximum
of 500 epochs with early stopping with patience 20, and
evaluate it after each epoch on the validation set. For each
fold, we keep the model with the minimum mean absolute
error obtained on the validation set. Our implementation
of the model is publicly available online. 1
Rank Fusion. For this task, we merge the documents
from different ranked lists employing QLFusion, which
rescales the relevance scores of documents according to
the proportion of relevant documents predicted for each
ranking. The baselines that we consider are CombSum
and CombMNZ – two popular and still highly competitive
rank fusion approaches – META – a similar probabilistic
approach for rank fusion that we reimplemented – and
three variants of our QLFusion approach relying on
different regression models such as a Linear Regressor (LR),
a Random Forest (RF) a standard neural model with the
same architecture and feature normalization strategy of
the proposed Bayesian neural one but using only
deterministic neural layers and the mean squared error as loss
function (det.). From the results in Table 1, we observe
that the proposed QLFusion algorithm combined with the
probabilistic QL model outperforms all the other baselines
in the majority of the evaluation measures on the TREC
experimental collections, while the deterministic variant
of our QLFusion approach is the best one overall on the
CLEF-2018 dataset. Interestingly, the performance of all
QLFusion variants on the CLEF-2018 collection is much
higher than on TREC ones – especially considering the
measures focusing on the top 5 and 10 items. We also
observe that the META algorithm is more competitive on
CLEF-2018 than on the other collections. This is likely
due to the different characteristics of the runs to merge,
which goal was to maximize recall instead of other
ranking metrics and therefore makes them more similar to our</p>
      </sec>
      <sec id="sec-4-2">
        <title>1Link to the online repository: https://github.com/</title>
        <p>albpurpura/QLFusion.
Model
QLFusion prob.</p>
        <p>QLFusion det.</p>
        <p>QLFusion prob.</p>
        <p>QLFusion det.</p>
        <p>QLFusion prob.</p>
        <p>QLFusion det.</p>
        <p>QLFusion prob.</p>
        <p>QLFusion det.</p>
        <p>QLFusion prob.</p>
        <p>QLFusion det.</p>
        <p>QLFusion prob.</p>
        <p>QLFusion det.</p>
        <p>Loss
proposed
proposed
MSE
MSE
proposed
proposed
MSE
MSE
proposed
proposed
MSE
MSE</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <sec id="sec-5-1">
        <title>Work supported by the ExaMode project, as part of the European Union Horizon 2020 program under Grant Agreement no. 825292.</title>
      </sec>
      <sec id="sec-5-2">
        <title>QLFusion model.</title>
        <p>CLEF-2018 [1] G. Forman, Counting positives accurately despite
inaccurate classification, in: Proc. of ECIR 2005,
Springer, 2005, pp. 564–575.</p>
        <p>Table 2 [2] A. Moreo, F. Sebastiani, Tutorial: Supervised
LearntEicvavlaurai atinotns ooff QthLeFmusoiodnelwainthd plorossbafubnilicsttiiocna. nd determinis- ing for Prevalence Estimation, in: Proc. of the 13th
International Conference on Flexible Query
AnswerModel Optimization. From the results reported in Ta- ing Systems FQAS 2019, volume 11529 of Lecture
ble 2, we observe how the probabilistic neural model Notes in Computer Science, Springer, 2019, pp. 13–
(QLFusion prob.) outperforms, in most cases, its deter- 17.
ministic variant (QLFusion det.) regardless of the chosen [3] W. Gao, F. Sebastiani, Tweet Sentiment: From
Clasloss function. However, the proposed probabilistic loss sification to Quantification, in: Proc. of the 2015
gives the probabilistic model a sizeable advantage over its IEEE/ACM International Conference on Advances
variant trained with the MSE loss, especially on the TREC in Social Networks Analysis and Mining, ASONAM
collections. Also, the effect of the proposed loss function ’15, ACM, New York, NY, USA, 2015, pp. 97–104.
on the deterministic QLFusion model is not as large as [4] P. González, A. Castaño, N. Chawla, J. Coz, A
on its probabilistic variant. Differently from the other review on quantification learning, ACM Computing
experiments, on the CLEF-2018 dataset we observe that Surveys (CSUR) 50 (2017) 74.
the effect of the different loss function used is negligible [5] R. Manmatha, T. Rath, F. Feng, Modeling score
across the model variants. If we consider the variance of distributions for combining the outputs of search
the models predictions – an indicator of how much our engines, in: In Proc. of the 24th annual international
models overfit or underfit the data – of the probabilistic ACM SIGIR conference on Research and
developand deterministic QLFusion models with the two previous ment in information retrieval, 2001, pp. 267–275.
loss functions variants, we notice that the probabilistic [6] D. Lillis, F. Toolan, R. Collier, J. Dunnion, Probfuse:
loss favors a larger variance in the model’s predictions – a probabilistic approach to data fusion, in: Proc. of
closer to the variance of the true proportions of relevant SIGIR 2006, 2006, pp. 139–146.
documents in different queries – decreasing the tendency [7] R. Neal, Bayesian learning for neural networks,
volto underfit the training data that we observed with the ume 118, Springer Science &amp; Business Media, 2012.
deterministic model and/or MSE loss functions. [8] C. Blundell, J. Cornebise, K. Kavukcuoglu, D.
Wierstra, Weight uncertainty in neural networks, arXiv
5. Conclusions preprint arXiv:1505.05424 (2015).
[9] A. Balan, V. Rathod, K. Murphy, M. Welling,
In this work, we proposed a Quantification Learning- Bayesian dark knowledge, in: Advances in Neural
based approach for rank fusion (QLFusion). The model Information Processing Systems, 2015, pp. 3438–
relies on a Bayesian Neural Network (BNN) to predict 3446.
the number or relevant documents in a ranking list, an in- [10] A. Graves, Practical variational inference for
neuformation that is then used to rescale the relevance scores ral networks, in: Advances in neural information
of documents in different ranked lists and fuse them. We processing systems, 2011, pp. 2348–2356.
trained the proposed BNN using a custom probabilistic [11] J. Shaw, E. Fox, Combination of multiple searches,
loss function which allowed us to improve the general- in: Proc. TREC 1994, 1994, pp. 105–108.
ization power of the model. The proposed approach out- [12] D. Harman, Overview of the third text retrieval
conperformed the considered baselines on three experimental ference (TREC-3), 500, DIANE Publishing, 1995.
collections. In the future, we plan to investigate on new [13] D. Harman, Overview of the fifth text retrieval
confeatures to add to our QLFusion approach to further im- ference (TREC-5), 1997.
prove its performance. [14] K. Evangelos, A. L. Li, D., R. Spijker, CLEF
2018 technologically assisted reviews in empirical</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>ings</surname>
          </string-name>
          , volume
          <volume>2125</volume>
          ,
          <year>2018</year>
          . [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Pang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Ai</surname>
          </string-name>
          , H. Zamani,
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>tion Processing</surname>
          </string-name>
          &amp;
          <string-name>
            <surname>Management</surname>
          </string-name>
          (
          <year>2019</year>
          )
          <fpage>102067</fpage>
          . [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Marchesin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Purpura</surname>
          </string-name>
          , G. Silvello, Focal el-
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Processing</surname>
          </string-name>
          &amp;
          <string-name>
            <surname>Management</surname>
          </string-name>
          (
          <year>2019</year>
          )
          <fpage>102109</fpage>
          . [17]
          <string-name>
            <given-names>L.</given-names>
            <surname>Si</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Callan</surname>
          </string-name>
          ,
          <article-title>Using sampled data and regression</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <year>2002</year>
          ,
          <year>2002</year>
          , pp.
          <fpage>19</fpage>
          -
          <lpage>26</lpage>
          . [18]
          <string-name>
            <given-names>J. Lei</given-names>
            <surname>Ba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kiros</surname>
          </string-name>
          , G. Hinton, Layer normalization,
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>arXiv preprint arXiv:1607.06450</source>
          (
          <year>2016</year>
          ). [19]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ghahramani</surname>
          </string-name>
          , Bayesian convolutional neu-
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>inference</surname>
          </string-name>
          ,
          <source>arXiv preprint arXiv:1506.02158</source>
          (
          <year>2015</year>
          ). [20]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Vicol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Grosse</surname>
          </string-name>
          , Flipout:
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>on</surname>
          </string-name>
          mini-batches, arXiv preprint arXiv:
          <year>1803</year>
          .04386
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>