<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>SEBD</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>for Neural IR⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>France</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guglielmo Faggioli</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thibault Formal</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefano Marchesin</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stéphane Clinchant</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicola Ferro</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Benjamin Piwowarski</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Naver Labs Europe</institution>
          ,
          <addr-line>Meylan</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Sorbonne Université, ISIR</institution>
          ,
          <addr-line>Paris</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Padova</institution>
          ,
          <addr-line>Padova</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>31</volume>
      <fpage>02</fpage>
      <lpage>05</lpage>
      <abstract>
        <p>The evaluation of Information Retrieval (IR) relies on human-made relevance assessments whose collection is time-consuming and expensive. To alleviate this limitation, Query Performance Prediction (QPP) models have been developed to estimate system performance without relying on human-made relevance judgements. QPP models have been applied to traditional IR methods with varying success. The shift towards semantic signals thanks to Neural IR (NIR) models has changed the retrieval paradigm. In this study, we investigate the ability of current QPP models to predict the performance of NIR systems. We evaluate seven traditional IR systems and seven NIR (BERT-based) approaches, as well as nineteen QPPs, on two collections: Deep Learning '19 and Robust '04. Our results highlight that QPPs perform significantly worse on NIR systems. When semantic signals are prevalent, such as in passage retrieval, their performance on neural models decreases by up to 10% compared to bag-of-words approaches.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        to select the best-performing system for a given query, help users in reformulating their needs,
or identify pathological queries that require manual intervention from the system administrators.
Said otherwise, the need for QPP still holds for NIR methods. Among the plethora of available
QPP methods, most of them rely on lexical aspects of the query and the collection. Such
approaches have been devised, tested, and evaluated in predicting the performance of lexical
bag-of-words IR systems – from now on referred to as Traditional IR (TIR) – with various
degrees of success. Recent advances in Natural Language Processing (NLP) led to the advent
of PLM-based IR systems, which shifted the retrieval paradigm from traditional approaches
based on lexical matching to exploiting contextualized semantic signals – thus alleviating the
semantic gap problem. To ease the readability throughout the rest of the manuscript, with an
abuse of notation, we use the more general term NIR to explicitly refer to first-stage IR systems
based on BERT [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. At the current time, no large-scale work has been devoted to assessing
whether traditional QPP models can be used for NIR systems [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. To address such a gap, we
compare the performance of nineteen QPP methods applied to seven traditional TIR systems,
with those achieved on seven state-of-the-art first-stage NIR approaches based on PLM. We
consider both pre- and post-retrieval QPPs, and include in our analyses post-retrieval QPP
models that exploit lexical or semantic signals to compute their predictions. To instantiate
our analyses on diferent scenarios we consider two widely adopted experimental collections:
Robust ‘04 and Deep Learning ‘19. As contributions:
• we apply and evaluate several state-of-the-art QPP approaches to multiple NIR retrievers
based on BERT, on Robust ‘04 and Deep Learning ‘19;
• we show that currently available QPPs perform reasonably well when applied to TIR
systems, while they fail to properly predict the performance for NIR systems, even on
NIR oriented collections;
• we highlight how such a decrease in QPP performance is particularly prominent on
queries where TIR and NIR performances difer the most.
      </p>
      <p>The remainder of this paper is organized as follows: Section 2 outlines the main related
endeavours. Section 3 details our methodology and experimental setting. Empirical results are
reported in Section 4. Section 5 draws the conclusions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Large PLMs like BERT [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] have given birth to a new generation of NIR systems. Indeed,
dense representations based on contextualized embeddings, combined with approximate
nearest neighbors algorithms, have proven to be efective and eficient first-stage retrieval
approaches [
        <xref ref-type="bibr" rid="ref4 ref5 ref6 ref7 ref8 ref9">4, 5, 6, 7, 8, 9</xref>
        ]. In the meantime, another research branch brought lexical models by
learning contextualized term weights [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13">10, 11, 12, 13</xref>
        ], query or document expansion [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], or
both mechanisms jointly [
        <xref ref-type="bibr" rid="ref15 ref16">15, 16</xref>
        ]. This new wave of NIR systems demonstrate state-of-the-art
results on several datasets [
        <xref ref-type="bibr" rid="ref17 ref18 ref19">17, 18, 19</xref>
        ].
      </p>
      <p>
        A well-known problem linked to IR evaluation is the variation in performance achieved by
diferent IR systems, even on a single query [
        <xref ref-type="bibr" rid="ref20 ref21">20, 21</xref>
        ]. To account for it, a large body of work
has focused on predicting the performance that a system would achieve for a given query,
using QPP models. Such models are typically divided into pre- and post-retrieval predictors.
Traditional pre-retrieval QPPs leverage statistics on the query terms occurrences [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. For
example, SCQ [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], VAR [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] and IDF [
        <xref ref-type="bibr" rid="ref24 ref25">24, 25</xref>
        ] combine query tokens’ occurrence indicators,
such as Collection Frequency (CF) and Inverse Document Frequency (IDF), to compute their
performance prediction score. Post-retrieval QPPs exploit the results of IR models for the given
query [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. Among them, Clarity [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] compares the language model of the first  retrieved
documents with the one of the entire corpus. NQC [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ], WIG [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] and SMV [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ] exploit the
retrieval scores distribution for the top-ranked documents to compute their predictive score.
Finally, Utility Estimation Framework (UEF) [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ] serves as a general framework that can be
instantiated with many of the mentioned predictors, pre-retrieval ones included. We further
divide QPP models into traditional and neural approaches. Among neural predictors, one of the
ifrst approaches is NeuralQPP [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ] which computes its predictions by combining semantic and
lexical signals using a feed-forward neural network. A similar approach for Question Answering
is NQA-QPP [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ], which also relies on three neural components but, unlike NeuralQPP, exploits
BERT [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] to embed tokens semantics. Similarly, BERT-QPP [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ] encodes semantics via BERT,
but directly fine-tunes it to predict query performance based on the first retrieved document.
Only a little work has been done to apply traditional QPP on NIR models [
        <xref ref-type="bibr" rid="ref32 ref34">32, 34</xref>
        ]. Similarly,
neural QPP methods – which model the semantic interactions between query and document
terms – have been mostly designed for and evaluated on TIR models. Hence, there is an urgent
need to deepen the evaluation of QPP on state-of-the-art NIR models to understand where we
are, what are the challenges, and which directions are more promising.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Experimental Methodology and Setup</title>
      <p>
        To assess the efect induced by NIR systems on QPP performance, we employ the following
ANalysis Of VAriance (ANOVA) models, using sARE [
        <xref ref-type="bibr" rid="ref35 ref36">35, 36</xref>
        ] as a performance measure. The
ifrst model, dubbed MD1, aims at explaining the sARE performance given the predictor, the
type of IR model and the collection. Therefore, we define it as follows:
 =  +   +   +   + ( ) +  ,
(MD1)
where  is the grand mean,   is the efect of the -th predictor,   represents the type of IR
model (either TIR or NIR),   stands for the efect of the -th collection on QPP’s performance,
and ( ) describes how much the type of run and the collection interact and  is the associated
error. Secondly, since we are interested in determining the efect of diferent predictors in
interaction with each query, we define a second model, dubbed MD2, that also includes the
interaction factor and is formulated as follows:
 =  +   +   +   + ( ) + ( ) + (  ) +  ,
(MD2)
Diferently from MD1, we apply MD2 to each collection separately. Therefore, having a single
collection, we replace the efect of the collection with  , the efect for the -th topic. Furthermore,
the model includes also all the first-order interactions. The Strength of Association ( SOA) [
        <xref ref-type="bibr" rid="ref37">37</xref>
        ]
is assessed using 2. As a rule-of-thumb, 2 &lt; 6% indicates a small SOA, 6% ≤ 2 &lt; 14% is a
medium-sized efect, while 2 ≥ 14% represent a large-sized efect.
      </p>
      <p>
        Our analyses focus on Robust ‘04 [
        <xref ref-type="bibr" rid="ref38">38</xref>
        ], and TREC Deep Learning 2019 Track (Deep Learning
‘19) [
        <xref ref-type="bibr" rid="ref39">39</xref>
        ] collections. The collections have respectively 249 and 43 topics each and are based
on TIPSTER and MS MARCO passages corpora. Robust ‘04 is one of the most used collections
to test lexical approaches, while providing a reliable benchmark for NIR models [
        <xref ref-type="bibr" rid="ref40">40</xref>
        ].Deep
Learning ‘19 concerns passage retrieval from natural questions making the retrieval harder for
TIR approaches, while NIR systems tend to have an edge in retrieving relevant documents.
      </p>
      <p>
        As reference points, we consider seven TIR methods: Language Model with Dirichlet (LMD) and
Jelinek–Mercer (LMJM) smoothing [
        <xref ref-type="bibr" rid="ref41">41</xref>
        ], BM25, vector space model [
        <xref ref-type="bibr" rid="ref42">42</xref>
        ] (TFIDF), InExpB2 [
        <xref ref-type="bibr" rid="ref43">43</xref>
        ]
(InEB2), Axiomatic F1-EXP [
        <xref ref-type="bibr" rid="ref44">44</xref>
        ] (AxF1e), and Divergence From Independence (DFI) [
        <xref ref-type="bibr" rid="ref45">45</xref>
        ]. For
the NIR methods, we focus on BERT-based first-stage models. We consider state-of-the-art
models from the three main families of NIR models. We consider dense models, i) a “standard”
bi-encoder (bi) trained with negative log-likelihood, ii) TAS-B [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] (bi-tasb) whose training
relies on topic-sampling and knowledge distillation iii) and finally CoCondenser [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] (bi-cc)
and Contriever [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] (bi-ct) which are based on contrastive pre-training. We also consider
two models from the sparse family: SPLADE [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] (sp) with default training strategy, and its
improved version SPLADE++ [
        <xref ref-type="bibr" rid="ref16 ref46">46, 16</xref>
        ] (sp++) based on distillation, hard-negative mining and
pre-training. We finally consider the late-interaction ColBERTv2 [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] (colb2). Models are
ifne-tuned on the MS MARCO passage dataset and applied in a zero-shot manner on Robust
‘04 [
        <xref ref-type="bibr" rid="ref18 ref19">18, 19</xref>
        ].We focus our analyses on Normalized Discounted Cumulated Gain (nDCG) with
cutof 10, as it is employed across NIR benchmarks consistently.
      </p>
      <p>
        Concerning QPP models, we consider 9 pre-retrieval models: Simplified query Clarity Score
(SCS) [
        <xref ref-type="bibr" rid="ref47">47</xref>
        ], Similarity Collection-Query (SCQ) [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], VAR [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], IDF and Inverse Collection Term
Frequency (ICTF) [
        <xref ref-type="bibr" rid="ref24 ref25">24, 25</xref>
        ]. For SCS, we use the sum aggregation, while for others we use max
and mean, which empirically produce the best results. In terms of post-retrieval QPP models,
our experiments are based on Clarity [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ], Normalized Query Commitment (NQC) [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ], Score
Magnitude and Variance (SMV) [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ], Weighted Information Gain (WIG) [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] and their UEF [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ]
counterparts. Among post-retrieval predictors, we also include a supervised approach,
BERTQPP [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ], using both bi-encoder (bi) and cross-encoder (ce) formulations. We train BERT-QPP1
for each IR system on the MS MARCO training set, as proposed in [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ]. Similarly to what is
done for NIR models, we apply BERT-QPP models on Robust ‘04 queries in a zero-shot manner.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Results</title>
      <p>Figures 1a and 1b refer, respectively, to Robust ‘04 and Deep Learning ‘19 collections and report
the Pearson’s  correlation between prediction scores and nDCG@10, for both TIR and NIR
runs. For Robust ‘04, we notice that pre-retrieval (top) predictors (mean correlation: 15.9%)
tend to perform 52.3% worse than post-retrieval ones (bottom) (mean correlation: 30.2%). The
phenomenon is more evident (darker colors) for NIR runs (right) than TIR ones (left).
Preretrieval predictors fail in predicting the performance of NIR systems (mean correlation 6.2% vs
25.6% for TIR), while in general we notice that post-retrieval predictors tend to perform similarly
1We use the implementation provided at https://github.com/Narabzad/BERTQPP
models, other models achieve highly similar results.
on TIR and NIR (34.5% vs 32.3%) – with some exceptions. For instance, for bi, post-retrieval
predictors either perform extremely well or completely fail. This happens particularly on Clarity,
NQC, and their UEF counterparts. Note that bi is the worst performing approach on Robust</p>
      <p>
        The patterns observed for Robust ‘04 hold only partially on Deep Learning ‘19. For example,
we notice again that pre-retrieval predictors (mean correlation: 14.7%) perform 58.3% worse than
post-retrieval ones (mean correlation: 35.3%). On the contrary, the diference in performance is
far more evident between NIR and TIR. On TIR runs, almost all predictors perform particularly
well (mean correlation: 38.1%) – even better than on Robust ‘04 collection. Conversely, on NIR
the performance is overall lower (13.1%) and relatively more uniform between pre- (5.4%) and
post-retrieval (19.9%) models. The maximum correlation achieved by pre-retrieval predictors
for NIR on Deep Learning ‘19 is much higher than the one achieved on Robust ‘04, especially
for bi-ct, sp, and bi-tasb runs. On the other hand, post-retrieval predictors, perform worse
than on the Robust ‘04. The only exception to this pattern is again represented by bi, on
which some post-retrieval predictors, namely WIG, UEFWIG, and UEFClarity work surprisingly
well. Interestingly, on Robust ‘04, post-retrieval QPPs achieve, on average, top performance
on the late interaction model (colb2), followed by sparse approaches (sp and sp++). Finally,
dense approaches are those where QPP perform the worst. In this sense, the performance that
QPP methods achieve on NIR systems seems to correlate with the importance these systems
give to lexical signals. BERT-QPP shows a trend similar to other post-retrieval predictors
on Deep Learning ‘19 (42.3% mean correlation against 52.9% respectively) for what concerns
TIR, with performance in line with the one reported in [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ]. This is exactly the setting where
BERT-QPP has been devised and tested. If we focus on Deep Learning ‘19 and NIR systems, its
performance (mean correlation: 4.5%) is far lower than those of other post-retrieval predictors
(mean correlation without BERT-QPP: 23.8%). Finally, its performance on Robust ‘04 – applied
in zero-shot – is considerably lower compared to other post-retrieval approaches.To further
statistically quantify the phenomena observed in the previous subsection, we apply MD1 to our
data, considering both collections at once. From a quantitative standpoint, we notice that all the
      </p>
      <p>
        Deep Learning '19 Robust '04
factors included in the model are statistically significant ( p-value &lt; 10− 4). In terms of SOA,
the collection factor has a small efect (0.02%). The run type, on the other hand, impacts for
2 = 0.48%. Finally, the interaction between the collection and run type, although statistically
significant, has a small impact on the performance ( 2 = 0.05%): in both collections QPPs
perform better on TIR models. All factors are significant but have small-size efects. This is in
contrast with what was observed for the performance of IR systems [
        <xref ref-type="bibr" rid="ref21 ref48">48, 21</xref>
        ], where most of
the SOA range between medium to large. Nevertheless, it is in line with what was observed by
Faggioli et al. [
        <xref ref-type="bibr" rid="ref35">35</xref>
        ] for the performance QPP methods, who showed that all the factors besides
the topic are small to medium.
      </p>
      <p>
        We are now interested in breaking down the performance of the predictors according to the
collection and type of run. Figure 2 reports the average performance (measured with sMARE,
the lower the better) for QPPs applied on NIR or TIR runs over diferent collections. The
performance achieved by predictors on NIR models is on average worse than the one achieved
on TIR runs. QPP models perform better on TIR than NIR on both collections: this explains the
small interaction efect between collections and run types. There is no statistical diference
between the performance achieved by QPPs applied to TIR models when considering Deep
Learning ‘19 and Robust ‘04– the confidence intervals are overlapping. This goes in contrast
with what happens when considering NIR models: QPPs approaches applied on Deep Learning
‘19 perform by far worse than on the former Robust ‘04. While on average we will be less
satisfied by QPP predictors applied to NIR regardless of the type of collection, there might be
good performing predictors also for NIR systems. To verify this hypothesis, we apply MD2 to
0.35
0.3
each collection separately, and measure what happens to each predictor individually2. Table 1
reports the p-values and 2 SOA for the factors included in MD2, while Figure 3 depicts the
phenomena visually. We observe that concerning Deep Learning ‘19, the run type (TIR or NIR)
is significant, while the interaction between the predictor and the run type is small: indeed
predictors always perform better on TIR runs than on NIR ones. The only model that behaves
slightly diferently is Clarity, with close performance for both classes of runs – this can be
explained by the fact that Clarity is overall the worst-performing predictor. Notice that, the
best predictor on TIR runs – NQC – performs almost 10% worse on NIR ones. Finally, we notice
a large-size interaction between topics and QPP models. This indicates that whether a model
will be better than another depends on the topic considered. An almost identical pattern was
observed also in [
        <xref ref-type="bibr" rid="ref35">35</xref>
        ]. Therefore, to improve QPP’s generalizability, it is important to address
challenges caused by diferences in NIR and TIR as well as to take account of the large variance
introduced by topics.
      </p>
      <p>If we consider Robust ‘04, figure 3 shows that predictors performances are much more similar
for TIR and NIR runs compared to Deep Learning ‘19. This is further highlighted by the far
smaller 2 for run type on Robust ‘04 in Table 1 – 4.35% against 0.11%. The widely diferent
pattern between Deep Learning ‘19 and Robust ‘04 suggests that current QPPs are doomed
to fail when used to predict the performance of IR approaches that learned the semantics of
a collection – which is the case for Deep Learning ‘19 that was used to fine-tune the models.
Current QPPs evaluate better IR approaches that rely on lexical clues. Such approaches include
both TIR models and NIR models applied in a zero-shot fashion, as it is the case for Robust ‘04.
Thus, QPP models are expected to fail where NIR models behave diferently from the TIR ones.
This poses at stake one of the major opportunities provided by QPP: if we fail in predicting the
performance of NIR models where they behave diferently from TIR ones, then a QPP cannot
be safely used to carry out model selection. To further investigate this we select from Robust
‘04 25% of the queries that are mostly “semantically defined” and rerun MD2 on the new set of
topics. We call “semantically defined” those queries where NIR behave, on average, oppositely
w.r.t. the TIR, either failing or succeeding at retrieving documents.</p>
      <p>
        Figure 4a shows the performance of topics that maximize the diference between TIR and
NIR and can be considered as more “semantically defined” [
        <xref ref-type="bibr" rid="ref49">49</xref>
        ]. If we consider the results of
2We focus on post-retrieval predictors – similar observations hold for pre-retrieval ones.
Clarity NQC SMV UWEIFGClariUtyEFNQUCEFSMUVEBFEWRITGQBPEPRcTeQPPbi
applying MD2 on this set of topics, we notice that compared to Robust ‘04 (Table 1, last column)
the efect of the diferent QPPs increases to 2.29%: on these topics, there is more diference
between diferent predictors. The interaction between predictors and run types grows from
0.30% to 0.91%. Furthermore, the efect of the run type grows from 0.11% to 0.67% – 6 times
bigger. On the selected topics, arguably those where a QPP is the most useful to help select the
right model, using NIR systems has a negative impact (6 times bigger) on the performance of
QPPs. Figure 4b, compared to Figure 3b, is more similar to Figure 3a – using only topics that
are highly semantically defined, we get similar patterns as those observed for Deep Learning
‘19 on Figure 3a. The only methods that behave diferently are BERT-QPP approaches, whose
performance is better on NIR runs than on TIR ones, but are the worst approaches in terms of
predictive capabilities for both run types. In this sense, even though the contribution of the
semantic signals appears to highly important to define new models with improved performance
in the NIR setting, it does not sufice to compensate for current QPPs limitations.
      </p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Future Work</title>
      <p>This study examined the applicability of current QPPs on first-stage NIR models based on PLMs.
The study evaluated 19 diverse QPP models on seven TIR and seven first-stage NIR methods
based on BERT, applied to the Robust ‘04 and Deep Learning ‘19 collections. We observe that
indeed QPPs are efective in predicting TIR systems’ performance but fail in dealing with NIR
ones. Moreover, the study found that QPPs tend to fail on those topics where NIR and TIR
models difer the most, which impairs the possibility of using QPP models to choose between
NIR and TIR approaches where it is most needed. Furthermore, semantic QPP approaches such
as BERT-QPP do not solve the problem and work properly only on lexical IR systems. These
results highlight the need for QPPs specifically tailored to neural IR.</p>
      <p>Future work will consider query variations to understand the impact of changing how a topic
is formulated on QPP, and the development of QPP methods explicitly designed for NIR models
that take into consideration the large variance introduced by topics.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Faggioli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Formal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Marchesin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Clinchant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. Piwowarski,</surname>
          </string-name>
          <article-title>Query Performance Prediction for Neural IR: Are We There Yet?</article-title>
          , in: J.
          <string-name>
            <surname>Kamps</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Crestani</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Maistro</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Joho</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Davis</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Gurrin</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          <string-name>
            <surname>Kruschwitz</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Caputo (Eds.),
          <source>Advances in Information Retrieval</source>
          , Springer Nature Switzerland,
          <year>2023</year>
          , pp.
          <fpage>232</fpage>
          -
          <lpage>248</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>031</fpage>
          -28244-7_
          <fpage>15</fpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -28244-7_
          <fpage>15</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , BERT:
          <article-title>Pre-training of Deep Bidirectional Transformers for Language Understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis</article-title>
          , MN, USA, June 2-7,
          <year>2019</year>
          , Volume
          <volume>1</volume>
          (Long and Short Papers),
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G.</given-names>
            <surname>Faggioli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mothe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Raiber</surname>
          </string-name>
          , QPP++
          <year>2023</year>
          :
          <article-title>Query-Performance Prediction and Its Evaluation in New Tasks</article-title>
          , in: J.
          <string-name>
            <surname>Kamps</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Crestani</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Maistro</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Joho</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Davis</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Gurrin</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          <string-name>
            <surname>Kruschwitz</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Caputo (Eds.),
          <source>Advances in Information Retrieval</source>
          , Springer Nature Switzerland,
          <year>2023</year>
          , pp.
          <fpage>388</fpage>
          -
          <lpage>391</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          , Sentence-BERT:
          <article-title>Sentence Embeddings using Siamese BERTNetworks</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP</source>
          <year>2019</year>
          ,
          <string-name>
            <given-names>Hong</given-names>
            <surname>Kong</surname>
          </string-name>
          , China, November 3-
          <issue>7</issue>
          ,
          <year>2019</year>
          ,
          <year>2019</year>
          , pp.
          <fpage>3980</fpage>
          -
          <lpage>3990</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Oguz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Min</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Edunov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          , W.-t. Yih,
          <article-title>Dense Passage Retrieval for Open-Domain Question Answering</article-title>
          ,
          <source>in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>6769</fpage>
          -
          <lpage>6781</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Tang</surname>
          </string-name>
          , J. Liu,
          <string-name>
            <given-names>P. N.</given-names>
            <surname>Bennett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Overwijk</surname>
          </string-name>
          ,
          <article-title>Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval</article-title>
          ,
          <source>in: 9th International Conference on Learning Representations, ICLR</source>
          <year>2021</year>
          ,
          <string-name>
            <given-names>Virtual</given-names>
            <surname>Event</surname>
          </string-name>
          , Austria, May 3-
          <issue>7</issue>
          ,
          <year>2021</year>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hofstätter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hanbury</surname>
          </string-name>
          ,
          <article-title>Eficiently Teaching an Efective Dense Retriever with Balanced Topic Aware Sampling</article-title>
          ,
          <source>in: SIGIR '21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , Virtual Event, Canada,
          <source>July 11-15</source>
          ,
          <year>2021</year>
          ,
          <year>2021</year>
          , pp.
          <fpage>113</fpage>
          -
          <lpage>122</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>L.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Callan</surname>
          </string-name>
          ,
          <article-title>Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</article-title>
          ,
          <source>ACL</source>
          <year>2022</year>
          , Dublin, Ireland, May
          <volume>22</volume>
          -27,
          <year>2022</year>
          ,
          <year>2022</year>
          , pp.
          <fpage>2843</fpage>
          -
          <lpage>2853</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Izacard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Caron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hosseini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bojanowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joulin</surname>
          </string-name>
          , E. Grave,
          <article-title>Towards Unsupervised Dense Information Retrieval with Contrastive Learning</article-title>
          ,
          <source>CoRR abs/2112</source>
          .09118 (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Callan</surname>
          </string-name>
          ,
          <article-title>Context-Aware Term Weighting For First Stage Passage Retrieval</article-title>
          ,
          <source>in: Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval</source>
          ,
          <string-name>
            <surname>SIGIR</surname>
          </string-name>
          <year>2020</year>
          ,
          <string-name>
            <given-names>Virtual</given-names>
            <surname>Event</surname>
          </string-name>
          , China,
          <source>July 25-30</source>
          ,
          <year>2020</year>
          ,
          <year>2020</year>
          , pp.
          <fpage>1533</fpage>
          -
          <lpage>1536</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mallia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Khattab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Suel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tonellotto</surname>
          </string-name>
          ,
          <article-title>Learning Passage Impacts for Inverted Indexes</article-title>
          ,
          <source>in: SIGIR '21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , Virtual Event, Canada,
          <source>July 11-15</source>
          ,
          <year>2021</year>
          ,
          <year>2021</year>
          , pp.
          <fpage>1723</fpage>
          -
          <lpage>1727</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Zuccon, TILDE: Term Independent Likelihood moDEl for Passage Reranking</article-title>
          ,
          <source>in: SIGIR '21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , Virtual Event, Canada,
          <source>July 11-15</source>
          ,
          <year>2021</year>
          ,
          <year>2021</year>
          , pp.
          <fpage>1483</fpage>
          -
          <lpage>1492</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A Few</given-names>
            <surname>Brief</surname>
          </string-name>
          <article-title>Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques</article-title>
          ,
          <source>CoRR abs/2106</source>
          .14807 (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>R. F.</given-names>
            <surname>Nogueira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          , Document Expansion by Query Prediction, CoRR abs/
          <year>1904</year>
          .08375 (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>T.</given-names>
            <surname>Formal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Piwowarski</surname>
          </string-name>
          , S. Clinchant, SPLADE:
          <article-title>Sparse Lexical and Expansion Model for First Stage Ranking</article-title>
          ,
          <source>in: SIGIR '21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , Virtual Event, Canada,
          <source>July 11-15</source>
          ,
          <year>2021</year>
          ,
          <year>2021</year>
          , pp.
          <fpage>2288</fpage>
          -
          <lpage>2292</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>T.</given-names>
            <surname>Formal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lassance</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Piwowarski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Clinchant</surname>
          </string-name>
          , From Distillation to Hard
          <source>Negative Sampling: Making Sparse Neural IR Models More Efective, in: SIGIR '22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , Madrid, Spain,
          <source>July 11 - 15</source>
          ,
          <year>2022</year>
          ,
          <year>2022</year>
          , pp.
          <fpage>2353</fpage>
          -
          <lpage>2359</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bajaj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Campos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Craswell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>McNamara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mitra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rosenberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Stoica</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tiwary</surname>
          </string-name>
          , T. Wang,
          <string-name>
            <surname>MS MARCO: A Human Generated MAchine Reading COmprehension Dataset</surname>
          </string-name>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>N.</given-names>
            <surname>Thakur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rücklé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models</article-title>
          ,
          <source>in: Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks</source>
          <volume>1</volume>
          ,
          <string-name>
            <given-names>NeurIPS</given-names>
            <surname>Datasets</surname>
          </string-name>
          and
          <source>Benchmarks</source>
          <year>2021</year>
          ,
          <year>December 2021</year>
          , virtual,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>K.</given-names>
            <surname>Santhanam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Khattab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Saad-Falcon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Potts</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Zaharia, ColBERTv2: Efective and Eficient Retrieval via Lightweight Late Interaction, in: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</article-title>
          ,
          <string-name>
            <surname>NAACL</surname>
          </string-name>
          <year>2022</year>
          , Seattle, WA,
          <string-name>
            <surname>United</surname>
            <given-names>States</given-names>
          </string-name>
          ,
          <source>July 10-15</source>
          ,
          <year>2022</year>
          ,
          <year>2022</year>
          , pp.
          <fpage>3715</fpage>
          -
          <lpage>3734</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>D.</given-names>
            <surname>Carmel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Yom-Tov</surname>
          </string-name>
          ,
          <article-title>Estimating the Query Dificulty for Information Retrieval</article-title>
          , Morgan &amp; Claypool Publishers,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Culpepper</surname>
          </string-name>
          , G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Kurland</surname>
          </string-name>
          , Topic Dificulty:
          <article-title>Collection and Query Formulation Efects</article-title>
          ,
          <source>ACM Trans. Inf. Syst</source>
          .
          <volume>40</volume>
          (
          <year>2022</year>
          )
          <volume>19</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>19</lpage>
          :
          <fpage>36</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>C.</given-names>
            <surname>Hauf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hiemstra</surname>
          </string-name>
          , F. de Jong,
          <article-title>A survey of pre-retrieval query performance predictors</article-title>
          ,
          <source>in: Proceedings of the 17th ACM Conference on Information and Knowledge Management</source>
          ,
          <string-name>
            <surname>CIKM</surname>
          </string-name>
          <year>2008</year>
          ,
          <string-name>
            <given-names>Napa</given-names>
            <surname>Valley</surname>
          </string-name>
          , California, USA, October
          <volume>26</volume>
          -
          <issue>30</issue>
          ,
          <year>2008</year>
          ,
          <year>2008</year>
          , pp.
          <fpage>1419</fpage>
          -
          <lpage>1420</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Scholer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tsegay</surname>
          </string-name>
          ,
          <article-title>Efective Pre-retrieval Query Performance Prediction Using Similarity and Variability Evidence</article-title>
          ,
          <source>in: Advances in Information Retrieval , 30th European Conference on IR Research</source>
          , ECIR
          <year>2008</year>
          , Glasgow, UK, March 30-April 3,
          <year>2008</year>
          . Proceedings, volume
          <volume>4956</volume>
          ,
          <year>2008</year>
          , pp.
          <fpage>52</fpage>
          -
          <lpage>64</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>S.</given-names>
            <surname>Cronen-Townsend</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , W. B.
          <string-name>
            <surname>Croft</surname>
            ,
            <given-names>A Language</given-names>
          </string-name>
          <string-name>
            <surname>Modeling</surname>
          </string-name>
          <article-title>Framework for Selective Query Expansion</article-title>
          ,
          <source>Technical Report</source>
          , CIIR, UMass,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>F.</given-names>
            <surname>Scholer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. E.</given-names>
            <surname>Williams</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Turpin</surname>
          </string-name>
          ,
          <article-title>Query association surrogates for Web search</article-title>
          ,
          <source>J. Assoc. Inf. Sci. Technol</source>
          .
          <volume>55</volume>
          (
          <year>2004</year>
          )
          <fpage>637</fpage>
          -
          <lpage>650</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>S.</given-names>
            <surname>Cronen-Townsend</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , W. B.
          <string-name>
            <surname>Croft</surname>
          </string-name>
          ,
          <article-title>Predicting query performance</article-title>
          ,
          <source>in: SIGIR 2002: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, August 11-15</source>
          ,
          <year>2002</year>
          , Tampere, Finland,
          <year>2002</year>
          , pp.
          <fpage>299</fpage>
          -
          <lpage>306</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>A.</given-names>
            <surname>Shtok</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Kurland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Carmel</surname>
          </string-name>
          ,
          <article-title>Predicting Query Performance by Query-Drift Estimation</article-title>
          , in: Advances in Information Retrieval Theory,
          <source>Second International Conference on the Theory of Information Retrieval</source>
          ,
          <string-name>
            <surname>ICTIR</surname>
          </string-name>
          <year>2009</year>
          , Cambridge, UK,
          <source>September 10-12</source>
          ,
          <year>2009</year>
          , Proceedings, volume
          <volume>5766</volume>
          ,
          <year>2009</year>
          , pp.
          <fpage>305</fpage>
          -
          <lpage>312</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <surname>W. B. Croft,</surname>
          </string-name>
          <article-title>Query performance prediction in web search environments</article-title>
          ,
          <source>in: SIGIR 2007: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , Amsterdam, The Netherlands,
          <source>July 23-27</source>
          ,
          <year>2007</year>
          ,
          <year>2007</year>
          , pp.
          <fpage>543</fpage>
          -
          <lpage>550</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. Wu,</surname>
          </string-name>
          <article-title>Query Performance Prediction By Considering Score Magnitude and Variance Together</article-title>
          ,
          <source>in: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management</source>
          ,
          <string-name>
            <surname>CIKM</surname>
          </string-name>
          <year>2014</year>
          , Shanghai, China, November 3-
          <issue>7</issue>
          ,
          <year>2014</year>
          ,
          <year>2014</year>
          , pp.
          <fpage>1891</fpage>
          -
          <lpage>1894</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>A.</given-names>
            <surname>Shtok</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Kurland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Carmel</surname>
          </string-name>
          ,
          <article-title>Using statistical decision theory and relevance models for query-performance prediction</article-title>
          ,
          <source>in: Proceeding of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          ,
          <string-name>
            <surname>SIGIR</surname>
          </string-name>
          <year>2010</year>
          , Geneva, Switzerland,
          <source>July 19-23</source>
          ,
          <year>2010</year>
          ,
          <year>2010</year>
          , pp.
          <fpage>259</fpage>
          -
          <lpage>266</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zamani</surname>
          </string-name>
          , W. B.
          <string-name>
            <surname>Croft</surname>
            ,
            <given-names>J. S.</given-names>
          </string-name>
          <string-name>
            <surname>Culpepper</surname>
          </string-name>
          ,
          <article-title>Neural Query Performance Prediction using Weak Supervision from Multiple Signals</article-title>
          ,
          <source>in: The 41st International ACM SIGIR Conference on Research &amp; Development in Information Retrieval</source>
          ,
          <string-name>
            <surname>SIGIR</surname>
          </string-name>
          <year>2018</year>
          , Ann Arbor, MI, USA, July
          <volume>08</volume>
          -
          <issue>12</issue>
          ,
          <year>2018</year>
          ,
          <year>2018</year>
          , pp.
          <fpage>105</fpage>
          -
          <lpage>114</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>H.</given-names>
            <surname>Hashemi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zamani</surname>
          </string-name>
          , W. B.
          <string-name>
            <surname>Croft</surname>
          </string-name>
          ,
          <article-title>Performance Prediction for Non-Factoid Question Answering</article-title>
          ,
          <source>in: Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval</source>
          ,
          <string-name>
            <surname>ICTIR</surname>
          </string-name>
          <year>2019</year>
          , Santa Clara, CA, USA, October 2-
          <issue>5</issue>
          ,
          <year>2019</year>
          ,
          <year>2019</year>
          , pp.
          <fpage>55</fpage>
          -
          <lpage>58</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>N.</given-names>
            <surname>Arabzadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Khodabakhsh</surname>
          </string-name>
          , E. Bagheri, BERT-QPP:
          <article-title>Contextualized Pre-trained transformers for Query Performance Prediction</article-title>
          ,
          <source>in: CIKM '21: The 30th ACM International Conference on Information and Knowledge Management</source>
          , Virtual Event, Queensland, Australia, November 1 -
          <issue>5</issue>
          ,
          <year>2021</year>
          ,
          <year>2021</year>
          , pp.
          <fpage>2857</fpage>
          -
          <lpage>2861</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>S.</given-names>
            <surname>Datta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ganguly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mitra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Greene</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A Relative</given-names>
            <surname>Information</surname>
          </string-name>
          <article-title>Gain-Based Query Performance Prediction Framework with Generated Query Variants</article-title>
          ,
          <source>ACM Trans. Inf. Syst</source>
          . (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>31</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>G.</given-names>
            <surname>Faggioli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Zendel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Culpepper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Scholer</surname>
          </string-name>
          ,
          <article-title>An Enhanced Evaluation Framework for Query Performance Prediction</article-title>
          ,
          <source>in: Advances in Information Retrieval - 43rd European Conference on IR Research</source>
          , ECIR
          <year>2021</year>
          ,
          <string-name>
            <surname>Virtual</surname>
            <given-names>Event</given-names>
          </string-name>
          ,
          <year>March</year>
          28 - April 1,
          <year>2021</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>I</given-names>
          </string-name>
          , volume
          <volume>12656</volume>
          ,
          <year>2021</year>
          , pp.
          <fpage>115</fpage>
          -
          <lpage>129</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>G.</given-names>
            <surname>Faggioli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Zendel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Culpepper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Scholer</surname>
          </string-name>
          ,
          <article-title>sMARE: a new paradigm to evaluate and understand query performance prediction methods, Inf</article-title>
          . Retr. J.
          <volume>25</volume>
          (
          <year>2022</year>
          )
          <fpage>94</fpage>
          -
          <lpage>122</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rutherford</surname>
          </string-name>
          ,
          <article-title>ANOVA and ANCOVA: a GLM approach</article-title>
          , John Wiley &amp; Sons,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Voorhees</surname>
          </string-name>
          ,
          <article-title>The TREC robust retrieval track</article-title>
          ,
          <source>SIGIR Forum 39</source>
          (
          <year>2005</year>
          )
          <fpage>11</fpage>
          -
          <lpage>20</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>N.</given-names>
            <surname>Craswell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mitra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Yilmaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Campos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Voorhees</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Soborof</surname>
          </string-name>
          ,
          <article-title>TREC Deep Learning Track: Reusable Test Collections in the Large Data Regime</article-title>
          ,
          <source>in: SIGIR '21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , Virtual Event, Canada,
          <source>July 11-15</source>
          ,
          <year>2021</year>
          ,
          <year>2021</year>
          , pp.
          <fpage>2369</fpage>
          -
          <lpage>2375</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Voorhees</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Soborof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          , Can
          <string-name>
            <surname>Old TREC Collections Reliably Evaluate Modern Neural Retrieval Models</surname>
          </string-name>
          ?,
          <source>CoRR abs/2201</source>
          .11086 (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <article-title>Statistical Language Models for Information Retrieval: A Critical Review</article-title>
          ,
          <source>Found. Trends Inf. Retr</source>
          .
          <volume>2</volume>
          (
          <year>2008</year>
          )
          <fpage>137</fpage>
          -
          <lpage>213</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>G.</given-names>
            <surname>Salton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Buckley</surname>
          </string-name>
          ,
          <article-title>Term-Weighting Approaches in Automatic Text Retrieval, Inf</article-title>
          . Process. Manag.
          <volume>24</volume>
          (
          <year>1988</year>
          )
          <fpage>513</fpage>
          -
          <lpage>523</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>G.</given-names>
            <surname>Amati</surname>
          </string-name>
          ,
          <string-name>
            <surname>C. J. van Rijsbergen</surname>
          </string-name>
          ,
          <article-title>Probabilistic Models of Information Retrieval based on measuring the Divergence From Randomness</article-title>
          ,
          <source>ACM Trans. Inf. Syst</source>
          <volume>20</volume>
          (
          <year>2002</year>
          )
          <fpage>357</fpage>
          -
          <lpage>389</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          [44]
          <string-name>
            <given-names>H.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <article-title>An exploration of axiomatic approaches to information retrieval</article-title>
          ,
          <source>in: SIGIR 2005: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , Salvador, Brazil,
          <source>August 15-19</source>
          ,
          <year>2005</year>
          ,
          <year>2005</year>
          , pp.
          <fpage>480</fpage>
          -
          <lpage>487</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          [45]
          <string-name>
            <given-names>I.</given-names>
            <surname>Kocabas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. T.</given-names>
            <surname>Dinçer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Karaoglan</surname>
          </string-name>
          ,
          <article-title>A nonparametric term weighting method for information retrieval based on measuring the divergence from independence</article-title>
          ,
          <source>Inf. Retr</source>
          .
          <volume>17</volume>
          (
          <year>2014</year>
          )
          <fpage>153</fpage>
          -
          <lpage>176</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          [46]
          <string-name>
            <given-names>T.</given-names>
            <surname>Formal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lassance</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Piwowarski</surname>
          </string-name>
          , S. Clinchant, SPLADE v2:
          <article-title>Sparse Lexical and Expansion Model for Information Retrieval</article-title>
          ,
          <source>CoRR abs/2109</source>
          .10086 (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          [47]
          <string-name>
            <given-names>J.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Larson</surname>
          </string-name>
          , M. de Rijke,
          <article-title>Using Coherence-Based Measures to Predict Query Dificulty</article-title>
          ,
          <source>in: Advances in Information Retrieval, 30th European Conference on IR Research</source>
          , ECIR
          <year>2008</year>
          , Glasgow, UK, March 30-April 3,
          <year>2008</year>
          . Proceedings, volume
          <volume>4956</volume>
          ,
          <year>2008</year>
          , pp.
          <fpage>689</fpage>
          -
          <lpage>694</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          [48]
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          , G. Silvello,
          <article-title>Toward an anatomy of IR system component performances</article-title>
          ,
          <source>J. Assoc. Inf. Sci. Technol</source>
          .
          <volume>69</volume>
          (
          <year>2018</year>
          )
          <fpage>187</fpage>
          -
          <lpage>200</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref49">
        <mixed-citation>
          [49]
          <string-name>
            <given-names>G.</given-names>
            <surname>Faggioli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Marchesin</surname>
          </string-name>
          ,
          <article-title>What makes a query semantically hard?</article-title>
          ,
          <source>in: Proceedings of the Second International Conference on Design of Experimental Search &amp; Information REtrieval Systems</source>
          , Padova, Italy,
          <source>September 15-18</source>
          ,
          <year>2021</year>
          , volume
          <volume>2950</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>61</fpage>
          -
          <lpage>69</lpage>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2950</volume>
          /paper-06. pdf.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>