<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Dublin, Ireland
$ suchana.datta@ucdconnect.ie (S. Datta); debasis.ganguly@glasgow.ac.uk (D. Ganguly); derek.greene@ucd.ie
(D. Greene); mandar@isical.ac.in (M. Mitra)
 https://gdebasis.github.io/ (D. Ganguly); http://derekgreene.com/ (D. Greene);
https://www.isical.ac.in/mandar-mitra (M. Mitra)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>On the Feasibility and Robustness of Pointwise Evaluation of Query Performance Prediction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Suchana Datta</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Debasis Ganguly</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Derek Greene</string-name>
          <email>R@10</email>
          <email>R@100</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mandar Mitra</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Indian Statistical Institute</institution>
          ,
          <addr-line>Kolkata</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University College Dublin</institution>
          ,
          <country country="IE">Ireland</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Glasgow</institution>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>Despite the retrieval effectiveness of queries being mutually independent of one another, the evaluation of query performance prediction (QPP) systems has been carried out by measuring rank correlation over an entire set of queries. Such a listwise approach has a number of disadvantages, notably that it does not support the common requirement of assessing QPP for individual queries. In this paper, we propose a pointwise QPP framework that allows us to evaluate the quality of a QPP system for individual queries by measuring the deviations between each prediction versus the corresponding true value, and then aggregating the results over a set of queries. Our experiments demonstrate that this new approach leads to smaller variances in QPP evaluations across a range of different target metrics and retrieval models.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        A major disadvantage of listwise QPP approaches is that evaluation is conducted in a relative
manner, so the performance of one query is measured relative to the others. However, a
downstream performance estimate of an individual query also needs to be evaluated independently
of the other queries. In contrast, a pointwise approach measures the effectiveness on individual
queries, and then, if required, aggregates the results over a complete set. This is analogous to
measuring the retrieval effectiveness metric MAP by computing the average precision values
for individual queries and then aggregating them. Pointwise evaluation also allows us to carry
out a per-query analysis of a method often leading to useful insights. For instance, Buckley [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
found that, by performing an extensive per-topic retrieval analysis, they were able to identify
queries where most IR systems fail to retrieve relevant documents. However, a listwise evaluation
methodology is not conducive to performing this kind of detailed per-query analysis.
      </p>
      <p>
        Another drawback of listwise methods is that they can be overly sensitive to the configuration
setup used for evaluation. The two most important such configurations are: i) the target retrieval
evaluation metric that induces a ground-truth ordering over the set of queries; ii) the retrieval
model used to obtain the top- set of documents for QPP estimation. Indeed, variations in
these configurations can lead to both large standard deviations in the reported rank correlation
measures and significant differences in the relative ranks of various QPP systems [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. To address
the limitations of listwise methods, we propose a new QPP evaluation framework, Aggregated
Pointwise Absolute Errors (APAE), which is shown to not only be consistent with the existing
listwise approaches, but also to be more robust to changes in QPP experimental setup.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. A Framework for Pointwise QPP Evaluation</title>
      <p>Correlation with listwise ground-truth Before describing our new QPP evaluation
framework APAE, we begin by introducing the required notation. Formally, a QPP estimate is a
function of the form (, ()) ↦→ R, where () is the set of top- ranked documents
retrieved by an IR model  for a query  ∈ , a benchmark set of queries.</p>
      <p>
        For the purpose of listwise evaluation, for each  ∈ , we first compute the value of a target
IR evaluation metric,  () that reflects the quality of the retrieved list (). The next step
uses these  () scores to induce a ground-truth ranking of the set , or in other words, arrange
the queries by their decreasing (or increasing)  () values, i.e.,
 = { ∈  :  () &gt;  (+1), ∀ = 1, . . . , || − 1}}
(1)
Similarly, the evaluation framework also yields a predicted ranking of the queries, where this
time the queries are sorted by the QPP estimated scores, i.e.,
 = { ∈  : () &gt; (+1), ∀ = 1, . . . , || − 1}
(2)
A listwise evaluation framework then computes the rank correlation between these two ordered
sets  ( , ), where  : R|| × R|| ↦→ [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ] is a correlation measure, such as Kendall’s  .
Individual ground-truth In contrast to listwise evaluations, where the ground-truth takes the
form of an ordered set of queries, pointwise QPP evaluation involves making || independent
comparisons. Each comparison is made between a query ’s predicted QPP score () and its
retrieval effectiveness measure  (), i.e.,
(3)
 (, ,  ) =def 1
|| ∈
      </p>
      <p>
        ∑︁  ( (), ())
Unlike the rank correlation  , here  is a pointwise correlation function of the form  : R× R ↦→ R.
It is often convenient to think of  as the inverse of a distance function that measures the extent
to which a predicted value deviates from the corresponding true value. In contrast to ground-truth
evaluation metrics, most QPP estimates (e.g., NQC, WIG etc.) are not bounded within [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ].
Therefore, to employ a distance measure, each QPP estimate () must be normalized to the unit
interval. Subsequently,  can be defined as  ( (), ()) =def 1 − |  () − ()/ℵ|, where ℵ is
a normalization constant which is sufficiently large to ensure that the denominator is positive.
      </p>
      <sec id="sec-2-1">
        <title>Selecting an IR metric for pointwise QPP evaluation In general, an unsupervised QPP</title>
        <p>
          estimator will be agnostic with respect to the target IR metric  . For instance, NQC scores can be
seen as being approximations of AP@100 values, but can also be interpreted as approximating
any other metric, such as nDCG@20 or P@10. Therefore, a question arises around which metric
should be used to compute the individual correlations in Equation 3. Of course, the results can
differ substantially for different choices of  , e.g., AP or nDCG. This is also the case for listwise
QPP evaluation, as reported in [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. To reduce the effect of such variations, we now propose a
simple yet effective solution.
        </p>
        <p>Metric-agnostic pointwise QPP evaluation For a set of evaluation functions  ∈ ℳ (e.g.,
ℳ = {AP@100, nDCG@20, . . .}), we employ an aggregation function to compute the overall
pointwise correlation (Equation 3) of a QPP estimate with respect to each metric. Formally,
 (, ℳ, ) = Σ ∈ℳ(1 − |  () − ()/ℵ|),
(4)
where Σ denotes an aggregation function (it does not indicate summation). In particular, we use
the most commonly-used such functions as choices for Σ: ‘minimum’, ‘maximum’, and ‘average’
– i.e., Σ ∈ {avg, min, max}. Next, we find the average over these values computed for a given set
of queries , i.e., we substitute  (, ℳ, ) from Equation 4 into the summation of Equation 3.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments</title>
      <p>
        A QPP experiment context [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] involves three configuration choices: i) the QPP method itself
that is used to predict the relative performance of queries; ii) the IR metric that is used to obtain
a ground-truth ordering of the query performances as measured on a set of top- ( = 100 in
our experiments) documents retrieved by iii) a specific IR model. Table 1 summarizes the IR
models and metrics used in our experiments, along with the relevant hyper-parameter values. The
objective of our experiments is to investigate the following two key research questions:
• RQ1: Does APAE agree with the standard listwise correlation metrics?
• RQ2: How robust is APAE with respect to changes in the QPP experiment context?
An affirmative answer to
      </p>
      <p>RQ1 would indicate that our proposed metric APAE is consistent
with existing metrics used for QPP evaluation, while an affirmative answer to</p>
      <sec id="sec-3-1">
        <title>RQ2 would suggest</title>
        <p>that APAE is preferable to existing methods due to its higher stability with respect to different
experimental settings.</p>
        <p>
          We conduct our QPP experiments on the TREC Robust dataset, which consists of 249 topics.
Following the standard practice for QPP experiments [
          <xref ref-type="bibr" rid="ref12 ref5">5, 12</xref>
          ], we report results aggregated over a
total of 30 randomly chosen equal-sized train-test splits of the data. The training split of each
partition was used for tuning the hyper-parameters for the QPP method.
pointwise evaluation metric.
        </p>
        <sec id="sec-3-1-1">
          <title>Agreement between listwise and pointwise evaluation</title>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Firstly, we investigate the</title>
        <p>
          consistency of APAE with respect to three standard listwise QPP evaluation metrics: Pearson’s ,
Spearman’s  and Kendall’s  ; and a pointwise approach, scaled Absolute Rank Error (sARE)
[
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. Since sARE is an error measure, we measure correlations of APAE with 1 − sARE measures
(which for the sake of simplicity, we refer to as sARE in Table 2). We experiment with three
different instances of APAE obtained by substituting the aggregation functions – avg, min and
max as Σ in Equation 4, denoted respectively as  avg(ℳ),  min(ℳ) and  max(ℳ).
        </p>
        <p>The results presented in Table 2 answer RQ1 in the afrfimative. Each reported value here
corresponds to the rank correlation (Kendall’s  ) between the relative ranks of the QPP systems
ordered by their effectiveness as computed via one of the standard metrics (one of ,  ,  or
sARE) and APAE, i.e., one of  avg(ℳ),  min(ℳ) and  max(ℳ)). The high correlation values
between the standard listwise and the proposed pointwise metrics show that APAE can be used
as a substitute for the standard listwise evaluation. Notably, we see that the average aggregate
function yields the best results, and hence for the subsequent experiments we use  avg(ℳ) as the</p>
        <p>The correlation of our proposed pointwise evaluation metric APAE with the standard listwise metrics
Pearson’s , Spearman’s  , Kendall’s  and sARE. The rank correlations between each pair of QPP
system ranks (evaluated with a listwise measure and a pointwise measure) were computed with Kendall’s  .
The high values indicate that the pointwise measurement can effectively substitute a standard list-based
measure, since they lead to a fairly similar relative ordering between the effectiveness of different QPP

 avg(ℳ)


 min(ℳ)</p>
        <p>sARE

sARE

 max(ℳ)


BM25
LMDir
LMJM
0.810 0.810 0.905 0.887</p>
        <p>the IR metrics.</p>
        <p>Metric</p>
        <p>Model</p>
        <p>BM25
(0.7, 0.3)
BM25
(0.3, 0.7)
LMDir
(500)
LMJM BM25 BM25 LMDir LMDir
(0.6) (0.7, 0.3) (0.3, 0.7) (500) (1000)
pairs of IR metrics and IR models. Red cells indicate the lowest value in each group, while the lowest values
along each column are bold-faced.
(a) Correlations between the relative ranks of 7 different</p>
        <p>QPP systems across different pairs of IR target
metrics. QPP systems were evaluated with the baseline
listwise metric - Kendall’s  .</p>
        <p>(b) Similar to Table 3a, except QPP performance was
evaluated with the pointwise approach APAE. A
comparison with Table 3a indicates a better consistency
in the relative ranks of QPP systems for variations in
Metric</p>
        <p>Model</p>
        <p>BM25
(0.7, 0.3)
BM25
(0.3, 0.7)
(c) Here rank correlations between the relative ranks of</p>
        <p>QPP systems are measured across IR model pairs.</p>
        <p>As in Table 3a, QPP systems were evaluated with  .</p>
        <p>The numbers alongside the IR models denote their
respective parameters.</p>
        <p>(d) Unlike Table 3c, here the QPP outcomes were
evaluated by APAE (instead of  ).</p>
        <sec id="sec-3-2-1">
          <title>Variances in relative effectiveness of QPP methods</title>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>To investigate RQ2, we consider</title>
        <p>the relative stability of QPP system ranks for variations in QPP contexts (i.e., different IR models
and target metrics), comparing both listwise and pointwise approaches (see Table 3). To clarify
with an example, if working with three QPP methods, say AvgIDF, NQC, WIG, we observe
that  (NQC) &gt;  (WIG) &gt;  (AvgIDF) for LMDir as measured relative to AP@100. We expect
to observe a similar ordering for a different choice of the IR model and target IR metric, say
BM25 with nDCG@100. As in our previous experiments, here we measure the rank correlations
between a total of seven QPP systems (see Table 1) via Kendall’s  .</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Concluding Remarks</title>
      <p>Unlike the standard listwise QPP evaluation mechanism of measuring an overall rank
correlation with respect to a reference ranking of the queries (in terms of retrieval effectiveness), we
have proposed a pointwise evaluation method that computes the relative difference between a
normalized QPP score and a true IR evaluation measure (e.g., AP@100 or nDCG@20). Our
experiments demonstrated that the proposed metric exhibits a high correlation with standard
listwise approaches and is more robust to changes in QPP experimental setup than listwise
evaluation measures. Using this metric, it should thus be possible to evaluate the effectiveness of
different QPP methods on downstream tasks on a per-query basis.</p>
      <p>Acknowledgement.. The first and the third authors were supported by the Science Foundation Ireland
(SFI) grant number SFI/12/RC/2289_P2.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , W. B.
          <string-name>
            <surname>Croft</surname>
          </string-name>
          ,
          <article-title>Ranking robustness: A novel framework to predict query performance</article-title>
          ,
          <source>in: Proc. of CIKM '06</source>
          ,
          <year>2006</year>
          , p.
          <fpage>567</fpage>
          -
          <lpage>574</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Shtok</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Kurland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Carmel</surname>
          </string-name>
          ,
          <article-title>Using statistical decision theory and relevance models for query-performance prediction</article-title>
          ,
          <source>in: Proc. of SIGIR '10</source>
          ,
          <year>2010</year>
          , p.
          <fpage>259</fpage>
          -
          <lpage>266</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lv</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <article-title>Adaptive relevance feedback in information retrieval</article-title>
          ,
          <source>in: Proc. of CIKM '09</source>
          ,
          <year>2009</year>
          , p.
          <fpage>255</fpage>
          -
          <lpage>264</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Cummins</surname>
          </string-name>
          ,
          <article-title>Document score distribution models for query performance inference and prediction</article-title>
          ,
          <source>ACM Transactions on Information Systems</source>
          <volume>32</volume>
          (
          <year>2014</year>
          ) 2:
          <fpage>1</fpage>
          -
          <lpage>2</lpage>
          :
          <fpage>28</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zamani</surname>
          </string-name>
          , W. B.
          <string-name>
            <surname>Croft</surname>
            ,
            <given-names>J. S.</given-names>
          </string-name>
          <string-name>
            <surname>Culpepper</surname>
          </string-name>
          ,
          <article-title>Neural query performance prediction using weak supervision from multiple signals</article-title>
          ,
          <source>in: Proc. of SIGIR '18</source>
          ,
          <year>2018</year>
          , p.
          <fpage>105</fpage>
          -
          <lpage>114</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Buckley</surname>
          </string-name>
          ,
          <article-title>Why current IR engines fail</article-title>
          ,
          <source>in: Proc. of SIGIR'04</source>
          ,
          <year>2004</year>
          , p.
          <fpage>584</fpage>
          -
          <lpage>585</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Ganguly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Datta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mitra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Greene</surname>
          </string-name>
          ,
          <article-title>An analysis of variations in the effectiveness of query performance prediction</article-title>
          ,
          <source>in: Proc. of ECIR'22</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>215</fpage>
          -
          <lpage>229</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C.</given-names>
            <surname>Hauff</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hiemstra</surname>
          </string-name>
          , F. de Jong,
          <article-title>A survey of pre-retrieval query performance predictors</article-title>
          ,
          <source>in: Proc. of CIKM '08</source>
          ,
          <year>2008</year>
          , p.
          <fpage>1419</fpage>
          -
          <lpage>1420</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Cronen-Townsend</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , W. B.
          <string-name>
            <surname>Croft</surname>
          </string-name>
          ,
          <article-title>Predicting query performance</article-title>
          ,
          <source>in: Proc. of SIGIR '02</source>
          ,
          <year>2002</year>
          , p.
          <fpage>299</fpage>
          -
          <lpage>306</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Shtok</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Kurland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Carmel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Raiber</surname>
          </string-name>
          , G. Markovits,
          <article-title>Predicting query performance by query-drift estimation</article-title>
          ,
          <source>ACM Transactions on Information Systems</source>
          <volume>30</volume>
          (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <surname>W. B. Croft,</surname>
          </string-name>
          <article-title>Query performance prediction in web search environments</article-title>
          ,
          <source>in: Proc. of SIGIR '07</source>
          ,
          <year>2007</year>
          , p.
          <fpage>543</fpage>
          -
          <lpage>550</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>O.</given-names>
            <surname>Zendel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shtok</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Raiber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Kurland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Culpepper</surname>
          </string-name>
          ,
          <article-title>Information needs, queries, and query performance prediction</article-title>
          ,
          <source>in: Proc. of SIGIR '19</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>395</fpage>
          -
          <lpage>404</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>G.</given-names>
            <surname>Faggioli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Zendel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Culpepper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Scholer</surname>
          </string-name>
          ,
          <article-title>An enhanced evaluation framework for query performance prediction</article-title>
          ,
          <source>in: Advances in Information Retrieval</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>115</fpage>
          -
          <lpage>129</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>