On the Feasibility and Robustness of Pointwise Evaluation of Query Performance Prediction

On the Feasibility and Robustness of Pointwise Evaluation of Query Performance Prediction SuchanaDatta suchana.datta@ucdconnect.ie University College Dublin

Ireland

DebasisGanguly debasis.ganguly@glasgow.ac.uk University of Glasgow

DerekGreene derek.greene@ucd.ie University College Dublin

Ireland

MandarMitra mandar@isical.ac.in Indian Statistical Institute

Kolkata India

On the Feasibility and Robustness of Pointwise Evaluation of Query Performance Prediction 1613-0073 7216E510C36329DA9E921E2AE31C6850 GROBID - A machine learning software for extracting information from scholarly documents

Despite the retrieval effectiveness of queries being mutually independent of one another, the evaluation of query performance prediction (QPP) systems has been carried out by measuring rank correlation over an entire set of queries. Such a listwise approach has a number of disadvantages, notably that it does not support the common requirement of assessing QPP for individual queries. In this paper, we propose a pointwise QPP framework that allows us to evaluate the quality of a QPP system for individual queries by measuring the deviations between each prediction versus the corresponding true value, and then aggregating the results over a set of queries. Our experiments demonstrate that this new approach leads to smaller variances in QPP evaluations across a range of different target metrics and retrieval models.

Introduction

Query performance prediction (QPP) methods have been proposed to automatically estimate the retrieval effectiveness for queries without making use of any true relevance information (e.g. [1,2]). In practice, a QPP method allows us to dynamically adjust the processing steps for a query, depending on its initial performance estimate. Although estimating the performance of individual queries independently is a common requirement in many downstream tasks (e.g., adaptive query processing [3]), the standard QPP evaluation methodology adopted by the IR research community has previously involved a listwise approach, rather than a pointwise one. This is despite the fact that the latter represents a more appropriate strategy for use in downstream applications. To elaborate, a listwise approach operates on a set of queries 𝒬 by first converting it into an ordered set as induced by the QPP estimated scores 𝜑(𝑄) ∀𝑄 ∈ 𝒬. It then computes a rank correlation measure, such as Kendall's 𝜏 , between the ground-truth ordering of the queries as induced by their average precision (AP) values [4] or by any other IR metric, such as nDCG [5].

A major disadvantage of listwise QPP approaches is that evaluation is conducted in a relative manner, so the performance of one query is measured relative to the others. However, a downstream performance estimate of an individual query also needs to be evaluated independently of the other queries. In contrast, a pointwise approach measures the effectiveness on individual queries, and then, if required, aggregates the results over a complete set. This is analogous to measuring the retrieval effectiveness metric MAP by computing the average precision values for individual queries and then aggregating them. Pointwise evaluation also allows us to carry out a per-query analysis of a method often leading to useful insights. For instance, Buckley [6] found that, by performing an extensive per-topic retrieval analysis, they were able to identify queries where most IR systems fail to retrieve relevant documents. However, a listwise evaluation methodology is not conducive to performing this kind of detailed per-query analysis.

Another drawback of listwise methods is that they can be overly sensitive to the configuration setup used for evaluation. The two most important such configurations are: i) the target retrieval evaluation metric that induces a ground-truth ordering over the set of queries; ii) the retrieval model used to obtain the top-𝑘 set of documents for QPP estimation. Indeed, variations in these configurations can lead to both large standard deviations in the reported rank correlation measures and significant differences in the relative ranks of various QPP systems [7]. To address the limitations of listwise methods, we propose a new QPP evaluation framework, Aggregated Pointwise Absolute Errors (APAE), which is shown to not only be consistent with the existing listwise approaches, but also to be more robust to changes in QPP experimental setup.

A Framework for Pointwise QPP Evaluation

Correlation with listwise ground-truth Before describing our new QPP evaluation framework APAE, we begin by introducing the required notation. Formally, a QPP estimate is a function of the form 𝜑(𝑄, 𝑀 𝑘 (𝑄)) ↦ → R, where 𝑀 𝑘 (𝑄) is the set of top-𝑘 ranked documents retrieved by an IR model 𝑀 for a query 𝑄 ∈ 𝒬, a benchmark set of queries.

For the purpose of listwise evaluation, for each 𝑄 ∈ 𝒬, we first compute the value of a target IR evaluation metric, 𝜇(𝑄) that reflects the quality of the retrieved list 𝑀 𝑘 (𝑄). The next step uses these 𝜇(𝑄) scores to induce a ground-truth ranking of the set 𝒬, or in other words, arrange the queries by their decreasing (or increasing) 𝜇(𝑄) values, i.e.,

𝒬 𝜇 = {𝑄 𝑖 ∈ 𝒬 : 𝜇(𝑄 𝑖 ) > 𝜇(𝑄 𝑖+1 ), ∀𝑖 = 1, . . . , |𝒬| − 1}}(1)

Similarly, the evaluation framework also yields a predicted ranking of the queries, where this time the queries are sorted by the QPP estimated scores, i.e.,

𝒬 𝜑 = {𝑄 𝑖 ∈ 𝒬 : 𝜑(𝑄 𝑖 ) > 𝜑(𝑄 𝑖+1 ), ∀𝑖 = 1, . . . , |𝒬| − 1}(2)

A listwise evaluation framework then computes the rank correlation between these two ordered sets 𝛾(𝒬 𝜇 , 𝒬 𝜑 ), where 𝛾 :

R |𝒬| × R |𝒬| ↦ → [0, 1] is a correlation measure, such as Kendall's 𝜏 .

Individual ground-truth In contrast to listwise evaluations, where the ground-truth takes the form of an ordered set of queries, pointwise QPP evaluation involves making |𝒬| independent

QPP Methods

AvgIDF [8], Clarity [9], NQC [10], WIG [11], UEF(Clarity), UEF(NQC), UEF(WIG) [2] IR Metrics AP@100, nDCG@100, P@10, Recall@100 IR Models LMJM (𝜆 = 0.6), LMDir (𝜇 = 1000), BM25 (𝑘, 𝑏) = (0.7, 0.3)

comparisons. Each comparison is made between a query 𝑄's predicted QPP score 𝜑(𝑄) and its retrieval effectiveness measure 𝜇(𝑄), i.e.,

𝜂(𝒬, 𝜇, 𝜑)def = 1 |𝒬| ∑︁ 𝑄∈𝒬 𝜂(𝜇(𝑄), 𝜑(𝑄))(3)

Unlike the rank correlation 𝛾, here 𝜂 is a pointwise correlation function of the form 𝜂 : R×R ↦ → R.

It is often convenient to think of 𝜂 as the inverse of a distance function that measures the extent to which a predicted value deviates from the corresponding true value. In contrast to ground-truth evaluation metrics, most QPP estimates (e.g., NQC, WIG etc.) are not bounded within [0, 1]. Therefore, to employ a distance measure, each QPP estimate 𝜑(𝑄) must be normalized to the unit interval. Subsequently, 𝜂 can be defined as 𝜂(𝜇(𝑄), 𝜑(𝑄))

def = 1 − |𝜇(𝑄) − 𝜑(𝑄)/ℵ|

, where ℵ is a normalization constant which is sufficiently large to ensure that the denominator is positive.

Selecting an IR metric for pointwise QPP evaluation In general, an unsupervised QPP estimator will be agnostic with respect to the target IR metric 𝜇. For instance, NQC scores can be seen as being approximations of AP@100 values, but can also be interpreted as approximating any other metric, such as nDCG@20 or P@10. Therefore, a question arises around which metric should be used to compute the individual correlations in Equation 3. Of course, the results can differ substantially for different choices of 𝜇, e.g., AP or nDCG. This is also the case for listwise QPP evaluation, as reported in [7]. To reduce the effect of such variations, we now propose a simple yet effective solution.

Metric-agnostic pointwise QPP evaluation For a set of evaluation functions 𝜇 ∈ ℳ (e.g., ℳ = {AP@100, nDCG@20, . . .}), we employ an aggregation function to compute the overall pointwise correlation (Equation 3) of a QPP estimate with respect to each metric. Formally,

𝜂(𝑄, ℳ, 𝜑) = Σ 𝜇∈ℳ (1 − |𝜇(𝑄) − 𝜑(𝑄)/ℵ|),(4)

where Σ denotes an aggregation function (it does not indicate summation). In particular, we use the most commonly-used such functions as choices for Σ: 'minimum', 'maximum', and 'average' -i.e., Σ ∈ {avg, min, max}. Next, we find the average over these values computed for a given set of queries 𝒬, i.e., we substitute 𝜂(𝑄, ℳ, 𝜑) from Equation 4into the summation of Equation 3.

Experiments

A QPP experiment context [7] involves three configuration choices: i) the QPP method itself that is used to predict the relative performance of queries; ii) the IR metric that is used to obtain a ground-truth ordering of the query performances as measured on a set of top-𝑘 (𝑘 = 100 in our experiments) documents retrieved by iii) a specific IR model. Table 1 summarizes the IR models and metrics used in our experiments, along with the relevant hyper-parameter values. The objective of our experiments is to investigate the following two key research questions:

• RQ1: Does APAE agree with the standard listwise correlation metrics? • RQ2: How robust is APAE with respect to changes in the QPP experiment context? An affirmative answer to RQ1 would indicate that our proposed metric APAE is consistent with existing metrics used for QPP evaluation, while an affirmative answer to RQ2 would suggest that APAE is preferable to existing methods due to its higher stability with respect to different experimental settings.

We conduct our QPP experiments on the TREC Robust dataset, which consists of 249 topics. Following the standard practice for QPP experiments [5,12], we report results aggregated over a total of 30 randomly chosen equal-sized train-test splits of the data. The training split of each partition was used for tuning the hyper-parameters for the QPP method.

Agreement between listwise and pointwise evaluation Firstly, we investigate the consistency of APAE with respect to three standard listwise QPP evaluation metrics: Pearson's 𝑟, Spearman's 𝜌 and Kendall's 𝜏 ; and a pointwise approach, scaled Absolute Rank Error (sARE) [13]. Since sARE is an error measure, we measure correlations of APAE with 1 − sARE measures (which for the sake of simplicity, we refer to as sARE in Table 2). We experiment with three different instances of APAE obtained by substituting the aggregation functions -avg, min and max as Σ in Equation 4, denoted respectively as 𝜂 avg (ℳ), 𝜂 min (ℳ) and 𝜂 max (ℳ).

The results presented in Table 2 answer RQ1 in the affirmative. Each reported value here corresponds to the rank correlation (Kendall's 𝜏 ) between the relative ranks of the QPP systems ordered by their effectiveness as computed via one of the standard metrics (one of 𝑟, 𝜌, 𝜏 or sARE) and APAE, i.e., one of 𝜂 avg (ℳ), 𝜂 min (ℳ) and 𝜂 max (ℳ)). The high correlation values between the standard listwise and the proposed pointwise metrics show that APAE can be used as a substitute for the standard listwise evaluation. Notably, we see that the average aggregate function yields the best results, and hence for the subsequent experiments we use 𝜂 avg (ℳ) as the pointwise evaluation metric.

Table 2

The correlation of our proposed pointwise evaluation metric APAE with the standard listwise metrics -Pearson's 𝑟, Spearman's 𝜌, Kendall's 𝜏 and sARE. The rank correlations between each pair of QPP system ranks (evaluated with a listwise measure and a pointwise measure) were computed with Kendall's 𝜏 . The high values indicate that the pointwise measurement can effectively substitute a standard list-based measure, since they lead to a fairly similar relative ordering between the effectiveness of different QPP methods.

𝜂 avg (ℳ) 𝜂 min (ℳ) 𝜂 max (ℳ) 𝑟 𝜌 𝜏

Table 3

Stability of the proposed pointwise QPP metric APAE with respect to listwise approach, across different pairs of IR metrics and IR models. Red cells indicate the lowest value in each group, while the lowest values along each column are bold-faced.

Model Metric AP@100 R@10 R@100 nDCG@10 nDCG@100 LMJM AP@10 As in Table 3a, QPP systems were evaluated with 𝜏 . The numbers alongside the IR models denote their respective parameters.

Metric Model LMJM BM25 BM25 LMDir LMDir (0.6) (0.7, 0.3) (0.3, 0.7) (500) (1000) AP@100 1.000 1.000 1.000 1.000 1.000 nDCG@100 LMJM 1.000 0.864 1.000 0.843 0.864 R@100 (0.3) 1.000 0.864 1.000 1.000 1.000 AP@100 1.000 1.000 1.000 1.000 nDCG@100 LMJM 0.914 1.000 0.813 0.914 R@100 (0.6) 1.000 1.000 1.000 1.000 AP@100 1.000 1.000 1.000 nDCG@100 BM25

1.000 1.000 1.000 R@100 (0.7, 0.3) 0.812 0.905 1.000 AP@100 1.000 1.000 nDCG@100 BM25

1.000 1.000 R@100 (0.3, 0.7) 1.000 1.000 AP@100 1.000 nDCG@100 LMDir 1.000 R@100 (500) 1.000

(d) Unlike Table 3c, here the QPP outcomes were evaluated by APAE (instead of 𝜏 ).

Variances in relative effectiveness of QPP methods

To investigate RQ2, we consider the relative stability of QPP system ranks for variations in QPP contexts (i.e., different IR models and target metrics), comparing both listwise and pointwise approaches (see Table 3). To clarify with an example, if working with three QPP methods, say AvgIDF, NQC, WIG, we observe that 𝜏 (NQC) > 𝜏 (WIG) > 𝜏 (AvgIDF) for LMDir as measured relative to AP@100. We expect to observe a similar ordering for a different choice of the IR model and target IR metric, say BM25 with nDCG@100. As in our previous experiments, here we measure the rank correlations between a total of seven QPP systems (see Table 1) via Kendall's 𝜏 .

Concluding Remarks

Unlike the standard listwise QPP evaluation mechanism of measuring an overall rank correlation with respect to a reference ranking of the queries (in terms of retrieval effectiveness), we have proposed a pointwise evaluation method that computes the relative difference between a normalized QPP score and a true IR evaluation measure (e.g., AP@100 or nDCG@20). Our experiments demonstrated that the proposed metric exhibits a high correlation with standard listwise approaches and is more robust to changes in QPP experimental setup than listwise evaluation measures. Using this metric, it should thus be possible to evaluate the effectiveness of different QPP methods on downstream tasks on a per-query basis.

Table 11QPP configurations -(QPP method, IR metric, and models) used to measure variations.

Similar to Table 3a, except QPP performance was evaluated with the pointwise approach APAE. A comparison with Table 3a indicates a better consistency in the relative ranks of QPP systems for variations in the IR metrics.

Model MetricAP@100 R@10 R@100 nDCG@10 nDCG@1000.497 0.813 0.4290.7830.429LMJM0.904 1.000 0.7151.0000.792BM250.897 0.722 0.7220.7930.793BM25AP@101.000 1.000 1.0001.0001.000LMDir0.897 0.786 0.7860.8230.905LMDir1.000 1.000 1.0001.0001.000LMJM0.328 0.8110.3630.783LMJM0.905 0.8110.6691.000BM25AP@1000.783 0.7840.7140.642BM25AP@1001.000 1.0001.0001.000LMDir0.823 0.9010.8340.789LMDir1.000 1.0001.0001.000LMJM0.6240.8930.503LMJM0.6030.9050.542BM25R@100.8030.9820.894BM25R@101.0001.0001.000LMDir0.9030.8640.864LMDir1.0001.0001.000LMJM0.8520.804LMJM0.6541.000BM25R@1000.7860.890BM25R@1001.0001.000LMDir0.7380.738LMDir1.0001.000LMJM0.537LMJM0.649BM25nDCG@100.904BM25nDCG@101.000LMDir0.868LMDir1.000(a) Correlations between the relative ranks of 7 different QPP systems across different pairs of IR target met-rics. QPP systems were evaluated with the baseline listwise metric -Kendall's 𝜏 . (b) Metric LMJM BM25 BM25 LMDir LMDir Model (0.6) (0.7, 0.3) (0.3, 0.7) (500) (1000)AP@1000.826 0.9040.819 0.714 0.895nDCG@100 LMJM 0.780 0.6940.695 0.759 0.759R@100(0.3)0.824 0.7690.782 0.904 0.904AP@1000.7030.712 0.904 0.823nDCG@100 LMJM0.7810.827 0.811 0.811R@100(0.6)0.8130.725 0.731 0.675AP@1000.903 0.785 0.785nDCG@100 BM250.897 0.786 0.786R@100(0.7, 0.3)0.812 0.752 0.779AP@1000.887 0.882nDCG@100 BM250.901 0.895R@100(0.3, 0.7)0.889 0.901AP@1000.901nDCG@100 LMDir0.893R@100(500)0.903(c) Here rank correlations between the relative ranks ofQPP systems are measured across IR model pairs.

Acknowledgement.. The first and the third authors were supported by the Science Foundation Ireland (SFI) grant number SFI/12/RC/2289_P2.

Ranking robustness: A novel framework to predict query performance YZhou WBCroft Proc. of CIKM '06 of CIKM '06 2006 Using statistical decision theory and relevance models for query-performance prediction AShtok OKurland DCarmel Proc. of SIGIR '10 of SIGIR '10 2010 Adaptive relevance feedback in information retrieval YLv CZhai Proc. of CIKM '09 of CIKM '09 2009 Document score distribution models for query performance inference and prediction RCummins ACM Transactions on Information Systems 32 28 2014 Neural query performance prediction using weak supervision from multiple signals HZamani WBCroft JSCulpepper Proc. of SIGIR '18 of SIGIR '18 2018 Why current IR engines fail CBuckley Proc. of SIGIR'04 of SIGIR'04 2004 An analysis of variations in the effectiveness of query performance prediction DGanguly SDatta MMitra DGreene Proc. of ECIR'22 of ECIR'22 2022 A survey of pre-retrieval query performance predictors CHauff DHiemstra FDe Jong Proc. of CIKM '08 of CIKM '08 2008 Predicting query performance SCronen-Townsend YZhou WBCroft Proc. of SIGIR '02 of SIGIR '02 2002 Predicting query performance by query-drift estimation AShtok OKurland DCarmel FRaiber GMarkovits ACM Transactions on Information Systems 30 2012 Query performance prediction in web search environments YZhou WBCroft Proc. of SIGIR '07 of SIGIR '07 2007 Information needs, queries, and query performance prediction OZendel AShtok FRaiber OKurland JSCulpepper Proc. of SIGIR '19 of SIGIR '19 2019 An enhanced evaluation framework for query performance prediction GFaggioli OZendel JSCulpepper NFerro FScholer Advances in Information Retrieval 2021