1. Introduction

Dublin, Ireland $ suchana.datta@ucdconnect.ie (S. Datta); debasis.ganguly@glasgow.ac.uk (D. Ganguly); derek.greene@ucd.ie (D. Greene); mandar@isical.ac.in (M. Mitra) https://gdebasis.github.io/ (D. Ganguly); http://derekgreene.com/ (D. Greene); https://www.isical.ac.in/mandar-mitra (M. Mitra)

On the Feasibility and Robustness of Pointwise Evaluation of Query Performance Prediction

Suchana Datta

Debasis Ganguly

Derek Greene

R@10 R@100 1

Mandar Mitra

0 0 Indian Statistical Institute , Kolkata , India 1 University College Dublin , Ireland 2 University of Glasgow , UK

2023

000 0 0001

Despite the retrieval effectiveness of queries being mutually independent of one another, the evaluation of query performance prediction (QPP) systems has been carried out by measuring rank correlation over an entire set of queries. Such a listwise approach has a number of disadvantages, notably that it does not support the common requirement of assessing QPP for individual queries. In this paper, we propose a pointwise QPP framework that allows us to evaluate the quality of a QPP system for individual queries by measuring the deviations between each prediction versus the corresponding true value, and then aggregating the results over a set of queries. Our experiments demonstrate that this new approach leads to smaller variances in QPP evaluations across a range of different target metrics and retrieval models.

1. Introduction

A major disadvantage of listwise QPP approaches is that evaluation is conducted in a relative manner, so the performance of one query is measured relative to the others. However, a downstream performance estimate of an individual query also needs to be evaluated independently of the other queries. In contrast, a pointwise approach measures the effectiveness on individual queries, and then, if required, aggregates the results over a complete set. This is analogous to measuring the retrieval effectiveness metric MAP by computing the average precision values for individual queries and then aggregating them. Pointwise evaluation also allows us to carry out a per-query analysis of a method often leading to useful insights. For instance, Buckley [ 6 ] found that, by performing an extensive per-topic retrieval analysis, they were able to identify queries where most IR systems fail to retrieve relevant documents. However, a listwise evaluation methodology is not conducive to performing this kind of detailed per-query analysis.

Another drawback of listwise methods is that they can be overly sensitive to the configuration setup used for evaluation. The two most important such configurations are: i) the target retrieval evaluation metric that induces a ground-truth ordering over the set of queries; ii) the retrieval model used to obtain the top- set of documents for QPP estimation. Indeed, variations in these configurations can lead to both large standard deviations in the reported rank correlation measures and significant differences in the relative ranks of various QPP systems [ 7 ]. To address the limitations of listwise methods, we propose a new QPP evaluation framework, Aggregated Pointwise Absolute Errors (APAE), which is shown to not only be consistent with the existing listwise approaches, but also to be more robust to changes in QPP experimental setup.

2. A Framework for Pointwise QPP Evaluation

Correlation with listwise ground-truth Before describing our new QPP evaluation framework APAE, we begin by introducing the required notation. Formally, a QPP estimate is a function of the form (, ()) ↦→ R, where () is the set of top- ranked documents retrieved by an IR model for a query ∈ , a benchmark set of queries.

For the purpose of listwise evaluation, for each ∈ , we first compute the value of a target IR evaluation metric, () that reflects the quality of the retrieved list (). The next step uses these () scores to induce a ground-truth ranking of the set , or in other words, arrange the queries by their decreasing (or increasing) () values, i.e., = { ∈ : () > (+1), ∀ = 1, . . . , || − 1}} (1) Similarly, the evaluation framework also yields a predicted ranking of the queries, where this time the queries are sorted by the QPP estimated scores, i.e., = { ∈ : () > (+1), ∀ = 1, . . . , || − 1} (2) A listwise evaluation framework then computes the rank correlation between these two ordered sets ( , ), where : R|| × R|| ↦→ [ 0, 1 ] is a correlation measure, such as Kendall’s . Individual ground-truth In contrast to listwise evaluations, where the ground-truth takes the form of an ordered set of queries, pointwise QPP evaluation involves making || independent comparisons. Each comparison is made between a query ’s predicted QPP score () and its retrieval effectiveness measure (), i.e., (3) (, , ) =def 1 || ∈

∑︁ ( (), ()) Unlike the rank correlation , here is a pointwise correlation function of the form : R× R ↦→ R. It is often convenient to think of as the inverse of a distance function that measures the extent to which a predicted value deviates from the corresponding true value. In contrast to ground-truth evaluation metrics, most QPP estimates (e.g., NQC, WIG etc.) are not bounded within [ 0, 1 ]. Therefore, to employ a distance measure, each QPP estimate () must be normalized to the unit interval. Subsequently, can be defined as ( (), ()) =def 1 − | () − ()/ℵ|, where ℵ is a normalization constant which is sufficiently large to ensure that the denominator is positive.

Selecting an IR metric for pointwise QPP evaluation In general, an unsupervised QPP

estimator will be agnostic with respect to the target IR metric . For instance, NQC scores can be seen as being approximations of AP@100 values, but can also be interpreted as approximating any other metric, such as nDCG@20 or P@10. Therefore, a question arises around which metric should be used to compute the individual correlations in Equation 3. Of course, the results can differ substantially for different choices of , e.g., AP or nDCG. This is also the case for listwise QPP evaluation, as reported in [ 7 ]. To reduce the effect of such variations, we now propose a simple yet effective solution.

Metric-agnostic pointwise QPP evaluation For a set of evaluation functions ∈ ℳ (e.g., ℳ = {AP@100, nDCG@20, . . .}), we employ an aggregation function to compute the overall pointwise correlation (Equation 3) of a QPP estimate with respect to each metric. Formally, (, ℳ, ) = Σ ∈ℳ(1 − | () − ()/ℵ|), (4) where Σ denotes an aggregation function (it does not indicate summation). In particular, we use the most commonly-used such functions as choices for Σ: ‘minimum’, ‘maximum’, and ‘average’ – i.e., Σ ∈ {avg, min, max}. Next, we find the average over these values computed for a given set of queries , i.e., we substitute (, ℳ, ) from Equation 4 into the summation of Equation 3.

3. Experiments

A QPP experiment context [ 7 ] involves three configuration choices: i) the QPP method itself that is used to predict the relative performance of queries; ii) the IR metric that is used to obtain a ground-truth ordering of the query performances as measured on a set of top- ( = 100 in our experiments) documents retrieved by iii) a specific IR model. Table 1 summarizes the IR models and metrics used in our experiments, along with the relevant hyper-parameter values. The objective of our experiments is to investigate the following two key research questions: • RQ1: Does APAE agree with the standard listwise correlation metrics? • RQ2: How robust is APAE with respect to changes in the QPP experiment context? An affirmative answer to

RQ1 would indicate that our proposed metric APAE is consistent with existing metrics used for QPP evaluation, while an affirmative answer to

RQ2 would suggest

that APAE is preferable to existing methods due to its higher stability with respect to different experimental settings.

We conduct our QPP experiments on the TREC Robust dataset, which consists of 249 topics. Following the standard practice for QPP experiments [ 5, 12 ], we report results aggregated over a total of 30 randomly chosen equal-sized train-test splits of the data. The training split of each partition was used for tuning the hyper-parameters for the QPP method. pointwise evaluation metric.

Agreement between listwise and pointwise evaluation Firstly, we investigate the

consistency of APAE with respect to three standard listwise QPP evaluation metrics: Pearson’s , Spearman’s and Kendall’s ; and a pointwise approach, scaled Absolute Rank Error (sARE) [ 13 ]. Since sARE is an error measure, we measure correlations of APAE with 1 − sARE measures (which for the sake of simplicity, we refer to as sARE in Table 2). We experiment with three different instances of APAE obtained by substituting the aggregation functions – avg, min and max as Σ in Equation 4, denoted respectively as avg(ℳ), min(ℳ) and max(ℳ).

The results presented in Table 2 answer RQ1 in the afrfimative. Each reported value here corresponds to the rank correlation (Kendall’s ) between the relative ranks of the QPP systems ordered by their effectiveness as computed via one of the standard metrics (one of , , or sARE) and APAE, i.e., one of avg(ℳ), min(ℳ) and max(ℳ)). The high correlation values between the standard listwise and the proposed pointwise metrics show that APAE can be used as a substitute for the standard listwise evaluation. Notably, we see that the average aggregate function yields the best results, and hence for the subsequent experiments we use avg(ℳ) as the

The correlation of our proposed pointwise evaluation metric APAE with the standard listwise metrics Pearson’s , Spearman’s , Kendall’s and sARE. The rank correlations between each pair of QPP system ranks (evaluated with a listwise measure and a pointwise measure) were computed with Kendall’s . The high values indicate that the pointwise measurement can effectively substitute a standard list-based measure, since they lead to a fairly similar relative ordering between the effectiveness of different QPP avg(ℳ) min(ℳ)

sARE sARE max(ℳ) BM25 LMDir LMJM 0.810 0.810 0.905 0.887

the IR metrics.

Metric

Model

BM25 (0.7, 0.3) BM25 (0.3, 0.7) LMDir (500) LMJM BM25 BM25 LMDir LMDir (0.6) (0.7, 0.3) (0.3, 0.7) (500) (1000) pairs of IR metrics and IR models. Red cells indicate the lowest value in each group, while the lowest values along each column are bold-faced. (a) Correlations between the relative ranks of 7 different

QPP systems across different pairs of IR target metrics. QPP systems were evaluated with the baseline listwise metric - Kendall’s .

(b) Similar to Table 3a, except QPP performance was evaluated with the pointwise approach APAE. A comparison with Table 3a indicates a better consistency in the relative ranks of QPP systems for variations in Metric

Model

BM25 (0.7, 0.3) BM25 (0.3, 0.7) (c) Here rank correlations between the relative ranks of

QPP systems are measured across IR model pairs.

As in Table 3a, QPP systems were evaluated with .

The numbers alongside the IR models denote their respective parameters.

(d) Unlike Table 3c, here the QPP outcomes were evaluated by APAE (instead of ).

Variances in relative effectiveness of QPP methods To investigate RQ2, we consider

the relative stability of QPP system ranks for variations in QPP contexts (i.e., different IR models and target metrics), comparing both listwise and pointwise approaches (see Table 3). To clarify with an example, if working with three QPP methods, say AvgIDF, NQC, WIG, we observe that (NQC) > (WIG) > (AvgIDF) for LMDir as measured relative to AP@100. We expect to observe a similar ordering for a different choice of the IR model and target IR metric, say BM25 with nDCG@100. As in our previous experiments, here we measure the rank correlations between a total of seven QPP systems (see Table 1) via Kendall’s .

4. Concluding Remarks

Unlike the standard listwise QPP evaluation mechanism of measuring an overall rank correlation with respect to a reference ranking of the queries (in terms of retrieval effectiveness), we have proposed a pointwise evaluation method that computes the relative difference between a normalized QPP score and a true IR evaluation measure (e.g., AP@100 or nDCG@20). Our experiments demonstrated that the proposed metric exhibits a high correlation with standard listwise approaches and is more robust to changes in QPP experimental setup than listwise evaluation measures. Using this metric, it should thus be possible to evaluate the effectiveness of different QPP methods on downstream tasks on a per-query basis.

Acknowledgement.. The first and the third authors were supported by the Science Foundation Ireland (SFI) grant number SFI/12/RC/2289_P2.

[1]

Zhou , W. B. Croft , Ranking robustness: A novel framework to predict query performance , in: Proc. of CIKM '06 , 2006 , p. 567 - 574 .

[2]

Shtok ,

Kurland ,

Carmel , Using statistical decision theory and relevance models for query-performance prediction , in: Proc. of SIGIR '10 , 2010 , p. 259 - 266 .

[3]

Lv ,

Zhai , Adaptive relevance feedback in information retrieval , in: Proc. of CIKM '09 , 2009 , p. 255 - 264 .

[4]

Cummins , Document score distribution models for query performance inference and prediction , ACM Transactions on Information Systems 32 ( 2014 ) 2: 1 - 2 : 28 .

[5]

Zamani , W. B. Croft , J. S. Culpepper , Neural query performance prediction using weak supervision from multiple signals , in: Proc. of SIGIR '18 , 2018 , p. 105 - 114 .

[6]

Buckley , Why current IR engines fail , in: Proc. of SIGIR'04 , 2004 , p. 584 - 585 .

[7]

Ganguly ,

Datta ,

Mitra ,

Greene , An analysis of variations in the effectiveness of query performance prediction , in: Proc. of ECIR'22 , 2022 , pp. 215 - 229 .

[8]

Hauff ,

Hiemstra , F. de Jong, A survey of pre-retrieval query performance predictors , in: Proc. of CIKM '08 , 2008 , p. 1419 - 1420 .

[9]

Cronen-Townsend ,

Zhou , W. B. Croft , Predicting query performance , in: Proc. of SIGIR '02 , 2002 , p. 299 - 306 .

[10]

Shtok ,

Kurland ,

Carmel ,

Raiber , G. Markovits, Predicting query performance by query-drift estimation , ACM Transactions on Information Systems 30 ( 2012 ).

[11]

Zhou , W. B. Croft, Query performance prediction in web search environments , in: Proc. of SIGIR '07 , 2007 , p. 543 - 550 .

[12]

Zendel ,

Shtok ,

Raiber ,

Kurland ,

J. S.

Culpepper , Information needs, queries, and query performance prediction , in: Proc. of SIGIR '19 , 2019 , pp. 395 - 404 .

[13]

Faggioli ,

Zendel ,

J. S.

Culpepper ,

Ferro ,

Scholer , An enhanced evaluation framework for query performance prediction , in: Advances in Information Retrieval , 2021 , pp. 115 - 129 .