<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">On the Feasibility and Robustness of Pointwise Evaluation of Query Performance Prediction</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Suchana</forename><surname>Datta</surname></persName>
							<email>suchana.datta@ucdconnect.ie</email>
							<affiliation key="aff0">
								<orgName type="institution">University College Dublin</orgName>
								<address>
									<country key="IE">Ireland</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Debasis</forename><surname>Ganguly</surname></persName>
							<email>debasis.ganguly@glasgow.ac.uk</email>
							<affiliation key="aff1">
								<orgName type="institution">University of Glasgow</orgName>
								<address>
									<country key="GB">UK</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Derek</forename><surname>Greene</surname></persName>
							<email>derek.greene@ucd.ie</email>
							<affiliation key="aff2">
								<orgName type="institution">University College Dublin</orgName>
								<address>
									<country key="IE">Ireland</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Mandar</forename><surname>Mitra</surname></persName>
							<email>mandar@isical.ac.in</email>
							<affiliation key="aff3">
								<orgName type="institution">Indian Statistical Institute</orgName>
								<address>
									<settlement>Kolkata</settlement>
									<country key="IN">India</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">On the Feasibility and Robustness of Pointwise Evaluation of Query Performance Prediction</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">7216E510C36329DA9E921E2AE31C6850</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-04-29T07:04+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Despite the retrieval effectiveness of queries being mutually independent of one another, the evaluation of query performance prediction (QPP) systems has been carried out by measuring rank correlation over an entire set of queries. Such a listwise approach has a number of disadvantages, notably that it does not support the common requirement of assessing QPP for individual queries. In this paper, we propose a pointwise QPP framework that allows us to evaluate the quality of a QPP system for individual queries by measuring the deviations between each prediction versus the corresponding true value, and then aggregating the results over a set of queries. Our experiments demonstrate that this new approach leads to smaller variances in QPP evaluations across a range of different target metrics and retrieval models.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Query performance prediction (QPP) methods have been proposed to automatically estimate the retrieval effectiveness for queries without making use of any true relevance information (e.g. <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2]</ref>). In practice, a QPP method allows us to dynamically adjust the processing steps for a query, depending on its initial performance estimate. Although estimating the performance of individual queries independently is a common requirement in many downstream tasks (e.g., adaptive query processing <ref type="bibr" target="#b2">[3]</ref>), the standard QPP evaluation methodology adopted by the IR research community has previously involved a listwise approach, rather than a pointwise one. This is despite the fact that the latter represents a more appropriate strategy for use in downstream applications. To elaborate, a listwise approach operates on a set of queries 𝒬 by first converting it into an ordered set as induced by the QPP estimated scores 𝜑(𝑄) ∀𝑄 ∈ 𝒬. It then computes a rank correlation measure, such as Kendall's 𝜏 , between the ground-truth ordering of the queries as induced by their average precision (AP) values <ref type="bibr" target="#b3">[4]</ref> or by any other IR metric, such as nDCG <ref type="bibr" target="#b4">[5]</ref>.</p><p>A major disadvantage of listwise QPP approaches is that evaluation is conducted in a relative manner, so the performance of one query is measured relative to the others. However, a downstream performance estimate of an individual query also needs to be evaluated independently of the other queries. In contrast, a pointwise approach measures the effectiveness on individual queries, and then, if required, aggregates the results over a complete set. This is analogous to measuring the retrieval effectiveness metric MAP by computing the average precision values for individual queries and then aggregating them. Pointwise evaluation also allows us to carry out a per-query analysis of a method often leading to useful insights. For instance, Buckley <ref type="bibr" target="#b5">[6]</ref> found that, by performing an extensive per-topic retrieval analysis, they were able to identify queries where most IR systems fail to retrieve relevant documents. However, a listwise evaluation methodology is not conducive to performing this kind of detailed per-query analysis.</p><p>Another drawback of listwise methods is that they can be overly sensitive to the configuration setup used for evaluation. The two most important such configurations are: i) the target retrieval evaluation metric that induces a ground-truth ordering over the set of queries; ii) the retrieval model used to obtain the top-𝑘 set of documents for QPP estimation. Indeed, variations in these configurations can lead to both large standard deviations in the reported rank correlation measures and significant differences in the relative ranks of various QPP systems <ref type="bibr" target="#b6">[7]</ref>. To address the limitations of listwise methods, we propose a new QPP evaluation framework, Aggregated Pointwise Absolute Errors (APAE), which is shown to not only be consistent with the existing listwise approaches, but also to be more robust to changes in QPP experimental setup.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">A Framework for Pointwise QPP Evaluation</head><p>Correlation with listwise ground-truth Before describing our new QPP evaluation framework APAE, we begin by introducing the required notation. Formally, a QPP estimate is a function of the form 𝜑(𝑄, 𝑀 𝑘 (𝑄)) ↦ → R, where 𝑀 𝑘 (𝑄) is the set of top-𝑘 ranked documents retrieved by an IR model 𝑀 for a query 𝑄 ∈ 𝒬, a benchmark set of queries.</p><p>For the purpose of listwise evaluation, for each 𝑄 ∈ 𝒬, we first compute the value of a target IR evaluation metric, 𝜇(𝑄) that reflects the quality of the retrieved list 𝑀 𝑘 (𝑄). The next step uses these 𝜇(𝑄) scores to induce a ground-truth ranking of the set 𝒬, or in other words, arrange the queries by their decreasing (or increasing) 𝜇(𝑄) values, i.e.,</p><formula xml:id="formula_0">𝒬 𝜇 = {𝑄 𝑖 ∈ 𝒬 : 𝜇(𝑄 𝑖 ) &gt; 𝜇(𝑄 𝑖+1 ), ∀𝑖 = 1, . . . , |𝒬| − 1}}<label>(1)</label></formula><p>Similarly, the evaluation framework also yields a predicted ranking of the queries, where this time the queries are sorted by the QPP estimated scores, i.e.,</p><formula xml:id="formula_1">𝒬 𝜑 = {𝑄 𝑖 ∈ 𝒬 : 𝜑(𝑄 𝑖 ) &gt; 𝜑(𝑄 𝑖+1 ), ∀𝑖 = 1, . . . , |𝒬| − 1}<label>(2)</label></formula><p>A listwise evaluation framework then computes the rank correlation between these two ordered sets 𝛾(𝒬 𝜇 , 𝒬 𝜑 ), where 𝛾 :</p><formula xml:id="formula_2">R |𝒬| × R |𝒬| ↦ → [0, 1] is a correlation measure, such as Kendall's 𝜏 .</formula><p>Individual ground-truth In contrast to listwise evaluations, where the ground-truth takes the form of an ordered set of queries, pointwise QPP evaluation involves making |𝒬| independent </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>QPP Methods</head><p>AvgIDF <ref type="bibr" target="#b7">[8]</ref>, Clarity <ref type="bibr" target="#b8">[9]</ref>, NQC <ref type="bibr" target="#b9">[10]</ref>, WIG <ref type="bibr" target="#b10">[11]</ref>, UEF(Clarity), UEF(NQC), UEF(WIG) <ref type="bibr" target="#b1">[2]</ref> IR Metrics AP@100, nDCG@100, P@10, Recall@100 IR Models LMJM (𝜆 = 0.6), LMDir (𝜇 = 1000), BM25 (𝑘, 𝑏) = (0.7, 0.3)</p><p>comparisons. Each comparison is made between a query 𝑄's predicted QPP score 𝜑(𝑄) and its retrieval effectiveness measure 𝜇(𝑄), i.e.,</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>𝜂(𝒬, 𝜇, 𝜑)</head><formula xml:id="formula_3">def = 1 |𝒬| ∑︁ 𝑄∈𝒬 𝜂(𝜇(𝑄), 𝜑(𝑄))<label>(3)</label></formula><p>Unlike the rank correlation 𝛾, here 𝜂 is a pointwise correlation function of the form 𝜂 : R×R ↦ → R.</p><p>It is often convenient to think of 𝜂 as the inverse of a distance function that measures the extent to which a predicted value deviates from the corresponding true value. In contrast to ground-truth evaluation metrics, most QPP estimates (e.g., NQC, WIG etc.) are not bounded within [0, 1]. Therefore, to employ a distance measure, each QPP estimate 𝜑(𝑄) must be normalized to the unit interval. Subsequently, 𝜂 can be defined as 𝜂(𝜇(𝑄), 𝜑(𝑄))</p><formula xml:id="formula_4">def = 1 − |𝜇(𝑄) − 𝜑(𝑄)/ℵ|</formula><p>, where ℵ is a normalization constant which is sufficiently large to ensure that the denominator is positive.</p><p>Selecting an IR metric for pointwise QPP evaluation In general, an unsupervised QPP estimator will be agnostic with respect to the target IR metric 𝜇. For instance, NQC scores can be seen as being approximations of AP@100 values, but can also be interpreted as approximating any other metric, such as nDCG@20 or P@10. Therefore, a question arises around which metric should be used to compute the individual correlations in Equation <ref type="formula" target="#formula_3">3</ref>. Of course, the results can differ substantially for different choices of 𝜇, e.g., AP or nDCG. This is also the case for listwise QPP evaluation, as reported in <ref type="bibr" target="#b6">[7]</ref>. To reduce the effect of such variations, we now propose a simple yet effective solution.</p><p>Metric-agnostic pointwise QPP evaluation For a set of evaluation functions 𝜇 ∈ ℳ (e.g., ℳ = {AP@100, nDCG@20, . . .}), we employ an aggregation function to compute the overall pointwise correlation (Equation <ref type="formula" target="#formula_3">3</ref>) of a QPP estimate with respect to each metric. Formally,</p><formula xml:id="formula_5">𝜂(𝑄, ℳ, 𝜑) = Σ 𝜇∈ℳ (1 − |𝜇(𝑄) − 𝜑(𝑄)/ℵ|),<label>(4)</label></formula><p>where Σ denotes an aggregation function (it does not indicate summation). In particular, we use the most commonly-used such functions as choices for Σ: 'minimum', 'maximum', and 'average' -i.e., Σ ∈ {avg, min, max}. Next, we find the average over these values computed for a given set of queries 𝒬, i.e., we substitute 𝜂(𝑄, ℳ, 𝜑) from Equation <ref type="formula" target="#formula_5">4</ref>into the summation of Equation <ref type="formula" target="#formula_3">3</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Experiments</head><p>A QPP experiment context <ref type="bibr" target="#b6">[7]</ref> involves three configuration choices: i) the QPP method itself that is used to predict the relative performance of queries; ii) the IR metric that is used to obtain a ground-truth ordering of the query performances as measured on a set of top-𝑘 (𝑘 = 100 in our experiments) documents retrieved by iii) a specific IR model. Table <ref type="table" target="#tab_0">1</ref> summarizes the IR models and metrics used in our experiments, along with the relevant hyper-parameter values. The objective of our experiments is to investigate the following two key research questions:</p><p>• RQ1: Does APAE agree with the standard listwise correlation metrics? • RQ2: How robust is APAE with respect to changes in the QPP experiment context? An affirmative answer to RQ1 would indicate that our proposed metric APAE is consistent with existing metrics used for QPP evaluation, while an affirmative answer to RQ2 would suggest that APAE is preferable to existing methods due to its higher stability with respect to different experimental settings.</p><p>We conduct our QPP experiments on the TREC Robust dataset, which consists of 249 topics. Following the standard practice for QPP experiments <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b11">12]</ref>, we report results aggregated over a total of 30 randomly chosen equal-sized train-test splits of the data. The training split of each partition was used for tuning the hyper-parameters for the QPP method.</p><p>Agreement between listwise and pointwise evaluation Firstly, we investigate the consistency of APAE with respect to three standard listwise QPP evaluation metrics: Pearson's 𝑟, Spearman's 𝜌 and Kendall's 𝜏 ; and a pointwise approach, scaled Absolute Rank Error (sARE) <ref type="bibr" target="#b12">[13]</ref>. Since sARE is an error measure, we measure correlations of APAE with 1 − sARE measures (which for the sake of simplicity, we refer to as sARE in Table <ref type="table">2</ref>). We experiment with three different instances of APAE obtained by substituting the aggregation functions -avg, min and max as Σ in Equation <ref type="formula" target="#formula_5">4</ref>, denoted respectively as 𝜂 avg (ℳ), 𝜂 min (ℳ) and 𝜂 max (ℳ).</p><p>The results presented in Table <ref type="table">2</ref> answer RQ1 in the affirmative. Each reported value here corresponds to the rank correlation (Kendall's 𝜏 ) between the relative ranks of the QPP systems ordered by their effectiveness as computed via one of the standard metrics (one of 𝑟, 𝜌, 𝜏 or sARE) and APAE, i.e., one of 𝜂 avg (ℳ), 𝜂 min (ℳ) and 𝜂 max (ℳ)). The high correlation values between the standard listwise and the proposed pointwise metrics show that APAE can be used as a substitute for the standard listwise evaluation. Notably, we see that the average aggregate function yields the best results, and hence for the subsequent experiments we use 𝜂 avg (ℳ) as the pointwise evaluation metric.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 2</head><p>The correlation of our proposed pointwise evaluation metric APAE with the standard listwise metrics -Pearson's 𝑟, Spearman's 𝜌, Kendall's 𝜏 and sARE. The rank correlations between each pair of QPP system ranks (evaluated with a listwise measure and a pointwise measure) were computed with Kendall's 𝜏 . The high values indicate that the pointwise measurement can effectively substitute a standard list-based measure, since they lead to a fairly similar relative ordering between the effectiveness of different QPP methods. </p><formula xml:id="formula_6">𝜂 avg (ℳ) 𝜂 min (ℳ) 𝜂 max (ℳ) 𝑟 𝜌 𝜏</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 3</head><p>Stability of the proposed pointwise QPP metric APAE with respect to listwise approach, across different pairs of IR metrics and IR models. Red cells indicate the lowest value in each group, while the lowest values along each column are bold-faced.</p><p>Model Metric AP@100 R@10 R@100 nDCG@10 nDCG@100 LMJM AP@10 As in Table <ref type="table">3a</ref>, QPP systems were evaluated with 𝜏 . The numbers alongside the IR models denote their respective parameters.</p><p>Metric Model LMJM BM25 BM25 LMDir LMDir (0.6) (0.7, 0.3) (0.3, 0.7) (500) (1000) AP@100 1.000 1.000 1.000 1.000 1.000 nDCG@100 LMJM 1.000 0.864 1.000 0.843 0.864 R@100 (0.3) 1.000 0.864 1.000 1.000 1.000 AP@100 1.000 1.000 1.000 1.000 nDCG@100 LMJM 0.914 1.000 0.813 0.914 R@100 (0.6) 1.000 1.000 1.000 1.000 AP@100 1.000 1.000 1.000 nDCG@100 BM25</p><p>1.000 1.000 1.000 R@100 (0.7, 0.3) 0.812 0.905 1.000 AP@100 1.000 1.000 nDCG@100 BM25</p><p>1.000 1.000 R@100 (0.3, 0.7) 1.000 1.000 AP@100 1.000 nDCG@100 LMDir 1.000 R@100 (500) 1.000</p><p>(d) Unlike Table <ref type="table">3c</ref>, here the QPP outcomes were evaluated by APAE (instead of 𝜏 ).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Variances in relative effectiveness of QPP methods</head><p>To investigate RQ2, we consider the relative stability of QPP system ranks for variations in QPP contexts (i.e., different IR models and target metrics), comparing both listwise and pointwise approaches (see Table <ref type="table">3</ref>). To clarify with an example, if working with three QPP methods, say AvgIDF, NQC, WIG, we observe that 𝜏 (NQC) &gt; 𝜏 (WIG) &gt; 𝜏 (AvgIDF) for LMDir as measured relative to AP@100. We expect to observe a similar ordering for a different choice of the IR model and target IR metric, say BM25 with nDCG@100. As in our previous experiments, here we measure the rank correlations between a total of seven QPP systems (see Table <ref type="table" target="#tab_0">1</ref>) via Kendall's 𝜏 .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Concluding Remarks</head><p>Unlike the standard listwise QPP evaluation mechanism of measuring an overall rank correlation with respect to a reference ranking of the queries (in terms of retrieval effectiveness), we have proposed a pointwise evaluation method that computes the relative difference between a normalized QPP score and a true IR evaluation measure (e.g., AP@100 or nDCG@20). Our experiments demonstrated that the proposed metric exhibits a high correlation with standard listwise approaches and is more robust to changes in QPP experimental setup than listwise evaluation measures. Using this metric, it should thus be possible to evaluate the effectiveness of different QPP methods on downstream tasks on a per-query basis.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>QPP configurations -(QPP method, IR metric, and models) used to measure variations.</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head></head><label></label><figDesc>Similar to Table 3a, except QPP performance was evaluated with the pointwise approach APAE. A comparison with Table 3a indicates a better consistency in the relative ranks of QPP systems for variations in the IR metrics.</figDesc><table><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell cols="2">Model Metric</cell><cell cols="3">AP@100 R@10 R@100 nDCG@10 nDCG@100</cell></row><row><cell></cell><cell></cell><cell cols="2">0.497 0.813 0.429</cell><cell>0.783</cell><cell>0.429</cell><cell>LMJM</cell><cell></cell><cell>0.904 1.000 0.715</cell><cell>1.000</cell><cell>0.792</cell></row><row><cell>BM25</cell><cell></cell><cell cols="2">0.897 0.722 0.722</cell><cell>0.793</cell><cell>0.793</cell><cell>BM25</cell><cell>AP@10</cell><cell>1.000 1.000 1.000</cell><cell>1.000</cell><cell>1.000</cell></row><row><cell>LMDir</cell><cell></cell><cell cols="2">0.897 0.786 0.786</cell><cell>0.823</cell><cell>0.905</cell><cell>LMDir</cell><cell></cell><cell>1.000 1.000 1.000</cell><cell>1.000</cell><cell>1.000</cell></row><row><cell>LMJM</cell><cell></cell><cell></cell><cell>0.328 0.811</cell><cell>0.363</cell><cell>0.783</cell><cell>LMJM</cell><cell></cell><cell>0.905 0.811</cell><cell>0.669</cell><cell>1.000</cell></row><row><cell>BM25</cell><cell cols="2">AP@100</cell><cell>0.783 0.784</cell><cell>0.714</cell><cell>0.642</cell><cell>BM25</cell><cell>AP@100</cell><cell>1.000 1.000</cell><cell>1.000</cell><cell>1.000</cell></row><row><cell>LMDir</cell><cell></cell><cell></cell><cell>0.823 0.901</cell><cell>0.834</cell><cell>0.789</cell><cell>LMDir</cell><cell></cell><cell>1.000 1.000</cell><cell>1.000</cell><cell>1.000</cell></row><row><cell>LMJM</cell><cell></cell><cell></cell><cell>0.624</cell><cell>0.893</cell><cell>0.503</cell><cell>LMJM</cell><cell></cell><cell>0.603</cell><cell>0.905</cell><cell>0.542</cell></row><row><cell>BM25</cell><cell>R@10</cell><cell></cell><cell>0.803</cell><cell>0.982</cell><cell>0.894</cell><cell>BM25</cell><cell>R@10</cell><cell>1.000</cell><cell>1.000</cell><cell>1.000</cell></row><row><cell>LMDir</cell><cell></cell><cell></cell><cell>0.903</cell><cell>0.864</cell><cell>0.864</cell><cell>LMDir</cell><cell></cell><cell>1.000</cell><cell>1.000</cell><cell>1.000</cell></row><row><cell>LMJM</cell><cell></cell><cell></cell><cell></cell><cell>0.852</cell><cell>0.804</cell><cell>LMJM</cell><cell></cell><cell></cell><cell>0.654</cell><cell>1.000</cell></row><row><cell>BM25</cell><cell>R@100</cell><cell></cell><cell></cell><cell>0.786</cell><cell>0.890</cell><cell>BM25</cell><cell>R@100</cell><cell></cell><cell>1.000</cell><cell>1.000</cell></row><row><cell>LMDir</cell><cell></cell><cell></cell><cell></cell><cell>0.738</cell><cell>0.738</cell><cell>LMDir</cell><cell></cell><cell></cell><cell>1.000</cell><cell>1.000</cell></row><row><cell>LMJM</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>0.537</cell><cell>LMJM</cell><cell></cell><cell></cell><cell></cell><cell>0.649</cell></row><row><cell>BM25</cell><cell cols="2">nDCG@10</cell><cell></cell><cell></cell><cell>0.904</cell><cell>BM25</cell><cell>nDCG@10</cell><cell></cell><cell></cell><cell>1.000</cell></row><row><cell>LMDir</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>0.868</cell><cell>LMDir</cell><cell></cell><cell></cell><cell></cell><cell>1.000</cell></row><row><cell cols="7">(a) Correlations between the relative ranks of 7 different QPP systems across different pairs of IR target met-rics. QPP systems were evaluated with the baseline listwise metric -Kendall's 𝜏 . (b) Metric LMJM BM25 BM25 LMDir LMDir Model (0.6) (0.7, 0.3) (0.3, 0.7) (500) (1000)</cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="2">AP@100</cell><cell></cell><cell>0.826 0.904</cell><cell cols="2">0.819 0.714 0.895</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="4">nDCG@100 LMJM 0.780 0.694</cell><cell cols="2">0.695 0.759 0.759</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="2">R@100</cell><cell>(0.3)</cell><cell>0.824 0.769</cell><cell cols="2">0.782 0.904 0.904</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="2">AP@100</cell><cell></cell><cell>0.703</cell><cell cols="2">0.712 0.904 0.823</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="3">nDCG@100 LMJM</cell><cell>0.781</cell><cell cols="2">0.827 0.811 0.811</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="2">R@100</cell><cell>(0.6)</cell><cell>0.813</cell><cell cols="2">0.725 0.731 0.675</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="2">AP@100</cell><cell></cell><cell></cell><cell cols="2">0.903 0.785 0.785</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="3">nDCG@100 BM25</cell><cell></cell><cell cols="2">0.897 0.786 0.786</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="2">R@100</cell><cell>(0.7, 0.3)</cell><cell></cell><cell cols="2">0.812 0.752 0.779</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="2">AP@100</cell><cell></cell><cell></cell><cell></cell><cell>0.887 0.882</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="3">nDCG@100 BM25</cell><cell></cell><cell></cell><cell>0.901 0.895</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="2">R@100</cell><cell>(0.3, 0.7)</cell><cell></cell><cell></cell><cell>0.889 0.901</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="2">AP@100</cell><cell></cell><cell></cell><cell></cell><cell>0.901</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="3">nDCG@100 LMDir</cell><cell></cell><cell></cell><cell>0.893</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="2">R@100</cell><cell>(500)</cell><cell></cell><cell></cell><cell>0.903</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="6">(c) Here rank correlations between the relative ranks of</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="6">QPP systems are measured across IR model pairs.</cell><cell></cell><cell></cell><cell></cell><cell></cell></row></table></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Acknowledgement.. The first and the third authors were supported by the Science Foundation Ireland (SFI) grant number SFI/12/RC/2289_P2.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Ranking robustness: A novel framework to predict query performance</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">B</forename><surname>Croft</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of CIKM &apos;06</title>
				<meeting>of CIKM &apos;06</meeting>
		<imprint>
			<date type="published" when="2006">2006</date>
			<biblScope unit="page" from="567" to="574" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Using statistical decision theory and relevance models for query-performance prediction</title>
		<author>
			<persName><forename type="first">A</forename><surname>Shtok</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Kurland</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Carmel</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of SIGIR &apos;10</title>
				<meeting>of SIGIR &apos;10</meeting>
		<imprint>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="259" to="266" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Adaptive relevance feedback in information retrieval</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Lv</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Zhai</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of CIKM &apos;09</title>
				<meeting>of CIKM &apos;09</meeting>
		<imprint>
			<date type="published" when="2009">2009</date>
			<biblScope unit="page" from="255" to="264" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Document score distribution models for query performance inference and prediction</title>
		<author>
			<persName><forename type="first">R</forename><surname>Cummins</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM Transactions on Information Systems</title>
		<imprint>
			<biblScope unit="volume">32</biblScope>
			<biblScope unit="page">28</biblScope>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Neural query performance prediction using weak supervision from multiple signals</title>
		<author>
			<persName><forename type="first">H</forename><surname>Zamani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">B</forename><surname>Croft</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">S</forename><surname>Culpepper</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of SIGIR &apos;18</title>
				<meeting>of SIGIR &apos;18</meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="105" to="114" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Why current IR engines fail</title>
		<author>
			<persName><forename type="first">C</forename><surname>Buckley</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of SIGIR&apos;04</title>
				<meeting>of SIGIR&apos;04</meeting>
		<imprint>
			<date type="published" when="2004">2004</date>
			<biblScope unit="page" from="584" to="585" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">An analysis of variations in the effectiveness of query performance prediction</title>
		<author>
			<persName><forename type="first">D</forename><surname>Ganguly</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Datta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mitra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Greene</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of ECIR&apos;22</title>
				<meeting>of ECIR&apos;22</meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="215" to="229" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">A survey of pre-retrieval query performance predictors</title>
		<author>
			<persName><forename type="first">C</forename><surname>Hauff</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Hiemstra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>De</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jong</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of CIKM &apos;08</title>
				<meeting>of CIKM &apos;08</meeting>
		<imprint>
			<date type="published" when="2008">2008</date>
			<biblScope unit="page" from="1419" to="1420" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Predicting query performance</title>
		<author>
			<persName><forename type="first">S</forename><surname>Cronen-Townsend</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">B</forename><surname>Croft</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of SIGIR &apos;02</title>
				<meeting>of SIGIR &apos;02</meeting>
		<imprint>
			<date type="published" when="2002">2002</date>
			<biblScope unit="page" from="299" to="306" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Predicting query performance by query-drift estimation</title>
		<author>
			<persName><forename type="first">A</forename><surname>Shtok</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Kurland</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Carmel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Raiber</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Markovits</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM Transactions on Information Systems</title>
		<imprint>
			<biblScope unit="volume">30</biblScope>
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Query performance prediction in web search environments</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">B</forename><surname>Croft</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of SIGIR &apos;07</title>
				<meeting>of SIGIR &apos;07</meeting>
		<imprint>
			<date type="published" when="2007">2007</date>
			<biblScope unit="page" from="543" to="550" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Information needs, queries, and query performance prediction</title>
		<author>
			<persName><forename type="first">O</forename><surname>Zendel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Shtok</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Raiber</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Kurland</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">S</forename><surname>Culpepper</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of SIGIR &apos;19</title>
				<meeting>of SIGIR &apos;19</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="395" to="404" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">An enhanced evaluation framework for query performance prediction</title>
		<author>
			<persName><forename type="first">G</forename><surname>Faggioli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Zendel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">S</forename><surname>Culpepper</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Ferro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Scholer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in Information Retrieval</title>
				<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="115" to="129" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
