On the Reproduciblity and Robustness of Query Performance Prediction Experiments - An Extended Abstract Suchana Datta1 , Debasis Ganguly2 , Mandar Mitra3 and Derek Greene1 1 University College Dublin, Ireland 2 University of Glasgow, United Kingdom 3 Indian Statistical Institute, India Query performance prediction (QPP), i.e. the process of estimating the retrieval quality of an IR system, has attracted the attention of the IR research community for several years. A diverse range of pre-retrieval (e.g. AvgIDF) and post-retrieval approaches (e.g. WIG, NQC, UEF) have been proposed for QPP. Specifically, given a query and an IR system, QPP methods compute a score that is indicative of the effectiveness of the system for the given query. While this score is typically not interpreted as a statistical estimate of a specific evaluation metric (e.g. AP or nDCG), it is indeed expected to be correlated with a standard evaluation measure computed over a ground-truth set of assessed relevant documents. Indeed, the effectiveness of a QPP method is determined by measuring the correlation between its predicted effectiveness scores and the values of some standard evaluation metric over a set of queries. This abstract summarizes our ECIR 2022 research article ‘An Analysis of Variations in the Effectiveness of Query Performance Prediction’ [1], where we analysed the relative stability of QPP outcomes (rank correlations) with respect to changes in the IR models used to derive the top-retrieved documents, or the IR evaluation metrics used to order a given set of queries from easy to difficult. As per our findings, we emphasize in this abstract that such variations in QPP results (both in terms of the absolute values themselves and also in terms of the relative effectiveness of different QPP systems) can lead to difficulties in reproducing QPP experiment results on standard datasets. We now summarize the research questions and the findings of our study [1]. The context of a QPP experiment depends on 3 factors, which are i) a list of top-𝜅 documents retrieved, ii) the IR model or scoring function that is used to derive this list, and iii) an IR evaluation function (e.g., average precision or AP) that is used to induce an ordering over the set of queries (e.g., low AP to high AP indicating a spectrum of easy to difficult queries). The first research question RQ1, that we investigated in our paper, [1] is - ‘Do variations in the QPP contexts lead to significant differences in measured QPP outcomes?’. Next, as the second CIRCLE’22: Joint Conference of the Information Retrieval Communities in Europe, July 04–07, 2022, Samatan, Gers, France " suchana.datta@ucdconnect.ie (S. Datta); debasis.ganguly@glasgow.ac.uk (D. Ganguly); mandar@isical.ac.in (M. Mitra); derek.greene@ucd.ie (D. Greene) ~ https://ucdcs-research.ucd.ie/phd-student/suchana-datta/ (S. Datta); https://gdebasis.github.io/ (D. Ganguly); https://www.isical.ac.in/~mandar/ (M. Mitra); http://derekgreene.com/ (D. Greene) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) research question RQ2, we investigated the following: ‘Do variations in the QPP contexts lead to significant differences in the relative effectiveness of different QPP methods?’. To investigate the above research questions in [1], we conducted QPP experiments1 on the widely-used TREC-Robust dataset, which consists of 249 queries. To obtain diverse QPP contexts for our experiments, we tried out a number of different combinations of IR models and IR evaluation metrics. Specifically, as IR models we employed a) language modeling with Jelinek-Mercer smoothing (LMJM), b) language modeling with Dirichlet smoothing (LMDir), and c) BM25. As choices for the IR evaluation metric, we considered a) AP, b) nDCG, c) P@10, and d) recall. We compared seven different QPP methods in our experiments, namely a) AvgIDF, b) Clarity, c) WIG, d) NQC, and three variants of UEF derived from three different base QPP models - e) UEF(Clarity), f) UEF(WIG) and g) UEF(NQC). We now summarize the main findings of our experimental study (for more details see [1]). • Observations related to RQ1 (differences in QPP outcomes): – With NQC as the QPP method, we observed that LMJM yielded the highest deviations in observed QPP outcomes (specifically, rank correlation computed by 𝜏 ) across the 4 different IR metrics used to obtain the QPP ground-truth (a reference order of query difficulty). The difference between the highest and the lowest rank correlation values were significant (0.3657 with recall and 0.2061 with P@10). This indicates that QPP methods do not generalize well enough across different IR metrics. In other words, it is hard to consistently predict which queries are easy and which ones are difficult across the different notions of how this query difficulty itself is defined (e.g., via a precision or a recall oriented measure). – With NQC as the QPP method, we observed significant differences in the highest and lowest QPP outcomes, across different IR models. The highest difference in rank correlation was recorded between BM25 (𝜏 = 0.3563) and LMDir (𝜏 = 0.4354), the QPP ground-truth being defined with respect to AP. This shows that the effectiveness of a QPP method is also not consistent for different IR models. In other words, it is not easy to predict that which queries are easy and which ones are difficult consistently well enough for different IR systems. • Observations related to RQ2 (differences in relative effectiveness of QPP methods): – We observed that the relative ranks of QPP systems (ordered by the effectiveness measure 𝜏 ) changed significantly across different IR metrics. The highest disagreement in the ranks were observed between AP@10 and recall@1000. This indicates that what may be the best QPP method for predicting the query difficulty induced by AP@10 may not be so for predicting query performance with respect to recall@1000. – Similar trends were also observed for differences in the choice of IR models in the QPP context. The highest disagreements in the relative performance of QPP methods were observed for the P@10 metric across LMJM and LMDir. This indicates that what may be the best QPP method for predicting the retrieval effectiveness of LMJM may not be the best one when it comes to predicting the system performance of LMDir. 1 Implementation available at: https://github.com/suchanadatta/qpp-eval.git The main takeaway from this extended abstract is that since our extensive investigation in [1] has shown that QPP outcomes are indeed sensitive to the experimental setup used, any future experiment on QPP should emphasize clear specification of the experimental setup to warrant better reproducibility. Acknowledgement The first and the fourth authors were supported by the Science Foundation Ireland (SFI) grant number SFI/12/RC/2289_P2. References [1] D. Ganguly, S. Datta, M. Mitra, D. Greene, An analysis of variations in the effectiveness of query performance prediction, in: Proc. of ECIR’22, 2022, pp. 215–229.