Performance Prediction for Conversational Search Using Perplexities of Query Rewrites

Performance Prediction for Conversational Search Using Perplexities of Query Rewrites ChuanMeng c.meng@uva.nl University of Amsterdam

The Netherlands

MohammadAliannejadi m.aliannejadi@uva.nl University of Amsterdam

The Netherlands

MaartenDe Rijke m.derijke@uva.nl University of Amsterdam

The Netherlands

Performance Prediction for Conversational Search Using Perplexities of Query Rewrites 1613-0073 C6CE2F0ACEEBE60880152B06DAEF2C12 GROBID - A machine learning software for extracting information from scholarly documents Query performance prediction conversational search perplexity

We consider query performance prediction (QPP) task for conversational search (CS), i.e., to estimate the retrieval quality for queries in multi-turn conversations. We reuse QPP methods from ad-hoc search for CS by feeding them self-contained query rewrites generated by T5. Our experiments on three CS datasets show that (i) lower query rewriting quality may lead to worse QPP performance, and (ii) incorporating query rewriting quality (as measured by perplexity) improves the effectiveness of QPP methods for CS if the query rewriting quality is limited. Our implementation is publicly available at https://github.com/ChuanMeng/QPP4CS.

Introduction

We consider the task of query performance prediction (QPP) [1,2] for conversational search (CS) [3], i.e., estimating the retrieval quality for a query in a multi-turn conversation. Little research has been done into QPP for CS. A unique aspect of CS is that each conversational query may contain omissions or coreferences, making it hard for ad-hoc search systems or QPP methods to capture the underlying information need. A popular two-stage CS pipeline [3] can effectively solve this issue by (i) rewriting a conversational query into a self-contained query, and (ii) reusing ad-hoc search systems fed with the query rewrite.

Inspired by the two-stage pipeline, we model QPP for CS by feeding query rewrites to QPP methods designed for ad-hoc search. However, our experiments on CS datasets show that lowquality query rewrites reduce the effectiveness of QPP methods. Based on the fact that lower query rewriting quality tends to result in lower retrieval quality, we argue that query rewriting quality provides evidence for estimating retrieval quality. To incorporate query rewriting quality into QPP methods, we propose a perplexity-based pre-retrieval QPP framework (PPL-QPP) for CS. PPL-QPP first evaluates the quality of a query rewrite by its perplexity measured by a pre-trained language model, and then combines the perplexity with a state-of-the-art pre-retrieval QPP

Table 1

Performance of QPP methods on three CS datasets, in terms of Pearson's 𝜌, Kendall's 𝜏 , and Spearman's 𝜌 correlation coefficients. IDF, PMI, SCQ, and VAR are defined for a single query term; aggregation functions over terms are needed; we report the performance of each method using the optimal aggregation function on each dataset; the aggregation functions used by each method on CAsT-19, CAsT-20, and OR-QuAC are listed sequentially in the brackets. All values are statistically significant (t-test, 𝑝 < 0.05) except the ones in italics. The best value in each column is marked in bold.

Methods

CAsT method [2]. Experiments show that PPL-QPP improves the effectiveness of QPP methods in the context of CS in cases when the query rewriting quality is limited.

Experiments

Experimental setup. We use seven widely used pre-retrieval QPP methods [2] on three CS datasets: CAsT-19 [4], CAsT-20 [4], and OR-QuAC [5]. The retriever to be evaluated by the QPP methods is T5-based query rewriter1 +BM25, a widely-used CS method [3]. The T5-generated query rewrites used by BM25 are fed into all QPP methods. We evaluate QPP methods by calculating the correlation between the NDCG@3 scores of the queries in the test set and the estimated retrieval quality. Note that NDCG@3 is the primary metric in CAsT [4,6].

Performance of QPP methods for CS. Experimental results are presented in Table 1. Our leading observation is that the overall performance of QPP methods on CAst-19 and OR-QuAC is better than on CAsT-20. The difference in results seems to be due to the difference in query rewriting quality on the three datasets. We measure query rewriting quality using the similarity between manual and T5-generated query rewrites in terms of ROUGE, and the BM25 retrieval quality gap between using manual and T5-generated query rewrites. Fig. 1a shows that the ROUGE scores on CAsT-20 are lower than those on CAsT-19 and OR-QuAC; Fig. 1b shows that the gap is larger on CAsT-20 than the gap on CAsT-19. We conclude that the quality of T5-generated query rewrites is lower on CAsT-20 than on the other datasets and that lower query rewriting quality may lead to worse QPP effectiveness.

Incorporating query rewriting quality into QPP for CS. Based on our observation that lower query rewriting quality tends to result in lower retrieval quality, we argue that query rewriting quality can provide evidence for estimating retrieval quality. We propose PPL-QPP, which incorporates query rewriting quality into QPP methods. Since we cannot obtain manual query rewrites during estimation, we regard the perplexity of generated query rewrites as a measure of quality. PPL-QPP first uses GPT-2 XL 2 to measure the perplexity of a T5-generated query rewrite and combines the perplexity with a pre-retrieval QPP method through linear interpolation:

𝛼 • 1 PPL + (1 − 𝛼) • QPP .

Here, 𝛼 is a trade-off parameter; the perplexity and QPP values are first normalized prior to fusion. For the QPP method to be combined, we use the state-of-the-art VAR (sum) on CAsT-19 and OR-QuAC, and SCQ (avg) on CAsT-20. The performance of PPL-QPP is presented in Table 1. The results show that PPL-QPP improves the effectiveness of QPP methods in the context of CS on CAsT-19 and, in particular, on CAsT-20, where the query rewriting quality is limited. Interestingly, and different from CAsT-19/20, PPL-QPP does not bring improvements on the OR-QuAC dataset; we plan to further investigate this in our future work.

Conclusion

In this paper, we have targeted QPP for CS. We have reused QPP methods for ad-hoc search in the context of CS by feeding them self-contained query rewrites generated by T5. Our experiments on three CS datasets show that (i) lower query rewriting quality may lead to worse QPP performance, and (ii) incorporating query rewriting quality into QPP methods improves their effectiveness in the context of CS when query rewriting quality is limited.

Figure 1 :1Figure 1:The similarity between manual and T5-generated query rewrites in terms of ROUGE (a) and the retrieval quality of BM25 for manual/T5-generated query rewrites in terms of NDCG@3 (b). https://huggingface.co/castorini/t5-base-canard

Acknowledgement. We want to thank our reviewers for their feedback. This research was partially supported by the China Scholarship Council (CSC).

An analysis of variations in the effectiveness of query performance prediction DGanguly SDatta MMitra DGreene ECIR Springer 2022 Estimating the query difficulty for information retrieval DCarmel EYom-Tov 2010 Morgan & Claypool Publishers Multi-stage conversational passage retrieval: An approach to fusing term importance estimation and neural query rewriting S.-CLin J.-HYang RNogueira M.-FTsai C.-JWang JLin TOIS 39 2021 The conversational assistance track overview JDalton CXiong JCallan Text Retrieval Conference 2020. 2020 CAsT Open-retrieval conversational question answering CQu LYang CChen MQiu WBCroft MIyyer SIGIR 2020 JDalton CXiong VKumar JCallan CAsT-19: A dataset for conversational information seeking 2020 SIGIR