1. Introduction

Performance Prediction for Conversational Search Using Perplexities of Query Rewrites

Chuan Meng

Mohammad Aliannejadi

Maarten de Rijke

0 0 University of Amsterdam , The Netherlands

We consider query performance prediction (QPP) task for conversational search (CS), i.e., to estimate the retrieval quality for queries in multi-turn conversations. We reuse QPP methods from ad-hoc search for CS by feeding them self-contained query rewrites generated by T5. Our experiments on three CS datasets show that (i) lower query rewriting quality may lead to worse QPP performance, and (ii) incorporating query rewriting quality (as measured by perplexity) improves the efectiveness of QPP methods for CS if the query rewriting quality is limited. Our implementation is publicly available at https://github.com/ChuanMeng/QPP4CS.

eol>Query performance prediction conversational search perplexity

1. Introduction

method [ 2 ]. Experiments show that PPL-QPP improves the efectiveness of QPP methods in the context of CS in cases when the query rewriting quality is limited.

2. Experiments

Experimental setup. We use seven widely used pre-retrieval QPP methods [ 2 ] on three CS datasets: CAsT-19 [ 4 ], CAsT-20 [ 4 ], and OR-QuAC [ 5 ]. The retriever to be evaluated by the QPP methods is T5-based query rewriter1+BM25, a widely-used CS method [ 3 ]. The T5-generated query rewrites used by BM25 are fed into all QPP methods. We evaluate QPP methods by calculating the correlation between the NDCG@3 scores of the queries in the test set and the estimated retrieval quality. Note that NDCG@3 is the primary metric in CAsT [ 4, 6 ]. Performance of QPP methods for CS. Experimental results are presented in Table 1. Our leading observation is that the overall performance of QPP methods on CAst-19 and OR-QuAC is better than on CAsT-20. The diference in results seems to be due to the diference in query rewriting quality on the three datasets. We measure query rewriting quality using the similarity between manual and T5-generated query rewrites in terms of ROUGE, and the BM25 retrieval quality gap between using manual and T5-generated query rewrites. Fig. 1a shows that the ROUGE scores on CAsT-20 are lower than those on CAsT-19 and OR-QuAC; Fig. 1b shows that the gap is larger on CAsT-20 than the gap on CAsT-19. We conclude that the quality of T5-generated query rewrites is lower on CAsT-20 than on the other datasets and that lower query rewriting quality may lead to worse QPP efectiveness.

Incorporating query rewriting quality into QPP for CS. Based on our observation that lower query rewriting quality tends to result in lower retrieval quality, we argue that query rewriting quality can provide evidence for estimating retrieval quality. We propose PPL-QPP, 1https://huggingface.co/castorini/t5-base-canard

ROUGE-1 ROUGE-2

ROUGE-L OR-QuAC manual query rewrites T5-generated query rewrites CAsT-19 which incorporates query rewriting quality into QPP methods. Since we cannot obtain manual query rewrites during estimation, we regard the perplexity of generated query rewrites as a measure of quality. PPL-QPP first uses GPT-2 XL 2 to measure the perplexity of a T5-generated query rewrite and combines the perplexity with a pre-retrieval QPP method through linear 1 interpolation: · PPL + (1 − ) · QPP . Here, is a trade-of parameter; the perplexity and QPP values are first normalized prior to fusion. For the QPP method to be combined, we use the state-of-the-art VAR (sum) on CAsT-19 and OR-QuAC, and SCQ (avg) on CAsT-20. The performance of PPL-QPP is presented in Table 1. The results show that PPL-QPP improves the efectiveness of QPP methods in the context of CS on CAsT-19 and, in particular, on CAsT-20, where the query rewriting quality is limited. Interestingly, and diferent from CAsT-19/20, PPL-QPP does not bring improvements on the OR-QuAC dataset; we plan to further investigate this in our future work.

3. Conclusion

In this paper, we have targeted QPP for CS. We have reused QPP methods for ad-hoc search in the context of CS by feeding them self-contained query rewrites generated by T5. Our experiments on three CS datasets show that (i) lower query rewriting quality may lead to worse QPP performance, and (ii) incorporating query rewriting quality into QPP methods improves their efectiveness in the context of CS when query rewriting quality is limited. Acknowledgement. We want to thank our reviewers for their feedback. This research was partially supported by the China Scholarship Council (CSC).

[1]

Ganguly ,

Datta ,

Mitra ,

Greene , An analysis of variations in the efectiveness of query performance prediction , in: ECIR , Springer, 2022 , pp. 215 - 229 .

[2]

Carmel ,

Yom-Tov , Estimating the query dificulty for information retrieval , Morgan & Claypool Publishers, 2010 .

[3]

S.-C.

Lin ,

J.-H.

Yang ,

Nogueira , M.-F. Tsai , C.-J. Wang , J.

Lin

, Multi-stage conversational passage retrieval: An approach to fusing term importance estimation and neural query rewriting , TOIS 39 ( 2021 ) 1 - 29 .

[4]

Dalton ,

Xiong , J. Callan, CAsT 2020 : The conversational assistance track overview , in: Text Retrieval Conference, 2020 .

[5]

Qu ,

Yang ,

Chen ,

Qiu , W. B. Croft , M. Iyyer , Open-retrieval conversational question answering , in: SIGIR , 2020 , pp. 539 - 548 .

[6]

Dalton ,

Xiong ,

Kumar , J. Callan, CAsT-19: A dataset for conversational information seeking , in: SIGIR , 2020 , pp. 1985 - 1988 .