<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Performance Prediction for Conversational Search Using Perplexities of Query Rewrites</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Chuan Meng</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mohammad Aliannejadi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maarten de Rijke</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Amsterdam</institution>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <issue>0</issue>
      <abstract>
        <p>We consider query performance prediction (QPP) task for conversational search (CS), i.e., to estimate the retrieval quality for queries in multi-turn conversations. We reuse QPP methods from ad-hoc search for CS by feeding them self-contained query rewrites generated by T5. Our experiments on three CS datasets show that (i) lower query rewriting quality may lead to worse QPP performance, and (ii) incorporating query rewriting quality (as measured by perplexity) improves the efectiveness of QPP methods for CS if the query rewriting quality is limited. Our implementation is publicly available at https://github.com/ChuanMeng/QPP4CS.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Query performance prediction</kwd>
        <kwd>conversational search</kwd>
        <kwd>perplexity</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        method [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Experiments show that PPL-QPP improves the efectiveness of QPP methods in the
context of CS in cases when the query rewriting quality is limited.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Experiments</title>
      <p>
        Experimental setup. We use seven widely used pre-retrieval QPP methods [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] on three CS
datasets: CAsT-19 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], CAsT-20 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], and OR-QuAC [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The retriever to be evaluated by the QPP
methods is T5-based query rewriter1+BM25, a widely-used CS method [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The T5-generated
query rewrites used by BM25 are fed into all QPP methods. We evaluate QPP methods by
calculating the correlation between the NDCG@3 scores of the queries in the test set and the
estimated retrieval quality. Note that NDCG@3 is the primary metric in CAsT [
        <xref ref-type="bibr" rid="ref4 ref6">4, 6</xref>
        ].
Performance of QPP methods for CS. Experimental results are presented in Table 1. Our
leading observation is that the overall performance of QPP methods on CAst-19 and OR-QuAC
is better than on CAsT-20. The diference in results seems to be due to the diference in query
rewriting quality on the three datasets. We measure query rewriting quality using the similarity
between manual and T5-generated query rewrites in terms of ROUGE, and the BM25 retrieval
quality gap between using manual and T5-generated query rewrites. Fig. 1a shows that the
ROUGE scores on CAsT-20 are lower than those on CAsT-19 and OR-QuAC; Fig. 1b shows
that the gap is larger on CAsT-20 than the gap on CAsT-19. We conclude that the quality of
T5-generated query rewrites is lower on CAsT-20 than on the other datasets and that lower
query rewriting quality may lead to worse QPP efectiveness.
      </p>
      <p>Incorporating query rewriting quality into QPP for CS. Based on our observation that
lower query rewriting quality tends to result in lower retrieval quality, we argue that query
rewriting quality can provide evidence for estimating retrieval quality. We propose PPL-QPP,
1https://huggingface.co/castorini/t5-base-canard</p>
      <p>ROUGE-1
ROUGE-2</p>
      <p>ROUGE-L
OR-QuAC
manual query rewrites
T5-generated query rewrites
CAsT-19
which incorporates query rewriting quality into QPP methods. Since we cannot obtain manual
query rewrites during estimation, we regard the perplexity of generated query rewrites as a
measure of quality. PPL-QPP first uses GPT-2 XL 2 to measure the perplexity of a T5-generated
query rewrite and combines the perplexity with a pre-retrieval QPP method through linear
1
interpolation:  · PPL + (1 −  ) · QPP . Here,  is a trade-of parameter; the perplexity and
QPP values are first normalized prior to fusion. For the QPP method to be combined, we use
the state-of-the-art VAR (sum) on CAsT-19 and OR-QuAC, and SCQ (avg) on CAsT-20. The
performance of PPL-QPP is presented in Table 1. The results show that PPL-QPP improves the
efectiveness of QPP methods in the context of CS on CAsT-19 and, in particular, on CAsT-20,
where the query rewriting quality is limited. Interestingly, and diferent from CAsT-19/20,
PPL-QPP does not bring improvements on the OR-QuAC dataset; we plan to further investigate
this in our future work.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Conclusion</title>
      <p>In this paper, we have targeted QPP for CS. We have reused QPP methods for ad-hoc search
in the context of CS by feeding them self-contained query rewrites generated by T5. Our
experiments on three CS datasets show that (i) lower query rewriting quality may lead to worse
QPP performance, and (ii) incorporating query rewriting quality into QPP methods improves
their efectiveness in the context of CS when query rewriting quality is limited.
Acknowledgement. We want to thank our reviewers for their feedback. This research was
partially supported by the China Scholarship Council (CSC).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Ganguly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Datta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mitra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Greene</surname>
          </string-name>
          ,
          <article-title>An analysis of variations in the efectiveness of query performance prediction</article-title>
          ,
          <source>in: ECIR</source>
          , Springer,
          <year>2022</year>
          , pp.
          <fpage>215</fpage>
          -
          <lpage>229</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Carmel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Yom-Tov</surname>
          </string-name>
          ,
          <article-title>Estimating the query dificulty for information retrieval</article-title>
          , Morgan &amp; Claypool Publishers,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.-C.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-H.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Nogueira</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.-F. Tsai</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-J. Wang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
          </string-name>
          <article-title>, Multi-stage conversational passage retrieval: An approach to fusing term importance estimation and neural query rewriting</article-title>
          ,
          <source>TOIS</source>
          <volume>39</volume>
          (
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>29</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Dalton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiong</surname>
          </string-name>
          , J. Callan,
          <string-name>
            <surname>CAsT</surname>
          </string-name>
          <year>2020</year>
          :
          <article-title>The conversational assistance track overview</article-title>
          , in: Text Retrieval Conference,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C.</given-names>
            <surname>Qu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Qiu</surname>
          </string-name>
          , W. B.
          <string-name>
            <surname>Croft</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Iyyer</surname>
          </string-name>
          ,
          <article-title>Open-retrieval conversational question answering</article-title>
          ,
          <source>in: SIGIR</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>539</fpage>
          -
          <lpage>548</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Dalton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kumar</surname>
          </string-name>
          , J. Callan, CAsT-19:
          <article-title>A dataset for conversational information seeking</article-title>
          ,
          <source>in: SIGIR</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1985</fpage>
          -
          <lpage>1988</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>