Towards Incorporating Personalized Context for Conversational
                         Information Seeking
                         Haitao Yu1,∗ , Lingzhen Zheng2 , Kaiyu Yang2 , Sumio Fujita3 and Hideo Joho1
                         1
                           Institute of Library, Information and Media Science, University of Tsukuba, Tsukuba City, Ibaraki, Japan
                         2
                           Graduate School of Comprehensive Human Sciences, University of Tsukuba, Tsukuba City, Ibaraki, Japan
                         3
                           LY Research, LY Corporation, Tokyo, Japan


                                         Abstract
                                         Conversational information seeking (CIS) extends the classic search to a conversational nature, which has attracted significant attention
                                         in recent years. Yet one size does not fit all, it is no surprise that users often need high-quality personalized response due to their
                                         different personas, e.g., for the search about alternatives to cow’s milk, the desired responses may differ a lot. In this work, we focus on
                                         CIS that aims to account for personalized retrieval and response generation. Specifically, we follow the CIS paradigm presented in the
                                         TREC iKAT track, which consists of three core tasks, namely personal textual knowledge base (PTKB) statement ranking, passage ranking,
                                         and response generation. For PTKB statement ranking, we propose to fuse multiple large language models (LLMs). For passage ranking,
                                         we propose four different strategies for personalized retrieval. For response generation, we resort to zero-short LLM-based answer
                                         generation by incorporating personalized context. The experimental results show that: (1) For PTKB statement ranking, our method
                                         achieves the best performance in terms of MRR on the set of iKAT organizers’ assessments. It also shows superior performance over the
                                         baseline based on GPT-4. This indicates that a fusion of multiple LLMs is a promising choice when tackling problems of this kind. (2)
                                         For passage ranking, on one hand, one of our proposed strategies is able to achieve comparable performance as Llama2-based baseline.
                                         On the other hand, our analysis indicates that the way of incorporating PTKB statements for personalized retrieval matters, where a
                                         direct concatenation is not recommended. (3) For response generation, our proposed method is able to generate grounded and natural
                                         personalized responses, and is comparable to the top-tier LLM-based baseline.

                                         Keywords
                                         Conversational, Information Seeking, Personalized Context, LLM


                         1. Introduction                                                                                            personas, it is of great importance that the CIS system can
                                                                                                                                    effectively incorporate the personalized context and provide
                         In recent years, conversational systems have attracted consid-                                             relevant responses to users. Motivated by this observation,
                         erable attention from both academic researchers and indus-                                                 we focus on developing a unified CIS system, which enables
                         trial practitioners. In the field of information retrieval (IR),                                           to incorporate personalized context during the interactive
                         conversational information seeking (CIS) has been identified                                               search process. The main contributions of this work are
                         as one of the most important research directions. Remark-                                                  listed as follows:
                         able efforts have been made from different aspects, which
                         include, but not limited to, conversational search conceptu-                                                    • By following the CIS paradigm presented in the
                         alization [1, 2, 3], conversational query re-writing [4, 5, 6],                                                   TREC iKAT track, we propose different methods
                         generating and selecting clarifying questions [7, 8, 9, 10] and                                                   for tackling the core tasks, namely personal textual
                         conversational response generation [11, 12, 13].                                                                  knowledge base (PTKB) statement ranking, passage
                            Despite the successes achieved by the aforementioned                                                           ranking, and response generation. For PTKB state-
                         studies, fundamental research questions remain open. For                                                          ment ranking, we explore how to fuse multiple large
                         example, providing high-quality user-specific response is                                                         language models (LLMs). The experimental results
                         still a challenging problem. Take the case by Aliannejadi et                                                      show that our method achieves the best performance
                         al. [14] as an example, for the search about alternatives to                                                      in terms of MRR on the set of iKAT organizers’ as-
                         cow’s milk, two personas can be: (A) Alice is a vegan who                                                         sessments which relies on a larger assessment pool.
                         is deeply concerned about the environment; and (B) Bob has                                                        Moreover, our method also shows superior perfor-
                         been recently diagnosed with diabetes, has a nut allergy, and                                                     mance over the GPT-4-based baseline. This high-
                         is lactose intolerant. Given Alice and Bob’s personas, their                                                      lights that it is not straightforward to solve a com-
                         corresponding conversations with the system would evolve                                                          ponent task by merely tailoring a powerful LLM.
                         and develop in very different ways. Put another way, the                                                          Whereas a fusion of multiple LLMs can be a promis-
                         responses that are helpful to Alice may not be necessarily                                                        ing choice when tackling problems of this kind.
                         useful to Bob, and vice versa. In fact, information needs of                                                    • For passage ranking, we propose four different
                         this kind are prevalent in daily information searches, which                                                      strategies for personalized retrieval, which enables
                         include, but not limited to, job finding, healthcare search and                                                   us to well investigate the impact of utterance rewrit-
                         online shopping. Given the information needs expressed                                                            ing and the way of incorporating personalized con-
                         as a sequence of search queries (or questions) and different                                                      text. Through result analysis and comparison, we
                                                                                                                                           found that: Though our proposed method for select-
                         Information Retrieval’s Role in RAG Systems (IR-RAG), 18 July, 2024,
                         Washington, DC
                                                                                                                                           ing PTKB statements is relatively reliable, how to
                         ∗
                             The corresponding author.                                                                                     incorporate the selected PTKB statements to formu-
                         Envelope-Open yuhaitao@slis.tsukuba.ac.jp (H. Yu); s2221686@u.tsukuba.ac.jp                                       late the input for personalized retrieval matters a lot.
                         (L. Zheng); s2321730@u.tsukuba.ac.jp (K. Yang); sufujita@lycorp.co.jp                                             A direct concatenation is not suggested according to
                         (S. Fujita); hideo@slis.tsukuba.ac.jp (H. Joho)                                                                   the inferior performance of our proposed strategies.
                         Orcid 0000-0002-1569-8507 (H. Yu); 0009-0004-5783-7079 (L. Zheng);
                         0009-0002-4491-7235 (K. Yang); 0000-0002-1282-386X (S. Fujita);                                                 • For response generation, we resort to zero-short
                         0000-0002-6611-652X (H. Joho)                                                                                     LLM-based answer generation by incorporating per-
                                     © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                     Attribution 4.0 International (CC BY 4.0).
                                                                                                                                           sonalized context. Our method is able to generate

CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
          Figure 1: Our focused framework for conversational information seeking that incorporates personalized context.


        grounded and natural personalized responses, and            fourth step, we mange to unify the ranking information and
        is comparable to the top-tier LLM-based baseline.           binary classification results of the previous two steps via
                                                                    a scoring function and an indicator function. The scoring
                                                                    function assigns a weight for each remaining statement in
2. Preliminaries                                                    the 2nd step as follows:
Figure 1 describes our focused framework for CIS that ac-                                       𝐼 𝑛𝑑𝑀𝑜𝑛𝑜𝑇 5 (𝑠) + 𝐼 𝑛𝑑𝑅𝑎𝑛𝑘𝐺𝑃𝑇 (𝑠)
counts for users’ personas. It assumes that there is a per-                      𝑤(𝑠) = 1 −                                                   (1)
                                                                                                              2 ∗ |𝑆|
sonal text knowledge base (PTKB), which consists narrative
sentences providing personal information about the users.           where 𝐼 𝑛𝑑𝑅𝑎𝑛𝑘𝐺𝑃𝑇 (𝑠) and 𝐼 𝑛𝑑𝑀𝑜𝑛𝑜𝑇 5 (𝑠) represent the rank
A system following this framework consists of the following         positions according to the regression scores by MonoT5
key modules. (1) Statement ranking: given the context of the        and RankGPT, respectively. |𝑆| represents the number of
conversation and the current user utterance, this module            remaining PTKB statements in the second step. The indica-
returns a ranked list of PTKB statements based on their rele-       tor function builds upon 𝑤(𝑠) and a voting mechanism as
vance, which reflects the user’s persona; (2) Passage ranking:      follows:
given the context of the conversation, the current user utter-
ance, and the PTKB statements, this module is responsible
for retrieving a ranked list of passages from the document                      ⎧1   if (𝑙𝑎𝑏𝐵𝐸𝑅𝑇 (𝑠) + 𝑙𝑎𝑏𝑀𝑜𝑛𝑜𝑇 5 (𝑠) + 𝑙𝑎𝑏𝑅𝑎𝑛𝑘𝐺𝑃𝑇 (𝑠)) ≥ 2
                                                                      𝐼 (𝑠) =          and 𝑤(𝑠) > 0.65                                        (2)
collection; (3) Response generation: this module returns the                    ⎨
answer text as a response to the user. In particular, the re-                   ⎩0   otherwise

sponse should be a generative or abstractive summary of the         𝑙𝑎𝑏𝐵𝐸𝑅𝑇 (𝑠), 𝑙𝑎𝑏𝑀𝑜𝑛𝑜𝑇 5 (𝑠), and 𝑙𝑎𝑏𝑅𝑎𝑛𝑘𝐺𝑃𝑇 (𝑠) respectively rep-
relevant passages. We recognize that the gap exists between         resent the binary classification result by each adopted LLM,
our focused framework for CIS and the real-world search             where an output of 1 denotes a true label, and 0 for a false
scenarios. Since this topic is still in its infancy, we leave it    label.
as a future work to explore more complex frameworks.                   The final result list of PTKB statements is generated by
                                                                    selecting statements with a positive output via the indicator
3. Methodology                                                      function and ranking them via the scoring function in a
                                                                    decreasing order.
Given the target paradigm for CIS in section 2, we elaborate
on the proposed methods for addressing the key module as            3.2. Zero-shot LLM-based Passage Ranking
below.
                                                                    To cope with passage ranking, we resort to the typical
                                                                    pipeline of retrieve-then-rank. Firstly, we use BM25 with
3.1. Statement Ranking by Fusing Multiple                           the default setting in Pyserini to retrieve the top 5 pas-
     LLMs                                                           sages. Then we design 4 strategies (denoted as PR_S1, PR_S2,
The key idea of our method (denoted as SR_FML) for tack-            PR_S3 and PR_S4, respectively) to re-rank the top 5 passages
ling statement ranking is to effectively fuse multiple LLMs         using multiple specifically selected LLMs in a zero-shot man-
through a cascade of four steps. At the first step, we rewrite      ner.
each conversation turn’s utterance. Specifically, the T5-              To formulate the input, PR_S1, PR_S3, and PR_S4 concate-
CANARD model [15] fine-tuned with the testing topics of             nate the rewritten utterance and the top 2 relevant PTKB
TREC CAsT 2022 [16] is used, and the preceding turns’ con-          statements returned by the module of statement ranking.
versations (3 turns at most) are used as the context. At            PR_S2 directly uses the rewritten utterance as the input.
the second step, given the candidate PTKB statements, we               During the ranking process, the differences among
perform binary logistic regression based on the BERT [17]           the four strategies are as follows: (1) PR_S1 and
model. The candidate PTKB statements with a true label are          PR_S2 assemble the results by multiple LLMs (i.e.,
kept for later steps, and the statements with a false label are     ”stabilityai/stablelm-tuned-alpha-7b”, ”eachadea/vicuna-
filtered out. At the third step, we perform binary logistic         13b-1.1”, ”jondurbin/airoboros-7b”, ”TheBloke/koala-13B-
regression again over the remaining PTKB statements based           HF”) [20, 21, 22, 23, 24] in a voting manner. Specifically,
on MonoT5 [18] in the same way as the second step. In               given the information need represented by the input, we
addition, we use RankGPT [19] to sort the PTKB statements,          ask each LLM to compare the candidate passages in a
and assign the top half statements with a true label, and           pairwise manner. The passage that is identified to be more
a false label for the remaining bottom statements. At the           relevant than the other gets a vote. Finally, we rank the
passages based on the cumulative number of votes in a             the top-3 PTKB statements, the context of the conversation
decreasing order; (3) PR_S3 merely relies on MonoT5 with          and the user utterance.
the default setting in PyGaggle to rank the passages; (4)
PR_S4 relies on the idea of RankGPT to rank the passages,         4.3. Implementation Details
where the GPT-3.5 API is used.
                                                                  All experiments were conducted on a server with two A100
                                                                  (40GB) GPUs. The CUDA version is 12.2. For fine-tuning
3.3. Personalized Response Generation
                                                                  T5-CANARD, the configuration is: training epochs: 5, batch
For tackling response generation, we aim to generate per-         size: 4, learning rate: 1𝑒 − 5. For SR_FML, bert-base-uncased
sonalized response. Specifically, for each conversation turn,     with default parameter settings is used as the backbone
the top-1 passage and the top-2 PTKB statements repre-            model, which comes from transformers library provided by
senting the personalized context are used as the input. For       HuggingFace [31]. We iterate its predictions five times and
the base LLM, we resort to T5 [25], which is specifically         compute the average relevance scores for each statement.
fine-tuned for the summarization task.                            For RankGPT, the configuration is: window size: 4, step size:
                                                                  1. The MonoT5 with default parameter settings in Pygaggle
                                                                  is used. In PR_S3, the window size of RankGPT is adjusted
4. Experimental Setup                                             to 3. In PR_S1 and PR_S2, we set the prompt_max_length
                                                                  of the four zero-shot LLMs to 2048. Additionally, we set the
4.1. Dataset                                                      decoding method to beam_search, output_max_length to
We use the dataset released by TREC iKAT 2023 for eval-           512, and temperature to 1.0 by default [32]. For RG_SumT5,
uating the effectiveness with 25 testing topics. Each topic       t5-base-finetuned-summarize-news is employed with con-
has 1 ∼ 3 subtree conversations that represent different          figuration: input max_length: 512, output min_length: 50,
personas. For each personalized conversation, there is a          output max_length: 150, length_penalty: 2.0, num_beams:
list of around 10 PTKB statements. Moreover, the passage          4.
collection has 116, 838, 987 passages, which is derived from
a subset of ClueWeb22-B [26].
                                                                  5. Results and Analysis
4.2. Baselines                                                    In Table 1, Table 2 and Table 3, we show the overall perfor-
                                                                  mance of the baseline approaches, and the proposed meth-
In order to make a fair and thorough analysis, we perform a       ods for statement ranking, passage ranking and response
module-specific comparison by selecting the most competi-         generation, respectively. Within each table, the best result
tive and representative baseline methods from TREC iKAT           in terms of each metric is indicated in bold, and the second-
2023’s participants. We add a prefix of BS to each baseline       best result is underlined.
method for a better clarity.                                         For statement ranking, we note that there are two sets
   For statement ranking, BS_zs_Llama and BS_ft_Llama use         of assessments which were created by the iKAT organiz-
zero-shot and fine-tuned Llama-2-7b-chat [27] for rewriting       ers and NIST assessors, respectively. The key differences
the utterance, respectively. Then they use MiniLM12 [28]          are that: During topic generation, the organizers annotated
to rank PTKB statements based on the rewritten utterance.         each turn in terms of their provenance to PTKB statements
   For passage ranking, BS_Llama2 initially instructs Llama-      and included their labels in the released topic files. During
2-7b-chat to reformulate the current utterance considering        the assessment of passage relevance, the NIST assessors
previous conversation turns’ context. Then, the revised           were also asked to judge the relevance of PTKB statements
conversation, along with a specific passage, are provided to      to each turn. The assessment pool is smaller than the one
the model to assess the passage’s relevance.                      done by the organizers. The organizers judged all of the
   For response generation, BS_FastChatT5andLlama cre-            turns, while the NIST assessors only judged the turns that
ates a summarization for each of the top passages retrieved       were selected for passage relevance [14]. From Table 1,
by BM25 using FastChatT5 [29], then it generates the re-          we can observe that BS_zs_Llama outperforms the other
sponse to current utterance based on the summaries in a           methods in terms of nDCG@3, P@3 and Recall@3. Though
retrieval-generate loop. A final response is summarized by        BS_ft_Llama relies on the same LLM, its performance is im-
BS_DenseMonoT5 using different engines including conven-          pacted due to the rewritten utterances in a fine-tune setting.
tional language models and Llama2 based on top passages.          On the contrary, BS_GPT-4 relying on the powerful GPT-4
   Besides the above module-specific baseline methods,            shows inferior performance across two sets of assessments.
BS_GPT-4 is compared across three modules, which repre-           This indicates that the usage of GPT-4 for statement rank-
sents the method using the most powerful LLM (i.e., GPT-4         ing is not straightforward, further exploration is needed
[30]). For statement ranking, BS_GPT-4 casts it as a binary       for a better performance. Over the set of iKAT organizers’
classification problem. The prompt includes the instruction,      assessments, our proposed method (i.e., SR_FML) shows
context of the conversation, PTKB statements of the user,         competitive performance as BS_zs_Llama, and achieves the
and current user utterance. The output is a ranked list of rel-   best performance in terms of MRR. This indicates the benefit
evant statements. For passage ranking, BS_GPT-4 initially         of fusing multiple LLMs, which enables us to leverage on
generates an answer for each turn. Subsequently, GPT-4 is         the advantages of different LLMs. In view of the fact that
employed to produce five queries for each answer. These           the set of iKAT organizers’ assessments bases on a larger
generated queries are used via BM25 to retrieve passages,         assessment pool, it is reasonable to say that the evaluation
then the pre-trained MiniLM12 is deployed for ranking the         over this set is more reliable.
passages. For response generation, GPT-4 is prompted to              For passage ranking, the results in Table 2 show that
generate the answer, using the top-10 retrieved passages,         BS_GPT-4 significantly outperform BS_Llama2 and our pro-
    Table 1
    The performance comparison on statement ranking.

                                                                                  Metric
                           Ground Truth            Method
                                                                 MRR      nDCG@3        P@3       Recall@3
                                                   BS_zs_Llama   0.6707     0.6394     0.3810      0.7375
                                                   BS_GPT-4      0.6618     0.6288     0.3423      0.6888
                    iKAT organizers’ assessment
                                                   BS_ft_Llama   0.6617     0.6149     0.3542      0.6918
                                                   SR_FML        0.6890     0.6370     0.3512      0.6903
                                                   BS_zs_Llama   0.7950     0.7254     0.4626      0.6964
                                                   BS_ft_Llama   0.7795     0.7102     0.4490      0.6796
                         NIST assessment
                                                   BS_GPT-4      0.7027     0.6174     0.3605      0.5833
                                                   SR_FML        0.7112     0.6594     0.4184      0.6213


Table 2                                                          that BS_GPT-4 again outperforms the other methods by
The performance comparison on passage ranking.                   a large margin. Our proposed method (i.e., RG_SumT5)
                                                                 outperforms BS_DenseMonoT5 and shows competitive per-
       Method        nDCG@3         nDCG@5        mAP
                                                                 formance as BS_FastChatT5andLlama.
       BS_GPT-4        0.4382        0.4396     0.1759              It is noticeable that the evaluation results are likely to be
       BS_Llama2       0.1389        0.1466     0.0376           somewhat biased towards BS_GPT-4, since the evaluation
       PR_S2           0.1433        0.1469     0.0350           is conducted by GPT-4. We leave it as a future work to
       PR_S4           0.1130        0.1070     0.0224           further test the effectiveness of these methods for response
       PR_S3           0.1107        0.1062     0.0223
                                                                 generation through human evaluation results.
       PR_S1           0.1086        0.1049     0.0222
                                                                    A joint look across Table 1, Table 2 and Table 3 reveals
                                                                 that: First, we do not observe a clear correlation between
                                                                 statement ranking and passage ranking, which seems coun-
posed methods by a large margin. This echoes the findings        terintuitive. For instance, though BS_GPT-4 shows inferior
in prior studies [19, 33, 34, 35] which have shown the leading   performance in statement ranking, it outperforms the other
capability of GPT-4 in the passage ranking task. One proba-      methods by a large margin in passage ranking. This counter-
ble reason is that the pipeline of generate-retrieve-generate    intuitiveness may arise from a number of possible reasons,
adopted by BS_GPT-4 is more suitable for passage ranking         such as the strong zero-shot capability of GPT-4 and the
than our adopted pipeline of retrieve-generate. Among our        precise understanding of persona information underlying
proposed strategies for passage ranking, PR_S2 shows the         selected PTKB statements. This is also worthy to be in-
best performance, and also outperforms BS_Llama2. Com-           vestigated as a future work. Second, for both personalized
pared with BS_Llama2, a possible reason for the inferior         retrieval and response generation in the context of CIS, there
performance of the other three strategies is the way of for-     is still a large room to improve the performance.
mulating the input. We directly concatenate the utterance
and related PTKB statements as the input, while BS_Llama2
rewrites the utterance with the statements using LLM. An-        6. Conclusion
other possible reason for our inferior performance is that
we focus on the earlier positions and only re-rank the top-5     In this study, we focus on CIS that accounts for personalized
passages returned by BM25. As a result, this setting would       retrieval and response generation. By following the CIS
become a bottleneck for us to get relevant passages given        paradigm presented in the TREC iKAT track, we propose
the limited retrieval ability of BM25.                           different methods to tackle three core tasks, namely per-
                                                                 sonal textual knowledge base (PTKB) statement ranking,
Table 3
                                                                 passage ranking and response generation. We have shown
The result comparison on response generation.                    that fusing multiple LLMs is a promising way for addressing
                                                                 PTKB statement ranking. Also, our analysis indicates that
  Method                        Groundedness    Naturalness      an effective way of injecting the selected PTKB statements
  BS_GPT-4                       0.89 (65/8)         4.0         is quite important for personalized retrieval. Since conver-
  BS_FastChatT5andLlama          0.67 (47/23)       3.684        sational systems arise in a variety of applications, such as
  BS_DenseMonoT5                 0.51 (37/36)      2.808         recommender systems and question answering, we believe
  RG_SumT5                       0.67 (49/24)      2.9178        that our work provides insights for developing conversa-
                                                                 tional systems that account for personalized retrieval and
   For response generation, the results are evaluated in terms   response generation.
of groudedness and naturalness. Groudedness measures
whether the generated response can be attributed to the pas-     7. Acknowledgments
sages that it is supposed to be generated from. Naturalness
measures the extent to which the response sounds human-          This research has been supported by JSPS KAKENHI Grant
like, such as the general fluency and understandability of       Number 19H04215.
the generated response. GPT-4 is used to evaluate both
the groundedness and naturalness of the responses in each
turn. Finally, the mean of groundedness and naturalness
over all turns is reported. From Table 3, we can observe
References                                                                aware response generation via learning to re-
                                                                          cover and rank utterances, Proceedings of the
 [1] L. Azzopardi, M. Dubiel, M. Halvey, J. Dalton, Concep-               AAAI Conference on Artificial Intelligence 35 (2021)
     tualizing agent-human interactions during the conver-                12911–12919. URL: https://ojs.aaai.org/index.php/
     sational search process, in: The second international                AAAI/article/view/17527. doi:10.1609/aaai.v35i14.
     workshop on conversational approaches to informa-                    17527 .
     tion retrieval, 2018.                                           [14] M. Aliannejadi, A. Zahra, C. Shubham, D. Jeffery,
 [2] Y. Deldjoo, J. R. Trippas, H. Zamani, Towards multi-                 A. Leif, Trec ikat 2023: The interactive knowledge
     modal conversational information seeking, in: Pro-                   assistance track overview, in: Proceedings of the
     ceedings of the 44th International ACM SIGIR con-                    Thirty-Second Text REtrieval Conference (TREC 2023),
     ference on research and development in Information                   2024.
     Retrieval, 2021, pp. 1577–1587.                                 [15] S.-C. Lin, J.-H. Yang, R. Nogueira, M.-F. Tsai, C.-J.
 [3] F. Radlinski, N. Craswell, A theoretical framework                   Wang, J. Lin, Conversational question reformulation
     for conversational search, in: Proceedings of the                    via sequence-to-sequence architectures and pretrained
     2017 Conference on Conference Human Informa-                         language models, arXiv preprint arXiv:2004.01909
     tion Interaction and Retrieval, CHIIR ’17, Associa-                  (2020).
     tion for Computing Machinery, New York, NY, USA,                [16] P. Owoicho, J. Dalton, M. Aliannejadi, L. Azzopardi,
     2017, p. 117–126. URL: https://doi.org/10.1145/3020165.              J. R. Trippas, S. Vakulenko, Trec cast 2022: Going
     3020183. doi:10.1145/3020165.3020183 .                               beyond user ask and system retrieve with initiative
 [4] S. Yu, J. Liu, J. Yang, C. Xiong, P. Bennett, J. Gao, Z. Liu,        and response generation, NIST Special Publication
     Few-shot generative conversational query rewriting,                  (2022) 500–338.
     in: Proceedings of the 43rd International ACM SIGIR             [17] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert:
     conference on research and development in Informa-                   Pre-training of deep bidirectional transformers for lan-
     tion Retrieval, 2020, pp. 1933–1936.                                 guage understanding, arXiv preprint arXiv:1810.04805
 [5] S. Vakulenko, S. Longpre, Z. Tu, R. Anantha, Question                (2018).
     rewriting for conversational question answering, in:            [18] R. Nogueira, Z. Jiang, R. Pradeep, J. Lin, Docu-
     Proceedings of the 14th ACM international conference                 ment ranking with a pretrained sequence-to-sequence
     on web search and data mining, 2021, pp. 355–363.                    model, in: T. Cohn, Y. He, Y. Liu (Eds.), Findings of the
 [6] S.-C. Lin, J.-H. Yang, R. Nogueira, M.-F. Tsai, C.-J.                Association for Computational Linguistics: EMNLP
     Wang, J. Lin, Multi-stage conversational passage re-                 2020, Association for Computational Linguistics, On-
     trieval: An approach to fusing term importance esti-                 line, 2020, pp. 708–718. URL: https://aclanthology.
     mation and neural query rewriting, ACM Trans. Inf.                   org/2020.findings-emnlp.63. doi:10.18653/v1/2020.
     Syst. 39 (2021). URL: https://doi.org/10.1145/3446426.               findings- emnlp.63 .
     doi:10.1145/3446426 .                                           [19] W. Sun, L. Yan, X. Ma, S. Wang, P. Ren, Z. Chen,
 [7] M. Aliannejadi, H. Zamani, F. Crestani, W. B.                        D. Yin, Z. Ren, Is chatgpt good at search? investi-
     Croft, Asking clarifying questions in open-domain                    gating large language models as re-ranking agents,
     information-seeking conversations, in: Proceedings                   2023. arXiv:2304.09542 .
     of the 42nd international acm sigir conference on re-           [20] X. Geng, A. Gudibande, H. Liu, E. Wallace, P. Abbeel,
     search and development in information retrieval, 2019,               S. Levine, D. Song, Koala: A dialogue model for aca-
     pp. 475–484.                                                         demic research, Blog post, 2023. URL: https://bair.
 [8] H. Zamani, S. Dumais, N. Craswell, P. Bennett,                       berkeley.edu/blog/2023/04/03/koala/.
     G. Lueck, Generating clarifying questions for infor-            [21] Y. Anand, Z. Nussbaum, B. Duderstadt, B. Schmidt,
     mation retrieval, in: Proceedings of The Web Con-                    A. Mulyar, Gpt4all: Training an assistant-style chatbot
     ference 2020, WWW ’20, Association for Computing                     with large scale data distillation from gpt-3.5-turbo,
     Machinery, New York, NY, USA, 2020, p. 418–428.                      https://github.com/nomic-ai/gpt4all, 2023.
     URL: https://doi.org/10.1145/3366423.3380126. doi:10.           [22] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li,
     1145/3366423.3380126 .                                               C. Guestrin, P. Liang, T. B. Hashimoto, Stanford al-
 [9] I. Sekulić, M. Aliannejadi, F. Crestani, Towards facet-              paca: An instruction-following llama model, https:
     driven generation of clarifying questions for conversa-              //github.com/tatsu-lab/stanford_alpaca, 2023.
     tional search, in: Proceedings of the 2021 ACM SIGIR            [23] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang,
     international conference on theory of information re-                L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez, I. Stoica,
     trieval, 2021, pp. 167–175.                                          E. P. Xing, Vicuna: An open-source chatbot impressing
[10] H. Zamani, B. Mitra, E. Chen, G. Lueck, F. Diaz, P. N.               gpt-4 with 90%* chatgpt quality, 2023. URL: https://
     Bennett, N. Craswell, S. T. Dumais, Analyzing and                    lmsys.org/blog/2023-03-30-vicuna/.
     learning from user interactions for search clarification,       [24] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A.
     2020. arXiv:2006.00166 .                                             Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro,
[11] K. Wang, J. Tian, R. Wang, X. Quan, J. Yu, Multi-                    F. Azhar, et al., Llama: Open and efficient foundation
     domain dialogue acts and response co-generation,                     language models, arXiv preprint arXiv:2302.13971
     arXiv preprint arXiv:2004.12363 (2020).                              (2023).
[12] C. Ye, L. Liao, F. Feng, W. Ji, T.-S. Chua, Structured and      [25] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang,
     natural responses co-generation for conversational                   M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the
     search, in: Proceedings of the 45th International ACM                limits of transfer learning with a unified text-to-text
     SIGIR Conference on Research and Development in                      transformer, The Journal of Machine Learning Re-
     Information Retrieval, 2022, pp. 155–164.                            search 21 (2020) 5485–5551.
[13] X. Gu, K. M. Yoo, J.-W. Ha, Dialogbert: Discourse-              [26] A. Overwijk, C. Xiong, J. Callan, Clueweb22: 10 billion
     web documents with rich information, in: Proceedings
     of the 45th International ACM SIGIR Conference on
     Research and Development in Information Retrieval,
     2022, pp. 3360–3362.
[27] H. Touvron, L. Martin, K. Stone, P. Albert, A. Alma-
     hairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava,
     S. Bhosale, et al., Llama 2: Open foundation and fine-
     tuned chat models, arXiv preprint arXiv:2307.09288
     (2023).
[28] N. Reimers, I. Gurevych, Sentence-bert: Sentence em-
     beddings using siamese bert-networks, in: Proceed-
     ings of the 2019 Conference on Empirical Methods in
     Natural Language Processing, Association for Compu-
     tational Linguistics, 2019. URL: https://arxiv.org/abs/
     1908.10084.
[29] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu,
     Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E.
     Gonzalez, I. Stoica, Judging llm-as-a-judge with mt-
     bench and chatbot arena, 2023. arXiv:2306.05685 .
[30] O. (2023),        Gpt-4 technical report,            2023.
     arXiv:2303.08774 .
[31] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue,
     A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz,
     J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jer-
     nite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame,
     Q. Lhoest, A. Rush, Transformers: State-of-the-art
     natural language processing, in: Q. Liu, D. Schlangen
     (Eds.), Proceedings of the 2020 Conference on Empir-
     ical Methods in Natural Language Processing: Sys-
     tem Demonstrations, Association for Computational
     Linguistics, Online, 2020, pp. 38–45. URL: https://
     aclanthology.org/2020.emnlp-demos.6. doi:10.18653/
     v1/2020.emnlp- demos.6 .
[32] D. Jiang, X. Ren, B. Y. Lin, Llm-blender: Ensembling
     large language models with pairwise ranking and gen-
     erative fusion, arXiv preprint arXiv:2306.02561 (2023).
[33] R. Pradeep, S. Sharifymoghaddam, J. Lin, Rankvi-
     cuna: Zero-shot listwise document reranking
     with open-source large language models, 2023.
     arXiv:2309.15088 .
[34] Y. Zhu, H. Yuan, S. Wang, J. Liu, W. Liu, C. Deng,
     H. Chen, Z. Dou, J.-R. Wen, Large language mod-
     els for information retrieval: A survey, 2024.
     arXiv:2308.07107 .
[35] R. Tang, X. Zhang, X. Ma, J. Lin, F. Ture, Found in
     the middle: Permutation self-consistency improves
     listwise ranking in large language models, 2023.
     arXiv:2310.07712 .