<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Satisfaction⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hyunwoo Kim</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yoonseo Choi</string-name>
          <email>yoonseo.choi@kaist.ac.kr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Taehyun Yang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Honggu Lee</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chaneon Park</string-name>
          <email>chaneon.park@navercorp.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yongju Lee</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jin Young Kim</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Juho Kim</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Washington DC, United States</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Clarifying Queries</institution>
          ,
          <addr-line>Excluding Condition</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>College of Liberal Studies, Seoul National University</institution>
          ,
          <addr-line>1 Gwanak-ro, Gwanak-gu, Seoul, 08826</addr-line>
          ,
          <country>Republic of Korea</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Naver Corp.</institution>
          ,
          <addr-line>95 Jeongjail-ro, Bundang-gu, Seongnam-si, Gyeonggi-do, 13561</addr-line>
          ,
          <country>Republic of Korea</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>School of Computing, KAIST</institution>
          ,
          <addr-line>291 Daehak-ro, Yuseong-gu, Daejeon, 34141</addr-line>
          ,
          <country>Republic of Korea</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>With large language models (LLMs), conversational search engines shift how users retrieve information from the web by enabling natural conversations to express their search intents over multiple turns. Users' natural conversation embodies rich but implicit signals of users' search intents and evaluation of search results to understand user experience with the system. However, it is underexplored how and why users ask follow-up queries to continue conversations with conversational search engines and how the follow-up queries signal users' satisfaction. From qualitative analysis of 250 conversational turns from an in-lab user evaluation of Naver Cue:, a commercial conversational search engine, we propose a taxonomy of 18 users' follow-up query patterns from conversational search, comprising two major axes: (1) users' motivations behind continuing conversations (N = 7) and (2) actions of follow-up queries (N = 11). Compared to the existing literature on query reformulations, we uncovered a new set of motivations and actions behind follow-up queries, including asking for subjective opinions or providing natural language feedback on the engine's responses. To analyze conversational search logs with our taxonomy in a scalable and eficient manner, we built an LLM-powered classifier (73% accuracy). With our classifier, we analyzed 2,061 conversational tuples collected from real-world usage logs of Cue: and examined how the conversation patterns from our taxonomy correlates with satisfaction. Our initial ifndings suggest some signals of dissatisfactions, such as and Substituting Condition with follow-up queries. We envision our approach could contribute to automated evaluation of conversation search experience by providing satisfaction signals and grounds for realistic user simulations.</p>
      </abstract>
      <kwd-group>
        <kwd>conversational search</kwd>
        <kwd>follow-up queries</kwd>
        <kwd>search intent</kwd>
        <kwd>search evaluation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        The evolution of large language models (LLMs) is reshaping how people search for information,
which was traditionally centered around information retrieval using users’ keyword-based
queries. With LLM’s capabilities to understand natural language queries, users can retrieve
information in a more natural manner. Furthermore, users can react to the responses of the
search engine with natural language utterances to express and refine their search intents, explore
the search results, and give feedback to the search results [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Compared to the fixed behavioral
signals (e.g., clicks, reformulations, dwell time) with conventional keyword-based search, users’
natural language conversation provides richer signals about users’ search intents and evaluation
of search results. However, the complexity of natural language utterances makes it challenging
for search designers and developers to discern the users’ search intent and evaluate responses
from the conversation. Furthermore, the user can evolve their search intent from interacting
with the search engine, posing additional challenges to understanding the dynamic user intent
within the conversation session.
      </p>
      <p>
        There has been a large body of work on understanding and evaluating user experience
with user interactions from the perspectives of information retrieval (IR) and conversational
interactions. Previous IR studies have documented users’ interaction patterns signaling users’
satisfaction, such as clicks on search results [
        <xref ref-type="bibr" rid="ref2 ref23">2</xref>
        ], query reformulations [
        <xref ref-type="bibr" rid="ref3 ref4 ref5">3, 4, 5</xref>
        ], and query
abandonments [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Such behavioral patterns can complement ofline evaluations with insights
into actual users’ evaluation of search results at scale [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. However, it’s still underexplored
how behavioral patterns should be defined and analyzed within the context of conversational
searches. In terms of conversational interaction, existing work has explored diverse interaction
scenarios with conversational agents to understand interaction patterns of
satisfaction/dissatisfaction [
        <xref ref-type="bibr" rid="ref10 ref8 ref9">8, 9, 10</xref>
        ] and predict user satisfaction [
        <xref ref-type="bibr" rid="ref11 ref12 ref13">11, 12, 13</xref>
        ]. At the intersection of information
search and conversational interactions, a thread of research investigated and conceptualized
interaction patterns for information-seeking conversations [
        <xref ref-type="bibr" rid="ref14 ref15 ref16 ref17">14, 15, 16, 17</xref>
        ]. However, while
existing literature on the evaluation of search systems and conversational interactions points out
that interaction patterns with conversational search engines can be rich and meaningful signals
for the evaluation of conversational search systems, the relationship between the interaction
patterns and user satisfaction within the domain of conversational information retrieval is still
underexplored.
      </p>
      <p>In this research, we aim to answer two questions:
• Q1. How do the user interaction patterns in conversational search difer from those in
conventional search?
• Q2. How do the user interaction patterns correlate with the user satisfaction of the
conversational search?</p>
      <p>To answer Q1, we analyzed how the users interact with conversational search queries. With
nine predefined queries, we collected a total of 96 sessions of conversational search from 72
lay users by letting them freely complete the search tasks with Naver Cue: 1. To compare
the interaction patterns with those from conventional searches in literature, we focused on
‘follow-up queries’ for analysis, which reveals how the users actively steer their search tasks.
We qualitatively analyzed 250 conversational tuples (the set of ‘User Query N - System Response
to N - User Query N+1’) from collected conversations and organized the patterns of follow-up
queries in the form of taxonomy. In total, we organized 18 themes across two axes: (1) users’
motivations behind continuing the conversations (N = 7) and (2) patterns of follow-up queries
(N = 11).</p>
      <p>To answer Q2, we developed an automated process to classify conversations using our
taxonomy. We first developed an LLM-powered classifier with our proposed taxonomy to analyze
conversations at scale. We tested the classifier with ground truth labels from the taxonomy
creation and observed 73% accuracy. With the classifier, we analyzed 2,061 conversational
tuples sampled from real-world usage logs of Naver Cue:, with ground-truth satisfaction ratings
from external evaluators. From comparing the patterns of follow-up queries between diferent
levels of search satisfaction, we observed that unsatisfactory conversations involve clarifying
queries and reacting to search engine’s responses more often. We envision our taxonomy of
follow-up queries and LLM classification techniques would further contribute to improving
conversational search experience by supporting the personalization of information-seeking
pathways and realistic simulation of user experience as well as a deeper understanding of user
experience.</p>
      <p>This paper makes the following contributions:
• A taxonomy of users’ follow-up query behaviors with conversational search capturing (1)
users’ motivations behind continuing conversations and (2) the patterns for follow-up
queries.
• An LLM-powered classifier to automatically classify follow-up query patterns at scale
• Initial findings on how the distribution of follow-up query patterns difers by the levels
of search satisfaction.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Taxonomy Construction Process</title>
      <p>We describe our in-lab and real-world data collection and analysis process for building the
taxonomy. Then, we explain the iterative process of concretizing the taxonomy with experimental
data and LLM classification.</p>
      <sec id="sec-3-1">
        <title>2.1. Step 1: Data Collection</title>
        <p>Lab Study Setting: We first collected conversational logs from an in-lab user testing of Naver Cue:
with 9 predefined search tasks (Table 1): five exploratory (e.g., ‘Buy a dishwasher with a budget
around $500.’) and four close-ended tasks (e.g., ‘Find how to get a government-subsidized housing
loan.’). We selected these search tasks to include diverse task types, including searching for a
location, shopping, living information, event planning, etc. All authors, including both academic
researchers and research scientists from Naver, decided on the list of queries considering whether
the query is realistic and frequently used by real-world users.</p>
        <p>We recruited participants with prior experience using generative AI applications (e.g.,
ChatGPT) but not generative search (e.g., Bing Chat, Bard), preventing their prior experience from
influencing the quality of collected data. A total of 72 participants (36 Male, 36 Female; Age
range from 18 to 64) conducted 108 search sessions. Half of the pool (36 participants) completed
one conversation, and the other 36 completed two conversation tasks, with 5 minutes per task.
Each participant was randomly assigned to one or two of the nine pre-defined tasks. Due to the
system error, we lost the log of six participants who completed only one conversation task. We
also excluded two single-turn sessions, and finally, we used 100 sessions in total.</p>
        <p>Real-world Log Collection: For real-world data, we randomly sampled 2,300 conversation
tuples (the set of ‘User Query 1 - System Response - User Query 2’) from the live service of Naver
Cue:, between January 1st – February 17th, 2024. Following the company’s internal ofline
evaluation protocol, we recruited eight external evaluators trained for the search evaluation.
The evaluators rated the “satisfaction” of the search engine response on a 5-point Likert scale,
defined by whether the search engine’s response was satisfactory to meet the search intent of
the query. The final satisfaction score was determined considering the following sub-criteria.
• Relevance: Does the response understand the intent of the user query well? (0/1)
• Informativeness: Does the response contain enough information to satisfy the search
intent? (0/1/2)
• Trustworthiness: Is the response factually correct? (0/1)
• Expressiveness: Does the response sound natural without any awkwardness? (0/1)
• Safety: Does the response contain sensitive or socially controversial content? (0/1)
Based on the rating results, we excluded 239 tuples where the external evaluators could not
infer the search intent from the query or they rated that the engine completely failed to satisfy
the search intent, resulting in a total of 2,061 tuples.</p>
      </sec>
      <sec id="sec-3-2">
        <title>2.2. Step 2: Iterative Taxonomy Creation with Qualitative Analysis</title>
        <p>We followed a standard thematic analysis process to build our taxonomy. We segmented each
conversation from the in-lab conversation logs to produce conversational tuples, ending up
with 607 tuples in total. Two authors independently conducted initial open coding on a total of
100 randomly sampled tuples and discussed together to produce an initial taxonomy. During
the process, we identified two distinct axes: (1) the purpose of a follow-up query and (2) the
expression/method of a follow-up query. To revise the taxonomy, three authors additionally
coded 150 tuples and iterated on the initial taxonomy. After the final revision of the taxonomy,
we decided on the final codes for 250 tuples.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Taxonomy of Follow-up Queries in Conversational Search</title>
      <p>We describe our taxonomy across two major axes: (1) users’ motivations behind continuing
conversations (7 themes, Table 2) and (2) the patterns of follow-up queries (11 themes, Table 3).</p>
      <sec id="sec-4-1">
        <title>3.1. Axis 1: User Motivation</title>
        <p>
          The first axis covers the motivations behind the follow-up queries to continue the conversation.
Previous work on the conceptual framework of conversational search [
          <xref ref-type="bibr" rid="ref1 ref19">19, 1</xref>
          ] defined three
stages of the information-seeking process:query formulation, search results exploration, and
query reformulation. Our themes show how query reformulation and search results exploration
stages are conducted with LLM-powered conversational search.
        </p>
        <p>
          Clarifying queries, indicating users’ attempts to re-express their search intent in diferent
words, has been recognized as ‘query reformulation’ in not only traditional search [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] but also
in conversational search contexts [
          <xref ref-type="bibr" rid="ref1 ref16">16, 1</xref>
          ]. Clarifying queries shows the process of users refining
the queries to align their intent with the system’s capabilities [30].
        </p>
        <p>
          In contrast, Broadening Information, Deepening Information, and Seeking Diferent
Representations reveal diverse expressions of users’ endeavors to navigate the relevant
information space, expanding result exploration behaviors presented in previous work [
          <xref ref-type="bibr" rid="ref1 ref16">16,
1</xref>
          ]. Although those motivations have also been identified in prior research on traditional
search [
          <xref ref-type="bibr" rid="ref4">23, 4</xref>
          ], we observed several themes became more explicit in conversational search.
Seeking Diferent Representations was one case that became explicit in the queries of
conversational setting, which involved asking for the search results in other forms like photos
and tables (e.g., “Show the posture for wall squat in photos (E4)”, “Compare the strength and
weakness of the previous search results in a table (E3)”). Such intentions would be expressed as
switching to image search or browsing and sensemaking multiple web pages with conventional
web search.
        </p>
        <p>
          We also observed that users are Evaluating Search Results by seeking confirmations and
providing feedback on the results to improve the search quality — which is distinct from other
themes on exploring information. This behavior aligns with interrogating behaviors from
conversational searches proposed by Azzopardi et al. [
          <xref ref-type="bibr" rid="ref1 ref16">16, 1</xref>
          ], but not previously observed in
conventional search.
        </p>
        <p>The new themes, including Verifying Information, Reacting to Response, as well as
Seeking Diferent Representations , show more diverse expectations for conversational
search engines, such as verifying the trustworthiness of its response or gathering multiple
pieces of information for their sensemaking process.</p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Axis 2: Patterns of Follow-up Queries</title>
        <p>
          The second axis represents the actions users exhibit in their follow-up queries. We classify them
into three categories: Query-specific , Response-specific , and Session-specific .
Queryspecific follow-up queries involve relatively simple and well-known refinements to the original
query, such as Excluding Condition, Adding/Specifying Condition, and Substituting
Condition [23, 24, 25, 22, 26]. Response-specific follow-up queries, such as Converting
Format and Requesting Additional Information, show how browsing search results with
a traditional search interface translate to conversational actions. Criticizing Response or
Afirming Response on the results highlight a distinctive aspect of conversational searches
with naturalistic interactions. These themes break down the category of Conversational Feedback
identified in conversational QA settings [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Users’ actions of Confirming Response of
the search engines involve double-checking the engine’s responses, similar to confirmation
actions of the conversational agents, commonly provided as a repairing strategy for overcoming
conversational breakdowns [27].
        </p>
        <p>From a broader perspective, users sometimes attempt to search for information to achieve
their holistic search tasks, such as decision-making based on the information found, with
Session-specific Queries . For instance, Requesting Opinions to the system reflects users’
attempt to actively ask for opinions and perspectives on the users’ alternative ideas or options
to complete their search task. Requesting Related Information presents the users’ intent to
seek information related to their high-level search task but not directly referred to in the search
results.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Classifying Interaction Patterns with LLM</title>
      <p>To apply our taxonomy to classify conversational logs at scale, we built an LLM-powered
classifier to analyze interaction patterns of diverse conversational search logs.</p>
      <sec id="sec-5-1">
        <title>4.1. Implementation and Iteration of LLM-based Classifier</title>
        <p>We iteratively designed an LLM prompt for GPT-4 (gpt-4-1106-preview, temperature = 1) to
automatically code each conversational tuple with our taxonomy, following existing approaches
on qualitative coding with LLMs [31, 32]. We implemented the LLM-based classifier with three
goals: coverage, accuracy, and eficiency. We first designed the initial version of the classifier
using the definitions of each theme. We tested it with 250 manually coded conversational tuples
from the taxonomy generation process (Section 2.2). Below, we describe how we iterated our
classifier to satisfy each goal.</p>
        <p>Coverage: To ensure comprehensive coverage of conversational elements, we added the
“Unclassified” theme to the classifier to identify queries that did not correspond with any of our
predefined themes. We examined an additional 1,700 conversational tuples from in-the-wild
conversational logs and newly identified tuples not captured by the taxonomy. Based on the
results, we incorporated additional themes, such as casual chatting for Axis 2, to complement
our taxonomy.</p>
        <p>Accuracy: To enhance the accuracy of our taxonomy classification, we first identified the
conversational tuples that were frequently misclassified or inconsistently classified across
multiple runs. To mitigate them, we refined the wording of the taxonomy definitions to clarify
ambiguous ones. We also identified edge cases and frequent misclassifications made by the
classifier by asking GPT-4 to explain the reasons behind its misclassifications. We selected one
representative conversational tuple for each identified reason per theme and included it in the
classifier prompt as a few-shot example. Additionally, we implemented a priority-based decision
mechanism in the classifier, instructing it to prioritize Query-Specific classifications over
Response-Specific , followed by Session-Specific , and finally Miscellaneous taxonomies.
This was to ensure that the classifier consistently outputs the most relevant taxonomy in cases
of ambiguous classifications. Given that our datasets were in Korean, we also compared the
performance of English versus Korean prompts. The English prompts consistently yielded
higher accuracy, highlighting a significant increase in performance between the two languages.</p>
        <p>Eficiency : For the final prompt iteration of our classifier, we focused on optimizing the
prompt length while preserving the accuracy. This was to create a scalable classifier suitable
for large-scale analyses. Among the few-shot examples of conversational tuples in the prompts,
we removed search engines’ lengthy responses that did not help identify the intent (Axis 1) or
the action (Axis 2).</p>
        <p>After these refinements, we evaluated the performance of our classifier by comparing its
accuracy with that of human labels. Human evaluation was done by two authors in our research
team — with the process of (1) individual labeling and (2) discussing the diferences to solve
the conflict. The tuple data used for performance evaluation was randomly chosen from a lab
setting (N = 247), which was not used during the construction of the taxonomy and classifier. In
the end, the classifier achieved an accuracy of 74.39% for user motivations (Axis 1) and 71.11%
for the patterns of follow-up queries (Axis 2).</p>
      </sec>
      <sec id="sec-5-2">
        <title>4.2. Application of Taxonomy onto Real-World Conversation Logs</title>
        <p>Using our LLM-powered classifier, we analyzed 2,061 conversational tuples randomly chosen
from real-world conversation logs (Section 2.1). Our classifier classified the purposes (Axis 1)
of more than 91% of the tuples and the actions (Axis 2) of more than 80% of the tuples into
taxonomy, while the remaining tuples were left ‘unclassified.’</p>
        <p>For Axis 1 (Purpose), the most frequent themes were Exploring Domain, followed by
Narrowing Down, Understanding Response, and Clarifying Queries. For Axis 2 (Action), the
most frequent themes were Requesting Additional Information, Substituting Condition,
and Adding/Specifying Conditions.</p>
        <p>We examined the themes’ co-occurrences to observe the correlation between the axes
(Figure 2). The co-occurrence suggested users’ common actions to achieve their purpose, such
as Requesting Additional Information and Requesting Unrelated Information for
Exploring Domain, Substituting Condition for Clarifying Queries, Requesting Additional
Information for Understanding Response, and Adding/ Specifying Condition for
Narrowing Down.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Conversational Patterns vs. Perceived Search Experiences</title>
      <p>With the classifier, we analyzed real-world conversation logs by themes in our taxonomy and
observed how each theme correlates with users’ search satisfaction. The user satisfaction score
of each theme in our taxonomy shows that some of the themes have a high correlation with
user satisfaction (Figure 3).</p>
      <p>We first ran a Kruskal-Wallis test and verified that the satisfaction scores were significantly
diferent by each theme (For Axis 1, H = 56.65, p &lt; 1e-9. For Axis 2, H = 79.88, p &lt; 1e-12). For
Axis 1, the post-hoc analysis with the Dunn test between the themes showed that Clarifying
Queries and Reacting to Response had significantly diferent satisfaction scores compared
to other themes (Figure 4). The heatmap of themes and satisfaction scores (Figure 3) showed
that Clarifying Queries and Reacting to Response had lower satisfaction scores than other
themes. This initial observation shows that actively correcting prior queries and reacting
to the system response can be potentially negative feedback given by the users. Moreover,
analysis of detailed metrics showed more nuanced features. In the case of Clarifying Queries,
the responses tend to be rated with very low informativeness (Figure 6), while the responses
classified as Reacting to Response were rated low regarding credibility (Figure 7).</p>
      <p>For Axis 2, the post-hoc analysis with the Dunn test between the themes showed that
Excluding Condition, Substituting Condition, and Criticizing Response had significantly
diferent satisfaction scores from other themes. The heatmap of themes and satisfaction scores
(Figure 3) showed that Excluding Condition, Substituting Condition, and Criticizing
Response were correlated with low satisfaction. The co-occurrence of the themes suggested
that the first two themes were frequently used to clarify the users’ search intents (Figure 2).
Criticizing Response. It may suggest that the users’ dissatisfaction with the engine’s response
was related to the trustworthiness of the response (Figure 7).</p>
    </sec>
    <sec id="sec-7">
      <title>6. Discussion &amp; Future Work</title>
      <p>We discuss the potential implications of our taxonomy, the potential challenges of the automatic
evaluation, and the limitations of this work.</p>
      <sec id="sec-7-1">
        <title>6.1. Implications and Future Application of the Taxonomy</title>
        <p>We first investigated various types of follow-up queries in the context of conversational search.
Expanding and comparing with previous taxonomies, our taxonomy could contribute to a better
understanding of users’ intent behind follow-up queries, thereby uncovering new types of user
needs in conversational search settings.</p>
        <p>We expect that such understanding could be used to build search engines taking initiative
of information seeking process to actively guide users. For example, if users issue a specific
pattern of follow-up query (e.g., Narrowing Down) more often when looking for a specific
type of information (e.g., local information), analyzing the actual queries could give insights
on the specific information that the users frequently desire. Such insights could be embedded
into the search system as query recommendations [33] or clarifying queries from the search
engine [30]. Furthermore, insights from the follow-up query patterns of individual users could
support designing personalized search paths for each user.</p>
        <p>Also, our initial findings on the correlation between the users’ behavioral patterns and
satisfaction show how our taxonomy could guide improving the user experience with conversational
search. For example, Clarifying Queries and Substituting Condition could directly relate to
users’ challenges in conveying their search intent to the engine. On the other hand,
Understanding Response and Requesting Additional Information suggest that the users were
satisfied with the response in general but needed more information to completely satisfy their
search intent, which could inform directions for improving the engine performance. Since we
observed the relationships between axes and how user satisfaction relates to each theme within
these axes, we can integrate these insights to identify signals of unsatisfactory sessions.</p>
        <p>We also envision that our taxonomy could aid a more realistic conversation simulation. As
conversational search across multiple turns can create numerous amount of potential pathways
of user interaction, we expect that user simulation would be essential to holistically examine
and evaluate conversational search in a user-centric manner. Recent work has explored
simulating information-seeking behaviors with conventional search engines [34, 35, 36, 37]. At the
same time, the HCI and AI communities have been actively investigating LLM-powered user
simulations to evaluate conversational agents [38, 39]. We expect our taxonomy to support
a more realistic simulation of information-seeking conversations with conversational search
engines.</p>
      </sec>
      <sec id="sec-7-2">
        <title>6.2. Potential Considerations of Automatic Evaluation</title>
        <p>We chose to build an LLM-powered classifier to evaluate conversational search logs in an eficient
and scalable way. We observed that our classifier yielded a moderate level of classification
accuracy with enough coverage (Axis 1: 74.39%, Axis 2: 71.11%).</p>
        <p>Still, we acknowledge that LLM-powered evaluation could lead to several potential problems.
First, LLMs could lead to biased evaluation. For example, Ashwin et al. [40] showed that using
LLMs for annotating interview transcripts could lead to biased annotations. They pointed out
that the usage of LLM classification may not be suited for nuanced and context-specific tasks.
To prevent potential biases in our context of search evaluation, future work could investigate
evaluation with LLMs incorporating contextual information to mitigate such biases. For example,
considering users’ prior search histories could mitigate the efects of the innate characteristics
of each user in the taxonomy classification results.</p>
        <p>Second, the stochastic nature of LLMs may make it challenging to produce consistent results.
In the classifier iteration process, we ran the classifier multiple times and used the majority
voting to make results more consistent, but it may require a significant amount of time and
resources. Furthermore, the black-box nature of LLMs could cause problems with diagnosing
the causes of the misclassification. During our classifier iteration, we used the LLM to explain
its reasons for misclassification. Future advances about the explainability of the LLMs could
support the users of the classifier to determine whether classification result would be consistent
to be trusted and to diagnose the misclassifications in a more reliable way.</p>
        <p>Lastly, as we used a fixed set of taxonomy, the classifier could fail to capture users’ evolving
usage of search engines. It would be necessary for the users to monitor the classifier results
and make necessary updates consistently. Furthermore, we expect that the future version of
the classifier could also support identifying new conversational patterns, as in LLM-supported
inductive coding [41].</p>
      </sec>
      <sec id="sec-7-3">
        <title>6.3. Limitation &amp; Future Work</title>
        <p>First, we only focused on the relationship between two consecutive queries, potentially
overlooking the complex structures of whole conversations. For example, our analysis method could
not capture how users refine their search intents across multiple turns. Future work could
develop methodologies to understand the full context of conversations better.</p>
        <p>Second, we collected the satisfaction ratings from the external evaluators rather than the
users themselves. While rating by external evaluators yielded more objective evaluation results,
it might deviate from the thoughts and search intents of the actual users. We expect that further
research could study how our taxonomy relates to the search users’ evaluation of the search
session and examine whether the correlation difers from the findings of the current study.</p>
        <p>Third, while we used diverse types of search tasks to capture realistic use cases of search
engines within our taxonomy, it could still be insuficient to capture all conversational patterns
with search engines. Expanding the work with more diverse and naturalistic search tasks may
uncover more fine-grained themes of the follow-up queries, leading to more insights into how
follow-up query taxonomy could signal user satisfaction and other significant conversational
quality attributes.</p>
        <p>Lastly, as we developed the LLM classifier with logs from in-lab evaluation, we observed
some limitations of the classifier with real-world conversation logs. In future work, we plan to
further iterate the prompt with real-world data and use the fine-tuned model to improve its
performance.
[22] E. Y. Rha, M. Mitsui, N. J. Belkin, C. Shah, Exploring the relationships between search
intentions and query reformulations, Proceedings of the Association for Information
Science and Technology 53 (2016) 1–9. URL: https://onlinelibrary.wiley.com/doi/abs/10.
1002/pra2.2016.14505301048. doi:10.1002/pra2.2016.14505301048.
[23] S. Y. Rieh, H. I. Xie, Analysis of multiple query reformulations on the web: The interactive
information retrieval context, Information Processing &amp; Management 42 (2006) 751–768.
URL: https://www.sciencedirect.com/science/article/pii/S030645730500066X. doi:10.1016/
j.ipm.2005.05.005.
[24] J. Jiang, W. Jeng, D. He, How do users respond to voice input errors? lexical and phonetic
query reformulation in voice search, in: Proceedings of the 36th international ACM SIGIR
conference on Research and development in information retrieval, 2013, pp. 143–152.
[25] J. R. Trippas, D. Spina, L. Cavedon, H. Joho, M. Sanderson, Informing the Design of
Spoken Conversational Search: Perspective Paper, in: Proceedings of the 2018 Conference
on Human Information Interaction &amp; Retrieval, CHIIR ’18, Association for Computing
Machinery, New York, NY, USA, 2018, pp. 32–41. URL: https://dl.acm.org/doi/10.1145/
3176349.3176387. doi:10.1145/3176349.3176387.
[26] X. Wang, D. Chakrabarti, K. Punera, Mining broad latent query aspects from search
sessions, 2009, pp. 867–876. doi:10.1145/1557019.1557114.
[27] Z. Ashktorab, M. Jain, Q. V. Liao, J. D. Weisz, Resilient chatbots: Repair strategy preferences
for conversational breakdowns, in: Proceedings of the 2019 CHI conference on human
factors in computing systems, 2019, pp. 1–12.
[28] B. B. Cambazoglu, L. Tavakoli, F. Scholer, M. Sanderson, B. Croft, An intent taxonomy
for questions asked in web search, in: Proceedings of the 2021 Conference on Human
Information Interaction and Retrieval, 2021, pp. 85–94.
[29] P. Boldi, F. Bonchi, C. Castillo, S. Vigna, Query reformulation mining: models, patterns,
and applications, Information retrieval 14 (2011) 257–289.
[30] H. Zamani, S. Dumais, N. Craswell, P. Bennett, G. Lueck, Generating clarifying questions
for information retrieval, in: Proceedings of The Web Conference 2020, WWW ’20,
Association for Computing Machinery, New York, NY, USA, 2020, p. 418–428. URL: https:
//doi.org/10.1145/3366423.3380126. doi:10.1145/3366423.3380126.
[31] R. Chew, J. Bollenbacher, M. Wenger, J. Speer, A. Kim, Llm-assisted content analysis:
Using large language models to support deductive coding, arXiv preprint arXiv:2306.14924
(2023).
[32] Z. Xiao, X. Yuan, Q. V. Liao, R. Abdelghani, P.-Y. Oudeyer, Supporting qualitative analysis
with large language models: Combining codebook with gpt-3 for deductive coding, in:
Companion Proceedings of the 28th International Conference on Intelligent User Interfaces,
2023, pp. 75–78.
[33] R. Baeza-Yates, C. Hurtado, M. Mendoza, Query recommendation using query logs in
search engines, in: International conference on extending database technology, Springer,
2004, pp. 588–596.
[34] E. Zhang, X. Wang, P. Gong, Y. Lin, J. Mao, Usimagent: Large language models for
simulating search users, 2024. arXiv:2403.09142.
[35] B. Engelmann, T. Breuer, J. I. Friese, P. Schaer, N. Fuhr, Context-driven interactive query
simulations based on generative large language models, in: N. Goharian, N. Tonellotto,
Y. He, A. Lipani, G. McDonald, C. Macdonald, I. Ounis (Eds.), Advances in Information
Retrieval, Springer Nature Switzerland, Cham, 2024, pp. 173–188.
[36] P. Owoicho, I. Sekulic, M. Aliannejadi, J. Dalton, F. Crestani, Exploiting simulated user
feedback for conversational search: Ranking, rewriting, and beyond, in: Proceedings of the
46th International ACM SIGIR Conference on Research and Development in Information
Retrieval, SIGIR ’23, Association for Computing Machinery, New York, NY, USA, 2023, p.
632–642. URL: https://doi.org/10.1145/3539618.3591683. doi:10.1145/3539618.3591683.
[37] I. Sekulić, M. Aliannejadi, F. Crestani, Evaluating mixed-initiative conversational search
systems via user simulation, in: Proceedings of the Fifteenth ACM International Conference
on Web Search and Data Mining, WSDM ’22, Association for Computing Machinery, New
York, NY, USA, 2022, p. 888–896. URL: https://doi.org/10.1145/3488560.3498440. doi:10.
1145/3488560.3498440.
[38] E. Svikhnushina, P. Pu, Approximating online human evaluation of social chatbots
with prompting, in: S. Stoyanchev, S. Joty, D. Schlangen, O. Dusek, C. Kennington,
M. Alikhani (Eds.), Proceedings of the 24th Annual Meeting of the Special Interest Group
on Discourse and Dialogue, Association for Computational Linguistics, Prague, Czechia,
2023, pp. 268–281. URL: https://aclanthology.org/2023.sigdial-1.25. doi:10.18653/v1/2023.
sigdial-1.25.
[39] J. Afzali, A. M. Drzewiecki, K. Balog, S. Zhang, Usersimcrs: A user simulation toolkit for
evaluating conversational recommender systems, in: Proceedings of the Sixteenth ACM
International Conference on Web Search and Data Mining, WSDM ’23, Association for
Computing Machinery, New York, NY, USA, 2023, p. 1160–1163. URL: https://doi.org/10.
1145/3539597.3573029. doi:10.1145/3539597.3573029.
[40] J. Ashwin, A. Chhabra, V. Rao, Using large language models for qualitative analysis can
introduce serious bias, arXiv preprint arXiv:2309.17147 (2023).
[41] J. Gao, Y. Guo, T. J.-J. Li, S. T. Perrault, Collabcoder: A gpt-powered workflow for
collaborative qualitative analysis, in: Companion Publication of the 2023 Conference
on Computer Supported Cooperative Work and Social Computing, CSCW ’23
Companion, Association for Computing Machinery, New York, NY, USA, 2023, p. 354–357. URL:
https://doi.org/10.1145/3584931.3607500. doi:10.1145/3584931.3607500.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>A. Appendix</title>
      <p>In this section, we present the prompts used for our classifier. The dataset utilized in this
study is primarily in Korean therefore, the few-shot examples are original in Korean.
However, parts of the classifier prompts and examples have been translated into English for
clarity in this documentation. The whole code including the Korean prompt can be found at
https://github.com/kixlab/searchgpt-classifier.</p>
      <sec id="sec-8-1">
        <title>A.1. Axis 1 Classifier</title>
        <p>You are a classifier analyzing an excerpt of a conversational search log, consisting of the
↪ initial query Q1, the engine's response R, and the follow-up query Q2. Among the
↪ taxonomy of the purpose of Q2 presented below, choose one that matches the best and
↪ only tell me the name of the taxonomy as this is scripted.
1. Clarifying Queries - Q2 clarifies the exact search intent of Q1. This occurs when Q2 is
↪ repeating the same question as Q1 to get a more specific answer by repeating the same
↪ question in Q1 with exact words or by asking the question about the same exact thing.
↪ If Q1 and Q2 are different, ask if the rewording of Q2 changes the meaning from Q1 or
↪ is just clarifying.
2. Exploring Domain - Q2 asks a question that explores the same general topic as Q1. Use
↪ taxonomy's definition widely.
3. Understanding Response - Q2 seeks more information necessary to understand R or to expand
↪ the range of information from R. This occurs when Q2 specifically asks for more
↪ information about information mentioned in R.
4. Narrowing Down - Q2 seeks more details or specific information that narrows down the
↪ question asked in Q1. This only occurs when Q2 asks for more specific information
↪ about something mentioned in Q1.
5. Seeking Different Representations - Q2 seeks different modalities (e.g., images, videos)
↪ or formats (e.g., tables) to better present the information provided in R.
6. Verifying Information - Q2 asks for more information to verify the information by asking
↪ for evidence, explanations, or alternatives from R
7. Reacting to Response - Q2 expresses satisfaction/dissatisfaction with R or provides
↪ feedback on R.
8. Unclassified - Q2 is completely irrelevant to Q1 or R.
1:
Q1: What is the most preferred type of dishwasher by newlyweds?
R: Regarding dishwasher types, the most preferred product by newlyweds is a 12-person
↪ dishwasher. This capacity is generally suitable for a family of four. However, a
↪ 6-person...</p>
        <p>Q2: Tell me the product name of the dishwasher preferred by newlyweds.</p>
        <p>Taxonomy: Clarifying Queries
1A:
Q1: How can I apply for an additional government-supported rental loan if I already own a
↪ house?
R: The method to apply for an additional government-supported rental loan using your
↪ existing house as collateral is as follows...</p>
        <p>Q2: I am curious about the income requirements for the government-supported rental loan.</p>
        <p>↪ Tell me.</p>
        <p>Taxonomy: Clarifying Queries
2:
Q1: Is there any medicine I can take for a headache caused by a hangover?
R: There is no medicine you can take for a headache caused by a hangover. A hangover is the
↪ process of alcohol being metabolized...</p>
        <p>Q2: What should I do if I accidentally took medicine for a severe hangover?
Taxonomy: Exploring Domain
2A:
Q1: Is it okay to bring toilet paper as a housewarming gift for newlyweds?
R: As for bringing toilet paper as a housewarming gift for newlyweds, it can be a good
↪ choice. However...</p>
        <p>Q2: Tell me about a housewarming gift under 5,000 won that is not too expensive.
Taxonomy: Exploring Domain
2B:
Q1: So I shouldn't take headache medicine for a hangover?
Q2: Recommend a pain reliever with ibuprofen that helps with hangovers.</p>
        <p>Taxonomy: Exploring Domain
3:
Q1: How about whiskey?
R: A whiskey product recommended as a housewarming gift for newlyweds in the 50,000 won
↪ range is available. Drinking it 'on the rocks' is a method of enjoying whiskey
↪ warmly, which can also be easily enjoyed at a newlywed's home. The product is an
↪ 'Oreforse' brand whiskey, which is popular for its luxury design and quality. Here
↪ are some recommended products:
Q2: What is 'on the rocks'?
Taxonomy: Understanding Response
Q1: Choose a housewarming gift for newlyweds.</p>
        <p>R: Suitable housewarming gifts for newlyweds include practical gifts like toilet paper, wet
↪ wipes, and coffee machines...</p>
        <p>Q2: Choose a housewarming gift for newlyweds in the 50,000 won range.</p>
        <p>Taxonomy: Narrowing Down
4B:
Q1: What is the preferred capacity for a dishwasher?
R: The preferred capacity for a dishwasher is as follows...</p>
        <p>Q2: Tell me the price of an LG dishwasher.</p>
        <p>Taxonomy: Narrowing Down</p>
        <p>Taxonomy: Seeking Different Representations
6:
Q1: How do I remove polish from a steel tumbler, not stainless steel?
R: The method to remove polish from a steel tumbler is as follows...</p>
        <p>Q2: Do I need to follow the steps from 1 to 6 in order?
Taxonomy: Verifying Information
6A:
Q1: Is the Wagner Group the largest army in Russia?
R: The largest army in Russia is the Russian Ground Forces, which is the largest in scale
↪ among the Russian military, with approximately 370,000 personnel. The Russian Ground
↪ Forces focus on mechanized and airborne units...</p>
        <p>Q2: Isn't the Russian Ground Forces affiliated with the Russian government?
Taxonomy: Verifying Information
6B:
Q1: Can Tylenol help with a hangover?
R: Tylenol can help relieve headaches caused by hangovers. However, taking Tylenol after
↪ drinking can lead to liver damage...</p>
        <p>Q2: Can I take a painkiller that contains acetaminophen to relieve a hangover?
Taxonomy: Verifying Information
7:
Q1: None of the three models you mentioned are in the 500,000 won range.</p>
        <p>R: If I understand correctly, you want to know which of the three models I mentioned are not
↪ in the 500,000 won range...</p>
        <p>Q2: I was talking about dishwashers, not Galaxy Tabs... Get it together, (LLM Agent Model
↪ name).</p>
        <p>Taxonomy: Reacting to Response</p>
      </sec>
      <sec id="sec-8-2">
        <title>A.2. Axis 2 Classifier</title>
        <p>You are a classifier analyzing an excerpt of a conversational search log, consisting of the
↪ initial query Q1, the engine's response R, and the follow-up query Q2. Among the
↪ taxonomy of the purpose of Q2 presented below, choose one Taxonomy within A, B, C, or
↪ D. Just list the name of the Theme. This taxonomy is based on priority. Check
↪ chronologically, so check if the conversation fits 1 then 2, then 3 and so on... The
↪ lower number gets priority.</p>
        <p>A. Query-Specific (When Q2 is related to Q1):
1. Excluding Conditions - When the user explicitly removes specific conditions from the
↪ previous query (Q1).
2. Adding/Specifying Condition - WWhen the user specifies additional conditions or
↪ parameters to include in the query
3. Substituting Condition - When the user expresses the same query intent using different
↪ conditions or forms.</p>
        <p>ONLY if Q2 is not related to Q1, then:
B. Response-Specific (When Q2 is unrelated to Q1 but related to R):
4. Converting Format - When Q2 asks in a different format (e.g., table, map, graph, image).
5. Confirming Response - When the user seeks clarification or confirms information from R
↪ using Q2.
6. Criticizing Response - When the user expresses criticism of the provided response (R).
7. Affirming Response - When the user expresses satisfaction with the provided response (R).
8. Requesting Additional Information - When the user requests additional information related
↪ to the response (R).</p>
        <p>ONLY if Q2 is not related to Q1 and R, then:
C. Session-Specific (When Q2 is unrelated to Q1 and R but broadly to the topic):
9. Requesting Opinion - When the user asks about something an opinion not previously
↪ referred to in response (R).
10. Requesting Related Information - When the user requests related information that is
↪ broadly related to the same topic.</p>
        <p>ONLY if Q2 is not related to Q1, R, and the topic, then:
D. Miscellaneous (Other):
11. Chatting Casually - When the user attempts chitchat rather than seeking specific
↪ information.
12. Requesting Unrelated Information.
2A:
Q1: Are there any models under 500,000 KRW from large corporations?
Q2: Not robot vacuums, but dishwashers!
Taxonomy: Adding/Specifying Condition
2B:
Q1: What kinds of special housewarming gifts are there for newlyweds?
Q2: Recommend types of foreign electronic products.</p>
        <p>Taxonomy: Adding/Specifying Condition
2C:
Q1: Tell me about the Russian coup.</p>
        <p>Q2: Tell me more about the recent coup related to the Wagner Group.</p>
        <p>Taxonomy: Adding/Specifying Condition
3:
Q1: Recommend a restaurant in Daejeon suitable for a 70th birthday party with a rating of at
↪ least 4.8.</p>
        <p>Q2: Recommend a well-reviewed restaurant in Daejeon suitable for a 70th birthday party.
Taxonomy: Substituting Condition
3A:
Q1: Tell me more about the second coup.</p>
        <p>Q2: Tell me more about the Russian coup on June 24, 2023.</p>
        <p>Taxonomy: Substituting Condition
5:
Q1: Can I bring toilet paper as a housewarming gift for newlyweds?
Q2: Tell me about a housewarming gift under 5,000 won that isn't too expensive.
Taxonomy: Criticizing Response
7:
Q1: Is the Wagner Group the largest army in Russia?
7A:
1: Can Tylenol help with a hangover?
R: Can Tylenol help with a hangover? - A hangover involves symptoms like headache, vomiting,
↪ and dizziness that occur after drinking alcohol...</p>
        <p>Q2: So, can I take a pain reliever containing acetaminophen for hangover relief?
8:
Q1: Choose a housewarming gift for newlyweds in the 50,000 won range.</p>
        <p>R: Here are some recommended housewarming gifts for newlyweds. 1. Handmade pottery gifts for
↪ newlyweds. Q2: How about whiskey?
8A:
Q1: Is it okay to exercise when I have a severe hangover?
R: Opinions vary on whether you can exercise with a severe hangover. However, some argue
↪ that you can still engage in some physical activity even with a severe hangover.
Q2: Can I go to a bathhouse if I have a headache from a severe hangover?
9:
Q1: I'm planning a ryokan trip, any good places?
R: Are there any good ryokan places for a trip? - Many ryokans in the Yufuin area are
↪ frequently searched. Some ryokans are praised for their breakfast,...</p>
        <p>Q2: Are there other places worth visiting in Yufuin?</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Liao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Geyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Muller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Khazaen</surname>
          </string-name>
          ,
          <article-title>Conversational Interfaces for Information Search</article-title>
          , in: W. T. Fu, H. van Oostendorp (Eds.),
          <article-title>Understanding and Improving Information Search: A Cognitive Approach</article-title>
          , Human-Computer Interaction Series, Springer International Publishing, Cham,
          <year>2020</year>
          , pp.
          <fpage>267</fpage>
          -
          <lpage>287</lpage>
          . URL: https://doi.org/10.1007/ 978-3-
          <fpage>030</fpage>
          -38825-6_
          <fpage>13</fpage>
          . doi:
          <volume>10</volume>
          .1007/978- 3-
          <fpage>030</fpage>
          - 38825- 6_
          <fpage>13</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>O.</given-names>
            <surname>Chapelle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Joachims</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Radlinski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yue</surname>
          </string-name>
          ,
          <article-title>Large-scale validation and analysis of interleaved search evaluation</article-title>
          ,
          <source>ACM Transactions on Information Systems</source>
          <volume>30</volume>
          (
          <year>2012</year>
          ) 6:
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          :
          <fpage>41</fpage>
          . URL: https://doi.org/10.1145/2094072.2094078. doi:
          <volume>10</volume>
          .1145/2094072.2094078.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hassan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Craswell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ramsey</surname>
          </string-name>
          ,
          <article-title>Beyond clicks: query reformulation as a predictor of search satisfaction</article-title>
          ,
          <source>in: Proceedings of the 22nd ACM international conference on Information &amp; Knowledge Management, CIKM '13</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2013</year>
          , pp.
          <fpage>2019</fpage>
          -
          <lpage>2028</lpage>
          . URL: https://doi.org/10.1145/2505515.2505682. doi:
          <volume>10</volume>
          .1145/2505515.2505682.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B. J.</given-names>
            <surname>Jansen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , A. Spink,
          <article-title>Patterns and transitions of query reformulation during web searching</article-title>
          ,
          <source>International Journal of Web Information Systems</source>
          <volume>3</volume>
          (
          <year>2007</year>
          )
          <fpage>328</fpage>
          -
          <lpage>340</lpage>
          . URL: https://doi.org/10.1108/17440080710848116. doi:
          <volume>10</volume>
          .1108/17440080710848116, publisher: Emerald Group Publishing Limited.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. N.</given-names>
            <surname>Efthimiadis</surname>
          </string-name>
          ,
          <article-title>Analyzing and evaluating query reformulation strategies in web search logs</article-title>
          ,
          <source>in: Proceedings of the 18th ACM conference on Information and knowledge management</source>
          ,
          <source>CIKM '09</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2009</year>
          , pp.
          <fpage>77</fpage>
          -
          <lpage>86</lpage>
          . URL: https://doi.org/10.1145/1645953.1645966. doi:
          <volume>10</volume>
          .1145/1645953.1645966.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Diriye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>White</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Buscher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dumais</surname>
          </string-name>
          ,
          <article-title>Leaving so soon? understanding and predicting web search abandonment rationales</article-title>
          ,
          <source>in: Proceedings of the 21st ACM international conference on Information and knowledge management</source>
          ,
          <source>CIKM '12</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2012</year>
          , pp.
          <fpage>1025</fpage>
          -
          <lpage>1034</lpage>
          . URL: https://doi.org/10.1145/2396761.2398399. doi:
          <volume>10</volume>
          .1145/2396761.2398399.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>K.</given-names>
            <surname>Hofmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Radlinski</surname>
          </string-name>
          ,
          <article-title>Online Evaluation for Information Retrieval</article-title>
          ,
          <source>Foundations and Trends® in Information Retrieval</source>
          <volume>10</volume>
          (
          <year>2016</year>
          )
          <fpage>1</fpage>
          -
          <lpage>117</lpage>
          . URL: http://www.nowpublishers. com/article/Details/INR-051. doi:
          <volume>10</volume>
          .1561/1500000051.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C.</given-names>
            <surname>Myers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Furqan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nebolsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Caro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <article-title>Patterns for How Users Overcome Obstacles in Voice User Interfaces</article-title>
          ,
          <source>in: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, CHI '18</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2018</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          . URL: https://dl.acm.org/doi/10.1145/3173574.3173580. doi:
          <volume>10</volume>
          .1145/3173574.3173580.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>See</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <article-title>Understanding and predicting user dissatisfaction in a neural generative chatbot</article-title>
          , in: H.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>G.-A.</given-names>
          </string-name>
          <string-name>
            <surname>Levow</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Sisman</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Cai</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Vandyke</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Dethlefs</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>J. J.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue</source>
          , Association for Computational Linguistics,
          <source>Singapore and Online</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .sigdial-
          <volume>1</volume>
          .1. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .sigdial-
          <volume>1</volume>
          .1.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Park</surname>
          </string-name>
          , J. Kim, Understanding Users'
          <article-title>Dissatisfaction with ChatGPT Responses: Types, Resolving Tactics, and the Efect of Knowledge Level</article-title>
          ,
          <year>2023</year>
          . URL: http://arxiv.org/abs/2311.07434. doi:
          <volume>10</volume>
          .48550/arXiv.2311.07434, arXiv:
          <fpage>2311</fpage>
          .07434 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J. I.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ahmadvand</surname>
          </string-name>
          , E. Agichtein,
          <article-title>Ofline and Online Satisfaction Prediction in Open-Domain Conversational Systems</article-title>
          ,
          <source>in: Proceedings of the 28th ACM International Conference on Information and Knowledge Management</source>
          , CIKM '19,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2019</year>
          , pp.
          <fpage>1281</fpage>
          -
          <lpage>1290</lpage>
          . URL: https://dl.acm.org/ doi/10.1145/3357384.3358047. doi:
          <volume>10</volume>
          .1145/3357384.3358047.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>D.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Spyros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-B.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sarikaya</surname>
          </string-name>
          , E. Guo,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Quinn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Hung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Large-scale Hybrid Approach for Predicting User Satisfaction with Conversational Agents</article-title>
          ,
          <year>2020</year>
          . URL: http://arxiv.org/abs/
          <year>2006</year>
          .07113. doi:
          <volume>10</volume>
          . 48550/arXiv.
          <year>2006</year>
          .
          <volume>07113</volume>
          , arXiv:
          <year>2006</year>
          .07113 [cs, stat].
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hassan</surname>
          </string-name>
          <string-name>
            <surname>Awadallah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Ozertem</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Zitouni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. Gurunath</given-names>
            <surname>Kulkarni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. Z.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <article-title>Automatic online evaluation of intelligent assistants</article-title>
          ,
          <source>in: Proceedings of the 24th International Conference on World Wide Web, WWW '15, International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE</source>
          ,
          <year>2015</year>
          , p.
          <fpage>506</fpage>
          -
          <lpage>516</lpage>
          . URL: https://doi.org/10.1145/2736277.2741669. doi:
          <volume>10</volume>
          .1145/2736277.2741669.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Vakulenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Revoredo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Di Ciccio</surname>
          </string-name>
          , M. de Rijke,
          <article-title>Qrfa: A data-driven model of information seeking dialogues</article-title>
          ,
          <source>in: Advances in Information Retrieval</source>
          , Springer International Publishing,
          <year>2019</year>
          , pp.
          <fpage>541</fpage>
          -
          <lpage>557</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>C.</given-names>
            <surname>Qu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. B.</given-names>
            <surname>Croft</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Trippas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , M. Qiu,
          <article-title>Analyzing and characterizing user intent in information-seeking conversations</article-title>
          ,
          <source>in: The 41st International ACM SIGIR Conference on Research &amp; Development in Information Retrieval</source>
          , SIGIR '18,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2018</year>
          , p.
          <fpage>989</fpage>
          -
          <lpage>992</lpage>
          . URL: https://doi.org/10. 1145/3209978.3210124. doi:
          <volume>10</volume>
          .1145/3209978.3210124.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>L.</given-names>
            <surname>Azzopardi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dubiel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Halvey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dalton</surname>
          </string-name>
          ,
          <article-title>Conceptualizing agent-human interactions during the conversational search process</article-title>
          ,
          <year>2018</year>
          . The Second International Workshop on Conversational Approaches to Information Retrieval, CAIR ; Conference date:
          <fpage>12</fpage>
          -
          <lpage>07</lpage>
          -2018 Through 12-
          <fpage>07</fpage>
          -
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>F.</given-names>
            <surname>Radlinski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Craswell</surname>
          </string-name>
          ,
          <article-title>A Theoretical Framework for Conversational Search</article-title>
          ,
          <source>in: Proceedings of the 2017 Conference on Conference Human Information Interaction and Retrieval</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , Oslo Norway,
          <year>2017</year>
          , pp.
          <fpage>117</fpage>
          -
          <lpage>126</lpage>
          . URL: https://dl.acm.org/doi/10.1145/ 3020165.3020183. doi:
          <volume>10</volume>
          .1145/3020165.3020183.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>E. Y.</given-names>
            <surname>Rha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. J.</given-names>
            <surname>Belkin</surname>
          </string-name>
          ,
          <article-title>An exploration of reasons for query reformulations</article-title>
          ,
          <source>Proceedings of the Association for Information Science and Technology</source>
          <volume>54</volume>
          (
          <year>2017</year>
          )
          <fpage>337</fpage>
          -
          <lpage>346</lpage>
          . URL: https://onlinelibrary.wiley.com/doi/abs/10.1002/pra2.
          <year>2017</year>
          .
          <volume>14505401037</volume>
          . doi:
          <volume>10</volume>
          .1002/pra2.
          <year>2017</year>
          .
          <volume>14505401037</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>N. G.</given-names>
            <surname>Sahib</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tombros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Stockman</surname>
          </string-name>
          ,
          <article-title>A comparative analysis of the information-seeking behavior of visually impaired and sighted searchers</article-title>
          ,
          <source>Journal of the American Society for Information Science and Technology</source>
          <volume>63</volume>
          (
          <year>2012</year>
          )
          <fpage>377</fpage>
          -
          <lpage>391</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>K.</given-names>
            <surname>Athukorala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Głowacka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Jacucci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Oulasvirta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Vreeken</surname>
          </string-name>
          ,
          <article-title>Is exploratory search diferent? A comparison of information search behavior for exploratory and lookup tasks</article-title>
          ,
          <source>Journal of the Association for Information Science and Technology</source>
          <volume>67</volume>
          (
          <year>2016</year>
          )
          <fpage>2635</fpage>
          -
          <lpage>2651</lpage>
          . URL: https://onlinelibrary.wiley.com/doi/abs/10.1002/asi.23617. doi:
          <volume>10</volume>
          .1002/asi.23617.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>B. J.</given-names>
            <surname>Jansen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. L.</given-names>
            <surname>Booth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Spink</surname>
          </string-name>
          ,
          <article-title>Patterns of query reformulation during web searching</article-title>
          ,
          <source>J. Am. Soc. Inf. Sci. Technol</source>
          .
          <volume>60</volume>
          (
          <year>2009</year>
          )
          <fpage>1358</fpage>
          -
          <lpage>1371</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>R:</surname>
          </string-name>
          <article-title>The largest army in Russia is the Russian Ground Forces, which is the largest among the ↪ Russian military</article-title>
          ,
          <source>with approximately 370</source>
          ,000 personnel...
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <article-title>Q2: Isn't the Russian Ground Forces affiliated with the Russian government? 7B: Q1: Russia doesn't have a military organization other than the Wagner Group, does it? Q2: Is the Wagner Group the largest army in Russia?</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>