Toward Conversational Query Reformulation
Johannes Kiesel1 , Xiaoni Cai1 , Roxanne El Baff2 , Benno Stein1 and Matthias Hagen3
1
  Bauhaus-Universität Weimar, Bauhausstraße 11, 99423 Weimar, Germany
2
  German Aerospace Center (DLR) Oberpfaffenhofen, Münchener Str. 20, 82234 Weßling, Germany
3
  Martin-Luther-Universität Halle-Wittenberg, Von-Seckendorff-Platz 1, 06120 Halle (Saale), Germany


                                             Abstract
                                             In traditional web search interfaces, information seekers reformulate their queries by editing the terms in the search box
                                             in order to guide the retrieval process. Such kind of editing is at odds with the natural language interaction paradigm in
                                             conversational interfaces, and for purely voice-based interfaces it is impossible. Conversational search studies reveal that
                                             participants instead describe their changes to a query; however, the principles of such “editing conversations” have not been
                                             analyzed in depth. The paper in hand formalizes the problem of conversational query reformulation. We cast reformulations
                                             as meta-queries that imply operations on the original query and categorize the operations following the standard CRUD
                                             terminology (create, read, update, delete). Based on this formalization we crowdsource a dataset with 2694 human refor-
                                             mulations across four search domains. Our analysis of the meta-queries reveals a large variety in word usage and indicates
                                             ambiguous reformulations as an important research topic of its own.

                                             Keywords
                                             Conversational search, Crowdsourcing, CRUD, Information seeking, Query refinement, Query reformulation


1. Introduction                                                                                                       (a)      Search interface                            Search engine
                                                                                                                                                       Conversation
                                                                                                                                                         protocol
During web search, information seekers frequently find a                                                                           Conversation                             Conversation
                                                                                                                                      layer                                    layer
search engine’s results either too specific, too generic, or
containing results relevant only to an unintended inter-                                                                                                  Laconic
                                                                                                                                                          protocol
pretation of their query. In such cases seekers may want                                                                           Laconic layer                            Laconic layer
to reformulate their queries [1, 2]. In a traditional search
interface, the seeker would directly edit the previous
query in the search field, creating, updating, or deleting                                                                         TCP/IP layer                             TCP/IP layer
terms. Such reformulations account for about half of                                                                                               Host to network layer
all queries [3, Sec. 6.3]. However, conversational search
                                                                                                                      (b)
interfaces—be they chat-like or voice-based—usually do
                                                                                                                                     “Show me news about COVID-19.”
not allow modifying the previous query. Though some                                                                                  COVID-19
chat interfaces not used for search allow to edit previous
                                                                                                                                        “OK, here is the latest news on COVID-19.” + <serp>
messages, such a functionality breaks temporal continu-                                                                                      {”hits”: {”total”: 142, ... }, ...}
ity, making the interaction significantly less conversa-
tional. Still, reformulations are also frequent in conver-                                                                           “Ah. Please no articles about vaccination this time.”
sational search lab studies [4] and can be seen as one                                                                               COVID-19 ∧ (¬ vaccination)

user-facing service of the search interface’s conversa-                                                                                     “Sure, I remove articles on vaccination.” + <serp>
tional layer, as illustrated in Figure 1 (a).                                                                                                  {”hits”: {”total”: 89, ... }, ...}
   As the example in Figure 1 (b) illustrates, reformula-
                                                                                                                                     ...
tions allow to specify information in small steps. As the                                                                   time     ...
main advantage of incremental formulation, seekers do
                                                                                                                      Figure 1: (a) Web application protocol stack, emphasizing
DESIRES 2021 – 2nd International Conference on Design of                                                              the two top layers. A conversational search interface imple-
Experimental Search & Information REtrieval Systems, September                                                        ments a “conversation layer” (an API) to operationalize the
15–18, 2021, Padua, Italy                                                                                             respective NLP translation functionality on top of a “laconic”
   johannes.kiesel@uni-weimar.de (J. Kiesel);                                                                         layer (an API) that implements the functionality to interact
xiaonicaiweimar@gmail.com (X. Cai); roxanne.elbaff@dlr.de (R. El                                                      with a traditional web search interface. (b) Example conver-
Baff); benno.stein@uni-weimar.de (B. Stein);                                                                          sation about COVID-19, showing the messages of a fictitious
matthias.hagen@informatik.uni-halle.de (M. Hagen)                                                                     dialog at the level of a conversational and a laconic protocol.
 0000-0002-1617-6508 (J. Kiesel); 0000-0003-2114-8162 (X. Cai);                                                      Recall that when going downwards (upwards) the stack, the
0000-0001-6661-8661 (R. El Baff); 0000-0001-9033-2217 (B. Stein);                                                     messages of the protocol at layer 𝑛−1 (layer 𝑛) are generated
0000-0002-9733-2890 (M. Hagen)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative   from those at layer 𝑛 (layer 𝑛−1).
                                       Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
not have to formulate the complete query in advance,                ask for information connected to information just re-
substantially reducing the required mental effort.1 More-           trieved. However, whereas query reformulations change
over, incremental formulation might simplify and thus               the criteria that identify relevant information, follow-up
increase the use of search operators for seekers, like the          questions request a completely new answer. This dif-
exclusion in Figure 1 (b), allowing seekers to formulate            ference in intent causes linguistic differences between
even complex needs more intuitively.2                               query reformulations and follow-up questions that war-
   Though the seeker may not consciously ask to create,             rant separate investigations. Moreover, conversational
update, or delete query terms, conversational reformu-              query reformulation relates to much of the available re-
lations are essentially such meta-queries that request              search on queries.
changes to the previous query. While these operations
are not implemented in standard retrieval engines, one              2.1. Queries in Conversational Search
can imagine a “conversation” layer on top of such engines
(Figure 1 (a)) that—among others—resolves reformula-                A central promise of the conversational search paradigm
tions similar to co-reference resolution in conversational          is to bring search closer to real-world assistance of a
question answering systems (e.g., [6]).                             reference librarian [8]. In this regard, conversational
   However, critical issues for conversational systems              reformulations are one piece for allowing the seeker to
concerning reformulations have barely been analyzed                 specify their need on a more natural level [9]. Still, unlike
in the literature, especially the reformulations’ inherent          “query by babbling” [10], reformulations require some for-
ambiguity. For example, consider the seeker asking “OK,             malized description in the seeker’s mind. Instead, conver-
how about vaccinations?” after getting the results for              sational query reformulations are one instance of “user
“Show me news about COVID-19” in Figure 1 (b). Is the               revealment” [11] where the seeker incrementally spec-
intent to create a “vaccination” term or to replace the             ifies their need. The advantages of small steps, as in
query entirely? If aware of the ambiguities, a system               orienteering [12], are that seekers have to specify less
could ask for clarification or use heuristics to resolve            and obtain context information on the way. A comple-
the ambiguity. It could also stress and thereby teach               mentary approach to help seekers refine their query is to
unambiguous language in its replies (“I reduced the list            ask them clarification questions in case the search engine
to those on vaccinations.”).                                        detects them struggeling or being ambiguous [13].
   To foster research on conversational query reformula-               In their history of IR research, Sanderson and Croft
tions, we contribute the following: (1) a conceptualization         [14] divided interactions against some first text-based
that casts conversational reformulations as meta-queries            conversational search systems into natural language or
following CRUD terminology (cf. Section 3); (2) the first           keyword-based and into non-querying and querying.
dataset on conversational query reformulations,3 contain-           In a later fine-grained study of conversational seeker
ing 2694 messages and associated meta-queries, crowd-               messages for passage retrieval, Lin et al. [15] categorize
sourced from 284 study participants from 5 countries                query ambiguity, though they focus on ambiguity regard-
in 4 different search domains (cf. Section 4); and (3) an           ing (not) referenced entities, whereas the paper at hand
in-depth analysis of the reformulations’ word patterns,             focuses on ambiguity regarding the desired operation.
emphasizing ambiguous word patterns as important re-                Trippas et al. [4] present a model for spoken conversa-
search direction and suggesting the general feasibility of          tional search that also covers information requests be-
a domain-independent rewriting system (cf. Section 5).              yond queries, for example, within a result document. In
                                                                    their lab study, they observe that both seekers conver-
                                                                    sationally reformulate queries (“query embellishments”)
2. Related Work                                                     and that participants in the system’s role conversation-
                                                                    ally offer reformulations based on what they see.
Though conversational search is an active research area,
conversational query reformulation has attracted little
attention so far. In contrast, several recent publications          2.2. Query Reformulation
target co-reference resolution for follow-up questions                      Query reformulations are queries based on the previous
in conversational question answering [7]. Reformula-                        one with a similar information need [1]. Boldi et al. [1]
tions and follow-up questions are similar in that both                      classified query reformulations on two axes from general-
     1
       This effect likely also holds for more instruction-like (but still ization to specification and from being the same query to
conversational) interactions, like in Adobe’s “phonic filters” image starting a new search mission. Sanderson and Croft [14]
search demo, https://blog.adobe.com/en/2019/05/29/preview-technology-gives- proposed a categorization based on the latter axis. Jiang
your-voice-the-power-of-a-creative-director.html.                           et al. [16] analyzed voice query repeats after voice input
     2
       Some years ago, only about 1% of web search queries con- errors. They found that seekers tended to stress words
tained operators [5].
     3
       Publicly available at https://doi.org/10.5281/zenodo.5031960
                                                                            that the system misunderstood—a way of conversational
reformulation that is specific to voice queries but outside Table 1
the scope of the paper at hand.                             Conversational examples for each CRUD operation [38].
   Though studies took note of conversational query re-
                                                            Operation                        Target
formulations (e.g., [17]), they have rarely been analyzed.
As an exception, Sa and Yuan [18] asked 32 participants                      Query        Expression         Literal
in a Wizard of Oz study to perform one generalization C reate           Show me news Remove all           Any news for
and one specialization of a displayed query, speaking                   about            without NCD, its treatment?
to the system like to a human. The participants pre-                    COVID-19         NI, or WHO in
ferred conversational query reformulations (which Sa an                                  the headline
Yuan call “partial query modification”) over repeating R ead            What do I        What did I       What was the
the query with changes (“complete query modification”),                 have so far?     say for the      last filter?
even though several participants reported employing the                                  headline?
latter for being used to it. We build upon this work, ask-
                                                            U pdate     Start a new      Remove my        No, NI means
ing for more complex reformulations in longer query ses-                search on the    criteria for the National
sions and focusing on analyzing the language employed                   flu              headline but     Insurance
and ambiguities therein.                                                                 search only in
                                                                                            economical
2.3. Query Rewriting                                                                        news
                                                             D elete      No, let’s start   Remove the   Remove the
In contrast to query reformulation, query rewriting refers                   again        headline         filter for
to processing the query before the retrieval. This task                                   criteria         treatments
currently attracts much attention in conversational ques-
tion answering, mostly concerning co-reference resolu-
tion. Available datasets for this task include CSQA [19],
CoQA [20], QReCC [6], QuAC [21], and TREC CAsT [22]. 3. Conceptualizing
The availability of datasets has already led to several ap-        Conversational Reformulations
proaches, often employing sequence-to-sequence learn-
ing [23, 24, 20] and previous interactions [25, 26, 27].       Conversational reformulations are query reformulations
                                                               using natural language. While query reformulations
2.4. Natural Language Queries                                  in web search usually are stand-alone queries that can
                                                               directly be submitted to retrieve results, formulating
Several studies analyzed natural language queries even queries from conversational reformulations requires an
before conversational interfaces. Belkin et al. [28] found additional step that “adds” the conversational context.
that, in a text-based search interface, the average query This section discusses the implications on three levels:
length increased by nearly 50% when the search box label (1) a model of conversational reformulations as meta-
encouraged to write a problem description. Still, the au- queries; (2) the problem of algorithmically understand-
tomatic “translation” of long queries to shorter keyword ing reformulations; and (3) the process of creating the
queries later also gained attention with systems reducing “laconic” queries from the reformulations.
natural language queries to some key concepts more com-
patible with keyword-based interfaces [29, 30]. Moreover,
also the translation of natural language to database or
                                                               3.1. Casting Conversational
knowledge graph queries attracts much attention (e.g.,               Reformulations as Meta-Queries
[31, 32, 33]). For smart assistants, Tanaka et al. [34] used From the information system’s perspective, the informa-
crowdsourcing like us to gather implicit and highly am- tion seeker uses a meta-query language when expressing
biguous seeker requests for specific non-conversational conversational reformulations: they tell the system to
tasks, e.g., “I’m thirsty” as a request to search for a nearby perform specific operations on the previous query. On
café or “I’m having trouble getting good reception” as a a syntactic level, the basic reformulation operations in
“request” to search for a WiFi spot. As example of spoken traditional search interfaces are adding, changing, and
reformulations in a different setting, researchers have for removing a term. These correspond to the basic opera-
decades investigated ways to edit text by voice [35, 36], tions create, update, and delete of data systems [38]. The
using commands like “Capitalize the first letter in each fourth basic operation of data systems, read, may also be
word in each title” [37].                                      useful if the previous query is not visible, like in some
                                                             conversational interfaces. For illustration, Table 1 shows
                                                             conversational examples for each basic operation. One
                                                             query reformulation can contain several basic operations.
3.2. Algorithmically Understanding                                      with 80 participants to minimize the interface’s influence
     Natural Language Reformulations                                    on the participant’s choice of words. Based on insights
                                                                        from the pilot studies, we formulated the tasks as bullet
In the past years, impressive advancements have been                    points with a sentence structure clearly different from the
achieved in natural language understanding. Still, when                 reformulations we asked for. Moreover, automatic check-
a message can be interpreted as different meta-queries,                 ing routines help the participants to stick to the task (e.g.,
the problem is far from solved. For example, what if the                alerts for undesired repetitions or missing terms). The in-
seeker would have asked “OK, how about vaccinations?”                   terface resembles a WhatsApp chat to prime participants
as their second message in Figure 1 (b)? Is the intent to               on chat messages [43].
specify the previous query or to start a new one? Hints                    After an initial “ready”-interaction to illustrate the task
on the true intent might be found in previous messages                  (cf. top of Figure 2), each participant completed twelve
(relation between ‘vaccination’ and the previous query)                 assignments from one domain as a single search session.
or previous results (maybe the seeker read something that               To analyze reformulation diversity, we changed the task
caused the question). Other conversations may suggest                   domain and topic between participants: either finding
quite different interpretations. For example, if asked                  arguments on banning plastic bags, finding books on (Sci-
after “Can you show me articles about its treatments?”,                 Fi) viruses, finding news on COVID-19, or finding trips
one could interpret ‘vaccination’ as a replacement for                  to San Jose (a very widespread city name). However,
‘treatments.’ To resolve such ambiguities, search systems               the search tasks for each participant had the same struc-
may ask the seeker for clarification [39, 40] or they may               ture of abstract operations (e.g., create one term) with
try heuristic disambiguation. Such heuristics could, for                only keywords being replaced between the domains.5 To
example, employ word or entity relationships (e.g., using               ensure a variety of reformulations, we formulated the
WordNet or knowledge graphs) or query performance                       instructions to cover all four CRUD operations, to vary
predictors like term specificity and result coherence [41].             the targets from a single literal to the whole query, to
                                                                        cover conjunctions and disjunctions, and to include some
3.3. Formulating Laconic Queries for                                    special cases like a filter attribute, an unspecified literal,
     Keyword-based Retrieval                                            or a negation. Participants completed a session in about
                                                                        12 minutes (observed in the pilot studies and the final
The example in Figure 1 (b) shows that conversational                   study) and we adjusted the payment to cover the min-
reformulations (like the seeker’s second message) have                  imum wage of the respective country.6 Unfortunately,
to be converted to context-independent queries to sub-                  we had to stop our study in India and Australia after the
mit them to standard retrieval systems. This is similar to              first domain (news). Only 22% of the Indian participants
“query rewriting,” a process that resolves co-references                provided reasonable messages for the tasks (61% in other
in conversational question answering (e.g., [6]). In fact,              countries) while getting answers from 20 Australian par-
similar methods may be effective for conversational re-                 ticipants alone exhausted our time constraints. In total,
formulations. In the protocol stack of Figure 1 (a), con-               we accepted the work of 284 participants.
versational reformulations can thus be seen as a service                   To ensure the dataset’s quality and ease processing,
of the conversation layer that builds upon the retrieval                we manually checked each message. Of the initially
service of the laconic layer.                                           3408 messages, 2917 are grammatically and semantically
                                                                        meaningful in the respective context. Of these, 2694 (79%
                                                                        of all) can be interpreted as the respective intended meta-
4. Crowdsourcing Reformulations                                         query and form the final dataset of 558 messages for the
To foster research on conversational query reformula-                   argument domain, 573 for book, 961 for news, and 602 for
tions, we publish a respective crowdsourced dataset4 that               trip (cf. Table 2 for other key statistics).
accounts for diversity in seeker location (five English-
speaking countries) and search domain (four different                   5. Analyzing Reformulations
ones). The dataset’s goal is to allow for analyzing the di-
versity and ambiguity in the language of conversational                 Like for all natural language systems, also developing
reformulations. However, the dataset also allows to boot-               systems that allow for conversational query reformu-
strap the natural language understanding component of                   lations demands investigating the peculiarities of the
conversational systems [42].
   Figure 2 shows the interface used in Amazon’s Mechan-                    5
                                                                              The keywords are contained in the README file of the dataset.
ical Turk marketplace to collect “natural” reformulations.              The annotation interface for each domains is presented by the re-
We iteratively refined the interface in eight pilot studies             spective <domain>-interface.html file.
                                                                             6
                                                                                 https://medium.com/ai2-blog/crowdsourcing-pricing-ethics-and-best-
    4
        Publicly available at https://doi.org/10.5281/zenodo.5031960.   practices-8487fd5c9872
Figure 2: Dataset collection interface (excerpt). After submitting a message (bottom box), the interface alerted participants
of potential missing or forbidden terms. Valid messages appear on the right and a next assignment (“task”) on the left.


respective language. To this end, Section 5.1 provides a       across both search domains and countries, indicating that
general overview of the messages collected in our dataset,     a generic reformulation resolution system might be fea-
highlighting differences in language use between search        sible. In Section 5.2, we report on our detailed analysis
domains and countries. Though the exact patterns occur         of the ambiguities in the messages, showcasing both the
with different frequencies, they, in general, are similar      ambiguities and possible ways to avoid them.
Table 2                                                              ference between countries, with no notable difference in
Key statistics of the dataset by the participants location           word choice in the relatively small amount of data.
(country). Messages have been manually categorized as being             Table 3 shows the message type usage per task. The
either a command (Co.), question (Qu.), or statement (St.).          most frequent are commands (e.g., “Please remove ar-
Participant       Participants by domain           Messages          guments about banning plastic bags.” for Task 12) that
location        ∑︀                            ∑︀                     often make up at least 60% of all messages per task. The
                    Arg. Book News Trip             Co. Qu. St.
                                                                     one task where the majority of messages are questions
Australia        20    0     0     20     0   192 0.76 0.08 0.16 is Task 9, which is the only task that involves a read
Canada           83 20      20     23    20   781 0.58 0.19 0.22
                                                                     operation (e.g., “Can you please remind me of my previ-
Great Britain 80 20         20     20    20   774 0.60 0.25 0.15
India            21    0     0     21     0   181 0.78 0.10 0.12 ous commands?” and “What did I search for?”). Though
United States 80 20         20     20    20   766 0.54 0.20 0.26 statements (e.g., “I would like to see news that are not
Total           284 60      60    104    60 2694 0.60 0.20 0.20
                                                                     about COVID-19.” for Task 12) are relatively rare in gen-
                                                                     eral, they are the dominant type for the two tasks that
                                                                     deal with correcting a misunderstanding: Task 6 is to
                                                                     tell the system that it misinterpreted an acronym (but
   To investigate on the word patterns of conversational
                                                                     leaving the acronym unspecified, e.g., “I did not mean
query reformulations and differences between search
                                                                     NI as North Ireland.”), whereas Task 7 is to specify the
domains and countries, our study design employs the
                                                                     intended meaning (e.g., “I meant National Insurance.”).
same sequence of twelve abstract tasks for each partic-
                                                                     Participants switching to statement messages might thus
ipant, only exchanging a few keywords to specify the
                                                                     be an indicator that they are pointing out problems.
different search domains. Table 3 shows the formal tasks
and provides∑︀general characteristics of the collected mes-
sages. The -column shows the total number of valid References to current list As a potential signal to
messages per task. Though this number is close to the identify reformulations, participants sometimes, but un-
maximum of 284 (the number of participants) for most fortunately rarely, refer to the items in the (imagined)
tasks, it is relatively low for Tasks 4, 9, and 10, indicating result list when reformulating the query (e.g., “Which
a misunderstanding of the participants. We see such mis- of these include vaccination?”). Specifically, in the rare
understandings as an artifact of our study setup and filter cases for the respective tasks (all but Tasks 1, 6, and
out the affected messages from our below analyses.7                  7), participants use those (3%), ones (3%), these (1%), or
                                                                     them (0.4%). Somewhat frequently, 16% refer to the list
                                                                     and 2% to results (e.g., “only show me results that include
5.1. Comparing Messages across Tasks,                                vaccination”). As a difference between domains, 1% of
       Domains, and Countries                                        messages in the respective tasks of the trip domain use
                                                                     there (e.g., “I want to have a travel to there by ship.”).
We have systematically analyzed the 2694 messages of
                                                                     More common than references to the current list is the
our dataset. Besides the more general analyses of mes-
                                                                     use of the domain-specific item (argument, book, article,
sage types, we also focus on word frequencies and pat-
                                                                     trip), with the special case of the phrases ‘pros and cons’
terns. Apart from a small difference in preposition fre-
                                                                     and ‘for and against’ to specify that both sides should be
quencies for trips (more frequent use of “to” when “about”
                                                                     considered in argument search (e.g., “Can you give me
is used in the other domains), argument search has one
                                                                     arguments for and against banning plastic bags please”).
difference to the other domains in that a few participants
                                                                     However, these do not indicate reformulations.
formulated their requests not as a query but asking for
the system’s opinion (e.g., “What do you think about
banning plastic bags?”).                                             Growing and shrinking As a somewhat strong sig-
                                                                     nal for reformulations, many participants explicitly ex-
                                                                     pressed whether the result list should grow or shrink.
Message types For a first general overview, we manu-
                                                                     Verbal expressions for shrinking the current list (Tasks 2,
ally annotated each message as being a command, a ques-
                                                                     5, 12) are remove (9% of messages in these tasks), filter (5%,
tion, or a statement. The left part of Table 2 shows the
                                                                     e.g., “Filter list to just about vaccination.”), exclude (4%),
message type usage per country. While participants from
                                                                     narrow down (3%), reduce (1%), limit (0.7%), filter out (0.5%,
Australia and India used more commands, participants
                                                                     e.g., “Please filter out all articles that are not about vac-
from Great Britain used more questions, and participants
                                                                     cination.”) and shorten (0.5%). Note that some of these
from the United States used more statements on average.
                                                                     verbs indicate which items to remove, whereas others in-
Overall, we found this observation to show the main dif-
                                                                     dicate which items to keep. Overall, 26% of the messages
                                                                     in these tasks contain a verb that explicitly requests to
    7
      For completeness, we provide these messages in a separate file shrink the list. Other signals for shrinking are the use of
along with the dataset.
Table 3
The abstracted sequence of operations (Create, Read, Update, Delete) that each participant performed in one of four domains,
together with a statistical overview of messages in the dataset, including their absolute number and relative frequency of each
type (command (Co.), question (Qu.), or statement (St.)) as well as unambiguous and ambiguous ones. For clarity, queries 𝑞𝑖
are provided here in Boolean form. An item is relevant for a query 𝑞𝑖 if 𝑞𝑖 ’s expression evaluates to true for the item, where 𝑙𝑖
denotes a literal that evaluates to true if the corresponding term occurs in the item, ?𝑖 a literal with no corresponding term,
and 𝑓𝑖 (𝑒) an expression that evaluates to true if 𝑒 evaluates to true for attribute 𝑓𝑖 of the item. See the dataset’s README file
for the corresponding terms and attributes for each domain and the interface HTML files for the respective task descriptions.

 #         Operation (CRUD)                 Result query                                            Messages
                                                                      ∑︀
                                                                           Co. Qu. St. Unam. Alt. interpretations [messages]
 1 C: 𝑙1                            𝑞1 = 𝑙1                          275   0.61 0.32 0.07    1.00     -
 2 C: 𝑙2                            𝑞2 = 𝑙1 ∧ 𝑙2                     226   0.68 0.23 0.09    0.36     U: 𝑞1 → 𝑙1 ∨ 𝑙2 [0.35]
                                                                                                      U: 𝑞1 → 𝑙2 [0.64]
 3 U: 𝑙2 → 𝑙2 ∨ 𝑙3                  𝑞3 = 𝑙1 ∧ (𝑙2 ∨ 𝑙3 )             212   0.69 0.24 0.07    0.27     U: 𝑞2 → 𝑙3 [0.16]
                                                                                                      U: 𝑞2 → (𝑙1 ∧ 𝑙2 ) ∨ 𝑙3 [0.72]
 4   D: 𝑙2 ∨ 𝑙3                     𝑞4 = 𝑙1                           70   0.60 0.34 0.06    1.00     -
 5   C: 𝑓1 : (𝑙4 ∨ 𝑙5 ∨ 𝑙6 )        𝑞5 = 𝑙1 ∧ 𝑓1 : (𝑙4 ∨ 𝑙5 ∨ 𝑙6 )   262   0.71 0.18 0.10    0.48     U: 𝑞4 → 𝑓1 : (𝑙4 ∨ 𝑙5 ∨ 𝑙6 ) [0.52]
 6   U: 𝑙5 → ?1                     𝑞6 = 𝑙1 ∧ 𝑓1 : (𝑙4 ∨ ?1 ∨ 𝑙6 )   258   0.42 0.03 0.54    0.61     U: 𝑙5 → ¬𝑙5 [0.39]
 7   U: ?1 → 𝑙7                     𝑞7 = 𝑙1 ∧ 𝑓1 : (𝑙4 ∨ 𝑙7 ∨ 𝑙6 )   276   0.32 0.01 0.66    1.00     -
 8   U: 𝑙1 → 𝑙8                     𝑞8 = 𝑙8 ∧ 𝑓1 : (𝑙4 ∨ 𝑙7 ∨ 𝑙6 )   259   0.67 0.17 0.15    1.00     -
 9   R: 𝑞8                          𝑞9 = 𝑙8 ∧ 𝑓1 : (𝑙4 ∨ 𝑙7 ∨ 𝑙6 )   189   0.38 0.59 0.03    1.00     -
10   U: 𝑓1 : (𝑙4 ∨ 𝑙7 ∨ 𝑙6 ) → 𝑙9   𝑞10 = 𝑙8 ∧ 𝑙9                    134   0.66 0.23 0.11    1.00     -
11   U: 𝑞10 → 𝑙10                   𝑞11 = 𝑙10                        269   0.72 0.14 0.14    0.52     C: 𝑙10 [0.48]
12   C: ¬𝑙1                         𝑞12 = 𝑙10 ∧ ¬𝑙1                  264   0.76 0.11 0.13    1.00     -


only (18%) and just (6%), though these percentages may Specializing a query or starting a new one When
be inflated as the task descriptions also contain these.      asked to specialize the query by adding a new term
   Frequently used verbal expressions for growing the (Task 2), the majority of our participants (66%, cf. Ta-
current list (Tasks 3, 4) are add (25%), add back (2%, e.g., ble 3) used a message that one could also interpret as
“add back the trips by car.”), expand (3%, e.g., “Expand the starting a new query with that one term (e.g., “Just show
list to include books about plants too.”). Other signals for me arguments about CO2 emissions”). We observe the
growing are the use of also (23%), as well (4%), too (3%, same ambiguity in other specialization tasks (Task 3, 16%
e.g., “can you also add those including treatment too?”), of messages, e.g., “Can you show me arguments that are
and as well as (1%). Interestingly, participants used the about renewable resources?”; Task 5, 52%, e.g., “May I
verb keep both to shrink the list in Task 2 (2% of messages please see the articles that have NCD, NI, or WHO in the
for Task 2, e.g., “Keep articles related to vaccination”) and headline?”) and in tasks that ask to start a new query
in a lexically indistinguishable way to partially undo such (Task 11, 48%, e.g., “Find a list of books about evolution.”).8
shrinking in Task 3 (1%, e.g., “Please keep the arguments Still, some participants directly used unambiguous mes-
about renewable resources”).                                  sages, either by explicitly referencing the current list
                                                              (e.g., “Great can you refine that to articles with NCD, NI
Summary The observed differences between countries or WHO in the headline?”) or indicating a new list or
and domains are relatively small. A change of the seeker search (45% of the messages for Task 11, e.g., “Find a list
from asking questions to expressing statements often of books about evolution,” “New search on evolution,” or
indicates specific unusual requests, though differences “Disregard all previous instructions and now only find me
between countries need to be considered. Finally, many books about evolution”). Moreover, 8% of the messages
participants directly requested the growing and shrinking for Task 11 are unambiguous due to explicitly expressing
of the result list in their reformulations.                   a replacement, for example, “Show me trips to Santiago
                                                              instead.”
5.2. Analyzing Operation Ambiguities
                                                                     Unclear precedence Though there are precise rules
A common problem for natural language interfaces is                  for operator precedence in logics, no such rules exist for
the ambiguity of natural language. For reformulations,
this means that the same message can be interpreted as                   8
                                                                           Taken literally, also several messages for Task 1 and 12 would
different operations. In our study, we found the below               be ambiguous. However, the alternative interpretations make no
three main ambiguities.                                              sense in the respective contexts. We ignore these strictly lexical
                                                                     ambiguities in our considerations.
natural language. Indeed, 72% of the messages for Task 3         query. However, seekers may also want to fetch a query
do not clearly express whether the new term should be            they used some time ago, maybe to continue or refresh a
an alternative just to the last term (as asked for) or to        previous search.
the entire query (e.g., “Could you please also include           More search operators.           We considered only the
arguments about renewable resources?”). About 11% of             standard logical operators (∨, ∧, ¬) and attribute-specific
the participants’ messages are unambiguous by explicitly         filters, but most retrieval systems support more. How
stating the relation (e.g., “Show me trips by ship or by car,”   would seekers highlight phrases (words to be retrieved in
where ‘ship’ is the previously added query term). Though         that sequence), initiate boosting (a term or attribute being
not asked to do so, a few of these participants also hinted      especially important), or fuzzy / strict term matching? As
at a reason for asking for the alternative (e.g., “Show me       hypothesized in Section 1, step-wise query formulation
trips by ship, if ship trips are not available then I would      might increase the use of diverse search operators.
like to select trips by car.”). A few participants (1%) made     Clarifications.      We considered only messages from
use of an explicitly stated filter term from the previous        the seeker, but studying possible system reactions is
query, which then allowed them to refer back to it (e.g.,        equally essential to account for implicit feedback (re-
“Filter list for infected animals” and then “Add plants to       peating what was understood) or asking clarification
filter as alternative”).                                         questions. Both methods are likely helpful to explain and
                                                                 resolve ambiguities, and could at the same time allow the
Ambiguous negation Surprisingly, several messages                system to showcase unambiguous formulations in an at-
submitted to tell the system that it misinterpreted an           tempt to teach the seeker how to prevent the ambiguities
acronym are lexically indistinguishable from filtering by        in the future.9
the acronym. While the majority of messages for Task 6
are unambiguous as expected (61%, e.g., “I’m not asking          Implications for Conversational Search
for North Ireland”), 39% of the messages are ambiguous
and could easily be misunderstood (e.g., “Do not include         We can only hypothesize how a seeker’s interactions dif-
articles about North Ireland”). Indeed, only the fact that       fered if a search system supported conversational query
the user had just added ‘NI’ as an acronym hints at the          reformulations. As mentioned above and in Section 1,
intended meaning.                                                one possibility is that the reduced cognitive effort due
                                                                 to step-wise and natural language query formulation en-
                                                                 courages more complex queries that contain more search
6. Conclusion                                                    operators. Extending on these considerations, we expect
                                                                 that some seekers will desire to regularly use the same
We have formalized the problem of supporting reformu-            query like a feed (e.g., for news, but also for professional
lations in conversational search systems. By casting re-         activities like scholarly search [45]) and may see query
formulations as meta-queries that imply standard CRUD            formulation as an act of personalization. Therefore, some
operations on the “actual” query, we demonstrate that            systems may even aim to support reformulations like “A
such functionality could be implemented in a conversa-           bit more on soccer.” At the same time, we believe that
tion layer on top of standard retrieval architectures. An        supporting reformulations will be essential for having a
analysis of a new dataset of 2694 crowdsourced human             conversation between seeker and search system, as re-
reformulations across four search domains shows that a           formulations implement a straightforward way for the
generic reformulation component is feasible when consid-         seeker to ground the conversation [46], complementing
ering the peculiarities of the respective search domains.        clarification questions from the system. At such stage,
However, we also find that ambiguities in the reformula-         the conversations will be more natural. And the users
tions will likely be a major challenge for conversational        will be “relieved” from the below laconic layer—just like
systems. Ambiguities thus merit further investigations.          today’s users of the laconic layer do not need to know
                                                                 any details about the underlying TCP/IP layer.
Future Work
We see several opportunities to extend the analysis of           Acknowledgments
this paper.
Other languages.       We considered only English mes-           This work has been partially funded by the Google Digi-
sages so far but expect at least some of the results to be       tal News Initiative as part of the “Conversational News”
language-dependent. Further analyses will need to be             project.
conducted for other languages.
                                                                     9
Generalized read operation.         We considered only                 Methods to explain ambiguities in reformulations may build
the most simple read-operation: reading the current              upon early work on explaining ambiguities in formal query lan-
                                                                 guages [44].
References                                                         study of orienteering behavior in directed search,
                                                                   in: E. Dykstra-Erickson, M. Tscheligi (Eds.), Proc.
 [1] P. Boldi, F. Bonchi, C. Castillo, S. Vigna, Query             of CHI, ACM, 2004, pp. 415–422. doi:10.1145/
     reformulation mining: models, patterns, and appli-            985692.985745.
     cations, Information Retrieval 14 (2011) 257–289.        [13] M. Aliannejadi, H. Zamani, F. Crestani, W. B.
     doi:10.1007/s10791-010-9155-3.                                Croft, Asking clarifying questions in open-domain
 [2] J. Chen, J. Mao, Y. Liu, F. Zhang, M. Zhang, S. Ma,           information-seeking conversations, in: B. Pi-
     Towards a better understanding of query reformu-              wowarski, M. Chevalier, É. Gaussier, Y. Maarek,
     lation behavior in web search, in: J. Leskovec,               J. Nie, F. Scholer (Eds.), Proc. of SIGIR, ACM, 2019,
     M. Grobelnik, M. Najork, J. Tang, L. Zia (Eds.),              pp. 475–484. doi:10.1145/3331184.3331265.
     Proc. of WWW, ACM / IW3C2, 2021, pp. 743–755.            [14] M. Sanderson, W. B. Croft, The history of infor-
     doi:10.1145/3442381.3450127.                                  mation retrieval research, Proc. of IEEE 100 (2012)
 [3] W. B. Croft, D. Metzler, T. Strohman, Search En-              1444–1451. doi:10.1109/JPROC.2012.2189916.
     gines - Information Retrieval in Practice, Pearson       [15] S.-C. Lin, J.-H. Yang, R. Nogueira, M.-F. Tsai, C.-
     Education, 2009.                                              J. Wang, J. Lin, Multi-stage conversational pas-
 [4] J. R. Trippas, D. Spina, P. Thomas, M. Sanderson,             sage retrieval: An approach to fusing term impor-
     H. Joho, L. Cavedon, Towards a model for spo-                 tance estimation and neural query rewriting, 2020.
     ken conversational search, Information Process-               arXiv:2005.02230.
     ing Management 57 (2020) 102162. doi:10.1016/            [16] J. Jiang, W. Jeng, D. He, How do users respond
     j.ipm.2019.102162.                                            to voice input errors?: lexical and phonetic query
 [5] R. W. White, D. Morris, Investigating the querying            reformulation in voice search, in: G. J. F. Jones,
     and browsing behavior of advanced search engine               P. Sheridan, D. Kelly, M. de Rijke, T. Sakai (Eds.),
     users, in: W. Kraaij, A. P. de Vries, C. L. A. Clarke,        Proc. of SIGIR, ACM, 2013, pp. 143–152. doi:10.
     N. Fuhr, N. Kando (Eds.), Proc. of SIGIR, ACM, 2007,          1145/2484028.2484092.
     pp. 255–262. URL: https://doi.org/10.1145/1277741.       [17] J. R. Trippas, D. Spina, L. Cavedon, H. Joho,
     1277787. doi:10.1145/1277741.1277787.                         M. Sanderson, How do people interact in conver-
 [6] R. Anantha, S. Vakulenko, Z. Tu, S. Longpre, S. Pul-          sational speech-only search tasks: A preliminary
     man, S. Chappidi, Open-domain question answer-                analysis, in: Proc. of CHIIR, ACM, 2017, pp. 325–
     ing goes conversational via question rewriting,               328. doi:10.1145/3020165.3022144.
     in: K. Toutanova, A. Rumshisky, L. Zettlemoyer,          [18] N. Sa, X. Yuan, Examining users’ partial query mod-
     D. Hakkani-Tür, I. Beltagy, S. Bethard, R. Cotterell,         ification patterns in voice search, Journal of the
     T. Chakraborty, Y. Zhou (Eds.), Proc. of NAACL-               Association for Information Science and Technol-
     HLT, ACL, 2021, pp. 520–534. URL: https://www.                ogy 71 (2020) 251–263. doi:10.1002/asi.24238.
     aclweb.org/anthology/2021.naacl-main.44/.                [19] A. Saha, V. Pahuja, M. M. Khapra, K. Sankara-
 [7] M. Zaib, W. E. Zhang, Q. Z. Sheng, A. Mahmood,                narayanan, S. Chandar, Complex sequential ques-
     Y. Zhang, Conversational question answering: A                tion answering: Towards learning to converse over
     survey, CoRR abs/2106.00874 (2021). URL: https:               linked question answer pairs with a knowledge
     //arxiv.org/abs/2106.00874.                                   graph, 2018. arXiv:1801.10314.
 [8] J. Culpepper, F. Diaz, M. Smucker, Research fron-        [20] S. Reddy, D. Chen, C. D. Manning, Coqa: A con-
     tiers in information retrieval: Report from the               versational question answering challenge, 2018.
     third strategic workshop on information retrieval             arXiv:1808.07042.
     in lorne (SWIRL), SIGIR Forum 52 (2018) 34–90.           [21] E. Choi, H. He, M. Iyyer, M. Yatskar, W. tau Yih,
 [9] R. S. Taylor, The process of asking questions, Amer-          Y. Choi, P. Liang, L. Zettlemoyer, Quac : Question
     ican Documentation 13 (1962) 391–396. doi:10.                 answering in context, 2018. arXiv:1808.07036.
     1002/asi.5090130405.                                     [22] J. Dalton, C. Xiong, J. Callan, Cast 2020: The
[10] D. W. Oard, Query by babbling: A research agenda,             conversational assistance track overview, in:
     in: Proc. of IKM4DR, IKM4DR’12, ACM, New York,                E. M. Voorhees, A. Ellis (Eds.), Proc. of TREC,
     NY, USA, 2012, pp. 17–22. doi:10.1145/2389776.                volume 1266 of NIST Special Publication, National
     2389781.                                                      Institute of Standards and Technology (NIST),
[11] F. Radlinski, N. Craswell, A theoretical frame-               2020. URL: https://trec.nist.gov/pubs/trec29/papers/
     work for conversational search, in: Proc. of CHIIR,           OVERVIEW.C.pdf.
     CHIIR ’17, ACM, New York, 2017, p. 117–126.              [23] A. Elgohary, D. Peskov, J. Boyd-Graber, Can you
     doi:10.1145/3020165.3020183.                                  unpack that? learning to rewrite questions-in-
[12] J. Teevan, C. Alvarado, M. S. Ackerman, D. R.                 context, in: Proc. of EMNLP-IJCNLP, Association
     Karger, The perfect search engine is not enough: a            for Computational Linguistics, Hong Kong, China,
     2019, pp. 5918–5924. URL: https://www.aclweb.                   requests and thoughtful actions, in: H. Li, G. Levow,
     org/anthology/D19-1605. doi:10.18653/v1/D19-                    Z. Yu, C. Gupta, B. Sisman, S. Cai, D. Vandyke,
     1605.                                                           N. Dethlefs, Y. Wu, J. J. Li (Eds.), Proc. of SIGdial,
[24] S. Yu, J. Liu, J. Yang, C. Xiong, P. Bennett, J. Gao,           ACL, 2021, pp. 77–88. URL: https://aclanthology.
     Z. Liu, Few-shot generative conversational query                org/2021.sigdial-1.9.
     rewriting, 2020. arXiv:2006.05009.                         [35] J. J. Leggett, G. Williams, An empirical investiga-
[25] Z. Chen, X. Fan, Y. Ling, L. Mathias, C. Guo,                   tion of voice as an input modality for computer pro-
     Pre-training for query rewriting in A spo-                      gramming, International Journal of Man-Machine
     ken language understanding system,                 CoRR         Studies 21 (1984) 493–520. doi:10.1016/S0020-
     abs/2002.05607 (2020). URL: https://arxiv.org/abs/              7373(84)80057-7.
     2002.05607. arXiv:2002.05607.                              [36] L. Rosenblatt, Vocalide: An ide for programming
[26] S.-C. Lin, J.-H. Yang, J. Lin, Contextualized                   via speech recognition, in: Proc. of SIGACCESS,
     query embeddings for conversational search, 2021.               ASSETS’17, ACM, New York, NY, USA, 2017, pp.
     arXiv:2104.08707.                                               417–418. doi:10.1145/3132525.3134824.
[27] S. Yuan, S. Gupta, X. Fan, D. Liu, Y. Liu, C. Guo,         [37] A. W. Biermann, L. Fineman, J. F. Heidlage, A
     Graph enhanced query rewriting for spoken lan-                  voice- and touch-driven natural language editor
     guage understanding system,              in: Proc. of           and its performance, International Journal of Man-
     ICASSP, IEEE, 2021, pp. 7997–8001. doi:10.1109/                 Machine Studies 37 (1992) 1–21. doi:10.1016/
     ICASSP39728.2021.9413840.                                       0020-7373(92)90089-4.
[28] N. J. Belkin, D. Kelly, G. Kim, J. Kim, H. Lee, G. Mure-   [38] J. Martin, Managing the Data Base Environment, 1
     san, M. M. Tang, X. Yuan, C. Cool, Query length                 ed., Prentice Hall PTR, USA, 1983.
     in interactive information retrieval, in: C. L. A.         [39] J. Kiesel, A. Bahrami, B. Stein, A. Anand, M. Ha-
     Clarke, G. V. Cormack, J. Callan, D. Hawking, A. F.             gen, Toward Voice Query Clarification, in: Proc.
     Smeaton (Eds.), Proc. of SIGIR, ACM, 2003, pp. 205–             of SIGIR, ACM, 2018, pp. 1257–1260. doi:10.1145/
     212. doi:10.1145/860435.860474.                                 3209978.3210160.
[29] N. Balasubramanian, G. Kumaran, V. R. Car-                 [40] H. Zamani, S. T. Dumais, N. Craswell, P. N. Bennett,
     valho, Exploring reductions for long web queries,               G. Lueck, Generating clarifying questions for in-
     in: F. Crestani, S. Marchand-Maillet, H. Chen,                  formation retrieval, in: Y. Huang, I. King, T. Liu,
     E. N. Efthimiadis, J. Savoy (Eds.), Proc. of SI-                M. van Steen (Eds.), Proc. of WWW 2020, ACM /
     GIR, ACM, 2010, pp. 571–578. URL: https://doi.org/              IW3C2, 2020, pp. 418–428. doi:10.1145/3366423.
     10.1145/1835449.1835545. doi:10.1145/1835449.                   3380126.
     1835545.                                                   [41] J. Arguello, S. Avula, F. Diaz, Using query perfor-
[30] M. Bendersky, W. B. Croft, Discovering key con-                 mance predictors to improve spoken queries, in:
     cepts in verbose queries, in: S. Myaeng, D. W. Oard,            N. Ferro, F. Crestani, M. Moens, J. Mothe, F. Silvestri,
     F. Sebastiani, T. Chua, M. Leong (Eds.), Proc. of SI-           G. M. D. Nunzio, C. Hauff, G. Silvello (Eds.), Proc. of
     GIR, ACM, 2008, pp. 491–498. URL: https://doi.org/              ECIR, volume 9626 of LNCS, Springer, 2016, pp. 309–
     10.1145/1390334.1390419. doi:10.1145/1390334.                   321. doi:10.1007/978-3-319-30671-1\_23.
     1390419.                                                   [42] C. Pearl, Designing Voice User Interfaces: Princi-
[31] F. Li, H. V. Jagadish, Constructing an interactive              ples of Conversational Experiences, 1st ed., O’Reilly
     natural language interface for relational databases,            Media, Inc., 2016.
     Proc. VLDB Endowment 8 (2014) 73–84. doi:10.               [43] A. Papenmeier, D. Kern, D. Hienert, A. Sliwa,
     14778/2735461.2735468.                                          A. Aker, N. Fuhr, Starting conversations with search
[32] K. Affolter, K. Stockinger, A. Bernstein, A compar-             engines - interfaces that elicit natural language
     ative survey of recent natural language interfaces              queries, in: F. Scholer, P. Thomas, D. Elsweiler,
     for databases, VLDB Journal 28 (2019) 793–819.                  H. Joho, N. Kando, C. Smith (Eds.), Proc. of CHIIR,
     doi:10.1007/s00778-019-00567-8.                                 ACM, 2021, pp. 261–265. doi:10.1145/3406522.
[33] E. Kuric, J. D. Fernández, O. Drozd, Knowledge                  3446035.
     graph exploration: A usability evaluation of query         [44] J. A. Wald, P. G. Sorenson, Explaining ambiguity
     builders for laypeople, in: M. Acosta, P. Cudré-                in a formal query language, ACM Transactions on
     Mauroux, M. Maleshkova, T. Pellegrini, H. Sack,                 Database Systems 15 (1990) 125–161. doi:10.1145/
     Y. Sure-Vetter (Eds.), Proc. of SEMANTiCS, volume               78922.78923.
     11702 of LNCS, Springer, 2019, pp. 326–342. doi:10.        [45] K. Balog, L. Flekova, M. Hagen, R. Jones, M. Pot-
     1007/978-3-030-33220-4\_24.                                     thast, F. Radlinski, M. Sanderson, S. Vakulenko,
[34] S. Tanaka, K. Yoshino, K. Sudoh, S. Nakamura,                   H. Zamani, Common Conversational Commu-
     ARTA: collection and classification of ambiguous                nity Prototype: Scholarly Conversational Assistant,
     CoRR abs/2001.06910 (2020). URL: https://arxiv.org/
     abs/2001.06910.
[46] H. H. Clark, S. E. Brennan, Grounding in com-
     munication, in: Perspectives on Socially Shared
     Cognition, APA, 1991, pp. 127–149.