Toward Conversational Query Reformulation Johannes Kiesel1 , Xiaoni Cai1 , Roxanne El Baff2 , Benno Stein1 and Matthias Hagen3 1 Bauhaus-Universität Weimar, Bauhausstraße 11, 99423 Weimar, Germany 2 German Aerospace Center (DLR) Oberpfaffenhofen, Münchener Str. 20, 82234 Weßling, Germany 3 Martin-Luther-Universität Halle-Wittenberg, Von-Seckendorff-Platz 1, 06120 Halle (Saale), Germany Abstract In traditional web search interfaces, information seekers reformulate their queries by editing the terms in the search box in order to guide the retrieval process. Such kind of editing is at odds with the natural language interaction paradigm in conversational interfaces, and for purely voice-based interfaces it is impossible. Conversational search studies reveal that participants instead describe their changes to a query; however, the principles of such “editing conversations” have not been analyzed in depth. The paper in hand formalizes the problem of conversational query reformulation. We cast reformulations as meta-queries that imply operations on the original query and categorize the operations following the standard CRUD terminology (create, read, update, delete). Based on this formalization we crowdsource a dataset with 2694 human refor- mulations across four search domains. Our analysis of the meta-queries reveals a large variety in word usage and indicates ambiguous reformulations as an important research topic of its own. Keywords Conversational search, Crowdsourcing, CRUD, Information seeking, Query refinement, Query reformulation 1. Introduction (a) Search interface Search engine Conversation protocol During web search, information seekers frequently find a Conversation Conversation layer layer search engine’s results either too specific, too generic, or containing results relevant only to an unintended inter- Laconic protocol pretation of their query. In such cases seekers may want Laconic layer Laconic layer to reformulate their queries [1, 2]. In a traditional search interface, the seeker would directly edit the previous query in the search field, creating, updating, or deleting TCP/IP layer TCP/IP layer terms. Such reformulations account for about half of Host to network layer all queries [3, Sec. 6.3]. However, conversational search (b) interfaces—be they chat-like or voice-based—usually do “Show me news about COVID-19.” not allow modifying the previous query. Though some COVID-19 chat interfaces not used for search allow to edit previous “OK, here is the latest news on COVID-19.” + messages, such a functionality breaks temporal continu- {”hits”: {”total”: 142, ... }, ...} ity, making the interaction significantly less conversa- tional. Still, reformulations are also frequent in conver- “Ah. Please no articles about vaccination this time.” sational search lab studies [4] and can be seen as one COVID-19 ∧ (¬ vaccination) user-facing service of the search interface’s conversa- “Sure, I remove articles on vaccination.” + tional layer, as illustrated in Figure 1 (a). {”hits”: {”total”: 89, ... }, ...} As the example in Figure 1 (b) illustrates, reformula- ... tions allow to specify information in small steps. As the time ... main advantage of incremental formulation, seekers do Figure 1: (a) Web application protocol stack, emphasizing DESIRES 2021 – 2nd International Conference on Design of the two top layers. A conversational search interface imple- Experimental Search & Information REtrieval Systems, September ments a “conversation layer” (an API) to operationalize the 15–18, 2021, Padua, Italy respective NLP translation functionality on top of a “laconic” johannes.kiesel@uni-weimar.de (J. Kiesel); layer (an API) that implements the functionality to interact xiaonicaiweimar@gmail.com (X. Cai); roxanne.elbaff@dlr.de (R. El with a traditional web search interface. (b) Example conver- Baff); benno.stein@uni-weimar.de (B. Stein); sation about COVID-19, showing the messages of a fictitious matthias.hagen@informatik.uni-halle.de (M. Hagen) dialog at the level of a conversational and a laconic protocol.  0000-0002-1617-6508 (J. Kiesel); 0000-0003-2114-8162 (X. Cai); Recall that when going downwards (upwards) the stack, the 0000-0001-6661-8661 (R. El Baff); 0000-0001-9033-2217 (B. Stein); messages of the protocol at layer 𝑛−1 (layer 𝑛) are generated 0000-0002-9733-2890 (M. Hagen) © 2021 Copyright for this paper by its authors. Use permitted under Creative from those at layer 𝑛 (layer 𝑛−1). Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) not have to formulate the complete query in advance, ask for information connected to information just re- substantially reducing the required mental effort.1 More- trieved. However, whereas query reformulations change over, incremental formulation might simplify and thus the criteria that identify relevant information, follow-up increase the use of search operators for seekers, like the questions request a completely new answer. This dif- exclusion in Figure 1 (b), allowing seekers to formulate ference in intent causes linguistic differences between even complex needs more intuitively.2 query reformulations and follow-up questions that war- Though the seeker may not consciously ask to create, rant separate investigations. Moreover, conversational update, or delete query terms, conversational reformu- query reformulation relates to much of the available re- lations are essentially such meta-queries that request search on queries. changes to the previous query. While these operations are not implemented in standard retrieval engines, one 2.1. Queries in Conversational Search can imagine a “conversation” layer on top of such engines (Figure 1 (a)) that—among others—resolves reformula- A central promise of the conversational search paradigm tions similar to co-reference resolution in conversational is to bring search closer to real-world assistance of a question answering systems (e.g., [6]). reference librarian [8]. In this regard, conversational However, critical issues for conversational systems reformulations are one piece for allowing the seeker to concerning reformulations have barely been analyzed specify their need on a more natural level [9]. Still, unlike in the literature, especially the reformulations’ inherent “query by babbling” [10], reformulations require some for- ambiguity. For example, consider the seeker asking “OK, malized description in the seeker’s mind. Instead, conver- how about vaccinations?” after getting the results for sational query reformulations are one instance of “user “Show me news about COVID-19” in Figure 1 (b). Is the revealment” [11] where the seeker incrementally spec- intent to create a “vaccination” term or to replace the ifies their need. The advantages of small steps, as in query entirely? If aware of the ambiguities, a system orienteering [12], are that seekers have to specify less could ask for clarification or use heuristics to resolve and obtain context information on the way. A comple- the ambiguity. It could also stress and thereby teach mentary approach to help seekers refine their query is to unambiguous language in its replies (“I reduced the list ask them clarification questions in case the search engine to those on vaccinations.”). detects them struggeling or being ambiguous [13]. To foster research on conversational query reformula- In their history of IR research, Sanderson and Croft tions, we contribute the following: (1) a conceptualization [14] divided interactions against some first text-based that casts conversational reformulations as meta-queries conversational search systems into natural language or following CRUD terminology (cf. Section 3); (2) the first keyword-based and into non-querying and querying. dataset on conversational query reformulations,3 contain- In a later fine-grained study of conversational seeker ing 2694 messages and associated meta-queries, crowd- messages for passage retrieval, Lin et al. [15] categorize sourced from 284 study participants from 5 countries query ambiguity, though they focus on ambiguity regard- in 4 different search domains (cf. Section 4); and (3) an ing (not) referenced entities, whereas the paper at hand in-depth analysis of the reformulations’ word patterns, focuses on ambiguity regarding the desired operation. emphasizing ambiguous word patterns as important re- Trippas et al. [4] present a model for spoken conversa- search direction and suggesting the general feasibility of tional search that also covers information requests be- a domain-independent rewriting system (cf. Section 5). yond queries, for example, within a result document. In their lab study, they observe that both seekers conver- sationally reformulate queries (“query embellishments”) 2. Related Work and that participants in the system’s role conversation- ally offer reformulations based on what they see. Though conversational search is an active research area, conversational query reformulation has attracted little attention so far. In contrast, several recent publications 2.2. Query Reformulation target co-reference resolution for follow-up questions Query reformulations are queries based on the previous in conversational question answering [7]. Reformula- one with a similar information need [1]. Boldi et al. [1] tions and follow-up questions are similar in that both classified query reformulations on two axes from general- 1 This effect likely also holds for more instruction-like (but still ization to specification and from being the same query to conversational) interactions, like in Adobe’s “phonic filters” image starting a new search mission. Sanderson and Croft [14] search demo, https://blog.adobe.com/en/2019/05/29/preview-technology-gives- proposed a categorization based on the latter axis. Jiang your-voice-the-power-of-a-creative-director.html. et al. [16] analyzed voice query repeats after voice input 2 Some years ago, only about 1% of web search queries con- errors. They found that seekers tended to stress words tained operators [5]. 3 Publicly available at https://doi.org/10.5281/zenodo.5031960 that the system misunderstood—a way of conversational reformulation that is specific to voice queries but outside Table 1 the scope of the paper at hand. Conversational examples for each CRUD operation [38]. Though studies took note of conversational query re- Operation Target formulations (e.g., [17]), they have rarely been analyzed. As an exception, Sa and Yuan [18] asked 32 participants Query Expression Literal in a Wizard of Oz study to perform one generalization C reate Show me news Remove all Any news for and one specialization of a displayed query, speaking about without NCD, its treatment? to the system like to a human. The participants pre- COVID-19 NI, or WHO in ferred conversational query reformulations (which Sa an the headline Yuan call “partial query modification”) over repeating R ead What do I What did I What was the the query with changes (“complete query modification”), have so far? say for the last filter? even though several participants reported employing the headline? latter for being used to it. We build upon this work, ask- U pdate Start a new Remove my No, NI means ing for more complex reformulations in longer query ses- search on the criteria for the National sions and focusing on analyzing the language employed flu headline but Insurance and ambiguities therein. search only in economical 2.3. Query Rewriting news D elete No, let’s start Remove the Remove the In contrast to query reformulation, query rewriting refers again headline filter for to processing the query before the retrieval. This task criteria treatments currently attracts much attention in conversational ques- tion answering, mostly concerning co-reference resolu- tion. Available datasets for this task include CSQA [19], CoQA [20], QReCC [6], QuAC [21], and TREC CAsT [22]. 3. Conceptualizing The availability of datasets has already led to several ap- Conversational Reformulations proaches, often employing sequence-to-sequence learn- ing [23, 24, 20] and previous interactions [25, 26, 27]. Conversational reformulations are query reformulations using natural language. While query reformulations 2.4. Natural Language Queries in web search usually are stand-alone queries that can directly be submitted to retrieve results, formulating Several studies analyzed natural language queries even queries from conversational reformulations requires an before conversational interfaces. Belkin et al. [28] found additional step that “adds” the conversational context. that, in a text-based search interface, the average query This section discusses the implications on three levels: length increased by nearly 50% when the search box label (1) a model of conversational reformulations as meta- encouraged to write a problem description. Still, the au- queries; (2) the problem of algorithmically understand- tomatic “translation” of long queries to shorter keyword ing reformulations; and (3) the process of creating the queries later also gained attention with systems reducing “laconic” queries from the reformulations. natural language queries to some key concepts more com- patible with keyword-based interfaces [29, 30]. Moreover, also the translation of natural language to database or 3.1. Casting Conversational knowledge graph queries attracts much attention (e.g., Reformulations as Meta-Queries [31, 32, 33]). For smart assistants, Tanaka et al. [34] used From the information system’s perspective, the informa- crowdsourcing like us to gather implicit and highly am- tion seeker uses a meta-query language when expressing biguous seeker requests for specific non-conversational conversational reformulations: they tell the system to tasks, e.g., “I’m thirsty” as a request to search for a nearby perform specific operations on the previous query. On café or “I’m having trouble getting good reception” as a a syntactic level, the basic reformulation operations in “request” to search for a WiFi spot. As example of spoken traditional search interfaces are adding, changing, and reformulations in a different setting, researchers have for removing a term. These correspond to the basic opera- decades investigated ways to edit text by voice [35, 36], tions create, update, and delete of data systems [38]. The using commands like “Capitalize the first letter in each fourth basic operation of data systems, read, may also be word in each title” [37]. useful if the previous query is not visible, like in some conversational interfaces. For illustration, Table 1 shows conversational examples for each basic operation. One query reformulation can contain several basic operations. 3.2. Algorithmically Understanding with 80 participants to minimize the interface’s influence Natural Language Reformulations on the participant’s choice of words. Based on insights from the pilot studies, we formulated the tasks as bullet In the past years, impressive advancements have been points with a sentence structure clearly different from the achieved in natural language understanding. Still, when reformulations we asked for. Moreover, automatic check- a message can be interpreted as different meta-queries, ing routines help the participants to stick to the task (e.g., the problem is far from solved. For example, what if the alerts for undesired repetitions or missing terms). The in- seeker would have asked “OK, how about vaccinations?” terface resembles a WhatsApp chat to prime participants as their second message in Figure 1 (b)? Is the intent to on chat messages [43]. specify the previous query or to start a new one? Hints After an initial “ready”-interaction to illustrate the task on the true intent might be found in previous messages (cf. top of Figure 2), each participant completed twelve (relation between ‘vaccination’ and the previous query) assignments from one domain as a single search session. or previous results (maybe the seeker read something that To analyze reformulation diversity, we changed the task caused the question). Other conversations may suggest domain and topic between participants: either finding quite different interpretations. For example, if asked arguments on banning plastic bags, finding books on (Sci- after “Can you show me articles about its treatments?”, Fi) viruses, finding news on COVID-19, or finding trips one could interpret ‘vaccination’ as a replacement for to San Jose (a very widespread city name). However, ‘treatments.’ To resolve such ambiguities, search systems the search tasks for each participant had the same struc- may ask the seeker for clarification [39, 40] or they may ture of abstract operations (e.g., create one term) with try heuristic disambiguation. Such heuristics could, for only keywords being replaced between the domains.5 To example, employ word or entity relationships (e.g., using ensure a variety of reformulations, we formulated the WordNet or knowledge graphs) or query performance instructions to cover all four CRUD operations, to vary predictors like term specificity and result coherence [41]. the targets from a single literal to the whole query, to cover conjunctions and disjunctions, and to include some 3.3. Formulating Laconic Queries for special cases like a filter attribute, an unspecified literal, Keyword-based Retrieval or a negation. Participants completed a session in about 12 minutes (observed in the pilot studies and the final The example in Figure 1 (b) shows that conversational study) and we adjusted the payment to cover the min- reformulations (like the seeker’s second message) have imum wage of the respective country.6 Unfortunately, to be converted to context-independent queries to sub- we had to stop our study in India and Australia after the mit them to standard retrieval systems. This is similar to first domain (news). Only 22% of the Indian participants “query rewriting,” a process that resolves co-references provided reasonable messages for the tasks (61% in other in conversational question answering (e.g., [6]). In fact, countries) while getting answers from 20 Australian par- similar methods may be effective for conversational re- ticipants alone exhausted our time constraints. In total, formulations. In the protocol stack of Figure 1 (a), con- we accepted the work of 284 participants. versational reformulations can thus be seen as a service To ensure the dataset’s quality and ease processing, of the conversation layer that builds upon the retrieval we manually checked each message. Of the initially service of the laconic layer. 3408 messages, 2917 are grammatically and semantically meaningful in the respective context. Of these, 2694 (79% of all) can be interpreted as the respective intended meta- 4. Crowdsourcing Reformulations query and form the final dataset of 558 messages for the To foster research on conversational query reformula- argument domain, 573 for book, 961 for news, and 602 for tions, we publish a respective crowdsourced dataset4 that trip (cf. Table 2 for other key statistics). accounts for diversity in seeker location (five English- speaking countries) and search domain (four different 5. Analyzing Reformulations ones). The dataset’s goal is to allow for analyzing the di- versity and ambiguity in the language of conversational Like for all natural language systems, also developing reformulations. However, the dataset also allows to boot- systems that allow for conversational query reformu- strap the natural language understanding component of lations demands investigating the peculiarities of the conversational systems [42]. Figure 2 shows the interface used in Amazon’s Mechan- 5 The keywords are contained in the README file of the dataset. ical Turk marketplace to collect “natural” reformulations. The annotation interface for each domains is presented by the re- We iteratively refined the interface in eight pilot studies spective -interface.html file. 6 https://medium.com/ai2-blog/crowdsourcing-pricing-ethics-and-best- 4 Publicly available at https://doi.org/10.5281/zenodo.5031960. practices-8487fd5c9872 Figure 2: Dataset collection interface (excerpt). After submitting a message (bottom box), the interface alerted participants of potential missing or forbidden terms. Valid messages appear on the right and a next assignment (“task”) on the left. respective language. To this end, Section 5.1 provides a across both search domains and countries, indicating that general overview of the messages collected in our dataset, a generic reformulation resolution system might be fea- highlighting differences in language use between search sible. In Section 5.2, we report on our detailed analysis domains and countries. Though the exact patterns occur of the ambiguities in the messages, showcasing both the with different frequencies, they, in general, are similar ambiguities and possible ways to avoid them. Table 2 ference between countries, with no notable difference in Key statistics of the dataset by the participants location word choice in the relatively small amount of data. (country). Messages have been manually categorized as being Table 3 shows the message type usage per task. The either a command (Co.), question (Qu.), or statement (St.). most frequent are commands (e.g., “Please remove ar- Participant Participants by domain Messages guments about banning plastic bags.” for Task 12) that location ∑︀ ∑︀ often make up at least 60% of all messages per task. The Arg. Book News Trip Co. Qu. St. one task where the majority of messages are questions Australia 20 0 0 20 0 192 0.76 0.08 0.16 is Task 9, which is the only task that involves a read Canada 83 20 20 23 20 781 0.58 0.19 0.22 operation (e.g., “Can you please remind me of my previ- Great Britain 80 20 20 20 20 774 0.60 0.25 0.15 India 21 0 0 21 0 181 0.78 0.10 0.12 ous commands?” and “What did I search for?”). Though United States 80 20 20 20 20 766 0.54 0.20 0.26 statements (e.g., “I would like to see news that are not Total 284 60 60 104 60 2694 0.60 0.20 0.20 about COVID-19.” for Task 12) are relatively rare in gen- eral, they are the dominant type for the two tasks that deal with correcting a misunderstanding: Task 6 is to tell the system that it misinterpreted an acronym (but To investigate on the word patterns of conversational leaving the acronym unspecified, e.g., “I did not mean query reformulations and differences between search NI as North Ireland.”), whereas Task 7 is to specify the domains and countries, our study design employs the intended meaning (e.g., “I meant National Insurance.”). same sequence of twelve abstract tasks for each partic- Participants switching to statement messages might thus ipant, only exchanging a few keywords to specify the be an indicator that they are pointing out problems. different search domains. Table 3 shows the formal tasks and provides∑︀general characteristics of the collected mes- sages. The -column shows the total number of valid References to current list As a potential signal to messages per task. Though this number is close to the identify reformulations, participants sometimes, but un- maximum of 284 (the number of participants) for most fortunately rarely, refer to the items in the (imagined) tasks, it is relatively low for Tasks 4, 9, and 10, indicating result list when reformulating the query (e.g., “Which a misunderstanding of the participants. We see such mis- of these include vaccination?”). Specifically, in the rare understandings as an artifact of our study setup and filter cases for the respective tasks (all but Tasks 1, 6, and out the affected messages from our below analyses.7 7), participants use those (3%), ones (3%), these (1%), or them (0.4%). Somewhat frequently, 16% refer to the list and 2% to results (e.g., “only show me results that include 5.1. Comparing Messages across Tasks, vaccination”). As a difference between domains, 1% of Domains, and Countries messages in the respective tasks of the trip domain use there (e.g., “I want to have a travel to there by ship.”). We have systematically analyzed the 2694 messages of More common than references to the current list is the our dataset. Besides the more general analyses of mes- use of the domain-specific item (argument, book, article, sage types, we also focus on word frequencies and pat- trip), with the special case of the phrases ‘pros and cons’ terns. Apart from a small difference in preposition fre- and ‘for and against’ to specify that both sides should be quencies for trips (more frequent use of “to” when “about” considered in argument search (e.g., “Can you give me is used in the other domains), argument search has one arguments for and against banning plastic bags please”). difference to the other domains in that a few participants However, these do not indicate reformulations. formulated their requests not as a query but asking for the system’s opinion (e.g., “What do you think about banning plastic bags?”). Growing and shrinking As a somewhat strong sig- nal for reformulations, many participants explicitly ex- pressed whether the result list should grow or shrink. Message types For a first general overview, we manu- Verbal expressions for shrinking the current list (Tasks 2, ally annotated each message as being a command, a ques- 5, 12) are remove (9% of messages in these tasks), filter (5%, tion, or a statement. The left part of Table 2 shows the e.g., “Filter list to just about vaccination.”), exclude (4%), message type usage per country. While participants from narrow down (3%), reduce (1%), limit (0.7%), filter out (0.5%, Australia and India used more commands, participants e.g., “Please filter out all articles that are not about vac- from Great Britain used more questions, and participants cination.”) and shorten (0.5%). Note that some of these from the United States used more statements on average. verbs indicate which items to remove, whereas others in- Overall, we found this observation to show the main dif- dicate which items to keep. Overall, 26% of the messages in these tasks contain a verb that explicitly requests to 7 For completeness, we provide these messages in a separate file shrink the list. Other signals for shrinking are the use of along with the dataset. Table 3 The abstracted sequence of operations (Create, Read, Update, Delete) that each participant performed in one of four domains, together with a statistical overview of messages in the dataset, including their absolute number and relative frequency of each type (command (Co.), question (Qu.), or statement (St.)) as well as unambiguous and ambiguous ones. For clarity, queries 𝑞𝑖 are provided here in Boolean form. An item is relevant for a query 𝑞𝑖 if 𝑞𝑖 ’s expression evaluates to true for the item, where 𝑙𝑖 denotes a literal that evaluates to true if the corresponding term occurs in the item, ?𝑖 a literal with no corresponding term, and 𝑓𝑖 (𝑒) an expression that evaluates to true if 𝑒 evaluates to true for attribute 𝑓𝑖 of the item. See the dataset’s README file for the corresponding terms and attributes for each domain and the interface HTML files for the respective task descriptions. # Operation (CRUD) Result query Messages ∑︀ Co. Qu. St. Unam. Alt. interpretations [messages] 1 C: 𝑙1 𝑞1 = 𝑙1 275 0.61 0.32 0.07 1.00 - 2 C: 𝑙2 𝑞2 = 𝑙1 ∧ 𝑙2 226 0.68 0.23 0.09 0.36 U: 𝑞1 → 𝑙1 ∨ 𝑙2 [0.35] U: 𝑞1 → 𝑙2 [0.64] 3 U: 𝑙2 → 𝑙2 ∨ 𝑙3 𝑞3 = 𝑙1 ∧ (𝑙2 ∨ 𝑙3 ) 212 0.69 0.24 0.07 0.27 U: 𝑞2 → 𝑙3 [0.16] U: 𝑞2 → (𝑙1 ∧ 𝑙2 ) ∨ 𝑙3 [0.72] 4 D: 𝑙2 ∨ 𝑙3 𝑞4 = 𝑙1 70 0.60 0.34 0.06 1.00 - 5 C: 𝑓1 : (𝑙4 ∨ 𝑙5 ∨ 𝑙6 ) 𝑞5 = 𝑙1 ∧ 𝑓1 : (𝑙4 ∨ 𝑙5 ∨ 𝑙6 ) 262 0.71 0.18 0.10 0.48 U: 𝑞4 → 𝑓1 : (𝑙4 ∨ 𝑙5 ∨ 𝑙6 ) [0.52] 6 U: 𝑙5 → ?1 𝑞6 = 𝑙1 ∧ 𝑓1 : (𝑙4 ∨ ?1 ∨ 𝑙6 ) 258 0.42 0.03 0.54 0.61 U: 𝑙5 → ¬𝑙5 [0.39] 7 U: ?1 → 𝑙7 𝑞7 = 𝑙1 ∧ 𝑓1 : (𝑙4 ∨ 𝑙7 ∨ 𝑙6 ) 276 0.32 0.01 0.66 1.00 - 8 U: 𝑙1 → 𝑙8 𝑞8 = 𝑙8 ∧ 𝑓1 : (𝑙4 ∨ 𝑙7 ∨ 𝑙6 ) 259 0.67 0.17 0.15 1.00 - 9 R: 𝑞8 𝑞9 = 𝑙8 ∧ 𝑓1 : (𝑙4 ∨ 𝑙7 ∨ 𝑙6 ) 189 0.38 0.59 0.03 1.00 - 10 U: 𝑓1 : (𝑙4 ∨ 𝑙7 ∨ 𝑙6 ) → 𝑙9 𝑞10 = 𝑙8 ∧ 𝑙9 134 0.66 0.23 0.11 1.00 - 11 U: 𝑞10 → 𝑙10 𝑞11 = 𝑙10 269 0.72 0.14 0.14 0.52 C: 𝑙10 [0.48] 12 C: ¬𝑙1 𝑞12 = 𝑙10 ∧ ¬𝑙1 264 0.76 0.11 0.13 1.00 - only (18%) and just (6%), though these percentages may Specializing a query or starting a new one When be inflated as the task descriptions also contain these. asked to specialize the query by adding a new term Frequently used verbal expressions for growing the (Task 2), the majority of our participants (66%, cf. Ta- current list (Tasks 3, 4) are add (25%), add back (2%, e.g., ble 3) used a message that one could also interpret as “add back the trips by car.”), expand (3%, e.g., “Expand the starting a new query with that one term (e.g., “Just show list to include books about plants too.”). Other signals for me arguments about CO2 emissions”). We observe the growing are the use of also (23%), as well (4%), too (3%, same ambiguity in other specialization tasks (Task 3, 16% e.g., “can you also add those including treatment too?”), of messages, e.g., “Can you show me arguments that are and as well as (1%). Interestingly, participants used the about renewable resources?”; Task 5, 52%, e.g., “May I verb keep both to shrink the list in Task 2 (2% of messages please see the articles that have NCD, NI, or WHO in the for Task 2, e.g., “Keep articles related to vaccination”) and headline?”) and in tasks that ask to start a new query in a lexically indistinguishable way to partially undo such (Task 11, 48%, e.g., “Find a list of books about evolution.”).8 shrinking in Task 3 (1%, e.g., “Please keep the arguments Still, some participants directly used unambiguous mes- about renewable resources”). sages, either by explicitly referencing the current list (e.g., “Great can you refine that to articles with NCD, NI Summary The observed differences between countries or WHO in the headline?”) or indicating a new list or and domains are relatively small. A change of the seeker search (45% of the messages for Task 11, e.g., “Find a list from asking questions to expressing statements often of books about evolution,” “New search on evolution,” or indicates specific unusual requests, though differences “Disregard all previous instructions and now only find me between countries need to be considered. Finally, many books about evolution”). Moreover, 8% of the messages participants directly requested the growing and shrinking for Task 11 are unambiguous due to explicitly expressing of the result list in their reformulations. a replacement, for example, “Show me trips to Santiago instead.” 5.2. Analyzing Operation Ambiguities Unclear precedence Though there are precise rules A common problem for natural language interfaces is for operator precedence in logics, no such rules exist for the ambiguity of natural language. For reformulations, this means that the same message can be interpreted as 8 Taken literally, also several messages for Task 1 and 12 would different operations. In our study, we found the below be ambiguous. However, the alternative interpretations make no three main ambiguities. sense in the respective contexts. We ignore these strictly lexical ambiguities in our considerations. natural language. Indeed, 72% of the messages for Task 3 query. However, seekers may also want to fetch a query do not clearly express whether the new term should be they used some time ago, maybe to continue or refresh a an alternative just to the last term (as asked for) or to previous search. the entire query (e.g., “Could you please also include More search operators. We considered only the arguments about renewable resources?”). About 11% of standard logical operators (∨, ∧, ¬) and attribute-specific the participants’ messages are unambiguous by explicitly filters, but most retrieval systems support more. How stating the relation (e.g., “Show me trips by ship or by car,” would seekers highlight phrases (words to be retrieved in where ‘ship’ is the previously added query term). Though that sequence), initiate boosting (a term or attribute being not asked to do so, a few of these participants also hinted especially important), or fuzzy / strict term matching? As at a reason for asking for the alternative (e.g., “Show me hypothesized in Section 1, step-wise query formulation trips by ship, if ship trips are not available then I would might increase the use of diverse search operators. like to select trips by car.”). A few participants (1%) made Clarifications. We considered only messages from use of an explicitly stated filter term from the previous the seeker, but studying possible system reactions is query, which then allowed them to refer back to it (e.g., equally essential to account for implicit feedback (re- “Filter list for infected animals” and then “Add plants to peating what was understood) or asking clarification filter as alternative”). questions. Both methods are likely helpful to explain and resolve ambiguities, and could at the same time allow the Ambiguous negation Surprisingly, several messages system to showcase unambiguous formulations in an at- submitted to tell the system that it misinterpreted an tempt to teach the seeker how to prevent the ambiguities acronym are lexically indistinguishable from filtering by in the future.9 the acronym. While the majority of messages for Task 6 are unambiguous as expected (61%, e.g., “I’m not asking Implications for Conversational Search for North Ireland”), 39% of the messages are ambiguous and could easily be misunderstood (e.g., “Do not include We can only hypothesize how a seeker’s interactions dif- articles about North Ireland”). Indeed, only the fact that fered if a search system supported conversational query the user had just added ‘NI’ as an acronym hints at the reformulations. As mentioned above and in Section 1, intended meaning. one possibility is that the reduced cognitive effort due to step-wise and natural language query formulation en- courages more complex queries that contain more search 6. Conclusion operators. Extending on these considerations, we expect that some seekers will desire to regularly use the same We have formalized the problem of supporting reformu- query like a feed (e.g., for news, but also for professional lations in conversational search systems. By casting re- activities like scholarly search [45]) and may see query formulations as meta-queries that imply standard CRUD formulation as an act of personalization. Therefore, some operations on the “actual” query, we demonstrate that systems may even aim to support reformulations like “A such functionality could be implemented in a conversa- bit more on soccer.” At the same time, we believe that tion layer on top of standard retrieval architectures. An supporting reformulations will be essential for having a analysis of a new dataset of 2694 crowdsourced human conversation between seeker and search system, as re- reformulations across four search domains shows that a formulations implement a straightforward way for the generic reformulation component is feasible when consid- seeker to ground the conversation [46], complementing ering the peculiarities of the respective search domains. clarification questions from the system. At such stage, However, we also find that ambiguities in the reformula- the conversations will be more natural. And the users tions will likely be a major challenge for conversational will be “relieved” from the below laconic layer—just like systems. Ambiguities thus merit further investigations. today’s users of the laconic layer do not need to know any details about the underlying TCP/IP layer. Future Work We see several opportunities to extend the analysis of Acknowledgments this paper. Other languages. We considered only English mes- This work has been partially funded by the Google Digi- sages so far but expect at least some of the results to be tal News Initiative as part of the “Conversational News” language-dependent. Further analyses will need to be project. conducted for other languages. 9 Generalized read operation. We considered only Methods to explain ambiguities in reformulations may build the most simple read-operation: reading the current upon early work on explaining ambiguities in formal query lan- guages [44]. References study of orienteering behavior in directed search, in: E. Dykstra-Erickson, M. Tscheligi (Eds.), Proc. [1] P. Boldi, F. Bonchi, C. Castillo, S. Vigna, Query of CHI, ACM, 2004, pp. 415–422. doi:10.1145/ reformulation mining: models, patterns, and appli- 985692.985745. cations, Information Retrieval 14 (2011) 257–289. [13] M. Aliannejadi, H. Zamani, F. Crestani, W. B. doi:10.1007/s10791-010-9155-3. Croft, Asking clarifying questions in open-domain [2] J. Chen, J. Mao, Y. Liu, F. Zhang, M. Zhang, S. Ma, information-seeking conversations, in: B. Pi- Towards a better understanding of query reformu- wowarski, M. Chevalier, É. Gaussier, Y. Maarek, lation behavior in web search, in: J. Leskovec, J. Nie, F. Scholer (Eds.), Proc. of SIGIR, ACM, 2019, M. Grobelnik, M. Najork, J. Tang, L. Zia (Eds.), pp. 475–484. doi:10.1145/3331184.3331265. Proc. of WWW, ACM / IW3C2, 2021, pp. 743–755. [14] M. Sanderson, W. B. Croft, The history of infor- doi:10.1145/3442381.3450127. mation retrieval research, Proc. of IEEE 100 (2012) [3] W. B. Croft, D. Metzler, T. Strohman, Search En- 1444–1451. doi:10.1109/JPROC.2012.2189916. gines - Information Retrieval in Practice, Pearson [15] S.-C. Lin, J.-H. Yang, R. Nogueira, M.-F. Tsai, C.- Education, 2009. J. Wang, J. Lin, Multi-stage conversational pas- [4] J. R. Trippas, D. Spina, P. Thomas, M. Sanderson, sage retrieval: An approach to fusing term impor- H. Joho, L. Cavedon, Towards a model for spo- tance estimation and neural query rewriting, 2020. ken conversational search, Information Process- arXiv:2005.02230. ing Management 57 (2020) 102162. doi:10.1016/ [16] J. Jiang, W. Jeng, D. He, How do users respond j.ipm.2019.102162. to voice input errors?: lexical and phonetic query [5] R. W. White, D. Morris, Investigating the querying reformulation in voice search, in: G. J. F. Jones, and browsing behavior of advanced search engine P. Sheridan, D. Kelly, M. de Rijke, T. Sakai (Eds.), users, in: W. Kraaij, A. P. de Vries, C. L. A. Clarke, Proc. of SIGIR, ACM, 2013, pp. 143–152. doi:10. N. Fuhr, N. Kando (Eds.), Proc. of SIGIR, ACM, 2007, 1145/2484028.2484092. pp. 255–262. URL: https://doi.org/10.1145/1277741. [17] J. R. Trippas, D. Spina, L. Cavedon, H. Joho, 1277787. doi:10.1145/1277741.1277787. M. Sanderson, How do people interact in conver- [6] R. Anantha, S. Vakulenko, Z. Tu, S. Longpre, S. Pul- sational speech-only search tasks: A preliminary man, S. Chappidi, Open-domain question answer- analysis, in: Proc. of CHIIR, ACM, 2017, pp. 325– ing goes conversational via question rewriting, 328. doi:10.1145/3020165.3022144. in: K. Toutanova, A. Rumshisky, L. Zettlemoyer, [18] N. Sa, X. Yuan, Examining users’ partial query mod- D. Hakkani-Tür, I. Beltagy, S. Bethard, R. Cotterell, ification patterns in voice search, Journal of the T. Chakraborty, Y. Zhou (Eds.), Proc. of NAACL- Association for Information Science and Technol- HLT, ACL, 2021, pp. 520–534. URL: https://www. ogy 71 (2020) 251–263. doi:10.1002/asi.24238. aclweb.org/anthology/2021.naacl-main.44/. [19] A. Saha, V. Pahuja, M. M. Khapra, K. Sankara- [7] M. Zaib, W. E. Zhang, Q. Z. Sheng, A. Mahmood, narayanan, S. Chandar, Complex sequential ques- Y. Zhang, Conversational question answering: A tion answering: Towards learning to converse over survey, CoRR abs/2106.00874 (2021). URL: https: linked question answer pairs with a knowledge //arxiv.org/abs/2106.00874. graph, 2018. arXiv:1801.10314. [8] J. Culpepper, F. Diaz, M. Smucker, Research fron- [20] S. Reddy, D. Chen, C. D. Manning, Coqa: A con- tiers in information retrieval: Report from the versational question answering challenge, 2018. third strategic workshop on information retrieval arXiv:1808.07042. in lorne (SWIRL), SIGIR Forum 52 (2018) 34–90. [21] E. Choi, H. He, M. Iyyer, M. Yatskar, W. tau Yih, [9] R. S. Taylor, The process of asking questions, Amer- Y. Choi, P. Liang, L. Zettlemoyer, Quac : Question ican Documentation 13 (1962) 391–396. doi:10. answering in context, 2018. arXiv:1808.07036. 1002/asi.5090130405. [22] J. Dalton, C. Xiong, J. Callan, Cast 2020: The [10] D. W. Oard, Query by babbling: A research agenda, conversational assistance track overview, in: in: Proc. of IKM4DR, IKM4DR’12, ACM, New York, E. M. Voorhees, A. Ellis (Eds.), Proc. of TREC, NY, USA, 2012, pp. 17–22. doi:10.1145/2389776. volume 1266 of NIST Special Publication, National 2389781. Institute of Standards and Technology (NIST), [11] F. Radlinski, N. Craswell, A theoretical frame- 2020. URL: https://trec.nist.gov/pubs/trec29/papers/ work for conversational search, in: Proc. of CHIIR, OVERVIEW.C.pdf. CHIIR ’17, ACM, New York, 2017, p. 117–126. [23] A. Elgohary, D. Peskov, J. Boyd-Graber, Can you doi:10.1145/3020165.3020183. unpack that? learning to rewrite questions-in- [12] J. Teevan, C. Alvarado, M. S. Ackerman, D. R. context, in: Proc. of EMNLP-IJCNLP, Association Karger, The perfect search engine is not enough: a for Computational Linguistics, Hong Kong, China, 2019, pp. 5918–5924. URL: https://www.aclweb. requests and thoughtful actions, in: H. Li, G. Levow, org/anthology/D19-1605. doi:10.18653/v1/D19- Z. Yu, C. Gupta, B. Sisman, S. Cai, D. Vandyke, 1605. N. Dethlefs, Y. Wu, J. J. Li (Eds.), Proc. of SIGdial, [24] S. Yu, J. Liu, J. Yang, C. Xiong, P. Bennett, J. Gao, ACL, 2021, pp. 77–88. URL: https://aclanthology. Z. Liu, Few-shot generative conversational query org/2021.sigdial-1.9. rewriting, 2020. arXiv:2006.05009. [35] J. J. Leggett, G. Williams, An empirical investiga- [25] Z. Chen, X. Fan, Y. Ling, L. Mathias, C. Guo, tion of voice as an input modality for computer pro- Pre-training for query rewriting in A spo- gramming, International Journal of Man-Machine ken language understanding system, CoRR Studies 21 (1984) 493–520. doi:10.1016/S0020- abs/2002.05607 (2020). URL: https://arxiv.org/abs/ 7373(84)80057-7. 2002.05607. arXiv:2002.05607. [36] L. Rosenblatt, Vocalide: An ide for programming [26] S.-C. Lin, J.-H. Yang, J. Lin, Contextualized via speech recognition, in: Proc. of SIGACCESS, query embeddings for conversational search, 2021. ASSETS’17, ACM, New York, NY, USA, 2017, pp. arXiv:2104.08707. 417–418. doi:10.1145/3132525.3134824. [27] S. Yuan, S. Gupta, X. Fan, D. Liu, Y. Liu, C. Guo, [37] A. W. Biermann, L. Fineman, J. F. Heidlage, A Graph enhanced query rewriting for spoken lan- voice- and touch-driven natural language editor guage understanding system, in: Proc. of and its performance, International Journal of Man- ICASSP, IEEE, 2021, pp. 7997–8001. doi:10.1109/ Machine Studies 37 (1992) 1–21. doi:10.1016/ ICASSP39728.2021.9413840. 0020-7373(92)90089-4. [28] N. J. Belkin, D. Kelly, G. Kim, J. Kim, H. Lee, G. Mure- [38] J. Martin, Managing the Data Base Environment, 1 san, M. M. Tang, X. Yuan, C. Cool, Query length ed., Prentice Hall PTR, USA, 1983. in interactive information retrieval, in: C. L. A. [39] J. Kiesel, A. Bahrami, B. Stein, A. Anand, M. Ha- Clarke, G. V. Cormack, J. Callan, D. Hawking, A. F. gen, Toward Voice Query Clarification, in: Proc. Smeaton (Eds.), Proc. of SIGIR, ACM, 2003, pp. 205– of SIGIR, ACM, 2018, pp. 1257–1260. doi:10.1145/ 212. doi:10.1145/860435.860474. 3209978.3210160. [29] N. Balasubramanian, G. Kumaran, V. R. Car- [40] H. Zamani, S. T. Dumais, N. Craswell, P. N. Bennett, valho, Exploring reductions for long web queries, G. Lueck, Generating clarifying questions for in- in: F. Crestani, S. Marchand-Maillet, H. Chen, formation retrieval, in: Y. Huang, I. King, T. Liu, E. N. Efthimiadis, J. Savoy (Eds.), Proc. of SI- M. van Steen (Eds.), Proc. of WWW 2020, ACM / GIR, ACM, 2010, pp. 571–578. URL: https://doi.org/ IW3C2, 2020, pp. 418–428. doi:10.1145/3366423. 10.1145/1835449.1835545. doi:10.1145/1835449. 3380126. 1835545. [41] J. Arguello, S. Avula, F. Diaz, Using query perfor- [30] M. Bendersky, W. B. Croft, Discovering key con- mance predictors to improve spoken queries, in: cepts in verbose queries, in: S. Myaeng, D. W. Oard, N. Ferro, F. Crestani, M. Moens, J. Mothe, F. Silvestri, F. Sebastiani, T. Chua, M. Leong (Eds.), Proc. of SI- G. M. D. Nunzio, C. Hauff, G. Silvello (Eds.), Proc. of GIR, ACM, 2008, pp. 491–498. URL: https://doi.org/ ECIR, volume 9626 of LNCS, Springer, 2016, pp. 309– 10.1145/1390334.1390419. doi:10.1145/1390334. 321. doi:10.1007/978-3-319-30671-1\_23. 1390419. [42] C. Pearl, Designing Voice User Interfaces: Princi- [31] F. Li, H. V. Jagadish, Constructing an interactive ples of Conversational Experiences, 1st ed., O’Reilly natural language interface for relational databases, Media, Inc., 2016. Proc. VLDB Endowment 8 (2014) 73–84. doi:10. [43] A. Papenmeier, D. Kern, D. Hienert, A. Sliwa, 14778/2735461.2735468. A. Aker, N. Fuhr, Starting conversations with search [32] K. Affolter, K. Stockinger, A. Bernstein, A compar- engines - interfaces that elicit natural language ative survey of recent natural language interfaces queries, in: F. Scholer, P. Thomas, D. Elsweiler, for databases, VLDB Journal 28 (2019) 793–819. H. Joho, N. Kando, C. Smith (Eds.), Proc. of CHIIR, doi:10.1007/s00778-019-00567-8. ACM, 2021, pp. 261–265. doi:10.1145/3406522. [33] E. Kuric, J. D. Fernández, O. Drozd, Knowledge 3446035. graph exploration: A usability evaluation of query [44] J. A. Wald, P. G. Sorenson, Explaining ambiguity builders for laypeople, in: M. Acosta, P. Cudré- in a formal query language, ACM Transactions on Mauroux, M. Maleshkova, T. Pellegrini, H. Sack, Database Systems 15 (1990) 125–161. doi:10.1145/ Y. Sure-Vetter (Eds.), Proc. of SEMANTiCS, volume 78922.78923. 11702 of LNCS, Springer, 2019, pp. 326–342. doi:10. [45] K. Balog, L. Flekova, M. Hagen, R. Jones, M. Pot- 1007/978-3-030-33220-4\_24. thast, F. Radlinski, M. Sanderson, S. Vakulenko, [34] S. Tanaka, K. Yoshino, K. Sudoh, S. Nakamura, H. Zamani, Common Conversational Commu- ARTA: collection and classification of ambiguous nity Prototype: Scholarly Conversational Assistant, CoRR abs/2001.06910 (2020). URL: https://arxiv.org/ abs/2001.06910. [46] H. H. Clark, S. E. Brennan, Grounding in com- munication, in: Perspectives on Socially Shared Cognition, APA, 1991, pp. 127–149.