1. Introduction

Conversational AI from an Information Retrieval Perspective: Remaining Challenges and a Case for User Simulation

Krisztian Balog

0 0 University of Stavanger , Norway

Conversational AI is an emerging field of computer science that engages multiple research communities, from information retrieval to natural language processing to dialogue systems. Within this vast space, we focus on conversational information access, a problem that is uniquely suited to be addressed by the information retrieval community. We argue that despite the significant research activity in this area, progress is mostly limited to component-level improvements. There remains a disconnect between current eforts and truly conversational information access systems. Apart from the inherently challenging nature of the problem, the lack of progress, in large part, can be attributed to the shortage of appropriate evaluation methodology and resources. This paper highlights challenges that render both ofline and online evaluation methodologies unsuitable for this problem, and discusses the use of user simulation as a viable solution.

eol>Conversational information access conversational AI user simulation

1. Introduction

appropriate responses, and developing efective end-toend (neural) architectures, which engage multiple reConversational AI may be seen as the holy grail of com- search communities. In this paper, we focus on the probputer science: building machines that are capable of in- lem of conversational information access (CIA), one that teracting with people in a human-like way. With rapid the IR community is uniquely suited to address. advances in AI technology, there are reasons to believe Conversational search or conversational information that such an ambition is within reach [1]. Conversational seeking has already been identified in 2012 as a research AI is a vast and complex problem, which requires a com- direction of strategic importance in IR [3], and its signifibination of methods, tools, and techniques from multiple cance has been re-iterated in 2018 [4]. There, the problem ifelds of computer science, including but not limited to focus has been defined to include complex user goals artificial intelligence (AI), natural language processing that require multi-step information seeking, exploratory (NLP), machine learning (ML), dialogue systems (DS), information gathering, and multi-step task completion recommender systems (RecSys), human-computer inter- and recommendation, as well as dialog settings with action (HCI), and not the least information retrieval (IR). variable communication channels. Our analysis of recent Each of these fields may have its own particular inter- works, however, leads us to the observation that current pretation of what conversational AI should entail and eforts do not seem to be fully aligned with the directions a specific focus on certain research challenges that are set out there. In terms of end-to-end tasks, there are involved. For example, in spoken dialogue systems the two main threads of work: conversational QA and main motif is to be able to talk to machines, i.e., on de- conversational recommendations. Currently, these are veloping speech-based human-computer interfaces [ 2 ], treated as two separate types of systems, with diferent and thus automatic speech recognition is a central com- goals, architectures, and evaluation criteria. Instead, for ponent. Many other communities, on the other hand, a more efective assistance of users, the two should be assume a chat-based interface and voice is not among seamlessly integrated in CIA systems, thereby moving the supported modalities. At the same time, there are from a siloed to a more unified view. Additionally, the many shared aspects, including handling the semantics multi-modality of interactions needs to be more fully involved in the dialogue process, generating contextually embraced, in order to more actively support efective interaction [5]. On the component level, most proposed techniques are not truly conversational in the sense that they are applicable to any interactive IR system (e.g., modern web search engines). A critical blocker to progress, on both the end-to-end and component levels, is the shortage of appropriate evaluation methodology and resources.

DESIRES 2021 – 2nd International Conference on Design of Experimental Search & Information REtrieval Systems, September 15–18, 2021, Padua, Italy " krisztian.balog@uis.no (K. Balog) ~ https://krisztianbalog.com/ (K. Balog) 0000-0003-2762-721X (K. Balog)

© 2021 Copyright for this paper by its authors. Use permitted under Creative CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmmUoRns LWiceonsrekAstthribouptionP4r.0oIncteerenadtiionnagl s(CC(CBYE4U.0)R.-WS.org) In summary, the contributions of this paper are threefold. 2.1.1. Traditional 2-way Categorization • We argue for a broader interpretation of conversational information access, one that embraces multiple user goals (mixing task-oriented and QA elements) and multi-modal interactions (Sect 2). • We provide a synthesis of progress on conversational information access and identify open challenges around methods and evaluation (Sect. 3). • We argue for (a more extensive) use of simulation as a viable evaluation paradigm for conversational information access and describe a simulator architecture (Sect. 4).

2. Defining Conversational Information Access This section defines conversational information access and places it in the broader context of conversational AI. 2.1. Conversational AI: The big picture

Conversational AI 1 is casually used to denote a broad range of systems that are capable of (some degree of) natural language understanding and responding in a way that mimics human dialogue. A conversational AI system may thus be considered successful if it ofers an experience that is indistinguishable from what could have been delivered by a human. These systems often focus on a particular type of conversational support, naturally lending themselves to categorization.

1In this paper, the terms conversational AI, conversational agent, and dialogue system are used interchangeably. We, however, avoid using the term chatbot, which has a diferent meaning in industrial and academic contexts; in the former case it refers to a task-oriented system, while in the latter it means a non-task-oriented system [7]. Traditionally, conversational agents are categorized as being goal-driven (or task-oriented) or non-goal-driven (also known as chatbots) [8, 9, 7]. Goal-driven systems aim to assist users to complete some specific task. Dialogues are constrained to a specific domain and characterized by having a designated structure, designed for particular tasks within that domain. The main success criteria for the conversational agent is its ability to help the user solve their task as eficiently as possible. Typical examples include travel planning and appointment scheduling.

Non-goal-driven systems, on the other hand, aim to carry on an extended conversation (“chit-chat”) with the goal of mimicking unstructured human-human interactions. The main purpose of these systems is usually entertainment or providing an “AI companion” [10]. Therefore, the objective for these systems is to be able to talk about diferent topics in an engaging and cohesive manner. 2.1.2. Contemporary 3-way Categorization Most recently, the traditional categorization has been extended with a third category, interactive question answering (QA) [1, 6], in recognition of the fact that it fits neither in task-oriented nor in social chat, but deserves a separate category on its own right. Interactive QA systems are designed to provide answers to specific questions. They are not characterized by a rigid dialogue flow, although they typically follow a question-answer pattern. Apart from some notable recent examples [11], the human-like conversation aspect for QA systems is much less pronounced than for the other two types of systems, and evaluation is restricted to answer correctness.

Table 1 summarizes the characteristics of the three categories of conversational AI systems. Given their unique goals and objectives, each of these problem categories is addressed by a distinctive system architecture [1, 6]. AI AI AI

I want to buy new running shoes.

My records say that you have been using a Nike Pegasus 33 before. How did you like that?

I liked it a lot on tarmac, but my feet often hurt a bit on very long asphalt runs.

Here are some alternatives for you. Of these, the ASICS Gel Nimbus 23 is especially renowed for its cushioned midsole.

What is the midsole? The midsole is the bed of foam that lies between your foot and the ground. This is the part of the shoe responsible for feeling soft or hard in the shoe.

2.2. Conversational Information Access

portantly, evaluation initiatives—in IR almost exclusively focus on a question-answering paradigm (see Sect. 3).

This does not allow for interaction with sets of items— one of the main properties that makes a search system conversational [12].

It has been shown that the “siloed” view, represented by the three categories in Table 1, in practice does not align well with users’ information needs and behavior [13]. Gao et al. [1] acknowledge the need for a “toplevel bot” that would act as a broker and switch between diferent user goals. Most commercial assistants are hybrid systems, with diferent degrees of support for switching. There is, however, little published research on it. In summary, there is need for a more holistic view where multiple user goals are supported.2 2.2.2. Multi-modality Another key point highlighted by the example in Fig. 1 is the need for embracing multi-modality. Text-only responses are motivated by an audio-only channel, without a screen [14]. However, more often than not a chat-base interface is available, which allows for a richer set of input controls and navigational components. These, in turn, would enable CIA systems to more actively support efective interaction [ 5]. We note that the need for multimodality has been recognized independently by other scholars as well [15].

Building on [4], we use the term conversational information access (CIA) to define a subset of conversational AI systems that specifically aim at a task-oriented sequence of exchanges to support multiple user goals, including 3. Progress to Date and search, recommendation and exploratory information gathering, that require multi-step interactions over pos- Remaining Challenges sibly multiple modalities. Further, these systems are ex- In this section, we reflect on progress achieved so far, pected to learn user preferences, personalize responses organized around methods and evaluation, and identify accordingly, and be capable of taking initiative. remaining challenges.

Consider the conversation shown in Fig. 1, illustrating some of the above requirements. It is primarily a taskoriented dialogue (the user wanting to buy new running 3.1. Methods shoes), which requires an exploration of the item space. In our discussion, we distinguish between end-to-end Assuming a chat-based interface, this can be done most conversational tasks and specific component-level subefectively by combining multiple modalities; not just tasks. text, but also a carousel for cycling through items, in this example. Up until the second user utterance, it is a strictly task-oriented sequence of exchanges (cf. the 3.1.1. End-to-end tasks task-oriented category in Table 1). But, then, the third user utterance breaks out of the task flow and switches to “QA mode” (cf. interactive QA in Table 1).

There are two main tasks that have received attention:

conversational QA [ 16, 1, 17 ] and conversational recommendations [ 18, 19, 20 ]. What distinguishes conversational QA from traditional single-turn QA is the 2.2.1. From Siloed to Unified View need for contextual understanding. Hence, much of the research revolves around modeling conversation hisOne key realization the above example is meant to illus- tory [ 16, 17, 21 ]. However, in terms of evaluation, the trate is that conversational information access cuts across the task-oriented and interactive QA categories. This blending makes CIA suited to assist users meaningfully with their needs. Conversely, existing work—and, im

2We note that this problem is not specific to IR. However, con

versational information access is a good starting point that the IR community is uniquely suited to address. Lessons and finding could then be generalized to broader applications. problem is simplified to a single-turn passage retrieval task, where the relevance of system response at a given turn does not consider the responses given by the system at earlier turns [ 22, 17 ]. It is only conversational recommender systems where the multi-turn nature of conversations is more fully embraced [ 18, 19, 20 ]. 3.1.2. Component-level sub-tasks Recently, progress has been made on specific subtasks for CIA, including response retrieval [23] and generation [24], query resolution [25, 26], asking clarifying questions [27] or suggestion questions [28], predicting user intent [29], and preference elicitation [ 18, 30 ]. Each of these studies makes the point that the conversational setting calls for a diferent set of approaches. However, most of these subtasks are applicable in any interactive IR context, adhering to the stance that search is inherently a conversational experience: it is a dialogue between a human and a search engine [31]. From this perspective, there has been substantial progress, and especially on the mixed initiative aspect, e.g., question clarifications and suggestions [27, 28]. Alternatively, one may take a more critical stance and ask: What separates conversational information access from any other interactive IR system (most prominently: search engines)? According to Croft [5], the key distinguishing factor is that a conversational system is more active partner in the interaction. From that regard, there is surprisingly little work, with only a handful of notable exceptions [32, 11].

3.2. Evaluation

3.2.1. Ofline Evaluation

Traditionally, system-oriented evaluation in IR has been

performed using ofline test collections, following the Cranfield paradigm [ 33]. This rigorous methodology ensures the repeatability and reproducibility of experiments, and has been instrumental to progress in the field. To date, work on CIA still employs ofline evaluation [ 22, 27], but this has severe limitations. First, reusability requires that the system is limited in selecting the best response, in answer to a user utterance, from a restricted set of possible candidates (i.e., some predefined corpus of responses).

Second, it is limited in scope to a single conversation turn and does not consider dialogue history that led to that particular user utterance (cf. red area in Fig. 2). An alternative is to let human evaluators assess an entire conversation, once it has taken place [6]. However, this is a single path (see blue area in Fig. 2), without considering the other choices the user could have taken during the course of the dialogue. Moreover, it is expensive, time-consuming, and does not scale. Most importantly, it would not yield a reusable test collection. In summary, ofline test collections have their merits, but their use is limited to the purpose of evaluating specific components, in isolation. Further, the choice of evaluation metrics is an open challenge [34]. 3.2.2. Online Evaluation Online evaluation involves fielding an IR system to real users, and observing how they interact with the system in situ, in their natural task environments [35]. This requires a live service as a research platform. Currently, this possibility is only available to researchers working at major service providers that develop conversational assistants (Google, Microsoft, Apple, Amazon). Even there, experimentation with live users is severely limited due to scalability, quality, and ethical concerns. Of these companies, only Amazon has decided to open up its platform for academic research, by organizing the Alexa Prize Challenge [36]. It represents a unique opportunity for academics to perform research with a live system used by millions of users, and provides university teams with real user conversational data at scale. While this efort points in the right direction, it is inherently limited in that it addresses social conversations (“chit-chat”), with the target goal of conversing coherently and engagingly with humans on popular topics such as sports, politics, or technology for 20 minutes. This is a non-goaldriven task, which is rather diferent from goal-driven CIA. Currently, there is no publicly available research platform for CIA. Living labs represents a novel evaluation paradigm for IR [37], which allows researchers to evaluate their methods with real users of live search services. This methodology has been successfully employed at world wide benchmarking campaigns [38, 39].

It, however, needs to be extended to a conversational setting, which brings about methodological and practical challenges. 3.2.3. User Simulation 3.3.3. Evaluation With a long history in the field of spoken dialogue sys- There is a need to go beyond turn-based evaluation to tems, user simulation is seen as a critical tool for auto- multi-turn-based and eventually end-to-end evaluation. matic dialogue management design [40]. The idea is to To be able to perform end-to-end evaluation of CIA systrain a user model that is “capable of producing responses tems, additional methodologies need to be considered, that a real user might have given in a certain dialog situ- including online evaluation and simulated users. For ation” [40]. This is in line with our goals, but there are online evaluation, the living labs paradigm represents two crucial diferences. First, the primary purpose of an alternative, but it requires agreement on a canonical user simulation in DS is to generate a synthetic training architecture in order to be able to open up individual data at scale, which in turn can be used to learn dia- components for experimentation. Further, it requires an logue strategies (typically, using reinforcement learning). existing service with live users, which is currently lackAssessment of the quality of simulated dialogues and ing. It should be noted that the need for such an open user simulation methods, however, is an open issue [41]. research platform has been identified and a plan for the Second, dialogue systems, as well as recent work on con- academic search domain has recently been outlined [44]. versational recommender systems [ 18 ], are focused on As for simulation, most existing approaches are meant supporting the user with a single goal that can be ful- to advance reinforcement learning techniques in a strictly iflled by eliciting preferences on a set of attributes. CIA goal-oriented setting. This is diferent from our purpose systems, on the other hand, need to deal with complex of evaluation. The simulation techniques that are cursearch and recommendation scenarios. This requires a rently used for evaluation lack the desired conversational more holistic user model. complexity.

3.3. Summary and Remaining Challenges 4. A Case for Simulation

3.3.1. Understanding User Needs and Behavior

This section presents a proposal for robust large-scale

Current characterizations of information seeking behav- automatic evaluation of CIA systems via user simulation. ior for CIA are limited either in the set of actions considered [42] or in sequences of conversational turns [43]. 4.1. Methodology To cater for the functionality defined by Radlinski and Craswell [12] and further expanded by us in Sect 2.2, one would need user and interaction models capable of representing (1) multi-modal interactions (speech, text, pointing&clicking), (2) users’ ability to change their state of knowledge (learn and forget) and (3) users’ ability to learn how a system works and what its limits are (and change their expectations and behavior accordingly). 3.3.2. Truly Conversational Methods

Conversational recommendations and QA have been

studied as end-to-end tasks. However, as we argued in Sect. 2.2, in practice these two are not clearly delineated applications, but rather diferent “modes” that should be seamlessly integrated with a CIA system. There has been significant progress on various components, which are indispensable building blocks. Integrating these into a unified system that supports multiple user goals remains an open challenge [1]. Further open questions in this space include (1) deciding when and what type of initiative a system should take, and (2) determining the best modality based on task and context.

Our main hypothesis is that it is possible to simulate

human behavior with regard to interacting with CIA systems. To validate this hypothesis, we need to show that simulated users behave indistinguishable from real humans, in the context of a specific conversational application and with respect to specific evaluation measures.

Formally, let 1 and 2 denote two CIA systems, which difer in some component(s). Both systems are assumed to be operated by a set of users from some user population. Let us assume that there is a statistically significant diference observed in their relative performance, according to some evaluation measure , such that (1, ) < (2, ). Simulation is considered successful, if by engaging a set * of simulated users, we observe the same relative system diferences as with real users, i.e., (1, * ) < (2, * ). Further, this observation should generalize across systems and evaluation measures .

The above formulation ensures that the behavior of simulated users aligns with those of real users. Notice that to be able to perform this validation, an operational CIA system is needed; we discuss the practical aspects of setting up such an experimental platform below, in Sect. 4.4. For the human evaluation part, i.e., measuring (, ), two distinct approaches may be employed: (1) asking users themselves inside the CIA system to Conversa2onal informa2on access system

Simulated user

Natural language understanding (NLU) poin%ng / clicking

text / voice Natural language genera2on (NLG)

Planning Execu2on Learning

User model Interac2on model Mental model give feedback on either the entire conversation or on specific system utterances, and (2) sampling interesting/meaningful branches from conversation logs, which will be annotated by external human labelers (e.g., crowd workers).

Once a user simulator is created and validated against real users, it may be used evaluating a given CIA system. It is important to note that, in principle, a given user simulator instance should be used only once, the same way that an ofline test collection should only be used once—to avoid overfitting systems against a particular test suite.

4.2. Requirements We identify a realistic user simulator with the ability of capturing:

(R1) Personal interests and preferences, and the changes of preferences over time; its main components below, and provide specific starting points for each of them.

User, interaction, and mental models provide the foundation for simulation behavior. • User model. To represent all personal information related to a given user, including persona (R2), preferences (R1), and knowledge (R4), personal knowledge graphs (PKGs) [45] may be used. The reason for using a PKG is to ensure the consistency of the preferences that are revealed by the simulated user, as it is done in [46]. To fully address R1 and R4, PKGs will need to be extended along two dimensions: (1) include concepts, in addition to entities, to represent the user’s knowledge on specific topics, with further distinction to be made between entities/concepts the user heard about vs. has in-depth knowledge on; (2) capture the temporal scope, to be able to distinguish between shortvs. long-term preferences and fresh vs. diminishing knowledge. (R5) The user’s ability to learn how a system works and what its limits are, and change their expectations and behavior accordingly. (R2) Persona (personality, educational and socio- • Interaction model. To characterize the CIA process economical background, etc.); between humans and systems for a given application, the key actions and decisions that manifest in dialogues (R3) Multi-modality of interactions (speech, text, need to be abstracted out. A starting point for a taxonpointing&clicking, etc.); omy of user/system actions is provided in [47]. This (R4) The user’s ability to change their state of knowl- taxonomy may be revised and extended to multi-modal edge (learn and forget); interactions (R3) based on conversations collected in laboratory user studies with an “idealized” CIA system using the Wizard-of-Oz approach [48] and from interaction data from actual CIA systems.

We note that not all these requirements are critical for an initial simulator and some may be highly ambitious. Nevertheless, we shall discuss our conceptual architecture with reference to these requirements. 4.3. Architecture

• Mental model. To capture how a particular user thinks about a given CIA system (R5), mental models need to be developed. The thinking aloud method is commonly used for such purposes in usability testing, psychology, and social sciences [49]. There is work in HCI on identifying and analyzing experiences and barriers qualitatively [50, 51, 52, 53]. A main diference from those studies is that the goal here is to build a quantifiable mental model that represents the user’s expectations and perceived capabilities of a CIA system.

Next, we describe the components responsible for inter- specific and is shared by all simulated users.) To make acting with CIA systems. the user model realistic, it should be anchored in actual • Natural language understanding. Obtaining a ausgeernperroatfileivse(wmhoidleelmmaaiyntbaeinuisnegd, wki-tahnpoanryammietyte)r.sFloerartnheadt, structured representation from a system utterance is

on publicly available corpora, e.g., item ratings for recomanalogous to NLU in dialogue systems and involves

mendation scenarios [46] and discussion fora for informadomain classification , intent determination, and slot fill- tion seeking tasks. The mental model may be initialized ing [7]. These tasks are efectively tackled by neural

using a small set of pre-trained skill profiles, created as architectures [54, 55, 1]. These approaches, however,

part of laboratory user studies. are created for conversational systems and assume From a system architecture perspective, the user simu“perfect” world knowledge, based on some underlying

lator in many regards resembles a CIA system, comprisknowledge repository. For user simulation, they need

ing of natural language understanding, dialog manageto be adapted to consider personal knowledge. For

ment, and natural language generation components. One example, the user may or may not be able to guess the

major diference is that CIA systems may be assumed corresponding type or category of an entity/concept

(in fact, expected) to have “perfect world knowledge,” that is mentioned for the first time, depending on their

only limited by the availability of data. Conversely, user knowledge of the given domain.

simulation also needs to consider the user’s knowledge • Response generation. Determining how a simulated level in language understanding and generation. Another user should respond to a system utterance is modeled major diference is that while a CIA system is modeled in three stages: planning, execution, and learning. In after a single person, each simulated user has a unique the planning stage, a structured representation of an persona. This requires each of the components to be information need (what to ask the system) or user re- parametrizable with respect to personal characteristics. sponse (how to respond to if prompted by the system) Further, the choice of dialogue actions is afected by the is generated. This is informed by the user model, in user’s mental model of the system (i.e., what the system terms of interests and preferences, as well as the in- is perceived to be able to understand and execute). teraction model, to help interpret what the system is asking in terms of a task-specific dialog flow. In the 4.4. Operationalization execution stage, the simulator decides on the course of execution, based on the user’s mental model of the Note that simulation capability is application specific. given system’s capabilities (e.g., it will not attempt to That is, diferent simulators would need to be trained for navigate a list using voice, but rather click, if voice item recommendation, interactive QA, and, ultimately, navigation did not function in the past as expected). for scenarios that cater for multiple user goals. To enBased on how the system responds to a given user sure that the behavior of simulated users aligns with that utterance, the learner module can make updates to of human users, an operational CIA system with actual the user model (whether the user learned something users would also be needed for each application. Setting new about a given topic) and also to the mental model up such applications should be seen as a community efof the system (how successful it was in understand- fort. Indeed, discussions in this direction have already ing/executing what was requested). Response genera- begun and one specific proposal for a CIA system suption can be framed within the well-established agenda- ports scholarly activities has been outlined in [44]. There based simulation approach [56]. are a number of challenges involved in building a CIA system that can serve as such a living lab. One is that • Natural language generation. Finally, a structured it would have insuficient trafic for meaningful online intent representation (what to say to the system) needs evaluation (an issue that has indeed been encountered in to turned into a natural language utterance (how to say the past [38]). To remedy that additional users may be reit). The exact articulation is influenced by the persona cruited, e.g., by involving students as part of their course and knowledge level of the simulated user. A possi- work or hiring workers on crowdsourcing platforms (i.e., ble starting point is to generate templated responses increasing trafic volume). Another potential dificulty and then apply transfer learning for text [57, 58, 59]. is that building a suficiently performant CIA system for Later, more end-to-end approaches may also be de- the application at hand turns out to be too challenging vised, eliminating the need for manual template gen- (thereby making the online service unattractive to users). eration. It should be noted that not all requests get While this is not easily solvable on the system front, it is passed through NLG, as the executor may decide to possible to manage users’ expectations. Indeed, one of use a diferent modality. the key ideas behind operating in the academic domain Each simulated user requires instantiating the user and in [44] is to build a tool by researchers to researchers, mental models. (The interaction model is application- and embrace its imperfection.

Simulation approaches are evaluated by comparing [3] J. Allan, B. Croft, A. Mofat, M. Sanderson, Frontiers, them against real users on a given live research platform. challenges, and opportunities for information reIn practice this means that a small portion of the usage trieval: Report from SWIRL 2012 the second stratedata collected from humans (i.e., first few weeks of the gic workshop on information retrieval in Lorne, live evaluation period) is disclosed and can be used for SIGIR Forum 46 (2012) 2–32. training the simulators, while the remaining data is used [4] J. S. Culpepper, F. Diaz, M. D. Smucker, Research for evaluating them. The set of systems participating in frontiers in information retrieval: Report from the the live evaluation (referred to as experimental systems) third strategic workshop on information retrieval in are also evaluated using the diferent simulators. Ulti- Lorne (SWIRL 2018), SIGIR Forum 52 (2018) 34–90. mately, the question we seek to answer is whether we [5] W. B. Croft, The importance of interaction for incan observe the same relative ranking of experimental formation retrieval, in: Proceedings of the 42nd systems with real users (based on the live experiment) as International ACM SIGIR Conference on Research with simulated ones—being able to answer this question and Development in Information Retrieval, SIGIR positively would mean that the simulator is suficiently ’19, 2019, pp. 1–2. realistic. [6] J. Deriu, A. Rodrigo, A. Otegi, G. Echegoyen, S. Rosset, E. Agirre, M. Cieliebak, Survey on evaluation methods for dialogue systems, Artificial Intelli5. Conclusions and Future gence Review (2020) 1573–7462.

Directions [7] D. Jurafsky, J. H. Martin, Speech and Language Processing: An Introduction to Natural Language In this paper, we have considered conversational AI from Processing, Computational Linguistics, and Speech an IR perspective, and focused in particular on the prob- Recognition, 3nd Edition draft, Prentice Hall, Pearlem of conversational information access, with the goal son Education International, 2019. to identify open challenges that the IR community is [8] H. Chen, X. Liu, D. Yin, J. Tang, A survey on diauniquely suited to address. logue systems: Recent advances and new frontiers,

One critical area concerns the understanding of SIGKDD Explor. Newsl. 19 (2017) 25–35. users’ information needs and their information seeking [9] I. V. Serban, R. Lowe, P. Henderson, L. Charlin, behavior—one of fundamental research directions in IR J. Pineau, A survey of available corpora for building from the very beginning [60]. Currently, there is a lack data-driven dialogue systems: The journal version, of understanding of what would be desirable conversa- Dialogue & Discourse 9 (2018) 1–49. tional experiences for information access scenarios that [10] L. Zhou, J. Gao, D. Li, H.-Y. Shum, The design and combine multiple user goals. Consequently, there are implementation of XiaoIce, an empathetic social no suitable models of user behavior that could serve as chatbot, Comput. Linguist. 46 (2020) 53–93. foundations for unified architectures that can support [11] I. Szpektor, D. Cohen, G. Elidan, M. Fink, A. Hassuch behavior. sidim, O. Keller, S. Kulkarni, E. Ofek, S. Pudinsky,

Another aspect that represents a major open challenge A. Revach, S. Salant, Y. Matias, Dynamic compois evaluation. Measurement is an area where IR has an sition for conversational domain exploration, in: unparalleled history [61, 62, 63, 64, 33, 35]. Building on Proceedings of The Web Conference 2020, WWW the rich tradition and experience of community bench- ’20, 2020, pp. 872–883. marking campaigns such as TREC [62] and CLEF [63], [12] F. Radlinski, N. Craswell, A theoretical framework our community is in a unique position to take a lead for conversational search, in: Proceedings of the on the development of novel evaluation paradigms and 2017 Conference on Conference Human Informamethodologies. This paper has outlined a specific plan tion Interaction and Retrieval, CHIIR ’17, 2017, pp. for such an efort along user simulation. 117–126. [13] Z. Yan, N. Duan, P. Chen, M. Zhou, J. Zhou, Z. Li,

Building task-oriented dialogue systems for online References shopping, in: Proceedings of the Thirty-First AAAI [1] J. Gao, M. Galley, L. Li, Neural approaches to con- Conference on Artificial Intelligence, AAAI ’17, versational AI, Found. Trends Inf. Retr. 13 (2019) 2017, pp. 4618–4625.

127–298. [14] J. R. Trippas, Spoken Conversational Search: Audio[ 2 ] M. F. McTear, Spoken dialogue technology: En- only Interactive Information Retrieval, Ph.D. thesis, abling the conversational user interface, ACM Com- RMIT University, 2019. put. Surv. 34 (2002) 90–169. [15] Y. Deldjoo, J. R. Trippas, H. Zamani, Towards multimodal conversational information seeking, in: Proceedings of the 43th International ACM SIGIR Conference on Research and Development in Informa- [27] M. Aliannejadi, H. Zamani, F. Crestani, W. B. tion Retrieval, SIGIR ’21, 2021, pp. 1577–1587. Croft, Asking clarifying questions in open-domain [16] C. Qu, L. Yang, M. Qiu, W. B. Croft, Y. Zhang, information-seeking conversations, in: ProceedM. Iyyer, Bert with history answer embedding for ings of the 42nd International ACM SIGIR Conferconversational question answering, in: Proceedings ence on Research and Development in Information of the 42nd International ACM SIGIR Conference Retrieval, SIGIR ’19, 2019, pp. 475–484. on Research and Development in Information Re- [28] C. Rosset, C. Xiong, X. Song, D. Campos, trieval, SIGIR ’19, 2019, pp. 1133–1136. N. Craswell, S. Tiwary, P. Bennett, Leading conver[ 17 ] S. Reddy, D. Chen, C. D. Manning, CoQA: A con- sational search by suggesting useful questions, in: versational question answering challenge, Transac- Proceedings of The Web Conference 2020, WWW tions of the Association for Computational Linguis- ’20, 2020, pp. 1160–1170.

tics 7 (2019) 249–266. [29] C. Qu, L. Yang, W. B. Croft, Y. Zhang, J. R. Trip[ 18 ] Y. Zhang, X. Chen, Q. Ai, L. Yang, W. B. Croft, To- pas, M. Qiu, User intent prediction in informationwards conversational search and recommendation: seeking conversations, in: Proceedings of the 2019 System ask, user respond, in: Proceedings of the Conference on Human Information Interaction and 27th ACM International Conference on Information Retrieval, CHIIR ’19, 2019, pp. 25–33. and Knowledge Management, CIKM ’18, 2018, pp. [30] K. Christakopoulou, F. Radlinski, K. Hofmann, To177–186. wards conversational recommender systems, in: [19] D. Jannach, A. Manzoor, W. Cai, L. Chen, A sur- Proceedings of the 22nd ACM SIGKDD Internavey on conversational recommender systems, 2020. tional Conference on Knowledge Discovery and arXiv:2004.00646. Data Mining, KDD ’16, 2016, pp. 815–824. [ 20 ] C. Gao, W. Lei, X. He, M. de Rijke, T. Chua, [31] P. Ren, Z. Chen, Z. Ren, E. Kanoulas, C. Monz, Advances and challenges in conversational M. de Rijke, Conversations with search engines, recommender systems: A survey, 2021. 2020. arXiv:2004.14162.

arXiv:2101.09459. [32] S. Zhang, Z. Dai, K. Balog, J. Callan, Summariz[21] C. Zhu, M. Zeng, X. Huang, SDNet: Contextualized ing and exploring tabular data in conversational attention-based deep network for conversational search, in: Proceedings of the 43rd International question answering, 2018. arXiv:1812.03593. ACM SIGIR Conference on Research and Develop[22] J. Dalton, C. Xiong, J. Callan, TREC CAsT 2019: The ment in Information Retrieval, SIGIR ’20, 2020, pp.

Conversational Assistance Track overview, 2020. 1537–1540.

arXiv:2003.13624. [33] M. Sanderson, Test collection based evaluation of [23] L. Yang, M. Qiu, C. Qu, J. Guo, Y. Zhang, W. B. information retrieval systems, Found. Trends Inf.

Croft, J. Huang, H. Chen, Response ranking with Retr. 4 (2010) 247–375. deep matching networks and external knowledge in [34] C.-W. Liu, R. Lowe, I. Serban, M. Noseworthy, information-seeking conversation systems, in: The L. Charlin, J. Pineau, How NOT to evaluate your di41st International ACM SIGIR Conference on Re- alogue system: An empirical study of unsupervised search and Development in Information Retrieval, evaluation metrics for dialogue response generaSIGIR ’18, 2018, pp. 245–254. tion, in: Proceedings of the 2016 Conference on [24] Y. Song, C.-T. Li, J.-Y. Nie, M. Zhang, D. Zhao, R. Yan, Empirical Methods in Natural Language Processing, An ensemble of retrieval-based and generation- EMNLP ’16, 2016, pp. 2122–2132. based human-computer conversation systems, in: [35] K. Hofmann, L. Li, F. Radlinski, Online evaluation Proceedings of the 27th International Joint Confer- for information retrieval, Found. Trends Inf. Retr. ence on Artificial Intelligence, IJCAI ’18, 2018, pp. 10 (2016) 1–117.

4382–4388. [36] A. Ram, R. Prasad, C. Khatri, A. Venkatesh, [25] N. Voskarides, D. Li, P. Ren, E. Kanoulas, M. de Rijke, R. Gabriel, Q. Liu, J. Nunn, B. Hedayatnia, M. Cheng, Query resolution for conversational search with A. Nagar, E. King, K. Bland, A. Wartick, Y. Pan, limited supervision, in: Proceedings of the 43rd H. Song, S. Jayadevan, G. Hwang, A. Pettigrue, ConInternational ACM SIGIR Conference on Research versational AI: The science behind the Alexa Prize, and Development in Information Retrieval, SIGIR 2018. arXiv:1801.03604.

’20, 2020, pp. 921–930. [37] A. Schuth, K. Balog, Living labs for online eval[26] S. Vakulenko, N. Voskarides, Z. Tu, S. Longpre, A uation: From theory to practice, in: Proceedings comparison of question rewriting methods for con- of the 38th European conference on Advances in versational passage retrieval, in: Proceedings of the Information Retrieval, ECIR ’16, 2016, pp. 893–896. 43rd European Conference on IR Research, ECIR [38] R. Jagerman, K. Balog, M. D. Rijke, Opensearch: ’21, 2021, pp. 418–424. Lessons learned from an online evaluation camand Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing), The MIT Press, 2005. [63] N. Ferro, C. Peters (Eds.), Information Retrieval

Evaluation in a Changing World - Lessons Learned from 20 Years of CLEF, volume 41 of The Information Retrieval Series, Springer, 2019. [64] D. Kelly, Methods for evaluating interactive information retrieval systems with users, Found. Trends Inf. Retr. 3 (2009) 1–224.

paign , J. Data and Information Quality 10 ( 2018 ). Speech Commun . 50 ( 2008 ) 630 - 645 . [39]

Hopfgartner ,

Balog ,

Lommatzsch , L. Kelly, [51]

B. R.

Cowan ,

Pantidi ,

Coyle , K. Morrissey,

Information Retrieval Evaluation in a Changing ceedings of the 19th International Conference on

World - Lessons Learned from 20 Years of

CLEF

, Human-Computer Interaction with Mobile Devices

Springer , 2019 . and Services, MobileHCI '17 , 2017 . [40]

Schatzmann ,

Weilhammer ,

Stuttle , S. Young, [52]

B. R.

Cowan ,

H. P.

Branigan ,

Begum , L. McKenna ,

ment

strategies

, Knowl. Eng. Rev . 21 ( 2006 ) 97 - 126 . partners, in: Proceedings of the 39th Annual Meet [41]

Pietquin ,

Hastie , A survey on metrics for the ing of the Cognitive Science Society , CogSci ' 17 ,

evaluation of user simulations , Knowl. Eng. Rev. 28 2017 .

( 2013 ). [53]

Sciuto ,

Saini ,

Forlizzi , J. I. Hong , “Hey Alexa, [42]

Vakulenko ,

Revoredo ,

Di Ciccio , M. de Rijke, what's up?”: A mixed-methods studies of in-home

dialogues, in: Proceedings of the 41st European the 2018 Designing Interactive Systems Conference,

Conference on IR Research , ECIR ' 19 , 2019 , pp. 541 - DIS ' 18 , 2018 , pp. 857 - 868 .

557. [54]

Mesnil ,

Dauphin ,

Yao ,

Bengio , L. Deng, [43]

J. R.

Trippas ,

Spina ,

Cavedon ,

Sanderson ,

Hakkani-Tur ,

He ,

Heck ,

Tur ,

Yu , et al.,

ceedings of the 2017 Conference on Conference Hu- Audio, Speech and Lang. Proc. 23 ( 2015 ) 530 - 539 .

man Information Interaction and Retrieval , CHIIR [55]

Liu , I. Lane , Attention-based recurrent neural

'17 , 2017 , pp. 325 - 328 . network models for joint intent detection and slot [44]

Balog ,

Flekova ,

Hagen ,

Jones , M. Pot- iflling , in: Interspeech 2016 , 2016 , pp. 685 - 689 .

thast , F.

Radlinski , M.

Sanderson , S. Vakulenko, [56] J.

Schatzmann , B.

Thomson , K.

Weilhammer , H. Ye,

prototype: Scholarly conversational assistant , 2020 . strapping a POMDP dialogue system , in: Human

arXiv:2001.06910. Language Technologies 2007 : The Conference of [45]

Balog , T. Kenter, Personal knowledge graphs: the North American Chapter of the Association

A research agenda , in: Proceedings of the 2019 for Computational Linguistics; Companion Volume,

ACM SIGIR International Conference on Theory of Short Papers , 2007 , pp. 149 - 152 .

Information

Retrieval

, ICTIR '19 , 2019 , pp. 217 - 220 . [57]

Yang ,

Hu ,

Dyer ,

E. P.

Xing , T. Berg[46]

Zhang , K. Balog, Evaluating conversational rec- Kirkpatrick, Unsupervised text style transfer using

ceedings of the 26th ACM SIGKDD International in Neural Information Processing Systems 31 , NIPS

Conference on Knowledge Discovery & Data Min- '18 , 2018 , pp. 7287 - 7298 .

ing , KDD '20 , 2020 , pp. 1512 - 1520 . [58]

Fu ,

Tan ,

Peng ,

Zhao ,

Yan , Style transfer [47]

Azzopardi ,

Dubiel ,

Halvey ,

Dalton , Con- in text: Exploration and evaluation , in: Proceedings

conversational search process , in: Proceedings 2018 .

of the 2nd International Workshop on Conversa- [59]

Shen ,

Lei ,

Barzilay ,

Jaakkola , Style trans-

'18 , 2018 . Proceedings of the 31st International Conference [48]

J. F.

Kelley , An iterative design methodology for on Neural Information Processing Systems , NIPS

user-friendly natural language ofice information '17 , 2017 , pp. 6833 - 6844 .

applications , ACM Trans. Inf. Syst . 2 ( 1984 ) 26 - 41 . [60]

Wilson , Information needs and uses: Fifty years [49]

Lewis ,

Rieman , Task-centered User Interface of progress , Fifty Years of Information Progress: A

Design : A Practical

Introduction , University of Col- Journal of Documentation Review 28 ( 1994 ) 15 - 51 .

orado , Boulder, Department of Computer Science, [61] D.

Ellis , The dilemma of measurement in informa-

1993. tion retrieval research , J. Am. Soc. Inf. Sci . 47 ( 1996 ) [50]

Edlund ,

Gustafson ,

Heldner , A . Hjalmars- 23 -36.

son, Towards human-like spoken dialogue systems , [62]

E. M.

Voorhees ,

D. K.

Harman , TREC: Experiment