=Paper= {{Paper |id=Vol-2950/paper-03 |storemode=property |title=Conversational AI from an Information Retrieval Perspective: Remaining Challenges and a Case for User Simulation |pdfUrl=https://ceur-ws.org/Vol-2950/paper-03.pdf |volume=Vol-2950 |authors=Krisztian Balog |dblpUrl=https://dblp.org/rec/conf/desires/Balog21 }} ==Conversational AI from an Information Retrieval Perspective: Remaining Challenges and a Case for User Simulation== https://ceur-ws.org/Vol-2950/paper-03.pdf
Conversational AI from an Information Retrieval
Perspective: Remaining Challenges and a Case for User
Simulation
Krisztian Balog
University of Stavanger, Norway


                                          Abstract
                                          Conversational AI is an emerging field of computer science that engages multiple research communities, from information
                                          retrieval to natural language processing to dialogue systems. Within this vast space, we focus on conversational informa-
                                          tion access, a problem that is uniquely suited to be addressed by the information retrieval community. We argue that despite
                                          the significant research activity in this area, progress is mostly limited to component-level improvements. There remains
                                          a disconnect between current efforts and truly conversational information access systems. Apart from the inherently chal-
                                          lenging nature of the problem, the lack of progress, in large part, can be attributed to the shortage of appropriate evaluation
                                          methodology and resources. This paper highlights challenges that render both offline and online evaluation methodologies
                                          unsuitable for this problem, and discusses the use of user simulation as a viable solution.

                                          Keywords
                                          Conversational information access, conversational AI, user simulation



1. Introduction                                                                                                    appropriate responses, and developing effective end-to-
                                                                                                                   end (neural) architectures, which engage multiple re-
Conversational AI may be seen as the holy grail of com-                                                            search communities. In this paper, we focus on the prob-
puter science: building machines that are capable of in-                                                           lem of conversational information access (CIA), one that
teracting with people in a human-like way. With rapid                                                              the IR community is uniquely suited to address.
advances in AI technology, there are reasons to believe                                                               Conversational search or conversational information
that such an ambition is within reach [1]. Conversational                                                          seeking has already been identified in 2012 as a research
AI is a vast and complex problem, which requires a com-                                                            direction of strategic importance in IR [3], and its signifi-
bination of methods, tools, and techniques from multiple                                                           cance has been re-iterated in 2018 [4]. There, the problem
fields of computer science, including but not limited to                                                           focus has been defined to include complex user goals
artificial intelligence (AI), natural language processing                                                          that require multi-step information seeking, exploratory
(NLP), machine learning (ML), dialogue systems (DS),                                                               information gathering, and multi-step task completion
recommender systems (RecSys), human-computer inter-                                                                and recommendation, as well as dialog settings with
action (HCI), and not the least information retrieval (IR).                                                        variable communication channels. Our analysis of recent
Each of these fields may have its own particular inter-                                                            works, however, leads us to the observation that current
pretation of what conversational AI should entail and                                                              efforts do not seem to be fully aligned with the directions
a specific focus on certain research challenges that are                                                           set out there. In terms of end-to-end tasks, there are
involved. For example, in spoken dialogue systems the                                                              two main threads of work: conversational QA and
main motif is to be able to talk to machines, i.e., on de-                                                         conversational recommendations. Currently, these are
veloping speech-based human-computer interfaces [2],                                                               treated as two separate types of systems, with different
and thus automatic speech recognition is a central com-                                                            goals, architectures, and evaluation criteria. Instead, for
ponent. Many other communities, on the other hand,                                                                 a more effective assistance of users, the two should be
assume a chat-based interface and voice is not among                                                               seamlessly integrated in CIA systems, thereby moving
the supported modalities. At the same time, there are                                                              from a siloed to a more unified view. Additionally, the
many shared aspects, including handling the semantics                                                              multi-modality of interactions needs to be more fully
involved in the dialogue process, generating contextually                                                          embraced, in order to more actively support effective
                                                                                                                   interaction [5]. On the component level, most proposed
DESIRES 2021 – 2nd International Conference on Design of
                                                                                                                   techniques are not truly conversational in the sense
Experimental Search & Information REtrieval Systems, September
15–18, 2021, Padua, Italy                                                                                          that they are applicable to any interactive IR system
" krisztian.balog@uis.no (K. Balog)                                                                                (e.g., modern web search engines). A critical blocker to
~ https://krisztianbalog.com/ (K. Balog)                                                                           progress, on both the end-to-end and component levels,
 0000-0003-2762-721X (K. Balog)                                                                                   is the shortage of appropriate evaluation methodology
                                    © 2021 Copyright for this paper by its authors. Use permitted under Creative
                                    Commons License Attribution 4.0 International (CC BY 4.0).                     and resources.
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)
Table 1
Categorization of conversational AI systems, based on [1, 6].

 Task-oriented                               Social chat                                 Interactive QA
 Aim to assist users to solve a specific     Aim to carry on an extended con-            Aim to provide concise, direct an-
 task (as efficiently as possible)           versation (“chit-chat”) with the goal       swers to user queries
                                             of mimicking human-human interac-
                                             tions
 Dialogues follow a clearly designed         Developed for unstructured, open            Dialogues are unstructured, but com-
 structure (flow) that is developed for      domain conversations                        monly follow a question-answer pat-
 a particular task in a closed domain                                                    tern; mostly open domain (dictated
                                                                                         by the underlying data)
 Well-defined measure of performance         Objective is to be human-like, i.e.,        Evaluated with respect to the correct-
 that is explicitly related to task com-     able to talk about different topics         ness of answers (on the turn level)
 pletion                                     (breadth and depth) in an engaging
                                             and coherent manner



In summary, the contributions of this paper are threefold. 2.1.1. Traditional 2-way Categorization

• We argue for a broader interpretation of conversa-       Traditionally, conversational agents are categorized as be-
  tional information access, one that embraces multiple    ing goal-driven (or task-oriented) or non-goal-driven (also
  user goals (mixing task-oriented and QA elements) and    known as chatbots) [8, 9, 7]. Goal-driven systems aim to
  multi-modal interactions (Sect 2).                       assist users to complete some specific task. Dialogues
                                                           are constrained to a specific domain and characterized
• We provide a synthesis of progress on conversa-
                                                           by having a designated structure, designed for particular
  tional information access and identify open challenges
                                                           tasks within that domain. The main success criteria for
  around methods and evaluation (Sect. 3).
                                                           the conversational agent is its ability to help the user
• We argue for (a more extensive) use of simulation as solve their task as efficiently as possible. Typical exam-
  a viable evaluation paradigm for conversational infor- ples include travel planning and appointment scheduling.
  mation access and describe a simulator architecture         Non-goal-driven systems, on the other hand, aim to
  (Sect. 4).                                               carry on an extended conversation (“chit-chat”) with the
                                                           goal of mimicking unstructured human-human interac-
                                                           tions. The main purpose of these systems is usually enter-
2. Defining Conversational                                 tainment or providing an “AI companion” [10]. Therefore,
     Information Access                                    the objective for these systems is to be able to talk about
                                                           different topics in an engaging and cohesive manner.
This section defines conversational information access
and places it in the broader context of conversational AI. 2.1.2. Contemporary 3-way Categorization

                                                                        Most recently, the traditional categorization has been
2.1. Conversational AI: The big picture
                                                                        extended with a third category, interactive question an-
Conversational AI 1 is casually used to denote a broad swering (QA) [1, 6], in recognition of the fact that it fits
range of systems that are capable of (some degree of) neither in task-oriented nor in social chat, but deserves
natural language understanding and responding in a way a separate category on its own right. Interactive QA
that mimics human dialogue. A conversational AI system systems are designed to provide answers to specific ques-
may thus be considered successful if it offers an expe- tions. They are not characterized by a rigid dialogue flow,
rience that is indistinguishable from what could have although they typically follow a question-answer pat-
been delivered by a human. These systems often focus tern. Apart from some notable recent examples [11], the
on a particular type of conversational support, naturally human-like conversation aspect for QA systems is much
lending themselves to categorization.                                   less pronounced than for the other two types of systems,
    1
                                                                        and evaluation is restricted to answer correctness.
      In this paper, the terms conversational AI, conversational agent,    Table 1 summarizes the characteristics of the three cat-
and dialogue system are used interchangeably. We, however, avoid
using the term chatbot, which has a different meaning in industrial egories of conversational AI systems. Given their unique
and academic contexts; in the former case it refers to a task-oriented goals and objectives, each of these problem categories is
system, while in the latter it means a non-task-oriented system [7]. addressed by a distinctive system architecture [1, 6].
           I want to buy new running shoes.                      portantly, evaluation initiatives—in IR almost exclusively
                                                                 focus on a question-answering paradigm (see Sect. 3).
 AI      My records say that you have been using a               This does not allow for interaction with sets of items—
         Nike Pegasus 33 before. How did you like that?          one of the main properties that makes a search system
                                                                 conversational [12].
           I liked it a lot on tarmac, but my feet often hurt
                                                                    It has been shown that the “siloed” view, represented
           a bit on very long asphalt runs.
                                                                 by the three categories in Table 1, in practice does not
 AI      Here are some alternatives for you. Of these,
                                                                 align well with users’ information needs and behav-
         the ASICS Gel Nimbus 23 is especially                   ior [13]. Gao et al. [1] acknowledge the need for a “top-
         renowed for its cushioned midsole.                      level bot” that would act as a broker and switch between
                                                                 different user goals. Most commercial assistants are hy-
                                                                 brid systems, with different degrees of support for switch-
                                                                 ing. There is, however, little published research on it. In
                                                                 summary, there is need for a more holistic view where
           What is the midsole?
                                                                 multiple user goals are supported.2

 AI      The midsole is the bed of foam that lies
                                                                 2.2.2. Multi-modality
         between your foot and the ground. This is the
         part of the shoe responsible for feeling soft or
                                                            Another key point highlighted by the example in Fig. 1
         hard in the shoe.
                                                            is the need for embracing multi-modality. Text-only re-
                                                            sponses are motivated by an audio-only channel, without
                                                            a screen [14]. However, more often than not a chat-base
Figure 1: Envisioned dialogue with a CIA system.
                                                            interface is available, which allows for a richer set of
                                                            input controls and navigational components. These, in
                                                            turn, would enable CIA systems to more actively support
2.2. Conversational Information Access                      effective interaction [5]. We note that the need for multi-
                                                            modality has been recognized independently by other
Building on [4], we use the term conversational informa-
                                                            scholars as well [15].
tion access (CIA) to define a subset of conversational AI
systems that specifically aim at a task-oriented sequence
of exchanges to support multiple user goals, including 3. Progress to Date and
search, recommendation and exploratory information
gathering, that require multi-step interactions over pos-          Remaining Challenges
sibly multiple modalities. Further, these systems are ex-
                                                            In this section, we reflect on progress achieved so far,
pected to learn user preferences, personalize responses
                                                            organized around methods and evaluation, and identify
accordingly, and be capable of taking initiative.
                                                            remaining challenges.
   Consider the conversation shown in Fig. 1, illustrating
some of the above requirements. It is primarily a task-
oriented dialogue (the user wanting to buy new running 3.1. Methods
shoes), which requires an exploration of the item space. In our discussion, we distinguish between end-to-end
Assuming a chat-based interface, this can be done most conversational tasks and specific component-level sub-
effectively by combining multiple modalities; not just tasks.
text, but also a carousel for cycling through items, in
this example. Up until the second user utterance, it is
                                                            3.1.1. End-to-end tasks
a strictly task-oriented sequence of exchanges (cf. the
task-oriented category in Table 1). But, then, the third There are two main tasks that have received attention:
user utterance breaks out of the task flow and switches conversational QA [16, 1, 17] and conversational rec-
to “QA mode” (cf. interactive QA in Table 1).               ommendations [18, 19, 20]. What distinguishes con-
                                                            versational QA from traditional single-turn QA is the
2.2.1. From Siloed to Unified View                          need for contextual understanding. Hence, much of
                                                            the research revolves around modeling conversation his-
One key realization the above example is meant to illus- tory [16, 17, 21]. However, in terms of evaluation, the
trate is that conversational information access cuts across
                                                                 2
the task-oriented and interactive QA categories. This              We note that this problem is not specific to IR. However, con-
blending makes CIA suited to assist users meaningfully versational information access is a good starting point that the IR
                                                            community is uniquely suited to address. Lessons and finding could
with their needs. Conversely, existing work—and, im- then be generalized to broader applications.
                                                                  and has been instrumental to progress in the field. To
                              S
                                                                  date, work on CIA still employs offline evaluation [22, 27],
                                                                  but this has severe limitations. First, reusability requires
                    U           …          U
                                                                  that the system is limited in selecting the best response, in
                                                                  answer to a user utterance, from a restricted set of possi-
              S     …     S           S     …     S               ble candidates (i.e., some predefined corpus of responses).
                                                                  Second, it is limited in scope to a single conversation
                                                                  turn and does not consider dialogue history that led to
           U … U U … U U … U U … U
                                                                  that particular user utterance (cf. red area in Fig. 2). An
                                                                  alternative is to let human evaluators assess an entire
Figure 2: The space of possible dialogue states increases ex-
                                                                  conversation, once it has taken place [6]. However, this
ponentially with the number of turns between system (S) and is a single path (see blue area in Fig. 2), without consid-
user (U). Evaluation is currently limited to either a single path ering the other choices the user could have taken during
(blue area) or a single turn (red area).                          the course of the dialogue. Moreover, it is expensive,
                                                                  time-consuming, and does not scale. Most importantly,
                                                                  it would not yield a reusable test collection. In summary,
                                                                  offline test collections have their merits, but their use is
problem is simplified to a single-turn passage retrieval
                                                                  limited to the purpose of evaluating specific components,
task, where the relevance of system response at a given
                                                                  in isolation. Further, the choice of evaluation metrics is
turn does not consider the responses given by the sys-
                                                                  an open challenge [34].
tem at earlier turns [22, 17]. It is only conversational
recommender systems where the multi-turn nature of
conversations is more fully embraced [18, 19, 20].                3.2.2. Online Evaluation
                                                                  Online evaluation involves fielding an IR system to real
3.1.2. Component-level sub-tasks                                  users, and observing how they interact with the system
                                                                  in situ, in their natural task environments [35]. This re-
Recently, progress has been made on specific subtasks
                                                                  quires a live service as a research platform. Currently,
for CIA, including response retrieval [23] and genera-
                                                                  this possibility is only available to researchers working
tion [24], query resolution [25, 26], asking clarifying
                                                                  at major service providers that develop conversational
questions [27] or suggestion questions [28], predicting
                                                                  assistants (Google, Microsoft, Apple, Amazon). Even
user intent [29], and preference elicitation [18, 30]. Each
                                                                  there, experimentation with live users is severely lim-
of these studies makes the point that the conversational
                                                                  ited due to scalability, quality, and ethical concerns. Of
setting calls for a different set of approaches. However,
                                                                  these companies, only Amazon has decided to open up its
most of these subtasks are applicable in any interactive IR
                                                                  platform for academic research, by organizing the Alexa
context, adhering to the stance that search is inherently
                                                                  Prize Challenge [36]. It represents a unique opportunity
a conversational experience: it is a dialogue between a
                                                                  for academics to perform research with a live system
human and a search engine [31]. From this perspective,
                                                                  used by millions of users, and provides university teams
there has been substantial progress, and especially on the
                                                                  with real user conversational data at scale. While this
mixed initiative aspect, e.g., question clarifications and
                                                                  effort points in the right direction, it is inherently lim-
suggestions [27, 28]. Alternatively, one may take a more
                                                                  ited in that it addresses social conversations (“chit-chat”),
critical stance and ask: What separates conversational
                                                                  with the target goal of conversing coherently and en-
information access from any other interactive IR system
                                                                  gagingly with humans on popular topics such as sports,
(most prominently: search engines)? According to Croft
                                                                  politics, or technology for 20 minutes. This is a non-goal-
[5], the key distinguishing factor is that a conversational
                                                                  driven task, which is rather different from goal-driven
system is more active partner in the interaction. From
                                                                  CIA. Currently, there is no publicly available research
that regard, there is surprisingly little work, with only a
                                                                  platform for CIA. Living labs represents a novel eval-
handful of notable exceptions [32, 11].
                                                                  uation paradigm for IR [37], which allows researchers
                                                                  to evaluate their methods with real users of live search
3.2. Evaluation                                                   services. This methodology has been successfully em-
                                                                  ployed at world wide benchmarking campaigns [38, 39].
3.2.1. Offline Evaluation
                                                                  It, however, needs to be extended to a conversational
Traditionally, system-oriented evaluation in IR has been setting, which brings about methodological and practical
performed using offline test collections, following the challenges.
Cranfield paradigm [33]. This rigorous methodology en-
sures the repeatability and reproducibility of experiments,
3.2.3. User Simulation                                         3.3.3. Evaluation
With a long history in the field of spoken dialogue sys-       There is a need to go beyond turn-based evaluation to
tems, user simulation is seen as a critical tool for auto-     multi-turn-based and eventually end-to-end evaluation.
matic dialogue management design [40]. The idea is to          To be able to perform end-to-end evaluation of CIA sys-
train a user model that is “capable of producing responses     tems, additional methodologies need to be considered,
that a real user might have given in a certain dialog situ-    including online evaluation and simulated users. For
ation” [40]. This is in line with our goals, but there are     online evaluation, the living labs paradigm represents
two crucial differences. First, the primary purpose of         an alternative, but it requires agreement on a canonical
user simulation in DS is to generate a synthetic training      architecture in order to be able to open up individual
data at scale, which in turn can be used to learn dia-         components for experimentation. Further, it requires an
logue strategies (typically, using reinforcement learning).    existing service with live users, which is currently lack-
Assessment of the quality of simulated dialogues and           ing. It should be noted that the need for such an open
user simulation methods, however, is an open issue [41].       research platform has been identified and a plan for the
Second, dialogue systems, as well as recent work on con-       academic search domain has recently been outlined [44].
versational recommender systems [18], are focused on              As for simulation, most existing approaches are meant
supporting the user with a single goal that can be ful-        to advance reinforcement learning techniques in a strictly
filled by eliciting preferences on a set of attributes. CIA    goal-oriented setting. This is different from our purpose
systems, on the other hand, need to deal with complex          of evaluation. The simulation techniques that are cur-
search and recommendation scenarios. This requires a           rently used for evaluation lack the desired conversational
more holistic user model.                                      complexity.

3.3. Summary and Remaining Challenges                          4. A Case for Simulation
3.3.1. Understanding User Needs and Behavior
                                                               This section presents a proposal for robust large-scale
Current characterizations of information seeking behav-        automatic evaluation of CIA systems via user simulation.
ior for CIA are limited either in the set of actions con-
sidered [42] or in sequences of conversational turns [43].
                                                               4.1. Methodology
To cater for the functionality defined by Radlinski and
Craswell [12] and further expanded by us in Sect 2.2,       Our main hypothesis is that it is possible to simulate
one would need user and interaction models capable of       human behavior with regard to interacting with CIA
representing (1) multi-modal interactions (speech, text,    systems. To validate this hypothesis, we need to show
pointing&clicking), (2) users’ ability to change their statethat simulated users behave indistinguishable from real
of knowledge (learn and forget) and (3) users’ ability to   humans, in the context of a specific conversational appli-
learn how a system works and what its limits are (and       cation and with respect to specific evaluation measures.
change their expectations and behavior accordingly).           Formally, let 𝑆1 and 𝑆2 denote two CIA systems, which
                                                            differ in some component(s). Both systems are assumed
3.3.2. Truly Conversational Methods                         to be operated by a set 𝑈 of users from some user popu-
                                                            lation. Let us assume that there is a statistically signif-
Conversational recommendations and QA have been icant difference observed in their relative performance,
studied as end-to-end tasks. However, as we argued in according to some evaluation measure 𝑀 , such that
Sect. 2.2, in practice these two are not clearly delineated 𝑀 (𝑆1 , 𝑈 ) < 𝑀 (𝑆2 , 𝑈 ). Simulation is considered suc-
applications, but rather different “modes” that should be cessful, if by engaging a set 𝑈 * of simulated users, we
seamlessly integrated with a CIA system. There has been observe the same relative system differences as with real
significant progress on various components, which are users, i.e., 𝑀 (𝑆1 , 𝑈 * ) < 𝑀 (𝑆2 , 𝑈 * ). Further, this obser-
indispensable building blocks. Integrating these into a vation should generalize across systems 𝑆 and evaluation
unified system that supports multiple user goals remains measures 𝑀 .
an open challenge [1]. Further open questions in this          The above formulation ensures that the behavior of
space include (1) deciding when and what type of initia- simulated users aligns with those of real users. Notice
tive a system should take, and (2) determining the best that to be able to perform this validation, an operational
modality based on task and context.                         CIA system is needed; we discuss the practical aspects
                                                            of setting up such an experimental platform below, in
                                                            Sect. 4.4. For the human evaluation part, i.e., measur-
                                                            ing 𝑀 (𝑆, 𝑈 ), two distinct approaches may be employed:
                                                            (1) asking users themselves inside the CIA system to
                                        Simulated user
                                                                               Response genera2on

                                              Natural language
                                             understanding (NLU)                    Planning          User model
                      Conversa2onal
                                                          poin%ng / clicking
                    informa2on access
                                                                                   Execu2on         Interac2on model
                          system                                text / voice


                                               Natural language                     Learning         Mental model
                                               genera2on (NLG)




Figure 3: Conceptual architecture of the user simulator.



give feedback on either the entire conversation or on                      its main components below, and provide specific starting
specific system utterances, and (2) sampling interest-                     points for each of them.
ing/meaningful branches from conversation logs, which                         User, interaction, and mental models provide the foun-
will be annotated by external human labelers (e.g., crowd                  dation for simulation behavior.
workers).
                                                                           • User model. To represent all personal information
   Once a user simulator is created and validated against
                                                                             related to a given user, including persona (R2), prefer-
real users, it may be used evaluating a given CIA system.
                                                                             ences (R1), and knowledge (R4), personal knowledge
It is important to note that, in principle, a given user
                                                                             graphs (PKGs) [45] may be used. The reason for using
simulator instance should be used only once, the same
                                                                             a PKG is to ensure the consistency of the preferences
way that an offline test collection should only be used
                                                                             that are revealed by the simulated user, as it is done
once—to avoid overfitting systems against a particular
                                                                             in [46]. To fully address R1 and R4, PKGs will need
test suite.
                                                                             to be extended along two dimensions: (1) include con-
                                                                             cepts, in addition to entities, to represent the user’s
4.2. Requirements                                                            knowledge on specific topics, with further distinction
We identify a realistic user simulator with the ability of                   to be made between entities/concepts the user heard
capturing:                                                                   about vs. has in-depth knowledge on; (2) capture the
                                                                             temporal scope, to be able to distinguish between short-
 (R1) Personal interests and preferences, and the                            vs. long-term preferences and fresh vs. diminishing
      changes of preferences over time;                                      knowledge.

  (R2) Persona (personality, educational and socio- • Interaction model. To characterize the CIA process
         economical background, etc.);                        between humans and systems for a given application,
                                                              the key actions and decisions that manifest in dialogues
  (R3) Multi-modality of interactions (speech, text,          need to be abstracted out. A starting point for a taxon-
         pointing&clicking, etc.);                            omy of user/system actions is provided in [47]. This
                                                              taxonomy may be revised and extended to multi-modal
  (R4) The user’s ability to change their state of knowl-
                                                              interactions (R3) based on conversations collected in
         edge (learn and forget);
                                                              laboratory user studies with an “idealized” CIA sys-
  (R5) The user’s ability to learn how a system works         tem using the Wizard-of-Oz approach [48] and from
         and what its limits are, and change their expecta-   interaction data from actual CIA systems.
         tions and behavior accordingly.
                                                            • Mental model. To capture how a particular user
We note that not all these requirements are critical for an   thinks about a given CIA system (R5), mental models
initial simulator and some may be highly ambitious. Nev-      need to be developed. The thinking aloud method is
ertheless, we shall discuss our conceptual architecture       commonly used for such purposes in usability testing,
with reference to these requirements.                         psychology, and social sciences [49]. There is work
                                                              in HCI on identifying and analyzing experiences and
                                                              barriers qualitatively [50, 51, 52, 53]. A main difference
4.3. Architecture                                             from those studies is that the goal here is to build a
Figure 3 shows the conceptual architecture of a user sim-     quantifiable mental model that represents the user’s
ulator addressing the stated requirements. We discuss         expectations and perceived capabilities of a CIA sys-
                                                              tem.
Next, we describe the components responsible for inter- specific and is shared by all simulated users.) To make
acting with CIA systems.                                      the user model realistic, it should be anchored in actual
                                                              user profiles (while maintaining k-anonymity). For that,
• Natural language understanding. Obtaining a
                                                              a generative model may be used, with parameters learned
  structured representation from a system utterance is
                                                              on publicly available corpora, e.g., item ratings for recom-
  analogous to NLU in dialogue systems and involves
                                                              mendation scenarios [46] and discussion fora for informa-
  domain classification, intent determination, and slot fill-
                                                              tion seeking tasks. The mental model may be initialized
  ing [7]. These tasks are effectively tackled by neural
                                                              using a small set of pre-trained skill profiles, created as
  architectures [54, 55, 1]. These approaches, however,
                                                              part of laboratory user studies.
  are created for conversational systems and assume
                                                                 From a system architecture perspective, the user simu-
  “perfect” world knowledge, based on some underlying
                                                              lator in many regards resembles a CIA system, compris-
  knowledge repository. For user simulation, they need
                                                              ing of natural language understanding, dialog manage-
  to be adapted to consider personal knowledge. For
                                                              ment, and natural language generation components. One
  example, the user may or may not be able to guess the
                                                              major difference is that CIA systems may be assumed
  corresponding type or category of an entity/concept
                                                              (in fact, expected) to have “perfect world knowledge,”
  that is mentioned for the first time, depending on their
                                                              only limited by the availability of data. Conversely, user
  knowledge of the given domain.
                                                              simulation also needs to consider the user’s knowledge
• Response generation. Determining how a simulated level in language understanding and generation. Another
  user should respond to a system utterance is modeled major difference is that while a CIA system is modeled
  in three stages: planning, execution, and learning. In after a single person, each simulated user has a unique
  the planning stage, a structured representation of an persona. This requires each of the components to be
  information need (what to ask the system) or user re- parametrizable with respect to personal characteristics.
  sponse (how to respond to if prompted by the system) Further, the choice of dialogue actions is affected by the
  is generated. This is informed by the user model, in user’s mental model of the system (i.e., what the system
  terms of interests and preferences, as well as the in- is perceived to be able to understand and execute).
  teraction model, to help interpret what the system is
  asking in terms of a task-specific dialog flow. In the 4.4. Operationalization
  execution stage, the simulator decides on the course
  of execution, based on the user’s mental model of the Note that simulation capability is application specific.
  given system’s capabilities (e.g., it will not attempt to That is, different simulators would need to be trained for
  navigate a list using voice, but rather click, if voice item recommendation, interactive QA, and, ultimately,
  navigation did not function in the past as expected). for scenarios that cater for multiple user goals. To en-
  Based on how the system responds to a given user sure that the behavior of simulated users aligns with that
  utterance, the learner module can make updates to of human users, an operational CIA system with actual
  the user model (whether the user learned something users would also be needed for each application. Setting
  new about a given topic) and also to the mental model up such applications should be seen as a community ef-
  of the system (how successful it was in understand- fort. Indeed, discussions in this direction have already
  ing/executing what was requested). Response genera- begun and one specific proposal for a CIA system sup-
  tion can be framed within the well-established agenda- ports scholarly activities has been outlined in [44]. There
  based simulation approach [56].                             are a number of challenges involved in building a CIA
                                                              system that can serve as such a living lab. One is that
• Natural language generation. Finally, a structured it would have insufficient traffic for meaningful online
  intent representation (what to say to the system) needs evaluation (an issue that has indeed been encountered in
  to turned into a natural language utterance (how to say the past [38]). To remedy that additional users may be re-
  it). The exact articulation is influenced by the persona cruited, e.g., by involving students as part of their course
  and knowledge level of the simulated user. A possi- work or hiring workers on crowdsourcing platforms (i.e.,
  ble starting point is to generate templated responses increasing traffic volume). Another potential difficulty
  and then apply transfer learning for text [57, 58, 59]. is that building a sufficiently performant CIA system for
  Later, more end-to-end approaches may also be de- the application at hand turns out to be too challenging
  vised, eliminating the need for manual template gen- (thereby making the online service unattractive to users).
  eration. It should be noted that not all requests get While this is not easily solvable on the system front, it is
  passed through NLG, as the executor may decide to possible to manage users’ expectations. Indeed, one of
  use a different modality.                                   the key ideas behind operating in the academic domain
Each simulated user requires instantiating the user and in [44] is to build a tool by researchers to researchers,
mental models. (The interaction model is application- and embrace its imperfection.
   Simulation approaches are evaluated by comparing           [3] J. Allan, B. Croft, A. Moffat, M. Sanderson, Frontiers,
them against real users on a given live research platform.        challenges, and opportunities for information re-
In practice this means that a small portion of the usage          trieval: Report from SWIRL 2012 the second strate-
data collected from humans (i.e., first few weeks of the          gic workshop on information retrieval in Lorne,
live evaluation period) is disclosed and can be used for          SIGIR Forum 46 (2012) 2–32.
training the simulators, while the remaining data is used     [4] J. S. Culpepper, F. Diaz, M. D. Smucker, Research
for evaluating them. The set of systems participating in          frontiers in information retrieval: Report from the
the live evaluation (referred to as experimental systems)         third strategic workshop on information retrieval in
are also evaluated using the different simulators. Ulti-          Lorne (SWIRL 2018), SIGIR Forum 52 (2018) 34–90.
mately, the question we seek to answer is whether we          [5] W. B. Croft, The importance of interaction for in-
can observe the same relative ranking of experimental             formation retrieval, in: Proceedings of the 42nd
systems with real users (based on the live experiment) as         International ACM SIGIR Conference on Research
with simulated ones—being able to answer this question            and Development in Information Retrieval, SIGIR
positively would mean that the simulator is sufficiently          ’19, 2019, pp. 1–2.
realistic.                                                    [6] J. Deriu, A. Rodrigo, A. Otegi, G. Echegoyen, S. Ros-
                                                                  set, E. Agirre, M. Cieliebak, Survey on evaluation
                                                                  methods for dialogue systems, Artificial Intelli-
5. Conclusions and Future                                         gence Review (2020) 1573–7462.
   Directions                                                 [7] D. Jurafsky, J. H. Martin, Speech and Language
                                                                  Processing: An Introduction to Natural Language
In this paper, we have considered conversational AI from          Processing, Computational Linguistics, and Speech
an IR perspective, and focused in particular on the prob-         Recognition, 3nd Edition draft, Prentice Hall, Pear-
lem of conversational information access, with the goal           son Education International, 2019.
to identify open challenges that the IR community is          [8] H. Chen, X. Liu, D. Yin, J. Tang, A survey on dia-
uniquely suited to address.                                       logue systems: Recent advances and new frontiers,
   One critical area concerns the understanding of                SIGKDD Explor. Newsl. 19 (2017) 25–35.
users’ information needs and their information seeking        [9] I. V. Serban, R. Lowe, P. Henderson, L. Charlin,
behavior—one of fundamental research directions in IR             J. Pineau, A survey of available corpora for building
from the very beginning [60]. Currently, there is a lack          data-driven dialogue systems: The journal version,
of understanding of what would be desirable conversa-             Dialogue & Discourse 9 (2018) 1–49.
tional experiences for information access scenarios that     [10] L. Zhou, J. Gao, D. Li, H.-Y. Shum, The design and
combine multiple user goals. Consequently, there are              implementation of XiaoIce, an empathetic social
no suitable models of user behavior that could serve as           chatbot, Comput. Linguist. 46 (2020) 53–93.
foundations for unified architectures that can support       [11] I. Szpektor, D. Cohen, G. Elidan, M. Fink, A. Has-
such behavior.                                                    sidim, O. Keller, S. Kulkarni, E. Ofek, S. Pudinsky,
   Another aspect that represents a major open challenge          A. Revach, S. Salant, Y. Matias, Dynamic compo-
is evaluation. Measurement is an area where IR has an             sition for conversational domain exploration, in:
unparalleled history [61, 62, 63, 64, 33, 35]. Building on        Proceedings of The Web Conference 2020, WWW
the rich tradition and experience of community bench-             ’20, 2020, pp. 872–883.
marking campaigns such as TREC [62] and CLEF [63],           [12] F. Radlinski, N. Craswell, A theoretical framework
our community is in a unique position to take a lead              for conversational search, in: Proceedings of the
on the development of novel evaluation paradigms and              2017 Conference on Conference Human Informa-
methodologies. This paper has outlined a specific plan            tion Interaction and Retrieval, CHIIR ’17, 2017, pp.
for such an effort along user simulation.                         117–126.
                                                             [13] Z. Yan, N. Duan, P. Chen, M. Zhou, J. Zhou, Z. Li,
                                                                  Building task-oriented dialogue systems for online
References                                                        shopping, in: Proceedings of the Thirty-First AAAI
                                                                  Conference on Artificial Intelligence, AAAI ’17,
 [1] J. Gao, M. Galley, L. Li, Neural approaches to con-
                                                                  2017, pp. 4618–4625.
     versational AI, Found. Trends Inf. Retr. 13 (2019)
                                                             [14] J. R. Trippas, Spoken Conversational Search: Audio-
     127–298.
                                                                  only Interactive Information Retrieval, Ph.D. thesis,
 [2] M. F. McTear, Spoken dialogue technology: En-
                                                                  RMIT University, 2019.
     abling the conversational user interface, ACM Com-
                                                             [15] Y. Deldjoo, J. R. Trippas, H. Zamani, Towards multi-
     put. Surv. 34 (2002) 90–169.
                                                                  modal conversational information seeking, in: Pro-
                                                                  ceedings of the 43th International ACM SIGIR Con-
     ference on Research and Development in Informa-            [27] M. Aliannejadi, H. Zamani, F. Crestani, W. B.
     tion Retrieval, SIGIR ’21, 2021, pp. 1577–1587.                 Croft, Asking clarifying questions in open-domain
[16] C. Qu, L. Yang, M. Qiu, W. B. Croft, Y. Zhang,                  information-seeking conversations, in: Proceed-
     M. Iyyer, Bert with history answer embedding for                ings of the 42nd International ACM SIGIR Confer-
     conversational question answering, in: Proceedings              ence on Research and Development in Information
     of the 42nd International ACM SIGIR Conference                  Retrieval, SIGIR ’19, 2019, pp. 475–484.
     on Research and Development in Information Re-             [28] C. Rosset, C. Xiong, X. Song, D. Campos,
     trieval, SIGIR ’19, 2019, pp. 1133–1136.                        N. Craswell, S. Tiwary, P. Bennett, Leading conver-
[17] S. Reddy, D. Chen, C. D. Manning, CoQA: A con-                  sational search by suggesting useful questions, in:
     versational question answering challenge, Transac-              Proceedings of The Web Conference 2020, WWW
     tions of the Association for Computational Linguis-             ’20, 2020, pp. 1160–1170.
     tics 7 (2019) 249–266.                                     [29] C. Qu, L. Yang, W. B. Croft, Y. Zhang, J. R. Trip-
[18] Y. Zhang, X. Chen, Q. Ai, L. Yang, W. B. Croft, To-             pas, M. Qiu, User intent prediction in information-
     wards conversational search and recommendation:                 seeking conversations, in: Proceedings of the 2019
     System ask, user respond, in: Proceedings of the                Conference on Human Information Interaction and
     27th ACM International Conference on Information                Retrieval, CHIIR ’19, 2019, pp. 25–33.
     and Knowledge Management, CIKM ’18, 2018, pp.              [30] K. Christakopoulou, F. Radlinski, K. Hofmann, To-
     177–186.                                                        wards conversational recommender systems, in:
[19] D. Jannach, A. Manzoor, W. Cai, L. Chen, A sur-                 Proceedings of the 22nd ACM SIGKDD Interna-
     vey on conversational recommender systems, 2020.                tional Conference on Knowledge Discovery and
     arXiv:2004.00646.                                               Data Mining, KDD ’16, 2016, pp. 815–824.
[20] C. Gao, W. Lei, X. He, M. de Rijke, T. Chua,               [31] P. Ren, Z. Chen, Z. Ren, E. Kanoulas, C. Monz,
     Advances and challenges in conversational                       M. de Rijke, Conversations with search engines,
     recommender systems:               A survey, 2021.              2020. arXiv:2004.14162.
     arXiv:2101.09459.                                          [32] S. Zhang, Z. Dai, K. Balog, J. Callan, Summariz-
[21] C. Zhu, M. Zeng, X. Huang, SDNet: Contextualized                ing and exploring tabular data in conversational
     attention-based deep network for conversational                 search, in: Proceedings of the 43rd International
     question answering, 2018. arXiv:1812.03593.                     ACM SIGIR Conference on Research and Develop-
[22] J. Dalton, C. Xiong, J. Callan, TREC CAsT 2019: The             ment in Information Retrieval, SIGIR ’20, 2020, pp.
     Conversational Assistance Track overview, 2020.                 1537–1540.
     arXiv:2003.13624.                                          [33] M. Sanderson, Test collection based evaluation of
[23] L. Yang, M. Qiu, C. Qu, J. Guo, Y. Zhang, W. B.                 information retrieval systems, Found. Trends Inf.
     Croft, J. Huang, H. Chen, Response ranking with                 Retr. 4 (2010) 247–375.
     deep matching networks and external knowledge in           [34] C.-W. Liu, R. Lowe, I. Serban, M. Noseworthy,
     information-seeking conversation systems, in: The               L. Charlin, J. Pineau, How NOT to evaluate your di-
     41st International ACM SIGIR Conference on Re-                  alogue system: An empirical study of unsupervised
     search and Development in Information Retrieval,                evaluation metrics for dialogue response genera-
     SIGIR ’18, 2018, pp. 245–254.                                   tion, in: Proceedings of the 2016 Conference on
[24] Y. Song, C.-T. Li, J.-Y. Nie, M. Zhang, D. Zhao, R. Yan,        Empirical Methods in Natural Language Processing,
     An ensemble of retrieval-based and generation-                  EMNLP ’16, 2016, pp. 2122–2132.
     based human-computer conversation systems, in:             [35] K. Hofmann, L. Li, F. Radlinski, Online evaluation
     Proceedings of the 27th International Joint Confer-             for information retrieval, Found. Trends Inf. Retr.
     ence on Artificial Intelligence, IJCAI ’18, 2018, pp.           10 (2016) 1–117.
     4382–4388.                                                 [36] A. Ram, R. Prasad, C. Khatri, A. Venkatesh,
[25] N. Voskarides, D. Li, P. Ren, E. Kanoulas, M. de Rijke,         R. Gabriel, Q. Liu, J. Nunn, B. Hedayatnia, M. Cheng,
     Query resolution for conversational search with                 A. Nagar, E. King, K. Bland, A. Wartick, Y. Pan,
     limited supervision, in: Proceedings of the 43rd                H. Song, S. Jayadevan, G. Hwang, A. Pettigrue, Con-
     International ACM SIGIR Conference on Research                  versational AI: The science behind the Alexa Prize,
     and Development in Information Retrieval, SIGIR                 2018. arXiv:1801.03604.
     ’20, 2020, pp. 921–930.                                    [37] A. Schuth, K. Balog, Living labs for online eval-
[26] S. Vakulenko, N. Voskarides, Z. Tu, S. Longpre, A               uation: From theory to practice, in: Proceedings
     comparison of question rewriting methods for con-               of the 38th European conference on Advances in
     versational passage retrieval, in: Proceedings of the           Information Retrieval, ECIR ’16, 2016, pp. 893–896.
     43rd European Conference on IR Research, ECIR              [38] R. Jagerman, K. Balog, M. D. Rijke, Opensearch:
     ’21, 2021, pp. 418–424.                                         Lessons learned from an online evaluation cam-
     paign, J. Data and Information Quality 10 (2018).            Speech Commun. 50 (2008) 630–645.
[39] F. Hopfgartner, K. Balog, A. Lommatzsch, L. Kelly,      [51] B. R. Cowan, N. Pantidi, D. Coyle, K. Morrissey,
     B. Kille, A. Schuth, M. Larson, Continuous evalu-            P. Clarke, S. Al-Shehri, D. Earley, N. Bandeira,
     ation of large-scale information access systems: A           “What can i help you with?”: Infrequent users’ ex-
     case for living labs, in: N. Ferro, C. Peters (Eds.),        periences of intelligent personal assistants, in: Pro-
     Information Retrieval Evaluation in a Changing               ceedings of the 19th International Conference on
     World - Lessons Learned from 20 Years of CLEF,               Human-Computer Interaction with Mobile Devices
     Springer, 2019.                                              and Services, MobileHCI ’17, 2017.
[40] J. Schatzmann, K. Weilhammer, M. Stuttle, S. Young,     [52] B. R. Cowan, H. P. Branigan, H. Begum, L. McKenna,
     A survey of statistical user simulation techniques           É. Székely, They know as much as we do: Knowl-
     for reinforcement-learning of dialogue manage-               edge estimation and partner modelling of artificial
     ment strategies, Knowl. Eng. Rev. 21 (2006) 97–126.          partners, in: Proceedings of the 39th Annual Meet-
[41] O. Pietquin, H. Hastie, A survey on metrics for the          ing of the Cognitive Science Society, CogSci ’17,
     evaluation of user simulations, Knowl. Eng. Rev. 28          2017.
     (2013).                                                 [53] A. Sciuto, A. Saini, J. Forlizzi, J. I. Hong, “Hey Alexa,
[42] S. Vakulenko, K. Revoredo, C. Di Ciccio, M. de Rijke,        what’s up?”: A mixed-methods studies of in-home
     QRFA: A data-driven model of information-seeking             conversational agent usage, in: Proceedings of
     dialogues, in: Proceedings of the 41st European              the 2018 Designing Interactive Systems Conference,
     Conference on IR Research, ECIR ’19, 2019, pp. 541–          DIS ’18, 2018, pp. 857–868.
     557.                                                    [54] G. Mesnil, Y. Dauphin, K. Yao, Y. Bengio, L. Deng,
[43] J. R. Trippas, D. Spina, L. Cavedon, M. Sanderson,           D. Hakkani-Tur, X. He, L. Heck, G. Tur, D. Yu, et al.,
     How do people interact in conversational speech-             Using recurrent neural networks for slot filling in
     only search tasks: A preliminary analysis, in: Pro-          spoken language understanding, IEEE/ACM Trans.
     ceedings of the 2017 Conference on Conference Hu-            Audio, Speech and Lang. Proc. 23 (2015) 530–539.
     man Information Interaction and Retrieval, CHIIR        [55] B. Liu, I. Lane, Attention-based recurrent neural
     ’17, 2017, pp. 325–328.                                      network models for joint intent detection and slot
[44] K. Balog, L. Flekova, M. Hagen, R. Jones, M. Pot-            filling, in: Interspeech 2016, 2016, pp. 685–689.
     thast, F. Radlinski, M. Sanderson, S. Vakulenko,        [56] J. Schatzmann, B. Thomson, K. Weilhammer, H. Ye,
     H. Zamani, Common conversational community                   S. Young, Agenda-based user simulation for boot-
     prototype: Scholarly conversational assistant, 2020.         strapping a POMDP dialogue system, in: Human
     arXiv:2001.06910.                                            Language Technologies 2007: The Conference of
[45] K. Balog, T. Kenter, Personal knowledge graphs:              the North American Chapter of the Association
     A research agenda, in: Proceedings of the 2019               for Computational Linguistics; Companion Volume,
     ACM SIGIR International Conference on Theory of              Short Papers, 2007, pp. 149–152.
     Information Retrieval, ICTIR ’19, 2019, pp. 217–220.    [57] Z. Yang, Z. Hu, C. Dyer, E. P. Xing, T. Berg-
[46] S. Zhang, K. Balog, Evaluating conversational rec-           Kirkpatrick, Unsupervised text style transfer using
     ommender systems via user simulation, in: Pro-               language models as discriminators, in: Advances
     ceedings of the 26th ACM SIGKDD International                in Neural Information Processing Systems 31, NIPS
     Conference on Knowledge Discovery & Data Min-                ’18, 2018, pp. 7287–7298.
     ing, KDD ’20, 2020, pp. 1512–1520.                      [58] Z. Fu, X. Tan, N. Peng, D. Zhao, R. Yan, Style transfer
[47] L. Azzopardi, M. Dubiel, M. Halvey, J. Dalton, Con-          in text: Exploration and evaluation, in: Proceedings
     ceptualizing agent-human interactions during the             of the AAAI Conference on Artificial Intelligence,
     conversational search process, in: Proceedings               2018.
     of the 2nd International Workshop on Conversa-          [59] T. Shen, T. Lei, R. Barzilay, T. Jaakkola, Style trans-
     tional Approaches to Information Retrieval, CAIR             fer from non-parallel text by cross-alignment, in:
     ’18, 2018.                                                   Proceedings of the 31st International Conference
[48] J. F. Kelley, An iterative design methodology for            on Neural Information Processing Systems, NIPS
     user-friendly natural language office information            ’17, 2017, pp. 6833–6844.
     applications, ACM Trans. Inf. Syst. 2 (1984) 26–41.     [60] T. Wilson, Information needs and uses: Fifty years
[49] C. Lewis, J. Rieman, Task-centered User Interface            of progress, Fifty Years of Information Progress: A
     Design: A Practical Introduction, University of Col-         Journal of Documentation Review 28 (1994) 15–51.
     orado, Boulder, Department of Computer Science,         [61] D. Ellis, The dilemma of measurement in informa-
     1993.                                                        tion retrieval research, J. Am. Soc. Inf. Sci. 47 (1996)
[50] J. Edlund, J. Gustafson, M. Heldner, A. Hjalmars-            23–36.
     son, Towards human-like spoken dialogue systems,        [62] E. M. Voorhees, D. K. Harman, TREC: Experiment
     and Evaluation in Information Retrieval (Digital
     Libraries and Electronic Publishing), The MIT Press,
     2005.
[63] N. Ferro, C. Peters (Eds.), Information Retrieval
     Evaluation in a Changing World - Lessons Learned
     from 20 Years of CLEF, volume 41 of The Informa-
     tion Retrieval Series, Springer, 2019.
[64] D. Kelly, Methods for evaluating interactive infor-
     mation retrieval systems with users, Found. Trends
     Inf. Retr. 3 (2009) 1–224.