=Paper= {{Paper |id=Vol-2950/paper-03 |storemode=property |title=Conversational AI from an Information Retrieval Perspective: Remaining Challenges and a Case for User Simulation |pdfUrl=https://ceur-ws.org/Vol-2950/paper-03.pdf |volume=Vol-2950 |authors=Krisztian Balog |dblpUrl=https://dblp.org/rec/conf/desires/Balog21 }} ==Conversational AI from an Information Retrieval Perspective: Remaining Challenges and a Case for User Simulation== https://ceur-ws.org/Vol-2950/paper-03.pdf

Conversational AI from an Information Retrieval
Perspective: Remaining Challenges and a Case for User
Simulation
Krisztian Balog
University of Stavanger, Norway

Abstract
Conversational AI is an emerging field of computer science that engages multiple research communities, from information
retrieval to natural language processing to dialogue systems. Within this vast space, we focus on conversational informa-
tion access, a problem that is uniquely suited to be addressed by the information retrieval community. We argue that despite
the significant research activity in this area, progress is mostly limited to component-level improvements. There remains
a disconnect between current efforts and truly conversational information access systems. Apart from the inherently chal-
lenging nature of the problem, the lack of progress, in large part, can be attributed to the shortage of appropriate evaluation
methodology and resources. This paper highlights challenges that render both offline and online evaluation methodologies
unsuitable for this problem, and discusses the use of user simulation as a viable solution.

Keywords
Conversational information access, conversational AI, user simulation

1. Introduction appropriate responses, and developing effective end-to-
end (neural) architectures, which engage multiple re-
Conversational AI may be seen as the holy grail of com- search communities. In this paper, we focus on the prob-
puter science: building machines that are capable of in- lem of conversational information access (CIA), one that
teracting with people in a human-like way. With rapid the IR community is uniquely suited to address.
advances in AI technology, there are reasons to believe Conversational search or conversational information
that such an ambition is within reach [1]. Conversational seeking has already been identified in 2012 as a research
AI is a vast and complex problem, which requires a com- direction of strategic importance in IR [3], and its signifi-
bination of methods, tools, and techniques from multiple cance has been re-iterated in 2018 [4]. There, the problem
fields of computer science, including but not limited to focus has been defined to include complex user goals
artificial intelligence (AI), natural language processing that require multi-step information seeking, exploratory
(NLP), machine learning (ML), dialogue systems (DS), information gathering, and multi-step task completion
recommender systems (RecSys), human-computer inter- and recommendation, as well as dialog settings with
action (HCI), and not the least information retrieval (IR). variable communication channels. Our analysis of recent
Each of these fields may have its own particular inter- works, however, leads us to the observation that current
pretation of what conversational AI should entail and efforts do not seem to be fully aligned with the directions
a specific focus on certain research challenges that are set out there. In terms of end-to-end tasks, there are
involved. For example, in spoken dialogue systems the two main threads of work: conversational QA and
main motif is to be able to talk to machines, i.e., on de- conversational recommendations. Currently, these are
veloping speech-based human-computer interfaces [2], treated as two separate types of systems, with different
and thus automatic speech recognition is a central com- goals, architectures, and evaluation criteria. Instead, for
ponent. Many other communities, on the other hand, a more effective assistance of users, the two should be
assume a chat-based interface and voice is not among seamlessly integrated in CIA systems, thereby moving
the supported modalities. At the same time, there are from a siloed to a more unified view. Additionally, the
many shared aspects, including handling the semantics multi-modality of interactions needs to be more fully
involved in the dialogue process, generating contextually embraced, in order to more actively support effective
interaction [5]. On the component level, most proposed
DESIRES 2021 – 2nd International Conference on Design of
techniques are not truly conversational in the sense
Experimental Search & Information REtrieval Systems, September
15–18, 2021, Padua, Italy that they are applicable to any interactive IR system
" krisztian.balog@uis.no (K. Balog) (e.g., modern web search engines). A critical blocker to
~ https://krisztianbalog.com/ (K. Balog) progress, on both the end-to-end and component levels,
0000-0003-2762-721X (K. Balog) is the shortage of appropriate evaluation methodology
© 2021 Copyright for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0). and resources.
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
Table 1
Categorization of conversational AI systems, based on [1, 6].

Task-oriented Social chat Interactive QA
Aim to assist users to solve a specific Aim to carry on an extended con- Aim to provide concise, direct an-
task (as efficiently as possible) versation (“chit-chat”) with the goal swers to user queries
of mimicking human-human interac-
tions
Dialogues follow a clearly designed Developed for unstructured, open Dialogues are unstructured, but com-
structure (flow) that is developed for domain conversations monly follow a question-answer pat-
a particular task in a closed domain tern; mostly open domain (dictated
by the underlying data)
Well-defined measure of performance Objective is to be human-like, i.e., Evaluated with respect to the correct-
that is explicitly related to task com- able to talk about different topics ness of answers (on the turn level)
pletion (breadth and depth) in an engaging
and coherent manner

In summary, the contributions of this paper are threefold. 2.1.1. Traditional 2-way Categorization

• We argue for a broader interpretation of conversa- Traditionally, conversational agents are categorized as be-
tional information access, one that embraces multiple ing goal-driven (or task-oriented) or non-goal-driven (also
user goals (mixing task-oriented and QA elements) and known as chatbots) [8, 9, 7]. Goal-driven systems aim to
multi-modal interactions (Sect 2). assist users to complete some specific task. Dialogues
are constrained to a specific domain and characterized
• We provide a synthesis of progress on conversa-
by having a designated structure, designed for particular
tional information access and identify open challenges
tasks within that domain. The main success criteria for
around methods and evaluation (Sect. 3).
the conversational agent is its ability to help the user
• We argue for (a more extensive) use of simulation as solve their task as efficiently as possible. Typical exam-
a viable evaluation paradigm for conversational infor- ples include travel planning and appointment scheduling.
mation access and describe a simulator architecture Non-goal-driven systems, on the other hand, aim to
(Sect. 4). carry on an extended conversation (“chit-chat”) with the
goal of mimicking unstructured human-human interac-
tions. The main purpose of these systems is usually enter-
2. Defining Conversational tainment or providing an “AI companion” [10]. Therefore,
Information Access the objective for these systems is to be able to talk about
different topics in an engaging and cohesive manner.
This section defines conversational information access
and places it in the broader context of conversational AI. 2.1.2. Contemporary 3-way Categorization

Most recently, the traditional categorization has been
2.1. Conversational AI: The big picture
extended with a third category, interactive question an-
Conversational AI 1 is casually used to denote a broad swering (QA) [1, 6], in recognition of the fact that it fits
range of systems that are capable of (some degree of) neither in task-oriented nor in social chat, but deserves
natural language understanding and responding in a way a separate category on its own right. Interactive QA
that mimics human dialogue. A conversational AI system systems are designed to provide answers to specific ques-
may thus be considered successful if it offers an expe- tions. They are not characterized by a rigid dialogue flow,
rience that is indistinguishable from what could have although they typically follow a question-answer pat-
been delivered by a human. These systems often focus tern. Apart from some notable recent examples [11], the
on a particular type of conversational support, naturally human-like conversation aspect for QA systems is much
lending themselves to categorization. less pronounced than for the other two types of systems,
1
and evaluation is restricted to answer correctness.
In this paper, the terms conversational AI, conversational agent, Table 1 summarizes the characteristics of the three cat-
and dialogue system are used interchangeably. We, however, avoid
using the term chatbot, which has a different meaning in industrial egories of conversational AI systems. Given their unique
and academic contexts; in the former case it refers to a task-oriented goals and objectives, each of these problem categories is
system, while in the latter it means a non-task-oriented system [7]. addressed by a distinctive system architecture [1, 6].
I want to buy new running shoes. portantly, evaluation initiatives—in IR almost exclusively
focus on a question-answering paradigm (see Sect. 3).
AI My records say that you have been using a This does not allow for interaction with sets of items—
Nike Pegasus 33 before. How did you like that? one of the main properties that makes a search system
conversational [12].
I liked it a lot on tarmac, but my feet often hurt
It has been shown that the “siloed” view, represented
a bit on very long asphalt runs.
by the three categories in Table 1, in practice does not
AI Here are some alternatives for you. Of these,
align well with users’ information needs and behav-
the ASICS Gel Nimbus 23 is especially ior [13]. Gao et al. [1] acknowledge the need for a “top-
renowed for its cushioned midsole. level bot” that would act as a broker and switch between
different user goals. Most commercial assistants are hy-
brid systems, with different degrees of support for switch-
ing. There is, however, little published research on it. In
summary, there is need for a more holistic view where
What is the midsole?
multiple user goals are supported.2

AI The midsole is the bed of foam that lies
2.2.2. Multi-modality
between your foot and the ground. This is the
part of the shoe responsible for feeling soft or
Another key point highlighted by the example in Fig. 1
hard in the shoe.
is the need for embracing multi-modality. Text-only re-
sponses are motivated by an audio-only channel, without
a screen [14]. However, more often than not a chat-base
Figure 1: Envisioned dialogue with a CIA system.
interface is available, which allows for a richer set of
input controls and navigational components. These, in
turn, would enable CIA systems to more actively support
2.2. Conversational Information Access effective interaction [5]. We note that the need for multi-
modality has been recognized independently by other
Building on [4], we use the term conversational informa-
scholars as well [15].
tion access (CIA) to define a subset of conversational AI
systems that specifically aim at a task-oriented sequence
of exchanges to support multiple user goals, including 3. Progress to Date and
search, recommendation and exploratory information
gathering, that require multi-step interactions over pos- Remaining Challenges
sibly multiple modalities. Further, these systems are ex-
In this section, we reflect on progress achieved so far,
pected to learn user preferences, personalize responses
organized around methods and evaluation, and identify
accordingly, and be capable of taking initiative.
remaining challenges.
Consider the conversation shown in Fig. 1, illustrating
some of the above requirements. It is primarily a task-
oriented dialogue (the user wanting to buy new running 3.1. Methods
shoes), which requires an exploration of the item space. In our discussion, we distinguish between end-to-end
Assuming a chat-based interface, this can be done most conversational tasks and specific component-level sub-
effectively by combining multiple modalities; not just tasks.
text, but also a carousel for cycling through items, in
this example. Up until the second user utterance, it is
3.1.1. End-to-end tasks
a strictly task-oriented sequence of exchanges (cf. the
task-oriented category in Table 1). But, then, the third There are two main tasks that have received attention:
user utterance breaks out of the task flow and switches conversational QA [16, 1, 17] and conversational rec-
to “QA mode” (cf. interactive QA in Table 1). ommendations [18, 19, 20]. What distinguishes con-
versational QA from traditional single-turn QA is the
2.2.1. From Siloed to Unified View need for contextual understanding. Hence, much of
the research revolves around modeling conversation his-
One key realization the above example is meant to illus- tory [16, 17, 21]. However, in terms of evaluation, the
trate is that conversational information access cuts across
2
the task-oriented and interactive QA categories. This We note that this problem is not specific to IR. However, con-
blending makes CIA suited to assist users meaningfully versational information access is a good starting point that the IR
community is uniquely suited to address. Lessons and finding could
with their needs. Conversely, existing work—and, im- then be generalized to broader applications.
and has been instrumental to progress in the field. To
S
date, work on CIA still employs offline evaluation [22, 27],
but this has severe limitations. First, reusability requires
U … U
that the system is limited in selecting the best response, in
answer to a user utterance, from a restricted set of possi-
S … S S … S ble candidates (i.e., some predefined corpus of responses).
Second, it is limited in scope to a single conversation
turn and does not consider dialogue history that led to
U … U U … U U … U U … U
that particular user utterance (cf. red area in Fig. 2). An
alternative is to let human evaluators assess an entire
Figure 2: The space of possible dialogue states increases ex-
conversation, once it has taken place [6]. However, this
ponentially with the number of turns between system (S) and is a single path (see blue area in Fig. 2), without consid-
user (U). Evaluation is currently limited to either a single path ering the other choices the user could have taken during
(blue area) or a single turn (red area). the course of the dialogue. Moreover, it is expensive,
time-consuming, and does not scale. Most importantly,
it would not yield a reusable test collection. In summary,
offline test collections have their merits, but their use is
problem is simplified to a single-turn passage retrieval
limited to the purpose of evaluating specific components,
task, where the relevance of system response at a given
in isolation. Further, the choice of evaluation metrics is
turn does not consider the responses given by the sys-
an open challenge [34].
tem at earlier turns [22, 17]. It is only conversational
recommender systems where the multi-turn nature of
conversations is more fully embraced [18, 19, 20]. 3.2.2. Online Evaluation
Online evaluation involves fielding an IR system to real
3.1.2. Component-level sub-tasks users, and observing how they interact with the system
in situ, in their natural task environments [35]. This re-
Recently, progress has been made on specific subtasks
quires a live service as a research platform. Currently,
for CIA, including response retrieval [23] and genera-
this possibility is only available to researchers working
tion [24], query resolution [25, 26], asking clarifying
at major service providers that develop conversational
questions [27] or suggestion questions [28], predicting
assistants (Google, Microsoft, Apple, Amazon). Even
user intent [29], and preference elicitation [18, 30]. Each
there, experimentation with live users is severely lim-
of these studies makes the point that the conversational
ited due to scalability, quality, and ethical concerns. Of
setting calls for a different set of approaches. However,
these companies, only Amazon has decided to open up its
most of these subtasks are applicable in any interactive IR
platform for academic research, by organizing the Alexa
context, adhering to the stance that search is inherently
Prize Challenge [36]. It represents a unique opportunity
a conversational experience: it is a dialogue between a
for academics to perform research with a live system
human and a search engine [31]. From this perspective,
used by millions of users, and provides university teams
there has been substantial progress, and especially on the
with real user conversational data at scale. While this
mixed initiative aspect, e.g., question clarifications and
effort points in the right direction, it is inherently lim-
suggestions [27, 28]. Alternatively, one may take a more
ited in that it addresses social conversations (“chit-chat”),
critical stance and ask: What separates conversational
with the target goal of conversing coherently and en-
information access from any other interactive IR system
gagingly with humans on popular topics such as sports,
(most prominently: search engines)? According to Croft
politics, or technology for 20 minutes. This is a non-goal-
[5], the key distinguishing factor is that a conversational
driven task, which is rather different from goal-driven
system is more active partner in the interaction. From
CIA. Currently, there is no publicly available research
that regard, there is surprisingly little work, with only a
platform for CIA. Living labs represents a novel eval-
handful of notable exceptions [32, 11].
uation paradigm for IR [37], which allows researchers
to evaluate their methods with real users of live search
3.2. Evaluation services. This methodology has been successfully em-
ployed at world wide benchmarking campaigns [38, 39].
3.2.1. Offline Evaluation
It, however, needs to be extended to a conversational
Traditionally, system-oriented evaluation in IR has been setting, which brings about methodological and practical
performed using offline test collections, following the challenges.
Cranfield paradigm [33]. This rigorous methodology en-
sures the repeatability and reproducibility of experiments,
3.2.3. User Simulation 3.3.3. Evaluation
With a long history in the field of spoken dialogue sys- There is a need to go beyond turn-based evaluation to
tems, user simulation is seen as a critical tool for auto- multi-turn-based and eventually end-to-end evaluation.
matic dialogue management design [40]. The idea is to To be able to perform end-to-end evaluation of CIA sys-
train a user model that is “capable of producing responses tems, additional methodologies need to be considered,
that a real user might have given in a certain dialog situ- including online evaluation and simulated users. For
ation” [40]. This is in line with our goals, but there are online evaluation, the living labs paradigm represents
two crucial differences. First, the primary purpose of an alternative, but it requires agreement on a canonical
user simulation in DS is to generate a synthetic training architecture in order to be able to open up individual
data at scale, which in turn can be used to learn dia- components for experimentation. Further, it requires an
logue strategies (typically, using reinforcement learning). existing service with live users, which is currently lack-
Assessment of the quality of simulated dialogues and ing. It should be noted that the need for such an open
user simulation methods, however, is an open issue [41]. research platform has been identified and a plan for the
Second, dialogue systems, as well as recent work on con- academic search domain has recently been outlined [44].
versational recommender systems [18], are focused on As for simulation, most existing approaches are meant
supporting the user with a single goal that can be ful- to advance reinforcement learning techniques in a strictly
filled by eliciting preferences on a set of attributes. CIA goal-oriented setting. This is different from our purpose
systems, on the other hand, need to deal with complex of evaluation. The simulation techniques that are cur-
search and recommendation scenarios. This requires a rently used for evaluation lack the desired conversational
more holistic user model. complexity.

3.3. Summary and Remaining Challenges 4. A Case for Simulation
3.3.1. Understanding User Needs and Behavior
This section presents a proposal for robust large-scale
Current characterizations of information seeking behav- automatic evaluation of CIA systems via user simulation.
ior for CIA are limited either in the set of actions con-
sidered [42] or in sequences of conversational turns [43].
4.1. Methodology
To cater for the functionality defined by Radlinski and
Craswell [12] and further expanded by us in Sect 2.2, Our main hypothesis is that it is possible to simulate
one would need user and interaction models capable of human behavior with regard to interacting with CIA
representing (1) multi-modal interactions (speech, text, systems. To validate this hypothesis, we need to show
pointing&clicking), (2) users’ ability to change their statethat simulated users behave indistinguishable from real
of knowledge (learn and forget) and (3) users’ ability to humans, in the context of a specific conversational appli-
learn how a system works and what its limits are (and cation and with respect to specific evaluation measures.
change their expectations and behavior accordingly). Formally, let 𝑆1 and 𝑆2 denote two CIA systems, which
differ in some component(s). Both systems are assumed
3.3.2. Truly Conversational Methods to be operated by a set 𝑈 of users from some user popu-
lation. Let us assume that there is a statistically signif-
Conversational recommendations and QA have been icant difference observed in their relative performance,
studied as end-to-end tasks. However, as we argued in according to some evaluation measure 𝑀 , such that
Sect. 2.2, in practice these two are not clearly delineated 𝑀 (𝑆1 , 𝑈 ) < 𝑀 (𝑆2 , 𝑈 ). Simulation is considered suc-
applications, but rather different “modes” that should be cessful, if by engaging a set 𝑈 * of simulated users, we
seamlessly integrated with a CIA system. There has been observe the same relative system differences as with real
significant progress on various components, which are users, i.e., 𝑀 (𝑆1 , 𝑈 * ) < 𝑀 (𝑆2 , 𝑈 * ). Further, this obser-
indispensable building blocks. Integrating these into a vation should generalize across systems 𝑆 and evaluation
unified system that supports multiple user goals remains measures 𝑀 .
an open challenge [1]. Further open questions in this The above formulation ensures that the behavior of
space include (1) deciding when and what type of initia- simulated users aligns with those of real users. Notice
tive a system should take, and (2) determining the best that to be able to perform this validation, an operational
modality based on task and context. CIA system is needed; we discuss the practical aspects
of setting up such an experimental platform below, in
Sect. 4.4. For the human evaluation part, i.e., measur-
ing 𝑀 (𝑆, 𝑈 ), two distinct approaches may be employed:
(1) asking users themselves inside the CIA system to
Simulated user
Response genera2on

Natural language
understanding (NLU) Planning User model
Conversa2onal
poin%ng / clicking
informa2on access
Execu2on Interac2on model
system text / voice

Natural language Learning Mental model
genera2on (NLG)

Figure 3: Conceptual architecture of the user simulator.

give feedback on either the entire conversation or on its main components below, and provide specific starting
specific system utterances, and (2) sampling interest- points for each of them.
ing/meaningful branches from conversation logs, which User, interaction, and mental models provide the foun-
will be annotated by external human labelers (e.g., crowd dation for simulation behavior.
workers).
• User model. To represent all personal information
Once a user simulator is created and validated against
related to a given user, including persona (R2), prefer-
real users, it may be used evaluating a given CIA system.
ences (R1), and knowledge (R4), personal knowledge
It is important to note that, in principle, a given user
graphs (PKGs) [45] may be used. The reason for using
simulator instance should be used only once, the same
a PKG is to ensure the consistency of the preferences
way that an offline test collection should only be used
that are revealed by the simulated user, as it is done
once—to avoid overfitting systems against a particular
in [46]. To fully address R1 and R4, PKGs will need
test suite.
to be extended along two dimensions: (1) include con-
cepts, in addition to entities, to represent the user’s
4.2. Requirements knowledge on specific topics, with further distinction
We identify a realistic user simulator with the ability of to be made between entities/concepts the user heard
capturing: about vs. has in-depth knowledge on; (2) capture the
temporal scope, to be able to distinguish between short-
(R1) Personal interests and preferences, and the vs. long-term preferences and fresh vs. diminishing
changes of preferences over time; knowledge.

(R2) Persona (personality, educational and socio- • Interaction model. To characterize the CIA process
economical background, etc.); between humans and systems for a given application,
the key actions and decisions that manifest in dialogues
(R3) Multi-modality of interactions (speech, text, need to be abstracted out. A starting point for a taxon-
pointing&clicking, etc.); omy of user/system actions is provided in [47]. This
taxonomy may be revised and extended to multi-modal
(R4) The user’s ability to change their state of knowl-
interactions (R3) based on conversations collected in
edge (learn and forget);
laboratory user studies with an “idealized” CIA sys-
(R5) The user’s ability to learn how a system works tem using the Wizard-of-Oz approach [48] and from
and what its limits are, and change their expecta- interaction data from actual CIA systems.
tions and behavior accordingly.
• Mental model. To capture how a particular user
We note that not all these requirements are critical for an thinks about a given CIA system (R5), mental models
initial simulator and some may be highly ambitious. Nev- need to be developed. The thinking aloud method is
ertheless, we shall discuss our conceptual architecture commonly used for such purposes in usability testing,
with reference to these requirements. psychology, and social sciences [49]. There is work
in HCI on identifying and analyzing experiences and
barriers qualitatively [50, 51, 52, 53]. A main difference
4.3. Architecture from those studies is that the goal here is to build a
Figure 3 shows the conceptual architecture of a user sim- quantifiable mental model that represents the user’s
ulator addressing the stated requirements. We discuss expectations and perceived capabilities of a CIA sys-
tem.
Next, we describe the components responsible for inter- specific and is shared by all simulated users.) To make
acting with CIA systems. the user model realistic, it should be anchored in actual
user profiles (while maintaining k-anonymity). For that,
• Natural language understanding. Obtaining a
a generative model may be used, with parameters learned
structured representation from a system utterance is
on publicly available corpora, e.g., item ratings for recom-
analogous to NLU in dialogue systems and involves
mendation scenarios [46] and discussion fora for informa-
domain classification, intent determination, and slot fill-
tion seeking tasks. The mental model may be initialized
ing [7]. These tasks are effectively tackled by neural
using a small set of pre-trained skill profiles, created as
architectures [54, 55, 1]. These approaches, however,
part of laboratory user studies.
are created for conversational systems and assume
From a system architecture perspective, the user simu-
“perfect” world knowledge, based on some underlying
lator in many regards resembles a CIA system, compris-
knowledge repository. For user simulation, they need
ing of natural language understanding, dialog manage-
to be adapted to consider personal knowledge. For
ment, and natural language generation components. One
example, the user may or may not be able to guess the
major difference is that CIA systems may be assumed
corresponding type or category of an entity/concept
(in fact, expected) to have “perfect world knowledge,”
that is mentioned for the first time, depending on their
only limited by the availability of data. Conversely, user
knowledge of the given domain.
simulation also needs to consider the user’s knowledge
• Response generation. Determining how a simulated level in language understanding and generation. Another
user should respond to a system utterance is modeled major difference is that while a CIA system is modeled
in three stages: planning, execution, and learning. In after a single person, each simulated user has a unique
the planning stage, a structured representation of an persona. This requires each of the components to be
information need (what to ask the system) or user re- parametrizable with respect to personal characteristics.
sponse (how to respond to if prompted by the system) Further, the choice of dialogue actions is affected by the
is generated. This is informed by the user model, in user’s mental model of the system (i.e., what the system
terms of interests and preferences, as well as the in- is perceived to be able to understand and execute).
teraction model, to help interpret what the system is
asking in terms of a task-specific dialog flow. In the 4.4. Operationalization
execution stage, the simulator decides on the course
of execution, based on the user’s mental model of the Note that simulation capability is application specific.
given system’s capabilities (e.g., it will not attempt to That is, different simulators would need to be trained for
navigate a list using voice, but rather click, if voice item recommendation, interactive QA, and, ultimately,
navigation did not function in the past as expected). for scenarios that cater for multiple user goals. To en-
Based on how the system responds to a given user sure that the behavior of simulated users aligns with that
utterance, the learner module can make updates to of human users, an operational CIA system with actual
the user model (whether the user learned something users would also be needed for each application. Setting
new about a given topic) and also to the mental model up such applications should be seen as a community ef-
of the system (how successful it was in understand- fort. Indeed, discussions in this direction have already
ing/executing what was requested). Response genera- begun and one specific proposal for a CIA system sup-
tion can be framed within the well-established agenda- ports scholarly activities has been outlined in [44]. There
based simulation approach [56]. are a number of challenges involved in building a CIA
system that can serve as such a living lab. One is that
• Natural language generation. Finally, a structured it would have insufficient traffic for meaningful online
intent representation (what to say to the system) needs evaluation (an issue that has indeed been encountered in
to turned into a natural language utterance (how to say the past [38]). To remedy that additional users may be re-
it). The exact articulation is influenced by the persona cruited, e.g., by involving students as part of their course
and knowledge level of the simulated user. A possi- work or hiring workers on crowdsourcing platforms (i.e.,
ble starting point is to generate templated responses increasing traffic volume). Another potential difficulty
and then apply transfer learning for text [57, 58, 59]. is that building a sufficiently performant CIA system for
Later, more end-to-end approaches may also be de- the application at hand turns out to be too challenging
vised, eliminating the need for manual template gen- (thereby making the online service unattractive to users).
eration. It should be noted that not all requests get While this is not easily solvable on the system front, it is
passed through NLG, as the executor may decide to possible to manage users’ expectations. Indeed, one of
use a different modality. the key ideas behind operating in the academic domain
Each simulated user requires instantiating the user and in [44] is to build a tool by researchers to researchers,
mental models. (The interaction model is application- and embrace its imperfection.
Simulation approaches are evaluated by comparing [3] J. Allan, B. Croft, A. Moffat, M. Sanderson, Frontiers,
them against real users on a given live research platform. challenges, and opportunities for information re-
In practice this means that a small portion of the usage trieval: Report from SWIRL 2012 the second strate-
data collected from humans (i.e., first few weeks of the gic workshop on information retrieval in Lorne,
live evaluation period) is disclosed and can be used for SIGIR Forum 46 (2012) 2–32.
training the simulators, while the remaining data is used [4] J. S. Culpepper, F. Diaz, M. D. Smucker, Research
for evaluating them. The set of systems participating in frontiers in information retrieval: Report from the
the live evaluation (referred to as experimental systems) third strategic workshop on information retrieval in
are also evaluated using the different simulators. Ulti- Lorne (SWIRL 2018), SIGIR Forum 52 (2018) 34–90.
mately, the question we seek to answer is whether we [5] W. B. Croft, The importance of interaction for in-
can observe the same relative ranking of experimental formation retrieval, in: Proceedings of the 42nd
systems with real users (based on the live experiment) as International ACM SIGIR Conference on Research
with simulated ones—being able to answer this question and Development in Information Retrieval, SIGIR
positively would mean that the simulator is sufficiently ’19, 2019, pp. 1–2.
realistic. [6] J. Deriu, A. Rodrigo, A. Otegi, G. Echegoyen, S. Ros-
set, E. Agirre, M. Cieliebak, Survey on evaluation
methods for dialogue systems, Artificial Intelli-
5. Conclusions and Future gence Review (2020) 1573–7462.
Directions [7] D. Jurafsky, J. H. Martin, Speech and Language
Processing: An Introduction to Natural Language
In this paper, we have considered conversational AI from Processing, Computational Linguistics, and Speech
an IR perspective, and focused in particular on the prob- Recognition, 3nd Edition draft, Prentice Hall, Pear-
lem of conversational information access, with the goal son Education International, 2019.
to identify open challenges that the IR community is [8] H. Chen, X. Liu, D. Yin, J. Tang, A survey on dia-
uniquely suited to address. logue systems: Recent advances and new frontiers,
One critical area concerns the understanding of SIGKDD Explor. Newsl. 19 (2017) 25–35.
users’ information needs and their information seeking [9] I. V. Serban, R. Lowe, P. Henderson, L. Charlin,
behavior—one of fundamental research directions in IR J. Pineau, A survey of available corpora for building
from the very beginning [60]. Currently, there is a lack data-driven dialogue systems: The journal version,
of understanding of what would be desirable conversa- Dialogue & Discourse 9 (2018) 1–49.
tional experiences for information access scenarios that [10] L. Zhou, J. Gao, D. Li, H.-Y. Shum, The design and
combine multiple user goals. Consequently, there are implementation of XiaoIce, an empathetic social
no suitable models of user behavior that could serve as chatbot, Comput. Linguist. 46 (2020) 53–93.
foundations for unified architectures that can support [11] I. Szpektor, D. Cohen, G. Elidan, M. Fink, A. Has-
such behavior. sidim, O. Keller, S. Kulkarni, E. Ofek, S. Pudinsky,
Another aspect that represents a major open challenge A. Revach, S. Salant, Y. Matias, Dynamic compo-
is evaluation. Measurement is an area where IR has an sition for conversational domain exploration, in:
unparalleled history [61, 62, 63, 64, 33, 35]. Building on Proceedings of The Web Conference 2020, WWW
the rich tradition and experience of community bench- ’20, 2020, pp. 872–883.
marking campaigns such as TREC [62] and CLEF [63], [12] F. Radlinski, N. Craswell, A theoretical framework
our community is in a unique position to take a lead for conversational search, in: Proceedings of the
on the development of novel evaluation paradigms and 2017 Conference on Conference Human Informa-
methodologies. This paper has outlined a specific plan tion Interaction and Retrieval, CHIIR ’17, 2017, pp.
for such an effort along user simulation. 117–126.
[13] Z. Yan, N. Duan, P. Chen, M. Zhou, J. Zhou, Z. Li,
Building task-oriented dialogue systems for online
References shopping, in: Proceedings of the Thirty-First AAAI
Conference on Artificial Intelligence, AAAI ’17,
[1] J. Gao, M. Galley, L. Li, Neural approaches to con-
2017, pp. 4618–4625.
versational AI, Found. Trends Inf. Retr. 13 (2019)
[14] J. R. Trippas, Spoken Conversational Search: Audio-
127–298.
only Interactive Information Retrieval, Ph.D. thesis,
[2] M. F. McTear, Spoken dialogue technology: En-
RMIT University, 2019.
abling the conversational user interface, ACM Com-
[15] Y. Deldjoo, J. R. Trippas, H. Zamani, Towards multi-
put. Surv. 34 (2002) 90–169.
modal conversational information seeking, in: Pro-
ceedings of the 43th International ACM SIGIR Con-
ference on Research and Development in Informa- [27] M. Aliannejadi, H. Zamani, F. Crestani, W. B.
tion Retrieval, SIGIR ’21, 2021, pp. 1577–1587. Croft, Asking clarifying questions in open-domain
[16] C. Qu, L. Yang, M. Qiu, W. B. Croft, Y. Zhang, information-seeking conversations, in: Proceed-
M. Iyyer, Bert with history answer embedding for ings of the 42nd International ACM SIGIR Confer-
conversational question answering, in: Proceedings ence on Research and Development in Information
of the 42nd International ACM SIGIR Conference Retrieval, SIGIR ’19, 2019, pp. 475–484.
on Research and Development in Information Re- [28] C. Rosset, C. Xiong, X. Song, D. Campos,
trieval, SIGIR ’19, 2019, pp. 1133–1136. N. Craswell, S. Tiwary, P. Bennett, Leading conver-
[17] S. Reddy, D. Chen, C. D. Manning, CoQA: A con- sational search by suggesting useful questions, in:
versational question answering challenge, Transac- Proceedings of The Web Conference 2020, WWW
tions of the Association for Computational Linguis- ’20, 2020, pp. 1160–1170.
tics 7 (2019) 249–266. [29] C. Qu, L. Yang, W. B. Croft, Y. Zhang, J. R. Trip-
[18] Y. Zhang, X. Chen, Q. Ai, L. Yang, W. B. Croft, To- pas, M. Qiu, User intent prediction in information-
wards conversational search and recommendation: seeking conversations, in: Proceedings of the 2019
System ask, user respond, in: Proceedings of the Conference on Human Information Interaction and
27th ACM International Conference on Information Retrieval, CHIIR ’19, 2019, pp. 25–33.
and Knowledge Management, CIKM ’18, 2018, pp. [30] K. Christakopoulou, F. Radlinski, K. Hofmann, To-
177–186. wards conversational recommender systems, in:
[19] D. Jannach, A. Manzoor, W. Cai, L. Chen, A sur- Proceedings of the 22nd ACM SIGKDD Interna-
vey on conversational recommender systems, 2020. tional Conference on Knowledge Discovery and
arXiv:2004.00646. Data Mining, KDD ’16, 2016, pp. 815–824.
[20] C. Gao, W. Lei, X. He, M. de Rijke, T. Chua, [31] P. Ren, Z. Chen, Z. Ren, E. Kanoulas, C. Monz,
Advances and challenges in conversational M. de Rijke, Conversations with search engines,
recommender systems: A survey, 2021. 2020. arXiv:2004.14162.
arXiv:2101.09459. [32] S. Zhang, Z. Dai, K. Balog, J. Callan, Summariz-
[21] C. Zhu, M. Zeng, X. Huang, SDNet: Contextualized ing and exploring tabular data in conversational
attention-based deep network for conversational search, in: Proceedings of the 43rd International
question answering, 2018. arXiv:1812.03593. ACM SIGIR Conference on Research and Develop-
[22] J. Dalton, C. Xiong, J. Callan, TREC CAsT 2019: The ment in Information Retrieval, SIGIR ’20, 2020, pp.
Conversational Assistance Track overview, 2020. 1537–1540.
arXiv:2003.13624. [33] M. Sanderson, Test collection based evaluation of
[23] L. Yang, M. Qiu, C. Qu, J. Guo, Y. Zhang, W. B. information retrieval systems, Found. Trends Inf.
Croft, J. Huang, H. Chen, Response ranking with Retr. 4 (2010) 247–375.
deep matching networks and external knowledge in [34] C.-W. Liu, R. Lowe, I. Serban, M. Noseworthy,
information-seeking conversation systems, in: The L. Charlin, J. Pineau, How NOT to evaluate your di-
41st International ACM SIGIR Conference on Re- alogue system: An empirical study of unsupervised
search and Development in Information Retrieval, evaluation metrics for dialogue response genera-
SIGIR ’18, 2018, pp. 245–254. tion, in: Proceedings of the 2016 Conference on
[24] Y. Song, C.-T. Li, J.-Y. Nie, M. Zhang, D. Zhao, R. Yan, Empirical Methods in Natural Language Processing,
An ensemble of retrieval-based and generation- EMNLP ’16, 2016, pp. 2122–2132.
based human-computer conversation systems, in: [35] K. Hofmann, L. Li, F. Radlinski, Online evaluation
Proceedings of the 27th International Joint Confer- for information retrieval, Found. Trends Inf. Retr.
ence on Artificial Intelligence, IJCAI ’18, 2018, pp. 10 (2016) 1–117.
4382–4388. [36] A. Ram, R. Prasad, C. Khatri, A. Venkatesh,
[25] N. Voskarides, D. Li, P. Ren, E. Kanoulas, M. de Rijke, R. Gabriel, Q. Liu, J. Nunn, B. Hedayatnia, M. Cheng,
Query resolution for conversational search with A. Nagar, E. King, K. Bland, A. Wartick, Y. Pan,
limited supervision, in: Proceedings of the 43rd H. Song, S. Jayadevan, G. Hwang, A. Pettigrue, Con-
International ACM SIGIR Conference on Research versational AI: The science behind the Alexa Prize,
and Development in Information Retrieval, SIGIR 2018. arXiv:1801.03604.
’20, 2020, pp. 921–930. [37] A. Schuth, K. Balog, Living labs for online eval-
[26] S. Vakulenko, N. Voskarides, Z. Tu, S. Longpre, A uation: From theory to practice, in: Proceedings
comparison of question rewriting methods for con- of the 38th European conference on Advances in
versational passage retrieval, in: Proceedings of the Information Retrieval, ECIR ’16, 2016, pp. 893–896.
43rd European Conference on IR Research, ECIR [38] R. Jagerman, K. Balog, M. D. Rijke, Opensearch:
’21, 2021, pp. 418–424. Lessons learned from an online evaluation cam-
paign, J. Data and Information Quality 10 (2018). Speech Commun. 50 (2008) 630–645.
[39] F. Hopfgartner, K. Balog, A. Lommatzsch, L. Kelly, [51] B. R. Cowan, N. Pantidi, D. Coyle, K. Morrissey,
B. Kille, A. Schuth, M. Larson, Continuous evalu- P. Clarke, S. Al-Shehri, D. Earley, N. Bandeira,
ation of large-scale information access systems: A “What can i help you with?”: Infrequent users’ ex-
case for living labs, in: N. Ferro, C. Peters (Eds.), periences of intelligent personal assistants, in: Pro-
Information Retrieval Evaluation in a Changing ceedings of the 19th International Conference on
World - Lessons Learned from 20 Years of CLEF, Human-Computer Interaction with Mobile Devices
Springer, 2019. and Services, MobileHCI ’17, 2017.
[40] J. Schatzmann, K. Weilhammer, M. Stuttle, S. Young, [52] B. R. Cowan, H. P. Branigan, H. Begum, L. McKenna,
A survey of statistical user simulation techniques É. Székely, They know as much as we do: Knowl-
for reinforcement-learning of dialogue manage- edge estimation and partner modelling of artificial
ment strategies, Knowl. Eng. Rev. 21 (2006) 97–126. partners, in: Proceedings of the 39th Annual Meet-
[41] O. Pietquin, H. Hastie, A survey on metrics for the ing of the Cognitive Science Society, CogSci ’17,
evaluation of user simulations, Knowl. Eng. Rev. 28 2017.
(2013). [53] A. Sciuto, A. Saini, J. Forlizzi, J. I. Hong, “Hey Alexa,
[42] S. Vakulenko, K. Revoredo, C. Di Ciccio, M. de Rijke, what’s up?”: A mixed-methods studies of in-home
QRFA: A data-driven model of information-seeking conversational agent usage, in: Proceedings of
dialogues, in: Proceedings of the 41st European the 2018 Designing Interactive Systems Conference,
Conference on IR Research, ECIR ’19, 2019, pp. 541– DIS ’18, 2018, pp. 857–868.
557. [54] G. Mesnil, Y. Dauphin, K. Yao, Y. Bengio, L. Deng,
[43] J. R. Trippas, D. Spina, L. Cavedon, M. Sanderson, D. Hakkani-Tur, X. He, L. Heck, G. Tur, D. Yu, et al.,
How do people interact in conversational speech- Using recurrent neural networks for slot filling in
only search tasks: A preliminary analysis, in: Pro- spoken language understanding, IEEE/ACM Trans.
ceedings of the 2017 Conference on Conference Hu- Audio, Speech and Lang. Proc. 23 (2015) 530–539.
man Information Interaction and Retrieval, CHIIR [55] B. Liu, I. Lane, Attention-based recurrent neural
’17, 2017, pp. 325–328. network models for joint intent detection and slot
[44] K. Balog, L. Flekova, M. Hagen, R. Jones, M. Pot- filling, in: Interspeech 2016, 2016, pp. 685–689.
thast, F. Radlinski, M. Sanderson, S. Vakulenko, [56] J. Schatzmann, B. Thomson, K. Weilhammer, H. Ye,
H. Zamani, Common conversational community S. Young, Agenda-based user simulation for boot-
prototype: Scholarly conversational assistant, 2020. strapping a POMDP dialogue system, in: Human
arXiv:2001.06910. Language Technologies 2007: The Conference of
[45] K. Balog, T. Kenter, Personal knowledge graphs: the North American Chapter of the Association
A research agenda, in: Proceedings of the 2019 for Computational Linguistics; Companion Volume,
ACM SIGIR International Conference on Theory of Short Papers, 2007, pp. 149–152.
Information Retrieval, ICTIR ’19, 2019, pp. 217–220. [57] Z. Yang, Z. Hu, C. Dyer, E. P. Xing, T. Berg-
[46] S. Zhang, K. Balog, Evaluating conversational rec- Kirkpatrick, Unsupervised text style transfer using
ommender systems via user simulation, in: Pro- language models as discriminators, in: Advances
ceedings of the 26th ACM SIGKDD International in Neural Information Processing Systems 31, NIPS
Conference on Knowledge Discovery & Data Min- ’18, 2018, pp. 7287–7298.
ing, KDD ’20, 2020, pp. 1512–1520. [58] Z. Fu, X. Tan, N. Peng, D. Zhao, R. Yan, Style transfer
[47] L. Azzopardi, M. Dubiel, M. Halvey, J. Dalton, Con- in text: Exploration and evaluation, in: Proceedings
ceptualizing agent-human interactions during the of the AAAI Conference on Artificial Intelligence,
conversational search process, in: Proceedings 2018.
of the 2nd International Workshop on Conversa- [59] T. Shen, T. Lei, R. Barzilay, T. Jaakkola, Style trans-
tional Approaches to Information Retrieval, CAIR fer from non-parallel text by cross-alignment, in:
’18, 2018. Proceedings of the 31st International Conference
[48] J. F. Kelley, An iterative design methodology for on Neural Information Processing Systems, NIPS
user-friendly natural language office information ’17, 2017, pp. 6833–6844.
applications, ACM Trans. Inf. Syst. 2 (1984) 26–41. [60] T. Wilson, Information needs and uses: Fifty years
[49] C. Lewis, J. Rieman, Task-centered User Interface of progress, Fifty Years of Information Progress: A
Design: A Practical Introduction, University of Col- Journal of Documentation Review 28 (1994) 15–51.
orado, Boulder, Department of Computer Science, [61] D. Ellis, The dilemma of measurement in informa-
1993. tion retrieval research, J. Am. Soc. Inf. Sci. 47 (1996)
[50] J. Edlund, J. Gustafson, M. Heldner, A. Hjalmars- 23–36.
son, Towards human-like spoken dialogue systems, [62] E. M. Voorhees, D. K. Harman, TREC: Experiment
and Evaluation in Information Retrieval (Digital
Libraries and Electronic Publishing), The MIT Press,
2005.
[63] N. Ferro, C. Peters (Eds.), Information Retrieval
Evaluation in a Changing World - Lessons Learned
from 20 Years of CLEF, volume 41 of The Informa-
tion Retrieval Series, Springer, 2019.
[64] D. Kelly, Methods for evaluating interactive infor-
mation retrieval systems with users, Found. Trends
Inf. Retr. 3 (2009) 1–224.