<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Conversational AI from an Information Retrieval Perspective: Remaining Challenges and a Case for User Simulation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Krisztian Balog</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Stavanger</institution>
          ,
          <country country="NO">Norway</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Conversational AI is an emerging field of computer science that engages multiple research communities, from information retrieval to natural language processing to dialogue systems. Within this vast space, we focus on conversational information access, a problem that is uniquely suited to be addressed by the information retrieval community. We argue that despite the significant research activity in this area, progress is mostly limited to component-level improvements. There remains a disconnect between current eforts and truly conversational information access systems. Apart from the inherently challenging nature of the problem, the lack of progress, in large part, can be attributed to the shortage of appropriate evaluation methodology and resources. This paper highlights challenges that render both ofline and online evaluation methodologies unsuitable for this problem, and discusses the use of user simulation as a viable solution.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Conversational information access</kwd>
        <kwd>conversational AI</kwd>
        <kwd>user simulation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        appropriate responses, and developing efective
end-toend (neural) architectures, which engage multiple
reConversational AI may be seen as the holy grail of com- search communities. In this paper, we focus on the
probputer science: building machines that are capable of in- lem of conversational information access (CIA), one that
teracting with people in a human-like way. With rapid the IR community is uniquely suited to address.
advances in AI technology, there are reasons to believe Conversational search or conversational information
that such an ambition is within reach [1]. Conversational seeking has already been identified in 2012 as a research
AI is a vast and complex problem, which requires a com- direction of strategic importance in IR [3], and its
signifibination of methods, tools, and techniques from multiple cance has been re-iterated in 2018 [4]. There, the problem
ifelds of computer science, including but not limited to focus has been defined to include complex user goals
artificial intelligence (AI), natural language processing that require multi-step information seeking, exploratory
(NLP), machine learning (ML), dialogue systems (DS), information gathering, and multi-step task completion
recommender systems (RecSys), human-computer inter- and recommendation, as well as dialog settings with
action (HCI), and not the least information retrieval (IR). variable communication channels. Our analysis of recent
Each of these fields may have its own particular inter- works, however, leads us to the observation that current
pretation of what conversational AI should entail and eforts do not seem to be fully aligned with the directions
a specific focus on certain research challenges that are set out there. In terms of end-to-end tasks, there are
involved. For example, in spoken dialogue systems the two main threads of work: conversational QA and
main motif is to be able to talk to machines, i.e., on de- conversational recommendations. Currently, these are
veloping speech-based human-computer interfaces [
        <xref ref-type="bibr" rid="ref24">2</xref>
        ], treated as two separate types of systems, with diferent
and thus automatic speech recognition is a central com- goals, architectures, and evaluation criteria. Instead, for
ponent. Many other communities, on the other hand, a more efective assistance of users, the two should be
assume a chat-based interface and voice is not among seamlessly integrated in CIA systems, thereby moving
the supported modalities. At the same time, there are from a siloed to a more unified view. Additionally, the
many shared aspects, including handling the semantics multi-modality of interactions needs to be more fully
involved in the dialogue process, generating contextually embraced, in order to more actively support efective
interaction [5]. On the component level, most proposed
techniques are not truly conversational in the sense
that they are applicable to any interactive IR system
(e.g., modern web search engines). A critical blocker to
progress, on both the end-to-end and component levels,
is the shortage of appropriate evaluation methodology
and resources.
      </p>
      <p>DESIRES 2021 – 2nd International Conference on Design of
Experimental Search &amp; Information REtrieval Systems, September
15–18, 2021, Padua, Italy
" krisztian.balog@uis.no (K. Balog)
~ https://krisztianbalog.com/ (K. Balog)
0000-0003-2762-721X (K. Balog)</p>
      <p>© 2021 Copyright for this paper by its authors. Use permitted under Creative
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmmUoRns LWiceonsrekAstthribouptionP4r.0oIncteerenadtiionnagl s(CC(CBYE4U.0)R.-WS.org)
In summary, the contributions of this paper are threefold. 2.1.1. Traditional 2-way Categorization
• We argue for a broader interpretation of
conversational information access, one that embraces multiple
user goals (mixing task-oriented and QA elements) and
multi-modal interactions (Sect 2).
• We provide a synthesis of progress on
conversational information access and identify open challenges
around methods and evaluation (Sect. 3).
• We argue for (a more extensive) use of simulation as
a viable evaluation paradigm for conversational
information access and describe a simulator architecture
(Sect. 4).</p>
    </sec>
    <sec id="sec-2">
      <title>2. Defining Conversational</title>
    </sec>
    <sec id="sec-3">
      <title>Information Access</title>
      <sec id="sec-3-1">
        <title>This section defines conversational information access and places it in the broader context of conversational AI.</title>
        <sec id="sec-3-1-1">
          <title>2.1. Conversational AI: The big picture</title>
          <p>Conversational AI 1 is casually used to denote a broad
range of systems that are capable of (some degree of)
natural language understanding and responding in a way
that mimics human dialogue. A conversational AI system
may thus be considered successful if it ofers an
experience that is indistinguishable from what could have
been delivered by a human. These systems often focus
on a particular type of conversational support, naturally
lending themselves to categorization.</p>
          <p>1In this paper, the terms conversational AI, conversational agent,
and dialogue system are used interchangeably. We, however, avoid
using the term chatbot, which has a diferent meaning in industrial
and academic contexts; in the former case it refers to a task-oriented
system, while in the latter it means a non-task-oriented system [7].
Traditionally, conversational agents are categorized as
being goal-driven (or task-oriented) or non-goal-driven (also
known as chatbots) [8, 9, 7]. Goal-driven systems aim to
assist users to complete some specific task. Dialogues
are constrained to a specific domain and characterized
by having a designated structure, designed for particular
tasks within that domain. The main success criteria for
the conversational agent is its ability to help the user
solve their task as eficiently as possible. Typical
examples include travel planning and appointment scheduling.</p>
          <p>Non-goal-driven systems, on the other hand, aim to
carry on an extended conversation (“chit-chat”) with the
goal of mimicking unstructured human-human
interactions. The main purpose of these systems is usually
entertainment or providing an “AI companion” [10]. Therefore,
the objective for these systems is to be able to talk about
diferent topics in an engaging and cohesive manner.
2.1.2. Contemporary 3-way Categorization
Most recently, the traditional categorization has been
extended with a third category, interactive question
answering (QA) [1, 6], in recognition of the fact that it fits
neither in task-oriented nor in social chat, but deserves
a separate category on its own right. Interactive QA
systems are designed to provide answers to specific
questions. They are not characterized by a rigid dialogue flow,
although they typically follow a question-answer
pattern. Apart from some notable recent examples [11], the
human-like conversation aspect for QA systems is much
less pronounced than for the other two types of systems,
and evaluation is restricted to answer correctness.</p>
          <p>Table 1 summarizes the characteristics of the three
categories of conversational AI systems. Given their unique
goals and objectives, each of these problem categories is
addressed by a distinctive system architecture [1, 6].
AI
AI
AI</p>
          <p>I want to buy new running shoes.</p>
          <p>My records say that you have been using a
Nike Pegasus 33 before. How did you like that?</p>
          <p>I liked it a lot on tarmac, but my feet often hurt
a bit on very long asphalt runs.</p>
          <p>Here are some alternatives for you. Of these,
the ASICS Gel Nimbus 23 is especially
renowed for its cushioned midsole.</p>
          <p>What is the midsole?
The midsole is the bed of foam that lies
between your foot and the ground. This is the
part of the shoe responsible for feeling soft or
hard in the shoe.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>2.2. Conversational Information Access</title>
          <p>portantly, evaluation initiatives—in IR almost exclusively
focus on a question-answering paradigm (see Sect. 3).</p>
          <p>This does not allow for interaction with sets of items—
one of the main properties that makes a search system
conversational [12].</p>
          <p>It has been shown that the “siloed” view, represented
by the three categories in Table 1, in practice does not
align well with users’ information needs and
behavior [13]. Gao et al. [1] acknowledge the need for a
“toplevel bot” that would act as a broker and switch between
diferent user goals. Most commercial assistants are
hybrid systems, with diferent degrees of support for
switching. There is, however, little published research on it. In
summary, there is need for a more holistic view where
multiple user goals are supported.2
2.2.2. Multi-modality
Another key point highlighted by the example in Fig. 1
is the need for embracing multi-modality. Text-only
responses are motivated by an audio-only channel, without
a screen [14]. However, more often than not a chat-base
interface is available, which allows for a richer set of
input controls and navigational components. These, in
turn, would enable CIA systems to more actively support
efective interaction [ 5]. We note that the need for
multimodality has been recognized independently by other
scholars as well [15].</p>
          <p>Building on [4], we use the term conversational
information access (CIA) to define a subset of conversational AI
systems that specifically aim at a task-oriented sequence
of exchanges to support multiple user goals, including 3. Progress to Date and
search, recommendation and exploratory information
gathering, that require multi-step interactions over pos- Remaining Challenges
sibly multiple modalities. Further, these systems are ex- In this section, we reflect on progress achieved so far,
pected to learn user preferences, personalize responses organized around methods and evaluation, and identify
accordingly, and be capable of taking initiative. remaining challenges.</p>
          <p>Consider the conversation shown in Fig. 1, illustrating
some of the above requirements. It is primarily a
taskoriented dialogue (the user wanting to buy new running 3.1. Methods
shoes), which requires an exploration of the item space. In our discussion, we distinguish between end-to-end
Assuming a chat-based interface, this can be done most conversational tasks and specific component-level
subefectively by combining multiple modalities; not just tasks.
text, but also a carousel for cycling through items, in
this example. Up until the second user utterance, it is
a strictly task-oriented sequence of exchanges (cf. the 3.1.1. End-to-end tasks
task-oriented category in Table 1). But, then, the third
user utterance breaks out of the task flow and switches
to “QA mode” (cf. interactive QA in Table 1).</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>There are two main tasks that have received attention:</title>
        <p>
          conversational QA [
          <xref ref-type="bibr" rid="ref13">16, 1, 17</xref>
          ] and conversational
recommendations [
          <xref ref-type="bibr" rid="ref22 ref25">18, 19, 20</xref>
          ]. What distinguishes
conversational QA from traditional single-turn QA is the
2.2.1. From Siloed to Unified View need for contextual understanding. Hence, much of
the research revolves around modeling conversation
hisOne key realization the above example is meant to illus- tory [
          <xref ref-type="bibr" rid="ref13">16, 17, 21</xref>
          ]. However, in terms of evaluation, the
trate is that conversational information access cuts across
the task-oriented and interactive QA categories. This
blending makes CIA suited to assist users meaningfully
with their needs. Conversely, existing work—and,
im
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>2We note that this problem is not specific to IR. However, con</title>
        <p>
          versational information access is a good starting point that the IR
community is uniquely suited to address. Lessons and finding could
then be generalized to broader applications.
problem is simplified to a single-turn passage retrieval
task, where the relevance of system response at a given
turn does not consider the responses given by the
system at earlier turns [
          <xref ref-type="bibr" rid="ref13">22, 17</xref>
          ]. It is only conversational
recommender systems where the multi-turn nature of
conversations is more fully embraced [
          <xref ref-type="bibr" rid="ref22 ref25">18, 19, 20</xref>
          ].
3.1.2. Component-level sub-tasks
Recently, progress has been made on specific subtasks
for CIA, including response retrieval [23] and
generation [24], query resolution [25, 26], asking clarifying
questions [27] or suggestion questions [28], predicting
user intent [29], and preference elicitation [
          <xref ref-type="bibr" rid="ref25">18, 30</xref>
          ]. Each
of these studies makes the point that the conversational
setting calls for a diferent set of approaches. However,
most of these subtasks are applicable in any interactive IR
context, adhering to the stance that search is inherently
a conversational experience: it is a dialogue between a
human and a search engine [31]. From this perspective,
there has been substantial progress, and especially on the
mixed initiative aspect, e.g., question clarifications and
suggestions [27, 28]. Alternatively, one may take a more
critical stance and ask: What separates conversational
information access from any other interactive IR system
(most prominently: search engines)? According to Croft
[5], the key distinguishing factor is that a conversational
system is more active partner in the interaction. From
that regard, there is surprisingly little work, with only a
handful of notable exceptions [32, 11].
        </p>
        <sec id="sec-3-3-1">
          <title>3.2. Evaluation</title>
          <p>3.2.1. Ofline Evaluation</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>Traditionally, system-oriented evaluation in IR has been</title>
        <p>performed using ofline test collections, following the
Cranfield paradigm [ 33]. This rigorous methodology
ensures the repeatability and reproducibility of experiments,
and has been instrumental to progress in the field. To
date, work on CIA still employs ofline evaluation [ 22, 27],
but this has severe limitations. First, reusability requires
that the system is limited in selecting the best response, in
answer to a user utterance, from a restricted set of
possible candidates (i.e., some predefined corpus of responses).</p>
        <p>Second, it is limited in scope to a single conversation
turn and does not consider dialogue history that led to
that particular user utterance (cf. red area in Fig. 2). An
alternative is to let human evaluators assess an entire
conversation, once it has taken place [6]. However, this
is a single path (see blue area in Fig. 2), without
considering the other choices the user could have taken during
the course of the dialogue. Moreover, it is expensive,
time-consuming, and does not scale. Most importantly,
it would not yield a reusable test collection. In summary,
ofline test collections have their merits, but their use is
limited to the purpose of evaluating specific components,
in isolation. Further, the choice of evaluation metrics is
an open challenge [34].
3.2.2. Online Evaluation
Online evaluation involves fielding an IR system to real
users, and observing how they interact with the system
in situ, in their natural task environments [35]. This
requires a live service as a research platform. Currently,
this possibility is only available to researchers working
at major service providers that develop conversational
assistants (Google, Microsoft, Apple, Amazon). Even
there, experimentation with live users is severely
limited due to scalability, quality, and ethical concerns. Of
these companies, only Amazon has decided to open up its
platform for academic research, by organizing the Alexa
Prize Challenge [36]. It represents a unique opportunity
for academics to perform research with a live system
used by millions of users, and provides university teams
with real user conversational data at scale. While this
efort points in the right direction, it is inherently
limited in that it addresses social conversations (“chit-chat”),
with the target goal of conversing coherently and
engagingly with humans on popular topics such as sports,
politics, or technology for 20 minutes. This is a
non-goaldriven task, which is rather diferent from goal-driven
CIA. Currently, there is no publicly available research
platform for CIA. Living labs represents a novel
evaluation paradigm for IR [37], which allows researchers
to evaluate their methods with real users of live search
services. This methodology has been successfully
employed at world wide benchmarking campaigns [38, 39].</p>
        <p>
          It, however, needs to be extended to a conversational
setting, which brings about methodological and practical
challenges.
3.2.3. User Simulation
3.3.3. Evaluation
With a long history in the field of spoken dialogue sys- There is a need to go beyond turn-based evaluation to
tems, user simulation is seen as a critical tool for auto- multi-turn-based and eventually end-to-end evaluation.
matic dialogue management design [40]. The idea is to To be able to perform end-to-end evaluation of CIA
systrain a user model that is “capable of producing responses tems, additional methodologies need to be considered,
that a real user might have given in a certain dialog situ- including online evaluation and simulated users. For
ation” [40]. This is in line with our goals, but there are online evaluation, the living labs paradigm represents
two crucial diferences. First, the primary purpose of an alternative, but it requires agreement on a canonical
user simulation in DS is to generate a synthetic training architecture in order to be able to open up individual
data at scale, which in turn can be used to learn dia- components for experimentation. Further, it requires an
logue strategies (typically, using reinforcement learning). existing service with live users, which is currently
lackAssessment of the quality of simulated dialogues and ing. It should be noted that the need for such an open
user simulation methods, however, is an open issue [41]. research platform has been identified and a plan for the
Second, dialogue systems, as well as recent work on con- academic search domain has recently been outlined [44].
versational recommender systems [
          <xref ref-type="bibr" rid="ref25">18</xref>
          ], are focused on As for simulation, most existing approaches are meant
supporting the user with a single goal that can be ful- to advance reinforcement learning techniques in a strictly
iflled by eliciting preferences on a set of attributes. CIA goal-oriented setting. This is diferent from our purpose
systems, on the other hand, need to deal with complex of evaluation. The simulation techniques that are
cursearch and recommendation scenarios. This requires a rently used for evaluation lack the desired conversational
more holistic user model. complexity.
        </p>
        <sec id="sec-3-4-1">
          <title>3.3. Summary and Remaining Challenges</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. A Case for Simulation</title>
      <p>3.3.1. Understanding User Needs and Behavior</p>
      <sec id="sec-4-1">
        <title>This section presents a proposal for robust large-scale</title>
        <p>Current characterizations of information seeking behav- automatic evaluation of CIA systems via user simulation.
ior for CIA are limited either in the set of actions
considered [42] or in sequences of conversational turns [43]. 4.1. Methodology
To cater for the functionality defined by Radlinski and
Craswell [12] and further expanded by us in Sect 2.2,
one would need user and interaction models capable of
representing (1) multi-modal interactions (speech, text,
pointing&amp;clicking), (2) users’ ability to change their state
of knowledge (learn and forget) and (3) users’ ability to
learn how a system works and what its limits are (and
change their expectations and behavior accordingly).
3.3.2. Truly Conversational Methods</p>
      </sec>
      <sec id="sec-4-2">
        <title>Conversational recommendations and QA have been</title>
        <p>studied as end-to-end tasks. However, as we argued in
Sect. 2.2, in practice these two are not clearly delineated
applications, but rather diferent “modes” that should be
seamlessly integrated with a CIA system. There has been
significant progress on various components, which are
indispensable building blocks. Integrating these into a
unified system that supports multiple user goals remains
an open challenge [1]. Further open questions in this
space include (1) deciding when and what type of
initiative a system should take, and (2) determining the best
modality based on task and context.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Our main hypothesis is that it is possible to simulate</title>
        <p>human behavior with regard to interacting with CIA
systems. To validate this hypothesis, we need to show
that simulated users behave indistinguishable from real
humans, in the context of a specific conversational
application and with respect to specific evaluation measures.</p>
        <p>Formally, let 1 and 2 denote two CIA systems, which
difer in some component(s). Both systems are assumed
to be operated by a set  of users from some user
population. Let us assume that there is a statistically
significant diference observed in their relative performance,
according to some evaluation measure  , such that
 (1,  ) &lt;  (2,  ). Simulation is considered
successful, if by engaging a set  * of simulated users, we
observe the same relative system diferences as with real
users, i.e.,  (1,  * ) &lt;  (2,  * ). Further, this
observation should generalize across systems  and evaluation
measures  .</p>
        <p>The above formulation ensures that the behavior of
simulated users aligns with those of real users. Notice
that to be able to perform this validation, an operational
CIA system is needed; we discuss the practical aspects
of setting up such an experimental platform below, in
Sect. 4.4. For the human evaluation part, i.e.,
measuring  (,  ), two distinct approaches may be employed:
(1) asking users themselves inside the CIA system to
Conversa2onal
informa2on access
system</p>
        <p>Simulated user</p>
        <p>Natural language
understanding (NLU)
poin%ng / clicking</p>
        <p>text / voice
Natural language
genera2on (NLG)</p>
        <p>Planning
Execu2on
Learning</p>
        <p>User model
Interac2on model
Mental model
give feedback on either the entire conversation or on
specific system utterances, and (2) sampling
interesting/meaningful branches from conversation logs, which
will be annotated by external human labelers (e.g., crowd
workers).</p>
        <p>Once a user simulator is created and validated against
real users, it may be used evaluating a given CIA system.
It is important to note that, in principle, a given user
simulator instance should be used only once, the same
way that an ofline test collection should only be used
once—to avoid overfitting systems against a particular
test suite.</p>
        <sec id="sec-4-3-1">
          <title>4.2. Requirements</title>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>We identify a realistic user simulator with the ability of capturing:</title>
        <p>(R1) Personal interests and preferences, and the
changes of preferences over time;
its main components below, and provide specific starting
points for each of them.</p>
        <p>User, interaction, and mental models provide the
foundation for simulation behavior.
• User model. To represent all personal information
related to a given user, including persona (R2),
preferences (R1), and knowledge (R4), personal knowledge
graphs (PKGs) [45] may be used. The reason for using
a PKG is to ensure the consistency of the preferences
that are revealed by the simulated user, as it is done
in [46]. To fully address R1 and R4, PKGs will need
to be extended along two dimensions: (1) include
concepts, in addition to entities, to represent the user’s
knowledge on specific topics, with further distinction
to be made between entities/concepts the user heard
about vs. has in-depth knowledge on; (2) capture the
temporal scope, to be able to distinguish between
shortvs. long-term preferences and fresh vs. diminishing
knowledge.
(R5) The user’s ability to learn how a system works
and what its limits are, and change their
expectations and behavior accordingly.
(R2) Persona (personality, educational and socio- • Interaction model. To characterize the CIA process
economical background, etc.); between humans and systems for a given application,
the key actions and decisions that manifest in dialogues
(R3) Multi-modality of interactions (speech, text, need to be abstracted out. A starting point for a
taxonpointing&amp;clicking, etc.); omy of user/system actions is provided in [47]. This
(R4) The user’s ability to change their state of knowl- taxonomy may be revised and extended to multi-modal
edge (learn and forget); interactions (R3) based on conversations collected in
laboratory user studies with an “idealized” CIA
system using the Wizard-of-Oz approach [48] and from
interaction data from actual CIA systems.</p>
      </sec>
      <sec id="sec-4-5">
        <title>We note that not all these requirements are critical for an initial simulator and some may be highly ambitious. Nevertheless, we shall discuss our conceptual architecture with reference to these requirements.</title>
        <sec id="sec-4-5-1">
          <title>4.3. Architecture</title>
          <p>• Mental model. To capture how a particular user
thinks about a given CIA system (R5), mental models
need to be developed. The thinking aloud method is
commonly used for such purposes in usability testing,
psychology, and social sciences [49]. There is work
in HCI on identifying and analyzing experiences and
barriers qualitatively [50, 51, 52, 53]. A main diference
from those studies is that the goal here is to build a
quantifiable mental model that represents the user’s
expectations and perceived capabilities of a CIA
system.</p>
          <p>Next, we describe the components responsible for inter- specific and is shared by all simulated users.) To make
acting with CIA systems. the user model realistic, it should be anchored in actual
• Natural language understanding. Obtaining a ausgeernperroatfileivse(wmhoidleelmmaaiyntbaeinuisnegd, wki-tahnpoanryammietyte)r.sFloerartnheadt,
structured representation from a system utterance is</p>
          <p>on publicly available corpora, e.g., item ratings for
recomanalogous to NLU in dialogue systems and involves</p>
          <p>mendation scenarios [46] and discussion fora for
informadomain classification , intent determination, and slot fill- tion seeking tasks. The mental model may be initialized
ing [7]. These tasks are efectively tackled by neural</p>
          <p>using a small set of pre-trained skill profiles, created as
architectures [54, 55, 1]. These approaches, however,</p>
          <p>part of laboratory user studies.
are created for conversational systems and assume From a system architecture perspective, the user
simu“perfect” world knowledge, based on some underlying</p>
          <p>lator in many regards resembles a CIA system,
comprisknowledge repository. For user simulation, they need</p>
          <p>ing of natural language understanding, dialog
manageto be adapted to consider personal knowledge. For</p>
          <p>ment, and natural language generation components. One
example, the user may or may not be able to guess the</p>
          <p>major diference is that CIA systems may be assumed
corresponding type or category of an entity/concept</p>
          <p>(in fact, expected) to have “perfect world knowledge,”
that is mentioned for the first time, depending on their</p>
          <p>only limited by the availability of data. Conversely, user
knowledge of the given domain.</p>
          <p>simulation also needs to consider the user’s knowledge
• Response generation. Determining how a simulated level in language understanding and generation. Another
user should respond to a system utterance is modeled major diference is that while a CIA system is modeled
in three stages: planning, execution, and learning. In after a single person, each simulated user has a unique
the planning stage, a structured representation of an persona. This requires each of the components to be
information need (what to ask the system) or user re- parametrizable with respect to personal characteristics.
sponse (how to respond to if prompted by the system) Further, the choice of dialogue actions is afected by the
is generated. This is informed by the user model, in user’s mental model of the system (i.e., what the system
terms of interests and preferences, as well as the in- is perceived to be able to understand and execute).
teraction model, to help interpret what the system is
asking in terms of a task-specific dialog flow. In the 4.4. Operationalization
execution stage, the simulator decides on the course
of execution, based on the user’s mental model of the Note that simulation capability is application specific.
given system’s capabilities (e.g., it will not attempt to That is, diferent simulators would need to be trained for
navigate a list using voice, but rather click, if voice item recommendation, interactive QA, and, ultimately,
navigation did not function in the past as expected). for scenarios that cater for multiple user goals. To
enBased on how the system responds to a given user sure that the behavior of simulated users aligns with that
utterance, the learner module can make updates to of human users, an operational CIA system with actual
the user model (whether the user learned something users would also be needed for each application. Setting
new about a given topic) and also to the mental model up such applications should be seen as a community
efof the system (how successful it was in understand- fort. Indeed, discussions in this direction have already
ing/executing what was requested). Response genera- begun and one specific proposal for a CIA system
suption can be framed within the well-established agenda- ports scholarly activities has been outlined in [44]. There
based simulation approach [56]. are a number of challenges involved in building a CIA
system that can serve as such a living lab. One is that
• Natural language generation. Finally, a structured it would have insuficient trafic for meaningful online
intent representation (what to say to the system) needs evaluation (an issue that has indeed been encountered in
to turned into a natural language utterance (how to say the past [38]). To remedy that additional users may be
reit). The exact articulation is influenced by the persona cruited, e.g., by involving students as part of their course
and knowledge level of the simulated user. A possi- work or hiring workers on crowdsourcing platforms (i.e.,
ble starting point is to generate templated responses increasing trafic volume). Another potential dificulty
and then apply transfer learning for text [57, 58, 59]. is that building a suficiently performant CIA system for
Later, more end-to-end approaches may also be de- the application at hand turns out to be too challenging
vised, eliminating the need for manual template gen- (thereby making the online service unattractive to users).
eration. It should be noted that not all requests get While this is not easily solvable on the system front, it is
passed through NLG, as the executor may decide to possible to manage users’ expectations. Indeed, one of
use a diferent modality. the key ideas behind operating in the academic domain
Each simulated user requires instantiating the user and in [44] is to build a tool by researchers to researchers,
mental models. (The interaction model is application- and embrace its imperfection.</p>
          <p>Simulation approaches are evaluated by comparing [3] J. Allan, B. Croft, A. Mofat, M. Sanderson, Frontiers,
them against real users on a given live research platform. challenges, and opportunities for information
reIn practice this means that a small portion of the usage trieval: Report from SWIRL 2012 the second
stratedata collected from humans (i.e., first few weeks of the gic workshop on information retrieval in Lorne,
live evaluation period) is disclosed and can be used for SIGIR Forum 46 (2012) 2–32.
training the simulators, while the remaining data is used [4] J. S. Culpepper, F. Diaz, M. D. Smucker, Research
for evaluating them. The set of systems participating in frontiers in information retrieval: Report from the
the live evaluation (referred to as experimental systems) third strategic workshop on information retrieval in
are also evaluated using the diferent simulators. Ulti- Lorne (SWIRL 2018), SIGIR Forum 52 (2018) 34–90.
mately, the question we seek to answer is whether we [5] W. B. Croft, The importance of interaction for
incan observe the same relative ranking of experimental formation retrieval, in: Proceedings of the 42nd
systems with real users (based on the live experiment) as International ACM SIGIR Conference on Research
with simulated ones—being able to answer this question and Development in Information Retrieval, SIGIR
positively would mean that the simulator is suficiently ’19, 2019, pp. 1–2.
realistic. [6] J. Deriu, A. Rodrigo, A. Otegi, G. Echegoyen, S.
Rosset, E. Agirre, M. Cieliebak, Survey on evaluation
methods for dialogue systems, Artificial
Intelli5. Conclusions and Future gence Review (2020) 1573–7462.</p>
          <p>Directions [7] D. Jurafsky, J. H. Martin, Speech and Language
Processing: An Introduction to Natural Language
In this paper, we have considered conversational AI from Processing, Computational Linguistics, and Speech
an IR perspective, and focused in particular on the prob- Recognition, 3nd Edition draft, Prentice Hall,
Pearlem of conversational information access, with the goal son Education International, 2019.
to identify open challenges that the IR community is [8] H. Chen, X. Liu, D. Yin, J. Tang, A survey on
diauniquely suited to address. logue systems: Recent advances and new frontiers,</p>
          <p>One critical area concerns the understanding of SIGKDD Explor. Newsl. 19 (2017) 25–35.
users’ information needs and their information seeking [9] I. V. Serban, R. Lowe, P. Henderson, L. Charlin,
behavior—one of fundamental research directions in IR J. Pineau, A survey of available corpora for building
from the very beginning [60]. Currently, there is a lack data-driven dialogue systems: The journal version,
of understanding of what would be desirable conversa- Dialogue &amp; Discourse 9 (2018) 1–49.
tional experiences for information access scenarios that [10] L. Zhou, J. Gao, D. Li, H.-Y. Shum, The design and
combine multiple user goals. Consequently, there are implementation of XiaoIce, an empathetic social
no suitable models of user behavior that could serve as chatbot, Comput. Linguist. 46 (2020) 53–93.
foundations for unified architectures that can support [11] I. Szpektor, D. Cohen, G. Elidan, M. Fink, A.
Hassuch behavior. sidim, O. Keller, S. Kulkarni, E. Ofek, S. Pudinsky,</p>
          <p>Another aspect that represents a major open challenge A. Revach, S. Salant, Y. Matias, Dynamic
compois evaluation. Measurement is an area where IR has an sition for conversational domain exploration, in:
unparalleled history [61, 62, 63, 64, 33, 35]. Building on Proceedings of The Web Conference 2020, WWW
the rich tradition and experience of community bench- ’20, 2020, pp. 872–883.
marking campaigns such as TREC [62] and CLEF [63], [12] F. Radlinski, N. Craswell, A theoretical framework
our community is in a unique position to take a lead for conversational search, in: Proceedings of the
on the development of novel evaluation paradigms and 2017 Conference on Conference Human
Informamethodologies. This paper has outlined a specific plan tion Interaction and Retrieval, CHIIR ’17, 2017, pp.
for such an efort along user simulation. 117–126.
[13] Z. Yan, N. Duan, P. Chen, M. Zhou, J. Zhou, Z. Li,</p>
          <p>Building task-oriented dialogue systems for online
References shopping, in: Proceedings of the Thirty-First AAAI
[1] J. Gao, M. Galley, L. Li, Neural approaches to con- Conference on Artificial Intelligence, AAAI ’17,
versational AI, Found. Trends Inf. Retr. 13 (2019) 2017, pp. 4618–4625.</p>
          <p>
            127–298. [14] J. R. Trippas, Spoken Conversational Search:
Audio[
            <xref ref-type="bibr" rid="ref24">2</xref>
            ] M. F. McTear, Spoken dialogue technology: En- only Interactive Information Retrieval, Ph.D. thesis,
abling the conversational user interface, ACM Com- RMIT University, 2019.
put. Surv. 34 (2002) 90–169. [15] Y. Deldjoo, J. R. Trippas, H. Zamani, Towards
multimodal conversational information seeking, in:
Proceedings of the 43th International ACM SIGIR
Conference on Research and Development in Informa- [27] M. Aliannejadi, H. Zamani, F. Crestani, W. B.
tion Retrieval, SIGIR ’21, 2021, pp. 1577–1587. Croft, Asking clarifying questions in open-domain
[16] C. Qu, L. Yang, M. Qiu, W. B. Croft, Y. Zhang, information-seeking conversations, in:
ProceedM. Iyyer, Bert with history answer embedding for ings of the 42nd International ACM SIGIR
Conferconversational question answering, in: Proceedings ence on Research and Development in Information
of the 42nd International ACM SIGIR Conference Retrieval, SIGIR ’19, 2019, pp. 475–484.
on Research and Development in Information Re- [28] C. Rosset, C. Xiong, X. Song, D. Campos,
trieval, SIGIR ’19, 2019, pp. 1133–1136. N. Craswell, S. Tiwary, P. Bennett, Leading
conver[
            <xref ref-type="bibr" rid="ref13">17</xref>
            ] S. Reddy, D. Chen, C. D. Manning, CoQA: A con- sational search by suggesting useful questions, in:
versational question answering challenge, Transac- Proceedings of The Web Conference 2020, WWW
tions of the Association for Computational Linguis- ’20, 2020, pp. 1160–1170.
          </p>
          <p>
            tics 7 (2019) 249–266. [29] C. Qu, L. Yang, W. B. Croft, Y. Zhang, J. R.
Trip[
            <xref ref-type="bibr" rid="ref25">18</xref>
            ] Y. Zhang, X. Chen, Q. Ai, L. Yang, W. B. Croft, To- pas, M. Qiu, User intent prediction in
informationwards conversational search and recommendation: seeking conversations, in: Proceedings of the 2019
System ask, user respond, in: Proceedings of the Conference on Human Information Interaction and
27th ACM International Conference on Information Retrieval, CHIIR ’19, 2019, pp. 25–33.
and Knowledge Management, CIKM ’18, 2018, pp. [30] K. Christakopoulou, F. Radlinski, K. Hofmann,
To177–186. wards conversational recommender systems, in:
[19] D. Jannach, A. Manzoor, W. Cai, L. Chen, A sur- Proceedings of the 22nd ACM SIGKDD
Internavey on conversational recommender systems, 2020. tional Conference on Knowledge Discovery and
arXiv:2004.00646. Data Mining, KDD ’16, 2016, pp. 815–824.
[
            <xref ref-type="bibr" rid="ref22">20</xref>
            ] C. Gao, W. Lei, X. He, M. de Rijke, T. Chua, [31] P. Ren, Z. Chen, Z. Ren, E. Kanoulas, C. Monz,
Advances and challenges in conversational M. de Rijke, Conversations with search engines,
recommender systems: A survey, 2021. 2020. arXiv:2004.14162.
          </p>
          <p>arXiv:2101.09459. [32] S. Zhang, Z. Dai, K. Balog, J. Callan,
Summariz[21] C. Zhu, M. Zeng, X. Huang, SDNet: Contextualized ing and exploring tabular data in conversational
attention-based deep network for conversational search, in: Proceedings of the 43rd International
question answering, 2018. arXiv:1812.03593. ACM SIGIR Conference on Research and
Develop[22] J. Dalton, C. Xiong, J. Callan, TREC CAsT 2019: The ment in Information Retrieval, SIGIR ’20, 2020, pp.</p>
          <p>Conversational Assistance Track overview, 2020. 1537–1540.</p>
          <p>arXiv:2003.13624. [33] M. Sanderson, Test collection based evaluation of
[23] L. Yang, M. Qiu, C. Qu, J. Guo, Y. Zhang, W. B. information retrieval systems, Found. Trends Inf.</p>
          <p>Croft, J. Huang, H. Chen, Response ranking with Retr. 4 (2010) 247–375.
deep matching networks and external knowledge in [34] C.-W. Liu, R. Lowe, I. Serban, M. Noseworthy,
information-seeking conversation systems, in: The L. Charlin, J. Pineau, How NOT to evaluate your
di41st International ACM SIGIR Conference on Re- alogue system: An empirical study of unsupervised
search and Development in Information Retrieval, evaluation metrics for dialogue response
generaSIGIR ’18, 2018, pp. 245–254. tion, in: Proceedings of the 2016 Conference on
[24] Y. Song, C.-T. Li, J.-Y. Nie, M. Zhang, D. Zhao, R. Yan, Empirical Methods in Natural Language Processing,
An ensemble of retrieval-based and generation- EMNLP ’16, 2016, pp. 2122–2132.
based human-computer conversation systems, in: [35] K. Hofmann, L. Li, F. Radlinski, Online evaluation
Proceedings of the 27th International Joint Confer- for information retrieval, Found. Trends Inf. Retr.
ence on Artificial Intelligence, IJCAI ’18, 2018, pp. 10 (2016) 1–117.</p>
          <p>4382–4388. [36] A. Ram, R. Prasad, C. Khatri, A. Venkatesh,
[25] N. Voskarides, D. Li, P. Ren, E. Kanoulas, M. de Rijke, R. Gabriel, Q. Liu, J. Nunn, B. Hedayatnia, M. Cheng,
Query resolution for conversational search with A. Nagar, E. King, K. Bland, A. Wartick, Y. Pan,
limited supervision, in: Proceedings of the 43rd H. Song, S. Jayadevan, G. Hwang, A. Pettigrue,
ConInternational ACM SIGIR Conference on Research versational AI: The science behind the Alexa Prize,
and Development in Information Retrieval, SIGIR 2018. arXiv:1801.03604.</p>
          <p>’20, 2020, pp. 921–930. [37] A. Schuth, K. Balog, Living labs for online
eval[26] S. Vakulenko, N. Voskarides, Z. Tu, S. Longpre, A uation: From theory to practice, in: Proceedings
comparison of question rewriting methods for con- of the 38th European conference on Advances in
versational passage retrieval, in: Proceedings of the Information Retrieval, ECIR ’16, 2016, pp. 893–896.
43rd European Conference on IR Research, ECIR [38] R. Jagerman, K. Balog, M. D. Rijke, Opensearch:
’21, 2021, pp. 418–424. Lessons learned from an online evaluation
camand Evaluation in Information Retrieval (Digital
Libraries and Electronic Publishing), The MIT Press,
2005.
[63] N. Ferro, C. Peters (Eds.), Information Retrieval</p>
          <p>Evaluation in a Changing World - Lessons Learned
from 20 Years of CLEF, volume 41 of The
Information Retrieval Series, Springer, 2019.
[64] D. Kelly, Methods for evaluating interactive
information retrieval systems with users, Found. Trends
Inf. Retr. 3 (2009) 1–224.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>paign</surname>
          </string-name>
          ,
          <source>J. Data and Information Quality</source>
          <volume>10</volume>
          (
          <year>2018</year>
          ).
          <source>Speech Commun</source>
          .
          <volume>50</volume>
          (
          <year>2008</year>
          )
          <fpage>630</fpage>
          -
          <lpage>645</lpage>
          . [39]
          <string-name>
            <given-names>F.</given-names>
            <surname>Hopfgartner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Balog</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lommatzsch</surname>
          </string-name>
          , L. Kelly, [51]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Cowan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Pantidi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Coyle</surname>
          </string-name>
          , K. Morrissey,
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>Information Retrieval Evaluation in a Changing ceedings of the 19th International Conference on</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>World - Lessons Learned</surname>
          </string-name>
          from 20 Years
          <string-name>
            <surname>of</surname>
            <given-names>CLEF</given-names>
          </string-name>
          , Human-Computer Interaction with Mobile Devices
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Springer</surname>
          </string-name>
          ,
          <year>2019</year>
          . and Services,
          <source>MobileHCI '17</source>
          ,
          <year>2017</year>
          . [40]
          <string-name>
            <given-names>J.</given-names>
            <surname>Schatzmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Weilhammer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stuttle</surname>
          </string-name>
          , S. Young, [52]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Cowan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. P.</given-names>
            <surname>Branigan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Begum</surname>
          </string-name>
          ,
          <string-name>
            <surname>L. McKenna</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>ment</surname>
            <given-names>strategies</given-names>
          </string-name>
          ,
          <source>Knowl. Eng. Rev</source>
          .
          <volume>21</volume>
          (
          <year>2006</year>
          )
          <fpage>97</fpage>
          -
          <lpage>126</lpage>
          . partners,
          <source>in: Proceedings of the 39th Annual Meet</source>
          [41]
          <string-name>
            <given-names>O.</given-names>
            <surname>Pietquin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hastie</surname>
          </string-name>
          ,
          <article-title>A survey on metrics for the ing of the Cognitive Science Society</article-title>
          , CogSci '
          <volume>17</volume>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <article-title>evaluation of user simulations</article-title>
          ,
          <source>Knowl. Eng. Rev. 28</source>
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          (
          <year>2013</year>
          ). [53]
          <string-name>
            <given-names>A.</given-names>
            <surname>Sciuto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Saini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Forlizzi</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. I. Hong</surname>
          </string-name>
          , “Hey Alexa, [42]
          <string-name>
            <given-names>S.</given-names>
            <surname>Vakulenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Revoredo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Di Ciccio</surname>
          </string-name>
          , M. de Rijke,
          <article-title>what's up?”: A mixed-methods studies of in-home</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          dialogues,
          <source>in: Proceedings of the 41st European the 2018 Designing Interactive Systems Conference,</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>Conference on IR Research</source>
          , ECIR '
          <volume>19</volume>
          ,
          <year>2019</year>
          , pp.
          <fpage>541</fpage>
          -
          <lpage>DIS</lpage>
          '
          <fpage>18</fpage>
          ,
          <year>2018</year>
          , pp.
          <fpage>857</fpage>
          -
          <lpage>868</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          557. [54]
          <string-name>
            <given-names>G.</given-names>
            <surname>Mesnil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dauphin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          , L. Deng, [43]
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Trippas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Spina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cavedon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sanderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hakkani-Tur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Heck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Tur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yu</surname>
          </string-name>
          , et al.,
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <source>ceedings of the 2017 Conference on Conference Hu- Audio, Speech and Lang. Proc. 23</source>
          (
          <year>2015</year>
          )
          <fpage>530</fpage>
          -
          <lpage>539</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>man Information Interaction and Retrieval</source>
          , CHIIR [55]
          <string-name>
            <given-names>B.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Lane</surname>
          </string-name>
          ,
          <article-title>Attention-based recurrent neural</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <source>'17</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>325</fpage>
          -
          <lpage>328</lpage>
          .
          <article-title>network models for joint intent detection</article-title>
          and slot [44]
          <string-name>
            <given-names>K.</given-names>
            <surname>Balog</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Flekova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hagen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Pot- iflling</article-title>
          ,
          <source>in: Interspeech</source>
          <year>2016</year>
          ,
          <year>2016</year>
          , pp.
          <fpage>685</fpage>
          -
          <lpage>689</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>thast</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Radlinski</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Sanderson</surname>
            , S. Vakulenko, [56]
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Schatzmann</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Thomson</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Weilhammer</surname>
          </string-name>
          , H. Ye,
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <source>prototype: Scholarly conversational assistant</source>
          ,
          <year>2020</year>
          .
          <article-title>strapping a POMDP dialogue system</article-title>
          ,
          <source>in: Human</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <source>arXiv:2001.06910. Language Technologies</source>
          <year>2007</year>
          : The Conference of [45]
          <string-name>
            <given-names>K.</given-names>
            <surname>Balog</surname>
          </string-name>
          , T. Kenter,
          <article-title>Personal knowledge graphs: the North American Chapter of the Association</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <article-title>A research agenda</article-title>
          ,
          <source>in: Proceedings of the 2019 for Computational Linguistics;</source>
          Companion Volume,
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <source>ACM SIGIR International Conference on Theory of Short Papers</source>
          ,
          <year>2007</year>
          , pp.
          <fpage>149</fpage>
          -
          <lpage>152</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Information</surname>
            <given-names>Retrieval</given-names>
          </string-name>
          ,
          <source>ICTIR '19</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>217</fpage>
          -
          <lpage>220</lpage>
          . [57]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Dyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. P.</given-names>
            <surname>Xing</surname>
          </string-name>
          , T. Berg[46]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , K. Balog,
          <article-title>Evaluating conversational rec- Kirkpatrick, Unsupervised text style transfer using</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <source>ceedings of the 26th ACM SIGKDD International in Neural Information Processing Systems</source>
          <volume>31</volume>
          , NIPS
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <source>Conference on Knowledge Discovery &amp; Data Min- '18</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>7287</fpage>
          -
          <lpage>7298</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>ing</surname>
          </string-name>
          ,
          <source>KDD '20</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1512</fpage>
          -
          <lpage>1520</lpage>
          . [58]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Yan</surname>
          </string-name>
          , Style transfer [47]
          <string-name>
            <given-names>L.</given-names>
            <surname>Azzopardi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dubiel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Halvey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dalton</surname>
          </string-name>
          ,
          <article-title>Con- in text: Exploration and evaluation</article-title>
          , in: Proceedings
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <article-title>conversational search process</article-title>
          ,
          <source>in: Proceedings</source>
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          of the 2nd International Workshop on Conversa- [59]
          <string-name>
            <given-names>T.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Barzilay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Jaakkola</surname>
          </string-name>
          ,
          <string-name>
            <surname>Style</surname>
          </string-name>
          trans-
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <source>'18</source>
          ,
          <year>2018</year>
          .
          <source>Proceedings of the 31st International Conference</source>
          [48]
          <string-name>
            <given-names>J. F.</given-names>
            <surname>Kelley</surname>
          </string-name>
          ,
          <article-title>An iterative design methodology for on</article-title>
          <source>Neural Information Processing Systems</source>
          , NIPS
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <source>user-friendly natural language ofice information '17</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>6833</fpage>
          -
          <lpage>6844</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <surname>applications</surname>
          </string-name>
          ,
          <source>ACM Trans. Inf. Syst</source>
          .
          <volume>2</volume>
          (
          <year>1984</year>
          )
          <fpage>26</fpage>
          -
          <lpage>41</lpage>
          . [60]
          <string-name>
            <given-names>T.</given-names>
            <surname>Wilson</surname>
          </string-name>
          , Information needs and uses: Fifty years [49]
          <string-name>
            <given-names>C.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rieman</surname>
          </string-name>
          ,
          <article-title>Task-centered User Interface of progress</article-title>
          , Fifty Years of Information Progress: A
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <surname>Design</surname>
            :
            <given-names>A Practical</given-names>
          </string-name>
          <string-name>
            <surname>Introduction</surname>
          </string-name>
          , University of Col-
          <source>Journal of Documentation Review</source>
          <volume>28</volume>
          (
          <year>1994</year>
          )
          <fpage>15</fpage>
          -
          <lpage>51</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <surname>orado</surname>
            , Boulder, Department of Computer Science, [61]
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Ellis</surname>
          </string-name>
          ,
          <article-title>The dilemma of measurement in informa-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          1993.
          <article-title>tion retrieval research</article-title>
          ,
          <source>J. Am. Soc. Inf. Sci</source>
          .
          <volume>47</volume>
          (
          <year>1996</year>
          ) [50]
          <string-name>
            <given-names>J.</given-names>
            <surname>Edlund</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gustafson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Heldner</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Hjalmars-
          <volume>23</volume>
          -36.
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <article-title>son, Towards human-like spoken dialogue systems</article-title>
          , [62]
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Voorhees</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. K.</given-names>
            <surname>Harman</surname>
          </string-name>
          , TREC: Experiment
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>