<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Improving Task-Oriented Dialogue Systems In Production with Conversation Logs</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>David Amid IBM Watson</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>David Boaz IBM Research</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>IBM Research</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Inbal Ronen IBM Research</institution>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Ofer Lavi IBM Research</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <abstract>
        <p>In this work we propose a solution to a significant limitation of task-oriented dialogue systems - their inability to learn and improve over time during deployment. Although current popular taskoriented systems are implemented as rule-based execution graphs, the available solutions for improvement incorporate neural network modules, either fully or partially, despite the poor performance of neural architectures for the task-oriented use-case. We present an algorithm to modify the graph-based system directly, in a manner which improves the system automatically and is simultaneously easy to understand by the system expert. To our knowledge, this is the first method of this type towards automatically improving a dialogue system's coverage in production, without additional explicit labels. Though the system is still evidential, our experiments already show promising results in its ability to usefully modify an existing dialogue system, while improving its coverage.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>CCS CONCEPTS</title>
      <p>• Computing methodologies → Learning from
demonstrations; Rule learning; Discourse, dialogue and pragmatics; •
Humancentered computing → Natural language interfaces.</p>
    </sec>
    <sec id="sec-2">
      <title>KEYWORDS</title>
      <p>dialogue systems, task oriented, closed domain, virtual agent, rule
based systems, machine learning
∗Both authors contributed equally to this research.</p>
      <p>Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored.</p>
      <p>For all other uses, contact the owner/author(s).</p>
      <p>KDD Converse’20, August 2020,
© 2020 Copyright held by the owner/author(s).</p>
      <p>Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
I can’t connect at all!</p>
      <p>Is there an error</p>
      <p>message?</p>
      <p>Yes, it’s error #666.
??? I don’t know how
to handle that!
I’ll escalate it to a
human agent.</p>
      <p>Please restart your</p>
      <p>computer.</p>
      <p>That solved the
problem. Thanks!</p>
      <p>I’m having
connection issues.</p>
      <p>Is there an error</p>
      <p>message?
The error message</p>
      <p>is #666.</p>
      <p>Please restart your
computer.</p>
      <p>That worked!</p>
      <p>Our</p>
      <p>Solution</p>
    </sec>
    <sec id="sec-3">
      <title>INTRODUCTION</title>
      <p>
        Dialogue systems, or virtual assistants, are automated systems for
interacting with users through a natural language interface.
Taskoriented1 dialogue systems are not only concerned with
maintaining coherent interaction with another party (e.g., chit-chat agents,
or chatbots), but also leading the interaction towards some goal
[
        <xref ref-type="bibr" rid="ref11 ref8">8, 11</xref>
        ]. These systems have a variety of useful applications, such as
customer support [
        <xref ref-type="bibr" rid="ref35">35</xref>
        ], restaurant or hotel reservation [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ], online
shopping [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ], and many others.
      </p>
      <p>
        Recent advances in Natural Language Understanding (NLU), via
neural networks, have shown promise to facilitate drastic
improvements in such virtual task-oriented dialogue agents [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] — as a major
bottleneck in the past has been correct interpretation of the user’s
natural language utterances. However, the scope of these dialogue
systems is still limited by their inability to handle new types of
1Also referred to as “goal-oriented” or “closed-domain”.
2
interactions after deployment (e.g., new software product in IT
support, or new categories in online shopping) [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>
        The dominating task-oriented dialogue systems follow a
rulebased architecture where machine learning NLU techniques
interpret the user utterances (Figure 2), with an execution graph
backbone for the dialogue path management [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Modelling such a
system requires expertise in both the backbone system and the domain
the system is planned to operate in (i.e., the concrete use case). This
combined knowledge of both the use-case and the system
engineering is rare, and requires training. Consequently, as the dialogue
management system is rule-based, improving the system’s
performance based on post-deployment usage requires manual updates
by such an expert, as well.
      </p>
      <p>Often, the dialogue management backbone is based on a dialogue
graph (Figure 3B). Each node in the graph represents a dialogue
state, and each edge a possible transition from one state to another
according to the user’s utterances and the condition derived from
it by the NLU system (Figure 2). Changing the dialogue system’s
behavior involves altering the dialogue system’s structure and
transition table. But how can we acquire supervision for the changes
necessary for these improvements?</p>
      <p>Towards this end, we point to a key property of our use-case:
Virtual assistants which are the topic of this work are deployed
as part of customer support centers. They work in tandem with a
fallback to human agents in cases of failure — as a way of
maintaining a suficient service level to customers (users). At any point
during the virtual assistant to user interaction, a failure can occur,
either when the virtual assistant detects its inability to continue, or
when the user directly requests the escalation to a human agent. In
these cases, the human agent will assume control of the interaction
to properly assist the user. Naturally, a record of such interactions
is collected during the deployment of the support system, and is
used by an expert to manually modify and improve the automatic
dialogue system. We refer to these records as escalation logs,
detailing interactions where the dialogue system assumed initial control,
subsequently failed, and control was escalated to a human agent to
resolve the case (Figure 1).</p>
      <p>In this work, we propose to leverage these escalation logs for
completing missing functionality in the dialogue system
automatically, by introducing new nodes to the dialogue execution graph.
A notable attribute of the dialogue systems discussed in this work,
based on execution graphs, is their human-readability, as they are
easy to read and understand by humans (since they are actively
designed by humans). Thus, modifying them automatically requires
maintaining the system’s human-readability by proposing
modifications which are also rule-based. This enables the dialogue system
developer to thoughtfully handle these updates — adapt them and
alter them as necessary. As these systems are designed to be
deployed and serve a large sector, this will allow the developer a
suficient degree of confidence in the automatic modifications to
allow their usage in production. We are addressing this aspect in
the design of our algorithm and assess some readability measures
of its results.</p>
      <p>The contributions of this paper are three-fold: First, we
formulate the node-completion problem for the dialogue execution graph
based on escalation logs; Next, we propose a method for
automatically deriving node transition rules based on user-to-human
User
Dialogue
System
"I can't connect
at all!"</p>
      <p>NLU system:
connection_error = TRUE
"What is the
connection error
code?"</p>
      <p>Dialogue System Graph:
root</p>
      <p>if (connection_error = TRUE)
ask for
error code
escalation logs; Finally, we present an automatic evaluation setup in
order to assess the quality of the suggested updates to the solution,
which can also serve other future dialogue system methods in this
area.</p>
      <p>The rest of this paper is structured as follows: In Section 2 we
provide background on diferent types of dialogue systems and
scope the discussion to the more prevalent type we deal with in this
paper. Then, in Section 2.2 we establish the importance of improving
such dialogue systems based on post-deployment execution logs.
In Section 3 we introduce our solution for automatically improving
these systems by means of learning from logs, a solution which
we provide implementation details for in Section 4. We evaluate
our solution in Section 5 and sum up with a short discussion and
conclusions in Section 6.
2</p>
    </sec>
    <sec id="sec-4">
      <title>BACKGROUND: IMPROVING DIALOGUE</title>
    </sec>
    <sec id="sec-5">
      <title>SYSTEMS IN PRODUCTION</title>
      <p>We give a brief overview on learning-based methodologies for
improving and updating dialogue systems without manual annotation
by an expert.
2.1</p>
    </sec>
    <sec id="sec-6">
      <title>Terminology and Notation</title>
      <p>Execution Graph Dialogue System (Figure 3B). We focus on the
prevalent dialogue systems where the system is a directed
“execution graph”, in which each node edge represents a binary decision
function (or condition) and an action. The decision function, based
on the current state of the environment (conversation), results in a
decision on whether to perform the action. If so, a change in the
environment is observed as a result of the action, and the execution
lfow proceeds to the children of the node, in a pre-defined order.
If the condition is not satisfied, the action is not performed, and
the execution flow proceeds to the next sibling of the current node.
The action to perform may be a communication with the user, or
a concrete action to perform to help the user, and the observable
result will be the user’s response to the action.</p>
      <p>Escalation Logs (Figure 1). The core supervision to drive learning
in production is collected in escalation logs — logs of interactions
where the deployed system assumed initial control of handling
the case, and subsequently it failed to complete the goal of the
interaction. This resulted in escalation of the case to a human agent,
who properly handled the case to its conclusion. In this work, we
propose a method to utilize the human agent’s handling of the
dialogue system’s failure in order to improve the dialogue system.
2.2</p>
    </sec>
    <sec id="sec-7">
      <title>Motivation</title>
      <p>
        In this section we elaborate on the core motivation behind this work
— namely, the answer to the question: Why is it valuable to develop
a method of updating dialogue systems after their deployment? We
give two central answers, detailed below.
2.2.1 Distribution Shift Over Time. The main motivation is simple
indeed, and uncontroversial: Even in the event where the initially
manually designed dialogue system is perfect for its use case, as
time goes by and new capabilities are required, we would like the
system to be able to manifest them automatically. This motivation
also shares common themes with the areas of lifelong machine
learning [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] and never-ending learning [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>As an example, consider the case of a technical customer
support virtual agent — which attempts to help incoming users with
technical issues and requests regarding a specific software product.
The virtual agent, although properly designed at deployment time,
must be continuously augmented with additional information to
reflect updates in the software product, as these updates introduce
new capabilities and issues.
2.2.2 Reference Logs Are Naturally-Occurring. Another key
motivation relates to the ease of obtaining these reference escalation
logs. Evidently, the system has been expertly designed to be used in
some practice, and thus, it will be deployed. As a result, instances
of escalated conversations where the bot has failed will be
gathered. These reference conversations can be considered “free”: they
will exist during production phase by default, and if they can be
utilized, no additional efort is necessary to gather supervision for
the improvement of the deployed system.</p>
      <p>Unfortunately, as explained in Section 2, there is currently no
method available for making use of this supervision to improve a
non-neural dialogue system (the prevailing type of virtual agents in
task-oriented settings). In other words, there exists a gap between
the relative ease of obtaining reference supervision for the
improvement of the currently deployed solution and the lack of available
techniques to make use of it.
2.3</p>
    </sec>
    <sec id="sec-8">
      <title>Related Work</title>
      <p>
        2.3.1 Execution Graph Solutions. An execution graph [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] is one
of the most popular methods for modeling task-oriented dialogue
systems. The vast majority of solutions of this type are created
manually by an expert [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], and to our knowledge, after being
deployed, they are either static, or manually updated by an expert.
One notable exception is by Volkova et al. [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ], which attempts
to create an initial graph-based model by using explicit
naturallanguage instructions on how the execution graph should act. This
method can be used to update the graph by redoing the process
with additional instructions. Additionally, [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] have proposed a
system designed for multi-domain sets of slot values in order to
remain scalable to new domains of conversations (we elaborate on
slots later).
2.3.2 Neural End-to-End Solutions. Recent advances in deep
learning has caused a surge in proposed neural solutions for dialogue
systems in the open-domain chit-chat setting [
        <xref ref-type="bibr" rid="ref15 ref16 ref21">15, 16, 21</xref>
        ].
Unfortunately, although these end-to-end models can be improved
relatively easily using reference conversation logs, current solutions are
ill-equipped to deal with the challenging setting of task-oriented
conversation — where the automatic solution must achieve some
purpose at the end of the interaction, via a natural language
interface and performing actions — and the insuficient quantity of
data which can be gathered2. Typically these neural solutions
involve a component of generating responses [
        <xref ref-type="bibr" rid="ref20 ref32">20, 32</xref>
        ] or ranking and
retrieving them from data [
        <xref ref-type="bibr" rid="ref2 ref29 ref33 ref4">2, 4, 29, 33</xref>
        ].
2.3.3 Hybrid Solutions. As previously mentioned, neural models
under-perform in task-oriented settings. However, the standout
quality of these models is their ability to learn by their design from
reference conversation logs. As such, hybrid models have been
proposed to combine the strengths of an execution graph backbone
with a neural fall-back which can learn to adapt and improve after
deployment. For example, Tammewar et al. [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] propose a hybrid
model in which every decision of the execution graph has a neural
fall-back in case of no appropriate response.
      </p>
      <p>Although these models are indeed able to learn from escalation
logs after deployment, in truth the only component which is able to
learn is the neural model. As mentioned before, these models are as
of yet unconvincing in their ability to uphold the task-oriented
usecase — due to their inability to rigorously conform to completing
the goal of the conversation, and requiring a significant amount of
data to learn on any level.</p>
      <p>
        Another alternative to the neural fall-back is a hybrid model
that ofers redirection of the misunderstood utterance to a search
engine and returning its result, relying on an up-to-date search
index such as the search skill described in [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. However, a search
user experience is substantially diferent from a conversation one.
2.4
      </p>
    </sec>
    <sec id="sec-9">
      <title>Conclusion</title>
      <p>We have discussed three possible solutions for task-oriented
systems, and their ability to learn automatically from reference logs
after deployment. Specifically, while execution graph-based models
are the most robust solutions, they are also rigid and require updates
by a manual expert to be continuously improved. Neural models
go to the other extreme, and are able to learn freely at any point
by optimizing their performance against reference logs. However
their overall performance at the task-oriented use-case is severely
lacking in comparison to the execution graph based models.</p>
      <p>
        In order to bridge the gap, hybrid models have been proposed to
embody the best of both worlds, such that they employ an execution
graph backbone and a neural fallback in case of failure. However,
the only component which is able to learn and improve in these
models is the neural component — which is anyway of negligible
2While out of the scope of this work, neural models indeed dominate the open-domain
chit-chat settings which don’t sufer from these constraints [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
value in the overall usefulness of the model — and so they sufer
the same issues as all of the previous solutions.
3
      </p>
    </sec>
    <sec id="sec-10">
      <title>OUR SOLUTION</title>
      <p>We elaborate on our proposed solution in order to concretely
improve an existing execution graph dialogue system, by using
reference escalation logs, obtained after deployment of the existing
virtual assistant.</p>
      <p>The procedure is conceptually divided into five steps. At the
end of the procedure, the algorithm recommends new edges and
nodes (composed of decisions and actions) to be integrated into
the execution graph currently in production. These new nodes can
be integrated as-is into the execution graph, to be evaluated in a
test environment, or they can be verified by an expert before being
integrated in order to guarantee their relevance before deployment.
3.1</p>
    </sec>
    <sec id="sec-11">
      <title>Step 1: Gathering Failure Points</title>
      <p>As mentioned in Section 2.2, to update the existing execution graph,
we utilize escalation logs obtained following its deployment.
(1) The before-escalation section of the log describes the
dialogue between the user and the dialogue system and ends
at a failure point. A failure point in a conversation is the
point where the control is escalated to a human agent. This
conversation corresponds to a single path in the dialogue
execution graph, terminating at some node we refer to as
the escalation node — a graph node from which some failure
points escalated to a human agent. Figure 3A illustrates a
single escalated conversation. The dialogue system
understood that the user wishes to transfer money and escalated
to a human agent in the next node.
(2) The after-escalation section of the log describes the
interaction from the failure point on, occurring between the user
and the human agent. Since this part of the conversation is
external to the dialogue system, there is no path
corresponding to it in the execution graph (Figure 3B).</p>
      <p>Our goal is to derive new nodes to attach to the execution graph
at the escalation node, so that failure points corresponding to that
node, occurring in multiple conversations, will be handled, or at
minimum delayed by an addition step in the execution graph. For a
single conversation we look at the execution path up to the
escalation node, and at the first response of the human agent after the
failure point. In order to generalize we gather multiple
conversations that were escalated at that specific escalation node. We thus
obtain a set of conversations along with their matching path up to
the escalation node, and the appropriate response for this
conversation as given by the human agent. We refer to these responses as
gold responses.
3.2</p>
    </sec>
    <sec id="sec-12">
      <title>Step 2: Clustering Gold Responses Into</title>
    </sec>
    <sec id="sec-13">
      <title>Response Types</title>
      <p>Given the collection of human agent responses we obtained in
the previous step, it is necessary to divide this collection into
categories: Although all of these conversations passed through the same
escalation node in the execution graph, they have each possibly
originated from diferent paths, and thus each of the human agents’
responses may be diferent based on the context of the interaction.
For this reason, we cluster the human agent responses into response
types based on semantic similarity. Figure 4 illustrates clustering
of multiple conversations based on the agent’s responses into 3
response types.</p>
      <p>
        In the case of textual responses, we utilize a neural model to
encode the text in a continuous embedding space [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] for clustering.
The clustering algorithm attempts to divide the human agents’
responses into diferent response types.
3.3
      </p>
    </sec>
    <sec id="sec-14">
      <title>Step 3: Afixing Actions to Response Types</title>
      <p>Each response type will be attributed by a concrete action — such
as a text message, value retrieval from a database, and/or
miscellaneous actions. This representative action can be derived in one of
multiple possible methods:
(1) The action can be chosen by some metric (such as quantity
of similar occurrences in the cluster) from among the actions
in the response type.
(2) The action can be chosen as the closest response to the
centroid of the cluster (Figure 5).
(3) In the case of a text message, the response can be
generated via some text generation component by utilizing the
collection of text responses in the cluster for the generation
process.3
3.4</p>
    </sec>
    <sec id="sec-15">
      <title>Step 4: Deriving Boolean Conditions</title>
      <p>Our next goal is to derive boolean conditions that will correctly map
a conversation to its response type, and trigger the chosen action.
In dialogue systems that use an execution graph as their dialogue
management backbone this is equivalent to adding one node per
response type with a decision function that takes the dialogue state
and context as its input.</p>
      <p>In Figure 6 we illustrate eight conversations clustered by the
agent’s response into three response types. Each cluster is marked
by a diferent type of line (solid, dashed, dotted). Within each cluster,
every conversation holds its own diferent dialogue state captured
when the conversation passed through the escalation node. The
table illustrates the state of each conversation represented as a set of
features, together with the assigned cluster for each conversation. A
decision function is then learned, taking the dialogue state as input
to discriminate between the three clusters. In the illustration we
can see three boolean conditions taking into account the payment
amount and customer VIP flag to diferentiate between the three
response types based on the dialogue state. Note that the boolean
conditions ignore the account number feature.
3.5</p>
    </sec>
    <sec id="sec-16">
      <title>Step 5: Recommending New Nodes</title>
      <p>At the final step of the procedure, various nodes are derived to
model the responses of human agents at various failure points. This
step attempts to rank these nodes so that only a confident subset
of the suggested nodes will be recommended for integration in the
deployed dialogue system. This is done for two reasons:
3Within the scope of this work, we do not consider the text generation case.
Improving Task-Oriented Dialogue SystSetmespIn1:PGroadtuhcetrioinngwfiathiluCroenpveorisnattsion Logs
Before
Escalation
After
Escalation</p>
      <p>A
User
(1) By choosing a specific amount  of nodes as the top- nodes
in the recommendation ranking, the balance between
precision and recall can be controlled: It is up to the expert to
prioritize quality of responses at the failure points versus
the potential coverage of failures.
(2) In the event that the expert will be interested in verifying the
suggested nodes before they are integrated in the deployed
dialogue system, to guarantee their validity, the procedure
must filter the nodes by confidence to alleviate the workload
of the expert.</p>
      <p>We consider the quality of the suggested nodes (and specifically
their conditions) via several heuristics that conform to notions of
human-readability4 for two main purposes: (i) Decision functions
that are easier to understand will be preferred, as the expert may
still attempt to understand them and verify their functionality to
gain confidence in their integration in the deployed product; (ii)
The human-readability of the boolean conditions can be viewed as</p>
      <p>Action 1: Message “Please enter PIN to complete the
regularization to mtirtaingsaatcetioonv”erfitting.
3.6</p>
    </sec>
    <sec id="sec-17">
      <title>Solution Summary</title>
      <p>We propose a five-step procedure for improving an execution graph’s
ability to handle failure points by using escalation logs as the source
of supervision. To our knowledge, this is the first method of this
type towards automatically improving a dialogue system’s coverage
after deployment, without labels that require external feedback —
outside of the already available escalation logs — and without
manual annotation by an expert. As mentioned, the procedure requires
a collection of escalation logs and results in a set of new nodes to be
integrated in the current dialogue system’s execution graph. These
nodes are ranked by some metric, and can be further verified by
an expert with minimal overhead to guarantee their behavior for a
deployed model.</p>
      <p>At the end of the integration of the new nodes, the execution
graph will be able to progress an additional step beyond what were
considered its failure points previously, thus increasing its coverage.
Once the new execution graph is deployed, more escalation logs
can be gathered to iteratively improve the system by repeating the
procedure.
4Such heuristics may include the length of the decision function, the amount of nesting
(such as “A or (B and C)”), the number of negation elements in the function, and so on.
Step 4: Deriving Boolean Conditions</p>
      <p>Root</p>
    </sec>
    <sec id="sec-18">
      <title>4 IMPLEMENTATION</title>
      <p>
        To verify our suggested approach we implemented each of the 5
steps in our solution on top of IBM Watson Assistant (WA) [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
However, we stress that while we exemplify our approach on top
of IBM Watson Assistant, it can be comfortably generalized to
other popular competing execution graph systems, such as Google
Dialogflow [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] and Microsoft Bot Framework [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>WA uses an execution graph as its dialogue management
backbone. The graph is designed by the system’s author such that at
each visited node, the system interprets a user’s utterance in the
context of the current conversation using natural language
understanding and chooses the appropriate transition to the next node
based on the execution graph design and the current dialogue state.</p>
      <p>The dialogue state is encoded with a set of contextual variables
characterizing the user’s intents (e.g., opening a new account),
identity (e.g., account number or country of origin) and relevant
details from the user utterances (dates, times, names, etc.). Some of
these contextual variables are extracted by WA automatically from
the user’s utterances and others are "injected" from outside the
system (e.g. the account number of the logged in user). Additional
variables can be calculated based on the values of exitsing ones
during the conversation.</p>
      <p>Each node in WA’s execution graphs contains a boolean condition
over the set of contextual variables and an action (e.g., a system
response). When the system arrives at a specific node during a
conversation, the next action in the conversation is chosen to be the
action attached to the first child node whose condition is satisfied.</p>
      <p>
        Below we describe our implementation in accordance with the 5
steps of Section 3):
(1) Step 1: Gathering Failure Points. Escalation nodes in WA’s
execution graph are nodes from which dialogues were escalated
to a human agent. These nodes are in fact sink nodes for all
points in conversations that did not satisfy any condition of
the children of the current node. For each escalation node
we gather all conversations that were escalated in that node.
(2) Step 2: Clustering Gold Responses Into Response Types. We first
embed the agent’s response following the escalation in a
continuous space. For this purpose we use BERT [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] based
embedding. Specifically, we use the [CLS] token which is
the output of employing the BERT model over the responses.
We then cluster the resulting vectors using the Mean Shift
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] clustering algorithm5.
(3) Step 3: Afixing Actions to Response Types. Each node in the
execution graph is the combination of both an entry
condition and an action to follow. Each cluster from the previous
phase is associated with a centroid. For the recommended
nodes’ actions we use the human response of the nearest
neighbor to the centroid inside the cluster.
(4) Step 4: Deriving Boolean Conditions. Every point in the
conversation is associated with a dialogue state constituting a
feature vector defined by the values of its contextual
variables. For each cluster obtained in step 2, we train a binary
decision tree classifier over the dialogue state at the
escalation node. The label of each conversation is 1 (positive) if the
decision tree associated it with the cluster, and 0 (negative)
otherwise. Specifically, we used the implementation ofered
by scikit-learn [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. This decision tree is then converted
into a boolean expression by collapsing sibling sub-trees as
or and collapsing parent-children sub-trees as and.
Optionally, the decision tree or boolean expression can be pruned or
simplified to increase generalization and readability [
        <xref ref-type="bibr" rid="ref17 ref3">3, 17</xref>
        ].
We implemented pruning using the min-leaf-size
parameter of scikit-learn. Notably, the decision trees are trained
to classify between a given cluster and all other clusters,
mitigating any issue with order-dependent movement along
the execution graph.
(5) Step 5: Recommending New Nodes. Clustering high-dimensional
vectors is likely to result in a long tail of very small
clusters pertaining to outlier responses. To mitigate this, we
bound the minimum size of a cluster (as a percentage of
the number of responses) to be considered for new node
recommendation. Our recommendations constitute  nodes
resulting from the  largest clusters.
5Although any other clustering algorithm is applicable, we chose Mean Shift since it
does not require the number of clusters to be predefined.
5
      </p>
    </sec>
    <sec id="sec-19">
      <title>EVALUATION</title>
      <p>
        Qualitative evaluation of dialogue systems, and particularly
taskoriented systems, is a very challenging open problem [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Deriu
et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] emphasize the need for automated evaluation methods as
collecting human judgement for the quality of a dialogue system is
laborious and costly. To this end, we devised an automated
evaluation method for our solution which does not involve measurement
of the dialogue system performance during deployment, but rather
utilizes current dialogue systems’ logs without the escalation to
human agents for evaluating the method itself.
      </p>
      <p>Instead of adding a new node and evaluating its quality, we take
an existing dialogue system as reference and destructively
modify it by choosing a node (which we refer as simulated escalation
node) and removing all its outgoing nodes (descendants) from the
execution graph. We then use our method to predict the removed
outgoing nodes, and compare the behavior of the system prior to
the removal with its behavior after adding the predicted nodes.
We then measure the quality of our recommendations in an
automated manner. A “high quality” node should capture a previously
unhandled case, properly act upon it, and be human-readable. Our
automatic evaluation is based on the following observation: in the
original (unmodified) graph, the removed nodes induce a partition
of the conversations that went through the simulated escalation
node. We call this partition the reference partition.</p>
      <p>Similarly, the predicted nodes induce a partition on the same set
of conversations. The nodes’ conditions and execution order may
not necessarily resemble the original ones, but the functionality of
the system should be preserved. This preservation can be measured
by the level of similarity of the two partitions, the one induced by
the removed nodes, and the one induced by the predicted nodes.
Our simulated escalation node can be viewed as an escalation node
in the human agent escalation case. Once we remove the outgoing
nodes, we consider only the conversation log before escalation,
ignoring the dialogue state and the continuation of the paths in the
execution graph.</p>
      <p>We also use the original node conditions to assess the quality
of our recommendations for example by comparing the length of
the recommended conditions to the original ones in terms of the
number of variables in the condition.</p>
      <p>Our experiments include two evaluation methods: (1) Automatic
evaluation of our solution to assess the quality of the partition
and the readability of the conditions. We experiment with diferent
hyperparmeters and implementations of the components in our
solution. This evaluation is performed on an internal dataset, using
our method for simulating escalation nodes. (2) Human evaluation
of the recommended conditions and clustering. This evaluation is
performed on a public dataset.</p>
      <p>The evaluation method proposed in this paper is standalone, and
is neither contingent upon the dialogue system nor the embedding,
clustering, and condition inference techniques.
5.1</p>
    </sec>
    <sec id="sec-20">
      <title>Datasets</title>
      <p>For our evaluation we use diferent datasets for each evaluation
method. For the automatic evaluation we use an internal real-life
(non-public) dataset from the banking domain. The dataset includes
7605 real-world conversations of users with a WA dialogue
system without escalations to human agents during a period of 10
days of operation. Each conversation includes an average of 6.05
turns between a user and the dialogue system. The execution graph
includes 135 intents with 62 entities. It has 1528 nodes and an
average depth of 2.59. The dataset handles several customer service
issues, such as opening a new account and transferring money. We
use this dataset by simulating escalation nodes as explained above.
We consider only escalation nodes with at least 50 conversations
passing through them. This results in a total of 39 escalation nodes
with an average of 535.98 conversations passing through each of
them (stdev: 760.12, min: 55, max: 3386) and an average of 2.46 child
nodes each. The feature vector used for training the decision tree
in step 4 of our solution includes 1070 features.</p>
      <p>
        In order to experiment with a diferent type of data, which
relfects a prevailing use case of task oriented dialogue systems, we
use the MultiWOZ dataset. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The dataset contains 10,000
conversations of humans in multiple domains (including hotels, taxi
and restaurant booking). Each conversation in MultiWOZ is labeled
using contextual variables similar to those of WA. Moreover, each
agent response is labeled with the actual agent’s action. For
example, many agent responses ask the user, in diferent ways, to
specify a certain area. All these responses are labeled as an “area”
action in the dataset. Despite the fact that MultiWOZ does not
include a built-in backbone execution graph, we simulated the state
of conversations by querying the agent actions’ labels.
5.2
      </p>
    </sec>
    <sec id="sec-21">
      <title>Experimental Setup</title>
      <p>As we noted earlier, our solution is to the best of our knowledge the
ifrst to tackle the problem of improving dialogue systems’ coverage
in production, without explicit external feedback. We thus have no
baselines to compare our solution to.</p>
      <p>
        Our solution contains (in step 4) a decision tree (DT) classifier.
We compare it to reference solutions employing other classification
models — Random Forest (RF) and the state-of-the-art XGBoost
(XGB) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Note that both of these models do not fit our complete
solution, as they do not ofer an interpretable mechanism from
which node conditions can be derived. Nevertheless we use these
references as an unrealistic upper-bound for the classification part.
      </p>
      <p>To evaluate various aspects of our solution we experimented with
diferent values of the the hyper-parameter  in the decision tree
and random forest models defining the ratio of minimum number
of samples required to be at a leaf node.
5.3</p>
    </sec>
    <sec id="sec-22">
      <title>Automatic Evaluation</title>
      <p>In this section we detail an experimental setup for automatically
evaluating our solution. These automatic methods allow a
straightforward verification of the efectiveness of our solution.</p>
      <p>
        We use the following evaluation metrics:
(1) Adjusted Rand Index (ARI). [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] To evaluate the partition
induced by our model’s recommended conditions, we use the ARI
between the recommended partition and the gold reference
partition, which measures the level of similarity between the two
clusterings. (2) Clustering Coverage. Ratio of failure points that
were eventually mapped to one of the response type clusters. Note
that our solution does not require that every failure point is mapped.
(3) #Child Nodes. Compares the number of recommended nodes
to the original number of nodes in the execution graph. (4)
CONDLength. Evaluates the level of readability of the conditions in our
solution (this is relevant only for the decision tree model). We
compare the length of the recommended conditions of the nodes to the
original conditions of the nodes in the execution graph. The length
is calculated by the number of variables in the condition.
5.3.1 Results. We evaluated diferent versions of our model and
diferent reference classifiers as mentioned in Section 5.2 over the
banking dataset as shown Table 1.
      </p>
      <p>In spite of the decision tree being the weakest classification
model in our comparison, it outperformed all other models in terms
of ARI. Moreover, in contrast to the random forest model, the
decision tree got consistently high ARI scores independently of  . The
clustering coverage of all variants was above 0.9, with the decision
tree model only slightly worse than the other models. Our decision
tree solution also outperformed the other models in terms of the
number of child nodes, being closest to the expected average
number of nodes in the dataset, 2.46. Regarding the condition length
measure, only a high value of  achieved conditions with length
close to the original length of the conditions. However, our
experiments showed that this metric tended to have a high variance due
to extreme outliers. These outliers were conditions corresponding
to “outlier clusters” of all conversations that did not map to any of
the other clusters. When discarding in step 5 all nodes with
conditions of length ≥ 10 with  = 0.01, our coverage of conversations
decreased to 95% of the original clustering coverage. In this case
the average condition length was only 1.88. As expected, the lower
 , the more aggressive our pruning becomes, which results in less
number of child nodes, shorter conditions, but also lower clustering
coverage.</p>
      <p>
        Figure 7 shows the distribution of the Adjusted Rand Index (ARI)
for the decision tree for  = 0.01. Our solution achieved high ARI
scores for most of the escalation nodes. Note that in our scenario the
number of child nodes is quite small (as can also be seen in Table 1
in comparison to the number of conversations that are clustered (at
least 50). This fact sometimes results in low ARI scores (and even a
score of 0) and is a known drawback of ARI [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ]. Nevertheless, we
use the ARI measure as it is the de facto standard to estimate the
level of similarity between two clusterings. Note that our findings
are consistent for all decision tree configurations.
5.4
      </p>
    </sec>
    <sec id="sec-23">
      <title>Experimenting with human-to-human logs</title>
      <p>Our proposed solution for suggesting conditions assumes a dialog
graph as a backbone model. We are aware that this is not the only
dialog system backbone representation possible and that
humanto-human conversation logs may not reflect any backbone at all.
Yet, we wanted to both evaluate our solution on human-to-human
logs, and extend the method so we can learn such a backbone
from human-to-human logs. We started with the modest task of
recovering conditions for single nodes.</p>
      <p>To this end, we used the MultiWOZ dataset which contains both
the dialog utterances, and context variables extracted throughout
the conversation by human annotators and is aligned with each turn
in the conversation. We simulated a single node by collecting all
agent utterances asking for a specific detail based on annotations
#Child
Nodes
(2.46)
supplied with the dataset. In particular in the MultiWOZ hotel
booking scenario, we use the action annotation "area" to collect
all utterances where the agent asks about the booking area. We
declare all turns in this collection as if they are assigned to the
same simulated node, e.g. "ask area node". Our task then is to create
additional nodes corresponding to actions taken in the consecutive
turn following that node in diferent conversations, and to recover
the conditions to be used for directing a dialogue system towards
the correct action.</p>
      <p>The actions taken consider the user’s answer and the context
of the conversation so far. For example, one action could be to ask
for more constraints from the user such as hotel grade, another
could be to suggest a small set of specific hotels matching the
user’s constraints supplied so far, and a third option could be to
ask the user to relieve a constraint because no hotels matching the
constraints were found. Applying our solution, it clusters the agent
responses to their types, and then discovers the condition, based on
the context of the conversation and the user’s response, that would
lead to each of these types.</p>
      <p>We created such simulated nodes and found that clustering the
agent responses resulted with a small number of clusters, and the
corresponding conditions turned out to be long and hard to
interpret. Inspecting them we saw that they consist of conjunctions of
clauses connected with an "or" operator. This reflects multiple, and
sometimes disjoint paths reaching our simulated node, with very
diferent contexts leading to the same node. We suspect this is due
to the slot-filling nature of the MultiWOZ dataset, where diferent
combinations of filled slots lead to the same question asking for
a specific slot not filled yet. This result led us to add a calculated
context feature, counting the number of hotels that satisfy all
constraints set by the filled slots so far. Adding this feature yielded
clear conditions that separate the conversations into distinct actions
based on this feature.</p>
      <p>While we saw that following our solution in this hotel booking
use case data set resulted in hard to interpret conditions, the process
taught us how to analyze conversations with respect to the context
variables, and come up with an extra variable that may lead to an
interpretable condition, albeit some additional manual analysis. A
complete fully automated solution would probably employ means
to automatically detect these missing variables for example by
analyzing agents’ actions which could be queries to an external
back-end system.
6</p>
    </sec>
    <sec id="sec-24">
      <title>CONCLUSION AND FUTURE WORK</title>
      <p>We presented a method to automatically improve goal-oriented
dialogues after deployment. The method ofers a way for ongoing
learning, utilizing the data that is collected in the customer care
center. We challenge a fatal limitation of deployed task-oriented
dialogue systems: These systems, while initially useful, cannot
improve during production without manual updates by an expert.
Previous methods have attempted to incorporate learning into the
systems via neural network fall-backs, which has shown to be an
inefective band-aid solution, as neural models have little guarantee
to the correctness of their behavior, and are seldom deployed in
practice.</p>
      <p>We propose a five-step procedure, which can be employed on a
deployed system and uses conversation logs collected during run
time. These logs named “escalation logs” include interactions where
the dialogue system assumed initial control, subsequently failed,
and control was escalated to a human agent to resolve the case.
Our procedure yields an improved version of the system, where the
modifications fulfill additional behaviors in cases where the system
failed to provide a satisfactory response.</p>
      <p>Future Work. This research is aimed to help in real customer care
environments in which human agents and virtual assistants work in
tandem. We propose a first step towards relieving the need of
manual expert annotations for the improvement of the system. Future
work on this topic will naturally involve a thorough evaluation in
a production setting, where the system is deployed, improved, and
evaluated for its quality in comparison to the previous version. This
procedure can be repeated multiple times to iteratively improve the
system.</p>
      <p>The MultiWOZ dataset poses a real-life scenario of slot filling,
in which a user needs to provide several slots of information before
the system can respond. The system will then consider the entire
context, e.g. all slots filled so far, the new value from the current
user utterance, and evaluate the current state to decide on an action.
This dependency between the system response and the anticipated
result of its action (based on slot value filled and the system state),
makes the prevalent slot filling case a challenging scenario for our
clustering step, which needs to take into account not only the agent
response but also the context of the conversation, the current user
utterance and the state of the system. On top of the calculated
feature we suggest in the paper, we plan an in-depth analysis of
such cases in future work.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Adiwardana</surname>
          </string-name>
          ,
          <string-name>
            <surname>Minh-Thang</surname>
            <given-names>Luong</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>David R.</given-names>
            <surname>So</surname>
          </string-name>
          , Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, and
          <string-name>
            <surname>Quoc</surname>
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Le</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Towards a Human-like Open-Domain Chatbot</article-title>
          . CoRR abs/
          <year>2001</year>
          .09977 (
          <year>2020</year>
          ). arXiv:
          <year>2001</year>
          .09977 https://arxiv.org/abs/
          <year>2001</year>
          .09977
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Rami</given-names>
            <surname>Al-Rfou</surname>
          </string-name>
          , Marc Pickett, Javier Snaider,
          <string-name>
            <surname>Yun-Hsuan</surname>
            <given-names>Sung</given-names>
          </string-name>
          , Brian Strope, and
          <string-name>
            <given-names>Ray</given-names>
            <surname>Kurzweil</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Conversational Contextual Cues: The Case of Personalization and History for Response Ranking</article-title>
          .
          <source>CoRR abs/1606</source>
          .00372 (
          <year>2016</year>
          ). arXiv:
          <volume>1606</volume>
          .00372 http://arxiv.org/abs/1606.00372
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Hussein</given-names>
            <surname>Almuallim</surname>
          </string-name>
          .
          <year>1996</year>
          .
          <article-title>An eficient algorithm for optimal pruning of decision trees</article-title>
          .
          <source>Artificial Intelligence</source>
          <volume>83</volume>
          ,
          <issue>2</issue>
          (
          <year>1996</year>
          ),
          <fpage>347</fpage>
          -
          <lpage>362</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Alexander</given-names>
            <surname>Bartl</surname>
          </string-name>
          and
          <string-name>
            <given-names>Gerasimos</given-names>
            <surname>Spanakis</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>A retrieval-based dialogue system utilizing utterance and context embeddings</article-title>
          .
          <source>CoRR abs/1710</source>
          .05780 (
          <year>2017</year>
          ). arXiv:
          <volume>1710</volume>
          .05780 http://arxiv.org/abs/1710.05780
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Manisha</given-names>
            <surname>Biswas</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Microsoft Bot Framework</article-title>
          .
          <source>In Beginning AI Bot Frameworks</source>
          . Springer,
          <fpage>25</fpage>
          -
          <lpage>66</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Pawel</given-names>
            <surname>Budzianowski</surname>
          </string-name>
          ,
          <string-name>
            <surname>Tsung-Hsien</surname>
            <given-names>Wen</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bo-Hsiang</surname>
            <given-names>Tseng</given-names>
          </string-name>
          , Iñigo Casanueva, Stefan Ultes, Osman Ramadan, and
          <string-name>
            <given-names>Milica</given-names>
            <surname>Gasic</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>MultiWOZ - A LargeScale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling</article-title>
          . CoRR abs/
          <year>1810</year>
          .00278 (
          <year>2018</year>
          ). arXiv:
          <year>1810</year>
          .00278 http://arxiv.org/abs/
          <year>1810</year>
          .00278
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Carlson</surname>
          </string-name>
          , Justin Betteridge, Bryan Kisiel, Burr Settles,
          <string-name>
            <surname>Estevam R. Hruschka</surname>
            Jr., and
            <given-names>Tom M.</given-names>
          </string-name>
          <string-name>
            <surname>Mitchell</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Toward an Architecture for Never-Ending Language Learning</article-title>
          .
          <source>In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence</source>
          ,
          <source>AAAI</source>
          <year>2010</year>
          , Atlanta, Georgia, USA, July
          <volume>11</volume>
          -
          <issue>15</issue>
          ,
          <year>2010</year>
          ,
          <string-name>
            <given-names>Maria</given-names>
            <surname>Fox</surname>
          </string-name>
          and David Poole (Eds.). AAAI Press. http://www.aaai.org/ocs/index.php/ AAAI/AAAI10/paper/view/1879
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Hongshen</given-names>
            <surname>Chen</surname>
          </string-name>
          , Xiaorui Liu, Dawei Yin, and
          <string-name>
            <given-names>Jiliang</given-names>
            <surname>Tang</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>A Survey on Dialogue Systems: Recent Advances and New Frontiers</article-title>
          .
          <source>SIGKDD Explorations 19</source>
          ,
          <issue>2</issue>
          (
          <year>2017</year>
          ),
          <fpage>25</fpage>
          -
          <lpage>35</lpage>
          . https://doi.org/10.1145/3166054.3166058
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Tianqi</given-names>
            <surname>Chen</surname>
          </string-name>
          and
          <string-name>
            <given-names>Carlos</given-names>
            <surname>Guestrin</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>XGBoost: A Scalable Tree Boosting System</article-title>
          .
          <source>In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          , San Francisco, CA, USA,
          <year>August</year>
          13-
          <issue>17</issue>
          ,
          <year>2016</year>
          ,
          <string-name>
            <given-names>Balaji</given-names>
            <surname>Krishnapuram</surname>
          </string-name>
          , Mohak Shah,
          <string-name>
            <given-names>Alexander J.</given-names>
            <surname>Smola</surname>
          </string-name>
          , Charu C. Aggarwal,
          <string-name>
            <given-names>Dou</given-names>
            <surname>Shen</surname>
          </string-name>
          , and Rajeev Rastogi (Eds.). ACM,
          <volume>785</volume>
          -
          <fpage>794</fpage>
          . https://doi.org/10.1145/ 2939672.2939785
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10] Yizong Cheng.
          <year>1995</year>
          .
          <article-title>Mean Shift, Mode Seeking, and Clustering</article-title>
          .
          <source>IEEE Trans. Pattern Anal. Mach. Intell</source>
          .
          <volume>17</volume>
          ,
          <issue>8</issue>
          (
          <year>1995</year>
          ),
          <fpage>790</fpage>
          -
          <lpage>799</lpage>
          . https://doi.org/10.1109/34.400568
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Jan</surname>
            <given-names>Deriu</given-names>
          </string-name>
          , Álvaro Rodrigo, Arantxa Otegi, Guillermo Echegoyen, Sophie Rosset, Eneko Agirre, and
          <string-name>
            <given-names>Mark</given-names>
            <surname>Cieliebak</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Survey on Evaluation Methods for Dialogue Systems</article-title>
          . CoRR abs/
          <year>1905</year>
          .04071 (
          <year>2019</year>
          ). arXiv:
          <year>1905</year>
          .04071 http://arxiv. org/abs/
          <year>1905</year>
          .04071
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Jacob</surname>
            <given-names>Devlin</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding</article-title>
          . CoRR abs/
          <year>1810</year>
          .04805 (
          <year>2018</year>
          ). arXiv:
          <year>1810</year>
          .04805 http://arxiv.org/abs/
          <year>1810</year>
          .04805
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>David</given-names>
            <surname>Ferrucci</surname>
          </string-name>
          ,
          <string-name>
            <surname>Eric Brown</surname>
          </string-name>
          , Jennifer Chu-Carroll,
          <string-name>
            <given-names>James</given-names>
            <surname>Fan</surname>
          </string-name>
          , David Gondek,
          <string-name>
            <given-names>Aditya A.</given-names>
            <surname>Kalyanpur</surname>
          </string-name>
          , Adam Lally,
          <string-name>
            <given-names>J. William</given-names>
            <surname>Murdock</surname>
          </string-name>
          , Eric Nyberg, John Prager, Nico Schlaefer, and
          <string-name>
            <given-names>Chris</given-names>
            <surname>Welty</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Building Watson: An Overview of the DeepQA Project</article-title>
          .
          <source>AI Magazine</source>
          <volume>31</volume>
          ,
          <issue>3</issue>
          (Jul.
          <year>2010</year>
          ),
          <fpage>59</fpage>
          -
          <lpage>79</lpage>
          . https://doi.org/10.1609/ aimag.v31i3.
          <fpage>2303</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Quoc</surname>
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Le</surname>
            and
            <given-names>Tomas</given-names>
          </string-name>
          <string-name>
            <surname>Mikolov</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Distributed Representations of Sentences and Documents</article-title>
          .
          <source>In Proceedings of the 31th International Conference on Machine Learning, ICML 2014</source>
          , Beijing, China,
          <fpage>21</fpage>
          -
          <lpage>26</lpage>
          June 2014 (JMLR Workshop and Conference Proceedings), Vol.
          <volume>32</volume>
          . JMLR.org,
          <volume>1188</volume>
          -
          <fpage>1196</fpage>
          . http://proceedings.mlr.press/ v32/le14.html
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Bing</given-names>
            <surname>Liu</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ian</given-names>
            <surname>Lane</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Iterative policy learning in end-to-end trainable task-oriented neural dialog models</article-title>
          .
          <source>In 2017 IEEE Automatic Speech Recognition and Understanding Workshop</source>
          , ASRU 2017, Okinawa, Japan,
          <source>December 16-20</source>
          ,
          <year>2017</year>
          . IEEE,
          <fpage>482</fpage>
          -
          <lpage>489</lpage>
          . https://doi.org/10.1109/ASRU.
          <year>2017</year>
          .8268975
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Bing</surname>
            <given-names>Liu</given-names>
          </string-name>
          , Gökhan Tür, Dilek Hakkani-Tür,
          <string-name>
            <given-names>Pararth</given-names>
            <surname>Shah</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Larry P.</given-names>
            <surname>Heck</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Dialogue Learning with Human Teaching and Feedback in End-to-End Trainable Task-Oriented Dialogue Systems</article-title>
          .
          <source>In Proceedings of the</source>
          <year>2018</year>
          <article-title>Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</article-title>
          , NAACL-HLT
          <year>2018</year>
          , New Orleans, Louisiana, USA, June 1-6,
          <year>2018</year>
          , Volume
          <volume>1</volume>
          (
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Marilyn A. Walker</source>
          , Heng Ji, and Amanda Stent (Eds.).
          <source>Association for Computational Linguistics</source>
          ,
          <fpage>2060</fpage>
          -
          <lpage>2069</lpage>
          . https://doi.org/10. 18653/v1/n18-
          <fpage>1187</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Maher</surname>
            <given-names>Nabulsi</given-names>
          </string-name>
          , Ahmad Alkatib, and
          <string-name>
            <given-names>Fatima</given-names>
            <surname>Quiam</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>A New Method for Boolean Function Simplification</article-title>
          .
          <source>International Journal of Control and Automation</source>
          <volume>10</volume>
          (12
          <year>2017</year>
          ),
          <fpage>139</fpage>
          -
          <lpage>146</lpage>
          . https://doi.org/10.14257/ijca.
          <year>2017</year>
          .
          <volume>10</volume>
          .12.13
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Mohammad</given-names>
            <surname>Nuruzzaman</surname>
          </string-name>
          and Omar Khadeer Hussain.
          <year>2018</year>
          .
          <article-title>A Survey on Chatbot Implementation in Customer Service Industry through Deep Neural Networks</article-title>
          .
          <source>In 15th IEEE International Conference on e-Business Engineering, ICEBE</source>
          <year>2018</year>
          ,
          <article-title>Xi'an, China</article-title>
          ,
          <source>October 12-14</source>
          ,
          <year>2018</year>
          . IEEE Computer Society,
          <fpage>54</fpage>
          -
          <lpage>61</lpage>
          . https://doi.org/10. 1109/ICEBE.
          <year>2018</year>
          .00019
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>F.</given-names>
            <surname>Pedregosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Varoquaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gramfort</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Michel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Thirion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Grisel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Blondel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Prettenhofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Dubourg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Vanderplas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Passos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cournapeau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Brucher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Perrot</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Duchesnay</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Scikit-learn: Machine Learning in Python</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>12</volume>
          (
          <year>2011</year>
          ),
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Jiahuan</surname>
            <given-names>Pei</given-names>
          </string-name>
          , Pengjie Ren, Christof Monz, and Maarten de Rijke.
          <year>2019</year>
          .
          <article-title>Retrospective and Prospective Mixture-of-Generators for Task-oriented Dialogue Response Generation</article-title>
          . ArXiv abs/
          <year>1911</year>
          .08151 (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Janarthanan</surname>
            <given-names>Rajendran</given-names>
          </string-name>
          , Jatin Ganhotra, and
          <string-name>
            <surname>Lazaros</surname>
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Polymenakos</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Learning End-to-End Goal-Oriented Dialog with Maximal User Task Success and Minimal Human Agent Use</article-title>
          .
          <source>TACL 7</source>
          (
          <year>2019</year>
          ),
          <fpage>375</fpage>
          -
          <lpage>386</lpage>
          . https://transacl.org/ojs/index. php/tacl/article/view/1622
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>William</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Rand</surname>
          </string-name>
          .
          <year>1971</year>
          .
          <article-title>Objective Criteria for the Evaluation of Clustering Methods</article-title>
          .
          <source>J. Amer. Statist. Assoc</source>
          .
          <volume>66</volume>
          ,
          <issue>336</issue>
          (
          <year>1971</year>
          ),
          <fpage>846</fpage>
          -
          <lpage>850</lpage>
          . http://www.jstor.org/ stable/2284239
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Abhinav</surname>
            <given-names>Rastogi</given-names>
          </string-name>
          , Dilek Hakkani-Tür,
          <string-name>
            <given-names>and Larry P.</given-names>
            <surname>Heck</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Scalable MultiDomain Dialogue State Tracking</article-title>
          .
          <source>CoRR abs/1712</source>
          .10224 (
          <year>2017</year>
          ). arXiv:
          <volume>1712</volume>
          .10224 http://arxiv.org/abs/1712.10224
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Alexander</surname>
            <given-names>I Rudnicky</given-names>
          </string-name>
          , Eric Thayer, Paul Constantinides, Chris Tchou,
          <string-name>
            <surname>R Shern</surname>
          </string-name>
          , Kevin Lenzo, Wei Xu,
          <string-name>
            <given-names>and Alice</given-names>
            <surname>Oh</surname>
          </string-name>
          .
          <year>1999</year>
          .
          <article-title>Creating natural dialogs in the Carnegie Mellon Communicator system</article-title>
          .
          <source>In Sixth European Conference on Speech Communication and Technology.</source>
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>Navin</given-names>
            <surname>Sabharwal</surname>
          </string-name>
          and
          <string-name>
            <given-names>Amit</given-names>
            <surname>Agrawal</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Introduction to Google Dialogflow</article-title>
          .
          <source>In Cognitive Virtual Assistants Using Google Dialogflow</source>
          . Springer,
          <fpage>13</fpage>
          -
          <lpage>54</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Navin</surname>
            <given-names>Sabharwal</given-names>
          </string-name>
          , Sudipta Barua, Neha Anand, and
          <string-name>
            <given-names>Pallavi</given-names>
            <surname>Aggarwal</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Integrating with Advance Services</article-title>
          .
          <source>In Developing Cognitive Bots Using the IBM Watson Engine</source>
          . Springer,
          <fpage>197</fpage>
          -
          <lpage>239</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>Daniel</surname>
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Silver</surname>
            ,
            <given-names>Qiang</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
            , and
            <given-names>Lianghao</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Lifelong Machine Learning Systems: Beyond Learning Algorithms</article-title>
          .
          <source>In Lifelong Machine Learning, Papers from the 2013 AAAI Spring Symposium</source>
          , Palo Alto, California, USA, March
          <volume>25</volume>
          -27,
          <year>2013</year>
          <source>(AAAI Technical Report)</source>
          , Vol. SS-
          <volume>13</volume>
          -
          <fpage>05</fpage>
          . AAAI. http://www.aaai.org/ocs/index. php/SSS/SSS13/paper/view/5802
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <surname>Aniruddha</surname>
            <given-names>Tammewar</given-names>
          </string-name>
          , Monik Pamecha, Chirag Jain, Apurva Nagvenkar, and
          <string-name>
            <given-names>Krupal</given-names>
            <surname>Modi</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Production Ready Chatbots: Generate if Not Retrieve</article-title>
          .
          <source>In The Workshops of the The Thirty-Second AAAI Conference on Artificial Intelligence</source>
          , New Orleans, Louisiana, USA, February 2-
          <issue>7</issue>
          ,
          <source>2018 (AAAI Workshops)</source>
          ,
          <source>Vol. WS-18</source>
          . AAAI Press,
          <fpage>739</fpage>
          -
          <lpage>745</lpage>
          . https://aaai.org/ocs/index.php/WS/AAAIW18/paper/view/17357
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>Chongyang</surname>
            <given-names>Tao</given-names>
          </string-name>
          , Wei Wu, Can Xu, Wenpeng Hu,
          <string-name>
            <given-names>Dongyan</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Rui</given-names>
            <surname>Yan</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Multi-Representation Fusion Network for Multi-Turn Response Selection in Retrieval-Based Chatbots</article-title>
          .
          <source>In Proceedings of the Twelfth ACM International Conference on Web Search</source>
          and
          <article-title>Data Mining (Melbourne VIC, Australia) (WSDM '19). Association for Computing Machinery</article-title>
          , New York, NY, USA,
          <fpage>267</fpage>
          -
          <lpage>275</lpage>
          . https: //doi.org/10.1145/3289600.3290985
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>Nguyen</given-names>
            <surname>Xuan</surname>
          </string-name>
          <string-name>
            <surname>Vinh</surname>
          </string-name>
          , Julien Epps, and
          <string-name>
            <given-names>James</given-names>
            <surname>Bailey</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance</article-title>
          .
          <source>J. Mach. Learn. Res</source>
          .
          <volume>11</volume>
          (
          <issue>Dec</issue>
          .
          <year>2010</year>
          ),
          <fpage>2837</fpage>
          -
          <lpage>2854</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <surname>Svitlana</surname>
            <given-names>Volkova</given-names>
          </string-name>
          , Pallavi Choudhury, Chris Quirk, Bill Dolan, and
          <string-name>
            <surname>Luke</surname>
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Zettlemoyer</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Lightly Supervised Learning of Procedural Dialog Systems</article-title>
          .
          <source>In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <string-name>
            <surname>ACL</surname>
          </string-name>
          <year>2013</year>
          ,
          <volume>4</volume>
          -9
          <source>August</source>
          <year>2013</year>
          , Sofia, Bulgaria, Volume
          <volume>1</volume>
          :
          <string-name>
            <given-names>Long</given-names>
            <surname>Papers</surname>
          </string-name>
          .
          <source>The Association for Computer Linguistics</source>
          ,
          <fpage>1669</fpage>
          -
          <lpage>1679</lpage>
          . https://www.aclweb.org/anthology/P13- 1164/
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <surname>Tsung-Hsien</surname>
            <given-names>Wen</given-names>
          </string-name>
          , Milica Gasic, Nikola Mrksic,
          <string-name>
            <surname>Lina Maria</surname>
            Rojas-Barahona, PeiHao Su, Stefan Ultes, David Vandyke, and
            <given-names>Steve J.</given-names>
          </string-name>
          <string-name>
            <surname>Young</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>A Network-based End-to-End Trainable Task-oriented Dialogue System</article-title>
          .
          <source>CoRR abs/1604</source>
          .04562 (
          <year>2016</year>
          ). arXiv:
          <volume>1604</volume>
          .04562 http://arxiv.org/abs/1604.04562
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <surname>Rui</surname>
            <given-names>Yan</given-names>
          </string-name>
          , Yiping Song, and
          <string-name>
            <given-names>Hua</given-names>
            <surname>Wu</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Learning to Respond with Deep Neural Networks for Retrieval-Based Human-Computer Conversation System</article-title>
          .
          <source>In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (Pisa, Italy) (SIGIR '16)</source>
          .
          <article-title>Association for Computing Machinery</article-title>
          , New York, NY, USA,
          <fpage>55</fpage>
          -
          <lpage>64</lpage>
          . https://doi.org/10.1145/ 2911451.2911542
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <surname>Zhao</surname>
            <given-names>Yan</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nan Duan</surname>
          </string-name>
          , Peng Chen, Ming Zhou,
          <string-name>
            <surname>Jianshe Zhou</surname>
            , and
            <given-names>Zhoujun</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Building Task-Oriented Dialogue Systems for Online Shopping</article-title>
          .
          <source>In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9</source>
          ,
          <year>2017</year>
          , San Francisco, California, USA,
          <string-name>
            <surname>Satinder</surname>
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Singh</surname>
          </string-name>
          and Shaul Markovitch (Eds.). AAAI Press,
          <fpage>4618</fpage>
          -
          <lpage>4626</lpage>
          . http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/ view/14261
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <surname>Guoguang</surname>
            <given-names>Zhao</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jianyu</surname>
            <given-names>Zhao</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            <given-names>Li</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Christoph</given-names>
            <surname>Alt</surname>
          </string-name>
          , Robert Schwarzenberg, Leonhard Hennig, Stefan Schafer, Sven Schmeier, Changjian Hu, and
          <string-name>
            <given-names>Feiyu</given-names>
            <surname>Xu</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>MOLI: Smart Conversation Agent for Mobile Customer Service</article-title>
          .
          <source>Information</source>
          <volume>10</volume>
          (02
          <year>2019</year>
          ),
          <volume>63</volume>
          . https://doi.org/10.3390/info10020063
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>