<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Wokrshops,
March</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>A Multi-Turn Emotionally Engaging Dialog Model</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Yubo Xie Ekaterina Svikhnushina Pearl Pu École Polytechnique Fédérale de Lausanne Lausanne</institution>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <volume>17</volume>
      <issue>2020</issue>
      <abstract>
        <p>Open-domain dialog systems (also known as chatbots) have increasingly drawn attention in natural language processing. Some of the recent work aims at incorporating afect information into sequence-to-sequence neural dialog modeling, making the response emotionally richer, while others use hand-crafted rules to determine the desired emotion response. However, they do not explicitly learn the subtle emotional interactions captured in human dialogs. In this paper, we propose a multi-turn dialog system aimed at learning and generating emotional responses that so far only humans know how to do. Compared with two baseline models, oflfine experiments show that our method performs the best in perplexity scores. Further human evaluations confirm that our chatbot can keep track of the conversation context and generate emotionally more appropriate responses while performing equally well on grammar.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>CCS CONCEPTS</title>
      <p>• Human-centered computing → Human computer
interaction (HCI); Natural language interfaces.
chatbots, afective c omputing, d eep l earning, n atural
language processing
Copyright © 2020 for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY 4.0).</p>
    </sec>
    <sec id="sec-2">
      <title>1 INTRODUCTION</title>
      <p>
        Many application areas show significant benefits of
integrating afect information in natural language dialogs. In earlier
work on human computer interaction, Klein et al. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] found
user’s frustration caused by a computer system can be
alleviated by computer-initiated emotional support, by providing
feedback on emotional content along with sympathy and
empathy. Recently, Hu et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] developed a customer
support neural chatbot, capable of generating dialogs similar
to the humans in terms of empathic and passionate tones,
potentially serving as proxy customer support agents on
social media platforms. In a qualitative study [
        <xref ref-type="bibr" rid="ref47">47</xref>
        ],
participants expressed an interest in chatbots capable of serving
as an attentive listener and providing motivational support,
thus fulfilling users’ emotional needs. Several participants
even noted a chatbot is ideal for sensitive content that is too
embarrassing to ask another human. Finally Bickmore and
Picard [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] showed a relational agent with deliberate
socialemotional skills was respected more, liked more, and trusted
more, even after four weeks of interaction, compared to an
equivalent task-oriented agent.
      </p>
      <p>
        Recent development in neural language modeling has
generated significant excitement in the open-domain dialog
generation community. The success of sequence-to-sequence
(seq2seq) learning [
        <xref ref-type="bibr" rid="ref37 ref5">5, 37</xref>
        ] in the field of neural machine
translation has inspired researchers to apply the recurrent neural
network (RNN) encoder-decoder structure to response
generation [
        <xref ref-type="bibr" rid="ref42">42</xref>
        ]. Following the standard seq2seq structure, various
improvements have been made on the neural conversation
model. For example, Shang et al. [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ] applied attention
mechanism [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] to the same structure on Twitter-style
microblogging data. Li et al. [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] found the original version tend to
favor short and dull responses. They fixed this problem by
increasing the diversity of the response. Li et al. [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]
modeled the personalities of the speakers, and Xing et al. [
        <xref ref-type="bibr" rid="ref44">44</xref>
        ]
developed a topic aware dialog system. We call work in this
area globally neural dialog generation. For a comprehensive
survey, please refer to [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        More recently, researchers started incorporating afect
information into neural dialog models. While a central theme
seems to be making the responses emotionally richer,
existing approaches mainly follow two directions. In one, an
emotion label is explicitly required as input so that the
machine can generate sentences of that particular emotion label
or type [
        <xref ref-type="bibr" rid="ref49">49</xref>
        ]. In another group of work, the main idea is to
develop handcrafted rules to direct the machines to generated
responses of the desired emotions [
        <xref ref-type="bibr" rid="ref1 ref48">1, 48</xref>
        ]. Both approaches
require an emotion label as input (either given or handcrafted),
which might be unpractical in real dialog scenarios.
      </p>
      <p>
        Furthermore, to the best of our knowledge, the psychology
and social science literature does not provide clear rules for
emotional interaction. It seems such social and emotional
intelligence is captured in our conversations. This is why we
decided to take the automatic and data-driven approach. In
this paper, we describe an end-to-end Multi-turn Emotionally
Engaging Dialog model (MEED), capable of recognizing
emotions and generating emotionally appropriate and
humanlike responses with the ultimate goal of reproducing social
behaviors that are habitual in human-human conversations.
We chose the multi-turn setting because a model suitable for
single-turn dialogs cannot efectively track earlier context in
multi-turn dialogs, both semantically and emotionally. Since
being able to track several turns is really important, we made
this design decision from the beginning, in contrast to most
related work where models are only trained and tested on
single-turn dialogs. While using a hierarchical mechanism
to track the conversation history in multi-turn dialogs is
not new (e.g., HRAN by Xing et al. [
        <xref ref-type="bibr" rid="ref45">45</xref>
        ]), to combine it with
an additional emotion RNN to process the emotional
information in each history utterance has never been attempted
before.
      </p>
      <p>Our contributions are threefold. (1) We describe in detail a
novel emotion-tracking dialog generation model that learns
the emotional interactions directly from the data. This
approach is free of human-defined heuristic rules, and hence,
is more robust and fundamental than those described in
existing work. (2) We compare our model, MEED, with the
generic seq2seq model and the hierarchical model of
multiturn dialogs (HRAN). Ofline experiments show that our
model outperforms both seq2seq and HRAN by a significant
amount. Further experiments with human evaluation show
our model produces emotionally more appropriate responses
than both baselines, while also improving the language
fluency. (3) We illustrate a human-evaluation procedure for
judging machine produced emotional dialogs. We consider
factors such as the balance of positive and negative emotions
in test dialogs, a well-chosen range of topics, and dialogs that
our human evaluators can relate. It is the first time such an
approach is designed with consideration for human judges.
Our main goal is to increase the objectivity of the results and
reduce judges’ mistakes due to out-of-context dialogs they
have to evaluate.
2</p>
    </sec>
    <sec id="sec-3">
      <title>RELATED WORK</title>
    </sec>
    <sec id="sec-4">
      <title>Neural Dialog Generation</title>
      <p>
        Vinyals and Le [
        <xref ref-type="bibr" rid="ref42">42</xref>
        ] were one of the first to model dialog
generation using neural networks. Their seq2seq framework
was trained on an IT Helpdesk Troubleshooting dataset
and the OpenSubtitles dataset [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. Shang et al. [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ]
further trained the seq2seq model with attention mechanism
on a self-crawled Weibo (a popular Twitter-like social media
website in China) dataset. Meanwhile, Xu et al. [
        <xref ref-type="bibr" rid="ref46">46</xref>
        ] built a
customer service chatbot by training the seq2seq model on
a dataset collected with conversations between customers
and customer service accounts from 62 brands on Twitter.
      </p>
      <p>
        The standard seq2seq framework is applied to single-turn
response generation. In multi-turn settings, where a context
with multiple history utterances is given, the same
structure often ignores the hierarchical characteristic of the
context. Some recent work addresses this problem by
adopting a hierarchical recurrent encoder-decoder (HRED)
structure [
        <xref ref-type="bibr" rid="ref32 ref33 ref35">32, 33, 35</xref>
        ]. To give attention to diferent parts of the
context while generating responses, Xing et al. [
        <xref ref-type="bibr" rid="ref45">45</xref>
        ] proposed
the hierarchical recurrent attention network (HRAN), using
a hierarchical attention mechanism. However, these
multiturn dialog models do not take into account the turn-taking
emotional changes of the dialog.
      </p>
    </sec>
    <sec id="sec-5">
      <title>Neural Dialog Models with Afect Information</title>
      <p>Recent work on incorporating afect information into natural
language processing tasks has inspired our current work.
They can be mainly described as afect language models and
emotional dialog systems.</p>
      <p>
        Ghosh et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] made the first attempt to augment the
original LSTM language model with afect treatment in what
they called Afect-LM. At training time, Afect-LM can be
considered as an energy based model where the added
energy term captures the degree of correlation between the
next word and the afect information of the preceeding text.
At text generation time, afect information is also used to
increase the appropriate selection of the next word. A key
component in Afect-LM is the use of a well established
text analysis program, LIWC (Linguistic Inquiry and Word
Count) [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ]. For every sentence, for example, “I unfortunately
did not pass my exam”, the model generates five emotion
features denoting (sad: 1, angry: 1, anxiety: 1, negative emotion: 1,
positive emotion: 0). This makes Afect-LM both capable of
distinguishing afect information conveyed by each word
in the language modeling part and aware of the preceeding
text’s emotion in each generation step. In a similar vein,
Asghar et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] appended the original word embeddings with
a VAD afect model [
        <xref ref-type="bibr" rid="ref43">43</xref>
        ]. VAD is a vector model, as opposed
to a categorical model (LIWC), representing a given
emotion in each of the valence, arousal, and dominance axes. In
contrast to Afect-LM, Asghar’s neural afect dialog model
aims at generating explicit responses given a particular
utterance. To do so, the authors designed three afect-related loss
functions, namely minimizing afect dissonance,
maximizing a afective dissonance, and maximizing afective content.
The paper also proposed the afectively diverse beam search
during decoding, so that the generated candidate responses
are as afectively diverse as possible. However, literature in
afective science does not necessarily validate such rules.
In fact, the best strategy to speak to an angry customer is
the de-escalation strategy (using neutral words to validate
anger) rather than employing equally emotional words
(minimizing afect dissonance) or words that convey happiness
(maximizing afect dissonance).
      </p>
      <p>
        The Emotional Chatting Machine (ECM) [
        <xref ref-type="bibr" rid="ref49">49</xref>
        ] takes a post
and generates a response in a predefined emotion category.
The main idea is to use an internal memory module to
capture the emotion dynamics during decoding, and an external
memory module to model emotional expressions explicitly
by assigning diferent probability values to emotional words
as opposed to regular words. Zhou and Wang [
        <xref ref-type="bibr" rid="ref50">50</xref>
        ] extended
the standard seq2seq model to a conditional variational
autoencoder combined with policy gradient techniques. The
model takes a post and an emoji as input, and generates the
response with target emotion specified by the emoji. Hu et
al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] built a tone-aware chatbot for customer care on social
media, by deploying extra meta information of the
conversations in the seq2seq model. Specifically, a tone indicator is
added to each step of the decoder during the training phase.
      </p>
      <p>
        In parallel to these developments, Zhong et al. [
        <xref ref-type="bibr" rid="ref48">48</xref>
        ]
proposed an afect-rich dialog model using biased attention
mechanism on emotional words in the input message, by
taking advantage of the VAD embeddings. The model is trained
with a weighted cross-entropy loss function, which
encourages the generation of emotional words.
      </p>
    </sec>
    <sec id="sec-6">
      <title>Summary</title>
      <p>
        As much as these work in the above section inspired our
work, our approach in generating afect dialogs is
significantly diferent. Most of related work focused on integrating
afect information into the transduction vector space using
either VAD or LIWC, we aim at modeling and generating
the afect exchanges in human dialogs using a dedicated
embedding layer. The approach is also completely data-driven,
thus absent of hand-crafted rules. To avoid learning obscene
and callous exchanges often found in social media data like
tweets and Reddit threads [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ], we opted to train our model
on movie subtitles, whose dialogs were carefully created by
professional writers. We believe the quality of this dataset
can be better than those curated by crowdsource platforms.
For modeling the afect information, we chose to use LIWC
because it is a well-established emotion lexical resource,
covering the whole English dictionary whereas VAD only
contains 13K lemmatized terms.
      </p>
    </sec>
    <sec id="sec-7">
      <title>3 MODEL</title>
      <p>We describe our model one element at a time, from the basic
structure, to the hierarchical component, and finally the
emotion embedding layer.</p>
      <p>We first consider the problem of generating response y
given a context X consisting of multiple previous utterances
by estimating the probability distribution p(y | X ) from a
data set D = {(X (i), y(i))}i=1 containing N context-response</p>
      <p>N
pairs. Here
is a sequence of mi utterances, and
is a sequence of ni j words. Similarly,</p>
      <p>X (i) = x 1(i), x 2(i), . . . , x m(i)i
xj(i) = xj(,1), xj(,2), . . . , xj(,n)ij</p>
      <p>i i i
y(i) = y1(i), y2(i), . . . , yT(ii)
(1)
(2)
(3)
is the response with Ti words.</p>
      <p>
        Usually the probability distribution p(y | X ) can be
modeled by an RNN language model conditioned on X . When
generating the word yt at time step t , the context X is
encoded into a fixed-sized dialog context vector ct by following
the hierarchical attention structure in HRAN [
        <xref ref-type="bibr" rid="ref45">45</xref>
        ].
Additionally, we extract the emotion information from the utterances
in X by leveraging an external text analysis program, and
use an RNN to encode it into an emotion context vector e,
which is combined with ct to produce the distribution. The
overall architecture of the model is depicted in Figure 1. We
are going to elaborate on how to obtain ct and e, and how
they are combined in the decoding part.
      </p>
    </sec>
    <sec id="sec-8">
      <title>Hierarchical Atention</title>
      <p>
        The hierarchical attention structure involves two encoders to
produce the dialog context vector ct , namely the word-level
encoder and the utterance-level encoder. The word-level
encoder is essentially a bidirectional RNN with gated recurrent
units (GRU) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. For utterance xj in X (j = 1, 2, . . . , m), the
bidirectional encoder produces two hidden states at each
word position k, the forward hidden state hfjk and the
backward hidden state hbjk . The final hidden state hjk is then
obtained by concatenating the two,
hjk = concat hfjk , hbjk .
(4)
The utterance-level encoder is a unidirectional RNN with
GRU that goes from the last utterance in the context to
the first, with its input at each step as the summary of the
corresponding utterance, which is obtained by applying a
Bahdanau-style attention mechanism [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] on the word-level
`1t
r1t
GRU
      </p>
      <p>Uerance-level aention
encoder output. More specifically, at decoding step t , the
summary of utterance xj is a linear combination of hjk , for
k = 1, 2, . . . , nj ,
nj
Õ
k=1
rjt =
αjtkhjk .</p>
      <p>Here αjtk is the word-level attention score placed on hjk , and
can be calculated as
atjk = vaT tanh(Uast −1 + Va ℓtj+1 + Wahjk ),
αjtk =</p>
      <p>exp(atjk )
Íkn′j=1 exp(atjk′)
,
where st −1 is the previous hidden state of the decoder, ℓtj+1 is
the previous hidden state of the utterance-level encoder, and
va , Ua , Va and Wa are word-level attention parameters. The
ifnal dialog context vector ct is then obtained as another
linear combination of the outputs of the utterance-level encoder
ℓtj , for j = 1, 2, . . . , m,
m
Õ
j=1
ct =
βjt ℓtj .</p>
      <p>(5)
(6)
(7)
(8)</p>
      <p>Here βjt is the utterance-level attention score placed on ℓtj ,
and can be calculated as
bjt = vbT tanh(Ubst −1 + Wb ℓtj ),</p>
      <p>exp(bjt )
βjt = Ím t ,
j′=1 exp(bj′)
(9)
(10)
where st −1 is the previous hidden state of the decoder, and
vb , Ub and Wb are utterance-level attention parameters.</p>
    </sec>
    <sec id="sec-9">
      <title>Emotion Encoder</title>
      <p>The main objective of the emotion embedding layer is to
recognize the afect information in the given utterances so
that the model can respond with emotionally appropriate
replies. To achieve this, we need an encoder to distinguish the
afect information in the context, in addition to its semantic
meaning. Equally we need a decoder capable of selecting the
best and most human-like answers.</p>
      <p>
        We are able to achieve this goal, i.e., capturing the emotion
information carried in the context X , in the encoder, thanks
to LIWC. We make use of the five emotion-related categories,
namely positive emotion, negative emotion, anxious, angry,
and sad. This set can be expanded to include more categories
if we desire a richer distinction. See the discussion section for
more details on how to do this. Using the newest version of
the program LIWC2015,1 we are able to map each utterance
xj in the context to a six-dimensional indicator vector 1(xj ),
with the first five entries corresponding to the five emotion
categories, and the last one corresponding to neutral. If any
word in xj belongs to one of the five categories, then the
corresponding entry in 1(xj ) is set to 1; otherwise, xj is
treated as neutral, with the last entry of 1(xj ) set to 1. For
example, assuming xj = “he is worried about me”, then
1(xj ) = [
        <xref ref-type="bibr" rid="ref1 ref1">0, 1, 1, 0, 0, 0</xref>
        ],
since the word “worried” is assigned to both negative emotion
and anxious. We apply a dense layer with sigmoid activation
function on top of 1(xj ) to embed the emotion indicator
vector into a continuous space,
      </p>
      <p>aj = σ (We 1(xj ) + be ),
where We and be are trainable parameters. The emotion flow
of the context X is then modeled by an unidirectional RNN
with GRU going from the first utterance in the context to the
last, with its input being aj at each step. The final emotion
context vector e is obtained as the last hidden state of this
emotion encoding RNN.</p>
    </sec>
    <sec id="sec-10">
      <title>Decoding</title>
      <p>The probability distribution p(y | X ) can be written as
p(y | X ) = p(y1, y2, . . . , yT | X )
= p(y1 | c1, e)
p(yt | y1, . . . , yt −1, ct , e).</p>
      <p>(13)
T
Ö
t =2
We model the probability distribution using an RNN
language model along with the emotion context vector e.
Specifically, at time step t , the hidden state of the decoder st is
obtained by applying the GRU function,</p>
      <p>
        st = GRU(st −1, concat(ct , wyt−1 )),
where wyt−1 is the word embedding of yt −1. Similar to
AfectLM [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], we then define a new feature vector ot by
concatenating st (which we refer to as the language context vector)
with the emotion context vector e,
      </p>
      <p>ot = concat(st , e),
on which we apply a softmax layer to obtain a probability
distribution over the vocabulary,</p>
      <p>pt = softmax (W ot + b),
where W and b are trainable parameters. Each term in
Equation (13) is then given by</p>
      <p>p(yt | y1, . . . , yt −1, ct , e) = pt ,yt .
1https://liwc.wpengine.com/
(11)
(12)
(14)
(15)
(16)
(17)
1 N</p>
      <p>Õ log p y(i) | X (i) .</p>
      <p>L = − ÍiN=1 Ti i=1
2We chose the maximum number of turns to be six because we would like
to have a longer context for each dialog while at the same time keeping the
training procedure computationally eficient.
whose responses appear too many times (the threshold is set
to 10 for Cornell, and 5 for DailyDialog), to prevent them
from dominating the learning procedure. See Table 1 for the
sizes of the training and validation sets. The test set consists
of 100 dialogs with four turns. We give more detailed
description of how we created the test set in the section of human
evaluation.</p>
    </sec>
    <sec id="sec-11">
      <title>Baselines and Implementation</title>
      <p>Our choice of including S2S is rather obvious. Including
HRAN instead of other neural dialog models with afect
information was not an easy decision. As mentioned in the
related work, Asghar’s afective dialog model, the afect-rich
conversation model, and the Emotional Chatting Machine do
not learn the emotional exchanges in the dialogs. This leaves
us wondering whether using a multi-turn neural model can
be as efective in learning emotional exchanges as MEED. In
addition, comparing S2S and HRAN also gives us an idea of
how much the hierarchical mechansim is improving upon
the basic model. This is why our final comparision is based
on three multi-turn dialog generation models: the standard
seq2seq model (denoted as S2S), HRAN, and our proposed
model, MEED. In order to adapt S2S to the multi-turn setting,
we concatenate all the history utterances in the context into
one.</p>
      <p>
        For all the models, the vocabulary consists of 20,000 most
frequent words in the Cornell and DailyDialog datasets, plus
three extra tokens: &lt;unk&gt; for words that do not exist in the
vocabulary, &lt;go&gt; indicating the begin of an utterance, and
&lt;eos&gt; indicating the end of an utterance. Here we summarize
the configurations and parameters of our experiments:
• We set the word embedding size to 256. We initialized
the word embeddings in the models with word2vec [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]
vectors first trained on Cornell and then fine-tuned on
DailyDialog, consistent with the training procedure of
the models.
• We set the number of hidden units of each RNN to 256,
the word-level attention depth to 256, and
utterancelevel 128. The output size of the emotion embedding
layer is 256.
• We optimized the objective function using the Adam
optimizer [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] with an initial learning rate of 0.001.
• For prediction, we used beam search [
        <xref ref-type="bibr" rid="ref39">39</xref>
        ] with a beam
width of 256.
      </p>
      <p>We have made the source code publicly available.3</p>
    </sec>
    <sec id="sec-12">
      <title>Evaluation Metrics</title>
      <p>
        The evaluation of chatbots remains an open problem in the
ifeld. Recent work [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] has shown that the automatic
evaluation metrics borrowed from machine translation such as
      </p>
      <sec id="sec-12-1">
        <title>3https://github.com/yuboxie/meed</title>
        <p>
          BLEU score [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ] tend to align poorly with human judgement.
Therefore, in this paper, we mainly adopt human evaluation,
along with perplexity and BLEU score, following the existing
work.
        </p>
        <p>
          Automatic Evaluation. Perplexity is a measurement of how a
probability model predicts a sample. It is a popular method
used in language modeling. In neural dialog generation
community, many researchers have adopted this method,
especially in the beginning of this field [
          <xref ref-type="bibr" rid="ref32 ref42 ref45 ref48 ref49 ref50">32, 42, 45, 48–50</xref>
          ]. It
measures how well a dialog model predicts the target
response. Given a target response y = {y1, y2, . . . , yT }, the
perplexity is calculated as
ppl(y) = p(y1, y2, . . . , yT )−1/T
        </p>
        <p>
          "
= exp
#
1 ÕT log p(yt | y1, . . . , yt −1) .
− T t =1
(19)
Thus a lower perplexity score indicates that the model has
better capability of predicting the target sentence, i.e., the
humans’ response. Some researchers [
          <xref ref-type="bibr" rid="ref19 ref34 ref48">19, 34, 48</xref>
          ] argue that
perplexity score is not the ideal measurement because for
a given context history, one should allow many responses.
This is especially true if we want our conversational agents
to speak more diversely. However, for our purpose, which
is to speak emotionally appropriately and as human-like as
possible, we believe this is a good measure. We do recognize
that it is not the only way to measure chatbots’ performance.
This is why we also conducted human evaluation experiment.
        </p>
        <p>
          BLEU score is often used to measure the quality of
machinetranslated text. Some earlier work of dialog response
generation [
          <xref ref-type="bibr" rid="ref17 ref18">17, 18</xref>
          ] adopted this metric to measure the performance
of chatbots. However, recent study [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] suggests that it does
not align well with human evaluation. Nevertheless, we still
include BLEU scores in this paper, to get a sense of
comparison with perplexity and human evaluation results.
Human Evaluation. Human evaluation has been widely used
to evaluate open-domain dialog generation tasks. This
approach can include any criterion as we judge appropriate.
Most commonly, researchers have included the model’s
ability to generate grammatically correct, contextually coherent,
and emotionally appropriate responses, of which the latter
two properties cannot be reliably evaluated using automatic
metrics. Recent work [
          <xref ref-type="bibr" rid="ref1 ref48 ref49">1, 48, 49</xref>
          ] on afect-rich conversational
chatbots turned to human opinion to evaluate both fluency
and emotionality of their models. But such human
experiments are sensitive to risk factors if the experiment is not
carefully designed. They include whether the intructions are
clear, whether they have been tested with users before hand,
and whether there is a good balance of the human judgement
tasks. Further, if a test set for human evaluation is prepared
by randomly sampling the dialogs from the dataset, it may
include out-of-context dialogs, causing confusion and
ambiguity for human evaluators. Unbalanced emotional distribution
of the test dialogs may also lead to biased conclusions since
the chatbot’s abilities are evaluated on the unrepresentative
sample.
        </p>
        <p>To take into account the above issues, we took several
iterations to prepare the instructions and the test set before
conducting the human evaluation experiment. Part of our
test set comes from the DailyDialog dataset, which consists
of meaningful complete dialogs. To compensate for the
inbalance, we further curated more negative emotion dialogs so
that the final set has equal emotion distributions. We provide
the details about the test data preparation process and the
evaluation experiment below.</p>
        <p>Preparation of Natural Dialog Test Set. We first selected
the emotionally colored dialogs with exactly four turns from
the DailyDialog dataset. In the dataset each dialog turn is
annotated with a corresponding emotional category,
including the neutral one. For our purposes we filtered out only
those dialogs where more than a half of utterances have
non-neutral emotional labels, resulting in 78 emotionally
positive dialogs and 14 emotionally negative dialogs. We
recruited two human workers to augment the data to
produce more emotionally negative dialogs. Both of them were
PhD students from our university (males, aged 24 and 25),
lfuent in English, and not related to the authors’ lab. We
found them via email and messaging platforms, and ofered
80 CHF (or roughly US $80) gift coupons as incentive for
each participant. The workers fulfilled the tasks in Google
form4 following the instructions and created five negative
dialogs with four turns, as if they were interacting with
another human, in each of the following topics: relationships,
entertainment, service, work and study, and everyday
situations. The Google form was released on 31 January 2019, and
the workers finished their tasks by 4 February 2019.
Subsequently, to form the final test set, we randomly selected
50 emotionally positive and 50 emotionally negative dialogs
from the two pools of dialogs described above.</p>
        <p>Human Evaluation Experiment Design. In the final human
evaluation of the model, we recruited four more PhD
students from our university (1 female and 3 males, aged 22–25).
Three of them are fluent English speakers and one is a native
speaker. The recruitment proceeded in the same manner as
described above; the raters were ofered 80 CHF (or roughly
US $80) per participant gift coupons for fulfilling the task,
and extra 20 CHF (or roughly US $20) coupon was promised
4We provide the link to the form used for creating the dialogs: https://forms.
gle/rPagMZYuYJ3M3Sq8A, hoping to help other researchers reproduce the
same procedure. However, due to privacy concerns, we do not plan to release
this dataset.
as a bonus to the rater judged to be the most serious. For the
evaluation survey, we also leveraged Google form.
Specifically, we randomly shufled the 100 dialogs in the test set,
then we used the first three utterances of each dialog as the
input to the three models being compared (S2S, HRAN, and
MEED), and obtain the respective responses. Dialog contexts
and three models’ responses were included into Google form.
According to the context given, the raters were instructed to
evaluate the quality of the responses based on three criteria:
(1) Grammatical correctness—whether or not the response
is fluent and free of grammatical mistakes;
(2) Contextual coherence—whether or not the response is
context sensitive to the previous dialog history;
(3) Emotional appropriateness—whether or not the response
conveys the right emotion and feels as if it had been
produced by a human.</p>
        <p>For each criterion, the raters gave scores of either 0, 1 or 2,
where 0 means bad, 2 means good, and 1 indicates neutral. For
this survey, the Google form was launched on 12 February
2019, and all the submissions from our raters were collected
by 14 February 2019.</p>
      </sec>
    </sec>
    <sec id="sec-13">
      <title>Results and Analysis</title>
      <p>In this subsection, we present the experimental results of the
automatic evaluation metric as well as human judgement,
followed by some analysis.</p>
      <p>
        Automatic Evaluation Results. Table 2 gives the perplexity
and BLEU scores obtained by the three models on the two
validation sets and the test set. As shown in the table, MEED
achieves the lowest perplexity and the highest BLEU score on
all three sets. We conducted t -test on the perplexity obtained,
and results show significant improvements of MEED over S2S
and HRAN on the two validation sets (with p-value &lt; 0.05).
Human Evaluation Results. Table 3, 4 and 5 summarize the
human evaluation results on the responses’ grammatical
correctness, contextual coherence, and emotional
appropriateness, respectively. In the tables, we give the percentage of
votes each model received for the three scores, the average
score obtained, and the agreement score among the raters.
Note that we report Fleiss’ κ score [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] for contextual
coherence and emotional appropriateness, and Finn’s r score [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
for grammatical correctness. We did not use Fleiss’ κ score for
grammatical correctness. As agreement is extremely high,
this can make Fleiss’ κ very sensitive to prevalence [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
On the contrary, we did not use Finn’s r score for
contextual coherence and emotional appropriateness because it is
only reasonable when the observed variance is significantly
less than the chance variance [
        <xref ref-type="bibr" rid="ref40">40</xref>
        ], which did not apply to
these two criteria. As shown in the tables, we got high
agreement among the raters for grammatical correctness, and fair
agreement among the raters for contextual coherence and
emotional appropriateness.5 For grammatical correctness, all
three models achieved high scores, which means all models
are capable of generating fluent utterances that make sense.
For contextual coherence and emotional appropriateness,
MEED achieved higher average scores than S2S and HRAN,
which means MEED keeps better track of the context and can
generate responses that are emotionally more appropriate
and natural. We first conducted Friedman test [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] and then
t -test on the human evaluation results (contextual coherence
and emotional appropriateness), showing the improvements
of MEED over S2S are significant (with p-value &lt; 0.01).
      </p>
      <p>The comparison between perplexity scores and human
evaluation results further confirms the fact that in the context
5https://en.wikipedia.org/wiki/Fleiss%27_kappa#Interpretation
of dialog response generation, perplexity does not align with
human judgement. In Table 2, for all the three sets, HRAN
performs worse than S2S in terms of perplexity. However, for
all of the three criteria in human evaluation, HRAN actually
outperforms S2S. Based on this, we conclude that perplexity
alone is not enough for evaluating a dialog system.
Visualization of Output Layer Weights. We may wonder how
HRAN and MEED difer in terms of the distributional
representations of their respective vocabularies (words in the
language model, and afect words). We decided to visualize
the output layer weights as word embedding representations
using dimensionality reduction technique for the various
models.</p>
      <p>In the decoding phase, Equation (16) takes ot , the
concatenation of the language context vector st and the emotion
context vector e, and generates a probability distribution
over the vocabulary words by applying a softmax layer. The
weight matrix of this softmax layer is denoted as W , whose
shape is |V | ×2d, where |V | is the vocabulary size and d = 256
is the hidden state size of the RNNs. Thus the ith row of the
weight matrix Wi can be regarded as a vector
representation of the ith word in the vocabulary. Since we concatenate
the language context vector and the emotion context
vector as the input to the softmax layer, the first half of the
weight vector Wi corresponds to the language context
vector, and the second half corresponds to the emotion context
vector. We refer to them as language model weights and
emotion weights, respectively. If the emotion embedding layer is
learning and distinguishing afect states correctly, we will
see clear diferences in the visualization.</p>
      <p>
        With t-SNE [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ], we are able to reduce the dimensionality
of the weights to two, and visualize them in a straightforward
way. For better illustration, we selected 100 most frequent
(emotionally) positive words and 100 most frequent negative
words from the vocabulary, and used t-SNE to project the
corresponding language model weights and emotion weights
to two dimensions. Figure 2 gives the results in three
subplots. Since HRAN does not have the emotion context vector,
we just visualized the whole output layer weight vector,
which does a similar job as the language model weights in
HRAN – Language Model Weights
MEED – Language Model Weights
MEED. We can observe from the first two plots that
positive words (green dots) and negative words (red dots) are
scattered around and mixed with each other in the language
model weights for HRAN and MEED respectively, which
means no emotion information is captured in these weights.
On the contrary, the emotion weights in MEED, in the last
plot, have a clearer clustering efect, i.e., positive words are
mainly grouped on the top-left, while negative words are
mainly grouped at the bottom-right. This gives the hint that
the emotion encoder in MEED is capable of tracking the
emotion states in the conversation history.
      </p>
      <p>Case Study. We present four sample dialogs in Table 6, along
with the responses generated by the three models. Dialog 1
and 2 are emotionally positive and dialog 3 and 4 are
negative. For the first two examples, we can see that MEED
is able to generate more emotional content (like “fun” and
“congratulations”) that is appropriate according to the
context. For dialog 4, MEED responds in sympathy to the other
speaker, which is consistent with the second utterance in the
context. On the contrary, HRAN poses a question in reply,
contradicting the dialog history.
5</p>
    </sec>
    <sec id="sec-14">
      <title>DISCUSSION</title>
      <p>In this section, we briefly discuss how our framework can
incorporate other components, as well as several directions
to extend it.</p>
    </sec>
    <sec id="sec-15">
      <title>Emotion Recognition</title>
      <p>
        To extract the afect information contained in the utterances,
we used the LIWC text analysis program. We believe this
emotion recognition step is vital for a dialog model to
produce emotionally appropriate responses. However, the choice
of emotion classifier is not strictly limited to LIWC. It could
be replaced by other well-established afect recognizer or one
that is more appropriate to the target domain. For example,
we can consider using more fine-grained emotion categories
from GALC [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ], or using DeepMoji [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], which was trained
6
4
2
0
on millions of tweets with emoji labels and is more suitable
for tweet-like conversations. However, for DeepMoji, the 64
categories of emojis do not have a clear and exact
correspondence with standardized emotion categories, nor to the VAD
vectors.
      </p>
    </sec>
    <sec id="sec-16">
      <title>Training Data</title>
      <p>We pre-trained our model on the Cornell movie subtitles and
then fine-tuned it with the DailyDialog dataset. We adopted
this particular training order because we would like our
chatbot to talk more like human chit-chats, and the DailyDialog
dataset, compared with the bigger Cornell dataset, is more
daily-based. Since our model learns how to respond properly
in a data-driven way, we believe having a training dataset
with good quality while being large enough plays an
important role in developing an engaging and user-friendly
chatbot. Thus, in the future, we plan to train our model on
the multi-turn conversations that we have already extracted
from the much bigger OpenSubtitles corpus and the
EmpatheticDialogues dataset.6</p>
    </sec>
    <sec id="sec-17">
      <title>Evaluation</title>
      <p>
        Evaluation of dialog models remains an open problem in the
response generation field. Early work [
        <xref ref-type="bibr" rid="ref18 ref30 ref36">18, 30, 36</xref>
        ] on response
generation used automatic evaluation metrics borrowed from
the machine translation field, such as the BLEU score, to
evaluate dialog systems. Later on, Liu et al. [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] showed that these
metrics correlate poorly with human judgement. Recently,
a number of researchers begain developing automatic and
data-driven evaluation methods [
        <xref ref-type="bibr" rid="ref24 ref38">24, 38</xref>
        ], with the ultimate
goal of replacing human evaluation. However they are still
in an early stage. In this paper, we used both perplexity
measures and human judgement in our experiments to finalize
our model. In other words, using the perplexity measures,
we were able to determine when to stop training our model.
But this condition does not gurantee the optimal results until
6https://github.com/facebookresearch/EmpatheticDialogues
A: Thank god! I am finished writing that service guide! It took me forever!
B: When did you finish?
A: This morning! No more overtime, and no more headaches!
(B: Well, I’m glad to hear it. Have a cup of cofee!)
      </p>
      <sec id="sec-17-1">
        <title>A: I think that’s settled. B: I’m tired of your cut-and-dried opinions. Who do you think you are! A: How dare you speak to me like this. (B: Why not?)</title>
      </sec>
      <sec id="sec-17-2">
        <title>A: This concert was awful. B: Agreed, the musicians were not in harmony. A: It was too painful. Never again. (B: That’s for sure!)</title>
      </sec>
      <sec id="sec-17-3">
        <title>Model Responses</title>
      </sec>
      <sec id="sec-17-4">
        <title>S2S: I hope so. HRAN: Thanks a lot. MEED: That sounds like fun.</title>
      </sec>
      <sec id="sec-17-5">
        <title>S2S: When? HRAN: Why? MEED: Congratulations!</title>
      </sec>
      <sec id="sec-17-6">
        <title>S2S: What is it? HRAN: What is it? MEED: Are you serious?</title>
      </sec>
      <sec id="sec-17-7">
        <title>S2S: Congratulations!</title>
        <p>
          HRAN: Why not?
MEED: That’s true.
human judgement test can validate them. We thus highly
recommend this combination, which is also a common practice
in the research community [
          <xref ref-type="bibr" rid="ref45 ref48 ref49 ref50">45, 48–50</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-18">
      <title>Model Extensions</title>
      <p>
        Our model uses RNNs to encode the input sequences, and
GRU cells to capture long-term dependency among diferent
positions in the sequences. Recent advances in natural
language understanding have proposed new network
architectures to process text input. Specifically, the Transformer [
        <xref ref-type="bibr" rid="ref41">41</xref>
        ]
uses pure attention mechanisms without any recurrence
structures. Compared with RNNs, the Transformer can
capture better long-term dependency due to the self-attention
mechanism, which is free of locality biases, and is more
efifcient to train because of better parallelization capability.
Following the Transformer architecture, researchers found
that pre-training language models on huge amounts of data
could largely boost the performance of downstream tasks,
and published many pre-trained language models such as
BERT [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and RoBERTa [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. As future work, we would like
to adopt the Transformer architecture to replace the RNNs
in our model, and initialize our encoder with pre-trained
language models. We hope to increase the performance of
response generation.
      </p>
    </sec>
    <sec id="sec-19">
      <title>6 CONCLUSION</title>
      <p>
        We believe reproducing conversational and emotional
intelligence will make social chatbots more believable and
engaging. In this paper, we proposed a multi-turn dialog system
capable of recognizing and generating emotionally
appropriate responses, which is the first step toward such a goal. We
have demonstrated how to do so by (1) modeling utterances
with extra afect vectors, (2) creating an emotional encoding
mechanism that learns emotion exchanges in the dataset, (3)
curating a multi-turn and balanced dialog dataset, and (4)
evaluating the model with ofline and online experiments.
For future directions, we would like to investigate the
diversity issue of the responses generated, possibly by extending
the mutual information objective function [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] to multi-turn
settings. We would also like to adopt the Transformer
architecture with pre-trained language model weights, and train
our model on a much larger dataset, by extracting multi-turn
dialogs from the OpenSubtitles corpus.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Nabiha</given-names>
            <surname>Asghar</surname>
          </string-name>
          , Pascal Poupart, Jesse Hoey, Xin Jiang, and
          <string-name>
            <given-names>Lili</given-names>
            <surname>Mou</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Afective Neural Response Generation</article-title>
          .
          <source>In Proceedings of ECIR</source>
          <year>2018</year>
          .
          <volume>154</volume>
          -
          <fpage>166</fpage>
          . https://doi.org/10.1007/978-3-
          <fpage>319</fpage>
          -76941-7_
          <fpage>12</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Dzmitry</given-names>
            <surname>Bahdanau</surname>
          </string-name>
          , Kyunghyun Cho, and
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Neural Machine Translation by Jointly Learning to Align and Translate</article-title>
          .
          <source>CoRR abs/1409</source>
          .0473 (
          <year>2014</year>
          ). arXiv:
          <volume>1409</volume>
          .0473 http://arxiv.org/abs/1409.0473
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Timothy</surname>
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Bickmore</surname>
            and
            <given-names>Rosalind W.</given-names>
          </string-name>
          <string-name>
            <surname>Picard</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Establishing and Maintaining Long-Term Human-Computer Relationships</article-title>
          .
          <source>ACM Trans. Comput.-Hum. Interact</source>
          .
          <volume>12</volume>
          ,
          <issue>2</issue>
          (
          <year>2005</year>
          ),
          <fpage>293</fpage>
          -
          <lpage>327</lpage>
          . https://doi.org/10.1145/ 1067860.1067867
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Hongshen</given-names>
            <surname>Chen</surname>
          </string-name>
          , Xiaorui Liu, Dawei Yin, and
          <string-name>
            <given-names>Jiliang</given-names>
            <surname>Tang</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>A Survey on Dialogue Systems: Recent Advances and New Frontiers</article-title>
          .
          <source>SIGKDD Explorations 19</source>
          ,
          <issue>2</issue>
          (
          <year>2017</year>
          ),
          <fpage>25</fpage>
          -
          <lpage>35</lpage>
          . https://doi.org/10.1145/ 3166054.3166058
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Kyunghyun</given-names>
            <surname>Cho</surname>
          </string-name>
          , Bart van Merrienboer,
          <string-name>
            <surname>Çaglar Gülçehre</surname>
            , Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and
            <given-names>Yoshua</given-names>
          </string-name>
          <string-name>
            <surname>Bengio</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation</article-title>
          .
          <source>In Proceedings of EMNLP</source>
          <year>2014</year>
          .
          <volume>1724</volume>
          -
          <fpage>1734</fpage>
          . http://aclweb.org/anthology/D/D14/D14-1179.pdf
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Cristian</given-names>
            <surname>Danescu-Niculescu-Mizil</surname>
          </string-name>
          and
          <string-name>
            <given-names>Lillian</given-names>
            <surname>Lee</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Chameleons in Imagined Conversations: A New Approach to Understanding Coordination of Linguistic Style in Dialogs</article-title>
          .
          <source>In Proceedings of CMCL@ACL</source>
          <year>2011</year>
          .
          <volume>76</volume>
          -
          <fpage>87</fpage>
          . https://aclanthology.info/papers/W11-0609/w11-
          <fpage>0609</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding</article-title>
          .
          <source>In Proceedings of NAACL-HLT</source>
          <year>2019</year>
          .
          <volume>4171</volume>
          -
          <fpage>4186</fpage>
          . https://aclweb.org/anthology/papers/N/N19/N19-1423/
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Bjarke</given-names>
            <surname>Felbo</surname>
          </string-name>
          , Alan Mislove, Anders Søgaard, Iyad Rahwan, and
          <string-name>
            <given-names>Sune</given-names>
            <surname>Lehmann</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Using Millions of Emoji Occurrences to Learn AnyDomain Representations for Detecting Sentiment, Emotion and Sarcasm</article-title>
          .
          <source>In Proceedings of EMNLP</source>
          <year>2017</year>
          .
          <volume>1615</volume>
          -
          <fpage>1625</fpage>
          . https://aclanthology. info/papers/D17-1169/d17-
          <fpage>1169</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Robert</surname>
            <given-names>H</given-names>
          </string-name>
          <string-name>
            <surname>Finn</surname>
          </string-name>
          .
          <year>1970</year>
          .
          <article-title>A Note on Estimating the Reliability of Categorical Data</article-title>
          .
          <source>Educational and Psychological Measurement</source>
          <volume>30</volume>
          ,
          <issue>1</issue>
          (
          <year>1970</year>
          ),
          <fpage>71</fpage>
          -
          <lpage>76</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Joseph</surname>
            <given-names>L Fleiss</given-names>
          </string-name>
          and
          <string-name>
            <surname>Jacob Cohen</surname>
          </string-name>
          .
          <year>1973</year>
          .
          <article-title>The Equivalence of Weighted kappa and the Intraclass Correlation Coeficient as Measures of Reliability</article-title>
          .
          <source>Educational and psychological measurement 33</source>
          ,
          <issue>3</issue>
          (
          <year>1973</year>
          ),
          <fpage>613</fpage>
          -
          <lpage>619</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Sayan</surname>
            <given-names>Ghosh</given-names>
          </string-name>
          , Mathieu Chollet, Eugene Laksana,
          <string-name>
            <surname>Louis-Philippe Morency</surname>
            , and
            <given-names>Stefan</given-names>
          </string-name>
          <string-name>
            <surname>Scherer</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Afect-LM: A Neural Language Model for Customizable Afective Text Generation</article-title>
          .
          <source>In Proceedings of ACL</source>
          <year>2017</year>
          .
          <volume>634</volume>
          -
          <fpage>642</fpage>
          . https://doi.org/10.18653/v1/
          <fpage>P17</fpage>
          -1059
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>David</surname>
            <given-names>C</given-names>
          </string-name>
          <string-name>
            <surname>Howell</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Fundamental Statistics for the Behavioral Sciences</article-title>
          .
          <source>Nelson Education.</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>George</given-names>
            <surname>Hripcsak and Daniel F. Heitjan</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>Measuring Agreement in Medical Informatics Reliability Studies</article-title>
          .
          <source>Journal of Biomedical Informatics</source>
          <volume>35</volume>
          ,
          <issue>2</issue>
          (
          <year>2002</year>
          ),
          <fpage>99</fpage>
          -
          <lpage>110</lpage>
          . https://doi.org/10.1016/S1532-
          <volume>0464</volume>
          (
          <issue>02</issue>
          )
          <fpage>00500</fpage>
          -
          <lpage>2</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Tianran</surname>
            <given-names>Hu</given-names>
          </string-name>
          , Anbang Xu, Zhe Liu, Quanzeng You, Yufan Guo, Vibha Sinha, Jiebo Luo, and
          <string-name>
            <given-names>Rama</given-names>
            <surname>Akkiraju</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Touch Your Heart: A Toneaware Chatbot for Customer Care on Social Media</article-title>
          .
          <source>In Proceedings of CHI</source>
          <year>2018</year>
          .
          <volume>415</volume>
          . https://doi.org/10.1145/3173574.3173989
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Diederik</surname>
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Kingma</surname>
            and
            <given-names>Jimmy</given-names>
          </string-name>
          <string-name>
            <surname>Ba</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Adam: A Method for Stochastic Optimization</article-title>
          .
          <source>CoRR abs/1412</source>
          .6980 (
          <year>2014</year>
          ). arXiv:
          <volume>1412</volume>
          .6980 http://arxiv.org/abs/1412.6980
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Jonathan</surname>
            <given-names>Klein</given-names>
          </string-name>
          , Youngme Moon, and
          <string-name>
            <surname>Rosalind</surname>
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Picard</surname>
          </string-name>
          .
          <year>2001</year>
          .
          <article-title>This Computer Responds to User Frustration: Theory, Design, and Results</article-title>
          .
          <source>Interacting with Computers</source>
          <volume>14</volume>
          ,
          <issue>2</issue>
          (
          <year>2001</year>
          ),
          <fpage>119</fpage>
          -
          <lpage>140</lpage>
          . https://doi.org/10. 1016/S0953-
          <volume>5438</volume>
          (
          <issue>01</issue>
          )
          <fpage>00053</fpage>
          -
          <lpage>4</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Jiwei</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Michel</given-names>
            <surname>Galley</surname>
          </string-name>
          , Chris Brockett,
          <string-name>
            <given-names>Jianfeng</given-names>
            <surname>Gao</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Bill</given-names>
            <surname>Dolan</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>A Diversity-Promoting Objective Function for Neural Conversation Models</article-title>
          .
          <source>In Proceedings of NAACL-HLT</source>
          <year>2016</year>
          .
          <volume>110</volume>
          -
          <fpage>119</fpage>
          . http: //aclweb.org/anthology/N/N16/N16-1014.pdf
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Jiwei</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Michel</given-names>
            <surname>Galley</surname>
          </string-name>
          , Chris Brockett, Georgios P. Spithourakis,
          <string-name>
            <given-names>Jianfeng</given-names>
            <surname>Gao</surname>
          </string-name>
          , and
          <string-name>
            <surname>William</surname>
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Dolan</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>A Persona-Based Neural Conversation Model</article-title>
          .
          <source>In Proceedings of ACL</source>
          <year>2016</year>
          . http://aclweb.org/anthology/ P/P16/P16-1094.pdf
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Jiwei</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Will</given-names>
            <surname>Monroe</surname>
          </string-name>
          , Alan Ritter, Dan Jurafsky, Michel Galley, and
          <string-name>
            <given-names>Jianfeng</given-names>
            <surname>Gao</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Deep Reinforcement Learning for Dialogue Generation</article-title>
          .
          <source>In Proceedings of EMNLP</source>
          <year>2016</year>
          .
          <volume>1192</volume>
          -
          <fpage>1202</fpage>
          . http://aclweb.org/ anthology/D/D16/D16-1127.pdf
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Yanran</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Hui</given-names>
            <surname>Su</surname>
          </string-name>
          , Xiaoyu Shen,
          <string-name>
            <given-names>Wenjie</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Ziqiang</given-names>
            <surname>Cao</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Shuzi</given-names>
            <surname>Niu</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset</article-title>
          .
          <source>In Proceedings of IJCNLP</source>
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Pierre</given-names>
            <surname>Lison</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jörg</given-names>
            <surname>Tiedemann</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles</article-title>
          .
          <source>In Proceedings of LREC</source>
          <year>2016</year>
          . http://www.lrec-conf.org/proceedings/lrec2016/ summaries/947.html
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Chia-Wei</surname>
            <given-names>Liu</given-names>
          </string-name>
          , Ryan Lowe, Iulian Serban, Michael Noseworthy, Laurent Charlin, and
          <string-name>
            <given-names>Joelle</given-names>
            <surname>Pineau</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation</article-title>
          .
          <source>In Proceedings of EMNLP</source>
          <year>2016</year>
          .
          <volume>2122</volume>
          -
          <fpage>2132</fpage>
          . http://aclweb.org/anthology/D/D16/D16-1230.pdf
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Yinhan</surname>
            <given-names>Liu</given-names>
          </string-name>
          , Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen,
          <string-name>
            <surname>Omer Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Mike</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Luke</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Veselin</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>RoBERTa: A Robustly Optimized BERT Pretraining Approach</article-title>
          . CoRR abs/
          <year>1907</year>
          .11692 (
          <year>2019</year>
          ). arXiv:
          <year>1907</year>
          .11692 http://arxiv.org/abs/
          <year>1907</year>
          .11692
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Ryan</surname>
            <given-names>Lowe</given-names>
          </string-name>
          , Michael Noseworthy, Iulian Vlad Serban,
          <string-name>
            <surname>Nicolas</surname>
            <given-names>AngelardGontier</given-names>
          </string-name>
          , Yoshua Bengio, and
          <string-name>
            <given-names>Joelle</given-names>
            <surname>Pineau</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses</article-title>
          .
          <source>In Proceedings ACL</source>
          <year>2017</year>
          .
          <volume>1116</volume>
          -
          <fpage>1126</fpage>
          . https://doi.org/10.18653/v1/
          <fpage>P17</fpage>
          -1103
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Laurens</surname>
            <given-names>van der Maaten and Geofrey</given-names>
          </string-name>
          <string-name>
            <surname>Hinton</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Visualizing data using t-SNE</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>9</volume>
          ,
          <string-name>
            <surname>Nov</surname>
          </string-name>
          (
          <year>2008</year>
          ),
          <fpage>2579</fpage>
          -
          <lpage>2605</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Tomas</surname>
            <given-names>Mikolov</given-names>
          </string-name>
          , Kai Chen, Greg Corrado, and
          <string-name>
            <given-names>Jefrey</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Efifcient Estimation of Word Representations in Vector Space</article-title>
          .
          <source>CoRR abs/1301</source>
          .3781 (
          <year>2013</year>
          ). arXiv:
          <volume>1301</volume>
          .3781 http://arxiv.org/abs/1301.3781
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>Kishore</surname>
            <given-names>Papineni</given-names>
          </string-name>
          , Salim Roukos, Todd Ward, and
          <string-name>
            <surname>Wei-Jing Zhu</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>BLEU: A Method for Automatic Evaluation of Machine Translation</article-title>
          .
          <source>In Proceedings of ACL</source>
          <year>2002</year>
          .
          <volume>311</volume>
          -
          <fpage>318</fpage>
          . http://www.aclweb.org/anthology/ P02-1040.pdf
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <surname>James</surname>
            <given-names>W Pennebaker</given-names>
          </string-name>
          , Martha E Francis, and
          <string-name>
            <surname>Roger</surname>
          </string-name>
          J Booth.
          <year>2001</year>
          .
          <article-title>Linguistic Inquiry and Word Count: LIWC 2001</article-title>
          . Mahway: Lawrence Erlbaum Associates 71,
          <year>2001</year>
          (
          <year>2001</year>
          ),
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>Hannah</surname>
            <given-names>Rashkin</given-names>
          </string-name>
          , Eric Michael Smith,
          <string-name>
            <given-names>Margaret</given-names>
            <surname>Li</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y-Lan</given-names>
            <surname>Boureau</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Towards Empathetic Open-domain Conversation Models: A New Benchmark and Dataset</article-title>
          .
          <source>In Proceedings of ACL</source>
          <year>2019</year>
          .
          <volume>5370</volume>
          -
          <fpage>5381</fpage>
          . https://www.aclweb.org/anthology/P19-1534/
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <surname>Alan</surname>
            <given-names>Ritter</given-names>
          </string-name>
          , Colin Cherry, and
          <string-name>
            <surname>William</surname>
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Dolan</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Data-Driven Response Generation in Social Media</article-title>
          .
          <source>In Proceedings of EMNLP</source>
          <year>2011</year>
          .
          <volume>583</volume>
          -
          <fpage>593</fpage>
          . http://www.aclweb.org/anthology/D11-1054
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <surname>Klaus</surname>
            <given-names>R</given-names>
          </string-name>
          <string-name>
            <surname>Scherer</surname>
          </string-name>
          .
          <source>2005. What Are Emotions? And How Can They Be Measured? Social science information 44</source>
          ,
          <issue>4</issue>
          (
          <year>2005</year>
          ),
          <fpage>695</fpage>
          -
          <lpage>729</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>Iulian</given-names>
            <surname>Vlad</surname>
          </string-name>
          <string-name>
            <surname>Serban</surname>
          </string-name>
          , Alessandro Sordoni, Yoshua Bengio, Aaron C. Courville, and
          <string-name>
            <given-names>Joelle</given-names>
            <surname>Pineau</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models</article-title>
          .
          <source>In Proceedings of AAAI</source>
          <year>2016</year>
          .
          <volume>3776</volume>
          -
          <fpage>3784</fpage>
          . http://www.aaai.org/ocs/index. php/AAAI/AAAI16/paper/view/11957
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>Iulian</given-names>
            <surname>Vlad</surname>
          </string-name>
          <string-name>
            <surname>Serban</surname>
          </string-name>
          , Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron C. Courville, and
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues</article-title>
          .
          <source>In Proceedings of AAAI</source>
          <year>2017</year>
          .
          <volume>3295</volume>
          -
          <fpage>3301</fpage>
          . http://aaai.org/ ocs/index.php/AAAI/AAAI17/paper/view/14567
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <surname>Lifeng</surname>
            <given-names>Shang</given-names>
          </string-name>
          , Zhengdong Lu, and
          <string-name>
            <given-names>Hang</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Neural Responding Machine for Short-Text Conversation</article-title>
          .
          <source>In Proceedings of ACL-IJCNLP</source>
          <year>2015</year>
          .
          <volume>1577</volume>
          -
          <fpage>1586</fpage>
          . http://aclweb.org/anthology/P/P15/P15-1152.pdf
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <surname>Alessandro</surname>
            <given-names>Sordoni</given-names>
          </string-name>
          , Yoshua Bengio, Hossein Vahabi, Christina Lioma, Jakob Grue Simonsen, and
          <string-name>
            <surname>Jian-Yun Nie</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>A Hierarchical Recurrent Encoder-Decoder for Generative Context-Aware Query Suggestion</article-title>
          .
          <source>In Proceedings of CIKM</source>
          <year>2015</year>
          .
          <volume>553</volume>
          -
          <fpage>562</fpage>
          . https://doi.org/10.1145/ 2806416.2806493
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <surname>Alessandro</surname>
            <given-names>Sordoni</given-names>
          </string-name>
          , Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Margaret Mitchell,
          <string-name>
            <surname>Jian-Yun</surname>
            <given-names>Nie</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Jianfeng</given-names>
            <surname>Gao</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Bill</given-names>
            <surname>Dolan</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>A Neural Network Approach to Context-Sensitive Generation of Conversational Responses</article-title>
          .
          <source>In Proceedings of NAACL-HLT</source>
          <year>2015</year>
          .
          <volume>196</volume>
          -
          <fpage>205</fpage>
          . http://aclweb.org/anthology/N/N15/N15-1020.pdf
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <surname>Ilya</surname>
            <given-names>Sutskever</given-names>
          </string-name>
          , Oriol Vinyals, and
          <string-name>
            <surname>Quoc</surname>
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Le</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Sequence to Sequence Learning with Neural Networks</article-title>
          .
          <source>In Proceedings of NIPS</source>
          <year>2014</year>
          .
          <volume>3104</volume>
          -
          <fpage>3112</fpage>
          . http://papers.nips.cc/paper/5346-sequence
          <article-title>-to-sequencelearning-with-neural-networks</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <surname>Chongyang</surname>
            <given-names>Tao</given-names>
          </string-name>
          , Lili Mou,
          <string-name>
            <given-names>Dongyan</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Rui</given-names>
            <surname>Yan</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems</article-title>
          .
          <source>In Proceedings of AAAI</source>
          <year>2018</year>
          .
          <volume>722</volume>
          -
          <fpage>729</fpage>
          . https://www. aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16179
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>Christoph</given-names>
            <surname>Tillmann</surname>
          </string-name>
          and Hermann Ney.
          <year>2003</year>
          .
          <article-title>Word Reordering and a Dynamic Programming Beam Search Algorithm for Statistical Machine Translation</article-title>
          .
          <source>Computational Linguistics</source>
          <volume>29</volume>
          ,
          <issue>1</issue>
          (
          <year>2003</year>
          ),
          <fpage>97</fpage>
          -
          <lpage>133</lpage>
          . https: //doi.org/10.1162/089120103321337458
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <surname>Howard</surname>
            <given-names>E</given-names>
          </string-name>
          <string-name>
            <surname>Tinsley and David J Weiss</surname>
          </string-name>
          .
          <year>1975</year>
          .
          <article-title>Interrater Reliability and Agreement of Subjective Judgments</article-title>
          .
          <source>Journal of Counseling Psychology</source>
          <volume>22</volume>
          ,
          <issue>4</issue>
          (
          <year>1975</year>
          ),
          <fpage>358</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41]
          <string-name>
            <surname>Ashish</surname>
            <given-names>Vaswani</given-names>
          </string-name>
          , Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
          <string-name>
            <given-names>Aidan N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Lukasz Kaiser, and
          <string-name>
            <given-names>Illia</given-names>
            <surname>Polosukhin</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Attention is All you Need</article-title>
          .
          <source>In Proceedings of NIPS</source>
          <year>2017</year>
          .
          <volume>5998</volume>
          -
          <fpage>6008</fpage>
          . http://papers.nips.cc/paper/7181-attention
          <article-title>-is-all-you-need</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>Oriol</given-names>
            <surname>Vinyals</surname>
          </string-name>
          and
          <string-name>
            <given-names>Quoc V.</given-names>
            <surname>Le</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>A Neural Conversational Model</article-title>
          .
          <source>CoRR abs/1506</source>
          .05869 (
          <year>2015</year>
          ). arXiv:
          <volume>1506</volume>
          .05869 http://arxiv.org/abs/ 1506.05869
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>Amy</given-names>
            <surname>Beth</surname>
          </string-name>
          <string-name>
            <surname>Warriner</surname>
          </string-name>
          , Victor Kuperman, and
          <string-name>
            <given-names>Marc</given-names>
            <surname>Brysbaert</surname>
          </string-name>
          .
          <year>2013</year>
          . Norms of Valence, Arousal, and Dominance for
          <volume>13</volume>
          ,915
          <string-name>
            <given-names>English</given-names>
            <surname>Lemmas</surname>
          </string-name>
          .
          <source>Behavior research methods 45</source>
          ,
          <issue>4</issue>
          (
          <year>2013</year>
          ),
          <fpage>1191</fpage>
          -
          <lpage>1207</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          [44]
          <string-name>
            <surname>Chen</surname>
            <given-names>Xing</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wei</surname>
            <given-names>Wu</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu Wu</surname>
          </string-name>
          , Jie Liu, Yalou Huang,
          <string-name>
            <surname>Ming Zhou</surname>
          </string-name>
          , and
          <string-name>
            <surname>Wei-Ying Ma</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Topic Aware Neural Response Generation</article-title>
          .
          <source>In Proceedings of AAAI</source>
          <year>2017</year>
          .
          <volume>3351</volume>
          -
          <fpage>3357</fpage>
          . http://aaai.org/ocs/index.php/ AAAI/AAAI17/paper/view/14563
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          [45]
          <string-name>
            <surname>Chen</surname>
            <given-names>Xing</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            <given-names>Wu</given-names>
          </string-name>
          , Wei Wu,
          <string-name>
            <surname>Yalou Huang</surname>
            , and
            <given-names>Ming</given-names>
          </string-name>
          <string-name>
            <surname>Zhou</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Hierarchical Recurrent Attention Network for Response Generation</article-title>
          .
          <source>In Proceedings of AAAI</source>
          <year>2018</year>
          .
          <volume>5610</volume>
          -
          <fpage>5617</fpage>
          . https://www.aaai.org/ocs/ index.php/AAAI/AAAI18/paper/view/16510
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          [46]
          <string-name>
            <surname>Anbang</surname>
            <given-names>Xu</given-names>
          </string-name>
          , Zhe Liu, Yufan Guo, Vibha Sinha, and
          <string-name>
            <given-names>Rama</given-names>
            <surname>Akkiraju</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>A New Chatbot for Customer Service on Social Media</article-title>
          .
          <source>In Proceedings of CHI</source>
          <year>2017</year>
          .
          <volume>3506</volume>
          -
          <fpage>3510</fpage>
          . https://doi.org/10.1145/3025453.3025496
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          [47]
          <string-name>
            <given-names>Jennifer</given-names>
            <surname>Zamora</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>I'm Sorry, Dave, I'm Afraid I Can't Do That: Chatbot Perception and Expectations</article-title>
          .
          <source>In Proceedings of HAI</source>
          <year>2017</year>
          .
          <volume>253</volume>
          -
          <fpage>260</fpage>
          . https://doi.org/10.1145/3125739.3125766
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          [48]
          <string-name>
            <surname>Peixiang</surname>
            <given-names>Zhong</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Di</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Chunyan</given-names>
            <surname>Miao</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>An Afect-Rich Neural Conversational Model with Biased Attention and Weighted Cross-Entropy Loss</article-title>
          .
          <source>In Proceedings of AAAI</source>
          <year>2019</year>
          .
          <volume>7492</volume>
          -
          <fpage>7500</fpage>
          . https: //aaai.org/ojs/index.php/AAAI/article/view/4740
        </mixed-citation>
      </ref>
      <ref id="ref49">
        <mixed-citation>
          [49]
          <string-name>
            <surname>Hao</surname>
            <given-names>Zhou</given-names>
          </string-name>
          , Minlie Huang, Tianyang Zhang, Xiaoyan Zhu, and Bing Liu.
          <year>2018</year>
          .
          <article-title>Emotional Chatting Machine: Emotional Conversation Generation with Internal and External Memory</article-title>
          .
          <source>In Proceedings of AAAI</source>
          <year>2018</year>
          . https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/ 16455
        </mixed-citation>
      </ref>
      <ref id="ref50">
        <mixed-citation>
          [50]
          <string-name>
            <given-names>Xianda</given-names>
            <surname>Zhou</surname>
          </string-name>
          and
          <string-name>
            <given-names>William</given-names>
            <surname>Yang Wang</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>MojiTalk: Generating Emotional Responses at Scale</article-title>
          .
          <source>In Proceedings of ACL</source>
          <year>2018</year>
          .
          <volume>1128</volume>
          -
          <fpage>1137</fpage>
          . https://doi.org/10.18653/v1/
          <fpage>P18</fpage>
          -1104
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>