=Paper= {{Paper |id=Vol-2848/user2agent_paper_3 |storemode=property |title=A Multi-Turn Emotionally Engaging Dialog Model |pdfUrl=https://ceur-ws.org/Vol-2848/user2agent-paper-3.pdf |volume=Vol-2848 |authors=Yubo Xie,Ekaterina Svikhnushina,Pearl Pu |dblpUrl=https://dblp.org/rec/conf/iui/XieSP20 }} ==A Multi-Turn Emotionally Engaging Dialog Model== https://ceur-ws.org/Vol-2848/user2agent-paper-3.pdf

A Multi-Turn Emotionally Engaging Dialog Model
Yubo Xie
Ekaterina Svikhnushina
Pearl Pu
yubo.xie@epfl.ch
ekaterina.svikhnushina@epfl.ch
pearl.pu@epfl.ch
École Polytechnique Fédérale de Lausanne
Lausanne, Switzerland
ABSTRACT 1 INTRODUCTION
Open-domain dialog systems (also known as chatbots) have Many application areas show significant benefits of integrat-
increasingly drawn attention in natural language process- ing affect information in natural language dialogs. In earlier
ing. Some of the recent work aims at incorporating affect work on human computer interaction, Klein et al. [16] found
information into sequence-to-sequence neural dialog model- user’s frustration caused by a computer system can be allevi-
ing, making the response emotionally richer, while others ated by computer-initiated emotional support, by providing
use hand-crafted rules to determine the desired emotion feedback on emotional content along with sympathy and
response. However, they do not explicitly learn the subtle empathy. Recently, Hu et al. [14] developed a customer sup-
emotional interactions captured in human dialogs. In this pa- port neural chatbot, capable of generating dialogs similar
per, we propose a multi-turn dialog system aimed at learning to the humans in terms of empathic and passionate tones,
and generating emotional responses that so far only humans potentially serving as proxy customer support agents on
know how to do. Compared with two baseline models, of- social media platforms. In a qualitative study [47], partici-
fline experiments show that our method performs the best pants expressed an interest in chatbots capable of serving
in perplexity scores. Further human evaluations confirm as an attentive listener and providing motivational support,
that our chatbot can keep track of the conversation context thus fulfilling users’ emotional needs. Several participants
and generate emotionally more appropriate responses while even noted a chatbot is ideal for sensitive content that is too
performing equally well on grammar. embarrassing to ask another human. Finally Bickmore and
Picard [3] showed a relational agent with deliberate social-
CCS CONCEPTS emotional skills was respected more, liked more, and trusted
• Human-centered computing → Human computer in- more, even after four weeks of interaction, compared to an
teraction (HCI); Natural language interfaces. equivalent task-oriented agent.
Recent development in neural language modeling has gen-
KEYWORDS erated significant excitement in the open-domain dialog gen-
chatbots, affective c omputing, d eep l earning, n atural lan- eration community. The success of sequence-to-sequence
guage processing (seq2seq) learning [5, 37] in the field of neural machine trans-
lation has inspired researchers to apply the recurrent neural
ACM Reference Format: network (RNN) encoder-decoder structure to response gener-
Yubo Xie, Ekaterina Svikhnushina, and Pearl Pu. 2020. A Multi- ation [42]. Following the standard seq2seq structure, various
Turn Emotionally Engaging Dialog Model. In IUI ’20 Wokrshops, improvements have been made on the neural conversation
March 17, 2020, Cagliari, Italy. ACM, New York, NY, USA, 12 pages.
model. For example, Shang et al. [34] applied attention mech-
https://doi.org/10.1145/nnnnnnn.nnnnnnn
anism [2] to the same structure on Twitter-style microblog-
ging data. Li et al. [17] found the original version tend to
Copyright © 2020 for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY 4.0).
favor short and dull responses. They fixed this problem by
increasing the diversity of the response. Li et al. [18] mod-
eled the personalities of the speakers, and Xing et al. [44]
developed a topic aware dialog system. We call work in this
area globally neural dialog generation. For a comprehensive
survey, please refer to [4].
More recently, researchers started incorporating affect in-
formation into neural dialog models. While a central theme
IUI ’20 Workshops, March 17, 2020, Cagliari, Italy Xie, et al.

seems to be making the responses emotionally richer, ex- 2 RELATED WORK
isting approaches mainly follow two directions. In one, an Neural Dialog Generation
emotion label is explicitly required as input so that the ma-
Vinyals and Le [42] were one of the first to model dialog gen-
chine can generate sentences of that particular emotion label
eration using neural networks. Their seq2seq framework
or type [49]. In another group of work, the main idea is to de-
was trained on an IT Helpdesk Troubleshooting dataset
velop handcrafted rules to direct the machines to generated
and the OpenSubtitles dataset [21]. Shang et al. [34] fur-
responses of the desired emotions [1, 48]. Both approaches re-
ther trained the seq2seq model with attention mechanism
quire an emotion label as input (either given or handcrafted),
on a self-crawled Weibo (a popular Twitter-like social media
which might be unpractical in real dialog scenarios.
website in China) dataset. Meanwhile, Xu et al. [46] built a
Furthermore, to the best of our knowledge, the psychology
customer service chatbot by training the seq2seq model on
and social science literature does not provide clear rules for
a dataset collected with conversations between customers
emotional interaction. It seems such social and emotional
and customer service accounts from 62 brands on Twitter.
intelligence is captured in our conversations. This is why we
The standard seq2seq framework is applied to single-turn
decided to take the automatic and data-driven approach. In
response generation. In multi-turn settings, where a context
this paper, we describe an end-to-end Multi-turn Emotionally
with multiple history utterances is given, the same struc-
Engaging Dialog model (MEED), capable of recognizing emo-
ture often ignores the hierarchical characteristic of the con-
tions and generating emotionally appropriate and human-
text. Some recent work addresses this problem by adopt-
like responses with the ultimate goal of reproducing social
ing a hierarchical recurrent encoder-decoder (HRED) struc-
behaviors that are habitual in human-human conversations.
ture [32, 33, 35]. To give attention to different parts of the
We chose the multi-turn setting because a model suitable for
context while generating responses, Xing et al. [45] proposed
single-turn dialogs cannot effectively track earlier context in
the hierarchical recurrent attention network (HRAN), using
multi-turn dialogs, both semantically and emotionally. Since
a hierarchical attention mechanism. However, these multi-
being able to track several turns is really important, we made
turn dialog models do not take into account the turn-taking
this design decision from the beginning, in contrast to most
emotional changes of the dialog.
related work where models are only trained and tested on
single-turn dialogs. While using a hierarchical mechanism
to track the conversation history in multi-turn dialogs is Neural Dialog Models with Affect Information
not new (e.g., HRAN by Xing et al. [45]), to combine it with Recent work on incorporating affect information into natural
an additional emotion RNN to process the emotional infor- language processing tasks has inspired our current work.
mation in each history utterance has never been attempted They can be mainly described as affect language models and
before. emotional dialog systems.
Our contributions are threefold. (1) We describe in detail a Ghosh et al. [11] made the first attempt to augment the
novel emotion-tracking dialog generation model that learns original LSTM language model with affect treatment in what
the emotional interactions directly from the data. This ap- they called Affect-LM. At training time, Affect-LM can be
proach is free of human-defined heuristic rules, and hence, considered as an energy based model where the added en-
is more robust and fundamental than those described in ex- ergy term captures the degree of correlation between the
isting work. (2) We compare our model, MEED, with the next word and the affect information of the preceeding text.
generic seq2seq model and the hierarchical model of multi- At text generation time, affect information is also used to
turn dialogs (HRAN). Offline experiments show that our increase the appropriate selection of the next word. A key
model outperforms both seq2seq and HRAN by a significant component in Affect-LM is the use of a well established
amount. Further experiments with human evaluation show text analysis program, LIWC (Linguistic Inquiry and Word
our model produces emotionally more appropriate responses Count) [28]. For every sentence, for example, “I unfortunately
than both baselines, while also improving the language flu- did not pass my exam”, the model generates five emotion fea-
ency. (3) We illustrate a human-evaluation procedure for tures denoting (sad: 1, angry: 1, anxiety: 1, negative emotion: 1,
judging machine produced emotional dialogs. We consider positive emotion: 0). This makes Affect-LM both capable of
factors such as the balance of positive and negative emotions distinguishing affect information conveyed by each word
in test dialogs, a well-chosen range of topics, and dialogs that in the language modeling part and aware of the preceeding
our human evaluators can relate. It is the first time such an text’s emotion in each generation step. In a similar vein, As-
approach is designed with consideration for human judges. ghar et al. [1] appended the original word embeddings with
Our main goal is to increase the objectivity of the results and a VAD affect model [43]. VAD is a vector model, as opposed
reduce judges’ mistakes due to out-of-context dialogs they to a categorical model (LIWC), representing a given emo-
have to evaluate. tion in each of the valence, arousal, and dominance axes. In
A Multi-Turn Emotionally Engaging Dialog Model IUI ’20 Workshops, March 17, 2020, Cagliari, Italy

contrast to Affect-LM, Asghar’s neural affect dialog model because it is a well-established emotion lexical resource, cov-
aims at generating explicit responses given a particular utter- ering the whole English dictionary whereas VAD only con-
ance. To do so, the authors designed three affect-related loss tains 13K lemmatized terms.
functions, namely minimizing affect dissonance, maximiz-
ing a affective dissonance, and maximizing affective content. 3 MODEL
The paper also proposed the affectively diverse beam search We describe our model one element at a time, from the basic
during decoding, so that the generated candidate responses structure, to the hierarchical component, and finally the
are as affectively diverse as possible. However, literature in emotion embedding layer.
affective science does not necessarily validate such rules. We first consider the problem of generating response y
In fact, the best strategy to speak to an angry customer is given a context X consisting of multiple previous utterances
the de-escalation strategy (using neutral words to validate by estimating the probability distribution p(y | X ) from a
anger) rather than employing equally emotional words (min- N containing N context-response
data set D = {(X (i), y (i) )}i=1
imizing affect dissonance) or words that convey happiness pairs. Here
(maximizing affect dissonance). X (i) = x 1(i), x 2(i), . . . , xm
(i)
i
(1)
The Emotional Chatting Machine (ECM) [49] takes a post
is a sequence of mi utterances, and
and generates a response in a predefined emotion category.
x j(i) = x j,1
(i) (i) (i)
, x j,2, . . . , x j,n

The main idea is to use an internal memory module to cap- ij
(2)
ture the emotion dynamics during decoding, and an external
memory module to model emotional expressions explicitly is a sequence of ni j words. Similarly,
by assigning different probability values to emotional words y (i) = y1(i), y2(i), . . . , yT(i)i

(3)
as opposed to regular words. Zhou and Wang [50] extended
the standard seq2seq model to a conditional variational au- is the response with Ti words.
toencoder combined with policy gradient techniques. The Usually the probability distribution p(y | X ) can be mod-
model takes a post and an emoji as input, and generates the eled by an RNN language model conditioned on X . When
response with target emotion specified by the emoji. Hu et generating the word yt at time step t, the context X is en-
al. [14] built a tone-aware chatbot for customer care on social coded into a fixed-sized dialog context vector c t by following
media, by deploying extra meta information of the conversa- the hierarchical attention structure in HRAN [45]. Addition-
tions in the seq2seq model. Specifically, a tone indicator is ally, we extract the emotion information from the utterances
added to each step of the decoder during the training phase. in X by leveraging an external text analysis program, and
In parallel to these developments, Zhong et al. [48] pro- use an RNN to encode it into an emotion context vector e,
posed an affect-rich dialog model using biased attention which is combined with c t to produce the distribution. The
mechanism on emotional words in the input message, by tak- overall architecture of the model is depicted in Figure 1. We
ing advantage of the VAD embeddings. The model is trained are going to elaborate on how to obtain c t and e, and how
with a weighted cross-entropy loss function, which encour- they are combined in the decoding part.
ages the generation of emotional words.
Hierarchical Attention
The hierarchical attention structure involves two encoders to
Summary produce the dialog context vector c t , namely the word-level
As much as these work in the above section inspired our encoder and the utterance-level encoder. The word-level en-
work, our approach in generating affect dialogs is signifi- coder is essentially a bidirectional RNN with gated recurrent
cantly different. Most of related work focused on integrating units (GRU) [5]. For utterance x j in X (j = 1, 2, . . . , m), the
affect information into the transduction vector space using bidirectional encoder produces two hidden states at each
either VAD or LIWC, we aim at modeling and generating word position k, the forward hidden state h fjk and the back-
the affect exchanges in human dialogs using a dedicated em- ward hidden state h bjk . The final hidden state h jk is then
bedding layer. The approach is also completely data-driven, obtained by concatenating the two,
thus absent of hand-crafted rules. To avoid learning obscene
h jk = concat h fjk , h bjk .

(4)
and callous exchanges often found in social media data like
tweets and Reddit threads [29], we opted to train our model The utterance-level encoder is a unidirectional RNN with
on movie subtitles, whose dialogs were carefully created by GRU that goes from the last utterance in the context to
professional writers. We believe the quality of this dataset the first, with its input at each step as the summary of the
can be better than those curated by crowdsource platforms. corresponding utterance, which is obtained by applying a
For modeling the affect information, we chose to use LIWC Bahdanau-style attention mechanism [2] on the word-level
IUI ’20 Workshops, March 17, 2020, Cagliari, Italy Xie, et al.

Softmax

st −1 st e
ot
··· GRU GRU ··· GRU ··· GRU ··· GRU

c t wyt −1 a1 aj am

Utterance-level attention Emotion embedding layer

`1t `tj `tj+1 t
`m 0 1 1 0 0 0

GRU ··· GRU GRU ··· GRU
LIWC program

r 1t r jt t
r j+1 t
rm xj

Word-level attention

h j,1 h j,2 h j,n j GRU Decoder

GRU Word-level encoder
GRU GRU ··· GRU
GRU Utterance-level encoder
wx j ,1 wx j ,2 wx j ,n j
GRU Emotion encoder
xj

Figure 1: The overall architecture of our model.

encoder output. More specifically, at decoding step t, the Here β jt is the utterance-level attention score placed on ℓtj ,
summary of utterance x j is a linear combination of h jk , for and can be calculated as
k = 1, 2, . . . , n j ,
b jt = vbT tanh(Ub st −1 + Wb ℓtj ), (9)
nj
exp(b jt )
r jt = t
Õ
α jk h jk . (5) β jt = Ím t , (10)
k =1 j ′ =1 exp(b j ′ )

t is the word-level attention score placed on h , and
Here α jk where st −1 is the previous hidden state of the decoder, and
jk
can be calculated as vb , Ub and Wb are utterance-level attention parameters.

Emotion Encoder
atjk = vaT tanh(Ua st −1 + Va ℓtj+1 + Wa h jk ), (6)
The main objective of the emotion embedding layer is to
exp(atjk ) recognize the affect information in the given utterances so
t
α jk = Ín j , (7)
k ′ =1
exp(atjk ′ ) that the model can respond with emotionally appropriate
replies. To achieve this, we need an encoder to distinguish the
where st −1 is the previous hidden state of the decoder, ℓtj+1 is affect information in the context, in addition to its semantic
the previous hidden state of the utterance-level encoder, and meaning. Equally we need a decoder capable of selecting the
va , Ua , Va and Wa are word-level attention parameters. The best and most human-like answers.
final dialog context vector c t is then obtained as another lin- We are able to achieve this goal, i.e., capturing the emotion
ear combination of the outputs of the utterance-level encoder information carried in the context X , in the encoder, thanks
ℓtj , for j = 1, 2, . . . , m, to LIWC. We make use of the five emotion-related categories,
namely positive emotion, negative emotion, anxious, angry,
m and sad. This set can be expanded to include more categories
β jt ℓtj .
Õ
ct = (8) if we desire a richer distinction. See the discussion section for
j=1 more details on how to do this. Using the newest version of
A Multi-Turn Emotionally Engaging Dialog Model IUI ’20 Workshops, March 17, 2020, Cagliari, Italy

the program LIWC2015,1 we are able to map each utterance Table 1: Statistics of the two datasets.
x j in the context to a six-dimensional indicator vector 1(x j ),
with the first five entries corresponding to the five emotion Cornell DailyDialog
categories, and the last one corresponding to neutral. If any
# dialogs 83,097 13,118
word in x j belongs to one of the five categories, then the
# utterances 304,713 102,977
corresponding entry in 1(x j ) is set to 1; otherwise, x j is
Average # turns 3.7 7.9
treated as neutral, with the last entry of 1(x j ) set to 1. For
Average # words / utterance 12.5 14.6
example, assuming x j = “he is worried about me”, then
Training set size 142,450 46,797
1(x j ) = [0, 1, 1, 0, 0, 0], (11) Validation set size 10,240 10,240
since the word “worried” is assigned to both negative emotion
and anxious. We apply a dense layer with sigmoid activation
We use the cross-entropy loss as our objective function
function on top of 1(x j ) to embed the emotion indicator
N
vector into a continuous space, 1 Õ
L = − ÍN log p y (i) | X (i) .

(18)
a j = σ (We 1(x j ) + be ), (12) i=1 Ti i=1

where We and be are trainable parameters. The emotion flow 4 EVALUATION
of the context X is then modeled by an unidirectional RNN We trained our model using two different datasets and com-
with GRU going from the first utterance in the context to the pared its performance with HRAN as well as the basic seq2seq
last, with its input being a j at each step. The final emotion model by performing both offline and online testings.
context vector e is obtained as the last hidden state of this
emotion encoding RNN. Datasets
We used two different dialog corpora to train our model—
Decoding
the Cornell Movie Dialogs Corpus [6] and the DailyDialog
The probability distribution p(y | X ) can be written as dataset [20].
p(y | X ) = p(y1, y2, . . . , yT | X ) • Cornell Movie Dialogs Corpus. The dataset con-
T tains 83,097 dialogs (220,579 conversational exchanges)
extracted from raw movie scripts. In total there are
Ö
= p(y1 | c 1, e) p(yt | y1, . . . , yt −1, c t , e). (13)
t =2 304,713 utterances.
• DailyDialog. The dataset is developed by crawling
We model the probability distribution using an RNN lan-
raw data from websites used for language learners to
guage model along with the emotion context vector e. Specif-
learn English dialogs in daily life. It contains 13,118
ically, at time step t, the hidden state of the decoder st is
dialogs in total.
obtained by applying the GRU function,
We summarize some of the basic information regarding the
st = GRU(st −1, concat(c t , wyt −1 )), (14) two datasets in Table 1.
In our experiments, the models were first trained on the
where wyt −1 is the word embedding of yt −1 . Similar to Affect-
Cornell Movie Dialogs Corpus, and then fine-tuned on the
LM [11], we then define a new feature vector ot by concate-
DailyDialog dataset. We adopted this training pattern be-
nating st (which we refer to as the language context vector)
cause the Cornell dataset is bigger but noisier, while DailyDi-
with the emotion context vector e,
alog is smaller but more daily-based. To create a training set
ot = concat(st , e), (15) and a validation set for each of the two datasets, we took seg-
ments of each dialog with number of turns no more than six,2
on which we apply a softmax layer to obtain a probability
to serve as the training/validation examples. Specifically, for
distribution over the vocabulary,
each dialog D = (x 1, x 2, . . . , x M ), we created M − 1 context-
p t = softmax(W ot + b), (16) response pairs, namely Ui = (x si , . . . , x i ) and yi = x i+1 , for
i = 1, 2, . . . , M − 1, where si = max(1, i − 4). We filtered
where W and b are trainable parameters. Each term in Equa- out those pairs that have at least one utterance with length
tion (13) is then given by greater than 30. We also reduced the frequency of those pairs
p(yt | y1, . . . , yt −1, c t , e) = p t ,yt . (17) 2 We chose the maximum number of turns to be six because we would like

to have a longer context for each dialog while at the same time keeping the
1 https://liwc.wpengine.com/ training procedure computationally efficient.
IUI ’20 Workshops, March 17, 2020, Cagliari, Italy Xie, et al.

whose responses appear too many times (the threshold is set BLEU score [27] tend to align poorly with human judgement.
to 10 for Cornell, and 5 for DailyDialog), to prevent them Therefore, in this paper, we mainly adopt human evaluation,
from dominating the learning procedure. See Table 1 for the along with perplexity and BLEU score, following the existing
sizes of the training and validation sets. The test set consists work.
of 100 dialogs with four turns. We give more detailed descrip-
tion of how we created the test set in the section of human Automatic Evaluation. Perplexity is a measurement of how a
evaluation. probability model predicts a sample. It is a popular method
used in language modeling. In neural dialog generation com-
Baselines and Implementation munity, many researchers have adopted this method, espe-
Our choice of including S2S is rather obvious. Including cially in the beginning of this field [32, 42, 45, 48–50]. It
HRAN instead of other neural dialog models with affect measures how well a dialog model predicts the target re-
information was not an easy decision. As mentioned in the sponse. Given a target response y = {y1, y2, . . . , yT }, the
related work, Asghar’s affective dialog model, the affect-rich perplexity is calculated as
conversation model, and the Emotional Chatting Machine do
ppl(y) = p(y1, y2, . . . , yT )−1/T
not learn the emotional exchanges in the dialogs. This leaves
T
" #
us wondering whether using a multi-turn neural model can 1Õ
be as effective in learning emotional exchanges as MEED. In = exp − log p(yt | y1, . . . , yt −1 ) . (19)
T t =1
addition, comparing S2S and HRAN also gives us an idea of
how much the hierarchical mechansim is improving upon Thus a lower perplexity score indicates that the model has
the basic model. This is why our final comparision is based better capability of predicting the target sentence, i.e., the
on three multi-turn dialog generation models: the standard humans’ response. Some researchers [19, 34, 48] argue that
seq2seq model (denoted as S2S), HRAN, and our proposed perplexity score is not the ideal measurement because for
model, MEED. In order to adapt S2S to the multi-turn setting, a given context history, one should allow many responses.
we concatenate all the history utterances in the context into This is especially true if we want our conversational agents
one. to speak more diversely. However, for our purpose, which
For all the models, the vocabulary consists of 20,000 most is to speak emotionally appropriately and as human-like as
frequent words in the Cornell and DailyDialog datasets, plus possible, we believe this is a good measure. We do recognize
three extra tokens: for words that do not exist in the that it is not the only way to measure chatbots’ performance.
vocabulary, indicating the begin of an utterance, and This is why we also conducted human evaluation experiment.
indicating the end of an utterance. Here we summarize BLEU score is often used to measure the quality of machine-
the configurations and parameters of our experiments: translated text. Some earlier work of dialog response genera-
• We set the word embedding size to 256. We initialized tion [17, 18] adopted this metric to measure the performance
the word embeddings in the models with word2vec [26] of chatbots. However, recent study [22] suggests that it does
vectors first trained on Cornell and then fine-tuned on not align well with human evaluation. Nevertheless, we still
DailyDialog, consistent with the training procedure of include BLEU scores in this paper, to get a sense of compari-
the models. son with perplexity and human evaluation results.
• We set the number of hidden units of each RNN to 256,
the word-level attention depth to 256, and utterance- Human Evaluation. Human evaluation has been widely used
level 128. The output size of the emotion embedding to evaluate open-domain dialog generation tasks. This ap-
layer is 256. proach can include any criterion as we judge appropriate.
• We optimized the objective function using the Adam Most commonly, researchers have included the model’s abil-
optimizer [15] with an initial learning rate of 0.001. ity to generate grammatically correct, contextually coherent,
• For prediction, we used beam search [39] with a beam and emotionally appropriate responses, of which the latter
width of 256. two properties cannot be reliably evaluated using automatic
metrics. Recent work [1, 48, 49] on affect-rich conversational
We have made the source code publicly available.3
chatbots turned to human opinion to evaluate both fluency
Evaluation Metrics and emotionality of their models. But such human experi-
ments are sensitive to risk factors if the experiment is not
The evaluation of chatbots remains an open problem in the
carefully designed. They include whether the intructions are
field. Recent work [22] has shown that the automatic eval-
clear, whether they have been tested with users before hand,
uation metrics borrowed from machine translation such as
and whether there is a good balance of the human judgement
3 https://github.com/yuboxie/meed tasks. Further, if a test set for human evaluation is prepared
A Multi-Turn Emotionally Engaging Dialog Model IUI ’20 Workshops, March 17, 2020, Cagliari, Italy

by randomly sampling the dialogs from the dataset, it may in- as a bonus to the rater judged to be the most serious. For the
clude out-of-context dialogs, causing confusion and ambigu- evaluation survey, we also leveraged Google form. Specifi-
ity for human evaluators. Unbalanced emotional distribution cally, we randomly shuffled the 100 dialogs in the test set,
of the test dialogs may also lead to biased conclusions since then we used the first three utterances of each dialog as the
the chatbot’s abilities are evaluated on the unrepresentative input to the three models being compared (S2S, HRAN, and
sample. MEED), and obtain the respective responses. Dialog contexts
To take into account the above issues, we took several and three models’ responses were included into Google form.
iterations to prepare the instructions and the test set before According to the context given, the raters were instructed to
conducting the human evaluation experiment. Part of our evaluate the quality of the responses based on three criteria:
test set comes from the DailyDialog dataset, which consists (1) Grammatical correctness—whether or not the response
of meaningful complete dialogs. To compensate for the inbal- is fluent and free of grammatical mistakes;
ance, we further curated more negative emotion dialogs so (2) Contextual coherence—whether or not the response is
that the final set has equal emotion distributions. We provide context sensitive to the previous dialog history;
the details about the test data preparation process and the (3) Emotional appropriateness—whether or not the response
evaluation experiment below. conveys the right emotion and feels as if it had been
Preparation of Natural Dialog Test Set. We first selected produced by a human.
the emotionally colored dialogs with exactly four turns from For each criterion, the raters gave scores of either 0, 1 or 2,
the DailyDialog dataset. In the dataset each dialog turn is where 0 means bad, 2 means good, and 1 indicates neutral. For
annotated with a corresponding emotional category, includ- this survey, the Google form was launched on 12 February
ing the neutral one. For our purposes we filtered out only 2019, and all the submissions from our raters were collected
those dialogs where more than a half of utterances have by 14 February 2019.
non-neutral emotional labels, resulting in 78 emotionally
positive dialogs and 14 emotionally negative dialogs. We Results and Analysis
recruited two human workers to augment the data to pro- In this subsection, we present the experimental results of the
duce more emotionally negative dialogs. Both of them were automatic evaluation metric as well as human judgement,
PhD students from our university (males, aged 24 and 25), followed by some analysis.
fluent in English, and not related to the authors’ lab. We
found them via email and messaging platforms, and offered Automatic Evaluation Results. Table 2 gives the perplexity
80 CHF (or roughly US $80) gift coupons as incentive for and BLEU scores obtained by the three models on the two
each participant. The workers fulfilled the tasks in Google validation sets and the test set. As shown in the table, MEED
form4 following the instructions and created five negative achieves the lowest perplexity and the highest BLEU score on
dialogs with four turns, as if they were interacting with an- all three sets. We conducted t-test on the perplexity obtained,
other human, in each of the following topics: relationships, and results show significant improvements of MEED over S2S
entertainment, service, work and study, and everyday situa- and HRAN on the two validation sets (with p-value < 0.05).
tions. The Google form was released on 31 January 2019, and
Human Evaluation Results. Table 3, 4 and 5 summarize the
the workers finished their tasks by 4 February 2019. Sub-
human evaluation results on the responses’ grammatical
sequently, to form the final test set, we randomly selected
correctness, contextual coherence, and emotional appropri-
50 emotionally positive and 50 emotionally negative dialogs
ateness, respectively. In the tables, we give the percentage of
from the two pools of dialogs described above.
votes each model received for the three scores, the average
Human Evaluation Experiment Design. In the final human score obtained, and the agreement score among the raters.
evaluation of the model, we recruited four more PhD stu- Note that we report Fleiss’ κ score [10] for contextual coher-
dents from our university (1 female and 3 males, aged 22–25). ence and emotional appropriateness, and Finn’s r score [9]
Three of them are fluent English speakers and one is a native for grammatical correctness. We did not use Fleiss’ κ score for
speaker. The recruitment proceeded in the same manner as grammatical correctness. As agreement is extremely high,
described above; the raters were offered 80 CHF (or roughly this can make Fleiss’ κ very sensitive to prevalence [13].
US $80) per participant gift coupons for fulfilling the task, On the contrary, we did not use Finn’s r score for contex-
and extra 20 CHF (or roughly US $20) coupon was promised tual coherence and emotional appropriateness because it is
4 We provide the link to the form used for creating the dialogs: https://forms. only reasonable when the observed variance is significantly
gle/rPagMZYuYJ3M3Sq8A, hoping to help other researchers reproduce the
less than the chance variance [40], which did not apply to
same procedure. However, due to privacy concerns, we do not plan to release these two criteria. As shown in the tables, we got high agree-
this dataset. ment among the raters for grammatical correctness, and fair
IUI ’20 Workshops, March 17, 2020, Cagliari, Italy Xie, et al.

Table 2: Perplexity and average BLEU scores achieved by the models. Avg. BLEU: average of BLEU-1, -2, -3, and -4. Validation
set 1 comes from the Cornell dataset, and validation set 2 comes from the DailyDialog dataset.

Perplexity Avg. BLEU
Validation Set 1 Validation Set 2 Test Set Validation Set 1 Validation Set 2 Test Set
S2S 43.136 25.418 19.913 1.639 2.427 3.720
HRAN 46.225 26.338 20.355 1.701 2.368 2.390
MEED 41.862 24.341 19.795 1.829 2.635 4.281

Table 3: Human evaluation results on grammatical correct- of dialog response generation, perplexity does not align with
ness. human judgement. In Table 2, for all the three sets, HRAN
performs worse than S2S in terms of perplexity. However, for
+2 +1 0 Avg. Score r all of the three criteria in human evaluation, HRAN actually
outperforms S2S. Based on this, we conclude that perplexity
S2S 98.0 0.8 1.2 1.968 0.915
alone is not enough for evaluating a dialog system.
HRAN 98.5 1.3 0.2 1.982 0.967
MEED 99.5 0.3 0.2 1.992 0.981
Visualization of Output Layer Weights. We may wonder how
Table 4: Human evaluation results on contextual coherence. HRAN and MEED differ in terms of the distributional rep-
resentations of their respective vocabularies (words in the
+2 +1 0 Avg. Score κ language model, and affect words). We decided to visualize
the output layer weights as word embedding representations
S2S 25.8 19.7 54.5 0.713 0.389 using dimensionality reduction technique for the various
HRAN 37.3 21.2 41.5 0.958 0.327 models.
MEED 38.5 22.0 39.5 0.990 0.356 In the decoding phase, Equation (16) takes ot , the concate-
nation of the language context vector st and the emotion
Table 5: Human evaluation results on emotional appropri- context vector e, and generates a probability distribution
ateness. over the vocabulary words by applying a softmax layer. The
weight matrix of this softmax layer is denoted as W , whose
+2 +1 0 Avg. Score κ shape is |V |×2d, where |V | is the vocabulary size and d = 256
is the hidden state size of the RNNs. Thus the ith row of the
S2S 21.8 25.2 53.0 0.688 0.361 weight matrix Wi can be regarded as a vector representa-
HRAN 30.5 28.5 41.0 0.895 0.387 tion of the ith word in the vocabulary. Since we concatenate
MEED 32.0 27.8 40.2 0.917 0.337 the language context vector and the emotion context vec-
tor as the input to the softmax layer, the first half of the
weight vector Wi corresponds to the language context vec-
agreement among the raters for contextual coherence and
tor, and the second half corresponds to the emotion context
emotional appropriateness.5 For grammatical correctness, all
vector. We refer to them as language model weights and emo-
three models achieved high scores, which means all models
tion weights, respectively. If the emotion embedding layer is
are capable of generating fluent utterances that make sense.
learning and distinguishing affect states correctly, we will
For contextual coherence and emotional appropriateness,
see clear differences in the visualization.
MEED achieved higher average scores than S2S and HRAN,
With t-SNE [25], we are able to reduce the dimensionality
which means MEED keeps better track of the context and can
of the weights to two, and visualize them in a straightforward
generate responses that are emotionally more appropriate
way. For better illustration, we selected 100 most frequent
and natural. We first conducted Friedman test [12] and then
(emotionally) positive words and 100 most frequent negative
t-test on the human evaluation results (contextual coherence
words from the vocabulary, and used t-SNE to project the
and emotional appropriateness), showing the improvements
corresponding language model weights and emotion weights
of MEED over S2S are significant (with p-value < 0.01).
to two dimensions. Figure 2 gives the results in three sub-
The comparison between perplexity scores and human
plots. Since HRAN does not have the emotion context vector,
evaluation results further confirms the fact that in the context
we just visualized the whole output layer weight vector,
5 https://en.wikipedia.org/wiki/Fleiss%27_kappa#Interpretation which does a similar job as the language model weights in
A Multi-Turn Emotionally Engaging Dialog Model IUI ’20 Workshops, March 17, 2020, Cagliari, Italy

HRAN – Language Model Weights MEED – Language Model Weights MEED – Emotion Weights
6
Negative
10
4 Positive
5

5 2
0
0
0

−5 −2
−5
−4
Negative Negative
− 10
Positive − 10 Positive −6

− 10 −5 0 5 10 −5 0 5 10 − 5.0 − 2.5 0.0 2.5 5.0

Figure 2: t-SNE visualization of the output layer weights in HRAN and MEED. 100 most frequent positive words and 100 most
frequent negative words are shown. The weight vectors in MEED are separated into two parts and visualized individually.

MEED. We can observe from the first two plots that posi- on millions of tweets with emoji labels and is more suitable
tive words (green dots) and negative words (red dots) are for tweet-like conversations. However, for DeepMoji, the 64
scattered around and mixed with each other in the language categories of emojis do not have a clear and exact correspon-
model weights for HRAN and MEED respectively, which dence with standardized emotion categories, nor to the VAD
means no emotion information is captured in these weights. vectors.
On the contrary, the emotion weights in MEED, in the last
plot, have a clearer clustering effect, i.e., positive words are Training Data
mainly grouped on the top-left, while negative words are We pre-trained our model on the Cornell movie subtitles and
mainly grouped at the bottom-right. This gives the hint that then fine-tuned it with the DailyDialog dataset. We adopted
the emotion encoder in MEED is capable of tracking the this particular training order because we would like our chat-
emotion states in the conversation history. bot to talk more like human chit-chats, and the DailyDialog
dataset, compared with the bigger Cornell dataset, is more
Case Study. We present four sample dialogs in Table 6, along
daily-based. Since our model learns how to respond properly
with the responses generated by the three models. Dialog 1
in a data-driven way, we believe having a training dataset
and 2 are emotionally positive and dialog 3 and 4 are neg-
with good quality while being large enough plays an im-
ative. For the first two examples, we can see that MEED
portant role in developing an engaging and user-friendly
is able to generate more emotional content (like “fun” and
chatbot. Thus, in the future, we plan to train our model on
“congratulations”) that is appropriate according to the con-
the multi-turn conversations that we have already extracted
text. For dialog 4, MEED responds in sympathy to the other
from the much bigger OpenSubtitles corpus and the Empa-
speaker, which is consistent with the second utterance in the
theticDialogues dataset.6
context. On the contrary, HRAN poses a question in reply,
contradicting the dialog history. Evaluation
5 DISCUSSION Evaluation of dialog models remains an open problem in the
response generation field. Early work [18, 30, 36] on response
In this section, we briefly discuss how our framework can
generation used automatic evaluation metrics borrowed from
incorporate other components, as well as several directions
the machine translation field, such as the BLEU score, to eval-
to extend it.
uate dialog systems. Later on, Liu et al. [22] showed that these
Emotion Recognition metrics correlate poorly with human judgement. Recently,
a number of researchers begain developing automatic and
To extract the affect information contained in the utterances, data-driven evaluation methods [24, 38], with the ultimate
we used the LIWC text analysis program. We believe this goal of replacing human evaluation. However they are still
emotion recognition step is vital for a dialog model to pro- in an early stage. In this paper, we used both perplexity mea-
duce emotionally appropriate responses. However, the choice sures and human judgement in our experiments to finalize
of emotion classifier is not strictly limited to LIWC. It could our model. In other words, using the perplexity measures,
be replaced by other well-established affect recognizer or one we were able to determine when to stop training our model.
that is more appropriate to the target domain. For example, But this condition does not gurantee the optimal results until
we can consider using more fine-grained emotion categories
from GALC [31], or using DeepMoji [8], which was trained 6 https://github.com/facebookresearch/EmpatheticDialogues
IUI ’20 Workshops, March 17, 2020, Cagliari, Italy Xie, et al.

Table 6: Sample model responses. For each dialog, the ground truth is included in a pair of parentheses.

Context Model Responses
A: I’m happy to see you again.
S2S: I hope so.
B: Mee too.
1 HRAN: Thanks a lot.
A: We should do this more often.
MEED: That sounds like fun.
(B: Okay, I’ll give you a ring next week.)
A: Thank god! I am finished writing that service guide! It took me forever!
S2S: When?
B: When did you finish?
2 HRAN: Why?
A: This morning! No more overtime, and no more headaches!
MEED: Congratulations!
(B: Well, I’m glad to hear it. Have a cup of coffee!)
A: I think that’s settled.
S2S: What is it?
B: I’m tired of your cut-and-dried opinions. Who do you think you are!
3 HRAN: What is it?
A: How dare you speak to me like this.
MEED: Are you serious?
(B: Why not?)
A: This concert was awful.
S2S: Congratulations!
B: Agreed, the musicians were not in harmony.
4 HRAN: Why not?
A: It was too painful. Never again.
MEED: That’s true.
(B: That’s for sure!)

human judgement test can validate them. We thus highly rec- capable of recognizing and generating emotionally appropri-
ommend this combination, which is also a common practice ate responses, which is the first step toward such a goal. We
in the research community [45, 48–50]. have demonstrated how to do so by (1) modeling utterances
with extra affect vectors, (2) creating an emotional encoding
Model Extensions mechanism that learns emotion exchanges in the dataset, (3)
Our model uses RNNs to encode the input sequences, and curating a multi-turn and balanced dialog dataset, and (4)
GRU cells to capture long-term dependency among different evaluating the model with offline and online experiments.
positions in the sequences. Recent advances in natural lan- For future directions, we would like to investigate the diver-
guage understanding have proposed new network architec- sity issue of the responses generated, possibly by extending
tures to process text input. Specifically, the Transformer [41] the mutual information objective function [17] to multi-turn
uses pure attention mechanisms without any recurrence settings. We would also like to adopt the Transformer archi-
structures. Compared with RNNs, the Transformer can cap- tecture with pre-trained language model weights, and train
ture better long-term dependency due to the self-attention our model on a much larger dataset, by extracting multi-turn
mechanism, which is free of locality biases, and is more ef- dialogs from the OpenSubtitles corpus.
ficient to train because of better parallelization capability.
Following the Transformer architecture, researchers found REFERENCES
that pre-training language models on huge amounts of data [1] Nabiha Asghar, Pascal Poupart, Jesse Hoey, Xin Jiang, and Lili Mou.
2018. Affective Neural Response Generation. In Proceedings of ECIR
could largely boost the performance of downstream tasks,
2018. 154–166. https://doi.org/10.1007/978-3-319-76941-7_12
and published many pre-trained language models such as [2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural
BERT [7] and RoBERTa [23]. As future work, we would like Machine Translation by Jointly Learning to Align and Translate. CoRR
to adopt the Transformer architecture to replace the RNNs abs/1409.0473 (2014). arXiv:1409.0473 http://arxiv.org/abs/1409.0473
in our model, and initialize our encoder with pre-trained [3] Timothy W. Bickmore and Rosalind W. Picard. 2005. Establishing and
Maintaining Long-Term Human-Computer Relationships. ACM Trans.
language models. We hope to increase the performance of
Comput.-Hum. Interact. 12, 2 (2005), 293–327. https://doi.org/10.1145/
response generation. 1067860.1067867
[4] Hongshen Chen, Xiaorui Liu, Dawei Yin, and Jiliang Tang. 2017. A
6 CONCLUSION Survey on Dialogue Systems: Recent Advances and New Frontiers.
SIGKDD Explorations 19, 2 (2017), 25–35. https://doi.org/10.1145/
We believe reproducing conversational and emotional intel- 3166054.3166058
ligence will make social chatbots more believable and engag- [5] Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry
ing. In this paper, we proposed a multi-turn dialog system Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014.
A Multi-Turn Emotionally Engaging Dialog Model IUI ’20 Workshops, March 17, 2020, Cagliari, Italy

Learning Phrase Representations Using RNN Encoder-Decoder for [22] Chia-Wei Liu, Ryan Lowe, Iulian Serban, Michael Noseworthy, Laurent
Statistical Machine Translation. In Proceedings of EMNLP 2014. 1724– Charlin, and Joelle Pineau. 2016. How NOT To Evaluate Your Dialogue
1734. http://aclweb.org/anthology/D/D14/D14-1179.pdf System: An Empirical Study of Unsupervised Evaluation Metrics for
[6] Cristian Danescu-Niculescu-Mizil and Lillian Lee. 2011. Chameleons Dialogue Response Generation. In Proceedings of EMNLP 2016. 2122–
in Imagined Conversations: A New Approach to Understanding Coor- 2132. http://aclweb.org/anthology/D/D16/D16-1230.pdf
dination of Linguistic Style in Dialogs. In Proceedings of CMCL@ACL [23] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi
2011. 76–87. https://aclanthology.info/papers/W11-0609/w11-0609 Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov.
[7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach.
2019. BERT: Pre-training of Deep Bidirectional Transformers for CoRR abs/1907.11692 (2019). arXiv:1907.11692 http://arxiv.org/abs/
Language Understanding. In Proceedings of NAACL-HLT 2019. 4171– 1907.11692
4186. https://aclweb.org/anthology/papers/N/N19/N19-1423/ [24] Ryan Lowe, Michael Noseworthy, Iulian Vlad Serban, Nicolas Angelard-
[8] Bjarke Felbo, Alan Mislove, Anders Søgaard, Iyad Rahwan, and Sune Gontier, Yoshua Bengio, and Joelle Pineau. 2017. Towards an Automatic
Lehmann. 2017. Using Millions of Emoji Occurrences to Learn Any- Turing Test: Learning to Evaluate Dialogue Responses. In Proceedings
Domain Representations for Detecting Sentiment, Emotion and Sar- ACL 2017. 1116–1126. https://doi.org/10.18653/v1/P17-1103
casm. In Proceedings of EMNLP 2017. 1615–1625. https://aclanthology. [25] Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data
info/papers/D17-1169/d17-1169 using t-SNE. Journal of Machine Learning Research 9, Nov (2008),
[9] Robert H Finn. 1970. A Note on Estimating the Reliability of Categorical 2579–2605.
Data. Educational and Psychological Measurement 30, 1 (1970), 71–76. [26] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Ef-
[10] Joseph L Fleiss and Jacob Cohen. 1973. The Equivalence of Weighted ficient Estimation of Word Representations in Vector Space. CoRR
kappa and the Intraclass Correlation Coefficient as Measures of Relia- abs/1301.3781 (2013). arXiv:1301.3781 http://arxiv.org/abs/1301.3781
bility. Educational and psychological measurement 33, 3 (1973), 613– [27] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002.
619. BLEU: A Method for Automatic Evaluation of Machine Translation. In
[11] Sayan Ghosh, Mathieu Chollet, Eugene Laksana, Louis-Philippe Proceedings of ACL 2002. 311–318. http://www.aclweb.org/anthology/
Morency, and Stefan Scherer. 2017. Affect-LM: A Neural Language P02-1040.pdf
Model for Customizable Affective Text Generation. In Proceedings of [28] James W Pennebaker, Martha E Francis, and Roger J Booth. 2001.
ACL 2017. 634–642. https://doi.org/10.18653/v1/P17-1059 Linguistic Inquiry and Word Count: LIWC 2001. Mahway: Lawrence
[12] David C Howell. 2016. Fundamental Statistics for the Behavioral Sciences. Erlbaum Associates 71, 2001 (2001), 2001.
Nelson Education. [29] Hannah Rashkin, Eric Michael Smith, Margaret Li, and Y-Lan Boureau.
[13] George Hripcsak and Daniel F. Heitjan. 2002. Measuring Agreement in 2019. Towards Empathetic Open-domain Conversation Models: A
Medical Informatics Reliability Studies. Journal of Biomedical Informat- New Benchmark and Dataset. In Proceedings of ACL 2019. 5370–5381.
ics 35, 2 (2002), 99–110. https://doi.org/10.1016/S1532-0464(02)00500-2 https://www.aclweb.org/anthology/P19-1534/
[14] Tianran Hu, Anbang Xu, Zhe Liu, Quanzeng You, Yufan Guo, Vibha [30] Alan Ritter, Colin Cherry, and William B. Dolan. 2011. Data-Driven
Sinha, Jiebo Luo, and Rama Akkiraju. 2018. Touch Your Heart: A Tone- Response Generation in Social Media. In Proceedings of EMNLP 2011.
aware Chatbot for Customer Care on Social Media. In Proceedings of 583–593. http://www.aclweb.org/anthology/D11-1054
CHI 2018. 415. https://doi.org/10.1145/3173574.3173989 [31] Klaus R Scherer. 2005. What Are Emotions? And How Can They Be
[15] Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Sto- Measured? Social science information 44, 4 (2005), 695–729.
chastic Optimization. CoRR abs/1412.6980 (2014). arXiv:1412.6980 [32] Iulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio, Aaron C.
http://arxiv.org/abs/1412.6980 Courville, and Joelle Pineau. 2016. Building End-To-End Dialogue
[16] Jonathan Klein, Youngme Moon, and Rosalind W. Picard. 2001. This Systems Using Generative Hierarchical Neural Network Models. In
Computer Responds to User Frustration: Theory, Design, and Results. Proceedings of AAAI 2016. 3776–3784. http://www.aaai.org/ocs/index.
Interacting with Computers 14, 2 (2001), 119–140. https://doi.org/10. php/AAAI/AAAI16/paper/view/11957
1016/S0953-5438(01)00053-4 [33] Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Char-
[17] Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. lin, Joelle Pineau, Aaron C. Courville, and Yoshua Bengio. 2017. A
2016. A Diversity-Promoting Objective Function for Neural Conver- Hierarchical Latent Variable Encoder-Decoder Model for Generating
sation Models. In Proceedings of NAACL-HLT 2016. 110–119. http: Dialogues. In Proceedings of AAAI 2017. 3295–3301. http://aaai.org/
//aclweb.org/anthology/N/N16/N16-1014.pdf ocs/index.php/AAAI/AAAI17/paper/view/14567
[18] Jiwei Li, Michel Galley, Chris Brockett, Georgios P. Spithourakis, Jian- [34] Lifeng Shang, Zhengdong Lu, and Hang Li. 2015. Neural Responding
feng Gao, and William B. Dolan. 2016. A Persona-Based Neural Conver- Machine for Short-Text Conversation. In Proceedings of ACL-IJCNLP
sation Model. In Proceedings of ACL 2016. http://aclweb.org/anthology/ 2015. 1577–1586. http://aclweb.org/anthology/P/P15/P15-1152.pdf
P/P16/P16-1094.pdf [35] Alessandro Sordoni, Yoshua Bengio, Hossein Vahabi, Christina Li-
[19] Jiwei Li, Will Monroe, Alan Ritter, Dan Jurafsky, Michel Galley, and oma, Jakob Grue Simonsen, and Jian-Yun Nie. 2015. A Hierarchical
Jianfeng Gao. 2016. Deep Reinforcement Learning for Dialogue Gen- Recurrent Encoder-Decoder for Generative Context-Aware Query Sug-
eration. In Proceedings of EMNLP 2016. 1192–1202. http://aclweb.org/ gestion. In Proceedings of CIKM 2015. 553–562. https://doi.org/10.1145/
anthology/D/D16/D16-1127.pdf 2806416.2806493
[20] Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. [36] Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett,
2017. DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset. Yangfeng Ji, Margaret Mitchell, Jian-Yun Nie, Jianfeng Gao, and Bill
In Proceedings of IJCNLP 2017. Dolan. 2015. A Neural Network Approach to Context-Sensitive Gen-
[21] Pierre Lison and Jörg Tiedemann. 2016. OpenSubtitles2016: Extract- eration of Conversational Responses. In Proceedings of NAACL-HLT
ing Large Parallel Corpora from Movie and TV Subtitles. In Proceed- 2015. 196–205. http://aclweb.org/anthology/N/N15/N15-1020.pdf
ings of LREC 2016. http://www.lrec-conf.org/proceedings/lrec2016/ [37] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to
summaries/947.html Sequence Learning with Neural Networks. In Proceedings of NIPS 2014.
3104–3112. http://papers.nips.cc/paper/5346-sequence-to-sequence-
IUI ’20 Workshops, March 17, 2020, Cagliari, Italy Xie, et al.

learning-with-neural-networks Proceedings of AAAI 2017. 3351–3357. http://aaai.org/ocs/index.php/
[38] Chongyang Tao, Lili Mou, Dongyan Zhao, and Rui Yan. 2018. RUBER: AAAI/AAAI17/paper/view/14563
An Unsupervised Method for Automatic Evaluation of Open-Domain [45] Chen Xing, Yu Wu, Wei Wu, Yalou Huang, and Ming Zhou. 2018.
Dialog Systems. In Proceedings of AAAI 2018. 722–729. https://www. Hierarchical Recurrent Attention Network for Response Generation.
aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16179 In Proceedings of AAAI 2018. 5610–5617. https://www.aaai.org/ocs/
[39] Christoph Tillmann and Hermann Ney. 2003. Word Reordering and a index.php/AAAI/AAAI18/paper/view/16510
Dynamic Programming Beam Search Algorithm for Statistical Machine [46] Anbang Xu, Zhe Liu, Yufan Guo, Vibha Sinha, and Rama Akkiraju. 2017.
Translation. Computational Linguistics 29, 1 (2003), 97–133. https: A New Chatbot for Customer Service on Social Media. In Proceedings
//doi.org/10.1162/089120103321337458 of CHI 2017. 3506–3510. https://doi.org/10.1145/3025453.3025496
[40] Howard E Tinsley and David J Weiss. 1975. Interrater Reliability and [47] Jennifer Zamora. 2017. I’m Sorry, Dave, I’m Afraid I Can’t Do That:
Agreement of Subjective Judgments. Journal of Counseling Psychology Chatbot Perception and Expectations. In Proceedings of HAI 2017. 253–
22, 4 (1975), 358. 260. https://doi.org/10.1145/3125739.3125766
[41] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion [48] Peixiang Zhong, Di Wang, and Chunyan Miao. 2019. An Affect-Rich
Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Neural Conversational Model with Biased Attention and Weighted
Attention is All you Need. In Proceedings of NIPS 2017. 5998–6008. Cross-Entropy Loss. In Proceedings of AAAI 2019. 7492–7500. https:
http://papers.nips.cc/paper/7181-attention-is-all-you-need //aaai.org/ojs/index.php/AAAI/article/view/4740
[42] Oriol Vinyals and Quoc V. Le. 2015. A Neural Conversational Model. [49] Hao Zhou, Minlie Huang, Tianyang Zhang, Xiaoyan Zhu, and Bing
CoRR abs/1506.05869 (2015). arXiv:1506.05869 http://arxiv.org/abs/ Liu. 2018. Emotional Chatting Machine: Emotional Conversation
1506.05869 Generation with Internal and External Memory. In Proceedings of AAAI
[43] Amy Beth Warriner, Victor Kuperman, and Marc Brysbaert. 2013. 2018. https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/
Norms of Valence, Arousal, and Dominance for 13,915 English Lemmas. 16455
Behavior research methods 45, 4 (2013), 1191–1207. [50] Xianda Zhou and William Yang Wang. 2018. MojiTalk: Generating
[44] Chen Xing, Wei Wu, Yu Wu, Jie Liu, Yalou Huang, Ming Zhou, and Emotional Responses at Scale. In Proceedings of ACL 2018. 1128–1137.
Wei-Ying Ma. 2017. Topic Aware Neural Response Generation. In https://doi.org/10.18653/v1/P18-1104